* [RFC][PATCH -mm 1/3] mm,vmscan: track recent pressure on each LRU set
2012-08-08 21:45 [RFC][PATCH -mm 0/3] mm,vmscan: reclaim from highest score cgroup Rik van Riel
@ 2012-08-08 21:47 ` Rik van Riel
2012-08-10 1:22 ` Ying Han
2012-08-08 21:48 ` [RFC][PATCH -mm 2/3] mm,vmscan: reclaim from highest score cgroups Rik van Riel
` (3 subsequent siblings)
4 siblings, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2012-08-08 21:47 UTC (permalink / raw)
To: linux-mm; +Cc: yinghan, hannes, mhocko, Mel Gorman
Keep track of the recent amount of pressure applied to each LRU list.
This statistic is incremented simultaneously with ->recent_scanned,
however it is aged in a different way. Recent_scanned and recent_rotated
are aged locally for each list, to estimate the fraction of objects
on each list that are in active use.
The recent_pressure statistic is aged globally for all lists. We
can use this to figure out which LRUs we should reclaim from.
Because this figure is only used at reclaim time, we can lazily
age it whenever we consider an lruvec for reclaiming.
Signed-off-by: Rik van Riel <riel@redhat.com>
---
include/linux/mmzone.h | 10 ++++++++-
mm/memcontrol.c | 5 ++++
mm/swap.c | 1 +
mm/vmscan.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 66 insertions(+), 1 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f222e06..b03be69 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -189,12 +189,20 @@ struct zone_reclaim_stat {
* The pageout code in vmscan.c keeps track of how many of the
* mem/swap backed and file backed pages are referenced.
* The higher the rotated/scanned ratio, the more valuable
- * that cache is.
+ * that cache is. These numbers are aged separately for each LRU.
*
* The anon LRU stats live in [0], file LRU stats in [1]
*/
unsigned long recent_rotated[2];
unsigned long recent_scanned[2];
+ /*
+ * This number is incremented together with recent_rotated,
+ * but is aged simultaneously for all LRUs. This allows the
+ * system to determine which LRUs have already been scanned
+ * enough, and which should be scanned next.
+ */
+ unsigned long recent_pressure[2];
+ unsigned long recent_pressure_seq;
};
struct lruvec {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d906b43..a18a0d5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3852,6 +3852,7 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
struct zone_reclaim_stat *rstat;
unsigned long recent_rotated[2] = {0, 0};
unsigned long recent_scanned[2] = {0, 0};
+ unsigned long recent_pressure[2] = {0, 0};
for_each_online_node(nid)
for (zid = 0; zid < MAX_NR_ZONES; zid++) {
@@ -3862,11 +3863,15 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
recent_rotated[1] += rstat->recent_rotated[1];
recent_scanned[0] += rstat->recent_scanned[0];
recent_scanned[1] += rstat->recent_scanned[1];
+ recent_pressure[0] += rstat->recent_pressure[0];
+ recent_pressure[1] += rstat->recent_pressure[1];
}
seq_printf(m, "recent_rotated_anon %lu\n", recent_rotated[0]);
seq_printf(m, "recent_rotated_file %lu\n", recent_rotated[1]);
seq_printf(m, "recent_scanned_anon %lu\n", recent_scanned[0]);
seq_printf(m, "recent_scanned_file %lu\n", recent_scanned[1]);
+ seq_printf(m, "recent_pressure_anon %lu\n", recent_pressure[0]);
+ seq_printf(m, "recent_pressure_file %lu\n", recent_pressure[1]);
}
#endif
diff --git a/mm/swap.c b/mm/swap.c
index 4e7e2ec..0cca972 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -316,6 +316,7 @@ static void update_page_reclaim_stat(struct lruvec *lruvec,
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
reclaim_stat->recent_scanned[file]++;
+ reclaim_stat->recent_pressure[file]++;
if (rotated)
reclaim_stat->recent_rotated[file]++;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a779b03..b0e5495 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1282,6 +1282,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
spin_lock_irq(&zone->lru_lock);
reclaim_stat->recent_scanned[file] += nr_taken;
+ reclaim_stat->recent_pressure[file] += nr_taken;
if (global_reclaim(sc)) {
if (current_is_kswapd())
@@ -1426,6 +1427,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
zone->pages_scanned += nr_scanned;
reclaim_stat->recent_scanned[file] += nr_taken;
+ reclaim_stat->recent_pressure[file] += nr_taken;
__count_zone_vm_events(PGREFILL, zone, nr_scanned);
__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
@@ -1852,6 +1854,53 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
throttle_vm_writeout(sc->gfp_mask);
}
+/*
+ * Ensure that the ->recent_pressure statistics for this lruvec are
+ * aged to the same degree as those elsewhere in the system, before
+ * we do reclaim on this lruvec or evaluate its reclaim priority.
+ */
+static DEFINE_SPINLOCK(recent_pressure_lock);
+static int recent_pressure_seq;
+static void age_recent_pressure(struct lruvec *lruvec, struct zone *zone)
+{
+ struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+ unsigned long anon = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
+ get_lru_size(lruvec, LRU_INACTIVE_ANON);
+ unsigned long file = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
+ get_lru_size(lruvec, LRU_INACTIVE_FILE);
+ int shift;
+
+ /*
+ * Do not bother recalculating unless we are behind with the
+ * system wide statistics, or our local recent_pressure numbers
+ * have grown too large. We have to keep the number somewhat
+ * small, to ensure that reclaim_score returns non-zero.
+ */
+ if (reclaim_stat->recent_pressure_seq != recent_pressure_seq &&
+ reclaim_stat->recent_pressure[0] < anon / 4 &&
+ reclaim_stat->recent_pressure[1] < file / 4)
+ return;
+
+ spin_lock(&recent_pressure_lock);
+ /*
+ * If we are aging due to local activity, increment the global
+ * sequence counter. Leave the global counter alone if we are
+ * merely playing catchup.
+ */
+ if (reclaim_stat->recent_pressure_seq == recent_pressure_seq)
+ recent_pressure_seq++;
+ shift = recent_pressure_seq - reclaim_stat->recent_pressure_seq;
+ shift = min(shift, (BITS_PER_LONG-1));
+ reclaim_stat->recent_pressure_seq = recent_pressure_seq;
+ spin_unlock(&recent_pressure_lock);
+
+ /* For every aging interval, do one division by two. */
+ spin_lock_irq(&zone->lru_lock);
+ reclaim_stat->recent_pressure[0] >>= shift;
+ reclaim_stat->recent_pressure[1] >>= shift;
+ spin_unlock_irq(&zone->lru_lock);
+}
+
static void shrink_zone(struct zone *zone, struct scan_control *sc)
{
struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -1869,6 +1918,8 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
do {
struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+ age_recent_pressure(lruvec, zone);
+
/*
* Reclaim from mem_cgroup if any of these conditions are met:
* - this is a targetted reclaim ( not global reclaim)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC][PATCH -mm 1/3] mm,vmscan: track recent pressure on each LRU set
2012-08-08 21:47 ` [RFC][PATCH -mm 1/3] mm,vmscan: track recent pressure on each LRU set Rik van Riel
@ 2012-08-10 1:22 ` Ying Han
2012-08-10 1:23 ` Ying Han
2012-08-10 15:48 ` Rik van Riel
0 siblings, 2 replies; 13+ messages in thread
From: Ying Han @ 2012-08-10 1:22 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-mm, hannes, mhocko, Mel Gorman
On Wed, Aug 8, 2012 at 2:47 PM, Rik van Riel <riel@redhat.com> wrote:
> Keep track of the recent amount of pressure applied to each LRU list.
>
> This statistic is incremented simultaneously with ->recent_scanned,
> however it is aged in a different way. Recent_scanned and recent_rotated
> are aged locally for each list, to estimate the fraction of objects
> on each list that are in active use.
>
> The recent_pressure statistic is aged globally for all lists. We
> can use this to figure out which LRUs we should reclaim from.
> Because this figure is only used at reclaim time, we can lazily
> age it whenever we consider an lruvec for reclaiming.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
> include/linux/mmzone.h | 10 ++++++++-
> mm/memcontrol.c | 5 ++++
> mm/swap.c | 1 +
> mm/vmscan.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 66 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index f222e06..b03be69 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -189,12 +189,20 @@ struct zone_reclaim_stat {
> * The pageout code in vmscan.c keeps track of how many of the
> * mem/swap backed and file backed pages are referenced.
> * The higher the rotated/scanned ratio, the more valuable
> - * that cache is.
> + * that cache is. These numbers are aged separately for each LRU.
> *
> * The anon LRU stats live in [0], file LRU stats in [1]
> */
> unsigned long recent_rotated[2];
> unsigned long recent_scanned[2];
> + /*
> + * This number is incremented together with recent_rotated,
s/recent_rotated/recent/scanned.
I assume the idea here is to associate the scanned to the amount of
pressure applied on the list.
> + * but is aged simultaneously for all LRUs. This allows the
> + * system to determine which LRUs have already been scanned
> + * enough, and which should be scanned next.
> + */
> + unsigned long recent_pressure[2];
> + unsigned long recent_pressure_seq;
> };
>
> struct lruvec {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d906b43..a18a0d5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3852,6 +3852,7 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
> struct zone_reclaim_stat *rstat;
> unsigned long recent_rotated[2] = {0, 0};
> unsigned long recent_scanned[2] = {0, 0};
> + unsigned long recent_pressure[2] = {0, 0};
>
> for_each_online_node(nid)
> for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> @@ -3862,11 +3863,15 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
> recent_rotated[1] += rstat->recent_rotated[1];
> recent_scanned[0] += rstat->recent_scanned[0];
> recent_scanned[1] += rstat->recent_scanned[1];
> + recent_pressure[0] += rstat->recent_pressure[0];
> + recent_pressure[1] += rstat->recent_pressure[1];
> }
> seq_printf(m, "recent_rotated_anon %lu\n", recent_rotated[0]);
> seq_printf(m, "recent_rotated_file %lu\n", recent_rotated[1]);
> seq_printf(m, "recent_scanned_anon %lu\n", recent_scanned[0]);
> seq_printf(m, "recent_scanned_file %lu\n", recent_scanned[1]);
> + seq_printf(m, "recent_pressure_anon %lu\n", recent_pressure[0]);
> + seq_printf(m, "recent_pressure_file %lu\n", recent_pressure[1]);
> }
> #endif
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 4e7e2ec..0cca972 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -316,6 +316,7 @@ static void update_page_reclaim_stat(struct lruvec *lruvec,
> struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>
> reclaim_stat->recent_scanned[file]++;
> + reclaim_stat->recent_pressure[file]++;
> if (rotated)
> reclaim_stat->recent_rotated[file]++;
> }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a779b03..b0e5495 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1282,6 +1282,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> spin_lock_irq(&zone->lru_lock);
>
> reclaim_stat->recent_scanned[file] += nr_taken;
> + reclaim_stat->recent_pressure[file] += nr_taken;
>
> if (global_reclaim(sc)) {
> if (current_is_kswapd())
> @@ -1426,6 +1427,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
> zone->pages_scanned += nr_scanned;
>
> reclaim_stat->recent_scanned[file] += nr_taken;
> + reclaim_stat->recent_pressure[file] += nr_taken;
>
> __count_zone_vm_events(PGREFILL, zone, nr_scanned);
> __mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
> @@ -1852,6 +1854,53 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> throttle_vm_writeout(sc->gfp_mask);
> }
>
> +/*
> + * Ensure that the ->recent_pressure statistics for this lruvec are
> + * aged to the same degree as those elsewhere in the system, before
> + * we do reclaim on this lruvec or evaluate its reclaim priority.
> + */
> +static DEFINE_SPINLOCK(recent_pressure_lock);
> +static int recent_pressure_seq;
> +static void age_recent_pressure(struct lruvec *lruvec, struct zone *zone)
> +{
> + struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> + unsigned long anon = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
> + get_lru_size(lruvec, LRU_INACTIVE_ANON);
> + unsigned long file = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
> + get_lru_size(lruvec, LRU_INACTIVE_FILE);
> + int shift;
> +
> + /*
> + * Do not bother recalculating unless we are behind with the
> + * system wide statistics, or our local recent_pressure numbers
> + * have grown too large. We have to keep the number somewhat
> + * small, to ensure that reclaim_score returns non-zero.
> + */
> + if (reclaim_stat->recent_pressure_seq != recent_pressure_seq &&
> + reclaim_stat->recent_pressure[0] < anon / 4 &&
> + reclaim_stat->recent_pressure[1] < file / 4)
> + return;
Let's see if I understand the logic here:
If updating the reclaim_stat->recent_pressure for this lruvec is
falling behind, don't bother to update it unless the scan count grows
fast enough. When that happens, recent_pressure is adjusted based on
the gap between the global pressure level (recent_pressure_seq) and
local pressure level (reclaim_stat->recent_pressure_seq). The lager
the gap, the more pressure applied on the lruvec.
1. if the usage activity(scan_count) is always low on a lruvec, the
pressure will be low.
2. if the usage activity is low for a while, and then when the
scan_count jumps suddenly, it will cause the pressure to jump as well
3. if the usage activity is always high, the pressure will be high .
So, the mechanism here is a way to balance the system pressure across
lruvec over time?
> +
> + spin_lock(&recent_pressure_lock);
> + /*
> + * If we are aging due to local activity, increment the global
> + * sequence counter. Leave the global counter alone if we are
> + * merely playing catchup.
> + */
> + if (reclaim_stat->recent_pressure_seq == recent_pressure_seq)
> + recent_pressure_seq++;
> + shift = recent_pressure_seq - reclaim_stat->recent_pressure_seq;
> + shift = min(shift, (BITS_PER_LONG-1));
> + reclaim_stat->recent_pressure_seq = recent_pressure_seq;
> + spin_unlock(&recent_pressure_lock);
> +
> + /* For every aging interval, do one division by two. */
> + spin_lock_irq(&zone->lru_lock);
> + reclaim_stat->recent_pressure[0] >>= shift;
> + reclaim_stat->recent_pressure[1] >>= shift;
This is a bit confusing. I would assume the bigger the shift, the less
pressure it causes. However, the end result is the other way around.
--Ying
> + spin_unlock_irq(&zone->lru_lock);
> +}
> +
> static void shrink_zone(struct zone *zone, struct scan_control *sc)
> {
> struct mem_cgroup *root = sc->target_mem_cgroup;
> @@ -1869,6 +1918,8 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
> do {
> struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
>
> + age_recent_pressure(lruvec, zone);
> +
> /*
> * Reclaim from mem_cgroup if any of these conditions are met:
> * - this is a targetted reclaim ( not global reclaim)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC][PATCH -mm 1/3] mm,vmscan: track recent pressure on each LRU set
2012-08-10 1:22 ` Ying Han
@ 2012-08-10 1:23 ` Ying Han
2012-08-10 15:48 ` Rik van Riel
1 sibling, 0 replies; 13+ messages in thread
From: Ying Han @ 2012-08-10 1:23 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-mm, hannes, mhocko, Mel Gorman
fix-up Mel's email
--Ying
On Thu, Aug 9, 2012 at 6:22 PM, Ying Han <yinghan@google.com> wrote:
> On Wed, Aug 8, 2012 at 2:47 PM, Rik van Riel <riel@redhat.com> wrote:
>> Keep track of the recent amount of pressure applied to each LRU list.
>>
>> This statistic is incremented simultaneously with ->recent_scanned,
>> however it is aged in a different way. Recent_scanned and recent_rotated
>> are aged locally for each list, to estimate the fraction of objects
>> on each list that are in active use.
>>
>> The recent_pressure statistic is aged globally for all lists. We
>> can use this to figure out which LRUs we should reclaim from.
>> Because this figure is only used at reclaim time, we can lazily
>> age it whenever we consider an lruvec for reclaiming.
>>
>> Signed-off-by: Rik van Riel <riel@redhat.com>
>> ---
>> include/linux/mmzone.h | 10 ++++++++-
>> mm/memcontrol.c | 5 ++++
>> mm/swap.c | 1 +
>> mm/vmscan.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++
>> 4 files changed, 66 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index f222e06..b03be69 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -189,12 +189,20 @@ struct zone_reclaim_stat {
>> * The pageout code in vmscan.c keeps track of how many of the
>> * mem/swap backed and file backed pages are referenced.
>> * The higher the rotated/scanned ratio, the more valuable
>> - * that cache is.
>> + * that cache is. These numbers are aged separately for each LRU.
>> *
>> * The anon LRU stats live in [0], file LRU stats in [1]
>> */
>> unsigned long recent_rotated[2];
>> unsigned long recent_scanned[2];
>> + /*
>> + * This number is incremented together with recent_rotated,
>
> s/recent_rotated/recent/scanned.
>
> I assume the idea here is to associate the scanned to the amount of
> pressure applied on the list.
>
>> + * but is aged simultaneously for all LRUs. This allows the
>> + * system to determine which LRUs have already been scanned
>> + * enough, and which should be scanned next.
>> + */
>> + unsigned long recent_pressure[2];
>> + unsigned long recent_pressure_seq;
>> };
>>
>> struct lruvec {
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index d906b43..a18a0d5 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3852,6 +3852,7 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
>> struct zone_reclaim_stat *rstat;
>> unsigned long recent_rotated[2] = {0, 0};
>> unsigned long recent_scanned[2] = {0, 0};
>> + unsigned long recent_pressure[2] = {0, 0};
>>
>> for_each_online_node(nid)
>> for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>> @@ -3862,11 +3863,15 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
>> recent_rotated[1] += rstat->recent_rotated[1];
>> recent_scanned[0] += rstat->recent_scanned[0];
>> recent_scanned[1] += rstat->recent_scanned[1];
>> + recent_pressure[0] += rstat->recent_pressure[0];
>> + recent_pressure[1] += rstat->recent_pressure[1];
>> }
>> seq_printf(m, "recent_rotated_anon %lu\n", recent_rotated[0]);
>> seq_printf(m, "recent_rotated_file %lu\n", recent_rotated[1]);
>> seq_printf(m, "recent_scanned_anon %lu\n", recent_scanned[0]);
>> seq_printf(m, "recent_scanned_file %lu\n", recent_scanned[1]);
>> + seq_printf(m, "recent_pressure_anon %lu\n", recent_pressure[0]);
>> + seq_printf(m, "recent_pressure_file %lu\n", recent_pressure[1]);
>> }
>> #endif
>>
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 4e7e2ec..0cca972 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -316,6 +316,7 @@ static void update_page_reclaim_stat(struct lruvec *lruvec,
>> struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>>
>> reclaim_stat->recent_scanned[file]++;
>> + reclaim_stat->recent_pressure[file]++;
>> if (rotated)
>> reclaim_stat->recent_rotated[file]++;
>> }
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index a779b03..b0e5495 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1282,6 +1282,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>> spin_lock_irq(&zone->lru_lock);
>>
>> reclaim_stat->recent_scanned[file] += nr_taken;
>> + reclaim_stat->recent_pressure[file] += nr_taken;
>>
>> if (global_reclaim(sc)) {
>> if (current_is_kswapd())
>> @@ -1426,6 +1427,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>> zone->pages_scanned += nr_scanned;
>>
>> reclaim_stat->recent_scanned[file] += nr_taken;
>> + reclaim_stat->recent_pressure[file] += nr_taken;
>>
>> __count_zone_vm_events(PGREFILL, zone, nr_scanned);
>> __mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
>> @@ -1852,6 +1854,53 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>> throttle_vm_writeout(sc->gfp_mask);
>> }
>>
>> +/*
>> + * Ensure that the ->recent_pressure statistics for this lruvec are
>> + * aged to the same degree as those elsewhere in the system, before
>> + * we do reclaim on this lruvec or evaluate its reclaim priority.
>> + */
>> +static DEFINE_SPINLOCK(recent_pressure_lock);
>> +static int recent_pressure_seq;
>> +static void age_recent_pressure(struct lruvec *lruvec, struct zone *zone)
>> +{
>> + struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>> + unsigned long anon = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
>> + get_lru_size(lruvec, LRU_INACTIVE_ANON);
>> + unsigned long file = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
>> + get_lru_size(lruvec, LRU_INACTIVE_FILE);
>> + int shift;
>> +
>> + /*
>> + * Do not bother recalculating unless we are behind with the
>> + * system wide statistics, or our local recent_pressure numbers
>> + * have grown too large. We have to keep the number somewhat
>> + * small, to ensure that reclaim_score returns non-zero.
>> + */
>> + if (reclaim_stat->recent_pressure_seq != recent_pressure_seq &&
>> + reclaim_stat->recent_pressure[0] < anon / 4 &&
>> + reclaim_stat->recent_pressure[1] < file / 4)
>> + return;
>
> Let's see if I understand the logic here:
>
> If updating the reclaim_stat->recent_pressure for this lruvec is
> falling behind, don't bother to update it unless the scan count grows
> fast enough. When that happens, recent_pressure is adjusted based on
> the gap between the global pressure level (recent_pressure_seq) and
> local pressure level (reclaim_stat->recent_pressure_seq). The lager
> the gap, the more pressure applied on the lruvec.
>
> 1. if the usage activity(scan_count) is always low on a lruvec, the
> pressure will be low.
> 2. if the usage activity is low for a while, and then when the
> scan_count jumps suddenly, it will cause the pressure to jump as well
> 3. if the usage activity is always high, the pressure will be high .
>
> So, the mechanism here is a way to balance the system pressure across
> lruvec over time?
>
>> +
>> + spin_lock(&recent_pressure_lock);
>> + /*
>> + * If we are aging due to local activity, increment the global
>> + * sequence counter. Leave the global counter alone if we are
>> + * merely playing catchup.
>> + */
>> + if (reclaim_stat->recent_pressure_seq == recent_pressure_seq)
>> + recent_pressure_seq++;
>> + shift = recent_pressure_seq - reclaim_stat->recent_pressure_seq;
>> + shift = min(shift, (BITS_PER_LONG-1));
>> + reclaim_stat->recent_pressure_seq = recent_pressure_seq;
>> + spin_unlock(&recent_pressure_lock);
>> +
>> + /* For every aging interval, do one division by two. */
>> + spin_lock_irq(&zone->lru_lock);
>> + reclaim_stat->recent_pressure[0] >>= shift;
>> + reclaim_stat->recent_pressure[1] >>= shift;
>
> This is a bit confusing. I would assume the bigger the shift, the less
> pressure it causes. However, the end result is the other way around.
>
> --Ying
>
>> + spin_unlock_irq(&zone->lru_lock);
>> +}
>> +
>> static void shrink_zone(struct zone *zone, struct scan_control *sc)
>> {
>> struct mem_cgroup *root = sc->target_mem_cgroup;
>> @@ -1869,6 +1918,8 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>> do {
>> struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
>>
>> + age_recent_pressure(lruvec, zone);
>> +
>> /*
>> * Reclaim from mem_cgroup if any of these conditions are met:
>> * - this is a targetted reclaim ( not global reclaim)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC][PATCH -mm 1/3] mm,vmscan: track recent pressure on each LRU set
2012-08-10 1:22 ` Ying Han
2012-08-10 1:23 ` Ying Han
@ 2012-08-10 15:48 ` Rik van Riel
1 sibling, 0 replies; 13+ messages in thread
From: Rik van Riel @ 2012-08-10 15:48 UTC (permalink / raw)
To: Ying Han; +Cc: linux-mm, hannes, mhocko, Mel Gorman
On 08/09/2012 09:22 PM, Ying Han wrote:
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index f222e06..b03be69 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -189,12 +189,20 @@ struct zone_reclaim_stat {
>> * The pageout code in vmscan.c keeps track of how many of the
>> * mem/swap backed and file backed pages are referenced.
>> * The higher the rotated/scanned ratio, the more valuable
>> - * that cache is.
>> + * that cache is. These numbers are aged separately for each LRU.
>> *
>> * The anon LRU stats live in [0], file LRU stats in [1]
>> */
>> unsigned long recent_rotated[2];
>> unsigned long recent_scanned[2];
>> + /*
>> + * This number is incremented together with recent_rotated,
>
> s/recent_rotated/recent/scanned.
>
> I assume the idea here is to associate the scanned to the amount of
> pressure applied on the list.
Indeed. Pageout scanning equals pressure :)
>> +/*
>> + * Ensure that the ->recent_pressure statistics for this lruvec are
>> + * aged to the same degree as those elsewhere in the system, before
>> + * we do reclaim on this lruvec or evaluate its reclaim priority.
>> + */
>> +static DEFINE_SPINLOCK(recent_pressure_lock);
>> +static int recent_pressure_seq;
>> +static void age_recent_pressure(struct lruvec *lruvec, struct zone *zone)
>> +{
>> + struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>> + unsigned long anon = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
>> + get_lru_size(lruvec, LRU_INACTIVE_ANON);
>> + unsigned long file = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
>> + get_lru_size(lruvec, LRU_INACTIVE_FILE);
>> + int shift;
>> +
>> + /*
>> + * Do not bother recalculating unless we are behind with the
>> + * system wide statistics, or our local recent_pressure numbers
>> + * have grown too large. We have to keep the number somewhat
>> + * small, to ensure that reclaim_score returns non-zero.
>> + */
>> + if (reclaim_stat->recent_pressure_seq != recent_pressure_seq &&
>> + reclaim_stat->recent_pressure[0] < anon / 4 &&
>> + reclaim_stat->recent_pressure[1] < file / 4)
>> + return;
>
> Let's see if I understand the logic here:
>
> If updating the reclaim_stat->recent_pressure for this lruvec is
> falling behind, don't bother to update it unless the scan count grows
> fast enough. When that happens, recent_pressure is adjusted based on
> the gap between the global pressure level (recent_pressure_seq) and
> local pressure level (reclaim_stat->recent_pressure_seq). The lager
> the gap, the more pressure applied on the lruvec.
>
> 1. if the usage activity(scan_count) is always low on a lruvec, the
> pressure will be low.
The scan count being low, indicates that the lruvec has seen little
pageout pressure recently.
This is essentially unrelated to how actively programs are using
(touching) the pages that are sitting on the lists in this lruvec.
> 2. if the usage activity is low for a while, and then when the
> scan_count jumps suddenly, it will cause the pressure to jump as well
> 3. if the usage activity is always high, the pressure will be high .
>
> So, the mechanism here is a way to balance the system pressure across
> lruvec over time?
Yes.
>> +
>> + spin_lock(&recent_pressure_lock);
>> + /*
>> + * If we are aging due to local activity, increment the global
>> + * sequence counter. Leave the global counter alone if we are
>> + * merely playing catchup.
>> + */
>> + if (reclaim_stat->recent_pressure_seq == recent_pressure_seq)
>> + recent_pressure_seq++;
>> + shift = recent_pressure_seq - reclaim_stat->recent_pressure_seq;
>> + shift = min(shift, (BITS_PER_LONG-1));
>> + reclaim_stat->recent_pressure_seq = recent_pressure_seq;
>> + spin_unlock(&recent_pressure_lock);
>> +
>> + /* For every aging interval, do one division by two. */
>> + spin_lock_irq(&zone->lru_lock);
>> + reclaim_stat->recent_pressure[0] >>= shift;
>> + reclaim_stat->recent_pressure[1] >>= shift;
>
> This is a bit confusing. I would assume the bigger the shift, the less
> pressure it causes. However, the end result is the other way around.
The longer ago it has been since this lruvec was last scanned by
the page reclaim code, the more the pressure is aged.
The less recent pressure an lruvec has recently seen, the more
likely it is that it will be scanned in the future (see patch 2/3).
Is there anything that I could explain better?
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC][PATCH -mm 2/3] mm,vmscan: reclaim from highest score cgroups
2012-08-08 21:45 [RFC][PATCH -mm 0/3] mm,vmscan: reclaim from highest score cgroup Rik van Riel
2012-08-08 21:47 ` [RFC][PATCH -mm 1/3] mm,vmscan: track recent pressure on each LRU set Rik van Riel
@ 2012-08-08 21:48 ` Rik van Riel
2012-08-14 19:19 ` Rafael Aquini
2012-08-08 21:49 ` [RFC][PATCH -mm 3/3] mm,vmscan: evict inactive file pages first Rik van Riel
` (2 subsequent siblings)
4 siblings, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2012-08-08 21:48 UTC (permalink / raw)
To: linux-mm; +Cc: yinghan, hannes, mhocko, Mel Gorman
Instead of doing a round robin reclaim over all the cgroups in a
zone, we pick the lruvec with the top score and reclaim from that.
We keep reclaiming from that lruvec until we have reclaimed enough
pages (common for direct reclaim), or that lruvec's score drops in
half. We keep reclaiming from the zone until we have reclaimed enough
pages, or have scanned more than the number of reclaimable pages shifted
by the reclaim priority.
As an additional change, targeted cgroup reclaim now reclaims from
the highest priority lruvec. This is because when a cgroup hierarchy
hits its limit, the best lruvec to reclaim from may be different than
whatever lruvec is the first we run into iterating from the hierarchy's
"root".
Signed-off-by: Rik van Riel <riel@redhat.com>
---
mm/vmscan.c | 118 ++++++++++++++++++++++++++++++++++++++++------------------
1 files changed, 81 insertions(+), 37 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b0e5495..1a9688b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1901,6 +1901,57 @@ static void age_recent_pressure(struct lruvec *lruvec, struct zone *zone)
spin_unlock_irq(&zone->lru_lock);
}
+/*
+ * The higher the LRU score, the more desirable it is to reclaim
+ * from this LRU set first. The score is a function of the fraction
+ * of recently scanned pages on the LRU that are in active use,
+ * as well as the size of the list and the amount of memory pressure
+ * that has been put on this LRU recently.
+ *
+ * recent_scanned size
+ * score = -------------- x --------------- x adjustment
+ * recent_rotated recent_pressure
+ *
+ * The maximum score of the anon and file list in this lruvec
+ * is returned. Adjustments are made for the file LRU having
+ * lots of inactive pages (mostly streaming IO), or the memcg
+ * being over its soft limit.
+ *
+ * This function should return a positive number for any lruvec
+ * with more than a handful of resident pages, because recent_scanned
+ * should always be larger than recent_rotated, and the size should
+ * always be larger than recent_pressure.
+ */
+static u64 reclaim_score(struct mem_cgroup *memcg,
+ struct lruvec *lruvec)
+{
+ struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+ u64 anon, file;
+
+ anon = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
+ get_lru_size(lruvec, LRU_INACTIVE_ANON);
+ anon *= reclaim_stat->recent_scanned[0];
+ anon /= (reclaim_stat->recent_rotated[0] + 1);
+ anon /= (reclaim_stat->recent_pressure[0] + 1);
+
+ file = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
+ get_lru_size(lruvec, LRU_INACTIVE_FILE);
+ file *= reclaim_stat->recent_scanned[1];
+ file /= (reclaim_stat->recent_rotated[1] + 1);
+ file /= (reclaim_stat->recent_pressure[1] + 1);
+
+ /*
+ * Give a STRONG preference to reclaiming memory from lruvecs
+ * that belong to a cgroup that is over its soft limit.
+ */
+ if (mem_cgroup_over_soft_limit(memcg)) {
+ file *= 10000;
+ anon *= 10000;
+ }
+
+ return max(anon, file);
+}
+
static void shrink_zone(struct zone *zone, struct scan_control *sc)
{
struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -1908,11 +1959,14 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
.zone = zone,
.priority = sc->priority,
};
+ unsigned long nr_scanned = sc->nr_scanned;
struct mem_cgroup *memcg;
- bool over_softlimit, ignore_softlimit = false;
+ struct lruvec *victim;
+ u64 score, max_score;
restart:
- over_softlimit = false;
+ max_score = 0;
+ victim = NULL;
memcg = mem_cgroup_iter(root, NULL, &reclaim);
do {
@@ -1920,48 +1974,38 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
age_recent_pressure(lruvec, zone);
- /*
- * Reclaim from mem_cgroup if any of these conditions are met:
- * - this is a targetted reclaim ( not global reclaim)
- * - reclaim priority is less than DEF_PRIORITY - 2
- * - mem_cgroup or its ancestor ( not including root cgroup)
- * exceeds its soft limit
- *
- * Note: The priority check is a balance of how hard to
- * preserve the pages under softlimit. If the memcgs of the
- * zone having trouble to reclaim pages above their softlimit,
- * we have to reclaim under softlimit instead of burning more
- * cpu cycles.
- */
- if (ignore_softlimit || !global_reclaim(sc) ||
- sc->priority < DEF_PRIORITY - 2 ||
- mem_cgroup_over_soft_limit(memcg)) {
- shrink_lruvec(lruvec, sc);
+ score = reclaim_score(memcg, lruvec);
- over_softlimit = true;
+ /* Pick the lruvec with the highest score. */
+ if (score > max_score) {
+ max_score = score;
+ victim = lruvec;
}
- /*
- * Limit reclaim has historically picked one memcg and
- * scanned it with decreasing priority levels until
- * nr_to_reclaim had been reclaimed. This priority
- * cycle is thus over after a single memcg.
- *
- * Direct reclaim and kswapd, on the other hand, have
- * to scan all memory cgroups to fulfill the overall
- * scan target for the zone.
- */
- if (!global_reclaim(sc)) {
- mem_cgroup_iter_break(root, memcg);
- break;
- }
memcg = mem_cgroup_iter(root, memcg, &reclaim);
} while (memcg);
- if (!over_softlimit) {
- ignore_softlimit = true;
+ /* No lruvec in our set is suitable for reclaiming. */
+ if (!victim)
+ return;
+
+ /*
+ * Reclaim from the top scoring lruvec until we freed enough
+ * pages, or its reclaim priority has halved.
+ */
+ do {
+ shrink_lruvec(victim, sc);
+ score = reclaim_score(memcg, victim);
+ } while (sc->nr_to_reclaim > 0 && score > max_score / 2);
+
+ /*
+ * Do we need to reclaim more pages?
+ * Did we scan fewer pages than the current priority allows?
+ */
+ if (sc->nr_to_reclaim > 0 &&
+ sc->nr_scanned + nr_scanned <
+ zone_reclaimable_pages(zone) >> sc->priority)
goto restart;
- }
}
/* Returns true if compaction should go ahead for a high-order request */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC][PATCH -mm 2/3] mm,vmscan: reclaim from highest score cgroups
2012-08-08 21:48 ` [RFC][PATCH -mm 2/3] mm,vmscan: reclaim from highest score cgroups Rik van Riel
@ 2012-08-14 19:19 ` Rafael Aquini
0 siblings, 0 replies; 13+ messages in thread
From: Rafael Aquini @ 2012-08-14 19:19 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-mm, yinghan, hannes, mhocko, Mel Gorman
On Wed, Aug 08, 2012 at 05:48:28PM -0400, Rik van Riel wrote:
> Instead of doing a round robin reclaim over all the cgroups in a
> zone, we pick the lruvec with the top score and reclaim from that.
>
> We keep reclaiming from that lruvec until we have reclaimed enough
> pages (common for direct reclaim), or that lruvec's score drops in
> half. We keep reclaiming from the zone until we have reclaimed enough
> pages, or have scanned more than the number of reclaimable pages shifted
> by the reclaim priority.
>
> As an additional change, targeted cgroup reclaim now reclaims from
> the highest priority lruvec. This is because when a cgroup hierarchy
> hits its limit, the best lruvec to reclaim from may be different than
> whatever lruvec is the first we run into iterating from the hierarchy's
> "root".
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
After fixing the spinlock lockup at patch 03, I started to see kswapd going on
an infinite loop around shrink_zone() that dumps the following softlockup report
---8<---
BUG: soft lockup - CPU#0 stuck for 22s! [kswapd0:29]
Modules linked in: lockd ip6t_REJECT nf_conntrack_ipv6 nf_conntrack_ipv4
nf_defrag_ipv6 nf_defrag_ipv4 xt_stac
irq event stamp: 16668492
hardirqs last enabled at (16668491): [<ffffffff8169d230>]
_raw_spin_unlock_irq+0x30/0x50
hardirqs last disabled at (16668492): [<ffffffff816a6e2a>]
apic_timer_interrupt+0x6a/0x80
softirqs last enabled at (16668258): [<ffffffff8106db4c>]
__do_softirq+0x18c/0x3f0
softirqs last disabled at (16668253): [<ffffffff816a75bc>]
call_softirq+0x1c/0x30
CPU 0
Pid: 29, comm: kswapd0 Not tainted 3.6.0-rc1+ #198 Bochs Bochs
RIP: 0010:[<ffffffff8111168d>] [<ffffffff8111168d>] rcu_is_cpu_idle+0x2d/0x40
RSP: 0018:ffff880002b99af0 EFLAGS: 00000286
RAX: ffff88003fc0efa0 RBX: 0000000000000001 RCX: 0000000000000000
RDX: ffff880002b98000 RSI: ffffffff81c32420 RDI: 0000000000000246
RBP: ffff880002b99af0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000013 R11: 0000000000000001 R12: 0000000000000013
R13: 0000000000000001 R14: ffffffff8169d630 R15: ffffffff811130ef
FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fde6b5c2d10 CR3: 0000000001c0b000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kswapd0 (pid: 29, threadinfo ffff880002b98000, task ffff880002a6a040)
Stack:
ffff880002b99b80 ffffffff81323fa4 ffff880002a6a6f8 ffff880000000000
ffff880002b99bac 0000000000000000 ffff88003d9d0938 ffff880002a6a040
ffff880002b99b60 0000000000000246 0000000000000000 ffff880002a6a040
Call Trace:
[<ffffffff81323fa4>] idr_get_next+0xd4/0x1a0
[<ffffffff810eb727>] css_get_next+0x87/0x1b0
[<ffffffff811b1c56>] mem_cgroup_iter+0x146/0x330
[<ffffffff811b1bfc>] ? mem_cgroup_iter+0xec/0x330
[<ffffffff8116865f>] shrink_zone+0x11f/0x2a0
[<ffffffff8116991b>] kswapd+0x85b/0xf60
[<ffffffff8108e4f0>] ? wake_up_bit+0x40/0x40
[<ffffffff811690c0>] ? zone_reclaim+0x420/0x420
[<ffffffff8108dd8e>] kthread+0xbe/0xd0
[<ffffffff816a74c4>] kernel_thread_helper+0x4/0x10
[<ffffffff8169d630>] ? retint_restore_args+0x13/0x13
[<ffffffff8108dcd0>] ? __init_kthread_worker+0x70/0x70
[<ffffffff816a74c0>] ? gs_change+0x13/0x13
---8<---
I've applied your suggestion fix (below) on top of this patch
and, till now, things are going fine.
Will keep you tuned on new developments, though :)
---8<---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4f6e124..8cb1bbf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2029,12 +2029,14 @@ restart:
score = reclaim_score(memcg, victim, sc);
} while (sc->nr_to_reclaim > 0 && score > max_score / 2);
+ if (!(sc->nr_scanned - nr_scanned))
+ return;
/*
* Do we need to reclaim more pages?
* Did we scan fewer pages than the current priority allows?
*/
if (sc->nr_to_reclaim > 0 &&
- sc->nr_scanned + nr_scanned <
+ sc->nr_scanned - nr_scanned <
zone_reclaimable_pages(zone) >> sc->priority)
goto restart;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC][PATCH -mm 3/3] mm,vmscan: evict inactive file pages first
2012-08-08 21:45 [RFC][PATCH -mm 0/3] mm,vmscan: reclaim from highest score cgroup Rik van Riel
2012-08-08 21:47 ` [RFC][PATCH -mm 1/3] mm,vmscan: track recent pressure on each LRU set Rik van Riel
2012-08-08 21:48 ` [RFC][PATCH -mm 2/3] mm,vmscan: reclaim from highest score cgroups Rik van Riel
@ 2012-08-08 21:49 ` Rik van Riel
2012-08-12 23:56 ` Rafael Aquini
2012-08-09 1:02 ` [RFC][PATCH -mm 0/3] mm,vmscan: reclaim from highest score cgroup Ying Han
2012-08-09 22:23 ` Ying Han
4 siblings, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2012-08-08 21:49 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-mm, yinghan, hannes, mhocko, Mel Gorman
When a lot of streaming file IO is happening, it makes sense to
evict just the inactive file pages and leave the other LRU lists
alone.
Likewise, when driving a cgroup hierarchy into its hard limit,
or over its soft limit, it makes sense to pick a child cgroup
that has lots of inactive file pages, and evict those first.
Being over its soft limit is considered a stronger preference
than just having a lot of inactive file pages, so a well behaved
cgroup is allowed to keep its file cache when there is a "badly
behaving" one in the same hierarchy.
Signed-off-by: Rik van Riel <riel@redhat.com>
---
mm/vmscan.c | 37 +++++++++++++++++++++++++++++++++----
1 files changed, 33 insertions(+), 4 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1a9688b..b4d73d4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1576,6 +1576,19 @@ static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru)
return inactive_anon_is_low(lruvec);
}
+/* If this lruvec has lots of inactive file pages, reclaim those only. */
+static bool reclaim_file_only(struct lruvec *lruvec, struct scan_control *sc,
+ unsigned long anon, unsigned long file)
+{
+ if (inactive_file_is_low(lruvec))
+ return false;
+
+ if (file > (anon + file) >> sc->priority)
+ return true;
+
+ return false;
+}
+
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct lruvec *lruvec, struct scan_control *sc)
{
@@ -1687,6 +1700,14 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
reclaim_stat->recent_rotated[1] /= 2;
}
+ /* Lots of inactive file pages? Reclaim those only. */
+ if (reclaim_file_only(lruvec, sc, anon, file)) {
+ fraction[0] = 0;
+ fraction[1] = 1;
+ denominator = 1;
+ goto out;
+ }
+
/*
* The amount of pressure on anon vs file pages is inversely
* proportional to the fraction of recently scanned pages on
@@ -1922,8 +1943,8 @@ static void age_recent_pressure(struct lruvec *lruvec, struct zone *zone)
* should always be larger than recent_rotated, and the size should
* always be larger than recent_pressure.
*/
-static u64 reclaim_score(struct mem_cgroup *memcg,
- struct lruvec *lruvec)
+static u64 reclaim_score(struct mem_cgroup *memcg, struct lruvec *lruvec,
+ struct scan_control *sc)
{
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
u64 anon, file;
@@ -1949,6 +1970,14 @@ static u64 reclaim_score(struct mem_cgroup *memcg,
anon *= 10000;
}
+ /*
+ * Prefer reclaiming from an lruvec with lots of inactive file
+ * pages. Once those have been reclaimed, the score will drop so
+ * far we will pick another lruvec to reclaim from.
+ */
+ if (reclaim_file_only(lruvec, sc, anon, file))
+ file *= 100;
+
return max(anon, file);
}
@@ -1974,7 +2003,7 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
age_recent_pressure(lruvec, zone);
- score = reclaim_score(memcg, lruvec);
+ score = reclaim_score(memcg, lruvec, sc);
/* Pick the lruvec with the highest score. */
if (score > max_score) {
@@ -1995,7 +2024,7 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
*/
do {
shrink_lruvec(victim, sc);
- score = reclaim_score(memcg, victim);
+ score = reclaim_score(memcg, victim, sc);
} while (sc->nr_to_reclaim > 0 && score > max_score / 2);
/*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC][PATCH -mm 3/3] mm,vmscan: evict inactive file pages first
2012-08-08 21:49 ` [RFC][PATCH -mm 3/3] mm,vmscan: evict inactive file pages first Rik van Riel
@ 2012-08-12 23:56 ` Rafael Aquini
2012-08-13 2:13 ` [RFC][PATCH -mm -v2 " Rik van Riel
0 siblings, 1 reply; 13+ messages in thread
From: Rafael Aquini @ 2012-08-12 23:56 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-mm, yinghan, hannes, mhocko, Mel Gorman
Howdy Rik,
On Wed, Aug 08, 2012 at 05:49:04PM -0400, Rik van Riel wrote:
> @@ -1687,6 +1700,14 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> reclaim_stat->recent_rotated[1] /= 2;
> }
>
> + /* Lots of inactive file pages? Reclaim those only. */
> + if (reclaim_file_only(lruvec, sc, anon, file)) {
> + fraction[0] = 0;
> + fraction[1] = 1;
> + denominator = 1;
> + goto out;
> + }
> +
This hunk causes a &zone->lru_lock spinlock lockup down this path:
shrink_zone()->shrink_lruvec()->shrink_list()->shrink_inactive_list()
I could trigger it by doing a Kernel RPM install on a 2GB guest, for all shots.
---8<---
...
=============================================
[ INFO: possible recursive locking detected ]
3.6.0-rc1+ #197 Not tainted
---------------------------------------------
kswapd0/29 is trying to acquire lock:
(&(&zone->lru_lock)->rlock){....-.}, at: [<ffffffff81167b34>]
shrink_inactive_list+0xd4/0x4b0
but task is already holding lock:
(&(&zone->lru_lock)->rlock){....-.}, at: [<ffffffff81168037>]
shrink_lruvec+0x127/0x630
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&zone->lru_lock)->rlock);
lock(&(&zone->lru_lock)->rlock);
*** DEADLOCK ***
May be due to missing lock nesting notation
1 lock held by kswapd0/29:
#0: (&(&zone->lru_lock)->rlock){....-.}, at: [<ffffffff81168037>]
shrink_lruvec+0x127/0x630
stack backtrace:
Pid: 29, comm: kswapd0 Not tainted 3.6.0-rc1+ #197
Call Trace:
[<ffffffff810cedda>] __lock_acquire+0x125a/0x1660
[<ffffffff8108a618>] ? __kernel_text_address+0x58/0x80
[<ffffffff810cf27f>] lock_acquire+0x9f/0x190
[<ffffffff81167b34>] ? shrink_inactive_list+0xd4/0x4b0
[<ffffffff8169c9dd>] _raw_spin_lock_irq+0x4d/0x60
[<ffffffff81167b34>] ? shrink_inactive_list+0xd4/0x4b0
[<ffffffff811612cf>] ? lru_add_drain+0x2f/0x40
[<ffffffff81167b34>] shrink_inactive_list+0xd4/0x4b0
[<ffffffff81168037>] ? shrink_lruvec+0x127/0x630
[<ffffffff81168395>] shrink_lruvec+0x485/0x630
[<ffffffff81168743>] shrink_zone+0x203/0x2a0
[<ffffffff8116991b>] kswapd+0x85b/0xf60
[<ffffffff8108e4f0>] ? wake_up_bit+0x40/0x40
[<ffffffff811690c0>] ? zone_reclaim+0x420/0x420
[<ffffffff8108dd8e>] kthread+0xbe/0xd0
[<ffffffff816a74c4>] kernel_thread_helper+0x4/0x10
[<ffffffff8169d630>] ? retint_restore_args+0x13/0x13
[<ffffffff8108dcd0>] ? __init_kthread_worker+0x70/0x70
[<ffffffff816a74c0>] ? gs_change+0x13/0x13
...
---8<---
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC][PATCH -mm -v2 3/3] mm,vmscan: evict inactive file pages first
2012-08-12 23:56 ` Rafael Aquini
@ 2012-08-13 2:13 ` Rik van Riel
2012-08-14 19:11 ` Rafael Aquini
0 siblings, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2012-08-13 2:13 UTC (permalink / raw)
To: Rafael Aquini; +Cc: linux-mm, yinghan, hannes, mhocko, Mel Gorman
On Sun, 12 Aug 2012 20:56:16 -0300
Rafael Aquini <aquini@redhat.com> wrote:
> Howdy Rik,
>
> On Wed, Aug 08, 2012 at 05:49:04PM -0400, Rik van Riel wrote:
> > @@ -1687,6 +1700,14 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > reclaim_stat->recent_rotated[1] /= 2;
> > }
> >
> > + /* Lots of inactive file pages? Reclaim those only. */
> > + if (reclaim_file_only(lruvec, sc, anon, file)) {
> > + fraction[0] = 0;
> > + fraction[1] = 1;
> > + denominator = 1;
> > + goto out;
> > + }
> > +
>
> This hunk causes a &zone->lru_lock spinlock lockup down this path:
> shrink_zone()->shrink_lruvec()->shrink_list()->shrink_inactive_list()
Oops. Looks like I put it in the wrong spot in get_scan_count,
the spot that is under the lru lock, which we really do not
need for this code.
Can you try this one?
---8<---
Subject: vm,vmscan: evict inactive file pages first
When a lot of streaming file IO is happening, it makes sense to
evict just the inactive file pages and leave the other LRU lists
alone.
Likewise, when driving a cgroup hierarchy into its hard limit,
or over its soft limit, it makes sense to pick a child cgroup
that has lots of inactive file pages, and evict those first.
Being over its soft limit is considered a stronger preference
than just having a lot of inactive file pages, so a well behaved
cgroup is allowed to keep its file cache when there is a "badly
behaving" one in the same hierarchy.
Signed-off-by: Rik van Riel <riel@redhat.com>
---
mm/vmscan.c | 37 +++++++++++++++++++++++++++++++++----
1 files changed, 33 insertions(+), 4 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1a9688b..0844b09 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1576,6 +1576,19 @@ static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru)
return inactive_anon_is_low(lruvec);
}
+/* If this lruvec has lots of inactive file pages, reclaim those only. */
+static bool reclaim_file_only(struct lruvec *lruvec, struct scan_control *sc,
+ unsigned long anon, unsigned long file)
+{
+ if (inactive_file_is_low(lruvec))
+ return false;
+
+ if (file > (anon + file) >> sc->priority)
+ return true;
+
+ return false;
+}
+
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct lruvec *lruvec, struct scan_control *sc)
{
@@ -1658,6 +1671,14 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
}
}
+ /* Lots of inactive file pages? Reclaim those only. */
+ if (reclaim_file_only(lruvec, sc, anon, file)) {
+ fraction[0] = 0;
+ fraction[1] = 1;
+ denominator = 1;
+ goto out;
+ }
+
/*
* With swappiness at 100, anonymous and file have the same priority.
* This scanning priority is essentially the inverse of IO cost.
@@ -1922,8 +1943,8 @@ static void age_recent_pressure(struct lruvec *lruvec, struct zone *zone)
* should always be larger than recent_rotated, and the size should
* always be larger than recent_pressure.
*/
-static u64 reclaim_score(struct mem_cgroup *memcg,
- struct lruvec *lruvec)
+static u64 reclaim_score(struct mem_cgroup *memcg, struct lruvec *lruvec,
+ struct scan_control *sc)
{
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
u64 anon, file;
@@ -1949,6 +1970,14 @@ static u64 reclaim_score(struct mem_cgroup *memcg,
anon *= 10000;
}
+ /*
+ * Prefer reclaiming from an lruvec with lots of inactive file
+ * pages. Once those have been reclaimed, the score will drop so
+ * far we will pick another lruvec to reclaim from.
+ */
+ if (reclaim_file_only(lruvec, sc, anon, file))
+ file *= 100;
+
return max(anon, file);
}
@@ -1974,7 +2003,7 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
age_recent_pressure(lruvec, zone);
- score = reclaim_score(memcg, lruvec);
+ score = reclaim_score(memcg, lruvec, sc);
/* Pick the lruvec with the highest score. */
if (score > max_score) {
@@ -1995,7 +2024,7 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
*/
do {
shrink_lruvec(victim, sc);
- score = reclaim_score(memcg, victim);
+ score = reclaim_score(memcg, victim, sc);
} while (sc->nr_to_reclaim > 0 && score > max_score / 2);
/*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC][PATCH -mm -v2 3/3] mm,vmscan: evict inactive file pages first
2012-08-13 2:13 ` [RFC][PATCH -mm -v2 " Rik van Riel
@ 2012-08-14 19:11 ` Rafael Aquini
0 siblings, 0 replies; 13+ messages in thread
From: Rafael Aquini @ 2012-08-14 19:11 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-mm, yinghan, hannes, mhocko, Mel Gorman
On Sun, Aug 12, 2012 at 10:13:13PM -0400, Rik van Riel wrote:
> Oops. Looks like I put it in the wrong spot in get_scan_count,
> the spot that is under the lru lock, which we really do not
> need for this code.
>
> Can you try this one?
Thanks Rik, the lockup is now gone. :)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC][PATCH -mm 0/3] mm,vmscan: reclaim from highest score cgroup
2012-08-08 21:45 [RFC][PATCH -mm 0/3] mm,vmscan: reclaim from highest score cgroup Rik van Riel
` (2 preceding siblings ...)
2012-08-08 21:49 ` [RFC][PATCH -mm 3/3] mm,vmscan: evict inactive file pages first Rik van Riel
@ 2012-08-09 1:02 ` Ying Han
2012-08-09 22:23 ` Ying Han
4 siblings, 0 replies; 13+ messages in thread
From: Ying Han @ 2012-08-09 1:02 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-mm, hannes, mhocko, Mel Gorman
On Wed, Aug 8, 2012 at 2:45 PM, Rik van Riel <riel@redhat.com> wrote:
> Instead of doing round robin reclaim over all the cgroups in a zone, we
> reclaim from the highest score cgroup first.
>
> Factors in the scoring are the use ratio of pages in the lruvec
> (recent_rotated / recent_scanned), the size of the lru, the recent amount
> of pressure applied to each lru, whether the cgroup is over its soft limit
> and whether the cgroup has lots of inactive file pages.
>
> This patch series is on top of a recent mmotm with Ying's memcg softreclaim
> patches [2/2] applied. Unfortunately it turns out that that mmmotm tree
> with Ying's patches does not compile with CONFIG_MEMCG=y, so I am testing
> these patches over the wall untested, as inspiration for others (hi Ying).
>
> This still suffers from the same scalability issue the current code has,
> namely a round robin iteration over all the lruvecs in a zone. We may want
> to fix that in the future by sorting the memcgs/lruvecs in some sort of
> tree, allowing us to find the high priority ones more easily and doing the
> recalculation asynchronously and less often.
Thank you Rik for the work !
I haven't got chance to look through it but I will tomorrow :)
--Ying
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC][PATCH -mm 0/3] mm,vmscan: reclaim from highest score cgroup
2012-08-08 21:45 [RFC][PATCH -mm 0/3] mm,vmscan: reclaim from highest score cgroup Rik van Riel
` (3 preceding siblings ...)
2012-08-09 1:02 ` [RFC][PATCH -mm 0/3] mm,vmscan: reclaim from highest score cgroup Ying Han
@ 2012-08-09 22:23 ` Ying Han
4 siblings, 0 replies; 13+ messages in thread
From: Ying Han @ 2012-08-09 22:23 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-mm, hannes, mhocko, Mel Gorman
On Wed, Aug 8, 2012 at 2:45 PM, Rik van Riel <riel@redhat.com> wrote:
> Instead of doing round robin reclaim over all the cgroups in a zone, we
> reclaim from the highest score cgroup first.
>
> Factors in the scoring are the use ratio of pages in the lruvec
> (recent_rotated / recent_scanned), the size of the lru, the recent amount
> of pressure applied to each lru, whether the cgroup is over its soft limit
> and whether the cgroup has lots of inactive file pages.
>
> This patch series is on top of a recent mmotm with Ying's memcg softreclaim
> patches [2/2] applied. Unfortunately it turns out that that mmmotm tree
> with Ying's patches does not compile with CONFIG_MEMCG=y, so I am testing
> these patches over the wall untested, as inspiration for others (hi Ying).
Hmm, I don't have build problem before and after your patchset.
Wondering if we are talking about two different trees. I based my
work on git://github.com/mstsxfx/memcg-devel.git
$ git log --oneline --decorate
f1979e3 (tag: mmotm-2012-07-25-16-41, github-memcg/since-3.5)
remove-__gfp_no_kswapd-fixes-fix
--Ying
>
> This still suffers from the same scalability issue the current code has,
> namely a round robin iteration over all the lruvecs in a zone. We may want
> to fix that in the future by sorting the memcgs/lruvecs in some sort of
> tree, allowing us to find the high priority ones more easily and doing the
> recalculation asynchronously and less often.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread