* [PATCH v5 1/4] memcg: add page_cgroup_ino helper
2015-05-12 13:34 [PATCH v5 0/4] idle memory tracking Vladimir Davydov
@ 2015-05-12 13:34 ` Vladimir Davydov
2015-05-12 13:34 ` [PATCH v5 2/4] proc: add kpagecgroup file Vladimir Davydov
` (3 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: Vladimir Davydov @ 2015-05-12 13:34 UTC (permalink / raw)
To: Andrew Morton
Cc: Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen,
Michel Lespinasse, David Rientjes, Pavel Emelyanov,
Cyrill Gorcunov, Jonathan Corbet, linux-api, linux-doc, linux-mm,
cgroups, linux-kernel
Hwpoison allows to filter pages by memory cgroup ino. To ahieve that, it
calls try_get_mem_cgroup_from_page(), then mem_cgroup_css(), and finally
cgroup_ino() on the cgroup returned. This looks bulky. Since in the next
patch I need to get the ino of the memory cgroup a page is charged to
too, in this patch I introduce the page_cgroup_ino() helper.
Note that page_cgroup_ino() only considers those pages that are charged
to mem_cgroup->memory (i.e. page->mem_cgroup != NULL), and for others it
returns 0, while try_get_mem_cgroup_page(), used by hwpoison before, may
extract the cgroup from a swapcache readahead page too. Ignoring
swapcache readahead pages allows to call page_cgroup_ino() on unlocked
pages, which is nice. Hwpoison users will hardly see any difference.
Another difference between try_get_mem_cgroup_page() and
page_cgroup_ino() is that the latter works on pages charged to offline
memory cgroups, returning the inode number of the closest online
ancestor in this case, while the former does not, which is crucial for
the next patch.
Since try_get_mem_cgroup_page() is not used by anyone else, this patch
removes this function. Also, it makes hwpoison memcg filter depend on
CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP (I've no idea why it was made
dependant on CONFIG_MEMCG_SWAP initially).
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
include/linux/memcontrol.h | 8 ++---
mm/hwpoison-inject.c | 5 +--
mm/memcontrol.c | 73 ++++++++++++++++++++++----------------------
mm/memory-failure.c | 16 ++--------
4 files changed, 42 insertions(+), 60 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 72dff5fb0d0c..9262a8407af7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -91,7 +91,6 @@ bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
struct mem_cgroup *root);
bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
-extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
@@ -192,6 +191,8 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
void mem_cgroup_split_huge_fixup(struct page *head);
#endif
+unsigned long page_cgroup_ino(struct page *page);
+
#else /* CONFIG_MEMCG */
struct mem_cgroup;
@@ -252,11 +253,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
return &zone->lruvec;
}
-static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
- return NULL;
-}
-
static inline bool mm_match_cgroup(struct mm_struct *mm,
struct mem_cgroup *memcg)
{
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 4ca5fe0042e1..d2facac0b01f 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -45,12 +45,9 @@ static int hwpoison_inject(void *data, u64 val)
/*
* do a racy check with elevated page count, to make sure PG_hwpoison
* will only be set for the targeted owner (or on a free page).
- * We temporarily take page lock for try_get_mem_cgroup_from_page().
* memory_failure() will redo the check reliably inside page lock.
*/
- lock_page(hpage);
err = hwpoison_filter(hpage);
- unlock_page(hpage);
if (err)
goto put_out;
@@ -126,7 +123,7 @@ static int pfn_inject_init(void)
if (!dentry)
goto fail;
-#ifdef CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
hwpoison_dir, &hwpoison_filter_memcg);
if (!dentry)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 14c2f2017e37..87c7f852d45b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2349,40 +2349,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
css_put_many(&memcg->css, nr_pages);
}
-/*
- * try_get_mem_cgroup_from_page - look up page's memcg association
- * @page: the page
- *
- * Look up, get a css reference, and return the memcg that owns @page.
- *
- * The page must be locked to prevent racing with swap-in and page
- * cache charges. If coming from an unlocked page table, the caller
- * must ensure the page is on the LRU or this can race with charging.
- */
-struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
- struct mem_cgroup *memcg;
- unsigned short id;
- swp_entry_t ent;
-
- VM_BUG_ON_PAGE(!PageLocked(page), page);
-
- memcg = page->mem_cgroup;
- if (memcg) {
- if (!css_tryget_online(&memcg->css))
- memcg = NULL;
- } else if (PageSwapCache(page)) {
- ent.val = page_private(page);
- id = lookup_swap_cgroup_id(ent);
- rcu_read_lock();
- memcg = mem_cgroup_from_id(id);
- if (memcg && !css_tryget_online(&memcg->css))
- memcg = NULL;
- rcu_read_unlock();
- }
- return memcg;
-}
-
static void lock_page_lru(struct page *page, int *isolated)
{
struct zone *zone = page_zone(page);
@@ -2774,6 +2740,31 @@ void mem_cgroup_split_huge_fixup(struct page *head)
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+/**
+ * page_cgroup_ino - return inode number of page's memcg
+ * @page: the page
+ *
+ * Look up the closest online ancestor of the memory cgroup @page is charged to
+ * and return its inode number. It is safe to call this function without taking
+ * a reference to the page.
+ */
+unsigned long page_cgroup_ino(struct page *page)
+{
+ struct mem_cgroup *memcg;
+ unsigned long ino = 0;
+
+ rcu_read_lock();
+ memcg = READ_ONCE(page->mem_cgroup);
+ while (memcg && !css_tryget_online(&memcg->css))
+ memcg = parent_mem_cgroup(memcg);
+ rcu_read_unlock();
+ if (memcg) {
+ ino = cgroup_ino(memcg->css.cgroup);
+ css_put(&memcg->css);
+ }
+ return ino;
+}
+
#ifdef CONFIG_MEMCG_SWAP
static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
bool charge)
@@ -5482,8 +5473,18 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
}
- if (do_swap_account && PageSwapCache(page))
- memcg = try_get_mem_cgroup_from_page(page);
+ if (do_swap_account && PageSwapCache(page)) {
+ swp_entry_t ent = { .val = page_private(page), };
+ unsigned short id = lookup_swap_cgroup_id(ent);
+
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_id(id);
+ if (memcg && !css_tryget_online(&memcg->css))
+ memcg = NULL;
+ rcu_read_unlock();
+ }
if (!memcg)
memcg = get_mem_cgroup_from_mm(mm);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 501820c815b3..7166ad81b222 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -128,27 +128,15 @@ static int hwpoison_filter_flags(struct page *p)
* can only guarantee that the page either belongs to the memcg tasks, or is
* a freed page.
*/
-#ifdef CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
u64 hwpoison_filter_memcg;
EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
static int hwpoison_filter_task(struct page *p)
{
- struct mem_cgroup *mem;
- struct cgroup_subsys_state *css;
- unsigned long ino;
-
if (!hwpoison_filter_memcg)
return 0;
- mem = try_get_mem_cgroup_from_page(p);
- if (!mem)
- return -EINVAL;
-
- css = mem_cgroup_css(mem);
- ino = cgroup_ino(css->cgroup);
- css_put(css);
-
- if (ino != hwpoison_filter_memcg)
+ if (page_cgroup_ino(p) != hwpoison_filter_memcg)
return -EINVAL;
return 0;
--
1.7.10.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v5 2/4] proc: add kpagecgroup file
2015-05-12 13:34 [PATCH v5 0/4] idle memory tracking Vladimir Davydov
2015-05-12 13:34 ` [PATCH v5 1/4] memcg: add page_cgroup_ino helper Vladimir Davydov
@ 2015-05-12 13:34 ` Vladimir Davydov
2015-05-12 13:34 ` [PATCH v5 3/4] proc: add kpageidle file Vladimir Davydov
` (2 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: Vladimir Davydov @ 2015-05-12 13:34 UTC (permalink / raw)
To: Andrew Morton
Cc: Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen,
Michel Lespinasse, David Rientjes, Pavel Emelyanov,
Cyrill Gorcunov, Jonathan Corbet, linux-api, linux-doc, linux-mm,
cgroups, linux-kernel
/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup
each page is charged to, indexed by PFN. Having this information is
useful for estimating a cgroup working set size.
The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
Documentation/vm/pagemap.txt | 6 ++++-
fs/proc/Kconfig | 5 ++--
fs/proc/page.c | 53 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 61 insertions(+), 3 deletions(-)
diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 6bfbc172cdb9..a9b7afc8fbc6 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by
reading files in /proc.
-There are three components to pagemap:
+There are four components to pagemap:
* /proc/pid/pagemap. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit
@@ -65,6 +65,10 @@ There are three components to pagemap:
23. BALLOON
24. ZERO_PAGE
+ * /proc/kpagecgroup. This file contains a 64-bit inode number of the
+ memory cgroup each page is charged to, indexed by PFN. Only available when
+ CONFIG_MEMCG is set.
+
Short descriptions to the page flags:
0. LOCKED
diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 2183fcf41d59..5021a2935bb9 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -69,5 +69,6 @@ config PROC_PAGE_MONITOR
help
Various /proc files exist to monitor process memory utilization:
/proc/pid/smaps, /proc/pid/clear_refs, /proc/pid/pagemap,
- /proc/kpagecount, and /proc/kpageflags. Disabling these
- interfaces will reduce the size of the kernel by approximately 4kb.
+ /proc/kpagecount, /proc/kpageflags, and /proc/kpagecgroup.
+ Disabling these interfaces will reduce the size of the kernel
+ by approximately 4kb.
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7eee2d8b97d9..70d23245dd43 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -9,6 +9,7 @@
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
#include <linux/hugetlb.h>
+#include <linux/memcontrol.h>
#include <linux/kernel-page-flags.h>
#include <asm/uaccess.h>
#include "internal.h"
@@ -225,10 +226,62 @@ static const struct file_operations proc_kpageflags_operations = {
.read = kpageflags_read,
};
+#ifdef CONFIG_MEMCG
+static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ u64 __user *out = (u64 __user *)buf;
+ struct page *ppage;
+ unsigned long src = *ppos;
+ unsigned long pfn;
+ ssize_t ret = 0;
+ u64 ino;
+
+ pfn = src / KPMSIZE;
+ count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
+ if (src & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ while (count > 0) {
+ if (pfn_valid(pfn))
+ ppage = pfn_to_page(pfn);
+ else
+ ppage = NULL;
+
+ if (ppage)
+ ino = page_cgroup_ino(ppage);
+ else
+ ino = 0;
+
+ if (put_user(ino, out)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ pfn++;
+ out++;
+ count -= KPMSIZE;
+ }
+
+ *ppos += (char __user *)out - buf;
+ if (!ret)
+ ret = (char __user *)out - buf;
+ return ret;
+}
+
+static const struct file_operations proc_kpagecgroup_operations = {
+ .llseek = mem_lseek,
+ .read = kpagecgroup_read,
+};
+#endif /* CONFIG_MEMCG */
+
static int __init proc_page_init(void)
{
proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
proc_create("kpageflags", S_IRUSR, NULL, &proc_kpageflags_operations);
+#ifdef CONFIG_MEMCG
+ proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
+#endif
return 0;
}
fs_initcall(proc_page_init);
--
1.7.10.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v5 3/4] proc: add kpageidle file
2015-05-12 13:34 [PATCH v5 0/4] idle memory tracking Vladimir Davydov
2015-05-12 13:34 ` [PATCH v5 1/4] memcg: add page_cgroup_ino helper Vladimir Davydov
2015-05-12 13:34 ` [PATCH v5 2/4] proc: add kpagecgroup file Vladimir Davydov
@ 2015-05-12 13:34 ` Vladimir Davydov
[not found] ` <cover.1431437088.git.vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2015-06-07 6:11 ` [PATCH v5 0/4] idle memory tracking Raghavendra KT
4 siblings, 0 replies; 9+ messages in thread
From: Vladimir Davydov @ 2015-05-12 13:34 UTC (permalink / raw)
To: Andrew Morton
Cc: Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen,
Michel Lespinasse, David Rientjes, Pavel Emelyanov,
Cyrill Gorcunov, Jonathan Corbet, linux-api, linux-doc, linux-mm,
cgroups, linux-kernel
Knowing the portion of memory that is not used by a certain application
or memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced.
However, this method has two serious shortcomings:
- it does not count unmapped file pages
- it affects the reclaimer logic
To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
can only be set from userspace by setting bit in /proc/kpageidle at the
offset corresponding to the page, and it is cleared whenever the page is
accessed either through page tables (it is cleared in page_referenced()
in this case) or using the read(2) system call (mark_page_accessed()).
Thus by setting the Idle flag for pages of a particular workload, which
can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
let the workload access its working set, and then reading the kpageidle
file, one can estimate the amount of pages that are not used by the
workload.
The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to kpageidle. If
page_referenced() is called on a Young page, it will add 1 to its return
value, therefore concealing the fact that the Access bit was cleared.
Note, since there is no room for extra page flags on 32 bit, this
feature uses extended page flags when compiled on 32 bit.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
Documentation/vm/pagemap.txt | 12 ++-
fs/proc/page.c | 178 ++++++++++++++++++++++++++++++++++++++++++
fs/proc/task_mmu.c | 4 +-
include/linux/mm.h | 88 +++++++++++++++++++++
include/linux/page-flags.h | 9 +++
include/linux/page_ext.h | 4 +
mm/Kconfig | 12 +++
mm/debug.c | 4 +
mm/page_ext.c | 3 +
mm/rmap.c | 8 ++
mm/swap.c | 2 +
11 files changed, 322 insertions(+), 2 deletions(-)
diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index a9b7afc8fbc6..c9266340852c 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by
reading files in /proc.
-There are four components to pagemap:
+There are five components to pagemap:
* /proc/pid/pagemap. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit
@@ -69,6 +69,16 @@ There are four components to pagemap:
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.
+ * /proc/kpageidle. This file implements a bitmap where each bit corresponds
+ to a page, indexed by PFN. When the bit is set, the corresponding page is
+ idle. A page is considered idle if it has not been accessed since it was
+ marked idle. To mark a page idle one should set the bit corresponding to the
+ page by writing to the file. A value written to the file is OR-ed with the
+ current bitmap value. Only user memory pages can be marked idle, for other
+ page types input is silently ignored. Writing to this file beyond max PFN
+ results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
+ set.
+
Short descriptions to the page flags:
0. LOCKED
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 70d23245dd43..f42ead08d346 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -16,6 +16,7 @@
#define KPMSIZE sizeof(u64)
#define KPMMASK (KPMSIZE - 1)
+#define KPMBITS (KPMSIZE * BITS_PER_BYTE)
/* /proc/kpagecount - an array exposing page counts
*
@@ -275,6 +276,179 @@ static const struct file_operations proc_kpagecgroup_operations = {
};
#endif /* CONFIG_MEMCG */
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+/*
+ * Idle page tracking only considers user memory pages, for other types of
+ * pages the idle flag is always unset and an attempt to set it is silently
+ * ignored.
+ *
+ * We treat a page as a user memory page if it is on an LRU list, because it is
+ * always safe to pass such a page to page_referenced(), which is essential for
+ * idle page tracking. With such an indicator of user pages we can skip
+ * isolated pages, but since there are not usually many of them, it will hardly
+ * affect the overall result.
+ *
+ * This function tries to get a user memory page by pfn as described above.
+ */
+static struct page *kpageidle_get_page(unsigned long pfn)
+{
+ struct page *page;
+ struct zone *zone;
+
+ if (!pfn_valid(pfn))
+ return NULL;
+
+ page = pfn_to_page(pfn);
+ if (!page || !PageLRU(page))
+ return NULL;
+ if (!get_page_unless_zero(page))
+ return NULL;
+
+ zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+ if (unlikely(!PageLRU(page))) {
+ put_page(page);
+ page = NULL;
+ }
+ spin_unlock_irq(&zone->lru_lock);
+ return page;
+}
+
+/*
+ * This function calls page_referenced() to clear the referenced bit for all
+ * mappings to a page. Since the latter also clears the page idle flag if the
+ * page was referenced, it can be used to update the idle flag of a page.
+ */
+static void kpageidle_clear_pte_refs(struct page *page)
+{
+ unsigned long dummy;
+
+ if (page_referenced(page, 0, NULL, &dummy))
+ /*
+ * We cleared the referenced bit in a mapping to this page. To
+ * avoid interference with the reclaimer, mark it young so that
+ * the next call to page_referenced() will also return > 0 (see
+ * page_referenced_one())
+ */
+ set_page_young(page);
+}
+
+static ssize_t kpageidle_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ u64 __user *out = (u64 __user *)buf;
+ struct page *page;
+ unsigned long pfn, end_pfn;
+ ssize_t ret = 0;
+ u64 idle_bitmap = 0;
+ int bit;
+
+ if (*ppos & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ pfn = *ppos * BITS_PER_BYTE;
+ if (pfn >= max_pfn)
+ return 0;
+
+ end_pfn = pfn + count * BITS_PER_BYTE;
+ if (end_pfn > max_pfn)
+ end_pfn = ALIGN(max_pfn, KPMBITS);
+
+ for (; pfn < end_pfn; pfn++) {
+ bit = pfn % KPMBITS;
+ page = kpageidle_get_page(pfn);
+ if (page) {
+ if (page_is_idle(page)) {
+ /*
+ * The page might have been referenced via a
+ * pte, in which case it is not idle. Clear
+ * refs and recheck.
+ */
+ kpageidle_clear_pte_refs(page);
+ if (page_is_idle(page))
+ idle_bitmap |= 1ULL << bit;
+ }
+ put_page(page);
+ }
+ if (bit == KPMBITS - 1) {
+ if (put_user(idle_bitmap, out)) {
+ ret = -EFAULT;
+ break;
+ }
+ idle_bitmap = 0;
+ out++;
+ }
+ }
+
+ *ppos += (char __user *)out - buf;
+ if (!ret)
+ ret = (char __user *)out - buf;
+ return ret;
+}
+
+static ssize_t kpageidle_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ const u64 __user *in = (const u64 __user *)buf;
+ struct page *page;
+ unsigned long pfn, end_pfn;
+ ssize_t ret = 0;
+ u64 idle_bitmap = 0;
+ int bit;
+
+ if (*ppos & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ pfn = *ppos * BITS_PER_BYTE;
+ if (pfn >= max_pfn)
+ return -ENXIO;
+
+ end_pfn = pfn + count * BITS_PER_BYTE;
+ if (end_pfn > max_pfn)
+ end_pfn = ALIGN(max_pfn, KPMBITS);
+
+ for (; pfn < end_pfn; pfn++) {
+ bit = pfn % KPMBITS;
+ if (bit == 0) {
+ if (get_user(idle_bitmap, in)) {
+ ret = -EFAULT;
+ break;
+ }
+ in++;
+ }
+ if (idle_bitmap >> bit & 1) {
+ page = kpageidle_get_page(pfn);
+ if (page) {
+ kpageidle_clear_pte_refs(page);
+ set_page_idle(page);
+ put_page(page);
+ }
+ }
+ }
+
+ *ppos += (const char __user *)in - buf;
+ if (!ret)
+ ret = (const char __user *)in - buf;
+ return ret;
+}
+
+static const struct file_operations proc_kpageidle_operations = {
+ .llseek = mem_lseek,
+ .read = kpageidle_read,
+ .write = kpageidle_write,
+};
+
+#ifndef CONFIG_64BIT
+static bool need_page_idle(void)
+{
+ return true;
+}
+struct page_ext_operations page_idle_ops = {
+ .need = need_page_idle,
+};
+#endif
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
static int __init proc_page_init(void)
{
proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
@@ -282,6 +456,10 @@ static int __init proc_page_init(void)
#ifdef CONFIG_MEMCG
proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
#endif
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+ proc_create("kpageidle", S_IRUSR | S_IWUSR, NULL,
+ &proc_kpageidle_operations);
+#endif
return 0;
}
fs_initcall(proc_page_init);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6dee68d013ff..ab04846f7dd5 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -458,7 +458,7 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
mss->resident += size;
/* Accumulate the size in pages that have been accessed. */
- if (young || PageReferenced(page))
+ if (young || page_is_young(page) || PageReferenced(page))
mss->referenced += size;
mapcount = page_mapcount(page);
if (mapcount >= 2) {
@@ -808,6 +808,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
/* Clear accessed and referenced bits. */
pmdp_test_and_clear_young(vma, addr, pmd);
+ clear_page_young(page);
ClearPageReferenced(page);
out:
spin_unlock(ptl);
@@ -835,6 +836,7 @@ out:
/* Clear accessed and referenced bits. */
ptep_test_and_clear_young(vma, addr, pte);
+ clear_page_young(page);
ClearPageReferenced(page);
}
pte_unmap_unlock(pte - 1, ptl);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0755b9fd03a7..794d29aa2317 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2200,5 +2200,93 @@ void __init setup_nr_node_ids(void);
static inline void setup_nr_node_ids(void) {}
#endif
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+#ifdef CONFIG_64BIT
+static inline bool page_is_young(struct page *page)
+{
+ return PageYoung(page);
+}
+
+static inline void set_page_young(struct page *page)
+{
+ SetPageYoung(page);
+}
+
+static inline void clear_page_young(struct page *page)
+{
+ ClearPageYoung(page);
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+ return PageIdle(page);
+}
+
+static inline void set_page_idle(struct page *page)
+{
+ SetPageIdle(page);
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+ ClearPageIdle(page);
+}
+#else /* !CONFIG_64BIT */
+/*
+ * If there is not enough space to store Idle and Young bits in page flags, use
+ * page ext flags instead.
+ */
+extern struct page_ext_operations page_idle_ops;
+
+static inline bool page_is_young(struct page *page)
+{
+ return test_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
+}
+
+static inline void set_page_young(struct page *page)
+{
+ set_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
+}
+
+static inline void clear_page_young(struct page *page)
+{
+ clear_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+ return test_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+
+static inline void set_page_idle(struct page *page)
+{
+ set_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+ clear_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+#endif /* CONFIG_64BIT */
+#else /* !CONFIG_IDLE_PAGE_TRACKING */
+static inline bool page_is_young(struct page *page)
+{
+ return false;
+}
+
+static inline void clear_page_young(struct page *page)
+{
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+ return false;
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+}
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f34e040b34e9..5e7c4f50a644 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -109,6 +109,10 @@ enum pageflags {
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
PG_compound_lock,
#endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+ PG_young,
+ PG_idle,
+#endif
__NR_PAGEFLAGS,
/* Filesystems */
@@ -289,6 +293,11 @@ PAGEFLAG_FALSE(HWPoison)
#define __PG_HWPOISON 0
#endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+PAGEFLAG(Young, young)
+PAGEFLAG(Idle, idle)
+#endif
+
/*
* On an anonymous page mapped into a user virtual memory area,
* page->mapping points to its anon_vma, not to a struct address_space;
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index c42981cd99aa..17f118a82854 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -26,6 +26,10 @@ enum page_ext_flags {
PAGE_EXT_DEBUG_POISON, /* Page is poisoned */
PAGE_EXT_DEBUG_GUARD,
PAGE_EXT_OWNER,
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
+ PAGE_EXT_YOUNG,
+ PAGE_EXT_IDLE,
+#endif
};
/*
diff --git a/mm/Kconfig b/mm/Kconfig
index 390214da4546..3600eace4774 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -635,3 +635,15 @@ config MAX_STACK_SIZE_MB
changed to a smaller value in which case that is used.
A sane initial value is 80 MB.
+
+config IDLE_PAGE_TRACKING
+ bool "Enable idle page tracking"
+ select PROC_PAGE_MONITOR
+ select PAGE_EXTENSION if !64BIT
+ help
+ This feature allows to estimate the amount of user pages that have
+ not been touched during a given period of time. This information can
+ be useful to tune memory cgroup limits and/or for job placement
+ within a compute cluster.
+
+ See Documentation/vm/pagemap.txt for more details.
diff --git a/mm/debug.c b/mm/debug.c
index 3eb3ac2fcee7..bb66f9ccec03 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -48,6 +48,10 @@ static const struct trace_print_flags pageflag_names[] = {
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
{1UL << PG_compound_lock, "compound_lock" },
#endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+ {1UL << PG_young, "young" },
+ {1UL << PG_idle, "idle" },
+#endif
};
static void dump_flags(unsigned long flags,
diff --git a/mm/page_ext.c b/mm/page_ext.c
index d86fd2f5353f..e4b3af054bf2 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -59,6 +59,9 @@ static struct page_ext_operations *page_ext_ops[] = {
#ifdef CONFIG_PAGE_OWNER
&page_owner_ops,
#endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
+ &page_idle_ops,
+#endif
};
static unsigned long total_usage;
diff --git a/mm/rmap.c b/mm/rmap.c
index 8b18fd4227d1..3650793eaeab 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -781,6 +781,14 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
pte_unmap_unlock(pte, ptl);
}
+ if (referenced && page_is_idle(page))
+ clear_page_idle(page);
+
+ if (page_is_young(page)) {
+ clear_page_young(page);
+ referenced++;
+ }
+
if (referenced) {
pra->referenced++;
pra->vm_flags |= vma->vm_flags;
diff --git a/mm/swap.c b/mm/swap.c
index a7251a8ed532..6bf6f293a9ea 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -623,6 +623,8 @@ void mark_page_accessed(struct page *page)
} else if (!PageReferenced(page)) {
SetPageReferenced(page);
}
+ if (page_is_idle(page))
+ clear_page_idle(page);
}
EXPORT_SYMBOL(mark_page_accessed);
--
1.7.10.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 9+ messages in thread[parent not found: <cover.1431437088.git.vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* [PATCH v5 4/4] proc: export idle flag via kpageflags
[not found] ` <cover.1431437088.git.vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2015-05-12 13:34 ` Vladimir Davydov
0 siblings, 0 replies; 9+ messages in thread
From: Vladimir Davydov @ 2015-05-12 13:34 UTC (permalink / raw)
To: Andrew Morton
Cc: Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen,
Michel Lespinasse, David Rientjes, Pavel Emelyanov,
Cyrill Gorcunov, Jonathan Corbet,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
As noted by Minchan, a benefit of reading idle flag from
/proc/kpageflags is that one can easily filter dirty and/or unevictable
pages while estimating the size of unused memory.
Note that idle flag read from /proc/kpageflags may be stale in case the
page was accessed via a PTE, because it would be too costly to iterate
over all page mappings on each /proc/kpageflags read to provide an
up-to-date value. To make sure the flag is up-to-date one has to read
/proc/kpageidle first.
Signed-off-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
---
Documentation/vm/pagemap.txt | 6 ++++++
fs/proc/page.c | 3 +++
include/uapi/linux/kernel-page-flags.h | 1 +
3 files changed, 10 insertions(+)
diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index c9266340852c..5896b7d7fd74 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -64,6 +64,7 @@ There are five components to pagemap:
22. THP
23. BALLOON
24. ZERO_PAGE
+ 25. IDLE
* /proc/kpagecgroup. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
@@ -124,6 +125,11 @@ Short descriptions to the page flags:
24. ZERO_PAGE
zero page for pfn_zero or huge_zero page
+25. IDLE
+ page has not been accessed since it was marked idle (see /proc/kpageidle)
+ Note that this flag may be stale in case the page was accessed via a PTE.
+ To make sure the flag is up-to-date one has to read /proc/kpageidle first.
+
[IO related page flags]
1. ERROR IO error occurred
3. UPTODATE page has up-to-date data
diff --git a/fs/proc/page.c b/fs/proc/page.c
index f42ead08d346..24748be3dd65 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -148,6 +148,9 @@ u64 stable_page_flags(struct page *page)
if (PageBalloon(page))
u |= 1 << KPF_BALLOON;
+ if (page_is_idle(page))
+ u |= 1 << KPF_IDLE;
+
u |= kpf_copy_bit(k, KPF_LOCKED, PG_locked);
u |= kpf_copy_bit(k, KPF_SLAB, PG_slab);
diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
index a6c4962e5d46..5da5f8751ce7 100644
--- a/include/uapi/linux/kernel-page-flags.h
+++ b/include/uapi/linux/kernel-page-flags.h
@@ -33,6 +33,7 @@
#define KPF_THP 22
#define KPF_BALLOON 23
#define KPF_ZERO_PAGE 24
+#define KPF_IDLE 25
#endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
--
1.7.10.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v5 0/4] idle memory tracking
2015-05-12 13:34 [PATCH v5 0/4] idle memory tracking Vladimir Davydov
` (3 preceding siblings ...)
[not found] ` <cover.1431437088.git.vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2015-06-07 6:11 ` Raghavendra KT
2015-06-07 9:11 ` Vladimir Davydov
2015-06-08 19:35 ` Andrew Morton
4 siblings, 2 replies; 9+ messages in thread
From: Raghavendra KT @ 2015-06-07 6:11 UTC (permalink / raw)
To: Vladimir Davydov
Cc: Andrew Morton, Minchan Kim, Johannes Weiner, Michal Hocko,
Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov,
Cyrill Gorcunov, Jonathan Corbet, linux-api, linux-doc, linux-mm,
cgroups, Linux Kernel Mailing List, Raghavendra KT
On Tue, May 12, 2015 at 7:04 PM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> Hi,
>
> This patch set introduces a new user API for tracking user memory pages
> that have not been used for a given period of time. The purpose of this
> is to provide the userspace with the means of tracking a workload's
> working set, i.e. the set of pages that are actively used by the
> workload. Knowing the working set size can be useful for partitioning
> the system more efficiently, e.g. by tuning memory cgroup limits
> appropriately, or for job placement within a compute cluster.
>
> ---- USE CASES ----
>
> The unified cgroup hierarchy has memory.low and memory.high knobs, which
> are defined as the low and high boundaries for the workload working set
> size. However, the working set size of a workload may be unknown or
> change in time. With this patch set, one can periodically estimate the
> amount of memory unused by each cgroup and tune their memory.low and
> memory.high parameters accordingly, therefore optimizing the overall
> memory utilization.
>
Hi Vladimir,
Thanks for the patches, I was able test how the series is helpful to determine
docker container workingset / idlemem with these patches. (tested on ppc64le
after porting to a distro kernel).
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH v5 0/4] idle memory tracking
2015-06-07 6:11 ` [PATCH v5 0/4] idle memory tracking Raghavendra KT
@ 2015-06-07 9:11 ` Vladimir Davydov
2015-06-08 19:35 ` Andrew Morton
1 sibling, 0 replies; 9+ messages in thread
From: Vladimir Davydov @ 2015-06-07 9:11 UTC (permalink / raw)
To: Raghavendra KT
Cc: Andrew Morton, Minchan Kim, Johannes Weiner, Michal Hocko,
Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov,
Cyrill Gorcunov, Jonathan Corbet, linux-api, linux-doc, linux-mm,
cgroups, Linux Kernel Mailing List
On Sun, Jun 07, 2015 at 11:41:15AM +0530, Raghavendra KT wrote:
> Thanks for the patches, I was able test how the series is helpful to determine
> docker container workingset / idlemem with these patches. (tested on ppc64le
> after porting to a distro kernel).
Hi,
Thank you for using and testing it! I've been busy for a while with my
internal tasks, but I am almost done with them and will get back to this
patch set and resubmit it soon (during the next week hopefully).
Thanks,
Vladimir
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v5 0/4] idle memory tracking
2015-06-07 6:11 ` [PATCH v5 0/4] idle memory tracking Raghavendra KT
2015-06-07 9:11 ` Vladimir Davydov
@ 2015-06-08 19:35 ` Andrew Morton
[not found] ` <20150608123535.d82543cedbb9060612a10113-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
1 sibling, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2015-06-08 19:35 UTC (permalink / raw)
To: Raghavendra KT
Cc: Vladimir Davydov, Minchan Kim, Johannes Weiner, Michal Hocko,
Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov,
Cyrill Gorcunov, Jonathan Corbet, linux-api, linux-doc, linux-mm,
cgroups, Linux Kernel Mailing List
On Sun, 7 Jun 2015 11:41:15 +0530 Raghavendra KT <raghavendra.kt@linux.vnet.ibm.com> wrote:
> On Tue, May 12, 2015 at 7:04 PM, Vladimir Davydov
> <vdavydov@parallels.com> wrote:
> > Hi,
> >
> > This patch set introduces a new user API for tracking user memory pages
> > that have not been used for a given period of time. The purpose of this
> > is to provide the userspace with the means of tracking a workload's
> > working set, i.e. the set of pages that are actively used by the
> > workload. Knowing the working set size can be useful for partitioning
> > the system more efficiently, e.g. by tuning memory cgroup limits
> > appropriately, or for job placement within a compute cluster.
> >
> > ---- USE CASES ----
> >
> > The unified cgroup hierarchy has memory.low and memory.high knobs, which
> > are defined as the low and high boundaries for the workload working set
> > size. However, the working set size of a workload may be unknown or
> > change in time. With this patch set, one can periodically estimate the
> > amount of memory unused by each cgroup and tune their memory.low and
> > memory.high parameters accordingly, therefore optimizing the overall
> > memory utilization.
> >
>
> Hi Vladimir,
>
> Thanks for the patches, I was able test how the series is helpful to determine
> docker container workingset / idlemem with these patches. (tested on ppc64le
> after porting to a distro kernel).
And what were the results of your testing? The more details the
better, please.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread