[PATCH V3] Add the pagefault count into memcg stats

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V3] Add the pagefault count into memcg stats
@ 2011-03-29  6:16 Ying Han
  2011-03-29 21:36 ` Minchan Kim
  2011-03-30  2:47 ` Balbir Singh
  0 siblings, 2 replies; 12+ messages in thread
From: Ying Han @ 2011-03-29  6:16 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Mark Brown, KAMEZAWA Hiroyuki, Andrew Morton
  Cc: linux-mm

Two new stats in per-memcg memory.stat which tracks the number of
page faults and number of major page faults.

"pgfault"
"pgmajfault"

They are different from "pgpgin"/"pgpgout" stat which count number of
pages charged/discharged to the cgroup and have no meaning of reading/
writing page to disk.

It is valuable to track the two stats for both measuring application's
performance as well as the efficiency of the kernel page reclaim path.
Counting pagefaults per process is useful, but we also need the aggregated
value since processes are monitored and controlled in cgroup basis in memcg.

Functional test: check the total number of pgfault/pgmajfault of all
memcgs and compare with global vmstat value:

$ cat /proc/vmstat | grep fault
pgfault 1070751
pgmajfault 553

$ cat /dev/cgroup/memory.stat | grep fault
pgfault 1071138
pgmajfault 553
total_pgfault 1071142
total_pgmajfault 553

$ cat /dev/cgroup/A/memory.stat | grep fault
pgfault 199
pgmajfault 0
total_pgfault 199
total_pgmajfault 0

Performance test: run page fault test(pft) wit 16 thread on faulting in 15G
anon pages in 16G container. There is no regression noticed on the "flt/cpu/s"

Sample output from pft:
TAG pft:anon-sys-default:
  Gb  Thr CLine   User     System     Wall    flt/cpu/s fault/wsec
  15   16   1     0.67s   233.41s    14.76s   16798.546 266356.260

+-------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10     16682.962     17344.027     16913.524     16928.812      166.5362
+  10     16695.568     16923.896     16820.604     16824.652     84.816568
No difference proven at 95.0% confidence

Change V3..v2
1. removed the unnecessary function definition in memcontrol.h

Signed-off-by: Ying Han <yinghan@google.com>
---
 Documentation/cgroups/memory.txt |    4 +++
 fs/ncpfs/mmap.c                  |    2 +
 include/linux/memcontrol.h       |    6 +++++
 mm/filemap.c                     |    1 +
 mm/memcontrol.c                  |   47 ++++++++++++++++++++++++++++++++++++++
 mm/memory.c                      |    2 +
 mm/shmem.c                       |    2 +
 7 files changed, 64 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b6ed61c..2db6103 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -385,6 +385,8 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
 swap		- # of bytes of swap usage
+pgfault		- # of page faults.
+pgmajfault	- # of major page faults.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
@@ -406,6 +408,8 @@ total_mapped_file	- sum of all children's "cache"
 total_pgpgin		- sum of all children's "pgpgin"
 total_pgpgout		- sum of all children's "pgpgout"
 total_swap		- sum of all children's "swap"
+total_pgfault		- sum of all children's "pgfault"
+total_pgmajfault	- sum of all children's "pgmajfault"
 total_inactive_anon	- sum of all children's "inactive_anon"
 total_active_anon	- sum of all children's "active_anon"
 total_inactive_file	- sum of all children's "inactive_file"
diff --git a/fs/ncpfs/mmap.c b/fs/ncpfs/mmap.c
index a7c07b4..e5d71b2 100644
--- a/fs/ncpfs/mmap.c
+++ b/fs/ncpfs/mmap.c
@@ -16,6 +16,7 @@
 #include <linux/mman.h>
 #include <linux/string.h>
 #include <linux/fcntl.h>
+#include <linux/memcontrol.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -92,6 +93,7 @@ static int ncp_file_mmap_fault(struct vm_area_struct *area,
 	 * -- wli
 	 */
 	count_vm_event(PGMAJFAULT);
+	mem_cgroup_count_vm_event(area->vm_mm, PGMAJFAULT);
 	return VM_FAULT_MAJOR;
 }
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5a5ce70..8a48f5b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -24,6 +24,7 @@ struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
+enum vm_event_item;
 
 /* Stats that can be updated by kernel. */
 enum mem_cgroup_page_stat_item {
@@ -147,6 +148,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
 
+void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
 #endif
@@ -354,6 +356,10 @@ static inline void mem_cgroup_split_huge_fixup(struct page *head,
 {
 }
 
+static inline
+void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
+{
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
diff --git a/mm/filemap.c b/mm/filemap.c
index a6cfecf..e022229 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1683,6 +1683,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		/* No page in the page cache at all */
 		do_sync_mmap_readahead(vma, ra, file, offset);
 		count_vm_event(PGMAJFAULT);
+		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
 retry_find:
 		page = find_get_page(mapping, offset);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4407dd0..8f9cf7b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -94,6 +94,8 @@ enum mem_cgroup_events_index {
 	MEM_CGROUP_EVENTS_PGPGIN,	/* # of pages paged in */
 	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
 	MEM_CGROUP_EVENTS_COUNT,	/* # of pages paged in/out */
+	MEM_CGROUP_EVENTS_PGFAULT,	/* # of page-faults */
+	MEM_CGROUP_EVENTS_PGMAJFAULT,	/* # of major page-faults */
 	MEM_CGROUP_EVENTS_NSTATS,
 };
 /*
@@ -585,6 +587,16 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
 	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_SWAPOUT], val);
 }
 
+void mem_cgroup_pgfault(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PGFAULT], val);
+}
+
+void mem_cgroup_pgmajfault(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT], val);
+}
+
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *mem,
 					    enum mem_cgroup_events_index idx)
 {
@@ -813,6 +825,33 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
 	return (mem == root_mem_cgroup);
 }
 
+void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
+{
+	struct mem_cgroup *mem;
+
+	if (!mm)
+		return;
+
+	rcu_read_lock();
+	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!mem))
+		goto out;
+
+	switch (idx) {
+	case PGMAJFAULT:
+		mem_cgroup_pgmajfault(mem, 1);
+		break;
+	case PGFAULT:
+		mem_cgroup_pgfault(mem, 1);
+		break;
+	default:
+		BUG();
+	}
+out:
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL(mem_cgroup_count_vm_event);
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -3772,6 +3811,8 @@ enum {
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
 	MCS_SWAP,
+	MCS_PGFAULT,
+	MCS_PGMAJFAULT,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -3794,6 +3835,8 @@ struct {
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
 	{"swap", "total_swap"},
+	{"pgfault", "total_pgfault"},
+	{"pgmajfault", "total_pgmajfault"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -3822,6 +3865,10 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
 		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGFAULT);
+	s->stat[MCS_PGFAULT] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGMAJFAULT);
+	s->stat[MCS_PGMAJFAULT] += val;
 
 	/* per zone stat */
 	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
diff --git a/mm/memory.c b/mm/memory.c
index 8617d39..28d19b6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2836,6 +2836,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
+		mem_cgroup_count_vm_event(mm, PGMAJFAULT);
 	} else if (PageHWPoison(page)) {
 		/*
 		 * hwpoisoned dirty swapcache pages are kept for killing
@@ -3375,6 +3376,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	__set_current_state(TASK_RUNNING);
 
 	count_vm_event(PGFAULT);
+	mem_cgroup_count_vm_event(mm, PGFAULT);
 
 	/* do counter updates before entering really critical section. */
 	check_sync_rss_stat(current);
diff --git a/mm/shmem.c b/mm/shmem.c
index ad8346b..fa0b2b8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1289,6 +1289,8 @@ repeat:
 			/* here we actually do the io */
 			if (type && !(*type & VM_FAULT_MAJOR)) {
 				__count_vm_event(PGMAJFAULT);
+				mem_cgroup_count_vm_event(current->mm,
+							  PGMAJFAULT);
 				*type |= VM_FAULT_MAJOR;
 			}
 			spin_unlock(&info->lock);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH V3] Add the pagefault count into memcg stats
@ 2011-03-29 17:32 Ying Han
  2011-03-30  1:16 ` KOSAKI Motohiro
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Ying Han @ 2011-03-29 17:32 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Mark Brown, KAMEZAWA Hiroyuki, Andrew Morton
  Cc: linux-mm

Two new stats in per-memcg memory.stat which tracks the number of
page faults and number of major page faults.

"pgfault"
"pgmajfault"

They are different from "pgpgin"/"pgpgout" stat which count number of
pages charged/discharged to the cgroup and have no meaning of reading/
writing page to disk.

It is valuable to track the two stats for both measuring application's
performance as well as the efficiency of the kernel page reclaim path.
Counting pagefaults per process is useful, but we also need the aggregated
value since processes are monitored and controlled in cgroup basis in memcg.

Functional test: check the total number of pgfault/pgmajfault of all
memcgs and compare with global vmstat value:

$ cat /proc/vmstat | grep fault
pgfault 1070751
pgmajfault 553

$ cat /dev/cgroup/memory.stat | grep fault
pgfault 1071138
pgmajfault 553
total_pgfault 1071142
total_pgmajfault 553

$ cat /dev/cgroup/A/memory.stat | grep fault
pgfault 199
pgmajfault 0
total_pgfault 199
total_pgmajfault 0

Performance test: run page fault test(pft) wit 16 thread on faulting in 15G
anon pages in 16G container. There is no regression noticed on the "flt/cpu/s"

Sample output from pft:
TAG pft:anon-sys-default:
  Gb  Thr CLine   User     System     Wall    flt/cpu/s fault/wsec
  15   16   1     0.67s   233.41s    14.76s   16798.546 266356.260

+-------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10     16682.962     17344.027     16913.524     16928.812      166.5362
+  10     16695.568     16923.896     16820.604     16824.652     84.816568
No difference proven at 95.0% confidence

Change v3..v2
1. removed the unnecessary function definition in memcontrol.h

Signed-off-by: Ying Han <yinghan@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/cgroups/memory.txt |    4 +++
 fs/ncpfs/mmap.c                  |    2 +
 include/linux/memcontrol.h       |    6 +++++
 mm/filemap.c                     |    1 +
 mm/memcontrol.c                  |   47 ++++++++++++++++++++++++++++++++++++++
 mm/memory.c                      |    2 +
 mm/shmem.c                       |    2 +
 7 files changed, 64 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b6ed61c..2db6103 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -385,6 +385,8 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
 swap		- # of bytes of swap usage
+pgfault		- # of page faults.
+pgmajfault	- # of major page faults.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
@@ -406,6 +408,8 @@ total_mapped_file	- sum of all children's "cache"
 total_pgpgin		- sum of all children's "pgpgin"
 total_pgpgout		- sum of all children's "pgpgout"
 total_swap		- sum of all children's "swap"
+total_pgfault		- sum of all children's "pgfault"
+total_pgmajfault	- sum of all children's "pgmajfault"
 total_inactive_anon	- sum of all children's "inactive_anon"
 total_active_anon	- sum of all children's "active_anon"
 total_inactive_file	- sum of all children's "inactive_file"
diff --git a/fs/ncpfs/mmap.c b/fs/ncpfs/mmap.c
index a7c07b4..e5d71b2 100644
--- a/fs/ncpfs/mmap.c
+++ b/fs/ncpfs/mmap.c
@@ -16,6 +16,7 @@
 #include <linux/mman.h>
 #include <linux/string.h>
 #include <linux/fcntl.h>
+#include <linux/memcontrol.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -92,6 +93,7 @@ static int ncp_file_mmap_fault(struct vm_area_struct *area,
 	 * -- wli
 	 */
 	count_vm_event(PGMAJFAULT);
+	mem_cgroup_count_vm_event(area->vm_mm, PGMAJFAULT);
 	return VM_FAULT_MAJOR;
 }
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5a5ce70..8a48f5b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -24,6 +24,7 @@ struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
+enum vm_event_item;
 
 /* Stats that can be updated by kernel. */
 enum mem_cgroup_page_stat_item {
@@ -147,6 +148,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
 
+void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
 #endif
@@ -354,6 +356,10 @@ static inline void mem_cgroup_split_huge_fixup(struct page *head,
 {
 }
 
+static inline
+void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
+{
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
diff --git a/mm/filemap.c b/mm/filemap.c
index a6cfecf..e022229 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1683,6 +1683,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		/* No page in the page cache at all */
 		do_sync_mmap_readahead(vma, ra, file, offset);
 		count_vm_event(PGMAJFAULT);
+		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
 retry_find:
 		page = find_get_page(mapping, offset);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4407dd0..8f9cf7b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -94,6 +94,8 @@ enum mem_cgroup_events_index {
 	MEM_CGROUP_EVENTS_PGPGIN,	/* # of pages paged in */
 	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
 	MEM_CGROUP_EVENTS_COUNT,	/* # of pages paged in/out */
+	MEM_CGROUP_EVENTS_PGFAULT,	/* # of page-faults */
+	MEM_CGROUP_EVENTS_PGMAJFAULT,	/* # of major page-faults */
 	MEM_CGROUP_EVENTS_NSTATS,
 };
 /*
@@ -585,6 +587,16 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
 	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_SWAPOUT], val);
 }
 
+void mem_cgroup_pgfault(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PGFAULT], val);
+}
+
+void mem_cgroup_pgmajfault(struct mem_cgroup *mem, int val)
+{
+	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT], val);
+}
+
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *mem,
 					    enum mem_cgroup_events_index idx)
 {
@@ -813,6 +825,33 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
 	return (mem == root_mem_cgroup);
 }
 
+void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
+{
+	struct mem_cgroup *mem;
+
+	if (!mm)
+		return;
+
+	rcu_read_lock();
+	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!mem))
+		goto out;
+
+	switch (idx) {
+	case PGMAJFAULT:
+		mem_cgroup_pgmajfault(mem, 1);
+		break;
+	case PGFAULT:
+		mem_cgroup_pgfault(mem, 1);
+		break;
+	default:
+		BUG();
+	}
+out:
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL(mem_cgroup_count_vm_event);
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -3772,6 +3811,8 @@ enum {
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
 	MCS_SWAP,
+	MCS_PGFAULT,
+	MCS_PGMAJFAULT,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -3794,6 +3835,8 @@ struct {
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
 	{"swap", "total_swap"},
+	{"pgfault", "total_pgfault"},
+	{"pgmajfault", "total_pgmajfault"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -3822,6 +3865,10 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
 		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGFAULT);
+	s->stat[MCS_PGFAULT] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGMAJFAULT);
+	s->stat[MCS_PGMAJFAULT] += val;
 
 	/* per zone stat */
 	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
diff --git a/mm/memory.c b/mm/memory.c
index 8617d39..28d19b6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2836,6 +2836,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
+		mem_cgroup_count_vm_event(mm, PGMAJFAULT);
 	} else if (PageHWPoison(page)) {
 		/*
 		 * hwpoisoned dirty swapcache pages are kept for killing
@@ -3375,6 +3376,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	__set_current_state(TASK_RUNNING);
 
 	count_vm_event(PGFAULT);
+	mem_cgroup_count_vm_event(mm, PGFAULT);
 
 	/* do counter updates before entering really critical section. */
 	check_sync_rss_stat(current);
diff --git a/mm/shmem.c b/mm/shmem.c
index ad8346b..fa0b2b8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1289,6 +1289,8 @@ repeat:
 			/* here we actually do the io */
 			if (type && !(*type & VM_FAULT_MAJOR)) {
 				__count_vm_event(PGMAJFAULT);
+				mem_cgroup_count_vm_event(current->mm,
+							  PGMAJFAULT);
 				*type |= VM_FAULT_MAJOR;
 			}
 			spin_unlock(&info->lock);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V3] Add the pagefault count into memcg stats
  2011-03-29  6:16 Ying Han
@ 2011-03-29 21:36 ` Minchan Kim
  2011-03-30  2:47 ` Balbir Singh
  1 sibling, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2011-03-29 21:36 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Mark Brown, KAMEZAWA Hiroyuki, Andrew Morton, linux-mm

On Tue, Mar 29, 2011 at 3:16 PM, Ying Han <yinghan@google.com> wrote:
> Two new stats in per-memcg memory.stat which tracks the number of
> page faults and number of major page faults.
>
> "pgfault"
> "pgmajfault"
>
> They are different from "pgpgin"/"pgpgout" stat which count number of
> pages charged/discharged to the cgroup and have no meaning of reading/
> writing page to disk.
>
> It is valuable to track the two stats for both measuring application's
> performance as well as the efficiency of the kernel page reclaim path.
> Counting pagefaults per process is useful, but we also need the aggregated
> value since processes are monitored and controlled in cgroup basis in memcg.
>
> Functional test: check the total number of pgfault/pgmajfault of all
> memcgs and compare with global vmstat value:
>
> $ cat /proc/vmstat | grep fault
> pgfault 1070751
> pgmajfault 553
>
> $ cat /dev/cgroup/memory.stat | grep fault
> pgfault 1071138
> pgmajfault 553
> total_pgfault 1071142
> total_pgmajfault 553
>
> $ cat /dev/cgroup/A/memory.stat | grep fault
> pgfault 199
> pgmajfault 0
> total_pgfault 199
> total_pgmajfault 0
>
> Performance test: run page fault test(pft) wit 16 thread on faulting in 15G
> anon pages in 16G container. There is no regression noticed on the "flt/cpu/s"
>
> Sample output from pft:
> TAG pft:anon-sys-default:
>  Gb  Thr CLine   User     System     Wall    flt/cpu/s fault/wsec
>  15   16   1     0.67s   233.41s    14.76s   16798.546 266356.260
>
> +-------------------------------------------------------------------------+
>    N           Min           Max        Median           Avg        Stddev
> x  10     16682.962     17344.027     16913.524     16928.812      166.5362
> +  10     16695.568     16923.896     16820.604     16824.652     84.816568
> No difference proven at 95.0% confidence
>
> Change V3..v2
> 1. removed the unnecessary function definition in memcontrol.h
>
> Signed-off-by: Ying Han <yinghan@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V3] Add the pagefault count into memcg stats
  2011-03-29 17:32 [PATCH V3] Add the pagefault count into memcg stats Ying Han
@ 2011-03-30  1:16 ` KOSAKI Motohiro
  2011-03-30  1:37   ` Ying Han
  2011-03-31 20:01 ` Andrew Morton
  2011-04-13 20:12 ` David Rientjes
  2 siblings, 1 reply; 12+ messages in thread
From: KOSAKI Motohiro @ 2011-03-30  1:16 UTC (permalink / raw)
  To: Ying Han
  Cc: kosaki.motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Mark Brown, KAMEZAWA Hiroyuki, Andrew Morton, linux-mm

Hi

sorry, I didn't see past discussion of this thread. then, I may be missing
something.

> Two new stats in per-memcg memory.stat which tracks the number of
> page faults and number of major page faults.
> 
> "pgfault"
> "pgmajfault"
> 
> They are different from "pgpgin"/"pgpgout" stat which count number of
> pages charged/discharged to the cgroup and have no meaning of reading/
> writing page to disk.
> 
> It is valuable to track the two stats for both measuring application's
> performance as well as the efficiency of the kernel page reclaim path.
> Counting pagefaults per process is useful, but we also need the aggregated
> value since processes are monitored and controlled in cgroup basis in memcg.

Currently, memory cgroup don't restrict number of page fault. And we already have
this feature by CONFIG_CGROUP_PERF if my understanding is correct. Why don't you
use perf cgroup?

In the other words, after your patch, we have four pagefault counter. Do we
really need *four*? Can't we consolidate them?

1. tsk->maj_flt
2. perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ)
3. count_vm_event(PGMAJFAULT);
4. mem_cgroup_count_vm_event(PGMAJFAULT);





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V3] Add the pagefault count into memcg stats
  2011-03-30  1:16 ` KOSAKI Motohiro
@ 2011-03-30  1:37   ` Ying Han
  2011-03-30  1:54     ` KOSAKI Motohiro
  0 siblings, 1 reply; 12+ messages in thread
From: Ying Han @ 2011-03-30  1:37 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Mark Brown, KAMEZAWA Hiroyuki, Andrew Morton, linux-mm

On Tue, Mar 29, 2011 at 6:16 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi
>
> sorry, I didn't see past discussion of this thread. then, I may be missing
> something.
>
>> Two new stats in per-memcg memory.stat which tracks the number of
>> page faults and number of major page faults.
>>
>> "pgfault"
>> "pgmajfault"
>>
>> They are different from "pgpgin"/"pgpgout" stat which count number of
>> pages charged/discharged to the cgroup and have no meaning of reading/
>> writing page to disk.
>>
>> It is valuable to track the two stats for both measuring application's
>> performance as well as the efficiency of the kernel page reclaim path.
>> Counting pagefaults per process is useful, but we also need the aggregated
>> value since processes are monitored and controlled in cgroup basis in memcg.
>
> Currently, memory cgroup don't restrict number of page fault. And we already have
> this feature by CONFIG_CGROUP_PERF if my understanding is correct. Why don't you
> use perf cgroup?
>
> In the other words, after your patch, we have four pagefault counter. Do we
> really need *four*? Can't we consolidate them?
>
> 1. tsk->maj_flt
> 2. perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ)
> 3. count_vm_event(PGMAJFAULT);
> 4. mem_cgroup_count_vm_event(PGMAJFAULT);

The first three are per-process and per-system level counters. What I
did in this patch is to add per-memcg counters for pgfault and
pgmajfault. This purpose is not to do any limiting but monitoring. I
am not sure about the CONFIG_CGROUP_PERF, does it require
CONFIG_PERF_EVENTS?

Thanks

--Ying
>
>
>
>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V3] Add the pagefault count into memcg stats
  2011-03-30  1:37   ` Ying Han
@ 2011-03-30  1:54     ` KOSAKI Motohiro
  0 siblings, 0 replies; 12+ messages in thread
From: KOSAKI Motohiro @ 2011-03-30  1:54 UTC (permalink / raw)
  To: Ying Han
  Cc: kosaki.motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Mark Brown, KAMEZAWA Hiroyuki, Andrew Morton, linux-mm

> > Currently, memory cgroup don't restrict number of page fault. And we already have
> > this feature by CONFIG_CGROUP_PERF if my understanding is correct. Why don't you
> > use perf cgroup?
> >
> > In the other words, after your patch, we have four pagefault counter. Do we
> > really need *four*? Can't we consolidate them?
> >
> > 1. tsk->maj_flt
> > 2. perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ)
> > 3. count_vm_event(PGMAJFAULT);
> > 4. mem_cgroup_count_vm_event(PGMAJFAULT);
> 
> The first three are per-process and per-system level counters. What I
> did in this patch is to add per-memcg counters for pgfault and
> pgmajfault. This purpose is not to do any limiting but monitoring. I
> am not sure about the CONFIG_CGROUP_PERF, does it require
> CONFIG_PERF_EVENTS?

Yes, per-process counter can be enhanced per-cgroup counter naturally and easily. 
I guess it's a background idea of that.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V3] Add the pagefault count into memcg stats
  2011-03-29  6:16 Ying Han
  2011-03-29 21:36 ` Minchan Kim
@ 2011-03-30  2:47 ` Balbir Singh
  1 sibling, 0 replies; 12+ messages in thread
From: Balbir Singh @ 2011-03-30  2:47 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Tejun Heo,
	Mark Brown, KAMEZAWA Hiroyuki, Andrew Morton, linux-mm

* Ying Han <yinghan@google.com> [2011-03-28 23:16:24]:

> Two new stats in per-memcg memory.stat which tracks the number of
> page faults and number of major page faults.
> 
> "pgfault"
> "pgmajfault"
> 
> They are different from "pgpgin"/"pgpgout" stat which count number of
> pages charged/discharged to the cgroup and have no meaning of reading/
> writing page to disk.
> 
> It is valuable to track the two stats for both measuring application's
> performance as well as the efficiency of the kernel page reclaim path.
> Counting pagefaults per process is useful, but we also need the aggregated
> value since processes are monitored and controlled in cgroup basis in memcg.
> 
> Functional test: check the total number of pgfault/pgmajfault of all
> memcgs and compare with global vmstat value:
>

Looks much better


Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 
 
-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V3] Add the pagefault count into memcg stats
  2011-03-29 17:32 [PATCH V3] Add the pagefault count into memcg stats Ying Han
  2011-03-30  1:16 ` KOSAKI Motohiro
@ 2011-03-31 20:01 ` Andrew Morton
  2011-04-13 20:12 ` David Rientjes
  2 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2011-03-31 20:01 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Mark Brown, KAMEZAWA Hiroyuki, linux-mm

On Tue, 29 Mar 2011 10:32:33 -0700
Ying Han <yinghan@google.com> wrote:

> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -24,6 +24,7 @@ struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
> +enum vm_event_item;
>  
>  /* Stats that can be updated by kernel. */
>  enum mem_cgroup_page_stat_item {
>
> ...
>
> @@ -354,6 +356,10 @@ static inline void mem_cgroup_split_huge_fixup(struct page *head,
>  {
>  }
>  
> +static inline
> +void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
> +{
> +}

This doesn't compile for me - parameter idx has an incomplete type.

I've seen this before and I suspect that some (later) versions of gcc
are happy with it, but older gcc's are not.  It's a recurring problem
with enums - forard-declaring them doesn't work.

What I did was to move all the vm_event_item definition into a
standalone header file and then applied the appropriate fixups.  This
was done as a preparatory patch, preceding yours.




From: Andrew Morton <akpm@linux-foundation.org>

enums are problematic because they cannot be forward-declared:

akpm2:/home/akpm> cat t.c

enum foo;

static inline void bar(enum foo f)
{
}
akpm2:/home/akpm> gcc -c t.c
t.c:4: error: parameter 1 ('f') has incomplete type

So move the enum's definition into a standalone header file which can be used
wherever its definition is needed.  

Cc: Ying Han <yinghan@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/vm_event_item.h |   64 ++++++++++++++++++++++++++++++++
 include/linux/vmstat.h        |   62 -------------------------------
 2 files changed, 65 insertions(+), 61 deletions(-)

diff -puN include/linux/vmstat.h~mm-move-enum-vm_event_item-into-a-standalone-header-file include/linux/vmstat.h
--- a/include/linux/vmstat.h~mm-move-enum-vm_event_item-into-a-standalone-header-file
+++ a/include/linux/vmstat.h
@@ -5,69 +5,9 @@
 #include <linux/percpu.h>
 #include <linux/mm.h>
 #include <linux/mmzone.h>
+#include <linux/vm_event_item.h>
 #include <asm/atomic.h>
 
-#ifdef CONFIG_ZONE_DMA
-#define DMA_ZONE(xx) xx##_DMA,
-#else
-#define DMA_ZONE(xx)
-#endif
-
-#ifdef CONFIG_ZONE_DMA32
-#define DMA32_ZONE(xx) xx##_DMA32,
-#else
-#define DMA32_ZONE(xx)
-#endif
-
-#ifdef CONFIG_HIGHMEM
-#define HIGHMEM_ZONE(xx) , xx##_HIGH
-#else
-#define HIGHMEM_ZONE(xx)
-#endif
-
-
-#define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL HIGHMEM_ZONE(xx) , xx##_MOVABLE
-
-enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
-		FOR_ALL_ZONES(PGALLOC),
-		PGFREE, PGACTIVATE, PGDEACTIVATE,
-		PGFAULT, PGMAJFAULT,
-		FOR_ALL_ZONES(PGREFILL),
-		FOR_ALL_ZONES(PGSTEAL),
-		FOR_ALL_ZONES(PGSCAN_KSWAPD),
-		FOR_ALL_ZONES(PGSCAN_DIRECT),
-#ifdef CONFIG_NUMA
-		PGSCAN_ZONE_RECLAIM_FAILED,
-#endif
-		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
-		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
-		KSWAPD_SKIP_CONGESTION_WAIT,
-		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
-#ifdef CONFIG_COMPACTION
-		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
-		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
-#endif
-#ifdef CONFIG_HUGETLB_PAGE
-		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
-#endif
-		UNEVICTABLE_PGCULLED,	/* culled to noreclaim list */
-		UNEVICTABLE_PGSCANNED,	/* scanned for reclaimability */
-		UNEVICTABLE_PGRESCUED,	/* rescued from noreclaim list */
-		UNEVICTABLE_PGMLOCKED,
-		UNEVICTABLE_PGMUNLOCKED,
-		UNEVICTABLE_PGCLEARED,	/* on COW, page truncate */
-		UNEVICTABLE_PGSTRANDED,	/* unable to isolate on unlock */
-		UNEVICTABLE_MLOCKFREED,
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		THP_FAULT_ALLOC,
-		THP_FAULT_FALLBACK,
-		THP_COLLAPSE_ALLOC,
-		THP_COLLAPSE_ALLOC_FAILED,
-		THP_SPLIT,
-#endif
-		NR_VM_EVENT_ITEMS
-};
-
 extern int sysctl_stat_interval;
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
diff -puN /dev/null include/linux/vm_event_item.h
--- /dev/null
+++ a/include/linux/vm_event_item.h
@@ -0,0 +1,64 @@
+#ifndef VM_EVENT_ITEM_H_INCLUDED
+#define VM_EVENT_ITEM_H_INCLUDED
+
+#ifdef CONFIG_ZONE_DMA
+#define DMA_ZONE(xx) xx##_DMA,
+#else
+#define DMA_ZONE(xx)
+#endif
+
+#ifdef CONFIG_ZONE_DMA32
+#define DMA32_ZONE(xx) xx##_DMA32,
+#else
+#define DMA32_ZONE(xx)
+#endif
+
+#ifdef CONFIG_HIGHMEM
+#define HIGHMEM_ZONE(xx) , xx##_HIGH
+#else
+#define HIGHMEM_ZONE(xx)
+#endif
+
+#define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL HIGHMEM_ZONE(xx) , xx##_MOVABLE
+
+enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
+		FOR_ALL_ZONES(PGALLOC),
+		PGFREE, PGACTIVATE, PGDEACTIVATE,
+		PGFAULT, PGMAJFAULT,
+		FOR_ALL_ZONES(PGREFILL),
+		FOR_ALL_ZONES(PGSTEAL),
+		FOR_ALL_ZONES(PGSCAN_KSWAPD),
+		FOR_ALL_ZONES(PGSCAN_DIRECT),
+#ifdef CONFIG_NUMA
+		PGSCAN_ZONE_RECLAIM_FAILED,
+#endif
+		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
+		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
+		KSWAPD_SKIP_CONGESTION_WAIT,
+		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+#ifdef CONFIG_COMPACTION
+		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
+		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
+#endif
+#ifdef CONFIG_HUGETLB_PAGE
+		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
+#endif
+		UNEVICTABLE_PGCULLED,	/* culled to noreclaim list */
+		UNEVICTABLE_PGSCANNED,	/* scanned for reclaimability */
+		UNEVICTABLE_PGRESCUED,	/* rescued from noreclaim list */
+		UNEVICTABLE_PGMLOCKED,
+		UNEVICTABLE_PGMUNLOCKED,
+		UNEVICTABLE_PGCLEARED,	/* on COW, page truncate */
+		UNEVICTABLE_PGSTRANDED,	/* unable to isolate on unlock */
+		UNEVICTABLE_MLOCKFREED,
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		THP_FAULT_ALLOC,
+		THP_FAULT_FALLBACK,
+		THP_COLLAPSE_ALLOC,
+		THP_COLLAPSE_ALLOC_FAILED,
+		THP_SPLIT,
+#endif
+		NR_VM_EVENT_ITEMS
+};
+
+#endif		/* VM_EVENT_ITEM_H_INCLUDED */
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V3] Add the pagefault count into memcg stats
  2011-03-29 17:32 [PATCH V3] Add the pagefault count into memcg stats Ying Han
  2011-03-30  1:16 ` KOSAKI Motohiro
  2011-03-31 20:01 ` Andrew Morton
@ 2011-04-13 20:12 ` David Rientjes
  2011-04-13 23:52   ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 12+ messages in thread
From: David Rientjes @ 2011-04-13 20:12 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Mark Brown, KAMEZAWA Hiroyuki, Andrew Morton, linux-mm

On Tue, 29 Mar 2011, Ying Han wrote:

> Two new stats in per-memcg memory.stat which tracks the number of
> page faults and number of major page faults.
> 
> "pgfault"
> "pgmajfault"
> 
> They are different from "pgpgin"/"pgpgout" stat which count number of
> pages charged/discharged to the cgroup and have no meaning of reading/
> writing page to disk.
> 
> It is valuable to track the two stats for both measuring application's
> performance as well as the efficiency of the kernel page reclaim path.
> Counting pagefaults per process is useful, but we also need the aggregated
> value since processes are monitored and controlled in cgroup basis in memcg.
> 
> Functional test: check the total number of pgfault/pgmajfault of all
> memcgs and compare with global vmstat value:
> 
> $ cat /proc/vmstat | grep fault
> pgfault 1070751
> pgmajfault 553
> 
> $ cat /dev/cgroup/memory.stat | grep fault
> pgfault 1071138
> pgmajfault 553
> total_pgfault 1071142
> total_pgmajfault 553
> 
> $ cat /dev/cgroup/A/memory.stat | grep fault
> pgfault 199
> pgmajfault 0
> total_pgfault 199
> total_pgmajfault 0
> 
> Performance test: run page fault test(pft) wit 16 thread on faulting in 15G
> anon pages in 16G container. There is no regression noticed on the "flt/cpu/s"
> 
> Sample output from pft:
> TAG pft:anon-sys-default:
>   Gb  Thr CLine   User     System     Wall    flt/cpu/s fault/wsec
>   15   16   1     0.67s   233.41s    14.76s   16798.546 266356.260
> 
> +-------------------------------------------------------------------------+
>     N           Min           Max        Median           Avg        Stddev
> x  10     16682.962     17344.027     16913.524     16928.812      166.5362
> +  10     16695.568     16923.896     16820.604     16824.652     84.816568
> No difference proven at 95.0% confidence
> 
> Change v3..v2
> 1. removed the unnecessary function definition in memcontrol.h
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

I'm wondering if we can just modify count_vm_event() directly for 
CONFIG_CGROUP_MEM_RES_CTLR so that we automatically track all vmstat items 
(those in enum vm_event_item) for each memcg.  We could add an array of 
NR_VM_EVENT_ITEMS into each struct mem_cgroup to be incremented on 
count_vm_event() for current's memcg.

If that's done, we wouldn't have to add additional calls for every vmstat 
item we want to duplicate from the global counters.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V3] Add the pagefault count into memcg stats
  2011-04-13 20:12 ` David Rientjes
@ 2011-04-13 23:52   ` KAMEZAWA Hiroyuki
  2011-04-14  0:47     ` David Rientjes
  0 siblings, 1 reply; 12+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-13 23:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ying Han, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Mark Brown, Andrew Morton, linux-mm

On Wed, 13 Apr 2011 13:12:33 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 29 Mar 2011, Ying Han wrote:
> 
> > Two new stats in per-memcg memory.stat which tracks the number of
> > page faults and number of major page faults.
> > 
> > "pgfault"
> > "pgmajfault"
> > 
> > They are different from "pgpgin"/"pgpgout" stat which count number of
> > pages charged/discharged to the cgroup and have no meaning of reading/
> > writing page to disk.
> > 
> > It is valuable to track the two stats for both measuring application's
> > performance as well as the efficiency of the kernel page reclaim path.
> > Counting pagefaults per process is useful, but we also need the aggregated
> > value since processes are monitored and controlled in cgroup basis in memcg.
> > 
> > Functional test: check the total number of pgfault/pgmajfault of all
> > memcgs and compare with global vmstat value:
> > 
> > $ cat /proc/vmstat | grep fault
> > pgfault 1070751
> > pgmajfault 553
> > 
> > $ cat /dev/cgroup/memory.stat | grep fault
> > pgfault 1071138
> > pgmajfault 553
> > total_pgfault 1071142
> > total_pgmajfault 553
> > 
> > $ cat /dev/cgroup/A/memory.stat | grep fault
> > pgfault 199
> > pgmajfault 0
> > total_pgfault 199
> > total_pgmajfault 0
> > 
> > Performance test: run page fault test(pft) wit 16 thread on faulting in 15G
> > anon pages in 16G container. There is no regression noticed on the "flt/cpu/s"
> > 
> > Sample output from pft:
> > TAG pft:anon-sys-default:
> >   Gb  Thr CLine   User     System     Wall    flt/cpu/s fault/wsec
> >   15   16   1     0.67s   233.41s    14.76s   16798.546 266356.260
> > 
> > +-------------------------------------------------------------------------+
> >     N           Min           Max        Median           Avg        Stddev
> > x  10     16682.962     17344.027     16913.524     16928.812      166.5362
> > +  10     16695.568     16923.896     16820.604     16824.652     84.816568
> > No difference proven at 95.0% confidence
> > 
> > Change v3..v2
> > 1. removed the unnecessary function definition in memcontrol.h
> > 
> > Signed-off-by: Ying Han <yinghan@google.com>
> > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> I'm wondering if we can just modify count_vm_event() directly for 
> CONFIG_CGROUP_MEM_RES_CTLR so that we automatically track all vmstat items 
> (those in enum vm_event_item) for each memcg.  We could add an array of 
> NR_VM_EVENT_ITEMS into each struct mem_cgroup to be incremented on 
> count_vm_event() for current's memcg.
> 
> If that's done, we wouldn't have to add additional calls for every vmstat 
> item we want to duplicate from the global counters.
> 

Maybe we do that finally.

For now, IIUC, over 50% of VM_EVENTS are needless for memcg (ex. per zone stats)
and this array consumes large size of percpu area. I think we need to select
events carefully even if we do that. And current memcg's percpu stat is mixture
of vm_events and vm_stat. We may need to sort out them and re-design it.
My concern is that I'm not sure we have enough percpu area for vmstat+vmevents
for 1000+ memcg, and it's allowed even if we can do.

But yes, it seems worth to consider.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V3] Add the pagefault count into memcg stats
  2011-04-13 23:52   ` KAMEZAWA Hiroyuki
@ 2011-04-14  0:47     ` David Rientjes
  2011-04-14  1:18       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 12+ messages in thread
From: David Rientjes @ 2011-04-14  0:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Mark Brown, Andrew Morton, linux-mm

On Thu, 14 Apr 2011, KAMEZAWA Hiroyuki wrote:

> > I'm wondering if we can just modify count_vm_event() directly for 
> > CONFIG_CGROUP_MEM_RES_CTLR so that we automatically track all vmstat items 
> > (those in enum vm_event_item) for each memcg.  We could add an array of 
> > NR_VM_EVENT_ITEMS into each struct mem_cgroup to be incremented on 
> > count_vm_event() for current's memcg.
> > 
> > If that's done, we wouldn't have to add additional calls for every vmstat 
> > item we want to duplicate from the global counters.
> > 
> 
> Maybe we do that finally.
> 
> For now, IIUC, over 50% of VM_EVENTS are needless for memcg (ex. per zone stats)
> and this array consumes large size of percpu area. I think we need to select
> events carefully even if we do that. And current memcg's percpu stat is mixture
> of vm_events and vm_stat. We may need to sort out them and re-design it.
> My concern is that I'm not sure we have enough percpu area for vmstat+vmevents
> for 1000+ memcg, and it's allowed even if we can do.
> 

What I proposed above was adding an array directly into struct mem_cgroup 
so that we don't collect the stats percpu, they are incremented directly 
in the mem_cgroup.  Perhaps if we separated enum vm_event_item out into 
two separate arrays (those useful only globally and those useful for both 
global and memcg), then this would be simple.

Something like

	enum vm_event_item {
		PGPGIN,
		PGPGOUT,
		PSWPIN,
		PSWPOUT,
		...
		NR_VM_EVENT_ITEMS,
	};

	enum vm_global_event_item {
		KSWAPD_STEAL = NR_VM_EVENT_ITEMS,
		KSWAPD_INODESTEAL,
		...
	};

and then in count_vm_event(), check

	if (item < NR_VM_EVENT_ITEMS) {
		memcg_add_vm_event(mem, item, count);
	}

I don't think we need to be concerned about reordering the global 
/proc/vmstat to fit this purpose.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V3] Add the pagefault count into memcg stats
  2011-04-14  0:47     ` David Rientjes
@ 2011-04-14  1:18       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 12+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-04-14  1:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ying Han, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Mark Brown, Andrew Morton, linux-mm

On Wed, 13 Apr 2011 17:47:01 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Thu, 14 Apr 2011, KAMEZAWA Hiroyuki wrote:
> 
> > > I'm wondering if we can just modify count_vm_event() directly for 
> > > CONFIG_CGROUP_MEM_RES_CTLR so that we automatically track all vmstat items 
> > > (those in enum vm_event_item) for each memcg.  We could add an array of 
> > > NR_VM_EVENT_ITEMS into each struct mem_cgroup to be incremented on 
> > > count_vm_event() for current's memcg.
> > > 
> > > If that's done, we wouldn't have to add additional calls for every vmstat 
> > > item we want to duplicate from the global counters.
> > > 
> > 
> > Maybe we do that finally.
> > 
> > For now, IIUC, over 50% of VM_EVENTS are needless for memcg (ex. per zone stats)
> > and this array consumes large size of percpu area. I think we need to select
> > events carefully even if we do that. And current memcg's percpu stat is mixture
> > of vm_events and vm_stat. We may need to sort out them and re-design it.
> > My concern is that I'm not sure we have enough percpu area for vmstat+vmevents
> > for 1000+ memcg, and it's allowed even if we can do.
> > 
> 
> What I proposed above was adding an array directly into struct mem_cgroup 
> so that we don't collect the stats percpu, they are incremented directly 
> in the mem_cgroup.  Perhaps if we separated enum vm_event_item out into 
> two separate arrays (those useful only globally and those useful for both 
> global and memcg), then this would be simple.
> 
> Something like
> 
> 	enum vm_event_item {
> 		PGPGIN,
> 		PGPGOUT,
> 		PSWPIN,
> 		PSWPOUT,
> 		...
> 		NR_VM_EVENT_ITEMS,
> 	};
> 
> 	enum vm_global_event_item {
> 		KSWAPD_STEAL = NR_VM_EVENT_ITEMS,
> 		KSWAPD_INODESTEAL,
> 		...
> 	};
> 
> and then in count_vm_event(), check
> 
> 	if (item < NR_VM_EVENT_ITEMS) {
> 		memcg_add_vm_event(mem, item, count);
> 	}
> 
> I don't think we need to be concerned about reordering the global 
> /proc/vmstat to fit this purpose.
> 
Hmm, ok. will try that.

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-04-14  1:25 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-29 17:32 [PATCH V3] Add the pagefault count into memcg stats Ying Han
2011-03-30  1:16 ` KOSAKI Motohiro
2011-03-30  1:37   ` Ying Han
2011-03-30  1:54     ` KOSAKI Motohiro
2011-03-31 20:01 ` Andrew Morton
2011-04-13 20:12 ` David Rientjes
2011-04-13 23:52   ` KAMEZAWA Hiroyuki
2011-04-14  0:47     ` David Rientjes
2011-04-14  1:18       ` KAMEZAWA Hiroyuki
  -- strict thread matches above, loose matches on Subject: below --
2011-03-29  6:16 Ying Han
2011-03-29 21:36 ` Minchan Kim
2011-03-30  2:47 ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).