* [PATCH -V2 0/9] memcg: add HugeTLB resource tracking
@ 2012-03-01 9:16 Aneesh Kumar K.V
2012-03-01 9:16 ` [PATCH -V2 1/9] mm: move hugetlbfs region tracking function to common code Aneesh Kumar K.V
` (10 more replies)
0 siblings, 11 replies; 26+ messages in thread
From: Aneesh Kumar K.V @ 2012-03-01 9:16 UTC (permalink / raw)
To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
akpm, hannes
Cc: linux-kernel, cgroups
Hi,
This patchset implements a memory controller extension to control
HugeTLB allocations. It is similar to the existing hugetlb quota
support in that, the limit is enforced at mmap(2) time and not at
fault time. HugeTLB's quota mechanism limits the number of huge pages
that can allocated per superblock.
For shared mappings we track the regions mapped by a task along with the
memcg. We keep the memory controller charged even after the task
that did mmap(2) exits. Uncharge happens during truncate. For Private
mappings we charge and uncharge from the current task cgroup.
A sample strace output for an application doing malloc with hugectl is given
below. libhugetlbfs will fall back to normal pagesize if the HugeTLB mmap fails.
open("/mnt/libhugetlbfs.tmp.uhLMgy", O_RDWR|O_CREAT|O_EXCL, 0600) = 3
unlink("/mnt/libhugetlbfs.tmp.uhLMgy") = 0
.........
mmap(0x20000000000, 50331648, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = -1 ENOMEM (Cannot allocate memory)
write(2, "libhugetlbfs", 12libhugetlbfs) = 12
write(2, ": WARNING: New heap segment map" ....
mmap(NULL, 42008576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xfff946c0000
....
Goals:
1) We want to keep the semantic closer to hugelb quota support. ie, we want
to extend quota semantics to a group of tasks. Currently hugetlb quota
mechanism allows one to control number of hugetlb pages allocated per
hugetlbfs superblock.
2) Applications using hugetlbfs always fallback to normal page size allocation when they
fail to allocate huge pages. libhugetlbfs internally handles this for malloc(3). We
want to retain this behaviour when we enforce the controller limit. ie, when huge page
allocation fails due to controller limit, applications should fallback to
allocation using normal page size. The above implies that we need to enforce
limit at mmap(2).
3) HugeTLBfs doesn't support page reclaim. It also doesn't support write(2). Applications
use hugetlbfs via mmap(2) interface. Important point to note here is hugetlbfs
extends file size in mmap.
With shared mappings, the file size gets extended in mmap and file will remain in hugetlbfs
consuming huge pages until it is truncated. We want to make sure we keep the controller
charged until the file is truncated. This implies, that the controller will be charged
even after the task that did mmap exit.
Implementation details:
In order to achieve the above goals we need to track the cgroup information
along with mmap range in a charge list in inode for shared mapping and in
vm_area_struct for private mapping. We won't be using page to track cgroup
information because with the above goals we are not really tracking the pages used.
Since we track cgroup in charge list, if we want to remove the cgroup, we need to update
the charge list to point to the parent cgroup. Currently we take the easy route
and prevent a cgroup removal if it's non reclaim resource usage is non zero.
Changes from V1:
* Changed the implementation as a memcg extension. We still use
the same logic to track the cgroup and range.
Changes from RFC post:
* Added support for HugeTLB cgroup hierarchy
* Added support for task migration
* Added documentation patch
* Other bug fixes
-aneesh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH -V2 1/9] mm: move hugetlbfs region tracking function to common code
2012-03-01 9:16 [PATCH -V2 0/9] memcg: add HugeTLB resource tracking Aneesh Kumar K.V
@ 2012-03-01 9:16 ` Aneesh Kumar K.V
2012-03-01 22:33 ` Andrew Morton
2012-03-01 9:16 ` [PATCH -V2 2/9] mm: Update region function to take new data arg Aneesh Kumar K.V
` (9 subsequent siblings)
10 siblings, 1 reply; 26+ messages in thread
From: Aneesh Kumar K.V @ 2012-03-01 9:16 UTC (permalink / raw)
To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
akpm, hannes
Cc: linux-kernel, cgroups, Aneesh Kumar K.V
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
This patch moves the hugetlbfs region tracking function to
common code. We will be using this in later patches in the
series.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
include/linux/region.h | 28 +++++++++
mm/Makefile | 2 +-
mm/region.c | 158 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 187 insertions(+), 1 deletions(-)
create mode 100644 include/linux/region.h
create mode 100644 mm/region.c
diff --git a/include/linux/region.h b/include/linux/region.h
new file mode 100644
index 0000000..a8a5b46
--- /dev/null
+++ b/include/linux/region.h
@@ -0,0 +1,28 @@
+/*
+ * Copyright IBM Corporation, 2012
+ * Author Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+#ifndef _LINUX_REGION_H
+#define _LINUX_REGION_H
+
+struct file_region {
+ struct list_head link;
+ long from;
+ long to;
+};
+
+extern long region_add(struct list_head *head, long from, long to);
+extern long region_chg(struct list_head *head, long from, long to);
+extern long region_truncate(struct list_head *head, long end);
+extern long region_count(struct list_head *head, long from, long to);
+#endif
diff --git a/mm/Makefile b/mm/Makefile
index 50ec00e..8828a1b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -14,7 +14,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
page_isolation.o mm_init.o mmu_context.o percpu.o \
$(mmu-y)
-obj-y += init-mm.o
+obj-y += init-mm.o region.o
ifdef CONFIG_NO_BOOTMEM
obj-y += nobootmem.o
diff --git a/mm/region.c b/mm/region.c
new file mode 100644
index 0000000..ab59fe7
--- /dev/null
+++ b/mm/region.c
@@ -0,0 +1,158 @@
+/*
+ * Copyright IBM Corporation, 2012
+ * Author Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/hugetlb.h>
+#include <linux/list.h>
+#include <linux/region.h>
+
+long region_add(struct list_head *head, long from, long to)
+{
+ struct file_region *rg, *nrg, *trg;
+
+ /* Locate the region we are either in or before. */
+ list_for_each_entry(rg, head, link)
+ if (from <= rg->to)
+ break;
+
+ /* Round our left edge to the current segment if it encloses us. */
+ if (from > rg->from)
+ from = rg->from;
+
+ /* Check for and consume any regions we now overlap with. */
+ nrg = rg;
+ list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+ if (&rg->link == head)
+ break;
+ if (rg->from > to)
+ break;
+
+ /* If this area reaches higher then extend our area to
+ * include it completely. If this is not the first area
+ * which we intend to reuse, free it. */
+ if (rg->to > to)
+ to = rg->to;
+ if (rg != nrg) {
+ list_del(&rg->link);
+ kfree(rg);
+ }
+ }
+ nrg->from = from;
+ nrg->to = to;
+ return 0;
+}
+
+long region_chg(struct list_head *head, long from, long to)
+{
+ struct file_region *rg, *nrg;
+ long chg = 0;
+
+ /* Locate the region we are before or in. */
+ list_for_each_entry(rg, head, link)
+ if (from <= rg->to)
+ break;
+
+ /* If we are below the current region then a new region is required.
+ * Subtle, allocate a new region at the position but make it zero
+ * size such that we can guarantee to record the reservation. */
+ if (&rg->link == head || to < rg->from) {
+ nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+ if (!nrg)
+ return -ENOMEM;
+ nrg->from = from;
+ nrg->to = from;
+ INIT_LIST_HEAD(&nrg->link);
+ list_add(&nrg->link, rg->link.prev);
+
+ return to - from;
+ }
+
+ /* Round our left edge to the current segment if it encloses us. */
+ if (from > rg->from)
+ from = rg->from;
+ chg = to - from;
+
+ /* Check for and consume any regions we now overlap with. */
+ list_for_each_entry(rg, rg->link.prev, link) {
+ if (&rg->link == head)
+ break;
+ if (rg->from > to)
+ return chg;
+
+ /* We overlap with this area, if it extends further than
+ * us then we must extend ourselves. Account for its
+ * existing reservation. */
+ if (rg->to > to) {
+ chg += rg->to - to;
+ to = rg->to;
+ }
+ chg -= rg->to - rg->from;
+ }
+ return chg;
+}
+
+long region_truncate(struct list_head *head, long end)
+{
+ struct file_region *rg, *trg;
+ long chg = 0;
+
+ /* Locate the region we are either in or before. */
+ list_for_each_entry(rg, head, link)
+ if (end <= rg->to)
+ break;
+ if (&rg->link == head)
+ return 0;
+
+ /* If we are in the middle of a region then adjust it. */
+ if (end > rg->from) {
+ chg = rg->to - end;
+ rg->to = end;
+ rg = list_entry(rg->link.next, typeof(*rg), link);
+ }
+
+ /* Drop any remaining regions. */
+ list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+ if (&rg->link == head)
+ break;
+ chg += rg->to - rg->from;
+ list_del(&rg->link);
+ kfree(rg);
+ }
+ return chg;
+}
+
+long region_count(struct list_head *head, long from, long to)
+{
+ struct file_region *rg;
+ long chg = 0;
+
+ /* Locate each segment we overlap with, and count that overlap. */
+ list_for_each_entry(rg, head, link) {
+ int seg_from;
+ int seg_to;
+
+ if (rg->to <= from)
+ continue;
+ if (rg->from >= to)
+ break;
+
+ seg_from = max(rg->from, from);
+ seg_to = min(rg->to, to);
+
+ chg += seg_to - seg_from;
+ }
+
+ return chg;
+}
--
1.7.9
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH -V2 2/9] mm: Update region function to take new data arg
2012-03-01 9:16 [PATCH -V2 0/9] memcg: add HugeTLB resource tracking Aneesh Kumar K.V
2012-03-01 9:16 ` [PATCH -V2 1/9] mm: move hugetlbfs region tracking function to common code Aneesh Kumar K.V
@ 2012-03-01 9:16 ` Aneesh Kumar K.V
2012-03-01 9:16 ` [PATCH -V2 3/9] hugetlbfs: Use the generic region API and drop local one Aneesh Kumar K.V
` (8 subsequent siblings)
10 siblings, 0 replies; 26+ messages in thread
From: Aneesh Kumar K.V @ 2012-03-01 9:16 UTC (permalink / raw)
To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
akpm, hannes
Cc: linux-kernel, cgroups, Aneesh Kumar K.V
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
This patch adds a new data arg to region tracking functions.
region_chg function will merge regions only if data arg match
otherwise it will create a new region to map the range.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
include/linux/region.h | 20 ++++--
mm/region.c | 177 ++++++++++++++++++++++++++++++++----------------
2 files changed, 132 insertions(+), 65 deletions(-)
diff --git a/include/linux/region.h b/include/linux/region.h
index a8a5b46..609e24c 100644
--- a/include/linux/region.h
+++ b/include/linux/region.h
@@ -16,13 +16,21 @@
#define _LINUX_REGION_H
struct file_region {
+ unsigned long from, to;
+ unsigned long data;
struct list_head link;
- long from;
- long to;
};
-extern long region_add(struct list_head *head, long from, long to);
-extern long region_chg(struct list_head *head, long from, long to);
-extern long region_truncate(struct list_head *head, long end);
-extern long region_count(struct list_head *head, long from, long to);
+extern long region_chg(struct list_head *head, unsigned long from,
+ unsigned long to, unsigned long data);
+extern void region_add(struct list_head *head, unsigned long from,
+ unsigned long to, unsigned long data);
+extern long region_truncate_range(struct list_head *head, unsigned long from,
+ unsigned long end);
+static inline long region_truncate(struct list_head *head, unsigned long from)
+{
+ return region_truncate_range(head, from, ULONG_MAX);
+}
+extern long region_count(struct list_head *head, unsigned long from,
+ unsigned long to);
#endif
diff --git a/mm/region.c b/mm/region.c
index ab59fe7..e547631 100644
--- a/mm/region.c
+++ b/mm/region.c
@@ -18,66 +18,46 @@
#include <linux/list.h>
#include <linux/region.h>
-long region_add(struct list_head *head, long from, long to)
+long region_chg(struct list_head *head, unsigned long from,
+ unsigned long to, unsigned long data)
{
- struct file_region *rg, *nrg, *trg;
-
- /* Locate the region we are either in or before. */
- list_for_each_entry(rg, head, link)
- if (from <= rg->to)
- break;
-
- /* Round our left edge to the current segment if it encloses us. */
- if (from > rg->from)
- from = rg->from;
-
- /* Check for and consume any regions we now overlap with. */
- nrg = rg;
- list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
- if (&rg->link == head)
- break;
- if (rg->from > to)
- break;
-
- /* If this area reaches higher then extend our area to
- * include it completely. If this is not the first area
- * which we intend to reuse, free it. */
- if (rg->to > to)
- to = rg->to;
- if (rg != nrg) {
- list_del(&rg->link);
- kfree(rg);
- }
- }
- nrg->from = from;
- nrg->to = to;
- return 0;
-}
-
-long region_chg(struct list_head *head, long from, long to)
-{
- struct file_region *rg, *nrg;
long chg = 0;
+ struct file_region *rg, *nrg, *trg;
/* Locate the region we are before or in. */
list_for_each_entry(rg, head, link)
if (from <= rg->to)
break;
-
- /* If we are below the current region then a new region is required.
+ /*
+ * If we are below the current region then a new region is required.
* Subtle, allocate a new region at the position but make it zero
- * size such that we can guarantee to record the reservation. */
+ * size such that we can guarantee to record the reservation.
+ */
if (&rg->link == head || to < rg->from) {
nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
if (!nrg)
return -ENOMEM;
nrg->from = from;
- nrg->to = from;
+ nrg->to = from;
+ nrg->data = data;
INIT_LIST_HEAD(&nrg->link);
list_add(&nrg->link, rg->link.prev);
-
return to - from;
}
+ /*
+ * from rg->from to rg->to
+ */
+ if (from < rg->from && data != rg->data) {
+ /* we need to allocate a new region */
+ nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+ if (!nrg)
+ return -ENOMEM;
+ nrg->from = from;
+ nrg->to = from;
+ nrg->data = data;
+ INIT_LIST_HEAD(&nrg->link);
+ list_add(&nrg->link, rg->link.prev);
+ }
/* Round our left edge to the current segment if it encloses us. */
if (from > rg->from)
@@ -85,15 +65,28 @@ long region_chg(struct list_head *head, long from, long to)
chg = to - from;
/* Check for and consume any regions we now overlap with. */
- list_for_each_entry(rg, rg->link.prev, link) {
+ list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
if (&rg->link == head)
break;
if (rg->from > to)
return chg;
-
- /* We overlap with this area, if it extends further than
- * us then we must extend ourselves. Account for its
- * existing reservation. */
+ /*
+ * rg->from from rg->to to
+ */
+ if (to > rg->to && data != rg->data) {
+ /* we need to allocate a new region */
+ nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+ if (!nrg)
+ return -ENOMEM;
+ nrg->from = rg->to;
+ nrg->to = rg->to;
+ nrg->data = data;
+ INIT_LIST_HEAD(&nrg->link);
+ list_add(&nrg->link, &rg->link);
+ }
+ /*
+ * update charge
+ */
if (rg->to > to) {
chg += rg->to - to;
to = rg->to;
@@ -103,29 +96,96 @@ long region_chg(struct list_head *head, long from, long to)
return chg;
}
-long region_truncate(struct list_head *head, long end)
+void region_add(struct list_head *head, unsigned long from,
+ unsigned long to, unsigned long data)
+{
+ struct file_region *rg, *nrg, *trg;
+
+ /* Locate the region we are before or in. */
+ list_for_each_entry(rg, head, link)
+ if (from <= rg->to)
+ break;
+
+ list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+
+ if (rg->from > to)
+ return;
+ if (&rg->link == head)
+ return;
+
+ /*FIXME!! this can possibly delete few regions */
+ /* We need to worry only if we match data */
+ if (rg->data == data) {
+ if (from < rg->from)
+ rg->from = from;
+ if (to > rg->to) {
+ /* if we are the last entry */
+ if (rg->link.next == head) {
+ rg->to = to;
+ break;
+ } else {
+ nrg = list_entry(rg->link.next,
+ typeof(*nrg), link);
+ rg->to = nrg->from;
+ }
+ }
+ }
+ from = rg->to;
+ }
+}
+
+long region_truncate_range(struct list_head *head, unsigned long from,
+ unsigned long to)
{
- struct file_region *rg, *trg;
long chg = 0;
+ struct file_region *rg, *trg;
/* Locate the region we are either in or before. */
list_for_each_entry(rg, head, link)
- if (end <= rg->to)
+ if (from <= rg->to)
break;
if (&rg->link == head)
return 0;
/* If we are in the middle of a region then adjust it. */
- if (end > rg->from) {
- chg = rg->to - end;
- rg->to = end;
+ if (from > rg->from) {
+ if (to < rg->to) {
+ struct file_region *nrg;
+ /* rf->from from to rg->to */
+ nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+ /*
+ * If we fail to allocate we return the
+ * with the 0 charge . Later a complete
+ * truncate will reclaim the left over space
+ */
+ if (!nrg)
+ return 0;
+ nrg->from = to;
+ nrg->to = rg->to;
+ nrg->data = rg->data;
+ INIT_LIST_HEAD(&nrg->link);
+ list_add(&nrg->link, &rg->link);
+
+ /* Adjust the rg entry */
+ rg->to = from;
+ chg = to - from;
+ return chg;
+ }
+ chg = rg->to - from;
+ rg->to = from;
rg = list_entry(rg->link.next, typeof(*rg), link);
}
-
- /* Drop any remaining regions. */
+ /* Drop any remaining regions till to */
list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+ if (rg->from >= to)
+ break;
if (&rg->link == head)
break;
+ if (rg->to > to) {
+ chg += to - rg->from;
+ rg->from = to;
+ return chg;
+ }
chg += rg->to - rg->from;
list_del(&rg->link);
kfree(rg);
@@ -133,10 +193,10 @@ long region_truncate(struct list_head *head, long end)
return chg;
}
-long region_count(struct list_head *head, long from, long to)
+long region_count(struct list_head *head, unsigned long from, unsigned long to)
{
- struct file_region *rg;
long chg = 0;
+ struct file_region *rg;
/* Locate each segment we overlap with, and count that overlap. */
list_for_each_entry(rg, head, link) {
@@ -153,6 +213,5 @@ long region_count(struct list_head *head, long from, long to)
chg += seg_to - seg_from;
}
-
return chg;
}
--
1.7.9
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH -V2 3/9] hugetlbfs: Use the generic region API and drop local one
2012-03-01 9:16 [PATCH -V2 0/9] memcg: add HugeTLB resource tracking Aneesh Kumar K.V
2012-03-01 9:16 ` [PATCH -V2 1/9] mm: move hugetlbfs region tracking function to common code Aneesh Kumar K.V
2012-03-01 9:16 ` [PATCH -V2 2/9] mm: Update region function to take new data arg Aneesh Kumar K.V
@ 2012-03-01 9:16 ` Aneesh Kumar K.V
2012-03-01 9:16 ` [PATCH -V2 4/9] memcg: Add non reclaim resource tracking to memcg Aneesh Kumar K.V
` (7 subsequent siblings)
10 siblings, 0 replies; 26+ messages in thread
From: Aneesh Kumar K.V @ 2012-03-01 9:16 UTC (permalink / raw)
To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
akpm, hannes
Cc: linux-kernel, cgroups, Aneesh Kumar K.V
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Use the new region functions added.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
mm/hugetlb.c | 160 ++++------------------------------------------------------
1 files changed, 10 insertions(+), 150 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5f34bd8..9fd6d38 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -21,6 +21,7 @@
#include <linux/rmap.h>
#include <linux/swap.h>
#include <linux/swapops.h>
+#include <linux/region.h>
#include <asm/page.h>
#include <asm/pgtable.h>
@@ -66,151 +67,10 @@ static DEFINE_SPINLOCK(hugetlb_lock);
* or
* down_read(&mm->mmap_sem);
* mutex_lock(&hugetlb_instantiation_mutex);
+ * shared mapping regions are tracked in inode->i_mapping and
+ * private mapping regions in vma_rea_struct
+ *
*/
-struct file_region {
- struct list_head link;
- long from;
- long to;
-};
-
-static long region_add(struct list_head *head, long f, long t)
-{
- struct file_region *rg, *nrg, *trg;
-
- /* Locate the region we are either in or before. */
- list_for_each_entry(rg, head, link)
- if (f <= rg->to)
- break;
-
- /* Round our left edge to the current segment if it encloses us. */
- if (f > rg->from)
- f = rg->from;
-
- /* Check for and consume any regions we now overlap with. */
- nrg = rg;
- list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
- if (&rg->link == head)
- break;
- if (rg->from > t)
- break;
-
- /* If this area reaches higher then extend our area to
- * include it completely. If this is not the first area
- * which we intend to reuse, free it. */
- if (rg->to > t)
- t = rg->to;
- if (rg != nrg) {
- list_del(&rg->link);
- kfree(rg);
- }
- }
- nrg->from = f;
- nrg->to = t;
- return 0;
-}
-
-static long region_chg(struct list_head *head, long f, long t)
-{
- struct file_region *rg, *nrg;
- long chg = 0;
-
- /* Locate the region we are before or in. */
- list_for_each_entry(rg, head, link)
- if (f <= rg->to)
- break;
-
- /* If we are below the current region then a new region is required.
- * Subtle, allocate a new region at the position but make it zero
- * size such that we can guarantee to record the reservation. */
- if (&rg->link == head || t < rg->from) {
- nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
- if (!nrg)
- return -ENOMEM;
- nrg->from = f;
- nrg->to = f;
- INIT_LIST_HEAD(&nrg->link);
- list_add(&nrg->link, rg->link.prev);
-
- return t - f;
- }
-
- /* Round our left edge to the current segment if it encloses us. */
- if (f > rg->from)
- f = rg->from;
- chg = t - f;
-
- /* Check for and consume any regions we now overlap with. */
- list_for_each_entry(rg, rg->link.prev, link) {
- if (&rg->link == head)
- break;
- if (rg->from > t)
- return chg;
-
- /* We overlap with this area, if it extends further than
- * us then we must extend ourselves. Account for its
- * existing reservation. */
- if (rg->to > t) {
- chg += rg->to - t;
- t = rg->to;
- }
- chg -= rg->to - rg->from;
- }
- return chg;
-}
-
-static long region_truncate(struct list_head *head, long end)
-{
- struct file_region *rg, *trg;
- long chg = 0;
-
- /* Locate the region we are either in or before. */
- list_for_each_entry(rg, head, link)
- if (end <= rg->to)
- break;
- if (&rg->link == head)
- return 0;
-
- /* If we are in the middle of a region then adjust it. */
- if (end > rg->from) {
- chg = rg->to - end;
- rg->to = end;
- rg = list_entry(rg->link.next, typeof(*rg), link);
- }
-
- /* Drop any remaining regions. */
- list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
- if (&rg->link == head)
- break;
- chg += rg->to - rg->from;
- list_del(&rg->link);
- kfree(rg);
- }
- return chg;
-}
-
-static long region_count(struct list_head *head, long f, long t)
-{
- struct file_region *rg;
- long chg = 0;
-
- /* Locate each segment we overlap with, and count that overlap. */
- list_for_each_entry(rg, head, link) {
- int seg_from;
- int seg_to;
-
- if (rg->to <= f)
- continue;
- if (rg->from >= t)
- break;
-
- seg_from = max(rg->from, f);
- seg_to = min(rg->to, t);
-
- chg += seg_to - seg_from;
- }
-
- return chg;
-}
/*
* Convert the address within this vma to the page offset within
@@ -981,7 +841,7 @@ static long vma_needs_reservation(struct hstate *h,
if (vma->vm_flags & VM_MAYSHARE) {
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
return region_chg(&inode->i_mapping->private_list,
- idx, idx + 1);
+ idx, idx + 1, 0);
} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
return 1;
@@ -991,7 +851,7 @@ static long vma_needs_reservation(struct hstate *h,
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
struct resv_map *reservations = vma_resv_map(vma);
- err = region_chg(&reservations->regions, idx, idx + 1);
+ err = region_chg(&reservations->regions, idx, idx + 1, 0);
if (err < 0)
return err;
return 0;
@@ -1005,14 +865,14 @@ static void vma_commit_reservation(struct hstate *h,
if (vma->vm_flags & VM_MAYSHARE) {
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
- region_add(&inode->i_mapping->private_list, idx, idx + 1);
+ region_add(&inode->i_mapping->private_list, idx, idx + 1, 0);
} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
struct resv_map *reservations = vma_resv_map(vma);
/* Mark this page used in the map. */
- region_add(&reservations->regions, idx, idx + 1);
+ region_add(&reservations->regions, idx, idx + 1, 0);
}
}
@@ -2885,7 +2745,7 @@ int hugetlb_reserve_pages(struct inode *inode,
* called to make the mapping read-write. Assume !vma is a shm mapping
*/
if (!vma || vma->vm_flags & VM_MAYSHARE)
- chg = region_chg(&inode->i_mapping->private_list, from, to);
+ chg = region_chg(&inode->i_mapping->private_list, from, to, 0);
else {
struct resv_map *resv_map = resv_map_alloc();
if (!resv_map)
@@ -2926,7 +2786,7 @@ int hugetlb_reserve_pages(struct inode *inode,
* else has to be done for private mappings here
*/
if (!vma || vma->vm_flags & VM_MAYSHARE)
- region_add(&inode->i_mapping->private_list, from, to);
+ region_add(&inode->i_mapping->private_list, from, to, 0);
return 0;
}
--
1.7.9
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH -V2 4/9] memcg: Add non reclaim resource tracking to memcg
2012-03-01 9:16 [PATCH -V2 0/9] memcg: add HugeTLB resource tracking Aneesh Kumar K.V
` (2 preceding siblings ...)
2012-03-01 9:16 ` [PATCH -V2 3/9] hugetlbfs: Use the generic region API and drop local one Aneesh Kumar K.V
@ 2012-03-01 9:16 ` Aneesh Kumar K.V
2012-03-02 8:38 ` KAMEZAWA Hiroyuki
2012-03-01 9:16 ` [PATCH -V2 5/9] hugetlbfs: Add memory controller support for shared mapping Aneesh Kumar K.V
` (6 subsequent siblings)
10 siblings, 1 reply; 26+ messages in thread
From: Aneesh Kumar K.V @ 2012-03-01 9:16 UTC (permalink / raw)
To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
akpm, hannes
Cc: linux-kernel, cgroups, Aneesh Kumar K.V
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Non reclaim resources include hugetlb pages or ramfs pages.
Both these file systems are memory based and they don't support
page reclaim. So enforcing memory controller limit during actual
page allocation doesn't make sense for them. Instead we would
enforce the limit during mmap and keep track of the mmap range
along with memcg information in charge list.
We could have multiple non reclaim resources which we want to track
indepedently, like huge pages with different huge page size.
Current code don't allow removal of memcg if they have any non
reclaim resource charge.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
include/linux/memcontrol.h | 11 +++
init/Kconfig | 11 +++
mm/memcontrol.c | 198 +++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 219 insertions(+), 1 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4d34356..59d93ee 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -171,6 +171,17 @@ void mem_cgroup_split_huge_fixup(struct page *head);
bool mem_cgroup_bad_page_check(struct page *page);
void mem_cgroup_print_bad_page(struct page *page);
#endif
+extern long mem_cgroup_try_noreclaim_charge(struct list_head *chg_list,
+ unsigned long from,
+ unsigned long to, int idx);
+extern void mem_cgroup_noreclaim_uncharge(struct list_head *chg_list,
+ int idx, unsigned long nr_pages);
+extern void mem_cgroup_commit_noreclaim_charge(struct list_head *chg_list,
+ unsigned long from,
+ unsigned long to);
+extern long mem_cgroup_truncate_chglist_range(struct list_head *chg_list,
+ unsigned long from,
+ unsigned long to, int idx);
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;
diff --git a/init/Kconfig b/init/Kconfig
index 3f42cd6..c4306f7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -673,6 +673,17 @@ config CGROUP_MEM_RES_CTLR
This config option also selects MM_OWNER config option, which
could in turn add some fork/exit overhead.
+config MEM_RES_CTLR_NORECLAIM
+ bool "Memory Resource Controller non reclaim Extension"
+ depends on CGROUP_MEM_RES_CTLR
+ help
+ Add non reclaim resource management to memory resource controller.
+ Currently only HugeTLB pages will be managed using this extension.
+ The controller limit is enforced during mmap(2), so that
+ application can fall back to allocations using smaller page size
+ if the memory controller limit prevented them from allocating HugeTLB
+ pages.
+
config CGROUP_MEM_RES_CTLR_SWAP
bool "Memory Resource Controller Swap Extension"
depends on CGROUP_MEM_RES_CTLR && SWAP
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6728a7a..b00d028 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -49,6 +49,7 @@
#include <linux/page_cgroup.h>
#include <linux/cpu.h>
#include <linux/oom.h>
+#include <linux/region.h>
#include "internal.h"
#include <net/sock.h>
#include <net/tcp_memcontrol.h>
@@ -214,6 +215,11 @@ static void mem_cgroup_threshold(struct mem_cgroup *memcg);
static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
/*
+ * Currently only hugetlbfs pages are tracked using no reclaim
+ * resource count. So we need only MAX_HSTATE res counter
+ */
+#define MEMCG_MAX_NORECLAIM HUGE_MAX_HSTATE
+/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
* statistics based on the statistics developed by Rik Van Riel for clock-pro,
@@ -235,6 +241,11 @@ struct mem_cgroup {
*/
struct res_counter memsw;
/*
+ * the counter to account for non reclaim resources
+ * like hugetlb pages
+ */
+ struct res_counter no_rcl_res[MEMCG_MAX_NORECLAIM];
+ /*
* Per cgroup active and inactive list, similar to the
* per zone LRU lists.
*/
@@ -4887,6 +4898,7 @@ err_cleanup:
static struct cgroup_subsys_state * __ref
mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
{
+ int idx;
struct mem_cgroup *memcg, *parent;
long error = -ENOMEM;
int node;
@@ -4922,6 +4934,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
if (parent && parent->use_hierarchy) {
res_counter_init(&memcg->res, &parent->res);
res_counter_init(&memcg->memsw, &parent->memsw);
+ for (idx = 0; idx < MEMCG_MAX_NORECLAIM; idx++) {
+ res_counter_init(&memcg->no_rcl_res[idx],
+ &parent->no_rcl_res[idx]);
+ }
/*
* We increment refcnt of the parent to ensure that we can
* safely access it on res_counter_charge/uncharge.
@@ -4932,6 +4948,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
} else {
res_counter_init(&memcg->res, NULL);
res_counter_init(&memcg->memsw, NULL);
+ for (idx = 0; idx < MEMCG_MAX_NORECLAIM; idx++)
+ res_counter_init(&memcg->no_rcl_res[idx], NULL);
}
memcg->last_scanned_node = MAX_NUMNODES;
INIT_LIST_HEAD(&memcg->oom_notify);
@@ -4950,8 +4968,22 @@ free_out:
static int mem_cgroup_pre_destroy(struct cgroup_subsys *ss,
struct cgroup *cont)
{
+ int idx;
+ u64 val;
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
-
+ /*
+ * We don't allow a cgroup deletion if it have some
+ * non reclaim resource charged against it. We can
+ * update the charge list to point to parent cgroup
+ * and allow this cgroup deletion here. But that
+ * involve tracking all the chg list which have this
+ * cgroup reference.
+ */
+ for (idx = 0; idx < MEMCG_MAX_NORECLAIM; idx++) {
+ val = res_counter_read_u64(&memcg->no_rcl_res[idx], RES_USAGE);
+ if (val)
+ return -EBUSY;
+ }
return mem_cgroup_force_empty(memcg, false);
}
@@ -5489,6 +5521,170 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
}
#endif
+#ifdef CONFIG_MEM_RES_CTLR_NORECLAIM
+/*
+ * For supporting resource control on non reclaim pages like hugetlbfs
+ * and ramfs, we enforce limit during mmap time. We also maintain
+ * a chg list for these resource, which track the range alog with
+ * memcg information. We need to have seperate chg_list for shared
+ * and private mapping. Shared mapping are mostly maintained in
+ * file inode and private mapping in vm_area_struct.
+ */
+long mem_cgroup_try_noreclaim_charge(struct list_head *chg_list,
+ unsigned long from, unsigned long to,
+ int idx)
+{
+ long chg;
+ int ret = 0;
+ unsigned long csize;
+ struct mem_cgroup *memcg;
+ struct res_counter *fail_res;
+
+ /*
+ * Get the task cgroup within rcu_readlock and also
+ * get cgroup reference to make sure cgroup destroy won't
+ * race with page_charge. We don't allow a cgroup destroy
+ * when the cgroup have some charge against it
+ */
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(current);
+ css_get(&memcg->css);
+ rcu_read_unlock();
+
+ chg = region_chg(chg_list, from, to, (unsigned long)memcg);
+ if (chg < 0)
+ goto err_out;
+
+ if (mem_cgroup_is_root(memcg))
+ goto err_out;
+
+ csize = chg * PAGE_SIZE;
+ ret = res_counter_charge(&memcg->no_rcl_res[idx], csize, &fail_res);
+
+err_out:
+ /* Now that we have charged we can drop cgroup reference */
+ css_put(&memcg->css);
+ if (!ret)
+ return chg;
+
+ /* We don't worry about region_uncharge */
+ return ret;
+}
+
+void mem_cgroup_noreclaim_uncharge(struct list_head *chg_list,
+ int idx, unsigned long nr_pages)
+{
+ struct mem_cgroup *memcg;
+ unsigned long csize = nr_pages * PAGE_SIZE;
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(current);
+
+ if (!mem_cgroup_is_root(memcg))
+ res_counter_uncharge(&memcg->no_rcl_res[idx], csize);
+ rcu_read_unlock();
+ /*
+ * We could ideally remove zero size regions from
+ * resv map hcg_regions here
+ */
+ return;
+}
+
+void mem_cgroup_commit_noreclaim_charge(struct list_head *chg_list,
+ unsigned long from, unsigned long to)
+{
+ struct mem_cgroup *memcg;
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(current);
+ region_add(chg_list, from, to, (unsigned long)memcg);
+ rcu_read_unlock();
+ return;
+}
+
+long mem_cgroup_truncate_chglist_range(struct list_head *chg_list,
+ unsigned long from, unsigned long to,
+ int idx)
+{
+ long chg = 0, csize;
+ struct mem_cgroup *memcg;
+ struct file_region *rg, *trg;
+
+ /* Locate the region we are either in or before. */
+ list_for_each_entry(rg, chg_list, link)
+ if (from <= rg->to)
+ break;
+ if (&rg->link == chg_list)
+ return 0;
+
+ /* If we are in the middle of a region then adjust it. */
+ if (from > rg->from) {
+ if (to < rg->to) {
+ struct file_region *nrg;
+ /* rg->from from to rg->to */
+ nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+ /*
+ * If we fail to allocate we return the
+ * with the 0 charge . Later a complete
+ * truncate will reclaim the left over space
+ */
+ if (!nrg)
+ return 0;
+ nrg->from = to;
+ nrg->to = rg->to;
+ nrg->data = rg->data;
+ INIT_LIST_HEAD(&nrg->link);
+ list_add(&nrg->link, &rg->link);
+
+ /* Adjust the rg entry */
+ rg->to = from;
+ chg = to - from;
+ memcg = (struct mem_cgroup *)rg->data;
+ if (!mem_cgroup_is_root(memcg)) {
+ csize = chg * PAGE_SIZE;
+ res_counter_uncharge(&memcg->no_rcl_res[idx], csize);
+ }
+ return chg;
+ }
+ chg = rg->to - from;
+ rg->to = from;
+ memcg = (struct mem_cgroup *)rg->data;
+ if (!mem_cgroup_is_root(memcg)) {
+ csize = chg * PAGE_SIZE;
+ res_counter_uncharge(&memcg->no_rcl_res[idx], csize);
+ }
+ rg = list_entry(rg->link.next, typeof(*rg), link);
+ }
+ /* Drop any remaining regions till to */
+ list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+ if (rg->from >= to)
+ break;
+ if (&rg->link == chg_list)
+ break;
+ if (rg->to > to) {
+ /* rg->from to rg->to */
+ chg += to - rg->from;
+ rg->from = to;
+ memcg = (struct mem_cgroup *)rg->data;
+ if (!mem_cgroup_is_root(memcg)) {
+ csize = (to - rg->from) * PAGE_SIZE;
+ res_counter_uncharge(&memcg->no_rcl_res[idx], csize);
+ }
+ return chg;
+ }
+ chg += rg->to - rg->from;
+ memcg = (struct mem_cgroup *)rg->data;
+ if (!mem_cgroup_is_root(memcg)) {
+ csize = (rg->to - rg->from) * PAGE_SIZE;
+ res_counter_uncharge(&memcg->no_rcl_res[idx], csize);
+ }
+ list_del(&rg->link);
+ kfree(rg);
+ }
+ return chg;
+}
+#endif
+
struct cgroup_subsys mem_cgroup_subsys = {
.name = "memory",
.subsys_id = mem_cgroup_subsys_id,
--
1.7.9
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH -V2 5/9] hugetlbfs: Add memory controller support for shared mapping
2012-03-01 9:16 [PATCH -V2 0/9] memcg: add HugeTLB resource tracking Aneesh Kumar K.V
` (3 preceding siblings ...)
2012-03-01 9:16 ` [PATCH -V2 4/9] memcg: Add non reclaim resource tracking to memcg Aneesh Kumar K.V
@ 2012-03-01 9:16 ` Aneesh Kumar K.V
2012-03-01 9:16 ` [PATCH -V2 6/9] hugetlbfs: Add memory controller support for private mapping Aneesh Kumar K.V
` (5 subsequent siblings)
10 siblings, 0 replies; 26+ messages in thread
From: Aneesh Kumar K.V @ 2012-03-01 9:16 UTC (permalink / raw)
To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
akpm, hannes
Cc: linux-kernel, cgroups, Aneesh Kumar K.V
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
For shared mapping we need to track the memory controller along with the range.
If two task in two different cgroup map the same area only the non-overlapping
part should be charged to the second task. Hence we need to track the memcg
along with range. We always charge during mmap(2) and we do uncharge during
truncate. The charge list is tracked in the inode->i_mapping->private_list.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
mm/hugetlb.c | 129 ++++++++++++++++++++++++++++++++++++++++++++++++----------
1 files changed, 107 insertions(+), 22 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9fd6d38..664c663 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -22,6 +22,7 @@
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/region.h>
+#include <linux/memcontrol.h>
#include <asm/page.h>
#include <asm/pgtable.h>
@@ -72,6 +73,59 @@ static DEFINE_SPINLOCK(hugetlb_lock);
*
*/
+long hugetlb_page_charge(struct list_head *head,
+ struct hstate *h, long from, long to)
+{
+#ifdef CONFIG_MEM_RES_CTLR_NORECLAIM
+ long chg;
+ from = from << huge_page_order(h);
+ to = to << huge_page_order(h);
+ chg = mem_cgroup_try_noreclaim_charge(head, from, to, h - hstates);
+ if (chg > 0)
+ return chg >> huge_page_order(h);
+ return chg;
+#else
+ return region_chg(head, from, to, 0);
+#endif
+}
+
+void hugetlb_page_uncharge(struct list_head *head, int idx, long nr_pages)
+{
+#ifdef CONFIG_MEM_RES_CTLR_NORECLAIM
+ return mem_cgroup_noreclaim_uncharge(head, idx, nr_pages);
+#else
+ return;
+#endif
+}
+
+void hugetlb_commit_page_charge(struct hstate *h,
+ struct list_head *head, long from, long to)
+{
+#ifdef CONFIG_MEM_RES_CTLR_NORECLAIM
+ from = from << huge_page_order(h);
+ to = to << huge_page_order(h);
+ return mem_cgroup_commit_noreclaim_charge(head, from, to);
+#else
+ return region_add(head, from, to, 0);
+#endif
+}
+
+long hugetlb_truncate_cgroup(struct hstate *h,
+ struct list_head *head, long from)
+{
+#ifdef CONFIG_MEM_RES_CTLR_NORECLAIM
+ long chg;
+ from = from << huge_page_order(h);
+ chg = mem_cgroup_truncate_chglist_range(head, from,
+ ULONG_MAX, h - hstates);
+ if (chg > 0)
+ return chg >> huge_page_order(h);
+ return chg;
+#else
+ return region_truncate(head, from);
+#endif
+}
+
/*
* Convert the address within this vma to the page offset within
* the mapping, in pagecache page units; huge pages here.
@@ -840,9 +894,8 @@ static long vma_needs_reservation(struct hstate *h,
if (vma->vm_flags & VM_MAYSHARE) {
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
- return region_chg(&inode->i_mapping->private_list,
- idx, idx + 1, 0);
-
+ return hugetlb_page_charge(&inode->i_mapping->private_list,
+ h, idx, idx + 1);
} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
return 1;
@@ -857,16 +910,33 @@ static long vma_needs_reservation(struct hstate *h,
return 0;
}
}
+
+static void vma_uncharge_reservation(struct hstate *h,
+ struct vm_area_struct *vma,
+ unsigned long chg)
+{
+ struct address_space *mapping = vma->vm_file->f_mapping;
+ struct inode *inode = mapping->host;
+
+
+ if (vma->vm_flags & VM_MAYSHARE) {
+ return hugetlb_page_uncharge(&inode->i_mapping->private_list,
+ h - hstates,
+ chg << huge_page_order(h));
+ }
+}
+
static void vma_commit_reservation(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
{
+
struct address_space *mapping = vma->vm_file->f_mapping;
struct inode *inode = mapping->host;
if (vma->vm_flags & VM_MAYSHARE) {
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
- region_add(&inode->i_mapping->private_list, idx, idx + 1, 0);
-
+ hugetlb_commit_page_charge(h, &inode->i_mapping->private_list,
+ idx, idx + 1);
} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
struct resv_map *reservations = vma_resv_map(vma);
@@ -895,9 +965,12 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
chg = vma_needs_reservation(h, vma, addr);
if (chg < 0)
return ERR_PTR(-VM_FAULT_OOM);
- if (chg)
- if (hugetlb_get_quota(inode->i_mapping, chg))
+ if (chg) {
+ if (hugetlb_get_quota(inode->i_mapping, chg)) {
+ vma_uncharge_reservation(h, vma, chg);
return ERR_PTR(-VM_FAULT_SIGBUS);
+ }
+ }
spin_lock(&hugetlb_lock);
page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
@@ -906,7 +979,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
if (!page) {
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
if (!page) {
- hugetlb_put_quota(inode->i_mapping, chg);
+ if (chg) {
+ vma_uncharge_reservation(h, vma, chg);
+ hugetlb_put_quota(inode->i_mapping, chg);
+ }
return ERR_PTR(-VM_FAULT_SIGBUS);
}
}
@@ -914,7 +990,6 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
set_page_private(page, (unsigned long) mapping);
vma_commit_reservation(h, vma, addr);
-
return page;
}
@@ -2744,9 +2819,10 @@ int hugetlb_reserve_pages(struct inode *inode,
* to reserve the full area even if read-only as mprotect() may be
* called to make the mapping read-write. Assume !vma is a shm mapping
*/
- if (!vma || vma->vm_flags & VM_MAYSHARE)
- chg = region_chg(&inode->i_mapping->private_list, from, to, 0);
- else {
+ if (!vma || vma->vm_flags & VM_MAYSHARE) {
+ chg = hugetlb_page_charge(&inode->i_mapping->private_list,
+ h, from, to);
+ } else {
struct resv_map *resv_map = resv_map_alloc();
if (!resv_map)
return -ENOMEM;
@@ -2761,19 +2837,17 @@ int hugetlb_reserve_pages(struct inode *inode,
return chg;
/* There must be enough filesystem quota for the mapping */
- if (hugetlb_get_quota(inode->i_mapping, chg))
- return -ENOSPC;
-
+ if (hugetlb_get_quota(inode->i_mapping, chg)) {
+ ret = -ENOSPC;
+ goto err_quota;
+ }
/*
* Check enough hugepages are available for the reservation.
* Hand back the quota if there are not
*/
ret = hugetlb_acct_memory(h, chg);
- if (ret < 0) {
- hugetlb_put_quota(inode->i_mapping, chg);
- return ret;
- }
-
+ if (ret < 0)
+ goto err_acct_mem;
/*
* Account for the reservations made. Shared mappings record regions
* that have reservations as they are shared by multiple VMAs.
@@ -2786,15 +2860,26 @@ int hugetlb_reserve_pages(struct inode *inode,
* else has to be done for private mappings here
*/
if (!vma || vma->vm_flags & VM_MAYSHARE)
- region_add(&inode->i_mapping->private_list, from, to, 0);
+ hugetlb_commit_page_charge(h, &inode->i_mapping->private_list,
+ from, to);
return 0;
+err_acct_mem:
+ hugetlb_put_quota(inode->i_mapping, chg);
+err_quota:
+ if (!vma || vma->vm_flags & VM_MAYSHARE)
+ hugetlb_page_uncharge(&inode->i_mapping->private_list,
+ h - hstates, chg << huge_page_order(h));
+ return ret;
+
}
void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
{
+ long chg;
struct hstate *h = hstate_inode(inode);
- long chg = region_truncate(&inode->i_mapping->private_list, offset);
+ chg = hugetlb_truncate_cgroup(h, &inode->i_mapping->private_list,
+ offset);
spin_lock(&inode->i_lock);
inode->i_blocks -= (blocks_per_huge_page(h) * freed);
spin_unlock(&inode->i_lock);
--
1.7.9
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH -V2 6/9] hugetlbfs: Add memory controller support for private mapping
2012-03-01 9:16 [PATCH -V2 0/9] memcg: add HugeTLB resource tracking Aneesh Kumar K.V
` (4 preceding siblings ...)
2012-03-01 9:16 ` [PATCH -V2 5/9] hugetlbfs: Add memory controller support for shared mapping Aneesh Kumar K.V
@ 2012-03-01 9:16 ` Aneesh Kumar K.V
2012-05-17 23:16 ` Darrick J. Wong
2012-03-01 9:16 ` [PATCH -V2 7/9] memcg: track resource index in cftype private Aneesh Kumar K.V
` (4 subsequent siblings)
10 siblings, 1 reply; 26+ messages in thread
From: Aneesh Kumar K.V @ 2012-03-01 9:16 UTC (permalink / raw)
To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
akpm, hannes
Cc: linux-kernel, cgroups, Aneesh Kumar K.V
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
For private mapping we always charge/uncharge from the current task memcg.
Charging happens during mmap(2) and uncharge happens during the
vm_operations->close. For child task after fork the charging happens
during fault time in alloc_huge_page. We also need to make sure for private
mapping each vma for hugeTLB mapping have struct resv_map allocated so that we
can store the charge list in resv_map.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
mm/hugetlb.c | 176 +++++++++++++++++++++++++++++++++------------------------
1 files changed, 102 insertions(+), 74 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 664c663..2d99d0a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -126,6 +126,22 @@ long hugetlb_truncate_cgroup(struct hstate *h,
#endif
}
+long hugetlb_truncate_cgroup_range(struct hstate *h,
+ struct list_head *head, long from, long end)
+{
+#ifdef CONFIG_MEM_RES_CTLR_NORECLAIM
+ long chg;
+ from = from << huge_page_order(h);
+ end = end << huge_page_order(h);
+ chg = mem_cgroup_truncate_chglist_range(head, from, end, h - hstates);
+ if (chg > 0)
+ return chg >> huge_page_order(h);
+ return chg;
+#else
+ return region_truncate_range(head, from, end);
+#endif
+}
+
/*
* Convert the address within this vma to the page offset within
* the mapping, in pagecache page units; huge pages here.
@@ -229,13 +245,19 @@ static struct resv_map *resv_map_alloc(void)
return resv_map;
}
-static void resv_map_release(struct kref *ref)
+static unsigned long resv_map_release(struct hstate *h,
+ struct resv_map *resv_map)
{
- struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
-
- /* Clear out any active regions before we release the map. */
- region_truncate(&resv_map->regions, 0);
+ unsigned long reserve;
+ /*
+ * We should not have any regions left here, if we were able to
+ * do memory allocation when in trunage_cgroup_range.
+ *
+ * Clear out any active regions before we release the map
+ */
+ reserve = hugetlb_truncate_cgroup(h, &resv_map->regions, 0);
kfree(resv_map);
+ return reserve;
}
static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
@@ -447,9 +469,7 @@ static void free_huge_page(struct page *page)
*/
struct hstate *h = page_hstate(page);
int nid = page_to_nid(page);
- struct address_space *mapping;
- mapping = (struct address_space *) page_private(page);
set_page_private(page, 0);
page->mapping = NULL;
BUG_ON(page_count(page));
@@ -465,8 +485,6 @@ static void free_huge_page(struct page *page)
enqueue_huge_page(h, page);
}
spin_unlock(&hugetlb_lock);
- if (mapping)
- hugetlb_put_quota(mapping, 1);
}
static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
@@ -887,63 +905,74 @@ static void return_unused_surplus_pages(struct hstate *h,
* No action is required on failure.
*/
static long vma_needs_reservation(struct hstate *h,
- struct vm_area_struct *vma, unsigned long addr)
+ struct vm_area_struct *vma,
+ unsigned long addr)
{
+ pgoff_t idx = vma_hugecache_offset(h, vma, addr);
struct address_space *mapping = vma->vm_file->f_mapping;
struct inode *inode = mapping->host;
+
if (vma->vm_flags & VM_MAYSHARE) {
- pgoff_t idx = vma_hugecache_offset(h, vma, addr);
return hugetlb_page_charge(&inode->i_mapping->private_list,
h, idx, idx + 1);
} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
- return 1;
-
- } else {
- long err;
- pgoff_t idx = vma_hugecache_offset(h, vma, addr);
- struct resv_map *reservations = vma_resv_map(vma);
-
- err = region_chg(&reservations->regions, idx, idx + 1, 0);
- if (err < 0)
- return err;
- return 0;
+ struct resv_map *resv_map = vma_resv_map(vma);
+ if (!resv_map) {
+ /*
+ * We didn't allocate resv_map for this vma.
+ * Allocate it here.
+ */
+ resv_map = resv_map_alloc();
+ if (!resv_map)
+ return -ENOMEM;
+ set_vma_resv_map(vma, resv_map);
+ }
+ return hugetlb_page_charge(&resv_map->regions,
+ h, idx, idx + 1);
}
+ /*
+ * We did the private page charging in mmap call
+ */
+ return 0;
}
static void vma_uncharge_reservation(struct hstate *h,
struct vm_area_struct *vma,
unsigned long chg)
{
+ int idx = h - hstates;
+ struct list_head *region_list;
struct address_space *mapping = vma->vm_file->f_mapping;
struct inode *inode = mapping->host;
- if (vma->vm_flags & VM_MAYSHARE) {
- return hugetlb_page_uncharge(&inode->i_mapping->private_list,
- h - hstates,
- chg << huge_page_order(h));
+ if (vma->vm_flags & VM_MAYSHARE)
+ region_list = &inode->i_mapping->private_list;
+ else {
+ struct resv_map *resv_map = vma_resv_map(vma);
+ region_list = &resv_map->regions;
}
+ return hugetlb_page_uncharge(region_list,
+ idx, chg << huge_page_order(h));
}
static void vma_commit_reservation(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
{
-
+ struct list_head *region_list;
+ pgoff_t idx = vma_hugecache_offset(h, vma, addr);
struct address_space *mapping = vma->vm_file->f_mapping;
struct inode *inode = mapping->host;
if (vma->vm_flags & VM_MAYSHARE) {
- pgoff_t idx = vma_hugecache_offset(h, vma, addr);
- hugetlb_commit_page_charge(h, &inode->i_mapping->private_list,
- idx, idx + 1);
- } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
- pgoff_t idx = vma_hugecache_offset(h, vma, addr);
+ region_list = &inode->i_mapping->private_list;
+ } else {
struct resv_map *reservations = vma_resv_map(vma);
-
- /* Mark this page used in the map. */
- region_add(&reservations->regions, idx, idx + 1, 0);
+ region_list = &reservations->regions;
}
+ hugetlb_commit_page_charge(h, region_list, idx, idx + 1);
+ return;
}
static struct page *alloc_huge_page(struct vm_area_struct *vma,
@@ -986,10 +1015,9 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
return ERR_PTR(-VM_FAULT_SIGBUS);
}
}
-
set_page_private(page, (unsigned long) mapping);
-
- vma_commit_reservation(h, vma, addr);
+ if (chg)
+ vma_commit_reservation(h, vma, addr);
return page;
}
@@ -2001,25 +2029,40 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
*/
if (reservations)
kref_get(&reservations->refs);
+ else if (!(vma->vm_flags & VM_MAYSHARE)) {
+ /*
+ * for non shared vma we need resv map to track
+ * hugetlb cgroup usage. Allocate it here. Charging
+ * the cgroup will take place in fault path.
+ */
+ struct resv_map *resv_map = resv_map_alloc();
+ /*
+ * If we fail to allocate resv_map here. We will allocate
+ * one when we do alloc_huge_page. So we don't handle
+ * ENOMEM here. The function also return void. So there is
+ * nothing much we can do.
+ */
+ if (resv_map)
+ set_vma_resv_map(vma, resv_map);
+ }
}
static void hugetlb_vm_op_close(struct vm_area_struct *vma)
{
struct hstate *h = hstate_vma(vma);
- struct resv_map *reservations = vma_resv_map(vma);
- unsigned long reserve;
- unsigned long start;
- unsigned long end;
+ struct resv_map *resv_map = vma_resv_map(vma);
+ unsigned long reserve, start, end;
- if (reservations) {
+ if (resv_map) {
start = vma_hugecache_offset(h, vma, vma->vm_start);
end = vma_hugecache_offset(h, vma, vma->vm_end);
- reserve = (end - start) -
- region_count(&reservations->regions, start, end);
-
- kref_put(&reservations->refs, resv_map_release);
-
+ reserve = hugetlb_truncate_cgroup_range(h, &resv_map->regions,
+ start, end);
+ /* open coded kref_put */
+ if (atomic_sub_and_test(1, &resv_map->refs.refcount)) {
+ reserve += resv_map_release(h, resv_map);
+ }
if (reserve) {
hugetlb_acct_memory(h, -reserve);
hugetlb_put_quota(vma->vm_file->f_mapping, reserve);
@@ -2803,8 +2846,9 @@ int hugetlb_reserve_pages(struct inode *inode,
vm_flags_t vm_flags)
{
long ret, chg;
+ struct list_head *region_list;
struct hstate *h = hstate_inode(inode);
-
+ struct resv_map *resv_map = NULL;
/*
* Only apply hugepage reservation if asked. At fault time, an
* attempt will be made for VM_NORESERVE to allocate a page
@@ -2820,19 +2864,17 @@ int hugetlb_reserve_pages(struct inode *inode,
* called to make the mapping read-write. Assume !vma is a shm mapping
*/
if (!vma || vma->vm_flags & VM_MAYSHARE) {
- chg = hugetlb_page_charge(&inode->i_mapping->private_list,
- h, from, to);
+ region_list = &inode->i_mapping->private_list;
} else {
- struct resv_map *resv_map = resv_map_alloc();
+ resv_map = resv_map_alloc();
if (!resv_map)
return -ENOMEM;
- chg = to - from;
-
set_vma_resv_map(vma, resv_map);
set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
+ region_list = &resv_map->regions;
}
-
+ chg = hugetlb_page_charge(region_list, h, from, to);
if (chg < 0)
return chg;
@@ -2848,29 +2890,15 @@ int hugetlb_reserve_pages(struct inode *inode,
ret = hugetlb_acct_memory(h, chg);
if (ret < 0)
goto err_acct_mem;
- /*
- * Account for the reservations made. Shared mappings record regions
- * that have reservations as they are shared by multiple VMAs.
- * When the last VMA disappears, the region map says how much
- * the reservation was and the page cache tells how much of
- * the reservation was consumed. Private mappings are per-VMA and
- * only the consumed reservations are tracked. When the VMA
- * disappears, the original reservation is the VMA size and the
- * consumed reservations are stored in the map. Hence, nothing
- * else has to be done for private mappings here
- */
- if (!vma || vma->vm_flags & VM_MAYSHARE)
- hugetlb_commit_page_charge(h, &inode->i_mapping->private_list,
- from, to);
+
+ hugetlb_commit_page_charge(h, region_list, from, to);
return 0;
err_acct_mem:
hugetlb_put_quota(inode->i_mapping, chg);
err_quota:
- if (!vma || vma->vm_flags & VM_MAYSHARE)
- hugetlb_page_uncharge(&inode->i_mapping->private_list,
- h - hstates, chg << huge_page_order(h));
+ hugetlb_page_uncharge(region_list, h - hstates,
+ chg << huge_page_order(h));
return ret;
-
}
void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
@@ -2884,7 +2912,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
inode->i_blocks -= (blocks_per_huge_page(h) * freed);
spin_unlock(&inode->i_lock);
- hugetlb_put_quota(inode->i_mapping, (chg - freed));
+ hugetlb_put_quota(inode->i_mapping, chg);
hugetlb_acct_memory(h, -(chg - freed));
}
--
1.7.9
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH -V2 7/9] memcg: track resource index in cftype private
2012-03-01 9:16 [PATCH -V2 0/9] memcg: add HugeTLB resource tracking Aneesh Kumar K.V
` (5 preceding siblings ...)
2012-03-01 9:16 ` [PATCH -V2 6/9] hugetlbfs: Add memory controller support for private mapping Aneesh Kumar K.V
@ 2012-03-01 9:16 ` Aneesh Kumar K.V
2012-03-01 9:16 ` [PATCH -V2 8/9] hugetlbfs: Add memcg control files for hugetlbfs Aneesh Kumar K.V
` (3 subsequent siblings)
10 siblings, 0 replies; 26+ messages in thread
From: Aneesh Kumar K.V @ 2012-03-01 9:16 UTC (permalink / raw)
To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
akpm, hannes
Cc: linux-kernel, cgroups, Aneesh Kumar K.V
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
This helps in using same memcg callbacks for non reclaim resource
control files.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
mm/memcontrol.c | 27 +++++++++++++++++++++------
1 files changed, 21 insertions(+), 6 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b00d028..25bc5f7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -365,9 +365,14 @@ enum charge_type {
#define _MEM (0)
#define _MEMSWAP (1)
#define _OOM_TYPE (2)
-#define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val))
-#define MEMFILE_TYPE(val) (((val) >> 16) & 0xffff)
-#define MEMFILE_ATTR(val) ((val) & 0xffff)
+#define _MEMNORCL (3)
+
+/* 0 ... val ...16.... x...24...idx...32*/
+#define __MEMFILE_PRIVATE(idx, x, val) (((idx) << 24) | ((x) << 16) | (val))
+#define MEMFILE_PRIVATE(x, val) __MEMFILE_PRIVATE(0, x, val)
+#define MEMFILE_TYPE(val) (((val) >> 16) & 0xff)
+#define MEMFILE_IDX(val) (((val) >> 24) & 0xff)
+#define MEMFILE_ATTR(val) ((val) & 0xffff)
/* Used for OOM nofiier */
#define OOM_CONTROL (0)
@@ -3834,7 +3839,7 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
u64 val;
- int type, name;
+ int type, name, idx;
type = MEMFILE_TYPE(cft->private);
name = MEMFILE_ATTR(cft->private);
@@ -3851,6 +3856,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
else
val = res_counter_read_u64(&memcg->memsw, name);
break;
+ case _MEMNORCL:
+ idx = MEMFILE_IDX(cft->private);
+ val = res_counter_read_u64(&memcg->no_rcl_res[idx], name);
+ break;
default:
BUG();
break;
@@ -3883,7 +3892,10 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
break;
if (type == _MEM)
ret = mem_cgroup_resize_limit(memcg, val);
- else
+ else if (type == _MEMNORCL) {
+ int idx = MEMFILE_IDX(cft->private);
+ ret = res_counter_set_limit(&memcg->no_rcl_res[idx], val);
+ } else
ret = mem_cgroup_resize_memsw_limit(memcg, val);
break;
case RES_SOFT_LIMIT:
@@ -3947,7 +3959,10 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
case RES_MAX_USAGE:
if (type == _MEM)
res_counter_reset_max(&memcg->res);
- else
+ else if (type == _MEMNORCL) {
+ int idx = MEMFILE_IDX(event);
+ res_counter_reset_max(&memcg->no_rcl_res[idx]);
+ } else
res_counter_reset_max(&memcg->memsw);
break;
case RES_FAILCNT:
--
1.7.9
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH -V2 8/9] hugetlbfs: Add memcg control files for hugetlbfs
2012-03-01 9:16 [PATCH -V2 0/9] memcg: add HugeTLB resource tracking Aneesh Kumar K.V
` (6 preceding siblings ...)
2012-03-01 9:16 ` [PATCH -V2 7/9] memcg: track resource index in cftype private Aneesh Kumar K.V
@ 2012-03-01 9:16 ` Aneesh Kumar K.V
2012-03-01 9:16 ` [PATCH -V2 9/9] memcg: Add memory controller documentation for hugetlb management Aneesh Kumar K.V
` (2 subsequent siblings)
10 siblings, 0 replies; 26+ messages in thread
From: Aneesh Kumar K.V @ 2012-03-01 9:16 UTC (permalink / raw)
To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
akpm, hannes
Cc: linux-kernel, cgroups, Aneesh Kumar K.V
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
This add control files for hugetlbfs in memcg
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
include/linux/hugetlb.h | 5 ++++
mm/hugetlb.c | 39 ++++++++++++++++++++++++++++++++++++-
mm/memcontrol.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 92 insertions(+), 1 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d9d6c86..8498fa8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -4,6 +4,7 @@
#include <linux/mm_types.h>
#include <linux/fs.h>
#include <linux/hugetlb_inline.h>
+#include <linux/cgroup.h>
struct ctl_table;
struct user_struct;
@@ -220,6 +221,10 @@ struct hstate {
unsigned int nr_huge_pages_node[MAX_NUMNODES];
unsigned int free_huge_pages_node[MAX_NUMNODES];
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+ /* cgroup control files */
+ struct cftype cgroup_limit_file;
+ struct cftype cgroup_usage_file;
+ struct cftype cgroup_max_usage_file;
char name[HSTATE_NAME_LEN];
};
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2d99d0a..9229715 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -23,6 +23,7 @@
#include <linux/swapops.h>
#include <linux/region.h>
#include <linux/memcontrol.h>
+#include <linux/res_counter.h>
#include <asm/page.h>
#include <asm/pgtable.h>
@@ -1761,6 +1762,42 @@ static int __init hugetlb_init(void)
}
module_init(hugetlb_init);
+#ifdef CONFIG_MEM_RES_CTLR_NORECLAIM
+int register_hugetlb_memcg_files(struct cgroup *cgroup,
+ struct cgroup_subsys *ss)
+{
+ int ret = 0;
+ struct hstate *h;
+
+ for_each_hstate(h) {
+ ret = cgroup_add_file(cgroup, ss, &h->cgroup_limit_file);
+ if (ret)
+ return ret;
+ ret = cgroup_add_file(cgroup, ss, &h->cgroup_usage_file);
+ if (ret)
+ return ret;
+ ret = cgroup_add_file(cgroup, ss, &h->cgroup_max_usage_file);
+ if (ret)
+ return ret;
+
+ }
+ return ret;
+}
+/* mm/memcontrol.c because mem_cgroup is not availabel outside */
+int hugetlb_memcg_file_init(struct hstate *h, int idx);
+#else
+int register_hugetlb_memcg_files(struct cgroup *cgroup,
+ struct cgroup_subsys *ss)
+{
+ return 0;
+}
+
+static int hugetlb_memcg_file_init(struct hstate *h, int idx)
+{
+ return 0;
+}
+#endif
+
/* Should be called on processing a hugepagesz=... option */
void __init hugetlb_add_hstate(unsigned order)
{
@@ -1784,7 +1821,7 @@ void __init hugetlb_add_hstate(unsigned order)
h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
huge_page_size(h)/1024);
-
+ hugetlb_memcg_file_init(h, max_hstate - 1);
parsed_hstate = h;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 25bc5f7..410d53d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5012,6 +5012,52 @@ static void mem_cgroup_destroy(struct cgroup_subsys *ss,
mem_cgroup_put(memcg);
}
+#if defined(CONFIG_MEM_RES_CTLR_NORECLAIM) && defined(CONFIG_HUGETLBFS)
+static char *mem_fmt(char *buf, unsigned long n)
+{
+ if (n >= (1UL << 30))
+ sprintf(buf, "%luGB", n >> 30);
+ else if (n >= (1UL << 20))
+ sprintf(buf, "%luMB", n >> 20);
+ else
+ sprintf(buf, "%luKB", n >> 10);
+ return buf;
+}
+
+int hugetlb_memcg_file_init(struct hstate *h, int idx)
+{
+ char buf[32];
+ struct cftype *cft;
+
+ /* format the size */
+ mem_fmt(buf, huge_page_size(h));
+
+ /* Add the limit file */
+ cft = &h->cgroup_limit_file;
+ snprintf(cft->name, MAX_CFTYPE_NAME, "hugetlb.%s.limit_in_bytes", buf);
+ cft->private = __MEMFILE_PRIVATE(idx, _MEMNORCL, RES_LIMIT);
+ cft->read_u64 = mem_cgroup_read;
+ cft->write_string = mem_cgroup_write;
+
+ /* Add the usage file */
+ cft = &h->cgroup_usage_file;
+ snprintf(cft->name, MAX_CFTYPE_NAME, "hugetlb.%s.usage_in_bytes", buf);
+ cft->private = __MEMFILE_PRIVATE(idx, _MEMNORCL, RES_USAGE);
+ cft->read_u64 = mem_cgroup_read;
+
+ /* Add the MAX usage file */
+ cft = &h->cgroup_max_usage_file;
+ snprintf(cft->name, MAX_CFTYPE_NAME, "hugetlb.%s.max_usage_in_bytes", buf);
+ cft->private = __MEMFILE_PRIVATE(idx, _MEMNORCL, RES_MAX_USAGE);
+ cft->trigger = mem_cgroup_reset;
+ cft->read_u64 = mem_cgroup_read;
+
+ return 0;
+}
+#endif
+
+int register_hugetlb_memcg_files(struct cgroup *cgroup,
+ struct cgroup_subsys *ss);
static int mem_cgroup_populate(struct cgroup_subsys *ss,
struct cgroup *cont)
{
@@ -5026,6 +5072,9 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
if (!ret)
ret = register_kmem_files(cont, ss);
+ if (!ret)
+ ret = register_hugetlb_memcg_files(cont, ss);
+
return ret;
}
--
1.7.9
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH -V2 9/9] memcg: Add memory controller documentation for hugetlb management
2012-03-01 9:16 [PATCH -V2 0/9] memcg: add HugeTLB resource tracking Aneesh Kumar K.V
` (7 preceding siblings ...)
2012-03-01 9:16 ` [PATCH -V2 8/9] hugetlbfs: Add memcg control files for hugetlbfs Aneesh Kumar K.V
@ 2012-03-01 9:16 ` Aneesh Kumar K.V
2012-03-01 22:40 ` [PATCH -V2 0/9] memcg: add HugeTLB resource tracking Andrew Morton
2012-03-02 5:48 ` KAMEZAWA Hiroyuki
10 siblings, 0 replies; 26+ messages in thread
From: Aneesh Kumar K.V @ 2012-03-01 9:16 UTC (permalink / raw)
To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
akpm, hannes
Cc: linux-kernel, cgroups, Aneesh Kumar K.V
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
Documentation/cgroups/memory.txt | 28 ++++++++++++++++++++++++++++
1 files changed, 28 insertions(+), 0 deletions(-)
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 4c95c00..f98d9af 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -43,6 +43,7 @@ Features:
- usage threshold notifier
- oom-killer disable knob and oom-notifier
- Root cgroup has no limit controls.
+ - resource accounting for non reclaimable resources like HugeTLB pages
Kernel memory support is work in progress, and the current version provides
basically functionality. (See Section 2.7)
@@ -75,6 +76,12 @@ Brief summary of control files.
memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory
memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation
+
+ memory.hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage
+ memory.hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
+ memory.hugetlb.<hugepagesize>.usage_in_bytes # show current res_counter usage for "hugepagesize" hugetlb
+ # see 5.7 for details
+
1. History
The memory controller has a long history. A request for comments for the memory
@@ -279,6 +286,14 @@ per cgroup, instead of globally.
* tcp memory pressure: sockets memory pressure for the tcp protocol.
+2.8 Non reclaim resouce management
+
+This helps in enforcing limit on non reclaimable pages like HugeTLB pages.
+For non reclaim resource, enforcing limit during actual page allocation
+doesn't make sense. Hence this extension enforce limit during mmap(2).
+This enables application to fall back to alternative allocation methods
+such as allocations using normal page size when HugeTLB allocation fails.
+
3. User Interface
0. Configuration
@@ -287,6 +302,7 @@ a. Enable CONFIG_CGROUPS
b. Enable CONFIG_RESOURCE_COUNTERS
c. Enable CONFIG_CGROUP_MEM_RES_CTLR
d. Enable CONFIG_CGROUP_MEM_RES_CTLR_SWAP (to use swap extension)
+f. Enable CONFIG_MEM_RES_CTLR_NORECLAIM (to use non reclaim extension)
1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
# mount -t tmpfs none /sys/fs/cgroup
@@ -510,6 +526,18 @@ unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
And we have total = file + anon + unevictable.
+5.7 non reclaimable resource control files
+For a system supporting two hugepage size (16M and 16G) the control
+files include:
+
+ memory.hugetlb.16GB.limit_in_bytes
+ memory.hugetlb.16GB.max_usage_in_bytes
+ memory.hugetlb.16GB.usage_in_bytes
+ memory.hugetlb.16MB.limit_in_bytes
+ memory.hugetlb.16MB.max_usage_in_bytes
+ memory.hugetlb.16MB.usage_in_bytes
+
+
6. Hierarchy support
The memory controller supports a deep hierarchy and hierarchical accounting.
--
1.7.9
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH -V2 1/9] mm: move hugetlbfs region tracking function to common code
2012-03-01 9:16 ` [PATCH -V2 1/9] mm: move hugetlbfs region tracking function to common code Aneesh Kumar K.V
@ 2012-03-01 22:33 ` Andrew Morton
2012-03-04 17:37 ` Aneesh Kumar K.V
0 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2012-03-01 22:33 UTC (permalink / raw)
To: Aneesh Kumar K.V
Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
hannes, linux-kernel, cgroups, Andrea Righi, John Stultz
On Thu, 1 Mar 2012 14:46:12 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> This patch moves the hugetlbfs region tracking function to
> common code. We will be using this in later patches in the
> series.
>
> ...
>
> +struct file_region {
> + struct list_head link;
> + long from;
> + long to;
> +};
Both Andrea Righi and John Stultz are working on (more sophisticated)
versions of file region tracking code. And we already have a (poor)
implementation in fs/locks.c.
That's four versions of the same thing floating around the place. This
is nutty.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -V2 0/9] memcg: add HugeTLB resource tracking
2012-03-01 9:16 [PATCH -V2 0/9] memcg: add HugeTLB resource tracking Aneesh Kumar K.V
` (8 preceding siblings ...)
2012-03-01 9:16 ` [PATCH -V2 9/9] memcg: Add memory controller documentation for hugetlb management Aneesh Kumar K.V
@ 2012-03-01 22:40 ` Andrew Morton
2012-03-02 3:28 ` David Gibson
2012-03-04 19:15 ` Aneesh Kumar K.V
2012-03-02 5:48 ` KAMEZAWA Hiroyuki
10 siblings, 2 replies; 26+ messages in thread
From: Andrew Morton @ 2012-03-01 22:40 UTC (permalink / raw)
To: Aneesh Kumar K.V
Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
hannes, linux-kernel, cgroups, David Gibson
On Thu, 1 Mar 2012 14:46:11 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> This patchset implements a memory controller extension to control
> HugeTLB allocations. It is similar to the existing hugetlb quota
> support in that, the limit is enforced at mmap(2) time and not at
> fault time. HugeTLB's quota mechanism limits the number of huge pages
> that can allocated per superblock.
>
> For shared mappings we track the regions mapped by a task along with the
> memcg. We keep the memory controller charged even after the task
> that did mmap(2) exits. Uncharge happens during truncate. For Private
> mappings we charge and uncharge from the current task cgroup.
I haven't begin to get my head around this yet, but I'd like to draw
your attention to https://lkml.org/lkml/2012/2/15/548. That fix has
been hanging around for a while, but I haven't done anything with it
yet because I don't like its additional blurring of the separation
between hugetlb core code and hugetlbfs. I want to find time to sit
down and see if the fix can be better architected but haven't got
around to that yet.
I expect that your patches will conflict at least mechanically with
David's, which is not a big issue. But I wonder whether your patches
will copy the same bug into other places, and whether you can think of
a tidier way of addressing the bug which David is seeing?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -V2 0/9] memcg: add HugeTLB resource tracking
2012-03-01 22:40 ` [PATCH -V2 0/9] memcg: add HugeTLB resource tracking Andrew Morton
@ 2012-03-02 3:28 ` David Gibson
2012-03-04 18:09 ` Aneesh Kumar K.V
2012-03-04 19:15 ` Aneesh Kumar K.V
1 sibling, 1 reply; 26+ messages in thread
From: David Gibson @ 2012-03-02 3:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Aneesh Kumar K.V, linux-mm, mgorman, kamezawa.hiroyu, dhillf,
aarcange, mhocko, hannes, linux-kernel, cgroups
On Thu, Mar 01, 2012 at 02:40:29PM -0800, Andrew Morton wrote:
> On Thu, 1 Mar 2012 14:46:11 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
>
> > This patchset implements a memory controller extension to control
> > HugeTLB allocations. It is similar to the existing hugetlb quota
> > support in that, the limit is enforced at mmap(2) time and not at
> > fault time. HugeTLB's quota mechanism limits the number of huge pages
> > that can allocated per superblock.
> >
> > For shared mappings we track the regions mapped by a task along with the
> > memcg. We keep the memory controller charged even after the task
> > that did mmap(2) exits. Uncharge happens during truncate. For Private
> > mappings we charge and uncharge from the current task cgroup.
>
> I haven't begin to get my head around this yet, but I'd like to draw
> your attention to https://lkml.org/lkml/2012/2/15/548. That fix has
> been hanging around for a while, but I haven't done anything with it
> yet because I don't like its additional blurring of the separation
> between hugetlb core code and hugetlbfs. I want to find time to sit
> down and see if the fix can be better architected but haven't got
> around to that yet.
So.. that version of the fix I specifically rebuilt to address your
concerns about that blurring - in fact I think it reduces the current
layer blurring. I haven't had any reply - what problems do see it as
still having?
> I expect that your patches will conflict at least mechanically with
> David's, which is not a big issue. But I wonder whether your patches
> will copy the same bug into other places, and whether you can think of
> a tidier way of addressing the bug which David is seeing?
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -V2 0/9] memcg: add HugeTLB resource tracking
2012-03-01 9:16 [PATCH -V2 0/9] memcg: add HugeTLB resource tracking Aneesh Kumar K.V
` (9 preceding siblings ...)
2012-03-01 22:40 ` [PATCH -V2 0/9] memcg: add HugeTLB resource tracking Andrew Morton
@ 2012-03-02 5:48 ` KAMEZAWA Hiroyuki
2012-03-04 18:14 ` Aneesh Kumar K.V
10 siblings, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-03-02 5:48 UTC (permalink / raw)
To: Aneesh Kumar K.V
Cc: linux-mm, mgorman, dhillf, aarcange, mhocko, akpm, hannes,
linux-kernel, cgroups
On Thu, 1 Mar 2012 14:46:11 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> Hi,
>
> This patchset implements a memory controller extension to control
> HugeTLB allocations. It is similar to the existing hugetlb quota
> support in that, the limit is enforced at mmap(2) time and not at
> fault time. HugeTLB's quota mechanism limits the number of huge pages
> that can allocated per superblock.
>
Thank you, I think memcg-extension is better than hugetlbfs cgroup.
> For shared mappings we track the regions mapped by a task along with the
> memcg. We keep the memory controller charged even after the task
> that did mmap(2) exits. Uncharge happens during truncate. For Private
> mappings we charge and uncharge from the current task cgroup.
>
What "current" means here ? current task's cgroup ?
> A sample strace output for an application doing malloc with hugectl is given
> below. libhugetlbfs will fall back to normal pagesize if the HugeTLB mmap fails.
>
> open("/mnt/libhugetlbfs.tmp.uhLMgy", O_RDWR|O_CREAT|O_EXCL, 0600) = 3
> unlink("/mnt/libhugetlbfs.tmp.uhLMgy") = 0
>
> .........
>
> mmap(0x20000000000, 50331648, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = -1 ENOMEM (Cannot allocate memory)
> write(2, "libhugetlbfs", 12libhugetlbfs) = 12
> write(2, ": WARNING: New heap segment map" ....
> mmap(NULL, 42008576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xfff946c0000
> ....
>
>
> Goals:
>
> 1) We want to keep the semantic closer to hugelb quota support. ie, we want
> to extend quota semantics to a group of tasks. Currently hugetlb quota
> mechanism allows one to control number of hugetlb pages allocated per
> hugetlbfs superblock.
>
> 2) Applications using hugetlbfs always fallback to normal page size allocation when they
> fail to allocate huge pages. libhugetlbfs internally handles this for malloc(3). We
> want to retain this behaviour when we enforce the controller limit. ie, when huge page
> allocation fails due to controller limit, applications should fallback to
> allocation using normal page size. The above implies that we need to enforce
> limit at mmap(2).
>
Hm, ok.
> 3) HugeTLBfs doesn't support page reclaim. It also doesn't support write(2). Applications
> use hugetlbfs via mmap(2) interface. Important point to note here is hugetlbfs
> extends file size in mmap.
>
> With shared mappings, the file size gets extended in mmap and file will remain in hugetlbfs
> consuming huge pages until it is truncated. We want to make sure we keep the controller
> charged until the file is truncated. This implies, that the controller will be charged
> even after the task that did mmap exit.
>
O.K. hugetlbfs is charged until the file is removed.
Then, next question will be 'can we destory cgroup....'
> Implementation details:
>
> In order to achieve the above goals we need to track the cgroup information
> along with mmap range in a charge list in inode for shared mapping and in
> vm_area_struct for private mapping. We won't be using page to track cgroup
> information because with the above goals we are not really tracking the pages used.
>
> Since we track cgroup in charge list, if we want to remove the cgroup, we need to update
> the charge list to point to the parent cgroup. Currently we take the easy route
> and prevent a cgroup removal if it's non reclaim resource usage is non zero.
>
As Andrew pointed out, there are some ongoing works about page-range tracking.
Please check.
Thanks,
-Kame
> Changes from V1:
> * Changed the implementation as a memcg extension. We still use
> the same logic to track the cgroup and range.
>
> Changes from RFC post:
> * Added support for HugeTLB cgroup hierarchy
> * Added support for task migration
> * Added documentation patch
> * Other bug fixes
>
> -aneesh
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -V2 4/9] memcg: Add non reclaim resource tracking to memcg
2012-03-01 9:16 ` [PATCH -V2 4/9] memcg: Add non reclaim resource tracking to memcg Aneesh Kumar K.V
@ 2012-03-02 8:38 ` KAMEZAWA Hiroyuki
2012-03-04 18:07 ` Aneesh Kumar K.V
0 siblings, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-03-02 8:38 UTC (permalink / raw)
To: Aneesh Kumar K.V
Cc: linux-mm, mgorman, dhillf, aarcange, mhocko, akpm, hannes,
linux-kernel, cgroups
On Thu, 1 Mar 2012 14:46:15 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>
> Non reclaim resources include hugetlb pages or ramfs pages.
> Both these file systems are memory based and they don't support
> page reclaim. So enforcing memory controller limit during actual
> page allocation doesn't make sense for them. Instead we would
> enforce the limit during mmap and keep track of the mmap range
> along with memcg information in charge list.
>
> We could have multiple non reclaim resources which we want to track
> indepedently, like huge pages with different huge page size.
>
> Current code don't allow removal of memcg if they have any non
> reclaim resource charge.
Hmmm why ? The reason is from coding trouble ? Or it's the spec. ?
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
> include/linux/memcontrol.h | 11 +++
> init/Kconfig | 11 +++
> mm/memcontrol.c | 198 +++++++++++++++++++++++++++++++++++++++++++-
> 3 files changed, 219 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4d34356..59d93ee 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -171,6 +171,17 @@ void mem_cgroup_split_huge_fixup(struct page *head);
> bool mem_cgroup_bad_page_check(struct page *page);
> void mem_cgroup_print_bad_page(struct page *page);
> #endif
> +extern long mem_cgroup_try_noreclaim_charge(struct list_head *chg_list,
> + unsigned long from,
> + unsigned long to, int idx);
> +extern void mem_cgroup_noreclaim_uncharge(struct list_head *chg_list,
> + int idx, unsigned long nr_pages);
> +extern void mem_cgroup_commit_noreclaim_charge(struct list_head *chg_list,
> + unsigned long from,
> + unsigned long to);
> +extern long mem_cgroup_truncate_chglist_range(struct list_head *chg_list,
> + unsigned long from,
> + unsigned long to, int idx);
> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> struct mem_cgroup;
>
> diff --git a/init/Kconfig b/init/Kconfig
> index 3f42cd6..c4306f7 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -673,6 +673,17 @@ config CGROUP_MEM_RES_CTLR
> This config option also selects MM_OWNER config option, which
> could in turn add some fork/exit overhead.
>
> +config MEM_RES_CTLR_NORECLAIM
> + bool "Memory Resource Controller non reclaim Extension"
> + depends on CGROUP_MEM_RES_CTLR
EXPERIMENTAL please.
> + help
> + Add non reclaim resource management to memory resource controller.
> + Currently only HugeTLB pages will be managed using this extension.
> + The controller limit is enforced during mmap(2), so that
> + application can fall back to allocations using smaller page size
> + if the memory controller limit prevented them from allocating HugeTLB
> + pages.
> +
Hm. In other thread, KMEM accounting is discussed. There is 2 proposals and
- 1st is accounting only reclaimable slabs (as dcache etc.)
- 2nd is accounting all slab allocations.
Here, 2nd one includes NORECLAIM kmem cache. (Discussion is not ended.)
So, for your developments, How about MEM_RES_CTLR_HUGEPAGE ?
With your design, memory.hugeltb.XX.xxxxxx files area added and
the interface is divided from other noreclaim mems. If necessary,
iwe can change the scope of config without affecting user space by refactoring
after having this features.
BTW, please set default n because this will forbid rmdir of cgroup.
> config CGROUP_MEM_RES_CTLR_SWAP
> bool "Memory Resource Controller Swap Extension"
> depends on CGROUP_MEM_RES_CTLR && SWAP
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6728a7a..b00d028 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -49,6 +49,7 @@
> #include <linux/page_cgroup.h>
> #include <linux/cpu.h>
> #include <linux/oom.h>
> +#include <linux/region.h>
> #include "internal.h"
> #include <net/sock.h>
> #include <net/tcp_memcontrol.h>
> @@ -214,6 +215,11 @@ static void mem_cgroup_threshold(struct mem_cgroup *memcg);
> static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
>
> /*
> + * Currently only hugetlbfs pages are tracked using no reclaim
> + * resource count. So we need only MAX_HSTATE res counter
> + */
> +#define MEMCG_MAX_NORECLAIM HUGE_MAX_HSTATE
> +/*
> * The memory controller data structure. The memory controller controls both
> * page cache and RSS per cgroup. We would eventually like to provide
> * statistics based on the statistics developed by Rik Van Riel for clock-pro,
> @@ -235,6 +241,11 @@ struct mem_cgroup {
> */
> struct res_counter memsw;
> /*
> + * the counter to account for non reclaim resources
> + * like hugetlb pages
> + */
> + struct res_counter no_rcl_res[MEMCG_MAX_NORECLAIM];
struct res_counter hugepages;
will be ok.
> + /*
> * Per cgroup active and inactive list, similar to the
> * per zone LRU lists.
> */
> @@ -4887,6 +4898,7 @@ err_cleanup:
> static struct cgroup_subsys_state * __ref
> mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> {
> + int idx;
> struct mem_cgroup *memcg, *parent;
> long error = -ENOMEM;
> int node;
> @@ -4922,6 +4934,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> if (parent && parent->use_hierarchy) {
> res_counter_init(&memcg->res, &parent->res);
> res_counter_init(&memcg->memsw, &parent->memsw);
> + for (idx = 0; idx < MEMCG_MAX_NORECLAIM; idx++) {
> + res_counter_init(&memcg->no_rcl_res[idx],
> + &parent->no_rcl_res[idx]);
> + }
You can remove this kinds of loop and keep your implemenation simple.
> /*
> * We increment refcnt of the parent to ensure that we can
> * safely access it on res_counter_charge/uncharge.
> @@ -4932,6 +4948,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> } else {
> res_counter_init(&memcg->res, NULL);
> res_counter_init(&memcg->memsw, NULL);
> + for (idx = 0; idx < MEMCG_MAX_NORECLAIM; idx++)
> + res_counter_init(&memcg->no_rcl_res[idx], NULL);
> }
> memcg->last_scanned_node = MAX_NUMNODES;
> INIT_LIST_HEAD(&memcg->oom_notify);
> @@ -4950,8 +4968,22 @@ free_out:
> static int mem_cgroup_pre_destroy(struct cgroup_subsys *ss,
> struct cgroup *cont)
> {
> + int idx;
> + u64 val;
> struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> -
> + /*
> + * We don't allow a cgroup deletion if it have some
> + * non reclaim resource charged against it. We can
> + * update the charge list to point to parent cgroup
> + * and allow this cgroup deletion here. But that
> + * involve tracking all the chg list which have this
> + * cgroup reference.
> + */
I don't like this..... until this is fixed, default should be "N".
> + for (idx = 0; idx < MEMCG_MAX_NORECLAIM; idx++) {
> + val = res_counter_read_u64(&memcg->no_rcl_res[idx], RES_USAGE);
> + if (val)
> + return -EBUSY;
> + }
> return mem_cgroup_force_empty(memcg, false);
> }
>
> @@ -5489,6 +5521,170 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
> }
> #endif
>
> +#ifdef CONFIG_MEM_RES_CTLR_NORECLAIM
> +/*
> + * For supporting resource control on non reclaim pages like hugetlbfs
> + * and ramfs, we enforce limit during mmap time. We also maintain
> + * a chg list for these resource, which track the range alog with
> + * memcg information. We need to have seperate chg_list for shared
> + * and private mapping. Shared mapping are mostly maintained in
> + * file inode and private mapping in vm_area_struct.
> + */
> +long mem_cgroup_try_noreclaim_charge(struct list_head *chg_list,
> + unsigned long from, unsigned long to,
> + int idx)
> +{
> + long chg;
> + int ret = 0;
> + unsigned long csize;
> + struct mem_cgroup *memcg;
> + struct res_counter *fail_res;
> +
> + /*
> + * Get the task cgroup within rcu_readlock and also
> + * get cgroup reference to make sure cgroup destroy won't
> + * race with page_charge. We don't allow a cgroup destroy
> + * when the cgroup have some charge against it
> + */
> + rcu_read_lock();
> + memcg = mem_cgroup_from_task(current);
> + css_get(&memcg->css);
css_tryget() ?
> + rcu_read_unlock();
> +
> + chg = region_chg(chg_list, from, to, (unsigned long)memcg);
> + if (chg < 0)
> + goto err_out;
I don't think all NORECLAIM objects has 'region' ;)
Makeing this dedicated for hugetlbfs will be good.
> +
> + if (mem_cgroup_is_root(memcg))
> + goto err_out;
> +
> + csize = chg * PAGE_SIZE;
> + ret = res_counter_charge(&memcg->no_rcl_res[idx], csize, &fail_res);
> +
> +err_out:
> + /* Now that we have charged we can drop cgroup reference */
> + css_put(&memcg->css);
> + if (!ret)
> + return chg;
> +
> + /* We don't worry about region_uncharge */
> + return ret;
> +}
> +
> +void mem_cgroup_noreclaim_uncharge(struct list_head *chg_list,
> + int idx, unsigned long nr_pages)
> +{
> + struct mem_cgroup *memcg;
> + unsigned long csize = nr_pages * PAGE_SIZE;
> +
> + rcu_read_lock();
> + memcg = mem_cgroup_from_task(current);
> +
> + if (!mem_cgroup_is_root(memcg))
> + res_counter_uncharge(&memcg->no_rcl_res[idx], csize);
What happens a task using hugeltb is moved to root cgroup ?
> + rcu_read_unlock();
> + /*
> + * We could ideally remove zero size regions from
> + * resv map hcg_regions here
> + */
> + return;
> +}
> +
> +void mem_cgroup_commit_noreclaim_charge(struct list_head *chg_list,
> + unsigned long from, unsigned long to)
> +{
> + struct mem_cgroup *memcg;
> +
> + rcu_read_lock();
> + memcg = mem_cgroup_from_task(current);
> + region_add(chg_list, from, to, (unsigned long)memcg);
> + rcu_read_unlock();
> + return;
> +}
Why we have both of try_charge() and charge ?
> +
> +long mem_cgroup_truncate_chglist_range(struct list_head *chg_list,
> + unsigned long from, unsigned long to,
> + int idx)
> +{
> + long chg = 0, csize;
> + struct mem_cgroup *memcg;
> + struct file_region *rg, *trg;
> +
> + /* Locate the region we are either in or before. */
> + list_for_each_entry(rg, chg_list, link)
> + if (from <= rg->to)
> + break;
> + if (&rg->link == chg_list)
> + return 0;
> +
> + /* If we are in the middle of a region then adjust it. */
> + if (from > rg->from) {
> + if (to < rg->to) {
> + struct file_region *nrg;
> + /* rg->from from to rg->to */
> + nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> + /*
> + * If we fail to allocate we return the
> + * with the 0 charge . Later a complete
> + * truncate will reclaim the left over space
> + */
> + if (!nrg)
> + return 0;
> + nrg->from = to;
> + nrg->to = rg->to;
> + nrg->data = rg->data;
> + INIT_LIST_HEAD(&nrg->link);
> + list_add(&nrg->link, &rg->link);
> +
> + /* Adjust the rg entry */
> + rg->to = from;
> + chg = to - from;
> + memcg = (struct mem_cgroup *)rg->data;
> + if (!mem_cgroup_is_root(memcg)) {
> + csize = chg * PAGE_SIZE;
> + res_counter_uncharge(&memcg->no_rcl_res[idx], csize);
> + }
> + return chg;
> + }
> + chg = rg->to - from;
> + rg->to = from;
> + memcg = (struct mem_cgroup *)rg->data;
> + if (!mem_cgroup_is_root(memcg)) {
> + csize = chg * PAGE_SIZE;
> + res_counter_uncharge(&memcg->no_rcl_res[idx], csize);
> + }
> + rg = list_entry(rg->link.next, typeof(*rg), link);
> + }
> + /* Drop any remaining regions till to */
> + list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
> + if (rg->from >= to)
> + break;
> + if (&rg->link == chg_list)
> + break;
> + if (rg->to > to) {
> + /* rg->from to rg->to */
> + chg += to - rg->from;
> + rg->from = to;
> + memcg = (struct mem_cgroup *)rg->data;
> + if (!mem_cgroup_is_root(memcg)) {
> + csize = (to - rg->from) * PAGE_SIZE;
> + res_counter_uncharge(&memcg->no_rcl_res[idx], csize);
> + }
> + return chg;
> + }
> + chg += rg->to - rg->from;
> + memcg = (struct mem_cgroup *)rg->data;
> + if (!mem_cgroup_is_root(memcg)) {
> + csize = (rg->to - rg->from) * PAGE_SIZE;
> + res_counter_uncharge(&memcg->no_rcl_res[idx], csize);
> + }
> + list_del(&rg->link);
> + kfree(rg);
> + }
> + return chg;
> +}
> +#endif
> +
Can't we move most of this to region.c ?
Thanks,
-Kame
> struct cgroup_subsys mem_cgroup_subsys = {
> .name = "memory",
> .subsys_id = mem_cgroup_subsys_id,
> --
> 1.7.9
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -V2 1/9] mm: move hugetlbfs region tracking function to common code
2012-03-01 22:33 ` Andrew Morton
@ 2012-03-04 17:37 ` Aneesh Kumar K.V
0 siblings, 0 replies; 26+ messages in thread
From: Aneesh Kumar K.V @ 2012-03-04 17:37 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
hannes, linux-kernel, cgroups, Andrea Righi, John Stultz
On Thu, 1 Mar 2012 14:33:45 -0800, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Thu, 1 Mar 2012 14:46:12 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
>
> > This patch moves the hugetlbfs region tracking function to
> > common code. We will be using this in later patches in the
> > series.
> >
> > ...
> >
> > +struct file_region {
> > + struct list_head link;
> > + long from;
> > + long to;
> > +};
>
> Both Andrea Righi and John Stultz are working on (more sophisticated)
> versions of file region tracking code. And we already have a (poor)
> implementation in fs/locks.c.
>
> That's four versions of the same thing floating around the place. This
> is nutty.
We should be able to remove region.c once other alternatives are in
upstream. I will also look at the alternatives and see if it would need
any change to make it usable for this work.
Thanks,
-aneesh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -V2 4/9] memcg: Add non reclaim resource tracking to memcg
2012-03-02 8:38 ` KAMEZAWA Hiroyuki
@ 2012-03-04 18:07 ` Aneesh Kumar K.V
2012-03-08 5:56 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 26+ messages in thread
From: Aneesh Kumar K.V @ 2012-03-04 18:07 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, mgorman, dhillf, aarcange, mhocko, akpm, hannes,
linux-kernel, cgroups
On Fri, 2 Mar 2012 17:38:16 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 1 Mar 2012 14:46:15 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
>
> > From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> >
> > Non reclaim resources include hugetlb pages or ramfs pages.
> > Both these file systems are memory based and they don't support
> > page reclaim. So enforcing memory controller limit during actual
> > page allocation doesn't make sense for them. Instead we would
> > enforce the limit during mmap and keep track of the mmap range
> > along with memcg information in charge list.
> >
> > We could have multiple non reclaim resources which we want to track
> > indepedently, like huge pages with different huge page size.
> >
> > Current code don't allow removal of memcg if they have any non
> > reclaim resource charge.
>
> Hmmm why ? The reason is from coding trouble ? Or it's the spec. ?
Not it is not the spec. It is just that I haven't got around
implementing the needed changes. In order to be able to remove
cgroup with reclaim resource charges, we should be tracking all the
charge list (which is currently attached to inode or to vm_area_struct)
in struct mem_cgroup, so that on memcg removal, we move those regions in
charge list to point to parent memcg.
>
>
> >
> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> > ---
> > include/linux/memcontrol.h | 11 +++
> > init/Kconfig | 11 +++
> > mm/memcontrol.c | 198 +++++++++++++++++++++++++++++++++++++++++++-
> > 3 files changed, 219 insertions(+), 1 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 4d34356..59d93ee 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -171,6 +171,17 @@ void mem_cgroup_split_huge_fixup(struct page *head);
> > bool mem_cgroup_bad_page_check(struct page *page);
> > void mem_cgroup_print_bad_page(struct page *page);
> > #endif
> > +extern long mem_cgroup_try_noreclaim_charge(struct list_head *chg_list,
> > + unsigned long from,
> > + unsigned long to, int idx);
> > +extern void mem_cgroup_noreclaim_uncharge(struct list_head *chg_list,
> > + int idx, unsigned long nr_pages);
> > +extern void mem_cgroup_commit_noreclaim_charge(struct list_head *chg_list,
> > + unsigned long from,
> > + unsigned long to);
> > +extern long mem_cgroup_truncate_chglist_range(struct list_head *chg_list,
> > + unsigned long from,
> > + unsigned long to, int idx);
> > #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> > struct mem_cgroup;
> >
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 3f42cd6..c4306f7 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -673,6 +673,17 @@ config CGROUP_MEM_RES_CTLR
> > This config option also selects MM_OWNER config option, which
> > could in turn add some fork/exit overhead.
> >
> > +config MEM_RES_CTLR_NORECLAIM
> > + bool "Memory Resource Controller non reclaim Extension"
> > + depends on CGROUP_MEM_RES_CTLR
>
> EXPERIMENTAL please.
>
Will do
>
> > + help
> > + Add non reclaim resource management to memory resource controller.
> > + Currently only HugeTLB pages will be managed using this extension.
> > + The controller limit is enforced during mmap(2), so that
> > + application can fall back to allocations using smaller page size
> > + if the memory controller limit prevented them from allocating HugeTLB
> > + pages.
> > +
>
> Hm. In other thread, KMEM accounting is discussed. There is 2 proposals and
> - 1st is accounting only reclaimable slabs (as dcache etc.)
> - 2nd is accounting all slab allocations.
>
> Here, 2nd one includes NORECLAIM kmem cache. (Discussion is not ended.)
>
> So, for your developments, How about MEM_RES_CTLR_HUGEPAGE ?
Frankly I didn't like the noreclaim name, I also didn't want to indicate
HUGEPAGE, because the code doesn't make any huge page assumption.
>
> With your design, memory.hugeltb.XX.xxxxxx files area added and
> the interface is divided from other noreclaim mems. If necessary,
> iwe can change the scope of config without affecting user space by refactoring
> after having this features.
>
> BTW, please set default n because this will forbid rmdir of cgroup.
>
Will do
>
>
> > config CGROUP_MEM_RES_CTLR_SWAP
> > bool "Memory Resource Controller Swap Extension"
> > depends on CGROUP_MEM_RES_CTLR && SWAP
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 6728a7a..b00d028 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -49,6 +49,7 @@
> > #include <linux/page_cgroup.h>
> > #include <linux/cpu.h>
> > #include <linux/oom.h>
> > +#include <linux/region.h>
> > #include "internal.h"
> > #include <net/sock.h>
> > #include <net/tcp_memcontrol.h>
> > @@ -214,6 +215,11 @@ static void mem_cgroup_threshold(struct mem_cgroup *memcg);
> > static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
> >
> > /*
> > + * Currently only hugetlbfs pages are tracked using no reclaim
> > + * resource count. So we need only MAX_HSTATE res counter
> > + */
> > +#define MEMCG_MAX_NORECLAIM HUGE_MAX_HSTATE
> > +/*
> > * The memory controller data structure. The memory controller controls both
> > * page cache and RSS per cgroup. We would eventually like to provide
> > * statistics based on the statistics developed by Rik Van Riel for clock-pro,
> > @@ -235,6 +241,11 @@ struct mem_cgroup {
> > */
> > struct res_counter memsw;
> > /*
> > + * the counter to account for non reclaim resources
> > + * like hugetlb pages
> > + */
> > + struct res_counter no_rcl_res[MEMCG_MAX_NORECLAIM];
>
> struct res_counter hugepages;
>
> will be ok.
>
My goal was to make this patch not to mention hugepages, because
it doesn't really have any depedency on hugepages. That is one of the reason
for adding MEMCG_MAX_NORECLAIM. Later if we want other in memory file system
(shmemfs) to limit the resource usage in a similar fashion, we should be
able to use this memcg changes.
May be for this patchset I can make the changes you suggested and later
when we want to reuse the code make it more generic ?
>
> > + /*
> > * Per cgroup active and inactive list, similar to the
> > * per zone LRU lists.
> > */
> > @@ -4887,6 +4898,7 @@ err_cleanup:
> > static struct cgroup_subsys_state * __ref
> > mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> > {
> > + int idx;
> > struct mem_cgroup *memcg, *parent;
> > long error = -ENOMEM;
> > int node;
> > @@ -4922,6 +4934,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> > if (parent && parent->use_hierarchy) {
> > res_counter_init(&memcg->res, &parent->res);
> > res_counter_init(&memcg->memsw, &parent->memsw);
> > + for (idx = 0; idx < MEMCG_MAX_NORECLAIM; idx++) {
> > + res_counter_init(&memcg->no_rcl_res[idx],
> > + &parent->no_rcl_res[idx]);
> > + }
>
> You can remove this kinds of loop and keep your implemenation simple.
Can you explain this ? How can we remote the loop ?. We want to track
each huge page size as a seperate resource.
>
> > /*
> > * We increment refcnt of the parent to ensure that we can
> > * safely access it on res_counter_charge/uncharge.
> > @@ -4932,6 +4948,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> > } else {
> > res_counter_init(&memcg->res, NULL);
> > res_counter_init(&memcg->memsw, NULL);
> > + for (idx = 0; idx < MEMCG_MAX_NORECLAIM; idx++)
> > + res_counter_init(&memcg->no_rcl_res[idx], NULL);
> > }
> > memcg->last_scanned_node = MAX_NUMNODES;
> > INIT_LIST_HEAD(&memcg->oom_notify);
> > @@ -4950,8 +4968,22 @@ free_out:
> > static int mem_cgroup_pre_destroy(struct cgroup_subsys *ss,
> > struct cgroup *cont)
> > {
> > + int idx;
> > + u64 val;
> > struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> > -
> > + /*
> > + * We don't allow a cgroup deletion if it have some
> > + * non reclaim resource charged against it. We can
> > + * update the charge list to point to parent cgroup
> > + * and allow this cgroup deletion here. But that
> > + * involve tracking all the chg list which have this
> > + * cgroup reference.
> > + */
>
> I don't like this..... until this is fixed, default should be "N".
Yes. Will update Kconfig
>
>
>
> > + for (idx = 0; idx < MEMCG_MAX_NORECLAIM; idx++) {
> > + val = res_counter_read_u64(&memcg->no_rcl_res[idx], RES_USAGE);
> > + if (val)
> > + return -EBUSY;
> > + }
> > return mem_cgroup_force_empty(memcg, false);
> > }
> >
> > @@ -5489,6 +5521,170 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
> > }
> > #endif
> >
> > +#ifdef CONFIG_MEM_RES_CTLR_NORECLAIM
> > +/*
> > + * For supporting resource control on non reclaim pages like hugetlbfs
> > + * and ramfs, we enforce limit during mmap time. We also maintain
> > + * a chg list for these resource, which track the range alog with
> > + * memcg information. We need to have seperate chg_list for shared
> > + * and private mapping. Shared mapping are mostly maintained in
> > + * file inode and private mapping in vm_area_struct.
> > + */
> > +long mem_cgroup_try_noreclaim_charge(struct list_head *chg_list,
> > + unsigned long from, unsigned long to,
> > + int idx)
> > +{
> > + long chg;
> > + int ret = 0;
> > + unsigned long csize;
> > + struct mem_cgroup *memcg;
> > + struct res_counter *fail_res;
> > +
> > + /*
> > + * Get the task cgroup within rcu_readlock and also
> > + * get cgroup reference to make sure cgroup destroy won't
> > + * race with page_charge. We don't allow a cgroup destroy
> > + * when the cgroup have some charge against it
> > + */
> > + rcu_read_lock();
> > + memcg = mem_cgroup_from_task(current);
> > + css_get(&memcg->css);
>
> css_tryget() ?
>
Why ?
>
> > + rcu_read_unlock();
> > +
> > + chg = region_chg(chg_list, from, to, (unsigned long)memcg);
> > + if (chg < 0)
> > + goto err_out;
>
> I don't think all NORECLAIM objects has 'region' ;)
> Makeing this dedicated for hugetlbfs will be good.
>
>
Will do. This should be no reclaim fs objects.
>
> > +
> > + if (mem_cgroup_is_root(memcg))
> > + goto err_out;
> > +
> > + csize = chg * PAGE_SIZE;
> > + ret = res_counter_charge(&memcg->no_rcl_res[idx], csize, &fail_res);
> > +
> > +err_out:
> > + /* Now that we have charged we can drop cgroup reference */
> > + css_put(&memcg->css);
> > + if (!ret)
> > + return chg;
> > +
> > + /* We don't worry about region_uncharge */
> > + return ret;
> > +}
> > +
> > +void mem_cgroup_noreclaim_uncharge(struct list_head *chg_list,
> > + int idx, unsigned long nr_pages)
> > +{
> > + struct mem_cgroup *memcg;
> > + unsigned long csize = nr_pages * PAGE_SIZE;
> > +
> > + rcu_read_lock();
> > + memcg = mem_cgroup_from_task(current);
> > +
> > + if (!mem_cgroup_is_root(memcg))
> > + res_counter_uncharge(&memcg->no_rcl_res[idx], csize);
>
> What happens a task using hugeltb is moved to root cgroup ?
>
When task migrate we don't move the charge. So after the task get moved
to root cgroup, the resource counter will not get charged.
>
> > + rcu_read_unlock();
> > + /*
> > + * We could ideally remove zero size regions from
> > + * resv map hcg_regions here
> > + */
> > + return;
> > +}
> > +
> > +void mem_cgroup_commit_noreclaim_charge(struct list_head *chg_list,
> > + unsigned long from, unsigned long to)
> > +{
> > + struct mem_cgroup *memcg;
> > +
> > + rcu_read_lock();
> > + memcg = mem_cgroup_from_task(current);
> > + region_add(chg_list, from, to, (unsigned long)memcg);
> > + rcu_read_unlock();
> > + return;
> > +}
>
> Why we have both of try_charge() and charge ?
>
>
try_charge and commit_charge ?. We do the below sequence:
try_charge()
alloc_page()
commit_charge()
We actually do memcg charging in try_charge, but the charge list won't
be updated to indicate the region boundaries in try_charge. That is
because, if alloc_page() failed, we will not be able to undo the region
boundary update, because we don't have enough information there to find
which region got added/updated by the previous try_charge(). Hence we split the
charging into two steps: try_charge and commit_charge and update region
boundaries in commit_charge()
> > +
> > +long mem_cgroup_truncate_chglist_range(struct list_head *chg_list,
> > + unsigned long from, unsigned long to,
> > + int idx)
> > +{
> > + long chg = 0, csize;
> > + struct mem_cgroup *memcg;
> > + struct file_region *rg, *trg;
> > +
> > + /* Locate the region we are either in or before. */
> > + list_for_each_entry(rg, chg_list, link)
> > + if (from <= rg->to)
> > + break;
> > + if (&rg->link == chg_list)
> > + return 0;
> > +
> > + /* If we are in the middle of a region then adjust it. */
> > + if (from > rg->from) {
> > + if (to < rg->to) {
> > + struct file_region *nrg;
> > + /* rg->from from to rg->to */
> > + nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> > + /*
> > + * If we fail to allocate we return the
> > + * with the 0 charge . Later a complete
> > + * truncate will reclaim the left over space
> > + */
> > + if (!nrg)
> > + return 0;
> > + nrg->from = to;
> > + nrg->to = rg->to;
> > + nrg->data = rg->data;
> > + INIT_LIST_HEAD(&nrg->link);
> > + list_add(&nrg->link, &rg->link);
> > +
> > + /* Adjust the rg entry */
> > + rg->to = from;
> > + chg = to - from;
> > + memcg = (struct mem_cgroup *)rg->data;
> > + if (!mem_cgroup_is_root(memcg)) {
> > + csize = chg * PAGE_SIZE;
> > + res_counter_uncharge(&memcg->no_rcl_res[idx], csize);
> > + }
> > + return chg;
> > + }
> > + chg = rg->to - from;
> > + rg->to = from;
> > + memcg = (struct mem_cgroup *)rg->data;
> > + if (!mem_cgroup_is_root(memcg)) {
> > + csize = chg * PAGE_SIZE;
> > + res_counter_uncharge(&memcg->no_rcl_res[idx], csize);
> > + }
> > + rg = list_entry(rg->link.next, typeof(*rg), link);
> > + }
> > + /* Drop any remaining regions till to */
> > + list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
> > + if (rg->from >= to)
> > + break;
> > + if (&rg->link == chg_list)
> > + break;
> > + if (rg->to > to) {
> > + /* rg->from to rg->to */
> > + chg += to - rg->from;
> > + rg->from = to;
> > + memcg = (struct mem_cgroup *)rg->data;
> > + if (!mem_cgroup_is_root(memcg)) {
> > + csize = (to - rg->from) * PAGE_SIZE;
> > + res_counter_uncharge(&memcg->no_rcl_res[idx], csize);
> > + }
> > + return chg;
> > + }
> > + chg += rg->to - rg->from;
> > + memcg = (struct mem_cgroup *)rg->data;
> > + if (!mem_cgroup_is_root(memcg)) {
> > + csize = (rg->to - rg->from) * PAGE_SIZE;
> > + res_counter_uncharge(&memcg->no_rcl_res[idx], csize);
> > + }
> > + list_del(&rg->link);
> > + kfree(rg);
> > + }
> > + return chg;
> > +}
> > +#endif
> > +
>
> Can't we move most of this to region.c ?
The difference between this and region_truncate_range is we do memcg
uncharge here. I am not sure we can move this change to region.c
because of struct mem_cgroup dependency here. What we can do is to
update region_truncate to take a callback that will be called before
destroying a region.
Thanks for the review,
-aneesh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -V2 0/9] memcg: add HugeTLB resource tracking
2012-03-02 3:28 ` David Gibson
@ 2012-03-04 18:09 ` Aneesh Kumar K.V
2012-03-06 2:38 ` David Gibson
0 siblings, 1 reply; 26+ messages in thread
From: Aneesh Kumar K.V @ 2012-03-04 18:09 UTC (permalink / raw)
To: David Gibson, Andrew Morton
Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
hannes, linux-kernel, cgroups
On Fri, 2 Mar 2012 14:28:53 +1100, David Gibson <dwg@au1.ibm.com> wrote:
> On Thu, Mar 01, 2012 at 02:40:29PM -0800, Andrew Morton wrote:
> > On Thu, 1 Mar 2012 14:46:11 +0530
> > "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> >
> > > This patchset implements a memory controller extension to control
> > > HugeTLB allocations. It is similar to the existing hugetlb quota
> > > support in that, the limit is enforced at mmap(2) time and not at
> > > fault time. HugeTLB's quota mechanism limits the number of huge pages
> > > that can allocated per superblock.
> > >
> > > For shared mappings we track the regions mapped by a task along with the
> > > memcg. We keep the memory controller charged even after the task
> > > that did mmap(2) exits. Uncharge happens during truncate. For Private
> > > mappings we charge and uncharge from the current task cgroup.
> >
> > I haven't begin to get my head around this yet, but I'd like to draw
> > your attention to https://lkml.org/lkml/2012/2/15/548. That fix has
> > been hanging around for a while, but I haven't done anything with it
> > yet because I don't like its additional blurring of the separation
> > between hugetlb core code and hugetlbfs. I want to find time to sit
> > down and see if the fix can be better architected but haven't got
> > around to that yet.
>
> So.. that version of the fix I specifically rebuilt to address your
> concerns about that blurring - in fact I think it reduces the current
> layer blurring. I haven't had any reply - what problems do see it as
> still having?
>
https://lkml.org/lkml/2012/2/16/179 ?
That is a serious issue isn't it ?
-aneesh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -V2 0/9] memcg: add HugeTLB resource tracking
2012-03-02 5:48 ` KAMEZAWA Hiroyuki
@ 2012-03-04 18:14 ` Aneesh Kumar K.V
0 siblings, 0 replies; 26+ messages in thread
From: Aneesh Kumar K.V @ 2012-03-04 18:14 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, mgorman, dhillf, aarcange, mhocko, akpm, hannes,
linux-kernel, cgroups
On Fri, 2 Mar 2012 14:48:28 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 1 Mar 2012 14:46:11 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
>
> > Hi,
> >
> > This patchset implements a memory controller extension to control
> > HugeTLB allocations. It is similar to the existing hugetlb quota
> > support in that, the limit is enforced at mmap(2) time and not at
> > fault time. HugeTLB's quota mechanism limits the number of huge pages
> > that can allocated per superblock.
> >
>
> Thank you, I think memcg-extension is better than hugetlbfs cgroup.
>
>
> > For shared mappings we track the regions mapped by a task along with the
> > memcg. We keep the memory controller charged even after the task
> > that did mmap(2) exits. Uncharge happens during truncate. For Private
> > mappings we charge and uncharge from the current task cgroup.
> >
>
> What "current" means here ? current task's cgroup ?
yes.
>
>
> > A sample strace output for an application doing malloc with hugectl is given
> > below. libhugetlbfs will fall back to normal pagesize if the HugeTLB mmap fails.
> >
> > open("/mnt/libhugetlbfs.tmp.uhLMgy", O_RDWR|O_CREAT|O_EXCL, 0600) = 3
> > unlink("/mnt/libhugetlbfs.tmp.uhLMgy") = 0
> >
> > .........
> >
> > mmap(0x20000000000, 50331648, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = -1 ENOMEM (Cannot allocate memory)
> > write(2, "libhugetlbfs", 12libhugetlbfs) = 12
> > write(2, ": WARNING: New heap segment map" ....
> > mmap(NULL, 42008576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xfff946c0000
> > ....
> >
> >
> > Goals:
> >
> > 1) We want to keep the semantic closer to hugelb quota support. ie, we want
> > to extend quota semantics to a group of tasks. Currently hugetlb quota
> > mechanism allows one to control number of hugetlb pages allocated per
> > hugetlbfs superblock.
> >
> > 2) Applications using hugetlbfs always fallback to normal page size allocation when they
> > fail to allocate huge pages. libhugetlbfs internally handles this for malloc(3). We
> > want to retain this behaviour when we enforce the controller limit. ie, when huge page
> > allocation fails due to controller limit, applications should fallback to
> > allocation using normal page size. The above implies that we need to enforce
> > limit at mmap(2).
> >
>
> Hm, ok.
>
> > 3) HugeTLBfs doesn't support page reclaim. It also doesn't support write(2). Applications
> > use hugetlbfs via mmap(2) interface. Important point to note here is hugetlbfs
> > extends file size in mmap.
> >
> > With shared mappings, the file size gets extended in mmap and file will remain in hugetlbfs
> > consuming huge pages until it is truncated. We want to make sure we keep the controller
> > charged until the file is truncated. This implies, that the controller will be charged
> > even after the task that did mmap exit.
> >
>
> O.K. hugetlbfs is charged until the file is removed.
> Then, next question will be 'can we destory cgroup....'
That is explained later. We don't allow cgroup removal if it's non
reclaim resource usage is non zero. But that restriction should be remove
before this can be merged.
>
> > Implementation details:
> >
> > In order to achieve the above goals we need to track the cgroup information
> > along with mmap range in a charge list in inode for shared mapping and in
> > vm_area_struct for private mapping. We won't be using page to track cgroup
> > information because with the above goals we are not really tracking the pages used.
> >
> > Since we track cgroup in charge list, if we want to remove the cgroup, we need to update
> > the charge list to point to the parent cgroup. Currently we take the easy route
> > and prevent a cgroup removal if it's non reclaim resource usage is non zero.
> >
>
> As Andrew pointed out, there are some ongoing works about page-range tracking.
> Please check.
Will do
-aneesh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -V2 0/9] memcg: add HugeTLB resource tracking
2012-03-01 22:40 ` [PATCH -V2 0/9] memcg: add HugeTLB resource tracking Andrew Morton
2012-03-02 3:28 ` David Gibson
@ 2012-03-04 19:15 ` Aneesh Kumar K.V
2012-03-05 13:56 ` Hillf Danton
1 sibling, 1 reply; 26+ messages in thread
From: Aneesh Kumar K.V @ 2012-03-04 19:15 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
hannes, linux-kernel, cgroups, David Gibson
On Thu, 1 Mar 2012 14:40:29 -0800, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Thu, 1 Mar 2012 14:46:11 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
>
> > This patchset implements a memory controller extension to control
> > HugeTLB allocations. It is similar to the existing hugetlb quota
> > support in that, the limit is enforced at mmap(2) time and not at
> > fault time. HugeTLB's quota mechanism limits the number of huge pages
> > that can allocated per superblock.
> >
> > For shared mappings we track the regions mapped by a task along with the
> > memcg. We keep the memory controller charged even after the task
> > that did mmap(2) exits. Uncharge happens during truncate. For Private
> > mappings we charge and uncharge from the current task cgroup.
>
> I haven't begin to get my head around this yet, but I'd like to draw
> your attention to https://lkml.org/lkml/2012/2/15/548.
Hmm that's really serious bug.
> That fix has
> been hanging around for a while, but I haven't done anything with it
> yet because I don't like its additional blurring of the separation
> between hugetlb core code and hugetlbfs. I want to find time to sit
> down and see if the fix can be better architected but haven't got
> around to that yet.
>
> I expect that your patches will conflict at least mechanically with
> David's, which is not a big issue. But I wonder whether your patches
> will copy the same bug into other places, and whether you can think of
> a tidier way of addressing the bug which David is seeing?
>
I will go through the implementation again and make sure the problem
explained by David doesn't happen in the new code path added by the
patch series.
Thanks
-aneesh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -V2 0/9] memcg: add HugeTLB resource tracking
2012-03-04 19:15 ` Aneesh Kumar K.V
@ 2012-03-05 13:56 ` Hillf Danton
2012-03-06 14:05 ` Aneesh Kumar K.V
0 siblings, 1 reply; 26+ messages in thread
From: Hillf Danton @ 2012-03-05 13:56 UTC (permalink / raw)
To: Aneesh Kumar K.V
Cc: Andrew Morton, linux-mm, mgorman, kamezawa.hiroyu, aarcange,
mhocko, hannes, linux-kernel, cgroups, David Gibson
On Mon, Mar 5, 2012 at 3:15 AM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
> On Thu, 1 Mar 2012 14:40:29 -0800, Andrew Morton <akpm@linux-foundation.org> wrote:
>> I haven't begin to get my head around this yet, but I'd like to draw
>> your attention to https://lkml.org/lkml/2012/2/15/548.
>
> Hmm that's really serious bug.
>
>> That fix has
>> been hanging around for a while, but I haven't done anything with it
>> yet because I don't like its additional blurring of the separation
>> between hugetlb core code and hugetlbfs. I want to find time to sit
>> down and see if the fix can be better architected but haven't got
>> around to that yet.
>>
>> I expect that your patches will conflict at least mechanically with
>> David's, which is not a big issue. But I wonder whether your patches
>> will copy the same bug into other places, and whether you can think of
>> a tidier way of addressing the bug which David is seeing?
>>
>
> I will go through the implementation again and make sure the problem
> explained by David doesn't happen in the new code path added by the
> patch series.
>
Hi Aneesh
When you tackle that problem, please take the following approach also
into account, though it is a draft, in which quota handback is simply
eliminated when huge page is freed, if that problem is caused by extra
reference count.
And get_quota is carefully paired with put_quota for newly allocated
page. That is all, and feel free to correct me.
Best Regards
-hd
--- a/mm/hugetlb.c Mon Mar 5 20:20:34 2012
+++ b/mm/hugetlb.c Mon Mar 5 21:20:14 2012
@@ -533,9 +533,7 @@ static void free_huge_page(struct page *
*/
struct hstate *h = page_hstate(page);
int nid = page_to_nid(page);
- struct address_space *mapping;
- mapping = (struct address_space *) page_private(page);
set_page_private(page, 0);
page->mapping = NULL;
BUG_ON(page_count(page));
@@ -551,8 +549,6 @@ static void free_huge_page(struct page *
enqueue_huge_page(h, page);
}
spin_unlock(&hugetlb_lock);
- if (mapping)
- hugetlb_put_quota(mapping, 1);
}
static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
@@ -1021,7 +1017,8 @@ static void vma_commit_reservation(struc
}
static struct page *alloc_huge_page(struct vm_area_struct *vma,
- unsigned long addr, int avoid_reserve)
+ unsigned long addr, int avoid_reserve,
+ long *quota)
{
struct hstate *h = hstate_vma(vma);
struct page *page;
@@ -1050,7 +1047,8 @@ static struct page *alloc_huge_page(stru
if (!page) {
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
if (!page) {
- hugetlb_put_quota(inode->i_mapping, chg);
+ if (chg)
+ hugetlb_put_quota(inode->i_mapping, chg);
return ERR_PTR(-VM_FAULT_SIGBUS);
}
}
@@ -1058,6 +1056,8 @@ static struct page *alloc_huge_page(stru
set_page_private(page, (unsigned long) mapping);
vma_commit_reservation(h, vma, addr);
+ if (quota)
+ *quota = chg;
return page;
}
@@ -2365,6 +2365,7 @@ static int hugetlb_cow(struct mm_struct
struct page *old_page, *new_page;
int avoidcopy;
int outside_reserve = 0;
+ long quota = 0;
old_page = pte_page(pte);
@@ -2397,7 +2398,8 @@ retry_avoidcopy:
/* Drop page_table_lock as buddy allocator may be called */
spin_unlock(&mm->page_table_lock);
- new_page = alloc_huge_page(vma, address, outside_reserve);
+ quota = 0;
+ new_page = alloc_huge_page(vma, address, outside_reserve, "a);
if (IS_ERR(new_page)) {
page_cache_release(old_page);
@@ -2439,6 +2441,8 @@ retry_avoidcopy:
if (unlikely(anon_vma_prepare(vma))) {
page_cache_release(new_page);
page_cache_release(old_page);
+ if (quota)
+ hugetlb_put_quota(vma->vm_file->f_mapping, quota);
/* Caller expects lock to be held */
spin_lock(&mm->page_table_lock);
return VM_FAULT_OOM;
@@ -2470,6 +2474,8 @@ retry_avoidcopy:
address & huge_page_mask(h),
(address & huge_page_mask(h)) + huge_page_size(h));
}
+ else if (quota)
+ hugetlb_put_quota(vma->vm_file->f_mapping, quota);
page_cache_release(new_page);
page_cache_release(old_page);
return 0;
@@ -2519,6 +2525,7 @@ static int hugetlb_no_page(struct mm_str
struct page *page;
struct address_space *mapping;
pte_t new_pte;
+ long quota = 0;
/*
* Currently, we are forced to kill the process in the event the
@@ -2540,12 +2547,13 @@ static int hugetlb_no_page(struct mm_str
* before we get page_table_lock.
*/
retry:
+ quota = 0;
page = find_lock_page(mapping, idx);
if (!page) {
size = i_size_read(mapping->host) >> huge_page_shift(h);
if (idx >= size)
goto out;
- page = alloc_huge_page(vma, address, 0);
+ page = alloc_huge_page(vma, address, 0, "a);
if (IS_ERR(page)) {
ret = -PTR_ERR(page);
goto out;
@@ -2560,6 +2568,8 @@ retry:
err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
if (err) {
put_page(page);
+ if (quota)
+ hugetlb_put_quota(mapping, quota);
if (err == -EEXIST)
goto retry;
goto out;
@@ -2633,6 +2643,8 @@ backout:
backout_unlocked:
unlock_page(page);
put_page(page);
+ if (quota)
+ hugetlb_put_quota(mapping, quota);
goto out;
}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -V2 0/9] memcg: add HugeTLB resource tracking
2012-03-04 18:09 ` Aneesh Kumar K.V
@ 2012-03-06 2:38 ` David Gibson
0 siblings, 0 replies; 26+ messages in thread
From: David Gibson @ 2012-03-06 2:38 UTC (permalink / raw)
To: Aneesh Kumar K.V
Cc: Andrew Morton, linux-mm, mgorman, kamezawa.hiroyu, dhillf,
aarcange, mhocko, hannes, linux-kernel, cgroups
On Sun, Mar 04, 2012 at 11:39:23PM +0530, Aneesh Kumar K.V wrote:
> On Fri, 2 Mar 2012 14:28:53 +1100, David Gibson <dwg@au1.ibm.com> wrote:
> > On Thu, Mar 01, 2012 at 02:40:29PM -0800, Andrew Morton wrote:
> > > On Thu, 1 Mar 2012 14:46:11 +0530
> > > "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> > >
> > > > This patchset implements a memory controller extension to control
> > > > HugeTLB allocations. It is similar to the existing hugetlb quota
> > > > support in that, the limit is enforced at mmap(2) time and not at
> > > > fault time. HugeTLB's quota mechanism limits the number of huge pages
> > > > that can allocated per superblock.
> > > >
> > > > For shared mappings we track the regions mapped by a task along with the
> > > > memcg. We keep the memory controller charged even after the task
> > > > that did mmap(2) exits. Uncharge happens during truncate. For Private
> > > > mappings we charge and uncharge from the current task cgroup.
> > >
> > > I haven't begin to get my head around this yet, but I'd like to draw
> > > your attention to https://lkml.org/lkml/2012/2/15/548. That fix has
> > > been hanging around for a while, but I haven't done anything with it
> > > yet because I don't like its additional blurring of the separation
> > > between hugetlb core code and hugetlbfs. I want to find time to sit
> > > down and see if the fix can be better architected but haven't got
> > > around to that yet.
> >
> > So.. that version of the fix I specifically rebuilt to address your
> > concerns about that blurring - in fact I think it reduces the current
> > layer blurring. I haven't had any reply - what problems do see it as
> > still having?
>
> https://lkml.org/lkml/2012/2/16/179 ?
Ah. Missed that reply somehow. Odd. Replied now and I'll respin
accordingly.
> That is a serious issue isn't it ?
Yes, it is. And it's been around for a long, long time.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -V2 0/9] memcg: add HugeTLB resource tracking
2012-03-05 13:56 ` Hillf Danton
@ 2012-03-06 14:05 ` Aneesh Kumar K.V
0 siblings, 0 replies; 26+ messages in thread
From: Aneesh Kumar K.V @ 2012-03-06 14:05 UTC (permalink / raw)
To: Hillf Danton
Cc: Andrew Morton, linux-mm, mgorman, kamezawa.hiroyu, aarcange,
mhocko, hannes, linux-kernel, cgroups, David Gibson
On Mon, 5 Mar 2012 21:56:50 +0800, Hillf Danton <dhillf@gmail.com> wrote:
> On Mon, Mar 5, 2012 at 3:15 AM, Aneesh Kumar K.V
> <aneesh.kumar@linux.vnet.ibm.com> wrote:
> > On Thu, 1 Mar 2012 14:40:29 -0800, Andrew Morton <akpm@linux-foundation.org> wrote:
> >> I haven't begin to get my head around this yet, but I'd like to draw
> >> your attention to https://lkml.org/lkml/2012/2/15/548.
> >
> > Hmm that's really serious bug.
> >
> >> That fix has
> >> been hanging around for a while, but I haven't done anything with it
> >> yet because I don't like its additional blurring of the separation
> >> between hugetlb core code and hugetlbfs. I want to find time to sit
> >> down and see if the fix can be better architected but haven't got
> >> around to that yet.
> >>
> >> I expect that your patches will conflict at least mechanically with
> >> David's, which is not a big issue. But I wonder whether your patches
> >> will copy the same bug into other places, and whether you can think of
> >> a tidier way of addressing the bug which David is seeing?
> >>
> >
> > I will go through the implementation again and make sure the problem
> > explained by David doesn't happen in the new code path added by the
> > patch series.
> >
> Hi Aneesh
>
> When you tackle that problem, please take the following approach also
> into account, though it is a draft, in which quota handback is simply
> eliminated when huge page is freed, if that problem is caused by extra
> reference count.
> And get_quota is carefully paired with put_quota for newly allocated
> page. That is all, and feel free to correct me.
But we should not do put quota until the page is added back to the free pool
right ? otherwise quota subsystem (the actual hugetlb pool) will
indicate availability where as the file system won't have any free
pages.
-aneesh
>
> Best Regards
> -hd
>
> --- a/mm/hugetlb.c Mon Mar 5 20:20:34 2012
> +++ b/mm/hugetlb.c Mon Mar 5 21:20:14 2012
> @@ -533,9 +533,7 @@ static void free_huge_page(struct page *
> */
> struct hstate *h = page_hstate(page);
> int nid = page_to_nid(page);
> - struct address_space *mapping;
>
> - mapping = (struct address_space *) page_private(page);
> set_page_private(page, 0);
> page->mapping = NULL;
> BUG_ON(page_count(page));
> @@ -551,8 +549,6 @@ static void free_huge_page(struct page *
> enqueue_huge_page(h, page);
> }
> spin_unlock(&hugetlb_lock);
> - if (mapping)
> - hugetlb_put_quota(mapping, 1);
> }
>
> static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> @@ -1021,7 +1017,8 @@ static void vma_commit_reservation(struc
> }
>
> static struct page *alloc_huge_page(struct vm_area_struct *vma,
> - unsigned long addr, int avoid_reserve)
> + unsigned long addr, int avoid_reserve,
> + long *quota)
> {
> struct hstate *h = hstate_vma(vma);
> struct page *page;
> @@ -1050,7 +1047,8 @@ static struct page *alloc_huge_page(stru
> if (!page) {
> page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> if (!page) {
> - hugetlb_put_quota(inode->i_mapping, chg);
> + if (chg)
> + hugetlb_put_quota(inode->i_mapping, chg);
> return ERR_PTR(-VM_FAULT_SIGBUS);
> }
> }
> @@ -1058,6 +1056,8 @@ static struct page *alloc_huge_page(stru
> set_page_private(page, (unsigned long) mapping);
>
> vma_commit_reservation(h, vma, addr);
> + if (quota)
> + *quota = chg;
>
> return page;
> }
> @@ -2365,6 +2365,7 @@ static int hugetlb_cow(struct mm_struct
> struct page *old_page, *new_page;
> int avoidcopy;
> int outside_reserve = 0;
> + long quota = 0;
>
> old_page = pte_page(pte);
>
> @@ -2397,7 +2398,8 @@ retry_avoidcopy:
>
> /* Drop page_table_lock as buddy allocator may be called */
> spin_unlock(&mm->page_table_lock);
> - new_page = alloc_huge_page(vma, address, outside_reserve);
> + quota = 0;
> + new_page = alloc_huge_page(vma, address, outside_reserve, "a);
>
> if (IS_ERR(new_page)) {
> page_cache_release(old_page);
> @@ -2439,6 +2441,8 @@ retry_avoidcopy:
> if (unlikely(anon_vma_prepare(vma))) {
> page_cache_release(new_page);
> page_cache_release(old_page);
> + if (quota)
> + hugetlb_put_quota(vma->vm_file->f_mapping, quota);
> /* Caller expects lock to be held */
> spin_lock(&mm->page_table_lock);
> return VM_FAULT_OOM;
> @@ -2470,6 +2474,8 @@ retry_avoidcopy:
> address & huge_page_mask(h),
> (address & huge_page_mask(h)) + huge_page_size(h));
> }
> + else if (quota)
> + hugetlb_put_quota(vma->vm_file->f_mapping, quota);
> page_cache_release(new_page);
> page_cache_release(old_page);
> return 0;
> @@ -2519,6 +2525,7 @@ static int hugetlb_no_page(struct mm_str
> struct page *page;
> struct address_space *mapping;
> pte_t new_pte;
> + long quota = 0;
>
> /*
> * Currently, we are forced to kill the process in the event the
> @@ -2540,12 +2547,13 @@ static int hugetlb_no_page(struct mm_str
> * before we get page_table_lock.
> */
> retry:
> + quota = 0;
> page = find_lock_page(mapping, idx);
> if (!page) {
> size = i_size_read(mapping->host) >> huge_page_shift(h);
> if (idx >= size)
> goto out;
> - page = alloc_huge_page(vma, address, 0);
> + page = alloc_huge_page(vma, address, 0, "a);
> if (IS_ERR(page)) {
> ret = -PTR_ERR(page);
> goto out;
> @@ -2560,6 +2568,8 @@ retry:
> err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
> if (err) {
> put_page(page);
> + if (quota)
> + hugetlb_put_quota(mapping, quota);
> if (err == -EEXIST)
> goto retry;
> goto out;
> @@ -2633,6 +2643,8 @@ backout:
> backout_unlocked:
> unlock_page(page);
> put_page(page);
> + if (quota)
> + hugetlb_put_quota(mapping, quota);
> goto out;
> }
>
> --
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -V2 4/9] memcg: Add non reclaim resource tracking to memcg
2012-03-04 18:07 ` Aneesh Kumar K.V
@ 2012-03-08 5:56 ` KAMEZAWA Hiroyuki
2012-03-08 11:48 ` Aneesh Kumar K.V
0 siblings, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-03-08 5:56 UTC (permalink / raw)
To: Aneesh Kumar K.V
Cc: linux-mm, mgorman, dhillf, aarcange, mhocko, akpm, hannes,
linux-kernel, cgroups
On Sun, 04 Mar 2012 23:37:22 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> On Fri, 2 Mar 2012 17:38:16 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Thu, 1 Mar 2012 14:46:15 +0530
> > "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> >
> > > From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> >
> > > + help
> > > + Add non reclaim resource management to memory resource controller.
> > > + Currently only HugeTLB pages will be managed using this extension.
> > > + The controller limit is enforced during mmap(2), so that
> > > + application can fall back to allocations using smaller page size
> > > + if the memory controller limit prevented them from allocating HugeTLB
> > > + pages.
> > > +
> >
> > Hm. In other thread, KMEM accounting is discussed. There is 2 proposals and
> > - 1st is accounting only reclaimable slabs (as dcache etc.)
> > - 2nd is accounting all slab allocations.
> >
> > Here, 2nd one includes NORECLAIM kmem cache. (Discussion is not ended.)
> >
> > So, for your developments, How about MEM_RES_CTLR_HUGEPAGE ?
>
> Frankly I didn't like the noreclaim name, I also didn't want to indicate
> HUGEPAGE, because the code doesn't make any huge page assumption.
You can add this config for HUGEPAGE interfaces.
Later we can sort out other configs.
> >
> >
> > > config CGROUP_MEM_RES_CTLR_SWAP
> > > bool "Memory Resource Controller Swap Extension"
> > > depends on CGROUP_MEM_RES_CTLR && SWAP
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 6728a7a..b00d028 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -49,6 +49,7 @@
> > > #include <linux/page_cgroup.h>
> > > #include <linux/cpu.h>
> > > #include <linux/oom.h>
> > > +#include <linux/region.h>
> > > #include "internal.h"
> > > #include <net/sock.h>
> > > #include <net/tcp_memcontrol.h>
> > > @@ -214,6 +215,11 @@ static void mem_cgroup_threshold(struct mem_cgroup *memcg);
> > > static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
> > >
> > > /*
> > > + * Currently only hugetlbfs pages are tracked using no reclaim
> > > + * resource count. So we need only MAX_HSTATE res counter
> > > + */
> > > +#define MEMCG_MAX_NORECLAIM HUGE_MAX_HSTATE
> > > +/*
> > > * The memory controller data structure. The memory controller controls both
> > > * page cache and RSS per cgroup. We would eventually like to provide
> > > * statistics based on the statistics developed by Rik Van Riel for clock-pro,
> > > @@ -235,6 +241,11 @@ struct mem_cgroup {
> > > */
> > > struct res_counter memsw;
> > > /*
> > > + * the counter to account for non reclaim resources
> > > + * like hugetlb pages
> > > + */
> > > + struct res_counter no_rcl_res[MEMCG_MAX_NORECLAIM];
> >
> > struct res_counter hugepages;
> >
> > will be ok.
> >
>
> My goal was to make this patch not to mention hugepages, because
> it doesn't really have any depedency on hugepages. That is one of the reason
> for adding MEMCG_MAX_NORECLAIM. Later if we want other in memory file system
> (shmemfs) to limit the resource usage in a similar fashion, we should be
> able to use this memcg changes.
>
> May be for this patchset I can make the changes you suggested and later
> when we want to reuse the code make it more generic ?
>
yes. If there is no user interface change, internal code change will be welcomed.
>
> >
> > > + /*
> > > * Per cgroup active and inactive list, similar to the
> > > * per zone LRU lists.
> > > */
> > > @@ -4887,6 +4898,7 @@ err_cleanup:
> > > static struct cgroup_subsys_state * __ref
> > > mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> > > {
> > > + int idx;
> > > struct mem_cgroup *memcg, *parent;
> > > long error = -ENOMEM;
> > > int node;
> > > @@ -4922,6 +4934,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> > > if (parent && parent->use_hierarchy) {
> > > res_counter_init(&memcg->res, &parent->res);
> > > res_counter_init(&memcg->memsw, &parent->memsw);
> > > + for (idx = 0; idx < MEMCG_MAX_NORECLAIM; idx++) {
> > > + res_counter_init(&memcg->no_rcl_res[idx],
> > > + &parent->no_rcl_res[idx]);
> > > + }
> >
> > You can remove this kinds of loop and keep your implemenation simple.
>
>
> Can you explain this ? How can we remote the loop ?. We want to track
> each huge page size as a seperate resource.
>
Ah, sorry. I miseed it. please ignore.
> > > +long mem_cgroup_try_noreclaim_charge(struct list_head *chg_list,
> > > + unsigned long from, unsigned long to,
> > > + int idx)
> > > +{
> > > + long chg;
> > > + int ret = 0;
> > > + unsigned long csize;
> > > + struct mem_cgroup *memcg;
> > > + struct res_counter *fail_res;
> > > +
> > > + /*
> > > + * Get the task cgroup within rcu_readlock and also
> > > + * get cgroup reference to make sure cgroup destroy won't
> > > + * race with page_charge. We don't allow a cgroup destroy
> > > + * when the cgroup have some charge against it
> > > + */
> > > + rcu_read_lock();
> > > + memcg = mem_cgroup_from_task(current);
> > > + css_get(&memcg->css);
> >
> > css_tryget() ?
> >
>
>
> Why ?
>
current<->cgroup relationship isn't under any locks. So, we do speculative
access with rcu_read_lock() and css_tryget().
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -V2 4/9] memcg: Add non reclaim resource tracking to memcg
2012-03-08 5:56 ` KAMEZAWA Hiroyuki
@ 2012-03-08 11:48 ` Aneesh Kumar K.V
0 siblings, 0 replies; 26+ messages in thread
From: Aneesh Kumar K.V @ 2012-03-08 11:48 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, mgorman, dhillf, aarcange, mhocko, akpm, hannes,
linux-kernel, cgroups
On Thu, 8 Mar 2012 14:56:28 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Sun, 04 Mar 2012 23:37:22 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
>
> > On Fri, 2 Mar 2012 17:38:16 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Thu, 1 Mar 2012 14:46:15 +0530
> > > "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> > >
> > > > From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>
> > >
> > > > + help
> > > > + Add non reclaim resource management to memory resource controller.
> > > > + Currently only HugeTLB pages will be managed using this extension.
> > > > + The controller limit is enforced during mmap(2), so that
> > > > + application can fall back to allocations using smaller page size
> > > > + if the memory controller limit prevented them from allocating HugeTLB
> > > > + pages.
> > > > +
> > >
> > > Hm. In other thread, KMEM accounting is discussed. There is 2 proposals and
> > > - 1st is accounting only reclaimable slabs (as dcache etc.)
> > > - 2nd is accounting all slab allocations.
> > >
> > > Here, 2nd one includes NORECLAIM kmem cache. (Discussion is not ended.)
> > >
> > > So, for your developments, How about MEM_RES_CTLR_HUGEPAGE ?
> >
> > Frankly I didn't like the noreclaim name, I also didn't want to indicate
> > HUGEPAGE, because the code doesn't make any huge page assumption.
>
> You can add this config for HUGEPAGE interfaces.
> Later we can sort out other configs.
>
Will do
>
> > >
> > >
> > > > config CGROUP_MEM_RES_CTLR_SWAP
> > > > bool "Memory Resource Controller Swap Extension"
> > > > depends on CGROUP_MEM_RES_CTLR && SWAP
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index 6728a7a..b00d028 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -49,6 +49,7 @@
> > > > #include <linux/page_cgroup.h>
> > > > #include <linux/cpu.h>
> > > > #include <linux/oom.h>
> > > > +#include <linux/region.h>
> > > > #include "internal.h"
> > > > #include <net/sock.h>
> > > > #include <net/tcp_memcontrol.h>
> > > > @@ -214,6 +215,11 @@ static void mem_cgroup_threshold(struct mem_cgroup *memcg);
> > > > static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
> > > >
> > > > /*
> > > > + * Currently only hugetlbfs pages are tracked using no reclaim
> > > > + * resource count. So we need only MAX_HSTATE res counter
> > > > + */
> > > > +#define MEMCG_MAX_NORECLAIM HUGE_MAX_HSTATE
> > > > +/*
> > > > * The memory controller data structure. The memory controller controls both
> > > > * page cache and RSS per cgroup. We would eventually like to provide
> > > > * statistics based on the statistics developed by Rik Van Riel for clock-pro,
> > > > @@ -235,6 +241,11 @@ struct mem_cgroup {
> > > > */
> > > > struct res_counter memsw;
> > > > /*
> > > > + * the counter to account for non reclaim resources
> > > > + * like hugetlb pages
> > > > + */
> > > > + struct res_counter no_rcl_res[MEMCG_MAX_NORECLAIM];
> > >
> > > struct res_counter hugepages;
> > >
> > > will be ok.
> > >
> >
> > My goal was to make this patch not to mention hugepages, because
> > it doesn't really have any depedency on hugepages. That is one of the reason
> > for adding MEMCG_MAX_NORECLAIM. Later if we want other in memory file system
> > (shmemfs) to limit the resource usage in a similar fashion, we should be
> > able to use this memcg changes.
> >
> > May be for this patchset I can make the changes you suggested and later
> > when we want to reuse the code make it more generic ?
> >
>
> yes. If there is no user interface change, internal code change will be welcomed.
>
Ok
>
> >
> > >
> > > > + /*
> > > > * Per cgroup active and inactive list, similar to the
> > > > * per zone LRU lists.
> > > > */
> > > > @@ -4887,6 +4898,7 @@ err_cleanup:
> > > > static struct cgroup_subsys_state * __ref
> > > > mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> > > > {
> > > > + int idx;
> > > > struct mem_cgroup *memcg, *parent;
> > > > long error = -ENOMEM;
> > > > int node;
> > > > @@ -4922,6 +4934,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> > > > if (parent && parent->use_hierarchy) {
> > > > res_counter_init(&memcg->res, &parent->res);
> > > > res_counter_init(&memcg->memsw, &parent->memsw);
> > > > + for (idx = 0; idx < MEMCG_MAX_NORECLAIM; idx++) {
> > > > + res_counter_init(&memcg->no_rcl_res[idx],
> > > > + &parent->no_rcl_res[idx]);
> > > > + }
> > >
> > > You can remove this kinds of loop and keep your implemenation simple.
> >
> >
> > Can you explain this ? How can we remote the loop ?. We want to track
> > each huge page size as a seperate resource.
> >
> Ah, sorry. I miseed it. please ignore.
>
>
>
> > > > +long mem_cgroup_try_noreclaim_charge(struct list_head *chg_list,
> > > > + unsigned long from, unsigned long to,
> > > > + int idx)
> > > > +{
> > > > + long chg;
> > > > + int ret = 0;
> > > > + unsigned long csize;
> > > > + struct mem_cgroup *memcg;
> > > > + struct res_counter *fail_res;
> > > > +
> > > > + /*
> > > > + * Get the task cgroup within rcu_readlock and also
> > > > + * get cgroup reference to make sure cgroup destroy won't
> > > > + * race with page_charge. We don't allow a cgroup destroy
> > > > + * when the cgroup have some charge against it
> > > > + */
> > > > + rcu_read_lock();
> > > > + memcg = mem_cgroup_from_task(current);
> > > > + css_get(&memcg->css);
> > >
> > > css_tryget() ?
> > >
> >
> >
> > Why ?
> >
>
> current<->cgroup relationship isn't under any locks. So, we do speculative
> access with rcu_read_lock() and css_tryget().
>
Will update.
Right now i am redoing the patch to see if enforcing limit during
page fault (alloc_huge_page()) and page free (free_huge_page()) simplifies
the patchset. Only problem with that approach is, application should have
a clear idea about it's hugepage usage, or else enforcing the limit at
fault time will result in application getting SIGBUS. But otherwise the approach
simplifies the cgroup removal and brings the code much closer to memcg
design.
-aneesh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -V2 6/9] hugetlbfs: Add memory controller support for private mapping
2012-03-01 9:16 ` [PATCH -V2 6/9] hugetlbfs: Add memory controller support for private mapping Aneesh Kumar K.V
@ 2012-05-17 23:16 ` Darrick J. Wong
0 siblings, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2012-05-17 23:16 UTC (permalink / raw)
To: Aneesh Kumar K.V
Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, aarcange, mhocko,
akpm, hannes, linux-kernel, cgroups
On Thu, Mar 01, 2012 at 02:46:17PM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>
> For private mapping we always charge/uncharge from the current task memcg.
> Charging happens during mmap(2) and uncharge happens during the
> vm_operations->close. For child task after fork the charging happens
> during fault time in alloc_huge_page. We also need to make sure for private
> mapping each vma for hugeTLB mapping have struct resv_map allocated so that we
> can store the charge list in resv_map.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
> mm/hugetlb.c | 176 +++++++++++++++++++++++++++++++++------------------------
> 1 files changed, 102 insertions(+), 74 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 664c663..2d99d0a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -126,6 +126,22 @@ long hugetlb_truncate_cgroup(struct hstate *h,
> #endif
> }
>
> +long hugetlb_truncate_cgroup_range(struct hstate *h,
> + struct list_head *head, long from, long end)
> +{
> +#ifdef CONFIG_MEM_RES_CTLR_NORECLAIM
> + long chg;
> + from = from << huge_page_order(h);
> + end = end << huge_page_order(h);
> + chg = mem_cgroup_truncate_chglist_range(head, from, end, h - hstates);
> + if (chg > 0)
> + return chg >> huge_page_order(h);
> + return chg;
> +#else
> + return region_truncate_range(head, from, end);
> +#endif
> +}
> +
> /*
> * Convert the address within this vma to the page offset within
> * the mapping, in pagecache page units; huge pages here.
> @@ -229,13 +245,19 @@ static struct resv_map *resv_map_alloc(void)
> return resv_map;
> }
>
> -static void resv_map_release(struct kref *ref)
> +static unsigned long resv_map_release(struct hstate *h,
> + struct resv_map *resv_map)
> {
> - struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
> -
> - /* Clear out any active regions before we release the map. */
> - region_truncate(&resv_map->regions, 0);
> + unsigned long reserve;
> + /*
> + * We should not have any regions left here, if we were able to
> + * do memory allocation when in trunage_cgroup_range.
> + *
> + * Clear out any active regions before we release the map
> + */
> + reserve = hugetlb_truncate_cgroup(h, &resv_map->regions, 0);
> kfree(resv_map);
> + return reserve;
> }
>
> static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
> @@ -447,9 +469,7 @@ static void free_huge_page(struct page *page)
> */
> struct hstate *h = page_hstate(page);
> int nid = page_to_nid(page);
> - struct address_space *mapping;
>
> - mapping = (struct address_space *) page_private(page);
> set_page_private(page, 0);
> page->mapping = NULL;
> BUG_ON(page_count(page));
> @@ -465,8 +485,6 @@ static void free_huge_page(struct page *page)
> enqueue_huge_page(h, page);
> }
> spin_unlock(&hugetlb_lock);
> - if (mapping)
> - hugetlb_put_quota(mapping, 1);
> }
>
> static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> @@ -887,63 +905,74 @@ static void return_unused_surplus_pages(struct hstate *h,
> * No action is required on failure.
> */
> static long vma_needs_reservation(struct hstate *h,
> - struct vm_area_struct *vma, unsigned long addr)
> + struct vm_area_struct *vma,
> + unsigned long addr)
> {
> + pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> struct address_space *mapping = vma->vm_file->f_mapping;
> struct inode *inode = mapping->host;
>
> +
> if (vma->vm_flags & VM_MAYSHARE) {
> - pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> return hugetlb_page_charge(&inode->i_mapping->private_list,
> h, idx, idx + 1);
> } else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> - return 1;
> -
> - } else {
> - long err;
> - pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> - struct resv_map *reservations = vma_resv_map(vma);
> -
> - err = region_chg(&reservations->regions, idx, idx + 1, 0);
> - if (err < 0)
> - return err;
> - return 0;
> + struct resv_map *resv_map = vma_resv_map(vma);
> + if (!resv_map) {
> + /*
> + * We didn't allocate resv_map for this vma.
> + * Allocate it here.
> + */
> + resv_map = resv_map_alloc();
> + if (!resv_map)
> + return -ENOMEM;
> + set_vma_resv_map(vma, resv_map);
> + }
> + return hugetlb_page_charge(&resv_map->regions,
> + h, idx, idx + 1);
> }
> + /*
> + * We did the private page charging in mmap call
> + */
> + return 0;
> }
>
> static void vma_uncharge_reservation(struct hstate *h,
> struct vm_area_struct *vma,
> unsigned long chg)
> {
> + int idx = h - hstates;
> + struct list_head *region_list;
> struct address_space *mapping = vma->vm_file->f_mapping;
> struct inode *inode = mapping->host;
>
>
> - if (vma->vm_flags & VM_MAYSHARE) {
> - return hugetlb_page_uncharge(&inode->i_mapping->private_list,
> - h - hstates,
> - chg << huge_page_order(h));
> + if (vma->vm_flags & VM_MAYSHARE)
> + region_list = &inode->i_mapping->private_list;
> + else {
> + struct resv_map *resv_map = vma_resv_map(vma);
> + region_list = &resv_map->regions;
> }
> + return hugetlb_page_uncharge(region_list,
> + idx, chg << huge_page_order(h));
> }
>
> static void vma_commit_reservation(struct hstate *h,
> struct vm_area_struct *vma, unsigned long addr)
> {
> -
> + struct list_head *region_list;
> + pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> struct address_space *mapping = vma->vm_file->f_mapping;
> struct inode *inode = mapping->host;
>
> if (vma->vm_flags & VM_MAYSHARE) {
> - pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> - hugetlb_commit_page_charge(h, &inode->i_mapping->private_list,
> - idx, idx + 1);
> - } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> - pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> + region_list = &inode->i_mapping->private_list;
> + } else {
> struct resv_map *reservations = vma_resv_map(vma);
> -
> - /* Mark this page used in the map. */
> - region_add(&reservations->regions, idx, idx + 1, 0);
> + region_list = &reservations->regions;
> }
> + hugetlb_commit_page_charge(h, region_list, idx, idx + 1);
> + return;
> }
>
> static struct page *alloc_huge_page(struct vm_area_struct *vma,
> @@ -986,10 +1015,9 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> return ERR_PTR(-VM_FAULT_SIGBUS);
> }
> }
> -
> set_page_private(page, (unsigned long) mapping);
> -
> - vma_commit_reservation(h, vma, addr);
> + if (chg)
> + vma_commit_reservation(h, vma, addr);
> return page;
> }
>
> @@ -2001,25 +2029,40 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
> */
> if (reservations)
> kref_get(&reservations->refs);
> + else if (!(vma->vm_flags & VM_MAYSHARE)) {
> + /*
> + * for non shared vma we need resv map to track
> + * hugetlb cgroup usage. Allocate it here. Charging
> + * the cgroup will take place in fault path.
> + */
> + struct resv_map *resv_map = resv_map_alloc();
> + /*
> + * If we fail to allocate resv_map here. We will allocate
> + * one when we do alloc_huge_page. So we don't handle
> + * ENOMEM here. The function also return void. So there is
> + * nothing much we can do.
> + */
> + if (resv_map)
> + set_vma_resv_map(vma, resv_map);
> + }
> }
>
> static void hugetlb_vm_op_close(struct vm_area_struct *vma)
> {
> struct hstate *h = hstate_vma(vma);
> - struct resv_map *reservations = vma_resv_map(vma);
> - unsigned long reserve;
> - unsigned long start;
> - unsigned long end;
> + struct resv_map *resv_map = vma_resv_map(vma);
> + unsigned long reserve, start, end;
>
> - if (reservations) {
> + if (resv_map) {
> start = vma_hugecache_offset(h, vma, vma->vm_start);
> end = vma_hugecache_offset(h, vma, vma->vm_end);
>
> - reserve = (end - start) -
> - region_count(&reservations->regions, start, end);
> -
> - kref_put(&reservations->refs, resv_map_release);
> -
> + reserve = hugetlb_truncate_cgroup_range(h, &resv_map->regions,
> + start, end);
I wonder, what's the reason for removing the "(end - start) -" part of the
"reserve =" statements? region_count() and hugetlb_truncate_cgroup_range()
both seem to return the number of bytes (I think?) between start and end that
are mapped in that region list.
This change alters the definition of "reserve" from "bytes not mapped" to
"bytes mapped" with no corresponding change to the subsequent uses of reserve,
which would likely cause hugetlb_acct_memory() to be invoked at the wrong
times, I think.
<shrug> Does that make any sense?
--D
> + /* open coded kref_put */
> + if (atomic_sub_and_test(1, &resv_map->refs.refcount)) {
> + reserve += resv_map_release(h, resv_map);
> + }
> if (reserve) {
> hugetlb_acct_memory(h, -reserve);
> hugetlb_put_quota(vma->vm_file->f_mapping, reserve);
> @@ -2803,8 +2846,9 @@ int hugetlb_reserve_pages(struct inode *inode,
> vm_flags_t vm_flags)
> {
> long ret, chg;
> + struct list_head *region_list;
> struct hstate *h = hstate_inode(inode);
> -
> + struct resv_map *resv_map = NULL;
> /*
> * Only apply hugepage reservation if asked. At fault time, an
> * attempt will be made for VM_NORESERVE to allocate a page
> @@ -2820,19 +2864,17 @@ int hugetlb_reserve_pages(struct inode *inode,
> * called to make the mapping read-write. Assume !vma is a shm mapping
> */
> if (!vma || vma->vm_flags & VM_MAYSHARE) {
> - chg = hugetlb_page_charge(&inode->i_mapping->private_list,
> - h, from, to);
> + region_list = &inode->i_mapping->private_list;
> } else {
> - struct resv_map *resv_map = resv_map_alloc();
> + resv_map = resv_map_alloc();
> if (!resv_map)
> return -ENOMEM;
>
> - chg = to - from;
> -
> set_vma_resv_map(vma, resv_map);
> set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
> + region_list = &resv_map->regions;
> }
> -
> + chg = hugetlb_page_charge(region_list, h, from, to);
> if (chg < 0)
> return chg;
>
> @@ -2848,29 +2890,15 @@ int hugetlb_reserve_pages(struct inode *inode,
> ret = hugetlb_acct_memory(h, chg);
> if (ret < 0)
> goto err_acct_mem;
> - /*
> - * Account for the reservations made. Shared mappings record regions
> - * that have reservations as they are shared by multiple VMAs.
> - * When the last VMA disappears, the region map says how much
> - * the reservation was and the page cache tells how much of
> - * the reservation was consumed. Private mappings are per-VMA and
> - * only the consumed reservations are tracked. When the VMA
> - * disappears, the original reservation is the VMA size and the
> - * consumed reservations are stored in the map. Hence, nothing
> - * else has to be done for private mappings here
> - */
> - if (!vma || vma->vm_flags & VM_MAYSHARE)
> - hugetlb_commit_page_charge(h, &inode->i_mapping->private_list,
> - from, to);
> +
> + hugetlb_commit_page_charge(h, region_list, from, to);
> return 0;
> err_acct_mem:
> hugetlb_put_quota(inode->i_mapping, chg);
> err_quota:
> - if (!vma || vma->vm_flags & VM_MAYSHARE)
> - hugetlb_page_uncharge(&inode->i_mapping->private_list,
> - h - hstates, chg << huge_page_order(h));
> + hugetlb_page_uncharge(region_list, h - hstates,
> + chg << huge_page_order(h));
> return ret;
> -
> }
>
> void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
> @@ -2884,7 +2912,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
> inode->i_blocks -= (blocks_per_huge_page(h) * freed);
> spin_unlock(&inode->i_lock);
>
> - hugetlb_put_quota(inode->i_mapping, (chg - freed));
> + hugetlb_put_quota(inode->i_mapping, chg);
> hugetlb_acct_memory(h, -(chg - freed));
> }
>
> --
> 1.7.9
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2012-05-17 23:17 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-01 9:16 [PATCH -V2 0/9] memcg: add HugeTLB resource tracking Aneesh Kumar K.V
2012-03-01 9:16 ` [PATCH -V2 1/9] mm: move hugetlbfs region tracking function to common code Aneesh Kumar K.V
2012-03-01 22:33 ` Andrew Morton
2012-03-04 17:37 ` Aneesh Kumar K.V
2012-03-01 9:16 ` [PATCH -V2 2/9] mm: Update region function to take new data arg Aneesh Kumar K.V
2012-03-01 9:16 ` [PATCH -V2 3/9] hugetlbfs: Use the generic region API and drop local one Aneesh Kumar K.V
2012-03-01 9:16 ` [PATCH -V2 4/9] memcg: Add non reclaim resource tracking to memcg Aneesh Kumar K.V
2012-03-02 8:38 ` KAMEZAWA Hiroyuki
2012-03-04 18:07 ` Aneesh Kumar K.V
2012-03-08 5:56 ` KAMEZAWA Hiroyuki
2012-03-08 11:48 ` Aneesh Kumar K.V
2012-03-01 9:16 ` [PATCH -V2 5/9] hugetlbfs: Add memory controller support for shared mapping Aneesh Kumar K.V
2012-03-01 9:16 ` [PATCH -V2 6/9] hugetlbfs: Add memory controller support for private mapping Aneesh Kumar K.V
2012-05-17 23:16 ` Darrick J. Wong
2012-03-01 9:16 ` [PATCH -V2 7/9] memcg: track resource index in cftype private Aneesh Kumar K.V
2012-03-01 9:16 ` [PATCH -V2 8/9] hugetlbfs: Add memcg control files for hugetlbfs Aneesh Kumar K.V
2012-03-01 9:16 ` [PATCH -V2 9/9] memcg: Add memory controller documentation for hugetlb management Aneesh Kumar K.V
2012-03-01 22:40 ` [PATCH -V2 0/9] memcg: add HugeTLB resource tracking Andrew Morton
2012-03-02 3:28 ` David Gibson
2012-03-04 18:09 ` Aneesh Kumar K.V
2012-03-06 2:38 ` David Gibson
2012-03-04 19:15 ` Aneesh Kumar K.V
2012-03-05 13:56 ` Hillf Danton
2012-03-06 14:05 ` Aneesh Kumar K.V
2012-03-02 5:48 ` KAMEZAWA Hiroyuki
2012-03-04 18:14 ` Aneesh Kumar K.V
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).