* [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM
@ 2017-06-14 20:11 Jérôme Glisse
2017-06-14 20:11 ` [HMM-CDM 3/5] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Jérôme Glisse @ 2017-06-14 20:11 UTC (permalink / raw)
To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg
Cc: John Hubbard, David Nellans, Jérôme Glisse,
cgroups-u79uwXL29TY76Z2rM5mHXA, Dan Williams, Ross Zwisler,
Johannes Weiner, Michal Hocko, Vladimir Davydov, Balbir Singh,
Aneesh Kumar, Paul E . McKenney, Benjamin Herrenschmidt
Cache coherent device memory apply to architecture with system bus
like CAPI or CCIX. Device connected to such system bus can expose
their memory to the system and allow cache coherent access to it
from the CPU.
Even if for all intent and purposes device memory behave like regular
memory, we still want to manage it in isolation from regular memory.
Several reasons for that, first and foremost this memory is less
reliable than regular memory if the device hangs because of invalid
commands we can loose access to device memory. Second CPU access to
this memory is expected to be slower than to regular memory. Third
having random memory into device means that some of the bus bandwith
wouldn't be available to the device but would be use by CPU access.
This is why we want to manage such memory in isolation from regular
memory. Kernel should not try to use this memory as last resort
when running out of memory, at least for now.
This patchset add a new type of ZONE_DEVICE memory (DEVICE_PUBLIC)
that is use to represent CDM memory. This patchset build on top of
the HMM patchset that already introduce a new type of ZONE_DEVICE
memory for private device memory (see HMM patchset).
The end result is that with this patch if a device is in use in
a process you might have private anonymous memory or file back
page memory using ZONE_DEVICE (MEMORY_PUBLIC). Thus care must be
taken to not overwritte lru fields of such pages.
Hence all core mm changes are done to address assumption that any
process memory is back by a regular struct page that is part of
the lru. ZONE_DEVICE page are not on the lru and the lru pointer
of struct page are use to store device specific informations.
Thus this patch update all code path that would make assumptions
about lruness of a process page.
patch 01 - deals with all the core mm functions
patch 02 - add an helper to HMM for hotplug of CDM memory
patch 03 - preparatory patch for memory controller changes
patch 04 - update memory controller to properly handle
ZONE_DEVICE pages when uncharging
patch 05 - kernel configuration updates and cleanup
git tree:
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v2
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Balbir Singh <balbirs-8fk3Idey6ehBDgjK7y7TUQ@public.gmane.org>
Cc: Aneesh Kumar <aneesh.kumar-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Cc: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Cc: Benjamin Herrenschmidt <benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org>
Jérôme Glisse (5):
mm/device-public-memory: device memory cache coherent with CPU
mm/hmm: add new helper to hotplug CDM memory region
mm/memcontrol: allow to uncharge page without using page->lru field
mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC
mm/hmm: simplify kconfig and enable HMM and DEVICE_PUBLIC for ppc64
fs/proc/task_mmu.c | 2 +-
include/linux/hmm.h | 7 +-
include/linux/ioport.h | 1 +
include/linux/memremap.h | 21 +++++
include/linux/mm.h | 16 ++--
kernel/memremap.c | 13 ++-
mm/Kconfig | 30 +++----
mm/gup.c | 7 ++
mm/hmm.c | 89 +++++++++++++++++--
mm/madvise.c | 2 +-
mm/memcontrol.c | 226 ++++++++++++++++++++++++++++++-----------------
mm/memory.c | 46 ++++++++--
mm/migrate.c | 60 ++++++++-----
mm/swap.c | 11 +++
14 files changed, 389 insertions(+), 142 deletions(-)
--
2.9.3
^ permalink raw reply [flat|nested] 13+ messages in thread
* [HMM-CDM 3/5] mm/memcontrol: allow to uncharge page without using page->lru field
2017-06-14 20:11 [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM Jérôme Glisse
@ 2017-06-14 20:11 ` Jérôme Glisse
[not found] ` <20170614201144.9306-4-jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-06-14 20:11 ` [HMM-CDM 4/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC Jérôme Glisse
2017-06-14 21:20 ` [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM Dave Hansen
2 siblings, 1 reply; 13+ messages in thread
From: Jérôme Glisse @ 2017-06-14 20:11 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: John Hubbard, David Nellans, Jérôme Glisse,
Johannes Weiner, Michal Hocko, Vladimir Davydov, cgroups
HMM pages (private or public device pages) are ZONE_DEVICE page and
thus you can not use page->lru fields of those pages. This patch
re-arrange the uncharge to allow single page to be uncharge without
modifying the lru field of the struct page.
There is no change to memcontrol logic, it is the same as it was
before this patch.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: cgroups@vger.kernel.org
---
mm/memcontrol.c | 168 +++++++++++++++++++++++++++++++-------------------------
1 file changed, 92 insertions(+), 76 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e3fe4d0..b93f5fe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5509,48 +5509,102 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
cancel_charge(memcg, nr_pages);
}
-static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
- unsigned long nr_anon, unsigned long nr_file,
- unsigned long nr_kmem, unsigned long nr_huge,
- unsigned long nr_shmem, struct page *dummy_page)
+struct uncharge_gather {
+ struct mem_cgroup *memcg;
+ unsigned long pgpgout;
+ unsigned long nr_anon;
+ unsigned long nr_file;
+ unsigned long nr_kmem;
+ unsigned long nr_huge;
+ unsigned long nr_shmem;
+ struct page *dummy_page;
+};
+
+static inline void uncharge_gather_clear(struct uncharge_gather *ug)
{
- unsigned long nr_pages = nr_anon + nr_file + nr_kmem;
+ memset(ug, 0, sizeof(*ug));
+}
+
+static void uncharge_batch(const struct uncharge_gather *ug)
+{
+ unsigned long nr_pages = ug->nr_anon + ug->nr_file + ug->nr_kmem;
unsigned long flags;
- if (!mem_cgroup_is_root(memcg)) {
- page_counter_uncharge(&memcg->memory, nr_pages);
+ if (!mem_cgroup_is_root(ug->memcg)) {
+ page_counter_uncharge(&ug->memcg->memory, nr_pages);
if (do_memsw_account())
- page_counter_uncharge(&memcg->memsw, nr_pages);
- if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && nr_kmem)
- page_counter_uncharge(&memcg->kmem, nr_kmem);
- memcg_oom_recover(memcg);
+ page_counter_uncharge(&ug->memcg->memsw, nr_pages);
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
+ page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
+ memcg_oom_recover(ug->memcg);
}
local_irq_save(flags);
- __this_cpu_sub(memcg->stat->count[MEMCG_RSS], nr_anon);
- __this_cpu_sub(memcg->stat->count[MEMCG_CACHE], nr_file);
- __this_cpu_sub(memcg->stat->count[MEMCG_RSS_HUGE], nr_huge);
- __this_cpu_sub(memcg->stat->count[NR_SHMEM], nr_shmem);
- __this_cpu_add(memcg->stat->events[PGPGOUT], pgpgout);
- __this_cpu_add(memcg->stat->nr_page_events, nr_pages);
- memcg_check_events(memcg, dummy_page);
+ __this_cpu_sub(ug->memcg->stat->count[MEMCG_RSS], ug->nr_anon);
+ __this_cpu_sub(ug->memcg->stat->count[MEMCG_CACHE], ug->nr_file);
+ __this_cpu_sub(ug->memcg->stat->count[MEMCG_RSS_HUGE], ug->nr_huge);
+ __this_cpu_sub(ug->memcg->stat->count[NR_SHMEM], ug->nr_shmem);
+ __this_cpu_add(ug->memcg->stat->events[PGPGOUT], ug->pgpgout);
+ __this_cpu_add(ug->memcg->stat->nr_page_events, nr_pages);
+ memcg_check_events(ug->memcg, ug->dummy_page);
local_irq_restore(flags);
- if (!mem_cgroup_is_root(memcg))
- css_put_many(&memcg->css, nr_pages);
+ if (!mem_cgroup_is_root(ug->memcg))
+ css_put_many(&ug->memcg->css, nr_pages);
+}
+
+static void uncharge_page(struct page *page, struct uncharge_gather *ug)
+{
+ VM_BUG_ON_PAGE(PageLRU(page), page);
+ VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
+
+ if (!page->mem_cgroup)
+ return;
+
+ /*
+ * Nobody should be changing or seriously looking at
+ * page->mem_cgroup at this point, we have fully
+ * exclusive access to the page.
+ */
+
+ if (ug->memcg != page->mem_cgroup) {
+ if (ug->memcg) {
+ uncharge_batch(ug);
+ uncharge_gather_clear(ug);
+ }
+ ug->memcg = page->mem_cgroup;
+ }
+
+ if (!PageKmemcg(page)) {
+ unsigned int nr_pages = 1;
+
+ if (PageTransHuge(page)) {
+ nr_pages <<= compound_order(page);
+ ug->nr_huge += nr_pages;
+ }
+ if (PageAnon(page))
+ ug->nr_anon += nr_pages;
+ else {
+ ug->nr_file += nr_pages;
+ if (PageSwapBacked(page))
+ ug->nr_shmem += nr_pages;
+ }
+ ug->pgpgout++;
+ } else {
+ ug->nr_kmem += 1 << compound_order(page);
+ __ClearPageKmemcg(page);
+ }
+
+ ug->dummy_page = page;
+ page->mem_cgroup = NULL;
}
static void uncharge_list(struct list_head *page_list)
{
- struct mem_cgroup *memcg = NULL;
- unsigned long nr_shmem = 0;
- unsigned long nr_anon = 0;
- unsigned long nr_file = 0;
- unsigned long nr_huge = 0;
- unsigned long nr_kmem = 0;
- unsigned long pgpgout = 0;
+ struct uncharge_gather ug;
struct list_head *next;
- struct page *page;
+
+ uncharge_gather_clear(&ug);
/*
* Note that the list can be a single page->lru; hence the
@@ -5558,57 +5612,16 @@ static void uncharge_list(struct list_head *page_list)
*/
next = page_list->next;
do {
+ struct page *page;
+
page = list_entry(next, struct page, lru);
next = page->lru.next;
- VM_BUG_ON_PAGE(PageLRU(page), page);
- VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
-
- if (!page->mem_cgroup)
- continue;
-
- /*
- * Nobody should be changing or seriously looking at
- * page->mem_cgroup at this point, we have fully
- * exclusive access to the page.
- */
-
- if (memcg != page->mem_cgroup) {
- if (memcg) {
- uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
- nr_kmem, nr_huge, nr_shmem, page);
- pgpgout = nr_anon = nr_file = nr_kmem = 0;
- nr_huge = nr_shmem = 0;
- }
- memcg = page->mem_cgroup;
- }
-
- if (!PageKmemcg(page)) {
- unsigned int nr_pages = 1;
-
- if (PageTransHuge(page)) {
- nr_pages <<= compound_order(page);
- nr_huge += nr_pages;
- }
- if (PageAnon(page))
- nr_anon += nr_pages;
- else {
- nr_file += nr_pages;
- if (PageSwapBacked(page))
- nr_shmem += nr_pages;
- }
- pgpgout++;
- } else {
- nr_kmem += 1 << compound_order(page);
- __ClearPageKmemcg(page);
- }
-
- page->mem_cgroup = NULL;
+ uncharge_page(page, &ug);
} while (next != page_list);
- if (memcg)
- uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
- nr_kmem, nr_huge, nr_shmem, page);
+ if (ug.memcg)
+ uncharge_batch(&ug);
}
/**
@@ -5620,6 +5633,8 @@ static void uncharge_list(struct list_head *page_list)
*/
void mem_cgroup_uncharge(struct page *page)
{
+ struct uncharge_gather ug;
+
if (mem_cgroup_disabled())
return;
@@ -5627,8 +5642,9 @@ void mem_cgroup_uncharge(struct page *page)
if (!page->mem_cgroup)
return;
- INIT_LIST_HEAD(&page->lru);
- uncharge_list(&page->lru);
+ uncharge_gather_clear(&ug);
+ uncharge_page(page, &ug);
+ uncharge_batch(&ug);
}
/**
--
2.9.3
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [HMM-CDM 4/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC
2017-06-14 20:11 [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM Jérôme Glisse
2017-06-14 20:11 ` [HMM-CDM 3/5] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
@ 2017-06-14 20:11 ` Jérôme Glisse
2017-06-15 1:41 ` Balbir Singh
2017-06-14 21:20 ` [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM Dave Hansen
2 siblings, 1 reply; 13+ messages in thread
From: Jérôme Glisse @ 2017-06-14 20:11 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: John Hubbard, David Nellans, Jérôme Glisse,
Johannes Weiner, Michal Hocko, Vladimir Davydov, cgroups
HMM pages (private or public device pages) are ZONE_DEVICE page and
thus need special handling when it comes to lru or refcount. This
patch make sure that memcontrol properly handle those when it face
them. Those pages are use like regular pages in a process address
space either as anonymous page or as file back page. So from memcg
point of view we want to handle them like regular page for now at
least.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: cgroups@vger.kernel.org
---
kernel/memremap.c | 2 ++
mm/memcontrol.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 55 insertions(+), 5 deletions(-)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index da74775..584984c 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -479,6 +479,8 @@ void put_zone_device_private_or_public_page(struct page *page)
__ClearPageActive(page);
__ClearPageWaiters(page);
+ mem_cgroup_uncharge(page);
+
page->pgmap->page_free(page, page->pgmap->data);
}
else if (!count)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b93f5fe..171b638 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4391,12 +4391,13 @@ enum mc_target_type {
MC_TARGET_NONE = 0,
MC_TARGET_PAGE,
MC_TARGET_SWAP,
+ MC_TARGET_DEVICE,
};
static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
unsigned long addr, pte_t ptent)
{
- struct page *page = vm_normal_page(vma, addr, ptent);
+ struct page *page = _vm_normal_page(vma, addr, ptent, true);
if (!page || !page_mapped(page))
return NULL;
@@ -4407,13 +4408,20 @@ static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
if (!(mc.flags & MOVE_FILE))
return NULL;
}
- if (!get_page_unless_zero(page))
+ if (is_device_public_page(page)) {
+ /*
+ * MEMORY_DEVICE_PUBLIC means ZONE_DEVICE page and which have a
+ * refcount of 1 when free (unlike normal page)
+ */
+ if (!page_ref_add_unless(page, 1, 1))
+ return NULL;
+ } else if (!get_page_unless_zero(page))
return NULL;
return page;
}
-#ifdef CONFIG_SWAP
+#if defined(CONFIG_SWAP) || defined(CONFIG_DEVICE_PRIVATE)
static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
pte_t ptent, swp_entry_t *entry)
{
@@ -4422,6 +4430,23 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
if (!(mc.flags & MOVE_ANON) || non_swap_entry(ent))
return NULL;
+
+ /*
+ * Handle MEMORY_DEVICE_PRIVATE which are ZONE_DEVICE page belonging to
+ * a device and because they are not accessible by CPU they are store
+ * as special swap entry in the CPU page table.
+ */
+ if (is_device_private_entry(ent)) {
+ page = device_private_entry_to_page(ent);
+ /*
+ * MEMORY_DEVICE_PRIVATE means ZONE_DEVICE page and which have
+ * a refcount of 1 when free (unlike normal page)
+ */
+ if (!page_ref_add_unless(page, 1, 1))
+ return NULL;
+ return page;
+ }
+
/*
* Because lookup_swap_cache() updates some statistics counter,
* we call find_get_page() with swapper_space directly.
@@ -4582,6 +4607,8 @@ static int mem_cgroup_move_account(struct page *page,
* 2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
* target for charge migration. if @target is not NULL, the entry is stored
* in target->ent.
+ * 3(MC_TARGET_DEVICE): like MC_TARGET_PAGE but page is MEMORY_DEVICE_PUBLIC
+ * or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the lru).
*
* Called with pte lock held.
*/
@@ -4610,6 +4637,9 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
*/
if (page->mem_cgroup == mc.from) {
ret = MC_TARGET_PAGE;
+ if (is_device_private_page(page) ||
+ is_device_public_page(page))
+ ret = MC_TARGET_DEVICE;
if (target)
target->page = page;
}
@@ -4669,6 +4699,11 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
+ /*
+ * Note their can not be MC_TARGET_DEVICE for now as we do not
+ * support transparent huge page with MEMORY_DEVICE_PUBLIC or
+ * MEMORY_DEVICE_PRIVATE but this might change.
+ */
if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
mc.precharge += HPAGE_PMD_NR;
spin_unlock(ptl);
@@ -4884,6 +4919,14 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
putback_lru_page(page);
}
put_page(page);
+ } else if (target_type == MC_TARGET_DEVICE) {
+ page = target.page;
+ if (!mem_cgroup_move_account(page, true,
+ mc.from, mc.to)) {
+ mc.precharge -= HPAGE_PMD_NR;
+ mc.moved_charge += HPAGE_PMD_NR;
+ }
+ put_page(page);
}
spin_unlock(ptl);
return 0;
@@ -4895,12 +4938,16 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
for (; addr != end; addr += PAGE_SIZE) {
pte_t ptent = *(pte++);
+ bool device = false;
swp_entry_t ent;
if (!mc.precharge)
break;
switch (get_mctgt_type(vma, addr, ptent, &target)) {
+ case MC_TARGET_DEVICE:
+ device = true;
+ /* fall through */
case MC_TARGET_PAGE:
page = target.page;
/*
@@ -4911,7 +4958,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
*/
if (PageTransCompound(page))
goto put;
- if (isolate_lru_page(page))
+ if (!device && isolate_lru_page(page))
goto put;
if (!mem_cgroup_move_account(page, false,
mc.from, mc.to)) {
@@ -4919,7 +4966,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
/* we uncharge from mc.from later. */
mc.moved_charge++;
}
- putback_lru_page(page);
+ if (!device)
+ putback_lru_page(page);
put: /* get_mctgt_type() gets the page */
put_page(page);
break;
--
2.9.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM
2017-06-14 20:11 [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM Jérôme Glisse
2017-06-14 20:11 ` [HMM-CDM 3/5] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
2017-06-14 20:11 ` [HMM-CDM 4/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC Jérôme Glisse
@ 2017-06-14 21:20 ` Dave Hansen
[not found] ` <8219f8fb-65bb-7c6b-6c4c-acc0601c1e0f-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2 siblings, 1 reply; 13+ messages in thread
From: Dave Hansen @ 2017-06-14 21:20 UTC (permalink / raw)
To: Jérôme Glisse, linux-kernel, linux-mm
Cc: John Hubbard, David Nellans, cgroups, Dan Williams, Ross Zwisler,
Johannes Weiner, Michal Hocko, Vladimir Davydov, Balbir Singh,
Aneesh Kumar, Paul E . McKenney, Benjamin Herrenschmidt
On 06/14/2017 01:11 PM, Jérôme Glisse wrote:
> Cache coherent device memory apply to architecture with system bus
> like CAPI or CCIX. Device connected to such system bus can expose
> their memory to the system and allow cache coherent access to it
> from the CPU.
How does this interact with device memory that's enumerated in the new
ACPI 6.2 HMAT? That stuff is also in the normal e820 and, by default,
treated as normal system RAM. Would this mechanism be used for those
devices as well?
http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM
[not found] ` <8219f8fb-65bb-7c6b-6c4c-acc0601c1e0f-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2017-06-14 21:38 ` Jerome Glisse
[not found] ` <20170614213800.GD4160-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 13+ messages in thread
From: Jerome Glisse @ 2017-06-14 21:38 UTC (permalink / raw)
To: Dave Hansen
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, John Hubbard, David Nellans,
cgroups-u79uwXL29TY76Z2rM5mHXA, Dan Williams, Ross Zwisler,
Johannes Weiner, Michal Hocko, Vladimir Davydov, Balbir Singh,
Aneesh Kumar, Paul E . McKenney, Benjamin Herrenschmidt
On Wed, Jun 14, 2017 at 02:20:23PM -0700, Dave Hansen wrote:
> On 06/14/2017 01:11 PM, Jérôme Glisse wrote:
> > Cache coherent device memory apply to architecture with system bus
> > like CAPI or CCIX. Device connected to such system bus can expose
> > their memory to the system and allow cache coherent access to it
> > from the CPU.
>
> How does this interact with device memory that's enumerated in the new
> ACPI 6.2 HMAT? That stuff is also in the normal e820 and, by default,
> treated as normal system RAM. Would this mechanism be used for those
> devices as well?
>
> http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
It doesn't interact with that. HMM-CDM is a set of helper that don't
do anything unless instructed so. So for device memory to be presented
as HMM-CDM you need to hotplug it as ZONE_DEVICE(DEVICE_PUBLIC) which
can be done with the helper introduced in patch 2 of this patchset.
I don't think that the HMAT inside ACPI is restricted or even intended
for device memory. The kind of memory i am refering too in HMM-CDM is
for instance GPU on board memory. On PCIE system the CPU can not access
such memory in the same manner as regular memory but on CAPI or CCIX
system it can.
How such memory is listed is platform/architecture specific and HMM-CDM
does not deal with that. So PowerPC CAPI will have its own way of
discovering such device memory. So will CCIX platform (thought for CCIX
i expect that UEFI/ACPI will be involve through HMAT or something new).
Also if HMAT allow represent device memory, the choice to use HMM-CDM
would still be with the device driver. For persistent memory is does not
make sense and it might not make sense for other devices too. I expect
this to be usefull for GPU, FPGA or similar accelerator (like Xeon Phi
as add on card if there is ever something like CCIX supported by Intel).
Hope this answer your question
Jérôme
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM
[not found] ` <20170614213800.GD4160-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-06-14 21:58 ` Dave Hansen
2017-06-14 22:07 ` Benjamin Herrenschmidt
[not found] ` <3a617630-2406-da49-707c-4959a2afd8e1-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
0 siblings, 2 replies; 13+ messages in thread
From: Dave Hansen @ 2017-06-14 21:58 UTC (permalink / raw)
To: Jerome Glisse
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, John Hubbard, David Nellans,
cgroups-u79uwXL29TY76Z2rM5mHXA, Dan Williams, Ross Zwisler,
Johannes Weiner, Michal Hocko, Vladimir Davydov, Balbir Singh,
Aneesh Kumar, Paul E . McKenney, Benjamin Herrenschmidt
On 06/14/2017 02:38 PM, Jerome Glisse wrote:
> On Wed, Jun 14, 2017 at 02:20:23PM -0700, Dave Hansen wrote:
>> On 06/14/2017 01:11 PM, Jérôme Glisse wrote:
>>> Cache coherent device memory apply to architecture with system bus
>>> like CAPI or CCIX. Device connected to such system bus can expose
>>> their memory to the system and allow cache coherent access to it
>>> from the CPU.
>> How does this interact with device memory that's enumerated in the new
>> ACPI 6.2 HMAT? That stuff is also in the normal e820 and, by default,
>> treated as normal system RAM. Would this mechanism be used for those
>> devices as well?
>>
>> http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
> It doesn't interact with that. HMM-CDM is a set of helper that don't
> do anything unless instructed so. So for device memory to be presented
> as HMM-CDM you need to hotplug it as ZONE_DEVICE(DEVICE_PUBLIC) which
> can be done with the helper introduced in patch 2 of this patchset.
I guess I'm asking whether we *should* instruct HMM-CDM to manage all
coherent device memory. If not, where do we draw the line for what we
use HMM-CDM, and for what we use the core MM?
> I don't think that the HMAT inside ACPI is restricted or even intended
> for device memory.
It can definitely describe memory attached to memory controllers which
are not directly attached to CPUs. That means either some kind of
memory expander, or device memory.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM
2017-06-14 21:58 ` Dave Hansen
@ 2017-06-14 22:07 ` Benjamin Herrenschmidt
[not found] ` <3a617630-2406-da49-707c-4959a2afd8e1-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
1 sibling, 0 replies; 13+ messages in thread
From: Benjamin Herrenschmidt @ 2017-06-14 22:07 UTC (permalink / raw)
To: Dave Hansen, Jerome Glisse
Cc: linux-kernel, linux-mm, John Hubbard, David Nellans, cgroups,
Dan Williams, Ross Zwisler, Johannes Weiner, Michal Hocko,
Vladimir Davydov, Balbir Singh, Aneesh Kumar, Paul E . McKenney
On Wed, 2017-06-14 at 14:58 -0700, Dave Hansen wrote:
> > > http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
> >
> > It doesn't interact with that. HMM-CDM is a set of helper that don't
> > do anything unless instructed so. So for device memory to be presented
> > as HMM-CDM you need to hotplug it as ZONE_DEVICE(DEVICE_PUBLIC) which
> > can be done with the helper introduced in patch 2 of this patchset.
>
> I guess I'm asking whether we *should* instruct HMM-CDM to manage all
> coherent device memory. If not, where do we draw the line for what we
> use HMM-CDM, and for what we use the core MM?
Well, if you want the features of HMM ... It basically boils down to
whether you have some kind of coherent processing unit close to that
memory and want to manage transparent migration of pages between system
and device memory, that sort of thing.
Cheers,
Ben.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM
[not found] ` <3a617630-2406-da49-707c-4959a2afd8e1-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2017-06-14 23:40 ` Balbir Singh
0 siblings, 0 replies; 13+ messages in thread
From: Balbir Singh @ 2017-06-14 23:40 UTC (permalink / raw)
To: Dave Hansen
Cc: Jerome Glisse,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm,
John Hubbard, David Nellans,
cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Dan Williams,
Ross Zwisler, Johannes Weiner, Michal Hocko, Vladimir Davydov,
Aneesh Kumar, Paul E . McKenney, Benjamin Herrenschmidt
On Thu, Jun 15, 2017 at 7:58 AM, Dave Hansen <dave.hansen-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> On 06/14/2017 02:38 PM, Jerome Glisse wrote:
>> On Wed, Jun 14, 2017 at 02:20:23PM -0700, Dave Hansen wrote:
>>> On 06/14/2017 01:11 PM, Jérôme Glisse wrote:
>>>> Cache coherent device memory apply to architecture with system bus
>>>> like CAPI or CCIX. Device connected to such system bus can expose
>>>> their memory to the system and allow cache coherent access to it
>>>> from the CPU.
>>> How does this interact with device memory that's enumerated in the new
>>> ACPI 6.2 HMAT? That stuff is also in the normal e820 and, by default,
>>> treated as normal system RAM. Would this mechanism be used for those
>>> devices as well?
>>>
>>> http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
>> It doesn't interact with that. HMM-CDM is a set of helper that don't
>> do anything unless instructed so. So for device memory to be presented
>> as HMM-CDM you need to hotplug it as ZONE_DEVICE(DEVICE_PUBLIC) which
>> can be done with the helper introduced in patch 2 of this patchset.
>
[Removing my cc'd email id and responding from a different address]
> I guess I'm asking whether we *should* instruct HMM-CDM to manage all
> coherent device memory. If not, where do we draw the line for what we
> use HMM-CDM, and for what we use the core MM?
>
If you believe the memory is managed by the device (and owned by a device
driver) I'd suggest using HMM-CDM. The idea behind HMM-CDM was that
it enables transparent migration of pages and its preferred when
locality of computation
and locality of memory access is the preferred model.
The other model was N_COHERENT_MEMORY that used the core MM, but
there were objections to exposing device memory using that technology.
Balbir Singh.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [HMM-CDM 4/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC
2017-06-14 20:11 ` [HMM-CDM 4/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC Jérôme Glisse
@ 2017-06-15 1:41 ` Balbir Singh
[not found] ` <20170615114159.11a1eece-czjtt42qU+xukE8zL+fsbsxtgHpCUUYS@public.gmane.org>
0 siblings, 1 reply; 13+ messages in thread
From: Balbir Singh @ 2017-06-15 1:41 UTC (permalink / raw)
To: Jérôme Glisse
Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
Johannes Weiner, Michal Hocko, Vladimir Davydov, cgroups
On Wed, 14 Jun 2017 16:11:43 -0400
Jérôme Glisse <jglisse@redhat.com> wrote:
> HMM pages (private or public device pages) are ZONE_DEVICE page and
> thus need special handling when it comes to lru or refcount. This
> patch make sure that memcontrol properly handle those when it face
> them. Those pages are use like regular pages in a process address
> space either as anonymous page or as file back page. So from memcg
> point of view we want to handle them like regular page for now at
> least.
>
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: cgroups@vger.kernel.org
> ---
> kernel/memremap.c | 2 ++
> mm/memcontrol.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
> 2 files changed, 55 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index da74775..584984c 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -479,6 +479,8 @@ void put_zone_device_private_or_public_page(struct page *page)
> __ClearPageActive(page);
> __ClearPageWaiters(page);
>
> + mem_cgroup_uncharge(page);
> +
A zone device page could have a mem_cgroup charge if
1. The old page was charged to a cgroup and the new page from ZONE_DEVICE then
gets the charge that we need to drop here
And should not be charged
2. If the driver allowed mmap based allocation (these pages are not on LRU
Since put_zone_device_private_or_public_page() is called from release_pages(),
I think the assumption is that 2 is not a problem? I've not tested the mmap
bits yet.
> page->pgmap->page_free(page, page->pgmap->data);
> }
> else if (!count)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b93f5fe..171b638 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4391,12 +4391,13 @@ enum mc_target_type {
> MC_TARGET_NONE = 0,
> MC_TARGET_PAGE,
> MC_TARGET_SWAP,
> + MC_TARGET_DEVICE,
> };
>
> static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
> unsigned long addr, pte_t ptent)
> {
> - struct page *page = vm_normal_page(vma, addr, ptent);
> + struct page *page = _vm_normal_page(vma, addr, ptent, true);
>
> if (!page || !page_mapped(page))
> return NULL;
> @@ -4407,13 +4408,20 @@ static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
> if (!(mc.flags & MOVE_FILE))
> return NULL;
> }
> - if (!get_page_unless_zero(page))
> + if (is_device_public_page(page)) {
> + /*
> + * MEMORY_DEVICE_PUBLIC means ZONE_DEVICE page and which have a
> + * refcount of 1 when free (unlike normal page)
> + */
> + if (!page_ref_add_unless(page, 1, 1))
> + return NULL;
> + } else if (!get_page_unless_zero(page))
> return NULL;
>
> return page;
> }
>
> -#ifdef CONFIG_SWAP
> +#if defined(CONFIG_SWAP) || defined(CONFIG_DEVICE_PRIVATE)
> static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
> pte_t ptent, swp_entry_t *entry)
> {
> @@ -4422,6 +4430,23 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
>
> if (!(mc.flags & MOVE_ANON) || non_swap_entry(ent))
> return NULL;
> +
> + /*
> + * Handle MEMORY_DEVICE_PRIVATE which are ZONE_DEVICE page belonging to
> + * a device and because they are not accessible by CPU they are store
> + * as special swap entry in the CPU page table.
> + */
> + if (is_device_private_entry(ent)) {
> + page = device_private_entry_to_page(ent);
> + /*
> + * MEMORY_DEVICE_PRIVATE means ZONE_DEVICE page and which have
> + * a refcount of 1 when free (unlike normal page)
> + */
> + if (!page_ref_add_unless(page, 1, 1))
> + return NULL;
> + return page;
> + }
> +
> /*
> * Because lookup_swap_cache() updates some statistics counter,
> * we call find_get_page() with swapper_space directly.
> @@ -4582,6 +4607,8 @@ static int mem_cgroup_move_account(struct page *page,
> * 2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
> * target for charge migration. if @target is not NULL, the entry is stored
> * in target->ent.
> + * 3(MC_TARGET_DEVICE): like MC_TARGET_PAGE but page is MEMORY_DEVICE_PUBLIC
> + * or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the lru).
> *
> * Called with pte lock held.
> */
> @@ -4610,6 +4637,9 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
> */
> if (page->mem_cgroup == mc.from) {
> ret = MC_TARGET_PAGE;
> + if (is_device_private_page(page) ||
> + is_device_public_page(page))
> + ret = MC_TARGET_DEVICE;
> if (target)
> target->page = page;
> }
> @@ -4669,6 +4699,11 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
>
> ptl = pmd_trans_huge_lock(pmd, vma);
> if (ptl) {
> + /*
> + * Note their can not be MC_TARGET_DEVICE for now as we do not
there
> + * support transparent huge page with MEMORY_DEVICE_PUBLIC or
> + * MEMORY_DEVICE_PRIVATE but this might change.
I am trying to remind myself why THP and MEMORY_DEVICE_* pages don't work well
together today, the driver could allocate a THP size set of pages and migrate it.
There are patches to do THP migration, not upstream yet. Could you remind me
of any other limitations?
Balbir Singh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [HMM-CDM 4/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC
[not found] ` <20170615114159.11a1eece-czjtt42qU+xukE8zL+fsbsxtgHpCUUYS@public.gmane.org>
@ 2017-06-15 2:04 ` Jerome Glisse
2017-06-15 3:10 ` Balbir Singh
0 siblings, 1 reply; 13+ messages in thread
From: Jerome Glisse @ 2017-06-15 2:04 UTC (permalink / raw)
To: Balbir Singh
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, John Hubbard, David Nellans,
Johannes Weiner, Michal Hocko, Vladimir Davydov,
cgroups-u79uwXL29TY76Z2rM5mHXA
On Thu, Jun 15, 2017 at 11:41:59AM +1000, Balbir Singh wrote:
> On Wed, 14 Jun 2017 16:11:43 -0400
> Jérôme Glisse <jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> > HMM pages (private or public device pages) are ZONE_DEVICE page and
> > thus need special handling when it comes to lru or refcount. This
> > patch make sure that memcontrol properly handle those when it face
> > them. Those pages are use like regular pages in a process address
> > space either as anonymous page or as file back page. So from memcg
> > point of view we want to handle them like regular page for now at
> > least.
> >
> > Signed-off-by: Jérôme Glisse <jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> > Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > ---
> > kernel/memremap.c | 2 ++
> > mm/memcontrol.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
> > 2 files changed, 55 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/memremap.c b/kernel/memremap.c
> > index da74775..584984c 100644
> > --- a/kernel/memremap.c
> > +++ b/kernel/memremap.c
> > @@ -479,6 +479,8 @@ void put_zone_device_private_or_public_page(struct page *page)
> > __ClearPageActive(page);
> > __ClearPageWaiters(page);
> >
> > + mem_cgroup_uncharge(page);
> > +
>
> A zone device page could have a mem_cgroup charge if
>
> 1. The old page was charged to a cgroup and the new page from ZONE_DEVICE then
> gets the charge that we need to drop here
>
> And should not be charged
>
> 2. If the driver allowed mmap based allocation (these pages are not on LRU
>
>
> Since put_zone_device_private_or_public_page() is called from release_pages(),
> I think the assumption is that 2 is not a problem? I've not tested the mmap
> bits yet.
Well that is one of the big question. Do we care about memory cgroup despite
page not being on lru and thus not being reclaimable through the usual path ?
I believe we do want to keep charging ZONE_DEVICE page against memory cgroup
so that userspace limit are enforced. This is important especialy for device
private when migrating back to system memory due to CPU page fault. We do not
want the migration back to fail because of memory cgroup limit.
Hence why i do want to charge ZONE_DEVICE page just like regular page. If we
have people that run into OOM because of this then we can start thinking about
how to account those pages slightly differently inside the memory cgroup.
For now i believe we do want this patch.
[...]
> > @@ -4610,6 +4637,9 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
> > */
> > if (page->mem_cgroup == mc.from) {
> > ret = MC_TARGET_PAGE;
> > + if (is_device_private_page(page) ||
> > + is_device_public_page(page))
> > + ret = MC_TARGET_DEVICE;
> > if (target)
> > target->page = page;
> > }
> > @@ -4669,6 +4699,11 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
> >
> > ptl = pmd_trans_huge_lock(pmd, vma);
> > if (ptl) {
> > + /*
> > + * Note their can not be MC_TARGET_DEVICE for now as we do not
> there
> > + * support transparent huge page with MEMORY_DEVICE_PUBLIC or
> > + * MEMORY_DEVICE_PRIVATE but this might change.
>
> I am trying to remind myself why THP and MEMORY_DEVICE_* pages don't work well
> together today, the driver could allocate a THP size set of pages and migrate it.
> There are patches to do THP migration, not upstream yet. Could you remind me
> of any other limitations?
No there is nothing that would be problematic AFAICT. Persistent memory already
use huge page so we should be in the clear. But i would rather enable that as
a separate patchset alltogether and have proper testing specificaly for such
scenario.
Jérôme
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [HMM-CDM 4/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC
2017-06-15 2:04 ` Jerome Glisse
@ 2017-06-15 3:10 ` Balbir Singh
0 siblings, 0 replies; 13+ messages in thread
From: Balbir Singh @ 2017-06-15 3:10 UTC (permalink / raw)
To: Jerome Glisse
Cc: linux-kernel, linux-mm, John Hubbard, David Nellans,
Johannes Weiner, Michal Hocko, Vladimir Davydov, cgroups
On Wed, 14 Jun 2017 22:04:55 -0400
Jerome Glisse <jglisse@redhat.com> wrote:
> On Thu, Jun 15, 2017 at 11:41:59AM +1000, Balbir Singh wrote:
> > On Wed, 14 Jun 2017 16:11:43 -0400
> > Jérôme Glisse <jglisse@redhat.com> wrote:
> >
> > > HMM pages (private or public device pages) are ZONE_DEVICE page and
> > > thus need special handling when it comes to lru or refcount. This
> > > patch make sure that memcontrol properly handle those when it face
> > > them. Those pages are use like regular pages in a process address
> > > space either as anonymous page or as file back page. So from memcg
> > > point of view we want to handle them like regular page for now at
> > > least.
> > >
> > > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > Cc: Michal Hocko <mhocko@kernel.org>
> > > Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> > > Cc: cgroups@vger.kernel.org
> > > ---
> > > kernel/memremap.c | 2 ++
> > > mm/memcontrol.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
> > > 2 files changed, 55 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/kernel/memremap.c b/kernel/memremap.c
> > > index da74775..584984c 100644
> > > --- a/kernel/memremap.c
> > > +++ b/kernel/memremap.c
> > > @@ -479,6 +479,8 @@ void put_zone_device_private_or_public_page(struct page *page)
> > > __ClearPageActive(page);
> > > __ClearPageWaiters(page);
> > >
> > > + mem_cgroup_uncharge(page);
> > > +
> >
> > A zone device page could have a mem_cgroup charge if
> >
> > 1. The old page was charged to a cgroup and the new page from ZONE_DEVICE then
> > gets the charge that we need to drop here
> >
> > And should not be charged
> >
> > 2. If the driver allowed mmap based allocation (these pages are not on LRU
> >
> >
> > Since put_zone_device_private_or_public_page() is called from release_pages(),
> > I think the assumption is that 2 is not a problem? I've not tested the mmap
> > bits yet.
>
> Well that is one of the big question. Do we care about memory cgroup despite
> page not being on lru and thus not being reclaimable through the usual path ?
>
> I believe we do want to keep charging ZONE_DEVICE page against memory cgroup
> so that userspace limit are enforced. This is important especialy for device
> private when migrating back to system memory due to CPU page fault. We do not
> want the migration back to fail because of memory cgroup limit.
>
> Hence why i do want to charge ZONE_DEVICE page just like regular page. If we
> have people that run into OOM because of this then we can start thinking about
> how to account those pages slightly differently inside the memory cgroup.
>
> For now i believe we do want this patch.
>
Yes, we do need the patch, I was trying to check if we'll end up trying to uncharge
a page that is not charged, just double checking
>
> [...]
>
> > > @@ -4610,6 +4637,9 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
> > > */
> > > if (page->mem_cgroup == mc.from) {
> > > ret = MC_TARGET_PAGE;
> > > + if (is_device_private_page(page) ||
> > > + is_device_public_page(page))
> > > + ret = MC_TARGET_DEVICE;
> > > if (target)
> > > target->page = page;
> > > }
> > > @@ -4669,6 +4699,11 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
> > >
> > > ptl = pmd_trans_huge_lock(pmd, vma);
> > > if (ptl) {
> > > + /*
> > > + * Note their can not be MC_TARGET_DEVICE for now as we do not
> > there
> > > + * support transparent huge page with MEMORY_DEVICE_PUBLIC or
> > > + * MEMORY_DEVICE_PRIVATE but this might change.
> >
> > I am trying to remind myself why THP and MEMORY_DEVICE_* pages don't work well
> > together today, the driver could allocate a THP size set of pages and migrate it.
> > There are patches to do THP migration, not upstream yet. Could you remind me
> > of any other limitations?
>
> No there is nothing that would be problematic AFAICT. Persistent memory already
> use huge page so we should be in the clear. But i would rather enable that as
> a separate patchset alltogether and have proper testing specificaly for such
> scenario.
Agreed
Balbir Singh.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [HMM-CDM 3/5] mm/memcontrol: allow to uncharge page without using page->lru field
[not found] ` <20170614201144.9306-4-jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-06-15 3:31 ` Balbir Singh
[not found] ` <20170615133128.2fe2c33f-czjtt42qU+xukE8zL+fsbsxtgHpCUUYS@public.gmane.org>
0 siblings, 1 reply; 13+ messages in thread
From: Balbir Singh @ 2017-06-15 3:31 UTC (permalink / raw)
To: Jérôme Glisse
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, John Hubbard, David Nellans,
Johannes Weiner, Michal Hocko, Vladimir Davydov,
cgroups-u79uwXL29TY76Z2rM5mHXA
On Wed, 14 Jun 2017 16:11:42 -0400
Jérôme Glisse <jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> HMM pages (private or public device pages) are ZONE_DEVICE page and
> thus you can not use page->lru fields of those pages. This patch
> re-arrange the uncharge to allow single page to be uncharge without
> modifying the lru field of the struct page.
>
> There is no change to memcontrol logic, it is the same as it was
> before this patch.
>
> Signed-off-by: Jérôme Glisse <jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> ---
> mm/memcontrol.c | 168 +++++++++++++++++++++++++++++++-------------------------
> 1 file changed, 92 insertions(+), 76 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e3fe4d0..b93f5fe 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5509,48 +5509,102 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
> cancel_charge(memcg, nr_pages);
> }
>
> -static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
> - unsigned long nr_anon, unsigned long nr_file,
> - unsigned long nr_kmem, unsigned long nr_huge,
> - unsigned long nr_shmem, struct page *dummy_page)
> +struct uncharge_gather {
> + struct mem_cgroup *memcg;
> + unsigned long pgpgout;
> + unsigned long nr_anon;
> + unsigned long nr_file;
> + unsigned long nr_kmem;
> + unsigned long nr_huge;
> + unsigned long nr_shmem;
> + struct page *dummy_page;
> +};
> +
> +static inline void uncharge_gather_clear(struct uncharge_gather *ug)
> {
> - unsigned long nr_pages = nr_anon + nr_file + nr_kmem;
> + memset(ug, 0, sizeof(*ug));
> +}
> +
> +static void uncharge_batch(const struct uncharge_gather *ug)
> +{
Can we pass page as an argument so that we can do check events on the page?
> + unsigned long nr_pages = ug->nr_anon + ug->nr_file + ug->nr_kmem;
> unsigned long flags;
>
> - if (!mem_cgroup_is_root(memcg)) {
> - page_counter_uncharge(&memcg->memory, nr_pages);
> + if (!mem_cgroup_is_root(ug->memcg)) {
> + page_counter_uncharge(&ug->memcg->memory, nr_pages);
> if (do_memsw_account())
> - page_counter_uncharge(&memcg->memsw, nr_pages);
> - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && nr_kmem)
> - page_counter_uncharge(&memcg->kmem, nr_kmem);
> - memcg_oom_recover(memcg);
> + page_counter_uncharge(&ug->memcg->memsw, nr_pages);
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
> + page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
> + memcg_oom_recover(ug->memcg);
> }
>
> local_irq_save(flags);
> - __this_cpu_sub(memcg->stat->count[MEMCG_RSS], nr_anon);
> - __this_cpu_sub(memcg->stat->count[MEMCG_CACHE], nr_file);
> - __this_cpu_sub(memcg->stat->count[MEMCG_RSS_HUGE], nr_huge);
> - __this_cpu_sub(memcg->stat->count[NR_SHMEM], nr_shmem);
> - __this_cpu_add(memcg->stat->events[PGPGOUT], pgpgout);
> - __this_cpu_add(memcg->stat->nr_page_events, nr_pages);
> - memcg_check_events(memcg, dummy_page);
> + __this_cpu_sub(ug->memcg->stat->count[MEMCG_RSS], ug->nr_anon);
> + __this_cpu_sub(ug->memcg->stat->count[MEMCG_CACHE], ug->nr_file);
> + __this_cpu_sub(ug->memcg->stat->count[MEMCG_RSS_HUGE], ug->nr_huge);
> + __this_cpu_sub(ug->memcg->stat->count[NR_SHMEM], ug->nr_shmem);
> + __this_cpu_add(ug->memcg->stat->events[PGPGOUT], ug->pgpgout);
> + __this_cpu_add(ug->memcg->stat->nr_page_events, nr_pages);
> + memcg_check_events(ug->memcg, ug->dummy_page);
> local_irq_restore(flags);
>
> - if (!mem_cgroup_is_root(memcg))
> - css_put_many(&memcg->css, nr_pages);
> + if (!mem_cgroup_is_root(ug->memcg))
> + css_put_many(&ug->memcg->css, nr_pages);
> +}
> +
> +static void uncharge_page(struct page *page, struct uncharge_gather *ug)
> +{
> + VM_BUG_ON_PAGE(PageLRU(page), page);
> + VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
> +
> + if (!page->mem_cgroup)
> + return;
> +
> + /*
> + * Nobody should be changing or seriously looking at
> + * page->mem_cgroup at this point, we have fully
> + * exclusive access to the page.
> + */
> +
> + if (ug->memcg != page->mem_cgroup) {
> + if (ug->memcg) {
> + uncharge_batch(ug);
What is ug->dummy_page set to at this point? ug->dummy_page is assigned below
> + uncharge_gather_clear(ug);
> + }
> + ug->memcg = page->mem_cgroup;
> + }
> +
> + if (!PageKmemcg(page)) {
> + unsigned int nr_pages = 1;
> +
> + if (PageTransHuge(page)) {
> + nr_pages <<= compound_order(page);
> + ug->nr_huge += nr_pages;
> + }
> + if (PageAnon(page))
> + ug->nr_anon += nr_pages;
> + else {
> + ug->nr_file += nr_pages;
> + if (PageSwapBacked(page))
> + ug->nr_shmem += nr_pages;
> + }
> + ug->pgpgout++;
> + } else {
> + ug->nr_kmem += 1 << compound_order(page);
> + __ClearPageKmemcg(page);
> + }
> +
> + ug->dummy_page = page;
> + page->mem_cgroup = NULL;
> }
>
> static void uncharge_list(struct list_head *page_list)
> {
> - struct mem_cgroup *memcg = NULL;
> - unsigned long nr_shmem = 0;
> - unsigned long nr_anon = 0;
> - unsigned long nr_file = 0;
> - unsigned long nr_huge = 0;
> - unsigned long nr_kmem = 0;
> - unsigned long pgpgout = 0;
> + struct uncharge_gather ug;
> struct list_head *next;
> - struct page *page;
> +
> + uncharge_gather_clear(&ug);
>
> /*
> * Note that the list can be a single page->lru; hence the
> @@ -5558,57 +5612,16 @@ static void uncharge_list(struct list_head *page_list)
> */
> next = page_list->next;
> do {
> + struct page *page;
> +
Nit pick
VM_WARN_ON(is_zone_device_page(page));
> page = list_entry(next, struct page, lru);
> next = page->lru.next;
>
> - VM_BUG_ON_PAGE(PageLRU(page), page);
> - VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
> -
> - if (!page->mem_cgroup)
> - continue;
> -
> - /*
> - * Nobody should be changing or seriously looking at
> - * page->mem_cgroup at this point, we have fully
> - * exclusive access to the page.
> - */
> -
> - if (memcg != page->mem_cgroup) {
> - if (memcg) {
> - uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
> - nr_kmem, nr_huge, nr_shmem, page);
> - pgpgout = nr_anon = nr_file = nr_kmem = 0;
> - nr_huge = nr_shmem = 0;
> - }
> - memcg = page->mem_cgroup;
> - }
> -
> - if (!PageKmemcg(page)) {
> - unsigned int nr_pages = 1;
> -
> - if (PageTransHuge(page)) {
> - nr_pages <<= compound_order(page);
> - nr_huge += nr_pages;
> - }
> - if (PageAnon(page))
> - nr_anon += nr_pages;
> - else {
> - nr_file += nr_pages;
> - if (PageSwapBacked(page))
> - nr_shmem += nr_pages;
> - }
> - pgpgout++;
> - } else {
> - nr_kmem += 1 << compound_order(page);
> - __ClearPageKmemcg(page);
> - }
> -
> - page->mem_cgroup = NULL;
> + uncharge_page(page, &ug);
> } while (next != page_list);
>
> - if (memcg)
> - uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
> - nr_kmem, nr_huge, nr_shmem, page);
> + if (ug.memcg)
> + uncharge_batch(&ug);
> }
>
> /**
> @@ -5620,6 +5633,8 @@ static void uncharge_list(struct list_head *page_list)
> */
> void mem_cgroup_uncharge(struct page *page)
> {
> + struct uncharge_gather ug;
> +
> if (mem_cgroup_disabled())
> return;
>
> @@ -5627,8 +5642,9 @@ void mem_cgroup_uncharge(struct page *page)
> if (!page->mem_cgroup)
> return;
>
> - INIT_LIST_HEAD(&page->lru);
> - uncharge_list(&page->lru);
> + uncharge_gather_clear(&ug);
> + uncharge_page(page, &ug);
> + uncharge_batch(&ug);
> }
Balbir Singh.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [HMM-CDM 3/5] mm/memcontrol: allow to uncharge page without using page->lru field
[not found] ` <20170615133128.2fe2c33f-czjtt42qU+xukE8zL+fsbsxtgHpCUUYS@public.gmane.org>
@ 2017-06-15 15:35 ` Jerome Glisse
0 siblings, 0 replies; 13+ messages in thread
From: Jerome Glisse @ 2017-06-15 15:35 UTC (permalink / raw)
To: Balbir Singh
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, John Hubbard, David Nellans,
Johannes Weiner, Michal Hocko, Vladimir Davydov,
cgroups-u79uwXL29TY76Z2rM5mHXA
On Thu, Jun 15, 2017 at 01:31:28PM +1000, Balbir Singh wrote:
> On Wed, 14 Jun 2017 16:11:42 -0400
> Jérôme Glisse <jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> > HMM pages (private or public device pages) are ZONE_DEVICE page and
> > thus you can not use page->lru fields of those pages. This patch
> > re-arrange the uncharge to allow single page to be uncharge without
> > modifying the lru field of the struct page.
> >
> > There is no change to memcontrol logic, it is the same as it was
> > before this patch.
> >
> > Signed-off-by: Jérôme Glisse <jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> > Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > ---
> > mm/memcontrol.c | 168 +++++++++++++++++++++++++++++++-------------------------
> > 1 file changed, 92 insertions(+), 76 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e3fe4d0..b93f5fe 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5509,48 +5509,102 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
> > cancel_charge(memcg, nr_pages);
> > }
> >
> > -static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
> > - unsigned long nr_anon, unsigned long nr_file,
> > - unsigned long nr_kmem, unsigned long nr_huge,
> > - unsigned long nr_shmem, struct page *dummy_page)
> > +struct uncharge_gather {
> > + struct mem_cgroup *memcg;
> > + unsigned long pgpgout;
> > + unsigned long nr_anon;
> > + unsigned long nr_file;
> > + unsigned long nr_kmem;
> > + unsigned long nr_huge;
> > + unsigned long nr_shmem;
> > + struct page *dummy_page;
> > +};
> > +
> > +static inline void uncharge_gather_clear(struct uncharge_gather *ug)
> > {
> > - unsigned long nr_pages = nr_anon + nr_file + nr_kmem;
> > + memset(ug, 0, sizeof(*ug));
> > +}
> > +
> > +static void uncharge_batch(const struct uncharge_gather *ug)
> > +{
>
> Can we pass page as an argument so that we can do check events on the page?
Well it is what dummy page is for, i wanted to keep code as close
to existing to make it easier for people to see that there was no
logic change with that patch.
[...]
> > +static void uncharge_page(struct page *page, struct uncharge_gather *ug)
> > +{
> > + VM_BUG_ON_PAGE(PageLRU(page), page);
> > + VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
> > +
> > + if (!page->mem_cgroup)
> > + return;
> > +
> > + /*
> > + * Nobody should be changing or seriously looking at
> > + * page->mem_cgroup at this point, we have fully
> > + * exclusive access to the page.
> > + */
> > +
> > + if (ug->memcg != page->mem_cgroup) {
> > + if (ug->memcg) {
> > + uncharge_batch(ug);
>
> What is ug->dummy_page set to at this point? ug->dummy_page is assigned below
So at begining ug->memcg is NULL and so is ug->dummy_page, after first
call to uncharge_page() if ug->memcg isn't NULL then ug->dummy_page
points to a valid page. So uncharge_batch() can never be call with
dummy_page == NULL same as before this patch.
[...]
> > static void uncharge_list(struct list_head *page_list)
> > {
> > - struct mem_cgroup *memcg = NULL;
> > - unsigned long nr_shmem = 0;
> > - unsigned long nr_anon = 0;
> > - unsigned long nr_file = 0;
> > - unsigned long nr_huge = 0;
> > - unsigned long nr_kmem = 0;
> > - unsigned long pgpgout = 0;
> > + struct uncharge_gather ug;
> > struct list_head *next;
> > - struct page *page;
> > +
> > + uncharge_gather_clear(&ug);
> >
> > /*
> > * Note that the list can be a single page->lru; hence the
> > @@ -5558,57 +5612,16 @@ static void uncharge_list(struct list_head *page_list)
> > */
> > next = page_list->next;
> > do {
> > + struct page *page;
> > +
>
> Nit pick
>
> VM_WARN_ON(is_zone_device_page(page));
Yeah probably good thing to add. I will add it as part of the
other memcontrol patch as i want to keep this one about moving
stuff around with no logic change.
Thanks for reviewing
Jérôme
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2017-06-15 15:35 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-06-14 20:11 [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM Jérôme Glisse
2017-06-14 20:11 ` [HMM-CDM 3/5] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
[not found] ` <20170614201144.9306-4-jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-06-15 3:31 ` Balbir Singh
[not found] ` <20170615133128.2fe2c33f-czjtt42qU+xukE8zL+fsbsxtgHpCUUYS@public.gmane.org>
2017-06-15 15:35 ` Jerome Glisse
2017-06-14 20:11 ` [HMM-CDM 4/5] mm/memcontrol: support MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_PUBLIC Jérôme Glisse
2017-06-15 1:41 ` Balbir Singh
[not found] ` <20170615114159.11a1eece-czjtt42qU+xukE8zL+fsbsxtgHpCUUYS@public.gmane.org>
2017-06-15 2:04 ` Jerome Glisse
2017-06-15 3:10 ` Balbir Singh
2017-06-14 21:20 ` [HMM-CDM 0/5] Cache coherent device memory (CDM) with HMM Dave Hansen
[not found] ` <8219f8fb-65bb-7c6b-6c4c-acc0601c1e0f-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2017-06-14 21:38 ` Jerome Glisse
[not found] ` <20170614213800.GD4160-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-06-14 21:58 ` Dave Hansen
2017-06-14 22:07 ` Benjamin Herrenschmidt
[not found] ` <3a617630-2406-da49-707c-4959a2afd8e1-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2017-06-14 23:40 ` Balbir Singh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox