From: Ralph Campbell <rcampbell@nvidia.com>
To: linux-mm@kvack.org, kvm-ppc@vger.kernel.org,
nouveau@lists.freedesktop.org, linux-kernel@vger.kernel.org
Cc: Dan Williams <dan.j.williams@intel.com>,
Ira Weiny <ira.weiny@intel.com>,
Matthew Wilcox <willy@infradead.org>,
Jerome Glisse <jglisse@redhat.com>,
John Hubbard <jhubbard@nvidia.com>,
Alistair Popple <apopple@nvidia.com>,
Christoph Hellwig <hch@lst.de>, Jason Gunthorpe <jgg@nvidia.com>,
Bharata B Rao <bharata@linux.ibm.com>, Zi Yan <ziy@nvidia.com>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
Yang Shi <yang.shi@linux.alibaba.com>,
Paul Mackerras <paulus@ozlabs.org>,
Ben Skeggs <bskeggs@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Ralph Campbell <rcampbell@nvidia.com>
Subject: [PATCH] mm: remove extra ZONE_DEVICE struct page refcount
Date: Mon, 14 Sep 2020 22:45:09 +0000 [thread overview]
Message-ID: <20200914224509.17699-1-rcampbell@nvidia.com> (raw)
ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
---
Matthew Wilcox, Ira Weiny, and others have complained that ZONE_DEVICE
struct page reference counting is ugly/broken. This is my attempt to
fix it and it works for the HMM migration self tests.
I'm only sending this out as a RFC since I'm not that familiar with the
DAX, PMEM, XEN, and other uses of ZONE_DEVICE struct pages allocated
with devm_memremap_pages() or memremap_pages() but my best reading of
the code looks like it might be OK. I could use help testing these
configurations.
I have a modified THP migration patch series that applies on top of
this one and is cleaner since I don't have to add code to handle the
+1 reference count. The link below is for the earlier v2:
("mm/hmm/nouveau: add THP migration to migrate_vma_*")
https://lore.kernel.org/linux-mm/20200902165830.5367-1-rcampbell@nvidia.com
arch/powerpc/kvm/book3s_hv_uvmem.c | 1 -
drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 -
include/linux/memremap.h | 6 +--
include/linux/mm.h | 39 ---------------
lib/test_hmm.c | 1 -
mm/gup.c | 44 -----------------
mm/memremap.c | 20 ++++----
mm/migrate.c | 5 --
mm/swap.c | 66 +++++++++++---------------
9 files changed, 41 insertions(+), 142 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 84e5a2dc8be5..00d97050d7ff 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -711,7 +711,6 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
- get_page(dpage);
lock_page(dpage);
return dpage;
out_clear:
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index a13c6215bba8..2a4bbe01a455 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -324,7 +324,6 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
- get_page(page);
lock_page(page);
return page;
}
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 4e9c738f4b31..7dd9802d2612 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -67,9 +67,9 @@ enum memory_type {
struct dev_pagemap_ops {
/*
- * Called once the page refcount reaches 1. (ZONE_DEVICE pages never
- * reach 0 refcount unless there is a refcount bug. This allows the
- * device driver to implement its own memory management.)
+ * Called once the page refcount reaches 0. The reference count is
+ * reset to 1 before calling page_free(). This allows the
+ * device driver to implement its own memory management.
*/
void (*page_free)(struct page *page);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 517751310dd2..5a82037a4b26 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1093,34 +1093,6 @@ static inline bool is_zone_device_page(const struct page *page)
#ifdef CONFIG_DEV_PAGEMAP_OPS
void free_devmap_managed_page(struct page *page);
DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-
-static inline bool page_is_devmap_managed(struct page *page)
-{
- if (!static_branch_unlikely(&devmap_managed_key))
- return false;
- if (!is_zone_device_page(page))
- return false;
- switch (page->pgmap->type) {
- case MEMORY_DEVICE_PRIVATE:
- case MEMORY_DEVICE_FS_DAX:
- return true;
- default:
- break;
- }
- return false;
-}
-
-void put_devmap_managed_page(struct page *page);
-
-#else /* CONFIG_DEV_PAGEMAP_OPS */
-static inline bool page_is_devmap_managed(struct page *page)
-{
- return false;
-}
-
-static inline void put_devmap_managed_page(struct page *page)
-{
-}
#endif /* CONFIG_DEV_PAGEMAP_OPS */
static inline bool is_device_private_page(const struct page *page)
@@ -1169,17 +1141,6 @@ static inline void put_page(struct page *page)
{
page = compound_head(page);
- /*
- * For devmap managed pages we need to catch refcount transition from
- * 2 to 1, when refcount reach one it means the page is free and we
- * need to inform the device driver through callback. See
- * include/linux/memremap.h and HMM for details.
- */
- if (page_is_devmap_managed(page)) {
- put_devmap_managed_page(page);
- return;
- }
-
if (put_page_testzero(page))
__put_page(page);
}
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index c8133f50160b..5410f7328577 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -561,7 +561,6 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
}
dpage->zone_device_data = rpage;
- get_page(dpage);
lock_page(dpage);
return dpage;
diff --git a/mm/gup.c b/mm/gup.c
index b9b1cd8bcd6a..3c54f0c190a3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -177,41 +177,6 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
return true;
}
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-static bool __unpin_devmap_managed_user_page(struct page *page)
-{
- int count, refs = 1;
-
- if (!page_is_devmap_managed(page))
- return false;
-
- if (hpage_pincount_available(page))
- hpage_pincount_sub(page, 1);
- else
- refs = GUP_PIN_COUNTING_BIAS;
-
- count = page_ref_sub_return(page, refs);
-
- mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
- /*
- * devmap page refcounts are 1-based, rather than 0-based: if
- * refcount is 1, then the page is free and the refcount is
- * stable because nobody holds a reference on the page.
- */
- if (count == 1)
- free_devmap_managed_page(page);
- else if (!count)
- __put_page(page);
-
- return true;
-}
-#else
-static bool __unpin_devmap_managed_user_page(struct page *page)
-{
- return false;
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
/**
* unpin_user_page() - release a dma-pinned page
* @page: pointer to page to be released
@@ -227,15 +192,6 @@ void unpin_user_page(struct page *page)
page = compound_head(page);
- /*
- * For devmap managed pages we need to catch refcount transition from
- * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
- * page is free and we need to inform the device driver through
- * callback. See include/linux/memremap.h and HMM for details.
- */
- if (__unpin_devmap_managed_user_page(page))
- return;
-
if (hpage_pincount_available(page))
hpage_pincount_sub(page, 1);
else
diff --git a/mm/memremap.c b/mm/memremap.c
index 9c951bbfee91..f09562789079 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -91,13 +91,6 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
return (range->start + range_len(range)) >> PAGE_SHIFT;
}
-static unsigned long pfn_next(unsigned long pfn)
-{
- if (pfn % 1024 == 0)
- cond_resched();
- return pfn + 1;
-}
-
/*
* This returns true if the page is reserved by ZONE_DEVICE driver.
*/
@@ -176,13 +169,12 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
void memunmap_pages(struct dev_pagemap *pgmap)
{
- unsigned long pfn;
int i;
dev_pagemap_kill(pgmap);
for (i = 0; i < pgmap->nr_range; i++)
- for_each_device_pfn(pfn, pgmap, i)
- put_page(pfn_to_page(pfn));
+ percpu_ref_put_many(pgmap->ref, pfn_end(pgmap, i) -
+ pfn_first(pgmap, i));
dev_pagemap_cleanup(pgmap);
for (i = 0; i < pgmap->nr_range; i++)
@@ -516,6 +508,14 @@ void free_devmap_managed_page(struct page *page)
mem_cgroup_uncharge(page);
+ /*
+ * ZONE_DEVICE drivers keep a reference to the page while it is on
+ * the driver's free list so we reset the reference count here.
+ * This matches the initial reference when the struct pages are
+ * created by memremap_pages().
+ */
+ set_page_refcounted(page);
+
/*
* When a device_private page is freed, the page->mapping field
* may still contain a (stale) mapping value. For example, the
diff --git a/mm/migrate.c b/mm/migrate.c
index 4f89360d9e77..be1586582b52 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -380,11 +380,6 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
{
int expected_count = 1;
- /*
- * Device private pages have an extra refcount as they are
- * ZONE_DEVICE pages.
- */
- expected_count += is_device_private_page(page);
if (mapping)
expected_count += thp_nr_pages(page) + page_has_private(page);
diff --git a/mm/swap.c b/mm/swap.c
index feff680d3de9..8468e72e397f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -113,10 +113,31 @@ static void __put_compound_page(struct page *page)
destroy_compound_page(page);
}
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+static void __put_devmap_managed_page(struct page *page)
+{
+ if (!static_branch_unlikely(&devmap_managed_key))
+ return;
+
+ switch (page->pgmap->type) {
+ case MEMORY_DEVICE_PRIVATE:
+ case MEMORY_DEVICE_FS_DAX:
+ free_devmap_managed_page(page);
+ break;
+ default:
+ break;
+ }
+}
+#else
+static inline void __put_devmap_managed_page(struct page *page)
+{
+}
+#endif
+
void __put_page(struct page *page)
{
if (is_zone_device_page(page)) {
- put_dev_pagemap(page->pgmap);
+ __put_devmap_managed_page(page);
/*
* The page belongs to the device that created pgmap. Do
@@ -851,27 +872,19 @@ void release_pages(struct page **pages, int nr)
if (is_huge_zero_page(page))
continue;
+ page = compound_head(page);
+ if (!put_page_testzero(page))
+ continue;
+
if (is_zone_device_page(page)) {
if (locked_pgdat) {
spin_unlock_irqrestore(&locked_pgdat->lru_lock,
flags);
locked_pgdat = NULL;
}
- /*
- * ZONE_DEVICE pages that return 'false' from
- * put_devmap_managed_page() do not require special
- * processing, and instead, expect a call to
- * put_page_testzero().
- */
- if (page_is_devmap_managed(page)) {
- put_devmap_managed_page(page);
- continue;
- }
- }
-
- page = compound_head(page);
- if (!put_page_testzero(page))
+ __put_devmap_managed_page(page);
continue;
+ }
if (PageCompound(page)) {
if (locked_pgdat) {
@@ -1149,26 +1162,3 @@ void __init swap_setup(void)
* _really_ don't want to cluster much more
*/
}
-
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void put_devmap_managed_page(struct page *page)
-{
- int count;
-
- if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
- return;
-
- count = page_ref_dec_return(page);
-
- /*
- * devmap page refcounts are 1-based, rather than 0-based: if
- * refcount is 1, then the page is free and the refcount is
- * stable because nobody holds a reference on the page.
- */
- if (count == 1)
- free_devmap_managed_page(page);
- else if (!count)
- __put_page(page);
-}
-EXPORT_SYMBOL(put_devmap_managed_page);
-#endif
--
2.20.1
WARNING: multiple messages have this Message-ID (diff)
From: Ralph Campbell <rcampbell@nvidia.com>
To: linux-mm@kvack.org, kvm-ppc@vger.kernel.org,
nouveau@lists.freedesktop.org, linux-kernel@vger.kernel.org
Cc: Dan Williams <dan.j.williams@intel.com>,
Ira Weiny <ira.weiny@intel.com>,
Matthew Wilcox <willy@infradead.org>,
Jerome Glisse <jglisse@redhat.com>,
John Hubbard <jhubbard@nvidia.com>,
Alistair Popple <apopple@nvidia.com>,
Christoph Hellwig <hch@lst.de>, Jason Gunthorpe <jgg@nvidia.com>,
Bharata B Rao <bharata@linux.ibm.com>, Zi Yan <ziy@nvidia.com>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
Yang Shi <yang.shi@linux.alibaba.com>,
Paul Mackerras <paulus@ozlabs.org>,
Ben Skeggs <bskeggs@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Ralph Campbell <rcampbell@nvidia.com>
Subject: [PATCH] mm: remove extra ZONE_DEVICE struct page refcount
Date: Mon, 14 Sep 2020 15:45:09 -0700 [thread overview]
Message-ID: <20200914224509.17699-1-rcampbell@nvidia.com> (raw)
ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
---
Matthew Wilcox, Ira Weiny, and others have complained that ZONE_DEVICE
struct page reference counting is ugly/broken. This is my attempt to
fix it and it works for the HMM migration self tests.
I'm only sending this out as a RFC since I'm not that familiar with the
DAX, PMEM, XEN, and other uses of ZONE_DEVICE struct pages allocated
with devm_memremap_pages() or memremap_pages() but my best reading of
the code looks like it might be OK. I could use help testing these
configurations.
I have a modified THP migration patch series that applies on top of
this one and is cleaner since I don't have to add code to handle the
+1 reference count. The link below is for the earlier v2:
("mm/hmm/nouveau: add THP migration to migrate_vma_*")
https://lore.kernel.org/linux-mm/20200902165830.5367-1-rcampbell@nvidia.com
arch/powerpc/kvm/book3s_hv_uvmem.c | 1 -
drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 -
include/linux/memremap.h | 6 +--
include/linux/mm.h | 39 ---------------
lib/test_hmm.c | 1 -
mm/gup.c | 44 -----------------
mm/memremap.c | 20 ++++----
mm/migrate.c | 5 --
mm/swap.c | 66 +++++++++++---------------
9 files changed, 41 insertions(+), 142 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 84e5a2dc8be5..00d97050d7ff 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -711,7 +711,6 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
- get_page(dpage);
lock_page(dpage);
return dpage;
out_clear:
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index a13c6215bba8..2a4bbe01a455 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -324,7 +324,6 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
- get_page(page);
lock_page(page);
return page;
}
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 4e9c738f4b31..7dd9802d2612 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -67,9 +67,9 @@ enum memory_type {
struct dev_pagemap_ops {
/*
- * Called once the page refcount reaches 1. (ZONE_DEVICE pages never
- * reach 0 refcount unless there is a refcount bug. This allows the
- * device driver to implement its own memory management.)
+ * Called once the page refcount reaches 0. The reference count is
+ * reset to 1 before calling page_free(). This allows the
+ * device driver to implement its own memory management.
*/
void (*page_free)(struct page *page);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 517751310dd2..5a82037a4b26 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1093,34 +1093,6 @@ static inline bool is_zone_device_page(const struct page *page)
#ifdef CONFIG_DEV_PAGEMAP_OPS
void free_devmap_managed_page(struct page *page);
DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-
-static inline bool page_is_devmap_managed(struct page *page)
-{
- if (!static_branch_unlikely(&devmap_managed_key))
- return false;
- if (!is_zone_device_page(page))
- return false;
- switch (page->pgmap->type) {
- case MEMORY_DEVICE_PRIVATE:
- case MEMORY_DEVICE_FS_DAX:
- return true;
- default:
- break;
- }
- return false;
-}
-
-void put_devmap_managed_page(struct page *page);
-
-#else /* CONFIG_DEV_PAGEMAP_OPS */
-static inline bool page_is_devmap_managed(struct page *page)
-{
- return false;
-}
-
-static inline void put_devmap_managed_page(struct page *page)
-{
-}
#endif /* CONFIG_DEV_PAGEMAP_OPS */
static inline bool is_device_private_page(const struct page *page)
@@ -1169,17 +1141,6 @@ static inline void put_page(struct page *page)
{
page = compound_head(page);
- /*
- * For devmap managed pages we need to catch refcount transition from
- * 2 to 1, when refcount reach one it means the page is free and we
- * need to inform the device driver through callback. See
- * include/linux/memremap.h and HMM for details.
- */
- if (page_is_devmap_managed(page)) {
- put_devmap_managed_page(page);
- return;
- }
-
if (put_page_testzero(page))
__put_page(page);
}
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index c8133f50160b..5410f7328577 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -561,7 +561,6 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
}
dpage->zone_device_data = rpage;
- get_page(dpage);
lock_page(dpage);
return dpage;
diff --git a/mm/gup.c b/mm/gup.c
index b9b1cd8bcd6a..3c54f0c190a3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -177,41 +177,6 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
return true;
}
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-static bool __unpin_devmap_managed_user_page(struct page *page)
-{
- int count, refs = 1;
-
- if (!page_is_devmap_managed(page))
- return false;
-
- if (hpage_pincount_available(page))
- hpage_pincount_sub(page, 1);
- else
- refs = GUP_PIN_COUNTING_BIAS;
-
- count = page_ref_sub_return(page, refs);
-
- mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
- /*
- * devmap page refcounts are 1-based, rather than 0-based: if
- * refcount is 1, then the page is free and the refcount is
- * stable because nobody holds a reference on the page.
- */
- if (count == 1)
- free_devmap_managed_page(page);
- else if (!count)
- __put_page(page);
-
- return true;
-}
-#else
-static bool __unpin_devmap_managed_user_page(struct page *page)
-{
- return false;
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
/**
* unpin_user_page() - release a dma-pinned page
* @page: pointer to page to be released
@@ -227,15 +192,6 @@ void unpin_user_page(struct page *page)
page = compound_head(page);
- /*
- * For devmap managed pages we need to catch refcount transition from
- * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
- * page is free and we need to inform the device driver through
- * callback. See include/linux/memremap.h and HMM for details.
- */
- if (__unpin_devmap_managed_user_page(page))
- return;
-
if (hpage_pincount_available(page))
hpage_pincount_sub(page, 1);
else
diff --git a/mm/memremap.c b/mm/memremap.c
index 9c951bbfee91..f09562789079 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -91,13 +91,6 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
return (range->start + range_len(range)) >> PAGE_SHIFT;
}
-static unsigned long pfn_next(unsigned long pfn)
-{
- if (pfn % 1024 == 0)
- cond_resched();
- return pfn + 1;
-}
-
/*
* This returns true if the page is reserved by ZONE_DEVICE driver.
*/
@@ -176,13 +169,12 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
void memunmap_pages(struct dev_pagemap *pgmap)
{
- unsigned long pfn;
int i;
dev_pagemap_kill(pgmap);
for (i = 0; i < pgmap->nr_range; i++)
- for_each_device_pfn(pfn, pgmap, i)
- put_page(pfn_to_page(pfn));
+ percpu_ref_put_many(pgmap->ref, pfn_end(pgmap, i) -
+ pfn_first(pgmap, i));
dev_pagemap_cleanup(pgmap);
for (i = 0; i < pgmap->nr_range; i++)
@@ -516,6 +508,14 @@ void free_devmap_managed_page(struct page *page)
mem_cgroup_uncharge(page);
+ /*
+ * ZONE_DEVICE drivers keep a reference to the page while it is on
+ * the driver's free list so we reset the reference count here.
+ * This matches the initial reference when the struct pages are
+ * created by memremap_pages().
+ */
+ set_page_refcounted(page);
+
/*
* When a device_private page is freed, the page->mapping field
* may still contain a (stale) mapping value. For example, the
diff --git a/mm/migrate.c b/mm/migrate.c
index 4f89360d9e77..be1586582b52 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -380,11 +380,6 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
{
int expected_count = 1;
- /*
- * Device private pages have an extra refcount as they are
- * ZONE_DEVICE pages.
- */
- expected_count += is_device_private_page(page);
if (mapping)
expected_count += thp_nr_pages(page) + page_has_private(page);
diff --git a/mm/swap.c b/mm/swap.c
index feff680d3de9..8468e72e397f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -113,10 +113,31 @@ static void __put_compound_page(struct page *page)
destroy_compound_page(page);
}
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+static void __put_devmap_managed_page(struct page *page)
+{
+ if (!static_branch_unlikely(&devmap_managed_key))
+ return;
+
+ switch (page->pgmap->type) {
+ case MEMORY_DEVICE_PRIVATE:
+ case MEMORY_DEVICE_FS_DAX:
+ free_devmap_managed_page(page);
+ break;
+ default:
+ break;
+ }
+}
+#else
+static inline void __put_devmap_managed_page(struct page *page)
+{
+}
+#endif
+
void __put_page(struct page *page)
{
if (is_zone_device_page(page)) {
- put_dev_pagemap(page->pgmap);
+ __put_devmap_managed_page(page);
/*
* The page belongs to the device that created pgmap. Do
@@ -851,27 +872,19 @@ void release_pages(struct page **pages, int nr)
if (is_huge_zero_page(page))
continue;
+ page = compound_head(page);
+ if (!put_page_testzero(page))
+ continue;
+
if (is_zone_device_page(page)) {
if (locked_pgdat) {
spin_unlock_irqrestore(&locked_pgdat->lru_lock,
flags);
locked_pgdat = NULL;
}
- /*
- * ZONE_DEVICE pages that return 'false' from
- * put_devmap_managed_page() do not require special
- * processing, and instead, expect a call to
- * put_page_testzero().
- */
- if (page_is_devmap_managed(page)) {
- put_devmap_managed_page(page);
- continue;
- }
- }
-
- page = compound_head(page);
- if (!put_page_testzero(page))
+ __put_devmap_managed_page(page);
continue;
+ }
if (PageCompound(page)) {
if (locked_pgdat) {
@@ -1149,26 +1162,3 @@ void __init swap_setup(void)
* _really_ don't want to cluster much more
*/
}
-
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void put_devmap_managed_page(struct page *page)
-{
- int count;
-
- if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
- return;
-
- count = page_ref_dec_return(page);
-
- /*
- * devmap page refcounts are 1-based, rather than 0-based: if
- * refcount is 1, then the page is free and the refcount is
- * stable because nobody holds a reference on the page.
- */
- if (count == 1)
- free_devmap_managed_page(page);
- else if (!count)
- __put_page(page);
-}
-EXPORT_SYMBOL(put_devmap_managed_page);
-#endif
--
2.20.1
WARNING: multiple messages have this Message-ID (diff)
From: Ralph Campbell <rcampbell@nvidia.com>
To: <linux-mm@kvack.org>, <kvm-ppc@vger.kernel.org>,
<nouveau@lists.freedesktop.org>, <linux-kernel@vger.kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>,
Ira Weiny <ira.weiny@intel.com>,
Matthew Wilcox <willy@infradead.org>,
Jerome Glisse <jglisse@redhat.com>,
John Hubbard <jhubbard@nvidia.com>,
Alistair Popple <apopple@nvidia.com>,
Christoph Hellwig <hch@lst.de>, Jason Gunthorpe <jgg@nvidia.com>,
"Bharata B Rao" <bharata@linux.ibm.com>, Zi Yan <ziy@nvidia.com>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
Yang Shi <yang.shi@linux.alibaba.com>,
Paul Mackerras <paulus@ozlabs.org>,
Ben Skeggs <bskeggs@redhat.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
Ralph Campbell <rcampbell@nvidia.com>
Subject: [PATCH] mm: remove extra ZONE_DEVICE struct page refcount
Date: Mon, 14 Sep 2020 15:45:09 -0700 [thread overview]
Message-ID: <20200914224509.17699-1-rcampbell@nvidia.com> (raw)
ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
---
Matthew Wilcox, Ira Weiny, and others have complained that ZONE_DEVICE
struct page reference counting is ugly/broken. This is my attempt to
fix it and it works for the HMM migration self tests.
I'm only sending this out as a RFC since I'm not that familiar with the
DAX, PMEM, XEN, and other uses of ZONE_DEVICE struct pages allocated
with devm_memremap_pages() or memremap_pages() but my best reading of
the code looks like it might be OK. I could use help testing these
configurations.
I have a modified THP migration patch series that applies on top of
this one and is cleaner since I don't have to add code to handle the
+1 reference count. The link below is for the earlier v2:
("mm/hmm/nouveau: add THP migration to migrate_vma_*")
https://lore.kernel.org/linux-mm/20200902165830.5367-1-rcampbell@nvidia.com
arch/powerpc/kvm/book3s_hv_uvmem.c | 1 -
drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 -
include/linux/memremap.h | 6 +--
include/linux/mm.h | 39 ---------------
lib/test_hmm.c | 1 -
mm/gup.c | 44 -----------------
mm/memremap.c | 20 ++++----
mm/migrate.c | 5 --
mm/swap.c | 66 +++++++++++---------------
9 files changed, 41 insertions(+), 142 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 84e5a2dc8be5..00d97050d7ff 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -711,7 +711,6 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
- get_page(dpage);
lock_page(dpage);
return dpage;
out_clear:
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index a13c6215bba8..2a4bbe01a455 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -324,7 +324,6 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
- get_page(page);
lock_page(page);
return page;
}
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 4e9c738f4b31..7dd9802d2612 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -67,9 +67,9 @@ enum memory_type {
struct dev_pagemap_ops {
/*
- * Called once the page refcount reaches 1. (ZONE_DEVICE pages never
- * reach 0 refcount unless there is a refcount bug. This allows the
- * device driver to implement its own memory management.)
+ * Called once the page refcount reaches 0. The reference count is
+ * reset to 1 before calling page_free(). This allows the
+ * device driver to implement its own memory management.
*/
void (*page_free)(struct page *page);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 517751310dd2..5a82037a4b26 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1093,34 +1093,6 @@ static inline bool is_zone_device_page(const struct page *page)
#ifdef CONFIG_DEV_PAGEMAP_OPS
void free_devmap_managed_page(struct page *page);
DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-
-static inline bool page_is_devmap_managed(struct page *page)
-{
- if (!static_branch_unlikely(&devmap_managed_key))
- return false;
- if (!is_zone_device_page(page))
- return false;
- switch (page->pgmap->type) {
- case MEMORY_DEVICE_PRIVATE:
- case MEMORY_DEVICE_FS_DAX:
- return true;
- default:
- break;
- }
- return false;
-}
-
-void put_devmap_managed_page(struct page *page);
-
-#else /* CONFIG_DEV_PAGEMAP_OPS */
-static inline bool page_is_devmap_managed(struct page *page)
-{
- return false;
-}
-
-static inline void put_devmap_managed_page(struct page *page)
-{
-}
#endif /* CONFIG_DEV_PAGEMAP_OPS */
static inline bool is_device_private_page(const struct page *page)
@@ -1169,17 +1141,6 @@ static inline void put_page(struct page *page)
{
page = compound_head(page);
- /*
- * For devmap managed pages we need to catch refcount transition from
- * 2 to 1, when refcount reach one it means the page is free and we
- * need to inform the device driver through callback. See
- * include/linux/memremap.h and HMM for details.
- */
- if (page_is_devmap_managed(page)) {
- put_devmap_managed_page(page);
- return;
- }
-
if (put_page_testzero(page))
__put_page(page);
}
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index c8133f50160b..5410f7328577 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -561,7 +561,6 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
}
dpage->zone_device_data = rpage;
- get_page(dpage);
lock_page(dpage);
return dpage;
diff --git a/mm/gup.c b/mm/gup.c
index b9b1cd8bcd6a..3c54f0c190a3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -177,41 +177,6 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
return true;
}
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-static bool __unpin_devmap_managed_user_page(struct page *page)
-{
- int count, refs = 1;
-
- if (!page_is_devmap_managed(page))
- return false;
-
- if (hpage_pincount_available(page))
- hpage_pincount_sub(page, 1);
- else
- refs = GUP_PIN_COUNTING_BIAS;
-
- count = page_ref_sub_return(page, refs);
-
- mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
- /*
- * devmap page refcounts are 1-based, rather than 0-based: if
- * refcount is 1, then the page is free and the refcount is
- * stable because nobody holds a reference on the page.
- */
- if (count == 1)
- free_devmap_managed_page(page);
- else if (!count)
- __put_page(page);
-
- return true;
-}
-#else
-static bool __unpin_devmap_managed_user_page(struct page *page)
-{
- return false;
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
/**
* unpin_user_page() - release a dma-pinned page
* @page: pointer to page to be released
@@ -227,15 +192,6 @@ void unpin_user_page(struct page *page)
page = compound_head(page);
- /*
- * For devmap managed pages we need to catch refcount transition from
- * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
- * page is free and we need to inform the device driver through
- * callback. See include/linux/memremap.h and HMM for details.
- */
- if (__unpin_devmap_managed_user_page(page))
- return;
-
if (hpage_pincount_available(page))
hpage_pincount_sub(page, 1);
else
diff --git a/mm/memremap.c b/mm/memremap.c
index 9c951bbfee91..f09562789079 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -91,13 +91,6 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
return (range->start + range_len(range)) >> PAGE_SHIFT;
}
-static unsigned long pfn_next(unsigned long pfn)
-{
- if (pfn % 1024 == 0)
- cond_resched();
- return pfn + 1;
-}
-
/*
* This returns true if the page is reserved by ZONE_DEVICE driver.
*/
@@ -176,13 +169,12 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
void memunmap_pages(struct dev_pagemap *pgmap)
{
- unsigned long pfn;
int i;
dev_pagemap_kill(pgmap);
for (i = 0; i < pgmap->nr_range; i++)
- for_each_device_pfn(pfn, pgmap, i)
- put_page(pfn_to_page(pfn));
+ percpu_ref_put_many(pgmap->ref, pfn_end(pgmap, i) -
+ pfn_first(pgmap, i));
dev_pagemap_cleanup(pgmap);
for (i = 0; i < pgmap->nr_range; i++)
@@ -516,6 +508,14 @@ void free_devmap_managed_page(struct page *page)
mem_cgroup_uncharge(page);
+ /*
+ * ZONE_DEVICE drivers keep a reference to the page while it is on
+ * the driver's free list so we reset the reference count here.
+ * This matches the initial reference when the struct pages are
+ * created by memremap_pages().
+ */
+ set_page_refcounted(page);
+
/*
* When a device_private page is freed, the page->mapping field
* may still contain a (stale) mapping value. For example, the
diff --git a/mm/migrate.c b/mm/migrate.c
index 4f89360d9e77..be1586582b52 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -380,11 +380,6 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
{
int expected_count = 1;
- /*
- * Device private pages have an extra refcount as they are
- * ZONE_DEVICE pages.
- */
- expected_count += is_device_private_page(page);
if (mapping)
expected_count += thp_nr_pages(page) + page_has_private(page);
diff --git a/mm/swap.c b/mm/swap.c
index feff680d3de9..8468e72e397f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -113,10 +113,31 @@ static void __put_compound_page(struct page *page)
destroy_compound_page(page);
}
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+static void __put_devmap_managed_page(struct page *page)
+{
+ if (!static_branch_unlikely(&devmap_managed_key))
+ return;
+
+ switch (page->pgmap->type) {
+ case MEMORY_DEVICE_PRIVATE:
+ case MEMORY_DEVICE_FS_DAX:
+ free_devmap_managed_page(page);
+ break;
+ default:
+ break;
+ }
+}
+#else
+static inline void __put_devmap_managed_page(struct page *page)
+{
+}
+#endif
+
void __put_page(struct page *page)
{
if (is_zone_device_page(page)) {
- put_dev_pagemap(page->pgmap);
+ __put_devmap_managed_page(page);
/*
* The page belongs to the device that created pgmap. Do
@@ -851,27 +872,19 @@ void release_pages(struct page **pages, int nr)
if (is_huge_zero_page(page))
continue;
+ page = compound_head(page);
+ if (!put_page_testzero(page))
+ continue;
+
if (is_zone_device_page(page)) {
if (locked_pgdat) {
spin_unlock_irqrestore(&locked_pgdat->lru_lock,
flags);
locked_pgdat = NULL;
}
- /*
- * ZONE_DEVICE pages that return 'false' from
- * put_devmap_managed_page() do not require special
- * processing, and instead, expect a call to
- * put_page_testzero().
- */
- if (page_is_devmap_managed(page)) {
- put_devmap_managed_page(page);
- continue;
- }
- }
-
- page = compound_head(page);
- if (!put_page_testzero(page))
+ __put_devmap_managed_page(page);
continue;
+ }
if (PageCompound(page)) {
if (locked_pgdat) {
@@ -1149,26 +1162,3 @@ void __init swap_setup(void)
* _really_ don't want to cluster much more
*/
}
-
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void put_devmap_managed_page(struct page *page)
-{
- int count;
-
- if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
- return;
-
- count = page_ref_dec_return(page);
-
- /*
- * devmap page refcounts are 1-based, rather than 0-based: if
- * refcount is 1, then the page is free and the refcount is
- * stable because nobody holds a reference on the page.
- */
- if (count == 1)
- free_devmap_managed_page(page);
- else if (!count)
- __put_page(page);
-}
-EXPORT_SYMBOL(put_devmap_managed_page);
-#endif
--
2.20.1
next reply other threads:[~2020-09-14 22:45 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-14 22:45 Ralph Campbell [this message]
2020-09-14 22:45 ` [PATCH] mm: remove extra ZONE_DEVICE struct page refcount Ralph Campbell
2020-09-14 22:45 ` Ralph Campbell
2020-09-14 23:10 ` Dan Williams
2020-09-14 23:10 ` Dan Williams
2020-09-14 23:10 ` Dan Williams
2020-09-14 23:53 ` Ralph Campbell
2020-09-14 23:53 ` Ralph Campbell
2020-09-14 23:53 ` Ralph Campbell
2020-09-15 16:29 ` Christoph Hellwig
2020-09-15 16:29 ` Christoph Hellwig
2020-09-15 16:39 ` Ralph Campbell
2020-09-15 16:39 ` Ralph Campbell
2020-09-15 16:39 ` Ralph Campbell
2020-09-16 5:36 ` Christoph Hellwig
2020-09-16 5:36 ` Christoph Hellwig
2020-09-17 0:29 ` Ralph Campbell
2020-09-17 0:29 ` Ralph Campbell
2020-09-17 0:29 ` Ralph Campbell
2020-09-17 0:32 ` Dan Williams
2020-09-17 0:32 ` Dan Williams
2020-09-16 6:10 ` Christoph Hellwig
2020-09-16 6:10 ` Christoph Hellwig
2020-09-17 0:35 ` Ralph Campbell
2020-09-17 0:35 ` Ralph Campbell
2020-09-17 0:35 ` Ralph Campbell
2020-09-16 6:09 ` Christoph Hellwig
2020-09-16 6:09 ` Christoph Hellwig
2020-09-17 0:34 ` Ralph Campbell
2020-09-17 0:34 ` Ralph Campbell
2020-09-17 0:34 ` Ralph Campbell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200914224509.17699-1-rcampbell@nvidia.com \
--to=rcampbell@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=bharata@linux.ibm.com \
--cc=bskeggs@redhat.com \
--cc=dan.j.williams@intel.com \
--cc=hch@lst.de \
--cc=ira.weiny@intel.com \
--cc=jgg@nvidia.com \
--cc=jglisse@redhat.com \
--cc=jhubbard@nvidia.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kvm-ppc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nouveau@lists.freedesktop.org \
--cc=paulus@ozlabs.org \
--cc=willy@infradead.org \
--cc=yang.shi@linux.alibaba.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.