From: Boaz Harrosh <boaz@plexistor.com>
To: Ross Zwisler <ross.zwisler@linux.intel.com>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org, Matthew Wilcox <willy@linux.intel.com>,
Sagi Manole <sagi@plexistor.com>,
Yigal Korman <yigal@plexistor.com>
Subject: [RFC 9/9] prd: Add support for page struct mapping
Date: Wed, 13 Aug 2014 15:26:08 +0300 [thread overview]
Message-ID: <53EB5960.50200@plexistor.com> (raw)
In-Reply-To: <53EB5536.8020702@gmail.com>
From: Yigal Korman <yigal@plexistor.com>
One of the current short comings of the NVDIMM/PMEM
support is that this memory does not have a page-struct(s)
associated with its memory and therefor cannot be passed
to a block-device or network or DMAed in any way through
another device in the system.
This simple patch fixes all this. After this patch an FS
can do:
bdev_direct_access(,&pfn,);
page = pfn_to_page(pfn);
And use that page for a lock_page(), set_page_dirty(), and/or
anything else one might do with a page *.
(Note that with brd one can already do this)
[pmem-pages-ref-count]
pmem will serve it's pages with ref==0. Once an FS does
an blkdev_get_XXX(,FMODE_EXCL,), that memory is own by the FS.
The FS needs to manage its allocation, just as it already does
for its disk blocks. The fs should set page->count = 2, before
submission to any Kernel subsystem so when it returns it will
never be released to the Kernel's page-allocators. (page_freeze)
All is actually needed for this is to allocate page-sections
and map them into kernel virtual memory. Note that these sections
are not associated with any zone, because that would add them to
the page_allocators.
In order to reuse existing code, prd now depends on memory hotplug
and sparse memory configuration options.
If system has enabled MEMORY_HOTPLUG_SPARSE then a new config option
BLK_DEV_PMEM_USE_PAGES is enabled (Yes by default)
We will also need MEMORY_HOTREMOVE so if BLK_DEV_PMEM_USE_PAGES
is on we will "select" MEMORY_HOTREMOVE. Most distro's have
MEMORY_HOTPLUG_SPARSE on but not MEMORY_HOTREMOVE. For us here
we must have both.
Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
drivers/block/Kconfig | 13 +++++
drivers/block/prd.c | 137 ++++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 145 insertions(+), 5 deletions(-)
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 8f0c225..8aca1b7 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -416,6 +416,19 @@ config BLK_DEV_PMEM
Most normal users won't need this functionality, and can thus say N
here.
+config BLK_DEV_PMEM_USE_PAGES
+ bool "Enable use of page struct pages with pmem"
+ depends on BLK_DEV_PMEM
+ depends on MEMORY_HOTPLUG_SPARSE
+ select MEMORY_HOTREMOVE
+ default y
+ help
+ If a user of PMEM device needs "struct page" associated
+ with its memory, so this memory can be sent to other
+ block devices, or sent on the network, or be DMA transferred
+ to other devices in the system, then you must say "Yes" here.
+ If unsure leave as Yes.
+
config CDROM_PKTCDVD
tristate "Packet writing on CD/DVD media"
depends on !UML
diff --git a/drivers/block/prd.c b/drivers/block/prd.c
index 36b8fe4..6115553 100644
--- a/drivers/block/prd.c
+++ b/drivers/block/prd.c
@@ -241,6 +241,134 @@ MODULE_PARM_DESC(map,
static LIST_HEAD(prd_devices);
static DEFINE_MUTEX(prd_devices_mutex);
+#ifdef CONFIG_BLK_DEV_PMEM_USE_PAGES
+static int prd_add_page_mapping(phys_addr_t phys_addr, size_t total_size,
+ void **o_virt_addr)
+{
+ int nid = memory_add_physaddr_to_nid(phys_addr);
+ unsigned long start_pfn = phys_addr >> PAGE_SHIFT;
+ unsigned long nr_pages = total_size >> PAGE_SHIFT;
+ unsigned int start_sec = pfn_to_section_nr(start_pfn);
+ unsigned int end_sec = pfn_to_section_nr(start_pfn + nr_pages - 1);
+ unsigned long phys_start_pfn;
+ struct page **page_array, **mapped_page_array;
+ unsigned long i;
+ struct vm_struct *vm_area;
+ void *virt_addr;
+ int ret = 0;
+
+ for (i = start_sec; i <= end_sec; i++) {
+ phys_start_pfn = i << PFN_SECTION_SHIFT;
+
+ if (pfn_valid(phys_start_pfn)) {
+ pr_warn("prd: memory section %lu already exists.\n", i);
+ continue;
+ }
+
+ ret = sparse_add_one_section(nid, phys_start_pfn);
+ if (unlikely(ret < 0)) {
+ if (ret == -EEXIST) {
+ ret = 0;
+ continue;
+ } else {
+ pr_warn("prd: sparse_add_one_section => %d\n",
+ ret);
+ return ret;
+ }
+ }
+ }
+
+ virt_addr = page_address(pfn_to_page(phys_addr >> PAGE_SHIFT));
+
+ page_array = vmalloc(sizeof(struct page *) * nr_pages);
+ if (unlikely(!page_array)) {
+ pr_warn("prd: failed to allocate nr_pages=0x%lx\n", nr_pages);
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < nr_pages; i++)
+ page_array[i] = pfn_to_page(start_pfn + i);
+
+ /* __get_vm_area requires a range of addresses from which to allocate
+ * the vm_area. This range will include more pages that we need because
+ * it allocates one guard page in the end. Usually you give it a wide
+ * range from which to choose from, but we want exact addresses, so add
+ * the size of the guard page to the end of the range (otherwise, this
+ * will always fail)
+ */
+ /* TODO this guard page may confuse users when asking for several pmem
+ * devices in adjacent areas (the start of the next pmem will be
+ * occupied by the guard page of the previous pmem)
+ */
+ vm_area = __get_vm_area(total_size, VM_USERMAP, (ulong)virt_addr,
+ (ulong)virt_addr + total_size + PAGE_SIZE);
+ if (unlikely(!vm_area)) {
+ pr_err("prd: failed to __get_vm_area.\n");
+ ret = -ENOMEM;
+ goto free_array;
+ }
+
+ mapped_page_array = page_array;
+ ret = map_vm_area(vm_area, PAGE_KERNEL, &mapped_page_array);
+ if (unlikely(ret || mapped_page_array < (page_array + nr_pages))) {
+ pr_err("prd: failed to map_vm_area => %d\n", ret);
+ if (!ret) {
+ free_vm_area(vm_area);
+ ret = -ENOMEM;
+ }
+ }
+ *o_virt_addr = virt_addr;
+
+free_array:
+ vfree(page_array);
+ return ret;
+}
+
+static void prd_remove_page_mapping(phys_addr_t phys_addr, size_t total_size,
+ void *virt_addr)
+{
+ unsigned long start_pfn = phys_addr >> PAGE_SHIFT;
+ unsigned long nr_pages = total_size >> PAGE_SHIFT;
+ unsigned int start_sec = pfn_to_section_nr(start_pfn);
+ unsigned int end_sec = pfn_to_section_nr(start_pfn + nr_pages - 1);
+ unsigned int i;
+
+ for (i = start_sec; i <= end_sec; i++) {
+ struct mem_section *ms = __nr_to_section(i);
+ int nid = pfn_to_nid(i << PFN_SECTION_SHIFT);
+
+ if (!valid_section(ms)) {
+ pr_warn("prd: memory section %d is missing.\n", i);
+ continue;
+ }
+
+ sparse_remove_one_section(nid, ms);
+ }
+ vunmap(virt_addr);
+}
+
+#else /* !CONFIG_BLK_DEV_PMEM_USE_PAGES */
+static int prd_add_page_mapping(phys_addr_t phys_addr, size_t total_size,
+ void **o_virt_addr)
+{
+ void *virt_addr = ioremap_cache(phys_addr, total_size);
+
+ if (unlikely(!virt_addr))
+ return -ENXIO;
+
+ *o_virt_addr = virt_addr;
+ return 0;
+}
+
+static void prd_remove_page_mapping(phys_addr_t phys_addr, size_t total_size,
+ void *virt_addr)
+{
+ iounmap(virt_addr);
+}
+#endif /* CONFIG_BLK_DEV_PMEM_USE_PAGES */
+
+
+
/* prd->phys_addr and prd->size need to be set.
* Will then set virt_addr if successful.
*/
@@ -257,11 +385,10 @@ int prd_mem_map(struct prd_device *prd)
return -EINVAL;
}
- prd->virt_addr = ioremap_cache(prd->phys_addr, prd->size);
- if (unlikely(!prd->virt_addr)) {
- err = -ENOMEM;
+ err = prd_add_page_mapping(prd->phys_addr, prd->size, &prd->virt_addr);
+ if (unlikely(err))
goto out_release;
- }
+
return 0;
out_release:
@@ -274,7 +401,7 @@ void prd_mem_unmap(struct prd_device *prd)
if (unlikely(!prd->virt_addr))
return;
- iounmap(prd->virt_addr);
+ prd_remove_page_mapping(prd->phys_addr, prd->size, prd->virt_addr);
release_mem_region(prd->phys_addr, prd->size);
prd->virt_addr = NULL;
}
--
1.9.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-08-13 12:26 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-08-13 12:08 [RFC 0/9] pmem: Support for "struct page" with Persistent Memory storage Boaz Harrosh
2014-08-13 12:10 ` [RFC 1/9] prd: Initial version of Persistent RAM Driver Boaz Harrosh
2014-08-13 12:11 ` [RFC 2/9] prd: add support for rw_page() Boaz Harrosh
2014-08-13 12:12 ` [RFC 3/9] prd: Add getgeo to block ops Boaz Harrosh
2014-08-13 12:14 ` [RFC 4/9] SQUASHME: prd: Fixs to getgeo Boaz Harrosh
2014-08-20 22:10 ` Ross Zwisler
2014-08-21 9:47 ` Boaz Harrosh
2014-08-13 12:16 ` [RFC 5/9] SQUASHME: prd: Last fixes for partitions Boaz Harrosh
2014-08-14 13:04 ` Boaz Harrosh
2014-08-14 13:16 ` Matthew Wilcox
2014-08-14 13:55 ` Boaz Harrosh
2014-08-14 13:07 ` [PATCH 5/9 v2] " Boaz Harrosh
2014-08-25 20:10 ` Ross Zwisler
2014-08-26 8:18 ` Boaz Harrosh
2014-08-26 17:36 ` Boaz Harrosh
2014-08-26 20:34 ` Ross Zwisler
2014-08-27 9:41 ` Boaz Harrosh
2014-08-27 4:38 ` Matthew Wilcox
2014-08-27 9:55 ` Boaz Harrosh
2014-08-27 12:46 ` Matthew Wilcox
2014-08-27 13:01 ` Boaz Harrosh
2014-08-20 23:03 ` [RFC 5/9] " Ross Zwisler
2014-08-21 10:05 ` Boaz Harrosh
2014-08-13 12:18 ` [RFC 6/9] SQUASHME: prd: Let each prd-device manage private memory region Boaz Harrosh
2014-08-21 16:57 ` Ross Zwisler
2014-08-13 12:20 ` [RFC 7/9] SQUASHME: prd: Support of multiple memory regions Boaz Harrosh
2014-08-25 23:02 ` Ross Zwisler
2014-08-13 12:21 ` [RFC 8/9] mm: export sparse_add/remove_one_section Boaz Harrosh
2014-08-13 12:26 ` Boaz Harrosh [this message]
2014-08-15 20:28 ` [RFC 9/9] prd: Add support for page struct mapping Toshi Kani
2014-08-17 9:17 ` Boaz Harrosh
2014-08-18 19:48 ` Toshi Kani
2014-08-19 8:40 ` Boaz Harrosh
2014-08-19 16:49 ` Toshi Kani
2014-08-22 14:36 ` Dave Hansen
2014-09-09 16:16 ` Boaz Harrosh
2014-09-09 16:29 ` Dave Hansen
2014-08-20 20:13 ` [RFC 0/9] pmem: Support for "struct page" with Persistent Memory storage Ross Zwisler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53EB5960.50200@plexistor.com \
--to=boaz@plexistor.com \
--cc=akpm@linux-foundation.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ross.zwisler@linux.intel.com \
--cc=sagi@plexistor.com \
--cc=willy@linux.intel.com \
--cc=yigal@plexistor.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).