[PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory)
       [not found] <1409173922-7484-1-git-send-email-ross.zwisler@linux.intel.com>
@ 2014-09-09 15:37 ` Boaz Harrosh
  2014-09-09 15:45   ` [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id Boaz Harrosh
                     ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Boaz Harrosh @ 2014-09-09 15:37 UTC (permalink / raw)
  To: Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm, Toshi Kani, Dave Hansen, linux-mm
  Cc: Andrew Morton, linux-kernel

On 08/28/2014 12:11 AM, Ross Zwisler wrote:
> PMEM is a modified version of the Block RAM Driver, BRD. The major difference
> is that BRD allocates its backing store pages from the page cache, whereas
> PMEM uses reserved memory that has been ioremapped.
> 
> One benefit of this approach is that there is a direct mapping between
> filesystem block numbers and virtual addresses.  In PMEM, filesystem blocks N,
> N+1, N+2, etc. will all be adjacent in the virtual memory space. This property
> allows us to set up PMD mappings (2 MiB) for DAX.
> 
> This patch set is builds upon the work that Matthew Wilcox has been doing for
> DAX:
> 

Let us not submit a driver with the wrong user visible API. Lets submit the
better API (and structure) I have sent.

> https://lkml.org/lkml/2014/8/27/31
> 
> Specifically, my implementation of pmem_direct_access() in patch 4/4 uses API
> enhancements introduced in Matthew's DAX patch v10 02/21:
> 
> https://lkml.org/lkml/2014/8/27/48
> 
> Ross Zwisler (4):
>   pmem: Initial version of persistent memory driver
>   pmem: Add support for getgeo()
>   pmem: Add support for rw_page()
>   pmem: Add support for direct_access()
> 

On top of the 4 above patches here is a list of changes:

[PATCH 1/9] SQUASHME: pmem: Remove unused #include headers
[PATCH 2/9] SQUASHME: pmem: Request from fdisk 4k alignment
[PATCH 3/9] SQUASHME: pmem: Let each device manage private memory region
[PATCH 4/9] SQUASHME: pmem: Support of multiple memory regions

	These 4 need to be squashed into Ross's 
		[patch 1/4] pmem: Initial version of persistent memory driver
	See below for a suggested new patch

[PATCH 5/9 v2] mm: Let sparse_{add,remove}_one_section receive a node_id
[PATCH 6/9 v2] mm: New add_persistent_memory/remove_persistent_memory
[PATCH 7/9 v2] pmem: Add support for page structs

	Please need review by Toshi and mm people.

[PATCH 8/9] SQUASHME: pmem: Fixs to getgeo
[PATCH 9/9] pmem: KISS, remove register_blkdev

	And some more development atop the initial version


All these patches can be viewed in this tree/branch:
	git://git.open-osd.org/pmem.git branch pmem-jens-3.17-rc1
	[http://git.open-osd.org/gitweb.cgi?p=pmem.git;a=shortlog;h=refs/heads/pmem-jens-3.17-rc1]

I have also prepared a new branch *pmem* which is already SQUASHED
And has my suggested changed commit logs for the combined patches
here is the commit-log:

aa85c80 Boaz Harrosh  |  pmem: KISS, remove register_blkdev 
738203c Boaz Harrosh  |  pmem: Add support for page structs 
9f50a54 Boaz Harrosh  |  mm: New add_persistent_memory/remove_persistent_memory 
fdfab12 Yigal Korman  |  mm: Let sparse_{add,remove}_one_section receive a node_id 
a477a87 Ross Zwisler  |  pmem: Add support for direct_access() 
316a93a Ross Zwisler  |  pmem: Add support for rw_page() 
6850353 Boaz Harrosh  |  SQUASHME: pmem: Fixs to getgeo 
d78a84a Ross Zwisler  |  pmem: Add support for getgeo() 
bb0eb45 Ross Zwisler  |  pmem: Initial version of persistent memory driver                                                                             

All these patches can be viewed in this tree/branch:
	git://git.open-osd.org/pmem.git branch pmem
	[http://git.open-osd.org/gitweb.cgi?p=pmem.git;a=shortlog;h=refs/heads/pmem]
Specifically the first [bb0eb45] is needed so first version can be released with the
proper user visible API.
Ross please consider taking these patches (pmem branch) in your tree for submission?

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id
  2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh
@ 2014-09-09 15:45   ` Boaz Harrosh
  2014-09-09 18:36     ` Dave Hansen
  2014-09-09 15:47   ` [PATCH 6/9] mm: New add_persistent_memory/remove_persistent_memory Boaz Harrosh
  2014-09-09 15:48   ` [PATCH 7/9] pmem: Add support for page structs Boaz Harrosh
  2 siblings, 1 reply; 12+ messages in thread
From: Boaz Harrosh @ 2014-09-09 15:45 UTC (permalink / raw)
  To: Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm, Toshi Kani, Dave Hansen, linux-mm
  Cc: Andrew Morton, linux-kernel

From: Yigal Korman <yigal@plexistor.com>

Refactored the arguments of sparse_add_one_section / sparse_remove_one_section
to use node id instead of struct zone * - A memory section has no direct
connection to zones, all that was needed from zone was the node id.

This is for add_persistent_memory that will want a section of pages
allocated but without any zone associated. This is because belonging
to a zone will give the memory to the page allocators, but
persistent_memory belongs to a block device, and is not available for
regular volatile usage.

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 include/linux/memory_hotplug.h | 4 ++--
 mm/memory_hotplug.c            | 4 ++--
 mm/sparse.c                    | 9 +++++----
 3 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index d9524c4..35ca1bb 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -264,8 +264,8 @@ extern int arch_add_memory(int nid, u64 start, u64 size);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern void remove_memory(int nid, u64 start, u64 size);
-extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn);
-extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms);
+extern int sparse_add_one_section(int nid, unsigned long start_pfn);
+extern void sparse_remove_one_section(int nid, struct mem_section *ms);
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
 					  unsigned long pnum);
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2ff8c23..e556a90 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -471,7 +471,7 @@ static int __meminit __add_section(int nid, struct zone *zone,
 	if (pfn_valid(phys_start_pfn))
 		return -EEXIST;
 
-	ret = sparse_add_one_section(zone, phys_start_pfn);
+	ret = sparse_add_one_section(zone->zone_pgdat->node_id, phys_start_pfn);
 
 	if (ret < 0)
 		return ret;
@@ -737,7 +737,7 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
 	start_pfn = section_nr_to_pfn(scn_nr);
 	__remove_zone(zone, start_pfn);
 
-	sparse_remove_one_section(zone, ms);
+	sparse_remove_one_section(zone->zone_pgdat->node_id, ms);
 	return 0;
 }
 
diff --git a/mm/sparse.c b/mm/sparse.c
index d1b48b6..12a10ab 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -690,10 +690,10 @@ static void free_map_bootmem(struct page *memmap)
  * set.  If this is <=0, then that means that the passed-in
  * map was not consumed and must be freed.
  */
-int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn)
+int __meminit sparse_add_one_section(int nid, unsigned long start_pfn)
 {
 	unsigned long section_nr = pfn_to_section_nr(start_pfn);
-	struct pglist_data *pgdat = zone->zone_pgdat;
+	struct pglist_data *pgdat = NODE_DATA(nid);
 	struct mem_section *ms;
 	struct page *memmap;
 	unsigned long *usemap;
@@ -788,11 +788,11 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 		free_map_bootmem(memmap);
 }
 
-void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
+void sparse_remove_one_section(int nid, struct mem_section *ms)
 {
 	struct page *memmap = NULL;
 	unsigned long *usemap = NULL, flags;
-	struct pglist_data *pgdat = zone->zone_pgdat;
+	struct pglist_data *pgdat = NODE_DATA(nid);
 
 	pgdat_resize_lock(pgdat, &flags);
 	if (ms->section_mem_map) {
@@ -807,5 +807,6 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 	clear_hwpoisoned_pages(memmap, PAGES_PER_SECTION);
 	free_section_usemap(memmap, usemap);
 }
+
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 #endif /* CONFIG_MEMORY_HOTPLUG */
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 6/9] mm: New add_persistent_memory/remove_persistent_memory
  2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh
  2014-09-09 15:45   ` [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id Boaz Harrosh
@ 2014-09-09 15:47   ` Boaz Harrosh
  2014-09-09 15:48   ` [PATCH 7/9] pmem: Add support for page structs Boaz Harrosh
  2 siblings, 0 replies; 12+ messages in thread
From: Boaz Harrosh @ 2014-09-09 15:47 UTC (permalink / raw)
  To: Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm, Toshi Kani, Dave Hansen, linux-mm
  Cc: Andrew Morton, linux-kernel

From: Boaz Harrosh <boaz@plexistor.com>

Persistent Memory is not Memory. It is not presented as
a Memory Zone and is not available through the page allocators
for application/kernel volatile usage.

It belongs to A block device just as any other Persistent storage,
the novelty here is that it is directly mapped on the CPU Memory
bus, and usually as fast or almost as fast as system RAM.

The main motivation of add_persistent_memory is to allocate a
page-struct "Section" for a given physical memory region. This is because
The user of this memory mapped device might need to pass pages-struct
of this memory to a Kernel subsytem like block-layer or networking
and have it's content directly DMAed to other devices

(For example these pages can be put on a bio and sent to disk
 in a copy-less manner)

It will also request_mem_region_exclusive(.., "persistent_memory")
to own that physical memory region.

And will map that physical region to the Kernel's VM at the
address expected for page_address() of those pages allocated
above

remove_persistent_memory() must be called to undo what
add_persistent_memory did.

A user of this API will then use pfn_to_page(PHISICAL_ADDR >> PAGE_SIZE)
to receive a page-struct for use on its pmem.

Any operation like page_address() page_to_pfn() page_lock() ... can
be preformed on these pages just as usual.

An example user is presented in the next patch to pmem.c Block Device
driver (There are 3 more such drivers in the Kernel that could use this
API)

This patch is based on research and patch made by
Yigal Korman <yigal@plexistor.com> to the pmem driver. I took his code
and adapted it to mm, where it belongs.

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 include/linux/memory_hotplug.h |   4 +
 mm/Kconfig                     |  23 ++++++
 mm/memory_hotplug.c            | 177 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 204 insertions(+)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 35ca1bb..9a16cec 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -191,6 +191,10 @@ extern void get_page_bootmem(unsigned long ingo, struct page *page,
 void get_online_mems(void);
 void put_online_mems(void);
 
+int add_persistent_memory(phys_addr_t phys_addr, size_t size,
+			  void **o_virt_addr);
+void remove_persistent_memory(phys_addr_t phys_addr, size_t size);
+
 #else /* ! CONFIG_MEMORY_HOTPLUG */
 /*
  * Stub functions for when hotplug is off
diff --git a/mm/Kconfig b/mm/Kconfig
index 886db21..2b78d19 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -197,6 +197,29 @@ config MEMORY_HOTREMOVE
 	depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
 	depends on MIGRATION
 
+
+# User of PERSISTENT_MEMORY_SECTION should:
+#	depends on PERSISTENT_MEMORY_DEPENDENCY and
+#	select DRIVER_NEEDS_PERSISTENT_MEMORY
+# Note that it will not be enabled if MEMORY_HOTPLUG is not enabled
+#
+# If you have changed the dependency/select of MEMORY_HOTREMOVE please also
+# update here
+#
+config PERSISTENT_MEMORY_DEPENDENCY
+	def_bool y
+	depends on MEMORY_HOTPLUG
+	depends on ARCH_ENABLE_MEMORY_HOTREMOVE && MIGRATION
+
+config DRIVER_NEEDS_PERSISTENT_MEMORY
+	bool
+
+config PERSISTENT_MEMORY_SECTION
+	def_bool y
+	depends on PERSISTENT_MEMORY_DEPENDENCY
+	depends on DRIVER_NEEDS_PERSISTENT_MEMORY
+	select MEMORY_HOTREMOVE
+
 #
 # If we have space for more page flags then we can enable additional
 # optimizations and functionality.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e556a90..1682b0e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -2004,3 +2004,180 @@ void __ref remove_memory(int nid, u64 start, u64 size)
 }
 EXPORT_SYMBOL_GPL(remove_memory);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
+
+#ifdef CONFIG_PERSISTENT_MEMORY_SECTION
+
+/* This helper is so we do not need to allocate a page_array bigger
+ * than PAGE_SIZE
+ */
+static int _map_sec_range(ulong sec_start_pfn, struct page **page_array)
+{
+	const uint NUM_PAGES = PAGE_SIZE / sizeof(*page_array);
+	const uint ARRAYS_IN_SECTION = PAGES_PER_SECTION / NUM_PAGES;
+	ulong pfn = sec_start_pfn;
+	uint a;
+
+	for (a = 0; a < ARRAYS_IN_SECTION; ++a) {
+		ulong virt_addr = (ulong)page_address(pfn_to_page(pfn));
+		uint p;
+		int ret;
+
+		for (p = 0; p < NUM_PAGES; ++p)
+			page_array[p] = pfn_to_page(pfn++);
+
+		ret = map_kernel_range_noflush(virt_addr, NUM_PAGES * PAGE_SIZE,
+					       PAGE_KERNEL, page_array);
+		if (unlikely(ret < 0)) {
+			pr_warn("%s: map_kernel_range(0x%lx, 0x%lx) => %d\n",
+				__func__, sec_start_pfn, virt_addr, ret);
+			return ret;
+		}
+		if (unlikely(ret < NUM_PAGES)) {
+			pr_warn("%s: map_kernel_range(0x%lx) => %d != %d last_pfn=0x%lx\n",
+				 __func__, virt_addr, NUM_PAGES, ret, pfn);
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * add_persistent_memory - Add memory sections and maps them to Kernel space
+ * @phys_addr: start of physical address to add & map
+ * @size: size of the memory range in bytes
+ * @o_virt_addr: The returned virtual address of the mapped memory range
+ *
+ * A persistent_memory block-device will use this function to add memory
+ * sections and map its physical memory range. After the call to this function
+ * There will be page-struct associated with each pfn added here, and it will
+ * be accessible from Kernel space through the returned @o_virt_addr
+ * @phys_addr will be rounded down to the nearest SECTION_SIZE, the range
+ * mapped will be in full SECTION_SIZE sections.
+ * @o_virt_addr is the address of @phys_addr not the start of the mapped section
+ * so usually mapping a range unaligned on SECTION_SIZE will work just that the
+ * unaligned start and/or end, will ignore the error and continue.
+ * (but will print "memory section XX already exists")
+ *
+ * NOTE:
+ * persistent_memory is not system ram and is not available through any
+ * allocator, for regular consumption. Therefore it does not belong to any
+ * memory zone.
+ * But it will need a memory-section allocated, so page-structs are available
+ * for this memory, so it can be DMA'd directly with zero copy.
+ * After a call to this function the ranged pages belong exclusively to the
+ * caller.
+ *
+ * RETURNS:
+ * zero on success, or -errno on failure. If successful @o_virt_addr will be set
+ */
+int add_persistent_memory(phys_addr_t phys_addr, size_t size,
+			  void **o_virt_addr)
+{
+	ulong start_pfn = phys_addr >> PAGE_SHIFT;
+	ulong nr_pages = size >> PAGE_SHIFT;
+	ulong start_sec = pfn_to_section_nr(start_pfn);
+	ulong end_sec = pfn_to_section_nr(start_pfn + nr_pages +
+							PAGES_PER_SECTION - 1);
+	int nid = memory_add_physaddr_to_nid(phys_addr);
+	struct resource *res_mem;
+	struct page **page_array;
+	ulong i;
+	int ret = 0;
+
+	page_array = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (unlikely(!page_array))
+		return -ENOMEM;
+
+	res_mem = request_mem_region_exclusive(phys_addr, size,
+					       "persistent_memory");
+	if (unlikely(!res_mem)) {
+		pr_warn("%s: request_mem_region_exclusive phys=0x%llx size=0x%zx failed\n",
+			__func__, phys_addr, size);
+		ret = -EINVAL;
+		goto free_array;
+	}
+
+	for (i = start_sec; i < end_sec; ++i) {
+		ulong sec_start_pfn = i << PFN_SECTION_SHIFT;
+
+		if (pfn_valid(sec_start_pfn)) {
+			pr_warn("%s: memory section %lu already exists.\n",
+				__func__, i);
+			continue;
+		}
+
+		ret = sparse_add_one_section(nid, sec_start_pfn);
+		if (unlikely(ret < 0)) {
+			if (ret == -EEXIST) {
+				ret = 0;
+				continue;
+			} else {
+				pr_warn("%s: sparse_add_one_section => %d\n",
+					__func__, ret);
+				goto release_region;
+			}
+		}
+
+		ret = _map_sec_range(sec_start_pfn, page_array);
+		if (unlikely(ret))
+			goto release_region;
+	}
+
+	*o_virt_addr = page_address(pfn_to_page(start_pfn));
+
+free_array:
+	kfree(page_array);
+	return ret;
+
+release_region:
+	release_mem_region(phys_addr, size);
+	goto free_array;
+}
+EXPORT_SYMBOL_GPL(add_persistent_memory);
+
+/**
+ * remove_persistent_memory - undo anything add_persistent_memory did
+ * @phys_addr: start of physical address to remove
+ * @size: size of the memory range in bytes
+ *
+ * A successful call to add_persistent_memory must be paired with
+ * remove_persistent_memory when done. It will unmap passed PFNs from
+ * Kernel virtual address, and will remove the memory sections.
+ * @phys_addr, @size must be exactly those passed to add_persistent_memory
+ * otherwise results are unexpected, there are no checks done on this.
+ */
+void remove_persistent_memory(phys_addr_t phys_addr, size_t size)
+{
+	ulong start_pfn = phys_addr >> PAGE_SHIFT;
+	ulong nr_pages = size >> PAGE_SHIFT;
+	ulong start_sec = pfn_to_section_nr(start_pfn);
+	ulong end_sec = pfn_to_section_nr(start_pfn + nr_pages +
+							PAGES_PER_SECTION - 1);
+	int nid = pfn_to_nid(start_pfn);
+	ulong virt_addr;
+	unsigned int i;
+
+	virt_addr = (ulong)page_address(
+				pfn_to_page(start_sec << PFN_SECTION_SHIFT));
+
+	for (i = start_sec; i < end_sec; ++i) {
+		struct mem_section *ms;
+
+		unmap_kernel_range(virt_addr, 1UL << SECTION_SIZE_BITS);
+		virt_addr += 1UL << SECTION_SIZE_BITS;
+
+		ms = __nr_to_section(i);
+		if (!valid_section(ms)) {
+			pr_warn("%s: memory section %d is missing.\n",
+				__func__, i);
+			continue;
+		}
+		sparse_remove_one_section(nid, ms);
+	}
+
+	release_mem_region(phys_addr, size);
+}
+EXPORT_SYMBOL_GPL(remove_persistent_memory);
+
+#endif /* def CONFIG_PERSISTENT_MEMORY_SECTION */
+
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 7/9] pmem: Add support for page structs
  2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh
  2014-09-09 15:45   ` [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id Boaz Harrosh
  2014-09-09 15:47   ` [PATCH 6/9] mm: New add_persistent_memory/remove_persistent_memory Boaz Harrosh
@ 2014-09-09 15:48   ` Boaz Harrosh
  2 siblings, 0 replies; 12+ messages in thread
From: Boaz Harrosh @ 2014-09-09 15:48 UTC (permalink / raw)
  To: Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm, Toshi Kani, Dave Hansen, linux-mm
  Cc: Andrew Morton, linux-kernel

From: Boaz Harrosh <boaz@plexistor.com>

One of the current shortcomings of the NVDIMM/PMEM
support is that this memory does not have a page-struct(s)
associated with its memory and therefor cannot be passed
to a block-device or network or DMAed in any way through
another device in the system.

The use of add_persistent_memory() fixes all this. After this patch
an FS can do:
	bdev_direct_access(,&pfn,);
	page = pfn_to_page(pfn);
And use that page for a lock_page(), set_page_dirty(), and/or
anything else one might do with a page *.
(Note that with brd one can already do this)

[pmem-pages-ref-count]
pmem will serve it's pages with ref==0. Once an FS does
an blkdev_get_XXX(,FMODE_EXCL,), that memory is own by the FS.
The FS needs to manage its allocation, just as it already does
for its disk blocks. The fs should set page->count = 2, before
submission to any Kernel subsystem so when it returns it will
never be released to the Kernel's page-allocators. (page_freeze)

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/Kconfig | 13 +++++++++++++
 drivers/block/pmem.c  | 19 +++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 5da8cbf..8a5929c 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -416,6 +416,19 @@ config BLK_DEV_PMEM
 	  Most normal users won't need this functionality, and can thus say N
 	  here.
 
+config BLK_DEV_PMEM_USE_PAGES
+	bool "Enable use of page struct pages with pmem"
+	depends on BLK_DEV_PMEM
+	depends on PERSISTENT_MEMORY_DEPENDENCY
+	select DRIVER_NEEDS_PERSISTENT_MEMORY
+	default y
+	help
+	  If a user of PMEM device needs "struct page" associated
+	  with its memory, so this memory can be sent to other
+	  block devices, or sent on the network, or be DMA transferred
+	  to other devices in the system, then you must say "Yes" here.
+	  If unsure leave as Yes.
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index e07a373..b415b61 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -221,6 +221,23 @@ MODULE_PARM_DESC(map,
 static LIST_HEAD(pmem_devices);
 static int pmem_major;
 
+#ifdef CONFIG_BLK_DEV_PMEM_USE_PAGES
+/* pmem->phys_addr and pmem->size need to be set.
+ * Will then set pmem->virt_addr if successful.
+ */
+int pmem_mapmem(struct pmem_device *pmem)
+{
+	return add_persistent_memory(pmem->phys_addr, pmem->size,
+				     &pmem->virt_addr);
+}
+
+static void pmem_unmapmem(struct pmem_device *pmem)
+{
+	remove_persistent_memory(pmem->phys_addr, pmem->size);
+}
+
+#else /* !CONFIG_BLK_DEV_PMEM_USE_PAGES */
+
 /* pmem->phys_addr and pmem->size need to be set.
  * Will then set virt_addr if successful.
  */
@@ -258,6 +275,8 @@ void pmem_unmapmem(struct pmem_device *pmem)
 	release_mem_region(pmem->phys_addr, pmem->size);
 	pmem->virt_addr = NULL;
 }
+#endif /* ! CONFIG_BLK_DEV_PMEM_USE_PAGES */
+
 
 static struct pmem_device *pmem_alloc(phys_addr_t phys_addr, size_t disk_size,
 				      int i)
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id
  2014-09-09 15:45   ` [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id Boaz Harrosh
@ 2014-09-09 18:36     ` Dave Hansen
  2014-09-10 10:07       ` Boaz Harrosh
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2014-09-09 18:36 UTC (permalink / raw)
  To: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox,
	linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm
  Cc: Andrew Morton, linux-kernel

On 09/09/2014 08:45 AM, Boaz Harrosh wrote:
> This is for add_persistent_memory that will want a section of pages
> allocated but without any zone associated. This is because belonging
> to a zone will give the memory to the page allocators, but
> persistent_memory belongs to a block device, and is not available for
> regular volatile usage.

I don't think we should be taking patches like this in to the kernel
until we've seen the other side of it.  Where is the page allocator code
which will see a page belonging to no zone?  Am I missing it in this set?

I see about 80 or so calls to page_zone() in the kernel.  How will a
zone-less page look to all of these sites?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id
  2014-09-09 18:36     ` Dave Hansen
@ 2014-09-10 10:07       ` Boaz Harrosh
  2014-09-10 16:10         ` Dave Hansen
  0 siblings, 1 reply; 12+ messages in thread
From: Boaz Harrosh @ 2014-09-10 10:07 UTC (permalink / raw)
  To: Dave Hansen, Ross Zwisler, Jens Axboe, Matthew Wilcox,
	linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm
  Cc: Andrew Morton, linux-kernel

On 09/09/2014 09:36 PM, Dave Hansen wrote:
> On 09/09/2014 08:45 AM, Boaz Harrosh wrote:
>> This is for add_persistent_memory that will want a section of pages
>> allocated but without any zone associated. This is because belonging
>> to a zone will give the memory to the page allocators, but
>> persistent_memory belongs to a block device, and is not available for
>> regular volatile usage.
> 
> I don't think we should be taking patches like this in to the kernel
> until we've seen the other side of it.  Where is the page allocator code
> which will see a page belonging to no zone?  Am I missing it in this set?
> 

It is not missing. It will never be.

These pages do not belong to any allocator. They are not allocate-able
pages. In fact they are not "memory" they are "storage"

These pages belong wholesomely to a block-device. In turn the block
device grants ownership of a partition of this pages to an FS.
The FS loaded has its own block allocation schema. Which internally
circulate each pages usage around. But the page never goes beyond its
FS.

> I see about 80 or so calls to page_zone() in the kernel.  How will a
> zone-less page look to all of these sites?
> 

None of these 80 call site will be reached! the pages are always used
below the FS, like send them on the network, or send them to a slower
block device via a BIO. I have a full fledge FS on top of this code
and it all works very smoothly, and stable. (And fast ;))

It is up to the pMem-based FS to manage its pages's ref count so they are
never released outside of its own block allocator.

at the end of the day, struct pages has nothing to do with zones
and allocators and "memory", as it says in Documentation struct
page is a facility to track the state of a physical page in the
system. All the other structures are higher in the stack above
the physical layer, struct-pages for me are the upper API of the
memory physical layer. Which are in common with pmem, higher
on the stack where with memory we have a zone, pmem has a block-device.
Higher where we have page allocators, pmem has an FS block allocator,
higher where we have a slab, pmem has files for user consumption.

pmem is storage, which shares the physical layer with memory, and
this is what this patch describes. There will be no more mm interaction
at all for pmem. The rest of the picture is all there in plain site as
part of this patchset, the pmem.c driver then an FS on top of that. What
else do you need to see?

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id
  2014-09-10 10:07       ` Boaz Harrosh
@ 2014-09-10 16:10         ` Dave Hansen
  2014-09-10 17:25           ` Boaz Harrosh
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2014-09-10 16:10 UTC (permalink / raw)
  To: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox,
	linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm
  Cc: Andrew Morton, linux-kernel

On 09/10/2014 03:07 AM, Boaz Harrosh wrote:
> On 09/09/2014 09:36 PM, Dave Hansen wrote:
>> On 09/09/2014 08:45 AM, Boaz Harrosh wrote:
>>> This is for add_persistent_memory that will want a section of pages
>>> allocated but without any zone associated. This is because belonging
>>> to a zone will give the memory to the page allocators, but
>>> persistent_memory belongs to a block device, and is not available for
>>> regular volatile usage.
>>
>> I don't think we should be taking patches like this in to the kernel
>> until we've seen the other side of it.  Where is the page allocator code
>> which will see a page belonging to no zone?  Am I missing it in this set?
> 
> It is not missing. It will never be.
> 
> These pages do not belong to any allocator. They are not allocate-able
> pages. In fact they are not "memory" they are "storage"
> 
> These pages belong wholesomely to a block-device. In turn the block
> device grants ownership of a partition of this pages to an FS.
> The FS loaded has its own block allocation schema. Which internally
> circulate each pages usage around. But the page never goes beyond its
> FS.

I'm mostly worried about things that start with an mmap().

Imagine you mmap() a persistent memory file, fault some pages in, then
'cat /proc/$pid/numa_maps'.  That code will look at the page to see
which zone and node it is in.

Or, consider if you mmap() then put a futex in the page.  The page will
have get_user_pages() called on it by the futex code, and a reference
taken.  The reference can outlast the mmap().  We either have to put the
file somewhere special and scan the page's reference occasionally, or we
need to hook something under put_page() to make sure that we keep the
page out of the normal allocator.

>> I see about 80 or so calls to page_zone() in the kernel.  How will a
>> zone-less page look to all of these sites?
> 
> None of these 80 call site will be reached! the pages are always used
> below the FS, like send them on the network, or send them to a slower
> block device via a BIO. I have a full fledge FS on top of this code
> and it all works very smoothly, and stable. (And fast ;))

Does the fs support mmap()?

The idea of layering is a nice one, but mmap() is a big fat layering
violation. :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id
  2014-09-10 16:10         ` Dave Hansen
@ 2014-09-10 17:25           ` Boaz Harrosh
  2014-09-10 18:28             ` Dave Hansen
  0 siblings, 1 reply; 12+ messages in thread
From: Boaz Harrosh @ 2014-09-10 17:25 UTC (permalink / raw)
  To: Dave Hansen, Ross Zwisler, Jens Axboe, Matthew Wilcox,
	linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm
  Cc: Andrew Morton, linux-kernel

On 09/10/2014 07:10 PM, Dave Hansen wrote:
> On 09/10/2014 03:07 AM, Boaz Harrosh wrote:
>> On 09/09/2014 09:36 PM, Dave Hansen wrote:
>>> On 09/09/2014 08:45 AM, Boaz Harrosh wrote:
>>>> This is for add_persistent_memory that will want a section of pages
>>>> allocated but without any zone associated. This is because belonging
>>>> to a zone will give the memory to the page allocators, but
>>>> persistent_memory belongs to a block device, and is not available for
>>>> regular volatile usage.
>>>
>>> I don't think we should be taking patches like this in to the kernel
>>> until we've seen the other side of it.  Where is the page allocator code
>>> which will see a page belonging to no zone?  Am I missing it in this set?
>>
>> It is not missing. It will never be.
>>
>> These pages do not belong to any allocator. They are not allocate-able
>> pages. In fact they are not "memory" they are "storage"
>>
>> These pages belong wholesomely to a block-device. In turn the block
>> device grants ownership of a partition of this pages to an FS.
>> The FS loaded has its own block allocation schema. Which internally
>> circulate each pages usage around. But the page never goes beyond its
>> FS.
> 
> I'm mostly worried about things that start with an mmap().
> 
> Imagine you mmap() a persistent memory file, fault some pages in, then
> 'cat /proc/$pid/numa_maps'.  That code will look at the page to see
> which zone and node it is in.
> 
> Or, consider if you mmap() then put a futex in the page.  The page will
> have get_user_pages() called on it by the futex code, and a reference
> taken.  The reference can outlast the mmap().  We either have to put the
> file somewhere special and scan the page's reference occasionally, or we
> need to hook something under put_page() to make sure that we keep the
> page out of the normal allocator.
> 

Yes the block_allocator of the pmem-FS always holds the final REF on this
page, as long as there is valid data on this block. Even cross boots, the
mount code re-initializes references. The only internal state that frees
these blocks is truncate, which only then return these pages to the block
allocator, all this is common practice in filesystems so the page-ref on
these blocks only ever drops to zero after they loose all visibility. And
yes the block allocator uses a special code to drop the count to zero
not using put_page().

So there is no chance these pages will ever be presented to page_allocators
through a  put_page().

BTW: There is an hook in place that can be used today. By calling
  SetPagePrivate(page) and setting a .release function on the page->mapping->a_ops
  If .release() returns false the page is not released (and can be added on an
  internal queue for garbage collection)
  But with above schema this is not needed at all. I yet need to find a test
  that keeps my free_block reference above 1. At which time I will exercise
  a garbage collection queue.

>>> I see about 80 or so calls to page_zone() in the kernel.  How will a
>>> zone-less page look to all of these sites?
>>
>> None of these 80 call site will be reached! the pages are always used
>> below the FS, like send them on the network, or send them to a slower
>> block device via a BIO. I have a full fledge FS on top of this code
>> and it all works very smoothly, and stable. (And fast ;))
> 
> Does the fs support mmap()?
> 
> The idea of layering is a nice one, but mmap() is a big fat layering
> violation. :)
> 

No!

Yes the FS supports mmap, but through the DAX patchset. Please see
Matthew's DAX patchset how he implements mmap without using pages
at all, direct PFN to virtual_addr. So these pages do not get exposed
to the top of the FS.

My FS uses his technics exactly only when it wants to spill over to
slower device it will use these pages copy-less.

Cheers
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id
  2014-09-10 17:25           ` Boaz Harrosh
@ 2014-09-10 18:28             ` Dave Hansen
  2014-09-11  8:39               ` Boaz Harrosh
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2014-09-10 18:28 UTC (permalink / raw)
  To: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox,
	linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm
  Cc: Andrew Morton, linux-kernel

On 09/10/2014 10:25 AM, Boaz Harrosh wrote:
> Yes the block_allocator of the pmem-FS always holds the final REF on this
> page, as long as there is valid data on this block. Even cross boots, the
> mount code re-initializes references. The only internal state that frees
> these blocks is truncate, which only then return these pages to the block
> allocator, all this is common practice in filesystems so the page-ref on
> these blocks only ever drops to zero after they loose all visibility. And
> yes the block allocator uses a special code to drop the count to zero
> not using put_page().

OK, so what happens when a page is truncated out of a file and this
"last" block reference is dropped while a get_user_pages() still has a
reference?

> On 09/10/2014 07:10 PM, Dave Hansen wrote:
>> Does the fs support mmap()?
>>
> No!
> 
> Yes the FS supports mmap, but through the DAX patchset. Please see
> Matthew's DAX patchset how he implements mmap without using pages
> at all, direct PFN to virtual_addr. So these pages do not get exposed
> to the top of the FS.
> 
> My FS uses his technics exactly only when it wants to spill over to
> slower device it will use these pages copy-less.

>From my perspective, DAX is complicated, but it is necessary because we
don't have a 'struct page'.  You're saying that even if we pay the cost
of a 'struct page' for the memory, we still don't get the benefit of
having it like getting rid of this DAX stuff?

Also, about not having a zone for these pages.  Do you intend to support
32-bit systems?  If so, I believe you will require the kmap() family of
functions to map the pages in order to copy data in and out.  kmap()
currently requires knowing the zone of the page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id
  2014-09-10 18:28             ` Dave Hansen
@ 2014-09-11  8:39               ` Boaz Harrosh
  2014-09-11 17:07                 ` Dave Hansen
  0 siblings, 1 reply; 12+ messages in thread
From: Boaz Harrosh @ 2014-09-11  8:39 UTC (permalink / raw)
  To: Dave Hansen, Boaz Harrosh, Ross Zwisler, Jens Axboe,
	Matthew Wilcox, linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm
  Cc: Andrew Morton, linux-kernel

On 09/10/2014 09:28 PM, Dave Hansen wrote:
<>
> 
> OK, so what happens when a page is truncated out of a file and this
> "last" block reference is dropped while a get_user_pages() still has a
> reference?
> 

I have a very simple plan for this scenario, as I said, hang these pages
with ref!=1 on a garbage list, and one of the clear threads can scan them
periodically and release them.

I have this test in place, currently what I do is just drop the block
and let it leak (that is, not be used any more) until a next mount where
this will be returned to free store. Yes stupid I know. But I have a big
fat message when this happens and I have not been able to reproduce it.
So I'm still waiting for this test case, I guess DAX protects me.

<>
> From my perspective, DAX is complicated, but it is necessary because we
> don't have a 'struct page'.  You're saying that even if we pay the cost
> of a 'struct page' for the memory, we still don't get the benefit of
> having it like getting rid of this DAX stuff?
> 

No DAX is still necessary because we map storage directly to app space,
and we still need it persistent. That is we can-not/need-not use an
in-ram radix tree but directly use on-storage btrees.
Regular VFS has this 2 tiers model, volatile-ram over persistent store.
DAX is an alternative VFS model where you have a single tier. the name
implies "Direct Access".

So this is nothing to do with page cost or "benefit". DAX is about a new
VFS model for new storage technologies.

And please be noted, the complexity you are talking about is just a learning
curve, on the developers side. Not a technological one. Actually if you
compare the two models, lets call them VFS-2t and VFS-1t, then you see that
DAX is an order of a magnitude simpler then the old model.

Life is hard and we do need the two models all at the same time, to support
all these different devices. So yes the complexity is added with the added
choice. But please do not confuse, DAX is not the complicated part. Having
a Choice is.

> Also, about not having a zone for these pages.  Do you intend to support
> 32-bit systems?  If so, I believe you will require the kmap() family of
> functions to map the pages in order to copy data in and out.  kmap()
> currently requires knowing the zone of the page.

No!!! This is strictly 64 bit. A 32bit system is able to have at maximum
3Gb of low-ram + storage.
DAX implies always mapped. That is, no re-mapping. So this rules out
more then a G of storage. Since that is a joke then No! 32bit is out.

You need to understand current HW std talks about DDR4 and there are
DDR3 samples flouting around. So this is strictly 64bit, even on
phones.

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id
  2014-09-11  8:39               ` Boaz Harrosh
@ 2014-09-11 17:07                 ` Dave Hansen
  2014-09-14  9:36                   ` Boaz Harrosh
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2014-09-11 17:07 UTC (permalink / raw)
  To: Boaz Harrosh, Boaz Harrosh, Ross Zwisler, Jens Axboe,
	Matthew Wilcox, linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm
  Cc: Andrew Morton, linux-kernel

On 09/11/2014 01:39 AM, Boaz Harrosh wrote:
> On 09/10/2014 09:28 PM, Dave Hansen wrote:
>> OK, so what happens when a page is truncated out of a file and this
>> "last" block reference is dropped while a get_user_pages() still has a
>> reference?
> 
> I have a very simple plan for this scenario, as I said, hang these pages
> with ref!=1 on a garbage list, and one of the clear threads can scan them
> periodically and release them.
> 
> I have this test in place, currently what I do is just drop the block
> and let it leak (that is, not be used any more) until a next mount where
> this will be returned to free store. Yes stupid I know. But I have a big
> fat message when this happens and I have not been able to reproduce it.
> So I'm still waiting for this test case, I guess DAX protects me.

OK, that sounds like it will work.  The "leaked until the next mount"
sounds disastrous, but I'm sure you'll fix that.  I can see how it might
lead to some fragmentation if only small amounts are ever pinned, but
not a deal-breaker.

>> From my perspective, DAX is complicated, but it is necessary because we
>> don't have a 'struct page'.  You're saying that even if we pay the cost
>> of a 'struct page' for the memory, we still don't get the benefit of
>> having it like getting rid of this DAX stuff?
> 
> No DAX is still necessary because we map storage directly to app space,
> and we still need it persistent. That is we can-not/need-not use an
> in-ram radix tree but directly use on-storage btrees.

Huh?  We obviously don't need/want persistent memory pages in the page
*cache*.  But, that's completely orthogonal to _having_ a 'struct page'
for them.

DAX does two major things:
1. avoids needing the page cache
2. creates "raw" page table entries that the VM does not manage
   for mmap()s

I'm not saying to put persistent memory in the page cache.

I'm saying that, if we have a 'struct page' for the memory, we should
try to make the mmap()s more normal.  This enables all kinds of things
that DAX does not support today, like direct I/O.

> Life is hard and we do need the two models all at the same time, to support
> all these different devices. So yes the complexity is added with the added
> choice. But please do not confuse, DAX is not the complicated part. Having
> a Choice is.

Great, so we at least agree that this adds complexity.

>> Also, about not having a zone for these pages.  Do you intend to support
>> 32-bit systems?  If so, I believe you will require the kmap() family of
>> functions to map the pages in order to copy data in and out.  kmap()
>> currently requires knowing the zone of the page.
> 
> No!!! This is strictly 64 bit. A 32bit system is able to have at maximum
> 3Gb of low-ram + storage.
> DAX implies always mapped. That is, no re-mapping. So this rules out
> more then a G of storage. Since that is a joke then No! 32bit is out.
> 
> You need to understand current HW std talks about DDR4 and there are
> DDR3 samples flouting around. So this is strictly 64bit, even on
> phones.

OK, so I think I at least understand the scope of the patch set and the
limitations.  I think I've summarized the limitations:

1. Approach requires all of RAM+Pmem to be direct-mapped (rules out
   almost all 32-bit systems, or any 64-bit systems with more than 64TB
   of RAM+pmem-storage)
2. Approach is currently incompatible with some kernel code that
   requires a 'struct page' (such as direct I/O), and all kernel code
   that requires knowledge of zones or NUMA nodes.
3. Approach requires 1/64 of the amount of storage to be consumed by
   RAM for a pseudo 'struct page'.  If you had 64GB of storage and 1GB
   of RAM, you would simply run our of RAM.

Did I miss any?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id
  2014-09-11 17:07                 ` Dave Hansen
@ 2014-09-14  9:36                   ` Boaz Harrosh
  0 siblings, 0 replies; 12+ messages in thread
From: Boaz Harrosh @ 2014-09-14  9:36 UTC (permalink / raw)
  To: Dave Hansen, Boaz Harrosh, Ross Zwisler, Jens Axboe,
	Matthew Wilcox, linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm
  Cc: Andrew Morton, linux-kernel

On 09/11/2014 08:07 PM, Dave Hansen wrote:
<>
> 
> OK, that sounds like it will work.  The "leaked until the next mount"
> sounds disastrous, but I'm sure you'll fix that.  I can see how it might
> lead to some fragmentation if only small amounts are ever pinned, but
> not a deal-breaker.
> 

There is no such thing as fragmentation with memory mapped storage ;-)

<>
> I'm saying that, if we have a 'struct page' for the memory, we should
> try to make the mmap()s more normal.  This enables all kinds of things
> that DAX does not support today, like direct I/O.
> 

What? no! direct I/O is fully supported. Including all API's of it. Do
you mean open(O_DIRECT) and io_submit(..) Yes it is fully supported.

In fact all IO is direct IO. there is never page-cache on the way, hence direct

BTW: These patches enable something else. Say FSA is DAX and FSB is regular
disk FS then
	fda = open(/mnt/FSA);
	pa = mmap(fda, ...);

	fdb = open(/mnt/FSB, O_DIRECT);
	io_submit(fdb,..,pa ,..);
	/* I mean pa is put for IO into the passed iocb for fdb */

Before this patch above will not work and revert to buffered IO, but
with these patches it will work.
Please note this is true for the submitted pmem driver. With brd which
also supports DAX this will work, because brd always uses pages.

<>
> Great, so we at least agree that this adds complexity.
> 

But the complexity is already there DAX by Matthew is to go in soon I hope.
Surly these added pages do not add to the complexity that much.

<>
> 
> OK, so I think I at least understand the scope of the patch set and the
> limitations.  I think I've summarized the limitations:
> 
> 1. Approach requires all of RAM+Pmem to be direct-mapped (rules out
>    almost all 32-bit systems, or any 64-bit systems with more than 64TB
>    of RAM+pmem-storage)

Yes, for NOW

> 2. Approach is currently incompatible with some kernel code that
>    requires a 'struct page' (such as direct I/O), and all kernel code
>    that requires knowledge of zones or NUMA nodes.

NO!
Direct IO - supported
NUMA - supported

"all kernel code that requires knowledge of zones" - Not needed

> 3. Approach requires 1/64 of the amount of storage to be consumed by
>    RAM for a pseudo 'struct page'.  If you had 64GB of storage and 1GB
>    of RAM, you would simply run our of RAM.
> 

Yes so in a system as above of 64GB of pmem, 1GB of pmem will need to be
set aside and hotpluged as volatile memory. This already works today BTW
you can set aside a portion of NvDIMM and hotplug it as system memory.

We are already used to pay that ratio for RAM.
On a kernel-config choice that ratio can be also paid for pmem. This is
why I left it a configuration option

> Did I miss any?
> 

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-09-14  9:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1409173922-7484-1-git-send-email-ross.zwisler@linux.intel.com>
2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh
2014-09-09 15:45   ` [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id Boaz Harrosh
2014-09-09 18:36     ` Dave Hansen
2014-09-10 10:07       ` Boaz Harrosh
2014-09-10 16:10         ` Dave Hansen
2014-09-10 17:25           ` Boaz Harrosh
2014-09-10 18:28             ` Dave Hansen
2014-09-11  8:39               ` Boaz Harrosh
2014-09-11 17:07                 ` Dave Hansen
2014-09-14  9:36                   ` Boaz Harrosh
2014-09-09 15:47   ` [PATCH 6/9] mm: New add_persistent_memory/remove_persistent_memory Boaz Harrosh
2014-09-09 15:48   ` [PATCH 7/9] pmem: Add support for page structs Boaz Harrosh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).