All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] nvdimm: allow exposing RAM as libnvdimm DIMMs
@ 2025-08-26  8:04 Mike Rapoport
  2025-08-26  8:04 ` [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices Mike Rapoport
  0 siblings, 1 reply; 20+ messages in thread
From: Mike Rapoport @ 2025-08-26  8:04 UTC (permalink / raw)
  To: Dan Williams, Dave Jiang, Ira Weiny, Vishal Verma
  Cc: jane.chu, Mike Rapoport, Pasha Tatashin, Tyler Hicks,
	linux-kernel, nvdimm

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Hi,

It's not uncommon that libnvdimm/dax/ndctl are used with normal volatile
memory for a whole bunch of $reasons.

Probably the most common usecase is to back VMs memory with fsdax/devdax,
but there are others as well when there's a requirement to manage memory
separately from the kernel.

The existing mechanisms to expose normal ram as "persistent", such as
memmap=x!y on x86 or dummy pmem-region device tree nodes on DT systems lack
flexibility to dynamically partition a single region without rebooting the
system and sometimes even updating the system firmware. Also, to create
several DAX devices with different properties it's necessary to repeat
the memmap= command line option or add several pmem-region nodes to the
DT.

I propose a new driver that will create a DIMM device on
E820_TYPE_PRAM/pmem-region and that will allow partitioning that device
dynamically. The label area is kept in the end of that region and managed
by the driver.

Changes since RFC:
* fix offset calculations in ramdax_{get,set}_config_data
* use a magic constant instead of a random number as nd_set->cookie*

RFC: https://lore.kernel.org/all/20250612083153.48624-1-rppt@kernel.org


Mike Rapoport (Microsoft) (1):
  nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices

 drivers/nvdimm/ramdax.c | 281 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 281 insertions(+)
 create mode 100644 drivers/nvdimm/ramdax.c


base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
-- 
2.50.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
  2025-08-26  8:04 [PATCH 0/1] nvdimm: allow exposing RAM as libnvdimm DIMMs Mike Rapoport
@ 2025-08-26  8:04 ` Mike Rapoport
  2025-08-29  0:47   ` Ira Weiny
  2025-09-24  1:08   ` dan.j.williams
  0 siblings, 2 replies; 20+ messages in thread
From: Mike Rapoport @ 2025-08-26  8:04 UTC (permalink / raw)
  To: Dan Williams, Dave Jiang, Ira Weiny, Vishal Verma
  Cc: jane.chu, Mike Rapoport, Pasha Tatashin, Tyler Hicks,
	linux-kernel, nvdimm

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

There are use cases, for example virtual machine hosts, that create
"persistent" memory regions using memmap= option on x86 or dummy
pmem-region device tree nodes on DT based systems.

Both these options are inflexible because they create static regions and
the layout of the "persistent" memory cannot be adjusted without reboot
and sometimes they even require firmware update.

Add a ramdax driver that allows creation of DIMM devices on top of
E820_TYPE_PRAM regions and devicetree pmem-region nodes.

The DIMMs support label space management on the "device" and provide a
flexible way to access RAM using fsdax and devdax.

Signed-off-by: Mike Rapoport (Mircosoft) <rppt@kernel.org>
---
 drivers/nvdimm/ramdax.c | 281 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 281 insertions(+)
 create mode 100644 drivers/nvdimm/ramdax.c

diff --git a/drivers/nvdimm/ramdax.c b/drivers/nvdimm/ramdax.c
new file mode 100644
index 000000000000..27c5102f600c
--- /dev/null
+++ b/drivers/nvdimm/ramdax.c
@@ -0,0 +1,281 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2015, Mike Rapoport, Microsoft
+ *
+ * Based on e820 pmem driver:
+ * Copyright (c) 2015, Christoph Hellwig.
+ * Copyright (c) 2015, Intel Corporation.
+ */
+#include <linux/platform_device.h>
+#include <linux/memory_hotplug.h>
+#include <linux/libnvdimm.h>
+#include <linux/module.h>
+#include <linux/numa.h>
+#include <linux/slab.h>
+#include <linux/io.h>
+#include <linux/of.h>
+
+#include <uapi/linux/ndctl.h>
+
+#define LABEL_AREA_SIZE	SZ_128K
+
+struct ramdax_dimm {
+	struct nvdimm *nvdimm;
+	void *label_area;
+};
+
+static void ramdax_remove(struct platform_device *pdev)
+{
+	struct nvdimm_bus *nvdimm_bus = platform_get_drvdata(pdev);
+
+	/* FIXME: cleanup dimm and region devices */
+
+	nvdimm_bus_unregister(nvdimm_bus);
+}
+
+static int ramdax_register_region(struct resource *res,
+				    struct nvdimm *nvdimm,
+				    struct nvdimm_bus *nvdimm_bus)
+{
+	struct nd_mapping_desc mapping;
+	struct nd_region_desc ndr_desc;
+	struct nd_interleave_set *nd_set;
+	int nid = phys_to_target_node(res->start);
+
+	nd_set = kzalloc(sizeof(*nd_set), GFP_KERNEL);
+	if (!nd_set)
+		return -ENOMEM;
+
+	nd_set->cookie1 = 0xcafebeefcafebeef;
+	nd_set->cookie2 = nd_set->cookie1;
+	nd_set->altcookie = nd_set->cookie1;
+
+	memset(&mapping, 0, sizeof(mapping));
+	mapping.nvdimm = nvdimm;
+	mapping.start = 0;
+	mapping.size = resource_size(res) - LABEL_AREA_SIZE;
+
+	memset(&ndr_desc, 0, sizeof(ndr_desc));
+	ndr_desc.res = res;
+	ndr_desc.numa_node = numa_map_to_online_node(nid);
+	ndr_desc.target_node = nid;
+	ndr_desc.num_mappings = 1;
+	ndr_desc.mapping = &mapping;
+	ndr_desc.nd_set = nd_set;
+
+	if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
+		goto err_free_nd_set;
+
+	return 0;
+
+err_free_nd_set:
+	kfree(nd_set);
+	return -ENXIO;
+}
+
+static int ramdax_register_dimm(struct resource *res, void *data)
+{
+	resource_size_t start = res->start;
+	resource_size_t size = resource_size(res);
+	unsigned long flags = 0, cmd_mask = 0;
+	struct nvdimm_bus *nvdimm_bus = data;
+	struct ramdax_dimm *dimm;
+	int err;
+
+	dimm = kzalloc(sizeof(*dimm), GFP_KERNEL);
+	if (!dimm)
+		return -ENOMEM;
+
+	dimm->label_area = memremap(start + size - LABEL_AREA_SIZE,
+				    LABEL_AREA_SIZE, MEMREMAP_WB);
+	if (!dimm->label_area)
+		goto err_free_dimm;
+
+	set_bit(NDD_LABELING, &flags);
+	set_bit(NDD_REGISTER_SYNC, &flags);
+	set_bit(ND_CMD_GET_CONFIG_SIZE, &cmd_mask);
+	set_bit(ND_CMD_GET_CONFIG_DATA, &cmd_mask);
+	set_bit(ND_CMD_SET_CONFIG_DATA, &cmd_mask);
+	dimm->nvdimm = nvdimm_create(nvdimm_bus, dimm,
+				     /* dimm_attribute_groups */ NULL,
+				     flags, cmd_mask, 0, NULL);
+	if (!dimm->nvdimm) {
+		err = -ENOMEM;
+		goto err_unmap_label;
+	}
+
+	err = ramdax_register_region(res, dimm->nvdimm, nvdimm_bus);
+	if (err)
+		goto err_remove_nvdimm;
+
+	return 0;
+
+err_remove_nvdimm:
+	nvdimm_delete(dimm->nvdimm);
+err_unmap_label:
+	memunmap(dimm->label_area);
+err_free_dimm:
+	kfree(dimm);
+	return err;
+}
+
+static int ramdax_get_config_size(struct nvdimm *nvdimm, int buf_len,
+				    struct nd_cmd_get_config_size *cmd)
+{
+	if (sizeof(*cmd) > buf_len)
+		return -EINVAL;
+
+	*cmd = (struct nd_cmd_get_config_size){
+		.status = 0,
+		.config_size = LABEL_AREA_SIZE,
+		.max_xfer = 8,
+	};
+
+	return 0;
+}
+
+static int ramdax_get_config_data(struct nvdimm *nvdimm, int buf_len,
+				    struct nd_cmd_get_config_data_hdr *cmd)
+{
+	struct ramdax_dimm *dimm = nvdimm_provider_data(nvdimm);
+
+	if (sizeof(*cmd) > buf_len)
+		return -EINVAL;
+	if (struct_size(cmd, out_buf, cmd->in_length) > buf_len)
+		return -EINVAL;
+	if (cmd->in_offset + cmd->in_length > LABEL_AREA_SIZE)
+		return -EINVAL;
+
+	memcpy(cmd->out_buf, dimm->label_area + cmd->in_offset, cmd->in_length);
+
+	return 0;
+}
+
+static int ramdax_set_config_data(struct nvdimm *nvdimm, int buf_len,
+				    struct nd_cmd_set_config_hdr *cmd)
+{
+	struct ramdax_dimm *dimm = nvdimm_provider_data(nvdimm);
+
+	if (sizeof(*cmd) > buf_len)
+		return -EINVAL;
+	if (struct_size(cmd, in_buf, cmd->in_length) > buf_len)
+		return -EINVAL;
+	if (cmd->in_offset + cmd->in_length > LABEL_AREA_SIZE)
+		return -EINVAL;
+
+	memcpy(dimm->label_area + cmd->in_offset, cmd->in_buf, cmd->in_length);
+
+	return 0;
+}
+
+static int ramdax_nvdimm_ctl(struct nvdimm *nvdimm, unsigned int cmd,
+			       void *buf, unsigned int buf_len)
+{
+	unsigned long cmd_mask = nvdimm_cmd_mask(nvdimm);
+
+	if (!test_bit(cmd, &cmd_mask))
+		return -ENOTTY;
+
+	switch (cmd) {
+	case ND_CMD_GET_CONFIG_SIZE:
+		return ramdax_get_config_size(nvdimm, buf_len, buf);
+	case ND_CMD_GET_CONFIG_DATA:
+		return ramdax_get_config_data(nvdimm, buf_len, buf);
+	case ND_CMD_SET_CONFIG_DATA:
+		return ramdax_set_config_data(nvdimm, buf_len, buf);
+	default:
+		return -ENOTTY;
+	}
+}
+
+static int ramdax_ctl(struct nvdimm_bus_descriptor *nd_desc,
+			 struct nvdimm *nvdimm, unsigned int cmd, void *buf,
+			 unsigned int buf_len, int *cmd_rc)
+{
+	/*
+	 * No firmware response to translate, let the transport error
+	 * code take precedence.
+	 */
+	*cmd_rc = 0;
+
+	if (!nvdimm)
+		return -ENOTTY;
+	return ramdax_nvdimm_ctl(nvdimm, cmd, buf, buf_len);
+}
+
+static int ramdax_probe_of(struct platform_device *pdev,
+			     struct nvdimm_bus *bus, struct device_node *np)
+{
+	int err;
+
+	for (int i = 0; i < pdev->num_resources; i++) {
+		err = ramdax_register_dimm(&pdev->resource[i], bus);
+		if (err)
+			goto err_unregister;
+	}
+
+	return 0;
+
+err_unregister:
+	/*
+	 * FIXME: should we unregister the dimms that were registered
+	 * successfully
+	 */
+	return err;
+}
+
+static int ramdax_probe(struct platform_device *pdev)
+{
+	static struct nvdimm_bus_descriptor nd_desc;
+	struct device *dev = &pdev->dev;
+	struct nvdimm_bus *nvdimm_bus;
+	struct device_node *np;
+	int rc = -ENXIO;
+
+	nd_desc.provider_name = "ramdax";
+	nd_desc.module = THIS_MODULE;
+	nd_desc.ndctl = ramdax_ctl;
+	nvdimm_bus = nvdimm_bus_register(dev, &nd_desc);
+	if (!nvdimm_bus)
+		goto err;
+
+	np = dev_of_node(&pdev->dev);
+	if (np)
+		rc = ramdax_probe_of(pdev, nvdimm_bus, np);
+	else
+		rc = walk_iomem_res_desc(IORES_DESC_PERSISTENT_MEMORY_LEGACY,
+					 IORESOURCE_MEM, 0, -1, nvdimm_bus,
+					 ramdax_register_dimm);
+	if (rc)
+		goto err;
+
+	platform_set_drvdata(pdev, nvdimm_bus);
+
+	return 0;
+err:
+	nvdimm_bus_unregister(nvdimm_bus);
+	return rc;
+}
+
+#ifdef CONFIG_OF
+static const struct of_device_id ramdax_of_matches[] = {
+	{ .compatible = "pmem-region", },
+	{ },
+};
+MODULE_DEVICE_TABLE(of, ramdax_of_matches);
+#endif
+
+static struct platform_driver ramdax_driver = {
+	.probe = ramdax_probe,
+	.remove = ramdax_remove,
+	.driver = {
+		.name = "e820_pmem",
+		.of_match_table = of_match_ptr(ramdax_of_matches),
+	},
+};
+
+module_platform_driver(ramdax_driver);
+
+MODULE_DESCRIPTION("NVDIMM support for e820 type-12 memory and OF pmem-region");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Microsoft Corporation");
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
  2025-08-26  8:04 ` [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices Mike Rapoport
@ 2025-08-29  0:47   ` Ira Weiny
  2025-08-29  7:57     ` Mike Rapoport
  2025-09-24  1:08   ` dan.j.williams
  1 sibling, 1 reply; 20+ messages in thread
From: Ira Weiny @ 2025-08-29  0:47 UTC (permalink / raw)
  To: Mike Rapoport, Dan Williams, Dave Jiang, Ira Weiny, Vishal Verma,
	Michal Clapinski
  Cc: jane.chu, Mike Rapoport, Pasha Tatashin, Tyler Hicks,
	linux-kernel, nvdimm

+ Michal

Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> There are use cases, for example virtual machine hosts, that create
> "persistent" memory regions using memmap= option on x86 or dummy
> pmem-region device tree nodes on DT based systems.
> 
> Both these options are inflexible because they create static regions and
> the layout of the "persistent" memory cannot be adjusted without reboot
> and sometimes they even require firmware update.
> 
> Add a ramdax driver that allows creation of DIMM devices on top of
> E820_TYPE_PRAM regions and devicetree pmem-region nodes.

While I recognize this driver and the e820 driver are mutually
exclusive[1][2].  I do wonder if the use cases are the same?

From a high level I don't like the idea of adding kernel parameters.  So
if this could solve Michal's problem I'm inclined to go this direction.

Ira

[1] https://lore.kernel.org/all/aExQ7nSejklEeVn0@kernel.org/
[2] https://lore.kernel.org/all/20250612114210.2786075-1-mclapinski@google.com/

> 
> The DIMMs support label space management on the "device" and provide a
> flexible way to access RAM using fsdax and devdax.
> 
> Signed-off-by: Mike Rapoport (Mircosoft) <rppt@kernel.org>
> ---
>  drivers/nvdimm/ramdax.c | 281 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 281 insertions(+)
>  create mode 100644 drivers/nvdimm/ramdax.c
> 
> diff --git a/drivers/nvdimm/ramdax.c b/drivers/nvdimm/ramdax.c
> new file mode 100644
> index 000000000000..27c5102f600c
> --- /dev/null
> +++ b/drivers/nvdimm/ramdax.c
> @@ -0,0 +1,281 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2015, Mike Rapoport, Microsoft
> + *
> + * Based on e820 pmem driver:
> + * Copyright (c) 2015, Christoph Hellwig.
> + * Copyright (c) 2015, Intel Corporation.
> + */
> +#include <linux/platform_device.h>
> +#include <linux/memory_hotplug.h>
> +#include <linux/libnvdimm.h>
> +#include <linux/module.h>
> +#include <linux/numa.h>
> +#include <linux/slab.h>
> +#include <linux/io.h>
> +#include <linux/of.h>
> +
> +#include <uapi/linux/ndctl.h>
> +
> +#define LABEL_AREA_SIZE	SZ_128K
> +
> +struct ramdax_dimm {
> +	struct nvdimm *nvdimm;
> +	void *label_area;
> +};
> +
> +static void ramdax_remove(struct platform_device *pdev)
> +{
> +	struct nvdimm_bus *nvdimm_bus = platform_get_drvdata(pdev);
> +
> +	/* FIXME: cleanup dimm and region devices */
> +
> +	nvdimm_bus_unregister(nvdimm_bus);
> +}
> +
> +static int ramdax_register_region(struct resource *res,
> +				    struct nvdimm *nvdimm,
> +				    struct nvdimm_bus *nvdimm_bus)
> +{
> +	struct nd_mapping_desc mapping;
> +	struct nd_region_desc ndr_desc;
> +	struct nd_interleave_set *nd_set;
> +	int nid = phys_to_target_node(res->start);
> +
> +	nd_set = kzalloc(sizeof(*nd_set), GFP_KERNEL);
> +	if (!nd_set)
> +		return -ENOMEM;
> +
> +	nd_set->cookie1 = 0xcafebeefcafebeef;
> +	nd_set->cookie2 = nd_set->cookie1;
> +	nd_set->altcookie = nd_set->cookie1;
> +
> +	memset(&mapping, 0, sizeof(mapping));
> +	mapping.nvdimm = nvdimm;
> +	mapping.start = 0;
> +	mapping.size = resource_size(res) - LABEL_AREA_SIZE;
> +
> +	memset(&ndr_desc, 0, sizeof(ndr_desc));
> +	ndr_desc.res = res;
> +	ndr_desc.numa_node = numa_map_to_online_node(nid);
> +	ndr_desc.target_node = nid;
> +	ndr_desc.num_mappings = 1;
> +	ndr_desc.mapping = &mapping;
> +	ndr_desc.nd_set = nd_set;
> +
> +	if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
> +		goto err_free_nd_set;
> +
> +	return 0;
> +
> +err_free_nd_set:
> +	kfree(nd_set);
> +	return -ENXIO;
> +}
> +
> +static int ramdax_register_dimm(struct resource *res, void *data)
> +{
> +	resource_size_t start = res->start;
> +	resource_size_t size = resource_size(res);
> +	unsigned long flags = 0, cmd_mask = 0;
> +	struct nvdimm_bus *nvdimm_bus = data;
> +	struct ramdax_dimm *dimm;
> +	int err;
> +
> +	dimm = kzalloc(sizeof(*dimm), GFP_KERNEL);
> +	if (!dimm)
> +		return -ENOMEM;
> +
> +	dimm->label_area = memremap(start + size - LABEL_AREA_SIZE,
> +				    LABEL_AREA_SIZE, MEMREMAP_WB);
> +	if (!dimm->label_area)
> +		goto err_free_dimm;
> +
> +	set_bit(NDD_LABELING, &flags);
> +	set_bit(NDD_REGISTER_SYNC, &flags);
> +	set_bit(ND_CMD_GET_CONFIG_SIZE, &cmd_mask);
> +	set_bit(ND_CMD_GET_CONFIG_DATA, &cmd_mask);
> +	set_bit(ND_CMD_SET_CONFIG_DATA, &cmd_mask);
> +	dimm->nvdimm = nvdimm_create(nvdimm_bus, dimm,
> +				     /* dimm_attribute_groups */ NULL,
> +				     flags, cmd_mask, 0, NULL);
> +	if (!dimm->nvdimm) {
> +		err = -ENOMEM;
> +		goto err_unmap_label;
> +	}
> +
> +	err = ramdax_register_region(res, dimm->nvdimm, nvdimm_bus);
> +	if (err)
> +		goto err_remove_nvdimm;
> +
> +	return 0;
> +
> +err_remove_nvdimm:
> +	nvdimm_delete(dimm->nvdimm);
> +err_unmap_label:
> +	memunmap(dimm->label_area);
> +err_free_dimm:
> +	kfree(dimm);
> +	return err;
> +}
> +
> +static int ramdax_get_config_size(struct nvdimm *nvdimm, int buf_len,
> +				    struct nd_cmd_get_config_size *cmd)
> +{
> +	if (sizeof(*cmd) > buf_len)
> +		return -EINVAL;
> +
> +	*cmd = (struct nd_cmd_get_config_size){
> +		.status = 0,
> +		.config_size = LABEL_AREA_SIZE,
> +		.max_xfer = 8,
> +	};
> +
> +	return 0;
> +}
> +
> +static int ramdax_get_config_data(struct nvdimm *nvdimm, int buf_len,
> +				    struct nd_cmd_get_config_data_hdr *cmd)
> +{
> +	struct ramdax_dimm *dimm = nvdimm_provider_data(nvdimm);
> +
> +	if (sizeof(*cmd) > buf_len)
> +		return -EINVAL;
> +	if (struct_size(cmd, out_buf, cmd->in_length) > buf_len)
> +		return -EINVAL;
> +	if (cmd->in_offset + cmd->in_length > LABEL_AREA_SIZE)
> +		return -EINVAL;
> +
> +	memcpy(cmd->out_buf, dimm->label_area + cmd->in_offset, cmd->in_length);
> +
> +	return 0;
> +}
> +
> +static int ramdax_set_config_data(struct nvdimm *nvdimm, int buf_len,
> +				    struct nd_cmd_set_config_hdr *cmd)
> +{
> +	struct ramdax_dimm *dimm = nvdimm_provider_data(nvdimm);
> +
> +	if (sizeof(*cmd) > buf_len)
> +		return -EINVAL;
> +	if (struct_size(cmd, in_buf, cmd->in_length) > buf_len)
> +		return -EINVAL;
> +	if (cmd->in_offset + cmd->in_length > LABEL_AREA_SIZE)
> +		return -EINVAL;
> +
> +	memcpy(dimm->label_area + cmd->in_offset, cmd->in_buf, cmd->in_length);
> +
> +	return 0;
> +}
> +
> +static int ramdax_nvdimm_ctl(struct nvdimm *nvdimm, unsigned int cmd,
> +			       void *buf, unsigned int buf_len)
> +{
> +	unsigned long cmd_mask = nvdimm_cmd_mask(nvdimm);
> +
> +	if (!test_bit(cmd, &cmd_mask))
> +		return -ENOTTY;
> +
> +	switch (cmd) {
> +	case ND_CMD_GET_CONFIG_SIZE:
> +		return ramdax_get_config_size(nvdimm, buf_len, buf);
> +	case ND_CMD_GET_CONFIG_DATA:
> +		return ramdax_get_config_data(nvdimm, buf_len, buf);
> +	case ND_CMD_SET_CONFIG_DATA:
> +		return ramdax_set_config_data(nvdimm, buf_len, buf);
> +	default:
> +		return -ENOTTY;
> +	}
> +}
> +
> +static int ramdax_ctl(struct nvdimm_bus_descriptor *nd_desc,
> +			 struct nvdimm *nvdimm, unsigned int cmd, void *buf,
> +			 unsigned int buf_len, int *cmd_rc)
> +{
> +	/*
> +	 * No firmware response to translate, let the transport error
> +	 * code take precedence.
> +	 */
> +	*cmd_rc = 0;
> +
> +	if (!nvdimm)
> +		return -ENOTTY;
> +	return ramdax_nvdimm_ctl(nvdimm, cmd, buf, buf_len);
> +}
> +
> +static int ramdax_probe_of(struct platform_device *pdev,
> +			     struct nvdimm_bus *bus, struct device_node *np)
> +{
> +	int err;
> +
> +	for (int i = 0; i < pdev->num_resources; i++) {
> +		err = ramdax_register_dimm(&pdev->resource[i], bus);
> +		if (err)
> +			goto err_unregister;
> +	}
> +
> +	return 0;
> +
> +err_unregister:
> +	/*
> +	 * FIXME: should we unregister the dimms that were registered
> +	 * successfully
> +	 */
> +	return err;
> +}
> +
> +static int ramdax_probe(struct platform_device *pdev)
> +{
> +	static struct nvdimm_bus_descriptor nd_desc;
> +	struct device *dev = &pdev->dev;
> +	struct nvdimm_bus *nvdimm_bus;
> +	struct device_node *np;
> +	int rc = -ENXIO;
> +
> +	nd_desc.provider_name = "ramdax";
> +	nd_desc.module = THIS_MODULE;
> +	nd_desc.ndctl = ramdax_ctl;
> +	nvdimm_bus = nvdimm_bus_register(dev, &nd_desc);
> +	if (!nvdimm_bus)
> +		goto err;
> +
> +	np = dev_of_node(&pdev->dev);
> +	if (np)
> +		rc = ramdax_probe_of(pdev, nvdimm_bus, np);
> +	else
> +		rc = walk_iomem_res_desc(IORES_DESC_PERSISTENT_MEMORY_LEGACY,
> +					 IORESOURCE_MEM, 0, -1, nvdimm_bus,
> +					 ramdax_register_dimm);
> +	if (rc)
> +		goto err;
> +
> +	platform_set_drvdata(pdev, nvdimm_bus);
> +
> +	return 0;
> +err:
> +	nvdimm_bus_unregister(nvdimm_bus);
> +	return rc;
> +}
> +
> +#ifdef CONFIG_OF
> +static const struct of_device_id ramdax_of_matches[] = {
> +	{ .compatible = "pmem-region", },
> +	{ },
> +};
> +MODULE_DEVICE_TABLE(of, ramdax_of_matches);
> +#endif
> +
> +static struct platform_driver ramdax_driver = {
> +	.probe = ramdax_probe,
> +	.remove = ramdax_remove,
> +	.driver = {
> +		.name = "e820_pmem",
> +		.of_match_table = of_match_ptr(ramdax_of_matches),
> +	},
> +};
> +
> +module_platform_driver(ramdax_driver);
> +
> +MODULE_DESCRIPTION("NVDIMM support for e820 type-12 memory and OF pmem-region");
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Microsoft Corporation");
> -- 
> 2.50.1
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
  2025-08-29  0:47   ` Ira Weiny
@ 2025-08-29  7:57     ` Mike Rapoport
  2025-09-01 16:01       ` Michał Cłapiński
  0 siblings, 1 reply; 20+ messages in thread
From: Mike Rapoport @ 2025-08-29  7:57 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Dave Jiang, Vishal Verma, Michal Clapinski,
	jane.chu, Pasha Tatashin, Tyler Hicks, linux-kernel, nvdimm

Hi Ira,

On Thu, Aug 28, 2025 at 07:47:31PM -0500, Ira Weiny wrote:
> + Michal
> 
> Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > There are use cases, for example virtual machine hosts, that create
> > "persistent" memory regions using memmap= option on x86 or dummy
> > pmem-region device tree nodes on DT based systems.
> > 
> > Both these options are inflexible because they create static regions and
> > the layout of the "persistent" memory cannot be adjusted without reboot
> > and sometimes they even require firmware update.
> > 
> > Add a ramdax driver that allows creation of DIMM devices on top of
> > E820_TYPE_PRAM regions and devicetree pmem-region nodes.
> 
> While I recognize this driver and the e820 driver are mutually
> exclusive[1][2].  I do wonder if the use cases are the same?

They are mutually exclusive in the sense that they cannot be loaded
together so I had this in Kconfig in RFC posting

config RAMDAX
	tristate "Support persistent memory interfaces on RAM carveouts"
	depends on OF || (X86 && X86_PMEM_LEGACY=n)

(somehow my rebase lost Makefile and Kconfig changes :( )

As Pasha said in the other thread [1] the use-cases are different. My goal
is to achieve flexibility in managing carved out "PMEM" regions and
Michal's patches aim to optimize boot time by autoconfiguring multiple PMEM
regions in the kernel without upcalls to ndctl.
 
> From a high level I don't like the idea of adding kernel parameters.  So
> if this could solve Michal's problem I'm inclined to go this direction.

I think it could help with optimizing the reboot times. On the first boot
the PMEM is partitioned using ndctl and then the partitioning remains there
so that on subsequent reboots kernel recreates dax devices without upcalls
to userspace.

[1] https://lore.kernel.org/all/CA+CK2bAPJR00j3eFZtF7WgvgXuqmmOtqjc8xO70bGyQUSKTKGg@mail.gmail.com/

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
  2025-08-29  7:57     ` Mike Rapoport
@ 2025-09-01 16:01       ` Michał Cłapiński
  2025-09-02 15:35         ` Mike Rapoport
  2025-09-24  1:16         ` dan.j.williams
  0 siblings, 2 replies; 20+ messages in thread
From: Michał Cłapiński @ 2025-09-01 16:01 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Ira Weiny, Dan Williams, Dave Jiang, Vishal Verma, jane.chu,
	Pasha Tatashin, Tyler Hicks, linux-kernel, nvdimm

On Fri, Aug 29, 2025 at 9:57 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> Hi Ira,
>
> On Thu, Aug 28, 2025 at 07:47:31PM -0500, Ira Weiny wrote:
> > + Michal
> >
> > Mike Rapoport wrote:
> > > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > >
> > > There are use cases, for example virtual machine hosts, that create
> > > "persistent" memory regions using memmap= option on x86 or dummy
> > > pmem-region device tree nodes on DT based systems.
> > >
> > > Both these options are inflexible because they create static regions and
> > > the layout of the "persistent" memory cannot be adjusted without reboot
> > > and sometimes they even require firmware update.
> > >
> > > Add a ramdax driver that allows creation of DIMM devices on top of
> > > E820_TYPE_PRAM regions and devicetree pmem-region nodes.
> >
> > While I recognize this driver and the e820 driver are mutually
> > exclusive[1][2].  I do wonder if the use cases are the same?
>
> They are mutually exclusive in the sense that they cannot be loaded
> together so I had this in Kconfig in RFC posting
>
> config RAMDAX
>         tristate "Support persistent memory interfaces on RAM carveouts"
>         depends on OF || (X86 && X86_PMEM_LEGACY=n)
>
> (somehow my rebase lost Makefile and Kconfig changes :( )
>
> As Pasha said in the other thread [1] the use-cases are different. My goal
> is to achieve flexibility in managing carved out "PMEM" regions and
> Michal's patches aim to optimize boot time by autoconfiguring multiple PMEM
> regions in the kernel without upcalls to ndctl.
>
> > From a high level I don't like the idea of adding kernel parameters.  So
> > if this could solve Michal's problem I'm inclined to go this direction.
>
> I think it could help with optimizing the reboot times. On the first boot
> the PMEM is partitioned using ndctl and then the partitioning remains there
> so that on subsequent reboots kernel recreates dax devices without upcalls
> to userspace.

Using this patch, if I want to divide 500GB of memory into 1GB chunks,
the last 128kB of every chunk would be taken by the label, right?

My patch disables labels, so we can divide the memory into 1GB chunks
without any losses and they all remain aligned to the 1GB boundary. I
think this is necessary for vmemmap dax optimization.

> [1] https://lore.kernel.org/all/CA+CK2bAPJR00j3eFZtF7WgvgXuqmmOtqjc8xO70bGyQUSKTKGg@mail.gmail.com/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
  2025-09-01 16:01       ` Michał Cłapiński
@ 2025-09-02 15:35         ` Mike Rapoport
  2025-09-24  1:16         ` dan.j.williams
  1 sibling, 0 replies; 20+ messages in thread
From: Mike Rapoport @ 2025-09-02 15:35 UTC (permalink / raw)
  To: Michał Cłapiński
  Cc: Ira Weiny, Dan Williams, Dave Jiang, Vishal Verma, jane.chu,
	Pasha Tatashin, Tyler Hicks, linux-kernel, nvdimm

Hi Michał,

On Mon, Sep 01, 2025 at 06:01:25PM +0200, Michał Cłapiński wrote:
> On Fri, Aug 29, 2025 at 9:57 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > Hi Ira,
> >
> > On Thu, Aug 28, 2025 at 07:47:31PM -0500, Ira Weiny wrote:
> > > + Michal
> > >
> > > Mike Rapoport wrote:
> > > > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > > >
> > > > There are use cases, for example virtual machine hosts, that create
> > > > "persistent" memory regions using memmap= option on x86 or dummy
> > > > pmem-region device tree nodes on DT based systems.
> > > >
> > > > Both these options are inflexible because they create static regions and
> > > > the layout of the "persistent" memory cannot be adjusted without reboot
> > > > and sometimes they even require firmware update.
> > > >
> > > > Add a ramdax driver that allows creation of DIMM devices on top of
> > > > E820_TYPE_PRAM regions and devicetree pmem-region nodes.
> > >
> > > While I recognize this driver and the e820 driver are mutually
> > > exclusive[1][2].  I do wonder if the use cases are the same?
> >
> > They are mutually exclusive in the sense that they cannot be loaded
> > together so I had this in Kconfig in RFC posting
> >
> > config RAMDAX
> >         tristate "Support persistent memory interfaces on RAM carveouts"
> >         depends on OF || (X86 && X86_PMEM_LEGACY=n)
> >
> > (somehow my rebase lost Makefile and Kconfig changes :( )
> >
> > As Pasha said in the other thread [1] the use-cases are different. My goal
> > is to achieve flexibility in managing carved out "PMEM" regions and
> > Michal's patches aim to optimize boot time by autoconfiguring multiple PMEM
> > regions in the kernel without upcalls to ndctl.
> >
> > > From a high level I don't like the idea of adding kernel parameters.  So
> > > if this could solve Michal's problem I'm inclined to go this direction.
> >
> > I think it could help with optimizing the reboot times. On the first boot
> > the PMEM is partitioned using ndctl and then the partitioning remains there
> > so that on subsequent reboots kernel recreates dax devices without upcalls
> > to userspace.
> 
> Using this patch, if I want to divide 500GB of memory into 1GB chunks,
> the last 128kB of every chunk would be taken by the label, right?

No, there will be a single 128kB namespace label area in the end of 500GB.
It's easy to add an option to put this area in the beginning.

Using dimm device with namespace labels instead of region device for e820
memory allows to partition a single memmap= region and it is similar to
patch 1 in your series.

> My patch disables labels, so we can divide the memory into 1GB chunks
> without any losses and they all remain aligned to the 1GB boundary. I
> think this is necessary for vmemmap dax optimization.
 
My understanding is that you mean info-block reserved in each devdax device
and AFAIU it's different from namespace labels. 

My patch does not deal with it, but I believe it also can be addressed
with a small "on device" structure outside the actual "partitions".

> > [1] https://lore.kernel.org/all/CA+CK2bAPJR00j3eFZtF7WgvgXuqmmOtqjc8xO70bGyQUSKTKGg@mail.gmail.com/

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
  2025-08-26  8:04 ` [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices Mike Rapoport
  2025-08-29  0:47   ` Ira Weiny
@ 2025-09-24  1:08   ` dan.j.williams
  2025-09-30 10:11     ` Mike Rapoport
  1 sibling, 1 reply; 20+ messages in thread
From: dan.j.williams @ 2025-09-24  1:08 UTC (permalink / raw)
  To: Mike Rapoport, Dan Williams, Dave Jiang, Ira Weiny, Vishal Verma
  Cc: jane.chu, Mike Rapoport, Pasha Tatashin, Tyler Hicks,
	linux-kernel, nvdimm

Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> There are use cases, for example virtual machine hosts, that create
> "persistent" memory regions using memmap= option on x86 or dummy
> pmem-region device tree nodes on DT based systems.
> 
> Both these options are inflexible because they create static regions and
> the layout of the "persistent" memory cannot be adjusted without reboot
> and sometimes they even require firmware update.
> 
> Add a ramdax driver that allows creation of DIMM devices on top of
> E820_TYPE_PRAM regions and devicetree pmem-region nodes.
> 
> The DIMMs support label space management on the "device" and provide a
> flexible way to access RAM using fsdax and devdax.

Hi Mike, I like this. Some questions below:

> +static struct platform_driver ramdax_driver = {
> +	.probe = ramdax_probe,
> +	.remove = ramdax_remove,
> +	.driver = {
> +		.name = "e820_pmem",
> +		.of_match_table = of_match_ptr(ramdax_of_matches),

So this driver collides with both e820_pmem and of_pmem, but I think it
would be useful to have both options (with/without labels) available and
not require disabling both those other drivers at compile time.

'struct pci_device_id' has this useful "override_only" flag to require
that the only driver that attaches is one that is explicitly requested
(see pci_match_device()).

Now, admittedly platform_match() is a bit more complicated in that it
matches 3 different platform device id types, but I think the ability to
opt-in to this turns this from a "cloud-host-provider-only" config
option to something distro kernels can enable by default.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
  2025-09-01 16:01       ` Michał Cłapiński
  2025-09-02 15:35         ` Mike Rapoport
@ 2025-09-24  1:16         ` dan.j.williams
  2025-09-26 12:47           ` Michał Cłapiński
  1 sibling, 1 reply; 20+ messages in thread
From: dan.j.williams @ 2025-09-24  1:16 UTC (permalink / raw)
  To: Michał Cłapiński, Mike Rapoport
  Cc: Ira Weiny, Dan Williams, Dave Jiang, Vishal Verma, jane.chu,
	Pasha Tatashin, Tyler Hicks, linux-kernel, nvdimm

Michał Cłapiński wrote:
> On Fri, Aug 29, 2025 at 9:57 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > Hi Ira,
> >
> > On Thu, Aug 28, 2025 at 07:47:31PM -0500, Ira Weiny wrote:
> > > + Michal
> > >
> > > Mike Rapoport wrote:
> > > > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > > >
> > > > There are use cases, for example virtual machine hosts, that create
> > > > "persistent" memory regions using memmap= option on x86 or dummy
> > > > pmem-region device tree nodes on DT based systems.
> > > >
> > > > Both these options are inflexible because they create static regions and
> > > > the layout of the "persistent" memory cannot be adjusted without reboot
> > > > and sometimes they even require firmware update.
> > > >
> > > > Add a ramdax driver that allows creation of DIMM devices on top of
> > > > E820_TYPE_PRAM regions and devicetree pmem-region nodes.
> > >
> > > While I recognize this driver and the e820 driver are mutually
> > > exclusive[1][2].  I do wonder if the use cases are the same?
> >
> > They are mutually exclusive in the sense that they cannot be loaded
> > together so I had this in Kconfig in RFC posting
> >
> > config RAMDAX
> >         tristate "Support persistent memory interfaces on RAM carveouts"
> >         depends on OF || (X86 && X86_PMEM_LEGACY=n)
> >
> > (somehow my rebase lost Makefile and Kconfig changes :( )
> >
> > As Pasha said in the other thread [1] the use-cases are different. My goal
> > is to achieve flexibility in managing carved out "PMEM" regions and
> > Michal's patches aim to optimize boot time by autoconfiguring multiple PMEM
> > regions in the kernel without upcalls to ndctl.
> >
> > > From a high level I don't like the idea of adding kernel parameters.  So
> > > if this could solve Michal's problem I'm inclined to go this direction.
> >
> > I think it could help with optimizing the reboot times. On the first boot
> > the PMEM is partitioned using ndctl and then the partitioning remains there
> > so that on subsequent reboots kernel recreates dax devices without upcalls
> > to userspace.
> 
> Using this patch, if I want to divide 500GB of memory into 1GB chunks,
> the last 128kB of every chunk would be taken by the label, right?
> 
> My patch disables labels, so we can divide the memory into 1GB chunks
> without any losses and they all remain aligned to the 1GB boundary. I
> think this is necessary for vmemmap dax optimization.

As Mike says you would lose 128K at the end, but that indeed becomes
losing that 1GB given alignment constraints.

However, I think that could be solved by just separately vmalloc'ing the
label space for this. Then instead of kernel parameters to sub-divide a
region, you just have an initramfs script to do the same.

Does that meet your needs?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
  2025-09-24  1:16         ` dan.j.williams
@ 2025-09-26 12:47           ` Michał Cłapiński
  2025-09-26 18:45             ` dan.j.williams
  0 siblings, 1 reply; 20+ messages in thread
From: Michał Cłapiński @ 2025-09-26 12:47 UTC (permalink / raw)
  To: dan.j.williams
  Cc: Mike Rapoport, Ira Weiny, Dave Jiang, Vishal Verma, jane.chu,
	Pasha Tatashin, Tyler Hicks, linux-kernel, nvdimm

On Wed, Sep 24, 2025 at 3:16 AM <dan.j.williams@intel.com> wrote:
>
> Michał Cłapiński wrote:
> > On Fri, Aug 29, 2025 at 9:57 AM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > Hi Ira,
> > >
> > > On Thu, Aug 28, 2025 at 07:47:31PM -0500, Ira Weiny wrote:
> > > > + Michal
> > > >
> > > > Mike Rapoport wrote:
> > > > > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > > > >
> > > > > There are use cases, for example virtual machine hosts, that create
> > > > > "persistent" memory regions using memmap= option on x86 or dummy
> > > > > pmem-region device tree nodes on DT based systems.
> > > > >
> > > > > Both these options are inflexible because they create static regions and
> > > > > the layout of the "persistent" memory cannot be adjusted without reboot
> > > > > and sometimes they even require firmware update.
> > > > >
> > > > > Add a ramdax driver that allows creation of DIMM devices on top of
> > > > > E820_TYPE_PRAM regions and devicetree pmem-region nodes.
> > > >
> > > > While I recognize this driver and the e820 driver are mutually
> > > > exclusive[1][2].  I do wonder if the use cases are the same?
> > >
> > > They are mutually exclusive in the sense that they cannot be loaded
> > > together so I had this in Kconfig in RFC posting
> > >
> > > config RAMDAX
> > >         tristate "Support persistent memory interfaces on RAM carveouts"
> > >         depends on OF || (X86 && X86_PMEM_LEGACY=n)
> > >
> > > (somehow my rebase lost Makefile and Kconfig changes :( )
> > >
> > > As Pasha said in the other thread [1] the use-cases are different. My goal
> > > is to achieve flexibility in managing carved out "PMEM" regions and
> > > Michal's patches aim to optimize boot time by autoconfiguring multiple PMEM
> > > regions in the kernel without upcalls to ndctl.
> > >
> > > > From a high level I don't like the idea of adding kernel parameters.  So
> > > > if this could solve Michal's problem I'm inclined to go this direction.
> > >
> > > I think it could help with optimizing the reboot times. On the first boot
> > > the PMEM is partitioned using ndctl and then the partitioning remains there
> > > so that on subsequent reboots kernel recreates dax devices without upcalls
> > > to userspace.
> >
> > Using this patch, if I want to divide 500GB of memory into 1GB chunks,
> > the last 128kB of every chunk would be taken by the label, right?
> >
> > My patch disables labels, so we can divide the memory into 1GB chunks
> > without any losses and they all remain aligned to the 1GB boundary. I
> > think this is necessary for vmemmap dax optimization.
>
> As Mike says you would lose 128K at the end, but that indeed becomes
> losing that 1GB given alignment constraints.
>
> However, I think that could be solved by just separately vmalloc'ing the
> label space for this. Then instead of kernel parameters to sub-divide a
> region, you just have an initramfs script to do the same.
>
> Does that meet your needs?

Sorry, I'm having trouble imagining this.
If I wanted 500 1GB chunks, I would request a region of 500GB+space
for the label? Or is that a label and info-blocks?
Then on each boot the kernel would check if there is an actual
label/info-blocks in that space and if yes, it would recreate my
devices (including the fsdax/devdax type)?

One of the requirements for live update is that the kexec reboot has
to be fast. My solution introduced a delay of tens of milliseconds
since the actual device creation is asynchronous. Manually dividing a
region into thousands of devices from userspace would be very slow but
I would have to do that only on the first boot, right?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
  2025-09-26 12:47           ` Michał Cłapiński
@ 2025-09-26 18:45             ` dan.j.williams
  2025-09-30 10:15               ` Mike Rapoport
  2025-10-01 14:14               ` Michał Cłapiński
  0 siblings, 2 replies; 20+ messages in thread
From: dan.j.williams @ 2025-09-26 18:45 UTC (permalink / raw)
  To: Michał Cłapiński, dan.j.williams
  Cc: Mike Rapoport, Ira Weiny, Dave Jiang, Vishal Verma, jane.chu,
	Pasha Tatashin, Tyler Hicks, linux-kernel, nvdimm

Michał Cłapiński wrote:
[..]
> > As Mike says you would lose 128K at the end, but that indeed becomes
> > losing that 1GB given alignment constraints.
> >
> > However, I think that could be solved by just separately vmalloc'ing the
> > label space for this. Then instead of kernel parameters to sub-divide a
> > region, you just have an initramfs script to do the same.
> >
> > Does that meet your needs?
> 
> Sorry, I'm having trouble imagining this.
> If I wanted 500 1GB chunks, I would request a region of 500GB+space
> for the label? Or is that a label and info-blocks?

You would specify an memmap= range of 500GB+128K*.

Force attach that range to Mike's RAMDAX driver.

[ modprobe -r nd_e820, don't build nd_820, or modprobe policy blocks nd_e820 ]
echo ramdax > /sys/bus/platform/devices/e820_pmem/driver_override
echo e820_pmem > /sys/bus/platform/drivers/ramdax

* forget what I said about vmalloc() previously, not needed

> Then on each boot the kernel would check if there is an actual
> label/info-blocks in that space and if yes, it would recreate my
> devices (including the fsdax/devdax type)?

Right, if that range is persistent the kernel would automatically parse
the label space each boot and divide up the 500GB region space into
namespaces.

128K of label spaces gives you 509 potential namespaces.

> One of the requirements for live update is that the kexec reboot has
> to be fast. My solution introduced a delay of tens of milliseconds
> since the actual device creation is asynchronous. Manually dividing a
> region into thousands of devices from userspace would be very slow but

Wait, 500GB Region / 1GB Namespace = thousands of Namespaces?

> I would have to do that only on the first boot, right?

Yes, the expectation is only incur that overhead once. It also allows
for VMs to be able to lookup their capacity by name. So you do not need
a separate mapping of 1GB Namepsace blocks to VMs. Just give some VMs
bigger Namespaces than others by name.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
  2025-09-24  1:08   ` dan.j.williams
@ 2025-09-30 10:11     ` Mike Rapoport
  0 siblings, 0 replies; 20+ messages in thread
From: Mike Rapoport @ 2025-09-30 10:11 UTC (permalink / raw)
  To: dan.j.williams
  Cc: Dave Jiang, Ira Weiny, Vishal Verma, jane.chu, Pasha Tatashin,
	Tyler Hicks, linux-kernel, nvdimm

On Tue, Sep 23, 2025 at 06:08:24PM -0700, dan.j.williams@intel.com wrote:
> Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > There are use cases, for example virtual machine hosts, that create
> > "persistent" memory regions using memmap= option on x86 or dummy
> > pmem-region device tree nodes on DT based systems.
> > 
> > Both these options are inflexible because they create static regions and
> > the layout of the "persistent" memory cannot be adjusted without reboot
> > and sometimes they even require firmware update.
> > 
> > Add a ramdax driver that allows creation of DIMM devices on top of
> > E820_TYPE_PRAM regions and devicetree pmem-region nodes.
> > 
> > The DIMMs support label space management on the "device" and provide a
> > flexible way to access RAM using fsdax and devdax.
> 
> Hi Mike, I like this. Some questions below:
> 
> > +static struct platform_driver ramdax_driver = {
> > +	.probe = ramdax_probe,
> > +	.remove = ramdax_remove,
> > +	.driver = {
> > +		.name = "e820_pmem",
> > +		.of_match_table = of_match_ptr(ramdax_of_matches),
> 
> So this driver collides with both e820_pmem and of_pmem, but I think it
> would be useful to have both options (with/without labels) available and
> not require disabling both those other drivers at compile time.
> 
> 'struct pci_device_id' has this useful "override_only" flag to require
> that the only driver that attaches is one that is explicitly requested
> (see pci_match_device()).
> 
> Now, admittedly platform_match() is a bit more complicated in that it
> matches 3 different platform device id types, but I think the ability to
> opt-in to this turns this from a "cloud-host-provider-only" config
> option to something distro kernels can enable by default.

It looks like /sys/bus/platform/devices/e820_pmem/driver_override does the
trick.

I'll make the driver to use "ramdax" as the name and rely on
driver_override for binding it to a device.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
  2025-09-26 18:45             ` dan.j.williams
@ 2025-09-30 10:15               ` Mike Rapoport
  2025-10-01 14:14               ` Michał Cłapiński
  1 sibling, 0 replies; 20+ messages in thread
From: Mike Rapoport @ 2025-09-30 10:15 UTC (permalink / raw)
  To: dan.j.williams
  Cc: Michał Cłapiński, Ira Weiny, Dave Jiang,
	Vishal Verma, jane.chu, Pasha Tatashin, Tyler Hicks, linux-kernel,
	nvdimm

On Fri, Sep 26, 2025 at 11:45:19AM -0700, dan.j.williams@intel.com wrote:
> Michał Cłapiński wrote:
> [..]
> > > As Mike says you would lose 128K at the end, but that indeed becomes
> > > losing that 1GB given alignment constraints.
> > >
> > > However, I think that could be solved by just separately vmalloc'ing the
> > > label space for this. Then instead of kernel parameters to sub-divide a
> > > region, you just have an initramfs script to do the same.
> > >
> > > Does that meet your needs?
> > 
> > Sorry, I'm having trouble imagining this.
> > If I wanted 500 1GB chunks, I would request a region of 500GB+space
> > for the label? Or is that a label and info-blocks?
> 
> You would specify an memmap= range of 500GB+128K*.
> 
> Force attach that range to Mike's RAMDAX driver.
> 
> [ modprobe -r nd_e820, don't build nd_820, or modprobe policy blocks nd_e820 ]
> echo ramdax > /sys/bus/platform/devices/e820_pmem/driver_override
> echo e820_pmem > /sys/bus/platform/drivers/ramdax
> 
> * forget what I said about vmalloc() previously, not needed
> 
> > Then on each boot the kernel would check if there is an actual
> > label/info-blocks in that space and if yes, it would recreate my
> > devices (including the fsdax/devdax type)?
> 
> Right, if that range is persistent the kernel would automatically parse
> the label space each boot and divide up the 500GB region space into
> namespaces.
> 
> 128K of label spaces gives you 509 potential namespaces.

I was also thinking that the label area can be put either in the end or in
the beginning of the memmap= range, so that if you specify memmap=<1G
aligned address - 128K> the actual space will be 1G aligned.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
  2025-09-26 18:45             ` dan.j.williams
  2025-09-30 10:15               ` Mike Rapoport
@ 2025-10-01 14:14               ` Michał Cłapiński
  2025-10-01 22:28                 ` dan.j.williams
  1 sibling, 1 reply; 20+ messages in thread
From: Michał Cłapiński @ 2025-10-01 14:14 UTC (permalink / raw)
  To: dan.j.williams
  Cc: Mike Rapoport, Ira Weiny, Dave Jiang, Vishal Verma, jane.chu,
	Pasha Tatashin, Tyler Hicks, linux-kernel, nvdimm

On Fri, Sep 26, 2025 at 8:45 PM <dan.j.williams@intel.com> wrote:
>
> Michał Cłapiński wrote:
> [..]
> > > As Mike says you would lose 128K at the end, but that indeed becomes
> > > losing that 1GB given alignment constraints.
> > >
> > > However, I think that could be solved by just separately vmalloc'ing the
> > > label space for this. Then instead of kernel parameters to sub-divide a
> > > region, you just have an initramfs script to do the same.
> > >
> > > Does that meet your needs?
> >
> > Sorry, I'm having trouble imagining this.
> > If I wanted 500 1GB chunks, I would request a region of 500GB+space
> > for the label? Or is that a label and info-blocks?
>
> You would specify an memmap= range of 500GB+128K*.
>
> Force attach that range to Mike's RAMDAX driver.
>
> [ modprobe -r nd_e820, don't build nd_820, or modprobe policy blocks nd_e820 ]
> echo ramdax > /sys/bus/platform/devices/e820_pmem/driver_override
> echo e820_pmem > /sys/bus/platform/drivers/ramdax
>
> * forget what I said about vmalloc() previously, not needed
>
> > Then on each boot the kernel would check if there is an actual
> > label/info-blocks in that space and if yes, it would recreate my
> > devices (including the fsdax/devdax type)?
>
> Right, if that range is persistent the kernel would automatically parse
> the label space each boot and divide up the 500GB region space into
> namespaces.
>
> 128K of label spaces gives you 509 potential namespaces.

That's not enough for us. We would need ~1 order of magnitude more.
Sorry I'm being vague about this but I can't discuss the actual
machine sizes.

> > One of the requirements for live update is that the kexec reboot has
> > to be fast. My solution introduced a delay of tens of milliseconds
> > since the actual device creation is asynchronous. Manually dividing a
> > region into thousands of devices from userspace would be very slow but
>
> Wait, 500GB Region / 1GB Namespace = thousands of Namespaces?

I was talking about devices and AFAIK 1 namespace equals 5 devices for
us currently (nd/{namespace, pfn, btt, dax}, dax/dax). Though the
device creation is asynchronous so I guess the actual device count is
not important.

> > I would have to do that only on the first boot, right?
>
> Yes, the expectation is only incur that overhead once. It also allows
> for VMs to be able to lookup their capacity by name. So you do not need
> a separate mapping of 1GB Namepsace blocks to VMs. Just give some VMs
> bigger Namespaces than others by name.

Sure, I can do that at first. But after some time fragmentation will
happen, right? At some point I will have to give VMs a bunch of
smaller namespaces here and there.

Btw. one more thing I don't understand. Why are maintainers so much
against adding new kernel parameters?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
  2025-10-01 14:14               ` Michał Cłapiński
@ 2025-10-01 22:28                 ` dan.j.williams
  2025-12-09 20:10                   ` Michał Cłapiński
  0 siblings, 1 reply; 20+ messages in thread
From: dan.j.williams @ 2025-10-01 22:28 UTC (permalink / raw)
  To: Michał Cłapiński, dan.j.williams
  Cc: Mike Rapoport, Ira Weiny, Dave Jiang, Vishal Verma, jane.chu,
	Pasha Tatashin, Tyler Hicks, linux-kernel, nvdimm

Michał Cłapiński wrote:
> On Fri, Sep 26, 2025 at 8:45 PM <dan.j.williams@intel.com> wrote:
> >
> > Michał Cłapiński wrote:
> > [..]
> > > > As Mike says you would lose 128K at the end, but that indeed becomes
> > > > losing that 1GB given alignment constraints.
> > > >
> > > > However, I think that could be solved by just separately vmalloc'ing the
> > > > label space for this. Then instead of kernel parameters to sub-divide a
> > > > region, you just have an initramfs script to do the same.
> > > >
> > > > Does that meet your needs?
> > >
> > > Sorry, I'm having trouble imagining this.
> > > If I wanted 500 1GB chunks, I would request a region of 500GB+space
> > > for the label? Or is that a label and info-blocks?
> >
> > You would specify an memmap= range of 500GB+128K*.
> >
> > Force attach that range to Mike's RAMDAX driver.
> >
> > [ modprobe -r nd_e820, don't build nd_820, or modprobe policy blocks nd_e820 ]
> > echo ramdax > /sys/bus/platform/devices/e820_pmem/driver_override
> > echo e820_pmem > /sys/bus/platform/drivers/ramdax
> >
> > * forget what I said about vmalloc() previously, not needed
> >
> > > Then on each boot the kernel would check if there is an actual
> > > label/info-blocks in that space and if yes, it would recreate my
> > > devices (including the fsdax/devdax type)?
> >
> > Right, if that range is persistent the kernel would automatically parse
> > the label space each boot and divide up the 500GB region space into
> > namespaces.
> >
> > 128K of label spaces gives you 509 potential namespaces.
> 
> That's not enough for us. We would need ~1 order of magnitude more.
> Sorry I'm being vague about this but I can't discuss the actual
> machine sizes.

Sure, then make it 1280K of label space. There's no practical limit in
the implementation.

> > > One of the requirements for live update is that the kexec reboot has
> > > to be fast. My solution introduced a delay of tens of milliseconds
> > > since the actual device creation is asynchronous. Manually dividing a
> > > region into thousands of devices from userspace would be very slow but
> >
> > Wait, 500GB Region / 1GB Namespace = thousands of Namespaces?
> 
> I was talking about devices and AFAIK 1 namespace equals 5 devices for
> us currently (nd/{namespace, pfn, btt, dax}, dax/dax). Though the
> device creation is asynchronous so I guess the actual device count is
> not important.

I do not see how it is relevant. You also get 1000s of devices with
plain memory block devices.

> > > I would have to do that only on the first boot, right?
> >
> > Yes, the expectation is only incur that overhead once. It also allows
> > for VMs to be able to lookup their capacity by name. So you do not need
> > a separate mapping of 1GB Namepsace blocks to VMs. Just give some VMs
> > bigger Namespaces than others by name.
> 
> Sure, I can do that at first. But after some time fragmentation will
> happen, right?

Why would fragementation be more of a problem with labels vs command
line if the expectation is maintaining a persistent namespace layout
over time?

> At some point I will have to give VMs a bunch of smaller namespaces
> here and there.
> 
> Btw. one more thing I don't understand. Why are maintainers so much
> against adding new kernel parameters?

This label code is already written and it is less burden to maintain a
new use of existing code vs new mechanism for a niche use case. Also,
memmap= has long been a footgun, making that problem worse for
questionable benefit to wider Linux project does not feel like the right
tradeoff.

The other alternative to labels is ACPI NFIT table injection. Again the
tradeoff is that is just another reuse of an existing well worn
mechanism for delineating PMEM.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
  2025-10-01 22:28                 ` dan.j.williams
@ 2025-12-09 20:10                   ` Michał Cłapiński
  2025-12-17  3:14                     ` dan.j.williams
  0 siblings, 1 reply; 20+ messages in thread
From: Michał Cłapiński @ 2025-12-09 20:10 UTC (permalink / raw)
  To: dan.j.williams
  Cc: Mike Rapoport, Ira Weiny, Dave Jiang, Vishal Verma, jane.chu,
	Pasha Tatashin, Tyler Hicks, linux-kernel, nvdimm

On Thu, Oct 2, 2025 at 12:28 AM <dan.j.williams@intel.com> wrote:
>
> Michał Cłapiński wrote:
> > On Fri, Sep 26, 2025 at 8:45 PM <dan.j.williams@intel.com> wrote:
> > >
> > > Michał Cłapiński wrote:
> > > [..]
> > > > > As Mike says you would lose 128K at the end, but that indeed becomes
> > > > > losing that 1GB given alignment constraints.
> > > > >
> > > > > However, I think that could be solved by just separately vmalloc'ing the
> > > > > label space for this. Then instead of kernel parameters to sub-divide a
> > > > > region, you just have an initramfs script to do the same.
> > > > >
> > > > > Does that meet your needs?
> > > >
> > > > Sorry, I'm having trouble imagining this.
> > > > If I wanted 500 1GB chunks, I would request a region of 500GB+space
> > > > for the label? Or is that a label and info-blocks?
> > >
> > > You would specify an memmap= range of 500GB+128K*.
> > >
> > > Force attach that range to Mike's RAMDAX driver.
> > >
> > > [ modprobe -r nd_e820, don't build nd_820, or modprobe policy blocks nd_e820 ]
> > > echo ramdax > /sys/bus/platform/devices/e820_pmem/driver_override
> > > echo e820_pmem > /sys/bus/platform/drivers/ramdax
> > >
> > > * forget what I said about vmalloc() previously, not needed
> > >
> > > > Then on each boot the kernel would check if there is an actual
> > > > label/info-blocks in that space and if yes, it would recreate my
> > > > devices (including the fsdax/devdax type)?
> > >
> > > Right, if that range is persistent the kernel would automatically parse
> > > the label space each boot and divide up the 500GB region space into
> > > namespaces.
> > >
> > > 128K of label spaces gives you 509 potential namespaces.
> >
> > That's not enough for us. We would need ~1 order of magnitude more.
> > Sorry I'm being vague about this but I can't discuss the actual
> > machine sizes.
>
> Sure, then make it 1280K of label space. There's no practical limit in
> the implementation.

Hi Dan,
I just had the time to try this out. So I modified the code to
increase the label space to 2M and I was able to create the
namespaces. It put the metadata in volatile memory.

But the infoblocks are still within the namespaces, right? If I try to
create a 3G namespace with alignment set to 1G, its actual usable size
is 2G. So I can't divide the whole pmem into 1G devices with 1G
alignment.

If I modify the code to remove the infoblocks, the namespace mode
won't be persistent, right? In my solution I get that information from
the kernel command line, so I don't need the infoblocks.

> > > > One of the requirements for live update is that the kexec reboot has
> > > > to be fast. My solution introduced a delay of tens of milliseconds
> > > > since the actual device creation is asynchronous. Manually dividing a
> > > > region into thousands of devices from userspace would be very slow but
> > >
> > > Wait, 500GB Region / 1GB Namespace = thousands of Namespaces?
> >
> > I was talking about devices and AFAIK 1 namespace equals 5 devices for
> > us currently (nd/{namespace, pfn, btt, dax}, dax/dax). Though the
> > device creation is asynchronous so I guess the actual device count is
> > not important.
>
> I do not see how it is relevant. You also get 1000s of devices with
> plain memory block devices.
>
> > > > I would have to do that only on the first boot, right?
> > >
> > > Yes, the expectation is only incur that overhead once. It also allows
> > > for VMs to be able to lookup their capacity by name. So you do not need
> > > a separate mapping of 1GB Namepsace blocks to VMs. Just give some VMs
> > > bigger Namespaces than others by name.
> >
> > Sure, I can do that at first. But after some time fragmentation will
> > happen, right?
>
> Why would fragementation be more of a problem with labels vs command
> line if the expectation is maintaining a persistent namespace layout
> over time?
>
> > At some point I will have to give VMs a bunch of smaller namespaces
> > here and there.
> >
> > Btw. one more thing I don't understand. Why are maintainers so much
> > against adding new kernel parameters?
>
> This label code is already written and it is less burden to maintain a
> new use of existing code vs new mechanism for a niche use case. Also,
> memmap= has long been a footgun, making that problem worse for
> questionable benefit to wider Linux project does not feel like the right
> tradeoff.
>
> The other alternative to labels is ACPI NFIT table injection. Again the
> tradeoff is that is just another reuse of an existing well worn
> mechanism for delineating PMEM.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
  2025-12-09 20:10                   ` Michał Cłapiński
@ 2025-12-17  3:14                     ` dan.j.williams
  2025-12-29 16:39                       ` Michał Cłapiński
  0 siblings, 1 reply; 20+ messages in thread
From: dan.j.williams @ 2025-12-17  3:14 UTC (permalink / raw)
  To: Michał Cłapiński, dan.j.williams
  Cc: Mike Rapoport, Ira Weiny, Dave Jiang, Vishal Verma, jane.chu,
	Pasha Tatashin, Tyler Hicks, linux-kernel, nvdimm

Michał Cłapiński wrote:
[..]
> > Sure, then make it 1280K of label space. There's no practical limit in
> > the implementation.
> 
> Hi Dan,
> I just had the time to try this out. So I modified the code to
> increase the label space to 2M and I was able to create the
> namespaces. It put the metadata in volatile memory.
> 
> But the infoblocks are still within the namespaces, right? If I try to
> create a 3G namespace with alignment set to 1G, its actual usable size
> is 2G. So I can't divide the whole pmem into 1G devices with 1G
> alignment.

Ugh, yes, I failed to predict that outcome.

> If I modify the code to remove the infoblocks, the namespace mode
> won't be persistent, right? In my solution I get that information from
> the kernel command line, so I don't need the infoblocks.

So, I dislike the command line option ABI expansion proposal enough to
invest some time to find an alternative. One observation is that the
label is able to indicate the namespace mode independent of an
info-block. The info-block is only really needed when deciding whether
and how much space to reserve to allocate 'struct page' metadata.

-- 8< --
From 4f44cbb6e3bd4cac9481bdd4caf28975a4f1e471 Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.williams@intel.com>
Date: Mon, 15 Dec 2025 17:10:04 -0800
Subject: [PATCH] nvdimm: Allow fsdax and devdax namespace modes without
 info-blocks
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Michał reports that the new ramdax facility does not meet his needs which
is to carve large reservations of memory into multiple 1GB aligned
namespaces/volumes. While ramdax solves the problem of in-memory
description of the volume layout, the nvdimm "infoblocks" eat capacity and
destroy alignment properties.

The infoblock serves 2 purposes, it indicates whether the namespace should
operate in fsdax or devdax mode, Michał needs this, and it optionally
reserves namespace capacity for storing 'struct page' metadata, Michał does
not need this. It turns out the mode information is already recorded in the
namespace label, and if no reservation is needed for 'struct page' metadata
then infoblock settings can just be hard coded.

Introduce a new ND_REGION_VIRT_INFOBLOCK flag for ramdax to indicate that
all infoblocks be synthesized and not consume any capacity from the
namespace.

With that ramdax can create a full sized namespace:

$ ndctl create-namespace -r region0 -s 1G -a 1G -M mem
{
  "dev":"namespace0.0",
  "mode":"fsdax",
  "map":"mem",
  "size":"1024.00 MiB (1073.74 MB)",
  "uuid":"c48c4991-86af-4de1-8c7c-8919358df1f9",
  "sector_size":512,
  "align":1073741824,
  "blockdev":"pmem0"
}

Note that the uuid is not persisted so the "raw_uuid" in the label will be
the method to identify the namespace:

<after disable/enable region>
$ ndctl list -vu
{
  "dev":"namespace0.0",
  "mode":"fsdax",
  "map":"mem",
  "size":"1024.00 MiB (1073.74 MB)",
  "uuid":"00000000-0000-0000-0000-000000000000",
  "raw_uuid":"1526a1df-d1ec-40e3-91e8-547f1ad9949d",
  "sector_size":512,
  "align":1073741824,
  "blockdev":"pmem0",
  "numa_node":0,
  "target_node":0
}

Also note that the align is hard coded to (PUD) 1G. That is probably fine
for now unless and until someone comes up with a case for making that
setting configurable.

Lastly, the kernel will complain if "-a 1G -M mem" are not specified to
"ndctl create-namespace" as the kernel still enforces that that live
settings specified at configuration time match the "virtual" infoblock.

Cc: Michał Cłapiński" <mclapinski@google.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/libnvdimm.h |  3 ++
 drivers/nvdimm/pfn_devs.c | 58 +++++++++++++++++++++++++++++++++++++--
 drivers/nvdimm/ramdax.c   |  1 +
 3 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 28f086c4a187..c79efc49dd24 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -70,6 +70,9 @@ enum {
 	/* Region was created by CXL subsystem */
 	ND_REGION_CXL = 4,
 
+	/* Virtual info-block mode (no writeback / storage reservation) */
+	ND_REGION_VIRT_INFOBLOCK = 5,
+
 	/* mark newly adjusted resources as requiring a label update */
 	DPA_RESOURCE_ADJUSTED = 1 << 0,
 };
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 42b172fc5576..68a998fe20a7 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -428,6 +428,50 @@ static bool nd_supported_alignment(unsigned long align)
 	return false;
 }
 
+static int nd_pfn_virt_init(struct nd_pfn *nd_pfn, const char *sig)
+{
+	struct nd_pfn_sb *pfn_sb = nd_pfn->pfn_sb;
+	struct nd_namespace_common *ndns = nd_pfn->ndns;
+
+	switch (ndns->claim_class) {
+	case NVDIMM_CCLASS_PFN:
+		if (memcmp(sig, PFN_SIG, PFN_SIG_LEN) != 0)
+			return -ENODEV;
+		break;
+	case NVDIMM_CCLASS_DAX:
+		if (memcmp(sig, DAX_SIG, PFN_SIG_LEN) != 0)
+			return -ENODEV;
+		break;
+	default:
+		return -ENODEV;
+	}
+
+	*pfn_sb = (struct nd_pfn_sb) {
+		.version_major = cpu_to_le16(1),
+		.version_minor = cpu_to_le16(4),
+		.mode = cpu_to_le32(PFN_MODE_RAM),
+		.align = cpu_to_le32(HPAGE_PUD_SIZE),
+		.page_size = cpu_to_le32(PAGE_SIZE),
+		.page_struct_size = cpu_to_le16(sizeof(struct page)),
+	};
+	memcpy(pfn_sb->signature, sig, PFN_SIG_LEN);
+
+	/*
+	 * Virtual infoblock uuids do not persist, but match the live setting in
+	 * the validation case. The @align and @mode settings are fixed for the
+	 * virtual case, validation will enforce that they match.
+	 */
+	if (nd_pfn->uuid)
+		memcpy(pfn_sb->uuid, nd_pfn->uuid, 16);
+	memcpy(pfn_sb->parent_uuid, nd_dev_to_uuid(&ndns->dev), 16);
+	pfn_sb->checksum = cpu_to_le64(nd_sb_checksum((struct nd_gen_sb *) pfn_sb));
+
+	dev_dbg(&nd_pfn->dev, "virtual %s infoblock for %s\n", sig,
+		dev_name(&ndns->dev));
+
+	return 0;
+}
+
 /**
  * nd_pfn_validate - read and validate info-block
  * @nd_pfn: fsdax namespace runtime state / properties
@@ -448,6 +492,7 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
 	struct nd_pfn_sb *pfn_sb = nd_pfn->pfn_sb;
 	struct nd_namespace_common *ndns = nd_pfn->ndns;
 	const uuid_t *parent_uuid = nd_dev_to_uuid(&ndns->dev);
+	struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
 
 	if (!pfn_sb || !ndns)
 		return -ENODEV;
@@ -455,8 +500,14 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
 	if (!is_memory(nd_pfn->dev.parent))
 		return -ENODEV;
 
-	if (nvdimm_read_bytes(ndns, SZ_4K, pfn_sb, sizeof(*pfn_sb), 0))
+	if (test_bit(ND_REGION_VIRT_INFOBLOCK, &nd_region->flags)) {
+		int rc = nd_pfn_virt_init(nd_pfn, sig);
+
+		if (rc)
+			return rc;
+	} else if (nvdimm_read_bytes(ndns, SZ_4K, pfn_sb, sizeof(*pfn_sb), 0)) {
 		return -ENXIO;
+	}
 
 	if (memcmp(pfn_sb->signature, sig, PFN_SIG_LEN) != 0)
 		return -ENODEV;
@@ -694,7 +745,10 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
 	};
 	pgmap->nr_range = 1;
 	if (nd_pfn->mode == PFN_MODE_RAM) {
-		if (offset < reserve)
+		struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
+
+		if (!test_bit(ND_REGION_VIRT_INFOBLOCK, &nd_region->flags) &&
+		    offset < reserve)
 			return -EINVAL;
 		nd_pfn->npfns = le64_to_cpu(pfn_sb->npfns);
 	} else if (nd_pfn->mode == PFN_MODE_PMEM) {
diff --git a/drivers/nvdimm/ramdax.c b/drivers/nvdimm/ramdax.c
index 954cb7919807..992346390086 100644
--- a/drivers/nvdimm/ramdax.c
+++ b/drivers/nvdimm/ramdax.c
@@ -60,6 +60,7 @@ static int ramdax_register_region(struct resource *res,
 	ndr_desc.num_mappings = 1;
 	ndr_desc.mapping = &mapping;
 	ndr_desc.nd_set = nd_set;
+	set_bit(ND_REGION_VIRT_INFOBLOCK, &ndr_desc.flags);
 
 	if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
 		goto err_free_nd_set;
-- 
2.51.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
  2025-12-17  3:14                     ` dan.j.williams
@ 2025-12-29 16:39                       ` Michał Cłapiński
  2026-01-12 23:36                         ` Ira Weiny
  0 siblings, 1 reply; 20+ messages in thread
From: Michał Cłapiński @ 2025-12-29 16:39 UTC (permalink / raw)
  To: dan.j.williams
  Cc: Mike Rapoport, Ira Weiny, Dave Jiang, Vishal Verma, jane.chu,
	Pasha Tatashin, Tyler Hicks, linux-kernel, nvdimm

On Wed, Dec 17, 2025 at 4:14 AM <dan.j.williams@intel.com> wrote:
>
> Michał Cłapiński wrote:
> [..]
> > > Sure, then make it 1280K of label space. There's no practical limit in
> > > the implementation.
> >
> > Hi Dan,
> > I just had the time to try this out. So I modified the code to
> > increase the label space to 2M and I was able to create the
> > namespaces. It put the metadata in volatile memory.
> >
> > But the infoblocks are still within the namespaces, right? If I try to
> > create a 3G namespace with alignment set to 1G, its actual usable size
> > is 2G. So I can't divide the whole pmem into 1G devices with 1G
> > alignment.
>
> Ugh, yes, I failed to predict that outcome.
>
> > If I modify the code to remove the infoblocks, the namespace mode
> > won't be persistent, right? In my solution I get that information from
> > the kernel command line, so I don't need the infoblocks.
>
> So, I dislike the command line option ABI expansion proposal enough to
> invest some time to find an alternative. One observation is that the
> label is able to indicate the namespace mode independent of an
> info-block. The info-block is only really needed when deciding whether
> and how much space to reserve to allocate 'struct page' metadata.
>
> -- 8< --
> From 4f44cbb6e3bd4cac9481bdd4caf28975a4f1e471 Mon Sep 17 00:00:00 2001
> From: Dan Williams <dan.j.williams@intel.com>
> Date: Mon, 15 Dec 2025 17:10:04 -0800
> Subject: [PATCH] nvdimm: Allow fsdax and devdax namespace modes without
>  info-blocks
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> Michał reports that the new ramdax facility does not meet his needs which
> is to carve large reservations of memory into multiple 1GB aligned
> namespaces/volumes. While ramdax solves the problem of in-memory
> description of the volume layout, the nvdimm "infoblocks" eat capacity and
> destroy alignment properties.
>
> The infoblock serves 2 purposes, it indicates whether the namespace should
> operate in fsdax or devdax mode, Michał needs this, and it optionally
> reserves namespace capacity for storing 'struct page' metadata, Michał does
> not need this. It turns out the mode information is already recorded in the
> namespace label, and if no reservation is needed for 'struct page' metadata
> then infoblock settings can just be hard coded.
>
> Introduce a new ND_REGION_VIRT_INFOBLOCK flag for ramdax to indicate that
> all infoblocks be synthesized and not consume any capacity from the
> namespace.
>
> With that ramdax can create a full sized namespace:
>
> $ ndctl create-namespace -r region0 -s 1G -a 1G -M mem
> {
>   "dev":"namespace0.0",
>   "mode":"fsdax",
>   "map":"mem",
>   "size":"1024.00 MiB (1073.74 MB)",
>   "uuid":"c48c4991-86af-4de1-8c7c-8919358df1f9",
>   "sector_size":512,
>   "align":1073741824,
>   "blockdev":"pmem0"
> }

Thank you for working on this.

I tried it an indeed it works with fsdax. It doesn't seem to work with
devdax though.

$ ndctl create-namespace -v -r region1 -m devdax -a 1G -M mem -s 1G
libndctl: ndctl_dax_enable: dax1.0: failed to enable
  Error: namespace1.1: failed to enable

failed to create namespace: No such device or address

$ dmesg | grep dax
[...]
[   29.504763] dax_pmem dax1.0: could not reserve metadata
[   29.504766] dax_pmem dax1.0: probe with driver dax_pmem failed with error -16
[   29.506553] probe of dax1.0 returned 16 after 1815 usecs
[...]

I think the dax_pmem driver needs to be modified too.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
  2025-12-29 16:39                       ` Michał Cłapiński
@ 2026-01-12 23:36                         ` Ira Weiny
  2026-01-13 12:39                           ` Michał Cłapiński
  0 siblings, 1 reply; 20+ messages in thread
From: Ira Weiny @ 2026-01-12 23:36 UTC (permalink / raw)
  To: Michał Cłapiński, dan.j.williams
  Cc: Mike Rapoport, Ira Weiny, Dave Jiang, Vishal Verma, jane.chu,
	Pasha Tatashin, Tyler Hicks, linux-kernel, nvdimm

Michał Cłapiński wrote:
> On Wed, Dec 17, 2025 at 4:14 AM <dan.j.williams@intel.com> wrote:
> >
> > Michał Cłapiński wrote:
> > [..]
> > > > Sure, then make it 1280K of label space. There's no practical limit in
> > > > the implementation.
> > >
> > > Hi Dan,
> > > I just had the time to try this out. So I modified the code to
> > > increase the label space to 2M and I was able to create the
> > > namespaces. It put the metadata in volatile memory.
> > >
> > > But the infoblocks are still within the namespaces, right? If I try to
> > > create a 3G namespace with alignment set to 1G, its actual usable size
> > > is 2G. So I can't divide the whole pmem into 1G devices with 1G
> > > alignment.
> >
> > Ugh, yes, I failed to predict that outcome.
> >
> > > If I modify the code to remove the infoblocks, the namespace mode
> > > won't be persistent, right? In my solution I get that information from
> > > the kernel command line, so I don't need the infoblocks.
> >
> > So, I dislike the command line option ABI expansion proposal enough to
> > invest some time to find an alternative. One observation is that the
> > label is able to indicate the namespace mode independent of an
> > info-block. The info-block is only really needed when deciding whether
> > and how much space to reserve to allocate 'struct page' metadata.
> >
> > -- 8< --
> > From 4f44cbb6e3bd4cac9481bdd4caf28975a4f1e471 Mon Sep 17 00:00:00 2001
> > From: Dan Williams <dan.j.williams@intel.com>
> > Date: Mon, 15 Dec 2025 17:10:04 -0800
> > Subject: [PATCH] nvdimm: Allow fsdax and devdax namespace modes without
> >  info-blocks
> > MIME-Version: 1.0
> > Content-Type: text/plain; charset=UTF-8
> > Content-Transfer-Encoding: 8bit
> >
> > Michał reports that the new ramdax facility does not meet his needs which
> > is to carve large reservations of memory into multiple 1GB aligned
> > namespaces/volumes. While ramdax solves the problem of in-memory
> > description of the volume layout, the nvdimm "infoblocks" eat capacity and
> > destroy alignment properties.
> >
> > The infoblock serves 2 purposes, it indicates whether the namespace should
> > operate in fsdax or devdax mode, Michał needs this, and it optionally
> > reserves namespace capacity for storing 'struct page' metadata, Michał does
> > not need this. It turns out the mode information is already recorded in the
> > namespace label, and if no reservation is needed for 'struct page' metadata
> > then infoblock settings can just be hard coded.
> >
> > Introduce a new ND_REGION_VIRT_INFOBLOCK flag for ramdax to indicate that
> > all infoblocks be synthesized and not consume any capacity from the
> > namespace.
> >
> > With that ramdax can create a full sized namespace:
> >
> > $ ndctl create-namespace -r region0 -s 1G -a 1G -M mem
> > {
> >   "dev":"namespace0.0",
> >   "mode":"fsdax",
> >   "map":"mem",
> >   "size":"1024.00 MiB (1073.74 MB)",
> >   "uuid":"c48c4991-86af-4de1-8c7c-8919358df1f9",
> >   "sector_size":512,
> >   "align":1073741824,
> >   "blockdev":"pmem0"
> > }
> 
> Thank you for working on this.
> 
> I tried it an indeed it works with fsdax. It doesn't seem to work with
> devdax though.
> 
> $ ndctl create-namespace -v -r region1 -m devdax -a 1G -M mem -s 1G
> libndctl: ndctl_dax_enable: dax1.0: failed to enable
>   Error: namespace1.1: failed to enable
> 
> failed to create namespace: No such device or address
> 
> $ dmesg | grep dax
> [...]
> [   29.504763] dax_pmem dax1.0: could not reserve metadata
> [   29.504766] dax_pmem dax1.0: probe with driver dax_pmem failed with error -16
> [   29.506553] probe of dax1.0 returned 16 after 1815 usecs
> [...]
> 
> I think the dax_pmem driver needs to be modified too.

Michał

Did yall have a suggestion on how?  I lost track of this thread over the
holidays and so I'm not sure where this stands.

Ira

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
  2026-01-12 23:36                         ` Ira Weiny
@ 2026-01-13 12:39                           ` Michał Cłapiński
  2026-01-14  2:34                             ` dan.j.williams
  0 siblings, 1 reply; 20+ messages in thread
From: Michał Cłapiński @ 2026-01-13 12:39 UTC (permalink / raw)
  To: Ira Weiny
  Cc: dan.j.williams, Mike Rapoport, Dave Jiang, Vishal Verma, jane.chu,
	Pasha Tatashin, Tyler Hicks, linux-kernel, nvdimm

On Tue, Jan 13, 2026 at 12:33 AM Ira Weiny <ira.weiny@intel.com> wrote:
>
> Michał Cłapiński wrote:
> > On Wed, Dec 17, 2025 at 4:14 AM <dan.j.williams@intel.com> wrote:
> > >
> > > Michał Cłapiński wrote:
> > > [..]
> > > > > Sure, then make it 1280K of label space. There's no practical limit in
> > > > > the implementation.
> > > >
> > > > Hi Dan,
> > > > I just had the time to try this out. So I modified the code to
> > > > increase the label space to 2M and I was able to create the
> > > > namespaces. It put the metadata in volatile memory.
> > > >
> > > > But the infoblocks are still within the namespaces, right? If I try to
> > > > create a 3G namespace with alignment set to 1G, its actual usable size
> > > > is 2G. So I can't divide the whole pmem into 1G devices with 1G
> > > > alignment.
> > >
> > > Ugh, yes, I failed to predict that outcome.
> > >
> > > > If I modify the code to remove the infoblocks, the namespace mode
> > > > won't be persistent, right? In my solution I get that information from
> > > > the kernel command line, so I don't need the infoblocks.
> > >
> > > So, I dislike the command line option ABI expansion proposal enough to
> > > invest some time to find an alternative. One observation is that the
> > > label is able to indicate the namespace mode independent of an
> > > info-block. The info-block is only really needed when deciding whether
> > > and how much space to reserve to allocate 'struct page' metadata.
> > >
> > > -- 8< --
> > > From 4f44cbb6e3bd4cac9481bdd4caf28975a4f1e471 Mon Sep 17 00:00:00 2001
> > > From: Dan Williams <dan.j.williams@intel.com>
> > > Date: Mon, 15 Dec 2025 17:10:04 -0800
> > > Subject: [PATCH] nvdimm: Allow fsdax and devdax namespace modes without
> > >  info-blocks
> > > MIME-Version: 1.0
> > > Content-Type: text/plain; charset=UTF-8
> > > Content-Transfer-Encoding: 8bit
> > >
> > > Michał reports that the new ramdax facility does not meet his needs which
> > > is to carve large reservations of memory into multiple 1GB aligned
> > > namespaces/volumes. While ramdax solves the problem of in-memory
> > > description of the volume layout, the nvdimm "infoblocks" eat capacity and
> > > destroy alignment properties.
> > >
> > > The infoblock serves 2 purposes, it indicates whether the namespace should
> > > operate in fsdax or devdax mode, Michał needs this, and it optionally
> > > reserves namespace capacity for storing 'struct page' metadata, Michał does
> > > not need this. It turns out the mode information is already recorded in the
> > > namespace label, and if no reservation is needed for 'struct page' metadata
> > > then infoblock settings can just be hard coded.
> > >
> > > Introduce a new ND_REGION_VIRT_INFOBLOCK flag for ramdax to indicate that
> > > all infoblocks be synthesized and not consume any capacity from the
> > > namespace.
> > >
> > > With that ramdax can create a full sized namespace:
> > >
> > > $ ndctl create-namespace -r region0 -s 1G -a 1G -M mem
> > > {
> > >   "dev":"namespace0.0",
> > >   "mode":"fsdax",
> > >   "map":"mem",
> > >   "size":"1024.00 MiB (1073.74 MB)",
> > >   "uuid":"c48c4991-86af-4de1-8c7c-8919358df1f9",
> > >   "sector_size":512,
> > >   "align":1073741824,
> > >   "blockdev":"pmem0"
> > > }
> >
> > Thank you for working on this.
> >
> > I tried it an indeed it works with fsdax. It doesn't seem to work with
> > devdax though.
> >
> > $ ndctl create-namespace -v -r region1 -m devdax -a 1G -M mem -s 1G
> > libndctl: ndctl_dax_enable: dax1.0: failed to enable
> >   Error: namespace1.1: failed to enable
> >
> > failed to create namespace: No such device or address
> >
> > $ dmesg | grep dax
> > [...]
> > [   29.504763] dax_pmem dax1.0: could not reserve metadata
> > [   29.504766] dax_pmem dax1.0: probe with driver dax_pmem failed with error -16
> > [   29.506553] probe of dax1.0 returned 16 after 1815 usecs
> > [...]
> >
> > I think the dax_pmem driver needs to be modified too.
>
> Michał
>
> Did yall have a suggestion on how?  I lost track of this thread over the
> holidays and so I'm not sure where this stands.
>
> Ira

I think Dan's change needs to skip devm_request_mem_region if offset
== 0 in __dax_pmem_probe in drivers/dax/pmem.c.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices
  2026-01-13 12:39                           ` Michał Cłapiński
@ 2026-01-14  2:34                             ` dan.j.williams
  0 siblings, 0 replies; 20+ messages in thread
From: dan.j.williams @ 2026-01-14  2:34 UTC (permalink / raw)
  To: Michał Cłapiński, Ira Weiny
  Cc: dan.j.williams, Mike Rapoport, Dave Jiang, Vishal Verma, jane.chu,
	Pasha Tatashin, Tyler Hicks, linux-kernel, nvdimm

Michał Cłapiński wrote:
[..]
> > Did yall have a suggestion on how?  I lost track of this thread over the
> > holidays and so I'm not sure where this stands.
> >
> > Ira
> 
> I think Dan's change needs to skip devm_request_mem_region if offset
> == 0 in __dax_pmem_probe in drivers/dax/pmem.c.

I would be happy to review a fixup patch that works for you, and give
you Co-developed-by credit. That would be faster than waiting for this
to bubble back up to the top my queue again.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2026-01-14  2:34 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-26  8:04 [PATCH 0/1] nvdimm: allow exposing RAM as libnvdimm DIMMs Mike Rapoport
2025-08-26  8:04 ` [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM devices Mike Rapoport
2025-08-29  0:47   ` Ira Weiny
2025-08-29  7:57     ` Mike Rapoport
2025-09-01 16:01       ` Michał Cłapiński
2025-09-02 15:35         ` Mike Rapoport
2025-09-24  1:16         ` dan.j.williams
2025-09-26 12:47           ` Michał Cłapiński
2025-09-26 18:45             ` dan.j.williams
2025-09-30 10:15               ` Mike Rapoport
2025-10-01 14:14               ` Michał Cłapiński
2025-10-01 22:28                 ` dan.j.williams
2025-12-09 20:10                   ` Michał Cłapiński
2025-12-17  3:14                     ` dan.j.williams
2025-12-29 16:39                       ` Michał Cłapiński
2026-01-12 23:36                         ` Ira Weiny
2026-01-13 12:39                           ` Michał Cłapiński
2026-01-14  2:34                             ` dan.j.williams
2025-09-24  1:08   ` dan.j.williams
2025-09-30 10:11     ` Mike Rapoport

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.