* [PATCH v2 0/3] "Device DAX" for persistent memory
@ 2016-05-15 6:26 ` Dan Williams
0 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2016-05-15 6:26 UTC (permalink / raw)
To: linux-nvdimm
Cc: Dave Hansen, Dave Chinner, linux-kernel, hch, linux-block,
Jeff Moyer, Jan Kara, Andrew Morton, Ross Zwisler
Changes since v1: [1]
1/ Dropped the lead in cleanup patches from this posting as they appear
on the libnvdimm-for-next branch.
2/ Fixed the needlessly overweight fault locking by replacing it with
rcu_read_lock() and drop taking the i_mmap_lock since it has no effect
or purpose. Unlike block devices the vfs does not arrange for character
device inodes to share the same address_space.
3/ Fixed the device release path since the class release method is not
called when the device is created by device_create_with_groups().
4/ Cleanups resulting from the switch to carry (struct dax_dev *) in
filp->private_date.
---
Device DAX is the device-centric analogue of Filesystem DAX
(CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped
without need of an intervening file system or being bound to block
device semantics.
Why "Device DAX"?
1/ As I mentioned at LSF [2] we are starting to see platforms with
performance and feature differentiated memory ranges. Environments like
high-performance-computing and usages like in-memory databases want
exclusive allocation of a memory range with zero conflicting
kernel/metadata allocations. For dedicated applications of high
bandwidth or low latency memory device-DAX provides a predictable direct
map mechanism.
Note that this is only for the small number of "crazy" applications that
are willing to re-write to get every bit of performance. For everyone
else we, Dave Hansen and I, are looking to add a mechanism to hot-plug
device-DAX ranges into the mm to get general memory management services
(oversubscribe / migration, etc) with the understanding that it may
sacrifice some predictability.
2/ For persistent memory there are similar applications that are willing
to re-write to take full advantage of byte-addressable persistence.
This mechanism satisfies those usages that only need a pre-allocated
file to mmap.
3/ It answers Dave Chinner's call to start thinking about pmem-native
solutions. Device DAX specifically avoids block-device and file system
conflicts.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2016-May/005660.html
[2]: https://lwn.net/Articles/685107/
---
Dan Williams (3):
/dev/dax, pmem: direct access to persistent memory
/dev/dax, core: file operations and dax-mmap
Revert "block: enable dax for raw block devices"
block/ioctl.c | 32 --
drivers/Kconfig | 2
drivers/Makefile | 1
drivers/dax/Kconfig | 26 ++
drivers/dax/Makefile | 4
drivers/dax/dax.c | 568 +++++++++++++++++++++++++++++++++++
drivers/dax/dax.h | 24 +
drivers/dax/pmem.c | 168 ++++++++++
fs/block_dev.c | 96 ++----
include/linux/fs.h | 8
include/uapi/linux/fs.h | 1
mm/huge_memory.c | 1
mm/hugetlb.c | 1
tools/testing/nvdimm/Kbuild | 9 +
tools/testing/nvdimm/config_check.c | 2
15 files changed, 835 insertions(+), 108 deletions(-)
create mode 100644 drivers/dax/Kconfig
create mode 100644 drivers/dax/Makefile
create mode 100644 drivers/dax/dax.c
create mode 100644 drivers/dax/dax.h
create mode 100644 drivers/dax/pmem.c
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH v2 0/3] "Device DAX" for persistent memory @ 2016-05-15 6:26 ` Dan Williams 0 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-15 6:26 UTC (permalink / raw) To: linux-nvdimm Cc: Dave Hansen, Dave Chinner, linux-kernel, hch, linux-block, Jeff Moyer, Jan Kara, Andrew Morton, Ross Zwisler Changes since v1: [1] 1/ Dropped the lead in cleanup patches from this posting as they appear on the libnvdimm-for-next branch. 2/ Fixed the needlessly overweight fault locking by replacing it with rcu_read_lock() and drop taking the i_mmap_lock since it has no effect or purpose. Unlike block devices the vfs does not arrange for character device inodes to share the same address_space. 3/ Fixed the device release path since the class release method is not called when the device is created by device_create_with_groups(). 4/ Cleanups resulting from the switch to carry (struct dax_dev *) in filp->private_date. --- Device DAX is the device-centric analogue of Filesystem DAX (CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped without need of an intervening file system or being bound to block device semantics. Why "Device DAX"? 1/ As I mentioned at LSF [2] we are starting to see platforms with performance and feature differentiated memory ranges. Environments like high-performance-computing and usages like in-memory databases want exclusive allocation of a memory range with zero conflicting kernel/metadata allocations. For dedicated applications of high bandwidth or low latency memory device-DAX provides a predictable direct map mechanism. Note that this is only for the small number of "crazy" applications that are willing to re-write to get every bit of performance. For everyone else we, Dave Hansen and I, are looking to add a mechanism to hot-plug device-DAX ranges into the mm to get general memory management services (oversubscribe / migration, etc) with the understanding that it may sacrifice some predictability. 2/ For persistent memory there are similar applications that are willing to re-write to take full advantage of byte-addressable persistence. This mechanism satisfies those usages that only need a pre-allocated file to mmap. 3/ It answers Dave Chinner's call to start thinking about pmem-native solutions. Device DAX specifically avoids block-device and file system conflicts. [1]: https://lists.01.org/pipermail/linux-nvdimm/2016-May/005660.html [2]: https://lwn.net/Articles/685107/ --- Dan Williams (3): /dev/dax, pmem: direct access to persistent memory /dev/dax, core: file operations and dax-mmap Revert "block: enable dax for raw block devices" block/ioctl.c | 32 -- drivers/Kconfig | 2 drivers/Makefile | 1 drivers/dax/Kconfig | 26 ++ drivers/dax/Makefile | 4 drivers/dax/dax.c | 568 +++++++++++++++++++++++++++++++++++ drivers/dax/dax.h | 24 + drivers/dax/pmem.c | 168 ++++++++++ fs/block_dev.c | 96 ++---- include/linux/fs.h | 8 include/uapi/linux/fs.h | 1 mm/huge_memory.c | 1 mm/hugetlb.c | 1 tools/testing/nvdimm/Kbuild | 9 + tools/testing/nvdimm/config_check.c | 2 15 files changed, 835 insertions(+), 108 deletions(-) create mode 100644 drivers/dax/Kconfig create mode 100644 drivers/dax/Makefile create mode 100644 drivers/dax/dax.c create mode 100644 drivers/dax/dax.h create mode 100644 drivers/dax/pmem.c ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v2 0/3] "Device DAX" for persistent memory @ 2016-05-15 6:26 ` Dan Williams 0 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-15 6:26 UTC (permalink / raw) To: linux-nvdimm Cc: Dave Hansen, Dave Chinner, linux-kernel, linux-block, Jan Kara, Andrew Morton, hch Changes since v1: [1] 1/ Dropped the lead in cleanup patches from this posting as they appear on the libnvdimm-for-next branch. 2/ Fixed the needlessly overweight fault locking by replacing it with rcu_read_lock() and drop taking the i_mmap_lock since it has no effect or purpose. Unlike block devices the vfs does not arrange for character device inodes to share the same address_space. 3/ Fixed the device release path since the class release method is not called when the device is created by device_create_with_groups(). 4/ Cleanups resulting from the switch to carry (struct dax_dev *) in filp->private_date. --- Device DAX is the device-centric analogue of Filesystem DAX (CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped without need of an intervening file system or being bound to block device semantics. Why "Device DAX"? 1/ As I mentioned at LSF [2] we are starting to see platforms with performance and feature differentiated memory ranges. Environments like high-performance-computing and usages like in-memory databases want exclusive allocation of a memory range with zero conflicting kernel/metadata allocations. For dedicated applications of high bandwidth or low latency memory device-DAX provides a predictable direct map mechanism. Note that this is only for the small number of "crazy" applications that are willing to re-write to get every bit of performance. For everyone else we, Dave Hansen and I, are looking to add a mechanism to hot-plug device-DAX ranges into the mm to get general memory management services (oversubscribe / migration, etc) with the understanding that it may sacrifice some predictability. 2/ For persistent memory there are similar applications that are willing to re-write to take full advantage of byte-addressable persistence. This mechanism satisfies those usages that only need a pre-allocated file to mmap. 3/ It answers Dave Chinner's call to start thinking about pmem-native solutions. Device DAX specifically avoids block-device and file system conflicts. [1]: https://lists.01.org/pipermail/linux-nvdimm/2016-May/005660.html [2]: https://lwn.net/Articles/685107/ --- Dan Williams (3): /dev/dax, pmem: direct access to persistent memory /dev/dax, core: file operations and dax-mmap Revert "block: enable dax for raw block devices" block/ioctl.c | 32 -- drivers/Kconfig | 2 drivers/Makefile | 1 drivers/dax/Kconfig | 26 ++ drivers/dax/Makefile | 4 drivers/dax/dax.c | 568 +++++++++++++++++++++++++++++++++++ drivers/dax/dax.h | 24 + drivers/dax/pmem.c | 168 ++++++++++ fs/block_dev.c | 96 ++---- include/linux/fs.h | 8 include/uapi/linux/fs.h | 1 mm/huge_memory.c | 1 mm/hugetlb.c | 1 tools/testing/nvdimm/Kbuild | 9 + tools/testing/nvdimm/config_check.c | 2 15 files changed, 835 insertions(+), 108 deletions(-) create mode 100644 drivers/dax/Kconfig create mode 100644 drivers/dax/Makefile create mode 100644 drivers/dax/dax.c create mode 100644 drivers/dax/dax.h create mode 100644 drivers/dax/pmem.c _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v2 1/3] /dev/dax, pmem: direct access to persistent memory 2016-05-15 6:26 ` Dan Williams (?) @ 2016-05-15 6:26 ` Dan Williams -1 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-15 6:26 UTC (permalink / raw) To: linux-nvdimm Cc: Dave Hansen, linux-kernel, linux-block, Jeff Moyer, Ross Zwisler, Andrew Morton, hch Device DAX is the device-centric analogue of Filesystem DAX (CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped without need of an intervening file system. Device DAX is strict, precise and predictable. Specifically this interface: 1/ Guarantees fault granularity with respect to a given page size (pte, pmd, or pud) set at configuration time. 2/ Enforces deterministic behavior by being strict about what fault scenarios are supported. For example, by forcing MADV_DONTFORK semantics and omitting MAP_PRIVATE support device-dax guarantees that a mapping always behaves/performs the same once established. It is the "what you see is what you get" access mechanism to differentiated memory vs filesystem DAX which has filesystem specific implementation semantics. Persistent memory is the first target, but the mechanism is also targeted for exclusive allocations of performance differentiated memory ranges. This commit is limited to the base device driver infrastructure to associate a dax device with pmem range. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/Kconfig | 2 drivers/Makefile | 1 drivers/dax/Kconfig | 25 +++ drivers/dax/Makefile | 4 + drivers/dax/dax.c | 252 +++++++++++++++++++++++++++++++++++ drivers/dax/dax.h | 24 +++ drivers/dax/pmem.c | 168 +++++++++++++++++++++++ tools/testing/nvdimm/Kbuild | 9 + tools/testing/nvdimm/config_check.c | 2 9 files changed, 487 insertions(+) create mode 100644 drivers/dax/Kconfig create mode 100644 drivers/dax/Makefile create mode 100644 drivers/dax/dax.c create mode 100644 drivers/dax/dax.h create mode 100644 drivers/dax/pmem.c diff --git a/drivers/Kconfig b/drivers/Kconfig index d2ac339de85f..8298eab84a6f 100644 --- a/drivers/Kconfig +++ b/drivers/Kconfig @@ -190,6 +190,8 @@ source "drivers/android/Kconfig" source "drivers/nvdimm/Kconfig" +source "drivers/dax/Kconfig" + source "drivers/nvmem/Kconfig" source "drivers/hwtracing/stm/Kconfig" diff --git a/drivers/Makefile b/drivers/Makefile index 8f5d076baeb0..0b6f3d60193d 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -66,6 +66,7 @@ obj-$(CONFIG_PARPORT) += parport/ obj-$(CONFIG_NVM) += lightnvm/ obj-y += base/ block/ misc/ mfd/ nfc/ obj-$(CONFIG_LIBNVDIMM) += nvdimm/ +obj-$(CONFIG_DEV_DAX) += dax/ obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/ obj-$(CONFIG_NUBUS) += nubus/ obj-y += macintosh/ diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig new file mode 100644 index 000000000000..86ffbaa891ad --- /dev/null +++ b/drivers/dax/Kconfig @@ -0,0 +1,25 @@ +menuconfig DEV_DAX + tristate "DAX: direct access to differentiated memory" + default m if NVDIMM_DAX + help + Support raw access to differentiated (persistence, bandwidth, + latency...) memory via an mmap(2) capable character + device. Platform firmware or a device driver may identify a + platform memory resource that is differentiated from the + baseline memory pool. Mappings of a /dev/daxX.Y device impose + restrictions that make the mapping behavior deterministic. + +if DEV_DAX + +config DEV_DAX_PMEM + tristate "PMEM DAX: direct access to persistent memory" + depends on NVDIMM_DAX + default DEV_DAX + help + Support raw access to persistent memory. Note that this + driver consumes memory ranges allocated and exported by the + libnvdimm sub-system. + + Say Y if unsure + +endif diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile new file mode 100644 index 000000000000..27c54e38478a --- /dev/null +++ b/drivers/dax/Makefile @@ -0,0 +1,4 @@ +obj-$(CONFIG_DEV_DAX) += dax.o +obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o + +dax_pmem-y := pmem.o diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c new file mode 100644 index 000000000000..8207fb33a992 --- /dev/null +++ b/drivers/dax/dax.c @@ -0,0 +1,252 @@ +/* + * Copyright(c) 2016 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include <linux/pagemap.h> +#include <linux/module.h> +#include <linux/device.h> +#include <linux/pfn_t.h> +#include <linux/slab.h> +#include <linux/dax.h> +#include <linux/fs.h> +#include <linux/mm.h> + +static int dax_major; +static struct class *dax_class; +static DEFINE_IDA(dax_minor_ida); + +/** + * struct dax_region - mapping infrastructure for dax devices + * @id: kernel-wide unique region for a memory range + * @base: linear address corresponding to @res + * @kref: to pin while other agents have a need to do lookups + * @dev: parent device backing this region + * @align: allocation and mapping alignment for child dax devices + * @res: physical address range of the region + * @pfn_flags: identify whether the pfns are paged back or not + */ +struct dax_region { + int id; + struct ida ida; + void *base; + struct kref kref; + struct device *dev; + unsigned int align; + struct resource res; + unsigned long pfn_flags; +}; + +/** + * struct dax_dev - subdivision of a dax region + * @region - parent region + * @dev - device backing the character device + * @kref - enable this data to be tracked in filp->private_data + * @id - child id in the region + * @num_resources - number of physical address extents in this device + * @res - array of physical address ranges + */ +struct dax_dev { + struct dax_region *region; + struct device *dev; + struct kref kref; + int id; + int num_resources; + struct resource res[0]; +}; + +static void dax_region_free(struct kref *kref) +{ + struct dax_region *dax_region; + + dax_region = container_of(kref, struct dax_region, kref); + kfree(dax_region); +} + +void dax_region_put(struct dax_region *dax_region) +{ + kref_put(&dax_region->kref, dax_region_free); +} +EXPORT_SYMBOL_GPL(dax_region_put); + +static void dax_dev_free(struct kref *kref) +{ + struct dax_dev *dax_dev; + + dax_dev = container_of(kref, struct dax_dev, kref); + dax_region_put(dax_dev->region); + kfree(dax_dev); +} + +static void dax_dev_put(struct dax_dev *dax_dev) +{ + kref_put(&dax_dev->kref, dax_dev_free); +} + +struct dax_region *alloc_dax_region(struct device *parent, int region_id, + struct resource *res, unsigned int align, void *addr, + unsigned long pfn_flags) +{ + struct dax_region *dax_region; + + dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL); + + if (!dax_region) + return NULL; + + memcpy(&dax_region->res, res, sizeof(*res)); + dax_region->pfn_flags = pfn_flags; + kref_init(&dax_region->kref); + dax_region->id = region_id; + ida_init(&dax_region->ida); + dax_region->align = align; + dax_region->dev = parent; + dax_region->base = addr; + + return dax_region; +} +EXPORT_SYMBOL_GPL(alloc_dax_region); + +static ssize_t size_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct dax_dev *dax_dev = dev_get_drvdata(dev); + unsigned long long size = 0; + int i; + + for (i = 0; i < dax_dev->num_resources; i++) + size += resource_size(&dax_dev->res[i]); + + return sprintf(buf, "%llu\n", size); +} +static DEVICE_ATTR_RO(size); + +static struct attribute *dax_device_attributes[] = { + &dev_attr_size.attr, + NULL, +}; + +static const struct attribute_group dax_device_attribute_group = { + .attrs = dax_device_attributes, +}; + +static const struct attribute_group *dax_attribute_groups[] = { + &dax_device_attribute_group, + NULL, +}; + +static void destroy_dax_dev(void *_dev) +{ + struct device *dev = _dev; + struct dax_dev *dax_dev = dev_get_drvdata(dev); + struct dax_region *dax_region = dax_dev->region; + + dev_dbg(dev, "%s\n", __func__); + + get_device(dev); + device_unregister(dev); + ida_simple_remove(&dax_region->ida, dax_dev->id); + ida_simple_remove(&dax_minor_ida, MINOR(dev->devt)); + put_device(dev); + dax_dev_put(dax_dev); +} + +int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, + int count) +{ + struct device *parent = dax_region->dev; + struct dax_dev *dax_dev; + struct device *dev; + int rc, minor; + dev_t dev_t; + + dax_dev = kzalloc(sizeof(*dax_dev) + sizeof(*res) * count, GFP_KERNEL); + if (!dax_dev) + return -ENOMEM; + memcpy(dax_dev->res, res, sizeof(*res) * count); + dax_dev->num_resources = count; + kref_init(&dax_dev->kref); + dax_dev->region = dax_region; + kref_get(&dax_region->kref); + + dax_dev->id = ida_simple_get(&dax_region->ida, 0, 0, GFP_KERNEL); + if (dax_dev->id < 0) { + rc = dax_dev->id; + goto err_id; + } + + minor = ida_simple_get(&dax_minor_ida, 0, 0, GFP_KERNEL); + if (minor < 0) { + rc = minor; + goto err_minor; + } + + dev_t = MKDEV(dax_major, minor); + dev = device_create_with_groups(dax_class, parent, dev_t, dax_dev, + dax_attribute_groups, "dax%d.%d", dax_region->id, + dax_dev->id); + if (IS_ERR(dev)) { + rc = PTR_ERR(dev); + goto err_create; + } + dax_dev->dev = dev; + + rc = devm_add_action(dax_region->dev, destroy_dax_dev, dev); + if (rc) { + destroy_dax_dev(dev); + return rc; + } + + return 0; + + err_create: + ida_simple_remove(&dax_minor_ida, minor); + err_minor: + ida_simple_remove(&dax_region->ida, dax_dev->id); + err_id: + dax_dev_put(dax_dev); + + return rc; +} +EXPORT_SYMBOL_GPL(devm_create_dax_dev); + +static const struct file_operations dax_fops = { + .llseek = noop_llseek, + .owner = THIS_MODULE, +}; + +static int __init dax_init(void) +{ + int rc; + + rc = register_chrdev(0, "dax", &dax_fops); + if (rc < 0) + return rc; + dax_major = rc; + + dax_class = class_create(THIS_MODULE, "dax"); + if (IS_ERR(dax_class)) { + unregister_chrdev(dax_major, "dax"); + return PTR_ERR(dax_class); + } + + return 0; +} + +static void __exit dax_exit(void) +{ + class_destroy(dax_class); + unregister_chrdev(dax_major, "dax"); +} + +MODULE_AUTHOR("Intel Corporation"); +MODULE_LICENSE("GPL v2"); +subsys_initcall(dax_init); +module_exit(dax_exit); diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h new file mode 100644 index 000000000000..d8b8f1f25054 --- /dev/null +++ b/drivers/dax/dax.h @@ -0,0 +1,24 @@ +/* + * Copyright(c) 2016 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#ifndef __DAX_H__ +#define __DAX_H__ +struct device; +struct resource; +struct dax_region; +void dax_region_put(struct dax_region *dax_region); +struct dax_region *alloc_dax_region(struct device *parent, + int region_id, struct resource *res, unsigned int align, + void *addr, unsigned long flags); +int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, + int count); +#endif /* __DAX_H__ */ diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c new file mode 100644 index 000000000000..4e97555e1cab --- /dev/null +++ b/drivers/dax/pmem.c @@ -0,0 +1,168 @@ +/* + * Copyright(c) 2016 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include <linux/percpu-refcount.h> +#include <linux/memremap.h> +#include <linux/module.h> +#include <linux/pfn_t.h> +#include "../nvdimm/pfn.h" +#include "../nvdimm/nd.h" +#include "dax.h" + +struct dax_pmem { + struct device *dev; + struct percpu_ref ref; + struct completion cmp; +}; + +struct dax_pmem *to_dax_pmem(struct percpu_ref *ref) +{ + return container_of(ref, struct dax_pmem, ref); +} + +static void dax_pmem_percpu_release(struct percpu_ref *ref) +{ + struct dax_pmem *dax_pmem = to_dax_pmem(ref); + + dev_dbg(dax_pmem->dev, "%s\n", __func__); + complete(&dax_pmem->cmp); +} + +static void dax_pmem_percpu_exit(void *data) +{ + struct percpu_ref *ref = data; + struct dax_pmem *dax_pmem = to_dax_pmem(ref); + + dev_dbg(dax_pmem->dev, "%s\n", __func__); + percpu_ref_exit(ref); + wait_for_completion(&dax_pmem->cmp); +} + +static void dax_pmem_percpu_kill(void *data) +{ + struct percpu_ref *ref = data; + struct dax_pmem *dax_pmem = to_dax_pmem(ref); + + dev_dbg(dax_pmem->dev, "%s\n", __func__); + percpu_ref_kill(ref); +} + +static int dax_pmem_probe(struct device *dev) +{ + int rc; + void *addr; + struct resource res; + struct nd_pfn_sb *pfn_sb; + struct dax_pmem *dax_pmem; + struct nd_region *nd_region; + struct nd_namespace_io *nsio; + struct dax_region *dax_region; + struct nd_namespace_common *ndns; + struct nd_dax *nd_dax = to_nd_dax(dev); + struct nd_pfn *nd_pfn = &nd_dax->nd_pfn; + struct vmem_altmap __altmap, *altmap = NULL; + + ndns = nvdimm_namespace_common_probe(dev); + if (IS_ERR(ndns)) + return PTR_ERR(ndns); + nsio = to_nd_namespace_io(&ndns->dev); + + /* parse the 'pfn' info block via ->rw_bytes */ + devm_nsio_enable(dev, nsio); + altmap = nvdimm_setup_pfn(nd_pfn, &res, &__altmap); + if (IS_ERR(altmap)) + return PTR_ERR(altmap); + devm_nsio_disable(dev, nsio); + + pfn_sb = nd_pfn->pfn_sb; + + if (!devm_request_mem_region(dev, nsio->res.start, + resource_size(&nsio->res), dev_name(dev))) { + dev_warn(dev, "could not reserve region %pR\n", &nsio->res); + return -EBUSY; + } + + dax_pmem = devm_kzalloc(dev, sizeof(*dax_pmem), GFP_KERNEL); + if (!dax_pmem) + return -ENOMEM; + + dax_pmem->dev = dev; + init_completion(&dax_pmem->cmp); + rc = percpu_ref_init(&dax_pmem->ref, dax_pmem_percpu_release, 0, + GFP_KERNEL); + if (rc) + return rc; + + rc = devm_add_action(dev, dax_pmem_percpu_exit, &dax_pmem->ref); + if (rc) { + dax_pmem_percpu_exit(&dax_pmem->ref); + return rc; + } + + addr = devm_memremap_pages(dev, &res, &dax_pmem->ref, altmap); + if (IS_ERR(addr)) + return PTR_ERR(addr); + + rc = devm_add_action(dev, dax_pmem_percpu_kill, &dax_pmem->ref); + if (rc) { + dax_pmem_percpu_kill(&dax_pmem->ref); + return rc; + } + + nd_region = to_nd_region(dev->parent); + dax_region = alloc_dax_region(dev, nd_region->id, &res, + le32_to_cpu(pfn_sb->align), addr, PFN_DEV|PFN_MAP); + if (!dax_region) + return -ENOMEM; + + /* TODO: support for subdividing a dax region... */ + rc = devm_create_dax_dev(dax_region, &res, 1); + + /* child dax_dev instances now own the lifetime of the dax_region */ + dax_region_put(dax_region); + + return rc; +} + +static int dax_pmem_remove(struct device *dev) +{ + /* + * The init path is fully devm-enabled, or child devices + * otherwise hold references on parent resources. + */ + return 0; +} + +static struct nd_device_driver dax_pmem_driver = { + .probe = dax_pmem_probe, + .remove = dax_pmem_remove, + .drv = { + .name = "dax_pmem", + }, + .type = ND_DRIVER_DAX_PMEM, +}; + +static int __init dax_pmem_init(void) +{ + return nd_driver_register(&dax_pmem_driver); +} +module_init(dax_pmem_init); + +static void __exit dax_pmem_exit(void) +{ + driver_unregister(&dax_pmem_driver.drv); +} +module_exit(dax_pmem_exit); + +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR("Intel Corporation"); +MODULE_ALIAS_ND_DEVICE(ND_DEVICE_DAX_PMEM); diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild index 5ff6d3c126a9..785985677159 100644 --- a/tools/testing/nvdimm/Kbuild +++ b/tools/testing/nvdimm/Kbuild @@ -16,6 +16,7 @@ ldflags-y += --wrap=phys_to_pfn_t DRIVERS := ../../../drivers NVDIMM_SRC := $(DRIVERS)/nvdimm ACPI_SRC := $(DRIVERS)/acpi +DAX_SRC := $(DRIVERS)/dax obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o @@ -23,6 +24,8 @@ obj-$(CONFIG_ND_BTT) += nd_btt.o obj-$(CONFIG_ND_BLK) += nd_blk.o obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o obj-$(CONFIG_ACPI_NFIT) += nfit.o +obj-$(CONFIG_DEV_DAX) += dax.o +obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o nfit-y := $(ACPI_SRC)/nfit.o nfit-y += config_check.o @@ -39,6 +42,12 @@ nd_blk-y += config_check.o nd_e820-y := $(NVDIMM_SRC)/e820.o nd_e820-y += config_check.o +dax-y := $(DAX_SRC)/dax.o +dax-y += config_check.o + +dax_pmem-y := $(DAX_SRC)/pmem.o +dax_pmem-y += config_check.o + libnvdimm-y := $(NVDIMM_SRC)/core.o libnvdimm-y += $(NVDIMM_SRC)/bus.o libnvdimm-y += $(NVDIMM_SRC)/dimm_devs.o diff --git a/tools/testing/nvdimm/config_check.c b/tools/testing/nvdimm/config_check.c index f2c7615554eb..adf18bfeca00 100644 --- a/tools/testing/nvdimm/config_check.c +++ b/tools/testing/nvdimm/config_check.c @@ -12,4 +12,6 @@ void check(void) BUILD_BUG_ON(!IS_MODULE(CONFIG_ND_BTT)); BUILD_BUG_ON(!IS_MODULE(CONFIG_ND_BLK)); BUILD_BUG_ON(!IS_MODULE(CONFIG_ACPI_NFIT)); + BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX)); + BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX_PMEM)); } ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v2 1/3] /dev/dax, pmem: direct access to persistent memory @ 2016-05-15 6:26 ` Dan Williams 0 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-15 6:26 UTC (permalink / raw) To: linux-nvdimm Cc: Dave Hansen, linux-kernel, linux-block, Jeff Moyer, Ross Zwisler, Andrew Morton, hch Device DAX is the device-centric analogue of Filesystem DAX (CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped without need of an intervening file system. Device DAX is strict, precise and predictable. Specifically this interface: 1/ Guarantees fault granularity with respect to a given page size (pte, pmd, or pud) set at configuration time. 2/ Enforces deterministic behavior by being strict about what fault scenarios are supported. For example, by forcing MADV_DONTFORK semantics and omitting MAP_PRIVATE support device-dax guarantees that a mapping always behaves/performs the same once established. It is the "what you see is what you get" access mechanism to differentiated memory vs filesystem DAX which has filesystem specific implementation semantics. Persistent memory is the first target, but the mechanism is also targeted for exclusive allocations of performance differentiated memory ranges. This commit is limited to the base device driver infrastructure to associate a dax device with pmem range. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/Kconfig | 2 drivers/Makefile | 1 drivers/dax/Kconfig | 25 +++ drivers/dax/Makefile | 4 + drivers/dax/dax.c | 252 +++++++++++++++++++++++++++++++++++ drivers/dax/dax.h | 24 +++ drivers/dax/pmem.c | 168 +++++++++++++++++++++++ tools/testing/nvdimm/Kbuild | 9 + tools/testing/nvdimm/config_check.c | 2 9 files changed, 487 insertions(+) create mode 100644 drivers/dax/Kconfig create mode 100644 drivers/dax/Makefile create mode 100644 drivers/dax/dax.c create mode 100644 drivers/dax/dax.h create mode 100644 drivers/dax/pmem.c diff --git a/drivers/Kconfig b/drivers/Kconfig index d2ac339de85f..8298eab84a6f 100644 --- a/drivers/Kconfig +++ b/drivers/Kconfig @@ -190,6 +190,8 @@ source "drivers/android/Kconfig" source "drivers/nvdimm/Kconfig" +source "drivers/dax/Kconfig" + source "drivers/nvmem/Kconfig" source "drivers/hwtracing/stm/Kconfig" diff --git a/drivers/Makefile b/drivers/Makefile index 8f5d076baeb0..0b6f3d60193d 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -66,6 +66,7 @@ obj-$(CONFIG_PARPORT) += parport/ obj-$(CONFIG_NVM) += lightnvm/ obj-y += base/ block/ misc/ mfd/ nfc/ obj-$(CONFIG_LIBNVDIMM) += nvdimm/ +obj-$(CONFIG_DEV_DAX) += dax/ obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/ obj-$(CONFIG_NUBUS) += nubus/ obj-y += macintosh/ diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig new file mode 100644 index 000000000000..86ffbaa891ad --- /dev/null +++ b/drivers/dax/Kconfig @@ -0,0 +1,25 @@ +menuconfig DEV_DAX + tristate "DAX: direct access to differentiated memory" + default m if NVDIMM_DAX + help + Support raw access to differentiated (persistence, bandwidth, + latency...) memory via an mmap(2) capable character + device. Platform firmware or a device driver may identify a + platform memory resource that is differentiated from the + baseline memory pool. Mappings of a /dev/daxX.Y device impose + restrictions that make the mapping behavior deterministic. + +if DEV_DAX + +config DEV_DAX_PMEM + tristate "PMEM DAX: direct access to persistent memory" + depends on NVDIMM_DAX + default DEV_DAX + help + Support raw access to persistent memory. Note that this + driver consumes memory ranges allocated and exported by the + libnvdimm sub-system. + + Say Y if unsure + +endif diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile new file mode 100644 index 000000000000..27c54e38478a --- /dev/null +++ b/drivers/dax/Makefile @@ -0,0 +1,4 @@ +obj-$(CONFIG_DEV_DAX) += dax.o +obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o + +dax_pmem-y := pmem.o diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c new file mode 100644 index 000000000000..8207fb33a992 --- /dev/null +++ b/drivers/dax/dax.c @@ -0,0 +1,252 @@ +/* + * Copyright(c) 2016 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include <linux/pagemap.h> +#include <linux/module.h> +#include <linux/device.h> +#include <linux/pfn_t.h> +#include <linux/slab.h> +#include <linux/dax.h> +#include <linux/fs.h> +#include <linux/mm.h> + +static int dax_major; +static struct class *dax_class; +static DEFINE_IDA(dax_minor_ida); + +/** + * struct dax_region - mapping infrastructure for dax devices + * @id: kernel-wide unique region for a memory range + * @base: linear address corresponding to @res + * @kref: to pin while other agents have a need to do lookups + * @dev: parent device backing this region + * @align: allocation and mapping alignment for child dax devices + * @res: physical address range of the region + * @pfn_flags: identify whether the pfns are paged back or not + */ +struct dax_region { + int id; + struct ida ida; + void *base; + struct kref kref; + struct device *dev; + unsigned int align; + struct resource res; + unsigned long pfn_flags; +}; + +/** + * struct dax_dev - subdivision of a dax region + * @region - parent region + * @dev - device backing the character device + * @kref - enable this data to be tracked in filp->private_data + * @id - child id in the region + * @num_resources - number of physical address extents in this device + * @res - array of physical address ranges + */ +struct dax_dev { + struct dax_region *region; + struct device *dev; + struct kref kref; + int id; + int num_resources; + struct resource res[0]; +}; + +static void dax_region_free(struct kref *kref) +{ + struct dax_region *dax_region; + + dax_region = container_of(kref, struct dax_region, kref); + kfree(dax_region); +} + +void dax_region_put(struct dax_region *dax_region) +{ + kref_put(&dax_region->kref, dax_region_free); +} +EXPORT_SYMBOL_GPL(dax_region_put); + +static void dax_dev_free(struct kref *kref) +{ + struct dax_dev *dax_dev; + + dax_dev = container_of(kref, struct dax_dev, kref); + dax_region_put(dax_dev->region); + kfree(dax_dev); +} + +static void dax_dev_put(struct dax_dev *dax_dev) +{ + kref_put(&dax_dev->kref, dax_dev_free); +} + +struct dax_region *alloc_dax_region(struct device *parent, int region_id, + struct resource *res, unsigned int align, void *addr, + unsigned long pfn_flags) +{ + struct dax_region *dax_region; + + dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL); + + if (!dax_region) + return NULL; + + memcpy(&dax_region->res, res, sizeof(*res)); + dax_region->pfn_flags = pfn_flags; + kref_init(&dax_region->kref); + dax_region->id = region_id; + ida_init(&dax_region->ida); + dax_region->align = align; + dax_region->dev = parent; + dax_region->base = addr; + + return dax_region; +} +EXPORT_SYMBOL_GPL(alloc_dax_region); + +static ssize_t size_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct dax_dev *dax_dev = dev_get_drvdata(dev); + unsigned long long size = 0; + int i; + + for (i = 0; i < dax_dev->num_resources; i++) + size += resource_size(&dax_dev->res[i]); + + return sprintf(buf, "%llu\n", size); +} +static DEVICE_ATTR_RO(size); + +static struct attribute *dax_device_attributes[] = { + &dev_attr_size.attr, + NULL, +}; + +static const struct attribute_group dax_device_attribute_group = { + .attrs = dax_device_attributes, +}; + +static const struct attribute_group *dax_attribute_groups[] = { + &dax_device_attribute_group, + NULL, +}; + +static void destroy_dax_dev(void *_dev) +{ + struct device *dev = _dev; + struct dax_dev *dax_dev = dev_get_drvdata(dev); + struct dax_region *dax_region = dax_dev->region; + + dev_dbg(dev, "%s\n", __func__); + + get_device(dev); + device_unregister(dev); + ida_simple_remove(&dax_region->ida, dax_dev->id); + ida_simple_remove(&dax_minor_ida, MINOR(dev->devt)); + put_device(dev); + dax_dev_put(dax_dev); +} + +int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, + int count) +{ + struct device *parent = dax_region->dev; + struct dax_dev *dax_dev; + struct device *dev; + int rc, minor; + dev_t dev_t; + + dax_dev = kzalloc(sizeof(*dax_dev) + sizeof(*res) * count, GFP_KERNEL); + if (!dax_dev) + return -ENOMEM; + memcpy(dax_dev->res, res, sizeof(*res) * count); + dax_dev->num_resources = count; + kref_init(&dax_dev->kref); + dax_dev->region = dax_region; + kref_get(&dax_region->kref); + + dax_dev->id = ida_simple_get(&dax_region->ida, 0, 0, GFP_KERNEL); + if (dax_dev->id < 0) { + rc = dax_dev->id; + goto err_id; + } + + minor = ida_simple_get(&dax_minor_ida, 0, 0, GFP_KERNEL); + if (minor < 0) { + rc = minor; + goto err_minor; + } + + dev_t = MKDEV(dax_major, minor); + dev = device_create_with_groups(dax_class, parent, dev_t, dax_dev, + dax_attribute_groups, "dax%d.%d", dax_region->id, + dax_dev->id); + if (IS_ERR(dev)) { + rc = PTR_ERR(dev); + goto err_create; + } + dax_dev->dev = dev; + + rc = devm_add_action(dax_region->dev, destroy_dax_dev, dev); + if (rc) { + destroy_dax_dev(dev); + return rc; + } + + return 0; + + err_create: + ida_simple_remove(&dax_minor_ida, minor); + err_minor: + ida_simple_remove(&dax_region->ida, dax_dev->id); + err_id: + dax_dev_put(dax_dev); + + return rc; +} +EXPORT_SYMBOL_GPL(devm_create_dax_dev); + +static const struct file_operations dax_fops = { + .llseek = noop_llseek, + .owner = THIS_MODULE, +}; + +static int __init dax_init(void) +{ + int rc; + + rc = register_chrdev(0, "dax", &dax_fops); + if (rc < 0) + return rc; + dax_major = rc; + + dax_class = class_create(THIS_MODULE, "dax"); + if (IS_ERR(dax_class)) { + unregister_chrdev(dax_major, "dax"); + return PTR_ERR(dax_class); + } + + return 0; +} + +static void __exit dax_exit(void) +{ + class_destroy(dax_class); + unregister_chrdev(dax_major, "dax"); +} + +MODULE_AUTHOR("Intel Corporation"); +MODULE_LICENSE("GPL v2"); +subsys_initcall(dax_init); +module_exit(dax_exit); diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h new file mode 100644 index 000000000000..d8b8f1f25054 --- /dev/null +++ b/drivers/dax/dax.h @@ -0,0 +1,24 @@ +/* + * Copyright(c) 2016 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#ifndef __DAX_H__ +#define __DAX_H__ +struct device; +struct resource; +struct dax_region; +void dax_region_put(struct dax_region *dax_region); +struct dax_region *alloc_dax_region(struct device *parent, + int region_id, struct resource *res, unsigned int align, + void *addr, unsigned long flags); +int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, + int count); +#endif /* __DAX_H__ */ diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c new file mode 100644 index 000000000000..4e97555e1cab --- /dev/null +++ b/drivers/dax/pmem.c @@ -0,0 +1,168 @@ +/* + * Copyright(c) 2016 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include <linux/percpu-refcount.h> +#include <linux/memremap.h> +#include <linux/module.h> +#include <linux/pfn_t.h> +#include "../nvdimm/pfn.h" +#include "../nvdimm/nd.h" +#include "dax.h" + +struct dax_pmem { + struct device *dev; + struct percpu_ref ref; + struct completion cmp; +}; + +struct dax_pmem *to_dax_pmem(struct percpu_ref *ref) +{ + return container_of(ref, struct dax_pmem, ref); +} + +static void dax_pmem_percpu_release(struct percpu_ref *ref) +{ + struct dax_pmem *dax_pmem = to_dax_pmem(ref); + + dev_dbg(dax_pmem->dev, "%s\n", __func__); + complete(&dax_pmem->cmp); +} + +static void dax_pmem_percpu_exit(void *data) +{ + struct percpu_ref *ref = data; + struct dax_pmem *dax_pmem = to_dax_pmem(ref); + + dev_dbg(dax_pmem->dev, "%s\n", __func__); + percpu_ref_exit(ref); + wait_for_completion(&dax_pmem->cmp); +} + +static void dax_pmem_percpu_kill(void *data) +{ + struct percpu_ref *ref = data; + struct dax_pmem *dax_pmem = to_dax_pmem(ref); + + dev_dbg(dax_pmem->dev, "%s\n", __func__); + percpu_ref_kill(ref); +} + +static int dax_pmem_probe(struct device *dev) +{ + int rc; + void *addr; + struct resource res; + struct nd_pfn_sb *pfn_sb; + struct dax_pmem *dax_pmem; + struct nd_region *nd_region; + struct nd_namespace_io *nsio; + struct dax_region *dax_region; + struct nd_namespace_common *ndns; + struct nd_dax *nd_dax = to_nd_dax(dev); + struct nd_pfn *nd_pfn = &nd_dax->nd_pfn; + struct vmem_altmap __altmap, *altmap = NULL; + + ndns = nvdimm_namespace_common_probe(dev); + if (IS_ERR(ndns)) + return PTR_ERR(ndns); + nsio = to_nd_namespace_io(&ndns->dev); + + /* parse the 'pfn' info block via ->rw_bytes */ + devm_nsio_enable(dev, nsio); + altmap = nvdimm_setup_pfn(nd_pfn, &res, &__altmap); + if (IS_ERR(altmap)) + return PTR_ERR(altmap); + devm_nsio_disable(dev, nsio); + + pfn_sb = nd_pfn->pfn_sb; + + if (!devm_request_mem_region(dev, nsio->res.start, + resource_size(&nsio->res), dev_name(dev))) { + dev_warn(dev, "could not reserve region %pR\n", &nsio->res); + return -EBUSY; + } + + dax_pmem = devm_kzalloc(dev, sizeof(*dax_pmem), GFP_KERNEL); + if (!dax_pmem) + return -ENOMEM; + + dax_pmem->dev = dev; + init_completion(&dax_pmem->cmp); + rc = percpu_ref_init(&dax_pmem->ref, dax_pmem_percpu_release, 0, + GFP_KERNEL); + if (rc) + return rc; + + rc = devm_add_action(dev, dax_pmem_percpu_exit, &dax_pmem->ref); + if (rc) { + dax_pmem_percpu_exit(&dax_pmem->ref); + return rc; + } + + addr = devm_memremap_pages(dev, &res, &dax_pmem->ref, altmap); + if (IS_ERR(addr)) + return PTR_ERR(addr); + + rc = devm_add_action(dev, dax_pmem_percpu_kill, &dax_pmem->ref); + if (rc) { + dax_pmem_percpu_kill(&dax_pmem->ref); + return rc; + } + + nd_region = to_nd_region(dev->parent); + dax_region = alloc_dax_region(dev, nd_region->id, &res, + le32_to_cpu(pfn_sb->align), addr, PFN_DEV|PFN_MAP); + if (!dax_region) + return -ENOMEM; + + /* TODO: support for subdividing a dax region... */ + rc = devm_create_dax_dev(dax_region, &res, 1); + + /* child dax_dev instances now own the lifetime of the dax_region */ + dax_region_put(dax_region); + + return rc; +} + +static int dax_pmem_remove(struct device *dev) +{ + /* + * The init path is fully devm-enabled, or child devices + * otherwise hold references on parent resources. + */ + return 0; +} + +static struct nd_device_driver dax_pmem_driver = { + .probe = dax_pmem_probe, + .remove = dax_pmem_remove, + .drv = { + .name = "dax_pmem", + }, + .type = ND_DRIVER_DAX_PMEM, +}; + +static int __init dax_pmem_init(void) +{ + return nd_driver_register(&dax_pmem_driver); +} +module_init(dax_pmem_init); + +static void __exit dax_pmem_exit(void) +{ + driver_unregister(&dax_pmem_driver.drv); +} +module_exit(dax_pmem_exit); + +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR("Intel Corporation"); +MODULE_ALIAS_ND_DEVICE(ND_DEVICE_DAX_PMEM); diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild index 5ff6d3c126a9..785985677159 100644 --- a/tools/testing/nvdimm/Kbuild +++ b/tools/testing/nvdimm/Kbuild @@ -16,6 +16,7 @@ ldflags-y += --wrap=phys_to_pfn_t DRIVERS := ../../../drivers NVDIMM_SRC := $(DRIVERS)/nvdimm ACPI_SRC := $(DRIVERS)/acpi +DAX_SRC := $(DRIVERS)/dax obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o @@ -23,6 +24,8 @@ obj-$(CONFIG_ND_BTT) += nd_btt.o obj-$(CONFIG_ND_BLK) += nd_blk.o obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o obj-$(CONFIG_ACPI_NFIT) += nfit.o +obj-$(CONFIG_DEV_DAX) += dax.o +obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o nfit-y := $(ACPI_SRC)/nfit.o nfit-y += config_check.o @@ -39,6 +42,12 @@ nd_blk-y += config_check.o nd_e820-y := $(NVDIMM_SRC)/e820.o nd_e820-y += config_check.o +dax-y := $(DAX_SRC)/dax.o +dax-y += config_check.o + +dax_pmem-y := $(DAX_SRC)/pmem.o +dax_pmem-y += config_check.o + libnvdimm-y := $(NVDIMM_SRC)/core.o libnvdimm-y += $(NVDIMM_SRC)/bus.o libnvdimm-y += $(NVDIMM_SRC)/dimm_devs.o diff --git a/tools/testing/nvdimm/config_check.c b/tools/testing/nvdimm/config_check.c index f2c7615554eb..adf18bfeca00 100644 --- a/tools/testing/nvdimm/config_check.c +++ b/tools/testing/nvdimm/config_check.c @@ -12,4 +12,6 @@ void check(void) BUILD_BUG_ON(!IS_MODULE(CONFIG_ND_BTT)); BUILD_BUG_ON(!IS_MODULE(CONFIG_ND_BLK)); BUILD_BUG_ON(!IS_MODULE(CONFIG_ACPI_NFIT)); + BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX)); + BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX_PMEM)); } ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v2 1/3] /dev/dax, pmem: direct access to persistent memory @ 2016-05-15 6:26 ` Dan Williams 0 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-15 6:26 UTC (permalink / raw) To: linux-nvdimm; +Cc: Dave Hansen, linux-kernel, hch, linux-block, Andrew Morton Device DAX is the device-centric analogue of Filesystem DAX (CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped without need of an intervening file system. Device DAX is strict, precise and predictable. Specifically this interface: 1/ Guarantees fault granularity with respect to a given page size (pte, pmd, or pud) set at configuration time. 2/ Enforces deterministic behavior by being strict about what fault scenarios are supported. For example, by forcing MADV_DONTFORK semantics and omitting MAP_PRIVATE support device-dax guarantees that a mapping always behaves/performs the same once established. It is the "what you see is what you get" access mechanism to differentiated memory vs filesystem DAX which has filesystem specific implementation semantics. Persistent memory is the first target, but the mechanism is also targeted for exclusive allocations of performance differentiated memory ranges. This commit is limited to the base device driver infrastructure to associate a dax device with pmem range. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/Kconfig | 2 drivers/Makefile | 1 drivers/dax/Kconfig | 25 +++ drivers/dax/Makefile | 4 + drivers/dax/dax.c | 252 +++++++++++++++++++++++++++++++++++ drivers/dax/dax.h | 24 +++ drivers/dax/pmem.c | 168 +++++++++++++++++++++++ tools/testing/nvdimm/Kbuild | 9 + tools/testing/nvdimm/config_check.c | 2 9 files changed, 487 insertions(+) create mode 100644 drivers/dax/Kconfig create mode 100644 drivers/dax/Makefile create mode 100644 drivers/dax/dax.c create mode 100644 drivers/dax/dax.h create mode 100644 drivers/dax/pmem.c diff --git a/drivers/Kconfig b/drivers/Kconfig index d2ac339de85f..8298eab84a6f 100644 --- a/drivers/Kconfig +++ b/drivers/Kconfig @@ -190,6 +190,8 @@ source "drivers/android/Kconfig" source "drivers/nvdimm/Kconfig" +source "drivers/dax/Kconfig" + source "drivers/nvmem/Kconfig" source "drivers/hwtracing/stm/Kconfig" diff --git a/drivers/Makefile b/drivers/Makefile index 8f5d076baeb0..0b6f3d60193d 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -66,6 +66,7 @@ obj-$(CONFIG_PARPORT) += parport/ obj-$(CONFIG_NVM) += lightnvm/ obj-y += base/ block/ misc/ mfd/ nfc/ obj-$(CONFIG_LIBNVDIMM) += nvdimm/ +obj-$(CONFIG_DEV_DAX) += dax/ obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/ obj-$(CONFIG_NUBUS) += nubus/ obj-y += macintosh/ diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig new file mode 100644 index 000000000000..86ffbaa891ad --- /dev/null +++ b/drivers/dax/Kconfig @@ -0,0 +1,25 @@ +menuconfig DEV_DAX + tristate "DAX: direct access to differentiated memory" + default m if NVDIMM_DAX + help + Support raw access to differentiated (persistence, bandwidth, + latency...) memory via an mmap(2) capable character + device. Platform firmware or a device driver may identify a + platform memory resource that is differentiated from the + baseline memory pool. Mappings of a /dev/daxX.Y device impose + restrictions that make the mapping behavior deterministic. + +if DEV_DAX + +config DEV_DAX_PMEM + tristate "PMEM DAX: direct access to persistent memory" + depends on NVDIMM_DAX + default DEV_DAX + help + Support raw access to persistent memory. Note that this + driver consumes memory ranges allocated and exported by the + libnvdimm sub-system. + + Say Y if unsure + +endif diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile new file mode 100644 index 000000000000..27c54e38478a --- /dev/null +++ b/drivers/dax/Makefile @@ -0,0 +1,4 @@ +obj-$(CONFIG_DEV_DAX) += dax.o +obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o + +dax_pmem-y := pmem.o diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c new file mode 100644 index 000000000000..8207fb33a992 --- /dev/null +++ b/drivers/dax/dax.c @@ -0,0 +1,252 @@ +/* + * Copyright(c) 2016 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include <linux/pagemap.h> +#include <linux/module.h> +#include <linux/device.h> +#include <linux/pfn_t.h> +#include <linux/slab.h> +#include <linux/dax.h> +#include <linux/fs.h> +#include <linux/mm.h> + +static int dax_major; +static struct class *dax_class; +static DEFINE_IDA(dax_minor_ida); + +/** + * struct dax_region - mapping infrastructure for dax devices + * @id: kernel-wide unique region for a memory range + * @base: linear address corresponding to @res + * @kref: to pin while other agents have a need to do lookups + * @dev: parent device backing this region + * @align: allocation and mapping alignment for child dax devices + * @res: physical address range of the region + * @pfn_flags: identify whether the pfns are paged back or not + */ +struct dax_region { + int id; + struct ida ida; + void *base; + struct kref kref; + struct device *dev; + unsigned int align; + struct resource res; + unsigned long pfn_flags; +}; + +/** + * struct dax_dev - subdivision of a dax region + * @region - parent region + * @dev - device backing the character device + * @kref - enable this data to be tracked in filp->private_data + * @id - child id in the region + * @num_resources - number of physical address extents in this device + * @res - array of physical address ranges + */ +struct dax_dev { + struct dax_region *region; + struct device *dev; + struct kref kref; + int id; + int num_resources; + struct resource res[0]; +}; + +static void dax_region_free(struct kref *kref) +{ + struct dax_region *dax_region; + + dax_region = container_of(kref, struct dax_region, kref); + kfree(dax_region); +} + +void dax_region_put(struct dax_region *dax_region) +{ + kref_put(&dax_region->kref, dax_region_free); +} +EXPORT_SYMBOL_GPL(dax_region_put); + +static void dax_dev_free(struct kref *kref) +{ + struct dax_dev *dax_dev; + + dax_dev = container_of(kref, struct dax_dev, kref); + dax_region_put(dax_dev->region); + kfree(dax_dev); +} + +static void dax_dev_put(struct dax_dev *dax_dev) +{ + kref_put(&dax_dev->kref, dax_dev_free); +} + +struct dax_region *alloc_dax_region(struct device *parent, int region_id, + struct resource *res, unsigned int align, void *addr, + unsigned long pfn_flags) +{ + struct dax_region *dax_region; + + dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL); + + if (!dax_region) + return NULL; + + memcpy(&dax_region->res, res, sizeof(*res)); + dax_region->pfn_flags = pfn_flags; + kref_init(&dax_region->kref); + dax_region->id = region_id; + ida_init(&dax_region->ida); + dax_region->align = align; + dax_region->dev = parent; + dax_region->base = addr; + + return dax_region; +} +EXPORT_SYMBOL_GPL(alloc_dax_region); + +static ssize_t size_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct dax_dev *dax_dev = dev_get_drvdata(dev); + unsigned long long size = 0; + int i; + + for (i = 0; i < dax_dev->num_resources; i++) + size += resource_size(&dax_dev->res[i]); + + return sprintf(buf, "%llu\n", size); +} +static DEVICE_ATTR_RO(size); + +static struct attribute *dax_device_attributes[] = { + &dev_attr_size.attr, + NULL, +}; + +static const struct attribute_group dax_device_attribute_group = { + .attrs = dax_device_attributes, +}; + +static const struct attribute_group *dax_attribute_groups[] = { + &dax_device_attribute_group, + NULL, +}; + +static void destroy_dax_dev(void *_dev) +{ + struct device *dev = _dev; + struct dax_dev *dax_dev = dev_get_drvdata(dev); + struct dax_region *dax_region = dax_dev->region; + + dev_dbg(dev, "%s\n", __func__); + + get_device(dev); + device_unregister(dev); + ida_simple_remove(&dax_region->ida, dax_dev->id); + ida_simple_remove(&dax_minor_ida, MINOR(dev->devt)); + put_device(dev); + dax_dev_put(dax_dev); +} + +int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, + int count) +{ + struct device *parent = dax_region->dev; + struct dax_dev *dax_dev; + struct device *dev; + int rc, minor; + dev_t dev_t; + + dax_dev = kzalloc(sizeof(*dax_dev) + sizeof(*res) * count, GFP_KERNEL); + if (!dax_dev) + return -ENOMEM; + memcpy(dax_dev->res, res, sizeof(*res) * count); + dax_dev->num_resources = count; + kref_init(&dax_dev->kref); + dax_dev->region = dax_region; + kref_get(&dax_region->kref); + + dax_dev->id = ida_simple_get(&dax_region->ida, 0, 0, GFP_KERNEL); + if (dax_dev->id < 0) { + rc = dax_dev->id; + goto err_id; + } + + minor = ida_simple_get(&dax_minor_ida, 0, 0, GFP_KERNEL); + if (minor < 0) { + rc = minor; + goto err_minor; + } + + dev_t = MKDEV(dax_major, minor); + dev = device_create_with_groups(dax_class, parent, dev_t, dax_dev, + dax_attribute_groups, "dax%d.%d", dax_region->id, + dax_dev->id); + if (IS_ERR(dev)) { + rc = PTR_ERR(dev); + goto err_create; + } + dax_dev->dev = dev; + + rc = devm_add_action(dax_region->dev, destroy_dax_dev, dev); + if (rc) { + destroy_dax_dev(dev); + return rc; + } + + return 0; + + err_create: + ida_simple_remove(&dax_minor_ida, minor); + err_minor: + ida_simple_remove(&dax_region->ida, dax_dev->id); + err_id: + dax_dev_put(dax_dev); + + return rc; +} +EXPORT_SYMBOL_GPL(devm_create_dax_dev); + +static const struct file_operations dax_fops = { + .llseek = noop_llseek, + .owner = THIS_MODULE, +}; + +static int __init dax_init(void) +{ + int rc; + + rc = register_chrdev(0, "dax", &dax_fops); + if (rc < 0) + return rc; + dax_major = rc; + + dax_class = class_create(THIS_MODULE, "dax"); + if (IS_ERR(dax_class)) { + unregister_chrdev(dax_major, "dax"); + return PTR_ERR(dax_class); + } + + return 0; +} + +static void __exit dax_exit(void) +{ + class_destroy(dax_class); + unregister_chrdev(dax_major, "dax"); +} + +MODULE_AUTHOR("Intel Corporation"); +MODULE_LICENSE("GPL v2"); +subsys_initcall(dax_init); +module_exit(dax_exit); diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h new file mode 100644 index 000000000000..d8b8f1f25054 --- /dev/null +++ b/drivers/dax/dax.h @@ -0,0 +1,24 @@ +/* + * Copyright(c) 2016 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#ifndef __DAX_H__ +#define __DAX_H__ +struct device; +struct resource; +struct dax_region; +void dax_region_put(struct dax_region *dax_region); +struct dax_region *alloc_dax_region(struct device *parent, + int region_id, struct resource *res, unsigned int align, + void *addr, unsigned long flags); +int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, + int count); +#endif /* __DAX_H__ */ diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c new file mode 100644 index 000000000000..4e97555e1cab --- /dev/null +++ b/drivers/dax/pmem.c @@ -0,0 +1,168 @@ +/* + * Copyright(c) 2016 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include <linux/percpu-refcount.h> +#include <linux/memremap.h> +#include <linux/module.h> +#include <linux/pfn_t.h> +#include "../nvdimm/pfn.h" +#include "../nvdimm/nd.h" +#include "dax.h" + +struct dax_pmem { + struct device *dev; + struct percpu_ref ref; + struct completion cmp; +}; + +struct dax_pmem *to_dax_pmem(struct percpu_ref *ref) +{ + return container_of(ref, struct dax_pmem, ref); +} + +static void dax_pmem_percpu_release(struct percpu_ref *ref) +{ + struct dax_pmem *dax_pmem = to_dax_pmem(ref); + + dev_dbg(dax_pmem->dev, "%s\n", __func__); + complete(&dax_pmem->cmp); +} + +static void dax_pmem_percpu_exit(void *data) +{ + struct percpu_ref *ref = data; + struct dax_pmem *dax_pmem = to_dax_pmem(ref); + + dev_dbg(dax_pmem->dev, "%s\n", __func__); + percpu_ref_exit(ref); + wait_for_completion(&dax_pmem->cmp); +} + +static void dax_pmem_percpu_kill(void *data) +{ + struct percpu_ref *ref = data; + struct dax_pmem *dax_pmem = to_dax_pmem(ref); + + dev_dbg(dax_pmem->dev, "%s\n", __func__); + percpu_ref_kill(ref); +} + +static int dax_pmem_probe(struct device *dev) +{ + int rc; + void *addr; + struct resource res; + struct nd_pfn_sb *pfn_sb; + struct dax_pmem *dax_pmem; + struct nd_region *nd_region; + struct nd_namespace_io *nsio; + struct dax_region *dax_region; + struct nd_namespace_common *ndns; + struct nd_dax *nd_dax = to_nd_dax(dev); + struct nd_pfn *nd_pfn = &nd_dax->nd_pfn; + struct vmem_altmap __altmap, *altmap = NULL; + + ndns = nvdimm_namespace_common_probe(dev); + if (IS_ERR(ndns)) + return PTR_ERR(ndns); + nsio = to_nd_namespace_io(&ndns->dev); + + /* parse the 'pfn' info block via ->rw_bytes */ + devm_nsio_enable(dev, nsio); + altmap = nvdimm_setup_pfn(nd_pfn, &res, &__altmap); + if (IS_ERR(altmap)) + return PTR_ERR(altmap); + devm_nsio_disable(dev, nsio); + + pfn_sb = nd_pfn->pfn_sb; + + if (!devm_request_mem_region(dev, nsio->res.start, + resource_size(&nsio->res), dev_name(dev))) { + dev_warn(dev, "could not reserve region %pR\n", &nsio->res); + return -EBUSY; + } + + dax_pmem = devm_kzalloc(dev, sizeof(*dax_pmem), GFP_KERNEL); + if (!dax_pmem) + return -ENOMEM; + + dax_pmem->dev = dev; + init_completion(&dax_pmem->cmp); + rc = percpu_ref_init(&dax_pmem->ref, dax_pmem_percpu_release, 0, + GFP_KERNEL); + if (rc) + return rc; + + rc = devm_add_action(dev, dax_pmem_percpu_exit, &dax_pmem->ref); + if (rc) { + dax_pmem_percpu_exit(&dax_pmem->ref); + return rc; + } + + addr = devm_memremap_pages(dev, &res, &dax_pmem->ref, altmap); + if (IS_ERR(addr)) + return PTR_ERR(addr); + + rc = devm_add_action(dev, dax_pmem_percpu_kill, &dax_pmem->ref); + if (rc) { + dax_pmem_percpu_kill(&dax_pmem->ref); + return rc; + } + + nd_region = to_nd_region(dev->parent); + dax_region = alloc_dax_region(dev, nd_region->id, &res, + le32_to_cpu(pfn_sb->align), addr, PFN_DEV|PFN_MAP); + if (!dax_region) + return -ENOMEM; + + /* TODO: support for subdividing a dax region... */ + rc = devm_create_dax_dev(dax_region, &res, 1); + + /* child dax_dev instances now own the lifetime of the dax_region */ + dax_region_put(dax_region); + + return rc; +} + +static int dax_pmem_remove(struct device *dev) +{ + /* + * The init path is fully devm-enabled, or child devices + * otherwise hold references on parent resources. + */ + return 0; +} + +static struct nd_device_driver dax_pmem_driver = { + .probe = dax_pmem_probe, + .remove = dax_pmem_remove, + .drv = { + .name = "dax_pmem", + }, + .type = ND_DRIVER_DAX_PMEM, +}; + +static int __init dax_pmem_init(void) +{ + return nd_driver_register(&dax_pmem_driver); +} +module_init(dax_pmem_init); + +static void __exit dax_pmem_exit(void) +{ + driver_unregister(&dax_pmem_driver.drv); +} +module_exit(dax_pmem_exit); + +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR("Intel Corporation"); +MODULE_ALIAS_ND_DEVICE(ND_DEVICE_DAX_PMEM); diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild index 5ff6d3c126a9..785985677159 100644 --- a/tools/testing/nvdimm/Kbuild +++ b/tools/testing/nvdimm/Kbuild @@ -16,6 +16,7 @@ ldflags-y += --wrap=phys_to_pfn_t DRIVERS := ../../../drivers NVDIMM_SRC := $(DRIVERS)/nvdimm ACPI_SRC := $(DRIVERS)/acpi +DAX_SRC := $(DRIVERS)/dax obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o @@ -23,6 +24,8 @@ obj-$(CONFIG_ND_BTT) += nd_btt.o obj-$(CONFIG_ND_BLK) += nd_blk.o obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o obj-$(CONFIG_ACPI_NFIT) += nfit.o +obj-$(CONFIG_DEV_DAX) += dax.o +obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o nfit-y := $(ACPI_SRC)/nfit.o nfit-y += config_check.o @@ -39,6 +42,12 @@ nd_blk-y += config_check.o nd_e820-y := $(NVDIMM_SRC)/e820.o nd_e820-y += config_check.o +dax-y := $(DAX_SRC)/dax.o +dax-y += config_check.o + +dax_pmem-y := $(DAX_SRC)/pmem.o +dax_pmem-y += config_check.o + libnvdimm-y := $(NVDIMM_SRC)/core.o libnvdimm-y += $(NVDIMM_SRC)/bus.o libnvdimm-y += $(NVDIMM_SRC)/dimm_devs.o diff --git a/tools/testing/nvdimm/config_check.c b/tools/testing/nvdimm/config_check.c index f2c7615554eb..adf18bfeca00 100644 --- a/tools/testing/nvdimm/config_check.c +++ b/tools/testing/nvdimm/config_check.c @@ -12,4 +12,6 @@ void check(void) BUILD_BUG_ON(!IS_MODULE(CONFIG_ND_BTT)); BUILD_BUG_ON(!IS_MODULE(CONFIG_ND_BLK)); BUILD_BUG_ON(!IS_MODULE(CONFIG_ACPI_NFIT)); + BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX)); + BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX_PMEM)); } _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH v2 1/3] /dev/dax, pmem: direct access to persistent memory 2016-05-15 6:26 ` Dan Williams (?) @ 2016-05-17 8:52 ` Johannes Thumshirn -1 siblings, 0 replies; 37+ messages in thread From: Johannes Thumshirn @ 2016-05-17 8:52 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, Dave Hansen, linux-kernel, hch, linux-block, Andrew Morton On Sat, May 14, 2016 at 11:26:24PM -0700, Dan Williams wrote: > Device DAX is the device-centric analogue of Filesystem DAX > (CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped > without need of an intervening file system. Device DAX is strict, > precise and predictable. Specifically this interface: > > 1/ Guarantees fault granularity with respect to a given page size (pte, > pmd, or pud) set at configuration time. > > 2/ Enforces deterministic behavior by being strict about what fault > scenarios are supported. > > For example, by forcing MADV_DONTFORK semantics and omitting MAP_PRIVATE > support device-dax guarantees that a mapping always behaves/performs the > same once established. It is the "what you see is what you get" access > mechanism to differentiated memory vs filesystem DAX which has > filesystem specific implementation semantics. > > Persistent memory is the first target, but the mechanism is also > targeted for exclusive allocations of performance differentiated memory > ranges. > > This commit is limited to the base device driver infrastructure to > associate a dax device with pmem range. > > Cc: Jeff Moyer <jmoyer@redhat.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > drivers/Kconfig | 2 > drivers/Makefile | 1 > drivers/dax/Kconfig | 25 +++ > drivers/dax/Makefile | 4 + > drivers/dax/dax.c | 252 +++++++++++++++++++++++++++++++++++ > drivers/dax/dax.h | 24 +++ > drivers/dax/pmem.c | 168 +++++++++++++++++++++++ Is a DAX device always a NVDIMM device, or can it be something else (like the S390 dcssblk)? If it's NVDIMM only I'd suggest it to go under the drivers/nvdimm directory. > tools/testing/nvdimm/Kbuild | 9 + > tools/testing/nvdimm/config_check.c | 2 > 9 files changed, 487 insertions(+) > create mode 100644 drivers/dax/Kconfig > create mode 100644 drivers/dax/Makefile > create mode 100644 drivers/dax/dax.c > create mode 100644 drivers/dax/dax.h > create mode 100644 drivers/dax/pmem.c > > diff --git a/drivers/Kconfig b/drivers/Kconfig > index d2ac339de85f..8298eab84a6f 100644 > --- a/drivers/Kconfig > +++ b/drivers/Kconfig > @@ -190,6 +190,8 @@ source "drivers/android/Kconfig" > > source "drivers/nvdimm/Kconfig" > > +source "drivers/dax/Kconfig" > + > source "drivers/nvmem/Kconfig" > > source "drivers/hwtracing/stm/Kconfig" > diff --git a/drivers/Makefile b/drivers/Makefile > index 8f5d076baeb0..0b6f3d60193d 100644 > --- a/drivers/Makefile > +++ b/drivers/Makefile > @@ -66,6 +66,7 @@ obj-$(CONFIG_PARPORT) += parport/ > obj-$(CONFIG_NVM) += lightnvm/ > obj-y += base/ block/ misc/ mfd/ nfc/ > obj-$(CONFIG_LIBNVDIMM) += nvdimm/ > +obj-$(CONFIG_DEV_DAX) += dax/ > obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/ > obj-$(CONFIG_NUBUS) += nubus/ > obj-y += macintosh/ > diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig > new file mode 100644 > index 000000000000..86ffbaa891ad > --- /dev/null > +++ b/drivers/dax/Kconfig > @@ -0,0 +1,25 @@ > +menuconfig DEV_DAX > + tristate "DAX: direct access to differentiated memory" > + default m if NVDIMM_DAX > + help > + Support raw access to differentiated (persistence, bandwidth, > + latency...) memory via an mmap(2) capable character > + device. Platform firmware or a device driver may identify a > + platform memory resource that is differentiated from the > + baseline memory pool. Mappings of a /dev/daxX.Y device impose > + restrictions that make the mapping behavior deterministic. > + > +if DEV_DAX > + > +config DEV_DAX_PMEM > + tristate "PMEM DAX: direct access to persistent memory" > + depends on NVDIMM_DAX > + default DEV_DAX > + help > + Support raw access to persistent memory. Note that this > + driver consumes memory ranges allocated and exported by the > + libnvdimm sub-system. > + > + Say Y if unsure > + > +endif > diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile > new file mode 100644 > index 000000000000..27c54e38478a > --- /dev/null > +++ b/drivers/dax/Makefile > @@ -0,0 +1,4 @@ > +obj-$(CONFIG_DEV_DAX) += dax.o > +obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o > + > +dax_pmem-y := pmem.o > diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c > new file mode 100644 > index 000000000000..8207fb33a992 > --- /dev/null > +++ b/drivers/dax/dax.c > @@ -0,0 +1,252 @@ > +/* > + * Copyright(c) 2016 Intel Corporation. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of version 2 of the GNU General Public License as > + * published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, but > + * WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + */ > +#include <linux/pagemap.h> > +#include <linux/module.h> > +#include <linux/device.h> > +#include <linux/pfn_t.h> > +#include <linux/slab.h> > +#include <linux/dax.h> > +#include <linux/fs.h> > +#include <linux/mm.h> > + > +static int dax_major; > +static struct class *dax_class; > +static DEFINE_IDA(dax_minor_ida); > + > +/** > + * struct dax_region - mapping infrastructure for dax devices > + * @id: kernel-wide unique region for a memory range > + * @base: linear address corresponding to @res > + * @kref: to pin while other agents have a need to do lookups > + * @dev: parent device backing this region > + * @align: allocation and mapping alignment for child dax devices > + * @res: physical address range of the region > + * @pfn_flags: identify whether the pfns are paged back or not > + */ > +struct dax_region { > + int id; > + struct ida ida; > + void *base; > + struct kref kref; > + struct device *dev; > + unsigned int align; > + struct resource res; > + unsigned long pfn_flags; > +}; > + > +/** > + * struct dax_dev - subdivision of a dax region > + * @region - parent region > + * @dev - device backing the character device > + * @kref - enable this data to be tracked in filp->private_data > + * @id - child id in the region > + * @num_resources - number of physical address extents in this device > + * @res - array of physical address ranges > + */ > +struct dax_dev { > + struct dax_region *region; > + struct device *dev; > + struct kref kref; > + int id; > + int num_resources; > + struct resource res[0]; > +}; > + > +static void dax_region_free(struct kref *kref) > +{ > + struct dax_region *dax_region; > + > + dax_region = container_of(kref, struct dax_region, kref); > + kfree(dax_region); > +} > + > +void dax_region_put(struct dax_region *dax_region) > +{ > + kref_put(&dax_region->kref, dax_region_free); > +} > +EXPORT_SYMBOL_GPL(dax_region_put); dax_region_get() ?? > + > +static void dax_dev_free(struct kref *kref) > +{ > + struct dax_dev *dax_dev; > + > + dax_dev = container_of(kref, struct dax_dev, kref); > + dax_region_put(dax_dev->region); > + kfree(dax_dev); > +} > + > +static void dax_dev_put(struct dax_dev *dax_dev) > +{ > + kref_put(&dax_dev->kref, dax_dev_free); > +} > + > +struct dax_region *alloc_dax_region(struct device *parent, int region_id, > + struct resource *res, unsigned int align, void *addr, > + unsigned long pfn_flags) > +{ > + struct dax_region *dax_region; > + > + dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL); > + > + if (!dax_region) > + return NULL; > + > + memcpy(&dax_region->res, res, sizeof(*res)); > + dax_region->pfn_flags = pfn_flags; > + kref_init(&dax_region->kref); > + dax_region->id = region_id; > + ida_init(&dax_region->ida); > + dax_region->align = align; > + dax_region->dev = parent; > + dax_region->base = addr; > + > + return dax_region; > +} > +EXPORT_SYMBOL_GPL(alloc_dax_region); > + > +static ssize_t size_show(struct device *dev, > + struct device_attribute *attr, char *buf) > +{ > + struct dax_dev *dax_dev = dev_get_drvdata(dev); > + unsigned long long size = 0; > + int i; > + > + for (i = 0; i < dax_dev->num_resources; i++) > + size += resource_size(&dax_dev->res[i]); > + > + return sprintf(buf, "%llu\n", size); > +} > +static DEVICE_ATTR_RO(size); > + > +static struct attribute *dax_device_attributes[] = { > + &dev_attr_size.attr, > + NULL, > +}; > + > +static const struct attribute_group dax_device_attribute_group = { > + .attrs = dax_device_attributes, > +}; > + > +static const struct attribute_group *dax_attribute_groups[] = { > + &dax_device_attribute_group, > + NULL, > +}; > + > +static void destroy_dax_dev(void *_dev) > +{ > + struct device *dev = _dev; > + struct dax_dev *dax_dev = dev_get_drvdata(dev); > + struct dax_region *dax_region = dax_dev->region; > + > + dev_dbg(dev, "%s\n", __func__); This dev_dbg() could be replaced by function graph tracing. > + > + get_device(dev); > + device_unregister(dev); > + ida_simple_remove(&dax_region->ida, dax_dev->id); > + ida_simple_remove(&dax_minor_ida, MINOR(dev->devt)); > + put_device(dev); > + dax_dev_put(dax_dev); > +} > + > +int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, > + int count) > +{ > + struct device *parent = dax_region->dev; > + struct dax_dev *dax_dev; > + struct device *dev; > + int rc, minor; > + dev_t dev_t; > + > + dax_dev = kzalloc(sizeof(*dax_dev) + sizeof(*res) * count, GFP_KERNEL); > + if (!dax_dev) > + return -ENOMEM; > + memcpy(dax_dev->res, res, sizeof(*res) * count); > + dax_dev->num_resources = count; > + kref_init(&dax_dev->kref); > + dax_dev->region = dax_region; > + kref_get(&dax_region->kref); dax_region_get() ? > + > + dax_dev->id = ida_simple_get(&dax_region->ida, 0, 0, GFP_KERNEL); > + if (dax_dev->id < 0) { > + rc = dax_dev->id; > + goto err_id; > + } > + > + minor = ida_simple_get(&dax_minor_ida, 0, 0, GFP_KERNEL); > + if (minor < 0) { > + rc = minor; > + goto err_minor; > + } > + > + dev_t = MKDEV(dax_major, minor); > + dev = device_create_with_groups(dax_class, parent, dev_t, dax_dev, > + dax_attribute_groups, "dax%d.%d", dax_region->id, > + dax_dev->id); > + if (IS_ERR(dev)) { > + rc = PTR_ERR(dev); > + goto err_create; > + } > + dax_dev->dev = dev; > + > + rc = devm_add_action(dax_region->dev, destroy_dax_dev, dev); > + if (rc) { > + destroy_dax_dev(dev); > + return rc; > + } > + > + return 0; > + > + err_create: > + ida_simple_remove(&dax_minor_ida, minor); > + err_minor: > + ida_simple_remove(&dax_region->ida, dax_dev->id); > + err_id: > + dax_dev_put(dax_dev); > + > + return rc; > +} > +EXPORT_SYMBOL_GPL(devm_create_dax_dev); > + > +static const struct file_operations dax_fops = { > + .llseek = noop_llseek, > + .owner = THIS_MODULE, > +}; > + > +static int __init dax_init(void) > +{ > + int rc; > + > + rc = register_chrdev(0, "dax", &dax_fops); > + if (rc < 0) > + return rc; > + dax_major = rc; > + > + dax_class = class_create(THIS_MODULE, "dax"); > + if (IS_ERR(dax_class)) { > + unregister_chrdev(dax_major, "dax"); > + return PTR_ERR(dax_class); > + } > + > + return 0; > +} > + > +static void __exit dax_exit(void) > +{ > + class_destroy(dax_class); > + unregister_chrdev(dax_major, "dax"); AFAICT you're missing a call to ida_destroy(&dax_minor_ida); here. > +} > + > +MODULE_AUTHOR("Intel Corporation"); > +MODULE_LICENSE("GPL v2"); > +subsys_initcall(dax_init); > +module_exit(dax_exit); > diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h > new file mode 100644 > index 000000000000..d8b8f1f25054 > --- /dev/null > +++ b/drivers/dax/dax.h > @@ -0,0 +1,24 @@ > +/* > + * Copyright(c) 2016 Intel Corporation. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of version 2 of the GNU General Public License as > + * published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, but > + * WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + */ > +#ifndef __DAX_H__ > +#define __DAX_H__ > +struct device; > +struct resource; > +struct dax_region; > +void dax_region_put(struct dax_region *dax_region); > +struct dax_region *alloc_dax_region(struct device *parent, > + int region_id, struct resource *res, unsigned int align, > + void *addr, unsigned long flags); > +int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, > + int count); > +#endif /* __DAX_H__ */ > diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c > new file mode 100644 > index 000000000000..4e97555e1cab > --- /dev/null > +++ b/drivers/dax/pmem.c > @@ -0,0 +1,168 @@ > +/* > + * Copyright(c) 2016 Intel Corporation. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of version 2 of the GNU General Public License as > + * published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, but > + * WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + */ > +#include <linux/percpu-refcount.h> > +#include <linux/memremap.h> > +#include <linux/module.h> > +#include <linux/pfn_t.h> > +#include "../nvdimm/pfn.h" > +#include "../nvdimm/nd.h" > +#include "dax.h" > + > +struct dax_pmem { > + struct device *dev; > + struct percpu_ref ref; > + struct completion cmp; > +}; > + > +struct dax_pmem *to_dax_pmem(struct percpu_ref *ref) > +{ > + return container_of(ref, struct dax_pmem, ref); > +} > + > +static void dax_pmem_percpu_release(struct percpu_ref *ref) > +{ > + struct dax_pmem *dax_pmem = to_dax_pmem(ref); > + > + dev_dbg(dax_pmem->dev, "%s\n", __func__); This dev_dbg() could be replaced by function graph tracing. > + complete(&dax_pmem->cmp); > +} > + > +static void dax_pmem_percpu_exit(void *data) > +{ > + struct percpu_ref *ref = data; > + struct dax_pmem *dax_pmem = to_dax_pmem(ref); > + > + dev_dbg(dax_pmem->dev, "%s\n", __func__); Same as above. > + percpu_ref_exit(ref); > + wait_for_completion(&dax_pmem->cmp); > +} > + > +static void dax_pmem_percpu_kill(void *data) > +{ > + struct percpu_ref *ref = data; > + struct dax_pmem *dax_pmem = to_dax_pmem(ref); > + > + dev_dbg(dax_pmem->dev, "%s\n", __func__); Same as above. > + percpu_ref_kill(ref); > +} > + > +static int dax_pmem_probe(struct device *dev) > +{ > + int rc; > + void *addr; > + struct resource res; > + struct nd_pfn_sb *pfn_sb; > + struct dax_pmem *dax_pmem; > + struct nd_region *nd_region; > + struct nd_namespace_io *nsio; > + struct dax_region *dax_region; > + struct nd_namespace_common *ndns; > + struct nd_dax *nd_dax = to_nd_dax(dev); > + struct nd_pfn *nd_pfn = &nd_dax->nd_pfn; > + struct vmem_altmap __altmap, *altmap = NULL; > + > + ndns = nvdimm_namespace_common_probe(dev); > + if (IS_ERR(ndns)) > + return PTR_ERR(ndns); > + nsio = to_nd_namespace_io(&ndns->dev); > + > + /* parse the 'pfn' info block via ->rw_bytes */ > + devm_nsio_enable(dev, nsio); > + altmap = nvdimm_setup_pfn(nd_pfn, &res, &__altmap); > + if (IS_ERR(altmap)) > + return PTR_ERR(altmap); > + devm_nsio_disable(dev, nsio); > + > + pfn_sb = nd_pfn->pfn_sb; > + > + if (!devm_request_mem_region(dev, nsio->res.start, > + resource_size(&nsio->res), dev_name(dev))) { > + dev_warn(dev, "could not reserve region %pR\n", &nsio->res); > + return -EBUSY; > + } > + > + dax_pmem = devm_kzalloc(dev, sizeof(*dax_pmem), GFP_KERNEL); > + if (!dax_pmem) > + return -ENOMEM; > + > + dax_pmem->dev = dev; > + init_completion(&dax_pmem->cmp); > + rc = percpu_ref_init(&dax_pmem->ref, dax_pmem_percpu_release, 0, > + GFP_KERNEL); > + if (rc) > + return rc; > + > + rc = devm_add_action(dev, dax_pmem_percpu_exit, &dax_pmem->ref); > + if (rc) { > + dax_pmem_percpu_exit(&dax_pmem->ref); > + return rc; > + } > + > + addr = devm_memremap_pages(dev, &res, &dax_pmem->ref, altmap); > + if (IS_ERR(addr)) > + return PTR_ERR(addr); > + > + rc = devm_add_action(dev, dax_pmem_percpu_kill, &dax_pmem->ref); > + if (rc) { > + dax_pmem_percpu_kill(&dax_pmem->ref); > + return rc; > + } > + > + nd_region = to_nd_region(dev->parent); > + dax_region = alloc_dax_region(dev, nd_region->id, &res, > + le32_to_cpu(pfn_sb->align), addr, PFN_DEV|PFN_MAP); > + if (!dax_region) > + return -ENOMEM; > + > + /* TODO: support for subdividing a dax region... */ > + rc = devm_create_dax_dev(dax_region, &res, 1); > + > + /* child dax_dev instances now own the lifetime of the dax_region */ > + dax_region_put(dax_region); > + > + return rc; > +} > + > +static int dax_pmem_remove(struct device *dev) > +{ > + /* > + * The init path is fully devm-enabled, or child devices > + * otherwise hold references on parent resources. > + */ So remove is completely pointless here. Why are you depending on it in __nd_driver_register()? __device_release_driver() does not depend on a remove callback to be present. > + return 0; > +} > + > +static struct nd_device_driver dax_pmem_driver = { > + .probe = dax_pmem_probe, > + .remove = dax_pmem_remove, > + .drv = { > + .name = "dax_pmem", > + }, > + .type = ND_DRIVER_DAX_PMEM, > +}; > + > +static int __init dax_pmem_init(void) > +{ > + return nd_driver_register(&dax_pmem_driver); > +} > +module_init(dax_pmem_init); > + > +static void __exit dax_pmem_exit(void) > +{ > + driver_unregister(&dax_pmem_driver.drv); > +} > +module_exit(dax_pmem_exit); > + > +MODULE_LICENSE("GPL v2"); > +MODULE_AUTHOR("Intel Corporation"); > +MODULE_ALIAS_ND_DEVICE(ND_DEVICE_DAX_PMEM); > diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild > index 5ff6d3c126a9..785985677159 100644 > --- a/tools/testing/nvdimm/Kbuild > +++ b/tools/testing/nvdimm/Kbuild > @@ -16,6 +16,7 @@ ldflags-y += --wrap=phys_to_pfn_t > DRIVERS := ../../../drivers > NVDIMM_SRC := $(DRIVERS)/nvdimm > ACPI_SRC := $(DRIVERS)/acpi > +DAX_SRC := $(DRIVERS)/dax > > obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o > obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o > @@ -23,6 +24,8 @@ obj-$(CONFIG_ND_BTT) += nd_btt.o > obj-$(CONFIG_ND_BLK) += nd_blk.o > obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o > obj-$(CONFIG_ACPI_NFIT) += nfit.o > +obj-$(CONFIG_DEV_DAX) += dax.o > +obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o > > nfit-y := $(ACPI_SRC)/nfit.o > nfit-y += config_check.o > @@ -39,6 +42,12 @@ nd_blk-y += config_check.o > nd_e820-y := $(NVDIMM_SRC)/e820.o > nd_e820-y += config_check.o > > +dax-y := $(DAX_SRC)/dax.o > +dax-y += config_check.o > + > +dax_pmem-y := $(DAX_SRC)/pmem.o > +dax_pmem-y += config_check.o > + > libnvdimm-y := $(NVDIMM_SRC)/core.o > libnvdimm-y += $(NVDIMM_SRC)/bus.o > libnvdimm-y += $(NVDIMM_SRC)/dimm_devs.o > diff --git a/tools/testing/nvdimm/config_check.c b/tools/testing/nvdimm/config_check.c > index f2c7615554eb..adf18bfeca00 100644 > --- a/tools/testing/nvdimm/config_check.c > +++ b/tools/testing/nvdimm/config_check.c > @@ -12,4 +12,6 @@ void check(void) > BUILD_BUG_ON(!IS_MODULE(CONFIG_ND_BTT)); > BUILD_BUG_ON(!IS_MODULE(CONFIG_ND_BLK)); > BUILD_BUG_ON(!IS_MODULE(CONFIG_ACPI_NFIT)); > + BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX)); > + BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX_PMEM)); > } > > _______________________________________________ > Linux-nvdimm mailing list > Linux-nvdimm@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: Felix Imend�rffer, Jane Smithard, Graham Norton HRB 21284 (AG N�rnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 1/3] /dev/dax, pmem: direct access to persistent memory @ 2016-05-17 8:52 ` Johannes Thumshirn 0 siblings, 0 replies; 37+ messages in thread From: Johannes Thumshirn @ 2016-05-17 8:52 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, Dave Hansen, linux-kernel, hch, linux-block, Andrew Morton On Sat, May 14, 2016 at 11:26:24PM -0700, Dan Williams wrote: > Device DAX is the device-centric analogue of Filesystem DAX > (CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped > without need of an intervening file system. Device DAX is strict, > precise and predictable. Specifically this interface: > > 1/ Guarantees fault granularity with respect to a given page size (pte, > pmd, or pud) set at configuration time. > > 2/ Enforces deterministic behavior by being strict about what fault > scenarios are supported. > > For example, by forcing MADV_DONTFORK semantics and omitting MAP_PRIVATE > support device-dax guarantees that a mapping always behaves/performs the > same once established. It is the "what you see is what you get" access > mechanism to differentiated memory vs filesystem DAX which has > filesystem specific implementation semantics. > > Persistent memory is the first target, but the mechanism is also > targeted for exclusive allocations of performance differentiated memory > ranges. > > This commit is limited to the base device driver infrastructure to > associate a dax device with pmem range. > > Cc: Jeff Moyer <jmoyer@redhat.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > drivers/Kconfig | 2 > drivers/Makefile | 1 > drivers/dax/Kconfig | 25 +++ > drivers/dax/Makefile | 4 + > drivers/dax/dax.c | 252 +++++++++++++++++++++++++++++++++++ > drivers/dax/dax.h | 24 +++ > drivers/dax/pmem.c | 168 +++++++++++++++++++++++ Is a DAX device always a NVDIMM device, or can it be something else (like the S390 dcssblk)? If it's NVDIMM only I'd suggest it to go under the drivers/nvdimm directory. > tools/testing/nvdimm/Kbuild | 9 + > tools/testing/nvdimm/config_check.c | 2 > 9 files changed, 487 insertions(+) > create mode 100644 drivers/dax/Kconfig > create mode 100644 drivers/dax/Makefile > create mode 100644 drivers/dax/dax.c > create mode 100644 drivers/dax/dax.h > create mode 100644 drivers/dax/pmem.c > > diff --git a/drivers/Kconfig b/drivers/Kconfig > index d2ac339de85f..8298eab84a6f 100644 > --- a/drivers/Kconfig > +++ b/drivers/Kconfig > @@ -190,6 +190,8 @@ source "drivers/android/Kconfig" > > source "drivers/nvdimm/Kconfig" > > +source "drivers/dax/Kconfig" > + > source "drivers/nvmem/Kconfig" > > source "drivers/hwtracing/stm/Kconfig" > diff --git a/drivers/Makefile b/drivers/Makefile > index 8f5d076baeb0..0b6f3d60193d 100644 > --- a/drivers/Makefile > +++ b/drivers/Makefile > @@ -66,6 +66,7 @@ obj-$(CONFIG_PARPORT) += parport/ > obj-$(CONFIG_NVM) += lightnvm/ > obj-y += base/ block/ misc/ mfd/ nfc/ > obj-$(CONFIG_LIBNVDIMM) += nvdimm/ > +obj-$(CONFIG_DEV_DAX) += dax/ > obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/ > obj-$(CONFIG_NUBUS) += nubus/ > obj-y += macintosh/ > diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig > new file mode 100644 > index 000000000000..86ffbaa891ad > --- /dev/null > +++ b/drivers/dax/Kconfig > @@ -0,0 +1,25 @@ > +menuconfig DEV_DAX > + tristate "DAX: direct access to differentiated memory" > + default m if NVDIMM_DAX > + help > + Support raw access to differentiated (persistence, bandwidth, > + latency...) memory via an mmap(2) capable character > + device. Platform firmware or a device driver may identify a > + platform memory resource that is differentiated from the > + baseline memory pool. Mappings of a /dev/daxX.Y device impose > + restrictions that make the mapping behavior deterministic. > + > +if DEV_DAX > + > +config DEV_DAX_PMEM > + tristate "PMEM DAX: direct access to persistent memory" > + depends on NVDIMM_DAX > + default DEV_DAX > + help > + Support raw access to persistent memory. Note that this > + driver consumes memory ranges allocated and exported by the > + libnvdimm sub-system. > + > + Say Y if unsure > + > +endif > diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile > new file mode 100644 > index 000000000000..27c54e38478a > --- /dev/null > +++ b/drivers/dax/Makefile > @@ -0,0 +1,4 @@ > +obj-$(CONFIG_DEV_DAX) += dax.o > +obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o > + > +dax_pmem-y := pmem.o > diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c > new file mode 100644 > index 000000000000..8207fb33a992 > --- /dev/null > +++ b/drivers/dax/dax.c > @@ -0,0 +1,252 @@ > +/* > + * Copyright(c) 2016 Intel Corporation. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of version 2 of the GNU General Public License as > + * published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, but > + * WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + */ > +#include <linux/pagemap.h> > +#include <linux/module.h> > +#include <linux/device.h> > +#include <linux/pfn_t.h> > +#include <linux/slab.h> > +#include <linux/dax.h> > +#include <linux/fs.h> > +#include <linux/mm.h> > + > +static int dax_major; > +static struct class *dax_class; > +static DEFINE_IDA(dax_minor_ida); > + > +/** > + * struct dax_region - mapping infrastructure for dax devices > + * @id: kernel-wide unique region for a memory range > + * @base: linear address corresponding to @res > + * @kref: to pin while other agents have a need to do lookups > + * @dev: parent device backing this region > + * @align: allocation and mapping alignment for child dax devices > + * @res: physical address range of the region > + * @pfn_flags: identify whether the pfns are paged back or not > + */ > +struct dax_region { > + int id; > + struct ida ida; > + void *base; > + struct kref kref; > + struct device *dev; > + unsigned int align; > + struct resource res; > + unsigned long pfn_flags; > +}; > + > +/** > + * struct dax_dev - subdivision of a dax region > + * @region - parent region > + * @dev - device backing the character device > + * @kref - enable this data to be tracked in filp->private_data > + * @id - child id in the region > + * @num_resources - number of physical address extents in this device > + * @res - array of physical address ranges > + */ > +struct dax_dev { > + struct dax_region *region; > + struct device *dev; > + struct kref kref; > + int id; > + int num_resources; > + struct resource res[0]; > +}; > + > +static void dax_region_free(struct kref *kref) > +{ > + struct dax_region *dax_region; > + > + dax_region = container_of(kref, struct dax_region, kref); > + kfree(dax_region); > +} > + > +void dax_region_put(struct dax_region *dax_region) > +{ > + kref_put(&dax_region->kref, dax_region_free); > +} > +EXPORT_SYMBOL_GPL(dax_region_put); dax_region_get() ?? > + > +static void dax_dev_free(struct kref *kref) > +{ > + struct dax_dev *dax_dev; > + > + dax_dev = container_of(kref, struct dax_dev, kref); > + dax_region_put(dax_dev->region); > + kfree(dax_dev); > +} > + > +static void dax_dev_put(struct dax_dev *dax_dev) > +{ > + kref_put(&dax_dev->kref, dax_dev_free); > +} > + > +struct dax_region *alloc_dax_region(struct device *parent, int region_id, > + struct resource *res, unsigned int align, void *addr, > + unsigned long pfn_flags) > +{ > + struct dax_region *dax_region; > + > + dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL); > + > + if (!dax_region) > + return NULL; > + > + memcpy(&dax_region->res, res, sizeof(*res)); > + dax_region->pfn_flags = pfn_flags; > + kref_init(&dax_region->kref); > + dax_region->id = region_id; > + ida_init(&dax_region->ida); > + dax_region->align = align; > + dax_region->dev = parent; > + dax_region->base = addr; > + > + return dax_region; > +} > +EXPORT_SYMBOL_GPL(alloc_dax_region); > + > +static ssize_t size_show(struct device *dev, > + struct device_attribute *attr, char *buf) > +{ > + struct dax_dev *dax_dev = dev_get_drvdata(dev); > + unsigned long long size = 0; > + int i; > + > + for (i = 0; i < dax_dev->num_resources; i++) > + size += resource_size(&dax_dev->res[i]); > + > + return sprintf(buf, "%llu\n", size); > +} > +static DEVICE_ATTR_RO(size); > + > +static struct attribute *dax_device_attributes[] = { > + &dev_attr_size.attr, > + NULL, > +}; > + > +static const struct attribute_group dax_device_attribute_group = { > + .attrs = dax_device_attributes, > +}; > + > +static const struct attribute_group *dax_attribute_groups[] = { > + &dax_device_attribute_group, > + NULL, > +}; > + > +static void destroy_dax_dev(void *_dev) > +{ > + struct device *dev = _dev; > + struct dax_dev *dax_dev = dev_get_drvdata(dev); > + struct dax_region *dax_region = dax_dev->region; > + > + dev_dbg(dev, "%s\n", __func__); This dev_dbg() could be replaced by function graph tracing. > + > + get_device(dev); > + device_unregister(dev); > + ida_simple_remove(&dax_region->ida, dax_dev->id); > + ida_simple_remove(&dax_minor_ida, MINOR(dev->devt)); > + put_device(dev); > + dax_dev_put(dax_dev); > +} > + > +int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, > + int count) > +{ > + struct device *parent = dax_region->dev; > + struct dax_dev *dax_dev; > + struct device *dev; > + int rc, minor; > + dev_t dev_t; > + > + dax_dev = kzalloc(sizeof(*dax_dev) + sizeof(*res) * count, GFP_KERNEL); > + if (!dax_dev) > + return -ENOMEM; > + memcpy(dax_dev->res, res, sizeof(*res) * count); > + dax_dev->num_resources = count; > + kref_init(&dax_dev->kref); > + dax_dev->region = dax_region; > + kref_get(&dax_region->kref); dax_region_get() ? > + > + dax_dev->id = ida_simple_get(&dax_region->ida, 0, 0, GFP_KERNEL); > + if (dax_dev->id < 0) { > + rc = dax_dev->id; > + goto err_id; > + } > + > + minor = ida_simple_get(&dax_minor_ida, 0, 0, GFP_KERNEL); > + if (minor < 0) { > + rc = minor; > + goto err_minor; > + } > + > + dev_t = MKDEV(dax_major, minor); > + dev = device_create_with_groups(dax_class, parent, dev_t, dax_dev, > + dax_attribute_groups, "dax%d.%d", dax_region->id, > + dax_dev->id); > + if (IS_ERR(dev)) { > + rc = PTR_ERR(dev); > + goto err_create; > + } > + dax_dev->dev = dev; > + > + rc = devm_add_action(dax_region->dev, destroy_dax_dev, dev); > + if (rc) { > + destroy_dax_dev(dev); > + return rc; > + } > + > + return 0; > + > + err_create: > + ida_simple_remove(&dax_minor_ida, minor); > + err_minor: > + ida_simple_remove(&dax_region->ida, dax_dev->id); > + err_id: > + dax_dev_put(dax_dev); > + > + return rc; > +} > +EXPORT_SYMBOL_GPL(devm_create_dax_dev); > + > +static const struct file_operations dax_fops = { > + .llseek = noop_llseek, > + .owner = THIS_MODULE, > +}; > + > +static int __init dax_init(void) > +{ > + int rc; > + > + rc = register_chrdev(0, "dax", &dax_fops); > + if (rc < 0) > + return rc; > + dax_major = rc; > + > + dax_class = class_create(THIS_MODULE, "dax"); > + if (IS_ERR(dax_class)) { > + unregister_chrdev(dax_major, "dax"); > + return PTR_ERR(dax_class); > + } > + > + return 0; > +} > + > +static void __exit dax_exit(void) > +{ > + class_destroy(dax_class); > + unregister_chrdev(dax_major, "dax"); AFAICT you're missing a call to ida_destroy(&dax_minor_ida); here. > +} > + > +MODULE_AUTHOR("Intel Corporation"); > +MODULE_LICENSE("GPL v2"); > +subsys_initcall(dax_init); > +module_exit(dax_exit); > diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h > new file mode 100644 > index 000000000000..d8b8f1f25054 > --- /dev/null > +++ b/drivers/dax/dax.h > @@ -0,0 +1,24 @@ > +/* > + * Copyright(c) 2016 Intel Corporation. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of version 2 of the GNU General Public License as > + * published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, but > + * WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + */ > +#ifndef __DAX_H__ > +#define __DAX_H__ > +struct device; > +struct resource; > +struct dax_region; > +void dax_region_put(struct dax_region *dax_region); > +struct dax_region *alloc_dax_region(struct device *parent, > + int region_id, struct resource *res, unsigned int align, > + void *addr, unsigned long flags); > +int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, > + int count); > +#endif /* __DAX_H__ */ > diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c > new file mode 100644 > index 000000000000..4e97555e1cab > --- /dev/null > +++ b/drivers/dax/pmem.c > @@ -0,0 +1,168 @@ > +/* > + * Copyright(c) 2016 Intel Corporation. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of version 2 of the GNU General Public License as > + * published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, but > + * WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + */ > +#include <linux/percpu-refcount.h> > +#include <linux/memremap.h> > +#include <linux/module.h> > +#include <linux/pfn_t.h> > +#include "../nvdimm/pfn.h" > +#include "../nvdimm/nd.h" > +#include "dax.h" > + > +struct dax_pmem { > + struct device *dev; > + struct percpu_ref ref; > + struct completion cmp; > +}; > + > +struct dax_pmem *to_dax_pmem(struct percpu_ref *ref) > +{ > + return container_of(ref, struct dax_pmem, ref); > +} > + > +static void dax_pmem_percpu_release(struct percpu_ref *ref) > +{ > + struct dax_pmem *dax_pmem = to_dax_pmem(ref); > + > + dev_dbg(dax_pmem->dev, "%s\n", __func__); This dev_dbg() could be replaced by function graph tracing. > + complete(&dax_pmem->cmp); > +} > + > +static void dax_pmem_percpu_exit(void *data) > +{ > + struct percpu_ref *ref = data; > + struct dax_pmem *dax_pmem = to_dax_pmem(ref); > + > + dev_dbg(dax_pmem->dev, "%s\n", __func__); Same as above. > + percpu_ref_exit(ref); > + wait_for_completion(&dax_pmem->cmp); > +} > + > +static void dax_pmem_percpu_kill(void *data) > +{ > + struct percpu_ref *ref = data; > + struct dax_pmem *dax_pmem = to_dax_pmem(ref); > + > + dev_dbg(dax_pmem->dev, "%s\n", __func__); Same as above. > + percpu_ref_kill(ref); > +} > + > +static int dax_pmem_probe(struct device *dev) > +{ > + int rc; > + void *addr; > + struct resource res; > + struct nd_pfn_sb *pfn_sb; > + struct dax_pmem *dax_pmem; > + struct nd_region *nd_region; > + struct nd_namespace_io *nsio; > + struct dax_region *dax_region; > + struct nd_namespace_common *ndns; > + struct nd_dax *nd_dax = to_nd_dax(dev); > + struct nd_pfn *nd_pfn = &nd_dax->nd_pfn; > + struct vmem_altmap __altmap, *altmap = NULL; > + > + ndns = nvdimm_namespace_common_probe(dev); > + if (IS_ERR(ndns)) > + return PTR_ERR(ndns); > + nsio = to_nd_namespace_io(&ndns->dev); > + > + /* parse the 'pfn' info block via ->rw_bytes */ > + devm_nsio_enable(dev, nsio); > + altmap = nvdimm_setup_pfn(nd_pfn, &res, &__altmap); > + if (IS_ERR(altmap)) > + return PTR_ERR(altmap); > + devm_nsio_disable(dev, nsio); > + > + pfn_sb = nd_pfn->pfn_sb; > + > + if (!devm_request_mem_region(dev, nsio->res.start, > + resource_size(&nsio->res), dev_name(dev))) { > + dev_warn(dev, "could not reserve region %pR\n", &nsio->res); > + return -EBUSY; > + } > + > + dax_pmem = devm_kzalloc(dev, sizeof(*dax_pmem), GFP_KERNEL); > + if (!dax_pmem) > + return -ENOMEM; > + > + dax_pmem->dev = dev; > + init_completion(&dax_pmem->cmp); > + rc = percpu_ref_init(&dax_pmem->ref, dax_pmem_percpu_release, 0, > + GFP_KERNEL); > + if (rc) > + return rc; > + > + rc = devm_add_action(dev, dax_pmem_percpu_exit, &dax_pmem->ref); > + if (rc) { > + dax_pmem_percpu_exit(&dax_pmem->ref); > + return rc; > + } > + > + addr = devm_memremap_pages(dev, &res, &dax_pmem->ref, altmap); > + if (IS_ERR(addr)) > + return PTR_ERR(addr); > + > + rc = devm_add_action(dev, dax_pmem_percpu_kill, &dax_pmem->ref); > + if (rc) { > + dax_pmem_percpu_kill(&dax_pmem->ref); > + return rc; > + } > + > + nd_region = to_nd_region(dev->parent); > + dax_region = alloc_dax_region(dev, nd_region->id, &res, > + le32_to_cpu(pfn_sb->align), addr, PFN_DEV|PFN_MAP); > + if (!dax_region) > + return -ENOMEM; > + > + /* TODO: support for subdividing a dax region... */ > + rc = devm_create_dax_dev(dax_region, &res, 1); > + > + /* child dax_dev instances now own the lifetime of the dax_region */ > + dax_region_put(dax_region); > + > + return rc; > +} > + > +static int dax_pmem_remove(struct device *dev) > +{ > + /* > + * The init path is fully devm-enabled, or child devices > + * otherwise hold references on parent resources. > + */ So remove is completely pointless here. Why are you depending on it in __nd_driver_register()? __device_release_driver() does not depend on a remove callback to be present. > + return 0; > +} > + > +static struct nd_device_driver dax_pmem_driver = { > + .probe = dax_pmem_probe, > + .remove = dax_pmem_remove, > + .drv = { > + .name = "dax_pmem", > + }, > + .type = ND_DRIVER_DAX_PMEM, > +}; > + > +static int __init dax_pmem_init(void) > +{ > + return nd_driver_register(&dax_pmem_driver); > +} > +module_init(dax_pmem_init); > + > +static void __exit dax_pmem_exit(void) > +{ > + driver_unregister(&dax_pmem_driver.drv); > +} > +module_exit(dax_pmem_exit); > + > +MODULE_LICENSE("GPL v2"); > +MODULE_AUTHOR("Intel Corporation"); > +MODULE_ALIAS_ND_DEVICE(ND_DEVICE_DAX_PMEM); > diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild > index 5ff6d3c126a9..785985677159 100644 > --- a/tools/testing/nvdimm/Kbuild > +++ b/tools/testing/nvdimm/Kbuild > @@ -16,6 +16,7 @@ ldflags-y += --wrap=phys_to_pfn_t > DRIVERS := ../../../drivers > NVDIMM_SRC := $(DRIVERS)/nvdimm > ACPI_SRC := $(DRIVERS)/acpi > +DAX_SRC := $(DRIVERS)/dax > > obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o > obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o > @@ -23,6 +24,8 @@ obj-$(CONFIG_ND_BTT) += nd_btt.o > obj-$(CONFIG_ND_BLK) += nd_blk.o > obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o > obj-$(CONFIG_ACPI_NFIT) += nfit.o > +obj-$(CONFIG_DEV_DAX) += dax.o > +obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o > > nfit-y := $(ACPI_SRC)/nfit.o > nfit-y += config_check.o > @@ -39,6 +42,12 @@ nd_blk-y += config_check.o > nd_e820-y := $(NVDIMM_SRC)/e820.o > nd_e820-y += config_check.o > > +dax-y := $(DAX_SRC)/dax.o > +dax-y += config_check.o > + > +dax_pmem-y := $(DAX_SRC)/pmem.o > +dax_pmem-y += config_check.o > + > libnvdimm-y := $(NVDIMM_SRC)/core.o > libnvdimm-y += $(NVDIMM_SRC)/bus.o > libnvdimm-y += $(NVDIMM_SRC)/dimm_devs.o > diff --git a/tools/testing/nvdimm/config_check.c b/tools/testing/nvdimm/config_check.c > index f2c7615554eb..adf18bfeca00 100644 > --- a/tools/testing/nvdimm/config_check.c > +++ b/tools/testing/nvdimm/config_check.c > @@ -12,4 +12,6 @@ void check(void) > BUILD_BUG_ON(!IS_MODULE(CONFIG_ND_BTT)); > BUILD_BUG_ON(!IS_MODULE(CONFIG_ND_BLK)); > BUILD_BUG_ON(!IS_MODULE(CONFIG_ACPI_NFIT)); > + BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX)); > + BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX_PMEM)); > } > > _______________________________________________ > Linux-nvdimm mailing list > Linux-nvdimm@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 1/3] /dev/dax, pmem: direct access to persistent memory @ 2016-05-17 8:52 ` Johannes Thumshirn 0 siblings, 0 replies; 37+ messages in thread From: Johannes Thumshirn @ 2016-05-17 8:52 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, Dave Hansen, linux-kernel, linux-block, Andrew Morton, hch On Sat, May 14, 2016 at 11:26:24PM -0700, Dan Williams wrote: > Device DAX is the device-centric analogue of Filesystem DAX > (CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped > without need of an intervening file system. Device DAX is strict, > precise and predictable. Specifically this interface: > > 1/ Guarantees fault granularity with respect to a given page size (pte, > pmd, or pud) set at configuration time. > > 2/ Enforces deterministic behavior by being strict about what fault > scenarios are supported. > > For example, by forcing MADV_DONTFORK semantics and omitting MAP_PRIVATE > support device-dax guarantees that a mapping always behaves/performs the > same once established. It is the "what you see is what you get" access > mechanism to differentiated memory vs filesystem DAX which has > filesystem specific implementation semantics. > > Persistent memory is the first target, but the mechanism is also > targeted for exclusive allocations of performance differentiated memory > ranges. > > This commit is limited to the base device driver infrastructure to > associate a dax device with pmem range. > > Cc: Jeff Moyer <jmoyer@redhat.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > drivers/Kconfig | 2 > drivers/Makefile | 1 > drivers/dax/Kconfig | 25 +++ > drivers/dax/Makefile | 4 + > drivers/dax/dax.c | 252 +++++++++++++++++++++++++++++++++++ > drivers/dax/dax.h | 24 +++ > drivers/dax/pmem.c | 168 +++++++++++++++++++++++ Is a DAX device always a NVDIMM device, or can it be something else (like the S390 dcssblk)? If it's NVDIMM only I'd suggest it to go under the drivers/nvdimm directory. > tools/testing/nvdimm/Kbuild | 9 + > tools/testing/nvdimm/config_check.c | 2 > 9 files changed, 487 insertions(+) > create mode 100644 drivers/dax/Kconfig > create mode 100644 drivers/dax/Makefile > create mode 100644 drivers/dax/dax.c > create mode 100644 drivers/dax/dax.h > create mode 100644 drivers/dax/pmem.c > > diff --git a/drivers/Kconfig b/drivers/Kconfig > index d2ac339de85f..8298eab84a6f 100644 > --- a/drivers/Kconfig > +++ b/drivers/Kconfig > @@ -190,6 +190,8 @@ source "drivers/android/Kconfig" > > source "drivers/nvdimm/Kconfig" > > +source "drivers/dax/Kconfig" > + > source "drivers/nvmem/Kconfig" > > source "drivers/hwtracing/stm/Kconfig" > diff --git a/drivers/Makefile b/drivers/Makefile > index 8f5d076baeb0..0b6f3d60193d 100644 > --- a/drivers/Makefile > +++ b/drivers/Makefile > @@ -66,6 +66,7 @@ obj-$(CONFIG_PARPORT) += parport/ > obj-$(CONFIG_NVM) += lightnvm/ > obj-y += base/ block/ misc/ mfd/ nfc/ > obj-$(CONFIG_LIBNVDIMM) += nvdimm/ > +obj-$(CONFIG_DEV_DAX) += dax/ > obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/ > obj-$(CONFIG_NUBUS) += nubus/ > obj-y += macintosh/ > diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig > new file mode 100644 > index 000000000000..86ffbaa891ad > --- /dev/null > +++ b/drivers/dax/Kconfig > @@ -0,0 +1,25 @@ > +menuconfig DEV_DAX > + tristate "DAX: direct access to differentiated memory" > + default m if NVDIMM_DAX > + help > + Support raw access to differentiated (persistence, bandwidth, > + latency...) memory via an mmap(2) capable character > + device. Platform firmware or a device driver may identify a > + platform memory resource that is differentiated from the > + baseline memory pool. Mappings of a /dev/daxX.Y device impose > + restrictions that make the mapping behavior deterministic. > + > +if DEV_DAX > + > +config DEV_DAX_PMEM > + tristate "PMEM DAX: direct access to persistent memory" > + depends on NVDIMM_DAX > + default DEV_DAX > + help > + Support raw access to persistent memory. Note that this > + driver consumes memory ranges allocated and exported by the > + libnvdimm sub-system. > + > + Say Y if unsure > + > +endif > diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile > new file mode 100644 > index 000000000000..27c54e38478a > --- /dev/null > +++ b/drivers/dax/Makefile > @@ -0,0 +1,4 @@ > +obj-$(CONFIG_DEV_DAX) += dax.o > +obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o > + > +dax_pmem-y := pmem.o > diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c > new file mode 100644 > index 000000000000..8207fb33a992 > --- /dev/null > +++ b/drivers/dax/dax.c > @@ -0,0 +1,252 @@ > +/* > + * Copyright(c) 2016 Intel Corporation. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of version 2 of the GNU General Public License as > + * published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, but > + * WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + */ > +#include <linux/pagemap.h> > +#include <linux/module.h> > +#include <linux/device.h> > +#include <linux/pfn_t.h> > +#include <linux/slab.h> > +#include <linux/dax.h> > +#include <linux/fs.h> > +#include <linux/mm.h> > + > +static int dax_major; > +static struct class *dax_class; > +static DEFINE_IDA(dax_minor_ida); > + > +/** > + * struct dax_region - mapping infrastructure for dax devices > + * @id: kernel-wide unique region for a memory range > + * @base: linear address corresponding to @res > + * @kref: to pin while other agents have a need to do lookups > + * @dev: parent device backing this region > + * @align: allocation and mapping alignment for child dax devices > + * @res: physical address range of the region > + * @pfn_flags: identify whether the pfns are paged back or not > + */ > +struct dax_region { > + int id; > + struct ida ida; > + void *base; > + struct kref kref; > + struct device *dev; > + unsigned int align; > + struct resource res; > + unsigned long pfn_flags; > +}; > + > +/** > + * struct dax_dev - subdivision of a dax region > + * @region - parent region > + * @dev - device backing the character device > + * @kref - enable this data to be tracked in filp->private_data > + * @id - child id in the region > + * @num_resources - number of physical address extents in this device > + * @res - array of physical address ranges > + */ > +struct dax_dev { > + struct dax_region *region; > + struct device *dev; > + struct kref kref; > + int id; > + int num_resources; > + struct resource res[0]; > +}; > + > +static void dax_region_free(struct kref *kref) > +{ > + struct dax_region *dax_region; > + > + dax_region = container_of(kref, struct dax_region, kref); > + kfree(dax_region); > +} > + > +void dax_region_put(struct dax_region *dax_region) > +{ > + kref_put(&dax_region->kref, dax_region_free); > +} > +EXPORT_SYMBOL_GPL(dax_region_put); dax_region_get() ?? > + > +static void dax_dev_free(struct kref *kref) > +{ > + struct dax_dev *dax_dev; > + > + dax_dev = container_of(kref, struct dax_dev, kref); > + dax_region_put(dax_dev->region); > + kfree(dax_dev); > +} > + > +static void dax_dev_put(struct dax_dev *dax_dev) > +{ > + kref_put(&dax_dev->kref, dax_dev_free); > +} > + > +struct dax_region *alloc_dax_region(struct device *parent, int region_id, > + struct resource *res, unsigned int align, void *addr, > + unsigned long pfn_flags) > +{ > + struct dax_region *dax_region; > + > + dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL); > + > + if (!dax_region) > + return NULL; > + > + memcpy(&dax_region->res, res, sizeof(*res)); > + dax_region->pfn_flags = pfn_flags; > + kref_init(&dax_region->kref); > + dax_region->id = region_id; > + ida_init(&dax_region->ida); > + dax_region->align = align; > + dax_region->dev = parent; > + dax_region->base = addr; > + > + return dax_region; > +} > +EXPORT_SYMBOL_GPL(alloc_dax_region); > + > +static ssize_t size_show(struct device *dev, > + struct device_attribute *attr, char *buf) > +{ > + struct dax_dev *dax_dev = dev_get_drvdata(dev); > + unsigned long long size = 0; > + int i; > + > + for (i = 0; i < dax_dev->num_resources; i++) > + size += resource_size(&dax_dev->res[i]); > + > + return sprintf(buf, "%llu\n", size); > +} > +static DEVICE_ATTR_RO(size); > + > +static struct attribute *dax_device_attributes[] = { > + &dev_attr_size.attr, > + NULL, > +}; > + > +static const struct attribute_group dax_device_attribute_group = { > + .attrs = dax_device_attributes, > +}; > + > +static const struct attribute_group *dax_attribute_groups[] = { > + &dax_device_attribute_group, > + NULL, > +}; > + > +static void destroy_dax_dev(void *_dev) > +{ > + struct device *dev = _dev; > + struct dax_dev *dax_dev = dev_get_drvdata(dev); > + struct dax_region *dax_region = dax_dev->region; > + > + dev_dbg(dev, "%s\n", __func__); This dev_dbg() could be replaced by function graph tracing. > + > + get_device(dev); > + device_unregister(dev); > + ida_simple_remove(&dax_region->ida, dax_dev->id); > + ida_simple_remove(&dax_minor_ida, MINOR(dev->devt)); > + put_device(dev); > + dax_dev_put(dax_dev); > +} > + > +int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, > + int count) > +{ > + struct device *parent = dax_region->dev; > + struct dax_dev *dax_dev; > + struct device *dev; > + int rc, minor; > + dev_t dev_t; > + > + dax_dev = kzalloc(sizeof(*dax_dev) + sizeof(*res) * count, GFP_KERNEL); > + if (!dax_dev) > + return -ENOMEM; > + memcpy(dax_dev->res, res, sizeof(*res) * count); > + dax_dev->num_resources = count; > + kref_init(&dax_dev->kref); > + dax_dev->region = dax_region; > + kref_get(&dax_region->kref); dax_region_get() ? > + > + dax_dev->id = ida_simple_get(&dax_region->ida, 0, 0, GFP_KERNEL); > + if (dax_dev->id < 0) { > + rc = dax_dev->id; > + goto err_id; > + } > + > + minor = ida_simple_get(&dax_minor_ida, 0, 0, GFP_KERNEL); > + if (minor < 0) { > + rc = minor; > + goto err_minor; > + } > + > + dev_t = MKDEV(dax_major, minor); > + dev = device_create_with_groups(dax_class, parent, dev_t, dax_dev, > + dax_attribute_groups, "dax%d.%d", dax_region->id, > + dax_dev->id); > + if (IS_ERR(dev)) { > + rc = PTR_ERR(dev); > + goto err_create; > + } > + dax_dev->dev = dev; > + > + rc = devm_add_action(dax_region->dev, destroy_dax_dev, dev); > + if (rc) { > + destroy_dax_dev(dev); > + return rc; > + } > + > + return 0; > + > + err_create: > + ida_simple_remove(&dax_minor_ida, minor); > + err_minor: > + ida_simple_remove(&dax_region->ida, dax_dev->id); > + err_id: > + dax_dev_put(dax_dev); > + > + return rc; > +} > +EXPORT_SYMBOL_GPL(devm_create_dax_dev); > + > +static const struct file_operations dax_fops = { > + .llseek = noop_llseek, > + .owner = THIS_MODULE, > +}; > + > +static int __init dax_init(void) > +{ > + int rc; > + > + rc = register_chrdev(0, "dax", &dax_fops); > + if (rc < 0) > + return rc; > + dax_major = rc; > + > + dax_class = class_create(THIS_MODULE, "dax"); > + if (IS_ERR(dax_class)) { > + unregister_chrdev(dax_major, "dax"); > + return PTR_ERR(dax_class); > + } > + > + return 0; > +} > + > +static void __exit dax_exit(void) > +{ > + class_destroy(dax_class); > + unregister_chrdev(dax_major, "dax"); AFAICT you're missing a call to ida_destroy(&dax_minor_ida); here. > +} > + > +MODULE_AUTHOR("Intel Corporation"); > +MODULE_LICENSE("GPL v2"); > +subsys_initcall(dax_init); > +module_exit(dax_exit); > diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h > new file mode 100644 > index 000000000000..d8b8f1f25054 > --- /dev/null > +++ b/drivers/dax/dax.h > @@ -0,0 +1,24 @@ > +/* > + * Copyright(c) 2016 Intel Corporation. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of version 2 of the GNU General Public License as > + * published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, but > + * WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + */ > +#ifndef __DAX_H__ > +#define __DAX_H__ > +struct device; > +struct resource; > +struct dax_region; > +void dax_region_put(struct dax_region *dax_region); > +struct dax_region *alloc_dax_region(struct device *parent, > + int region_id, struct resource *res, unsigned int align, > + void *addr, unsigned long flags); > +int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, > + int count); > +#endif /* __DAX_H__ */ > diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c > new file mode 100644 > index 000000000000..4e97555e1cab > --- /dev/null > +++ b/drivers/dax/pmem.c > @@ -0,0 +1,168 @@ > +/* > + * Copyright(c) 2016 Intel Corporation. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of version 2 of the GNU General Public License as > + * published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, but > + * WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + */ > +#include <linux/percpu-refcount.h> > +#include <linux/memremap.h> > +#include <linux/module.h> > +#include <linux/pfn_t.h> > +#include "../nvdimm/pfn.h" > +#include "../nvdimm/nd.h" > +#include "dax.h" > + > +struct dax_pmem { > + struct device *dev; > + struct percpu_ref ref; > + struct completion cmp; > +}; > + > +struct dax_pmem *to_dax_pmem(struct percpu_ref *ref) > +{ > + return container_of(ref, struct dax_pmem, ref); > +} > + > +static void dax_pmem_percpu_release(struct percpu_ref *ref) > +{ > + struct dax_pmem *dax_pmem = to_dax_pmem(ref); > + > + dev_dbg(dax_pmem->dev, "%s\n", __func__); This dev_dbg() could be replaced by function graph tracing. > + complete(&dax_pmem->cmp); > +} > + > +static void dax_pmem_percpu_exit(void *data) > +{ > + struct percpu_ref *ref = data; > + struct dax_pmem *dax_pmem = to_dax_pmem(ref); > + > + dev_dbg(dax_pmem->dev, "%s\n", __func__); Same as above. > + percpu_ref_exit(ref); > + wait_for_completion(&dax_pmem->cmp); > +} > + > +static void dax_pmem_percpu_kill(void *data) > +{ > + struct percpu_ref *ref = data; > + struct dax_pmem *dax_pmem = to_dax_pmem(ref); > + > + dev_dbg(dax_pmem->dev, "%s\n", __func__); Same as above. > + percpu_ref_kill(ref); > +} > + > +static int dax_pmem_probe(struct device *dev) > +{ > + int rc; > + void *addr; > + struct resource res; > + struct nd_pfn_sb *pfn_sb; > + struct dax_pmem *dax_pmem; > + struct nd_region *nd_region; > + struct nd_namespace_io *nsio; > + struct dax_region *dax_region; > + struct nd_namespace_common *ndns; > + struct nd_dax *nd_dax = to_nd_dax(dev); > + struct nd_pfn *nd_pfn = &nd_dax->nd_pfn; > + struct vmem_altmap __altmap, *altmap = NULL; > + > + ndns = nvdimm_namespace_common_probe(dev); > + if (IS_ERR(ndns)) > + return PTR_ERR(ndns); > + nsio = to_nd_namespace_io(&ndns->dev); > + > + /* parse the 'pfn' info block via ->rw_bytes */ > + devm_nsio_enable(dev, nsio); > + altmap = nvdimm_setup_pfn(nd_pfn, &res, &__altmap); > + if (IS_ERR(altmap)) > + return PTR_ERR(altmap); > + devm_nsio_disable(dev, nsio); > + > + pfn_sb = nd_pfn->pfn_sb; > + > + if (!devm_request_mem_region(dev, nsio->res.start, > + resource_size(&nsio->res), dev_name(dev))) { > + dev_warn(dev, "could not reserve region %pR\n", &nsio->res); > + return -EBUSY; > + } > + > + dax_pmem = devm_kzalloc(dev, sizeof(*dax_pmem), GFP_KERNEL); > + if (!dax_pmem) > + return -ENOMEM; > + > + dax_pmem->dev = dev; > + init_completion(&dax_pmem->cmp); > + rc = percpu_ref_init(&dax_pmem->ref, dax_pmem_percpu_release, 0, > + GFP_KERNEL); > + if (rc) > + return rc; > + > + rc = devm_add_action(dev, dax_pmem_percpu_exit, &dax_pmem->ref); > + if (rc) { > + dax_pmem_percpu_exit(&dax_pmem->ref); > + return rc; > + } > + > + addr = devm_memremap_pages(dev, &res, &dax_pmem->ref, altmap); > + if (IS_ERR(addr)) > + return PTR_ERR(addr); > + > + rc = devm_add_action(dev, dax_pmem_percpu_kill, &dax_pmem->ref); > + if (rc) { > + dax_pmem_percpu_kill(&dax_pmem->ref); > + return rc; > + } > + > + nd_region = to_nd_region(dev->parent); > + dax_region = alloc_dax_region(dev, nd_region->id, &res, > + le32_to_cpu(pfn_sb->align), addr, PFN_DEV|PFN_MAP); > + if (!dax_region) > + return -ENOMEM; > + > + /* TODO: support for subdividing a dax region... */ > + rc = devm_create_dax_dev(dax_region, &res, 1); > + > + /* child dax_dev instances now own the lifetime of the dax_region */ > + dax_region_put(dax_region); > + > + return rc; > +} > + > +static int dax_pmem_remove(struct device *dev) > +{ > + /* > + * The init path is fully devm-enabled, or child devices > + * otherwise hold references on parent resources. > + */ So remove is completely pointless here. Why are you depending on it in __nd_driver_register()? __device_release_driver() does not depend on a remove callback to be present. > + return 0; > +} > + > +static struct nd_device_driver dax_pmem_driver = { > + .probe = dax_pmem_probe, > + .remove = dax_pmem_remove, > + .drv = { > + .name = "dax_pmem", > + }, > + .type = ND_DRIVER_DAX_PMEM, > +}; > + > +static int __init dax_pmem_init(void) > +{ > + return nd_driver_register(&dax_pmem_driver); > +} > +module_init(dax_pmem_init); > + > +static void __exit dax_pmem_exit(void) > +{ > + driver_unregister(&dax_pmem_driver.drv); > +} > +module_exit(dax_pmem_exit); > + > +MODULE_LICENSE("GPL v2"); > +MODULE_AUTHOR("Intel Corporation"); > +MODULE_ALIAS_ND_DEVICE(ND_DEVICE_DAX_PMEM); > diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild > index 5ff6d3c126a9..785985677159 100644 > --- a/tools/testing/nvdimm/Kbuild > +++ b/tools/testing/nvdimm/Kbuild > @@ -16,6 +16,7 @@ ldflags-y += --wrap=phys_to_pfn_t > DRIVERS := ../../../drivers > NVDIMM_SRC := $(DRIVERS)/nvdimm > ACPI_SRC := $(DRIVERS)/acpi > +DAX_SRC := $(DRIVERS)/dax > > obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o > obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o > @@ -23,6 +24,8 @@ obj-$(CONFIG_ND_BTT) += nd_btt.o > obj-$(CONFIG_ND_BLK) += nd_blk.o > obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o > obj-$(CONFIG_ACPI_NFIT) += nfit.o > +obj-$(CONFIG_DEV_DAX) += dax.o > +obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o > > nfit-y := $(ACPI_SRC)/nfit.o > nfit-y += config_check.o > @@ -39,6 +42,12 @@ nd_blk-y += config_check.o > nd_e820-y := $(NVDIMM_SRC)/e820.o > nd_e820-y += config_check.o > > +dax-y := $(DAX_SRC)/dax.o > +dax-y += config_check.o > + > +dax_pmem-y := $(DAX_SRC)/pmem.o > +dax_pmem-y += config_check.o > + > libnvdimm-y := $(NVDIMM_SRC)/core.o > libnvdimm-y += $(NVDIMM_SRC)/bus.o > libnvdimm-y += $(NVDIMM_SRC)/dimm_devs.o > diff --git a/tools/testing/nvdimm/config_check.c b/tools/testing/nvdimm/config_check.c > index f2c7615554eb..adf18bfeca00 100644 > --- a/tools/testing/nvdimm/config_check.c > +++ b/tools/testing/nvdimm/config_check.c > @@ -12,4 +12,6 @@ void check(void) > BUILD_BUG_ON(!IS_MODULE(CONFIG_ND_BTT)); > BUILD_BUG_ON(!IS_MODULE(CONFIG_ND_BLK)); > BUILD_BUG_ON(!IS_MODULE(CONFIG_ACPI_NFIT)); > + BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX)); > + BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX_PMEM)); > } > > _______________________________________________ > Linux-nvdimm mailing list > Linux-nvdimm@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 1/3] /dev/dax, pmem: direct access to persistent memory 2016-05-17 8:52 ` Johannes Thumshirn (?) @ 2016-05-17 16:40 ` Dan Williams -1 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-17 16:40 UTC (permalink / raw) To: Johannes Thumshirn Cc: linux-nvdimm@lists.01.org, Dave Hansen, linux-kernel@vger.kernel.org, Christoph Hellwig, linux-block, Andrew Morton On Tue, May 17, 2016 at 1:52 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: > On Sat, May 14, 2016 at 11:26:24PM -0700, Dan Williams wrote: >> Device DAX is the device-centric analogue of Filesystem DAX >> (CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped >> without need of an intervening file system. Device DAX is strict, >> precise and predictable. Specifically this interface: >> >> 1/ Guarantees fault granularity with respect to a given page size (pte, >> pmd, or pud) set at configuration time. >> >> 2/ Enforces deterministic behavior by being strict about what fault >> scenarios are supported. >> >> For example, by forcing MADV_DONTFORK semantics and omitting MAP_PRIVATE >> support device-dax guarantees that a mapping always behaves/performs the >> same once established. It is the "what you see is what you get" access >> mechanism to differentiated memory vs filesystem DAX which has >> filesystem specific implementation semantics. >> >> Persistent memory is the first target, but the mechanism is also >> targeted for exclusive allocations of performance differentiated memory >> ranges. >> >> This commit is limited to the base device driver infrastructure to >> associate a dax device with pmem range. >> >> Cc: Jeff Moyer <jmoyer@redhat.com> >> Cc: Christoph Hellwig <hch@lst.de> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Dave Hansen <dave.hansen@linux.intel.com> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >> --- >> drivers/Kconfig | 2 >> drivers/Makefile | 1 >> drivers/dax/Kconfig | 25 +++ >> drivers/dax/Makefile | 4 + >> drivers/dax/dax.c | 252 +++++++++++++++++++++++++++++++++++ >> drivers/dax/dax.h | 24 +++ >> drivers/dax/pmem.c | 168 +++++++++++++++++++++++ > > Is a DAX device always a NVDIMM device, or can it be something else (like the > S390 dcssblk)? If it's NVDIMM only I'd suggest it to go under the > drivers/nvdimm directory. The plan is that it can be something else, like high bandwidth memory for example. >> tools/testing/nvdimm/Kbuild | 9 + >> tools/testing/nvdimm/config_check.c | 2 >> 9 files changed, 487 insertions(+) >> create mode 100644 drivers/dax/Kconfig >> create mode 100644 drivers/dax/Makefile >> create mode 100644 drivers/dax/dax.c >> create mode 100644 drivers/dax/dax.h >> create mode 100644 drivers/dax/pmem.c >> >> diff --git a/drivers/Kconfig b/drivers/Kconfig >> index d2ac339de85f..8298eab84a6f 100644 >> --- a/drivers/Kconfig >> +++ b/drivers/Kconfig >> @@ -190,6 +190,8 @@ source "drivers/android/Kconfig" >> >> source "drivers/nvdimm/Kconfig" >> >> +source "drivers/dax/Kconfig" >> + >> source "drivers/nvmem/Kconfig" >> >> source "drivers/hwtracing/stm/Kconfig" >> diff --git a/drivers/Makefile b/drivers/Makefile >> index 8f5d076baeb0..0b6f3d60193d 100644 >> --- a/drivers/Makefile >> +++ b/drivers/Makefile >> @@ -66,6 +66,7 @@ obj-$(CONFIG_PARPORT) += parport/ >> obj-$(CONFIG_NVM) += lightnvm/ >> obj-y += base/ block/ misc/ mfd/ nfc/ >> obj-$(CONFIG_LIBNVDIMM) += nvdimm/ >> +obj-$(CONFIG_DEV_DAX) += dax/ >> obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/ >> obj-$(CONFIG_NUBUS) += nubus/ >> obj-y += macintosh/ >> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig >> new file mode 100644 >> index 000000000000..86ffbaa891ad >> --- /dev/null >> +++ b/drivers/dax/Kconfig >> @@ -0,0 +1,25 @@ >> +menuconfig DEV_DAX >> + tristate "DAX: direct access to differentiated memory" >> + default m if NVDIMM_DAX >> + help >> + Support raw access to differentiated (persistence, bandwidth, >> + latency...) memory via an mmap(2) capable character >> + device. Platform firmware or a device driver may identify a >> + platform memory resource that is differentiated from the >> + baseline memory pool. Mappings of a /dev/daxX.Y device impose >> + restrictions that make the mapping behavior deterministic. >> + >> +if DEV_DAX >> + >> +config DEV_DAX_PMEM >> + tristate "PMEM DAX: direct access to persistent memory" >> + depends on NVDIMM_DAX >> + default DEV_DAX >> + help >> + Support raw access to persistent memory. Note that this >> + driver consumes memory ranges allocated and exported by the >> + libnvdimm sub-system. >> + >> + Say Y if unsure >> + >> +endif >> diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile >> new file mode 100644 >> index 000000000000..27c54e38478a >> --- /dev/null >> +++ b/drivers/dax/Makefile >> @@ -0,0 +1,4 @@ >> +obj-$(CONFIG_DEV_DAX) += dax.o >> +obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o >> + >> +dax_pmem-y := pmem.o >> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c >> new file mode 100644 >> index 000000000000..8207fb33a992 >> --- /dev/null >> +++ b/drivers/dax/dax.c [..] >> +void dax_region_put(struct dax_region *dax_region) >> +{ >> + kref_put(&dax_region->kref, dax_region_free); >> +} >> +EXPORT_SYMBOL_GPL(dax_region_put); > > dax_region_get() ?? There's currently no public (outside of dax.c) usage for taking a reference against a region. This export is really only there to keep dax_region_free() private. >> + >> +static void dax_dev_free(struct kref *kref) >> +{ >> + struct dax_dev *dax_dev; >> + >> + dax_dev = container_of(kref, struct dax_dev, kref); >> + dax_region_put(dax_dev->region); >> + kfree(dax_dev); >> +} >> + >> +static void dax_dev_put(struct dax_dev *dax_dev) >> +{ >> + kref_put(&dax_dev->kref, dax_dev_free); >> +} >> + >> +struct dax_region *alloc_dax_region(struct device *parent, int region_id, >> + struct resource *res, unsigned int align, void *addr, >> + unsigned long pfn_flags) >> +{ >> + struct dax_region *dax_region; >> + >> + dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL); >> + >> + if (!dax_region) >> + return NULL; >> + >> + memcpy(&dax_region->res, res, sizeof(*res)); >> + dax_region->pfn_flags = pfn_flags; >> + kref_init(&dax_region->kref); >> + dax_region->id = region_id; >> + ida_init(&dax_region->ida); >> + dax_region->align = align; >> + dax_region->dev = parent; >> + dax_region->base = addr; >> + >> + return dax_region; >> +} >> +EXPORT_SYMBOL_GPL(alloc_dax_region); >> + >> +static ssize_t size_show(struct device *dev, >> + struct device_attribute *attr, char *buf) >> +{ >> + struct dax_dev *dax_dev = dev_get_drvdata(dev); >> + unsigned long long size = 0; >> + int i; >> + >> + for (i = 0; i < dax_dev->num_resources; i++) >> + size += resource_size(&dax_dev->res[i]); >> + >> + return sprintf(buf, "%llu\n", size); >> +} >> +static DEVICE_ATTR_RO(size); >> + >> +static struct attribute *dax_device_attributes[] = { >> + &dev_attr_size.attr, >> + NULL, >> +}; >> + >> +static const struct attribute_group dax_device_attribute_group = { >> + .attrs = dax_device_attributes, >> +}; >> + >> +static const struct attribute_group *dax_attribute_groups[] = { >> + &dax_device_attribute_group, >> + NULL, >> +}; >> + >> +static void destroy_dax_dev(void *_dev) >> +{ >> + struct device *dev = _dev; >> + struct dax_dev *dax_dev = dev_get_drvdata(dev); >> + struct dax_region *dax_region = dax_dev->region; >> + >> + dev_dbg(dev, "%s\n", __func__); > > This dev_dbg() could be replaced by function graph tracing. Not without an explicit trace point to re-add the dev_printk details. What I really want, and has been on the back burner for a long time, is to enhance dynamic debug to turn all individual dev_dbg() statements optionally into trace points. > >> + >> + get_device(dev); >> + device_unregister(dev); >> + ida_simple_remove(&dax_region->ida, dax_dev->id); >> + ida_simple_remove(&dax_minor_ida, MINOR(dev->devt)); >> + put_device(dev); >> + dax_dev_put(dax_dev); >> +} >> + >> +int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, >> + int count) >> +{ >> + struct device *parent = dax_region->dev; >> + struct dax_dev *dax_dev; >> + struct device *dev; >> + int rc, minor; >> + dev_t dev_t; >> + >> + dax_dev = kzalloc(sizeof(*dax_dev) + sizeof(*res) * count, GFP_KERNEL); >> + if (!dax_dev) >> + return -ENOMEM; >> + memcpy(dax_dev->res, res, sizeof(*res) * count); >> + dax_dev->num_resources = count; >> + kref_init(&dax_dev->kref); >> + dax_dev->region = dax_region; >> + kref_get(&dax_region->kref); > > dax_region_get() ? I'm not sure that trivial wrapper is worth it. >> + >> + dax_dev->id = ida_simple_get(&dax_region->ida, 0, 0, GFP_KERNEL); >> + if (dax_dev->id < 0) { >> + rc = dax_dev->id; >> + goto err_id; >> + } >> + >> + minor = ida_simple_get(&dax_minor_ida, 0, 0, GFP_KERNEL); >> + if (minor < 0) { >> + rc = minor; >> + goto err_minor; >> + } >> + >> + dev_t = MKDEV(dax_major, minor); >> + dev = device_create_with_groups(dax_class, parent, dev_t, dax_dev, >> + dax_attribute_groups, "dax%d.%d", dax_region->id, >> + dax_dev->id); >> + if (IS_ERR(dev)) { >> + rc = PTR_ERR(dev); >> + goto err_create; >> + } >> + dax_dev->dev = dev; >> + >> + rc = devm_add_action(dax_region->dev, destroy_dax_dev, dev); >> + if (rc) { >> + destroy_dax_dev(dev); >> + return rc; >> + } >> + >> + return 0; >> + >> + err_create: >> + ida_simple_remove(&dax_minor_ida, minor); >> + err_minor: >> + ida_simple_remove(&dax_region->ida, dax_dev->id); >> + err_id: >> + dax_dev_put(dax_dev); >> + >> + return rc; >> +} >> +EXPORT_SYMBOL_GPL(devm_create_dax_dev); >> + >> +static const struct file_operations dax_fops = { >> + .llseek = noop_llseek, >> + .owner = THIS_MODULE, >> +}; >> + >> +static int __init dax_init(void) >> +{ >> + int rc; >> + >> + rc = register_chrdev(0, "dax", &dax_fops); >> + if (rc < 0) >> + return rc; >> + dax_major = rc; >> + >> + dax_class = class_create(THIS_MODULE, "dax"); >> + if (IS_ERR(dax_class)) { >> + unregister_chrdev(dax_major, "dax"); >> + return PTR_ERR(dax_class); >> + } >> + >> + return 0; >> +} >> + >> +static void __exit dax_exit(void) >> +{ >> + class_destroy(dax_class); >> + unregister_chrdev(dax_major, "dax"); > > AFAICT you're missing a call to ida_destroy(&dax_minor_ida); here. Indeed, you're right. That same bug is also present in multiple places in drivers/nvdimm/. [..] >> diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h >> new file mode 100644 >> index 000000000000..d8b8f1f25054 >> --- /dev/null >> +++ b/drivers/dax/dax.h >> @@ -0,0 +1,24 @@ >> +/* >> + * Copyright(c) 2016 Intel Corporation. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify >> + * it under the terms of version 2 of the GNU General Public License as >> + * published by the Free Software Foundation. >> + * >> + * This program is distributed in the hope that it will be useful, but >> + * WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >> + * General Public License for more details. >> + */ >> +#ifndef __DAX_H__ >> +#define __DAX_H__ >> +struct device; >> +struct resource; >> +struct dax_region; >> +void dax_region_put(struct dax_region *dax_region); >> +struct dax_region *alloc_dax_region(struct device *parent, >> + int region_id, struct resource *res, unsigned int align, >> + void *addr, unsigned long flags); >> +int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, >> + int count); >> +#endif /* __DAX_H__ */ >> diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c >> new file mode 100644 >> index 000000000000..4e97555e1cab >> --- /dev/null >> +++ b/drivers/dax/pmem.c >> @@ -0,0 +1,168 @@ >> +/* >> + * Copyright(c) 2016 Intel Corporation. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify >> + * it under the terms of version 2 of the GNU General Public License as >> + * published by the Free Software Foundation. >> + * >> + * This program is distributed in the hope that it will be useful, but >> + * WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >> + * General Public License for more details. >> + */ >> +#include <linux/percpu-refcount.h> >> +#include <linux/memremap.h> >> +#include <linux/module.h> >> +#include <linux/pfn_t.h> >> +#include "../nvdimm/pfn.h" >> +#include "../nvdimm/nd.h" >> +#include "dax.h" >> + >> +struct dax_pmem { >> + struct device *dev; >> + struct percpu_ref ref; >> + struct completion cmp; >> +}; >> + >> +struct dax_pmem *to_dax_pmem(struct percpu_ref *ref) >> +{ >> + return container_of(ref, struct dax_pmem, ref); >> +} >> + >> +static void dax_pmem_percpu_release(struct percpu_ref *ref) >> +{ >> + struct dax_pmem *dax_pmem = to_dax_pmem(ref); >> + >> + dev_dbg(dax_pmem->dev, "%s\n", __func__); > > This dev_dbg() could be replaced by function graph tracing. [..] >> + dev_dbg(dax_pmem->dev, "%s\n", __func__); > > Same as above. > [..] >> + dev_dbg(dax_pmem->dev, "%s\n", __func__); > > Same as above. Same reply as before to these... > >> + percpu_ref_kill(ref); >> +} >> + >> +static int dax_pmem_probe(struct device *dev) >> +{ >> + int rc; >> + void *addr; >> + struct resource res; >> + struct nd_pfn_sb *pfn_sb; >> + struct dax_pmem *dax_pmem; >> + struct nd_region *nd_region; >> + struct nd_namespace_io *nsio; >> + struct dax_region *dax_region; >> + struct nd_namespace_common *ndns; >> + struct nd_dax *nd_dax = to_nd_dax(dev); >> + struct nd_pfn *nd_pfn = &nd_dax->nd_pfn; >> + struct vmem_altmap __altmap, *altmap = NULL; >> + >> + ndns = nvdimm_namespace_common_probe(dev); >> + if (IS_ERR(ndns)) >> + return PTR_ERR(ndns); >> + nsio = to_nd_namespace_io(&ndns->dev); >> + >> + /* parse the 'pfn' info block via ->rw_bytes */ >> + devm_nsio_enable(dev, nsio); >> + altmap = nvdimm_setup_pfn(nd_pfn, &res, &__altmap); >> + if (IS_ERR(altmap)) >> + return PTR_ERR(altmap); >> + devm_nsio_disable(dev, nsio); >> + >> + pfn_sb = nd_pfn->pfn_sb; >> + >> + if (!devm_request_mem_region(dev, nsio->res.start, >> + resource_size(&nsio->res), dev_name(dev))) { >> + dev_warn(dev, "could not reserve region %pR\n", &nsio->res); >> + return -EBUSY; >> + } >> + >> + dax_pmem = devm_kzalloc(dev, sizeof(*dax_pmem), GFP_KERNEL); >> + if (!dax_pmem) >> + return -ENOMEM; >> + >> + dax_pmem->dev = dev; >> + init_completion(&dax_pmem->cmp); >> + rc = percpu_ref_init(&dax_pmem->ref, dax_pmem_percpu_release, 0, >> + GFP_KERNEL); >> + if (rc) >> + return rc; >> + >> + rc = devm_add_action(dev, dax_pmem_percpu_exit, &dax_pmem->ref); >> + if (rc) { >> + dax_pmem_percpu_exit(&dax_pmem->ref); >> + return rc; >> + } >> + >> + addr = devm_memremap_pages(dev, &res, &dax_pmem->ref, altmap); >> + if (IS_ERR(addr)) >> + return PTR_ERR(addr); >> + >> + rc = devm_add_action(dev, dax_pmem_percpu_kill, &dax_pmem->ref); >> + if (rc) { >> + dax_pmem_percpu_kill(&dax_pmem->ref); >> + return rc; >> + } >> + >> + nd_region = to_nd_region(dev->parent); >> + dax_region = alloc_dax_region(dev, nd_region->id, &res, >> + le32_to_cpu(pfn_sb->align), addr, PFN_DEV|PFN_MAP); >> + if (!dax_region) >> + return -ENOMEM; >> + >> + /* TODO: support for subdividing a dax region... */ >> + rc = devm_create_dax_dev(dax_region, &res, 1); >> + >> + /* child dax_dev instances now own the lifetime of the dax_region */ >> + dax_region_put(dax_region); >> + >> + return rc; >> +} >> + >> +static int dax_pmem_remove(struct device *dev) >> +{ >> + /* >> + * The init path is fully devm-enabled, or child devices >> + * otherwise hold references on parent resources. >> + */ > > So remove is completely pointless here. Why are you depending on it in > __nd_driver_register()? __device_release_driver() does not depend on a remove > callback to be present. Good point, I'll just remove the need for it. [..] Thanks, Johannes! ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 1/3] /dev/dax, pmem: direct access to persistent memory @ 2016-05-17 16:40 ` Dan Williams 0 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-17 16:40 UTC (permalink / raw) To: Johannes Thumshirn Cc: linux-nvdimm@lists.01.org, Dave Hansen, linux-kernel@vger.kernel.org, Christoph Hellwig, linux-block, Andrew Morton On Tue, May 17, 2016 at 1:52 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: > On Sat, May 14, 2016 at 11:26:24PM -0700, Dan Williams wrote: >> Device DAX is the device-centric analogue of Filesystem DAX >> (CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped >> without need of an intervening file system. Device DAX is strict, >> precise and predictable. Specifically this interface: >> >> 1/ Guarantees fault granularity with respect to a given page size (pte, >> pmd, or pud) set at configuration time. >> >> 2/ Enforces deterministic behavior by being strict about what fault >> scenarios are supported. >> >> For example, by forcing MADV_DONTFORK semantics and omitting MAP_PRIVATE >> support device-dax guarantees that a mapping always behaves/performs the >> same once established. It is the "what you see is what you get" access >> mechanism to differentiated memory vs filesystem DAX which has >> filesystem specific implementation semantics. >> >> Persistent memory is the first target, but the mechanism is also >> targeted for exclusive allocations of performance differentiated memory >> ranges. >> >> This commit is limited to the base device driver infrastructure to >> associate a dax device with pmem range. >> >> Cc: Jeff Moyer <jmoyer@redhat.com> >> Cc: Christoph Hellwig <hch@lst.de> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Dave Hansen <dave.hansen@linux.intel.com> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >> --- >> drivers/Kconfig | 2 >> drivers/Makefile | 1 >> drivers/dax/Kconfig | 25 +++ >> drivers/dax/Makefile | 4 + >> drivers/dax/dax.c | 252 +++++++++++++++++++++++++++++++++++ >> drivers/dax/dax.h | 24 +++ >> drivers/dax/pmem.c | 168 +++++++++++++++++++++++ > > Is a DAX device always a NVDIMM device, or can it be something else (like the > S390 dcssblk)? If it's NVDIMM only I'd suggest it to go under the > drivers/nvdimm directory. The plan is that it can be something else, like high bandwidth memory for example. >> tools/testing/nvdimm/Kbuild | 9 + >> tools/testing/nvdimm/config_check.c | 2 >> 9 files changed, 487 insertions(+) >> create mode 100644 drivers/dax/Kconfig >> create mode 100644 drivers/dax/Makefile >> create mode 100644 drivers/dax/dax.c >> create mode 100644 drivers/dax/dax.h >> create mode 100644 drivers/dax/pmem.c >> >> diff --git a/drivers/Kconfig b/drivers/Kconfig >> index d2ac339de85f..8298eab84a6f 100644 >> --- a/drivers/Kconfig >> +++ b/drivers/Kconfig >> @@ -190,6 +190,8 @@ source "drivers/android/Kconfig" >> >> source "drivers/nvdimm/Kconfig" >> >> +source "drivers/dax/Kconfig" >> + >> source "drivers/nvmem/Kconfig" >> >> source "drivers/hwtracing/stm/Kconfig" >> diff --git a/drivers/Makefile b/drivers/Makefile >> index 8f5d076baeb0..0b6f3d60193d 100644 >> --- a/drivers/Makefile >> +++ b/drivers/Makefile >> @@ -66,6 +66,7 @@ obj-$(CONFIG_PARPORT) += parport/ >> obj-$(CONFIG_NVM) += lightnvm/ >> obj-y += base/ block/ misc/ mfd/ nfc/ >> obj-$(CONFIG_LIBNVDIMM) += nvdimm/ >> +obj-$(CONFIG_DEV_DAX) += dax/ >> obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/ >> obj-$(CONFIG_NUBUS) += nubus/ >> obj-y += macintosh/ >> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig >> new file mode 100644 >> index 000000000000..86ffbaa891ad >> --- /dev/null >> +++ b/drivers/dax/Kconfig >> @@ -0,0 +1,25 @@ >> +menuconfig DEV_DAX >> + tristate "DAX: direct access to differentiated memory" >> + default m if NVDIMM_DAX >> + help >> + Support raw access to differentiated (persistence, bandwidth, >> + latency...) memory via an mmap(2) capable character >> + device. Platform firmware or a device driver may identify a >> + platform memory resource that is differentiated from the >> + baseline memory pool. Mappings of a /dev/daxX.Y device impose >> + restrictions that make the mapping behavior deterministic. >> + >> +if DEV_DAX >> + >> +config DEV_DAX_PMEM >> + tristate "PMEM DAX: direct access to persistent memory" >> + depends on NVDIMM_DAX >> + default DEV_DAX >> + help >> + Support raw access to persistent memory. Note that this >> + driver consumes memory ranges allocated and exported by the >> + libnvdimm sub-system. >> + >> + Say Y if unsure >> + >> +endif >> diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile >> new file mode 100644 >> index 000000000000..27c54e38478a >> --- /dev/null >> +++ b/drivers/dax/Makefile >> @@ -0,0 +1,4 @@ >> +obj-$(CONFIG_DEV_DAX) += dax.o >> +obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o >> + >> +dax_pmem-y := pmem.o >> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c >> new file mode 100644 >> index 000000000000..8207fb33a992 >> --- /dev/null >> +++ b/drivers/dax/dax.c [..] >> +void dax_region_put(struct dax_region *dax_region) >> +{ >> + kref_put(&dax_region->kref, dax_region_free); >> +} >> +EXPORT_SYMBOL_GPL(dax_region_put); > > dax_region_get() ?? There's currently no public (outside of dax.c) usage for taking a reference against a region. This export is really only there to keep dax_region_free() private. >> + >> +static void dax_dev_free(struct kref *kref) >> +{ >> + struct dax_dev *dax_dev; >> + >> + dax_dev = container_of(kref, struct dax_dev, kref); >> + dax_region_put(dax_dev->region); >> + kfree(dax_dev); >> +} >> + >> +static void dax_dev_put(struct dax_dev *dax_dev) >> +{ >> + kref_put(&dax_dev->kref, dax_dev_free); >> +} >> + >> +struct dax_region *alloc_dax_region(struct device *parent, int region_id, >> + struct resource *res, unsigned int align, void *addr, >> + unsigned long pfn_flags) >> +{ >> + struct dax_region *dax_region; >> + >> + dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL); >> + >> + if (!dax_region) >> + return NULL; >> + >> + memcpy(&dax_region->res, res, sizeof(*res)); >> + dax_region->pfn_flags = pfn_flags; >> + kref_init(&dax_region->kref); >> + dax_region->id = region_id; >> + ida_init(&dax_region->ida); >> + dax_region->align = align; >> + dax_region->dev = parent; >> + dax_region->base = addr; >> + >> + return dax_region; >> +} >> +EXPORT_SYMBOL_GPL(alloc_dax_region); >> + >> +static ssize_t size_show(struct device *dev, >> + struct device_attribute *attr, char *buf) >> +{ >> + struct dax_dev *dax_dev = dev_get_drvdata(dev); >> + unsigned long long size = 0; >> + int i; >> + >> + for (i = 0; i < dax_dev->num_resources; i++) >> + size += resource_size(&dax_dev->res[i]); >> + >> + return sprintf(buf, "%llu\n", size); >> +} >> +static DEVICE_ATTR_RO(size); >> + >> +static struct attribute *dax_device_attributes[] = { >> + &dev_attr_size.attr, >> + NULL, >> +}; >> + >> +static const struct attribute_group dax_device_attribute_group = { >> + .attrs = dax_device_attributes, >> +}; >> + >> +static const struct attribute_group *dax_attribute_groups[] = { >> + &dax_device_attribute_group, >> + NULL, >> +}; >> + >> +static void destroy_dax_dev(void *_dev) >> +{ >> + struct device *dev = _dev; >> + struct dax_dev *dax_dev = dev_get_drvdata(dev); >> + struct dax_region *dax_region = dax_dev->region; >> + >> + dev_dbg(dev, "%s\n", __func__); > > This dev_dbg() could be replaced by function graph tracing. Not without an explicit trace point to re-add the dev_printk details. What I really want, and has been on the back burner for a long time, is to enhance dynamic debug to turn all individual dev_dbg() statements optionally into trace points. > >> + >> + get_device(dev); >> + device_unregister(dev); >> + ida_simple_remove(&dax_region->ida, dax_dev->id); >> + ida_simple_remove(&dax_minor_ida, MINOR(dev->devt)); >> + put_device(dev); >> + dax_dev_put(dax_dev); >> +} >> + >> +int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, >> + int count) >> +{ >> + struct device *parent = dax_region->dev; >> + struct dax_dev *dax_dev; >> + struct device *dev; >> + int rc, minor; >> + dev_t dev_t; >> + >> + dax_dev = kzalloc(sizeof(*dax_dev) + sizeof(*res) * count, GFP_KERNEL); >> + if (!dax_dev) >> + return -ENOMEM; >> + memcpy(dax_dev->res, res, sizeof(*res) * count); >> + dax_dev->num_resources = count; >> + kref_init(&dax_dev->kref); >> + dax_dev->region = dax_region; >> + kref_get(&dax_region->kref); > > dax_region_get() ? I'm not sure that trivial wrapper is worth it. >> + >> + dax_dev->id = ida_simple_get(&dax_region->ida, 0, 0, GFP_KERNEL); >> + if (dax_dev->id < 0) { >> + rc = dax_dev->id; >> + goto err_id; >> + } >> + >> + minor = ida_simple_get(&dax_minor_ida, 0, 0, GFP_KERNEL); >> + if (minor < 0) { >> + rc = minor; >> + goto err_minor; >> + } >> + >> + dev_t = MKDEV(dax_major, minor); >> + dev = device_create_with_groups(dax_class, parent, dev_t, dax_dev, >> + dax_attribute_groups, "dax%d.%d", dax_region->id, >> + dax_dev->id); >> + if (IS_ERR(dev)) { >> + rc = PTR_ERR(dev); >> + goto err_create; >> + } >> + dax_dev->dev = dev; >> + >> + rc = devm_add_action(dax_region->dev, destroy_dax_dev, dev); >> + if (rc) { >> + destroy_dax_dev(dev); >> + return rc; >> + } >> + >> + return 0; >> + >> + err_create: >> + ida_simple_remove(&dax_minor_ida, minor); >> + err_minor: >> + ida_simple_remove(&dax_region->ida, dax_dev->id); >> + err_id: >> + dax_dev_put(dax_dev); >> + >> + return rc; >> +} >> +EXPORT_SYMBOL_GPL(devm_create_dax_dev); >> + >> +static const struct file_operations dax_fops = { >> + .llseek = noop_llseek, >> + .owner = THIS_MODULE, >> +}; >> + >> +static int __init dax_init(void) >> +{ >> + int rc; >> + >> + rc = register_chrdev(0, "dax", &dax_fops); >> + if (rc < 0) >> + return rc; >> + dax_major = rc; >> + >> + dax_class = class_create(THIS_MODULE, "dax"); >> + if (IS_ERR(dax_class)) { >> + unregister_chrdev(dax_major, "dax"); >> + return PTR_ERR(dax_class); >> + } >> + >> + return 0; >> +} >> + >> +static void __exit dax_exit(void) >> +{ >> + class_destroy(dax_class); >> + unregister_chrdev(dax_major, "dax"); > > AFAICT you're missing a call to ida_destroy(&dax_minor_ida); here. Indeed, you're right. That same bug is also present in multiple places in drivers/nvdimm/. [..] >> diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h >> new file mode 100644 >> index 000000000000..d8b8f1f25054 >> --- /dev/null >> +++ b/drivers/dax/dax.h >> @@ -0,0 +1,24 @@ >> +/* >> + * Copyright(c) 2016 Intel Corporation. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify >> + * it under the terms of version 2 of the GNU General Public License as >> + * published by the Free Software Foundation. >> + * >> + * This program is distributed in the hope that it will be useful, but >> + * WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >> + * General Public License for more details. >> + */ >> +#ifndef __DAX_H__ >> +#define __DAX_H__ >> +struct device; >> +struct resource; >> +struct dax_region; >> +void dax_region_put(struct dax_region *dax_region); >> +struct dax_region *alloc_dax_region(struct device *parent, >> + int region_id, struct resource *res, unsigned int align, >> + void *addr, unsigned long flags); >> +int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, >> + int count); >> +#endif /* __DAX_H__ */ >> diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c >> new file mode 100644 >> index 000000000000..4e97555e1cab >> --- /dev/null >> +++ b/drivers/dax/pmem.c >> @@ -0,0 +1,168 @@ >> +/* >> + * Copyright(c) 2016 Intel Corporation. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify >> + * it under the terms of version 2 of the GNU General Public License as >> + * published by the Free Software Foundation. >> + * >> + * This program is distributed in the hope that it will be useful, but >> + * WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >> + * General Public License for more details. >> + */ >> +#include <linux/percpu-refcount.h> >> +#include <linux/memremap.h> >> +#include <linux/module.h> >> +#include <linux/pfn_t.h> >> +#include "../nvdimm/pfn.h" >> +#include "../nvdimm/nd.h" >> +#include "dax.h" >> + >> +struct dax_pmem { >> + struct device *dev; >> + struct percpu_ref ref; >> + struct completion cmp; >> +}; >> + >> +struct dax_pmem *to_dax_pmem(struct percpu_ref *ref) >> +{ >> + return container_of(ref, struct dax_pmem, ref); >> +} >> + >> +static void dax_pmem_percpu_release(struct percpu_ref *ref) >> +{ >> + struct dax_pmem *dax_pmem = to_dax_pmem(ref); >> + >> + dev_dbg(dax_pmem->dev, "%s\n", __func__); > > This dev_dbg() could be replaced by function graph tracing. [..] >> + dev_dbg(dax_pmem->dev, "%s\n", __func__); > > Same as above. > [..] >> + dev_dbg(dax_pmem->dev, "%s\n", __func__); > > Same as above. Same reply as before to these... > >> + percpu_ref_kill(ref); >> +} >> + >> +static int dax_pmem_probe(struct device *dev) >> +{ >> + int rc; >> + void *addr; >> + struct resource res; >> + struct nd_pfn_sb *pfn_sb; >> + struct dax_pmem *dax_pmem; >> + struct nd_region *nd_region; >> + struct nd_namespace_io *nsio; >> + struct dax_region *dax_region; >> + struct nd_namespace_common *ndns; >> + struct nd_dax *nd_dax = to_nd_dax(dev); >> + struct nd_pfn *nd_pfn = &nd_dax->nd_pfn; >> + struct vmem_altmap __altmap, *altmap = NULL; >> + >> + ndns = nvdimm_namespace_common_probe(dev); >> + if (IS_ERR(ndns)) >> + return PTR_ERR(ndns); >> + nsio = to_nd_namespace_io(&ndns->dev); >> + >> + /* parse the 'pfn' info block via ->rw_bytes */ >> + devm_nsio_enable(dev, nsio); >> + altmap = nvdimm_setup_pfn(nd_pfn, &res, &__altmap); >> + if (IS_ERR(altmap)) >> + return PTR_ERR(altmap); >> + devm_nsio_disable(dev, nsio); >> + >> + pfn_sb = nd_pfn->pfn_sb; >> + >> + if (!devm_request_mem_region(dev, nsio->res.start, >> + resource_size(&nsio->res), dev_name(dev))) { >> + dev_warn(dev, "could not reserve region %pR\n", &nsio->res); >> + return -EBUSY; >> + } >> + >> + dax_pmem = devm_kzalloc(dev, sizeof(*dax_pmem), GFP_KERNEL); >> + if (!dax_pmem) >> + return -ENOMEM; >> + >> + dax_pmem->dev = dev; >> + init_completion(&dax_pmem->cmp); >> + rc = percpu_ref_init(&dax_pmem->ref, dax_pmem_percpu_release, 0, >> + GFP_KERNEL); >> + if (rc) >> + return rc; >> + >> + rc = devm_add_action(dev, dax_pmem_percpu_exit, &dax_pmem->ref); >> + if (rc) { >> + dax_pmem_percpu_exit(&dax_pmem->ref); >> + return rc; >> + } >> + >> + addr = devm_memremap_pages(dev, &res, &dax_pmem->ref, altmap); >> + if (IS_ERR(addr)) >> + return PTR_ERR(addr); >> + >> + rc = devm_add_action(dev, dax_pmem_percpu_kill, &dax_pmem->ref); >> + if (rc) { >> + dax_pmem_percpu_kill(&dax_pmem->ref); >> + return rc; >> + } >> + >> + nd_region = to_nd_region(dev->parent); >> + dax_region = alloc_dax_region(dev, nd_region->id, &res, >> + le32_to_cpu(pfn_sb->align), addr, PFN_DEV|PFN_MAP); >> + if (!dax_region) >> + return -ENOMEM; >> + >> + /* TODO: support for subdividing a dax region... */ >> + rc = devm_create_dax_dev(dax_region, &res, 1); >> + >> + /* child dax_dev instances now own the lifetime of the dax_region */ >> + dax_region_put(dax_region); >> + >> + return rc; >> +} >> + >> +static int dax_pmem_remove(struct device *dev) >> +{ >> + /* >> + * The init path is fully devm-enabled, or child devices >> + * otherwise hold references on parent resources. >> + */ > > So remove is completely pointless here. Why are you depending on it in > __nd_driver_register()? __device_release_driver() does not depend on a remove > callback to be present. Good point, I'll just remove the need for it. [..] Thanks, Johannes! ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 1/3] /dev/dax, pmem: direct access to persistent memory @ 2016-05-17 16:40 ` Dan Williams 0 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-17 16:40 UTC (permalink / raw) To: Johannes Thumshirn Cc: linux-nvdimm@lists.01.org, Dave Hansen, linux-kernel@vger.kernel.org, linux-block, Andrew Morton, Christoph Hellwig On Tue, May 17, 2016 at 1:52 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: > On Sat, May 14, 2016 at 11:26:24PM -0700, Dan Williams wrote: >> Device DAX is the device-centric analogue of Filesystem DAX >> (CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped >> without need of an intervening file system. Device DAX is strict, >> precise and predictable. Specifically this interface: >> >> 1/ Guarantees fault granularity with respect to a given page size (pte, >> pmd, or pud) set at configuration time. >> >> 2/ Enforces deterministic behavior by being strict about what fault >> scenarios are supported. >> >> For example, by forcing MADV_DONTFORK semantics and omitting MAP_PRIVATE >> support device-dax guarantees that a mapping always behaves/performs the >> same once established. It is the "what you see is what you get" access >> mechanism to differentiated memory vs filesystem DAX which has >> filesystem specific implementation semantics. >> >> Persistent memory is the first target, but the mechanism is also >> targeted for exclusive allocations of performance differentiated memory >> ranges. >> >> This commit is limited to the base device driver infrastructure to >> associate a dax device with pmem range. >> >> Cc: Jeff Moyer <jmoyer@redhat.com> >> Cc: Christoph Hellwig <hch@lst.de> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Dave Hansen <dave.hansen@linux.intel.com> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >> --- >> drivers/Kconfig | 2 >> drivers/Makefile | 1 >> drivers/dax/Kconfig | 25 +++ >> drivers/dax/Makefile | 4 + >> drivers/dax/dax.c | 252 +++++++++++++++++++++++++++++++++++ >> drivers/dax/dax.h | 24 +++ >> drivers/dax/pmem.c | 168 +++++++++++++++++++++++ > > Is a DAX device always a NVDIMM device, or can it be something else (like the > S390 dcssblk)? If it's NVDIMM only I'd suggest it to go under the > drivers/nvdimm directory. The plan is that it can be something else, like high bandwidth memory for example. >> tools/testing/nvdimm/Kbuild | 9 + >> tools/testing/nvdimm/config_check.c | 2 >> 9 files changed, 487 insertions(+) >> create mode 100644 drivers/dax/Kconfig >> create mode 100644 drivers/dax/Makefile >> create mode 100644 drivers/dax/dax.c >> create mode 100644 drivers/dax/dax.h >> create mode 100644 drivers/dax/pmem.c >> >> diff --git a/drivers/Kconfig b/drivers/Kconfig >> index d2ac339de85f..8298eab84a6f 100644 >> --- a/drivers/Kconfig >> +++ b/drivers/Kconfig >> @@ -190,6 +190,8 @@ source "drivers/android/Kconfig" >> >> source "drivers/nvdimm/Kconfig" >> >> +source "drivers/dax/Kconfig" >> + >> source "drivers/nvmem/Kconfig" >> >> source "drivers/hwtracing/stm/Kconfig" >> diff --git a/drivers/Makefile b/drivers/Makefile >> index 8f5d076baeb0..0b6f3d60193d 100644 >> --- a/drivers/Makefile >> +++ b/drivers/Makefile >> @@ -66,6 +66,7 @@ obj-$(CONFIG_PARPORT) += parport/ >> obj-$(CONFIG_NVM) += lightnvm/ >> obj-y += base/ block/ misc/ mfd/ nfc/ >> obj-$(CONFIG_LIBNVDIMM) += nvdimm/ >> +obj-$(CONFIG_DEV_DAX) += dax/ >> obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/ >> obj-$(CONFIG_NUBUS) += nubus/ >> obj-y += macintosh/ >> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig >> new file mode 100644 >> index 000000000000..86ffbaa891ad >> --- /dev/null >> +++ b/drivers/dax/Kconfig >> @@ -0,0 +1,25 @@ >> +menuconfig DEV_DAX >> + tristate "DAX: direct access to differentiated memory" >> + default m if NVDIMM_DAX >> + help >> + Support raw access to differentiated (persistence, bandwidth, >> + latency...) memory via an mmap(2) capable character >> + device. Platform firmware or a device driver may identify a >> + platform memory resource that is differentiated from the >> + baseline memory pool. Mappings of a /dev/daxX.Y device impose >> + restrictions that make the mapping behavior deterministic. >> + >> +if DEV_DAX >> + >> +config DEV_DAX_PMEM >> + tristate "PMEM DAX: direct access to persistent memory" >> + depends on NVDIMM_DAX >> + default DEV_DAX >> + help >> + Support raw access to persistent memory. Note that this >> + driver consumes memory ranges allocated and exported by the >> + libnvdimm sub-system. >> + >> + Say Y if unsure >> + >> +endif >> diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile >> new file mode 100644 >> index 000000000000..27c54e38478a >> --- /dev/null >> +++ b/drivers/dax/Makefile >> @@ -0,0 +1,4 @@ >> +obj-$(CONFIG_DEV_DAX) += dax.o >> +obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o >> + >> +dax_pmem-y := pmem.o >> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c >> new file mode 100644 >> index 000000000000..8207fb33a992 >> --- /dev/null >> +++ b/drivers/dax/dax.c [..] >> +void dax_region_put(struct dax_region *dax_region) >> +{ >> + kref_put(&dax_region->kref, dax_region_free); >> +} >> +EXPORT_SYMBOL_GPL(dax_region_put); > > dax_region_get() ?? There's currently no public (outside of dax.c) usage for taking a reference against a region. This export is really only there to keep dax_region_free() private. >> + >> +static void dax_dev_free(struct kref *kref) >> +{ >> + struct dax_dev *dax_dev; >> + >> + dax_dev = container_of(kref, struct dax_dev, kref); >> + dax_region_put(dax_dev->region); >> + kfree(dax_dev); >> +} >> + >> +static void dax_dev_put(struct dax_dev *dax_dev) >> +{ >> + kref_put(&dax_dev->kref, dax_dev_free); >> +} >> + >> +struct dax_region *alloc_dax_region(struct device *parent, int region_id, >> + struct resource *res, unsigned int align, void *addr, >> + unsigned long pfn_flags) >> +{ >> + struct dax_region *dax_region; >> + >> + dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL); >> + >> + if (!dax_region) >> + return NULL; >> + >> + memcpy(&dax_region->res, res, sizeof(*res)); >> + dax_region->pfn_flags = pfn_flags; >> + kref_init(&dax_region->kref); >> + dax_region->id = region_id; >> + ida_init(&dax_region->ida); >> + dax_region->align = align; >> + dax_region->dev = parent; >> + dax_region->base = addr; >> + >> + return dax_region; >> +} >> +EXPORT_SYMBOL_GPL(alloc_dax_region); >> + >> +static ssize_t size_show(struct device *dev, >> + struct device_attribute *attr, char *buf) >> +{ >> + struct dax_dev *dax_dev = dev_get_drvdata(dev); >> + unsigned long long size = 0; >> + int i; >> + >> + for (i = 0; i < dax_dev->num_resources; i++) >> + size += resource_size(&dax_dev->res[i]); >> + >> + return sprintf(buf, "%llu\n", size); >> +} >> +static DEVICE_ATTR_RO(size); >> + >> +static struct attribute *dax_device_attributes[] = { >> + &dev_attr_size.attr, >> + NULL, >> +}; >> + >> +static const struct attribute_group dax_device_attribute_group = { >> + .attrs = dax_device_attributes, >> +}; >> + >> +static const struct attribute_group *dax_attribute_groups[] = { >> + &dax_device_attribute_group, >> + NULL, >> +}; >> + >> +static void destroy_dax_dev(void *_dev) >> +{ >> + struct device *dev = _dev; >> + struct dax_dev *dax_dev = dev_get_drvdata(dev); >> + struct dax_region *dax_region = dax_dev->region; >> + >> + dev_dbg(dev, "%s\n", __func__); > > This dev_dbg() could be replaced by function graph tracing. Not without an explicit trace point to re-add the dev_printk details. What I really want, and has been on the back burner for a long time, is to enhance dynamic debug to turn all individual dev_dbg() statements optionally into trace points. > >> + >> + get_device(dev); >> + device_unregister(dev); >> + ida_simple_remove(&dax_region->ida, dax_dev->id); >> + ida_simple_remove(&dax_minor_ida, MINOR(dev->devt)); >> + put_device(dev); >> + dax_dev_put(dax_dev); >> +} >> + >> +int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, >> + int count) >> +{ >> + struct device *parent = dax_region->dev; >> + struct dax_dev *dax_dev; >> + struct device *dev; >> + int rc, minor; >> + dev_t dev_t; >> + >> + dax_dev = kzalloc(sizeof(*dax_dev) + sizeof(*res) * count, GFP_KERNEL); >> + if (!dax_dev) >> + return -ENOMEM; >> + memcpy(dax_dev->res, res, sizeof(*res) * count); >> + dax_dev->num_resources = count; >> + kref_init(&dax_dev->kref); >> + dax_dev->region = dax_region; >> + kref_get(&dax_region->kref); > > dax_region_get() ? I'm not sure that trivial wrapper is worth it. >> + >> + dax_dev->id = ida_simple_get(&dax_region->ida, 0, 0, GFP_KERNEL); >> + if (dax_dev->id < 0) { >> + rc = dax_dev->id; >> + goto err_id; >> + } >> + >> + minor = ida_simple_get(&dax_minor_ida, 0, 0, GFP_KERNEL); >> + if (minor < 0) { >> + rc = minor; >> + goto err_minor; >> + } >> + >> + dev_t = MKDEV(dax_major, minor); >> + dev = device_create_with_groups(dax_class, parent, dev_t, dax_dev, >> + dax_attribute_groups, "dax%d.%d", dax_region->id, >> + dax_dev->id); >> + if (IS_ERR(dev)) { >> + rc = PTR_ERR(dev); >> + goto err_create; >> + } >> + dax_dev->dev = dev; >> + >> + rc = devm_add_action(dax_region->dev, destroy_dax_dev, dev); >> + if (rc) { >> + destroy_dax_dev(dev); >> + return rc; >> + } >> + >> + return 0; >> + >> + err_create: >> + ida_simple_remove(&dax_minor_ida, minor); >> + err_minor: >> + ida_simple_remove(&dax_region->ida, dax_dev->id); >> + err_id: >> + dax_dev_put(dax_dev); >> + >> + return rc; >> +} >> +EXPORT_SYMBOL_GPL(devm_create_dax_dev); >> + >> +static const struct file_operations dax_fops = { >> + .llseek = noop_llseek, >> + .owner = THIS_MODULE, >> +}; >> + >> +static int __init dax_init(void) >> +{ >> + int rc; >> + >> + rc = register_chrdev(0, "dax", &dax_fops); >> + if (rc < 0) >> + return rc; >> + dax_major = rc; >> + >> + dax_class = class_create(THIS_MODULE, "dax"); >> + if (IS_ERR(dax_class)) { >> + unregister_chrdev(dax_major, "dax"); >> + return PTR_ERR(dax_class); >> + } >> + >> + return 0; >> +} >> + >> +static void __exit dax_exit(void) >> +{ >> + class_destroy(dax_class); >> + unregister_chrdev(dax_major, "dax"); > > AFAICT you're missing a call to ida_destroy(&dax_minor_ida); here. Indeed, you're right. That same bug is also present in multiple places in drivers/nvdimm/. [..] >> diff --git a/drivers/dax/dax.h b/drivers/dax/dax.h >> new file mode 100644 >> index 000000000000..d8b8f1f25054 >> --- /dev/null >> +++ b/drivers/dax/dax.h >> @@ -0,0 +1,24 @@ >> +/* >> + * Copyright(c) 2016 Intel Corporation. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify >> + * it under the terms of version 2 of the GNU General Public License as >> + * published by the Free Software Foundation. >> + * >> + * This program is distributed in the hope that it will be useful, but >> + * WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >> + * General Public License for more details. >> + */ >> +#ifndef __DAX_H__ >> +#define __DAX_H__ >> +struct device; >> +struct resource; >> +struct dax_region; >> +void dax_region_put(struct dax_region *dax_region); >> +struct dax_region *alloc_dax_region(struct device *parent, >> + int region_id, struct resource *res, unsigned int align, >> + void *addr, unsigned long flags); >> +int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, >> + int count); >> +#endif /* __DAX_H__ */ >> diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c >> new file mode 100644 >> index 000000000000..4e97555e1cab >> --- /dev/null >> +++ b/drivers/dax/pmem.c >> @@ -0,0 +1,168 @@ >> +/* >> + * Copyright(c) 2016 Intel Corporation. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify >> + * it under the terms of version 2 of the GNU General Public License as >> + * published by the Free Software Foundation. >> + * >> + * This program is distributed in the hope that it will be useful, but >> + * WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >> + * General Public License for more details. >> + */ >> +#include <linux/percpu-refcount.h> >> +#include <linux/memremap.h> >> +#include <linux/module.h> >> +#include <linux/pfn_t.h> >> +#include "../nvdimm/pfn.h" >> +#include "../nvdimm/nd.h" >> +#include "dax.h" >> + >> +struct dax_pmem { >> + struct device *dev; >> + struct percpu_ref ref; >> + struct completion cmp; >> +}; >> + >> +struct dax_pmem *to_dax_pmem(struct percpu_ref *ref) >> +{ >> + return container_of(ref, struct dax_pmem, ref); >> +} >> + >> +static void dax_pmem_percpu_release(struct percpu_ref *ref) >> +{ >> + struct dax_pmem *dax_pmem = to_dax_pmem(ref); >> + >> + dev_dbg(dax_pmem->dev, "%s\n", __func__); > > This dev_dbg() could be replaced by function graph tracing. [..] >> + dev_dbg(dax_pmem->dev, "%s\n", __func__); > > Same as above. > [..] >> + dev_dbg(dax_pmem->dev, "%s\n", __func__); > > Same as above. Same reply as before to these... > >> + percpu_ref_kill(ref); >> +} >> + >> +static int dax_pmem_probe(struct device *dev) >> +{ >> + int rc; >> + void *addr; >> + struct resource res; >> + struct nd_pfn_sb *pfn_sb; >> + struct dax_pmem *dax_pmem; >> + struct nd_region *nd_region; >> + struct nd_namespace_io *nsio; >> + struct dax_region *dax_region; >> + struct nd_namespace_common *ndns; >> + struct nd_dax *nd_dax = to_nd_dax(dev); >> + struct nd_pfn *nd_pfn = &nd_dax->nd_pfn; >> + struct vmem_altmap __altmap, *altmap = NULL; >> + >> + ndns = nvdimm_namespace_common_probe(dev); >> + if (IS_ERR(ndns)) >> + return PTR_ERR(ndns); >> + nsio = to_nd_namespace_io(&ndns->dev); >> + >> + /* parse the 'pfn' info block via ->rw_bytes */ >> + devm_nsio_enable(dev, nsio); >> + altmap = nvdimm_setup_pfn(nd_pfn, &res, &__altmap); >> + if (IS_ERR(altmap)) >> + return PTR_ERR(altmap); >> + devm_nsio_disable(dev, nsio); >> + >> + pfn_sb = nd_pfn->pfn_sb; >> + >> + if (!devm_request_mem_region(dev, nsio->res.start, >> + resource_size(&nsio->res), dev_name(dev))) { >> + dev_warn(dev, "could not reserve region %pR\n", &nsio->res); >> + return -EBUSY; >> + } >> + >> + dax_pmem = devm_kzalloc(dev, sizeof(*dax_pmem), GFP_KERNEL); >> + if (!dax_pmem) >> + return -ENOMEM; >> + >> + dax_pmem->dev = dev; >> + init_completion(&dax_pmem->cmp); >> + rc = percpu_ref_init(&dax_pmem->ref, dax_pmem_percpu_release, 0, >> + GFP_KERNEL); >> + if (rc) >> + return rc; >> + >> + rc = devm_add_action(dev, dax_pmem_percpu_exit, &dax_pmem->ref); >> + if (rc) { >> + dax_pmem_percpu_exit(&dax_pmem->ref); >> + return rc; >> + } >> + >> + addr = devm_memremap_pages(dev, &res, &dax_pmem->ref, altmap); >> + if (IS_ERR(addr)) >> + return PTR_ERR(addr); >> + >> + rc = devm_add_action(dev, dax_pmem_percpu_kill, &dax_pmem->ref); >> + if (rc) { >> + dax_pmem_percpu_kill(&dax_pmem->ref); >> + return rc; >> + } >> + >> + nd_region = to_nd_region(dev->parent); >> + dax_region = alloc_dax_region(dev, nd_region->id, &res, >> + le32_to_cpu(pfn_sb->align), addr, PFN_DEV|PFN_MAP); >> + if (!dax_region) >> + return -ENOMEM; >> + >> + /* TODO: support for subdividing a dax region... */ >> + rc = devm_create_dax_dev(dax_region, &res, 1); >> + >> + /* child dax_dev instances now own the lifetime of the dax_region */ >> + dax_region_put(dax_region); >> + >> + return rc; >> +} >> + >> +static int dax_pmem_remove(struct device *dev) >> +{ >> + /* >> + * The init path is fully devm-enabled, or child devices >> + * otherwise hold references on parent resources. >> + */ > > So remove is completely pointless here. Why are you depending on it in > __nd_driver_register()? __device_release_driver() does not depend on a remove > callback to be present. Good point, I'll just remove the need for it. [..] Thanks, Johannes! _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap 2016-05-15 6:26 ` Dan Williams (?) @ 2016-05-15 6:26 ` Dan Williams -1 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-15 6:26 UTC (permalink / raw) To: linux-nvdimm Cc: Dave Hansen, linux-kernel, linux-block, Jeff Moyer, Ross Zwisler, Andrew Morton, hch The "Device DAX" core enables dax mappings of performance / feature differentiated memory. An open mapping or file handle keeps the backing struct device live, but new mappings are only possible while the device is enabled. Faults are handled under rcu_read_lock to synchronize with the enabled state of the device. Similar to the filesystem-dax case the backing memory may optionally have struct page entries. However, unlike fs-dax there is no support for private mappings, or mappings that are not backed by media (see use of zero-page in fs-dax). Mappings are always guaranteed to match the alignment of the dax_region. If the dax_region is configured to have a 2MB alignment, all mappings are guaranteed to be backed by a pmd entry. Contrast this determinism with the fs-dax case where pmd mappings are opportunistic. If userspace attempts to force a misaligned mapping, the driver will fail the mmap attempt. See dax_dev_check_vma() for other scenarios that are rejected, like MAP_PRIVATE mappings. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/dax/Kconfig | 1 drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ mm/huge_memory.c | 1 mm/hugetlb.c | 1 4 files changed, 319 insertions(+) diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index 86ffbaa891ad..cedab7572de3 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -1,6 +1,7 @@ menuconfig DEV_DAX tristate "DAX: direct access to differentiated memory" default m if NVDIMM_DAX + depends on TRANSPARENT_HUGEPAGE help Support raw access to differentiated (persistence, bandwidth, latency...) memory via an mmap(2) capable character diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c index 8207fb33a992..b2fe8a0ce866 100644 --- a/drivers/dax/dax.c +++ b/drivers/dax/dax.c @@ -49,6 +49,7 @@ struct dax_region { * @region - parent region * @dev - device backing the character device * @kref - enable this data to be tracked in filp->private_data + * @alive - !alive + rcu grace period == no new mappings can be established * @id - child id in the region * @num_resources - number of physical address extents in this device * @res - array of physical address ranges @@ -57,6 +58,7 @@ struct dax_dev { struct dax_region *region; struct device *dev; struct kref kref; + bool alive; int id; int num_resources; struct resource res[0]; @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) dev_dbg(dev, "%s\n", __func__); + /* disable and flush fault handlers, TODO unmap inodes */ + dax_dev->alive = false; + synchronize_rcu(); + get_device(dev); device_unregister(dev); ida_simple_remove(&dax_region->ida, dax_dev->id); @@ -173,6 +179,7 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, memcpy(dax_dev->res, res, sizeof(*res) * count); dax_dev->num_resources = count; kref_init(&dax_dev->kref); + dax_dev->alive = true; dax_dev->region = dax_region; kref_get(&dax_region->kref); @@ -217,9 +224,318 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, } EXPORT_SYMBOL_GPL(devm_create_dax_dev); +/* return an unmapped area aligned to the dax region specified alignment */ +static unsigned long dax_dev_get_unmapped_area(struct file *filp, + unsigned long addr, unsigned long len, unsigned long pgoff, + unsigned long flags) +{ + unsigned long off, off_end, off_align, len_align, addr_align, align; + struct dax_dev *dax_dev = filp ? filp->private_data : NULL; + struct dax_region *dax_region; + + if (!dax_dev || addr) + goto out; + + dax_region = dax_dev->region; + align = dax_region->align; + off = pgoff << PAGE_SHIFT; + off_end = off + len; + off_align = round_up(off, align); + + if ((off_end <= off_align) || ((off_end - off_align) < align)) + goto out; + + len_align = len + align; + if ((off + len_align) < off) + goto out; + + addr_align = current->mm->get_unmapped_area(filp, addr, len_align, + pgoff, flags); + if (!IS_ERR_VALUE(addr_align)) { + addr_align += (off - addr_align) & (align - 1); + return addr_align; + } + out: + return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags); +} + +static int __match_devt(struct device *dev, const void *data) +{ + const dev_t *devt = data; + + return dev->devt == *devt; +} + +static struct device *dax_dev_find(dev_t dev_t) +{ + return class_find_device(dax_class, NULL, &dev_t, __match_devt); +} + +static int dax_dev_open(struct inode *inode, struct file *filp) +{ + struct dax_dev *dax_dev = NULL; + struct device *dev; + + dev = dax_dev_find(inode->i_rdev); + if (!dev) + return -ENXIO; + + device_lock(dev); + dax_dev = dev_get_drvdata(dev); + if (dax_dev) { + dev_dbg(dev, "%s\n", __func__); + filp->private_data = dax_dev; + kref_get(&dax_dev->kref); + inode->i_flags = S_DAX; + } + device_unlock(dev); + + if (!dax_dev) { + put_device(dev); + return -ENXIO; + } + return 0; +} + +static int dax_dev_release(struct inode *inode, struct file *filp) +{ + struct dax_dev *dax_dev = filp->private_data; + struct device *dev = dax_dev->dev; + + dev_dbg(dax_dev->dev, "%s\n", __func__); + dax_dev_put(dax_dev); + put_device(dev); + + return 0; +} + +static int check_vma(struct dax_dev *dax_dev, struct vm_area_struct *vma, + const char *func) +{ + struct dax_region *dax_region = dax_dev->region; + struct device *dev = dax_dev->dev; + unsigned long mask; + + if (!dax_dev->alive) + return -ENXIO; + + /* prevent private / writable mappings from being established */ + if ((vma->vm_flags & (VM_NORESERVE|VM_SHARED|VM_WRITE)) == VM_WRITE) { + dev_dbg(dev, "%s: %s: fail, attempted private mapping\n", + current->comm, func); + return -EINVAL; + } + + mask = dax_region->align - 1; + if (vma->vm_start & mask || vma->vm_end & mask) { + dev_dbg(dev, "%s: %s: fail, unaligned vma (%#lx - %#lx, %#lx)\n", + current->comm, func, vma->vm_start, vma->vm_end, + mask); + return -EINVAL; + } + + if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV + && (vma->vm_flags & VM_DONTCOPY) == 0) { + dev_dbg(dev, "%s: %s: fail, dax range requires MADV_DONTFORK\n", + current->comm, func); + return -EINVAL; + } + + if (!vma_is_dax(vma)) { + dev_dbg(dev, "%s: %s: fail, vma is not DAX capable\n", + current->comm, func); + return -EINVAL; + } + + return 0; +} + +static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff, + unsigned long size) +{ + struct resource *res; + phys_addr_t phys; + int i; + + for (i = 0; i < dax_dev->num_resources; i++) { + res = &dax_dev->res[i]; + phys = pgoff * PAGE_SIZE + res->start; + if (phys >= res->start && phys <= res->end) + break; + pgoff -= PHYS_PFN(resource_size(res)); + } + + if (i < dax_dev->num_resources) { + res = &dax_dev->res[i]; + if (phys + size - 1 <= res->end) + return phys; + } + + return -1; +} + +static int __dax_dev_fault(struct dax_dev *dax_dev, struct vm_area_struct *vma, + struct vm_fault *vmf) +{ + unsigned long vaddr = (unsigned long) vmf->virtual_address; + struct device *dev = dax_dev->dev; + struct dax_region *dax_region; + int rc = VM_FAULT_SIGBUS; + phys_addr_t phys; + pfn_t pfn; + + if (check_vma(dax_dev, vma, __func__)) + return VM_FAULT_SIGBUS; + + dax_region = dax_dev->region; + if (dax_region->align > PAGE_SIZE) { + dev_dbg(dev, "%s: alignment > fault size\n", __func__); + return VM_FAULT_SIGBUS; + } + + phys = pgoff_to_phys(dax_dev, vmf->pgoff, PAGE_SIZE); + if (phys == -1) { + dev_dbg(dev, "%s: phys_to_pgoff(%#lx) failed\n", __func__, + vmf->pgoff); + return VM_FAULT_SIGBUS; + } + + pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); + + rc = vm_insert_mixed(vma, vaddr, pfn); + + if (rc == -ENOMEM) + return VM_FAULT_OOM; + if (rc < 0 && rc != -EBUSY) + return VM_FAULT_SIGBUS; + + return VM_FAULT_NOPAGE; +} + +static int dax_dev_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + int rc; + struct file *filp = vma->vm_file; + struct dax_dev *dax_dev = filp->private_data; + + dev_dbg(dax_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__, + current->comm, (vmf->flags & FAULT_FLAG_WRITE) + ? "write" : "read", vma->vm_start, vma->vm_end); + rcu_read_lock(); + rc = __dax_dev_fault(dax_dev, vma, vmf); + rcu_read_unlock(); + + return rc; +} + +static int __dax_dev_pmd_fault(struct dax_dev *dax_dev, + struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, + unsigned int flags) +{ + unsigned long pmd_addr = addr & PMD_MASK; + struct device *dev = dax_dev->dev; + struct dax_region *dax_region; + phys_addr_t phys; + pgoff_t pgoff; + pfn_t pfn; + + if (check_vma(dax_dev, vma, __func__)) + return VM_FAULT_SIGBUS; + + dax_region = dax_dev->region; + if (dax_region->align > PMD_SIZE) { + dev_dbg(dev, "%s: alignment > fault size\n", __func__); + return VM_FAULT_SIGBUS; + } + + /* dax pmd mappings require pfn_t_devmap() */ + if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) != (PFN_DEV|PFN_MAP)) { + dev_dbg(dev, "%s: alignment > fault size\n", __func__); + return VM_FAULT_SIGBUS; + } + + pgoff = linear_page_index(vma, pmd_addr); + phys = pgoff_to_phys(dax_dev, pgoff, PAGE_SIZE); + if (phys == -1) { + dev_dbg(dev, "%s: phys_to_pgoff(%#lx) failed\n", __func__, + pgoff); + return VM_FAULT_SIGBUS; + } + + pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); + + return vmf_insert_pfn_pmd(vma, addr, pmd, pfn, + flags & FAULT_FLAG_WRITE); +} + +static int dax_dev_pmd_fault(struct vm_area_struct *vma, unsigned long addr, + pmd_t *pmd, unsigned int flags) +{ + int rc; + struct file *filp = vma->vm_file; + struct dax_dev *dax_dev = filp->private_data; + + dev_dbg(dax_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__, + current->comm, (flags & FAULT_FLAG_WRITE) + ? "write" : "read", vma->vm_start, vma->vm_end); + + rcu_read_lock(); + rc = __dax_dev_pmd_fault(dax_dev, vma, addr, pmd, flags); + rcu_read_unlock(); + + return rc; +} + +static void dax_dev_vm_open(struct vm_area_struct *vma) +{ + struct file *filp = vma->vm_file; + struct dax_dev *dax_dev = filp->private_data; + + dev_dbg(dax_dev->dev, "%s\n", __func__); + kref_get(&dax_dev->kref); +} + +static void dax_dev_vm_close(struct vm_area_struct *vma) +{ + struct file *filp = vma->vm_file; + struct dax_dev *dax_dev = filp->private_data; + + dev_dbg(dax_dev->dev, "%s\n", __func__); + dax_dev_put(dax_dev); +} + +static const struct vm_operations_struct dax_dev_vm_ops = { + .fault = dax_dev_fault, + .pmd_fault = dax_dev_pmd_fault, + .open = dax_dev_vm_open, + .close = dax_dev_vm_close, +}; + +static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct dax_dev *dax_dev = filp->private_data; + int rc; + + dev_dbg(dax_dev->dev, "%s\n", __func__); + + rc = check_vma(dax_dev, vma, __func__); + if (rc) + return rc; + + kref_get(&dax_dev->kref); + vma->vm_ops = &dax_dev_vm_ops; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + return 0; + +} + static const struct file_operations dax_fops = { .llseek = noop_llseek, .owner = THIS_MODULE, + .open = dax_dev_open, + .release = dax_dev_release, + .get_unmapped_area = dax_dev_get_unmapped_area, + .mmap = dax_dev_mmap, }; static int __init dax_init(void) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 86f9f8b82f8e..52ea012d8a80 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1013,6 +1013,7 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write); return VM_FAULT_NOPAGE; } +EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd); static void touch_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 19d0d08b396f..b14e98129b07 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -624,6 +624,7 @@ pgoff_t linear_hugepage_index(struct vm_area_struct *vma, { return vma_hugecache_offset(hstate_vma(vma), vma, address); } +EXPORT_SYMBOL_GPL(linear_hugepage_index); /* * Return the size of the pages allocated when backing a VMA. In the majority ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap @ 2016-05-15 6:26 ` Dan Williams 0 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-15 6:26 UTC (permalink / raw) To: linux-nvdimm Cc: Dave Hansen, linux-kernel, linux-block, Jeff Moyer, Ross Zwisler, Andrew Morton, hch The "Device DAX" core enables dax mappings of performance / feature differentiated memory. An open mapping or file handle keeps the backing struct device live, but new mappings are only possible while the device is enabled. Faults are handled under rcu_read_lock to synchronize with the enabled state of the device. Similar to the filesystem-dax case the backing memory may optionally have struct page entries. However, unlike fs-dax there is no support for private mappings, or mappings that are not backed by media (see use of zero-page in fs-dax). Mappings are always guaranteed to match the alignment of the dax_region. If the dax_region is configured to have a 2MB alignment, all mappings are guaranteed to be backed by a pmd entry. Contrast this determinism with the fs-dax case where pmd mappings are opportunistic. If userspace attempts to force a misaligned mapping, the driver will fail the mmap attempt. See dax_dev_check_vma() for other scenarios that are rejected, like MAP_PRIVATE mappings. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/dax/Kconfig | 1 drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ mm/huge_memory.c | 1 mm/hugetlb.c | 1 4 files changed, 319 insertions(+) diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index 86ffbaa891ad..cedab7572de3 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -1,6 +1,7 @@ menuconfig DEV_DAX tristate "DAX: direct access to differentiated memory" default m if NVDIMM_DAX + depends on TRANSPARENT_HUGEPAGE help Support raw access to differentiated (persistence, bandwidth, latency...) memory via an mmap(2) capable character diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c index 8207fb33a992..b2fe8a0ce866 100644 --- a/drivers/dax/dax.c +++ b/drivers/dax/dax.c @@ -49,6 +49,7 @@ struct dax_region { * @region - parent region * @dev - device backing the character device * @kref - enable this data to be tracked in filp->private_data + * @alive - !alive + rcu grace period == no new mappings can be established * @id - child id in the region * @num_resources - number of physical address extents in this device * @res - array of physical address ranges @@ -57,6 +58,7 @@ struct dax_dev { struct dax_region *region; struct device *dev; struct kref kref; + bool alive; int id; int num_resources; struct resource res[0]; @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) dev_dbg(dev, "%s\n", __func__); + /* disable and flush fault handlers, TODO unmap inodes */ + dax_dev->alive = false; + synchronize_rcu(); + get_device(dev); device_unregister(dev); ida_simple_remove(&dax_region->ida, dax_dev->id); @@ -173,6 +179,7 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, memcpy(dax_dev->res, res, sizeof(*res) * count); dax_dev->num_resources = count; kref_init(&dax_dev->kref); + dax_dev->alive = true; dax_dev->region = dax_region; kref_get(&dax_region->kref); @@ -217,9 +224,318 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, } EXPORT_SYMBOL_GPL(devm_create_dax_dev); +/* return an unmapped area aligned to the dax region specified alignment */ +static unsigned long dax_dev_get_unmapped_area(struct file *filp, + unsigned long addr, unsigned long len, unsigned long pgoff, + unsigned long flags) +{ + unsigned long off, off_end, off_align, len_align, addr_align, align; + struct dax_dev *dax_dev = filp ? filp->private_data : NULL; + struct dax_region *dax_region; + + if (!dax_dev || addr) + goto out; + + dax_region = dax_dev->region; + align = dax_region->align; + off = pgoff << PAGE_SHIFT; + off_end = off + len; + off_align = round_up(off, align); + + if ((off_end <= off_align) || ((off_end - off_align) < align)) + goto out; + + len_align = len + align; + if ((off + len_align) < off) + goto out; + + addr_align = current->mm->get_unmapped_area(filp, addr, len_align, + pgoff, flags); + if (!IS_ERR_VALUE(addr_align)) { + addr_align += (off - addr_align) & (align - 1); + return addr_align; + } + out: + return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags); +} + +static int __match_devt(struct device *dev, const void *data) +{ + const dev_t *devt = data; + + return dev->devt == *devt; +} + +static struct device *dax_dev_find(dev_t dev_t) +{ + return class_find_device(dax_class, NULL, &dev_t, __match_devt); +} + +static int dax_dev_open(struct inode *inode, struct file *filp) +{ + struct dax_dev *dax_dev = NULL; + struct device *dev; + + dev = dax_dev_find(inode->i_rdev); + if (!dev) + return -ENXIO; + + device_lock(dev); + dax_dev = dev_get_drvdata(dev); + if (dax_dev) { + dev_dbg(dev, "%s\n", __func__); + filp->private_data = dax_dev; + kref_get(&dax_dev->kref); + inode->i_flags = S_DAX; + } + device_unlock(dev); + + if (!dax_dev) { + put_device(dev); + return -ENXIO; + } + return 0; +} + +static int dax_dev_release(struct inode *inode, struct file *filp) +{ + struct dax_dev *dax_dev = filp->private_data; + struct device *dev = dax_dev->dev; + + dev_dbg(dax_dev->dev, "%s\n", __func__); + dax_dev_put(dax_dev); + put_device(dev); + + return 0; +} + +static int check_vma(struct dax_dev *dax_dev, struct vm_area_struct *vma, + const char *func) +{ + struct dax_region *dax_region = dax_dev->region; + struct device *dev = dax_dev->dev; + unsigned long mask; + + if (!dax_dev->alive) + return -ENXIO; + + /* prevent private / writable mappings from being established */ + if ((vma->vm_flags & (VM_NORESERVE|VM_SHARED|VM_WRITE)) == VM_WRITE) { + dev_dbg(dev, "%s: %s: fail, attempted private mapping\n", + current->comm, func); + return -EINVAL; + } + + mask = dax_region->align - 1; + if (vma->vm_start & mask || vma->vm_end & mask) { + dev_dbg(dev, "%s: %s: fail, unaligned vma (%#lx - %#lx, %#lx)\n", + current->comm, func, vma->vm_start, vma->vm_end, + mask); + return -EINVAL; + } + + if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV + && (vma->vm_flags & VM_DONTCOPY) == 0) { + dev_dbg(dev, "%s: %s: fail, dax range requires MADV_DONTFORK\n", + current->comm, func); + return -EINVAL; + } + + if (!vma_is_dax(vma)) { + dev_dbg(dev, "%s: %s: fail, vma is not DAX capable\n", + current->comm, func); + return -EINVAL; + } + + return 0; +} + +static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff, + unsigned long size) +{ + struct resource *res; + phys_addr_t phys; + int i; + + for (i = 0; i < dax_dev->num_resources; i++) { + res = &dax_dev->res[i]; + phys = pgoff * PAGE_SIZE + res->start; + if (phys >= res->start && phys <= res->end) + break; + pgoff -= PHYS_PFN(resource_size(res)); + } + + if (i < dax_dev->num_resources) { + res = &dax_dev->res[i]; + if (phys + size - 1 <= res->end) + return phys; + } + + return -1; +} + +static int __dax_dev_fault(struct dax_dev *dax_dev, struct vm_area_struct *vma, + struct vm_fault *vmf) +{ + unsigned long vaddr = (unsigned long) vmf->virtual_address; + struct device *dev = dax_dev->dev; + struct dax_region *dax_region; + int rc = VM_FAULT_SIGBUS; + phys_addr_t phys; + pfn_t pfn; + + if (check_vma(dax_dev, vma, __func__)) + return VM_FAULT_SIGBUS; + + dax_region = dax_dev->region; + if (dax_region->align > PAGE_SIZE) { + dev_dbg(dev, "%s: alignment > fault size\n", __func__); + return VM_FAULT_SIGBUS; + } + + phys = pgoff_to_phys(dax_dev, vmf->pgoff, PAGE_SIZE); + if (phys == -1) { + dev_dbg(dev, "%s: phys_to_pgoff(%#lx) failed\n", __func__, + vmf->pgoff); + return VM_FAULT_SIGBUS; + } + + pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); + + rc = vm_insert_mixed(vma, vaddr, pfn); + + if (rc == -ENOMEM) + return VM_FAULT_OOM; + if (rc < 0 && rc != -EBUSY) + return VM_FAULT_SIGBUS; + + return VM_FAULT_NOPAGE; +} + +static int dax_dev_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + int rc; + struct file *filp = vma->vm_file; + struct dax_dev *dax_dev = filp->private_data; + + dev_dbg(dax_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__, + current->comm, (vmf->flags & FAULT_FLAG_WRITE) + ? "write" : "read", vma->vm_start, vma->vm_end); + rcu_read_lock(); + rc = __dax_dev_fault(dax_dev, vma, vmf); + rcu_read_unlock(); + + return rc; +} + +static int __dax_dev_pmd_fault(struct dax_dev *dax_dev, + struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, + unsigned int flags) +{ + unsigned long pmd_addr = addr & PMD_MASK; + struct device *dev = dax_dev->dev; + struct dax_region *dax_region; + phys_addr_t phys; + pgoff_t pgoff; + pfn_t pfn; + + if (check_vma(dax_dev, vma, __func__)) + return VM_FAULT_SIGBUS; + + dax_region = dax_dev->region; + if (dax_region->align > PMD_SIZE) { + dev_dbg(dev, "%s: alignment > fault size\n", __func__); + return VM_FAULT_SIGBUS; + } + + /* dax pmd mappings require pfn_t_devmap() */ + if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) != (PFN_DEV|PFN_MAP)) { + dev_dbg(dev, "%s: alignment > fault size\n", __func__); + return VM_FAULT_SIGBUS; + } + + pgoff = linear_page_index(vma, pmd_addr); + phys = pgoff_to_phys(dax_dev, pgoff, PAGE_SIZE); + if (phys == -1) { + dev_dbg(dev, "%s: phys_to_pgoff(%#lx) failed\n", __func__, + pgoff); + return VM_FAULT_SIGBUS; + } + + pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); + + return vmf_insert_pfn_pmd(vma, addr, pmd, pfn, + flags & FAULT_FLAG_WRITE); +} + +static int dax_dev_pmd_fault(struct vm_area_struct *vma, unsigned long addr, + pmd_t *pmd, unsigned int flags) +{ + int rc; + struct file *filp = vma->vm_file; + struct dax_dev *dax_dev = filp->private_data; + + dev_dbg(dax_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__, + current->comm, (flags & FAULT_FLAG_WRITE) + ? "write" : "read", vma->vm_start, vma->vm_end); + + rcu_read_lock(); + rc = __dax_dev_pmd_fault(dax_dev, vma, addr, pmd, flags); + rcu_read_unlock(); + + return rc; +} + +static void dax_dev_vm_open(struct vm_area_struct *vma) +{ + struct file *filp = vma->vm_file; + struct dax_dev *dax_dev = filp->private_data; + + dev_dbg(dax_dev->dev, "%s\n", __func__); + kref_get(&dax_dev->kref); +} + +static void dax_dev_vm_close(struct vm_area_struct *vma) +{ + struct file *filp = vma->vm_file; + struct dax_dev *dax_dev = filp->private_data; + + dev_dbg(dax_dev->dev, "%s\n", __func__); + dax_dev_put(dax_dev); +} + +static const struct vm_operations_struct dax_dev_vm_ops = { + .fault = dax_dev_fault, + .pmd_fault = dax_dev_pmd_fault, + .open = dax_dev_vm_open, + .close = dax_dev_vm_close, +}; + +static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct dax_dev *dax_dev = filp->private_data; + int rc; + + dev_dbg(dax_dev->dev, "%s\n", __func__); + + rc = check_vma(dax_dev, vma, __func__); + if (rc) + return rc; + + kref_get(&dax_dev->kref); + vma->vm_ops = &dax_dev_vm_ops; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + return 0; + +} + static const struct file_operations dax_fops = { .llseek = noop_llseek, .owner = THIS_MODULE, + .open = dax_dev_open, + .release = dax_dev_release, + .get_unmapped_area = dax_dev_get_unmapped_area, + .mmap = dax_dev_mmap, }; static int __init dax_init(void) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 86f9f8b82f8e..52ea012d8a80 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1013,6 +1013,7 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write); return VM_FAULT_NOPAGE; } +EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd); static void touch_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 19d0d08b396f..b14e98129b07 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -624,6 +624,7 @@ pgoff_t linear_hugepage_index(struct vm_area_struct *vma, { return vma_hugecache_offset(hstate_vma(vma), vma, address); } +EXPORT_SYMBOL_GPL(linear_hugepage_index); /* * Return the size of the pages allocated when backing a VMA. In the majority ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap @ 2016-05-15 6:26 ` Dan Williams 0 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-15 6:26 UTC (permalink / raw) To: linux-nvdimm; +Cc: Dave Hansen, linux-kernel, hch, linux-block, Andrew Morton The "Device DAX" core enables dax mappings of performance / feature differentiated memory. An open mapping or file handle keeps the backing struct device live, but new mappings are only possible while the device is enabled. Faults are handled under rcu_read_lock to synchronize with the enabled state of the device. Similar to the filesystem-dax case the backing memory may optionally have struct page entries. However, unlike fs-dax there is no support for private mappings, or mappings that are not backed by media (see use of zero-page in fs-dax). Mappings are always guaranteed to match the alignment of the dax_region. If the dax_region is configured to have a 2MB alignment, all mappings are guaranteed to be backed by a pmd entry. Contrast this determinism with the fs-dax case where pmd mappings are opportunistic. If userspace attempts to force a misaligned mapping, the driver will fail the mmap attempt. See dax_dev_check_vma() for other scenarios that are rejected, like MAP_PRIVATE mappings. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/dax/Kconfig | 1 drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ mm/huge_memory.c | 1 mm/hugetlb.c | 1 4 files changed, 319 insertions(+) diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index 86ffbaa891ad..cedab7572de3 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -1,6 +1,7 @@ menuconfig DEV_DAX tristate "DAX: direct access to differentiated memory" default m if NVDIMM_DAX + depends on TRANSPARENT_HUGEPAGE help Support raw access to differentiated (persistence, bandwidth, latency...) memory via an mmap(2) capable character diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c index 8207fb33a992..b2fe8a0ce866 100644 --- a/drivers/dax/dax.c +++ b/drivers/dax/dax.c @@ -49,6 +49,7 @@ struct dax_region { * @region - parent region * @dev - device backing the character device * @kref - enable this data to be tracked in filp->private_data + * @alive - !alive + rcu grace period == no new mappings can be established * @id - child id in the region * @num_resources - number of physical address extents in this device * @res - array of physical address ranges @@ -57,6 +58,7 @@ struct dax_dev { struct dax_region *region; struct device *dev; struct kref kref; + bool alive; int id; int num_resources; struct resource res[0]; @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) dev_dbg(dev, "%s\n", __func__); + /* disable and flush fault handlers, TODO unmap inodes */ + dax_dev->alive = false; + synchronize_rcu(); + get_device(dev); device_unregister(dev); ida_simple_remove(&dax_region->ida, dax_dev->id); @@ -173,6 +179,7 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, memcpy(dax_dev->res, res, sizeof(*res) * count); dax_dev->num_resources = count; kref_init(&dax_dev->kref); + dax_dev->alive = true; dax_dev->region = dax_region; kref_get(&dax_region->kref); @@ -217,9 +224,318 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, } EXPORT_SYMBOL_GPL(devm_create_dax_dev); +/* return an unmapped area aligned to the dax region specified alignment */ +static unsigned long dax_dev_get_unmapped_area(struct file *filp, + unsigned long addr, unsigned long len, unsigned long pgoff, + unsigned long flags) +{ + unsigned long off, off_end, off_align, len_align, addr_align, align; + struct dax_dev *dax_dev = filp ? filp->private_data : NULL; + struct dax_region *dax_region; + + if (!dax_dev || addr) + goto out; + + dax_region = dax_dev->region; + align = dax_region->align; + off = pgoff << PAGE_SHIFT; + off_end = off + len; + off_align = round_up(off, align); + + if ((off_end <= off_align) || ((off_end - off_align) < align)) + goto out; + + len_align = len + align; + if ((off + len_align) < off) + goto out; + + addr_align = current->mm->get_unmapped_area(filp, addr, len_align, + pgoff, flags); + if (!IS_ERR_VALUE(addr_align)) { + addr_align += (off - addr_align) & (align - 1); + return addr_align; + } + out: + return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags); +} + +static int __match_devt(struct device *dev, const void *data) +{ + const dev_t *devt = data; + + return dev->devt == *devt; +} + +static struct device *dax_dev_find(dev_t dev_t) +{ + return class_find_device(dax_class, NULL, &dev_t, __match_devt); +} + +static int dax_dev_open(struct inode *inode, struct file *filp) +{ + struct dax_dev *dax_dev = NULL; + struct device *dev; + + dev = dax_dev_find(inode->i_rdev); + if (!dev) + return -ENXIO; + + device_lock(dev); + dax_dev = dev_get_drvdata(dev); + if (dax_dev) { + dev_dbg(dev, "%s\n", __func__); + filp->private_data = dax_dev; + kref_get(&dax_dev->kref); + inode->i_flags = S_DAX; + } + device_unlock(dev); + + if (!dax_dev) { + put_device(dev); + return -ENXIO; + } + return 0; +} + +static int dax_dev_release(struct inode *inode, struct file *filp) +{ + struct dax_dev *dax_dev = filp->private_data; + struct device *dev = dax_dev->dev; + + dev_dbg(dax_dev->dev, "%s\n", __func__); + dax_dev_put(dax_dev); + put_device(dev); + + return 0; +} + +static int check_vma(struct dax_dev *dax_dev, struct vm_area_struct *vma, + const char *func) +{ + struct dax_region *dax_region = dax_dev->region; + struct device *dev = dax_dev->dev; + unsigned long mask; + + if (!dax_dev->alive) + return -ENXIO; + + /* prevent private / writable mappings from being established */ + if ((vma->vm_flags & (VM_NORESERVE|VM_SHARED|VM_WRITE)) == VM_WRITE) { + dev_dbg(dev, "%s: %s: fail, attempted private mapping\n", + current->comm, func); + return -EINVAL; + } + + mask = dax_region->align - 1; + if (vma->vm_start & mask || vma->vm_end & mask) { + dev_dbg(dev, "%s: %s: fail, unaligned vma (%#lx - %#lx, %#lx)\n", + current->comm, func, vma->vm_start, vma->vm_end, + mask); + return -EINVAL; + } + + if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV + && (vma->vm_flags & VM_DONTCOPY) == 0) { + dev_dbg(dev, "%s: %s: fail, dax range requires MADV_DONTFORK\n", + current->comm, func); + return -EINVAL; + } + + if (!vma_is_dax(vma)) { + dev_dbg(dev, "%s: %s: fail, vma is not DAX capable\n", + current->comm, func); + return -EINVAL; + } + + return 0; +} + +static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff, + unsigned long size) +{ + struct resource *res; + phys_addr_t phys; + int i; + + for (i = 0; i < dax_dev->num_resources; i++) { + res = &dax_dev->res[i]; + phys = pgoff * PAGE_SIZE + res->start; + if (phys >= res->start && phys <= res->end) + break; + pgoff -= PHYS_PFN(resource_size(res)); + } + + if (i < dax_dev->num_resources) { + res = &dax_dev->res[i]; + if (phys + size - 1 <= res->end) + return phys; + } + + return -1; +} + +static int __dax_dev_fault(struct dax_dev *dax_dev, struct vm_area_struct *vma, + struct vm_fault *vmf) +{ + unsigned long vaddr = (unsigned long) vmf->virtual_address; + struct device *dev = dax_dev->dev; + struct dax_region *dax_region; + int rc = VM_FAULT_SIGBUS; + phys_addr_t phys; + pfn_t pfn; + + if (check_vma(dax_dev, vma, __func__)) + return VM_FAULT_SIGBUS; + + dax_region = dax_dev->region; + if (dax_region->align > PAGE_SIZE) { + dev_dbg(dev, "%s: alignment > fault size\n", __func__); + return VM_FAULT_SIGBUS; + } + + phys = pgoff_to_phys(dax_dev, vmf->pgoff, PAGE_SIZE); + if (phys == -1) { + dev_dbg(dev, "%s: phys_to_pgoff(%#lx) failed\n", __func__, + vmf->pgoff); + return VM_FAULT_SIGBUS; + } + + pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); + + rc = vm_insert_mixed(vma, vaddr, pfn); + + if (rc == -ENOMEM) + return VM_FAULT_OOM; + if (rc < 0 && rc != -EBUSY) + return VM_FAULT_SIGBUS; + + return VM_FAULT_NOPAGE; +} + +static int dax_dev_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + int rc; + struct file *filp = vma->vm_file; + struct dax_dev *dax_dev = filp->private_data; + + dev_dbg(dax_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__, + current->comm, (vmf->flags & FAULT_FLAG_WRITE) + ? "write" : "read", vma->vm_start, vma->vm_end); + rcu_read_lock(); + rc = __dax_dev_fault(dax_dev, vma, vmf); + rcu_read_unlock(); + + return rc; +} + +static int __dax_dev_pmd_fault(struct dax_dev *dax_dev, + struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, + unsigned int flags) +{ + unsigned long pmd_addr = addr & PMD_MASK; + struct device *dev = dax_dev->dev; + struct dax_region *dax_region; + phys_addr_t phys; + pgoff_t pgoff; + pfn_t pfn; + + if (check_vma(dax_dev, vma, __func__)) + return VM_FAULT_SIGBUS; + + dax_region = dax_dev->region; + if (dax_region->align > PMD_SIZE) { + dev_dbg(dev, "%s: alignment > fault size\n", __func__); + return VM_FAULT_SIGBUS; + } + + /* dax pmd mappings require pfn_t_devmap() */ + if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) != (PFN_DEV|PFN_MAP)) { + dev_dbg(dev, "%s: alignment > fault size\n", __func__); + return VM_FAULT_SIGBUS; + } + + pgoff = linear_page_index(vma, pmd_addr); + phys = pgoff_to_phys(dax_dev, pgoff, PAGE_SIZE); + if (phys == -1) { + dev_dbg(dev, "%s: phys_to_pgoff(%#lx) failed\n", __func__, + pgoff); + return VM_FAULT_SIGBUS; + } + + pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); + + return vmf_insert_pfn_pmd(vma, addr, pmd, pfn, + flags & FAULT_FLAG_WRITE); +} + +static int dax_dev_pmd_fault(struct vm_area_struct *vma, unsigned long addr, + pmd_t *pmd, unsigned int flags) +{ + int rc; + struct file *filp = vma->vm_file; + struct dax_dev *dax_dev = filp->private_data; + + dev_dbg(dax_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__, + current->comm, (flags & FAULT_FLAG_WRITE) + ? "write" : "read", vma->vm_start, vma->vm_end); + + rcu_read_lock(); + rc = __dax_dev_pmd_fault(dax_dev, vma, addr, pmd, flags); + rcu_read_unlock(); + + return rc; +} + +static void dax_dev_vm_open(struct vm_area_struct *vma) +{ + struct file *filp = vma->vm_file; + struct dax_dev *dax_dev = filp->private_data; + + dev_dbg(dax_dev->dev, "%s\n", __func__); + kref_get(&dax_dev->kref); +} + +static void dax_dev_vm_close(struct vm_area_struct *vma) +{ + struct file *filp = vma->vm_file; + struct dax_dev *dax_dev = filp->private_data; + + dev_dbg(dax_dev->dev, "%s\n", __func__); + dax_dev_put(dax_dev); +} + +static const struct vm_operations_struct dax_dev_vm_ops = { + .fault = dax_dev_fault, + .pmd_fault = dax_dev_pmd_fault, + .open = dax_dev_vm_open, + .close = dax_dev_vm_close, +}; + +static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct dax_dev *dax_dev = filp->private_data; + int rc; + + dev_dbg(dax_dev->dev, "%s\n", __func__); + + rc = check_vma(dax_dev, vma, __func__); + if (rc) + return rc; + + kref_get(&dax_dev->kref); + vma->vm_ops = &dax_dev_vm_ops; + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; + return 0; + +} + static const struct file_operations dax_fops = { .llseek = noop_llseek, .owner = THIS_MODULE, + .open = dax_dev_open, + .release = dax_dev_release, + .get_unmapped_area = dax_dev_get_unmapped_area, + .mmap = dax_dev_mmap, }; static int __init dax_init(void) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 86f9f8b82f8e..52ea012d8a80 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1013,6 +1013,7 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write); return VM_FAULT_NOPAGE; } +EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd); static void touch_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 19d0d08b396f..b14e98129b07 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -624,6 +624,7 @@ pgoff_t linear_hugepage_index(struct vm_area_struct *vma, { return vma_hugecache_offset(hstate_vma(vma), vma, address); } +EXPORT_SYMBOL_GPL(linear_hugepage_index); /* * Return the size of the pages allocated when backing a VMA. In the majority _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap 2016-05-15 6:26 ` Dan Williams (?) @ 2016-05-17 10:57 ` Johannes Thumshirn -1 siblings, 0 replies; 37+ messages in thread From: Johannes Thumshirn @ 2016-05-17 10:57 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, Dave Hansen, linux-kernel, hch, linux-block, Andrew Morton On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: > The "Device DAX" core enables dax mappings of performance / feature > differentiated memory. An open mapping or file handle keeps the backing > struct device live, but new mappings are only possible while the device > is enabled. Faults are handled under rcu_read_lock to synchronize > with the enabled state of the device. > > Similar to the filesystem-dax case the backing memory may optionally > have struct page entries. However, unlike fs-dax there is no support > for private mappings, or mappings that are not backed by media (see > use of zero-page in fs-dax). > > Mappings are always guaranteed to match the alignment of the dax_region. > If the dax_region is configured to have a 2MB alignment, all mappings > are guaranteed to be backed by a pmd entry. Contrast this determinism > with the fs-dax case where pmd mappings are opportunistic. If userspace > attempts to force a misaligned mapping, the driver will fail the mmap > attempt. See dax_dev_check_vma() for other scenarios that are rejected, > like MAP_PRIVATE mappings. > > Cc: Jeff Moyer <jmoyer@redhat.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > drivers/dax/Kconfig | 1 > drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ > mm/huge_memory.c | 1 > mm/hugetlb.c | 1 > 4 files changed, 319 insertions(+) > > diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig > index 86ffbaa891ad..cedab7572de3 100644 > --- a/drivers/dax/Kconfig > +++ b/drivers/dax/Kconfig > @@ -1,6 +1,7 @@ > menuconfig DEV_DAX > tristate "DAX: direct access to differentiated memory" > default m if NVDIMM_DAX > + depends on TRANSPARENT_HUGEPAGE > help > Support raw access to differentiated (persistence, bandwidth, > latency...) memory via an mmap(2) capable character > diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c > index 8207fb33a992..b2fe8a0ce866 100644 > --- a/drivers/dax/dax.c > +++ b/drivers/dax/dax.c > @@ -49,6 +49,7 @@ struct dax_region { > * @region - parent region > * @dev - device backing the character device > * @kref - enable this data to be tracked in filp->private_data > + * @alive - !alive + rcu grace period == no new mappings can be established > * @id - child id in the region > * @num_resources - number of physical address extents in this device > * @res - array of physical address ranges > @@ -57,6 +58,7 @@ struct dax_dev { > struct dax_region *region; > struct device *dev; > struct kref kref; > + bool alive; > int id; > int num_resources; > struct resource res[0]; > @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) > > dev_dbg(dev, "%s\n", __func__); > > + /* disable and flush fault handlers, TODO unmap inodes */ > + dax_dev->alive = false; > + synchronize_rcu(); > + IIRC RCU is only protecting a pointer, not the content of the pointer, so this looks wrong to me. > get_device(dev); > device_unregister(dev); > ida_simple_remove(&dax_region->ida, dax_dev->id); > @@ -173,6 +179,7 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, > memcpy(dax_dev->res, res, sizeof(*res) * count); > dax_dev->num_resources = count; > kref_init(&dax_dev->kref); > + dax_dev->alive = true; > dax_dev->region = dax_region; > kref_get(&dax_region->kref); > > @@ -217,9 +224,318 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, > } > EXPORT_SYMBOL_GPL(devm_create_dax_dev); > > +/* return an unmapped area aligned to the dax region specified alignment */ > +static unsigned long dax_dev_get_unmapped_area(struct file *filp, > + unsigned long addr, unsigned long len, unsigned long pgoff, > + unsigned long flags) > +{ > + unsigned long off, off_end, off_align, len_align, addr_align, align; > + struct dax_dev *dax_dev = filp ? filp->private_data : NULL; > + struct dax_region *dax_region; > + > + if (!dax_dev || addr) > + goto out; > + > + dax_region = dax_dev->region; > + align = dax_region->align; > + off = pgoff << PAGE_SHIFT; > + off_end = off + len; > + off_align = round_up(off, align); > + > + if ((off_end <= off_align) || ((off_end - off_align) < align)) > + goto out; > + > + len_align = len + align; > + if ((off + len_align) < off) > + goto out; > + > + addr_align = current->mm->get_unmapped_area(filp, addr, len_align, > + pgoff, flags); > + if (!IS_ERR_VALUE(addr_align)) { > + addr_align += (off - addr_align) & (align - 1); > + return addr_align; > + } > + out: > + return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags); > +} > + > +static int __match_devt(struct device *dev, const void *data) > +{ > + const dev_t *devt = data; > + > + return dev->devt == *devt; > +} > + > +static struct device *dax_dev_find(dev_t dev_t) > +{ > + return class_find_device(dax_class, NULL, &dev_t, __match_devt); > +} > + > +static int dax_dev_open(struct inode *inode, struct file *filp) > +{ > + struct dax_dev *dax_dev = NULL; > + struct device *dev; > + > + dev = dax_dev_find(inode->i_rdev); > + if (!dev) > + return -ENXIO; > + > + device_lock(dev); > + dax_dev = dev_get_drvdata(dev); > + if (dax_dev) { > + dev_dbg(dev, "%s\n", __func__); > + filp->private_data = dax_dev; > + kref_get(&dax_dev->kref); > + inode->i_flags = S_DAX; > + } > + device_unlock(dev); > + Does this block really need to be protected by the dev mutex? If yes, have you considered re-ordering it like this? device_lock(dev); dax_dev = dev_get_drvdata(dev); if (!dax_dev) { device_unlock(dev); goto out_put; } filp->private_data = dax_dev; kref_get(&dax_dev->kref); // or get_dax_device(dax_dev) inode->i_flags = S_DAX; device_unlock(dev); return 0; out_put: put_device(dev); return -ENXIO; The only thing I see that could be needed to be protected here, is the inode->i_flags and shouldn't that be protected by the inode->i_mutex? But I'm not sure, hence the question. Also S_DAX is the only valid flag for a DAX device, isn't it? > + if (!dax_dev) { > + put_device(dev); > + return -ENXIO; > + } > + return 0; > +} > + > +static int dax_dev_release(struct inode *inode, struct file *filp) > +{ > + struct dax_dev *dax_dev = filp->private_data; > + struct device *dev = dax_dev->dev; > + > + dev_dbg(dax_dev->dev, "%s\n", __func__); > + dax_dev_put(dax_dev); > + put_device(dev); > + For reasons of consistency one could reset the inode's i_flags here. > + return 0; > +} > + > +static int check_vma(struct dax_dev *dax_dev, struct vm_area_struct *vma, > + const char *func) > +{ > + struct dax_region *dax_region = dax_dev->region; > + struct device *dev = dax_dev->dev; > + unsigned long mask; > + > + if (!dax_dev->alive) > + return -ENXIO; > + > + /* prevent private / writable mappings from being established */ > + if ((vma->vm_flags & (VM_NORESERVE|VM_SHARED|VM_WRITE)) == VM_WRITE) { > + dev_dbg(dev, "%s: %s: fail, attempted private mapping\n", > + current->comm, func); This deserves a higher log-level than debug, IMHO. > + return -EINVAL; > + } > + > + mask = dax_region->align - 1; > + if (vma->vm_start & mask || vma->vm_end & mask) { > + dev_dbg(dev, "%s: %s: fail, unaligned vma (%#lx - %#lx, %#lx)\n", > + current->comm, func, vma->vm_start, vma->vm_end, > + mask); Ditto. > + return -EINVAL; > + } > + > + if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV > + && (vma->vm_flags & VM_DONTCOPY) == 0) { > + dev_dbg(dev, "%s: %s: fail, dax range requires MADV_DONTFORK\n", > + current->comm, func); Ditto. > + return -EINVAL; > + } > + > + if (!vma_is_dax(vma)) { > + dev_dbg(dev, "%s: %s: fail, vma is not DAX capable\n", > + current->comm, func); Ditto. > + return -EINVAL; > + } > + > + return 0; > +} > + > +static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff, > + unsigned long size) > +{ > + struct resource *res; > + phys_addr_t phys; > + int i; > + > + for (i = 0; i < dax_dev->num_resources; i++) { > + res = &dax_dev->res[i]; > + phys = pgoff * PAGE_SIZE + res->start; > + if (phys >= res->start && phys <= res->end) > + break; > + pgoff -= PHYS_PFN(resource_size(res)); > + } > + > + if (i < dax_dev->num_resources) { > + res = &dax_dev->res[i]; > + if (phys + size - 1 <= res->end) > + return phys; > + } > + > + return -1; > +} > + > +static int __dax_dev_fault(struct dax_dev *dax_dev, struct vm_area_struct *vma, > + struct vm_fault *vmf) > +{ > + unsigned long vaddr = (unsigned long) vmf->virtual_address; > + struct device *dev = dax_dev->dev; > + struct dax_region *dax_region; > + int rc = VM_FAULT_SIGBUS; > + phys_addr_t phys; > + pfn_t pfn; > + > + if (check_vma(dax_dev, vma, __func__)) > + return VM_FAULT_SIGBUS; > + > + dax_region = dax_dev->region; > + if (dax_region->align > PAGE_SIZE) { > + dev_dbg(dev, "%s: alignment > fault size\n", __func__); > + return VM_FAULT_SIGBUS; > + } > + > + phys = pgoff_to_phys(dax_dev, vmf->pgoff, PAGE_SIZE); > + if (phys == -1) { > + dev_dbg(dev, "%s: phys_to_pgoff(%#lx) failed\n", __func__, > + vmf->pgoff); > + return VM_FAULT_SIGBUS; > + } > + > + pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); > + > + rc = vm_insert_mixed(vma, vaddr, pfn); > + > + if (rc == -ENOMEM) > + return VM_FAULT_OOM; > + if (rc < 0 && rc != -EBUSY) > + return VM_FAULT_SIGBUS; > + > + return VM_FAULT_NOPAGE; > +} > + > +static int dax_dev_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > +{ > + int rc; > + struct file *filp = vma->vm_file; > + struct dax_dev *dax_dev = filp->private_data; > + > + dev_dbg(dax_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__, > + current->comm, (vmf->flags & FAULT_FLAG_WRITE) > + ? "write" : "read", vma->vm_start, vma->vm_end); > + rcu_read_lock(); > + rc = __dax_dev_fault(dax_dev, vma, vmf); > + rcu_read_unlock(); Similarly, what are you protecting? I just see you're locking something to be read, but don't do a rcu_dereference() to actually access a rcu protected pointer. Or am I missing something totally here? > + > + return rc; > +} > + > +static int __dax_dev_pmd_fault(struct dax_dev *dax_dev, > + struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, > + unsigned int flags) > +{ > + unsigned long pmd_addr = addr & PMD_MASK; > + struct device *dev = dax_dev->dev; > + struct dax_region *dax_region; > + phys_addr_t phys; > + pgoff_t pgoff; > + pfn_t pfn; > + > + if (check_vma(dax_dev, vma, __func__)) > + return VM_FAULT_SIGBUS; > + > + dax_region = dax_dev->region; > + if (dax_region->align > PMD_SIZE) { > + dev_dbg(dev, "%s: alignment > fault size\n", __func__); > + return VM_FAULT_SIGBUS; > + } > + > + /* dax pmd mappings require pfn_t_devmap() */ > + if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) != (PFN_DEV|PFN_MAP)) { > + dev_dbg(dev, "%s: alignment > fault size\n", __func__); > + return VM_FAULT_SIGBUS; > + } > + > + pgoff = linear_page_index(vma, pmd_addr); > + phys = pgoff_to_phys(dax_dev, pgoff, PAGE_SIZE); > + if (phys == -1) { > + dev_dbg(dev, "%s: phys_to_pgoff(%#lx) failed\n", __func__, > + pgoff); > + return VM_FAULT_SIGBUS; > + } > + > + pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); > + > + return vmf_insert_pfn_pmd(vma, addr, pmd, pfn, > + flags & FAULT_FLAG_WRITE); > +} > + > +static int dax_dev_pmd_fault(struct vm_area_struct *vma, unsigned long addr, > + pmd_t *pmd, unsigned int flags) > +{ > + int rc; > + struct file *filp = vma->vm_file; > + struct dax_dev *dax_dev = filp->private_data; > + > + dev_dbg(dax_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__, > + current->comm, (flags & FAULT_FLAG_WRITE) > + ? "write" : "read", vma->vm_start, vma->vm_end); > + > + rcu_read_lock(); > + rc = __dax_dev_pmd_fault(dax_dev, vma, addr, pmd, flags); > + rcu_read_unlock(); > + > + return rc; > +} > + > +static void dax_dev_vm_open(struct vm_area_struct *vma) > +{ > + struct file *filp = vma->vm_file; > + struct dax_dev *dax_dev = filp->private_data; > + > + dev_dbg(dax_dev->dev, "%s\n", __func__); > + kref_get(&dax_dev->kref); > +} > + > +static void dax_dev_vm_close(struct vm_area_struct *vma) > +{ > + struct file *filp = vma->vm_file; > + struct dax_dev *dax_dev = filp->private_data; > + > + dev_dbg(dax_dev->dev, "%s\n", __func__); > + dax_dev_put(dax_dev); > +} > + > +static const struct vm_operations_struct dax_dev_vm_ops = { > + .fault = dax_dev_fault, > + .pmd_fault = dax_dev_pmd_fault, > + .open = dax_dev_vm_open, > + .close = dax_dev_vm_close, > +}; > + > +static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma) > +{ > + struct dax_dev *dax_dev = filp->private_data; > + int rc; > + > + dev_dbg(dax_dev->dev, "%s\n", __func__); > + > + rc = check_vma(dax_dev, vma, __func__); > + if (rc) > + return rc; > + > + kref_get(&dax_dev->kref); > + vma->vm_ops = &dax_dev_vm_ops; > + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; > + return 0; > + > +} > + > static const struct file_operations dax_fops = { > .llseek = noop_llseek, > .owner = THIS_MODULE, > + .open = dax_dev_open, > + .release = dax_dev_release, > + .get_unmapped_area = dax_dev_get_unmapped_area, > + .mmap = dax_dev_mmap, > }; > > static int __init dax_init(void) > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 86f9f8b82f8e..52ea012d8a80 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1013,6 +1013,7 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, > insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write); > return VM_FAULT_NOPAGE; > } > +EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd); > > static void touch_pmd(struct vm_area_struct *vma, unsigned long addr, > pmd_t *pmd) > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 19d0d08b396f..b14e98129b07 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -624,6 +624,7 @@ pgoff_t linear_hugepage_index(struct vm_area_struct *vma, > { > return vma_hugecache_offset(hstate_vma(vma), vma, address); > } > +EXPORT_SYMBOL_GPL(linear_hugepage_index); > > /* > * Return the size of the pages allocated when backing a VMA. In the majority > > _______________________________________________ > Linux-nvdimm mailing list > Linux-nvdimm@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: Felix Imend�rffer, Jane Smithard, Graham Norton HRB 21284 (AG N�rnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap @ 2016-05-17 10:57 ` Johannes Thumshirn 0 siblings, 0 replies; 37+ messages in thread From: Johannes Thumshirn @ 2016-05-17 10:57 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, Dave Hansen, linux-kernel, hch, linux-block, Andrew Morton On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: > The "Device DAX" core enables dax mappings of performance / feature > differentiated memory. An open mapping or file handle keeps the backing > struct device live, but new mappings are only possible while the device > is enabled. Faults are handled under rcu_read_lock to synchronize > with the enabled state of the device. > > Similar to the filesystem-dax case the backing memory may optionally > have struct page entries. However, unlike fs-dax there is no support > for private mappings, or mappings that are not backed by media (see > use of zero-page in fs-dax). > > Mappings are always guaranteed to match the alignment of the dax_region. > If the dax_region is configured to have a 2MB alignment, all mappings > are guaranteed to be backed by a pmd entry. Contrast this determinism > with the fs-dax case where pmd mappings are opportunistic. If userspace > attempts to force a misaligned mapping, the driver will fail the mmap > attempt. See dax_dev_check_vma() for other scenarios that are rejected, > like MAP_PRIVATE mappings. > > Cc: Jeff Moyer <jmoyer@redhat.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > drivers/dax/Kconfig | 1 > drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ > mm/huge_memory.c | 1 > mm/hugetlb.c | 1 > 4 files changed, 319 insertions(+) > > diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig > index 86ffbaa891ad..cedab7572de3 100644 > --- a/drivers/dax/Kconfig > +++ b/drivers/dax/Kconfig > @@ -1,6 +1,7 @@ > menuconfig DEV_DAX > tristate "DAX: direct access to differentiated memory" > default m if NVDIMM_DAX > + depends on TRANSPARENT_HUGEPAGE > help > Support raw access to differentiated (persistence, bandwidth, > latency...) memory via an mmap(2) capable character > diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c > index 8207fb33a992..b2fe8a0ce866 100644 > --- a/drivers/dax/dax.c > +++ b/drivers/dax/dax.c > @@ -49,6 +49,7 @@ struct dax_region { > * @region - parent region > * @dev - device backing the character device > * @kref - enable this data to be tracked in filp->private_data > + * @alive - !alive + rcu grace period == no new mappings can be established > * @id - child id in the region > * @num_resources - number of physical address extents in this device > * @res - array of physical address ranges > @@ -57,6 +58,7 @@ struct dax_dev { > struct dax_region *region; > struct device *dev; > struct kref kref; > + bool alive; > int id; > int num_resources; > struct resource res[0]; > @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) > > dev_dbg(dev, "%s\n", __func__); > > + /* disable and flush fault handlers, TODO unmap inodes */ > + dax_dev->alive = false; > + synchronize_rcu(); > + IIRC RCU is only protecting a pointer, not the content of the pointer, so this looks wrong to me. > get_device(dev); > device_unregister(dev); > ida_simple_remove(&dax_region->ida, dax_dev->id); > @@ -173,6 +179,7 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, > memcpy(dax_dev->res, res, sizeof(*res) * count); > dax_dev->num_resources = count; > kref_init(&dax_dev->kref); > + dax_dev->alive = true; > dax_dev->region = dax_region; > kref_get(&dax_region->kref); > > @@ -217,9 +224,318 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, > } > EXPORT_SYMBOL_GPL(devm_create_dax_dev); > > +/* return an unmapped area aligned to the dax region specified alignment */ > +static unsigned long dax_dev_get_unmapped_area(struct file *filp, > + unsigned long addr, unsigned long len, unsigned long pgoff, > + unsigned long flags) > +{ > + unsigned long off, off_end, off_align, len_align, addr_align, align; > + struct dax_dev *dax_dev = filp ? filp->private_data : NULL; > + struct dax_region *dax_region; > + > + if (!dax_dev || addr) > + goto out; > + > + dax_region = dax_dev->region; > + align = dax_region->align; > + off = pgoff << PAGE_SHIFT; > + off_end = off + len; > + off_align = round_up(off, align); > + > + if ((off_end <= off_align) || ((off_end - off_align) < align)) > + goto out; > + > + len_align = len + align; > + if ((off + len_align) < off) > + goto out; > + > + addr_align = current->mm->get_unmapped_area(filp, addr, len_align, > + pgoff, flags); > + if (!IS_ERR_VALUE(addr_align)) { > + addr_align += (off - addr_align) & (align - 1); > + return addr_align; > + } > + out: > + return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags); > +} > + > +static int __match_devt(struct device *dev, const void *data) > +{ > + const dev_t *devt = data; > + > + return dev->devt == *devt; > +} > + > +static struct device *dax_dev_find(dev_t dev_t) > +{ > + return class_find_device(dax_class, NULL, &dev_t, __match_devt); > +} > + > +static int dax_dev_open(struct inode *inode, struct file *filp) > +{ > + struct dax_dev *dax_dev = NULL; > + struct device *dev; > + > + dev = dax_dev_find(inode->i_rdev); > + if (!dev) > + return -ENXIO; > + > + device_lock(dev); > + dax_dev = dev_get_drvdata(dev); > + if (dax_dev) { > + dev_dbg(dev, "%s\n", __func__); > + filp->private_data = dax_dev; > + kref_get(&dax_dev->kref); > + inode->i_flags = S_DAX; > + } > + device_unlock(dev); > + Does this block really need to be protected by the dev mutex? If yes, have you considered re-ordering it like this? device_lock(dev); dax_dev = dev_get_drvdata(dev); if (!dax_dev) { device_unlock(dev); goto out_put; } filp->private_data = dax_dev; kref_get(&dax_dev->kref); // or get_dax_device(dax_dev) inode->i_flags = S_DAX; device_unlock(dev); return 0; out_put: put_device(dev); return -ENXIO; The only thing I see that could be needed to be protected here, is the inode->i_flags and shouldn't that be protected by the inode->i_mutex? But I'm not sure, hence the question. Also S_DAX is the only valid flag for a DAX device, isn't it? > + if (!dax_dev) { > + put_device(dev); > + return -ENXIO; > + } > + return 0; > +} > + > +static int dax_dev_release(struct inode *inode, struct file *filp) > +{ > + struct dax_dev *dax_dev = filp->private_data; > + struct device *dev = dax_dev->dev; > + > + dev_dbg(dax_dev->dev, "%s\n", __func__); > + dax_dev_put(dax_dev); > + put_device(dev); > + For reasons of consistency one could reset the inode's i_flags here. > + return 0; > +} > + > +static int check_vma(struct dax_dev *dax_dev, struct vm_area_struct *vma, > + const char *func) > +{ > + struct dax_region *dax_region = dax_dev->region; > + struct device *dev = dax_dev->dev; > + unsigned long mask; > + > + if (!dax_dev->alive) > + return -ENXIO; > + > + /* prevent private / writable mappings from being established */ > + if ((vma->vm_flags & (VM_NORESERVE|VM_SHARED|VM_WRITE)) == VM_WRITE) { > + dev_dbg(dev, "%s: %s: fail, attempted private mapping\n", > + current->comm, func); This deserves a higher log-level than debug, IMHO. > + return -EINVAL; > + } > + > + mask = dax_region->align - 1; > + if (vma->vm_start & mask || vma->vm_end & mask) { > + dev_dbg(dev, "%s: %s: fail, unaligned vma (%#lx - %#lx, %#lx)\n", > + current->comm, func, vma->vm_start, vma->vm_end, > + mask); Ditto. > + return -EINVAL; > + } > + > + if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV > + && (vma->vm_flags & VM_DONTCOPY) == 0) { > + dev_dbg(dev, "%s: %s: fail, dax range requires MADV_DONTFORK\n", > + current->comm, func); Ditto. > + return -EINVAL; > + } > + > + if (!vma_is_dax(vma)) { > + dev_dbg(dev, "%s: %s: fail, vma is not DAX capable\n", > + current->comm, func); Ditto. > + return -EINVAL; > + } > + > + return 0; > +} > + > +static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff, > + unsigned long size) > +{ > + struct resource *res; > + phys_addr_t phys; > + int i; > + > + for (i = 0; i < dax_dev->num_resources; i++) { > + res = &dax_dev->res[i]; > + phys = pgoff * PAGE_SIZE + res->start; > + if (phys >= res->start && phys <= res->end) > + break; > + pgoff -= PHYS_PFN(resource_size(res)); > + } > + > + if (i < dax_dev->num_resources) { > + res = &dax_dev->res[i]; > + if (phys + size - 1 <= res->end) > + return phys; > + } > + > + return -1; > +} > + > +static int __dax_dev_fault(struct dax_dev *dax_dev, struct vm_area_struct *vma, > + struct vm_fault *vmf) > +{ > + unsigned long vaddr = (unsigned long) vmf->virtual_address; > + struct device *dev = dax_dev->dev; > + struct dax_region *dax_region; > + int rc = VM_FAULT_SIGBUS; > + phys_addr_t phys; > + pfn_t pfn; > + > + if (check_vma(dax_dev, vma, __func__)) > + return VM_FAULT_SIGBUS; > + > + dax_region = dax_dev->region; > + if (dax_region->align > PAGE_SIZE) { > + dev_dbg(dev, "%s: alignment > fault size\n", __func__); > + return VM_FAULT_SIGBUS; > + } > + > + phys = pgoff_to_phys(dax_dev, vmf->pgoff, PAGE_SIZE); > + if (phys == -1) { > + dev_dbg(dev, "%s: phys_to_pgoff(%#lx) failed\n", __func__, > + vmf->pgoff); > + return VM_FAULT_SIGBUS; > + } > + > + pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); > + > + rc = vm_insert_mixed(vma, vaddr, pfn); > + > + if (rc == -ENOMEM) > + return VM_FAULT_OOM; > + if (rc < 0 && rc != -EBUSY) > + return VM_FAULT_SIGBUS; > + > + return VM_FAULT_NOPAGE; > +} > + > +static int dax_dev_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > +{ > + int rc; > + struct file *filp = vma->vm_file; > + struct dax_dev *dax_dev = filp->private_data; > + > + dev_dbg(dax_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__, > + current->comm, (vmf->flags & FAULT_FLAG_WRITE) > + ? "write" : "read", vma->vm_start, vma->vm_end); > + rcu_read_lock(); > + rc = __dax_dev_fault(dax_dev, vma, vmf); > + rcu_read_unlock(); Similarly, what are you protecting? I just see you're locking something to be read, but don't do a rcu_dereference() to actually access a rcu protected pointer. Or am I missing something totally here? > + > + return rc; > +} > + > +static int __dax_dev_pmd_fault(struct dax_dev *dax_dev, > + struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, > + unsigned int flags) > +{ > + unsigned long pmd_addr = addr & PMD_MASK; > + struct device *dev = dax_dev->dev; > + struct dax_region *dax_region; > + phys_addr_t phys; > + pgoff_t pgoff; > + pfn_t pfn; > + > + if (check_vma(dax_dev, vma, __func__)) > + return VM_FAULT_SIGBUS; > + > + dax_region = dax_dev->region; > + if (dax_region->align > PMD_SIZE) { > + dev_dbg(dev, "%s: alignment > fault size\n", __func__); > + return VM_FAULT_SIGBUS; > + } > + > + /* dax pmd mappings require pfn_t_devmap() */ > + if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) != (PFN_DEV|PFN_MAP)) { > + dev_dbg(dev, "%s: alignment > fault size\n", __func__); > + return VM_FAULT_SIGBUS; > + } > + > + pgoff = linear_page_index(vma, pmd_addr); > + phys = pgoff_to_phys(dax_dev, pgoff, PAGE_SIZE); > + if (phys == -1) { > + dev_dbg(dev, "%s: phys_to_pgoff(%#lx) failed\n", __func__, > + pgoff); > + return VM_FAULT_SIGBUS; > + } > + > + pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); > + > + return vmf_insert_pfn_pmd(vma, addr, pmd, pfn, > + flags & FAULT_FLAG_WRITE); > +} > + > +static int dax_dev_pmd_fault(struct vm_area_struct *vma, unsigned long addr, > + pmd_t *pmd, unsigned int flags) > +{ > + int rc; > + struct file *filp = vma->vm_file; > + struct dax_dev *dax_dev = filp->private_data; > + > + dev_dbg(dax_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__, > + current->comm, (flags & FAULT_FLAG_WRITE) > + ? "write" : "read", vma->vm_start, vma->vm_end); > + > + rcu_read_lock(); > + rc = __dax_dev_pmd_fault(dax_dev, vma, addr, pmd, flags); > + rcu_read_unlock(); > + > + return rc; > +} > + > +static void dax_dev_vm_open(struct vm_area_struct *vma) > +{ > + struct file *filp = vma->vm_file; > + struct dax_dev *dax_dev = filp->private_data; > + > + dev_dbg(dax_dev->dev, "%s\n", __func__); > + kref_get(&dax_dev->kref); > +} > + > +static void dax_dev_vm_close(struct vm_area_struct *vma) > +{ > + struct file *filp = vma->vm_file; > + struct dax_dev *dax_dev = filp->private_data; > + > + dev_dbg(dax_dev->dev, "%s\n", __func__); > + dax_dev_put(dax_dev); > +} > + > +static const struct vm_operations_struct dax_dev_vm_ops = { > + .fault = dax_dev_fault, > + .pmd_fault = dax_dev_pmd_fault, > + .open = dax_dev_vm_open, > + .close = dax_dev_vm_close, > +}; > + > +static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma) > +{ > + struct dax_dev *dax_dev = filp->private_data; > + int rc; > + > + dev_dbg(dax_dev->dev, "%s\n", __func__); > + > + rc = check_vma(dax_dev, vma, __func__); > + if (rc) > + return rc; > + > + kref_get(&dax_dev->kref); > + vma->vm_ops = &dax_dev_vm_ops; > + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; > + return 0; > + > +} > + > static const struct file_operations dax_fops = { > .llseek = noop_llseek, > .owner = THIS_MODULE, > + .open = dax_dev_open, > + .release = dax_dev_release, > + .get_unmapped_area = dax_dev_get_unmapped_area, > + .mmap = dax_dev_mmap, > }; > > static int __init dax_init(void) > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 86f9f8b82f8e..52ea012d8a80 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1013,6 +1013,7 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, > insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write); > return VM_FAULT_NOPAGE; > } > +EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd); > > static void touch_pmd(struct vm_area_struct *vma, unsigned long addr, > pmd_t *pmd) > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 19d0d08b396f..b14e98129b07 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -624,6 +624,7 @@ pgoff_t linear_hugepage_index(struct vm_area_struct *vma, > { > return vma_hugecache_offset(hstate_vma(vma), vma, address); > } > +EXPORT_SYMBOL_GPL(linear_hugepage_index); > > /* > * Return the size of the pages allocated when backing a VMA. In the majority > > _______________________________________________ > Linux-nvdimm mailing list > Linux-nvdimm@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap @ 2016-05-17 10:57 ` Johannes Thumshirn 0 siblings, 0 replies; 37+ messages in thread From: Johannes Thumshirn @ 2016-05-17 10:57 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, Dave Hansen, linux-kernel, linux-block, Andrew Morton, hch On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: > The "Device DAX" core enables dax mappings of performance / feature > differentiated memory. An open mapping or file handle keeps the backing > struct device live, but new mappings are only possible while the device > is enabled. Faults are handled under rcu_read_lock to synchronize > with the enabled state of the device. > > Similar to the filesystem-dax case the backing memory may optionally > have struct page entries. However, unlike fs-dax there is no support > for private mappings, or mappings that are not backed by media (see > use of zero-page in fs-dax). > > Mappings are always guaranteed to match the alignment of the dax_region. > If the dax_region is configured to have a 2MB alignment, all mappings > are guaranteed to be backed by a pmd entry. Contrast this determinism > with the fs-dax case where pmd mappings are opportunistic. If userspace > attempts to force a misaligned mapping, the driver will fail the mmap > attempt. See dax_dev_check_vma() for other scenarios that are rejected, > like MAP_PRIVATE mappings. > > Cc: Jeff Moyer <jmoyer@redhat.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > drivers/dax/Kconfig | 1 > drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ > mm/huge_memory.c | 1 > mm/hugetlb.c | 1 > 4 files changed, 319 insertions(+) > > diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig > index 86ffbaa891ad..cedab7572de3 100644 > --- a/drivers/dax/Kconfig > +++ b/drivers/dax/Kconfig > @@ -1,6 +1,7 @@ > menuconfig DEV_DAX > tristate "DAX: direct access to differentiated memory" > default m if NVDIMM_DAX > + depends on TRANSPARENT_HUGEPAGE > help > Support raw access to differentiated (persistence, bandwidth, > latency...) memory via an mmap(2) capable character > diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c > index 8207fb33a992..b2fe8a0ce866 100644 > --- a/drivers/dax/dax.c > +++ b/drivers/dax/dax.c > @@ -49,6 +49,7 @@ struct dax_region { > * @region - parent region > * @dev - device backing the character device > * @kref - enable this data to be tracked in filp->private_data > + * @alive - !alive + rcu grace period == no new mappings can be established > * @id - child id in the region > * @num_resources - number of physical address extents in this device > * @res - array of physical address ranges > @@ -57,6 +58,7 @@ struct dax_dev { > struct dax_region *region; > struct device *dev; > struct kref kref; > + bool alive; > int id; > int num_resources; > struct resource res[0]; > @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) > > dev_dbg(dev, "%s\n", __func__); > > + /* disable and flush fault handlers, TODO unmap inodes */ > + dax_dev->alive = false; > + synchronize_rcu(); > + IIRC RCU is only protecting a pointer, not the content of the pointer, so this looks wrong to me. > get_device(dev); > device_unregister(dev); > ida_simple_remove(&dax_region->ida, dax_dev->id); > @@ -173,6 +179,7 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, > memcpy(dax_dev->res, res, sizeof(*res) * count); > dax_dev->num_resources = count; > kref_init(&dax_dev->kref); > + dax_dev->alive = true; > dax_dev->region = dax_region; > kref_get(&dax_region->kref); > > @@ -217,9 +224,318 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, > } > EXPORT_SYMBOL_GPL(devm_create_dax_dev); > > +/* return an unmapped area aligned to the dax region specified alignment */ > +static unsigned long dax_dev_get_unmapped_area(struct file *filp, > + unsigned long addr, unsigned long len, unsigned long pgoff, > + unsigned long flags) > +{ > + unsigned long off, off_end, off_align, len_align, addr_align, align; > + struct dax_dev *dax_dev = filp ? filp->private_data : NULL; > + struct dax_region *dax_region; > + > + if (!dax_dev || addr) > + goto out; > + > + dax_region = dax_dev->region; > + align = dax_region->align; > + off = pgoff << PAGE_SHIFT; > + off_end = off + len; > + off_align = round_up(off, align); > + > + if ((off_end <= off_align) || ((off_end - off_align) < align)) > + goto out; > + > + len_align = len + align; > + if ((off + len_align) < off) > + goto out; > + > + addr_align = current->mm->get_unmapped_area(filp, addr, len_align, > + pgoff, flags); > + if (!IS_ERR_VALUE(addr_align)) { > + addr_align += (off - addr_align) & (align - 1); > + return addr_align; > + } > + out: > + return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags); > +} > + > +static int __match_devt(struct device *dev, const void *data) > +{ > + const dev_t *devt = data; > + > + return dev->devt == *devt; > +} > + > +static struct device *dax_dev_find(dev_t dev_t) > +{ > + return class_find_device(dax_class, NULL, &dev_t, __match_devt); > +} > + > +static int dax_dev_open(struct inode *inode, struct file *filp) > +{ > + struct dax_dev *dax_dev = NULL; > + struct device *dev; > + > + dev = dax_dev_find(inode->i_rdev); > + if (!dev) > + return -ENXIO; > + > + device_lock(dev); > + dax_dev = dev_get_drvdata(dev); > + if (dax_dev) { > + dev_dbg(dev, "%s\n", __func__); > + filp->private_data = dax_dev; > + kref_get(&dax_dev->kref); > + inode->i_flags = S_DAX; > + } > + device_unlock(dev); > + Does this block really need to be protected by the dev mutex? If yes, have you considered re-ordering it like this? device_lock(dev); dax_dev = dev_get_drvdata(dev); if (!dax_dev) { device_unlock(dev); goto out_put; } filp->private_data = dax_dev; kref_get(&dax_dev->kref); // or get_dax_device(dax_dev) inode->i_flags = S_DAX; device_unlock(dev); return 0; out_put: put_device(dev); return -ENXIO; The only thing I see that could be needed to be protected here, is the inode->i_flags and shouldn't that be protected by the inode->i_mutex? But I'm not sure, hence the question. Also S_DAX is the only valid flag for a DAX device, isn't it? > + if (!dax_dev) { > + put_device(dev); > + return -ENXIO; > + } > + return 0; > +} > + > +static int dax_dev_release(struct inode *inode, struct file *filp) > +{ > + struct dax_dev *dax_dev = filp->private_data; > + struct device *dev = dax_dev->dev; > + > + dev_dbg(dax_dev->dev, "%s\n", __func__); > + dax_dev_put(dax_dev); > + put_device(dev); > + For reasons of consistency one could reset the inode's i_flags here. > + return 0; > +} > + > +static int check_vma(struct dax_dev *dax_dev, struct vm_area_struct *vma, > + const char *func) > +{ > + struct dax_region *dax_region = dax_dev->region; > + struct device *dev = dax_dev->dev; > + unsigned long mask; > + > + if (!dax_dev->alive) > + return -ENXIO; > + > + /* prevent private / writable mappings from being established */ > + if ((vma->vm_flags & (VM_NORESERVE|VM_SHARED|VM_WRITE)) == VM_WRITE) { > + dev_dbg(dev, "%s: %s: fail, attempted private mapping\n", > + current->comm, func); This deserves a higher log-level than debug, IMHO. > + return -EINVAL; > + } > + > + mask = dax_region->align - 1; > + if (vma->vm_start & mask || vma->vm_end & mask) { > + dev_dbg(dev, "%s: %s: fail, unaligned vma (%#lx - %#lx, %#lx)\n", > + current->comm, func, vma->vm_start, vma->vm_end, > + mask); Ditto. > + return -EINVAL; > + } > + > + if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV > + && (vma->vm_flags & VM_DONTCOPY) == 0) { > + dev_dbg(dev, "%s: %s: fail, dax range requires MADV_DONTFORK\n", > + current->comm, func); Ditto. > + return -EINVAL; > + } > + > + if (!vma_is_dax(vma)) { > + dev_dbg(dev, "%s: %s: fail, vma is not DAX capable\n", > + current->comm, func); Ditto. > + return -EINVAL; > + } > + > + return 0; > +} > + > +static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff, > + unsigned long size) > +{ > + struct resource *res; > + phys_addr_t phys; > + int i; > + > + for (i = 0; i < dax_dev->num_resources; i++) { > + res = &dax_dev->res[i]; > + phys = pgoff * PAGE_SIZE + res->start; > + if (phys >= res->start && phys <= res->end) > + break; > + pgoff -= PHYS_PFN(resource_size(res)); > + } > + > + if (i < dax_dev->num_resources) { > + res = &dax_dev->res[i]; > + if (phys + size - 1 <= res->end) > + return phys; > + } > + > + return -1; > +} > + > +static int __dax_dev_fault(struct dax_dev *dax_dev, struct vm_area_struct *vma, > + struct vm_fault *vmf) > +{ > + unsigned long vaddr = (unsigned long) vmf->virtual_address; > + struct device *dev = dax_dev->dev; > + struct dax_region *dax_region; > + int rc = VM_FAULT_SIGBUS; > + phys_addr_t phys; > + pfn_t pfn; > + > + if (check_vma(dax_dev, vma, __func__)) > + return VM_FAULT_SIGBUS; > + > + dax_region = dax_dev->region; > + if (dax_region->align > PAGE_SIZE) { > + dev_dbg(dev, "%s: alignment > fault size\n", __func__); > + return VM_FAULT_SIGBUS; > + } > + > + phys = pgoff_to_phys(dax_dev, vmf->pgoff, PAGE_SIZE); > + if (phys == -1) { > + dev_dbg(dev, "%s: phys_to_pgoff(%#lx) failed\n", __func__, > + vmf->pgoff); > + return VM_FAULT_SIGBUS; > + } > + > + pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); > + > + rc = vm_insert_mixed(vma, vaddr, pfn); > + > + if (rc == -ENOMEM) > + return VM_FAULT_OOM; > + if (rc < 0 && rc != -EBUSY) > + return VM_FAULT_SIGBUS; > + > + return VM_FAULT_NOPAGE; > +} > + > +static int dax_dev_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > +{ > + int rc; > + struct file *filp = vma->vm_file; > + struct dax_dev *dax_dev = filp->private_data; > + > + dev_dbg(dax_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__, > + current->comm, (vmf->flags & FAULT_FLAG_WRITE) > + ? "write" : "read", vma->vm_start, vma->vm_end); > + rcu_read_lock(); > + rc = __dax_dev_fault(dax_dev, vma, vmf); > + rcu_read_unlock(); Similarly, what are you protecting? I just see you're locking something to be read, but don't do a rcu_dereference() to actually access a rcu protected pointer. Or am I missing something totally here? > + > + return rc; > +} > + > +static int __dax_dev_pmd_fault(struct dax_dev *dax_dev, > + struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, > + unsigned int flags) > +{ > + unsigned long pmd_addr = addr & PMD_MASK; > + struct device *dev = dax_dev->dev; > + struct dax_region *dax_region; > + phys_addr_t phys; > + pgoff_t pgoff; > + pfn_t pfn; > + > + if (check_vma(dax_dev, vma, __func__)) > + return VM_FAULT_SIGBUS; > + > + dax_region = dax_dev->region; > + if (dax_region->align > PMD_SIZE) { > + dev_dbg(dev, "%s: alignment > fault size\n", __func__); > + return VM_FAULT_SIGBUS; > + } > + > + /* dax pmd mappings require pfn_t_devmap() */ > + if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) != (PFN_DEV|PFN_MAP)) { > + dev_dbg(dev, "%s: alignment > fault size\n", __func__); > + return VM_FAULT_SIGBUS; > + } > + > + pgoff = linear_page_index(vma, pmd_addr); > + phys = pgoff_to_phys(dax_dev, pgoff, PAGE_SIZE); > + if (phys == -1) { > + dev_dbg(dev, "%s: phys_to_pgoff(%#lx) failed\n", __func__, > + pgoff); > + return VM_FAULT_SIGBUS; > + } > + > + pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); > + > + return vmf_insert_pfn_pmd(vma, addr, pmd, pfn, > + flags & FAULT_FLAG_WRITE); > +} > + > +static int dax_dev_pmd_fault(struct vm_area_struct *vma, unsigned long addr, > + pmd_t *pmd, unsigned int flags) > +{ > + int rc; > + struct file *filp = vma->vm_file; > + struct dax_dev *dax_dev = filp->private_data; > + > + dev_dbg(dax_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__, > + current->comm, (flags & FAULT_FLAG_WRITE) > + ? "write" : "read", vma->vm_start, vma->vm_end); > + > + rcu_read_lock(); > + rc = __dax_dev_pmd_fault(dax_dev, vma, addr, pmd, flags); > + rcu_read_unlock(); > + > + return rc; > +} > + > +static void dax_dev_vm_open(struct vm_area_struct *vma) > +{ > + struct file *filp = vma->vm_file; > + struct dax_dev *dax_dev = filp->private_data; > + > + dev_dbg(dax_dev->dev, "%s\n", __func__); > + kref_get(&dax_dev->kref); > +} > + > +static void dax_dev_vm_close(struct vm_area_struct *vma) > +{ > + struct file *filp = vma->vm_file; > + struct dax_dev *dax_dev = filp->private_data; > + > + dev_dbg(dax_dev->dev, "%s\n", __func__); > + dax_dev_put(dax_dev); > +} > + > +static const struct vm_operations_struct dax_dev_vm_ops = { > + .fault = dax_dev_fault, > + .pmd_fault = dax_dev_pmd_fault, > + .open = dax_dev_vm_open, > + .close = dax_dev_vm_close, > +}; > + > +static int dax_dev_mmap(struct file *filp, struct vm_area_struct *vma) > +{ > + struct dax_dev *dax_dev = filp->private_data; > + int rc; > + > + dev_dbg(dax_dev->dev, "%s\n", __func__); > + > + rc = check_vma(dax_dev, vma, __func__); > + if (rc) > + return rc; > + > + kref_get(&dax_dev->kref); > + vma->vm_ops = &dax_dev_vm_ops; > + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; > + return 0; > + > +} > + > static const struct file_operations dax_fops = { > .llseek = noop_llseek, > .owner = THIS_MODULE, > + .open = dax_dev_open, > + .release = dax_dev_release, > + .get_unmapped_area = dax_dev_get_unmapped_area, > + .mmap = dax_dev_mmap, > }; > > static int __init dax_init(void) > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 86f9f8b82f8e..52ea012d8a80 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1013,6 +1013,7 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, > insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write); > return VM_FAULT_NOPAGE; > } > +EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd); > > static void touch_pmd(struct vm_area_struct *vma, unsigned long addr, > pmd_t *pmd) > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 19d0d08b396f..b14e98129b07 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -624,6 +624,7 @@ pgoff_t linear_hugepage_index(struct vm_area_struct *vma, > { > return vma_hugecache_offset(hstate_vma(vma), vma, address); > } > +EXPORT_SYMBOL_GPL(linear_hugepage_index); > > /* > * Return the size of the pages allocated when backing a VMA. In the majority > > _______________________________________________ > Linux-nvdimm mailing list > Linux-nvdimm@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap 2016-05-17 10:57 ` Johannes Thumshirn (?) @ 2016-05-17 22:19 ` Dan Williams -1 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-17 22:19 UTC (permalink / raw) To: Johannes Thumshirn Cc: linux-nvdimm@lists.01.org, Dave Hansen, linux-kernel@vger.kernel.org, Christoph Hellwig, linux-block, Andrew Morton On Tue, May 17, 2016 at 3:57 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: > On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: >> The "Device DAX" core enables dax mappings of performance / feature >> differentiated memory. An open mapping or file handle keeps the backing >> struct device live, but new mappings are only possible while the device >> is enabled. Faults are handled under rcu_read_lock to synchronize >> with the enabled state of the device. >> >> Similar to the filesystem-dax case the backing memory may optionally >> have struct page entries. However, unlike fs-dax there is no support >> for private mappings, or mappings that are not backed by media (see >> use of zero-page in fs-dax). >> >> Mappings are always guaranteed to match the alignment of the dax_region. >> If the dax_region is configured to have a 2MB alignment, all mappings >> are guaranteed to be backed by a pmd entry. Contrast this determinism >> with the fs-dax case where pmd mappings are opportunistic. If userspace >> attempts to force a misaligned mapping, the driver will fail the mmap >> attempt. See dax_dev_check_vma() for other scenarios that are rejected, >> like MAP_PRIVATE mappings. >> >> Cc: Jeff Moyer <jmoyer@redhat.com> >> Cc: Christoph Hellwig <hch@lst.de> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Dave Hansen <dave.hansen@linux.intel.com> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >> --- >> drivers/dax/Kconfig | 1 >> drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ >> mm/huge_memory.c | 1 >> mm/hugetlb.c | 1 >> 4 files changed, 319 insertions(+) >> >> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig >> index 86ffbaa891ad..cedab7572de3 100644 >> --- a/drivers/dax/Kconfig >> +++ b/drivers/dax/Kconfig >> @@ -1,6 +1,7 @@ >> menuconfig DEV_DAX >> tristate "DAX: direct access to differentiated memory" >> default m if NVDIMM_DAX >> + depends on TRANSPARENT_HUGEPAGE >> help >> Support raw access to differentiated (persistence, bandwidth, >> latency...) memory via an mmap(2) capable character >> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c >> index 8207fb33a992..b2fe8a0ce866 100644 >> --- a/drivers/dax/dax.c >> +++ b/drivers/dax/dax.c >> @@ -49,6 +49,7 @@ struct dax_region { >> * @region - parent region >> * @dev - device backing the character device >> * @kref - enable this data to be tracked in filp->private_data >> + * @alive - !alive + rcu grace period == no new mappings can be established >> * @id - child id in the region >> * @num_resources - number of physical address extents in this device >> * @res - array of physical address ranges >> @@ -57,6 +58,7 @@ struct dax_dev { >> struct dax_region *region; >> struct device *dev; >> struct kref kref; >> + bool alive; >> int id; >> int num_resources; >> struct resource res[0]; >> @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) >> >> dev_dbg(dev, "%s\n", __func__); >> >> + /* disable and flush fault handlers, TODO unmap inodes */ >> + dax_dev->alive = false; >> + synchronize_rcu(); >> + > > IIRC RCU is only protecting a pointer, not the content of the pointer, so this > looks wrong to me. The driver is using RCU to guarantee that all currently running fault handlers have either completed or will see the new state of ->alive when they start. Reference counts are protecting the actual dax_dev object. >> get_device(dev); >> device_unregister(dev); >> ida_simple_remove(&dax_region->ida, dax_dev->id); >> @@ -173,6 +179,7 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, >> memcpy(dax_dev->res, res, sizeof(*res) * count); >> dax_dev->num_resources = count; >> kref_init(&dax_dev->kref); >> + dax_dev->alive = true; >> dax_dev->region = dax_region; >> kref_get(&dax_region->kref); >> >> @@ -217,9 +224,318 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, >> } >> EXPORT_SYMBOL_GPL(devm_create_dax_dev); >> >> +/* return an unmapped area aligned to the dax region specified alignment */ >> +static unsigned long dax_dev_get_unmapped_area(struct file *filp, >> + unsigned long addr, unsigned long len, unsigned long pgoff, >> + unsigned long flags) >> +{ >> + unsigned long off, off_end, off_align, len_align, addr_align, align; >> + struct dax_dev *dax_dev = filp ? filp->private_data : NULL; >> + struct dax_region *dax_region; >> + >> + if (!dax_dev || addr) >> + goto out; >> + >> + dax_region = dax_dev->region; >> + align = dax_region->align; >> + off = pgoff << PAGE_SHIFT; >> + off_end = off + len; >> + off_align = round_up(off, align); >> + >> + if ((off_end <= off_align) || ((off_end - off_align) < align)) >> + goto out; >> + >> + len_align = len + align; >> + if ((off + len_align) < off) >> + goto out; >> + >> + addr_align = current->mm->get_unmapped_area(filp, addr, len_align, >> + pgoff, flags); >> + if (!IS_ERR_VALUE(addr_align)) { >> + addr_align += (off - addr_align) & (align - 1); >> + return addr_align; >> + } >> + out: >> + return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags); >> +} >> + >> +static int __match_devt(struct device *dev, const void *data) >> +{ >> + const dev_t *devt = data; >> + >> + return dev->devt == *devt; >> +} >> + >> +static struct device *dax_dev_find(dev_t dev_t) >> +{ >> + return class_find_device(dax_class, NULL, &dev_t, __match_devt); >> +} >> + >> +static int dax_dev_open(struct inode *inode, struct file *filp) >> +{ >> + struct dax_dev *dax_dev = NULL; >> + struct device *dev; >> + >> + dev = dax_dev_find(inode->i_rdev); >> + if (!dev) >> + return -ENXIO; >> + >> + device_lock(dev); >> + dax_dev = dev_get_drvdata(dev); >> + if (dax_dev) { >> + dev_dbg(dev, "%s\n", __func__); >> + filp->private_data = dax_dev; >> + kref_get(&dax_dev->kref); >> + inode->i_flags = S_DAX; >> + } >> + device_unlock(dev); >> + > > Does this block really need to be protected by the dev mutex? If yes, have you > considered re-ordering it like this? We need to make sure the dax_dev is not freed while we're trying to take a reference against it, hence the lock. > > device_lock(dev); > dax_dev = dev_get_drvdata(dev); > if (!dax_dev) { > device_unlock(dev); > goto out_put; > } > filp->private_data = dax_dev; > kref_get(&dax_dev->kref); // or get_dax_device(dax_dev) dax_dev may already be invalid at this point if open() is racing device_unregister(). > inode->i_flags = S_DAX; > device_unlock(dev); > > return 0; > > out_put: > put_device(dev); > return -ENXIO; > > The only thing I see that could be needed to be protected here, is the > inode->i_flags and shouldn't that be protected by the inode->i_mutex? But I'm > not sure, hence the question. Nothing else should be mutating the inode flags so I don't think we need protection. > Also S_DAX is the only valid flag for a DAX > device, isn't it? Correct, we need that capability so that vma_is_dax() will recognize these mappings. > >> + if (!dax_dev) { >> + put_device(dev); >> + return -ENXIO; >> + } >> + return 0; >> +} >> + >> +static int dax_dev_release(struct inode *inode, struct file *filp) >> +{ >> + struct dax_dev *dax_dev = filp->private_data; >> + struct device *dev = dax_dev->dev; >> + >> + dev_dbg(dax_dev->dev, "%s\n", __func__); >> + dax_dev_put(dax_dev); >> + put_device(dev); >> + > > For reasons of consistency one could reset the inode's i_flags here. That change seems like a bug to me, it doesn't stop being a dax inode just because an open file is released, right? >> + return 0; >> +} >> + >> +static int check_vma(struct dax_dev *dax_dev, struct vm_area_struct *vma, >> + const char *func) >> +{ >> + struct dax_region *dax_region = dax_dev->region; >> + struct device *dev = dax_dev->dev; >> + unsigned long mask; >> + >> + if (!dax_dev->alive) >> + return -ENXIO; >> + >> + /* prevent private / writable mappings from being established */ >> + if ((vma->vm_flags & (VM_NORESERVE|VM_SHARED|VM_WRITE)) == VM_WRITE) { >> + dev_dbg(dev, "%s: %s: fail, attempted private mapping\n", >> + current->comm, func); > > This deserves a higher log-level than debug, IMHO. > >> + return -EINVAL; >> + } >> + >> + mask = dax_region->align - 1; >> + if (vma->vm_start & mask || vma->vm_end & mask) { >> + dev_dbg(dev, "%s: %s: fail, unaligned vma (%#lx - %#lx, %#lx)\n", >> + current->comm, func, vma->vm_start, vma->vm_end, >> + mask); > > Ditto. > >> + return -EINVAL; >> + } >> + >> + if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV >> + && (vma->vm_flags & VM_DONTCOPY) == 0) { >> + dev_dbg(dev, "%s: %s: fail, dax range requires MADV_DONTFORK\n", >> + current->comm, func); > > Ditto. > >> + return -EINVAL; >> + } >> + >> + if (!vma_is_dax(vma)) { >> + dev_dbg(dev, "%s: %s: fail, vma is not DAX capable\n", >> + current->comm, func); > > Ditto. Ok, will bump these up to dev_info. >> + return -EINVAL; >> + } >> + >> + return 0; >> +} >> + >> +static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff, >> + unsigned long size) >> +{ >> + struct resource *res; >> + phys_addr_t phys; >> + int i; >> + >> + for (i = 0; i < dax_dev->num_resources; i++) { >> + res = &dax_dev->res[i]; >> + phys = pgoff * PAGE_SIZE + res->start; >> + if (phys >= res->start && phys <= res->end) >> + break; >> + pgoff -= PHYS_PFN(resource_size(res)); >> + } >> + >> + if (i < dax_dev->num_resources) { >> + res = &dax_dev->res[i]; >> + if (phys + size - 1 <= res->end) >> + return phys; >> + } >> + >> + return -1; >> +} >> + >> +static int __dax_dev_fault(struct dax_dev *dax_dev, struct vm_area_struct *vma, >> + struct vm_fault *vmf) >> +{ >> + unsigned long vaddr = (unsigned long) vmf->virtual_address; >> + struct device *dev = dax_dev->dev; >> + struct dax_region *dax_region; >> + int rc = VM_FAULT_SIGBUS; >> + phys_addr_t phys; >> + pfn_t pfn; >> + >> + if (check_vma(dax_dev, vma, __func__)) >> + return VM_FAULT_SIGBUS; >> + >> + dax_region = dax_dev->region; >> + if (dax_region->align > PAGE_SIZE) { >> + dev_dbg(dev, "%s: alignment > fault size\n", __func__); >> + return VM_FAULT_SIGBUS; >> + } >> + >> + phys = pgoff_to_phys(dax_dev, vmf->pgoff, PAGE_SIZE); >> + if (phys == -1) { >> + dev_dbg(dev, "%s: phys_to_pgoff(%#lx) failed\n", __func__, >> + vmf->pgoff); >> + return VM_FAULT_SIGBUS; >> + } >> + >> + pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); >> + >> + rc = vm_insert_mixed(vma, vaddr, pfn); >> + >> + if (rc == -ENOMEM) >> + return VM_FAULT_OOM; >> + if (rc < 0 && rc != -EBUSY) >> + return VM_FAULT_SIGBUS; >> + >> + return VM_FAULT_NOPAGE; >> +} >> + >> +static int dax_dev_fault(struct vm_area_struct *vma, struct vm_fault *vmf) >> +{ >> + int rc; >> + struct file *filp = vma->vm_file; >> + struct dax_dev *dax_dev = filp->private_data; >> + >> + dev_dbg(dax_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__, >> + current->comm, (vmf->flags & FAULT_FLAG_WRITE) >> + ? "write" : "read", vma->vm_start, vma->vm_end); >> + rcu_read_lock(); >> + rc = __dax_dev_fault(dax_dev, vma, vmf); >> + rcu_read_unlock(); > > Similarly, what are you protecting? I just see you're locking something to be > read, but don't do a rcu_dereference() to actually access a rcu protected > pointer. Or am I missing something totally here? Like I mentioned above, I'm protecting the fault handler vs shutdown. I need this building-block-guarantee later when I go to fix the vfs to unmap inodes on device-removal. Same bug currently in filesystem-DAX [1]. [1]: http://thread.gmane.org/gmane.linux.file-systems/103333/focus=72179 [..] Thanks for the review Johannes. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap @ 2016-05-17 22:19 ` Dan Williams 0 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-17 22:19 UTC (permalink / raw) To: Johannes Thumshirn Cc: linux-nvdimm@lists.01.org, Dave Hansen, linux-kernel@vger.kernel.org, Christoph Hellwig, linux-block, Andrew Morton On Tue, May 17, 2016 at 3:57 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: > On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: >> The "Device DAX" core enables dax mappings of performance / feature >> differentiated memory. An open mapping or file handle keeps the backing >> struct device live, but new mappings are only possible while the device >> is enabled. Faults are handled under rcu_read_lock to synchronize >> with the enabled state of the device. >> >> Similar to the filesystem-dax case the backing memory may optionally >> have struct page entries. However, unlike fs-dax there is no support >> for private mappings, or mappings that are not backed by media (see >> use of zero-page in fs-dax). >> >> Mappings are always guaranteed to match the alignment of the dax_region. >> If the dax_region is configured to have a 2MB alignment, all mappings >> are guaranteed to be backed by a pmd entry. Contrast this determinism >> with the fs-dax case where pmd mappings are opportunistic. If userspace >> attempts to force a misaligned mapping, the driver will fail the mmap >> attempt. See dax_dev_check_vma() for other scenarios that are rejected, >> like MAP_PRIVATE mappings. >> >> Cc: Jeff Moyer <jmoyer@redhat.com> >> Cc: Christoph Hellwig <hch@lst.de> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Dave Hansen <dave.hansen@linux.intel.com> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >> --- >> drivers/dax/Kconfig | 1 >> drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ >> mm/huge_memory.c | 1 >> mm/hugetlb.c | 1 >> 4 files changed, 319 insertions(+) >> >> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig >> index 86ffbaa891ad..cedab7572de3 100644 >> --- a/drivers/dax/Kconfig >> +++ b/drivers/dax/Kconfig >> @@ -1,6 +1,7 @@ >> menuconfig DEV_DAX >> tristate "DAX: direct access to differentiated memory" >> default m if NVDIMM_DAX >> + depends on TRANSPARENT_HUGEPAGE >> help >> Support raw access to differentiated (persistence, bandwidth, >> latency...) memory via an mmap(2) capable character >> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c >> index 8207fb33a992..b2fe8a0ce866 100644 >> --- a/drivers/dax/dax.c >> +++ b/drivers/dax/dax.c >> @@ -49,6 +49,7 @@ struct dax_region { >> * @region - parent region >> * @dev - device backing the character device >> * @kref - enable this data to be tracked in filp->private_data >> + * @alive - !alive + rcu grace period == no new mappings can be established >> * @id - child id in the region >> * @num_resources - number of physical address extents in this device >> * @res - array of physical address ranges >> @@ -57,6 +58,7 @@ struct dax_dev { >> struct dax_region *region; >> struct device *dev; >> struct kref kref; >> + bool alive; >> int id; >> int num_resources; >> struct resource res[0]; >> @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) >> >> dev_dbg(dev, "%s\n", __func__); >> >> + /* disable and flush fault handlers, TODO unmap inodes */ >> + dax_dev->alive = false; >> + synchronize_rcu(); >> + > > IIRC RCU is only protecting a pointer, not the content of the pointer, so this > looks wrong to me. The driver is using RCU to guarantee that all currently running fault handlers have either completed or will see the new state of ->alive when they start. Reference counts are protecting the actual dax_dev object. >> get_device(dev); >> device_unregister(dev); >> ida_simple_remove(&dax_region->ida, dax_dev->id); >> @@ -173,6 +179,7 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, >> memcpy(dax_dev->res, res, sizeof(*res) * count); >> dax_dev->num_resources = count; >> kref_init(&dax_dev->kref); >> + dax_dev->alive = true; >> dax_dev->region = dax_region; >> kref_get(&dax_region->kref); >> >> @@ -217,9 +224,318 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, >> } >> EXPORT_SYMBOL_GPL(devm_create_dax_dev); >> >> +/* return an unmapped area aligned to the dax region specified alignment */ >> +static unsigned long dax_dev_get_unmapped_area(struct file *filp, >> + unsigned long addr, unsigned long len, unsigned long pgoff, >> + unsigned long flags) >> +{ >> + unsigned long off, off_end, off_align, len_align, addr_align, align; >> + struct dax_dev *dax_dev = filp ? filp->private_data : NULL; >> + struct dax_region *dax_region; >> + >> + if (!dax_dev || addr) >> + goto out; >> + >> + dax_region = dax_dev->region; >> + align = dax_region->align; >> + off = pgoff << PAGE_SHIFT; >> + off_end = off + len; >> + off_align = round_up(off, align); >> + >> + if ((off_end <= off_align) || ((off_end - off_align) < align)) >> + goto out; >> + >> + len_align = len + align; >> + if ((off + len_align) < off) >> + goto out; >> + >> + addr_align = current->mm->get_unmapped_area(filp, addr, len_align, >> + pgoff, flags); >> + if (!IS_ERR_VALUE(addr_align)) { >> + addr_align += (off - addr_align) & (align - 1); >> + return addr_align; >> + } >> + out: >> + return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags); >> +} >> + >> +static int __match_devt(struct device *dev, const void *data) >> +{ >> + const dev_t *devt = data; >> + >> + return dev->devt == *devt; >> +} >> + >> +static struct device *dax_dev_find(dev_t dev_t) >> +{ >> + return class_find_device(dax_class, NULL, &dev_t, __match_devt); >> +} >> + >> +static int dax_dev_open(struct inode *inode, struct file *filp) >> +{ >> + struct dax_dev *dax_dev = NULL; >> + struct device *dev; >> + >> + dev = dax_dev_find(inode->i_rdev); >> + if (!dev) >> + return -ENXIO; >> + >> + device_lock(dev); >> + dax_dev = dev_get_drvdata(dev); >> + if (dax_dev) { >> + dev_dbg(dev, "%s\n", __func__); >> + filp->private_data = dax_dev; >> + kref_get(&dax_dev->kref); >> + inode->i_flags = S_DAX; >> + } >> + device_unlock(dev); >> + > > Does this block really need to be protected by the dev mutex? If yes, have you > considered re-ordering it like this? We need to make sure the dax_dev is not freed while we're trying to take a reference against it, hence the lock. > > device_lock(dev); > dax_dev = dev_get_drvdata(dev); > if (!dax_dev) { > device_unlock(dev); > goto out_put; > } > filp->private_data = dax_dev; > kref_get(&dax_dev->kref); // or get_dax_device(dax_dev) dax_dev may already be invalid at this point if open() is racing device_unregister(). > inode->i_flags = S_DAX; > device_unlock(dev); > > return 0; > > out_put: > put_device(dev); > return -ENXIO; > > The only thing I see that could be needed to be protected here, is the > inode->i_flags and shouldn't that be protected by the inode->i_mutex? But I'm > not sure, hence the question. Nothing else should be mutating the inode flags so I don't think we need protection. > Also S_DAX is the only valid flag for a DAX > device, isn't it? Correct, we need that capability so that vma_is_dax() will recognize these mappings. > >> + if (!dax_dev) { >> + put_device(dev); >> + return -ENXIO; >> + } >> + return 0; >> +} >> + >> +static int dax_dev_release(struct inode *inode, struct file *filp) >> +{ >> + struct dax_dev *dax_dev = filp->private_data; >> + struct device *dev = dax_dev->dev; >> + >> + dev_dbg(dax_dev->dev, "%s\n", __func__); >> + dax_dev_put(dax_dev); >> + put_device(dev); >> + > > For reasons of consistency one could reset the inode's i_flags here. That change seems like a bug to me, it doesn't stop being a dax inode just because an open file is released, right? >> + return 0; >> +} >> + >> +static int check_vma(struct dax_dev *dax_dev, struct vm_area_struct *vma, >> + const char *func) >> +{ >> + struct dax_region *dax_region = dax_dev->region; >> + struct device *dev = dax_dev->dev; >> + unsigned long mask; >> + >> + if (!dax_dev->alive) >> + return -ENXIO; >> + >> + /* prevent private / writable mappings from being established */ >> + if ((vma->vm_flags & (VM_NORESERVE|VM_SHARED|VM_WRITE)) == VM_WRITE) { >> + dev_dbg(dev, "%s: %s: fail, attempted private mapping\n", >> + current->comm, func); > > This deserves a higher log-level than debug, IMHO. > >> + return -EINVAL; >> + } >> + >> + mask = dax_region->align - 1; >> + if (vma->vm_start & mask || vma->vm_end & mask) { >> + dev_dbg(dev, "%s: %s: fail, unaligned vma (%#lx - %#lx, %#lx)\n", >> + current->comm, func, vma->vm_start, vma->vm_end, >> + mask); > > Ditto. > >> + return -EINVAL; >> + } >> + >> + if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV >> + && (vma->vm_flags & VM_DONTCOPY) == 0) { >> + dev_dbg(dev, "%s: %s: fail, dax range requires MADV_DONTFORK\n", >> + current->comm, func); > > Ditto. > >> + return -EINVAL; >> + } >> + >> + if (!vma_is_dax(vma)) { >> + dev_dbg(dev, "%s: %s: fail, vma is not DAX capable\n", >> + current->comm, func); > > Ditto. Ok, will bump these up to dev_info. >> + return -EINVAL; >> + } >> + >> + return 0; >> +} >> + >> +static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff, >> + unsigned long size) >> +{ >> + struct resource *res; >> + phys_addr_t phys; >> + int i; >> + >> + for (i = 0; i < dax_dev->num_resources; i++) { >> + res = &dax_dev->res[i]; >> + phys = pgoff * PAGE_SIZE + res->start; >> + if (phys >= res->start && phys <= res->end) >> + break; >> + pgoff -= PHYS_PFN(resource_size(res)); >> + } >> + >> + if (i < dax_dev->num_resources) { >> + res = &dax_dev->res[i]; >> + if (phys + size - 1 <= res->end) >> + return phys; >> + } >> + >> + return -1; >> +} >> + >> +static int __dax_dev_fault(struct dax_dev *dax_dev, struct vm_area_struct *vma, >> + struct vm_fault *vmf) >> +{ >> + unsigned long vaddr = (unsigned long) vmf->virtual_address; >> + struct device *dev = dax_dev->dev; >> + struct dax_region *dax_region; >> + int rc = VM_FAULT_SIGBUS; >> + phys_addr_t phys; >> + pfn_t pfn; >> + >> + if (check_vma(dax_dev, vma, __func__)) >> + return VM_FAULT_SIGBUS; >> + >> + dax_region = dax_dev->region; >> + if (dax_region->align > PAGE_SIZE) { >> + dev_dbg(dev, "%s: alignment > fault size\n", __func__); >> + return VM_FAULT_SIGBUS; >> + } >> + >> + phys = pgoff_to_phys(dax_dev, vmf->pgoff, PAGE_SIZE); >> + if (phys == -1) { >> + dev_dbg(dev, "%s: phys_to_pgoff(%#lx) failed\n", __func__, >> + vmf->pgoff); >> + return VM_FAULT_SIGBUS; >> + } >> + >> + pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); >> + >> + rc = vm_insert_mixed(vma, vaddr, pfn); >> + >> + if (rc == -ENOMEM) >> + return VM_FAULT_OOM; >> + if (rc < 0 && rc != -EBUSY) >> + return VM_FAULT_SIGBUS; >> + >> + return VM_FAULT_NOPAGE; >> +} >> + >> +static int dax_dev_fault(struct vm_area_struct *vma, struct vm_fault *vmf) >> +{ >> + int rc; >> + struct file *filp = vma->vm_file; >> + struct dax_dev *dax_dev = filp->private_data; >> + >> + dev_dbg(dax_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__, >> + current->comm, (vmf->flags & FAULT_FLAG_WRITE) >> + ? "write" : "read", vma->vm_start, vma->vm_end); >> + rcu_read_lock(); >> + rc = __dax_dev_fault(dax_dev, vma, vmf); >> + rcu_read_unlock(); > > Similarly, what are you protecting? I just see you're locking something to be > read, but don't do a rcu_dereference() to actually access a rcu protected > pointer. Or am I missing something totally here? Like I mentioned above, I'm protecting the fault handler vs shutdown. I need this building-block-guarantee later when I go to fix the vfs to unmap inodes on device-removal. Same bug currently in filesystem-DAX [1]. [1]: http://thread.gmane.org/gmane.linux.file-systems/103333/focus=72179 [..] Thanks for the review Johannes. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap @ 2016-05-17 22:19 ` Dan Williams 0 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-17 22:19 UTC (permalink / raw) To: Johannes Thumshirn Cc: linux-nvdimm@lists.01.org, Dave Hansen, linux-kernel@vger.kernel.org, linux-block, Andrew Morton, Christoph Hellwig On Tue, May 17, 2016 at 3:57 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: > On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: >> The "Device DAX" core enables dax mappings of performance / feature >> differentiated memory. An open mapping or file handle keeps the backing >> struct device live, but new mappings are only possible while the device >> is enabled. Faults are handled under rcu_read_lock to synchronize >> with the enabled state of the device. >> >> Similar to the filesystem-dax case the backing memory may optionally >> have struct page entries. However, unlike fs-dax there is no support >> for private mappings, or mappings that are not backed by media (see >> use of zero-page in fs-dax). >> >> Mappings are always guaranteed to match the alignment of the dax_region. >> If the dax_region is configured to have a 2MB alignment, all mappings >> are guaranteed to be backed by a pmd entry. Contrast this determinism >> with the fs-dax case where pmd mappings are opportunistic. If userspace >> attempts to force a misaligned mapping, the driver will fail the mmap >> attempt. See dax_dev_check_vma() for other scenarios that are rejected, >> like MAP_PRIVATE mappings. >> >> Cc: Jeff Moyer <jmoyer@redhat.com> >> Cc: Christoph Hellwig <hch@lst.de> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Dave Hansen <dave.hansen@linux.intel.com> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >> --- >> drivers/dax/Kconfig | 1 >> drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ >> mm/huge_memory.c | 1 >> mm/hugetlb.c | 1 >> 4 files changed, 319 insertions(+) >> >> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig >> index 86ffbaa891ad..cedab7572de3 100644 >> --- a/drivers/dax/Kconfig >> +++ b/drivers/dax/Kconfig >> @@ -1,6 +1,7 @@ >> menuconfig DEV_DAX >> tristate "DAX: direct access to differentiated memory" >> default m if NVDIMM_DAX >> + depends on TRANSPARENT_HUGEPAGE >> help >> Support raw access to differentiated (persistence, bandwidth, >> latency...) memory via an mmap(2) capable character >> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c >> index 8207fb33a992..b2fe8a0ce866 100644 >> --- a/drivers/dax/dax.c >> +++ b/drivers/dax/dax.c >> @@ -49,6 +49,7 @@ struct dax_region { >> * @region - parent region >> * @dev - device backing the character device >> * @kref - enable this data to be tracked in filp->private_data >> + * @alive - !alive + rcu grace period == no new mappings can be established >> * @id - child id in the region >> * @num_resources - number of physical address extents in this device >> * @res - array of physical address ranges >> @@ -57,6 +58,7 @@ struct dax_dev { >> struct dax_region *region; >> struct device *dev; >> struct kref kref; >> + bool alive; >> int id; >> int num_resources; >> struct resource res[0]; >> @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) >> >> dev_dbg(dev, "%s\n", __func__); >> >> + /* disable and flush fault handlers, TODO unmap inodes */ >> + dax_dev->alive = false; >> + synchronize_rcu(); >> + > > IIRC RCU is only protecting a pointer, not the content of the pointer, so this > looks wrong to me. The driver is using RCU to guarantee that all currently running fault handlers have either completed or will see the new state of ->alive when they start. Reference counts are protecting the actual dax_dev object. >> get_device(dev); >> device_unregister(dev); >> ida_simple_remove(&dax_region->ida, dax_dev->id); >> @@ -173,6 +179,7 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, >> memcpy(dax_dev->res, res, sizeof(*res) * count); >> dax_dev->num_resources = count; >> kref_init(&dax_dev->kref); >> + dax_dev->alive = true; >> dax_dev->region = dax_region; >> kref_get(&dax_region->kref); >> >> @@ -217,9 +224,318 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *res, >> } >> EXPORT_SYMBOL_GPL(devm_create_dax_dev); >> >> +/* return an unmapped area aligned to the dax region specified alignment */ >> +static unsigned long dax_dev_get_unmapped_area(struct file *filp, >> + unsigned long addr, unsigned long len, unsigned long pgoff, >> + unsigned long flags) >> +{ >> + unsigned long off, off_end, off_align, len_align, addr_align, align; >> + struct dax_dev *dax_dev = filp ? filp->private_data : NULL; >> + struct dax_region *dax_region; >> + >> + if (!dax_dev || addr) >> + goto out; >> + >> + dax_region = dax_dev->region; >> + align = dax_region->align; >> + off = pgoff << PAGE_SHIFT; >> + off_end = off + len; >> + off_align = round_up(off, align); >> + >> + if ((off_end <= off_align) || ((off_end - off_align) < align)) >> + goto out; >> + >> + len_align = len + align; >> + if ((off + len_align) < off) >> + goto out; >> + >> + addr_align = current->mm->get_unmapped_area(filp, addr, len_align, >> + pgoff, flags); >> + if (!IS_ERR_VALUE(addr_align)) { >> + addr_align += (off - addr_align) & (align - 1); >> + return addr_align; >> + } >> + out: >> + return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags); >> +} >> + >> +static int __match_devt(struct device *dev, const void *data) >> +{ >> + const dev_t *devt = data; >> + >> + return dev->devt == *devt; >> +} >> + >> +static struct device *dax_dev_find(dev_t dev_t) >> +{ >> + return class_find_device(dax_class, NULL, &dev_t, __match_devt); >> +} >> + >> +static int dax_dev_open(struct inode *inode, struct file *filp) >> +{ >> + struct dax_dev *dax_dev = NULL; >> + struct device *dev; >> + >> + dev = dax_dev_find(inode->i_rdev); >> + if (!dev) >> + return -ENXIO; >> + >> + device_lock(dev); >> + dax_dev = dev_get_drvdata(dev); >> + if (dax_dev) { >> + dev_dbg(dev, "%s\n", __func__); >> + filp->private_data = dax_dev; >> + kref_get(&dax_dev->kref); >> + inode->i_flags = S_DAX; >> + } >> + device_unlock(dev); >> + > > Does this block really need to be protected by the dev mutex? If yes, have you > considered re-ordering it like this? We need to make sure the dax_dev is not freed while we're trying to take a reference against it, hence the lock. > > device_lock(dev); > dax_dev = dev_get_drvdata(dev); > if (!dax_dev) { > device_unlock(dev); > goto out_put; > } > filp->private_data = dax_dev; > kref_get(&dax_dev->kref); // or get_dax_device(dax_dev) dax_dev may already be invalid at this point if open() is racing device_unregister(). > inode->i_flags = S_DAX; > device_unlock(dev); > > return 0; > > out_put: > put_device(dev); > return -ENXIO; > > The only thing I see that could be needed to be protected here, is the > inode->i_flags and shouldn't that be protected by the inode->i_mutex? But I'm > not sure, hence the question. Nothing else should be mutating the inode flags so I don't think we need protection. > Also S_DAX is the only valid flag for a DAX > device, isn't it? Correct, we need that capability so that vma_is_dax() will recognize these mappings. > >> + if (!dax_dev) { >> + put_device(dev); >> + return -ENXIO; >> + } >> + return 0; >> +} >> + >> +static int dax_dev_release(struct inode *inode, struct file *filp) >> +{ >> + struct dax_dev *dax_dev = filp->private_data; >> + struct device *dev = dax_dev->dev; >> + >> + dev_dbg(dax_dev->dev, "%s\n", __func__); >> + dax_dev_put(dax_dev); >> + put_device(dev); >> + > > For reasons of consistency one could reset the inode's i_flags here. That change seems like a bug to me, it doesn't stop being a dax inode just because an open file is released, right? >> + return 0; >> +} >> + >> +static int check_vma(struct dax_dev *dax_dev, struct vm_area_struct *vma, >> + const char *func) >> +{ >> + struct dax_region *dax_region = dax_dev->region; >> + struct device *dev = dax_dev->dev; >> + unsigned long mask; >> + >> + if (!dax_dev->alive) >> + return -ENXIO; >> + >> + /* prevent private / writable mappings from being established */ >> + if ((vma->vm_flags & (VM_NORESERVE|VM_SHARED|VM_WRITE)) == VM_WRITE) { >> + dev_dbg(dev, "%s: %s: fail, attempted private mapping\n", >> + current->comm, func); > > This deserves a higher log-level than debug, IMHO. > >> + return -EINVAL; >> + } >> + >> + mask = dax_region->align - 1; >> + if (vma->vm_start & mask || vma->vm_end & mask) { >> + dev_dbg(dev, "%s: %s: fail, unaligned vma (%#lx - %#lx, %#lx)\n", >> + current->comm, func, vma->vm_start, vma->vm_end, >> + mask); > > Ditto. > >> + return -EINVAL; >> + } >> + >> + if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV >> + && (vma->vm_flags & VM_DONTCOPY) == 0) { >> + dev_dbg(dev, "%s: %s: fail, dax range requires MADV_DONTFORK\n", >> + current->comm, func); > > Ditto. > >> + return -EINVAL; >> + } >> + >> + if (!vma_is_dax(vma)) { >> + dev_dbg(dev, "%s: %s: fail, vma is not DAX capable\n", >> + current->comm, func); > > Ditto. Ok, will bump these up to dev_info. >> + return -EINVAL; >> + } >> + >> + return 0; >> +} >> + >> +static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff, >> + unsigned long size) >> +{ >> + struct resource *res; >> + phys_addr_t phys; >> + int i; >> + >> + for (i = 0; i < dax_dev->num_resources; i++) { >> + res = &dax_dev->res[i]; >> + phys = pgoff * PAGE_SIZE + res->start; >> + if (phys >= res->start && phys <= res->end) >> + break; >> + pgoff -= PHYS_PFN(resource_size(res)); >> + } >> + >> + if (i < dax_dev->num_resources) { >> + res = &dax_dev->res[i]; >> + if (phys + size - 1 <= res->end) >> + return phys; >> + } >> + >> + return -1; >> +} >> + >> +static int __dax_dev_fault(struct dax_dev *dax_dev, struct vm_area_struct *vma, >> + struct vm_fault *vmf) >> +{ >> + unsigned long vaddr = (unsigned long) vmf->virtual_address; >> + struct device *dev = dax_dev->dev; >> + struct dax_region *dax_region; >> + int rc = VM_FAULT_SIGBUS; >> + phys_addr_t phys; >> + pfn_t pfn; >> + >> + if (check_vma(dax_dev, vma, __func__)) >> + return VM_FAULT_SIGBUS; >> + >> + dax_region = dax_dev->region; >> + if (dax_region->align > PAGE_SIZE) { >> + dev_dbg(dev, "%s: alignment > fault size\n", __func__); >> + return VM_FAULT_SIGBUS; >> + } >> + >> + phys = pgoff_to_phys(dax_dev, vmf->pgoff, PAGE_SIZE); >> + if (phys == -1) { >> + dev_dbg(dev, "%s: phys_to_pgoff(%#lx) failed\n", __func__, >> + vmf->pgoff); >> + return VM_FAULT_SIGBUS; >> + } >> + >> + pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); >> + >> + rc = vm_insert_mixed(vma, vaddr, pfn); >> + >> + if (rc == -ENOMEM) >> + return VM_FAULT_OOM; >> + if (rc < 0 && rc != -EBUSY) >> + return VM_FAULT_SIGBUS; >> + >> + return VM_FAULT_NOPAGE; >> +} >> + >> +static int dax_dev_fault(struct vm_area_struct *vma, struct vm_fault *vmf) >> +{ >> + int rc; >> + struct file *filp = vma->vm_file; >> + struct dax_dev *dax_dev = filp->private_data; >> + >> + dev_dbg(dax_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__, >> + current->comm, (vmf->flags & FAULT_FLAG_WRITE) >> + ? "write" : "read", vma->vm_start, vma->vm_end); >> + rcu_read_lock(); >> + rc = __dax_dev_fault(dax_dev, vma, vmf); >> + rcu_read_unlock(); > > Similarly, what are you protecting? I just see you're locking something to be > read, but don't do a rcu_dereference() to actually access a rcu protected > pointer. Or am I missing something totally here? Like I mentioned above, I'm protecting the fault handler vs shutdown. I need this building-block-guarantee later when I go to fix the vfs to unmap inodes on device-removal. Same bug currently in filesystem-DAX [1]. [1]: http://thread.gmane.org/gmane.linux.file-systems/103333/focus=72179 [..] Thanks for the review Johannes. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap 2016-05-17 22:19 ` Dan Williams @ 2016-05-18 8:07 ` Hannes Reinecke -1 siblings, 0 replies; 37+ messages in thread From: Hannes Reinecke @ 2016-05-18 8:07 UTC (permalink / raw) To: Dan Williams, Johannes Thumshirn Cc: linux-nvdimm@lists.01.org, Dave Hansen, linux-kernel@vger.kernel.org, Christoph Hellwig, linux-block, Andrew Morton, Paul Mackerras On 05/18/2016 12:19 AM, Dan Williams wrote: > On Tue, May 17, 2016 at 3:57 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: >> On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: >>> The "Device DAX" core enables dax mappings of performance / feature >>> differentiated memory. An open mapping or file handle keeps the backing >>> struct device live, but new mappings are only possible while the device >>> is enabled. Faults are handled under rcu_read_lock to synchronize >>> with the enabled state of the device. >>> >>> Similar to the filesystem-dax case the backing memory may optionally >>> have struct page entries. However, unlike fs-dax there is no support >>> for private mappings, or mappings that are not backed by media (see >>> use of zero-page in fs-dax). >>> >>> Mappings are always guaranteed to match the alignment of the dax_region. >>> If the dax_region is configured to have a 2MB alignment, all mappings >>> are guaranteed to be backed by a pmd entry. Contrast this determinism >>> with the fs-dax case where pmd mappings are opportunistic. If userspace >>> attempts to force a misaligned mapping, the driver will fail the mmap >>> attempt. See dax_dev_check_vma() for other scenarios that are rejected, >>> like MAP_PRIVATE mappings. >>> >>> Cc: Jeff Moyer <jmoyer@redhat.com> >>> Cc: Christoph Hellwig <hch@lst.de> >>> Cc: Andrew Morton <akpm@linux-foundation.org> >>> Cc: Dave Hansen <dave.hansen@linux.intel.com> >>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> >>> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >>> --- >>> drivers/dax/Kconfig | 1 >>> drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ >>> mm/huge_memory.c | 1 >>> mm/hugetlb.c | 1 >>> 4 files changed, 319 insertions(+) >>> >>> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig >>> index 86ffbaa891ad..cedab7572de3 100644 >>> --- a/drivers/dax/Kconfig >>> +++ b/drivers/dax/Kconfig >>> @@ -1,6 +1,7 @@ >>> menuconfig DEV_DAX >>> tristate "DAX: direct access to differentiated memory" >>> default m if NVDIMM_DAX >>> + depends on TRANSPARENT_HUGEPAGE >>> help >>> Support raw access to differentiated (persistence, bandwidth, >>> latency...) memory via an mmap(2) capable character >>> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c >>> index 8207fb33a992..b2fe8a0ce866 100644 >>> --- a/drivers/dax/dax.c >>> +++ b/drivers/dax/dax.c >>> @@ -49,6 +49,7 @@ struct dax_region { >>> * @region - parent region >>> * @dev - device backing the character device >>> * @kref - enable this data to be tracked in filp->private_data >>> + * @alive - !alive + rcu grace period == no new mappings can be established >>> * @id - child id in the region >>> * @num_resources - number of physical address extents in this device >>> * @res - array of physical address ranges >>> @@ -57,6 +58,7 @@ struct dax_dev { >>> struct dax_region *region; >>> struct device *dev; >>> struct kref kref; >>> + bool alive; >>> int id; >>> int num_resources; >>> struct resource res[0]; >>> @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) >>> >>> dev_dbg(dev, "%s\n", __func__); >>> >>> + /* disable and flush fault handlers, TODO unmap inodes */ >>> + dax_dev->alive = false; >>> + synchronize_rcu(); >>> + >> >> IIRC RCU is only protecting a pointer, not the content of the pointer, so this >> looks wrong to me. > > The driver is using RCU to guarantee that all currently running fault > handlers have either completed or will see the new state of ->alive > when they start. Reference counts are protecting the actual dax_dev > object. > Hmm. This is the same 'creative' RCU usage Mike Snitzer has been trying when trying to improve device-mapper performance. >From my understanding RCU is protecting the _pointer_, not the values of the structure pointed to. IOW we are guaranteed to have a valid pointer at any time. But at the same time _no_ guarantee is made about the _contents_ of the structure. It might well be that using 'synchronize_rcu' giving you similar results (as synchronize_rcu() is essentially waiting a SMP grace period, after which all CPUs should be seeing the update). However, I haven't been able to find that this is a guaranteed behaviour. So from my understanding you have to use locking primitives protecting the contents of the structure or exchange the _entire_ structure if you want to rely on RCU here. Can we get some clarification here? Paul? Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap @ 2016-05-18 8:07 ` Hannes Reinecke 0 siblings, 0 replies; 37+ messages in thread From: Hannes Reinecke @ 2016-05-18 8:07 UTC (permalink / raw) To: Dan Williams, Johannes Thumshirn Cc: linux-nvdimm@lists.01.org, Dave Hansen, linux-kernel@vger.kernel.org, Christoph Hellwig, linux-block, Andrew Morton, Paul Mackerras On 05/18/2016 12:19 AM, Dan Williams wrote: > On Tue, May 17, 2016 at 3:57 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: >> On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: >>> The "Device DAX" core enables dax mappings of performance / feature >>> differentiated memory. An open mapping or file handle keeps the backing >>> struct device live, but new mappings are only possible while the device >>> is enabled. Faults are handled under rcu_read_lock to synchronize >>> with the enabled state of the device. >>> >>> Similar to the filesystem-dax case the backing memory may optionally >>> have struct page entries. However, unlike fs-dax there is no support >>> for private mappings, or mappings that are not backed by media (see >>> use of zero-page in fs-dax). >>> >>> Mappings are always guaranteed to match the alignment of the dax_region. >>> If the dax_region is configured to have a 2MB alignment, all mappings >>> are guaranteed to be backed by a pmd entry. Contrast this determinism >>> with the fs-dax case where pmd mappings are opportunistic. If userspace >>> attempts to force a misaligned mapping, the driver will fail the mmap >>> attempt. See dax_dev_check_vma() for other scenarios that are rejected, >>> like MAP_PRIVATE mappings. >>> >>> Cc: Jeff Moyer <jmoyer@redhat.com> >>> Cc: Christoph Hellwig <hch@lst.de> >>> Cc: Andrew Morton <akpm@linux-foundation.org> >>> Cc: Dave Hansen <dave.hansen@linux.intel.com> >>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> >>> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >>> --- >>> drivers/dax/Kconfig | 1 >>> drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ >>> mm/huge_memory.c | 1 >>> mm/hugetlb.c | 1 >>> 4 files changed, 319 insertions(+) >>> >>> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig >>> index 86ffbaa891ad..cedab7572de3 100644 >>> --- a/drivers/dax/Kconfig >>> +++ b/drivers/dax/Kconfig >>> @@ -1,6 +1,7 @@ >>> menuconfig DEV_DAX >>> tristate "DAX: direct access to differentiated memory" >>> default m if NVDIMM_DAX >>> + depends on TRANSPARENT_HUGEPAGE >>> help >>> Support raw access to differentiated (persistence, bandwidth, >>> latency...) memory via an mmap(2) capable character >>> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c >>> index 8207fb33a992..b2fe8a0ce866 100644 >>> --- a/drivers/dax/dax.c >>> +++ b/drivers/dax/dax.c >>> @@ -49,6 +49,7 @@ struct dax_region { >>> * @region - parent region >>> * @dev - device backing the character device >>> * @kref - enable this data to be tracked in filp->private_data >>> + * @alive - !alive + rcu grace period == no new mappings can be established >>> * @id - child id in the region >>> * @num_resources - number of physical address extents in this device >>> * @res - array of physical address ranges >>> @@ -57,6 +58,7 @@ struct dax_dev { >>> struct dax_region *region; >>> struct device *dev; >>> struct kref kref; >>> + bool alive; >>> int id; >>> int num_resources; >>> struct resource res[0]; >>> @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) >>> >>> dev_dbg(dev, "%s\n", __func__); >>> >>> + /* disable and flush fault handlers, TODO unmap inodes */ >>> + dax_dev->alive = false; >>> + synchronize_rcu(); >>> + >> >> IIRC RCU is only protecting a pointer, not the content of the pointer, so this >> looks wrong to me. > > The driver is using RCU to guarantee that all currently running fault > handlers have either completed or will see the new state of ->alive > when they start. Reference counts are protecting the actual dax_dev > object. > Hmm. This is the same 'creative' RCU usage Mike Snitzer has been trying when trying to improve device-mapper performance. >From my understanding RCU is protecting the _pointer_, not the values of the structure pointed to. IOW we are guaranteed to have a valid pointer at any time. But at the same time _no_ guarantee is made about the _contents_ of the structure. It might well be that using 'synchronize_rcu' giving you similar results (as synchronize_rcu() is essentially waiting a SMP grace period, after which all CPUs should be seeing the update). However, I haven't been able to find that this is a guaranteed behaviour. So from my understanding you have to use locking primitives protecting the contents of the structure or exchange the _entire_ structure if you want to rely on RCU here. Can we get some clarification here? Paul? Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap 2016-05-18 8:07 ` Hannes Reinecke @ 2016-05-18 9:10 ` Paul Mackerras -1 siblings, 0 replies; 37+ messages in thread From: Paul Mackerras @ 2016-05-18 9:10 UTC (permalink / raw) To: Hannes Reinecke, Paul E. McKenney Cc: Dan Williams, Johannes Thumshirn, linux-nvdimm@lists.01.org, Dave Hansen, linux-kernel@vger.kernel.org, Christoph Hellwig, linux-block, Andrew Morton On Wed, May 18, 2016 at 10:07:19AM +0200, Hannes Reinecke wrote: > On 05/18/2016 12:19 AM, Dan Williams wrote: > > On Tue, May 17, 2016 at 3:57 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: > >> On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: > >>> The "Device DAX" core enables dax mappings of performance / feature > >>> differentiated memory. An open mapping or file handle keeps the backing > >>> struct device live, but new mappings are only possible while the device > >>> is enabled. Faults are handled under rcu_read_lock to synchronize > >>> with the enabled state of the device. > >>> > >>> Similar to the filesystem-dax case the backing memory may optionally > >>> have struct page entries. However, unlike fs-dax there is no support > >>> for private mappings, or mappings that are not backed by media (see > >>> use of zero-page in fs-dax). > >>> > >>> Mappings are always guaranteed to match the alignment of the dax_region. > >>> If the dax_region is configured to have a 2MB alignment, all mappings > >>> are guaranteed to be backed by a pmd entry. Contrast this determinism > >>> with the fs-dax case where pmd mappings are opportunistic. If userspace > >>> attempts to force a misaligned mapping, the driver will fail the mmap > >>> attempt. See dax_dev_check_vma() for other scenarios that are rejected, > >>> like MAP_PRIVATE mappings. > >>> > >>> Cc: Jeff Moyer <jmoyer@redhat.com> > >>> Cc: Christoph Hellwig <hch@lst.de> > >>> Cc: Andrew Morton <akpm@linux-foundation.org> > >>> Cc: Dave Hansen <dave.hansen@linux.intel.com> > >>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > >>> Signed-off-by: Dan Williams <dan.j.williams@intel.com> > >>> --- > >>> drivers/dax/Kconfig | 1 > >>> drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> mm/huge_memory.c | 1 > >>> mm/hugetlb.c | 1 > >>> 4 files changed, 319 insertions(+) > >>> > >>> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig > >>> index 86ffbaa891ad..cedab7572de3 100644 > >>> --- a/drivers/dax/Kconfig > >>> +++ b/drivers/dax/Kconfig > >>> @@ -1,6 +1,7 @@ > >>> menuconfig DEV_DAX > >>> tristate "DAX: direct access to differentiated memory" > >>> default m if NVDIMM_DAX > >>> + depends on TRANSPARENT_HUGEPAGE > >>> help > >>> Support raw access to differentiated (persistence, bandwidth, > >>> latency...) memory via an mmap(2) capable character > >>> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c > >>> index 8207fb33a992..b2fe8a0ce866 100644 > >>> --- a/drivers/dax/dax.c > >>> +++ b/drivers/dax/dax.c > >>> @@ -49,6 +49,7 @@ struct dax_region { > >>> * @region - parent region > >>> * @dev - device backing the character device > >>> * @kref - enable this data to be tracked in filp->private_data > >>> + * @alive - !alive + rcu grace period == no new mappings can be established > >>> * @id - child id in the region > >>> * @num_resources - number of physical address extents in this device > >>> * @res - array of physical address ranges > >>> @@ -57,6 +58,7 @@ struct dax_dev { > >>> struct dax_region *region; > >>> struct device *dev; > >>> struct kref kref; > >>> + bool alive; > >>> int id; > >>> int num_resources; > >>> struct resource res[0]; > >>> @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) > >>> > >>> dev_dbg(dev, "%s\n", __func__); > >>> > >>> + /* disable and flush fault handlers, TODO unmap inodes */ > >>> + dax_dev->alive = false; > >>> + synchronize_rcu(); > >>> + > >> > >> IIRC RCU is only protecting a pointer, not the content of the pointer, so this > >> looks wrong to me. > > > > The driver is using RCU to guarantee that all currently running fault > > handlers have either completed or will see the new state of ->alive > > when they start. Reference counts are protecting the actual dax_dev > > object. > > > Hmm. > This is the same 'creative' RCU usage Mike Snitzer has been trying > when trying to improve device-mapper performance. > > >From my understanding RCU is protecting the _pointer_, not the > values of the structure pointed to. > IOW we are guaranteed to have a valid pointer at any time. > But at the same time _no_ guarantee is made about the _contents_ of > the structure. > It might well be that using 'synchronize_rcu' giving you similar > results (as synchronize_rcu() is essentially waiting a SMP grace > period, after which all CPUs should be seeing the update). > However, I haven't been able to find that this is a guaranteed > behaviour. > So from my understanding you have to use locking primitives > protecting the contents of the structure or exchange the _entire_ > structure if you want to rely on RCU here. > > Can we get some clarification here? > Paul? I think you want the other Paul, Paul McKenney. Paul. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap @ 2016-05-18 9:10 ` Paul Mackerras 0 siblings, 0 replies; 37+ messages in thread From: Paul Mackerras @ 2016-05-18 9:10 UTC (permalink / raw) To: Hannes Reinecke, Paul E. McKenney Cc: Dan Williams, Johannes Thumshirn, linux-nvdimm@lists.01.org, Dave Hansen, linux-kernel@vger.kernel.org, Christoph Hellwig, linux-block, Andrew Morton On Wed, May 18, 2016 at 10:07:19AM +0200, Hannes Reinecke wrote: > On 05/18/2016 12:19 AM, Dan Williams wrote: > > On Tue, May 17, 2016 at 3:57 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: > >> On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: > >>> The "Device DAX" core enables dax mappings of performance / feature > >>> differentiated memory. An open mapping or file handle keeps the backing > >>> struct device live, but new mappings are only possible while the device > >>> is enabled. Faults are handled under rcu_read_lock to synchronize > >>> with the enabled state of the device. > >>> > >>> Similar to the filesystem-dax case the backing memory may optionally > >>> have struct page entries. However, unlike fs-dax there is no support > >>> for private mappings, or mappings that are not backed by media (see > >>> use of zero-page in fs-dax). > >>> > >>> Mappings are always guaranteed to match the alignment of the dax_region. > >>> If the dax_region is configured to have a 2MB alignment, all mappings > >>> are guaranteed to be backed by a pmd entry. Contrast this determinism > >>> with the fs-dax case where pmd mappings are opportunistic. If userspace > >>> attempts to force a misaligned mapping, the driver will fail the mmap > >>> attempt. See dax_dev_check_vma() for other scenarios that are rejected, > >>> like MAP_PRIVATE mappings. > >>> > >>> Cc: Jeff Moyer <jmoyer@redhat.com> > >>> Cc: Christoph Hellwig <hch@lst.de> > >>> Cc: Andrew Morton <akpm@linux-foundation.org> > >>> Cc: Dave Hansen <dave.hansen@linux.intel.com> > >>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > >>> Signed-off-by: Dan Williams <dan.j.williams@intel.com> > >>> --- > >>> drivers/dax/Kconfig | 1 > >>> drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> mm/huge_memory.c | 1 > >>> mm/hugetlb.c | 1 > >>> 4 files changed, 319 insertions(+) > >>> > >>> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig > >>> index 86ffbaa891ad..cedab7572de3 100644 > >>> --- a/drivers/dax/Kconfig > >>> +++ b/drivers/dax/Kconfig > >>> @@ -1,6 +1,7 @@ > >>> menuconfig DEV_DAX > >>> tristate "DAX: direct access to differentiated memory" > >>> default m if NVDIMM_DAX > >>> + depends on TRANSPARENT_HUGEPAGE > >>> help > >>> Support raw access to differentiated (persistence, bandwidth, > >>> latency...) memory via an mmap(2) capable character > >>> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c > >>> index 8207fb33a992..b2fe8a0ce866 100644 > >>> --- a/drivers/dax/dax.c > >>> +++ b/drivers/dax/dax.c > >>> @@ -49,6 +49,7 @@ struct dax_region { > >>> * @region - parent region > >>> * @dev - device backing the character device > >>> * @kref - enable this data to be tracked in filp->private_data > >>> + * @alive - !alive + rcu grace period == no new mappings can be established > >>> * @id - child id in the region > >>> * @num_resources - number of physical address extents in this device > >>> * @res - array of physical address ranges > >>> @@ -57,6 +58,7 @@ struct dax_dev { > >>> struct dax_region *region; > >>> struct device *dev; > >>> struct kref kref; > >>> + bool alive; > >>> int id; > >>> int num_resources; > >>> struct resource res[0]; > >>> @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) > >>> > >>> dev_dbg(dev, "%s\n", __func__); > >>> > >>> + /* disable and flush fault handlers, TODO unmap inodes */ > >>> + dax_dev->alive = false; > >>> + synchronize_rcu(); > >>> + > >> > >> IIRC RCU is only protecting a pointer, not the content of the pointer, so this > >> looks wrong to me. > > > > The driver is using RCU to guarantee that all currently running fault > > handlers have either completed or will see the new state of ->alive > > when they start. Reference counts are protecting the actual dax_dev > > object. > > > Hmm. > This is the same 'creative' RCU usage Mike Snitzer has been trying > when trying to improve device-mapper performance. > > >From my understanding RCU is protecting the _pointer_, not the > values of the structure pointed to. > IOW we are guaranteed to have a valid pointer at any time. > But at the same time _no_ guarantee is made about the _contents_ of > the structure. > It might well be that using 'synchronize_rcu' giving you similar > results (as synchronize_rcu() is essentially waiting a SMP grace > period, after which all CPUs should be seeing the update). > However, I haven't been able to find that this is a guaranteed > behaviour. > So from my understanding you have to use locking primitives > protecting the contents of the structure or exchange the _entire_ > structure if you want to rely on RCU here. > > Can we get some clarification here? > Paul? I think you want the other Paul, Paul McKenney. Paul. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap 2016-05-18 9:10 ` Paul Mackerras @ 2016-05-18 9:15 ` Hannes Reinecke -1 siblings, 0 replies; 37+ messages in thread From: Hannes Reinecke @ 2016-05-18 9:15 UTC (permalink / raw) To: Paul Mackerras, Paul E. McKenney Cc: Dan Williams, Johannes Thumshirn, linux-nvdimm@lists.01.org, Dave Hansen, linux-kernel@vger.kernel.org, Christoph Hellwig, linux-block, Andrew Morton On 05/18/2016 11:10 AM, Paul Mackerras wrote: > On Wed, May 18, 2016 at 10:07:19AM +0200, Hannes Reinecke wrote: >> On 05/18/2016 12:19 AM, Dan Williams wrote: >>> On Tue, May 17, 2016 at 3:57 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: >>>> On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: >>>>> The "Device DAX" core enables dax mappings of performance / feature >>>>> differentiated memory. An open mapping or file handle keeps the backing >>>>> struct device live, but new mappings are only possible while the device >>>>> is enabled. Faults are handled under rcu_read_lock to synchronize >>>>> with the enabled state of the device. >>>>> >>>>> Similar to the filesystem-dax case the backing memory may optionally >>>>> have struct page entries. However, unlike fs-dax there is no support >>>>> for private mappings, or mappings that are not backed by media (see >>>>> use of zero-page in fs-dax). >>>>> >>>>> Mappings are always guaranteed to match the alignment of the dax_region. >>>>> If the dax_region is configured to have a 2MB alignment, all mappings >>>>> are guaranteed to be backed by a pmd entry. Contrast this determinism >>>>> with the fs-dax case where pmd mappings are opportunistic. If userspace >>>>> attempts to force a misaligned mapping, the driver will fail the mmap >>>>> attempt. See dax_dev_check_vma() for other scenarios that are rejected, >>>>> like MAP_PRIVATE mappings. >>>>> >>>>> Cc: Jeff Moyer <jmoyer@redhat.com> >>>>> Cc: Christoph Hellwig <hch@lst.de> >>>>> Cc: Andrew Morton <akpm@linux-foundation.org> >>>>> Cc: Dave Hansen <dave.hansen@linux.intel.com> >>>>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> >>>>> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >>>>> --- >>>>> drivers/dax/Kconfig | 1 >>>>> drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> mm/huge_memory.c | 1 >>>>> mm/hugetlb.c | 1 >>>>> 4 files changed, 319 insertions(+) >>>>> >>>>> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig >>>>> index 86ffbaa891ad..cedab7572de3 100644 >>>>> --- a/drivers/dax/Kconfig >>>>> +++ b/drivers/dax/Kconfig >>>>> @@ -1,6 +1,7 @@ >>>>> menuconfig DEV_DAX >>>>> tristate "DAX: direct access to differentiated memory" >>>>> default m if NVDIMM_DAX >>>>> + depends on TRANSPARENT_HUGEPAGE >>>>> help >>>>> Support raw access to differentiated (persistence, bandwidth, >>>>> latency...) memory via an mmap(2) capable character >>>>> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c >>>>> index 8207fb33a992..b2fe8a0ce866 100644 >>>>> --- a/drivers/dax/dax.c >>>>> +++ b/drivers/dax/dax.c >>>>> @@ -49,6 +49,7 @@ struct dax_region { >>>>> * @region - parent region >>>>> * @dev - device backing the character device >>>>> * @kref - enable this data to be tracked in filp->private_data >>>>> + * @alive - !alive + rcu grace period == no new mappings can be established >>>>> * @id - child id in the region >>>>> * @num_resources - number of physical address extents in this device >>>>> * @res - array of physical address ranges >>>>> @@ -57,6 +58,7 @@ struct dax_dev { >>>>> struct dax_region *region; >>>>> struct device *dev; >>>>> struct kref kref; >>>>> + bool alive; >>>>> int id; >>>>> int num_resources; >>>>> struct resource res[0]; >>>>> @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) >>>>> >>>>> dev_dbg(dev, "%s\n", __func__); >>>>> >>>>> + /* disable and flush fault handlers, TODO unmap inodes */ >>>>> + dax_dev->alive = false; >>>>> + synchronize_rcu(); >>>>> + >>>> >>>> IIRC RCU is only protecting a pointer, not the content of the pointer, so this >>>> looks wrong to me. >>> >>> The driver is using RCU to guarantee that all currently running fault >>> handlers have either completed or will see the new state of ->alive >>> when they start. Reference counts are protecting the actual dax_dev >>> object. >>> >> Hmm. >> This is the same 'creative' RCU usage Mike Snitzer has been trying >> when trying to improve device-mapper performance. >> >> >From my understanding RCU is protecting the _pointer_, not the >> values of the structure pointed to. >> IOW we are guaranteed to have a valid pointer at any time. >> But at the same time _no_ guarantee is made about the _contents_ of >> the structure. >> It might well be that using 'synchronize_rcu' giving you similar >> results (as synchronize_rcu() is essentially waiting a SMP grace >> period, after which all CPUs should be seeing the update). >> However, I haven't been able to find that this is a guaranteed >> behaviour. >> So from my understanding you have to use locking primitives >> protecting the contents of the structure or exchange the _entire_ >> structure if you want to rely on RCU here. >> >> Can we get some clarification here? >> Paul? > > I think you want the other Paul, Paul McKenney. > I think you are in fact right. Sorry for the Paul-confusion :-) Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG N�rnberg) ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap @ 2016-05-18 9:15 ` Hannes Reinecke 0 siblings, 0 replies; 37+ messages in thread From: Hannes Reinecke @ 2016-05-18 9:15 UTC (permalink / raw) To: Paul Mackerras, Paul E. McKenney Cc: Dan Williams, Johannes Thumshirn, linux-nvdimm@lists.01.org, Dave Hansen, linux-kernel@vger.kernel.org, Christoph Hellwig, linux-block, Andrew Morton On 05/18/2016 11:10 AM, Paul Mackerras wrote: > On Wed, May 18, 2016 at 10:07:19AM +0200, Hannes Reinecke wrote: >> On 05/18/2016 12:19 AM, Dan Williams wrote: >>> On Tue, May 17, 2016 at 3:57 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: >>>> On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: >>>>> The "Device DAX" core enables dax mappings of performance / feature >>>>> differentiated memory. An open mapping or file handle keeps the backing >>>>> struct device live, but new mappings are only possible while the device >>>>> is enabled. Faults are handled under rcu_read_lock to synchronize >>>>> with the enabled state of the device. >>>>> >>>>> Similar to the filesystem-dax case the backing memory may optionally >>>>> have struct page entries. However, unlike fs-dax there is no support >>>>> for private mappings, or mappings that are not backed by media (see >>>>> use of zero-page in fs-dax). >>>>> >>>>> Mappings are always guaranteed to match the alignment of the dax_region. >>>>> If the dax_region is configured to have a 2MB alignment, all mappings >>>>> are guaranteed to be backed by a pmd entry. Contrast this determinism >>>>> with the fs-dax case where pmd mappings are opportunistic. If userspace >>>>> attempts to force a misaligned mapping, the driver will fail the mmap >>>>> attempt. See dax_dev_check_vma() for other scenarios that are rejected, >>>>> like MAP_PRIVATE mappings. >>>>> >>>>> Cc: Jeff Moyer <jmoyer@redhat.com> >>>>> Cc: Christoph Hellwig <hch@lst.de> >>>>> Cc: Andrew Morton <akpm@linux-foundation.org> >>>>> Cc: Dave Hansen <dave.hansen@linux.intel.com> >>>>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> >>>>> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >>>>> --- >>>>> drivers/dax/Kconfig | 1 >>>>> drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> mm/huge_memory.c | 1 >>>>> mm/hugetlb.c | 1 >>>>> 4 files changed, 319 insertions(+) >>>>> >>>>> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig >>>>> index 86ffbaa891ad..cedab7572de3 100644 >>>>> --- a/drivers/dax/Kconfig >>>>> +++ b/drivers/dax/Kconfig >>>>> @@ -1,6 +1,7 @@ >>>>> menuconfig DEV_DAX >>>>> tristate "DAX: direct access to differentiated memory" >>>>> default m if NVDIMM_DAX >>>>> + depends on TRANSPARENT_HUGEPAGE >>>>> help >>>>> Support raw access to differentiated (persistence, bandwidth, >>>>> latency...) memory via an mmap(2) capable character >>>>> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c >>>>> index 8207fb33a992..b2fe8a0ce866 100644 >>>>> --- a/drivers/dax/dax.c >>>>> +++ b/drivers/dax/dax.c >>>>> @@ -49,6 +49,7 @@ struct dax_region { >>>>> * @region - parent region >>>>> * @dev - device backing the character device >>>>> * @kref - enable this data to be tracked in filp->private_data >>>>> + * @alive - !alive + rcu grace period == no new mappings can be established >>>>> * @id - child id in the region >>>>> * @num_resources - number of physical address extents in this device >>>>> * @res - array of physical address ranges >>>>> @@ -57,6 +58,7 @@ struct dax_dev { >>>>> struct dax_region *region; >>>>> struct device *dev; >>>>> struct kref kref; >>>>> + bool alive; >>>>> int id; >>>>> int num_resources; >>>>> struct resource res[0]; >>>>> @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) >>>>> >>>>> dev_dbg(dev, "%s\n", __func__); >>>>> >>>>> + /* disable and flush fault handlers, TODO unmap inodes */ >>>>> + dax_dev->alive = false; >>>>> + synchronize_rcu(); >>>>> + >>>> >>>> IIRC RCU is only protecting a pointer, not the content of the pointer, so this >>>> looks wrong to me. >>> >>> The driver is using RCU to guarantee that all currently running fault >>> handlers have either completed or will see the new state of ->alive >>> when they start. Reference counts are protecting the actual dax_dev >>> object. >>> >> Hmm. >> This is the same 'creative' RCU usage Mike Snitzer has been trying >> when trying to improve device-mapper performance. >> >> >From my understanding RCU is protecting the _pointer_, not the >> values of the structure pointed to. >> IOW we are guaranteed to have a valid pointer at any time. >> But at the same time _no_ guarantee is made about the _contents_ of >> the structure. >> It might well be that using 'synchronize_rcu' giving you similar >> results (as synchronize_rcu() is essentially waiting a SMP grace >> period, after which all CPUs should be seeing the update). >> However, I haven't been able to find that this is a guaranteed >> behaviour. >> So from my understanding you have to use locking primitives >> protecting the contents of the structure or exchange the _entire_ >> structure if you want to rely on RCU here. >> >> Can we get some clarification here? >> Paul? > > I think you want the other Paul, Paul McKenney. > I think you are in fact right. Sorry for the Paul-confusion :-) Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap 2016-05-18 9:15 ` Hannes Reinecke @ 2016-05-18 17:12 ` Paul E. McKenney -1 siblings, 0 replies; 37+ messages in thread From: Paul E. McKenney @ 2016-05-18 17:12 UTC (permalink / raw) To: Hannes Reinecke Cc: Paul Mackerras, Dan Williams, Johannes Thumshirn, linux-nvdimm@lists.01.org, Dave Hansen, linux-kernel@vger.kernel.org, Christoph Hellwig, linux-block, Andrew Morton On Wed, May 18, 2016 at 11:15:11AM +0200, Hannes Reinecke wrote: > On 05/18/2016 11:10 AM, Paul Mackerras wrote: > > On Wed, May 18, 2016 at 10:07:19AM +0200, Hannes Reinecke wrote: > >> On 05/18/2016 12:19 AM, Dan Williams wrote: > >>> On Tue, May 17, 2016 at 3:57 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: > >>>> On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: > >>>>> The "Device DAX" core enables dax mappings of performance / feature > >>>>> differentiated memory. An open mapping or file handle keeps the backing > >>>>> struct device live, but new mappings are only possible while the device > >>>>> is enabled. Faults are handled under rcu_read_lock to synchronize > >>>>> with the enabled state of the device. > >>>>> > >>>>> Similar to the filesystem-dax case the backing memory may optionally > >>>>> have struct page entries. However, unlike fs-dax there is no support > >>>>> for private mappings, or mappings that are not backed by media (see > >>>>> use of zero-page in fs-dax). > >>>>> > >>>>> Mappings are always guaranteed to match the alignment of the dax_region. > >>>>> If the dax_region is configured to have a 2MB alignment, all mappings > >>>>> are guaranteed to be backed by a pmd entry. Contrast this determinism > >>>>> with the fs-dax case where pmd mappings are opportunistic. If userspace > >>>>> attempts to force a misaligned mapping, the driver will fail the mmap > >>>>> attempt. See dax_dev_check_vma() for other scenarios that are rejected, > >>>>> like MAP_PRIVATE mappings. > >>>>> > >>>>> Cc: Jeff Moyer <jmoyer@redhat.com> > >>>>> Cc: Christoph Hellwig <hch@lst.de> > >>>>> Cc: Andrew Morton <akpm@linux-foundation.org> > >>>>> Cc: Dave Hansen <dave.hansen@linux.intel.com> > >>>>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > >>>>> Signed-off-by: Dan Williams <dan.j.williams@intel.com> > >>>>> --- > >>>>> drivers/dax/Kconfig | 1 > >>>>> drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>> mm/huge_memory.c | 1 > >>>>> mm/hugetlb.c | 1 > >>>>> 4 files changed, 319 insertions(+) > >>>>> > >>>>> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig > >>>>> index 86ffbaa891ad..cedab7572de3 100644 > >>>>> --- a/drivers/dax/Kconfig > >>>>> +++ b/drivers/dax/Kconfig > >>>>> @@ -1,6 +1,7 @@ > >>>>> menuconfig DEV_DAX > >>>>> tristate "DAX: direct access to differentiated memory" > >>>>> default m if NVDIMM_DAX > >>>>> + depends on TRANSPARENT_HUGEPAGE > >>>>> help > >>>>> Support raw access to differentiated (persistence, bandwidth, > >>>>> latency...) memory via an mmap(2) capable character > >>>>> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c > >>>>> index 8207fb33a992..b2fe8a0ce866 100644 > >>>>> --- a/drivers/dax/dax.c > >>>>> +++ b/drivers/dax/dax.c > >>>>> @@ -49,6 +49,7 @@ struct dax_region { > >>>>> * @region - parent region > >>>>> * @dev - device backing the character device > >>>>> * @kref - enable this data to be tracked in filp->private_data > >>>>> + * @alive - !alive + rcu grace period == no new mappings can be established > >>>>> * @id - child id in the region > >>>>> * @num_resources - number of physical address extents in this device > >>>>> * @res - array of physical address ranges > >>>>> @@ -57,6 +58,7 @@ struct dax_dev { > >>>>> struct dax_region *region; > >>>>> struct device *dev; > >>>>> struct kref kref; > >>>>> + bool alive; > >>>>> int id; > >>>>> int num_resources; > >>>>> struct resource res[0]; > >>>>> @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) > >>>>> > >>>>> dev_dbg(dev, "%s\n", __func__); > >>>>> > >>>>> + /* disable and flush fault handlers, TODO unmap inodes */ > >>>>> + dax_dev->alive = false; > >>>>> + synchronize_rcu(); If you need to wait for fault handlers, you need synchronize_sched() instead of synchronize_rcu(). Please note that synchronize_rcu() is guaranteed to wait only for tasks that have done rcu_read_lock() to reach the corresponding rcu_read_unlock(). In contrast, synchronize_sched() is guaranteed to wait for any non-idle preempt-disable region of code to complete, regardless of exactly what is disabling preemptiong. And the "non-idle" is not an idle qualifier. If you need to wait on fault handlers that somehow occur from an idle hardware thread, you will need those fault handlers to do rcu_irq_enter() on entry and rcu_irq_exit() on exit. (My guess is that you cannot take faults in the idle loop, but I have learned not to trust such guesses all that far.) And last, but definitely not least, synchronize_sched() waits only for pre-existing preempt-disable regions of code. So if you do synchronize_sched(), and immediately after a fault handler starts, synchronize_sched() won't necessarily wait on it. However, you -are- guaranteed that synchronize_shced() -will- wait for any fault handler that might possibly see dax_dev->alive with a non-false value. Are these the guarantees you are looking for? (Yes, I did recently watch "A New Hope". Why do you ask?) > >>>>> + > >>>> > >>>> IIRC RCU is only protecting a pointer, not the content of the pointer, so this > >>>> looks wrong to me. RCU can be, and usually is, used to protect pointers, but it can be and sometimes is used for other things as well. At its core, RCU is about waiting for pre-existing RCU readers to complete. > >>> The driver is using RCU to guarantee that all currently running fault > >>> handlers have either completed or will see the new state of ->alive > >>> when they start. Reference counts are protecting the actual dax_dev > >>> object. > >>> > >> Hmm. > >> This is the same 'creative' RCU usage Mike Snitzer has been trying > >> when trying to improve device-mapper performance. To repeat, unless all your fault handlers begin with rcu_read_lock() and end with rcu_read_unlock(), and as long as you don't care about not waiting for fault handlers that are currently executing just before the rcu_read_lock() and just after the rcu_read_unlock(), you need synchronize_sched() rather than synchronize_rcu() for this job. > >> >From my understanding RCU is protecting the _pointer_, not the > >> values of the structure pointed to. > >> IOW we are guaranteed to have a valid pointer at any time. > >> But at the same time _no_ guarantee is made about the _contents_ of > >> the structure. > >> It might well be that using 'synchronize_rcu' giving you similar > >> results (as synchronize_rcu() is essentially waiting a SMP grace > >> period, after which all CPUs should be seeing the update). > >> However, I haven't been able to find that this is a guaranteed > >> behaviour. > >> So from my understanding you have to use locking primitives > >> protecting the contents of the structure or exchange the _entire_ > >> structure if you want to rely on RCU here. > >> > >> Can we get some clarification here? Maybe... What exactly is your synchronization design needing here? > >> Paul? > > > > I think you want the other Paul, Paul McKenney. > > > I think you are in fact right. > Sorry for the Paul-confusion :-) Did I keep my end of the confusion up? ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap @ 2016-05-18 17:12 ` Paul E. McKenney 0 siblings, 0 replies; 37+ messages in thread From: Paul E. McKenney @ 2016-05-18 17:12 UTC (permalink / raw) To: Hannes Reinecke Cc: Paul Mackerras, Dan Williams, Johannes Thumshirn, linux-nvdimm@lists.01.org, Dave Hansen, linux-kernel@vger.kernel.org, Christoph Hellwig, linux-block, Andrew Morton On Wed, May 18, 2016 at 11:15:11AM +0200, Hannes Reinecke wrote: > On 05/18/2016 11:10 AM, Paul Mackerras wrote: > > On Wed, May 18, 2016 at 10:07:19AM +0200, Hannes Reinecke wrote: > >> On 05/18/2016 12:19 AM, Dan Williams wrote: > >>> On Tue, May 17, 2016 at 3:57 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: > >>>> On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: > >>>>> The "Device DAX" core enables dax mappings of performance / feature > >>>>> differentiated memory. An open mapping or file handle keeps the backing > >>>>> struct device live, but new mappings are only possible while the device > >>>>> is enabled. Faults are handled under rcu_read_lock to synchronize > >>>>> with the enabled state of the device. > >>>>> > >>>>> Similar to the filesystem-dax case the backing memory may optionally > >>>>> have struct page entries. However, unlike fs-dax there is no support > >>>>> for private mappings, or mappings that are not backed by media (see > >>>>> use of zero-page in fs-dax). > >>>>> > >>>>> Mappings are always guaranteed to match the alignment of the dax_region. > >>>>> If the dax_region is configured to have a 2MB alignment, all mappings > >>>>> are guaranteed to be backed by a pmd entry. Contrast this determinism > >>>>> with the fs-dax case where pmd mappings are opportunistic. If userspace > >>>>> attempts to force a misaligned mapping, the driver will fail the mmap > >>>>> attempt. See dax_dev_check_vma() for other scenarios that are rejected, > >>>>> like MAP_PRIVATE mappings. > >>>>> > >>>>> Cc: Jeff Moyer <jmoyer@redhat.com> > >>>>> Cc: Christoph Hellwig <hch@lst.de> > >>>>> Cc: Andrew Morton <akpm@linux-foundation.org> > >>>>> Cc: Dave Hansen <dave.hansen@linux.intel.com> > >>>>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > >>>>> Signed-off-by: Dan Williams <dan.j.williams@intel.com> > >>>>> --- > >>>>> drivers/dax/Kconfig | 1 > >>>>> drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>> mm/huge_memory.c | 1 > >>>>> mm/hugetlb.c | 1 > >>>>> 4 files changed, 319 insertions(+) > >>>>> > >>>>> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig > >>>>> index 86ffbaa891ad..cedab7572de3 100644 > >>>>> --- a/drivers/dax/Kconfig > >>>>> +++ b/drivers/dax/Kconfig > >>>>> @@ -1,6 +1,7 @@ > >>>>> menuconfig DEV_DAX > >>>>> tristate "DAX: direct access to differentiated memory" > >>>>> default m if NVDIMM_DAX > >>>>> + depends on TRANSPARENT_HUGEPAGE > >>>>> help > >>>>> Support raw access to differentiated (persistence, bandwidth, > >>>>> latency...) memory via an mmap(2) capable character > >>>>> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c > >>>>> index 8207fb33a992..b2fe8a0ce866 100644 > >>>>> --- a/drivers/dax/dax.c > >>>>> +++ b/drivers/dax/dax.c > >>>>> @@ -49,6 +49,7 @@ struct dax_region { > >>>>> * @region - parent region > >>>>> * @dev - device backing the character device > >>>>> * @kref - enable this data to be tracked in filp->private_data > >>>>> + * @alive - !alive + rcu grace period == no new mappings can be established > >>>>> * @id - child id in the region > >>>>> * @num_resources - number of physical address extents in this device > >>>>> * @res - array of physical address ranges > >>>>> @@ -57,6 +58,7 @@ struct dax_dev { > >>>>> struct dax_region *region; > >>>>> struct device *dev; > >>>>> struct kref kref; > >>>>> + bool alive; > >>>>> int id; > >>>>> int num_resources; > >>>>> struct resource res[0]; > >>>>> @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) > >>>>> > >>>>> dev_dbg(dev, "%s\n", __func__); > >>>>> > >>>>> + /* disable and flush fault handlers, TODO unmap inodes */ > >>>>> + dax_dev->alive = false; > >>>>> + synchronize_rcu(); If you need to wait for fault handlers, you need synchronize_sched() instead of synchronize_rcu(). Please note that synchronize_rcu() is guaranteed to wait only for tasks that have done rcu_read_lock() to reach the corresponding rcu_read_unlock(). In contrast, synchronize_sched() is guaranteed to wait for any non-idle preempt-disable region of code to complete, regardless of exactly what is disabling preemptiong. And the "non-idle" is not an idle qualifier. If you need to wait on fault handlers that somehow occur from an idle hardware thread, you will need those fault handlers to do rcu_irq_enter() on entry and rcu_irq_exit() on exit. (My guess is that you cannot take faults in the idle loop, but I have learned not to trust such guesses all that far.) And last, but definitely not least, synchronize_sched() waits only for pre-existing preempt-disable regions of code. So if you do synchronize_sched(), and immediately after a fault handler starts, synchronize_sched() won't necessarily wait on it. However, you -are- guaranteed that synchronize_shced() -will- wait for any fault handler that might possibly see dax_dev->alive with a non-false value. Are these the guarantees you are looking for? (Yes, I did recently watch "A New Hope". Why do you ask?) > >>>>> + > >>>> > >>>> IIRC RCU is only protecting a pointer, not the content of the pointer, so this > >>>> looks wrong to me. RCU can be, and usually is, used to protect pointers, but it can be and sometimes is used for other things as well. At its core, RCU is about waiting for pre-existing RCU readers to complete. > >>> The driver is using RCU to guarantee that all currently running fault > >>> handlers have either completed or will see the new state of ->alive > >>> when they start. Reference counts are protecting the actual dax_dev > >>> object. > >>> > >> Hmm. > >> This is the same 'creative' RCU usage Mike Snitzer has been trying > >> when trying to improve device-mapper performance. To repeat, unless all your fault handlers begin with rcu_read_lock() and end with rcu_read_unlock(), and as long as you don't care about not waiting for fault handlers that are currently executing just before the rcu_read_lock() and just after the rcu_read_unlock(), you need synchronize_sched() rather than synchronize_rcu() for this job. > >> >From my understanding RCU is protecting the _pointer_, not the > >> values of the structure pointed to. > >> IOW we are guaranteed to have a valid pointer at any time. > >> But at the same time _no_ guarantee is made about the _contents_ of > >> the structure. > >> It might well be that using 'synchronize_rcu' giving you similar > >> results (as synchronize_rcu() is essentially waiting a SMP grace > >> period, after which all CPUs should be seeing the update). > >> However, I haven't been able to find that this is a guaranteed > >> behaviour. > >> So from my understanding you have to use locking primitives > >> protecting the contents of the structure or exchange the _entire_ > >> structure if you want to rely on RCU here. > >> > >> Can we get some clarification here? Maybe... What exactly is your synchronization design needing here? > >> Paul? > > > > I think you want the other Paul, Paul McKenney. > > > I think you are in fact right. > Sorry for the Paul-confusion :-) Did I keep my end of the confusion up? ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap 2016-05-18 17:12 ` Paul E. McKenney (?) @ 2016-05-18 17:26 ` Dan Williams -1 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-18 17:26 UTC (permalink / raw) To: Paul McKenney Cc: Hannes Reinecke, Paul Mackerras, Johannes Thumshirn, linux-nvdimm@lists.01.org, Dave Hansen, linux-kernel@vger.kernel.org, Christoph Hellwig, linux-block, Andrew Morton On Wed, May 18, 2016 at 10:12 AM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > On Wed, May 18, 2016 at 11:15:11AM +0200, Hannes Reinecke wrote: >> On 05/18/2016 11:10 AM, Paul Mackerras wrote: >> > On Wed, May 18, 2016 at 10:07:19AM +0200, Hannes Reinecke wrote: >> >> On 05/18/2016 12:19 AM, Dan Williams wrote: >> >>> On Tue, May 17, 2016 at 3:57 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: >> >>>> On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: >> >>>>> The "Device DAX" core enables dax mappings of performance / feature >> >>>>> differentiated memory. An open mapping or file handle keeps the backing >> >>>>> struct device live, but new mappings are only possible while the device >> >>>>> is enabled. Faults are handled under rcu_read_lock to synchronize >> >>>>> with the enabled state of the device. >> >>>>> >> >>>>> Similar to the filesystem-dax case the backing memory may optionally >> >>>>> have struct page entries. However, unlike fs-dax there is no support >> >>>>> for private mappings, or mappings that are not backed by media (see >> >>>>> use of zero-page in fs-dax). >> >>>>> >> >>>>> Mappings are always guaranteed to match the alignment of the dax_region. >> >>>>> If the dax_region is configured to have a 2MB alignment, all mappings >> >>>>> are guaranteed to be backed by a pmd entry. Contrast this determinism >> >>>>> with the fs-dax case where pmd mappings are opportunistic. If userspace >> >>>>> attempts to force a misaligned mapping, the driver will fail the mmap >> >>>>> attempt. See dax_dev_check_vma() for other scenarios that are rejected, >> >>>>> like MAP_PRIVATE mappings. >> >>>>> >> >>>>> Cc: Jeff Moyer <jmoyer@redhat.com> >> >>>>> Cc: Christoph Hellwig <hch@lst.de> >> >>>>> Cc: Andrew Morton <akpm@linux-foundation.org> >> >>>>> Cc: Dave Hansen <dave.hansen@linux.intel.com> >> >>>>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> >> >>>>> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >> >>>>> --- >> >>>>> drivers/dax/Kconfig | 1 >> >>>>> drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ >> >>>>> mm/huge_memory.c | 1 >> >>>>> mm/hugetlb.c | 1 >> >>>>> 4 files changed, 319 insertions(+) >> >>>>> >> >>>>> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig >> >>>>> index 86ffbaa891ad..cedab7572de3 100644 >> >>>>> --- a/drivers/dax/Kconfig >> >>>>> +++ b/drivers/dax/Kconfig >> >>>>> @@ -1,6 +1,7 @@ >> >>>>> menuconfig DEV_DAX >> >>>>> tristate "DAX: direct access to differentiated memory" >> >>>>> default m if NVDIMM_DAX >> >>>>> + depends on TRANSPARENT_HUGEPAGE >> >>>>> help >> >>>>> Support raw access to differentiated (persistence, bandwidth, >> >>>>> latency...) memory via an mmap(2) capable character >> >>>>> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c >> >>>>> index 8207fb33a992..b2fe8a0ce866 100644 >> >>>>> --- a/drivers/dax/dax.c >> >>>>> +++ b/drivers/dax/dax.c >> >>>>> @@ -49,6 +49,7 @@ struct dax_region { >> >>>>> * @region - parent region >> >>>>> * @dev - device backing the character device >> >>>>> * @kref - enable this data to be tracked in filp->private_data >> >>>>> + * @alive - !alive + rcu grace period == no new mappings can be established >> >>>>> * @id - child id in the region >> >>>>> * @num_resources - number of physical address extents in this device >> >>>>> * @res - array of physical address ranges >> >>>>> @@ -57,6 +58,7 @@ struct dax_dev { >> >>>>> struct dax_region *region; >> >>>>> struct device *dev; >> >>>>> struct kref kref; >> >>>>> + bool alive; >> >>>>> int id; >> >>>>> int num_resources; >> >>>>> struct resource res[0]; >> >>>>> @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) >> >>>>> >> >>>>> dev_dbg(dev, "%s\n", __func__); >> >>>>> >> >>>>> + /* disable and flush fault handlers, TODO unmap inodes */ >> >>>>> + dax_dev->alive = false; >> >>>>> + synchronize_rcu(); > > If you need to wait for fault handlers, you need synchronize_sched() > instead of synchronize_rcu(). Please note that synchronize_rcu() is > guaranteed to wait only for tasks that have done rcu_read_lock() to reach > the corresponding rcu_read_unlock(). In contrast, synchronize_sched() > is guaranteed to wait for any non-idle preempt-disable region of code > to complete, regardless of exactly what is disabling preemptiong. > > And the "non-idle" is not an idle qualifier. If you need to wait on fault > handlers that somehow occur from an idle hardware thread, you will need > those fault handlers to do rcu_irq_enter() on entry and rcu_irq_exit() > on exit. (My guess is that you cannot take faults in the idle loop, > but I have learned not to trust such guesses all that far.) > > And last, but definitely not least, synchronize_sched() waits only > for pre-existing preempt-disable regions of code. So if you do > synchronize_sched(), and immediately after a fault handler starts, > synchronize_sched() won't necessarily wait on it. However, you -are- > guaranteed that synchronize_shced() -will- wait for any fault handler > that might possibly see dax_dev->alive with a non-false value. > > Are these the guarantees you are looking for? (Yes, I did recently > watch "A New Hope". Why do you ask?) Spoken like a true rcu-Jedi. So in this case the fault handlers are indeed running under rcu_read_lock(), and I can't fathom how these faults would trigger from an idle thread... > >> >>>>> + >> >>>> >> >>>> IIRC RCU is only protecting a pointer, not the content of the pointer, so this >> >>>> looks wrong to me. > > RCU can be, and usually is, used to protect pointers, but it can be and > sometimes is used for other things as well. At its core, RCU is about > waiting for pre-existing RCU readers to complete. > >> >>> The driver is using RCU to guarantee that all currently running fault >> >>> handlers have either completed or will see the new state of ->alive >> >>> when they start. Reference counts are protecting the actual dax_dev >> >>> object. >> >>> >> >> Hmm. >> >> This is the same 'creative' RCU usage Mike Snitzer has been trying >> >> when trying to improve device-mapper performance. > > To repeat, unless all your fault handlers begin with rcu_read_lock() > and end with rcu_read_unlock(), and as long as you don't care about not > waiting for fault handlers that are currently executing just before > the rcu_read_lock() and just after the rcu_read_unlock(), you need > synchronize_sched() rather than synchronize_rcu() for this job. > >> >> >From my understanding RCU is protecting the _pointer_, not the >> >> values of the structure pointed to. >> >> IOW we are guaranteed to have a valid pointer at any time. >> >> But at the same time _no_ guarantee is made about the _contents_ of >> >> the structure. >> >> It might well be that using 'synchronize_rcu' giving you similar >> >> results (as synchronize_rcu() is essentially waiting a SMP grace >> >> period, after which all CPUs should be seeing the update). >> >> However, I haven't been able to find that this is a guaranteed >> >> behaviour. >> >> So from my understanding you have to use locking primitives >> >> protecting the contents of the structure or exchange the _entire_ >> >> structure if you want to rely on RCU here. >> >> >> >> Can we get some clarification here? > > Maybe... What exactly is your synchronization design needing here? > >> >> Paul? >> > >> > I think you want the other Paul, Paul McKenney. >> > >> I think you are in fact right. >> Sorry for the Paul-confusion :-) > > Did I keep my end of the confusion up? ;-) Yes, I think we're good, but please double check I am not mistaken in the following clarification comment: @@ -150,6 +152,16 @@ static void destroy_dax_dev(void *_dev) dev_dbg(dev, "%s\n", __func__); + /* + * Note, rcu is not protecting the liveness of dax_dev, rcu is + * ensuring that any fault handlers that might have seen + * dax_dev->alive == true, have completed. Any fault handlers + * that start after synchronize_rcu() has started will abort + * upon seeing dax_dev->alive == false. + */ + dax_dev->alive = false; + synchronize_rcu(); + get_device(dev); device_unregister(dev); ida_simple_remove(&dax_region->ida, dax_dev->id); @@ -173,6 +185,7 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *re ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap @ 2016-05-18 17:26 ` Dan Williams 0 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-18 17:26 UTC (permalink / raw) To: Paul McKenney Cc: Hannes Reinecke, Paul Mackerras, Johannes Thumshirn, linux-nvdimm@lists.01.org, Dave Hansen, linux-kernel@vger.kernel.org, Christoph Hellwig, linux-block, Andrew Morton On Wed, May 18, 2016 at 10:12 AM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > On Wed, May 18, 2016 at 11:15:11AM +0200, Hannes Reinecke wrote: >> On 05/18/2016 11:10 AM, Paul Mackerras wrote: >> > On Wed, May 18, 2016 at 10:07:19AM +0200, Hannes Reinecke wrote: >> >> On 05/18/2016 12:19 AM, Dan Williams wrote: >> >>> On Tue, May 17, 2016 at 3:57 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: >> >>>> On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: >> >>>>> The "Device DAX" core enables dax mappings of performance / feature >> >>>>> differentiated memory. An open mapping or file handle keeps the backing >> >>>>> struct device live, but new mappings are only possible while the device >> >>>>> is enabled. Faults are handled under rcu_read_lock to synchronize >> >>>>> with the enabled state of the device. >> >>>>> >> >>>>> Similar to the filesystem-dax case the backing memory may optionally >> >>>>> have struct page entries. However, unlike fs-dax there is no support >> >>>>> for private mappings, or mappings that are not backed by media (see >> >>>>> use of zero-page in fs-dax). >> >>>>> >> >>>>> Mappings are always guaranteed to match the alignment of the dax_region. >> >>>>> If the dax_region is configured to have a 2MB alignment, all mappings >> >>>>> are guaranteed to be backed by a pmd entry. Contrast this determinism >> >>>>> with the fs-dax case where pmd mappings are opportunistic. If userspace >> >>>>> attempts to force a misaligned mapping, the driver will fail the mmap >> >>>>> attempt. See dax_dev_check_vma() for other scenarios that are rejected, >> >>>>> like MAP_PRIVATE mappings. >> >>>>> >> >>>>> Cc: Jeff Moyer <jmoyer@redhat.com> >> >>>>> Cc: Christoph Hellwig <hch@lst.de> >> >>>>> Cc: Andrew Morton <akpm@linux-foundation.org> >> >>>>> Cc: Dave Hansen <dave.hansen@linux.intel.com> >> >>>>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> >> >>>>> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >> >>>>> --- >> >>>>> drivers/dax/Kconfig | 1 >> >>>>> drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ >> >>>>> mm/huge_memory.c | 1 >> >>>>> mm/hugetlb.c | 1 >> >>>>> 4 files changed, 319 insertions(+) >> >>>>> >> >>>>> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig >> >>>>> index 86ffbaa891ad..cedab7572de3 100644 >> >>>>> --- a/drivers/dax/Kconfig >> >>>>> +++ b/drivers/dax/Kconfig >> >>>>> @@ -1,6 +1,7 @@ >> >>>>> menuconfig DEV_DAX >> >>>>> tristate "DAX: direct access to differentiated memory" >> >>>>> default m if NVDIMM_DAX >> >>>>> + depends on TRANSPARENT_HUGEPAGE >> >>>>> help >> >>>>> Support raw access to differentiated (persistence, bandwidth, >> >>>>> latency...) memory via an mmap(2) capable character >> >>>>> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c >> >>>>> index 8207fb33a992..b2fe8a0ce866 100644 >> >>>>> --- a/drivers/dax/dax.c >> >>>>> +++ b/drivers/dax/dax.c >> >>>>> @@ -49,6 +49,7 @@ struct dax_region { >> >>>>> * @region - parent region >> >>>>> * @dev - device backing the character device >> >>>>> * @kref - enable this data to be tracked in filp->private_data >> >>>>> + * @alive - !alive + rcu grace period == no new mappings can be established >> >>>>> * @id - child id in the region >> >>>>> * @num_resources - number of physical address extents in this device >> >>>>> * @res - array of physical address ranges >> >>>>> @@ -57,6 +58,7 @@ struct dax_dev { >> >>>>> struct dax_region *region; >> >>>>> struct device *dev; >> >>>>> struct kref kref; >> >>>>> + bool alive; >> >>>>> int id; >> >>>>> int num_resources; >> >>>>> struct resource res[0]; >> >>>>> @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) >> >>>>> >> >>>>> dev_dbg(dev, "%s\n", __func__); >> >>>>> >> >>>>> + /* disable and flush fault handlers, TODO unmap inodes */ >> >>>>> + dax_dev->alive = false; >> >>>>> + synchronize_rcu(); > > If you need to wait for fault handlers, you need synchronize_sched() > instead of synchronize_rcu(). Please note that synchronize_rcu() is > guaranteed to wait only for tasks that have done rcu_read_lock() to reach > the corresponding rcu_read_unlock(). In contrast, synchronize_sched() > is guaranteed to wait for any non-idle preempt-disable region of code > to complete, regardless of exactly what is disabling preemptiong. > > And the "non-idle" is not an idle qualifier. If you need to wait on fault > handlers that somehow occur from an idle hardware thread, you will need > those fault handlers to do rcu_irq_enter() on entry and rcu_irq_exit() > on exit. (My guess is that you cannot take faults in the idle loop, > but I have learned not to trust such guesses all that far.) > > And last, but definitely not least, synchronize_sched() waits only > for pre-existing preempt-disable regions of code. So if you do > synchronize_sched(), and immediately after a fault handler starts, > synchronize_sched() won't necessarily wait on it. However, you -are- > guaranteed that synchronize_shced() -will- wait for any fault handler > that might possibly see dax_dev->alive with a non-false value. > > Are these the guarantees you are looking for? (Yes, I did recently > watch "A New Hope". Why do you ask?) Spoken like a true rcu-Jedi. So in this case the fault handlers are indeed running under rcu_read_lock(), and I can't fathom how these faults would trigger from an idle thread... > >> >>>>> + >> >>>> >> >>>> IIRC RCU is only protecting a pointer, not the content of the pointer, so this >> >>>> looks wrong to me. > > RCU can be, and usually is, used to protect pointers, but it can be and > sometimes is used for other things as well. At its core, RCU is about > waiting for pre-existing RCU readers to complete. > >> >>> The driver is using RCU to guarantee that all currently running fault >> >>> handlers have either completed or will see the new state of ->alive >> >>> when they start. Reference counts are protecting the actual dax_dev >> >>> object. >> >>> >> >> Hmm. >> >> This is the same 'creative' RCU usage Mike Snitzer has been trying >> >> when trying to improve device-mapper performance. > > To repeat, unless all your fault handlers begin with rcu_read_lock() > and end with rcu_read_unlock(), and as long as you don't care about not > waiting for fault handlers that are currently executing just before > the rcu_read_lock() and just after the rcu_read_unlock(), you need > synchronize_sched() rather than synchronize_rcu() for this job. > >> >> >From my understanding RCU is protecting the _pointer_, not the >> >> values of the structure pointed to. >> >> IOW we are guaranteed to have a valid pointer at any time. >> >> But at the same time _no_ guarantee is made about the _contents_ of >> >> the structure. >> >> It might well be that using 'synchronize_rcu' giving you similar >> >> results (as synchronize_rcu() is essentially waiting a SMP grace >> >> period, after which all CPUs should be seeing the update). >> >> However, I haven't been able to find that this is a guaranteed >> >> behaviour. >> >> So from my understanding you have to use locking primitives >> >> protecting the contents of the structure or exchange the _entire_ >> >> structure if you want to rely on RCU here. >> >> >> >> Can we get some clarification here? > > Maybe... What exactly is your synchronization design needing here? > >> >> Paul? >> > >> > I think you want the other Paul, Paul McKenney. >> > >> I think you are in fact right. >> Sorry for the Paul-confusion :-) > > Did I keep my end of the confusion up? ;-) Yes, I think we're good, but please double check I am not mistaken in the following clarification comment: @@ -150,6 +152,16 @@ static void destroy_dax_dev(void *_dev) dev_dbg(dev, "%s\n", __func__); + /* + * Note, rcu is not protecting the liveness of dax_dev, rcu is + * ensuring that any fault handlers that might have seen + * dax_dev->alive == true, have completed. Any fault handlers + * that start after synchronize_rcu() has started will abort + * upon seeing dax_dev->alive == false. + */ + dax_dev->alive = false; + synchronize_rcu(); + get_device(dev); device_unregister(dev); ida_simple_remove(&dax_region->ida, dax_dev->id); @@ -173,6 +185,7 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *re ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap @ 2016-05-18 17:26 ` Dan Williams 0 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-18 17:26 UTC (permalink / raw) To: Paul McKenney Cc: linux-block, linux-nvdimm@lists.01.org, Dave Hansen, linux-kernel@vger.kernel.org, Paul Mackerras, Hannes Reinecke, Andrew Morton, Christoph Hellwig On Wed, May 18, 2016 at 10:12 AM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > On Wed, May 18, 2016 at 11:15:11AM +0200, Hannes Reinecke wrote: >> On 05/18/2016 11:10 AM, Paul Mackerras wrote: >> > On Wed, May 18, 2016 at 10:07:19AM +0200, Hannes Reinecke wrote: >> >> On 05/18/2016 12:19 AM, Dan Williams wrote: >> >>> On Tue, May 17, 2016 at 3:57 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: >> >>>> On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: >> >>>>> The "Device DAX" core enables dax mappings of performance / feature >> >>>>> differentiated memory. An open mapping or file handle keeps the backing >> >>>>> struct device live, but new mappings are only possible while the device >> >>>>> is enabled. Faults are handled under rcu_read_lock to synchronize >> >>>>> with the enabled state of the device. >> >>>>> >> >>>>> Similar to the filesystem-dax case the backing memory may optionally >> >>>>> have struct page entries. However, unlike fs-dax there is no support >> >>>>> for private mappings, or mappings that are not backed by media (see >> >>>>> use of zero-page in fs-dax). >> >>>>> >> >>>>> Mappings are always guaranteed to match the alignment of the dax_region. >> >>>>> If the dax_region is configured to have a 2MB alignment, all mappings >> >>>>> are guaranteed to be backed by a pmd entry. Contrast this determinism >> >>>>> with the fs-dax case where pmd mappings are opportunistic. If userspace >> >>>>> attempts to force a misaligned mapping, the driver will fail the mmap >> >>>>> attempt. See dax_dev_check_vma() for other scenarios that are rejected, >> >>>>> like MAP_PRIVATE mappings. >> >>>>> >> >>>>> Cc: Jeff Moyer <jmoyer@redhat.com> >> >>>>> Cc: Christoph Hellwig <hch@lst.de> >> >>>>> Cc: Andrew Morton <akpm@linux-foundation.org> >> >>>>> Cc: Dave Hansen <dave.hansen@linux.intel.com> >> >>>>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> >> >>>>> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >> >>>>> --- >> >>>>> drivers/dax/Kconfig | 1 >> >>>>> drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ >> >>>>> mm/huge_memory.c | 1 >> >>>>> mm/hugetlb.c | 1 >> >>>>> 4 files changed, 319 insertions(+) >> >>>>> >> >>>>> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig >> >>>>> index 86ffbaa891ad..cedab7572de3 100644 >> >>>>> --- a/drivers/dax/Kconfig >> >>>>> +++ b/drivers/dax/Kconfig >> >>>>> @@ -1,6 +1,7 @@ >> >>>>> menuconfig DEV_DAX >> >>>>> tristate "DAX: direct access to differentiated memory" >> >>>>> default m if NVDIMM_DAX >> >>>>> + depends on TRANSPARENT_HUGEPAGE >> >>>>> help >> >>>>> Support raw access to differentiated (persistence, bandwidth, >> >>>>> latency...) memory via an mmap(2) capable character >> >>>>> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c >> >>>>> index 8207fb33a992..b2fe8a0ce866 100644 >> >>>>> --- a/drivers/dax/dax.c >> >>>>> +++ b/drivers/dax/dax.c >> >>>>> @@ -49,6 +49,7 @@ struct dax_region { >> >>>>> * @region - parent region >> >>>>> * @dev - device backing the character device >> >>>>> * @kref - enable this data to be tracked in filp->private_data >> >>>>> + * @alive - !alive + rcu grace period == no new mappings can be established >> >>>>> * @id - child id in the region >> >>>>> * @num_resources - number of physical address extents in this device >> >>>>> * @res - array of physical address ranges >> >>>>> @@ -57,6 +58,7 @@ struct dax_dev { >> >>>>> struct dax_region *region; >> >>>>> struct device *dev; >> >>>>> struct kref kref; >> >>>>> + bool alive; >> >>>>> int id; >> >>>>> int num_resources; >> >>>>> struct resource res[0]; >> >>>>> @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) >> >>>>> >> >>>>> dev_dbg(dev, "%s\n", __func__); >> >>>>> >> >>>>> + /* disable and flush fault handlers, TODO unmap inodes */ >> >>>>> + dax_dev->alive = false; >> >>>>> + synchronize_rcu(); > > If you need to wait for fault handlers, you need synchronize_sched() > instead of synchronize_rcu(). Please note that synchronize_rcu() is > guaranteed to wait only for tasks that have done rcu_read_lock() to reach > the corresponding rcu_read_unlock(). In contrast, synchronize_sched() > is guaranteed to wait for any non-idle preempt-disable region of code > to complete, regardless of exactly what is disabling preemptiong. > > And the "non-idle" is not an idle qualifier. If you need to wait on fault > handlers that somehow occur from an idle hardware thread, you will need > those fault handlers to do rcu_irq_enter() on entry and rcu_irq_exit() > on exit. (My guess is that you cannot take faults in the idle loop, > but I have learned not to trust such guesses all that far.) > > And last, but definitely not least, synchronize_sched() waits only > for pre-existing preempt-disable regions of code. So if you do > synchronize_sched(), and immediately after a fault handler starts, > synchronize_sched() won't necessarily wait on it. However, you -are- > guaranteed that synchronize_shced() -will- wait for any fault handler > that might possibly see dax_dev->alive with a non-false value. > > Are these the guarantees you are looking for? (Yes, I did recently > watch "A New Hope". Why do you ask?) Spoken like a true rcu-Jedi. So in this case the fault handlers are indeed running under rcu_read_lock(), and I can't fathom how these faults would trigger from an idle thread... > >> >>>>> + >> >>>> >> >>>> IIRC RCU is only protecting a pointer, not the content of the pointer, so this >> >>>> looks wrong to me. > > RCU can be, and usually is, used to protect pointers, but it can be and > sometimes is used for other things as well. At its core, RCU is about > waiting for pre-existing RCU readers to complete. > >> >>> The driver is using RCU to guarantee that all currently running fault >> >>> handlers have either completed or will see the new state of ->alive >> >>> when they start. Reference counts are protecting the actual dax_dev >> >>> object. >> >>> >> >> Hmm. >> >> This is the same 'creative' RCU usage Mike Snitzer has been trying >> >> when trying to improve device-mapper performance. > > To repeat, unless all your fault handlers begin with rcu_read_lock() > and end with rcu_read_unlock(), and as long as you don't care about not > waiting for fault handlers that are currently executing just before > the rcu_read_lock() and just after the rcu_read_unlock(), you need > synchronize_sched() rather than synchronize_rcu() for this job. > >> >> >From my understanding RCU is protecting the _pointer_, not the >> >> values of the structure pointed to. >> >> IOW we are guaranteed to have a valid pointer at any time. >> >> But at the same time _no_ guarantee is made about the _contents_ of >> >> the structure. >> >> It might well be that using 'synchronize_rcu' giving you similar >> >> results (as synchronize_rcu() is essentially waiting a SMP grace >> >> period, after which all CPUs should be seeing the update). >> >> However, I haven't been able to find that this is a guaranteed >> >> behaviour. >> >> So from my understanding you have to use locking primitives >> >> protecting the contents of the structure or exchange the _entire_ >> >> structure if you want to rely on RCU here. >> >> >> >> Can we get some clarification here? > > Maybe... What exactly is your synchronization design needing here? > >> >> Paul? >> > >> > I think you want the other Paul, Paul McKenney. >> > >> I think you are in fact right. >> Sorry for the Paul-confusion :-) > > Did I keep my end of the confusion up? ;-) Yes, I think we're good, but please double check I am not mistaken in the following clarification comment: @@ -150,6 +152,16 @@ static void destroy_dax_dev(void *_dev) dev_dbg(dev, "%s\n", __func__); + /* + * Note, rcu is not protecting the liveness of dax_dev, rcu is + * ensuring that any fault handlers that might have seen + * dax_dev->alive == true, have completed. Any fault handlers + * that start after synchronize_rcu() has started will abort + * upon seeing dax_dev->alive == false. + */ + dax_dev->alive = false; + synchronize_rcu(); + get_device(dev); device_unregister(dev); ida_simple_remove(&dax_region->ida, dax_dev->id); @@ -173,6 +185,7 @@ int devm_create_dax_dev(struct dax_region *dax_region, struct resource *re _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap 2016-05-18 17:26 ` Dan Williams @ 2016-05-18 17:57 ` Paul E. McKenney -1 siblings, 0 replies; 37+ messages in thread From: Paul E. McKenney @ 2016-05-18 17:57 UTC (permalink / raw) To: Dan Williams Cc: Hannes Reinecke, Paul Mackerras, Johannes Thumshirn, linux-nvdimm@lists.01.org, Dave Hansen, linux-kernel@vger.kernel.org, Christoph Hellwig, linux-block, Andrew Morton On Wed, May 18, 2016 at 10:26:57AM -0700, Dan Williams wrote: > On Wed, May 18, 2016 at 10:12 AM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > On Wed, May 18, 2016 at 11:15:11AM +0200, Hannes Reinecke wrote: > >> On 05/18/2016 11:10 AM, Paul Mackerras wrote: > >> > On Wed, May 18, 2016 at 10:07:19AM +0200, Hannes Reinecke wrote: > >> >> On 05/18/2016 12:19 AM, Dan Williams wrote: > >> >>> On Tue, May 17, 2016 at 3:57 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: > >> >>>> On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: > >> >>>>> The "Device DAX" core enables dax mappings of performance / feature > >> >>>>> differentiated memory. An open mapping or file handle keeps the backing > >> >>>>> struct device live, but new mappings are only possible while the device > >> >>>>> is enabled. Faults are handled under rcu_read_lock to synchronize > >> >>>>> with the enabled state of the device. > >> >>>>> > >> >>>>> Similar to the filesystem-dax case the backing memory may optionally > >> >>>>> have struct page entries. However, unlike fs-dax there is no support > >> >>>>> for private mappings, or mappings that are not backed by media (see > >> >>>>> use of zero-page in fs-dax). > >> >>>>> > >> >>>>> Mappings are always guaranteed to match the alignment of the dax_region. > >> >>>>> If the dax_region is configured to have a 2MB alignment, all mappings > >> >>>>> are guaranteed to be backed by a pmd entry. Contrast this determinism > >> >>>>> with the fs-dax case where pmd mappings are opportunistic. If userspace > >> >>>>> attempts to force a misaligned mapping, the driver will fail the mmap > >> >>>>> attempt. See dax_dev_check_vma() for other scenarios that are rejected, > >> >>>>> like MAP_PRIVATE mappings. > >> >>>>> > >> >>>>> Cc: Jeff Moyer <jmoyer@redhat.com> > >> >>>>> Cc: Christoph Hellwig <hch@lst.de> > >> >>>>> Cc: Andrew Morton <akpm@linux-foundation.org> > >> >>>>> Cc: Dave Hansen <dave.hansen@linux.intel.com> > >> >>>>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > >> >>>>> Signed-off-by: Dan Williams <dan.j.williams@intel.com> > >> >>>>> --- > >> >>>>> drivers/dax/Kconfig | 1 > >> >>>>> drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ > >> >>>>> mm/huge_memory.c | 1 > >> >>>>> mm/hugetlb.c | 1 > >> >>>>> 4 files changed, 319 insertions(+) > >> >>>>> > >> >>>>> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig > >> >>>>> index 86ffbaa891ad..cedab7572de3 100644 > >> >>>>> --- a/drivers/dax/Kconfig > >> >>>>> +++ b/drivers/dax/Kconfig > >> >>>>> @@ -1,6 +1,7 @@ > >> >>>>> menuconfig DEV_DAX > >> >>>>> tristate "DAX: direct access to differentiated memory" > >> >>>>> default m if NVDIMM_DAX > >> >>>>> + depends on TRANSPARENT_HUGEPAGE > >> >>>>> help > >> >>>>> Support raw access to differentiated (persistence, bandwidth, > >> >>>>> latency...) memory via an mmap(2) capable character > >> >>>>> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c > >> >>>>> index 8207fb33a992..b2fe8a0ce866 100644 > >> >>>>> --- a/drivers/dax/dax.c > >> >>>>> +++ b/drivers/dax/dax.c > >> >>>>> @@ -49,6 +49,7 @@ struct dax_region { > >> >>>>> * @region - parent region > >> >>>>> * @dev - device backing the character device > >> >>>>> * @kref - enable this data to be tracked in filp->private_data > >> >>>>> + * @alive - !alive + rcu grace period == no new mappings can be established > >> >>>>> * @id - child id in the region > >> >>>>> * @num_resources - number of physical address extents in this device > >> >>>>> * @res - array of physical address ranges > >> >>>>> @@ -57,6 +58,7 @@ struct dax_dev { > >> >>>>> struct dax_region *region; > >> >>>>> struct device *dev; > >> >>>>> struct kref kref; > >> >>>>> + bool alive; > >> >>>>> int id; > >> >>>>> int num_resources; > >> >>>>> struct resource res[0]; > >> >>>>> @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) > >> >>>>> > >> >>>>> dev_dbg(dev, "%s\n", __func__); > >> >>>>> > >> >>>>> + /* disable and flush fault handlers, TODO unmap inodes */ > >> >>>>> + dax_dev->alive = false; > >> >>>>> + synchronize_rcu(); > > > > If you need to wait for fault handlers, you need synchronize_sched() > > instead of synchronize_rcu(). Please note that synchronize_rcu() is > > guaranteed to wait only for tasks that have done rcu_read_lock() to reach > > the corresponding rcu_read_unlock(). In contrast, synchronize_sched() > > is guaranteed to wait for any non-idle preempt-disable region of code > > to complete, regardless of exactly what is disabling preemptiong. > > > > And the "non-idle" is not an idle qualifier. If you need to wait on fault > > handlers that somehow occur from an idle hardware thread, you will need > > those fault handlers to do rcu_irq_enter() on entry and rcu_irq_exit() > > on exit. (My guess is that you cannot take faults in the idle loop, > > but I have learned not to trust such guesses all that far.) > > > > And last, but definitely not least, synchronize_sched() waits only > > for pre-existing preempt-disable regions of code. So if you do > > synchronize_sched(), and immediately after a fault handler starts, > > synchronize_sched() won't necessarily wait on it. However, you -are- > > guaranteed that synchronize_shced() -will- wait for any fault handler > > that might possibly see dax_dev->alive with a non-false value. > > > > Are these the guarantees you are looking for? (Yes, I did recently > > watch "A New Hope". Why do you ask?) > > Spoken like a true rcu-Jedi. ;-) > So in this case the fault handlers are indeed running under > rcu_read_lock(), and I can't fathom how these faults would trigger > from an idle thread... OK, given all fault handlers use rcu_read_lock() and either: 1. As you say, faults never trigger from idle, or 2. The fault handlers call rcu_irq_enter() on entry and rcu_irq_exit() on exit Then, yes, you can keep on using synchronize_rcu(). > >> >>>>> + > >> >>>> > >> >>>> IIRC RCU is only protecting a pointer, not the content of the pointer, so this > >> >>>> looks wrong to me. > > > > RCU can be, and usually is, used to protect pointers, but it can be and > > sometimes is used for other things as well. At its core, RCU is about > > waiting for pre-existing RCU readers to complete. > > > >> >>> The driver is using RCU to guarantee that all currently running fault > >> >>> handlers have either completed or will see the new state of ->alive > >> >>> when they start. Reference counts are protecting the actual dax_dev > >> >>> object. > >> >>> > >> >> Hmm. > >> >> This is the same 'creative' RCU usage Mike Snitzer has been trying > >> >> when trying to improve device-mapper performance. > > > > To repeat, unless all your fault handlers begin with rcu_read_lock() > > and end with rcu_read_unlock(), and as long as you don't care about not > > waiting for fault handlers that are currently executing just before > > the rcu_read_lock() and just after the rcu_read_unlock(), you need > > synchronize_sched() rather than synchronize_rcu() for this job. > > > >> >> >From my understanding RCU is protecting the _pointer_, not the > >> >> values of the structure pointed to. > >> >> IOW we are guaranteed to have a valid pointer at any time. > >> >> But at the same time _no_ guarantee is made about the _contents_ of > >> >> the structure. > >> >> It might well be that using 'synchronize_rcu' giving you similar > >> >> results (as synchronize_rcu() is essentially waiting a SMP grace > >> >> period, after which all CPUs should be seeing the update). > >> >> However, I haven't been able to find that this is a guaranteed > >> >> behaviour. > >> >> So from my understanding you have to use locking primitives > >> >> protecting the contents of the structure or exchange the _entire_ > >> >> structure if you want to rely on RCU here. > >> >> > >> >> Can we get some clarification here? > > > > Maybe... What exactly is your synchronization design needing here? > > > >> >> Paul? > >> > > >> > I think you want the other Paul, Paul McKenney. > >> > > >> I think you are in fact right. > >> Sorry for the Paul-confusion :-) > > > > Did I keep my end of the confusion up? ;-) > > Yes, I think we're good, but please double check I am not mistaken in > the following clarification comment: > > @@ -150,6 +152,16 @@ static void destroy_dax_dev(void *_dev) > > dev_dbg(dev, "%s\n", __func__); > > + /* > + * Note, rcu is not protecting the liveness of dax_dev, rcu is > + * ensuring that any fault handlers that might have seen > + * dax_dev->alive == true, have completed. Any fault handlers > + * that start after synchronize_rcu() has started will abort > + * upon seeing dax_dev->alive == false. > + */ > + dax_dev->alive = false; > + synchronize_rcu(); > + > get_device(dev); > device_unregister(dev); > ida_simple_remove(&dax_region->ida, dax_dev->id); > @@ -173,6 +185,7 @@ int devm_create_dax_dev(struct dax_region > *dax_region, struct resource *re Given your comment and your statement that fault handlers never happen on idle CPUs and are always protected by rcu_read_lock(), from an RCU point of view: Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Thanx, Paul ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap @ 2016-05-18 17:57 ` Paul E. McKenney 0 siblings, 0 replies; 37+ messages in thread From: Paul E. McKenney @ 2016-05-18 17:57 UTC (permalink / raw) To: Dan Williams Cc: Hannes Reinecke, Paul Mackerras, Johannes Thumshirn, linux-nvdimm@lists.01.org, Dave Hansen, linux-kernel@vger.kernel.org, Christoph Hellwig, linux-block, Andrew Morton On Wed, May 18, 2016 at 10:26:57AM -0700, Dan Williams wrote: > On Wed, May 18, 2016 at 10:12 AM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > On Wed, May 18, 2016 at 11:15:11AM +0200, Hannes Reinecke wrote: > >> On 05/18/2016 11:10 AM, Paul Mackerras wrote: > >> > On Wed, May 18, 2016 at 10:07:19AM +0200, Hannes Reinecke wrote: > >> >> On 05/18/2016 12:19 AM, Dan Williams wrote: > >> >>> On Tue, May 17, 2016 at 3:57 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote: > >> >>>> On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote: > >> >>>>> The "Device DAX" core enables dax mappings of performance / feature > >> >>>>> differentiated memory. An open mapping or file handle keeps the backing > >> >>>>> struct device live, but new mappings are only possible while the device > >> >>>>> is enabled. Faults are handled under rcu_read_lock to synchronize > >> >>>>> with the enabled state of the device. > >> >>>>> > >> >>>>> Similar to the filesystem-dax case the backing memory may optionally > >> >>>>> have struct page entries. However, unlike fs-dax there is no support > >> >>>>> for private mappings, or mappings that are not backed by media (see > >> >>>>> use of zero-page in fs-dax). > >> >>>>> > >> >>>>> Mappings are always guaranteed to match the alignment of the dax_region. > >> >>>>> If the dax_region is configured to have a 2MB alignment, all mappings > >> >>>>> are guaranteed to be backed by a pmd entry. Contrast this determinism > >> >>>>> with the fs-dax case where pmd mappings are opportunistic. If userspace > >> >>>>> attempts to force a misaligned mapping, the driver will fail the mmap > >> >>>>> attempt. See dax_dev_check_vma() for other scenarios that are rejected, > >> >>>>> like MAP_PRIVATE mappings. > >> >>>>> > >> >>>>> Cc: Jeff Moyer <jmoyer@redhat.com> > >> >>>>> Cc: Christoph Hellwig <hch@lst.de> > >> >>>>> Cc: Andrew Morton <akpm@linux-foundation.org> > >> >>>>> Cc: Dave Hansen <dave.hansen@linux.intel.com> > >> >>>>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > >> >>>>> Signed-off-by: Dan Williams <dan.j.williams@intel.com> > >> >>>>> --- > >> >>>>> drivers/dax/Kconfig | 1 > >> >>>>> drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++ > >> >>>>> mm/huge_memory.c | 1 > >> >>>>> mm/hugetlb.c | 1 > >> >>>>> 4 files changed, 319 insertions(+) > >> >>>>> > >> >>>>> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig > >> >>>>> index 86ffbaa891ad..cedab7572de3 100644 > >> >>>>> --- a/drivers/dax/Kconfig > >> >>>>> +++ b/drivers/dax/Kconfig > >> >>>>> @@ -1,6 +1,7 @@ > >> >>>>> menuconfig DEV_DAX > >> >>>>> tristate "DAX: direct access to differentiated memory" > >> >>>>> default m if NVDIMM_DAX > >> >>>>> + depends on TRANSPARENT_HUGEPAGE > >> >>>>> help > >> >>>>> Support raw access to differentiated (persistence, bandwidth, > >> >>>>> latency...) memory via an mmap(2) capable character > >> >>>>> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c > >> >>>>> index 8207fb33a992..b2fe8a0ce866 100644 > >> >>>>> --- a/drivers/dax/dax.c > >> >>>>> +++ b/drivers/dax/dax.c > >> >>>>> @@ -49,6 +49,7 @@ struct dax_region { > >> >>>>> * @region - parent region > >> >>>>> * @dev - device backing the character device > >> >>>>> * @kref - enable this data to be tracked in filp->private_data > >> >>>>> + * @alive - !alive + rcu grace period == no new mappings can be established > >> >>>>> * @id - child id in the region > >> >>>>> * @num_resources - number of physical address extents in this device > >> >>>>> * @res - array of physical address ranges > >> >>>>> @@ -57,6 +58,7 @@ struct dax_dev { > >> >>>>> struct dax_region *region; > >> >>>>> struct device *dev; > >> >>>>> struct kref kref; > >> >>>>> + bool alive; > >> >>>>> int id; > >> >>>>> int num_resources; > >> >>>>> struct resource res[0]; > >> >>>>> @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev) > >> >>>>> > >> >>>>> dev_dbg(dev, "%s\n", __func__); > >> >>>>> > >> >>>>> + /* disable and flush fault handlers, TODO unmap inodes */ > >> >>>>> + dax_dev->alive = false; > >> >>>>> + synchronize_rcu(); > > > > If you need to wait for fault handlers, you need synchronize_sched() > > instead of synchronize_rcu(). Please note that synchronize_rcu() is > > guaranteed to wait only for tasks that have done rcu_read_lock() to reach > > the corresponding rcu_read_unlock(). In contrast, synchronize_sched() > > is guaranteed to wait for any non-idle preempt-disable region of code > > to complete, regardless of exactly what is disabling preemptiong. > > > > And the "non-idle" is not an idle qualifier. If you need to wait on fault > > handlers that somehow occur from an idle hardware thread, you will need > > those fault handlers to do rcu_irq_enter() on entry and rcu_irq_exit() > > on exit. (My guess is that you cannot take faults in the idle loop, > > but I have learned not to trust such guesses all that far.) > > > > And last, but definitely not least, synchronize_sched() waits only > > for pre-existing preempt-disable regions of code. So if you do > > synchronize_sched(), and immediately after a fault handler starts, > > synchronize_sched() won't necessarily wait on it. However, you -are- > > guaranteed that synchronize_shced() -will- wait for any fault handler > > that might possibly see dax_dev->alive with a non-false value. > > > > Are these the guarantees you are looking for? (Yes, I did recently > > watch "A New Hope". Why do you ask?) > > Spoken like a true rcu-Jedi. ;-) > So in this case the fault handlers are indeed running under > rcu_read_lock(), and I can't fathom how these faults would trigger > from an idle thread... OK, given all fault handlers use rcu_read_lock() and either: 1. As you say, faults never trigger from idle, or 2. The fault handlers call rcu_irq_enter() on entry and rcu_irq_exit() on exit Then, yes, you can keep on using synchronize_rcu(). > >> >>>>> + > >> >>>> > >> >>>> IIRC RCU is only protecting a pointer, not the content of the pointer, so this > >> >>>> looks wrong to me. > > > > RCU can be, and usually is, used to protect pointers, but it can be and > > sometimes is used for other things as well. At its core, RCU is about > > waiting for pre-existing RCU readers to complete. > > > >> >>> The driver is using RCU to guarantee that all currently running fault > >> >>> handlers have either completed or will see the new state of ->alive > >> >>> when they start. Reference counts are protecting the actual dax_dev > >> >>> object. > >> >>> > >> >> Hmm. > >> >> This is the same 'creative' RCU usage Mike Snitzer has been trying > >> >> when trying to improve device-mapper performance. > > > > To repeat, unless all your fault handlers begin with rcu_read_lock() > > and end with rcu_read_unlock(), and as long as you don't care about not > > waiting for fault handlers that are currently executing just before > > the rcu_read_lock() and just after the rcu_read_unlock(), you need > > synchronize_sched() rather than synchronize_rcu() for this job. > > > >> >> >From my understanding RCU is protecting the _pointer_, not the > >> >> values of the structure pointed to. > >> >> IOW we are guaranteed to have a valid pointer at any time. > >> >> But at the same time _no_ guarantee is made about the _contents_ of > >> >> the structure. > >> >> It might well be that using 'synchronize_rcu' giving you similar > >> >> results (as synchronize_rcu() is essentially waiting a SMP grace > >> >> period, after which all CPUs should be seeing the update). > >> >> However, I haven't been able to find that this is a guaranteed > >> >> behaviour. > >> >> So from my understanding you have to use locking primitives > >> >> protecting the contents of the structure or exchange the _entire_ > >> >> structure if you want to rely on RCU here. > >> >> > >> >> Can we get some clarification here? > > > > Maybe... What exactly is your synchronization design needing here? > > > >> >> Paul? > >> > > >> > I think you want the other Paul, Paul McKenney. > >> > > >> I think you are in fact right. > >> Sorry for the Paul-confusion :-) > > > > Did I keep my end of the confusion up? ;-) > > Yes, I think we're good, but please double check I am not mistaken in > the following clarification comment: > > @@ -150,6 +152,16 @@ static void destroy_dax_dev(void *_dev) > > dev_dbg(dev, "%s\n", __func__); > > + /* > + * Note, rcu is not protecting the liveness of dax_dev, rcu is > + * ensuring that any fault handlers that might have seen > + * dax_dev->alive == true, have completed. Any fault handlers > + * that start after synchronize_rcu() has started will abort > + * upon seeing dax_dev->alive == false. > + */ > + dax_dev->alive = false; > + synchronize_rcu(); > + > get_device(dev); > device_unregister(dev); > ida_simple_remove(&dax_region->ida, dax_dev->id); > @@ -173,6 +185,7 @@ int devm_create_dax_dev(struct dax_region > *dax_region, struct resource *re Given your comment and your statement that fault handlers never happen on idle CPUs and are always protected by rcu_read_lock(), from an RCU point of view: Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Thanx, Paul ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v2 3/3] Revert "block: enable dax for raw block devices" 2016-05-15 6:26 ` Dan Williams (?) @ 2016-05-15 6:26 ` Dan Williams -1 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-15 6:26 UTC (permalink / raw) To: linux-nvdimm Cc: Jan Kara, Dave Chinner, linux-kernel, linux-block, Jeff Moyer, Ross Zwisler, Andrew Morton, hch This reverts commit 5a023cdba50c5f5f2bc351783b3131699deb3937. The functionality is superseded by the new "Device DAX" facility. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jan Kara <jack@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- block/ioctl.c | 32 ---------------- fs/block_dev.c | 96 ++++++++++++++--------------------------------- include/linux/fs.h | 8 ---- include/uapi/linux/fs.h | 1 4 files changed, 29 insertions(+), 108 deletions(-) diff --git a/block/ioctl.c b/block/ioctl.c index 4ff1f92f89ca..698c7933d582 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -407,35 +407,6 @@ static inline int is_unrecognized_ioctl(int ret) ret == -ENOIOCTLCMD; } -#ifdef CONFIG_FS_DAX -bool blkdev_dax_capable(struct block_device *bdev) -{ - struct gendisk *disk = bdev->bd_disk; - - if (!disk->fops->direct_access) - return false; - - /* - * If the partition is not aligned on a page boundary, we can't - * do dax I/O to it. - */ - if ((bdev->bd_part->start_sect % (PAGE_SIZE / 512)) - || (bdev->bd_part->nr_sects % (PAGE_SIZE / 512))) - return false; - - /* - * If the device has known bad blocks, force all I/O through the - * driver / page cache. - * - * TODO: support finer grained dax error handling - */ - if (disk->bb && disk->bb->count) - return false; - - return true; -} -#endif - static int blkdev_flushbuf(struct block_device *bdev, fmode_t mode, unsigned cmd, unsigned long arg) { @@ -598,9 +569,6 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd, case BLKTRACESETUP: case BLKTRACETEARDOWN: return blk_trace_ioctl(bdev, cmd, argp); - case BLKDAXGET: - return put_int(arg, !!(bdev->bd_inode->i_flags & S_DAX)); - break; case IOC_PR_REGISTER: return blkdev_pr_register(bdev, argp); case IOC_PR_RESERVE: diff --git a/fs/block_dev.c b/fs/block_dev.c index 20a2c02b77c4..36ee10ca503e 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -29,6 +29,7 @@ #include <linux/log2.h> #include <linux/cleancache.h> #include <linux/dax.h> +#include <linux/badblocks.h> #include <asm/uaccess.h> #include "internal.h" @@ -1159,6 +1160,33 @@ void bd_set_size(struct block_device *bdev, loff_t size) } EXPORT_SYMBOL(bd_set_size); +static bool blkdev_dax_capable(struct block_device *bdev) +{ + struct gendisk *disk = bdev->bd_disk; + + if (!disk->fops->direct_access || !IS_ENABLED(CONFIG_FS_DAX)) + return false; + + /* + * If the partition is not aligned on a page boundary, we can't + * do dax I/O to it. + */ + if ((bdev->bd_part->start_sect % (PAGE_SIZE / 512)) + || (bdev->bd_part->nr_sects % (PAGE_SIZE / 512))) + return false; + + /* + * If the device has known bad blocks, force all I/O through the + * driver / page cache. + * + * TODO: support finer grained dax error handling + */ + if (disk->bb && disk->bb->count) + return false; + + return true; +} + static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part); /* @@ -1724,79 +1752,13 @@ static const struct address_space_operations def_blk_aops = { .is_dirty_writeback = buffer_check_dirty_writeback, }; -#ifdef CONFIG_FS_DAX -/* - * In the raw block case we do not need to contend with truncation nor - * unwritten file extents. Without those concerns there is no need for - * additional locking beyond the mmap_sem context that these routines - * are already executing under. - * - * Note, there is no protection if the block device is dynamically - * resized (partition grow/shrink) during a fault. A stable block device - * size is already not enforced in the blkdev_direct_IO path. - * - * For DAX, it is the responsibility of the block device driver to - * ensure the whole-disk device size is stable while requests are in - * flight. - * - * Finally, unlike the filemap_page_mkwrite() case there is no - * filesystem superblock to sync against freezing. We still include a - * pfn_mkwrite callback for dax drivers to receive write fault - * notifications. - */ -static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) -{ - return __dax_fault(vma, vmf, blkdev_get_block, NULL); -} - -static int blkdev_dax_pfn_mkwrite(struct vm_area_struct *vma, - struct vm_fault *vmf) -{ - return dax_pfn_mkwrite(vma, vmf); -} - -static int blkdev_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr, - pmd_t *pmd, unsigned int flags) -{ - return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block, NULL); -} - -static const struct vm_operations_struct blkdev_dax_vm_ops = { - .fault = blkdev_dax_fault, - .pmd_fault = blkdev_dax_pmd_fault, - .pfn_mkwrite = blkdev_dax_pfn_mkwrite, -}; - -static const struct vm_operations_struct blkdev_default_vm_ops = { - .fault = filemap_fault, - .map_pages = filemap_map_pages, -}; - -static int blkdev_mmap(struct file *file, struct vm_area_struct *vma) -{ - struct inode *bd_inode = bdev_file_inode(file); - - file_accessed(file); - if (IS_DAX(bd_inode)) { - vma->vm_ops = &blkdev_dax_vm_ops; - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; - } else { - vma->vm_ops = &blkdev_default_vm_ops; - } - - return 0; -} -#else -#define blkdev_mmap generic_file_mmap -#endif - const struct file_operations def_blk_fops = { .open = blkdev_open, .release = blkdev_close, .llseek = block_llseek, .read_iter = blkdev_read_iter, .write_iter = blkdev_write_iter, - .mmap = blkdev_mmap, + .mmap = generic_file_mmap, .fsync = blkdev_fsync, .unlocked_ioctl = block_ioctl, #ifdef CONFIG_COMPAT diff --git a/include/linux/fs.h b/include/linux/fs.h index 70e61b58baaf..8363a10660f6 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2320,14 +2320,6 @@ extern struct super_block *freeze_bdev(struct block_device *); extern void emergency_thaw_all(void); extern int thaw_bdev(struct block_device *bdev, struct super_block *sb); extern int fsync_bdev(struct block_device *); -#ifdef CONFIG_FS_DAX -extern bool blkdev_dax_capable(struct block_device *bdev); -#else -static inline bool blkdev_dax_capable(struct block_device *bdev) -{ - return false; -} -#endif extern struct super_block *blockdev_superblock; diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index a079d50376e1..fbff8b28aa35 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -222,7 +222,6 @@ struct fsxattr { #define BLKSECDISCARD _IO(0x12,125) #define BLKROTATIONAL _IO(0x12,126) #define BLKZEROOUT _IO(0x12,127) -#define BLKDAXGET _IO(0x12,129) #define BMAP_IOCTL 1 /* obsolete - kept for compatibility */ #define FIBMAP _IO(0x00,1) /* bmap access */ ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v2 3/3] Revert "block: enable dax for raw block devices" @ 2016-05-15 6:26 ` Dan Williams 0 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-15 6:26 UTC (permalink / raw) To: linux-nvdimm Cc: Jan Kara, Dave Chinner, linux-kernel, linux-block, Jeff Moyer, Ross Zwisler, Andrew Morton, hch This reverts commit 5a023cdba50c5f5f2bc351783b3131699deb3937. The functionality is superseded by the new "Device DAX" facility. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jan Kara <jack@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- block/ioctl.c | 32 ---------------- fs/block_dev.c | 96 ++++++++++++++--------------------------------- include/linux/fs.h | 8 ---- include/uapi/linux/fs.h | 1 4 files changed, 29 insertions(+), 108 deletions(-) diff --git a/block/ioctl.c b/block/ioctl.c index 4ff1f92f89ca..698c7933d582 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -407,35 +407,6 @@ static inline int is_unrecognized_ioctl(int ret) ret == -ENOIOCTLCMD; } -#ifdef CONFIG_FS_DAX -bool blkdev_dax_capable(struct block_device *bdev) -{ - struct gendisk *disk = bdev->bd_disk; - - if (!disk->fops->direct_access) - return false; - - /* - * If the partition is not aligned on a page boundary, we can't - * do dax I/O to it. - */ - if ((bdev->bd_part->start_sect % (PAGE_SIZE / 512)) - || (bdev->bd_part->nr_sects % (PAGE_SIZE / 512))) - return false; - - /* - * If the device has known bad blocks, force all I/O through the - * driver / page cache. - * - * TODO: support finer grained dax error handling - */ - if (disk->bb && disk->bb->count) - return false; - - return true; -} -#endif - static int blkdev_flushbuf(struct block_device *bdev, fmode_t mode, unsigned cmd, unsigned long arg) { @@ -598,9 +569,6 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd, case BLKTRACESETUP: case BLKTRACETEARDOWN: return blk_trace_ioctl(bdev, cmd, argp); - case BLKDAXGET: - return put_int(arg, !!(bdev->bd_inode->i_flags & S_DAX)); - break; case IOC_PR_REGISTER: return blkdev_pr_register(bdev, argp); case IOC_PR_RESERVE: diff --git a/fs/block_dev.c b/fs/block_dev.c index 20a2c02b77c4..36ee10ca503e 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -29,6 +29,7 @@ #include <linux/log2.h> #include <linux/cleancache.h> #include <linux/dax.h> +#include <linux/badblocks.h> #include <asm/uaccess.h> #include "internal.h" @@ -1159,6 +1160,33 @@ void bd_set_size(struct block_device *bdev, loff_t size) } EXPORT_SYMBOL(bd_set_size); +static bool blkdev_dax_capable(struct block_device *bdev) +{ + struct gendisk *disk = bdev->bd_disk; + + if (!disk->fops->direct_access || !IS_ENABLED(CONFIG_FS_DAX)) + return false; + + /* + * If the partition is not aligned on a page boundary, we can't + * do dax I/O to it. + */ + if ((bdev->bd_part->start_sect % (PAGE_SIZE / 512)) + || (bdev->bd_part->nr_sects % (PAGE_SIZE / 512))) + return false; + + /* + * If the device has known bad blocks, force all I/O through the + * driver / page cache. + * + * TODO: support finer grained dax error handling + */ + if (disk->bb && disk->bb->count) + return false; + + return true; +} + static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part); /* @@ -1724,79 +1752,13 @@ static const struct address_space_operations def_blk_aops = { .is_dirty_writeback = buffer_check_dirty_writeback, }; -#ifdef CONFIG_FS_DAX -/* - * In the raw block case we do not need to contend with truncation nor - * unwritten file extents. Without those concerns there is no need for - * additional locking beyond the mmap_sem context that these routines - * are already executing under. - * - * Note, there is no protection if the block device is dynamically - * resized (partition grow/shrink) during a fault. A stable block device - * size is already not enforced in the blkdev_direct_IO path. - * - * For DAX, it is the responsibility of the block device driver to - * ensure the whole-disk device size is stable while requests are in - * flight. - * - * Finally, unlike the filemap_page_mkwrite() case there is no - * filesystem superblock to sync against freezing. We still include a - * pfn_mkwrite callback for dax drivers to receive write fault - * notifications. - */ -static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) -{ - return __dax_fault(vma, vmf, blkdev_get_block, NULL); -} - -static int blkdev_dax_pfn_mkwrite(struct vm_area_struct *vma, - struct vm_fault *vmf) -{ - return dax_pfn_mkwrite(vma, vmf); -} - -static int blkdev_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr, - pmd_t *pmd, unsigned int flags) -{ - return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block, NULL); -} - -static const struct vm_operations_struct blkdev_dax_vm_ops = { - .fault = blkdev_dax_fault, - .pmd_fault = blkdev_dax_pmd_fault, - .pfn_mkwrite = blkdev_dax_pfn_mkwrite, -}; - -static const struct vm_operations_struct blkdev_default_vm_ops = { - .fault = filemap_fault, - .map_pages = filemap_map_pages, -}; - -static int blkdev_mmap(struct file *file, struct vm_area_struct *vma) -{ - struct inode *bd_inode = bdev_file_inode(file); - - file_accessed(file); - if (IS_DAX(bd_inode)) { - vma->vm_ops = &blkdev_dax_vm_ops; - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; - } else { - vma->vm_ops = &blkdev_default_vm_ops; - } - - return 0; -} -#else -#define blkdev_mmap generic_file_mmap -#endif - const struct file_operations def_blk_fops = { .open = blkdev_open, .release = blkdev_close, .llseek = block_llseek, .read_iter = blkdev_read_iter, .write_iter = blkdev_write_iter, - .mmap = blkdev_mmap, + .mmap = generic_file_mmap, .fsync = blkdev_fsync, .unlocked_ioctl = block_ioctl, #ifdef CONFIG_COMPAT diff --git a/include/linux/fs.h b/include/linux/fs.h index 70e61b58baaf..8363a10660f6 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2320,14 +2320,6 @@ extern struct super_block *freeze_bdev(struct block_device *); extern void emergency_thaw_all(void); extern int thaw_bdev(struct block_device *bdev, struct super_block *sb); extern int fsync_bdev(struct block_device *); -#ifdef CONFIG_FS_DAX -extern bool blkdev_dax_capable(struct block_device *bdev); -#else -static inline bool blkdev_dax_capable(struct block_device *bdev) -{ - return false; -} -#endif extern struct super_block *blockdev_superblock; diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index a079d50376e1..fbff8b28aa35 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -222,7 +222,6 @@ struct fsxattr { #define BLKSECDISCARD _IO(0x12,125) #define BLKROTATIONAL _IO(0x12,126) #define BLKZEROOUT _IO(0x12,127) -#define BLKDAXGET _IO(0x12,129) #define BMAP_IOCTL 1 /* obsolete - kept for compatibility */ #define FIBMAP _IO(0x00,1) /* bmap access */ ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v2 3/3] Revert "block: enable dax for raw block devices" @ 2016-05-15 6:26 ` Dan Williams 0 siblings, 0 replies; 37+ messages in thread From: Dan Williams @ 2016-05-15 6:26 UTC (permalink / raw) To: linux-nvdimm Cc: Dave Chinner, linux-kernel, hch, linux-block, Jan Kara, Andrew Morton This reverts commit 5a023cdba50c5f5f2bc351783b3131699deb3937. The functionality is superseded by the new "Device DAX" facility. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jan Kara <jack@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- block/ioctl.c | 32 ---------------- fs/block_dev.c | 96 ++++++++++++++--------------------------------- include/linux/fs.h | 8 ---- include/uapi/linux/fs.h | 1 4 files changed, 29 insertions(+), 108 deletions(-) diff --git a/block/ioctl.c b/block/ioctl.c index 4ff1f92f89ca..698c7933d582 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -407,35 +407,6 @@ static inline int is_unrecognized_ioctl(int ret) ret == -ENOIOCTLCMD; } -#ifdef CONFIG_FS_DAX -bool blkdev_dax_capable(struct block_device *bdev) -{ - struct gendisk *disk = bdev->bd_disk; - - if (!disk->fops->direct_access) - return false; - - /* - * If the partition is not aligned on a page boundary, we can't - * do dax I/O to it. - */ - if ((bdev->bd_part->start_sect % (PAGE_SIZE / 512)) - || (bdev->bd_part->nr_sects % (PAGE_SIZE / 512))) - return false; - - /* - * If the device has known bad blocks, force all I/O through the - * driver / page cache. - * - * TODO: support finer grained dax error handling - */ - if (disk->bb && disk->bb->count) - return false; - - return true; -} -#endif - static int blkdev_flushbuf(struct block_device *bdev, fmode_t mode, unsigned cmd, unsigned long arg) { @@ -598,9 +569,6 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd, case BLKTRACESETUP: case BLKTRACETEARDOWN: return blk_trace_ioctl(bdev, cmd, argp); - case BLKDAXGET: - return put_int(arg, !!(bdev->bd_inode->i_flags & S_DAX)); - break; case IOC_PR_REGISTER: return blkdev_pr_register(bdev, argp); case IOC_PR_RESERVE: diff --git a/fs/block_dev.c b/fs/block_dev.c index 20a2c02b77c4..36ee10ca503e 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -29,6 +29,7 @@ #include <linux/log2.h> #include <linux/cleancache.h> #include <linux/dax.h> +#include <linux/badblocks.h> #include <asm/uaccess.h> #include "internal.h" @@ -1159,6 +1160,33 @@ void bd_set_size(struct block_device *bdev, loff_t size) } EXPORT_SYMBOL(bd_set_size); +static bool blkdev_dax_capable(struct block_device *bdev) +{ + struct gendisk *disk = bdev->bd_disk; + + if (!disk->fops->direct_access || !IS_ENABLED(CONFIG_FS_DAX)) + return false; + + /* + * If the partition is not aligned on a page boundary, we can't + * do dax I/O to it. + */ + if ((bdev->bd_part->start_sect % (PAGE_SIZE / 512)) + || (bdev->bd_part->nr_sects % (PAGE_SIZE / 512))) + return false; + + /* + * If the device has known bad blocks, force all I/O through the + * driver / page cache. + * + * TODO: support finer grained dax error handling + */ + if (disk->bb && disk->bb->count) + return false; + + return true; +} + static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part); /* @@ -1724,79 +1752,13 @@ static const struct address_space_operations def_blk_aops = { .is_dirty_writeback = buffer_check_dirty_writeback, }; -#ifdef CONFIG_FS_DAX -/* - * In the raw block case we do not need to contend with truncation nor - * unwritten file extents. Without those concerns there is no need for - * additional locking beyond the mmap_sem context that these routines - * are already executing under. - * - * Note, there is no protection if the block device is dynamically - * resized (partition grow/shrink) during a fault. A stable block device - * size is already not enforced in the blkdev_direct_IO path. - * - * For DAX, it is the responsibility of the block device driver to - * ensure the whole-disk device size is stable while requests are in - * flight. - * - * Finally, unlike the filemap_page_mkwrite() case there is no - * filesystem superblock to sync against freezing. We still include a - * pfn_mkwrite callback for dax drivers to receive write fault - * notifications. - */ -static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) -{ - return __dax_fault(vma, vmf, blkdev_get_block, NULL); -} - -static int blkdev_dax_pfn_mkwrite(struct vm_area_struct *vma, - struct vm_fault *vmf) -{ - return dax_pfn_mkwrite(vma, vmf); -} - -static int blkdev_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr, - pmd_t *pmd, unsigned int flags) -{ - return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block, NULL); -} - -static const struct vm_operations_struct blkdev_dax_vm_ops = { - .fault = blkdev_dax_fault, - .pmd_fault = blkdev_dax_pmd_fault, - .pfn_mkwrite = blkdev_dax_pfn_mkwrite, -}; - -static const struct vm_operations_struct blkdev_default_vm_ops = { - .fault = filemap_fault, - .map_pages = filemap_map_pages, -}; - -static int blkdev_mmap(struct file *file, struct vm_area_struct *vma) -{ - struct inode *bd_inode = bdev_file_inode(file); - - file_accessed(file); - if (IS_DAX(bd_inode)) { - vma->vm_ops = &blkdev_dax_vm_ops; - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; - } else { - vma->vm_ops = &blkdev_default_vm_ops; - } - - return 0; -} -#else -#define blkdev_mmap generic_file_mmap -#endif - const struct file_operations def_blk_fops = { .open = blkdev_open, .release = blkdev_close, .llseek = block_llseek, .read_iter = blkdev_read_iter, .write_iter = blkdev_write_iter, - .mmap = blkdev_mmap, + .mmap = generic_file_mmap, .fsync = blkdev_fsync, .unlocked_ioctl = block_ioctl, #ifdef CONFIG_COMPAT diff --git a/include/linux/fs.h b/include/linux/fs.h index 70e61b58baaf..8363a10660f6 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2320,14 +2320,6 @@ extern struct super_block *freeze_bdev(struct block_device *); extern void emergency_thaw_all(void); extern int thaw_bdev(struct block_device *bdev, struct super_block *sb); extern int fsync_bdev(struct block_device *); -#ifdef CONFIG_FS_DAX -extern bool blkdev_dax_capable(struct block_device *bdev); -#else -static inline bool blkdev_dax_capable(struct block_device *bdev) -{ - return false; -} -#endif extern struct super_block *blockdev_superblock; diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index a079d50376e1..fbff8b28aa35 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -222,7 +222,6 @@ struct fsxattr { #define BLKSECDISCARD _IO(0x12,125) #define BLKROTATIONAL _IO(0x12,126) #define BLKZEROOUT _IO(0x12,127) -#define BLKDAXGET _IO(0x12,129) #define BMAP_IOCTL 1 /* obsolete - kept for compatibility */ #define FIBMAP _IO(0x00,1) /* bmap access */ _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 37+ messages in thread
end of thread, other threads:[~2016-05-18 17:57 UTC | newest] Thread overview: 37+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-05-15 6:26 [PATCH v2 0/3] "Device DAX" for persistent memory Dan Williams 2016-05-15 6:26 ` Dan Williams 2016-05-15 6:26 ` Dan Williams 2016-05-15 6:26 ` [PATCH v2 1/3] /dev/dax, pmem: direct access to " Dan Williams 2016-05-15 6:26 ` Dan Williams 2016-05-15 6:26 ` Dan Williams 2016-05-17 8:52 ` Johannes Thumshirn 2016-05-17 8:52 ` Johannes Thumshirn 2016-05-17 8:52 ` Johannes Thumshirn 2016-05-17 16:40 ` Dan Williams 2016-05-17 16:40 ` Dan Williams 2016-05-17 16:40 ` Dan Williams 2016-05-15 6:26 ` [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap Dan Williams 2016-05-15 6:26 ` Dan Williams 2016-05-15 6:26 ` Dan Williams 2016-05-17 10:57 ` Johannes Thumshirn 2016-05-17 10:57 ` Johannes Thumshirn 2016-05-17 10:57 ` Johannes Thumshirn 2016-05-17 22:19 ` Dan Williams 2016-05-17 22:19 ` Dan Williams 2016-05-17 22:19 ` Dan Williams 2016-05-18 8:07 ` Hannes Reinecke 2016-05-18 8:07 ` Hannes Reinecke 2016-05-18 9:10 ` Paul Mackerras 2016-05-18 9:10 ` Paul Mackerras 2016-05-18 9:15 ` Hannes Reinecke 2016-05-18 9:15 ` Hannes Reinecke 2016-05-18 17:12 ` Paul E. McKenney 2016-05-18 17:12 ` Paul E. McKenney 2016-05-18 17:26 ` Dan Williams 2016-05-18 17:26 ` Dan Williams 2016-05-18 17:26 ` Dan Williams 2016-05-18 17:57 ` Paul E. McKenney 2016-05-18 17:57 ` Paul E. McKenney 2016-05-15 6:26 ` [PATCH v2 3/3] Revert "block: enable dax for raw block devices" Dan Williams 2016-05-15 6:26 ` Dan Williams 2016-05-15 6:26 ` Dan Williams
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.