From: "Verma, Vishal L" <vishal.l.verma@intel.com>
To: "Williams, Dan J" <dan.j.williams@intel.com>,
"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>
Cc: "gregory.price@memverge.com" <gregory.price@memverge.com>,
"Hocko, Michal" <mhocko@suse.com>,
"fan.ni@samsung.com" <fan.ni@samsung.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"david@redhat.com" <david@redhat.com>,
"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>
Subject: Re: [PATCH v2 19/20] dax: Assign RAM regions to memory-hotplug by default
Date: Sat, 11 Feb 2023 05:57:33 +0000 [thread overview]
Message-ID: <25626477e9500b1e17c15d5fbab1fd067dd57ef7.camel@intel.com> (raw)
In-Reply-To: <167602003336.1924368.6809503401422267885.stgit@dwillia2-xfh.jf.intel.com>
On Fri, 2023-02-10 at 01:07 -0800, Dan Williams wrote:
> The default mode for device-dax instances is backwards for RAM-regions
> as evidenced by the fact that it tends to catch end users by surprise.
> "Where is my memory?". Recall that platforms are increasingly shipping
> with performance-differentiated memory pools beyond typical DRAM and
> NUMA effects. This includes HBM (high-bandwidth-memory) and CXL (dynamic
> interleave, varied media types, and future fabric attached
> possibilities).
>
> For this reason the EFI_MEMORY_SP (EFI Special Purpose Memory => Linux
> 'Soft Reserved') attribute is expected to be applied to all memory-pools
> that are not the general purpose pool. This designation gives an
> Operating System a chance to defer usage of a memory pool until later in
> the boot process where its performance properties can be interrogated
> and administrator policy can be applied.
>
> 'Soft Reserved' memory can be anything from too limited and precious to
> be part of the general purpose pool (HBM), too slow to host hot kernel
> data structures (some PMEM media), or anything in between. However, in
> the absence of an explicit policy, the memory should at least be made
> usable by default. The current device-dax default hides all
> non-general-purpose memory behind a device interface.
>
> The expectation is that the distribution of users that want the memory
> online by default vs device-dedicated-access by default follows the
> Pareto principle. A small number of enlightened users may want to do
> userspace memory management through a device, but general users just
> want the kernel to make the memory available with an option to get more
> advanced later.
>
> Arrange for all device-dax instances not backed by PMEM to default to
> attaching to the dax_kmem driver. From there the baseline memory hotplug
> policy (CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE / memhp_default_state=)
> gates whether the memory comes online or stays offline. Where, if it
> stays offline, it can be reliably converted back to device-mode where it
> can be partitioned, or fronted by a userspace allocator.
>
> So, if someone wants device-dax instances for their 'Soft Reserved'
> memory:
>
> 1/ Build a kernel with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n or boot
> with memhp_default_state=offline, or roll the dice and hope that the
> kernel has not pinned a page in that memory before step 2.
>
> 2/ Write a udev rule to convert the target dax device(s) from
> 'system-ram' mode to 'devdax' mode:
>
> daxctl reconfigure-device $dax -m devdax -f
>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Gregory Price <gregory.price@memverge.com>
> Tested-by: Fan Ni <fan.ni@samsung.com>
> Link: https://lore.kernel.org/r/167564544513.847146.4645646177864365755.stgit@dwillia2-xfh.jf.intel.com
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> drivers/dax/Kconfig | 2 +-
> drivers/dax/bus.c | 53 ++++++++++++++++++++---------------------------
> drivers/dax/bus.h | 12 +++++++++--
> drivers/dax/device.c | 3 +--
> drivers/dax/hmem/hmem.c | 12 ++++++++++-
> drivers/dax/kmem.c | 1 +
> 6 files changed, 46 insertions(+), 37 deletions(-)
Looks good,
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
>
> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> index d13c889c2a64..1163eb62e5f6 100644
> --- a/drivers/dax/Kconfig
> +++ b/drivers/dax/Kconfig
> @@ -50,7 +50,7 @@ config DEV_DAX_HMEM_DEVICES
> def_bool y
>
> config DEV_DAX_KMEM
> - tristate "KMEM DAX: volatile-use of persistent memory"
> + tristate "KMEM DAX: map dax-devices as System-RAM"
> default DEV_DAX
> depends on DEV_DAX
> depends on MEMORY_HOTPLUG # for add_memory() and friends
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 1dad813ee4a6..012d576004e9 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -56,6 +56,25 @@ static int dax_match_id(struct dax_device_driver *dax_drv, struct device *dev)
> return match;
> }
>
> +static int dax_match_type(struct dax_device_driver *dax_drv, struct device *dev)
> +{
> + enum dax_driver_type type = DAXDRV_DEVICE_TYPE;
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> +
> + if (dev_dax->region->res.flags & IORESOURCE_DAX_KMEM)
> + type = DAXDRV_KMEM_TYPE;
> +
> + if (dax_drv->type == type)
> + return 1;
> +
> + /* default to device mode if dax_kmem is disabled */
> + if (dax_drv->type == DAXDRV_DEVICE_TYPE &&
> + !IS_ENABLED(CONFIG_DEV_DAX_KMEM))
> + return 1;
> +
> + return 0;
> +}
> +
> enum id_action {
> ID_REMOVE,
> ID_ADD,
> @@ -216,14 +235,9 @@ static int dax_bus_match(struct device *dev, struct device_driver *drv)
> {
> struct dax_device_driver *dax_drv = to_dax_drv(drv);
>
> - /*
> - * All but the 'device-dax' driver, which has 'match_always'
> - * set, requires an exact id match.
> - */
> - if (dax_drv->match_always)
> + if (dax_match_id(dax_drv, dev))
> return 1;
> -
> - return dax_match_id(dax_drv, dev);
> + return dax_match_type(dax_drv, dev);
> }
>
> /*
> @@ -1413,13 +1427,10 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
> }
> EXPORT_SYMBOL_GPL(devm_create_dev_dax);
>
> -static int match_always_count;
> -
> int __dax_driver_register(struct dax_device_driver *dax_drv,
> struct module *module, const char *mod_name)
> {
> struct device_driver *drv = &dax_drv->drv;
> - int rc = 0;
>
> /*
> * dax_bus_probe() calls dax_drv->probe() unconditionally.
> @@ -1434,26 +1445,7 @@ int __dax_driver_register(struct dax_device_driver *dax_drv,
> drv->mod_name = mod_name;
> drv->bus = &dax_bus_type;
>
> - /* there can only be one default driver */
> - mutex_lock(&dax_bus_lock);
> - match_always_count += dax_drv->match_always;
> - if (match_always_count > 1) {
> - match_always_count--;
> - WARN_ON(1);
> - rc = -EINVAL;
> - }
> - mutex_unlock(&dax_bus_lock);
> - if (rc)
> - return rc;
> -
> - rc = driver_register(drv);
> - if (rc && dax_drv->match_always) {
> - mutex_lock(&dax_bus_lock);
> - match_always_count -= dax_drv->match_always;
> - mutex_unlock(&dax_bus_lock);
> - }
> -
> - return rc;
> + return driver_register(drv);
> }
> EXPORT_SYMBOL_GPL(__dax_driver_register);
>
> @@ -1463,7 +1455,6 @@ void dax_driver_unregister(struct dax_device_driver *dax_drv)
> struct dax_id *dax_id, *_id;
>
> mutex_lock(&dax_bus_lock);
> - match_always_count -= dax_drv->match_always;
> list_for_each_entry_safe(dax_id, _id, &dax_drv->ids, list) {
> list_del(&dax_id->list);
> kfree(dax_id);
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index fbb940293d6d..8cd79ab34292 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -11,7 +11,10 @@ struct dax_device;
> struct dax_region;
> void dax_region_put(struct dax_region *dax_region);
>
> -#define IORESOURCE_DAX_STATIC (1UL << 0)
> +/* dax bus specific ioresource flags */
> +#define IORESOURCE_DAX_STATIC BIT(0)
> +#define IORESOURCE_DAX_KMEM BIT(1)
> +
> struct dax_region *alloc_dax_region(struct device *parent, int region_id,
> struct range *range, int target_node, unsigned int align,
> unsigned long flags);
> @@ -25,10 +28,15 @@ struct dev_dax_data {
>
> struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data);
>
> +enum dax_driver_type {
> + DAXDRV_KMEM_TYPE,
> + DAXDRV_DEVICE_TYPE,
> +};
> +
> struct dax_device_driver {
> struct device_driver drv;
> struct list_head ids;
> - int match_always;
> + enum dax_driver_type type;
> int (*probe)(struct dev_dax *dev);
> void (*remove)(struct dev_dax *dev);
> };
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 5494d745ced5..ecdff79e31f2 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -475,8 +475,7 @@ EXPORT_SYMBOL_GPL(dev_dax_probe);
>
> static struct dax_device_driver device_dax_driver = {
> .probe = dev_dax_probe,
> - /* all probe actions are unwound by devm, so .remove isn't necessary */
> - .match_always = 1,
> + .type = DAXDRV_DEVICE_TYPE,
> };
>
> static int __init dax_init(void)
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index e7bdff3132fa..5ec08f9f8a57 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -11,15 +11,25 @@ module_param_named(region_idle, region_idle, bool, 0644);
>
> static int dax_hmem_probe(struct platform_device *pdev)
> {
> + unsigned long flags = IORESOURCE_DAX_KMEM;
> struct device *dev = &pdev->dev;
> struct dax_region *dax_region;
> struct memregion_info *mri;
> struct dev_dax_data data;
> struct dev_dax *dev_dax;
>
> + /*
> + * @region_idle == true indicates that an administrative agent
> + * wants to manipulate the range partitioning before the devices
> + * are created, so do not send them to the dax_kmem driver by
> + * default.
> + */
> + if (region_idle)
> + flags = 0;
> +
> mri = dev->platform_data;
> dax_region = alloc_dax_region(dev, pdev->id, &mri->range,
> - mri->target_node, PMD_SIZE, 0);
> + mri->target_node, PMD_SIZE, flags);
> if (!dax_region)
> return -ENOMEM;
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 4852a2dbdb27..918d01d3fbaa 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -239,6 +239,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
> static struct dax_device_driver device_dax_kmem_driver = {
> .probe = dev_dax_kmem_probe,
> .remove = dev_dax_kmem_remove,
> + .type = DAXDRV_KMEM_TYPE,
> };
>
> static int __init dax_kmem_init(void)
>
next prev parent reply other threads:[~2023-02-11 5:57 UTC|newest]
Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-10 9:05 [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
2023-02-10 9:05 ` [PATCH v2 01/20] cxl/memdev: Fix endpoint port removal Dan Williams
2023-02-10 17:28 ` Jonathan Cameron
2023-02-10 21:14 ` Dan Williams
2023-02-10 23:17 ` Verma, Vishal L
2023-02-10 9:05 ` [PATCH v2 02/20] cxl/Documentation: Update references to attributes added in v6.0 Dan Williams
2023-02-10 9:05 ` [PATCH v2 03/20] cxl/region: Add a mode attribute for regions Dan Williams
2023-02-10 9:05 ` [PATCH v2 04/20] cxl/region: Support empty uuids for non-pmem regions Dan Williams
2023-02-10 17:30 ` Jonathan Cameron
2023-02-10 23:34 ` Ira Weiny
2023-02-10 9:05 ` [PATCH v2 05/20] cxl/region: Validate region mode vs decoder mode Dan Williams
2023-02-10 9:05 ` [PATCH v2 06/20] cxl/region: Add volatile region creation support Dan Williams
2023-02-10 9:06 ` [PATCH v2 07/20] cxl/region: Refactor attach_target() for autodiscovery Dan Williams
2023-02-10 9:06 ` [PATCH v2 08/20] cxl/region: Cleanup target list on attach error Dan Williams
2023-02-10 17:31 ` Jonathan Cameron
2023-02-10 23:17 ` Verma, Vishal L
2023-02-10 23:46 ` Ira Weiny
2023-02-10 9:06 ` [PATCH v2 09/20] cxl/region: Move region-position validation to a helper Dan Williams
2023-02-10 17:34 ` Jonathan Cameron
2023-02-10 9:06 ` [PATCH v2 10/20] kernel/range: Uplevel the cxl subsystem's range_contains() helper Dan Williams
2023-02-10 9:06 ` [PATCH v2 11/20] cxl/region: Enable CONFIG_CXL_REGION to be toggled Dan Williams
2023-02-10 9:06 ` [PATCH v2 12/20] cxl/port: Split endpoint and switch port probe Dan Williams
2023-02-10 17:41 ` Jonathan Cameron
2023-02-10 23:21 ` Verma, Vishal L
2023-02-10 9:06 ` [PATCH v2 13/20] cxl/region: Add region autodiscovery Dan Williams
2023-02-10 18:09 ` Jonathan Cameron
2023-02-10 21:35 ` Dan Williams
2023-02-14 13:23 ` Jonathan Cameron
2023-02-14 16:43 ` Dan Williams
2023-02-10 21:49 ` Dan Williams
2023-02-11 0:29 ` Verma, Vishal L
2023-02-11 1:03 ` Dan Williams
[not found] ` <CGME20230213192752uscas1p1c49508da4b100c9ba6a1a3aa92ca03e5@uscas1p1.samsung.com>
2023-02-13 19:27 ` Fan Ni
[not found] ` <CGME20230228185348uscas1p1a5314a077383ee81ac228c1b9f1da2f8@uscas1p1.samsung.com>
2023-02-28 18:53 ` Fan Ni
2023-02-10 9:06 ` [PATCH v2 14/20] tools/testing/cxl: Define a fixed volatile configuration to parse Dan Williams
2023-02-10 18:12 ` Jonathan Cameron
2023-02-10 18:36 ` Dave Jiang
2023-02-11 0:39 ` Verma, Vishal L
2023-02-10 9:06 ` [PATCH v2 15/20] dax/hmem: Move HMAT and Soft reservation probe initcall level Dan Williams
2023-02-10 21:53 ` Dave Jiang
2023-02-10 21:57 ` Dave Jiang
2023-02-11 0:40 ` Verma, Vishal L
2023-02-10 9:06 ` [PATCH v2 16/20] dax/hmem: Drop unnecessary dax_hmem_remove() Dan Williams
2023-02-10 21:59 ` Dave Jiang
2023-02-11 0:41 ` Verma, Vishal L
2023-02-10 9:07 ` [PATCH v2 17/20] dax/hmem: Convey the dax range via memregion_info() Dan Williams
2023-02-10 22:03 ` Dave Jiang
2023-02-11 4:25 ` Verma, Vishal L
2023-02-10 9:07 ` [PATCH v2 18/20] dax/hmem: Move hmem device registration to dax_hmem.ko Dan Williams
2023-02-10 18:25 ` Jonathan Cameron
2023-02-10 22:09 ` Dave Jiang
2023-02-11 4:41 ` Verma, Vishal L
2023-02-10 9:07 ` [PATCH v2 19/20] dax: Assign RAM regions to memory-hotplug by default Dan Williams
2023-02-10 22:19 ` Dave Jiang
2023-02-11 5:57 ` Verma, Vishal L [this message]
2023-02-10 9:07 ` [PATCH v2 20/20] cxl/dax: Create dax devices for CXL RAM regions Dan Williams
2023-02-10 18:38 ` Jonathan Cameron
2023-02-10 22:42 ` Dave Jiang
2023-02-10 17:53 ` [PATCH v2 00/20] CXL RAM and the 'Soft Reserved' => 'System RAM' default Dan Williams
2023-02-11 14:04 ` Gregory Price
2023-02-13 18:22 ` Gregory Price
2023-02-13 18:31 ` Gregory Price
[not found] ` <CGME20230222214151uscas1p26d53b2e198f63a1f382fe575c6c25070@uscas1p2.samsung.com>
2023-02-22 21:41 ` Fan Ni
2023-02-22 22:18 ` Dan Williams
2023-02-14 13:35 ` Jonathan Cameron
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=25626477e9500b1e17c15d5fbab1fd067dd57ef7.camel@intel.com \
--to=vishal.l.verma@intel.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=fan.ni@samsung.com \
--cc=gregory.price@memverge.com \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).