* [PATCH v6 00/12] mm: Sub-section memory hotplug support
@ 2019-04-17 18:38 Dan Williams
2019-04-17 18:39 ` [PATCH v6 11/12] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields Dan Williams
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Dan Williams @ 2019-04-17 18:38 UTC (permalink / raw)
To: akpm
Cc: David Hildenbrand, Jérôme Glisse, Logan Gunthorpe,
Toshi Kani, Jeff Moyer, Michal Hocko, Vlastimil Babka, stable,
linux-mm, linux-nvdimm, linux-kernel, mhocko, david
Changes since v5 [1]:
- Rebase on next-20190416 and the new 'struct mhp_restrictions'
infrastructure.
- Extend mhp_restrictions to the 'remove' case so the sub-section policy
can be clarified with respect to the memblock-api in a symmetric
manner with the 'add' case.
- Kill is_dev_zone() since cleanups have now made it moot
[1]: https://lwn.net/Articles/783808/
---
The memory hotplug section is an arbitrary / convenient unit for memory
hotplug. 'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace. The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory'
use cases, persistent memory (pmem) in particular. Recall that pmem uses
devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
'struct page' memmap for pmem. However, it does not use the 'bottom
half' of memory hotplug, i.e. never marks pmem pages online and never
exposes the userspace memblock interface for pmem. This leaves an
opening to redress the section-size constraint.
To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory(). Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the
next. Device failure (intermittent or permanent) and physical
reconfiguration are events that can cause the platform firmware to
change the physical placement of pmem on a subsequent boot, and device
failure is an everyday event in a data-center.
It turns out that sections are only a hard requirement of the
user-facing interface for memory hotplug and with a bit more
infrastructure sub-section arch_add_memory() support can be added for
kernel internal usages like devm_memremap_pages(). Here is an analysis
of the current design assumptions in the current code and how they are
addressed in the new implementation:
Current design assumptions:
- Sections that describe boot memory (early sections) are never
unplugged / removed.
- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
valid_section() check
- __add_pages() and helper routines assume all operations occur in
PAGES_PER_SECTION units.
- The memblock sysfs interface only comprehends full sections
New design assumptions:
- Sections are instrumented with a sub-section bitmask to track (on x86)
individual 2MB sub-divisions of a 128MB section.
- Partially populated early sections can be extended with additional
sub-sections, and those sub-sections can be removed with
arch_remove_memory(). With this in place we no longer lose usable memory
capacity to padding.
- pfn_valid() is updated to look deeper than valid_section() to also check the
active-sub-section mask. This indication is in the same cacheline as
the valid_section() so the performance impact is expected to be
negligible. So far the lkp robot has not reported any regressions.
- Outside of the core vmemmap population routines which are replaced,
other helper routines like shrink_{zone,pgdat}_span() are updated to
handle the smaller granularity. Core memory hotplug routines that deal
with online memory are not touched.
- The existing memblock sysfs user api guarantees / assumptions are
not touched since this capability is limited to !online
!memblock-sysfs-accessible sections.
Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them. The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt. Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem
ranges with other pmem ranges by default [3]. In short,
devm_memremap_pages() has pushed the venerable section-size constraint
past the breaking point, and the simplicity of section-aligned
arch_add_memory() is no longer tenable.
These patches are exposed to the kbuild robot on my libnvdimm-pending
branch [4], and a preview of the unit test for this functionality is
available on the 'subsection-pending' branch of ndctl [5].
[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[3]: https://github.com/pmem/ndctl/issues/76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=libnvdimm-pending
[5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c
---
Dan Williams (12):
mm/sparsemem: Introduce struct mem_section_usage
mm/sparsemem: Introduce common definitions for the size and mask of a section
mm/sparsemem: Add helpers track active portions of a section at boot
mm/hotplug: Prepare shrink_{zone,pgdat}_span for sub-section removal
mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap()
mm/hotplug: Add mem-hotplug restrictions for remove_memory()
mm: Kill is_dev_zone() helper
mm/sparsemem: Prepare for sub-section ranges
mm/sparsemem: Support sub-section hotplug
mm/devm_memremap_pages: Enable sub-section remap
libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields
libnvdimm/pfn: Stop padding pmem namespaces to section alignment
arch/ia64/mm/init.c | 4
arch/powerpc/mm/mem.c | 5 -
arch/s390/mm/init.c | 2
arch/sh/mm/init.c | 4
arch/x86/mm/init_32.c | 4
arch/x86/mm/init_64.c | 9 +
drivers/nvdimm/dax_devs.c | 2
drivers/nvdimm/pfn.h | 12 -
drivers/nvdimm/pfn_devs.c | 93 +++-------
include/linux/memory_hotplug.h | 12 +
include/linux/mm.h | 4
include/linux/mmzone.h | 72 ++++++--
kernel/memremap.c | 70 +++-----
mm/hmm.c | 2
mm/memory_hotplug.c | 148 +++++++++-------
mm/page_alloc.c | 8 +
mm/sparse-vmemmap.c | 21 ++
mm/sparse.c | 371 +++++++++++++++++++++++++++-------------
18 files changed, 503 insertions(+), 340 deletions(-)
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v6 11/12] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields
2019-04-17 18:38 [PATCH v6 00/12] mm: Sub-section memory hotplug support Dan Williams
@ 2019-04-17 18:39 ` Dan Williams
2019-04-17 22:02 ` Andrew Morton
2019-04-17 22:03 ` [PATCH v6 00/12] mm: Sub-section memory hotplug support Andrew Morton
2019-05-02 22:46 ` Pavel Tatashin
2 siblings, 1 reply; 15+ messages in thread
From: Dan Williams @ 2019-04-17 18:39 UTC (permalink / raw)
To: akpm; +Cc: stable, linux-mm, linux-nvdimm, linux-kernel, mhocko, david
At namespace creation time there is the potential for the "expected to
be zero" fields of a 'pfn' info-block to be filled with indeterminate
data. While the kernel buffer is zeroed on allocation it is immediately
overwritten by nd_pfn_validate() filling it with the current contents of
the on-media info-block location. For fields like, 'flags' and the
'padding' it potentially means that future implementations can not rely
on those fields being zero.
In preparation to stop using the 'start_pad' and 'end_trunc' fields for
section alignment, arrange for fields that are not explicitly
initialized to be guaranteed zero. Bump the minor version to indicate it
is safe to assume the 'padding' and 'flags' are zero. Otherwise, this
corruption is expected to benign since all other critical fields are
explicitly initialized.
Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem")
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
drivers/nvdimm/dax_devs.c | 2 +-
drivers/nvdimm/pfn.h | 1 +
drivers/nvdimm/pfn_devs.c | 18 +++++++++++++++---
3 files changed, 17 insertions(+), 4 deletions(-)
diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c
index 0453f49dc708..326f02ffca81 100644
--- a/drivers/nvdimm/dax_devs.c
+++ b/drivers/nvdimm/dax_devs.c
@@ -126,7 +126,7 @@ int nd_dax_probe(struct device *dev, struct nd_namespace_common *ndns)
nvdimm_bus_unlock(&ndns->dev);
if (!dax_dev)
return -ENOMEM;
- pfn_sb = devm_kzalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
+ pfn_sb = devm_kmalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
nd_pfn->pfn_sb = pfn_sb;
rc = nd_pfn_validate(nd_pfn, DAX_SIG);
dev_dbg(dev, "dax: %s\n", rc == 0 ? dev_name(dax_dev) : "<none>");
diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h
index dde9853453d3..e901e3a3b04c 100644
--- a/drivers/nvdimm/pfn.h
+++ b/drivers/nvdimm/pfn.h
@@ -36,6 +36,7 @@ struct nd_pfn_sb {
__le32 end_trunc;
/* minor-version-2 record the base alignment of the mapping */
__le32 align;
+ /* minor-version-3 guarantee the padding and flags are zero */
u8 padding[4000];
__le64 checksum;
};
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 01f40672507f..a2406253eb70 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -420,6 +420,15 @@ static int nd_pfn_clear_memmap_errors(struct nd_pfn *nd_pfn)
return 0;
}
+/**
+ * nd_pfn_validate - read and validate info-block
+ * @nd_pfn: fsdax namespace runtime state / properties
+ * @sig: 'devdax' or 'fsdax' signature
+ *
+ * Upon return the info-block buffer contents (->pfn_sb) are
+ * indeterminate when validation fails, and a coherent info-block
+ * otherwise.
+ */
int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
{
u64 checksum, offset;
@@ -565,7 +574,7 @@ int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns)
nvdimm_bus_unlock(&ndns->dev);
if (!pfn_dev)
return -ENOMEM;
- pfn_sb = devm_kzalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
+ pfn_sb = devm_kmalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
nd_pfn = to_nd_pfn(pfn_dev);
nd_pfn->pfn_sb = pfn_sb;
rc = nd_pfn_validate(nd_pfn, PFN_SIG);
@@ -702,7 +711,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
u64 checksum;
int rc;
- pfn_sb = devm_kzalloc(&nd_pfn->dev, sizeof(*pfn_sb), GFP_KERNEL);
+ pfn_sb = devm_kmalloc(&nd_pfn->dev, sizeof(*pfn_sb), GFP_KERNEL);
if (!pfn_sb)
return -ENOMEM;
@@ -711,11 +720,14 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
sig = DAX_SIG;
else
sig = PFN_SIG;
+
rc = nd_pfn_validate(nd_pfn, sig);
if (rc != -ENODEV)
return rc;
/* no info block, do init */;
+ memset(pfn_sb, 0, sizeof(*pfn_sb));
+
nd_region = to_nd_region(nd_pfn->dev.parent);
if (nd_region->ro) {
dev_info(&nd_pfn->dev,
@@ -768,7 +780,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
memcpy(pfn_sb->uuid, nd_pfn->uuid, 16);
memcpy(pfn_sb->parent_uuid, nd_dev_to_uuid(&ndns->dev), 16);
pfn_sb->version_major = cpu_to_le16(1);
- pfn_sb->version_minor = cpu_to_le16(2);
+ pfn_sb->version_minor = cpu_to_le16(3);
pfn_sb->start_pad = cpu_to_le32(start_pad);
pfn_sb->end_trunc = cpu_to_le32(end_trunc);
pfn_sb->align = cpu_to_le32(nd_pfn->align);
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v6 11/12] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields
2019-04-17 18:39 ` [PATCH v6 11/12] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields Dan Williams
@ 2019-04-17 22:02 ` Andrew Morton
2019-04-17 22:09 ` Dan Williams
0 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2019-04-17 22:02 UTC (permalink / raw)
To: Dan Williams; +Cc: stable, linux-mm, linux-nvdimm, linux-kernel, mhocko, david
On Wed, 17 Apr 2019 11:39:52 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
> At namespace creation time there is the potential for the "expected to
> be zero" fields of a 'pfn' info-block to be filled with indeterminate
> data. While the kernel buffer is zeroed on allocation it is immediately
> overwritten by nd_pfn_validate() filling it with the current contents of
> the on-media info-block location. For fields like, 'flags' and the
> 'padding' it potentially means that future implementations can not rely
> on those fields being zero.
>
> In preparation to stop using the 'start_pad' and 'end_trunc' fields for
> section alignment, arrange for fields that are not explicitly
> initialized to be guaranteed zero. Bump the minor version to indicate it
> is safe to assume the 'padding' and 'flags' are zero. Otherwise, this
> corruption is expected to benign since all other critical fields are
> explicitly initialized.
>
> Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Buried at the end of a 12 patch series. Should this be a standalone
patch, suitable for a prompt merge?
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 00/12] mm: Sub-section memory hotplug support
2019-04-17 18:38 [PATCH v6 00/12] mm: Sub-section memory hotplug support Dan Williams
2019-04-17 18:39 ` [PATCH v6 11/12] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields Dan Williams
@ 2019-04-17 22:03 ` Andrew Morton
2019-04-17 22:59 ` Dan Williams
2019-05-02 22:46 ` Pavel Tatashin
2 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2019-04-17 22:03 UTC (permalink / raw)
To: Dan Williams
Cc: David Hildenbrand, Jérôme Glisse, Logan Gunthorpe,
Toshi Kani, Jeff Moyer, Michal Hocko, Vlastimil Babka, stable,
linux-mm, linux-nvdimm, linux-kernel
On Wed, 17 Apr 2019 11:38:55 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
> The memory hotplug section is an arbitrary / convenient unit for memory
> hotplug. 'Section-size' units have bled into the user interface
> ('memblock' sysfs) and can not be changed without breaking existing
> userspace. The section-size constraint, while mostly benign for typical
> memory hotplug, has and continues to wreak havoc with 'device-memory'
> use cases, persistent memory (pmem) in particular. Recall that pmem uses
> devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
> 'struct page' memmap for pmem. However, it does not use the 'bottom
> half' of memory hotplug, i.e. never marks pmem pages online and never
> exposes the userspace memblock interface for pmem. This leaves an
> opening to redress the section-size constraint.
v6 and we're not showing any review activity. Who would be suitable
people to help out here?
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 11/12] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields
2019-04-17 22:02 ` Andrew Morton
@ 2019-04-17 22:09 ` Dan Williams
0 siblings, 0 replies; 15+ messages in thread
From: Dan Williams @ 2019-04-17 22:09 UTC (permalink / raw)
To: Andrew Morton
Cc: stable, Linux MM, linux-nvdimm, Linux Kernel Mailing List,
Michal Hocko, David Hildenbrand
On Wed, Apr 17, 2019 at 3:02 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 17 Apr 2019 11:39:52 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
>
> > At namespace creation time there is the potential for the "expected to
> > be zero" fields of a 'pfn' info-block to be filled with indeterminate
> > data. While the kernel buffer is zeroed on allocation it is immediately
> > overwritten by nd_pfn_validate() filling it with the current contents of
> > the on-media info-block location. For fields like, 'flags' and the
> > 'padding' it potentially means that future implementations can not rely
> > on those fields being zero.
> >
> > In preparation to stop using the 'start_pad' and 'end_trunc' fields for
> > section alignment, arrange for fields that are not explicitly
> > initialized to be guaranteed zero. Bump the minor version to indicate it
> > is safe to assume the 'padding' and 'flags' are zero. Otherwise, this
> > corruption is expected to benign since all other critical fields are
> > explicitly initialized.
> >
> > Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem")
> > Cc: <stable@vger.kernel.org>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>
> Buried at the end of a 12 patch series. Should this be a standalone
> patch, suitable for a prompt merge?
It's not a problem unless a kernel implementation is explicitly
expecting those fields to be zero-initialized. I only marked it for
-stable in case some future kernel backports patch12. Otherwise it's
benign on older kernels that don't have patch12 since all fields are
indeed initialized.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 00/12] mm: Sub-section memory hotplug support
2019-04-17 22:03 ` [PATCH v6 00/12] mm: Sub-section memory hotplug support Andrew Morton
@ 2019-04-17 22:59 ` Dan Williams
2019-04-18 2:09 ` Dan Williams
2019-04-23 13:16 ` Oscar Salvador
0 siblings, 2 replies; 15+ messages in thread
From: Dan Williams @ 2019-04-17 22:59 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Jérôme Glisse, Logan Gunthorpe,
Toshi Kani, Jeff Moyer, Michal Hocko, Vlastimil Babka, stable,
Linux MM, linux-nvdimm, Linux Kernel Mailing List, osalvador
On Wed, Apr 17, 2019 at 3:04 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 17 Apr 2019 11:38:55 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
>
> > The memory hotplug section is an arbitrary / convenient unit for memory
> > hotplug. 'Section-size' units have bled into the user interface
> > ('memblock' sysfs) and can not be changed without breaking existing
> > userspace. The section-size constraint, while mostly benign for typical
> > memory hotplug, has and continues to wreak havoc with 'device-memory'
> > use cases, persistent memory (pmem) in particular. Recall that pmem uses
> > devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
> > 'struct page' memmap for pmem. However, it does not use the 'bottom
> > half' of memory hotplug, i.e. never marks pmem pages online and never
> > exposes the userspace memblock interface for pmem. This leaves an
> > opening to redress the section-size constraint.
>
> v6 and we're not showing any review activity. Who would be suitable
> people to help out here?
There was quite a bit of review of the cover letter from Michal and
David, but you're right the details not so much as of yet. I'd like to
call out other people where I can reciprocate with some review of my
own. Oscar's altmap work looks like a good candidate for that.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 00/12] mm: Sub-section memory hotplug support
2019-04-17 22:59 ` Dan Williams
@ 2019-04-18 2:09 ` Dan Williams
2019-04-18 12:45 ` Jeff Moyer
2019-04-23 13:16 ` Oscar Salvador
1 sibling, 1 reply; 15+ messages in thread
From: Dan Williams @ 2019-04-18 2:09 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Jérôme Glisse, Logan Gunthorpe,
Toshi Kani, Jeff Moyer, Michal Hocko, Vlastimil Babka, stable,
Linux MM, linux-nvdimm, Linux Kernel Mailing List, osalvador
On Wed, Apr 17, 2019 at 3:59 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Wed, Apr 17, 2019 at 3:04 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Wed, 17 Apr 2019 11:38:55 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > > The memory hotplug section is an arbitrary / convenient unit for memory
> > > hotplug. 'Section-size' units have bled into the user interface
> > > ('memblock' sysfs) and can not be changed without breaking existing
> > > userspace. The section-size constraint, while mostly benign for typical
> > > memory hotplug, has and continues to wreak havoc with 'device-memory'
> > > use cases, persistent memory (pmem) in particular. Recall that pmem uses
> > > devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
> > > 'struct page' memmap for pmem. However, it does not use the 'bottom
> > > half' of memory hotplug, i.e. never marks pmem pages online and never
> > > exposes the userspace memblock interface for pmem. This leaves an
> > > opening to redress the section-size constraint.
> >
> > v6 and we're not showing any review activity. Who would be suitable
> > people to help out here?
>
> There was quite a bit of review of the cover letter from Michal and
> David, but you're right the details not so much as of yet. I'd like to
> call out other people where I can reciprocate with some review of my
> own. Oscar's altmap work looks like a good candidate for that.
I'm also hoping Jeff can give a tested-by for the customer scenarios
that fall over with the current implementation.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 00/12] mm: Sub-section memory hotplug support
2019-04-18 2:09 ` Dan Williams
@ 2019-04-18 12:45 ` Jeff Moyer
2019-04-19 3:25 ` Dan Williams
0 siblings, 1 reply; 15+ messages in thread
From: Jeff Moyer @ 2019-04-18 12:45 UTC (permalink / raw)
To: Dan Williams
Cc: Andrew Morton, David Hildenbrand, Jérôme Glisse,
Logan Gunthorpe, Toshi Kani, Michal Hocko, Vlastimil Babka,
stable, Linux MM, linux-nvdimm, Linux Kernel Mailing List,
osalvador
Dan Williams <dan.j.williams@intel.com> writes:
>> On Wed, Apr 17, 2019 at 3:59 PM Dan Williams <dan.j.williams@intel.com> wrote:
>>
>> On Wed, Apr 17, 2019 at 3:04 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>> >
>> > On Wed, 17 Apr 2019 11:38:55 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
>> >
>> > > The memory hotplug section is an arbitrary / convenient unit for memory
>> > > hotplug. 'Section-size' units have bled into the user interface
>> > > ('memblock' sysfs) and can not be changed without breaking existing
>> > > userspace. The section-size constraint, while mostly benign for typical
>> > > memory hotplug, has and continues to wreak havoc with 'device-memory'
>> > > use cases, persistent memory (pmem) in particular. Recall that pmem uses
>> > > devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
>> > > 'struct page' memmap for pmem. However, it does not use the 'bottom
>> > > half' of memory hotplug, i.e. never marks pmem pages online and never
>> > > exposes the userspace memblock interface for pmem. This leaves an
>> > > opening to redress the section-size constraint.
>> >
>> > v6 and we're not showing any review activity. Who would be suitable
>> > people to help out here?
>>
>> There was quite a bit of review of the cover letter from Michal and
>> David, but you're right the details not so much as of yet. I'd like to
>> call out other people where I can reciprocate with some review of my
>> own. Oscar's altmap work looks like a good candidate for that.
>
> I'm also hoping Jeff can give a tested-by for the customer scenarios
> that fall over with the current implementation.
Sure. I'll also have a look over the patches.
-Jeff
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 00/12] mm: Sub-section memory hotplug support
2019-04-18 12:45 ` Jeff Moyer
@ 2019-04-19 3:25 ` Dan Williams
0 siblings, 0 replies; 15+ messages in thread
From: Dan Williams @ 2019-04-19 3:25 UTC (permalink / raw)
To: Jeff Moyer
Cc: Andrew Morton, David Hildenbrand, Jérôme Glisse,
Logan Gunthorpe, Toshi Kani, Michal Hocko, Vlastimil Babka,
stable, Linux MM, linux-nvdimm, Linux Kernel Mailing List,
osalvador
On Thu, Apr 18, 2019 at 5:45 AM Jeff Moyer <jmoyer@redhat.com> wrote:
[..]
> >> > v6 and we're not showing any review activity. Who would be suitable
> >> > people to help out here?
> >>
> >> There was quite a bit of review of the cover letter from Michal and
> >> David, but you're right the details not so much as of yet. I'd like to
> >> call out other people where I can reciprocate with some review of my
> >> own. Oscar's altmap work looks like a good candidate for that.
> >
> > I'm also hoping Jeff can give a tested-by for the customer scenarios
> > that fall over with the current implementation.
>
> Sure. I'll also have a look over the patches.
Andrew, heads up it looks like there is a memory corruption bug in
these patches as I've gotten a few reported of "bad page state" at
boot. Please drop until I can track down the failure.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 00/12] mm: Sub-section memory hotplug support
2019-04-17 22:59 ` Dan Williams
2019-04-18 2:09 ` Dan Williams
@ 2019-04-23 13:16 ` Oscar Salvador
2019-04-24 20:43 ` Pavel Tatashin
1 sibling, 1 reply; 15+ messages in thread
From: Oscar Salvador @ 2019-04-23 13:16 UTC (permalink / raw)
To: Dan Williams, Andrew Morton
Cc: David Hildenbrand, Jérôme Glisse, Logan Gunthorpe,
Toshi Kani, Jeff Moyer, Michal Hocko, Vlastimil Babka, stable,
Linux MM, linux-nvdimm, Linux Kernel Mailing List
On Wed, 2019-04-17 at 15:59 -0700, Dan Williams wrote:
> On Wed, Apr 17, 2019 at 3:04 PM Andrew Morton <akpm@linux-foundation.
> org> wrote:
> >
> > On Wed, 17 Apr 2019 11:38:55 -0700 Dan Williams <dan.j.williams@int
> > el.com> wrote:
> >
> > > The memory hotplug section is an arbitrary / convenient unit for
> > > memory
> > > hotplug. 'Section-size' units have bled into the user interface
> > > ('memblock' sysfs) and can not be changed without breaking
> > > existing
> > > userspace. The section-size constraint, while mostly benign for
> > > typical
> > > memory hotplug, has and continues to wreak havoc with 'device-
> > > memory'
> > > use cases, persistent memory (pmem) in particular. Recall that
> > > pmem uses
> > > devm_memremap_pages(), and subsequently arch_add_memory(), to
> > > allocate a
> > > 'struct page' memmap for pmem. However, it does not use the
> > > 'bottom
> > > half' of memory hotplug, i.e. never marks pmem pages online and
> > > never
> > > exposes the userspace memblock interface for pmem. This leaves an
> > > opening to redress the section-size constraint.
> >
> > v6 and we're not showing any review activity. Who would be
> > suitable
> > people to help out here?
>
> There was quite a bit of review of the cover letter from Michal and
> David, but you're right the details not so much as of yet. I'd like
> to
> call out other people where I can reciprocate with some review of my
> own. Oscar's altmap work looks like a good candidate for that.
Thanks Dan for ccing me.
I will take a look at the patches soon.
--
Oscar Salvador
SUSE L3
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 00/12] mm: Sub-section memory hotplug support
2019-04-23 13:16 ` Oscar Salvador
@ 2019-04-24 20:43 ` Pavel Tatashin
0 siblings, 0 replies; 15+ messages in thread
From: Pavel Tatashin @ 2019-04-24 20:43 UTC (permalink / raw)
To: Oscar Salvador
Cc: Dan Williams, Andrew Morton, David Hildenbrand,
Jérôme Glisse, Logan Gunthorpe, Toshi Kani, Jeff Moyer,
Michal Hocko, Vlastimil Babka, stable, Linux MM, linux-nvdimm,
Linux Kernel Mailing List
I am also taking a look at this work now. I will review and test it in
the next couple of days.
Pasha
On Tue, Apr 23, 2019 at 9:17 AM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Wed, 2019-04-17 at 15:59 -0700, Dan Williams wrote:
> > On Wed, Apr 17, 2019 at 3:04 PM Andrew Morton <akpm@linux-foundation.
> > org> wrote:
> > >
> > > On Wed, 17 Apr 2019 11:38:55 -0700 Dan Williams <dan.j.williams@int
> > > el.com> wrote:
> > >
> > > > The memory hotplug section is an arbitrary / convenient unit for
> > > > memory
> > > > hotplug. 'Section-size' units have bled into the user interface
> > > > ('memblock' sysfs) and can not be changed without breaking
> > > > existing
> > > > userspace. The section-size constraint, while mostly benign for
> > > > typical
> > > > memory hotplug, has and continues to wreak havoc with 'device-
> > > > memory'
> > > > use cases, persistent memory (pmem) in particular. Recall that
> > > > pmem uses
> > > > devm_memremap_pages(), and subsequently arch_add_memory(), to
> > > > allocate a
> > > > 'struct page' memmap for pmem. However, it does not use the
> > > > 'bottom
> > > > half' of memory hotplug, i.e. never marks pmem pages online and
> > > > never
> > > > exposes the userspace memblock interface for pmem. This leaves an
> > > > opening to redress the section-size constraint.
> > >
> > > v6 and we're not showing any review activity. Who would be
> > > suitable
> > > people to help out here?
> >
> > There was quite a bit of review of the cover letter from Michal and
> > David, but you're right the details not so much as of yet. I'd like
> > to
> > call out other people where I can reciprocate with some review of my
> > own. Oscar's altmap work looks like a good candidate for that.
>
> Thanks Dan for ccing me.
> I will take a look at the patches soon.
>
> --
> Oscar Salvador
> SUSE L3
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 00/12] mm: Sub-section memory hotplug support
2019-04-17 18:38 [PATCH v6 00/12] mm: Sub-section memory hotplug support Dan Williams
2019-04-17 18:39 ` [PATCH v6 11/12] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields Dan Williams
2019-04-17 22:03 ` [PATCH v6 00/12] mm: Sub-section memory hotplug support Andrew Morton
@ 2019-05-02 22:46 ` Pavel Tatashin
2019-05-02 23:20 ` Dan Williams
2 siblings, 1 reply; 15+ messages in thread
From: Pavel Tatashin @ 2019-05-02 22:46 UTC (permalink / raw)
To: Dan Williams
Cc: Andrew Morton, David Hildenbrand, Jérôme Glisse,
Logan Gunthorpe, Toshi Kani, Jeff Moyer, Michal Hocko,
Vlastimil Babka, stable, linux-mm, linux-nvdimm, LKML
Hi Dan,
How do you test these patches? Do you have any instructions?
I see for example that check_hotplug_memory_range() still enforces
memory_block_size_bytes() alignment.
Also, after removing check_hotplug_memory_range(), I tried to online
16M aligned DAX memory, and got the following panic:
# echo online > /sys/devices/system/memory/memory7/state
[ 202.193132] WARNING: CPU: 2 PID: 351 at drivers/base/memory.c:207
memory_block_action+0x110/0x178
[ 202.193391] Modules linked in:
[ 202.193698] CPU: 2 PID: 351 Comm: sh Not tainted
5.1.0-rc7_pt_devdax-00038-g865af4385544-dirty #9
[ 202.193909] Hardware name: linux,dummy-virt (DT)
[ 202.194122] pstate: 60000005 (nZCv daif -PAN -UAO)
[ 202.194243] pc : memory_block_action+0x110/0x178
[ 202.194404] lr : memory_block_action+0x90/0x178
[ 202.194506] sp : ffff000016763ca0
[ 202.194592] x29: ffff000016763ca0 x28: ffff80016fd29b80
[ 202.194724] x27: 0000000000000000 x26: 0000000000000000
[ 202.194838] x25: ffff000015546000 x24: 00000000001c0000
[ 202.194949] x23: 0000000000000000 x22: 0000000000040000
[ 202.195058] x21: 00000000001c0000 x20: 0000000000000008
[ 202.195168] x19: 0000000000000007 x18: 0000000000000000
[ 202.195281] x17: 0000000000000000 x16: 0000000000000000
[ 202.195393] x15: 0000000000000000 x14: 0000000000000000
[ 202.195505] x13: 0000000000000000 x12: 0000000000000000
[ 202.195614] x11: 0000000000000000 x10: 0000000000000000
[ 202.195744] x9 : 0000000000000000 x8 : 0000000180000000
[ 202.195858] x7 : 0000000000000018 x6 : ffff000015541930
[ 202.195966] x5 : ffff000015541930 x4 : 0000000000000001
[ 202.196074] x3 : 0000000000000001 x2 : 0000000000000000
[ 202.196185] x1 : 0000000000000070 x0 : 0000000000000000
[ 202.196366] Call trace:
[ 202.196455] memory_block_action+0x110/0x178
[ 202.196589] memory_subsys_online+0x3c/0x80
[ 202.196681] device_online+0x6c/0x90
[ 202.196761] state_store+0x84/0x100
[ 202.196841] dev_attr_store+0x18/0x28
[ 202.196927] sysfs_kf_write+0x40/0x58
[ 202.197010] kernfs_fop_write+0xcc/0x1d8
[ 202.197099] __vfs_write+0x18/0x40
[ 202.197187] vfs_write+0xa4/0x1b0
[ 202.197295] ksys_write+0x64/0xd8
[ 202.197430] __arm64_sys_write+0x18/0x20
[ 202.197521] el0_svc_common.constprop.0+0x7c/0xe8
[ 202.197621] el0_svc_handler+0x28/0x78
[ 202.197706] el0_svc+0x8/0xc
[ 202.197828] ---[ end trace 57719823dda6d21e ]---
Thank you,
Pasha
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 00/12] mm: Sub-section memory hotplug support
2019-05-02 22:46 ` Pavel Tatashin
@ 2019-05-02 23:20 ` Dan Williams
2019-05-02 23:21 ` Dan Williams
2019-05-03 10:48 ` Oscar Salvador
0 siblings, 2 replies; 15+ messages in thread
From: Dan Williams @ 2019-05-02 23:20 UTC (permalink / raw)
To: Pavel Tatashin
Cc: Andrew Morton, David Hildenbrand, Jérôme Glisse,
Logan Gunthorpe, Toshi Kani, Jeff Moyer, Michal Hocko,
Vlastimil Babka, stable, linux-mm, linux-nvdimm, LKML
On Thu, May 2, 2019 at 3:46 PM Pavel Tatashin <pasha.tatashin@soleen.com> wrote:
>
> Hi Dan,
>
> How do you test these patches? Do you have any instructions?
Yes, I briefly mentioned this in the cover letter, but here is the
test I am using:
>
> I see for example that check_hotplug_memory_range() still enforces
> memory_block_size_bytes() alignment.
>
> Also, after removing check_hotplug_memory_range(), I tried to online
> 16M aligned DAX memory, and got the following panic:
Right, this functionality is currently strictly limited to the
devm_memremap_pages() case where there are guarantees that the memory
will never be onlined. This is due to the fact that the section size
is entangled with the memblock api. That said I would have expected
you to trigger the warning in subsection_check() before getting this
far into the hotplug process.
>
> # echo online > /sys/devices/system/memory/memory7/state
> [ 202.193132] WARNING: CPU: 2 PID: 351 at drivers/base/memory.c:207
> memory_block_action+0x110/0x178
> [ 202.193391] Modules linked in:
> [ 202.193698] CPU: 2 PID: 351 Comm: sh Not tainted
> 5.1.0-rc7_pt_devdax-00038-g865af4385544-dirty #9
> [ 202.193909] Hardware name: linux,dummy-virt (DT)
> [ 202.194122] pstate: 60000005 (nZCv daif -PAN -UAO)
> [ 202.194243] pc : memory_block_action+0x110/0x178
> [ 202.194404] lr : memory_block_action+0x90/0x178
> [ 202.194506] sp : ffff000016763ca0
> [ 202.194592] x29: ffff000016763ca0 x28: ffff80016fd29b80
> [ 202.194724] x27: 0000000000000000 x26: 0000000000000000
> [ 202.194838] x25: ffff000015546000 x24: 00000000001c0000
> [ 202.194949] x23: 0000000000000000 x22: 0000000000040000
> [ 202.195058] x21: 00000000001c0000 x20: 0000000000000008
> [ 202.195168] x19: 0000000000000007 x18: 0000000000000000
> [ 202.195281] x17: 0000000000000000 x16: 0000000000000000
> [ 202.195393] x15: 0000000000000000 x14: 0000000000000000
> [ 202.195505] x13: 0000000000000000 x12: 0000000000000000
> [ 202.195614] x11: 0000000000000000 x10: 0000000000000000
> [ 202.195744] x9 : 0000000000000000 x8 : 0000000180000000
> [ 202.195858] x7 : 0000000000000018 x6 : ffff000015541930
> [ 202.195966] x5 : ffff000015541930 x4 : 0000000000000001
> [ 202.196074] x3 : 0000000000000001 x2 : 0000000000000000
> [ 202.196185] x1 : 0000000000000070 x0 : 0000000000000000
> [ 202.196366] Call trace:
> [ 202.196455] memory_block_action+0x110/0x178
> [ 202.196589] memory_subsys_online+0x3c/0x80
> [ 202.196681] device_online+0x6c/0x90
> [ 202.196761] state_store+0x84/0x100
> [ 202.196841] dev_attr_store+0x18/0x28
> [ 202.196927] sysfs_kf_write+0x40/0x58
> [ 202.197010] kernfs_fop_write+0xcc/0x1d8
> [ 202.197099] __vfs_write+0x18/0x40
> [ 202.197187] vfs_write+0xa4/0x1b0
> [ 202.197295] ksys_write+0x64/0xd8
> [ 202.197430] __arm64_sys_write+0x18/0x20
> [ 202.197521] el0_svc_common.constprop.0+0x7c/0xe8
> [ 202.197621] el0_svc_handler+0x28/0x78
> [ 202.197706] el0_svc+0x8/0xc
> [ 202.197828] ---[ end trace 57719823dda6d21e ]---
>
> Thank you,
> Pasha
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 00/12] mm: Sub-section memory hotplug support
2019-05-02 23:20 ` Dan Williams
@ 2019-05-02 23:21 ` Dan Williams
2019-05-03 10:48 ` Oscar Salvador
1 sibling, 0 replies; 15+ messages in thread
From: Dan Williams @ 2019-05-02 23:21 UTC (permalink / raw)
To: Pavel Tatashin
Cc: Andrew Morton, David Hildenbrand, Jérôme Glisse,
Logan Gunthorpe, Toshi Kani, Jeff Moyer, Michal Hocko,
Vlastimil Babka, stable, linux-mm, linux-nvdimm, LKML
On Thu, May 2, 2019 at 4:20 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Thu, May 2, 2019 at 3:46 PM Pavel Tatashin <pasha.tatashin@soleen.com> wrote:
> >
> > Hi Dan,
> >
> > How do you test these patches? Do you have any instructions?
>
> Yes, I briefly mentioned this in the cover letter, but here is the
> test I am using:
Sorry, fumble fingered the 'send' button, here is that link:
https://github.com/pmem/ndctl/blob/subsection-pending/test/sub-section.sh
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v6 00/12] mm: Sub-section memory hotplug support
2019-05-02 23:20 ` Dan Williams
2019-05-02 23:21 ` Dan Williams
@ 2019-05-03 10:48 ` Oscar Salvador
1 sibling, 0 replies; 15+ messages in thread
From: Oscar Salvador @ 2019-05-03 10:48 UTC (permalink / raw)
To: Dan Williams
Cc: Pavel Tatashin, Andrew Morton, David Hildenbrand,
Jérôme Glisse, Logan Gunthorpe, Toshi Kani, Jeff Moyer,
Michal Hocko, Vlastimil Babka, stable, linux-mm, linux-nvdimm,
LKML
On Thu, May 02, 2019 at 04:20:03PM -0700, Dan Williams wrote:
> On Thu, May 2, 2019 at 3:46 PM Pavel Tatashin <pasha.tatashin@soleen.com> wrote:
> >
> > Hi Dan,
> >
> > How do you test these patches? Do you have any instructions?
>
> Yes, I briefly mentioned this in the cover letter, but here is the
> test I am using:
>
> >
> > I see for example that check_hotplug_memory_range() still enforces
> > memory_block_size_bytes() alignment.
> >
> > Also, after removing check_hotplug_memory_range(), I tried to online
> > 16M aligned DAX memory, and got the following panic:
>
> Right, this functionality is currently strictly limited to the
> devm_memremap_pages() case where there are guarantees that the memory
> will never be onlined. This is due to the fact that the section size
> is entangled with the memblock api. That said I would have expected
> you to trigger the warning in subsection_check() before getting this
> far into the hotplug process.
> >
> > # echo online > /sys/devices/system/memory/memory7/state
> > [ 202.193132] WARNING: CPU: 2 PID: 351 at drivers/base/memory.c:207
> > memory_block_action+0x110/0x178
> > [ 202.193391] Modules linked in:
> > [ 202.193698] CPU: 2 PID: 351 Comm: sh Not tainted
> > 5.1.0-rc7_pt_devdax-00038-g865af4385544-dirty #9
> > [ 202.193909] Hardware name: linux,dummy-virt (DT)
> > [ 202.194122] pstate: 60000005 (nZCv daif -PAN -UAO)
> > [ 202.194243] pc : memory_block_action+0x110/0x178
> > [ 202.194404] lr : memory_block_action+0x90/0x178
> > [ 202.194506] sp : ffff000016763ca0
> > [ 202.194592] x29: ffff000016763ca0 x28: ffff80016fd29b80
> > [ 202.194724] x27: 0000000000000000 x26: 0000000000000000
> > [ 202.194838] x25: ffff000015546000 x24: 00000000001c0000
> > [ 202.194949] x23: 0000000000000000 x22: 0000000000040000
> > [ 202.195058] x21: 00000000001c0000 x20: 0000000000000008
> > [ 202.195168] x19: 0000000000000007 x18: 0000000000000000
> > [ 202.195281] x17: 0000000000000000 x16: 0000000000000000
> > [ 202.195393] x15: 0000000000000000 x14: 0000000000000000
> > [ 202.195505] x13: 0000000000000000 x12: 0000000000000000
> > [ 202.195614] x11: 0000000000000000 x10: 0000000000000000
> > [ 202.195744] x9 : 0000000000000000 x8 : 0000000180000000
> > [ 202.195858] x7 : 0000000000000018 x6 : ffff000015541930
> > [ 202.195966] x5 : ffff000015541930 x4 : 0000000000000001
> > [ 202.196074] x3 : 0000000000000001 x2 : 0000000000000000
> > [ 202.196185] x1 : 0000000000000070 x0 : 0000000000000000
> > [ 202.196366] Call trace:
> > [ 202.196455] memory_block_action+0x110/0x178
> > [ 202.196589] memory_subsys_online+0x3c/0x80
> > [ 202.196681] device_online+0x6c/0x90
> > [ 202.196761] state_store+0x84/0x100
> > [ 202.196841] dev_attr_store+0x18/0x28
> > [ 202.196927] sysfs_kf_write+0x40/0x58
> > [ 202.197010] kernfs_fop_write+0xcc/0x1d8
> > [ 202.197099] __vfs_write+0x18/0x40
> > [ 202.197187] vfs_write+0xa4/0x1b0
> > [ 202.197295] ksys_write+0x64/0xd8
> > [ 202.197430] __arm64_sys_write+0x18/0x20
> > [ 202.197521] el0_svc_common.constprop.0+0x7c/0xe8
> > [ 202.197621] el0_svc_handler+0x28/0x78
> > [ 202.197706] el0_svc+0x8/0xc
> > [ 202.197828] ---[ end trace 57719823dda6d21e ]---
This warning relates to:
for (; section_nr < section_nr_end; section_nr++) {
if (WARN_ON_ONCE(!pfn_valid(pfn)))
return false;
from pages_correctly_probed().
AFAICS, this is orthogonal to subsection_check().
--
Oscar Salvador
SUSE L3
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2019-05-03 10:48 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-04-17 18:38 [PATCH v6 00/12] mm: Sub-section memory hotplug support Dan Williams
2019-04-17 18:39 ` [PATCH v6 11/12] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields Dan Williams
2019-04-17 22:02 ` Andrew Morton
2019-04-17 22:09 ` Dan Williams
2019-04-17 22:03 ` [PATCH v6 00/12] mm: Sub-section memory hotplug support Andrew Morton
2019-04-17 22:59 ` Dan Williams
2019-04-18 2:09 ` Dan Williams
2019-04-18 12:45 ` Jeff Moyer
2019-04-19 3:25 ` Dan Williams
2019-04-23 13:16 ` Oscar Salvador
2019-04-24 20:43 ` Pavel Tatashin
2019-05-02 22:46 ` Pavel Tatashin
2019-05-02 23:20 ` Dan Williams
2019-05-02 23:21 ` Dan Williams
2019-05-03 10:48 ` Oscar Salvador
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).