linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse
@ 2023-11-19 16:56 Alexandru Elisei
  2023-11-19 16:56 ` [PATCH RFC v2 01/27] arm64: mte: Rework naming for tag manipulation functions Alexandru Elisei
                   ` (26 more replies)
  0 siblings, 27 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:56 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

The series is based on v6.7-rc1 and can be cloned with:

$ git clone https://gitlab.arm.com/linux-arm/linux-ae.git \
	-b arm-mte-dynamic-carveout-rfc-v2

Introduction
============

Memory Tagging Extension (MTE) is implemented currently to have a static
carve-out of the DRAM to store the allocation tags (a.k.a. memory colour).
This is what we call the tag storage. Each 16 bytes have 4 bits of tags, so
this means 1/32 of the DRAM, roughly 3% used for the tag storage.  This is
done transparently by the hardware/interconnect (with firmware setup) and
normally hidden from the OS. So a checked memory access to location X
generates a tag fetch from location Y in the carve-out and this tag is
compared with the bits 59:56 in the pointer. The correspondence from X to Y
is linear (subject to a minimum block size to deal with some address
interleaving). The software doesn't need to know about this correspondence
as we have specific instructions like STG/LDG to location X that lead to a
tag store/load to Y.

Now, not all memory used by applications is tagged (mmap(PROT_MTE)).  For
example, some large allocations may not use PROT_MTE at all or only for the
first and last page since initialising the tags takes time. The side-effect
is that of that 3% of DRAM, only part of it, say 1%, is effectively used.

The series aims to take that unused tag storage and release it to the page
allocator for normal data usage.

The first complication is that a PROT_MTE page allocation at address X will
need to reserve the tag storage page at location Y (and migrate any data in
that page if it is in use).

To make things worse, pages in the tag storage/carve-out range cannot use
PROT_MTE themselves on current hardware, so this adds the second
complication - a heterogeneous memory layout. The kernel needs to know
where to allocate a PROT_MTE page from or migrate a current page if it
becomes PROT_MTE (mprotect()) and the range it is in does not support
tagging.

Some other complications are arm64-specific like cache coherency between
tags and data accesses. There is a draft architecture spec which will be
released soon, detailing how the hardware behaves.

All of this will be entirely transparent to userspace. As with the current
kernel (without this dynamic tag storage), a user only needs to ask for
PROT_MTE mappings to get tagged pages.

Implementation
==============

MTE tag storage reuse is accomplished with the following changes to the
Linux kernel:

1. The tag storage memory is exposed to the memory allocator as
MIGRATE_CMA. The arm64 code manages this memory directly instead of using
cma_declare_contiguous/cma_alloc for performance reasons.

There is a limitation to this approach: MIGRATE_CMA cannot be used for
tagged allocations, even if not all MIGRATE_CMA memory is tag storage.

2. mprotect(PROT_MTE) is implemented by adding a fault-on-access mechanism
for existing pages. When a page is next accessed, a fault is taken and the
corresponding tag storage is reserved.

3. When the code tries to copy tags to a page (when swapping in a newly
allocated page, or during migration/THP collapse) which doesn't have the
tag storage reserved, the tags are copied to an xarray and restored when
tag storage is reserved for the destination page.

KVM support has not been implemented yet, that because a non-MTE enabled
VMA can back the memory of an MTE-enabled VM. It will be added after there
is a consensus on the right approach on the memory management support.

Overview of the patches
=======================

For people not interested in the arm64 details, you probably want to start
with patches 1-10, which mostly deal with adding the necessary hooks to the
memory management code, and patches 19 and 20 which add the page
fault-on-access mechanism for regular pages, respectively huge pages. Patch
21 is rather invasive, it moves the definition of struct
migration_target_control out of mm/internal.h to migrate.h, and the arm64
code also uses isolate_lru_page() and putback_movable_pages() when
migrating a tag storage page out of a PROT_MTE VMA. And finally patch 26 is
an optimization for a performance regression that has been reported with
Chrome and it introduces CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY to allow arm64 to
use take_page_off_buddy() to fast track reserving tag storage when the page
is free.

The rest of the patches are mostly arm64 specific.

Patches 11-18 support for detecting the tag storage region and reserving
tag storage when a tagged page is allocated.

Patches 19-21 add the page fault-on-access on mechanism and use it to
reserve tag storage when needed.

Patches 22 and 23 handle saving tags temporarily to an xarray if the page
doesn't have tag storage, and copying the tags over to the tagged page when
tag storage is reserved.

Changelog
=========

Changes since RFC v1 [1]:

* The entire series has been reworked to remove MIGRATE_METADATA and put tag
  storage pages on the MIGRATE_CMA freelists.

* Changed how tags are saved and restored when copying them from one page to
  another if the destination page doesn't have tag storage - now the tags are
  restored when tag storage is reserved for the destination page instead of
  restoring them in set_pte_at() -> mte_sync_tags().

[1] https://lore.kernel.org/lkml/20230823131350.114942-1-alexandru.elisei@arm.com/

Testing
=======

To enable MTE dynamic tag storage:

- CONFIG_ARM64_MTE_TAG_STORAGE=y
- system_supports_mte() returns true
- kasan_hw_tags_enabled() returns false
- correct DTB node (for the specification, see commit "arm64: mte: Reserve tag
  storage memory")

Check dmesg for the message "MTE tag storage region management enabled".

I've tested the series using FVP with MTE enabled, but without support for
dynamic tag storage reuse. To simulate it, I've added two fake tag storage
regions in the DTB by splitting the upper 2GB memory region into 3: one region
for normal RAM, followed by the tag storage for the lower 2GB of memory, then
the tag storage for the normal RAM region. Like below:

diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
index 60472d65a355..8c719825a9b3 100644
--- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
+++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
@@ -165,12 +165,30 @@ C1_L2: l2-cache1 {
                };
        };
 
-       memory@80000000 {
+       memory0: memory@80000000 {
                device_type = "memory";
-               reg = <0x00000000 0x80000000 0 0x80000000>,
-                     <0x00000008 0x80000000 0 0x80000000>;
+               reg = <0x00000000 0x80000000 0 0x80000000>;
        };
 
+       memory1: memory@880000000 {
+               device_type = "memory";
+               reg = <0x00000008 0x80000000 0 0x78000000>;
+       };
+
+       tags0: tag-storage@8f8000000 {
+                compatible = "arm,mte-tag-storage";
+                reg = <0x00000008 0xf8000000 0 0x4000000>;
+                block-size = <0x1000>;
+                memory = <&memory0>;
+        };
+
+        tags1: tag-storage@8fc000000 {
+                compatible = "arm,mte-tag-storage";
+                reg = <0x00000008 0xfc000000 0 0x3c00000>;
+                block-size = <0x1000>;
+                memory = <&memory1>;
+        };
+
        reserved-memory {
                #address-cells = <2>;
                #size-cells = <2>;



Alexandru Elisei (27):
  arm64: mte: Rework naming for tag manipulation functions
  arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED
  mm: cma: Make CMA_ALLOC_SUCCESS/FAIL count the number of pages
  mm: migrate/mempolicy: Add hook to modify migration target gfp
  mm: page_alloc: Add an arch hook to allow prep_new_page() to fail
  mm: page_alloc: Allow an arch to hook early into free_pages_prepare()
  mm: page_alloc: Add an arch hook to filter MIGRATE_CMA allocations
  mm: page_alloc: Partially revert "mm: page_alloc: remove stale CMA
    guard code"
  mm: Allow an arch to hook into folio allocation when VMA is known
  mm: Call arch_swap_prepare_to_restore() before arch_swap_restore()
  arm64: mte: Reserve tag storage memory
  arm64: mte: Add tag storage pages to the MIGRATE_CMA migratetype
  arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK
  arm64: mte: Disable dynamic tag storage management if HW KASAN is
    enabled
  arm64: mte: Check that tag storage blocks are in the same zone
  arm64: mte: Manage tag storage on page allocation
  arm64: mte: Perform CMOs for tag blocks on tagged page allocation/free
  arm64: mte: Reserve tag block for the zero page
  mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)
  mm: hugepage: Handle huge page fault on access
  mm: arm64: Handle tag storage pages mapped before mprotect(PROT_MTE)
  arm64: mte: swap: Handle tag restoring when missing tag storage
  arm64: mte: copypage: Handle tag restoring when missing tag storage
  arm64: mte: Handle fatal signal in reserve_tag_storage()
  KVM: arm64: Disable MTE if tag storage is enabled
  arm64: mte: Fast track reserving tag storage when the block is free
  arm64: mte: Enable dynamic tag storage reuse

 arch/arm64/Kconfig                       |  16 +
 arch/arm64/include/asm/assembler.h       |  10 +
 arch/arm64/include/asm/mte-def.h         |  16 +-
 arch/arm64/include/asm/mte.h             |  43 +-
 arch/arm64/include/asm/mte_tag_storage.h |  75 +++
 arch/arm64/include/asm/page.h            |   5 +-
 arch/arm64/include/asm/pgtable-prot.h    |   2 +
 arch/arm64/include/asm/pgtable.h         |  96 +++-
 arch/arm64/kernel/Makefile               |   1 +
 arch/arm64/kernel/elfcore.c              |  14 +-
 arch/arm64/kernel/hibernate.c            |  46 +-
 arch/arm64/kernel/mte.c                  |  12 +-
 arch/arm64/kernel/mte_tag_storage.c      | 686 +++++++++++++++++++++++
 arch/arm64/kernel/setup.c                |   7 +
 arch/arm64/kvm/arm.c                     |   6 +-
 arch/arm64/lib/mte.S                     |  34 +-
 arch/arm64/mm/copypage.c                 |  59 ++
 arch/arm64/mm/fault.c                    | 261 ++++++++-
 arch/arm64/mm/mteswap.c                  | 162 +++++-
 fs/proc/page.c                           |   1 +
 include/linux/gfp_types.h                |  14 +-
 include/linux/huge_mm.h                  |   2 +
 include/linux/kernel-page-flags.h        |   1 +
 include/linux/migrate.h                  |  12 +-
 include/linux/migrate_mode.h             |   1 +
 include/linux/mmzone.h                   |   5 +
 include/linux/page-flags.h               |  16 +-
 include/linux/pgtable.h                  |  54 ++
 include/trace/events/mmflags.h           |   5 +-
 mm/Kconfig                               |   7 +
 mm/cma.c                                 |   4 +-
 mm/huge_memory.c                         |   5 +-
 mm/internal.h                            |   9 -
 mm/memory-failure.c                      |   8 +-
 mm/memory.c                              |  10 +
 mm/mempolicy.c                           |   3 +
 mm/migrate.c                             |   3 +
 mm/page_alloc.c                          | 118 +++-
 mm/shmem.c                               |  14 +-
 mm/swapfile.c                            |   7 +
 40 files changed, 1668 insertions(+), 182 deletions(-)
 create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
 create mode 100644 arch/arm64/kernel/mte_tag_storage.c


base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 01/27] arm64: mte: Rework naming for tag manipulation functions
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
@ 2023-11-19 16:56 ` Alexandru Elisei
  2023-11-19 16:56 ` [PATCH RFC v2 02/27] arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED Alexandru Elisei
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:56 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

The tag save/restore/copy functions could be more explicit about from where
the tags are coming from and where they are being copied to. Renaming the
functions to make it easier to understand what they are doing:

- Rename the mte_clear_page_tags() 'addr' parameter to 'page_addr', to
  match the other functions that take a page address as parameter.

- Rename mte_save/restore_tags() to
  mte_save/restore_page_tags_by_swp_entry() to make it clear that they are
  saved in a collection indexed by swp_entry (this will become important
  when they will be also saved in a collection indexed by page pfn). Same
  applies to mte_invalidate_tags{,_area}_by_swp_entry().

- Rename mte_save/restore_page_tags() to make it clear where the tags are
  going to be saved, respectively from where they are restored - in a
  previously allocated memory buffer, not in an xarray, like when the tags
  are saved when swapping. Rename the action to 'copy' instead of
  'save'/'restore' to match the copy from user functions, which also copy
  tags to memory.

- Rename mte_allocate/free_tag_storage() to mte_allocate/free_tag_buf() to
  make it clear the functions have nothing to do with the memory where the
  corresponding tags for a page live. Change the parameter type for
  mte_free_tag_buf()) to be void *, to match the return value of
  mte_allocate_tag_buf(). Also do that because that memory is opaque and it
  is not meant to be directly deferenced.

In the name of consistency rename local variables from tag_storage to tags.
Give a similar treatment to the hibernation code that saves and restores
the tags for all tagged pages.

In the same spirit, rename MTE_PAGE_TAG_STORAGE to
MTE_PAGE_TAG_STORAGE_SIZE to make it clear that it relates to the size of
the memory needed to save the tags for a page. Oportunistically rename
MTE_TAG_SIZE to MTE_TAG_SIZE_BITS to make it clear it is measured in bits,
not bytes, like the rest of the size variable from the same header file.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/mte-def.h | 16 +++++-----
 arch/arm64/include/asm/mte.h     | 23 +++++++++------
 arch/arm64/include/asm/pgtable.h |  8 ++---
 arch/arm64/kernel/elfcore.c      | 14 ++++-----
 arch/arm64/kernel/hibernate.c    | 46 ++++++++++++++---------------
 arch/arm64/lib/mte.S             | 18 ++++++------
 arch/arm64/mm/mteswap.c          | 50 ++++++++++++++++----------------
 7 files changed, 90 insertions(+), 85 deletions(-)

diff --git a/arch/arm64/include/asm/mte-def.h b/arch/arm64/include/asm/mte-def.h
index 14ee86b019c2..eb0d76a6bdcf 100644
--- a/arch/arm64/include/asm/mte-def.h
+++ b/arch/arm64/include/asm/mte-def.h
@@ -5,14 +5,14 @@
 #ifndef __ASM_MTE_DEF_H
 #define __ASM_MTE_DEF_H
 
-#define MTE_GRANULE_SIZE	UL(16)
-#define MTE_GRANULE_MASK	(~(MTE_GRANULE_SIZE - 1))
-#define MTE_GRANULES_PER_PAGE	(PAGE_SIZE / MTE_GRANULE_SIZE)
-#define MTE_TAG_SHIFT		56
-#define MTE_TAG_SIZE		4
-#define MTE_TAG_MASK		GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE - 1)), MTE_TAG_SHIFT)
-#define MTE_PAGE_TAG_STORAGE	(MTE_GRANULES_PER_PAGE * MTE_TAG_SIZE / 8)
+#define MTE_GRANULE_SIZE		UL(16)
+#define MTE_GRANULE_MASK		(~(MTE_GRANULE_SIZE - 1))
+#define MTE_GRANULES_PER_PAGE		(PAGE_SIZE / MTE_GRANULE_SIZE)
+#define MTE_TAG_SHIFT			56
+#define MTE_TAG_SIZE_BITS		4
+#define MTE_TAG_MASK		GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE_BITS - 1)), MTE_TAG_SHIFT)
+#define MTE_PAGE_TAG_STORAGE_SIZE	(MTE_GRANULES_PER_PAGE * MTE_TAG_SIZE_BITS / 8)
 
-#define __MTE_PREAMBLE		ARM64_ASM_PREAMBLE ".arch_extension memtag\n"
+#define __MTE_PREAMBLE			ARM64_ASM_PREAMBLE ".arch_extension memtag\n"
 
 #endif /* __ASM_MTE_DEF_H  */
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 91fbd5c8a391..8034695b3dd7 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -18,19 +18,24 @@
 
 #include <asm/pgtable-types.h>
 
-void mte_clear_page_tags(void *addr);
+void mte_clear_page_tags(void *page_addr);
+
 unsigned long mte_copy_tags_from_user(void *to, const void __user *from,
 				      unsigned long n);
 unsigned long mte_copy_tags_to_user(void __user *to, void *from,
 				    unsigned long n);
-int mte_save_tags(struct page *page);
-void mte_save_page_tags(const void *page_addr, void *tag_storage);
-void mte_restore_tags(swp_entry_t entry, struct page *page);
-void mte_restore_page_tags(void *page_addr, const void *tag_storage);
-void mte_invalidate_tags(int type, pgoff_t offset);
-void mte_invalidate_tags_area(int type);
-void *mte_allocate_tag_storage(void);
-void mte_free_tag_storage(char *storage);
+
+int mte_save_page_tags_by_swp_entry(struct page *page);
+void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page);
+
+void mte_copy_page_tags_to_buf(const void *page_addr, void *to);
+void mte_copy_page_tags_from_buf(void *page_addr, const void *from);
+
+void mte_invalidate_tags_by_swp_entry(int type, pgoff_t offset);
+void mte_invalidate_tags_area_by_swp_entry(int type);
+
+void *mte_allocate_tag_buf(void);
+void mte_free_tag_buf(void *buf);
 
 #ifdef CONFIG_ARM64_MTE
 
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b19a8aee684c..9b32c74b4a1b 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1039,7 +1039,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 static inline int arch_prepare_to_swap(struct page *page)
 {
 	if (system_supports_mte())
-		return mte_save_tags(page);
+		return mte_save_page_tags_by_swp_entry(page);
 	return 0;
 }
 
@@ -1047,20 +1047,20 @@ static inline int arch_prepare_to_swap(struct page *page)
 static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
 {
 	if (system_supports_mte())
-		mte_invalidate_tags(type, offset);
+		mte_invalidate_tags_by_swp_entry(type, offset);
 }
 
 static inline void arch_swap_invalidate_area(int type)
 {
 	if (system_supports_mte())
-		mte_invalidate_tags_area(type);
+		mte_invalidate_tags_area_by_swp_entry(type);
 }
 
 #define __HAVE_ARCH_SWAP_RESTORE
 static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 {
 	if (system_supports_mte())
-		mte_restore_tags(entry, &folio->page);
+		mte_restore_page_tags_by_swp_entry(entry, &folio->page);
 }
 
 #endif /* CONFIG_ARM64_MTE */
diff --git a/arch/arm64/kernel/elfcore.c b/arch/arm64/kernel/elfcore.c
index 2e94d20c4ac7..e9ae00dacad8 100644
--- a/arch/arm64/kernel/elfcore.c
+++ b/arch/arm64/kernel/elfcore.c
@@ -17,7 +17,7 @@
 
 static unsigned long mte_vma_tag_dump_size(struct core_vma_metadata *m)
 {
-	return (m->dump_size >> PAGE_SHIFT) * MTE_PAGE_TAG_STORAGE;
+	return (m->dump_size >> PAGE_SHIFT) * MTE_PAGE_TAG_STORAGE_SIZE;
 }
 
 /* Derived from dump_user_range(); start/end must be page-aligned */
@@ -38,7 +38,7 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
 		 * have been all zeros.
 		 */
 		if (!page) {
-			dump_skip(cprm, MTE_PAGE_TAG_STORAGE);
+			dump_skip(cprm, MTE_PAGE_TAG_STORAGE_SIZE);
 			continue;
 		}
 
@@ -48,12 +48,12 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
 		 */
 		if (!page_mte_tagged(page)) {
 			put_page(page);
-			dump_skip(cprm, MTE_PAGE_TAG_STORAGE);
+			dump_skip(cprm, MTE_PAGE_TAG_STORAGE_SIZE);
 			continue;
 		}
 
 		if (!tags) {
-			tags = mte_allocate_tag_storage();
+			tags = mte_allocate_tag_buf();
 			if (!tags) {
 				put_page(page);
 				ret = 0;
@@ -61,16 +61,16 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
 			}
 		}
 
-		mte_save_page_tags(page_address(page), tags);
+		mte_copy_page_tags_to_buf(page_address(page), tags);
 		put_page(page);
-		if (!dump_emit(cprm, tags, MTE_PAGE_TAG_STORAGE)) {
+		if (!dump_emit(cprm, tags, MTE_PAGE_TAG_STORAGE_SIZE)) {
 			ret = 0;
 			break;
 		}
 	}
 
 	if (tags)
-		mte_free_tag_storage(tags);
+		mte_free_tag_buf(tags);
 
 	return ret;
 }
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 02870beb271e..a3b0e7b32457 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -215,41 +215,41 @@ static int create_safe_exec_page(void *src_start, size_t length,
 
 #ifdef CONFIG_ARM64_MTE
 
-static DEFINE_XARRAY(mte_pages);
+static DEFINE_XARRAY(tags_by_pfn);
 
-static int save_tags(struct page *page, unsigned long pfn)
+static int save_page_tags_by_pfn(struct page *page, unsigned long pfn)
 {
-	void *tag_storage, *ret;
+	void *tags, *ret;
 
-	tag_storage = mte_allocate_tag_storage();
-	if (!tag_storage)
+	tags = mte_allocate_tag_buf();
+	if (!tags)
 		return -ENOMEM;
 
-	mte_save_page_tags(page_address(page), tag_storage);
+	mte_copy_page_tags_to_buf(page_address(page), tags);
 
-	ret = xa_store(&mte_pages, pfn, tag_storage, GFP_KERNEL);
+	ret = xa_store(&tags_by_pfn, pfn, tags, GFP_KERNEL);
 	if (WARN(xa_is_err(ret), "Failed to store MTE tags")) {
-		mte_free_tag_storage(tag_storage);
+		mte_free_tag_buf(tags);
 		return xa_err(ret);
 	} else if (WARN(ret, "swsusp: %s: Duplicate entry", __func__)) {
-		mte_free_tag_storage(ret);
+		mte_free_tag_buf(ret);
 	}
 
 	return 0;
 }
 
-static void swsusp_mte_free_storage(void)
+static void swsusp_mte_free_tags(void)
 {
-	XA_STATE(xa_state, &mte_pages, 0);
+	XA_STATE(xa_state, &tags_by_pfn, 0);
 	void *tags;
 
-	xa_lock(&mte_pages);
+	xa_lock(&tags_by_pfn);
 	xas_for_each(&xa_state, tags, ULONG_MAX) {
-		mte_free_tag_storage(tags);
+		mte_free_tag_buf(tags);
 	}
-	xa_unlock(&mte_pages);
+	xa_unlock(&tags_by_pfn);
 
-	xa_destroy(&mte_pages);
+	xa_destroy(&tags_by_pfn);
 }
 
 static int swsusp_mte_save_tags(void)
@@ -273,9 +273,9 @@ static int swsusp_mte_save_tags(void)
 			if (!page_mte_tagged(page))
 				continue;
 
-			ret = save_tags(page, pfn);
+			ret = save_page_tags_by_pfn(page, pfn);
 			if (ret) {
-				swsusp_mte_free_storage();
+				swsusp_mte_free_tags();
 				goto out;
 			}
 
@@ -290,25 +290,25 @@ static int swsusp_mte_save_tags(void)
 
 static void swsusp_mte_restore_tags(void)
 {
-	XA_STATE(xa_state, &mte_pages, 0);
+	XA_STATE(xa_state, &tags_by_pfn, 0);
 	int n = 0;
 	void *tags;
 
-	xa_lock(&mte_pages);
+	xa_lock(&tags_by_pfn);
 	xas_for_each(&xa_state, tags, ULONG_MAX) {
 		unsigned long pfn = xa_state.xa_index;
 		struct page *page = pfn_to_online_page(pfn);
 
-		mte_restore_page_tags(page_address(page), tags);
+		mte_copy_page_tags_from_buf(page_address(page), tags);
 
-		mte_free_tag_storage(tags);
+		mte_free_tag_buf(tags);
 		n++;
 	}
-	xa_unlock(&mte_pages);
+	xa_unlock(&tags_by_pfn);
 
 	pr_info("Restored %d MTE pages\n", n);
 
-	xa_destroy(&mte_pages);
+	xa_destroy(&tags_by_pfn);
 }
 
 #else	/* CONFIG_ARM64_MTE */
diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
index 5018ac03b6bf..9f623e9da09f 100644
--- a/arch/arm64/lib/mte.S
+++ b/arch/arm64/lib/mte.S
@@ -119,7 +119,7 @@ SYM_FUNC_START(mte_copy_tags_to_user)
 	cbz	x2, 2f
 1:
 	ldg	x4, [x1]
-	ubfx	x4, x4, #MTE_TAG_SHIFT, #MTE_TAG_SIZE
+	ubfx	x4, x4, #MTE_TAG_SHIFT, #MTE_TAG_SIZE_BITS
 USER(2f, sttrb	w4, [x0])
 	add	x0, x0, #1
 	add	x1, x1, #MTE_GRANULE_SIZE
@@ -132,11 +132,11 @@ USER(2f, sttrb	w4, [x0])
 SYM_FUNC_END(mte_copy_tags_to_user)
 
 /*
- * Save the tags in a page
+ * Copy the tags in a page to a buffer
  *   x0 - page address
- *   x1 - tag storage, MTE_PAGE_TAG_STORAGE bytes
+ *   x1 - memory buffer, MTE_PAGE_TAG_STORAGE_SIZE bytes
  */
-SYM_FUNC_START(mte_save_page_tags)
+SYM_FUNC_START(mte_copy_page_tags_to_buf)
 	multitag_transfer_size x7, x5
 1:
 	mov	x2, #0
@@ -153,14 +153,14 @@ SYM_FUNC_START(mte_save_page_tags)
 	b.ne	1b
 
 	ret
-SYM_FUNC_END(mte_save_page_tags)
+SYM_FUNC_END(mte_copy_page_tags_to_buf)
 
 /*
- * Restore the tags in a page
+ * Restore the tags in a page from a buffer
  *   x0 - page address
- *   x1 - tag storage, MTE_PAGE_TAG_STORAGE bytes
+ *   x1 - memory buffer, MTE_PAGE_TAG_STORAGE_SIZE bytes
  */
-SYM_FUNC_START(mte_restore_page_tags)
+SYM_FUNC_START(mte_copy_page_tags_from_buf)
 	multitag_transfer_size x7, x5
 1:
 	ldr	x2, [x1], #8
@@ -174,4 +174,4 @@ SYM_FUNC_START(mte_restore_page_tags)
 	b.ne	1b
 
 	ret
-SYM_FUNC_END(mte_restore_page_tags)
+SYM_FUNC_END(mte_copy_page_tags_from_buf)
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index a31833e3ddc5..2a43746b803f 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -7,79 +7,79 @@
 #include <linux/swapops.h>
 #include <asm/mte.h>
 
-static DEFINE_XARRAY(mte_pages);
+static DEFINE_XARRAY(tags_by_swp_entry);
 
-void *mte_allocate_tag_storage(void)
+void *mte_allocate_tag_buf(void)
 {
 	/* tags granule is 16 bytes, 2 tags stored per byte */
-	return kmalloc(MTE_PAGE_TAG_STORAGE, GFP_KERNEL);
+	return kmalloc(MTE_PAGE_TAG_STORAGE_SIZE, GFP_KERNEL);
 }
 
-void mte_free_tag_storage(char *storage)
+void mte_free_tag_buf(void *buf)
 {
-	kfree(storage);
+	kfree(buf);
 }
 
-int mte_save_tags(struct page *page)
+int mte_save_page_tags_by_swp_entry(struct page *page)
 {
-	void *tag_storage, *ret;
+	void *tags, *ret;
 
 	if (!page_mte_tagged(page))
 		return 0;
 
-	tag_storage = mte_allocate_tag_storage();
-	if (!tag_storage)
+	tags = mte_allocate_tag_buf();
+	if (!tags)
 		return -ENOMEM;
 
-	mte_save_page_tags(page_address(page), tag_storage);
+	mte_copy_page_tags_to_buf(page_address(page), tags);
 
 	/* lookup the swap entry.val from the page */
-	ret = xa_store(&mte_pages, page_swap_entry(page).val, tag_storage,
+	ret = xa_store(&tags_by_swp_entry, page_swap_entry(page).val, tags,
 		       GFP_KERNEL);
 	if (WARN(xa_is_err(ret), "Failed to store MTE tags")) {
-		mte_free_tag_storage(tag_storage);
+		mte_free_tag_buf(tags);
 		return xa_err(ret);
 	} else if (ret) {
 		/* Entry is being replaced, free the old entry */
-		mte_free_tag_storage(ret);
+		mte_free_tag_buf(ret);
 	}
 
 	return 0;
 }
 
-void mte_restore_tags(swp_entry_t entry, struct page *page)
+void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page)
 {
-	void *tags = xa_load(&mte_pages, entry.val);
+	void *tags = xa_load(&tags_by_swp_entry, entry.val);
 
 	if (!tags)
 		return;
 
 	if (try_page_mte_tagging(page)) {
-		mte_restore_page_tags(page_address(page), tags);
+		mte_copy_page_tags_from_buf(page_address(page), tags);
 		set_page_mte_tagged(page);
 	}
 }
 
-void mte_invalidate_tags(int type, pgoff_t offset)
+void mte_invalidate_tags_by_swp_entry(int type, pgoff_t offset)
 {
 	swp_entry_t entry = swp_entry(type, offset);
-	void *tags = xa_erase(&mte_pages, entry.val);
+	void *tags = xa_erase(&tags_by_swp_entry, entry.val);
 
-	mte_free_tag_storage(tags);
+	mte_free_tag_buf(tags);
 }
 
-void mte_invalidate_tags_area(int type)
+void mte_invalidate_tags_area_by_swp_entry(int type)
 {
 	swp_entry_t entry = swp_entry(type, 0);
 	swp_entry_t last_entry = swp_entry(type + 1, 0);
 	void *tags;
 
-	XA_STATE(xa_state, &mte_pages, entry.val);
+	XA_STATE(xa_state, &tags_by_swp_entry, entry.val);
 
-	xa_lock(&mte_pages);
+	xa_lock(&tags_by_swp_entry);
 	xas_for_each(&xa_state, tags, last_entry.val - 1) {
-		__xa_erase(&mte_pages, xa_state.xa_index);
-		mte_free_tag_storage(tags);
+		__xa_erase(&tags_by_swp_entry, xa_state.xa_index);
+		mte_free_tag_buf(tags);
 	}
-	xa_unlock(&mte_pages);
+	xa_unlock(&tags_by_swp_entry);
 }
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 02/27] arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
  2023-11-19 16:56 ` [PATCH RFC v2 01/27] arm64: mte: Rework naming for tag manipulation functions Alexandru Elisei
@ 2023-11-19 16:56 ` Alexandru Elisei
  2023-11-19 16:56 ` [PATCH RFC v2 03/27] mm: cma: Make CMA_ALLOC_SUCCESS/FAIL count the number of pages Alexandru Elisei
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:56 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

__GFP_ZEROTAGS is used to instruct the page allocator to zero the tags at
the same time as the physical frame is zeroed. The name can be slightly
misleading, because it doesn't mean that the code will zero the tags
unconditionally, but that the tags will be zeroed if and only if the
physical frame is also zeroed (either __GFP_ZERO is set or init_on_alloc is
1).

Rename it to __GFP_TAGGED, in preparation for it to be used by the page
allocator to recognize when an allocation is tagged (has metadata).

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/mm/fault.c          |  2 +-
 include/linux/gfp_types.h      | 14 +++++++-------
 include/trace/events/mmflags.h |  2 +-
 mm/page_alloc.c                |  2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 460d799e1296..daa91608d917 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -948,7 +948,7 @@ struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
 	 * separate DC ZVA and STGM.
 	 */
 	if (vma->vm_flags & VM_MTE)
-		flags |= __GFP_ZEROTAGS;
+		flags |= __GFP_TAGGED;
 
 	return vma_alloc_folio(flags, 0, vma, vaddr, false);
 }
diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 6583a58670c5..37b9e265d77e 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -45,7 +45,7 @@ typedef unsigned int __bitwise gfp_t;
 #define ___GFP_HARDWALL		0x100000u
 #define ___GFP_THISNODE		0x200000u
 #define ___GFP_ACCOUNT		0x400000u
-#define ___GFP_ZEROTAGS		0x800000u
+#define ___GFP_TAGGED		0x800000u
 #ifdef CONFIG_KASAN_HW_TAGS
 #define ___GFP_SKIP_ZERO	0x1000000u
 #define ___GFP_SKIP_KASAN	0x2000000u
@@ -226,11 +226,11 @@ typedef unsigned int __bitwise gfp_t;
  *
  * %__GFP_ZERO returns a zeroed page on success.
  *
- * %__GFP_ZEROTAGS zeroes memory tags at allocation time if the memory itself
- * is being zeroed (either via __GFP_ZERO or via init_on_alloc, provided that
- * __GFP_SKIP_ZERO is not set). This flag is intended for optimization: setting
- * memory tags at the same time as zeroing memory has minimal additional
- * performace impact.
+ * %__GFP_TAGGED marks the allocation as having tags, which will be zeroed it
+ * allocation time if the memory itself is being zeroed (either via __GFP_ZERO
+ * or via init_on_alloc, provided that __GFP_SKIP_ZERO is not set). This flag is
+ * intended for optimization: setting memory tags at the same time as zeroing
+ * memory has minimal additional performace impact.
  *
  * %__GFP_SKIP_KASAN makes KASAN skip unpoisoning on page allocation.
  * Used for userspace and vmalloc pages; the latter are unpoisoned by
@@ -241,7 +241,7 @@ typedef unsigned int __bitwise gfp_t;
 #define __GFP_NOWARN	((__force gfp_t)___GFP_NOWARN)
 #define __GFP_COMP	((__force gfp_t)___GFP_COMP)
 #define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)
-#define __GFP_ZEROTAGS	((__force gfp_t)___GFP_ZEROTAGS)
+#define __GFP_TAGGED	((__force gfp_t)___GFP_TAGGED)
 #define __GFP_SKIP_ZERO ((__force gfp_t)___GFP_SKIP_ZERO)
 #define __GFP_SKIP_KASAN ((__force gfp_t)___GFP_SKIP_KASAN)
 
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index d801409b33cf..6ca0d5ed46c0 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -50,7 +50,7 @@
 	gfpflag_string(__GFP_RECLAIM),		\
 	gfpflag_string(__GFP_DIRECT_RECLAIM),	\
 	gfpflag_string(__GFP_KSWAPD_RECLAIM),	\
-	gfpflag_string(__GFP_ZEROTAGS)
+	gfpflag_string(__GFP_TAGGED)
 
 #ifdef CONFIG_KASAN_HW_TAGS
 #define __def_gfpflag_names_kasan ,			\
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 733732e7e0ba..770e585b77c8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1483,7 +1483,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 {
 	bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
 			!should_skip_init(gfp_flags);
-	bool zero_tags = init && (gfp_flags & __GFP_ZEROTAGS);
+	bool zero_tags = init && (gfp_flags & __GFP_TAGGED);
 	int i;
 
 	set_page_private(page, 0);
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 03/27] mm: cma: Make CMA_ALLOC_SUCCESS/FAIL count the number of pages
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
  2023-11-19 16:56 ` [PATCH RFC v2 01/27] arm64: mte: Rework naming for tag manipulation functions Alexandru Elisei
  2023-11-19 16:56 ` [PATCH RFC v2 02/27] arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED Alexandru Elisei
@ 2023-11-19 16:56 ` Alexandru Elisei
  2023-11-19 16:56 ` [PATCH RFC v2 04/27] mm: migrate/mempolicy: Add hook to modify migration target gfp Alexandru Elisei
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:56 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

The CMA_ALLOC_SUCCESS, respectively CMA_ALLOC_FAIL, are increased by one
after each cma_alloc() function call. This is done even though cma_alloc()
can allocate an arbitrary number of CMA pages. When looking at
/proc/vmstat, the number of successful (or failed) cma_alloc() calls
doesn't tell much with regards to how many CMA pages were allocated via
cma_alloc() versus via the page allocator (regular allocation request or
PCP lists refill).

This can also be rather confusing to a user who isn't familiar with the
code, since the unit of measurement for nr_free_cma is the number of pages,
but cma_alloc_success and cma_alloc_fail count the number of cma_alloc()
function calls.

Let's make this consistent, and arguably more useful, by having
CMA_ALLOC_SUCCESS count the number of successfully allocated CMA pages, and
CMA_ALLOC_FAIL count the number of pages cma_alloc() failed to allocate.

For users that wish to track the number of cma_alloc() calls, there are
tracepoints for that already implemented.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 mm/cma.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index 2b2494fd6b59..2b74db5116d5 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -517,10 +517,10 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
 	pr_debug("%s(): returned %p\n", __func__, page);
 out:
 	if (page) {
-		count_vm_event(CMA_ALLOC_SUCCESS);
+		count_vm_events(CMA_ALLOC_SUCCESS, count);
 		cma_sysfs_account_success_pages(cma, count);
 	} else {
-		count_vm_event(CMA_ALLOC_FAIL);
+		count_vm_events(CMA_ALLOC_FAIL, count);
 		if (cma)
 			cma_sysfs_account_fail_pages(cma, count);
 	}
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 04/27] mm: migrate/mempolicy: Add hook to modify migration target gfp
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (2 preceding siblings ...)
  2023-11-19 16:56 ` [PATCH RFC v2 03/27] mm: cma: Make CMA_ALLOC_SUCCESS/FAIL count the number of pages Alexandru Elisei
@ 2023-11-19 16:56 ` Alexandru Elisei
  2023-11-25 10:03   ` Mike Rapoport
  2023-11-19 16:56 ` [PATCH RFC v2 05/27] mm: page_alloc: Add an arch hook to allow prep_new_page() to fail Alexandru Elisei
                   ` (22 subsequent siblings)
  26 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:56 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

It might be desirable for an architecture to modify the gfp flags used to
allocate the destination page for migration based on the page that it is
being replaced. For example, if an architectures has metadata associated
with a page (like arm64, when the memory tagging extension is implemented),
it can request that the destination page similarly has storage for tags
already allocated.

No functional change.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 include/linux/migrate.h | 4 ++++
 mm/mempolicy.c          | 2 ++
 mm/migrate.c            | 3 +++
 3 files changed, 9 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 2ce13e8a309b..0acef592043c 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -60,6 +60,10 @@ struct movable_operations {
 /* Defined in mm/debug.c: */
 extern const char *migrate_reason_names[MR_TYPES];
 
+#ifndef arch_migration_target_gfp
+#define arch_migration_target_gfp(src, gfp) 0
+#endif
+
 #ifdef CONFIG_MIGRATION
 
 void putback_movable_pages(struct list_head *l);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..50bc43ab50d6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1182,6 +1182,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
 
 		h = folio_hstate(src);
 		gfp = htlb_alloc_mask(h);
+		gfp |= arch_migration_target_gfp(src, gfp);
 		nodemask = policy_nodemask(gfp, pol, ilx, &nid);
 		return alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp);
 	}
@@ -1190,6 +1191,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
 		gfp = GFP_TRANSHUGE;
 	else
 		gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | __GFP_COMP;
+	gfp |= arch_migration_target_gfp(src, gfp);
 
 	page = alloc_pages_mpol(gfp, order, pol, ilx, nid);
 	return page_rmappable_folio(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index 35a88334bb3c..dd25ab69e3de 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2016,6 +2016,7 @@ struct folio *alloc_migration_target(struct folio *src, unsigned long private)
 		struct hstate *h = folio_hstate(src);
 
 		gfp_mask = htlb_modify_alloc_mask(h, gfp_mask);
+		gfp_mask |= arch_migration_target_gfp(src, gfp);
 		return alloc_hugetlb_folio_nodemask(h, nid,
 						mtc->nmask, gfp_mask);
 	}
@@ -2032,6 +2033,7 @@ struct folio *alloc_migration_target(struct folio *src, unsigned long private)
 	zidx = zone_idx(folio_zone(src));
 	if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE)
 		gfp_mask |= __GFP_HIGHMEM;
+	gfp_mask |= arch_migration_target_gfp(src, gfp);
 
 	return __folio_alloc(gfp_mask, order, nid, mtc->nmask);
 }
@@ -2500,6 +2502,7 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
 			__GFP_NOWARN;
 		gfp &= ~__GFP_RECLAIM;
 	}
+	gfp |= arch_migration_target_gfp(src, gfp);
 	return __folio_alloc_node(gfp, order, nid);
 }
 
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 05/27] mm: page_alloc: Add an arch hook to allow prep_new_page() to fail
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (3 preceding siblings ...)
  2023-11-19 16:56 ` [PATCH RFC v2 04/27] mm: migrate/mempolicy: Add hook to modify migration target gfp Alexandru Elisei
@ 2023-11-19 16:56 ` Alexandru Elisei
  2023-11-24 19:35   ` David Hildenbrand
  2023-11-19 16:57 ` [PATCH RFC v2 06/27] mm: page_alloc: Allow an arch to hook early into free_pages_prepare() Alexandru Elisei
                   ` (21 subsequent siblings)
  26 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:56 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Introduce arch_prep_new_page(), which will be used by arm64 to reserve tag
storage for an allocated page. Reserving tag storage can fail, for example,
if the tag storage page has a short pin on it, so allow prep_new_page() ->
arch_prep_new_page() to similarly fail.

arch_alloc_page(), called from post_alloc_hook(), has been considered as an
alternative to adding yet another arch hook, but post_alloc_hook() cannot
fail, as it's also called when free pages are isolated.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 include/linux/pgtable.h |  7 ++++
 mm/page_alloc.c         | 75 ++++++++++++++++++++++++++++++++---------
 2 files changed, 66 insertions(+), 16 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index af7639c3b0a3..b31f53e9ab1d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -873,6 +873,13 @@ static inline void arch_do_swap_page(struct mm_struct *mm,
 }
 #endif
 
+#ifndef __HAVE_ARCH_PREP_NEW_PAGE
+static inline int arch_prep_new_page(struct page *page, int order, gfp_t gfp)
+{
+	return 0;
+}
+#endif
+
 #ifndef __HAVE_ARCH_UNMAP_ONE
 /*
  * Some architectures support metadata associated with a page. When a
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 770e585b77c8..b2782b778e78 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1538,9 +1538,15 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	page_table_check_alloc(page, order);
 }
 
-static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
+static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
 							unsigned int alloc_flags)
 {
+	int ret;
+
+	ret = arch_prep_new_page(page, order, gfp_flags);
+	if (unlikely(ret))
+		return ret;
+
 	post_alloc_hook(page, order, gfp_flags);
 
 	if (order && (gfp_flags & __GFP_COMP))
@@ -1556,6 +1562,8 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags
 		set_page_pfmemalloc(page);
 	else
 		clear_page_pfmemalloc(page);
+
+	return 0;
 }
 
 /*
@@ -3163,6 +3171,24 @@ static inline unsigned int gfp_to_alloc_flags_cma(gfp_t gfp_mask,
 	return alloc_flags;
 }
 
+#ifdef HAVE_ARCH_ALLOC_PAGE
+static void return_page_to_buddy(struct page *page, int order)
+{
+	int migratetype = get_pfnblock_migratetype(page, pfn);
+	unsigned long pfn = page_to_pfn(page);
+	struct zone *zone = page_zone(page);
+	unsigned long flags;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	__free_one_page(page, pfn, zone, order, migratetype, FPI_TO_TAIL);
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+#else
+static void return_page_to_buddy(struct page *page, int order)
+{
+}
+#endif
+
 /*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
@@ -3309,7 +3335,10 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		page = rmqueue(ac->preferred_zoneref->zone, zone, order,
 				gfp_mask, alloc_flags, ac->migratetype);
 		if (page) {
-			prep_new_page(page, order, gfp_mask, alloc_flags);
+			if (prep_new_page(page, order, gfp_mask, alloc_flags)) {
+				return_page_to_buddy(page, order);
+				goto no_page;
+			}
 
 			/*
 			 * If this is a high-order atomic allocation then check
@@ -3319,20 +3348,20 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 				reserve_highatomic_pageblock(page, zone);
 
 			return page;
-		} else {
-			if (has_unaccepted_memory()) {
-				if (try_to_accept_memory(zone, order))
-					goto try_this_zone;
-			}
+		}
+no_page:
+		if (has_unaccepted_memory()) {
+			if (try_to_accept_memory(zone, order))
+				goto try_this_zone;
+		}
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-			/* Try again if zone has deferred pages */
-			if (deferred_pages_enabled()) {
-				if (_deferred_grow_zone(zone, order))
-					goto try_this_zone;
-			}
-#endif
+		/* Try again if zone has deferred pages */
+		if (deferred_pages_enabled()) {
+			if (_deferred_grow_zone(zone, order))
+				goto try_this_zone;
 		}
+#endif
 	}
 
 	/*
@@ -3538,8 +3567,12 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	count_vm_event(COMPACTSTALL);
 
 	/* Prep a captured page if available */
-	if (page)
-		prep_new_page(page, order, gfp_mask, alloc_flags);
+	if (page) {
+		if (prep_new_page(page, order, gfp_mask, alloc_flags)) {
+			return_page_to_buddy(page, order);
+			page = NULL;
+		}
+	}
 
 	/* Try get a page from the freelist if available */
 	if (!page)
@@ -4490,9 +4523,18 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 			}
 			break;
 		}
+
+		if (prep_new_page(page, 0, gfp, 0)) {
+			pcp_spin_unlock(pcp);
+			pcp_trylock_finish(UP_flags);
+			return_page_to_buddy(page, 0);
+			if (!nr_account)
+				goto failed;
+			else
+				goto out_statistics;
+		}
 		nr_account++;
 
-		prep_new_page(page, 0, gfp, 0);
 		if (page_list)
 			list_add(&page->lru, page_list);
 		else
@@ -4503,6 +4545,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	pcp_spin_unlock(pcp);
 	pcp_trylock_finish(UP_flags);
 
+out_statistics:
 	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
 	zone_statistics(ac.preferred_zoneref->zone, zone, nr_account);
 
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 06/27] mm: page_alloc: Allow an arch to hook early into free_pages_prepare()
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (4 preceding siblings ...)
  2023-11-19 16:56 ` [PATCH RFC v2 05/27] mm: page_alloc: Add an arch hook to allow prep_new_page() to fail Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-24 19:36   ` David Hildenbrand
  2023-11-19 16:57 ` [PATCH RFC v2 07/27] mm: page_alloc: Add an arch hook to filter MIGRATE_CMA allocations Alexandru Elisei
                   ` (20 subsequent siblings)
  26 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Add arch_free_pages_prepare() hook that is called before that page flags
are cleared. This will be used by arm64 when explicit management of tag
storage pages is enabled.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 include/linux/pgtable.h | 4 ++++
 mm/page_alloc.c         | 4 +++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b31f53e9ab1d..3f34f00ced62 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -880,6 +880,10 @@ static inline int arch_prep_new_page(struct page *page, int order, gfp_t gfp)
 }
 #endif
 
+#ifndef __HAVE_ARCH_FREE_PAGES_PREPARE
+static inline void arch_free_pages_prepare(struct page *page, int order) { }
+#endif
+
 #ifndef __HAVE_ARCH_UNMAP_ONE
 /*
  * Some architectures support metadata associated with a page. When a
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b2782b778e78..86e4b1dac538 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1086,6 +1086,8 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	trace_mm_page_free(page, order);
 	kmsan_free_page(page, order);
 
+	arch_free_pages_prepare(page, order);
+
 	if (unlikely(PageHWPoison(page)) && !order) {
 		/*
 		 * Do not let hwpoison pages hit pcplists/buddy
@@ -3171,7 +3173,7 @@ static inline unsigned int gfp_to_alloc_flags_cma(gfp_t gfp_mask,
 	return alloc_flags;
 }
 
-#ifdef HAVE_ARCH_ALLOC_PAGE
+#ifdef HAVE_ARCH_PREP_NEW_PAGE
 static void return_page_to_buddy(struct page *page, int order)
 {
 	int migratetype = get_pfnblock_migratetype(page, pfn);
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 07/27] mm: page_alloc: Add an arch hook to filter MIGRATE_CMA allocations
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (5 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 06/27] mm: page_alloc: Allow an arch to hook early into free_pages_prepare() Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-19 16:57 ` [PATCH RFC v2 08/27] mm: page_alloc: Partially revert "mm: page_alloc: remove stale CMA guard code" Alexandru Elisei
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

As an architecture might have specific requirements around the allocation
of CMA pages, add an arch hook that can disable allocations from
MIGRATE_CMA, if the allocation was otherwise allowed.

This will be used by arm64, which will put tag storage pages on the
MIGRATE_CMA list, pages which have specific limitations.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 include/linux/pgtable.h | 7 +++++++
 mm/page_alloc.c         | 3 ++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 3f34f00ced62..b7a9ab818f6d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -884,6 +884,13 @@ static inline int arch_prep_new_page(struct page *page, int order, gfp_t gfp)
 static inline void arch_free_pages_prepare(struct page *page, int order) { }
 #endif
 
+#ifndef __HAVE_ARCH_ALLOC_CMA
+static inline bool arch_alloc_cma(gfp_t gfp)
+{
+	return true;
+}
+#endif
+
 #ifndef __HAVE_ARCH_UNMAP_ONE
 /*
  * Some architectures support metadata associated with a page. When a
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 86e4b1dac538..0f508070c404 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3167,7 +3167,8 @@ static inline unsigned int gfp_to_alloc_flags_cma(gfp_t gfp_mask,
 						  unsigned int alloc_flags)
 {
 #ifdef CONFIG_CMA
-	if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+	if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE &&
+	    arch_alloc_cma(gfp_mask))
 		alloc_flags |= ALLOC_CMA;
 #endif
 	return alloc_flags;
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 08/27] mm: page_alloc: Partially revert "mm: page_alloc: remove stale CMA guard code"
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (6 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 07/27] mm: page_alloc: Add an arch hook to filter MIGRATE_CMA allocations Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-19 16:57 ` [PATCH RFC v2 09/27] mm: Allow an arch to hook into folio allocation when VMA is known Alexandru Elisei
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

The patch f945116e4e19 ("mm: page_alloc: remove stale CMA guard code")
removed the CMA filter when allocating from the MIGRATE_MOVABLE pcp list
because CMA is always allowed when __GFP_MOVABLE is set.

With the introduction of the arch_alloc_cma() function, the above is not
true anymore, so bring back the filter.

This is a partially revert because the stale comment remains removed.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 mm/page_alloc.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0f508070c404..135f9283a863 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2907,10 +2907,17 @@ struct page *rmqueue(struct zone *preferred_zone,
 	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
 
 	if (likely(pcp_allowed_order(order))) {
-		page = rmqueue_pcplist(preferred_zone, zone, order,
-				       migratetype, alloc_flags);
-		if (likely(page))
-			goto out;
+		/*
+		 * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
+		 * we need to skip it when CMA area isn't allowed.
+		 */
+		if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
+				migratetype != MIGRATE_MOVABLE) {
+			page = rmqueue_pcplist(preferred_zone, zone, order,
+					migratetype, alloc_flags);
+			if (likely(page))
+				goto out;
+		}
 	}
 
 	page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 09/27] mm: Allow an arch to hook into folio allocation when VMA is known
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (7 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 08/27] mm: page_alloc: Partially revert "mm: page_alloc: remove stale CMA guard code" Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-19 16:57 ` [PATCH RFC v2 10/27] mm: Call arch_swap_prepare_to_restore() before arch_swap_restore() Alexandru Elisei
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

An architecture might want to fixup the gfp flags based on the type of VMA
where the page will be mapped.

On arm64, this is currently used if the VMA is MTE enabled. When
__GFP_TAGGED is set, for performance reasons, tag zeroing is performed at
the same time as the data is zeroed, instead of being performed separately,
in set_pte_at() -> mte_sync_tags().

Its usage will be expanded when the storage for the tags will have to be
explicitely managed by the kernel.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/page.h    |  5 ++---
 arch/arm64/include/asm/pgtable.h |  3 +++
 arch/arm64/mm/fault.c            | 19 ++++++-------------
 include/linux/pgtable.h          |  7 +++++++
 mm/mempolicy.c                   |  1 +
 mm/shmem.c                       |  5 ++++-
 6 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 2312e6ee595f..c8125a28eaa2 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -29,9 +29,8 @@ void copy_user_highpage(struct page *to, struct page *from,
 void copy_highpage(struct page *to, struct page *from);
 #define __HAVE_ARCH_COPY_HIGHPAGE
 
-struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-						unsigned long vaddr);
-#define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio
+#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false);
 
 void tag_clear_highpage(struct page *to);
 #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 9b32c74b4a1b..cd5dacd1be3a 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1065,6 +1065,9 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 
 #endif /* CONFIG_ARM64_MTE */
 
+#define __HAVE_ARCH_CALC_VMA_GFP
+gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp);
+
 /*
  * On AArch64, the cache coherency is handled via the set_pte_at() function.
  */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index daa91608d917..acbc7530d2b2 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -935,22 +935,15 @@ void do_debug_exception(unsigned long addr_if_watchpoint, unsigned long esr,
 NOKPROBE_SYMBOL(do_debug_exception);
 
 /*
- * Used during anonymous page fault handling.
+ * If this is called during anonymous page fault handling, and the page is
+ * mapped with PROT_MTE, initialise the tags at the point of tag zeroing as this
+ * is usually faster than separate DC ZVA and STGM.
  */
-struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-						unsigned long vaddr)
+gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp)
 {
-	gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
-
-	/*
-	 * If the page is mapped with PROT_MTE, initialise the tags at the
-	 * point of allocation and page zeroing as this is usually faster than
-	 * separate DC ZVA and STGM.
-	 */
 	if (vma->vm_flags & VM_MTE)
-		flags |= __GFP_TAGGED;
-
-	return vma_alloc_folio(flags, 0, vma, vaddr, false);
+		return __GFP_TAGGED;
+	return 0;
 }
 
 void tag_clear_highpage(struct page *page)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b7a9ab818f6d..b1001ce361ac 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -873,6 +873,13 @@ static inline void arch_do_swap_page(struct mm_struct *mm,
 }
 #endif
 
+#ifndef __HAVE_ARCH_CALC_VMA_GFP
+static inline gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp)
+{
+	return 0;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PREP_NEW_PAGE
 static inline int arch_prep_new_page(struct page *page, int order, gfp_t gfp)
 {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 50bc43ab50d6..cb170abae1fd 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2170,6 +2170,7 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
 	pgoff_t ilx;
 	struct page *page;
 
+	gfp |= arch_calc_vma_gfp(vma, gfp);
 	pol = get_vma_policy(vma, addr, order, &ilx);
 	page = alloc_pages_mpol(gfp | __GFP_COMP, order,
 				pol, ilx, numa_node_id());
diff --git a/mm/shmem.c b/mm/shmem.c
index 91e2620148b2..71ce5fe5c779 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1570,7 +1570,7 @@ static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
  */
 static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
 {
-	gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM;
+	gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM | __GFP_TAGGED;
 	gfp_t denyflags = __GFP_NOWARN | __GFP_NORETRY;
 	gfp_t zoneflags = limit_gfp & GFP_ZONEMASK;
 	gfp_t result = huge_gfp & ~(allowflags | GFP_ZONEMASK);
@@ -2023,6 +2023,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		gfp_t huge_gfp;
 
 		huge_gfp = vma_thp_gfp_mask(vma);
+		huge_gfp |= arch_calc_vma_gfp(vma, huge_gfp);
 		huge_gfp = limit_gfp_mask(huge_gfp, gfp);
 		folio = shmem_alloc_and_add_folio(huge_gfp,
 				inode, index, fault_mm, true);
@@ -2199,6 +2200,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 	vm_fault_t ret = 0;
 	int err;
 
+	gfp |= arch_calc_vma_gfp(vmf->vma, gfp);
+
 	/*
 	 * Trinity finds that probing a hole which tmpfs is punching can
 	 * prevent the hole-punch from ever completing: noted in i_private.
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 10/27] mm: Call arch_swap_prepare_to_restore() before arch_swap_restore()
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (8 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 09/27] mm: Allow an arch to hook into folio allocation when VMA is known Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-19 16:57 ` [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory Alexandru Elisei
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

arm64 uses arch_swap_restore() to restore saved tags before the page is
swapped in and it's called in atomic context (with the ptl lock held).

Introduce arch_swap_prepare_to_restore() that will allow an architecture to
perform extra work during swap in and outside of a critical section.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 include/linux/pgtable.h | 7 +++++++
 mm/memory.c             | 4 ++++
 mm/shmem.c              | 9 +++++++++
 mm/swapfile.c           | 7 +++++++
 4 files changed, 27 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b1001ce361ac..ffdb9b6bed6c 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -938,6 +938,13 @@ static inline void arch_swap_invalidate_area(int type)
 }
 #endif
 
+#ifndef __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
+static inline vm_fault_t arch_swap_prepare_to_restore(swp_entry_t entry, struct folio *folio)
+{
+	return 0;
+}
+#endif
+
 #ifndef __HAVE_ARCH_SWAP_RESTORE
 static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 1f18ed4a5497..e137f7673749 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3957,6 +3957,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 	folio_throttle_swaprate(folio, GFP_KERNEL);
 
+	ret = arch_swap_prepare_to_restore(entry, folio);
+	if (ret)
+		goto out_page;
+
 	/*
 	 * Back out if somebody else already faulted in this pte.
 	 */
diff --git a/mm/shmem.c b/mm/shmem.c
index 71ce5fe5c779..0449c03dbdfd 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1840,6 +1840,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	struct swap_info_struct *si;
 	struct folio *folio = NULL;
 	swp_entry_t swap;
+	vm_fault_t ret;
 	int error;
 
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
@@ -1888,6 +1889,14 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	}
 	folio_wait_writeback(folio);
 
+	ret = arch_swap_prepare_to_restore(swap, folio);
+	if (ret) {
+		if (fault_type)
+			*fault_type = ret;
+		error = -EINVAL;
+		goto unlock;
+	}
+
 	/*
 	 * Some architectures may have to restore extra metadata to the
 	 * folio after reading from swap.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4bc70f459164..9983dffce47b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1746,6 +1746,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	spinlock_t *ptl;
 	pte_t *pte, new_pte, old_pte;
 	bool hwpoisoned = PageHWPoison(page);
+	vm_fault_t err;
 	int ret = 1;
 
 	swapcache = page;
@@ -1779,6 +1780,12 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		goto setpte;
 	}
 
+	err = arch_swap_prepare_to_restore(entry, page_folio(page));
+	if (err) {
+		ret = -EINVAL;
+		goto out;
+	}
+
 	/*
 	 * Some architectures may have to restore extra metadata to the page
 	 * when reading from swap. This metadata may be indexed by swap entry
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (9 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 10/27] mm: Call arch_swap_prepare_to_restore() before arch_swap_restore() Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-29  8:44   ` Hyesoo Yu
  2023-12-11 17:29   ` Rob Herring
  2023-11-19 16:57 ` [PATCH RFC v2 12/27] arm64: mte: Add tag storage pages to the MIGRATE_CMA migratetype Alexandru Elisei
                   ` (15 subsequent siblings)
  26 siblings, 2 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Allow the kernel to get the size and location of the MTE tag storage
regions from the DTB. This memory is marked as reserved for now.

The DTB node for the tag storage region is defined as:

        tags0: tag-storage@8f8000000 {
                compatible = "arm,mte-tag-storage";
                reg = <0x08 0xf8000000 0x00 0x4000000>;
                block-size = <0x1000>;
                memory = <&memory0>;	// Associated tagged memory node
        };

The tag storage region represents the largest contiguous memory region that
holds all the tags for the associated contiguous memory region which can be
tagged. For example, for a 32GB contiguous tagged memory the corresponding
tag storage region is 1GB of contiguous memory, not two adjacent 512M of
tag storage memory.

"block-size" represents the minimum multiple of 4K of tag storage where all
the tags stored in the block correspond to a contiguous memory region. This
is needed for platforms where the memory controller interleaves tag writes
to memory. For example, if the memory controller interleaves tag writes for
256KB of contiguous memory across 8K of tag storage (2-way interleave),
then the correct value for "block-size" is 0x2000. This value is a hardware
property, independent of the selected kernel page size.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/Kconfig                       |  12 ++
 arch/arm64/include/asm/mte_tag_storage.h |  15 ++
 arch/arm64/kernel/Makefile               |   1 +
 arch/arm64/kernel/mte_tag_storage.c      | 256 +++++++++++++++++++++++
 arch/arm64/kernel/setup.c                |   7 +
 5 files changed, 291 insertions(+)
 create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
 create mode 100644 arch/arm64/kernel/mte_tag_storage.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7b071a00425d..fe8276fdc7a8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2062,6 +2062,18 @@ config ARM64_MTE
 
 	  Documentation/arch/arm64/memory-tagging-extension.rst.
 
+if ARM64_MTE
+config ARM64_MTE_TAG_STORAGE
+	bool "Dynamic MTE tag storage management"
+	help
+	  Adds support for dynamic management of the memory used by the hardware
+	  for storing MTE tags. This memory, unlike normal memory, cannot be
+	  tagged. When it is used to store tags for another memory location it
+	  cannot be used for any type of allocation.
+
+	  If unsure, say N
+endif # ARM64_MTE
+
 endmenu # "ARMv8.5 architectural features"
 
 menu "ARMv8.7 architectural features"
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
new file mode 100644
index 000000000000..8f86c4f9a7c3
--- /dev/null
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2023 ARM Ltd.
+ */
+#ifndef __ASM_MTE_TAG_STORAGE_H
+#define __ASM_MTE_TAG_STORAGE_H
+
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+void mte_tag_storage_init(void);
+#else
+static inline void mte_tag_storage_init(void)
+{
+}
+#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
+#endif /* __ASM_MTE_TAG_STORAGE_H  */
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index d95b3d6b471a..5f031bf9f8f1 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -70,6 +70,7 @@ obj-$(CONFIG_CRASH_CORE)		+= crash_core.o
 obj-$(CONFIG_ARM_SDE_INTERFACE)		+= sdei.o
 obj-$(CONFIG_ARM64_PTR_AUTH)		+= pointer_auth.o
 obj-$(CONFIG_ARM64_MTE)			+= mte.o
+obj-$(CONFIG_ARM64_MTE_TAG_STORAGE)	+= mte_tag_storage.o
 obj-y					+= vdso-wrap.o
 obj-$(CONFIG_COMPAT_VDSO)		+= vdso32-wrap.o
 obj-$(CONFIG_UNWIND_PATCH_PAC_INTO_SCS)	+= patch-scs.o
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
new file mode 100644
index 000000000000..fa6267ef8392
--- /dev/null
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -0,0 +1,256 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Support for dynamic tag storage.
+ *
+ * Copyright (C) 2023 ARM Ltd.
+ */
+
+#include <linux/memblock.h>
+#include <linux/mm.h>
+#include <linux/of_device.h>
+#include <linux/of_fdt.h>
+#include <linux/range.h>
+#include <linux/string.h>
+#include <linux/xarray.h>
+
+#include <asm/mte_tag_storage.h>
+
+struct tag_region {
+	struct range mem_range;	/* Memory associated with the tag storage, in PFNs. */
+	struct range tag_range;	/* Tag storage memory, in PFNs. */
+	u32 block_size;		/* Tag block size, in pages. */
+};
+
+#define MAX_TAG_REGIONS	32
+
+static struct tag_region tag_regions[MAX_TAG_REGIONS];
+static int num_tag_regions;
+
+static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
+						int reg_len, struct range *range)
+{
+	int addr_cells = dt_root_addr_cells;
+	int size_cells = dt_root_size_cells;
+	u64 size;
+
+	if (reg_len / 4 > addr_cells + size_cells)
+		return -EINVAL;
+
+	range->start = PHYS_PFN(of_read_number(reg, addr_cells));
+	size = PHYS_PFN(of_read_number(reg + addr_cells, size_cells));
+	if (size == 0) {
+		pr_err("Invalid node");
+		return -EINVAL;
+	}
+	range->end = range->start + size - 1;
+
+	return 0;
+}
+
+static int __init tag_storage_of_flat_get_tag_range(unsigned long node,
+						    struct range *tag_range)
+{
+	const __be32 *reg;
+	int reg_len;
+
+	reg = of_get_flat_dt_prop(node, "reg", &reg_len);
+	if (reg == NULL) {
+		pr_err("Invalid metadata node");
+		return -EINVAL;
+	}
+
+	return tag_storage_of_flat_get_range(node, reg, reg_len, tag_range);
+}
+
+static int __init tag_storage_of_flat_get_memory_range(unsigned long node, struct range *mem)
+{
+	const __be32 *reg;
+	int reg_len;
+
+	reg = of_get_flat_dt_prop(node, "linux,usable-memory", &reg_len);
+	if (reg == NULL)
+		reg = of_get_flat_dt_prop(node, "reg", &reg_len);
+
+	if (reg == NULL) {
+		pr_err("Invalid memory node");
+		return -EINVAL;
+	}
+
+	return tag_storage_of_flat_get_range(node, reg, reg_len, mem);
+}
+
+struct find_memory_node_arg {
+	unsigned long node;
+	u32 phandle;
+};
+
+static int __init fdt_find_memory_node(unsigned long node, const char *uname,
+				       int depth, void *data)
+{
+	const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
+	struct find_memory_node_arg *arg = data;
+
+	if (depth != 1 || !type || strcmp(type, "memory") != 0)
+		return 0;
+
+	if (of_get_flat_dt_phandle(node) == arg->phandle) {
+		arg->node = node;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int __init tag_storage_get_memory_node(unsigned long tag_node, unsigned long *mem_node)
+{
+	struct find_memory_node_arg arg = { 0 };
+	const __be32 *memory_prop;
+	u32 mem_phandle;
+	int ret, reg_len;
+
+	memory_prop = of_get_flat_dt_prop(tag_node, "memory", &reg_len);
+	if (!memory_prop) {
+		pr_err("Missing 'memory' property in the tag storage node");
+		return -EINVAL;
+	}
+
+	mem_phandle = be32_to_cpup(memory_prop);
+	arg.phandle = mem_phandle;
+
+	ret = of_scan_flat_dt(fdt_find_memory_node, &arg);
+	if (ret != 1) {
+		pr_err("Associated memory node not found");
+		return -EINVAL;
+	}
+
+	*mem_node = arg.node;
+
+	return 0;
+}
+
+static int __init tag_storage_of_flat_read_u32(unsigned long node, const char *propname,
+					       u32 *retval)
+{
+	const __be32 *reg;
+
+	reg = of_get_flat_dt_prop(node, propname, NULL);
+	if (!reg)
+		return -EINVAL;
+
+	*retval = be32_to_cpup(reg);
+	return 0;
+}
+
+static u32 __init get_block_size_pages(u32 block_size_bytes)
+{
+	u32 a = PAGE_SIZE;
+	u32 b = block_size_bytes;
+	u32 r;
+
+	/* Find greatest common divisor using the Euclidian algorithm. */
+	do {
+		r = a % b;
+		a = b;
+		b = r;
+	} while (b != 0);
+
+	return PHYS_PFN(PAGE_SIZE * block_size_bytes / a);
+}
+
+static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
+				       int depth, void *data)
+{
+	struct tag_region *region;
+	unsigned long mem_node;
+	struct range *mem_range;
+	struct range *tag_range;
+	u32 block_size_bytes;
+	u32 nid = 0;
+	int ret;
+
+	if (depth != 1 || !strstr(uname, "tag-storage"))
+		return 0;
+
+	if (!of_flat_dt_is_compatible(node, "arm,mte-tag-storage"))
+		return 0;
+
+	if (num_tag_regions == MAX_TAG_REGIONS) {
+		pr_err("Maximum number of tag storage regions exceeded");
+		return -EINVAL;
+	}
+
+	region = &tag_regions[num_tag_regions];
+	mem_range = &region->mem_range;
+	tag_range = &region->tag_range;
+
+	ret = tag_storage_of_flat_get_tag_range(node, tag_range);
+	if (ret) {
+		pr_err("Invalid tag storage node");
+		return ret;
+	}
+
+	ret = tag_storage_get_memory_node(node, &mem_node);
+	if (ret)
+		return ret;
+
+	ret = tag_storage_of_flat_get_memory_range(mem_node, mem_range);
+	if (ret) {
+		pr_err("Invalid address for associated data memory node");
+		return ret;
+	}
+
+	/* The tag region must exactly match the corresponding memory. */
+	if (range_len(tag_range) * 32 != range_len(mem_range)) {
+		pr_err("Tag storage region 0x%llx-0x%llx does not cover the memory region 0x%llx-0x%llx",
+		       PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
+		       PFN_PHYS(mem_range->start), PFN_PHYS(mem_range->end));
+		return -EINVAL;
+	}
+
+	ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
+	if (ret || block_size_bytes == 0) {
+		pr_err("Invalid or missing 'block-size' property");
+		return -EINVAL;
+	}
+	region->block_size = get_block_size_pages(block_size_bytes);
+	if (range_len(tag_range) % region->block_size != 0) {
+		pr_err("Tag storage region size 0x%llx is not a multiple of block size %u",
+		       PFN_PHYS(range_len(tag_range)), region->block_size);
+		return -EINVAL;
+	}
+
+	ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
+	if (ret)
+		nid = numa_node_id();
+
+	ret = memblock_add_node(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)),
+				nid, MEMBLOCK_NONE);
+	if (ret) {
+		pr_err("Error adding tag memblock (%d)", ret);
+		return ret;
+	}
+	memblock_reserve(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
+
+	pr_info("Found tag storage region 0x%llx-0x%llx, block size %u pages",
+		PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end), region->block_size);
+
+	num_tag_regions++;
+
+	return 0;
+}
+
+void __init mte_tag_storage_init(void)
+{
+	struct range *tag_range;
+	int i, ret;
+
+	ret = of_scan_flat_dt(fdt_init_tag_storage, NULL);
+	if (ret) {
+		for (i = 0; i < num_tag_regions; i++) {
+			tag_range = &tag_regions[i].tag_range;
+			memblock_remove(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
+		}
+		num_tag_regions = 0;
+		pr_info("MTE tag storage region management disabled");
+	}
+}
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 417a8a86b2db..1b77138c1aa5 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -42,6 +42,7 @@
 #include <asm/cpufeature.h>
 #include <asm/cpu_ops.h>
 #include <asm/kasan.h>
+#include <asm/mte_tag_storage.h>
 #include <asm/numa.h>
 #include <asm/scs.h>
 #include <asm/sections.h>
@@ -342,6 +343,12 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
 			   FW_BUG "Booted with MMU enabled!");
 	}
 
+	/*
+	 * Must be called before memory limits are enforced by
+	 * arm64_memblock_init().
+	 */
+	mte_tag_storage_init();
+
 	arm64_memblock_init();
 
 	paging_init();
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 12/27] arm64: mte: Add tag storage pages to the MIGRATE_CMA migratetype
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (10 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-24 19:40   ` David Hildenbrand
  2023-11-19 16:57 ` [PATCH RFC v2 13/27] arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK Alexandru Elisei
                   ` (14 subsequent siblings)
  26 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Add the MTE tag storage pages to the MIGRATE_CMA migratetype, which allows
the page allocator to manage them like regular pages.

Ths migratype lends the pages some very desirable properties:

* They cannot be longterm pinned, meaning they will always be migratable.

* The pages can be allocated explicitely by using their PFN (with
  alloc_contig_range()) when they are needed to store tags.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/Kconfig                  |  1 +
 arch/arm64/kernel/mte_tag_storage.c | 68 +++++++++++++++++++++++++++++
 include/linux/mmzone.h              |  5 +++
 mm/internal.h                       |  3 --
 4 files changed, 74 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fe8276fdc7a8..047487046e8f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2065,6 +2065,7 @@ config ARM64_MTE
 if ARM64_MTE
 config ARM64_MTE_TAG_STORAGE
 	bool "Dynamic MTE tag storage management"
+	select CONFIG_CMA
 	help
 	  Adds support for dynamic management of the memory used by the hardware
 	  for storing MTE tags. This memory, unlike normal memory, cannot be
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index fa6267ef8392..427f4f1909f3 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -5,10 +5,12 @@
  * Copyright (C) 2023 ARM Ltd.
  */
 
+#include <linux/cma.h>
 #include <linux/memblock.h>
 #include <linux/mm.h>
 #include <linux/of_device.h>
 #include <linux/of_fdt.h>
+#include <linux/pageblock-flags.h>
 #include <linux/range.h>
 #include <linux/string.h>
 #include <linux/xarray.h>
@@ -189,6 +191,14 @@ static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
 		return ret;
 	}
 
+	/* Pages are managed in pageblock_nr_pages chunks */
+	if (!IS_ALIGNED(tag_range->start | range_len(tag_range), pageblock_nr_pages)) {
+		pr_err("Tag storage region 0x%llx-0x%llx not aligned to pageblock size 0x%llx",
+		       PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
+		       PFN_PHYS(pageblock_nr_pages));
+		return -EINVAL;
+	}
+
 	ret = tag_storage_get_memory_node(node, &mem_node);
 	if (ret)
 		return ret;
@@ -254,3 +264,61 @@ void __init mte_tag_storage_init(void)
 		pr_info("MTE tag storage region management disabled");
 	}
 }
+
+static int __init mte_tag_storage_activate_regions(void)
+{
+	phys_addr_t dram_start, dram_end;
+	struct range *tag_range;
+	unsigned long pfn;
+	int i, ret;
+
+	if (num_tag_regions == 0)
+		return 0;
+
+	dram_start = memblock_start_of_DRAM();
+	dram_end = memblock_end_of_DRAM();
+
+	for (i = 0; i < num_tag_regions; i++) {
+		tag_range = &tag_regions[i].tag_range;
+		/*
+		 * Tag storage region was clipped by arm64_bootmem_init()
+		 * enforcing addressing limits.
+		 */
+		if (PFN_PHYS(tag_range->start) < dram_start ||
+				PFN_PHYS(tag_range->end) >= dram_end) {
+			pr_err("Tag storage region 0x%llx-0x%llx outside addressable memory",
+			       PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end));
+			ret = -EINVAL;
+			goto out_disabled;
+		}
+	}
+
+	/*
+	 * MTE disabled, tag storage pages can be used like any other pages. The
+	 * only restriction is that the pages cannot be used by kexec because
+	 * the memory remains marked as reserved in the memblock allocator.
+	 */
+	if (!system_supports_mte()) {
+		for (i = 0; i< num_tag_regions; i++) {
+			tag_range = &tag_regions[i].tag_range;
+			for (pfn = tag_range->start; pfn <= tag_range->end; pfn++)
+				free_reserved_page(pfn_to_page(pfn));
+		}
+		ret = 0;
+		goto out_disabled;
+	}
+
+	for (i = 0; i < num_tag_regions; i++) {
+		tag_range = &tag_regions[i].tag_range;
+		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
+			init_cma_reserved_pageblock(pfn_to_page(pfn));
+		totalcma_pages += range_len(tag_range);
+	}
+
+	return 0;
+
+out_disabled:
+	pr_info("MTE tag storage region management disabled");
+	return ret;
+}
+arch_initcall(mte_tag_storage_activate_regions);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3c25226beeed..15f81429e145 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -83,6 +83,11 @@ static inline bool is_migrate_movable(int mt)
 	return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
 }
 
+#ifdef CONFIG_CMA
+/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
+void init_cma_reserved_pageblock(struct page *page);
+#endif
+
 /*
  * Check whether a migratetype can be merged with another migratetype.
  *
diff --git a/mm/internal.h b/mm/internal.h
index b61034bd50f5..ddf6bb6c6308 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -539,9 +539,6 @@ isolate_migratepages_range(struct compact_control *cc,
 int __alloc_contig_migrate_range(struct compact_control *cc,
 					unsigned long start, unsigned long end);
 
-/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
-void init_cma_reserved_pageblock(struct page *page);
-
 #endif /* CONFIG_COMPACTION || CONFIG_CMA */
 
 int find_suitable_fallback(struct free_area *area, unsigned int order,
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 13/27] arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (11 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 12/27] arm64: mte: Add tag storage pages to the MIGRATE_CMA migratetype Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-24 19:51   ` David Hildenbrand
  2023-11-19 16:57 ` [PATCH RFC v2 14/27] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled Alexandru Elisei
                   ` (13 subsequent siblings)
  26 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Tag storage memory requires that the tag storage pages used for data are
always migratable when they need to be repurposed to store tags.

If ARCH_KEEP_MEMBLOCK is enabled, kexec will scan all non-reserved
memblocks to find a suitable location for copying the kernel image. The
kernel image, once loaded, cannot be moved to another location in physical
memory. The initialization code for the tag storage reserves the memblocks
for the tag storage pages, which means kexec will not use them, and the tag
storage pages can be migrated at any time, which is the desired behaviour.

However, if ARCH_KEEP_MEMBLOCK is not selected, kexec will not skip a
region unless the memory resource has the IORESOURCE_SYSRAM_DRIVER_MANAGED
flag, which isn't currently set by the tag storage initialization code.

Make ARM64_MTE_TAG_STORAGE depend on ARCH_KEEP_MEMBLOCK to make it explicit
that that the Kconfig option required for it to work correctly.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 047487046e8f..efa5b7958169 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2065,6 +2065,7 @@ config ARM64_MTE
 if ARM64_MTE
 config ARM64_MTE_TAG_STORAGE
 	bool "Dynamic MTE tag storage management"
+	depends on ARCH_KEEP_MEMBLOCK
 	select CONFIG_CMA
 	help
 	  Adds support for dynamic management of the memory used by the hardware
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 14/27] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (12 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 13/27] arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-24 19:54   ` David Hildenbrand
  2023-11-19 16:57 ` [PATCH RFC v2 15/27] arm64: mte: Check that tag storage blocks are in the same zone Alexandru Elisei
                   ` (12 subsequent siblings)
  26 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

To be able to reserve the tag storage associated with a page requires that
the tag storage page can be migrated.

When HW KASAN is enabled, the kernel allocates pages, which are now tagged,
in non-preemptible contexts, which can make reserving the associate tag
storage impossible.

Keep the tag storage pages reserved if HW KASAN is enabled.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 427f4f1909f3..8b9bedf7575d 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -308,6 +308,19 @@ static int __init mte_tag_storage_activate_regions(void)
 		goto out_disabled;
 	}
 
+	/*
+	 * The kernel allocates memory in non-preemptible contexts, which makes
+	 * migration impossible when reserving the associated tag storage.
+	 *
+	 * The check is safe to make because KASAN HW tags are enabled before
+	 * the rest of the init functions are called, in smp_prepare_boot_cpu().
+	 */
+	if (kasan_hw_tags_enabled()) {
+		pr_info("KASAN HW tags incompatible with MTE tag storage management");
+		ret = 0;
+		goto out_disabled;
+	}
+
 	for (i = 0; i < num_tag_regions; i++) {
 		tag_range = &tag_regions[i].tag_range;
 		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 15/27] arm64: mte: Check that tag storage blocks are in the same zone
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (13 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 14/27] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-24 19:56   ` David Hildenbrand
  2023-11-29  8:57   ` Hyesoo Yu
  2023-11-19 16:57 ` [PATCH RFC v2 16/27] arm64: mte: Manage tag storage on page allocation Alexandru Elisei
                   ` (11 subsequent siblings)
  26 siblings, 2 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

alloc_contig_range() requires that the requested pages are in the same
zone. Check that this is indeed the case before initializing the tag
storage blocks.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c | 33 +++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 8b9bedf7575d..fd63430d4dc0 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -265,6 +265,35 @@ void __init mte_tag_storage_init(void)
 	}
 }
 
+/* alloc_contig_range() requires all pages to be in the same zone. */
+static int __init mte_tag_storage_check_zone(void)
+{
+	struct range *tag_range;
+	struct zone *zone;
+	unsigned long pfn;
+	u32 block_size;
+	int i, j;
+
+	for (i = 0; i < num_tag_regions; i++) {
+		block_size = tag_regions[i].block_size;
+		if (block_size == 1)
+			continue;
+
+		tag_range = &tag_regions[i].tag_range;
+		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += block_size) {
+			zone = page_zone(pfn_to_page(pfn));
+			for (j = 1; j < block_size; j++) {
+				if (page_zone(pfn_to_page(pfn + j)) != zone) {
+					pr_err("Tag storage block pages in different zones");
+					return -EINVAL;
+				}
+			}
+		}
+	}
+
+	 return 0;
+}
+
 static int __init mte_tag_storage_activate_regions(void)
 {
 	phys_addr_t dram_start, dram_end;
@@ -321,6 +350,10 @@ static int __init mte_tag_storage_activate_regions(void)
 		goto out_disabled;
 	}
 
+	ret = mte_tag_storage_check_zone();
+	if (ret)
+		goto out_disabled;
+
 	for (i = 0; i < num_tag_regions; i++) {
 		tag_range = &tag_regions[i].tag_range;
 		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 16/27] arm64: mte: Manage tag storage on page allocation
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (14 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 15/27] arm64: mte: Check that tag storage blocks are in the same zone Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-29  9:10   ` Hyesoo Yu
  2023-11-19 16:57 ` [PATCH RFC v2 17/27] arm64: mte: Perform CMOs for tag blocks on tagged page allocation/free Alexandru Elisei
                   ` (10 subsequent siblings)
  26 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Reserve tag storage for a tagged page by migrating the contents of the tag
storage (if in use for data) and removing the tag storage pages from the
page allocator by calling alloc_contig_range().

When all the associated tagged pages have been freed, return the tag
storage pages back to the page allocator, where they can be used again for
data allocations.

Tag storage pages cannot be tagged, so disallow allocations from
MIGRATE_CMA when the allocation is tagged.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/mte.h             |  16 +-
 arch/arm64/include/asm/mte_tag_storage.h |  45 +++++
 arch/arm64/include/asm/pgtable.h         |  27 +++
 arch/arm64/kernel/mte_tag_storage.c      | 241 +++++++++++++++++++++++
 fs/proc/page.c                           |   1 +
 include/linux/kernel-page-flags.h        |   1 +
 include/linux/page-flags.h               |   1 +
 include/trace/events/mmflags.h           |   3 +-
 mm/huge_memory.c                         |   1 +
 9 files changed, 333 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 8034695b3dd7..6457b7899207 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -40,12 +40,24 @@ void mte_free_tag_buf(void *buf);
 #ifdef CONFIG_ARM64_MTE
 
 /* track which pages have valid allocation tags */
-#define PG_mte_tagged	PG_arch_2
+#define PG_mte_tagged		PG_arch_2
 /* simple lock to avoid multiple threads tagging the same page */
-#define PG_mte_lock	PG_arch_3
+#define PG_mte_lock		PG_arch_3
+/* Track if a tagged page has tag storage reserved */
+#define PG_tag_storage_reserved	PG_arch_4
+
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
+extern bool page_tag_storage_reserved(struct page *page);
+#endif
 
 static inline void set_page_mte_tagged(struct page *page)
 {
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+	/* Open code mte_tag_storage_enabled() */
+	WARN_ON_ONCE(static_branch_likely(&tag_storage_enabled_key) &&
+		     !page_tag_storage_reserved(page));
+#endif
 	/*
 	 * Ensure that the tags written prior to this function are visible
 	 * before the page flags update.
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index 8f86c4f9a7c3..cab033b184ab 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -5,11 +5,56 @@
 #ifndef __ASM_MTE_TAG_STORAGE_H
 #define __ASM_MTE_TAG_STORAGE_H
 
+#ifndef __ASSEMBLY__
+
+#include <linux/mm_types.h>
+
+#include <asm/mte.h>
+
 #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+
+DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
+
+static inline bool tag_storage_enabled(void)
+{
+	return static_branch_likely(&tag_storage_enabled_key);
+}
+
+static inline bool alloc_requires_tag_storage(gfp_t gfp)
+{
+	return gfp & __GFP_TAGGED;
+}
+
 void mte_tag_storage_init(void);
+
+int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
+void free_tag_storage(struct page *page, int order);
+
+bool page_tag_storage_reserved(struct page *page);
 #else
+static inline bool tag_storage_enabled(void)
+{
+	return false;
+}
+static inline bool alloc_requires_tag_storage(struct page *page)
+{
+	return false;
+}
 static inline void mte_tag_storage_init(void)
 {
 }
+static inline int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
+{
+	return 0;
+}
+static inline void free_tag_storage(struct page *page, int order)
+{
+}
+static inline bool page_tag_storage_reserved(struct page *page)
+{
+	return true;
+}
 #endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
+
+#endif /* !__ASSEMBLY__ */
 #endif /* __ASM_MTE_TAG_STORAGE_H  */
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index cd5dacd1be3a..20e8de853f5d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -10,6 +10,7 @@
 
 #include <asm/memory.h>
 #include <asm/mte.h>
+#include <asm/mte_tag_storage.h>
 #include <asm/pgtable-hwdef.h>
 #include <asm/pgtable-prot.h>
 #include <asm/tlbflush.h>
@@ -1063,6 +1064,32 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 		mte_restore_page_tags_by_swp_entry(entry, &folio->page);
 }
 
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+
+#define __HAVE_ARCH_PREP_NEW_PAGE
+static inline int arch_prep_new_page(struct page *page, int order, gfp_t gfp)
+{
+	if (tag_storage_enabled() && alloc_requires_tag_storage(gfp))
+		return reserve_tag_storage(page, order, gfp);
+	return 0;
+}
+
+#define __HAVE_ARCH_FREE_PAGES_PREPARE
+static inline void arch_free_pages_prepare(struct page *page, int order)
+{
+	if (tag_storage_enabled() && page_mte_tagged(page))
+		free_tag_storage(page, order);
+}
+
+#define __HAVE_ARCH_ALLOC_CMA
+static inline bool arch_alloc_cma(gfp_t gfp_mask)
+{
+	if (tag_storage_enabled() && alloc_requires_tag_storage(gfp_mask))
+		return false;
+	return true;
+}
+
+#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
 #endif /* CONFIG_ARM64_MTE */
 
 #define __HAVE_ARCH_CALC_VMA_GFP
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index fd63430d4dc0..9f8ef3116fc3 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -11,12 +11,18 @@
 #include <linux/of_device.h>
 #include <linux/of_fdt.h>
 #include <linux/pageblock-flags.h>
+#include <linux/page-flags.h>
+#include <linux/page_owner.h>
 #include <linux/range.h>
+#include <linux/sched/mm.h>
 #include <linux/string.h>
+#include <linux/vm_event_item.h>
 #include <linux/xarray.h>
 
 #include <asm/mte_tag_storage.h>
 
+__ro_after_init DEFINE_STATIC_KEY_FALSE(tag_storage_enabled_key);
+
 struct tag_region {
 	struct range mem_range;	/* Memory associated with the tag storage, in PFNs. */
 	struct range tag_range;	/* Tag storage memory, in PFNs. */
@@ -28,6 +34,31 @@ struct tag_region {
 static struct tag_region tag_regions[MAX_TAG_REGIONS];
 static int num_tag_regions;
 
+/*
+ * A note on locking. Reserving tag storage takes the tag_blocks_lock mutex,
+ * because alloc_contig_range() might sleep.
+ *
+ * Freeing tag storage takes the xa_lock spinlock with interrupts disabled
+ * because pages can be freed from non-preemptible contexts, including from an
+ * interrupt handler.
+ *
+ * Because tag storage can be freed from interrupt contexts, the xarray is
+ * defined with the XA_FLAGS_LOCK_IRQ flag to disable interrupts when calling
+ * xa_store(). This is done to prevent a deadlock with free_tag_storage() being
+ * called from an interrupt raised before xa_store() releases the xa_lock.
+ *
+ * All of the above means that reserve_tag_storage() cannot run concurrently
+ * with itself (no concurrent insertions), but it can run at the same time as
+ * free_tag_storage(). The first thing that reserve_tag_storage() does after
+ * taking the mutex is increase the refcount on all present tag storage blocks
+ * with the xa_lock held, to serialize against freeing the blocks. This is an
+ * optimization to avoid taking and releasing the xa_lock after each iteration
+ * if the refcount operation was moved inside the loop, where it would have had
+ * to be executed for each block.
+ */
+static DEFINE_XARRAY_FLAGS(tag_blocks_reserved, XA_FLAGS_LOCK_IRQ);
+static DEFINE_MUTEX(tag_blocks_lock);
+
 static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
 						int reg_len, struct range *range)
 {
@@ -368,3 +399,213 @@ static int __init mte_tag_storage_activate_regions(void)
 	return ret;
 }
 arch_initcall(mte_tag_storage_activate_regions);
+
+static void page_set_tag_storage_reserved(struct page *page, int order)
+{
+	int i;
+
+	for (i = 0; i < (1 << order); i++)
+		set_bit(PG_tag_storage_reserved, &(page + i)->flags);
+}
+
+static void block_ref_add(unsigned long block, struct tag_region *region, int order)
+{
+	int count;
+
+	count = min(1u << order, 32 * region->block_size);
+	page_ref_add(pfn_to_page(block), count);
+}
+
+static int block_ref_sub_return(unsigned long block, struct tag_region *region, int order)
+{
+	int count;
+
+	count = min(1u << order, 32 * region->block_size);
+	return page_ref_sub_return(pfn_to_page(block), count);
+}
+
+static bool tag_storage_block_is_reserved(unsigned long block)
+{
+	return xa_load(&tag_blocks_reserved, block) != NULL;
+}
+
+static int tag_storage_reserve_block(unsigned long block, struct tag_region *region, int order)
+{
+	int ret;
+
+	ret = xa_err(xa_store(&tag_blocks_reserved, block, pfn_to_page(block), GFP_KERNEL));
+	if (!ret)
+		block_ref_add(block, region, order);
+
+	return ret;
+}
+
+static int order_to_num_blocks(int order)
+{
+	return max((1 << order) / 32, 1);
+}
+
+static int tag_storage_find_block_in_region(struct page *page, unsigned long *blockp,
+					    struct tag_region *region)
+{
+	struct range *tag_range = &region->tag_range;
+	struct range *mem_range = &region->mem_range;
+	u64 page_pfn = page_to_pfn(page);
+	u64 block, block_offset;
+
+	if (!(mem_range->start <= page_pfn && page_pfn <= mem_range->end))
+		return -ERANGE;
+
+	block_offset = (page_pfn - mem_range->start) / 32;
+	block = tag_range->start + rounddown(block_offset, region->block_size);
+
+	if (block + region->block_size - 1 > tag_range->end) {
+		pr_err("Block 0x%llx-0x%llx is outside tag region 0x%llx-0x%llx\n",
+			PFN_PHYS(block), PFN_PHYS(block + region->block_size),
+			PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end));
+		return -ERANGE;
+	}
+	*blockp = block;
+
+	return 0;
+
+}
+
+static int tag_storage_find_block(struct page *page, unsigned long *block,
+				  struct tag_region **region)
+{
+	int i, ret;
+
+	for (i = 0; i < num_tag_regions; i++) {
+		ret = tag_storage_find_block_in_region(page, block, &tag_regions[i]);
+		if (ret == 0) {
+			*region = &tag_regions[i];
+			return 0;
+		}
+	}
+
+	return -EINVAL;
+}
+
+bool page_tag_storage_reserved(struct page *page)
+{
+	return test_bit(PG_tag_storage_reserved, &page->flags);
+}
+
+int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
+{
+	unsigned long start_block, end_block;
+	struct tag_region *region;
+	unsigned long block;
+	unsigned long flags;
+	unsigned int tries;
+	int ret = 0;
+
+	VM_WARN_ON_ONCE(!preemptible());
+
+	if (page_tag_storage_reserved(page))
+		return 0;
+
+	/*
+	 * __alloc_contig_migrate_range() ignores gfp when allocating the
+	 * destination page for migration. Regardless, massage gfp flags and
+	 * remove __GFP_TAGGED to avoid recursion in case gfp stops being
+	 * ignored.
+	 */
+	gfp &= ~__GFP_TAGGED;
+	if (!(gfp & __GFP_NORETRY))
+		gfp |= __GFP_RETRY_MAYFAIL;
+
+	ret = tag_storage_find_block(page, &start_block, &region);
+	if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
+		return 0;
+	end_block = start_block + order_to_num_blocks(order) * region->block_size;
+
+	mutex_lock(&tag_blocks_lock);
+
+	/* Check again, this time with the lock held. */
+	if (page_tag_storage_reserved(page))
+		goto out_unlock;
+
+	/* Make sure existing entries are not freed from out under out feet. */
+	xa_lock_irqsave(&tag_blocks_reserved, flags);
+	for (block = start_block; block < end_block; block += region->block_size) {
+		if (tag_storage_block_is_reserved(block))
+			block_ref_add(block, region, order);
+	}
+	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
+
+	for (block = start_block; block < end_block; block += region->block_size) {
+		/* Refcount incremented above. */
+		if (tag_storage_block_is_reserved(block))
+			continue;
+
+		tries = 3;
+		while (tries--) {
+			ret = alloc_contig_range(block, block + region->block_size, MIGRATE_CMA, gfp);
+			if (ret == 0 || ret != -EBUSY)
+				break;
+		}
+
+		if (ret)
+			goto out_error;
+
+		ret = tag_storage_reserve_block(block, region, order);
+		if (ret) {
+			free_contig_range(block, region->block_size);
+			goto out_error;
+		}
+
+		count_vm_events(CMA_ALLOC_SUCCESS, region->block_size);
+	}
+
+	page_set_tag_storage_reserved(page, order);
+out_unlock:
+	mutex_unlock(&tag_blocks_lock);
+
+	return 0;
+
+out_error:
+	xa_lock_irqsave(&tag_blocks_reserved, flags);
+	for (block = start_block; block < end_block; block += region->block_size) {
+		if (tag_storage_block_is_reserved(block) &&
+		    block_ref_sub_return(block, region, order) == 1) {
+			__xa_erase(&tag_blocks_reserved, block);
+			free_contig_range(block, region->block_size);
+		}
+	}
+	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
+
+	mutex_unlock(&tag_blocks_lock);
+
+	count_vm_events(CMA_ALLOC_FAIL, region->block_size);
+
+	return ret;
+}
+
+void free_tag_storage(struct page *page, int order)
+{
+	unsigned long block, start_block, end_block;
+	struct tag_region *region;
+	unsigned long flags;
+	int ret;
+
+	ret = tag_storage_find_block(page, &start_block, &region);
+	if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
+		return;
+
+	end_block = start_block + order_to_num_blocks(order) * region->block_size;
+
+	xa_lock_irqsave(&tag_blocks_reserved, flags);
+	for (block = start_block; block < end_block; block += region->block_size) {
+		if (WARN_ONCE(!tag_storage_block_is_reserved(block),
+		    "Block 0x%lx is not reserved for pfn 0x%lx", block, page_to_pfn(page)))
+			continue;
+
+		if (block_ref_sub_return(block, region, order) == 1) {
+			__xa_erase(&tag_blocks_reserved, block);
+			free_contig_range(block, region->block_size);
+		}
+	}
+	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
+}
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 195b077c0fac..e7eb584a9234 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -221,6 +221,7 @@ u64 stable_page_flags(struct page *page)
 #ifdef CONFIG_ARCH_USES_PG_ARCH_X
 	u |= kpf_copy_bit(k, KPF_ARCH_2,	PG_arch_2);
 	u |= kpf_copy_bit(k, KPF_ARCH_3,	PG_arch_3);
+	u |= kpf_copy_bit(k, KPF_ARCH_4,	PG_arch_4);
 #endif
 
 	return u;
diff --git a/include/linux/kernel-page-flags.h b/include/linux/kernel-page-flags.h
index 859f4b0c1b2b..4a0d719ffdd4 100644
--- a/include/linux/kernel-page-flags.h
+++ b/include/linux/kernel-page-flags.h
@@ -19,5 +19,6 @@
 #define KPF_SOFTDIRTY		40
 #define KPF_ARCH_2		41
 #define KPF_ARCH_3		42
+#define KPF_ARCH_4		43
 
 #endif /* LINUX_KERNEL_PAGE_FLAGS_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index a88e64acebfe..7915165a51bd 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -135,6 +135,7 @@ enum pageflags {
 #ifdef CONFIG_ARCH_USES_PG_ARCH_X
 	PG_arch_2,
 	PG_arch_3,
+	PG_arch_4,
 #endif
 	__NR_PAGEFLAGS,
 
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 6ca0d5ed46c0..ba962fd10a2c 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -125,7 +125,8 @@ IF_HAVE_PG_HWPOISON(hwpoison)						\
 IF_HAVE_PG_IDLE(idle)							\
 IF_HAVE_PG_IDLE(young)							\
 IF_HAVE_PG_ARCH_X(arch_2)						\
-IF_HAVE_PG_ARCH_X(arch_3)
+IF_HAVE_PG_ARCH_X(arch_3)						\
+IF_HAVE_PG_ARCH_X(arch_4)
 
 #define show_page_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",				\
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f31f02472396..9beead961a65 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2474,6 +2474,7 @@ static void __split_huge_page_tail(struct folio *folio, int tail,
 #ifdef CONFIG_ARCH_USES_PG_ARCH_X
 			 (1L << PG_arch_2) |
 			 (1L << PG_arch_3) |
+			 (1L << PG_arch_4) |
 #endif
 			 (1L << PG_dirty) |
 			 LRU_GEN_MASK | LRU_REFS_MASK));
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 17/27] arm64: mte: Perform CMOs for tag blocks on tagged page allocation/free
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (15 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 16/27] arm64: mte: Manage tag storage on page allocation Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-19 16:57 ` [PATCH RFC v2 18/27] arm64: mte: Reserve tag block for the zero page Alexandru Elisei
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Make sure the contents of the tag storage block is not corrupted by
performing:

1. A tag dcache inval when the associated tagged pages are freed, to avoid
   dirty tag cache lines being evicted and corrupting the tag storage
   block when it's being used to store data.

2. A data cache inval when the tag storage block is being reserved, to
   ensure that no dirty data cache lines are present, which would
   trigger a writeback that could corrupt the tags stored in the block.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/assembler.h       | 10 ++++++++++
 arch/arm64/include/asm/mte_tag_storage.h |  2 ++
 arch/arm64/kernel/mte_tag_storage.c      | 11 +++++++++++
 arch/arm64/lib/mte.S                     | 16 ++++++++++++++++
 4 files changed, 39 insertions(+)

diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 376a980f2bad..8d41c8cfdc69 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -310,6 +310,16 @@ alternative_cb_end
 	lsl		\reg, \reg, \tmp	// actual cache line size
 	.endm
 
+/*
+ * tcache_line_size - get the safe tag cache line size across all CPUs
+ */
+	.macro	tcache_line_size, reg, tmp
+	read_ctr	\tmp
+	ubfm		\tmp, \tmp, #32, #37	// tag cache line size encoding
+	mov		\reg, #4		// bytes per word
+	lsl		\reg, \reg, \tmp	// actual tag cache line size
+	.endm
+
 /*
  * raw_icache_line_size - get the minimum I-cache line size on this CPU
  * from the CTR register.
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index cab033b184ab..6e5d28e607bb 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -11,6 +11,8 @@
 
 #include <asm/mte.h>
 
+extern void dcache_inval_tags_poc(unsigned long start, unsigned long end);
+
 #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
 
 DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 9f8ef3116fc3..833480048170 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -19,6 +19,7 @@
 #include <linux/vm_event_item.h>
 #include <linux/xarray.h>
 
+#include <asm/cacheflush.h>
 #include <asm/mte_tag_storage.h>
 
 __ro_after_init DEFINE_STATIC_KEY_FALSE(tag_storage_enabled_key);
@@ -431,8 +432,13 @@ static bool tag_storage_block_is_reserved(unsigned long block)
 
 static int tag_storage_reserve_block(unsigned long block, struct tag_region *region, int order)
 {
+	unsigned long block_va;
 	int ret;
 
+	block_va = (unsigned long)page_to_virt(pfn_to_page(block));
+	/* Avoid writeback of dirty data cache lines corrupting tags. */
+	dcache_inval_poc(block_va, block_va + region->block_size * PAGE_SIZE);
+
 	ret = xa_err(xa_store(&tag_blocks_reserved, block, pfn_to_page(block), GFP_KERNEL));
 	if (!ret)
 		block_ref_add(block, region, order);
@@ -587,6 +593,7 @@ void free_tag_storage(struct page *page, int order)
 {
 	unsigned long block, start_block, end_block;
 	struct tag_region *region;
+	unsigned long page_va;
 	unsigned long flags;
 	int ret;
 
@@ -594,6 +601,10 @@ void free_tag_storage(struct page *page, int order)
 	if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
 		return;
 
+	page_va = (unsigned long)page_to_virt(page);
+	/* Avoid writeback of dirty tag cache lines corrupting data. */
+	dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order));
+
 	end_block = start_block + order_to_num_blocks(order) * region->block_size;
 
 	xa_lock_irqsave(&tag_blocks_reserved, flags);
diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
index 9f623e9da09f..bc02b4e95062 100644
--- a/arch/arm64/lib/mte.S
+++ b/arch/arm64/lib/mte.S
@@ -175,3 +175,19 @@ SYM_FUNC_START(mte_copy_page_tags_from_buf)
 
 	ret
 SYM_FUNC_END(mte_copy_page_tags_from_buf)
+
+/*
+ *	dcache_inval_tags_poc(start, end)
+ *
+ *	Ensure that any tags in the D-cache for the interval [start, end)
+ *	are invalidated to PoC.
+ *
+ *	- start   - virtual start address of region
+ *	- end     - virtual end address of region
+ */
+SYM_FUNC_START(__pi_dcache_inval_tags_poc)
+	tcache_line_size x2, x3
+	dcache_by_myline_op igvac, sy, x0, x1, x2, x3
+	ret
+SYM_FUNC_END(__pi_dcache_inval_tags_poc)
+SYM_FUNC_ALIAS(dcache_inval_tags_poc, __pi_dcache_inval_tags_poc)
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 18/27] arm64: mte: Reserve tag block for the zero page
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (16 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 17/27] arm64: mte: Perform CMOs for tag blocks on tagged page allocation/free Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-28 17:06   ` David Hildenbrand
  2023-11-19 16:57 ` [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE) Alexandru Elisei
                   ` (8 subsequent siblings)
  26 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

On arm64, the zero page receives special treatment by having the tagged
flag set on MTE initialization, not when the page is mapped in a process
address space. Reserve the corresponding tag block when tag storage
management is being activated.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 833480048170..a1cc239f7211 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -393,6 +393,8 @@ static int __init mte_tag_storage_activate_regions(void)
 		totalcma_pages += range_len(tag_range);
 	}
 
+	reserve_tag_storage(ZERO_PAGE(0), 0, GFP_HIGHUSER_MOVABLE);
+
 	return 0;
 
 out_disabled:
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (17 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 18/27] arm64: mte: Reserve tag block for the zero page Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-28 17:55   ` David Hildenbrand
  2023-11-29  9:27   ` Hyesoo Yu
  2023-11-19 16:57 ` [PATCH RFC v2 20/27] mm: hugepage: Handle huge page fault on access Alexandru Elisei
                   ` (7 subsequent siblings)
  26 siblings, 2 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

To enable tagging on a memory range, userspace can use mprotect() with the
PROT_MTE access flag. Pages already mapped in the VMA don't have the
associated tag storage block reserved, so mark the PTEs as
PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
reserve the tag storage on the fault path.

This has several benefits over reserving the tag storage as part of the
mprotect() call handling:

- Tag storage is reserved only for those pages in the VMA that are
  accessed, instead of for all the pages already mapped in the VMA.
- Reduces the latency of the mprotect() call.
- Eliminates races with page migration.

But all of this is at the expense of an extra page fault per page until the
pages being accessed all have their corresponding tag storage reserved.

For arm64, the PAGE_FAULT_ON_ACCESS protection is created by defining a new
page table entry software bit, PTE_TAG_STORAGE_NONE. Linux doesn't set any
of the PBHA bits in entries from the last level of the translation table
and it doesn't use the TCR_ELx.HWUxx bits; also, the first PBHA bit, bit
59, is already being used as a software bit for PMD_PRESENT_INVALID.

This is only implemented for PTE mappings; PMD mappings will follow.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/Kconfig                       |   1 +
 arch/arm64/include/asm/mte.h             |   4 +-
 arch/arm64/include/asm/mte_tag_storage.h |   2 +
 arch/arm64/include/asm/pgtable-prot.h    |   2 +
 arch/arm64/include/asm/pgtable.h         |  40 ++++++---
 arch/arm64/kernel/mte.c                  |  12 ++-
 arch/arm64/mm/fault.c                    | 101 +++++++++++++++++++++++
 include/linux/pgtable.h                  |  17 ++++
 mm/Kconfig                               |   3 +
 mm/memory.c                              |   3 +
 10 files changed, 170 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index efa5b7958169..3b9c435eaafb 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2066,6 +2066,7 @@ if ARM64_MTE
 config ARM64_MTE_TAG_STORAGE
 	bool "Dynamic MTE tag storage management"
 	depends on ARCH_KEEP_MEMBLOCK
+	select ARCH_HAS_FAULT_ON_ACCESS
 	select CONFIG_CMA
 	help
 	  Adds support for dynamic management of the memory used by the hardware
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 6457b7899207..70dc2e409070 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -107,7 +107,7 @@ static inline bool try_page_mte_tagging(struct page *page)
 }
 
 void mte_zero_clear_page_tags(void *addr);
-void mte_sync_tags(pte_t pte, unsigned int nr_pages);
+void mte_sync_tags(pte_t *pteval, unsigned int nr_pages);
 void mte_copy_page_tags(void *kto, const void *kfrom);
 void mte_thread_init_user(void);
 void mte_thread_switch(struct task_struct *next);
@@ -139,7 +139,7 @@ static inline bool try_page_mte_tagging(struct page *page)
 static inline void mte_zero_clear_page_tags(void *addr)
 {
 }
-static inline void mte_sync_tags(pte_t pte, unsigned int nr_pages)
+static inline void mte_sync_tags(pte_t *pteval, unsigned int nr_pages)
 {
 }
 static inline void mte_copy_page_tags(void *kto, const void *kfrom)
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index 6e5d28e607bb..c70ced60a0cd 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -33,6 +33,8 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
 void free_tag_storage(struct page *page, int order);
 
 bool page_tag_storage_reserved(struct page *page);
+
+vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
 #else
 static inline bool tag_storage_enabled(void)
 {
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index e9624f6326dd..85ebb3e352ad 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -19,6 +19,7 @@
 #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
 #define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
 #define PTE_PROT_NONE		(_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
+#define PTE_TAG_STORAGE_NONE	(_AT(pteval_t, 1) << 60) /* only when PTE_PROT_NONE */
 
 /*
  * This bit indicates that the entry is present i.e. pmd_page()
@@ -94,6 +95,7 @@ extern bool arm64_use_ng_mappings;
 	 })
 
 #define PAGE_NONE		__pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
+#define PAGE_FAULT_ON_ACCESS	__pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
 /* shared+writable pages are clean by default, hence PTE_RDONLY|PTE_WRITE */
 #define PAGE_SHARED		__pgprot(_PAGE_SHARED)
 #define PAGE_SHARED_EXEC	__pgprot(_PAGE_SHARED_EXEC)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 20e8de853f5d..8cc135f1c112 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -326,10 +326,10 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
 		     __func__, pte_val(old_pte), pte_val(pte));
 }
 
-static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
+static inline void __sync_cache_and_tags(pte_t *pteval, unsigned int nr_pages)
 {
-	if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
-		__sync_icache_dcache(pte);
+	if (pte_present(*pteval) && pte_user_exec(*pteval) && !pte_special(*pteval))
+		__sync_icache_dcache(*pteval);
 
 	/*
 	 * If the PTE would provide user space access to the tags associated
@@ -337,9 +337,9 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
 	 * pte_access_permitted() returns false for exec only mappings, they
 	 * don't expose tags (instruction fetches don't check tags).
 	 */
-	if (system_supports_mte() && pte_access_permitted(pte, false) &&
-	    !pte_special(pte) && pte_tagged(pte))
-		mte_sync_tags(pte, nr_pages);
+	if (system_supports_mte() && pte_access_permitted(*pteval, false) &&
+	    !pte_special(*pteval) && pte_tagged(*pteval))
+		mte_sync_tags(pteval, nr_pages);
 }
 
 static inline void set_ptes(struct mm_struct *mm,
@@ -347,7 +347,7 @@ static inline void set_ptes(struct mm_struct *mm,
 			    pte_t *ptep, pte_t pte, unsigned int nr)
 {
 	page_table_check_ptes_set(mm, ptep, pte, nr);
-	__sync_cache_and_tags(pte, nr);
+	__sync_cache_and_tags(&pte, nr);
 
 	for (;;) {
 		__check_safe_pte_update(mm, ptep, pte);
@@ -459,6 +459,26 @@ static inline int pmd_protnone(pmd_t pmd)
 }
 #endif
 
+#ifdef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
+static inline bool fault_on_access_pte(pte_t pte)
+{
+	return (pte_val(pte) & (PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_VALID)) ==
+		(PTE_PROT_NONE | PTE_TAG_STORAGE_NONE);
+}
+
+static inline bool fault_on_access_pmd(pmd_t pmd)
+{
+	return fault_on_access_pte(pmd_pte(pmd));
+}
+
+static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
+{
+	if (tag_storage_enabled())
+		return handle_page_missing_tag_storage(vmf);
+	return VM_FAULT_SIGBUS;
+}
+#endif /* CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
+
 #define pmd_present_invalid(pmd)     (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))
 
 static inline int pmd_present(pmd_t pmd)
@@ -533,7 +553,7 @@ static inline void __set_pte_at(struct mm_struct *mm,
 				unsigned long __always_unused addr,
 				pte_t *ptep, pte_t pte, unsigned int nr)
 {
-	__sync_cache_and_tags(pte, nr);
+	__sync_cache_and_tags(&pte, nr);
 	__check_safe_pte_update(mm, ptep, pte);
 	set_pte(ptep, pte);
 }
@@ -828,8 +848,8 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	 * in MAIR_EL1. The mask below has to include PTE_ATTRINDX_MASK.
 	 */
 	const pteval_t mask = PTE_USER | PTE_PXN | PTE_UXN | PTE_RDONLY |
-			      PTE_PROT_NONE | PTE_VALID | PTE_WRITE | PTE_GP |
-			      PTE_ATTRINDX_MASK;
+			      PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_VALID |
+			      PTE_WRITE | PTE_GP | PTE_ATTRINDX_MASK;
 	/* preserve the hardware dirty information */
 	if (pte_hw_dirty(pte))
 		pte = set_pte_bit(pte, __pgprot(PTE_DIRTY));
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index a41ef3213e1e..5962bab1d549 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -21,6 +21,7 @@
 #include <asm/barrier.h>
 #include <asm/cpufeature.h>
 #include <asm/mte.h>
+#include <asm/mte_tag_storage.h>
 #include <asm/ptrace.h>
 #include <asm/sysreg.h>
 
@@ -35,13 +36,18 @@ DEFINE_STATIC_KEY_FALSE(mte_async_or_asymm_mode);
 EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode);
 #endif
 
-void mte_sync_tags(pte_t pte, unsigned int nr_pages)
+void mte_sync_tags(pte_t *pteval, unsigned int nr_pages)
 {
-	struct page *page = pte_page(pte);
+	struct page *page = pte_page(*pteval);
 	unsigned int i;
 
-	/* if PG_mte_tagged is set, tags have already been initialised */
 	for (i = 0; i < nr_pages; i++, page++) {
+		if (tag_storage_enabled() && unlikely(!page_tag_storage_reserved(page))) {
+			*pteval = pte_modify(*pteval, PAGE_FAULT_ON_ACCESS);
+			continue;
+		}
+
+		/* if PG_mte_tagged is set, tags have already been initialised */
 		if (try_page_mte_tagging(page)) {
 			mte_clear_page_tags(page_address(page));
 			set_page_mte_tagged(page);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index acbc7530d2b2..f5fa583acf18 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -19,6 +19,7 @@
 #include <linux/kprobes.h>
 #include <linux/uaccess.h>
 #include <linux/page-flags.h>
+#include <linux/page-isolation.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/debug.h>
 #include <linux/highmem.h>
@@ -953,3 +954,103 @@ void tag_clear_highpage(struct page *page)
 	mte_zero_clear_page_tags(page_address(page));
 	set_page_mte_tagged(page);
 }
+
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct page *page = NULL;
+	pte_t new_pte, old_pte;
+	bool writable = false;
+	vm_fault_t err;
+	int ret;
+
+	spin_lock(vmf->ptl);
+	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		return 0;
+	}
+
+	/* Get the normal PTE  */
+	old_pte = ptep_get(vmf->pte);
+	new_pte = pte_modify(old_pte, vma->vm_page_prot);
+
+	/*
+	 * Detect now whether the PTE could be writable; this information
+	 * is only valid while holding the PT lock.
+	 */
+	writable = pte_write(new_pte);
+	if (!writable && vma_wants_manual_pte_write_upgrade(vma) &&
+	    can_change_pte_writable(vma, vmf->address, new_pte))
+		writable = true;
+
+	page = vm_normal_page(vma, vmf->address, new_pte);
+	if (!page || is_zone_device_page(page))
+		goto out_map;
+
+	/*
+	 * This should never happen, once a VMA has been marked as tagged, that
+	 * cannot be changed.
+	 */
+	if (!(vma->vm_flags & VM_MTE))
+		goto out_map;
+
+	/* Prevent the page from being unmapped from under us. */
+	get_page(page);
+	vma_set_access_pid_bit(vma);
+
+	/*
+	 * Pairs with pte_offset_map_nolock(), which takes the RCU read lock,
+	 * and spin_lock() above which takes the ptl lock. Both locks should be
+	 * balanced after this point.
+	 */
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+
+	/*
+	 * Probably the page is being isolated for migration, replay the fault
+	 * to give time for the entry to be replaced by a migration pte.
+	 */
+	if (unlikely(is_migrate_isolate_page(page)))
+		goto out_retry;
+
+	ret = reserve_tag_storage(page, 0, GFP_HIGHUSER_MOVABLE);
+	if (ret)
+		goto out_retry;
+
+	put_page(page);
+
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
+	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		return 0;
+	}
+
+out_map:
+	/*
+	 * Make it present again, depending on how arch implements
+	 * non-accessible ptes, some can allow access by kernel mode.
+	 */
+	old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
+	new_pte = pte_modify(old_pte, vma->vm_page_prot);
+	new_pte = pte_mkyoung(new_pte);
+	if (writable)
+		new_pte = pte_mkwrite(new_pte, vma);
+	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, new_pte);
+	update_mmu_cache(vma, vmf->address, vmf->pte);
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+
+	return 0;
+
+out_retry:
+	put_page(page);
+	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
+		vma_end_read(vma);
+	if (fault_flag_allow_retry_first(vmf->flags)) {
+		err = VM_FAULT_RETRY;
+	} else {
+		/* Replay the fault. */
+		err = 0;
+	}
+	return err;
+}
+#endif
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index ffdb9b6bed6c..e2c761dd6c41 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1458,6 +1458,23 @@ static inline int pmd_protnone(pmd_t pmd)
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifndef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
+static inline bool fault_on_access_pte(pte_t pte)
+{
+	return false;
+}
+
+static inline bool fault_on_access_pmd(pmd_t pmd)
+{
+	return false;
+}
+
+static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
+{
+	return VM_FAULT_SIGBUS;
+}
+#endif
+
 #endif /* CONFIG_MMU */
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
diff --git a/mm/Kconfig b/mm/Kconfig
index 89971a894b60..a90eefc3ee80 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1019,6 +1019,9 @@ config IDLE_PAGE_TRACKING
 config ARCH_HAS_CACHE_LINE_SIZE
 	bool
 
+config ARCH_HAS_FAULT_ON_ACCESS
+	bool
+
 config ARCH_HAS_CURRENT_STACK_POINTER
 	bool
 	help
diff --git a/mm/memory.c b/mm/memory.c
index e137f7673749..a04a971200b9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5044,6 +5044,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 	if (!pte_present(vmf->orig_pte))
 		return do_swap_page(vmf);
 
+	if (fault_on_access_pte(vmf->orig_pte) && vma_is_accessible(vmf->vma))
+		return arch_do_page_fault_on_access(vmf);
+
 	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
 		return do_numa_page(vmf);
 
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 20/27] mm: hugepage: Handle huge page fault on access
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (18 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE) Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-22  1:28   ` Peter Collingbourne
  2023-11-28 17:56   ` David Hildenbrand
  2023-11-19 16:57 ` [PATCH RFC v2 21/27] mm: arm64: Handle tag storage pages mapped before mprotect(PROT_MTE) Alexandru Elisei
                   ` (6 subsequent siblings)
  26 siblings, 2 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Handle PAGE_FAULT_ON_ACCESS faults for huge pages in a similar way to
regular pages.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/mte_tag_storage.h |  1 +
 arch/arm64/include/asm/pgtable.h         |  7 ++
 arch/arm64/mm/fault.c                    | 81 ++++++++++++++++++++++++
 include/linux/huge_mm.h                  |  2 +
 include/linux/pgtable.h                  |  5 ++
 mm/huge_memory.c                         |  4 +-
 mm/memory.c                              |  3 +
 7 files changed, 101 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index c70ced60a0cd..b97406d369ce 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -35,6 +35,7 @@ void free_tag_storage(struct page *page, int order);
 bool page_tag_storage_reserved(struct page *page);
 
 vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
+vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf);
 #else
 static inline bool tag_storage_enabled(void)
 {
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 8cc135f1c112..1704411c096d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -477,6 +477,13 @@ static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
 		return handle_page_missing_tag_storage(vmf);
 	return VM_FAULT_SIGBUS;
 }
+
+static inline vm_fault_t arch_do_huge_page_fault_on_access(struct vm_fault *vmf)
+{
+	if (tag_storage_enabled())
+		return handle_huge_page_missing_tag_storage(vmf);
+	return VM_FAULT_SIGBUS;
+}
 #endif /* CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
 
 #define pmd_present_invalid(pmd)     (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index f5fa583acf18..6730a0812a24 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -1041,6 +1041,87 @@ vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf)
 
 	return 0;
 
+out_retry:
+	put_page(page);
+	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
+		vma_end_read(vma);
+	if (fault_flag_allow_retry_first(vmf->flags)) {
+		err = VM_FAULT_RETRY;
+	} else {
+		/* Replay the fault. */
+		err = 0;
+	}
+	return err;
+}
+
+vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf)
+{
+	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	struct vm_area_struct *vma = vmf->vma;
+	pmd_t old_pmd, new_pmd;
+	bool writable = false;
+	struct page *page;
+	vm_fault_t err;
+	int ret;
+
+	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+	if (unlikely(!pmd_same(vmf->orig_pmd, *vmf->pmd))) {
+		spin_unlock(vmf->ptl);
+		return 0;
+	}
+
+	old_pmd = vmf->orig_pmd;
+	new_pmd = pmd_modify(old_pmd, vma->vm_page_prot);
+
+	/*
+	 * Detect now whether the PMD could be writable; this information
+	 * is only valid while holding the PT lock.
+	 */
+	writable = pmd_write(new_pmd);
+	if (!writable && vma_wants_manual_pte_write_upgrade(vma) &&
+	    can_change_pmd_writable(vma, vmf->address, new_pmd))
+		writable = true;
+
+	page = vm_normal_page_pmd(vma, haddr, new_pmd);
+	if (!page)
+		goto out_map;
+
+	if (!(vma->vm_flags & VM_MTE))
+		goto out_map;
+
+	get_page(page);
+	vma_set_access_pid_bit(vma);
+
+	spin_unlock(vmf->ptl);
+	writable = false;
+
+	if (unlikely(is_migrate_isolate_page(page)))
+		goto out_retry;
+
+	ret = reserve_tag_storage(page, HPAGE_PMD_ORDER, GFP_HIGHUSER_MOVABLE);
+	if (ret)
+		goto out_retry;
+
+	put_page(page);
+
+	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+	if (unlikely(!pmd_same(old_pmd, *vmf->pmd))) {
+		spin_unlock(vmf->ptl);
+		return 0;
+	}
+
+out_map:
+	/* Restore the PMD */
+	new_pmd = pmd_modify(old_pmd, vma->vm_page_prot);
+	new_pmd = pmd_mkyoung(new_pmd);
+	if (writable)
+		new_pmd = pmd_mkwrite(new_pmd, vma);
+	set_pmd_at(vma->vm_mm, haddr, vmf->pmd, new_pmd);
+	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
+	spin_unlock(vmf->ptl);
+
+	return 0;
+
 out_retry:
 	put_page(page);
 	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fa0350b0812a..bb84291f9231 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -36,6 +36,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
 		    unsigned long cp_flags);
+bool can_change_pmd_writable(struct vm_area_struct *vma, unsigned long addr,
+			     pmd_t pmd);
 
 vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e2c761dd6c41..de45f475bf8d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1473,6 +1473,11 @@ static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
 {
 	return VM_FAULT_SIGBUS;
 }
+
+static inline vm_fault_t arch_do_huge_page_fault_on_access(struct vm_fault *vmf)
+{
+	return VM_FAULT_SIGBUS;
+}
 #endif
 
 #endif /* CONFIG_MMU */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9beead961a65..d1402b43ea39 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1406,8 +1406,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 	return VM_FAULT_FALLBACK;
 }
 
-static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
-					   unsigned long addr, pmd_t pmd)
+inline bool can_change_pmd_writable(struct vm_area_struct *vma,
+				    unsigned long addr, pmd_t pmd)
 {
 	struct page *page;
 
diff --git a/mm/memory.c b/mm/memory.c
index a04a971200b9..46b926625503 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5168,6 +5168,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 			return 0;
 		}
 		if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
+			if (fault_on_access_pmd(vmf.orig_pmd) && vma_is_accessible(vma))
+				return arch_do_huge_page_fault_on_access(&vmf);
+
 			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf);
 
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 21/27] mm: arm64: Handle tag storage pages mapped before mprotect(PROT_MTE)
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (19 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 20/27] mm: hugepage: Handle huge page fault on access Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-28  5:39   ` Peter Collingbourne
  2023-11-19 16:57 ` [PATCH RFC v2 22/27] arm64: mte: swap: Handle tag restoring when missing tag storage Alexandru Elisei
                   ` (5 subsequent siblings)
  26 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/mte_tag_storage.h |  1 +
 arch/arm64/kernel/mte_tag_storage.c      | 15 +++++++
 arch/arm64/mm/fault.c                    | 55 ++++++++++++++++++++++++
 include/linux/migrate.h                  |  8 +++-
 include/linux/migrate_mode.h             |  1 +
 mm/internal.h                            |  6 ---
 6 files changed, 78 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index b97406d369ce..6a8b19a6a758 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -33,6 +33,7 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
 void free_tag_storage(struct page *page, int order);
 
 bool page_tag_storage_reserved(struct page *page);
+bool page_is_tag_storage(struct page *page);
 
 vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
 vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf);
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index a1cc239f7211..5096ce859136 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -500,6 +500,21 @@ bool page_tag_storage_reserved(struct page *page)
 	return test_bit(PG_tag_storage_reserved, &page->flags);
 }
 
+bool page_is_tag_storage(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct range *tag_range;
+	int i;
+
+	for (i = 0; i < num_tag_regions; i++) {
+		tag_range = &tag_regions[i].tag_range;
+		if (tag_range->start <= pfn && pfn <= tag_range->end)
+			return true;
+	}
+
+	return false;
+}
+
 int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
 {
 	unsigned long start_block, end_block;
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 6730a0812a24..964c5ae161a3 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -12,6 +12,7 @@
 #include <linux/extable.h>
 #include <linux/kfence.h>
 #include <linux/signal.h>
+#include <linux/migrate.h>
 #include <linux/mm.h>
 #include <linux/hardirq.h>
 #include <linux/init.h>
@@ -956,6 +957,50 @@ void tag_clear_highpage(struct page *page)
 }
 
 #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+
+#define MR_TAGGED_TAG_STORAGE	MR_ARCH_1
+
+extern bool isolate_lru_page(struct page *page);
+extern void putback_movable_pages(struct list_head *l);
+
+/* Returns with the page reference dropped. */
+static void migrate_tag_storage_page(struct page *page)
+{
+	struct migration_target_control mtc = {
+		.nid = NUMA_NO_NODE,
+		.gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_TAGGED,
+	};
+	unsigned long i, nr_pages = compound_nr(page);
+	LIST_HEAD(pagelist);
+	int ret, tries;
+
+	lru_cache_disable();
+
+	for (i = 0; i < nr_pages; i++) {
+		if (!isolate_lru_page(page + i)) {
+			ret = -EAGAIN;
+			goto out;
+		}
+		/* Isolate just grabbed another reference, drop ours. */
+		put_page(page + i);
+		list_add_tail(&(page + i)->lru, &pagelist);
+	}
+
+	tries = 5;
+	while (tries--) {
+		ret = migrate_pages(&pagelist, alloc_migration_target, NULL, (unsigned long)&mtc,
+				    MIGRATE_SYNC, MR_TAGGED_TAG_STORAGE, NULL);
+		if (ret == 0 || ret != -EBUSY)
+			break;
+	}
+
+out:
+	if (ret != 0)
+		putback_movable_pages(&pagelist);
+
+	lru_cache_enable();
+}
+
 vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
@@ -1013,6 +1058,11 @@ vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf)
 	if (unlikely(is_migrate_isolate_page(page)))
 		goto out_retry;
 
+	if (unlikely(page_is_tag_storage(page))) {
+		migrate_tag_storage_page(page);
+		return 0;
+	}
+
 	ret = reserve_tag_storage(page, 0, GFP_HIGHUSER_MOVABLE);
 	if (ret)
 		goto out_retry;
@@ -1098,6 +1148,11 @@ vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf)
 	if (unlikely(is_migrate_isolate_page(page)))
 		goto out_retry;
 
+	if (unlikely(page_is_tag_storage(page))) {
+		migrate_tag_storage_page(page);
+		return 0;
+	}
+
 	ret = reserve_tag_storage(page, HPAGE_PMD_ORDER, GFP_HIGHUSER_MOVABLE);
 	if (ret)
 		goto out_retry;
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 0acef592043c..afca42ace735 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -10,8 +10,6 @@
 typedef struct folio *new_folio_t(struct folio *folio, unsigned long private);
 typedef void free_folio_t(struct folio *folio, unsigned long private);
 
-struct migration_target_control;
-
 /*
  * Return values from addresss_space_operations.migratepage():
  * - negative errno on page migration failure;
@@ -57,6 +55,12 @@ struct movable_operations {
 	void (*putback_page)(struct page *);
 };
 
+struct migration_target_control {
+	int nid;		/* preferred node id */
+	nodemask_t *nmask;
+	gfp_t gfp_mask;
+};
+
 /* Defined in mm/debug.c: */
 extern const char *migrate_reason_names[MR_TYPES];
 
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index f37cc03f9369..c6c5c7726d26 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -29,6 +29,7 @@ enum migrate_reason {
 	MR_CONTIG_RANGE,
 	MR_LONGTERM_PIN,
 	MR_DEMOTION,
+	MR_ARCH_1,
 	MR_TYPES
 };
 
diff --git a/mm/internal.h b/mm/internal.h
index ddf6bb6c6308..96fff5dfc041 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -949,12 +949,6 @@ static inline bool is_migrate_highatomic_page(struct page *page)
 
 void setup_zone_pageset(struct zone *zone);
 
-struct migration_target_control {
-	int nid;		/* preferred node id */
-	nodemask_t *nmask;
-	gfp_t gfp_mask;
-};
-
 /*
  * mm/filemap.c
  */
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 22/27] arm64: mte: swap: Handle tag restoring when missing tag storage
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (20 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 21/27] mm: arm64: Handle tag storage pages mapped before mprotect(PROT_MTE) Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-19 16:57 ` [PATCH RFC v2 23/27] arm64: mte: copypage: " Alexandru Elisei
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Linux restores tags when a page is swapped in and there are tags associated
with the swap entry which the new page will replace. The saved tags are
restored even if the page will not be mapped as tagged, to protect against
cases where the page is shared between different VMAs, and is tagged in
some, but untagged in others. By using this approach, the process can still
access the correct tags following an mprotect(PROT_MTE) on the non-MTE
enabled VMA.

But this poses a challenge for managing tag storage: in the scenario above,
when a new page is allocated to be swapped in for the process where it will
be mapped as untagged, the corresponding tag storage block is not reserved.
mte_restore_page_tags_by_swp_entry(), when it restores the saved tags, will
overwrite data in the tag storage block associated with the new page,
leading to data corruption if the block is in use by a process.

Get around this issue by saving the tags in a new xarray, this time indexed
by the page pfn, and then restoring them when tag storage is reserved for
the page.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/mte_tag_storage.h |   9 ++
 arch/arm64/include/asm/pgtable.h         |  11 +++
 arch/arm64/kernel/mte_tag_storage.c      |  20 +++-
 arch/arm64/mm/mteswap.c                  | 112 +++++++++++++++++++++++
 4 files changed, 148 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index 6a8b19a6a758..a3c38099fe1a 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -37,6 +37,15 @@ bool page_is_tag_storage(struct page *page);
 
 vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
 vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf);
+
+void tags_by_pfn_lock(void);
+void tags_by_pfn_unlock(void);
+
+void *mte_erase_tags_for_pfn(unsigned long pfn);
+bool mte_save_tags_for_pfn(void *tags, unsigned long pfn);
+void mte_restore_tags_for_pfn(unsigned long start_pfn, int order);
+
+vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page);
 #else
 static inline bool tag_storage_enabled(void)
 {
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 1704411c096d..1a25b7d601c2 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1084,6 +1084,17 @@ static inline void arch_swap_invalidate_area(int type)
 		mte_invalidate_tags_area_by_swp_entry(type);
 }
 
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+#define __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
+static inline vm_fault_t arch_swap_prepare_to_restore(swp_entry_t entry,
+						      struct folio *folio)
+{
+	if (tag_storage_enabled())
+		return mte_try_transfer_swap_tags(entry, &folio->page);
+	return 0;
+}
+#endif
+
 #define __HAVE_ARCH_SWAP_RESTORE
 static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 {
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 5096ce859136..6b11bb408b51 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -547,8 +547,10 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
 	mutex_lock(&tag_blocks_lock);
 
 	/* Check again, this time with the lock held. */
-	if (page_tag_storage_reserved(page))
-		goto out_unlock;
+	if (page_tag_storage_reserved(page)) {
+		mutex_unlock(&tag_blocks_lock);
+		return 0;
+	}
 
 	/* Make sure existing entries are not freed from out under out feet. */
 	xa_lock_irqsave(&tag_blocks_reserved, flags);
@@ -583,9 +585,10 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
 	}
 
 	page_set_tag_storage_reserved(page, order);
-out_unlock:
 	mutex_unlock(&tag_blocks_lock);
 
+	mte_restore_tags_for_pfn(page_to_pfn(page), order);
+
 	return 0;
 
 out_error:
@@ -612,7 +615,8 @@ void free_tag_storage(struct page *page, int order)
 	struct tag_region *region;
 	unsigned long page_va;
 	unsigned long flags;
-	int ret;
+	void *tags;
+	int i, ret;
 
 	ret = tag_storage_find_block(page, &start_block, &region);
 	if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
@@ -622,6 +626,14 @@ void free_tag_storage(struct page *page, int order)
 	/* Avoid writeback of dirty tag cache lines corrupting data. */
 	dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order));
 
+	tags_by_pfn_lock();
+	for (i = 0; i < (1 << order); i++) {
+		tags = mte_erase_tags_for_pfn(page_to_pfn(page + i));
+		if (unlikely(tags))
+			mte_free_tag_buf(tags);
+	}
+	tags_by_pfn_unlock();
+
 	end_block = start_block + order_to_num_blocks(order) * region->block_size;
 
 	xa_lock_irqsave(&tag_blocks_reserved, flags);
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index 2a43746b803f..20d718a514af 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -20,6 +20,114 @@ void mte_free_tag_buf(void *buf)
 	kfree(buf);
 }
 
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+static DEFINE_XARRAY(tags_by_pfn);
+
+void tags_by_pfn_lock(void)
+{
+	xa_lock(&tags_by_pfn);
+}
+
+void tags_by_pfn_unlock(void)
+{
+	xa_unlock(&tags_by_pfn);
+}
+
+void *mte_erase_tags_for_pfn(unsigned long pfn)
+{
+	return __xa_erase(&tags_by_pfn, pfn);
+}
+
+bool mte_save_tags_for_pfn(void *tags, unsigned long pfn)
+{
+	void *entry;
+	int ret;
+
+	ret = xa_reserve(&tags_by_pfn, pfn, GFP_KERNEL);
+	if (ret)
+		return true;
+
+	tags_by_pfn_lock();
+
+	if (page_tag_storage_reserved(pfn_to_page(pfn))) {
+		tags_by_pfn_unlock();
+		return false;
+	}
+
+	entry = __xa_store(&tags_by_pfn, pfn, tags, GFP_ATOMIC);
+	if (xa_is_err(entry)) {
+		xa_release(&tags_by_pfn, pfn);
+		goto out_unlock;
+	} else if (entry) {
+		mte_free_tag_buf(entry);
+	}
+
+out_unlock:
+	tags_by_pfn_unlock();
+	return true;
+}
+
+void mte_restore_tags_for_pfn(unsigned long start_pfn, int order)
+{
+	struct page *page = pfn_to_page(start_pfn);
+	unsigned long pfn;
+	void *tags;
+
+	tags_by_pfn_lock();
+
+	for (pfn = start_pfn; pfn < start_pfn + (1 << order); pfn++, page++) {
+		if (WARN_ON_ONCE(!page_tag_storage_reserved(page)))
+			continue;
+
+		tags = mte_erase_tags_for_pfn(pfn);
+		if (unlikely(tags)) {
+			/*
+			 * Mark the page as tagged so mte_sync_tags() doesn't
+			 * clear the tags.
+			 */
+			WARN_ON_ONCE(!try_page_mte_tagging(page));
+			mte_copy_page_tags_from_buf(page_address(page), tags);
+			set_page_mte_tagged(page);
+			mte_free_tag_buf(tags);
+		}
+	}
+
+	tags_by_pfn_unlock();
+}
+
+/*
+ * Note on locking: swap in/out is done with the folio locked, which eliminates
+ * races with mte_save/restore_page_tags_by_swp_entry.
+ */
+vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page)
+{
+	void *swap_tags, *pfn_tags;
+	bool saved;
+
+	/*
+	 * mte_restore_page_tags_by_swp_entry() will take care of copying the
+	 * tags over.
+	 */
+	if (likely(page_mte_tagged(page) || page_tag_storage_reserved(page)))
+		return 0;
+
+	swap_tags = xa_load(&tags_by_swp_entry, entry.val);
+	if (!swap_tags)
+		return 0;
+
+	pfn_tags = mte_allocate_tag_buf();
+	if (!pfn_tags)
+		return VM_FAULT_OOM;
+
+	memcpy(pfn_tags, swap_tags, MTE_PAGE_TAG_STORAGE_SIZE);
+	saved = mte_save_tags_for_pfn(pfn_tags, page_to_pfn(page));
+	if (!saved)
+		mte_free_tag_buf(pfn_tags);
+
+	return 0;
+}
+#endif
+
 int mte_save_page_tags_by_swp_entry(struct page *page)
 {
 	void *tags, *ret;
@@ -54,6 +162,10 @@ void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page)
 	if (!tags)
 		return;
 
+	/* Tags will be restored when tag storage is reserved. */
+	if (tag_storage_enabled() && unlikely(!page_tag_storage_reserved(page)))
+		return;
+
 	if (try_page_mte_tagging(page)) {
 		mte_copy_page_tags_from_buf(page_address(page), tags);
 		set_page_mte_tagged(page);
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 23/27] arm64: mte: copypage: Handle tag restoring when missing tag storage
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (21 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 22/27] arm64: mte: swap: Handle tag restoring when missing tag storage Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-19 16:57 ` [PATCH RFC v2 24/27] arm64: mte: Handle fatal signal in reserve_tag_storage() Alexandru Elisei
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

There are several situations where copy_highpage() can end up copying
tags to a page which doesn't have its tag storage reserved.

One situation involves migration racing with mprotect(PROT_MTE): VMA is
initially untagged, migration starts and destination page is allocated
as untagged, mprotect(PROT_MTE) changes the VMA to tagged and userspace
accesses the source page, thus making it tagged.  The migration code
then calls copy_highpage(), which will copy the tags from the source
page (now tagged) to the destination page (allocated as untagged).

Yes another situation can happen during THP collapse. The huge page that
will replace the HPAGE_PMD_NR contiguous mapped pages is allocated with
__GFP_TAGGED not set. copy_highpage() will copy the tags from the pages
being replaced to the huge page which doesn't have tag storage reserved.

The situation gets even more complicated when the replacement huge page
is a tag storage page. The tag storage huge page will be migrated after
a fault on access, but the tags from the original pages must be copied
over to the huge page that will be replacing the tag storage huge page.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/mm/copypage.c | 59 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c
index a7bb20055ce0..7899f38773b9 100644
--- a/arch/arm64/mm/copypage.c
+++ b/arch/arm64/mm/copypage.c
@@ -13,6 +13,62 @@
 #include <asm/cacheflush.h>
 #include <asm/cpufeature.h>
 #include <asm/mte.h>
+#include <asm/mte_tag_storage.h>
+
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+static inline bool try_transfer_saved_tags(struct page *from, struct page *to)
+{
+	void *tags;
+	bool saved;
+
+	VM_WARN_ON_ONCE(!preemptible());
+
+	if (page_mte_tagged(from)) {
+		if (likely(page_tag_storage_reserved(to)))
+			return false;
+
+		tags = mte_allocate_tag_buf();
+		if (WARN_ON(!tags))
+			return true;
+
+		mte_copy_page_tags_to_buf(page_address(from), tags);
+		saved = mte_save_tags_for_pfn(tags, page_to_pfn(to));
+		if (!saved)
+			mte_free_tag_buf(tags);
+
+		return saved;
+	}
+
+	if (likely(!page_is_tag_storage(from)))
+		return false;
+
+	tags_by_pfn_lock();
+	tags = mte_erase_tags_for_pfn(page_to_pfn(from));
+	tags_by_pfn_unlock();
+
+	if (likely(!tags))
+		return false;
+
+	if (page_tag_storage_reserved(to)) {
+		WARN_ON_ONCE(!try_page_mte_tagging(to));
+		mte_copy_page_tags_from_buf(page_address(to), tags);
+		set_page_mte_tagged(to);
+		mte_free_tag_buf(tags);
+		return true;
+	}
+
+	saved = mte_save_tags_for_pfn(tags, page_to_pfn(to));
+	if (!saved)
+		mte_free_tag_buf(tags);
+
+	return saved;
+}
+#else
+static inline bool try_transfer_saved_tags(struct page *from, struct page *to)
+{
+	return false;
+}
+#endif
 
 void copy_highpage(struct page *to, struct page *from)
 {
@@ -24,6 +80,9 @@ void copy_highpage(struct page *to, struct page *from)
 	if (kasan_hw_tags_enabled())
 		page_kasan_tag_reset(to);
 
+	if (tag_storage_enabled() && try_transfer_saved_tags(from, to))
+		return;
+
 	if (system_supports_mte() && page_mte_tagged(from)) {
 		/* It's a new page, shouldn't have been tagged yet */
 		WARN_ON_ONCE(!try_page_mte_tagging(to));
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 24/27] arm64: mte: Handle fatal signal in reserve_tag_storage()
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (22 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 23/27] arm64: mte: copypage: " Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-19 16:57 ` [PATCH RFC v2 25/27] KVM: arm64: Disable MTE if tag storage is enabled Alexandru Elisei
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

As long as a fatal signal is pending, alloc_contig_range() will fail with
-EINTR. This makes it impossible for tag storage allocation to succeed, and
the page allocator will print an OOM splat.

The process is going to be killed, so return 0 (success) from
reserve_tag_storage() to allow the page allocator to make progress.
set_pte_at() will map it with PAGE_FAULT_ON_ACCESS and subsequent accesses
from different threads will cause a fault until the signal is delivered.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c | 17 +++++++++++++++++
 arch/arm64/mm/fault.c               |  5 +++++
 2 files changed, 22 insertions(+)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 6b11bb408b51..602fdc70db1c 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -572,6 +572,23 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
 				break;
 		}
 
+		/*
+		 * alloc_contig_range() returns -EINTR from
+		 * __alloc_contig_migrate_range() if a fatal signal is pending.
+		 * As long as the signal hasn't been handled, it is impossible
+		 * to reserve tag storage for any page. Stop trying to reserve
+		 * tag storage, but return 0 so the page allocator can make
+		 * forward progress, instead of printing an OOM splat.
+		 *
+		 * The tagged page with missing tag storage will be mapped with
+		 * PAGE_FAULT_ON_ACCESS in set_pte_at(), which means accesses
+		 * until the signal is delivered will cause a fault.
+		 */
+		if (ret == -EINTR) {
+			ret = 0;
+			goto out_error;
+		}
+
 		if (ret)
 			goto out_error;
 
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 964c5ae161a3..fdc98c5828bf 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -950,6 +950,11 @@ gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp)
 
 void tag_clear_highpage(struct page *page)
 {
+	if (tag_storage_enabled() && unlikely(!page_tag_storage_reserved(page))) {
+		clear_page(page_address(page));
+		return;
+	}
+
 	/* Newly allocated page, shouldn't have been tagged yet */
 	WARN_ON_ONCE(!try_page_mte_tagging(page));
 	mte_zero_clear_page_tags(page_address(page));
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 25/27] KVM: arm64: Disable MTE if tag storage is enabled
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (23 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 24/27] arm64: mte: Handle fatal signal in reserve_tag_storage() Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-19 16:57 ` [PATCH RFC v2 26/27] arm64: mte: Fast track reserving tag storage when the block is free Alexandru Elisei
  2023-11-19 16:57 ` [PATCH RFC v2 27/27] arm64: mte: Enable dynamic tag storage reuse Alexandru Elisei
  26 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

KVM allows MTE enabled VMs to be created when the backing VMA does not have
MTE enabled.  Without changes to how KVM allocates memory for a VM, it is
impossible at the moment to discern when the corresponding tag storage
needs to be reserved.

For now, disable MTE in KVM if tag storage is enabled.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kvm/arm.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e5f75f1f1085..5b33c532c62a 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -29,6 +29,7 @@
 #include <linux/uaccess.h>
 #include <asm/ptrace.h>
 #include <asm/mman.h>
+#include <asm/mte_tag_storage.h>
 #include <asm/tlbflush.h>
 #include <asm/cacheflush.h>
 #include <asm/cpufeature.h>
@@ -86,7 +87,8 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		break;
 	case KVM_CAP_ARM_MTE:
 		mutex_lock(&kvm->lock);
-		if (!system_supports_mte() || kvm->created_vcpus) {
+		if (!system_supports_mte() || tag_storage_enabled() ||
+		    kvm->created_vcpus) {
 			r = -EINVAL;
 		} else {
 			r = 0;
@@ -279,7 +281,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		r = 1;
 		break;
 	case KVM_CAP_ARM_MTE:
-		r = system_supports_mte();
+		r = system_supports_mte() && !tag_storage_enabled();
 		break;
 	case KVM_CAP_STEAL_TIME:
 		r = kvm_arm_pvtime_supported();
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 26/27] arm64: mte: Fast track reserving tag storage when the block is free
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (24 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 25/27] KVM: arm64: Disable MTE if tag storage is enabled Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  2023-11-19 16:57 ` [PATCH RFC v2 27/27] arm64: mte: Enable dynamic tag storage reuse Alexandru Elisei
  26 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

A double digit performance decrease for Chrome startup time has been
reported with the dynamic tag storage management enabled. A large part of
the regression is due to lru_cache_disable(), called from
__alloc_contig_migrate_range(), which IPIs all CPUs in the system.

Improve the performance by taking the storage block directly from the
freelist if it's free, thus sidestepping the costly function call.

Note that at the moment this is implemented only when the block size is
1 (the block is one page); larger block sizes could be added later if
necessary.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/Kconfig                  |  1 +
 arch/arm64/kernel/mte_tag_storage.c | 15 +++++++++++++++
 include/linux/page-flags.h          | 15 +++++++++++++--
 mm/Kconfig                          |  4 ++++
 mm/memory-failure.c                 |  8 ++++----
 mm/page_alloc.c                     | 21 ++++++++++++---------
 6 files changed, 49 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 3b9c435eaafb..93a4bbca3800 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2067,6 +2067,7 @@ config ARM64_MTE_TAG_STORAGE
 	bool "Dynamic MTE tag storage management"
 	depends on ARCH_KEEP_MEMBLOCK
 	select ARCH_HAS_FAULT_ON_ACCESS
+	select WANTS_TAKE_PAGE_OFF_BUDDY
 	select CONFIG_CMA
 	help
 	  Adds support for dynamic management of the memory used by the hardware
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 602fdc70db1c..11961587382d 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -522,6 +522,7 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
 	unsigned long block;
 	unsigned long flags;
 	unsigned int tries;
+	bool success;
 	int ret = 0;
 
 	VM_WARN_ON_ONCE(!preemptible());
@@ -565,6 +566,19 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
 		if (tag_storage_block_is_reserved(block))
 			continue;
 
+		if (region->block_size == 1 && is_free_buddy_page(pfn_to_page(block))) {
+			success = take_page_off_buddy(pfn_to_page(block), false);
+			if (success) {
+				ret = tag_storage_reserve_block(block, region, order);
+				if (ret) {
+					put_page_back_buddy(pfn_to_page(block), false);
+					goto out_error;
+				}
+				page_ref_inc(pfn_to_page(block));
+				goto success_next;
+			}
+		}
+
 		tries = 3;
 		while (tries--) {
 			ret = alloc_contig_range(block, block + region->block_size, MIGRATE_CMA, gfp);
@@ -598,6 +612,7 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
 			goto out_error;
 		}
 
+success_next:
 		count_vm_events(CMA_ALLOC_SUCCESS, region->block_size);
 	}
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7915165a51bd..0d0380141f5d 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -576,11 +576,22 @@ TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
 #define MAGIC_HWPOISON	0x48575053U	/* HWPS */
 extern void SetPageHWPoisonTakenOff(struct page *page);
 extern void ClearPageHWPoisonTakenOff(struct page *page);
-extern bool take_page_off_buddy(struct page *page);
-extern bool put_page_back_buddy(struct page *page);
+extern bool PageHWPoisonTakenOff(struct page *page);
 #else
 PAGEFLAG_FALSE(HWPoison, hwpoison)
+TESTSCFLAG_FALSE(HWPoison, hwpoison)
 #define __PG_HWPOISON 0
+static inline void SetPageHWPoisonTakenOff(struct page *page) { }
+static inline void ClearPageHWPoisonTakenOff(struct page *page) { }
+static inline bool PageHWPoisonTakenOff(struct page *page)
+{
+	return false;
+}
+#endif
+
+#ifdef CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY
+extern bool take_page_off_buddy(struct page *page, bool poison);
+extern bool put_page_back_buddy(struct page *page, bool unpoison);
 #endif
 
 #if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT)
diff --git a/mm/Kconfig b/mm/Kconfig
index a90eefc3ee80..0766cdc3de4d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -773,6 +773,7 @@ config MEMORY_FAILURE
 	depends on MMU
 	depends on ARCH_SUPPORTS_MEMORY_FAILURE
 	bool "Enable recovery from hardware memory errors"
+	select WANTS_TAKE_PAGE_OFF_BUDDY
 	select MEMORY_ISOLATION
 	select RAS
 	help
@@ -1022,6 +1023,9 @@ config ARCH_HAS_CACHE_LINE_SIZE
 config ARCH_HAS_FAULT_ON_ACCESS
 	bool
 
+config WANTS_TAKE_PAGE_OFF_BUDDY
+	bool
+
 config ARCH_HAS_CURRENT_STACK_POINTER
 	bool
 	help
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 660c21859118..8b44afd6a558 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -157,7 +157,7 @@ static int __page_handle_poison(struct page *page)
 	zone_pcp_disable(page_zone(page));
 	ret = dissolve_free_huge_page(page);
 	if (!ret)
-		ret = take_page_off_buddy(page);
+		ret = take_page_off_buddy(page, true);
 	zone_pcp_enable(page_zone(page));
 
 	return ret;
@@ -1348,7 +1348,7 @@ static int page_action(struct page_state *ps, struct page *p,
 	return action_result(pfn, ps->type, result);
 }
 
-static inline bool PageHWPoisonTakenOff(struct page *page)
+bool PageHWPoisonTakenOff(struct page *page)
 {
 	return PageHWPoison(page) && page_private(page) == MAGIC_HWPOISON;
 }
@@ -2236,7 +2236,7 @@ int memory_failure(unsigned long pfn, int flags)
 		res = get_hwpoison_page(p, flags);
 		if (!res) {
 			if (is_free_buddy_page(p)) {
-				if (take_page_off_buddy(p)) {
+				if (take_page_off_buddy(p, true)) {
 					page_ref_inc(p);
 					res = MF_RECOVERED;
 				} else {
@@ -2567,7 +2567,7 @@ int unpoison_memory(unsigned long pfn)
 		ret = folio_test_clear_hwpoison(folio) ? 0 : -EBUSY;
 	} else if (ghp < 0) {
 		if (ghp == -EHWPOISON) {
-			ret = put_page_back_buddy(p) ? 0 : -EBUSY;
+			ret = put_page_back_buddy(p, true) ? 0 : -EBUSY;
 		} else {
 			ret = ghp;
 			unpoison_pr_info("Unpoison: failed to grab page %#lx\n",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 135f9283a863..4b74acfc41a6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6700,7 +6700,7 @@ bool is_free_buddy_page(struct page *page)
 }
 EXPORT_SYMBOL(is_free_buddy_page);
 
-#ifdef CONFIG_MEMORY_FAILURE
+#ifdef CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY
 /*
  * Break down a higher-order page in sub-pages, and keep our target out of
  * buddy allocator.
@@ -6730,11 +6730,10 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page,
 		set_buddy_order(current_buddy, high);
 	}
 }
-
 /*
- * Take a page that will be marked as poisoned off the buddy allocator.
+ * Take a page off the buddy allocator, and optionally mark it as poisoned.
  */
-bool take_page_off_buddy(struct page *page)
+bool take_page_off_buddy(struct page *page, bool poison)
 {
 	struct zone *zone = page_zone(page);
 	unsigned long pfn = page_to_pfn(page);
@@ -6755,7 +6754,8 @@ bool take_page_off_buddy(struct page *page)
 			del_page_from_free_list(page_head, zone, page_order);
 			break_down_buddy_pages(zone, page_head, page, 0,
 						page_order, migratetype);
-			SetPageHWPoisonTakenOff(page);
+			if (poison)
+				SetPageHWPoisonTakenOff(page);
 			if (!is_migrate_isolate(migratetype))
 				__mod_zone_freepage_state(zone, -1, migratetype);
 			ret = true;
@@ -6769,9 +6769,10 @@ bool take_page_off_buddy(struct page *page)
 }
 
 /*
- * Cancel takeoff done by take_page_off_buddy().
+ * Cancel takeoff done by take_page_off_buddy(), and optionally unpoison the
+ * page.
  */
-bool put_page_back_buddy(struct page *page)
+bool put_page_back_buddy(struct page *page, bool unpoison)
 {
 	struct zone *zone = page_zone(page);
 	unsigned long pfn = page_to_pfn(page);
@@ -6781,9 +6782,11 @@ bool put_page_back_buddy(struct page *page)
 
 	spin_lock_irqsave(&zone->lock, flags);
 	if (put_page_testzero(page)) {
-		ClearPageHWPoisonTakenOff(page);
+		VM_WARN_ON_ONCE(PageHWPoisonTakenOff(page) && !unpoison);
+		if (unpoison)
+			ClearPageHWPoisonTakenOff(page);
 		__free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE);
-		if (TestClearPageHWPoison(page)) {
+		if (!unpoison || (unpoison && TestClearPageHWPoison(page))) {
 			ret = true;
 		}
 	}
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH RFC v2 27/27] arm64: mte: Enable dynamic tag storage reuse
  2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
                   ` (25 preceding siblings ...)
  2023-11-19 16:57 ` [PATCH RFC v2 26/27] arm64: mte: Fast track reserving tag storage when the block is free Alexandru Elisei
@ 2023-11-19 16:57 ` Alexandru Elisei
  26 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-19 16:57 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Everything is in place, enable tag storage management.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 11961587382d..9f60e952a814 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -395,6 +395,9 @@ static int __init mte_tag_storage_activate_regions(void)
 
 	reserve_tag_storage(ZERO_PAGE(0), 0, GFP_HIGHUSER_MOVABLE);
 
+	static_branch_enable(&tag_storage_enabled_key);
+	pr_info("MTE tag storage region management enabled");
+
 	return 0;
 
 out_disabled:
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 20/27] mm: hugepage: Handle huge page fault on access
  2023-11-19 16:57 ` [PATCH RFC v2 20/27] mm: hugepage: Handle huge page fault on access Alexandru Elisei
@ 2023-11-22  1:28   ` Peter Collingbourne
  2023-11-22  9:22     ` Alexandru Elisei
  2023-11-28 17:56   ` David Hildenbrand
  1 sibling, 1 reply; 98+ messages in thread
From: Peter Collingbourne @ 2023-11-22  1:28 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Sun, Nov 19, 2023 at 8:59 AM Alexandru Elisei
<alexandru.elisei@arm.com> wrote:
>
> Handle PAGE_FAULT_ON_ACCESS faults for huge pages in a similar way to
> regular pages.
>
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  arch/arm64/include/asm/mte_tag_storage.h |  1 +
>  arch/arm64/include/asm/pgtable.h         |  7 ++
>  arch/arm64/mm/fault.c                    | 81 ++++++++++++++++++++++++
>  include/linux/huge_mm.h                  |  2 +
>  include/linux/pgtable.h                  |  5 ++
>  mm/huge_memory.c                         |  4 +-
>  mm/memory.c                              |  3 +
>  7 files changed, 101 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> index c70ced60a0cd..b97406d369ce 100644
> --- a/arch/arm64/include/asm/mte_tag_storage.h
> +++ b/arch/arm64/include/asm/mte_tag_storage.h
> @@ -35,6 +35,7 @@ void free_tag_storage(struct page *page, int order);
>  bool page_tag_storage_reserved(struct page *page);
>
>  vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
> +vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf);
>  #else
>  static inline bool tag_storage_enabled(void)
>  {
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 8cc135f1c112..1704411c096d 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -477,6 +477,13 @@ static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
>                 return handle_page_missing_tag_storage(vmf);
>         return VM_FAULT_SIGBUS;
>  }
> +
> +static inline vm_fault_t arch_do_huge_page_fault_on_access(struct vm_fault *vmf)
> +{
> +       if (tag_storage_enabled())
> +               return handle_huge_page_missing_tag_storage(vmf);
> +       return VM_FAULT_SIGBUS;
> +}
>  #endif /* CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
>
>  #define pmd_present_invalid(pmd)     (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index f5fa583acf18..6730a0812a24 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -1041,6 +1041,87 @@ vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf)
>
>         return 0;
>
> +out_retry:
> +       put_page(page);
> +       if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> +               vma_end_read(vma);
> +       if (fault_flag_allow_retry_first(vmf->flags)) {
> +               err = VM_FAULT_RETRY;
> +       } else {
> +               /* Replay the fault. */
> +               err = 0;
> +       }
> +       return err;
> +}
> +
> +vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf)
> +{
> +       unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> +       struct vm_area_struct *vma = vmf->vma;
> +       pmd_t old_pmd, new_pmd;
> +       bool writable = false;
> +       struct page *page;
> +       vm_fault_t err;
> +       int ret;
> +
> +       vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> +       if (unlikely(!pmd_same(vmf->orig_pmd, *vmf->pmd))) {
> +               spin_unlock(vmf->ptl);
> +               return 0;
> +       }
> +
> +       old_pmd = vmf->orig_pmd;
> +       new_pmd = pmd_modify(old_pmd, vma->vm_page_prot);
> +
> +       /*
> +        * Detect now whether the PMD could be writable; this information
> +        * is only valid while holding the PT lock.
> +        */
> +       writable = pmd_write(new_pmd);
> +       if (!writable && vma_wants_manual_pte_write_upgrade(vma) &&
> +           can_change_pmd_writable(vma, vmf->address, new_pmd))
> +               writable = true;
> +
> +       page = vm_normal_page_pmd(vma, haddr, new_pmd);
> +       if (!page)
> +               goto out_map;
> +
> +       if (!(vma->vm_flags & VM_MTE))
> +               goto out_map;
> +
> +       get_page(page);
> +       vma_set_access_pid_bit(vma);
> +
> +       spin_unlock(vmf->ptl);
> +       writable = false;
> +
> +       if (unlikely(is_migrate_isolate_page(page)))
> +               goto out_retry;
> +
> +       ret = reserve_tag_storage(page, HPAGE_PMD_ORDER, GFP_HIGHUSER_MOVABLE);
> +       if (ret)
> +               goto out_retry;
> +
> +       put_page(page);
> +
> +       vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> +       if (unlikely(!pmd_same(old_pmd, *vmf->pmd))) {
> +               spin_unlock(vmf->ptl);
> +               return 0;
> +       }
> +
> +out_map:
> +       /* Restore the PMD */
> +       new_pmd = pmd_modify(old_pmd, vma->vm_page_prot);
> +       new_pmd = pmd_mkyoung(new_pmd);
> +       if (writable)
> +               new_pmd = pmd_mkwrite(new_pmd, vma);
> +       set_pmd_at(vma->vm_mm, haddr, vmf->pmd, new_pmd);
> +       update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
> +       spin_unlock(vmf->ptl);
> +
> +       return 0;
> +
>  out_retry:
>         put_page(page);
>         if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index fa0350b0812a..bb84291f9231 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -36,6 +36,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>  int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>                     pmd_t *pmd, unsigned long addr, pgprot_t newprot,
>                     unsigned long cp_flags);
> +bool can_change_pmd_writable(struct vm_area_struct *vma, unsigned long addr,
> +                            pmd_t pmd);
>
>  vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
>  vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e2c761dd6c41..de45f475bf8d 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1473,6 +1473,11 @@ static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
>  {
>         return VM_FAULT_SIGBUS;
>  }
> +
> +static inline vm_fault_t arch_do_huge_page_fault_on_access(struct vm_fault *vmf)
> +{
> +       return VM_FAULT_SIGBUS;
> +}
>  #endif
>
>  #endif /* CONFIG_MMU */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9beead961a65..d1402b43ea39 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1406,8 +1406,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
>         return VM_FAULT_FALLBACK;
>  }
>
> -static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
> -                                          unsigned long addr, pmd_t pmd)
> +inline bool can_change_pmd_writable(struct vm_area_struct *vma,

Remove inline keyword here.

Peter

> +                                   unsigned long addr, pmd_t pmd)
>  {
>         struct page *page;
>
> diff --git a/mm/memory.c b/mm/memory.c
> index a04a971200b9..46b926625503 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5168,6 +5168,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>                         return 0;
>                 }
>                 if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
> +                       if (fault_on_access_pmd(vmf.orig_pmd) && vma_is_accessible(vma))
> +                               return arch_do_huge_page_fault_on_access(&vmf);
> +
>                         if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
>                                 return do_huge_pmd_numa_page(&vmf);
>
> --
> 2.42.1
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 20/27] mm: hugepage: Handle huge page fault on access
  2023-11-22  1:28   ` Peter Collingbourne
@ 2023-11-22  9:22     ` Alexandru Elisei
  0 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-22  9:22 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi Peter,

On Tue, Nov 21, 2023 at 05:28:49PM -0800, Peter Collingbourne wrote:
> On Sun, Nov 19, 2023 at 8:59 AM Alexandru Elisei
> <alexandru.elisei@arm.com> wrote:
> >
> > Handle PAGE_FAULT_ON_ACCESS faults for huge pages in a similar way to
> > regular pages.
> >
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> >  arch/arm64/include/asm/mte_tag_storage.h |  1 +
> >  arch/arm64/include/asm/pgtable.h         |  7 ++
> >  arch/arm64/mm/fault.c                    | 81 ++++++++++++++++++++++++
> >  include/linux/huge_mm.h                  |  2 +
> >  include/linux/pgtable.h                  |  5 ++
> >  mm/huge_memory.c                         |  4 +-
> >  mm/memory.c                              |  3 +
> >  7 files changed, 101 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> > index c70ced60a0cd..b97406d369ce 100644
> > --- a/arch/arm64/include/asm/mte_tag_storage.h
> > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > @@ -35,6 +35,7 @@ void free_tag_storage(struct page *page, int order);
> >  bool page_tag_storage_reserved(struct page *page);
> >
> >  vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
> > +vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf);
> >  #else
> >  static inline bool tag_storage_enabled(void)
> >  {
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 8cc135f1c112..1704411c096d 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -477,6 +477,13 @@ static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
> >                 return handle_page_missing_tag_storage(vmf);
> >         return VM_FAULT_SIGBUS;
> >  }
> > +
> > +static inline vm_fault_t arch_do_huge_page_fault_on_access(struct vm_fault *vmf)
> > +{
> > +       if (tag_storage_enabled())
> > +               return handle_huge_page_missing_tag_storage(vmf);
> > +       return VM_FAULT_SIGBUS;
> > +}
> >  #endif /* CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
> >
> >  #define pmd_present_invalid(pmd)     (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))
> > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > index f5fa583acf18..6730a0812a24 100644
> > --- a/arch/arm64/mm/fault.c
> > +++ b/arch/arm64/mm/fault.c
> > @@ -1041,6 +1041,87 @@ vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf)
> >
> >         return 0;
> >
> > +out_retry:
> > +       put_page(page);
> > +       if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > +               vma_end_read(vma);
> > +       if (fault_flag_allow_retry_first(vmf->flags)) {
> > +               err = VM_FAULT_RETRY;
> > +       } else {
> > +               /* Replay the fault. */
> > +               err = 0;
> > +       }
> > +       return err;
> > +}
> > +
> > +vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf)
> > +{
> > +       unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> > +       struct vm_area_struct *vma = vmf->vma;
> > +       pmd_t old_pmd, new_pmd;
> > +       bool writable = false;
> > +       struct page *page;
> > +       vm_fault_t err;
> > +       int ret;
> > +
> > +       vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> > +       if (unlikely(!pmd_same(vmf->orig_pmd, *vmf->pmd))) {
> > +               spin_unlock(vmf->ptl);
> > +               return 0;
> > +       }
> > +
> > +       old_pmd = vmf->orig_pmd;
> > +       new_pmd = pmd_modify(old_pmd, vma->vm_page_prot);
> > +
> > +       /*
> > +        * Detect now whether the PMD could be writable; this information
> > +        * is only valid while holding the PT lock.
> > +        */
> > +       writable = pmd_write(new_pmd);
> > +       if (!writable && vma_wants_manual_pte_write_upgrade(vma) &&
> > +           can_change_pmd_writable(vma, vmf->address, new_pmd))
> > +               writable = true;
> > +
> > +       page = vm_normal_page_pmd(vma, haddr, new_pmd);
> > +       if (!page)
> > +               goto out_map;
> > +
> > +       if (!(vma->vm_flags & VM_MTE))
> > +               goto out_map;
> > +
> > +       get_page(page);
> > +       vma_set_access_pid_bit(vma);
> > +
> > +       spin_unlock(vmf->ptl);
> > +       writable = false;
> > +
> > +       if (unlikely(is_migrate_isolate_page(page)))
> > +               goto out_retry;
> > +
> > +       ret = reserve_tag_storage(page, HPAGE_PMD_ORDER, GFP_HIGHUSER_MOVABLE);
> > +       if (ret)
> > +               goto out_retry;
> > +
> > +       put_page(page);
> > +
> > +       vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> > +       if (unlikely(!pmd_same(old_pmd, *vmf->pmd))) {
> > +               spin_unlock(vmf->ptl);
> > +               return 0;
> > +       }
> > +
> > +out_map:
> > +       /* Restore the PMD */
> > +       new_pmd = pmd_modify(old_pmd, vma->vm_page_prot);
> > +       new_pmd = pmd_mkyoung(new_pmd);
> > +       if (writable)
> > +               new_pmd = pmd_mkwrite(new_pmd, vma);
> > +       set_pmd_at(vma->vm_mm, haddr, vmf->pmd, new_pmd);
> > +       update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
> > +       spin_unlock(vmf->ptl);
> > +
> > +       return 0;
> > +
> >  out_retry:
> >         put_page(page);
> >         if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index fa0350b0812a..bb84291f9231 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -36,6 +36,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> >  int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >                     pmd_t *pmd, unsigned long addr, pgprot_t newprot,
> >                     unsigned long cp_flags);
> > +bool can_change_pmd_writable(struct vm_area_struct *vma, unsigned long addr,
> > +                            pmd_t pmd);
> >
> >  vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
> >  vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index e2c761dd6c41..de45f475bf8d 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -1473,6 +1473,11 @@ static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
> >  {
> >         return VM_FAULT_SIGBUS;
> >  }
> > +
> > +static inline vm_fault_t arch_do_huge_page_fault_on_access(struct vm_fault *vmf)
> > +{
> > +       return VM_FAULT_SIGBUS;
> > +}
> >  #endif
> >
> >  #endif /* CONFIG_MMU */
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 9beead961a65..d1402b43ea39 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1406,8 +1406,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
> >         return VM_FAULT_FALLBACK;
> >  }
> >
> > -static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
> > -                                          unsigned long addr, pmd_t pmd)
> > +inline bool can_change_pmd_writable(struct vm_area_struct *vma,
> 
> Remove inline keyword here.

Indeed, as it does nothing now that the function is not static.

Thanks,
Alex

> 
> Peter
> 
> > +                                   unsigned long addr, pmd_t pmd)
> >  {
> >         struct page *page;
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index a04a971200b9..46b926625503 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5168,6 +5168,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> >                         return 0;
> >                 }
> >                 if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
> > +                       if (fault_on_access_pmd(vmf.orig_pmd) && vma_is_accessible(vma))
> > +                               return arch_do_huge_page_fault_on_access(&vmf);
> > +
> >                         if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
> >                                 return do_huge_pmd_numa_page(&vmf);
> >
> > --
> > 2.42.1
> >

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 05/27] mm: page_alloc: Add an arch hook to allow prep_new_page() to fail
  2023-11-19 16:56 ` [PATCH RFC v2 05/27] mm: page_alloc: Add an arch hook to allow prep_new_page() to fail Alexandru Elisei
@ 2023-11-24 19:35   ` David Hildenbrand
  2023-11-27 12:09     ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2023-11-24 19:35 UTC (permalink / raw)
  To: Alexandru Elisei, catalin.marinas, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, eugenis,
	kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

On 19.11.23 17:56, Alexandru Elisei wrote:
> Introduce arch_prep_new_page(), which will be used by arm64 to reserve tag
> storage for an allocated page. Reserving tag storage can fail, for example,
> if the tag storage page has a short pin on it, so allow prep_new_page() ->
> arch_prep_new_page() to similarly fail.

But what are the side-effects of this? How does the calling code recover?

E.g., what if we need to populate a page into user space, but that 
particular page we allocated fails to be prepared? So we inject a signal 
into that poor process?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 06/27] mm: page_alloc: Allow an arch to hook early into free_pages_prepare()
  2023-11-19 16:57 ` [PATCH RFC v2 06/27] mm: page_alloc: Allow an arch to hook early into free_pages_prepare() Alexandru Elisei
@ 2023-11-24 19:36   ` David Hildenbrand
  2023-11-27 13:03     ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2023-11-24 19:36 UTC (permalink / raw)
  To: Alexandru Elisei, catalin.marinas, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, eugenis,
	kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

On 19.11.23 17:57, Alexandru Elisei wrote:
> Add arch_free_pages_prepare() hook that is called before that page flags
> are cleared. This will be used by arm64 when explicit management of tag
> storage pages is enabled.

Can you elaborate a bit what exactly will be done by that code with that 
information?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 12/27] arm64: mte: Add tag storage pages to the MIGRATE_CMA migratetype
  2023-11-19 16:57 ` [PATCH RFC v2 12/27] arm64: mte: Add tag storage pages to the MIGRATE_CMA migratetype Alexandru Elisei
@ 2023-11-24 19:40   ` David Hildenbrand
  2023-11-27 15:01     ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2023-11-24 19:40 UTC (permalink / raw)
  To: Alexandru Elisei, catalin.marinas, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, eugenis,
	kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

On 19.11.23 17:57, Alexandru Elisei wrote:
> Add the MTE tag storage pages to the MIGRATE_CMA migratetype, which allows
> the page allocator to manage them like regular pages.
> 
> Ths migratype lends the pages some very desirable properties:
> 
> * They cannot be longterm pinned, meaning they will always be migratable.
> 
> * The pages can be allocated explicitely by using their PFN (with
>    alloc_contig_range()) when they are needed to store tags.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>   arch/arm64/Kconfig                  |  1 +
>   arch/arm64/kernel/mte_tag_storage.c | 68 +++++++++++++++++++++++++++++
>   include/linux/mmzone.h              |  5 +++
>   mm/internal.h                       |  3 --
>   4 files changed, 74 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index fe8276fdc7a8..047487046e8f 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2065,6 +2065,7 @@ config ARM64_MTE
>   if ARM64_MTE
>   config ARM64_MTE_TAG_STORAGE
>   	bool "Dynamic MTE tag storage management"
> +	select CONFIG_CMA
>   	help
>   	  Adds support for dynamic management of the memory used by the hardware
>   	  for storing MTE tags. This memory, unlike normal memory, cannot be
> diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> index fa6267ef8392..427f4f1909f3 100644
> --- a/arch/arm64/kernel/mte_tag_storage.c
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -5,10 +5,12 @@
>    * Copyright (C) 2023 ARM Ltd.
>    */
>   
> +#include <linux/cma.h>
>   #include <linux/memblock.h>
>   #include <linux/mm.h>
>   #include <linux/of_device.h>
>   #include <linux/of_fdt.h>
> +#include <linux/pageblock-flags.h>
>   #include <linux/range.h>
>   #include <linux/string.h>
>   #include <linux/xarray.h>
> @@ -189,6 +191,14 @@ static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
>   		return ret;
>   	}
>   
> +	/* Pages are managed in pageblock_nr_pages chunks */
> +	if (!IS_ALIGNED(tag_range->start | range_len(tag_range), pageblock_nr_pages)) {
> +		pr_err("Tag storage region 0x%llx-0x%llx not aligned to pageblock size 0x%llx",
> +		       PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> +		       PFN_PHYS(pageblock_nr_pages));
> +		return -EINVAL;
> +	}
> +
>   	ret = tag_storage_get_memory_node(node, &mem_node);
>   	if (ret)
>   		return ret;
> @@ -254,3 +264,61 @@ void __init mte_tag_storage_init(void)
>   		pr_info("MTE tag storage region management disabled");
>   	}
>   }
> +
> +static int __init mte_tag_storage_activate_regions(void)
> +{
> +	phys_addr_t dram_start, dram_end;
> +	struct range *tag_range;
> +	unsigned long pfn;
> +	int i, ret;
> +
> +	if (num_tag_regions == 0)
> +		return 0;
> +
> +	dram_start = memblock_start_of_DRAM();
> +	dram_end = memblock_end_of_DRAM();
> +
> +	for (i = 0; i < num_tag_regions; i++) {
> +		tag_range = &tag_regions[i].tag_range;
> +		/*
> +		 * Tag storage region was clipped by arm64_bootmem_init()
> +		 * enforcing addressing limits.
> +		 */
> +		if (PFN_PHYS(tag_range->start) < dram_start ||
> +				PFN_PHYS(tag_range->end) >= dram_end) {
> +			pr_err("Tag storage region 0x%llx-0x%llx outside addressable memory",
> +			       PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end));
> +			ret = -EINVAL;
> +			goto out_disabled;
> +		}
> +	}
> +
> +	/*
> +	 * MTE disabled, tag storage pages can be used like any other pages. The
> +	 * only restriction is that the pages cannot be used by kexec because
> +	 * the memory remains marked as reserved in the memblock allocator.
> +	 */
> +	if (!system_supports_mte()) {
> +		for (i = 0; i< num_tag_regions; i++) {
> +			tag_range = &tag_regions[i].tag_range;
> +			for (pfn = tag_range->start; pfn <= tag_range->end; pfn++)
> +				free_reserved_page(pfn_to_page(pfn));
> +		}
> +		ret = 0;
> +		goto out_disabled;
> +	}
> +
> +	for (i = 0; i < num_tag_regions; i++) {
> +		tag_range = &tag_regions[i].tag_range;
> +		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
> +			init_cma_reserved_pageblock(pfn_to_page(pfn));
> +		totalcma_pages += range_len(tag_range);
> +	}

You shouldn't be doing that manually in arm code. Likely you want some 
cma.c helper for something like that.

But, can you elaborate on why you took this hacky (sorry) approach as 
documented in the cover letter:

"The arm64 code manages this memory directly instead of using
cma_declare_contiguous/cma_alloc for performance reasons."

What is the exact problem?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 13/27] arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK
  2023-11-19 16:57 ` [PATCH RFC v2 13/27] arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK Alexandru Elisei
@ 2023-11-24 19:51   ` David Hildenbrand
  2023-11-27 15:04     ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2023-11-24 19:51 UTC (permalink / raw)
  To: Alexandru Elisei, catalin.marinas, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, eugenis,
	kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

On 19.11.23 17:57, Alexandru Elisei wrote:
> Tag storage memory requires that the tag storage pages used for data are
> always migratable when they need to be repurposed to store tags.
> 
> If ARCH_KEEP_MEMBLOCK is enabled, kexec will scan all non-reserved
> memblocks to find a suitable location for copying the kernel image. The
> kernel image, once loaded, cannot be moved to another location in physical
> memory. The initialization code for the tag storage reserves the memblocks
> for the tag storage pages, which means kexec will not use them, and the tag
> storage pages can be migrated at any time, which is the desired behaviour.
> 
> However, if ARCH_KEEP_MEMBLOCK is not selected, kexec will not skip a
> region unless the memory resource has the IORESOURCE_SYSRAM_DRIVER_MANAGED
> flag, which isn't currently set by the tag storage initialization code.
> 
> Make ARM64_MTE_TAG_STORAGE depend on ARCH_KEEP_MEMBLOCK to make it explicit
> that that the Kconfig option required for it to work correctly.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>   arch/arm64/Kconfig | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 047487046e8f..efa5b7958169 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2065,6 +2065,7 @@ config ARM64_MTE
>   if ARM64_MTE
>   config ARM64_MTE_TAG_STORAGE
>   	bool "Dynamic MTE tag storage management"
> +	depends on ARCH_KEEP_MEMBLOCK
>   	select CONFIG_CMA
>   	help
>   	  Adds support for dynamic management of the memory used by the hardware

Doesn't arm64 select that unconditionally? Why is this required then?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 14/27] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled
  2023-11-19 16:57 ` [PATCH RFC v2 14/27] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled Alexandru Elisei
@ 2023-11-24 19:54   ` David Hildenbrand
  2023-11-27 15:07     ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2023-11-24 19:54 UTC (permalink / raw)
  To: Alexandru Elisei, catalin.marinas, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, eugenis,
	kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

On 19.11.23 17:57, Alexandru Elisei wrote:
> To be able to reserve the tag storage associated with a page requires that
> the tag storage page can be migrated.
> 
> When HW KASAN is enabled, the kernel allocates pages, which are now tagged,
> in non-preemptible contexts, which can make reserving the associate tag
> storage impossible.

I assume that it's the only in-kernel user that actually requires tagged 
memory (besides for user space), correct?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 15/27] arm64: mte: Check that tag storage blocks are in the same zone
  2023-11-19 16:57 ` [PATCH RFC v2 15/27] arm64: mte: Check that tag storage blocks are in the same zone Alexandru Elisei
@ 2023-11-24 19:56   ` David Hildenbrand
  2023-11-27 15:10     ` Alexandru Elisei
  2023-11-29  8:57   ` Hyesoo Yu
  1 sibling, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2023-11-24 19:56 UTC (permalink / raw)
  To: Alexandru Elisei, catalin.marinas, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, eugenis,
	kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

On 19.11.23 17:57, Alexandru Elisei wrote:
> alloc_contig_range() requires that the requested pages are in the same
> zone. Check that this is indeed the case before initializing the tag
> storage blocks.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>   arch/arm64/kernel/mte_tag_storage.c | 33 +++++++++++++++++++++++++++++
>   1 file changed, 33 insertions(+)
> 
> diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> index 8b9bedf7575d..fd63430d4dc0 100644
> --- a/arch/arm64/kernel/mte_tag_storage.c
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -265,6 +265,35 @@ void __init mte_tag_storage_init(void)
>   	}
>   }
>   
> +/* alloc_contig_range() requires all pages to be in the same zone. */
> +static int __init mte_tag_storage_check_zone(void)
> +{
> +	struct range *tag_range;
> +	struct zone *zone;
> +	unsigned long pfn;
> +	u32 block_size;
> +	int i, j;
> +
> +	for (i = 0; i < num_tag_regions; i++) {
> +		block_size = tag_regions[i].block_size;
> +		if (block_size == 1)
> +			continue;
> +
> +		tag_range = &tag_regions[i].tag_range;
> +		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += block_size) {
> +			zone = page_zone(pfn_to_page(pfn));
> +			for (j = 1; j < block_size; j++) {
> +				if (page_zone(pfn_to_page(pfn + j)) != zone) {
> +					pr_err("Tag storage block pages in different zones");
> +					return -EINVAL;
> +				}
> +			}
> +		}
> +	}
> +
> +	 return 0;
> +}
> +

Looks like something that ordinary CMA provides. See cma_activate_area().

Can't we find a way to let CMA do CMA thingies and only be a user of 
that? What would be required to make the performance issue you spelled 
out in the cover letter be gone and not have to open-code that in arch code?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 04/27] mm: migrate/mempolicy: Add hook to modify migration target gfp
  2023-11-19 16:56 ` [PATCH RFC v2 04/27] mm: migrate/mempolicy: Add hook to modify migration target gfp Alexandru Elisei
@ 2023-11-25 10:03   ` Mike Rapoport
  2023-11-27 11:52     ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: Mike Rapoport @ 2023-11-25 10:03 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Sun, Nov 19, 2023 at 04:56:58PM +0000, Alexandru Elisei wrote:
> It might be desirable for an architecture to modify the gfp flags used to
> allocate the destination page for migration based on the page that it is
> being replaced. For example, if an architectures has metadata associated
> with a page (like arm64, when the memory tagging extension is implemented),
> it can request that the destination page similarly has storage for tags
> already allocated.
> 
> No functional change.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  include/linux/migrate.h | 4 ++++
>  mm/mempolicy.c          | 2 ++
>  mm/migrate.c            | 3 +++
>  3 files changed, 9 insertions(+)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 2ce13e8a309b..0acef592043c 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -60,6 +60,10 @@ struct movable_operations {
>  /* Defined in mm/debug.c: */
>  extern const char *migrate_reason_names[MR_TYPES];
>  
> +#ifndef arch_migration_target_gfp
> +#define arch_migration_target_gfp(src, gfp) 0
> +#endif
> +
>  #ifdef CONFIG_MIGRATION
>  
>  void putback_movable_pages(struct list_head *l);
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 10a590ee1c89..50bc43ab50d6 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1182,6 +1182,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
>  
>  		h = folio_hstate(src);
>  		gfp = htlb_alloc_mask(h);
> +		gfp |= arch_migration_target_gfp(src, gfp);

I think it'll be more robust to have arch_migration_target_gfp() to modify
the flags and return the new mask with added (or potentially removed)
flags.

>  		nodemask = policy_nodemask(gfp, pol, ilx, &nid);
>  		return alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp);
>  	}
> @@ -1190,6 +1191,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
>  		gfp = GFP_TRANSHUGE;
>  	else
>  		gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | __GFP_COMP;
> +	gfp |= arch_migration_target_gfp(src, gfp);
>  
>  	page = alloc_pages_mpol(gfp, order, pol, ilx, nid);
>  	return page_rmappable_folio(page);

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 04/27] mm: migrate/mempolicy: Add hook to modify migration target gfp
  2023-11-25 10:03   ` Mike Rapoport
@ 2023-11-27 11:52     ` Alexandru Elisei
  2023-11-28  6:49       ` Mike Rapoport
  0 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-27 11:52 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi Mike,

I really appreciate you having a look!

On Sat, Nov 25, 2023 at 12:03:22PM +0200, Mike Rapoport wrote:
> On Sun, Nov 19, 2023 at 04:56:58PM +0000, Alexandru Elisei wrote:
> > It might be desirable for an architecture to modify the gfp flags used to
> > allocate the destination page for migration based on the page that it is
> > being replaced. For example, if an architectures has metadata associated
> > with a page (like arm64, when the memory tagging extension is implemented),
> > it can request that the destination page similarly has storage for tags
> > already allocated.
> > 
> > No functional change.
> > 
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> >  include/linux/migrate.h | 4 ++++
> >  mm/mempolicy.c          | 2 ++
> >  mm/migrate.c            | 3 +++
> >  3 files changed, 9 insertions(+)
> > 
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index 2ce13e8a309b..0acef592043c 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -60,6 +60,10 @@ struct movable_operations {
> >  /* Defined in mm/debug.c: */
> >  extern const char *migrate_reason_names[MR_TYPES];
> >  
> > +#ifndef arch_migration_target_gfp
> > +#define arch_migration_target_gfp(src, gfp) 0
> > +#endif
> > +
> >  #ifdef CONFIG_MIGRATION
> >  
> >  void putback_movable_pages(struct list_head *l);
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 10a590ee1c89..50bc43ab50d6 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1182,6 +1182,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
> >  
> >  		h = folio_hstate(src);
> >  		gfp = htlb_alloc_mask(h);
> > +		gfp |= arch_migration_target_gfp(src, gfp);
> 
> I think it'll be more robust to have arch_migration_target_gfp() to modify
> the flags and return the new mask with added (or potentially removed)
> flags.

I did it this way so an arch won't be able to remove flags set by the MM code.
There's a similar pattern in do_mmap() -> calc_vm_flag_bits() ->
arch_calc_vm_flag_bits().

I'll change it to return the new mask if you think that's better.

Thanks,
Alex

> 
> >  		nodemask = policy_nodemask(gfp, pol, ilx, &nid);
> >  		return alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp);
> >  	}
> > @@ -1190,6 +1191,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
> >  		gfp = GFP_TRANSHUGE;
> >  	else
> >  		gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | __GFP_COMP;
> > +	gfp |= arch_migration_target_gfp(src, gfp);
> >  
> >  	page = alloc_pages_mpol(gfp, order, pol, ilx, nid);
> >  	return page_rmappable_folio(page);
> 
> -- 
> Sincerely yours,
> Mike.
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 05/27] mm: page_alloc: Add an arch hook to allow prep_new_page() to fail
  2023-11-24 19:35   ` David Hildenbrand
@ 2023-11-27 12:09     ` Alexandru Elisei
  2023-11-28 16:57       ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-27 12:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

Thank you so much for your comments, there are genuinely useful.

On Fri, Nov 24, 2023 at 08:35:47PM +0100, David Hildenbrand wrote:
> On 19.11.23 17:56, Alexandru Elisei wrote:
> > Introduce arch_prep_new_page(), which will be used by arm64 to reserve tag
> > storage for an allocated page. Reserving tag storage can fail, for example,
> > if the tag storage page has a short pin on it, so allow prep_new_page() ->
> > arch_prep_new_page() to similarly fail.
> 
> But what are the side-effects of this? How does the calling code recover?
> 
> E.g., what if we need to populate a page into user space, but that
> particular page we allocated fails to be prepared? So we inject a signal
> into that poor process?

When the page fails to be prepared, it is put back to the tail of the
freelist with __free_one_page(.., FPI_TO_TAIL). If all the allocation paths
are exhausted and no page has been found for which tag storage has been
reserved, then that's treated like an OOM situation.

I have been thinking about this, and I think I can simplify the code by
making tag reservation a best effort approach. The page can be allocated
even if reserving tag storage fails, but the page is marked as invalid in
set_pte_at() (PAGE_NONE + an extra bit to tell arm64 that it needs tag
storage) and next time it is accessed, arm64 will reserve tag storage in
the fault handling code (the mechanism for that is implemented in patch #19
of the series, "mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for
mprotect(PROT_MTE)").

With this new approach, prep_new_page() stays the way it is, and no further
changes are required for the page allocator, as there are already arch
callbacks that can be used for that, for example tag_clear_highpage() and
arch_alloc_page(). The downside is extra page faults, which might impact
performance.

What do you think?

Thanks,
Alex

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 06/27] mm: page_alloc: Allow an arch to hook early into free_pages_prepare()
  2023-11-24 19:36   ` David Hildenbrand
@ 2023-11-27 13:03     ` Alexandru Elisei
  2023-11-28 16:58       ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-27 13:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

On Fri, Nov 24, 2023 at 08:36:52PM +0100, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
> > Add arch_free_pages_prepare() hook that is called before that page flags
> > are cleared. This will be used by arm64 when explicit management of tag
> > storage pages is enabled.
> 
> Can you elaborate a bit what exactly will be done by that code with that
> information?

Of course.

The MTE code that is in the kernel today uses the PG_arch_2 page flag, which it
renames to PG_mte_tagged, to track if a page has been mapped with tagging
enabled. That flag is cleared by free_pages_prepare() when it does:

	page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;

When tag storage management is enabled, tag storage is reserved for a page if
and only if the page is mapped as tagged. When a page is freed, the code looks
at the PG_mte_tagged flag to determine if the page was mapped as tagged, and
therefore has tag storage reserved, to determine if the corresponding tag
storage should also be freed.

I have considered using arch_free_page(), but free_pages_prepare() calls the
function after the flags are cleared.

Does that answer your question?

Alex

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 12/27] arm64: mte: Add tag storage pages to the MIGRATE_CMA migratetype
  2023-11-24 19:40   ` David Hildenbrand
@ 2023-11-27 15:01     ` Alexandru Elisei
  2023-11-28 17:03       ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-27 15:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi David,

On Fri, Nov 24, 2023 at 08:40:55PM +0100, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
> > Add the MTE tag storage pages to the MIGRATE_CMA migratetype, which allows
> > the page allocator to manage them like regular pages.
> > 
> > Ths migratype lends the pages some very desirable properties:
> > 
> > * They cannot be longterm pinned, meaning they will always be migratable.
> > 
> > * The pages can be allocated explicitely by using their PFN (with
> >    alloc_contig_range()) when they are needed to store tags.
> > 
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> >   arch/arm64/Kconfig                  |  1 +
> >   arch/arm64/kernel/mte_tag_storage.c | 68 +++++++++++++++++++++++++++++
> >   include/linux/mmzone.h              |  5 +++
> >   mm/internal.h                       |  3 --
> >   4 files changed, 74 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index fe8276fdc7a8..047487046e8f 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -2065,6 +2065,7 @@ config ARM64_MTE
> >   if ARM64_MTE
> >   config ARM64_MTE_TAG_STORAGE
> >   	bool "Dynamic MTE tag storage management"
> > +	select CONFIG_CMA
> >   	help
> >   	  Adds support for dynamic management of the memory used by the hardware
> >   	  for storing MTE tags. This memory, unlike normal memory, cannot be
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > index fa6267ef8392..427f4f1909f3 100644
> > --- a/arch/arm64/kernel/mte_tag_storage.c
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -5,10 +5,12 @@
> >    * Copyright (C) 2023 ARM Ltd.
> >    */
> > +#include <linux/cma.h>
> >   #include <linux/memblock.h>
> >   #include <linux/mm.h>
> >   #include <linux/of_device.h>
> >   #include <linux/of_fdt.h>
> > +#include <linux/pageblock-flags.h>
> >   #include <linux/range.h>
> >   #include <linux/string.h>
> >   #include <linux/xarray.h>
> > @@ -189,6 +191,14 @@ static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
> >   		return ret;
> >   	}
> > +	/* Pages are managed in pageblock_nr_pages chunks */
> > +	if (!IS_ALIGNED(tag_range->start | range_len(tag_range), pageblock_nr_pages)) {
> > +		pr_err("Tag storage region 0x%llx-0x%llx not aligned to pageblock size 0x%llx",
> > +		       PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> > +		       PFN_PHYS(pageblock_nr_pages));
> > +		return -EINVAL;
> > +	}
> > +
> >   	ret = tag_storage_get_memory_node(node, &mem_node);
> >   	if (ret)
> >   		return ret;
> > @@ -254,3 +264,61 @@ void __init mte_tag_storage_init(void)
> >   		pr_info("MTE tag storage region management disabled");
> >   	}
> >   }
> > +
> > +static int __init mte_tag_storage_activate_regions(void)
> > +{
> > +	phys_addr_t dram_start, dram_end;
> > +	struct range *tag_range;
> > +	unsigned long pfn;
> > +	int i, ret;
> > +
> > +	if (num_tag_regions == 0)
> > +		return 0;
> > +
> > +	dram_start = memblock_start_of_DRAM();
> > +	dram_end = memblock_end_of_DRAM();
> > +
> > +	for (i = 0; i < num_tag_regions; i++) {
> > +		tag_range = &tag_regions[i].tag_range;
> > +		/*
> > +		 * Tag storage region was clipped by arm64_bootmem_init()
> > +		 * enforcing addressing limits.
> > +		 */
> > +		if (PFN_PHYS(tag_range->start) < dram_start ||
> > +				PFN_PHYS(tag_range->end) >= dram_end) {
> > +			pr_err("Tag storage region 0x%llx-0x%llx outside addressable memory",
> > +			       PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end));
> > +			ret = -EINVAL;
> > +			goto out_disabled;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * MTE disabled, tag storage pages can be used like any other pages. The
> > +	 * only restriction is that the pages cannot be used by kexec because
> > +	 * the memory remains marked as reserved in the memblock allocator.
> > +	 */
> > +	if (!system_supports_mte()) {
> > +		for (i = 0; i< num_tag_regions; i++) {
> > +			tag_range = &tag_regions[i].tag_range;
> > +			for (pfn = tag_range->start; pfn <= tag_range->end; pfn++)
> > +				free_reserved_page(pfn_to_page(pfn));
> > +		}
> > +		ret = 0;
> > +		goto out_disabled;
> > +	}
> > +
> > +	for (i = 0; i < num_tag_regions; i++) {
> > +		tag_range = &tag_regions[i].tag_range;
> > +		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
> > +			init_cma_reserved_pageblock(pfn_to_page(pfn));
> > +		totalcma_pages += range_len(tag_range);
> > +	}
> 
> You shouldn't be doing that manually in arm code. Likely you want some cma.c
> helper for something like that.

If you referring to the last loop (the one that does
ini_cma_reserved_pageblock()), indeed, there's already a function which
does that, cma_init_reserved_areas() -> cma_activate_area().

> 
> But, can you elaborate on why you took this hacky (sorry) approach as
> documented in the cover letter:

No worries, it is indeed a bit hacky :)

> 
> "The arm64 code manages this memory directly instead of using
> cma_declare_contiguous/cma_alloc for performance reasons."
> 
> What is the exact problem?

I am referring to the performance degredation that is fixed in patch #26,
"arm64: mte: Fast track reserving tag storage when the block is free" [1].
The issue is that alloc_contig_range() -> __alloc_contig_migrate_range()
calls lru_cache_disable(), which IPIs all the CPUs in the system, and that
leads to a 10-20% performance degradation on Chrome. It has been observed
that most of the time the tag storage pages are free, and the
lru_cache_disable() calls are unnecessary.

The performance degradation is almost entirely eliminated by having the code
take the tag storage page directly from the free list if it's free, instead
of calling alloc_contig_range().

Do you believe it would be better to use the cma code, and modify it to use
this fast path to take the page drectly from the buddy allocator?

I can definitely try to integrate the code with cma_alloc(), but I think
keeping the fast path for reserving tag storage is extremely desirable,
since it makes such a huge difference to performance.

[1] https://lore.kernel.org/linux-trace-kernel/20231119165721.9849-27-alexandru.elisei@arm.com/

Thanks,
Alex

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 13/27] arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK
  2023-11-24 19:51   ` David Hildenbrand
@ 2023-11-27 15:04     ` Alexandru Elisei
  2023-11-28 17:05       ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-27 15:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

On Fri, Nov 24, 2023 at 08:51:38PM +0100, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
> > Tag storage memory requires that the tag storage pages used for data are
> > always migratable when they need to be repurposed to store tags.
> > 
> > If ARCH_KEEP_MEMBLOCK is enabled, kexec will scan all non-reserved
> > memblocks to find a suitable location for copying the kernel image. The
> > kernel image, once loaded, cannot be moved to another location in physical
> > memory. The initialization code for the tag storage reserves the memblocks
> > for the tag storage pages, which means kexec will not use them, and the tag
> > storage pages can be migrated at any time, which is the desired behaviour.
> > 
> > However, if ARCH_KEEP_MEMBLOCK is not selected, kexec will not skip a
> > region unless the memory resource has the IORESOURCE_SYSRAM_DRIVER_MANAGED
> > flag, which isn't currently set by the tag storage initialization code.
> > 
> > Make ARM64_MTE_TAG_STORAGE depend on ARCH_KEEP_MEMBLOCK to make it explicit
> > that that the Kconfig option required for it to work correctly.
> > 
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> >   arch/arm64/Kconfig | 1 +
> >   1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 047487046e8f..efa5b7958169 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -2065,6 +2065,7 @@ config ARM64_MTE
> >   if ARM64_MTE
> >   config ARM64_MTE_TAG_STORAGE
> >   	bool "Dynamic MTE tag storage management"
> > +	depends on ARCH_KEEP_MEMBLOCK
> >   	select CONFIG_CMA
> >   	help
> >   	  Adds support for dynamic management of the memory used by the hardware
> 
> Doesn't arm64 select that unconditionally? Why is this required then?

I've added this patch to make the dependancy explicit. If, in the future, arm64
stops selecting ARCH_KEEP_MEMBLOCK, I thinkg it would be very easy to miss the
fact that tag storage depends on it. So this patch is not required per-se, it's
there to document the dependancy.

Thanks,
Alex

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 14/27] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled
  2023-11-24 19:54   ` David Hildenbrand
@ 2023-11-27 15:07     ` Alexandru Elisei
  2023-11-28 17:05       ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-27 15:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

On Fri, Nov 24, 2023 at 08:54:12PM +0100, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
> > To be able to reserve the tag storage associated with a page requires that
> > the tag storage page can be migrated.
> > 
> > When HW KASAN is enabled, the kernel allocates pages, which are now tagged,
> > in non-preemptible contexts, which can make reserving the associate tag
> > storage impossible.
> 
> I assume that it's the only in-kernel user that actually requires tagged
> memory (besides for user space), correct?

Indeed, this is the case. I'll expand the commit message to be more clear about
it.

Thanks,
Alex

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 15/27] arm64: mte: Check that tag storage blocks are in the same zone
  2023-11-24 19:56   ` David Hildenbrand
@ 2023-11-27 15:10     ` Alexandru Elisei
  0 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-27 15:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

On Fri, Nov 24, 2023 at 08:56:59PM +0100, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
> > alloc_contig_range() requires that the requested pages are in the same
> > zone. Check that this is indeed the case before initializing the tag
> > storage blocks.
> > 
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> >   arch/arm64/kernel/mte_tag_storage.c | 33 +++++++++++++++++++++++++++++
> >   1 file changed, 33 insertions(+)
> > 
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > index 8b9bedf7575d..fd63430d4dc0 100644
> > --- a/arch/arm64/kernel/mte_tag_storage.c
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -265,6 +265,35 @@ void __init mte_tag_storage_init(void)
> >   	}
> >   }
> > +/* alloc_contig_range() requires all pages to be in the same zone. */
> > +static int __init mte_tag_storage_check_zone(void)
> > +{
> > +	struct range *tag_range;
> > +	struct zone *zone;
> > +	unsigned long pfn;
> > +	u32 block_size;
> > +	int i, j;
> > +
> > +	for (i = 0; i < num_tag_regions; i++) {
> > +		block_size = tag_regions[i].block_size;
> > +		if (block_size == 1)
> > +			continue;
> > +
> > +		tag_range = &tag_regions[i].tag_range;
> > +		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += block_size) {
> > +			zone = page_zone(pfn_to_page(pfn));
> > +			for (j = 1; j < block_size; j++) {
> > +				if (page_zone(pfn_to_page(pfn + j)) != zone) {
> > +					pr_err("Tag storage block pages in different zones");
> > +					return -EINVAL;
> > +				}
> > +			}
> > +		}
> > +	}
> > +
> > +	 return 0;
> > +}
> > +
> 
> Looks like something that ordinary CMA provides. See cma_activate_area().

Indeed.

> 
> Can't we find a way to let CMA do CMA thingies and only be a user of that?
> What would be required to make the performance issue you spelled out in the
> cover letter be gone and not have to open-code that in arch code?

I've replied with a possible solution here [1].

[1] https://lore.kernel.org/all/ZWSvMYMjFLFZ-abv@raptor/

Thanks,
Alex

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 21/27] mm: arm64: Handle tag storage pages mapped before mprotect(PROT_MTE)
  2023-11-19 16:57 ` [PATCH RFC v2 21/27] mm: arm64: Handle tag storage pages mapped before mprotect(PROT_MTE) Alexandru Elisei
@ 2023-11-28  5:39   ` Peter Collingbourne
  2023-11-30 17:43     ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: Peter Collingbourne @ 2023-11-28  5:39 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi Alexandru,

On Sun, Nov 19, 2023 at 8:59 AM Alexandru Elisei
<alexandru.elisei@arm.com> wrote:
>
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  arch/arm64/include/asm/mte_tag_storage.h |  1 +
>  arch/arm64/kernel/mte_tag_storage.c      | 15 +++++++
>  arch/arm64/mm/fault.c                    | 55 ++++++++++++++++++++++++
>  include/linux/migrate.h                  |  8 +++-
>  include/linux/migrate_mode.h             |  1 +
>  mm/internal.h                            |  6 ---
>  6 files changed, 78 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> index b97406d369ce..6a8b19a6a758 100644
> --- a/arch/arm64/include/asm/mte_tag_storage.h
> +++ b/arch/arm64/include/asm/mte_tag_storage.h
> @@ -33,6 +33,7 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
>  void free_tag_storage(struct page *page, int order);
>
>  bool page_tag_storage_reserved(struct page *page);
> +bool page_is_tag_storage(struct page *page);
>
>  vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
>  vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf);
> diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> index a1cc239f7211..5096ce859136 100644
> --- a/arch/arm64/kernel/mte_tag_storage.c
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -500,6 +500,21 @@ bool page_tag_storage_reserved(struct page *page)
>         return test_bit(PG_tag_storage_reserved, &page->flags);
>  }
>
> +bool page_is_tag_storage(struct page *page)
> +{
> +       unsigned long pfn = page_to_pfn(page);
> +       struct range *tag_range;
> +       int i;
> +
> +       for (i = 0; i < num_tag_regions; i++) {
> +               tag_range = &tag_regions[i].tag_range;
> +               if (tag_range->start <= pfn && pfn <= tag_range->end)
> +                       return true;
> +       }
> +
> +       return false;
> +}
> +
>  int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
>  {
>         unsigned long start_block, end_block;
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 6730a0812a24..964c5ae161a3 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -12,6 +12,7 @@
>  #include <linux/extable.h>
>  #include <linux/kfence.h>
>  #include <linux/signal.h>
> +#include <linux/migrate.h>
>  #include <linux/mm.h>
>  #include <linux/hardirq.h>
>  #include <linux/init.h>
> @@ -956,6 +957,50 @@ void tag_clear_highpage(struct page *page)
>  }
>
>  #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +
> +#define MR_TAGGED_TAG_STORAGE  MR_ARCH_1
> +
> +extern bool isolate_lru_page(struct page *page);
> +extern void putback_movable_pages(struct list_head *l);

Could we move these declarations to a non-mm-internal header and
#include it instead of manually declaring them here?

> +
> +/* Returns with the page reference dropped. */
> +static void migrate_tag_storage_page(struct page *page)
> +{
> +       struct migration_target_control mtc = {
> +               .nid = NUMA_NO_NODE,
> +               .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_TAGGED,
> +       };
> +       unsigned long i, nr_pages = compound_nr(page);
> +       LIST_HEAD(pagelist);
> +       int ret, tries;
> +
> +       lru_cache_disable();
> +
> +       for (i = 0; i < nr_pages; i++) {
> +               if (!isolate_lru_page(page + i)) {
> +                       ret = -EAGAIN;
> +                       goto out;
> +               }
> +               /* Isolate just grabbed another reference, drop ours. */
> +               put_page(page + i);
> +               list_add_tail(&(page + i)->lru, &pagelist);
> +       }
> +
> +       tries = 5;
> +       while (tries--) {
> +               ret = migrate_pages(&pagelist, alloc_migration_target, NULL, (unsigned long)&mtc,
> +                                   MIGRATE_SYNC, MR_TAGGED_TAG_STORAGE, NULL);
> +               if (ret == 0 || ret != -EBUSY)

This could be simplified to:

if (ret != -EBUSY)

Peter

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 04/27] mm: migrate/mempolicy: Add hook to modify migration target gfp
  2023-11-27 11:52     ` Alexandru Elisei
@ 2023-11-28  6:49       ` Mike Rapoport
  2023-11-28 17:21         ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: Mike Rapoport @ 2023-11-28  6:49 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Mon, Nov 27, 2023 at 11:52:56AM +0000, Alexandru Elisei wrote:
> Hi Mike,
> 
> I really appreciate you having a look!
> 
> On Sat, Nov 25, 2023 at 12:03:22PM +0200, Mike Rapoport wrote:
> > On Sun, Nov 19, 2023 at 04:56:58PM +0000, Alexandru Elisei wrote:
> > > It might be desirable for an architecture to modify the gfp flags used to
> > > allocate the destination page for migration based on the page that it is
> > > being replaced. For example, if an architectures has metadata associated
> > > with a page (like arm64, when the memory tagging extension is implemented),
> > > it can request that the destination page similarly has storage for tags
> > > already allocated.
> > > 
> > > No functional change.
> > > 
> > > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > > ---
> > >  include/linux/migrate.h | 4 ++++
> > >  mm/mempolicy.c          | 2 ++
> > >  mm/migrate.c            | 3 +++
> > >  3 files changed, 9 insertions(+)
> > > 
> > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > > index 2ce13e8a309b..0acef592043c 100644
> > > --- a/include/linux/migrate.h
> > > +++ b/include/linux/migrate.h
> > > @@ -60,6 +60,10 @@ struct movable_operations {
> > >  /* Defined in mm/debug.c: */
> > >  extern const char *migrate_reason_names[MR_TYPES];
> > >  
> > > +#ifndef arch_migration_target_gfp
> > > +#define arch_migration_target_gfp(src, gfp) 0
> > > +#endif
> > > +
> > >  #ifdef CONFIG_MIGRATION
> > >  
> > >  void putback_movable_pages(struct list_head *l);
> > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > index 10a590ee1c89..50bc43ab50d6 100644
> > > --- a/mm/mempolicy.c
> > > +++ b/mm/mempolicy.c
> > > @@ -1182,6 +1182,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
> > >  
> > >  		h = folio_hstate(src);
> > >  		gfp = htlb_alloc_mask(h);
> > > +		gfp |= arch_migration_target_gfp(src, gfp);
> > 
> > I think it'll be more robust to have arch_migration_target_gfp() to modify
> > the flags and return the new mask with added (or potentially removed)
> > flags.
> 
> I did it this way so an arch won't be able to remove flags set by the MM code.
> There's a similar pattern in do_mmap() -> calc_vm_flag_bits() ->
> arch_calc_vm_flag_bits().

Ok, just add a sentence about it to the commit message.
 
> Thanks,
> Alex
> 
> > 
> > >  		nodemask = policy_nodemask(gfp, pol, ilx, &nid);
> > >  		return alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp);
> > >  	}
> > > @@ -1190,6 +1191,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
> > >  		gfp = GFP_TRANSHUGE;
> > >  	else
> > >  		gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | __GFP_COMP;
> > > +	gfp |= arch_migration_target_gfp(src, gfp);
> > >  
> > >  	page = alloc_pages_mpol(gfp, order, pol, ilx, nid);
> > >  	return page_rmappable_folio(page);
> > 
> > -- 
> > Sincerely yours,
> > Mike.
> > 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 05/27] mm: page_alloc: Add an arch hook to allow prep_new_page() to fail
  2023-11-27 12:09     ` Alexandru Elisei
@ 2023-11-28 16:57       ` David Hildenbrand
  2023-11-28 17:17         ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2023-11-28 16:57 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

On 27.11.23 13:09, Alexandru Elisei wrote:
> Hi,
> 
> Thank you so much for your comments, there are genuinely useful.
> 
> On Fri, Nov 24, 2023 at 08:35:47PM +0100, David Hildenbrand wrote:
>> On 19.11.23 17:56, Alexandru Elisei wrote:
>>> Introduce arch_prep_new_page(), which will be used by arm64 to reserve tag
>>> storage for an allocated page. Reserving tag storage can fail, for example,
>>> if the tag storage page has a short pin on it, so allow prep_new_page() ->
>>> arch_prep_new_page() to similarly fail.
>>
>> But what are the side-effects of this? How does the calling code recover?
>>
>> E.g., what if we need to populate a page into user space, but that
>> particular page we allocated fails to be prepared? So we inject a signal
>> into that poor process?
> 
> When the page fails to be prepared, it is put back to the tail of the
> freelist with __free_one_page(.., FPI_TO_TAIL). If all the allocation paths
> are exhausted and no page has been found for which tag storage has been
> reserved, then that's treated like an OOM situation.
> 
> I have been thinking about this, and I think I can simplify the code by
> making tag reservation a best effort approach. The page can be allocated
> even if reserving tag storage fails, but the page is marked as invalid in
> set_pte_at() (PAGE_NONE + an extra bit to tell arm64 that it needs tag
> storage) and next time it is accessed, arm64 will reserve tag storage in
> the fault handling code (the mechanism for that is implemented in patch #19
> of the series, "mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for
> mprotect(PROT_MTE)").
> 
> With this new approach, prep_new_page() stays the way it is, and no further
> changes are required for the page allocator, as there are already arch
> callbacks that can be used for that, for example tag_clear_highpage() and
> arch_alloc_page(). The downside is extra page faults, which might impact
> performance.
> 
> What do you think?

That sounds a lot more robust, compared to intermittent failures to 
allocate pages.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 06/27] mm: page_alloc: Allow an arch to hook early into free_pages_prepare()
  2023-11-27 13:03     ` Alexandru Elisei
@ 2023-11-28 16:58       ` David Hildenbrand
  2023-11-28 17:17         ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2023-11-28 16:58 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

On 27.11.23 14:03, Alexandru Elisei wrote:
> Hi,
> 
> On Fri, Nov 24, 2023 at 08:36:52PM +0100, David Hildenbrand wrote:
>> On 19.11.23 17:57, Alexandru Elisei wrote:
>>> Add arch_free_pages_prepare() hook that is called before that page flags
>>> are cleared. This will be used by arm64 when explicit management of tag
>>> storage pages is enabled.
>>
>> Can you elaborate a bit what exactly will be done by that code with that
>> information?
> 
> Of course.
> 
> The MTE code that is in the kernel today uses the PG_arch_2 page flag, which it
> renames to PG_mte_tagged, to track if a page has been mapped with tagging
> enabled. That flag is cleared by free_pages_prepare() when it does:
> 
> 	page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> 
> When tag storage management is enabled, tag storage is reserved for a page if
> and only if the page is mapped as tagged. When a page is freed, the code looks
> at the PG_mte_tagged flag to determine if the page was mapped as tagged, and
> therefore has tag storage reserved, to determine if the corresponding tag
> storage should also be freed.
> 
> I have considered using arch_free_page(), but free_pages_prepare() calls the
> function after the flags are cleared.
> 
> Does that answer your question?

Yes, please add some of that to the patch description!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 12/27] arm64: mte: Add tag storage pages to the MIGRATE_CMA migratetype
  2023-11-27 15:01     ` Alexandru Elisei
@ 2023-11-28 17:03       ` David Hildenbrand
  2023-11-29 10:44         ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2023-11-28 17:03 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

On 27.11.23 16:01, Alexandru Elisei wrote:
> Hi David,
> 
> On Fri, Nov 24, 2023 at 08:40:55PM +0100, David Hildenbrand wrote:
>> On 19.11.23 17:57, Alexandru Elisei wrote:
>>> Add the MTE tag storage pages to the MIGRATE_CMA migratetype, which allows
>>> the page allocator to manage them like regular pages.
>>>
>>> Ths migratype lends the pages some very desirable properties:
>>>
>>> * They cannot be longterm pinned, meaning they will always be migratable.
>>>
>>> * The pages can be allocated explicitely by using their PFN (with
>>>     alloc_contig_range()) when they are needed to store tags.
>>>
>>> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
>>> ---
>>>    arch/arm64/Kconfig                  |  1 +
>>>    arch/arm64/kernel/mte_tag_storage.c | 68 +++++++++++++++++++++++++++++
>>>    include/linux/mmzone.h              |  5 +++
>>>    mm/internal.h                       |  3 --
>>>    4 files changed, 74 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index fe8276fdc7a8..047487046e8f 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -2065,6 +2065,7 @@ config ARM64_MTE
>>>    if ARM64_MTE
>>>    config ARM64_MTE_TAG_STORAGE
>>>    	bool "Dynamic MTE tag storage management"
>>> +	select CONFIG_CMA
>>>    	help
>>>    	  Adds support for dynamic management of the memory used by the hardware
>>>    	  for storing MTE tags. This memory, unlike normal memory, cannot be
>>> diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
>>> index fa6267ef8392..427f4f1909f3 100644
>>> --- a/arch/arm64/kernel/mte_tag_storage.c
>>> +++ b/arch/arm64/kernel/mte_tag_storage.c
>>> @@ -5,10 +5,12 @@
>>>     * Copyright (C) 2023 ARM Ltd.
>>>     */
>>> +#include <linux/cma.h>
>>>    #include <linux/memblock.h>
>>>    #include <linux/mm.h>
>>>    #include <linux/of_device.h>
>>>    #include <linux/of_fdt.h>
>>> +#include <linux/pageblock-flags.h>
>>>    #include <linux/range.h>
>>>    #include <linux/string.h>
>>>    #include <linux/xarray.h>
>>> @@ -189,6 +191,14 @@ static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
>>>    		return ret;
>>>    	}
>>> +	/* Pages are managed in pageblock_nr_pages chunks */
>>> +	if (!IS_ALIGNED(tag_range->start | range_len(tag_range), pageblock_nr_pages)) {
>>> +		pr_err("Tag storage region 0x%llx-0x%llx not aligned to pageblock size 0x%llx",
>>> +		       PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
>>> +		       PFN_PHYS(pageblock_nr_pages));
>>> +		return -EINVAL;
>>> +	}
>>> +
>>>    	ret = tag_storage_get_memory_node(node, &mem_node);
>>>    	if (ret)
>>>    		return ret;
>>> @@ -254,3 +264,61 @@ void __init mte_tag_storage_init(void)
>>>    		pr_info("MTE tag storage region management disabled");
>>>    	}
>>>    }
>>> +
>>> +static int __init mte_tag_storage_activate_regions(void)
>>> +{
>>> +	phys_addr_t dram_start, dram_end;
>>> +	struct range *tag_range;
>>> +	unsigned long pfn;
>>> +	int i, ret;
>>> +
>>> +	if (num_tag_regions == 0)
>>> +		return 0;
>>> +
>>> +	dram_start = memblock_start_of_DRAM();
>>> +	dram_end = memblock_end_of_DRAM();
>>> +
>>> +	for (i = 0; i < num_tag_regions; i++) {
>>> +		tag_range = &tag_regions[i].tag_range;
>>> +		/*
>>> +		 * Tag storage region was clipped by arm64_bootmem_init()
>>> +		 * enforcing addressing limits.
>>> +		 */
>>> +		if (PFN_PHYS(tag_range->start) < dram_start ||
>>> +				PFN_PHYS(tag_range->end) >= dram_end) {
>>> +			pr_err("Tag storage region 0x%llx-0x%llx outside addressable memory",
>>> +			       PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end));
>>> +			ret = -EINVAL;
>>> +			goto out_disabled;
>>> +		}
>>> +	}
>>> +
>>> +	/*
>>> +	 * MTE disabled, tag storage pages can be used like any other pages. The
>>> +	 * only restriction is that the pages cannot be used by kexec because
>>> +	 * the memory remains marked as reserved in the memblock allocator.
>>> +	 */
>>> +	if (!system_supports_mte()) {
>>> +		for (i = 0; i< num_tag_regions; i++) {
>>> +			tag_range = &tag_regions[i].tag_range;
>>> +			for (pfn = tag_range->start; pfn <= tag_range->end; pfn++)
>>> +				free_reserved_page(pfn_to_page(pfn));
>>> +		}
>>> +		ret = 0;
>>> +		goto out_disabled;
>>> +	}
>>> +
>>> +	for (i = 0; i < num_tag_regions; i++) {
>>> +		tag_range = &tag_regions[i].tag_range;
>>> +		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
>>> +			init_cma_reserved_pageblock(pfn_to_page(pfn));
>>> +		totalcma_pages += range_len(tag_range);
>>> +	}
>>
>> You shouldn't be doing that manually in arm code. Likely you want some cma.c
>> helper for something like that.
> 
> If you referring to the last loop (the one that does
> ini_cma_reserved_pageblock()), indeed, there's already a function which
> does that, cma_init_reserved_areas() -> cma_activate_area().
> 
>>
>> But, can you elaborate on why you took this hacky (sorry) approach as
>> documented in the cover letter:
> 
> No worries, it is indeed a bit hacky :)
> 
>>
>> "The arm64 code manages this memory directly instead of using
>> cma_declare_contiguous/cma_alloc for performance reasons."
>>
>> What is the exact problem?
> 
> I am referring to the performance degredation that is fixed in patch #26,
> "arm64: mte: Fast track reserving tag storage when the block is free" [1].
> The issue is that alloc_contig_range() -> __alloc_contig_migrate_range()
> calls lru_cache_disable(), which IPIs all the CPUs in the system, and that
> leads to a 10-20% performance degradation on Chrome. It has been observed
> that most of the time the tag storage pages are free, and the
> lru_cache_disable() calls are unnecessary.

This sounds like something eventually worth integrating into 
CMA/alloc_contig_range(). Like, a fast path to check if we are only 
allocating something small (e.g., falls within a single pageblock), and 
if the page is free.

> 
> The performance degradation is almost entirely eliminated by having the code
> take the tag storage page directly from the free list if it's free, instead
> of calling alloc_contig_range().
> 
> Do you believe it would be better to use the cma code, and modify it to use
> this fast path to take the page drectly from the buddy allocator?

That sounds reasonable yes. Do you see any blockers for that?

> 
> I can definitely try to integrate the code with cma_alloc(), but I think
> keeping the fast path for reserving tag storage is extremely desirable,
> since it makes such a huge difference to performance.

Yes, but let's try finding a way to optimize common code, to eventually 
improve some CMA cases as well? :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 13/27] arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK
  2023-11-27 15:04     ` Alexandru Elisei
@ 2023-11-28 17:05       ` David Hildenbrand
  2023-11-29 10:46         ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2023-11-28 17:05 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

On 27.11.23 16:04, Alexandru Elisei wrote:
> Hi,
> 
> On Fri, Nov 24, 2023 at 08:51:38PM +0100, David Hildenbrand wrote:
>> On 19.11.23 17:57, Alexandru Elisei wrote:
>>> Tag storage memory requires that the tag storage pages used for data are
>>> always migratable when they need to be repurposed to store tags.
>>>
>>> If ARCH_KEEP_MEMBLOCK is enabled, kexec will scan all non-reserved
>>> memblocks to find a suitable location for copying the kernel image. The
>>> kernel image, once loaded, cannot be moved to another location in physical
>>> memory. The initialization code for the tag storage reserves the memblocks
>>> for the tag storage pages, which means kexec will not use them, and the tag
>>> storage pages can be migrated at any time, which is the desired behaviour.
>>>
>>> However, if ARCH_KEEP_MEMBLOCK is not selected, kexec will not skip a
>>> region unless the memory resource has the IORESOURCE_SYSRAM_DRIVER_MANAGED
>>> flag, which isn't currently set by the tag storage initialization code.
>>>
>>> Make ARM64_MTE_TAG_STORAGE depend on ARCH_KEEP_MEMBLOCK to make it explicit
>>> that that the Kconfig option required for it to work correctly.
>>>
>>> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
>>> ---
>>>    arch/arm64/Kconfig | 1 +
>>>    1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index 047487046e8f..efa5b7958169 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -2065,6 +2065,7 @@ config ARM64_MTE
>>>    if ARM64_MTE
>>>    config ARM64_MTE_TAG_STORAGE
>>>    	bool "Dynamic MTE tag storage management"
>>> +	depends on ARCH_KEEP_MEMBLOCK
>>>    	select CONFIG_CMA
>>>    	help
>>>    	  Adds support for dynamic management of the memory used by the hardware
>>
>> Doesn't arm64 select that unconditionally? Why is this required then?
> 
> I've added this patch to make the dependancy explicit. If, in the future, arm64
> stops selecting ARCH_KEEP_MEMBLOCK, I thinkg it would be very easy to miss the
> fact that tag storage depends on it. So this patch is not required per-se, it's
> there to document the dependancy.

I see. Could you add some static_assert / BUILD_BUG_ON instead?

I suspect there are plenty other (undocumented) reasons why 
ARCH_KEEP_MEMBLOCK has to be enabled for now, and none sets 
ARCH_KEEP_MEMBLOCK, I suspect/

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 14/27] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled
  2023-11-27 15:07     ` Alexandru Elisei
@ 2023-11-28 17:05       ` David Hildenbrand
  0 siblings, 0 replies; 98+ messages in thread
From: David Hildenbrand @ 2023-11-28 17:05 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

On 27.11.23 16:07, Alexandru Elisei wrote:
> Hi,
> 
> On Fri, Nov 24, 2023 at 08:54:12PM +0100, David Hildenbrand wrote:
>> On 19.11.23 17:57, Alexandru Elisei wrote:
>>> To be able to reserve the tag storage associated with a page requires that
>>> the tag storage page can be migrated.
>>>
>>> When HW KASAN is enabled, the kernel allocates pages, which are now tagged,
>>> in non-preemptible contexts, which can make reserving the associate tag
>>> storage impossible.
>>
>> I assume that it's the only in-kernel user that actually requires tagged
>> memory (besides for user space), correct?
> 
> Indeed, this is the case. I'll expand the commit message to be more clear about
> it.
> 

Great, thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 18/27] arm64: mte: Reserve tag block for the zero page
  2023-11-19 16:57 ` [PATCH RFC v2 18/27] arm64: mte: Reserve tag block for the zero page Alexandru Elisei
@ 2023-11-28 17:06   ` David Hildenbrand
  2023-11-29 11:30     ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2023-11-28 17:06 UTC (permalink / raw)
  To: Alexandru Elisei, catalin.marinas, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, eugenis,
	kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

On 19.11.23 17:57, Alexandru Elisei wrote:
> On arm64, the zero page receives special treatment by having the tagged
> flag set on MTE initialization, not when the page is mapped in a process
> address space. Reserve the corresponding tag block when tag storage
> management is being activated.

Out of curiosity: why does the shared zeropage require tagged storage? 
What about the huge zeropage?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 05/27] mm: page_alloc: Add an arch hook to allow prep_new_page() to fail
  2023-11-28 16:57       ` David Hildenbrand
@ 2023-11-28 17:17         ` Alexandru Elisei
  0 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-28 17:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

On Tue, Nov 28, 2023 at 05:57:31PM +0100, David Hildenbrand wrote:
> On 27.11.23 13:09, Alexandru Elisei wrote:
> > Hi,
> > 
> > Thank you so much for your comments, there are genuinely useful.
> > 
> > On Fri, Nov 24, 2023 at 08:35:47PM +0100, David Hildenbrand wrote:
> > > On 19.11.23 17:56, Alexandru Elisei wrote:
> > > > Introduce arch_prep_new_page(), which will be used by arm64 to reserve tag
> > > > storage for an allocated page. Reserving tag storage can fail, for example,
> > > > if the tag storage page has a short pin on it, so allow prep_new_page() ->
> > > > arch_prep_new_page() to similarly fail.
> > > 
> > > But what are the side-effects of this? How does the calling code recover?
> > > 
> > > E.g., what if we need to populate a page into user space, but that
> > > particular page we allocated fails to be prepared? So we inject a signal
> > > into that poor process?
> > 
> > When the page fails to be prepared, it is put back to the tail of the
> > freelist with __free_one_page(.., FPI_TO_TAIL). If all the allocation paths
> > are exhausted and no page has been found for which tag storage has been
> > reserved, then that's treated like an OOM situation.
> > 
> > I have been thinking about this, and I think I can simplify the code by
> > making tag reservation a best effort approach. The page can be allocated
> > even if reserving tag storage fails, but the page is marked as invalid in
> > set_pte_at() (PAGE_NONE + an extra bit to tell arm64 that it needs tag
> > storage) and next time it is accessed, arm64 will reserve tag storage in
> > the fault handling code (the mechanism for that is implemented in patch #19
> > of the series, "mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for
> > mprotect(PROT_MTE)").
> > 
> > With this new approach, prep_new_page() stays the way it is, and no further
> > changes are required for the page allocator, as there are already arch
> > callbacks that can be used for that, for example tag_clear_highpage() and
> > arch_alloc_page(). The downside is extra page faults, which might impact
> > performance.
> > 
> > What do you think?
> 
> That sounds a lot more robust, compared to intermittent failures to allocate
> pages.

Great, thank you for the feedback, I will use this approach for the next
iteration of the series.

Thanks,
Alex

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 06/27] mm: page_alloc: Allow an arch to hook early into free_pages_prepare()
  2023-11-28 16:58       ` David Hildenbrand
@ 2023-11-28 17:17         ` Alexandru Elisei
  0 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-28 17:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

On Tue, Nov 28, 2023 at 05:58:55PM +0100, David Hildenbrand wrote:
> On 27.11.23 14:03, Alexandru Elisei wrote:
> > Hi,
> > 
> > On Fri, Nov 24, 2023 at 08:36:52PM +0100, David Hildenbrand wrote:
> > > On 19.11.23 17:57, Alexandru Elisei wrote:
> > > > Add arch_free_pages_prepare() hook that is called before that page flags
> > > > are cleared. This will be used by arm64 when explicit management of tag
> > > > storage pages is enabled.
> > > 
> > > Can you elaborate a bit what exactly will be done by that code with that
> > > information?
> > 
> > Of course.
> > 
> > The MTE code that is in the kernel today uses the PG_arch_2 page flag, which it
> > renames to PG_mte_tagged, to track if a page has been mapped with tagging
> > enabled. That flag is cleared by free_pages_prepare() when it does:
> > 
> > 	page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > 
> > When tag storage management is enabled, tag storage is reserved for a page if
> > and only if the page is mapped as tagged. When a page is freed, the code looks
> > at the PG_mte_tagged flag to determine if the page was mapped as tagged, and
> > therefore has tag storage reserved, to determine if the corresponding tag
> > storage should also be freed.
> > 
> > I have considered using arch_free_page(), but free_pages_prepare() calls the
> > function after the flags are cleared.
> > 
> > Does that answer your question?
> 
> Yes, please add some of that to the patch description!

Will do!

Thanks,
Alex

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 04/27] mm: migrate/mempolicy: Add hook to modify migration target gfp
  2023-11-28  6:49       ` Mike Rapoport
@ 2023-11-28 17:21         ` Alexandru Elisei
  0 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-28 17:21 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi,

On Tue, Nov 28, 2023 at 08:49:57AM +0200, Mike Rapoport wrote:
> On Mon, Nov 27, 2023 at 11:52:56AM +0000, Alexandru Elisei wrote:
> > Hi Mike,
> > 
> > I really appreciate you having a look!
> > 
> > On Sat, Nov 25, 2023 at 12:03:22PM +0200, Mike Rapoport wrote:
> > > On Sun, Nov 19, 2023 at 04:56:58PM +0000, Alexandru Elisei wrote:
> > > > It might be desirable for an architecture to modify the gfp flags used to
> > > > allocate the destination page for migration based on the page that it is
> > > > being replaced. For example, if an architectures has metadata associated
> > > > with a page (like arm64, when the memory tagging extension is implemented),
> > > > it can request that the destination page similarly has storage for tags
> > > > already allocated.
> > > > 
> > > > No functional change.
> > > > 
> > > > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > > > ---
> > > >  include/linux/migrate.h | 4 ++++
> > > >  mm/mempolicy.c          | 2 ++
> > > >  mm/migrate.c            | 3 +++
> > > >  3 files changed, 9 insertions(+)
> > > > 
> > > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > > > index 2ce13e8a309b..0acef592043c 100644
> > > > --- a/include/linux/migrate.h
> > > > +++ b/include/linux/migrate.h
> > > > @@ -60,6 +60,10 @@ struct movable_operations {
> > > >  /* Defined in mm/debug.c: */
> > > >  extern const char *migrate_reason_names[MR_TYPES];
> > > >  
> > > > +#ifndef arch_migration_target_gfp
> > > > +#define arch_migration_target_gfp(src, gfp) 0
> > > > +#endif
> > > > +
> > > >  #ifdef CONFIG_MIGRATION
> > > >  
> > > >  void putback_movable_pages(struct list_head *l);
> > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > > index 10a590ee1c89..50bc43ab50d6 100644
> > > > --- a/mm/mempolicy.c
> > > > +++ b/mm/mempolicy.c
> > > > @@ -1182,6 +1182,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
> > > >  
> > > >  		h = folio_hstate(src);
> > > >  		gfp = htlb_alloc_mask(h);
> > > > +		gfp |= arch_migration_target_gfp(src, gfp);
> > > 
> > > I think it'll be more robust to have arch_migration_target_gfp() to modify
> > > the flags and return the new mask with added (or potentially removed)
> > > flags.
> > 
> > I did it this way so an arch won't be able to remove flags set by the MM code.
> > There's a similar pattern in do_mmap() -> calc_vm_flag_bits() ->
> > arch_calc_vm_flag_bits().
> 
> Ok, just add a sentence about it to the commit message.

Great, will do that!

Thanks,
Alex

>  
> > Thanks,
> > Alex
> > 
> > > 
> > > >  		nodemask = policy_nodemask(gfp, pol, ilx, &nid);
> > > >  		return alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp);
> > > >  	}
> > > > @@ -1190,6 +1191,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
> > > >  		gfp = GFP_TRANSHUGE;
> > > >  	else
> > > >  		gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | __GFP_COMP;
> > > > +	gfp |= arch_migration_target_gfp(src, gfp);
> > > >  
> > > >  	page = alloc_pages_mpol(gfp, order, pol, ilx, nid);
> > > >  	return page_rmappable_folio(page);
> > > 
> > > -- 
> > > Sincerely yours,
> > > Mike.
> > > 
> 
> -- 
> Sincerely yours,
> Mike.
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)
  2023-11-19 16:57 ` [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE) Alexandru Elisei
@ 2023-11-28 17:55   ` David Hildenbrand
  2023-11-28 18:00     ` David Hildenbrand
  2023-11-29 11:55     ` Alexandru Elisei
  2023-11-29  9:27   ` Hyesoo Yu
  1 sibling, 2 replies; 98+ messages in thread
From: David Hildenbrand @ 2023-11-28 17:55 UTC (permalink / raw)
  To: Alexandru Elisei, catalin.marinas, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, eugenis,
	kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

On 19.11.23 17:57, Alexandru Elisei wrote:
> To enable tagging on a memory range, userspace can use mprotect() with the
> PROT_MTE access flag. Pages already mapped in the VMA don't have the
> associated tag storage block reserved, so mark the PTEs as
> PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
> reserve the tag storage on the fault path.

That sounds alot like fake PROT_NONE. Would there be a way to unify hat 
handling and simply reuse pte_protnone()? For example, could we special 
case on VMA flags?

Like, don't do NUMA hinting in these special VMAs. Then, have something 
like:

if (pte_protnone(vmf->orig_pte))
	return handle_pte_protnone(vmf);

In there, special case on the VMA flags.

I *suspect* that handle_page_missing_tag_storage() stole (sorry :P) some 
code from the prot_none handling path. At least the recovery path and 
writability handling looks like it better be located shared in 
handle_pte_protnone() as well.

That might take some magic out of this patch.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 20/27] mm: hugepage: Handle huge page fault on access
  2023-11-19 16:57 ` [PATCH RFC v2 20/27] mm: hugepage: Handle huge page fault on access Alexandru Elisei
  2023-11-22  1:28   ` Peter Collingbourne
@ 2023-11-28 17:56   ` David Hildenbrand
  2023-11-29 11:56     ` Alexandru Elisei
  1 sibling, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2023-11-28 17:56 UTC (permalink / raw)
  To: Alexandru Elisei, catalin.marinas, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, eugenis,
	kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

On 19.11.23 17:57, Alexandru Elisei wrote:
> Handle PAGE_FAULT_ON_ACCESS faults for huge pages in a similar way to
> regular pages.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---

Same comments :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)
  2023-11-28 17:55   ` David Hildenbrand
@ 2023-11-28 18:00     ` David Hildenbrand
  2023-11-29 11:55     ` Alexandru Elisei
  1 sibling, 0 replies; 98+ messages in thread
From: David Hildenbrand @ 2023-11-28 18:00 UTC (permalink / raw)
  To: Alexandru Elisei, catalin.marinas, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, eugenis,
	kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

On 28.11.23 18:55, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
>> To enable tagging on a memory range, userspace can use mprotect() with the
>> PROT_MTE access flag. Pages already mapped in the VMA don't have the
>> associated tag storage block reserved, so mark the PTEs as
>> PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
>> reserve the tag storage on the fault path.
> 
> That sounds alot like fake PROT_NONE. Would there be a way to unify hat
> handling and simply reuse pte_protnone()? For example, could we special
> case on VMA flags?
> 
> Like, don't do NUMA hinting in these special VMAs. Then, have something
> like:
> 
> if (pte_protnone(vmf->orig_pte))
> 	return handle_pte_protnone(vmf);
> 

Think out loud: maybe there isn't even the need to special-case on the 
VMA. Arch code should know it there is something to do. If not, it 
surely was triggered bu NUMA hinting. So maybe that could be handled in 
handle_pte_protnone() quite nicely.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory
  2023-11-19 16:57 ` [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory Alexandru Elisei
@ 2023-11-29  8:44   ` Hyesoo Yu
  2023-11-30 11:56     ` Alexandru Elisei
  2023-12-03 12:14     ` Alexandru Elisei
  2023-12-11 17:29   ` Rob Herring
  1 sibling, 2 replies; 98+ messages in thread
From: Hyesoo Yu @ 2023-11-29  8:44 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 13238 bytes --]

Hello.

On Sun, Nov 19, 2023 at 04:57:05PM +0000, Alexandru Elisei wrote:
> Allow the kernel to get the size and location of the MTE tag storage
> regions from the DTB. This memory is marked as reserved for now.
> 
> The DTB node for the tag storage region is defined as:
> 
>         tags0: tag-storage@8f8000000 {
>                 compatible = "arm,mte-tag-storage";
>                 reg = <0x08 0xf8000000 0x00 0x4000000>;
>                 block-size = <0x1000>;
>                 memory = <&memory0>;	// Associated tagged memory node
>         };
>

How about using compatible = "shared-dma-pool" like below ?

&reserved_memory {
	tags0: tag0@8f8000000 {
		compatible = "arm,mte-tag-storage";
        	reg = <0x08 0xf8000000 0x00 0x4000000>;
	};
}

tag-storage {
        compatible = "arm,mte-tag-storage";
	memory-region = <&tag>;
        memory = <&memory0>;
	block-size = <0x1000>;
}

And then, the activation of CMA would be performed in the CMA code.
We just can get the region information from memory-region and allocate it directly
like alloc_contig_range, take_page_off_buddy. It seems like we can remove a lots of code.

> The tag storage region represents the largest contiguous memory region that
> holds all the tags for the associated contiguous memory region which can be
> tagged. For example, for a 32GB contiguous tagged memory the corresponding
> tag storage region is 1GB of contiguous memory, not two adjacent 512M of
> tag storage memory.
> 
> "block-size" represents the minimum multiple of 4K of tag storage where all
> the tags stored in the block correspond to a contiguous memory region. This
> is needed for platforms where the memory controller interleaves tag writes
> to memory. For example, if the memory controller interleaves tag writes for
> 256KB of contiguous memory across 8K of tag storage (2-way interleave),
> then the correct value for "block-size" is 0x2000. This value is a hardware
> property, independent of the selected kernel page size.
>

Is it considered for kernel page size like 16K page, 64K page ? The comment says
it should be a multiple of 4K, but it should be a multiple of the "page size" more accurately.
Please let me know if there's anything I misunderstood. :-)


> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  arch/arm64/Kconfig                       |  12 ++
>  arch/arm64/include/asm/mte_tag_storage.h |  15 ++
>  arch/arm64/kernel/Makefile               |   1 +
>  arch/arm64/kernel/mte_tag_storage.c      | 256 +++++++++++++++++++++++
>  arch/arm64/kernel/setup.c                |   7 +
>  5 files changed, 291 insertions(+)
>  create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
>  create mode 100644 arch/arm64/kernel/mte_tag_storage.c
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 7b071a00425d..fe8276fdc7a8 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2062,6 +2062,18 @@ config ARM64_MTE
>  
>  	  Documentation/arch/arm64/memory-tagging-extension.rst.
>  
> +if ARM64_MTE
> +config ARM64_MTE_TAG_STORAGE
> +	bool "Dynamic MTE tag storage management"
> +	help
> +	  Adds support for dynamic management of the memory used by the hardware
> +	  for storing MTE tags. This memory, unlike normal memory, cannot be
> +	  tagged. When it is used to store tags for another memory location it
> +	  cannot be used for any type of allocation.
> +
> +	  If unsure, say N
> +endif # ARM64_MTE
> +
>  endmenu # "ARMv8.5 architectural features"
>  
>  menu "ARMv8.7 architectural features"
> diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> new file mode 100644
> index 000000000000..8f86c4f9a7c3
> --- /dev/null
> +++ b/arch/arm64/include/asm/mte_tag_storage.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +#ifndef __ASM_MTE_TAG_STORAGE_H
> +#define __ASM_MTE_TAG_STORAGE_H
> +
> +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +void mte_tag_storage_init(void);
> +#else
> +static inline void mte_tag_storage_init(void)
> +{
> +}
> +#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
> +#endif /* __ASM_MTE_TAG_STORAGE_H  */
> diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> index d95b3d6b471a..5f031bf9f8f1 100644
> --- a/arch/arm64/kernel/Makefile
> +++ b/arch/arm64/kernel/Makefile
> @@ -70,6 +70,7 @@ obj-$(CONFIG_CRASH_CORE)		+= crash_core.o
>  obj-$(CONFIG_ARM_SDE_INTERFACE)		+= sdei.o
>  obj-$(CONFIG_ARM64_PTR_AUTH)		+= pointer_auth.o
>  obj-$(CONFIG_ARM64_MTE)			+= mte.o
> +obj-$(CONFIG_ARM64_MTE_TAG_STORAGE)	+= mte_tag_storage.o
>  obj-y					+= vdso-wrap.o
>  obj-$(CONFIG_COMPAT_VDSO)		+= vdso32-wrap.o
>  obj-$(CONFIG_UNWIND_PATCH_PAC_INTO_SCS)	+= patch-scs.o
> diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> new file mode 100644
> index 000000000000..fa6267ef8392
> --- /dev/null
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -0,0 +1,256 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Support for dynamic tag storage.
> + *
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#include <linux/memblock.h>
> +#include <linux/mm.h>
> +#include <linux/of_device.h>
> +#include <linux/of_fdt.h>
> +#include <linux/range.h>
> +#include <linux/string.h>
> +#include <linux/xarray.h>
> +
> +#include <asm/mte_tag_storage.h>
> +
> +struct tag_region {
> +	struct range mem_range;	/* Memory associated with the tag storage, in PFNs. */
> +	struct range tag_range;	/* Tag storage memory, in PFNs. */
> +	u32 block_size;		/* Tag block size, in pages. */
> +};
> +
> +#define MAX_TAG_REGIONS	32
> +
> +static struct tag_region tag_regions[MAX_TAG_REGIONS];
> +static int num_tag_regions;
> +
> +static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
> +						int reg_len, struct range *range)
> +{
> +	int addr_cells = dt_root_addr_cells;
> +	int size_cells = dt_root_size_cells;
> +	u64 size;
> +
> +	if (reg_len / 4 > addr_cells + size_cells)
> +		return -EINVAL;
> +
> +	range->start = PHYS_PFN(of_read_number(reg, addr_cells));
> +	size = PHYS_PFN(of_read_number(reg + addr_cells, size_cells));
> +	if (size == 0) {
> +		pr_err("Invalid node");
> +		return -EINVAL;
> +	}
> +	range->end = range->start + size - 1;
> +
> +	return 0;
> +}
> +
> +static int __init tag_storage_of_flat_get_tag_range(unsigned long node,
> +						    struct range *tag_range)
> +{
> +	const __be32 *reg;
> +	int reg_len;
> +
> +	reg = of_get_flat_dt_prop(node, "reg", &reg_len);
> +	if (reg == NULL) {
> +		pr_err("Invalid metadata node");
> +		return -EINVAL;
> +	}
> +
> +	return tag_storage_of_flat_get_range(node, reg, reg_len, tag_range);
> +}
> +
> +static int __init tag_storage_of_flat_get_memory_range(unsigned long node, struct range *mem)
> +{
> +	const __be32 *reg;
> +	int reg_len;
> +
> +	reg = of_get_flat_dt_prop(node, "linux,usable-memory", &reg_len);
> +	if (reg == NULL)
> +		reg = of_get_flat_dt_prop(node, "reg", &reg_len);
> +
> +	if (reg == NULL) {
> +		pr_err("Invalid memory node");
> +		return -EINVAL;
> +	}
> +
> +	return tag_storage_of_flat_get_range(node, reg, reg_len, mem);
> +}
> +
> +struct find_memory_node_arg {
> +	unsigned long node;
> +	u32 phandle;
> +};
> +
> +static int __init fdt_find_memory_node(unsigned long node, const char *uname,
> +				       int depth, void *data)
> +{
> +	const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
> +	struct find_memory_node_arg *arg = data;
> +
> +	if (depth != 1 || !type || strcmp(type, "memory") != 0)
> +		return 0;
> +
> +	if (of_get_flat_dt_phandle(node) == arg->phandle) {
> +		arg->node = node;
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int __init tag_storage_get_memory_node(unsigned long tag_node, unsigned long *mem_node)
> +{
> +	struct find_memory_node_arg arg = { 0 };
> +	const __be32 *memory_prop;
> +	u32 mem_phandle;
> +	int ret, reg_len;
> +
> +	memory_prop = of_get_flat_dt_prop(tag_node, "memory", &reg_len);
> +	if (!memory_prop) {
> +		pr_err("Missing 'memory' property in the tag storage node");
> +		return -EINVAL;
> +	}
> +
> +	mem_phandle = be32_to_cpup(memory_prop);
> +	arg.phandle = mem_phandle;
> +
> +	ret = of_scan_flat_dt(fdt_find_memory_node, &arg);
> +	if (ret != 1) {
> +		pr_err("Associated memory node not found");
> +		return -EINVAL;
> +	}
> +
> +	*mem_node = arg.node;
> +
> +	return 0;
> +}
> +
> +static int __init tag_storage_of_flat_read_u32(unsigned long node, const char *propname,
> +					       u32 *retval)
> +{
> +	const __be32 *reg;
> +
> +	reg = of_get_flat_dt_prop(node, propname, NULL);
> +	if (!reg)
> +		return -EINVAL;
> +
> +	*retval = be32_to_cpup(reg);
> +	return 0;
> +}
> +
> +static u32 __init get_block_size_pages(u32 block_size_bytes)
> +{
> +	u32 a = PAGE_SIZE;
> +	u32 b = block_size_bytes;
> +	u32 r;
> +
> +	/* Find greatest common divisor using the Euclidian algorithm. */
> +	do {
> +		r = a % b;
> +		a = b;
> +		b = r;
> +	} while (b != 0);
> +
> +	return PHYS_PFN(PAGE_SIZE * block_size_bytes / a);
> +}
> +
> +static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
> +				       int depth, void *data)
> +{
> +	struct tag_region *region;
> +	unsigned long mem_node;
> +	struct range *mem_range;
> +	struct range *tag_range;
> +	u32 block_size_bytes;
> +	u32 nid = 0;
> +	int ret;
> +
> +	if (depth != 1 || !strstr(uname, "tag-storage"))
> +		return 0;
> +
> +	if (!of_flat_dt_is_compatible(node, "arm,mte-tag-storage"))
> +		return 0;
> +
> +	if (num_tag_regions == MAX_TAG_REGIONS) {
> +		pr_err("Maximum number of tag storage regions exceeded");
> +		return -EINVAL;
> +	}
> +
> +	region = &tag_regions[num_tag_regions];
> +	mem_range = &region->mem_range;
> +	tag_range = &region->tag_range;
> +
> +	ret = tag_storage_of_flat_get_tag_range(node, tag_range);
> +	if (ret) {
> +		pr_err("Invalid tag storage node");
> +		return ret;
> +	}
> +
> +	ret = tag_storage_get_memory_node(node, &mem_node);
> +	if (ret)
> +		return ret;
> +
> +	ret = tag_storage_of_flat_get_memory_range(mem_node, mem_range);
> +	if (ret) {
> +		pr_err("Invalid address for associated data memory node");
> +		return ret;
> +	}
> +
> +	/* The tag region must exactly match the corresponding memory. */
> +	if (range_len(tag_range) * 32 != range_len(mem_range)) {
> +		pr_err("Tag storage region 0x%llx-0x%llx does not cover the memory region 0x%llx-0x%llx",
> +		       PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> +		       PFN_PHYS(mem_range->start), PFN_PHYS(mem_range->end));
> +		return -EINVAL;
> +	}
> +
> +	ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
> +	if (ret || block_size_bytes == 0) {
> +		pr_err("Invalid or missing 'block-size' property");
> +		return -EINVAL;
> +	}
> +	region->block_size = get_block_size_pages(block_size_bytes);
> +	if (range_len(tag_range) % region->block_size != 0) {
> +		pr_err("Tag storage region size 0x%llx is not a multiple of block size %u",
> +		       PFN_PHYS(range_len(tag_range)), region->block_size);
> +		return -EINVAL;
> +	}
> +

I was confused about the variable "block_size", The block size declared in the device tree is
in bytes, but the actual block size used is in pages. I think the term "block_size" can cause
confusion as it might be interpreted as bytes. If possible, I suggest changing the term "block_size"
to something more readable, such as "block_nr_pages" (This is just a example!)

Thanks,
Regards.

> +	ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
> +	if (ret)
> +		nid = numa_node_id();
> +
> +	ret = memblock_add_node(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)),
> +				nid, MEMBLOCK_NONE);
> +	if (ret) {
> +		pr_err("Error adding tag memblock (%d)", ret);
> +		return ret;
> +	}
> +	memblock_reserve(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
> +
> +	pr_info("Found tag storage region 0x%llx-0x%llx, block size %u pages",
> +		PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end), region->block_size);
> +
> +	num_tag_regions++;
> +
> +	return 0;
> +}
> +
> +void __init mte_tag_storage_init(void)
> +{
> +	struct range *tag_range;
> +	int i, ret;
> +
> +	ret = of_scan_flat_dt(fdt_init_tag_storage, NULL);
> +	if (ret) {
> +		for (i = 0; i < num_tag_regions; i++) {
> +			tag_range = &tag_regions[i].tag_range;
> +			memblock_remove(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
> +		}
> +		num_tag_regions = 0;
> +		pr_info("MTE tag storage region management disabled");
> +	}
> +}
> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> index 417a8a86b2db..1b77138c1aa5 100644
> --- a/arch/arm64/kernel/setup.c
> +++ b/arch/arm64/kernel/setup.c
> @@ -42,6 +42,7 @@
>  #include <asm/cpufeature.h>
>  #include <asm/cpu_ops.h>
>  #include <asm/kasan.h>
> +#include <asm/mte_tag_storage.h>
>  #include <asm/numa.h>
>  #include <asm/scs.h>
>  #include <asm/sections.h>
> @@ -342,6 +343,12 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
>  			   FW_BUG "Booted with MMU enabled!");
>  	}
>  
> +	/*
> +	 * Must be called before memory limits are enforced by
> +	 * arm64_memblock_init().
> +	 */
> +	mte_tag_storage_init();
> +
>  	arm64_memblock_init();
>  
>  	paging_init();
> -- 
> 2.42.1
> 
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 15/27] arm64: mte: Check that tag storage blocks are in the same zone
  2023-11-19 16:57 ` [PATCH RFC v2 15/27] arm64: mte: Check that tag storage blocks are in the same zone Alexandru Elisei
  2023-11-24 19:56   ` David Hildenbrand
@ 2023-11-29  8:57   ` Hyesoo Yu
  2023-11-30 12:00     ` Alexandru Elisei
  1 sibling, 1 reply; 98+ messages in thread
From: Hyesoo Yu @ 2023-11-29  8:57 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 2123 bytes --]

On Sun, Nov 19, 2023 at 04:57:09PM +0000, Alexandru Elisei wrote:
> alloc_contig_range() requires that the requested pages are in the same
> zone. Check that this is indeed the case before initializing the tag
> storage blocks.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  arch/arm64/kernel/mte_tag_storage.c | 33 +++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
> 
> diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> index 8b9bedf7575d..fd63430d4dc0 100644
> --- a/arch/arm64/kernel/mte_tag_storage.c
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -265,6 +265,35 @@ void __init mte_tag_storage_init(void)
>  	}
>  }
>  
> +/* alloc_contig_range() requires all pages to be in the same zone. */
> +static int __init mte_tag_storage_check_zone(void)
> +{
> +	struct range *tag_range;
> +	struct zone *zone;
> +	unsigned long pfn;
> +	u32 block_size;
> +	int i, j;
> +
> +	for (i = 0; i < num_tag_regions; i++) {
> +		block_size = tag_regions[i].block_size;
> +		if (block_size == 1)
> +			continue;
> +
> +		tag_range = &tag_regions[i].tag_range;
> +		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += block_size) {
> +			zone = page_zone(pfn_to_page(pfn));

Hello.

Since the blocks within the tag_range must all be in the same zone, can we move the "page_zone"
out of the loop ?

Thanks,
Regards.

> +			for (j = 1; j < block_size; j++) {
> +				if (page_zone(pfn_to_page(pfn + j)) != zone) {
> +					pr_err("Tag storage block pages in different zones");
> +					return -EINVAL;
> +				}
> +			}
> +		}
> +	}
> +
> +	 return 0;
> +}
> +
>  static int __init mte_tag_storage_activate_regions(void)
>  {
>  	phys_addr_t dram_start, dram_end;
> @@ -321,6 +350,10 @@ static int __init mte_tag_storage_activate_regions(void)
>  		goto out_disabled;
>  	}
>  
> +	ret = mte_tag_storage_check_zone();
> +	if (ret)
> +		goto out_disabled;
> +
>  	for (i = 0; i < num_tag_regions; i++) {
>  		tag_range = &tag_regions[i].tag_range;
>  		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
> -- 
> 2.42.1
> 
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 16/27] arm64: mte: Manage tag storage on page allocation
  2023-11-19 16:57 ` [PATCH RFC v2 16/27] arm64: mte: Manage tag storage on page allocation Alexandru Elisei
@ 2023-11-29  9:10   ` Hyesoo Yu
  2023-11-29 13:33     ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: Hyesoo Yu @ 2023-11-29  9:10 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 16896 bytes --]

On Sun, Nov 19, 2023 at 04:57:10PM +0000, Alexandru Elisei wrote:
> Reserve tag storage for a tagged page by migrating the contents of the tag
> storage (if in use for data) and removing the tag storage pages from the
> page allocator by calling alloc_contig_range().
> 
> When all the associated tagged pages have been freed, return the tag
> storage pages back to the page allocator, where they can be used again for
> data allocations.
> 
> Tag storage pages cannot be tagged, so disallow allocations from
> MIGRATE_CMA when the allocation is tagged.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  arch/arm64/include/asm/mte.h             |  16 +-
>  arch/arm64/include/asm/mte_tag_storage.h |  45 +++++
>  arch/arm64/include/asm/pgtable.h         |  27 +++
>  arch/arm64/kernel/mte_tag_storage.c      | 241 +++++++++++++++++++++++
>  fs/proc/page.c                           |   1 +
>  include/linux/kernel-page-flags.h        |   1 +
>  include/linux/page-flags.h               |   1 +
>  include/trace/events/mmflags.h           |   3 +-
>  mm/huge_memory.c                         |   1 +
>  9 files changed, 333 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> index 8034695b3dd7..6457b7899207 100644
> --- a/arch/arm64/include/asm/mte.h
> +++ b/arch/arm64/include/asm/mte.h
> @@ -40,12 +40,24 @@ void mte_free_tag_buf(void *buf);
>  #ifdef CONFIG_ARM64_MTE
>  
>  /* track which pages have valid allocation tags */
> -#define PG_mte_tagged	PG_arch_2
> +#define PG_mte_tagged		PG_arch_2
>  /* simple lock to avoid multiple threads tagging the same page */
> -#define PG_mte_lock	PG_arch_3
> +#define PG_mte_lock		PG_arch_3
> +/* Track if a tagged page has tag storage reserved */
> +#define PG_tag_storage_reserved	PG_arch_4
> +
> +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
> +extern bool page_tag_storage_reserved(struct page *page);
> +#endif
>  
>  static inline void set_page_mte_tagged(struct page *page)
>  {
> +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +	/* Open code mte_tag_storage_enabled() */
> +	WARN_ON_ONCE(static_branch_likely(&tag_storage_enabled_key) &&
> +		     !page_tag_storage_reserved(page));
> +#endif
>  	/*
>  	 * Ensure that the tags written prior to this function are visible
>  	 * before the page flags update.
> diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> index 8f86c4f9a7c3..cab033b184ab 100644
> --- a/arch/arm64/include/asm/mte_tag_storage.h
> +++ b/arch/arm64/include/asm/mte_tag_storage.h
> @@ -5,11 +5,56 @@
>  #ifndef __ASM_MTE_TAG_STORAGE_H
>  #define __ASM_MTE_TAG_STORAGE_H
>  
> +#ifndef __ASSEMBLY__
> +
> +#include <linux/mm_types.h>
> +
> +#include <asm/mte.h>
> +
>  #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +
> +DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
> +
> +static inline bool tag_storage_enabled(void)
> +{
> +	return static_branch_likely(&tag_storage_enabled_key);
> +}
> +
> +static inline bool alloc_requires_tag_storage(gfp_t gfp)
> +{
> +	return gfp & __GFP_TAGGED;
> +}
> +
>  void mte_tag_storage_init(void);
> +
> +int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
> +void free_tag_storage(struct page *page, int order);
> +
> +bool page_tag_storage_reserved(struct page *page);
>  #else
> +static inline bool tag_storage_enabled(void)
> +{
> +	return false;
> +}
> +static inline bool alloc_requires_tag_storage(struct page *page)
> +{
> +	return false;
> +}
>  static inline void mte_tag_storage_init(void)
>  {
>  }
> +static inline int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
> +{
> +	return 0;
> +}
> +static inline void free_tag_storage(struct page *page, int order)
> +{
> +}
> +static inline bool page_tag_storage_reserved(struct page *page)
> +{
> +	return true;
> +}
>  #endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
> +
> +#endif /* !__ASSEMBLY__ */
>  #endif /* __ASM_MTE_TAG_STORAGE_H  */
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index cd5dacd1be3a..20e8de853f5d 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -10,6 +10,7 @@
>  
>  #include <asm/memory.h>
>  #include <asm/mte.h>
> +#include <asm/mte_tag_storage.h>
>  #include <asm/pgtable-hwdef.h>
>  #include <asm/pgtable-prot.h>
>  #include <asm/tlbflush.h>
> @@ -1063,6 +1064,32 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>  		mte_restore_page_tags_by_swp_entry(entry, &folio->page);
>  }
>  
> +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +
> +#define __HAVE_ARCH_PREP_NEW_PAGE
> +static inline int arch_prep_new_page(struct page *page, int order, gfp_t gfp)
> +{
> +	if (tag_storage_enabled() && alloc_requires_tag_storage(gfp))
> +		return reserve_tag_storage(page, order, gfp);
> +	return 0;
> +}
> +
> +#define __HAVE_ARCH_FREE_PAGES_PREPARE
> +static inline void arch_free_pages_prepare(struct page *page, int order)
> +{
> +	if (tag_storage_enabled() && page_mte_tagged(page))
> +		free_tag_storage(page, order);
> +}
> +
> +#define __HAVE_ARCH_ALLOC_CMA
> +static inline bool arch_alloc_cma(gfp_t gfp_mask)
> +{
> +	if (tag_storage_enabled() && alloc_requires_tag_storage(gfp_mask))
> +		return false;
> +	return true;
> +}
> +
> +#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
>  #endif /* CONFIG_ARM64_MTE */
>  
>  #define __HAVE_ARCH_CALC_VMA_GFP
> diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> index fd63430d4dc0..9f8ef3116fc3 100644
> --- a/arch/arm64/kernel/mte_tag_storage.c
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -11,12 +11,18 @@
>  #include <linux/of_device.h>
>  #include <linux/of_fdt.h>
>  #include <linux/pageblock-flags.h>
> +#include <linux/page-flags.h>
> +#include <linux/page_owner.h>
>  #include <linux/range.h>
> +#include <linux/sched/mm.h>
>  #include <linux/string.h>
> +#include <linux/vm_event_item.h>
>  #include <linux/xarray.h>
>  
>  #include <asm/mte_tag_storage.h>
>  
> +__ro_after_init DEFINE_STATIC_KEY_FALSE(tag_storage_enabled_key);
> +
>  struct tag_region {
>  	struct range mem_range;	/* Memory associated with the tag storage, in PFNs. */
>  	struct range tag_range;	/* Tag storage memory, in PFNs. */
> @@ -28,6 +34,31 @@ struct tag_region {
>  static struct tag_region tag_regions[MAX_TAG_REGIONS];
>  static int num_tag_regions;
>  
> +/*
> + * A note on locking. Reserving tag storage takes the tag_blocks_lock mutex,
> + * because alloc_contig_range() might sleep.
> + *
> + * Freeing tag storage takes the xa_lock spinlock with interrupts disabled
> + * because pages can be freed from non-preemptible contexts, including from an
> + * interrupt handler.
> + *
> + * Because tag storage can be freed from interrupt contexts, the xarray is
> + * defined with the XA_FLAGS_LOCK_IRQ flag to disable interrupts when calling
> + * xa_store(). This is done to prevent a deadlock with free_tag_storage() being
> + * called from an interrupt raised before xa_store() releases the xa_lock.
> + *
> + * All of the above means that reserve_tag_storage() cannot run concurrently
> + * with itself (no concurrent insertions), but it can run at the same time as
> + * free_tag_storage(). The first thing that reserve_tag_storage() does after
> + * taking the mutex is increase the refcount on all present tag storage blocks
> + * with the xa_lock held, to serialize against freeing the blocks. This is an
> + * optimization to avoid taking and releasing the xa_lock after each iteration
> + * if the refcount operation was moved inside the loop, where it would have had
> + * to be executed for each block.
> + */
> +static DEFINE_XARRAY_FLAGS(tag_blocks_reserved, XA_FLAGS_LOCK_IRQ);
> +static DEFINE_MUTEX(tag_blocks_lock);
> +
>  static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
>  						int reg_len, struct range *range)
>  {
> @@ -368,3 +399,213 @@ static int __init mte_tag_storage_activate_regions(void)
>  	return ret;
>  }
>  arch_initcall(mte_tag_storage_activate_regions);
> +
> +static void page_set_tag_storage_reserved(struct page *page, int order)
> +{
> +	int i;
> +
> +	for (i = 0; i < (1 << order); i++)
> +		set_bit(PG_tag_storage_reserved, &(page + i)->flags);
> +}
> +
> +static void block_ref_add(unsigned long block, struct tag_region *region, int order)
> +{
> +	int count;
> +
> +	count = min(1u << order, 32 * region->block_size);
> +	page_ref_add(pfn_to_page(block), count);
> +}
> +
> +static int block_ref_sub_return(unsigned long block, struct tag_region *region, int order)
> +{
> +	int count;
> +
> +	count = min(1u << order, 32 * region->block_size);
> +	return page_ref_sub_return(pfn_to_page(block), count);
> +}
> +
> +static bool tag_storage_block_is_reserved(unsigned long block)
> +{
> +	return xa_load(&tag_blocks_reserved, block) != NULL;
> +}
> +
> +static int tag_storage_reserve_block(unsigned long block, struct tag_region *region, int order)
> +{
> +	int ret;
> +
> +	ret = xa_err(xa_store(&tag_blocks_reserved, block, pfn_to_page(block), GFP_KERNEL));
> +	if (!ret)
> +		block_ref_add(block, region, order);
> +
> +	return ret;
> +}
> +
> +static int order_to_num_blocks(int order)
> +{
> +	return max((1 << order) / 32, 1);
> +}
> +
> +static int tag_storage_find_block_in_region(struct page *page, unsigned long *blockp,
> +					    struct tag_region *region)
> +{
> +	struct range *tag_range = &region->tag_range;
> +	struct range *mem_range = &region->mem_range;
> +	u64 page_pfn = page_to_pfn(page);
> +	u64 block, block_offset;
> +
> +	if (!(mem_range->start <= page_pfn && page_pfn <= mem_range->end))
> +		return -ERANGE;
> +
> +	block_offset = (page_pfn - mem_range->start) / 32;
> +	block = tag_range->start + rounddown(block_offset, region->block_size);
> +
> +	if (block + region->block_size - 1 > tag_range->end) {
> +		pr_err("Block 0x%llx-0x%llx is outside tag region 0x%llx-0x%llx\n",
> +			PFN_PHYS(block), PFN_PHYS(block + region->block_size),
> +			PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end));
> +		return -ERANGE;
> +	}
> +	*blockp = block;
> +
> +	return 0;
> +
> +}
> +
> +static int tag_storage_find_block(struct page *page, unsigned long *block,
> +				  struct tag_region **region)
> +{
> +	int i, ret;
> +
> +	for (i = 0; i < num_tag_regions; i++) {
> +		ret = tag_storage_find_block_in_region(page, block, &tag_regions[i]);
> +		if (ret == 0) {
> +			*region = &tag_regions[i];
> +			return 0;
> +		}
> +	}
> +
> +	return -EINVAL;
> +}
> +
> +bool page_tag_storage_reserved(struct page *page)
> +{
> +	return test_bit(PG_tag_storage_reserved, &page->flags);
> +}
> +
> +int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
> +{
> +	unsigned long start_block, end_block;
> +	struct tag_region *region;
> +	unsigned long block;
> +	unsigned long flags;
> +	unsigned int tries;
> +	int ret = 0;
> +
> +	VM_WARN_ON_ONCE(!preemptible());
> +
> +	if (page_tag_storage_reserved(page))
> +		return 0;
> +
> +	/*
> +	 * __alloc_contig_migrate_range() ignores gfp when allocating the
> +	 * destination page for migration. Regardless, massage gfp flags and
> +	 * remove __GFP_TAGGED to avoid recursion in case gfp stops being
> +	 * ignored.
> +	 */
> +	gfp &= ~__GFP_TAGGED;
> +	if (!(gfp & __GFP_NORETRY))
> +		gfp |= __GFP_RETRY_MAYFAIL;
> +
> +	ret = tag_storage_find_block(page, &start_block, &region);
> +	if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
> +		return 0;
> +	end_block = start_block + order_to_num_blocks(order) * region->block_size;
> +

Hello.

If the page size is 4K,  block size is 2 (block size bytes 8K), and order is 6,
then we need 2 pages for the tag. However according to the equation, order_to_num_blocks
is 2 and block_size is also 2, so end block will be incremented by 4.

However we actually only need 8K of tag, right for 256K ?
Could you explain order_to_num_blocks * region->block_size more detail ?

Thanks,
Regards.

> +	mutex_lock(&tag_blocks_lock);
> +
> +	/* Check again, this time with the lock held. */
> +	if (page_tag_storage_reserved(page))
> +		goto out_unlock;
> +
> +	/* Make sure existing entries are not freed from out under out feet. */
> +	xa_lock_irqsave(&tag_blocks_reserved, flags);
> +	for (block = start_block; block < end_block; block += region->block_size) {
> +		if (tag_storage_block_is_reserved(block))
> +			block_ref_add(block, region, order);
> +	}
> +	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
> +
> +	for (block = start_block; block < end_block; block += region->block_size) {
> +		/* Refcount incremented above. */
> +		if (tag_storage_block_is_reserved(block))
> +			continue;
> +
> +		tries = 3;
> +		while (tries--) {
> +			ret = alloc_contig_range(block, block + region->block_size, MIGRATE_CMA, gfp);
> +			if (ret == 0 || ret != -EBUSY)
> +				break;
> +		}
> +
> +		if (ret)
> +			goto out_error;
> +
> +		ret = tag_storage_reserve_block(block, region, order);
> +		if (ret) {
> +			free_contig_range(block, region->block_size);
> +			goto out_error;
> +		}
> +
> +		count_vm_events(CMA_ALLOC_SUCCESS, region->block_size);
> +	}
> +
> +	page_set_tag_storage_reserved(page, order);
> +out_unlock:
> +	mutex_unlock(&tag_blocks_lock);
> +
> +	return 0;
> +
> +out_error:
> +	xa_lock_irqsave(&tag_blocks_reserved, flags);
> +	for (block = start_block; block < end_block; block += region->block_size) {
> +		if (tag_storage_block_is_reserved(block) &&
> +		    block_ref_sub_return(block, region, order) == 1) {
> +			__xa_erase(&tag_blocks_reserved, block);
> +			free_contig_range(block, region->block_size);
> +		}
> +	}
> +	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
> +
> +	mutex_unlock(&tag_blocks_lock);
> +
> +	count_vm_events(CMA_ALLOC_FAIL, region->block_size);
> +
> +	return ret;
> +}
> +
> +void free_tag_storage(struct page *page, int order)
> +{
> +	unsigned long block, start_block, end_block;
> +	struct tag_region *region;
> +	unsigned long flags;
> +	int ret;
> +
> +	ret = tag_storage_find_block(page, &start_block, &region);
> +	if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
> +		return;
> +
> +	end_block = start_block + order_to_num_blocks(order) * region->block_size;
> +
> +	xa_lock_irqsave(&tag_blocks_reserved, flags);
> +	for (block = start_block; block < end_block; block += region->block_size) {
> +		if (WARN_ONCE(!tag_storage_block_is_reserved(block),
> +		    "Block 0x%lx is not reserved for pfn 0x%lx", block, page_to_pfn(page)))
> +			continue;
> +
> +		if (block_ref_sub_return(block, region, order) == 1) {
> +			__xa_erase(&tag_blocks_reserved, block);
> +			free_contig_range(block, region->block_size);
> +		}
> +	}
> +	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
> +}
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 195b077c0fac..e7eb584a9234 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -221,6 +221,7 @@ u64 stable_page_flags(struct page *page)
>  #ifdef CONFIG_ARCH_USES_PG_ARCH_X
>  	u |= kpf_copy_bit(k, KPF_ARCH_2,	PG_arch_2);
>  	u |= kpf_copy_bit(k, KPF_ARCH_3,	PG_arch_3);
> +	u |= kpf_copy_bit(k, KPF_ARCH_4,	PG_arch_4);
>  #endif
>  
>  	return u;
> diff --git a/include/linux/kernel-page-flags.h b/include/linux/kernel-page-flags.h
> index 859f4b0c1b2b..4a0d719ffdd4 100644
> --- a/include/linux/kernel-page-flags.h
> +++ b/include/linux/kernel-page-flags.h
> @@ -19,5 +19,6 @@
>  #define KPF_SOFTDIRTY		40
>  #define KPF_ARCH_2		41
>  #define KPF_ARCH_3		42
> +#define KPF_ARCH_4		43
>  
>  #endif /* LINUX_KERNEL_PAGE_FLAGS_H */
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index a88e64acebfe..7915165a51bd 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -135,6 +135,7 @@ enum pageflags {
>  #ifdef CONFIG_ARCH_USES_PG_ARCH_X
>  	PG_arch_2,
>  	PG_arch_3,
> +	PG_arch_4,
>  #endif
>  	__NR_PAGEFLAGS,
>  
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 6ca0d5ed46c0..ba962fd10a2c 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -125,7 +125,8 @@ IF_HAVE_PG_HWPOISON(hwpoison)						\
>  IF_HAVE_PG_IDLE(idle)							\
>  IF_HAVE_PG_IDLE(young)							\
>  IF_HAVE_PG_ARCH_X(arch_2)						\
> -IF_HAVE_PG_ARCH_X(arch_3)
> +IF_HAVE_PG_ARCH_X(arch_3)						\
> +IF_HAVE_PG_ARCH_X(arch_4)
>  
>  #define show_page_flags(flags)						\
>  	(flags) ? __print_flags(flags, "|",				\
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index f31f02472396..9beead961a65 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2474,6 +2474,7 @@ static void __split_huge_page_tail(struct folio *folio, int tail,
>  #ifdef CONFIG_ARCH_USES_PG_ARCH_X
>  			 (1L << PG_arch_2) |
>  			 (1L << PG_arch_3) |
> +			 (1L << PG_arch_4) |
>  #endif
>  			 (1L << PG_dirty) |
>  			 LRU_GEN_MASK | LRU_REFS_MASK));
> -- 
> 2.42.1
> 
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)
  2023-11-19 16:57 ` [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE) Alexandru Elisei
  2023-11-28 17:55   ` David Hildenbrand
@ 2023-11-29  9:27   ` Hyesoo Yu
  2023-11-30 12:06     ` Alexandru Elisei
  1 sibling, 1 reply; 98+ messages in thread
From: Hyesoo Yu @ 2023-11-29  9:27 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 15131 bytes --]

On Sun, Nov 19, 2023 at 04:57:13PM +0000, Alexandru Elisei wrote:
> To enable tagging on a memory range, userspace can use mprotect() with the
> PROT_MTE access flag. Pages already mapped in the VMA don't have the
> associated tag storage block reserved, so mark the PTEs as
> PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
> reserve the tag storage on the fault path.
> 
> This has several benefits over reserving the tag storage as part of the
> mprotect() call handling:
> 
> - Tag storage is reserved only for those pages in the VMA that are
>   accessed, instead of for all the pages already mapped in the VMA.
> - Reduces the latency of the mprotect() call.
> - Eliminates races with page migration.
> 
> But all of this is at the expense of an extra page fault per page until the
> pages being accessed all have their corresponding tag storage reserved.
> 
> For arm64, the PAGE_FAULT_ON_ACCESS protection is created by defining a new
> page table entry software bit, PTE_TAG_STORAGE_NONE. Linux doesn't set any
> of the PBHA bits in entries from the last level of the translation table
> and it doesn't use the TCR_ELx.HWUxx bits; also, the first PBHA bit, bit
> 59, is already being used as a software bit for PMD_PRESENT_INVALID.
> 
> This is only implemented for PTE mappings; PMD mappings will follow.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  arch/arm64/Kconfig                       |   1 +
>  arch/arm64/include/asm/mte.h             |   4 +-
>  arch/arm64/include/asm/mte_tag_storage.h |   2 +
>  arch/arm64/include/asm/pgtable-prot.h    |   2 +
>  arch/arm64/include/asm/pgtable.h         |  40 ++++++---
>  arch/arm64/kernel/mte.c                  |  12 ++-
>  arch/arm64/mm/fault.c                    | 101 +++++++++++++++++++++++
>  include/linux/pgtable.h                  |  17 ++++
>  mm/Kconfig                               |   3 +
>  mm/memory.c                              |   3 +
>  10 files changed, 170 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index efa5b7958169..3b9c435eaafb 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2066,6 +2066,7 @@ if ARM64_MTE
>  config ARM64_MTE_TAG_STORAGE
>  	bool "Dynamic MTE tag storage management"
>  	depends on ARCH_KEEP_MEMBLOCK
> +	select ARCH_HAS_FAULT_ON_ACCESS
>  	select CONFIG_CMA
>  	help
>  	  Adds support for dynamic management of the memory used by the hardware
> diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> index 6457b7899207..70dc2e409070 100644
> --- a/arch/arm64/include/asm/mte.h
> +++ b/arch/arm64/include/asm/mte.h
> @@ -107,7 +107,7 @@ static inline bool try_page_mte_tagging(struct page *page)
>  }
>  
>  void mte_zero_clear_page_tags(void *addr);
> -void mte_sync_tags(pte_t pte, unsigned int nr_pages);
> +void mte_sync_tags(pte_t *pteval, unsigned int nr_pages);
>  void mte_copy_page_tags(void *kto, const void *kfrom);
>  void mte_thread_init_user(void);
>  void mte_thread_switch(struct task_struct *next);
> @@ -139,7 +139,7 @@ static inline bool try_page_mte_tagging(struct page *page)
>  static inline void mte_zero_clear_page_tags(void *addr)
>  {
>  }
> -static inline void mte_sync_tags(pte_t pte, unsigned int nr_pages)
> +static inline void mte_sync_tags(pte_t *pteval, unsigned int nr_pages)
>  {
>  }
>  static inline void mte_copy_page_tags(void *kto, const void *kfrom)
> diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> index 6e5d28e607bb..c70ced60a0cd 100644
> --- a/arch/arm64/include/asm/mte_tag_storage.h
> +++ b/arch/arm64/include/asm/mte_tag_storage.h
> @@ -33,6 +33,8 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
>  void free_tag_storage(struct page *page, int order);
>  
>  bool page_tag_storage_reserved(struct page *page);
> +
> +vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
>  #else
>  static inline bool tag_storage_enabled(void)
>  {
> diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
> index e9624f6326dd..85ebb3e352ad 100644
> --- a/arch/arm64/include/asm/pgtable-prot.h
> +++ b/arch/arm64/include/asm/pgtable-prot.h
> @@ -19,6 +19,7 @@
>  #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
>  #define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
>  #define PTE_PROT_NONE		(_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
> +#define PTE_TAG_STORAGE_NONE	(_AT(pteval_t, 1) << 60) /* only when PTE_PROT_NONE */
>  
>  /*
>   * This bit indicates that the entry is present i.e. pmd_page()
> @@ -94,6 +95,7 @@ extern bool arm64_use_ng_mappings;
>  	 })
>  
>  #define PAGE_NONE		__pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
> +#define PAGE_FAULT_ON_ACCESS	__pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
>  /* shared+writable pages are clean by default, hence PTE_RDONLY|PTE_WRITE */
>  #define PAGE_SHARED		__pgprot(_PAGE_SHARED)
>  #define PAGE_SHARED_EXEC	__pgprot(_PAGE_SHARED_EXEC)
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 20e8de853f5d..8cc135f1c112 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -326,10 +326,10 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
>  		     __func__, pte_val(old_pte), pte_val(pte));
>  }
>  
> -static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
> +static inline void __sync_cache_and_tags(pte_t *pteval, unsigned int nr_pages)
>  {
> -	if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
> -		__sync_icache_dcache(pte);
> +	if (pte_present(*pteval) && pte_user_exec(*pteval) && !pte_special(*pteval))
> +		__sync_icache_dcache(*pteval);
>  
>  	/*
>  	 * If the PTE would provide user space access to the tags associated
> @@ -337,9 +337,9 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
>  	 * pte_access_permitted() returns false for exec only mappings, they
>  	 * don't expose tags (instruction fetches don't check tags).
>  	 */
> -	if (system_supports_mte() && pte_access_permitted(pte, false) &&
> -	    !pte_special(pte) && pte_tagged(pte))
> -		mte_sync_tags(pte, nr_pages);
> +	if (system_supports_mte() && pte_access_permitted(*pteval, false) &&
> +	    !pte_special(*pteval) && pte_tagged(*pteval))
> +		mte_sync_tags(pteval, nr_pages);
>  }
>  
>  static inline void set_ptes(struct mm_struct *mm,
> @@ -347,7 +347,7 @@ static inline void set_ptes(struct mm_struct *mm,
>  			    pte_t *ptep, pte_t pte, unsigned int nr)
>  {
>  	page_table_check_ptes_set(mm, ptep, pte, nr);
> -	__sync_cache_and_tags(pte, nr);
> +	__sync_cache_and_tags(&pte, nr);
>  
>  	for (;;) {
>  		__check_safe_pte_update(mm, ptep, pte);
> @@ -459,6 +459,26 @@ static inline int pmd_protnone(pmd_t pmd)
>  }
>  #endif
>  
> +#ifdef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
> +static inline bool fault_on_access_pte(pte_t pte)
> +{
> +	return (pte_val(pte) & (PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_VALID)) ==
> +		(PTE_PROT_NONE | PTE_TAG_STORAGE_NONE);
> +}
> +
> +static inline bool fault_on_access_pmd(pmd_t pmd)
> +{
> +	return fault_on_access_pte(pmd_pte(pmd));
> +}
> +
> +static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
> +{
> +	if (tag_storage_enabled())
> +		return handle_page_missing_tag_storage(vmf);
> +	return VM_FAULT_SIGBUS;
> +}
> +#endif /* CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
> +
>  #define pmd_present_invalid(pmd)     (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))
>  
>  static inline int pmd_present(pmd_t pmd)
> @@ -533,7 +553,7 @@ static inline void __set_pte_at(struct mm_struct *mm,
>  				unsigned long __always_unused addr,
>  				pte_t *ptep, pte_t pte, unsigned int nr)
>  {
> -	__sync_cache_and_tags(pte, nr);
> +	__sync_cache_and_tags(&pte, nr);
>  	__check_safe_pte_update(mm, ptep, pte);
>  	set_pte(ptep, pte);
>  }
> @@ -828,8 +848,8 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
>  	 * in MAIR_EL1. The mask below has to include PTE_ATTRINDX_MASK.
>  	 */
>  	const pteval_t mask = PTE_USER | PTE_PXN | PTE_UXN | PTE_RDONLY |
> -			      PTE_PROT_NONE | PTE_VALID | PTE_WRITE | PTE_GP |
> -			      PTE_ATTRINDX_MASK;
> +			      PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_VALID |
> +			      PTE_WRITE | PTE_GP | PTE_ATTRINDX_MASK;
>  	/* preserve the hardware dirty information */
>  	if (pte_hw_dirty(pte))
>  		pte = set_pte_bit(pte, __pgprot(PTE_DIRTY));
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index a41ef3213e1e..5962bab1d549 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -21,6 +21,7 @@
>  #include <asm/barrier.h>
>  #include <asm/cpufeature.h>
>  #include <asm/mte.h>
> +#include <asm/mte_tag_storage.h>
>  #include <asm/ptrace.h>
>  #include <asm/sysreg.h>
>  
> @@ -35,13 +36,18 @@ DEFINE_STATIC_KEY_FALSE(mte_async_or_asymm_mode);
>  EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode);
>  #endif
>  
> -void mte_sync_tags(pte_t pte, unsigned int nr_pages)
> +void mte_sync_tags(pte_t *pteval, unsigned int nr_pages)
>  {
> -	struct page *page = pte_page(pte);
> +	struct page *page = pte_page(*pteval);
>  	unsigned int i;
>  
> -	/* if PG_mte_tagged is set, tags have already been initialised */
>  	for (i = 0; i < nr_pages; i++, page++) {
> +		if (tag_storage_enabled() && unlikely(!page_tag_storage_reserved(page))) {
> +			*pteval = pte_modify(*pteval, PAGE_FAULT_ON_ACCESS);
> +			continue;
> +		}
> +
> +		/* if PG_mte_tagged is set, tags have already been initialised */
>  		if (try_page_mte_tagging(page)) {
>  			mte_clear_page_tags(page_address(page));
>  			set_page_mte_tagged(page);
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index acbc7530d2b2..f5fa583acf18 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -19,6 +19,7 @@
>  #include <linux/kprobes.h>
>  #include <linux/uaccess.h>
>  #include <linux/page-flags.h>
> +#include <linux/page-isolation.h>
>  #include <linux/sched/signal.h>
>  #include <linux/sched/debug.h>
>  #include <linux/highmem.h>
> @@ -953,3 +954,103 @@ void tag_clear_highpage(struct page *page)
>  	mte_zero_clear_page_tags(page_address(page));
>  	set_page_mte_tagged(page);
>  }
> +
> +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct page *page = NULL;
> +	pte_t new_pte, old_pte;
> +	bool writable = false;
> +	vm_fault_t err;
> +	int ret;
> +
> +	spin_lock(vmf->ptl);
> +	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
> +		pte_unmap_unlock(vmf->pte, vmf->ptl);
> +		return 0;
> +	}
> +
> +	/* Get the normal PTE  */
> +	old_pte = ptep_get(vmf->pte);
> +	new_pte = pte_modify(old_pte, vma->vm_page_prot);
> +
> +	/*
> +	 * Detect now whether the PTE could be writable; this information
> +	 * is only valid while holding the PT lock.
> +	 */
> +	writable = pte_write(new_pte);
> +	if (!writable && vma_wants_manual_pte_write_upgrade(vma) &&
> +	    can_change_pte_writable(vma, vmf->address, new_pte))
> +		writable = true;
> +
> +	page = vm_normal_page(vma, vmf->address, new_pte);
> +	if (!page || is_zone_device_page(page))
> +		goto out_map;
> +
> +	/*
> +	 * This should never happen, once a VMA has been marked as tagged, that
> +	 * cannot be changed.
> +	 */
> +	if (!(vma->vm_flags & VM_MTE))
> +		goto out_map;
> +
> +	/* Prevent the page from being unmapped from under us. */
> +	get_page(page);
> +	vma_set_access_pid_bit(vma);
> +
> +	/*
> +	 * Pairs with pte_offset_map_nolock(), which takes the RCU read lock,
> +	 * and spin_lock() above which takes the ptl lock. Both locks should be
> +	 * balanced after this point.
> +	 */
> +	pte_unmap_unlock(vmf->pte, vmf->ptl);
> +
> +	/*
> +	 * Probably the page is being isolated for migration, replay the fault
> +	 * to give time for the entry to be replaced by a migration pte.
> +	 */
> +	if (unlikely(is_migrate_isolate_page(page)))
> +		goto out_retry;
> +
> +	ret = reserve_tag_storage(page, 0, GFP_HIGHUSER_MOVABLE);
> +	if (ret)
> +		goto out_retry;
> +
> +	put_page(page);
> +
> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
> +	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
> +		pte_unmap_unlock(vmf->pte, vmf->ptl);
> +		return 0;
> +	}
> +
> +out_map:
> +	/*
> +	 * Make it present again, depending on how arch implements
> +	 * non-accessible ptes, some can allow access by kernel mode.
> +	 */
> +	old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
> +	new_pte = pte_modify(old_pte, vma->vm_page_prot);
> +	new_pte = pte_mkyoung(new_pte);
> +	if (writable)
> +		new_pte = pte_mkwrite(new_pte, vma);
> +	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, new_pte);
> +	update_mmu_cache(vma, vmf->address, vmf->pte);
> +	pte_unmap_unlock(vmf->pte, vmf->ptl);
> +
> +	return 0;
> +
> +out_retry:
> +	put_page(page);
> +	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> +		vma_end_read(vma);
> +	if (fault_flag_allow_retry_first(vmf->flags)) {
> +		err = VM_FAULT_RETRY;
> +	} else {
> +		/* Replay the fault. */
> +		err = 0;

Hello!

Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)

How about migrating the page when migration problem repeats.

Thanks,
Regards.

> +	}
> +	return err;
> +}
> +#endif
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index ffdb9b6bed6c..e2c761dd6c41 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1458,6 +1458,23 @@ static inline int pmd_protnone(pmd_t pmd)
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
>  
> +#ifndef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
> +static inline bool fault_on_access_pte(pte_t pte)
> +{
> +	return false;
> +}
> +
> +static inline bool fault_on_access_pmd(pmd_t pmd)
> +{
> +	return false;
> +}
> +
> +static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
> +{
> +	return VM_FAULT_SIGBUS;
> +}
> +#endif
> +
>  #endif /* CONFIG_MMU */
>  
>  #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 89971a894b60..a90eefc3ee80 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1019,6 +1019,9 @@ config IDLE_PAGE_TRACKING
>  config ARCH_HAS_CACHE_LINE_SIZE
>  	bool
>  
> +config ARCH_HAS_FAULT_ON_ACCESS
> +	bool
> +
>  config ARCH_HAS_CURRENT_STACK_POINTER
>  	bool
>  	help
> diff --git a/mm/memory.c b/mm/memory.c
> index e137f7673749..a04a971200b9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5044,6 +5044,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>  	if (!pte_present(vmf->orig_pte))
>  		return do_swap_page(vmf);
>  
> +	if (fault_on_access_pte(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> +		return arch_do_page_fault_on_access(vmf);
> +
>  	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
>  		return do_numa_page(vmf);
>  
> -- 
> 2.42.1
> 
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 12/27] arm64: mte: Add tag storage pages to the MIGRATE_CMA migratetype
  2023-11-28 17:03       ` David Hildenbrand
@ 2023-11-29 10:44         ` Alexandru Elisei
  0 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-29 10:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

On Tue, Nov 28, 2023 at 06:03:52PM +0100, David Hildenbrand wrote:
> On 27.11.23 16:01, Alexandru Elisei wrote:
> > Hi David,
> > 
> > On Fri, Nov 24, 2023 at 08:40:55PM +0100, David Hildenbrand wrote:
> > > On 19.11.23 17:57, Alexandru Elisei wrote:
> > > > Add the MTE tag storage pages to the MIGRATE_CMA migratetype, which allows
> > > > the page allocator to manage them like regular pages.
> > > > 
> > > > Ths migratype lends the pages some very desirable properties:
> > > > 
> > > > * They cannot be longterm pinned, meaning they will always be migratable.
> > > > 
> > > > * The pages can be allocated explicitely by using their PFN (with
> > > >     alloc_contig_range()) when they are needed to store tags.
> > > > 
> > > > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > > > ---
> > > >    arch/arm64/Kconfig                  |  1 +
> > > >    arch/arm64/kernel/mte_tag_storage.c | 68 +++++++++++++++++++++++++++++
> > > >    include/linux/mmzone.h              |  5 +++
> > > >    mm/internal.h                       |  3 --
> > > >    4 files changed, 74 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > > index fe8276fdc7a8..047487046e8f 100644
> > > > --- a/arch/arm64/Kconfig
> > > > +++ b/arch/arm64/Kconfig
> > > > @@ -2065,6 +2065,7 @@ config ARM64_MTE
> > > >    if ARM64_MTE
> > > >    config ARM64_MTE_TAG_STORAGE
> > > >    	bool "Dynamic MTE tag storage management"
> > > > +	select CONFIG_CMA
> > > >    	help
> > > >    	  Adds support for dynamic management of the memory used by the hardware
> > > >    	  for storing MTE tags. This memory, unlike normal memory, cannot be
> > > > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > > > index fa6267ef8392..427f4f1909f3 100644
> > > > --- a/arch/arm64/kernel/mte_tag_storage.c
> > > > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > > > @@ -5,10 +5,12 @@
> > > >     * Copyright (C) 2023 ARM Ltd.
> > > >     */
> > > > +#include <linux/cma.h>
> > > >    #include <linux/memblock.h>
> > > >    #include <linux/mm.h>
> > > >    #include <linux/of_device.h>
> > > >    #include <linux/of_fdt.h>
> > > > +#include <linux/pageblock-flags.h>
> > > >    #include <linux/range.h>
> > > >    #include <linux/string.h>
> > > >    #include <linux/xarray.h>
> > > > @@ -189,6 +191,14 @@ static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
> > > >    		return ret;
> > > >    	}
> > > > +	/* Pages are managed in pageblock_nr_pages chunks */
> > > > +	if (!IS_ALIGNED(tag_range->start | range_len(tag_range), pageblock_nr_pages)) {
> > > > +		pr_err("Tag storage region 0x%llx-0x%llx not aligned to pageblock size 0x%llx",
> > > > +		       PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> > > > +		       PFN_PHYS(pageblock_nr_pages));
> > > > +		return -EINVAL;
> > > > +	}
> > > > +
> > > >    	ret = tag_storage_get_memory_node(node, &mem_node);
> > > >    	if (ret)
> > > >    		return ret;
> > > > @@ -254,3 +264,61 @@ void __init mte_tag_storage_init(void)
> > > >    		pr_info("MTE tag storage region management disabled");
> > > >    	}
> > > >    }
> > > > +
> > > > +static int __init mte_tag_storage_activate_regions(void)
> > > > +{
> > > > +	phys_addr_t dram_start, dram_end;
> > > > +	struct range *tag_range;
> > > > +	unsigned long pfn;
> > > > +	int i, ret;
> > > > +
> > > > +	if (num_tag_regions == 0)
> > > > +		return 0;
> > > > +
> > > > +	dram_start = memblock_start_of_DRAM();
> > > > +	dram_end = memblock_end_of_DRAM();
> > > > +
> > > > +	for (i = 0; i < num_tag_regions; i++) {
> > > > +		tag_range = &tag_regions[i].tag_range;
> > > > +		/*
> > > > +		 * Tag storage region was clipped by arm64_bootmem_init()
> > > > +		 * enforcing addressing limits.
> > > > +		 */
> > > > +		if (PFN_PHYS(tag_range->start) < dram_start ||
> > > > +				PFN_PHYS(tag_range->end) >= dram_end) {
> > > > +			pr_err("Tag storage region 0x%llx-0x%llx outside addressable memory",
> > > > +			       PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end));
> > > > +			ret = -EINVAL;
> > > > +			goto out_disabled;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * MTE disabled, tag storage pages can be used like any other pages. The
> > > > +	 * only restriction is that the pages cannot be used by kexec because
> > > > +	 * the memory remains marked as reserved in the memblock allocator.
> > > > +	 */
> > > > +	if (!system_supports_mte()) {
> > > > +		for (i = 0; i< num_tag_regions; i++) {
> > > > +			tag_range = &tag_regions[i].tag_range;
> > > > +			for (pfn = tag_range->start; pfn <= tag_range->end; pfn++)
> > > > +				free_reserved_page(pfn_to_page(pfn));
> > > > +		}
> > > > +		ret = 0;
> > > > +		goto out_disabled;
> > > > +	}
> > > > +
> > > > +	for (i = 0; i < num_tag_regions; i++) {
> > > > +		tag_range = &tag_regions[i].tag_range;
> > > > +		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
> > > > +			init_cma_reserved_pageblock(pfn_to_page(pfn));
> > > > +		totalcma_pages += range_len(tag_range);
> > > > +	}
> > > 
> > > You shouldn't be doing that manually in arm code. Likely you want some cma.c
> > > helper for something like that.
> > 
> > If you referring to the last loop (the one that does
> > ini_cma_reserved_pageblock()), indeed, there's already a function which
> > does that, cma_init_reserved_areas() -> cma_activate_area().
> > 
> > > 
> > > But, can you elaborate on why you took this hacky (sorry) approach as
> > > documented in the cover letter:
> > 
> > No worries, it is indeed a bit hacky :)
> > 
> > > 
> > > "The arm64 code manages this memory directly instead of using
> > > cma_declare_contiguous/cma_alloc for performance reasons."
> > > 
> > > What is the exact problem?
> > 
> > I am referring to the performance degredation that is fixed in patch #26,
> > "arm64: mte: Fast track reserving tag storage when the block is free" [1].
> > The issue is that alloc_contig_range() -> __alloc_contig_migrate_range()
> > calls lru_cache_disable(), which IPIs all the CPUs in the system, and that
> > leads to a 10-20% performance degradation on Chrome. It has been observed
> > that most of the time the tag storage pages are free, and the
> > lru_cache_disable() calls are unnecessary.
> 
> This sounds like something eventually worth integrating into
> CMA/alloc_contig_range(). Like, a fast path to check if we are only
> allocating something small (e.g., falls within a single pageblock), and if
> the page is free.
> 
> > 
> > The performance degradation is almost entirely eliminated by having the code
> > take the tag storage page directly from the free list if it's free, instead
> > of calling alloc_contig_range().
> > 
> > Do you believe it would be better to use the cma code, and modify it to use
> > this fast path to take the page drectly from the buddy allocator?
> 
> That sounds reasonable yes. Do you see any blockers for that?

I have been looking at the CMA code, and nothing stands out. I'll try changing
the code to use cma_alloc/cma_release for the next iteration.

> 
> > 
> > I can definitely try to integrate the code with cma_alloc(), but I think
> > keeping the fast path for reserving tag storage is extremely desirable,
> > since it makes such a huge difference to performance.
> 
> Yes, but let's try finding a way to optimize common code, to eventually
> improve some CMA cases as well? :)

Sounds good, I'll try to integrate the fast path code to cma_alloc(), that way
existing callers can benefit from it immediately.

Thanks,
Alex

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 13/27] arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK
  2023-11-28 17:05       ` David Hildenbrand
@ 2023-11-29 10:46         ` Alexandru Elisei
  0 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-29 10:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

On Tue, Nov 28, 2023 at 06:05:20PM +0100, David Hildenbrand wrote:
> On 27.11.23 16:04, Alexandru Elisei wrote:
> > Hi,
> > 
> > On Fri, Nov 24, 2023 at 08:51:38PM +0100, David Hildenbrand wrote:
> > > On 19.11.23 17:57, Alexandru Elisei wrote:
> > > > Tag storage memory requires that the tag storage pages used for data are
> > > > always migratable when they need to be repurposed to store tags.
> > > > 
> > > > If ARCH_KEEP_MEMBLOCK is enabled, kexec will scan all non-reserved
> > > > memblocks to find a suitable location for copying the kernel image. The
> > > > kernel image, once loaded, cannot be moved to another location in physical
> > > > memory. The initialization code for the tag storage reserves the memblocks
> > > > for the tag storage pages, which means kexec will not use them, and the tag
> > > > storage pages can be migrated at any time, which is the desired behaviour.
> > > > 
> > > > However, if ARCH_KEEP_MEMBLOCK is not selected, kexec will not skip a
> > > > region unless the memory resource has the IORESOURCE_SYSRAM_DRIVER_MANAGED
> > > > flag, which isn't currently set by the tag storage initialization code.
> > > > 
> > > > Make ARM64_MTE_TAG_STORAGE depend on ARCH_KEEP_MEMBLOCK to make it explicit
> > > > that that the Kconfig option required for it to work correctly.
> > > > 
> > > > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > > > ---
> > > >    arch/arm64/Kconfig | 1 +
> > > >    1 file changed, 1 insertion(+)
> > > > 
> > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > > index 047487046e8f..efa5b7958169 100644
> > > > --- a/arch/arm64/Kconfig
> > > > +++ b/arch/arm64/Kconfig
> > > > @@ -2065,6 +2065,7 @@ config ARM64_MTE
> > > >    if ARM64_MTE
> > > >    config ARM64_MTE_TAG_STORAGE
> > > >    	bool "Dynamic MTE tag storage management"
> > > > +	depends on ARCH_KEEP_MEMBLOCK
> > > >    	select CONFIG_CMA
> > > >    	help
> > > >    	  Adds support for dynamic management of the memory used by the hardware
> > > 
> > > Doesn't arm64 select that unconditionally? Why is this required then?
> > 
> > I've added this patch to make the dependancy explicit. If, in the future, arm64
> > stops selecting ARCH_KEEP_MEMBLOCK, I thinkg it would be very easy to miss the
> > fact that tag storage depends on it. So this patch is not required per-se, it's
> > there to document the dependancy.
> 
> I see. Could you add some static_assert / BUILD_BUG_ON instead?

I can do that, sure.

Thanks,
Alex

> 
> I suspect there are plenty other (undocumented) reasons why
> ARCH_KEEP_MEMBLOCK has to be enabled for now, and none sets
> ARCH_KEEP_MEMBLOCK, I suspect/
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 18/27] arm64: mte: Reserve tag block for the zero page
  2023-11-28 17:06   ` David Hildenbrand
@ 2023-11-29 11:30     ` Alexandru Elisei
  2023-11-29 13:13       ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-29 11:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

On Tue, Nov 28, 2023 at 06:06:54PM +0100, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
> > On arm64, the zero page receives special treatment by having the tagged
> > flag set on MTE initialization, not when the page is mapped in a process
> > address space. Reserve the corresponding tag block when tag storage
> > management is being activated.
> 
> Out of curiosity: why does the shared zeropage require tagged storage? What
> about the huge zeropage?

There are two different tags that are used for tag checking: the logical
tag, the tag embedded in bits 59:56 of an address, and the physical tag
corresponding to the address. This tag is stored in a separate memory
location, called tag storage. When an access is performed, hardware
compares the logical tag (from the address) with the physical tag (from the
tag storage). If they match, the access is permitted.

The physical tag is set with special instructions.

Userspace pointers have bits 59:56 zero. If the pointer is in a VMA with
MTE enabled, then for userspace to be able to access this address, the
physical tag must also be 0b0000.

To make it easier on userspace, when a page is first mapped as tagged, its
tags are cleared by the kernel; this way, userspace can access the address
immediately, without clearing the physical tags beforehand. Another reason
for clearing the physical tags when a page is mapped as tagged would be to
avoid leaking uninitialized tags to userspace.

The zero page is special, because the physical tags are not zeroed every
time the page is mapped in a process; instead, the zero page is marked as
tagged (by setting a page flag) and the physical tags are zeroed only once,
when MTE is enabled at boot.

All of this means that when tag storage is enabled, which happens after MTE
is enabled, the tag storage corresponding to the zero page is already in
use and must be rezerved, and it can never be used for data allocations.

I hope all of the above makes sense. I can also put it in the commit
message :)

As for the zero huge page, the MTE code in the kernel treats it like a
regular page, and it zeroes the tags when it is mapped as tagged in a
process. I agree that this might not be the best solution from a
performance perspective, but it has worked so far.

With tag storage management enabled, set_pte_at()->mte_sync_tags() will
discover that the huge zero page doesn't have tag storage reserved, the
table entry will be mapped as invalid to use the page fault-on-access
mechanism that I introduce later in the series [1] to reserve tag storage,
and after that set_pte_at() will zero the physical tags.

[1] https://lore.kernel.org/all/20231119165721.9849-20-alexandru.elisei@arm.com/

Thanks,
Alex

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)
  2023-11-28 17:55   ` David Hildenbrand
  2023-11-28 18:00     ` David Hildenbrand
@ 2023-11-29 11:55     ` Alexandru Elisei
  2023-11-29 12:48       ` David Hildenbrand
  1 sibling, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-29 11:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

On Tue, Nov 28, 2023 at 06:55:18PM +0100, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
> > To enable tagging on a memory range, userspace can use mprotect() with the
> > PROT_MTE access flag. Pages already mapped in the VMA don't have the
> > associated tag storage block reserved, so mark the PTEs as
> > PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
> > reserve the tag storage on the fault path.
> 
> That sounds alot like fake PROT_NONE. Would there be a way to unify hat

Yes, arm64 basically defines PAGE_FAULT_ON_ACCESS as PAGE_NONE |
PTE_TAG_STORAGE_NONE.

> handling and simply reuse pte_protnone()? For example, could we special case
> on VMA flags?
> 
> Like, don't do NUMA hinting in these special VMAs. Then, have something
> like:
> 
> if (pte_protnone(vmf->orig_pte))
> 	return handle_pte_protnone(vmf);
> 
> In there, special case on the VMA flags.

Your suggestion from the follow-up reply that an arch should know if it needs to
do something was spot on, arm64 can use the software bit in the translation
table entry for that.

So what you are proposing is this:

* Rename do_numa_page->handle_pte_protnone
* At some point in the do_numa_page (now renamed to handle_pte_protnone) flow,
  decide if pte_protnone() has been set for an arch specific reason or because
  of automatic NUMA balancing.
* if pte_protnone() has been set by an architecture, then let the architecture
  handle the fault.

If I understood you correctly, that's a good idea, and should be easy to
implement.

> 
> I *suspect* that handle_page_missing_tag_storage() stole (sorry :P) some

Indeed, most of the code is taken as-is from do_numa_page().

> code from the prot_none handling path. At least the recovery path and
> writability handling looks like it better be located shared in
> handle_pte_protnone() as well.

Yes, I agree.

Thanks,
Alex

> 
> That might take some magic out of this patch.
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 20/27] mm: hugepage: Handle huge page fault on access
  2023-11-28 17:56   ` David Hildenbrand
@ 2023-11-29 11:56     ` Alexandru Elisei
  0 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-29 11:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

On Tue, Nov 28, 2023 at 06:56:34PM +0100, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
> > Handle PAGE_FAULT_ON_ACCESS faults for huge pages in a similar way to
> > regular pages.
> > 
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> 
> Same comments :)

Yes, will have a look at this fault handling path too :)

Thanks,
Alex

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)
  2023-11-29 11:55     ` Alexandru Elisei
@ 2023-11-29 12:48       ` David Hildenbrand
  0 siblings, 0 replies; 98+ messages in thread
From: David Hildenbrand @ 2023-11-29 12:48 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

On 29.11.23 12:55, Alexandru Elisei wrote:
> Hi,
> 
> On Tue, Nov 28, 2023 at 06:55:18PM +0100, David Hildenbrand wrote:
>> On 19.11.23 17:57, Alexandru Elisei wrote:
>>> To enable tagging on a memory range, userspace can use mprotect() with the
>>> PROT_MTE access flag. Pages already mapped in the VMA don't have the
>>> associated tag storage block reserved, so mark the PTEs as
>>> PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
>>> reserve the tag storage on the fault path.
>>
>> That sounds alot like fake PROT_NONE. Would there be a way to unify hat
> 
> Yes, arm64 basically defines PAGE_FAULT_ON_ACCESS as PAGE_NONE |
> PTE_TAG_STORAGE_NONE.
> 
>> handling and simply reuse pte_protnone()? For example, could we special case
>> on VMA flags?
>>
>> Like, don't do NUMA hinting in these special VMAs. Then, have something
>> like:
>>
>> if (pte_protnone(vmf->orig_pte))
>> 	return handle_pte_protnone(vmf);
>>
>> In there, special case on the VMA flags.
> 
> Your suggestion from the follow-up reply that an arch should know if it needs to
> do something was spot on, arm64 can use the software bit in the translation
> table entry for that.
> 
> So what you are proposing is this:
> 
> * Rename do_numa_page->handle_pte_protnone
> * At some point in the do_numa_page (now renamed to handle_pte_protnone) flow,
>    decide if pte_protnone() has been set for an arch specific reason or because
>    of automatic NUMA balancing.
> * if pte_protnone() has been set by an architecture, then let the architecture
>    handle the fault.
> 
> If I understood you correctly, that's a good idea, and should be easy to
> implement.

yes! :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 18/27] arm64: mte: Reserve tag block for the zero page
  2023-11-29 11:30     ` Alexandru Elisei
@ 2023-11-29 13:13       ` David Hildenbrand
  2023-11-29 13:41         ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2023-11-29 13:13 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

On 29.11.23 12:30, Alexandru Elisei wrote:
> On Tue, Nov 28, 2023 at 06:06:54PM +0100, David Hildenbrand wrote:
>> On 19.11.23 17:57, Alexandru Elisei wrote:
>>> On arm64, the zero page receives special treatment by having the tagged
>>> flag set on MTE initialization, not when the page is mapped in a process
>>> address space. Reserve the corresponding tag block when tag storage
>>> management is being activated.
>>
>> Out of curiosity: why does the shared zeropage require tagged storage? What
>> about the huge zeropage?
> 
> There are two different tags that are used for tag checking: the logical
> tag, the tag embedded in bits 59:56 of an address, and the physical tag
> corresponding to the address. This tag is stored in a separate memory
> location, called tag storage. When an access is performed, hardware
> compares the logical tag (from the address) with the physical tag (from the
> tag storage). If they match, the access is permitted.

Ack, matches my understanding.

> 
> The physical tag is set with special instructions.
> 
> Userspace pointers have bits 59:56 zero. If the pointer is in a VMA with
> MTE enabled, then for userspace to be able to access this address, the
> physical tag must also be 0b0000.
> 
> To make it easier on userspace, when a page is first mapped as tagged, its
> tags are cleared by the kernel; this way, userspace can access the address
> immediately, without clearing the physical tags beforehand. Another reason
> for clearing the physical tags when a page is mapped as tagged would be to
> avoid leaking uninitialized tags to userspace.

Make sense. Zero it just like we zero page content.

> 
> The zero page is special, because the physical tags are not zeroed every
> time the page is mapped in a process; instead, the zero page is marked as
> tagged (by setting a page flag) and the physical tags are zeroed only once,
> when MTE is enabled at boot.

Makes sense.

> 
> All of this means that when tag storage is enabled, which happens after MTE
> is enabled, the tag storage corresponding to the zero page is already in
> use and must be rezerved, and it can never be used for data allocations.
> 
> I hope all of the above makes sense. I can also put it in the commit
> message :)

Yes, makes sense!

> 
> As for the zero huge page, the MTE code in the kernel treats it like a
> regular page, and it zeroes the tags when it is mapped as tagged in a
> process. I agree that this might not be the best solution from a
> performance perspective, but it has worked so far.

What if user space were to change the tag of that shared resource?

Having a tag != 0 doesn't make sense for such a shared resource, so I 
suspect modifying the tag is like a write event: trigger write-fault -> COW.

> 
> With tag storage management enabled, set_pte_at()->mte_sync_tags() will
> discover that the huge zero page doesn't have tag storage reserved, the
> table entry will be mapped as invalid to use the page fault-on-access
> mechanism that I introduce later in the series [1] to reserve tag storage,

I assume (without looking at the code) that you took proper care of 
possible races.

Thanks for goind into detail!


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 16/27] arm64: mte: Manage tag storage on page allocation
  2023-11-29  9:10   ` Hyesoo Yu
@ 2023-11-29 13:33     ` Alexandru Elisei
  2023-12-08  5:29       ` Hyesoo Yu
  0 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-29 13:33 UTC (permalink / raw)
  To: Hyesoo Yu
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

On Wed, Nov 29, 2023 at 06:10:40PM +0900, Hyesoo Yu wrote:
> On Sun, Nov 19, 2023 at 04:57:10PM +0000, Alexandru Elisei wrote:
> > [..]
> > +static int order_to_num_blocks(int order)
> > +{
> > +	return max((1 << order) / 32, 1);
> > +}
> > [..]
> > +int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
> > +{
> > +	unsigned long start_block, end_block;
> > +	struct tag_region *region;
> > +	unsigned long block;
> > +	unsigned long flags;
> > +	unsigned int tries;
> > +	int ret = 0;
> > +
> > +	VM_WARN_ON_ONCE(!preemptible());
> > +
> > +	if (page_tag_storage_reserved(page))
> > +		return 0;
> > +
> > +	/*
> > +	 * __alloc_contig_migrate_range() ignores gfp when allocating the
> > +	 * destination page for migration. Regardless, massage gfp flags and
> > +	 * remove __GFP_TAGGED to avoid recursion in case gfp stops being
> > +	 * ignored.
> > +	 */
> > +	gfp &= ~__GFP_TAGGED;
> > +	if (!(gfp & __GFP_NORETRY))
> > +		gfp |= __GFP_RETRY_MAYFAIL;
> > +
> > +	ret = tag_storage_find_block(page, &start_block, &region);
> > +	if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
> > +		return 0;
> > +	end_block = start_block + order_to_num_blocks(order) * region->block_size;
> > +
> 
> Hello.
> 
> If the page size is 4K,  block size is 2 (block size bytes 8K), and order is 6,
> then we need 2 pages for the tag. However according to the equation, order_to_num_blocks
> is 2 and block_size is also 2, so end block will be incremented by 4.
> 
> However we actually only need 8K of tag, right for 256K ?
> Could you explain order_to_num_blocks * region->block_size more detail ?

I think you are correct, thank you for pointing it out. The formula should
probably be something like:

static int order_to_num_blocks(int order, u32 block_size)
{
	int num_tag_pages = max((1 << order) / 32, 1);

	return DIV_ROUND_UP(num_tag_pages, block_size);
}

and that will make end_block = start_block + 2 in your scenario.

Does that look correct to you?

Thanks,
Alex

> 
> Thanks,
> Regards.
> 
> > +	mutex_lock(&tag_blocks_lock);
> > +
> > +	/* Check again, this time with the lock held. */
> > +	if (page_tag_storage_reserved(page))
> > +		goto out_unlock;
> > +
> > +	/* Make sure existing entries are not freed from out under out feet. */
> > +	xa_lock_irqsave(&tag_blocks_reserved, flags);
> > +	for (block = start_block; block < end_block; block += region->block_size) {
> > +		if (tag_storage_block_is_reserved(block))
> > +			block_ref_add(block, region, order);
> > +	}
> > +	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
> > +
> > +	for (block = start_block; block < end_block; block += region->block_size) {
> > +		/* Refcount incremented above. */
> > +		if (tag_storage_block_is_reserved(block))
> > +			continue;
> > +
> > +		tries = 3;
> > +		while (tries--) {
> > +			ret = alloc_contig_range(block, block + region->block_size, MIGRATE_CMA, gfp);
> > +			if (ret == 0 || ret != -EBUSY)
> > +				break;
> > +		}
> > +
> > +		if (ret)
> > +			goto out_error;
> > +
> > +		ret = tag_storage_reserve_block(block, region, order);
> > +		if (ret) {
> > +			free_contig_range(block, region->block_size);
> > +			goto out_error;
> > +		}
> > +
> > +		count_vm_events(CMA_ALLOC_SUCCESS, region->block_size);
> > +	}
> > +
> > +	page_set_tag_storage_reserved(page, order);
> > +out_unlock:
> > +	mutex_unlock(&tag_blocks_lock);
> > +
> > +	return 0;
> > +
> > +out_error:
> > +	xa_lock_irqsave(&tag_blocks_reserved, flags);
> > +	for (block = start_block; block < end_block; block += region->block_size) {
> > +		if (tag_storage_block_is_reserved(block) &&
> > +		    block_ref_sub_return(block, region, order) == 1) {
> > +			__xa_erase(&tag_blocks_reserved, block);
> > +			free_contig_range(block, region->block_size);
> > +		}
> > +	}
> > +	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
> > +
> > +	mutex_unlock(&tag_blocks_lock);
> > +
> > +	count_vm_events(CMA_ALLOC_FAIL, region->block_size);
> > +
> > +	return ret;
> > +}
> > +
> > +void free_tag_storage(struct page *page, int order)
> > +{
> > +	unsigned long block, start_block, end_block;
> > +	struct tag_region *region;
> > +	unsigned long flags;
> > +	int ret;
> > +
> > +	ret = tag_storage_find_block(page, &start_block, &region);
> > +	if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
> > +		return;
> > +
> > +	end_block = start_block + order_to_num_blocks(order) * region->block_size;
> > +
> > +	xa_lock_irqsave(&tag_blocks_reserved, flags);
> > +	for (block = start_block; block < end_block; block += region->block_size) {
> > +		if (WARN_ONCE(!tag_storage_block_is_reserved(block),
> > +		    "Block 0x%lx is not reserved for pfn 0x%lx", block, page_to_pfn(page)))
> > +			continue;
> > +
> > +		if (block_ref_sub_return(block, region, order) == 1) {
> > +			__xa_erase(&tag_blocks_reserved, block);
> > +			free_contig_range(block, region->block_size);
> > +		}
> > +	}
> > +	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
> > +}
> > diff --git a/fs/proc/page.c b/fs/proc/page.c
> > index 195b077c0fac..e7eb584a9234 100644
> > --- a/fs/proc/page.c
> > +++ b/fs/proc/page.c
> > @@ -221,6 +221,7 @@ u64 stable_page_flags(struct page *page)
> >  #ifdef CONFIG_ARCH_USES_PG_ARCH_X
> >  	u |= kpf_copy_bit(k, KPF_ARCH_2,	PG_arch_2);
> >  	u |= kpf_copy_bit(k, KPF_ARCH_3,	PG_arch_3);
> > +	u |= kpf_copy_bit(k, KPF_ARCH_4,	PG_arch_4);
> >  #endif
> >  
> >  	return u;
> > diff --git a/include/linux/kernel-page-flags.h b/include/linux/kernel-page-flags.h
> > index 859f4b0c1b2b..4a0d719ffdd4 100644
> > --- a/include/linux/kernel-page-flags.h
> > +++ b/include/linux/kernel-page-flags.h
> > @@ -19,5 +19,6 @@
> >  #define KPF_SOFTDIRTY		40
> >  #define KPF_ARCH_2		41
> >  #define KPF_ARCH_3		42
> > +#define KPF_ARCH_4		43
> >  
> >  #endif /* LINUX_KERNEL_PAGE_FLAGS_H */
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index a88e64acebfe..7915165a51bd 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -135,6 +135,7 @@ enum pageflags {
> >  #ifdef CONFIG_ARCH_USES_PG_ARCH_X
> >  	PG_arch_2,
> >  	PG_arch_3,
> > +	PG_arch_4,
> >  #endif
> >  	__NR_PAGEFLAGS,
> >  
> > diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> > index 6ca0d5ed46c0..ba962fd10a2c 100644
> > --- a/include/trace/events/mmflags.h
> > +++ b/include/trace/events/mmflags.h
> > @@ -125,7 +125,8 @@ IF_HAVE_PG_HWPOISON(hwpoison)						\
> >  IF_HAVE_PG_IDLE(idle)							\
> >  IF_HAVE_PG_IDLE(young)							\
> >  IF_HAVE_PG_ARCH_X(arch_2)						\
> > -IF_HAVE_PG_ARCH_X(arch_3)
> > +IF_HAVE_PG_ARCH_X(arch_3)						\
> > +IF_HAVE_PG_ARCH_X(arch_4)
> >  
> >  #define show_page_flags(flags)						\
> >  	(flags) ? __print_flags(flags, "|",				\
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index f31f02472396..9beead961a65 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2474,6 +2474,7 @@ static void __split_huge_page_tail(struct folio *folio, int tail,
> >  #ifdef CONFIG_ARCH_USES_PG_ARCH_X
> >  			 (1L << PG_arch_2) |
> >  			 (1L << PG_arch_3) |
> > +			 (1L << PG_arch_4) |
> >  #endif
> >  			 (1L << PG_dirty) |
> >  			 LRU_GEN_MASK | LRU_REFS_MASK));
> > -- 
> > 2.42.1
> > 
> > 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 18/27] arm64: mte: Reserve tag block for the zero page
  2023-11-29 13:13       ` David Hildenbrand
@ 2023-11-29 13:41         ` Alexandru Elisei
  0 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-29 13:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

On Wed, Nov 29, 2023 at 02:13:50PM +0100, David Hildenbrand wrote:
> On 29.11.23 12:30, Alexandru Elisei wrote:
> > On Tue, Nov 28, 2023 at 06:06:54PM +0100, David Hildenbrand wrote:
> > > On 19.11.23 17:57, Alexandru Elisei wrote:
> > > > On arm64, the zero page receives special treatment by having the tagged
> > > > flag set on MTE initialization, not when the page is mapped in a process
> > > > address space. Reserve the corresponding tag block when tag storage
> > > > management is being activated.
> > > 
> > > Out of curiosity: why does the shared zeropage require tagged storage? What
> > > about the huge zeropage?
> > 
> > There are two different tags that are used for tag checking: the logical
> > tag, the tag embedded in bits 59:56 of an address, and the physical tag
> > corresponding to the address. This tag is stored in a separate memory
> > location, called tag storage. When an access is performed, hardware
> > compares the logical tag (from the address) with the physical tag (from the
> > tag storage). If they match, the access is permitted.
> 
> Ack, matches my understanding.
> 
> > 
> > The physical tag is set with special instructions.
> > 
> > Userspace pointers have bits 59:56 zero. If the pointer is in a VMA with
> > MTE enabled, then for userspace to be able to access this address, the
> > physical tag must also be 0b0000.
> > 
> > To make it easier on userspace, when a page is first mapped as tagged, its
> > tags are cleared by the kernel; this way, userspace can access the address
> > immediately, without clearing the physical tags beforehand. Another reason
> > for clearing the physical tags when a page is mapped as tagged would be to
> > avoid leaking uninitialized tags to userspace.
> 
> Make sense. Zero it just like we zero page content.
> 
> > 
> > The zero page is special, because the physical tags are not zeroed every
> > time the page is mapped in a process; instead, the zero page is marked as
> > tagged (by setting a page flag) and the physical tags are zeroed only once,
> > when MTE is enabled at boot.
> 
> Makes sense.
> 
> > 
> > All of this means that when tag storage is enabled, which happens after MTE
> > is enabled, the tag storage corresponding to the zero page is already in
> > use and must be rezerved, and it can never be used for data allocations.
> > 
> > I hope all of the above makes sense. I can also put it in the commit
> > message :)
> 
> Yes, makes sense!
> 
> > 
> > As for the zero huge page, the MTE code in the kernel treats it like a
> > regular page, and it zeroes the tags when it is mapped as tagged in a
> > process. I agree that this might not be the best solution from a
> > performance perspective, but it has worked so far.
> 
> What if user space were to change the tag of that shared resource?
> 
> Having a tag != 0 doesn't make sense for such a shared resource, so I
> suspect modifying the tag is like a write event: trigger write-fault -> COW.

Yes, modifying the tag is a write event.

> 
> > 
> > With tag storage management enabled, set_pte_at()->mte_sync_tags() will
> > discover that the huge zero page doesn't have tag storage reserved, the
> > table entry will be mapped as invalid to use the page fault-on-access
> > mechanism that I introduce later in the series [1] to reserve tag storage,
> 
> I assume (without looking at the code) that you took proper care of possible
> races.
> 
> Thanks for goind into detail!

No problem.

Alex

> 
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory
  2023-11-29  8:44   ` Hyesoo Yu
@ 2023-11-30 11:56     ` Alexandru Elisei
  2023-12-03 12:14     ` Alexandru Elisei
  1 sibling, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-30 11:56 UTC (permalink / raw)
  To: Hyesoo Yu
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

On Wed, Nov 29, 2023 at 05:44:24PM +0900, Hyesoo Yu wrote:
> Hello.
> 
> On Sun, Nov 19, 2023 at 04:57:05PM +0000, Alexandru Elisei wrote:
> > Allow the kernel to get the size and location of the MTE tag storage
> > regions from the DTB. This memory is marked as reserved for now.
> > 
> > The DTB node for the tag storage region is defined as:
> > 
> >         tags0: tag-storage@8f8000000 {
> >                 compatible = "arm,mte-tag-storage";
> >                 reg = <0x08 0xf8000000 0x00 0x4000000>;
> >                 block-size = <0x1000>;
> >                 memory = <&memory0>;	// Associated tagged memory node
> >         };
> >
> 
> How about using compatible = "shared-dma-pool" like below ?
> 
> &reserved_memory {
> 	tags0: tag0@8f8000000 {
> 		compatible = "arm,mte-tag-storage";
>         	reg = <0x08 0xf8000000 0x00 0x4000000>;
> 	};
> }
> 
> tag-storage {
>         compatible = "arm,mte-tag-storage";
> 	memory-region = <&tag>;
>         memory = <&memory0>;
> 	block-size = <0x1000>;
> }

I'm sorry, but I don't follow where compatible = "shared-dma-pool" fits
with the examples.

> 
> And then, the activation of CMA would be performed in the CMA code.
> We just can get the region information from memory-region and allocate it directly
> like alloc_contig_range, take_page_off_buddy. It seems like we can remove a lots of code.

For the next iteration I am planning to integrate the code more tightly
with CMA, so any suggestions to that effect are very welcome :)

> 
> > The tag storage region represents the largest contiguous memory region that
> > holds all the tags for the associated contiguous memory region which can be
> > tagged. For example, for a 32GB contiguous tagged memory the corresponding
> > tag storage region is 1GB of contiguous memory, not two adjacent 512M of
> > tag storage memory.
> > 
> > "block-size" represents the minimum multiple of 4K of tag storage where all
> > the tags stored in the block correspond to a contiguous memory region. This
> > is needed for platforms where the memory controller interleaves tag writes
> > to memory. For example, if the memory controller interleaves tag writes for
> > 256KB of contiguous memory across 8K of tag storage (2-way interleave),
> > then the correct value for "block-size" is 0x2000. This value is a hardware
> > property, independent of the selected kernel page size.
> >
> 
> Is it considered for kernel page size like 16K page, 64K page ? The comment says
> it should be a multiple of 4K, but it should be a multiple of the "page size" more accurately.
> Please let me know if there's anything I misunderstood. :-)

The block size in the DTB is a hardware property, it's independent of the
kernel page size, which is a compile time option.

The function get_block_size_pages(), which computes the tag storage block
size as the kernel will use it, takes into account the fact that the
hardware block size is not necessarily a multiple of the kernel page size,
and computes the least common multiple by doing:

(kernel page size in bytes x DTB block size in bytes) / greatest common divisor

As for why the hardware block size is a multiple of 4k, that was chosen
because it will be part of the architecture update. Since the minimum
hardware page size is 4K, it doesn't make much sense to have the DTB
block-size smaller than that.

Hope that makes sense!

Thanks,
Alex

> 
> 
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> >  arch/arm64/Kconfig                       |  12 ++
> >  arch/arm64/include/asm/mte_tag_storage.h |  15 ++
> >  arch/arm64/kernel/Makefile               |   1 +
> >  arch/arm64/kernel/mte_tag_storage.c      | 256 +++++++++++++++++++++++
> >  arch/arm64/kernel/setup.c                |   7 +
> >  5 files changed, 291 insertions(+)
> >  create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
> >  create mode 100644 arch/arm64/kernel/mte_tag_storage.c
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 7b071a00425d..fe8276fdc7a8 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -2062,6 +2062,18 @@ config ARM64_MTE
> >  
> >  	  Documentation/arch/arm64/memory-tagging-extension.rst.
> >  
> > +if ARM64_MTE
> > +config ARM64_MTE_TAG_STORAGE
> > +	bool "Dynamic MTE tag storage management"
> > +	help
> > +	  Adds support for dynamic management of the memory used by the hardware
> > +	  for storing MTE tags. This memory, unlike normal memory, cannot be
> > +	  tagged. When it is used to store tags for another memory location it
> > +	  cannot be used for any type of allocation.
> > +
> > +	  If unsure, say N
> > +endif # ARM64_MTE
> > +
> >  endmenu # "ARMv8.5 architectural features"
> >  
> >  menu "ARMv8.7 architectural features"
> > diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> > new file mode 100644
> > index 000000000000..8f86c4f9a7c3
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > @@ -0,0 +1,15 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Copyright (C) 2023 ARM Ltd.
> > + */
> > +#ifndef __ASM_MTE_TAG_STORAGE_H
> > +#define __ASM_MTE_TAG_STORAGE_H
> > +
> > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > +void mte_tag_storage_init(void);
> > +#else
> > +static inline void mte_tag_storage_init(void)
> > +{
> > +}
> > +#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
> > +#endif /* __ASM_MTE_TAG_STORAGE_H  */
> > diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> > index d95b3d6b471a..5f031bf9f8f1 100644
> > --- a/arch/arm64/kernel/Makefile
> > +++ b/arch/arm64/kernel/Makefile
> > @@ -70,6 +70,7 @@ obj-$(CONFIG_CRASH_CORE)		+= crash_core.o
> >  obj-$(CONFIG_ARM_SDE_INTERFACE)		+= sdei.o
> >  obj-$(CONFIG_ARM64_PTR_AUTH)		+= pointer_auth.o
> >  obj-$(CONFIG_ARM64_MTE)			+= mte.o
> > +obj-$(CONFIG_ARM64_MTE_TAG_STORAGE)	+= mte_tag_storage.o
> >  obj-y					+= vdso-wrap.o
> >  obj-$(CONFIG_COMPAT_VDSO)		+= vdso32-wrap.o
> >  obj-$(CONFIG_UNWIND_PATCH_PAC_INTO_SCS)	+= patch-scs.o
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > new file mode 100644
> > index 000000000000..fa6267ef8392
> > --- /dev/null
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -0,0 +1,256 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Support for dynamic tag storage.
> > + *
> > + * Copyright (C) 2023 ARM Ltd.
> > + */
> > +
> > +#include <linux/memblock.h>
> > +#include <linux/mm.h>
> > +#include <linux/of_device.h>
> > +#include <linux/of_fdt.h>
> > +#include <linux/range.h>
> > +#include <linux/string.h>
> > +#include <linux/xarray.h>
> > +
> > +#include <asm/mte_tag_storage.h>
> > +
> > +struct tag_region {
> > +	struct range mem_range;	/* Memory associated with the tag storage, in PFNs. */
> > +	struct range tag_range;	/* Tag storage memory, in PFNs. */
> > +	u32 block_size;		/* Tag block size, in pages. */
> > +};
> > +
> > +#define MAX_TAG_REGIONS	32
> > +
> > +static struct tag_region tag_regions[MAX_TAG_REGIONS];
> > +static int num_tag_regions;
> > +
> > +static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
> > +						int reg_len, struct range *range)
> > +{
> > +	int addr_cells = dt_root_addr_cells;
> > +	int size_cells = dt_root_size_cells;
> > +	u64 size;
> > +
> > +	if (reg_len / 4 > addr_cells + size_cells)
> > +		return -EINVAL;
> > +
> > +	range->start = PHYS_PFN(of_read_number(reg, addr_cells));
> > +	size = PHYS_PFN(of_read_number(reg + addr_cells, size_cells));
> > +	if (size == 0) {
> > +		pr_err("Invalid node");
> > +		return -EINVAL;
> > +	}
> > +	range->end = range->start + size - 1;
> > +
> > +	return 0;
> > +}
> > +
> > +static int __init tag_storage_of_flat_get_tag_range(unsigned long node,
> > +						    struct range *tag_range)
> > +{
> > +	const __be32 *reg;
> > +	int reg_len;
> > +
> > +	reg = of_get_flat_dt_prop(node, "reg", &reg_len);
> > +	if (reg == NULL) {
> > +		pr_err("Invalid metadata node");
> > +		return -EINVAL;
> > +	}
> > +
> > +	return tag_storage_of_flat_get_range(node, reg, reg_len, tag_range);
> > +}
> > +
> > +static int __init tag_storage_of_flat_get_memory_range(unsigned long node, struct range *mem)
> > +{
> > +	const __be32 *reg;
> > +	int reg_len;
> > +
> > +	reg = of_get_flat_dt_prop(node, "linux,usable-memory", &reg_len);
> > +	if (reg == NULL)
> > +		reg = of_get_flat_dt_prop(node, "reg", &reg_len);
> > +
> > +	if (reg == NULL) {
> > +		pr_err("Invalid memory node");
> > +		return -EINVAL;
> > +	}
> > +
> > +	return tag_storage_of_flat_get_range(node, reg, reg_len, mem);
> > +}
> > +
> > +struct find_memory_node_arg {
> > +	unsigned long node;
> > +	u32 phandle;
> > +};
> > +
> > +static int __init fdt_find_memory_node(unsigned long node, const char *uname,
> > +				       int depth, void *data)
> > +{
> > +	const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
> > +	struct find_memory_node_arg *arg = data;
> > +
> > +	if (depth != 1 || !type || strcmp(type, "memory") != 0)
> > +		return 0;
> > +
> > +	if (of_get_flat_dt_phandle(node) == arg->phandle) {
> > +		arg->node = node;
> > +		return 1;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int __init tag_storage_get_memory_node(unsigned long tag_node, unsigned long *mem_node)
> > +{
> > +	struct find_memory_node_arg arg = { 0 };
> > +	const __be32 *memory_prop;
> > +	u32 mem_phandle;
> > +	int ret, reg_len;
> > +
> > +	memory_prop = of_get_flat_dt_prop(tag_node, "memory", &reg_len);
> > +	if (!memory_prop) {
> > +		pr_err("Missing 'memory' property in the tag storage node");
> > +		return -EINVAL;
> > +	}
> > +
> > +	mem_phandle = be32_to_cpup(memory_prop);
> > +	arg.phandle = mem_phandle;
> > +
> > +	ret = of_scan_flat_dt(fdt_find_memory_node, &arg);
> > +	if (ret != 1) {
> > +		pr_err("Associated memory node not found");
> > +		return -EINVAL;
> > +	}
> > +
> > +	*mem_node = arg.node;
> > +
> > +	return 0;
> > +}
> > +
> > +static int __init tag_storage_of_flat_read_u32(unsigned long node, const char *propname,
> > +					       u32 *retval)
> > +{
> > +	const __be32 *reg;
> > +
> > +	reg = of_get_flat_dt_prop(node, propname, NULL);
> > +	if (!reg)
> > +		return -EINVAL;
> > +
> > +	*retval = be32_to_cpup(reg);
> > +	return 0;
> > +}
> > +
> > +static u32 __init get_block_size_pages(u32 block_size_bytes)
> > +{
> > +	u32 a = PAGE_SIZE;
> > +	u32 b = block_size_bytes;
> > +	u32 r;
> > +
> > +	/* Find greatest common divisor using the Euclidian algorithm. */
> > +	do {
> > +		r = a % b;
> > +		a = b;
> > +		b = r;
> > +	} while (b != 0);
> > +
> > +	return PHYS_PFN(PAGE_SIZE * block_size_bytes / a);
> > +}
> > +
> > +static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
> > +				       int depth, void *data)
> > +{
> > +	struct tag_region *region;
> > +	unsigned long mem_node;
> > +	struct range *mem_range;
> > +	struct range *tag_range;
> > +	u32 block_size_bytes;
> > +	u32 nid = 0;
> > +	int ret;
> > +
> > +	if (depth != 1 || !strstr(uname, "tag-storage"))
> > +		return 0;
> > +
> > +	if (!of_flat_dt_is_compatible(node, "arm,mte-tag-storage"))
> > +		return 0;
> > +
> > +	if (num_tag_regions == MAX_TAG_REGIONS) {
> > +		pr_err("Maximum number of tag storage regions exceeded");
> > +		return -EINVAL;
> > +	}
> > +
> > +	region = &tag_regions[num_tag_regions];
> > +	mem_range = &region->mem_range;
> > +	tag_range = &region->tag_range;
> > +
> > +	ret = tag_storage_of_flat_get_tag_range(node, tag_range);
> > +	if (ret) {
> > +		pr_err("Invalid tag storage node");
> > +		return ret;
> > +	}
> > +
> > +	ret = tag_storage_get_memory_node(node, &mem_node);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = tag_storage_of_flat_get_memory_range(mem_node, mem_range);
> > +	if (ret) {
> > +		pr_err("Invalid address for associated data memory node");
> > +		return ret;
> > +	}
> > +
> > +	/* The tag region must exactly match the corresponding memory. */
> > +	if (range_len(tag_range) * 32 != range_len(mem_range)) {
> > +		pr_err("Tag storage region 0x%llx-0x%llx does not cover the memory region 0x%llx-0x%llx",
> > +		       PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> > +		       PFN_PHYS(mem_range->start), PFN_PHYS(mem_range->end));
> > +		return -EINVAL;
> > +	}
> > +
> > +	ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
> > +	if (ret || block_size_bytes == 0) {
> > +		pr_err("Invalid or missing 'block-size' property");
> > +		return -EINVAL;
> > +	}
> > +	region->block_size = get_block_size_pages(block_size_bytes);
> > +	if (range_len(tag_range) % region->block_size != 0) {
> > +		pr_err("Tag storage region size 0x%llx is not a multiple of block size %u",
> > +		       PFN_PHYS(range_len(tag_range)), region->block_size);
> > +		return -EINVAL;
> > +	}
> > +
> 
> I was confused about the variable "block_size", The block size declared in the device tree is
> in bytes, but the actual block size used is in pages. I think the term "block_size" can cause
> confusion as it might be interpreted as bytes. If possible, I suggest changing the term "block_size"
> to something more readable, such as "block_nr_pages" (This is just a example!)
> 
> Thanks,
> Regards.
> 
> > +	ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
> > +	if (ret)
> > +		nid = numa_node_id();
> > +
> > +	ret = memblock_add_node(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)),
> > +				nid, MEMBLOCK_NONE);
> > +	if (ret) {
> > +		pr_err("Error adding tag memblock (%d)", ret);
> > +		return ret;
> > +	}
> > +	memblock_reserve(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
> > +
> > +	pr_info("Found tag storage region 0x%llx-0x%llx, block size %u pages",
> > +		PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end), region->block_size);
> > +
> > +	num_tag_regions++;
> > +
> > +	return 0;
> > +}
> > +
> > +void __init mte_tag_storage_init(void)
> > +{
> > +	struct range *tag_range;
> > +	int i, ret;
> > +
> > +	ret = of_scan_flat_dt(fdt_init_tag_storage, NULL);
> > +	if (ret) {
> > +		for (i = 0; i < num_tag_regions; i++) {
> > +			tag_range = &tag_regions[i].tag_range;
> > +			memblock_remove(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
> > +		}
> > +		num_tag_regions = 0;
> > +		pr_info("MTE tag storage region management disabled");
> > +	}
> > +}
> > diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> > index 417a8a86b2db..1b77138c1aa5 100644
> > --- a/arch/arm64/kernel/setup.c
> > +++ b/arch/arm64/kernel/setup.c
> > @@ -42,6 +42,7 @@
> >  #include <asm/cpufeature.h>
> >  #include <asm/cpu_ops.h>
> >  #include <asm/kasan.h>
> > +#include <asm/mte_tag_storage.h>
> >  #include <asm/numa.h>
> >  #include <asm/scs.h>
> >  #include <asm/sections.h>
> > @@ -342,6 +343,12 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
> >  			   FW_BUG "Booted with MMU enabled!");
> >  	}
> >  
> > +	/*
> > +	 * Must be called before memory limits are enforced by
> > +	 * arm64_memblock_init().
> > +	 */
> > +	mte_tag_storage_init();
> > +
> >  	arm64_memblock_init();
> >  
> >  	paging_init();
> > -- 
> > 2.42.1
> > 
> > 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 15/27] arm64: mte: Check that tag storage blocks are in the same zone
  2023-11-29  8:57   ` Hyesoo Yu
@ 2023-11-30 12:00     ` Alexandru Elisei
  2023-12-08  5:27       ` Hyesoo Yu
  0 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-30 12:00 UTC (permalink / raw)
  To: Hyesoo Yu
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

On Wed, Nov 29, 2023 at 05:57:44PM +0900, Hyesoo Yu wrote:
> On Sun, Nov 19, 2023 at 04:57:09PM +0000, Alexandru Elisei wrote:
> > alloc_contig_range() requires that the requested pages are in the same
> > zone. Check that this is indeed the case before initializing the tag
> > storage blocks.
> > 
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> >  arch/arm64/kernel/mte_tag_storage.c | 33 +++++++++++++++++++++++++++++
> >  1 file changed, 33 insertions(+)
> > 
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > index 8b9bedf7575d..fd63430d4dc0 100644
> > --- a/arch/arm64/kernel/mte_tag_storage.c
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -265,6 +265,35 @@ void __init mte_tag_storage_init(void)
> >  	}
> >  }
> >  
> > +/* alloc_contig_range() requires all pages to be in the same zone. */
> > +static int __init mte_tag_storage_check_zone(void)
> > +{
> > +	struct range *tag_range;
> > +	struct zone *zone;
> > +	unsigned long pfn;
> > +	u32 block_size;
> > +	int i, j;
> > +
> > +	for (i = 0; i < num_tag_regions; i++) {
> > +		block_size = tag_regions[i].block_size;
> > +		if (block_size == 1)
> > +			continue;
> > +
> > +		tag_range = &tag_regions[i].tag_range;
> > +		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += block_size) {
> > +			zone = page_zone(pfn_to_page(pfn));
> 
> Hello.
> 
> Since the blocks within the tag_range must all be in the same zone, can we move the "page_zone"
> out of the loop ?

Hmm.. why do you say that the pages in a tag_range must be in the same
zone? I am not very familiar with how the memory management code puts pages
into zones, but I would imagine that pages in a tag range straddling the
4GB limit (so, let's say, from 3GB to 5GB) will end up in both ZONE_DMA and
ZONE_NORMAL.

Thanks,
Alex

> 
> Thanks,
> Regards.
> 
> > +			for (j = 1; j < block_size; j++) {
> > +				if (page_zone(pfn_to_page(pfn + j)) != zone) {
> > +					pr_err("Tag storage block pages in different zones");
> > +					return -EINVAL;
> > +				}
> > +			}
> > +		}
> > +	}
> > +
> > +	 return 0;
> > +}
> > +
> >  static int __init mte_tag_storage_activate_regions(void)
> >  {
> >  	phys_addr_t dram_start, dram_end;
> > @@ -321,6 +350,10 @@ static int __init mte_tag_storage_activate_regions(void)
> >  		goto out_disabled;
> >  	}
> >  
> > +	ret = mte_tag_storage_check_zone();
> > +	if (ret)
> > +		goto out_disabled;
> > +
> >  	for (i = 0; i < num_tag_regions; i++) {
> >  		tag_range = &tag_regions[i].tag_range;
> >  		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
> > -- 
> > 2.42.1
> > 
> > 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)
  2023-11-29  9:27   ` Hyesoo Yu
@ 2023-11-30 12:06     ` Alexandru Elisei
  2023-11-30 12:49       ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-30 12:06 UTC (permalink / raw)
  To: Hyesoo Yu
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

On Wed, Nov 29, 2023 at 06:27:25PM +0900, Hyesoo Yu wrote:
> On Sun, Nov 19, 2023 at 04:57:13PM +0000, Alexandru Elisei wrote:
> > To enable tagging on a memory range, userspace can use mprotect() with the
> > PROT_MTE access flag. Pages already mapped in the VMA don't have the
> > associated tag storage block reserved, so mark the PTEs as
> > PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
> > reserve the tag storage on the fault path.
> > 
> > This has several benefits over reserving the tag storage as part of the
> > mprotect() call handling:
> > 
> > - Tag storage is reserved only for those pages in the VMA that are
> >   accessed, instead of for all the pages already mapped in the VMA.
> > - Reduces the latency of the mprotect() call.
> > - Eliminates races with page migration.
> > 
> > But all of this is at the expense of an extra page fault per page until the
> > pages being accessed all have their corresponding tag storage reserved.
> > 
> > For arm64, the PAGE_FAULT_ON_ACCESS protection is created by defining a new
> > page table entry software bit, PTE_TAG_STORAGE_NONE. Linux doesn't set any
> > of the PBHA bits in entries from the last level of the translation table
> > and it doesn't use the TCR_ELx.HWUxx bits; also, the first PBHA bit, bit
> > 59, is already being used as a software bit for PMD_PRESENT_INVALID.
> > 
> > This is only implemented for PTE mappings; PMD mappings will follow.
> > 
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> >  arch/arm64/Kconfig                       |   1 +
> >  arch/arm64/include/asm/mte.h             |   4 +-
> >  arch/arm64/include/asm/mte_tag_storage.h |   2 +
> >  arch/arm64/include/asm/pgtable-prot.h    |   2 +
> >  arch/arm64/include/asm/pgtable.h         |  40 ++++++---
> >  arch/arm64/kernel/mte.c                  |  12 ++-
> >  arch/arm64/mm/fault.c                    | 101 +++++++++++++++++++++++
> >  include/linux/pgtable.h                  |  17 ++++
> >  mm/Kconfig                               |   3 +
> >  mm/memory.c                              |   3 +
> >  10 files changed, 170 insertions(+), 15 deletions(-)
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index efa5b7958169..3b9c435eaafb 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -2066,6 +2066,7 @@ if ARM64_MTE
> >  config ARM64_MTE_TAG_STORAGE
> >  	bool "Dynamic MTE tag storage management"
> >  	depends on ARCH_KEEP_MEMBLOCK
> > +	select ARCH_HAS_FAULT_ON_ACCESS
> >  	select CONFIG_CMA
> >  	help
> >  	  Adds support for dynamic management of the memory used by the hardware
> > diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> > index 6457b7899207..70dc2e409070 100644
> > --- a/arch/arm64/include/asm/mte.h
> > +++ b/arch/arm64/include/asm/mte.h
> > @@ -107,7 +107,7 @@ static inline bool try_page_mte_tagging(struct page *page)
> >  }
> >  
> >  void mte_zero_clear_page_tags(void *addr);
> > -void mte_sync_tags(pte_t pte, unsigned int nr_pages);
> > +void mte_sync_tags(pte_t *pteval, unsigned int nr_pages);
> >  void mte_copy_page_tags(void *kto, const void *kfrom);
> >  void mte_thread_init_user(void);
> >  void mte_thread_switch(struct task_struct *next);
> > @@ -139,7 +139,7 @@ static inline bool try_page_mte_tagging(struct page *page)
> >  static inline void mte_zero_clear_page_tags(void *addr)
> >  {
> >  }
> > -static inline void mte_sync_tags(pte_t pte, unsigned int nr_pages)
> > +static inline void mte_sync_tags(pte_t *pteval, unsigned int nr_pages)
> >  {
> >  }
> >  static inline void mte_copy_page_tags(void *kto, const void *kfrom)
> > diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> > index 6e5d28e607bb..c70ced60a0cd 100644
> > --- a/arch/arm64/include/asm/mte_tag_storage.h
> > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > @@ -33,6 +33,8 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
> >  void free_tag_storage(struct page *page, int order);
> >  
> >  bool page_tag_storage_reserved(struct page *page);
> > +
> > +vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
> >  #else
> >  static inline bool tag_storage_enabled(void)
> >  {
> > diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
> > index e9624f6326dd..85ebb3e352ad 100644
> > --- a/arch/arm64/include/asm/pgtable-prot.h
> > +++ b/arch/arm64/include/asm/pgtable-prot.h
> > @@ -19,6 +19,7 @@
> >  #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
> >  #define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
> >  #define PTE_PROT_NONE		(_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
> > +#define PTE_TAG_STORAGE_NONE	(_AT(pteval_t, 1) << 60) /* only when PTE_PROT_NONE */
> >  
> >  /*
> >   * This bit indicates that the entry is present i.e. pmd_page()
> > @@ -94,6 +95,7 @@ extern bool arm64_use_ng_mappings;
> >  	 })
> >  
> >  #define PAGE_NONE		__pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
> > +#define PAGE_FAULT_ON_ACCESS	__pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
> >  /* shared+writable pages are clean by default, hence PTE_RDONLY|PTE_WRITE */
> >  #define PAGE_SHARED		__pgprot(_PAGE_SHARED)
> >  #define PAGE_SHARED_EXEC	__pgprot(_PAGE_SHARED_EXEC)
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 20e8de853f5d..8cc135f1c112 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -326,10 +326,10 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
> >  		     __func__, pte_val(old_pte), pte_val(pte));
> >  }
> >  
> > -static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
> > +static inline void __sync_cache_and_tags(pte_t *pteval, unsigned int nr_pages)
> >  {
> > -	if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
> > -		__sync_icache_dcache(pte);
> > +	if (pte_present(*pteval) && pte_user_exec(*pteval) && !pte_special(*pteval))
> > +		__sync_icache_dcache(*pteval);
> >  
> >  	/*
> >  	 * If the PTE would provide user space access to the tags associated
> > @@ -337,9 +337,9 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
> >  	 * pte_access_permitted() returns false for exec only mappings, they
> >  	 * don't expose tags (instruction fetches don't check tags).
> >  	 */
> > -	if (system_supports_mte() && pte_access_permitted(pte, false) &&
> > -	    !pte_special(pte) && pte_tagged(pte))
> > -		mte_sync_tags(pte, nr_pages);
> > +	if (system_supports_mte() && pte_access_permitted(*pteval, false) &&
> > +	    !pte_special(*pteval) && pte_tagged(*pteval))
> > +		mte_sync_tags(pteval, nr_pages);
> >  }
> >  
> >  static inline void set_ptes(struct mm_struct *mm,
> > @@ -347,7 +347,7 @@ static inline void set_ptes(struct mm_struct *mm,
> >  			    pte_t *ptep, pte_t pte, unsigned int nr)
> >  {
> >  	page_table_check_ptes_set(mm, ptep, pte, nr);
> > -	__sync_cache_and_tags(pte, nr);
> > +	__sync_cache_and_tags(&pte, nr);
> >  
> >  	for (;;) {
> >  		__check_safe_pte_update(mm, ptep, pte);
> > @@ -459,6 +459,26 @@ static inline int pmd_protnone(pmd_t pmd)
> >  }
> >  #endif
> >  
> > +#ifdef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
> > +static inline bool fault_on_access_pte(pte_t pte)
> > +{
> > +	return (pte_val(pte) & (PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_VALID)) ==
> > +		(PTE_PROT_NONE | PTE_TAG_STORAGE_NONE);
> > +}
> > +
> > +static inline bool fault_on_access_pmd(pmd_t pmd)
> > +{
> > +	return fault_on_access_pte(pmd_pte(pmd));
> > +}
> > +
> > +static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
> > +{
> > +	if (tag_storage_enabled())
> > +		return handle_page_missing_tag_storage(vmf);
> > +	return VM_FAULT_SIGBUS;
> > +}
> > +#endif /* CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
> > +
> >  #define pmd_present_invalid(pmd)     (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))
> >  
> >  static inline int pmd_present(pmd_t pmd)
> > @@ -533,7 +553,7 @@ static inline void __set_pte_at(struct mm_struct *mm,
> >  				unsigned long __always_unused addr,
> >  				pte_t *ptep, pte_t pte, unsigned int nr)
> >  {
> > -	__sync_cache_and_tags(pte, nr);
> > +	__sync_cache_and_tags(&pte, nr);
> >  	__check_safe_pte_update(mm, ptep, pte);
> >  	set_pte(ptep, pte);
> >  }
> > @@ -828,8 +848,8 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
> >  	 * in MAIR_EL1. The mask below has to include PTE_ATTRINDX_MASK.
> >  	 */
> >  	const pteval_t mask = PTE_USER | PTE_PXN | PTE_UXN | PTE_RDONLY |
> > -			      PTE_PROT_NONE | PTE_VALID | PTE_WRITE | PTE_GP |
> > -			      PTE_ATTRINDX_MASK;
> > +			      PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_VALID |
> > +			      PTE_WRITE | PTE_GP | PTE_ATTRINDX_MASK;
> >  	/* preserve the hardware dirty information */
> >  	if (pte_hw_dirty(pte))
> >  		pte = set_pte_bit(pte, __pgprot(PTE_DIRTY));
> > diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> > index a41ef3213e1e..5962bab1d549 100644
> > --- a/arch/arm64/kernel/mte.c
> > +++ b/arch/arm64/kernel/mte.c
> > @@ -21,6 +21,7 @@
> >  #include <asm/barrier.h>
> >  #include <asm/cpufeature.h>
> >  #include <asm/mte.h>
> > +#include <asm/mte_tag_storage.h>
> >  #include <asm/ptrace.h>
> >  #include <asm/sysreg.h>
> >  
> > @@ -35,13 +36,18 @@ DEFINE_STATIC_KEY_FALSE(mte_async_or_asymm_mode);
> >  EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode);
> >  #endif
> >  
> > -void mte_sync_tags(pte_t pte, unsigned int nr_pages)
> > +void mte_sync_tags(pte_t *pteval, unsigned int nr_pages)
> >  {
> > -	struct page *page = pte_page(pte);
> > +	struct page *page = pte_page(*pteval);
> >  	unsigned int i;
> >  
> > -	/* if PG_mte_tagged is set, tags have already been initialised */
> >  	for (i = 0; i < nr_pages; i++, page++) {
> > +		if (tag_storage_enabled() && unlikely(!page_tag_storage_reserved(page))) {
> > +			*pteval = pte_modify(*pteval, PAGE_FAULT_ON_ACCESS);
> > +			continue;
> > +		}
> > +
> > +		/* if PG_mte_tagged is set, tags have already been initialised */
> >  		if (try_page_mte_tagging(page)) {
> >  			mte_clear_page_tags(page_address(page));
> >  			set_page_mte_tagged(page);
> > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > index acbc7530d2b2..f5fa583acf18 100644
> > --- a/arch/arm64/mm/fault.c
> > +++ b/arch/arm64/mm/fault.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/kprobes.h>
> >  #include <linux/uaccess.h>
> >  #include <linux/page-flags.h>
> > +#include <linux/page-isolation.h>
> >  #include <linux/sched/signal.h>
> >  #include <linux/sched/debug.h>
> >  #include <linux/highmem.h>
> > @@ -953,3 +954,103 @@ void tag_clear_highpage(struct page *page)
> >  	mte_zero_clear_page_tags(page_address(page));
> >  	set_page_mte_tagged(page);
> >  }
> > +
> > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > +vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf)
> > +{
> > +	struct vm_area_struct *vma = vmf->vma;
> > +	struct page *page = NULL;
> > +	pte_t new_pte, old_pte;
> > +	bool writable = false;
> > +	vm_fault_t err;
> > +	int ret;
> > +
> > +	spin_lock(vmf->ptl);
> > +	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
> > +		pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +		return 0;
> > +	}
> > +
> > +	/* Get the normal PTE  */
> > +	old_pte = ptep_get(vmf->pte);
> > +	new_pte = pte_modify(old_pte, vma->vm_page_prot);
> > +
> > +	/*
> > +	 * Detect now whether the PTE could be writable; this information
> > +	 * is only valid while holding the PT lock.
> > +	 */
> > +	writable = pte_write(new_pte);
> > +	if (!writable && vma_wants_manual_pte_write_upgrade(vma) &&
> > +	    can_change_pte_writable(vma, vmf->address, new_pte))
> > +		writable = true;
> > +
> > +	page = vm_normal_page(vma, vmf->address, new_pte);
> > +	if (!page || is_zone_device_page(page))
> > +		goto out_map;
> > +
> > +	/*
> > +	 * This should never happen, once a VMA has been marked as tagged, that
> > +	 * cannot be changed.
> > +	 */
> > +	if (!(vma->vm_flags & VM_MTE))
> > +		goto out_map;
> > +
> > +	/* Prevent the page from being unmapped from under us. */
> > +	get_page(page);
> > +	vma_set_access_pid_bit(vma);
> > +
> > +	/*
> > +	 * Pairs with pte_offset_map_nolock(), which takes the RCU read lock,
> > +	 * and spin_lock() above which takes the ptl lock. Both locks should be
> > +	 * balanced after this point.
> > +	 */
> > +	pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +
> > +	/*
> > +	 * Probably the page is being isolated for migration, replay the fault
> > +	 * to give time for the entry to be replaced by a migration pte.
> > +	 */
> > +	if (unlikely(is_migrate_isolate_page(page)))
> > +		goto out_retry;
> > +
> > +	ret = reserve_tag_storage(page, 0, GFP_HIGHUSER_MOVABLE);
> > +	if (ret)
> > +		goto out_retry;
> > +
> > +	put_page(page);
> > +
> > +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
> > +	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
> > +		pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +		return 0;
> > +	}
> > +
> > +out_map:
> > +	/*
> > +	 * Make it present again, depending on how arch implements
> > +	 * non-accessible ptes, some can allow access by kernel mode.
> > +	 */
> > +	old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
> > +	new_pte = pte_modify(old_pte, vma->vm_page_prot);
> > +	new_pte = pte_mkyoung(new_pte);
> > +	if (writable)
> > +		new_pte = pte_mkwrite(new_pte, vma);
> > +	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, new_pte);
> > +	update_mmu_cache(vma, vmf->address, vmf->pte);
> > +	pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +
> > +	return 0;
> > +
> > +out_retry:
> > +	put_page(page);
> > +	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > +		vma_end_read(vma);
> > +	if (fault_flag_allow_retry_first(vmf->flags)) {
> > +		err = VM_FAULT_RETRY;
> > +	} else {
> > +		/* Replay the fault. */
> > +		err = 0;
> 
> Hello!
> 
> Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
> I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)
> 
> How about migrating the page when migration problem repeats.

Yes, I had the same though in the previous iteration of the series, the
page was migrated out of the VMA if tag storage couldn't be reserved.

Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
pin will be released before the fault is replayed. Because of this, and
because it makes the code simpler, I chose not to migrate the page if tag
storage couldn't be reserved.

I'd be happy to revisit this if it turns out that in the real world
replaying the fault happens often enough that migrating the page is faster.

In fact, statistics about how often the fault is replayed and how long that
takes would be very helpful.

Thanks,
Alex

> 
> Thanks,
> Regards.
> 
> > +	}
> > +	return err;
> > +}
> > +#endif
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index ffdb9b6bed6c..e2c761dd6c41 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -1458,6 +1458,23 @@ static inline int pmd_protnone(pmd_t pmd)
> >  }
> >  #endif /* CONFIG_NUMA_BALANCING */
> >  
> > +#ifndef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
> > +static inline bool fault_on_access_pte(pte_t pte)
> > +{
> > +	return false;
> > +}
> > +
> > +static inline bool fault_on_access_pmd(pmd_t pmd)
> > +{
> > +	return false;
> > +}
> > +
> > +static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
> > +{
> > +	return VM_FAULT_SIGBUS;
> > +}
> > +#endif
> > +
> >  #endif /* CONFIG_MMU */
> >  
> >  #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 89971a894b60..a90eefc3ee80 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -1019,6 +1019,9 @@ config IDLE_PAGE_TRACKING
> >  config ARCH_HAS_CACHE_LINE_SIZE
> >  	bool
> >  
> > +config ARCH_HAS_FAULT_ON_ACCESS
> > +	bool
> > +
> >  config ARCH_HAS_CURRENT_STACK_POINTER
> >  	bool
> >  	help
> > diff --git a/mm/memory.c b/mm/memory.c
> > index e137f7673749..a04a971200b9 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5044,6 +5044,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> >  	if (!pte_present(vmf->orig_pte))
> >  		return do_swap_page(vmf);
> >  
> > +	if (fault_on_access_pte(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> > +		return arch_do_page_fault_on_access(vmf);
> > +
> >  	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> >  		return do_numa_page(vmf);
> >  
> > -- 
> > 2.42.1
> > 
> > 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)
  2023-11-30 12:06     ` Alexandru Elisei
@ 2023-11-30 12:49       ` David Hildenbrand
  2023-11-30 13:32         ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2023-11-30 12:49 UTC (permalink / raw)
  To: Alexandru Elisei, Hyesoo Yu
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

>>> +
>>> +out_retry:
>>> +	put_page(page);
>>> +	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
>>> +		vma_end_read(vma);
>>> +	if (fault_flag_allow_retry_first(vmf->flags)) {
>>> +		err = VM_FAULT_RETRY;
>>> +	} else {
>>> +		/* Replay the fault. */
>>> +		err = 0;
>>
>> Hello!
>>
>> Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
>> I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)
>>
>> How about migrating the page when migration problem repeats.
> 
> Yes, I had the same though in the previous iteration of the series, the
> page was migrated out of the VMA if tag storage couldn't be reserved.
> 
> Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
> pin will be released before the fault is replayed. Because of this, and
> because it makes the code simpler, I chose not to migrate the page if tag
> storage couldn't be reserved.

There are still some cases that are theoretically problematic: 
vmsplice() can pin pages forever and doesn't use FOLL_LONGTERM yet.

All these things also affect other users that rely on movability (e.g., 
CMA, memory hotunplug).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)
  2023-11-30 12:49       ` David Hildenbrand
@ 2023-11-30 13:32         ` Alexandru Elisei
  2023-11-30 13:43           ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-30 13:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hyesoo Yu, catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

On Thu, Nov 30, 2023 at 01:49:34PM +0100, David Hildenbrand wrote:
> > > > +
> > > > +out_retry:
> > > > +	put_page(page);
> > > > +	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > > > +		vma_end_read(vma);
> > > > +	if (fault_flag_allow_retry_first(vmf->flags)) {
> > > > +		err = VM_FAULT_RETRY;
> > > > +	} else {
> > > > +		/* Replay the fault. */
> > > > +		err = 0;
> > > 
> > > Hello!
> > > 
> > > Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
> > > I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)
> > > 
> > > How about migrating the page when migration problem repeats.
> > 
> > Yes, I had the same though in the previous iteration of the series, the
> > page was migrated out of the VMA if tag storage couldn't be reserved.
> > 
> > Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
> > pin will be released before the fault is replayed. Because of this, and
> > because it makes the code simpler, I chose not to migrate the page if tag
> > storage couldn't be reserved.
> 
> There are still some cases that are theoretically problematic: vmsplice()
> can pin pages forever and doesn't use FOLL_LONGTERM yet.
> 
> All these things also affect other users that rely on movability (e.g., CMA,
> memory hotunplug).

I wasn't aware of that, thank you for the information. Then to ensure that the
process doesn't hang by replying the loop indefinitely, I'll migrate the page if
tag storage cannot be reserved. Looking over the code again, I think I can reuse
the same function that migrates tag storage pages out of the MTE VMA (added in
patch #21), so no major changes needed.

Thanks,
Alex

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)
  2023-11-30 13:32         ` Alexandru Elisei
@ 2023-11-30 13:43           ` David Hildenbrand
  2023-11-30 14:33             ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2023-11-30 13:43 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Hyesoo Yu, catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

On 30.11.23 14:32, Alexandru Elisei wrote:
> Hi,
> 
> On Thu, Nov 30, 2023 at 01:49:34PM +0100, David Hildenbrand wrote:
>>>>> +
>>>>> +out_retry:
>>>>> +	put_page(page);
>>>>> +	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
>>>>> +		vma_end_read(vma);
>>>>> +	if (fault_flag_allow_retry_first(vmf->flags)) {
>>>>> +		err = VM_FAULT_RETRY;
>>>>> +	} else {
>>>>> +		/* Replay the fault. */
>>>>> +		err = 0;
>>>>
>>>> Hello!
>>>>
>>>> Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
>>>> I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)
>>>>
>>>> How about migrating the page when migration problem repeats.
>>>
>>> Yes, I had the same though in the previous iteration of the series, the
>>> page was migrated out of the VMA if tag storage couldn't be reserved.
>>>
>>> Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
>>> pin will be released before the fault is replayed. Because of this, and
>>> because it makes the code simpler, I chose not to migrate the page if tag
>>> storage couldn't be reserved.
>>
>> There are still some cases that are theoretically problematic: vmsplice()
>> can pin pages forever and doesn't use FOLL_LONGTERM yet.
>>
>> All these things also affect other users that rely on movability (e.g., CMA,
>> memory hotunplug).
> 
> I wasn't aware of that, thank you for the information. Then to ensure that the
> process doesn't hang by replying the loop indefinitely, I'll migrate the page if
> tag storage cannot be reserved. Looking over the code again, I think I can reuse
> the same function that migrates tag storage pages out of the MTE VMA (added in
> patch #21), so no major changes needed.

It's going to be interesting if migrating that page fails because it is 
pinned :/

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)
  2023-11-30 13:43           ` David Hildenbrand
@ 2023-11-30 14:33             ` Alexandru Elisei
  2023-11-30 14:39               ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-30 14:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hyesoo Yu, catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

On Thu, Nov 30, 2023 at 02:43:48PM +0100, David Hildenbrand wrote:
> On 30.11.23 14:32, Alexandru Elisei wrote:
> > Hi,
> > 
> > On Thu, Nov 30, 2023 at 01:49:34PM +0100, David Hildenbrand wrote:
> > > > > > +
> > > > > > +out_retry:
> > > > > > +	put_page(page);
> > > > > > +	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > > > > > +		vma_end_read(vma);
> > > > > > +	if (fault_flag_allow_retry_first(vmf->flags)) {
> > > > > > +		err = VM_FAULT_RETRY;
> > > > > > +	} else {
> > > > > > +		/* Replay the fault. */
> > > > > > +		err = 0;
> > > > > 
> > > > > Hello!
> > > > > 
> > > > > Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
> > > > > I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)
> > > > > 
> > > > > How about migrating the page when migration problem repeats.
> > > > 
> > > > Yes, I had the same though in the previous iteration of the series, the
> > > > page was migrated out of the VMA if tag storage couldn't be reserved.
> > > > 
> > > > Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
> > > > pin will be released before the fault is replayed. Because of this, and
> > > > because it makes the code simpler, I chose not to migrate the page if tag
> > > > storage couldn't be reserved.
> > > 
> > > There are still some cases that are theoretically problematic: vmsplice()
> > > can pin pages forever and doesn't use FOLL_LONGTERM yet.
> > > 
> > > All these things also affect other users that rely on movability (e.g., CMA,
> > > memory hotunplug).
> > 
> > I wasn't aware of that, thank you for the information. Then to ensure that the
> > process doesn't hang by replying the loop indefinitely, I'll migrate the page if
> > tag storage cannot be reserved. Looking over the code again, I think I can reuse
> > the same function that migrates tag storage pages out of the MTE VMA (added in
> > patch #21), so no major changes needed.
> 
> It's going to be interesting if migrating that page fails because it is
> pinned :/

I imagine that having both the page **and** its tag storage pinned longterm
without FOLL_LONGTERM is going to be exceedingly rare.

Am I mistaken in believing that the problematic vmsplice() behaviour is
recognized as something that needs to be fixed?

Thanks,
Alex

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)
  2023-11-30 14:33             ` Alexandru Elisei
@ 2023-11-30 14:39               ` David Hildenbrand
  0 siblings, 0 replies; 98+ messages in thread
From: David Hildenbrand @ 2023-11-30 14:39 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Hyesoo Yu, catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

On 30.11.23 15:33, Alexandru Elisei wrote:
> On Thu, Nov 30, 2023 at 02:43:48PM +0100, David Hildenbrand wrote:
>> On 30.11.23 14:32, Alexandru Elisei wrote:
>>> Hi,
>>>
>>> On Thu, Nov 30, 2023 at 01:49:34PM +0100, David Hildenbrand wrote:
>>>>>>> +
>>>>>>> +out_retry:
>>>>>>> +	put_page(page);
>>>>>>> +	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
>>>>>>> +		vma_end_read(vma);
>>>>>>> +	if (fault_flag_allow_retry_first(vmf->flags)) {
>>>>>>> +		err = VM_FAULT_RETRY;
>>>>>>> +	} else {
>>>>>>> +		/* Replay the fault. */
>>>>>>> +		err = 0;
>>>>>>
>>>>>> Hello!
>>>>>>
>>>>>> Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
>>>>>> I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)
>>>>>>
>>>>>> How about migrating the page when migration problem repeats.
>>>>>
>>>>> Yes, I had the same though in the previous iteration of the series, the
>>>>> page was migrated out of the VMA if tag storage couldn't be reserved.
>>>>>
>>>>> Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
>>>>> pin will be released before the fault is replayed. Because of this, and
>>>>> because it makes the code simpler, I chose not to migrate the page if tag
>>>>> storage couldn't be reserved.
>>>>
>>>> There are still some cases that are theoretically problematic: vmsplice()
>>>> can pin pages forever and doesn't use FOLL_LONGTERM yet.
>>>>
>>>> All these things also affect other users that rely on movability (e.g., CMA,
>>>> memory hotunplug).
>>>
>>> I wasn't aware of that, thank you for the information. Then to ensure that the
>>> process doesn't hang by replying the loop indefinitely, I'll migrate the page if
>>> tag storage cannot be reserved. Looking over the code again, I think I can reuse
>>> the same function that migrates tag storage pages out of the MTE VMA (added in
>>> patch #21), so no major changes needed.
>>
>> It's going to be interesting if migrating that page fails because it is
>> pinned :/
> 
> I imagine that having both the page **and** its tag storage pinned longterm
> without FOLL_LONGTERM is going to be exceedingly rare.

Yes. I recall that the rule of thumb is that some O_DIRECT I/O can take 
up to 10 seconds, although extremely rare (and maybe not applicable on 
arm64).

> 
> Am I mistaken in believing that the problematic vmsplice() behaviour is
> recognized as something that needs to be fixed?

Yes, for a couple of years  I'm hoping this will actually get fixed now 
that O_DIRECT mostly uses FOLL_PIN instead of FOLL_GET.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 21/27] mm: arm64: Handle tag storage pages mapped before mprotect(PROT_MTE)
  2023-11-28  5:39   ` Peter Collingbourne
@ 2023-11-30 17:43     ` Alexandru Elisei
  0 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-11-30 17:43 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi Peter,

On Mon, Nov 27, 2023 at 09:39:17PM -0800, Peter Collingbourne wrote:
> Hi Alexandru,
> 
> On Sun, Nov 19, 2023 at 8:59 AM Alexandru Elisei
> <alexandru.elisei@arm.com> wrote:
> >
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> >  arch/arm64/include/asm/mte_tag_storage.h |  1 +
> >  arch/arm64/kernel/mte_tag_storage.c      | 15 +++++++
> >  arch/arm64/mm/fault.c                    | 55 ++++++++++++++++++++++++
> >  include/linux/migrate.h                  |  8 +++-
> >  include/linux/migrate_mode.h             |  1 +
> >  mm/internal.h                            |  6 ---
> >  6 files changed, 78 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> > index b97406d369ce..6a8b19a6a758 100644
> > --- a/arch/arm64/include/asm/mte_tag_storage.h
> > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > @@ -33,6 +33,7 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
> >  void free_tag_storage(struct page *page, int order);
> >
> >  bool page_tag_storage_reserved(struct page *page);
> > +bool page_is_tag_storage(struct page *page);
> >
> >  vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
> >  vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf);
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > index a1cc239f7211..5096ce859136 100644
> > --- a/arch/arm64/kernel/mte_tag_storage.c
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -500,6 +500,21 @@ bool page_tag_storage_reserved(struct page *page)
> >         return test_bit(PG_tag_storage_reserved, &page->flags);
> >  }
> >
> > +bool page_is_tag_storage(struct page *page)
> > +{
> > +       unsigned long pfn = page_to_pfn(page);
> > +       struct range *tag_range;
> > +       int i;
> > +
> > +       for (i = 0; i < num_tag_regions; i++) {
> > +               tag_range = &tag_regions[i].tag_range;
> > +               if (tag_range->start <= pfn && pfn <= tag_range->end)
> > +                       return true;
> > +       }
> > +
> > +       return false;
> > +}
> > +
> >  int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
> >  {
> >         unsigned long start_block, end_block;
> > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > index 6730a0812a24..964c5ae161a3 100644
> > --- a/arch/arm64/mm/fault.c
> > +++ b/arch/arm64/mm/fault.c
> > @@ -12,6 +12,7 @@
> >  #include <linux/extable.h>
> >  #include <linux/kfence.h>
> >  #include <linux/signal.h>
> > +#include <linux/migrate.h>
> >  #include <linux/mm.h>
> >  #include <linux/hardirq.h>
> >  #include <linux/init.h>
> > @@ -956,6 +957,50 @@ void tag_clear_highpage(struct page *page)
> >  }
> >
> >  #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > +
> > +#define MR_TAGGED_TAG_STORAGE  MR_ARCH_1
> > +
> > +extern bool isolate_lru_page(struct page *page);
> > +extern void putback_movable_pages(struct list_head *l);
> 
> Could we move these declarations to a non-mm-internal header and
> #include it instead of manually declaring them here?

Yes, that's better than this hackish way of doing it.

> 
> > +
> > +/* Returns with the page reference dropped. */
> > +static void migrate_tag_storage_page(struct page *page)
> > +{
> > +       struct migration_target_control mtc = {
> > +               .nid = NUMA_NO_NODE,
> > +               .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_TAGGED,
> > +       };
> > +       unsigned long i, nr_pages = compound_nr(page);
> > +       LIST_HEAD(pagelist);
> > +       int ret, tries;
> > +
> > +       lru_cache_disable();
> > +
> > +       for (i = 0; i < nr_pages; i++) {
> > +               if (!isolate_lru_page(page + i)) {
> > +                       ret = -EAGAIN;
> > +                       goto out;
> > +               }
> > +               /* Isolate just grabbed another reference, drop ours. */
> > +               put_page(page + i);
> > +               list_add_tail(&(page + i)->lru, &pagelist);
> > +       }
> > +
> > +       tries = 5;
> > +       while (tries--) {
> > +               ret = migrate_pages(&pagelist, alloc_migration_target, NULL, (unsigned long)&mtc,
> > +                                   MIGRATE_SYNC, MR_TAGGED_TAG_STORAGE, NULL);
> > +               if (ret == 0 || ret != -EBUSY)
> 
> This could be simplified to:
> 
> if (ret != -EBUSY)

Indeed! I can do the same thing in reserve_tag_storage(), in the loop where I
call alloc_contig_range().

Thanks,
Alex

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory
  2023-11-29  8:44   ` Hyesoo Yu
  2023-11-30 11:56     ` Alexandru Elisei
@ 2023-12-03 12:14     ` Alexandru Elisei
  2023-12-08  5:03       ` Hyesoo Yu
  1 sibling, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-12-03 12:14 UTC (permalink / raw)
  To: Hyesoo Yu
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

On Wed, Nov 29, 2023 at 05:44:24PM +0900, Hyesoo Yu wrote:
> Hello.
> 
> On Sun, Nov 19, 2023 at 04:57:05PM +0000, Alexandru Elisei wrote:
> > Allow the kernel to get the size and location of the MTE tag storage
> > regions from the DTB. This memory is marked as reserved for now.
> > 
> > The DTB node for the tag storage region is defined as:
> > 
> >         tags0: tag-storage@8f8000000 {
> >                 compatible = "arm,mte-tag-storage";
> >                 reg = <0x08 0xf8000000 0x00 0x4000000>;
> >                 block-size = <0x1000>;
> >                 memory = <&memory0>;	// Associated tagged memory node
> >         };
> >
> 
> How about using compatible = "shared-dma-pool" like below ?
> 
> &reserved_memory {
> 	tags0: tag0@8f8000000 {
> 		compatible = "arm,mte-tag-storage";
>         	reg = <0x08 0xf8000000 0x00 0x4000000>;
> 	};
> }
> 
> tag-storage {
>         compatible = "arm,mte-tag-storage";
> 	memory-region = <&tag>;
>         memory = <&memory0>;
> 	block-size = <0x1000>;
> }
> 
> And then, the activation of CMA would be performed in the CMA code.
> We just can get the region information from memory-region and allocate it directly
> like alloc_contig_range, take_page_off_buddy. It seems like we can remove a lots of code.

Played with reserved_mem a bit. I don't think that's the correct path
forward.

The location of the tag storage is a hardware property, independent of how
Linux is configured.

early_init_fdt_scan_reserved_mem() is called from arm64_memblock_init(),
**after** the kernel enforces an upper address for various reasons. One of
the reasons can be that it's been compiled with 39 bits VA.

After early_init_fdt_scan_reserved_mem() returns, the kernel sets the
maximum address, stored in the variable "high_memory".

What can happen is that tag storage is present at an address above the
maximum addressable by the kernel, and the CMA code will trigger an
unrecovrable page fault.

I was able to trigger this with the dts change:

diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
index 60472d65a355..201359d014e4 100644
--- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
+++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
@@ -183,6 +183,13 @@ vram: vram@18000000 {
                        reg = <0x00000000 0x18000000 0 0x00800000>;
                        no-map;
                };
+
+
+               linux,cma {
+                       compatible = "shared-dma-pool";
+                       reg = <0x100 0x0 0x00 0x4000000>;
+                       reusable;
+               };
        };

        gic: interrupt-controller@2f000000 {

And the error I got:

[    0.000000] Reserved memory: created CMA memory pool at 0x0000010000000000, size 64 MiB
[    0.000000] OF: reserved mem: initialized node linux,cma, compatible id shared-dma-pool
[    0.000000] OF: reserved mem: 0x0000010000000000..0x0000010003ffffff (65536 KiB) map reusable linux,cma
[..]
[    0.793193] WARNING: CPU: 0 PID: 1 at mm/cma.c:111 cma_init_reserved_areas+0xa8/0x378
[..]
[    0.806945] Unable to handle kernel paging request at virtual address 00000001fe000000
[    0.807277] Mem abort info:
[    0.807277]   ESR = 0x0000000096000005
[    0.807693]   EC = 0x25: DABT (current EL), IL = 32 bits
[    0.808110]   SET = 0, FnV = 0
[    0.808443]   EA = 0, S1PTW = 0
[    0.808526]   FSC = 0x05: level 1 translation fault
[    0.808943] Data abort info:
[    0.808943]   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
[    0.809360]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[    0.809776]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[    0.810221] [00000001fe000000] user address but active_mm is swapper
[..]
[    0.820887] Call trace:
[    0.821027]  cma_init_reserved_areas+0xc4/0x378
[    0.821443]  do_one_initcall+0x7c/0x1c0
[    0.821860]  kernel_init_freeable+0x1bc/0x284
[    0.822277]  kernel_init+0x24/0x1dc
[    0.822693]  ret_from_fork+0x10/0x20
[    0.823554] Code: 9127a29a cb813321 d37ae421 8b030020 (f8636822)
[    0.823554] ---[ end trace 0000000000000000 ]---
[    0.824360] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    0.824443] SMP: stopping secondary CPUs
[    0.825193] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

Should reserved mem check if the reserved memory is actually addressable by
the kernel if it's not "no-map"? Should cma fail gracefully if
!pfn_valid(base_pfn)? Shold early_init_fdt_scan_reserved_mem() be moved
because arm64_bootmem_init()? I don't have the answer to any of those. And
I got a kernel panic because the kernel cannot address that memory (39 bits
VA). I don't know what would happen if the upper limit is reduced for
another reason.

What I think should happen:

1. Add the tag storage memory before any limits are enforced by
arm64_bootmem_init().

2. Call cma_declare_contiguous_nid() after arm64_bootmem_init(), because
the function will check the memory limit.

3. Have an arch initcall that checks that the CMA regions corresponding to
the tag storage have been activated successfully (cma_init_reserved_areas()
is a core initcall). If not, then don't enable tag storage.

How does that sound to you?

Thanks,
Alex

> 
> > The tag storage region represents the largest contiguous memory region that
> > holds all the tags for the associated contiguous memory region which can be
> > tagged. For example, for a 32GB contiguous tagged memory the corresponding
> > tag storage region is 1GB of contiguous memory, not two adjacent 512M of
> > tag storage memory.
> > 
> > "block-size" represents the minimum multiple of 4K of tag storage where all
> > the tags stored in the block correspond to a contiguous memory region. This
> > is needed for platforms where the memory controller interleaves tag writes
> > to memory. For example, if the memory controller interleaves tag writes for
> > 256KB of contiguous memory across 8K of tag storage (2-way interleave),
> > then the correct value for "block-size" is 0x2000. This value is a hardware
> > property, independent of the selected kernel page size.
> >
> 
> Is it considered for kernel page size like 16K page, 64K page ? The comment says
> it should be a multiple of 4K, but it should be a multiple of the "page size" more accurately.
> Please let me know if there's anything I misunderstood. :-)
> 
> 
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> >  arch/arm64/Kconfig                       |  12 ++
> >  arch/arm64/include/asm/mte_tag_storage.h |  15 ++
> >  arch/arm64/kernel/Makefile               |   1 +
> >  arch/arm64/kernel/mte_tag_storage.c      | 256 +++++++++++++++++++++++
> >  arch/arm64/kernel/setup.c                |   7 +
> >  5 files changed, 291 insertions(+)
> >  create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
> >  create mode 100644 arch/arm64/kernel/mte_tag_storage.c
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 7b071a00425d..fe8276fdc7a8 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -2062,6 +2062,18 @@ config ARM64_MTE
> >  
> >  	  Documentation/arch/arm64/memory-tagging-extension.rst.
> >  
> > +if ARM64_MTE
> > +config ARM64_MTE_TAG_STORAGE
> > +	bool "Dynamic MTE tag storage management"
> > +	help
> > +	  Adds support for dynamic management of the memory used by the hardware
> > +	  for storing MTE tags. This memory, unlike normal memory, cannot be
> > +	  tagged. When it is used to store tags for another memory location it
> > +	  cannot be used for any type of allocation.
> > +
> > +	  If unsure, say N
> > +endif # ARM64_MTE
> > +
> >  endmenu # "ARMv8.5 architectural features"
> >  
> >  menu "ARMv8.7 architectural features"
> > diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> > new file mode 100644
> > index 000000000000..8f86c4f9a7c3
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > @@ -0,0 +1,15 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Copyright (C) 2023 ARM Ltd.
> > + */
> > +#ifndef __ASM_MTE_TAG_STORAGE_H
> > +#define __ASM_MTE_TAG_STORAGE_H
> > +
> > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > +void mte_tag_storage_init(void);
> > +#else
> > +static inline void mte_tag_storage_init(void)
> > +{
> > +}
> > +#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
> > +#endif /* __ASM_MTE_TAG_STORAGE_H  */
> > diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> > index d95b3d6b471a..5f031bf9f8f1 100644
> > --- a/arch/arm64/kernel/Makefile
> > +++ b/arch/arm64/kernel/Makefile
> > @@ -70,6 +70,7 @@ obj-$(CONFIG_CRASH_CORE)		+= crash_core.o
> >  obj-$(CONFIG_ARM_SDE_INTERFACE)		+= sdei.o
> >  obj-$(CONFIG_ARM64_PTR_AUTH)		+= pointer_auth.o
> >  obj-$(CONFIG_ARM64_MTE)			+= mte.o
> > +obj-$(CONFIG_ARM64_MTE_TAG_STORAGE)	+= mte_tag_storage.o
> >  obj-y					+= vdso-wrap.o
> >  obj-$(CONFIG_COMPAT_VDSO)		+= vdso32-wrap.o
> >  obj-$(CONFIG_UNWIND_PATCH_PAC_INTO_SCS)	+= patch-scs.o
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > new file mode 100644
> > index 000000000000..fa6267ef8392
> > --- /dev/null
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -0,0 +1,256 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Support for dynamic tag storage.
> > + *
> > + * Copyright (C) 2023 ARM Ltd.
> > + */
> > +
> > +#include <linux/memblock.h>
> > +#include <linux/mm.h>
> > +#include <linux/of_device.h>
> > +#include <linux/of_fdt.h>
> > +#include <linux/range.h>
> > +#include <linux/string.h>
> > +#include <linux/xarray.h>
> > +
> > +#include <asm/mte_tag_storage.h>
> > +
> > +struct tag_region {
> > +	struct range mem_range;	/* Memory associated with the tag storage, in PFNs. */
> > +	struct range tag_range;	/* Tag storage memory, in PFNs. */
> > +	u32 block_size;		/* Tag block size, in pages. */
> > +};
> > +
> > +#define MAX_TAG_REGIONS	32
> > +
> > +static struct tag_region tag_regions[MAX_TAG_REGIONS];
> > +static int num_tag_regions;
> > +
> > +static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
> > +						int reg_len, struct range *range)
> > +{
> > +	int addr_cells = dt_root_addr_cells;
> > +	int size_cells = dt_root_size_cells;
> > +	u64 size;
> > +
> > +	if (reg_len / 4 > addr_cells + size_cells)
> > +		return -EINVAL;
> > +
> > +	range->start = PHYS_PFN(of_read_number(reg, addr_cells));
> > +	size = PHYS_PFN(of_read_number(reg + addr_cells, size_cells));
> > +	if (size == 0) {
> > +		pr_err("Invalid node");
> > +		return -EINVAL;
> > +	}
> > +	range->end = range->start + size - 1;
> > +
> > +	return 0;
> > +}
> > +
> > +static int __init tag_storage_of_flat_get_tag_range(unsigned long node,
> > +						    struct range *tag_range)
> > +{
> > +	const __be32 *reg;
> > +	int reg_len;
> > +
> > +	reg = of_get_flat_dt_prop(node, "reg", &reg_len);
> > +	if (reg == NULL) {
> > +		pr_err("Invalid metadata node");
> > +		return -EINVAL;
> > +	}
> > +
> > +	return tag_storage_of_flat_get_range(node, reg, reg_len, tag_range);
> > +}
> > +
> > +static int __init tag_storage_of_flat_get_memory_range(unsigned long node, struct range *mem)
> > +{
> > +	const __be32 *reg;
> > +	int reg_len;
> > +
> > +	reg = of_get_flat_dt_prop(node, "linux,usable-memory", &reg_len);
> > +	if (reg == NULL)
> > +		reg = of_get_flat_dt_prop(node, "reg", &reg_len);
> > +
> > +	if (reg == NULL) {
> > +		pr_err("Invalid memory node");
> > +		return -EINVAL;
> > +	}
> > +
> > +	return tag_storage_of_flat_get_range(node, reg, reg_len, mem);
> > +}
> > +
> > +struct find_memory_node_arg {
> > +	unsigned long node;
> > +	u32 phandle;
> > +};
> > +
> > +static int __init fdt_find_memory_node(unsigned long node, const char *uname,
> > +				       int depth, void *data)
> > +{
> > +	const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
> > +	struct find_memory_node_arg *arg = data;
> > +
> > +	if (depth != 1 || !type || strcmp(type, "memory") != 0)
> > +		return 0;
> > +
> > +	if (of_get_flat_dt_phandle(node) == arg->phandle) {
> > +		arg->node = node;
> > +		return 1;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int __init tag_storage_get_memory_node(unsigned long tag_node, unsigned long *mem_node)
> > +{
> > +	struct find_memory_node_arg arg = { 0 };
> > +	const __be32 *memory_prop;
> > +	u32 mem_phandle;
> > +	int ret, reg_len;
> > +
> > +	memory_prop = of_get_flat_dt_prop(tag_node, "memory", &reg_len);
> > +	if (!memory_prop) {
> > +		pr_err("Missing 'memory' property in the tag storage node");
> > +		return -EINVAL;
> > +	}
> > +
> > +	mem_phandle = be32_to_cpup(memory_prop);
> > +	arg.phandle = mem_phandle;
> > +
> > +	ret = of_scan_flat_dt(fdt_find_memory_node, &arg);
> > +	if (ret != 1) {
> > +		pr_err("Associated memory node not found");
> > +		return -EINVAL;
> > +	}
> > +
> > +	*mem_node = arg.node;
> > +
> > +	return 0;
> > +}
> > +
> > +static int __init tag_storage_of_flat_read_u32(unsigned long node, const char *propname,
> > +					       u32 *retval)
> > +{
> > +	const __be32 *reg;
> > +
> > +	reg = of_get_flat_dt_prop(node, propname, NULL);
> > +	if (!reg)
> > +		return -EINVAL;
> > +
> > +	*retval = be32_to_cpup(reg);
> > +	return 0;
> > +}
> > +
> > +static u32 __init get_block_size_pages(u32 block_size_bytes)
> > +{
> > +	u32 a = PAGE_SIZE;
> > +	u32 b = block_size_bytes;
> > +	u32 r;
> > +
> > +	/* Find greatest common divisor using the Euclidian algorithm. */
> > +	do {
> > +		r = a % b;
> > +		a = b;
> > +		b = r;
> > +	} while (b != 0);
> > +
> > +	return PHYS_PFN(PAGE_SIZE * block_size_bytes / a);
> > +}
> > +
> > +static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
> > +				       int depth, void *data)
> > +{
> > +	struct tag_region *region;
> > +	unsigned long mem_node;
> > +	struct range *mem_range;
> > +	struct range *tag_range;
> > +	u32 block_size_bytes;
> > +	u32 nid = 0;
> > +	int ret;
> > +
> > +	if (depth != 1 || !strstr(uname, "tag-storage"))
> > +		return 0;
> > +
> > +	if (!of_flat_dt_is_compatible(node, "arm,mte-tag-storage"))
> > +		return 0;
> > +
> > +	if (num_tag_regions == MAX_TAG_REGIONS) {
> > +		pr_err("Maximum number of tag storage regions exceeded");
> > +		return -EINVAL;
> > +	}
> > +
> > +	region = &tag_regions[num_tag_regions];
> > +	mem_range = &region->mem_range;
> > +	tag_range = &region->tag_range;
> > +
> > +	ret = tag_storage_of_flat_get_tag_range(node, tag_range);
> > +	if (ret) {
> > +		pr_err("Invalid tag storage node");
> > +		return ret;
> > +	}
> > +
> > +	ret = tag_storage_get_memory_node(node, &mem_node);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = tag_storage_of_flat_get_memory_range(mem_node, mem_range);
> > +	if (ret) {
> > +		pr_err("Invalid address for associated data memory node");
> > +		return ret;
> > +	}
> > +
> > +	/* The tag region must exactly match the corresponding memory. */
> > +	if (range_len(tag_range) * 32 != range_len(mem_range)) {
> > +		pr_err("Tag storage region 0x%llx-0x%llx does not cover the memory region 0x%llx-0x%llx",
> > +		       PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> > +		       PFN_PHYS(mem_range->start), PFN_PHYS(mem_range->end));
> > +		return -EINVAL;
> > +	}
> > +
> > +	ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
> > +	if (ret || block_size_bytes == 0) {
> > +		pr_err("Invalid or missing 'block-size' property");
> > +		return -EINVAL;
> > +	}
> > +	region->block_size = get_block_size_pages(block_size_bytes);
> > +	if (range_len(tag_range) % region->block_size != 0) {
> > +		pr_err("Tag storage region size 0x%llx is not a multiple of block size %u",
> > +		       PFN_PHYS(range_len(tag_range)), region->block_size);
> > +		return -EINVAL;
> > +	}
> > +
> 
> I was confused about the variable "block_size", The block size declared in the device tree is
> in bytes, but the actual block size used is in pages. I think the term "block_size" can cause
> confusion as it might be interpreted as bytes. If possible, I suggest changing the term "block_size"
> to something more readable, such as "block_nr_pages" (This is just a example!)
> 
> Thanks,
> Regards.
> 
> > +	ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
> > +	if (ret)
> > +		nid = numa_node_id();
> > +
> > +	ret = memblock_add_node(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)),
> > +				nid, MEMBLOCK_NONE);
> > +	if (ret) {
> > +		pr_err("Error adding tag memblock (%d)", ret);
> > +		return ret;
> > +	}
> > +	memblock_reserve(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
> > +
> > +	pr_info("Found tag storage region 0x%llx-0x%llx, block size %u pages",
> > +		PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end), region->block_size);
> > +
> > +	num_tag_regions++;
> > +
> > +	return 0;
> > +}
> > +
> > +void __init mte_tag_storage_init(void)
> > +{
> > +	struct range *tag_range;
> > +	int i, ret;
> > +
> > +	ret = of_scan_flat_dt(fdt_init_tag_storage, NULL);
> > +	if (ret) {
> > +		for (i = 0; i < num_tag_regions; i++) {
> > +			tag_range = &tag_regions[i].tag_range;
> > +			memblock_remove(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
> > +		}
> > +		num_tag_regions = 0;
> > +		pr_info("MTE tag storage region management disabled");
> > +	}
> > +}
> > diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> > index 417a8a86b2db..1b77138c1aa5 100644
> > --- a/arch/arm64/kernel/setup.c
> > +++ b/arch/arm64/kernel/setup.c
> > @@ -42,6 +42,7 @@
> >  #include <asm/cpufeature.h>
> >  #include <asm/cpu_ops.h>
> >  #include <asm/kasan.h>
> > +#include <asm/mte_tag_storage.h>
> >  #include <asm/numa.h>
> >  #include <asm/scs.h>
> >  #include <asm/sections.h>
> > @@ -342,6 +343,12 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
> >  			   FW_BUG "Booted with MMU enabled!");
> >  	}
> >  
> > +	/*
> > +	 * Must be called before memory limits are enforced by
> > +	 * arm64_memblock_init().
> > +	 */
> > +	mte_tag_storage_init();
> > +
> >  	arm64_memblock_init();
> >  
> >  	paging_init();
> > -- 
> > 2.42.1
> > 
> > 



^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory
  2023-12-03 12:14     ` Alexandru Elisei
@ 2023-12-08  5:03       ` Hyesoo Yu
  2023-12-11 14:45         ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: Hyesoo Yu @ 2023-12-08  5:03 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 9707 bytes --]

Hi, 

I'm sorry for the late response, I was on vacation.

On Sun, Dec 03, 2023 at 12:14:30PM +0000, Alexandru Elisei wrote:
> Hi,
> 
> On Wed, Nov 29, 2023 at 05:44:24PM +0900, Hyesoo Yu wrote:
> > Hello.
> > 
> > On Sun, Nov 19, 2023 at 04:57:05PM +0000, Alexandru Elisei wrote:
> > > Allow the kernel to get the size and location of the MTE tag storage
> > > regions from the DTB. This memory is marked as reserved for now.
> > > 
> > > The DTB node for the tag storage region is defined as:
> > > 
> > >         tags0: tag-storage@8f8000000 {
> > >                 compatible = "arm,mte-tag-storage";
> > >                 reg = <0x08 0xf8000000 0x00 0x4000000>;
> > >                 block-size = <0x1000>;
> > >                 memory = <&memory0>;	// Associated tagged memory node
> > >         };
> > >
> > 
> > How about using compatible = "shared-dma-pool" like below ?
> > 
> > &reserved_memory {
> > 	tags0: tag0@8f8000000 {
> > 		compatible = "arm,mte-tag-storage";
> >         	reg = <0x08 0xf8000000 0x00 0x4000000>;
> > 	};
> > }
> > 
> > tag-storage {
> >         compatible = "arm,mte-tag-storage";
> > 	memory-region = <&tag>;
> >         memory = <&memory0>;
> > 	block-size = <0x1000>;
> > }
> > 
> > And then, the activation of CMA would be performed in the CMA code.
> > We just can get the region information from memory-region and allocate it directly
> > like alloc_contig_range, take_page_off_buddy. It seems like we can remove a lots of code.
>

Sorry, that example was my mistake. Actually I wanted to write like this. 

&reserved_memory {
	tags0: tag0@8f8000000 {
		compatible = "shared-dma-pool";
        	reg = <0x08 0xf8000000 0x00 0x4000000>;
		reusable;
	};
}

tag-storage {
        compatible = "arm,mte-tag-storage";
 	memory-region = <&tag>;
        memory = <&memory0>;
	block-size = <0x1000>;
}


> Played with reserved_mem a bit. I don't think that's the correct path
> forward.
> 
> The location of the tag storage is a hardware property, independent of how
> Linux is configured.
> 
> early_init_fdt_scan_reserved_mem() is called from arm64_memblock_init(),
> **after** the kernel enforces an upper address for various reasons. One of
> the reasons can be that it's been compiled with 39 bits VA.
> 

I'm not sure about this part. What is the upper address enforced by the kernel ?
Where can I check the code ? Do you means that memblock_end_of_DRAM() ?  

> After early_init_fdt_scan_reserved_mem() returns, the kernel sets the
> maximum address, stored in the variable "high_memory".
>
> What can happen is that tag storage is present at an address above the
> maximum addressable by the kernel, and the CMA code will trigger an
> unrecovrable page fault.
> 
> I was able to trigger this with the dts change:
> 
> diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> index 60472d65a355..201359d014e4 100644
> --- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> +++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> @@ -183,6 +183,13 @@ vram: vram@18000000 {
>                         reg = <0x00000000 0x18000000 0 0x00800000>;
>                         no-map;
>                 };
> +
> +
> +               linux,cma {
> +                       compatible = "shared-dma-pool";
> +                       reg = <0x100 0x0 0x00 0x4000000>;
> +                       reusable;
> +               };
>         };
> 
>         gic: interrupt-controller@2f000000 {
> 
> And the error I got:
> 
> [    0.000000] Reserved memory: created CMA memory pool at 0x0000010000000000, size 64 MiB
> [    0.000000] OF: reserved mem: initialized node linux,cma, compatible id shared-dma-pool
> [    0.000000] OF: reserved mem: 0x0000010000000000..0x0000010003ffffff (65536 KiB) map reusable linux,cma
> [..]
> [    0.793193] WARNING: CPU: 0 PID: 1 at mm/cma.c:111 cma_init_reserved_areas+0xa8/0x378
> [..]
> [    0.806945] Unable to handle kernel paging request at virtual address 00000001fe000000
> [    0.807277] Mem abort info:
> [    0.807277]   ESR = 0x0000000096000005
> [    0.807693]   EC = 0x25: DABT (current EL), IL = 32 bits
> [    0.808110]   SET = 0, FnV = 0
> [    0.808443]   EA = 0, S1PTW = 0
> [    0.808526]   FSC = 0x05: level 1 translation fault
> [    0.808943] Data abort info:
> [    0.808943]   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
> [    0.809360]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> [    0.809776]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [    0.810221] [00000001fe000000] user address but active_mm is swapper
> [..]
> [    0.820887] Call trace:
> [    0.821027]  cma_init_reserved_areas+0xc4/0x378
> [    0.821443]  do_one_initcall+0x7c/0x1c0
> [    0.821860]  kernel_init_freeable+0x1bc/0x284
> [    0.822277]  kernel_init+0x24/0x1dc
> [    0.822693]  ret_from_fork+0x10/0x20
> [    0.823554] Code: 9127a29a cb813321 d37ae421 8b030020 (f8636822)
> [    0.823554] ---[ end trace 0000000000000000 ]---
> [    0.824360] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [    0.824443] SMP: stopping secondary CPUs
> [    0.825193] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
> 
> Should reserved mem check if the reserved memory is actually addressable by
> the kernel if it's not "no-map"? Should cma fail gracefully if
> !pfn_valid(base_pfn)? Shold early_init_fdt_scan_reserved_mem() be moved
> because arm64_bootmem_init()? I don't have the answer to any of those. And
> I got a kernel panic because the kernel cannot address that memory (39 bits
> VA). I don't know what would happen if the upper limit is reduced for
> another reason.
> 

My answer may not be accurate because I don't understand what this upper limit is.
Is this a problem caused by the tag storage area not being included in the memory node ?

The reason for not including it in the memory node is to enable static mte when dmte
initilization fails, right ? I think I missed that. I thought the tag storage is included
in the memory node and registered as cma.

> What I think should happen:
> 
> 1. Add the tag storage memory before any limits are enforced by
> arm64_bootmem_init().
>
> 2. Call cma_declare_contiguous_nid() after arm64_bootmem_init(), because
> the function will check the memory limit.
> 
> 3. Have an arch initcall that checks that the CMA regions corresponding to
> the tag storage have been activated successfully (cma_init_reserved_areas()
> is a core initcall). If not, then don't enable tag storage.
> 
> How does that sound to you?
> 
> Thanks,
> Alex
> 

I think this is a good way to utilize the cma code !

Thanks,
Regards.

> > > +	ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
> > > +	if (ret || block_size_bytes == 0) {
> > > +		pr_err("Invalid or missing 'block-size' property");
> > > +		return -EINVAL;
> > > +	}
> > > +	region->block_size = get_block_size_pages(block_size_bytes);
> > > +	if (range_len(tag_range) % region->block_size != 0) {
> > > +		pr_err("Tag storage region size 0x%llx is not a multiple of block size %u",
> > > +		       PFN_PHYS(range_len(tag_range)), region->block_size);
> > > +		return -EINVAL;
> > > +	}
> > > +
> > 
> > I was confused about the variable "block_size", The block size declared in the device tree is
> > in bytes, but the actual block size used is in pages. I think the term "block_size" can cause
> > confusion as it might be interpreted as bytes. If possible, I suggest changing the term "block_size"
> > to something more readable, such as "block_nr_pages" (This is just a example!)
> > 
> > Thanks,
> > Regards.
>>

What do you think about this ?

Thanks,
Regards.

> > > +	ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
> > > +	if (ret)
> > > +		nid = numa_node_id();
> > > +
> > > +	ret = memblock_add_node(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)),
> > > +				nid, MEMBLOCK_NONE);
> > > +	if (ret) {
> > > +		pr_err("Error adding tag memblock (%d)", ret);
> > > +		return ret;
> > > +	}
> > > +	memblock_reserve(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
> > > +
> > > +	pr_info("Found tag storage region 0x%llx-0x%llx, block size %u pages",
> > > +		PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end), region->block_size);
> > > +
> > > +	num_tag_regions++;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +void __init mte_tag_storage_init(void)
> > > +{
> > > +	struct range *tag_range;
> > > +	int i, ret;
> > > +
> > > +	ret = of_scan_flat_dt(fdt_init_tag_storage, NULL);
> > > +	if (ret) {
> > > +		for (i = 0; i < num_tag_regions; i++) {
> > > +			tag_range = &tag_regions[i].tag_range;
> > > +			memblock_remove(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
> > > +		}
> > > +		num_tag_regions = 0;
> > > +		pr_info("MTE tag storage region management disabled");
> > > +	}
> > > +}
> > > diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> > > index 417a8a86b2db..1b77138c1aa5 100644
> > > --- a/arch/arm64/kernel/setup.c
> > > +++ b/arch/arm64/kernel/setup.c
> > > @@ -42,6 +42,7 @@
> > >  #include <asm/cpufeature.h>
> > >  #include <asm/cpu_ops.h>
> > >  #include <asm/kasan.h>
> > > +#include <asm/mte_tag_storage.h>
> > >  #include <asm/numa.h>
> > >  #include <asm/scs.h>
> > >  #include <asm/sections.h>
> > > @@ -342,6 +343,12 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
> > >  			   FW_BUG "Booted with MMU enabled!");
> > >  	}
> > >  
> > > +	/*
> > > +	 * Must be called before memory limits are enforced by
> > > +	 * arm64_memblock_init().
> > > +	 */
> > > +	mte_tag_storage_init();
> > > +
> > >  	arm64_memblock_init();
> > >  
> > >  	paging_init();
> > > -- 
> > > 2.42.1
> > > 
> > > 
> 
> 
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 15/27] arm64: mte: Check that tag storage blocks are in the same zone
  2023-11-30 12:00     ` Alexandru Elisei
@ 2023-12-08  5:27       ` Hyesoo Yu
  2023-12-11 14:21         ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: Hyesoo Yu @ 2023-12-08  5:27 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 3449 bytes --]

Hi~

On Thu, Nov 30, 2023 at 12:00:11PM +0000, Alexandru Elisei wrote:
> Hi,
> 
> On Wed, Nov 29, 2023 at 05:57:44PM +0900, Hyesoo Yu wrote:
> > On Sun, Nov 19, 2023 at 04:57:09PM +0000, Alexandru Elisei wrote:
> > > alloc_contig_range() requires that the requested pages are in the same
> > > zone. Check that this is indeed the case before initializing the tag
> > > storage blocks.
> > > 
> > > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > > ---
> > >  arch/arm64/kernel/mte_tag_storage.c | 33 +++++++++++++++++++++++++++++
> > >  1 file changed, 33 insertions(+)
> > > 
> > > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > > index 8b9bedf7575d..fd63430d4dc0 100644
> > > --- a/arch/arm64/kernel/mte_tag_storage.c
> > > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > > @@ -265,6 +265,35 @@ void __init mte_tag_storage_init(void)
> > >  	}
> > >  }
> > >  
> > > +/* alloc_contig_range() requires all pages to be in the same zone. */
> > > +static int __init mte_tag_storage_check_zone(void)
> > > +{
> > > +	struct range *tag_range;
> > > +	struct zone *zone;
> > > +	unsigned long pfn;
> > > +	u32 block_size;
> > > +	int i, j;
> > > +
> > > +	for (i = 0; i < num_tag_regions; i++) {
> > > +		block_size = tag_regions[i].block_size;
> > > +		if (block_size == 1)
> > > +			continue;
> > > +
> > > +		tag_range = &tag_regions[i].tag_range;
> > > +		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += block_size) {
> > > +			zone = page_zone(pfn_to_page(pfn));
> > 
> > Hello.
> > 
> > Since the blocks within the tag_range must all be in the same zone, can we move the "page_zone"
> > out of the loop ?
> `
> Hmm.. why do you say that the pages in a tag_range must be in the same
> zone? I am not very familiar with how the memory management code puts pages
> into zones, but I would imagine that pages in a tag range straddling the
> 4GB limit (so, let's say, from 3GB to 5GB) will end up in both ZONE_DMA and
> ZONE_NORMAL.
> 
> Thanks,
> Alex
> 

Oh, I see that reserve_tag_storage only calls alloc_contig_rnage in units of block_size,
I thought it could be called for the entire range the page needed at once.
(Maybe it could be a bit faster ? It doesn't seem like unnecessary drain and
other operation are repeated.)

If we use the cma code when activating the tag storage, it will be error if the
entire area of tag region is not in the same zone, so there should be a constraint
that it must be in the same zone when defining the tag region on device tree.

Thanks,
Regards.

> > 
> > Thanks,
> > Regards.
> > 
> > > +			for (j = 1; j < block_size; j++) {
> > > +				if (page_zone(pfn_to_page(pfn + j)) != zone) {
> > > +					pr_err("Tag storage block pages in different zones");
> > > +					return -EINVAL;
> > > +				}
> > > +			}
> > > +		}
> > > +	}
> > > +
> > > +	 return 0;
> > > +}
> > > +
> > >  static int __init mte_tag_storage_activate_regions(void)
> > >  {
> > >  	phys_addr_t dram_start, dram_end;
> > > @@ -321,6 +350,10 @@ static int __init mte_tag_storage_activate_regions(void)
> > >  		goto out_disabled;
> > >  	}
> > >  
> > > +	ret = mte_tag_storage_check_zone();
> > > +	if (ret)
> > > +		goto out_disabled;
> > > +
> > >  	for (i = 0; i < num_tag_regions; i++) {
> > >  		tag_range = &tag_regions[i].tag_range;
> > >  		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
> > > -- 
> > > 2.42.1
> > > 
> > > 
> 
> 
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 16/27] arm64: mte: Manage tag storage on page allocation
  2023-11-29 13:33     ` Alexandru Elisei
@ 2023-12-08  5:29       ` Hyesoo Yu
  0 siblings, 0 replies; 98+ messages in thread
From: Hyesoo Yu @ 2023-12-08  5:29 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 7820 bytes --]

Hi.

On Wed, Nov 29, 2023 at 01:33:37PM +0000, Alexandru Elisei wrote:
> Hi,
> 
> On Wed, Nov 29, 2023 at 06:10:40PM +0900, Hyesoo Yu wrote:
> > On Sun, Nov 19, 2023 at 04:57:10PM +0000, Alexandru Elisei wrote:
> > > [..]
> > > +static int order_to_num_blocks(int order)
> > > +{
> > > +	return max((1 << order) / 32, 1);
> > > +}
> > > [..]
> > > +int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
> > > +{
> > > +	unsigned long start_block, end_block;
> > > +	struct tag_region *region;
> > > +	unsigned long block;
> > > +	unsigned long flags;
> > > +	unsigned int tries;
> > > +	int ret = 0;
> > > +
> > > +	VM_WARN_ON_ONCE(!preemptible());
> > > +
> > > +	if (page_tag_storage_reserved(page))
> > > +		return 0;
> > > +
> > > +	/*
> > > +	 * __alloc_contig_migrate_range() ignores gfp when allocating the
> > > +	 * destination page for migration. Regardless, massage gfp flags and
> > > +	 * remove __GFP_TAGGED to avoid recursion in case gfp stops being
> > > +	 * ignored.
> > > +	 */
> > > +	gfp &= ~__GFP_TAGGED;
> > > +	if (!(gfp & __GFP_NORETRY))
> > > +		gfp |= __GFP_RETRY_MAYFAIL;
> > > +
> > > +	ret = tag_storage_find_block(page, &start_block, &region);
> > > +	if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
> > > +		return 0;
> > > +	end_block = start_block + order_to_num_blocks(order) * region->block_size;
> > > +
> > 
> > Hello.
> > 
> > If the page size is 4K,  block size is 2 (block size bytes 8K), and order is 6,
> > then we need 2 pages for the tag. However according to the equation, order_to_num_blocks
> > is 2 and block_size is also 2, so end block will be incremented by 4.
> > 
> > However we actually only need 8K of tag, right for 256K ?
> > Could you explain order_to_num_blocks * region->block_size more detail ?
> 
> I think you are correct, thank you for pointing it out. The formula should
> probably be something like:
> 
> static int order_to_num_blocks(int order, u32 block_size)
> {
> 	int num_tag_pages = max((1 << order) / 32, 1);
> 
> 	return DIV_ROUND_UP(num_tag_pages, block_size);
> }
> 
> and that will make end_block = start_block + 2 in your scenario.
> 
> Does that look correct to you?
> 
> Thanks,
> Alex
> 

That looks great!

Thanks,
Regards.

> > 
> > Thanks,
> > Regards.
> > 
> > > +	mutex_lock(&tag_blocks_lock);
> > > +
> > > +	/* Check again, this time with the lock held. */
> > > +	if (page_tag_storage_reserved(page))
> > > +		goto out_unlock;
> > > +
> > > +	/* Make sure existing entries are not freed from out under out feet. */
> > > +	xa_lock_irqsave(&tag_blocks_reserved, flags);
> > > +	for (block = start_block; block < end_block; block += region->block_size) {
> > > +		if (tag_storage_block_is_reserved(block))
> > > +			block_ref_add(block, region, order);
> > > +	}
> > > +	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
> > > +
> > > +	for (block = start_block; block < end_block; block += region->block_size) {
> > > +		/* Refcount incremented above. */
> > > +		if (tag_storage_block_is_reserved(block))
> > > +			continue;
> > > +
> > > +		tries = 3;
> > > +		while (tries--) {
> > > +			ret = alloc_contig_range(block, block + region->block_size, MIGRATE_CMA, gfp);
> > > +			if (ret == 0 || ret != -EBUSY)
> > > +				break;
> > > +		}
> > > +
> > > +		if (ret)
> > > +			goto out_error;
> > > +
> > > +		ret = tag_storage_reserve_block(block, region, order);
> > > +		if (ret) {
> > > +			free_contig_range(block, region->block_size);
> > > +			goto out_error;
> > > +		}
> > > +
> > > +		count_vm_events(CMA_ALLOC_SUCCESS, region->block_size);
> > > +	}
> > > +
> > > +	page_set_tag_storage_reserved(page, order);
> > > +out_unlock:
> > > +	mutex_unlock(&tag_blocks_lock);
> > > +
> > > +	return 0;
> > > +
> > > +out_error:
> > > +	xa_lock_irqsave(&tag_blocks_reserved, flags);
> > > +	for (block = start_block; block < end_block; block += region->block_size) {
> > > +		if (tag_storage_block_is_reserved(block) &&
> > > +		    block_ref_sub_return(block, region, order) == 1) {
> > > +			__xa_erase(&tag_blocks_reserved, block);
> > > +			free_contig_range(block, region->block_size);
> > > +		}
> > > +	}
> > > +	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
> > > +
> > > +	mutex_unlock(&tag_blocks_lock);
> > > +
> > > +	count_vm_events(CMA_ALLOC_FAIL, region->block_size);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +void free_tag_storage(struct page *page, int order)
> > > +{
> > > +	unsigned long block, start_block, end_block;
> > > +	struct tag_region *region;
> > > +	unsigned long flags;
> > > +	int ret;
> > > +
> > > +	ret = tag_storage_find_block(page, &start_block, &region);
> > > +	if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
> > > +		return;
> > > +
> > > +	end_block = start_block + order_to_num_blocks(order) * region->block_size;
> > > +
> > > +	xa_lock_irqsave(&tag_blocks_reserved, flags);
> > > +	for (block = start_block; block < end_block; block += region->block_size) {
> > > +		if (WARN_ONCE(!tag_storage_block_is_reserved(block),
> > > +		    "Block 0x%lx is not reserved for pfn 0x%lx", block, page_to_pfn(page)))
> > > +			continue;
> > > +
> > > +		if (block_ref_sub_return(block, region, order) == 1) {
> > > +			__xa_erase(&tag_blocks_reserved, block);
> > > +			free_contig_range(block, region->block_size);
> > > +		}
> > > +	}
> > > +	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
> > > +}
> > > diff --git a/fs/proc/page.c b/fs/proc/page.c
> > > index 195b077c0fac..e7eb584a9234 100644
> > > --- a/fs/proc/page.c
> > > +++ b/fs/proc/page.c
> > > @@ -221,6 +221,7 @@ u64 stable_page_flags(struct page *page)
> > >  #ifdef CONFIG_ARCH_USES_PG_ARCH_X
> > >  	u |= kpf_copy_bit(k, KPF_ARCH_2,	PG_arch_2);
> > >  	u |= kpf_copy_bit(k, KPF_ARCH_3,	PG_arch_3);
> > > +	u |= kpf_copy_bit(k, KPF_ARCH_4,	PG_arch_4);
> > >  #endif
> > >  
> > >  	return u;
> > > diff --git a/include/linux/kernel-page-flags.h b/include/linux/kernel-page-flags.h
> > > index 859f4b0c1b2b..4a0d719ffdd4 100644
> > > --- a/include/linux/kernel-page-flags.h
> > > +++ b/include/linux/kernel-page-flags.h
> > > @@ -19,5 +19,6 @@
> > >  #define KPF_SOFTDIRTY		40
> > >  #define KPF_ARCH_2		41
> > >  #define KPF_ARCH_3		42
> > > +#define KPF_ARCH_4		43
> > >  
> > >  #endif /* LINUX_KERNEL_PAGE_FLAGS_H */
> > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > > index a88e64acebfe..7915165a51bd 100644
> > > --- a/include/linux/page-flags.h
> > > +++ b/include/linux/page-flags.h
> > > @@ -135,6 +135,7 @@ enum pageflags {
> > >  #ifdef CONFIG_ARCH_USES_PG_ARCH_X
> > >  	PG_arch_2,
> > >  	PG_arch_3,
> > > +	PG_arch_4,
> > >  #endif
> > >  	__NR_PAGEFLAGS,
> > >  
> > > diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> > > index 6ca0d5ed46c0..ba962fd10a2c 100644
> > > --- a/include/trace/events/mmflags.h
> > > +++ b/include/trace/events/mmflags.h
> > > @@ -125,7 +125,8 @@ IF_HAVE_PG_HWPOISON(hwpoison)						\
> > >  IF_HAVE_PG_IDLE(idle)							\
> > >  IF_HAVE_PG_IDLE(young)							\
> > >  IF_HAVE_PG_ARCH_X(arch_2)						\
> > > -IF_HAVE_PG_ARCH_X(arch_3)
> > > +IF_HAVE_PG_ARCH_X(arch_3)						\
> > > +IF_HAVE_PG_ARCH_X(arch_4)
> > >  
> > >  #define show_page_flags(flags)						\
> > >  	(flags) ? __print_flags(flags, "|",				\
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index f31f02472396..9beead961a65 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -2474,6 +2474,7 @@ static void __split_huge_page_tail(struct folio *folio, int tail,
> > >  #ifdef CONFIG_ARCH_USES_PG_ARCH_X
> > >  			 (1L << PG_arch_2) |
> > >  			 (1L << PG_arch_3) |
> > > +			 (1L << PG_arch_4) |
> > >  #endif
> > >  			 (1L << PG_dirty) |
> > >  			 LRU_GEN_MASK | LRU_REFS_MASK));
> > > -- 
> > > 2.42.1
> > > 
> > > 
> 
> 
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 15/27] arm64: mte: Check that tag storage blocks are in the same zone
  2023-12-08  5:27       ` Hyesoo Yu
@ 2023-12-11 14:21         ` Alexandru Elisei
  0 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-12-11 14:21 UTC (permalink / raw)
  To: Hyesoo Yu
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

On Fri, Dec 08, 2023 at 02:27:39PM +0900, Hyesoo Yu wrote:
> Hi~
> 
> On Thu, Nov 30, 2023 at 12:00:11PM +0000, Alexandru Elisei wrote:
> > Hi,
> > 
> > On Wed, Nov 29, 2023 at 05:57:44PM +0900, Hyesoo Yu wrote:
> > > On Sun, Nov 19, 2023 at 04:57:09PM +0000, Alexandru Elisei wrote:
> > > > alloc_contig_range() requires that the requested pages are in the same
> > > > zone. Check that this is indeed the case before initializing the tag
> > > > storage blocks.
> > > > 
> > > > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > > > ---
> > > >  arch/arm64/kernel/mte_tag_storage.c | 33 +++++++++++++++++++++++++++++
> > > >  1 file changed, 33 insertions(+)
> > > > 
> > > > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > > > index 8b9bedf7575d..fd63430d4dc0 100644
> > > > --- a/arch/arm64/kernel/mte_tag_storage.c
> > > > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > > > @@ -265,6 +265,35 @@ void __init mte_tag_storage_init(void)
> > > >  	}
> > > >  }
> > > >  
> > > > +/* alloc_contig_range() requires all pages to be in the same zone. */
> > > > +static int __init mte_tag_storage_check_zone(void)
> > > > +{
> > > > +	struct range *tag_range;
> > > > +	struct zone *zone;
> > > > +	unsigned long pfn;
> > > > +	u32 block_size;
> > > > +	int i, j;
> > > > +
> > > > +	for (i = 0; i < num_tag_regions; i++) {
> > > > +		block_size = tag_regions[i].block_size;
> > > > +		if (block_size == 1)
> > > > +			continue;
> > > > +
> > > > +		tag_range = &tag_regions[i].tag_range;
> > > > +		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += block_size) {
> > > > +			zone = page_zone(pfn_to_page(pfn));
> > > 
> > > Hello.
> > > 
> > > Since the blocks within the tag_range must all be in the same zone, can we move the "page_zone"
> > > out of the loop ?
> > `
> > Hmm.. why do you say that the pages in a tag_range must be in the same
> > zone? I am not very familiar with how the memory management code puts pages
> > into zones, but I would imagine that pages in a tag range straddling the
> > 4GB limit (so, let's say, from 3GB to 5GB) will end up in both ZONE_DMA and
> > ZONE_NORMAL.
> > 
> > Thanks,
> > Alex
> > 
> 
> Oh, I see that reserve_tag_storage only calls alloc_contig_rnage in units of block_size,
> I thought it could be called for the entire range the page needed at once.
> (Maybe it could be a bit faster ? It doesn't seem like unnecessary drain and
> other operation are repeated.)

Yes, that might be useful to do. Worth keeping in mind is that:

- a number of block size pages at the start and end of the range might
  already be reserved for other tagged pages, so the actual range that is
  being reserved might end up being smaller that what we are expecting.

- the most common allocation order is smaller or equal to
  PAGE_ALLOC_COSTLY_ORDER, which is 3, which means that the most common
  case is that reserve_tag_storage reserves only one tag storage block.

I will definitely keep this optimization in mind, but I would prefer to get
the series into a more stable shape before looking at performance
optimizations.

> 
> If we use the cma code when activating the tag storage, it will be error if the
> entire area of tag region is not in the same zone, so there should be a constraint
> that it must be in the same zone when defining the tag region on device tree.

I don't think that's the best approach, because the device tree describes
the hardware, which does not change, and this is a software limitation
(i.e, CMA doesn't work if a CMA region spans different zones), which might
get fixed in a future version of Linux.

In my opinion, the simplest solution would be to check that all tag storage
regions have been activated successfully by CMA before enabling tag
storage. Another alternative would be to split the tag storage region into
several CMA regions at a zone boundary, and add it as distinct CMA regions.

Thanks,
Alex

> 
> Thanks,
> Regards.
> 
> > > 
> > > Thanks,
> > > Regards.
> > > 
> > > > +			for (j = 1; j < block_size; j++) {
> > > > +				if (page_zone(pfn_to_page(pfn + j)) != zone) {
> > > > +					pr_err("Tag storage block pages in different zones");
> > > > +					return -EINVAL;
> > > > +				}
> > > > +			}
> > > > +		}
> > > > +	}
> > > > +
> > > > +	 return 0;
> > > > +}
> > > > +
> > > >  static int __init mte_tag_storage_activate_regions(void)
> > > >  {
> > > >  	phys_addr_t dram_start, dram_end;
> > > > @@ -321,6 +350,10 @@ static int __init mte_tag_storage_activate_regions(void)
> > > >  		goto out_disabled;
> > > >  	}
> > > >  
> > > > +	ret = mte_tag_storage_check_zone();
> > > > +	if (ret)
> > > > +		goto out_disabled;
> > > > +
> > > >  	for (i = 0; i < num_tag_regions; i++) {
> > > >  		tag_range = &tag_regions[i].tag_range;
> > > >  		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages)
> > > > -- 
> > > > 2.42.1
> > > > 
> > > > 
> > 
> > 
> > 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory
  2023-12-08  5:03       ` Hyesoo Yu
@ 2023-12-11 14:45         ` Alexandru Elisei
  0 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-12-11 14:45 UTC (permalink / raw)
  To: Hyesoo Yu
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel, linux-arch,
	linux-mm, linux-trace-kernel

Hi,

On Fri, Dec 08, 2023 at 02:03:44PM +0900, Hyesoo Yu wrote:
> Hi, 
> 
> I'm sorry for the late response, I was on vacation.
> 
> On Sun, Dec 03, 2023 at 12:14:30PM +0000, Alexandru Elisei wrote:
> > Hi,
> > 
> > On Wed, Nov 29, 2023 at 05:44:24PM +0900, Hyesoo Yu wrote:
> > > Hello.
> > > 
> > > On Sun, Nov 19, 2023 at 04:57:05PM +0000, Alexandru Elisei wrote:
> > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > regions from the DTB. This memory is marked as reserved for now.
> > > > 
> > > > The DTB node for the tag storage region is defined as:
> > > > 
> > > >         tags0: tag-storage@8f8000000 {
> > > >                 compatible = "arm,mte-tag-storage";
> > > >                 reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > >                 block-size = <0x1000>;
> > > >                 memory = <&memory0>;	// Associated tagged memory node
> > > >         };
> > > >
> > > 
> > > How about using compatible = "shared-dma-pool" like below ?
> > > 
> > > &reserved_memory {
> > > 	tags0: tag0@8f8000000 {
> > > 		compatible = "arm,mte-tag-storage";
> > >         	reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > 	};
> > > }
> > > 
> > > tag-storage {
> > >         compatible = "arm,mte-tag-storage";
> > > 	memory-region = <&tag>;
> > >         memory = <&memory0>;
> > > 	block-size = <0x1000>;
> > > }
> > > 
> > > And then, the activation of CMA would be performed in the CMA code.
> > > We just can get the region information from memory-region and allocate it directly
> > > like alloc_contig_range, take_page_off_buddy. It seems like we can remove a lots of code.
> >
> 
> Sorry, that example was my mistake. Actually I wanted to write like this. 
> 
> &reserved_memory {
> 	tags0: tag0@8f8000000 {
> 		compatible = "shared-dma-pool";
>         	reg = <0x08 0xf8000000 0x00 0x4000000>;
> 		reusable;
> 	};
> }
> 
> tag-storage {
>         compatible = "arm,mte-tag-storage";
>  	memory-region = <&tag>;
>         memory = <&memory0>;
> 	block-size = <0x1000>;
> }

I prototyped your suggestion with this change to the device tree:

            reserved-memory {
                    #address-cells = <0x02>;
                    #size-cells = <0x02>;
                    ranges;

                    tags0: tag-storage@8f8000000 {
                            compatible = "arm,mte-tag-storage";
                            reg = <0x08 0xf8000000 0x00 0x4000000>;
                            block-size = <0x1000>;
                            memory = <&memory0>;
                            reusable;
                    };
            };

Would you mind explaining what we are gaining by using reserved mem?

Struct reserved_mem only has the base and size of the tag storage region,
and initialization for reserved mem happens before the DTB is unflattened.
When I prototyped using reserved mem, I still had to write the code to
parse the memory node address and size. This code was the same as the code
needed to parse the tag storage region address and size, so having that
information in struct reserved_mem does not reduce the size of the code by
a meaningful amount.

> 
> 
> > Played with reserved_mem a bit. I don't think that's the correct path
> > forward.
> > 
> > The location of the tag storage is a hardware property, independent of how
> > Linux is configured.
> > 
> > early_init_fdt_scan_reserved_mem() is called from arm64_memblock_init(),
> > **after** the kernel enforces an upper address for various reasons. One of
> > the reasons can be that it's been compiled with 39 bits VA.
> > 
> 
> I'm not sure about this part. What is the upper address enforced by the kernel ?
> Where can I check the code ? Do you means that memblock_end_of_DRAM() ?  

I am referring to arch/arm64/mm/init.c:: arm64_memblock_init(). The
function initializes reserved mem (in early_init_fdt_scan_reserved_mem())
**after**removing memory from memblock that the kernel cannot address.

> 
> > After early_init_fdt_scan_reserved_mem() returns, the kernel sets the
> > maximum address, stored in the variable "high_memory".
> >
> > What can happen is that tag storage is present at an address above the
> > maximum addressable by the kernel, and the CMA code will trigger an
> > unrecovrable page fault.
> > 
> > I was able to trigger this with the dts change:
> > 
> > diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > index 60472d65a355..201359d014e4 100644
> > --- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > +++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > @@ -183,6 +183,13 @@ vram: vram@18000000 {
> >                         reg = <0x00000000 0x18000000 0 0x00800000>;
> >                         no-map;
> >                 };
> > +
> > +
> > +               linux,cma {
> > +                       compatible = "shared-dma-pool";
> > +                       reg = <0x100 0x0 0x00 0x4000000>;
> > +                       reusable;
> > +               };
> >         };
> > 
> >         gic: interrupt-controller@2f000000 {
> > 
> > And the error I got:
> > 
> > [    0.000000] Reserved memory: created CMA memory pool at 0x0000010000000000, size 64 MiB
> > [    0.000000] OF: reserved mem: initialized node linux,cma, compatible id shared-dma-pool
> > [    0.000000] OF: reserved mem: 0x0000010000000000..0x0000010003ffffff (65536 KiB) map reusable linux,cma
> > [..]
> > [    0.793193] WARNING: CPU: 0 PID: 1 at mm/cma.c:111 cma_init_reserved_areas+0xa8/0x378
> > [..]
> > [    0.806945] Unable to handle kernel paging request at virtual address 00000001fe000000
> > [    0.807277] Mem abort info:
> > [    0.807277]   ESR = 0x0000000096000005
> > [    0.807693]   EC = 0x25: DABT (current EL), IL = 32 bits
> > [    0.808110]   SET = 0, FnV = 0
> > [    0.808443]   EA = 0, S1PTW = 0
> > [    0.808526]   FSC = 0x05: level 1 translation fault
> > [    0.808943] Data abort info:
> > [    0.808943]   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
> > [    0.809360]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> > [    0.809776]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> > [    0.810221] [00000001fe000000] user address but active_mm is swapper
> > [..]
> > [    0.820887] Call trace:
> > [    0.821027]  cma_init_reserved_areas+0xc4/0x378
> > [    0.821443]  do_one_initcall+0x7c/0x1c0
> > [    0.821860]  kernel_init_freeable+0x1bc/0x284
> > [    0.822277]  kernel_init+0x24/0x1dc
> > [    0.822693]  ret_from_fork+0x10/0x20
> > [    0.823554] Code: 9127a29a cb813321 d37ae421 8b030020 (f8636822)
> > [    0.823554] ---[ end trace 0000000000000000 ]---
> > [    0.824360] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> > [    0.824443] SMP: stopping secondary CPUs
> > [    0.825193] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
> > 
> > Should reserved mem check if the reserved memory is actually addressable by
> > the kernel if it's not "no-map"? Should cma fail gracefully if
> > !pfn_valid(base_pfn)? Shold early_init_fdt_scan_reserved_mem() be moved
> > because arm64_bootmem_init()? I don't have the answer to any of those. And
> > I got a kernel panic because the kernel cannot address that memory (39 bits
> > VA). I don't know what would happen if the upper limit is reduced for
> > another reason.
> > 
> 
> My answer may not be accurate because I don't understand what this upper limit is.
> Is this a problem caused by the tag storage area not being included in the memory node ?

This problem is caused by the kernel cannot using virtual addresses in the
linear map (where VA == PA) to access the tag storage region.

> 
> The reason for not including it in the memory node is to enable static mte when dmte
> initilization fails, right ? I think I missed that. I thought the tag storage is included
> in the memory node and registered as cma.
> 
> > What I think should happen:
> > 
> > 1. Add the tag storage memory before any limits are enforced by
> > arm64_bootmem_init().
> >
> > 2. Call cma_declare_contiguous_nid() after arm64_bootmem_init(), because
> > the function will check the memory limit.
> > 
> > 3. Have an arch initcall that checks that the CMA regions corresponding to
> > the tag storage have been activated successfully (cma_init_reserved_areas()
> > is a core initcall). If not, then don't enable tag storage.
> > 
> > How does that sound to you?
> > 
> > Thanks,
> > Alex
> > 
> 
> I think this is a good way to utilize the cma code !

Cool, thanks!

Alex

> 
> Thanks,
> Regards.
> 
> > > > +	ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
> > > > +	if (ret || block_size_bytes == 0) {
> > > > +		pr_err("Invalid or missing 'block-size' property");
> > > > +		return -EINVAL;
> > > > +	}
> > > > +	region->block_size = get_block_size_pages(block_size_bytes);
> > > > +	if (range_len(tag_range) % region->block_size != 0) {
> > > > +		pr_err("Tag storage region size 0x%llx is not a multiple of block size %u",
> > > > +		       PFN_PHYS(range_len(tag_range)), region->block_size);
> > > > +		return -EINVAL;
> > > > +	}
> > > > +
> > > 
> > > I was confused about the variable "block_size", The block size declared in the device tree is
> > > in bytes, but the actual block size used is in pages. I think the term "block_size" can cause
> > > confusion as it might be interpreted as bytes. If possible, I suggest changing the term "block_size"
> > > to something more readable, such as "block_nr_pages" (This is just a example!)
> > > 
> > > Thanks,
> > > Regards.
> >>
> 
> What do you think about this ?
> 
> Thanks,
> Regards.
> 
> > > > +	ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
> > > > +	if (ret)
> > > > +		nid = numa_node_id();
> > > > +
> > > > +	ret = memblock_add_node(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)),
> > > > +				nid, MEMBLOCK_NONE);
> > > > +	if (ret) {
> > > > +		pr_err("Error adding tag memblock (%d)", ret);
> > > > +		return ret;
> > > > +	}
> > > > +	memblock_reserve(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
> > > > +
> > > > +	pr_info("Found tag storage region 0x%llx-0x%llx, block size %u pages",
> > > > +		PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end), region->block_size);
> > > > +
> > > > +	num_tag_regions++;
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +void __init mte_tag_storage_init(void)
> > > > +{
> > > > +	struct range *tag_range;
> > > > +	int i, ret;
> > > > +
> > > > +	ret = of_scan_flat_dt(fdt_init_tag_storage, NULL);
> > > > +	if (ret) {
> > > > +		for (i = 0; i < num_tag_regions; i++) {
> > > > +			tag_range = &tag_regions[i].tag_range;
> > > > +			memblock_remove(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
> > > > +		}
> > > > +		num_tag_regions = 0;
> > > > +		pr_info("MTE tag storage region management disabled");
> > > > +	}
> > > > +}
> > > > diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> > > > index 417a8a86b2db..1b77138c1aa5 100644
> > > > --- a/arch/arm64/kernel/setup.c
> > > > +++ b/arch/arm64/kernel/setup.c
> > > > @@ -42,6 +42,7 @@
> > > >  #include <asm/cpufeature.h>
> > > >  #include <asm/cpu_ops.h>
> > > >  #include <asm/kasan.h>
> > > > +#include <asm/mte_tag_storage.h>
> > > >  #include <asm/numa.h>
> > > >  #include <asm/scs.h>
> > > >  #include <asm/sections.h>
> > > > @@ -342,6 +343,12 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
> > > >  			   FW_BUG "Booted with MMU enabled!");
> > > >  	}
> > > >  
> > > > +	/*
> > > > +	 * Must be called before memory limits are enforced by
> > > > +	 * arm64_memblock_init().
> > > > +	 */
> > > > +	mte_tag_storage_init();
> > > > +
> > > >  	arm64_memblock_init();
> > > >  
> > > >  	paging_init();
> > > > -- 
> > > > 2.42.1
> > > > 
> > > > 
> > 
> > 
> > 



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory
  2023-11-19 16:57 ` [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory Alexandru Elisei
  2023-11-29  8:44   ` Hyesoo Yu
@ 2023-12-11 17:29   ` Rob Herring
  2023-12-12 16:38     ` Alexandru Elisei
  1 sibling, 1 reply; 98+ messages in thread
From: Rob Herring @ 2023-12-11 17:29 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
<alexandru.elisei@arm.com> wrote:
>
> Allow the kernel to get the size and location of the MTE tag storage
> regions from the DTB. This memory is marked as reserved for now.
>
> The DTB node for the tag storage region is defined as:
>
>         tags0: tag-storage@8f8000000 {
>                 compatible = "arm,mte-tag-storage";
>                 reg = <0x08 0xf8000000 0x00 0x4000000>;
>                 block-size = <0x1000>;
>                 memory = <&memory0>;    // Associated tagged memory node
>         };

I skimmed thru the discussion some. If this memory range is within
main RAM, then it definitely belongs in /reserved-memory.

You need a binding for this too.

> The tag storage region represents the largest contiguous memory region that
> holds all the tags for the associated contiguous memory region which can be
> tagged. For example, for a 32GB contiguous tagged memory the corresponding
> tag storage region is 1GB of contiguous memory, not two adjacent 512M of
> tag storage memory.
>
> "block-size" represents the minimum multiple of 4K of tag storage where all
> the tags stored in the block correspond to a contiguous memory region. This
> is needed for platforms where the memory controller interleaves tag writes
> to memory. For example, if the memory controller interleaves tag writes for
> 256KB of contiguous memory across 8K of tag storage (2-way interleave),
> then the correct value for "block-size" is 0x2000. This value is a hardware
> property, independent of the selected kernel page size.
>
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  arch/arm64/Kconfig                       |  12 ++
>  arch/arm64/include/asm/mte_tag_storage.h |  15 ++
>  arch/arm64/kernel/Makefile               |   1 +
>  arch/arm64/kernel/mte_tag_storage.c      | 256 +++++++++++++++++++++++
>  arch/arm64/kernel/setup.c                |   7 +
>  5 files changed, 291 insertions(+)
>  create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
>  create mode 100644 arch/arm64/kernel/mte_tag_storage.c
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 7b071a00425d..fe8276fdc7a8 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2062,6 +2062,18 @@ config ARM64_MTE
>
>           Documentation/arch/arm64/memory-tagging-extension.rst.
>
> +if ARM64_MTE
> +config ARM64_MTE_TAG_STORAGE
> +       bool "Dynamic MTE tag storage management"
> +       help
> +         Adds support for dynamic management of the memory used by the hardware
> +         for storing MTE tags. This memory, unlike normal memory, cannot be
> +         tagged. When it is used to store tags for another memory location it
> +         cannot be used for any type of allocation.
> +
> +         If unsure, say N
> +endif # ARM64_MTE
> +
>  endmenu # "ARMv8.5 architectural features"
>
>  menu "ARMv8.7 architectural features"
> diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> new file mode 100644
> index 000000000000..8f86c4f9a7c3
> --- /dev/null
> +++ b/arch/arm64/include/asm/mte_tag_storage.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +#ifndef __ASM_MTE_TAG_STORAGE_H
> +#define __ASM_MTE_TAG_STORAGE_H
> +
> +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +void mte_tag_storage_init(void);
> +#else
> +static inline void mte_tag_storage_init(void)
> +{
> +}
> +#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
> +#endif /* __ASM_MTE_TAG_STORAGE_H  */
> diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> index d95b3d6b471a..5f031bf9f8f1 100644
> --- a/arch/arm64/kernel/Makefile
> +++ b/arch/arm64/kernel/Makefile
> @@ -70,6 +70,7 @@ obj-$(CONFIG_CRASH_CORE)              += crash_core.o
>  obj-$(CONFIG_ARM_SDE_INTERFACE)                += sdei.o
>  obj-$(CONFIG_ARM64_PTR_AUTH)           += pointer_auth.o
>  obj-$(CONFIG_ARM64_MTE)                        += mte.o
> +obj-$(CONFIG_ARM64_MTE_TAG_STORAGE)    += mte_tag_storage.o
>  obj-y                                  += vdso-wrap.o
>  obj-$(CONFIG_COMPAT_VDSO)              += vdso32-wrap.o
>  obj-$(CONFIG_UNWIND_PATCH_PAC_INTO_SCS)        += patch-scs.o
> diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> new file mode 100644
> index 000000000000..fa6267ef8392
> --- /dev/null
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -0,0 +1,256 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Support for dynamic tag storage.
> + *
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#include <linux/memblock.h>
> +#include <linux/mm.h>
> +#include <linux/of_device.h>

You probably don't need this header. If you depend on what it
implicitly includes, then that will now break in linux-next.

> +#include <linux/of_fdt.h>
> +#include <linux/range.h>
> +#include <linux/string.h>
> +#include <linux/xarray.h>
> +
> +#include <asm/mte_tag_storage.h>
> +
> +struct tag_region {
> +       struct range mem_range; /* Memory associated with the tag storage, in PFNs. */
> +       struct range tag_range; /* Tag storage memory, in PFNs. */
> +       u32 block_size;         /* Tag block size, in pages. */
> +};
> +
> +#define MAX_TAG_REGIONS        32
> +
> +static struct tag_region tag_regions[MAX_TAG_REGIONS];
> +static int num_tag_regions;
> +
> +static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
> +                                               int reg_len, struct range *range)
> +{
> +       int addr_cells = dt_root_addr_cells;
> +       int size_cells = dt_root_size_cells;
> +       u64 size;
> +
> +       if (reg_len / 4 > addr_cells + size_cells)
> +               return -EINVAL;
> +
> +       range->start = PHYS_PFN(of_read_number(reg, addr_cells));
> +       size = PHYS_PFN(of_read_number(reg + addr_cells, size_cells));
> +       if (size == 0) {
> +               pr_err("Invalid node");
> +               return -EINVAL;
> +       }
> +       range->end = range->start + size - 1;

We have a function to read (and translate which you forgot) addresses.
Add what's missing rather than open code your own.

> +
> +       return 0;
> +}
> +
> +static int __init tag_storage_of_flat_get_tag_range(unsigned long node,
> +                                                   struct range *tag_range)
> +{
> +       const __be32 *reg;
> +       int reg_len;
> +
> +       reg = of_get_flat_dt_prop(node, "reg", &reg_len);
> +       if (reg == NULL) {
> +               pr_err("Invalid metadata node");
> +               return -EINVAL;
> +       }
> +
> +       return tag_storage_of_flat_get_range(node, reg, reg_len, tag_range);
> +}
> +
> +static int __init tag_storage_of_flat_get_memory_range(unsigned long node, struct range *mem)
> +{
> +       const __be32 *reg;
> +       int reg_len;
> +
> +       reg = of_get_flat_dt_prop(node, "linux,usable-memory", &reg_len);
> +       if (reg == NULL)
> +               reg = of_get_flat_dt_prop(node, "reg", &reg_len);
> +
> +       if (reg == NULL) {
> +               pr_err("Invalid memory node");
> +               return -EINVAL;
> +       }
> +
> +       return tag_storage_of_flat_get_range(node, reg, reg_len, mem);
> +}
> +
> +struct find_memory_node_arg {
> +       unsigned long node;
> +       u32 phandle;
> +};
> +
> +static int __init fdt_find_memory_node(unsigned long node, const char *uname,
> +                                      int depth, void *data)
> +{
> +       const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
> +       struct find_memory_node_arg *arg = data;
> +
> +       if (depth != 1 || !type || strcmp(type, "memory") != 0)
> +               return 0;
> +
> +       if (of_get_flat_dt_phandle(node) == arg->phandle) {
> +               arg->node = node;
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> +
> +static int __init tag_storage_get_memory_node(unsigned long tag_node, unsigned long *mem_node)
> +{
> +       struct find_memory_node_arg arg = { 0 };
> +       const __be32 *memory_prop;
> +       u32 mem_phandle;
> +       int ret, reg_len;
> +
> +       memory_prop = of_get_flat_dt_prop(tag_node, "memory", &reg_len);
> +       if (!memory_prop) {
> +               pr_err("Missing 'memory' property in the tag storage node");
> +               return -EINVAL;
> +       }
> +
> +       mem_phandle = be32_to_cpup(memory_prop);
> +       arg.phandle = mem_phandle;
> +
> +       ret = of_scan_flat_dt(fdt_find_memory_node, &arg);

Do not use of_scan_flat_dt. It is a relic predating libfdt which can
get a node by phandle directly.

> +       if (ret != 1) {
> +               pr_err("Associated memory node not found");
> +               return -EINVAL;
> +       }
> +
> +       *mem_node = arg.node;
> +
> +       return 0;
> +}
> +
> +static int __init tag_storage_of_flat_read_u32(unsigned long node, const char *propname,
> +                                              u32 *retval)

If you are going to make a generic function, make it for everyone.

> +{
> +       const __be32 *reg;
> +
> +       reg = of_get_flat_dt_prop(node, propname, NULL);
> +       if (!reg)
> +               return -EINVAL;
> +
> +       *retval = be32_to_cpup(reg);
> +       return 0;
> +}
> +
> +static u32 __init get_block_size_pages(u32 block_size_bytes)
> +{
> +       u32 a = PAGE_SIZE;
> +       u32 b = block_size_bytes;
> +       u32 r;
> +
> +       /* Find greatest common divisor using the Euclidian algorithm. */
> +       do {
> +               r = a % b;
> +               a = b;
> +               b = r;
> +       } while (b != 0);
> +
> +       return PHYS_PFN(PAGE_SIZE * block_size_bytes / a);
> +}
> +
> +static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
> +                                      int depth, void *data)
> +{
> +       struct tag_region *region;
> +       unsigned long mem_node;
> +       struct range *mem_range;
> +       struct range *tag_range;
> +       u32 block_size_bytes;
> +       u32 nid = 0;
> +       int ret;
> +
> +       if (depth != 1 || !strstr(uname, "tag-storage"))
> +               return 0;
> +
> +       if (!of_flat_dt_is_compatible(node, "arm,mte-tag-storage"))
> +               return 0;
> +
> +       if (num_tag_regions == MAX_TAG_REGIONS) {
> +               pr_err("Maximum number of tag storage regions exceeded");
> +               return -EINVAL;
> +       }
> +
> +       region = &tag_regions[num_tag_regions];
> +       mem_range = &region->mem_range;
> +       tag_range = &region->tag_range;
> +
> +       ret = tag_storage_of_flat_get_tag_range(node, tag_range);
> +       if (ret) {
> +               pr_err("Invalid tag storage node");
> +               return ret;
> +       }
> +
> +       ret = tag_storage_get_memory_node(node, &mem_node);
> +       if (ret)
> +               return ret;
> +
> +       ret = tag_storage_of_flat_get_memory_range(mem_node, mem_range);
> +       if (ret) {
> +               pr_err("Invalid address for associated data memory node");
> +               return ret;
> +       }
> +
> +       /* The tag region must exactly match the corresponding memory. */
> +       if (range_len(tag_range) * 32 != range_len(mem_range)) {
> +               pr_err("Tag storage region 0x%llx-0x%llx does not cover the memory region 0x%llx-0x%llx",
> +                      PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> +                      PFN_PHYS(mem_range->start), PFN_PHYS(mem_range->end));
> +               return -EINVAL;
> +       }
> +
> +       ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
> +       if (ret || block_size_bytes == 0) {
> +               pr_err("Invalid or missing 'block-size' property");
> +               return -EINVAL;
> +       }
> +       region->block_size = get_block_size_pages(block_size_bytes);
> +       if (range_len(tag_range) % region->block_size != 0) {
> +               pr_err("Tag storage region size 0x%llx is not a multiple of block size %u",
> +                      PFN_PHYS(range_len(tag_range)), region->block_size);
> +               return -EINVAL;
> +       }
> +
> +       ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);

I was going to say we already have a way to associate memory nodes
other nodes using "numa-node-id", so the "memory" phandle property is
somewhat redundant. Maybe the tag node should have a numa-node-id.
With that, it looks like you don't even need to access the /memory
node. Avoiding that would be good for 2 reasons. It avoids parsing
memory nodes twice and it's not the kernel's job to validate the DT.
Really, if you want memory info, you should use memblock to get it
because all the special cases of memory layout are handled. For
example you can have memory nodes with multiple 'reg' entries or
multiple memory nodes or both, and then some of those could be
contiguous.

Rob

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory
  2023-12-11 17:29   ` Rob Herring
@ 2023-12-12 16:38     ` Alexandru Elisei
  2023-12-12 18:44       ` Rob Herring
  0 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-12-12 16:38 UTC (permalink / raw)
  To: Rob Herring
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi Rob,

Thank you so much for the feedback, I'm not very familiar with device tree,
and any comments are very useful.

On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> <alexandru.elisei@arm.com> wrote:
> >
> > Allow the kernel to get the size and location of the MTE tag storage
> > regions from the DTB. This memory is marked as reserved for now.
> >
> > The DTB node for the tag storage region is defined as:
> >
> >         tags0: tag-storage@8f8000000 {
> >                 compatible = "arm,mte-tag-storage";
> >                 reg = <0x08 0xf8000000 0x00 0x4000000>;
> >                 block-size = <0x1000>;
> >                 memory = <&memory0>;    // Associated tagged memory node
> >         };
> 
> I skimmed thru the discussion some. If this memory range is within
> main RAM, then it definitely belongs in /reserved-memory.

Ok, will do that.

If you don't mind, why do you say that it definitely belongs in
reserved-memory? I'm not trying to argue otherwise, I'm curious about the
motivation.

Tag storage is not DMA and can live anywhere in memory. In
arm64_memblock_init(), the kernel first removes the memory that it cannot
address from memblock. For example, because it has been compiled with
CONFIG_ARM64_VA_BITS_39=y. And then calls
early_init_fdt_scan_reserved_mem().

What happens if reserved memory is above what the kernel can address?

From my testing, when the kernel is compiled with 39 bit VA, if I use
reserved memory to discover tag storage the lives above the virtua address
limit and then I try to use CMA to manage the tag storage memory, I get a
kernel panic:

[    0.000000] Reserved memory: created CMA memory pool at 0x0000010000000000, size 64 MiB
[    0.000000] OF: reserved mem: initialized node linux,cma, compatible id shared-dma-pool
[    0.000000] OF: reserved mem: 0x0000010000000000..0x0000010003ffffff (65536 KiB) map reusable linux,cma
[..]
[    0.806945] Unable to handle kernel paging request at virtual address 00000001fe000000
[    0.807277] Mem abort info:
[    0.807277]   ESR = 0x0000000096000005
[    0.807693]   EC = 0x25: DABT (current EL), IL = 32 bits
[    0.808110]   SET = 0, FnV = 0
[    0.808443]   EA = 0, S1PTW = 0
[    0.808526]   FSC = 0x05: level 1 translation fault
[    0.808943] Data abort info:
[    0.808943]   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
[    0.809360]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[    0.809776]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[    0.810221] [00000001fe000000] user address but active_mm is swapper
[..]
[    0.820887] Call trace:
[    0.821027]  cma_init_reserved_areas+0xc4/0x378

> 
> You need a binding for this too.

By binding you mean having an yaml file in dt-schem [1] describing the tag
storage node, right?

[1] https://github.com/devicetree-org/dt-schema

> 
> > The tag storage region represents the largest contiguous memory region that
> > holds all the tags for the associated contiguous memory region which can be
> > tagged. For example, for a 32GB contiguous tagged memory the corresponding
> > tag storage region is 1GB of contiguous memory, not two adjacent 512M of
> > tag storage memory.
> >
> > "block-size" represents the minimum multiple of 4K of tag storage where all
> > the tags stored in the block correspond to a contiguous memory region. This
> > is needed for platforms where the memory controller interleaves tag writes
> > to memory. For example, if the memory controller interleaves tag writes for
> > 256KB of contiguous memory across 8K of tag storage (2-way interleave),
> > then the correct value for "block-size" is 0x2000. This value is a hardware
> > property, independent of the selected kernel page size.
> >
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> >  arch/arm64/Kconfig                       |  12 ++
> >  arch/arm64/include/asm/mte_tag_storage.h |  15 ++
> >  arch/arm64/kernel/Makefile               |   1 +
> >  arch/arm64/kernel/mte_tag_storage.c      | 256 +++++++++++++++++++++++
> >  arch/arm64/kernel/setup.c                |   7 +
> >  5 files changed, 291 insertions(+)
> >  create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
> >  create mode 100644 arch/arm64/kernel/mte_tag_storage.c
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 7b071a00425d..fe8276fdc7a8 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -2062,6 +2062,18 @@ config ARM64_MTE
> >
> >           Documentation/arch/arm64/memory-tagging-extension.rst.
> >
> > +if ARM64_MTE
> > +config ARM64_MTE_TAG_STORAGE
> > +       bool "Dynamic MTE tag storage management"
> > +       help
> > +         Adds support for dynamic management of the memory used by the hardware
> > +         for storing MTE tags. This memory, unlike normal memory, cannot be
> > +         tagged. When it is used to store tags for another memory location it
> > +         cannot be used for any type of allocation.
> > +
> > +         If unsure, say N
> > +endif # ARM64_MTE
> > +
> >  endmenu # "ARMv8.5 architectural features"
> >
> >  menu "ARMv8.7 architectural features"
> > diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> > new file mode 100644
> > index 000000000000..8f86c4f9a7c3
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > @@ -0,0 +1,15 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Copyright (C) 2023 ARM Ltd.
> > + */
> > +#ifndef __ASM_MTE_TAG_STORAGE_H
> > +#define __ASM_MTE_TAG_STORAGE_H
> > +
> > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > +void mte_tag_storage_init(void);
> > +#else
> > +static inline void mte_tag_storage_init(void)
> > +{
> > +}
> > +#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
> > +#endif /* __ASM_MTE_TAG_STORAGE_H  */
> > diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> > index d95b3d6b471a..5f031bf9f8f1 100644
> > --- a/arch/arm64/kernel/Makefile
> > +++ b/arch/arm64/kernel/Makefile
> > @@ -70,6 +70,7 @@ obj-$(CONFIG_CRASH_CORE)              += crash_core.o
> >  obj-$(CONFIG_ARM_SDE_INTERFACE)                += sdei.o
> >  obj-$(CONFIG_ARM64_PTR_AUTH)           += pointer_auth.o
> >  obj-$(CONFIG_ARM64_MTE)                        += mte.o
> > +obj-$(CONFIG_ARM64_MTE_TAG_STORAGE)    += mte_tag_storage.o
> >  obj-y                                  += vdso-wrap.o
> >  obj-$(CONFIG_COMPAT_VDSO)              += vdso32-wrap.o
> >  obj-$(CONFIG_UNWIND_PATCH_PAC_INTO_SCS)        += patch-scs.o
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > new file mode 100644
> > index 000000000000..fa6267ef8392
> > --- /dev/null
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -0,0 +1,256 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Support for dynamic tag storage.
> > + *
> > + * Copyright (C) 2023 ARM Ltd.
> > + */
> > +
> > +#include <linux/memblock.h>
> > +#include <linux/mm.h>
> > +#include <linux/of_device.h>
> 
> You probably don't need this header. If you depend on what it
> implicitly includes, then that will now break in linux-next.

I'll have a look if I can remove it. Might be an artifact from an earlier
version of the patches.

> 
> > +#include <linux/of_fdt.h>
> > +#include <linux/range.h>
> > +#include <linux/string.h>
> > +#include <linux/xarray.h>
> > +
> > +#include <asm/mte_tag_storage.h>
> > +
> > +struct tag_region {
> > +       struct range mem_range; /* Memory associated with the tag storage, in PFNs. */
> > +       struct range tag_range; /* Tag storage memory, in PFNs. */
> > +       u32 block_size;         /* Tag block size, in pages. */
> > +};
> > +
> > +#define MAX_TAG_REGIONS        32
> > +
> > +static struct tag_region tag_regions[MAX_TAG_REGIONS];
> > +static int num_tag_regions;
> > +
> > +static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
> > +                                               int reg_len, struct range *range)
> > +{
> > +       int addr_cells = dt_root_addr_cells;
> > +       int size_cells = dt_root_size_cells;
> > +       u64 size;
> > +
> > +       if (reg_len / 4 > addr_cells + size_cells)
> > +               return -EINVAL;
> > +
> > +       range->start = PHYS_PFN(of_read_number(reg, addr_cells));
> > +       size = PHYS_PFN(of_read_number(reg + addr_cells, size_cells));
> > +       if (size == 0) {
> > +               pr_err("Invalid node");
> > +               return -EINVAL;
> > +       }
> > +       range->end = range->start + size - 1;
> 
> We have a function to read (and translate which you forgot) addresses.
> Add what's missing rather than open code your own.

I must have missed that there's already a function to read addresses. Would
you mind pointing me in the right direction?

> 
> > +
> > +       return 0;
> > +}
> > +
> > +static int __init tag_storage_of_flat_get_tag_range(unsigned long node,
> > +                                                   struct range *tag_range)
> > +{
> > +       const __be32 *reg;
> > +       int reg_len;
> > +
> > +       reg = of_get_flat_dt_prop(node, "reg", &reg_len);
> > +       if (reg == NULL) {
> > +               pr_err("Invalid metadata node");
> > +               return -EINVAL;
> > +       }
> > +
> > +       return tag_storage_of_flat_get_range(node, reg, reg_len, tag_range);
> > +}
> > +
> > +static int __init tag_storage_of_flat_get_memory_range(unsigned long node, struct range *mem)
> > +{
> > +       const __be32 *reg;
> > +       int reg_len;
> > +
> > +       reg = of_get_flat_dt_prop(node, "linux,usable-memory", &reg_len);
> > +       if (reg == NULL)
> > +               reg = of_get_flat_dt_prop(node, "reg", &reg_len);
> > +
> > +       if (reg == NULL) {
> > +               pr_err("Invalid memory node");
> > +               return -EINVAL;
> > +       }
> > +
> > +       return tag_storage_of_flat_get_range(node, reg, reg_len, mem);
> > +}
> > +
> > +struct find_memory_node_arg {
> > +       unsigned long node;
> > +       u32 phandle;
> > +};
> > +
> > +static int __init fdt_find_memory_node(unsigned long node, const char *uname,
> > +                                      int depth, void *data)
> > +{
> > +       const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
> > +       struct find_memory_node_arg *arg = data;
> > +
> > +       if (depth != 1 || !type || strcmp(type, "memory") != 0)
> > +               return 0;
> > +
> > +       if (of_get_flat_dt_phandle(node) == arg->phandle) {
> > +               arg->node = node;
> > +               return 1;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static int __init tag_storage_get_memory_node(unsigned long tag_node, unsigned long *mem_node)
> > +{
> > +       struct find_memory_node_arg arg = { 0 };
> > +       const __be32 *memory_prop;
> > +       u32 mem_phandle;
> > +       int ret, reg_len;
> > +
> > +       memory_prop = of_get_flat_dt_prop(tag_node, "memory", &reg_len);
> > +       if (!memory_prop) {
> > +               pr_err("Missing 'memory' property in the tag storage node");
> > +               return -EINVAL;
> > +       }
> > +
> > +       mem_phandle = be32_to_cpup(memory_prop);
> > +       arg.phandle = mem_phandle;
> > +
> > +       ret = of_scan_flat_dt(fdt_find_memory_node, &arg);
> 
> Do not use of_scan_flat_dt. It is a relic predating libfdt which can
> get a node by phandle directly.

I used that because that's what drivers/of/fdt.c uses. With reserved memory
I shouldn't need it, because struct reserved_mem already includes a
phandle.

> 
> > +       if (ret != 1) {
> > +               pr_err("Associated memory node not found");
> > +               return -EINVAL;
> > +       }
> > +
> > +       *mem_node = arg.node;
> > +
> > +       return 0;
> > +}
> > +
> > +static int __init tag_storage_of_flat_read_u32(unsigned long node, const char *propname,
> > +                                              u32 *retval)
> 
> If you are going to make a generic function, make it for everyone.

Sure. If I still need it, should I put the function in
include/linux/of_fdt.h?

> 
> > +{
> > +       const __be32 *reg;
> > +
> > +       reg = of_get_flat_dt_prop(node, propname, NULL);
> > +       if (!reg)
> > +               return -EINVAL;
> > +
> > +       *retval = be32_to_cpup(reg);
> > +       return 0;
> > +}
> > +
> > +static u32 __init get_block_size_pages(u32 block_size_bytes)
> > +{
> > +       u32 a = PAGE_SIZE;
> > +       u32 b = block_size_bytes;
> > +       u32 r;
> > +
> > +       /* Find greatest common divisor using the Euclidian algorithm. */
> > +       do {
> > +               r = a % b;
> > +               a = b;
> > +               b = r;
> > +       } while (b != 0);
> > +
> > +       return PHYS_PFN(PAGE_SIZE * block_size_bytes / a);
> > +}
> > +
> > +static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
> > +                                      int depth, void *data)
> > +{
> > +       struct tag_region *region;
> > +       unsigned long mem_node;
> > +       struct range *mem_range;
> > +       struct range *tag_range;
> > +       u32 block_size_bytes;
> > +       u32 nid = 0;
> > +       int ret;
> > +
> > +       if (depth != 1 || !strstr(uname, "tag-storage"))
> > +               return 0;
> > +
> > +       if (!of_flat_dt_is_compatible(node, "arm,mte-tag-storage"))
> > +               return 0;
> > +
> > +       if (num_tag_regions == MAX_TAG_REGIONS) {
> > +               pr_err("Maximum number of tag storage regions exceeded");
> > +               return -EINVAL;
> > +       }
> > +
> > +       region = &tag_regions[num_tag_regions];
> > +       mem_range = &region->mem_range;
> > +       tag_range = &region->tag_range;
> > +
> > +       ret = tag_storage_of_flat_get_tag_range(node, tag_range);
> > +       if (ret) {
> > +               pr_err("Invalid tag storage node");
> > +               return ret;
> > +       }
> > +
> > +       ret = tag_storage_get_memory_node(node, &mem_node);
> > +       if (ret)
> > +               return ret;
> > +
> > +       ret = tag_storage_of_flat_get_memory_range(mem_node, mem_range);
> > +       if (ret) {
> > +               pr_err("Invalid address for associated data memory node");
> > +               return ret;
> > +       }
> > +
> > +       /* The tag region must exactly match the corresponding memory. */
> > +       if (range_len(tag_range) * 32 != range_len(mem_range)) {
> > +               pr_err("Tag storage region 0x%llx-0x%llx does not cover the memory region 0x%llx-0x%llx",
> > +                      PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> > +                      PFN_PHYS(mem_range->start), PFN_PHYS(mem_range->end));
> > +               return -EINVAL;
> > +       }
> > +
> > +       ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
> > +       if (ret || block_size_bytes == 0) {
> > +               pr_err("Invalid or missing 'block-size' property");
> > +               return -EINVAL;
> > +       }
> > +       region->block_size = get_block_size_pages(block_size_bytes);
> > +       if (range_len(tag_range) % region->block_size != 0) {
> > +               pr_err("Tag storage region size 0x%llx is not a multiple of block size %u",
> > +                      PFN_PHYS(range_len(tag_range)), region->block_size);
> > +               return -EINVAL;
> > +       }
> > +
> > +       ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
> 
> I was going to say we already have a way to associate memory nodes
> other nodes using "numa-node-id", so the "memory" phandle property is
> somewhat redundant. Maybe the tag node should have a numa-node-id.
> With that, it looks like you don't even need to access the /memory
> node. Avoiding that would be good for 2 reasons. It avoids parsing
> memory nodes twice and it's not the kernel's job to validate the DT.
> Really, if you want memory info, you should use memblock to get it
> because all the special cases of memory layout are handled. For
> example you can have memory nodes with multiple 'reg' entries or
> multiple memory nodes or both, and then some of those could be
> contiguous.

I need to have a memory node associated with the tag storage node because
there is a static relationship between a page from "normal" memory and its
associated tag storage. If the code doesn't know that the memory region
A..B has the corresponding tag storage in the region X..Y, then it doesn't
know which tag storage to reserve when a page is allocated as tagged.

In the example above, assuming that page P is allocated as tagged, the
corresponding tag storage page that needs to be reserved is:

tag_storage_pfn = (page_to_pfn(P) - PHYS_PFN(A)) / 32* + PHYS_PFN(X)

numa-node-id is not enough for this, because as far I know you can have
multiple memory regions withing the same numa node.

*32 tagged pages use one tag storage page to store the tags.

Thanks,
Alex


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory
  2023-12-12 16:38     ` Alexandru Elisei
@ 2023-12-12 18:44       ` Rob Herring
  2023-12-13 13:04         ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: Rob Herring @ 2023-12-12 18:44 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
<alexandru.elisei@arm.com> wrote:
>
> Hi Rob,
>
> Thank you so much for the feedback, I'm not very familiar with device tree,
> and any comments are very useful.
>
> On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > <alexandru.elisei@arm.com> wrote:
> > >
> > > Allow the kernel to get the size and location of the MTE tag storage
> > > regions from the DTB. This memory is marked as reserved for now.
> > >
> > > The DTB node for the tag storage region is defined as:
> > >
> > >         tags0: tag-storage@8f8000000 {
> > >                 compatible = "arm,mte-tag-storage";
> > >                 reg = <0x08 0xf8000000 0x00 0x4000000>;
> > >                 block-size = <0x1000>;
> > >                 memory = <&memory0>;    // Associated tagged memory node
> > >         };
> >
> > I skimmed thru the discussion some. If this memory range is within
> > main RAM, then it definitely belongs in /reserved-memory.
>
> Ok, will do that.
>
> If you don't mind, why do you say that it definitely belongs in
> reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> motivation.

Simply so that /memory nodes describe all possible memory and
/reserved-memory is just adding restrictions. It's also because
/reserved-memory is what gets handled early, and we don't need
multiple things to handle early.

> Tag storage is not DMA and can live anywhere in memory.

Then why put it in DT at all? The only reason CMA is there is to set
the size. It's not even clear to me we need CMA in DT either. The
reasoning long ago was the kernel didn't do a good job of moving and
reclaiming contiguous space, but that's supposed to be better now (and
most h/w figured out they need IOMMUs).

But for tag storage you know the size as it is a function of the
memory size, right? After all, you are validating the size is correct.
I guess there is still the aspect of whether you want enable MTE or
not which could be done in a variety of ways.

> In
> arm64_memblock_init(), the kernel first removes the memory that it cannot
> address from memblock. For example, because it has been compiled with
> CONFIG_ARM64_VA_BITS_39=y. And then calls
> early_init_fdt_scan_reserved_mem().
>
> What happens if reserved memory is above what the kernel can address?

I would hope the kernel handles it. That's the kernel's problem unless
there's some h/w limitation to access some region. The DT can't have
things dependent on the kernel config.

> From my testing, when the kernel is compiled with 39 bit VA, if I use
> reserved memory to discover tag storage the lives above the virtua address
> limit and then I try to use CMA to manage the tag storage memory, I get a
> kernel panic:

Looks like we should handle that better...

>> [    0.000000] Reserved memory: created CMA memory pool at 0x0000010000000000, size 64 MiB
> [    0.000000] OF: reserved mem: initialized node linux,cma, compatible id shared-dma-pool
> [    0.000000] OF: reserved mem: 0x0000010000000000..0x0000010003ffffff (65536 KiB) map reusable linux,cma
> [..]
> [    0.806945] Unable to handle kernel paging request at virtual address 00000001fe000000
> [    0.807277] Mem abort info:
> [    0.807277]   ESR = 0x0000000096000005
> [    0.807693]   EC = 0x25: DABT (current EL), IL = 32 bits
> [    0.808110]   SET = 0, FnV = 0
> [    0.808443]   EA = 0, S1PTW = 0
> [    0.808526]   FSC = 0x05: level 1 translation fault
> [    0.808943] Data abort info:
> [    0.808943]   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
> [    0.809360]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> [    0.809776]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [    0.810221] [00000001fe000000] user address but active_mm is swapper
> [..]
> [    0.820887] Call trace:
> [    0.821027]  cma_init_reserved_areas+0xc4/0x378
>
> >
> > You need a binding for this too.
>
> By binding you mean having an yaml file in dt-schem [1] describing the tag
> storage node, right?

Yes, but in the kernel tree is fine.

[...]

> > > +static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
> > > +                                               int reg_len, struct range *range)
> > > +{
> > > +       int addr_cells = dt_root_addr_cells;
> > > +       int size_cells = dt_root_size_cells;
> > > +       u64 size;
> > > +
> > > +       if (reg_len / 4 > addr_cells + size_cells)
> > > +               return -EINVAL;
> > > +
> > > +       range->start = PHYS_PFN(of_read_number(reg, addr_cells));
> > > +       size = PHYS_PFN(of_read_number(reg + addr_cells, size_cells));
> > > +       if (size == 0) {
> > > +               pr_err("Invalid node");
> > > +               return -EINVAL;
> > > +       }
> > > +       range->end = range->start + size - 1;
> >
> > We have a function to read (and translate which you forgot) addresses.
> > Add what's missing rather than open code your own.
>
> I must have missed that there's already a function to read addresses. Would
> you mind pointing me in the right direction?

drivers/of/fdt_address.c

Though it doesn't provide getting the size, so that will have to be added.


> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +static int __init tag_storage_of_flat_get_tag_range(unsigned long node,
> > > +                                                   struct range *tag_range)
> > > +{
> > > +       const __be32 *reg;
> > > +       int reg_len;
> > > +
> > > +       reg = of_get_flat_dt_prop(node, "reg", &reg_len);
> > > +       if (reg == NULL) {
> > > +               pr_err("Invalid metadata node");
> > > +               return -EINVAL;
> > > +       }
> > > +
> > > +       return tag_storage_of_flat_get_range(node, reg, reg_len, tag_range);
> > > +}
> > > +
> > > +static int __init tag_storage_of_flat_get_memory_range(unsigned long node, struct range *mem)
> > > +{
> > > +       const __be32 *reg;
> > > +       int reg_len;
> > > +
> > > +       reg = of_get_flat_dt_prop(node, "linux,usable-memory", &reg_len);
> > > +       if (reg == NULL)
> > > +               reg = of_get_flat_dt_prop(node, "reg", &reg_len);
> > > +
> > > +       if (reg == NULL) {
> > > +               pr_err("Invalid memory node");
> > > +               return -EINVAL;
> > > +       }
> > > +
> > > +       return tag_storage_of_flat_get_range(node, reg, reg_len, mem);
> > > +}
> > > +
> > > +struct find_memory_node_arg {
> > > +       unsigned long node;
> > > +       u32 phandle;
> > > +};
> > > +
> > > +static int __init fdt_find_memory_node(unsigned long node, const char *uname,
> > > +                                      int depth, void *data)
> > > +{
> > > +       const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
> > > +       struct find_memory_node_arg *arg = data;
> > > +
> > > +       if (depth != 1 || !type || strcmp(type, "memory") != 0)
> > > +               return 0;
> > > +
> > > +       if (of_get_flat_dt_phandle(node) == arg->phandle) {
> > > +               arg->node = node;
> > > +               return 1;
> > > +       }
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +static int __init tag_storage_get_memory_node(unsigned long tag_node, unsigned long *mem_node)
> > > +{
> > > +       struct find_memory_node_arg arg = { 0 };
> > > +       const __be32 *memory_prop;
> > > +       u32 mem_phandle;
> > > +       int ret, reg_len;
> > > +
> > > +       memory_prop = of_get_flat_dt_prop(tag_node, "memory", &reg_len);
> > > +       if (!memory_prop) {
> > > +               pr_err("Missing 'memory' property in the tag storage node");
> > > +               return -EINVAL;
> > > +       }
> > > +
> > > +       mem_phandle = be32_to_cpup(memory_prop);
> > > +       arg.phandle = mem_phandle;
> > > +
> > > +       ret = of_scan_flat_dt(fdt_find_memory_node, &arg);
> >
> > Do not use of_scan_flat_dt. It is a relic predating libfdt which can
> > get a node by phandle directly.
>
> I used that because that's what drivers/of/fdt.c uses. With reserved memory
> I shouldn't need it, because struct reserved_mem already includes a
> phandle.

Check again. Only some arch/ code (mostly powerpc) uses it. I've
killed off most of it.


> > > +       if (ret != 1) {
> > > +               pr_err("Associated memory node not found");
> > > +               return -EINVAL;
> > > +       }
> > > +
> > > +       *mem_node = arg.node;
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +static int __init tag_storage_of_flat_read_u32(unsigned long node, const char *propname,
> > > +                                              u32 *retval)
> >
> > If you are going to make a generic function, make it for everyone.
>
> Sure. If I still need it, should I put the function in
> include/linux/of_fdt.h?

Yes.

> > > +{
> > > +       const __be32 *reg;
> > > +
> > > +       reg = of_get_flat_dt_prop(node, propname, NULL);
> > > +       if (!reg)
> > > +               return -EINVAL;
> > > +
> > > +       *retval = be32_to_cpup(reg);
> > > +       return 0;
> > > +}
> > > +
> > > +static u32 __init get_block_size_pages(u32 block_size_bytes)
> > > +{
> > > +       u32 a = PAGE_SIZE;
> > > +       u32 b = block_size_bytes;
> > > +       u32 r;
> > > +
> > > +       /* Find greatest common divisor using the Euclidian algorithm. */
> > > +       do {
> > > +               r = a % b;
> > > +               a = b;
> > > +               b = r;
> > > +       } while (b != 0);
> > > +
> > > +       return PHYS_PFN(PAGE_SIZE * block_size_bytes / a);
> > > +}
> > > +
> > > +static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
> > > +                                      int depth, void *data)
> > > +{
> > > +       struct tag_region *region;
> > > +       unsigned long mem_node;
> > > +       struct range *mem_range;
> > > +       struct range *tag_range;
> > > +       u32 block_size_bytes;
> > > +       u32 nid = 0;
> > > +       int ret;
> > > +
> > > +       if (depth != 1 || !strstr(uname, "tag-storage"))
> > > +               return 0;
> > > +
> > > +       if (!of_flat_dt_is_compatible(node, "arm,mte-tag-storage"))
> > > +               return 0;
> > > +
> > > +       if (num_tag_regions == MAX_TAG_REGIONS) {
> > > +               pr_err("Maximum number of tag storage regions exceeded");
> > > +               return -EINVAL;
> > > +       }
> > > +
> > > +       region = &tag_regions[num_tag_regions];
> > > +       mem_range = &region->mem_range;
> > > +       tag_range = &region->tag_range;
> > > +
> > > +       ret = tag_storage_of_flat_get_tag_range(node, tag_range);
> > > +       if (ret) {
> > > +               pr_err("Invalid tag storage node");
> > > +               return ret;
> > > +       }
> > > +
> > > +       ret = tag_storage_get_memory_node(node, &mem_node);
> > > +       if (ret)
> > > +               return ret;
> > > +
> > > +       ret = tag_storage_of_flat_get_memory_range(mem_node, mem_range);
> > > +       if (ret) {
> > > +               pr_err("Invalid address for associated data memory node");
> > > +               return ret;
> > > +       }
> > > +
> > > +       /* The tag region must exactly match the corresponding memory. */
> > > +       if (range_len(tag_range) * 32 != range_len(mem_range)) {
> > > +               pr_err("Tag storage region 0x%llx-0x%llx does not cover the memory region 0x%llx-0x%llx",
> > > +                      PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> > > +                      PFN_PHYS(mem_range->start), PFN_PHYS(mem_range->end));
> > > +               return -EINVAL;
> > > +       }
> > > +
> > > +       ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
> > > +       if (ret || block_size_bytes == 0) {
> > > +               pr_err("Invalid or missing 'block-size' property");
> > > +               return -EINVAL;
> > > +       }
> > > +       region->block_size = get_block_size_pages(block_size_bytes);
> > > +       if (range_len(tag_range) % region->block_size != 0) {
> > > +               pr_err("Tag storage region size 0x%llx is not a multiple of block size %u",
> > > +                      PFN_PHYS(range_len(tag_range)), region->block_size);
> > > +               return -EINVAL;
> > > +       }
> > > +
> > > +       ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
> >
> > I was going to say we already have a way to associate memory nodes
> > other nodes using "numa-node-id", so the "memory" phandle property is
> > somewhat redundant. Maybe the tag node should have a numa-node-id.
> > With that, it looks like you don't even need to access the /memory
> > node. Avoiding that would be good for 2 reasons. It avoids parsing
> > memory nodes twice and it's not the kernel's job to validate the DT.
> > Really, if you want memory info, you should use memblock to get it
> > because all the special cases of memory layout are handled. For
> > example you can have memory nodes with multiple 'reg' entries or
> > multiple memory nodes or both, and then some of those could be
> > contiguous.
>
> I need to have a memory node associated with the tag storage node because
> there is a static relationship between a page from "normal" memory and its
> associated tag storage. If the code doesn't know that the memory region
> A..B has the corresponding tag storage in the region X..Y, then it doesn't
> know which tag storage to reserve when a page is allocated as tagged.
>
> In the example above, assuming that page P is allocated as tagged, the
> corresponding tag storage page that needs to be reserved is:
>
> tag_storage_pfn = (page_to_pfn(P) - PHYS_PFN(A)) / 32* + PHYS_PFN(X)
>
> numa-node-id is not enough for this, because as far I know you can have
> multiple memory regions withing the same numa node.

Okay.

Rob

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory
  2023-12-12 18:44       ` Rob Herring
@ 2023-12-13 13:04         ` Alexandru Elisei
  2023-12-13 14:06           ` Rob Herring
  0 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-12-13 13:04 UTC (permalink / raw)
  To: Rob Herring
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi Rob,

On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> <alexandru.elisei@arm.com> wrote:
> >
> > Hi Rob,
> >
> > Thank you so much for the feedback, I'm not very familiar with device tree,
> > and any comments are very useful.
> >
> > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > <alexandru.elisei@arm.com> wrote:
> > > >
> > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > regions from the DTB. This memory is marked as reserved for now.
> > > >
> > > > The DTB node for the tag storage region is defined as:
> > > >
> > > >         tags0: tag-storage@8f8000000 {
> > > >                 compatible = "arm,mte-tag-storage";
> > > >                 reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > >                 block-size = <0x1000>;
> > > >                 memory = <&memory0>;    // Associated tagged memory node
> > > >         };
> > >
> > > I skimmed thru the discussion some. If this memory range is within
> > > main RAM, then it definitely belongs in /reserved-memory.
> >
> > Ok, will do that.
> >
> > If you don't mind, why do you say that it definitely belongs in
> > reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> > motivation.
> 
> Simply so that /memory nodes describe all possible memory and
> /reserved-memory is just adding restrictions. It's also because
> /reserved-memory is what gets handled early, and we don't need
> multiple things to handle early.
> 
> > Tag storage is not DMA and can live anywhere in memory.
> 
> Then why put it in DT at all? The only reason CMA is there is to set
> the size. It's not even clear to me we need CMA in DT either. The
> reasoning long ago was the kernel didn't do a good job of moving and
> reclaiming contiguous space, but that's supposed to be better now (and
> most h/w figured out they need IOMMUs).
> 
> But for tag storage you know the size as it is a function of the
> memory size, right? After all, you are validating the size is correct.
> I guess there is still the aspect of whether you want enable MTE or
> not which could be done in a variety of ways.

Oh, sorry, my bad, I should have been clearer about this. I don't want to
put it in the DT as a "linux,cma" node. But I want it to be managed by CMA.

> 
> > In
> > arm64_memblock_init(), the kernel first removes the memory that it cannot
> > address from memblock. For example, because it has been compiled with
> > CONFIG_ARM64_VA_BITS_39=y. And then calls
> > early_init_fdt_scan_reserved_mem().
> >
> > What happens if reserved memory is above what the kernel can address?
> 
> I would hope the kernel handles it. That's the kernel's problem unless
> there's some h/w limitation to access some region. The DT can't have
> things dependent on the kernel config.

I would hope so too, that's why I was surprised when I put reserved memory
at 1TB in a 39 bit VA kernel and got a panic.

> 
> > From my testing, when the kernel is compiled with 39 bit VA, if I use
> > reserved memory to discover tag storage the lives above the virtua address
> > limit and then I try to use CMA to manage the tag storage memory, I get a
> > kernel panic:
> 
> Looks like we should handle that better...

I guess we don't need to tackle that problem right now. I don't know of
many systems in the wild that have memory above the 1TB address.

> 
> >> [    0.000000] Reserved memory: created CMA memory pool at 0x0000010000000000, size 64 MiB
> > [    0.000000] OF: reserved mem: initialized node linux,cma, compatible id shared-dma-pool
> > [    0.000000] OF: reserved mem: 0x0000010000000000..0x0000010003ffffff (65536 KiB) map reusable linux,cma
> > [..]
> > [    0.806945] Unable to handle kernel paging request at virtual address 00000001fe000000
> > [    0.807277] Mem abort info:
> > [    0.807277]   ESR = 0x0000000096000005
> > [    0.807693]   EC = 0x25: DABT (current EL), IL = 32 bits
> > [    0.808110]   SET = 0, FnV = 0
> > [    0.808443]   EA = 0, S1PTW = 0
> > [    0.808526]   FSC = 0x05: level 1 translation fault
> > [    0.808943] Data abort info:
> > [    0.808943]   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
> > [    0.809360]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> > [    0.809776]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> > [    0.810221] [00000001fe000000] user address but active_mm is swapper
> > [..]
> > [    0.820887] Call trace:
> > [    0.821027]  cma_init_reserved_areas+0xc4/0x378
> >
> > >
> > > You need a binding for this too.
> >
> > By binding you mean having an yaml file in dt-schem [1] describing the tag
> > storage node, right?
> 
> Yes, but in the kernel tree is fine.

Cool, thanks.

> 
> [...]
> 
> > > > +static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
> > > > +                                               int reg_len, struct range *range)
> > > > +{
> > > > +       int addr_cells = dt_root_addr_cells;
> > > > +       int size_cells = dt_root_size_cells;
> > > > +       u64 size;
> > > > +
> > > > +       if (reg_len / 4 > addr_cells + size_cells)
> > > > +               return -EINVAL;
> > > > +
> > > > +       range->start = PHYS_PFN(of_read_number(reg, addr_cells));
> > > > +       size = PHYS_PFN(of_read_number(reg + addr_cells, size_cells));
> > > > +       if (size == 0) {
> > > > +               pr_err("Invalid node");
> > > > +               return -EINVAL;
> > > > +       }
> > > > +       range->end = range->start + size - 1;
> > >
> > > We have a function to read (and translate which you forgot) addresses.
> > > Add what's missing rather than open code your own.
> >
> > I must have missed that there's already a function to read addresses. Would
> > you mind pointing me in the right direction?
> 
> drivers/of/fdt_address.c
> 
> Though it doesn't provide getting the size, so that will have to be added.

Ok, will do!

> 
> 
> > > > +
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +static int __init tag_storage_of_flat_get_tag_range(unsigned long node,
> > > > +                                                   struct range *tag_range)
> > > > +{
> > > > +       const __be32 *reg;
> > > > +       int reg_len;
> > > > +
> > > > +       reg = of_get_flat_dt_prop(node, "reg", &reg_len);
> > > > +       if (reg == NULL) {
> > > > +               pr_err("Invalid metadata node");
> > > > +               return -EINVAL;
> > > > +       }
> > > > +
> > > > +       return tag_storage_of_flat_get_range(node, reg, reg_len, tag_range);
> > > > +}
> > > > +
> > > > +static int __init tag_storage_of_flat_get_memory_range(unsigned long node, struct range *mem)
> > > > +{
> > > > +       const __be32 *reg;
> > > > +       int reg_len;
> > > > +
> > > > +       reg = of_get_flat_dt_prop(node, "linux,usable-memory", &reg_len);
> > > > +       if (reg == NULL)
> > > > +               reg = of_get_flat_dt_prop(node, "reg", &reg_len);
> > > > +
> > > > +       if (reg == NULL) {
> > > > +               pr_err("Invalid memory node");
> > > > +               return -EINVAL;
> > > > +       }
> > > > +
> > > > +       return tag_storage_of_flat_get_range(node, reg, reg_len, mem);
> > > > +}
> > > > +
> > > > +struct find_memory_node_arg {
> > > > +       unsigned long node;
> > > > +       u32 phandle;
> > > > +};
> > > > +
> > > > +static int __init fdt_find_memory_node(unsigned long node, const char *uname,
> > > > +                                      int depth, void *data)
> > > > +{
> > > > +       const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
> > > > +       struct find_memory_node_arg *arg = data;
> > > > +
> > > > +       if (depth != 1 || !type || strcmp(type, "memory") != 0)
> > > > +               return 0;
> > > > +
> > > > +       if (of_get_flat_dt_phandle(node) == arg->phandle) {
> > > > +               arg->node = node;
> > > > +               return 1;
> > > > +       }
> > > > +
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +static int __init tag_storage_get_memory_node(unsigned long tag_node, unsigned long *mem_node)
> > > > +{
> > > > +       struct find_memory_node_arg arg = { 0 };
> > > > +       const __be32 *memory_prop;
> > > > +       u32 mem_phandle;
> > > > +       int ret, reg_len;
> > > > +
> > > > +       memory_prop = of_get_flat_dt_prop(tag_node, "memory", &reg_len);
> > > > +       if (!memory_prop) {
> > > > +               pr_err("Missing 'memory' property in the tag storage node");
> > > > +               return -EINVAL;
> > > > +       }
> > > > +
> > > > +       mem_phandle = be32_to_cpup(memory_prop);
> > > > +       arg.phandle = mem_phandle;
> > > > +
> > > > +       ret = of_scan_flat_dt(fdt_find_memory_node, &arg);
> > >
> > > Do not use of_scan_flat_dt. It is a relic predating libfdt which can
> > > get a node by phandle directly.
> >
> > I used that because that's what drivers/of/fdt.c uses. With reserved memory
> > I shouldn't need it, because struct reserved_mem already includes a
> > phandle.
> 
> Check again. Only some arch/ code (mostly powerpc) uses it. I've
> killed off most of it.

You're right, I think I grep'ed for a different function name in
drivers/of/fdt.c Either way, the message is clear: no of_scan_flat_dt().

> 
> 
> > > > +       if (ret != 1) {
> > > > +               pr_err("Associated memory node not found");
> > > > +               return -EINVAL;
> > > > +       }
> > > > +
> > > > +       *mem_node = arg.node;
> > > > +
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +static int __init tag_storage_of_flat_read_u32(unsigned long node, const char *propname,
> > > > +                                              u32 *retval)
> > >
> > > If you are going to make a generic function, make it for everyone.
> >
> > Sure. If I still need it, should I put the function in
> > include/linux/of_fdt.h?
> 
> Yes.

Noted.

> 
> > > > +{
> > > > +       const __be32 *reg;
> > > > +
> > > > +       reg = of_get_flat_dt_prop(node, propname, NULL);
> > > > +       if (!reg)
> > > > +               return -EINVAL;
> > > > +
> > > > +       *retval = be32_to_cpup(reg);
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +static u32 __init get_block_size_pages(u32 block_size_bytes)
> > > > +{
> > > > +       u32 a = PAGE_SIZE;
> > > > +       u32 b = block_size_bytes;
> > > > +       u32 r;
> > > > +
> > > > +       /* Find greatest common divisor using the Euclidian algorithm. */
> > > > +       do {
> > > > +               r = a % b;
> > > > +               a = b;
> > > > +               b = r;
> > > > +       } while (b != 0);
> > > > +
> > > > +       return PHYS_PFN(PAGE_SIZE * block_size_bytes / a);
> > > > +}
> > > > +
> > > > +static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
> > > > +                                      int depth, void *data)
> > > > +{
> > > > +       struct tag_region *region;
> > > > +       unsigned long mem_node;
> > > > +       struct range *mem_range;
> > > > +       struct range *tag_range;
> > > > +       u32 block_size_bytes;
> > > > +       u32 nid = 0;
> > > > +       int ret;
> > > > +
> > > > +       if (depth != 1 || !strstr(uname, "tag-storage"))
> > > > +               return 0;
> > > > +
> > > > +       if (!of_flat_dt_is_compatible(node, "arm,mte-tag-storage"))
> > > > +               return 0;
> > > > +
> > > > +       if (num_tag_regions == MAX_TAG_REGIONS) {
> > > > +               pr_err("Maximum number of tag storage regions exceeded");
> > > > +               return -EINVAL;
> > > > +       }
> > > > +
> > > > +       region = &tag_regions[num_tag_regions];
> > > > +       mem_range = &region->mem_range;
> > > > +       tag_range = &region->tag_range;
> > > > +
> > > > +       ret = tag_storage_of_flat_get_tag_range(node, tag_range);
> > > > +       if (ret) {
> > > > +               pr_err("Invalid tag storage node");
> > > > +               return ret;
> > > > +       }
> > > > +
> > > > +       ret = tag_storage_get_memory_node(node, &mem_node);
> > > > +       if (ret)
> > > > +               return ret;
> > > > +
> > > > +       ret = tag_storage_of_flat_get_memory_range(mem_node, mem_range);
> > > > +       if (ret) {
> > > > +               pr_err("Invalid address for associated data memory node");
> > > > +               return ret;
> > > > +       }
> > > > +
> > > > +       /* The tag region must exactly match the corresponding memory. */
> > > > +       if (range_len(tag_range) * 32 != range_len(mem_range)) {
> > > > +               pr_err("Tag storage region 0x%llx-0x%llx does not cover the memory region 0x%llx-0x%llx",
> > > > +                      PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> > > > +                      PFN_PHYS(mem_range->start), PFN_PHYS(mem_range->end));
> > > > +               return -EINVAL;
> > > > +       }
> > > > +
> > > > +       ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
> > > > +       if (ret || block_size_bytes == 0) {
> > > > +               pr_err("Invalid or missing 'block-size' property");
> > > > +               return -EINVAL;
> > > > +       }
> > > > +       region->block_size = get_block_size_pages(block_size_bytes);
> > > > +       if (range_len(tag_range) % region->block_size != 0) {
> > > > +               pr_err("Tag storage region size 0x%llx is not a multiple of block size %u",
> > > > +                      PFN_PHYS(range_len(tag_range)), region->block_size);
> > > > +               return -EINVAL;
> > > > +       }
> > > > +
> > > > +       ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
> > >
> > > I was going to say we already have a way to associate memory nodes
> > > other nodes using "numa-node-id", so the "memory" phandle property is
> > > somewhat redundant. Maybe the tag node should have a numa-node-id.
> > > With that, it looks like you don't even need to access the /memory
> > > node. Avoiding that would be good for 2 reasons. It avoids parsing
> > > memory nodes twice and it's not the kernel's job to validate the DT.
> > > Really, if you want memory info, you should use memblock to get it
> > > because all the special cases of memory layout are handled. For
> > > example you can have memory nodes with multiple 'reg' entries or
> > > multiple memory nodes or both, and then some of those could be
> > > contiguous.
> >
> > I need to have a memory node associated with the tag storage node because
> > there is a static relationship between a page from "normal" memory and its
> > associated tag storage. If the code doesn't know that the memory region
> > A..B has the corresponding tag storage in the region X..Y, then it doesn't
> > know which tag storage to reserve when a page is allocated as tagged.
> >
> > In the example above, assuming that page P is allocated as tagged, the
> > corresponding tag storage page that needs to be reserved is:
> >
> > tag_storage_pfn = (page_to_pfn(P) - PHYS_PFN(A)) / 32* + PHYS_PFN(X)
> >
> > numa-node-id is not enough for this, because as far I know you can have
> > multiple memory regions withing the same numa node.
> 
> Okay.

Great, glad we are on the same page.

Thanks,
Alex

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory
  2023-12-13 13:04         ` Alexandru Elisei
@ 2023-12-13 14:06           ` Rob Herring
  2023-12-13 14:51             ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: Rob Herring @ 2023-12-13 14:06 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Wed, Dec 13, 2023 at 7:05 AM Alexandru Elisei
<alexandru.elisei@arm.com> wrote:
>
> Hi Rob,
>
> On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> > On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> > <alexandru.elisei@arm.com> wrote:
> > >
> > > Hi Rob,
> > >
> > > Thank you so much for the feedback, I'm not very familiar with device tree,
> > > and any comments are very useful.
> > >
> > > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > > <alexandru.elisei@arm.com> wrote:
> > > > >
> > > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > > regions from the DTB. This memory is marked as reserved for now.
> > > > >
> > > > > The DTB node for the tag storage region is defined as:
> > > > >
> > > > >         tags0: tag-storage@8f8000000 {
> > > > >                 compatible = "arm,mte-tag-storage";
> > > > >                 reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > > >                 block-size = <0x1000>;
> > > > >                 memory = <&memory0>;    // Associated tagged memory node
> > > > >         };
> > > >
> > > > I skimmed thru the discussion some. If this memory range is within
> > > > main RAM, then it definitely belongs in /reserved-memory.
> > >
> > > Ok, will do that.
> > >
> > > If you don't mind, why do you say that it definitely belongs in
> > > reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> > > motivation.
> >
> > Simply so that /memory nodes describe all possible memory and
> > /reserved-memory is just adding restrictions. It's also because
> > /reserved-memory is what gets handled early, and we don't need
> > multiple things to handle early.
> >
> > > Tag storage is not DMA and can live anywhere in memory.
> >
> > Then why put it in DT at all? The only reason CMA is there is to set
> > the size. It's not even clear to me we need CMA in DT either. The
> > reasoning long ago was the kernel didn't do a good job of moving and
> > reclaiming contiguous space, but that's supposed to be better now (and
> > most h/w figured out they need IOMMUs).
> >
> > But for tag storage you know the size as it is a function of the
> > memory size, right? After all, you are validating the size is correct.
> > I guess there is still the aspect of whether you want enable MTE or
> > not which could be done in a variety of ways.
>
> Oh, sorry, my bad, I should have been clearer about this. I don't want to
> put it in the DT as a "linux,cma" node. But I want it to be managed by CMA.

Yes, I understand, but my point remains. Why do you need this in DT?
If the location doesn't matter and you can calculate the size from the
memory size, what else is there to add to the DT?

Rob

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory
  2023-12-13 14:06           ` Rob Herring
@ 2023-12-13 14:51             ` Alexandru Elisei
  2023-12-13 17:22               ` Rob Herring
  0 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-12-13 14:51 UTC (permalink / raw)
  To: Rob Herring
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi,

On Wed, Dec 13, 2023 at 08:06:44AM -0600, Rob Herring wrote:
> On Wed, Dec 13, 2023 at 7:05 AM Alexandru Elisei
> <alexandru.elisei@arm.com> wrote:
> >
> > Hi Rob,
> >
> > On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> > > On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> > > <alexandru.elisei@arm.com> wrote:
> > > >
> > > > Hi Rob,
> > > >
> > > > Thank you so much for the feedback, I'm not very familiar with device tree,
> > > > and any comments are very useful.
> > > >
> > > > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > > > <alexandru.elisei@arm.com> wrote:
> > > > > >
> > > > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > > > regions from the DTB. This memory is marked as reserved for now.
> > > > > >
> > > > > > The DTB node for the tag storage region is defined as:
> > > > > >
> > > > > >         tags0: tag-storage@8f8000000 {
> > > > > >                 compatible = "arm,mte-tag-storage";
> > > > > >                 reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > > > >                 block-size = <0x1000>;
> > > > > >                 memory = <&memory0>;    // Associated tagged memory node
> > > > > >         };
> > > > >
> > > > > I skimmed thru the discussion some. If this memory range is within
> > > > > main RAM, then it definitely belongs in /reserved-memory.
> > > >
> > > > Ok, will do that.
> > > >
> > > > If you don't mind, why do you say that it definitely belongs in
> > > > reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> > > > motivation.
> > >
> > > Simply so that /memory nodes describe all possible memory and
> > > /reserved-memory is just adding restrictions. It's also because
> > > /reserved-memory is what gets handled early, and we don't need
> > > multiple things to handle early.
> > >
> > > > Tag storage is not DMA and can live anywhere in memory.
> > >
> > > Then why put it in DT at all? The only reason CMA is there is to set
> > > the size. It's not even clear to me we need CMA in DT either. The
> > > reasoning long ago was the kernel didn't do a good job of moving and
> > > reclaiming contiguous space, but that's supposed to be better now (and
> > > most h/w figured out they need IOMMUs).
> > >
> > > But for tag storage you know the size as it is a function of the
> > > memory size, right? After all, you are validating the size is correct.
> > > I guess there is still the aspect of whether you want enable MTE or
> > > not which could be done in a variety of ways.
> >
> > Oh, sorry, my bad, I should have been clearer about this. I don't want to
> > put it in the DT as a "linux,cma" node. But I want it to be managed by CMA.
> 
> Yes, I understand, but my point remains. Why do you need this in DT?
> If the location doesn't matter and you can calculate the size from the
> memory size, what else is there to add to the DT?

I am afraid there has been a misunderstanding. What do you mean by
"location doesn't matter"?

At the very least, Linux needs to know the address and size of a memory
region to use it. The series is about using the tag storage memory for
data. Tag storage cannot be described as a regular memory node because it
cannot be tagged (and normal memory can).

Then there's the matter of the tag storage block size (explained in this
commit message), and also knowing the memory range for which a tag storage
region stores the tags. This is explained in the cover letter.

Is there something that you feel that is not clear enough? I am more than
happy to go into details.

Thanks,
Alex

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory
  2023-12-13 14:51             ` Alexandru Elisei
@ 2023-12-13 17:22               ` Rob Herring
  2023-12-13 17:44                 ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: Rob Herring @ 2023-12-13 17:22 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Wed, Dec 13, 2023 at 8:51 AM Alexandru Elisei
<alexandru.elisei@arm.com> wrote:
>
> Hi,
>
> On Wed, Dec 13, 2023 at 08:06:44AM -0600, Rob Herring wrote:
> > On Wed, Dec 13, 2023 at 7:05 AM Alexandru Elisei
> > <alexandru.elisei@arm.com> wrote:
> > >
> > > Hi Rob,
> > >
> > > On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> > > > On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> > > > <alexandru.elisei@arm.com> wrote:
> > > > >
> > > > > Hi Rob,
> > > > >
> > > > > Thank you so much for the feedback, I'm not very familiar with device tree,
> > > > > and any comments are very useful.
> > > > >
> > > > > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > > > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > > > > <alexandru.elisei@arm.com> wrote:
> > > > > > >
> > > > > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > > > > regions from the DTB. This memory is marked as reserved for now.
> > > > > > >
> > > > > > > The DTB node for the tag storage region is defined as:
> > > > > > >
> > > > > > >         tags0: tag-storage@8f8000000 {
> > > > > > >                 compatible = "arm,mte-tag-storage";
> > > > > > >                 reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > > > > >                 block-size = <0x1000>;
> > > > > > >                 memory = <&memory0>;    // Associated tagged memory node
> > > > > > >         };
> > > > > >
> > > > > > I skimmed thru the discussion some. If this memory range is within
> > > > > > main RAM, then it definitely belongs in /reserved-memory.
> > > > >
> > > > > Ok, will do that.
> > > > >
> > > > > If you don't mind, why do you say that it definitely belongs in
> > > > > reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> > > > > motivation.
> > > >
> > > > Simply so that /memory nodes describe all possible memory and
> > > > /reserved-memory is just adding restrictions. It's also because
> > > > /reserved-memory is what gets handled early, and we don't need
> > > > multiple things to handle early.
> > > >
> > > > > Tag storage is not DMA and can live anywhere in memory.
> > > >
> > > > Then why put it in DT at all? The only reason CMA is there is to set
> > > > the size. It's not even clear to me we need CMA in DT either. The
> > > > reasoning long ago was the kernel didn't do a good job of moving and
> > > > reclaiming contiguous space, but that's supposed to be better now (and
> > > > most h/w figured out they need IOMMUs).
> > > >
> > > > But for tag storage you know the size as it is a function of the
> > > > memory size, right? After all, you are validating the size is correct.
> > > > I guess there is still the aspect of whether you want enable MTE or
> > > > not which could be done in a variety of ways.
> > >
> > > Oh, sorry, my bad, I should have been clearer about this. I don't want to
> > > put it in the DT as a "linux,cma" node. But I want it to be managed by CMA.
> >
> > Yes, I understand, but my point remains. Why do you need this in DT?
> > If the location doesn't matter and you can calculate the size from the
> > memory size, what else is there to add to the DT?
>
> I am afraid there has been a misunderstanding. What do you mean by
> "location doesn't matter"?

You said:
> Tag storage is not DMA and can live anywhere in memory.

Which I took as the kernel can figure out where to put it. But maybe
you meant the h/w platform can hard code it to be anywhere in memory?
If so, then yes, DT is needed.

> At the very least, Linux needs to know the address and size of a memory
> region to use it. The series is about using the tag storage memory for
> data. Tag storage cannot be described as a regular memory node because it
> cannot be tagged (and normal memory can).

If the tag storage lives in the middle of memory, then it would be
described in the memory node, but removed by being in reserved-memory
node.

> Then there's the matter of the tag storage block size (explained in this
> commit message), and also knowing the memory range for which a tag storage
> region stores the tags. This is explained in the cover letter.

Honestly, I just forgot about that part.

Rob

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory
  2023-12-13 17:22               ` Rob Herring
@ 2023-12-13 17:44                 ` Alexandru Elisei
  2023-12-13 20:30                   ` Rob Herring
  0 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-12-13 17:44 UTC (permalink / raw)
  To: Rob Herring
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Wed, Dec 13, 2023 at 11:22:17AM -0600, Rob Herring wrote:
> On Wed, Dec 13, 2023 at 8:51 AM Alexandru Elisei
> <alexandru.elisei@arm.com> wrote:
> >
> > Hi,
> >
> > On Wed, Dec 13, 2023 at 08:06:44AM -0600, Rob Herring wrote:
> > > On Wed, Dec 13, 2023 at 7:05 AM Alexandru Elisei
> > > <alexandru.elisei@arm.com> wrote:
> > > >
> > > > Hi Rob,
> > > >
> > > > On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> > > > > On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> > > > > <alexandru.elisei@arm.com> wrote:
> > > > > >
> > > > > > Hi Rob,
> > > > > >
> > > > > > Thank you so much for the feedback, I'm not very familiar with device tree,
> > > > > > and any comments are very useful.
> > > > > >
> > > > > > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > > > > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > > > > > <alexandru.elisei@arm.com> wrote:
> > > > > > > >
> > > > > > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > > > > > regions from the DTB. This memory is marked as reserved for now.
> > > > > > > >
> > > > > > > > The DTB node for the tag storage region is defined as:
> > > > > > > >
> > > > > > > >         tags0: tag-storage@8f8000000 {
> > > > > > > >                 compatible = "arm,mte-tag-storage";
> > > > > > > >                 reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > > > > > >                 block-size = <0x1000>;
> > > > > > > >                 memory = <&memory0>;    // Associated tagged memory node
> > > > > > > >         };
> > > > > > >
> > > > > > > I skimmed thru the discussion some. If this memory range is within
> > > > > > > main RAM, then it definitely belongs in /reserved-memory.
> > > > > >
> > > > > > Ok, will do that.
> > > > > >
> > > > > > If you don't mind, why do you say that it definitely belongs in
> > > > > > reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> > > > > > motivation.
> > > > >
> > > > > Simply so that /memory nodes describe all possible memory and
> > > > > /reserved-memory is just adding restrictions. It's also because
> > > > > /reserved-memory is what gets handled early, and we don't need
> > > > > multiple things to handle early.
> > > > >
> > > > > > Tag storage is not DMA and can live anywhere in memory.
> > > > >
> > > > > Then why put it in DT at all? The only reason CMA is there is to set
> > > > > the size. It's not even clear to me we need CMA in DT either. The
> > > > > reasoning long ago was the kernel didn't do a good job of moving and
> > > > > reclaiming contiguous space, but that's supposed to be better now (and
> > > > > most h/w figured out they need IOMMUs).
> > > > >
> > > > > But for tag storage you know the size as it is a function of the
> > > > > memory size, right? After all, you are validating the size is correct.
> > > > > I guess there is still the aspect of whether you want enable MTE or
> > > > > not which could be done in a variety of ways.
> > > >
> > > > Oh, sorry, my bad, I should have been clearer about this. I don't want to
> > > > put it in the DT as a "linux,cma" node. But I want it to be managed by CMA.
> > >
> > > Yes, I understand, but my point remains. Why do you need this in DT?
> > > If the location doesn't matter and you can calculate the size from the
> > > memory size, what else is there to add to the DT?
> >
> > I am afraid there has been a misunderstanding. What do you mean by
> > "location doesn't matter"?
> 
> You said:
> > Tag storage is not DMA and can live anywhere in memory.
> 
> Which I took as the kernel can figure out where to put it. But maybe
> you meant the h/w platform can hard code it to be anywhere in memory?
> If so, then yes, DT is needed.

Ah, I see, sorry for not being clear enough, you are correct: tag storage
is a hardware property, and software needs a mechanism (in this case, the
dt) to discover its properties.

> 
> > At the very least, Linux needs to know the address and size of a memory
> > region to use it. The series is about using the tag storage memory for
> > data. Tag storage cannot be described as a regular memory node because it
> > cannot be tagged (and normal memory can).
> 
> If the tag storage lives in the middle of memory, then it would be
> described in the memory node, but removed by being in reserved-memory
> node.

I don't follow. Would you mind going into more details?

> 
> > Then there's the matter of the tag storage block size (explained in this
> > commit message), and also knowing the memory range for which a tag storage
> > region stores the tags. This is explained in the cover letter.
> 
> Honestly, I just forgot about that part.

I totally understand, there are a lot of things to consider at the same
time.

Thanks,
Alex

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory
  2023-12-13 17:44                 ` Alexandru Elisei
@ 2023-12-13 20:30                   ` Rob Herring
  2023-12-14 15:45                     ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: Rob Herring @ 2023-12-13 20:30 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Wed, Dec 13, 2023 at 11:44 AM Alexandru Elisei
<alexandru.elisei@arm.com> wrote:
>
> On Wed, Dec 13, 2023 at 11:22:17AM -0600, Rob Herring wrote:
> > On Wed, Dec 13, 2023 at 8:51 AM Alexandru Elisei
> > <alexandru.elisei@arm.com> wrote:
> > >
> > > Hi,
> > >
> > > On Wed, Dec 13, 2023 at 08:06:44AM -0600, Rob Herring wrote:
> > > > On Wed, Dec 13, 2023 at 7:05 AM Alexandru Elisei
> > > > <alexandru.elisei@arm.com> wrote:
> > > > >
> > > > > Hi Rob,
> > > > >
> > > > > On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> > > > > > On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> > > > > > <alexandru.elisei@arm.com> wrote:
> > > > > > >
> > > > > > > Hi Rob,
> > > > > > >
> > > > > > > Thank you so much for the feedback, I'm not very familiar with device tree,
> > > > > > > and any comments are very useful.
> > > > > > >
> > > > > > > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > > > > > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > > > > > > <alexandru.elisei@arm.com> wrote:
> > > > > > > > >
> > > > > > > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > > > > > > regions from the DTB. This memory is marked as reserved for now.
> > > > > > > > >
> > > > > > > > > The DTB node for the tag storage region is defined as:
> > > > > > > > >
> > > > > > > > >         tags0: tag-storage@8f8000000 {
> > > > > > > > >                 compatible = "arm,mte-tag-storage";
> > > > > > > > >                 reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > > > > > > >                 block-size = <0x1000>;
> > > > > > > > >                 memory = <&memory0>;    // Associated tagged memory node
> > > > > > > > >         };
> > > > > > > >
> > > > > > > > I skimmed thru the discussion some. If this memory range is within
> > > > > > > > main RAM, then it definitely belongs in /reserved-memory.
> > > > > > >
> > > > > > > Ok, will do that.
> > > > > > >
> > > > > > > If you don't mind, why do you say that it definitely belongs in
> > > > > > > reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> > > > > > > motivation.
> > > > > >
> > > > > > Simply so that /memory nodes describe all possible memory and
> > > > > > /reserved-memory is just adding restrictions. It's also because
> > > > > > /reserved-memory is what gets handled early, and we don't need
> > > > > > multiple things to handle early.
> > > > > >
> > > > > > > Tag storage is not DMA and can live anywhere in memory.
> > > > > >
> > > > > > Then why put it in DT at all? The only reason CMA is there is to set
> > > > > > the size. It's not even clear to me we need CMA in DT either. The
> > > > > > reasoning long ago was the kernel didn't do a good job of moving and
> > > > > > reclaiming contiguous space, but that's supposed to be better now (and
> > > > > > most h/w figured out they need IOMMUs).
> > > > > >
> > > > > > But for tag storage you know the size as it is a function of the
> > > > > > memory size, right? After all, you are validating the size is correct.
> > > > > > I guess there is still the aspect of whether you want enable MTE or
> > > > > > not which could be done in a variety of ways.
> > > > >
> > > > > Oh, sorry, my bad, I should have been clearer about this. I don't want to
> > > > > put it in the DT as a "linux,cma" node. But I want it to be managed by CMA.
> > > >
> > > > Yes, I understand, but my point remains. Why do you need this in DT?
> > > > If the location doesn't matter and you can calculate the size from the
> > > > memory size, what else is there to add to the DT?
> > >
> > > I am afraid there has been a misunderstanding. What do you mean by
> > > "location doesn't matter"?
> >
> > You said:
> > > Tag storage is not DMA and can live anywhere in memory.
> >
> > Which I took as the kernel can figure out where to put it. But maybe
> > you meant the h/w platform can hard code it to be anywhere in memory?
> > If so, then yes, DT is needed.
>
> Ah, I see, sorry for not being clear enough, you are correct: tag storage
> is a hardware property, and software needs a mechanism (in this case, the
> dt) to discover its properties.
>
> >
> > > At the very least, Linux needs to know the address and size of a memory
> > > region to use it. The series is about using the tag storage memory for
> > > data. Tag storage cannot be described as a regular memory node because it
> > > cannot be tagged (and normal memory can).
> >
> > If the tag storage lives in the middle of memory, then it would be
> > described in the memory node, but removed by being in reserved-memory
> > node.
>
> I don't follow. Would you mind going into more details?

It goes back to what I said earlier about /memory nodes describing all
the memory. There's no reason to reserve memory if you haven't
described that range as memory to begin with. One could presumably
just have a memory node for each contiguous chunk and not need
/reserved-memory (ignoring the need to say what things are reserved
for). That would become very difficult to adjust. Note that the kernel
has a hardcoded limit of 64 reserved regions currently and that is not
enough for some people. Seems like a lot, but I have no idea how they
are (ab)using /reserved-memory.

Let me give an example. Presumably using MTE at all is configurable.
If you boot a kernel with MTE disabled (or older and not supporting
it), then I'd assume you'd want to use the tag storage for regular
memory. Well, If tag storage is already part of /memory, then all you
have to do is ignore the tag reserved-memory region. Tweaking the
memory nodes would be more work.


Also, I should point out that /memory and /reserved-memory nodes are
not used for UEFI boot.

Rob

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory
  2023-12-13 20:30                   ` Rob Herring
@ 2023-12-14 15:45                     ` Alexandru Elisei
  2023-12-14 18:55                       ` Rob Herring
  0 siblings, 1 reply; 98+ messages in thread
From: Alexandru Elisei @ 2023-12-14 15:45 UTC (permalink / raw)
  To: Rob Herring
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi,

On Wed, Dec 13, 2023 at 02:30:42PM -0600, Rob Herring wrote:
> On Wed, Dec 13, 2023 at 11:44 AM Alexandru Elisei
> <alexandru.elisei@arm.com> wrote:
> >
> > On Wed, Dec 13, 2023 at 11:22:17AM -0600, Rob Herring wrote:
> > > On Wed, Dec 13, 2023 at 8:51 AM Alexandru Elisei
> > > <alexandru.elisei@arm.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Wed, Dec 13, 2023 at 08:06:44AM -0600, Rob Herring wrote:
> > > > > On Wed, Dec 13, 2023 at 7:05 AM Alexandru Elisei
> > > > > <alexandru.elisei@arm.com> wrote:
> > > > > >
> > > > > > Hi Rob,
> > > > > >
> > > > > > On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> > > > > > > On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> > > > > > > <alexandru.elisei@arm.com> wrote:
> > > > > > > >
> > > > > > > > Hi Rob,
> > > > > > > >
> > > > > > > > Thank you so much for the feedback, I'm not very familiar with device tree,
> > > > > > > > and any comments are very useful.
> > > > > > > >
> > > > > > > > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > > > > > > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > > > > > > > <alexandru.elisei@arm.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > > > > > > > regions from the DTB. This memory is marked as reserved for now.
> > > > > > > > > >
> > > > > > > > > > The DTB node for the tag storage region is defined as:
> > > > > > > > > >
> > > > > > > > > >         tags0: tag-storage@8f8000000 {
> > > > > > > > > >                 compatible = "arm,mte-tag-storage";
> > > > > > > > > >                 reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > > > > > > > >                 block-size = <0x1000>;
> > > > > > > > > >                 memory = <&memory0>;    // Associated tagged memory node
> > > > > > > > > >         };
> > > > > > > > >
> > > > > > > > > I skimmed thru the discussion some. If this memory range is within
> > > > > > > > > main RAM, then it definitely belongs in /reserved-memory.
> > > > > > > >
> > > > > > > > Ok, will do that.
> > > > > > > >
> > > > > > > > If you don't mind, why do you say that it definitely belongs in
> > > > > > > > reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> > > > > > > > motivation.
> > > > > > >
> > > > > > > Simply so that /memory nodes describe all possible memory and
> > > > > > > /reserved-memory is just adding restrictions. It's also because
> > > > > > > /reserved-memory is what gets handled early, and we don't need
> > > > > > > multiple things to handle early.
> > > > > > >
> > > > > > > > Tag storage is not DMA and can live anywhere in memory.
> > > > > > >
> > > > > > > Then why put it in DT at all? The only reason CMA is there is to set
> > > > > > > the size. It's not even clear to me we need CMA in DT either. The
> > > > > > > reasoning long ago was the kernel didn't do a good job of moving and
> > > > > > > reclaiming contiguous space, but that's supposed to be better now (and
> > > > > > > most h/w figured out they need IOMMUs).
> > > > > > >
> > > > > > > But for tag storage you know the size as it is a function of the
> > > > > > > memory size, right? After all, you are validating the size is correct.
> > > > > > > I guess there is still the aspect of whether you want enable MTE or
> > > > > > > not which could be done in a variety of ways.
> > > > > >
> > > > > > Oh, sorry, my bad, I should have been clearer about this. I don't want to
> > > > > > put it in the DT as a "linux,cma" node. But I want it to be managed by CMA.
> > > > >
> > > > > Yes, I understand, but my point remains. Why do you need this in DT?
> > > > > If the location doesn't matter and you can calculate the size from the
> > > > > memory size, what else is there to add to the DT?
> > > >
> > > > I am afraid there has been a misunderstanding. What do you mean by
> > > > "location doesn't matter"?
> > >
> > > You said:
> > > > Tag storage is not DMA and can live anywhere in memory.
> > >
> > > Which I took as the kernel can figure out where to put it. But maybe
> > > you meant the h/w platform can hard code it to be anywhere in memory?
> > > If so, then yes, DT is needed.
> >
> > Ah, I see, sorry for not being clear enough, you are correct: tag storage
> > is a hardware property, and software needs a mechanism (in this case, the
> > dt) to discover its properties.
> >
> > >
> > > > At the very least, Linux needs to know the address and size of a memory
> > > > region to use it. The series is about using the tag storage memory for
> > > > data. Tag storage cannot be described as a regular memory node because it
> > > > cannot be tagged (and normal memory can).
> > >
> > > If the tag storage lives in the middle of memory, then it would be
> > > described in the memory node, but removed by being in reserved-memory
> > > node.
> >
> > I don't follow. Would you mind going into more details?
> 
> It goes back to what I said earlier about /memory nodes describing all
> the memory. There's no reason to reserve memory if you haven't
> described that range as memory to begin with. One could presumably
> just have a memory node for each contiguous chunk and not need
> /reserved-memory (ignoring the need to say what things are reserved
> for). That would become very difficult to adjust. Note that the kernel
> has a hardcoded limit of 64 reserved regions currently and that is not
> enough for some people. Seems like a lot, but I have no idea how they
> are (ab)using /reserved-memory.

Ah, I see what you mean, reserved memory is about marking existing memory
(from a /memory node) as special, not about adding new memory.

After the memblock allocator is initialized, the kernel can use it for its
own allocations. Kernel allocations are not movable.

When a page is allocated as tagged, the associated tag storage cannot be
used for data, otherwise the tags would corrupt that data. To avoid this,
the requirement is that tag storage pages are only used for movable
allocations. When a page is allocated as tagged, the data in the associated
tag storage is migrated and the tag storage is taken from the page
allocator (via alloc_contig_range()).

My understanding is that the memblock allocator can use all the memory from
a /memory node. If the tags storage memory is declared in a /memory node,
there exists the possibility that Linux will use tag storage memory for its
own allocation, which would make that tags storage memory unmovable, and
thus unusable for storing tags.

Looking at early_init_dt_scan_memory(), even if a /memory node if marked at
hotpluggable, memblock will still use it, unless "movable_node" is set on
the kernel command line.

That's the reason why I'm not describing tag storage in a /memory node.  Is
there way to tell the memblock allocator not to use memory from a /memory
node?

> 
> Let me give an example. Presumably using MTE at all is configurable.
> If you boot a kernel with MTE disabled (or older and not supporting
> it), then I'd assume you'd want to use the tag storage for regular
> memory. Well, If tag storage is already part of /memory, then all you
> have to do is ignore the tag reserved-memory region. Tweaking the
> memory nodes would be more work.

Right now, memory is added via memblock_reserve(), and if MTE is disabled
(for example, via the kernel command line), the code calls
free_reserved_page() for each tag storage page. I find that straightfoward
to implement.

Thanks,
Alex

> 
> 
> Also, I should point out that /memory and /reserved-memory nodes are
> not used for UEFI boot.
> 
> Rob
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory
  2023-12-14 15:45                     ` Alexandru Elisei
@ 2023-12-14 18:55                       ` Rob Herring
  2023-12-18 10:59                         ` Alexandru Elisei
  0 siblings, 1 reply; 98+ messages in thread
From: Rob Herring @ 2023-12-14 18:55 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Thu, Dec 14, 2023 at 9:45 AM Alexandru Elisei
<alexandru.elisei@arm.com> wrote:
>
> Hi,
>
> On Wed, Dec 13, 2023 at 02:30:42PM -0600, Rob Herring wrote:
> > On Wed, Dec 13, 2023 at 11:44 AM Alexandru Elisei
> > <alexandru.elisei@arm.com> wrote:
> > >
> > > On Wed, Dec 13, 2023 at 11:22:17AM -0600, Rob Herring wrote:
> > > > On Wed, Dec 13, 2023 at 8:51 AM Alexandru Elisei
> > > > <alexandru.elisei@arm.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On Wed, Dec 13, 2023 at 08:06:44AM -0600, Rob Herring wrote:
> > > > > > On Wed, Dec 13, 2023 at 7:05 AM Alexandru Elisei
> > > > > > <alexandru.elisei@arm.com> wrote:
> > > > > > >
> > > > > > > Hi Rob,
> > > > > > >
> > > > > > > On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> > > > > > > > On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> > > > > > > > <alexandru.elisei@arm.com> wrote:
> > > > > > > > >
> > > > > > > > > Hi Rob,
> > > > > > > > >
> > > > > > > > > Thank you so much for the feedback, I'm not very familiar with device tree,
> > > > > > > > > and any comments are very useful.
> > > > > > > > >
> > > > > > > > > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > > > > > > > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > > > > > > > > <alexandru.elisei@arm.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > > > > > > > > regions from the DTB. This memory is marked as reserved for now.
> > > > > > > > > > >
> > > > > > > > > > > The DTB node for the tag storage region is defined as:
> > > > > > > > > > >
> > > > > > > > > > >         tags0: tag-storage@8f8000000 {
> > > > > > > > > > >                 compatible = "arm,mte-tag-storage";
> > > > > > > > > > >                 reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > > > > > > > > >                 block-size = <0x1000>;
> > > > > > > > > > >                 memory = <&memory0>;    // Associated tagged memory node
> > > > > > > > > > >         };
> > > > > > > > > >
> > > > > > > > > > I skimmed thru the discussion some. If this memory range is within
> > > > > > > > > > main RAM, then it definitely belongs in /reserved-memory.
> > > > > > > > >
> > > > > > > > > Ok, will do that.
> > > > > > > > >
> > > > > > > > > If you don't mind, why do you say that it definitely belongs in
> > > > > > > > > reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> > > > > > > > > motivation.
> > > > > > > >
> > > > > > > > Simply so that /memory nodes describe all possible memory and
> > > > > > > > /reserved-memory is just adding restrictions. It's also because
> > > > > > > > /reserved-memory is what gets handled early, and we don't need
> > > > > > > > multiple things to handle early.
> > > > > > > >
> > > > > > > > > Tag storage is not DMA and can live anywhere in memory.
> > > > > > > >
> > > > > > > > Then why put it in DT at all? The only reason CMA is there is to set
> > > > > > > > the size. It's not even clear to me we need CMA in DT either. The
> > > > > > > > reasoning long ago was the kernel didn't do a good job of moving and
> > > > > > > > reclaiming contiguous space, but that's supposed to be better now (and
> > > > > > > > most h/w figured out they need IOMMUs).
> > > > > > > >
> > > > > > > > But for tag storage you know the size as it is a function of the
> > > > > > > > memory size, right? After all, you are validating the size is correct.
> > > > > > > > I guess there is still the aspect of whether you want enable MTE or
> > > > > > > > not which could be done in a variety of ways.
> > > > > > >
> > > > > > > Oh, sorry, my bad, I should have been clearer about this. I don't want to
> > > > > > > put it in the DT as a "linux,cma" node. But I want it to be managed by CMA.
> > > > > >
> > > > > > Yes, I understand, but my point remains. Why do you need this in DT?
> > > > > > If the location doesn't matter and you can calculate the size from the
> > > > > > memory size, what else is there to add to the DT?
> > > > >
> > > > > I am afraid there has been a misunderstanding. What do you mean by
> > > > > "location doesn't matter"?
> > > >
> > > > You said:
> > > > > Tag storage is not DMA and can live anywhere in memory.
> > > >
> > > > Which I took as the kernel can figure out where to put it. But maybe
> > > > you meant the h/w platform can hard code it to be anywhere in memory?
> > > > If so, then yes, DT is needed.
> > >
> > > Ah, I see, sorry for not being clear enough, you are correct: tag storage
> > > is a hardware property, and software needs a mechanism (in this case, the
> > > dt) to discover its properties.
> > >
> > > >
> > > > > At the very least, Linux needs to know the address and size of a memory
> > > > > region to use it. The series is about using the tag storage memory for
> > > > > data. Tag storage cannot be described as a regular memory node because it
> > > > > cannot be tagged (and normal memory can).
> > > >
> > > > If the tag storage lives in the middle of memory, then it would be
> > > > described in the memory node, but removed by being in reserved-memory
> > > > node.
> > >
> > > I don't follow. Would you mind going into more details?
> >
> > It goes back to what I said earlier about /memory nodes describing all
> > the memory. There's no reason to reserve memory if you haven't
> > described that range as memory to begin with. One could presumably
> > just have a memory node for each contiguous chunk and not need
> > /reserved-memory (ignoring the need to say what things are reserved
> > for). That would become very difficult to adjust. Note that the kernel
> > has a hardcoded limit of 64 reserved regions currently and that is not
> > enough for some people. Seems like a lot, but I have no idea how they
> > are (ab)using /reserved-memory.
>
> Ah, I see what you mean, reserved memory is about marking existing memory
> (from a /memory node) as special, not about adding new memory.
>
> After the memblock allocator is initialized, the kernel can use it for its
> own allocations. Kernel allocations are not movable.
>
> When a page is allocated as tagged, the associated tag storage cannot be
> used for data, otherwise the tags would corrupt that data. To avoid this,
> the requirement is that tag storage pages are only used for movable
> allocations. When a page is allocated as tagged, the data in the associated
> tag storage is migrated and the tag storage is taken from the page
> allocator (via alloc_contig_range()).
>
> My understanding is that the memblock allocator can use all the memory from
> a /memory node. If the tags storage memory is declared in a /memory node,
> there exists the possibility that Linux will use tag storage memory for its
> own allocation, which would make that tags storage memory unmovable, and
> thus unusable for storing tags.

No, because the tag storage would be reserved in /reserved-memory.

Of course, the arch code could do something between scanning /memory
nodes and /reserved-memory, but that would be broken arch code.
Ideally, there wouldn't be any arch code in between those 2 points,
but it's complicated. It used to mainly be powerpc, but we keep adding
to the complexity on arm64.

> Looking at early_init_dt_scan_memory(), even if a /memory node if marked at
> hotpluggable, memblock will still use it, unless "movable_node" is set on
> the kernel command line.
>
> That's the reason why I'm not describing tag storage in a /memory node.  Is
> there way to tell the memblock allocator not to use memory from a /memory
> node?
>
> >
> > Let me give an example. Presumably using MTE at all is configurable.
> > If you boot a kernel with MTE disabled (or older and not supporting
> > it), then I'd assume you'd want to use the tag storage for regular
> > memory. Well, If tag storage is already part of /memory, then all you
> > have to do is ignore the tag reserved-memory region. Tweaking the
> > memory nodes would be more work.
>
> Right now, memory is added via memblock_reserve(), and if MTE is disabled
> (for example, via the kernel command line), the code calls
> free_reserved_page() for each tag storage page. I find that straightfoward
> to implement.

But better to just not reserve the region in the first place. Also, it
needs to be simple enough to back port.

Also, does free_reserved_page() work on ranges outside of memblock
range (e.g. beyond end_of_DRAM())? If the tag storage happened to live
at the end of DRAM and you shorten the /memory node size to remove tag
storage, is it still going to work?

Rob

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory
  2023-12-14 18:55                       ` Rob Herring
@ 2023-12-18 10:59                         ` Alexandru Elisei
  0 siblings, 0 replies; 98+ messages in thread
From: Alexandru Elisei @ 2023-12-18 10:59 UTC (permalink / raw)
  To: Rob Herring
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi,

On Thu, Dec 14, 2023 at 12:55:14PM -0600, Rob Herring wrote:
> On Thu, Dec 14, 2023 at 9:45 AM Alexandru Elisei
> <alexandru.elisei@arm.com> wrote:
> >
> > Hi,
> >
> > On Wed, Dec 13, 2023 at 02:30:42PM -0600, Rob Herring wrote:
> > > On Wed, Dec 13, 2023 at 11:44 AM Alexandru Elisei
> > > <alexandru.elisei@arm.com> wrote:
> > > >
> > > > On Wed, Dec 13, 2023 at 11:22:17AM -0600, Rob Herring wrote:
> > > > > On Wed, Dec 13, 2023 at 8:51 AM Alexandru Elisei
> > > > > <alexandru.elisei@arm.com> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > On Wed, Dec 13, 2023 at 08:06:44AM -0600, Rob Herring wrote:
> > > > > > > On Wed, Dec 13, 2023 at 7:05 AM Alexandru Elisei
> > > > > > > <alexandru.elisei@arm.com> wrote:
> > > > > > > >
> > > > > > > > Hi Rob,
> > > > > > > >
> > > > > > > > On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> > > > > > > > > On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> > > > > > > > > <alexandru.elisei@arm.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Rob,
> > > > > > > > > >
> > > > > > > > > > Thank you so much for the feedback, I'm not very familiar with device tree,
> > > > > > > > > > and any comments are very useful.
> > > > > > > > > >
> > > > > > > > > > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > > > > > > > > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > > > > > > > > > <alexandru.elisei@arm.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > > > > > > > > > regions from the DTB. This memory is marked as reserved for now.
> > > > > > > > > > > >
> > > > > > > > > > > > The DTB node for the tag storage region is defined as:
> > > > > > > > > > > >
> > > > > > > > > > > >         tags0: tag-storage@8f8000000 {
> > > > > > > > > > > >                 compatible = "arm,mte-tag-storage";
> > > > > > > > > > > >                 reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > > > > > > > > > >                 block-size = <0x1000>;
> > > > > > > > > > > >                 memory = <&memory0>;    // Associated tagged memory node
> > > > > > > > > > > >         };
> > > > > > > > > > >
> > > > > > > > > > > I skimmed thru the discussion some. If this memory range is within
> > > > > > > > > > > main RAM, then it definitely belongs in /reserved-memory.
> > > > > > > > > >
> > > > > > > > > > Ok, will do that.
> > > > > > > > > >
> > > > > > > > > > If you don't mind, why do you say that it definitely belongs in
> > > > > > > > > > reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> > > > > > > > > > motivation.
> > > > > > > > >
> > > > > > > > > Simply so that /memory nodes describe all possible memory and
> > > > > > > > > /reserved-memory is just adding restrictions. It's also because
> > > > > > > > > /reserved-memory is what gets handled early, and we don't need
> > > > > > > > > multiple things to handle early.
> > > > > > > > >
> > > > > > > > > > Tag storage is not DMA and can live anywhere in memory.
> > > > > > > > >
> > > > > > > > > Then why put it in DT at all? The only reason CMA is there is to set
> > > > > > > > > the size. It's not even clear to me we need CMA in DT either. The
> > > > > > > > > reasoning long ago was the kernel didn't do a good job of moving and
> > > > > > > > > reclaiming contiguous space, but that's supposed to be better now (and
> > > > > > > > > most h/w figured out they need IOMMUs).
> > > > > > > > >
> > > > > > > > > But for tag storage you know the size as it is a function of the
> > > > > > > > > memory size, right? After all, you are validating the size is correct.
> > > > > > > > > I guess there is still the aspect of whether you want enable MTE or
> > > > > > > > > not which could be done in a variety of ways.
> > > > > > > >
> > > > > > > > Oh, sorry, my bad, I should have been clearer about this. I don't want to
> > > > > > > > put it in the DT as a "linux,cma" node. But I want it to be managed by CMA.
> > > > > > >
> > > > > > > Yes, I understand, but my point remains. Why do you need this in DT?
> > > > > > > If the location doesn't matter and you can calculate the size from the
> > > > > > > memory size, what else is there to add to the DT?
> > > > > >
> > > > > > I am afraid there has been a misunderstanding. What do you mean by
> > > > > > "location doesn't matter"?
> > > > >
> > > > > You said:
> > > > > > Tag storage is not DMA and can live anywhere in memory.
> > > > >
> > > > > Which I took as the kernel can figure out where to put it. But maybe
> > > > > you meant the h/w platform can hard code it to be anywhere in memory?
> > > > > If so, then yes, DT is needed.
> > > >
> > > > Ah, I see, sorry for not being clear enough, you are correct: tag storage
> > > > is a hardware property, and software needs a mechanism (in this case, the
> > > > dt) to discover its properties.
> > > >
> > > > >
> > > > > > At the very least, Linux needs to know the address and size of a memory
> > > > > > region to use it. The series is about using the tag storage memory for
> > > > > > data. Tag storage cannot be described as a regular memory node because it
> > > > > > cannot be tagged (and normal memory can).
> > > > >
> > > > > If the tag storage lives in the middle of memory, then it would be
> > > > > described in the memory node, but removed by being in reserved-memory
> > > > > node.
> > > >
> > > > I don't follow. Would you mind going into more details?
> > >
> > > It goes back to what I said earlier about /memory nodes describing all
> > > the memory. There's no reason to reserve memory if you haven't
> > > described that range as memory to begin with. One could presumably
> > > just have a memory node for each contiguous chunk and not need
> > > /reserved-memory (ignoring the need to say what things are reserved
> > > for). That would become very difficult to adjust. Note that the kernel
> > > has a hardcoded limit of 64 reserved regions currently and that is not
> > > enough for some people. Seems like a lot, but I have no idea how they
> > > are (ab)using /reserved-memory.
> >
> > Ah, I see what you mean, reserved memory is about marking existing memory
> > (from a /memory node) as special, not about adding new memory.
> >
> > After the memblock allocator is initialized, the kernel can use it for its
> > own allocations. Kernel allocations are not movable.
> >
> > When a page is allocated as tagged, the associated tag storage cannot be
> > used for data, otherwise the tags would corrupt that data. To avoid this,
> > the requirement is that tag storage pages are only used for movable
> > allocations. When a page is allocated as tagged, the data in the associated
> > tag storage is migrated and the tag storage is taken from the page
> > allocator (via alloc_contig_range()).
> >
> > My understanding is that the memblock allocator can use all the memory from
> > a /memory node. If the tags storage memory is declared in a /memory node,
> > there exists the possibility that Linux will use tag storage memory for its
> > own allocation, which would make that tags storage memory unmovable, and
> > thus unusable for storing tags.
> 
> No, because the tag storage would be reserved in /reserved-memory.
> 
> Of course, the arch code could do something between scanning /memory
> nodes and /reserved-memory, but that would be broken arch code.
> Ideally, there wouldn't be any arch code in between those 2 points,
> but it's complicated. It used to mainly be powerpc, but we keep adding
> to the complexity on arm64.

Ah, yes, thats what I was referring to, the fact that the memory nodes are
parsed in setup_arch -> setup_machine_fdt -> early_init_dt_scan, and the
reserved memory is parsed later in setup_arch -> arm64_memblock_init.

If the rule is that no memblock allocations can take place between
setup_machine_fdt() and arm64_memblock_init(), then putting tag storage in
a /memory node will work, thank you for the clarification.

> 
> > Looking at early_init_dt_scan_memory(), even if a /memory node if marked at
> > hotpluggable, memblock will still use it, unless "movable_node" is set on
> > the kernel command line.
> >
> > That's the reason why I'm not describing tag storage in a /memory node.  Is
> > there way to tell the memblock allocator not to use memory from a /memory
> > node?
> >
> > >
> > > Let me give an example. Presumably using MTE at all is configurable.
> > > If you boot a kernel with MTE disabled (or older and not supporting
> > > it), then I'd assume you'd want to use the tag storage for regular
> > > memory. Well, If tag storage is already part of /memory, then all you
> > > have to do is ignore the tag reserved-memory region. Tweaking the
> > > memory nodes would be more work.
> >
> > Right now, memory is added via memblock_reserve(), and if MTE is disabled
> > (for example, via the kernel command line), the code calls
> > free_reserved_page() for each tag storage page. I find that straightfoward
> > to implement.
> 
> But better to just not reserve the region in the first place. Also, it
> needs to be simple enough to back port.

I don't think that works - reserved memory is parsed in setup_arch ->
arm64_memblock_init, and the cpu capabilities are initialized later, in
smp_prepare_boot_cpu.

> 
> Also, does free_reserved_page() work on ranges outside of memblock
> range (e.g. beyond end_of_DRAM())? If the tag storage happened to live
> at the end of DRAM and you shorten the /memory node size to remove tag
> storage, is it still going to work?

Tag storage memory is discovered in 2 staged: first it is added to memblock
with memblock_add(), then reserved with memblock_reserve().  This is
performed in setup_arch(), after setup_machine_fdt(), and before
arm64_memblock_init(). The tag torage code keeps an array of the discovered
tag regions. This is implemented in this patch.

The next patch [1] adds an arch_initcall that checks if
memblock_end_of_DRAM() is less than the upper address of a tag storage
region. If that is the case, then tag storage memory is kept as reserved
and remains unused by the kernel.

The next check is for mte enabled: if it is disabled, then the pages are
unreserved by doing free_reserved_page().

And finally, if all the checks pass, the tag storage pages are put on the
MIGRATE_CMA lists with init_cma_reserved_pageblock().

[1] https://lore.kernel.org/all/20231119165721.9849-12-alexandru.elisei@arm.com/

Thanks,
Alex

> 
> Rob

^ permalink raw reply	[flat|nested] 98+ messages in thread

end of thread, other threads:[~2023-12-18 10:59 UTC | newest]

Thread overview: 98+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-19 16:56 [PATCH RFC v2 00/27] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
2023-11-19 16:56 ` [PATCH RFC v2 01/27] arm64: mte: Rework naming for tag manipulation functions Alexandru Elisei
2023-11-19 16:56 ` [PATCH RFC v2 02/27] arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED Alexandru Elisei
2023-11-19 16:56 ` [PATCH RFC v2 03/27] mm: cma: Make CMA_ALLOC_SUCCESS/FAIL count the number of pages Alexandru Elisei
2023-11-19 16:56 ` [PATCH RFC v2 04/27] mm: migrate/mempolicy: Add hook to modify migration target gfp Alexandru Elisei
2023-11-25 10:03   ` Mike Rapoport
2023-11-27 11:52     ` Alexandru Elisei
2023-11-28  6:49       ` Mike Rapoport
2023-11-28 17:21         ` Alexandru Elisei
2023-11-19 16:56 ` [PATCH RFC v2 05/27] mm: page_alloc: Add an arch hook to allow prep_new_page() to fail Alexandru Elisei
2023-11-24 19:35   ` David Hildenbrand
2023-11-27 12:09     ` Alexandru Elisei
2023-11-28 16:57       ` David Hildenbrand
2023-11-28 17:17         ` Alexandru Elisei
2023-11-19 16:57 ` [PATCH RFC v2 06/27] mm: page_alloc: Allow an arch to hook early into free_pages_prepare() Alexandru Elisei
2023-11-24 19:36   ` David Hildenbrand
2023-11-27 13:03     ` Alexandru Elisei
2023-11-28 16:58       ` David Hildenbrand
2023-11-28 17:17         ` Alexandru Elisei
2023-11-19 16:57 ` [PATCH RFC v2 07/27] mm: page_alloc: Add an arch hook to filter MIGRATE_CMA allocations Alexandru Elisei
2023-11-19 16:57 ` [PATCH RFC v2 08/27] mm: page_alloc: Partially revert "mm: page_alloc: remove stale CMA guard code" Alexandru Elisei
2023-11-19 16:57 ` [PATCH RFC v2 09/27] mm: Allow an arch to hook into folio allocation when VMA is known Alexandru Elisei
2023-11-19 16:57 ` [PATCH RFC v2 10/27] mm: Call arch_swap_prepare_to_restore() before arch_swap_restore() Alexandru Elisei
2023-11-19 16:57 ` [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory Alexandru Elisei
2023-11-29  8:44   ` Hyesoo Yu
2023-11-30 11:56     ` Alexandru Elisei
2023-12-03 12:14     ` Alexandru Elisei
2023-12-08  5:03       ` Hyesoo Yu
2023-12-11 14:45         ` Alexandru Elisei
2023-12-11 17:29   ` Rob Herring
2023-12-12 16:38     ` Alexandru Elisei
2023-12-12 18:44       ` Rob Herring
2023-12-13 13:04         ` Alexandru Elisei
2023-12-13 14:06           ` Rob Herring
2023-12-13 14:51             ` Alexandru Elisei
2023-12-13 17:22               ` Rob Herring
2023-12-13 17:44                 ` Alexandru Elisei
2023-12-13 20:30                   ` Rob Herring
2023-12-14 15:45                     ` Alexandru Elisei
2023-12-14 18:55                       ` Rob Herring
2023-12-18 10:59                         ` Alexandru Elisei
2023-11-19 16:57 ` [PATCH RFC v2 12/27] arm64: mte: Add tag storage pages to the MIGRATE_CMA migratetype Alexandru Elisei
2023-11-24 19:40   ` David Hildenbrand
2023-11-27 15:01     ` Alexandru Elisei
2023-11-28 17:03       ` David Hildenbrand
2023-11-29 10:44         ` Alexandru Elisei
2023-11-19 16:57 ` [PATCH RFC v2 13/27] arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK Alexandru Elisei
2023-11-24 19:51   ` David Hildenbrand
2023-11-27 15:04     ` Alexandru Elisei
2023-11-28 17:05       ` David Hildenbrand
2023-11-29 10:46         ` Alexandru Elisei
2023-11-19 16:57 ` [PATCH RFC v2 14/27] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled Alexandru Elisei
2023-11-24 19:54   ` David Hildenbrand
2023-11-27 15:07     ` Alexandru Elisei
2023-11-28 17:05       ` David Hildenbrand
2023-11-19 16:57 ` [PATCH RFC v2 15/27] arm64: mte: Check that tag storage blocks are in the same zone Alexandru Elisei
2023-11-24 19:56   ` David Hildenbrand
2023-11-27 15:10     ` Alexandru Elisei
2023-11-29  8:57   ` Hyesoo Yu
2023-11-30 12:00     ` Alexandru Elisei
2023-12-08  5:27       ` Hyesoo Yu
2023-12-11 14:21         ` Alexandru Elisei
2023-11-19 16:57 ` [PATCH RFC v2 16/27] arm64: mte: Manage tag storage on page allocation Alexandru Elisei
2023-11-29  9:10   ` Hyesoo Yu
2023-11-29 13:33     ` Alexandru Elisei
2023-12-08  5:29       ` Hyesoo Yu
2023-11-19 16:57 ` [PATCH RFC v2 17/27] arm64: mte: Perform CMOs for tag blocks on tagged page allocation/free Alexandru Elisei
2023-11-19 16:57 ` [PATCH RFC v2 18/27] arm64: mte: Reserve tag block for the zero page Alexandru Elisei
2023-11-28 17:06   ` David Hildenbrand
2023-11-29 11:30     ` Alexandru Elisei
2023-11-29 13:13       ` David Hildenbrand
2023-11-29 13:41         ` Alexandru Elisei
2023-11-19 16:57 ` [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE) Alexandru Elisei
2023-11-28 17:55   ` David Hildenbrand
2023-11-28 18:00     ` David Hildenbrand
2023-11-29 11:55     ` Alexandru Elisei
2023-11-29 12:48       ` David Hildenbrand
2023-11-29  9:27   ` Hyesoo Yu
2023-11-30 12:06     ` Alexandru Elisei
2023-11-30 12:49       ` David Hildenbrand
2023-11-30 13:32         ` Alexandru Elisei
2023-11-30 13:43           ` David Hildenbrand
2023-11-30 14:33             ` Alexandru Elisei
2023-11-30 14:39               ` David Hildenbrand
2023-11-19 16:57 ` [PATCH RFC v2 20/27] mm: hugepage: Handle huge page fault on access Alexandru Elisei
2023-11-22  1:28   ` Peter Collingbourne
2023-11-22  9:22     ` Alexandru Elisei
2023-11-28 17:56   ` David Hildenbrand
2023-11-29 11:56     ` Alexandru Elisei
2023-11-19 16:57 ` [PATCH RFC v2 21/27] mm: arm64: Handle tag storage pages mapped before mprotect(PROT_MTE) Alexandru Elisei
2023-11-28  5:39   ` Peter Collingbourne
2023-11-30 17:43     ` Alexandru Elisei
2023-11-19 16:57 ` [PATCH RFC v2 22/27] arm64: mte: swap: Handle tag restoring when missing tag storage Alexandru Elisei
2023-11-19 16:57 ` [PATCH RFC v2 23/27] arm64: mte: copypage: " Alexandru Elisei
2023-11-19 16:57 ` [PATCH RFC v2 24/27] arm64: mte: Handle fatal signal in reserve_tag_storage() Alexandru Elisei
2023-11-19 16:57 ` [PATCH RFC v2 25/27] KVM: arm64: Disable MTE if tag storage is enabled Alexandru Elisei
2023-11-19 16:57 ` [PATCH RFC v2 26/27] arm64: mte: Fast track reserving tag storage when the block is free Alexandru Elisei
2023-11-19 16:57 ` [PATCH RFC v2 27/27] arm64: mte: Enable dynamic tag storage reuse Alexandru Elisei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).