* [PATCH v12 10/16] KVM: guest_memfd: Add flag to remove from direct map
From: Kalyazin, Nikita @ 2026-04-10 15:19 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
bpf@vger.kernel.org, linux-kselftest@vger.kernel.org,
kernel@xen0n.name, linux-riscv@lists.infradead.org,
linux-s390@vger.kernel.org, loongarch@lists.linux.dev,
linux-pm@vger.kernel.org
Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
oupton@kernel.org, joey.gouly@arm.com, suzuki.poulose@arm.com,
yuzenghui@huawei.com, catalin.marinas@arm.com, will@kernel.org,
seanjc@google.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, luto@kernel.org, peterz@infradead.org,
willy@infradead.org, akpm@linux-foundation.org, david@kernel.org,
lorenzo.stoakes@oracle.com, vbabka@kernel.org, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, ast@kernel.org,
daniel@iogearbox.net, andrii@kernel.org, martin.lau@linux.dev,
eddyz87@gmail.com, song@kernel.org, yonghong.song@linux.dev,
john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
haoluo@google.com, jolsa@kernel.org, jgg@ziepe.ca,
jhubbard@nvidia.com, peterx@redhat.com, jannh@google.com,
pfalcato@suse.de, skhan@linuxfoundation.org, riel@surriel.com,
ryan.roberts@arm.com, jgross@suse.com, yu-cheng.yu@intel.com,
kas@kernel.org, coxu@redhat.com, ackerleytng@google.com,
yosry@kernel.org, ajones@ventanamicro.com, maobibo@loongson.cn,
tabba@google.com, prsampat@amd.com, wu.fei9@sanechips.com.cn,
mlevitsk@redhat.com, jmattson@google.com, jthoughton@google.com,
agordeev@linux.ibm.com, alex@ghiti.fr, aou@eecs.berkeley.edu,
borntraeger@linux.ibm.com, chenhuacai@kernel.org,
baolu.lu@linux.intel.com, dev.jain@arm.com, gor@linux.ibm.com,
hca@linux.ibm.com, palmer@dabbelt.com, pjw@kernel.org,
shijie@os.amperecomputing.com, svens@linux.ibm.com,
thuth@redhat.com, yang@os.amperecomputing.com,
Liam.Howlett@oracle.com, urezki@gmail.com,
zhengqi.arch@bytedance.com, gerald.schaefer@linux.ibm.com,
jiayuan.chen@shopee.com, lenb@kernel.org, pavel@kernel.org,
rafael@kernel.org, yangyicong@hisilicon.com,
vannapurve@google.com, jackmanb@google.com, patrick.roy@linux.dev,
Thomson, Jack, Itazuri, Takahiro, Manwaring, Derek,
Kalyazin, Nikita, Nikita Kalyazin
In-Reply-To: <20260410151746.61150-1-kalyazin@amazon.com>
From: Patrick Roy <patrick.roy@linux.dev>
Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD()
ioctl. When set, guest_memfd folios will be removed from the direct map
after preparation, with direct map entries only restored when the folios
are freed.
To ensure these folios do not end up in places where the kernel cannot
deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested.
Note that this flag causes removal of direct map entries for all
guest_memfd folios independent of whether they are "shared" or "private"
(although current guest_memfd only supports either all folios in the
"shared" state, or all folios in the "private" state if
GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map
entries of also the shared parts of guest_memfd are a special type of
non-CoCo VM where, host userspace is trusted to have access to all of
guest memory, but where Spectre-style transient execution attacks
through the host kernel's direct map should still be mitigated. In this
setup, KVM retains access to guest memory via userspace mappings of
guest_memfd, which are reflected back into KVM's memslots via
userspace_addr. This is needed for things like MMIO emulation on x86_64
to work.
Direct map entries are zapped right before guest or userspace mappings
of gmem folios are set up, e.g. in kvm_gmem_fault_user_mapping() or
kvm_gmem_get_pfn() [called from the KVM MMU code]. At present, direct
map removal is not supported on platforms that support
kvm_gmem_populate(). In case such support is added in the future, the
following ordering is maintained: zap then prepare, invalidate then
restore, to avoid having guest-owned pages being temporarily mapped on
by host. This assumes that preparation or invalidation code does not
access the page content.
Signed-off-by: Patrick Roy <patrick.roy@linux.dev>
Co-developed-by: Nikita Kalyazin <nikita.kalyazin@linux.dev>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@linux.dev>
---
Documentation/virt/kvm/api.rst | 21 +++++-----
include/linux/kvm_host.h | 3 ++
include/uapi/linux/kvm.h | 1 +
virt/kvm/guest_memfd.c | 71 ++++++++++++++++++++++++++++++++--
4 files changed, 83 insertions(+), 13 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 032516783e96..8feec77b03fe 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6439,15 +6439,18 @@ a single guest_memfd file, but the bound ranges must not overlap).
The capability KVM_CAP_GUEST_MEMFD_FLAGS enumerates the `flags` that can be
specified via KVM_CREATE_GUEST_MEMFD. Currently defined flags:
- ============================ ================================================
- GUEST_MEMFD_FLAG_MMAP Enable using mmap() on the guest_memfd file
- descriptor.
- GUEST_MEMFD_FLAG_INIT_SHARED Make all memory in the file shared during
- KVM_CREATE_GUEST_MEMFD (memory files created
- without INIT_SHARED will be marked private).
- Shared memory can be faulted into host userspace
- page tables. Private memory cannot.
- ============================ ================================================
+ ============================== ================================================
+ GUEST_MEMFD_FLAG_MMAP Enable using mmap() on the guest_memfd file
+ descriptor.
+ GUEST_MEMFD_FLAG_INIT_SHARED Make all memory in the file shared during
+ KVM_CREATE_GUEST_MEMFD (memory files created
+ without INIT_SHARED will be marked private).
+ Shared memory can be faulted into host userspace
+ page tables. Private memory cannot.
+ GUEST_MEMFD_FLAG_NO_DIRECT_MAP The guest_memfd instance will unmap the memory
+ backing it from the kernel's address space
+ before passing it off to userspace or the guest.
+ ============================== ================================================
When the KVM MMU performs a PFN lookup to service a guest fault and the backing
guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ce8c5fdf2752..c95747e2278c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -738,6 +738,9 @@ static inline u64 kvm_gmem_get_supported_flags(struct kvm *kvm)
if (!kvm || kvm_arch_supports_gmem_init_shared(kvm))
flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
+ if (!kvm || kvm_arch_gmem_supports_no_direct_map(kvm))
+ flags |= GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
+
return flags;
}
#endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 80364d4dbebb..d864f67efdb7 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1642,6 +1642,7 @@ struct kvm_memory_attributes {
#define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd)
#define GUEST_MEMFD_FLAG_MMAP (1ULL << 0)
#define GUEST_MEMFD_FLAG_INIT_SHARED (1ULL << 1)
+#define GUEST_MEMFD_FLAG_NO_DIRECT_MAP (1ULL << 2)
struct kvm_create_guest_memfd {
__u64 size;
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 651649623448..80d4a6aca128 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -7,6 +7,7 @@
#include <linux/mempolicy.h>
#include <linux/pseudo_fs.h>
#include <linux/pagemap.h>
+#include <linux/set_memory.h>
#include "kvm_mm.h"
@@ -76,6 +77,39 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo
return 0;
}
+#define KVM_GMEM_FOLIO_NO_DIRECT_MAP BIT(0)
+
+static bool kvm_gmem_folio_no_direct_map(struct folio *folio)
+{
+ return ((u64)folio->private) & KVM_GMEM_FOLIO_NO_DIRECT_MAP;
+}
+
+static int kvm_gmem_folio_zap_direct_map(struct folio *folio)
+{
+ int r = 0;
+
+ VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+
+ if (WARN_ON_ONCE(!(GMEM_I(folio_inode(folio))->flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)))
+ return -EINVAL;
+
+ if (kvm_gmem_folio_no_direct_map(folio))
+ goto out;
+
+ r = folio_zap_direct_map(folio);
+ if (!r)
+ folio->private = (void *)((u64)folio->private | KVM_GMEM_FOLIO_NO_DIRECT_MAP);
+
+out:
+ return r;
+}
+
+static void kvm_gmem_folio_restore_direct_map(struct folio *folio)
+{
+ folio_restore_direct_map(folio);
+ folio->private = (void *)((u64)folio->private & ~KVM_GMEM_FOLIO_NO_DIRECT_MAP);
+}
+
/*
* Process @folio, which contains @gfn, so that the guest can use it.
* The folio must be locked and the gfn must be contained in @slot.
@@ -388,11 +422,17 @@ static bool kvm_gmem_supports_mmap(struct inode *inode)
return GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_MMAP;
}
+static bool kvm_gmem_no_direct_map(struct inode *inode)
+{
+ return GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
+}
+
static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
{
struct inode *inode = file_inode(vmf->vma->vm_file);
struct folio *folio;
vm_fault_t ret = VM_FAULT_LOCKED;
+ int err;
if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
return VM_FAULT_SIGBUS;
@@ -418,6 +458,14 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
folio_mark_uptodate(folio);
}
+ if (kvm_gmem_no_direct_map(folio_inode(folio))) {
+ err = kvm_gmem_folio_zap_direct_map(folio);
+ if (err) {
+ ret = vmf_error(err);
+ goto out_folio;
+ }
+ }
+
vmf->page = folio_file_page(folio, vmf->pgoff);
out_folio:
@@ -529,6 +577,9 @@ static void kvm_gmem_free_folio(struct folio *folio)
int order = folio_order(folio);
kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
+
+ if (kvm_gmem_folio_no_direct_map(folio))
+ kvm_gmem_folio_restore_direct_map(folio);
}
static const struct address_space_operations kvm_gmem_aops = {
@@ -591,6 +642,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
/* Unmovable mappings are supposed to be marked unevictable as well. */
WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
+ if (flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)
+ mapping_set_no_direct_map(inode->i_mapping);
+
GMEM_I(inode)->flags = flags;
file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR, &kvm_gmem_fops);
@@ -802,14 +856,23 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
folio_mark_uptodate(folio);
}
+ if (kvm_gmem_no_direct_map(folio_inode(folio))) {
+ r = kvm_gmem_folio_zap_direct_map(folio);
+ if (r)
+ goto out_unlock;
+ }
+
r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
+ if (r)
+ goto out_unlock;
+ *page = folio_file_page(folio, index);
folio_unlock(folio);
+ return 0;
- if (!r)
- *page = folio_file_page(folio, index);
- else
- folio_put(folio);
+out_unlock:
+ folio_unlock(folio);
+ folio_put(folio);
return r;
}
--
2.50.1
^ permalink raw reply related
* Re: [PATCH v2] Documentation: Refactored watchdog old doc
From: Guenter Roeck @ 2026-04-10 15:19 UTC (permalink / raw)
To: Sunny Patel
Cc: Jonathan Corbet, Wim Van Sebroeck, Shuah Khan, linux-watchdog,
linux-doc, linux-kernel
In-Reply-To: <20260410072825.19114-1-nueralspacetech@gmail.com>
On Fri, Apr 10, 2026 at 12:58:11PM +0530, Sunny Patel wrote:
> Good Point. So again revisited the watchdog core
> api and list out the deprecated one and marked
> as deprecated in doc and also mentioned it just
> for legacy driver and not for newer one.
>
> As someof the legacy driver still have reference
> to old api so just marked as deprecated in doc.
>
> Also checked with other watchdog related api
> which are deprecated in driver but still present
> in doc but didn't find any.
>
> ---
The above would show up as commit message, there is no change log, and
this e-mail was sent as response to v1. And I can see that without even
looking at the patch itself.
That makes me wonder what Documentation/process/submitting-patches.rst
is useful for. No one seems to bother reading it. We might as well
just remove it.
Guenter
^ permalink raw reply
* [PATCH v12 09/16] KVM: arm64: define kvm_arch_gmem_supports_no_direct_map()
From: Kalyazin, Nikita @ 2026-04-10 15:19 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
bpf@vger.kernel.org, linux-kselftest@vger.kernel.org,
kernel@xen0n.name, linux-riscv@lists.infradead.org,
linux-s390@vger.kernel.org, loongarch@lists.linux.dev,
linux-pm@vger.kernel.org
Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
oupton@kernel.org, joey.gouly@arm.com, suzuki.poulose@arm.com,
yuzenghui@huawei.com, catalin.marinas@arm.com, will@kernel.org,
seanjc@google.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, luto@kernel.org, peterz@infradead.org,
willy@infradead.org, akpm@linux-foundation.org, david@kernel.org,
lorenzo.stoakes@oracle.com, vbabka@kernel.org, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, ast@kernel.org,
daniel@iogearbox.net, andrii@kernel.org, martin.lau@linux.dev,
eddyz87@gmail.com, song@kernel.org, yonghong.song@linux.dev,
john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
haoluo@google.com, jolsa@kernel.org, jgg@ziepe.ca,
jhubbard@nvidia.com, peterx@redhat.com, jannh@google.com,
pfalcato@suse.de, skhan@linuxfoundation.org, riel@surriel.com,
ryan.roberts@arm.com, jgross@suse.com, yu-cheng.yu@intel.com,
kas@kernel.org, coxu@redhat.com, ackerleytng@google.com,
yosry@kernel.org, ajones@ventanamicro.com, maobibo@loongson.cn,
tabba@google.com, prsampat@amd.com, wu.fei9@sanechips.com.cn,
mlevitsk@redhat.com, jmattson@google.com, jthoughton@google.com,
agordeev@linux.ibm.com, alex@ghiti.fr, aou@eecs.berkeley.edu,
borntraeger@linux.ibm.com, chenhuacai@kernel.org,
baolu.lu@linux.intel.com, dev.jain@arm.com, gor@linux.ibm.com,
hca@linux.ibm.com, palmer@dabbelt.com, pjw@kernel.org,
shijie@os.amperecomputing.com, svens@linux.ibm.com,
thuth@redhat.com, yang@os.amperecomputing.com,
Liam.Howlett@oracle.com, urezki@gmail.com,
zhengqi.arch@bytedance.com, gerald.schaefer@linux.ibm.com,
jiayuan.chen@shopee.com, lenb@kernel.org, pavel@kernel.org,
rafael@kernel.org, yangyicong@hisilicon.com,
vannapurve@google.com, jackmanb@google.com, patrick.roy@linux.dev,
Thomson, Jack, Itazuri, Takahiro, Manwaring, Derek,
Kalyazin, Nikita
In-Reply-To: <20260410151746.61150-1-kalyazin@amazon.com>
From: Patrick Roy <patrick.roy@linux.dev>
Support for GUEST_MEMFD_FLAG_NO_DIRECT_MAP on arm64 depends on 1) direct
map manipulations at 4k granularity being possible, and 2) FEAT_S2FWB.
1) is met whenever the direct map is set up at 4k granularity (e.g. not
with huge/gigantic pages) at boottime, as due to ARM's
break-before-make semantics, breaking huge mappings into 4k mappings in
the direct map is not possible (BBM would require temporary invalidation
of the entire huge mapping, even if only a 4k subrange should be zapped,
which will probably crash the kernel). However, the current default for
rodata_full is true, which forces a 4k direct map.
2) is required to allow KVM to elide cache coherency operations when
installing stage 2 page tables, which require the direct map to be
entry for the newly mapped memory to be present (which it will not be,
as guest_memfd would have removed direct map entries in
kvm_gmem_get_pfn()).
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Patrick Roy <patrick.roy@linux.dev>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@linux.dev>
---
arch/arm64/include/asm/kvm_host.h | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 70cb9cfd760a..fbdd43e7e94e 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -19,6 +19,7 @@
#include <linux/maple_tree.h>
#include <linux/percpu.h>
#include <linux/psci.h>
+#include <linux/set_memory.h>
#include <asm/arch_gicv3.h>
#include <asm/barrier.h>
#include <asm/cpufeature.h>
@@ -1682,6 +1683,18 @@ static __always_inline enum fgt_group_id __fgt_reg_to_group_id(enum vcpu_sysreg
\
p; \
})
+#ifdef CONFIG_KVM_GUEST_MEMFD
+static inline bool kvm_arch_gmem_supports_no_direct_map(struct kvm *kvm)
+{
+ /*
+ * Without FWB, direct map access is needed in kvm_pgtable_stage2_map(),
+ * as it calls dcache_clean_inval_poc().
+ */
+ return can_set_direct_map() && cpus_have_final_cap(ARM64_HAS_STAGE2_FWB);
+}
+#define kvm_arch_gmem_supports_no_direct_map kvm_arch_gmem_supports_no_direct_map
+#endif /* CONFIG_KVM_GUEST_MEMFD */
+
long kvm_get_cap_for_kvm_ioctl(unsigned int ioctl, long *ext);
--
2.50.1
^ permalink raw reply related
* [PATCH v12 08/16] KVM: x86: define kvm_arch_gmem_supports_no_direct_map()
From: Kalyazin, Nikita @ 2026-04-10 15:19 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
bpf@vger.kernel.org, linux-kselftest@vger.kernel.org,
kernel@xen0n.name, linux-riscv@lists.infradead.org,
linux-s390@vger.kernel.org, loongarch@lists.linux.dev,
linux-pm@vger.kernel.org
Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
oupton@kernel.org, joey.gouly@arm.com, suzuki.poulose@arm.com,
yuzenghui@huawei.com, catalin.marinas@arm.com, will@kernel.org,
seanjc@google.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, luto@kernel.org, peterz@infradead.org,
willy@infradead.org, akpm@linux-foundation.org, david@kernel.org,
lorenzo.stoakes@oracle.com, vbabka@kernel.org, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, ast@kernel.org,
daniel@iogearbox.net, andrii@kernel.org, martin.lau@linux.dev,
eddyz87@gmail.com, song@kernel.org, yonghong.song@linux.dev,
john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
haoluo@google.com, jolsa@kernel.org, jgg@ziepe.ca,
jhubbard@nvidia.com, peterx@redhat.com, jannh@google.com,
pfalcato@suse.de, skhan@linuxfoundation.org, riel@surriel.com,
ryan.roberts@arm.com, jgross@suse.com, yu-cheng.yu@intel.com,
kas@kernel.org, coxu@redhat.com, ackerleytng@google.com,
yosry@kernel.org, ajones@ventanamicro.com, maobibo@loongson.cn,
tabba@google.com, prsampat@amd.com, wu.fei9@sanechips.com.cn,
mlevitsk@redhat.com, jmattson@google.com, jthoughton@google.com,
agordeev@linux.ibm.com, alex@ghiti.fr, aou@eecs.berkeley.edu,
borntraeger@linux.ibm.com, chenhuacai@kernel.org,
baolu.lu@linux.intel.com, dev.jain@arm.com, gor@linux.ibm.com,
hca@linux.ibm.com, palmer@dabbelt.com, pjw@kernel.org,
shijie@os.amperecomputing.com, svens@linux.ibm.com,
thuth@redhat.com, yang@os.amperecomputing.com,
Liam.Howlett@oracle.com, urezki@gmail.com,
zhengqi.arch@bytedance.com, gerald.schaefer@linux.ibm.com,
jiayuan.chen@shopee.com, lenb@kernel.org, pavel@kernel.org,
rafael@kernel.org, yangyicong@hisilicon.com,
vannapurve@google.com, jackmanb@google.com, patrick.roy@linux.dev,
Thomson, Jack, Itazuri, Takahiro, Manwaring, Derek,
Kalyazin, Nikita, Nikita Kalyazin
In-Reply-To: <20260410151746.61150-1-kalyazin@amazon.com>
From: Patrick Roy <patrick.roy@linux.dev>
x86 supports GUEST_MEMFD_FLAG_NO_DIRECT_MAP whenever direct map
modifications are possible. Exclude TDX and SEV-SNP as they access
pages via direct map in certain operations, such as population.
Signed-off-by: Patrick Roy <patrick.roy@linux.dev>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Co-developed-by: Nikita Kalyazin <nikita.kalyazin@linux.dev>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@linux.dev>
---
arch/x86/include/asm/kvm_host.h | 6 ++++++
arch/x86/kvm/x86.c | 7 +++++++
include/linux/kvm_host.h | 9 +++++++++
3 files changed, 22 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6e4e3ef9b8c7..171ce8b84137 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -28,6 +28,7 @@
#include <linux/sched/vhost_task.h>
#include <linux/call_once.h>
#include <linux/atomic.h>
+#include <linux/set_memory.h>
#include <asm/apic.h>
#include <asm/pvclock-abi.h>
@@ -2504,4 +2505,9 @@ static inline bool kvm_arch_has_irq_bypass(void)
return enable_device_posted_irqs;
}
+#ifdef CONFIG_KVM_GUEST_MEMFD
+bool kvm_arch_gmem_supports_no_direct_map(struct kvm *kvm);
+#define kvm_arch_gmem_supports_no_direct_map kvm_arch_gmem_supports_no_direct_map
+#endif /* CONFIG_KVM_GUEST_MEMFD */
+
#endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fd1c4a36b593..32da7820823c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -14079,6 +14079,13 @@ void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
kvm_x86_call(gmem_invalidate)(start, end);
}
#endif
+
+bool kvm_arch_gmem_supports_no_direct_map(struct kvm *kvm)
+{
+ return can_set_direct_map() &&
+ kvm->arch.vm_type != KVM_X86_TDX_VM &&
+ kvm->arch.vm_type != KVM_X86_SNP_VM;
+}
#endif
int kvm_spec_ctrl_test_value(u64 value)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e8aa3d676c31..ce8c5fdf2752 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -742,6 +742,15 @@ static inline u64 kvm_gmem_get_supported_flags(struct kvm *kvm)
}
#endif
+#ifdef CONFIG_KVM_GUEST_MEMFD
+#ifndef kvm_arch_gmem_supports_no_direct_map
+static inline bool kvm_arch_gmem_supports_no_direct_map(struct kvm *kvm)
+{
+ return false;
+}
+#endif
+#endif /* CONFIG_KVM_GUEST_MEMFD */
+
#ifndef kvm_arch_has_readonly_mem
static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
{
--
2.50.1
^ permalink raw reply related
* [PATCH v12 07/16] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate
From: Kalyazin, Nikita @ 2026-04-10 15:19 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
bpf@vger.kernel.org, linux-kselftest@vger.kernel.org,
kernel@xen0n.name, linux-riscv@lists.infradead.org,
linux-s390@vger.kernel.org, loongarch@lists.linux.dev,
linux-pm@vger.kernel.org
Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
oupton@kernel.org, joey.gouly@arm.com, suzuki.poulose@arm.com,
yuzenghui@huawei.com, catalin.marinas@arm.com, will@kernel.org,
seanjc@google.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, luto@kernel.org, peterz@infradead.org,
willy@infradead.org, akpm@linux-foundation.org, david@kernel.org,
lorenzo.stoakes@oracle.com, vbabka@kernel.org, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, ast@kernel.org,
daniel@iogearbox.net, andrii@kernel.org, martin.lau@linux.dev,
eddyz87@gmail.com, song@kernel.org, yonghong.song@linux.dev,
john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
haoluo@google.com, jolsa@kernel.org, jgg@ziepe.ca,
jhubbard@nvidia.com, peterx@redhat.com, jannh@google.com,
pfalcato@suse.de, skhan@linuxfoundation.org, riel@surriel.com,
ryan.roberts@arm.com, jgross@suse.com, yu-cheng.yu@intel.com,
kas@kernel.org, coxu@redhat.com, ackerleytng@google.com,
yosry@kernel.org, ajones@ventanamicro.com, maobibo@loongson.cn,
tabba@google.com, prsampat@amd.com, wu.fei9@sanechips.com.cn,
mlevitsk@redhat.com, jmattson@google.com, jthoughton@google.com,
agordeev@linux.ibm.com, alex@ghiti.fr, aou@eecs.berkeley.edu,
borntraeger@linux.ibm.com, chenhuacai@kernel.org,
baolu.lu@linux.intel.com, dev.jain@arm.com, gor@linux.ibm.com,
hca@linux.ibm.com, palmer@dabbelt.com, pjw@kernel.org,
shijie@os.amperecomputing.com, svens@linux.ibm.com,
thuth@redhat.com, yang@os.amperecomputing.com,
Liam.Howlett@oracle.com, urezki@gmail.com,
zhengqi.arch@bytedance.com, gerald.schaefer@linux.ibm.com,
jiayuan.chen@shopee.com, lenb@kernel.org, pavel@kernel.org,
rafael@kernel.org, yangyicong@hisilicon.com,
vannapurve@google.com, jackmanb@google.com, patrick.roy@linux.dev,
Thomson, Jack, Itazuri, Takahiro, Manwaring, Derek,
Kalyazin, Nikita, Vlastimil Babka
In-Reply-To: <20260410151746.61150-1-kalyazin@amazon.com>
From: Patrick Roy <patrick.roy@linux.dev>
Add a no-op stub for kvm_arch_gmem_invalidate if
CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE=n. This allows defining
kvm_gmem_free_folio without ifdef-ery, which allows more cleanly using
guest_memfd's free_folio callback for non-arch-invalidation related
code.
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Patrick Roy <patrick.roy@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@linux.dev>
---
include/linux/kvm_host.h | 2 ++
virt/kvm/guest_memfd.c | 4 ----
2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6b76e7a6f4c2..e8aa3d676c31 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2587,6 +2587,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages
#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
+#else
+static inline void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end) { }
#endif
#ifdef CONFIG_KVM_GENERIC_PRE_FAULT_MEMORY
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 017d84a7adf3..651649623448 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -522,7 +522,6 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
return MF_DELAYED;
}
-#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
static void kvm_gmem_free_folio(struct folio *folio)
{
struct page *page = folio_page(folio, 0);
@@ -531,15 +530,12 @@ static void kvm_gmem_free_folio(struct folio *folio)
kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
}
-#endif
static const struct address_space_operations kvm_gmem_aops = {
.dirty_folio = noop_dirty_folio,
.migrate_folio = kvm_gmem_migrate_folio,
.error_remove_folio = kvm_gmem_error_folio,
-#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
.free_folio = kvm_gmem_free_folio,
-#endif
};
static int kvm_gmem_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
--
2.50.1
^ permalink raw reply related
* [PATCH v12 06/16] mm: introduce AS_NO_DIRECT_MAP
From: Kalyazin, Nikita @ 2026-04-10 15:18 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
bpf@vger.kernel.org, linux-kselftest@vger.kernel.org,
kernel@xen0n.name, linux-riscv@lists.infradead.org,
linux-s390@vger.kernel.org, loongarch@lists.linux.dev,
linux-pm@vger.kernel.org
Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
oupton@kernel.org, joey.gouly@arm.com, suzuki.poulose@arm.com,
yuzenghui@huawei.com, catalin.marinas@arm.com, will@kernel.org,
seanjc@google.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, luto@kernel.org, peterz@infradead.org,
willy@infradead.org, akpm@linux-foundation.org, david@kernel.org,
lorenzo.stoakes@oracle.com, vbabka@kernel.org, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, ast@kernel.org,
daniel@iogearbox.net, andrii@kernel.org, martin.lau@linux.dev,
eddyz87@gmail.com, song@kernel.org, yonghong.song@linux.dev,
john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
haoluo@google.com, jolsa@kernel.org, jgg@ziepe.ca,
jhubbard@nvidia.com, peterx@redhat.com, jannh@google.com,
pfalcato@suse.de, skhan@linuxfoundation.org, riel@surriel.com,
ryan.roberts@arm.com, jgross@suse.com, yu-cheng.yu@intel.com,
kas@kernel.org, coxu@redhat.com, ackerleytng@google.com,
yosry@kernel.org, ajones@ventanamicro.com, maobibo@loongson.cn,
tabba@google.com, prsampat@amd.com, wu.fei9@sanechips.com.cn,
mlevitsk@redhat.com, jmattson@google.com, jthoughton@google.com,
agordeev@linux.ibm.com, alex@ghiti.fr, aou@eecs.berkeley.edu,
borntraeger@linux.ibm.com, chenhuacai@kernel.org,
baolu.lu@linux.intel.com, dev.jain@arm.com, gor@linux.ibm.com,
hca@linux.ibm.com, palmer@dabbelt.com, pjw@kernel.org,
shijie@os.amperecomputing.com, svens@linux.ibm.com,
thuth@redhat.com, yang@os.amperecomputing.com,
Liam.Howlett@oracle.com, urezki@gmail.com,
zhengqi.arch@bytedance.com, gerald.schaefer@linux.ibm.com,
jiayuan.chen@shopee.com, lenb@kernel.org, pavel@kernel.org,
rafael@kernel.org, yangyicong@hisilicon.com,
vannapurve@google.com, jackmanb@google.com, patrick.roy@linux.dev,
Thomson, Jack, Itazuri, Takahiro, Manwaring, Derek,
Kalyazin, Nikita, Vlastimil Babka
In-Reply-To: <20260410151746.61150-1-kalyazin@amazon.com>
From: Patrick Roy <patrick.roy@linux.dev>
Add AS_NO_DIRECT_MAP for mappings where direct map entries of folios are
set to not present. Currently, mappings that match this description are
secretmem mappings (memfd_secret()). Later, some guest_memfd
configurations will also fall into this category.
Reject this new type of mappings in all locations that currently reject
secretmem mappings, on the assumption that if secretmem mappings are
rejected somewhere, it is precisely because of an inability to deal with
folios without direct map entries, and then make memfd_secret() use
AS_NO_DIRECT_MAP on its address_space to drop its special
vma_is_secretmem()/secretmem_mapping() checks.
Use a new flag instead of overloading AS_INACCESSIBLE (which is already
set by guest_memfd) because not all guest_memfd mappings will end up
being direct map removed (e.g. in pKVM setups, parts of guest_memfd that
can be mapped to userspace should also be GUP-able, and generally not
have restrictions on who can access it).
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Patrick Roy <patrick.roy@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@linux.dev>
---
include/linux/pagemap.h | 16 ++++++++++++++++
include/linux/secretmem.h | 18 ------------------
lib/buildid.c | 8 ++++++--
mm/gup.c | 9 ++++-----
mm/mlock.c | 2 +-
mm/secretmem.c | 8 ++------
6 files changed, 29 insertions(+), 32 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ec442af3f886..68c075502d91 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -211,6 +211,7 @@ enum mapping_flags {
AS_KERNEL_FILE = 10, /* mapping for a fake kernel file that shouldn't
account usage to user cgroups */
AS_NO_DATA_INTEGRITY = 11, /* no data integrity guarantees */
+ AS_NO_DIRECT_MAP = 12, /* Folios in the mapping are not in the direct map */
/* Bits 16-25 are used for FOLIO_ORDER */
AS_FOLIO_ORDER_BITS = 5,
AS_FOLIO_ORDER_MIN = 16,
@@ -356,6 +357,21 @@ static inline bool mapping_no_data_integrity(const struct address_space *mapping
return test_bit(AS_NO_DATA_INTEGRITY, &mapping->flags);
}
+static inline void mapping_set_no_direct_map(struct address_space *mapping)
+{
+ set_bit(AS_NO_DIRECT_MAP, &mapping->flags);
+}
+
+static inline bool mapping_no_direct_map(const struct address_space *mapping)
+{
+ return test_bit(AS_NO_DIRECT_MAP, &mapping->flags);
+}
+
+static inline bool vma_has_no_direct_map(const struct vm_area_struct *vma)
+{
+ return vma->vm_file && mapping_no_direct_map(vma->vm_file->f_mapping);
+}
+
static inline gfp_t mapping_gfp_mask(const struct address_space *mapping)
{
return mapping->gfp_mask;
diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
index e918f96881f5..0ae1fb057b3d 100644
--- a/include/linux/secretmem.h
+++ b/include/linux/secretmem.h
@@ -4,28 +4,10 @@
#ifdef CONFIG_SECRETMEM
-extern const struct address_space_operations secretmem_aops;
-
-static inline bool secretmem_mapping(struct address_space *mapping)
-{
- return mapping->a_ops == &secretmem_aops;
-}
-
-bool vma_is_secretmem(struct vm_area_struct *vma);
bool secretmem_active(void);
#else
-static inline bool vma_is_secretmem(struct vm_area_struct *vma)
-{
- return false;
-}
-
-static inline bool secretmem_mapping(struct address_space *mapping)
-{
- return false;
-}
-
static inline bool secretmem_active(void)
{
return false;
diff --git a/lib/buildid.c b/lib/buildid.c
index c4b737640621..ba79bf28f7e6 100644
--- a/lib/buildid.c
+++ b/lib/buildid.c
@@ -47,6 +47,10 @@ static int freader_get_folio(struct freader *r, loff_t file_off)
freader_put_folio(r);
+ /* reject folios without direct map entries (e.g. from memfd_secret() or guest_memfd()) */
+ if (mapping_no_direct_map(r->file->f_mapping))
+ return -EFAULT;
+
/* only use page cache lookup - fail if not already cached */
r->folio = filemap_get_folio(r->file->f_mapping, file_off >> PAGE_SHIFT);
@@ -87,8 +91,8 @@ const void *freader_fetch(struct freader *r, loff_t file_off, size_t sz)
return r->data + file_off;
}
- /* reject secretmem folios created with memfd_secret() */
- if (secretmem_mapping(r->file->f_mapping)) {
+ /* reject folios without direct map entries (e.g. from memfd_secret() or guest_memfd()) */
+ if (mapping_no_direct_map(r->file->f_mapping)) {
r->err = -EFAULT;
return NULL;
}
diff --git a/mm/gup.c b/mm/gup.c
index 41eb64783e03..c1b4fb1eaee7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -11,7 +11,6 @@
#include <linux/rmap.h>
#include <linux/swap.h>
#include <linux/swapops.h>
-#include <linux/secretmem.h>
#include <linux/sched/signal.h>
#include <linux/rwsem.h>
@@ -1216,7 +1215,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
if ((gup_flags & FOLL_SPLIT_PMD) && is_vm_hugetlb_page(vma))
return -EOPNOTSUPP;
- if (vma_is_secretmem(vma))
+ if (vma_has_no_direct_map(vma))
return -EFAULT;
if (write) {
@@ -2724,7 +2723,7 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
* This call assumes the caller has pinned the folio, that the lowest page table
* level still points to this folio, and that interrupts have been disabled.
*
- * GUP-fast must reject all secretmem folios.
+ * GUP-fast must reject all folios without direct map entries (such as secretmem).
*
* Writing to pinned file-backed dirty tracked folios is inherently problematic
* (see comment describing the writable_file_mapping_allowed() function). We
@@ -2744,7 +2743,7 @@ static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
if (WARN_ON_ONCE(folio_test_slab(folio)))
return false;
- /* hugetlb neither requires dirty-tracking nor can be secretmem. */
+ /* hugetlb neither requires dirty-tracking nor can be without direct map. */
if (folio_test_hugetlb(folio))
return true;
@@ -2786,7 +2785,7 @@ static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
* At this point, we know the mapping is non-null and points to an
* address_space object.
*/
- if (secretmem_mapping(mapping))
+ if (mapping_no_direct_map(mapping))
return false;
/*
diff --git a/mm/mlock.c b/mm/mlock.c
index 2f699c3497a5..a6f4b3df4f3f 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -474,7 +474,7 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (newflags == oldflags || (oldflags & VM_SPECIAL) ||
is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
- vma_is_dax(vma) || vma_is_secretmem(vma) || (oldflags & VM_DROPPABLE))
+ vma_is_dax(vma) || vma_has_no_direct_map(vma) || (oldflags & VM_DROPPABLE))
/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
goto out;
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 27b176af8fc4..d32e1be1eb35 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -129,11 +129,6 @@ static int secretmem_mmap_prepare(struct vm_area_desc *desc)
return 0;
}
-bool vma_is_secretmem(struct vm_area_struct *vma)
-{
- return vma->vm_ops == &secretmem_vm_ops;
-}
-
static const struct file_operations secretmem_fops = {
.release = secretmem_release,
.mmap_prepare = secretmem_mmap_prepare,
@@ -151,7 +146,7 @@ static void secretmem_free_folio(struct folio *folio)
folio_zero_segment(folio, 0, folio_size(folio));
}
-const struct address_space_operations secretmem_aops = {
+static const struct address_space_operations secretmem_aops = {
.dirty_folio = noop_dirty_folio,
.free_folio = secretmem_free_folio,
.migrate_folio = secretmem_migrate_folio,
@@ -200,6 +195,7 @@ static struct file *secretmem_file_create(unsigned long flags)
mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
mapping_set_unevictable(inode->i_mapping);
+ mapping_set_no_direct_map(inode->i_mapping);
inode->i_op = &secretmem_iops;
inode->i_mapping->a_ops = &secretmem_aops;
--
2.50.1
^ permalink raw reply related
* [PATCH v12 05/16] mm/gup: drop local variable in gup_fast_folio_allowed
From: Kalyazin, Nikita @ 2026-04-10 15:18 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
bpf@vger.kernel.org, linux-kselftest@vger.kernel.org,
kernel@xen0n.name, linux-riscv@lists.infradead.org,
linux-s390@vger.kernel.org, loongarch@lists.linux.dev,
linux-pm@vger.kernel.org
Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
oupton@kernel.org, joey.gouly@arm.com, suzuki.poulose@arm.com,
yuzenghui@huawei.com, catalin.marinas@arm.com, will@kernel.org,
seanjc@google.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, luto@kernel.org, peterz@infradead.org,
willy@infradead.org, akpm@linux-foundation.org, david@kernel.org,
lorenzo.stoakes@oracle.com, vbabka@kernel.org, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, ast@kernel.org,
daniel@iogearbox.net, andrii@kernel.org, martin.lau@linux.dev,
eddyz87@gmail.com, song@kernel.org, yonghong.song@linux.dev,
john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
haoluo@google.com, jolsa@kernel.org, jgg@ziepe.ca,
jhubbard@nvidia.com, peterx@redhat.com, jannh@google.com,
pfalcato@suse.de, skhan@linuxfoundation.org, riel@surriel.com,
ryan.roberts@arm.com, jgross@suse.com, yu-cheng.yu@intel.com,
kas@kernel.org, coxu@redhat.com, ackerleytng@google.com,
yosry@kernel.org, ajones@ventanamicro.com, maobibo@loongson.cn,
tabba@google.com, prsampat@amd.com, wu.fei9@sanechips.com.cn,
mlevitsk@redhat.com, jmattson@google.com, jthoughton@google.com,
agordeev@linux.ibm.com, alex@ghiti.fr, aou@eecs.berkeley.edu,
borntraeger@linux.ibm.com, chenhuacai@kernel.org,
baolu.lu@linux.intel.com, dev.jain@arm.com, gor@linux.ibm.com,
hca@linux.ibm.com, palmer@dabbelt.com, pjw@kernel.org,
shijie@os.amperecomputing.com, svens@linux.ibm.com,
thuth@redhat.com, yang@os.amperecomputing.com,
Liam.Howlett@oracle.com, urezki@gmail.com,
zhengqi.arch@bytedance.com, gerald.schaefer@linux.ibm.com,
jiayuan.chen@shopee.com, lenb@kernel.org, pavel@kernel.org,
rafael@kernel.org, yangyicong@hisilicon.com,
vannapurve@google.com, jackmanb@google.com, patrick.roy@linux.dev,
Thomson, Jack, Itazuri, Takahiro, Manwaring, Derek,
Kalyazin, Nikita
In-Reply-To: <20260410151746.61150-1-kalyazin@amazon.com>
From: Nikita Kalyazin <nikita.kalyazin@linux.dev>
Move the check for pinning closer to where the result is used.
No functional changes.
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@linux.dev>
---
mm/gup.c | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index e8367564d636..41eb64783e03 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2737,18 +2737,9 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
*/
static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
{
- bool reject_file_backed = false;
struct address_space *mapping;
unsigned long mapping_flags;
- /*
- * If we aren't pinning then no problematic write can occur. A long term
- * pin is the most egregious case so this is the one we disallow.
- */
- if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) ==
- (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE))
- reject_file_backed = true;
-
/* We hold a folio reference, so we can safely access folio fields. */
if (WARN_ON_ONCE(folio_test_slab(folio)))
return false;
@@ -2797,8 +2788,18 @@ static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
*/
if (secretmem_mapping(mapping))
return false;
- /* The only remaining allowed file system is shmem. */
- return !reject_file_backed || shmem_mapping(mapping);
+
+ /*
+ * If we aren't pinning then no problematic write can occur. A writable
+ * long term pin is the most egregious case, so this is the one we
+ * allow only for ...
+ */
+ if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) !=
+ (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE))
+ return true;
+
+ /* ... hugetlb (which we allowed above already) and shared memory. */
+ return shmem_mapping(mapping);
}
#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
--
2.50.1
^ permalink raw reply related
* [PATCH v12 04/16] mm/gup: drop secretmem optimization from gup_fast_folio_allowed
From: Kalyazin, Nikita @ 2026-04-10 15:18 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
bpf@vger.kernel.org, linux-kselftest@vger.kernel.org,
kernel@xen0n.name, linux-riscv@lists.infradead.org,
linux-s390@vger.kernel.org, loongarch@lists.linux.dev,
linux-pm@vger.kernel.org
Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
oupton@kernel.org, joey.gouly@arm.com, suzuki.poulose@arm.com,
yuzenghui@huawei.com, catalin.marinas@arm.com, will@kernel.org,
seanjc@google.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, luto@kernel.org, peterz@infradead.org,
willy@infradead.org, akpm@linux-foundation.org, david@kernel.org,
lorenzo.stoakes@oracle.com, vbabka@kernel.org, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, ast@kernel.org,
daniel@iogearbox.net, andrii@kernel.org, martin.lau@linux.dev,
eddyz87@gmail.com, song@kernel.org, yonghong.song@linux.dev,
john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
haoluo@google.com, jolsa@kernel.org, jgg@ziepe.ca,
jhubbard@nvidia.com, peterx@redhat.com, jannh@google.com,
pfalcato@suse.de, skhan@linuxfoundation.org, riel@surriel.com,
ryan.roberts@arm.com, jgross@suse.com, yu-cheng.yu@intel.com,
kas@kernel.org, coxu@redhat.com, ackerleytng@google.com,
yosry@kernel.org, ajones@ventanamicro.com, maobibo@loongson.cn,
tabba@google.com, prsampat@amd.com, wu.fei9@sanechips.com.cn,
mlevitsk@redhat.com, jmattson@google.com, jthoughton@google.com,
agordeev@linux.ibm.com, alex@ghiti.fr, aou@eecs.berkeley.edu,
borntraeger@linux.ibm.com, chenhuacai@kernel.org,
baolu.lu@linux.intel.com, dev.jain@arm.com, gor@linux.ibm.com,
hca@linux.ibm.com, palmer@dabbelt.com, pjw@kernel.org,
shijie@os.amperecomputing.com, svens@linux.ibm.com,
thuth@redhat.com, yang@os.amperecomputing.com,
Liam.Howlett@oracle.com, urezki@gmail.com,
zhengqi.arch@bytedance.com, gerald.schaefer@linux.ibm.com,
jiayuan.chen@shopee.com, lenb@kernel.org, pavel@kernel.org,
rafael@kernel.org, yangyicong@hisilicon.com,
vannapurve@google.com, jackmanb@google.com, patrick.roy@linux.dev,
Thomson, Jack, Itazuri, Takahiro, Manwaring, Derek,
Kalyazin, Nikita, Vlastimil Babka
In-Reply-To: <20260410151746.61150-1-kalyazin@amazon.com>
From: Patrick Roy <patrick.roy@linux.dev>
This drops an optimization in gup_fast_folio_allowed() where
secretmem_mapping() was only called if CONFIG_SECRETMEM=y. secretmem is
enabled by default since commit b758fe6df50d ("mm/secretmem: make it on
by default"), so the secretmem check did not actually end up elided in
most cases anymore anyway.
To make sure the fast path for ZONE_DEVICE pages (like Device DAX and
PCI P2PDMA) is still allowed, check for folio_is_zone_device() if
mapping is NULL.
This is in preparation of the generalization of handling mappings where
direct map entries of folios are set to not present. Currently,
mappings that match this description are secretmem mappings
(memfd_secret()). Later, some guest_memfd configurations will also fall
into this category.
Signed-off-by: Patrick Roy <patrick.roy@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@linux.dev>
---
mm/gup.c | 17 ++++++-----------
1 file changed, 6 insertions(+), 11 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 8e7dc2c6ee73..e8367564d636 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2739,7 +2739,6 @@ static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
{
bool reject_file_backed = false;
struct address_space *mapping;
- bool check_secretmem = false;
unsigned long mapping_flags;
/*
@@ -2751,14 +2750,6 @@ static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
reject_file_backed = true;
/* We hold a folio reference, so we can safely access folio fields. */
-
- /* secretmem folios are always order-0 folios. */
- if (IS_ENABLED(CONFIG_SECRETMEM) && !folio_test_large(folio))
- check_secretmem = true;
-
- if (!reject_file_backed && !check_secretmem)
- return true;
-
if (WARN_ON_ONCE(folio_test_slab(folio)))
return false;
@@ -2787,9 +2778,13 @@ static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
* The mapping may have been truncated, in any case we cannot determine
* if this mapping is safe - fall back to slow path to determine how to
* proceed.
+ *
+ * ZONE_DEVICE folios (e.g. Device DAX, PCI P2PDMA) may legitimately
+ * have a NULL mapping. They are never secretmem/no-direct-map folios,
+ * so let them through.
*/
if (!mapping)
- return false;
+ return folio_is_zone_device(folio);
/* Anonymous folios pose no problem. */
mapping_flags = (unsigned long)mapping & FOLIO_MAPPING_FLAGS;
@@ -2800,7 +2795,7 @@ static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
* At this point, we know the mapping is non-null and points to an
* address_space object.
*/
- if (check_secretmem && secretmem_mapping(mapping))
+ if (secretmem_mapping(mapping))
return false;
/* The only remaining allowed file system is shmem. */
return !reject_file_backed || shmem_mapping(mapping);
--
2.50.1
^ permalink raw reply related
* [PATCH v12 03/16] mm/secretmem: make use of folio_{zap,restore}_direct_map
From: Kalyazin, Nikita @ 2026-04-10 15:18 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
bpf@vger.kernel.org, linux-kselftest@vger.kernel.org,
kernel@xen0n.name, linux-riscv@lists.infradead.org,
linux-s390@vger.kernel.org, loongarch@lists.linux.dev,
linux-pm@vger.kernel.org
Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
oupton@kernel.org, joey.gouly@arm.com, suzuki.poulose@arm.com,
yuzenghui@huawei.com, catalin.marinas@arm.com, will@kernel.org,
seanjc@google.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, luto@kernel.org, peterz@infradead.org,
willy@infradead.org, akpm@linux-foundation.org, david@kernel.org,
lorenzo.stoakes@oracle.com, vbabka@kernel.org, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, ast@kernel.org,
daniel@iogearbox.net, andrii@kernel.org, martin.lau@linux.dev,
eddyz87@gmail.com, song@kernel.org, yonghong.song@linux.dev,
john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
haoluo@google.com, jolsa@kernel.org, jgg@ziepe.ca,
jhubbard@nvidia.com, peterx@redhat.com, jannh@google.com,
pfalcato@suse.de, skhan@linuxfoundation.org, riel@surriel.com,
ryan.roberts@arm.com, jgross@suse.com, yu-cheng.yu@intel.com,
kas@kernel.org, coxu@redhat.com, ackerleytng@google.com,
yosry@kernel.org, ajones@ventanamicro.com, maobibo@loongson.cn,
tabba@google.com, prsampat@amd.com, wu.fei9@sanechips.com.cn,
mlevitsk@redhat.com, jmattson@google.com, jthoughton@google.com,
agordeev@linux.ibm.com, alex@ghiti.fr, aou@eecs.berkeley.edu,
borntraeger@linux.ibm.com, chenhuacai@kernel.org,
baolu.lu@linux.intel.com, dev.jain@arm.com, gor@linux.ibm.com,
hca@linux.ibm.com, palmer@dabbelt.com, pjw@kernel.org,
shijie@os.amperecomputing.com, svens@linux.ibm.com,
thuth@redhat.com, yang@os.amperecomputing.com,
Liam.Howlett@oracle.com, urezki@gmail.com,
zhengqi.arch@bytedance.com, gerald.schaefer@linux.ibm.com,
jiayuan.chen@shopee.com, lenb@kernel.org, pavel@kernel.org,
rafael@kernel.org, yangyicong@hisilicon.com,
vannapurve@google.com, jackmanb@google.com, patrick.roy@linux.dev,
Thomson, Jack, Itazuri, Takahiro, Manwaring, Derek,
Kalyazin, Nikita
In-Reply-To: <20260410151746.61150-1-kalyazin@amazon.com>
From: Nikita Kalyazin <nikita.kalyazin@linux.dev>
Replace set_direct_map_*_noflush with newly available
folio_zap_direct_map calls that take folio's address internally. A side
effect is even if filemap_add_folio fails, the TLB is still flushed,
which is not expected to be on the hot path.
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@linux.dev>
---
mm/secretmem.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/mm/secretmem.c b/mm/secretmem.c
index fd29b33c6764..27b176af8fc4 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -53,7 +53,6 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
struct inode *inode = file_inode(vmf->vma->vm_file);
pgoff_t offset = vmf->pgoff;
gfp_t gfp = vmf->gfp_mask;
- unsigned long addr;
struct folio *folio;
vm_fault_t ret;
int err;
@@ -72,7 +71,7 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
goto out;
}
- err = set_direct_map_invalid_noflush(folio_address(folio));
+ err = folio_zap_direct_map(folio);
if (err) {
folio_put(folio);
ret = vmf_error(err);
@@ -87,7 +86,7 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
* already happened when we marked the page invalid
* which guarantees that this call won't fail
*/
- set_direct_map_default_noflush(folio_address(folio));
+ folio_restore_direct_map(folio);
folio_put(folio);
if (err == -EEXIST)
goto retry;
@@ -95,9 +94,6 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
ret = vmf_error(err);
goto out;
}
-
- addr = (unsigned long)folio_address(folio);
- flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
}
vmf->page = folio_file_page(folio, vmf->pgoff);
--
2.50.1
^ permalink raw reply related
* [PATCH v12 02/16] set_memory: add folio_{zap,restore}_direct_map helpers
From: Kalyazin, Nikita @ 2026-04-10 15:18 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
bpf@vger.kernel.org, linux-kselftest@vger.kernel.org,
kernel@xen0n.name, linux-riscv@lists.infradead.org,
linux-s390@vger.kernel.org, loongarch@lists.linux.dev,
linux-pm@vger.kernel.org
Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
oupton@kernel.org, joey.gouly@arm.com, suzuki.poulose@arm.com,
yuzenghui@huawei.com, catalin.marinas@arm.com, will@kernel.org,
seanjc@google.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, luto@kernel.org, peterz@infradead.org,
willy@infradead.org, akpm@linux-foundation.org, david@kernel.org,
lorenzo.stoakes@oracle.com, vbabka@kernel.org, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, ast@kernel.org,
daniel@iogearbox.net, andrii@kernel.org, martin.lau@linux.dev,
eddyz87@gmail.com, song@kernel.org, yonghong.song@linux.dev,
john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
haoluo@google.com, jolsa@kernel.org, jgg@ziepe.ca,
jhubbard@nvidia.com, peterx@redhat.com, jannh@google.com,
pfalcato@suse.de, skhan@linuxfoundation.org, riel@surriel.com,
ryan.roberts@arm.com, jgross@suse.com, yu-cheng.yu@intel.com,
kas@kernel.org, coxu@redhat.com, ackerleytng@google.com,
yosry@kernel.org, ajones@ventanamicro.com, maobibo@loongson.cn,
tabba@google.com, prsampat@amd.com, wu.fei9@sanechips.com.cn,
mlevitsk@redhat.com, jmattson@google.com, jthoughton@google.com,
agordeev@linux.ibm.com, alex@ghiti.fr, aou@eecs.berkeley.edu,
borntraeger@linux.ibm.com, chenhuacai@kernel.org,
baolu.lu@linux.intel.com, dev.jain@arm.com, gor@linux.ibm.com,
hca@linux.ibm.com, palmer@dabbelt.com, pjw@kernel.org,
shijie@os.amperecomputing.com, svens@linux.ibm.com,
thuth@redhat.com, yang@os.amperecomputing.com,
Liam.Howlett@oracle.com, urezki@gmail.com,
zhengqi.arch@bytedance.com, gerald.schaefer@linux.ibm.com,
jiayuan.chen@shopee.com, lenb@kernel.org, pavel@kernel.org,
rafael@kernel.org, yangyicong@hisilicon.com,
vannapurve@google.com, jackmanb@google.com, patrick.roy@linux.dev,
Thomson, Jack, Itazuri, Takahiro, Manwaring, Derek,
Kalyazin, Nikita
In-Reply-To: <20260410151746.61150-1-kalyazin@amazon.com>
From: Nikita Kalyazin <nikita.kalyazin@linux.dev>
Let's provide folio_{zap,restore}_direct_map helpers as preparation for
supporting removal of the direct map for guest_memfd folios.
In folio_zap_direct_map(), flush TLB to make sure the data is not
accessible. On some architectures, there may be a double TLB flush
issued because set_direct_map_valid_noflush already performs a flush
internally.
The new helpers need to be accessible to KVM on architectures that
support guest_memfd (x86 and arm64).
Direct map removal gives guest_memfd the same protection that
memfd_secret does, such as hardening against Spectre-like attacks
through in-kernel gadgets.
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@linux.dev>
---
include/linux/set_memory.h | 13 +++++++++++
mm/memory.c | 45 ++++++++++++++++++++++++++++++++++++++
2 files changed, 58 insertions(+)
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 1a2563f525fc..24caea2931f9 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -41,6 +41,15 @@ static inline int set_direct_map_valid_noflush(const void *addr,
return 0;
}
+static inline int folio_zap_direct_map(struct folio *folio)
+{
+ return 0;
+}
+
+static inline void folio_restore_direct_map(struct folio *folio)
+{
+}
+
static inline bool kernel_page_present(struct page *page)
{
return true;
@@ -57,6 +66,10 @@ static inline bool can_set_direct_map(void)
}
#define can_set_direct_map can_set_direct_map
#endif
+
+int folio_zap_direct_map(struct folio *folio);
+void folio_restore_direct_map(struct folio *folio);
+
#endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
#ifdef CONFIG_X86_64
diff --git a/mm/memory.c b/mm/memory.c
index 2f815a34d924..3b9ada2cc19c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -78,6 +78,7 @@
#include <linux/sched/sysctl.h>
#include <linux/pgalloc.h>
#include <linux/uaccess.h>
+#include <linux/set_memory.h>
#include <trace/events/kmem.h>
@@ -7479,3 +7480,47 @@ void vma_pgtable_walk_end(struct vm_area_struct *vma)
if (is_vm_hugetlb_page(vma))
hugetlb_vma_unlock_read(vma);
}
+
+#ifdef CONFIG_ARCH_HAS_SET_DIRECT_MAP
+/**
+ * folio_zap_direct_map - remove a folio from the kernel direct map
+ * @folio: folio to remove from the direct map
+ *
+ * Removes the folio from the kernel direct map and flushes the TLB. This may
+ * require splitting huge pages in the direct map, which can fail due to memory
+ * allocation. So far, only order-0 folios are supported.
+ *
+ * Return: 0 on success, or a negative error code on failure.
+ */
+int folio_zap_direct_map(struct folio *folio)
+{
+ const void *addr = folio_address(folio);
+ int ret;
+
+ if (folio_test_large(folio))
+ return -EINVAL;
+
+ ret = set_direct_map_valid_noflush(addr, folio_nr_pages(folio), false);
+ flush_tlb_kernel_range((unsigned long)addr,
+ (unsigned long)addr + folio_size(folio));
+
+ return ret;
+}
+EXPORT_SYMBOL_FOR_MODULES(folio_zap_direct_map, "kvm");
+
+/**
+ * folio_restore_direct_map - restore the kernel direct map entry for a folio
+ * @folio: folio whose direct map entry is to be restored
+ *
+ * This may only be called after a prior successful folio_zap_direct_map() on
+ * the same folio. Because the zap will have already split any huge pages in
+ * the direct map, restoration here only updates protection bits and cannot
+ * fail.
+ */
+void folio_restore_direct_map(struct folio *folio)
+{
+ WARN_ON_ONCE(set_direct_map_valid_noflush(folio_address(folio),
+ folio_nr_pages(folio), true));
+}
+EXPORT_SYMBOL_FOR_MODULES(folio_restore_direct_map, "kvm");
+#endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
--
2.50.1
^ permalink raw reply related
* [PATCH v12 01/16] set_memory: set_direct_map_* to take address
From: Kalyazin, Nikita @ 2026-04-10 15:17 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
bpf@vger.kernel.org, linux-kselftest@vger.kernel.org,
kernel@xen0n.name, linux-riscv@lists.infradead.org,
linux-s390@vger.kernel.org, loongarch@lists.linux.dev,
linux-pm@vger.kernel.org
Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
oupton@kernel.org, joey.gouly@arm.com, suzuki.poulose@arm.com,
yuzenghui@huawei.com, catalin.marinas@arm.com, will@kernel.org,
seanjc@google.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, luto@kernel.org, peterz@infradead.org,
willy@infradead.org, akpm@linux-foundation.org, david@kernel.org,
lorenzo.stoakes@oracle.com, vbabka@kernel.org, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, ast@kernel.org,
daniel@iogearbox.net, andrii@kernel.org, martin.lau@linux.dev,
eddyz87@gmail.com, song@kernel.org, yonghong.song@linux.dev,
john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
haoluo@google.com, jolsa@kernel.org, jgg@ziepe.ca,
jhubbard@nvidia.com, peterx@redhat.com, jannh@google.com,
pfalcato@suse.de, skhan@linuxfoundation.org, riel@surriel.com,
ryan.roberts@arm.com, jgross@suse.com, yu-cheng.yu@intel.com,
kas@kernel.org, coxu@redhat.com, ackerleytng@google.com,
yosry@kernel.org, ajones@ventanamicro.com, maobibo@loongson.cn,
tabba@google.com, prsampat@amd.com, wu.fei9@sanechips.com.cn,
mlevitsk@redhat.com, jmattson@google.com, jthoughton@google.com,
agordeev@linux.ibm.com, alex@ghiti.fr, aou@eecs.berkeley.edu,
borntraeger@linux.ibm.com, chenhuacai@kernel.org,
baolu.lu@linux.intel.com, dev.jain@arm.com, gor@linux.ibm.com,
hca@linux.ibm.com, palmer@dabbelt.com, pjw@kernel.org,
shijie@os.amperecomputing.com, svens@linux.ibm.com,
thuth@redhat.com, yang@os.amperecomputing.com,
Liam.Howlett@oracle.com, urezki@gmail.com,
zhengqi.arch@bytedance.com, gerald.schaefer@linux.ibm.com,
jiayuan.chen@shopee.com, lenb@kernel.org, pavel@kernel.org,
rafael@kernel.org, yangyicong@hisilicon.com,
vannapurve@google.com, jackmanb@google.com, patrick.roy@linux.dev,
Thomson, Jack, Itazuri, Takahiro, Manwaring, Derek,
Kalyazin, Nikita
In-Reply-To: <20260410151746.61150-1-kalyazin@amazon.com>
From: Nikita Kalyazin <nikita.kalyazin@linux.dev>
Let's convert set_direct_map_*() to take an address instead of a page to
prepare for adding helpers that operate on folios; it will be more
efficient to convert from a folio directly to an address without going
through a page first.
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@linux.dev>
---
arch/arm64/include/asm/set_memory.h | 7 ++++---
arch/arm64/mm/pageattr.c | 19 +++++++++--------
arch/loongarch/include/asm/set_memory.h | 7 ++++---
arch/loongarch/mm/pageattr.c | 25 ++++++++++-------------
arch/riscv/include/asm/set_memory.h | 7 ++++---
arch/riscv/mm/pageattr.c | 17 ++++++++--------
arch/s390/include/asm/set_memory.h | 7 ++++---
arch/s390/mm/pageattr.c | 13 ++++++------
arch/x86/include/asm/set_memory.h | 7 ++++---
arch/x86/mm/pat/set_memory.c | 27 +++++++++++++------------
include/linux/set_memory.h | 9 +++++----
kernel/power/snapshot.c | 4 ++--
mm/execmem.c | 6 ++++--
mm/secretmem.c | 6 +++---
mm/vmalloc.c | 11 ++++++----
15 files changed, 91 insertions(+), 81 deletions(-)
diff --git a/arch/arm64/include/asm/set_memory.h b/arch/arm64/include/asm/set_memory.h
index 90f61b17275e..c71a2a6812c4 100644
--- a/arch/arm64/include/asm/set_memory.h
+++ b/arch/arm64/include/asm/set_memory.h
@@ -11,9 +11,10 @@ bool can_set_direct_map(void);
int set_memory_valid(unsigned long addr, int numpages, int enable);
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
-int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid);
+int set_direct_map_invalid_noflush(const void *addr);
+int set_direct_map_default_noflush(const void *addr);
+int set_direct_map_valid_noflush(const void *addr, unsigned long numpages,
+ bool valid);
bool kernel_page_present(struct page *page);
int set_memory_encrypted(unsigned long addr, int numpages);
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 358d1dc9a576..5aff94e1f8b2 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -245,7 +245,7 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
__pgprot(PTE_VALID));
}
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(const void *addr)
{
pgprot_t clear_mask = __pgprot(PTE_VALID);
pgprot_t set_mask = __pgprot(0);
@@ -253,11 +253,11 @@ int set_direct_map_invalid_noflush(struct page *page)
if (!can_set_direct_map())
return 0;
- return update_range_prot((unsigned long)page_address(page),
- PAGE_SIZE, set_mask, clear_mask);
+ return update_range_prot((unsigned long)addr, PAGE_SIZE, set_mask,
+ clear_mask);
}
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(const void *addr)
{
pgprot_t set_mask = __pgprot(PTE_VALID | PTE_WRITE);
pgprot_t clear_mask = __pgprot(PTE_RDONLY);
@@ -265,8 +265,8 @@ int set_direct_map_default_noflush(struct page *page)
if (!can_set_direct_map())
return 0;
- return update_range_prot((unsigned long)page_address(page),
- PAGE_SIZE, set_mask, clear_mask);
+ return update_range_prot((unsigned long)addr, PAGE_SIZE, set_mask,
+ clear_mask);
}
static int __set_memory_enc_dec(unsigned long addr,
@@ -349,14 +349,13 @@ int realm_register_memory_enc_ops(void)
return arm64_mem_crypt_ops_register(&realm_crypt_ops);
}
-int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
+int set_direct_map_valid_noflush(const void *addr, unsigned long numpages,
+ bool valid)
{
- unsigned long addr = (unsigned long)page_address(page);
-
if (!can_set_direct_map())
return 0;
- return set_memory_valid(addr, nr, valid);
+ return set_memory_valid((unsigned long)addr, numpages, valid);
}
#ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/arch/loongarch/include/asm/set_memory.h b/arch/loongarch/include/asm/set_memory.h
index 55dfaefd02c8..5e9b67b2fea1 100644
--- a/arch/loongarch/include/asm/set_memory.h
+++ b/arch/loongarch/include/asm/set_memory.h
@@ -15,8 +15,9 @@ int set_memory_ro(unsigned long addr, int numpages);
int set_memory_rw(unsigned long addr, int numpages);
bool kernel_page_present(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid);
+int set_direct_map_invalid_noflush(const void *addr);
+int set_direct_map_default_noflush(const void *addr);
+int set_direct_map_valid_noflush(const void *addr, unsigned long numpages,
+ bool valid);
#endif /* _ASM_LOONGARCH_SET_MEMORY_H */
diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c
index f5e910b68229..9e08905d3624 100644
--- a/arch/loongarch/mm/pageattr.c
+++ b/arch/loongarch/mm/pageattr.c
@@ -198,32 +198,29 @@ bool kernel_page_present(struct page *page)
return pte_present(ptep_get(pte));
}
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(const void *addr)
{
- unsigned long addr = (unsigned long)page_address(page);
-
- if (addr < vm_map_base)
+ if ((unsigned long)addr < vm_map_base)
return 0;
- return __set_memory(addr, 1, PAGE_KERNEL, __pgprot(0));
+ return __set_memory((unsigned long)addr, 1, PAGE_KERNEL, __pgprot(0));
}
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(const void *addr)
{
- unsigned long addr = (unsigned long)page_address(page);
-
- if (addr < vm_map_base)
+ if ((unsigned long)addr < vm_map_base)
return 0;
- return __set_memory(addr, 1, __pgprot(0), __pgprot(_PAGE_PRESENT | _PAGE_VALID));
+ return __set_memory((unsigned long)addr, 1, __pgprot(0),
+ __pgprot(_PAGE_PRESENT | _PAGE_VALID));
}
-int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
+int set_direct_map_valid_noflush(const void *addr, unsigned long numpages,
+ bool valid)
{
- unsigned long addr = (unsigned long)page_address(page);
pgprot_t set, clear;
- if (addr < vm_map_base)
+ if ((unsigned long)addr < vm_map_base)
return 0;
if (valid) {
@@ -234,5 +231,5 @@ int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
clear = __pgprot(_PAGE_PRESENT | _PAGE_VALID);
}
- return __set_memory(addr, 1, set, clear);
+ return __set_memory((unsigned long)addr, 1, set, clear);
}
diff --git a/arch/riscv/include/asm/set_memory.h b/arch/riscv/include/asm/set_memory.h
index 87389e93325a..a87eabd7fc78 100644
--- a/arch/riscv/include/asm/set_memory.h
+++ b/arch/riscv/include/asm/set_memory.h
@@ -40,9 +40,10 @@ static inline int set_kernel_memory(char *startp, char *endp,
}
#endif
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
-int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid);
+int set_direct_map_invalid_noflush(const void *addr);
+int set_direct_map_default_noflush(const void *addr);
+int set_direct_map_valid_noflush(const void *addr, unsigned long numpages,
+ bool valid);
bool kernel_page_present(struct page *page);
#endif /* __ASSEMBLER__ */
diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
index 3f76db3d2769..0a457177a88c 100644
--- a/arch/riscv/mm/pageattr.c
+++ b/arch/riscv/mm/pageattr.c
@@ -374,19 +374,20 @@ int set_memory_nx(unsigned long addr, int numpages)
return __set_memory(addr, numpages, __pgprot(0), __pgprot(_PAGE_EXEC));
}
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(const void *addr)
{
- return __set_memory((unsigned long)page_address(page), 1,
- __pgprot(0), __pgprot(_PAGE_PRESENT));
+ return __set_memory((unsigned long)addr, 1, __pgprot(0),
+ __pgprot(_PAGE_PRESENT));
}
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(const void *addr)
{
- return __set_memory((unsigned long)page_address(page), 1,
- PAGE_KERNEL, __pgprot(_PAGE_EXEC));
+ return __set_memory((unsigned long)addr, 1, PAGE_KERNEL,
+ __pgprot(_PAGE_EXEC));
}
-int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
+int set_direct_map_valid_noflush(const void *addr, unsigned long numpages,
+ bool valid)
{
pgprot_t set, clear;
@@ -398,7 +399,7 @@ int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
clear = __pgprot(_PAGE_PRESENT);
}
- return __set_memory((unsigned long)page_address(page), nr, set, clear);
+ return __set_memory((unsigned long)addr, numpages, set, clear);
}
#ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/arch/s390/include/asm/set_memory.h b/arch/s390/include/asm/set_memory.h
index 94092f4ae764..3e43c3c96e67 100644
--- a/arch/s390/include/asm/set_memory.h
+++ b/arch/s390/include/asm/set_memory.h
@@ -60,9 +60,10 @@ __SET_MEMORY_FUNC(set_memory_rox, SET_MEMORY_RO | SET_MEMORY_X)
__SET_MEMORY_FUNC(set_memory_rwnx, SET_MEMORY_RW | SET_MEMORY_NX)
__SET_MEMORY_FUNC(set_memory_4k, SET_MEMORY_4K)
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
-int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid);
+int set_direct_map_invalid_noflush(const void *addr);
+int set_direct_map_default_noflush(const void *addr);
+int set_direct_map_valid_noflush(const void *addr, unsigned long numpages,
+ bool valid);
bool kernel_page_present(struct page *page);
#endif
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index bb29c38ae624..8e90ff5cf50d 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -383,17 +383,18 @@ int __set_memory(unsigned long addr, unsigned long numpages, unsigned long flags
return rc;
}
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(const void *addr)
{
- return __set_memory((unsigned long)page_to_virt(page), 1, SET_MEMORY_INV);
+ return __set_memory((unsigned long)addr, 1, SET_MEMORY_INV);
}
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(const void *addr)
{
- return __set_memory((unsigned long)page_to_virt(page), 1, SET_MEMORY_DEF);
+ return __set_memory((unsigned long)addr, 1, SET_MEMORY_DEF);
}
-int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
+int set_direct_map_valid_noflush(const void *addr, unsigned long numpages,
+ bool valid)
{
unsigned long flags;
@@ -402,7 +403,7 @@ int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
else
flags = SET_MEMORY_INV;
- return __set_memory((unsigned long)page_to_virt(page), nr, flags);
+ return __set_memory((unsigned long)addr, numpages, flags);
}
bool kernel_page_present(struct page *page)
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 4362c26aa992..b6a4173ff249 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -86,9 +86,10 @@ int set_pages_wb(struct page *page, int numpages);
int set_pages_ro(struct page *page, int numpages);
int set_pages_rw(struct page *page, int numpages);
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
-int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid);
+int set_direct_map_invalid_noflush(const void *addr);
+int set_direct_map_default_noflush(const void *addr);
+int set_direct_map_valid_noflush(const void *addr, unsigned long numpages,
+ bool valid);
bool kernel_page_present(struct page *page);
extern int kernel_set_to_readonly;
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 40581a720fe8..7517195b75b9 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2587,9 +2587,9 @@ int set_pages_rw(struct page *page, int numpages)
return set_memory_rw(addr, numpages);
}
-static int __set_pages_p(struct page *page, int numpages)
+static int __set_pages_p(const void *addr, int numpages)
{
- unsigned long tempaddr = (unsigned long) page_address(page);
+ unsigned long tempaddr = (unsigned long)addr;
struct cpa_data cpa = { .vaddr = &tempaddr,
.pgd = NULL,
.numpages = numpages,
@@ -2606,9 +2606,9 @@ static int __set_pages_p(struct page *page, int numpages)
return __change_page_attr_set_clr(&cpa, 1);
}
-static int __set_pages_np(struct page *page, int numpages)
+static int __set_pages_np(const void *addr, int numpages)
{
- unsigned long tempaddr = (unsigned long) page_address(page);
+ unsigned long tempaddr = (unsigned long)addr;
struct cpa_data cpa = { .vaddr = &tempaddr,
.pgd = NULL,
.numpages = numpages,
@@ -2625,22 +2625,23 @@ static int __set_pages_np(struct page *page, int numpages)
return __change_page_attr_set_clr(&cpa, 1);
}
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(const void *addr)
{
- return __set_pages_np(page, 1);
+ return __set_pages_np(addr, 1);
}
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(const void *addr)
{
- return __set_pages_p(page, 1);
+ return __set_pages_p(addr, 1);
}
-int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
+int set_direct_map_valid_noflush(const void *addr, unsigned long numpages,
+ bool valid)
{
if (valid)
- return __set_pages_p(page, nr);
+ return __set_pages_p(addr, numpages);
- return __set_pages_np(page, nr);
+ return __set_pages_np(addr, numpages);
}
#ifdef CONFIG_DEBUG_PAGEALLOC
@@ -2659,9 +2660,9 @@ void __kernel_map_pages(struct page *page, int numpages, int enable)
* and hence no memory allocations during large page split.
*/
if (enable)
- __set_pages_p(page, numpages);
+ __set_pages_p(page_address(page), numpages);
else
- __set_pages_np(page, numpages);
+ __set_pages_np(page_address(page), numpages);
/*
* We should perform an IPI and flush all tlbs,
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 3030d9245f5a..1a2563f525fc 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -25,17 +25,18 @@ static inline int set_memory_rox(unsigned long addr, int numpages)
#endif
#ifndef CONFIG_ARCH_HAS_SET_DIRECT_MAP
-static inline int set_direct_map_invalid_noflush(struct page *page)
+static inline int set_direct_map_invalid_noflush(const void *addr)
{
return 0;
}
-static inline int set_direct_map_default_noflush(struct page *page)
+static inline int set_direct_map_default_noflush(const void *addr)
{
return 0;
}
-static inline int set_direct_map_valid_noflush(struct page *page,
- unsigned nr, bool valid)
+static inline int set_direct_map_valid_noflush(const void *addr,
+ unsigned long numpages,
+ bool valid)
{
return 0;
}
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 6e1321837c66..6eddfb22c0ff 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -88,7 +88,7 @@ static inline int hibernate_restore_unprotect_page(void *page_address) {return 0
static inline void hibernate_map_page(struct page *page)
{
if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
- int ret = set_direct_map_default_noflush(page);
+ int ret = set_direct_map_default_noflush(page_address(page));
if (ret)
pr_warn_once("Failed to remap page\n");
@@ -101,7 +101,7 @@ static inline void hibernate_unmap_page(struct page *page)
{
if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
unsigned long addr = (unsigned long)page_address(page);
- int ret = set_direct_map_invalid_noflush(page);
+ int ret = set_direct_map_invalid_noflush(page_address(page));
if (ret)
pr_warn_once("Failed to remap page\n");
diff --git a/mm/execmem.c b/mm/execmem.c
index 810a4ba9c924..220298ec87c8 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -119,7 +119,8 @@ static int execmem_set_direct_map_valid(struct vm_struct *vm, bool valid)
int err = 0;
for (int i = 0; i < vm->nr_pages; i += nr) {
- err = set_direct_map_valid_noflush(vm->pages[i], nr, valid);
+ err = set_direct_map_valid_noflush(page_address(vm->pages[i]),
+ nr, valid);
if (err)
goto err_restore;
updated += nr;
@@ -129,7 +130,8 @@ static int execmem_set_direct_map_valid(struct vm_struct *vm, bool valid)
err_restore:
for (int i = 0; i < updated; i += nr)
- set_direct_map_valid_noflush(vm->pages[i], nr, !valid);
+ set_direct_map_valid_noflush(page_address(vm->pages[i]), nr,
+ !valid);
return err;
}
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 11a779c812a7..fd29b33c6764 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -72,7 +72,7 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
goto out;
}
- err = set_direct_map_invalid_noflush(folio_page(folio, 0));
+ err = set_direct_map_invalid_noflush(folio_address(folio));
if (err) {
folio_put(folio);
ret = vmf_error(err);
@@ -87,7 +87,7 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
* already happened when we marked the page invalid
* which guarantees that this call won't fail
*/
- set_direct_map_default_noflush(folio_page(folio, 0));
+ set_direct_map_default_noflush(folio_address(folio));
folio_put(folio);
if (err == -EEXIST)
goto retry;
@@ -151,7 +151,7 @@ static int secretmem_migrate_folio(struct address_space *mapping,
static void secretmem_free_folio(struct folio *folio)
{
- set_direct_map_default_noflush(folio_page(folio, 0));
+ set_direct_map_default_noflush(folio_address(folio));
folio_zero_segment(folio, 0, folio_size(folio));
}
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 61caa55a4402..8822f73957d9 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3342,14 +3342,17 @@ struct vm_struct *remove_vm_area(const void *addr)
}
static inline void set_area_direct_map(const struct vm_struct *area,
- int (*set_direct_map)(struct page *page))
+ int (*set_direct_map)(const void *addr))
{
int i;
/* HUGE_VMALLOC passes small pages to set_direct_map */
- for (i = 0; i < area->nr_pages; i++)
- if (page_address(area->pages[i]))
- set_direct_map(area->pages[i]);
+ for (i = 0; i < area->nr_pages; i++) {
+ const void *addr = page_address(area->pages[i]);
+
+ if (addr)
+ set_direct_map(addr);
+ }
}
/*
--
2.50.1
^ permalink raw reply related
* [PATCH v12 00/16] Direct Map Removal Support for guest_memfd
From: Kalyazin, Nikita @ 2026-04-10 15:17 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
bpf@vger.kernel.org, linux-kselftest@vger.kernel.org,
kernel@xen0n.name, linux-riscv@lists.infradead.org,
linux-s390@vger.kernel.org, loongarch@lists.linux.dev,
linux-pm@vger.kernel.org
Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
oupton@kernel.org, joey.gouly@arm.com, suzuki.poulose@arm.com,
yuzenghui@huawei.com, catalin.marinas@arm.com, will@kernel.org,
seanjc@google.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, luto@kernel.org, peterz@infradead.org,
willy@infradead.org, akpm@linux-foundation.org, david@kernel.org,
lorenzo.stoakes@oracle.com, vbabka@kernel.org, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, ast@kernel.org,
daniel@iogearbox.net, andrii@kernel.org, martin.lau@linux.dev,
eddyz87@gmail.com, song@kernel.org, yonghong.song@linux.dev,
john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
haoluo@google.com, jolsa@kernel.org, jgg@ziepe.ca,
jhubbard@nvidia.com, peterx@redhat.com, jannh@google.com,
pfalcato@suse.de, skhan@linuxfoundation.org, riel@surriel.com,
ryan.roberts@arm.com, jgross@suse.com, yu-cheng.yu@intel.com,
kas@kernel.org, coxu@redhat.com, ackerleytng@google.com,
yosry@kernel.org, ajones@ventanamicro.com, maobibo@loongson.cn,
tabba@google.com, prsampat@amd.com, wu.fei9@sanechips.com.cn,
mlevitsk@redhat.com, jmattson@google.com, jthoughton@google.com,
agordeev@linux.ibm.com, alex@ghiti.fr, aou@eecs.berkeley.edu,
borntraeger@linux.ibm.com, chenhuacai@kernel.org,
baolu.lu@linux.intel.com, dev.jain@arm.com, gor@linux.ibm.com,
hca@linux.ibm.com, palmer@dabbelt.com, pjw@kernel.org,
shijie@os.amperecomputing.com, svens@linux.ibm.com,
thuth@redhat.com, yang@os.amperecomputing.com,
Liam.Howlett@oracle.com, urezki@gmail.com,
zhengqi.arch@bytedance.com, gerald.schaefer@linux.ibm.com,
jiayuan.chen@shopee.com, lenb@kernel.org, pavel@kernel.org,
rafael@kernel.org, yangyicong@hisilicon.com,
vannapurve@google.com, jackmanb@google.com, patrick.roy@linux.dev,
Thomson, Jack, Itazuri, Takahiro, Manwaring, Derek,
Kalyazin, Nikita, Nikita Kalyazin
From: Nikita Kalyazin <nikita.kalyazin@linux.dev>
[ based on kvm/next ]
Unmapping virtual machine guest memory from the host kernel's direct map
is a successful mitigation against Spectre-style transient execution
issues: if the kernel page tables do not contain entries pointing to
guest memory, then any attempted speculative read through the direct map
will necessarily be blocked by the MMU before any observable
microarchitectural side-effects happen. This means that Spectre-gadgets
and similar cannot be used to target virtual machine memory. Roughly
60% of speculative execution issues fall into this category [1, Table
1].
This patch series extends guest_memfd with the ability to remove its
memory from the host kernel's direct map, to be able to attain the above
protection for KVM guests running inside guest_memfd.
Additionally, a Firecracker branch with support for these VMs can be
found on GitHub [2].
For more details, please refer to the v5 cover letter. No substantial
changes in design have taken place since.
See also related write() syscall support in guest_memfd [3] where
the interoperation between the two features is described.
Changes since v11:
- Ackerley/Sashiko: fix previously missed __set_pages_* argument update
in __kernel_map_pages (patch 1)
- David: disallow large folios in folio_zap_direct_map (patch 2)
- David/Sashiko: check for folio_is_zone_device if mapping is NULL in
gup_fast_folio_allowed (patch 4)
- Ackerley/Sashiko: kvm_arch_gmem_supports_no_direct_map to return
false for SEV-SNP (patch 8).
- David: replace a redundant check for GUEST_MEMFD_FLAG_NO_DIRECT_MAP
with a WARN_ON_ONCE (patch 10)
- David: assert the folio is locked when zapping direct map (patch 10)
- Ackerley/Sashiko: reorder operations to "zap then prepare" and
"invalidate then restore" (patch 10)
v11: https://lore.kernel.org/kvm/20260317141031.514-1-kalyazin@amazon.com
v10: https://lore.kernel.org/kvm/20260126164445.11867-1-kalyazin@amazon.com
v9: https://lore.kernel.org/kvm/20260114134510.1835-1-kalyazin@amazon.com
v8: https://lore.kernel.org/kvm/20251205165743.9341-1-kalyazin@amazon.com
v7: https://lore.kernel.org/kvm/20250924151101.2225820-1-patrick.roy@campus.lmu.de
v6: https://lore.kernel.org/kvm/20250912091708.17502-1-roypat@amazon.co.uk
v5: https://lore.kernel.org/kvm/20250828093902.2719-1-roypat@amazon.co.uk
v4: https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk
RFCv3: https://lore.kernel.org/kvm/20241030134912.515725-1-roypat@amazon.co.uk
RFCv2: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon.co.uk
RFCv1: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk
[1] https://download.vusec.net/papers/quarantine_raid23.pdf
[2] https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding
[3] https://lore.kernel.org/kvm/20251114151828.98165-1-kalyazin@amazon.com
Nikita Kalyazin (4):
set_memory: set_direct_map_* to take address
set_memory: add folio_{zap,restore}_direct_map helpers
mm/secretmem: make use of folio_{zap,restore}_direct_map
mm/gup: drop local variable in gup_fast_folio_allowed
Patrick Roy (12):
mm/gup: drop secretmem optimization from gup_fast_folio_allowed
mm: introduce AS_NO_DIRECT_MAP
KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate
KVM: x86: define kvm_arch_gmem_supports_no_direct_map()
KVM: arm64: define kvm_arch_gmem_supports_no_direct_map()
KVM: guest_memfd: Add flag to remove from direct map
KVM: selftests: load elf via bounce buffer
KVM: selftests: set KVM_MEM_GUEST_MEMFD in vm_mem_add() if guest_memfd
!= -1
KVM: selftests: Add guest_memfd based vm_mem_backing_src_types
KVM: selftests: cover GUEST_MEMFD_FLAG_NO_DIRECT_MAP in existing
selftests
KVM: selftests: stuff vm_mem_backing_src_type into vm_shape
KVM: selftests: Test guest execution from direct map removed gmem
Documentation/virt/kvm/api.rst | 21 +++---
arch/arm64/include/asm/kvm_host.h | 13 ++++
arch/arm64/include/asm/set_memory.h | 7 +-
arch/arm64/mm/pageattr.c | 19 +++--
arch/loongarch/include/asm/set_memory.h | 7 +-
arch/loongarch/mm/pageattr.c | 25 +++----
arch/riscv/include/asm/set_memory.h | 7 +-
arch/riscv/mm/pageattr.c | 17 +++--
arch/s390/include/asm/set_memory.h | 7 +-
arch/s390/mm/pageattr.c | 13 ++--
arch/x86/include/asm/kvm_host.h | 6 ++
arch/x86/include/asm/set_memory.h | 7 +-
arch/x86/kvm/x86.c | 7 ++
arch/x86/mm/pat/set_memory.c | 27 +++----
include/linux/kvm_host.h | 14 ++++
include/linux/pagemap.h | 16 ++++
include/linux/secretmem.h | 18 -----
include/linux/set_memory.h | 22 +++++-
include/uapi/linux/kvm.h | 1 +
kernel/power/snapshot.c | 4 +-
lib/buildid.c | 8 +-
mm/execmem.c | 6 +-
mm/gup.c | 47 ++++++------
mm/memory.c | 45 +++++++++++
mm/mlock.c | 2 +-
mm/secretmem.c | 18 ++---
mm/vmalloc.c | 11 ++-
.../testing/selftests/kvm/guest_memfd_test.c | 17 ++++-
.../testing/selftests/kvm/include/kvm_util.h | 37 ++++++---
.../testing/selftests/kvm/include/test_util.h | 8 ++
tools/testing/selftests/kvm/lib/elf.c | 8 +-
tools/testing/selftests/kvm/lib/io.c | 23 ++++++
tools/testing/selftests/kvm/lib/kvm_util.c | 59 ++++++++-------
tools/testing/selftests/kvm/lib/test_util.c | 8 ++
tools/testing/selftests/kvm/lib/x86/sev.c | 1 +
.../selftests/kvm/pre_fault_memory_test.c | 1 +
.../selftests/kvm/set_memory_region_test.c | 52 ++++++++++++-
.../kvm/x86/private_mem_conversions_test.c | 7 +-
virt/kvm/guest_memfd.c | 75 +++++++++++++++++--
39 files changed, 489 insertions(+), 202 deletions(-)
base-commit: 24f9515de8778410e4b84c85b196c9850d2c1e18
--
2.50.1
^ permalink raw reply
* [PATCH] docs: tainted-kernels: fix typos in documentation
From: Brian Knutsson @ 2026-04-10 14:56 UTC (permalink / raw)
To: linux-doc, linux-kernel; +Cc: Knutsson Development
From: Knutsson Development <development@knutsson.it>
Fix two minor typos in the tainted-kernels documentation:
- 'a more details explanation' -> 'a more detailed explanation'
Signed-off-by: Knutsson Development <development@knutsson.it>
---
Documentation/admin-guide/tainted-kernels.rst | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
index ed1f8f1e86c5..714186159536 100644
--- a/Documentation/admin-guide/tainted-kernels.rst
+++ b/Documentation/admin-guide/tainted-kernels.rst
@@ -63,7 +63,7 @@ this on the machine that had the statements in the logs that were quoted earlier
* Externally-built ('out-of-tree') module was loaded (#12)
See Documentation/admin-guide/tainted-kernels.rst in the Linux kernel or
https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html for
- a more details explanation of the various taint flags.
+ a more detailed explanation of the various taint flags.
Raw taint value as int/string: 4609/'P W O '
You can try to decode the number yourself. That's easy if there was only one
--
2.48.1
^ permalink raw reply related
* Re: [PATCH v6] hwmon: add driver for ARCTIC Fan Controller
From: Thomas Weißschuh @ 2026-04-10 14:53 UTC (permalink / raw)
To: Aureo Serrano de Souza
Cc: linux-hwmon, linux, corbet, skhan, linux-doc, linux-kernel
In-Reply-To: <20260401153949.77488-1-aureo.serrano@arctic.de>
Hi Aureo,
On 2026-04-01 23:39:47+0800, Aureo Serrano de Souza wrote:
(...)
> +struct arctic_fan_data {
> + struct hid_device *hdev;
> + struct device *hwmon_dev; /* stored for explicit unregister in remove() */
> + spinlock_t in_report_lock; /* protects fan_rpm, ack_status, write_pending, pwm_duty */
> + struct completion in_report_received; /* ACK (ID 0x02) received in raw_event */
> + int ack_status; /* 0 = OK, negative errno on device error */
> + bool write_pending; /* true while an OUT report ACK is in flight */
> + u32 fan_rpm[ARCTIC_NUM_FANS];
> + u8 pwm_duty[ARCTIC_NUM_FANS]; /* 0-255 matching sysfs range; converted to 0-100 on send */
> + /*
> + * OUT report buffer. Cache-line aligned so it occupies its own cache
> + * line, preventing DMA cache-coherency issues with adjacent fields
> + * (fan_rpm[], pwm_duty[]) on non-coherent architectures.
> + * Embedded in the devm_kzalloc'd struct so it is heap-allocated and
> + * passes usb_hcd_map_urb_for_dma(). Serialized by the hwmon core.
> + */
> + u8 buf[ARCTIC_REPORT_LEN] ____cacheline_aligned;
I recently discovered __dma_from_device_group_begin() / _end().
These look like the correct solution to use here. It would also make
parts of the comment unnecessary, as the macro already expresses this
semantic.
> +};
(...)
Thomas
^ permalink raw reply
* Re: [PATCH V10 00/10] famfs: port into fuse
From: John Groves @ 2026-04-10 14:46 UTC (permalink / raw)
To: Joanne Koong
Cc: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield, John Groves, Jonathan Corbet, Shuah Khan,
Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, David Hildenbrand, Christian Brauner,
Darrick J . Wong, Randy Dunlap, Jeff Layton, Amir Goldstein,
Jonathan Cameron, Stefan Hajnoczi, Josef Bacik, Bagas Sanjaya,
Chen Linxuan, James Morse, Fuad Tabba, Sean Christopherson,
Shivank Garg, Ackerley Tng, Gregory Price, Aravind Ramesh,
Ajay Joshi, venkataravis@micron.com, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev,
linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org
In-Reply-To: <CAJnrk1ZRTGWjNzkMxS3UkeZMmrpadJDtWKontMx2=d-smXYq=w@mail.gmail.com>
On 26/04/06 10:43AM, Joanne Koong wrote:
> On Tue, Mar 31, 2026 at 5:37 AM John Groves <john@jagalactic.com> wrote:
> >
> > From: John Groves <john@groves.net>
> >
> > NOTE: this series depends on the famfs dax series in Ira's for-7.1/dax-famfs
> > branch [0]
> >
> > Changes v9 -> v10
> > - Rebased to Ira's for-7.1/dax-famfs branch [0], which contains the required
> > dax patches
> > - Add parentheses to FUSE_IS_VIRTIO_DAX() macro, in case something bad is
> > passed in as fuse_inode (thanks Jonathan's AI)
> >
> > Description:
> >
> > This patch series introduces famfs into the fuse file system framework.
> > Famfs depends on the bundled dax patch set.
> >
> > The famfs user space code can be found at [1].
> >
> > Fuse Overview:
> >
> > Famfs started as a standalone file system, but this series is intended to
> > permanently supersede that implementation. At a high level, famfs adds
> > two new fuse server messages:
> >
> > GET_FMAP - Retrieves a famfs fmap (the file-to-dax map for a famfs
> > file)
> > GET_DAXDEV - Retrieves the details of a particular daxdev that was
> > referenced by an fmap
> >
> > Famfs Overview
> >
> > Famfs exposes shared memory as a file system. Famfs consumes shared
> > memory from dax devices, and provides memory-mappable files that map
> > directly to the memory - no page cache involvement. Famfs differs from
> > conventional file systems in fs-dax mode, in that it handles in-memory
> > metadata in a sharable way (which begins with never caching dirty shared
> > metadata).
> >
> > Famfs started as a standalone file system [2,3], but the consensus at
> > LSFMM was that it should be ported into fuse [4,5].
> >
> > The key performance requirement is that famfs must resolve mapping faults
> > without upcalls. This is achieved by fully caching the file-to-devdax
> > metadata for all active files. This is done via two fuse client/server
> > message/response pairs: GET_FMAP and GET_DAXDEV.
> >
> > Famfs remains the first fs-dax file system that is backed by devdax
> > rather than pmem in fs-dax mode (hence the need for the new dax mode).
> >
> > Notes
> >
> > - When a file is opened in a famfs mount, the OPEN is followed by a
> > GET_FMAP message and response. The "fmap" is the full file-to-dax
> > mapping, allowing the fuse/famfs kernel code to handle
> > read/write/fault without any upcalls.
> >
> > - After each GET_FMAP, the fmap is checked for extents that reference
> > previously-unknown daxdevs. Each such occurrence is handled with a
> > GET_DAXDEV message and response.
> >
> > - Daxdevs are stored in a table (which might become an xarray at some
> > point). When entries are added to the table, we acquire exclusive
> > access to the daxdev via the fs_dax_get() call (modeled after how
> > fs-dax handles this with pmem devices). Famfs provides
> > holder_operations to devdax, providing a notification path in the
> > event of memory errors or forced reconfiguration.
> >
> > - If devdax notifies famfs of memory errors on a dax device, famfs
> > currently blocks all subsequent accesses to data on that device. The
> > recovery is to re-initialize the memory and file system. Famfs is
> > memory, not storage...
> >
> > - Because famfs uses backing (devdax) devices, only privileged mounts are
> > supported (i.e. the fuse server requires CAP_SYS_RAWIO).
> >
> > - The famfs kernel code never accesses the memory directly - it only
> > facilitates read, write and mmap on behalf of user processes, using
> > fmap metadata provided by its privileged fuse server. As such, the
> > RAS of the shared memory affects applications, but not the kernel.
> >
> > - Famfs has backing device(s), but they are devdax (char) rather than
> > block. Right now there is no way to tell the vfs layer that famfs has a
> > char backing device (unless we say it's block, but it's not). Currently
> > we use the standard anonymous fuse fs_type - but I'm not sure that's
> > ultimately optimal (thoughts?)
> >
> > Changes v8 -> v9
> > - Kconfig: fs/fuse/Kconfig:CONFIG_FUSE_FAMFS_DAX now depends on the
> > new CONFIG_DEV_DAX_FSDEV (from drivers/dax/Kconfig) rather than
> > just CONFIG_DEV_DAX and CONFIG_FS_DAX. (CONFIG_FUSE_FAMFS_DAX
> > depends on those...)
> >
> > Changes v7 -> v8
> > - Moved to inline __free declaration in fuse_get_fmap() and
> > famfs_fuse_meta_alloc(), famfs_teardown()
> > - Adopted FIELD_PREP() macro rather than manual bitfield manipulation
> > - Minor doc edits
> > - I dropped adding magic numbers to include/uapi/linux/magic.h. That
> > can be done later if appropriate
> >
> > Changes v6 -> v7
> > - Fixed a regression in famfs_interleave_fileofs_to_daxofs() that
> > was reported by Intel's kernel test robot
> > - Added a check in __fsdev_dax_direct_access() for negative return
> > from pgoff_to_phys(), which would indicate an out-of-range offset
> > - Fixed a bug in __famfs_meta_free(), where not all interleaved
> > extents were freed
> > - Added chunksize alignment checks in famfs_fuse_meta_alloc() and
> > famfs_interleave_fileofs_to_daxofs() as interleaved chunks must
> > be PTE or PMD aligned
> > - Simplified famfs_file_init_dax() a bit
> > - Re-ran CM's kernel code review prompts on the entire series and
> > fixed several minor issues
> >
> > Changes v4 -> v5 -> v6
> > - None. Re-sending due to technical difficulties
> >
> > Changes v3 [9] -> v4
> > - The patch "dax: prevent driver unbind while filesystem holds device"
> > has been dropped. Dan Williams indicated that the favored behavior is
> > for a file system to stop working if an underlying driver is unbound,
> > rather than preventing the unbind.
> > - The patch "famfs_fuse: Famfs mount opt: -o shadow=<shadowpath>" has
> > been dropped. Found a way for the famfs user space to do without the
> > -o opt (via getxattr).
> > - Squashed the fs/fuse/Kconfig patch into the first subsequent patch
> > that needed the change
> > ("famfs_fuse: Basic fuse kernel ABI enablement for famfs")
> > - Many review comments addressed.
> > - Addressed minor kerneldoc infractions reported by test robot.
> >
> > Changes v2 [7] -> v3
> > - Dax: Completely new fsdev driver (drivers/dax/fsdev.c) replaces the
> > dev_dax_iomap modifications to bus.c/device.c. Devdax devices can now
> > be switched among 'devdax', 'famfs' and 'system-ram' modes via daxctl
> > or sysfs.
> > - Dax: fsdev uses MEMORY_DEVICE_FS_DAX type and leaves folios at order-0
> > (no vmemmap_shift), allowing fs-dax to manage folio lifecycles
> > dynamically like pmem does.
> > - Dax: The "poisoned page" problem is properly fixed via
> > fsdev_clear_folio_state(), which clears stale mapping/compound state
> > when fsdev binds. The temporary WARN_ON_ONCE workaround in fs/dax.c
> > has been removed.
> > - Dax: Added dax_set_ops() so fsdev can set dax_operations at bind time
> > (and clear them on unbind), since the dax_device is created before we
> > know which driver will bind.
> > - Dax: Added custom bind/unbind sysfs handlers; unbind return -EBUSY if a
> > filesystem holds the device, preventing unbind while famfs is mounted.
> > - Fuse: Famfs mounts now require that the fuse server/daemon has
> > CAP_SYS_RAWIO because they expose raw memory devices.
> > - Fuse: Added DAX address_space_operations with noop_dirty_folio since
> > famfs is memory-backed with no writeback required.
> > - Rebased to latest kernels, fully compatible with Alistair Popple
> > et. al's recent dax refactoring.
> > - Ran this series through Chris Mason's code review AI prompts to check
> > for issues - several subtle problems found and fixed.
> > - Dropped RFC status - this version is intended to be mergeable.
> >
> > Changes v1 [8] -> v2:
> >
> > - The GET_FMAP message/response has been moved from LOOKUP to OPEN, as
> > was the pretty much unanimous consensus.
> > - Made the response payload to GET_FMAP variable sized (patch 12)
> > - Dodgy kerneldoc comments cleaned up or removed.
> > - Fixed memory leak of fc->shadow in patch 11 (thanks Joanne)
> > - Dropped many pr_debug and pr_notice calls
> >
> >
> > References
> >
> > [0] - https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/
> > [1] - https://famfs.org (famfs user space)
> > [2] - https://lore.kernel.org/linux-cxl/cover.1708709155.git.john@groves.net/
> > [3] - https://lore.kernel.org/linux-cxl/cover.1714409084.git.john@groves.net/
> > [4] - https://lwn.net/Articles/983105/ (lsfmm 2024)
> > [5] - https://lwn.net/Articles/1020170/ (lsfmm 2025)
> > [6] - https://lore.kernel.org/linux-cxl/cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apopple@nvidia.com/
> > [7] - https://lore.kernel.org/linux-fsdevel/20250703185032.46568-1-john@groves.net/ (famfs fuse v2)
> > [8] - https://lore.kernel.org/linux-fsdevel/20250421013346.32530-1-john@groves.net/ (famfs fuse v1)
> > [9] - https://lore.kernel.org/linux-fsdevel/20260107153244.64703-1-john@groves.net/T/#mb2c868801be16eca82dab239a1d201628534aea7 (famfs fuse v3)
> >
> >
> > John Groves (10):
> > famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/
> > famfs_fuse: Basic fuse kernel ABI enablement for famfs
> > famfs_fuse: Plumb the GET_FMAP message/response
> > famfs_fuse: Create files with famfs fmaps
> > famfs_fuse: GET_DAXDEV message and daxdev_table
> > famfs_fuse: Plumb dax iomap and fuse read/write/mmap
> > famfs_fuse: Add holder_operations for dax notify_failure()
> > famfs_fuse: Add DAX address_space_operations with noop_dirty_folio
> > famfs_fuse: Add famfs fmap metadata documentation
> > famfs_fuse: Add documentation
> >
> > Documentation/filesystems/famfs.rst | 142 ++++
> > Documentation/filesystems/index.rst | 1 +
> > MAINTAINERS | 10 +
> > fs/fuse/Kconfig | 13 +
> > fs/fuse/Makefile | 1 +
> > fs/fuse/dir.c | 2 +-
> > fs/fuse/famfs.c | 1180 +++++++++++++++++++++++++++
> > fs/fuse/famfs_kfmap.h | 167 ++++
> > fs/fuse/file.c | 45 +-
> > fs/fuse/fuse_i.h | 116 ++-
> > fs/fuse/inode.c | 35 +-
> > fs/fuse/iomode.c | 2 +-
> > fs/namei.c | 1 +
> > include/uapi/linux/fuse.h | 88 ++
> > 14 files changed, 1790 insertions(+), 13 deletions(-)
> > create mode 100644 Documentation/filesystems/famfs.rst
> > create mode 100644 fs/fuse/famfs.c
> > create mode 100644 fs/fuse/famfs_kfmap.h
> >
> >
> > base-commit: 2ae624d5a555d47a735fb3f4d850402859a4db77
> > --
> > 2.53.0
> >
>
> Hi John,
>
> I’m curious to hear your thoughts on whether you think it makes sense
> for the famfs-specific logic in this series to be moved to a bpf
> program that goes through a generic fuse iomap dax layer.
>
> Based on [1], this gives feature-parity with the famfs logic in this
> series. In my opinion, having famfs go through a generic fuse iomap
> dax layer makes the fuse kernel code more extensible for future
> servers that will also want to use dax iomap, and keeps the fuse code
> cleaner by not having famfs-specific logic hardcoded in and having to
> introduce new fuse uapis for something famfs-specific. In my
> understanding of it, fuse is meant to be generic and it feels like
> adding server-specific logic goes against that design philosophy and
> sets a precedent for other servers wanting similar special-casing in
> the future. I'd like to explore whether the bpf and generic fuse iomap
> dax layer approach can preserve that philosophy while still giving
> famfs the flexibility it needs.
>
> I think moving the famfs logic to bpf benefits famfs as well:
> - Instead of needing to issue a FUSE_GET_FMAP request after a file is
> opened, the server can directly populate the metadata map from
> userspace with the mapping info when it processes the FUSE_OPEN
> request, which gets rid of the roundtrip cost
> - The server can dynamically update the metadata / bpf maps during
> runtime from userspace if any mapping info needs to change
> - Future code changes / updates for famfs are all server-side and can
> be deployed immediately instead of needing to go through the upstream
> kernel mailing list process
> - Famfs updates / new releases can ship independently of kernel releases
>
> I'd appreciate the chance to discuss tradeoffs or if you'd rather
> discuss this at the fuse BoF at lsf, that sounds great too.
>
> Thanks,
> Joanne
>
Hi Joanne,
I'm definitely up for discussing it, and talking before LSFMM would be
good because then I'd have some time to think about before we discuss
at LSFMM.
I have not had a chance to really study this, in part since I've never even
written a "hello world" BPF program.
I'll ping off-list about times to talk.
However...
I would object vehemently to this sort of re-write prior to going upstream,
as would users and vendors who need famfs now that the memory products it
enables have started to ship.
This work started over 3 years ago, initial patches over 2 years ago,
community decision that it should go into fuse 2 years ago, first fuse
patches a year ago.
This implementation is pretty much exactly in line with expectation-setting
starting two years ago. Famfs is a complicated orchestration between dax,
fuse, ndctl (for daxctl), libfuse and the extensive famfs user space. Famfs
has a fairly small kernel footprint, but its user space is much larger.
This could set it back a year if we re-write now.
Two things are true at once: I think this is a serious idea worth
considering, and I think it's too late to make this sort of change before
going upstream. Products need this enablement, and quite a long process has
run in order to make it available in a timely fashion (which means soon
now). I hope we can avoid making the perfect the enemy of the good.
I believe the risk of merging famfs soon is quite low, because famfs will
not affect anybody who doesn't use it. I hope we can run this discussion and
analysis in parallel with merging the current implementation of famfs soon.
Thank you,
John
^ permalink raw reply
* Re: [PATCH] Documentation: seq_file: drop 2.6 reference
From: Jonathan Corbet @ 2026-04-10 14:39 UTC (permalink / raw)
To: Wolfram Sang, linux-doc; +Cc: Wolfram Sang, Shuah Khan
In-Reply-To: <20260410143234.43610-2-wsa+renesas@sang-engineering.com>
Wolfram Sang <wsa+renesas@sang-engineering.com> writes:
> Even kernels after 2.6 have seq-file support.
>
> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
> ---
>
> See, somebody still reads it :)
>
> Documentation/filesystems/seq_file.rst | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/Documentation/filesystems/seq_file.rst b/Documentation/filesystems/seq_file.rst
> index 1e1713d00010..d753d8177bcb 100644
> --- a/Documentation/filesystems/seq_file.rst
> +++ b/Documentation/filesystems/seq_file.rst
> @@ -27,7 +27,7 @@ position within the virtual file - that position is, likely as not, in the
> middle of a line of output. The kernel has traditionally had a number of
> implementations that got this wrong.
>
> -The 2.6 kernel contains a set of functions (implemented by Alexander Viro)
> +The kernel now contains a set of functions (implemented by Alexander Viro)
> which are designed to make it easy for virtual file creators to get it
But but but ... it *is* correct as written... :)
(Applied, thanks).
jon
^ permalink raw reply
* [PATCH] Documentation: seq_file: drop 2.6 reference
From: Wolfram Sang @ 2026-04-10 14:31 UTC (permalink / raw)
To: linux-doc; +Cc: Wolfram Sang, Jonathan Corbet, Shuah Khan
Even kernels after 2.6 have seq-file support.
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
---
See, somebody still reads it :)
Documentation/filesystems/seq_file.rst | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Documentation/filesystems/seq_file.rst b/Documentation/filesystems/seq_file.rst
index 1e1713d00010..d753d8177bcb 100644
--- a/Documentation/filesystems/seq_file.rst
+++ b/Documentation/filesystems/seq_file.rst
@@ -27,7 +27,7 @@ position within the virtual file - that position is, likely as not, in the
middle of a line of output. The kernel has traditionally had a number of
implementations that got this wrong.
-The 2.6 kernel contains a set of functions (implemented by Alexander Viro)
+The kernel now contains a set of functions (implemented by Alexander Viro)
which are designed to make it easy for virtual file creators to get it
right.
--
2.51.0
^ permalink raw reply related
* Re: [PATCH v12 04/25] drm/bridge: Act on the DRM color format property
From: Nicolas Frattaroli @ 2026-04-10 14:21 UTC (permalink / raw)
To: Dmitry Baryshkov
Cc: Harry Wentland, Leo Li, Rodrigo Siqueira, Alex Deucher,
Christian König, David Airlie, Simona Vetter,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
Andrzej Hajda, Neil Armstrong, Robert Foss, Laurent Pinchart,
Jonas Karlman, Jernej Skrabec, Sandy Huang, Heiko Stübner,
Andy Yan, Jani Nikula, Rodrigo Vivi, Joonas Lahtinen,
Tvrtko Ursulin, Dmitry Baryshkov, Sascha Hauer, Rob Herring,
Jonathan Corbet, Shuah Khan, kernel, amd-gfx, dri-devel,
linux-kernel, linux-arm-kernel, linux-rockchip, intel-gfx,
intel-xe, linux-doc
In-Reply-To: <edeq6wxmwzvovfoo6pvih6dybwszf3tahg7nvqkjrw3qxllbfd@jwejh2sgb5eh>
On Friday, 10 April 2026 00:08:24 Central European Summer Time Dmitry Baryshkov wrote:
> On Thu, Apr 09, 2026 at 05:44:54PM +0200, Nicolas Frattaroli wrote:
> > The new DRM color format property allows userspace to request a specific
> > color format on a connector. In turn, this fills the connector state's
> > color_format member to switch color formats.
> >
> > Make drm_bridges consider the color_format set in the connector state
> > during the atomic bridge check. For bridges that represent HDMI bridges,
> > rely on whatever format the HDMI logic set. Reject any output bus
> > formats that do not correspond to the requested color format.
> >
> > Non-HDMI last bridges with DRM_CONNECTOR_COLOR_FORMAT_AUTO set will end
> > up choosing the first output format that functions to make a whole
> > recursive bridge chain format selection succeed.
> >
> > Signed-off-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com>
> > ---
> > drivers/gpu/drm/drm_bridge.c | 89 +++++++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 88 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/drm_bridge.c b/drivers/gpu/drm/drm_bridge.c
> > index ba80bebb5685..7c1516864d96 100644
> > --- a/drivers/gpu/drm/drm_bridge.c
> > +++ b/drivers/gpu/drm/drm_bridge.c
> > @@ -1150,6 +1150,47 @@ static int select_bus_fmt_recursive(struct drm_bridge *first_bridge,
> > return ret;
> > }
> >
> > +static bool __pure bus_format_is_color_fmt(u32 bus_fmt, enum drm_connector_color_format fmt)
> > +{
> > + if (fmt == DRM_CONNECTOR_COLOR_FORMAT_AUTO)
> > + return true;
> > +
> > + switch (bus_fmt) {
> > + case MEDIA_BUS_FMT_FIXED:
> > + return true;
> > + case MEDIA_BUS_FMT_RGB888_1X24:
> > + case MEDIA_BUS_FMT_RGB101010_1X30:
> > + case MEDIA_BUS_FMT_RGB121212_1X36:
> > + case MEDIA_BUS_FMT_RGB161616_1X48:
> > + return fmt == DRM_CONNECTOR_COLOR_FORMAT_RGB444;
> > + case MEDIA_BUS_FMT_YUV8_1X24:
> > + case MEDIA_BUS_FMT_YUV10_1X30:
> > + case MEDIA_BUS_FMT_YUV12_1X36:
> > + case MEDIA_BUS_FMT_YUV16_1X48:
> > + return fmt == DRM_CONNECTOR_COLOR_FORMAT_YCBCR444;
> > + case MEDIA_BUS_FMT_UYVY8_1X16:
> > + case MEDIA_BUS_FMT_VYUY8_1X16:
> > + case MEDIA_BUS_FMT_YUYV8_1X16:
> > + case MEDIA_BUS_FMT_YVYU8_1X16:
> > + case MEDIA_BUS_FMT_UYVY10_1X20:
> > + case MEDIA_BUS_FMT_YUYV10_1X20:
> > + case MEDIA_BUS_FMT_VYUY10_1X20:
> > + case MEDIA_BUS_FMT_YVYU10_1X20:
> > + case MEDIA_BUS_FMT_UYVY12_1X24:
> > + case MEDIA_BUS_FMT_VYUY12_1X24:
> > + case MEDIA_BUS_FMT_YUYV12_1X24:
> > + case MEDIA_BUS_FMT_YVYU12_1X24:
> > + return fmt == DRM_CONNECTOR_COLOR_FORMAT_YCBCR422;
> > + case MEDIA_BUS_FMT_UYYVYY8_0_5X24:
> > + case MEDIA_BUS_FMT_UYYVYY10_0_5X30:
> > + case MEDIA_BUS_FMT_UYYVYY12_0_5X36:
> > + case MEDIA_BUS_FMT_UYYVYY16_0_5X48:
> > + return fmt == DRM_CONNECTOR_COLOR_FORMAT_YCBCR420;
> > + default:
> > + return false;
> > + }
> > +}
> > +
> > /*
> > * This function is called by &drm_atomic_bridge_chain_check() just before
> > * calling &drm_bridge_funcs.atomic_check() on all elements of the chain.
> > @@ -1193,6 +1234,7 @@ drm_atomic_bridge_chain_select_bus_fmts(struct drm_bridge *bridge,
> > struct drm_encoder *encoder = bridge->encoder;
> > struct drm_bridge_state *last_bridge_state;
> > unsigned int i, num_out_bus_fmts = 0;
> > + enum drm_connector_color_format fmt;
> > u32 *out_bus_fmts;
> > int ret = 0;
> >
> > @@ -1234,13 +1276,58 @@ drm_atomic_bridge_chain_select_bus_fmts(struct drm_bridge *bridge,
> > out_bus_fmts[0] = MEDIA_BUS_FMT_FIXED;
> > }
> >
> > + /*
> > + * On HDMI connectors, use the output format chosen by whatever does the
> > + * HDMI logic. For everyone else, just trust that the bridge out_bus_fmts
> > + * are sorted by preference for %DRM_CONNECTOR_COLOR_FORMAT_AUTO, as
> > + * bus_format_is_color_fmt() always returns true for AUTO.
> > + */
> > + if (last_bridge->type == DRM_MODE_CONNECTOR_HDMIA) {
>
> I still think this is misplaced (and misidentified). Consider HDMI
> bridge being routed to the DVI-D connector. The last bridge would have
> different type, but the HDMI-specific logic must still be applied. The
> bridge must use RGB444, but it must be handled in a generic way.
Thanks for the review. I was hoping that an HDMI bridge chain going to
a DVI connector would be a DRM_MODE_CONNECTOR_HDMIB thing, but apparently
not. I also don't know however how doing this in the drm bridge connector
helps us here. I guess we'd call into a drm_bridge_connector specific
function with the connector format and connector, and get an output
format in return, and said function checks that if any bridge in the
bridge connector is HDMI it uses the HDMI logic? That would conflict
with the following case from what I understand:
> Or other way around, a DVI bridge being routed through the HDMI
> connector (thinking about PandaBoard here). The combo should not go
> through the HDMI-specific color format selection although the last
> bridge in the chanin is the HDMI-A bridge.
In that case, wouldn't the recursive bridge bus format selection
take care of this? I assume the DVI bridge will only allow RGB444,
and one of the bridges that follow it is an HDMI bridge for the
HDMI connector that's physically on the board.
In such a case, if the HDMI state helpers came up with something
other than RGB444, the select_bus_fmt_recursive below would fail
as expected, since it won't be able to find an output that
satisfies the constraints given by the DVI bridge.
> I think all these cases should be handled by the connector, which knows
> if there is an OP_HDMI bridge in the chain or not.
Kind regards,
Nicolas Frattaroli
> > + drm_dbg_kms(last_bridge->dev,
> > + "HDMI bridge requests format %s\n",
> > + drm_hdmi_connector_get_output_format_name(
> > + conn_state->hdmi.output_format));
> > + switch (conn_state->hdmi.output_format) {
> > + case DRM_OUTPUT_COLOR_FORMAT_RGB444:
> > + fmt = DRM_CONNECTOR_COLOR_FORMAT_RGB444;
> > + break;
> > + case DRM_OUTPUT_COLOR_FORMAT_YCBCR444:
> > + fmt = DRM_CONNECTOR_COLOR_FORMAT_YCBCR444;
> > + break;
> > + case DRM_OUTPUT_COLOR_FORMAT_YCBCR422:
> > + fmt = DRM_CONNECTOR_COLOR_FORMAT_YCBCR422;
> > + break;
> > + case DRM_OUTPUT_COLOR_FORMAT_YCBCR420:
> > + fmt = DRM_CONNECTOR_COLOR_FORMAT_YCBCR420;
> > + break;
> > + default:
> > + ret = -EINVAL;
> > + goto out_free_bus_fmts;
> > + }
> > + } else {
> > + fmt = conn_state->color_format;
> > + drm_dbg_kms(last_bridge->dev, "Non-HDMI bridge requests format %d\n", fmt);
> > + }
> > +
> > for (i = 0; i < num_out_bus_fmts; i++) {
> > + if (!bus_format_is_color_fmt(out_bus_fmts[i], fmt)) {
> > + drm_dbg_kms(last_bridge->dev,
> > + "Skipping bus format 0x%04x as it doesn't match format %d\n",
> > + out_bus_fmts[i], fmt);
> > + ret = -ENOTSUPP;
> > + continue;
> > + }
> > ret = select_bus_fmt_recursive(bridge, last_bridge, crtc_state,
> > conn_state, out_bus_fmts[i]);
> > - if (ret != -ENOTSUPP)
> > + if (ret != -ENOTSUPP) {
> > + drm_dbg_kms(last_bridge->dev,
> > + "Found bridge chain ending with bus format 0x%04x\n",
> > + out_bus_fmts[i]);
> > break;
> > + }
> > }
> >
> > +out_free_bus_fmts:
> > kfree(out_bus_fmts);
> >
> > return ret;
> >
>
>
^ permalink raw reply
* [RFC PATCH v5 06/11] Docs/admin-guide/mm/damon/usage: document fail_charge_{num,denom} files
From: SeongJae Park @ 2026-04-10 14:20 UTC (permalink / raw)
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
Jonathan Corbet, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
linux-kernel, linux-mm
In-Reply-To: <20260410142034.83798-1-sj@kernel.org>
Update DAMON usage document for the DAMOS action failed regions quota
charge ratio control sysfs files.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
Documentation/admin-guide/mm/damon/usage.rst | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index bfdb717441f05..d5548e460857c 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -84,7 +84,9 @@ comma (",").
│ │ │ │ │ │ │ │ sz/min,max
│ │ │ │ │ │ │ │ nr_accesses/min,max
│ │ │ │ │ │ │ │ age/min,max
- │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,effective_bytes,goal_tuner
+ │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,
+ │ │ │ │ │ │ │ effective_bytes,goal_tuner,
+ │ │ │ │ │ │ │ fail_charge_num,fail_charge_denom
│ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil
│ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals
│ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value,nid,path
@@ -381,9 +383,10 @@ schemes/<N>/quotas/
The directory for the :ref:`quotas <damon_design_damos_quotas>` of the given
DAMON-based operation scheme.
-Under ``quotas`` directory, five files (``ms``, ``bytes``,
-``reset_interval_ms``, ``effective_bytes`` and ``goal_tuner``) and two
-directories (``weights`` and ``goals``) exist.
+Under ``quotas`` directory, seven files (``ms``, ``bytes``,
+``reset_interval_ms``, ``effective_bytes``, ``goal_tuner``, ``fail_charge_num``
+and ``fail_charge_denom``) and two directories (``weights`` and ``goals``)
+exist.
You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and
``reset interval`` in milliseconds by writing the values to the three files,
@@ -402,6 +405,13 @@ the background design of the feature and the name of the selectable algorithms.
Refer to :ref:`goals directory <sysfs_schemes_quota_goals>` for the goals
setup.
+You can set the action-failed memory quota charging ratio by writing the
+numerator and the denominator for the ratio to ``fail_charge_num`` and
+``fail_charge_denom`` files, respectively. Reading those files will return the
+current set values. Refer to :ref:`design
+<damon_design_damos_quotas_failed_memory_charging_ratio>` for more details of
+the ratio feature.
+
The time quota is internally transformed to a size quota. Between the
transformed size quota and user-specified size quota, smaller one is applied.
Based on the user-specified :ref:`goal <sysfs_schemes_quota_goals>`, the
--
2.47.3
^ permalink raw reply related
* [RFC PATCH v5 05/11] Docs/mm/damon/design: document fail_charge_{num,denom}
From: SeongJae Park @ 2026-04-10 14:20 UTC (permalink / raw)
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
Jonathan Corbet, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
linux-kernel, linux-mm
In-Reply-To: <20260410142034.83798-1-sj@kernel.org>
Update DAMON design document for the DAMOS action failed region quota
charge ratio.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
Documentation/mm/damon/design.rst | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index 622d24e35961e..fa7392b5a331d 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -576,6 +576,28 @@ interface <sysfs_interface>`, refer to :ref:`weights <sysfs_quotas>` part of
the documentation.
+.. _damon_design_damos_quotas_failed_memory_charging_ratio:
+
+Action-failed Memory Charging Ratio
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+DAMOS action to a given region can fail for some subsets of the memory of the
+region. For example, if the action is ``pageout`` and the region has some
+unreclaimable pages, applying the action to the pages will fail. The amount of
+system resource that is taken for such failed action applications is usually
+different from that for successful action applications. For such cases, users
+can set different charging ratio for such failed memory. The ratio can be
+specified using ``fail_charge_num`` and ``fail_charge_denom`` parameters. The
+two parameters represent the numerator and denominator of the ratio. The
+feature is enabled only if ``fail_charge_denom`` is not zero.
+
+For example, let's suppose a DAMOS action is applied to a region of 1,000 MiB
+size. The action is successfully applied to only 700 MiB of the region.
+``fail_charge_num`` and ``fail_charge_denom`` are set to ``1`` and ``1024``,
+respectively. Then only 700 MiB and 300 KiB of size (``700 MiB + 300 MiB * 1 /
+1024``) will be charged.
+
+
.. _damon_design_damos_quotas_auto_tuning:
Aim-oriented Feedback-driven Auto-tuning
--
2.47.3
^ permalink raw reply related
* [RFC PATCH v5 00/11] mm/damon: introduce DAMOS failed region quota charge ratio
From: SeongJae Park @ 2026-04-10 14:20 UTC (permalink / raw)
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, Brendan Higgins,
David Gow, David Hildenbrand, Jonathan Corbet, Lorenzo Stoakes,
Michal Hocko, Mike Rapoport, Shuah Khan, Shuah Khan,
Suren Baghdasaryan, Vlastimil Babka, damon, kunit-dev, linux-doc,
linux-kernel, linux-kselftest, linux-mm
TL; DR: Let users set different DAMOS quota charge ratios for DAMOS
action failed regions, for deterministic and consistent DAMOS action
progress.
Common Reports: Unexpectedly Slow DAMOS
=======================================
One common issue report that we get from DAMON users is that DAMOS
action applying progress speed is sometimes much slower than expected.
And one common root cause is that the DAMOS quota is exceeded by the
action applying failed memory regions.
For example, a group of users tried to run DAMOS-based proactive memory
reclamation (DAMON_RECLAIM) with 100 MiB per second DAMOS quota. They
ran it on a system having no active workload which means all memory of
the system is cold. The expectation was that the system will show 100
MiB per second reclamation until (nearly) all memory is reclaimed. But
what they found is that the speed is quite inconsistent and sometimes it
becomes very slower than the expectation, sometimes even no reclamation
at all for about tens of seconds. The upper limit of the speed (100 MiB
per second) was being kept as expected, though.
By monitoring the qt_exceeds (number of DAMOS quota exceed events) DAMOS
stat, we found DAMOS quota is always exceeded when the speed is slow. By
monitoring sz_tried and sz_applied (the total amount of DAMOS action
tried memory and succeeded memory) DAMOS stats together, we found the
reclamation attempts nearly always failed when the speed is slow.
DAMOS quota charges DAMOS action tried regions regardless of the
successfulness of the try. Hence in the example reported case, there
was unreclaimable memory spread around the system memory. Sometimes
nearly 100 MiB of memory that DAMOS tried to reclaim in the given quota
interval was reclaimable, and therefore showed nearly 100 MiB per second
speed. Sometimes nearly 99 MiB of memory that DAMOS was trying to
reclaim in the given quota interval was unreclaimable, and therefore
showing only about 1 MiB per second reclaim speed.
We explained it is an expected behavior of the feature rather than a
bug, as DAMOS quota is there for only the upper-limit of the speed. The
users agreed and later reported a huge win from the adoption of
DAMON_RECLAIM on their products.
It is Not a Bug but a Feature; But...
=====================================
So nothing is broken. DAMOS quota is working as intended, as the upper
limit of the speed. It also provides its behavior observability via
DAMOS stat. In the real world production environment that runs long
term active workloads and matters stability, the speed sometimes being
slow is not a real problem.
But, the non-deterministic behavior is sometimes annoying, especially in
lab environments. Even in a realistic production environment, when
there is a huge amount of DAMOS action unapplicable memory, the speed
could be problematically slow. Let's suppose a virtual machines
provider that setup 99% of the host memory as hugetlb pages that cannot
be reclaimed, to give it to virtual machines. Also, when aim-oriented
DAMOS auto-tuning is applied, this could also make the internal feedback
loop confused.
The intention of the current behavior was that trying DAMOS action to
regions would anyway impose some overhead, and therefore somehow be
charged. But in the real world, the overhead for failed action is much
lighter than successful action. Charging those at the same ratio may be
unfair, or at least suboptimum in some environments.
DAMOS Action Failed Region Quota Charge Ratio
=============================================
Let users set the charge ratio for the action-failed memory, for more
optimal and deterministic use of DAMOS. It allows users to specify the
numerator and the denominator of the ratio for flexible setup. For
example, let's suppose the numerator and the denominator are set to 1
and 4,096, respectively. The ratio is 1 / 4,096. A DAMOS scheme action
is applied to 5 GiB memory. For 1 GiB of the memory, the action is
succeeded. For the rest (4 GiB), the action is failed. Then, only 1
GiB and 1 MiB quota is charged.
The optimal charge ratio will depend on the use case and
system/workload. I'd recommend starting from setting the nominator as 1
and the denominator as PAGE_SIZE and tune based on the results, because
many DAMOS actions are applied at page level.
Tests
=====
I tested this feature in the steps below.
1. Allocate 50% of system memory and mlock() it using a test program.
2. Fill up the page cache to exhaust nearly all free memory.
3. Start DAMON-based proactive reclamation with 100 MiB/second DAMOS
hard-quota. Auto-tune the DAMOS soft-quota under the hard-quota for
achieving 40% free memory of the system with 'temporal' tuner.
For step 1, I run a simple C program that is written by Gemini. It is
quite straightforward, so I'm not sharing the code here.
For step 2, I use dd command like below:
dd if=/dev/zero of=foo bs=1M count=$50_percent_of_system_memory
For step 3, I use the latest version of DAMON user-space tool (damo)
like below.
sudo damo start --damos_action pageout \
` # Do the pageout only up to 100 MiB per second ` \
--damos_quota_space 100M --damos_quota_interval 1s \
` # Auto-tune the quota below the hard quota aiming` \
` # 40% free memory of the node 0 ` \
` # (entire node of the test system)` \
--damos_quota_goal node_mem_free_bp 40% 0 \
` # use temporal tuner, which is easy to understnd ` \
--damos_quota_goal_tuner temporal
As expected, the progress of the reclamation is not consistent, because
the quota is exceeded for the failed reclamation of the unreclaimable
memory.
I do this again, but with the failed region charge ratio feature. For
this, the above 'damo' command is used, after appending command line
option for setup of the charge ratio like below. Note that the option
was added to 'damo' after v3.1.9.
sudo ./damo start --damos_action pageout \
[...]
` # quota-charge only 1/4096 for pageout-failed regions ` \
--damos_quota_fail_charge_ratio 1 4096
The progress of the reclamation was nearly 100 MiB per second until the
goal was achieved, meeting the expectation.
Patches Sequence
================
First two patches make preparational changes. Patch 1 updates fully
charged quota check to handle <min_region_sz remaining quota, which will
be able to exist after this series is applied. Patch 2 merges regions
that split out for quota as soon as possible, since the split can happen
much more frequently under a corner case that this series will make
available.
Patch 3 implements the feature and exposes it via DAMON core API. Patch
4 implements DAMON sysfs ABI for the feature. Three following patches
(5-7) document the feature and ABI on design, usage, and ABI documents,
respectively. Four patches for testing of the new feature follow.
Patch 8 implements a kunit test for the feature. Patches 9 and 10
extend DAMON selftest helpers for DAMON sysfs control and internal state
dumping for adding a new selftest for the feature. Patch 11 extends
existing DAMON sysfs interface selftest to test the new feature using
the extended helper scripts.
Changelog
=========
Changes from RFC v4
(https://lore.kernel.org/20260409142148.60652-1-sj@kernel.org)
- Fix quota-sliced region merge-back issues.
- Use damon_for_each_region() instead of damon_for_each_region_safe().
- Avoid merging back of sliced but scheme unapplied regions, to keep
the monitoring information.
Changes from RFC v3
(https://lore.kernel.org/20260407010536.83603-1-sj@kernel.org)
- Make damos_quota_is_full() safe from overflow and easier to read.
- Avoid quota-based region split making too many new regions.
Changes from RFC v2
(https://lore.kernel.org/20260405151232.102690-1-sj@kernel.org)
- Handle <min_region_sz remaining quota.
- Document zero denum behavior.
- Fix typos: s/selftets/selftests/
Changes from RFC v1
(https://lore.kernel.org/20260404163943.89278-1-sj@kernel.org)
- Avoid overflows in charge amount calculation.
- Fix/wordsmith documentation for grammar, typo, and wrong examples.
- Improve unit test for more consistent comparison source use.
SeongJae Park (11):
mm/damon/core: handle <min_region_sz remaining quota as empty
mm/damon/core: merge quota-sliced regions back
mm/damon/core: introduce failed region quota charge ratio
mm/damon/sysfs-schemes: implement fail_charge_{num,denom} files
Docs/mm/damon/design: document fail_charge_{num,denom}
Docs/admin-guide/mm/damon/usage: document fail_charge_{num,denom}
files
Docs/ABI/damon: document fail_charge_{num,denom}
mm/damon/tests/core-kunit: test fail_charge_{num,denom} committing
selftests/damon/_damon_sysfs: support failed region quota charge ratio
selftests/damon/drgn_dump_damon_status: support failed region quota
charge ratio
selftests/damon/sysfs.py: test failed region quota charge ratio
.../ABI/testing/sysfs-kernel-mm-damon | 12 +++
Documentation/admin-guide/mm/damon/usage.rst | 18 +++-
Documentation/mm/damon/design.rst | 22 +++++
include/linux/damon.h | 9 ++
mm/damon/core.c | 83 ++++++++++++++++---
mm/damon/sysfs-schemes.c | 54 ++++++++++++
mm/damon/tests/core-kunit.h | 6 ++
tools/testing/selftests/damon/_damon_sysfs.py | 21 ++++-
.../selftests/damon/drgn_dump_damon_status.py | 2 +
tools/testing/selftests/damon/sysfs.py | 6 ++
10 files changed, 216 insertions(+), 17 deletions(-)
base-commit: fe17d40616ec462138186edb32f3105b0c064674
--
2.47.3
^ permalink raw reply
* Re: [PATCH v2 2/3] mm/memory-failure: add panic_on_unrecoverable_memory_failure sysctl
From: Breno Leitao @ 2026-04-10 14:17 UTC (permalink / raw)
To: Miaohe Lin
Cc: linux-mm, linux-kernel, linux-doc, kernel-team, Naoya Horiguchi,
Andrew Morton, Jonathan Corbet, Shuah Khan
In-Reply-To: <59c133a7-74a7-4678-d907-add764bbd107@huawei.com>
On Tue, Apr 07, 2026 at 10:57:36AM +0800, Miaohe Lin wrote:
> On 2026/3/31 19:00, Breno Leitao wrote:
> > + if (sysctl_panic_on_unrecoverable_mf && result == MF_IGNORED &&
> > + (type == MF_MSG_KERNEL || type == MF_MSG_KERNEL_HIGH_ORDER ||
> > + type == MF_MSG_UNKNOWN))
> > + panic("Memory failure: %#lx: unrecoverable page", pfn);
>
> Will it be better to add a helper here?
Yes, a helper would make things easier to read and digest. Thanks for
the feedback. This is what I have in mind:
commit 36d5b3cbbe6d6abfe3296b7b21135a5f01e743eb
Author: Breno Leitao <leitao@debian.org>
Date: Mon Mar 23 08:00:29 2026 -0700
mm/memory-failure: add panic_on_unrecoverable_memory_failure sysctl
Add a sysctl that allows the system to panic when an unrecoverable
memory failure is detected. This covers kernel pages, high-order
kernel pages, and unknown page types that cannot be recovered.
Signed-off-by: Breno Leitao <leitao@debian.org>
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 6ff80e01b91a4..a29b6688fe2d3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -74,6 +74,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1;
static int sysctl_enable_soft_offline __read_mostly = 1;
+static int sysctl_panic_on_unrecoverable_mf __read_mostly;
+
atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
static bool hw_memory_failure __read_mostly = false;
@@ -155,6 +157,15 @@ static const struct ctl_table memory_failure_table[] = {
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_ONE,
+ },
+ {
+ .procname = "panic_on_unrecoverable_memory_failure",
+ .data = &sysctl_panic_on_unrecoverable_mf,
+ .maxlen = sizeof(sysctl_panic_on_unrecoverable_mf),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
}
};
@@ -1281,6 +1292,16 @@ static void update_per_node_mf_stats(unsigned long pfn,
++mf_stats->total;
}
+static bool is_unrecoverable_memory_failure(enum mf_action_page_type type,
+ enum mf_result result)
+{
+ return sysctl_panic_on_unrecoverable_mf &&
+ result == MF_IGNORED &&
+ (type == MF_MSG_KERNEL ||
+ type == MF_MSG_KERNEL_HIGH_ORDER ||
+ type == MF_MSG_UNKNOWN);
+}
+
/*
* "Dirty/Clean" indication is not 100% accurate due to the possibility of
* setting PG_dirty outside page lock. See also comment above set_page_dirty().
@@ -1298,6 +1319,9 @@ static int action_result(unsigned long pfn, enum mf_action_page_type type,
pr_err("%#lx: recovery action for %s: %s\n",
pfn, action_page_types[type], action_name[result]);
+ if (is_unrecoverable_memory_failure(type, result))
+ panic("Memory failure: %#lx: unrecoverable page", pfn);
+
return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY;
}
^ permalink raw reply related
* Re: [PATCH v2 1/3] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
From: Breno Leitao @ 2026-04-10 14:03 UTC (permalink / raw)
To: Miaohe Lin
Cc: linux-mm, linux-kernel, linux-doc, kernel-team, Naoya Horiguchi,
Andrew Morton, Jonathan Corbet, Shuah Khan
In-Reply-To: <e074d5de-f494-4995-2521-32de4f2bd34f@huawei.com>
On Tue, Apr 07, 2026 at 10:56:39AM +0800, Miaohe Lin wrote:
> On 2026/3/31 19:00, Breno Leitao wrote:
> > When get_hwpoison_page() returns a negative value, distinguish
> > reserved pages from other failure cases by reporting MF_MSG_KERNEL
> > instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
> > and should be classified accordingly for proper handling by the
> > panic_on_unrecoverable_memory_failure mechanism.
> >
> > Signed-off-by: Breno Leitao <leitao@debian.org>
> > ---
> > mm/memory-failure.c | 6 +++++-
> > 1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index ee42d4361309..6ff80e01b91a 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -2432,7 +2432,11 @@ int memory_failure(unsigned long pfn, int flags)
> > }
> > goto unlock_mutex;
> > } else if (res < 0) {
> > - res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
> > + if (PageReserved(p))
> > + res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>
> Is it safe or common to check page flags without holding extra refcnt?
Yes, this is safe. At this point the page has HWPoison set, preventing
reallocation.
PageReserved is an atomic flag test on struct page memory that's always
valid for online PFNs.
Reserved pages are inherently stable (kernel text, firmware, etc.) and
don't change status dynamically.
This follows the same pattern as the existing is_free_buddy_page(p)
check a few lines above, which also reads page state without an extra
refcount.
The result is only used for a detailed classification, so even a theoretical
race would not be a bad issue.
^ permalink raw reply
* Re: [GIT PULL] Chinese-docs changes for v7.1
From: Jonathan Corbet @ 2026-04-10 13:54 UTC (permalink / raw)
To: Alex Shi, linux-doc, open list, Yanteng Si, Dongliang Mu
In-Reply-To: <b71a3647-b1b2-4c66-b357-38f10058123d@gmail.com>
Alex Shi <seakeel@gmail.com> writes:
> The following changes since commit 1eab6493f525910aa7bc383a2a27b68916e3c616:
>
> tracing: Documentation: Update histogram-design.rst for fn() handling
> (2026-04-09 08:46:39 -0600)
>
> are available in the Git repository at:
>
> git@gitolite.kernel.org:pub/scm/linux/kernel/git/alexs/linux.git
> tags/Chinese-docs-7.1
>
> for you to fetch changes up to 78405e7f42fa9127325c65aec9289187f67ac5ce:
>
> docs/zh_CN: update rust/index.rst translation (2026-04-10 20:09:45 +0800)
>
> ----------------------------------------------------------------
> Chinese translation docs for 7.1
>
> This is the Chinese translation subtree for 7.1. It includes
> the following changes:
> - Add the rust docs translation
> - Fix an inconsistent statement in dev-tools/testing-overview
> - sync process/2.Process.rst with English version
>
> Above patches are tested by 'make htmldocs'
>
> Signed-off-by: Alex Shi <alexs@kernel.org>
>
> ----------------------------------------------------------------
> Ben Guo (4):
> docs/zh_CN: update rust/arch-support.rst translation
> docs/zh_CN: update rust/coding-guidelines.rst translation
> docs/zh_CN: update rust/quick-start.rst translation
> docs/zh_CN: update rust/index.rst translation
>
> LIU Haoyang (1):
> docs/zh_CN: fix an inconsistent statement in
> dev-tools/testing-overview
>
> Song Hongyi (1):
> docs/zh_CN: sync process/2.Process.rst with English version
>
> Documentation/translations/zh_CN/dev-tools/testing-overview.rst | 2 +-
> Documentation/translations/zh_CN/process/2.Process.rst | 56
> ++++---
> Documentation/translations/zh_CN/rust/arch-support.rst | 9 +-
> Documentation/translations/zh_CN/rust/coding-guidelines.rst | 262
> +++++++++++++++++++++++++++++++--
> Documentation/translations/zh_CN/rust/index.rst | 17 ---
> Documentation/translations/zh_CN/rust/quick-start.rst | 190
> ++++++++++++++++++------
> 6 files changed, 427 insertions(+), 109 deletions(-)
Pulled, thanks.
For future reference, this is a bit late; I would really rather get
these before -rc7 if at all possible.
jon
^ permalink raw reply
* Re: [PATCH] cpufreq: CPPC: add autonomous mode boot parameter support
From: Pierre Gondois @ 2026-04-10 13:47 UTC (permalink / raw)
To: Sumit Gupta
Cc: linux-tegra, linux-kernel, linux-doc, zhenglifeng1, treding,
viresh.kumar, jonathanh, vsethi, ionela.voinescu, ksitaraman,
sanjayc, zhanjie9, corbet, mochs, skhan, bbasu, rdunlap, linux-pm,
mario.limonciello, rafael
In-Reply-To: <b8debb30-67a5-4d2b-8c08-8fd287f7258e@nvidia.com>
Hello Sumit,
On 4/6/26 20:08, Sumit Gupta wrote:
> Hi Pierre,
>
> Thank you for the comments.
> Sorry for late reply as I was on vacation.
>
No worries
>
> On 24/03/26 23:48, Pierre Gondois wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> Hello Sumit,
>>
>> On 3/17/26 16:10, Sumit Gupta wrote:
>>> Add kernel boot parameter 'cppc_cpufreq.auto_sel_mode' to enable CPPC
>>> autonomous performance selection on all CPUs at system startup without
>>> requiring runtime sysfs manipulation. When autonomous mode is enabled,
>>> the hardware automatically adjusts CPU performance based on workload
>>> demands using Energy Performance Preference (EPP) hints.
>>>
>>> When auto_sel_mode=1:
>>> - Configure all CPUs for autonomous operation on first init
>>> - Set EPP to performance preference (0x0)
>>> - Use HW min/max when set; otherwise program from policy limits (caps)
>>> - Clamp desired_perf to bounds before enabling autonomous mode
>>> - Hardware controls frequency instead of the OS governor
>>>
>>> The boot parameter is applied only during first policy initialization.
>>> On hotplug, skip applying it so that the user's runtime sysfs
>>> configuration is preserved.
>>>
>>> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> (Documentation)
>>> Signed-off-by: Sumit Gupta <sumitg@nvidia.com>
>>> ---
>>> Part 1 [1] of this series was applied for 7.1 and present in next.
>>> Sending this patch as reworked version of 'patch 11' from [2] based
>>> on next.
>>>
>>> [1]
>>> https://lore.kernel.org/lkml/20260206142658.72583-1-sumitg@nvidia.com/
>>> [2]
>>> https://lore.kernel.org/lkml/20251223121307.711773-1-sumitg@nvidia.com/
>>> ---
>>> .../admin-guide/kernel-parameters.txt | 13 +++
>>> drivers/cpufreq/cppc_cpufreq.c | 84
>>> +++++++++++++++++--
>>> 2 files changed, 92 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt
>>> b/Documentation/admin-guide/kernel-parameters.txt
>>> index fa6171b5fdd5..de4b4c89edfe 100644
>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>> @@ -1060,6 +1060,19 @@ Kernel parameters
>>> policy to use. This governor must be
>>> registered in the
>>> kernel before the cpufreq driver probes.
>>>
>>> + cppc_cpufreq.auto_sel_mode=
>>> + [CPU_FREQ] Enable ACPI CPPC autonomous
>>> performance
>>> + selection. When enabled, hardware
>>> automatically adjusts
>>> + CPU frequency on all CPUs based on workload
>>> demands.
>>> + In Autonomous mode, Energy Performance
>>> Preference (EPP)
>>> + hints guide hardware toward performance (0x0)
>>> or energy
>>> + efficiency (0xff).
>>> + Requires ACPI CPPC autonomous selection
>>> register support.
>>> + Format: <bool>
>>> + Default: 0 (disabled)
>>> + 0: use cpufreq governors
>>> + 1: enable if supported by hardware
>>> +
>>> cpu_init_udelay=N
>>> [X86,EARLY] Delay for N microsec between
>>> assert and de-assert
>>> of APIC INIT to start processors. This delay
>>> occurs
>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c
>>> b/drivers/cpufreq/cppc_cpufreq.c
>>> index 5dfb109cf1f4..49c148b2a0a4 100644
>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>> @@ -28,6 +28,9 @@
>>>
>>> static struct cpufreq_driver cppc_cpufreq_driver;
>>>
>>> +/* Autonomous Selection boot parameter */
>>> +static bool auto_sel_mode;
>>> +
>>> #ifdef CONFIG_ACPI_CPPC_CPUFREQ_FIE
>>> static enum {
>>> FIE_UNSET = -1,
>>> @@ -708,11 +711,74 @@ static int cppc_cpufreq_cpu_init(struct
>>> cpufreq_policy *policy)
>>> policy->cur = cppc_perf_to_khz(caps, caps->highest_perf);
>>> cpu_data->perf_ctrls.desired_perf = caps->highest_perf;
>>>
>>> - ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
>>> - if (ret) {
>>> - pr_debug("Err setting perf value:%d on CPU:%d. ret:%d\n",
>>> - caps->highest_perf, cpu, ret);
>>> - goto out;
>>> + /*
>>> + * Enable autonomous mode on first init if boot param is set.
>>> + * Check last_governor to detect first init and skip if auto_sel
>>> + * is already enabled.
>>> + */
>> If the goal is to set autosel only once at the driver init,
>> shouldn't this be done in cppc_cpufreq_init() ?
>> I understand that cpu_data doesn't exist yet in
>> cppc_cpufreq_init(), but this seems more appropriate to do
>> it there IMO.
>>
>> This means the cpudata should be updated accordingly
>> in this cppc_cpufreq_cpu_init() function.
>
> In an earlier version [1], the setup was in cppc_cpufreq_init() but
> was moved to cppc_cpufreq_cpu_init() to improve per-CPU error handling.
> Keeping the setup in cppc_cpufreq_init() helps to avoid the last_governor
> check. We can warn for a CPU failing to enable and continue so other
> CPUs keep autonomous mode.
> cppc_cpufreq_cpu_init() would then just check the auto_sel state
> from register and sync policy limits from min/max_perf registers when
> autonomous mode is active.
> Please let me know your thoughts.
FWIU the auto_sel_mode module parameter allows to
configure the default auto_sel_mode when the driver is
first loaded, so there should not need to check that again
whenever cppc_cpufreq_cpu_init() is called.
Maybe Ionela saw something we didn't see ?
Also just to be sure, should it still be possible to change
the auto_sel_mode through the sysfs if the driver was
loaded with auto_sel_mode=1 ?
>
> [1]
> https://lore.kernel.org/lkml/5593d364-ca37-41c5-b33f-f7e245d6d626@nvidia.com/
>
>
>>
>>> + if (auto_sel_mode && policy->last_governor[0] == '\0' &&
>>> + !cpu_data->perf_ctrls.auto_sel) {
>>> + /* Enable CPPC - optional register, some platforms
>>> need it */
>> The documentation of the CPPC Enable Register is subject to
>> interpretation, but IIUC the field should be set to use the CPPC
>> controls, so I assume this should be set in cppc_cpufreq_init()
>> instead ?
>
> Agree that the CPPC Enable is about using the CPPC control path
> in general and not only for autonomous selection.
> Will move cppc_set_enable() into cppc_cpufreq_init() or outside the
> autonomous mode block in cppc_cpufreq_cpu_init() as per conclusion
> of previous comment.
>
>>> + ret = cppc_set_enable(cpu, true);
>>> + if (ret && ret != -EOPNOTSUPP)
>>> + pr_warn("Failed to enable CPPC for CPU%d
>>> (%d)\n", cpu, ret);
>>> +
>>> + /*
>>> + * Prefer HW min/max_perf when set; otherwise program
>>> from
>>> + * policy limits derived earlier from caps.
>>> + * Clamp desired_perf to bounds and sync policy->cur.
>>> + */
>>> + if (!cpu_data->perf_ctrls.min_perf ||
>>> !cpu_data->perf_ctrls.max_perf)
>>
>> The function doesn't seem to exist.
>
> It is newly added in [2].
> Don't need to call it if we move the setup to cppc_cpufreq_init().
Ah ok right thanks.
>
> [2]
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=ea3db45ae476889a1ba0ab3617e6afdeeefbda3d
>
>
>
>>
>>> + cppc_cpufreq_update_perf_limits(cpu_data, policy);
>>> +
>>> + cpu_data->perf_ctrls.desired_perf =
>>> + clamp_t(u32, cpu_data->perf_ctrls.desired_perf,
>>> + cpu_data->perf_ctrls.min_perf,
>>> + cpu_data->perf_ctrls.max_perf);
>>> +
>>> + policy->cur = cppc_perf_to_khz(caps,
>>> + cpu_data->perf_ctrls.desired_perf);
>>> +
>>
>> Maybe this should also be done in cppc_cpufreq_init()
>> if the auto_sel_mode parameter is set ?
>
> Yes.
>
>>
>>> + /* EPP is optional - some platforms may not support it */
>>> + ret = cppc_set_epp(cpu, CPPC_EPP_PERFORMANCE_PREF);
>>> + if (ret && ret != -EOPNOTSUPP)
>>> + pr_warn("Failed to set EPP for CPU%d (%d)\n",
>>> cpu, ret);
>>> + else if (!ret)
>>> + cpu_data->perf_ctrls.energy_perf =
>>> CPPC_EPP_PERFORMANCE_PREF;
>>> +
>>> + ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
>>> + if (ret) {
>>> + pr_debug("Err setting perf for autonomous mode
>>> CPU:%d ret:%d\n",
>>> + cpu, ret);
>>> + goto out;
>>> + }
>>> +
>>> + ret = cppc_set_auto_sel(cpu, true);
>>> + if (ret && ret != -EOPNOTSUPP) {
>>> + pr_warn("Failed autonomous config for CPU%d
>>> (%d)\n",
>>> + cpu, ret);
>>> + goto out;
>>> + }
>>> + if (!ret)
>>> + cpu_data->perf_ctrls.auto_sel = true;
>>> + }
>>> +
>>> + if (cpu_data->perf_ctrls.auto_sel) {
>>
>> There is a patchset ongoing which tries to remove
>> setting policy->min/max from driver initialization.
>> Indeed, these values are only temporarily valid,
>> until the governor override them.
>> It is not sure yet the patch will be accepted though.
>>
>> https://lore.kernel.org/lkml/20260317101753.2284763-4-pierre.gondois@arm.com/
>>
>
>
> You are right that policy->min/max from .init() are temporary today
> as cpufreq_set_policy() overwrites them before the governor starts.
>
> On my test platform (highest == nominal, lowest_nonlinear == lowest),
> this had no visible effect because the BIOS bounds and cpuinfo range
> end up identical. But on platforms where they differ, the governor
> would widen the range to full cpuinfo limits.
>
> I think your patch [3] fixes this by giving these the right semantic as
> initial QoS requests. With it, cpufreq_set_policy() preserves the policy
> limits set from min/max_perf registers in .init(), which can either be
> BIOS values on first boot or last user configured values before hotplug.
>
> I will update the comment in v2 to reflect QoS seeding intent.
>
> I see that the first two patches of your series [3] is applied for 7.1.
> Do you plan to send the pending patch (3/4) from [3]?
>
I need to ping Viresh to check if this is still relevant.
> [3]
> https://lore.kernel.org/lkml/20260317101753.2284763-4-pierre.gondois@arm.com/
>
>
>>
>>
>>> + /* Sync policy limits from HW when autonomous mode is
>>> active */
>>> + policy->min = cppc_perf_to_khz(caps,
>>> + cpu_data->perf_ctrls.min_perf ?:
>>> + caps->lowest_nonlinear_perf);
>>> + policy->max = cppc_perf_to_khz(caps,
>>> + cpu_data->perf_ctrls.max_perf ?:
>>> + caps->nominal_perf);
>>> + } else {
>>> + /* Normal mode: governors control frequency */
>>> + ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
>>> + if (ret) {
>>> + pr_debug("Err setting perf value:%d on CPU:%d.
>>> ret:%d\n",
>>> + caps->highest_perf, cpu, ret);
>>> + goto out;
>>> + }
>>> }
>>>
>>> cppc_cpufreq_cpu_fie_init(policy);
>>> @@ -1038,10 +1104,18 @@ static int __init cppc_cpufreq_init(void)
>>>
>>> static void __exit cppc_cpufreq_exit(void)
>>> {
>>> + unsigned int cpu;
>>> +
>>> + for_each_present_cpu(cpu)
>>> + cppc_set_auto_sel(cpu, false);
>>
>> If the firmware has a default EPP value, it means that loading
>> and the unloading the driver will reset this default EPP value.
>> Maybe the initial EPP value and/or the auto_sel value should be
>> cached somewhere and restored on exit ?
>> I don't know if this is actually an issue, this is just to signal it.
>
> The auto_sel_mode boot path programs EPP to performance preference(0),
> not the firmware’s previous value. On unload we only call
> cppc_set_auto_sel(false); we do not restore EPP, min/max perf,
> or other CPPC fields to firmware defaults.
Yes right, so loading/unloading the driver might change the
default EPP value.
>
> Thank you,
> Sumit Gupta
>
> ....
>
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox