* Re: [PATCH v4 0/2] cpufreq: CPPC: add autonomous mode boot parameter support
From: Sumit Gupta @ 2026-06-16 12:52 UTC (permalink / raw)
To: rafael, viresh.kumar, pierre.gondois, ionela.voinescu,
zhenglifeng1, zhanjie9, corbet, skhan, rdunlap, mario.limonciello,
linux-pm, linux-doc, linux-kernel
Cc: linux-tegra, treding, jonathanh, vsethi, ksitaraman, sanjayc,
mochs, bbasu, sumitg
In-Reply-To: <20260527202550.206828-1-sumitg@nvidia.com>
On 28/05/26 01:55, Sumit Gupta wrote:
> This series adds a kernel boot parameter 'cppc_cpufreq.auto_sel_mode'
> to enable CPPC autonomous performance selection on all CPUs at system
> startup, avoiding per-CPU sysfs scripting at every boot.
>
> When autonomous mode is enabled, the hardware automatically adjusts
> CPU performance based on workload demands using Energy Performance
> Preference (EPP) hints.
>
> Patch 1: Sets CPPC Enable Register for both OS-driven and autonomous
> CPPC control modes. It can be applied independently of patch 2.
>
> Patch 2: Adds the auto_sel_mode boot parameter with three modes:
> - performance (or 1): override EPP to performance (0x0)
> - balance_performance (or 2): override EPP to balance_performance (0x80)
> - default_epp (or 3): preserve EPP value programmed by
> BIOS/firmware
>
> Patch 2 depends on Pierre's series [4] ("cpufreq: Set policy->min and
> max as real QoS constraints") so that policy->min/max set during
> cppc_cpufreq_cpu_init() are not overridden by cpufreq_set_policy().
>
> v3[3] -> v4:
> - Add 'balance_performance' mode which sets EPP to 0x80.
> - Add CPPC_EPP_BALANCE_PERFORMANCE_PREF (0x80) constant in cppc_acpi.h.
> - Clean up EPP mode selection with switch + boolean flag in cpu_init.
> - Use local variable for kp->arg in auto_sel_mode_set/get to avoid
> repeated casts.
>
> Sumit Gupta (2):
> cpufreq: CPPC: Set CPPC Enable register in cpu_init
> cpufreq: CPPC: add autonomous mode boot parameter support
>
> .../admin-guide/kernel-parameters.txt | 20 +++
> drivers/cpufreq/cppc_cpufreq.c | 154 +++++++++++++++++-
> include/acpi/cppc_acpi.h | 1 +
> 3 files changed, 170 insertions(+), 5 deletions(-)
>
> [1] v1: https://lore.kernel.org/lkml/20260317151053.2361475-1-sumitg@nvidia.com/
> [2] v2: https://lore.kernel.org/lkml/20260424201814.230071-1-sumitg@nvidia.com/
> [3] v3: https://lore.kernel.org/lkml/20260515122624.1920637-1-sumitg@nvidia.com/
> [4] https://lore.kernel.org/lkml/20260511135538.522653-1-pierre.gondois@arm.com/
>
Gentle ping on this series.
The dependency it was waiting on, the "cpufreq: Set policy->min and
max as real QoS constraints" series, is now in linux-pm (linux-next).
I rebased on top and verified autonomous mode works as expected, and
it applies cleanly on the current linux-next.
The [1] reference in patch 2/2 points to v2 of that series; the merged
version is v3 [2].
If there are no further comments, please consider acking and queuing
this for the next cycle.
Thanks for your time.
[1]
https://lore.kernel.org/lkml/20260511135538.522653-1-pierre.gondois@arm.com/
[2]
https://lore.kernel.org/lkml/20260528090913.2759118-1-pierre.gondois@arm.com/
Regards,
Sumit
^ permalink raw reply
* Re: [PATCH 0/3] mm/zram: route block swap I/O through swap_ops
From: Christoph Hellwig @ 2026-06-16 12:36 UTC (permalink / raw)
To: Jianyue Wu
Cc: Andrew Morton, Christoph Hellwig, Chris Li, Baoquan He, Nhat Pham,
Barry Song, Kairui Song, Kemeng Shi, Youngjun Park, Minchan Kim,
Sergey Senozhatsky, Jens Axboe, Matthew Wilcox (Oracle), Jan Kara,
linux-mm, linux-kernel, linux-block, linux-doc
In-Reply-To: <20260614-zram-swap-ops-block-register-v1-0-6c1a6639c222@gmail.com>
I fear this is going entirely in the wrong direction.
Yes, we have to keep zram around as a legacy interface for now,
but the right place to deal with compressed swap is in the core.
So please don't add more hacks for 'magic' block devices.
^ permalink raw reply
* [PATCH v4 4/4] KVM: PPC: Document KVM_PPC_GET_COMPAT_CAPS ioctl
From: Amit Machhiwal @ 2026-06-16 12:33 UTC (permalink / raw)
To: linuxppc-dev, Madhavan Srinivasan
Cc: Vaibhav Jain, Amit Machhiwal, Anushree Mathur, Paolo Bonzini,
Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
Jonathan Corbet, Shuah Khan, kvm, linux-kernel, linux-doc, lkp
In-Reply-To: <20260616123314.82721-1-amachhiw@linux.ibm.com>
Add documentation for the KVM_PPC_GET_COMPAT_CAPS ioctl to the KVM API
documentation.
The ioctl exposes host processor compatibility modes supported for
nested KVM guests on PowerPC systems. The documentation includes
comprehensive error code descriptions, structure field definitions
including the size field for forward compatibility, and KVM-specific
capability bit constants.
Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
---
Documentation/virt/kvm/api.rst | 47 ++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 52bbbb553ce1..ba6feba74d7d 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6553,6 +6553,53 @@ KVM_S390_KEYOP_SSKE
Sets the storage key for the guest address ``guest_addr`` to the key
specified in ``key``, returning the previous value in ``key``.
+4.145 KVM_PPC_GET_COMPAT_CAPS
+-----------------------------
+:Capability: KVM_CAP_PPC_COMPAT_CAPS
+:Architectures: powerpc
+:Type: vm ioctl
+:Parameters: struct kvm_ppc_compat_caps (out)
+:Returns: 0 on success, negative value on failure
+
+Errors include:
+
+ ======== ============================================================
+ EFAULT if ``struct kvm_ppc_compat_caps`` cannot be read from or
+ written to userspace
+ EINVAL if the ``size`` field is smaller than the current structure
+ size, or if the backend implementation fails to retrieve or
+ map CPU compatibility capabilities
+ ENOTTY if the backend does not implement the ``get_compat_caps``
+ operation (e.g., on non-pseries platforms or when the
+ required KVM operations are not available)
+ ======== ============================================================
+
+IBM POWER system server-based processors provide a compatibility mode feature
+where an Nth generation processor can operate in modes consistent with earlier
+generations such as (N-1) and (N-2).
+
+This ioctl provides userspace with information about the CPU compatibility modes
+supported by the current host processor for booting the nested KVM guests on
+PowerNV (KVM nested APIv1) and PowerVM (KVM nested APIv2) platforms.
+
+::
+
+ struct kvm_ppc_compat_caps {
+ __u64 flags; /* Reserved for future use */
+ __u64 size; /* Size of this structure */
+ __u64 compat_capabilities; /* Capabilities supported by the host */
+ };
+
+The ``compat_capabilities`` bit field describes the processor compatibility
+modes supported by the host. For example, the following bits indicate support
+for specific processor modes.
+
+::
+
+ KVM_PPC_COMPAT_CAP_POWER9 (bit 1): KVM guests can run in Power9 processor mode
+ KVM_PPC_COMPAT_CAP_POWER10 (bit 2): KVM guests can run in Power10 processor mode
+ KVM_PPC_COMPAT_CAP_POWER11 (bit 3): KVM guests can run in Power11 processor mode
+
.. _kvm_run:
5. The kvm_run structure
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH v4 3/4] KVM: PPC: Book3S HV: Add support for compat CPU capabilities for KVM on PowerNV
From: Amit Machhiwal @ 2026-06-16 12:33 UTC (permalink / raw)
To: linuxppc-dev, Madhavan Srinivasan
Cc: Vaibhav Jain, Amit Machhiwal, Anushree Mathur, Paolo Bonzini,
Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
Jonathan Corbet, Shuah Khan, kvm, linux-kernel, linux-doc, lkp
In-Reply-To: <20260616123314.82721-1-amachhiw@linux.ibm.com>
Currently, when booting a compatibility-mode KVM guest (L1) on a PowerNV
hypervisor (L0), the guest runs with the expected processor
compatibility level. However, when booting a nested KVM guest (L2)
inside the L1, QEMU derives the CPU model from the raw host PVR and
attempts to run the nested guest at that level, instead of honoring the
compatibility mode of the L1.
Extend host CPU compatibility capability reporting to support nested
virtualization on PowerNV systems (PAPR nested API v1).
For nested API v2 (PowerVM), compatibility capabilities are obtained
from the hypervisor via the H_GUEST_GET_CAPABILITIES hcall. This
information is not available on PowerNV systems.
For nested API v1, derive the compatibility capabilities from the L1
guest by reading the "cpu-version" property from the device tree, which
reflects the effective (logical) processor compatibility level. Map this
value to the corresponding compatibility capability bitmap using
KVM-specific constants.
Introduce a helper to translate CPU version values into KVM_PPC_COMPAT_CAP
bits and integrate it into kvmppc_get_compat_caps(). The implementation
applies masking to ensure only supported processor modes are exposed.
This allows userspace to query host CPU compatibility modes on both
PowerVM and PowerNV platforms via the KVM_PPC_GET_COMPAT_CAPS ioctl.
Suggested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
---
arch/powerpc/kvm/book3s_hv.c | 37 +++++++++++++++++++++++++++++++++++-
1 file changed, 36 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index f674386df62c..375e7a7fa9f8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -6523,15 +6523,50 @@ static bool kvmppc_hash_v3_possible(void)
return true;
}
+static int kvmppc_map_compat_capabilities(const __be32 cpu_version,
+ unsigned long *capabilities)
+{
+ switch (cpu_version) {
+ case PVR_ARCH_31_P11:
+ *capabilities |= KVM_PPC_COMPAT_CAP_POWER11;
+ break;
+ case PVR_ARCH_31:
+ *capabilities |= KVM_PPC_COMPAT_CAP_POWER10;
+ break;
+ case PVR_ARCH_300:
+ *capabilities |= KVM_PPC_COMPAT_CAP_POWER9;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ return 0;
+}
static int kvmppc_get_compat_caps(struct kvm_ppc_compat_caps *host_caps)
{
+ struct device_node *np;
unsigned long capabilities = 0;
+ const __be32 *prop = NULL;
long rc = -EINVAL;
+ u32 cpu_version;
if (kvmhv_on_pseries()) {
- if (kvmhv_is_nestedv2())
+ if (kvmhv_is_nestedv2()) {
rc = plpar_guest_get_capabilities(0, &capabilities);
+ } else {
+ for_each_node_by_type(np, "cpu") {
+ prop = of_get_property(np, "cpu-version", NULL);
+ if (prop) {
+ cpu_version = be32_to_cpup(prop);
+ break;
+ }
+ }
+ if (!prop)
+ return -EINVAL;
+ rc = kvmppc_map_compat_capabilities(cpu_version,
+ &capabilities);
+ }
host_caps->compat_capabilities = capabilities &
KVM_PPC_COMPAT_BITMASK;
}
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH v4 2/4] KVM: PPC: Book3S HV: Implement compat CPU capability retrieval for KVM on PowerVM
From: Amit Machhiwal @ 2026-06-16 12:33 UTC (permalink / raw)
To: linuxppc-dev, Madhavan Srinivasan
Cc: Vaibhav Jain, Amit Machhiwal, Anushree Mathur, Paolo Bonzini,
Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
Jonathan Corbet, Shuah Khan, kvm, linux-kernel, linux-doc, lkp
In-Reply-To: <20260616123314.82721-1-amachhiw@linux.ibm.com>
On POWER systems, the host CPU may run in a compatibility mode (e.g., a
Power11 processor operating in Power10 compatibility mode). In such
cases, the effective CPU level exposed to guests differs from the
physical processor generation.
When running nested KVM guests, QEMU derives the host CPU type using
mfpvr(), which reflects the physical processor version. This can result
in a mismatch between the CPU model selected by QEMU and the
compatibility mode enforced by the host, leading to guest boot failures.
For example, booting a nested guest on a Power11 LPAR configured in
Power10 compatibility mode fails with:
KVM-NESTEDv2: couldn't set guest wide elements
[..KVM reg dump..]
This occurs because QEMU selects a CPU model corresponding to the
physical processor (via mfpvr()), while the host operates in a lower
compatibility mode. As a result, KVM rejects the requested compatibility
level during guest initialization.
Add support for retrieving host CPU compatibility capabilities for
nested guests on PowerVM (PAPR nested API v2). The hypervisor provides
the effective compatibility levels via the H_GUEST_GET_CAPABILITIES
hcall, which reflects the processor modes negotiated between the Power
hypervisor (L0) and the host partition (L1).
On pseries systems, obtain the capability bitmap using
plpar_guest_get_capabilities() and return it via struct
kvm_ppc_compat_caps. The implementation defines KVM-specific capability
constants (KVM_PPC_COMPAT_CAP_POWER9/10/11) and applies masking to ensure
only supported processor modes are exposed to userspace. This information
is then exposed through the KVM_PPC_GET_COMPAT_CAPS ioctl.
Hook the implementation into the Book3S HV kvmppc_ops so that it can be
invoked by the generic KVM ioctl handling code.
Suggested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
---
arch/powerpc/include/uapi/asm/kvm.h | 11 ++++++++++-
arch/powerpc/kvm/book3s_hv.c | 17 +++++++++++++++++
2 files changed, 27 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 8a38be6c3b03..730488681443 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -443,7 +443,16 @@ struct kvm_ppc_compat_caps {
__u64 size; /* Size of this structure */
__u64 compat_capabilities; /* Capabilities supported by the host */
};
-
+/*
+ * Capability bits for compat_capabilities field in kvm_ppc_compat_caps.
+ * These bits indicate which processor compatibility modes are supported.
+ */
+#define KVM_PPC_COMPAT_CAP_POWER9 (1ULL << 62)
+#define KVM_PPC_COMPAT_CAP_POWER10 (1ULL << 61)
+#define KVM_PPC_COMPAT_CAP_POWER11 (1ULL << 60)
+#define KVM_PPC_COMPAT_BITMASK (KVM_PPC_COMPAT_CAP_POWER9 | \
+ KVM_PPC_COMPAT_CAP_POWER10 | \
+ KVM_PPC_COMPAT_CAP_POWER11)
/*
* Values for character and character_mask.
* These are identical to the values used by H_GET_CPU_CHARACTERISTICS.
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index f9380ef65750..f674386df62c 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -6523,6 +6523,22 @@ static bool kvmppc_hash_v3_possible(void)
return true;
}
+
+static int kvmppc_get_compat_caps(struct kvm_ppc_compat_caps *host_caps)
+{
+ unsigned long capabilities = 0;
+ long rc = -EINVAL;
+
+ if (kvmhv_on_pseries()) {
+ if (kvmhv_is_nestedv2())
+ rc = plpar_guest_get_capabilities(0, &capabilities);
+ host_caps->compat_capabilities = capabilities &
+ KVM_PPC_COMPAT_BITMASK;
+ }
+
+ return rc;
+}
+
static struct kvmppc_ops kvm_ops_hv = {
.get_sregs = kvm_arch_vcpu_ioctl_get_sregs_hv,
.set_sregs = kvm_arch_vcpu_ioctl_set_sregs_hv,
@@ -6565,6 +6581,7 @@ static struct kvmppc_ops kvm_ops_hv = {
.hash_v3_possible = kvmppc_hash_v3_possible,
.create_vcpu_debugfs = kvmppc_arch_create_vcpu_debugfs_hv,
.create_vm_debugfs = kvmppc_arch_create_vm_debugfs_hv,
+ .get_compat_caps = kvmppc_get_compat_caps,
};
static int kvm_init_subcore_bitmap(void)
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH v4 1/4] KVM: PPC: Introduce KVM_CAP_PPC_COMPAT_CAPS and wire up ioctl
From: Amit Machhiwal @ 2026-06-16 12:33 UTC (permalink / raw)
To: linuxppc-dev, Madhavan Srinivasan
Cc: Vaibhav Jain, Amit Machhiwal, Anushree Mathur, Paolo Bonzini,
Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
Jonathan Corbet, Shuah Khan, kvm, linux-kernel, linux-doc, lkp
In-Reply-To: <20260616123314.82721-1-amachhiw@linux.ibm.com>
Introduce a new capability and ioctl to expose CPU compatibility modes
supported by the host processor for nested guests.
On IBM POWER systems, newer processor generations (N) can operate in
compatibility modes corresponding to earlier generations, like (N-1) and
(N-2). This is particularly relevant for nested virtualization, where
nested KVM guests may need to run with a specific processor compatibility
level.
Introduce KVM_CAP_PPC_COMPAT_CAPS capability and the corresponding
KVM_PPC_GET_COMPAT_CAPS vm ioctl. The ioctl returns a bitmap describing
the compatibility modes supported by the host in respective bit numbers,
allowing userspace (e.g., QEMU) to select an appropriate compatibility
level when configuring nested KVM guests.
The ioctl handling is added in kvm_arch_vm_ioctl() and retrieves host
CPU compatibility capabilities via a PowerPC-specific backend
implementation when available. The implementation validates the structure
size from userspace to ensure forward compatibility and returns
appropriate error codes (EINVAL for invalid size, EFAULT for copy
failures, ENOTTY if backend is not implemented). The struct
kvm_ppc_compat_caps includes a size field to support future ABI
extensions.
Suggested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
---
arch/powerpc/include/asm/kvm_ppc.h | 1 +
arch/powerpc/include/uapi/asm/kvm.h | 7 ++++++
arch/powerpc/kvm/powerpc.c | 35 +++++++++++++++++++++++++++++
include/uapi/linux/kvm.h | 4 ++++
4 files changed, 47 insertions(+)
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 0953f2daa466..169ea6a7fbad 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -319,6 +319,7 @@ struct kvmppc_ops {
bool (*hash_v3_possible)(void);
int (*create_vm_debugfs)(struct kvm *kvm);
int (*create_vcpu_debugfs)(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry);
+ int (*get_compat_caps)(struct kvm_ppc_compat_caps *host_caps);
};
extern struct kvmppc_ops *kvmppc_hv_ops;
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 077c5437f521..8a38be6c3b03 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -437,6 +437,13 @@ struct kvm_ppc_cpu_char {
__u64 behaviour_mask; /* valid bits in behaviour */
};
+/* For KVM_PPC_GET_COMPAT_CAPS */
+struct kvm_ppc_compat_caps {
+ __u64 flags; /* Reserved for future use */
+ __u64 size; /* Size of this structure */
+ __u64 compat_capabilities; /* Capabilities supported by the host */
+};
+
/*
* Values for character and character_mask.
* These are identical to the values used by H_GET_CPU_CHARACTERISTICS.
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 98de68379b18..9153b0034b45 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -701,6 +701,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
}
}
break;
+#if defined(CONFIG_KVM_BOOK3S_HV_POSSIBLE)
+ case KVM_CAP_PPC_COMPAT_CAPS:
+ r = 0;
+ if (kvmhv_on_pseries())
+ r = 1;
+ break;
+#endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
default:
r = 0;
break;
@@ -2467,6 +2474,34 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
r = kvm->arch.kvm_ops->svm_off(kvm);
break;
}
+ case KVM_PPC_GET_COMPAT_CAPS: {
+ struct kvm_ppc_compat_caps host_caps;
+ u64 user_size;
+
+ r = -EFAULT;
+ /* First, get the size field from userspace to validate */
+ if (copy_from_user(&user_size, &((struct kvm_ppc_compat_caps
+ __user *)argp)->size, sizeof(user_size))) {
+ goto out;
+ }
+
+ /* Validate size - must be at least the current structure size */
+ r = -EINVAL;
+ if (user_size < sizeof(host_caps))
+ goto out;
+
+ r = -ENOTTY;
+ memset(&host_caps, 0, sizeof(host_caps));
+ if (!kvm->arch.kvm_ops->get_compat_caps)
+ goto out;
+
+ r = kvm->arch.kvm_ops->get_compat_caps(&host_caps);
+ /* Set the actual size of the structure we're returning */
+ host_caps.size = sizeof(host_caps);
+ if (!r && copy_to_user(argp, &host_caps, sizeof(host_caps)))
+ r = -EFAULT;
+ break;
+ }
default: {
struct kvm *kvm = filp->private_data;
r = kvm->arch.kvm_ops->arch_vm_ioctl(filp, ioctl, arg);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6c8afa2047bf..1788a0068662 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -996,6 +996,7 @@ struct kvm_enable_cap {
#define KVM_CAP_S390_USER_OPEREXEC 246
#define KVM_CAP_S390_KEYOP 247
#define KVM_CAP_S390_VSIE_ESAMODE 248
+#define KVM_CAP_PPC_COMPAT_CAPS 249
struct kvm_irq_routing_irqchip {
__u32 irqchip;
@@ -1349,6 +1350,9 @@ struct kvm_s390_keyop {
#define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct kvm_device_attr)
#define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct kvm_device_attr)
+/* Available with KVM_CAP_PPC_COMPAT_CAPS */
+#define KVM_PPC_GET_COMPAT_CAPS _IOR(KVMIO, 0xe4, struct kvm_ppc_compat_caps)
+
/*
* ioctls for vcpu fds
*/
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH v4 0/4] KVM: PPC: Expose CPU compatibility modes for nested guests
From: Amit Machhiwal @ 2026-06-16 12:33 UTC (permalink / raw)
To: linuxppc-dev, Madhavan Srinivasan
Cc: Vaibhav Jain, Amit Machhiwal, Anushree Mathur, Paolo Bonzini,
Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
Jonathan Corbet, Shuah Khan, kvm, linux-kernel, linux-doc, lkp
On POWER systems, newer processor generations can operate in compatibility
modes corresponding to earlier generations (e.g., a Power11 system running
in Power10 compatibility mode). In such cases, the effective CPU level
exposed to guests differs from the physical processor generation.
This creates a problem for nested virtualization. When booting a nested KVM
guest (L2) inside a host KVM guest (L1) running in a compatibility mode,
userspace (e.g., QEMU) may derive the CPU model from the raw hardware PVR
and attempt to configure the nested guest accordingly. However, the L1
partition is constrained by the compatibility level negotiated with the
hypervisor (L0), and requests exceeding that level are rejected, leading to
guest boot failures such as:
KVM-NESTEDv2: couldn't set guest wide elements
This series provides a mechanism for userspace to query the effective CPU
compatibility modes supported by the host, so it can select an appropriate
CPU model for nested guests.
To achieve this, the series introduces a new KVM capability and ioctl
(KVM_CAP_PPC_COMPAT_CAPS / KVM_PPC_GET_COMPAT_CAPS) that expose the
compatibility modes supported by the host.
Why a new UAPI?
---------------
While cpu-version is available in /proc/device-tree/cpus/<cpu#>/cpu-version
on both L1 booted on PowerNV and PowerVM LPARs, the UAPI approach is
preferable for several reasons:
1. pHYP (L0) capabilities: On PowerVM, we need to rely on capabilities
negotiated with pHYP in KVM, not just device tree properties. The
cpu-version property depicts the current compat mode but doesn't point
to what all compat modes are supported for the nested guest.
2. procfs dependency: Not all systems run with procfs enabled (CONFIG_PROC_FS
is optional). Minimal configurations like buildroot might disable it, but
KVM ioctl works regardless since it accesses kernel data structures
directly.
3. Kernel validation: The kernel validates and normalizes the compatibility
information, ensuring userspace gets validated, consistent data.
4. Abstraction & stability: /proc/device-tree is an implementation detail.
The UAPI provides a stable interface that won't break if the underlying
mechanism changes.
5. Semantic clarity: KVM_PPC_GET_COMPAT_CAPS clearly expresses what
compatibility modes can be used for KVM guests, vs. parsing device tree
which requires understanding the semantic meaning of cpu-version.
The implementation supports both:
- PowerVM (nested API v2), where compatibility information is obtained
via the H_GUEST_GET_CAPABILITIES hypercall.
- PowerNV (nested API v1), where compatibility is derived from the device
tree ("cpu-version") representing the effective processor compatibility
level.
This allows userspace (e.g., QEMU) to select a CPU model consistent with
the host compatibility mode, avoiding mismatches and enabling successful
nested guest boot.
Changes in v4:
- Added 'size' field to struct kvm_ppc_compat_caps for forward
compatibility and ABI extensibility
- Implemented size validation in ioctl handler to ensure correct structure
size from userspace
- Introduced KVM-specific capability constants (KVM_PPC_COMPAT_CAP_POWER9/
10/11) instead of exposing hypervisor-internal H_GUEST_CAP_* constants
- Added capability masking using KVM_PPC_COMPAT_BITMASK to ensure only
supported processor modes are exposed
- Enhanced error handling with comprehensive error codes (EINVAL, EFAULT,
ENOTTY) and detailed documentation
- Removed Tested-by tags pending re-testing with v4 changes
- Separated validation patch (patch 1 from v3) and sent independently [1]
Note: This series is built on top of patches [1] and [2] which must be
applied first. Patch [1] ensures arch_compat is validated against the host
compatibility mode before this series adds the capability query mechanism.
Patch [2] sets CPU_FTR_P11_PVR for Power11 and later processors, which is
needed for proper CPU feature detection in dt-cpu-ftrs environments.
Changes in v3:
- Added "Why a new UAPI?" section to cover letter addressing questions
about the need for a new UAPI vs. using existing mechanisms like
/proc/device-tree
- Fixed initialization of 'r' in KVM_PPC_GET_COMPAT_CAPS ioctl handler
from 0 to -ENOTTY for proper error handling when the operation is not
supported
- Added Vaibhav's "Suggested-by" tags
- Have retained Anushree's "Tested-by" tags as no major code changes
- Fixed documentation build warning reported by kernel test robot and
added "Reported-by" and "Closes" tags to patch 5
Changes in v2:
- Squashed patches 2 and 3 from v1 (capability introduction and ioctl
wiring) into a single patch for better logical grouping
- Changed kvm_ppc_compat_caps.flags from __u32 to __u64 for consistency
and future extensibility
- Addressed other review comments
- Improved commit messages with clearer explanations of the changes
Patch summary:
[1/4] Introduce KVM_CAP_PPC_COMPAT_CAPS and wire up ioctl
[2/4] Implement capability retrieval for PowerVM (API v2)
[3/4] Add PowerNV support (API v1)
[4/4] Document the new ioctl
Testing (with QEMU v3 patches and on top of patches [1] and [2]):
KVM APIv1 Testing
=================
On P10 PowerNV machine (L0)
---------------------------
- P10 L1 KVM guest -> works
- P10 nested L2 KVM guest -> works
- P9 compat nested L2 KVM guest -> works
- P9 compat L1 KVM guest -> works
- P9 nested L2 KVM guest -> works
On Powernv11 TCG Guest (L0)
---------------------------
- P11 L1 KVM guest -> works
- P11 L2 KVM guest -> works
- P10 compat L1 KVM guest -> works
- P10 L2 KVM guest -> works
- P9 compat L1 KVM guest -> works
- P9 L2 KVM guest -> works
KVM APIv2 Testing
=================
On P11 PowerVM LPAR (L1)
------------------------
- P11 L2 KVM guest -> works
- P10 compat L2 KVM guest -> works
On P11 LPAR in P10 compat (L1)
------------------------------
- P10 (host compat) L2 KVM guest -> works
On P10 PowerVM LPAR (L1)
------------------------
- P10 L2 KVM guest -> works
With this series, nested guests boot successfully in configurations where
they previously failed due to compatibility mismatches.
Related QEMU series:
--------------------
A corresponding QEMU v3 series adds support for querying and using these
compatibility capabilities when configuring nested KVM guests:
v3: https://lore.kernel.org/all/20260616113915.25589-1-amachhiw@linux.ibm.com/
v2: https://lore.kernel.org/all/20260502140021.69712-1-amachhiw@linux.ibm.com/
v1: https://lore.kernel.org/all/20260430061333.37905-1-amachhiw@linux.ibm.com/
Previous versions:
------------------
v3: https://lore.kernel.org/linuxppc-dev/20260522152744.55251-1-amachhiw@linux.ibm.com/
v2: https://lore.kernel.org/linuxppc-dev/20260513100755.83215-1-amachhiw@linux.ibm.com/
v1: https://lore.kernel.org/linuxppc-dev/20260430054906.94431-1-amachhiw@linux.ibm.com/
References:
-----------
[1] https://lore.kernel.org/all/20260609053327.61563-1-amachhiw@linux.ibm.com/
[2] https://lore.kernel.org/all/20260614173437.26352-1-amachhiw@linux.ibm.com/
Amit Machhiwal (4):
KVM: PPC: Introduce KVM_CAP_PPC_COMPAT_CAPS and wire up ioctl
KVM: PPC: Book3S HV: Implement compat CPU capability retrieval for KVM
on PowerVM
KVM: PPC: Book3S HV: Add support for compat CPU capabilities for KVM
on PowerNV
KVM: PPC: Document KVM_PPC_GET_COMPAT_CAPS ioctl
Documentation/virt/kvm/api.rst | 47 ++++++++++++++++++++++++++
arch/powerpc/include/asm/kvm_ppc.h | 1 +
arch/powerpc/include/uapi/asm/kvm.h | 16 +++++++++
arch/powerpc/kvm/book3s_hv.c | 52 +++++++++++++++++++++++++++++
arch/powerpc/kvm/powerpc.c | 35 +++++++++++++++++++
include/uapi/linux/kvm.h | 4 +++
6 files changed, 155 insertions(+)
base-commit: 8b308f96484e37d92d2fc6b72b091f60496c000e
prerequisite-patch-id: ce9521668d549f2bd731d321a38f720b789e0b4e
prerequisite-patch-id: 4662f01d2101cfae8502f04290658deed60eec26
--
2.50.1 (Apple Git-155)
^ permalink raw reply
* Re: [PATCH] docs: pt_BR: update netdevsim section in maintainer-netdev.rst
From: Daniel Pereira @ 2026-06-16 12:24 UTC (permalink / raw)
To: Amanda Corrêa; +Cc: corbet, skhan, linux-doc, linux-kernel
In-Reply-To: <20260616005234.11036-1-amandacorreasilvax@gmail.com>
Em seg., 15 de jun. de 2026 às 21:52, Amanda Corrêa
<amandacorreasilvax@gmail.com> escreveu:
>
> Update the Brazilian Portuguese translation of maintainer-netdev.rst
> to align with the latest English version.
>
> Key changes include:
> - Updated the netdevsim section to reflect upstream changes
> - Added guidance on netdevsim-based API testing
> - Fixed minor spacing and formatting issues
>
> Signed-off-by: Amanda Corrêa <amandacorreasilvax@gmail.com>
> ---
Hi Amanda,
Thanks for the adjustments, the patch makes sense.
Acked-by: Daniel Pereira <danielmaraboo@gmail.com>
^ permalink raw reply
* [PATCH v2] hwmon: coretemp: Fix documentation wording
From: Ximing Zhang @ 2026-06-16 12:15 UTC (permalink / raw)
To: Guenter Roeck, Jonathan Corbet
Cc: Shuah Khan, linux-hwmon, linux-doc, linux-kernel, Ximing Zhang
Fix two minor wording issues in the coretemp documentation.
Signed-off-by: Ximing Zhang <xzhangjr@gmail.com>
---
Changes in v2:
- Capitalized "The" in "The following table" as requested.
Documentation/hwmon/coretemp.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/hwmon/coretemp.rst b/Documentation/hwmon/coretemp.rst
index f63b21f24d42..349301683381 100644
--- a/Documentation/hwmon/coretemp.rst
+++ b/Documentation/hwmon/coretemp.rst
@@ -44,9 +44,9 @@ Temperature known as TjMax is the maximum junction temperature of processor,
which depends on the CPU model. See table below. At this temperature, protection
mechanism will perform actions to forcibly cool down the processor. Alarm
may be raised, if the temperature grows enough (more than TjMax) to trigger
-the Out-Of-Spec bit. Following table summarizes the exported sysfs files:
+the Out-Of-Spec bit. The following table summarizes the exported sysfs files:
-All Sysfs entries are named with their core_id (represented here by 'X').
+All sysfs entries are named with their core_id (represented here by 'X').
================= ========================================================
tempX_input Core temperature (in millidegrees Celsius).
--
2.54.0
^ permalink raw reply related
* [PATCH] hwmon: coretemp: Fix documentation wording
From: Ximing Zhang @ 2026-06-16 12:06 UTC (permalink / raw)
To: Guenter Roeck, Jonathan Corbet
Cc: Shuah Khan, linux-hwmon, linux-doc, linux-kernel, Ximing Zhang
Fix two minor wording issues in the coretemp documentation.
Signed-off-by: Ximing Zhang <xzhangjr@gmail.com>
---
Documentation/hwmon/coretemp.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/hwmon/coretemp.rst b/Documentation/hwmon/coretemp.rst
index f63b21f24d42..5ce125b0be2e 100644
--- a/Documentation/hwmon/coretemp.rst
+++ b/Documentation/hwmon/coretemp.rst
@@ -44,9 +44,9 @@ Temperature known as TjMax is the maximum junction temperature of processor,
which depends on the CPU model. See table below. At this temperature, protection
mechanism will perform actions to forcibly cool down the processor. Alarm
may be raised, if the temperature grows enough (more than TjMax) to trigger
-the Out-Of-Spec bit. Following table summarizes the exported sysfs files:
+the Out-Of-Spec bit. The Following table summarizes the exported sysfs files:
-All Sysfs entries are named with their core_id (represented here by 'X').
+All sysfs entries are named with their core_id (represented here by 'X').
================= ========================================================
tempX_input Core temperature (in millidegrees Celsius).
--
2.54.0
^ permalink raw reply related
* Re: [PATCH v13 0/4] kunit: Add support for suppressing warning backtraces
From: David Gow @ 2026-06-16 12:06 UTC (permalink / raw)
To: Albert Esteve, Arnd Bergmann, Brendan Higgins, Rae Moar,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Jonathan Corbet, Shuah Khan, Andrew Morton,
Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: linux-kernel, linux-arch, linux-kselftest, kunit-dev, dri-devel,
workflows, linux-riscv, linux-doc, peterz, Alessandro Carminati,
Guenter Roeck, Kees Cook, Linux Kernel Functional Testing,
Maíra Canal, Dan Carpenter, Simona Vetter
In-Reply-To: <CADSE00+typ3Zi5Vf0Z276+e0G7PuS7mt7oA9h5awBu3YgYKw0g@mail.gmail.com>
Le 16/06/2026 à 7:44 PM, Albert Esteve a écrit :
> On Fri, May 15, 2026 at 2:29 PM Albert Esteve <aesteve@redhat.com> wrote:
>>
>> Some unit tests intentionally trigger warning backtraces by passing bad
>> parameters to kernel API functions. Such unit tests typically check the
>> return value from such calls, not the existence of the warning backtrace.
>>
>> Such intentionally generated warning backtraces are neither desirable
>> nor useful for a number of reasons:
>> - They can result in overlooked real problems.
>> - A warning that suddenly starts to show up in unit tests needs to be
>> investigated and has to be marked to be ignored, for example by
>> adjusting filter scripts. Such filters are ad hoc because there is
>> no real standard format for warnings. On top of that, such filter
>> scripts would require constant maintenance.
>>
>> One option to address the problem would be to add messages such as
>> "expected warning backtraces start/end here" to the kernel log.
>> However, that would again require filter scripts, might result in
>> missing real problematic warning backtraces triggered while the test
>> is running, and the irrelevant backtrace(s) would still clog the
>> kernel log.
>>
>> Solve the problem by providing a means to suppress warning backtraces
>> originating from the current kthread while executing test code.
>> Since each KUnit test runs in its own kthread, this effectively scopes
>> suppression to the test that enabled it, without requiring any
>> architecture-specific code.
>>
>> Overview:
>> Patch#1 Introduces the suppression infrastructure integrated into
>> KUnit's hook mechanism.
>> Patch#2 Adds selftests to validate the functionality.
>> Patch#3 Demonstrates real-world usage in the DRM subsystem.
>> Patch#4 Documents the new API and usage guidelines.
>>
>> Design Notes:
>> Suppression is integrated into the existing KUnit hooks infrastructure,
>> reusing the kunit_running static branch for zero overhead
>> when no tests are running. The implementation lives entirely in the
>> kunit module; only a static-inline wrapper and a function pointer
>> slot are added to built-in code.
>>
>> Suppression is checked at three points in the warning path:
>> - In `warn_slowpath_fmt()` (kernel/panic.c), for architectures without
>> __WARN_FLAGS. The check runs before any output, fully suppressing
>> both message and backtrace.
>> - In `__warn_printk()` (kernel/panic.c), for architectures that define
>> __WARN_FLAGS but not their own __WARN_printf (arm64, loongarch,
>> parisc, powerpc, riscv, sh). The check suppresses the warning message
>> text that is printed before the trap enters __report_bug().
>> - In `__report_bug()` (lib/bug.c), for architectures that define
>> __WARN_FLAGS. The check runs before `__warn()` is called, suppressing
>> the backtrace and stack dump.
>>
>> To avoid double-counting on architectures where both `__warn_printk()`
>> and `__report_bug()` run for the same warning, the hook takes a bool
>> parameter: true to increment the suppression counter, false to suppress
>> without counting.
>>
>> The suppression state is dynamically allocated via kunit_kzalloc() and
>> tied to the KUnit test lifecycle via `kunit_add_action()`, ensuring
>> automatic cleanup at test exit. Writer-side access to the global
>> suppression list is serialized with a spinlock; readers use RCU.
>>
>> Two API forms are provided:
>> - kunit_warning_suppress(test) { ... }: scoped blocks with automatic
>> cleanup. The suppression handle is not accessible outside the block,
>> so warning counts (if needed) must be checked inside. Multiple
>> sequential suppression blocks are allowed.
>> - kunit_start/end_suppress_warning(test): direct functions that return
>> an explicit handle. Use when the handle needs to be retained, or passed
>> across helpers. Multiple sequential suppression blocks are allowed.
>>
>> This series is based on the RFC patch and subsequent discussion at
>> https://patchwork.kernel.org/project/linux-kselftest/patch/02546e59-1afe-4b08-ba81-d94f3b691c9a@moroto.mountain/
>> and offers a more comprehensive solution of the problem discussed there.
>>
>> Changes since RFC:
>> - Introduced CONFIG_KUNIT_SUPPRESS_BACKTRACE
>> - Minor cleanups and bug fixes
>> - Added support for all affected architectures
>> - Added support for counting suppressed warnings
>> - Added unit tests using those counters
>> - Added patch to suppress warning backtraces in dev_addr_lists tests
>>
>> Changes since v1:
>> - Rebased to v6.9-rc1
>> - Added Tested-by:, Acked-by:, and Reviewed-by: tags
>> [I retained those tags since there have been no functional changes]
>> - Introduced KUNIT_SUPPRESS_BACKTRACE configuration option, enabled by
>> default.
>>
>> Changes since v2:
>> - Rebased to v6.9-rc2
>> - Added comments to drm warning suppression explaining why it is needed.
>> - Added patch to move conditional code in arch/sh/include/asm/bug.h
>> to avoid kerneldoc warning
>> - Added architecture maintainers to Cc: for architecture specific patches
>> - No functional changes
>>
>> Changes since v3:
>> - Rebased to v6.14-rc6
>> - Dropped net: "kunit: Suppress lock warning noise at end of dev_addr_lists tests"
>> since 3db3b62955cd6d73afde05a17d7e8e106695c3b9
>> - Added __kunit_ and KUNIT_ prefixes.
>> - Tested on interessed architectures.
>>
>> Changes since v4:
>> - Rebased to v6.15-rc7
>> - Dropped all code in __report_bug()
>> - Moved all checks in WARN*() macros.
>> - Dropped all architecture specific code.
>> - Made __kunit_is_suppressed_warning nice to noinstr functions.
>>
>> Changes since v5:
>> - Rebased to v7.0-rc3
>> - Added RCU protection for the suppressed warnings list.
>> - Added static key and branching optimization.
>> - Removed custom `strcmp` implementation and reworked
>> __kunit_is_suppressed_warning() entrypoint function.
>>
>> Changes since v6:
>> - Moved suppression checks from WARN*() macros to warn_slowpath_fmt()
>> and __report_bug().
>> - Replaced stack-allocated suppression struct with kunit_kzalloc() heap
>> allocation tied to the KUnit test lifecycle.
>> - Changed suppression strategy from function-name matching to task-scoped:
>> all warnings on the current task are suppressed between START and END,
>> rather than only warnings originating from a specific named function.
>> - Simplified macro API: removed KUNIT_DECLARE_SUPPRESSED_WARNING(),
>> the START macro now takes (test) and handles allocation internally.
>> - Removed static key and branching optiomization, as by the time it
>> was executed, callers are already in warn slowpaths.
>> - Link to v6: https://lore.kernel.org/r/20260317-kunit_add_support-v6-0-dd22aeb3fe5d@redhat.com
>>
>> Changes since v7:
>> - Integrated suppression into existing KUnit hooks infrastructure
>> - Removed CONFIG_KUNIT_SUPPRESS_BACKTRACE
>> - Added suppression check in __warn_printk()
>> - Added spinlock for writer-side RCU protection
>> - Replaced explicit rcu_read_lock/unlock with guard(rcu)()
>> - Added scoped API (kunit_warning_suppress) using __cleanup attribute
>> - Updated DRM patch to use scoped API
>> - Expanded self-tests: incremental counting, cross-kthread isolation
>> - Rewrote documentation covering all three API forms with examples
>> - Link to v7: https://lore.kernel.org/r/20260420-kunit_add_support-v7-0-e8bc6e0f70de@redhat.com
>>
>> Changes since v8:
>> - Rebased to v7.1-rc2
>> - Remove KUNIT_START/END_SUPPRESSED_WARNING() macros
>> - Add KUNIT_EXPECT_SUPPRESSED_WARNING_COUNT checks to drm tests
>> - Link to v8: https://lore.kernel.org/r/20260504-kunit_add_support-v8-0-3e5957cdd235@redhat.com
>>
>> Changes since v9:
>> - Fix silent false-pass when kunit_start_suppress_warning() returns NULL
>> - Fix RCU lockdep splat for kunit_is_suppressed_warning() calls
>> - Move disable_trace_on_warning() in __report_bug()
>> - Make suppress counter atomic
>> - Mark helper warn functions in selftest as noinline
>> - Add kunit_skip() for CONFIG_BUG=n in selftests
>> - Fix potentially uninitialized data.was_active in kthread seltest
>> - Add kthread_stop() in kthread selftest early exit
>> - Initialize scaling_factor to INT_MIN in DRM scaling tests
>> - Add include for bool in test-bug.h to fix CONFIG_KUNIT=n case
>> - Link to v9: https://lore.kernel.org/r/20260508-kunit_add_support-v9-0-99df7aa880f6@redhat.com
>>
>> Changes since v10:
>> - Remove synchronize_rcu() to avoid sleeping in atomic context
>> - Pin task_struct refcount to prevent ABA false-positive matches
>> - Loop in suppression selftest to prevent use-after-free on kthread exit
>> - Skip DRM rect tests on CONFIG_BUG=n
>> - Link to v10: https://lore.kernel.org/r/20260513-kunit_add_support-v10-0-e379d206c8cd@redhat.com
>>
>> Changes since v11:
>> - Use call_rcu() to defer free without blocking
>> - Remove #ifdef CONFIG_KUNIT guard in lib/bug.c
>> - Remove stale config checks from selftest
>> - Replace skip on DRM rect tests with conditional expectation
>> - Link to v11: https://lore.kernel.org/r/20260514-kunit_add_support-v11-0-b36a530a6d8f@redhat.com
>>
>> Changes since v12:
>> - Reverted to the v9 synchronize_rcu() approach
>> - Add in_task() check at the top of __kunit_is_suppressed_warning_impl()
>> - Link to v12: https://lore.kernel.org/r/20260515-kunit_add_support-v12-0-a216dc228be8@redhat.com
>
> Hi all,
>
> I am not sure if there is a decision to merge this series or if any
> work remains to be done.
>
> I reckon I sent a few versions back-to-back last time as I was
> struggling with Sashiko. However, there are no significant changes,
> the core strategy remains unchanged, involving only the addition of
> safety checks and the removal of some redundancies to satisfy the AI.
> I am just clarifying in case the last versions/respins were unclear. I
> tried running AI reviews locally but Sashiko always found more issues
> than my local model could.
>
The latest version was fine. It's just landed for 7.2:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=42eb3a5ef6bc56192bf450c79a3f274e081f8131
Thanks for all of your work,
-- David
^ permalink raw reply
* [PATCH v7 20/20] nfsd: add support to CB_NOTIFY for dir attribute changes
From: Jeff Layton @ 2026-06-16 11:59 UTC (permalink / raw)
To: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, Jonathan Corbet, Shuah Khan,
Chuck Lever
Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260616-dir-deleg-v7-0-6cbc7eac0ade@kernel.org>
If the client requested dir attribute change notifications, send those
alongside any set of add/remove/rename events. Note that the server will
still recall the delegation on a SETATTR, so these are only sent for
changes to child dirents.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Documentation/sunrpc/xdr/nfs4_1.x | 1 +
fs/nfsd/nfs4proc.c | 7 +++--
fs/nfsd/nfs4state.c | 25 +++++++++++++++--
fs/nfsd/nfs4xdr.c | 58 +++++++++++++++++++++++++++++++++------
fs/nfsd/nfs4xdr_gen.c | 4 +--
fs/nfsd/nfs4xdr_gen.h | 3 ++
fs/nfsd/xdr4.h | 2 ++
7 files changed, 84 insertions(+), 16 deletions(-)
diff --git a/Documentation/sunrpc/xdr/nfs4_1.x b/Documentation/sunrpc/xdr/nfs4_1.x
index a32df1e882e5..e66f396ae659 100644
--- a/Documentation/sunrpc/xdr/nfs4_1.x
+++ b/Documentation/sunrpc/xdr/nfs4_1.x
@@ -464,6 +464,7 @@ pragma public notify_add4;
struct notify_attr4 {
notify_entry4 na_changed_entry;
};
+pragma public notify_attr4;
struct notify_rename4 {
notify_remove4 nrn_old_entry;
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 48fc7b0df4dc..c413ed0810b9 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -2552,9 +2552,10 @@ nfsd4_verify(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
return status == nfserr_same ? nfs_ok : status;
}
-#define SUPPORTED_NOTIFY_MASK (BIT(NOTIFY4_REMOVE_ENTRY) | \
- BIT(NOTIFY4_ADD_ENTRY) | \
- BIT(NOTIFY4_RENAME_ENTRY) | \
+#define SUPPORTED_NOTIFY_MASK (BIT(NOTIFY4_CHANGE_DIR_ATTRS) | \
+ BIT(NOTIFY4_REMOVE_ENTRY) | \
+ BIT(NOTIFY4_ADD_ENTRY) | \
+ BIT(NOTIFY4_RENAME_ENTRY) | \
BIT(NOTIFY4_GFLAG_EXTEND))
static __be32
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index a948dc8a46cc..2f7210accdf1 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -3522,10 +3522,15 @@ nfsd4_cb_notify_prepare(struct nfsd4_callback *cb)
struct nfsd_notify_event *events[NOTIFY4_EVENT_QUEUE_SIZE];
struct xdr_buf xdr = { .buflen = PAGE_SIZE * NOTIFY4_PAGE_ARRAY_SIZE,
.pages = ncn->ncn_pages };
+ int limit = NOTIFY4_EVENT_QUEUE_SIZE;
struct xdr_stream stream;
struct nfsd_file *nf;
- int count, i;
bool error = false;
+ int count, i;
+
+ /* Save a slot for dir attr update if requested */
+ if (dp->dl_notify_mask & BIT(NOTIFY4_CHANGE_DIR_ATTRS))
+ --limit;
/* Clear any failure recorded by a previous transmit. */
ncn->ncn_encode_err = false;
@@ -3542,7 +3547,7 @@ nfsd4_cb_notify_prepare(struct nfsd4_callback *cb)
}
/* we can't keep up! */
- if (count > NOTIFY4_EVENT_QUEUE_SIZE) {
+ if (count > limit) {
spin_unlock(&ncn->ncn_lock);
goto out_recall;
}
@@ -3589,6 +3594,22 @@ nfsd4_cb_notify_prepare(struct nfsd4_callback *cb)
nfsd_notify_event_put(nne);
}
if (!error) {
+ if (dp->dl_notify_mask & BIT(NOTIFY4_CHANGE_DIR_ATTRS)) {
+ u32 *maskp = (u32 *)xdr_reserve_space(&stream, sizeof(*maskp));
+
+ if (maskp) {
+ u8 *p = nfsd4_encode_dir_attr_change(&stream, dp, nf);
+
+ if (p) {
+ *maskp = BIT(NOTIFY4_CHANGE_DIR_ATTRS);
+ ncn->ncn_nf[count].notify_mask.count = 1;
+ ncn->ncn_nf[count].notify_mask.element = maskp;
+ ncn->ncn_nf[count].notify_vals.data = p;
+ ncn->ncn_nf[count].notify_vals.len = (u8 *)stream.p - p;
+ ++count;
+ }
+ }
+ }
ncn->ncn_nf_cnt = count;
nfsd_file_put(nf);
return true;
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 0649bb4cf2e7..a18d0b766b50 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4253,11 +4253,11 @@ nfsd4_setup_notify_entry4(struct notify_entry4 *ne, struct xdr_stream *xdr,
struct dentry *dentry, struct nfs4_delegation *dp,
struct nfsd_file *nf, char *name, u32 namelen)
{
- struct path path = { .mnt = nf->nf_file->f_path.mnt,
- .dentry = dentry };
+ struct path path = nf->nf_file->f_path;
struct nfsd4_fattr_args args = { };
uint32_t *attrmask;
__be32 status;
+ bool parent;
int ret;
/* Reserve space for attrmask */
@@ -4269,6 +4269,9 @@ nfsd4_setup_notify_entry4(struct notify_entry4 *ne, struct xdr_stream *xdr,
ne->ne_file.len = namelen;
ne->ne_attrs.attrmask.element = attrmask;
+ parent = (dentry == path.dentry);
+ path.dentry = dentry;
+
/* FIXME: d_find_alias for inode ? */
if (!path.dentry || !d_inode(path.dentry))
goto noattrs;
@@ -4284,15 +4287,20 @@ nfsd4_setup_notify_entry4(struct notify_entry4 *ne, struct xdr_stream *xdr,
args.change_attr = nfsd4_change_attribute(&args.stat);
- attrmask[0] = dp->dl_child_attrs[0];
- attrmask[1] = dp->dl_child_attrs[1];
- attrmask[2] = 0;
+ if (parent) {
+ attrmask[0] = dp->dl_dir_attrs[0];
+ attrmask[1] = dp->dl_dir_attrs[1];
+ } else {
+ attrmask[0] = dp->dl_child_attrs[0];
+ attrmask[1] = dp->dl_child_attrs[1];
- if (!setup_notify_fhandle(dentry, dp, nf, &args))
- attrmask[0] &= ~FATTR4_WORD0_FILEHANDLE;
+ if (!setup_notify_fhandle(dentry, dp, nf, &args))
+ attrmask[0] &= ~FATTR4_WORD0_FILEHANDLE;
- if (!(args.stat.result_mask & STATX_BTIME))
- attrmask[1] &= ~FATTR4_WORD1_TIME_CREATE;
+ if (!(args.stat.result_mask & STATX_BTIME))
+ attrmask[1] &= ~FATTR4_WORD1_TIME_CREATE;
+ }
+ attrmask[2] = 0;
ne->ne_attrs.attrmask.count = 2;
ne->ne_attrs.attr_vals.data = (u8 *)xdr->p;
@@ -4405,6 +4413,38 @@ u8 *nfsd4_encode_notify_event(struct xdr_stream *xdr, struct nfsd_notify_event *
return NULL;
}
+/**
+ * nfsd4_encode_dir_attr_change
+ * @xdr: stream to which to encode the fattr4
+ * @dp: delegation where the event occurred
+ * @nf: nfsd_file opened on the directory
+ *
+ * Encode a dir attr change event.
+ */
+u8 *nfsd4_encode_dir_attr_change(struct xdr_stream *xdr, struct nfs4_delegation *dp,
+ struct nfsd_file *nf)
+{
+ struct dentry *dentry = nf->nf_file->f_path.dentry;
+ struct notify_attr4 na = { };
+ bool ret;
+ u8 *p = NULL;
+
+ if (!(dp->dl_notify_mask & BIT(NOTIFY4_CHANGE_DIR_ATTRS)))
+ return NULL;
+
+ /* RFC 8881 s10.4.3: ne_file must be a zero-length string for dir attrs */
+ ret = nfsd4_setup_notify_entry4(&na.na_changed_entry, xdr,
+ dentry, dp, nf, "", 0);
+
+ /* Don't bother with the event if we're not encoding attrs */
+ if (ret && na.na_changed_entry.ne_attrs.attr_vals.len) {
+ p = (u8 *)xdr->p;
+ if (!xdrgen_encode_notify_attr4(xdr, &na))
+ p = NULL;
+ }
+ return p;
+}
+
static void svcxdr_init_encode_from_buffer(struct xdr_stream *xdr,
struct xdr_buf *buf, __be32 *p, int bytes)
{
diff --git a/fs/nfsd/nfs4xdr_gen.c b/fs/nfsd/nfs4xdr_gen.c
index d1240ade120d..a6725c773768 100644
--- a/fs/nfsd/nfs4xdr_gen.c
+++ b/fs/nfsd/nfs4xdr_gen.c
@@ -669,7 +669,7 @@ xdrgen_decode_notify_add4(struct xdr_stream *xdr, struct notify_add4 *ptr)
return true;
}
-static bool __maybe_unused
+bool
xdrgen_decode_notify_attr4(struct xdr_stream *xdr, struct notify_attr4 *ptr)
{
if (!xdrgen_decode_notify_entry4(xdr, &ptr->na_changed_entry))
@@ -1091,7 +1091,7 @@ xdrgen_encode_notify_add4(struct xdr_stream *xdr, const struct notify_add4 *valu
return true;
}
-static bool __maybe_unused
+bool
xdrgen_encode_notify_attr4(struct xdr_stream *xdr, const struct notify_attr4 *value)
{
if (!xdrgen_encode_notify_entry4(xdr, &value->na_changed_entry))
diff --git a/fs/nfsd/nfs4xdr_gen.h b/fs/nfsd/nfs4xdr_gen.h
index c62299bac735..f6a458a07406 100644
--- a/fs/nfsd/nfs4xdr_gen.h
+++ b/fs/nfsd/nfs4xdr_gen.h
@@ -38,6 +38,9 @@ bool xdrgen_encode_notify_remove4(struct xdr_stream *xdr, const struct notify_re
bool xdrgen_decode_notify_add4(struct xdr_stream *xdr, struct notify_add4 *ptr);
bool xdrgen_encode_notify_add4(struct xdr_stream *xdr, const struct notify_add4 *value);
+bool xdrgen_decode_notify_attr4(struct xdr_stream *xdr, struct notify_attr4 *ptr);
+bool xdrgen_encode_notify_attr4(struct xdr_stream *xdr, const struct notify_attr4 *value);
+
bool xdrgen_decode_notify_rename4(struct xdr_stream *xdr, struct notify_rename4 *ptr);
bool xdrgen_encode_notify_rename4(struct xdr_stream *xdr, const struct notify_rename4 *value);
diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
index 62ac790428be..805c7122eb93 100644
--- a/fs/nfsd/xdr4.h
+++ b/fs/nfsd/xdr4.h
@@ -973,6 +973,8 @@ __be32 nfsd4_encode_fattr_to_buf(__be32 **p, int words,
u8 *nfsd4_encode_notify_event(struct xdr_stream *xdr, struct nfsd_notify_event *nne,
struct nfs4_delegation *dd, struct nfsd_file *nf,
u32 *notify_mask);
+u8 *nfsd4_encode_dir_attr_change(struct xdr_stream *xdr, struct nfs4_delegation *dp,
+ struct nfsd_file *nf);
extern __be32 nfsd4_setclientid(struct svc_rqst *rqstp,
struct nfsd4_compound_state *, union nfsd4_op_u *u);
extern __be32 nfsd4_setclientid_confirm(struct svc_rqst *rqstp,
--
2.54.0
^ permalink raw reply related
* [PATCH v7 19/20] nfsd: track requested dir attributes
From: Jeff Layton @ 2026-06-16 11:59 UTC (permalink / raw)
To: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, Jonathan Corbet, Shuah Khan,
Chuck Lever
Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260616-dir-deleg-v7-0-6cbc7eac0ade@kernel.org>
Track the union of requested and supported dir attributes in the
delegation. In a later patch this will be used to ensure that we
only encode the attributes in that union when sending
add/remove/rename updates.
Since the requested dir attributes can now include word1 attributes,
gddr_dir_attributes[1] may be non-zero and nfsd4_encode_bitmap4() can
emit a two-word bitmap. Bump the dir-attribute bitmap budget in
nfsd4_get_dir_delegation_rsize() from one word to two accordingly, so the
reply-size check before this non-idempotent op accounts for the larger
encoding.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/nfs4proc.c | 4 +++-
fs/nfsd/nfs4state.c | 17 ++++++++++++++---
fs/nfsd/state.h | 1 +
3 files changed, 18 insertions(+), 4 deletions(-)
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 576346578ee0..48fc7b0df4dc 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -2610,6 +2610,8 @@ nfsd4_get_dir_delegation(struct svc_rqst *rqstp,
memcpy(&gdd->gddr_stateid, &dd->dl_stid.sc_stateid, sizeof(gdd->gddr_stateid));
gdd->gddr_child_attributes[0] = dd->dl_child_attrs[0];
gdd->gddr_child_attributes[1] = dd->dl_child_attrs[1];
+ gdd->gddr_dir_attributes[0] = dd->dl_dir_attrs[0];
+ gdd->gddr_dir_attributes[1] = dd->dl_dir_attrs[1];
nfs4_put_stid(&dd->dl_stid);
return nfs_ok;
}
@@ -3577,7 +3579,7 @@ static u32 nfsd4_get_dir_delegation_rsize(const struct svc_rqst *rqstp,
op_encode_stateid_maxsz +
2 /* gddr_notification */ +
3 /* gddr_child_attributes */ +
- 2 /* gddr_dir_attributes */) * sizeof(__be32);
+ 3 /* gddr_dir_attributes */) * sizeof(__be32);
}
#ifdef CONFIG_NFSD_PNFS
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index c4dc0428f0a6..a948dc8a46cc 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -9974,6 +9974,15 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct dentry *dentry,
FATTR4_WORD1_TIME_MODIFY | \
FATTR4_WORD1_TIME_CREATE)
+#define GDD_WORD0_DIR_ATTRS (FATTR4_WORD0_CHANGE | \
+ FATTR4_WORD0_SIZE)
+
+#define GDD_WORD1_DIR_ATTRS (FATTR4_WORD1_NUMLINKS | \
+ FATTR4_WORD1_SPACE_USED | \
+ FATTR4_WORD1_TIME_ACCESS | \
+ FATTR4_WORD1_TIME_METADATA | \
+ FATTR4_WORD1_TIME_MODIFY)
+
/**
* nfsd_get_dir_deleg - attempt to get a directory delegation
* @cstate: compound state
@@ -10042,14 +10051,16 @@ nfsd_get_dir_deleg(struct nfsd4_compound_state *cstate,
dp->dl_stid.sc_export =
exp_get(cstate->current_fh.fh_export);
- dp->dl_child_attrs[0] = gdd->gdda_child_attributes[0] & GDD_WORD0_CHILD_ATTRS;
- dp->dl_child_attrs[1] = gdd->gdda_child_attributes[1] & GDD_WORD1_CHILD_ATTRS;
-
/*
* NB: gddr_notification[0] represents the notifications that
* will be granted to the client
*/
dp->dl_notify_mask = gdd->gddr_notification[0];
+ dp->dl_child_attrs[0] = gdd->gdda_child_attributes[0] & GDD_WORD0_CHILD_ATTRS;
+ dp->dl_child_attrs[1] = gdd->gdda_child_attributes[1] & GDD_WORD1_CHILD_ATTRS;
+ dp->dl_dir_attrs[0] = gdd->gdda_dir_attributes[0] & GDD_WORD0_DIR_ATTRS;
+ dp->dl_dir_attrs[1] = gdd->gdda_dir_attributes[1] & GDD_WORD1_DIR_ATTRS;
+
fl = nfs4_alloc_init_lease(dp, dp->dl_notify_mask);
if (!fl)
goto out_put_stid;
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 2bc54546deb3..bc0181ef1cb6 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -301,6 +301,7 @@ struct nfs4_delegation {
/* For dir delegations */
u32 dl_notify_mask;
u32 dl_child_attrs[2];
+ u32 dl_dir_attrs[2];
};
static inline bool deleg_is_read(u32 dl_type)
--
2.54.0
^ permalink raw reply related
* [PATCH v7 18/20] nfsd: properly track requested child attributes
From: Jeff Layton @ 2026-06-16 11:59 UTC (permalink / raw)
To: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, Jonathan Corbet, Shuah Khan,
Chuck Lever
Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260616-dir-deleg-v7-0-6cbc7eac0ade@kernel.org>
Track the union of requested and supported child attributes in the
delegation, and only encode the attributes in that union when sending
add/remove/rename updates.
Since the requested child attributes can now include word1 attributes,
gddr_child_attributes[1] may be non-zero and nfsd4_encode_bitmap4() can
emit a two-word bitmap. Bump the child-attribute bitmap budget in
nfsd4_get_dir_delegation_rsize() from one word to two accordingly, so the
reply-size check before this non-idempotent op accounts for the larger
encoding.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/nfs4proc.c | 4 +++-
fs/nfsd/nfs4state.c | 18 ++++++++++++++++++
fs/nfsd/nfs4xdr.c | 15 ++++++---------
fs/nfsd/state.h | 1 +
4 files changed, 28 insertions(+), 10 deletions(-)
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index be79b6063afe..576346578ee0 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -2608,6 +2608,8 @@ nfsd4_get_dir_delegation(struct svc_rqst *rqstp,
gdd->gddrnf_status = GDD4_OK;
memcpy(&gdd->gddr_stateid, &dd->dl_stid.sc_stateid, sizeof(gdd->gddr_stateid));
+ gdd->gddr_child_attributes[0] = dd->dl_child_attrs[0];
+ gdd->gddr_child_attributes[1] = dd->dl_child_attrs[1];
nfs4_put_stid(&dd->dl_stid);
return nfs_ok;
}
@@ -3574,7 +3576,7 @@ static u32 nfsd4_get_dir_delegation_rsize(const struct svc_rqst *rqstp,
op_encode_verifier_maxsz +
op_encode_stateid_maxsz +
2 /* gddr_notification */ +
- 2 /* gddr_child_attributes */ +
+ 3 /* gddr_child_attributes */ +
2 /* gddr_dir_attributes */) * sizeof(__be32);
}
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 682c00fbd2fb..c4dc0428f0a6 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -9959,6 +9959,21 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct dentry *dentry,
return status;
}
+#define GDD_WORD0_CHILD_ATTRS (FATTR4_WORD0_TYPE | \
+ FATTR4_WORD0_CHANGE | \
+ FATTR4_WORD0_SIZE | \
+ FATTR4_WORD0_FILEID | \
+ FATTR4_WORD0_FILEHANDLE)
+
+#define GDD_WORD1_CHILD_ATTRS (FATTR4_WORD1_MODE | \
+ FATTR4_WORD1_NUMLINKS | \
+ FATTR4_WORD1_RAWDEV | \
+ FATTR4_WORD1_SPACE_USED | \
+ FATTR4_WORD1_TIME_ACCESS | \
+ FATTR4_WORD1_TIME_METADATA | \
+ FATTR4_WORD1_TIME_MODIFY | \
+ FATTR4_WORD1_TIME_CREATE)
+
/**
* nfsd_get_dir_deleg - attempt to get a directory delegation
* @cstate: compound state
@@ -10027,6 +10042,9 @@ nfsd_get_dir_deleg(struct nfsd4_compound_state *cstate,
dp->dl_stid.sc_export =
exp_get(cstate->current_fh.fh_export);
+ dp->dl_child_attrs[0] = gdd->gdda_child_attributes[0] & GDD_WORD0_CHILD_ATTRS;
+ dp->dl_child_attrs[1] = gdd->gdda_child_attributes[1] & GDD_WORD1_CHILD_ATTRS;
+
/*
* NB: gddr_notification[0] represents the notifications that
* will be granted to the client
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index b3a4b134d309..0649bb4cf2e7 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4284,18 +4284,15 @@ nfsd4_setup_notify_entry4(struct notify_entry4 *ne, struct xdr_stream *xdr,
args.change_attr = nfsd4_change_attribute(&args.stat);
- attrmask[0] = FATTR4_WORD0_TYPE | FATTR4_WORD0_CHANGE |
- FATTR4_WORD0_SIZE | FATTR4_WORD0_FILEID;
- attrmask[1] = FATTR4_WORD1_MODE | FATTR4_WORD1_NUMLINKS | FATTR4_WORD1_RAWDEV |
- FATTR4_WORD1_SPACE_USED | FATTR4_WORD1_TIME_ACCESS |
- FATTR4_WORD1_TIME_METADATA | FATTR4_WORD1_TIME_MODIFY;
+ attrmask[0] = dp->dl_child_attrs[0];
+ attrmask[1] = dp->dl_child_attrs[1];
attrmask[2] = 0;
- if (setup_notify_fhandle(dentry, dp, nf, &args))
- attrmask[0] |= FATTR4_WORD0_FILEHANDLE;
+ if (!setup_notify_fhandle(dentry, dp, nf, &args))
+ attrmask[0] &= ~FATTR4_WORD0_FILEHANDLE;
- if (args.stat.result_mask & STATX_BTIME)
- attrmask[1] |= FATTR4_WORD1_TIME_CREATE;
+ if (!(args.stat.result_mask & STATX_BTIME))
+ attrmask[1] &= ~FATTR4_WORD1_TIME_CREATE;
ne->ne_attrs.attrmask.count = 2;
ne->ne_attrs.attr_vals.data = (u8 *)xdr->p;
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 7a66048a130c..2bc54546deb3 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -300,6 +300,7 @@ struct nfs4_delegation {
/* For dir delegations */
u32 dl_notify_mask;
+ u32 dl_child_attrs[2];
};
static inline bool deleg_is_read(u32 dl_type)
--
2.54.0
^ permalink raw reply related
* [PATCH v7 17/20] nfsd: fix reply size estimate for GET_DIR_DELEGATION
From: Jeff Layton @ 2026-06-16 11:59 UTC (permalink / raw)
To: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, Jonathan Corbet, Shuah Khan,
Chuck Lever
Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260616-dir-deleg-v7-0-6cbc7eac0ade@kernel.org>
nfsd4_get_dir_delegation_rsize() returns its estimate in XDR words, but
the COMPOUND reply-size machinery works in bytes: every other op's
_rsize helper multiplies its word count by sizeof(__be32). Since
GET_DIR_DELEGATION is OP_MODIFIES_SOMETHING, this estimate is consulted
before the op executes to ensure the reply will fit. The ~4x too-small
estimate lets a compound near the session/reply limit pass the check,
grant a directory delegation, and then fail to encode the reply with
NFS4ERR_RESOURCE/REP_TOO_BIG, leaving the client without the returned
stateid.
Multiply the estimate by sizeof(__be32) like the other _rsize helpers.
Fixes: 33a1e6ea73e5 ("nfsd: trivial GET_DIR_DELEGATION support")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/nfs4proc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 565bf76c08ed..be79b6063afe 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -3575,7 +3575,7 @@ static u32 nfsd4_get_dir_delegation_rsize(const struct svc_rqst *rqstp,
op_encode_stateid_maxsz +
2 /* gddr_notification */ +
2 /* gddr_child_attributes */ +
- 2 /* gddr_dir_attributes */);
+ 2 /* gddr_dir_attributes */) * sizeof(__be32);
}
#ifdef CONFIG_NFSD_PNFS
--
2.54.0
^ permalink raw reply related
* [PATCH v7 16/20] nfsd: add the filehandle to returned attributes in CB_NOTIFY
From: Jeff Layton @ 2026-06-16 11:58 UTC (permalink / raw)
To: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, Jonathan Corbet, Shuah Khan,
Chuck Lever
Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260616-dir-deleg-v7-0-6cbc7eac0ade@kernel.org>
nfsd's usual fh_compose routine requires a svc_export and fills out a
svc_fh, which is more machinery than a CB_NOTIFY callback needs.
Add a new routine that composes a filehandle from just the parent
filehandle in the nfs4_file and the child dentry, and use it to fill out
the fhandle field in the nfsd4_fattr_args.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/nfs4xdr.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 49 insertions(+)
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 48de0922c6dd..b3a4b134d309 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4198,6 +4198,52 @@ nfsd4_encode_fattr4(struct svc_rqst *rqstp, struct xdr_stream *xdr,
goto out;
}
+static bool
+setup_notify_fhandle(struct dentry *dentry, struct nfs4_delegation *dp,
+ struct nfsd_file *nf, struct nfsd4_fattr_args *args)
+{
+ struct nfs4_file *fi = dp->dl_stid.sc_file;
+ struct svc_export *exp = dp->dl_stid.sc_export;
+ int fileid_type, fsid_len, maxsize, flags = 0;
+ struct knfsd_fh *fhp = &args->fhandle;
+ struct inode *inode = d_inode(dentry);
+ struct inode *parent = NULL;
+ struct fid *fid;
+
+ fsid_len = key_len(fi->fi_fhandle.fh_fsid_type);
+ fhp->fh_size = 4 + fsid_len;
+
+ /* Copy first 4 bytes + fsid */
+ memcpy(&fhp->fh_raw, &fi->fi_fhandle.fh_raw, fhp->fh_size);
+
+ fid = (struct fid *)(fh_fsid(fhp) + fsid_len/4);
+ maxsize = (NFS4_FHSIZE - fhp->fh_size)/4;
+
+ /*
+ * Subtree-checking exports need a connectable filehandle so the
+ * parent can be resolved at decode time. Derive this from the
+ * delegation's export rather than the shared nfs4_file, which may
+ * have been initialized under a different export.
+ */
+ if (!(exp->ex_flags & NFSEXP_NOSUBTREECHECK) && !S_ISDIR(inode->i_mode)) {
+ parent = d_inode(nf->nf_file->f_path.dentry);
+ flags = EXPORT_FH_CONNECTABLE;
+ }
+
+ fileid_type = exportfs_encode_inode_fh(inode, fid, &maxsize, parent, flags);
+ if (fileid_type < 0 || fileid_type == FILEID_INVALID)
+ return false;
+
+ fhp->fh_fileid_type = fileid_type;
+ fhp->fh_size += maxsize * 4;
+
+ if (exp && (exp->ex_flags & NFSEXP_SIGN_FH))
+ if (!fh_append_mac(fhp, NFS4_FHSIZE, exp->cd->net))
+ return false;
+
+ return true;
+}
+
#define CB_NOTIFY_STATX_REQUEST_MASK (STATX_BASIC_STATS | \
STATX_BTIME | \
STATX_CHANGE_COOKIE)
@@ -4245,6 +4291,9 @@ nfsd4_setup_notify_entry4(struct notify_entry4 *ne, struct xdr_stream *xdr,
FATTR4_WORD1_TIME_METADATA | FATTR4_WORD1_TIME_MODIFY;
attrmask[2] = 0;
+ if (setup_notify_fhandle(dentry, dp, nf, &args))
+ attrmask[0] |= FATTR4_WORD0_FILEHANDLE;
+
if (args.stat.result_mask & STATX_BTIME)
attrmask[1] |= FATTR4_WORD1_TIME_CREATE;
--
2.54.0
^ permalink raw reply related
* [PATCH v7 15/20] nfsd: allow encoding a filehandle into fattr4 without a svc_fh
From: Jeff Layton @ 2026-06-16 11:58 UTC (permalink / raw)
To: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, Jonathan Corbet, Shuah Khan,
Chuck Lever
Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260616-dir-deleg-v7-0-6cbc7eac0ade@kernel.org>
The current fattr4 encoder requires a svc_fh in order to encode the
filehandle. This is not available in a CB_NOTIFY callback. Add a new
"fhandle" field to struct nfsd4_fattr_args and copy the filehandle into
there from the svc_fh. CB_NOTIFY will populate it via other means.
A filehandle composed this way may still need a MAC appended on signed
exports, so generalize fh_append_mac() to operate on a bare knfsd_fh
(plus its maximum size and net) rather than a svc_fh.
The FSID attribute shares the same attrmask gate as the filehandle, so
do the same for it: add fsid_source_fh() which takes a bare knfsd_fh and
its svc_export, and have the FSID encoder use args->fhandle and
args->exp. fsid_source() becomes a wrapper for the v2/v3 callers. The
now-unused svc_fh pointer is dropped from struct nfsd4_fattr_args.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/nfs4xdr.c | 37 ++++++++++++++++++++-----------------
fs/nfsd/nfsfh.c | 30 ++++++++++++++++++------------
fs/nfsd/nfsfh.h | 3 +++
3 files changed, 41 insertions(+), 29 deletions(-)
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 90c265ce3846..48de0922c6dd 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -2719,7 +2719,7 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
}
static __be32 nfsd4_encode_nfs_fh4(struct xdr_stream *xdr,
- struct knfsd_fh *fh_handle)
+ const struct knfsd_fh *fh_handle)
{
return nfsd4_encode_opaque(xdr, fh_handle->fh_raw, fh_handle->fh_size);
}
@@ -3159,9 +3159,9 @@ nfsd4_encode_bitmap4(struct xdr_stream *xdr, u32 bmval0, u32 bmval1, u32 bmval2)
struct nfsd4_fattr_args {
struct svc_rqst *rqstp;
- struct svc_fh *fhp;
struct svc_export *exp;
struct dentry *dentry;
+ struct knfsd_fh fhandle;
struct kstat stat;
struct kstatfs statfs;
struct nfs4_acl *acl;
@@ -3309,7 +3309,7 @@ static __be32 nfsd4_encode_fattr4_fsid(struct xdr_stream *xdr,
xdr_encode_hyper(p, NFS4_REFERRAL_FSID_MINOR);
return nfs_ok;
}
- switch (fsid_source(args->fhp)) {
+ switch (fsid_source_fh(&args->fhandle, args->exp)) {
case FSIDSOURCE_FSID:
p = xdr_encode_hyper(p, (u64)args->exp->ex_fsid);
xdr_encode_hyper(p, (u64)0);
@@ -3406,7 +3406,7 @@ static __be32 nfsd4_encode_fattr4_homogeneous(struct xdr_stream *xdr,
static __be32 nfsd4_encode_fattr4_filehandle(struct xdr_stream *xdr,
const struct nfsd4_fattr_args *args)
{
- return nfsd4_encode_nfs_fh4(xdr, &args->fhp->fh_handle);
+ return nfsd4_encode_nfs_fh4(xdr, &args->fhandle);
}
static __be32 nfsd4_encode_fattr4_fileid(struct xdr_stream *xdr,
@@ -4019,19 +4019,22 @@ nfsd4_encode_fattr4(struct svc_rqst *rqstp, struct xdr_stream *xdr,
if (err)
goto out_nfserr;
}
- if ((attrmask[0] & (FATTR4_WORD0_FILEHANDLE | FATTR4_WORD0_FSID)) &&
- !fhp) {
- tempfh = kmalloc_obj(struct svc_fh);
- status = nfserr_jukebox;
- if (!tempfh)
- goto out;
- fh_init(tempfh, NFS4_FHSIZE);
- status = fh_compose(tempfh, exp, dentry, NULL);
- if (status)
- goto out;
- args.fhp = tempfh;
- } else
- args.fhp = fhp;
+
+ if ((attrmask[0] & (FATTR4_WORD0_FILEHANDLE | FATTR4_WORD0_FSID))) {
+ if (!fhp) {
+ tempfh = kmalloc_obj(struct svc_fh);
+ status = nfserr_jukebox;
+ if (!tempfh)
+ goto out;
+ fh_init(tempfh, NFS4_FHSIZE);
+ status = fh_compose(tempfh, exp, dentry, NULL);
+ if (status)
+ goto out;
+ fhp = tempfh;
+ }
+ fh_copy_shallow(&args.fhandle, &fhp->fh_handle);
+ }
+
if (attrmask[0] & (FATTR4_WORD0_CASE_INSENSITIVE |
FATTR4_WORD0_CASE_PRESERVING)) {
/*
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index ab53de1c280d..8b1a95e1d058 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -142,16 +142,15 @@ static inline __be32 check_pseudo_root(struct dentry *dentry,
/* Size of a file handle MAC, in 4-octet words */
#define FH_MAC_WORDS (sizeof(__le64) / 4)
-static bool fh_append_mac(struct svc_fh *fhp, struct net *net)
+bool fh_append_mac(struct knfsd_fh *fh, int fh_maxsize, struct net *net)
{
struct nfsd_net *nn = net_generic(net, nfsd_net_id);
- struct knfsd_fh *fh = &fhp->fh_handle;
siphash_key_t *fh_key = nn->fh_key;
__le64 hash;
if (!fh_key)
goto out_no_key;
- if (fh->fh_size + sizeof(hash) > fhp->fh_maxsize)
+ if (fh->fh_size + sizeof(hash) > fh_maxsize)
goto out_no_space;
hash = cpu_to_le64(siphash(&fh->fh_raw, fh->fh_size, fh_key));
@@ -165,7 +164,7 @@ static bool fh_append_mac(struct svc_fh *fhp, struct net *net)
out_no_space:
pr_warn_ratelimited("NFSD: unable to sign filehandles, fh_size %zu would be greater than fh_maxsize %d.\n",
- fh->fh_size + sizeof(hash), fhp->fh_maxsize);
+ fh->fh_size + sizeof(hash), fh_maxsize);
return false;
}
@@ -564,7 +563,8 @@ static void _fh_update(struct svc_fh *fhp, struct svc_export *exp,
fhp->fh_handle.fh_size += maxsize * 4;
if (exp->ex_flags & NFSEXP_SIGN_FH)
- if (!fh_append_mac(fhp, exp->cd->net))
+ if (!fh_append_mac(&fhp->fh_handle, fhp->fh_maxsize,
+ exp->cd->net))
fhp->fh_handle.fh_fileid_type = FILEID_INVALID;
} else {
fhp->fh_handle.fh_fileid_type = FILEID_ROOT;
@@ -894,19 +894,20 @@ char * SVCFH_fmt(struct svc_fh *fhp)
return buf;
}
-enum fsid_source fsid_source(const struct svc_fh *fhp)
+enum fsid_source fsid_source_fh(const struct knfsd_fh *fh,
+ struct svc_export *exp)
{
- if (fhp->fh_handle.fh_version != 1)
+ if (fh->fh_version != 1)
return FSIDSOURCE_DEV;
- switch(fhp->fh_handle.fh_fsid_type) {
+ switch (fh->fh_fsid_type) {
case FSID_DEV:
case FSID_ENCODE_DEV:
case FSID_MAJOR_MINOR:
- if (exp_sb(fhp->fh_export)->s_type->fs_flags & FS_REQUIRES_DEV)
+ if (exp_sb(exp)->s_type->fs_flags & FS_REQUIRES_DEV)
return FSIDSOURCE_DEV;
break;
case FSID_NUM:
- if (fhp->fh_export->ex_flags & NFSEXP_FSID)
+ if (exp->ex_flags & NFSEXP_FSID)
return FSIDSOURCE_FSID;
break;
default:
@@ -915,13 +916,18 @@ enum fsid_source fsid_source(const struct svc_fh *fhp)
/* either a UUID type filehandle, or the filehandle doesn't
* match the export.
*/
- if (fhp->fh_export->ex_flags & NFSEXP_FSID)
+ if (exp->ex_flags & NFSEXP_FSID)
return FSIDSOURCE_FSID;
- if (fhp->fh_export->ex_uuid)
+ if (exp->ex_uuid)
return FSIDSOURCE_UUID;
return FSIDSOURCE_DEV;
}
+enum fsid_source fsid_source(const struct svc_fh *fhp)
+{
+ return fsid_source_fh(&fhp->fh_handle, fhp->fh_export);
+}
+
/**
* nfsd4_change_attribute - Generate an NFSv4 change_attribute value
* @stat: inode attributes
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 5ef7191f8ad8..cdeb5eea65a8 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -131,6 +131,8 @@ enum fsid_source {
FSIDSOURCE_FSID,
FSIDSOURCE_UUID,
};
+extern enum fsid_source fsid_source_fh(const struct knfsd_fh *fh,
+ struct svc_export *exp);
extern enum fsid_source fsid_source(const struct svc_fh *fhp);
@@ -226,6 +228,7 @@ __be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat);
__be32 fh_compose(struct svc_fh *, struct svc_export *, struct dentry *, struct svc_fh *);
__be32 fh_update(struct svc_fh *);
void fh_put(struct svc_fh *);
+bool fh_append_mac(struct knfsd_fh *fh, int fh_maxsize, struct net *net);
static __inline__ struct svc_fh *
fh_copy(struct svc_fh *dst, const struct svc_fh *src)
--
2.54.0
^ permalink raw reply related
* [PATCH v7 14/20] nfsd: send basic file attributes in CB_NOTIFY
From: Jeff Layton @ 2026-06-16 11:58 UTC (permalink / raw)
To: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, Jonathan Corbet, Shuah Khan,
Chuck Lever
Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260616-dir-deleg-v7-0-6cbc7eac0ade@kernel.org>
In addition to the filename, send attributes about the inode in a
CB_NOTIFY event. This patch just adds a the basic inode information that
can be acquired via GETATTR.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/nfs4xdr.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 6f5b0c032d64..90c265ce3846 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4195,12 +4195,21 @@ nfsd4_encode_fattr4(struct svc_rqst *rqstp, struct xdr_stream *xdr,
goto out;
}
+#define CB_NOTIFY_STATX_REQUEST_MASK (STATX_BASIC_STATS | \
+ STATX_BTIME | \
+ STATX_CHANGE_COOKIE)
+
static bool
nfsd4_setup_notify_entry4(struct notify_entry4 *ne, struct xdr_stream *xdr,
struct dentry *dentry, struct nfs4_delegation *dp,
struct nfsd_file *nf, char *name, u32 namelen)
{
+ struct path path = { .mnt = nf->nf_file->f_path.mnt,
+ .dentry = dentry };
+ struct nfsd4_fattr_args args = { };
uint32_t *attrmask;
+ __be32 status;
+ int ret;
/* Reserve space for attrmask */
attrmask = xdr_reserve_space(xdr, 3 * sizeof(uint32_t));
@@ -4211,6 +4220,41 @@ nfsd4_setup_notify_entry4(struct notify_entry4 *ne, struct xdr_stream *xdr,
ne->ne_file.len = namelen;
ne->ne_attrs.attrmask.element = attrmask;
+ /* FIXME: d_find_alias for inode ? */
+ if (!path.dentry || !d_inode(path.dentry))
+ goto noattrs;
+
+ /*
+ * It is possible that the client was granted a delegation when a file
+ * was created. Note that we don't issue a CB_GETATTR here since stale
+ * attributes are presumably ok.
+ */
+ ret = vfs_getattr(&path, &args.stat, CB_NOTIFY_STATX_REQUEST_MASK, AT_STATX_SYNC_AS_STAT);
+ if (ret)
+ goto noattrs;
+
+ args.change_attr = nfsd4_change_attribute(&args.stat);
+
+ attrmask[0] = FATTR4_WORD0_TYPE | FATTR4_WORD0_CHANGE |
+ FATTR4_WORD0_SIZE | FATTR4_WORD0_FILEID;
+ attrmask[1] = FATTR4_WORD1_MODE | FATTR4_WORD1_NUMLINKS | FATTR4_WORD1_RAWDEV |
+ FATTR4_WORD1_SPACE_USED | FATTR4_WORD1_TIME_ACCESS |
+ FATTR4_WORD1_TIME_METADATA | FATTR4_WORD1_TIME_MODIFY;
+ attrmask[2] = 0;
+
+ if (args.stat.result_mask & STATX_BTIME)
+ attrmask[1] |= FATTR4_WORD1_TIME_CREATE;
+
+ ne->ne_attrs.attrmask.count = 2;
+ ne->ne_attrs.attr_vals.data = (u8 *)xdr->p;
+
+ status = nfsd4_encode_attr_vals(xdr, attrmask, &args);
+ if (status != nfs_ok)
+ goto noattrs;
+
+ ne->ne_attrs.attr_vals.len = (u8 *)xdr->p - ne->ne_attrs.attr_vals.data;
+ return true;
+noattrs:
attrmask[0] = 0;
attrmask[1] = 0;
attrmask[2] = 0;
--
2.54.0
^ permalink raw reply related
* [PATCH v7 13/20] nfsd: allow nfsd4_encode_fattr4_change() to work with no export
From: Jeff Layton @ 2026-06-16 11:58 UTC (permalink / raw)
To: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, Jonathan Corbet, Shuah Khan,
Chuck Lever
Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260616-dir-deleg-v7-0-6cbc7eac0ade@kernel.org>
In the context of a CB_NOTIFY callback, we may not have easy access to
a svc_export. nfsd will not currently grant a delegation on a the V4 root
however, so this should be safe.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/nfs4xdr.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 54a7902935b5..6f5b0c032d64 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -3277,7 +3277,7 @@ static __be32 nfsd4_encode_fattr4_change(struct xdr_stream *xdr,
{
const struct svc_export *exp = args->exp;
- if (unlikely(exp->ex_flags & NFSEXP_V4ROOT)) {
+ if (exp && unlikely(exp->ex_flags & NFSEXP_V4ROOT)) {
u32 flush_time = convert_to_wallclock(exp->cd->flush_time);
if (xdr_stream_encode_u32(xdr, flush_time) != XDR_UNIT)
--
2.54.0
^ permalink raw reply related
* [PATCH v7 12/20] nfsd: add helper to marshal a fattr4 from completed args
From: Jeff Layton @ 2026-06-16 11:58 UTC (permalink / raw)
To: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, Jonathan Corbet, Shuah Khan,
Chuck Lever
Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260616-dir-deleg-v7-0-6cbc7eac0ade@kernel.org>
Break the loop that encodes the actual attr_vals field into a separate
function.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/nfs4xdr.c | 44 +++++++++++++++++++++++++-------------------
1 file changed, 25 insertions(+), 19 deletions(-)
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 976d60115cd6..54a7902935b5 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -3899,6 +3899,22 @@ static const nfsd4_enc_attr nfsd4_enc_fattr4_encode_ops[] = {
#endif
};
+static __be32
+nfsd4_encode_attr_vals(struct xdr_stream *xdr, u32 *attrmask, struct nfsd4_fattr_args *args)
+{
+ DECLARE_BITMAP(attr_bitmap, ARRAY_SIZE(nfsd4_enc_fattr4_encode_ops));
+ unsigned long bit;
+ __be32 status;
+
+ bitmap_from_arr32(attr_bitmap, attrmask, ARRAY_SIZE(nfsd4_enc_fattr4_encode_ops));
+ for_each_set_bit(bit, attr_bitmap, ARRAY_SIZE(nfsd4_enc_fattr4_encode_ops)) {
+ status = nfsd4_enc_fattr4_encode_ops[bit](xdr, args);
+ if (status != nfs_ok)
+ return status;
+ }
+ return nfs_ok;
+}
+
/*
* Note: @fhp can be NULL; in this case, we might have to compose the filehandle
* ourselves. @case_cache is NULL for callers that encode a single dentry
@@ -3912,7 +3928,6 @@ nfsd4_encode_fattr4(struct svc_rqst *rqstp, struct xdr_stream *xdr,
int ignore_crossmnt,
struct nfsd_case_attrs_cache *case_cache)
{
- DECLARE_BITMAP(attr_bitmap, ARRAY_SIZE(nfsd4_enc_fattr4_encode_ops));
struct nfs4_delegation *dp = NULL;
struct nfsd4_fattr_args args;
struct svc_fh *tempfh = NULL;
@@ -3927,7 +3942,6 @@ nfsd4_encode_fattr4(struct svc_rqst *rqstp, struct xdr_stream *xdr,
.mnt = exp->ex_path.mnt,
.dentry = dentry,
};
- unsigned long bit;
WARN_ON_ONCE(bmval[1] & NFSD_WRITEONLY_ATTRS_WORD1);
WARN_ON_ONCE(!nfsd_attrs_supported(minorversion, bmval));
@@ -4141,27 +4155,22 @@ nfsd4_encode_fattr4(struct svc_rqst *rqstp, struct xdr_stream *xdr,
#endif /* CONFIG_NFSD_V4_POSIX_ACLS */
/* attrmask */
- status = nfsd4_encode_bitmap4(xdr, attrmask[0], attrmask[1],
- attrmask[2]);
+ status = nfsd4_encode_bitmap4(xdr, attrmask[0], attrmask[1], attrmask[2]);
if (status)
goto out;
/* attr_vals */
attrlen_offset = xdr->buf->len;
- if (unlikely(!xdr_reserve_space(xdr, XDR_UNIT)))
- goto out_resource;
- bitmap_from_arr32(attr_bitmap, attrmask,
- ARRAY_SIZE(nfsd4_enc_fattr4_encode_ops));
- for_each_set_bit(bit, attr_bitmap,
- ARRAY_SIZE(nfsd4_enc_fattr4_encode_ops)) {
- status = nfsd4_enc_fattr4_encode_ops[bit](xdr, &args);
- if (status != nfs_ok)
- goto out;
+ if (unlikely(!xdr_reserve_space(xdr, XDR_UNIT))) {
+ status = nfserr_resource;
+ goto out;
}
- attrlen = cpu_to_be32(xdr->buf->len - attrlen_offset - XDR_UNIT);
- write_bytes_to_xdr_buf(xdr->buf, attrlen_offset, &attrlen, XDR_UNIT);
- status = nfs_ok;
+ status = nfsd4_encode_attr_vals(xdr, attrmask, &args);
+ if (status == nfs_ok) {
+ attrlen = cpu_to_be32(xdr->buf->len - attrlen_offset - XDR_UNIT);
+ write_bytes_to_xdr_buf(xdr->buf, attrlen_offset, &attrlen, XDR_UNIT);
+ }
out:
#ifdef CONFIG_NFSD_V4_POSIX_ACLS
if (args.dpacl)
@@ -4184,9 +4193,6 @@ nfsd4_encode_fattr4(struct svc_rqst *rqstp, struct xdr_stream *xdr,
out_nfserr:
status = nfserrno(err);
goto out;
-out_resource:
- status = nfserr_resource;
- goto out;
}
static bool
--
2.54.0
^ permalink raw reply related
* [PATCH v7 11/20] nfsd: apply the notify mask to the delegation when requested
From: Jeff Layton @ 2026-06-16 11:58 UTC (permalink / raw)
To: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, Jonathan Corbet, Shuah Khan,
Chuck Lever
Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260616-dir-deleg-v7-0-6cbc7eac0ade@kernel.org>
If the client requests a directory delegation with notifications
enabled, set the appropriate return mask in gddr_notification[0]. This
will ensure the lease acquisition sets the appropriate ignore mask.
Also store the granted mask in the delegation's dl_notify_mask field, so
that the CB_NOTIFY encoder can later tell which notifications the client
was granted.
If the client doesn't set NOTIFY4_GFLAG_EXTEND, then don't offer any
notifications, as nfsd won't provide directory offset information, and
"classic" notifications require them.
Similarly, if the client sets GFLAG_EXTEND | CFLAG_ORDER, then zero out
the notification mask. The Linux server can't provide the necessary
ordering info to those clients.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/nfs4proc.c | 21 +++++++++++++++++++++
fs/nfsd/nfs4state.c | 3 ++-
fs/nfsd/state.h | 3 +++
3 files changed, 26 insertions(+), 1 deletion(-)
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 3e4de45aa360..565bf76c08ed 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -2552,12 +2552,18 @@ nfsd4_verify(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
return status == nfserr_same ? nfs_ok : status;
}
+#define SUPPORTED_NOTIFY_MASK (BIT(NOTIFY4_REMOVE_ENTRY) | \
+ BIT(NOTIFY4_ADD_ENTRY) | \
+ BIT(NOTIFY4_RENAME_ENTRY) | \
+ BIT(NOTIFY4_GFLAG_EXTEND))
+
static __be32
nfsd4_get_dir_delegation(struct svc_rqst *rqstp,
struct nfsd4_compound_state *cstate,
union nfsd4_op_u *u)
{
struct nfsd4_get_dir_delegation *gdd = &u->get_dir_delegation;
+ u32 requested = gdd->gdda_notification_types[0];
struct nfs4_delegation *dd;
struct nfsd_file *nf;
__be32 status;
@@ -2566,6 +2572,21 @@ nfsd4_get_dir_delegation(struct svc_rqst *rqstp,
if (status != nfs_ok)
return status;
+ /*
+ * Offer no notifications to an order-aware client. RFC8881bis section
+ * 16.2.13 defines order-aware as NOTIFY4_CFLAG_ORDER being set or
+ * NOTIFY4_GFLAG_EXTEND being reset. Such a client expects cookie and
+ * previous-entry information with its notifications (e.g. 27.4.5), and
+ * nfsd does not track or emit directory offset information. Per
+ * 16.2.11.3 the alternative would be to recall the delegation, so it's
+ * simpler to just decline the notifications here.
+ */
+ if (!(requested & BIT(NOTIFY4_GFLAG_EXTEND)) ||
+ (requested & BIT(NOTIFY4_CFLAG_ORDER)))
+ requested = 0;
+
+ gdd->gddr_notification[0] = requested & SUPPORTED_NOTIFY_MASK;
+
/*
* RFC 8881, section 18.39.3 says:
*
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 5a4f0843c2fe..682c00fbd2fb 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -10031,7 +10031,8 @@ nfsd_get_dir_deleg(struct nfsd4_compound_state *cstate,
* NB: gddr_notification[0] represents the notifications that
* will be granted to the client
*/
- fl = nfs4_alloc_init_lease(dp, gdd->gddr_notification[0]);
+ dp->dl_notify_mask = gdd->gddr_notification[0];
+ fl = nfs4_alloc_init_lease(dp, dp->dl_notify_mask);
if (!fl)
goto out_put_stid;
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index f8457e0f2b57..7a66048a130c 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -297,6 +297,9 @@ struct nfs4_delegation {
struct timespec64 dl_atime;
struct timespec64 dl_mtime;
struct timespec64 dl_ctime;
+
+ /* For dir delegations */
+ u32 dl_notify_mask;
};
static inline bool deleg_is_read(u32 dl_type)
--
2.54.0
^ permalink raw reply related
* [PATCH v7 10/20] nfsd: add notification handlers for dir events
From: Jeff Layton @ 2026-06-16 11:58 UTC (permalink / raw)
To: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, Jonathan Corbet, Shuah Khan,
Chuck Lever
Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260616-dir-deleg-v7-0-6cbc7eac0ade@kernel.org>
Add the necessary parts to accept a fsnotify callback for directory
change event and create a CB_NOTIFY request for it. When a dir nfsd_file
is created set a handle_event callback to handle the notification.
Use that to allocate a nfsd_notify_event object and then hand off a
reference to each delegation's CB_NOTIFY. If anything fails along the
way, recall any affected delegations.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Documentation/sunrpc/xdr/nfs4_1.x | 3 +
fs/nfsd/filecache.c | 70 ++++++--
fs/nfsd/nfs4callback.c | 51 +++++-
fs/nfsd/nfs4state.c | 330 +++++++++++++++++++++++++++++++++++---
fs/nfsd/nfs4xdr.c | 117 ++++++++++++++
fs/nfsd/nfs4xdr_gen.c | 12 +-
fs/nfsd/nfs4xdr_gen.h | 9 ++
fs/nfsd/state.h | 20 ++-
fs/nfsd/trace.h | 23 +++
fs/nfsd/xdr4.h | 3 +
10 files changed, 587 insertions(+), 51 deletions(-)
diff --git a/Documentation/sunrpc/xdr/nfs4_1.x b/Documentation/sunrpc/xdr/nfs4_1.x
index 99a831d68da8..a32df1e882e5 100644
--- a/Documentation/sunrpc/xdr/nfs4_1.x
+++ b/Documentation/sunrpc/xdr/nfs4_1.x
@@ -445,6 +445,7 @@ struct notify_remove4 {
notify_entry4 nrm_old_entry;
nfs_cookie4 nrm_old_entry_cookie;
};
+pragma public notify_remove4;
struct notify_add4 {
/*
@@ -458,6 +459,7 @@ struct notify_add4 {
prev_entry4 nad_prev_entry<1>;
bool nad_last_entry;
};
+pragma public notify_add4;
struct notify_attr4 {
notify_entry4 na_changed_entry;
@@ -467,6 +469,7 @@ struct notify_rename4 {
notify_remove4 nrn_old_entry;
notify_add4 nrn_new_entry;
};
+pragma public notify_rename4;
struct notify_verifier4 {
verifier4 nv_old_cookieverf;
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index c5f2c5768324..b9548eb17c77 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -78,6 +78,7 @@ static struct kmem_cache *nfsd_file_mark_slab;
static struct list_lru nfsd_file_lru;
static unsigned long nfsd_file_flags;
static struct fsnotify_group *nfsd_file_fsnotify_group;
+static struct fsnotify_group *nfsd_dir_fsnotify_group;
static struct delayed_work nfsd_filecache_laundrette;
static struct rhltable nfsd_file_rhltable
____cacheline_aligned_in_smp;
@@ -153,7 +154,7 @@ static void
nfsd_file_mark_put(struct nfsd_file_mark *nfm)
{
if (refcount_dec_and_test(&nfm->nfm_ref)) {
- fsnotify_destroy_mark(&nfm->nfm_mark, nfsd_file_fsnotify_group);
+ fsnotify_destroy_mark(&nfm->nfm_mark, nfm->nfm_mark.group);
fsnotify_put_mark(&nfm->nfm_mark);
}
}
@@ -161,35 +162,37 @@ nfsd_file_mark_put(struct nfsd_file_mark *nfm)
static struct nfsd_file_mark *
nfsd_file_mark_find_or_create(struct inode *inode)
{
- int err;
- struct fsnotify_mark *mark;
struct nfsd_file_mark *nfm = NULL, *new;
+ struct fsnotify_group *group;
+ struct fsnotify_mark *mark;
+ int err;
+
+ group = S_ISDIR(inode->i_mode) ? nfsd_dir_fsnotify_group : nfsd_file_fsnotify_group;
do {
- fsnotify_group_lock(nfsd_file_fsnotify_group);
- mark = fsnotify_find_inode_mark(inode,
- nfsd_file_fsnotify_group);
+ fsnotify_group_lock(group);
+ mark = fsnotify_find_inode_mark(inode, group);
if (mark) {
nfm = nfsd_file_mark_get(container_of(mark,
struct nfsd_file_mark,
nfm_mark));
- fsnotify_group_unlock(nfsd_file_fsnotify_group);
+ fsnotify_group_unlock(group);
if (nfm) {
fsnotify_put_mark(mark);
break;
}
/* Avoid soft lockup race with nfsd_file_mark_put() */
- fsnotify_destroy_mark(mark, nfsd_file_fsnotify_group);
+ fsnotify_destroy_mark(mark, group);
fsnotify_put_mark(mark);
} else {
- fsnotify_group_unlock(nfsd_file_fsnotify_group);
+ fsnotify_group_unlock(group);
}
/* allocate a new nfm */
new = kmem_cache_alloc(nfsd_file_mark_slab, GFP_KERNEL);
if (!new)
return NULL;
- fsnotify_init_mark(&new->nfm_mark, nfsd_file_fsnotify_group);
+ fsnotify_init_mark(&new->nfm_mark, group);
new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
refcount_set(&new->nfm_ref, 1);
mutex_init(&new->nfm_recalc_mutex);
@@ -830,12 +833,36 @@ nfsd_file_fsnotify_handle_event(struct fsnotify_mark *mark, u32 mask,
return 0;
}
+#ifdef CONFIG_NFSD_V4
+static int
+nfsd_dir_fsnotify_handle_event(struct fsnotify_group *group, u32 mask,
+ const void *data, int data_type, struct inode *dir,
+ const struct qstr *name, u32 cookie,
+ struct fsnotify_iter_info *iter_info)
+{
+ return nfsd_handle_dir_event(mask, dir, data, data_type, name);
+}
+#else
+static int
+nfsd_dir_fsnotify_handle_event(struct fsnotify_group *group, u32 mask,
+ const void *data, int data_type, struct inode *dir,
+ const struct qstr *name, u32 cookie,
+ struct fsnotify_iter_info *iter_info)
+{
+ return 0;
+}
+#endif
static const struct fsnotify_ops nfsd_file_fsnotify_ops = {
.handle_inode_event = nfsd_file_fsnotify_handle_event,
.free_mark = nfsd_file_mark_free,
};
+static const struct fsnotify_ops nfsd_dir_fsnotify_ops = {
+ .handle_event = nfsd_dir_fsnotify_handle_event,
+ .free_mark = nfsd_file_mark_free,
+};
+
int
nfsd_file_cache_init(void)
{
@@ -887,8 +914,7 @@ nfsd_file_cache_init(void)
goto out_shrinker;
}
- nfsd_file_fsnotify_group = fsnotify_alloc_group(&nfsd_file_fsnotify_ops,
- 0);
+ nfsd_file_fsnotify_group = fsnotify_alloc_group(&nfsd_file_fsnotify_ops, 0);
if (IS_ERR(nfsd_file_fsnotify_group)) {
pr_err("nfsd: unable to create fsnotify group: %ld\n",
PTR_ERR(nfsd_file_fsnotify_group));
@@ -897,11 +923,23 @@ nfsd_file_cache_init(void)
goto out_notifier;
}
+ nfsd_dir_fsnotify_group = fsnotify_alloc_group(&nfsd_dir_fsnotify_ops, 0);
+ if (IS_ERR(nfsd_dir_fsnotify_group)) {
+ pr_err("nfsd: unable to create fsnotify group: %ld\n",
+ PTR_ERR(nfsd_dir_fsnotify_group));
+ ret = PTR_ERR(nfsd_dir_fsnotify_group);
+ nfsd_dir_fsnotify_group = NULL;
+ goto out_notify_group;
+ }
+
INIT_DELAYED_WORK(&nfsd_filecache_laundrette, nfsd_file_gc_worker);
out:
if (ret)
clear_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags);
return ret;
+out_notify_group:
+ fsnotify_put_group(nfsd_file_fsnotify_group);
+ nfsd_file_fsnotify_group = NULL;
out_notifier:
lease_unregister_notifier(&nfsd_file_lease_notifier);
out_shrinker:
@@ -1019,6 +1057,8 @@ nfsd_file_cache_shutdown(void)
rcu_barrier();
fsnotify_put_group(nfsd_file_fsnotify_group);
nfsd_file_fsnotify_group = NULL;
+ fsnotify_put_group(nfsd_dir_fsnotify_group);
+ nfsd_dir_fsnotify_group = NULL;
kmem_cache_destroy(nfsd_file_slab);
nfsd_file_slab = NULL;
fsnotify_wait_marks_destroyed();
@@ -1223,10 +1263,8 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
open_file:
trace_nfsd_file_alloc(nf);
- if (type == S_IFREG)
- nf->nf_mark = nfsd_file_mark_find_or_create(inode);
-
- if (type != S_IFREG || nf->nf_mark) {
+ nf->nf_mark = nfsd_file_mark_find_or_create(inode);
+ if (nf->nf_mark) {
if (file && (file->f_mode & FMODE_OPENED)) {
get_file(file);
nf->nf_file = file;
diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
index 2df281554abf..71dcb448fa0a 100644
--- a/fs/nfsd/nfs4callback.c
+++ b/fs/nfsd/nfs4callback.c
@@ -896,11 +896,14 @@ static void nfs4_xdr_enc_cb_notify(struct rpc_rqst *req,
const void *data)
{
const struct nfsd4_callback *cb = data;
+ struct nfsd4_cb_notify *ncn = container_of(cb, struct nfsd4_cb_notify, ncn_cb);
+ struct nfs4_delegation *dp = container_of(ncn, struct nfs4_delegation, dl_cb_notify);
struct nfs4_cb_compound_hdr hdr = {
.ident = 0,
.minorversion = cb->cb_clp->cl_minorversion,
};
- struct CB_NOTIFY4args args = { };
+ struct CB_NOTIFY4args args;
+ unsigned int start;
WARN_ON_ONCE(hdr.minorversion == 0);
@@ -908,13 +911,43 @@ static void nfs4_xdr_enc_cb_notify(struct rpc_rqst *req,
encode_cb_sequence4args(xdr, cb, &hdr);
/*
- * FIXME: get stateid and fh from delegation. Inline the cna_changes
- * buffer, and zero it.
+ * nfsd4_cb_notify_prepare() sized the payload against a single page,
+ * but did not account for the compound, sequence, stateid, and
+ * filehandle encoded here. If the variable-length encode overflows the
+ * backchannel send buffer, roll back to before the operation so that a
+ * truncated CB_NOTIFY is never placed on the wire.
*/
- xdrgen_encode_CB_NOTIFY4args(xdr, &args);
+ start = xdr_stream_pos(xdr);
+
+ if (xdr_stream_encode_u32(xdr, OP_CB_NOTIFY) < 0)
+ goto out_err;
+
+ args.cna_stateid.seqid = dp->dl_stid.sc_stateid.si_generation;
+ memcpy(&args.cna_stateid.other, &dp->dl_stid.sc_stateid.si_opaque,
+ ARRAY_SIZE(args.cna_stateid.other));
+ args.cna_fh.len = dp->dl_stid.sc_file->fi_fhandle.fh_size;
+ args.cna_fh.data = dp->dl_stid.sc_file->fi_fhandle.fh_raw;
+ args.cna_changes.count = ncn->ncn_nf_cnt;
+ args.cna_changes.element = ncn->ncn_nf;
+ if (!xdrgen_encode_CB_NOTIFY4args(xdr, &args))
+ goto out_err;
hdr.nops++;
encode_cb_nops(&hdr);
+ return;
+
+out_err:
+ /*
+ * Drop the CB_NOTIFY op and emit a valid CB_SEQUENCE-only compound so
+ * the client still advances its slot. Flag the failure so the done
+ * handler recalls the delegation and the missed notification is not
+ * silently lost. The flag is written here in the transmit path and read
+ * in the done handler; the two are serialized phases of the same
+ * rpc_task, so no additional barrier is needed.
+ */
+ ncn->ncn_encode_err = true;
+ xdr_truncate_encode(xdr, start);
+ encode_cb_nops(&hdr);
}
static int nfs4_xdr_dec_cb_notify(struct rpc_rqst *rqstp,
@@ -1412,6 +1445,16 @@ static void nfsd41_destroy_cb(struct nfsd4_callback *cb)
else
clear_bit(NFSD4_CALLBACK_RUNNING, &cb->cb_flags);
+ /*
+ * Order the clear of NFSD4_CALLBACK_RUNNING above before the ->release()
+ * callback below. A release op may re-check producer-side state to decide
+ * whether to requeue itself (see nfsd4_cb_notify_release()), and that
+ * check must not be reordered ahead of the clear. The plain clear_bit()
+ * path carries no ordering; clear_and_wake_up_bit() already issues this
+ * barrier internally, so the extra one is harmless there.
+ */
+ smp_mb__after_atomic();
+
if (cb->cb_ops && cb->cb_ops->release)
cb->cb_ops->release(cb);
nfsd41_cb_inflight_end(clp);
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 2d82cdd96e12..5a4f0843c2fe 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -55,6 +55,7 @@
#include "netns.h"
#include "pnfs.h"
#include "filecache.h"
+#include "nfs4xdr_gen.h"
#include "trace.h"
#define NFSDDBG_FACILITY NFSDDBG_PROC
@@ -3485,19 +3486,154 @@ nfsd4_cb_getattr_release(struct nfsd4_callback *cb)
nfs4_put_stid(&dp->dl_stid);
}
+static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
+{
+ bool queued;
+
+ if (test_and_set_bit(NFSD4_CALLBACK_RUNNING, &dp->dl_recall.cb_flags))
+ return;
+
+ /*
+ * When called from the lease break (nfsd_break_deleg_cb()) the state
+ * code is serialized by the flc_lock and the lease has not been
+ * removed yet, so sc_count is known to be nonzero. The CB_NOTIFY
+ * callback paths reach here from a workqueue without the flc_lock,
+ * where the delegation may already be unhashed with sc_count at zero.
+ * Use refcount_inc_not_zero() so both cases are safe, and bail if the
+ * delegation is already being torn down.
+ */
+ if (!refcount_inc_not_zero(&dp->dl_stid.sc_count)) {
+ clear_bit(NFSD4_CALLBACK_RUNNING, &dp->dl_recall.cb_flags);
+ return;
+ }
+ queued = nfsd4_run_cb(&dp->dl_recall);
+ WARN_ON_ONCE(!queued);
+ if (!queued) {
+ refcount_dec(&dp->dl_stid.sc_count);
+ clear_bit(NFSD4_CALLBACK_RUNNING, &dp->dl_recall.cb_flags);
+ }
+}
+
+static bool
+nfsd4_cb_notify_prepare(struct nfsd4_callback *cb)
+{
+ struct nfsd4_cb_notify *ncn = container_of(cb, struct nfsd4_cb_notify, ncn_cb);
+ struct nfs4_delegation *dp = container_of(ncn, struct nfs4_delegation, dl_cb_notify);
+ struct nfsd_notify_event *events[NOTIFY4_EVENT_QUEUE_SIZE];
+ struct xdr_buf xdr = { .buflen = PAGE_SIZE * NOTIFY4_PAGE_ARRAY_SIZE,
+ .pages = ncn->ncn_pages };
+ struct xdr_stream stream;
+ struct nfsd_file *nf;
+ int count, i;
+ bool error = false;
+
+ /* Clear any failure recorded by a previous transmit. */
+ ncn->ncn_encode_err = false;
+
+ xdr_init_encode_pages(&stream, &xdr);
+
+ spin_lock(&ncn->ncn_lock);
+ count = ncn->ncn_evt_cnt;
+
+ /* spurious queueing? */
+ if (count == 0) {
+ spin_unlock(&ncn->ncn_lock);
+ return false;
+ }
+
+ /* we can't keep up! */
+ if (count > NOTIFY4_EVENT_QUEUE_SIZE) {
+ spin_unlock(&ncn->ncn_lock);
+ goto out_recall;
+ }
+
+ memcpy(events, ncn->ncn_evt, sizeof(*events) * count);
+ ncn->ncn_evt_cnt = 0;
+ spin_unlock(&ncn->ncn_lock);
+
+ rcu_read_lock();
+ nf = nfsd_file_get(rcu_dereference(dp->dl_stid.sc_file->fi_deleg_file));
+ rcu_read_unlock();
+ if (!nf) {
+ for (i = 0; i < count; ++i)
+ nfsd_notify_event_put(events[i]);
+ goto out_recall;
+ }
+
+ for (i = 0; i < count; ++i) {
+ struct nfsd_notify_event *nne = events[i];
+
+ if (!error) {
+ u32 *maskp = (u32 *)xdr_reserve_space(&stream, sizeof(*maskp));
+ u8 *p;
+
+ if (!maskp) {
+ error = true;
+ goto put_event;
+ }
+
+ p = nfsd4_encode_notify_event(&stream, nne, dp, nf, maskp);
+ if (!p) {
+ pr_notice("Could not generate CB_NOTIFY from fsnotify mask 0x%x\n",
+ nne->ne_mask);
+ error = true;
+ goto put_event;
+ }
+
+ ncn->ncn_nf[i].notify_mask.count = 1;
+ ncn->ncn_nf[i].notify_mask.element = maskp;
+ ncn->ncn_nf[i].notify_vals.data = p;
+ ncn->ncn_nf[i].notify_vals.len = (u8 *)stream.p - p;
+ }
+put_event:
+ nfsd_notify_event_put(nne);
+ }
+ if (!error) {
+ ncn->ncn_nf_cnt = count;
+ nfsd_file_put(nf);
+ return true;
+ }
+ nfsd_file_put(nf);
+out_recall:
+ nfsd_break_one_deleg(dp);
+ return false;
+}
+
static int
nfsd4_cb_notify_done(struct nfsd4_callback *cb,
struct rpc_task *task)
{
+ struct nfsd4_cb_notify *ncn = container_of(cb, struct nfsd4_cb_notify, ncn_cb);
+ struct nfs4_delegation *dp = container_of(ncn, struct nfs4_delegation, dl_cb_notify);
+
+ if (dp->dl_stid.sc_status)
+ return 1;
+
+ /*
+ * The CB_NOTIFY op overflowed the send buffer and was dropped from the
+ * compound. The notification is lost, so recall the delegation rather
+ * than leaving the client unaware of the directory change.
+ */
+ if (ncn->ncn_encode_err) {
+ nfsd_break_one_deleg(dp);
+ return 1;
+ }
+
switch (task->tk_status) {
case -NFS4ERR_DELAY:
rpc_delay(task, 2 * HZ);
return 0;
default:
+ /* For any other hard error, recall the deleg */
+ nfsd_break_one_deleg(dp);
+ fallthrough;
+ case 0:
return 1;
}
}
+static void nfsd4_run_cb_notify(struct nfsd4_cb_notify *ncn);
+
static void
nfsd4_cb_notify_release(struct nfsd4_callback *cb)
{
@@ -3506,6 +3642,9 @@ nfsd4_cb_notify_release(struct nfsd4_callback *cb)
struct nfs4_delegation *dp =
container_of(ncn, struct nfs4_delegation, dl_cb_notify);
+ /* Drain events that arrived while this callback was in flight */
+ if (READ_ONCE(ncn->ncn_evt_cnt) > 0)
+ nfsd4_run_cb_notify(ncn);
nfs4_put_stid(&dp->dl_stid);
}
@@ -3522,6 +3661,7 @@ static const struct nfsd4_callback_ops nfsd4_cb_getattr_ops = {
};
static const struct nfsd4_callback_ops nfsd4_cb_notify_ops = {
+ .prepare = nfsd4_cb_notify_prepare,
.done = nfsd4_cb_notify_done,
.release = nfsd4_cb_notify_release,
.opcode = OP_CB_NOTIFY,
@@ -5781,29 +5921,6 @@ static const struct nfsd4_callback_ops nfsd4_cb_recall_ops = {
.opcode = OP_CB_RECALL,
};
-static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
-{
- bool queued;
-
- if (test_and_set_bit(NFSD4_CALLBACK_RUNNING, &dp->dl_recall.cb_flags))
- return;
-
- /*
- * We're assuming the state code never drops its reference
- * without first removing the lease. Since we're in this lease
- * callback (and since the lease code is serialized by the
- * flc_lock) we know the server hasn't removed the lease yet, and
- * we know it's safe to take a reference.
- */
- refcount_inc(&dp->dl_stid.sc_count);
- queued = nfsd4_run_cb(&dp->dl_recall);
- WARN_ON_ONCE(!queued);
- if (!queued) {
- refcount_dec(&dp->dl_stid.sc_count);
- clear_bit(NFSD4_CALLBACK_RUNNING, &dp->dl_recall.cb_flags);
- }
-}
-
/* Called from break_lease() with flc_lock held. */
static bool
nfsd_break_deleg_cb(struct file_lease *fl)
@@ -9991,3 +10108,170 @@ void nfsd_update_cmtime_attr(struct file *f, unsigned int flags)
MINOR(inode->i_sb->s_dev),
inode->i_ino, ret);
}
+
+static void
+nfsd4_run_cb_notify(struct nfsd4_cb_notify *ncn)
+{
+ struct nfs4_delegation *dp = container_of(ncn, struct nfs4_delegation, dl_cb_notify);
+
+ if (test_and_set_bit(NFSD4_CALLBACK_RUNNING, &ncn->ncn_cb.cb_flags))
+ return;
+
+ if (!refcount_inc_not_zero(&dp->dl_stid.sc_count))
+ clear_bit(NFSD4_CALLBACK_RUNNING, &ncn->ncn_cb.cb_flags);
+ else
+ nfsd4_run_cb(&ncn->ncn_cb);
+}
+
+static struct nfsd_notify_event *
+alloc_nfsd_notify_event(u32 mask, const struct qstr *q, struct dentry *dentry,
+ struct inode *target)
+{
+ struct nfsd_notify_event *ne;
+ struct name_snapshot newname;
+ u32 newnamelen = 0;
+
+ /*
+ * For a rename, @q is the old name and the live dentry carries the new
+ * name. Snapshot the new name now, while it is guaranteed to describe
+ * this event: the dentry can be renamed again before the CB_NOTIFY work
+ * runs, which would corrupt a late read in nfsd4_encode_notify_event().
+ */
+ if (mask & FS_RENAME) {
+ take_dentry_name_snapshot(&newname, dentry);
+ newnamelen = newname.name.len;
+ }
+
+ ne = kmalloc(struct_size(ne, ne_name, q->len + 1 +
+ (newnamelen ? newnamelen + 1 : 0)), GFP_NOFS);
+ if (!ne)
+ goto out;
+
+ memcpy(ne->ne_name, q->name, q->len);
+ ne->ne_name[q->len] = '\0';
+ ne->ne_namelen = q->len;
+
+ ne->ne_newnamelen = newnamelen;
+ if (newnamelen) {
+ char *p = nfsd_notify_event_newname(ne);
+
+ memcpy(p, newname.name.name, newnamelen);
+ p[newnamelen] = '\0';
+ }
+
+ refcount_set(&ne->ne_ref, 1);
+ ne->ne_mask = mask;
+ ne->ne_dentry = dget(dentry);
+ ne->ne_target = target;
+ if (ne->ne_target)
+ ihold(ne->ne_target);
+out:
+ if (mask & FS_RENAME)
+ release_dentry_name_snapshot(&newname);
+ return ne;
+}
+
+static bool
+should_notify_deleg(u32 mask, struct file_lease *fl)
+{
+ /* Don't notify the client generating the event */
+ if (nfsd_breaker_owns_lease(fl))
+ return false;
+
+ /* Skip if this event wasn't ignored by the lease */
+ if ((mask & FS_DELETE) && !(fl->c.flc_flags & FL_IGN_DIR_DELETE))
+ return false;
+ if ((mask & FS_CREATE) && !(fl->c.flc_flags & FL_IGN_DIR_CREATE))
+ return false;
+ if ((mask & FS_RENAME) && !(fl->c.flc_flags & FL_IGN_DIR_RENAME))
+ return false;
+
+ return true;
+}
+
+static void
+nfsd_recall_all_dir_delegs(const struct inode *dir)
+{
+ struct file_lock_context *ctx = locks_inode_context(dir);
+ struct file_lock_core *flc;
+
+ spin_lock(&ctx->flc_lock);
+ list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
+ struct file_lease *fl = container_of(flc, struct file_lease, c);
+
+ if (fl->fl_lmops == &nfsd_lease_mng_ops)
+ nfsd_break_deleg_cb(fl);
+ }
+ spin_unlock(&ctx->flc_lock);
+}
+
+int
+nfsd_handle_dir_event(u32 mask, const struct inode *dir, const void *data,
+ int data_type, const struct qstr *name)
+{
+ struct dentry *dentry = fsnotify_data_dentry(data, data_type);
+ struct inode *target = fsnotify_data_rename_target(data, data_type);
+ struct file_lock_context *ctx;
+ struct file_lock_core *flc;
+ struct nfsd_notify_event *evt;
+
+ trace_nfsd_handle_dir_event(mask, dir, name);
+
+ /* Normalize cross-dir rename events to create/delete */
+ if (mask & FS_MOVED_FROM) {
+ mask &= ~FS_MOVED_FROM;
+ mask |= FS_DELETE;
+ }
+ if (mask & FS_MOVED_TO) {
+ mask &= ~FS_MOVED_TO;
+ mask |= FS_CREATE;
+ }
+
+ /*
+ * FS_RENAME fires on the source directory even for a cross-dir
+ * rename, where the moved entry now lives under a different parent.
+ * NOTIFY4_RENAME_ENTRY describes an in-place rename, so reporting it
+ * here would advertise a name absent from this directory.
+ */
+ if ((mask & FS_RENAME) && dentry && d_inode(dentry->d_parent) != dir)
+ mask &= ~FS_RENAME;
+
+ /* Don't do anything if this is not an expected event */
+ if (!(mask & (FS_CREATE|FS_DELETE|FS_RENAME)))
+ return 0;
+
+ ctx = locks_inode_context(dir);
+ if (!ctx || list_empty(&ctx->flc_lease))
+ return 0;
+
+ evt = alloc_nfsd_notify_event(mask, name, dentry, target);
+ if (!evt) {
+ nfsd_recall_all_dir_delegs(dir);
+ return 0;
+ }
+
+ spin_lock(&ctx->flc_lock);
+ list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
+ struct file_lease *fl = container_of(flc, struct file_lease, c);
+ struct nfs4_delegation *dp = flc->flc_owner;
+ struct nfsd4_cb_notify *ncn = &dp->dl_cb_notify;
+
+ if (!should_notify_deleg(mask, fl))
+ continue;
+
+ spin_lock(&ncn->ncn_lock);
+ if (ncn->ncn_evt_cnt >= NOTIFY4_EVENT_QUEUE_SIZE) {
+ /* We're generating notifications too fast. Recall. */
+ spin_unlock(&ncn->ncn_lock);
+ nfsd_break_deleg_cb(fl);
+ continue;
+ }
+ ncn->ncn_evt[ncn->ncn_evt_cnt++] = nfsd_notify_event_get(evt);
+ spin_unlock(&ncn->ncn_lock);
+
+ nfsd4_run_cb_notify(ncn);
+ }
+ spin_unlock(&ctx->flc_lock);
+ nfsd_notify_event_put(evt);
+ return 0;
+}
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index ad192d25724c..976d60115cd6 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4189,6 +4189,123 @@ nfsd4_encode_fattr4(struct svc_rqst *rqstp, struct xdr_stream *xdr,
goto out;
}
+static bool
+nfsd4_setup_notify_entry4(struct notify_entry4 *ne, struct xdr_stream *xdr,
+ struct dentry *dentry, struct nfs4_delegation *dp,
+ struct nfsd_file *nf, char *name, u32 namelen)
+{
+ uint32_t *attrmask;
+
+ /* Reserve space for attrmask */
+ attrmask = xdr_reserve_space(xdr, 3 * sizeof(uint32_t));
+ if (!attrmask)
+ return false;
+
+ ne->ne_file.data = name;
+ ne->ne_file.len = namelen;
+ ne->ne_attrs.attrmask.element = attrmask;
+
+ attrmask[0] = 0;
+ attrmask[1] = 0;
+ attrmask[2] = 0;
+ ne->ne_attrs.attr_vals.data = NULL;
+ ne->ne_attrs.attr_vals.len = 0;
+ ne->ne_attrs.attrmask.count = 1;
+ return true;
+}
+
+/**
+ * nfsd4_encode_notify_event - encode a notify
+ * @xdr: stream to which to encode the fattr4
+ * @nne: nfsd_notify_event to encode
+ * @dp: delegation where the event occurred
+ * @nf: nfsd_file on which event occurred
+ * @notify_mask: pointer to word where notification mask should be set
+ *
+ * Encode @nne into @xdr. The matching bit in @notify_mask is set on
+ * success.
+ *
+ * Return: pointer to the start of the encoded event, or NULL if the
+ * event could not be encoded.
+ */
+u8 *nfsd4_encode_notify_event(struct xdr_stream *xdr, struct nfsd_notify_event *nne,
+ struct nfs4_delegation *dp, struct nfsd_file *nf,
+ u32 *notify_mask)
+{
+ u8 *p = NULL;
+
+ *notify_mask = 0;
+
+ if (nne->ne_mask & FS_DELETE) {
+ struct notify_remove4 nr = { };
+
+ if (!nfsd4_setup_notify_entry4(&nr.nrm_old_entry, xdr, nne->ne_dentry, dp,
+ nf, nne->ne_name, nne->ne_namelen))
+ goto out_err;
+ p = (u8 *)xdr->p;
+ if (!xdrgen_encode_notify_remove4(xdr, &nr))
+ goto out_err;
+ *notify_mask |= BIT(NOTIFY4_REMOVE_ENTRY);
+ } else if (nne->ne_mask & FS_CREATE) {
+ struct notify_add4 na = { };
+ struct notify_remove4 old = { };
+
+ if (!nfsd4_setup_notify_entry4(&na.nad_new_entry, xdr, nne->ne_dentry, dp,
+ nf, nne->ne_name, nne->ne_namelen))
+ goto out_err;
+
+ /* If a file was overwritten, report it in nad_old_entry */
+ if (nne->ne_target) {
+ if (!nfsd4_setup_notify_entry4(&old.nrm_old_entry, xdr,
+ NULL, dp, nf,
+ nne->ne_name, nne->ne_namelen))
+ goto out_err;
+ na.nad_old_entry.count = 1;
+ na.nad_old_entry.element = &old;
+ }
+
+ p = (u8 *)xdr->p;
+ if (!xdrgen_encode_notify_add4(xdr, &na))
+ goto out_err;
+
+ *notify_mask |= BIT(NOTIFY4_ADD_ENTRY);
+ } else if (nne->ne_mask & FS_RENAME) {
+ struct notify_rename4 nr = { };
+ struct notify_remove4 old = { };
+ char *newname = nfsd_notify_event_newname(nne);
+
+ /* Don't send any attributes in the old_entry since they're the same in new */
+ if (!nfsd4_setup_notify_entry4(&nr.nrn_old_entry.nrm_old_entry, xdr,
+ NULL, dp, nf, nne->ne_name,
+ nne->ne_namelen))
+ goto out_err;
+
+ if (!nfsd4_setup_notify_entry4(&nr.nrn_new_entry.nad_new_entry, xdr,
+ nne->ne_dentry, dp, nf, newname,
+ nne->ne_newnamelen))
+ goto out_err;
+
+ /* If a file was overwritten, report it in nad_old_entry */
+ if (nne->ne_target) {
+ if (!nfsd4_setup_notify_entry4(&old.nrm_old_entry, xdr,
+ NULL, dp, nf, newname,
+ nne->ne_newnamelen))
+ goto out_err;
+ nr.nrn_new_entry.nad_old_entry.count = 1;
+ nr.nrn_new_entry.nad_old_entry.element = &old;
+ }
+
+ p = (u8 *)xdr->p;
+ if (!xdrgen_encode_notify_rename4(xdr, &nr))
+ goto out_err;
+ *notify_mask |= BIT(NOTIFY4_RENAME_ENTRY);
+ }
+ return p;
+out_err:
+ pr_warn("nfsd: unable to marshal notify event to xdr stream\n");
+ return NULL;
+}
+
static void svcxdr_init_encode_from_buffer(struct xdr_stream *xdr,
struct xdr_buf *buf, __be32 *p, int bytes)
{
diff --git a/fs/nfsd/nfs4xdr_gen.c b/fs/nfsd/nfs4xdr_gen.c
index b1a583a6dfa1..d1240ade120d 100644
--- a/fs/nfsd/nfs4xdr_gen.c
+++ b/fs/nfsd/nfs4xdr_gen.c
@@ -628,7 +628,7 @@ xdrgen_decode_prev_entry4(struct xdr_stream *xdr, struct prev_entry4 *ptr)
return true;
}
-static bool __maybe_unused
+bool
xdrgen_decode_notify_remove4(struct xdr_stream *xdr, struct notify_remove4 *ptr)
{
if (!xdrgen_decode_notify_entry4(xdr, &ptr->nrm_old_entry))
@@ -638,7 +638,7 @@ xdrgen_decode_notify_remove4(struct xdr_stream *xdr, struct notify_remove4 *ptr)
return true;
}
-static bool __maybe_unused
+bool
xdrgen_decode_notify_add4(struct xdr_stream *xdr, struct notify_add4 *ptr)
{
if (xdr_stream_decode_u32(xdr, &ptr->nad_old_entry.count) < 0)
@@ -677,7 +677,7 @@ xdrgen_decode_notify_attr4(struct xdr_stream *xdr, struct notify_attr4 *ptr)
return true;
}
-static bool __maybe_unused
+bool
xdrgen_decode_notify_rename4(struct xdr_stream *xdr, struct notify_rename4 *ptr)
{
if (!xdrgen_decode_notify_remove4(xdr, &ptr->nrn_old_entry))
@@ -1050,7 +1050,7 @@ xdrgen_encode_prev_entry4(struct xdr_stream *xdr, const struct prev_entry4 *valu
return true;
}
-static bool __maybe_unused
+bool
xdrgen_encode_notify_remove4(struct xdr_stream *xdr, const struct notify_remove4 *value)
{
if (!xdrgen_encode_notify_entry4(xdr, &value->nrm_old_entry))
@@ -1060,7 +1060,7 @@ xdrgen_encode_notify_remove4(struct xdr_stream *xdr, const struct notify_remove4
return true;
}
-static bool __maybe_unused
+bool
xdrgen_encode_notify_add4(struct xdr_stream *xdr, const struct notify_add4 *value)
{
if (value->nad_old_entry.count > 1)
@@ -1099,7 +1099,7 @@ xdrgen_encode_notify_attr4(struct xdr_stream *xdr, const struct notify_attr4 *va
return true;
}
-static bool __maybe_unused
+bool
xdrgen_encode_notify_rename4(struct xdr_stream *xdr, const struct notify_rename4 *value)
{
if (!xdrgen_encode_notify_remove4(xdr, &value->nrn_old_entry))
diff --git a/fs/nfsd/nfs4xdr_gen.h b/fs/nfsd/nfs4xdr_gen.h
index 0e11cd537a98..c62299bac735 100644
--- a/fs/nfsd/nfs4xdr_gen.h
+++ b/fs/nfsd/nfs4xdr_gen.h
@@ -32,6 +32,15 @@ bool xdrgen_decode_posixaceperm4(struct xdr_stream *xdr, posixaceperm4 *ptr);
bool xdrgen_encode_posixaceperm4(struct xdr_stream *xdr, const posixaceperm4 value);
+bool xdrgen_decode_notify_remove4(struct xdr_stream *xdr, struct notify_remove4 *ptr);
+bool xdrgen_encode_notify_remove4(struct xdr_stream *xdr, const struct notify_remove4 *value);
+
+bool xdrgen_decode_notify_add4(struct xdr_stream *xdr, struct notify_add4 *ptr);
+bool xdrgen_encode_notify_add4(struct xdr_stream *xdr, const struct notify_add4 *value);
+
+bool xdrgen_decode_notify_rename4(struct xdr_stream *xdr, struct notify_rename4 *ptr);
+bool xdrgen_encode_notify_rename4(struct xdr_stream *xdr, const struct notify_rename4 *value);
+
bool xdrgen_decode_CB_NOTIFY4args(struct xdr_stream *xdr, struct CB_NOTIFY4args *ptr);
bool xdrgen_encode_CB_NOTIFY4args(struct xdr_stream *xdr, const struct CB_NOTIFY4args *value);
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index ac9dd798ea22..f8457e0f2b57 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -201,10 +201,23 @@ struct nfsd_notify_event {
refcount_t ne_ref; // refcount
u32 ne_mask; // FS_* mask from fsnotify callback
struct dentry *ne_dentry; // dentry reference to target
- u32 ne_namelen; // length of ne_name
- char ne_name[]; // name of dentry being changed
+ struct inode *ne_target; // inode overwritten by rename, or NULL
+ u32 ne_namelen; // length of ne_name (old name for a rename)
+ u32 ne_newnamelen; // length of new name (rename only), else 0
+ char ne_name[]; // entry name, then new name (rename only)
};
+/*
+ * For a rename, the new name is snapshotted at event-alloc time and stored
+ * immediately after the (NUL-terminated) old name in ne_name[]. ne_dentry can
+ * be renamed again before the CB_NOTIFY work runs, so the new name must not be
+ * read from the live dentry at encode time.
+ */
+static inline char *nfsd_notify_event_newname(struct nfsd_notify_event *ne)
+{
+ return ne->ne_name + ne->ne_namelen + 1;
+}
+
static inline struct nfsd_notify_event *nfsd_notify_event_get(struct nfsd_notify_event *ne)
{
refcount_inc(&ne->ne_ref);
@@ -214,6 +227,7 @@ static inline struct nfsd_notify_event *nfsd_notify_event_get(struct nfsd_notify
static inline void nfsd_notify_event_put(struct nfsd_notify_event *ne)
{
if (refcount_dec_and_test(&ne->ne_ref)) {
+ iput(ne->ne_target);
dput(ne->ne_dentry);
kfree(ne);
}
@@ -901,6 +915,8 @@ void nfsd_update_cmtime_attr(struct file *f, unsigned int flags);
extern struct nfs4_client_reclaim *nfs4_client_to_reclaim(struct xdr_netobj name,
struct xdr_netobj princhash, struct nfsd_net *nn);
extern bool nfs4_has_reclaimed_state(struct xdr_netobj name, struct nfsd_net *nn);
+int nfsd_handle_dir_event(u32 mask, const struct inode *dir, const void *data,
+ int data_type, const struct qstr *name);
void put_nfs4_file(struct nfs4_file *fi);
extern void nfs4_put_cpntf_state(struct nfsd_net *nn,
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 171e8fdbafb6..db0a0dc70660 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -12,6 +12,7 @@
#include <linux/sunrpc/clnt.h>
#include <linux/sunrpc/xprt.h>
#include <trace/misc/fs.h>
+#include <trace/misc/fsnotify.h>
#include <trace/misc/nfs.h>
#include <trace/misc/sunrpc.h>
@@ -1377,6 +1378,28 @@ TRACE_EVENT(nfsd_file_fsnotify_handle_event,
__entry->nlink, __entry->mode, __entry->mask)
);
+TRACE_EVENT(nfsd_handle_dir_event,
+ TP_PROTO(u32 mask, const struct inode *dir, const struct qstr *name),
+ TP_ARGS(mask, dir, name),
+ TP_STRUCT__entry(
+ __field(u32, mask)
+ __field(dev_t, s_dev)
+ __field(u64, i_ino)
+ __string_len(name, name ? name->name : NULL,
+ name ? name->len : 0)
+ ),
+ TP_fast_assign(
+ __entry->mask = mask;
+ __entry->s_dev = dir ? dir->i_sb->s_dev : 0;
+ __entry->i_ino = dir ? dir->i_ino : 0;
+ __assign_str(name);
+ ),
+ TP_printk("inode=0x%x:0x%x:0x%llx mask=%s name=%s",
+ MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
+ __entry->i_ino, show_fsnotify_mask(__entry->mask),
+ __get_str(name))
+);
+
DECLARE_EVENT_CLASS(nfsd_file_gc_class,
TP_PROTO(
const struct nfsd_file *nf
diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
index 85574b2a139a..62ac790428be 100644
--- a/fs/nfsd/xdr4.h
+++ b/fs/nfsd/xdr4.h
@@ -970,6 +970,9 @@ __be32 nfsd4_encode_fattr_to_buf(__be32 **p, int words,
struct svc_fh *fhp, struct svc_export *exp,
struct dentry *dentry,
u32 *bmval, struct svc_rqst *, int ignore_crossmnt);
+u8 *nfsd4_encode_notify_event(struct xdr_stream *xdr, struct nfsd_notify_event *nne,
+ struct nfs4_delegation *dd, struct nfsd_file *nf,
+ u32 *notify_mask);
extern __be32 nfsd4_setclientid(struct svc_rqst *rqstp,
struct nfsd4_compound_state *, union nfsd4_op_u *u);
extern __be32 nfsd4_setclientid_confirm(struct svc_rqst *rqstp,
--
2.54.0
^ permalink raw reply related
* [PATCH v7 09/20] nfsd: add data structures for handling CB_NOTIFY
From: Jeff Layton @ 2026-06-16 11:58 UTC (permalink / raw)
To: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, Jonathan Corbet, Shuah Khan,
Chuck Lever
Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260616-dir-deleg-v7-0-6cbc7eac0ade@kernel.org>
Add the data structures, allocation helpers, and callback operations
needed for directory delegation CB_NOTIFY support:
- struct nfsd_notify_event: carries fsnotify events for CB_NOTIFY
- struct nfsd4_cb_notify: per-delegation state for notification handling
- Union dl_cb_fattr with dl_cb_notify in nfs4_delegation since a
delegation is either a regular file delegation or a directory
delegation, never both
Refactor alloc_init_deleg() into a common __alloc_init_deleg() base
with a pluggable sc_free callback, and add alloc_init_dir_deleg() which
allocates the page array and notify4 buffer needed for CB_NOTIFY
encoding.
Add skeleton nfsd4_cb_notify_ops with done/release handlers that will
be filled in when the notification path is wired up.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/nfs4state.c | 121 ++++++++++++++++++++++++++++++++++++++++++++++------
fs/nfsd/state.h | 47 +++++++++++++++++++-
2 files changed, 152 insertions(+), 16 deletions(-)
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 47af5729a86f..2d82cdd96e12 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -126,6 +126,7 @@ static void free_session(struct nfsd4_session *);
static const struct nfsd4_callback_ops nfsd4_cb_recall_ops;
static const struct nfsd4_callback_ops nfsd4_cb_notify_lock_ops;
static const struct nfsd4_callback_ops nfsd4_cb_getattr_ops;
+static const struct nfsd4_callback_ops nfsd4_cb_notify_ops;
static struct workqueue_struct *laundry_wq;
@@ -1123,29 +1124,31 @@ static void block_delegations(struct knfsd_fh *fh)
}
static struct nfs4_delegation *
-alloc_init_deleg(struct nfs4_client *clp, struct nfs4_file *fp,
- struct nfs4_clnt_odstate *odstate, u32 dl_type)
+__alloc_init_deleg(struct nfs4_client *clp, struct nfs4_file *fp,
+ struct nfs4_clnt_odstate *odstate, u32 dl_type,
+ void (*sc_free)(struct nfs4_stid *))
{
struct nfs4_delegation *dp;
struct nfs4_stid *stid;
long n;
- dprintk("NFSD alloc_init_deleg\n");
+ if (delegation_blocked(&fp->fi_fhandle))
+ return NULL;
+
n = atomic_long_inc_return(&num_delegations);
if (n < 0 || n > max_delegations)
goto out_dec;
- if (delegation_blocked(&fp->fi_fhandle))
- goto out_dec;
- stid = nfs4_alloc_stid(clp, deleg_slab, nfs4_free_deleg);
+
+ stid = nfs4_alloc_stid(clp, deleg_slab, sc_free);
if (stid == NULL)
goto out_dec;
- dp = delegstateid(stid);
/*
* delegation seqid's are never incremented. The 4.1 special
* meaning of seqid 0 isn't meaningful, really, but let's avoid
- * 0 anyway just for consistency and use 1:
+ * 0 anyway just for consistency and use 1.
*/
+ dp = delegstateid(stid);
dp->dl_stid.sc_stateid.si_generation = 1;
INIT_LIST_HEAD(&dp->dl_perfile);
INIT_LIST_HEAD(&dp->dl_perclnt);
@@ -1155,19 +1158,79 @@ alloc_init_deleg(struct nfs4_client *clp, struct nfs4_file *fp,
dp->dl_type = dl_type;
dp->dl_retries = 1;
dp->dl_recalled = false;
- nfsd4_init_cb(&dp->dl_recall, dp->dl_stid.sc_client,
- &nfsd4_cb_recall_ops, NFSPROC4_CLNT_CB_RECALL);
- nfsd4_init_cb(&dp->dl_cb_fattr.ncf_getattr, dp->dl_stid.sc_client,
- &nfsd4_cb_getattr_ops, NFSPROC4_CLNT_CB_GETATTR);
- dp->dl_cb_fattr.ncf_file_modified = false;
get_nfs4_file(fp);
dp->dl_stid.sc_file = fp;
+ nfsd4_init_cb(&dp->dl_recall, dp->dl_stid.sc_client,
+ &nfsd4_cb_recall_ops, NFSPROC4_CLNT_CB_RECALL);
return dp;
out_dec:
atomic_long_dec(&num_delegations);
return NULL;
}
+static struct nfs4_delegation *
+alloc_init_deleg(struct nfs4_client *clp, struct nfs4_file *fp,
+ struct nfs4_clnt_odstate *odstate, u32 dl_type)
+{
+ struct nfs4_delegation *dp;
+
+ dp = __alloc_init_deleg(clp, fp, odstate, dl_type, nfs4_free_deleg);
+ if (!dp)
+ return NULL;
+
+ nfsd4_init_cb(&dp->dl_cb_fattr.ncf_getattr, dp->dl_stid.sc_client,
+ &nfsd4_cb_getattr_ops, NFSPROC4_CLNT_CB_GETATTR);
+ dp->dl_cb_fattr.ncf_file_modified = false;
+ return dp;
+}
+
+static void nfs4_free_dir_deleg(struct nfs4_stid *stid)
+{
+ struct nfs4_delegation *dp = delegstateid(stid);
+ struct nfsd4_cb_notify *ncn = &dp->dl_cb_notify;
+ int i;
+
+ for (i = 0; i < ncn->ncn_evt_cnt; ++i)
+ nfsd_notify_event_put(ncn->ncn_evt[i]);
+ kfree(ncn->ncn_nf);
+ for (i = 0; i < NOTIFY4_PAGE_ARRAY_SIZE; i++) {
+ if (!ncn->ncn_pages[i])
+ break;
+ put_page(ncn->ncn_pages[i]);
+ }
+ nfs4_free_deleg(stid);
+}
+
+static struct nfs4_delegation *
+alloc_init_dir_deleg(struct nfs4_client *clp, struct nfs4_file *fp)
+{
+ struct nfs4_delegation *dp;
+ struct nfsd4_cb_notify *ncn;
+ int npages;
+
+ dp = __alloc_init_deleg(clp, fp, NULL, NFS4_OPEN_DELEGATE_READ, nfs4_free_dir_deleg);
+ if (!dp)
+ return NULL;
+
+ ncn = &dp->dl_cb_notify;
+
+ npages = alloc_pages_bulk(GFP_KERNEL, NOTIFY4_PAGE_ARRAY_SIZE, ncn->ncn_pages);
+ if (npages != NOTIFY4_PAGE_ARRAY_SIZE) {
+ nfs4_put_stid(&dp->dl_stid);
+ return NULL;
+ }
+
+ ncn->ncn_nf = kcalloc(NOTIFY4_EVENT_QUEUE_SIZE, sizeof(*ncn->ncn_nf), GFP_KERNEL);
+ if (!ncn->ncn_nf) {
+ nfs4_put_stid(&dp->dl_stid);
+ return NULL;
+ }
+ spin_lock_init(&ncn->ncn_lock);
+ nfsd4_init_cb(&ncn->ncn_cb, dp->dl_stid.sc_client,
+ &nfsd4_cb_notify_ops, NFSPROC4_CLNT_CB_NOTIFY);
+ return dp;
+}
+
void
nfs4_put_stid(struct nfs4_stid *s)
{
@@ -3422,6 +3485,30 @@ nfsd4_cb_getattr_release(struct nfsd4_callback *cb)
nfs4_put_stid(&dp->dl_stid);
}
+static int
+nfsd4_cb_notify_done(struct nfsd4_callback *cb,
+ struct rpc_task *task)
+{
+ switch (task->tk_status) {
+ case -NFS4ERR_DELAY:
+ rpc_delay(task, 2 * HZ);
+ return 0;
+ default:
+ return 1;
+ }
+}
+
+static void
+nfsd4_cb_notify_release(struct nfsd4_callback *cb)
+{
+ struct nfsd4_cb_notify *ncn =
+ container_of(cb, struct nfsd4_cb_notify, ncn_cb);
+ struct nfs4_delegation *dp =
+ container_of(ncn, struct nfs4_delegation, dl_cb_notify);
+
+ nfs4_put_stid(&dp->dl_stid);
+}
+
static const struct nfsd4_callback_ops nfsd4_cb_recall_any_ops = {
.done = nfsd4_cb_recall_any_done,
.release = nfsd4_cb_recall_any_release,
@@ -3434,6 +3521,12 @@ static const struct nfsd4_callback_ops nfsd4_cb_getattr_ops = {
.opcode = OP_CB_GETATTR,
};
+static const struct nfsd4_callback_ops nfsd4_cb_notify_ops = {
+ .done = nfsd4_cb_notify_done,
+ .release = nfsd4_cb_notify_release,
+ .opcode = OP_CB_NOTIFY,
+};
+
static void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf)
{
struct nfs4_delegation *dp =
@@ -9810,7 +9903,7 @@ nfsd_get_dir_deleg(struct nfsd4_compound_state *cstate,
/* Try to set up the lease */
status = -ENOMEM;
- dp = alloc_init_deleg(clp, fp, NULL, NFS4_OPEN_DELEGATE_READ);
+ dp = alloc_init_dir_deleg(clp, fp);
if (!dp)
goto out_delegees;
if (cstate->current_fh.fh_export)
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 4fca0537ca8b..ac9dd798ea22 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -197,6 +197,45 @@ struct nfs4_cb_fattr {
#define NOTIFY4_EVENT_QUEUE_SIZE 3
#define NOTIFY4_PAGE_ARRAY_SIZE 1
+struct nfsd_notify_event {
+ refcount_t ne_ref; // refcount
+ u32 ne_mask; // FS_* mask from fsnotify callback
+ struct dentry *ne_dentry; // dentry reference to target
+ u32 ne_namelen; // length of ne_name
+ char ne_name[]; // name of dentry being changed
+};
+
+static inline struct nfsd_notify_event *nfsd_notify_event_get(struct nfsd_notify_event *ne)
+{
+ refcount_inc(&ne->ne_ref);
+ return ne;
+}
+
+static inline void nfsd_notify_event_put(struct nfsd_notify_event *ne)
+{
+ if (refcount_dec_and_test(&ne->ne_ref)) {
+ dput(ne->ne_dentry);
+ kfree(ne);
+ }
+}
+
+/*
+ * Represents a directory delegation. The callback is for handling CB_NOTIFYs.
+ * As notifications from fsnotify come in, allocate a new event, take the ncn_lock,
+ * and add it to the ncn_evt queue. The CB_NOTIFY prepare handler will take the
+ * lock, clean out the list and process it.
+ */
+struct nfsd4_cb_notify {
+ spinlock_t ncn_lock; // protects the evt queue and count
+ int ncn_evt_cnt; // count of events in ncn_evt
+ int ncn_nf_cnt; // count of valid entries in ncn_nf
+ struct nfsd_notify_event *ncn_evt[NOTIFY4_EVENT_QUEUE_SIZE]; // list of events
+ struct page *ncn_pages[NOTIFY4_PAGE_ARRAY_SIZE]; // for encoding
+ struct notify4 *ncn_nf; // array of notify4's to be sent
+ bool ncn_encode_err; // did encoding fail?
+ struct nfsd4_callback ncn_cb; // notify4 callback
+};
+
/*
* Represents a delegation stateid. The nfs4_client holds references to these
* and they are put when it is being destroyed or when the delegation is
@@ -233,8 +272,12 @@ struct nfs4_delegation {
bool dl_written;
bool dl_setattr;
- /* for CB_GETATTR */
- struct nfs4_cb_fattr dl_cb_fattr;
+ union {
+ /* for CB_GETATTR */
+ struct nfs4_cb_fattr dl_cb_fattr;
+ /* for CB_NOTIFY */
+ struct nfsd4_cb_notify dl_cb_notify;
+ };
/* For delegated timestamps */
struct timespec64 dl_atime;
--
2.54.0
^ permalink raw reply related
* [PATCH v7 08/20] nfsd: use RCU to protect fi_deleg_file
From: Jeff Layton @ 2026-06-16 11:58 UTC (permalink / raw)
To: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, Jonathan Corbet, Shuah Khan,
Chuck Lever
Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260616-dir-deleg-v7-0-6cbc7eac0ade@kernel.org>
fi_deleg_file can be NULLed by put_deleg_file() when fi_delegees drops
to zero during delegation teardown (e.g. DELEGRETURN). Concurrent
accesses from workqueue callbacks -- such as CB_NOTIFY -- can
dereference a NULL pointer if they race with this teardown.
Annotate fi_deleg_file with __rcu and convert all accessors to use
proper RCU primitives:
- rcu_assign_pointer() / RCU_INIT_POINTER() for stores
- rcu_dereference_protected() for reads under fi_lock or where
fi_delegees > 0 guarantees stability
This prepares for a subsequent patch that will use rcu_read_lock +
rcu_dereference + nfsd_file_get to safely acquire a reference from
the CB_NOTIFY callback path without holding fi_lock.
While converting the error-path lease teardown in nfsd_get_dir_deleg(),
also add a nfsd_fsnotify_recalc_mask() call after dropping the lease, to
match the success path and the equivalent teardown in
nfs4_unlock_deleg_lease(). Without it, a failure after the lease is set
leaves the inode's fsnotify mask reflecting a delegation that no longer
exists.
That teardown already unlocks against fi_deleg_file->nf_file rather than
this client's nf->nf_file; document why. The lease's flc_file is set to
fi_deleg_file in nfs4_alloc_init_lease(), which differs from nf when an
earlier client already holds a delegation on the same directory, and
generic_delete_lease() matches on flc_file -- unlocking the wrong file
would leak the lease on the inode.
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/nfs4layouts.c | 7 ++++---
fs/nfsd/nfs4state.c | 51 ++++++++++++++++++++++++++++++++++-----------------
fs/nfsd/state.h | 2 +-
3 files changed, 39 insertions(+), 21 deletions(-)
diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
index 4c3f253c7d07..22bcb6d09f70 100644
--- a/fs/nfsd/nfs4layouts.c
+++ b/fs/nfsd/nfs4layouts.c
@@ -248,12 +248,13 @@ nfsd4_alloc_layout_stateid(struct nfsd4_compound_state *cstate,
NFSPROC4_CLNT_CB_LAYOUT);
if (parent->sc_type == SC_TYPE_DELEG) {
- spin_lock(&fp->fi_lock);
- ls->ls_file = nfsd_file_get(fp->fi_deleg_file);
- spin_unlock(&fp->fi_lock);
+ rcu_read_lock();
+ ls->ls_file = nfsd_file_get(rcu_dereference(fp->fi_deleg_file));
+ rcu_read_unlock();
} else {
ls->ls_file = find_any_file(fp);
}
+
if (!ls->ls_file) {
nfs4_put_stid(stp);
return NULL;
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 2189d8d360af..47af5729a86f 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -1212,7 +1212,9 @@ static void put_deleg_file(struct nfs4_file *fp)
spin_lock(&fp->fi_lock);
if (--fp->fi_delegees == 0) {
- swap(nf, fp->fi_deleg_file);
+ nf = rcu_dereference_protected(fp->fi_deleg_file,
+ lockdep_is_held(&fp->fi_lock));
+ RCU_INIT_POINTER(fp->fi_deleg_file, NULL);
swap(rnf, fp->fi_rdeleg_file);
}
spin_unlock(&fp->fi_lock);
@@ -1250,7 +1252,7 @@ static void nfsd4_finalize_deleg_timestamps(struct nfs4_delegation *dp, struct f
static void nfs4_unlock_deleg_lease(struct nfs4_delegation *dp)
{
struct nfs4_file *fp = dp->dl_stid.sc_file;
- struct nfsd_file *nf = fp->fi_deleg_file;
+ struct nfsd_file *nf = rcu_dereference_protected(fp->fi_deleg_file, 1);
WARN_ON_ONCE(!fp->fi_delegees);
@@ -3200,7 +3202,8 @@ static int nfs4_show_deleg(struct seq_file *s, struct nfs4_stid *st)
/* XXX: lease time, whether it's being recalled. */
spin_lock(&nf->fi_lock);
- file = nf->fi_deleg_file;
+ file = rcu_dereference_protected(nf->fi_deleg_file,
+ lockdep_is_held(&nf->fi_lock));
if (file) {
seq_puts(s, ", ");
nfs4_show_superblock(s, file);
@@ -5009,7 +5012,7 @@ static void nfsd4_file_init(const struct svc_fh *fh, struct nfs4_file *fp)
INIT_LIST_HEAD(&fp->fi_delegations);
INIT_LIST_HEAD(&fp->fi_clnt_odstate);
fh_copy_shallow(&fp->fi_fhandle, &fh->fh_handle);
- fp->fi_deleg_file = NULL;
+ RCU_INIT_POINTER(fp->fi_deleg_file, NULL);
fp->fi_rdeleg_file = NULL;
fp->fi_had_conflict = false;
fp->fi_share_deny = 0;
@@ -6163,7 +6166,7 @@ static struct file_lease *nfs4_alloc_init_lease(struct nfs4_delegation *dp, u32
fl->c.flc_type = deleg_is_read(dp->dl_type) ? F_RDLCK : F_WRLCK;
fl->c.flc_owner = (fl_owner_t)dp;
fl->c.flc_pid = current->tgid;
- fl->c.flc_file = dp->dl_stid.sc_file->fi_deleg_file->nf_file;
+ fl->c.flc_file = rcu_dereference_protected(dp->dl_stid.sc_file->fi_deleg_file, 1)->nf_file;
return fl;
}
@@ -6171,7 +6174,7 @@ static int nfsd4_check_conflicting_opens(struct nfs4_client *clp,
struct nfs4_file *fp)
{
struct nfs4_ol_stateid *st;
- struct file *f = fp->fi_deleg_file->nf_file;
+ struct file *f = rcu_dereference_protected(fp->fi_deleg_file, 1)->nf_file;
struct inode *ino = file_inode(f);
int writes;
@@ -6248,7 +6251,7 @@ nfsd4_verify_deleg_dentry(struct nfsd4_open *open, struct nfs4_file *fp,
exp_put(exp);
dput(child);
- if (child != file_dentry(fp->fi_deleg_file->nf_file))
+ if (child != file_dentry(rcu_dereference_protected(fp->fi_deleg_file, 1)->nf_file))
return -EAGAIN;
return 0;
@@ -6354,8 +6357,9 @@ nfs4_set_delegation(struct nfsd4_open *open, struct nfs4_ol_stateid *stp,
status = -EAGAIN;
else if (nfsd4_verify_setuid_write(open, nf))
status = -EAGAIN;
- else if (!fp->fi_deleg_file) {
- fp->fi_deleg_file = nf;
+ else if (!rcu_dereference_protected(fp->fi_deleg_file,
+ lockdep_is_held(&fp->fi_lock))) {
+ rcu_assign_pointer(fp->fi_deleg_file, nf);
/* increment early to prevent fi_deleg_file from being
* cleared */
fp->fi_delegees = 1;
@@ -6380,7 +6384,7 @@ nfs4_set_delegation(struct nfsd4_open *open, struct nfs4_ol_stateid *stp,
if (!fl)
goto out_clnt_odstate;
- status = kernel_setlease(fp->fi_deleg_file->nf_file,
+ status = kernel_setlease(rcu_dereference_protected(fp->fi_deleg_file, 1)->nf_file,
fl->c.flc_type, &fl, NULL);
if (fl)
locks_free_lease(fl);
@@ -6401,7 +6405,7 @@ nfs4_set_delegation(struct nfsd4_open *open, struct nfs4_ol_stateid *stp,
* Now that the deleg is set, check again to ensure that nothing
* raced in and changed the mode while we weren't looking.
*/
- status = nfsd4_verify_setuid_write(open, fp->fi_deleg_file);
+ status = nfsd4_verify_setuid_write(open, rcu_dereference_protected(fp->fi_deleg_file, 1));
if (status)
goto out_unlock;
@@ -6422,7 +6426,8 @@ nfs4_set_delegation(struct nfsd4_open *open, struct nfs4_ol_stateid *stp,
return dp;
out_unlock:
- kernel_setlease(fp->fi_deleg_file->nf_file, F_UNLCK, NULL, (void **)&dp);
+ kernel_setlease(rcu_dereference_protected(fp->fi_deleg_file, 1)->nf_file,
+ F_UNLCK, NULL, (void **)&dp);
out_clnt_odstate:
put_clnt_odstate(dp->dl_clnt_odstate);
nfs4_put_stid(&dp->dl_stid);
@@ -6579,8 +6584,9 @@ nfs4_open_delegation(struct svc_rqst *rqstp, struct nfsd4_open *open,
memcpy(&open->op_delegate_stateid, &dp->dl_stid.sc_stateid, sizeof(dp->dl_stid.sc_stateid));
if (open->op_share_access & NFS4_SHARE_ACCESS_WRITE) {
- struct file *f = dp->dl_stid.sc_file->fi_deleg_file->nf_file;
+ struct file *f;
+ f = rcu_dereference_protected(dp->dl_stid.sc_file->fi_deleg_file, 1)->nf_file;
if (!nfsd4_add_rdaccess_to_wrdeleg(rqstp, open, fh, stp) ||
!nfs4_delegation_stat(dp, currentfh, &stat)) {
nfs4_put_stid(&dp->dl_stid);
@@ -9787,8 +9793,9 @@ nfsd_get_dir_deleg(struct nfsd4_compound_state *cstate,
/* existing delegation? */
if (nfs4_delegation_exists(clp, fp)) {
status = -EAGAIN;
- } else if (!fp->fi_deleg_file) {
- fp->fi_deleg_file = nfsd_file_get(nf);
+ } else if (!rcu_dereference_protected(fp->fi_deleg_file,
+ lockdep_is_held(&fp->fi_lock))) {
+ rcu_assign_pointer(fp->fi_deleg_file, nfsd_file_get(nf));
fp->fi_delegees = 1;
} else {
++fp->fi_delegees;
@@ -9844,8 +9851,18 @@ nfsd_get_dir_deleg(struct nfsd4_compound_state *cstate,
return dp;
}
- /* Something failed. Drop the lease and clean up the stid */
- kernel_setlease(fp->fi_deleg_file->nf_file, F_UNLCK, NULL, (void **)&dp);
+ /*
+ * Something failed after the lease was set. Drop the lease and clean
+ * up the stid. The lease's flc_file is the fi_deleg_file (see
+ * nfs4_alloc_init_lease()), which is not necessarily this client's
+ * @nf when an earlier client already holds a delegation on @fp.
+ * generic_delete_lease() matches on flc_file, so unlock against
+ * fi_deleg_file or the lease will be leaked (and later freed with the
+ * stid, leading to a use-after-free when it's eventually broken).
+ */
+ kernel_setlease(rcu_dereference_protected(fp->fi_deleg_file, 1)->nf_file,
+ F_UNLCK, NULL, (void **)&dp);
+ nfsd_fsnotify_recalc_mask(nf);
out_put_stid:
nfs4_put_stid(&dp->dl_stid);
out_delegees:
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 9f321e9ed76d..4fca0537ca8b 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -699,7 +699,7 @@ struct nfs4_file {
*/
atomic_t fi_access[2];
u32 fi_share_deny;
- struct nfsd_file *fi_deleg_file;
+ struct nfsd_file __rcu *fi_deleg_file;
struct nfsd_file *fi_rdeleg_file;
int fi_delegees;
struct knfsd_fh fi_fhandle;
--
2.54.0
^ permalink raw reply related
* [PATCH v7 07/20] nfsd: add callback encoding and decoding linkages for CB_NOTIFY
From: Jeff Layton @ 2026-06-16 11:58 UTC (permalink / raw)
To: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, Jonathan Corbet, Shuah Khan,
Chuck Lever
Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260616-dir-deleg-v7-0-6cbc7eac0ade@kernel.org>
Add routines for encoding and decoding CB_NOTIFY messages. These call
into the code generated by xdrgen to do the actual encoding and
decoding.
For now, the encoder is a stub. Later patches will flesh out the payload
encoding.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Documentation/sunrpc/xdr/nfs4_1.x | 1 +
fs/nfsd/nfs4callback.c | 46 +++++++++++++++++++++++++++++++++++++++
fs/nfsd/nfs4xdr_gen.c | 4 ++--
fs/nfsd/nfs4xdr_gen.h | 3 +++
fs/nfsd/state.h | 8 +++++++
fs/nfsd/xdr4cb.h | 12 ++++++++++
6 files changed, 72 insertions(+), 2 deletions(-)
diff --git a/Documentation/sunrpc/xdr/nfs4_1.x b/Documentation/sunrpc/xdr/nfs4_1.x
index 4c3842e23859..99a831d68da8 100644
--- a/Documentation/sunrpc/xdr/nfs4_1.x
+++ b/Documentation/sunrpc/xdr/nfs4_1.x
@@ -490,6 +490,7 @@ struct CB_NOTIFY4args {
nfs_fh4 cna_fh;
notify4 cna_changes<>;
};
+pragma public CB_NOTIFY4args;
struct CB_NOTIFY4res {
nfsstat4 cnr_status;
diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
index c7f914eab3b0..2df281554abf 100644
--- a/fs/nfsd/nfs4callback.c
+++ b/fs/nfsd/nfs4callback.c
@@ -891,6 +891,51 @@ static void encode_stateowner(struct xdr_stream *xdr, struct nfs4_stateowner *so
xdr_encode_opaque(p, so->so_owner.data, so->so_owner.len);
}
+static void nfs4_xdr_enc_cb_notify(struct rpc_rqst *req,
+ struct xdr_stream *xdr,
+ const void *data)
+{
+ const struct nfsd4_callback *cb = data;
+ struct nfs4_cb_compound_hdr hdr = {
+ .ident = 0,
+ .minorversion = cb->cb_clp->cl_minorversion,
+ };
+ struct CB_NOTIFY4args args = { };
+
+ WARN_ON_ONCE(hdr.minorversion == 0);
+
+ encode_cb_compound4args(xdr, &hdr);
+ encode_cb_sequence4args(xdr, cb, &hdr);
+
+ /*
+ * FIXME: get stateid and fh from delegation. Inline the cna_changes
+ * buffer, and zero it.
+ */
+ xdrgen_encode_CB_NOTIFY4args(xdr, &args);
+
+ hdr.nops++;
+ encode_cb_nops(&hdr);
+}
+
+static int nfs4_xdr_dec_cb_notify(struct rpc_rqst *rqstp,
+ struct xdr_stream *xdr,
+ void *data)
+{
+ struct nfsd4_callback *cb = data;
+ struct nfs4_cb_compound_hdr hdr;
+ int status;
+
+ status = decode_cb_compound4res(xdr, &hdr);
+ if (unlikely(status))
+ return status;
+
+ status = decode_cb_sequence4res(xdr, cb);
+ if (unlikely(status || cb->cb_seq_status))
+ return status;
+
+ return decode_cb_op_status(xdr, OP_CB_NOTIFY, &cb->cb_status);
+}
+
static void nfs4_xdr_enc_cb_notify_lock(struct rpc_rqst *req,
struct xdr_stream *xdr,
const void *data)
@@ -1052,6 +1097,7 @@ static const struct rpc_procinfo nfs4_cb_procedures[] = {
#ifdef CONFIG_NFSD_PNFS
PROC(CB_LAYOUT, COMPOUND, cb_layout, cb_layout),
#endif
+ PROC(CB_NOTIFY, COMPOUND, cb_notify, cb_notify),
PROC(CB_NOTIFY_LOCK, COMPOUND, cb_notify_lock, cb_notify_lock),
PROC(CB_OFFLOAD, COMPOUND, cb_offload, cb_offload),
PROC(CB_RECALL_ANY, COMPOUND, cb_recall_any, cb_recall_any),
diff --git a/fs/nfsd/nfs4xdr_gen.c b/fs/nfsd/nfs4xdr_gen.c
index 61346d9e6833..b1a583a6dfa1 100644
--- a/fs/nfsd/nfs4xdr_gen.c
+++ b/fs/nfsd/nfs4xdr_gen.c
@@ -713,7 +713,7 @@ xdrgen_decode_notify4(struct xdr_stream *xdr, struct notify4 *ptr)
return true;
}
-static bool __maybe_unused
+bool
xdrgen_decode_CB_NOTIFY4args(struct xdr_stream *xdr, struct CB_NOTIFY4args *ptr)
{
if (!xdrgen_decode_stateid4(xdr, &ptr->cna_stateid))
@@ -1135,7 +1135,7 @@ xdrgen_encode_notify4(struct xdr_stream *xdr, const struct notify4 *value)
return true;
}
-static bool __maybe_unused
+bool
xdrgen_encode_CB_NOTIFY4args(struct xdr_stream *xdr, const struct CB_NOTIFY4args *value)
{
if (!xdrgen_encode_stateid4(xdr, &value->cna_stateid))
diff --git a/fs/nfsd/nfs4xdr_gen.h b/fs/nfsd/nfs4xdr_gen.h
index b989f37cdee8..0e11cd537a98 100644
--- a/fs/nfsd/nfs4xdr_gen.h
+++ b/fs/nfsd/nfs4xdr_gen.h
@@ -32,4 +32,7 @@ bool xdrgen_decode_posixaceperm4(struct xdr_stream *xdr, posixaceperm4 *ptr);
bool xdrgen_encode_posixaceperm4(struct xdr_stream *xdr, const posixaceperm4 value);
+bool xdrgen_decode_CB_NOTIFY4args(struct xdr_stream *xdr, struct CB_NOTIFY4args *ptr);
+bool xdrgen_encode_CB_NOTIFY4args(struct xdr_stream *xdr, const struct CB_NOTIFY4args *value);
+
#endif /* _LINUX_XDRGEN_NFS4_1_DECL_H */
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 4c6765a4cf22..9f321e9ed76d 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -190,6 +190,13 @@ struct nfs4_cb_fattr {
u64 ncf_cur_fsize;
};
+/*
+ * FIXME: the current backchannel encoder can't handle a send buffer longer
+ * than a single page (see bc_malloc/bc_free).
+ */
+#define NOTIFY4_EVENT_QUEUE_SIZE 3
+#define NOTIFY4_PAGE_ARRAY_SIZE 1
+
/*
* Represents a delegation stateid. The nfs4_client holds references to these
* and they are put when it is being destroyed or when the delegation is
@@ -776,6 +783,7 @@ enum nfsd4_cb_op {
NFSPROC4_CLNT_CB_NOTIFY_LOCK,
NFSPROC4_CLNT_CB_RECALL_ANY,
NFSPROC4_CLNT_CB_GETATTR,
+ NFSPROC4_CLNT_CB_NOTIFY,
};
/* Returns true iff a is later than b: */
diff --git a/fs/nfsd/xdr4cb.h b/fs/nfsd/xdr4cb.h
index f4e29c0c701c..b06d0170d7c4 100644
--- a/fs/nfsd/xdr4cb.h
+++ b/fs/nfsd/xdr4cb.h
@@ -33,6 +33,18 @@
cb_sequence_dec_sz + \
op_dec_sz)
+#define NFS4_enc_cb_notify_sz (cb_compound_enc_hdr_sz + \
+ cb_sequence_enc_sz + \
+ 1 + enc_stateid_sz + \
+ enc_nfs4_fh_sz + \
+ 1 + \
+ NOTIFY4_EVENT_QUEUE_SIZE * \
+ (2 + (NFS4_OPAQUE_LIMIT >> 2)))
+
+#define NFS4_dec_cb_notify_sz (cb_compound_dec_hdr_sz + \
+ cb_sequence_dec_sz + \
+ op_dec_sz)
+
#define NFS4_enc_cb_notify_lock_sz (cb_compound_enc_hdr_sz + \
cb_sequence_enc_sz + \
2 + 1 + \
--
2.54.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox