Generic Linux architectural discussions

Generic Linux architectural discussions
 help / color / mirror / Atom feed

* RE: [PATCH v2 3/6] arm64: hyperv: Add per-CPU RSI host call infrastructure for CCA Realms
From: Michael Kelley @ 2026-06-25 18:58 UTC (permalink / raw)
  To: Kameron Carr, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: catalin.marinas@arm.com, will@kernel.org, mark.rutland@arm.com,
	lpieralisi@kernel.org, sudeep.holla@kernel.org, arnd@arndb.de,
	thuth@redhat.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	Michael Kelley
In-Reply-To: <20260625173500.1995481-4-kameroncarr@linux.microsoft.com>

From: Kameron Carr <kameroncarr@linux.microsoft.com> Sent: Thursday, June 25, 2026 10:35 AM
> To: kys@microsoft.com; haiyangz@microsoft.com; wei.liu@kernel.org;
> decui@microsoft.com; longli@microsoft.com
> Cc: catalin.marinas@arm.com; will@kernel.org; mark.rutland@arm.com;
> lpieralisi@kernel.org; sudeep.holla@kernel.org; arnd@arndb.de; thuth@redhat.com; linux-
> hyperv@vger.kernel.org; linux-arm-kernel@lists.infradead.org; linux-
> kernel@vger.kernel.org; linux-arch@vger.kernel.org; mhklinux@outlook.com
> Subject: [PATCH v2 3/6] arm64: hyperv: Add per-CPU RSI host call infrastructure for CCA
> Realms
> 
> Arm CCA Realms cannot issue Hyper-V hypercalls via HVC; the guest must
> route them through the RSI_HOST_CALL interface, which takes the IPA of a
> per-CPU rsi_host_call structure as its argument.
> 
> Add hv_hostcall_array as a per-CPU struct array and allocate it during
> hyperv_init(). The allocation is gated on is_realm_world() so non-Realm
> arm64 Hyper-V guests pay no memory cost.
> 
> Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
> ---
>  arch/arm64/hyperv/mshyperv.c      | 32 ++++++++++++++++++++++++++++++-
>  arch/arm64/include/asm/mshyperv.h |  4 ++++
>  2 files changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
> index 4fdc26ade1d74..7d536d7fb557e 100644
> --- a/arch/arm64/hyperv/mshyperv.c
> +++ b/arch/arm64/hyperv/mshyperv.c
> @@ -15,10 +15,15 @@
>  #include <linux/errno.h>
>  #include <linux/version.h>
>  #include <linux/cpuhotplug.h>
> +#include <linux/slab.h>
>  #include <asm/mshyperv.h>
> +#include <asm/rsi.h>
> 
>  static bool hyperv_initialized;
> 
> +struct rsi_host_call *hv_hostcall_array;
> +EXPORT_SYMBOL_GPL(hv_hostcall_array);
> +
>  int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
>  {
>  	hv_get_vpreg_128(HV_REGISTER_HYPERVISOR_VERSION,
> @@ -60,6 +65,12 @@ static bool __init hyperv_detect_via_acpi(void)
> 
>  #endif
> 
> +static void hv_hostcall_free(void)
> +{
> +	kfree(hv_hostcall_array);
> +	hv_hostcall_array = NULL;
> +}
> +
>  static bool __init hyperv_detect_via_smccc(void)
>  {
>  	uuid_t hyperv_uuid = UUID_INIT(
> @@ -85,6 +96,20 @@ static int __init hyperv_init(void)
>  	if (!hyperv_detect_via_acpi() && !hyperv_detect_via_smccc())
>  		return 0;
> 
> +	/*
> +	 * The RSI host-call buffers are only ever used when
> +	 * is_realm_world() is true. Skip the allocation on non-Realm
> +	 * guests. A single contiguous array of nr_cpu_ids entries is
> +	 * allocated; each CPU indexes into it by its processor ID.
> +	 */
> +	if (is_realm_world()) {
> +		hv_hostcall_array = kcalloc(nr_cpu_ids,
> +					    sizeof(struct rsi_host_call),
> +					    GFP_KERNEL);
> +		if (!hv_hostcall_array)
> +			return -ENOMEM;
> +	}
> +
>  	/* Setup the guest ID */
>  	guest_id = hv_generate_guest_id(LINUX_VERSION_CODE);
>  	hv_set_vpreg(HV_REGISTER_GUEST_OS_ID, guest_id);
> @@ -106,12 +131,13 @@ static int __init hyperv_init(void)
> 
>  	ret = hv_common_init();
>  	if (ret)
> -		return ret;
> +		goto free_hostcall_mem;
> 
>  	ret = cpuhp_setup_state(CPUHP_AP_HYPERV_ONLINE, "arm64/hyperv_init:online",
>  				hv_common_cpu_init, hv_common_cpu_die);
>  	if (ret < 0) {
>  		hv_common_free();
> +		hv_hostcall_free();
>  		return ret;

Let me suggest a small additional simplification. For this error
path, call hv_common_free() as you have now, but then do
"goto free_hostcall_mem". At the free_hostcall_mem label, do

	kfree(hv_hostcall_array);
	hv_hostcall_array = NULL;

directly inline, and eliminate the hv_hostcall_free() helper
function. Saves about 5 lines of code overall and I think is a 
bit simpler.

>  	}
> 
> @@ -125,6 +151,10 @@ static int __init hyperv_init(void)
> 
>  	hyperv_initialized = true;
>  	return 0;
> +
> +free_hostcall_mem:
> +	hv_hostcall_free();
> +	return ret;
>  }
> 
>  early_initcall(hyperv_init);
> diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
> index b721d3134ab66..c207a3f79b99b 100644
> --- a/arch/arm64/include/asm/mshyperv.h
> +++ b/arch/arm64/include/asm/mshyperv.h
> @@ -63,4 +63,8 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
> 
>  #include <asm-generic/mshyperv.h>
> 
> +/* Per-CPU-indexed RSI host call structures for CCA Realms */
> +struct rsi_host_call;
> +extern struct rsi_host_call *hv_hostcall_array;
> +

The intent is that the #include of asm-generic/mshyperv.h should be
last in the arch-specific version of mshyperv.h. If there's a need to go
after the #include, that's a red flag to check if some restructuring of
the definitions would be appropriate.

Unless I'm missing something, I think these new definitions can go
above the #include.

Michael

>  #endif
> --
> 2.45.4
> 


^ permalink raw reply

* Re: [RFC] Null Namespaces
From: John Ericson @ 2026-06-25 18:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, Li Chen, Cong Wang, Christian Brauner, linux-arch, LKML,
	linux-fsdevel, linux-api, Arnd Bergmann, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria
In-Reply-To: <CALCETrVdtYeJ7rXmvymLpOvn6X4LsfHYoVmbL6XgTqnjcP5n7g@mail.gmail.com>

On Thu, Jun 25, 2026, at 11:51 AM, Andy Lutomirski wrote:
> On Wed, Jun 24, 2026 at 8:41 PM John Ericson <mail@johnericson.me> wrote:
>
> It's sort of a combination -- read the data structures :)  Other than
> the propagation part, they're really not that bad.

Are you saying path resolution *does* depend on the mount namespace that
the task belongs to? I certainly hope not! I did look over the data
structures along with my patches and I didn't see an example of this ---
just path resolution depending on the CWD and root directories (as one
would expect it to).

> In any event, I think this discussion is sort of immaterial to the
> proposed API change.  No one is about to remove the concept of a mount
> namespace.  But maybe it makes sense to have a way to have a task that
> doesn't actually belong to a mount namespace.  A mount namespace is
> certainly going to exist.

I am not sure if that is addressed more to Al or me? I certainly do
agree with all that, in any case. Mount namespaces are absolutely here
to stay, and I'm just trying to make a process that does not belong to
one; that's exactly correct. Sorry if my motivation by way of historical
analysis veered off topic.

> There will definitely be subtleties.  For example, what happens if a
> task with "no mount namespace" tries to do OPEN_TREE_CLONE?  In some
> logical sense it ought to work but it ought to be impossible to
> actually mount the resulting tree anywhere, but this risks running
> afoul of all kinds of checks.  Maybe you get a whole new mount
> namespace (that does not become your current mnt_ns) if you
> OPEN_TREE_CLONE?
>
> This stuff is complex and it probably makes more sense to keep changes simple.

Yes it is subtle; I definitely don't claim to fully understand the
permission model with mount namespace modifications yet, for one. Should
we switch gears to just discussing the null CWD and root directories,
then, and return to mount namespaces later?

I have started to rework my patch series accordingly, so I have a new
draft first patch for just that, before changing anything else. I could
(after some testing) submit that next; it's pretty small.

John

^ permalink raw reply

* RE: [RFC PATCH 1/6] arm64: rsi: Add RSI host call structure and helper function
From: Kameron Carr @ 2026-06-25 17:44 UTC (permalink / raw)
  To: 'Michael Kelley'
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, kys, haiyangz, wei.liu, decui, longli
In-Reply-To: <SN6PR02MB4157C9AA6BA2DD14E7697F2BD4E32@SN6PR02MB4157.namprd02.prod.outlook.com>

On Thursday, June 18, 2026 10:46 AM, Michael Kelley wrote:
> From: Kameron Carr <kameroncarr@linux.microsoft.com> Sent: Tuesday, June
> 9, 2026 11:10 AM
> > diff --git a/arch/arm64/include/asm/rsi_smc.h
> b/arch/arm64/include/asm/rsi_smc.h
> > index e19253f96c940..ffea93340ed7f 100644
> > --- a/arch/arm64/include/asm/rsi_smc.h
> > +++ b/arch/arm64/include/asm/rsi_smc.h
> > @@ -142,6 +142,12 @@ struct realm_config {
> >  	 */
> >  } __aligned(0x1000);
> >
> > +struct rsi_host_call {
> > +	u16 immediate;
> 
> I don't see the "immediate" used anywhere in this patch set.
> Is it always zero for the Hyper-V use cases?  Just curious ...

Yes, the immediate value is always zero for Hyper-V host calls.

-- Kameron


^ permalink raw reply

* RE: [RFC PATCH 2/6] firmware: smccc: Detect hypervisor via RSI host call in CCA Realms
From: Kameron Carr @ 2026-06-25 17:42 UTC (permalink / raw)
  To: 'Michael Kelley'
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, kys, haiyangz, wei.liu, decui, longli
In-Reply-To: <SN6PR02MB4157F6A66DEDE650298E120ED4E32@SN6PR02MB4157.namprd02.prod.outlook.com>

On Thursday, June 18, 2026 10:46 AM, Michael Kelley wrote:
> From: Kameron Carr <kameroncarr@linux.microsoft.com> Sent: Tuesday, June
> 9, 2026 11:10 AM
> > diff --git a/drivers/firmware/smccc/smccc.c
> b/drivers/firmware/smccc/smccc.c
> > index bdee057db2fd3..6b465e65472b0 100644
> > --- a/drivers/firmware/smccc/smccc.c
> > +++ b/drivers/firmware/smccc/smccc.c
> > @@ -12,6 +12,12 @@
> >  #include <linux/platform_device.h>
> >  #include <asm/archrandom.h>
> >
> > +#ifdef CONFIG_ARM64
> > +#include <linux/cleanup.h>
> > +#include <linux/spinlock.h>
> > +#include <asm/rsi.h>
> > +#endif
> > +
> >  static u32 smccc_version = ARM_SMCCC_VERSION_1_0;
> >  static enum arm_smccc_conduit smccc_conduit = SMCCC_CONDUIT_NONE;
> >
> > @@ -67,12 +73,45 @@ s32 arm_smccc_get_soc_id_revision(void)
> >  }
> >  EXPORT_SYMBOL_GPL(arm_smccc_get_soc_id_revision);
> >
> > +#ifdef CONFIG_ARM64
> > +static struct rsi_host_call uuid_hc;
> > +static DEFINE_SPINLOCK(uuid_hc_lock);
> 
> So evidently Sashiko is wrong in saying that struct rsi_host_call must be
> in decrypted memory?

Yes, Sashiko is wrong. The RMM spec clearly states that the rsi_host_call
struct must be encrypted / "protected". The other two requirements are
256 aligned and not RIPAS_EMPTY.


^ permalink raw reply

* [PATCH v2 6/6] arm64: hyperv: Implement hv_is_isolation_supported() for CCA Realms
From: Kameron Carr @ 2026-06-25 17:35 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux
In-Reply-To: <20260625173500.1995481-1-kameroncarr@linux.microsoft.com>

Provide an arm64 implementation of hv_is_isolation_supported() that
overrides the __weak default in drivers/hv/hv_common.c.

The implementation deliberately does not depend on
hv_is_hyperv_initialized() because hv_common_init() consults
hv_is_isolation_supported() before hyperv_initialized is set.

Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
---
 arch/arm64/hyperv/mshyperv.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
index 8e8148b723d9c..62995b6133f6f 100644
--- a/arch/arm64/hyperv/mshyperv.c
+++ b/arch/arm64/hyperv/mshyperv.c
@@ -169,3 +169,8 @@ bool hv_isolation_type_cca(void)
 {
 	return is_realm_world();
 }
+
+bool hv_is_isolation_supported(void)
+{
+	return is_realm_world();
+}
-- 
2.45.4


^ permalink raw reply related

* [PATCH v2 5/6] arm64: hyperv: Route hypercalls through RSI host call in CCA Realms
From: Kameron Carr @ 2026-06-25 17:34 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux
In-Reply-To: <20260625173500.1995481-1-kameroncarr@linux.microsoft.com>

Modify the five hypercall wrapper functions to check is_realm_world()
and use the per-CPU rsi_host_call structure when inside a Realm.

Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
---
 arch/arm64/hyperv/hv_core.c | 155 ++++++++++++++++++++++++++++--------
 1 file changed, 121 insertions(+), 34 deletions(-)

diff --git a/arch/arm64/hyperv/hv_core.c b/arch/arm64/hyperv/hv_core.c
index e33a9e3c366a1..77cba08fca132 100644
--- a/arch/arm64/hyperv/hv_core.c
+++ b/arch/arm64/hyperv/hv_core.c
@@ -13,9 +13,41 @@
 #include <linux/mm.h>
 #include <linux/arm-smccc.h>
 #include <linux/module.h>
+#include <linux/smp.h>
 #include <asm-generic/bug.h>
 #include <hyperv/hvhdk.h>
 #include <asm/mshyperv.h>
+#include <asm/rsi.h>
+
+/*
+ * hv_do_rsi_hypercall - Helper function to invoke a hypercall from a
+ * Realm world using the RSI interface.
+ */
+static u64 hv_do_rsi_hypercall(u64 control, u64 input1, u64 input2)
+{
+	struct rsi_host_call *hostcall;
+	unsigned long flags;
+	u64 ret;
+
+	if (!hv_hostcall_array)
+		return HV_STATUS_INVALID_HYPERCALL_INPUT;
+
+	local_irq_save(flags);
+	hostcall = &hv_hostcall_array[smp_processor_id()];
+	memset(hostcall, 0, sizeof(*hostcall));
+	hostcall->gprs[0] = HV_FUNC_ID;
+	hostcall->gprs[1] = control;
+	hostcall->gprs[2] = input1;
+	hostcall->gprs[3] = input2;
+
+	if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS)
+		ret = hostcall->gprs[0];
+	else
+		ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
+
+	local_irq_restore(flags);
+	return ret;
+}
 
 /*
  * hv_do_hypercall- Invoke the specified hypercall
@@ -29,8 +61,11 @@ u64 hv_do_hypercall(u64 control, void *input, void *output)
 	input_address = input ? virt_to_phys(input) : 0;
 	output_address = output ? virt_to_phys(output) : 0;
 
-	arm_smccc_1_1_hvc(HV_FUNC_ID, control,
-			  input_address, output_address, &res);
+	if (is_realm_world())
+		return hv_do_rsi_hypercall(control, input_address, output_address);
+
+	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input_address,
+			  output_address, &res);
 	return res.a0;
 }
 EXPORT_SYMBOL_GPL(hv_do_hypercall);
@@ -48,6 +83,9 @@ u64 hv_do_fast_hypercall8(u16 code, u64 input)
 
 	control = (u64)code | HV_HYPERCALL_FAST_BIT;
 
+	if (is_realm_world())
+		return hv_do_rsi_hypercall(control, input, 0);
+
 	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input, &res);
 	return res.a0;
 }
@@ -65,6 +103,9 @@ u64 hv_do_fast_hypercall16(u16 code, u64 input1, u64 input2)
 
 	control = (u64)code | HV_HYPERCALL_FAST_BIT;
 
+	if (is_realm_world())
+		return hv_do_rsi_hypercall(control, input1, input2);
+
 	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input1, input2, &res);
 	return res.a0;
 }
@@ -76,24 +117,44 @@ EXPORT_SYMBOL_GPL(hv_do_fast_hypercall16);
 void hv_set_vpreg(u32 msr, u64 value)
 {
 	struct arm_smccc_res res;
+	struct rsi_host_call *hostcall;
+	unsigned long flags;
+	u64 status;
 
-	arm_smccc_1_1_hvc(HV_FUNC_ID,
-		HVCALL_SET_VP_REGISTERS | HV_HYPERCALL_FAST_BIT |
-			HV_HYPERCALL_REP_COMP_1,
-		HV_PARTITION_ID_SELF,
-		HV_VP_INDEX_SELF,
-		msr,
-		0,
-		value,
-		0,
-		&res);
+	if (is_realm_world()) {
+		local_irq_save(flags);
+		hostcall = &hv_hostcall_array[smp_processor_id()];
+		memset(hostcall, 0, sizeof(*hostcall));
+		hostcall->gprs[0] = HV_FUNC_ID;
+		hostcall->gprs[1] = HVCALL_SET_VP_REGISTERS |
+				    HV_HYPERCALL_FAST_BIT |
+				    HV_HYPERCALL_REP_COMP_1;
+		hostcall->gprs[2] = HV_PARTITION_ID_SELF;
+		hostcall->gprs[3] = HV_VP_INDEX_SELF;
+		hostcall->gprs[4] = msr;
+		hostcall->gprs[6] = value;
+
+		if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS)
+			status = hostcall->gprs[0];
+		else
+			status = HV_STATUS_INVALID_HYPERCALL_INPUT;
+		local_irq_restore(flags);
+	} else {
+		arm_smccc_1_1_hvc(HV_FUNC_ID,
+				  HVCALL_SET_VP_REGISTERS |
+					  HV_HYPERCALL_FAST_BIT |
+					  HV_HYPERCALL_REP_COMP_1,
+				  HV_PARTITION_ID_SELF, HV_VP_INDEX_SELF, msr,
+				  0, value, 0, &res);
+		status = res.a0;
+	}
 
 	/*
-	 * Something is fundamentally broken in the hypervisor if
-	 * setting a VP register fails. There's really no way to
-	 * continue as a guest VM, so panic.
+	 * Something is fundamentally broken in the hypervisor (or, in a
+	 * Realm, the RMM denied the host call) if setting a VP register
+	 * fails. There's really no way to continue as a guest VM, so panic.
 	 */
-	BUG_ON(!hv_result_success(res.a0));
+	BUG_ON(!hv_result_success(status));
 }
 EXPORT_SYMBOL_GPL(hv_set_vpreg);
 
@@ -108,29 +169,55 @@ void hv_get_vpreg_128(u32 msr, struct hv_get_vp_registers_output *result)
 {
 	struct arm_smccc_1_2_regs args;
 	struct arm_smccc_1_2_regs res;
+	struct rsi_host_call *hostcall;
+	unsigned long flags;
+	u64 status;
 
-	args.a0 = HV_FUNC_ID;
-	args.a1 = HVCALL_GET_VP_REGISTERS | HV_HYPERCALL_FAST_BIT |
-			HV_HYPERCALL_REP_COMP_1;
-	args.a2 = HV_PARTITION_ID_SELF;
-	args.a3 = HV_VP_INDEX_SELF;
-	args.a4 = msr;
+	if (is_realm_world()) {
+		local_irq_save(flags);
+		hostcall = &hv_hostcall_array[smp_processor_id()];
+		memset(hostcall, 0, sizeof(*hostcall));
 
-	/*
-	 * Use the SMCCC 1.2 interface because the results are in registers
-	 * beyond X0-X3.
-	 */
-	arm_smccc_1_2_hvc(&args, &res);
+		hostcall->gprs[0] = HV_FUNC_ID;
+		hostcall->gprs[1] = HVCALL_GET_VP_REGISTERS |
+				    HV_HYPERCALL_FAST_BIT |
+				    HV_HYPERCALL_REP_COMP_1;
+		hostcall->gprs[2] = HV_PARTITION_ID_SELF;
+		hostcall->gprs[3] = HV_VP_INDEX_SELF;
+		hostcall->gprs[4] = msr;
+
+		if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS) {
+			status = hostcall->gprs[0];
+			result->as64.low = hostcall->gprs[6];
+			result->as64.high = hostcall->gprs[7];
+		} else {
+			status = HV_STATUS_INVALID_HYPERCALL_INPUT;
+		}
+		local_irq_restore(flags);
+	} else {
+		args.a0 = HV_FUNC_ID;
+		args.a1 = HVCALL_GET_VP_REGISTERS | HV_HYPERCALL_FAST_BIT |
+			  HV_HYPERCALL_REP_COMP_1;
+		args.a2 = HV_PARTITION_ID_SELF;
+		args.a3 = HV_VP_INDEX_SELF;
+		args.a4 = msr;
+
+		/*
+		 * Use the SMCCC 1.2 interface because the results are in
+		 * registers beyond X0-X3.
+		 */
+		arm_smccc_1_2_hvc(&args, &res);
+		status = res.a0;
+		result->as64.low = res.a6;
+		result->as64.high = res.a7;
+	}
 
 	/*
-	 * Something is fundamentally broken in the hypervisor if
-	 * getting a VP register fails. There's really no way to
-	 * continue as a guest VM, so panic.
+	 * Something is fundamentally broken in the hypervisor (or, in a
+	 * Realm, the RMM denied the host call) if getting a VP register
+	 * fails. There's really no way to continue as a guest VM, so panic.
 	 */
-	BUG_ON(!hv_result_success(res.a0));
-
-	result->as64.low = res.a6;
-	result->as64.high = res.a7;
+	BUG_ON(!hv_result_success(status));
 }
 EXPORT_SYMBOL_GPL(hv_get_vpreg_128);
 
-- 
2.45.4


^ permalink raw reply related

* [PATCH v2 4/6] Drivers: hv: Mark shared memory as decrypted for CCA Realms
From: Kameron Carr @ 2026-06-25 17:34 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux
In-Reply-To: <20260625173500.1995481-1-kameroncarr@linux.microsoft.com>

In hv_common_cpu_init(), the per-CPU hypercall input/output pages need
to be marked as decrypted (shared) for confidential VM isolation types.
This is already done for SNP and TDX isolation; extend the same handling
to Arm CCA Realm guests so that the host hypervisor can access the
shared hypercall buffers.

We need to round up the memory allocated for the input/output pages to
the nearest PAGE_SIZE, since set_memory_decrypted() requires the size to
be a multiple of PAGE_SIZE. This only has an effect on ARM VMs that are
using PAGE_SIZE larger than 4K.

is_realm_world() is only declared in arch/arm64/include/asm/rsi.h, so
using it directly in the arch-neutral drivers/hv/hv_common.c would
break the x86 build. Introduce a Hyper-V-specific helper following the
established hv_isolation_type_snp() / hv_isolation_type_tdx() pattern.

On architectures other than arm64 the weak default keeps the existing
behaviour.

Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
---
 arch/arm64/hyperv/mshyperv.c   |  5 +++++
 drivers/hv/hv_common.c         | 17 +++++++++++++----
 include/asm-generic/mshyperv.h |  1 +
 3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
index 7d536d7fb557e..8e8148b723d9c 100644
--- a/arch/arm64/hyperv/mshyperv.c
+++ b/arch/arm64/hyperv/mshyperv.c
@@ -164,3 +164,8 @@ bool hv_is_hyperv_initialized(void)
 	return hyperv_initialized;
 }
 EXPORT_SYMBOL_GPL(hv_is_hyperv_initialized);
+
+bool hv_isolation_type_cca(void)
+{
+	return is_realm_world();
+}
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 6b67ac6167891..17048a0a18729 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -476,6 +476,7 @@ int hv_common_cpu_init(unsigned int cpu)
 	u64 msr_vp_index;
 	gfp_t flags;
 	const int pgcount = hv_output_page_exists() ? 2 : 1;
+	const size_t alloc_size = ALIGN((size_t)pgcount * HV_HYP_PAGE_SIZE, PAGE_SIZE);
 	void *mem;
 	int ret = 0;
 
@@ -489,7 +490,7 @@ int hv_common_cpu_init(unsigned int cpu)
 	 * online and then taken offline
 	 */
 	if (!*inputarg) {
-		mem = kmalloc_array(pgcount, HV_HYP_PAGE_SIZE, flags);
+		mem = kmalloc(alloc_size, flags);
 		if (!mem)
 			return -ENOMEM;
 
@@ -499,14 +500,16 @@ int hv_common_cpu_init(unsigned int cpu)
 		}
 
 		if (!ms_hyperv.paravisor_present &&
-		    (hv_isolation_type_snp() || hv_isolation_type_tdx())) {
-			ret = set_memory_decrypted((unsigned long)mem, pgcount);
+		    (hv_isolation_type_snp() || hv_isolation_type_tdx() ||
+		     hv_isolation_type_cca())) {
+			ret = set_memory_decrypted((unsigned long)kasan_reset_tag(mem),
+				alloc_size >> PAGE_SHIFT);
 			if (ret) {
 				/* It may be unsafe to free 'mem' */
 				return ret;
 			}
 
-			memset(mem, 0x00, pgcount * HV_HYP_PAGE_SIZE);
+			memset(mem, 0x00, alloc_size);
 		}
 
 		/*
@@ -666,6 +669,12 @@ bool __weak hv_isolation_type_tdx(void)
 }
 EXPORT_SYMBOL_GPL(hv_isolation_type_tdx);
 
+bool __weak hv_isolation_type_cca(void)
+{
+	return false;
+}
+EXPORT_SYMBOL_GPL(hv_isolation_type_cca);
+
 void __weak hv_setup_vmbus_handler(void (*handler)(void))
 {
 }
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index bf601d67cecb9..1fa79abce743c 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -79,6 +79,7 @@ u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2);
 
 bool hv_isolation_type_snp(void);
 bool hv_isolation_type_tdx(void);
+bool hv_isolation_type_cca(void);
 
 /*
  * On architectures where Hyper-V doesn't support AEOI (e.g., ARM64),
-- 
2.45.4


^ permalink raw reply related

* [PATCH v2 3/6] arm64: hyperv: Add per-CPU RSI host call infrastructure for CCA Realms
From: Kameron Carr @ 2026-06-25 17:34 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux
In-Reply-To: <20260625173500.1995481-1-kameroncarr@linux.microsoft.com>

Arm CCA Realms cannot issue Hyper-V hypercalls via HVC; the guest must
route them through the RSI_HOST_CALL interface, which takes the IPA of a
per-CPU rsi_host_call structure as its argument.

Add hv_hostcall_array as a per-CPU struct array and allocate it during
hyperv_init(). The allocation is gated on is_realm_world() so non-Realm
arm64 Hyper-V guests pay no memory cost.

Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
---
 arch/arm64/hyperv/mshyperv.c      | 32 ++++++++++++++++++++++++++++++-
 arch/arm64/include/asm/mshyperv.h |  4 ++++
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
index 4fdc26ade1d74..7d536d7fb557e 100644
--- a/arch/arm64/hyperv/mshyperv.c
+++ b/arch/arm64/hyperv/mshyperv.c
@@ -15,10 +15,15 @@
 #include <linux/errno.h>
 #include <linux/version.h>
 #include <linux/cpuhotplug.h>
+#include <linux/slab.h>
 #include <asm/mshyperv.h>
+#include <asm/rsi.h>
 
 static bool hyperv_initialized;
 
+struct rsi_host_call *hv_hostcall_array;
+EXPORT_SYMBOL_GPL(hv_hostcall_array);
+
 int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
 {
 	hv_get_vpreg_128(HV_REGISTER_HYPERVISOR_VERSION,
@@ -60,6 +65,12 @@ static bool __init hyperv_detect_via_acpi(void)
 
 #endif
 
+static void hv_hostcall_free(void)
+{
+	kfree(hv_hostcall_array);
+	hv_hostcall_array = NULL;
+}
+
 static bool __init hyperv_detect_via_smccc(void)
 {
 	uuid_t hyperv_uuid = UUID_INIT(
@@ -85,6 +96,20 @@ static int __init hyperv_init(void)
 	if (!hyperv_detect_via_acpi() && !hyperv_detect_via_smccc())
 		return 0;
 
+	/*
+	 * The RSI host-call buffers are only ever used when
+	 * is_realm_world() is true. Skip the allocation on non-Realm
+	 * guests. A single contiguous array of nr_cpu_ids entries is
+	 * allocated; each CPU indexes into it by its processor ID.
+	 */
+	if (is_realm_world()) {
+		hv_hostcall_array = kcalloc(nr_cpu_ids,
+					    sizeof(struct rsi_host_call),
+					    GFP_KERNEL);
+		if (!hv_hostcall_array)
+			return -ENOMEM;
+	}
+
 	/* Setup the guest ID */
 	guest_id = hv_generate_guest_id(LINUX_VERSION_CODE);
 	hv_set_vpreg(HV_REGISTER_GUEST_OS_ID, guest_id);
@@ -106,12 +131,13 @@ static int __init hyperv_init(void)
 
 	ret = hv_common_init();
 	if (ret)
-		return ret;
+		goto free_hostcall_mem;
 
 	ret = cpuhp_setup_state(CPUHP_AP_HYPERV_ONLINE, "arm64/hyperv_init:online",
 				hv_common_cpu_init, hv_common_cpu_die);
 	if (ret < 0) {
 		hv_common_free();
+		hv_hostcall_free();
 		return ret;
 	}
 
@@ -125,6 +151,10 @@ static int __init hyperv_init(void)
 
 	hyperv_initialized = true;
 	return 0;
+
+free_hostcall_mem:
+	hv_hostcall_free();
+	return ret;
 }
 
 early_initcall(hyperv_init);
diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
index b721d3134ab66..c207a3f79b99b 100644
--- a/arch/arm64/include/asm/mshyperv.h
+++ b/arch/arm64/include/asm/mshyperv.h
@@ -63,4 +63,8 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
 
 #include <asm-generic/mshyperv.h>
 
+/* Per-CPU-indexed RSI host call structures for CCA Realms */
+struct rsi_host_call;
+extern struct rsi_host_call *hv_hostcall_array;
+
 #endif
-- 
2.45.4


^ permalink raw reply related

* [PATCH v2 2/6] firmware: smccc: Detect hypervisor via RSI host call in CCA Realms
From: Kameron Carr @ 2026-06-25 17:34 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux
In-Reply-To: <20260625173500.1995481-1-kameroncarr@linux.microsoft.com>

Modify arm_smccc_hypervisor_has_uuid() to check is_realm_world() and
use rsi_host_call() to query the hypervisor vendor UUID when inside a
Realm. The realm path is factored into a helper,
arm_smccc_realm_get_hypervisor_uuid(), that owns a file-static
rsi_host_call buffer (uuid_hc) serialized by a spinlock.

The RSI-specific includes, file-static state and helper are guarded
with CONFIG_ARM64 because <asm/rsi.h> does not exist on 32-bit ARM.

For non-Realm environments, the existing arm_smccc_1_1_invoke() path
is unchanged.

Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
---
 drivers/firmware/smccc/smccc.c | 41 +++++++++++++++++++++++++++++++++-
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/smccc/smccc.c b/drivers/firmware/smccc/smccc.c
index bdee057db2fd3..a876b7aa2dc99 100644
--- a/drivers/firmware/smccc/smccc.c
+++ b/drivers/firmware/smccc/smccc.c
@@ -12,6 +12,12 @@
 #include <linux/platform_device.h>
 #include <asm/archrandom.h>
 
+#ifdef CONFIG_ARM64
+#include <linux/cleanup.h>
+#include <linux/spinlock.h>
+#include <asm/rsi.h>
+#endif
+
 static u32 smccc_version = ARM_SMCCC_VERSION_1_0;
 static enum arm_smccc_conduit smccc_conduit = SMCCC_CONDUIT_NONE;
 
@@ -67,12 +73,45 @@ s32 arm_smccc_get_soc_id_revision(void)
 }
 EXPORT_SYMBOL_GPL(arm_smccc_get_soc_id_revision);
 
+#ifdef CONFIG_ARM64
+static struct rsi_host_call uuid_hc;
+static DEFINE_SPINLOCK(uuid_hc_lock);
+
+/*
+ * Helper function to get the hypervisor UUID via an RsiHostCall.
+ */
+static void arm_smccc_realm_get_hypervisor_uuid(struct arm_smccc_res *res)
+{
+	guard(spinlock_irqsave)(&uuid_hc_lock);
+
+	memset(&uuid_hc, 0, sizeof(uuid_hc));
+	uuid_hc.gprs[0] = ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID;
+
+	if (rsi_host_call(__pa_symbol(&uuid_hc)) != RSI_SUCCESS) {
+		res->a0 = SMCCC_RET_NOT_SUPPORTED;
+		return;
+	}
+
+	res->a0 = uuid_hc.gprs[0];
+	res->a1 = uuid_hc.gprs[1];
+	res->a2 = uuid_hc.gprs[2];
+	res->a3 = uuid_hc.gprs[3];
+}
+#endif
+
 bool arm_smccc_hypervisor_has_uuid(const uuid_t *hyp_uuid)
 {
 	struct arm_smccc_res res = {};
 	uuid_t uuid;
 
-	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID, &res);
+#ifdef CONFIG_ARM64
+	if (is_realm_world())
+		arm_smccc_realm_get_hypervisor_uuid(&res);
+	else
+#endif
+		arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID,
+				     &res);
+
 	if (res.a0 == SMCCC_RET_NOT_SUPPORTED)
 		return false;
 
-- 
2.45.4


^ permalink raw reply related

* [PATCH v2 1/6] arm64: rsi: Add RSI host call structure and helper function
From: Kameron Carr @ 2026-06-25 17:34 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux
In-Reply-To: <20260625173500.1995481-1-kameroncarr@linux.microsoft.com>

Add struct rsi_host_call to rsi_smc.h, which represents the host call
data structure used by the Realm Management Monitor (RMM) for the
RSI_HOST_CALL interface. The structure contains a 16-bit immediate field
and 31 general-purpose register values, aligned to 256 bytes as required
by the CCA RMM specification.

Add rsi_host_call() static inline wrapper in rsi_cmds.h that invokes
SMC_RSI_HOST_CALL with the physical address of the host call structure.
This will be used by Hyper-V guest code to route hypercalls through the
RSI interface when running inside an Arm CCA Realm.

Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
---
 arch/arm64/include/asm/rsi_cmds.h | 22 ++++++++++++++++++++++
 arch/arm64/include/asm/rsi_smc.h  |  7 +++++++
 2 files changed, 29 insertions(+)

diff --git a/arch/arm64/include/asm/rsi_cmds.h b/arch/arm64/include/asm/rsi_cmds.h
index 2c8763876dfb7..9daf8008e5da2 100644
--- a/arch/arm64/include/asm/rsi_cmds.h
+++ b/arch/arm64/include/asm/rsi_cmds.h
@@ -88,6 +88,28 @@ static inline long rsi_set_addr_range_state(phys_addr_t start,
 	return res.a0;
 }
 
+/**
+ * rsi_host_call - Make a Host call.
+ * @host_call_struct: IPA of host call structure
+ *
+ * This call will fail if the IPA of the host call structure:
+ * * is not aligned to 256 bytes,
+ * * is not protected / encrypted,
+ * * is RIPAS_EMPTY
+ *
+ * Returns:
+ *  On success, returns RSI_SUCCESS.
+ *  Otherwise, returns an error code.
+ */
+static inline unsigned long rsi_host_call(phys_addr_t host_call_struct)
+{
+	struct arm_smccc_res res;
+
+	arm_smccc_smc(SMC_RSI_HOST_CALL, host_call_struct, 0, 0, 0, 0, 0, 0,
+		      &res);
+	return res.a0;
+}
+
 /**
  * rsi_attestation_token_init - Initialise the operation to retrieve an
  * attestation token.
diff --git a/arch/arm64/include/asm/rsi_smc.h b/arch/arm64/include/asm/rsi_smc.h
index e19253f96c940..9cc57b5be0c02 100644
--- a/arch/arm64/include/asm/rsi_smc.h
+++ b/arch/arm64/include/asm/rsi_smc.h
@@ -142,6 +142,13 @@ struct realm_config {
 	 */
 } __aligned(0x1000);
 
+struct rsi_host_call {
+	u16 immediate;
+	u8 _padding[6];
+	u64 gprs[31];
+} __aligned(256);
+static_assert(sizeof(struct rsi_host_call) == 256);
+
 #endif /* __ASSEMBLER__ */
 
 /*
-- 
2.45.4


^ permalink raw reply related

* [PATCH v2 0/6] arm64: hyperv: Add Realm support for Hyper-V
From: Kameron Carr @ 2026-06-25 17:34 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux

Realms (CoCo VMs on ARM) require host calls to be routed through the RMM
(Realm Management Monitor) via the RSI (Realm Service Interface). This
series implements most of the necessary changes to support Realms on
Hyper-V.

One required change is not included in this series. The two buffers
allocated via vzalloc() in netvsc_init_buf() cannot be decrypted in
vmbus_establish_gpadl(). Currently only linearly mapped memory can be
decrypted. See my RFC patch [1]. I will implement the accompanying netvsc
changes based on the feedback I receive on that patch.

This patch series was tested by booting a Realm on Cobalt 200 running
Windows. I decreased the buffer size and used kzalloc() in
netvsc_init_buf() in my testing as a workaround for the issue mentioned
above.


Changes since v1 [2]:
  Patch 1: Add explicit padding to the RSI host call structure
  Patch 3: Change from a per-cpu pointer lazily allocated to an array
             of host call structs indexed by cpu id
  Patch 4: Align input_page + output_page allocation to PAGE_SIZE since
              that is the smallest unit of memory that can be decrypted
           Remove KSAN tags before passing address to set_memory_decrypted()
             since __is_lm_address() does pointer arithmetic.
  Patch 5: Add a helper function to reduce repetition
           Check for NULL before indexing into host call array

[1] https://lore.kernel.org/all/20260521205834.1012925-1-kameroncarr@linux.microsoft.com/
[2] https://lore.kernel.org/all/20260609181030.2378391-1-kameroncarr@linux.microsoft.com/

Kameron Carr (6):
  arm64: rsi: Add RSI host call structure and helper function
  firmware: smccc: Detect hypervisor via RSI host call in CCA Realms
  arm64: hyperv: Add per-CPU RSI host call infrastructure for CCA Realms
  Drivers: hv: Mark shared memory as decrypted for CCA Realms
  arm64: hyperv: Route hypercalls through RSI host call in CCA Realms
  arm64: hyperv: Implement hv_is_isolation_supported() for CCA Realms

 arch/arm64/hyperv/hv_core.c       | 155 +++++++++++++++++++++++-------
 arch/arm64/hyperv/mshyperv.c      |  42 +++++++-
 arch/arm64/include/asm/mshyperv.h |   4 +
 arch/arm64/include/asm/rsi_cmds.h |  22 +++++
 arch/arm64/include/asm/rsi_smc.h  |   7 ++
 drivers/firmware/smccc/smccc.c    |  41 +++++++-
 drivers/hv/hv_common.c            |  17 +++-
 include/asm-generic/mshyperv.h    |   1 +
 8 files changed, 249 insertions(+), 40 deletions(-)


base-commit: a4ffc59238be84dd1c26bf1c001543e832674fc6
-- 
2.45.4


^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Al Viro @ 2026-06-25 16:55 UTC (permalink / raw)
  To: Xin Zhao
  Cc: brauner, alex.aring, allen.lkml, arnd, chuck.lever, david,
	ebiederm, j.granados, jack, jlayton, keescook, linux-arch,
	linux-fsdevel, linux-kernel, linux-mm, ljs, mcgrof, mjguzik,
	pfalcato, rppt
In-Reply-To: <20260625085018.989584-1-jackzxcui1989@163.com>

On Thu, Jun 25, 2026 at 04:50:18PM +0800, Xin Zhao wrote:
> > [Severity: Medium]
> > Similar to the issue in exit_mmap_mapped_shared(), this non-atomic update
> > of file->f_flags risks losing concurrent fcntl() updates since it doesn't
> > hold file->f_lock.
> > 
> > Also, if a file has duplicated file descriptors (e.g., via dup()), will
> > clearing O_TMPCLOS here prematurely skip the closure of the remaining
> > descriptors? When encountering the duplicated descriptor later, the flag
> > will already be cleared, leaving the shared file actively referenced.
> 
> Currently, this flag will only be used by the logic we added, so I believe
> there won't be any issues.

What makes you (or whatever LLM you happen to use) think that file is referenced
only by descriptor table of the coredumping process?  Or that only one coredumping
process exists at any time, for that matter - and each might hold references to
the same struct file.

^ permalink raw reply

* Re: [RFC] Null Namespaces
From: Andy Lutomirski @ 2026-06-25 15:51 UTC (permalink / raw)
  To: John Ericson
  Cc: Al Viro, Andy Lutomirski, Li Chen, Cong Wang, Christian Brauner,
	linux-arch, LKML, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria
In-Reply-To: <a75a9b82-a15b-4893-8f92-62b62664ea83@app.fastmail.com>

On Wed, Jun 24, 2026 at 8:41 PM John Ericson <mail@johnericson.me> wrote:
>
> Ah, I started replying to your first email, but this is better, this
> gets to the heart of the matter. Please don't mind me responding to your
> two questions in reverse.
>
> On Wed, Jun 24, 2026, at 9:10 PM, Al Viro wrote:
> > What's the fundamental difference between CWD and any open descriptor
> > for a directory?  Why does it make sense to ban the former, but allow
> > the equivalents done via the latter?
>
> Yes! These two notions are very close --- but that's the *problem*, not
> a reason to not care about the existence of the CWD and root FS. I want
> to get rid of CWD in my processes not because it is fundamentally
> different (it isn't), but because it is superfluous.
>
> If one is capability-minded like me, it's a bad mistake that we ever had
> this "working directory" notion to begin with, and yet another example
> of the folks at Bell Labs sticking something in the kernel that was
> really only needed by the shell, and that could have just been done in
> userland.
>
> The current working directory, roughly, is *just* some global state
> holding a directory file descriptor. But I don't want that global state.
> If I am writing my userland program (that is not a shell), I would not
> create the global variable. I do not appreciate the fact that the kernel
> foists that state upon me whether I like it or not.
>
> Now obviously we cannot have a giant breaking change removing the notion
> of a current working directory altogether. But we can allow individual
> processes which don't want it to opt out, and that is what nulling out
> these fields (and updating the path resolution code to cope with that)
> allows.
>
> There is no loss of expressive power doing this, because one can (and
> should!) just use the `*at` and file descriptors. But there is, however,
> the imposition of discipline. The programmer (or coding agent) is
> encouraged to do everything with file descriptors rather than path
> concatenations etc., because they need to use `*at` anyways, and then
> voilà, without browbeating anyone in security seminars or code review, a
> bunch of TOCTOU issues disappear simply because doing the right thing is
> now the path of least resistance.
>
> > Please, start with explaining what, in your opinion, a mount namespace
> > _is_, and where does "mount X is attached at path P relative to mount
> > Y" belong.
>
> Let's take a pathological example:
>
> - Process A has `/foo` bind-mounted at `/bar/foo`
>
> - Process B has `/bar` without that bind mount, and `/foo` mounted at
>   `/baz/foo`, as is possible because it is in a different mount
>   namespace.
>
> If A opens `/bar/foo`, and sends it over (via socket) to B, and then B
> does `openat(recv_fd, "..")`, B will get `/bar`, not `/baz`. This is
> because `..` is resolved according to the mount referenced in the open
> file. (This is, by the way, very good! Directory file descriptors would
> be perilous to use if this were not the case!)
>
> The moral of the story is that "mount X is attached at path P relative
> to mount Y" is information accessed in the mounts themselves (maybe via
> their containing mount namespace, per the `mnt_ns` field, or maybe not,
> I am not sure, but it is immaterial). In contrast, the mount namespace
> of the *opening* task (`current->nsproxy->mnt_ns`, and current is B)
> doesn't matter at all for this purpose.

It's sort of a combination -- read the data structures :)  Other than
the propagation part, they're really not that bad.

In any event, I think this discussion is sort of immaterial to the
proposed API change.  No one is about to remove the concept of a mount
namespace.  But maybe it makes sense to have a way to have a task that
doesn't actually belong to a mount namespace.  A mount namespace is
certainly going to exist.

There will definitely be subtleties.  For example, what happens if a
task with "no mount namespace" tries to do OPEN_TREE_CLONE?  In some
logical sense it ought to work but it ought to be impossible to
actually mount the resulting tree anywhere, but this risks running
afoul of all kinds of checks.  Maybe you get a whole new mount
namespace (that does not become your current mnt_ns) if you
OPEN_TREE_CLONE?

This stuff is complex and it probably makes more sense to keep changes simple.

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Xin Zhao @ 2026-06-25 15:45 UTC (permalink / raw)
  To: ljs
  Cc: akpm, alex.aring, allen.lkml, arnd, brauner, chuck.lever, corbet,
	david, ebiederm, j.granados, jack, jackzxcui1989, jlayton,
	juri.lelli, keescook, liam, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, mcgrof, mingo, mjguzik, peterz, pfalcato,
	vincent.guittot, viro
In-Reply-To: <aj0cUrwdXYKIicC-@lucifer>

On Thu, 25 Jun 2026 13:48:10 +0100 Lorenzo Stoakes <ljs@kernel.org> wrote:

> +cc missing maintainers, lists.
> 
> NAK.
> 
> This is un-upstreamable for numerous reasons.
> 
> The stuff you're doing in mm is broken, wrong and invasive and you've not
> even bothered to cc- mm people. I'm annoyed by this.
> 
> You're also doing incredibly silly mistakes at v4 of something that should have
> been an RFC.
> 
> You don't seem to understand the concept of patch _series_ (break it up into
> smaller patches!!!) and you haven't bothered cc'ing maintainers whose subsystems
> you're radically alterting.
> 
> I'm annoyed as you have a history where you were told not to add insane hacks
> before ([0], my reply at [1]).
> 
> [0]:https://lore.kernel.org/all/20260116042817.3790405-1-jackzxcui1989@163.com/
> [1]:https://lore.kernel.org/all/14110b70-19e7-474d-b0dd-ba80e8bed9b0@lucifer.local/
> 
> Was I wasting my time there? Am I wasting my time responding now?
> 
> And how hard is it to run a simple perl script?
> 
> Let me run it for you for _just_ the maintainers:

I probably shouldn't reply to this email to waste more of your time, but I
can't help but respond because your comments have been very beneficial to
me, and I enjoy the process.

The v4 version has changed too much compared to the v3 version. I should
have re-executed the "get maintainer" script, but I mistakenly copied the
previous email list and sent it out. I sincerely apologize for that.

There are quite a few issues now, and I haven't come up with a good
overall solution. I actually want to resolve the problems we encountered
in our project with minimal kernel modifications, but I can't think of a
good way to do it. It seems that the v4 version has turned out to be a
complete disaster of a patch, and I sincerely hope that my example won't
be used as a counterexample in the future. Thank you for that.

Suddenly, I have some thoughts about this issue, but I even question
whether I should have these ideas. Let me sit down and sort things out
properly. I hope the v5 version won't be a disaster.

Thanks
Xin Zhao

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Lorenzo Stoakes @ 2026-06-25 12:48 UTC (permalink / raw)
  To: Xin Zhao
  Cc: brauner, mjguzik, pfalcato, ebiederm, viro, jack, jlayton,
	chuck.lever, alex.aring, arnd, keescook, mcgrof, j.granados,
	allen.lkml, linux-fsdevel, linux-kernel, linux-arch,
	Jonathan Corbet, Andrew Morton, David Hildenbrand, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Liam R. Howlett,
	linux-doc, linux-mm
In-Reply-To: <20260624145552.70143-1-jackzxcui1989@163.com>

+cc missing maintainers, lists.

NAK.

This is un-upstreamable for numerous reasons.

The stuff you're doing in mm is broken, wrong and invasive and you've not
even bothered to cc- mm people. I'm annoyed by this.

You're also doing incredibly silly mistakes at v4 of something that should have
been an RFC.

You don't seem to understand the concept of patch _series_ (break it up into
smaller patches!!!) and you haven't bothered cc'ing maintainers whose subsystems
you're radically alterting.

I'm annoyed as you have a history where you were told not to add insane hacks
before ([0], my reply at [1]).

[0]:https://lore.kernel.org/all/20260116042817.3790405-1-jackzxcui1989@163.com/
[1]:https://lore.kernel.org/all/14110b70-19e7-474d-b0dd-ba80e8bed9b0@lucifer.local/

Was I wasting my time there? Am I wasting my time responding now?

And how hard is it to run a simple perl script?

Let me run it for you for _just_ the maintainers:

$ scripts/get_maintainer.pl --nogit --nogit-fallback --nor your_patch.patch
Jonathan Corbet <corbet@lwn.net> (maintainer:DOCUMENTATION)
Alexander Viro <viro@zeniv.linux.org.uk> (maintainer:FILESYSTEMS (VFS and infrastructure))
Christian Brauner <brauner@kernel.org> (maintainer:FILESYSTEMS (VFS and infrastructure))
Andrew Morton <akpm@linux-foundation.org> (maintainer:MEMORY MANAGEMENT - CORE)
David Hildenbrand <david@kernel.org> (maintainer:MEMORY MANAGEMENT - CORE)
Arnd Bergmann <arnd@arndb.de> (maintainer:GENERIC INCLUDE/ASM HEADER FILES)
Ingo Molnar <mingo@redhat.com> (maintainer:SCHEDULER)
Peter Zijlstra <peterz@infradead.org> (maintainer:SCHEDULER)
Juri Lelli <juri.lelli@redhat.com> (maintainer:SCHEDULER)
Vincent Guittot <vincent.guittot@linaro.org> (maintainer:SCHEDULER)
Kees Cook <kees@kernel.org> (maintainer:EXEC & BINFMT API, ELF)
"Liam R. Howlett" <liam@infradead.org> (maintainer:MEMORY MAPPING)
Lorenzo Stoakes <ljs@kernel.org> (maintainer:MEMORY MAPPING)
linux-doc@vger.kernel.org (open list:DOCUMENTATION)
linux-kernel@vger.kernel.org (open list)
linux-fsdevel@vger.kernel.org (open list:PROC FILESYSTEM)
linux-mm@kvack.org (open list:MEMORY MANAGEMENT - CORE)
linux-arch@vger.kernel.org (open list:GENERIC INCLUDE/ASM HEADER FILES)
EXEC & BINFMT API, ELF status: Supported

You're missing the majority of these. That's _not OK_.

On Wed, Jun 24, 2026 at 10:55:52PM +0800, Xin Zhao wrote:
> A coredump typically takes some time to complete. If we happen to hold a
> write lock with flock just before triggering the coredump, that write lock
> will not be released during the entire coredump process. As a result,
> other processes attempting to acquire the same write lock may experience
> significant delays. Another typical scenario is that shared memory, such
> as dma-buf, remains occupied and is not released for a long time due to
> core dumps.
>
> To address this, add /proc/<pid>/coredump_pre_exit node so that people can

This is a horrible idea.

> specify which resources they want to release before dumping core. This
> patch implements the early release of two types of resources: flock files
> and file-backed shared memory. Default settings are NOT pre-exit anything.

What, people set this ahead of time? For a dynamic thing like files?

>
> A temporary bit, O_TMPCLOS, is added to mark vma->vm_file->f_flags during
> the execution of the newly introduced exit_mmap_mapped_shared() function.
> In this way, the subsequent exit_files_pre_exit() function does not need
> to find the corresponding vma through the file to check for the VM_SHARED
> attribute, thereby reducing the traversal cost.

This sentence doesn't even make sense?

And also !VM_SHARED means !vma->vm_file so your code would NULL deref if you
didn't check that. But !VM_SHARED VMAs can absolutely be file-backed...

>
> Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
> ---
>
> Change in v4:
> - Christian pointed out that the coredump process will traverse file
>   descriptors (fd), so certain fds should not be closed by default.
>   Rework the whole feature, add /proc/<pid>/coredump_pre_exit for user
>   pre-exit resources selection, default is NOT pre-exit anything.
> - Mateusz suggested that walking the fd table and release the file-lock is
>   reasonable. No longer release all the fd(s). Based on user config, only
>   the flock fd(s) and the fd(s) correspondent to file-backed shared memory
>   will be released at most.
>
> Change in v3:
> - Add comment and commit-log to explain why do the MMF_DUMP_MAPPED_SHARED
>   mm_flags_test() check, note that memory mapped files keep their own
>   separate references to the files. The case to work around is that early
>   unlocking a flock on a file allows other processes to lock and modify
>   the mapped data protected by the flock,
>   as suggested by Pedro Falcato.
> - Link to v3: https://lore.kernel.org/all/20260619122419.3954581-1-jackzxcui1989@163.com/
>
> Change in v2:
> - Get rid of the implement of adding new fcntl API, the issue does not
>   worth inflicting the cost on everyone,
>   as suggested by Al Viro.
> - Call exit_files() in coredump_wait(),
>   as suggested by Eric W. Biederman.
>   Add MMF_DUMP_MAPPED_SHARED mm_flags_test() check to filter cases that
>   need to dump file-backed shared memory.
> - Link to v2: https://lore.kernel.org/lkml/20260618150301.3226517-1-jackzxcui1989@163.com/
>
> v1:
> - Link to v1: https://lore.kernel.org/all/20260618030700.2511668-1-jackzxcui1989@163.com/
> ---
>  .../admin-guide/kernel-parameters.txt         |  5 ++
>  Documentation/filesystems/proc.rst            | 58 +++++++++-----
>  fs/coredump.c                                 | 23 ++++++
>  fs/file.c                                     | 46 +++++++++++
>  fs/proc/base.c                                | 78 +++++++++++++++++++
>  include/linux/mm.h                            |  1 +

No.

>  include/linux/mm_types.h                      |  9 +++

No.

>  include/linux/sched/task.h                    |  1 +
>  include/uapi/asm-generic/fcntl.h              |  4 +
>  kernel/fork.c                                 | 12 +++
>  mm/mmap.c                                     | 21 +++++

No.

>  11 files changed, 238 insertions(+), 20 deletions(-)

This is a completely insane diffstat for a single patch. Ridiculous.

AND YOU HAVEN'T ADDED A SINGLE TEST.

>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index f575d4508..bc6d3859f 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1024,6 +1024,11 @@ Kernel parameters
>  			/proc/<pid>/coredump_filter.
>  			See also Documentation/filesystems/proc.rst.
>
> +	coredump_pre_exit=
> +			[KNL] Change the default value for
> +			/proc/<pid>/coredump_pre_exit.
> +			See also Documentation/filesystems/proc.rst.
> +
>  	coresight_cpu_debug.enable
>  			[ARM,ARM64]
>  			Format: <bool>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index db6167bef..6a637d31d 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -39,16 +39,17 @@ fixes/update part 1.1  Stefani Seibold <stefani@seibold.net>    June 9 2009
>    3.2	/proc/<pid>/oom_score - Display current oom-killer score
>    3.3	/proc/<pid>/io - Display the IO accounting fields
>    3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
> -  3.5	/proc/<pid>/mountinfo - Information about mounts
> -  3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
> -  3.7   /proc/<pid>/task/<tid>/children - Information about task children
> -  3.8   /proc/<pid>/fdinfo/<fd> - Information about opened file
> -  3.9   /proc/<pid>/map_files - Information about memory mapped files
> -  3.10  /proc/<pid>/timerslack_ns - Task timerslack value
> -  3.11	/proc/<pid>/patch_state - Livepatch patch operation state
> -  3.12	/proc/<pid>/arch_status - Task architecture specific information
> -  3.13  /proc/<pid>/fd - List of symlinks to open files
> -  3.14  /proc/<pid>/ksm_stat - Information about the process's ksm status.
> +  3.5  /proc/<pid>/coredump_pre_exit - Core dump pre-exit settings
> +  3.6	/proc/<pid>/mountinfo - Information about mounts
> +  3.7	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
> +  3.8   /proc/<pid>/task/<tid>/children - Information about task children
> +  3.9   /proc/<pid>/fdinfo/<fd> - Information about opened file
> +  3.10   /proc/<pid>/map_files - Information about memory mapped files
> +  3.11  /proc/<pid>/timerslack_ns - Task timerslack value
> +  3.12	/proc/<pid>/patch_state - Livepatch patch operation state
> +  3.13	/proc/<pid>/arch_status - Task architecture specific information
> +  3.14  /proc/<pid>/fd - List of symlinks to open files
> +  3.15  /proc/<pid>/ksm_stat - Information about the process's ksm status.
>
>    4	Configuring procfs
>    4.1	Mount options
> @@ -1961,7 +1962,24 @@ For example::
>    $ echo 0x7 > /proc/self/coredump_filter
>    $ ./some_program
>
> -3.5	/proc/<pid>/mountinfo - Information about mounts
> +3.5 /proc/<pid>/coredump_pre_exit - Core dump pre-exit settings
> +---------------------------------------------------------------
> +A coredump typically takes some time to complete. If we happen to hold a write
> +lock with flock just before triggering the coredump, that write lock will not
> +be released during the entire coredump process. As a result, other processes
> +attempting to acquire the same write lock may experience significant delays.
> +Another typical scenario is that shared memory, such as dma-buf, remains
> +occupied and is not released for a long time due to core dumps.
> +
> +/proc/<pid>/coredump_pre_exit allows you to pre-exit some resources before
> +dumping core.
> +
> +The following two types are supported:
> +
> +  - (bit 0) flock files
> +  - (bit 1) file-backed shared memory
> +
> +3.6	/proc/<pid>/mountinfo - Information about mounts
>  --------------------------------------------------------
>
>  This file contains lines of the form::
> @@ -2001,7 +2019,7 @@ For more information on mount propagation see:
>    Documentation/filesystems/sharedsubtree.rst
>
>
> -3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
> +3.7	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
>  --------------------------------------------------------
>  These files provide a method to access a task's comm value. It also allows for
>  a task to set its own or one of its thread siblings comm value. The comm value
> @@ -2010,7 +2028,7 @@ then the kernel's TASK_COMM_LEN (currently 16 chars, including the NUL
>  terminator) will result in a truncated comm value.
>
>
> -3.7	/proc/<pid>/task/<tid>/children - Information about task children
> +3.8	/proc/<pid>/task/<tid>/children - Information about task children
>  -------------------------------------------------------------------------
>  This file provides a fast way to retrieve first level children pids
>  of a task pointed by <pid>/<tid> pair. The format is a space separated
> @@ -2027,7 +2045,7 @@ pids, so one needs to either stop or freeze processes being inspected
>  if precise results are needed.
>
>
> -3.8	/proc/<pid>/fdinfo/<fd> - Information about opened file
> +3.9	/proc/<pid>/fdinfo/<fd> - Information about opened file
>  ---------------------------------------------------------------
>  This file provides information associated with an opened file. The regular
>  files have at least four fields -- 'pos', 'flags', 'mnt_id' and 'ino'.
> @@ -2198,7 +2216,7 @@ VFIO Device files
>  where 'vfio-device-syspath' is the sysfs path corresponding to the VFIO device
>  file.
>
> -3.9	/proc/<pid>/map_files - Information about memory mapped files
> +3.10	/proc/<pid>/map_files - Information about memory mapped files
>  ---------------------------------------------------------------------
>  This directory contains symbolic links which represent memory mapped files
>  the process is maintaining.  Example output::
> @@ -2220,7 +2238,7 @@ time one can open(2) mappings from the listings of two processes and
>  comparing their inode numbers to figure out which anonymous memory areas
>  are actually shared.
>
> -3.10	/proc/<pid>/timerslack_ns - Task timerslack value
> +3.11	/proc/<pid>/timerslack_ns - Task timerslack value
>  ---------------------------------------------------------
>  This file provides the value of the task's timerslack value in nanoseconds.
>  This value specifies an amount of time that normal timers may be deferred
> @@ -2236,7 +2254,7 @@ Valid values are from 0 - ULLONG_MAX
>  An application setting the value must have PTRACE_MODE_ATTACH_FSCREDS level
>  permissions on the task specified to change its timerslack_ns value.
>
> -3.11	/proc/<pid>/patch_state - Livepatch patch operation state
> +3.12	/proc/<pid>/patch_state - Livepatch patch operation state
>  -----------------------------------------------------------------
>  When CONFIG_LIVEPATCH is enabled, this file displays the value of the
>  patch state for the task.
> @@ -2253,7 +2271,7 @@ patched.  If the patch is being enabled, then the task has already been
>  patched.  If the patch is being disabled, then the task hasn't been
>  unpatched yet.
>
> -3.12 /proc/<pid>/arch_status - task architecture specific status
> +3.13 /proc/<pid>/arch_status - task architecture specific status
>  -------------------------------------------------------------------
>  When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the
>  architecture specific status of the task.
> @@ -2298,7 +2316,7 @@ AVX512_elapsed_ms
>    the task is unlikely an AVX512 user, but depends on the workload and the
>    scheduling scenario, it also could be a false negative mentioned above.
>
> -3.13 /proc/<pid>/fd - List of symlinks to open files
> +3.14 /proc/<pid>/fd - List of symlinks to open files
>  -------------------------------------------------------
>  This directory contains symbolic links which represent open files
>  the process is maintaining.  Example output::
> @@ -2313,7 +2331,7 @@ The number of open files for the process is stored in 'size' member
>  of stat() output for /proc/<pid>/fd for fast access.
>  -------------------------------------------------------
>
> -3.14 /proc/<pid>/ksm_stat - Information about the process's ksm status
> +3.15 /proc/<pid>/ksm_stat - Information about the process's ksm status
>  ----------------------------------------------------------------------
>  When CONFIG_KSM is enabled, each process has this file which displays
>  the information of ksm merging status.
> diff --git a/fs/coredump.c b/fs/coredump.c
> index bb6fdb1f4..e08a8a6c4 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -521,6 +521,27 @@ static int zap_threads(struct task_struct *tsk,
>  	return nr;
>  }
>
> +static void coredump_pre_exit(void)
> +{
> +	struct task_struct *tsk = current;
> +	unsigned long flags = __mm_flags_get_dumpable(tsk->mm);
> +
> +	if (!likely(flags & MMF_DUMP_PRE_EXIT_MASK))
> +		return;
> +
> +	/*
> +	 * Set O_TMPCLOS of file f_flags if file needs to be closed.
> +	 */
> +	if (test_bit(MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED, &flags) &&
> +	    !test_bit(MMF_DUMP_MAPPED_SHARED, &flags))
> +		exit_mmap_mapped_shared(tsk->mm);

What the hell are you doing?

This is not where we unmap VMAs?

This is likely broken in subtle ways.

> +
> +	/*
> +	 * Check O_TMPCLOS of file f_flags to close file and clear it.
> +	 */
> +	exit_files_pre_exit(tsk, mm_flags_test(MMF_DUMP_PRE_EXIT_FLOCK, tsk->mm));
> +}
> +
>  static int coredump_wait(int exit_code, struct core_state *core_state)
>  {
>  	struct task_struct *tsk = current;
> @@ -1100,6 +1121,8 @@ static void do_coredump(struct core_name *cn, struct coredump_params *cprm,
>  		return;
>  	}
>
> +	coredump_pre_exit();
> +
>  	switch (cn->core_type) {
>  	case COREDUMP_FILE:
>  		if (!coredump_file(cn, cprm, binfmt))
> diff --git a/fs/file.c b/fs/file.c
> index 2c81c0b16..a58ffffcc 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -23,6 +23,7 @@
>  #include <linux/file_ref.h>
>  #include <net/sock.h>
>  #include <linux/init_task.h>
> +#include <linux/filelock.h>
>
>  #include "internal.h"
>
> @@ -527,6 +528,51 @@ void exit_files(struct task_struct *tsk)
>  	}
>  }
>
> +void exit_files_pre_exit(struct task_struct *tsk, bool checkflock)
> +{
> +	struct files_struct *files = tsk->files;
> +	struct fdtable *fdt;
> +	struct file *file;
> +	unsigned int i, j = 0;
> +
> +	if (!files)
> +		return;
> +
> +	fdt = rcu_dereference_raw(files->fdt);
> +	for (;;) {
> +		unsigned long set;
> +
> +		i = j * BITS_PER_LONG;
> +		if (i >= fdt->max_fds)
> +			break;
> +		set = fdt->open_fds[j++];
> +		while (set) {
> +			if (!(set & 1))
> +				goto next_fd;
> +			file = fdt->fd[i];
> +			if (!file)
> +				goto next_fd;
> +			if (file->f_flags & O_TMPCLOS) {
> +				file->f_flags &= ~O_TMPCLOS;
> +				goto close_fd;
> +			}
> +			if (!checkflock)
> +				goto next_fd;
> +			if (!vfs_inode_has_locks(file_inode(file)))
> +				goto next_fd;
> +
> +close_fd:
> +			fdt->fd[i] = NULL;
> +			filp_close(file, files);
> +			cond_resched();
> +
> +next_fd:
> +			i++;
> +			set >>= 1;
> +		}
> +	}

This code hurts my eyes.

> +}
> +
>  struct files_struct init_files = {
>  	.count		= ATOMIC_INIT(1),
>  	.fdt		= &init_files.fdtab,
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index d9acfa89c..99b5f219f 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3026,6 +3026,83 @@ static const struct file_operations proc_coredump_filter_operations = {
>  	.write		= proc_coredump_filter_write,
>  	.llseek		= generic_file_llseek,
>  };
> +

No comment, obviously.

> +static ssize_t proc_coredump_pre_exit_read(struct file *file, char __user *buf,
> +					   size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task = get_proc_task(file_inode(file));
> +	struct mm_struct *mm;
> +	char buffer[PROC_NUMBUF];
> +	size_t len;
> +	int ret;
> +
> +	if (!task)
> +		return -ESRCH;
> +
> +	ret = 0;
> +	mm = get_task_mm(task);
> +	if (mm) {
> +		unsigned long flags = __mm_flags_get_dumpable(mm);
> +
> +		len = snprintf(buffer, sizeof(buffer), "%08lx\n",
> +			       ((flags & MMF_DUMP_PRE_EXIT_MASK) >>
> +				MMF_DUMP_PRE_EXIT_SHIFT));
> +		mmput(mm);
> +		ret = simple_read_from_buffer(buf, count, ppos, buffer, len);
> +	}
> +
> +	put_task_struct(task);
> +
> +	return ret;
> +}
> +

Yeah who needs a comment...

> +static ssize_t proc_coredump_pre_exit_write(struct file *file,
> +					    const char __user *buf,
> +					    size_t count,
> +					    loff_t *ppos)
> +{
> +	struct task_struct *task;
> +	struct mm_struct *mm;
> +	unsigned int val;
> +	int ret;
> +	int i;
> +	unsigned long mask;
> +
> +	ret = kstrtouint_from_user(buf, count, 0, &val);
> +	if (ret < 0)
> +		return ret;
> +
> +	ret = -ESRCH;
> +	task = get_proc_task(file_inode(file));
> +	if (!task)
> +		goto out_no_task;
> +
> +	mm = get_task_mm(task);
> +	if (!mm)
> +		goto out_no_mm;
> +	ret = 0;
> +
> +	for (i = 0, mask = 1; i < MMF_DUMP_PRE_EXIT_BITS; i++, mask <<= 1) {

What?

> +		if (val & mask)
> +			mm_flags_set(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> +		else
> +			mm_flags_clear(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> +	}
> +
> +	mmput(mm);
> + out_no_mm:
> +	put_task_struct(task);
> + out_no_task:
> +	if (ret < 0)
> +		return ret;
> +	return count;
> +}
> +
> +static const struct file_operations proc_coredump_pre_exit_operations = {
> +	.read		= proc_coredump_pre_exit_read,
> +	.write		= proc_coredump_pre_exit_write,
> +	.llseek		= generic_file_llseek,
> +};
>  #endif
>
>  #ifdef CONFIG_TASK_IO_ACCOUNTING
> @@ -3391,6 +3468,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #endif
>  #ifdef CONFIG_ELF_CORE
>  	REG("coredump_filter", S_IRUGO|S_IWUSR, proc_coredump_filter_operations),
> +	REG("coredump_pre_exit", S_IRUGO|S_IWUSR, proc_coredump_pre_exit_operations),
>  #endif
>  #ifdef CONFIG_TASK_IO_ACCOUNTING
>  	ONE("io",	S_IRUSR, proc_tgid_io_accounting),
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index af23453e9..dfd4717c7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4066,6 +4066,7 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
>  extern int __vm_enough_memory(const struct mm_struct *mm, long pages, int cap_sys_admin);
>  extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
>  extern void exit_mmap(struct mm_struct *);
> +extern void exit_mmap_mapped_shared(struct mm_struct *mm);

You don't use extern.

>  bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
>  				 unsigned long addr, bool write);
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index c7db35be6..0555aaf50 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1963,6 +1963,15 @@ enum {
>  	(BIT(MMF_DUMP_ANON_PRIVATE) | BIT(MMF_DUMP_ANON_SHARED) | \
>  	 BIT(MMF_DUMP_HUGETLB_PRIVATE) | MMF_DUMP_MASK_DEFAULT_ELF)
>
> +/* coredump pre-exit bits */
> +#define MMF_DUMP_PRE_EXIT_FLOCK	11
> +#define MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED 12

Err do we have space for this?

You really want to add 2 more bits to mm_struct flags for this insanity?

> +
> +#define MMF_DUMP_PRE_EXIT_SHIFT	(MMF_DUMPABLE_BITS + MMF_DUMP_FILTER_BITS)
> +#define MMF_DUMP_PRE_EXIT_BITS	2
> +#define MMF_DUMP_PRE_EXIT_MASK	\
> +	(((1 << MMF_DUMP_PRE_EXIT_BITS) - 1) << MMF_DUMP_PRE_EXIT_SHIFT)

So are these dumpable bits or not? Why are you not just incrementing
MMF_DUMPABLE_BITS?

> +
>  #ifdef CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS
>  # define MMF_DUMP_MASK_DEFAULT_ELF	BIT(MMF_DUMP_ELF_HEADERS)
>  #else
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 41ed884cf..b4becbf6c 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -93,6 +93,7 @@ static inline void exit_thread(struct task_struct *tsk)
>  extern __noreturn void do_group_exit(int);
>
>  extern void exit_files(struct task_struct *);
> +extern void exit_files_pre_exit(struct task_struct *, bool);
>  extern void exit_itimers(struct task_struct *);
>
>  extern pid_t kernel_clone(struct kernel_clone_args *kargs);
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 613475285..360604d65 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -95,6 +95,10 @@
>  #define O_NDELAY	O_NONBLOCK
>  #endif
>
> +#ifndef O_TMPCLOS
> +#define O_TMPCLOS	0x80000000	/* tag need close, temporarily used */
> +#endif
> +
>  #define F_DUPFD		0	/* dup */
>  #define F_GETFD		1	/* get close_on_exec */
>  #define F_SETFD		2	/* set/clear close_on_exec */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a679b2448..84f1ee7f3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1030,6 +1030,18 @@ static int __init coredump_filter_setup(char *s)
>
>  __setup("coredump_filter=", coredump_filter_setup);
>
> +static unsigned long default_dump_pre_exit;
> +
> +static int __init coredump_pre_exit_setup(char *s)
> +{
> +	default_dump_pre_exit =
> +		(simple_strtoul(s, NULL, 0) << MMF_DUMP_PRE_EXIT_SHIFT) &
> +		MMF_DUMP_PRE_EXIT_MASK;
> +	return 1;
> +}
> +
> +__setup("coredump_pre_exit=", coredump_pre_exit_setup);
> +
>  #include <linux/init_task.h>
>
>  static void mm_init_aio(struct mm_struct *mm)
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5754d1c36..b955c47c0 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1326,6 +1326,27 @@ void exit_mmap(struct mm_struct *mm)
>  	vm_unacct_memory(nr_accounted);
>  }
>
> +void exit_mmap_mapped_shared(struct mm_struct *mm)
> +{
> +	struct vm_area_struct *vma;
> +	VMA_ITERATOR(vmi, mm, 0);
> +
> +	mmap_write_lock(mm);
> +	lru_add_drain();

Why?

> +
> +	for_each_vma(vmi, vma) {

Literally every single VMA? Including the gate VMA too?

No VMA locks... so that's already broken.

> +		if (vma->vm_flags & VM_HUGETLB)
> +			continue;

That's not how you test for hugetlb.

> +		if (!(vma->vm_flags & VM_SHARED) || !file_inode(vma->vm_file)->i_nlink)

This isn't how we work with flags any more.

> +			continue;
> +		vma->vm_file->f_flags |= O_TMPCLOS;


Not sure directly manipulating file flags like this is valid in any way, shape,
or form.

> +		do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start, NULL);

This is utterly broken, the outer loop will be invalidated by you removing
these, do_munmap() has its own iterator...

And this is just madly inefficient. Why wouldn't you just loop over the VMAs to
alter flags then unmap the whole range?

But this is also introducing a completely separate, duplicative, version of
exit_mmap().

You're not doing any of what that function does. You're just very inefficiently
unmapping everything?

> +		cond_resched();

Of course!

> +	}
> +
> +	mmap_write_unlock(mm);

And VMAs can be mapped again now?

> +}
> +
>  /*
>   * Return true if the calling process may expand its vm space by the passed
>   * number of pages
> --
> 2.34.1
>

I'm not sure if this idea can be made upstreamble in any way. But this patch or
anything that looks like it or fundamentally alters mm is just not acceptable,
sorry.

Lorenzo

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: David Hildenbrand (Arm) @ 2026-06-25 11:43 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Christian Brauner, Mike Rapoport, Lorenzo Stoakes, mjguzik,
	ebiederm, viro, jack, jlayton, chuck.lever, alex.aring, arnd,
	keescook, mcgrof, j.granados, allen.lkml, linux-fsdevel,
	linux-kernel, linux-arch, Xin Zhao, linux-mm
In-Reply-To: <aj0Mr3e9yt0kU-Qj@pedro-suse>

On 6/25/26 13:18, Pedro Falcato wrote:
> On Thu, Jun 25, 2026 at 12:57:02PM +0200, David Hildenbrand (Arm) wrote:
>>>
>>> This makes no sense. I think you really need to sit down and think about
>>> a design for this that doesn't introduce state machinery for boot, mm,
>>> and the VFS in one shot to solve a fringe problem...
>>
>> Staring at exit_mmap_mapped_shared(), ... this looks rather hacky ("let's fake
>> munmap and set some magical flags").
>>
>> We're essentially saying "we don't want (pretty much) anything that's MAP_SHARED
>> in the coredump". And for some reason someone should configure that, that's a
>> rather weird toggle tbh.
>>
>> And the granularity ("file-backed shared memory") is completely odd.
>>
>>
>> Aren't there other ways we could optimize this internally?
>>
>> Like, if we know that a process is dead and cannot run anymore, downgrade writes
>> to reads (and make sure we block GUP write attempts accordingly), or would that
>> also not be sufficient?
>>
>>
>> Another thought:
>>
>> fs/coredump.c calls get_dump_page().
>>
>> get_dump_page() will not fault in any memory. So if a page is not in the page
>> tables at the time of the dump, it will not get included in the coredump. Which
>> means, that whether most non-anonymous memory will be included in a coredump is
>> already like playing the lottery.
>>
>> This is true for MAP_SHARED file mappings and MAP_PRIVATE file mappings without
>> private modifications.
>>
>> Which makes me wonder: How much is tooling relying on file-backed pages to end
>> up in a coredump?
> 
> FWIW this mechanism already exists, see /proc/self/coredump_filter. The
> default is bits 0, 1, 4 and 5 (see core(5)), which maps back to no file pages
> being dumped to a core dump, apart from ELF headers (these help the debugger
> trace back the mapped binary to the debug info using the buildid).
> 
> So the answer to this question is "approximately none" :)
> 

Ah, thanks! vma_dump_size() honors this, and I am sure through some magical
routing the information stored in m->dump_size will end up not dumping these pages.

Staring at elf_core_dump(), this "unmap some stuff" part is really, really
nasty, as it effectively removes the VMAs->segments from the dump. (unless I am
missing something important)

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Pedro Falcato @ 2026-06-25 11:18 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Christian Brauner, Mike Rapoport, Lorenzo Stoakes, mjguzik,
	ebiederm, viro, jack, jlayton, chuck.lever, alex.aring, arnd,
	keescook, mcgrof, j.granados, allen.lkml, linux-fsdevel,
	linux-kernel, linux-arch, Xin Zhao, linux-mm
In-Reply-To: <9105c433-44a7-4e8f-bacb-def93d11a7f2@kernel.org>

On Thu, Jun 25, 2026 at 12:57:02PM +0200, David Hildenbrand (Arm) wrote:
> >> +
> >>  #define F_DUPFD		0	/* dup */
> >>  #define F_GETFD		1	/* get close_on_exec */
> >>  #define F_SETFD		2	/* set/clear close_on_exec */
> >> diff --git a/kernel/fork.c b/kernel/fork.c
> >> index a679b2448234..84f1ee7f32cf 100644
> >> --- a/kernel/fork.c
> >> +++ b/kernel/fork.c
> >> @@ -1030,6 +1030,18 @@ static int __init coredump_filter_setup(char *s)
> >>  
> >>  __setup("coredump_filter=", coredump_filter_setup);
> >>  
> >> +static unsigned long default_dump_pre_exit;
> >> +
> >> +static int __init coredump_pre_exit_setup(char *s)
> >> +{
> >> +	default_dump_pre_exit =
> >> +		(simple_strtoul(s, NULL, 0) << MMF_DUMP_PRE_EXIT_SHIFT) &
> >> +		MMF_DUMP_PRE_EXIT_MASK;
> >> +	return 1;
> >> +}
> >> +
> >> +__setup("coredump_pre_exit=", coredump_pre_exit_setup);
> > 
> > This makes no sense. I think you really need to sit down and think about
> > a design for this that doesn't introduce state machinery for boot, mm,
> > and the VFS in one shot to solve a fringe problem...
> 
> Staring at exit_mmap_mapped_shared(), ... this looks rather hacky ("let's fake
> munmap and set some magical flags").
> 
> We're essentially saying "we don't want (pretty much) anything that's MAP_SHARED
> in the coredump". And for some reason someone should configure that, that's a
> rather weird toggle tbh.
> 
> And the granularity ("file-backed shared memory") is completely odd.
> 
> 
> Aren't there other ways we could optimize this internally?
> 
> Like, if we know that a process is dead and cannot run anymore, downgrade writes
> to reads (and make sure we block GUP write attempts accordingly), or would that
> also not be sufficient?
> 
> 
> Another thought:
> 
> fs/coredump.c calls get_dump_page().
> 
> get_dump_page() will not fault in any memory. So if a page is not in the page
> tables at the time of the dump, it will not get included in the coredump. Which
> means, that whether most non-anonymous memory will be included in a coredump is
> already like playing the lottery.
> 
> This is true for MAP_SHARED file mappings and MAP_PRIVATE file mappings without
> private modifications.
> 
> Which makes me wonder: How much is tooling relying on file-backed pages to end
> up in a coredump?

FWIW this mechanism already exists, see /proc/self/coredump_filter. The
default is bits 0, 1, 4 and 5 (see core(5)), which maps back to no file pages
being dumped to a core dump, apart from ELF headers (these help the debugger
trace back the mapped binary to the debug info using the buildid).

So the answer to this question is "approximately none" :)

-- 
Pedro

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: David Hildenbrand (Arm) @ 2026-06-25 10:57 UTC (permalink / raw)
  To: Christian Brauner, Mike Rapoport, Lorenzo Stoakes, mjguzik,
	pfalcato, ebiederm, viro, jack, jlayton, chuck.lever, alex.aring,
	arnd, keescook, mcgrof, j.granados, allen.lkml
  Cc: linux-fsdevel, linux-kernel, linux-arch, Xin Zhao, linux-mm
In-Reply-To: <20260625-wappnen-drohbrief-wermutstropfen-c53538f01547@brauner>

>> +
>>  #define F_DUPFD		0	/* dup */
>>  #define F_GETFD		1	/* get close_on_exec */
>>  #define F_SETFD		2	/* set/clear close_on_exec */
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index a679b2448234..84f1ee7f32cf 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1030,6 +1030,18 @@ static int __init coredump_filter_setup(char *s)
>>  
>>  __setup("coredump_filter=", coredump_filter_setup);
>>  
>> +static unsigned long default_dump_pre_exit;
>> +
>> +static int __init coredump_pre_exit_setup(char *s)
>> +{
>> +	default_dump_pre_exit =
>> +		(simple_strtoul(s, NULL, 0) << MMF_DUMP_PRE_EXIT_SHIFT) &
>> +		MMF_DUMP_PRE_EXIT_MASK;
>> +	return 1;
>> +}
>> +
>> +__setup("coredump_pre_exit=", coredump_pre_exit_setup);
> 
> This makes no sense. I think you really need to sit down and think about
> a design for this that doesn't introduce state machinery for boot, mm,
> and the VFS in one shot to solve a fringe problem...

Staring at exit_mmap_mapped_shared(), ... this looks rather hacky ("let's fake
munmap and set some magical flags").

We're essentially saying "we don't want (pretty much) anything that's MAP_SHARED
in the coredump". And for some reason someone should configure that, that's a
rather weird toggle tbh.

And the granularity ("file-backed shared memory") is completely odd.

Aren't there other ways we could optimize this internally?

Like, if we know that a process is dead and cannot run anymore, downgrade writes
to reads (and make sure we block GUP write attempts accordingly), or would that
also not be sufficient?

Another thought:

fs/coredump.c calls get_dump_page().

get_dump_page() will not fault in any memory. So if a page is not in the page
tables at the time of the dump, it will not get included in the coredump. Which
means, that whether most non-anonymous memory will be included in a coredump is
already like playing the lottery.

This is true for MAP_SHARED file mappings and MAP_PRIVATE file mappings without
private modifications.

Which makes me wonder: How much is tooling relying on file-backed pages to end
up in a coredump?

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Xin Zhao @ 2026-06-25  8:50 UTC (permalink / raw)
  To: brauner
  Cc: alex.aring, allen.lkml, arnd, chuck.lever, david, ebiederm,
	j.granados, jack, jackzxcui1989, jlayton, keescook, linux-arch,
	linux-fsdevel, linux-kernel, linux-mm, ljs, mcgrof, mjguzik,
	pfalcato, rppt, viro
In-Reply-To: <20260625-wappnen-drohbrief-wermutstropfen-c53538f01547@brauner>

On Thu, 25 Jun 2026 09:28:08 +0200 Christian Brauner <brauner@kernel.org> wrote:

> > +	coredump_pre_exit=
> > +			[KNL] Change the default value for
> > +			/proc/<pid>/coredump_pre_exit.
> > +			See also Documentation/filesystems/proc.rst.
> 
> Nah, we're not doing a separate file for this. That makes no sense
> whatsoever. I've already explained this in the first mail. There are
> effectively three modes:
> 
> (1) dump to a file
> (2) spawn super-privileged usermode helper process connect coredumping
>     process and said helper via pipe
> (3) coredumping process connects to AF_UNIX socket
> 
> Parameterize (1) and (2) via a command line arguments. I strongly
> suspect you're using some AI tooling so it should be able to figure out
> how this was done in the past.
> 
> (3) can be extended by just introducing a new flag value for struct
>     coredump_req. That is also illustrated by previous work.
> 
> We're not spreading procfs files. It's terrible api design especially
> for security sensitive changes.

The coredump socket approach is easier to implement because it allows for
interaction between the server and client, enabling the customization of
protocols. However, for the coredump file method, I can only think of
defining "r" and "R" through core_pattern to release flock and file-backed
shared data in advance. I'm unsure if this is feasible, as it changes the
original definition of core_pattern.

Regarding the coredump pipe, there is also a lack of a mechanism for the
pipe program to notify the coredump process, so it might still require
adding "r" and "R" at the end of core_pattern to indicate this, allowing
the coredump process to handle the early release on its own. I'm not sure
if my understanding is correct.

Even if the coredump pipe program obtains the file pointer from the process
that generated the coredump, it cannot reduce the reference count of the
file (which I understand is a very bad attempt). Since it cannot decrease
the reference count of the file, the early release must still be performed
by the task that generated the coredump. Given this situation, it seems
that we indeed need to use core_pattern for marking. I've thought for a
long time about more suitable solutions, but I haven't come up with any.

> > +#ifndef O_TMPCLOS
> > +#define O_TMPCLOS	0x80000000	/* tag need close, temporarily used */
> > +#endif
> 
> Sorry, not going to happen. This doesn't not justify the addition of a
> new uapi value at all.

OK, if I use it at last, I will not put it in user header file.

> > +
> > +__setup("coredump_pre_exit=", coredump_pre_exit_setup);
> 
> This makes no sense. I think you really need to sit down and think about
> a design for this that doesn't introduce state machinery for boot, mm,
> and the VFS in one shot to solve a fringe problem...

I'll get rid of the attempt to add a new boot-up argument for this feature.

> [Severity: High]
> Does modifying the VMA maple tree via do_munmap() during the for_each_vma()
> iteration invalidate the outer iterator? The loop traverses the maple tree
> using the iterator vmi. However, do_munmap() creates its own internal
> VMA_ITERATOR and removes the VMA from the tree. Because the outer vmi
> iterator is not updated to reflect these structural changes, its cached
> state becomes stale, which can lead to a use-after-free when vma_next()
> is subsequently called.
> 
> via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com

When executing this traversal logic, we have already acquired a lock, and
the process has been frozen. The traversal logic goes from start to finish.
Are you sure that this approach could still have issues?

> [Severity: High]
> Is it safe to iterate the file descriptor table without holding
> rcu_read_lock()? Because coredump_pre_exit() is called before zap_threads()
> kills other threads, concurrent threads can still trigger expand_files(),
> which replaces the fdt and frees the old one after an RCU grace period.

Since the process has already been frozen, shouldn't we not need to consider
such concurrency issues?

> [Severity: Medium]
> Similar to the issue in exit_mmap_mapped_shared(), this non-atomic update
> of file->f_flags risks losing concurrent fcntl() updates since it doesn't
> hold file->f_lock.
> 
> Also, if a file has duplicated file descriptors (e.g., via dup()), will
> clearing O_TMPCLOS here prematurely skip the closure of the remaining
> descriptors? When encountering the duplicated descriptor later, the flag
> will already be cleared, leaving the shared file actively referenced.

> [Severity: Medium]
> Similar to the issue in exit_mmap_mapped_shared(), this non-atomic update
> of file->f_flags risks losing concurrent fcntl() updates since it doesn't
> hold file->f_lock.
> 
> Also, if a file has duplicated file descriptors (e.g., via dup()), will
> clearing O_TMPCLOS here prematurely skip the closure of the remaining
> descriptors? When encountering the duplicated descriptor later, the flag
> will already be cleared, leaving the shared file actively referenced.

Currently, this flag will only be used by the logic we added, so I believe
there won't be any issues.

Thanks
Xin Zhao

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Christian Brauner @ 2026-06-25  7:28 UTC (permalink / raw)
  To: David Hildenbrand, Mike Rapoport, Lorenzo Stoakes, brauner,
	mjguzik, pfalcato, ebiederm, viro, jack, jlayton, chuck.lever,
	alex.aring, arnd, keescook, mcgrof, j.granados, allen.lkml
  Cc: linux-fsdevel, linux-kernel, linux-arch, Xin Zhao, linux-mm
In-Reply-To: <20260624145552.70143-1-jackzxcui1989@163.com>

> A coredump typically takes some time to complete. If we happen to hold a
> write lock with flock just before triggering the coredump, that write lock
> will not be released during the entire coredump process. As a result,
> other processes attempting to acquire the same write lock may experience
> significant delays. Another typical scenario is that shared memory, such
> as dma-buf, remains occupied and is not released for a long time due to
> core dumps.
> 
> To address this, add /proc/<pid>/coredump_pre_exit node so that people can
> specify which resources they want to release before dumping core. This
> patch implements the early release of two types of resources: flock files
> and file-backed shared memory. Default settings are NOT pre-exit anything.
> 
> A temporary bit, O_TMPCLOS, is added to mark vma->vm_file->f_flags during
> the execution of the newly introduced exit_mmap_mapped_shared() function.
> In this way, the subsequent exit_files_pre_exit() function does not need
> to find the corresponding vma through the file to check for the VM_SHARED
> attribute, thereby reducing the traversal cost.
> 
> Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index f575d450861e..bc6d3859f874 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1024,6 +1024,11 @@ Kernel parameters
>  			/proc/<pid>/coredump_filter.
>  			See also Documentation/filesystems/proc.rst.
>  
> +	coredump_pre_exit=
> +			[KNL] Change the default value for
> +			/proc/<pid>/coredump_pre_exit.
> +			See also Documentation/filesystems/proc.rst.

Nah, we're not doing a separate file for this. That makes no sense
whatsoever. I've already explained this in the first mail. There are
effectively three modes:

(1) dump to a file
(2) spawn super-privileged usermode helper process connect coredumping
    process and said helper via pipe
(3) coredumping process connects to AF_UNIX socket

Parameterize (1) and (2) via a command line arguments. I strongly
suspect you're using some AI tooling so it should be able to figure out
how this was done in the past.

(3) can be extended by just introducing a new flag value for struct
    coredump_req. That is also illustrated by previous work.

We're not spreading procfs files. It's terrible api design especially
for security sensitive changes.

> +static void coredump_pre_exit(void)
> +{
> +	struct task_struct *tsk = current;
> +	unsigned long flags = __mm_flags_get_dumpable(tsk->mm);
> +
> +	if (!likely(flags & MMF_DUMP_PRE_EXIT_MASK))
> +		return;
> +
> +	/*
> +	 * Set O_TMPCLOS of file f_flags if file needs to be closed.
> +	 */
> +	if (test_bit(MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED, &flags) &&
> +	    !test_bit(MMF_DUMP_MAPPED_SHARED, &flags))
> +		exit_mmap_mapped_shared(tsk->mm);
> +
> +	/*
> +	 * Check O_TMPCLOS of file f_flags to close file and clear it.
> +	 */
> +	exit_files_pre_exit(tsk, mm_flags_test(MMF_DUMP_PRE_EXIT_FLOCK, tsk->mm));
> +}
> +
>  static int coredump_wait(int exit_code, struct core_state *core_state)
>  {
>  	struct task_struct *tsk = current;
> @@ -1100,6 +1121,8 @@ static void do_coredump(struct core_name *cn, struct coredump_params *cprm,
>  		return;
>  	}
>  
> +	coredump_pre_exit();
> +
>  	switch (cn->core_type) {
>  	case COREDUMP_FILE:
>  		if (!coredump_file(cn, cprm, binfmt))
> diff --git a/fs/file.c b/fs/file.c
> index 2c81c0b162d0..a58ffffcc31d 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -23,6 +23,7 @@
>  #include <linux/file_ref.h>
>  #include <net/sock.h>
>  #include <linux/init_task.h>
> +#include <linux/filelock.h>
>  
>  #include "internal.h"
>  
> @@ -527,6 +528,51 @@ void exit_files(struct task_struct *tsk)
>  	}
>  }
>  
> +void exit_files_pre_exit(struct task_struct *tsk, bool checkflock)
> +{
> +	struct files_struct *files = tsk->files;
> +	struct fdtable *fdt;
> +	struct file *file;
> +	unsigned int i, j = 0;
> +
> +	if (!files)
> +		return;
> +
> +	fdt = rcu_dereference_raw(files->fdt);
> +	for (;;) {
> +		unsigned long set;
> +
> +		i = j * BITS_PER_LONG;
> +		if (i >= fdt->max_fds)
> +			break;
> +		set = fdt->open_fds[j++];
> +		while (set) {
> +			if (!(set & 1))
> +				goto next_fd;
> +			file = fdt->fd[i];
> +			if (!file)
> +				goto next_fd;
> +			if (file->f_flags & O_TMPCLOS) {
> +				file->f_flags &= ~O_TMPCLOS;
> +				goto close_fd;
> +			}
> +			if (!checkflock)
> +				goto next_fd;
> +			if (!vfs_inode_has_locks(file_inode(file)))
> +				goto next_fd;
> +
> +close_fd:
> +			fdt->fd[i] = NULL;
> +			filp_close(file, files);
> +			cond_resched();
> +
> +next_fd:
> +			i++;
> +			set >>= 1;
> +		}
> +	}
> +}
> +
>  struct files_struct init_files = {
>  	.count		= ATOMIC_INIT(1),
>  	.fdt		= &init_files.fdtab,
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index d9acfa89c894..99b5f219f7fa 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3026,6 +3026,83 @@ static const struct file_operations proc_coredump_filter_operations = {
>  	.write		= proc_coredump_filter_write,
>  	.llseek		= generic_file_llseek,
>  };
> +
> +static ssize_t proc_coredump_pre_exit_read(struct file *file, char __user *buf,
> +					   size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task = get_proc_task(file_inode(file));
> +	struct mm_struct *mm;
> +	char buffer[PROC_NUMBUF];
> +	size_t len;
> +	int ret;
> +
> +	if (!task)
> +		return -ESRCH;
> +
> +	ret = 0;
> +	mm = get_task_mm(task);
> +	if (mm) {
> +		unsigned long flags = __mm_flags_get_dumpable(mm);
> +
> +		len = snprintf(buffer, sizeof(buffer), "%08lx\n",
> +			       ((flags & MMF_DUMP_PRE_EXIT_MASK) >>
> +				MMF_DUMP_PRE_EXIT_SHIFT));
> +		mmput(mm);
> +		ret = simple_read_from_buffer(buf, count, ppos, buffer, len);
> +	}
> +
> +	put_task_struct(task);
> +
> +	return ret;
> +}
> +
> +static ssize_t proc_coredump_pre_exit_write(struct file *file,
> +					    const char __user *buf,
> +					    size_t count,
> +					    loff_t *ppos)
> +{
> +	struct task_struct *task;
> +	struct mm_struct *mm;
> +	unsigned int val;
> +	int ret;
> +	int i;
> +	unsigned long mask;
> +
> +	ret = kstrtouint_from_user(buf, count, 0, &val);
> +	if (ret < 0)
> +		return ret;
> +
> +	ret = -ESRCH;
> +	task = get_proc_task(file_inode(file));
> +	if (!task)
> +		goto out_no_task;
> +
> +	mm = get_task_mm(task);
> +	if (!mm)
> +		goto out_no_mm;
> +	ret = 0;
> +
> +	for (i = 0, mask = 1; i < MMF_DUMP_PRE_EXIT_BITS; i++, mask <<= 1) {
> +		if (val & mask)
> +			mm_flags_set(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> +		else
> +			mm_flags_clear(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> +	}
> +
> +	mmput(mm);
> + out_no_mm:
> +	put_task_struct(task);
> + out_no_task:
> +	if (ret < 0)
> +		return ret;
> +	return count;
> +}
> +
> +static const struct file_operations proc_coredump_pre_exit_operations = {
> +	.read		= proc_coredump_pre_exit_read,
> +	.write		= proc_coredump_pre_exit_write,
> +	.llseek		= generic_file_llseek,
> +};
>  #endif
>  
>  #ifdef CONFIG_TASK_IO_ACCOUNTING
> @@ -3391,6 +3468,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #endif
>  #ifdef CONFIG_ELF_CORE
>  	REG("coredump_filter", S_IRUGO|S_IWUSR, proc_coredump_filter_operations),
> +	REG("coredump_pre_exit", S_IRUGO|S_IWUSR, proc_coredump_pre_exit_operations),
>  #endif
>  #ifdef CONFIG_TASK_IO_ACCOUNTING
>  	ONE("io",	S_IRUSR, proc_tgid_io_accounting),
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index af23453e9dbd..dfd4717c7e3e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4066,6 +4066,7 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
>  extern int __vm_enough_memory(const struct mm_struct *mm, long pages, int cap_sys_admin);
>  extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
>  extern void exit_mmap(struct mm_struct *);
> +extern void exit_mmap_mapped_shared(struct mm_struct *mm);
>  bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
>  				 unsigned long addr, bool write);
>  
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index c7db35be6a30..0555aaf50001 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1963,6 +1963,15 @@ enum {
>  	(BIT(MMF_DUMP_ANON_PRIVATE) | BIT(MMF_DUMP_ANON_SHARED) | \
>  	 BIT(MMF_DUMP_HUGETLB_PRIVATE) | MMF_DUMP_MASK_DEFAULT_ELF)
>  
> +/* coredump pre-exit bits */
> +#define MMF_DUMP_PRE_EXIT_FLOCK	11
> +#define MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED 12
> +
> +#define MMF_DUMP_PRE_EXIT_SHIFT	(MMF_DUMPABLE_BITS + MMF_DUMP_FILTER_BITS)
> +#define MMF_DUMP_PRE_EXIT_BITS	2
> +#define MMF_DUMP_PRE_EXIT_MASK	\
> +	(((1 << MMF_DUMP_PRE_EXIT_BITS) - 1) << MMF_DUMP_PRE_EXIT_SHIFT)
> +
>  #ifdef CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS
>  # define MMF_DUMP_MASK_DEFAULT_ELF	BIT(MMF_DUMP_ELF_HEADERS)
>  #else
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 41ed884cffc9..b4becbf6c0eb 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -93,6 +93,7 @@ static inline void exit_thread(struct task_struct *tsk)
>  extern __noreturn void do_group_exit(int);
>  
>  extern void exit_files(struct task_struct *);
> +extern void exit_files_pre_exit(struct task_struct *, bool);
>  extern void exit_itimers(struct task_struct *);
>  
>  extern pid_t kernel_clone(struct kernel_clone_args *kargs);
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 613475285643..360604d653b4 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -95,6 +95,10 @@
>  #define O_NDELAY	O_NONBLOCK
>  #endif
>  
> +#ifndef O_TMPCLOS
> +#define O_TMPCLOS	0x80000000	/* tag need close, temporarily used */
> +#endif

Sorry, not going to happen. This doesn't not justify the addition of a
new uapi value at all.

I'm also including various Sashkio comments:

sashiko.dev <sashiko@sashiko.dev>:

[Severity: Medium]
Is it safe to expose an internal, temporary kernel flag in a UAPI header?
Userspace applications could intentionally or accidentally pass O_TMPCLOS to
open(), which might permanently pollute the userspace ABI and trigger
unexpected behavior during a coredump.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com

> +
>  #define F_DUPFD		0	/* dup */
>  #define F_GETFD		1	/* get close_on_exec */
>  #define F_SETFD		2	/* set/clear close_on_exec */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a679b2448234..84f1ee7f32cf 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1030,6 +1030,18 @@ static int __init coredump_filter_setup(char *s)
>  
>  __setup("coredump_filter=", coredump_filter_setup);
>  
> +static unsigned long default_dump_pre_exit;
> +
> +static int __init coredump_pre_exit_setup(char *s)
> +{
> +	default_dump_pre_exit =
> +		(simple_strtoul(s, NULL, 0) << MMF_DUMP_PRE_EXIT_SHIFT) &
> +		MMF_DUMP_PRE_EXIT_MASK;
> +	return 1;
> +}
> +
> +__setup("coredump_pre_exit=", coredump_pre_exit_setup);

This makes no sense. I think you really need to sit down and think about
a design for this that doesn't introduce state machinery for boot, mm,
and the VFS in one shot to solve a fringe problem...





sashiko.dev <sashiko@sashiko.dev>:

[Severity: High]
Does modifying the VMA maple tree via do_munmap() during the for_each_vma()
iteration invalidate the outer iterator? The loop traverses the maple tree
using the iterator vmi. However, do_munmap() creates its own internal
VMA_ITERATOR and removes the VMA from the tree. Because the outer vmi
iterator is not updated to reflect these structural changes, its cached
state becomes stale, which can lead to a use-after-free when vma_next()
is subsequently called.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com


sashiko.dev <sashiko@sashiko.dev>:

[Severity: High]
Is it safe to iterate the file descriptor table without holding
rcu_read_lock()? Because coredump_pre_exit() is called before zap_threads()
kills other threads, concurrent threads can still trigger expand_files(),
which replaces the fdt and frees the old one after an RCU grace period.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com


sashiko.dev <sashiko@sashiko.dev>:

[Severity: Medium]
Similar to the issue in exit_mmap_mapped_shared(), this non-atomic update
of file->f_flags risks losing concurrent fcntl() updates since it doesn't
hold file->f_lock.

Also, if a file has duplicated file descriptors (e.g., via dup()), will
clearing O_TMPCLOS here prematurely skip the closure of the remaining
descriptors? When encountering the duplicated descriptor later, the flag
will already be cleared, leaving the shared file actively referenced.

via: https://sashiko.dev/#/message/20260624145552.70143-1-jackzxcui1989@163.com

-- 
Christian Brauner <brauner@kernel.org>

^ permalink raw reply

* Re: [RFC] Null Namespaces
From: John Ericson @ 2026-06-25  3:41 UTC (permalink / raw)
  To: Al Viro
  Cc: Andy Lutomirski, Li Chen, Cong Wang, Christian Brauner,
	linux-arch, LKML, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria
In-Reply-To: <20260625011023.GM2636677@ZenIV>

Ah, I started replying to your first email, but this is better, this
gets to the heart of the matter. Please don't mind me responding to your
two questions in reverse.

On Wed, Jun 24, 2026, at 9:10 PM, Al Viro wrote:
> What's the fundamental difference between CWD and any open descriptor
> for a directory?  Why does it make sense to ban the former, but allow
> the equivalents done via the latter?

Yes! These two notions are very close --- but that's the *problem*, not
a reason to not care about the existence of the CWD and root FS. I want
to get rid of CWD in my processes not because it is fundamentally
different (it isn't), but because it is superfluous.

If one is capability-minded like me, it's a bad mistake that we ever had
this "working directory" notion to begin with, and yet another example
of the folks at Bell Labs sticking something in the kernel that was
really only needed by the shell, and that could have just been done in
userland.

The current working directory, roughly, is *just* some global state
holding a directory file descriptor. But I don't want that global state.
If I am writing my userland program (that is not a shell), I would not
create the global variable. I do not appreciate the fact that the kernel
foists that state upon me whether I like it or not.

Now obviously we cannot have a giant breaking change removing the notion
of a current working directory altogether. But we can allow individual
processes which don't want it to opt out, and that is what nulling out
these fields (and updating the path resolution code to cope with that)
allows.

There is no loss of expressive power doing this, because one can (and
should!) just use the `*at` and file descriptors. But there is, however,
the imposition of discipline. The programmer (or coding agent) is
encouraged to do everything with file descriptors rather than path
concatenations etc., because they need to use `*at` anyways, and then
voilà, without browbeating anyone in security seminars or code review, a
bunch of TOCTOU issues disappear simply because doing the right thing is
now the path of least resistance.

> Please, start with explaining what, in your opinion, a mount namespace
> _is_, and where does "mount X is attached at path P relative to mount
> Y" belong.

Let's take a pathological example:

- Process A has `/foo` bind-mounted at `/bar/foo`

- Process B has `/bar` without that bind mount, and `/foo` mounted at
  `/baz/foo`, as is possible because it is in a different mount
  namespace.

If A opens `/bar/foo`, and sends it over (via socket) to B, and then B
does `openat(recv_fd, "..")`, B will get `/bar`, not `/baz`. This is
because `..` is resolved according to the mount referenced in the open
file. (This is, by the way, very good! Directory file descriptors would
be perilous to use if this were not the case!)

The moral of the story is that "mount X is attached at path P relative
to mount Y" is information accessed in the mounts themselves (maybe via
their containing mount namespace, per the `mnt_ns` field, or maybe not,
I am not sure, but it is immaterial). In contrast, the mount namespace
of the *opening* task (`current->nsproxy->mnt_ns`, and current is B)
doesn't matter at all for this purpose.

I am not on a crusade against `struct mnt_namespace` in general; I am
just trying to null out `(struct nsproxy)::mnt_ns` in particular. (This
is just as I am not on a crusade against `struct path`, just `root` and
`pwd` of `struct fs_struct`.)

These days, `current->nsproxy->mnt_ns` is, to me, first and foremost,
there for the legacy mount API. Again, just like our CWD example above,
this is mostly just global state.

The new mount API drastically [^1] reduces the need for it, since it
allows referring to mounts explicitly via file descriptors. That's OK!
The argument is the same as the above --- I am *not* trying to limit
what can be done if one has all the right files open with the right
perms. I am just trying to limit what works out of the box --- to reduce
the default set of privileges, *especially* where the resources involved
are implicit and/or stateful.

[^1]: It doesn't *quite* eliminate the need for `nsproxy->mnt_ns`
    entirely, since (as I understand it, from reading the `move_mount`
    man page) it is still used for some authorization checks, since
    `O_PATH` file descriptors do not grant privileges other than mere
    discoverability. But that's a problem that could be solved later
    with an `O_MOUNT` option analogous to `O_RDONLY` or `O_WRONLY`. In
    the meantime, I am perfectly happy if my processes with null mount
    namespaces get `move_mount` permission errors.

^ permalink raw reply

* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Xin Zhao @ 2026-06-25  2:51 UTC (permalink / raw)
  To: viro
  Cc: alex.aring, allen.lkml, arnd, brauner, chuck.lever, ebiederm,
	j.granados, jack, jackzxcui1989, jlayton, keescook, linux-arch,
	linux-fsdevel, linux-kernel, mcgrof, mjguzik, pfalcato
In-Reply-To: <20260624162844.GK2636677@ZenIV>

On Wed, 24 Jun 2026 17:28:44 +0100 Al Viro <viro@zeniv.linux.org.uk> wrote:

> > +			if (file->f_flags & O_TMPCLOS) {
> > +				file->f_flags &= ~O_TMPCLOS;
> > +				goto close_fd;
> > +			}
> 
> *blink*
> 
> 	How could that possibly make sense?  Many descriptors
> may refer to the same file; what's more, many descriptor tables
> may contain such descriptors, so... just what is that code
> trying to do?

This is yet another serious mistake. Perhaps my test scenarios were not
complex enough, or I was overly confident in removing the logic that
cleared the O_TMPCLOS flag and performed debug printing only when the
reference count dropped to zero during that single close operation,
without conducting further tests.

In v5, I plan to avoid clearing the O_TMPCLOS flag to handle the situation
where multiple file descriptors map to a single file. Of course, there are
some cases where the lifecycle of this file may extend beyond the process
exit, but AFICT such situations either cannot last long or do not involve
memory in the case where i_nlink != 0. Therefore, keeping this flag seems
unlikely to cause any issues.

Since this flag is no longer used temporarily (it will never be cleared),
I would like to rename it to O_PRECLOS.

Thanks
Xin Zhao

^ permalink raw reply

* Re: [RFC] Null Namespaces
From: Al Viro @ 2026-06-25  1:10 UTC (permalink / raw)
  To: John Ericson
  Cc: Andy Lutomirski, Li Chen, Cong Wang, Christian Brauner,
	linux-arch, LKML, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria
In-Reply-To: <103524f8-1658-41df-88e9-cf49c628a721@app.fastmail.com>

On Wed, Jun 24, 2026 at 07:53:53PM -0400, John Ericson wrote:
> I wanted to discuss a bit about each type of namespace to indicate that
> this is a concept I think works across the board --- it wouldn't be such
> a good solution for the process spawning API if it was only applicable
> to some but not all namespace types. But the truth is that I have
> thought about the FS cases the most, as I think you have picked up on.
> 
> If there is interest in landing
> 
>   1. null CWD
>   2. null root fs
>   3. null mount namespace
> 
> in isolation, and then returning to the other namespaces to iron out
> their details, that would be fantastic. It would be much nicer for me to
> get some momentum that way, without having to design everything all at
> once first before getting to implement anything.

Please, start with explaining what, in your opinion, a mount namespace _is_,
and where does "mount X is attached at path P relative to mount Y" belong.

What's the fundamental difference between CWD and any open descriptor for
a directory?  Why does it make sense to ban the former, but allow the
equivalents done via the latter?

^ permalink raw reply

* Re: [RFC] Null Namespaces
From: John Ericson @ 2026-06-24 23:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Li Chen, Cong Wang, Christian Brauner, linux-arch, LKML,
	linux-fsdevel, linux-api, Arnd Bergmann, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Jan Kara, Jonathan Corbet, Shuah Khan, Al Viro, Kees Cook,
	Sergei Zimmerman, Farid Zakaria
In-Reply-To: <CALCETrU3bgUxp0k1y-U-uL0-fW2016Gmsyu9O_=830czEUGMcQ@mail.gmail.com>

On Wed, Jun 24, 2026, at 7:20 PM, Andy Lutomirski wrote:
> I think I like this, but some comments:

Thanks, that's really nice to hear!

While arguably this is just the culmination of a direction Linux has
been going in for a while, it could also be seen as a very "out there"
idea. That at least one person likes the rough sound of things makes me
feel a lot better!

> On Wed, Jun 24, 2026 at 4:06 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Wed, Jun 24, 2026 at 3:52 PM John Ericson <mail@johnericson.me> wrote:
>
> > >   - null current working directory: relative paths with traditional,
> > >     non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.
> >
> > It's perfectly valid to cd to a directory that does not belong to
> > one's namespace.  We have fchdir.  What's wrong with letting it
> > continue working?
> >
> > Regardless of that, the current directory either needs to be a
> > directory or to be nothing at all, and if we support the latter, we
> > need to figure out what /proc will show.
>
> Thinking about this more: I think that handling CWD might actually be
> a prerequisite for the series and has little to do with namespaces.
> Maybe try adding, as a standalone feature, the ability to have a null
> CWD.  Define semantics and see what the implementation looks like.
>
> Then, if you add null namespaces, you could optionally make
> transitioning to a null namespace set a null CWD.  Or those features
> could be orthogonal.

Hehe, I had the same thought after working on the filesystem patches,
along with the analogous thought for the root filesystem. It had been so
long since I had done a `chroot` without also doing a mount namespace
`unshare` --- despite the former being much older --- that I had
forgotten this separation of concerns.

My apologies for forgetting to include this insight in the original
email.

> Maybe the way to go is to implement the ones that have clearer
> semantics and to defer the others.

I would much prefer this, actually.

I wanted to discuss a bit about each type of namespace to indicate that
this is a concept I think works across the board --- it wouldn't be such
a good solution for the process spawning API if it was only applicable
to some but not all namespace types. But the truth is that I have
thought about the FS cases the most, as I think you have picked up on.

If there is interest in landing

  1. null CWD
  2. null root fs
  3. null mount namespace

in isolation, and then returning to the other namespaces to iron out
their details, that would be fantastic. It would be much nicer for me to
get some momentum that way, without having to design everything all at
once first before getting to implement anything.

John

^ permalink raw reply

* Re: [RFC] Null Namespaces
From: Andy Lutomirski @ 2026-06-24 23:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: John Ericson, Li Chen, Cong Wang, Christian Brauner, linux-arch,
	linux-kernel, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan,
	Alexander Viro, Kees Cook, Sergei Zimmerman, Farid Zakaria
In-Reply-To: <CALCETrWhXNetw-BsAaoyT31suMmjYLdMh9MAuLB2Lvk2ac-31g@mail.gmail.com>

On Wed, Jun 24, 2026 at 4:06 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Wed, Jun 24, 2026 at 3:52 PM John Ericson <mail@johnericson.me> wrote:

> >   - null current working directory: relative paths with traditional,
> >     non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.
>
> It's perfectly valid to cd to a directory that does not belong to
> one's namespace.  We have fchdir.  What's wrong with letting it
> continue working?
>
> Regardless of that, the current directory either needs to be a
> directory or to be nothing at all, and if we support the latter, we
> need to figure out what /proc will show.

Thinking about this more: I think that handling CWD might actually be
a prerequisite for the series and has little to do with namespaces.
Maybe try adding, as a standalone feature, the ability to have a null
CWD.  Define semantics and see what the implementation looks like.

Then, if you add null namespaces, you could optionally make
transitioning to a null namespace set a null CWD.  Or those features
could be orthogonal.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox