Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* [PATCH v2 2/2] power: supply: sbs-battery: Add PbAc, NiZn, RAM, and ZnAr support
From: Boris Shtrasman @ 2026-06-24 13:57 UTC (permalink / raw)
  To: Sebastian Reichel, Shuah Khan
  Cc: linux-pm, linux-kernel, linux-kselftest, linux-api,
	Boris Shtrasman
In-Reply-To: <20260624135718.286771-1-borissh1983@gmail.com>

Add support for PbAc, NiZn, RAM, and ZnAr chemistries as defined in the
Smart Battery Data Specification v1.1 (Section 5.1.30 DeviceChemistry).

Currently, the sbs-battery driver only handles LION, LiP, NiCd and NiMH.
The Smart Battery specification defines 8 possible values:
 - Lead Acid (PbAc)
 - Lithium Ion (LION)
 - Nickel Cadmium (NiCd)
 - Nickel Metal Hydride (NiMH)
 - Nickel Zinc (NiZn)
 - Rechargeable Alkaline-Manganese (RAM)
 - Zinc Air (ZnAr)
 - Lithium Polymer (LiP)

Link: https://sbs-forum.org/specs/sbdat110.pdf
Signed-off-by: Boris Shtrasman <borissh1983@gmail.com>
---
 drivers/power/supply/sbs-battery.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/power/supply/sbs-battery.c b/drivers/power/supply/sbs-battery.c
index 43c48196c167..42a941e99155 100644
--- a/drivers/power/supply/sbs-battery.c
+++ b/drivers/power/supply/sbs-battery.c
@@ -860,6 +860,14 @@ static int sbs_get_chemistry(struct sbs_info *chip,
 		chip->technology = POWER_SUPPLY_TECHNOLOGY_NiCd;
 	else if (!strncasecmp(chemistry, "NiMH", 4))
 		chip->technology = POWER_SUPPLY_TECHNOLOGY_NiMH;
+	else if (!strncasecmp(chemistry, "PbAc", 4))
+		chip->technology = POWER_SUPPLY_TECHNOLOGY_PbAc;
+	else if (!strncasecmp(chemistry, "NiZn", 4))
+		chip->technology = POWER_SUPPLY_TECHNOLOGY_NiZn;
+	else if (!strncasecmp(chemistry, "RAM", 3))
+		chip->technology = POWER_SUPPLY_TECHNOLOGY_RAM;
+	else if (!strncasecmp(chemistry, "ZnAr", 4))
+		chip->technology = POWER_SUPPLY_TECHNOLOGY_ZnAr;
 	else
 		chip->technology = POWER_SUPPLY_TECHNOLOGY_UNKNOWN;
 
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 1/2] power: supply: Add PbAc, NiZn, RAM, and ZnAr support
From: Boris Shtrasman @ 2026-06-24 13:57 UTC (permalink / raw)
  To: Sebastian Reichel, Shuah Khan
  Cc: linux-pm, linux-kernel, linux-kselftest, linux-api,
	Boris Shtrasman
In-Reply-To: <20260624135718.286771-1-borissh1983@gmail.com>

Add four new members to the POWER_SUPPLY_TECHNOLOGY
enum and sysfs interface to represent the Smart Battery
Data Specification v1.1 (Section 5.1.30 DeviceChemistry) battery types:

 - Lead Acid (PbAc)
 - Nickel Zinc (NiZn)
 - Rechargeable Alkaline-Manganese (RAM)
 - Zinc Air (ZnAr)

Update documentation to express these types.
Update ABI testing for these types.

Link: https://sbs-forum.org/specs/sbdat110.pdf
Signed-off-by: Boris Shtrasman <borissh1983@gmail.com>
---
 Documentation/ABI/testing/sysfs-class-power                   | 2 +-
 drivers/power/supply/power_supply_sysfs.c                     | 4 ++++
 include/linux/power_supply.h                                  | 4 ++++
 .../selftests/power_supply/test_power_supply_properties.sh    | 3 ++-
 4 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-class-power b/Documentation/ABI/testing/sysfs-class-power
index 32697b926cc8..5641f1fd5fd6 100644
--- a/Documentation/ABI/testing/sysfs-class-power
+++ b/Documentation/ABI/testing/sysfs-class-power
@@ -525,7 +525,7 @@ Description:
 
 		Valid values:
 			      "Unknown", "NiMH", "Li-ion", "Li-poly", "LiFe",
-			      "NiCd", "LiMn"
+			      "NiCd", "LiMn", "PbAc", "NiZn", "RAM", "ZnAr"
 
 
 What:		/sys/class/power_supply/<supply_name>/voltage_avg,
diff --git a/drivers/power/supply/power_supply_sysfs.c b/drivers/power/supply/power_supply_sysfs.c
index f30a7b9ccd5e..9d6b24856c8b 100644
--- a/drivers/power/supply/power_supply_sysfs.c
+++ b/drivers/power/supply/power_supply_sysfs.c
@@ -124,6 +124,10 @@ static const char * const POWER_SUPPLY_TECHNOLOGY_TEXT[] = {
 	[POWER_SUPPLY_TECHNOLOGY_LiFe]		= "LiFe",
 	[POWER_SUPPLY_TECHNOLOGY_NiCd]		= "NiCd",
 	[POWER_SUPPLY_TECHNOLOGY_LiMn]		= "LiMn",
+	[POWER_SUPPLY_TECHNOLOGY_PbAc]		= "PbAc",
+	[POWER_SUPPLY_TECHNOLOGY_NiZn]		= "NiZn",
+	[POWER_SUPPLY_TECHNOLOGY_RAM]		= "RAM",
+	[POWER_SUPPLY_TECHNOLOGY_ZnAr]		= "ZnAr",
 };
 
 static const char * const POWER_SUPPLY_CAPACITY_LEVEL_TEXT[] = {
diff --git a/include/linux/power_supply.h b/include/linux/power_supply.h
index 7a5e4c3242a0..034800cd21da 100644
--- a/include/linux/power_supply.h
+++ b/include/linux/power_supply.h
@@ -83,6 +83,10 @@ enum {
 	POWER_SUPPLY_TECHNOLOGY_LiFe,
 	POWER_SUPPLY_TECHNOLOGY_NiCd,
 	POWER_SUPPLY_TECHNOLOGY_LiMn,
+	POWER_SUPPLY_TECHNOLOGY_PbAc,
+	POWER_SUPPLY_TECHNOLOGY_NiZn,
+	POWER_SUPPLY_TECHNOLOGY_RAM,
+	POWER_SUPPLY_TECHNOLOGY_ZnAr,
 };
 
 enum {
diff --git a/tools/testing/selftests/power_supply/test_power_supply_properties.sh b/tools/testing/selftests/power_supply/test_power_supply_properties.sh
index a66b1313ed88..1ebac6fe5d23 100755
--- a/tools/testing/selftests/power_supply/test_power_supply_properties.sh
+++ b/tools/testing/selftests/power_supply/test_power_supply_properties.sh
@@ -74,7 +74,8 @@ for DEVNAME in $supplies; do
 	test_sysfs_prop_optional model_name
 	test_sysfs_prop_optional manufacturer
 	test_sysfs_prop_optional serial_number
-	test_sysfs_prop_optional_list technology "Unknown","NiMH","Li-ion","Li-poly","LiFe","NiCd","LiMn"
+	test_sysfs_prop_optional_list technology "Unknown","NiMH","Li-ion","Li-poly","LiFe","NiCd"\
+		,"LiMn","PbAc","NiZn","RAM","ZnAr"
 
 	test_sysfs_prop_optional cycle_count
 
-- 
2.47.3


^ permalink raw reply related

* PATCH v2 0/2] Power: supply: Add PbAc, NiZn, RAM, and ZnAr support
From: Boris Shtrasman @ 2026-06-24 13:57 UTC (permalink / raw)
  To: Sebastian Reichel, Shuah Khan
  Cc: linux-pm, linux-kernel, linux-kselftest, linux-api,
	Boris Shtrasman

These series adds support for PbAc, NiZn, RAM, and ZnAr chemistries as 
defined in the Smart Battery Data Specification v1.1 (Section 5.1.30 
DeviceChemistry).

Currently, the sbs-battery driver only handles LION, LiP, NiCd and NiMH.
The Smart Battery specification defines 8 possible values:
 - Lead Acid (PbAc)
 - Lithium Ion (LION)
 - Nickel Cadmium (NiCd)
 - Nickel Metal Hydride (NiMH)
 - Nickel Zinc (NiZn)
 - Rechargeable Alkaline-Manganese (RAM)
 - Zinc Air (ZnAr)
 - Lithium Polymer (LiP)

Map the missing specification values to their respective core kernel
POWER_SUPPLY_TECHNOLOGY definitions and documenation, declare these
values into selftest. 

In selftest LiMn is moved to the next line to comply with
checkpatch warning after adding said types.

It is an update for 
https://lore.kernel.org/linux-pm/ajmc_naB7zYv0SPY@venus.

Link: https://sbs-forum.org/specs/sbdat110.pdf
Signed-off-by: Boris Shtrasman <borissh1983@gmail.com>
--
Changes in V2:

1. Seperate into two patches.
2. Modify Documenation, self test and sysfs interface. self test is
updated as the documeation is now mentioning them.
--

Boris Shtrasman (2):
  power: supply: Add PbAc, NiZn, RAM, and ZnAr support
  power: supply: sbs-battery: Add PbAc, NiZn, RAM, and ZnAr support

 Documentation/ABI/testing/sysfs-class-power               | 2 +-
 drivers/power/supply/power_supply_sysfs.c                 | 4 ++++
 drivers/power/supply/sbs-battery.c                        | 8 ++++++++
 include/linux/power_supply.h                              | 4 ++++
 .../power_supply/test_power_supply_properties.sh          | 3 ++-
 5 files changed, 19 insertions(+), 2 deletions(-)

-- 
2.47.3


^ permalink raw reply

* Re: [f2fs-dev] [PATCH v8 00/17] Subject: Exposing case folding behavior
From: patchwork-bot+f2fs @ 2026-06-24  8:59 UTC (permalink / raw)
  To: Chuck Lever
  Cc: viro, brauner, jack, pc, yuezhang.mo, cem, almaz.alexandrovich,
	adilger.kernel, linux-cifs, sfrench, slava, linux-ext4,
	linkinjeon, sprasad, frank.li, ronniesahlberg, glaubitz, jaegeuk,
	hirofumi, linux-nfs, tytso, linux-api, linux-f2fs-devel,
	linux-xfs, senozhatsky, chuck.lever, hansg, anna, linux-fsdevel,
	sj1557.seo, trondmy
In-Reply-To: <20260217214741.1928576-1-cel@kernel.org>

Hello:

This series was applied to jaegeuk/f2fs.git (dev)
by Christian Brauner <brauner@kernel.org>:

On Tue, 17 Feb 2026 16:47:24 -0500 you wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Following on from
> 
> https://lore.kernel.org/linux-nfs/20251021-zypressen-bazillus-545a44af57fd@brauner/T/#m0ba197d75b7921d994cf284f3cef3a62abb11aaa
> 
> I'm attempting to implement enough support in the Linux VFS to
> enable file services like NFSD and ksmbd (and user space
> equivalents) to provide the actual status of case folding support
> in local file systems. The default behavior for local file systems
> not explicitly supported in this series is to reflect the usual
> POSIX behaviors:
> 
> [...]

Here is the summary with links:
  - [f2fs-dev,v8,01/17] fs: Move file_kattr initialization to callers
    (no matching commit)
  - [f2fs-dev,v8,02/17] fs: Add case sensitivity flags to file_kattr
    https://git.kernel.org/jaegeuk/f2fs/c/3035e4454142
  - [f2fs-dev,v8,03/17] fat: Implement fileattr_get for case sensitivity
    (no matching commit)
  - [f2fs-dev,v8,04/17] exfat: Implement fileattr_get for case sensitivity
    (no matching commit)
  - [f2fs-dev,v8,05/17] ntfs3: Implement fileattr_get for case sensitivity
    (no matching commit)
  - [f2fs-dev,v8,06/17] hfs: Implement fileattr_get for case sensitivity
    (no matching commit)
  - [f2fs-dev,v8,07/17] hfsplus: Report case sensitivity in fileattr_get
    (no matching commit)
  - [f2fs-dev,v8,08/17] ext4: Report case sensitivity in fileattr_get
    (no matching commit)
  - [f2fs-dev,v8,09/17] xfs: Report case sensitivity in fileattr_get
    (no matching commit)
  - [f2fs-dev,v8,10/17] cifs: Implement fileattr_get for case sensitivity
    (no matching commit)
  - [f2fs-dev,v8,11/17] nfs: Implement fileattr_get for case sensitivity
    (no matching commit)
  - [f2fs-dev,v8,12/17] f2fs: Add case sensitivity reporting to fileattr_get
    (no matching commit)
  - [f2fs-dev,v8,13/17] vboxsf: Implement fileattr_get for case sensitivity
    (no matching commit)
  - [f2fs-dev,v8,14/17] isofs: Implement fileattr_get for case sensitivity
    (no matching commit)
  - [f2fs-dev,v8,15/17] nfsd: Report export case-folding via NFSv3 PATHCONF
    (no matching commit)
  - [f2fs-dev,v8,16/17] nfsd: Implement NFSv4 FATTR4_CASE_INSENSITIVE and FATTR4_CASE_PRESERVING
    (no matching commit)
  - [f2fs-dev,v8,17/17] ksmbd: Report filesystem case sensitivity via FS_ATTRIBUTE_INFORMATION
    (no matching commit)

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [f2fs-dev] [PATCH v14 00/15] Exposing case folding behavior
From: patchwork-bot+f2fs @ 2026-06-24  8:59 UTC (permalink / raw)
  To: Chuck Lever
  Cc: viro, brauner, jack, pc, yuezhang.mo, cem, roland.mainz,
	almaz.alexandrovich, adilger.kernel, linux-cifs, sfrench, slava,
	djwong, linux-ext4, linkinjeon, stfrench, sprasad, frank.li,
	ronniesahlberg, glaubitz, jaegeuk, hirofumi, linux-nfs, tytso,
	linux-api, linux-f2fs-devel, linux-xfs, senozhatsky, chuck.lever,
	hansg, anna, linux-fsdevel, sj1557.seo, trondmy
In-Reply-To: <20260507-case-sensitivity-v14-0-e62cc8200435@oracle.com>

Hello:

This series was applied to jaegeuk/f2fs.git (dev)
by Christian Brauner <brauner@kernel.org>:

On Thu, 07 May 2026 04:52:53 -0400 you wrote:
> Christian, let's lock this one in. I will post subsequent changes
> as delta patches.
> 
> Following on from:
> 
> https://lore.kernel.org/linux-nfs/20251021-zypressen-bazillus-545a44af57fd@brauner/T/#m0ba197d75b7921d994cf284f3cef3a62abb11aaa
> 
> [...]

Here is the summary with links:
  - [f2fs-dev,v14,01/15] fs: Move file_kattr initialization to callers
    https://git.kernel.org/jaegeuk/f2fs/c/14c3197ecf07
  - [f2fs-dev,v14,02/15] fs: Add case sensitivity flags to file_kattr
    https://git.kernel.org/jaegeuk/f2fs/c/3035e4454142
  - [f2fs-dev,v14,03/15] fat: Implement fileattr_get for case sensitivity
    https://git.kernel.org/jaegeuk/f2fs/c/c92db2ca726f
  - [f2fs-dev,v14,04/15] exfat: Implement fileattr_get for case sensitivity
    https://git.kernel.org/jaegeuk/f2fs/c/27e0b573dd4a
  - [f2fs-dev,v14,05/15] ntfs3: Implement fileattr_get for case sensitivity
    https://git.kernel.org/jaegeuk/f2fs/c/eeb7b37b9700
  - [f2fs-dev,v14,06/15] hfs: Implement fileattr_get for case sensitivity
    https://git.kernel.org/jaegeuk/f2fs/c/b6fe046c3023
  - [f2fs-dev,v14,07/15] hfsplus: Report case sensitivity in fileattr_get
    https://git.kernel.org/jaegeuk/f2fs/c/a6469a15eefe
  - [f2fs-dev,v14,08/15] xfs: Report case sensitivity in fileattr_get
    https://git.kernel.org/jaegeuk/f2fs/c/c9da43e4e5c3
  - [f2fs-dev,v14,09/15] cifs: Implement fileattr_get for case sensitivity
    https://git.kernel.org/jaegeuk/f2fs/c/e50bc12f5a36
  - [f2fs-dev,v14,10/15] nfs: Implement fileattr_get for case sensitivity
    https://git.kernel.org/jaegeuk/f2fs/c/92d67628a1a9
  - [f2fs-dev,v14,11/15] vboxsf: Implement fileattr_get for case sensitivity
    https://git.kernel.org/jaegeuk/f2fs/c/ef14aa143f1d
  - [f2fs-dev,v14,12/15] isofs: Implement fileattr_get for case sensitivity
    https://git.kernel.org/jaegeuk/f2fs/c/7bbd51b1d748
  - [f2fs-dev,v14,13/15] nfsd: Report export case-folding via NFSv3 PATHCONF
    https://git.kernel.org/jaegeuk/f2fs/c/211cb2ba4877
  - [f2fs-dev,v14,14/15] nfsd: Implement NFSv4 FATTR4_CASE_INSENSITIVE and FATTR4_CASE_PRESERVING
    https://git.kernel.org/jaegeuk/f2fs/c/01ee7c3d2e23
  - [f2fs-dev,v14,15/15] ksmbd: Report filesystem case sensitivity via FS_ATTRIBUTE_INFORMATION
    https://git.kernel.org/jaegeuk/f2fs/c/0164df1d1de7

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-24  7:12 UTC (permalink / raw)
  To: avagin
  Cc: akpm, alexander, axboe, bernd, brauner, criu, david, dhowells,
	fuse-devel, hch, jack, joannelkoong, linux-api, linux-fsdevel,
	linux-kernel, linux-mm, miklos, netdev, patches, pfalcato,
	rostedt, safinaskar, torvalds, val, viro, willy
In-Reply-To: <CANaxB-zK5q=Xw6UZTmeFtXsDZjUsPkFk=p485m-wtNTBnf4hgg@mail.gmail.com>

Andrei Vagin <avagin@gmail.com>:
> The CRIU fifo test fails with this change. The problem is that vmsplice
> with SPLICE_F_NONBLOCK to a fifo file descriptor fails with -EOPNOTSUPP.
> 
> It seems we need a fix like this one:
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 429b0714ec57..6fc49e933727 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -1253,6 +1253,7 @@ static int fifo_open(struct inode *inode, struct
> file *filp)
> 
>         /* We can only do regular read/write on fifos */
>         stream_open(inode, filp);
> +       filp->f_mode |= FMODE_NOWAIT;
> 
>         switch (filp->f_mode & (FMODE_READ | FMODE_WRITE)) {
>         case FMODE_READ:

Does CRIU actually rely on ability to do SPLICE_F_NONBLOCK vmsplice into
named fifos? Or this is merely a test?

If this is just a test, I think we need not to preserve this behavior.

I did debian code search with regex "vmsplice.*SPLICE_F_NONBLOCK" and I
found very few packages. And it seems all them use pipes, not named fifos.

(On speed: I still think that my vmsplice patches are good thing,
despite performance regressions in CRIU.)

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH] crypto: af_alg - Document the deprecation of AF_ALG
From: Eric Biggers @ 2026-06-23 19:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bastien Nocera, linux-crypto, Herbert Xu, Marcel Holtmann,
	Luiz Augusto von Dentz, linux-doc, linux-api, linux-kernel,
	netdev, linux-bluetooth, ell
In-Reply-To: <CAHk-=wgNG=F3xO9PjL0RcKy3UWvq0Np9uZu+nFUQBAA8So9xdA@mail.gmail.com>

On Tue, Jun 23, 2026 at 11:56:10AM -0700, Linus Torvalds wrote:
> On Tue, 23 Jun 2026 at 09:51, Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > We're aware of that and are taking it into account in the allowlist:
> 
> Note that if we can  just unconditionally make it depend on
> CAP_NET_ADMIN, that would be good - independently of any allowlist.
> 
> Because if iwd and abluetoothd are the main two users, and both of
> those already require CAP_NET_ADMIN anyway...

There's also cryptsetup, including unprivileged benchmarking and also
(in theory) formatting support, and pre-7.0 versions of iproute2 which
used it for computing SHA-1 hashes of BPF programs.

If we broke unprivileged 'cryptsetup benchmark', some people would
definitely notice.  However, since it's just a manually-run benchmark
anyway, users could just run it with sudo.

I don't know about the iproute2 case.

It depends how aggressive we want to be.  My current proposal
(https://lore.kernel.org/linux-crypto/20260622234803.6982-1-ebiggers@kernel.org/)
has the entries in the allowlist marked as either privileged or
unprivileged.  There are just a few unprivileged ones, for cryptsetup
and iproute2 as mentioned.  But we could try doing away with the
unprivileged ones entirely and see who complains.

- Eric

^ permalink raw reply

* Re: [PATCH] crypto: af_alg - Document the deprecation of AF_ALG
From: Linus Torvalds @ 2026-06-23 18:56 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Bastien Nocera, linux-crypto, Herbert Xu, Marcel Holtmann,
	Luiz Augusto von Dentz, linux-doc, linux-api, linux-kernel,
	netdev, linux-bluetooth, ell
In-Reply-To: <20260623164932.GA1793@sol>

On Tue, 23 Jun 2026 at 09:51, Eric Biggers <ebiggers@kernel.org> wrote:
>
> We're aware of that and are taking it into account in the allowlist:

Note that if we can  just unconditionally make it depend on
CAP_NET_ADMIN, that would be good - independently of any allowlist.

Because if iwd and abluetoothd are the main two users, and both of
those already require CAP_NET_ADMIN anyway...

                Linus

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Andrei Vagin @ 2026-06-23 17:29 UTC (permalink / raw)
  To: Askar Safin
  Cc: akpm, alexander, axboe, bernd, brauner, criu, david, dhowells,
	fuse-devel, hch, jack, joannelkoong, linux-api, linux-fsdevel,
	linux-kernel, linux-mm, miklos, netdev, patches, pfalcato,
	rostedt, torvalds, val, viro, willy
In-Reply-To: <20260623094211.1080873-1-safinaskar@gmail.com>

On Tue, Jun 23, 2026 at 2:42 AM Askar Safin <safinaskar@gmail.com> wrote:
>
> Andrei Vagin <avagin@gmail.com>:
> > Actually, this change introduces a performance and functional
> > regression for CRIU.
> >
> > Here is a brief overview of how CRIU currently dumps memory pages:
> >
> > CRIU injects a parasite code blob into the target process's address
> > space. The parasite invokes vmsplice() with the SPLICE_F_GIFT flag to
> > pin physical pages directly inside a pipe without copying them. The main
> > CRIU process then takes over from outside the target context, calling
> > splice() on the other end of the pipe to stream the data directly into
> > checkpoint image files or a remote network socket.
> >
> > I ran a simple test that creates an anonymous mapping and touches every
> > page within it:
> > Without this patch, CRIU takes 9 seconds to dump the test process.
> > With this patch, It takes 18 seconds...
> >
> > Plus, it obviously introduces some memory overhead.
> >
> > If these changes are merged, we will need to completely rework the
> > memory dumping mechanism in CRIU. Using vmsplice() in this proposed form
> > no longer makes any sense for our architecture...
>
> I just have read some docs for CRIU. I found this statement:
>
> > #### Why `splice` is Better:
> > *   **Consistency via COW**: The `SPLICE_F_GIFT` flag ensures that if the process modifies a "gifted" page after resuming, the kernel performs a **Copy-on-Write (COW)**. The pipe buffer > continues to hold the *original* version of the page as it existed at the moment of the `vmsplice()` call, ensuring a perfectly consistent snapshot of that page.
>
> This is wrong (with released kernels). I confirmed this by testing this on my current kernel (6.12.90).
>
> See the code in the end of this message.
>
> If you actually rely on mentioned consistency, then, it seems, CRIU is broken.
>
> So, in fact, my patch actually brings consistency to CRIU. :)

Askar, unfortunately, the statement about "Consistency via COW" is just
"AI imagination". The under-the-hood docs were recently transferred from
the criu.org wiki using some AI transformations, which introduced this
wrong statement. The original document can be found here:
https://criu.org/Memory_dumping_and_restoring.

To clarify, CRIU does not rely on page consistency for intermediate
pre-dumps. The pre-dump mechanism is designed to be iterative: During a
pre-dump, tasks are briefly frozen to vmsplice dirty pages into pipes.
Then the tasks are resumed, and CRIU drains the pipes. If the process
modifies a page after it has been spliced, the data in the pipe may
become inconsistent. However, any such modification is tracked by the
soft-dirty page tracker. In the next pre-dump iteration (or the final
dump), these modified pages will be identified as dirty again and
re-dumped. During restore, the images are applied sequentially, and the
final dump (taken while the process is fully frozen) ensures we
reconstruct a consistent final state.

But what really matters in this scheme is the vmsplice performance.
The proposed change significantly slows it down. In the case of CRIU,
vmsplice performance is critical because the target process is frozen
during these calls. Minimizing the freeze time is the primary goal of
pre-dumping to make migration almost invisible to the user workload.

Thanks,
Andrei

^ permalink raw reply

* Re: [PATCH] crypto: af_alg - Document the deprecation of AF_ALG
From: Eric Biggers @ 2026-06-23 16:49 UTC (permalink / raw)
  To: Bastien Nocera
  Cc: linux-crypto, Herbert Xu, Marcel Holtmann, Luiz Augusto von Dentz,
	linux-doc, linux-api, linux-kernel, netdev, Linus Torvalds,
	linux-bluetooth, ell
In-Reply-To: <7d08a6df54279e9915f5df6bd4e5e5dde52b4fe1.camel@hadess.net>

On Tue, Jun 23, 2026 at 02:44:28PM +0200, Bastien Nocera wrote:
> Hey,
> 
> Replying to this older patch.
> 
> On Wed, 2026-04-29 at 18:15 -0700, Eric Biggers wrote:
> <snip>
> > This isn't intended to change anything overnight.  After all, most Linux
> > distros won't be able to disable the kconfig options quite yet, mainly
> > because of iwd.  But this should create a bit more impetus for these
> > userspace programs to be fixed, and the documentation update should also
> > help prevent more users from appearing.
> 
> There are 2 other users that I know of: bluez, and the ell library
> (used by iwd and bluez).
>
> From what I could tell, bluetoothd uses AF_ALG for cryptography:
> https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/src/shared/crypto.c
> https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/tools/mesh-gatt/crypto.c
> 
> It uses "ecb(aes)" and "cmac(aes)" as algorithms.
> 
> Finally, it also uses them both again:
> https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/mesh/crypto.c
> through ell:
> https://git.kernel.org/pub/scm/libs/ell/ell.git/tree/ell/cipher.c
> 
> Because that's a question that also came up, bluetoothd also uses the
> CAP_NET_ADMIN capability.
> 
> I'll let Luiz and Marcel take it over from here.
> 

We're aware of that and are taking it into account in the allowlist:
https://lore.kernel.org/linux-crypto/20260622234803.6982-1-ebiggers@kernel.org/
If you have any feedback on the allowlist, please respond to that patch.

- Eric

^ permalink raw reply

* Re: [PATCH] crypto: af_alg - Document the deprecation of AF_ALG
From: Bastien Nocera @ 2026-06-23 12:44 UTC (permalink / raw)
  To: Eric Biggers, linux-crypto, Herbert Xu, Marcel Holtmann,
	Luiz Augusto von Dentz
  Cc: linux-doc, linux-api, linux-kernel, netdev, Linus Torvalds,
	linux-bluetooth, ell
In-Reply-To: <20260430011544.31823-1-ebiggers@kernel.org>

Hey,

Replying to this older patch.

On Wed, 2026-04-29 at 18:15 -0700, Eric Biggers wrote:
<snip>
> This isn't intended to change anything overnight.  After all, most Linux
> distros won't be able to disable the kconfig options quite yet, mainly
> because of iwd.  But this should create a bit more impetus for these
> userspace programs to be fixed, and the documentation update should also
> help prevent more users from appearing.

There are 2 other users that I know of: bluez, and the ell library
(used by iwd and bluez).

From what I could tell, bluetoothd uses AF_ALG for cryptography:
https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/src/shared/crypto.c
https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/tools/mesh-gatt/crypto.c

It uses "ecb(aes)" and "cmac(aes)" as algorithms.

Finally, it also uses them both again:
https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/mesh/crypto.c
through ell:
https://git.kernel.org/pub/scm/libs/ell/ell.git/tree/ell/cipher.c

Because that's a question that also came up, bluetoothd also uses the
CAP_NET_ADMIN capability.

I'll let Luiz and Marcel take it over from here.

Cheers

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-23  9:42 UTC (permalink / raw)
  To: avagin
  Cc: akpm, alexander, axboe, bernd, brauner, criu, david, dhowells,
	fuse-devel, hch, jack, joannelkoong, linux-api, linux-fsdevel,
	linux-kernel, linux-mm, miklos, netdev, patches, pfalcato,
	rostedt, safinaskar, torvalds, val, viro, willy
In-Reply-To: <CANaxB-xVCP5HSUNwphFrKPdW0Qh1pA33A6npac60WArkZMFt7w@mail.gmail.com>

Andrei Vagin <avagin@gmail.com>:
> Actually, this change introduces a performance and functional
> regression for CRIU.
> 
> Here is a brief overview of how CRIU currently dumps memory pages:
> 
> CRIU injects a parasite code blob into the target process's address
> space. The parasite invokes vmsplice() with the SPLICE_F_GIFT flag to
> pin physical pages directly inside a pipe without copying them. The main
> CRIU process then takes over from outside the target context, calling
> splice() on the other end of the pipe to stream the data directly into
> checkpoint image files or a remote network socket.
> 
> I ran a simple test that creates an anonymous mapping and touches every
> page within it:
> Without this patch, CRIU takes 9 seconds to dump the test process.
> With this patch, It takes 18 seconds...
> 
> Plus, it obviously introduces some memory overhead.
> 
> If these changes are merged, we will need to completely rework the
> memory dumping mechanism in CRIU. Using vmsplice() in this proposed form
> no longer makes any sense for our architecture...

I just have read some docs for CRIU. I found this statement:

> #### Why `splice` is Better:
> *   **Consistency via COW**: The `SPLICE_F_GIFT` flag ensures that if the process modifies a "gifted" page after resuming, the kernel performs a **Copy-on-Write (COW)**. The pipe buffer > continues to hold the *original* version of the page as it existed at the moment of the `vmsplice()` call, ensuring a perfectly consistent snapshot of that page.

This is wrong (with released kernels). I confirmed this by testing this on my current kernel (6.12.90).

See the code in the end of this message.

If you actually rely on mentioned consistency, then, it seems, CRIU is broken.

So, in fact, my patch actually brings consistency to CRIU. :)

-- 
Askar Safin




#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/uio.h>
#include <sys/wait.h>
#include <errno.h>

int
main (void)
{
    int p[2];
    if (pipe (p) != 0)
        abort ();
    char buf[1] = {'a'};
    struct iovec iov[] = {
        {
            .iov_base = buf,
            .iov_len = 1,
        }
    };
    // I pass "SPLICE_F_NONBLOCK | SPLICE_F_GIFT" here, because this is what criu passes
    if (vmsplice (p[1], iov, 1, SPLICE_F_NONBLOCK | SPLICE_F_GIFT) != 1)
        abort ();
    if (close (p[1]) != 0)
        abort ();
    buf[0] = 'b';
    char buf2[1];
    if (read (p[0], buf2, 1) != 1)
        abort ();
    printf ("[%c]\n", buf2[0]); // Prints "b" as opposed to "a" on Linux 6.12.90
    return 0;
}

^ permalink raw reply

* [RFC] signal: per-thread control over alternate signal stack delivery for selected signals
From: Tim Parth @ 2026-06-23  6:30 UTC (permalink / raw)
  To: linux-api@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org

Hi,

I am looking for guidance on a Linux signal ABI limitation that shows up in multi-runtime processes, specifically a .NET host loading a Go c-shared library.

Disclaimer: I am reporting this from the application/runtime integration side, not as a kernel developer. I arrived here after tracing crashes in a .NET application hosting a Go shared library through several runtime-specific issues, reproductions, and analyses. My understanding of the Linux signal subsystem and ABI details is therefore limited, and I may be missing important details.

The technical summary below reflects my best understanding of the issue based on the referenced investigations. I used AI-assisted editing to help structure and clarify this report, but the observations, reproducer, and referenced analyses come from the linked investigations.

This is not a claim that the current kernel behavior violates the existing ABI. Rather, I believe the current ABI lacks a way for multiple language runtimes in the same process to compose their signal and sigaltstack requirements safely.

Observed failure
================

A .NET process loads a Go shared library built with -buildmode=c-shared and calls it via P/Invoke. Under stress, the process crashes with SIGSEGV while CoreCLR is handling SIGRTMIN for runtime activation / GC suspension.

The reproducer is here:

https://github.com/egonelbre/csharp-go-interop-issue/tree/main/dotnet-go-reproducer

Related runtime issues:

https://github.com/golang/go/issues/78883
https://github.com/dotnet/runtime/issues/127320

The .NET-side analysis shows that the crash happens inside CoreCLR's inject_activation_handler path. The kernel delivered SIGRTMIN on the thread's alternate signal stack, and CoreCLR then ran a call chain deep enough to overflow that stack. In the reported case the per-thread alternate stack installed by CoreCLR was 16 KiB. Increasing it to around 49 KiB avoids the crash in the provided stress test, but that is a runtime-specific mitigation and does not address the general ABI composition problem.

Current ABI interaction
=======================

The problematic interaction is:

1. Signal disposition, including SA_ONSTACK, is per-process.
2. sigaltstack is per-thread.
3. On signal delivery, Linux uses the alternate signal stack if the handler has SA_ONSTACK and the current thread has an alternate stack.
4. The Go runtime documents that non-Go signal handlers must use SA_ONSTACK, because Go may be running on limited stacks. For -buildmode=c-shared, when Go sees an existing signal handler it may turn on SA_ONSTACK and otherwise keep the existing handler.
5. CoreCLR has internal signals such as SIGRTMIN whose handlers may need a different stack policy or a larger stack budget than the alternate stack currently registered on that thread.

The result is that one runtime can make a process-wide SA_ONSTACK decision that affects handlers and threads owned by another runtime. The other runtime can install a larger per-thread sigaltstack, but that becomes an arms race and does not give a runtime any way to express which signals should use which stack policy on a particular thread.

Why existing mechanisms do not fully solve this
===============================================

- Raising SIGSTKSZ or MINSIGSTKSZ does not solve the general issue. The kernel can only know the signal frame requirements, not the maximum user-space stack consumption of an arbitrary signal handler and everything it calls.

- The kernel cannot automatically extend an alternate signal stack.

- Clearing SA_ONSTACK with sigaction is process-wide and can violate the requirements of another runtime, for example Go's requirement that signal handlers run on an alternate stack when Go code may be interrupted.

- SS_AUTODISARM helps with a different class of problems, such as avoiding corruption when switching away from a signal handler, but it does not let a thread express "use an alternate stack for SIGSEGV but not for this runtime-internal suspension signal", nor does it provide separate stack policies for different signals.

Possible ABI direction
======================

One possible direction would be an opt-in, per-thread signal-altstack policy, for example a prctl() or similar interface that lets a thread provide a signal mask for which SA_ONSTACK should be ignored on that thread:
PR_SET_SIGALTSTACK_EXCLUDE_MASK(sigset_t *mask, size_t sigsetsize)

The default mask would be empty, preserving current behavior. Signal delivery would then become, conceptually:

if (handler_has_SA_ONSTACK &&
thread_has_altstack &&
!signal_is_in_current_thread_altstack_exclude_mask)
deliver_on_altstack;
else
deliver_on_normal_stack;

This is only a sketch. I am not attached to this exact interface. Another shape might be preferable, such as a more general per-thread/per-signal alternate stack policy or a way to associate alternate stack requirements with particular signals.

Questions
=========

1. Is the signal maintainers' view that multi-runtime processes should solve this entirely in userspace by agreeing on one sufficiently large per-thread sigaltstack?

2. Would a per-thread/per-signal opt-in policy for alternate signal stack delivery be considered acceptable as a Linux UAPI extension?

3. If such a UAPI is plausible, is prctl() the right place, or would maintainers prefer a different interface?

4. Which subsystem/list should own this discussion? I am sending this first to linux-api and LKML because this appears to be a userspace ABI issue around signal delivery.

Environment from the reproducer report
======================================

- Architecture: x86_64
- OS: Linux
- Example distro: Ubuntu 24.04
- Go: go1.26.2 linux/amd64
- .NET: 10.0.6 and runtime main were tested in the linked report
- Signal involved in the reproducer: SIGRTMIN
- Failure mode: SIGSEGV while running CoreCLR activation handling on the
alternate signal stack

Thanks,

Tim Parth

^ permalink raw reply

* [PATCH v2 2/2] selftests/fuse: add test for FUSE_HAS_SYNCFS privilege gating
From: Jimmy Zuber @ 2026-06-19 17:02 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: fuse-devel, linux-fsdevel, linux-api, linux-kernel, Shuah Khan,
	linux-kselftest
In-Reply-To: <20260619170251.1154562-1-jamz@amazon.com>

Add a selftest that talks the raw FUSE protocol over /dev/fuse (rather
than via libfuse, which negotiates INIT internally) so it can both choose
whether to advertise FUSE_HAS_SYNCFS and directly observe whether a
FUSE_SYNCFS opcode is forwarded by the kernel.

Four cases are covered:

  T1: host-root mount, server sets FUSE_HAS_SYNCFS
      -> FUSE_SYNCFS must reach the server.
  T2: host-root mount, server does not opt in
      -> FUSE_SYNCFS must not be sent (back-compat).
  T3: server opts in but opened /dev/fuse without CAP_SYS_ADMIN while still
      in the initial user namespace
      -> FUSE_SYNCFS must be withheld.  This is the case that distinguishes
         gating on the server's privilege from gating on the mount's user
         namespace.
  T4: unprivileged user-namespace mount, server sets FUSE_HAS_SYNCFS
      -> FUSE_SYNCFS must be withheld.

T4 requires CONFIG_USER_NS and the ability to create an unprivileged
user-namespace mount; it is skipped otherwise.

Signed-off-by: Jimmy Zuber <jamz@amazon.com>
Assisted-by: Claude:claude-opus-4-8 [Claude-Code]
---
 .../selftests/filesystems/fuse/.gitignore     |   1 +
 .../selftests/filesystems/fuse/Makefile       |   2 +-
 .../selftests/filesystems/fuse/test_syncfs.c  | 370 ++++++++++++++++++
 3 files changed, 372 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/filesystems/fuse/test_syncfs.c

diff --git a/tools/testing/selftests/filesystems/fuse/.gitignore b/tools/testing/selftests/filesystems/fuse/.gitignore
index 3e72e742d08e..4e92f363c74a 100644
--- a/tools/testing/selftests/filesystems/fuse/.gitignore
+++ b/tools/testing/selftests/filesystems/fuse/.gitignore
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0-only
 fuse_mnt
 fusectl_test
+test_syncfs
diff --git a/tools/testing/selftests/filesystems/fuse/Makefile b/tools/testing/selftests/filesystems/fuse/Makefile
index 612aad69a93a..cbba01635226 100644
--- a/tools/testing/selftests/filesystems/fuse/Makefile
+++ b/tools/testing/selftests/filesystems/fuse/Makefile
@@ -2,7 +2,7 @@
 
 CFLAGS += -Wall -O2 -g $(KHDR_INCLUDES)
 
-TEST_GEN_PROGS := fusectl_test
+TEST_GEN_PROGS := fusectl_test test_syncfs
 TEST_GEN_FILES := fuse_mnt
 
 include ../../lib.mk
diff --git a/tools/testing/selftests/filesystems/fuse/test_syncfs.c b/tools/testing/selftests/filesystems/fuse/test_syncfs.c
new file mode 100644
index 000000000000..90fb3478554a
--- /dev/null
+++ b/tools/testing/selftests/filesystems/fuse/test_syncfs.c
@@ -0,0 +1,370 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test that FUSE_SYNCFS is propagated to a userspace server only when the
+ * server opts in with FUSE_HAS_SYNCFS *and* opened /dev/fuse with
+ * CAP_SYS_ADMIN in the initial user namespace.
+ *
+ * Unlike the libfuse-based selftests, this talks the raw FUSE wire protocol
+ * over /dev/fuse so it can (a) choose whether to advertise FUSE_HAS_SYNCFS in
+ * the INIT reply and (b) observe directly whether a FUSE_SYNCFS opcode arrives.
+ */
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/eventfd.h>
+#include <sys/mount.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/uio.h>
+#include <sys/wait.h>
+#include <linux/capability.h>
+#include <linux/fuse.h>
+
+#include "../../kselftest.h"
+
+#define FUSE_ROOT_ID 1
+
+/*
+ * Add or drop CAP_SYS_ADMIN in the effective set via raw capget/capset
+ * (avoids a libcap dependency).  Used to construct a server that opens
+ * /dev/fuse without the privilege FUSE_HAS_SYNCFS requires.
+ */
+static int set_sysadmin(int on)
+{
+	struct __user_cap_header_struct hdr = {
+		.version = _LINUX_CAPABILITY_VERSION_3,
+		.pid = 0,
+	};
+	struct __user_cap_data_struct data[_LINUX_CAPABILITY_U32S_3] = {};
+	unsigned int idx = CAP_SYS_ADMIN >> 5;
+	__u32 bit = 1U << (CAP_SYS_ADMIN & 31);
+
+	if (syscall(SYS_capget, &hdr, data))
+		return -1;
+	if (on)
+		data[idx].effective |= bit;
+	else
+		data[idx].effective &= ~bit;
+	return syscall(SYS_capset, &hdr, data);
+}
+
+/*
+ * eventfd the server child writes once when it receives FUSE_SYNCFS; the
+ * parent poll()s it to observe (or rule out) propagation.
+ */
+static int syncfs_evfd;
+
+static void reply(int fd, uint64_t unique, int error, void *data, size_t len)
+{
+	struct fuse_out_header oh = {
+		.len = sizeof(oh) + len,
+		.error = error,
+		.unique = unique,
+	};
+	struct iovec iov[2] = {
+		{ &oh, sizeof(oh) },
+		{ data, len },
+	};
+
+	if (writev(fd, iov, data ? 2 : 1) < 0)
+		ksft_print_msg("server writev failed: %s\n", strerror(errno));
+}
+
+static void fill_attr(struct fuse_attr *a, uint64_t ino, uint32_t mode,
+		      uint32_t nlink)
+{
+	memset(a, 0, sizeof(*a));
+	a->ino = ino;
+	a->mode = mode;
+	a->nlink = nlink;
+	a->blksize = 4096;
+}
+
+/*
+ * Minimal FUSE server.  Advertises FUSE_HAS_SYNCFS in its INIT reply iff
+ * @advertise is set.  Signals syncfs_evfd when a FUSE_SYNCFS opcode arrives.
+ */
+#define SERVER_MAX_WRITE 65536
+static void run_server(int fd, int advertise)
+{
+	/*
+	 * The kernel rejects reads (EINVAL) whose buffer is smaller than
+	 * max_write + header, so size generously for the max_write we
+	 * advertise in the INIT reply below.
+	 */
+	static char buf[SERVER_MAX_WRITE + 4096];
+
+	for (;;) {
+		ssize_t n = read(fd, buf, sizeof(buf));
+		struct fuse_in_header *ih = (void *)buf;
+
+		if (n < 0) {
+			if (errno == EINTR || errno == EAGAIN)
+				continue;
+			return;		/* device closed on unmount/abort */
+		}
+		if (n < (ssize_t)sizeof(*ih))
+			continue;
+
+		switch (ih->opcode) {
+		case FUSE_INIT: {
+			struct fuse_init_in *in = (void *)(ih + 1);
+			struct fuse_init_out out = {0};
+			uint64_t flags = FUSE_INIT_EXT;
+
+			out.major = FUSE_KERNEL_VERSION;
+			out.minor = FUSE_KERNEL_MINOR_VERSION;
+			out.max_readahead = in->max_readahead;
+			out.max_write = SERVER_MAX_WRITE;
+			out.max_background = 16;
+			out.congestion_threshold = 12;
+			if (advertise)
+				flags |= FUSE_HAS_SYNCFS;
+			out.flags = flags;
+			out.flags2 = flags >> 32;
+			reply(fd, ih->unique, 0, &out, sizeof(out));
+			break;
+		}
+		case FUSE_GETATTR: {
+			struct fuse_attr_out out = {0};
+
+			fill_attr(&out.attr, FUSE_ROOT_ID, S_IFDIR | 0755, 2);
+			reply(fd, ih->unique, 0, &out, sizeof(out));
+			break;
+		}
+		case FUSE_SYNCFS: {
+			uint64_t one = 1;
+
+			if (write(syncfs_evfd, &one, sizeof(one)) < 0)
+				ksft_print_msg("server eventfd write failed: %s\n",
+					       strerror(errno));
+			reply(fd, ih->unique, 0, NULL, 0);
+			break;
+		}
+		default:
+			/*
+			 * Anything else (e.g. OPENDIR from opening the mount
+			 * root) is not needed to drive this test; -ENOSYS lets
+			 * the kernel proceed.
+			 */
+			reply(fd, ih->unique, -ENOSYS, NULL, 0);
+			break;
+		}
+	}
+}
+
+/*
+ * Mount a fuse fs backed by a forked server, issue syncfs(), and report
+ * whether the server observed FUSE_SYNCFS.  Returns 0 on success, -1 if the
+ * environment could not support the test (caller should skip).
+ *
+ * If @unpriv_open is set, /dev/fuse is opened with CAP_SYS_ADMIN dropped
+ * (regained only for the mount() syscall), so the device opener -- the
+ * server principal FUSE_HAS_SYNCFS is gated on -- lacks the privilege even
+ * though it remains in the initial user namespace.
+ */
+static int do_mount_and_syncfs(const char *mnt, int advertise, int unpriv_open,
+			       int *seen)
+{
+	struct pollfd pfd = { .events = POLLIN };
+	char opts[256];
+	int fd, mfd = -1, i;
+	pid_t pid;
+
+	syncfs_evfd = eventfd(0, EFD_CLOEXEC);
+	if (syncfs_evfd < 0)
+		return -1;
+
+	if (unpriv_open && set_sysadmin(0))
+		goto out_evfd;
+	fd = open("/dev/fuse", O_RDWR);
+	if (unpriv_open && set_sysadmin(1)) {
+		if (fd >= 0)
+			close(fd);
+		goto out_evfd;
+	}
+	if (fd < 0)
+		goto out_evfd;
+
+	mkdir(mnt, 0755);
+	snprintf(opts, sizeof(opts),
+		 "fd=%d,rootmode=40000,user_id=%d,group_id=%d",
+		 fd, getuid(), getgid());
+
+	if (mount("fuse", mnt, "fuse", 0, opts) < 0)
+		goto out_fd;
+
+	pid = fork();
+	if (pid < 0)
+		goto out_umount;
+	if (pid == 0) {
+		run_server(fd, advertise);
+		_exit(0);
+	}
+
+	/*
+	 * The parent does not service the fuse fd; the child does.  Close our
+	 * copy so the kernel sees a single server, and so that if the child
+	 * dies the connection aborts instead of hanging us forever.
+	 */
+	close(fd);
+
+	/*
+	 * mount() returns before the server has answered FUSE_INIT, so the
+	 * first open() can race and fail with ENOTCONN; retry until the
+	 * handshake settles.
+	 */
+	for (i = 0; i < 1000; i++) {
+		mfd = open(mnt, O_RDONLY | O_DIRECTORY);
+		if (mfd >= 0)
+			break;
+		usleep(1000);
+	}
+	if (mfd >= 0) {
+		syncfs(mfd);
+		close(mfd);
+	}
+
+	/*
+	 * No waiting is needed: the server writes syncfs_evfd before it replies
+	 * to FUSE_SYNCFS, and that reply is what unblocks the synchronous
+	 * syncfs() above.  So once syncfs() has returned, the eventfd is already
+	 * signalled if the opcode was propagated, and will never be otherwise.
+	 * poll() with a zero timeout therefore decides both cases immediately.
+	 */
+	pfd.fd = syncfs_evfd;
+	*seen = poll(&pfd, 1, 0) > 0 && (pfd.revents & POLLIN);
+
+	kill(pid, SIGKILL);
+	waitpid(pid, NULL, 0);
+	umount2(mnt, MNT_DETACH);
+	close(syncfs_evfd);
+	return 0;
+
+out_umount:
+	umount2(mnt, MNT_DETACH);
+out_fd:
+	close(fd);
+out_evfd:
+	close(syncfs_evfd);
+	return -1;
+}
+
+/* T4: same as above but the mount is created inside a new user namespace. */
+static int run_in_userns(const char *mnt, int advertise, int *seen)
+{
+	uid_t uid = getuid();
+	gid_t gid = getgid();
+	char map[64];
+	int f;
+
+	if (unshare(CLONE_NEWUSER | CLONE_NEWNS) < 0)
+		return -1;	/* unprivileged userns mounts unavailable */
+
+	f = open("/proc/self/setgroups", O_WRONLY);
+	if (f >= 0) {
+		dprintf(f, "deny");
+		close(f);
+	}
+	snprintf(map, sizeof(map), "0 %d 1", uid);
+	f = open("/proc/self/uid_map", O_WRONLY);
+	if (f < 0 || dprintf(f, "%s", map) < 0)
+		return -1;
+	close(f);
+	snprintf(map, sizeof(map), "0 %d 1", gid);
+	f = open("/proc/self/gid_map", O_WRONLY);
+	if (f < 0 || dprintf(f, "%s", map) < 0)
+		return -1;
+	close(f);
+
+	/* Need a mount namespace where we can mount fuse unprivileged. */
+	if (mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL) < 0)
+		return -1;
+
+	return do_mount_and_syncfs(mnt, advertise, 0, seen);
+}
+
+int main(void)
+{
+	char mnt[] = "/tmp/fuse_syncfs_XXXXXX";
+	int seen, ret;
+
+	ksft_print_header();
+	ksft_set_plan(4);
+
+	/* Hard watchdog: never let a stuck syncfs hang the test runner. */
+	signal(SIGALRM, SIG_DFL);
+	alarm(60);
+
+	if (geteuid() != 0)
+		ksft_exit_skip("test requires root to mount fuse\n");
+
+	if (!mkdtemp(mnt))
+		ksft_exit_fail_msg("mkdtemp failed\n");
+
+	/* T1: host-root mount, server opts in -> syncfs must reach server. */
+	ret = do_mount_and_syncfs(mnt, 1, 0, &seen);
+	if (ret < 0)
+		ksft_test_result_skip("T1: could not mount fuse\n");
+	else
+		ksft_test_result(seen,
+				 "T1 host-root + FUSE_HAS_SYNCFS: server receives FUSE_SYNCFS\n");
+
+	/* T2: host-root mount, server does NOT opt in -> no FUSE_SYNCFS. */
+	ret = do_mount_and_syncfs(mnt, 0, 0, &seen);
+	if (ret < 0)
+		ksft_test_result_skip("T2: could not mount fuse\n");
+	else
+		ksft_test_result(!seen,
+				 "T2 host-root, no opt-in: server does NOT receive FUSE_SYNCFS\n");
+
+	/*
+	 * T3: server opts in but opened /dev/fuse without CAP_SYS_ADMIN while
+	 * still in the initial user namespace -> kernel must withhold
+	 * FUSE_SYNCFS.  This is the case that distinguishes gating on the
+	 * server's privilege from gating on the mount's user namespace.
+	 */
+	ret = do_mount_and_syncfs(mnt, 1, 1, &seen);
+	if (ret < 0)
+		ksft_test_result_skip("T3: could not mount fuse unprivileged\n");
+	else
+		ksft_test_result(!seen,
+				 "T3 init_userns, opener lacks CAP_SYS_ADMIN: FUSE_SYNCFS withheld\n");
+
+	/*
+	 * T4: unprivileged userns mount, server opts in -> kernel must still
+	 * withhold FUSE_SYNCFS.  Run in a child since it unshares namespaces.
+	 */
+	{
+		pid_t p = fork();
+
+		if (p == 0) {
+			int s = 0;
+			int r = run_in_userns(mnt, 1, &s);
+
+			_exit(r < 0 ? 2 : (s ? 1 : 0));
+		} else {
+			int status;
+
+			waitpid(p, &status, 0);
+			if (!WIFEXITED(status))
+				ksft_test_result_error("T4: child crashed\n");
+			else if (WEXITSTATUS(status) == 2)
+				ksft_test_result_skip("T4: userns fuse mount unavailable\n");
+			else
+				ksft_test_result(WEXITSTATUS(status) == 0,
+						 "T4 unpriv userns + opt-in: FUSE_SYNCFS withheld\n");
+		}
+	}
+
+	rmdir(mnt);
+	ksft_finished();
+}
-- 
2.50.1


^ permalink raw reply related

* [PATCH v2 1/2] fuse: allow FUSE_SYNCFS for privileged userspace servers
From: Jimmy Zuber @ 2026-06-19 17:02 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: fuse-devel, linux-fsdevel, linux-api, linux-kernel, Shuah Khan,
	linux-kselftest
In-Reply-To: <20260619170251.1154562-1-jamz@amazon.com>

Propagating syncfs()/sync() to a FUSE server via FUSE_SYNCFS lets the
server flush its own cached or intermediate state when userspace asks the
filesystem to sync.  This is currently enabled only for virtiofs and
fuseblk, because an untrusted server can use it to stall sync()
indefinitely (see commit 2d82ab251ef0 ("virtiofs: propagate sync() to file
server"), and commit d3906d8f3cee ("fuse: enable FUSE_SYNCFS for all
fuseblk servers")).  Both of those mount types require host privilege to
set up, so the server is trusted not to abuse it.

There is nothing virtiofs- or block-specific about wanting to handle
syncfs(), though.  A plain /dev/fuse server is just as entitled to
participate in the sync() path -- so that data it has buffered reaches
stable storage when the user asks for it -- provided it is equally
trusted.

The trust property that virtiofs and fuseblk satisfy is that neither can
be mounted without CAP_SYS_ADMIN in the initial user namespace (neither
sets FS_USERNS_MOUNT).  A plain fuse mount does set FS_USERNS_MOUNT, so its
existence guarantees no such privilege; the privilege has to be checked
rather than assumed.

Add an opt-in INIT flag, FUSE_HAS_SYNCFS, and honor it only when the
server opened /dev/fuse with CAP_SYS_ADMIN in the initial user namespace,
recorded at mount time via file_ns_capable().  This is the same privilege
that mounting virtiofs or fuseblk requires, applied to the process that
actually services (and could stall) the connection.  Checking the device
opener's capability -- rather than the mount's user namespace -- avoids
treating an unprivileged server that merely happens to run in the initial
user namespace (e.g. a normal sshfs mount) as trusted.

The flag is only advertised to servers that pass this check, so an
unprivileged server is never invited to opt in (and is ignored by
fuse_syncfs_enable() if it sets the flag anyway).

Signed-off-by: Jimmy Zuber <jamz@amazon.com>
Assisted-by: Claude:claude-opus-4-8 [Claude-Code]
---
 fs/fuse/fuse_i.h          |  9 +++++++++
 fs/fuse/inode.c           | 28 ++++++++++++++++++++++++++++
 include/uapi/linux/fuse.h | 12 +++++++++++-
 3 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 85f738c53122..3fbe4957e839 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -391,6 +391,7 @@ struct fuse_fs_context {
 	bool no_control:1;
 	bool no_force_umount:1;
 	bool legacy_opts_show:1;
+	bool syncfs_capable:1;
 	enum fuse_dax_mode dax_mode;
 	unsigned int max_read;
 	unsigned int blksize;
@@ -675,6 +676,14 @@ struct fuse_conn {
 	/** @sync_fs: Propagate syncfs() to server */
 	unsigned int sync_fs:1;
 
+	/**
+	 * @syncfs_capable: the privilege required to honor FUSE_HAS_SYNCFS was
+	 * present when /dev/fuse was opened (CAP_SYS_ADMIN in the initial user
+	 * namespace), i.e. the same privilege that mounting virtiofs/fuseblk
+	 * requires.
+	 */
+	unsigned int syncfs_capable:1;
+
 	/** @init_security: Initialize security xattrs when creating a new inode */
 	unsigned int init_security:1;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index d975073c6029..ba7506b539ae 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -800,6 +800,15 @@ static int fuse_opt_fd(struct fs_context *fsc, struct file *file)
 	if (file->f_cred->user_ns != fsc->user_ns)
 		return invalfc(fsc, "wrong user namespace for fuse device");
 
+	/*
+	 * Record whether the server opened /dev/fuse with CAP_SYS_ADMIN in the
+	 * initial user namespace -- the same privilege that mounting virtiofs
+	 * or fuseblk requires.  Only such servers are trusted to receive
+	 * FUSE_SYNCFS (see fuse_syncfs_enable()).
+	 */
+	ctx->syncfs_capable = file_ns_capable(file, &init_user_ns,
+					      CAP_SYS_ADMIN);
+
 	ctx->fud = fuse_dev_grab(file);
 
 	return 0;
@@ -1266,6 +1275,16 @@ struct fuse_init_args {
 	struct fuse_mount *fm;
 };
 
+/*
+ * A server can stall syncfs()/sync(), so only honor FUSE_HAS_SYNCFS for
+ * servers that opened /dev/fuse with CAP_SYS_ADMIN in the initial user
+ * namespace -- the same privilege required to mount virtiofs or fuseblk.
+ */
+static bool fuse_syncfs_enable(struct fuse_conn *fc, u64 flags)
+{
+	return (flags & FUSE_HAS_SYNCFS) && fc->syncfs_capable;
+}
+
 static void process_init_reply(struct fuse_args *args, int error)
 {
 	struct fuse_init_args *ia = container_of(args, typeof(*ia), args);
@@ -1406,6 +1425,9 @@ static void process_init_reply(struct fuse_args *args, int error)
 
 			if (flags & FUSE_REQUEST_TIMEOUT)
 				timeout = arg->request_timeout;
+
+			if (fuse_syncfs_enable(fc, flags))
+				fc->sync_fs = 1;
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
 			fc->no_lock = 1;
@@ -1473,6 +1495,11 @@ static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm)
 		flags |= FUSE_SUBMOUNTS;
 	if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
 		flags |= FUSE_PASSTHROUGH;
+	/* Only offered to sufficiently privileged servers; see
+	 * fuse_syncfs_enable().
+	 */
+	if (fm->fc->syncfs_capable)
+		flags |= FUSE_HAS_SYNCFS;
 
 	/*
 	 * This is just an information flag for fuse server. No need to check
@@ -1766,6 +1793,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
 
 	fc->default_permissions = ctx->default_permissions;
 	fc->allow_other = ctx->allow_other;
+	fc->syncfs_capable = ctx->syncfs_capable;
 	fc->user_id = ctx->user_id;
 	fc->group_id = ctx->group_id;
 	fc->legacy_opts_show = ctx->legacy_opts_show;
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index c13e1f9a2f12..c078c5f0f94e 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -240,6 +240,9 @@
  *  - add FUSE_COPY_FILE_RANGE_64
  *  - add struct fuse_copy_file_range_out
  *  - add FUSE_NOTIFY_PRUNE
+ *
+ *  7.46
+ *  - add FUSE_HAS_SYNCFS opt-in flag for privileged userspace servers
  */
 
 #ifndef _LINUX_FUSE_H
@@ -275,7 +278,7 @@
 #define FUSE_KERNEL_VERSION 7
 
 /** Minor version number of this interface */
-#define FUSE_KERNEL_MINOR_VERSION 45
+#define FUSE_KERNEL_MINOR_VERSION 46
 
 /** The node ID of the root inode */
 #define FUSE_ROOT_ID 1
@@ -448,6 +451,12 @@ struct fuse_file_lock {
  * FUSE_OVER_IO_URING: Indicate that client supports io-uring
  * FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
  *			 init_out.request_timeout contains the timeout (in secs)
+ * FUSE_HAS_SYNCFS: server requests that syncfs()/sync() be propagated as
+ *		FUSE_SYNCFS requests.  Since an untrusted server can use this
+ *		to stall sync(), it is only honored when /dev/fuse was opened
+ *		with CAP_SYS_ADMIN in the initial user namespace (the same
+ *		privilege that mounting virtiofs or fuseblk requires).
+ *		Insufficiently privileged servers ignore it.
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -495,6 +504,7 @@ struct fuse_file_lock {
 #define FUSE_ALLOW_IDMAP	(1ULL << 40)
 #define FUSE_OVER_IO_URING	(1ULL << 41)
 #define FUSE_REQUEST_TIMEOUT	(1ULL << 42)
+#define FUSE_HAS_SYNCFS		(1ULL << 43)
 
 /**
  * CUSE INIT request/reply flags
-- 
2.50.1


^ permalink raw reply related

* [PATCH v2 0/2] fuse: allow FUSE_SYNCFS for privileged userspace servers
From: Jimmy Zuber @ 2026-06-19 17:02 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: fuse-devel, linux-fsdevel, linux-api, linux-kernel, Shuah Khan,
	linux-kselftest
In-Reply-To: <20260616151909.916667-1-jamz@amazon.com>

FUSE_SYNCFS (propagating syncfs()/sync() to the server) is currently
enabled only for virtiofs and fuseblk, since an untrusted server can stall
sync(). Any FUSE filesystem may buffer data in the server that ought to
reach storage on sync(); the only thing that should gate it is whether the
/dev/fuse opener is sufficiently privileged to be trusted to block sync.

This series lets a plain /dev/fuse server opt in via a new FUSE_HAS_SYNCFS
INIT flag, honored only when the server opened /dev/fuse with CAP_SYS_ADMIN
privilege in the initial user namespace -- checked with file_ns_capable() at
mount time.

  Patch 1: the kernel change (UAPI flag + privilege gating).
  Patch 2: a selftest that speaks the raw FUSE protocol over /dev/fuse, so
           it can withhold the flag and directly observe whether the
           FUSE_SYNCFS opcode is forwarded.

A matching libfuse change (FUSE_CAP_SYNCFS negotiation) will be sent to the
libfuse project once the UAPI flag here is settled.

Changes since v1 [1]:
 - Gate on the privilege of the opener (CAP_SYS_ADMIN in init_user_ns at
   /dev/fuse open time, via file_ns_capable()) rather than on the mount's
   user namespace.  Miklos pointed out that the v1 check
   (fc->user_ns == &init_user_ns) tested a property of the mount, not of
   the server that actually services -- and can stall -- the connection.
   Being in the initial user namespace is not itself a privilege
   (e.g. an ordinary sshfs mount qualifies).  Checking the device opener's
   capability closes that gap.
 - Selftest: add a case covering exactly that distinction -- a server
   in the initial user namespace that opened /dev/fuse without
   CAP_SYS_ADMIN -- which v1 would have wrongly allowed.

Testing: built and booted under QEMU.  The selftest passes all
four cases, including the new case above (verified that the kernel withholds
FUSE_SYNCFS while it would have been granted under the v1 check).

[1] https://lore.kernel.org/20260616151909.916667-1-jamz@amazon.com

Jimmy Zuber (2):
  fuse: allow FUSE_SYNCFS for privileged userspace servers
  selftests/fuse: add test for FUSE_HAS_SYNCFS privilege gating

 fs/fuse/fuse_i.h                              |   9 +
 fs/fuse/inode.c                               |  28 ++
 include/uapi/linux/fuse.h                     |  12 +-
 .../selftests/filesystems/fuse/.gitignore     |   1 +
 .../selftests/filesystems/fuse/Makefile       |   2 +-
 .../selftests/filesystems/fuse/test_syncfs.c  | 370 ++++++++++++++++++
 6 files changed, 420 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/filesystems/fuse/test_syncfs.c

base-commit: 7d87a5a284bb34edb3f4e7e312ef403b3385a7b7
-- 
2.50.1

^ permalink raw reply

* Re: [PATCH 1/2] fuse: allow FUSE_SYNCFS for privileged userspace servers
From: Jimmy Zuber @ 2026-06-19 16:33 UTC (permalink / raw)
  To: miklos
  Cc: fuse-devel, jamz, linux-api, linux-fsdevel, linux-kernel,
	linux-kselftest, shuah
In-Reply-To: <CAJfpeguaa4U6Tsxh5opeaLeYQQxP6MQsRMu_JATJwwgj=OBt3Q@mail.gmail.com>

On Mon, 16 Jun 2026, Miklos Szeredi <miklos@szeredi.hu> wrote:
> Sounds really easy to trick: start the server in the initial user ns,
> then clone the mounter with a new user/mount namespace.   The
> init_user_ns test will pass happily, since the server is running in
> the initial namespace.

Ah, the intention was to limit sync to sufficiently privileged FUSE
setups.  I missed that the initial user namespace is not equivalent to
elevated permissions.  I am thinking instead it would make sense to
assert that the opener of /dev/fuse has CAP_SYS_ADMIN in the initial
user namespace.  They could then hand off the fd to a less privileged
server, but that is the prerogative of that privileged user, so I think
it satisfies the spirit of the DoS prevention requirement.

I will follow up shortly with a new version.

Thank you!

Jimmy

^ permalink raw reply

* Re: [PATCH v6 3/3] xfs: add support for FALLOC_FL_WRITE_ZEROES
From: Pankaj Raghav (Samsung) @ 2026-06-18 13:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pankaj Raghav, linux-xfs, bfoster, lukas, Darrick J . Wong, dgc,
	gost.dev, andres, kundan.kumar, cem, Zhang Yi, linux-fsdevel,
	linux-api
In-Reply-To: <ajP0sno47hihsCCo@infradead.org>

On Thu, Jun 18, 2026 at 06:37:54AM -0700, Christoph Hellwig wrote:
> On Thu, Jun 18, 2026 at 03:26:14PM +0200, Pankaj Raghav (Samsung) wrote:
> > existing code **does not** overwrite any user data. Here is why:
> > 
> > - xfs_free_file_space (punch hole) punches inward (offset_ru -> end_rd)
> >   and zeroes out from (offset -> offset_ru) and (end_rd -> end) with
> >   xfs_zero_range
> 
> Ah, right.
> 
> > - Luckily, even though we consider offset_rd to end_ru in
> >   alloc_file_space, XFS_BMAPI_ZERO will skip zeroing already written
> >   edge blocks and only offset_ru -> end_rd are zeroed using unmap zero range.
> >   ( (offset -> offset_ru) -> EXT_NORM, (offset_ru -> end_rd) -> HOLE,
> >   (end_rd -> end) -> EXT_NORM)
> 
> Oh.  So this was lucky because of the corner cases.  I still think doing
> the proper alignment in the higher level code and only calling down for
> the area that we really want to zero would be benefitial here, if only
> to make the code easier to follow.

I agree. I will do the changes and document some of the nuances in the
code.

Thanks!

--
Pankaj

^ permalink raw reply

* Re: [PATCH v6 3/3] xfs: add support for FALLOC_FL_WRITE_ZEROES
From: Christoph Hellwig @ 2026-06-18 13:37 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, bfoster, lukas,
	Darrick J . Wong, dgc, gost.dev, andres, kundan.kumar, cem,
	Zhang Yi, linux-fsdevel, linux-api
In-Reply-To: <xklmkbxoh7yfm5mkxxpp72wfifngzr2v4nfrn3lgzc57ep7nkj@louxc6z3gfmk>

On Thu, Jun 18, 2026 at 03:26:14PM +0200, Pankaj Raghav (Samsung) wrote:
> existing code **does not** overwrite any user data. Here is why:
> 
> - xfs_free_file_space (punch hole) punches inward (offset_ru -> end_rd)
>   and zeroes out from (offset -> offset_ru) and (end_rd -> end) with
>   xfs_zero_range

Ah, right.

> - Luckily, even though we consider offset_rd to end_ru in
>   alloc_file_space, XFS_BMAPI_ZERO will skip zeroing already written
>   edge blocks and only offset_ru -> end_rd are zeroed using unmap zero range.
>   ( (offset -> offset_ru) -> EXT_NORM, (offset_ru -> end_rd) -> HOLE,
>   (end_rd -> end) -> EXT_NORM)

Oh.  So this was lucky because of the corner cases.  I still think doing
the proper alignment in the higher level code and only calling down for
the area that we really want to zero would be benefitial here, if only
to make the code easier to follow.


^ permalink raw reply

* Re: [PATCH v6 3/3] xfs: add support for FALLOC_FL_WRITE_ZEROES
From: Pankaj Raghav (Samsung) @ 2026-06-18 13:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pankaj Raghav, linux-xfs, bfoster, lukas, Darrick J . Wong, dgc,
	gost.dev, andres, kundan.kumar, cem, Zhang Yi, linux-fsdevel,
	linux-api
In-Reply-To: <ajO8AHsMFolUQLra@infradead.org>

On Thu, Jun 18, 2026 at 02:36:00AM -0700, Christoph Hellwig wrote:
> On Thu, Jun 18, 2026 at 11:28:15AM +0200, Pankaj Raghav (Samsung) wrote:
> > > But I guess not unaligned FALLOC_FL_WRITE_ZEROES?
> > 
> >         -r readbdy: 4096 would make reads page aligned (default 1)
> >         -t truncbdy: 4096 would make truncates page aligned (default 1)
> >         -w writebdy: 4096 would make writes page aligned (default 1)
> > 
> > FALLOC_FL_WRITE_ZEROES comes under truncate. So I would assume we also
> > do that. That is how I also found the issue with offset > EOF. I will
> > take a look or else, I will add a test case to test this condition!
> 
> A targeted test using xfs_io that does FALLOC_FL_WRITE_ZEROES on an
> unaligned range and then checks that the data around it is preserved
> while the unaligned data in the range is zeroed would also be useful.
> 


|----------|----------|----------|----------|----------|
           ^     ^    ^                     ^     ^    ^
           |     |    |                     |     |    |
           |   offset |                     |    end   |
           |          |                     |          |
	off_rd       off_ru              end_rd       end_ru


I wrote a simple test case for unaligned FALLOC_FL_WRITE_ZEROES. The
existing code **does not** overwrite any user data. Here is why:

- xfs_free_file_space (punch hole) punches inward (offset_ru -> end_rd)
  and zeroes out from (offset -> offset_ru) and (end_rd -> end) with
  xfs_zero_range

- Luckily, even though we consider offset_rd to end_ru in
  alloc_file_space, XFS_BMAPI_ZERO will skip zeroing already written
  edge blocks and only offset_ru -> end_rd are zeroed using unmap zero range.
  ( (offset -> offset_ru) -> EXT_NORM, (offset_ru -> end_rd) -> HOLE,
  (end_rd -> end) -> EXT_NORM)

I didn't think it through but the code still holds for these cases. This
might be the reason why fsx did not complain? I will also document this
in the patch.

But as you said, we need a xfstest for this boundary block checks in case
some behaviour changes in the future.

I guess apart from this, the only change is to address persisting data
when offset > old_size as sashiko reported.

Let me know what you think.
--
Pankaj

^ permalink raw reply

* Re: [PATCH v6 3/3] xfs: add support for FALLOC_FL_WRITE_ZEROES
From: Zhang Yi @ 2026-06-18 10:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pankaj Raghav (Samsung), Pankaj Raghav, linux-xfs, bfoster, lukas,
	Darrick J . Wong, dgc, gost.dev, andres, kundan.kumar, cem,
	linux-fsdevel, linux-api
In-Reply-To: <ajOzZv7407w_hodk@infradead.org>

On 6/18/2026 4:59 PM, Christoph Hellwig wrote:
> On Thu, Jun 18, 2026 at 11:22:45AM +0800, Zhang Yi wrote:
>> 1) if the two blocks straddling the boundaries have not yet been allocated,
>>    or allocated as unwritten, we should round outward the allocation range
>>    and zero out all allocated blocks, including those two boundary blocks.
>> 2) if the blocks at the boundaries are already in the written state — which
>>    can occur when we call FALLOC_FL_WRITE_ZEROES within the file size. We
>>    should be careful here: we should only zero the ranges [offset, offset_ru)
>>    and [end_rd, end) for the boundary blocks, leaving the already-written
>>    portions of the boundary blocks intact.
>>
>> Thoughs?
> 
> Yes.
> 
>> Regarding the second point, the current ext4 implementation has an issue —
>> it zeroes out the entire boundary blocks. I overlooked this previously, and
>> I appreciate you pointing it out.
> 
> Which means we're missing test coverage for this as well..
> 

Ha, I just re-checked the ext4 implementation and found that scenario 2 is
actually fine, Sorry I misread the code earlier. Fortunately, no data
corruption occurred. :-)

The real issue is in scenario 1, where the boundary blocks are not correctly
converted to written-type extents.

As for testing the unaligned case, I also agree that we should explicitly add
a dedicated test for it. Relying on existing fsstress and fsx is not reliable
enough in this regard.

Thanks,
Yi.



^ permalink raw reply

* Re: [PATCH v6 3/3] xfs: add support for FALLOC_FL_WRITE_ZEROES
From: Christoph Hellwig @ 2026-06-18  9:36 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, bfoster, lukas,
	Darrick J . Wong, dgc, gost.dev, andres, kundan.kumar, cem,
	Zhang Yi, linux-fsdevel, linux-api
In-Reply-To: <au3yznolpebvd5vphmzv73fzosdy7bxrb45kv56n734jnwwj3l@43d7sqlef27k>

On Thu, Jun 18, 2026 at 11:28:15AM +0200, Pankaj Raghav (Samsung) wrote:
> > But I guess not unaligned FALLOC_FL_WRITE_ZEROES?
> 
>         -r readbdy: 4096 would make reads page aligned (default 1)
>         -t truncbdy: 4096 would make truncates page aligned (default 1)
>         -w writebdy: 4096 would make writes page aligned (default 1)
> 
> FALLOC_FL_WRITE_ZEROES comes under truncate. So I would assume we also
> do that. That is how I also found the issue with offset > EOF. I will
> take a look or else, I will add a test case to test this condition!

A targeted test using xfs_io that does FALLOC_FL_WRITE_ZEROES on an
unaligned range and then checks that the data around it is preserved
while the unaligned data in the range is zeroed would also be useful.


^ permalink raw reply

* Re: [PATCH v6 3/3] xfs: add support for FALLOC_FL_WRITE_ZEROES
From: Pankaj Raghav (Samsung) @ 2026-06-18  9:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pankaj Raghav, linux-xfs, bfoster, lukas, Darrick J . Wong, dgc,
	gost.dev, andres, kundan.kumar, cem, Zhang Yi, linux-fsdevel,
	linux-api
In-Reply-To: <ajOzoBt_peOCtNUm@infradead.org>

On Thu, Jun 18, 2026 at 02:00:16AM -0700, Christoph Hellwig wrote:
> On Wed, Jun 17, 2026 at 11:44:47AM +0200, Pankaj Raghav (Samsung) wrote:
> > > Maybe we also want xfstests that try unaligned FALLOC_FL_WRITE_ZEROES
> > > and make sure no existing data before the range is lost and the
> > > entire range is zeroed?
> > > 
> > 
> > I added FALLOC_FL_WRITE_ZEROES support to ltp (both fsx and fsstress).
> > For example, generic/363 tests for unaligned writes and checks for any
> > stale data. By default, I think we do unaligned reads, writes and
> > truncate in fsx.
> 
> But I guess not unaligned FALLOC_FL_WRITE_ZEROES?

        -r readbdy: 4096 would make reads page aligned (default 1)
        -t truncbdy: 4096 would make truncates page aligned (default 1)
        -w writebdy: 4096 would make writes page aligned (default 1)

FALLOC_FL_WRITE_ZEROES comes under truncate. So I would assume we also
do that. That is how I also found the issue with offset > EOF. I will
take a look or else, I will add a test case to test this condition!

Thanks.
--
Pankaj

^ permalink raw reply

* Re: [PATCH v6 3/3] xfs: add support for FALLOC_FL_WRITE_ZEROES
From: Christoph Hellwig @ 2026-06-18  9:00 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Pankaj Raghav, linux-xfs, bfoster, lukas, Darrick J . Wong, dgc,
	gost.dev, andres, kundan.kumar, cem, Zhang Yi, linux-fsdevel,
	linux-api
In-Reply-To: <kn5dyqdxmsydibpfcurgpom5vwqtwinrz27oenh4pekks6ybdj@wyi4zkh7mogy>

On Wed, Jun 17, 2026 at 11:44:47AM +0200, Pankaj Raghav (Samsung) wrote:
> > Maybe we also want xfstests that try unaligned FALLOC_FL_WRITE_ZEROES
> > and make sure no existing data before the range is lost and the
> > entire range is zeroed?
> > 
> 
> I added FALLOC_FL_WRITE_ZEROES support to ltp (both fsx and fsstress).
> For example, generic/363 tests for unaligned writes and checks for any
> stale data. By default, I think we do unaligned reads, writes and
> truncate in fsx.

But I guess not unaligned FALLOC_FL_WRITE_ZEROES?


^ permalink raw reply

* Re: [PATCH v6 3/3] xfs: add support for FALLOC_FL_WRITE_ZEROES
From: Christoph Hellwig @ 2026-06-18  8:59 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Pankaj Raghav (Samsung), Pankaj Raghav, linux-xfs, bfoster, lukas,
	Darrick J . Wong, dgc, gost.dev, andres, kundan.kumar, cem,
	linux-fsdevel, linux-api
In-Reply-To: <557b2e5c-7c65-48de-87a9-6fba21eca99f@huaweicloud.com>

On Thu, Jun 18, 2026 at 11:22:45AM +0800, Zhang Yi wrote:
> 1) if the two blocks straddling the boundaries have not yet been allocated,
>    or allocated as unwritten, we should round outward the allocation range
>    and zero out all allocated blocks, including those two boundary blocks.
> 2) if the blocks at the boundaries are already in the written state — which
>    can occur when we call FALLOC_FL_WRITE_ZEROES within the file size. We
>    should be careful here: we should only zero the ranges [offset, offset_ru)
>    and [end_rd, end) for the boundary blocks, leaving the already-written
>    portions of the boundary blocks intact.
> 
> Thoughs?

Yes.

> Regarding the second point, the current ext4 implementation has an issue —
> it zeroes out the entire boundary blocks. I overlooked this previously, and
> I appreciate you pointing it out.

Which means we're missing test coverage for this as well..


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox