Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH 2/3] ext4: add support for 32-bit default reserved uid and gid values
From: Darrick J. Wong @ 2025-09-11 22:31 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-api
In-Reply-To: <20250908-tune2fs-v1-2-e3a6929f3355@mit.edu>

On Mon, Sep 08, 2025 at 11:15:49PM -0400, Theodore Ts'o via B4 Relay wrote:
> From: Theodore Ts'o <tytso@mit.edu>
> 
> Support for specifying the default user id and group id that is
> allowed to use the reserved block space was added way back when Linux
> only supported 16-bit uid's and gid's.  (Yeah, that long ago.)  It's
> not a commonly used feature, but let's add support for 32-bit user and
> group id's.
> 
> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> ---
>  fs/ext4/ext4.h  | 16 +++++++++++++++-
>  fs/ext4/super.c |  8 ++++----
>  2 files changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 01a6e2de7fc3ef0e20b039d3200b9c9bd656f59f..4bfcd5f0c74fda30db4009ee28fbee00a2f6b76f 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1442,7 +1442,9 @@ struct ext4_super_block {
>  	__le16  s_encoding;		/* Filename charset encoding */
>  	__le16  s_encoding_flags;	/* Filename charset encoding flags */
>  	__le32  s_orphan_file_inum;	/* Inode for tracking orphan inodes */
> -	__le32	s_reserved[94];		/* Padding to the end of the block */
> +	__le16	s_def_resuid_hi;
> +	__le16	s_def_resgid_hi;
> +	__le32	s_reserved[93];		/* Padding to the end of the block */

Does anything actually check that s_reserved is zero?  I couldn't find
any:

$ git grep -w s_reserved fs/ext4 fs/ext2
fs/ext2/ext2.h:480:     __u32   s_reserved[190];        /* Padding to the end of the block */
fs/ext4/ext4.h:1445:    __le32  s_reserved[94];         /* Padding to the end of the block */

$ git grep -w s_reserved lib/ext2fs/ e2fsck/
lib/ext2fs/ext2_fs.h:777:       __le32  s_reserved[94];         /* Padding to the end of the block */
lib/ext2fs/swapfs.c:135:        /* catch when new fields are used from s_reserved */
lib/ext2fs/swapfs.c:136:        EXT2FS_BUILD_BUG_ON(sizeof(sb->s_reserved) != 94 * sizeof(__le32));
lib/ext2fs/tst_super_size.c:156:        check_field(s_reserved, 94 * 4);

Is there a risk that some garbage written to s_reserved (and not caught
by either the kernel or e2fsck) will now appear as a "legitimate" resuid
value?

--D

>  	__le32	s_checksum;		/* crc32c(superblock) */
>  };
>  
> @@ -1812,6 +1814,18 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
>  		 ino <= le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count));
>  }
>  
> +static inline int ext4_get_resuid(struct ext4_super_block *es)
> +{
> +	return(le16_to_cpu(es->s_def_resuid) |
> +	       (le16_to_cpu(es->s_def_resuid_hi) << 16));
> +}
> +
> +static inline int ext4_get_resgid(struct ext4_super_block *es)
> +{
> +	return(le16_to_cpu(es->s_def_resgid) |
> +	       (le16_to_cpu(es->s_def_resgid_hi) << 16));
> +}
> +
>  /*
>   * Returns: sbi->field[index]
>   * Used to access an array element from the following sbi fields which require
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 94c98446c84f9a4614971d246ca7f001de610a8a..0256c8f7c6cee2b8d9295f2fa9a7acd904382e83 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -2951,11 +2951,11 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
>  	}
>  
>  	if (nodefs || !uid_eq(sbi->s_resuid, make_kuid(&init_user_ns, EXT4_DEF_RESUID)) ||
> -	    le16_to_cpu(es->s_def_resuid) != EXT4_DEF_RESUID)
> +	    ext4_get_resuid(es) != EXT4_DEF_RESUID)
>  		SEQ_OPTS_PRINT("resuid=%u",
>  				from_kuid_munged(&init_user_ns, sbi->s_resuid));
>  	if (nodefs || !gid_eq(sbi->s_resgid, make_kgid(&init_user_ns, EXT4_DEF_RESGID)) ||
> -	    le16_to_cpu(es->s_def_resgid) != EXT4_DEF_RESGID)
> +	    ext4_get_resgid(es) != EXT4_DEF_RESGID)
>  		SEQ_OPTS_PRINT("resgid=%u",
>  				from_kgid_munged(&init_user_ns, sbi->s_resgid));
>  	def_errors = nodefs ? -1 : le16_to_cpu(es->s_errors);
> @@ -5270,8 +5270,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>  
>  	ext4_set_def_opts(sb, es);
>  
> -	sbi->s_resuid = make_kuid(&init_user_ns, le16_to_cpu(es->s_def_resuid));
> -	sbi->s_resgid = make_kgid(&init_user_ns, le16_to_cpu(es->s_def_resgid));
> +	sbi->s_resuid = make_kuid(&init_user_ns, ext4_get_resuid(es));
> +	sbi->s_resgid = make_kgid(&init_user_ns, ext4_get_resuid(es));
>  	sbi->s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ;
>  	sbi->s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME;
>  	sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;
> 
> -- 
> 2.51.0
> 
> 
> 

^ permalink raw reply

* Re: [PATCH 1/3] ext4: avoid potential buffer over-read in parse_apply_sb_mount_options()
From: Darrick J. Wong @ 2025-09-11 22:27 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-api, stable
In-Reply-To: <20250908-tune2fs-v1-1-e3a6929f3355@mit.edu>

On Mon, Sep 08, 2025 at 11:15:48PM -0400, Theodore Ts'o via B4 Relay wrote:
> From: Theodore Ts'o <tytso@mit.edu>
> 
> Unlike other strings in the ext4 superblock, we rely on tune2fs to
> make sure s_mount_opts is NUL terminated.  Harden
> parse_apply_sb_mount_options() by treating s_mount_opts as a potential
> __nonstring.

Uh.... does that mean that a filesystem with exactly 64 bytes worth of
mount option string (and no trailing null) could do something malicious?

My guess is that s_usr_quota_inum mostly saves us, but a nastycrafted
filesystem with more than 2^24 inodes could cause an out of bounds
memory access?  But that most likely will just fail the mount option
parser anyway?

--D

> 
> Cc: stable@vger.kernel.org
> Fixes: 8b67f04ab9de ("ext4: Add mount options in superblock")
> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> ---
>  fs/ext4/super.c | 17 +++++------------
>  1 file changed, 5 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 699c15db28a82f26809bf68533454a242596f0fd..94c98446c84f9a4614971d246ca7f001de610a8a 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -2460,7 +2460,7 @@ static int parse_apply_sb_mount_options(struct super_block *sb,
>  					struct ext4_fs_context *m_ctx)
>  {
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> -	char *s_mount_opts = NULL;
> +	char s_mount_opts[65];
>  	struct ext4_fs_context *s_ctx = NULL;
>  	struct fs_context *fc = NULL;
>  	int ret = -ENOMEM;
> @@ -2468,15 +2468,11 @@ static int parse_apply_sb_mount_options(struct super_block *sb,
>  	if (!sbi->s_es->s_mount_opts[0])
>  		return 0;
>  
> -	s_mount_opts = kstrndup(sbi->s_es->s_mount_opts,
> -				sizeof(sbi->s_es->s_mount_opts),
> -				GFP_KERNEL);
> -	if (!s_mount_opts)
> -		return ret;
> +	strscpy_pad(s_mount_opts, sbi->s_es->s_mount_opts);
>  
>  	fc = kzalloc(sizeof(struct fs_context), GFP_KERNEL);
>  	if (!fc)
> -		goto out_free;
> +		return -ENOMEM;
>  
>  	s_ctx = kzalloc(sizeof(struct ext4_fs_context), GFP_KERNEL);
>  	if (!s_ctx)
> @@ -2508,11 +2504,8 @@ static int parse_apply_sb_mount_options(struct super_block *sb,
>  	ret = 0;
>  
>  out_free:
> -	if (fc) {
> -		ext4_fc_free(fc);
> -		kfree(fc);
> -	}
> -	kfree(s_mount_opts);
> +	ext4_fc_free(fc);
> +	kfree(fc);
>  	return ret;
>  }
>  
> 
> -- 
> 2.51.0
> 
> 
> 

^ permalink raw reply

* Re: [PATCH 3/3] ext4: implemet new ioctls to set and get superblock parameters
From: kernel test robot @ 2025-09-09 21:33 UTC (permalink / raw)
  To: Theodore Ts'o via B4 Relay, tytso
  Cc: oe-kbuild-all, linux-ext4, linux-api
In-Reply-To: <20250908-tune2fs-v1-3-e3a6929f3355@mit.edu>

Hi Theodore,

kernel test robot noticed the following build warnings:

[auto build test WARNING on b320789d6883cc00ac78ce83bccbfe7ed58afcf0]

url:    https://github.com/intel-lab-lkp/linux/commits/Theodore-Ts-o-via-B4-Relay/ext4-avoid-potential-buffer-over-read-in-parse_apply_sb_mount_options/20250909-111746
base:   b320789d6883cc00ac78ce83bccbfe7ed58afcf0
patch link:    https://lore.kernel.org/r/20250908-tune2fs-v1-3-e3a6929f3355%40mit.edu
patch subject: [PATCH 3/3] ext4: implemet new ioctls to set and get superblock parameters
config: csky-randconfig-r123-20250910 (https://download.01.org/0day-ci/archive/20250910/202509100550.fj5qrPH5-lkp@intel.com/config)
compiler: csky-linux-gcc (GCC) 10.5.0
reproduce: (https://download.01.org/0day-ci/archive/20250910/202509100550.fj5qrPH5-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509100550.fj5qrPH5-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> fs/ext4/ioctl.c:1255:29: sparse: sparse: incorrect type in assignment (different base types) @@     expected unsigned short [addressable] [assigned] [usertype] errors_behavior @@     got restricted __le16 [usertype] s_errors @@
   fs/ext4/ioctl.c:1255:29: sparse:     expected unsigned short [addressable] [assigned] [usertype] errors_behavior
   fs/ext4/ioctl.c:1255:29: sparse:     got restricted __le16 [usertype] s_errors
>> fs/ext4/ioctl.c:1267:33: sparse: sparse: cast to restricted __le16
>> fs/ext4/ioctl.c:1267:33: sparse: sparse: cast from restricted __le32
>> fs/ext4/ioctl.c:1323:41: sparse: sparse: incorrect type in assignment (different base types) @@     expected restricted __le32 [usertype] s_raid_stripe_width @@     got restricted __le16 [usertype] @@
   fs/ext4/ioctl.c:1323:41: sparse:     expected restricted __le32 [usertype] s_raid_stripe_width
   fs/ext4/ioctl.c:1323:41: sparse:     got restricted __le16 [usertype]
   fs/ext4/ioctl.c: note: in included file (through include/linux/uaccess.h, include/linux/sched/task.h, include/linux/sched/signal.h, ...):
   arch/csky/include/asm/uaccess.h:110:17: sparse: sparse: cast removes address space '__user' of expression
   arch/csky/include/asm/uaccess.h:110:17: sparse: sparse: asm output is not an lvalue
   arch/csky/include/asm/uaccess.h:110:17: sparse: sparse: cast removes address space '__user' of expression
   arch/csky/include/asm/uaccess.h:110:17: sparse: sparse: cast removes address space '__user' of expression
   arch/csky/include/asm/uaccess.h:110:17: sparse: sparse: asm output is not an lvalue
   arch/csky/include/asm/uaccess.h:110:17: sparse: sparse: cast removes address space '__user' of expression
   arch/csky/include/asm/uaccess.h:110:17: sparse: sparse: generating address of non-lvalue (11)
   arch/csky/include/asm/uaccess.h:110:17: sparse: sparse: generating address of non-lvalue (11)

vim +1255 fs/ext4/ioctl.c

  1235	
  1236	
  1237	#define TUNE_OPS_SUPPORTED (EXT4_TUNE_FL_ERRORS_BEHAVIOR |    \
  1238		EXT4_TUNE_FL_MNT_COUNT | EXT4_TUNE_FL_MAX_MNT_COUNT | \
  1239		EXT4_TUNE_FL_CHECKINTRVAL | EXT4_TUNE_FL_LAST_CHECK_TIME | \
  1240		EXT4_TUNE_FL_RESERVED_BLOCKS | EXT4_TUNE_FL_RESERVED_UID | \
  1241		EXT4_TUNE_FL_RESERVED_GID | EXT4_TUNE_FL_DEFAULT_MNT_OPTS | \
  1242		EXT4_TUNE_FL_DEF_HASH_ALG | EXT4_TUNE_FL_RAID_STRIDE | \
  1243		EXT4_TUNE_FL_RAID_STRIPE_WIDTH | EXT4_TUNE_FL_MOUNT_OPTS | \
  1244		EXT4_TUNE_FL_FEATURES | EXT4_TUNE_FL_EDIT_FEATURES | \
  1245		EXT4_TUNE_FL_FORCE_FSCK)
  1246	
  1247	static int ext4_ioctl_get_tune_sb(struct ext4_sb_info *sbi,
  1248					  struct ext4_tune_sb_params __user *params)
  1249	{
  1250		struct ext4_tune_sb_params ret;
  1251		struct ext4_super_block *es = sbi->s_es;
  1252	
  1253		memset(&ret, 0, sizeof(ret));
  1254		ret.set_flags = TUNE_OPS_SUPPORTED;
> 1255		ret.errors_behavior = es->s_errors;
  1256		ret.mnt_count = le16_to_cpu(es->s_mnt_count);
  1257		ret.max_mnt_count = le16_to_cpu(es->s_max_mnt_count);
  1258		ret.checkinterval = le32_to_cpu(es->s_checkinterval);
  1259		ret.last_check_time = le32_to_cpu(es->s_lastcheck);
  1260		ret.reserved_blocks = ext4_r_blocks_count(es);
  1261		ret.blocks_count = ext4_blocks_count(es);
  1262		ret.reserved_uid = ext4_get_resuid(es);
  1263		ret.reserved_gid = ext4_get_resgid(es);
  1264		ret.default_mnt_opts = le32_to_cpu(es->s_default_mount_opts);
  1265		ret.def_hash_alg = es->s_def_hash_version;
  1266		ret.raid_stride = le16_to_cpu(es->s_raid_stride);
> 1267		ret.raid_stripe_width = le16_to_cpu(es->s_raid_stripe_width);
  1268		strscpy_pad(ret.mount_opts, es->s_mount_opts);
  1269		ret.feature_compat = le32_to_cpu(es->s_feature_compat);
  1270		ret.feature_incompat = le32_to_cpu(es->s_feature_incompat);
  1271		ret.feature_ro_compat = le32_to_cpu(es->s_feature_ro_compat);
  1272		ret.set_feature_compat_mask = EXT4_TUNE_SET_COMPAT_SUPP;
  1273		ret.set_feature_incompat_mask = EXT4_TUNE_SET_INCOMPAT_SUPP;
  1274		ret.set_feature_ro_compat_mask = EXT4_TUNE_SET_RO_COMPAT_SUPP;
  1275		ret.clear_feature_compat_mask = EXT4_TUNE_CLEAR_COMPAT_SUPP;
  1276		ret.clear_feature_incompat_mask = EXT4_TUNE_CLEAR_INCOMPAT_SUPP;
  1277		ret.clear_feature_ro_compat_mask = EXT4_TUNE_CLEAR_RO_COMPAT_SUPP;
  1278		if (copy_to_user(params, &ret, sizeof(ret)))
  1279			return -EFAULT;
  1280		return 0;
  1281	}
  1282	
  1283	static void ext4_sb_setparams(struct ext4_sb_info *sbi,
  1284				      struct ext4_super_block *es, const void *arg)
  1285	{
  1286		const struct ext4_tune_sb_params *params = arg;
  1287	
  1288		if (params->set_flags & EXT4_TUNE_FL_ERRORS_BEHAVIOR)
  1289			es->s_errors = cpu_to_le16(params->errors_behavior);
  1290		if (params->set_flags & EXT4_TUNE_FL_MNT_COUNT)
  1291			es->s_mnt_count = cpu_to_le16(params->mnt_count);
  1292		if (params->set_flags & EXT4_TUNE_FL_MAX_MNT_COUNT)
  1293			es->s_max_mnt_count = cpu_to_le16(params->max_mnt_count);
  1294		if (params->set_flags & EXT4_TUNE_FL_CHECKINTRVAL)
  1295			es->s_checkinterval = cpu_to_le32(params->checkinterval);
  1296		if (params->set_flags & EXT4_TUNE_FL_LAST_CHECK_TIME)
  1297			es->s_lastcheck = cpu_to_le32(params->last_check_time);
  1298		if (params->set_flags & EXT4_TUNE_FL_RESERVED_BLOCKS) {
  1299			ext4_fsblk_t blk = params->reserved_blocks;
  1300	
  1301			es->s_r_blocks_count_lo = cpu_to_le32((u32)blk);
  1302			es->s_r_blocks_count_hi = cpu_to_le32(blk >> 32);
  1303		}
  1304		if (params->set_flags & EXT4_TUNE_FL_RESERVED_UID) {
  1305			int uid = params->reserved_uid;
  1306	
  1307			es->s_def_resuid = cpu_to_le16(uid & 0xFFFF);
  1308			es->s_def_resuid_hi = cpu_to_le16(uid >> 16);
  1309		}
  1310		if (params->set_flags & EXT4_TUNE_FL_RESERVED_GID) {
  1311			int gid = params->reserved_gid;
  1312	
  1313			es->s_def_resgid = cpu_to_le16(gid & 0xFFFF);
  1314			es->s_def_resgid_hi = cpu_to_le16(gid >> 16);
  1315		}
  1316		if (params->set_flags & EXT4_TUNE_FL_DEFAULT_MNT_OPTS)
  1317			es->s_default_mount_opts = cpu_to_le32(params->default_mnt_opts);
  1318		if (params->set_flags & EXT4_TUNE_FL_DEF_HASH_ALG)
  1319			es->s_def_hash_version = params->def_hash_alg;
  1320		if (params->set_flags & EXT4_TUNE_FL_RAID_STRIDE)
  1321			es->s_raid_stride = cpu_to_le16(params->raid_stride);
  1322		if (params->set_flags & EXT4_TUNE_FL_RAID_STRIPE_WIDTH)
> 1323			es->s_raid_stripe_width =
  1324				cpu_to_le16(params->raid_stripe_width);
  1325		strscpy_pad(es->s_mount_opts, params->mount_opts);
  1326		if (params->set_flags & EXT4_TUNE_FL_EDIT_FEATURES) {
  1327			es->s_feature_compat |=
  1328				cpu_to_le32(params->set_feature_compat_mask);
  1329			es->s_feature_incompat |=
  1330				cpu_to_le32(params->set_feature_incompat_mask);
  1331			es->s_feature_ro_compat |=
  1332				cpu_to_le32(params->set_feature_ro_compat_mask);
  1333			es->s_feature_compat &=
  1334				~cpu_to_le32(params->clear_feature_compat_mask);
  1335			es->s_feature_incompat &=
  1336				~cpu_to_le32(params->clear_feature_incompat_mask);
  1337			es->s_feature_ro_compat &=
  1338				~cpu_to_le32(params->clear_feature_ro_compat_mask);
  1339			if (params->set_feature_compat_mask &
  1340			    EXT4_FEATURE_COMPAT_DIR_INDEX)
  1341				es->s_def_hash_version = sbi->s_def_hash_version;
  1342			if (params->set_feature_incompat_mask &
  1343			    EXT4_FEATURE_INCOMPAT_CSUM_SEED)
  1344				es->s_checksum_seed = cpu_to_le32(sbi->s_csum_seed);
  1345		}
  1346		if (params->set_flags & EXT4_TUNE_FL_FORCE_FSCK)
  1347			es->s_state |= cpu_to_le16(EXT4_ERROR_FS);
  1348	}
  1349	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pasha Tatashin @ 2025-09-09 17:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Pratyush Yadav, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250909165718.GP789684@nvidia.com>

> Yes, but lets design things to have this kind of logical code model
> where there are specific serializations, with specific versions that
> are at least discoverably by greping for some struct luo_xxx_ops or
> whatever.
>
> Let's avoid open coding versioning stuff where it is hard to find and
> hard to later make a manifest out of

Fully agreed, the versioning has to be centralized (not open coded,
and verified in a single place) & discoverable.

>
> Jason
>

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Jason Gunthorpe @ 2025-09-09 16:57 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, Pratyush Yadav, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bAvxvXKKanKzMZYrknBnVBUGBwYmgXppdiPbotbXRkGeQ@mail.gmail.com>

On Tue, Sep 09, 2025 at 12:30:35PM -0400, Pasha Tatashin wrote:
> On Tue, Sep 9, 2025 at 11:54 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Tue, Sep 09, 2025 at 11:40:18AM -0400, Pasha Tatashin wrote:
> > > In reality, this is not something that is high priority for cloud
> > > providers, because these kinds of incompatibilities would be found
> > > during qualification; the kernel will fail to update by detecting a
> > > version mismatch during boot instead of during shutdown.
> >
> > Given I expect CSPs will have to add-in specific version support for
> > their own special version-pair needs, I think it would be helpful in
> > the long run to have a tool that reported what versions a kernel build
> > wrote and parsed. Test-to-learn the same information sounds a bit too
> > difficult.
> 
> Yes, I agree. My point was only about the near term: it's just not a
> priority at the moment. This won't block us in the future, as we can
> always add a tooling later to inject the required ELF segments for
> pre-live update checks.

Yes, but lets design things to have this kind of logical code model
where there are specific serializations, with specific versions that
are at least discoverably by greping for some struct luo_xxx_ops or
whatever.

Let's avoid open coding versioning stuff where it is hard to find and
hard to later make a manifest out of

Jason


^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pasha Tatashin @ 2025-09-09 16:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Pratyush Yadav, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250909155407.GO789684@nvidia.com>

On Tue, Sep 9, 2025 at 11:54 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Sep 09, 2025 at 11:40:18AM -0400, Pasha Tatashin wrote:
> > In reality, this is not something that is high priority for cloud
> > providers, because these kinds of incompatibilities would be found
> > during qualification; the kernel will fail to update by detecting a
> > version mismatch during boot instead of during shutdown.
>
> Given I expect CSPs will have to add-in specific version support for
> their own special version-pair needs, I think it would be helpful in
> the long run to have a tool that reported what versions a kernel build
> wrote and parsed. Test-to-learn the same information sounds a bit too
> difficult.

Yes, I agree. My point was only about the near term: it's just not a
priority at the moment. This won't block us in the future, as we can
always add a tooling later to inject the required ELF segments for
pre-live update checks.

Pasha

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pasha Tatashin @ 2025-09-09 16:25 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Jason Gunthorpe, Pratyush Yadav, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <mafs0h5xbk4ap.fsf@yadavpratyush.com>

> I think it would help with making a wider range of roll back and forward
> options available. For example, if your current kernel can speak version
> A and B, and you are rolling back to a kernel that only speaks A, this
> information can be used to choose the right serialization formats.

At least for upstream, we discussed not to support rolling back (this
can be revised in the future), but for now rollback is something that
would need to be taken care of downstream.

Pasha

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pratyush Yadav @ 2025-09-09 15:56 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, Jason Gunthorpe, Pratyush Yadav, jasonmiu, graf,
	changyuanl, rppt, dmatlack, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
	brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, parav,
	leonro, witu
In-Reply-To: <CA+CK2bAKL-gyER2abOV-f4M6HOx9=xDE+=jtcDL6YFbQf1-6og@mail.gmail.com>

On Tue, Sep 09 2025, Pasha Tatashin wrote:

> On Tue, Sep 9, 2025 at 10:53 AM Pratyush Yadav <me@yadavpratyush.com> wrote:
>>
>> On Thu, Sep 04 2025, Jason Gunthorpe wrote:
>>
>> > On Thu, Sep 04, 2025 at 02:57:35PM +0200, Pratyush Yadav wrote:
[...]
>> >> But perhaps it might be a better idea to come up with a mechanism for
>> >> the kernel to discover which formats the "next" kernel speaks so it can
>> >> for one decide whether it can do the live update at all, and for another
>> >> which formats it should use. Maybe we give a way for luod to choose
>> >> formats, and give it the responsibility for doing these checks?
>> >
>> > I have felt that we should catalog the formats&versions the kernel can
>> > read/write in some way during kbuild.
>> >
>> > Maybe this turns into a sysfs directory of all the data with an
>> > 'enable_write' flag that luod could set to 0 to optimize.
>> >
>> > And maybe this could be a kbuild report that luod could parse to do
>> > this optimization.
>>
>> Or maybe we put that information in a ELF section in the kernel image?
>> Not sure how feasible it would be for tooling to read but I think that
>> would very closely associate the versions info with the kernel. The
>> other option might be to put it somewhere with modules I guess.
>
> To me, all this sounds like hardening, which, while important, can be
> added later. The pre-kexec check for compatibility can be defined and
> implemented once we have all live update components ready
> (KHO/LUO/PCI/IOMMU/VFIO/MEMFD), once we stabilize the versioning
> story, and once we start discussing update stability.

Right. I don't think this is something the current LUO patches have to
solve. This is for later down the line.

>
> Currently, we've agreed that there are no stability guarantees.
> Sometime in the future, we may guarantee minor-to-minor stability, and
> later, stable-to-stable. Once we start working on minor-to-minor
> stability, it would be a good idea to also add hardening where a
> pre-live update would check for compatibility.
>
> In reality, this is not something that is high priority for cloud
> providers, because these kinds of incompatibilities would be found
> during qualification; the kernel will fail to update by detecting a
> version mismatch during boot instead of during shutdown.

I think it would help with making a wider range of roll back and forward
options available. For example, if your current kernel can speak version
A and B, and you are rolling back to a kernel that only speaks A, this
information can be used to choose the right serialization formats.

[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Jason Gunthorpe @ 2025-09-09 15:54 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, Pratyush Yadav, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bAKL-gyER2abOV-f4M6HOx9=xDE+=jtcDL6YFbQf1-6og@mail.gmail.com>

On Tue, Sep 09, 2025 at 11:40:18AM -0400, Pasha Tatashin wrote:
> In reality, this is not something that is high priority for cloud
> providers, because these kinds of incompatibilities would be found
> during qualification; the kernel will fail to update by detecting a
> version mismatch during boot instead of during shutdown.

Given I expect CSPs will have to add-in specific version support for
their own special version-pair needs, I think it would be helpful in
the long run to have a tool that reported what versions a kernel build
wrote and parsed. Test-to-learn the same information sounds a bit too
difficult.

Something with ELF is probably how to do that, but I don't imagine a
use in a runtime check consuming this information.

Jason

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pasha Tatashin @ 2025-09-09 15:40 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Jason Gunthorpe, Pratyush Yadav, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <mafs0cy7zllsn.fsf@yadavpratyush.com>

On Tue, Sep 9, 2025 at 10:53 AM Pratyush Yadav <me@yadavpratyush.com> wrote:
>
> On Thu, Sep 04 2025, Jason Gunthorpe wrote:
>
> > On Thu, Sep 04, 2025 at 02:57:35PM +0200, Pratyush Yadav wrote:
> >
> >> I don't think it matters if they are preserved or not. The serialization
> >> and deserialization is independent of that. You can very well create a
> >> KHO array that you don't KHO-preserve. On next boot, you can still use
> >> it, you just have to be careful of doing it while scratch-only. Same as
> >> we do now.
> >
> > The KHO array machinery itself can't preserve its own memory
> > either.
>
> It can. Maybe it couldn't in the version I showed you, but now it can.
> See kho_array_preserve() in
> https://lore.kernel.org/linux-mm/20250909144426.33274-2-pratyush@kernel.org/
>
> >
> >> For the _hypervisor_ live update case, sure. Though even there, I have a
> >> feeling we will start seeing userspace components on the hypervisor use
> >> memfd for stashing some of their state.
> >
> > Sure, but don't make excessively sparse memfds for kexec use, why
> > should that be hard?
>
> Sure, I don't think they should be excessively sparse. But _some_ level
> of sparseness can be there.

This is right; loosely sparse memfd support is needed. However, an
excessively sparse preservation will be inefficient for LU, unless we
change the backing to be from a separate pool of physical pages that
is always preserved. If we do that, it would probably make sense only
for guestmemfd and only if we ever decide to support overcommitted
VMs. I suspect it is not something that we currently need to worry
about.

> >> applications. Think big storage nodes with memory in order of TiB. Those
> >> can use a memfd to back their caches so on a kernel upgrade the caches
> >> don't have to be re-fetched. Sparseness is to be expected for such use
> >> cases.
> >
> > Oh? I'm surpised you'd have sparseness there. sparseness seems like
> > such a weird feature to want to rely on :\
> >
> >> But perhaps it might be a better idea to come up with a mechanism for
> >> the kernel to discover which formats the "next" kernel speaks so it can
> >> for one decide whether it can do the live update at all, and for another
> >> which formats it should use. Maybe we give a way for luod to choose
> >> formats, and give it the responsibility for doing these checks?
> >
> > I have felt that we should catalog the formats&versions the kernel can
> > read/write in some way during kbuild.
> >
> > Maybe this turns into a sysfs directory of all the data with an
> > 'enable_write' flag that luod could set to 0 to optimize.
> >
> > And maybe this could be a kbuild report that luod could parse to do
> > this optimization.
>
> Or maybe we put that information in a ELF section in the kernel image?
> Not sure how feasible it would be for tooling to read but I think that
> would very closely associate the versions info with the kernel. The
> other option might be to put it somewhere with modules I guess.

To me, all this sounds like hardening, which, while important, can be
added later. The pre-kexec check for compatibility can be defined and
implemented once we have all live update components ready
(KHO/LUO/PCI/IOMMU/VFIO/MEMFD), once we stabilize the versioning
story, and once we start discussing update stability.

Currently, we've agreed that there are no stability guarantees.
Sometime in the future, we may guarantee minor-to-minor stability, and
later, stable-to-stable. Once we start working on minor-to-minor
stability, it would be a good idea to also add hardening where a
pre-live update would check for compatibility.

In reality, this is not something that is high priority for cloud
providers, because these kinds of incompatibilities would be found
during qualification; the kernel will fail to update by detecting a
version mismatch during boot instead of during shutdown.

> > And maybe distro/csps use this information mechanically to check if
> > version pairs are kexec compatible.
> >
> > Which re-enforces my feeling that the formats/version should be first
> > class concepts, every version should be registered and luo should
> > sequence calling the code for the right version at the right time.
> >
> > Jason
>
> --
> Regards,
> Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pratyush Yadav @ 2025-09-09 14:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Pasha Tatashin, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250904144240.GO470103@nvidia.com>

On Thu, Sep 04 2025, Jason Gunthorpe wrote:

> On Thu, Sep 04, 2025 at 02:57:35PM +0200, Pratyush Yadav wrote:
>
>> I don't think it matters if they are preserved or not. The serialization
>> and deserialization is independent of that. You can very well create a
>> KHO array that you don't KHO-preserve. On next boot, you can still use
>> it, you just have to be careful of doing it while scratch-only. Same as
>> we do now.
>
> The KHO array machinery itself can't preserve its own memory
> either.

It can. Maybe it couldn't in the version I showed you, but now it can.
See kho_array_preserve() in
https://lore.kernel.org/linux-mm/20250909144426.33274-2-pratyush@kernel.org/

>
>> For the _hypervisor_ live update case, sure. Though even there, I have a
>> feeling we will start seeing userspace components on the hypervisor use
>> memfd for stashing some of their state. 
>
> Sure, but don't make excessively sparse memfds for kexec use, why
> should that be hard?

Sure, I don't think they should be excessively sparse. But _some_ level
of sparseness can be there.

>
>> applications. Think big storage nodes with memory in order of TiB. Those
>> can use a memfd to back their caches so on a kernel upgrade the caches
>> don't have to be re-fetched. Sparseness is to be expected for such use
>> cases.
>
> Oh? I'm surpised you'd have sparseness there. sparseness seems like
> such a weird feature to want to rely on :\
>
>> But perhaps it might be a better idea to come up with a mechanism for
>> the kernel to discover which formats the "next" kernel speaks so it can
>> for one decide whether it can do the live update at all, and for another
>> which formats it should use. Maybe we give a way for luod to choose
>> formats, and give it the responsibility for doing these checks?
>
> I have felt that we should catalog the formats&versions the kernel can
> read/write in some way during kbuild.
>
> Maybe this turns into a sysfs directory of all the data with an
> 'enable_write' flag that luod could set to 0 to optimize.
>
> And maybe this could be a kbuild report that luod could parse to do
> this optimization.

Or maybe we put that information in a ELF section in the kernel image?
Not sure how feasible it would be for tooling to read but I think that
would very closely associate the versions info with the kernel. The
other option might be to put it somewhere with modules I guess.

>
> And maybe distro/csps use this information mechanically to check if
> version pairs are kexec compatible.
>
> Which re-enforces my feeling that the formats/version should be first
> class concepts, every version should be registered and luo should
> sequence calling the code for the right version at the right time.
>
> Jason

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pratyush Yadav @ 2025-09-09 14:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Chris Li, Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl,
	rppt, dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie,
	ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250904173433.GA616306@nvidia.com>

On Thu, Sep 04 2025, Jason Gunthorpe wrote:

> On Wed, Sep 03, 2025 at 05:01:15AM -0700, Chris Li wrote:
>
>> > And if you want to serialize that the optimal path would be to have a
>> > vmalloc of all the strings and a vmalloc of the [] data, sort of like
>> > the kho array idea.
>> 
>> The KHO array idea is already implemented in the existing KHO code or
>> that is something new you want to propose?
>
> Pratyush has proposed it

I just sent out the RFC:
https://lore.kernel.org/linux-mm/20250909144426.33274-1-pratyush@kernel.org/T/#u

[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* [PATCH 3/3] ext4: implemet new ioctls to set and get superblock parameters
From: Theodore Ts'o via B4 Relay @ 2025-09-09  3:15 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-api
In-Reply-To: <20250908-tune2fs-v1-0-e3a6929f3355@mit.edu>

From: Theodore Ts'o <tytso@mit.edu>

Implement the EXT4_IOC_GET_TUNE_SB_PARAM and
EXT4_IOC_SET_TUNE_SB_PARAM ioctls, which allow certains superblock
parameters to be set while the file system is mounted, without needing
write access to the block device.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/ext4/ioctl.c           | 256 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 include/uapi/linux/ext4.h |  75 ++++++++++++++++++++++
 2 files changed, 324 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 84e3c73952d72e436429489f5fc8b7ae1c01c7a1..569c98c962af63130c0119f60788a26a2807bd86 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -27,14 +27,16 @@
 #include "fsmap.h"
 #include <trace/events/ext4.h>
 
-typedef void ext4_update_sb_callback(struct ext4_super_block *es,
-				       const void *arg);
+typedef void ext4_update_sb_callback(struct ext4_sb_info *sbi,
+				     struct ext4_super_block *es,
+				     const void *arg);
 
 /*
  * Superblock modification callback function for changing file system
  * label
  */
-static void ext4_sb_setlabel(struct ext4_super_block *es, const void *arg)
+static void ext4_sb_setlabel(struct ext4_sb_info *sbi,
+			     struct ext4_super_block *es, const void *arg)
 {
 	/* Sanity check, this should never happen */
 	BUILD_BUG_ON(sizeof(es->s_volume_name) < EXT4_LABEL_MAX);
@@ -46,7 +48,8 @@ static void ext4_sb_setlabel(struct ext4_super_block *es, const void *arg)
  * Superblock modification callback function for changing file system
  * UUID.
  */
-static void ext4_sb_setuuid(struct ext4_super_block *es, const void *arg)
+static void ext4_sb_setuuid(struct ext4_sb_info *sbi,
+			    struct ext4_super_block *es, const void *arg)
 {
 	memcpy(es->s_uuid, (__u8 *)arg, UUID_SIZE);
 }
@@ -71,7 +74,7 @@ int ext4_update_primary_sb(struct super_block *sb, handle_t *handle,
 		goto out_err;
 
 	lock_buffer(bh);
-	func(es, arg);
+	func(sbi, es, arg);
 	ext4_superblock_csum_set(sb);
 	unlock_buffer(bh);
 
@@ -149,7 +152,7 @@ static int ext4_update_backup_sb(struct super_block *sb,
 		unlock_buffer(bh);
 		goto out_bh;
 	}
-	func(es, arg);
+	func(EXT4_SB(sb), es, arg);
 	if (ext4_has_feature_metadata_csum(sb))
 		es->s_checksum = ext4_superblock_csum(es);
 	set_buffer_uptodate(bh);
@@ -1230,6 +1233,239 @@ static int ext4_ioctl_setuuid(struct file *filp,
 	return ret;
 }
 
+
+#define TUNE_OPS_SUPPORTED (EXT4_TUNE_FL_ERRORS_BEHAVIOR |    \
+	EXT4_TUNE_FL_MNT_COUNT | EXT4_TUNE_FL_MAX_MNT_COUNT | \
+	EXT4_TUNE_FL_CHECKINTRVAL | EXT4_TUNE_FL_LAST_CHECK_TIME | \
+	EXT4_TUNE_FL_RESERVED_BLOCKS | EXT4_TUNE_FL_RESERVED_UID | \
+	EXT4_TUNE_FL_RESERVED_GID | EXT4_TUNE_FL_DEFAULT_MNT_OPTS | \
+	EXT4_TUNE_FL_DEF_HASH_ALG | EXT4_TUNE_FL_RAID_STRIDE | \
+	EXT4_TUNE_FL_RAID_STRIPE_WIDTH | EXT4_TUNE_FL_MOUNT_OPTS | \
+	EXT4_TUNE_FL_FEATURES | EXT4_TUNE_FL_EDIT_FEATURES | \
+	EXT4_TUNE_FL_FORCE_FSCK)
+
+static int ext4_ioctl_get_tune_sb(struct ext4_sb_info *sbi,
+				  struct ext4_tune_sb_params __user *params)
+{
+	struct ext4_tune_sb_params ret;
+	struct ext4_super_block *es = sbi->s_es;
+
+	memset(&ret, 0, sizeof(ret));
+	ret.set_flags = TUNE_OPS_SUPPORTED;
+	ret.errors_behavior = es->s_errors;
+	ret.mnt_count = le16_to_cpu(es->s_mnt_count);
+	ret.max_mnt_count = le16_to_cpu(es->s_max_mnt_count);
+	ret.checkinterval = le32_to_cpu(es->s_checkinterval);
+	ret.last_check_time = le32_to_cpu(es->s_lastcheck);
+	ret.reserved_blocks = ext4_r_blocks_count(es);
+	ret.blocks_count = ext4_blocks_count(es);
+	ret.reserved_uid = ext4_get_resuid(es);
+	ret.reserved_gid = ext4_get_resgid(es);
+	ret.default_mnt_opts = le32_to_cpu(es->s_default_mount_opts);
+	ret.def_hash_alg = es->s_def_hash_version;
+	ret.raid_stride = le16_to_cpu(es->s_raid_stride);
+	ret.raid_stripe_width = le16_to_cpu(es->s_raid_stripe_width);
+	strscpy_pad(ret.mount_opts, es->s_mount_opts);
+	ret.feature_compat = le32_to_cpu(es->s_feature_compat);
+	ret.feature_incompat = le32_to_cpu(es->s_feature_incompat);
+	ret.feature_ro_compat = le32_to_cpu(es->s_feature_ro_compat);
+	ret.set_feature_compat_mask = EXT4_TUNE_SET_COMPAT_SUPP;
+	ret.set_feature_incompat_mask = EXT4_TUNE_SET_INCOMPAT_SUPP;
+	ret.set_feature_ro_compat_mask = EXT4_TUNE_SET_RO_COMPAT_SUPP;
+	ret.clear_feature_compat_mask = EXT4_TUNE_CLEAR_COMPAT_SUPP;
+	ret.clear_feature_incompat_mask = EXT4_TUNE_CLEAR_INCOMPAT_SUPP;
+	ret.clear_feature_ro_compat_mask = EXT4_TUNE_CLEAR_RO_COMPAT_SUPP;
+	if (copy_to_user(params, &ret, sizeof(ret)))
+		return -EFAULT;
+	return 0;
+}
+
+static void ext4_sb_setparams(struct ext4_sb_info *sbi,
+			      struct ext4_super_block *es, const void *arg)
+{
+	const struct ext4_tune_sb_params *params = arg;
+
+	if (params->set_flags & EXT4_TUNE_FL_ERRORS_BEHAVIOR)
+		es->s_errors = cpu_to_le16(params->errors_behavior);
+	if (params->set_flags & EXT4_TUNE_FL_MNT_COUNT)
+		es->s_mnt_count = cpu_to_le16(params->mnt_count);
+	if (params->set_flags & EXT4_TUNE_FL_MAX_MNT_COUNT)
+		es->s_max_mnt_count = cpu_to_le16(params->max_mnt_count);
+	if (params->set_flags & EXT4_TUNE_FL_CHECKINTRVAL)
+		es->s_checkinterval = cpu_to_le32(params->checkinterval);
+	if (params->set_flags & EXT4_TUNE_FL_LAST_CHECK_TIME)
+		es->s_lastcheck = cpu_to_le32(params->last_check_time);
+	if (params->set_flags & EXT4_TUNE_FL_RESERVED_BLOCKS) {
+		ext4_fsblk_t blk = params->reserved_blocks;
+
+		es->s_r_blocks_count_lo = cpu_to_le32((u32)blk);
+		es->s_r_blocks_count_hi = cpu_to_le32(blk >> 32);
+	}
+	if (params->set_flags & EXT4_TUNE_FL_RESERVED_UID) {
+		int uid = params->reserved_uid;
+
+		es->s_def_resuid = cpu_to_le16(uid & 0xFFFF);
+		es->s_def_resuid_hi = cpu_to_le16(uid >> 16);
+	}
+	if (params->set_flags & EXT4_TUNE_FL_RESERVED_GID) {
+		int gid = params->reserved_gid;
+
+		es->s_def_resgid = cpu_to_le16(gid & 0xFFFF);
+		es->s_def_resgid_hi = cpu_to_le16(gid >> 16);
+	}
+	if (params->set_flags & EXT4_TUNE_FL_DEFAULT_MNT_OPTS)
+		es->s_default_mount_opts = cpu_to_le32(params->default_mnt_opts);
+	if (params->set_flags & EXT4_TUNE_FL_DEF_HASH_ALG)
+		es->s_def_hash_version = params->def_hash_alg;
+	if (params->set_flags & EXT4_TUNE_FL_RAID_STRIDE)
+		es->s_raid_stride = cpu_to_le16(params->raid_stride);
+	if (params->set_flags & EXT4_TUNE_FL_RAID_STRIPE_WIDTH)
+		es->s_raid_stripe_width =
+			cpu_to_le16(params->raid_stripe_width);
+	strscpy_pad(es->s_mount_opts, params->mount_opts);
+	if (params->set_flags & EXT4_TUNE_FL_EDIT_FEATURES) {
+		es->s_feature_compat |=
+			cpu_to_le32(params->set_feature_compat_mask);
+		es->s_feature_incompat |=
+			cpu_to_le32(params->set_feature_incompat_mask);
+		es->s_feature_ro_compat |=
+			cpu_to_le32(params->set_feature_ro_compat_mask);
+		es->s_feature_compat &=
+			~cpu_to_le32(params->clear_feature_compat_mask);
+		es->s_feature_incompat &=
+			~cpu_to_le32(params->clear_feature_incompat_mask);
+		es->s_feature_ro_compat &=
+			~cpu_to_le32(params->clear_feature_ro_compat_mask);
+		if (params->set_feature_compat_mask &
+		    EXT4_FEATURE_COMPAT_DIR_INDEX)
+			es->s_def_hash_version = sbi->s_def_hash_version;
+		if (params->set_feature_incompat_mask &
+		    EXT4_FEATURE_INCOMPAT_CSUM_SEED)
+			es->s_checksum_seed = cpu_to_le32(sbi->s_csum_seed);
+	}
+	if (params->set_flags & EXT4_TUNE_FL_FORCE_FSCK)
+		es->s_state |= cpu_to_le16(EXT4_ERROR_FS);
+}
+
+static int ext4_ioctl_set_tune_sb(struct file *filp,
+				  struct ext4_tune_sb_params __user *in)
+{
+	struct ext4_tune_sb_params params;
+	struct super_block *sb = file_inode(filp)->i_sb;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_super_block *es = sbi->s_es;
+	int ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (copy_from_user(&params, in, sizeof(params)))
+		return -EFAULT;
+
+	if ((params.set_flags & ~TUNE_OPS_SUPPORTED) != 0)
+		return -EOPNOTSUPP;
+
+	if ((params.set_flags & EXT4_TUNE_FL_ERRORS_BEHAVIOR) &&
+	    (params.errors_behavior > EXT4_ERRORS_PANIC))
+		return -EINVAL;
+
+	if ((params.set_flags & EXT4_TUNE_FL_RESERVED_BLOCKS) &&
+	    (params.reserved_blocks > ext4_blocks_count(sbi->s_es) / 2))
+		return -EINVAL;
+	if ((params.set_flags & EXT4_TUNE_FL_DEF_HASH_ALG) &&
+	    ((params.def_hash_alg > DX_HASH_LAST) ||
+	     (params.def_hash_alg == DX_HASH_SIPHASH)))
+		return -EINVAL;
+	if ((params.set_flags & EXT4_TUNE_FL_FEATURES) &&
+	    (params.set_flags & EXT4_TUNE_FL_EDIT_FEATURES))
+		return -EINVAL;
+
+	if (params.set_flags & EXT4_TUNE_FL_FEATURES) {
+		params.set_feature_compat_mask =
+			params.feature_compat &
+			~le32_to_cpu(es->s_feature_compat);
+		params.set_feature_incompat_mask =
+			params.feature_incompat &
+			~le32_to_cpu(es->s_feature_incompat);
+		params.set_feature_ro_compat_mask =
+			params.feature_ro_compat &
+			~le32_to_cpu(es->s_feature_ro_compat);
+		params.clear_feature_compat_mask =
+			~params.feature_compat &
+			le32_to_cpu(es->s_feature_compat);
+		params.clear_feature_incompat_mask =
+			~params.feature_incompat &
+			le32_to_cpu(es->s_feature_incompat);
+		params.clear_feature_ro_compat_mask =
+			~params.feature_ro_compat &
+			le32_to_cpu(es->s_feature_ro_compat);
+		params.set_flags |= EXT4_TUNE_FL_EDIT_FEATURES;
+	}
+	if (params.set_flags & EXT4_TUNE_FL_EDIT_FEATURES) {
+		if ((params.set_feature_compat_mask &
+		     ~EXT4_TUNE_SET_COMPAT_SUPP) ||
+		    (params.set_feature_incompat_mask &
+		     ~EXT4_TUNE_SET_INCOMPAT_SUPP) ||
+		    (params.set_feature_ro_compat_mask &
+		     ~EXT4_TUNE_SET_RO_COMPAT_SUPP) ||
+		    (params.clear_feature_compat_mask &
+		     ~EXT4_TUNE_CLEAR_COMPAT_SUPP) ||
+		    (params.clear_feature_incompat_mask &
+		     ~EXT4_TUNE_CLEAR_INCOMPAT_SUPP) ||
+		    (params.clear_feature_ro_compat_mask &
+		     ~EXT4_TUNE_CLEAR_RO_COMPAT_SUPP))
+			return -EOPNOTSUPP;
+
+		/*
+		 * Filter out the features that are already set from
+		 * the set_mask.
+		 */
+		params.set_feature_compat_mask &=
+			~le32_to_cpu(es->s_feature_compat);
+		params.set_feature_incompat_mask &=
+			~le32_to_cpu(es->s_feature_incompat);
+		params.set_feature_ro_compat_mask &=
+			~le32_to_cpu(es->s_feature_ro_compat);
+		if ((params.set_feature_compat_mask &
+		     EXT4_FEATURE_COMPAT_DIR_INDEX) &&
+		    !ext4_has_feature_dir_index(sb)) {
+			uuid_t	uu;
+
+			memcpy(&uu, sbi->s_hash_seed, UUID_SIZE);
+			if (uuid_is_null(&uu))
+				generate_random_uuid((char *)
+						     &sbi->s_hash_seed);
+			if (params.set_flags & EXT4_TUNE_FL_DEF_HASH_ALG)
+				sbi->s_def_hash_version = params.def_hash_alg;
+			else if (sbi->s_def_hash_version == 0)
+				sbi->s_def_hash_version = DX_HASH_HALF_MD4;
+			if (!(es->s_flags &
+			      cpu_to_le32(EXT2_FLAGS_UNSIGNED_HASH)) &&
+			    !(es->s_flags &
+			      cpu_to_le32(EXT2_FLAGS_SIGNED_HASH))) {
+#ifdef __CHAR_UNSIGNED__
+				sbi->s_hash_unsigned = 3;
+#else
+				sbi->s_hash_unsigned = 0;
+#endif
+			}
+		}
+	}
+
+
+	ret = mnt_want_write_file(filp);
+	if (ret)
+		return ret;
+
+	ret = ext4_update_superblocks_fn(sb, ext4_sb_setparams, &params);
+	mnt_drop_write_file(filp);
+
+	if (params.set_flags & EXT4_TUNE_FL_DEF_HASH_ALG)
+		sbi->s_def_hash_version = params.def_hash_alg;
+
+	return ret;
+}
+
 static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
 	struct inode *inode = file_inode(filp);
@@ -1616,6 +1852,11 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		return ext4_ioctl_getuuid(EXT4_SB(sb), (void __user *)arg);
 	case EXT4_IOC_SETFSUUID:
 		return ext4_ioctl_setuuid(filp, (const void __user *)arg);
+	case EXT4_IOC_GET_TUNE_SB_PARAM:
+		return ext4_ioctl_get_tune_sb(EXT4_SB(sb),
+					      (void __user *)arg);
+	case EXT4_IOC_SET_TUNE_SB_PARAM:
+		return ext4_ioctl_set_tune_sb(filp, (void __user *)arg);
 	default:
 		return -ENOTTY;
 	}
@@ -1703,7 +1944,8 @@ long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 }
 #endif
 
-static void set_overhead(struct ext4_super_block *es, const void *arg)
+static void set_overhead(struct ext4_sb_info *sbi,
+			 struct ext4_super_block *es, const void *arg)
 {
 	es->s_overhead_clusters = cpu_to_le32(*((unsigned long *) arg));
 }
diff --git a/include/uapi/linux/ext4.h b/include/uapi/linux/ext4.h
index 1c4c2dd29112cda9f7dc91d917492cffc33ee524..145875fd633772e76ce7fd8bc0fef136ff620d2d 100644
--- a/include/uapi/linux/ext4.h
+++ b/include/uapi/linux/ext4.h
@@ -33,6 +33,8 @@
 #define EXT4_IOC_CHECKPOINT		_IOW('f', 43, __u32)
 #define EXT4_IOC_GETFSUUID		_IOR('f', 44, struct fsuuid)
 #define EXT4_IOC_SETFSUUID		_IOW('f', 44, struct fsuuid)
+#define EXT4_IOC_GET_TUNE_SB_PARAM	_IOR('f', 45, struct ext4_tune_sb_params)
+#define EXT4_IOC_SET_TUNE_SB_PARAM	_IOW('f', 46, struct ext4_tune_sb_params)
 
 #define EXT4_IOC_SHUTDOWN _IOR('X', 125, __u32)
 
@@ -108,6 +110,79 @@ struct ext4_new_group_input {
 	__u16 unused;
 };
 
+struct ext4_tune_sb_params {
+	__u32 set_flags;
+	__u32 checkinterval;
+	__u16 errors_behavior;
+	__u16 mnt_count;
+	__u16 max_mnt_count;
+	__u16 raid_stride;
+	__u64 last_check_time;
+	__u64 reserved_blocks;
+	__u64 blocks_count;
+	__u32 default_mnt_opts;
+	__u32 reserved_uid;
+	__u32 reserved_gid;
+	__u32 raid_stripe_width;
+	__u8  def_hash_alg;
+	__u8  pad_1;
+	__u16 pad_2;
+	__u32 feature_compat;
+	__u32 feature_incompat;
+	__u32 feature_ro_compat;
+	__u32 set_feature_compat_mask;
+	__u32 set_feature_incompat_mask;
+	__u32 set_feature_ro_compat_mask;
+	__u32 clear_feature_compat_mask;
+	__u32 clear_feature_incompat_mask;
+	__u32 clear_feature_ro_compat_mask;
+	__u8  mount_opts[64];
+	__u8  pad[64];
+};
+
+#define EXT4_TUNE_FL_ERRORS_BEHAVIOR	0x00000001
+#define EXT4_TUNE_FL_MNT_COUNT		0x00000002
+#define EXT4_TUNE_FL_MAX_MNT_COUNT	0x00000004
+#define EXT4_TUNE_FL_CHECKINTRVAL	0x00000008
+#define EXT4_TUNE_FL_LAST_CHECK_TIME	0x00000010
+#define EXT4_TUNE_FL_RESERVED_BLOCKS	0x00000020
+#define EXT4_TUNE_FL_RESERVED_UID	0x00000040
+#define EXT4_TUNE_FL_RESERVED_GID	0x00000080
+#define EXT4_TUNE_FL_DEFAULT_MNT_OPTS	0x00000100
+#define EXT4_TUNE_FL_DEF_HASH_ALG	0x00000200
+#define EXT4_TUNE_FL_RAID_STRIDE	0x00000400
+#define EXT4_TUNE_FL_RAID_STRIPE_WIDTH	0x00000800
+#define EXT4_TUNE_FL_MOUNT_OPTS		0x00001000
+#define EXT4_TUNE_FL_FEATURES		0x00002000
+#define EXT4_TUNE_FL_EDIT_FEATURES	0x00004000
+#define EXT4_TUNE_FL_FORCE_FSCK		0x00008000
+
+#define EXT4_TUNE_SET_COMPAT_SUPP \
+		(EXT4_FEATURE_COMPAT_DIR_INDEX |	\
+		 EXT4_FEATURE_COMPAT_STABLE_INODES)
+#define EXT4_TUNE_SET_INCOMPAT_SUPP \
+		(EXT4_FEATURE_INCOMPAT_EXTENTS |	\
+		 EXT4_FEATURE_INCOMPAT_EA_INODE |	\
+		 EXT4_FEATURE_INCOMPAT_ENCRYPT |	\
+		 EXT4_FEATURE_INCOMPAT_CSUM_SEED |	\
+		 EXT4_FEATURE_INCOMPAT_LARGEDIR |	\
+		 EXT4_FEATURE_INCOMPAT_CASEFOLD)
+#define EXT4_TUNE_SET_RO_COMPAT_SUPP \
+		(EXT4_FEATURE_RO_COMPAT_LARGE_FILE |	\
+		 EXT4_FEATURE_RO_COMPAT_DIR_NLINK |	\
+		 EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE |	\
+		 EXT4_FEATURE_RO_COMPAT_READONLY |	\
+		 EXT4_FEATURE_RO_COMPAT_PROJECT |	\
+		 EXT4_FEATURE_RO_COMPAT_VERITY)
+
+#define EXT4_TUNE_CLEAR_COMPAT_SUPP (0)
+#define EXT4_TUNE_CLEAR_INCOMPAT_SUPP (0)
+#define EXT4_TUNE_CLEAR_RO_COMPAT_SUPP \
+		(EXT4_FEATURE_RO_COMPAT_LARGE_FILE |	\
+		 EXT4_FEATURE_RO_COMPAT_DIR_NLINK |	\
+		 EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE |	\
+		 EXT4_FEATURE_RO_COMPAT_PROJECT)
+
 /*
  * Returned by EXT4_IOC_GET_ES_CACHE as an additional possible flag.
  * It indicates that the entry in extent status cache is for a hole.

-- 
2.51.0



^ permalink raw reply related

* [PATCH 0/3] ext4: Add support for mounted updates to the superblock via an ioctl
From: Theodore Ts'o via B4 Relay @ 2025-09-09  3:15 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-api, stable

This patch series enables a future version of tune2fs to be able to
modify certain parts of the ext4 superblock without to write to the
block device.

The first patch fixes a potential buffer overrun caused by a
maliciously moified superblock.  The second patch adds support for
32-bit uid and gid's which can have access to the reserved blocks pool.
The last patch adds the ioctl's which will be used by tune2fs.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
Theodore Ts'o (3):
      ext4: avoid potential buffer over-read in parse_apply_sb_mount_options()
      ext4: add support for 32-bit default reserved uid and gid values
      ext4: implemet new ioctls to set and get superblock parameters

 fs/ext4/ext4.h            |  16 ++++-
 fs/ext4/ioctl.c           | 256 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 fs/ext4/super.c           |  25 +++-----
 include/uapi/linux/ext4.h |  75 ++++++++++++++++++++++
 4 files changed, 348 insertions(+), 24 deletions(-)
---
base-commit: b320789d6883cc00ac78ce83bccbfe7ed58afcf0
change-id: 20250830-tune2fs-3376beb72403

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>



^ permalink raw reply

* [PATCH 1/3] ext4: avoid potential buffer over-read in parse_apply_sb_mount_options()
From: Theodore Ts'o via B4 Relay @ 2025-09-09  3:15 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-api, stable
In-Reply-To: <20250908-tune2fs-v1-0-e3a6929f3355@mit.edu>

From: Theodore Ts'o <tytso@mit.edu>

Unlike other strings in the ext4 superblock, we rely on tune2fs to
make sure s_mount_opts is NUL terminated.  Harden
parse_apply_sb_mount_options() by treating s_mount_opts as a potential
__nonstring.

Cc: stable@vger.kernel.org
Fixes: 8b67f04ab9de ("ext4: Add mount options in superblock")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/ext4/super.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 699c15db28a82f26809bf68533454a242596f0fd..94c98446c84f9a4614971d246ca7f001de610a8a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2460,7 +2460,7 @@ static int parse_apply_sb_mount_options(struct super_block *sb,
 					struct ext4_fs_context *m_ctx)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
-	char *s_mount_opts = NULL;
+	char s_mount_opts[65];
 	struct ext4_fs_context *s_ctx = NULL;
 	struct fs_context *fc = NULL;
 	int ret = -ENOMEM;
@@ -2468,15 +2468,11 @@ static int parse_apply_sb_mount_options(struct super_block *sb,
 	if (!sbi->s_es->s_mount_opts[0])
 		return 0;
 
-	s_mount_opts = kstrndup(sbi->s_es->s_mount_opts,
-				sizeof(sbi->s_es->s_mount_opts),
-				GFP_KERNEL);
-	if (!s_mount_opts)
-		return ret;
+	strscpy_pad(s_mount_opts, sbi->s_es->s_mount_opts);
 
 	fc = kzalloc(sizeof(struct fs_context), GFP_KERNEL);
 	if (!fc)
-		goto out_free;
+		return -ENOMEM;
 
 	s_ctx = kzalloc(sizeof(struct ext4_fs_context), GFP_KERNEL);
 	if (!s_ctx)
@@ -2508,11 +2504,8 @@ static int parse_apply_sb_mount_options(struct super_block *sb,
 	ret = 0;
 
 out_free:
-	if (fc) {
-		ext4_fc_free(fc);
-		kfree(fc);
-	}
-	kfree(s_mount_opts);
+	ext4_fc_free(fc);
+	kfree(fc);
 	return ret;
 }
 

-- 
2.51.0



^ permalink raw reply related

* [PATCH 2/3] ext4: add support for 32-bit default reserved uid and gid values
From: Theodore Ts'o via B4 Relay @ 2025-09-09  3:15 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-api
In-Reply-To: <20250908-tune2fs-v1-0-e3a6929f3355@mit.edu>

From: Theodore Ts'o <tytso@mit.edu>

Support for specifying the default user id and group id that is
allowed to use the reserved block space was added way back when Linux
only supported 16-bit uid's and gid's.  (Yeah, that long ago.)  It's
not a commonly used feature, but let's add support for 32-bit user and
group id's.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/ext4/ext4.h  | 16 +++++++++++++++-
 fs/ext4/super.c |  8 ++++----
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 01a6e2de7fc3ef0e20b039d3200b9c9bd656f59f..4bfcd5f0c74fda30db4009ee28fbee00a2f6b76f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1442,7 +1442,9 @@ struct ext4_super_block {
 	__le16  s_encoding;		/* Filename charset encoding */
 	__le16  s_encoding_flags;	/* Filename charset encoding flags */
 	__le32  s_orphan_file_inum;	/* Inode for tracking orphan inodes */
-	__le32	s_reserved[94];		/* Padding to the end of the block */
+	__le16	s_def_resuid_hi;
+	__le16	s_def_resgid_hi;
+	__le32	s_reserved[93];		/* Padding to the end of the block */
 	__le32	s_checksum;		/* crc32c(superblock) */
 };
 
@@ -1812,6 +1814,18 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
 		 ino <= le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count));
 }
 
+static inline int ext4_get_resuid(struct ext4_super_block *es)
+{
+	return(le16_to_cpu(es->s_def_resuid) |
+	       (le16_to_cpu(es->s_def_resuid_hi) << 16));
+}
+
+static inline int ext4_get_resgid(struct ext4_super_block *es)
+{
+	return(le16_to_cpu(es->s_def_resgid) |
+	       (le16_to_cpu(es->s_def_resgid_hi) << 16));
+}
+
 /*
  * Returns: sbi->field[index]
  * Used to access an array element from the following sbi fields which require
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 94c98446c84f9a4614971d246ca7f001de610a8a..0256c8f7c6cee2b8d9295f2fa9a7acd904382e83 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2951,11 +2951,11 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
 	}
 
 	if (nodefs || !uid_eq(sbi->s_resuid, make_kuid(&init_user_ns, EXT4_DEF_RESUID)) ||
-	    le16_to_cpu(es->s_def_resuid) != EXT4_DEF_RESUID)
+	    ext4_get_resuid(es) != EXT4_DEF_RESUID)
 		SEQ_OPTS_PRINT("resuid=%u",
 				from_kuid_munged(&init_user_ns, sbi->s_resuid));
 	if (nodefs || !gid_eq(sbi->s_resgid, make_kgid(&init_user_ns, EXT4_DEF_RESGID)) ||
-	    le16_to_cpu(es->s_def_resgid) != EXT4_DEF_RESGID)
+	    ext4_get_resgid(es) != EXT4_DEF_RESGID)
 		SEQ_OPTS_PRINT("resgid=%u",
 				from_kgid_munged(&init_user_ns, sbi->s_resgid));
 	def_errors = nodefs ? -1 : le16_to_cpu(es->s_errors);
@@ -5270,8 +5270,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 
 	ext4_set_def_opts(sb, es);
 
-	sbi->s_resuid = make_kuid(&init_user_ns, le16_to_cpu(es->s_def_resuid));
-	sbi->s_resgid = make_kgid(&init_user_ns, le16_to_cpu(es->s_def_resgid));
+	sbi->s_resuid = make_kuid(&init_user_ns, ext4_get_resuid(es));
+	sbi->s_resgid = make_kgid(&init_user_ns, ext4_get_resuid(es));
 	sbi->s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ;
 	sbi->s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME;
 	sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;

-- 
2.51.0



^ permalink raw reply related

* Re: [PATCH v20 4/8] fork: Add shadow stack support to clone3()
From: Mark Brown @ 2025-09-05 16:02 UTC (permalink / raw)
  To: Kees Cook
  Cc: Christian Brauner, Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy,
	H.J. Lu, Florian Weimer, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Shuah Khan,
	linux-kernel, Catalin Marinas, Will Deacon, jannh, Andrew Morton,
	Yury Khrustalev, Wilco Dijkstra, linux-kselftest, linux-api
In-Reply-To: <202509050900.8A01B1E6@keescook>

[-- Attachment #1: Type: text/plain, Size: 375 bytes --]

On Fri, Sep 05, 2025 at 09:00:51AM -0700, Kees Cook wrote:
> On Fri, Sep 05, 2025 at 04:43:22PM +0100, Mark Brown wrote:

> > discover the problem fairly rapidly in testing.  ss_token would shorter
> > but the abbreviation is less clear, whatever name you prefer is fine by
> > me.

> Bike shed: shstk_token?

That also works and is fine by me, probably better than my idea.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v20 4/8] fork: Add shadow stack support to clone3()
From: Kees Cook @ 2025-09-05 16:00 UTC (permalink / raw)
  To: Mark Brown
  Cc: Christian Brauner, Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy,
	H.J. Lu, Florian Weimer, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Shuah Khan,
	linux-kernel, Catalin Marinas, Will Deacon, jannh, Andrew Morton,
	Yury Khrustalev, Wilco Dijkstra, linux-kselftest, linux-api
In-Reply-To: <0ff8b70e-283f-4d56-8bab-bcae11cd5bdb@sirena.org.uk>

On Fri, Sep 05, 2025 at 04:43:22PM +0100, Mark Brown wrote:
> On Fri, Sep 05, 2025 at 05:21:59PM +0200, Christian Brauner wrote:
> > On Tue, Sep 02, 2025 at 11:21:48AM +0100, Mark Brown wrote:
> 
> > > +		.shadow_stack_token	= args.shadow_stack_token,
> 
> > I'm not sure why this has to be named "shadow_stack_token" I think
> > that's just confusing and we should just call it "shadow_stack" and be
> > done with it. It's also a bit long of a field name imho.
> 
> I'm not hugely attached to the name, if you want to rename that's
> perfectly fine by me.  My thinking was that there's a potential
> confusion with it being a pointer to the base of the shadow stack by
> comparison with the existing "stack" but I do agree that the resulting
> name is quite long and if someone does actually get confused they should
> discover the problem fairly rapidly in testing.  ss_token would shorter
> but the abbreviation is less clear, whatever name you prefer is fine by
> me.

Bike shed: shstk_token?


-- 
Kees Cook

^ permalink raw reply

* Re: [PATCH v20 4/8] fork: Add shadow stack support to clone3()
From: Edgecombe, Rick P @ 2025-09-05 15:53 UTC (permalink / raw)
  To: broonie@kernel.org, brauner@kernel.org
  Cc: dietmar.eggemann@arm.com, linux-api@vger.kernel.org,
	Szabolcs.Nagy@arm.com, shuah@kernel.org,
	dave.hansen@linux.intel.com, debug@rivosinc.com, mgorman@suse.de,
	vincent.guittot@linaro.org, fweimer@redhat.com,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	rostedt@goodmis.org, hjl.tools@gmail.com, tglx@linutronix.de,
	akpm@linux-foundation.org, vschneid@redhat.com,
	catalin.marinas@arm.com, kees@kernel.org, will@kernel.org,
	hpa@zytor.com, peterz@infradead.org, jannh@google.com,
	yury.khrustalev@arm.com, bp@alien8.de, wilco.dijkstra@arm.com,
	bsegall@google.com, linux-kselftest@vger.kernel.org,
	x86@kernel.org, juri.lelli@redhat.com
In-Reply-To: <0ff8b70e-283f-4d56-8bab-bcae11cd5bdb@sirena.org.uk>

On Fri, 2025-09-05 at 16:43 +0100, Mark Brown wrote:
> On Fri, Sep 05, 2025 at 05:21:59PM +0200, Christian Brauner wrote:
> > On Tue, Sep 02, 2025 at 11:21:48AM +0100, Mark Brown wrote:
> 
> > > +		.shadow_stack_token	= args.shadow_stack_token,
> 
> > I'm not sure why this has to be named "shadow_stack_token" I think
> > that's just confusing and we should just call it "shadow_stack" and be
> > done with it. It's also a bit long of a field name imho.
> 
> I'm not hugely attached to the name, if you want to rename that's
> perfectly fine by me.  My thinking was that there's a potential
> confusion with it being a pointer to the base of the shadow stack by
> comparison with the existing "stack" but I do agree that the resulting
> name is quite long and if someone does actually get confused they should
> discover the problem fairly rapidly in testing.  ss_token would shorter
> but the abbreviation is less clear, whatever name you prefer is fine by
> me.

Yea the token point here is kind of important. That said, we could probably make
up for it with documentation.

^ permalink raw reply

* Re: [PATCH v20 4/8] fork: Add shadow stack support to clone3()
From: Mark Brown @ 2025-09-05 15:43 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Shuah Khan, linux-kernel,
	Catalin Marinas, Will Deacon, jannh, Andrew Morton,
	Yury Khrustalev, Wilco Dijkstra, linux-kselftest, linux-api,
	Kees Cook
In-Reply-To: <20250905-nutria-befund-2f3e92003734@brauner>

[-- Attachment #1: Type: text/plain, Size: 1668 bytes --]

On Fri, Sep 05, 2025 at 05:21:59PM +0200, Christian Brauner wrote:
> On Tue, Sep 02, 2025 at 11:21:48AM +0100, Mark Brown wrote:

> > +		.shadow_stack_token	= args.shadow_stack_token,

> I'm not sure why this has to be named "shadow_stack_token" I think
> that's just confusing and we should just call it "shadow_stack" and be
> done with it. It's also a bit long of a field name imho.

I'm not hugely attached to the name, if you want to rename that's
perfectly fine by me.  My thinking was that there's a potential
confusion with it being a pointer to the base of the shadow stack by
comparison with the existing "stack" but I do agree that the resulting
name is quite long and if someone does actually get confused they should
discover the problem fairly rapidly in testing.  ss_token would shorter
but the abbreviation is less clear, whatever name you prefer is fine by
me.

> I have a kernel-6.18.clone3 branch
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=kernel-6.18.clone3
> because there's another cross-arch cleanup that cleans up copy_thread(),
> copy_sighand(), and copy_process() and - surprisingly - also adds
> clone3() support for nios2...

> Anyway, if you just want me to slap it on top of that branch then I can
> simply rename while applying so no need to resend in that case.

That would be amazing, I'm totally happy with you doing that.  If I do
need to rebase and resend let me know.

It's probably worth mentioning that the RISC-V shadow stack support was
getting near to being merged, if that ends up being OK for this release
(it's not been posted yet though) there might be some cross tree merge
needed or something.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v20 4/8] fork: Add shadow stack support to clone3()
From: Christian Brauner @ 2025-09-05 15:21 UTC (permalink / raw)
  To: Mark Brown
  Cc: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Shuah Khan, linux-kernel,
	Catalin Marinas, Will Deacon, jannh, Andrew Morton,
	Yury Khrustalev, Wilco Dijkstra, linux-kselftest, linux-api,
	Kees Cook
In-Reply-To: <20250902-clone3-shadow-stack-v20-4-4d9fff1c53e7@kernel.org>

On Tue, Sep 02, 2025 at 11:21:48AM +0100, Mark Brown wrote:
> Unlike with the normal stack there is no API for configuring the shadow
> stack for a new thread, instead the kernel will dynamically allocate a
> new shadow stack with the same size as the normal stack. This appears to
> be due to the shadow stack series having been in development since
> before the more extensible clone3() was added rather than anything more
> deliberate.
> 
> Add a parameter to clone3() specifying a shadow stack pointer to use
> for the new thread, this is inconsistent with the way we specify the
> normal stack but during review concerns were expressed about having to
> identify where the shadow stack pointer should be placed especially in
> cases where the shadow stack has been previously active.  If no shadow
> stack is specified then the existing implicit allocation behaviour is
> maintained.
> 
> If a shadow stack pointer is specified then it is required to have an
> architecture defined token placed on the stack, this will be consumed by
> the new task, the shadow stack is specified by pointing to this token.  If
> no valid token is present then this will be reported with -EINVAL.  This
> token prevents new threads being created pointing at the shadow stack of
> an existing running thread.  On architectures with support for userspace
> pivoting of shadow stacks it is expected that the same format and placement
> of tokens will be used, this is the case for arm64 and x86.
> 
> If the architecture does not support shadow stacks the shadow stack
> pointer must be not be specified, architectures that do support the
> feature are expected to enforce the same requirement on individual
> systems that lack shadow stack support.
> 
> Update the existing arm64 and x86 implementations to pay attention to
> the newly added arguments, in order to maintain compatibility we use the
> existing behaviour if no shadow stack is specified. Since we are now
> using more fields from the kernel_clone_args we pass that into the
> shadow stack code rather than individual fields.
> 
> Portions of the x86 architecture code were written by Rick Edgecombe.
> 
> Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
> Tested-by: Yury Khrustalev <yury.khrustalev@arm.com>
> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Mark Brown <broonie@kernel.org>
> ---
>  arch/arm64/mm/gcs.c              | 47 +++++++++++++++++++-
>  arch/x86/include/asm/shstk.h     | 11 +++--
>  arch/x86/kernel/process.c        |  2 +-
>  arch/x86/kernel/shstk.c          | 53 ++++++++++++++++++++---
>  include/asm-generic/cacheflush.h | 11 +++++
>  include/linux/sched/task.h       | 17 ++++++++
>  include/uapi/linux/sched.h       |  9 ++--
>  kernel/fork.c                    | 93 ++++++++++++++++++++++++++++++++++------
>  8 files changed, 217 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/arm64/mm/gcs.c b/arch/arm64/mm/gcs.c
> index 3abcbf9adb5c..249ff05bca45 100644
> --- a/arch/arm64/mm/gcs.c
> +++ b/arch/arm64/mm/gcs.c
> @@ -43,8 +43,23 @@ int gcs_alloc_thread_stack(struct task_struct *tsk,
>  {
>  	unsigned long addr, size;
>  
> -	if (!system_supports_gcs())
> +	if (!system_supports_gcs()) {
> +		if (args->shadow_stack_token)
> +			return -EINVAL;
> +
>  		return 0;
> +	}
> +
> +	/*
> +	 * If the user specified a GCS then use it, otherwise fall
> +	 * back to a default allocation strategy. Validation is done
> +	 * in arch_shstk_validate_clone().
> +	 */
> +	if (args->shadow_stack_token) {
> +		tsk->thread.gcs_base = 0;
> +		tsk->thread.gcs_size = 0;
> +		return 0;
> +	}
>  
>  	if (!task_gcs_el0_enabled(tsk))
>  		return 0;
> @@ -68,6 +83,36 @@ int gcs_alloc_thread_stack(struct task_struct *tsk,
>  	return 0;
>  }
>  
> +static bool gcs_consume_token(struct vm_area_struct *vma, struct page *page,
> +			      unsigned long user_addr)
> +{
> +	u64 expected = GCS_CAP(user_addr);
> +	u64 *token = page_address(page) + offset_in_page(user_addr);
> +
> +	if (!cmpxchg_to_user_page(vma, page, user_addr, token, expected, 0))
> +		return false;
> +	set_page_dirty_lock(page);
> +
> +	return true;
> +}
> +
> +int arch_shstk_validate_clone(struct task_struct *tsk,
> +			      struct vm_area_struct *vma,
> +			      struct page *page,
> +			      struct kernel_clone_args *args)
> +{
> +	unsigned long gcspr_el0;
> +	int ret = 0;
> +
> +	gcspr_el0 = args->shadow_stack_token;
> +	if (!gcs_consume_token(vma, page, gcspr_el0))
> +		return -EINVAL;
> +
> +	tsk->thread.gcspr_el0 = gcspr_el0 + sizeof(u64);
> +
> +	return ret;
> +}
> +
>  SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
>  {
>  	unsigned long alloc_size;
> diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
> index ba6f2fe43848..827e983430aa 100644
> --- a/arch/x86/include/asm/shstk.h
> +++ b/arch/x86/include/asm/shstk.h
> @@ -6,6 +6,7 @@
>  #include <linux/types.h>
>  
>  struct task_struct;
> +struct kernel_clone_args;
>  struct ksignal;
>  
>  #ifdef CONFIG_X86_USER_SHADOW_STACK
> @@ -16,8 +17,8 @@ struct thread_shstk {
>  
>  long shstk_prctl(struct task_struct *task, int option, unsigned long arg2);
>  void reset_thread_features(void);
> -unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
> -				       unsigned long stack_size);
> +unsigned long shstk_alloc_thread_stack(struct task_struct *p,
> +				       const struct kernel_clone_args *args);
>  void shstk_free(struct task_struct *p);
>  int setup_signal_shadow_stack(struct ksignal *ksig);
>  int restore_signal_shadow_stack(void);
> @@ -28,8 +29,10 @@ static inline long shstk_prctl(struct task_struct *task, int option,
>  			       unsigned long arg2) { return -EINVAL; }
>  static inline void reset_thread_features(void) {}
>  static inline unsigned long shstk_alloc_thread_stack(struct task_struct *p,
> -						     unsigned long clone_flags,
> -						     unsigned long stack_size) { return 0; }
> +						     const struct kernel_clone_args *args)
> +{
> +	return 0;
> +}
>  static inline void shstk_free(struct task_struct *p) {}
>  static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
>  static inline int restore_signal_shadow_stack(void) { return 0; }
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 1b7960cf6eb0..0a54af6c60df 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -209,7 +209,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
>  	 * is disabled, new_ssp will remain 0, and fpu_clone() will know not to
>  	 * update it.
>  	 */
> -	new_ssp = shstk_alloc_thread_stack(p, clone_flags, args->stack_size);
> +	new_ssp = shstk_alloc_thread_stack(p, args);
>  	if (IS_ERR_VALUE(new_ssp))
>  		return PTR_ERR((void *)new_ssp);
>  
> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index 2ddf23387c7e..9926d58e5d41 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -191,18 +191,61 @@ void reset_thread_features(void)
>  	current->thread.features_locked = 0;
>  }
>  
> -unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
> -				       unsigned long stack_size)
> +int arch_shstk_validate_clone(struct task_struct *t,
> +			      struct vm_area_struct *vma,
> +			      struct page *page,
> +			      struct kernel_clone_args *args)
> +{
> +	void *maddr = page_address(page);
> +	unsigned long token;
> +	int offset;
> +	u64 expected;
> +
> +	/*
> +	 * kernel_clone_args() verification assures token address is 8
> +	 * byte aligned.
> +	 */
> +	token = args->shadow_stack_token;
> +	expected = (token + SS_FRAME_SIZE) | BIT(0);
> +	offset = offset_in_page(token);
> +
> +	if (!cmpxchg_to_user_page(vma, page, token, (unsigned long *)(maddr + offset),
> +				  expected, 0))
> +		return -EINVAL;
> +	set_page_dirty_lock(page);
> +
> +	return 0;
> +}
> +
> +unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
> +				       const struct kernel_clone_args *args)
>  {
>  	struct thread_shstk *shstk = &tsk->thread.shstk;
> +	unsigned long clone_flags = args->flags;
>  	unsigned long addr, size;
>  
>  	/*
>  	 * If shadow stack is not enabled on the new thread, skip any
> -	 * switch to a new shadow stack.
> +	 * implicit switch to a new shadow stack and reject attempts to
> +	 * explicitly specify one.
>  	 */
> -	if (!features_enabled(ARCH_SHSTK_SHSTK))
> +	if (!features_enabled(ARCH_SHSTK_SHSTK)) {
> +		if (args->shadow_stack_token)
> +			return (unsigned long)ERR_PTR(-EINVAL);
> +
>  		return 0;
> +	}
> +
> +	/*
> +	 * If the user specified a shadow stack then use it, otherwise
> +	 * fall back to a default allocation strategy. Validation is
> +	 * done in arch_shstk_validate_clone().
> +	 */
> +	if (args->shadow_stack_token) {
> +		shstk->base = 0;
> +		shstk->size = 0;
> +		return args->shadow_stack_token + 8;
> +	}
>  
>  	/*
>  	 * For CLONE_VFORK the child will share the parents shadow stack.
> @@ -222,7 +265,7 @@ unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long cl
>  	if (!(clone_flags & CLONE_VM))
>  		return 0;
>  
> -	size = adjust_shstk_size(stack_size);
> +	size = adjust_shstk_size(args->stack_size);
>  	addr = alloc_shstk(0, size, 0, false);
>  	if (IS_ERR_VALUE(addr))
>  		return addr;
> diff --git a/include/asm-generic/cacheflush.h b/include/asm-generic/cacheflush.h
> index 7ee8a179d103..96cc0c7a5c90 100644
> --- a/include/asm-generic/cacheflush.h
> +++ b/include/asm-generic/cacheflush.h
> @@ -124,4 +124,15 @@ static inline void flush_cache_vunmap(unsigned long start, unsigned long end)
>  	} while (0)
>  #endif
>  
> +#ifndef cmpxchg_to_user_page
> +#define cmpxchg_to_user_page(vma, page, vaddr, ptr, old, new)  \
> +({							  \
> +	bool ret;						  \
> +								  \
> +	ret = try_cmpxchg(ptr, &old, new);			  \
> +	flush_icache_user_page(vma, page, vaddr, sizeof(*ptr));	  \
> +	ret;							  \
> +})
> +#endif
> +
>  #endif /* _ASM_GENERIC_CACHEFLUSH_H */
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index ea41795a352b..b501f752fc9a 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -16,6 +16,7 @@ struct task_struct;
>  struct rusage;
>  union thread_union;
>  struct css_set;
> +struct vm_area_struct;
>  
>  /* All the bits taken by the old clone syscall. */
>  #define CLONE_LEGACY_FLAGS 0xffffffffULL
> @@ -44,6 +45,7 @@ struct kernel_clone_args {
>  	struct cgroup *cgrp;
>  	struct css_set *cset;
>  	unsigned int kill_seq;
> +	unsigned long shadow_stack_token;
>  };
>  
>  /*
> @@ -226,4 +228,19 @@ static inline void task_unlock(struct task_struct *p)
>  
>  DEFINE_GUARD(task_lock, struct task_struct *, task_lock(_T), task_unlock(_T))
>  
> +#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
> +int arch_shstk_validate_clone(struct task_struct *p,
> +			      struct vm_area_struct *vma,
> +			      struct page *page,
> +			      struct kernel_clone_args *args);
> +#else
> +static inline int arch_shstk_validate_clone(struct task_struct *p,
> +					    struct vm_area_struct *vma,
> +					    struct page *page,
> +					    struct kernel_clone_args *args)
> +{
> +	return 0;
> +}
> +#endif
> +
>  #endif /* _LINUX_SCHED_TASK_H */
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 359a14cc76a4..9cf5c419e109 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -84,6 +84,7 @@
>   *                kernel's limit of nested PID namespaces.
>   * @cgroup:       If CLONE_INTO_CGROUP is specified set this to
>   *                a file descriptor for the cgroup.
> + * @shadow_stack_token: Pointer to shadow stack token at top of stack.
>   *
>   * The structure is versioned by size and thus extensible.
>   * New struct members must go at the end of the struct and
> @@ -101,12 +102,14 @@ struct clone_args {
>  	__aligned_u64 set_tid;
>  	__aligned_u64 set_tid_size;
>  	__aligned_u64 cgroup;
> +	__aligned_u64 shadow_stack_token;
>  };
>  #endif
>  
> -#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
> -#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
> -#define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */
> +#define CLONE_ARGS_SIZE_VER0  64 /* sizeof first published struct */
> +#define CLONE_ARGS_SIZE_VER1  80 /* sizeof second published struct */
> +#define CLONE_ARGS_SIZE_VER2  88 /* sizeof third published struct */
> +#define CLONE_ARGS_SIZE_VER3  96 /* sizeof fourth published struct */
>  
>  /*
>   * Scheduling policies
> diff --git a/kernel/fork.c b/kernel/fork.c
> index af673856499d..d484ebeded33 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1907,6 +1907,51 @@ static bool need_futex_hash_allocate_default(u64 clone_flags)
>  	return true;
>  }
>  
> +static int shstk_validate_clone(struct task_struct *p,
> +				struct kernel_clone_args *args)
> +{
> +	struct mm_struct *mm;
> +	struct vm_area_struct *vma;
> +	struct page *page;
> +	unsigned long addr;
> +	int ret;
> +
> +	if (!IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK))
> +		return 0;
> +
> +	if (!args->shadow_stack_token)
> +		return 0;
> +
> +	mm = get_task_mm(p);
> +	if (!mm)
> +		return -EFAULT;
> +
> +	mmap_read_lock(mm);
> +
> +	addr = untagged_addr_remote(mm, args->shadow_stack_token);
> +	page = get_user_page_vma_remote(mm, addr, FOLL_FORCE | FOLL_WRITE,
> +					&vma);
> +	if (IS_ERR(page)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	if (!(vma->vm_flags & VM_SHADOW_STACK) ||
> +	    !(vma->vm_flags & VM_WRITE)) {
> +		ret = -EFAULT;
> +		goto out_page;
> +	}
> +
> +	ret = arch_shstk_validate_clone(p, vma, page, args);
> +
> +out_page:
> +	put_page(page);
> +out:
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +	return ret;
> +}
> +
>  /*
>   * This creates a new process as a copy of the old one,
>   * but does not actually start it yet.
> @@ -2182,6 +2227,9 @@ __latent_entropy struct task_struct *copy_process(
>  	if (retval)
>  		goto bad_fork_cleanup_namespaces;
>  	retval = copy_thread(p, args);
> +	if (retval)
> +		goto bad_fork_cleanup_io;
> +	retval = shstk_validate_clone(p, args);
>  	if (retval)
>  		goto bad_fork_cleanup_io;
>  
> @@ -2763,7 +2811,9 @@ static noinline int copy_clone_args_from_user(struct kernel_clone_args *kargs,
>  		     CLONE_ARGS_SIZE_VER1);
>  	BUILD_BUG_ON(offsetofend(struct clone_args, cgroup) !=
>  		     CLONE_ARGS_SIZE_VER2);
> -	BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER2);
> +	BUILD_BUG_ON(offsetofend(struct clone_args, shadow_stack_token) !=
> +		     CLONE_ARGS_SIZE_VER3);
> +	BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER3);
>  
>  	if (unlikely(usize > PAGE_SIZE))
>  		return -E2BIG;
> @@ -2796,16 +2846,17 @@ static noinline int copy_clone_args_from_user(struct kernel_clone_args *kargs,
>  		return -EINVAL;
>  
>  	*kargs = (struct kernel_clone_args){
> -		.flags		= args.flags,
> -		.pidfd		= u64_to_user_ptr(args.pidfd),
> -		.child_tid	= u64_to_user_ptr(args.child_tid),
> -		.parent_tid	= u64_to_user_ptr(args.parent_tid),
> -		.exit_signal	= args.exit_signal,
> -		.stack		= args.stack,
> -		.stack_size	= args.stack_size,
> -		.tls		= args.tls,
> -		.set_tid_size	= args.set_tid_size,
> -		.cgroup		= args.cgroup,
> +		.flags			= args.flags,
> +		.pidfd			= u64_to_user_ptr(args.pidfd),
> +		.child_tid		= u64_to_user_ptr(args.child_tid),
> +		.parent_tid		= u64_to_user_ptr(args.parent_tid),
> +		.exit_signal		= args.exit_signal,
> +		.stack			= args.stack,
> +		.stack_size		= args.stack_size,
> +		.tls			= args.tls,
> +		.set_tid_size		= args.set_tid_size,
> +		.cgroup			= args.cgroup,
> +		.shadow_stack_token	= args.shadow_stack_token,

I'm not sure why this has to be named "shadow_stack_token" I think
that's just confusing and we should just call it "shadow_stack" and be
done with it. It's also a bit long of a field name imho.

I have a kernel-6.18.clone3 branch
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=kernel-6.18.clone3
because there's another cross-arch cleanup that cleans up copy_thread(),
copy_sighand(), and copy_process() and - surprisingly - also adds
clone3() support for nios2...

Anyway, if you just want me to slap it on top of that branch then I can
simply rename while applying so no need to resend in that case.

>  	};
>  
>  	if (args.set_tid &&
> @@ -2846,6 +2897,24 @@ static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
>  	return true;
>  }
>  
> +/**
> + * clone3_shadow_stack_valid - check and prepare shadow stack
> + * @kargs: kernel clone args
> + *
> + * Verify that shadow stacks are only enabled if supported.
> + */
> +static inline bool clone3_shadow_stack_valid(struct kernel_clone_args *kargs)
> +{
> +	if (!kargs->shadow_stack_token)
> +		return true;
> +
> +	if (!IS_ALIGNED(kargs->shadow_stack_token, sizeof(void *)))
> +		return false;
> +
> +	/* Fail if the kernel wasn't built with shadow stacks */
> +	return IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK);
> +}
> +
>  static bool clone3_args_valid(struct kernel_clone_args *kargs)
>  {
>  	/* Verify that no unknown flags are passed along. */
> @@ -2868,7 +2937,7 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
>  	    kargs->exit_signal)
>  		return false;
>  
> -	if (!clone3_stack_valid(kargs))
> +	if (!clone3_stack_valid(kargs) || !clone3_shadow_stack_valid(kargs))
>  		return false;
>  
>  	return true;
> 
> -- 
> 2.39.5
> 

^ permalink raw reply

* Re: [PATCH v4 0/4] procfs: make reference pidns more user-visible
From: Aleksa Sarai @ 2025-09-05 14:48 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Jonathan Corbet, Shuah Khan,
	Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest
In-Reply-To: <20250902-gehofft-ruheraum-3c286b25b6d3@brauner>

[-- Attachment #1: Type: text/plain, Size: 8585 bytes --]

On 2025-09-02, Christian Brauner <brauner@kernel.org> wrote:
> On Tue, Aug 05, 2025 at 03:45:07PM +1000, Aleksa Sarai wrote:
> > Ever since the introduction of pid namespaces, procfs has had very
> > implicit behaviour surrounding them (the pidns used by a procfs mount is
> > auto-selected based on the mounting process's active pidns, and the
> > pidns itself is basically hidden once the mount has been constructed).
> > 
> > /* pidns mount option for procfs */
> > 
> > This implicit behaviour has historically meant that userspace was
> > required to do some special dances in order to configure the pidns of a
> > procfs mount as desired. Examples include:
> > 
> >  * In order to bypass the mnt_too_revealing() check, Kubernetes creates
> >    a procfs mount from an empty pidns so that user namespaced containers
> >    can be nested (without this, the nested containers would fail to
> >    mount procfs). But this requires forking off a helper process because
> >    you cannot just one-shot this using mount(2).
> > 
> >  * Container runtimes in general need to fork into a container before
> >    configuring its mounts, which can lead to security issues in the case
> >    of shared-pidns containers (a privileged process in the pidns can
> >    interact with your container runtime process). While
> >    SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
> >    strict need for this due to a minor uAPI wart is kind of unfortunate.
> > 
> > Things would be much easier if there was a way for userspace to just
> > specify the pidns they want. Patch 1 implements a new "pidns" argument
> > which can be set using fsconfig(2):
> > 
> >     fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
> >     fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
> > 
> > or classic mount(2) / mount(8):
> > 
> >     // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
> >     mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
> > 
> > The initial security model I have in this RFC is to be as conservative
> > as possible and just mirror the security model for setns(2) -- which
> > means that you can only set pidns=... to pid namespaces that your
> > current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN
> > privileges over the pid namespace. This fulfils the requirements of
> > container runtimes, but I suspect that this may be too strict for some
> > usecases.
> > 
> > The pidns argument is not displayed in mountinfo -- it's not clear to me
> > what value it would make sense to show (maybe we could just use ns_dname
> > to provide an identifier for the namespace, but this number would be
> > fairly useless to userspace). I'm open to suggestions. Note that
> > PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get
> > information about this outside of mountinfo.
> > 
> > Note that you cannot change the pidns of an already-created procfs
> > instance. The primary reason is that allowing this to be changed would
> > require RCU-protecting proc_pid_ns(sb) and thus auditing all of
> > fs/proc/* and some of the users in fs/* to make sure they wouldn't UAF
> > the pid namespace. Since creating procfs instances is very cheap, it
> > seems unnecessary to overcomplicate this upfront. Trying to reconfigure
> > procfs this way errors out with -EBUSY.
> > 
> > /* ioctl(PROCFS_GET_PID_NAMESPACE) */
> > 
> > In addition, being able to figure out what pid namespace is being used
> > by a procfs mount is quite useful when you have an administrative
> > process (such as a container runtime) which wants to figure out the
> > correct way of mapping PIDs between its own namespace and the namespace
> > for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
> > alternative ways to do this, but they all rely on ancillary information
> > that third-party libraries and tools do not necessarily have access to.
> > 
> > To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
> > can be used to get a reference to the pidns that a procfs is using.
> > 
> > Rather than copying the (fairly strict) security model for setns(2),
> > apply a slightly looser model to better match what userspace can already
> > do:
> > 
> >  * Make the ioctl only valid on the root (meaning that a process without
> >    access to the procfs root -- such as only having an fd to a procfs
> >    file or some open_tree(2)-like subset -- cannot use this API). This
> >    means that the process already has some level of access to the
> >    /proc/$pid directories.
> > 
> >  * If the calling process is in an ancestor pidns, then they can already
> >    create pidfd for processes inside the pidns, which is morally
> >    equivalent to a pidns file descriptor according to setns(2). So it
> >    seems reasonable to just allow it in this case. (The justification
> >    for this model was suggested by Christian.)
> > 
> >  * If the process has access to /proc/1/ns/pid already (i.e. has
> >    ptrace-read access to the pidns pid1), then this ioctl is equivalent
> >    to just opening a handle to it that way.
> > 
> >    Ideally we would check for ptrace-read access against all processes
> >    in the pidns (which is very likely to be true for at least one
> >    process, as SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set
> >    by most programs), but this would obviously not scale.
> > 
> > I'm open to suggestions for whether we need to make this stricter (or
> > possibly allow more cases).
> > 
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> 
> Thanks for the patchset. Being able to specify what pid namespace the
> procfs instance is supposed to belong to is super useful and will make
> things easier for userspace for sure.

I was going to send a new version changing the whole thing to be struct
path based (and adding FSCONFIG_SET_PATH{,_EMPTY} support) so we don't
need to allocate a file explicitly for the non-FSCONFIG_SET_FD case, but
we can do that as a follow-up I guess.

> The code you added contains a minor wrinkle that I disliked which I've
> changed and you tell me if you can live with this restriction or not.
> 
> The way you've implemented it specifying a pid namespace that the caller
> holds privilege over would silently also override the user namespace the
> filesystem is supposed to belong to.
> 
> Specifically, you did something like:
> 
>         put_pid_ns(ctx->pid_ns);
>         ctx->pid_ns = get_pid_ns(target);
>         put_user_ns(fc->user_ns);
>         fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
> 
> This silently overrides the user namespace recorded at fsopen() time. I
> think that's too subtle and we should just not allow that at all for
> now.
> 
> Instead I've changed this to:
> 
>         if (fc->user_ns != target->user_ns)
>                 return invalfc(fc, "owning user namespace of pid namespace doesn't match procfs user namespace");
> 
>         put_pid_ns(ctx->pid_ns);
>         ctx->pid_ns = get_pid_ns(target);
> 
> so we just refuse different owernship.

That sounds fine, I wasn't quite sure what to do with fc->user_ns to be
honest. Being more conservative is probably the right call here.

> I've also dropped the procfs ioctl because I'm not sure how much value
> it will actually add given that you can do this via /proc/1/ns/pid.
> 
> If that is something that libpathrs despearately needs I would like to
> do it as a separate patch anyways.

The main issues are:

1. pid1 can often be non-dumpable, which can block you from doing that.
   In principle, because the dumpable flag is reset on execve, it is
   theoretically possible to get access to /proc/$pid/ns/pid if you win
   the race in a pid namespace with lots of process activity, but this
   kind of sucks.

2. This approach doesn't work for empty pid namesapces.
   pidns_for_children doesn't let you get a handle to an empty pid
   namespace either (I briefly looked at the history and it seems this
   was silently changed in v2 of the patchset based on some feedback
   that I'm not sure was entirely correct).

3. Now that you can configure the procfs mount, it seems like a
   half-baked interface to not provide diagnostic information about the
   namespace. (I suspect the criu folks would be happy to have this too
   ;).)

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3] uapi/linux/fcntl: remove AT_RENAME* macros
From: Florian Weimer @ 2025-09-05  9:41 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Aleksa Sarai, linux-fsdevel, patches, Jeff Layton, Chuck Lever,
	Alexander Aring, Josef Bacik, Jan Kara, Christian Brauner,
	Matthew Wilcox, David Howells, linux-api, Kees Cook, Randy Dunlap
In-Reply-To: <CAOQ4uxjcLDUcfdp72cpQcDQEtZaaR4G+P8oPXL_HbotFirGrKQ@mail.gmail.com>

* Amir Goldstein:

>> If it's too much effort to synchronise them between glibc then it's
>> better to just close the book on this whole chapter (even though my
>> impression is that glibc made a mistake or two when adding the
>> definitions).
>
> Considering that glibc has this fix lined up:
> https://inbox.sourceware.org/libc-alpha/lhubjnpv03o.fsf@oldenburg.str.redhat.com/
>
> Do we need to do anything at all?
>
> Florian,
>
> I am not that familiar with packaging and distributions of glibc
> headers and kernel headers to downstream users.

I don't think kernel changes are necessary or desirable at this point.
The glibc change went into glibc 2.42 only, and at this point in time,
all distributions shipping 2.42 (few of them do) are pretty much
guaranteed to pick up fixes from the 2.42 stable release branch
regularly.  So if we get this into glibc 2.43 and backport it to 2.42,
the problem should disappear quite soon from a developer's perspective.

Thanks,
Florian


^ permalink raw reply

* Re: [PATCH v3] uapi/linux/fcntl: remove AT_RENAME* macros
From: Amir Goldstein @ 2025-09-05  9:17 UTC (permalink / raw)
  To: Aleksa Sarai, Florian Weimer
  Cc: linux-fsdevel, patches, Jeff Layton, Chuck Lever, Alexander Aring,
	Josef Bacik, Jan Kara, Christian Brauner, Matthew Wilcox,
	David Howells, linux-api, Kees Cook, Randy Dunlap
In-Reply-To: <2025-09-05-armless-uneaten-venture-denizen-HnoIhR@cyphar.com>

On Fri, Sep 5, 2025 at 7:11 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2025-09-04, Amir Goldstein <amir73il@gmail.com> wrote:
> > On Thu, Sep 4, 2025 at 8:22 AM Randy Dunlap <rdunlap@infradead.org> wrote:
> > >
> > > Don't define the AT_RENAME_* macros at all since the kernel does not
> > > use them nor does the kernel need to provide them for userspace.
> > > Leave them as comments in <uapi/linux/fcntl.h> only as an example.
> > >
> > > The AT_RENAME_* macros have recently been added to glibc's <stdio.h>.
> > > For a kernel allmodconfig build, this made the macros be defined
> > > differently in 2 places (same values but different macro text),
> > > causing build errors/warnings (duplicate definitions) in both
> > > samples/watch_queue/watch_test.c and samples/vfs/test-statx.c.
> > > (<linux/fcntl.h> is included indirecty in both programs above.)
> > >
> > > Fixes: b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be allocated")
> > > Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
> > > ---
> > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > Cc: Jeff Layton <jlayton@kernel.org>
> > > Cc: Chuck Lever <chuck.lever@oracle.com>
> > > Cc: Alexander Aring <alex.aring@gmail.com>
> > > Cc: Josef Bacik <josef@toxicpanda.com>
> > > Cc: Aleksa Sarai <cyphar@cyphar.com>
> > > Cc: Jan Kara <jack@suse.cz>
> > > Cc: Christian Brauner <brauner@kernel.org>
> > > Cc: Matthew Wilcox <willy@infradead.org>
> > > Cc: David Howells <dhowells@redhat.com>
> > > CC: linux-api@vger.kernel.org
> > > To: linux-fsdevel@vger.kernel.org
> > > ---
> > >  include/uapi/linux/fcntl.h |    6 ++++++
> > >  1 file changed, 6 insertions(+)
> > >
> > > --- linux-next-20250819.orig/include/uapi/linux/fcntl.h
> > > +++ linux-next-20250819/include/uapi/linux/fcntl.h
> > > @@ -155,10 +155,16 @@
> > >   * as possible, so we can use them for generic bits in the future if necessary.
> > >   */
> > >
> > > +/*
> > > + * Note: This is an example of how the AT_RENAME_* flags could be defined,
> > > + * but the kernel has no need to define them, so leave them as comments.
> > > + */
> > >  /* Flags for renameat2(2) (must match legacy RENAME_* flags). */
> > > +/*
> > >  #define AT_RENAME_NOREPLACE    0x0001
> > >  #define AT_RENAME_EXCHANGE     0x0002
> > >  #define AT_RENAME_WHITEOUT     0x0004
> > > +*/
> > >
> >
> > I find this end result a bit odd, but I don't want to suggest another variant
> > I already proposed one in v2 review [1] that maybe you did not like.
> > It's fine.
> > I'll let Aleksa and Christian chime in to decide on if and how they want this
> > comment to look or if we should just delete these definitions and be done with
> > this episode.
>
> For my part, I'm fine with these becoming comments or even removing them
> outright. I think that defining them as AT_* flags would've been useful
> examples of how these flags should be used, but it is what it is.
>
> Then again, AT_EXECVE_CHECK went in and used a higher-level bit despite
> the comments describing that this was unfavourable and what should be
> done instead, so maybe attempting to avoid conflicts is an exercise in
> futility...

That's a bummer :-/
but to be fair, AT_EXECVE_CHECK was merged after v23, so I guess
the patch set started way before this comment and got rebased
after the comment was added, so it was easier to miss it.

>
> If it's too much effort to synchronise them between glibc then it's
> better to just close the book on this whole chapter (even though my
> impression is that glibc made a mistake or two when adding the
> definitions).

Considering that glibc has this fix lined up:
https://inbox.sourceware.org/libc-alpha/lhubjnpv03o.fsf@oldenburg.str.redhat.com/

Do we need to do anything at all?

Florian,

I am not that familiar with packaging and distributions of glibc
headers and kernel headers to downstream users.

What are the chances that us removing these definitions from the
current kernel header is going to help any downstream user in the future?

Thanks,
Amir.

^ permalink raw reply

* Re: [PATCH v3] uapi/linux/fcntl: remove AT_RENAME* macros
From: Randy Dunlap @ 2025-09-05  7:36 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Amir Goldstein, linux-fsdevel, patches, Jeff Layton, Chuck Lever,
	Alexander Aring, Josef Bacik, Aleksa Sarai, Jan Kara,
	Christian Brauner, Matthew Wilcox, David Howells, linux-api
In-Reply-To: <lhu7bydv01m.fsf@oldenburg.str.redhat.com>

Hi,

On 9/5/25 12:19 AM, Florian Weimer wrote:
> * Randy Dunlap:
> 
>> On 9/4/25 11:49 AM, Florian Weimer wrote:
>>> * Amir Goldstein:
>>>
>>>> I find this end result a bit odd, but I don't want to suggest another variant
>>>> I already proposed one in v2 review [1] that maybe you did not like.
>>>> It's fine.
>>>> I'll let Aleksa and Christian chime in to decide on if and how they want this
>>>> comment to look or if we should just delete these definitions and be done with
>>>> this episode.
>>>
>>> We should fix the definition in glibc to be identical token-wise to the
>>> kernel's.
>>
>> That's probably a good suggestion...
>> while I tried the reverse of that and Amir opposed.
> 
> It's certainly odd that the kernel uses different token sequences for
> defining AT_RENAME_* and RENAME_*.  But it's probably too late to fix
> that.
> 
> Here's the glibc patch:
> 
>   [PATCH] libio: Define AT_RENAME_* with the same tokens as Linux
>   <https://inbox.sourceware.org/libc-alpha/lhubjnpv03o.fsf@oldenburg.str.redhat.com/T/#u>


Thanks!

-- 
~Randy


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox