[PATCH AUTOSEL 6.4 01/16] vfs: Replace all non-returning strlcpy with strscpy

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 6.4 01/16] vfs: Replace all non-returning strlcpy with strscpy
@ 2023-07-02 19:38 Sasha Levin
  2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 02/16] coredump: require O_WRONLY instead of O_RDWR Sasha Levin
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Sasha Levin @ 2023-07-02 19:38 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Azeem Shaikh, Kees Cook, Christian Brauner, Sasha Levin, viro,
	linux-fsdevel

From: Azeem Shaikh <azeemshaikh38@gmail.com>

[ Upstream commit c642256b91770e201519d037a91f255a617a4602 ]

strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Message-Id: <20230510221119.3508930-1-azeemshaikh38@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/char_dev.c | 2 +-
 fs/super.c    | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/char_dev.c b/fs/char_dev.c
index 13deb45f1ec65..950b6919fb872 100644
--- a/fs/char_dev.c
+++ b/fs/char_dev.c
@@ -150,7 +150,7 @@ __register_chrdev_region(unsigned int major, unsigned int baseminor,
 	cd->major = major;
 	cd->baseminor = baseminor;
 	cd->minorct = minorct;
-	strlcpy(cd->name, name, sizeof(cd->name));
+	strscpy(cd->name, name, sizeof(cd->name));
 
 	if (!prev) {
 		cd->next = curr;
diff --git a/fs/super.c b/fs/super.c
index 04bc62ab7dfea..09668ddfbbd55 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -595,7 +595,7 @@ struct super_block *sget_fc(struct fs_context *fc,
 	fc->s_fs_info = NULL;
 	s->s_type = fc->fs_type;
 	s->s_iflags |= fc->s_iflags;
-	strlcpy(s->s_id, s->s_type->name, sizeof(s->s_id));
+	strscpy(s->s_id, s->s_type->name, sizeof(s->s_id));
 	list_add_tail(&s->s_list, &super_blocks);
 	hlist_add_head(&s->s_instances, &s->s_type->fs_supers);
 	spin_unlock(&sb_lock);
@@ -674,7 +674,7 @@ struct super_block *sget(struct file_system_type *type,
 		return ERR_PTR(err);
 	}
 	s->s_type = type;
-	strlcpy(s->s_id, type->name, sizeof(s->s_id));
+	strscpy(s->s_id, type->name, sizeof(s->s_id));
 	list_add_tail(&s->s_list, &super_blocks);
 	hlist_add_head(&s->s_instances, &type->fs_supers);
 	spin_unlock(&sb_lock);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH AUTOSEL 6.4 02/16] coredump: require O_WRONLY instead of O_RDWR
  2023-07-02 19:38 [PATCH AUTOSEL 6.4 01/16] vfs: Replace all non-returning strlcpy with strscpy Sasha Levin
@ 2023-07-02 19:38 ` Sasha Levin
  2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 03/16] fs: use UB-safe check for signed addition overflow in remap_verify_area Sasha Levin
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Sasha Levin @ 2023-07-02 19:38 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Vladimir Sementsov-Ogievskiy, Linus Torvalds, Christian Brauner,
	Sasha Levin, viro, linux-fsdevel

From: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

[ Upstream commit 88e4607034ee49e09e32d91d083dced5c2f4f127 ]

The motivation for this patch has been to enable using a stricter
apparmor profile to prevent programs from reading any coredump in the
system.

However, this became something else. The following details are based on
Christian's and Linus' archeology into the history of the number "2" in
the coredump handling code.

To make sure we're not accidently introducing some subtle behavioral
change into the coredump code we set out on a voyage into the depths of
history.git to figure out why this was O_RDWR in the first place.

Coredump handling was introduced over 30 years ago in commit
ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)").
The original code used O_WRONLY:

    open_namei("core",O_CREAT | O_WRONLY | O_TRUNC,0600,&inode,NULL)

However, this changed in 1993 and starting with commit
9cb9f18b5d26 ("[PATCH] Linux-0.99.10 (June 7, 1993)") the coredump code
suddenly used the constant "2":

    open_namei("core",O_CREAT | 2 | O_TRUNC,0600,&inode,NULL)

This was curious as in the same commit the kernel switched from
constants to proper defines in other places such as KERNEL_DS and
USER_DS and O_RDWR did already exist.

So why was "2" used? It turns out that open_namei() - an early version
of what later turned into filp_open() - didn't accept O_RDWR.

A semantic quirk of the open() uapi is the definition of the O_RDONLY
flag. It would seem natural to define:

    #define O_RDWR (O_RDONLY | O_WRONLY)

but that isn't possible because:

    #define O_RDONLY 0

This makes O_RDONLY effectively meaningless when passed to the kernel.
In other words, there has never been a way - until O_PATH at least - to
open a file without any permission; O_RDONLY was always implied on the
uapi side while the kernel does in fact allow opening files without
permissions.

The trouble comes when trying to map the uapi flags onto the
corresponding file mode flags FMODE_{READ,WRITE}. This mapping still
happens today and is causing issues to this day (We ran into this
during additions for openat2() for example.).

So the special value "3" was used to indicate that the file was opened
for special access:

    f->f_flags = flag = flags;
    f->f_mode = (flag+1) & O_ACCMODE;
    if (f->f_mode)
            flag++;

This allowed the file mode to be set to FMODE_READ | FMODE_WRITE mapping
the O_{RDONLY,WRONLY,RDWR} flags into the FMODE_{READ,WRITE} flags. The
special access then required read-write permissions and 0 was used to
access symlinks.

But back when ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") added
coredump handling open_namei() took the FMODE_{READ,WRITE} flags as an
argument. So the coredump handling introduced in
ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") was buggy because
O_WRONLY shouldn't have been passed. Since O_WRONLY is 1 but
open_namei() took FMODE_{READ,WRITE} it was passed FMODE_READ on
accident.

So 9cb9f18b5d26 ("[PATCH] Linux-0.99.10 (June 7, 1993)") was a bugfix
for this and the 2 didn't really mean O_RDWR, it meant FMODE_WRITE which
was correct.

The clue is that FMODE_{READ,WRITE} didn't exist yet and thus a raw "2"
value was passed.

Fast forward 5 years when around 2.2.4pre4 (February 16, 1999) this code
was changed to:

    -       dentry = open_namei(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600);
    ...
    +       file = filp_open(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600);

At this point the raw "2" should have become O_WRONLY again as
filp_open() didn't take FMODE_{READ,WRITE} but O_{RDONLY,WRONLY,RDWR}.

Another 17 years later, the code was changed again cementing the mistake
and making it almost impossible to detect when commit
378c6520e7d2 ("fs/coredump: prevent fsuid=0 dumps into user-controlled directories")
replaced the raw "2" with O_RDWR.

And now, here we are with this patch that sent us on a quest to answer
the big questions in life such as "Why are coredump files opened with
O_RDWR?" and "Is it safe to just use O_WRONLY?".

So with this commit we're reintroducing O_WRONLY again and bringing this
code back to its original state when it was first introduced in commit
ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") over 30 years ago.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-Id: <20230420120409.602576-1-vsementsov@yandex-team.ru>
[brauner@kernel.org: completely rewritten commit message]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/coredump.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 88740c51b9420..9d235fa14ab98 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -648,7 +648,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 	} else {
 		struct mnt_idmap *idmap;
 		struct inode *inode;
-		int open_flags = O_CREAT | O_RDWR | O_NOFOLLOW |
+		int open_flags = O_CREAT | O_WRONLY | O_NOFOLLOW |
 				 O_LARGEFILE | O_EXCL;

 		if (cprm.limit < binfmt->min_coredump)
-- 
2.39.2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH AUTOSEL 6.4 03/16] fs: use UB-safe check for signed addition overflow in remap_verify_area
  2023-07-02 19:38 [PATCH AUTOSEL 6.4 01/16] vfs: Replace all non-returning strlcpy with strscpy Sasha Levin
  2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 02/16] coredump: require O_WRONLY instead of O_RDWR Sasha Levin
@ 2023-07-02 19:38 ` Sasha Levin
  2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 07/16] fs: Protect reconfiguration of sb read-write from racing writes Sasha Levin
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Sasha Levin @ 2023-07-02 19:38 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: David Sterba, Christian Brauner, Sasha Levin, viro, linux-fsdevel

From: David Sterba <dsterba@suse.com>

[ Upstream commit b7a9a503c38d665c05a789132b632d81ec0b2703 ]

The following warning pops up with enabled UBSAN in tests fstests/generic/303:

  [23127.529395] UBSAN: Undefined behaviour in fs/read_write.c:1725:7
  [23127.529400] signed integer overflow:
  [23127.529403] 4611686018427322368 + 9223372036854775807 cannot be represented in type 'long long int'
  [23127.529412] CPU: 4 PID: 26180 Comm: xfs_io Not tainted 5.2.0-rc2-1.ge195904-vanilla+ #450
  [23127.556999] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
  [23127.557001] Call Trace:
  [23127.557060]  dump_stack+0x67/0x9b
  [23127.557070]  ubsan_epilogue+0x9/0x40
  [23127.573496]  handle_overflow+0xb3/0xc0
  [23127.573514]  do_clone_file_range+0x28f/0x2a0
  [23127.573547]  vfs_clone_file_range+0x35/0xb0
  [23127.573564]  ioctl_file_clone+0x8d/0xc0
  [23127.590144]  do_vfs_ioctl+0x300/0x700
  [23127.590160]  ksys_ioctl+0x70/0x80
  [23127.590203]  ? trace_hardirqs_off_thunk+0x1a/0x1c
  [23127.590210]  __x64_sys_ioctl+0x16/0x20
  [23127.590215]  do_syscall_64+0x5c/0x1d0
  [23127.590224]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [23127.590231] RIP: 0033:0x7ff6d7250327
  [23127.590241] RSP: 002b:00007ffe3a38f1d8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
  [23127.590246] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007ff6d7250327
  [23127.590249] RDX: 00007ffe3a38f220 RSI: 000000004020940d RDI: 0000000000000003
  [23127.590252] RBP: 0000000000000000 R08: 00007ffe3a3c80a0 R09: 00007ffe3a3c8080
  [23127.590255] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 0000000000000000
  [23127.590260] R13: 0000000000000000 R14: 3fffffffffff0000 R15: 00007ff6d750a20c

As loff_t is a signed type, we should use the safe overflow checks
instead of relying on compiler implementation.

The bogus values are intentional and the test is supposed to verify the
boundary conditions.

Signed-off-by: David Sterba <dsterba@suse.com>
Message-Id: <20230523162628.17071-1-dsterba@suse.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/remap_range.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/remap_range.c b/fs/remap_range.c
index 1331a890f2f29..87ae4f0dc3aa0 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -15,6 +15,7 @@
 #include <linux/mount.h>
 #include <linux/fs.h>
 #include <linux/dax.h>
+#include <linux/overflow.h>
 #include "internal.h"
 
 #include <linux/uaccess.h>
@@ -101,10 +102,12 @@ static int generic_remap_checks(struct file *file_in, loff_t pos_in,
 static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
 			     bool write)
 {
+	loff_t tmp;
+
 	if (unlikely(pos < 0 || len < 0))
 		return -EINVAL;
 
-	if (unlikely((loff_t) (pos + len) < 0))
+	if (unlikely(check_add_overflow(pos, len, &tmp)))
 		return -EINVAL;
 
 	return security_file_permission(file, write ? MAY_WRITE : MAY_READ);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH AUTOSEL 6.4 07/16] fs: Protect reconfiguration of sb read-write from racing writes
  2023-07-02 19:38 [PATCH AUTOSEL 6.4 01/16] vfs: Replace all non-returning strlcpy with strscpy Sasha Levin
  2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 02/16] coredump: require O_WRONLY instead of O_RDWR Sasha Levin
  2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 03/16] fs: use UB-safe check for signed addition overflow in remap_verify_area Sasha Levin
@ 2023-07-02 19:38 ` Sasha Levin
  2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 11/16] fs: use correct __poll_t type Sasha Levin
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Sasha Levin @ 2023-07-02 19:38 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Jan Kara, Christian Brauner, Sasha Levin, viro, linux-fsdevel

From: Jan Kara <jack@suse.cz>

[ Upstream commit c541dce86c537714b6761a79a969c1623dfa222b ]

The reconfigure / remount code takes a lot of effort to protect
filesystem's reconfiguration code from racing writes on remounting
read-only. However during remounting read-only filesystem to read-write
mode userspace writes can start immediately once we clear SB_RDONLY
flag. This is inconvenient for example for ext4 because we need to do
some writes to the filesystem (such as preparation of quota files)
before we can take userspace writes so we are clearing SB_RDONLY flag
before we are fully ready to accept userpace writes and syzbot has found
a way to exploit this [1]. Also as far as I'm reading the code
the filesystem remount code was protected from racing writes in the
legacy mount path by the mount's MNT_READONLY flag so this is relatively
new problem. It is actually fairly easy to protect remount read-write
from racing writes using sb->s_readonly_remount flag so let's just do
that instead of having to workaround these races in the filesystem code.

[1] https://lore.kernel.org/all/00000000000006a0df05f6667499@google.com/T/

Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230615113848.8439-1-jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/super.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index 09668ddfbbd55..860d7a4b14c7c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -903,6 +903,7 @@ int reconfigure_super(struct fs_context *fc)
 	struct super_block *sb = fc->root->d_sb;
 	int retval;
 	bool remount_ro = false;
+	bool remount_rw = false;
 	bool force = fc->sb_flags & SB_FORCE;
 
 	if (fc->sb_flags_mask & ~MS_RMT_MASK)
@@ -920,7 +921,7 @@ int reconfigure_super(struct fs_context *fc)
 		    bdev_read_only(sb->s_bdev))
 			return -EACCES;
 #endif
-
+		remount_rw = !(fc->sb_flags & SB_RDONLY) && sb_rdonly(sb);
 		remount_ro = (fc->sb_flags & SB_RDONLY) && !sb_rdonly(sb);
 	}
 
@@ -950,6 +951,14 @@ int reconfigure_super(struct fs_context *fc)
 			if (retval)
 				return retval;
 		}
+	} else if (remount_rw) {
+		/*
+		 * We set s_readonly_remount here to protect filesystem's
+		 * reconfigure code from writes from userspace until
+		 * reconfigure finishes.
+		 */
+		sb->s_readonly_remount = 1;
+		smp_wmb();
 	}
 
 	if (fc->ops->reconfigure) {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH AUTOSEL 6.4 11/16] fs: use correct __poll_t type
  2023-07-02 19:38 [PATCH AUTOSEL 6.4 01/16] vfs: Replace all non-returning strlcpy with strscpy Sasha Levin
                   ` (2 preceding siblings ...)
  2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 07/16] fs: Protect reconfiguration of sb read-write from racing writes Sasha Levin
@ 2023-07-02 19:38 ` Sasha Levin
  2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 12/16] fs.h: Optimize file struct to prevent false sharing Sasha Levin
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Sasha Levin @ 2023-07-02 19:38 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Min-Hua Chen, Christian Brauner, Sasha Levin, viro, axboe, willy,
	wenyang.linux, dylany, zhangqilong3, linux-fsdevel

From: Min-Hua Chen <minhuadotchen@gmail.com>

[ Upstream commit 38f1755a3e59a3f88e33030f8e4ee0421de2f05a ]

Fix the following sparse warnings by using __poll_t instead
of unsigned type.

fs/eventpoll.c:541:9: sparse: warning: restricted __poll_t degrades to integer
fs/eventfd.c:67:17: sparse: warning: restricted __poll_t degrades to integer

Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com>
Message-Id: <20230511164628.336586-1-minhuadotchen@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/eventfd.c            | 2 +-
 fs/eventpoll.c          | 2 +-
 include/linux/eventfd.h | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 95850a13ce8d0..6c06a527747f2 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -43,7 +43,7 @@ struct eventfd_ctx {
 	int id;
 };
 
-__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, unsigned mask)
+__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, __poll_t mask)
 {
 	unsigned long flags;
 
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 266d45c7685b4..4b1b3362f697b 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -536,7 +536,7 @@ static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi,
 #else
 
 static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi,
-			     unsigned pollflags)
+			     __poll_t pollflags)
 {
 	wake_up_poll(&ep->poll_wait, EPOLLIN | pollflags);
 }
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index 36a486505b081..98d31cdaca400 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -40,7 +40,7 @@ struct file *eventfd_fget(int fd);
 struct eventfd_ctx *eventfd_ctx_fdget(int fd);
 struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
 __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n);
-__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, unsigned mask);
+__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, __poll_t mask);
 int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_entry_t *wait,
 				  __u64 *cnt);
 void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH AUTOSEL 6.4 12/16] fs.h: Optimize file struct to prevent false sharing
  2023-07-02 19:38 [PATCH AUTOSEL 6.4 01/16] vfs: Replace all non-returning strlcpy with strscpy Sasha Levin
                   ` (3 preceding siblings ...)
  2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 11/16] fs: use correct __poll_t type Sasha Levin
@ 2023-07-02 19:38 ` Sasha Levin
  2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 13/16] fs: unexport buffer_check_dirty_writeback Sasha Levin
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Sasha Levin @ 2023-07-02 19:38 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: chenzhiyin, Christian Brauner, Sasha Levin, viro, linux-fsdevel

From: chenzhiyin <zhiyin.chen@intel.com>

[ Upstream commit a7bc2e8ddf3c8e1f5bfeb401f7ee112956cea259 ]

In the syscall test of UnixBench, performance regression occurred due
to false sharing.

The lock and atomic members, including file::f_lock, file::f_count and
file::f_pos_lock are highly contended and frequently updated in the
high-concurrency test scenarios. perf c2c indentified one affected
read access, file::f_op.
To prevent false sharing, the layout of file struct is changed as
following
(A) f_lock, f_count and f_pos_lock are put together to share the same
cache line.
(B) The read mostly members, including f_path, f_inode, f_op are put
into a separate cache line.
(C) f_mode is put together with f_count, since they are used frequently
 at the same time.
Due to '__randomize_layout' attribute of file struct, the updated layout
only can be effective when CONFIG_RANDSTRUCT_NONE is 'y'.

The optimization has been validated in the syscall test of UnixBench.
performance gain is 30~50%. Furthermore, to confirm the optimization
effectiveness on the other codes path, the results of fsdisk, fsbuffer
and fstime are also shown.

Here are the detailed test results of unixbench.

Command: numactl -C 3-18 ./Run -c 16 syscall fsbuffer fstime fsdisk

Without Patch
------------------------------------------------------------------------
File Copy 1024 bufsize 2000 maxblocks   875052.1 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks     235484.0 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks  2815153.5 KBps  (30.0 s, 2 samples)
System Call Overhead                   5772268.3 lps   (10.0 s, 7 samples)

System Benchmarks Partial Index         BASELINE       RESULT    INDEX
File Copy 1024 bufsize 2000 maxblocks     3960.0     875052.1   2209.7
File Copy 256 bufsize 500 maxblocks       1655.0     235484.0   1422.9
File Copy 4096 bufsize 8000 maxblocks     5800.0    2815153.5   4853.7
System Call Overhead                     15000.0    5772268.3   3848.2
                                                              ========
System Benchmarks Index Score (Partial Only)                    2768.3

With Patch
------------------------------------------------------------------------
File Copy 1024 bufsize 2000 maxblocks  1009977.2 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks     264765.9 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks  3052236.0 KBps  (30.0 s, 2 samples)
System Call Overhead                   8237404.4 lps   (10.0 s, 7 samples)

System Benchmarks Partial Index         BASELINE       RESULT    INDEX
File Copy 1024 bufsize 2000 maxblocks     3960.0    1009977.2   2550.4
File Copy 256 bufsize 500 maxblocks       1655.0     264765.9   1599.8
File Copy 4096 bufsize 8000 maxblocks     5800.0    3052236.0   5262.5
System Call Overhead                     15000.0    8237404.4   5491.6
                                                              ========
System Benchmarks Index Score (Partial Only)                    3295.3

Signed-off-by: chenzhiyin <zhiyin.chen@intel.com>
Message-Id: <20230601092400.27162-1-zhiyin.chen@intel.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 include/linux/fs.h | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 133f0640fb241..6f96f99ab9511 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -956,29 +956,35 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
 		index <  ra->start + ra->size);
 }
 
+/*
+ * f_{lock,count,pos_lock} members can be highly contended and share
+ * the same cacheline. f_{lock,mode} are very frequently used together
+ * and so share the same cacheline as well. The read-mostly
+ * f_{path,inode,op} are kept on a separate cacheline.
+ */
 struct file {
 	union {
 		struct llist_node	f_llist;
 		struct rcu_head 	f_rcuhead;
 		unsigned int 		f_iocb_flags;
 	};
-	struct path		f_path;
-	struct inode		*f_inode;	/* cached value */
-	const struct file_operations	*f_op;
 
 	/*
 	 * Protects f_ep, f_flags.
 	 * Must not be taken from IRQ context.
 	 */
 	spinlock_t		f_lock;
-	atomic_long_t		f_count;
-	unsigned int 		f_flags;
 	fmode_t			f_mode;
+	atomic_long_t		f_count;
 	struct mutex		f_pos_lock;
 	loff_t			f_pos;
+	unsigned int		f_flags;
 	struct fown_struct	f_owner;
 	const struct cred	*f_cred;
 	struct file_ra_state	f_ra;
+	struct path		f_path;
+	struct inode		*f_inode;	/* cached value */
+	const struct file_operations	*f_op;
 
 	u64			f_version;
 #ifdef CONFIG_SECURITY
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH AUTOSEL 6.4 13/16] fs: unexport buffer_check_dirty_writeback
  2023-07-02 19:38 [PATCH AUTOSEL 6.4 01/16] vfs: Replace all non-returning strlcpy with strscpy Sasha Levin
                   ` (4 preceding siblings ...)
  2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 12/16] fs.h: Optimize file struct to prevent false sharing Sasha Levin
@ 2023-07-02 19:38 ` Sasha Levin
  2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 14/16] eventfd: show the EFD_SEMAPHORE flag in fdinfo Sasha Levin
  2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 16/16] fs: Provide helpers for manipulating sb->s_readonly_remount Sasha Levin
  7 siblings, 0 replies; 9+ messages in thread
From: Sasha Levin @ 2023-07-02 19:38 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Christoph Hellwig, Christian Brauner, Sasha Levin, viro,
	linux-fsdevel

From: Christoph Hellwig <hch@lst.de>

[ Upstream commit 4bb218a65a43782b1e75f5510744cb44795a1105 ]

buffer_check_dirty_writeback is only used by the block device aops,
remove the export.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Message-Id: <20230608122958.276954-1-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/buffer.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index a7fc561758b1a..fe64356e89b8a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -111,7 +111,6 @@ void buffer_check_dirty_writeback(struct folio *folio,
 		bh = bh->b_this_page;
 	} while (bh != head);
 }
-EXPORT_SYMBOL(buffer_check_dirty_writeback);
 
 /*
  * Block until a buffer comes unlocked.  This doesn't stop it
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH AUTOSEL 6.4 14/16] eventfd: show the EFD_SEMAPHORE flag in fdinfo
  2023-07-02 19:38 [PATCH AUTOSEL 6.4 01/16] vfs: Replace all non-returning strlcpy with strscpy Sasha Levin
                   ` (5 preceding siblings ...)
  2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 13/16] fs: unexport buffer_check_dirty_writeback Sasha Levin
@ 2023-07-02 19:38 ` Sasha Levin
  2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 16/16] fs: Provide helpers for manipulating sb->s_readonly_remount Sasha Levin
  7 siblings, 0 replies; 9+ messages in thread
From: Sasha Levin @ 2023-07-02 19:38 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Wen Yang, Christian Brauner, Alexander Viro, Jens Axboe,
	Christoph Hellwig, Dylan Yudaken, David Woodhouse, Matthew Wilcox,
	Eric Biggers, linux-fsdevel, Sasha Levin

From: Wen Yang <wenyang.linux@foxmail.com>

[ Upstream commit 33d8b5d7824c7175ed968b8e89e6db3566e9c177 ]

The EFD_SEMAPHORE flag should be displayed in fdinfo,
as different value could affect the behavior of eventfd.

Suggested-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Wen Yang <wenyang.linux@foxmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dylan Yudaken <dylany@fb.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Eric Biggers <ebiggers@google.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Message-Id: <tencent_05B9CFEFE6B9BC2A9B3A27886A122A7D9205@qq.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/eventfd.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 6c06a527747f2..8aa36cd373516 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -33,10 +33,10 @@ struct eventfd_ctx {
 	/*
 	 * Every time that a write(2) is performed on an eventfd, the
 	 * value of the __u64 being written is added to "count" and a
-	 * wakeup is performed on "wqh". A read(2) will return the "count"
-	 * value to userspace, and will reset "count" to zero. The kernel
-	 * side eventfd_signal() also, adds to the "count" counter and
-	 * issue a wakeup.
+	 * wakeup is performed on "wqh". If EFD_SEMAPHORE flag was not
+	 * specified, a read(2) will return the "count" value to userspace,
+	 * and will reset "count" to zero. The kernel side eventfd_signal()
+	 * also, adds to the "count" counter and issue a wakeup.
 	 */
 	__u64 count;
 	unsigned int flags;
@@ -301,6 +301,8 @@ static void eventfd_show_fdinfo(struct seq_file *m, struct file *f)
 		   (unsigned long long)ctx->count);
 	spin_unlock_irq(&ctx->wqh.lock);
 	seq_printf(m, "eventfd-id: %d\n", ctx->id);
+	seq_printf(m, "eventfd-semaphore: %d\n",
+		   !!(ctx->flags & EFD_SEMAPHORE));
 }
 #endif
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH AUTOSEL 6.4 16/16] fs: Provide helpers for manipulating sb->s_readonly_remount
  2023-07-02 19:38 [PATCH AUTOSEL 6.4 01/16] vfs: Replace all non-returning strlcpy with strscpy Sasha Levin
                   ` (6 preceding siblings ...)
  2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 14/16] eventfd: show the EFD_SEMAPHORE flag in fdinfo Sasha Levin
@ 2023-07-02 19:38 ` Sasha Levin
  7 siblings, 0 replies; 9+ messages in thread
From: Sasha Levin @ 2023-07-02 19:38 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Jan Kara, Dave Chinner, Dave Chinner, Christian Brauner,
	Sasha Levin, viro, linux-fsdevel

From: Jan Kara <jack@suse.cz>

[ Upstream commit d7439fb1f4338fffd0bc68bb62d78f7712725f26 ]

Provide helpers to set and clear sb->s_readonly_remount including
appropriate memory barriers. Also use this opportunity to document what
the barriers pair with and why they are needed.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Message-Id: <20230620112832.5158-1-jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/internal.h      | 41 +++++++++++++++++++++++++++++++++++++++++
 fs/namespace.c     | 25 ++++++++++++++++---------
 fs/super.c         | 17 ++++++-----------
 include/linux/fs.h |  2 +-
 4 files changed, 64 insertions(+), 21 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index bd3b2810a36b6..b916b84809f36 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -120,6 +120,47 @@ void put_super(struct super_block *sb);
 extern bool mount_capable(struct fs_context *);
 int sb_init_dio_done_wq(struct super_block *sb);
 
+/*
+ * Prepare superblock for changing its read-only state (i.e., either remount
+ * read-write superblock read-only or vice versa). After this function returns
+ * mnt_is_readonly() will return true for any mount of the superblock if its
+ * caller is able to observe any changes done by the remount. This holds until
+ * sb_end_ro_state_change() is called.
+ */
+static inline void sb_start_ro_state_change(struct super_block *sb)
+{
+	WRITE_ONCE(sb->s_readonly_remount, 1);
+	/*
+	 * For RO->RW transition, the barrier pairs with the barrier in
+	 * mnt_is_readonly() making sure if mnt_is_readonly() sees SB_RDONLY
+	 * cleared, it will see s_readonly_remount set.
+	 * For RW->RO transition, the barrier pairs with the barrier in
+	 * __mnt_want_write() before the mnt_is_readonly() check. The barrier
+	 * makes sure if __mnt_want_write() sees MNT_WRITE_HOLD already
+	 * cleared, it will see s_readonly_remount set.
+	 */
+	smp_wmb();
+}
+
+/*
+ * Ends section changing read-only state of the superblock. After this function
+ * returns if mnt_is_readonly() returns false, the caller will be able to
+ * observe all the changes remount did to the superblock.
+ */
+static inline void sb_end_ro_state_change(struct super_block *sb)
+{
+	/*
+	 * This barrier provides release semantics that pairs with
+	 * the smp_rmb() acquire semantics in mnt_is_readonly().
+	 * This barrier pair ensure that when mnt_is_readonly() sees
+	 * 0 for sb->s_readonly_remount, it will also see all the
+	 * preceding flag changes that were made during the RO state
+	 * change.
+	 */
+	smp_wmb();
+	WRITE_ONCE(sb->s_readonly_remount, 0);
+}
+
 /*
  * open.c
  */
diff --git a/fs/namespace.c b/fs/namespace.c
index 54847db5b8195..5ba1eca6f7208 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -309,9 +309,16 @@ static unsigned int mnt_get_writers(struct mount *mnt)
 
 static int mnt_is_readonly(struct vfsmount *mnt)
 {
-	if (mnt->mnt_sb->s_readonly_remount)
+	if (READ_ONCE(mnt->mnt_sb->s_readonly_remount))
 		return 1;
-	/* Order wrt setting s_flags/s_readonly_remount in do_remount() */
+	/*
+	 * The barrier pairs with the barrier in sb_start_ro_state_change()
+	 * making sure if we don't see s_readonly_remount set yet, we also will
+	 * not see any superblock / mount flag changes done by remount.
+	 * It also pairs with the barrier in sb_end_ro_state_change()
+	 * assuring that if we see s_readonly_remount already cleared, we will
+	 * see the values of superblock / mount flags updated by remount.
+	 */
 	smp_rmb();
 	return __mnt_is_readonly(mnt);
 }
@@ -364,9 +371,11 @@ int __mnt_want_write(struct vfsmount *m)
 		}
 	}
 	/*
-	 * After the slowpath clears MNT_WRITE_HOLD, mnt_is_readonly will
-	 * be set to match its requirements. So we must not load that until
-	 * MNT_WRITE_HOLD is cleared.
+	 * The barrier pairs with the barrier sb_start_ro_state_change() making
+	 * sure that if we see MNT_WRITE_HOLD cleared, we will also see
+	 * s_readonly_remount set (or even SB_RDONLY / MNT_READONLY flags) in
+	 * mnt_is_readonly() and bail in case we are racing with remount
+	 * read-only.
 	 */
 	smp_rmb();
 	if (mnt_is_readonly(m)) {
@@ -588,10 +597,8 @@ int sb_prepare_remount_readonly(struct super_block *sb)
 	if (!err && atomic_long_read(&sb->s_remove_count))
 		err = -EBUSY;
 
-	if (!err) {
-		sb->s_readonly_remount = 1;
-		smp_wmb();
-	}
+	if (!err)
+		sb_start_ro_state_change(sb);
 	list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
 		if (mnt->mnt.mnt_flags & MNT_WRITE_HOLD)
 			mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
diff --git a/fs/super.c b/fs/super.c
index 860d7a4b14c7c..48c29954d4875 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -944,8 +944,7 @@ int reconfigure_super(struct fs_context *fc)
 	 */
 	if (remount_ro) {
 		if (force) {
-			sb->s_readonly_remount = 1;
-			smp_wmb();
+			sb_start_ro_state_change(sb);
 		} else {
 			retval = sb_prepare_remount_readonly(sb);
 			if (retval)
@@ -953,12 +952,10 @@ int reconfigure_super(struct fs_context *fc)
 		}
 	} else if (remount_rw) {
 		/*
-		 * We set s_readonly_remount here to protect filesystem's
-		 * reconfigure code from writes from userspace until
-		 * reconfigure finishes.
+		 * Protect filesystem's reconfigure code from writes from
+		 * userspace until reconfigure finishes.
 		 */
-		sb->s_readonly_remount = 1;
-		smp_wmb();
+		sb_start_ro_state_change(sb);
 	}
 
 	if (fc->ops->reconfigure) {
@@ -974,9 +971,7 @@ int reconfigure_super(struct fs_context *fc)
 
 	WRITE_ONCE(sb->s_flags, ((sb->s_flags & ~fc->sb_flags_mask) |
 				 (fc->sb_flags & fc->sb_flags_mask)));
-	/* Needs to be ordered wrt mnt_is_readonly() */
-	smp_wmb();
-	sb->s_readonly_remount = 0;
+	sb_end_ro_state_change(sb);
 
 	/*
 	 * Some filesystems modify their metadata via some other path than the
@@ -991,7 +986,7 @@ int reconfigure_super(struct fs_context *fc)
 	return 0;
 
 cancel_readonly:
-	sb->s_readonly_remount = 0;
+	sb_end_ro_state_change(sb);
 	return retval;
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6f96f99ab9511..879c000eec397 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1248,7 +1248,7 @@ struct super_block {
 	 */
 	atomic_long_t s_fsnotify_connectors;
 
-	/* Being remounted read-only */
+	/* Read-only state of the superblock is being changed */
 	int s_readonly_remount;
 
 	/* per-sb errseq_t for reporting writeback errors via syncfs */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-07-02 19:41 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-02 19:38 [PATCH AUTOSEL 6.4 01/16] vfs: Replace all non-returning strlcpy with strscpy Sasha Levin
2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 02/16] coredump: require O_WRONLY instead of O_RDWR Sasha Levin
2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 03/16] fs: use UB-safe check for signed addition overflow in remap_verify_area Sasha Levin
2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 07/16] fs: Protect reconfiguration of sb read-write from racing writes Sasha Levin
2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 11/16] fs: use correct __poll_t type Sasha Levin
2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 12/16] fs.h: Optimize file struct to prevent false sharing Sasha Levin
2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 13/16] fs: unexport buffer_check_dirty_writeback Sasha Levin
2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 14/16] eventfd: show the EFD_SEMAPHORE flag in fdinfo Sasha Levin
2023-07-02 19:38 ` [PATCH AUTOSEL 6.4 16/16] fs: Provide helpers for manipulating sb->s_readonly_remount Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).