* [PATCH AUTOSEL 6.11 020/244] exec: don't WARN for racy path_noexec check
[not found] <20240925113641.1297102-1-sashal@kernel.org>
@ 2024-09-25 11:24 ` Sasha Levin
2024-09-25 11:24 ` [PATCH AUTOSEL 6.11 021/244] fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name Sasha Levin
` (6 subsequent siblings)
7 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2024-09-25 11:24 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Mateusz Guzik, Christian Brauner, Sasha Levin, viro, linux-mm,
linux-fsdevel
From: Mateusz Guzik <mjguzik@gmail.com>
[ Upstream commit 0d196e7589cefe207d5d41f37a0a28a1fdeeb7c6 ]
Both i_mode and noexec checks wrapped in WARN_ON stem from an artifact
of the previous implementation. They used to legitimately check for the
condition, but that got moved up in two commits:
633fb6ac3980 ("exec: move S_ISREG() check earlier")
0fd338b2d2cd ("exec: move path_noexec() check earlier")
Instead of being removed said checks are WARN_ON'ed instead, which
has some debug value.
However, the spurious path_noexec check is racy, resulting in
unwarranted warnings should someone race with setting the noexec flag.
One can note there is more to perm-checking whether execve is allowed
and none of the conditions are guaranteed to still hold after they were
tested for.
Additionally this does not validate whether the code path did any perm
checking to begin with -- it will pass if the inode happens to be
regular.
Keep the redundant path_noexec() check even though it's mindless
nonsense checking for guarantee that isn't given so drop the WARN.
Reword the commentary and do small tidy ups while here.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20240805131721.765484-1-mjguzik@gmail.com
[brauner: keep redundant path_noexec() check]
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
fs/exec.c | 31 ++++++++++++-------------------
1 file changed, 12 insertions(+), 19 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 50e76cc633c4b..caae051c5a956 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -145,13 +145,11 @@ SYSCALL_DEFINE1(uselib, const char __user *, library)
goto out;
/*
- * may_open() has already checked for this, so it should be
- * impossible to trip now. But we need to be extra cautious
- * and check again at the very end too.
+ * Check do_open_execat() for an explanation.
*/
error = -EACCES;
- if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode) ||
- path_noexec(&file->f_path)))
+ if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode)) ||
+ path_noexec(&file->f_path))
goto exit;
error = -ENOEXEC;
@@ -954,7 +952,6 @@ EXPORT_SYMBOL(transfer_args_to_stack);
static struct file *do_open_execat(int fd, struct filename *name, int flags)
{
struct file *file;
- int err;
struct open_flags open_exec_flags = {
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
.acc_mode = MAY_EXEC,
@@ -971,24 +968,20 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
file = do_filp_open(fd, name, &open_exec_flags);
if (IS_ERR(file))
- goto out;
+ return file;
/*
- * may_open() has already checked for this, so it should be
- * impossible to trip now. But we need to be extra cautious
- * and check again at the very end too.
+ * In the past the regular type check was here. It moved to may_open() in
+ * 633fb6ac3980 ("exec: move S_ISREG() check earlier"). Since then it is
+ * an invariant that all non-regular files error out before we get here.
*/
- err = -EACCES;
- if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode) ||
- path_noexec(&file->f_path)))
- goto exit;
+ if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode)) ||
+ path_noexec(&file->f_path)) {
+ fput(file);
+ return ERR_PTR(-EACCES);
+ }
-out:
return file;
-
-exit:
- fput(file);
- return ERR_PTR(err);
}
/**
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH AUTOSEL 6.11 021/244] fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name
[not found] <20240925113641.1297102-1-sashal@kernel.org>
2024-09-25 11:24 ` [PATCH AUTOSEL 6.11 020/244] exec: don't WARN for racy path_noexec check Sasha Levin
@ 2024-09-25 11:24 ` Sasha Levin
2024-09-25 11:24 ` [PATCH AUTOSEL 6.11 064/244] proc: add config & param to block forcing mem writes Sasha Levin
` (5 subsequent siblings)
7 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2024-09-25 11:24 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Li Zhijian, Jan Kara, Christian Brauner, Sasha Levin, viro,
linux-fsdevel
From: Li Zhijian <lizhijian@fujitsu.com>
[ Upstream commit 7f7b850689ac06a62befe26e1fd1806799e7f152 ]
It's observed that a crash occurs during hot-remove a memory device,
in which user is accessing the hugetlb. See calltrace as following:
------------[ cut here ]------------
WARNING: CPU: 1 PID: 14045 at arch/x86/mm/fault.c:1278 do_user_addr_fault+0x2a0/0x790
Modules linked in: kmem device_dax cxl_mem cxl_pmem cxl_port cxl_pci dax_hmem dax_pmem nd_pmem cxl_acpi nd_btt cxl_core crc32c_intel nvme virtiofs fuse nvme_core nfit libnvdimm dm_multipath scsi_dh_rdac scsi_dh_emc s
mirror dm_region_hash dm_log dm_mod
CPU: 1 PID: 14045 Comm: daxctl Not tainted 6.10.0-rc2-lizhijian+ #492
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
RIP: 0010:do_user_addr_fault+0x2a0/0x790
Code: 48 8b 00 a8 04 0f 84 b5 fe ff ff e9 1c ff ff ff 4c 89 e9 4c 89 e2 be 01 00 00 00 bf 02 00 00 00 e8 b5 ef 24 00 e9 42 fe ff ff <0f> 0b 48 83 c4 08 4c 89 ea 48 89 ee 4c 89 e7 5b 5d 41 5c 41 5d 41
RSP: 0000:ffffc90000a575f0 EFLAGS: 00010046
RAX: ffff88800c303600 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000001000 RSI: ffffffff82504162 RDI: ffffffff824b2c36
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90000a57658
R13: 0000000000001000 R14: ffff88800bc2e040 R15: 0000000000000000
FS: 00007f51cb57d880(0000) GS:ffff88807fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000001000 CR3: 00000000072e2004 CR4: 00000000001706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
? __warn+0x8d/0x190
? do_user_addr_fault+0x2a0/0x790
? report_bug+0x1c3/0x1d0
? handle_bug+0x3c/0x70
? exc_invalid_op+0x14/0x70
? asm_exc_invalid_op+0x16/0x20
? do_user_addr_fault+0x2a0/0x790
? exc_page_fault+0x31/0x200
exc_page_fault+0x68/0x200
<...snip...>
BUG: unable to handle page fault for address: 0000000000001000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 800000000ad92067 P4D 800000000ad92067 PUD 7677067 PMD 0
Oops: Oops: 0000 [#1] PREEMPT SMP PTI
---[ end trace 0000000000000000 ]---
BUG: unable to handle page fault for address: 0000000000001000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 800000000ad92067 P4D 800000000ad92067 PUD 7677067 PMD 0
Oops: Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 1 PID: 14045 Comm: daxctl Kdump: loaded Tainted: G W 6.10.0-rc2-lizhijian+ #492
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
RIP: 0010:dentry_name+0x1f4/0x440
<...snip...>
? dentry_name+0x2fa/0x440
vsnprintf+0x1f3/0x4f0
vprintk_store+0x23a/0x540
vprintk_emit+0x6d/0x330
_printk+0x58/0x80
dump_mapping+0x10b/0x1a0
? __pfx_free_object_rcu+0x10/0x10
__dump_page+0x26b/0x3e0
? vprintk_emit+0xe0/0x330
? _printk+0x58/0x80
? dump_page+0x17/0x50
dump_page+0x17/0x50
do_migrate_range+0x2f7/0x7f0
? do_migrate_range+0x42/0x7f0
? offline_pages+0x2f4/0x8c0
offline_pages+0x60a/0x8c0
memory_subsys_offline+0x9f/0x1c0
? lockdep_hardirqs_on+0x77/0x100
? _raw_spin_unlock_irqrestore+0x38/0x60
device_offline+0xe3/0x110
state_store+0x6e/0xc0
kernfs_fop_write_iter+0x143/0x200
vfs_write+0x39f/0x560
ksys_write+0x65/0xf0
do_syscall_64+0x62/0x130
Previously, some sanity check have been done in dump_mapping() before
the print facility parsing '%pd' though, it's still possible to run into
an invalid dentry.d_name.name.
Since dump_mapping() only needs to dump the filename only, retrieve it
by itself in a safer way to prevent an unnecessary crash.
Note that either retrieving the filename with '%pd' or
strncpy_from_kernel_nofault(), the filename could be unreliable.
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Link: https://lore.kernel.org/r/20240826055503.1522320-1-lizhijian@fujitsu.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
fs/inode.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 10c4619faeef8..d1a098e7d4087 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -595,6 +595,7 @@ void dump_mapping(const struct address_space *mapping)
struct hlist_node *dentry_first;
struct dentry *dentry_ptr;
struct dentry dentry;
+ char fname[64] = {};
unsigned long ino;
/*
@@ -631,11 +632,14 @@ void dump_mapping(const struct address_space *mapping)
return;
}
+ if (strncpy_from_kernel_nofault(fname, dentry.d_name.name, 63) < 0)
+ strscpy(fname, "<invalid>");
/*
- * if dentry is corrupted, the %pd handler may still crash,
- * but it's unlikely that we reach here with a corrupt mapping
+ * Even if strncpy_from_kernel_nofault() succeeded,
+ * the fname could be unreliable
*/
- pr_warn("aops:%ps ino:%lx dentry name:\"%pd\"\n", a_ops, ino, &dentry);
+ pr_warn("aops:%ps ino:%lx dentry name(?):\"%s\"\n",
+ a_ops, ino, fname);
}
void clear_inode(struct inode *inode)
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH AUTOSEL 6.11 064/244] proc: add config & param to block forcing mem writes
[not found] <20240925113641.1297102-1-sashal@kernel.org>
2024-09-25 11:24 ` [PATCH AUTOSEL 6.11 020/244] exec: don't WARN for racy path_noexec check Sasha Levin
2024-09-25 11:24 ` [PATCH AUTOSEL 6.11 021/244] fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name Sasha Levin
@ 2024-09-25 11:24 ` Sasha Levin
2024-09-25 11:24 ` [PATCH AUTOSEL 6.11 065/244] vfs: use RCU in ilookup Sasha Levin
` (4 subsequent siblings)
7 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2024-09-25 11:24 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Adrian Ratiu, Doug Anderson, Jeff Xu, Jann Horn, Kees Cook,
Ard Biesheuvel, Christian Brauner, Linus Torvalds, Sasha Levin,
corbet, paul, jmorris, serge, thuth, bp, jpoimboe, tglx, paulmck,
tony, xiongwei.song, akpm, oleg, mic, jlayton, casey, viro,
adobriyan, linux-doc, linux-fsdevel, linux-security-module
From: Adrian Ratiu <adrian.ratiu@collabora.com>
[ Upstream commit 41e8149c8892ed1962bd15350b3c3e6e90cba7f4 ]
This adds a Kconfig option and boot param to allow removing
the FOLL_FORCE flag from /proc/pid/mem write calls because
it can be abused.
The traditional forcing behavior is kept as default because
it can break GDB and some other use cases.
Previously we tried a more sophisticated approach allowing
distributions to fine-tune /proc/pid/mem behavior, however
that got NAK-ed by Linus [1], who prefers this simpler
approach with semantics also easier to understand for users.
Link: https://lore.kernel.org/lkml/CAHk-=wiGWLChxYmUA5HrT5aopZrB7_2VTa0NLZcxORgkUe5tEQ@mail.gmail.com/ [1]
Cc: Doug Anderson <dianders@chromium.org>
Cc: Jeff Xu <jeffxu@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
Link: https://lore.kernel.org/r/20240802080225.89408-1-adrian.ratiu@collabora.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
.../admin-guide/kernel-parameters.txt | 10 +++
fs/proc/base.c | 61 ++++++++++++++++++-
security/Kconfig | 32 ++++++++++
3 files changed, 102 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 09126bb8cc9ff..be010fec76541 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4788,6 +4788,16 @@
printk.time= Show timing data prefixed to each printk message line
Format: <bool> (1/Y/y=enable, 0/N/n=disable)
+ proc_mem.force_override= [KNL]
+ Format: {always | ptrace | never}
+ Traditionally /proc/pid/mem allows memory permissions to be
+ overridden without restrictions. This option may be set to
+ restrict that. Can be one of:
+ - 'always': traditional behavior always allows mem overrides.
+ - 'ptrace': only allow mem overrides for active ptracers.
+ - 'never': never allow mem overrides.
+ If not specified, default is the CONFIG_PROC_MEM_* choice.
+
processor.max_cstate= [HW,ACPI]
Limit processor to maximum C-state
max_cstate=9 overrides any DMI blacklist limit.
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 72a1acd03675c..f389c69767fa5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -85,6 +85,7 @@
#include <linux/elf.h>
#include <linux/pid_namespace.h>
#include <linux/user_namespace.h>
+#include <linux/fs_parser.h>
#include <linux/fs_struct.h>
#include <linux/slab.h>
#include <linux/sched/autogroup.h>
@@ -117,6 +118,40 @@
static u8 nlink_tid __ro_after_init;
static u8 nlink_tgid __ro_after_init;
+enum proc_mem_force {
+ PROC_MEM_FORCE_ALWAYS,
+ PROC_MEM_FORCE_PTRACE,
+ PROC_MEM_FORCE_NEVER
+};
+
+static enum proc_mem_force proc_mem_force_override __ro_after_init =
+ IS_ENABLED(CONFIG_PROC_MEM_NO_FORCE) ? PROC_MEM_FORCE_NEVER :
+ IS_ENABLED(CONFIG_PROC_MEM_FORCE_PTRACE) ? PROC_MEM_FORCE_PTRACE :
+ PROC_MEM_FORCE_ALWAYS;
+
+static const struct constant_table proc_mem_force_table[] __initconst = {
+ { "always", PROC_MEM_FORCE_ALWAYS },
+ { "ptrace", PROC_MEM_FORCE_PTRACE },
+ { "never", PROC_MEM_FORCE_NEVER },
+ { }
+};
+
+static int __init early_proc_mem_force_override(char *buf)
+{
+ if (!buf)
+ return -EINVAL;
+
+ /*
+ * lookup_constant() defaults to proc_mem_force_override to preseve
+ * the initial Kconfig choice in case an invalid param gets passed.
+ */
+ proc_mem_force_override = lookup_constant(proc_mem_force_table,
+ buf, proc_mem_force_override);
+
+ return 0;
+}
+early_param("proc_mem.force_override", early_proc_mem_force_override);
+
struct pid_entry {
const char *name;
unsigned int len;
@@ -835,6 +870,28 @@ static int mem_open(struct inode *inode, struct file *file)
return ret;
}
+static bool proc_mem_foll_force(struct file *file, struct mm_struct *mm)
+{
+ struct task_struct *task;
+ bool ptrace_active = false;
+
+ switch (proc_mem_force_override) {
+ case PROC_MEM_FORCE_NEVER:
+ return false;
+ case PROC_MEM_FORCE_PTRACE:
+ task = get_proc_task(file_inode(file));
+ if (task) {
+ ptrace_active = READ_ONCE(task->ptrace) &&
+ READ_ONCE(task->mm) == mm &&
+ READ_ONCE(task->parent) == current;
+ put_task_struct(task);
+ }
+ return ptrace_active;
+ default:
+ return true;
+ }
+}
+
static ssize_t mem_rw(struct file *file, char __user *buf,
size_t count, loff_t *ppos, int write)
{
@@ -855,7 +912,9 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
if (!mmget_not_zero(mm))
goto free;
- flags = FOLL_FORCE | (write ? FOLL_WRITE : 0);
+ flags = write ? FOLL_WRITE : 0;
+ if (proc_mem_foll_force(file, mm))
+ flags |= FOLL_FORCE;
while (count > 0) {
size_t this_len = min_t(size_t, count, PAGE_SIZE);
diff --git a/security/Kconfig b/security/Kconfig
index 412e76f1575d0..a93c1a9b7c283 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -19,6 +19,38 @@ config SECURITY_DMESG_RESTRICT
If you are unsure how to answer this question, answer N.
+choice
+ prompt "Allow /proc/pid/mem access override"
+ default PROC_MEM_ALWAYS_FORCE
+ help
+ Traditionally /proc/pid/mem allows users to override memory
+ permissions for users like ptrace, assuming they have ptrace
+ capability.
+
+ This allows people to limit that - either never override, or
+ require actual active ptrace attachment.
+
+ Defaults to the traditional behavior (for now)
+
+config PROC_MEM_ALWAYS_FORCE
+ bool "Traditional /proc/pid/mem behavior"
+ help
+ This allows /proc/pid/mem accesses to override memory mapping
+ permissions if you have ptrace access rights.
+
+config PROC_MEM_FORCE_PTRACE
+ bool "Require active ptrace() use for access override"
+ help
+ This allows /proc/pid/mem accesses to override memory mapping
+ permissions for active ptracers like gdb.
+
+config PROC_MEM_NO_FORCE
+ bool "Never"
+ help
+ Never override memory mapping permissions
+
+endchoice
+
config SECURITY
bool "Enable different security models"
depends on SYSFS
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH AUTOSEL 6.11 065/244] vfs: use RCU in ilookup
[not found] <20240925113641.1297102-1-sashal@kernel.org>
` (2 preceding siblings ...)
2024-09-25 11:24 ` [PATCH AUTOSEL 6.11 064/244] proc: add config & param to block forcing mem writes Sasha Levin
@ 2024-09-25 11:24 ` Sasha Levin
2024-09-25 11:25 ` [PATCH AUTOSEL 6.11 086/244] netfs: Cancel dirty folios that have no storage destination Sasha Levin
` (3 subsequent siblings)
7 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2024-09-25 11:24 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Mateusz Guzik, Jan Kara, Christian Brauner, Sasha Levin, viro,
linux-fsdevel
From: Mateusz Guzik <mjguzik@gmail.com>
[ Upstream commit 122381a46954ad592ee93d7da2bef5074b396247 ]
A soft lockup in ilookup was reported when stress-testing a 512-way
system [1] (see [2] for full context) and it was verified that not
taking the lock shifts issues back to mm.
[1] https://lore.kernel.org/linux-mm/56865e57-c250-44da-9713-cf1404595bcc@amd.com/
[2] https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20240715071324.265879-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
fs/inode.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index d1a098e7d4087..a2af540102d7a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1574,9 +1574,7 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino)
struct hlist_head *head = inode_hashtable + hash(sb, ino);
struct inode *inode;
again:
- spin_lock(&inode_hash_lock);
- inode = find_inode_fast(sb, head, ino, true);
- spin_unlock(&inode_hash_lock);
+ inode = find_inode_fast(sb, head, ino, false);
if (inode) {
if (IS_ERR(inode))
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH AUTOSEL 6.11 086/244] netfs: Cancel dirty folios that have no storage destination
[not found] <20240925113641.1297102-1-sashal@kernel.org>
` (3 preceding siblings ...)
2024-09-25 11:24 ` [PATCH AUTOSEL 6.11 065/244] vfs: use RCU in ilookup Sasha Levin
@ 2024-09-25 11:25 ` Sasha Levin
2024-09-25 11:25 ` [PATCH AUTOSEL 6.11 118/244] coredump: Standartize and fix logging Sasha Levin
` (2 subsequent siblings)
7 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2024-09-25 11:25 UTC (permalink / raw)
To: linux-kernel, stable
Cc: David Howells, Jeff Layton, netfs, linux-fsdevel,
Christian Brauner, Sasha Levin, rostedt, mhiramat,
linux-trace-kernel
From: David Howells <dhowells@redhat.com>
[ Upstream commit 8f246b7c0a1be0882374f2ff831a61f0dbe77678 ]
Kafs wants to be able to cache the contents of directories (and symlinks),
but whilst these are downloaded from the server with the FS.FetchData RPC
op and similar, the same as for regular files, they can't be updated by
FS.StoreData, but rather have special operations (FS.MakeDir, etc.).
Now, rather than redownloading a directory's content after each change made
to that directory, kafs modifies the local blob. This blob can be saved
out to the cache, and since it's using netfslib, kafs just marks the folios
dirty and lets ->writepages() on the directory take care of it, as for an
regular file.
This is fine as long as there's a cache as although the upload stream is
disabled, there's a cache stream to drive the procedure. But if the cache
goes away in the meantime, suddenly there's no way do any writes and the
code gets confused, complains "R=%x: No submit" to dmesg and leaves the
dirty folio hanging.
Fix this by just cancelling the store of the folio if neither stream is
active. (If there's no cache at the time of dirtying, we should just not
mark the folio dirty).
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-23-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
fs/netfs/write_issue.c | 6 +++++-
include/trace/events/netfs.h | 1 +
2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c
index 3f7e37e50c7d0..3ae287bb19a0a 100644
--- a/fs/netfs/write_issue.c
+++ b/fs/netfs/write_issue.c
@@ -410,13 +410,17 @@ static int netfs_write_folio(struct netfs_io_request *wreq,
folio_unlock(folio);
if (fgroup == NETFS_FOLIO_COPY_TO_CACHE) {
- if (!fscache_resources_valid(&wreq->cache_resources)) {
+ if (!cache->avail) {
trace_netfs_folio(folio, netfs_folio_trace_cancel_copy);
netfs_issue_write(wreq, upload);
netfs_folio_written_back(folio);
return 0;
}
trace_netfs_folio(folio, netfs_folio_trace_store_copy);
+ } else if (!upload->avail && !cache->avail) {
+ trace_netfs_folio(folio, netfs_folio_trace_cancel_store);
+ netfs_folio_written_back(folio);
+ return 0;
} else if (!upload->construct) {
trace_netfs_folio(folio, netfs_folio_trace_store);
} else {
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index 606b4a0f92dae..edcc3b3a3ecf8 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -141,6 +141,7 @@
EM(netfs_streaming_cont_filled_page, "mod-streamw-f+") \
/* The rest are for writeback */ \
EM(netfs_folio_trace_cancel_copy, "cancel-copy") \
+ EM(netfs_folio_trace_cancel_store, "cancel-store") \
EM(netfs_folio_trace_clear, "clear") \
EM(netfs_folio_trace_clear_cc, "clear-cc") \
EM(netfs_folio_trace_clear_g, "clear-g") \
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH AUTOSEL 6.11 118/244] coredump: Standartize and fix logging
[not found] <20240925113641.1297102-1-sashal@kernel.org>
` (4 preceding siblings ...)
2024-09-25 11:25 ` [PATCH AUTOSEL 6.11 086/244] netfs: Cancel dirty folios that have no storage destination Sasha Levin
@ 2024-09-25 11:25 ` Sasha Levin
2024-09-25 11:27 ` [PATCH AUTOSEL 6.11 237/244] iomap: fix iomap_dio_zero() for fs bs > system page size Sasha Levin
2024-09-25 11:27 ` [PATCH AUTOSEL 6.11 242/244] iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release Sasha Levin
7 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2024-09-25 11:25 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Roman Kisel, Allen Pais, Kees Cook, Sasha Levin, viro, brauner,
nagvijay, linux-fsdevel
From: Roman Kisel <romank@linux.microsoft.com>
[ Upstream commit c114e9948c2b6a0b400266e59cc656b59e795bca ]
The coredump code does not log the process ID and the comm
consistently, logs unescaped comm when it does log it, and
does not always use the ratelimited logging. That makes it
harder to analyze logs and puts the system at the risk of
spamming the system log incase something crashes many times
over and over again.
Fix that by logging TGID and comm (escaped) consistently and
using the ratelimited logging always.
Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Tested-by: Allen Pais <apais@linux.microsoft.com>
Link: https://lore.kernel.org/r/20240718182743.1959160-2-romank@linux.microsoft.com
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
fs/coredump.c | 43 +++++++++++++++-------------------------
include/linux/coredump.h | 22 ++++++++++++++++++++
2 files changed, 38 insertions(+), 27 deletions(-)
diff --git a/fs/coredump.c b/fs/coredump.c
index 7f12ff6ad1d3e..87ff71a59fbe7 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -586,8 +586,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
struct subprocess_info *sub_info;
if (ispipe < 0) {
- printk(KERN_WARNING "format_corename failed\n");
- printk(KERN_WARNING "Aborting core\n");
+ coredump_report_failure("format_corename failed, aborting core");
goto fail_unlock;
}
@@ -607,27 +606,21 @@ void do_coredump(const kernel_siginfo_t *siginfo)
* right pid if a thread in a multi-threaded
* core_pattern process dies.
*/
- printk(KERN_WARNING
- "Process %d(%s) has RLIMIT_CORE set to 1\n",
- task_tgid_vnr(current), current->comm);
- printk(KERN_WARNING "Aborting core\n");
+ coredump_report_failure("RLIMIT_CORE is set to 1, aborting core");
goto fail_unlock;
}
cprm.limit = RLIM_INFINITY;
dump_count = atomic_inc_return(&core_dump_count);
if (core_pipe_limit && (core_pipe_limit < dump_count)) {
- printk(KERN_WARNING "Pid %d(%s) over core_pipe_limit\n",
- task_tgid_vnr(current), current->comm);
- printk(KERN_WARNING "Skipping core dump\n");
+ coredump_report_failure("over core_pipe_limit, skipping core dump");
goto fail_dropcount;
}
helper_argv = kmalloc_array(argc + 1, sizeof(*helper_argv),
GFP_KERNEL);
if (!helper_argv) {
- printk(KERN_WARNING "%s failed to allocate memory\n",
- __func__);
+ coredump_report_failure("%s failed to allocate memory", __func__);
goto fail_dropcount;
}
for (argi = 0; argi < argc; argi++)
@@ -644,8 +637,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
kfree(helper_argv);
if (retval) {
- printk(KERN_INFO "Core dump to |%s pipe failed\n",
- cn.corename);
+ coredump_report_failure("|%s pipe failed", cn.corename);
goto close_fail;
}
} else {
@@ -658,10 +650,8 @@ void do_coredump(const kernel_siginfo_t *siginfo)
goto fail_unlock;
if (need_suid_safe && cn.corename[0] != '/') {
- printk(KERN_WARNING "Pid %d(%s) can only dump core "\
- "to fully qualified path!\n",
- task_tgid_vnr(current), current->comm);
- printk(KERN_WARNING "Skipping core dump\n");
+ coredump_report_failure(
+ "this process can only dump core to a fully qualified path, skipping core dump");
goto fail_unlock;
}
@@ -730,13 +720,13 @@ void do_coredump(const kernel_siginfo_t *siginfo)
idmap = file_mnt_idmap(cprm.file);
if (!vfsuid_eq_kuid(i_uid_into_vfsuid(idmap, inode),
current_fsuid())) {
- pr_info_ratelimited("Core dump to %s aborted: cannot preserve file owner\n",
- cn.corename);
+ coredump_report_failure("Core dump to %s aborted: "
+ "cannot preserve file owner", cn.corename);
goto close_fail;
}
if ((inode->i_mode & 0677) != 0600) {
- pr_info_ratelimited("Core dump to %s aborted: cannot preserve file permissions\n",
- cn.corename);
+ coredump_report_failure("Core dump to %s aborted: "
+ "cannot preserve file permissions", cn.corename);
goto close_fail;
}
if (!(cprm.file->f_mode & FMODE_CAN_WRITE))
@@ -757,7 +747,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
* have this set to NULL.
*/
if (!cprm.file) {
- pr_info("Core dump to |%s disabled\n", cn.corename);
+ coredump_report_failure("Core dump to |%s disabled", cn.corename);
goto close_fail;
}
if (!dump_vma_snapshot(&cprm))
@@ -983,11 +973,10 @@ void validate_coredump_safety(void)
{
if (suid_dumpable == SUID_DUMP_ROOT &&
core_pattern[0] != '/' && core_pattern[0] != '|') {
- pr_warn(
-"Unsafe core_pattern used with fs.suid_dumpable=2.\n"
-"Pipe handler or fully qualified core dump path required.\n"
-"Set kernel.core_pattern before fs.suid_dumpable.\n"
- );
+
+ coredump_report_failure("Unsafe core_pattern used with fs.suid_dumpable=2: "
+ "pipe handler or fully qualified core dump path required. "
+ "Set kernel.core_pattern before fs.suid_dumpable.");
}
}
diff --git a/include/linux/coredump.h b/include/linux/coredump.h
index 0904ba010341a..45e598fe34766 100644
--- a/include/linux/coredump.h
+++ b/include/linux/coredump.h
@@ -43,8 +43,30 @@ extern int dump_align(struct coredump_params *cprm, int align);
int dump_user_range(struct coredump_params *cprm, unsigned long start,
unsigned long len);
extern void do_coredump(const kernel_siginfo_t *siginfo);
+
+/*
+ * Logging for the coredump code, ratelimited.
+ * The TGID and comm fields are added to the message.
+ */
+
+#define __COREDUMP_PRINTK(Level, Format, ...) \
+ do { \
+ char comm[TASK_COMM_LEN]; \
+ \
+ get_task_comm(comm, current); \
+ printk_ratelimited(Level "coredump: %d(%*pE): " Format "\n", \
+ task_tgid_vnr(current), (int)strlen(comm), comm, ##__VA_ARGS__); \
+ } while (0) \
+
+#define coredump_report(fmt, ...) __COREDUMP_PRINTK(KERN_INFO, fmt, ##__VA_ARGS__)
+#define coredump_report_failure(fmt, ...) __COREDUMP_PRINTK(KERN_WARNING, fmt, ##__VA_ARGS__)
+
#else
static inline void do_coredump(const kernel_siginfo_t *siginfo) {}
+
+#define coredump_report(...)
+#define coredump_report_failure(...)
+
#endif
#if defined(CONFIG_COREDUMP) && defined(CONFIG_SYSCTL)
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH AUTOSEL 6.11 237/244] iomap: fix iomap_dio_zero() for fs bs > system page size
[not found] <20240925113641.1297102-1-sashal@kernel.org>
` (5 preceding siblings ...)
2024-09-25 11:25 ` [PATCH AUTOSEL 6.11 118/244] coredump: Standartize and fix logging Sasha Levin
@ 2024-09-25 11:27 ` Sasha Levin
2024-09-25 13:01 ` Dave Chinner
2024-09-25 11:27 ` [PATCH AUTOSEL 6.11 242/244] iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release Sasha Levin
7 siblings, 1 reply; 11+ messages in thread
From: Sasha Levin @ 2024-09-25 11:27 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Pankaj Raghav, Hannes Reinecke, Darrick J . Wong, Dave Chinner,
Daniel Gomez, Christian Brauner, Sasha Levin, linux-xfs,
linux-fsdevel
From: Pankaj Raghav <p.raghav@samsung.com>
[ Upstream commit 10553a91652d995274da63fc317470f703765081 ]
iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
< fs block size. iomap_dio_zero() has an implicit assumption that fs block
size < page_size. This is true for most filesystems at the moment.
If the block size > page size, this will send the contents of the page
next to zero page(as len > PAGE_SIZE) to the underlying block device,
causing FS corruption.
iomap is a generic infrastructure and it should not make any assumptions
about the fs block size and the page size of the system.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20240822135018.1931258-7-kernel@pankajraghav.com
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
fs/iomap/buffered-io.c | 4 ++--
fs/iomap/direct-io.c | 45 ++++++++++++++++++++++++++++++++++++------
2 files changed, 41 insertions(+), 8 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f420c53d86acc..d745f718bcde8 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -2007,10 +2007,10 @@ iomap_writepages(struct address_space *mapping, struct writeback_control *wbc,
}
EXPORT_SYMBOL_GPL(iomap_writepages);
-static int __init iomap_init(void)
+static int __init iomap_buffered_init(void)
{
return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
offsetof(struct iomap_ioend, io_bio),
BIOSET_NEED_BVECS);
}
-fs_initcall(iomap_init);
+fs_initcall(iomap_buffered_init);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46e..c02b266bba525 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -11,6 +11,7 @@
#include <linux/iomap.h>
#include <linux/backing-dev.h>
#include <linux/uio.h>
+#include <linux/set_memory.h>
#include <linux/task_io_accounting_ops.h>
#include "trace.h"
@@ -27,6 +28,13 @@
#define IOMAP_DIO_WRITE (1U << 30)
#define IOMAP_DIO_DIRTY (1U << 31)
+/*
+ * Used for sub block zeroing in iomap_dio_zero()
+ */
+#define IOMAP_ZERO_PAGE_SIZE (SZ_64K)
+#define IOMAP_ZERO_PAGE_ORDER (get_order(IOMAP_ZERO_PAGE_SIZE))
+static struct page *zero_page;
+
struct iomap_dio {
struct kiocb *iocb;
const struct iomap_dio_ops *dops;
@@ -232,13 +240,20 @@ void iomap_dio_bio_end_io(struct bio *bio)
}
EXPORT_SYMBOL_GPL(iomap_dio_bio_end_io);
-static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
+static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
loff_t pos, unsigned len)
{
struct inode *inode = file_inode(dio->iocb->ki_filp);
- struct page *page = ZERO_PAGE(0);
struct bio *bio;
+ if (!len)
+ return 0;
+ /*
+ * Max block size supported is 64k
+ */
+ if (WARN_ON_ONCE(len > IOMAP_ZERO_PAGE_SIZE))
+ return -EINVAL;
+
bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
GFP_KERNEL);
@@ -246,8 +261,9 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;
- __bio_add_page(bio, page, len, 0);
+ __bio_add_page(bio, zero_page, len, 0);
iomap_dio_submit_bio(iter, dio, bio, pos);
+ return 0;
}
/*
@@ -356,8 +372,10 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
if (need_zeroout) {
/* zero out from the start of the block to the write offset */
pad = pos & (fs_block_size - 1);
- if (pad)
- iomap_dio_zero(iter, dio, pos - pad, pad);
+
+ ret = iomap_dio_zero(iter, dio, pos - pad, pad);
+ if (ret)
+ goto out;
}
/*
@@ -431,7 +449,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
/* zero out from the end of the write to the end of the block */
pad = pos & (fs_block_size - 1);
if (pad)
- iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
+ ret = iomap_dio_zero(iter, dio, pos,
+ fs_block_size - pad);
}
out:
/* Undo iter limitation to current extent */
@@ -753,3 +772,17 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
return iomap_dio_complete(dio);
}
EXPORT_SYMBOL_GPL(iomap_dio_rw);
+
+static int __init iomap_dio_init(void)
+{
+ zero_page = alloc_pages(GFP_KERNEL | __GFP_ZERO,
+ IOMAP_ZERO_PAGE_ORDER);
+
+ if (!zero_page)
+ return -ENOMEM;
+
+ set_memory_ro((unsigned long)page_address(zero_page),
+ 1U << IOMAP_ZERO_PAGE_ORDER);
+ return 0;
+}
+fs_initcall(iomap_dio_init);
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH AUTOSEL 6.11 237/244] iomap: fix iomap_dio_zero() for fs bs > system page size
2024-09-25 11:27 ` [PATCH AUTOSEL 6.11 237/244] iomap: fix iomap_dio_zero() for fs bs > system page size Sasha Levin
@ 2024-09-25 13:01 ` Dave Chinner
2024-09-26 7:58 ` Pankaj Raghav (Samsung)
2024-10-06 0:30 ` Sasha Levin
0 siblings, 2 replies; 11+ messages in thread
From: Dave Chinner @ 2024-09-25 13:01 UTC (permalink / raw)
To: Sasha Levin
Cc: linux-kernel, stable, Pankaj Raghav, Hannes Reinecke,
Darrick J . Wong, Dave Chinner, Daniel Gomez, Christian Brauner,
linux-xfs, linux-fsdevel
On Wed, Sep 25, 2024 at 07:27:38AM -0400, Sasha Levin wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> [ Upstream commit 10553a91652d995274da63fc317470f703765081 ]
>
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
>
> If the block size > page size, this will send the contents of the page
> next to zero page(as len > PAGE_SIZE) to the underlying block device,
> causing FS corruption.
>
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.
Please drop this. It is for support of new functionality that was
just merged and has no relevance to older kernels. It is not a bug
fix.
And ....
> +
> + set_memory_ro((unsigned long)page_address(zero_page),
> + 1U << IOMAP_ZERO_PAGE_ORDER);
.... this will cause stable kernel regressions.
It was removed later in the merge because it is unnecessary and
causes boot failures on (at least) some Power architectures.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Re: [PATCH AUTOSEL 6.11 237/244] iomap: fix iomap_dio_zero() for fs bs > system page size
2024-09-25 13:01 ` Dave Chinner
@ 2024-09-26 7:58 ` Pankaj Raghav (Samsung)
2024-10-06 0:30 ` Sasha Levin
1 sibling, 0 replies; 11+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-09-26 7:58 UTC (permalink / raw)
To: Dave Chinner
Cc: Sasha Levin, linux-kernel, stable, Pankaj Raghav, Hannes Reinecke,
Darrick J . Wong, Dave Chinner, Daniel Gomez, Christian Brauner,
linux-xfs, linux-fsdevel
On Wed, Sep 25, 2024 at 11:01:46PM +1000, Dave Chinner wrote:
> On Wed, Sep 25, 2024 at 07:27:38AM -0400, Sasha Levin wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> >
> > [ Upstream commit 10553a91652d995274da63fc317470f703765081 ]
> >
> > iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> > < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> > size < page_size. This is true for most filesystems at the moment.
> >
> > If the block size > page size, this will send the contents of the page
> > next to zero page(as len > PAGE_SIZE) to the underlying block device,
> > causing FS corruption.
> >
> > iomap is a generic infrastructure and it should not make any assumptions
> > about the fs block size and the page size of the system.
>
> Please drop this. It is for support of new functionality that was
> just merged and has no relevance to older kernels. It is not a bug
> fix.
>
I did not have any fixes by tag for this reason. So please drop this commit
from the queue.
> And ....
>
> > +
> > + set_memory_ro((unsigned long)page_address(zero_page),
> > + 1U << IOMAP_ZERO_PAGE_ORDER);
>
> .... this will cause stable kernel regressions.
>
> It was removed later in the merge because it is unnecessary and
> causes boot failures on (at least) some Power architectures.
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
--
Pankaj Raghav
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH AUTOSEL 6.11 237/244] iomap: fix iomap_dio_zero() for fs bs > system page size
2024-09-25 13:01 ` Dave Chinner
2024-09-26 7:58 ` Pankaj Raghav (Samsung)
@ 2024-10-06 0:30 ` Sasha Levin
1 sibling, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2024-10-06 0:30 UTC (permalink / raw)
To: Dave Chinner
Cc: linux-kernel, stable, Pankaj Raghav, Hannes Reinecke,
Darrick J . Wong, Dave Chinner, Daniel Gomez, Christian Brauner,
linux-xfs, linux-fsdevel
On Wed, Sep 25, 2024 at 11:01:46PM +1000, Dave Chinner wrote:
>On Wed, Sep 25, 2024 at 07:27:38AM -0400, Sasha Levin wrote:
>> From: Pankaj Raghav <p.raghav@samsung.com>
>>
>> [ Upstream commit 10553a91652d995274da63fc317470f703765081 ]
>>
>> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
>> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
>> size < page_size. This is true for most filesystems at the moment.
>>
>> If the block size > page size, this will send the contents of the page
>> next to zero page(as len > PAGE_SIZE) to the underlying block device,
>> causing FS corruption.
>>
>> iomap is a generic infrastructure and it should not make any assumptions
>> about the fs block size and the page size of the system.
>
>Please drop this. It is for support of new functionality that was
>just merged and has no relevance to older kernels. It is not a bug
>fix.
>
>And ....
>
>> +
>> + set_memory_ro((unsigned long)page_address(zero_page),
>> + 1U << IOMAP_ZERO_PAGE_ORDER);
>
>.... this will cause stable kernel regressions.
>
>It was removed later in the merge because it is unnecessary and
>causes boot failures on (at least) some Power architectures.
Dropped, and sorry about sending this one out twice!
--
Thanks,
Sasha
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH AUTOSEL 6.11 242/244] iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release
[not found] <20240925113641.1297102-1-sashal@kernel.org>
` (6 preceding siblings ...)
2024-09-25 11:27 ` [PATCH AUTOSEL 6.11 237/244] iomap: fix iomap_dio_zero() for fs bs > system page size Sasha Levin
@ 2024-09-25 11:27 ` Sasha Levin
7 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2024-09-25 11:27 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Christoph Hellwig, Darrick J . Wong, Christian Brauner,
Sasha Levin, linux-xfs, linux-fsdevel
From: Christoph Hellwig <hch@lst.de>
[ Upstream commit 7a9d43eace888a0ee6095035997bb138425844d3 ]
When direct I/O completions invalidates the page cache it holds neither the
i_rwsem nor the invalidate_lock so it can be racing with
iomap_write_delalloc_release. If the search for the end of the region that
contains data returns the start offset we hit such a race and just need to
look for the end of the newly created hole instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240910043949.3481298-2-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
fs/iomap/buffered-io.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d745f718bcde8..ca928cc9be49a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1241,7 +1241,15 @@ static int iomap_write_delalloc_release(struct inode *inode,
error = data_end;
goto out_unlock;
}
- WARN_ON_ONCE(data_end <= start_byte);
+
+ /*
+ * If we race with post-direct I/O invalidation of the page cache,
+ * there might be no data left at start_byte.
+ */
+ if (data_end == start_byte)
+ continue;
+
+ WARN_ON_ONCE(data_end < start_byte);
WARN_ON_ONCE(data_end > scan_end_byte);
error = iomap_write_delalloc_scan(inode, &punch_start_byte,
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread