* [PATCH] [v2] filemap: Move prefaulting out of hot write path
@ 2025-02-28 20:37 Dave Hansen
2025-03-04 2:37 ` Andrew Morton
2025-03-10 8:45 ` kernel test robot
0 siblings, 2 replies; 3+ messages in thread
From: Dave Hansen @ 2025-02-28 20:37 UTC (permalink / raw)
To: linux-kernel
Cc: linux-fsdevel, Dave Hansen, tytso, willy, akpm, mjguzik, david
There is a generic anti-pattern that shows up in the VFS and several
filesystems where the hot write paths touch userspace twice when they
could get away with doing it once.
Dave Chinner suggested that they should all be fixed up[1]. I agree[2].
But, the series to do that fixup spans a bunch of filesystems and a lot
of people. This patch fixes common code that absolutely everyone uses.
It has measurable performance benefits[3].
I think this patch can go in and not be held up by the others.
I will post them separately to their separate maintainers for
consideration. But, honestly, I'm not going to lose any sleep if
the maintainers don't pick those up.
1. https://lore.kernel.org/all/Z5f-x278Z3wTIugL@dread.disaster.area/
2. https://lore.kernel.org/all/20250129181749.C229F6F3@davehans-spike.ostc.intel.com/
3. https://lore.kernel.org/all/202502121529.d62a409e-lkp@intel.com/
--
Changesfrom v1:
* Update comment to talk more about how filesystem locking and
atomic usercopies avoid deadlocks.
--
From: Dave Hansen <dave.hansen@linux.intel.com>
There is a bit of a sordid history here. I originally wrote
998ef75ddb57 ("fs: do not prefault sys_write() user buffer pages")
to fix a performance issue that showed up on early SMAP hardware.
But that was reverted with 00a3d660cbac because it exposed an
underlying filesystem bug.
This is a reimplementation of the original commit along with some
simplification and comment improvements.
The basic problem is that the generic write path has two userspace
accesses: one to prefault the write source buffer and then another to
perform the actual write. On x86, this means an extra STAC/CLAC pair.
These are relatively expensive instructions because they function as
barriers.
Keep the prefaulting behavior but move it into the slow path that gets
run when the write did not make any progress. This avoids livelocks
that can happen when the write's source and destination target the
same folio. Contrary to the existing comments, the fault-in does not
prevent deadlocks. That's accomplished by using an "atomic" usercopy
that disables page faults.
The end result is that the generic write fast path now touches
userspace once instead of twice.
0day has shown some improvements on a couple of microbenchmarks:
https://lore.kernel.org/all/202502121529.d62a409e-lkp@intel.com/
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/yxyuijjfd6yknryji2q64j3keq2ygw6ca6fs5jwyolklzvo45s@4u63qqqyosy2/
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
---
b/mm/filemap.c | 27 ++++++++++++++++-----------
1 file changed, 16 insertions(+), 11 deletions(-)
diff -puN mm/filemap.c~generic_perform_write-1 mm/filemap.c
--- a/mm/filemap.c~generic_perform_write-1 2025-02-28 11:58:36.499615962 -0800
+++ b/mm/filemap.c 2025-02-28 12:13:27.845549365 -0800
@@ -4170,17 +4170,6 @@ retry:
bytes = min(chunk - offset, bytes);
balance_dirty_pages_ratelimited(mapping);
- /*
- * Bring in the user page that we will copy from _first_.
- * Otherwise there's a nasty deadlock on copying from the
- * same page as we're writing to, without it being marked
- * up-to-date.
- */
- if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
- status = -EFAULT;
- break;
- }
-
if (fatal_signal_pending(current)) {
status = -EINTR;
break;
@@ -4198,6 +4187,12 @@ retry:
if (mapping_writably_mapped(mapping))
flush_dcache_folio(folio);
+ /*
+ * Faults here on mmap()s can recurse into arbitrary
+ * filesystem code. Lots of locks are held that can
+ * deadlock. Use an atomic copy to avoid deadlocking
+ * in page fault handling.
+ */
copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
flush_dcache_folio(folio);
@@ -4223,6 +4218,16 @@ retry:
bytes = copied;
goto retry;
}
+
+ /*
+ * 'folio' is now unlocked and faults on it can be
+ * handled. Ensure forward progress by trying to
+ * fault it in now.
+ */
+ if (fault_in_iov_iter_readable(i, bytes) == bytes) {
+ status = -EFAULT;
+ break;
+ }
} else {
pos += status;
written += status;
_
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] [v2] filemap: Move prefaulting out of hot write path
2025-02-28 20:37 [PATCH] [v2] filemap: Move prefaulting out of hot write path Dave Hansen
@ 2025-03-04 2:37 ` Andrew Morton
2025-03-10 8:45 ` kernel test robot
1 sibling, 0 replies; 3+ messages in thread
From: Andrew Morton @ 2025-03-04 2:37 UTC (permalink / raw)
To: Dave Hansen; +Cc: linux-kernel, linux-fsdevel, tytso, willy, mjguzik, david
On Fri, 28 Feb 2025 12:37:22 -0800 Dave Hansen <dave.hansen@linux.intel.com> wrote:
> There is a generic anti-pattern that shows up in the VFS and several
> filesystems where the hot write paths touch userspace twice when they
> could get away with doing it once.
>
> Dave Chinner suggested that they should all be fixed up[1]. I agree[2].
> But, the series to do that fixup spans a bunch of filesystems and a lot
> of people. This patch fixes common code that absolutely everyone uses.
> It has measurable performance benefits[3].
>
> I think this patch can go in and not be held up by the others.
I'll queue this in mm-hotfixes with a view to sneaking it into 6.14-rcX
as a basis for others to work against.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] [v2] filemap: Move prefaulting out of hot write path
2025-02-28 20:37 [PATCH] [v2] filemap: Move prefaulting out of hot write path Dave Hansen
2025-03-04 2:37 ` Andrew Morton
@ 2025-03-10 8:45 ` kernel test robot
1 sibling, 0 replies; 3+ messages in thread
From: kernel test robot @ 2025-03-10 8:45 UTC (permalink / raw)
To: Dave Hansen
Cc: oe-lkp, lkp, Ted Ts'o, Matthew Wilcox, Andrew Morton,
Mateusz Guzik, Dave Chinner, linux-fsdevel, linux-kernel,
Dave Hansen, oliver.sang
Hello,
kernel test robot noticed a 3.6% improvement of will-it-scale.per_thread_ops on:
commit: 391ab5826c820c58d180534a7a727ff5668d4d61 ("[PATCH] [v2] filemap: Move prefaulting out of hot write path")
url: https://github.com/intel-lab-lkp/linux/commits/Dave-Hansen/filemap-Move-prefaulting-out-of-hot-write-path/20250301-043921
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/20250228203722.CAEB63AC@davehans-spike.ostc.intel.com/
patch subject: [PATCH] [v2] filemap: Move prefaulting out of hot write path
testcase: will-it-scale
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 104 threads 2 sockets (Skylake) with 192G memory
parameters:
nr_task: 100%
mode: thread
test: pwrite1
cpufreq_governor: performance
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250310/202503101621.e0858506-lkp@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-9.4/thread/100%/debian-12-x86_64-20240206.cgz/lkp-skl-fpga01/pwrite1/will-it-scale
commit:
3dec9c0e67 ("foo")
391ab5826c ("filemap: Move prefaulting out of hot write path")
3dec9c0e67aaf496 391ab5826c820c58d180534a7a7
---------------- ---------------------------
%stddev %change %stddev
\ | \
182266 ± 3% +9.4% 199333 ± 4% meminfo.DirectMap4k
765.67 ± 8% -22.1% 596.83 ± 9% perf-sched.wait_and_delay.count.__cond_resched.shmem_get_folio_gfp.shmem_write_begin.generic_perform_write.shmem_file_write_iter
17510 ± 6% +19.3% 20889 ± 8% sched_debug.cpu.nr_switches.max
3219 ± 5% +11.2% 3578 ± 2% sched_debug.cpu.nr_switches.stddev
54561715 +3.6% 56543708 will-it-scale.104.threads
524631 +3.6% 543689 will-it-scale.per_thread_ops
54561715 +3.6% 56543708 will-it-scale.workload
1.752e+10 -1.2% 1.731e+10 perf-stat.i.branch-instructions
1.59 +0.0 1.63 perf-stat.i.branch-miss-rate%
3.25 +1.8% 3.31 perf-stat.i.cpi
8.828e+10 -1.5% 8.699e+10 perf-stat.i.instructions
0.31 -1.8% 0.30 perf-stat.i.ipc
1.58 +0.0 1.62 perf-stat.overall.branch-miss-rate%
3.25 +1.8% 3.31 perf-stat.overall.cpi
0.31 -1.7% 0.30 perf-stat.overall.ipc
487316 -4.8% 464012 perf-stat.overall.path-length
1.746e+10 -1.2% 1.725e+10 perf-stat.ps.branch-instructions
8.798e+10 -1.5% 8.67e+10 perf-stat.ps.instructions
2.659e+13 -1.3% 2.624e+13 perf-stat.total.instructions
34.45 -5.9 28.57 perf-profile.calltrace.cycles-pp.generic_perform_write.shmem_file_write_iter.vfs_write.__x64_sys_pwrite64.do_syscall_64
48.17 -5.3 42.87 ± 2% perf-profile.calltrace.cycles-pp.vfs_write.__x64_sys_pwrite64.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_pwrite
42.56 -5.3 37.29 ± 2% perf-profile.calltrace.cycles-pp.shmem_file_write_iter.vfs_write.__x64_sys_pwrite64.do_syscall_64.entry_SYSCALL_64_after_hwframe
51.38 -4.9 46.46 perf-profile.calltrace.cycles-pp.__x64_sys_pwrite64.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_pwrite
54.26 -4.5 49.76 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_pwrite
62.52 -3.9 58.65 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__libc_pwrite
13.30 ± 2% -2.3 10.96 perf-profile.calltrace.cycles-pp.copy_page_from_iter_atomic.generic_perform_write.shmem_file_write_iter.vfs_write.__x64_sys_pwrite64
10.17 ± 2% -2.0 8.18 perf-profile.calltrace.cycles-pp.rep_movs_alternative.copy_page_from_iter_atomic.generic_perform_write.shmem_file_write_iter.vfs_write
0.59 ± 4% -0.3 0.26 ±100% perf-profile.calltrace.cycles-pp.__cond_resched.shmem_get_folio_gfp.shmem_write_begin.generic_perform_write.shmem_file_write_iter
0.88 ± 3% -0.1 0.74 ± 4% perf-profile.calltrace.cycles-pp.folio_mark_accessed.shmem_get_folio_gfp.shmem_write_begin.generic_perform_write.shmem_file_write_iter
0.67 ± 2% -0.1 0.56 perf-profile.calltrace.cycles-pp.balance_dirty_pages_ratelimited_flags.generic_perform_write.shmem_file_write_iter.vfs_write.__x64_sys_pwrite64
99.51 +0.1 99.56 perf-profile.calltrace.cycles-pp.__libc_pwrite
0.64 ± 5% +0.1 0.71 ± 2% perf-profile.calltrace.cycles-pp.fput.__x64_sys_pwrite64.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_pwrite
0.93 ± 2% +0.1 1.02 perf-profile.calltrace.cycles-pp.folio_unlock.shmem_write_end.generic_perform_write.shmem_file_write_iter.vfs_write
0.72 +0.1 0.82 ± 2% perf-profile.calltrace.cycles-pp.ktime_get_coarse_real_ts64_mg.current_time.inode_needs_update_time.file_update_time.shmem_file_write_iter
1.20 +0.2 1.38 ± 2% perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_pwrite
0.84 ± 2% +0.2 1.04 ± 2% perf-profile.calltrace.cycles-pp.noop_dirty_folio.shmem_write_end.generic_perform_write.shmem_file_write_iter.vfs_write
1.48 +0.2 1.72 ± 2% perf-profile.calltrace.cycles-pp.current_time.inode_needs_update_time.file_update_time.shmem_file_write_iter.vfs_write
1.00 ± 2% +0.3 1.26 ± 3% perf-profile.calltrace.cycles-pp.folio_mark_dirty.shmem_write_end.generic_perform_write.shmem_file_write_iter.vfs_write
1.99 ± 3% +0.3 2.27 ± 4% perf-profile.calltrace.cycles-pp.fdget.__x64_sys_pwrite64.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_pwrite
6.69 +0.3 7.02 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.__libc_pwrite
2.24 +0.4 2.60 ± 3% perf-profile.calltrace.cycles-pp.inode_needs_update_time.file_update_time.shmem_file_write_iter.vfs_write.__x64_sys_pwrite64
2.78 ± 2% +0.4 3.21 ± 2% perf-profile.calltrace.cycles-pp.file_update_time.shmem_file_write_iter.vfs_write.__x64_sys_pwrite64.do_syscall_64
4.34 +0.6 4.92 perf-profile.calltrace.cycles-pp.shmem_write_end.generic_perform_write.shmem_file_write_iter.vfs_write.__x64_sys_pwrite64
12.01 +0.9 12.92 perf-profile.calltrace.cycles-pp.entry_SYSRETQ_unsafe_stack.__libc_pwrite
2.89 ± 6% +1.4 4.26 ± 6% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_safe_stack.__libc_pwrite
15.09 +1.5 16.61 perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.__libc_pwrite
34.58 -5.9 28.70 perf-profile.children.cycles-pp.generic_perform_write
48.24 -5.3 42.94 ± 2% perf-profile.children.cycles-pp.vfs_write
42.95 -5.3 37.66 ± 2% perf-profile.children.cycles-pp.shmem_file_write_iter
51.54 -4.9 46.63 perf-profile.children.cycles-pp.__x64_sys_pwrite64
54.37 -4.5 49.86 perf-profile.children.cycles-pp.do_syscall_64
62.76 -3.9 58.90 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
10.62 ± 2% -2.3 8.30 perf-profile.children.cycles-pp.rep_movs_alternative
13.47 ± 2% -2.3 11.16 perf-profile.children.cycles-pp.copy_page_from_iter_atomic
0.90 ± 3% -0.1 0.76 ± 4% perf-profile.children.cycles-pp.folio_mark_accessed
0.69 ± 2% -0.1 0.60 perf-profile.children.cycles-pp.balance_dirty_pages_ratelimited_flags
0.29 -0.0 0.26 ± 2% perf-profile.children.cycles-pp.testcase
0.31 ± 3% -0.0 0.29 ± 2% perf-profile.children.cycles-pp.update_process_times
0.50 -0.0 0.48 perf-profile.children.cycles-pp.rcu_all_qs
99.67 +0.0 99.70 perf-profile.children.cycles-pp.__libc_pwrite
0.64 ± 5% +0.1 0.71 ± 2% perf-profile.children.cycles-pp.fput
0.43 ± 3% +0.1 0.50 ± 2% perf-profile.children.cycles-pp.folio_mapping
0.94 ± 2% +0.1 1.02 perf-profile.children.cycles-pp.folio_unlock
0.74 ± 2% +0.1 0.85 ± 2% perf-profile.children.cycles-pp.ktime_get_coarse_real_ts64_mg
1.23 +0.2 1.41 ± 2% perf-profile.children.cycles-pp.syscall_exit_to_user_mode
0.89 +0.2 1.10 ± 2% perf-profile.children.cycles-pp.noop_dirty_folio
1.54 +0.2 1.79 ± 2% perf-profile.children.cycles-pp.current_time
1.99 ± 3% +0.3 2.27 ± 4% perf-profile.children.cycles-pp.fdget
1.08 ± 3% +0.3 1.36 ± 3% perf-profile.children.cycles-pp.folio_mark_dirty
2.30 +0.4 2.66 ± 3% perf-profile.children.cycles-pp.inode_needs_update_time
2.86 ± 2% +0.4 3.30 ± 2% perf-profile.children.cycles-pp.file_update_time
4.58 +0.6 5.17 perf-profile.children.cycles-pp.shmem_write_end
1.73 ± 5% +0.7 2.43 ± 5% perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
12.88 +1.0 13.84 perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
6.99 +1.0 7.95 perf-profile.children.cycles-pp.entry_SYSCALL_64
15.22 +1.5 16.75 perf-profile.children.cycles-pp.syscall_return_via_sysret
10.43 ± 2% -2.3 8.08 perf-profile.self.cycles-pp.rep_movs_alternative
3.26 -0.4 2.86 perf-profile.self.cycles-pp.generic_perform_write
2.02 -0.2 1.78 ± 2% perf-profile.self.cycles-pp.shmem_get_folio_gfp
0.87 ± 3% -0.1 0.74 ± 4% perf-profile.self.cycles-pp.folio_mark_accessed
0.53 ± 2% -0.1 0.43 ± 2% perf-profile.self.cycles-pp.balance_dirty_pages_ratelimited_flags
0.25 ± 2% -0.0 0.22 ± 2% perf-profile.self.cycles-pp.testcase
0.54 +0.0 0.59 perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack
0.79 ± 4% +0.0 0.84 ± 2% perf-profile.self.cycles-pp.__x64_sys_pwrite64
0.51 ± 2% +0.1 0.58 ± 2% perf-profile.self.cycles-pp.fput
0.38 ± 3% +0.1 0.45 ± 2% perf-profile.self.cycles-pp.folio_mapping
0.74 ± 2% +0.1 0.82 perf-profile.self.cycles-pp.folio_unlock
0.73 ± 4% +0.1 0.82 ± 4% perf-profile.self.cycles-pp.inode_needs_update_time
0.54 ± 6% +0.1 0.64 ± 3% perf-profile.self.cycles-pp.file_update_time
0.72 ± 2% +0.1 0.82 ± 2% perf-profile.self.cycles-pp.ktime_get_coarse_real_ts64_mg
0.79 ± 2% +0.1 0.94 ± 3% perf-profile.self.cycles-pp.current_time
0.98 ± 2% +0.2 1.14 ± 2% perf-profile.self.cycles-pp.syscall_exit_to_user_mode
0.85 +0.2 1.04 ± 2% perf-profile.self.cycles-pp.noop_dirty_folio
0.65 ± 3% +0.2 0.85 ± 4% perf-profile.self.cycles-pp.folio_mark_dirty
1.11 ± 6% +0.2 1.32 ± 2% perf-profile.self.cycles-pp.do_syscall_64
1.98 ± 3% +0.3 2.26 ± 4% perf-profile.self.cycles-pp.fdget
2.14 ± 2% +0.4 2.55 ± 3% perf-profile.self.cycles-pp.__libc_pwrite
1.63 +0.6 2.23 ± 3% perf-profile.self.cycles-pp.shmem_write_begin
8.54 +0.7 9.20 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
6.09 +0.9 7.03 ± 2% perf-profile.self.cycles-pp.entry_SYSCALL_64
12.75 +1.0 13.70 perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
15.20 +1.5 16.72 perf-profile.self.cycles-pp.syscall_return_via_sysret
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-03-10 8:45 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-28 20:37 [PATCH] [v2] filemap: Move prefaulting out of hot write path Dave Hansen
2025-03-04 2:37 ` Andrew Morton
2025-03-10 8:45 ` kernel test robot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).