* Hard and soft lockups with FIO and LTP runs on a large system
@ 2024-07-03 15:11 Bharata B Rao
2024-07-06 22:42 ` Yu Zhao
2024-07-17 9:42 ` Vlastimil Babka
0 siblings, 2 replies; 37+ messages in thread
From: Bharata B Rao @ 2024-07-03 15:11 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
David Hildenbrand, willy, vbabka, yuzhao, kinseyho, Mel Gorman
Many soft and hard lockups are seen with upstream kernel when running a
bunch of tests that include FIO and LTP filesystem test on 10 NVME
disks. The lockups can appear anywhere between 2 to 48 hours. Originally
this was reported on a large customer VM instance with passthrough NVME
disks on older kernels(v5.4 based). However, similar problems were
reproduced when running the tests on bare metal with latest upstream
kernel (v6.10-rc3). Other lockups with different signatures are seen but
in this report, only those related to MM area are being discussed.
Also note that the subsequent description is related to the lockups in
bare metal upstream (and not VM).
The general observation is that the problem usually surfaces when the
system free memory goes very low and page cache/buffer consumption hits
the ceiling. Most of the times the two contended locks are lruvec and
inode->i_lock spinlocks.
- Could this be a scalability issue in LRU list handling and/or page
cache invalidation typical to a large system configuration?
- Are there any MM/FS tunables that could help here?
Hardware configuration
======================
Dual socket AMD EPYC 128 Core processor (256 cores, 512 threads)
Memory: 1.5 TB
10 NVME - 3.5TB each
available: 2 nodes (0-1)
node 0 cpus: 0-127,256-383
node 0 size: 773727 MB
node 1 cpus: 128-255,384-511
node 1 size: 773966 MB
Workload details
================
Workload includes concurrent runs of FIO and a few FS tests from LTP.
FIO is run with a size of 1TB on each NVME partition with different
combinations of ioengine/blocksize/mode parameters and buffered-IO.
Selected FS tests from LTP are run on 256GB partitions of all NVME
disks. This is the typical NVME partition layout.
nvme2n1 259:4 0 3.5T 0 disk
├─nvme2n1p1 259:6 0 256G 0 part /data_nvme2n1p1
└─nvme2n1p2 259:7 0 3.2T 0 part
Though many different runs exist in the workload, the combination that
results in the problem is buffered-IO run with sync engine.
fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \
-rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=sync -bs=4k \
-numjobs=400 -runtime=25000 --time_based -group_reporting -name=mytest
Watchdog threshold was reduced to 5s to reproduce the problem early and
all CPU backtrace enabled.
Problem details and analysis
============================
One of the hard lockups which was observed and analyzed in detail is this:
kernel: watchdog: Watchdog detected hard LOCKUP on cpu 284
kernel: CPU: 284 PID: 924096 Comm: cat Not tainted 6.10.0-rc3-lruvec #9
kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: Call Trace:
kernel: <NMI>
kernel: ? show_regs+0x69/0x80
kernel: ? watchdog_hardlockup_check+0x19e/0x360
<SNIP>
kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: </NMI>
kernel: <TASK>
kernel: ? __pfx_lru_add_fn+0x10/0x10
kernel: _raw_spin_lock_irqsave+0x42/0x50
kernel: folio_lruvec_lock_irqsave+0x62/0xb0
kernel: folio_batch_move_lru+0x79/0x2a0
kernel: folio_add_lru+0x6d/0xf0
kernel: filemap_add_folio+0xba/0xe0
kernel: __filemap_get_folio+0x137/0x2e0
kernel: ext4_da_write_begin+0x12c/0x270
kernel: generic_perform_write+0xbf/0x200
kernel: ext4_buffered_write_iter+0x67/0xf0
kernel: ext4_file_write_iter+0x70/0x780
kernel: vfs_write+0x301/0x420
kernel: ksys_write+0x67/0xf0
kernel: __x64_sys_write+0x19/0x20
kernel: x64_sys_call+0x1689/0x20d0
kernel: do_syscall_64+0x6b/0x110
kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e kernel: RIP:
0033:0x7fe21c314887
With all CPU backtraces enabled, many CPUs are waiting for lruvec_lock
acquisition. We measured the lruvec spinlock start, end and hold
time(htime) using sched_clock(), along with a BUG() if the hold time was
more than 10s. The below case shows that lruvec spin lock was held for ~25s.
kernel: vmscan: unlock_page_lruvec_irq: stime 27963327514341, etime
27963324369895, htime 25889317166
kernel: ------------[ cut here ]------------
kernel: kernel BUG at include/linux/memcontrol.h:1677!
kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
kernel: CPU: 21 PID: 3211 Comm: kswapd0 Tainted: G W
6.10.0-rc3-qspindbg #10
kernel: RIP: 0010:shrink_active_list+0x40a/0x520
And the corresponding trace point for the above:
kswapd0-3211 [021] dN.1. 27963.324332: mm_vmscan_lru_isolate:
classzone=0 order=0 nr_requested=1 nr_scanned=156946361
nr_skipped=156946360 nr_taken=1 lru=active_file
This shows that isolate_lru_folios() is scanning through a huge number
(~150million) of folios (order=0) with lruvec spinlock held. This is
happening because a large number of folios are being skipped to isolate
a few ZONE_DMA folios. Though the number of folios to be scanned is
bounded (32), there exists a genuine case where this can become
unbounded, i.e. in case where folios are skipped.
Meminfo output shows that the free memory is around ~2% and page/buffer
cache grows very high when the lockup happens.
MemTotal: 1584835956 kB
MemFree: 27805664 kB
MemAvailable: 1568099004 kB
Buffers: 1386120792 kB
Cached: 151894528 kB
SwapCached: 30620 kB
Active: 1043678892 kB
Inactive: 494456452 kB
Often times, the perf output at the time of the problem shows heavy
contention on lruvec spin lock. Similar contention is also observed with
inode i_lock (in clear_shadow_entry path)
98.98% fio [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
|
--98.96%--native_queued_spin_lock_slowpath
|
--98.96%--_raw_spin_lock_irqsave
folio_lruvec_lock_irqsave
|
--98.78%--folio_batch_move_lru
|
--98.63%--deactivate_file_folio
mapping_try_invalidate
invalidate_mapping_pages
invalidate_bdev
blkdev_common_ioctl
blkdev_ioctl
__x64_sys_ioctl
x64_sys_call
do_syscall_64
entry_SYSCALL_64_after_hwframe
Some experiments tried
======================
1) When MGLRU was enabled many soft lockups were observed, no hard
lockups were seen for 48 hours run. Below is once such soft lockup.
kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649]
kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G L
6.10.0-rc3-mglru-irqstrc #24
kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: Call Trace:
kernel: <IRQ>
kernel: ? show_regs+0x69/0x80
kernel: ? watchdog_timer_fn+0x223/0x2b0
kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
<SNIP>
kernel: </IRQ>
kernel: <TASK>
kernel: ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: _raw_spin_lock+0x38/0x50
kernel: clear_shadow_entry+0x3d/0x100
kernel: ? __pfx_workingset_update_node+0x10/0x10
kernel: mapping_try_invalidate+0x117/0x1d0
kernel: invalidate_mapping_pages+0x10/0x20
kernel: invalidate_bdev+0x3c/0x50
kernel: blkdev_common_ioctl+0x5f7/0xa90
kernel: blkdev_ioctl+0x109/0x270
kernel: x64_sys_call+0x1215/0x20d0
kernel: do_syscall_64+0x7e/0x130
This happens to be contending on inode i_lock spinlock.
Below preemptirqsoff trace points to preemption being disabled for more
than 10s and the lock in picture is lruvec spinlock.
# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc
# --------------------------------------------------------------------
# latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0
HP:0 #P:512)
# -----------------
# | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0)
# -----------------
# => started at: deactivate_file_folio
# => ended at: deactivate_file_folio
#
#
# _------=> CPU#
# / _-----=> irqs-off/BH-disabled
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# |||| / _-=> migrate-disable
# ||||| / delay
# cmd pid |||||| time | caller
# \ / |||||| \ | /
fio-2701523 128...1. 0us$: deactivate_file_folio
<-deactivate_file_folio
fio-2701523 128.N.1. 10382681us : deactivate_file_folio
<-deactivate_file_folio
fio-2701523 128.N.1. 10382683us : tracer_preempt_on
<-deactivate_file_folio
fio-2701523 128.N.1. 10382691us : <stack trace>
=> deactivate_file_folio
=> mapping_try_invalidate
=> invalidate_mapping_pages
=> invalidate_bdev
=> blkdev_common_ioctl
=> blkdev_ioctl
=> __x64_sys_ioctl
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
2) Increased low_watermark_threshold to 10% to prevent system from
entering into extremely low memory situation. Although hard lockups
weren't seen, but soft lockups (clear_shadow_entry()) were still seen.
3) AMD has a BIOS setting called NPS (Nodes per socket), using which a
socket can be further partitioned into smaller NUMA nodes. With NPS=4,
there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in
the system. This was done to check if having more number of kswapd
threads working on lesser number of folios per node would make a
difference. However here too, multiple soft lockups were seen (in
clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed.
Any insights/suggestion into these lockups and suggestions are welcome!
Regards,
Bharata.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-03 15:11 Hard and soft lockups with FIO and LTP runs on a large system Bharata B Rao
@ 2024-07-06 22:42 ` Yu Zhao
2024-07-08 14:34 ` Bharata B Rao
2024-07-10 12:03 ` Bharata B Rao
2024-07-17 9:42 ` Vlastimil Babka
1 sibling, 2 replies; 37+ messages in thread
From: Yu Zhao @ 2024-07-06 22:42 UTC (permalink / raw)
To: Bharata B Rao
Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman
[-- Attachment #1: Type: text/plain, Size: 10946 bytes --]
Hi Bharata,
On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
>
> Many soft and hard lockups are seen with upstream kernel when running a
> bunch of tests that include FIO and LTP filesystem test on 10 NVME
> disks. The lockups can appear anywhere between 2 to 48 hours. Originally
> this was reported on a large customer VM instance with passthrough NVME
> disks on older kernels(v5.4 based). However, similar problems were
> reproduced when running the tests on bare metal with latest upstream
> kernel (v6.10-rc3). Other lockups with different signatures are seen but
> in this report, only those related to MM area are being discussed.
> Also note that the subsequent description is related to the lockups in
> bare metal upstream (and not VM).
>
> The general observation is that the problem usually surfaces when the
> system free memory goes very low and page cache/buffer consumption hits
> the ceiling. Most of the times the two contended locks are lruvec and
> inode->i_lock spinlocks.
>
> - Could this be a scalability issue in LRU list handling and/or page
> cache invalidation typical to a large system configuration?
> - Are there any MM/FS tunables that could help here?
>
> Hardware configuration
> ======================
> Dual socket AMD EPYC 128 Core processor (256 cores, 512 threads)
> Memory: 1.5 TB
> 10 NVME - 3.5TB each
> available: 2 nodes (0-1)
> node 0 cpus: 0-127,256-383
> node 0 size: 773727 MB
> node 1 cpus: 128-255,384-511
> node 1 size: 773966 MB
>
> Workload details
> ================
> Workload includes concurrent runs of FIO and a few FS tests from LTP.
>
> FIO is run with a size of 1TB on each NVME partition with different
> combinations of ioengine/blocksize/mode parameters and buffered-IO.
> Selected FS tests from LTP are run on 256GB partitions of all NVME
> disks. This is the typical NVME partition layout.
>
> nvme2n1 259:4 0 3.5T 0 disk
> ├─nvme2n1p1 259:6 0 256G 0 part /data_nvme2n1p1
> └─nvme2n1p2 259:7 0 3.2T 0 part
>
> Though many different runs exist in the workload, the combination that
> results in the problem is buffered-IO run with sync engine.
>
> fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \
> -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=sync -bs=4k \
> -numjobs=400 -runtime=25000 --time_based -group_reporting -name=mytest
>
> Watchdog threshold was reduced to 5s to reproduce the problem early and
> all CPU backtrace enabled.
>
> Problem details and analysis
> ============================
> One of the hard lockups which was observed and analyzed in detail is this:
>
> kernel: watchdog: Watchdog detected hard LOCKUP on cpu 284
> kernel: CPU: 284 PID: 924096 Comm: cat Not tainted 6.10.0-rc3-lruvec #9
> kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
> kernel: Call Trace:
> kernel: <NMI>
> kernel: ? show_regs+0x69/0x80
> kernel: ? watchdog_hardlockup_check+0x19e/0x360
> <SNIP>
> kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300
> kernel: </NMI>
> kernel: <TASK>
> kernel: ? __pfx_lru_add_fn+0x10/0x10
> kernel: _raw_spin_lock_irqsave+0x42/0x50
> kernel: folio_lruvec_lock_irqsave+0x62/0xb0
> kernel: folio_batch_move_lru+0x79/0x2a0
> kernel: folio_add_lru+0x6d/0xf0
> kernel: filemap_add_folio+0xba/0xe0
> kernel: __filemap_get_folio+0x137/0x2e0
> kernel: ext4_da_write_begin+0x12c/0x270
> kernel: generic_perform_write+0xbf/0x200
> kernel: ext4_buffered_write_iter+0x67/0xf0
> kernel: ext4_file_write_iter+0x70/0x780
> kernel: vfs_write+0x301/0x420
> kernel: ksys_write+0x67/0xf0
> kernel: __x64_sys_write+0x19/0x20
> kernel: x64_sys_call+0x1689/0x20d0
> kernel: do_syscall_64+0x6b/0x110
> kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e kernel: RIP:
> 0033:0x7fe21c314887
>
> With all CPU backtraces enabled, many CPUs are waiting for lruvec_lock
> acquisition. We measured the lruvec spinlock start, end and hold
> time(htime) using sched_clock(), along with a BUG() if the hold time was
> more than 10s. The below case shows that lruvec spin lock was held for ~25s.
>
> kernel: vmscan: unlock_page_lruvec_irq: stime 27963327514341, etime
> 27963324369895, htime 25889317166
> kernel: ------------[ cut here ]------------
> kernel: kernel BUG at include/linux/memcontrol.h:1677!
> kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> kernel: CPU: 21 PID: 3211 Comm: kswapd0 Tainted: G W
> 6.10.0-rc3-qspindbg #10
> kernel: RIP: 0010:shrink_active_list+0x40a/0x520
>
> And the corresponding trace point for the above:
> kswapd0-3211 [021] dN.1. 27963.324332: mm_vmscan_lru_isolate:
> classzone=0 order=0 nr_requested=1 nr_scanned=156946361
> nr_skipped=156946360 nr_taken=1 lru=active_file
>
> This shows that isolate_lru_folios() is scanning through a huge number
> (~150million) of folios (order=0) with lruvec spinlock held. This is
> happening because a large number of folios are being skipped to isolate
> a few ZONE_DMA folios. Though the number of folios to be scanned is
> bounded (32), there exists a genuine case where this can become
> unbounded, i.e. in case where folios are skipped.
>
> Meminfo output shows that the free memory is around ~2% and page/buffer
> cache grows very high when the lockup happens.
>
> MemTotal: 1584835956 kB
> MemFree: 27805664 kB
> MemAvailable: 1568099004 kB
> Buffers: 1386120792 kB
> Cached: 151894528 kB
> SwapCached: 30620 kB
> Active: 1043678892 kB
> Inactive: 494456452 kB
>
> Often times, the perf output at the time of the problem shows heavy
> contention on lruvec spin lock. Similar contention is also observed with
> inode i_lock (in clear_shadow_entry path)
>
> 98.98% fio [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
> |
> --98.96%--native_queued_spin_lock_slowpath
> |
> --98.96%--_raw_spin_lock_irqsave
> folio_lruvec_lock_irqsave
> |
> --98.78%--folio_batch_move_lru
> |
> --98.63%--deactivate_file_folio
> mapping_try_invalidate
> invalidate_mapping_pages
> invalidate_bdev
> blkdev_common_ioctl
> blkdev_ioctl
> __x64_sys_ioctl
> x64_sys_call
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
>
> Some experiments tried
> ======================
> 1) When MGLRU was enabled many soft lockups were observed, no hard
> lockups were seen for 48 hours run. Below is once such soft lockup.
This is not really an MGLRU issue -- can you please try one of the
attached patches? It (truncate.patch) should help with or without
MGLRU.
> kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649]
> kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G L
> 6.10.0-rc3-mglru-irqstrc #24
> kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
> kernel: Call Trace:
> kernel: <IRQ>
> kernel: ? show_regs+0x69/0x80
> kernel: ? watchdog_timer_fn+0x223/0x2b0
> kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
> <SNIP>
> kernel: </IRQ>
> kernel: <TASK>
> kernel: ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300
> kernel: _raw_spin_lock+0x38/0x50
> kernel: clear_shadow_entry+0x3d/0x100
> kernel: ? __pfx_workingset_update_node+0x10/0x10
> kernel: mapping_try_invalidate+0x117/0x1d0
> kernel: invalidate_mapping_pages+0x10/0x20
> kernel: invalidate_bdev+0x3c/0x50
> kernel: blkdev_common_ioctl+0x5f7/0xa90
> kernel: blkdev_ioctl+0x109/0x270
> kernel: x64_sys_call+0x1215/0x20d0
> kernel: do_syscall_64+0x7e/0x130
>
> This happens to be contending on inode i_lock spinlock.
>
> Below preemptirqsoff trace points to preemption being disabled for more
> than 10s and the lock in picture is lruvec spinlock.
Also if you could try the other patch (mglru.patch) please. It should
help reduce unnecessary rotations from deactivate_file_folio(), which
in turn should reduce the contention on the LRU lock for MGLRU.
> # tracer: preemptirqsoff
> #
> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc
> # --------------------------------------------------------------------
> # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0
> HP:0 #P:512)
> # -----------------
> # | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0)
> # -----------------
> # => started at: deactivate_file_folio
> # => ended at: deactivate_file_folio
> #
> #
> # _------=> CPU#
> # / _-----=> irqs-off/BH-disabled
> # | / _----=> need-resched
> # || / _---=> hardirq/softirq
> # ||| / _--=> preempt-depth
> # |||| / _-=> migrate-disable
> # ||||| / delay
> # cmd pid |||||| time | caller
> # \ / |||||| \ | /
> fio-2701523 128...1. 0us$: deactivate_file_folio
> <-deactivate_file_folio
> fio-2701523 128.N.1. 10382681us : deactivate_file_folio
> <-deactivate_file_folio
> fio-2701523 128.N.1. 10382683us : tracer_preempt_on
> <-deactivate_file_folio
> fio-2701523 128.N.1. 10382691us : <stack trace>
> => deactivate_file_folio
> => mapping_try_invalidate
> => invalidate_mapping_pages
> => invalidate_bdev
> => blkdev_common_ioctl
> => blkdev_ioctl
> => __x64_sys_ioctl
> => x64_sys_call
> => do_syscall_64
> => entry_SYSCALL_64_after_hwframe
>
> 2) Increased low_watermark_threshold to 10% to prevent system from
> entering into extremely low memory situation. Although hard lockups
> weren't seen, but soft lockups (clear_shadow_entry()) were still seen.
>
> 3) AMD has a BIOS setting called NPS (Nodes per socket), using which a
> socket can be further partitioned into smaller NUMA nodes. With NPS=4,
> there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in
> the system. This was done to check if having more number of kswapd
> threads working on lesser number of folios per node would make a
> difference. However here too, multiple soft lockups were seen (in
> clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed.
>
> Any insights/suggestion into these lockups and suggestions are welcome!
Thanks!
[-- Attachment #2: mglru.patch --]
[-- Type: application/octet-stream, Size: 1202 bytes --]
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index d9a8a4affaaf..7d24d065aed8 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -182,6 +182,16 @@ static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
}
+static inline bool lru_gen_should_rotate(struct folio *folio)
+{
+ int gen = folio_lru_gen(folio);
+ int type = folio_is_file_lru(folio);
+ struct lruvec *lruvec = folio_lruvec(folio);
+ struct lru_gen_folio *lrugen = &lruvec->lrugen;
+
+ return gen != lru_gen_from_seq(READ_ONCE(lrugen->min_seq[type]));
+}
+
static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *folio,
int old_gen, int new_gen)
{
diff --git a/mm/swap.c b/mm/swap.c
index 802681b3c857..e3dd092224ba 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -692,7 +692,7 @@ void deactivate_file_folio(struct folio *folio)
struct folio_batch *fbatch;
/* Deactivating an unevictable folio will not accelerate reclaim */
- if (folio_test_unevictable(folio))
+ if (folio_test_unevictable(folio) || !lru_gen_should_rotate(folio))
return;
folio_get(folio);
[-- Attachment #3: truncate.patch --]
[-- Type: application/octet-stream, Size: 3942 bytes --]
diff --git a/mm/truncate.c b/mm/truncate.c
index e99085bf3d34..545211cf6061 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -39,12 +39,24 @@ static inline void __clear_shadow_entry(struct address_space *mapping,
xas_store(&xas, NULL);
}
-static void clear_shadow_entry(struct address_space *mapping, pgoff_t index,
- void *entry)
+static void clear_shadow_entry(struct address_space *mapping,
+ struct folio_batch *fbatch, pgoff_t *indices)
{
+ int i;
+
+ if (shmem_mapping(mapping) || dax_mapping(mapping))
+ return;
+
spin_lock(&mapping->host->i_lock);
xa_lock_irq(&mapping->i_pages);
- __clear_shadow_entry(mapping, index, entry);
+
+ for (i = 0; i < folio_batch_count(fbatch); i++) {
+ struct folio *folio = fbatch->folios[i];
+
+ if (xa_is_value(folio))
+ __clear_shadow_entry(mapping, indices[i], folio);
+ }
+
xa_unlock_irq(&mapping->i_pages);
if (mapping_shrinkable(mapping))
inode_add_lru(mapping->host);
@@ -105,36 +117,6 @@ static void truncate_folio_batch_exceptionals(struct address_space *mapping,
fbatch->nr = j;
}
-/*
- * Invalidate exceptional entry if easily possible. This handles exceptional
- * entries for invalidate_inode_pages().
- */
-static int invalidate_exceptional_entry(struct address_space *mapping,
- pgoff_t index, void *entry)
-{
- /* Handled by shmem itself, or for DAX we do nothing. */
- if (shmem_mapping(mapping) || dax_mapping(mapping))
- return 1;
- clear_shadow_entry(mapping, index, entry);
- return 1;
-}
-
-/*
- * Invalidate exceptional entry if clean. This handles exceptional entries for
- * invalidate_inode_pages2() so for DAX it evicts only clean entries.
- */
-static int invalidate_exceptional_entry2(struct address_space *mapping,
- pgoff_t index, void *entry)
-{
- /* Handled by shmem itself */
- if (shmem_mapping(mapping))
- return 1;
- if (dax_mapping(mapping))
- return dax_invalidate_mapping_entry_sync(mapping, index);
- clear_shadow_entry(mapping, index, entry);
- return 1;
-}
-
/**
* folio_invalidate - Invalidate part or all of a folio.
* @folio: The folio which is affected.
@@ -494,6 +476,7 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
unsigned long ret;
unsigned long count = 0;
int i;
+ bool xa_has_values = false;
folio_batch_init(&fbatch);
while (find_lock_entries(mapping, &index, end, &fbatch, indices)) {
@@ -503,8 +486,8 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
/* We rely upon deletion not changing folio->index */
if (xa_is_value(folio)) {
- count += invalidate_exceptional_entry(mapping,
- indices[i], folio);
+ xa_has_values = true;
+ count++;
continue;
}
@@ -522,6 +505,10 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
}
count += ret;
}
+
+ if (xa_has_values)
+ clear_shadow_entry(mapping, &fbatch, indices);
+
folio_batch_remove_exceptionals(&fbatch);
folio_batch_release(&fbatch);
cond_resched();
@@ -616,6 +603,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
int ret = 0;
int ret2 = 0;
int did_range_unmap = 0;
+ bool xa_has_values = false;
if (mapping_empty(mapping))
return 0;
@@ -629,8 +617,9 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
/* We rely upon deletion not changing folio->index */
if (xa_is_value(folio)) {
- if (!invalidate_exceptional_entry2(mapping,
- indices[i], folio))
+ xa_has_values = true;
+ if (dax_mapping(mapping) &&
+ !dax_invalidate_mapping_entry_sync(mapping, indices[i]))
ret = -EBUSY;
continue;
}
@@ -666,6 +655,10 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
ret = ret2;
folio_unlock(folio);
}
+
+ if (xa_has_values)
+ clear_shadow_entry(mapping, &fbatch, indices);
+
folio_batch_remove_exceptionals(&fbatch);
folio_batch_release(&fbatch);
cond_resched();
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-06 22:42 ` Yu Zhao
@ 2024-07-08 14:34 ` Bharata B Rao
2024-07-08 16:17 ` Yu Zhao
2024-07-10 12:03 ` Bharata B Rao
1 sibling, 1 reply; 37+ messages in thread
From: Bharata B Rao @ 2024-07-08 14:34 UTC (permalink / raw)
To: Yu Zhao
Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman
Hi Yu Zhao,
Thanks for your patches. See below...
On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> Hi Bharata,
>
> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
>>
<snip>
>>
>> Some experiments tried
>> ======================
>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>> lockups were seen for 48 hours run. Below is once such soft lockup.
>
> This is not really an MGLRU issue -- can you please try one of the
> attached patches? It (truncate.patch) should help with or without
> MGLRU.
With truncate.patch and default LRU scheme, a few hard lockups are seen.
First one is this:
watchdog: Watchdog detected hard LOCKUP on cpu 487
CPU: 487 PID: 11525 Comm: fio Not tainted 6.10.0-rc3 #27
RIP: 0010:native_queued_spin_lock_slowpath+0x81/0x300
Call Trace:
<NMI>
? show_regs+0x69/0x80
? watchdog_hardlockup_check+0x1b4/0x3a0
<SNIP>
? native_queued_spin_lock_slowpath+0x81/0x300
</NMI>
<TASK>
? __pfx_folio_activate_fn+0x10/0x10
_raw_spin_lock_irqsave+0x5b/0x70
folio_lruvec_lock_irqsave+0x62/0x90
folio_batch_move_lru+0x9d/0x160
folio_activate+0x95/0xe0
folio_mark_accessed+0x11f/0x160
filemap_read+0x343/0x3d0
<SNIP>
blkdev_read_iter+0x6f/0x140
vfs_read+0x25b/0x340
ksys_read+0x67/0xf0
__x64_sys_read+0x19/0x20
x64_sys_call+0x1771/0x20d0
This is the next one:
watchdog: Watchdog detected hard LOCKUP on cpu 219
CPU: 219 PID: 2584763 Comm: fs_racer_dir_cr Not tainted 6.10.0-rc3 #27
RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
Call Trace:
<NMI>
? show_regs+0x69/0x80
? watchdog_hardlockup_check+0x1b4/0x3a0
<SNIP>
? native_queued_spin_lock_slowpath+0x2b4/0x300
</NMI>
<TASK>
_raw_spin_lock_irqsave+0x5b/0x70
folio_lruvec_lock_irqsave+0x62/0x90
__page_cache_release+0x89/0x2f0
folios_put_refs+0x92/0x230
__folio_batch_release+0x74/0x90
truncate_inode_pages_range+0x16f/0x520
truncate_pagecache+0x49/0x70
ext4_setattr+0x326/0xaa0
notify_change+0x353/0x500
do_truncate+0x83/0xe0
path_openat+0xd9e/0x1090
do_filp_open+0xaa/0x150
do_sys_openat2+0x9b/0xd0
__x64_sys_openat+0x55/0x90
x64_sys_call+0xe55/0x20d0
do_syscall_64+0x7e/0x130
entry_SYSCALL_64_after_hwframe+0x76/0x7e
When this happens, all-CPU backtrace shows a CPU being in
isolate_lru_folios().
>
>> kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649]
>> kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G L
>> 6.10.0-rc3-mglru-irqstrc #24
>> kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
>> kernel: Call Trace:
>> kernel: <IRQ>
>> kernel: ? show_regs+0x69/0x80
>> kernel: ? watchdog_timer_fn+0x223/0x2b0
>> kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
>> <SNIP>
>> kernel: </IRQ>
>> kernel: <TASK>
>> kernel: ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
>> kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300
>> kernel: _raw_spin_lock+0x38/0x50
>> kernel: clear_shadow_entry+0x3d/0x100
>> kernel: ? __pfx_workingset_update_node+0x10/0x10
>> kernel: mapping_try_invalidate+0x117/0x1d0
>> kernel: invalidate_mapping_pages+0x10/0x20
>> kernel: invalidate_bdev+0x3c/0x50
>> kernel: blkdev_common_ioctl+0x5f7/0xa90
>> kernel: blkdev_ioctl+0x109/0x270
>> kernel: x64_sys_call+0x1215/0x20d0
>> kernel: do_syscall_64+0x7e/0x130
>>
>> This happens to be contending on inode i_lock spinlock.
>>
>> Below preemptirqsoff trace points to preemption being disabled for more
>> than 10s and the lock in picture is lruvec spinlock.
>
> Also if you could try the other patch (mglru.patch) please. It should
> help reduce unnecessary rotations from deactivate_file_folio(), which
> in turn should reduce the contention on the LRU lock for MGLRU.
Currently testing is in progress with mglru.patch and MGLRU enabled.
Will get back on the results.
Regards,
Bharata.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-08 14:34 ` Bharata B Rao
@ 2024-07-08 16:17 ` Yu Zhao
2024-07-09 4:30 ` Bharata B Rao
0 siblings, 1 reply; 37+ messages in thread
From: Yu Zhao @ 2024-07-08 16:17 UTC (permalink / raw)
To: Bharata B Rao
Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman
On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
>
> Hi Yu Zhao,
>
> Thanks for your patches. See below...
>
> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> > Hi Bharata,
> >
> > On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
> >>
> <snip>
> >>
> >> Some experiments tried
> >> ======================
> >> 1) When MGLRU was enabled many soft lockups were observed, no hard
> >> lockups were seen for 48 hours run. Below is once such soft lockup.
> >
> > This is not really an MGLRU issue -- can you please try one of the
> > attached patches? It (truncate.patch) should help with or without
> > MGLRU.
>
> With truncate.patch and default LRU scheme, a few hard lockups are seen.
Thanks.
In your original report, you said:
Most of the times the two contended locks are lruvec and
inode->i_lock spinlocks.
...
Often times, the perf output at the time of the problem shows
heavy contention on lruvec spin lock. Similar contention is
also observed with inode i_lock (in clear_shadow_entry path)
Based on this new report, does it mean the i_lock is not as contended,
for the same path (truncation) you tested? If so, I'll post
truncate.patch and add reported-by and tested-by you, unless you have
objections.
The two paths below were contended on the LRU lock, but they already
batch their operations. So I don't know what else we can do surgically
to improve them.
> First one is this:
>
> watchdog: Watchdog detected hard LOCKUP on cpu 487
> CPU: 487 PID: 11525 Comm: fio Not tainted 6.10.0-rc3 #27
> RIP: 0010:native_queued_spin_lock_slowpath+0x81/0x300
> Call Trace:
> <NMI>
> ? show_regs+0x69/0x80
> ? watchdog_hardlockup_check+0x1b4/0x3a0
> <SNIP>
> ? native_queued_spin_lock_slowpath+0x81/0x300
> </NMI>
> <TASK>
> ? __pfx_folio_activate_fn+0x10/0x10
> _raw_spin_lock_irqsave+0x5b/0x70
> folio_lruvec_lock_irqsave+0x62/0x90
> folio_batch_move_lru+0x9d/0x160
> folio_activate+0x95/0xe0
> folio_mark_accessed+0x11f/0x160
> filemap_read+0x343/0x3d0
> <SNIP>
> blkdev_read_iter+0x6f/0x140
> vfs_read+0x25b/0x340
> ksys_read+0x67/0xf0
> __x64_sys_read+0x19/0x20
> x64_sys_call+0x1771/0x20d0
>
> This is the next one:
>
> watchdog: Watchdog detected hard LOCKUP on cpu 219
> CPU: 219 PID: 2584763 Comm: fs_racer_dir_cr Not tainted 6.10.0-rc3 #27
> RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
> Call Trace:
> <NMI>
> ? show_regs+0x69/0x80
> ? watchdog_hardlockup_check+0x1b4/0x3a0
> <SNIP>
> ? native_queued_spin_lock_slowpath+0x2b4/0x300
> </NMI>
> <TASK>
> _raw_spin_lock_irqsave+0x5b/0x70
> folio_lruvec_lock_irqsave+0x62/0x90
> __page_cache_release+0x89/0x2f0
> folios_put_refs+0x92/0x230
> __folio_batch_release+0x74/0x90
> truncate_inode_pages_range+0x16f/0x520
> truncate_pagecache+0x49/0x70
> ext4_setattr+0x326/0xaa0
> notify_change+0x353/0x500
> do_truncate+0x83/0xe0
> path_openat+0xd9e/0x1090
> do_filp_open+0xaa/0x150
> do_sys_openat2+0x9b/0xd0
> __x64_sys_openat+0x55/0x90
> x64_sys_call+0xe55/0x20d0
> do_syscall_64+0x7e/0x130
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> When this happens, all-CPU backtrace shows a CPU being in
> isolate_lru_folios().
>
> >
> >> kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649]
> >> kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G L
> >> 6.10.0-rc3-mglru-irqstrc #24
> >> kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
> >> kernel: Call Trace:
> >> kernel: <IRQ>
> >> kernel: ? show_regs+0x69/0x80
> >> kernel: ? watchdog_timer_fn+0x223/0x2b0
> >> kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
> >> <SNIP>
> >> kernel: </IRQ>
> >> kernel: <TASK>
> >> kernel: ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> >> kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300
> >> kernel: _raw_spin_lock+0x38/0x50
> >> kernel: clear_shadow_entry+0x3d/0x100
> >> kernel: ? __pfx_workingset_update_node+0x10/0x10
> >> kernel: mapping_try_invalidate+0x117/0x1d0
> >> kernel: invalidate_mapping_pages+0x10/0x20
> >> kernel: invalidate_bdev+0x3c/0x50
> >> kernel: blkdev_common_ioctl+0x5f7/0xa90
> >> kernel: blkdev_ioctl+0x109/0x270
> >> kernel: x64_sys_call+0x1215/0x20d0
> >> kernel: do_syscall_64+0x7e/0x130
> >>
> >> This happens to be contending on inode i_lock spinlock.
> >>
> >> Below preemptirqsoff trace points to preemption being disabled for more
> >> than 10s and the lock in picture is lruvec spinlock.
> >
> > Also if you could try the other patch (mglru.patch) please. It should
> > help reduce unnecessary rotations from deactivate_file_folio(), which
> > in turn should reduce the contention on the LRU lock for MGLRU.
>
> Currently testing is in progress with mglru.patch and MGLRU enabled.
> Will get back on the results.
Thank you.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-08 16:17 ` Yu Zhao
@ 2024-07-09 4:30 ` Bharata B Rao
2024-07-09 5:58 ` Yu Zhao
2024-07-17 9:37 ` Vlastimil Babka
0 siblings, 2 replies; 37+ messages in thread
From: Bharata B Rao @ 2024-07-09 4:30 UTC (permalink / raw)
To: Yu Zhao
Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman
On 08-Jul-24 9:47 PM, Yu Zhao wrote:
> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
>>
>> Hi Yu Zhao,
>>
>> Thanks for your patches. See below...
>>
>> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
>>> Hi Bharata,
>>>
>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>
>> <snip>
>>>>
>>>> Some experiments tried
>>>> ======================
>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>>>> lockups were seen for 48 hours run. Below is once such soft lockup.
>>>
>>> This is not really an MGLRU issue -- can you please try one of the
>>> attached patches? It (truncate.patch) should help with or without
>>> MGLRU.
>>
>> With truncate.patch and default LRU scheme, a few hard lockups are seen.
>
> Thanks.
>
> In your original report, you said:
>
> Most of the times the two contended locks are lruvec and
> inode->i_lock spinlocks.
> ...
> Often times, the perf output at the time of the problem shows
> heavy contention on lruvec spin lock. Similar contention is
> also observed with inode i_lock (in clear_shadow_entry path)
>
> Based on this new report, does it mean the i_lock is not as contended,
> for the same path (truncation) you tested? If so, I'll post
> truncate.patch and add reported-by and tested-by you, unless you have
> objections.
truncate.patch has been tested on two systems with default LRU scheme
and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run.
>
> The two paths below were contended on the LRU lock, but they already
> batch their operations. So I don't know what else we can do surgically
> to improve them.
What has been seen with this workload is that the lruvec spinlock is
held for a long time from shrink_[active/inactive]_list path. In this
path, there is a case in isolate_lru_folios() where scanning of LRU
lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
scanning/skipping of more than 150 million folios were seen. There is
already a comment in there which explains why nr_skipped shouldn't be
counted, but is there any possibility of re-looking at this condition?
Regards,
Bharata.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-09 4:30 ` Bharata B Rao
@ 2024-07-09 5:58 ` Yu Zhao
2024-07-11 5:43 ` Bharata B Rao
2024-08-13 11:04 ` Usama Arif
2024-07-17 9:37 ` Vlastimil Babka
1 sibling, 2 replies; 37+ messages in thread
From: Yu Zhao @ 2024-07-09 5:58 UTC (permalink / raw)
To: Bharata B Rao
Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman
On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@amd.com> wrote:
>
> On 08-Jul-24 9:47 PM, Yu Zhao wrote:
> > On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
> >>
> >> Hi Yu Zhao,
> >>
> >> Thanks for your patches. See below...
> >>
> >> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> >>> Hi Bharata,
> >>>
> >>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
> >>>>
> >> <snip>
> >>>>
> >>>> Some experiments tried
> >>>> ======================
> >>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
> >>>> lockups were seen for 48 hours run. Below is once such soft lockup.
> >>>
> >>> This is not really an MGLRU issue -- can you please try one of the
> >>> attached patches? It (truncate.patch) should help with or without
> >>> MGLRU.
> >>
> >> With truncate.patch and default LRU scheme, a few hard lockups are seen.
> >
> > Thanks.
> >
> > In your original report, you said:
> >
> > Most of the times the two contended locks are lruvec and
> > inode->i_lock spinlocks.
> > ...
> > Often times, the perf output at the time of the problem shows
> > heavy contention on lruvec spin lock. Similar contention is
> > also observed with inode i_lock (in clear_shadow_entry path)
> >
> > Based on this new report, does it mean the i_lock is not as contended,
> > for the same path (truncation) you tested? If so, I'll post
> > truncate.patch and add reported-by and tested-by you, unless you have
> > objections.
>
> truncate.patch has been tested on two systems with default LRU scheme
> and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run.
Thanks.
> >
> > The two paths below were contended on the LRU lock, but they already
> > batch their operations. So I don't know what else we can do surgically
> > to improve them.
>
> What has been seen with this workload is that the lruvec spinlock is
> held for a long time from shrink_[active/inactive]_list path. In this
> path, there is a case in isolate_lru_folios() where scanning of LRU
> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
> scanning/skipping of more than 150 million folios were seen. There is
> already a comment in there which explains why nr_skipped shouldn't be
> counted, but is there any possibility of re-looking at this condition?
For this specific case, probably this can help:
@@ -1659,8 +1659,15 @@ static unsigned long
isolate_lru_folios(unsigned long nr_to_scan,
if (folio_zonenum(folio) > sc->reclaim_idx ||
skip_cma(folio, sc)) {
nr_skipped[folio_zonenum(folio)] += nr_pages;
- move_to = &folios_skipped;
- goto move;
+ list_move(&folio->lru, &folios_skipped);
+ if (spin_is_contended(&lruvec->lru_lock)) {
+ if (!list_empty(dst))
+ break;
+ spin_unlock_irq(&lruvec->lru_lock);
+ cond_resched();
+ spin_lock_irq(&lruvec->lru_lock);
+ }
+ continue;
}
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-06 22:42 ` Yu Zhao
2024-07-08 14:34 ` Bharata B Rao
@ 2024-07-10 12:03 ` Bharata B Rao
2024-07-10 12:24 ` Mateusz Guzik
2024-07-10 18:04 ` Yu Zhao
1 sibling, 2 replies; 37+ messages in thread
From: Bharata B Rao @ 2024-07-10 12:03 UTC (permalink / raw)
To: Yu Zhao, mjguzik, david, kent.overstreet
Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman,
linux-fsdevel
On 07-Jul-24 4:12 AM, Yu Zhao wrote:
>> Some experiments tried
>> ======================
>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>> lockups were seen for 48 hours run. Below is once such soft lockup.
<snip>
>> Below preemptirqsoff trace points to preemption being disabled for more
>> than 10s and the lock in picture is lruvec spinlock.
>
> Also if you could try the other patch (mglru.patch) please. It should
> help reduce unnecessary rotations from deactivate_file_folio(), which
> in turn should reduce the contention on the LRU lock for MGLRU.
Thanks. With mglru.patch on a MGLRU-enabled system, the below latency
trace record is no longer seen for a 30hr workload run.
>
>> # tracer: preemptirqsoff
>> #
>> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc
>> # --------------------------------------------------------------------
>> # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0
>> HP:0 #P:512)
>> # -----------------
>> # | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0)
>> # -----------------
>> # => started at: deactivate_file_folio
>> # => ended at: deactivate_file_folio
>> #
>> #
>> # _------=> CPU#
>> # / _-----=> irqs-off/BH-disabled
>> # | / _----=> need-resched
>> # || / _---=> hardirq/softirq
>> # ||| / _--=> preempt-depth
>> # |||| / _-=> migrate-disable
>> # ||||| / delay
>> # cmd pid |||||| time | caller
>> # \ / |||||| \ | /
>> fio-2701523 128...1. 0us$: deactivate_file_folio
>> <-deactivate_file_folio
>> fio-2701523 128.N.1. 10382681us : deactivate_file_folio
>> <-deactivate_file_folio
>> fio-2701523 128.N.1. 10382683us : tracer_preempt_on
>> <-deactivate_file_folio
>> fio-2701523 128.N.1. 10382691us : <stack trace>
>> => deactivate_file_folio
>> => mapping_try_invalidate
>> => invalidate_mapping_pages
>> => invalidate_bdev
>> => blkdev_common_ioctl
>> => blkdev_ioctl
>> => __x64_sys_ioctl
>> => x64_sys_call
>> => do_syscall_64
>> => entry_SYSCALL_64_after_hwframe
However the contention now has shifted to inode_hash_lock. Around 55
softlockups in ilookup() were observed:
# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-trnmglru
# --------------------------------------------------------------------
# latency: 10620430 us, #4/4, CPU#260 | (M:desktop VP:0, KP:0, SP:0 HP:0
#P:512)
# -----------------
# | task: fio-3244715 (uid:0 nice:0 policy:0 rt_prio:0)
# -----------------
# => started at: ilookup
# => ended at: ilookup
#
#
# _------=> CPU#
# / _-----=> irqs-off/BH-disabled
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# |||| / _-=> migrate-disable
# ||||| / delay
# cmd pid |||||| time | caller
# \ / |||||| \ | /
fio-3244715 260...1. 0us$: _raw_spin_lock <-ilookup
fio-3244715 260.N.1. 10620429us : _raw_spin_unlock <-ilookup
fio-3244715 260.N.1. 10620430us : tracer_preempt_on <-ilookup
fio-3244715 260.N.1. 10620440us : <stack trace>
=> _raw_spin_unlock
=> ilookup
=> blkdev_get_no_open
=> blkdev_open
=> do_dentry_open
=> vfs_open
=> path_openat
=> do_filp_open
=> do_sys_openat2
=> __x64_sys_openat
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
It appears that scalability issues with inode_hash_lock has been brought
up multiple times in the past and there were patches to address the same.
https://lore.kernel.org/all/20231206060629.2827226-9-david@fromorbit.com/
https://lore.kernel.org/lkml/20240611173824.535995-2-mjguzik@gmail.com/
CC'ing FS folks/list for awareness/comments.
Regards,
Bharata.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-10 12:03 ` Bharata B Rao
@ 2024-07-10 12:24 ` Mateusz Guzik
2024-07-10 13:04 ` Mateusz Guzik
2024-07-10 18:04 ` Yu Zhao
1 sibling, 1 reply; 37+ messages in thread
From: Mateusz Guzik @ 2024-07-10 12:24 UTC (permalink / raw)
To: Bharata B Rao
Cc: Yu Zhao, david, kent.overstreet, linux-mm, linux-kernel, nikunj,
Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka,
kinseyho, Mel Gorman, linux-fsdevel
On Wed, Jul 10, 2024 at 2:04 PM Bharata B Rao <bharata@amd.com> wrote:
>
> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> >> Some experiments tried
> >> ======================
> >> 1) When MGLRU was enabled many soft lockups were observed, no hard
> >> lockups were seen for 48 hours run. Below is once such soft lockup.
> <snip>
> >> Below preemptirqsoff trace points to preemption being disabled for more
> >> than 10s and the lock in picture is lruvec spinlock.
> >
> > Also if you could try the other patch (mglru.patch) please. It should
> > help reduce unnecessary rotations from deactivate_file_folio(), which
> > in turn should reduce the contention on the LRU lock for MGLRU.
>
> Thanks. With mglru.patch on a MGLRU-enabled system, the below latency
> trace record is no longer seen for a 30hr workload run.
>
> >
> >> # tracer: preemptirqsoff
> >> #
> >> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc
> >> # --------------------------------------------------------------------
> >> # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0
> >> HP:0 #P:512)
> >> # -----------------
> >> # | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0)
> >> # -----------------
> >> # => started at: deactivate_file_folio
> >> # => ended at: deactivate_file_folio
> >> #
> >> #
> >> # _------=> CPU#
> >> # / _-----=> irqs-off/BH-disabled
> >> # | / _----=> need-resched
> >> # || / _---=> hardirq/softirq
> >> # ||| / _--=> preempt-depth
> >> # |||| / _-=> migrate-disable
> >> # ||||| / delay
> >> # cmd pid |||||| time | caller
> >> # \ / |||||| \ | /
> >> fio-2701523 128...1. 0us$: deactivate_file_folio
> >> <-deactivate_file_folio
> >> fio-2701523 128.N.1. 10382681us : deactivate_file_folio
> >> <-deactivate_file_folio
> >> fio-2701523 128.N.1. 10382683us : tracer_preempt_on
> >> <-deactivate_file_folio
> >> fio-2701523 128.N.1. 10382691us : <stack trace>
> >> => deactivate_file_folio
> >> => mapping_try_invalidate
> >> => invalidate_mapping_pages
> >> => invalidate_bdev
> >> => blkdev_common_ioctl
> >> => blkdev_ioctl
> >> => __x64_sys_ioctl
> >> => x64_sys_call
> >> => do_syscall_64
> >> => entry_SYSCALL_64_after_hwframe
>
> However the contention now has shifted to inode_hash_lock. Around 55
> softlockups in ilookup() were observed:
>
> # tracer: preemptirqsoff
> #
> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-trnmglru
> # --------------------------------------------------------------------
> # latency: 10620430 us, #4/4, CPU#260 | (M:desktop VP:0, KP:0, SP:0 HP:0
> #P:512)
> # -----------------
> # | task: fio-3244715 (uid:0 nice:0 policy:0 rt_prio:0)
> # -----------------
> # => started at: ilookup
> # => ended at: ilookup
> #
> #
> # _------=> CPU#
> # / _-----=> irqs-off/BH-disabled
> # | / _----=> need-resched
> # || / _---=> hardirq/softirq
> # ||| / _--=> preempt-depth
> # |||| / _-=> migrate-disable
> # ||||| / delay
> # cmd pid |||||| time | caller
> # \ / |||||| \ | /
> fio-3244715 260...1. 0us$: _raw_spin_lock <-ilookup
> fio-3244715 260.N.1. 10620429us : _raw_spin_unlock <-ilookup
> fio-3244715 260.N.1. 10620430us : tracer_preempt_on <-ilookup
> fio-3244715 260.N.1. 10620440us : <stack trace>
> => _raw_spin_unlock
> => ilookup
> => blkdev_get_no_open
> => blkdev_open
> => do_dentry_open
> => vfs_open
> => path_openat
> => do_filp_open
> => do_sys_openat2
> => __x64_sys_openat
> => x64_sys_call
> => do_syscall_64
> => entry_SYSCALL_64_after_hwframe
>
> It appears that scalability issues with inode_hash_lock has been brought
> up multiple times in the past and there were patches to address the same.
>
> https://lore.kernel.org/all/20231206060629.2827226-9-david@fromorbit.com/
> https://lore.kernel.org/lkml/20240611173824.535995-2-mjguzik@gmail.com/
>
> CC'ing FS folks/list for awareness/comments.
Note my patch does not enable RCU usage in ilookup, but this can be
trivially added.
I can't even compile-test at the moment, but the diff below should do
it. Also note the patches are present here
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.inode.rcu
, not yet integrated anywhere.
That said, if fio you are operating on the same target inode every
time then this is merely going to shift contention to the inode
spinlock usage in find_inode_fast.
diff --git a/fs/inode.c b/fs/inode.c
index ad7844ca92f9..70b0e6383341 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1524,10 +1524,14 @@ struct inode *ilookup(struct super_block *sb,
unsigned long ino)
{
struct hlist_head *head = inode_hashtable + hash(sb, ino);
struct inode *inode;
+
again:
- spin_lock(&inode_hash_lock);
- inode = find_inode_fast(sb, head, ino, true);
- spin_unlock(&inode_hash_lock);
+ inode = find_inode_fast(sb, head, ino, false);
+ if (IS_ERR_OR_NULL_PTR(inode)) {
+ spin_lock(&inode_hash_lock);
+ inode = find_inode_fast(sb, head, ino, true);
+ spin_unlock(&inode_hash_lock);
+ }
if (inode) {
if (IS_ERR(inode))
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-10 12:24 ` Mateusz Guzik
@ 2024-07-10 13:04 ` Mateusz Guzik
2024-07-15 5:22 ` Bharata B Rao
0 siblings, 1 reply; 37+ messages in thread
From: Mateusz Guzik @ 2024-07-10 13:04 UTC (permalink / raw)
To: Bharata B Rao
Cc: Yu Zhao, david, kent.overstreet, linux-mm, linux-kernel, nikunj,
Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka,
kinseyho, Mel Gorman, linux-fsdevel
On Wed, Jul 10, 2024 at 2:24 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Wed, Jul 10, 2024 at 2:04 PM Bharata B Rao <bharata@amd.com> wrote:
> >
> > On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> > >> Some experiments tried
> > >> ======================
> > >> 1) When MGLRU was enabled many soft lockups were observed, no hard
> > >> lockups were seen for 48 hours run. Below is once such soft lockup.
> > <snip>
> > >> Below preemptirqsoff trace points to preemption being disabled for more
> > >> than 10s and the lock in picture is lruvec spinlock.
> > >
> > > Also if you could try the other patch (mglru.patch) please. It should
> > > help reduce unnecessary rotations from deactivate_file_folio(), which
> > > in turn should reduce the contention on the LRU lock for MGLRU.
> >
> > Thanks. With mglru.patch on a MGLRU-enabled system, the below latency
> > trace record is no longer seen for a 30hr workload run.
> >
> > >
> > >> # tracer: preemptirqsoff
> > >> #
> > >> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc
> > >> # --------------------------------------------------------------------
> > >> # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0
> > >> HP:0 #P:512)
> > >> # -----------------
> > >> # | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0)
> > >> # -----------------
> > >> # => started at: deactivate_file_folio
> > >> # => ended at: deactivate_file_folio
> > >> #
> > >> #
> > >> # _------=> CPU#
> > >> # / _-----=> irqs-off/BH-disabled
> > >> # | / _----=> need-resched
> > >> # || / _---=> hardirq/softirq
> > >> # ||| / _--=> preempt-depth
> > >> # |||| / _-=> migrate-disable
> > >> # ||||| / delay
> > >> # cmd pid |||||| time | caller
> > >> # \ / |||||| \ | /
> > >> fio-2701523 128...1. 0us$: deactivate_file_folio
> > >> <-deactivate_file_folio
> > >> fio-2701523 128.N.1. 10382681us : deactivate_file_folio
> > >> <-deactivate_file_folio
> > >> fio-2701523 128.N.1. 10382683us : tracer_preempt_on
> > >> <-deactivate_file_folio
> > >> fio-2701523 128.N.1. 10382691us : <stack trace>
> > >> => deactivate_file_folio
> > >> => mapping_try_invalidate
> > >> => invalidate_mapping_pages
> > >> => invalidate_bdev
> > >> => blkdev_common_ioctl
> > >> => blkdev_ioctl
> > >> => __x64_sys_ioctl
> > >> => x64_sys_call
> > >> => do_syscall_64
> > >> => entry_SYSCALL_64_after_hwframe
> >
> > However the contention now has shifted to inode_hash_lock. Around 55
> > softlockups in ilookup() were observed:
> >
> > # tracer: preemptirqsoff
> > #
> > # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-trnmglru
> > # --------------------------------------------------------------------
> > # latency: 10620430 us, #4/4, CPU#260 | (M:desktop VP:0, KP:0, SP:0 HP:0
> > #P:512)
> > # -----------------
> > # | task: fio-3244715 (uid:0 nice:0 policy:0 rt_prio:0)
> > # -----------------
> > # => started at: ilookup
> > # => ended at: ilookup
> > #
> > #
> > # _------=> CPU#
> > # / _-----=> irqs-off/BH-disabled
> > # | / _----=> need-resched
> > # || / _---=> hardirq/softirq
> > # ||| / _--=> preempt-depth
> > # |||| / _-=> migrate-disable
> > # ||||| / delay
> > # cmd pid |||||| time | caller
> > # \ / |||||| \ | /
> > fio-3244715 260...1. 0us$: _raw_spin_lock <-ilookup
> > fio-3244715 260.N.1. 10620429us : _raw_spin_unlock <-ilookup
> > fio-3244715 260.N.1. 10620430us : tracer_preempt_on <-ilookup
> > fio-3244715 260.N.1. 10620440us : <stack trace>
> > => _raw_spin_unlock
> > => ilookup
> > => blkdev_get_no_open
> > => blkdev_open
> > => do_dentry_open
> > => vfs_open
> > => path_openat
> > => do_filp_open
> > => do_sys_openat2
> > => __x64_sys_openat
> > => x64_sys_call
> > => do_syscall_64
> > => entry_SYSCALL_64_after_hwframe
> >
> > It appears that scalability issues with inode_hash_lock has been brought
> > up multiple times in the past and there were patches to address the same.
> >
> > https://lore.kernel.org/all/20231206060629.2827226-9-david@fromorbit.com/
> > https://lore.kernel.org/lkml/20240611173824.535995-2-mjguzik@gmail.com/
> >
> > CC'ing FS folks/list for awareness/comments.
>
> Note my patch does not enable RCU usage in ilookup, but this can be
> trivially added.
>
> I can't even compile-test at the moment, but the diff below should do
> it. Also note the patches are present here
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.inode.rcu
> , not yet integrated anywhere.
>
> That said, if fio you are operating on the same target inode every
> time then this is merely going to shift contention to the inode
> spinlock usage in find_inode_fast.
>
> diff --git a/fs/inode.c b/fs/inode.c
> index ad7844ca92f9..70b0e6383341 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1524,10 +1524,14 @@ struct inode *ilookup(struct super_block *sb,
> unsigned long ino)
> {
> struct hlist_head *head = inode_hashtable + hash(sb, ino);
> struct inode *inode;
> +
> again:
> - spin_lock(&inode_hash_lock);
> - inode = find_inode_fast(sb, head, ino, true);
> - spin_unlock(&inode_hash_lock);
> + inode = find_inode_fast(sb, head, ino, false);
> + if (IS_ERR_OR_NULL_PTR(inode)) {
> + spin_lock(&inode_hash_lock);
> + inode = find_inode_fast(sb, head, ino, true);
> + spin_unlock(&inode_hash_lock);
> + }
>
> if (inode) {
> if (IS_ERR(inode))
>
I think I expressed myself poorly, so here is take two:
1. inode hash soft lookup should get resolved if you apply
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.inode.rcu&id=7180f8d91fcbf252de572d9ffacc945effed0060
and the above pasted fix (not compile tested tho, but it should be
obvious what the intended fix looks like)
2. find_inode_hash spinlocks the target inode. if your bench only
operates on one, then contention is going to shift there and you may
still be getting soft lockups. not taking the spinlock in this
codepath is hackable, but I don't want to do it without a good
justification.
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-10 12:03 ` Bharata B Rao
2024-07-10 12:24 ` Mateusz Guzik
@ 2024-07-10 18:04 ` Yu Zhao
1 sibling, 0 replies; 37+ messages in thread
From: Yu Zhao @ 2024-07-10 18:04 UTC (permalink / raw)
To: Bharata B Rao
Cc: mjguzik, david, kent.overstreet, linux-mm, linux-kernel, nikunj,
Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka,
kinseyho, Mel Gorman, linux-fsdevel
On Wed, Jul 10, 2024 at 6:04 AM Bharata B Rao <bharata@amd.com> wrote:
>
> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> >> Some experiments tried
> >> ======================
> >> 1) When MGLRU was enabled many soft lockups were observed, no hard
> >> lockups were seen for 48 hours run. Below is once such soft lockup.
> <snip>
> >> Below preemptirqsoff trace points to preemption being disabled for more
> >> than 10s and the lock in picture is lruvec spinlock.
> >
> > Also if you could try the other patch (mglru.patch) please. It should
> > help reduce unnecessary rotations from deactivate_file_folio(), which
> > in turn should reduce the contention on the LRU lock for MGLRU.
>
> Thanks. With mglru.patch on a MGLRU-enabled system, the below latency
> trace record is no longer seen for a 30hr workload run.
Glad to hear. Will post a patch and add you as reported/tested-by.
> >
> >> # tracer: preemptirqsoff
> >> #
> >> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc
> >> # --------------------------------------------------------------------
> >> # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0
> >> HP:0 #P:512)
> >> # -----------------
> >> # | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0)
> >> # -----------------
> >> # => started at: deactivate_file_folio
> >> # => ended at: deactivate_file_folio
> >> #
> >> #
> >> # _------=> CPU#
> >> # / _-----=> irqs-off/BH-disabled
> >> # | / _----=> need-resched
> >> # || / _---=> hardirq/softirq
> >> # ||| / _--=> preempt-depth
> >> # |||| / _-=> migrate-disable
> >> # ||||| / delay
> >> # cmd pid |||||| time | caller
> >> # \ / |||||| \ | /
> >> fio-2701523 128...1. 0us$: deactivate_file_folio
> >> <-deactivate_file_folio
> >> fio-2701523 128.N.1. 10382681us : deactivate_file_folio
> >> <-deactivate_file_folio
> >> fio-2701523 128.N.1. 10382683us : tracer_preempt_on
> >> <-deactivate_file_folio
> >> fio-2701523 128.N.1. 10382691us : <stack trace>
> >> => deactivate_file_folio
> >> => mapping_try_invalidate
> >> => invalidate_mapping_pages
> >> => invalidate_bdev
> >> => blkdev_common_ioctl
> >> => blkdev_ioctl
> >> => __x64_sys_ioctl
> >> => x64_sys_call
> >> => do_syscall_64
> >> => entry_SYSCALL_64_after_hwframe
>
> However the contention now has shifted to inode_hash_lock. Around 55
> softlockups in ilookup() were observed:
This one is from fs/blk, so I'll leave it to those experts.
> # tracer: preemptirqsoff
> #
> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-trnmglru
> # --------------------------------------------------------------------
> # latency: 10620430 us, #4/4, CPU#260 | (M:desktop VP:0, KP:0, SP:0 HP:0
> #P:512)
> # -----------------
> # | task: fio-3244715 (uid:0 nice:0 policy:0 rt_prio:0)
> # -----------------
> # => started at: ilookup
> # => ended at: ilookup
> #
> #
> # _------=> CPU#
> # / _-----=> irqs-off/BH-disabled
> # | / _----=> need-resched
> # || / _---=> hardirq/softirq
> # ||| / _--=> preempt-depth
> # |||| / _-=> migrate-disable
> # ||||| / delay
> # cmd pid |||||| time | caller
> # \ / |||||| \ | /
> fio-3244715 260...1. 0us$: _raw_spin_lock <-ilookup
> fio-3244715 260.N.1. 10620429us : _raw_spin_unlock <-ilookup
> fio-3244715 260.N.1. 10620430us : tracer_preempt_on <-ilookup
> fio-3244715 260.N.1. 10620440us : <stack trace>
> => _raw_spin_unlock
> => ilookup
> => blkdev_get_no_open
> => blkdev_open
> => do_dentry_open
> => vfs_open
> => path_openat
> => do_filp_open
> => do_sys_openat2
> => __x64_sys_openat
> => x64_sys_call
> => do_syscall_64
> => entry_SYSCALL_64_after_hwframe
>
> It appears that scalability issues with inode_hash_lock has been brought
> up multiple times in the past and there were patches to address the same.
>
> https://lore.kernel.org/all/20231206060629.2827226-9-david@fromorbit.com/
> https://lore.kernel.org/lkml/20240611173824.535995-2-mjguzik@gmail.com/
>
> CC'ing FS folks/list for awareness/comments.
>
> Regards,
> Bharata.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-09 5:58 ` Yu Zhao
@ 2024-07-11 5:43 ` Bharata B Rao
2024-07-15 5:19 ` Bharata B Rao
2024-08-13 11:04 ` Usama Arif
1 sibling, 1 reply; 37+ messages in thread
From: Bharata B Rao @ 2024-07-11 5:43 UTC (permalink / raw)
To: Yu Zhao
Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman
On 09-Jul-24 11:28 AM, Yu Zhao wrote:
> On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@amd.com> wrote:
>>
>> On 08-Jul-24 9:47 PM, Yu Zhao wrote:
>>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>
>>>> Hi Yu Zhao,
>>>>
>>>> Thanks for your patches. See below...
>>>>
>>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
>>>>> Hi Bharata,
>>>>>
>>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>>>
>>>> <snip>
>>>>>>
>>>>>> Some experiments tried
>>>>>> ======================
>>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>>>>>> lockups were seen for 48 hours run. Below is once such soft lockup.
>>>>>
>>>>> This is not really an MGLRU issue -- can you please try one of the
>>>>> attached patches? It (truncate.patch) should help with or without
>>>>> MGLRU.
>>>>
>>>> With truncate.patch and default LRU scheme, a few hard lockups are seen.
>>>
>>> Thanks.
>>>
>>> In your original report, you said:
>>>
>>> Most of the times the two contended locks are lruvec and
>>> inode->i_lock spinlocks.
>>> ...
>>> Often times, the perf output at the time of the problem shows
>>> heavy contention on lruvec spin lock. Similar contention is
>>> also observed with inode i_lock (in clear_shadow_entry path)
>>>
>>> Based on this new report, does it mean the i_lock is not as contended,
>>> for the same path (truncation) you tested? If so, I'll post
>>> truncate.patch and add reported-by and tested-by you, unless you have
>>> objections.
>>
>> truncate.patch has been tested on two systems with default LRU scheme
>> and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run.
>
> Thanks.
>
>>>
>>> The two paths below were contended on the LRU lock, but they already
>>> batch their operations. So I don't know what else we can do surgically
>>> to improve them.
>>
>> What has been seen with this workload is that the lruvec spinlock is
>> held for a long time from shrink_[active/inactive]_list path. In this
>> path, there is a case in isolate_lru_folios() where scanning of LRU
>> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
>> scanning/skipping of more than 150 million folios were seen. There is
>> already a comment in there which explains why nr_skipped shouldn't be
>> counted, but is there any possibility of re-looking at this condition?
>
> For this specific case, probably this can help:
>
> @@ -1659,8 +1659,15 @@ static unsigned long
> isolate_lru_folios(unsigned long nr_to_scan,
> if (folio_zonenum(folio) > sc->reclaim_idx ||
> skip_cma(folio, sc)) {
> nr_skipped[folio_zonenum(folio)] += nr_pages;
> - move_to = &folios_skipped;
> - goto move;
> + list_move(&folio->lru, &folios_skipped);
> + if (spin_is_contended(&lruvec->lru_lock)) {
> + if (!list_empty(dst))
> + break;
> + spin_unlock_irq(&lruvec->lru_lock);
> + cond_resched();
> + spin_lock_irq(&lruvec->lru_lock);
> + }
> + continue;
> }
Thanks, this helped. With this fix, the test ran for 24hrs without any
lockups attributable to lruvec spinlock. As noted in this thread,
earlier isolate_lru_folios() used to scan millions of folios and spend a
lot of time with spinlock held but after this fix, such a scenario is no
longer seen.
However the contention seems to have shifted to other areas and these
are the two MM related soft and hard lockups that were observed during
this run:
Soft lockup
===========
watchdog: BUG: soft lockup - CPU#425 stuck for 12s!
CPU: 425 PID: 145707 Comm: fio Kdump: loaded Tainted: G W
6.10.0-rc3-trkwtrs_trnct_nvme_lruvecresched #21
RIP: 0010:handle_softirqs+0x70/0x2f0
__rmqueue_pcplist+0x4ce/0x9a0
get_page_from_freelist+0x2e1/0x1650
__alloc_pages_noprof+0x1b4/0x12c0
alloc_pages_mpol_noprof+0xdd/0x200
folio_alloc_noprof+0x67/0xe0
Hard lockup
===========
watchdog: Watchdog detected hard LOCKUP on cpu 296
CPU: 296 PID: 150155 Comm: fio Kdump: loaded Tainted: G W L
6.10.0-rc3-trkwtrs_trnct_nvme_lruvecresched #21
RIP: 0010:native_queued_spin_lock_slowpath+0x347/0x430
Call Trace:
<NMI>
? watchdog_hardlockup_check+0x1a2/0x370
? watchdog_overflow_callback+0x6d/0x80
<SNIP>
native_queued_spin_lock_slowpath+0x347/0x430
</NMI>
<IRQ>
_raw_spin_lock_irqsave+0x46/0x60
free_unref_page+0x19f/0x540
? __slab_free+0x2ab/0x2b0
__free_pages+0x9d/0xb0
__free_slab+0xa7/0xf0
free_slab+0x31/0x100
discard_slab+0x32/0x40
__put_partials+0xb8/0xe0
put_cpu_partial+0x5a/0x90
__slab_free+0x1d9/0x2b0
kfree+0x244/0x280
mempool_kfree+0x12/0x20
mempool_free+0x30/0x90
nvme_unmap_data+0xd0/0x150 [nvme]
nvme_pci_complete_batch+0xaf/0xd0 [nvme]
nvme_irq+0x96/0xe0 [nvme]
__handle_irq_event_percpu+0x50/0x1b0
Regards,
Bharata.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-11 5:43 ` Bharata B Rao
@ 2024-07-15 5:19 ` Bharata B Rao
2024-07-19 20:21 ` Yu Zhao
2024-07-25 9:59 ` zhaoyang.huang
0 siblings, 2 replies; 37+ messages in thread
From: Bharata B Rao @ 2024-07-15 5:19 UTC (permalink / raw)
To: Yu Zhao
Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman, mjguzik
On 11-Jul-24 11:13 AM, Bharata B Rao wrote:
> On 09-Jul-24 11:28 AM, Yu Zhao wrote:
>> On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@amd.com> wrote:
>>>
>>> On 08-Jul-24 9:47 PM, Yu Zhao wrote:
>>>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>>
>>>>> Hi Yu Zhao,
>>>>>
>>>>> Thanks for your patches. See below...
>>>>>
>>>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
>>>>>> Hi Bharata,
>>>>>>
>>>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>>>>
>>>>> <snip>
>>>>>>>
>>>>>>> Some experiments tried
>>>>>>> ======================
>>>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>>>>>>> lockups were seen for 48 hours run. Below is once such soft lockup.
>>>>>>
>>>>>> This is not really an MGLRU issue -- can you please try one of the
>>>>>> attached patches? It (truncate.patch) should help with or without
>>>>>> MGLRU.
>>>>>
>>>>> With truncate.patch and default LRU scheme, a few hard lockups are
>>>>> seen.
>>>>
>>>> Thanks.
>>>>
>>>> In your original report, you said:
>>>>
>>>> Most of the times the two contended locks are lruvec and
>>>> inode->i_lock spinlocks.
>>>> ...
>>>> Often times, the perf output at the time of the problem shows
>>>> heavy contention on lruvec spin lock. Similar contention is
>>>> also observed with inode i_lock (in clear_shadow_entry path)
>>>>
>>>> Based on this new report, does it mean the i_lock is not as contended,
>>>> for the same path (truncation) you tested? If so, I'll post
>>>> truncate.patch and add reported-by and tested-by you, unless you have
>>>> objections.
>>>
>>> truncate.patch has been tested on two systems with default LRU scheme
>>> and the lockup due to inode->i_lock hasn't been seen yet after 24
>>> hours run.
>>
>> Thanks.
>>
>>>>
>>>> The two paths below were contended on the LRU lock, but they already
>>>> batch their operations. So I don't know what else we can do surgically
>>>> to improve them.
>>>
>>> What has been seen with this workload is that the lruvec spinlock is
>>> held for a long time from shrink_[active/inactive]_list path. In this
>>> path, there is a case in isolate_lru_folios() where scanning of LRU
>>> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
>>> scanning/skipping of more than 150 million folios were seen. There is
>>> already a comment in there which explains why nr_skipped shouldn't be
>>> counted, but is there any possibility of re-looking at this condition?
>>
>> For this specific case, probably this can help:
>>
>> @@ -1659,8 +1659,15 @@ static unsigned long
>> isolate_lru_folios(unsigned long nr_to_scan,
>> if (folio_zonenum(folio) > sc->reclaim_idx ||
>> skip_cma(folio, sc)) {
>> nr_skipped[folio_zonenum(folio)] += nr_pages;
>> - move_to = &folios_skipped;
>> - goto move;
>> + list_move(&folio->lru, &folios_skipped);
>> + if (spin_is_contended(&lruvec->lru_lock)) {
>> + if (!list_empty(dst))
>> + break;
>> + spin_unlock_irq(&lruvec->lru_lock);
>> + cond_resched();
>> + spin_lock_irq(&lruvec->lru_lock);
>> + }
>> + continue;
>> }
>
> Thanks, this helped. With this fix, the test ran for 24hrs without any
> lockups attributable to lruvec spinlock. As noted in this thread,
> earlier isolate_lru_folios() used to scan millions of folios and spend a
> lot of time with spinlock held but after this fix, such a scenario is no
> longer seen.
However during the weekend mglru-enabled run (with above fix to
isolate_lru_folios() and also the previous two patches: truncate.patch
and mglru.patch and the inode fix provided by Mateusz), another hard
lockup related to lruvec spinlock was observed.
Here is the hardlock up:
watchdog: Watchdog detected hard LOCKUP on cpu 466
CPU: 466 PID: 3103929 Comm: fio Not tainted
6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
Call Trace:
<NMI>
? show_regs+0x69/0x80
? watchdog_hardlockup_check+0x1b4/0x3a0
<SNIP>
? native_queued_spin_lock_slowpath+0x2b4/0x300
</NMI>
<IRQ>
_raw_spin_lock_irqsave+0x5b/0x70
folio_lruvec_lock_irqsave+0x62/0x90
folio_batch_move_lru+0x9d/0x160
folio_rotate_reclaimable+0xab/0xf0
folio_end_writeback+0x60/0x90
end_buffer_async_write+0xaa/0xe0
end_bio_bh_io_sync+0x2c/0x50
bio_endio+0x108/0x180
blk_mq_end_request_batch+0x11f/0x5e0
nvme_pci_complete_batch+0xb5/0xd0 [nvme]
nvme_irq+0x92/0xe0 [nvme]
__handle_irq_event_percpu+0x6e/0x1e0
handle_irq_event+0x39/0x80
handle_edge_irq+0x8c/0x240
__common_interrupt+0x4e/0xf0
common_interrupt+0x49/0xc0
asm_common_interrupt+0x27/0x40
Here is the lock holder details captured by all-cpu-backtrace:
NMI backtrace for cpu 75
CPU: 75 PID: 3095650 Comm: fio Not tainted
6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
RIP: 0010:folio_inc_gen+0x142/0x430
Call Trace:
<NMI>
? show_regs+0x69/0x80
? nmi_cpu_backtrace+0xc5/0x130
? nmi_cpu_backtrace_handler+0x11/0x20
? nmi_handle+0x64/0x180
? default_do_nmi+0x45/0x130
? exc_nmi+0x128/0x1a0
? end_repeat_nmi+0xf/0x53
? folio_inc_gen+0x142/0x430
? folio_inc_gen+0x142/0x430
? folio_inc_gen+0x142/0x430
</NMI>
<TASK>
isolate_folios+0x954/0x1630
evict_folios+0xa5/0x8c0
try_to_shrink_lruvec+0x1be/0x320
shrink_one+0x10f/0x1d0
shrink_node+0xa4c/0xc90
do_try_to_free_pages+0xc0/0x590
try_to_free_pages+0xde/0x210
__alloc_pages_noprof+0x6ae/0x12c0
alloc_pages_mpol_noprof+0xd9/0x220
folio_alloc_noprof+0x63/0xe0
filemap_alloc_folio_noprof+0xf4/0x100
page_cache_ra_unbounded+0xb9/0x1a0
page_cache_ra_order+0x26e/0x310
ondemand_readahead+0x1a3/0x360
page_cache_sync_ra+0x83/0x90
filemap_get_pages+0xf0/0x6a0
filemap_read+0xe7/0x3d0
blkdev_read_iter+0x6f/0x140
vfs_read+0x25b/0x340
ksys_read+0x67/0xf0
__x64_sys_read+0x19/0x20
x64_sys_call+0x1771/0x20d0
do_syscall_64+0x7e/0x130
Regards,
Bharata.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-10 13:04 ` Mateusz Guzik
@ 2024-07-15 5:22 ` Bharata B Rao
2024-07-15 6:48 ` Mateusz Guzik
0 siblings, 1 reply; 37+ messages in thread
From: Bharata B Rao @ 2024-07-15 5:22 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Yu Zhao, david, kent.overstreet, linux-mm, linux-kernel, nikunj,
Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka,
kinseyho, Mel Gorman, linux-fsdevel
On 10-Jul-24 6:34 PM, Mateusz Guzik wrote:
>>> However the contention now has shifted to inode_hash_lock. Around 55
>>> softlockups in ilookup() were observed:
>>>
>>> # tracer: preemptirqsoff
>>> #
>>> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-trnmglru
>>> # --------------------------------------------------------------------
>>> # latency: 10620430 us, #4/4, CPU#260 | (M:desktop VP:0, KP:0, SP:0 HP:0
>>> #P:512)
>>> # -----------------
>>> # | task: fio-3244715 (uid:0 nice:0 policy:0 rt_prio:0)
>>> # -----------------
>>> # => started at: ilookup
>>> # => ended at: ilookup
>>> #
>>> #
>>> # _------=> CPU#
>>> # / _-----=> irqs-off/BH-disabled
>>> # | / _----=> need-resched
>>> # || / _---=> hardirq/softirq
>>> # ||| / _--=> preempt-depth
>>> # |||| / _-=> migrate-disable
>>> # ||||| / delay
>>> # cmd pid |||||| time | caller
>>> # \ / |||||| \ | /
>>> fio-3244715 260...1. 0us$: _raw_spin_lock <-ilookup
>>> fio-3244715 260.N.1. 10620429us : _raw_spin_unlock <-ilookup
>>> fio-3244715 260.N.1. 10620430us : tracer_preempt_on <-ilookup
>>> fio-3244715 260.N.1. 10620440us : <stack trace>
>>> => _raw_spin_unlock
>>> => ilookup
>>> => blkdev_get_no_open
>>> => blkdev_open
>>> => do_dentry_open
>>> => vfs_open
>>> => path_openat
>>> => do_filp_open
>>> => do_sys_openat2
>>> => __x64_sys_openat
>>> => x64_sys_call
>>> => do_syscall_64
>>> => entry_SYSCALL_64_after_hwframe
>>>
>>> It appears that scalability issues with inode_hash_lock has been brought
>>> up multiple times in the past and there were patches to address the same.
>>>
>>> https://lore.kernel.org/all/20231206060629.2827226-9-david@fromorbit.com/
>>> https://lore.kernel.org/lkml/20240611173824.535995-2-mjguzik@gmail.com/
>>>
>>> CC'ing FS folks/list for awareness/comments.
>>
>> Note my patch does not enable RCU usage in ilookup, but this can be
>> trivially added.
>>
>> I can't even compile-test at the moment, but the diff below should do
>> it. Also note the patches are present here
>> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.inode.rcu
>> , not yet integrated anywhere.
>>
>> That said, if fio you are operating on the same target inode every
>> time then this is merely going to shift contention to the inode
>> spinlock usage in find_inode_fast.
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index ad7844ca92f9..70b0e6383341 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -1524,10 +1524,14 @@ struct inode *ilookup(struct super_block *sb,
>> unsigned long ino)
>> {
>> struct hlist_head *head = inode_hashtable + hash(sb, ino);
>> struct inode *inode;
>> +
>> again:
>> - spin_lock(&inode_hash_lock);
>> - inode = find_inode_fast(sb, head, ino, true);
>> - spin_unlock(&inode_hash_lock);
>> + inode = find_inode_fast(sb, head, ino, false);
>> + if (IS_ERR_OR_NULL_PTR(inode)) {
>> + spin_lock(&inode_hash_lock);
>> + inode = find_inode_fast(sb, head, ino, true);
>> + spin_unlock(&inode_hash_lock);
>> + }
>>
>> if (inode) {
>> if (IS_ERR(inode))
>>
>
> I think I expressed myself poorly, so here is take two:
> 1. inode hash soft lookup should get resolved if you apply
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.inode.rcu&id=7180f8d91fcbf252de572d9ffacc945effed0060
> and the above pasted fix (not compile tested tho, but it should be
> obvious what the intended fix looks like)
> 2. find_inode_hash spinlocks the target inode. if your bench only
> operates on one, then contention is going to shift there and you may
> still be getting soft lockups. not taking the spinlock in this
> codepath is hackable, but I don't want to do it without a good
> justification.
Thanks Mateusz for the fix. With this patch applied, the above mentioned
contention in ilookup() has not been observed for a test run during the
weekend.
Regards,
Bharata.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-15 5:22 ` Bharata B Rao
@ 2024-07-15 6:48 ` Mateusz Guzik
0 siblings, 0 replies; 37+ messages in thread
From: Mateusz Guzik @ 2024-07-15 6:48 UTC (permalink / raw)
To: Bharata B Rao
Cc: Yu Zhao, david, kent.overstreet, linux-mm, linux-kernel, nikunj,
Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka,
kinseyho, Mel Gorman, linux-fsdevel
On Mon, Jul 15, 2024 at 7:22 AM Bharata B Rao <bharata@amd.com> wrote:
>
> On 10-Jul-24 6:34 PM, Mateusz Guzik wrote:
> >>> However the contention now has shifted to inode_hash_lock. Around 55
> >>> softlockups in ilookup() were observed:
> >>>
> >>> # tracer: preemptirqsoff
> >>> #
> >>> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-trnmglru
> >>> # --------------------------------------------------------------------
> >>> # latency: 10620430 us, #4/4, CPU#260 | (M:desktop VP:0, KP:0, SP:0 HP:0
> >>> #P:512)
> >>> # -----------------
> >>> # | task: fio-3244715 (uid:0 nice:0 policy:0 rt_prio:0)
> >>> # -----------------
> >>> # => started at: ilookup
> >>> # => ended at: ilookup
> >>> #
> >>> #
> >>> # _------=> CPU#
> >>> # / _-----=> irqs-off/BH-disabled
> >>> # | / _----=> need-resched
> >>> # || / _---=> hardirq/softirq
> >>> # ||| / _--=> preempt-depth
> >>> # |||| / _-=> migrate-disable
> >>> # ||||| / delay
> >>> # cmd pid |||||| time | caller
> >>> # \ / |||||| \ | /
> >>> fio-3244715 260...1. 0us$: _raw_spin_lock <-ilookup
> >>> fio-3244715 260.N.1. 10620429us : _raw_spin_unlock <-ilookup
> >>> fio-3244715 260.N.1. 10620430us : tracer_preempt_on <-ilookup
> >>> fio-3244715 260.N.1. 10620440us : <stack trace>
> >>> => _raw_spin_unlock
> >>> => ilookup
> >>> => blkdev_get_no_open
> >>> => blkdev_open
> >>> => do_dentry_open
> >>> => vfs_open
> >>> => path_openat
> >>> => do_filp_open
> >>> => do_sys_openat2
> >>> => __x64_sys_openat
> >>> => x64_sys_call
> >>> => do_syscall_64
> >>> => entry_SYSCALL_64_after_hwframe
> >>>
> >>> It appears that scalability issues with inode_hash_lock has been brought
> >>> up multiple times in the past and there were patches to address the same.
> >>>
> >>> https://lore.kernel.org/all/20231206060629.2827226-9-david@fromorbit.com/
> >>> https://lore.kernel.org/lkml/20240611173824.535995-2-mjguzik@gmail.com/
> >>>
> >>> CC'ing FS folks/list for awareness/comments.
> >>
> >> Note my patch does not enable RCU usage in ilookup, but this can be
> >> trivially added.
> >>
> >> I can't even compile-test at the moment, but the diff below should do
> >> it. Also note the patches are present here
> >> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.inode.rcu
> >> , not yet integrated anywhere.
> >>
> >> That said, if fio you are operating on the same target inode every
> >> time then this is merely going to shift contention to the inode
> >> spinlock usage in find_inode_fast.
> >>
> >> diff --git a/fs/inode.c b/fs/inode.c
> >> index ad7844ca92f9..70b0e6383341 100644
> >> --- a/fs/inode.c
> >> +++ b/fs/inode.c
> >> @@ -1524,10 +1524,14 @@ struct inode *ilookup(struct super_block *sb,
> >> unsigned long ino)
> >> {
> >> struct hlist_head *head = inode_hashtable + hash(sb, ino);
> >> struct inode *inode;
> >> +
> >> again:
> >> - spin_lock(&inode_hash_lock);
> >> - inode = find_inode_fast(sb, head, ino, true);
> >> - spin_unlock(&inode_hash_lock);
> >> + inode = find_inode_fast(sb, head, ino, false);
> >> + if (IS_ERR_OR_NULL_PTR(inode)) {
> >> + spin_lock(&inode_hash_lock);
> >> + inode = find_inode_fast(sb, head, ino, true);
> >> + spin_unlock(&inode_hash_lock);
> >> + }
> >>
> >> if (inode) {
> >> if (IS_ERR(inode))
> >>
> >
> > I think I expressed myself poorly, so here is take two:
> > 1. inode hash soft lookup should get resolved if you apply
> > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.inode.rcu&id=7180f8d91fcbf252de572d9ffacc945effed0060
> > and the above pasted fix (not compile tested tho, but it should be
> > obvious what the intended fix looks like)
> > 2. find_inode_hash spinlocks the target inode. if your bench only
> > operates on one, then contention is going to shift there and you may
> > still be getting soft lockups. not taking the spinlock in this
> > codepath is hackable, but I don't want to do it without a good
> > justification.
>
> Thanks Mateusz for the fix. With this patch applied, the above mentioned
> contention in ilookup() has not been observed for a test run during the
> weekend.
>
Ok, I'll do some clean ups and send a proper patch to the vfs folks later today.
Thanks for testing.
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-09 4:30 ` Bharata B Rao
2024-07-09 5:58 ` Yu Zhao
@ 2024-07-17 9:37 ` Vlastimil Babka
2024-07-17 10:50 ` Bharata B Rao
1 sibling, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2024-07-17 9:37 UTC (permalink / raw)
To: Bharata B Rao, Yu Zhao
Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
David Hildenbrand, willy, kinseyho, Mel Gorman, Petr Tesarik
On 7/9/24 6:30 AM, Bharata B Rao wrote:
> On 08-Jul-24 9:47 PM, Yu Zhao wrote:
>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
>>>
>>> Hi Yu Zhao,
>>>
>>> Thanks for your patches. See below...
>>>
>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
>>>> Hi Bharata,
>>>>
>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>>
>>> <snip>
>>>>>
>>>>> Some experiments tried
>>>>> ======================
>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>>>>> lockups were seen for 48 hours run. Below is once such soft lockup.
>>>>
>>>> This is not really an MGLRU issue -- can you please try one of the
>>>> attached patches? It (truncate.patch) should help with or without
>>>> MGLRU.
>>>
>>> With truncate.patch and default LRU scheme, a few hard lockups are seen.
>>
>> Thanks.
>>
>> In your original report, you said:
>>
>> Most of the times the two contended locks are lruvec and
>> inode->i_lock spinlocks.
>> ...
>> Often times, the perf output at the time of the problem shows
>> heavy contention on lruvec spin lock. Similar contention is
>> also observed with inode i_lock (in clear_shadow_entry path)
>>
>> Based on this new report, does it mean the i_lock is not as contended,
>> for the same path (truncation) you tested? If so, I'll post
>> truncate.patch and add reported-by and tested-by you, unless you have
>> objections.
>
> truncate.patch has been tested on two systems with default LRU scheme
> and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run.
>
>>
>> The two paths below were contended on the LRU lock, but they already
>> batch their operations. So I don't know what else we can do surgically
>> to improve them.
>
> What has been seen with this workload is that the lruvec spinlock is
> held for a long time from shrink_[active/inactive]_list path. In this
> path, there is a case in isolate_lru_folios() where scanning of LRU
> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
> scanning/skipping of more than 150 million folios were seen. There is
It seems weird to me to see anything that would require ZONE_DMA allocation
on a modern system. Do you know where it comes from?
> already a comment in there which explains why nr_skipped shouldn't be
> counted, but is there any possibility of re-looking at this condition?
>
> Regards,
> Bharata.
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-03 15:11 Hard and soft lockups with FIO and LTP runs on a large system Bharata B Rao
2024-07-06 22:42 ` Yu Zhao
@ 2024-07-17 9:42 ` Vlastimil Babka
2024-07-17 10:31 ` Bharata B Rao
` (2 more replies)
1 sibling, 3 replies; 37+ messages in thread
From: Vlastimil Babka @ 2024-07-17 9:42 UTC (permalink / raw)
To: Bharata B Rao, linux-mm
Cc: linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
David Hildenbrand, willy, yuzhao, kinseyho, Mel Gorman,
Mateusz Guzik
On 7/3/24 5:11 PM, Bharata B Rao wrote:
> Many soft and hard lockups are seen with upstream kernel when running a
> bunch of tests that include FIO and LTP filesystem test on 10 NVME
> disks. The lockups can appear anywhere between 2 to 48 hours. Originally
> this was reported on a large customer VM instance with passthrough NVME
> disks on older kernels(v5.4 based). However, similar problems were
> reproduced when running the tests on bare metal with latest upstream
> kernel (v6.10-rc3). Other lockups with different signatures are seen but
> in this report, only those related to MM area are being discussed.
> Also note that the subsequent description is related to the lockups in
> bare metal upstream (and not VM).
>
> The general observation is that the problem usually surfaces when the
> system free memory goes very low and page cache/buffer consumption hits
> the ceiling. Most of the times the two contended locks are lruvec and
> inode->i_lock spinlocks.
>
> - Could this be a scalability issue in LRU list handling and/or page
> cache invalidation typical to a large system configuration?
Seems to me it could be (except that ZONE_DMA corner case) a general
scalability issue in that you tweak some part of the kernel and the
contention moves elsewhere. At least in MM we have per-node locks so this
means 256 CPUs per lock? It used to be that there were not that many
(cores/threads) per a physical CPU and its NUMA node, so many cpus would
mean also more NUMA nodes where the locks contention would distribute among
them. I think you could try fakenuma to create these nodes artificially and
see if it helps for the MM part. But if the contention moves to e.g. an
inode lock, I'm not sure what to do about that then.
> - Are there any MM/FS tunables that could help here?
>
> Hardware configuration
> ======================
> Dual socket AMD EPYC 128 Core processor (256 cores, 512 threads)
> Memory: 1.5 TB
> 10 NVME - 3.5TB each
> available: 2 nodes (0-1)
> node 0 cpus: 0-127,256-383
> node 0 size: 773727 MB
> node 1 cpus: 128-255,384-511
> node 1 size: 773966 MB
>
> Workload details
> ================
> Workload includes concurrent runs of FIO and a few FS tests from LTP.
>
> FIO is run with a size of 1TB on each NVME partition with different
> combinations of ioengine/blocksize/mode parameters and buffered-IO.
> Selected FS tests from LTP are run on 256GB partitions of all NVME
> disks. This is the typical NVME partition layout.
>
> nvme2n1 259:4 0 3.5T 0 disk
> ├─nvme2n1p1 259:6 0 256G 0 part /data_nvme2n1p1
> └─nvme2n1p2 259:7 0 3.2T 0 part
>
> Though many different runs exist in the workload, the combination that
> results in the problem is buffered-IO run with sync engine.
>
> fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \
> -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=sync -bs=4k \
> -numjobs=400 -runtime=25000 --time_based -group_reporting -name=mytest
>
> Watchdog threshold was reduced to 5s to reproduce the problem early and
> all CPU backtrace enabled.
>
> Problem details and analysis
> ============================
> One of the hard lockups which was observed and analyzed in detail is this:
>
> kernel: watchdog: Watchdog detected hard LOCKUP on cpu 284
> kernel: CPU: 284 PID: 924096 Comm: cat Not tainted 6.10.0-rc3-lruvec #9
> kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
> kernel: Call Trace:
> kernel: <NMI>
> kernel: ? show_regs+0x69/0x80
> kernel: ? watchdog_hardlockup_check+0x19e/0x360
> <SNIP>
> kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300
> kernel: </NMI>
> kernel: <TASK>
> kernel: ? __pfx_lru_add_fn+0x10/0x10
> kernel: _raw_spin_lock_irqsave+0x42/0x50
> kernel: folio_lruvec_lock_irqsave+0x62/0xb0
> kernel: folio_batch_move_lru+0x79/0x2a0
> kernel: folio_add_lru+0x6d/0xf0
> kernel: filemap_add_folio+0xba/0xe0
> kernel: __filemap_get_folio+0x137/0x2e0
> kernel: ext4_da_write_begin+0x12c/0x270
> kernel: generic_perform_write+0xbf/0x200
> kernel: ext4_buffered_write_iter+0x67/0xf0
> kernel: ext4_file_write_iter+0x70/0x780
> kernel: vfs_write+0x301/0x420
> kernel: ksys_write+0x67/0xf0
> kernel: __x64_sys_write+0x19/0x20
> kernel: x64_sys_call+0x1689/0x20d0
> kernel: do_syscall_64+0x6b/0x110
> kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e kernel: RIP:
> 0033:0x7fe21c314887
>
> With all CPU backtraces enabled, many CPUs are waiting for lruvec_lock
> acquisition. We measured the lruvec spinlock start, end and hold
> time(htime) using sched_clock(), along with a BUG() if the hold time was
> more than 10s. The below case shows that lruvec spin lock was held for ~25s.
>
> kernel: vmscan: unlock_page_lruvec_irq: stime 27963327514341, etime
> 27963324369895, htime 25889317166
> kernel: ------------[ cut here ]------------
> kernel: kernel BUG at include/linux/memcontrol.h:1677!
> kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> kernel: CPU: 21 PID: 3211 Comm: kswapd0 Tainted: G W
> 6.10.0-rc3-qspindbg #10
> kernel: RIP: 0010:shrink_active_list+0x40a/0x520
>
> And the corresponding trace point for the above:
> kswapd0-3211 [021] dN.1. 27963.324332: mm_vmscan_lru_isolate:
> classzone=0 order=0 nr_requested=1 nr_scanned=156946361
> nr_skipped=156946360 nr_taken=1 lru=active_file
>
> This shows that isolate_lru_folios() is scanning through a huge number
> (~150million) of folios (order=0) with lruvec spinlock held. This is
> happening because a large number of folios are being skipped to isolate
> a few ZONE_DMA folios. Though the number of folios to be scanned is
> bounded (32), there exists a genuine case where this can become
> unbounded, i.e. in case where folios are skipped.
>
> Meminfo output shows that the free memory is around ~2% and page/buffer
> cache grows very high when the lockup happens.
>
> MemTotal: 1584835956 kB
> MemFree: 27805664 kB
> MemAvailable: 1568099004 kB
> Buffers: 1386120792 kB
> Cached: 151894528 kB
> SwapCached: 30620 kB
> Active: 1043678892 kB
> Inactive: 494456452 kB
>
> Often times, the perf output at the time of the problem shows heavy
> contention on lruvec spin lock. Similar contention is also observed with
> inode i_lock (in clear_shadow_entry path)
>
> 98.98% fio [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
> |
> --98.96%--native_queued_spin_lock_slowpath
> |
> --98.96%--_raw_spin_lock_irqsave
> folio_lruvec_lock_irqsave
> |
> --98.78%--folio_batch_move_lru
> |
> --98.63%--deactivate_file_folio
> mapping_try_invalidate
> invalidate_mapping_pages
> invalidate_bdev
> blkdev_common_ioctl
> blkdev_ioctl
> __x64_sys_ioctl
> x64_sys_call
> do_syscall_64
> entry_SYSCALL_64_after_hwframe
>
> Some experiments tried
> ======================
> 1) When MGLRU was enabled many soft lockups were observed, no hard
> lockups were seen for 48 hours run. Below is once such soft lockup.
>
> kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649]
> kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G L
> 6.10.0-rc3-mglru-irqstrc #24
> kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
> kernel: Call Trace:
> kernel: <IRQ>
> kernel: ? show_regs+0x69/0x80
> kernel: ? watchdog_timer_fn+0x223/0x2b0
> kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
> <SNIP>
> kernel: </IRQ>
> kernel: <TASK>
> kernel: ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300
> kernel: _raw_spin_lock+0x38/0x50
> kernel: clear_shadow_entry+0x3d/0x100
> kernel: ? __pfx_workingset_update_node+0x10/0x10
> kernel: mapping_try_invalidate+0x117/0x1d0
> kernel: invalidate_mapping_pages+0x10/0x20
> kernel: invalidate_bdev+0x3c/0x50
> kernel: blkdev_common_ioctl+0x5f7/0xa90
> kernel: blkdev_ioctl+0x109/0x270
> kernel: x64_sys_call+0x1215/0x20d0
> kernel: do_syscall_64+0x7e/0x130
>
> This happens to be contending on inode i_lock spinlock.
>
> Below preemptirqsoff trace points to preemption being disabled for more
> than 10s and the lock in picture is lruvec spinlock.
>
> # tracer: preemptirqsoff
> #
> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc
> # --------------------------------------------------------------------
> # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0
> HP:0 #P:512)
> # -----------------
> # | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0)
> # -----------------
> # => started at: deactivate_file_folio
> # => ended at: deactivate_file_folio
> #
> #
> # _------=> CPU#
> # / _-----=> irqs-off/BH-disabled
> # | / _----=> need-resched
> # || / _---=> hardirq/softirq
> # ||| / _--=> preempt-depth
> # |||| / _-=> migrate-disable
> # ||||| / delay
> # cmd pid |||||| time | caller
> # \ / |||||| \ | /
> fio-2701523 128...1. 0us$: deactivate_file_folio
> <-deactivate_file_folio
> fio-2701523 128.N.1. 10382681us : deactivate_file_folio
> <-deactivate_file_folio
> fio-2701523 128.N.1. 10382683us : tracer_preempt_on
> <-deactivate_file_folio
> fio-2701523 128.N.1. 10382691us : <stack trace>
> => deactivate_file_folio
> => mapping_try_invalidate
> => invalidate_mapping_pages
> => invalidate_bdev
> => blkdev_common_ioctl
> => blkdev_ioctl
> => __x64_sys_ioctl
> => x64_sys_call
> => do_syscall_64
> => entry_SYSCALL_64_after_hwframe
>
> 2) Increased low_watermark_threshold to 10% to prevent system from
> entering into extremely low memory situation. Although hard lockups
> weren't seen, but soft lockups (clear_shadow_entry()) were still seen.
>
> 3) AMD has a BIOS setting called NPS (Nodes per socket), using which a
> socket can be further partitioned into smaller NUMA nodes. With NPS=4,
> there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in
> the system. This was done to check if having more number of kswapd
> threads working on lesser number of folios per node would make a
> difference. However here too, multiple soft lockups were seen (in
> clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed.
>
> Any insights/suggestion into these lockups and suggestions are welcome!
>
> Regards,
> Bharata.
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-17 9:42 ` Vlastimil Babka
@ 2024-07-17 10:31 ` Bharata B Rao
2024-07-17 16:44 ` Karim Manaouil
2024-07-17 11:29 ` Mateusz Guzik
2024-07-17 16:34 ` Karim Manaouil
2 siblings, 1 reply; 37+ messages in thread
From: Bharata B Rao @ 2024-07-17 10:31 UTC (permalink / raw)
To: Vlastimil Babka, linux-mm
Cc: linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
David Hildenbrand, willy, yuzhao, kinseyho, Mel Gorman,
Mateusz Guzik
On 17-Jul-24 3:12 PM, Vlastimil Babka wrote:
> On 7/3/24 5:11 PM, Bharata B Rao wrote:
>> Many soft and hard lockups are seen with upstream kernel when running a
>> bunch of tests that include FIO and LTP filesystem test on 10 NVME
>> disks. The lockups can appear anywhere between 2 to 48 hours. Originally
>> this was reported on a large customer VM instance with passthrough NVME
>> disks on older kernels(v5.4 based). However, similar problems were
>> reproduced when running the tests on bare metal with latest upstream
>> kernel (v6.10-rc3). Other lockups with different signatures are seen but
>> in this report, only those related to MM area are being discussed.
>> Also note that the subsequent description is related to the lockups in
>> bare metal upstream (and not VM).
>>
>> The general observation is that the problem usually surfaces when the
>> system free memory goes very low and page cache/buffer consumption hits
>> the ceiling. Most of the times the two contended locks are lruvec and
>> inode->i_lock spinlocks.
>>
>> - Could this be a scalability issue in LRU list handling and/or page
>> cache invalidation typical to a large system configuration?
>
> Seems to me it could be (except that ZONE_DMA corner case) a general
> scalability issue in that you tweak some part of the kernel and the
> contention moves elsewhere. At least in MM we have per-node locks so this
> means 256 CPUs per lock? It used to be that there were not that many
> (cores/threads) per a physical CPU and its NUMA node, so many cpus would
> mean also more NUMA nodes where the locks contention would distribute among
> them. I think you could try fakenuma to create these nodes artificially and
> see if it helps for the MM part. But if the contention moves to e.g. an
> inode lock, I'm not sure what to do about that then.
See below...
>
<SNIP>
>>
>> 3) AMD has a BIOS setting called NPS (Nodes per socket), using which a
>> socket can be further partitioned into smaller NUMA nodes. With NPS=4,
>> there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in
>> the system. This was done to check if having more number of kswapd
>> threads working on lesser number of folios per node would make a
>> difference. However here too, multiple soft lockups were seen (in
>> clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed.
These are some softlockups seen with NPS4 mode.
watchdog: BUG: soft lockup - CPU#315 stuck for 11s! [kworker/315:1H:5153]
CPU: 315 PID: 5153 Comm: kworker/315:1H Kdump: loaded Not tainted
6.10.0-rc3-enbprftw #12
Workqueue: kblockd blk_mq_run_work_fn
RIP: 0010:handle_softirqs+0x70/0x2f0
Call Trace:
<IRQ>
__irq_exit_rcu+0x68/0x90
irq_exit_rcu+0x12/0x20
sysvec_apic_timer_interrupt+0x85/0xb0
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x1f/0x30
RIP: 0010:iommu_dma_map_page+0xca/0x2c0
dma_map_page_attrs+0x20d/0x2a0
nvme_prep_rq.part.0+0x63d/0x940 [nvme]
nvme_queue_rq+0x82/0x210 [nvme]
blk_mq_dispatch_rq_list+0x289/0x6d0
__blk_mq_sched_dispatch_requests+0x142/0x5f0
blk_mq_sched_dispatch_requests+0x36/0x70
blk_mq_run_work_fn+0x73/0x90
process_one_work+0x185/0x3d0
worker_thread+0x2ce/0x3e0
kthread+0xe5/0x120
ret_from_fork+0x3d/0x60
ret_from_fork_asm+0x1a/0x30
watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [fio:19820]
CPU: 0 PID: 19820 Comm: fio Kdump: loaded Tainted: G L
6.10.0-rc3-enbprftw #12
RIP: 0010:native_queued_spin_lock_slowpath+0x2b8/0x300
Call Trace:
<IRQ>
</IRQ>
<TASK>
_raw_spin_lock+0x2d/0x40
clear_shadow_entry+0x3d/0x100
mapping_try_invalidate+0x11b/0x1e0
invalidate_mapping_pages+0x14/0x20
invalidate_bdev+0x40/0x50
blkdev_common_ioctl+0x5f7/0xa90
blkdev_ioctl+0x10d/0x270
__x64_sys_ioctl+0x99/0xd0
x64_sys_call+0x1219/0x20d0
do_syscall_64+0x51/0x120
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fc92fc3ec6b
</TASK>
The above one (clear_shadow_entry) has since been fixed by Yu Zhao and
fix is in mm tree.
We had seen a couple of scenarios with zone lock contention from page
free and slab free code paths, as reported here:
https://lore.kernel.org/linux-mm/b68e43d4-91f2-4481-80a9-d166c0a43584@amd.com/
Would you have any insights on these?
Regards,
Bharata.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-17 9:37 ` Vlastimil Babka
@ 2024-07-17 10:50 ` Bharata B Rao
2024-07-17 11:15 ` Hillf Danton
0 siblings, 1 reply; 37+ messages in thread
From: Bharata B Rao @ 2024-07-17 10:50 UTC (permalink / raw)
To: Vlastimil Babka, Yu Zhao
Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
David Hildenbrand, willy, kinseyho, Mel Gorman, Petr Tesarik
On 17-Jul-24 3:07 PM, Vlastimil Babka wrote:
> On 7/9/24 6:30 AM, Bharata B Rao wrote:
>> On 08-Jul-24 9:47 PM, Yu Zhao wrote:
>>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>
>>>> Hi Yu Zhao,
>>>>
>>>> Thanks for your patches. See below...
>>>>
>>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
>>>>> Hi Bharata,
>>>>>
>>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>>>
>>>> <snip>
>>>>>>
>>>>>> Some experiments tried
>>>>>> ======================
>>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>>>>>> lockups were seen for 48 hours run. Below is once such soft lockup.
>>>>>
>>>>> This is not really an MGLRU issue -- can you please try one of the
>>>>> attached patches? It (truncate.patch) should help with or without
>>>>> MGLRU.
>>>>
>>>> With truncate.patch and default LRU scheme, a few hard lockups are seen.
>>>
>>> Thanks.
>>>
>>> In your original report, you said:
>>>
>>> Most of the times the two contended locks are lruvec and
>>> inode->i_lock spinlocks.
>>> ...
>>> Often times, the perf output at the time of the problem shows
>>> heavy contention on lruvec spin lock. Similar contention is
>>> also observed with inode i_lock (in clear_shadow_entry path)
>>>
>>> Based on this new report, does it mean the i_lock is not as contended,
>>> for the same path (truncation) you tested? If so, I'll post
>>> truncate.patch and add reported-by and tested-by you, unless you have
>>> objections.
>>
>> truncate.patch has been tested on two systems with default LRU scheme
>> and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run.
>>
>>>
>>> The two paths below were contended on the LRU lock, but they already
>>> batch their operations. So I don't know what else we can do surgically
>>> to improve them.
>>
>> What has been seen with this workload is that the lruvec spinlock is
>> held for a long time from shrink_[active/inactive]_list path. In this
>> path, there is a case in isolate_lru_folios() where scanning of LRU
>> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
>> scanning/skipping of more than 150 million folios were seen. There is
>
> It seems weird to me to see anything that would require ZONE_DMA allocation
> on a modern system. Do you know where it comes from?
We measured the lruvec spinlock start, end and hold
time(htime) using sched_clock(), along with a BUG() if the hold time was
more than 10s. The below case shows that lruvec spin lock was held for ~25s.
vmscan: unlock_page_lruvec_irq: stime 27963327514341, etime
27963324369895, htime 25889317166 (time in ns)
kernel BUG at include/linux/memcontrol.h:1677!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 21 PID: 3211 Comm: kswapd0 Tainted: G W
6.10.0-rc3-qspindbg #10
RIP: 0010:shrink_active_list+0x40a/0x520
Call Trace:
<TASK>
shrink_lruvec+0x981/0x13b0
shrink_node+0x358/0xd30
balance_pgdat+0x3a3/0xa60
kswapd+0x207/0x3a0
kthread+0xe1/0x120
ret_from_fork+0x39/0x60
ret_from_fork_asm+0x1a/0x30
</TASK>
As you can see the call stack is from kswapd but not sure what is the
exact trigger.
Regards,
Bharata.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-17 10:50 ` Bharata B Rao
@ 2024-07-17 11:15 ` Hillf Danton
2024-07-18 9:02 ` Bharata B Rao
0 siblings, 1 reply; 37+ messages in thread
From: Hillf Danton @ 2024-07-17 11:15 UTC (permalink / raw)
To: Bharata B Rao
Cc: Vlastimil Babka, Yu Zhao, linux-mm, linux-kernel, willy,
Mel Gorman
On Wed, 17 Jul 2024 16:20:04 +0530 Bharata B Rao <bharata@amd.com>
> On 17-Jul-24 3:07 PM, Vlastimil Babka wrote:
> >
> > It seems weird to me to see anything that would require ZONE_DMA allocation
> > on a modern system. Do you know where it comes from?
>
> We measured the lruvec spinlock start, end and hold
> time(htime) using sched_clock(), along with a BUG() if the hold time was
> more than 10s. The below case shows that lruvec spin lock was held for ~25s.
>
What is more unusual could be observed perhaps with your hardware config but
with 386MiB RAM assigned to each node, the so called tight memory but not
extremely tight.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-17 9:42 ` Vlastimil Babka
2024-07-17 10:31 ` Bharata B Rao
@ 2024-07-17 11:29 ` Mateusz Guzik
2024-07-18 9:00 ` Bharata B Rao
2024-07-17 16:34 ` Karim Manaouil
2 siblings, 1 reply; 37+ messages in thread
From: Mateusz Guzik @ 2024-07-17 11:29 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Bharata B Rao, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj,
Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho,
Mel Gorman
On Wed, Jul 17, 2024 at 11:42 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 7/3/24 5:11 PM, Bharata B Rao wrote:
> > The general observation is that the problem usually surfaces when the
> > system free memory goes very low and page cache/buffer consumption hits
> > the ceiling. Most of the times the two contended locks are lruvec and
> > inode->i_lock spinlocks.
> >
[snip mm stuff]
There are numerous avoidable i_lock acquires (including some only
showing up under load), but I don't know if they play any role in this
particular test.
Collecting all traces would definitely help, locked up or not, for example:
bpftrace -e 'kprobe:queued_spin_lock_slowpath { @[kstack()] = count();
}' -o traces
As for clear_shadow_entry mentioned in the opening mail, the content is:
spin_lock(&mapping->host->i_lock);
xa_lock_irq(&mapping->i_pages);
__clear_shadow_entry(mapping, index, entry);
xa_unlock_irq(&mapping->i_pages);
if (mapping_shrinkable(mapping))
inode_add_lru(mapping->host);
spin_unlock(&mapping->host->i_lock);
so for all I know it's all about the xarray thing, not the i_lock per se.
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-17 9:42 ` Vlastimil Babka
2024-07-17 10:31 ` Bharata B Rao
2024-07-17 11:29 ` Mateusz Guzik
@ 2024-07-17 16:34 ` Karim Manaouil
2 siblings, 0 replies; 37+ messages in thread
From: Karim Manaouil @ 2024-07-17 16:34 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Bharata B Rao, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj,
Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho,
Mel Gorman, Mateusz Guzik
On Wed, Jul 17, 2024 at 11:42:31AM +0200, Vlastimil Babka wrote:
> Seems to me it could be (except that ZONE_DMA corner case) a general
> scalability issue in that you tweak some part of the kernel and the
> contention moves elsewhere. At least in MM we have per-node locks so this
> means 256 CPUs per lock? It used to be that there were not that many
> (cores/threads) per a physical CPU and its NUMA node, so many cpus would
> mean also more NUMA nodes where the locks contention would distribute among
> them. I think you could try fakenuma to create these nodes artificially and
> see if it helps for the MM part. But if the contention moves to e.g. an
> inode lock, I'm not sure what to do about that then.
AMD EPYC BIOSes have an option called NPS (Nodes Per Socket) that can be
set to 1, 2, 4 or 8 and that divides the system up into the chosen number
of NUMA nodes.
Karim
PhD Student
Edinburgh University
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-17 10:31 ` Bharata B Rao
@ 2024-07-17 16:44 ` Karim Manaouil
0 siblings, 0 replies; 37+ messages in thread
From: Karim Manaouil @ 2024-07-17 16:44 UTC (permalink / raw)
To: Bharata B Rao
Cc: Vlastimil Babka, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj,
Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho,
Mel Gorman, Mateusz Guzik
On Wed, Jul 17, 2024 at 04:01:05PM +0530, Bharata B Rao wrote:
> On 17-Jul-24 3:12 PM, Vlastimil Babka wrote:
> > On 7/3/24 5:11 PM, Bharata B Rao wrote:
> > > Many soft and hard lockups are seen with upstream kernel when running a
> > > bunch of tests that include FIO and LTP filesystem test on 10 NVME
> > > disks. The lockups can appear anywhere between 2 to 48 hours. Originally
> > > this was reported on a large customer VM instance with passthrough NVME
> > > disks on older kernels(v5.4 based). However, similar problems were
> > > reproduced when running the tests on bare metal with latest upstream
> > > kernel (v6.10-rc3). Other lockups with different signatures are seen but
> > > in this report, only those related to MM area are being discussed.
> > > Also note that the subsequent description is related to the lockups in
> > > bare metal upstream (and not VM).
> > >
> > > The general observation is that the problem usually surfaces when the
> > > system free memory goes very low and page cache/buffer consumption hits
> > > the ceiling. Most of the times the two contended locks are lruvec and
> > > inode->i_lock spinlocks.
> > >
> > > - Could this be a scalability issue in LRU list handling and/or page
> > > cache invalidation typical to a large system configuration?
> >
> > Seems to me it could be (except that ZONE_DMA corner case) a general
> > scalability issue in that you tweak some part of the kernel and the
> > contention moves elsewhere. At least in MM we have per-node locks so this
> > means 256 CPUs per lock? It used to be that there were not that many
> > (cores/threads) per a physical CPU and its NUMA node, so many cpus would
> > mean also more NUMA nodes where the locks contention would distribute among
> > them. I think you could try fakenuma to create these nodes artificially and
> > see if it helps for the MM part. But if the contention moves to e.g. an
> > inode lock, I'm not sure what to do about that then.
>
> See below...
>
> >
> <SNIP>
> > >
> > > 3) AMD has a BIOS setting called NPS (Nodes per socket), using which a
> > > socket can be further partitioned into smaller NUMA nodes. With NPS=4,
> > > there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in
> > > the system. This was done to check if having more number of kswapd
> > > threads working on lesser number of folios per node would make a
> > > difference. However here too, multiple soft lockups were seen (in
> > > clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed.
>
> These are some softlockups seen with NPS4 mode.
>
> watchdog: BUG: soft lockup - CPU#315 stuck for 11s! [kworker/315:1H:5153]
> CPU: 315 PID: 5153 Comm: kworker/315:1H Kdump: loaded Not tainted
> 6.10.0-rc3-enbprftw #12
> Workqueue: kblockd blk_mq_run_work_fn
> RIP: 0010:handle_softirqs+0x70/0x2f0
> Call Trace:
> <IRQ>
> __irq_exit_rcu+0x68/0x90
> irq_exit_rcu+0x12/0x20
> sysvec_apic_timer_interrupt+0x85/0xb0
> </IRQ>
> <TASK>
> asm_sysvec_apic_timer_interrupt+0x1f/0x30
> RIP: 0010:iommu_dma_map_page+0xca/0x2c0
> dma_map_page_attrs+0x20d/0x2a0
> nvme_prep_rq.part.0+0x63d/0x940 [nvme]
> nvme_queue_rq+0x82/0x210 [nvme]
> blk_mq_dispatch_rq_list+0x289/0x6d0
> __blk_mq_sched_dispatch_requests+0x142/0x5f0
> blk_mq_sched_dispatch_requests+0x36/0x70
> blk_mq_run_work_fn+0x73/0x90
> process_one_work+0x185/0x3d0
> worker_thread+0x2ce/0x3e0
> kthread+0xe5/0x120
> ret_from_fork+0x3d/0x60
> ret_from_fork_asm+0x1a/0x30
>
>
> watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [fio:19820]
> CPU: 0 PID: 19820 Comm: fio Kdump: loaded Tainted: G L
> 6.10.0-rc3-enbprftw #12
> RIP: 0010:native_queued_spin_lock_slowpath+0x2b8/0x300
> Call Trace:
> <IRQ>
> </IRQ>
> <TASK>
> _raw_spin_lock+0x2d/0x40
> clear_shadow_entry+0x3d/0x100
> mapping_try_invalidate+0x11b/0x1e0
> invalidate_mapping_pages+0x14/0x20
> invalidate_bdev+0x40/0x50
> blkdev_common_ioctl+0x5f7/0xa90
> blkdev_ioctl+0x10d/0x270
> __x64_sys_ioctl+0x99/0xd0
> x64_sys_call+0x1219/0x20d0
> do_syscall_64+0x51/0x120
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7fc92fc3ec6b
> </TASK>
>
> The above one (clear_shadow_entry) has since been fixed by Yu Zhao and fix
> is in mm tree.
>
> We had seen a couple of scenarios with zone lock contention from page free
> and slab free code paths, as reported here: https://lore.kernel.org/linux-mm/b68e43d4-91f2-4481-80a9-d166c0a43584@amd.com/
>
> Would you have any insights on these?
Have you tried enabling memory interleaving policy for your workload?
Karim
PhD Student
Edinburgh University
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-17 11:29 ` Mateusz Guzik
@ 2024-07-18 9:00 ` Bharata B Rao
2024-07-18 12:11 ` Mateusz Guzik
0 siblings, 1 reply; 37+ messages in thread
From: Bharata B Rao @ 2024-07-18 9:00 UTC (permalink / raw)
To: Mateusz Guzik, Vlastimil Babka
Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
David Hildenbrand, willy, yuzhao, kinseyho, Mel Gorman
[-- Attachment #1: Type: text/plain, Size: 3538 bytes --]
On 17-Jul-24 4:59 PM, Mateusz Guzik wrote:
> On Wed, Jul 17, 2024 at 11:42 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 7/3/24 5:11 PM, Bharata B Rao wrote:
>>> The general observation is that the problem usually surfaces when the
>>> system free memory goes very low and page cache/buffer consumption hits
>>> the ceiling. Most of the times the two contended locks are lruvec and
>>> inode->i_lock spinlocks.
>>>
> [snip mm stuff]
>
> There are numerous avoidable i_lock acquires (including some only
> showing up under load), but I don't know if they play any role in this
> particular test.
>
> Collecting all traces would definitely help, locked up or not, for example:
> bpftrace -e 'kprobe:queued_spin_lock_slowpath { @[kstack()] = count();
> }' -o traces
Here are the top 3 traces collected while the full list from a 30s
collection duration when the workload was running, is attached.
@[
native_queued_spin_lock_slowpath+1
__remove_mapping+98
remove_mapping+22
mapping_evict_folio+118
mapping_try_invalidate+214
invalidate_mapping_pages+16
invalidate_bdev+60
blkdev_common_ioctl+1527
blkdev_ioctl+265
__x64_sys_ioctl+149
x64_sys_call+4629
do_syscall_64+126
entry_SYSCALL_64_after_hwframe+118
]: 1787212
@[
native_queued_spin_lock_slowpath+1
folio_wait_bit_common+205
filemap_get_pages+1543
filemap_read+231
blkdev_read_iter+111
aio_read+242
io_submit_one+546
__x64_sys_io_submit+132
x64_sys_call+6617
do_syscall_64+126
entry_SYSCALL_64_after_hwframe+118
]: 7922497
@[
native_queued_spin_lock_slowpath+1
clear_shadow_entry+92
mapping_try_invalidate+337
invalidate_mapping_pages+16
invalidate_bdev+60
blkdev_common_ioctl+1527
blkdev_ioctl+265
__x64_sys_ioctl+149
x64_sys_call+4629
do_syscall_64+126
entry_SYSCALL_64_after_hwframe+118
]: 10357614
>
> As for clear_shadow_entry mentioned in the opening mail, the content is:
> spin_lock(&mapping->host->i_lock);
> xa_lock_irq(&mapping->i_pages);
> __clear_shadow_entry(mapping, index, entry);
> xa_unlock_irq(&mapping->i_pages);
> if (mapping_shrinkable(mapping))
> inode_add_lru(mapping->host);
> spin_unlock(&mapping->host->i_lock);
>
> so for all I know it's all about the xarray thing, not the i_lock per se.
The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq
and hence concluded it to be i_lock. Re-pasting the clear_shadow_entry
softlockup here again:
kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649]
kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G L
6.10.0-rc3-mglru-irqstrc #24
kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: Call Trace:
kernel: <IRQ>
kernel: ? show_regs+0x69/0x80
kernel: ? watchdog_timer_fn+0x223/0x2b0
kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
<SNIP>
kernel: </IRQ>
kernel: <TASK>
kernel: ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: _raw_spin_lock+0x38/0x50
kernel: clear_shadow_entry+0x3d/0x100
kernel: ? __pfx_workingset_update_node+0x10/0x10
kernel: mapping_try_invalidate+0x117/0x1d0
kernel: invalidate_mapping_pages+0x10/0x20
kernel: invalidate_bdev+0x3c/0x50
kernel: blkdev_common_ioctl+0x5f7/0xa90
kernel: blkdev_ioctl+0x109/0x270
kernel: x64_sys_call+0x1215/0x20d0
kernel: do_syscall_64+0x7e/0x130
Regards,
Bharata.
[-- Attachment #2: traces.gz --]
[-- Type: application/x-gzip, Size: 83505 bytes --]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-17 11:15 ` Hillf Danton
@ 2024-07-18 9:02 ` Bharata B Rao
0 siblings, 0 replies; 37+ messages in thread
From: Bharata B Rao @ 2024-07-18 9:02 UTC (permalink / raw)
To: Hillf Danton
Cc: Vlastimil Babka, Yu Zhao, linux-mm, linux-kernel, willy,
Mel Gorman, Dadhania, Nikunj, Upadhyay, Neeraj
On 17-Jul-24 4:45 PM, Hillf Danton wrote:
> On Wed, 17 Jul 2024 16:20:04 +0530 Bharata B Rao <bharata@amd.com>
>> On 17-Jul-24 3:07 PM, Vlastimil Babka wrote:
>>>
>>> It seems weird to me to see anything that would require ZONE_DMA allocation
>>> on a modern system. Do you know where it comes from?
>>
>> We measured the lruvec spinlock start, end and hold
>> time(htime) using sched_clock(), along with a BUG() if the hold time was
>> more than 10s. The below case shows that lruvec spin lock was held for ~25s.
>>
> What is more unusual could be observed perhaps with your hardware config but
> with 386MiB RAM assigned to each node, the so called tight memory but not
> extremely tight.
Hardware config is this:
Dual socket AMD EPYC 128 Core processor (256 cores, 512 threads)
Memory: 1.5 TB
10 NVME - 3.5TB each
available: 2 nodes (0-1)
node 0 cpus: 0-127,256-383
node 0 size: 773727 MB
node 1 cpus: 128-255,384-511
node 1 size: 773966 MB
But I don't quite follow what you are hinting at, can you please
rephrase or be more verbose?
Regards,
Bharata.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-18 9:00 ` Bharata B Rao
@ 2024-07-18 12:11 ` Mateusz Guzik
2024-07-19 6:16 ` Bharata B Rao
0 siblings, 1 reply; 37+ messages in thread
From: Mateusz Guzik @ 2024-07-18 12:11 UTC (permalink / raw)
To: Bharata B Rao
Cc: Vlastimil Babka, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj,
Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho,
Mel Gorman
On Thu, Jul 18, 2024 at 11:00 AM Bharata B Rao <bharata@amd.com> wrote:
>
> On 17-Jul-24 4:59 PM, Mateusz Guzik wrote:
> > As for clear_shadow_entry mentioned in the opening mail, the content is:
> > spin_lock(&mapping->host->i_lock);
> > xa_lock_irq(&mapping->i_pages);
> > __clear_shadow_entry(mapping, index, entry);
> > xa_unlock_irq(&mapping->i_pages);
> > if (mapping_shrinkable(mapping))
> > inode_add_lru(mapping->host);
> > spin_unlock(&mapping->host->i_lock);
> >
> > so for all I know it's all about the xarray thing, not the i_lock per se.
>
> The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq
> and hence concluded it to be i_lock.
I'm not disputing it was i_lock. I am claiming that the i_pages is
taken immediately after and it may be that in your workload this is
the thing with the actual contention problem, making i_lock a red
herring.
I tried to match up offsets to my own kernel binary, but things went haywire.
Can you please resolve a bunch of symbols, like this:
./scripts/faddr2line vmlinux clear_shadow_entry+92
and then paste the source code from reported lines? (I presume you are
running with some local patches, so opening relevant files in my repo
may still give bogus resutls)
Addresses are: clear_shadow_entry+92 __remove_mapping+98 __filemap_add_folio+332
Most notably in __remove_mapping i_lock is conditional:
if (!folio_test_swapcache(folio))
spin_lock(&mapping->host->i_lock);
xa_lock_irq(&mapping->i_pages);
and the disasm of the offset in my case does not match either acquire.
For all I know i_lock in this routine is *not* taken and all the
queued up __remove_mapping callers increase i_lock -> i_pages wait
times in clear_shadow_entry.
To my cursory reading i_lock in clear_shadow_entry can be hacked away
with some effort, but should this happen the contention is going to
shift to i_pages presumably with more soft lockups (except on that
lock). I am not convinced messing with it is justified. From looking
at other places the i_lock is not a problem in other spots fwiw.
All that said even if it is i_lock in both cases *and* someone whacks
it, the mm folk should look into what happens when (maybe i_lock ->)
i_pages lock is held. To that end perhaps you could provide a
flamegraph or output of perf record -a -g, I don't know what's
preferred.
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-18 12:11 ` Mateusz Guzik
@ 2024-07-19 6:16 ` Bharata B Rao
2024-07-19 7:06 ` Yu Zhao
2024-07-19 14:26 ` Mateusz Guzik
0 siblings, 2 replies; 37+ messages in thread
From: Bharata B Rao @ 2024-07-19 6:16 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Vlastimil Babka, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj,
Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho,
Mel Gorman
[-- Attachment #1: Type: text/plain, Size: 4821 bytes --]
On 18-Jul-24 5:41 PM, Mateusz Guzik wrote:
> On Thu, Jul 18, 2024 at 11:00 AM Bharata B Rao <bharata@amd.com> wrote:
>>
>> On 17-Jul-24 4:59 PM, Mateusz Guzik wrote:
>>> As for clear_shadow_entry mentioned in the opening mail, the content is:
>>> spin_lock(&mapping->host->i_lock);
>>> xa_lock_irq(&mapping->i_pages);
>>> __clear_shadow_entry(mapping, index, entry);
>>> xa_unlock_irq(&mapping->i_pages);
>>> if (mapping_shrinkable(mapping))
>>> inode_add_lru(mapping->host);
>>> spin_unlock(&mapping->host->i_lock);
>>>
>>> so for all I know it's all about the xarray thing, not the i_lock per se.
>>
>> The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq
>> and hence concluded it to be i_lock.
>
> I'm not disputing it was i_lock. I am claiming that the i_pages is
> taken immediately after and it may be that in your workload this is
> the thing with the actual contention problem, making i_lock a red
> herring.
>
> I tried to match up offsets to my own kernel binary, but things went haywire.
>
> Can you please resolve a bunch of symbols, like this:
> ./scripts/faddr2line vmlinux clear_shadow_entry+92
>
> and then paste the source code from reported lines? (I presume you are
> running with some local patches, so opening relevant files in my repo
> may still give bogus resutls)
>
> Addresses are: clear_shadow_entry+92 __remove_mapping+98 __filemap_add_folio+332
clear_shadow_entry+92
$ ./scripts/faddr2line vmlinux clear_shadow_entry+92
clear_shadow_entry+92/0x180:
spin_lock_irq at include/linux/spinlock.h:376
(inlined by) clear_shadow_entry at mm/truncate.c:51
42 static void clear_shadow_entry(struct address_space *mapping,
43 struct folio_batch *fbatch, pgoff_t
*indices)
44 {
45 int i;
46
47 if (shmem_mapping(mapping) || dax_mapping(mapping))
48 return;
49
50 spin_lock(&mapping->host->i_lock);
51 xa_lock_irq(&mapping->i_pages);
__remove_mapping+98
$ ./scripts/faddr2line vmlinux __remove_mapping+98
__remove_mapping+98/0x230:
spin_lock_irq at include/linux/spinlock.h:376
(inlined by) __remove_mapping at mm/vmscan.c:695
684 static int __remove_mapping(struct address_space *mapping, struct
folio *folio,
685 bool reclaimed, struct mem_cgroup
*target_memcg)
686 {
687 int refcount;
688 void *shadow = NULL;
689
690 BUG_ON(!folio_test_locked(folio));
691 BUG_ON(mapping != folio_mapping(folio));
692
693 if (!folio_test_swapcache(folio))
694 spin_lock(&mapping->host->i_lock);
695 xa_lock_irq(&mapping->i_pages);
__filemap_add_folio+332
$ ./scripts/faddr2line vmlinux __filemap_add_folio+332
__filemap_add_folio+332/0x480:
spin_lock_irq at include/linux/spinlock.h:377
(inlined by) __filemap_add_folio at mm/filemap.c:878
851 noinline int __filemap_add_folio(struct address_space *mapping,
852 struct folio *folio, pgoff_t index, gfp_t gfp, void
**shadowp)
853 {
854 XA_STATE(xas, &mapping->i_pages, index);
...
874 for (;;) {
875 int order = -1, split_order = 0;
876 void *entry, *old = NULL;
877
878 xas_lock_irq(&xas);
879 xas_for_each_conflict(&xas, entry) {
>
> Most notably in __remove_mapping i_lock is conditional:
> if (!folio_test_swapcache(folio))
> spin_lock(&mapping->host->i_lock);
> xa_lock_irq(&mapping->i_pages);
>
> and the disasm of the offset in my case does not match either acquire.
> For all I know i_lock in this routine is *not* taken and all the
> queued up __remove_mapping callers increase i_lock -> i_pages wait
> times in clear_shadow_entry.
So the first two are on i_pages lock and the last one is xa_lock.
>
> To my cursory reading i_lock in clear_shadow_entry can be hacked away
> with some effort, but should this happen the contention is going to
> shift to i_pages presumably with more soft lockups (except on that
> lock). I am not convinced messing with it is justified. From looking
> at other places the i_lock is not a problem in other spots fwiw.
>
> All that said even if it is i_lock in both cases *and* someone whacks
> it, the mm folk should look into what happens when (maybe i_lock ->)
> i_pages lock is held. To that end perhaps you could provide a
> flamegraph or output of perf record -a -g, I don't know what's
> preferred.
I have attached the flamegraph but this is for the kernel that has been
running with all the accumulated fixes so far. The original one (w/o
fixes) did show considerable time spent on
native_queued_spin_lock_slowpath but unfortunately unable to locate it now.
Regards,
Bharata.
[-- Attachment #2: perf 1.svg --]
[-- Type: image/svg+xml, Size: 1215900 bytes --]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-19 6:16 ` Bharata B Rao
@ 2024-07-19 7:06 ` Yu Zhao
2024-07-19 14:26 ` Mateusz Guzik
1 sibling, 0 replies; 37+ messages in thread
From: Yu Zhao @ 2024-07-19 7:06 UTC (permalink / raw)
To: Bharata B Rao
Cc: Mateusz Guzik, Vlastimil Babka, linux-mm, linux-kernel, nikunj,
Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy,
kinseyho, Mel Gorman
On Fri, Jul 19, 2024 at 12:16 AM Bharata B Rao <bharata@amd.com> wrote:
>
> On 18-Jul-24 5:41 PM, Mateusz Guzik wrote:
> > On Thu, Jul 18, 2024 at 11:00 AM Bharata B Rao <bharata@amd.com> wrote:
> >>
> >> On 17-Jul-24 4:59 PM, Mateusz Guzik wrote:
> >>> As for clear_shadow_entry mentioned in the opening mail, the content is:
> >>> spin_lock(&mapping->host->i_lock);
> >>> xa_lock_irq(&mapping->i_pages);
> >>> __clear_shadow_entry(mapping, index, entry);
> >>> xa_unlock_irq(&mapping->i_pages);
> >>> if (mapping_shrinkable(mapping))
> >>> inode_add_lru(mapping->host);
> >>> spin_unlock(&mapping->host->i_lock);
> >>>
> >>> so for all I know it's all about the xarray thing, not the i_lock per se.
> >>
> >> The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq
> >> and hence concluded it to be i_lock.
> >
> > I'm not disputing it was i_lock. I am claiming that the i_pages is
> > taken immediately after and it may be that in your workload this is
> > the thing with the actual contention problem, making i_lock a red
> > herring.
> >
> > I tried to match up offsets to my own kernel binary, but things went haywire.
> >
> > Can you please resolve a bunch of symbols, like this:
> > ./scripts/faddr2line vmlinux clear_shadow_entry+92
> >
> > and then paste the source code from reported lines? (I presume you are
> > running with some local patches, so opening relevant files in my repo
> > may still give bogus resutls)
> >
> > Addresses are: clear_shadow_entry+92 __remove_mapping+98 __filemap_add_folio+332
>
> clear_shadow_entry+92
>
> $ ./scripts/faddr2line vmlinux clear_shadow_entry+92
> clear_shadow_entry+92/0x180:
> spin_lock_irq at include/linux/spinlock.h:376
> (inlined by) clear_shadow_entry at mm/truncate.c:51
>
> 42 static void clear_shadow_entry(struct address_space *mapping,
> 43 struct folio_batch *fbatch, pgoff_t
> *indices)
> 44 {
> 45 int i;
> 46
> 47 if (shmem_mapping(mapping) || dax_mapping(mapping))
> 48 return;
> 49
> 50 spin_lock(&mapping->host->i_lock);
> 51 xa_lock_irq(&mapping->i_pages);
>
>
> __remove_mapping+98
>
> $ ./scripts/faddr2line vmlinux __remove_mapping+98
> __remove_mapping+98/0x230:
> spin_lock_irq at include/linux/spinlock.h:376
> (inlined by) __remove_mapping at mm/vmscan.c:695
>
> 684 static int __remove_mapping(struct address_space *mapping, struct
> folio *folio,
> 685 bool reclaimed, struct mem_cgroup
> *target_memcg)
> 686 {
> 687 int refcount;
> 688 void *shadow = NULL;
> 689
> 690 BUG_ON(!folio_test_locked(folio));
> 691 BUG_ON(mapping != folio_mapping(folio));
> 692
> 693 if (!folio_test_swapcache(folio))
> 694 spin_lock(&mapping->host->i_lock);
> 695 xa_lock_irq(&mapping->i_pages);
>
>
> __filemap_add_folio+332
>
> $ ./scripts/faddr2line vmlinux __filemap_add_folio+332
> __filemap_add_folio+332/0x480:
> spin_lock_irq at include/linux/spinlock.h:377
> (inlined by) __filemap_add_folio at mm/filemap.c:878
>
> 851 noinline int __filemap_add_folio(struct address_space *mapping,
> 852 struct folio *folio, pgoff_t index, gfp_t gfp, void
> **shadowp)
> 853 {
> 854 XA_STATE(xas, &mapping->i_pages, index);
> ...
> 874 for (;;) {
> 875 int order = -1, split_order = 0;
> 876 void *entry, *old = NULL;
> 877
> 878 xas_lock_irq(&xas);
> 879 xas_for_each_conflict(&xas, entry) {
>
> >
> > Most notably in __remove_mapping i_lock is conditional:
> > if (!folio_test_swapcache(folio))
> > spin_lock(&mapping->host->i_lock);
> > xa_lock_irq(&mapping->i_pages);
> >
> > and the disasm of the offset in my case does not match either acquire.
> > For all I know i_lock in this routine is *not* taken and all the
> > queued up __remove_mapping callers increase i_lock -> i_pages wait
> > times in clear_shadow_entry.
>
> So the first two are on i_pages lock and the last one is xa_lock.
Isn't xa_lock also i_pages->xa_lock, i.e., the same lock?
> > To my cursory reading i_lock in clear_shadow_entry can be hacked away
> > with some effort, but should this happen the contention is going to
> > shift to i_pages presumably with more soft lockups (except on that
> > lock). I am not convinced messing with it is justified. From looking
> > at other places the i_lock is not a problem in other spots fwiw.
> >
> > All that said even if it is i_lock in both cases *and* someone whacks
> > it, the mm folk should look into what happens when (maybe i_lock ->)
> > i_pages lock is held. To that end perhaps you could provide a
> > flamegraph or output of perf record -a -g, I don't know what's
> > preferred.
>
> I have attached the flamegraph but this is for the kernel that has been
> running with all the accumulated fixes so far. The original one (w/o
> fixes) did show considerable time spent on
> native_queued_spin_lock_slowpath but unfortunately unable to locate it now.
>
> Regards,
> Bharata.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-19 6:16 ` Bharata B Rao
2024-07-19 7:06 ` Yu Zhao
@ 2024-07-19 14:26 ` Mateusz Guzik
1 sibling, 0 replies; 37+ messages in thread
From: Mateusz Guzik @ 2024-07-19 14:26 UTC (permalink / raw)
To: Bharata B Rao
Cc: Vlastimil Babka, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj,
Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho,
Mel Gorman
On Fri, Jul 19, 2024 at 8:16 AM Bharata B Rao <bharata@amd.com> wrote:
>
> On 18-Jul-24 5:41 PM, Mateusz Guzik wrote:
> > On Thu, Jul 18, 2024 at 11:00 AM Bharata B Rao <bharata@amd.com> wrote:
> >>
> >> On 17-Jul-24 4:59 PM, Mateusz Guzik wrote:
> >>> As for clear_shadow_entry mentioned in the opening mail, the content is:
> >>> spin_lock(&mapping->host->i_lock);
> >>> xa_lock_irq(&mapping->i_pages);
> >>> __clear_shadow_entry(mapping, index, entry);
> >>> xa_unlock_irq(&mapping->i_pages);
> >>> if (mapping_shrinkable(mapping))
> >>> inode_add_lru(mapping->host);
> >>> spin_unlock(&mapping->host->i_lock);
> >>>
> >>> so for all I know it's all about the xarray thing, not the i_lock per se.
> >>
> >> The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq
> >> and hence concluded it to be i_lock.
> >
> > I'm not disputing it was i_lock. I am claiming that the i_pages is
> > taken immediately after and it may be that in your workload this is
> > the thing with the actual contention problem, making i_lock a red
> > herring.
> >
> > I tried to match up offsets to my own kernel binary, but things went haywire.
> >
> > Can you please resolve a bunch of symbols, like this:
> > ./scripts/faddr2line vmlinux clear_shadow_entry+92
> >
> > and then paste the source code from reported lines? (I presume you are
> > running with some local patches, so opening relevant files in my repo
> > may still give bogus resutls)
> >
> > Addresses are: clear_shadow_entry+92 __remove_mapping+98 __filemap_add_folio+332
>
> clear_shadow_entry+92
>
> $ ./scripts/faddr2line vmlinux clear_shadow_entry+92
> clear_shadow_entry+92/0x180:
> spin_lock_irq at include/linux/spinlock.h:376
> (inlined by) clear_shadow_entry at mm/truncate.c:51
>
> 42 static void clear_shadow_entry(struct address_space *mapping,
> 43 struct folio_batch *fbatch, pgoff_t
> *indices)
> 44 {
> 45 int i;
> 46
> 47 if (shmem_mapping(mapping) || dax_mapping(mapping))
> 48 return;
> 49
> 50 spin_lock(&mapping->host->i_lock);
> 51 xa_lock_irq(&mapping->i_pages);
>
>
> __remove_mapping+98
>
> $ ./scripts/faddr2line vmlinux __remove_mapping+98
> __remove_mapping+98/0x230:
> spin_lock_irq at include/linux/spinlock.h:376
> (inlined by) __remove_mapping at mm/vmscan.c:695
>
> 684 static int __remove_mapping(struct address_space *mapping, struct
> folio *folio,
> 685 bool reclaimed, struct mem_cgroup
> *target_memcg)
> 686 {
> 687 int refcount;
> 688 void *shadow = NULL;
> 689
> 690 BUG_ON(!folio_test_locked(folio));
> 691 BUG_ON(mapping != folio_mapping(folio));
> 692
> 693 if (!folio_test_swapcache(folio))
> 694 spin_lock(&mapping->host->i_lock);
> 695 xa_lock_irq(&mapping->i_pages);
>
>
> __filemap_add_folio+332
>
> $ ./scripts/faddr2line vmlinux __filemap_add_folio+332
> __filemap_add_folio+332/0x480:
> spin_lock_irq at include/linux/spinlock.h:377
> (inlined by) __filemap_add_folio at mm/filemap.c:878
>
> 851 noinline int __filemap_add_folio(struct address_space *mapping,
> 852 struct folio *folio, pgoff_t index, gfp_t gfp, void
> **shadowp)
> 853 {
> 854 XA_STATE(xas, &mapping->i_pages, index);
> ...
> 874 for (;;) {
> 875 int order = -1, split_order = 0;
> 876 void *entry, *old = NULL;
> 877
> 878 xas_lock_irq(&xas);
> 879 xas_for_each_conflict(&xas, entry) {
>
> >
> > Most notably in __remove_mapping i_lock is conditional:
> > if (!folio_test_swapcache(folio))
> > spin_lock(&mapping->host->i_lock);
> > xa_lock_irq(&mapping->i_pages);
> >
> > and the disasm of the offset in my case does not match either acquire.
> > For all I know i_lock in this routine is *not* taken and all the
> > queued up __remove_mapping callers increase i_lock -> i_pages wait
> > times in clear_shadow_entry.
>
> So the first two are on i_pages lock and the last one is xa_lock.
>
bottom line though messing with i_lock removal is not justified afaics
> >
> > To my cursory reading i_lock in clear_shadow_entry can be hacked away
> > with some effort, but should this happen the contention is going to
> > shift to i_pages presumably with more soft lockups (except on that
> > lock). I am not convinced messing with it is justified. From looking
> > at other places the i_lock is not a problem in other spots fwiw.
> >
> > All that said even if it is i_lock in both cases *and* someone whacks
> > it, the mm folk should look into what happens when (maybe i_lock ->)
> > i_pages lock is held. To that end perhaps you could provide a
> > flamegraph or output of perf record -a -g, I don't know what's
> > preferred.
>
> I have attached the flamegraph but this is for the kernel that has been
> running with all the accumulated fixes so far. The original one (w/o
> fixes) did show considerable time spent on
> native_queued_spin_lock_slowpath but unfortunately unable to locate it now.
>
So I think the problems at this point are all mm, so I'm kicking the
ball to that side.
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-15 5:19 ` Bharata B Rao
@ 2024-07-19 20:21 ` Yu Zhao
2024-07-20 7:57 ` Mateusz Guzik
2024-07-22 4:12 ` Bharata B Rao
2024-07-25 9:59 ` zhaoyang.huang
1 sibling, 2 replies; 37+ messages in thread
From: Yu Zhao @ 2024-07-19 20:21 UTC (permalink / raw)
To: Bharata B Rao
Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman, mjguzik
On Sun, Jul 14, 2024 at 11:20 PM Bharata B Rao <bharata@amd.com> wrote:
>
> On 11-Jul-24 11:13 AM, Bharata B Rao wrote:
> > On 09-Jul-24 11:28 AM, Yu Zhao wrote:
> >> On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@amd.com> wrote:
> >>>
> >>> On 08-Jul-24 9:47 PM, Yu Zhao wrote:
> >>>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
> >>>>>
> >>>>> Hi Yu Zhao,
> >>>>>
> >>>>> Thanks for your patches. See below...
> >>>>>
> >>>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> >>>>>> Hi Bharata,
> >>>>>>
> >>>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
> >>>>>>>
> >>>>> <snip>
> >>>>>>>
> >>>>>>> Some experiments tried
> >>>>>>> ======================
> >>>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
> >>>>>>> lockups were seen for 48 hours run. Below is once such soft lockup.
> >>>>>>
> >>>>>> This is not really an MGLRU issue -- can you please try one of the
> >>>>>> attached patches? It (truncate.patch) should help with or without
> >>>>>> MGLRU.
> >>>>>
> >>>>> With truncate.patch and default LRU scheme, a few hard lockups are
> >>>>> seen.
> >>>>
> >>>> Thanks.
> >>>>
> >>>> In your original report, you said:
> >>>>
> >>>> Most of the times the two contended locks are lruvec and
> >>>> inode->i_lock spinlocks.
> >>>> ...
> >>>> Often times, the perf output at the time of the problem shows
> >>>> heavy contention on lruvec spin lock. Similar contention is
> >>>> also observed with inode i_lock (in clear_shadow_entry path)
> >>>>
> >>>> Based on this new report, does it mean the i_lock is not as contended,
> >>>> for the same path (truncation) you tested? If so, I'll post
> >>>> truncate.patch and add reported-by and tested-by you, unless you have
> >>>> objections.
> >>>
> >>> truncate.patch has been tested on two systems with default LRU scheme
> >>> and the lockup due to inode->i_lock hasn't been seen yet after 24
> >>> hours run.
> >>
> >> Thanks.
> >>
> >>>>
> >>>> The two paths below were contended on the LRU lock, but they already
> >>>> batch their operations. So I don't know what else we can do surgically
> >>>> to improve them.
> >>>
> >>> What has been seen with this workload is that the lruvec spinlock is
> >>> held for a long time from shrink_[active/inactive]_list path. In this
> >>> path, there is a case in isolate_lru_folios() where scanning of LRU
> >>> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
> >>> scanning/skipping of more than 150 million folios were seen. There is
> >>> already a comment in there which explains why nr_skipped shouldn't be
> >>> counted, but is there any possibility of re-looking at this condition?
> >>
> >> For this specific case, probably this can help:
> >>
> >> @@ -1659,8 +1659,15 @@ static unsigned long
> >> isolate_lru_folios(unsigned long nr_to_scan,
> >> if (folio_zonenum(folio) > sc->reclaim_idx ||
> >> skip_cma(folio, sc)) {
> >> nr_skipped[folio_zonenum(folio)] += nr_pages;
> >> - move_to = &folios_skipped;
> >> - goto move;
> >> + list_move(&folio->lru, &folios_skipped);
> >> + if (spin_is_contended(&lruvec->lru_lock)) {
> >> + if (!list_empty(dst))
> >> + break;
> >> + spin_unlock_irq(&lruvec->lru_lock);
> >> + cond_resched();
> >> + spin_lock_irq(&lruvec->lru_lock);
> >> + }
> >> + continue;
> >> }
> >
> > Thanks, this helped. With this fix, the test ran for 24hrs without any
> > lockups attributable to lruvec spinlock. As noted in this thread,
> > earlier isolate_lru_folios() used to scan millions of folios and spend a
> > lot of time with spinlock held but after this fix, such a scenario is no
> > longer seen.
>
> However during the weekend mglru-enabled run (with above fix to
> isolate_lru_folios() and also the previous two patches: truncate.patch
> and mglru.patch and the inode fix provided by Mateusz), another hard
> lockup related to lruvec spinlock was observed.
Thanks again for the stress tests.
I can't come up with any reasonable band-aid at this moment, i.e.,
something not too ugly to work around a more fundamental scalability
problem.
Before I give up: what type of dirty data was written back to the nvme
device? Was it page cache or swap?
> Here is the hardlock up:
>
> watchdog: Watchdog detected hard LOCKUP on cpu 466
> CPU: 466 PID: 3103929 Comm: fio Not tainted
> 6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
> RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
> Call Trace:
> <NMI>
> ? show_regs+0x69/0x80
> ? watchdog_hardlockup_check+0x1b4/0x3a0
> <SNIP>
> ? native_queued_spin_lock_slowpath+0x2b4/0x300
> </NMI>
> <IRQ>
> _raw_spin_lock_irqsave+0x5b/0x70
> folio_lruvec_lock_irqsave+0x62/0x90
> folio_batch_move_lru+0x9d/0x160
> folio_rotate_reclaimable+0xab/0xf0
> folio_end_writeback+0x60/0x90
> end_buffer_async_write+0xaa/0xe0
> end_bio_bh_io_sync+0x2c/0x50
> bio_endio+0x108/0x180
> blk_mq_end_request_batch+0x11f/0x5e0
> nvme_pci_complete_batch+0xb5/0xd0 [nvme]
> nvme_irq+0x92/0xe0 [nvme]
> __handle_irq_event_percpu+0x6e/0x1e0
> handle_irq_event+0x39/0x80
> handle_edge_irq+0x8c/0x240
> __common_interrupt+0x4e/0xf0
> common_interrupt+0x49/0xc0
> asm_common_interrupt+0x27/0x40
>
> Here is the lock holder details captured by all-cpu-backtrace:
>
> NMI backtrace for cpu 75
> CPU: 75 PID: 3095650 Comm: fio Not tainted
> 6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
> RIP: 0010:folio_inc_gen+0x142/0x430
> Call Trace:
> <NMI>
> ? show_regs+0x69/0x80
> ? nmi_cpu_backtrace+0xc5/0x130
> ? nmi_cpu_backtrace_handler+0x11/0x20
> ? nmi_handle+0x64/0x180
> ? default_do_nmi+0x45/0x130
> ? exc_nmi+0x128/0x1a0
> ? end_repeat_nmi+0xf/0x53
> ? folio_inc_gen+0x142/0x430
> ? folio_inc_gen+0x142/0x430
> ? folio_inc_gen+0x142/0x430
> </NMI>
> <TASK>
> isolate_folios+0x954/0x1630
> evict_folios+0xa5/0x8c0
> try_to_shrink_lruvec+0x1be/0x320
> shrink_one+0x10f/0x1d0
> shrink_node+0xa4c/0xc90
> do_try_to_free_pages+0xc0/0x590
> try_to_free_pages+0xde/0x210
> __alloc_pages_noprof+0x6ae/0x12c0
> alloc_pages_mpol_noprof+0xd9/0x220
> folio_alloc_noprof+0x63/0xe0
> filemap_alloc_folio_noprof+0xf4/0x100
> page_cache_ra_unbounded+0xb9/0x1a0
> page_cache_ra_order+0x26e/0x310
> ondemand_readahead+0x1a3/0x360
> page_cache_sync_ra+0x83/0x90
> filemap_get_pages+0xf0/0x6a0
> filemap_read+0xe7/0x3d0
> blkdev_read_iter+0x6f/0x140
> vfs_read+0x25b/0x340
> ksys_read+0x67/0xf0
> __x64_sys_read+0x19/0x20
> x64_sys_call+0x1771/0x20d0
> do_syscall_64+0x7e/0x130
>
> Regards,
> Bharata.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-19 20:21 ` Yu Zhao
@ 2024-07-20 7:57 ` Mateusz Guzik
2024-07-22 4:17 ` Bharata B Rao
2024-07-22 4:12 ` Bharata B Rao
1 sibling, 1 reply; 37+ messages in thread
From: Mateusz Guzik @ 2024-07-20 7:57 UTC (permalink / raw)
To: Yu Zhao
Cc: Bharata B Rao, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj,
Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho,
Mel Gorman
On Fri, Jul 19, 2024 at 10:21 PM Yu Zhao <yuzhao@google.com> wrote:
> I can't come up with any reasonable band-aid at this moment, i.e.,
> something not too ugly to work around a more fundamental scalability
> problem.
>
> Before I give up: what type of dirty data was written back to the nvme
> device? Was it page cache or swap?
>
With my corporate employee hat on, I would like to note a couple of
three things.
1. there are definitely bugs here and someone(tm) should sort them out(R)
however....
2. the real goal is presumably to beat the kernel into shape where
production kernels no longer suffer lockups running this workload on
this hardware
3. the flamegraph (to be found in [1]) shows expensive debug enabled,
notably for preemption count (search for preempt_count_sub to see)
4. I'm told the lruvec problem is being worked on (but no ETA) and I
don't think the above justifies considering any hacks or otherwise
putting more pressure on it
It is plausible eliminating the aforementioned debug will be good enough.
Apart from that I note percpu_counter_add_batch (+ irq debug) accounts
for 5.8% cpu time. This will of course go down if irq tracing is
disabled, but so happens I optimized this routine to be faster
single-threaded (in particular by dodging the interrupt trip). The
patch is hanging out in the mm tree [2] and is trivially applicable
for testing.
Even if none of the debug opts can get modified, this should drop
percpu_counter_add_batch to 1.5% or so, which may or may not have a
side effect of avoiding the lockup problem.
[1]: https://lore.kernel.org/lkml/584ecb5e-b1fc-4b43-ba36-ad396d379fad@amd.com/
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything&id=51d821654be4286b005ad2b7dc8b973d5008a2ec
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-19 20:21 ` Yu Zhao
2024-07-20 7:57 ` Mateusz Guzik
@ 2024-07-22 4:12 ` Bharata B Rao
1 sibling, 0 replies; 37+ messages in thread
From: Bharata B Rao @ 2024-07-22 4:12 UTC (permalink / raw)
To: Yu Zhao
Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman, mjguzik
On 20-Jul-24 1:51 AM, Yu Zhao wrote:
>> However during the weekend mglru-enabled run (with above fix to
>> isolate_lru_folios() and also the previous two patches: truncate.patch
>> and mglru.patch and the inode fix provided by Mateusz), another hard
>> lockup related to lruvec spinlock was observed.
>
> Thanks again for the stress tests.
>
> I can't come up with any reasonable band-aid at this moment, i.e.,
> something not too ugly to work around a more fundamental scalability
> problem.
>
> Before I give up: what type of dirty data was written back to the nvme
> device? Was it page cache or swap?
This is how a typical dstat report looks like when we start to see the
problem with lruvec spinlock.
------memory-usage----- ----swap---
used free buff cach| used free|
14.3G 20.7G 1467G 185M| 938M 15G|
14.3G 20.0G 1468G 174M| 938M 15G|
14.3G 20.3G 1468G 184M| 938M 15G|
14.3G 19.8G 1468G 183M| 938M 15G|
14.3G 19.9G 1468G 183M| 938M 15G|
14.3G 19.5G 1468G 183M| 938M 15G|
As you can see, most of the usage is in buffer cache and swap is hardly
used. Just to recap from the original post...
====
FIO is run with a size of 1TB on each NVME partition with different
combinations of ioengine/blocksize/mode parameters and buffered-IO.
Selected FS tests from LTP are run on 256GB partitions of all NVME
disks. This is the typical NVME partition layout.
nvme2n1 259:4 0 3.5T 0 disk
├─nvme2n1p1 259:6 0 256G 0 part /data_nvme2n1p1
└─nvme2n1p2 259:7 0 3.2T 0 part
Though many different runs exist in the workload, the combination that
results in the problem is buffered-IO run with sync engine.
fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \
-rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=sync -bs=4k \
-numjobs=400 -runtime=25000 --time_based -group_reporting -name=mytest
====
Regards,
Bharata.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-20 7:57 ` Mateusz Guzik
@ 2024-07-22 4:17 ` Bharata B Rao
0 siblings, 0 replies; 37+ messages in thread
From: Bharata B Rao @ 2024-07-22 4:17 UTC (permalink / raw)
To: Mateusz Guzik, Yu Zhao
Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman
On 20-Jul-24 1:27 PM, Mateusz Guzik wrote:
> On Fri, Jul 19, 2024 at 10:21 PM Yu Zhao <yuzhao@google.com> wrote:
>> I can't come up with any reasonable band-aid at this moment, i.e.,
>> something not too ugly to work around a more fundamental scalability
>> problem.
>>
>> Before I give up: what type of dirty data was written back to the nvme
>> device? Was it page cache or swap?
>>
>
> With my corporate employee hat on, I would like to note a couple of
> three things.
>
> 1. there are definitely bugs here and someone(tm) should sort them out(R)
>
> however....
>
> 2. the real goal is presumably to beat the kernel into shape where
> production kernels no longer suffer lockups running this workload on
> this hardware
> 3. the flamegraph (to be found in [1]) shows expensive debug enabled,
> notably for preemption count (search for preempt_count_sub to see)
> 4. I'm told the lruvec problem is being worked on (but no ETA) and I
> don't think the above justifies considering any hacks or otherwise
> putting more pressure on it
>
> It is plausible eliminating the aforementioned debug will be good enough.
>
> Apart from that I note percpu_counter_add_batch (+ irq debug) accounts
> for 5.8% cpu time. This will of course go down if irq tracing is
> disabled, but so happens I optimized this routine to be faster
> single-threaded (in particular by dodging the interrupt trip). The
> patch is hanging out in the mm tree [2] and is trivially applicable
> for testing.
>
> Even if none of the debug opts can get modified, this should drop
> percpu_counter_add_batch to 1.5% or so, which may or may not have a
> side effect of avoiding the lockup problem.
Thanks, A few debug options were turned ON to gather debug data. Will do
a full run once with them turned OFF and with the above
percpu_counter_add_batch patch.
Regards,
Bharata.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-15 5:19 ` Bharata B Rao
2024-07-19 20:21 ` Yu Zhao
@ 2024-07-25 9:59 ` zhaoyang.huang
2024-07-26 3:26 ` Zhaoyang Huang
1 sibling, 1 reply; 37+ messages in thread
From: zhaoyang.huang @ 2024-07-25 9:59 UTC (permalink / raw)
To: bharata
Cc: Neeraj.Upadhyay, akpm, david, kinseyho, linux-kernel, linux-mm,
mgorman, mjguzik, nikunj, vbabka, willy, yuzhao, huangzhaoyang,
steve.kang
>However during the weekend mglru-enabled run (with above fix to
>isolate_lru_folios() and also the previous two patches: truncate.patch
>and mglru.patch and the inode fix provided by Mateusz), another hard
>lockup related to lruvec spinlock was observed.
>
>Here is the hardlock up:
>
>watchdog: Watchdog detected hard LOCKUP on cpu 466
>CPU: 466 PID: 3103929 Comm: fio Not tainted
>6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
>RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
>Call Trace:
> <NMI>
> ? show_regs+0x69/0x80
> ? watchdog_hardlockup_check+0x1b4/0x3a0
><SNIP>
> ? native_queued_spin_lock_slowpath+0x2b4/0x300
> </NMI>
> <IRQ>
> _raw_spin_lock_irqsave+0x5b/0x70
> folio_lruvec_lock_irqsave+0x62/0x90
> folio_batch_move_lru+0x9d/0x160
> folio_rotate_reclaimable+0xab/0xf0
> folio_end_writeback+0x60/0x90
> end_buffer_async_write+0xaa/0xe0
> end_bio_bh_io_sync+0x2c/0x50
> bio_endio+0x108/0x180
> blk_mq_end_request_batch+0x11f/0x5e0
> nvme_pci_complete_batch+0xb5/0xd0 [nvme]
> nvme_irq+0x92/0xe0 [nvme]
> __handle_irq_event_percpu+0x6e/0x1e0
> handle_irq_event+0x39/0x80
> handle_edge_irq+0x8c/0x240
> __common_interrupt+0x4e/0xf0
> common_interrupt+0x49/0xc0
> asm_common_interrupt+0x27/0x40
>
>Here is the lock holder details captured by all-cpu-backtrace:
>
>NMI backtrace for cpu 75
>CPU: 75 PID: 3095650 Comm: fio Not tainted
>6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
>RIP: 0010:folio_inc_gen+0x142/0x430
>Call Trace:
> <NMI>
> ? show_regs+0x69/0x80
> ? nmi_cpu_backtrace+0xc5/0x130
> ? nmi_cpu_backtrace_handler+0x11/0x20
> ? nmi_handle+0x64/0x180
> ? default_do_nmi+0x45/0x130
> ? exc_nmi+0x128/0x1a0
> ? end_repeat_nmi+0xf/0x53
> ? folio_inc_gen+0x142/0x430
> ? folio_inc_gen+0x142/0x430
> ? folio_inc_gen+0x142/0x430
> </NMI>
> <TASK>
> isolate_folios+0x954/0x1630
> evict_folios+0xa5/0x8c0
> try_to_shrink_lruvec+0x1be/0x320
> shrink_one+0x10f/0x1d0
> shrink_node+0xa4c/0xc90
> do_try_to_free_pages+0xc0/0x590
> try_to_free_pages+0xde/0x210
> __alloc_pages_noprof+0x6ae/0x12c0
> alloc_pages_mpol_noprof+0xd9/0x220
> folio_alloc_noprof+0x63/0xe0
> filemap_alloc_folio_noprof+0xf4/0x100
> page_cache_ra_unbounded+0xb9/0x1a0
> page_cache_ra_order+0x26e/0x310
> ondemand_readahead+0x1a3/0x360
> page_cache_sync_ra+0x83/0x90
> filemap_get_pages+0xf0/0x6a0
> filemap_read+0xe7/0x3d0
> blkdev_read_iter+0x6f/0x140
> vfs_read+0x25b/0x340
> ksys_read+0x67/0xf0
> __x64_sys_read+0x19/0x20
> x64_sys_call+0x1771/0x20d0
> do_syscall_64+0x7e/0x130
From the callstack of lock holder, it is looks like a scability issue rather than a deadlock. Unlike legacy LRU management, there is no throttling mechanism for global reclaim under mglru so far.Could we apply the similar method to throttle the reclaim when it is too aggresive. I am wondering if this patch which is a rough version could help on this?
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2e34de9cd0d4..827036e21f24 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4520,6 +4520,50 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
return scanned;
}
+static void lru_gen_throttle(pg_data_t *pgdat, struct scan_control *sc)
+{
+ struct lruvec *target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+
+ if (current_is_kswapd()) {
+ if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
+ set_bit(PGDAT_WRITEBACK, &pgdat->flags);
+
+ /* Allow kswapd to start writing pages during reclaim.*/
+ if (sc->nr.unqueued_dirty == sc->nr.file_taken)
+ set_bit(PGDAT_DIRTY, &pgdat->flags);
+
+ if (sc->nr.immediate)
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+ }
+
+ /*
+ * Tag a node/memcg as congested if all the dirty pages were marked
+ * for writeback and immediate reclaim (counted in nr.congested).
+ *
+ * Legacy memcg will stall in page writeback so avoid forcibly
+ * stalling in reclaim_throttle().
+ */
+ if (sc->nr.dirty && (sc->nr.dirty / 2 < sc->nr.congested)) {
+ if (cgroup_reclaim(sc) && writeback_throttling_sane(sc))
+ set_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags);
+
+ if (current_is_kswapd())
+ set_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags);
+ }
+
+ /*
+ * Stall direct reclaim for IO completions if the lruvec is
+ * node is congested. Allow kswapd to continue until it
+ * starts encountering unqueued dirty pages or cycling through
+ * the LRU too quickly.
+ */
+ if (!current_is_kswapd() && current_may_throttle() &&
+ !sc->hibernation_mode &&
+ (test_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags) ||
+ test_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags)))
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED);
+}
+
static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
{
int type;
@@ -4552,6 +4596,16 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
retry:
reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false);
sc->nr_reclaimed += reclaimed;
+ sc->nr.dirty += stat.nr_dirty;
+ sc->nr.congested += stat.nr_congested;
+ sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
+ sc->nr.writeback += stat.nr_writeback;
+ sc->nr.immediate += stat.nr_immediate;
+ sc->nr.taken += scanned;
+
+ if (type)
+ sc->nr.file_taken += scanned;
+
trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
scanned, reclaimed, &stat, sc->priority,
type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
@@ -5908,6 +5962,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
if (lru_gen_enabled() && root_reclaim(sc)) {
lru_gen_shrink_node(pgdat, sc);
+ lru_gen_throttle(pgdat, sc);
return;
}
--
2.25.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-25 9:59 ` zhaoyang.huang
@ 2024-07-26 3:26 ` Zhaoyang Huang
2024-07-29 4:49 ` Bharata B Rao
0 siblings, 1 reply; 37+ messages in thread
From: Zhaoyang Huang @ 2024-07-26 3:26 UTC (permalink / raw)
To: zhaoyang.huang
Cc: bharata, Neeraj.Upadhyay, akpm, david, kinseyho, linux-kernel,
linux-mm, mgorman, mjguzik, nikunj, vbabka, willy, yuzhao,
steve.kang
On Thu, Jul 25, 2024 at 6:00 PM zhaoyang.huang
<zhaoyang.huang@unisoc.com> wrote:
>
> >However during the weekend mglru-enabled run (with above fix to
> >isolate_lru_folios() and also the previous two patches: truncate.patch
> >and mglru.patch and the inode fix provided by Mateusz), another hard
> >lockup related to lruvec spinlock was observed.
> >
> >Here is the hardlock up:
> >
> >watchdog: Watchdog detected hard LOCKUP on cpu 466
> >CPU: 466 PID: 3103929 Comm: fio Not tainted
> >6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
> >RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
> >Call Trace:
> > <NMI>
> > ? show_regs+0x69/0x80
> > ? watchdog_hardlockup_check+0x1b4/0x3a0
> ><SNIP>
> > ? native_queued_spin_lock_slowpath+0x2b4/0x300
> > </NMI>
> > <IRQ>
> > _raw_spin_lock_irqsave+0x5b/0x70
> > folio_lruvec_lock_irqsave+0x62/0x90
> > folio_batch_move_lru+0x9d/0x160
> > folio_rotate_reclaimable+0xab/0xf0
> > folio_end_writeback+0x60/0x90
> > end_buffer_async_write+0xaa/0xe0
> > end_bio_bh_io_sync+0x2c/0x50
> > bio_endio+0x108/0x180
> > blk_mq_end_request_batch+0x11f/0x5e0
> > nvme_pci_complete_batch+0xb5/0xd0 [nvme]
> > nvme_irq+0x92/0xe0 [nvme]
> > __handle_irq_event_percpu+0x6e/0x1e0
> > handle_irq_event+0x39/0x80
> > handle_edge_irq+0x8c/0x240
> > __common_interrupt+0x4e/0xf0
> > common_interrupt+0x49/0xc0
> > asm_common_interrupt+0x27/0x40
> >
> >Here is the lock holder details captured by all-cpu-backtrace:
> >
> >NMI backtrace for cpu 75
> >CPU: 75 PID: 3095650 Comm: fio Not tainted
> >6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
> >RIP: 0010:folio_inc_gen+0x142/0x430
> >Call Trace:
> > <NMI>
> > ? show_regs+0x69/0x80
> > ? nmi_cpu_backtrace+0xc5/0x130
> > ? nmi_cpu_backtrace_handler+0x11/0x20
> > ? nmi_handle+0x64/0x180
> > ? default_do_nmi+0x45/0x130
> > ? exc_nmi+0x128/0x1a0
> > ? end_repeat_nmi+0xf/0x53
> > ? folio_inc_gen+0x142/0x430
> > ? folio_inc_gen+0x142/0x430
> > ? folio_inc_gen+0x142/0x430
> > </NMI>
> > <TASK>
> > isolate_folios+0x954/0x1630
> > evict_folios+0xa5/0x8c0
> > try_to_shrink_lruvec+0x1be/0x320
> > shrink_one+0x10f/0x1d0
> > shrink_node+0xa4c/0xc90
> > do_try_to_free_pages+0xc0/0x590
> > try_to_free_pages+0xde/0x210
> > __alloc_pages_noprof+0x6ae/0x12c0
> > alloc_pages_mpol_noprof+0xd9/0x220
> > folio_alloc_noprof+0x63/0xe0
> > filemap_alloc_folio_noprof+0xf4/0x100
> > page_cache_ra_unbounded+0xb9/0x1a0
> > page_cache_ra_order+0x26e/0x310
> > ondemand_readahead+0x1a3/0x360
> > page_cache_sync_ra+0x83/0x90
> > filemap_get_pages+0xf0/0x6a0
> > filemap_read+0xe7/0x3d0
> > blkdev_read_iter+0x6f/0x140
> > vfs_read+0x25b/0x340
> > ksys_read+0x67/0xf0
> > __x64_sys_read+0x19/0x20
> > x64_sys_call+0x1771/0x20d0
> > do_syscall_64+0x7e/0x130
>
> From the callstack of lock holder, it is looks like a scability issue rather than a deadlock. Unlike legacy LRU management, there is no throttling mechanism for global reclaim under mglru so far.Could we apply the similar method to throttle the reclaim when it is too aggresive. I am wondering if this patch which is a rough version could help on this?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2e34de9cd0d4..827036e21f24 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4520,6 +4520,50 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
> return scanned;
> }
>
> +static void lru_gen_throttle(pg_data_t *pgdat, struct scan_control *sc)
> +{
> + struct lruvec *target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
> +
> + if (current_is_kswapd()) {
> + if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
> + set_bit(PGDAT_WRITEBACK, &pgdat->flags);
> +
> + /* Allow kswapd to start writing pages during reclaim.*/
> + if (sc->nr.unqueued_dirty == sc->nr.file_taken)
> + set_bit(PGDAT_DIRTY, &pgdat->flags);
> +
> + if (sc->nr.immediate)
> + reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> + }
> +
> + /*
> + * Tag a node/memcg as congested if all the dirty pages were marked
> + * for writeback and immediate reclaim (counted in nr.congested).
> + *
> + * Legacy memcg will stall in page writeback so avoid forcibly
> + * stalling in reclaim_throttle().
> + */
> + if (sc->nr.dirty && (sc->nr.dirty / 2 < sc->nr.congested)) {
> + if (cgroup_reclaim(sc) && writeback_throttling_sane(sc))
> + set_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags);
> +
> + if (current_is_kswapd())
> + set_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags);
> + }
> +
> + /*
> + * Stall direct reclaim for IO completions if the lruvec is
> + * node is congested. Allow kswapd to continue until it
> + * starts encountering unqueued dirty pages or cycling through
> + * the LRU too quickly.
> + */
> + if (!current_is_kswapd() && current_may_throttle() &&
> + !sc->hibernation_mode &&
> + (test_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags) ||
> + test_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags)))
> + reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED);
> +}
> +
> static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
> {
> int type;
> @@ -4552,6 +4596,16 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
> retry:
> reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false);
> sc->nr_reclaimed += reclaimed;
> + sc->nr.dirty += stat.nr_dirty;
> + sc->nr.congested += stat.nr_congested;
> + sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
> + sc->nr.writeback += stat.nr_writeback;
> + sc->nr.immediate += stat.nr_immediate;
> + sc->nr.taken += scanned;
> +
> + if (type)
> + sc->nr.file_taken += scanned;
> +
> trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> scanned, reclaimed, &stat, sc->priority,
> type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> @@ -5908,6 +5962,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>
> if (lru_gen_enabled() && root_reclaim(sc)) {
> lru_gen_shrink_node(pgdat, sc);
> + lru_gen_throttle(pgdat, sc);
> return;
> }
Hi Bharata,
This patch arised from a regression Android test case failure which
allocated 1GB virtual memory by each over 8 threads on an 5.5GB RAM
system. This test could pass on legacy LRU management while failing
under MGLRU as a watchdog monitor detected abnormal system-wide
schedule status(watchdog can't be scheduled within 60 seconds). This
patch with a slight change as below got passed in the test whereas has
not been investigated deeply for how it was done. Theoretically, this
patch enrolled the similar reclaim throttle mechanism as legacy do
which could reduce the contention of lruvec->lru_lock. I think this
patch is quite naive for now, but I am hoping it could help you as
your case seems like a scability issue of memory pressure rather than
a deadlock issue. Thank you!
the change of the applied version(try to throttle the reclaim before
instead of after)
if (lru_gen_enabled() && root_reclaim(sc)) {
+ lru_gen_throttle(pgdat, sc);
lru_gen_shrink_node(pgdat, sc);
- lru_gen_throttle(pgdat, sc);
return;
}
>
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-26 3:26 ` Zhaoyang Huang
@ 2024-07-29 4:49 ` Bharata B Rao
0 siblings, 0 replies; 37+ messages in thread
From: Bharata B Rao @ 2024-07-29 4:49 UTC (permalink / raw)
To: Zhaoyang Huang, zhaoyang.huang
Cc: Neeraj.Upadhyay, akpm, david, kinseyho, linux-kernel, linux-mm,
mgorman, mjguzik, nikunj, vbabka, willy, yuzhao, steve.kang
On 26-Jul-24 8:56 AM, Zhaoyang Huang wrote:
> On Thu, Jul 25, 2024 at 6:00 PM zhaoyang.huang
> <zhaoyang.huang@unisoc.com> wrote:
<snip>
>> From the callstack of lock holder, it is looks like a scability issue rather than a deadlock. Unlike legacy LRU management, there is no throttling mechanism for global reclaim under mglru so far.Could we apply the similar method to throttle the reclaim when it is too aggresive. I am wondering if this patch which is a rough version could help on this?
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 2e34de9cd0d4..827036e21f24 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -4520,6 +4520,50 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
>> return scanned;
>> }
>>
>> +static void lru_gen_throttle(pg_data_t *pgdat, struct scan_control *sc)
>> +{
>> + struct lruvec *target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
>> +
>> + if (current_is_kswapd()) {
>> + if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
>> + set_bit(PGDAT_WRITEBACK, &pgdat->flags);
>> +
>> + /* Allow kswapd to start writing pages during reclaim.*/
>> + if (sc->nr.unqueued_dirty == sc->nr.file_taken)
>> + set_bit(PGDAT_DIRTY, &pgdat->flags);
>> +
>> + if (sc->nr.immediate)
>> + reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
>> + }
>> +
>> + /*
>> + * Tag a node/memcg as congested if all the dirty pages were marked
>> + * for writeback and immediate reclaim (counted in nr.congested).
>> + *
>> + * Legacy memcg will stall in page writeback so avoid forcibly
>> + * stalling in reclaim_throttle().
>> + */
>> + if (sc->nr.dirty && (sc->nr.dirty / 2 < sc->nr.congested)) {
>> + if (cgroup_reclaim(sc) && writeback_throttling_sane(sc))
>> + set_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags);
>> +
>> + if (current_is_kswapd())
>> + set_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags);
>> + }
>> +
>> + /*
>> + * Stall direct reclaim for IO completions if the lruvec is
>> + * node is congested. Allow kswapd to continue until it
>> + * starts encountering unqueued dirty pages or cycling through
>> + * the LRU too quickly.
>> + */
>> + if (!current_is_kswapd() && current_may_throttle() &&
>> + !sc->hibernation_mode &&
>> + (test_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags) ||
>> + test_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags)))
>> + reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED);
>> +}
>> +
>> static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
>> {
>> int type;
>> @@ -4552,6 +4596,16 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
>> retry:
>> reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false);
>> sc->nr_reclaimed += reclaimed;
>> + sc->nr.dirty += stat.nr_dirty;
>> + sc->nr.congested += stat.nr_congested;
>> + sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
>> + sc->nr.writeback += stat.nr_writeback;
>> + sc->nr.immediate += stat.nr_immediate;
>> + sc->nr.taken += scanned;
>> +
>> + if (type)
>> + sc->nr.file_taken += scanned;
>> +
>> trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>> scanned, reclaimed, &stat, sc->priority,
>> type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>> @@ -5908,6 +5962,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>>
>> if (lru_gen_enabled() && root_reclaim(sc)) {
>> lru_gen_shrink_node(pgdat, sc);
>> + lru_gen_throttle(pgdat, sc);
>> return;
>> }
> Hi Bharata,
> This patch arised from a regression Android test case failure which
> allocated 1GB virtual memory by each over 8 threads on an 5.5GB RAM
> system. This test could pass on legacy LRU management while failing
> under MGLRU as a watchdog monitor detected abnormal system-wide
> schedule status(watchdog can't be scheduled within 60 seconds). This
> patch with a slight change as below got passed in the test whereas has
> not been investigated deeply for how it was done. Theoretically, this
> patch enrolled the similar reclaim throttle mechanism as legacy do
> which could reduce the contention of lruvec->lru_lock. I think this
> patch is quite naive for now, but I am hoping it could help you as
> your case seems like a scability issue of memory pressure rather than
> a deadlock issue. Thank you!
>
> the change of the applied version(try to throttle the reclaim before
> instead of after)
> if (lru_gen_enabled() && root_reclaim(sc)) {
> + lru_gen_throttle(pgdat, sc);
> lru_gen_shrink_node(pgdat, sc);
> - lru_gen_throttle(pgdat, sc);
> return;
> }
Thanks Zhaoyang Huang for the patch, will give this a test and report back.
Regards,
Bharata.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-07-09 5:58 ` Yu Zhao
2024-07-11 5:43 ` Bharata B Rao
@ 2024-08-13 11:04 ` Usama Arif
2024-08-13 17:43 ` Yu Zhao
1 sibling, 1 reply; 37+ messages in thread
From: Usama Arif @ 2024-08-13 11:04 UTC (permalink / raw)
To: Yu Zhao, Bharata B Rao
Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman, leitao
On 09/07/2024 06:58, Yu Zhao wrote:
> On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@amd.com> wrote:
>>
>> On 08-Jul-24 9:47 PM, Yu Zhao wrote:
>>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>
>>>> Hi Yu Zhao,
>>>>
>>>> Thanks for your patches. See below...
>>>>
>>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
>>>>> Hi Bharata,
>>>>>
>>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>>>
>>>> <snip>
>>>>>>
>>>>>> Some experiments tried
>>>>>> ======================
>>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>>>>>> lockups were seen for 48 hours run. Below is once such soft lockup.
>>>>>
>>>>> This is not really an MGLRU issue -- can you please try one of the
>>>>> attached patches? It (truncate.patch) should help with or without
>>>>> MGLRU.
>>>>
>>>> With truncate.patch and default LRU scheme, a few hard lockups are seen.
>>>
>>> Thanks.
>>>
>>> In your original report, you said:
>>>
>>> Most of the times the two contended locks are lruvec and
>>> inode->i_lock spinlocks.
>>> ...
>>> Often times, the perf output at the time of the problem shows
>>> heavy contention on lruvec spin lock. Similar contention is
>>> also observed with inode i_lock (in clear_shadow_entry path)
>>>
>>> Based on this new report, does it mean the i_lock is not as contended,
>>> for the same path (truncation) you tested? If so, I'll post
>>> truncate.patch and add reported-by and tested-by you, unless you have
>>> objections.
>>
>> truncate.patch has been tested on two systems with default LRU scheme
>> and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run.
>
> Thanks.
>
>>>
>>> The two paths below were contended on the LRU lock, but they already
>>> batch their operations. So I don't know what else we can do surgically
>>> to improve them.
>>
>> What has been seen with this workload is that the lruvec spinlock is
>> held for a long time from shrink_[active/inactive]_list path. In this
>> path, there is a case in isolate_lru_folios() where scanning of LRU
>> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
>> scanning/skipping of more than 150 million folios were seen. There is
>> already a comment in there which explains why nr_skipped shouldn't be
>> counted, but is there any possibility of re-looking at this condition?
>
> For this specific case, probably this can help:
>
> @@ -1659,8 +1659,15 @@ static unsigned long
> isolate_lru_folios(unsigned long nr_to_scan,
> if (folio_zonenum(folio) > sc->reclaim_idx ||
> skip_cma(folio, sc)) {
> nr_skipped[folio_zonenum(folio)] += nr_pages;
> - move_to = &folios_skipped;
> - goto move;
> + list_move(&folio->lru, &folios_skipped);
> + if (spin_is_contended(&lruvec->lru_lock)) {
> + if (!list_empty(dst))
> + break;
> + spin_unlock_irq(&lruvec->lru_lock);
> + cond_resched();
> + spin_lock_irq(&lruvec->lru_lock);
> + }
> + continue;
> }
>
Hi Yu,
We are seeing lockups and high memory pressure in Meta production due to this lock contention as well. My colleague highlighted it in https://lore.kernel.org/all/ZrssOrcJIDy8hacI@gmail.com/ and was pointed to this fix.
We removed skip_cma check as a temporary measure, but this is a proper fix. I might have missed it but didn't see this as a patch on the mailing list. Just wanted to check if you were planning to send it as a patch? Happy to send it on your behalf as well.
Thanks
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system
2024-08-13 11:04 ` Usama Arif
@ 2024-08-13 17:43 ` Yu Zhao
0 siblings, 0 replies; 37+ messages in thread
From: Yu Zhao @ 2024-08-13 17:43 UTC (permalink / raw)
To: Usama Arif
Cc: Bharata B Rao, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj,
Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho,
Mel Gorman, leitao
On Tue, Aug 13, 2024 at 5:04 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 09/07/2024 06:58, Yu Zhao wrote:
> > On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@amd.com> wrote:
> >>
> >> On 08-Jul-24 9:47 PM, Yu Zhao wrote:
> >>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
> >>>>
> >>>> Hi Yu Zhao,
> >>>>
> >>>> Thanks for your patches. See below...
> >>>>
> >>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> >>>>> Hi Bharata,
> >>>>>
> >>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
> >>>>>>
> >>>> <snip>
> >>>>>>
> >>>>>> Some experiments tried
> >>>>>> ======================
> >>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
> >>>>>> lockups were seen for 48 hours run. Below is once such soft lockup.
> >>>>>
> >>>>> This is not really an MGLRU issue -- can you please try one of the
> >>>>> attached patches? It (truncate.patch) should help with or without
> >>>>> MGLRU.
> >>>>
> >>>> With truncate.patch and default LRU scheme, a few hard lockups are seen.
> >>>
> >>> Thanks.
> >>>
> >>> In your original report, you said:
> >>>
> >>> Most of the times the two contended locks are lruvec and
> >>> inode->i_lock spinlocks.
> >>> ...
> >>> Often times, the perf output at the time of the problem shows
> >>> heavy contention on lruvec spin lock. Similar contention is
> >>> also observed with inode i_lock (in clear_shadow_entry path)
> >>>
> >>> Based on this new report, does it mean the i_lock is not as contended,
> >>> for the same path (truncation) you tested? If so, I'll post
> >>> truncate.patch and add reported-by and tested-by you, unless you have
> >>> objections.
> >>
> >> truncate.patch has been tested on two systems with default LRU scheme
> >> and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run.
> >
> > Thanks.
> >
> >>>
> >>> The two paths below were contended on the LRU lock, but they already
> >>> batch their operations. So I don't know what else we can do surgically
> >>> to improve them.
> >>
> >> What has been seen with this workload is that the lruvec spinlock is
> >> held for a long time from shrink_[active/inactive]_list path. In this
> >> path, there is a case in isolate_lru_folios() where scanning of LRU
> >> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
> >> scanning/skipping of more than 150 million folios were seen. There is
> >> already a comment in there which explains why nr_skipped shouldn't be
> >> counted, but is there any possibility of re-looking at this condition?
> >
> > For this specific case, probably this can help:
> >
> > @@ -1659,8 +1659,15 @@ static unsigned long
> > isolate_lru_folios(unsigned long nr_to_scan,
> > if (folio_zonenum(folio) > sc->reclaim_idx ||
> > skip_cma(folio, sc)) {
> > nr_skipped[folio_zonenum(folio)] += nr_pages;
> > - move_to = &folios_skipped;
> > - goto move;
> > + list_move(&folio->lru, &folios_skipped);
> > + if (spin_is_contended(&lruvec->lru_lock)) {
> > + if (!list_empty(dst))
> > + break;
> > + spin_unlock_irq(&lruvec->lru_lock);
> > + cond_resched();
> > + spin_lock_irq(&lruvec->lru_lock);
> > + }
> > + continue;
Nitpick:
if () {
...
if (!spin_is_contended(&lruvec->lru_lock))
continue;
if (!list_empty(dst))
break;
spin_unlock_irq(&lruvec->lru_lock);
cond_resched();
spin_lock_irq(&lruvec->lru_lock);
}
> Hi Yu,
>
> We are seeing lockups and high memory pressure in Meta production due to this lock contention as well. My colleague highlighted it in https://lore.kernel.org/all/ZrssOrcJIDy8hacI@gmail.com/ and was pointed to this fix.
>
> We removed skip_cma check as a temporary measure, but this is a proper fix. I might have missed it but didn't see this as a patch on the mailing list. Just wanted to check if you were planning to send it as a patch? Happy to send it on your behalf as well.
Please. Thank you.
^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2024-08-13 17:44 UTC | newest]
Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-03 15:11 Hard and soft lockups with FIO and LTP runs on a large system Bharata B Rao
2024-07-06 22:42 ` Yu Zhao
2024-07-08 14:34 ` Bharata B Rao
2024-07-08 16:17 ` Yu Zhao
2024-07-09 4:30 ` Bharata B Rao
2024-07-09 5:58 ` Yu Zhao
2024-07-11 5:43 ` Bharata B Rao
2024-07-15 5:19 ` Bharata B Rao
2024-07-19 20:21 ` Yu Zhao
2024-07-20 7:57 ` Mateusz Guzik
2024-07-22 4:17 ` Bharata B Rao
2024-07-22 4:12 ` Bharata B Rao
2024-07-25 9:59 ` zhaoyang.huang
2024-07-26 3:26 ` Zhaoyang Huang
2024-07-29 4:49 ` Bharata B Rao
2024-08-13 11:04 ` Usama Arif
2024-08-13 17:43 ` Yu Zhao
2024-07-17 9:37 ` Vlastimil Babka
2024-07-17 10:50 ` Bharata B Rao
2024-07-17 11:15 ` Hillf Danton
2024-07-18 9:02 ` Bharata B Rao
2024-07-10 12:03 ` Bharata B Rao
2024-07-10 12:24 ` Mateusz Guzik
2024-07-10 13:04 ` Mateusz Guzik
2024-07-15 5:22 ` Bharata B Rao
2024-07-15 6:48 ` Mateusz Guzik
2024-07-10 18:04 ` Yu Zhao
2024-07-17 9:42 ` Vlastimil Babka
2024-07-17 10:31 ` Bharata B Rao
2024-07-17 16:44 ` Karim Manaouil
2024-07-17 11:29 ` Mateusz Guzik
2024-07-18 9:00 ` Bharata B Rao
2024-07-18 12:11 ` Mateusz Guzik
2024-07-19 6:16 ` Bharata B Rao
2024-07-19 7:06 ` Yu Zhao
2024-07-19 14:26 ` Mateusz Guzik
2024-07-17 16:34 ` Karim Manaouil
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).