* Hard and soft lockups with FIO and LTP runs on a large system
@ 2024-07-03 15:11 Bharata B Rao
2024-07-06 22:42 ` Yu Zhao
2024-07-17 9:42 ` Vlastimil Babka
0 siblings, 2 replies; 37+ messages in thread
From: Bharata B Rao @ 2024-07-03 15:11 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
David Hildenbrand, willy, vbabka, yuzhao, kinseyho, Mel Gorman
Many soft and hard lockups are seen with upstream kernel when running a
bunch of tests that include FIO and LTP filesystem test on 10 NVME
disks. The lockups can appear anywhere between 2 to 48 hours. Originally
this was reported on a large customer VM instance with passthrough NVME
disks on older kernels(v5.4 based). However, similar problems were
reproduced when running the tests on bare metal with latest upstream
kernel (v6.10-rc3). Other lockups with different signatures are seen but
in this report, only those related to MM area are being discussed.
Also note that the subsequent description is related to the lockups in
bare metal upstream (and not VM).
The general observation is that the problem usually surfaces when the
system free memory goes very low and page cache/buffer consumption hits
the ceiling. Most of the times the two contended locks are lruvec and
inode->i_lock spinlocks.
- Could this be a scalability issue in LRU list handling and/or page
cache invalidation typical to a large system configuration?
- Are there any MM/FS tunables that could help here?
Hardware configuration
======================
Dual socket AMD EPYC 128 Core processor (256 cores, 512 threads)
Memory: 1.5 TB
10 NVME - 3.5TB each
available: 2 nodes (0-1)
node 0 cpus: 0-127,256-383
node 0 size: 773727 MB
node 1 cpus: 128-255,384-511
node 1 size: 773966 MB
Workload details
================
Workload includes concurrent runs of FIO and a few FS tests from LTP.
FIO is run with a size of 1TB on each NVME partition with different
combinations of ioengine/blocksize/mode parameters and buffered-IO.
Selected FS tests from LTP are run on 256GB partitions of all NVME
disks. This is the typical NVME partition layout.
nvme2n1 259:4 0 3.5T 0 disk
├─nvme2n1p1 259:6 0 256G 0 part /data_nvme2n1p1
└─nvme2n1p2 259:7 0 3.2T 0 part
Though many different runs exist in the workload, the combination that
results in the problem is buffered-IO run with sync engine.
fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \
-rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=sync -bs=4k \
-numjobs=400 -runtime=25000 --time_based -group_reporting -name=mytest
Watchdog threshold was reduced to 5s to reproduce the problem early and
all CPU backtrace enabled.
Problem details and analysis
============================
One of the hard lockups which was observed and analyzed in detail is this:
kernel: watchdog: Watchdog detected hard LOCKUP on cpu 284
kernel: CPU: 284 PID: 924096 Comm: cat Not tainted 6.10.0-rc3-lruvec #9
kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: Call Trace:
kernel: <NMI>
kernel: ? show_regs+0x69/0x80
kernel: ? watchdog_hardlockup_check+0x19e/0x360
<SNIP>
kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: </NMI>
kernel: <TASK>
kernel: ? __pfx_lru_add_fn+0x10/0x10
kernel: _raw_spin_lock_irqsave+0x42/0x50
kernel: folio_lruvec_lock_irqsave+0x62/0xb0
kernel: folio_batch_move_lru+0x79/0x2a0
kernel: folio_add_lru+0x6d/0xf0
kernel: filemap_add_folio+0xba/0xe0
kernel: __filemap_get_folio+0x137/0x2e0
kernel: ext4_da_write_begin+0x12c/0x270
kernel: generic_perform_write+0xbf/0x200
kernel: ext4_buffered_write_iter+0x67/0xf0
kernel: ext4_file_write_iter+0x70/0x780
kernel: vfs_write+0x301/0x420
kernel: ksys_write+0x67/0xf0
kernel: __x64_sys_write+0x19/0x20
kernel: x64_sys_call+0x1689/0x20d0
kernel: do_syscall_64+0x6b/0x110
kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e kernel: RIP:
0033:0x7fe21c314887
With all CPU backtraces enabled, many CPUs are waiting for lruvec_lock
acquisition. We measured the lruvec spinlock start, end and hold
time(htime) using sched_clock(), along with a BUG() if the hold time was
more than 10s. The below case shows that lruvec spin lock was held for ~25s.
kernel: vmscan: unlock_page_lruvec_irq: stime 27963327514341, etime
27963324369895, htime 25889317166
kernel: ------------[ cut here ]------------
kernel: kernel BUG at include/linux/memcontrol.h:1677!
kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
kernel: CPU: 21 PID: 3211 Comm: kswapd0 Tainted: G W
6.10.0-rc3-qspindbg #10
kernel: RIP: 0010:shrink_active_list+0x40a/0x520
And the corresponding trace point for the above:
kswapd0-3211 [021] dN.1. 27963.324332: mm_vmscan_lru_isolate:
classzone=0 order=0 nr_requested=1 nr_scanned=156946361
nr_skipped=156946360 nr_taken=1 lru=active_file
This shows that isolate_lru_folios() is scanning through a huge number
(~150million) of folios (order=0) with lruvec spinlock held. This is
happening because a large number of folios are being skipped to isolate
a few ZONE_DMA folios. Though the number of folios to be scanned is
bounded (32), there exists a genuine case where this can become
unbounded, i.e. in case where folios are skipped.
Meminfo output shows that the free memory is around ~2% and page/buffer
cache grows very high when the lockup happens.
MemTotal: 1584835956 kB
MemFree: 27805664 kB
MemAvailable: 1568099004 kB
Buffers: 1386120792 kB
Cached: 151894528 kB
SwapCached: 30620 kB
Active: 1043678892 kB
Inactive: 494456452 kB
Often times, the perf output at the time of the problem shows heavy
contention on lruvec spin lock. Similar contention is also observed with
inode i_lock (in clear_shadow_entry path)
98.98% fio [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
|
--98.96%--native_queued_spin_lock_slowpath
|
--98.96%--_raw_spin_lock_irqsave
folio_lruvec_lock_irqsave
|
--98.78%--folio_batch_move_lru
|
--98.63%--deactivate_file_folio
mapping_try_invalidate
invalidate_mapping_pages
invalidate_bdev
blkdev_common_ioctl
blkdev_ioctl
__x64_sys_ioctl
x64_sys_call
do_syscall_64
entry_SYSCALL_64_after_hwframe
Some experiments tried
======================
1) When MGLRU was enabled many soft lockups were observed, no hard
lockups were seen for 48 hours run. Below is once such soft lockup.
kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649]
kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G L
6.10.0-rc3-mglru-irqstrc #24
kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: Call Trace:
kernel: <IRQ>
kernel: ? show_regs+0x69/0x80
kernel: ? watchdog_timer_fn+0x223/0x2b0
kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
<SNIP>
kernel: </IRQ>
kernel: <TASK>
kernel: ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: _raw_spin_lock+0x38/0x50
kernel: clear_shadow_entry+0x3d/0x100
kernel: ? __pfx_workingset_update_node+0x10/0x10
kernel: mapping_try_invalidate+0x117/0x1d0
kernel: invalidate_mapping_pages+0x10/0x20
kernel: invalidate_bdev+0x3c/0x50
kernel: blkdev_common_ioctl+0x5f7/0xa90
kernel: blkdev_ioctl+0x109/0x270
kernel: x64_sys_call+0x1215/0x20d0
kernel: do_syscall_64+0x7e/0x130
This happens to be contending on inode i_lock spinlock.
Below preemptirqsoff trace points to preemption being disabled for more
than 10s and the lock in picture is lruvec spinlock.
# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc
# --------------------------------------------------------------------
# latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0
HP:0 #P:512)
# -----------------
# | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0)
# -----------------
# => started at: deactivate_file_folio
# => ended at: deactivate_file_folio
#
#
# _------=> CPU#
# / _-----=> irqs-off/BH-disabled
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# |||| / _-=> migrate-disable
# ||||| / delay
# cmd pid |||||| time | caller
# \ / |||||| \ | /
fio-2701523 128...1. 0us$: deactivate_file_folio
<-deactivate_file_folio
fio-2701523 128.N.1. 10382681us : deactivate_file_folio
<-deactivate_file_folio
fio-2701523 128.N.1. 10382683us : tracer_preempt_on
<-deactivate_file_folio
fio-2701523 128.N.1. 10382691us : <stack trace>
=> deactivate_file_folio
=> mapping_try_invalidate
=> invalidate_mapping_pages
=> invalidate_bdev
=> blkdev_common_ioctl
=> blkdev_ioctl
=> __x64_sys_ioctl
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
2) Increased low_watermark_threshold to 10% to prevent system from
entering into extremely low memory situation. Although hard lockups
weren't seen, but soft lockups (clear_shadow_entry()) were still seen.
3) AMD has a BIOS setting called NPS (Nodes per socket), using which a
socket can be further partitioned into smaller NUMA nodes. With NPS=4,
there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in
the system. This was done to check if having more number of kswapd
threads working on lesser number of folios per node would make a
difference. However here too, multiple soft lockups were seen (in
clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed.
Any insights/suggestion into these lockups and suggestions are welcome!
Regards,
Bharata.
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-03 15:11 Hard and soft lockups with FIO and LTP runs on a large system Bharata B Rao @ 2024-07-06 22:42 ` Yu Zhao 2024-07-08 14:34 ` Bharata B Rao 2024-07-10 12:03 ` Bharata B Rao 2024-07-17 9:42 ` Vlastimil Babka 1 sibling, 2 replies; 37+ messages in thread From: Yu Zhao @ 2024-07-06 22:42 UTC (permalink / raw) To: Bharata B Rao Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman [-- Attachment #1: Type: text/plain, Size: 10946 bytes --] Hi Bharata, On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote: > > Many soft and hard lockups are seen with upstream kernel when running a > bunch of tests that include FIO and LTP filesystem test on 10 NVME > disks. The lockups can appear anywhere between 2 to 48 hours. Originally > this was reported on a large customer VM instance with passthrough NVME > disks on older kernels(v5.4 based). However, similar problems were > reproduced when running the tests on bare metal with latest upstream > kernel (v6.10-rc3). Other lockups with different signatures are seen but > in this report, only those related to MM area are being discussed. > Also note that the subsequent description is related to the lockups in > bare metal upstream (and not VM). > > The general observation is that the problem usually surfaces when the > system free memory goes very low and page cache/buffer consumption hits > the ceiling. Most of the times the two contended locks are lruvec and > inode->i_lock spinlocks. > > - Could this be a scalability issue in LRU list handling and/or page > cache invalidation typical to a large system configuration? > - Are there any MM/FS tunables that could help here? > > Hardware configuration > ====================== > Dual socket AMD EPYC 128 Core processor (256 cores, 512 threads) > Memory: 1.5 TB > 10 NVME - 3.5TB each > available: 2 nodes (0-1) > node 0 cpus: 0-127,256-383 > node 0 size: 773727 MB > node 1 cpus: 128-255,384-511 > node 1 size: 773966 MB > > Workload details > ================ > Workload includes concurrent runs of FIO and a few FS tests from LTP. > > FIO is run with a size of 1TB on each NVME partition with different > combinations of ioengine/blocksize/mode parameters and buffered-IO. > Selected FS tests from LTP are run on 256GB partitions of all NVME > disks. This is the typical NVME partition layout. > > nvme2n1 259:4 0 3.5T 0 disk > ├─nvme2n1p1 259:6 0 256G 0 part /data_nvme2n1p1 > └─nvme2n1p2 259:7 0 3.2T 0 part > > Though many different runs exist in the workload, the combination that > results in the problem is buffered-IO run with sync engine. > > fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \ > -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=sync -bs=4k \ > -numjobs=400 -runtime=25000 --time_based -group_reporting -name=mytest > > Watchdog threshold was reduced to 5s to reproduce the problem early and > all CPU backtrace enabled. > > Problem details and analysis > ============================ > One of the hard lockups which was observed and analyzed in detail is this: > > kernel: watchdog: Watchdog detected hard LOCKUP on cpu 284 > kernel: CPU: 284 PID: 924096 Comm: cat Not tainted 6.10.0-rc3-lruvec #9 > kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300 > kernel: Call Trace: > kernel: <NMI> > kernel: ? show_regs+0x69/0x80 > kernel: ? watchdog_hardlockup_check+0x19e/0x360 > <SNIP> > kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300 > kernel: </NMI> > kernel: <TASK> > kernel: ? __pfx_lru_add_fn+0x10/0x10 > kernel: _raw_spin_lock_irqsave+0x42/0x50 > kernel: folio_lruvec_lock_irqsave+0x62/0xb0 > kernel: folio_batch_move_lru+0x79/0x2a0 > kernel: folio_add_lru+0x6d/0xf0 > kernel: filemap_add_folio+0xba/0xe0 > kernel: __filemap_get_folio+0x137/0x2e0 > kernel: ext4_da_write_begin+0x12c/0x270 > kernel: generic_perform_write+0xbf/0x200 > kernel: ext4_buffered_write_iter+0x67/0xf0 > kernel: ext4_file_write_iter+0x70/0x780 > kernel: vfs_write+0x301/0x420 > kernel: ksys_write+0x67/0xf0 > kernel: __x64_sys_write+0x19/0x20 > kernel: x64_sys_call+0x1689/0x20d0 > kernel: do_syscall_64+0x6b/0x110 > kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e kernel: RIP: > 0033:0x7fe21c314887 > > With all CPU backtraces enabled, many CPUs are waiting for lruvec_lock > acquisition. We measured the lruvec spinlock start, end and hold > time(htime) using sched_clock(), along with a BUG() if the hold time was > more than 10s. The below case shows that lruvec spin lock was held for ~25s. > > kernel: vmscan: unlock_page_lruvec_irq: stime 27963327514341, etime > 27963324369895, htime 25889317166 > kernel: ------------[ cut here ]------------ > kernel: kernel BUG at include/linux/memcontrol.h:1677! > kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI > kernel: CPU: 21 PID: 3211 Comm: kswapd0 Tainted: G W > 6.10.0-rc3-qspindbg #10 > kernel: RIP: 0010:shrink_active_list+0x40a/0x520 > > And the corresponding trace point for the above: > kswapd0-3211 [021] dN.1. 27963.324332: mm_vmscan_lru_isolate: > classzone=0 order=0 nr_requested=1 nr_scanned=156946361 > nr_skipped=156946360 nr_taken=1 lru=active_file > > This shows that isolate_lru_folios() is scanning through a huge number > (~150million) of folios (order=0) with lruvec spinlock held. This is > happening because a large number of folios are being skipped to isolate > a few ZONE_DMA folios. Though the number of folios to be scanned is > bounded (32), there exists a genuine case where this can become > unbounded, i.e. in case where folios are skipped. > > Meminfo output shows that the free memory is around ~2% and page/buffer > cache grows very high when the lockup happens. > > MemTotal: 1584835956 kB > MemFree: 27805664 kB > MemAvailable: 1568099004 kB > Buffers: 1386120792 kB > Cached: 151894528 kB > SwapCached: 30620 kB > Active: 1043678892 kB > Inactive: 494456452 kB > > Often times, the perf output at the time of the problem shows heavy > contention on lruvec spin lock. Similar contention is also observed with > inode i_lock (in clear_shadow_entry path) > > 98.98% fio [kernel.kallsyms] [k] native_queued_spin_lock_slowpath > | > --98.96%--native_queued_spin_lock_slowpath > | > --98.96%--_raw_spin_lock_irqsave > folio_lruvec_lock_irqsave > | > --98.78%--folio_batch_move_lru > | > --98.63%--deactivate_file_folio > mapping_try_invalidate > invalidate_mapping_pages > invalidate_bdev > blkdev_common_ioctl > blkdev_ioctl > __x64_sys_ioctl > x64_sys_call > do_syscall_64 > entry_SYSCALL_64_after_hwframe > > Some experiments tried > ====================== > 1) When MGLRU was enabled many soft lockups were observed, no hard > lockups were seen for 48 hours run. Below is once such soft lockup. This is not really an MGLRU issue -- can you please try one of the attached patches? It (truncate.patch) should help with or without MGLRU. > kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649] > kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G L > 6.10.0-rc3-mglru-irqstrc #24 > kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300 > kernel: Call Trace: > kernel: <IRQ> > kernel: ? show_regs+0x69/0x80 > kernel: ? watchdog_timer_fn+0x223/0x2b0 > kernel: ? __pfx_watchdog_timer_fn+0x10/0x10 > <SNIP> > kernel: </IRQ> > kernel: <TASK> > kernel: ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 > kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300 > kernel: _raw_spin_lock+0x38/0x50 > kernel: clear_shadow_entry+0x3d/0x100 > kernel: ? __pfx_workingset_update_node+0x10/0x10 > kernel: mapping_try_invalidate+0x117/0x1d0 > kernel: invalidate_mapping_pages+0x10/0x20 > kernel: invalidate_bdev+0x3c/0x50 > kernel: blkdev_common_ioctl+0x5f7/0xa90 > kernel: blkdev_ioctl+0x109/0x270 > kernel: x64_sys_call+0x1215/0x20d0 > kernel: do_syscall_64+0x7e/0x130 > > This happens to be contending on inode i_lock spinlock. > > Below preemptirqsoff trace points to preemption being disabled for more > than 10s and the lock in picture is lruvec spinlock. Also if you could try the other patch (mglru.patch) please. It should help reduce unnecessary rotations from deactivate_file_folio(), which in turn should reduce the contention on the LRU lock for MGLRU. > # tracer: preemptirqsoff > # > # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc > # -------------------------------------------------------------------- > # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0 > HP:0 #P:512) > # ----------------- > # | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0) > # ----------------- > # => started at: deactivate_file_folio > # => ended at: deactivate_file_folio > # > # > # _------=> CPU# > # / _-----=> irqs-off/BH-disabled > # | / _----=> need-resched > # || / _---=> hardirq/softirq > # ||| / _--=> preempt-depth > # |||| / _-=> migrate-disable > # ||||| / delay > # cmd pid |||||| time | caller > # \ / |||||| \ | / > fio-2701523 128...1. 0us$: deactivate_file_folio > <-deactivate_file_folio > fio-2701523 128.N.1. 10382681us : deactivate_file_folio > <-deactivate_file_folio > fio-2701523 128.N.1. 10382683us : tracer_preempt_on > <-deactivate_file_folio > fio-2701523 128.N.1. 10382691us : <stack trace> > => deactivate_file_folio > => mapping_try_invalidate > => invalidate_mapping_pages > => invalidate_bdev > => blkdev_common_ioctl > => blkdev_ioctl > => __x64_sys_ioctl > => x64_sys_call > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > 2) Increased low_watermark_threshold to 10% to prevent system from > entering into extremely low memory situation. Although hard lockups > weren't seen, but soft lockups (clear_shadow_entry()) were still seen. > > 3) AMD has a BIOS setting called NPS (Nodes per socket), using which a > socket can be further partitioned into smaller NUMA nodes. With NPS=4, > there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in > the system. This was done to check if having more number of kswapd > threads working on lesser number of folios per node would make a > difference. However here too, multiple soft lockups were seen (in > clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed. > > Any insights/suggestion into these lockups and suggestions are welcome! Thanks! [-- Attachment #2: mglru.patch --] [-- Type: application/octet-stream, Size: 1202 bytes --] diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index d9a8a4affaaf..7d24d065aed8 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -182,6 +182,16 @@ static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen) return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1); } +static inline bool lru_gen_should_rotate(struct folio *folio) +{ + int gen = folio_lru_gen(folio); + int type = folio_is_file_lru(folio); + struct lruvec *lruvec = folio_lruvec(folio); + struct lru_gen_folio *lrugen = &lruvec->lrugen; + + return gen != lru_gen_from_seq(READ_ONCE(lrugen->min_seq[type])); +} + static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *folio, int old_gen, int new_gen) { diff --git a/mm/swap.c b/mm/swap.c index 802681b3c857..e3dd092224ba 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -692,7 +692,7 @@ void deactivate_file_folio(struct folio *folio) struct folio_batch *fbatch; /* Deactivating an unevictable folio will not accelerate reclaim */ - if (folio_test_unevictable(folio)) + if (folio_test_unevictable(folio) || !lru_gen_should_rotate(folio)) return; folio_get(folio); [-- Attachment #3: truncate.patch --] [-- Type: application/octet-stream, Size: 3942 bytes --] diff --git a/mm/truncate.c b/mm/truncate.c index e99085bf3d34..545211cf6061 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -39,12 +39,24 @@ static inline void __clear_shadow_entry(struct address_space *mapping, xas_store(&xas, NULL); } -static void clear_shadow_entry(struct address_space *mapping, pgoff_t index, - void *entry) +static void clear_shadow_entry(struct address_space *mapping, + struct folio_batch *fbatch, pgoff_t *indices) { + int i; + + if (shmem_mapping(mapping) || dax_mapping(mapping)) + return; + spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); - __clear_shadow_entry(mapping, index, entry); + + for (i = 0; i < folio_batch_count(fbatch); i++) { + struct folio *folio = fbatch->folios[i]; + + if (xa_is_value(folio)) + __clear_shadow_entry(mapping, indices[i], folio); + } + xa_unlock_irq(&mapping->i_pages); if (mapping_shrinkable(mapping)) inode_add_lru(mapping->host); @@ -105,36 +117,6 @@ static void truncate_folio_batch_exceptionals(struct address_space *mapping, fbatch->nr = j; } -/* - * Invalidate exceptional entry if easily possible. This handles exceptional - * entries for invalidate_inode_pages(). - */ -static int invalidate_exceptional_entry(struct address_space *mapping, - pgoff_t index, void *entry) -{ - /* Handled by shmem itself, or for DAX we do nothing. */ - if (shmem_mapping(mapping) || dax_mapping(mapping)) - return 1; - clear_shadow_entry(mapping, index, entry); - return 1; -} - -/* - * Invalidate exceptional entry if clean. This handles exceptional entries for - * invalidate_inode_pages2() so for DAX it evicts only clean entries. - */ -static int invalidate_exceptional_entry2(struct address_space *mapping, - pgoff_t index, void *entry) -{ - /* Handled by shmem itself */ - if (shmem_mapping(mapping)) - return 1; - if (dax_mapping(mapping)) - return dax_invalidate_mapping_entry_sync(mapping, index); - clear_shadow_entry(mapping, index, entry); - return 1; -} - /** * folio_invalidate - Invalidate part or all of a folio. * @folio: The folio which is affected. @@ -494,6 +476,7 @@ unsigned long mapping_try_invalidate(struct address_space *mapping, unsigned long ret; unsigned long count = 0; int i; + bool xa_has_values = false; folio_batch_init(&fbatch); while (find_lock_entries(mapping, &index, end, &fbatch, indices)) { @@ -503,8 +486,8 @@ unsigned long mapping_try_invalidate(struct address_space *mapping, /* We rely upon deletion not changing folio->index */ if (xa_is_value(folio)) { - count += invalidate_exceptional_entry(mapping, - indices[i], folio); + xa_has_values = true; + count++; continue; } @@ -522,6 +505,10 @@ unsigned long mapping_try_invalidate(struct address_space *mapping, } count += ret; } + + if (xa_has_values) + clear_shadow_entry(mapping, &fbatch, indices); + folio_batch_remove_exceptionals(&fbatch); folio_batch_release(&fbatch); cond_resched(); @@ -616,6 +603,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping, int ret = 0; int ret2 = 0; int did_range_unmap = 0; + bool xa_has_values = false; if (mapping_empty(mapping)) return 0; @@ -629,8 +617,9 @@ int invalidate_inode_pages2_range(struct address_space *mapping, /* We rely upon deletion not changing folio->index */ if (xa_is_value(folio)) { - if (!invalidate_exceptional_entry2(mapping, - indices[i], folio)) + xa_has_values = true; + if (dax_mapping(mapping) && + !dax_invalidate_mapping_entry_sync(mapping, indices[i])) ret = -EBUSY; continue; } @@ -666,6 +655,10 @@ int invalidate_inode_pages2_range(struct address_space *mapping, ret = ret2; folio_unlock(folio); } + + if (xa_has_values) + clear_shadow_entry(mapping, &fbatch, indices); + folio_batch_remove_exceptionals(&fbatch); folio_batch_release(&fbatch); cond_resched(); ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-06 22:42 ` Yu Zhao @ 2024-07-08 14:34 ` Bharata B Rao 2024-07-08 16:17 ` Yu Zhao 2024-07-10 12:03 ` Bharata B Rao 1 sibling, 1 reply; 37+ messages in thread From: Bharata B Rao @ 2024-07-08 14:34 UTC (permalink / raw) To: Yu Zhao Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman Hi Yu Zhao, Thanks for your patches. See below... On 07-Jul-24 4:12 AM, Yu Zhao wrote: > Hi Bharata, > > On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote: >> <snip> >> >> Some experiments tried >> ====================== >> 1) When MGLRU was enabled many soft lockups were observed, no hard >> lockups were seen for 48 hours run. Below is once such soft lockup. > > This is not really an MGLRU issue -- can you please try one of the > attached patches? It (truncate.patch) should help with or without > MGLRU. With truncate.patch and default LRU scheme, a few hard lockups are seen. First one is this: watchdog: Watchdog detected hard LOCKUP on cpu 487 CPU: 487 PID: 11525 Comm: fio Not tainted 6.10.0-rc3 #27 RIP: 0010:native_queued_spin_lock_slowpath+0x81/0x300 Call Trace: <NMI> ? show_regs+0x69/0x80 ? watchdog_hardlockup_check+0x1b4/0x3a0 <SNIP> ? native_queued_spin_lock_slowpath+0x81/0x300 </NMI> <TASK> ? __pfx_folio_activate_fn+0x10/0x10 _raw_spin_lock_irqsave+0x5b/0x70 folio_lruvec_lock_irqsave+0x62/0x90 folio_batch_move_lru+0x9d/0x160 folio_activate+0x95/0xe0 folio_mark_accessed+0x11f/0x160 filemap_read+0x343/0x3d0 <SNIP> blkdev_read_iter+0x6f/0x140 vfs_read+0x25b/0x340 ksys_read+0x67/0xf0 __x64_sys_read+0x19/0x20 x64_sys_call+0x1771/0x20d0 This is the next one: watchdog: Watchdog detected hard LOCKUP on cpu 219 CPU: 219 PID: 2584763 Comm: fs_racer_dir_cr Not tainted 6.10.0-rc3 #27 RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300 Call Trace: <NMI> ? show_regs+0x69/0x80 ? watchdog_hardlockup_check+0x1b4/0x3a0 <SNIP> ? native_queued_spin_lock_slowpath+0x2b4/0x300 </NMI> <TASK> _raw_spin_lock_irqsave+0x5b/0x70 folio_lruvec_lock_irqsave+0x62/0x90 __page_cache_release+0x89/0x2f0 folios_put_refs+0x92/0x230 __folio_batch_release+0x74/0x90 truncate_inode_pages_range+0x16f/0x520 truncate_pagecache+0x49/0x70 ext4_setattr+0x326/0xaa0 notify_change+0x353/0x500 do_truncate+0x83/0xe0 path_openat+0xd9e/0x1090 do_filp_open+0xaa/0x150 do_sys_openat2+0x9b/0xd0 __x64_sys_openat+0x55/0x90 x64_sys_call+0xe55/0x20d0 do_syscall_64+0x7e/0x130 entry_SYSCALL_64_after_hwframe+0x76/0x7e When this happens, all-CPU backtrace shows a CPU being in isolate_lru_folios(). > >> kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649] >> kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G L >> 6.10.0-rc3-mglru-irqstrc #24 >> kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300 >> kernel: Call Trace: >> kernel: <IRQ> >> kernel: ? show_regs+0x69/0x80 >> kernel: ? watchdog_timer_fn+0x223/0x2b0 >> kernel: ? __pfx_watchdog_timer_fn+0x10/0x10 >> <SNIP> >> kernel: </IRQ> >> kernel: <TASK> >> kernel: ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 >> kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300 >> kernel: _raw_spin_lock+0x38/0x50 >> kernel: clear_shadow_entry+0x3d/0x100 >> kernel: ? __pfx_workingset_update_node+0x10/0x10 >> kernel: mapping_try_invalidate+0x117/0x1d0 >> kernel: invalidate_mapping_pages+0x10/0x20 >> kernel: invalidate_bdev+0x3c/0x50 >> kernel: blkdev_common_ioctl+0x5f7/0xa90 >> kernel: blkdev_ioctl+0x109/0x270 >> kernel: x64_sys_call+0x1215/0x20d0 >> kernel: do_syscall_64+0x7e/0x130 >> >> This happens to be contending on inode i_lock spinlock. >> >> Below preemptirqsoff trace points to preemption being disabled for more >> than 10s and the lock in picture is lruvec spinlock. > > Also if you could try the other patch (mglru.patch) please. It should > help reduce unnecessary rotations from deactivate_file_folio(), which > in turn should reduce the contention on the LRU lock for MGLRU. Currently testing is in progress with mglru.patch and MGLRU enabled. Will get back on the results. Regards, Bharata. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-08 14:34 ` Bharata B Rao @ 2024-07-08 16:17 ` Yu Zhao 2024-07-09 4:30 ` Bharata B Rao 0 siblings, 1 reply; 37+ messages in thread From: Yu Zhao @ 2024-07-08 16:17 UTC (permalink / raw) To: Bharata B Rao Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote: > > Hi Yu Zhao, > > Thanks for your patches. See below... > > On 07-Jul-24 4:12 AM, Yu Zhao wrote: > > Hi Bharata, > > > > On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote: > >> > <snip> > >> > >> Some experiments tried > >> ====================== > >> 1) When MGLRU was enabled many soft lockups were observed, no hard > >> lockups were seen for 48 hours run. Below is once such soft lockup. > > > > This is not really an MGLRU issue -- can you please try one of the > > attached patches? It (truncate.patch) should help with or without > > MGLRU. > > With truncate.patch and default LRU scheme, a few hard lockups are seen. Thanks. In your original report, you said: Most of the times the two contended locks are lruvec and inode->i_lock spinlocks. ... Often times, the perf output at the time of the problem shows heavy contention on lruvec spin lock. Similar contention is also observed with inode i_lock (in clear_shadow_entry path) Based on this new report, does it mean the i_lock is not as contended, for the same path (truncation) you tested? If so, I'll post truncate.patch and add reported-by and tested-by you, unless you have objections. The two paths below were contended on the LRU lock, but they already batch their operations. So I don't know what else we can do surgically to improve them. > First one is this: > > watchdog: Watchdog detected hard LOCKUP on cpu 487 > CPU: 487 PID: 11525 Comm: fio Not tainted 6.10.0-rc3 #27 > RIP: 0010:native_queued_spin_lock_slowpath+0x81/0x300 > Call Trace: > <NMI> > ? show_regs+0x69/0x80 > ? watchdog_hardlockup_check+0x1b4/0x3a0 > <SNIP> > ? native_queued_spin_lock_slowpath+0x81/0x300 > </NMI> > <TASK> > ? __pfx_folio_activate_fn+0x10/0x10 > _raw_spin_lock_irqsave+0x5b/0x70 > folio_lruvec_lock_irqsave+0x62/0x90 > folio_batch_move_lru+0x9d/0x160 > folio_activate+0x95/0xe0 > folio_mark_accessed+0x11f/0x160 > filemap_read+0x343/0x3d0 > <SNIP> > blkdev_read_iter+0x6f/0x140 > vfs_read+0x25b/0x340 > ksys_read+0x67/0xf0 > __x64_sys_read+0x19/0x20 > x64_sys_call+0x1771/0x20d0 > > This is the next one: > > watchdog: Watchdog detected hard LOCKUP on cpu 219 > CPU: 219 PID: 2584763 Comm: fs_racer_dir_cr Not tainted 6.10.0-rc3 #27 > RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300 > Call Trace: > <NMI> > ? show_regs+0x69/0x80 > ? watchdog_hardlockup_check+0x1b4/0x3a0 > <SNIP> > ? native_queued_spin_lock_slowpath+0x2b4/0x300 > </NMI> > <TASK> > _raw_spin_lock_irqsave+0x5b/0x70 > folio_lruvec_lock_irqsave+0x62/0x90 > __page_cache_release+0x89/0x2f0 > folios_put_refs+0x92/0x230 > __folio_batch_release+0x74/0x90 > truncate_inode_pages_range+0x16f/0x520 > truncate_pagecache+0x49/0x70 > ext4_setattr+0x326/0xaa0 > notify_change+0x353/0x500 > do_truncate+0x83/0xe0 > path_openat+0xd9e/0x1090 > do_filp_open+0xaa/0x150 > do_sys_openat2+0x9b/0xd0 > __x64_sys_openat+0x55/0x90 > x64_sys_call+0xe55/0x20d0 > do_syscall_64+0x7e/0x130 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > When this happens, all-CPU backtrace shows a CPU being in > isolate_lru_folios(). > > > > >> kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649] > >> kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G L > >> 6.10.0-rc3-mglru-irqstrc #24 > >> kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300 > >> kernel: Call Trace: > >> kernel: <IRQ> > >> kernel: ? show_regs+0x69/0x80 > >> kernel: ? watchdog_timer_fn+0x223/0x2b0 > >> kernel: ? __pfx_watchdog_timer_fn+0x10/0x10 > >> <SNIP> > >> kernel: </IRQ> > >> kernel: <TASK> > >> kernel: ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 > >> kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300 > >> kernel: _raw_spin_lock+0x38/0x50 > >> kernel: clear_shadow_entry+0x3d/0x100 > >> kernel: ? __pfx_workingset_update_node+0x10/0x10 > >> kernel: mapping_try_invalidate+0x117/0x1d0 > >> kernel: invalidate_mapping_pages+0x10/0x20 > >> kernel: invalidate_bdev+0x3c/0x50 > >> kernel: blkdev_common_ioctl+0x5f7/0xa90 > >> kernel: blkdev_ioctl+0x109/0x270 > >> kernel: x64_sys_call+0x1215/0x20d0 > >> kernel: do_syscall_64+0x7e/0x130 > >> > >> This happens to be contending on inode i_lock spinlock. > >> > >> Below preemptirqsoff trace points to preemption being disabled for more > >> than 10s and the lock in picture is lruvec spinlock. > > > > Also if you could try the other patch (mglru.patch) please. It should > > help reduce unnecessary rotations from deactivate_file_folio(), which > > in turn should reduce the contention on the LRU lock for MGLRU. > > Currently testing is in progress with mglru.patch and MGLRU enabled. > Will get back on the results. Thank you. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-08 16:17 ` Yu Zhao @ 2024-07-09 4:30 ` Bharata B Rao 2024-07-09 5:58 ` Yu Zhao 2024-07-17 9:37 ` Vlastimil Babka 0 siblings, 2 replies; 37+ messages in thread From: Bharata B Rao @ 2024-07-09 4:30 UTC (permalink / raw) To: Yu Zhao Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman On 08-Jul-24 9:47 PM, Yu Zhao wrote: > On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote: >> >> Hi Yu Zhao, >> >> Thanks for your patches. See below... >> >> On 07-Jul-24 4:12 AM, Yu Zhao wrote: >>> Hi Bharata, >>> >>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote: >>>> >> <snip> >>>> >>>> Some experiments tried >>>> ====================== >>>> 1) When MGLRU was enabled many soft lockups were observed, no hard >>>> lockups were seen for 48 hours run. Below is once such soft lockup. >>> >>> This is not really an MGLRU issue -- can you please try one of the >>> attached patches? It (truncate.patch) should help with or without >>> MGLRU. >> >> With truncate.patch and default LRU scheme, a few hard lockups are seen. > > Thanks. > > In your original report, you said: > > Most of the times the two contended locks are lruvec and > inode->i_lock spinlocks. > ... > Often times, the perf output at the time of the problem shows > heavy contention on lruvec spin lock. Similar contention is > also observed with inode i_lock (in clear_shadow_entry path) > > Based on this new report, does it mean the i_lock is not as contended, > for the same path (truncation) you tested? If so, I'll post > truncate.patch and add reported-by and tested-by you, unless you have > objections. truncate.patch has been tested on two systems with default LRU scheme and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run. > > The two paths below were contended on the LRU lock, but they already > batch their operations. So I don't know what else we can do surgically > to improve them. What has been seen with this workload is that the lruvec spinlock is held for a long time from shrink_[active/inactive]_list path. In this path, there is a case in isolate_lru_folios() where scanning of LRU lists can become unbounded. To isolate a page from ZONE_DMA, sometimes scanning/skipping of more than 150 million folios were seen. There is already a comment in there which explains why nr_skipped shouldn't be counted, but is there any possibility of re-looking at this condition? Regards, Bharata. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-09 4:30 ` Bharata B Rao @ 2024-07-09 5:58 ` Yu Zhao 2024-07-11 5:43 ` Bharata B Rao 2024-08-13 11:04 ` Usama Arif 2024-07-17 9:37 ` Vlastimil Babka 1 sibling, 2 replies; 37+ messages in thread From: Yu Zhao @ 2024-07-09 5:58 UTC (permalink / raw) To: Bharata B Rao Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@amd.com> wrote: > > On 08-Jul-24 9:47 PM, Yu Zhao wrote: > > On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote: > >> > >> Hi Yu Zhao, > >> > >> Thanks for your patches. See below... > >> > >> On 07-Jul-24 4:12 AM, Yu Zhao wrote: > >>> Hi Bharata, > >>> > >>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote: > >>>> > >> <snip> > >>>> > >>>> Some experiments tried > >>>> ====================== > >>>> 1) When MGLRU was enabled many soft lockups were observed, no hard > >>>> lockups were seen for 48 hours run. Below is once such soft lockup. > >>> > >>> This is not really an MGLRU issue -- can you please try one of the > >>> attached patches? It (truncate.patch) should help with or without > >>> MGLRU. > >> > >> With truncate.patch and default LRU scheme, a few hard lockups are seen. > > > > Thanks. > > > > In your original report, you said: > > > > Most of the times the two contended locks are lruvec and > > inode->i_lock spinlocks. > > ... > > Often times, the perf output at the time of the problem shows > > heavy contention on lruvec spin lock. Similar contention is > > also observed with inode i_lock (in clear_shadow_entry path) > > > > Based on this new report, does it mean the i_lock is not as contended, > > for the same path (truncation) you tested? If so, I'll post > > truncate.patch and add reported-by and tested-by you, unless you have > > objections. > > truncate.patch has been tested on two systems with default LRU scheme > and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run. Thanks. > > > > The two paths below were contended on the LRU lock, but they already > > batch their operations. So I don't know what else we can do surgically > > to improve them. > > What has been seen with this workload is that the lruvec spinlock is > held for a long time from shrink_[active/inactive]_list path. In this > path, there is a case in isolate_lru_folios() where scanning of LRU > lists can become unbounded. To isolate a page from ZONE_DMA, sometimes > scanning/skipping of more than 150 million folios were seen. There is > already a comment in there which explains why nr_skipped shouldn't be > counted, but is there any possibility of re-looking at this condition? For this specific case, probably this can help: @@ -1659,8 +1659,15 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan, if (folio_zonenum(folio) > sc->reclaim_idx || skip_cma(folio, sc)) { nr_skipped[folio_zonenum(folio)] += nr_pages; - move_to = &folios_skipped; - goto move; + list_move(&folio->lru, &folios_skipped); + if (spin_is_contended(&lruvec->lru_lock)) { + if (!list_empty(dst)) + break; + spin_unlock_irq(&lruvec->lru_lock); + cond_resched(); + spin_lock_irq(&lruvec->lru_lock); + } + continue; } ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-09 5:58 ` Yu Zhao @ 2024-07-11 5:43 ` Bharata B Rao 2024-07-15 5:19 ` Bharata B Rao 2024-08-13 11:04 ` Usama Arif 1 sibling, 1 reply; 37+ messages in thread From: Bharata B Rao @ 2024-07-11 5:43 UTC (permalink / raw) To: Yu Zhao Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman On 09-Jul-24 11:28 AM, Yu Zhao wrote: > On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@amd.com> wrote: >> >> On 08-Jul-24 9:47 PM, Yu Zhao wrote: >>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote: >>>> >>>> Hi Yu Zhao, >>>> >>>> Thanks for your patches. See below... >>>> >>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote: >>>>> Hi Bharata, >>>>> >>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote: >>>>>> >>>> <snip> >>>>>> >>>>>> Some experiments tried >>>>>> ====================== >>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard >>>>>> lockups were seen for 48 hours run. Below is once such soft lockup. >>>>> >>>>> This is not really an MGLRU issue -- can you please try one of the >>>>> attached patches? It (truncate.patch) should help with or without >>>>> MGLRU. >>>> >>>> With truncate.patch and default LRU scheme, a few hard lockups are seen. >>> >>> Thanks. >>> >>> In your original report, you said: >>> >>> Most of the times the two contended locks are lruvec and >>> inode->i_lock spinlocks. >>> ... >>> Often times, the perf output at the time of the problem shows >>> heavy contention on lruvec spin lock. Similar contention is >>> also observed with inode i_lock (in clear_shadow_entry path) >>> >>> Based on this new report, does it mean the i_lock is not as contended, >>> for the same path (truncation) you tested? If so, I'll post >>> truncate.patch and add reported-by and tested-by you, unless you have >>> objections. >> >> truncate.patch has been tested on two systems with default LRU scheme >> and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run. > > Thanks. > >>> >>> The two paths below were contended on the LRU lock, but they already >>> batch their operations. So I don't know what else we can do surgically >>> to improve them. >> >> What has been seen with this workload is that the lruvec spinlock is >> held for a long time from shrink_[active/inactive]_list path. In this >> path, there is a case in isolate_lru_folios() where scanning of LRU >> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes >> scanning/skipping of more than 150 million folios were seen. There is >> already a comment in there which explains why nr_skipped shouldn't be >> counted, but is there any possibility of re-looking at this condition? > > For this specific case, probably this can help: > > @@ -1659,8 +1659,15 @@ static unsigned long > isolate_lru_folios(unsigned long nr_to_scan, > if (folio_zonenum(folio) > sc->reclaim_idx || > skip_cma(folio, sc)) { > nr_skipped[folio_zonenum(folio)] += nr_pages; > - move_to = &folios_skipped; > - goto move; > + list_move(&folio->lru, &folios_skipped); > + if (spin_is_contended(&lruvec->lru_lock)) { > + if (!list_empty(dst)) > + break; > + spin_unlock_irq(&lruvec->lru_lock); > + cond_resched(); > + spin_lock_irq(&lruvec->lru_lock); > + } > + continue; > } Thanks, this helped. With this fix, the test ran for 24hrs without any lockups attributable to lruvec spinlock. As noted in this thread, earlier isolate_lru_folios() used to scan millions of folios and spend a lot of time with spinlock held but after this fix, such a scenario is no longer seen. However the contention seems to have shifted to other areas and these are the two MM related soft and hard lockups that were observed during this run: Soft lockup =========== watchdog: BUG: soft lockup - CPU#425 stuck for 12s! CPU: 425 PID: 145707 Comm: fio Kdump: loaded Tainted: G W 6.10.0-rc3-trkwtrs_trnct_nvme_lruvecresched #21 RIP: 0010:handle_softirqs+0x70/0x2f0 __rmqueue_pcplist+0x4ce/0x9a0 get_page_from_freelist+0x2e1/0x1650 __alloc_pages_noprof+0x1b4/0x12c0 alloc_pages_mpol_noprof+0xdd/0x200 folio_alloc_noprof+0x67/0xe0 Hard lockup =========== watchdog: Watchdog detected hard LOCKUP on cpu 296 CPU: 296 PID: 150155 Comm: fio Kdump: loaded Tainted: G W L 6.10.0-rc3-trkwtrs_trnct_nvme_lruvecresched #21 RIP: 0010:native_queued_spin_lock_slowpath+0x347/0x430 Call Trace: <NMI> ? watchdog_hardlockup_check+0x1a2/0x370 ? watchdog_overflow_callback+0x6d/0x80 <SNIP> native_queued_spin_lock_slowpath+0x347/0x430 </NMI> <IRQ> _raw_spin_lock_irqsave+0x46/0x60 free_unref_page+0x19f/0x540 ? __slab_free+0x2ab/0x2b0 __free_pages+0x9d/0xb0 __free_slab+0xa7/0xf0 free_slab+0x31/0x100 discard_slab+0x32/0x40 __put_partials+0xb8/0xe0 put_cpu_partial+0x5a/0x90 __slab_free+0x1d9/0x2b0 kfree+0x244/0x280 mempool_kfree+0x12/0x20 mempool_free+0x30/0x90 nvme_unmap_data+0xd0/0x150 [nvme] nvme_pci_complete_batch+0xaf/0xd0 [nvme] nvme_irq+0x96/0xe0 [nvme] __handle_irq_event_percpu+0x50/0x1b0 Regards, Bharata. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-11 5:43 ` Bharata B Rao @ 2024-07-15 5:19 ` Bharata B Rao 2024-07-19 20:21 ` Yu Zhao 2024-07-25 9:59 ` zhaoyang.huang 0 siblings, 2 replies; 37+ messages in thread From: Bharata B Rao @ 2024-07-15 5:19 UTC (permalink / raw) To: Yu Zhao Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman, mjguzik On 11-Jul-24 11:13 AM, Bharata B Rao wrote: > On 09-Jul-24 11:28 AM, Yu Zhao wrote: >> On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@amd.com> wrote: >>> >>> On 08-Jul-24 9:47 PM, Yu Zhao wrote: >>>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote: >>>>> >>>>> Hi Yu Zhao, >>>>> >>>>> Thanks for your patches. See below... >>>>> >>>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote: >>>>>> Hi Bharata, >>>>>> >>>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote: >>>>>>> >>>>> <snip> >>>>>>> >>>>>>> Some experiments tried >>>>>>> ====================== >>>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard >>>>>>> lockups were seen for 48 hours run. Below is once such soft lockup. >>>>>> >>>>>> This is not really an MGLRU issue -- can you please try one of the >>>>>> attached patches? It (truncate.patch) should help with or without >>>>>> MGLRU. >>>>> >>>>> With truncate.patch and default LRU scheme, a few hard lockups are >>>>> seen. >>>> >>>> Thanks. >>>> >>>> In your original report, you said: >>>> >>>> Most of the times the two contended locks are lruvec and >>>> inode->i_lock spinlocks. >>>> ... >>>> Often times, the perf output at the time of the problem shows >>>> heavy contention on lruvec spin lock. Similar contention is >>>> also observed with inode i_lock (in clear_shadow_entry path) >>>> >>>> Based on this new report, does it mean the i_lock is not as contended, >>>> for the same path (truncation) you tested? If so, I'll post >>>> truncate.patch and add reported-by and tested-by you, unless you have >>>> objections. >>> >>> truncate.patch has been tested on two systems with default LRU scheme >>> and the lockup due to inode->i_lock hasn't been seen yet after 24 >>> hours run. >> >> Thanks. >> >>>> >>>> The two paths below were contended on the LRU lock, but they already >>>> batch their operations. So I don't know what else we can do surgically >>>> to improve them. >>> >>> What has been seen with this workload is that the lruvec spinlock is >>> held for a long time from shrink_[active/inactive]_list path. In this >>> path, there is a case in isolate_lru_folios() where scanning of LRU >>> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes >>> scanning/skipping of more than 150 million folios were seen. There is >>> already a comment in there which explains why nr_skipped shouldn't be >>> counted, but is there any possibility of re-looking at this condition? >> >> For this specific case, probably this can help: >> >> @@ -1659,8 +1659,15 @@ static unsigned long >> isolate_lru_folios(unsigned long nr_to_scan, >> if (folio_zonenum(folio) > sc->reclaim_idx || >> skip_cma(folio, sc)) { >> nr_skipped[folio_zonenum(folio)] += nr_pages; >> - move_to = &folios_skipped; >> - goto move; >> + list_move(&folio->lru, &folios_skipped); >> + if (spin_is_contended(&lruvec->lru_lock)) { >> + if (!list_empty(dst)) >> + break; >> + spin_unlock_irq(&lruvec->lru_lock); >> + cond_resched(); >> + spin_lock_irq(&lruvec->lru_lock); >> + } >> + continue; >> } > > Thanks, this helped. With this fix, the test ran for 24hrs without any > lockups attributable to lruvec spinlock. As noted in this thread, > earlier isolate_lru_folios() used to scan millions of folios and spend a > lot of time with spinlock held but after this fix, such a scenario is no > longer seen. However during the weekend mglru-enabled run (with above fix to isolate_lru_folios() and also the previous two patches: truncate.patch and mglru.patch and the inode fix provided by Mateusz), another hard lockup related to lruvec spinlock was observed. Here is the hardlock up: watchdog: Watchdog detected hard LOCKUP on cpu 466 CPU: 466 PID: 3103929 Comm: fio Not tainted 6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32 RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300 Call Trace: <NMI> ? show_regs+0x69/0x80 ? watchdog_hardlockup_check+0x1b4/0x3a0 <SNIP> ? native_queued_spin_lock_slowpath+0x2b4/0x300 </NMI> <IRQ> _raw_spin_lock_irqsave+0x5b/0x70 folio_lruvec_lock_irqsave+0x62/0x90 folio_batch_move_lru+0x9d/0x160 folio_rotate_reclaimable+0xab/0xf0 folio_end_writeback+0x60/0x90 end_buffer_async_write+0xaa/0xe0 end_bio_bh_io_sync+0x2c/0x50 bio_endio+0x108/0x180 blk_mq_end_request_batch+0x11f/0x5e0 nvme_pci_complete_batch+0xb5/0xd0 [nvme] nvme_irq+0x92/0xe0 [nvme] __handle_irq_event_percpu+0x6e/0x1e0 handle_irq_event+0x39/0x80 handle_edge_irq+0x8c/0x240 __common_interrupt+0x4e/0xf0 common_interrupt+0x49/0xc0 asm_common_interrupt+0x27/0x40 Here is the lock holder details captured by all-cpu-backtrace: NMI backtrace for cpu 75 CPU: 75 PID: 3095650 Comm: fio Not tainted 6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32 RIP: 0010:folio_inc_gen+0x142/0x430 Call Trace: <NMI> ? show_regs+0x69/0x80 ? nmi_cpu_backtrace+0xc5/0x130 ? nmi_cpu_backtrace_handler+0x11/0x20 ? nmi_handle+0x64/0x180 ? default_do_nmi+0x45/0x130 ? exc_nmi+0x128/0x1a0 ? end_repeat_nmi+0xf/0x53 ? folio_inc_gen+0x142/0x430 ? folio_inc_gen+0x142/0x430 ? folio_inc_gen+0x142/0x430 </NMI> <TASK> isolate_folios+0x954/0x1630 evict_folios+0xa5/0x8c0 try_to_shrink_lruvec+0x1be/0x320 shrink_one+0x10f/0x1d0 shrink_node+0xa4c/0xc90 do_try_to_free_pages+0xc0/0x590 try_to_free_pages+0xde/0x210 __alloc_pages_noprof+0x6ae/0x12c0 alloc_pages_mpol_noprof+0xd9/0x220 folio_alloc_noprof+0x63/0xe0 filemap_alloc_folio_noprof+0xf4/0x100 page_cache_ra_unbounded+0xb9/0x1a0 page_cache_ra_order+0x26e/0x310 ondemand_readahead+0x1a3/0x360 page_cache_sync_ra+0x83/0x90 filemap_get_pages+0xf0/0x6a0 filemap_read+0xe7/0x3d0 blkdev_read_iter+0x6f/0x140 vfs_read+0x25b/0x340 ksys_read+0x67/0xf0 __x64_sys_read+0x19/0x20 x64_sys_call+0x1771/0x20d0 do_syscall_64+0x7e/0x130 Regards, Bharata. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-15 5:19 ` Bharata B Rao @ 2024-07-19 20:21 ` Yu Zhao 2024-07-20 7:57 ` Mateusz Guzik 2024-07-22 4:12 ` Bharata B Rao 2024-07-25 9:59 ` zhaoyang.huang 1 sibling, 2 replies; 37+ messages in thread From: Yu Zhao @ 2024-07-19 20:21 UTC (permalink / raw) To: Bharata B Rao Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman, mjguzik On Sun, Jul 14, 2024 at 11:20 PM Bharata B Rao <bharata@amd.com> wrote: > > On 11-Jul-24 11:13 AM, Bharata B Rao wrote: > > On 09-Jul-24 11:28 AM, Yu Zhao wrote: > >> On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@amd.com> wrote: > >>> > >>> On 08-Jul-24 9:47 PM, Yu Zhao wrote: > >>>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote: > >>>>> > >>>>> Hi Yu Zhao, > >>>>> > >>>>> Thanks for your patches. See below... > >>>>> > >>>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote: > >>>>>> Hi Bharata, > >>>>>> > >>>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote: > >>>>>>> > >>>>> <snip> > >>>>>>> > >>>>>>> Some experiments tried > >>>>>>> ====================== > >>>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard > >>>>>>> lockups were seen for 48 hours run. Below is once such soft lockup. > >>>>>> > >>>>>> This is not really an MGLRU issue -- can you please try one of the > >>>>>> attached patches? It (truncate.patch) should help with or without > >>>>>> MGLRU. > >>>>> > >>>>> With truncate.patch and default LRU scheme, a few hard lockups are > >>>>> seen. > >>>> > >>>> Thanks. > >>>> > >>>> In your original report, you said: > >>>> > >>>> Most of the times the two contended locks are lruvec and > >>>> inode->i_lock spinlocks. > >>>> ... > >>>> Often times, the perf output at the time of the problem shows > >>>> heavy contention on lruvec spin lock. Similar contention is > >>>> also observed with inode i_lock (in clear_shadow_entry path) > >>>> > >>>> Based on this new report, does it mean the i_lock is not as contended, > >>>> for the same path (truncation) you tested? If so, I'll post > >>>> truncate.patch and add reported-by and tested-by you, unless you have > >>>> objections. > >>> > >>> truncate.patch has been tested on two systems with default LRU scheme > >>> and the lockup due to inode->i_lock hasn't been seen yet after 24 > >>> hours run. > >> > >> Thanks. > >> > >>>> > >>>> The two paths below were contended on the LRU lock, but they already > >>>> batch their operations. So I don't know what else we can do surgically > >>>> to improve them. > >>> > >>> What has been seen with this workload is that the lruvec spinlock is > >>> held for a long time from shrink_[active/inactive]_list path. In this > >>> path, there is a case in isolate_lru_folios() where scanning of LRU > >>> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes > >>> scanning/skipping of more than 150 million folios were seen. There is > >>> already a comment in there which explains why nr_skipped shouldn't be > >>> counted, but is there any possibility of re-looking at this condition? > >> > >> For this specific case, probably this can help: > >> > >> @@ -1659,8 +1659,15 @@ static unsigned long > >> isolate_lru_folios(unsigned long nr_to_scan, > >> if (folio_zonenum(folio) > sc->reclaim_idx || > >> skip_cma(folio, sc)) { > >> nr_skipped[folio_zonenum(folio)] += nr_pages; > >> - move_to = &folios_skipped; > >> - goto move; > >> + list_move(&folio->lru, &folios_skipped); > >> + if (spin_is_contended(&lruvec->lru_lock)) { > >> + if (!list_empty(dst)) > >> + break; > >> + spin_unlock_irq(&lruvec->lru_lock); > >> + cond_resched(); > >> + spin_lock_irq(&lruvec->lru_lock); > >> + } > >> + continue; > >> } > > > > Thanks, this helped. With this fix, the test ran for 24hrs without any > > lockups attributable to lruvec spinlock. As noted in this thread, > > earlier isolate_lru_folios() used to scan millions of folios and spend a > > lot of time with spinlock held but after this fix, such a scenario is no > > longer seen. > > However during the weekend mglru-enabled run (with above fix to > isolate_lru_folios() and also the previous two patches: truncate.patch > and mglru.patch and the inode fix provided by Mateusz), another hard > lockup related to lruvec spinlock was observed. Thanks again for the stress tests. I can't come up with any reasonable band-aid at this moment, i.e., something not too ugly to work around a more fundamental scalability problem. Before I give up: what type of dirty data was written back to the nvme device? Was it page cache or swap? > Here is the hardlock up: > > watchdog: Watchdog detected hard LOCKUP on cpu 466 > CPU: 466 PID: 3103929 Comm: fio Not tainted > 6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32 > RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300 > Call Trace: > <NMI> > ? show_regs+0x69/0x80 > ? watchdog_hardlockup_check+0x1b4/0x3a0 > <SNIP> > ? native_queued_spin_lock_slowpath+0x2b4/0x300 > </NMI> > <IRQ> > _raw_spin_lock_irqsave+0x5b/0x70 > folio_lruvec_lock_irqsave+0x62/0x90 > folio_batch_move_lru+0x9d/0x160 > folio_rotate_reclaimable+0xab/0xf0 > folio_end_writeback+0x60/0x90 > end_buffer_async_write+0xaa/0xe0 > end_bio_bh_io_sync+0x2c/0x50 > bio_endio+0x108/0x180 > blk_mq_end_request_batch+0x11f/0x5e0 > nvme_pci_complete_batch+0xb5/0xd0 [nvme] > nvme_irq+0x92/0xe0 [nvme] > __handle_irq_event_percpu+0x6e/0x1e0 > handle_irq_event+0x39/0x80 > handle_edge_irq+0x8c/0x240 > __common_interrupt+0x4e/0xf0 > common_interrupt+0x49/0xc0 > asm_common_interrupt+0x27/0x40 > > Here is the lock holder details captured by all-cpu-backtrace: > > NMI backtrace for cpu 75 > CPU: 75 PID: 3095650 Comm: fio Not tainted > 6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32 > RIP: 0010:folio_inc_gen+0x142/0x430 > Call Trace: > <NMI> > ? show_regs+0x69/0x80 > ? nmi_cpu_backtrace+0xc5/0x130 > ? nmi_cpu_backtrace_handler+0x11/0x20 > ? nmi_handle+0x64/0x180 > ? default_do_nmi+0x45/0x130 > ? exc_nmi+0x128/0x1a0 > ? end_repeat_nmi+0xf/0x53 > ? folio_inc_gen+0x142/0x430 > ? folio_inc_gen+0x142/0x430 > ? folio_inc_gen+0x142/0x430 > </NMI> > <TASK> > isolate_folios+0x954/0x1630 > evict_folios+0xa5/0x8c0 > try_to_shrink_lruvec+0x1be/0x320 > shrink_one+0x10f/0x1d0 > shrink_node+0xa4c/0xc90 > do_try_to_free_pages+0xc0/0x590 > try_to_free_pages+0xde/0x210 > __alloc_pages_noprof+0x6ae/0x12c0 > alloc_pages_mpol_noprof+0xd9/0x220 > folio_alloc_noprof+0x63/0xe0 > filemap_alloc_folio_noprof+0xf4/0x100 > page_cache_ra_unbounded+0xb9/0x1a0 > page_cache_ra_order+0x26e/0x310 > ondemand_readahead+0x1a3/0x360 > page_cache_sync_ra+0x83/0x90 > filemap_get_pages+0xf0/0x6a0 > filemap_read+0xe7/0x3d0 > blkdev_read_iter+0x6f/0x140 > vfs_read+0x25b/0x340 > ksys_read+0x67/0xf0 > __x64_sys_read+0x19/0x20 > x64_sys_call+0x1771/0x20d0 > do_syscall_64+0x7e/0x130 > > Regards, > Bharata. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-19 20:21 ` Yu Zhao @ 2024-07-20 7:57 ` Mateusz Guzik 2024-07-22 4:17 ` Bharata B Rao 2024-07-22 4:12 ` Bharata B Rao 1 sibling, 1 reply; 37+ messages in thread From: Mateusz Guzik @ 2024-07-20 7:57 UTC (permalink / raw) To: Yu Zhao Cc: Bharata B Rao, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman On Fri, Jul 19, 2024 at 10:21 PM Yu Zhao <yuzhao@google.com> wrote: > I can't come up with any reasonable band-aid at this moment, i.e., > something not too ugly to work around a more fundamental scalability > problem. > > Before I give up: what type of dirty data was written back to the nvme > device? Was it page cache or swap? > With my corporate employee hat on, I would like to note a couple of three things. 1. there are definitely bugs here and someone(tm) should sort them out(R) however.... 2. the real goal is presumably to beat the kernel into shape where production kernels no longer suffer lockups running this workload on this hardware 3. the flamegraph (to be found in [1]) shows expensive debug enabled, notably for preemption count (search for preempt_count_sub to see) 4. I'm told the lruvec problem is being worked on (but no ETA) and I don't think the above justifies considering any hacks or otherwise putting more pressure on it It is plausible eliminating the aforementioned debug will be good enough. Apart from that I note percpu_counter_add_batch (+ irq debug) accounts for 5.8% cpu time. This will of course go down if irq tracing is disabled, but so happens I optimized this routine to be faster single-threaded (in particular by dodging the interrupt trip). The patch is hanging out in the mm tree [2] and is trivially applicable for testing. Even if none of the debug opts can get modified, this should drop percpu_counter_add_batch to 1.5% or so, which may or may not have a side effect of avoiding the lockup problem. [1]: https://lore.kernel.org/lkml/584ecb5e-b1fc-4b43-ba36-ad396d379fad@amd.com/ [2]: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything&id=51d821654be4286b005ad2b7dc8b973d5008a2ec -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-20 7:57 ` Mateusz Guzik @ 2024-07-22 4:17 ` Bharata B Rao 0 siblings, 0 replies; 37+ messages in thread From: Bharata B Rao @ 2024-07-22 4:17 UTC (permalink / raw) To: Mateusz Guzik, Yu Zhao Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman On 20-Jul-24 1:27 PM, Mateusz Guzik wrote: > On Fri, Jul 19, 2024 at 10:21 PM Yu Zhao <yuzhao@google.com> wrote: >> I can't come up with any reasonable band-aid at this moment, i.e., >> something not too ugly to work around a more fundamental scalability >> problem. >> >> Before I give up: what type of dirty data was written back to the nvme >> device? Was it page cache or swap? >> > > With my corporate employee hat on, I would like to note a couple of > three things. > > 1. there are definitely bugs here and someone(tm) should sort them out(R) > > however.... > > 2. the real goal is presumably to beat the kernel into shape where > production kernels no longer suffer lockups running this workload on > this hardware > 3. the flamegraph (to be found in [1]) shows expensive debug enabled, > notably for preemption count (search for preempt_count_sub to see) > 4. I'm told the lruvec problem is being worked on (but no ETA) and I > don't think the above justifies considering any hacks or otherwise > putting more pressure on it > > It is plausible eliminating the aforementioned debug will be good enough. > > Apart from that I note percpu_counter_add_batch (+ irq debug) accounts > for 5.8% cpu time. This will of course go down if irq tracing is > disabled, but so happens I optimized this routine to be faster > single-threaded (in particular by dodging the interrupt trip). The > patch is hanging out in the mm tree [2] and is trivially applicable > for testing. > > Even if none of the debug opts can get modified, this should drop > percpu_counter_add_batch to 1.5% or so, which may or may not have a > side effect of avoiding the lockup problem. Thanks, A few debug options were turned ON to gather debug data. Will do a full run once with them turned OFF and with the above percpu_counter_add_batch patch. Regards, Bharata. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-19 20:21 ` Yu Zhao 2024-07-20 7:57 ` Mateusz Guzik @ 2024-07-22 4:12 ` Bharata B Rao 1 sibling, 0 replies; 37+ messages in thread From: Bharata B Rao @ 2024-07-22 4:12 UTC (permalink / raw) To: Yu Zhao Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman, mjguzik On 20-Jul-24 1:51 AM, Yu Zhao wrote: >> However during the weekend mglru-enabled run (with above fix to >> isolate_lru_folios() and also the previous two patches: truncate.patch >> and mglru.patch and the inode fix provided by Mateusz), another hard >> lockup related to lruvec spinlock was observed. > > Thanks again for the stress tests. > > I can't come up with any reasonable band-aid at this moment, i.e., > something not too ugly to work around a more fundamental scalability > problem. > > Before I give up: what type of dirty data was written back to the nvme > device? Was it page cache or swap? This is how a typical dstat report looks like when we start to see the problem with lruvec spinlock. ------memory-usage----- ----swap--- used free buff cach| used free| 14.3G 20.7G 1467G 185M| 938M 15G| 14.3G 20.0G 1468G 174M| 938M 15G| 14.3G 20.3G 1468G 184M| 938M 15G| 14.3G 19.8G 1468G 183M| 938M 15G| 14.3G 19.9G 1468G 183M| 938M 15G| 14.3G 19.5G 1468G 183M| 938M 15G| As you can see, most of the usage is in buffer cache and swap is hardly used. Just to recap from the original post... ==== FIO is run with a size of 1TB on each NVME partition with different combinations of ioengine/blocksize/mode parameters and buffered-IO. Selected FS tests from LTP are run on 256GB partitions of all NVME disks. This is the typical NVME partition layout. nvme2n1 259:4 0 3.5T 0 disk ├─nvme2n1p1 259:6 0 256G 0 part /data_nvme2n1p1 └─nvme2n1p2 259:7 0 3.2T 0 part Though many different runs exist in the workload, the combination that results in the problem is buffered-IO run with sync engine. fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \ -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=sync -bs=4k \ -numjobs=400 -runtime=25000 --time_based -group_reporting -name=mytest ==== Regards, Bharata. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-15 5:19 ` Bharata B Rao 2024-07-19 20:21 ` Yu Zhao @ 2024-07-25 9:59 ` zhaoyang.huang 2024-07-26 3:26 ` Zhaoyang Huang 1 sibling, 1 reply; 37+ messages in thread From: zhaoyang.huang @ 2024-07-25 9:59 UTC (permalink / raw) To: bharata Cc: Neeraj.Upadhyay, akpm, david, kinseyho, linux-kernel, linux-mm, mgorman, mjguzik, nikunj, vbabka, willy, yuzhao, huangzhaoyang, steve.kang >However during the weekend mglru-enabled run (with above fix to >isolate_lru_folios() and also the previous two patches: truncate.patch >and mglru.patch and the inode fix provided by Mateusz), another hard >lockup related to lruvec spinlock was observed. > >Here is the hardlock up: > >watchdog: Watchdog detected hard LOCKUP on cpu 466 >CPU: 466 PID: 3103929 Comm: fio Not tainted >6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32 >RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300 >Call Trace: > <NMI> > ? show_regs+0x69/0x80 > ? watchdog_hardlockup_check+0x1b4/0x3a0 ><SNIP> > ? native_queued_spin_lock_slowpath+0x2b4/0x300 > </NMI> > <IRQ> > _raw_spin_lock_irqsave+0x5b/0x70 > folio_lruvec_lock_irqsave+0x62/0x90 > folio_batch_move_lru+0x9d/0x160 > folio_rotate_reclaimable+0xab/0xf0 > folio_end_writeback+0x60/0x90 > end_buffer_async_write+0xaa/0xe0 > end_bio_bh_io_sync+0x2c/0x50 > bio_endio+0x108/0x180 > blk_mq_end_request_batch+0x11f/0x5e0 > nvme_pci_complete_batch+0xb5/0xd0 [nvme] > nvme_irq+0x92/0xe0 [nvme] > __handle_irq_event_percpu+0x6e/0x1e0 > handle_irq_event+0x39/0x80 > handle_edge_irq+0x8c/0x240 > __common_interrupt+0x4e/0xf0 > common_interrupt+0x49/0xc0 > asm_common_interrupt+0x27/0x40 > >Here is the lock holder details captured by all-cpu-backtrace: > >NMI backtrace for cpu 75 >CPU: 75 PID: 3095650 Comm: fio Not tainted >6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32 >RIP: 0010:folio_inc_gen+0x142/0x430 >Call Trace: > <NMI> > ? show_regs+0x69/0x80 > ? nmi_cpu_backtrace+0xc5/0x130 > ? nmi_cpu_backtrace_handler+0x11/0x20 > ? nmi_handle+0x64/0x180 > ? default_do_nmi+0x45/0x130 > ? exc_nmi+0x128/0x1a0 > ? end_repeat_nmi+0xf/0x53 > ? folio_inc_gen+0x142/0x430 > ? folio_inc_gen+0x142/0x430 > ? folio_inc_gen+0x142/0x430 > </NMI> > <TASK> > isolate_folios+0x954/0x1630 > evict_folios+0xa5/0x8c0 > try_to_shrink_lruvec+0x1be/0x320 > shrink_one+0x10f/0x1d0 > shrink_node+0xa4c/0xc90 > do_try_to_free_pages+0xc0/0x590 > try_to_free_pages+0xde/0x210 > __alloc_pages_noprof+0x6ae/0x12c0 > alloc_pages_mpol_noprof+0xd9/0x220 > folio_alloc_noprof+0x63/0xe0 > filemap_alloc_folio_noprof+0xf4/0x100 > page_cache_ra_unbounded+0xb9/0x1a0 > page_cache_ra_order+0x26e/0x310 > ondemand_readahead+0x1a3/0x360 > page_cache_sync_ra+0x83/0x90 > filemap_get_pages+0xf0/0x6a0 > filemap_read+0xe7/0x3d0 > blkdev_read_iter+0x6f/0x140 > vfs_read+0x25b/0x340 > ksys_read+0x67/0xf0 > __x64_sys_read+0x19/0x20 > x64_sys_call+0x1771/0x20d0 > do_syscall_64+0x7e/0x130 From the callstack of lock holder, it is looks like a scability issue rather than a deadlock. Unlike legacy LRU management, there is no throttling mechanism for global reclaim under mglru so far.Could we apply the similar method to throttle the reclaim when it is too aggresive. I am wondering if this patch which is a rough version could help on this? diff --git a/mm/vmscan.c b/mm/vmscan.c index 2e34de9cd0d4..827036e21f24 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4520,6 +4520,50 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw return scanned; } +static void lru_gen_throttle(pg_data_t *pgdat, struct scan_control *sc) +{ + struct lruvec *target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); + + if (current_is_kswapd()) { + if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken) + set_bit(PGDAT_WRITEBACK, &pgdat->flags); + + /* Allow kswapd to start writing pages during reclaim.*/ + if (sc->nr.unqueued_dirty == sc->nr.file_taken) + set_bit(PGDAT_DIRTY, &pgdat->flags); + + if (sc->nr.immediate) + reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK); + } + + /* + * Tag a node/memcg as congested if all the dirty pages were marked + * for writeback and immediate reclaim (counted in nr.congested). + * + * Legacy memcg will stall in page writeback so avoid forcibly + * stalling in reclaim_throttle(). + */ + if (sc->nr.dirty && (sc->nr.dirty / 2 < sc->nr.congested)) { + if (cgroup_reclaim(sc) && writeback_throttling_sane(sc)) + set_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags); + + if (current_is_kswapd()) + set_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags); + } + + /* + * Stall direct reclaim for IO completions if the lruvec is + * node is congested. Allow kswapd to continue until it + * starts encountering unqueued dirty pages or cycling through + * the LRU too quickly. + */ + if (!current_is_kswapd() && current_may_throttle() && + !sc->hibernation_mode && + (test_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags) || + test_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags))) + reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED); +} + static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness) { int type; @@ -4552,6 +4596,16 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap retry: reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false); sc->nr_reclaimed += reclaimed; + sc->nr.dirty += stat.nr_dirty; + sc->nr.congested += stat.nr_congested; + sc->nr.unqueued_dirty += stat.nr_unqueued_dirty; + sc->nr.writeback += stat.nr_writeback; + sc->nr.immediate += stat.nr_immediate; + sc->nr.taken += scanned; + + if (type) + sc->nr.file_taken += scanned; + trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id, scanned, reclaimed, &stat, sc->priority, type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON); @@ -5908,6 +5962,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) if (lru_gen_enabled() && root_reclaim(sc)) { lru_gen_shrink_node(pgdat, sc); + lru_gen_throttle(pgdat, sc); return; } -- 2.25.1 ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-25 9:59 ` zhaoyang.huang @ 2024-07-26 3:26 ` Zhaoyang Huang 2024-07-29 4:49 ` Bharata B Rao 0 siblings, 1 reply; 37+ messages in thread From: Zhaoyang Huang @ 2024-07-26 3:26 UTC (permalink / raw) To: zhaoyang.huang Cc: bharata, Neeraj.Upadhyay, akpm, david, kinseyho, linux-kernel, linux-mm, mgorman, mjguzik, nikunj, vbabka, willy, yuzhao, steve.kang On Thu, Jul 25, 2024 at 6:00 PM zhaoyang.huang <zhaoyang.huang@unisoc.com> wrote: > > >However during the weekend mglru-enabled run (with above fix to > >isolate_lru_folios() and also the previous two patches: truncate.patch > >and mglru.patch and the inode fix provided by Mateusz), another hard > >lockup related to lruvec spinlock was observed. > > > >Here is the hardlock up: > > > >watchdog: Watchdog detected hard LOCKUP on cpu 466 > >CPU: 466 PID: 3103929 Comm: fio Not tainted > >6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32 > >RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300 > >Call Trace: > > <NMI> > > ? show_regs+0x69/0x80 > > ? watchdog_hardlockup_check+0x1b4/0x3a0 > ><SNIP> > > ? native_queued_spin_lock_slowpath+0x2b4/0x300 > > </NMI> > > <IRQ> > > _raw_spin_lock_irqsave+0x5b/0x70 > > folio_lruvec_lock_irqsave+0x62/0x90 > > folio_batch_move_lru+0x9d/0x160 > > folio_rotate_reclaimable+0xab/0xf0 > > folio_end_writeback+0x60/0x90 > > end_buffer_async_write+0xaa/0xe0 > > end_bio_bh_io_sync+0x2c/0x50 > > bio_endio+0x108/0x180 > > blk_mq_end_request_batch+0x11f/0x5e0 > > nvme_pci_complete_batch+0xb5/0xd0 [nvme] > > nvme_irq+0x92/0xe0 [nvme] > > __handle_irq_event_percpu+0x6e/0x1e0 > > handle_irq_event+0x39/0x80 > > handle_edge_irq+0x8c/0x240 > > __common_interrupt+0x4e/0xf0 > > common_interrupt+0x49/0xc0 > > asm_common_interrupt+0x27/0x40 > > > >Here is the lock holder details captured by all-cpu-backtrace: > > > >NMI backtrace for cpu 75 > >CPU: 75 PID: 3095650 Comm: fio Not tainted > >6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32 > >RIP: 0010:folio_inc_gen+0x142/0x430 > >Call Trace: > > <NMI> > > ? show_regs+0x69/0x80 > > ? nmi_cpu_backtrace+0xc5/0x130 > > ? nmi_cpu_backtrace_handler+0x11/0x20 > > ? nmi_handle+0x64/0x180 > > ? default_do_nmi+0x45/0x130 > > ? exc_nmi+0x128/0x1a0 > > ? end_repeat_nmi+0xf/0x53 > > ? folio_inc_gen+0x142/0x430 > > ? folio_inc_gen+0x142/0x430 > > ? folio_inc_gen+0x142/0x430 > > </NMI> > > <TASK> > > isolate_folios+0x954/0x1630 > > evict_folios+0xa5/0x8c0 > > try_to_shrink_lruvec+0x1be/0x320 > > shrink_one+0x10f/0x1d0 > > shrink_node+0xa4c/0xc90 > > do_try_to_free_pages+0xc0/0x590 > > try_to_free_pages+0xde/0x210 > > __alloc_pages_noprof+0x6ae/0x12c0 > > alloc_pages_mpol_noprof+0xd9/0x220 > > folio_alloc_noprof+0x63/0xe0 > > filemap_alloc_folio_noprof+0xf4/0x100 > > page_cache_ra_unbounded+0xb9/0x1a0 > > page_cache_ra_order+0x26e/0x310 > > ondemand_readahead+0x1a3/0x360 > > page_cache_sync_ra+0x83/0x90 > > filemap_get_pages+0xf0/0x6a0 > > filemap_read+0xe7/0x3d0 > > blkdev_read_iter+0x6f/0x140 > > vfs_read+0x25b/0x340 > > ksys_read+0x67/0xf0 > > __x64_sys_read+0x19/0x20 > > x64_sys_call+0x1771/0x20d0 > > do_syscall_64+0x7e/0x130 > > From the callstack of lock holder, it is looks like a scability issue rather than a deadlock. Unlike legacy LRU management, there is no throttling mechanism for global reclaim under mglru so far.Could we apply the similar method to throttle the reclaim when it is too aggresive. I am wondering if this patch which is a rough version could help on this? > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2e34de9cd0d4..827036e21f24 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -4520,6 +4520,50 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw > return scanned; > } > > +static void lru_gen_throttle(pg_data_t *pgdat, struct scan_control *sc) > +{ > + struct lruvec *target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); > + > + if (current_is_kswapd()) { > + if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken) > + set_bit(PGDAT_WRITEBACK, &pgdat->flags); > + > + /* Allow kswapd to start writing pages during reclaim.*/ > + if (sc->nr.unqueued_dirty == sc->nr.file_taken) > + set_bit(PGDAT_DIRTY, &pgdat->flags); > + > + if (sc->nr.immediate) > + reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK); > + } > + > + /* > + * Tag a node/memcg as congested if all the dirty pages were marked > + * for writeback and immediate reclaim (counted in nr.congested). > + * > + * Legacy memcg will stall in page writeback so avoid forcibly > + * stalling in reclaim_throttle(). > + */ > + if (sc->nr.dirty && (sc->nr.dirty / 2 < sc->nr.congested)) { > + if (cgroup_reclaim(sc) && writeback_throttling_sane(sc)) > + set_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags); > + > + if (current_is_kswapd()) > + set_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags); > + } > + > + /* > + * Stall direct reclaim for IO completions if the lruvec is > + * node is congested. Allow kswapd to continue until it > + * starts encountering unqueued dirty pages or cycling through > + * the LRU too quickly. > + */ > + if (!current_is_kswapd() && current_may_throttle() && > + !sc->hibernation_mode && > + (test_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags) || > + test_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags))) > + reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED); > +} > + > static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness) > { > int type; > @@ -4552,6 +4596,16 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap > retry: > reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false); > sc->nr_reclaimed += reclaimed; > + sc->nr.dirty += stat.nr_dirty; > + sc->nr.congested += stat.nr_congested; > + sc->nr.unqueued_dirty += stat.nr_unqueued_dirty; > + sc->nr.writeback += stat.nr_writeback; > + sc->nr.immediate += stat.nr_immediate; > + sc->nr.taken += scanned; > + > + if (type) > + sc->nr.file_taken += scanned; > + > trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id, > scanned, reclaimed, &stat, sc->priority, > type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON); > @@ -5908,6 +5962,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > > if (lru_gen_enabled() && root_reclaim(sc)) { > lru_gen_shrink_node(pgdat, sc); > + lru_gen_throttle(pgdat, sc); > return; > } Hi Bharata, This patch arised from a regression Android test case failure which allocated 1GB virtual memory by each over 8 threads on an 5.5GB RAM system. This test could pass on legacy LRU management while failing under MGLRU as a watchdog monitor detected abnormal system-wide schedule status(watchdog can't be scheduled within 60 seconds). This patch with a slight change as below got passed in the test whereas has not been investigated deeply for how it was done. Theoretically, this patch enrolled the similar reclaim throttle mechanism as legacy do which could reduce the contention of lruvec->lru_lock. I think this patch is quite naive for now, but I am hoping it could help you as your case seems like a scability issue of memory pressure rather than a deadlock issue. Thank you! the change of the applied version(try to throttle the reclaim before instead of after) if (lru_gen_enabled() && root_reclaim(sc)) { + lru_gen_throttle(pgdat, sc); lru_gen_shrink_node(pgdat, sc); - lru_gen_throttle(pgdat, sc); return; } > > -- > 2.25.1 > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-26 3:26 ` Zhaoyang Huang @ 2024-07-29 4:49 ` Bharata B Rao 0 siblings, 0 replies; 37+ messages in thread From: Bharata B Rao @ 2024-07-29 4:49 UTC (permalink / raw) To: Zhaoyang Huang, zhaoyang.huang Cc: Neeraj.Upadhyay, akpm, david, kinseyho, linux-kernel, linux-mm, mgorman, mjguzik, nikunj, vbabka, willy, yuzhao, steve.kang On 26-Jul-24 8:56 AM, Zhaoyang Huang wrote: > On Thu, Jul 25, 2024 at 6:00 PM zhaoyang.huang > <zhaoyang.huang@unisoc.com> wrote: <snip> >> From the callstack of lock holder, it is looks like a scability issue rather than a deadlock. Unlike legacy LRU management, there is no throttling mechanism for global reclaim under mglru so far.Could we apply the similar method to throttle the reclaim when it is too aggresive. I am wondering if this patch which is a rough version could help on this? >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 2e34de9cd0d4..827036e21f24 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -4520,6 +4520,50 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw >> return scanned; >> } >> >> +static void lru_gen_throttle(pg_data_t *pgdat, struct scan_control *sc) >> +{ >> + struct lruvec *target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); >> + >> + if (current_is_kswapd()) { >> + if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken) >> + set_bit(PGDAT_WRITEBACK, &pgdat->flags); >> + >> + /* Allow kswapd to start writing pages during reclaim.*/ >> + if (sc->nr.unqueued_dirty == sc->nr.file_taken) >> + set_bit(PGDAT_DIRTY, &pgdat->flags); >> + >> + if (sc->nr.immediate) >> + reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK); >> + } >> + >> + /* >> + * Tag a node/memcg as congested if all the dirty pages were marked >> + * for writeback and immediate reclaim (counted in nr.congested). >> + * >> + * Legacy memcg will stall in page writeback so avoid forcibly >> + * stalling in reclaim_throttle(). >> + */ >> + if (sc->nr.dirty && (sc->nr.dirty / 2 < sc->nr.congested)) { >> + if (cgroup_reclaim(sc) && writeback_throttling_sane(sc)) >> + set_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags); >> + >> + if (current_is_kswapd()) >> + set_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags); >> + } >> + >> + /* >> + * Stall direct reclaim for IO completions if the lruvec is >> + * node is congested. Allow kswapd to continue until it >> + * starts encountering unqueued dirty pages or cycling through >> + * the LRU too quickly. >> + */ >> + if (!current_is_kswapd() && current_may_throttle() && >> + !sc->hibernation_mode && >> + (test_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags) || >> + test_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags))) >> + reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED); >> +} >> + >> static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness) >> { >> int type; >> @@ -4552,6 +4596,16 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap >> retry: >> reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false); >> sc->nr_reclaimed += reclaimed; >> + sc->nr.dirty += stat.nr_dirty; >> + sc->nr.congested += stat.nr_congested; >> + sc->nr.unqueued_dirty += stat.nr_unqueued_dirty; >> + sc->nr.writeback += stat.nr_writeback; >> + sc->nr.immediate += stat.nr_immediate; >> + sc->nr.taken += scanned; >> + >> + if (type) >> + sc->nr.file_taken += scanned; >> + >> trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id, >> scanned, reclaimed, &stat, sc->priority, >> type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON); >> @@ -5908,6 +5962,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) >> >> if (lru_gen_enabled() && root_reclaim(sc)) { >> lru_gen_shrink_node(pgdat, sc); >> + lru_gen_throttle(pgdat, sc); >> return; >> } > Hi Bharata, > This patch arised from a regression Android test case failure which > allocated 1GB virtual memory by each over 8 threads on an 5.5GB RAM > system. This test could pass on legacy LRU management while failing > under MGLRU as a watchdog monitor detected abnormal system-wide > schedule status(watchdog can't be scheduled within 60 seconds). This > patch with a slight change as below got passed in the test whereas has > not been investigated deeply for how it was done. Theoretically, this > patch enrolled the similar reclaim throttle mechanism as legacy do > which could reduce the contention of lruvec->lru_lock. I think this > patch is quite naive for now, but I am hoping it could help you as > your case seems like a scability issue of memory pressure rather than > a deadlock issue. Thank you! > > the change of the applied version(try to throttle the reclaim before > instead of after) > if (lru_gen_enabled() && root_reclaim(sc)) { > + lru_gen_throttle(pgdat, sc); > lru_gen_shrink_node(pgdat, sc); > - lru_gen_throttle(pgdat, sc); > return; > } Thanks Zhaoyang Huang for the patch, will give this a test and report back. Regards, Bharata. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-09 5:58 ` Yu Zhao 2024-07-11 5:43 ` Bharata B Rao @ 2024-08-13 11:04 ` Usama Arif 2024-08-13 17:43 ` Yu Zhao 1 sibling, 1 reply; 37+ messages in thread From: Usama Arif @ 2024-08-13 11:04 UTC (permalink / raw) To: Yu Zhao, Bharata B Rao Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman, leitao On 09/07/2024 06:58, Yu Zhao wrote: > On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@amd.com> wrote: >> >> On 08-Jul-24 9:47 PM, Yu Zhao wrote: >>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote: >>>> >>>> Hi Yu Zhao, >>>> >>>> Thanks for your patches. See below... >>>> >>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote: >>>>> Hi Bharata, >>>>> >>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote: >>>>>> >>>> <snip> >>>>>> >>>>>> Some experiments tried >>>>>> ====================== >>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard >>>>>> lockups were seen for 48 hours run. Below is once such soft lockup. >>>>> >>>>> This is not really an MGLRU issue -- can you please try one of the >>>>> attached patches? It (truncate.patch) should help with or without >>>>> MGLRU. >>>> >>>> With truncate.patch and default LRU scheme, a few hard lockups are seen. >>> >>> Thanks. >>> >>> In your original report, you said: >>> >>> Most of the times the two contended locks are lruvec and >>> inode->i_lock spinlocks. >>> ... >>> Often times, the perf output at the time of the problem shows >>> heavy contention on lruvec spin lock. Similar contention is >>> also observed with inode i_lock (in clear_shadow_entry path) >>> >>> Based on this new report, does it mean the i_lock is not as contended, >>> for the same path (truncation) you tested? If so, I'll post >>> truncate.patch and add reported-by and tested-by you, unless you have >>> objections. >> >> truncate.patch has been tested on two systems with default LRU scheme >> and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run. > > Thanks. > >>> >>> The two paths below were contended on the LRU lock, but they already >>> batch their operations. So I don't know what else we can do surgically >>> to improve them. >> >> What has been seen with this workload is that the lruvec spinlock is >> held for a long time from shrink_[active/inactive]_list path. In this >> path, there is a case in isolate_lru_folios() where scanning of LRU >> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes >> scanning/skipping of more than 150 million folios were seen. There is >> already a comment in there which explains why nr_skipped shouldn't be >> counted, but is there any possibility of re-looking at this condition? > > For this specific case, probably this can help: > > @@ -1659,8 +1659,15 @@ static unsigned long > isolate_lru_folios(unsigned long nr_to_scan, > if (folio_zonenum(folio) > sc->reclaim_idx || > skip_cma(folio, sc)) { > nr_skipped[folio_zonenum(folio)] += nr_pages; > - move_to = &folios_skipped; > - goto move; > + list_move(&folio->lru, &folios_skipped); > + if (spin_is_contended(&lruvec->lru_lock)) { > + if (!list_empty(dst)) > + break; > + spin_unlock_irq(&lruvec->lru_lock); > + cond_resched(); > + spin_lock_irq(&lruvec->lru_lock); > + } > + continue; > } > Hi Yu, We are seeing lockups and high memory pressure in Meta production due to this lock contention as well. My colleague highlighted it in https://lore.kernel.org/all/ZrssOrcJIDy8hacI@gmail.com/ and was pointed to this fix. We removed skip_cma check as a temporary measure, but this is a proper fix. I might have missed it but didn't see this as a patch on the mailing list. Just wanted to check if you were planning to send it as a patch? Happy to send it on your behalf as well. Thanks ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-08-13 11:04 ` Usama Arif @ 2024-08-13 17:43 ` Yu Zhao 0 siblings, 0 replies; 37+ messages in thread From: Yu Zhao @ 2024-08-13 17:43 UTC (permalink / raw) To: Usama Arif Cc: Bharata B Rao, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman, leitao On Tue, Aug 13, 2024 at 5:04 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 09/07/2024 06:58, Yu Zhao wrote: > > On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@amd.com> wrote: > >> > >> On 08-Jul-24 9:47 PM, Yu Zhao wrote: > >>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote: > >>>> > >>>> Hi Yu Zhao, > >>>> > >>>> Thanks for your patches. See below... > >>>> > >>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote: > >>>>> Hi Bharata, > >>>>> > >>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote: > >>>>>> > >>>> <snip> > >>>>>> > >>>>>> Some experiments tried > >>>>>> ====================== > >>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard > >>>>>> lockups were seen for 48 hours run. Below is once such soft lockup. > >>>>> > >>>>> This is not really an MGLRU issue -- can you please try one of the > >>>>> attached patches? It (truncate.patch) should help with or without > >>>>> MGLRU. > >>>> > >>>> With truncate.patch and default LRU scheme, a few hard lockups are seen. > >>> > >>> Thanks. > >>> > >>> In your original report, you said: > >>> > >>> Most of the times the two contended locks are lruvec and > >>> inode->i_lock spinlocks. > >>> ... > >>> Often times, the perf output at the time of the problem shows > >>> heavy contention on lruvec spin lock. Similar contention is > >>> also observed with inode i_lock (in clear_shadow_entry path) > >>> > >>> Based on this new report, does it mean the i_lock is not as contended, > >>> for the same path (truncation) you tested? If so, I'll post > >>> truncate.patch and add reported-by and tested-by you, unless you have > >>> objections. > >> > >> truncate.patch has been tested on two systems with default LRU scheme > >> and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run. > > > > Thanks. > > > >>> > >>> The two paths below were contended on the LRU lock, but they already > >>> batch their operations. So I don't know what else we can do surgically > >>> to improve them. > >> > >> What has been seen with this workload is that the lruvec spinlock is > >> held for a long time from shrink_[active/inactive]_list path. In this > >> path, there is a case in isolate_lru_folios() where scanning of LRU > >> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes > >> scanning/skipping of more than 150 million folios were seen. There is > >> already a comment in there which explains why nr_skipped shouldn't be > >> counted, but is there any possibility of re-looking at this condition? > > > > For this specific case, probably this can help: > > > > @@ -1659,8 +1659,15 @@ static unsigned long > > isolate_lru_folios(unsigned long nr_to_scan, > > if (folio_zonenum(folio) > sc->reclaim_idx || > > skip_cma(folio, sc)) { > > nr_skipped[folio_zonenum(folio)] += nr_pages; > > - move_to = &folios_skipped; > > - goto move; > > + list_move(&folio->lru, &folios_skipped); > > + if (spin_is_contended(&lruvec->lru_lock)) { > > + if (!list_empty(dst)) > > + break; > > + spin_unlock_irq(&lruvec->lru_lock); > > + cond_resched(); > > + spin_lock_irq(&lruvec->lru_lock); > > + } > > + continue; Nitpick: if () { ... if (!spin_is_contended(&lruvec->lru_lock)) continue; if (!list_empty(dst)) break; spin_unlock_irq(&lruvec->lru_lock); cond_resched(); spin_lock_irq(&lruvec->lru_lock); } > Hi Yu, > > We are seeing lockups and high memory pressure in Meta production due to this lock contention as well. My colleague highlighted it in https://lore.kernel.org/all/ZrssOrcJIDy8hacI@gmail.com/ and was pointed to this fix. > > We removed skip_cma check as a temporary measure, but this is a proper fix. I might have missed it but didn't see this as a patch on the mailing list. Just wanted to check if you were planning to send it as a patch? Happy to send it on your behalf as well. Please. Thank you. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-09 4:30 ` Bharata B Rao 2024-07-09 5:58 ` Yu Zhao @ 2024-07-17 9:37 ` Vlastimil Babka 2024-07-17 10:50 ` Bharata B Rao 1 sibling, 1 reply; 37+ messages in thread From: Vlastimil Babka @ 2024-07-17 9:37 UTC (permalink / raw) To: Bharata B Rao, Yu Zhao Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, kinseyho, Mel Gorman, Petr Tesarik On 7/9/24 6:30 AM, Bharata B Rao wrote: > On 08-Jul-24 9:47 PM, Yu Zhao wrote: >> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote: >>> >>> Hi Yu Zhao, >>> >>> Thanks for your patches. See below... >>> >>> On 07-Jul-24 4:12 AM, Yu Zhao wrote: >>>> Hi Bharata, >>>> >>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote: >>>>> >>> <snip> >>>>> >>>>> Some experiments tried >>>>> ====================== >>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard >>>>> lockups were seen for 48 hours run. Below is once such soft lockup. >>>> >>>> This is not really an MGLRU issue -- can you please try one of the >>>> attached patches? It (truncate.patch) should help with or without >>>> MGLRU. >>> >>> With truncate.patch and default LRU scheme, a few hard lockups are seen. >> >> Thanks. >> >> In your original report, you said: >> >> Most of the times the two contended locks are lruvec and >> inode->i_lock spinlocks. >> ... >> Often times, the perf output at the time of the problem shows >> heavy contention on lruvec spin lock. Similar contention is >> also observed with inode i_lock (in clear_shadow_entry path) >> >> Based on this new report, does it mean the i_lock is not as contended, >> for the same path (truncation) you tested? If so, I'll post >> truncate.patch and add reported-by and tested-by you, unless you have >> objections. > > truncate.patch has been tested on two systems with default LRU scheme > and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run. > >> >> The two paths below were contended on the LRU lock, but they already >> batch their operations. So I don't know what else we can do surgically >> to improve them. > > What has been seen with this workload is that the lruvec spinlock is > held for a long time from shrink_[active/inactive]_list path. In this > path, there is a case in isolate_lru_folios() where scanning of LRU > lists can become unbounded. To isolate a page from ZONE_DMA, sometimes > scanning/skipping of more than 150 million folios were seen. There is It seems weird to me to see anything that would require ZONE_DMA allocation on a modern system. Do you know where it comes from? > already a comment in there which explains why nr_skipped shouldn't be > counted, but is there any possibility of re-looking at this condition? > > Regards, > Bharata. > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-17 9:37 ` Vlastimil Babka @ 2024-07-17 10:50 ` Bharata B Rao 2024-07-17 11:15 ` Hillf Danton 0 siblings, 1 reply; 37+ messages in thread From: Bharata B Rao @ 2024-07-17 10:50 UTC (permalink / raw) To: Vlastimil Babka, Yu Zhao Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, kinseyho, Mel Gorman, Petr Tesarik On 17-Jul-24 3:07 PM, Vlastimil Babka wrote: > On 7/9/24 6:30 AM, Bharata B Rao wrote: >> On 08-Jul-24 9:47 PM, Yu Zhao wrote: >>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote: >>>> >>>> Hi Yu Zhao, >>>> >>>> Thanks for your patches. See below... >>>> >>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote: >>>>> Hi Bharata, >>>>> >>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote: >>>>>> >>>> <snip> >>>>>> >>>>>> Some experiments tried >>>>>> ====================== >>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard >>>>>> lockups were seen for 48 hours run. Below is once such soft lockup. >>>>> >>>>> This is not really an MGLRU issue -- can you please try one of the >>>>> attached patches? It (truncate.patch) should help with or without >>>>> MGLRU. >>>> >>>> With truncate.patch and default LRU scheme, a few hard lockups are seen. >>> >>> Thanks. >>> >>> In your original report, you said: >>> >>> Most of the times the two contended locks are lruvec and >>> inode->i_lock spinlocks. >>> ... >>> Often times, the perf output at the time of the problem shows >>> heavy contention on lruvec spin lock. Similar contention is >>> also observed with inode i_lock (in clear_shadow_entry path) >>> >>> Based on this new report, does it mean the i_lock is not as contended, >>> for the same path (truncation) you tested? If so, I'll post >>> truncate.patch and add reported-by and tested-by you, unless you have >>> objections. >> >> truncate.patch has been tested on two systems with default LRU scheme >> and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run. >> >>> >>> The two paths below were contended on the LRU lock, but they already >>> batch their operations. So I don't know what else we can do surgically >>> to improve them. >> >> What has been seen with this workload is that the lruvec spinlock is >> held for a long time from shrink_[active/inactive]_list path. In this >> path, there is a case in isolate_lru_folios() where scanning of LRU >> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes >> scanning/skipping of more than 150 million folios were seen. There is > > It seems weird to me to see anything that would require ZONE_DMA allocation > on a modern system. Do you know where it comes from? We measured the lruvec spinlock start, end and hold time(htime) using sched_clock(), along with a BUG() if the hold time was more than 10s. The below case shows that lruvec spin lock was held for ~25s. vmscan: unlock_page_lruvec_irq: stime 27963327514341, etime 27963324369895, htime 25889317166 (time in ns) kernel BUG at include/linux/memcontrol.h:1677! Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 21 PID: 3211 Comm: kswapd0 Tainted: G W 6.10.0-rc3-qspindbg #10 RIP: 0010:shrink_active_list+0x40a/0x520 Call Trace: <TASK> shrink_lruvec+0x981/0x13b0 shrink_node+0x358/0xd30 balance_pgdat+0x3a3/0xa60 kswapd+0x207/0x3a0 kthread+0xe1/0x120 ret_from_fork+0x39/0x60 ret_from_fork_asm+0x1a/0x30 </TASK> As you can see the call stack is from kswapd but not sure what is the exact trigger. Regards, Bharata. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-17 10:50 ` Bharata B Rao @ 2024-07-17 11:15 ` Hillf Danton 2024-07-18 9:02 ` Bharata B Rao 0 siblings, 1 reply; 37+ messages in thread From: Hillf Danton @ 2024-07-17 11:15 UTC (permalink / raw) To: Bharata B Rao Cc: Vlastimil Babka, Yu Zhao, linux-mm, linux-kernel, willy, Mel Gorman On Wed, 17 Jul 2024 16:20:04 +0530 Bharata B Rao <bharata@amd.com> > On 17-Jul-24 3:07 PM, Vlastimil Babka wrote: > > > > It seems weird to me to see anything that would require ZONE_DMA allocation > > on a modern system. Do you know where it comes from? > > We measured the lruvec spinlock start, end and hold > time(htime) using sched_clock(), along with a BUG() if the hold time was > more than 10s. The below case shows that lruvec spin lock was held for ~25s. > What is more unusual could be observed perhaps with your hardware config but with 386MiB RAM assigned to each node, the so called tight memory but not extremely tight. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-17 11:15 ` Hillf Danton @ 2024-07-18 9:02 ` Bharata B Rao 0 siblings, 0 replies; 37+ messages in thread From: Bharata B Rao @ 2024-07-18 9:02 UTC (permalink / raw) To: Hillf Danton Cc: Vlastimil Babka, Yu Zhao, linux-mm, linux-kernel, willy, Mel Gorman, Dadhania, Nikunj, Upadhyay, Neeraj On 17-Jul-24 4:45 PM, Hillf Danton wrote: > On Wed, 17 Jul 2024 16:20:04 +0530 Bharata B Rao <bharata@amd.com> >> On 17-Jul-24 3:07 PM, Vlastimil Babka wrote: >>> >>> It seems weird to me to see anything that would require ZONE_DMA allocation >>> on a modern system. Do you know where it comes from? >> >> We measured the lruvec spinlock start, end and hold >> time(htime) using sched_clock(), along with a BUG() if the hold time was >> more than 10s. The below case shows that lruvec spin lock was held for ~25s. >> > What is more unusual could be observed perhaps with your hardware config but > with 386MiB RAM assigned to each node, the so called tight memory but not > extremely tight. Hardware config is this: Dual socket AMD EPYC 128 Core processor (256 cores, 512 threads) Memory: 1.5 TB 10 NVME - 3.5TB each available: 2 nodes (0-1) node 0 cpus: 0-127,256-383 node 0 size: 773727 MB node 1 cpus: 128-255,384-511 node 1 size: 773966 MB But I don't quite follow what you are hinting at, can you please rephrase or be more verbose? Regards, Bharata. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-06 22:42 ` Yu Zhao 2024-07-08 14:34 ` Bharata B Rao @ 2024-07-10 12:03 ` Bharata B Rao 2024-07-10 12:24 ` Mateusz Guzik 2024-07-10 18:04 ` Yu Zhao 1 sibling, 2 replies; 37+ messages in thread From: Bharata B Rao @ 2024-07-10 12:03 UTC (permalink / raw) To: Yu Zhao, mjguzik, david, kent.overstreet Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman, linux-fsdevel On 07-Jul-24 4:12 AM, Yu Zhao wrote: >> Some experiments tried >> ====================== >> 1) When MGLRU was enabled many soft lockups were observed, no hard >> lockups were seen for 48 hours run. Below is once such soft lockup. <snip> >> Below preemptirqsoff trace points to preemption being disabled for more >> than 10s and the lock in picture is lruvec spinlock. > > Also if you could try the other patch (mglru.patch) please. It should > help reduce unnecessary rotations from deactivate_file_folio(), which > in turn should reduce the contention on the LRU lock for MGLRU. Thanks. With mglru.patch on a MGLRU-enabled system, the below latency trace record is no longer seen for a 30hr workload run. > >> # tracer: preemptirqsoff >> # >> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc >> # -------------------------------------------------------------------- >> # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0 >> HP:0 #P:512) >> # ----------------- >> # | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0) >> # ----------------- >> # => started at: deactivate_file_folio >> # => ended at: deactivate_file_folio >> # >> # >> # _------=> CPU# >> # / _-----=> irqs-off/BH-disabled >> # | / _----=> need-resched >> # || / _---=> hardirq/softirq >> # ||| / _--=> preempt-depth >> # |||| / _-=> migrate-disable >> # ||||| / delay >> # cmd pid |||||| time | caller >> # \ / |||||| \ | / >> fio-2701523 128...1. 0us$: deactivate_file_folio >> <-deactivate_file_folio >> fio-2701523 128.N.1. 10382681us : deactivate_file_folio >> <-deactivate_file_folio >> fio-2701523 128.N.1. 10382683us : tracer_preempt_on >> <-deactivate_file_folio >> fio-2701523 128.N.1. 10382691us : <stack trace> >> => deactivate_file_folio >> => mapping_try_invalidate >> => invalidate_mapping_pages >> => invalidate_bdev >> => blkdev_common_ioctl >> => blkdev_ioctl >> => __x64_sys_ioctl >> => x64_sys_call >> => do_syscall_64 >> => entry_SYSCALL_64_after_hwframe However the contention now has shifted to inode_hash_lock. Around 55 softlockups in ilookup() were observed: # tracer: preemptirqsoff # # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-trnmglru # -------------------------------------------------------------------- # latency: 10620430 us, #4/4, CPU#260 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:512) # ----------------- # | task: fio-3244715 (uid:0 nice:0 policy:0 rt_prio:0) # ----------------- # => started at: ilookup # => ended at: ilookup # # # _------=> CPU# # / _-----=> irqs-off/BH-disabled # | / _----=> need-resched # || / _---=> hardirq/softirq # ||| / _--=> preempt-depth # |||| / _-=> migrate-disable # ||||| / delay # cmd pid |||||| time | caller # \ / |||||| \ | / fio-3244715 260...1. 0us$: _raw_spin_lock <-ilookup fio-3244715 260.N.1. 10620429us : _raw_spin_unlock <-ilookup fio-3244715 260.N.1. 10620430us : tracer_preempt_on <-ilookup fio-3244715 260.N.1. 10620440us : <stack trace> => _raw_spin_unlock => ilookup => blkdev_get_no_open => blkdev_open => do_dentry_open => vfs_open => path_openat => do_filp_open => do_sys_openat2 => __x64_sys_openat => x64_sys_call => do_syscall_64 => entry_SYSCALL_64_after_hwframe It appears that scalability issues with inode_hash_lock has been brought up multiple times in the past and there were patches to address the same. https://lore.kernel.org/all/20231206060629.2827226-9-david@fromorbit.com/ https://lore.kernel.org/lkml/20240611173824.535995-2-mjguzik@gmail.com/ CC'ing FS folks/list for awareness/comments. Regards, Bharata. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-10 12:03 ` Bharata B Rao @ 2024-07-10 12:24 ` Mateusz Guzik 2024-07-10 13:04 ` Mateusz Guzik 2024-07-10 18:04 ` Yu Zhao 1 sibling, 1 reply; 37+ messages in thread From: Mateusz Guzik @ 2024-07-10 12:24 UTC (permalink / raw) To: Bharata B Rao Cc: Yu Zhao, david, kent.overstreet, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman, linux-fsdevel On Wed, Jul 10, 2024 at 2:04 PM Bharata B Rao <bharata@amd.com> wrote: > > On 07-Jul-24 4:12 AM, Yu Zhao wrote: > >> Some experiments tried > >> ====================== > >> 1) When MGLRU was enabled many soft lockups were observed, no hard > >> lockups were seen for 48 hours run. Below is once such soft lockup. > <snip> > >> Below preemptirqsoff trace points to preemption being disabled for more > >> than 10s and the lock in picture is lruvec spinlock. > > > > Also if you could try the other patch (mglru.patch) please. It should > > help reduce unnecessary rotations from deactivate_file_folio(), which > > in turn should reduce the contention on the LRU lock for MGLRU. > > Thanks. With mglru.patch on a MGLRU-enabled system, the below latency > trace record is no longer seen for a 30hr workload run. > > > > >> # tracer: preemptirqsoff > >> # > >> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc > >> # -------------------------------------------------------------------- > >> # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0 > >> HP:0 #P:512) > >> # ----------------- > >> # | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0) > >> # ----------------- > >> # => started at: deactivate_file_folio > >> # => ended at: deactivate_file_folio > >> # > >> # > >> # _------=> CPU# > >> # / _-----=> irqs-off/BH-disabled > >> # | / _----=> need-resched > >> # || / _---=> hardirq/softirq > >> # ||| / _--=> preempt-depth > >> # |||| / _-=> migrate-disable > >> # ||||| / delay > >> # cmd pid |||||| time | caller > >> # \ / |||||| \ | / > >> fio-2701523 128...1. 0us$: deactivate_file_folio > >> <-deactivate_file_folio > >> fio-2701523 128.N.1. 10382681us : deactivate_file_folio > >> <-deactivate_file_folio > >> fio-2701523 128.N.1. 10382683us : tracer_preempt_on > >> <-deactivate_file_folio > >> fio-2701523 128.N.1. 10382691us : <stack trace> > >> => deactivate_file_folio > >> => mapping_try_invalidate > >> => invalidate_mapping_pages > >> => invalidate_bdev > >> => blkdev_common_ioctl > >> => blkdev_ioctl > >> => __x64_sys_ioctl > >> => x64_sys_call > >> => do_syscall_64 > >> => entry_SYSCALL_64_after_hwframe > > However the contention now has shifted to inode_hash_lock. Around 55 > softlockups in ilookup() were observed: > > # tracer: preemptirqsoff > # > # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-trnmglru > # -------------------------------------------------------------------- > # latency: 10620430 us, #4/4, CPU#260 | (M:desktop VP:0, KP:0, SP:0 HP:0 > #P:512) > # ----------------- > # | task: fio-3244715 (uid:0 nice:0 policy:0 rt_prio:0) > # ----------------- > # => started at: ilookup > # => ended at: ilookup > # > # > # _------=> CPU# > # / _-----=> irqs-off/BH-disabled > # | / _----=> need-resched > # || / _---=> hardirq/softirq > # ||| / _--=> preempt-depth > # |||| / _-=> migrate-disable > # ||||| / delay > # cmd pid |||||| time | caller > # \ / |||||| \ | / > fio-3244715 260...1. 0us$: _raw_spin_lock <-ilookup > fio-3244715 260.N.1. 10620429us : _raw_spin_unlock <-ilookup > fio-3244715 260.N.1. 10620430us : tracer_preempt_on <-ilookup > fio-3244715 260.N.1. 10620440us : <stack trace> > => _raw_spin_unlock > => ilookup > => blkdev_get_no_open > => blkdev_open > => do_dentry_open > => vfs_open > => path_openat > => do_filp_open > => do_sys_openat2 > => __x64_sys_openat > => x64_sys_call > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > It appears that scalability issues with inode_hash_lock has been brought > up multiple times in the past and there were patches to address the same. > > https://lore.kernel.org/all/20231206060629.2827226-9-david@fromorbit.com/ > https://lore.kernel.org/lkml/20240611173824.535995-2-mjguzik@gmail.com/ > > CC'ing FS folks/list for awareness/comments. Note my patch does not enable RCU usage in ilookup, but this can be trivially added. I can't even compile-test at the moment, but the diff below should do it. Also note the patches are present here https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.inode.rcu , not yet integrated anywhere. That said, if fio you are operating on the same target inode every time then this is merely going to shift contention to the inode spinlock usage in find_inode_fast. diff --git a/fs/inode.c b/fs/inode.c index ad7844ca92f9..70b0e6383341 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1524,10 +1524,14 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino) { struct hlist_head *head = inode_hashtable + hash(sb, ino); struct inode *inode; + again: - spin_lock(&inode_hash_lock); - inode = find_inode_fast(sb, head, ino, true); - spin_unlock(&inode_hash_lock); + inode = find_inode_fast(sb, head, ino, false); + if (IS_ERR_OR_NULL_PTR(inode)) { + spin_lock(&inode_hash_lock); + inode = find_inode_fast(sb, head, ino, true); + spin_unlock(&inode_hash_lock); + } if (inode) { if (IS_ERR(inode)) -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-10 12:24 ` Mateusz Guzik @ 2024-07-10 13:04 ` Mateusz Guzik 2024-07-15 5:22 ` Bharata B Rao 0 siblings, 1 reply; 37+ messages in thread From: Mateusz Guzik @ 2024-07-10 13:04 UTC (permalink / raw) To: Bharata B Rao Cc: Yu Zhao, david, kent.overstreet, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman, linux-fsdevel On Wed, Jul 10, 2024 at 2:24 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On Wed, Jul 10, 2024 at 2:04 PM Bharata B Rao <bharata@amd.com> wrote: > > > > On 07-Jul-24 4:12 AM, Yu Zhao wrote: > > >> Some experiments tried > > >> ====================== > > >> 1) When MGLRU was enabled many soft lockups were observed, no hard > > >> lockups were seen for 48 hours run. Below is once such soft lockup. > > <snip> > > >> Below preemptirqsoff trace points to preemption being disabled for more > > >> than 10s and the lock in picture is lruvec spinlock. > > > > > > Also if you could try the other patch (mglru.patch) please. It should > > > help reduce unnecessary rotations from deactivate_file_folio(), which > > > in turn should reduce the contention on the LRU lock for MGLRU. > > > > Thanks. With mglru.patch on a MGLRU-enabled system, the below latency > > trace record is no longer seen for a 30hr workload run. > > > > > > > >> # tracer: preemptirqsoff > > >> # > > >> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc > > >> # -------------------------------------------------------------------- > > >> # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0 > > >> HP:0 #P:512) > > >> # ----------------- > > >> # | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0) > > >> # ----------------- > > >> # => started at: deactivate_file_folio > > >> # => ended at: deactivate_file_folio > > >> # > > >> # > > >> # _------=> CPU# > > >> # / _-----=> irqs-off/BH-disabled > > >> # | / _----=> need-resched > > >> # || / _---=> hardirq/softirq > > >> # ||| / _--=> preempt-depth > > >> # |||| / _-=> migrate-disable > > >> # ||||| / delay > > >> # cmd pid |||||| time | caller > > >> # \ / |||||| \ | / > > >> fio-2701523 128...1. 0us$: deactivate_file_folio > > >> <-deactivate_file_folio > > >> fio-2701523 128.N.1. 10382681us : deactivate_file_folio > > >> <-deactivate_file_folio > > >> fio-2701523 128.N.1. 10382683us : tracer_preempt_on > > >> <-deactivate_file_folio > > >> fio-2701523 128.N.1. 10382691us : <stack trace> > > >> => deactivate_file_folio > > >> => mapping_try_invalidate > > >> => invalidate_mapping_pages > > >> => invalidate_bdev > > >> => blkdev_common_ioctl > > >> => blkdev_ioctl > > >> => __x64_sys_ioctl > > >> => x64_sys_call > > >> => do_syscall_64 > > >> => entry_SYSCALL_64_after_hwframe > > > > However the contention now has shifted to inode_hash_lock. Around 55 > > softlockups in ilookup() were observed: > > > > # tracer: preemptirqsoff > > # > > # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-trnmglru > > # -------------------------------------------------------------------- > > # latency: 10620430 us, #4/4, CPU#260 | (M:desktop VP:0, KP:0, SP:0 HP:0 > > #P:512) > > # ----------------- > > # | task: fio-3244715 (uid:0 nice:0 policy:0 rt_prio:0) > > # ----------------- > > # => started at: ilookup > > # => ended at: ilookup > > # > > # > > # _------=> CPU# > > # / _-----=> irqs-off/BH-disabled > > # | / _----=> need-resched > > # || / _---=> hardirq/softirq > > # ||| / _--=> preempt-depth > > # |||| / _-=> migrate-disable > > # ||||| / delay > > # cmd pid |||||| time | caller > > # \ / |||||| \ | / > > fio-3244715 260...1. 0us$: _raw_spin_lock <-ilookup > > fio-3244715 260.N.1. 10620429us : _raw_spin_unlock <-ilookup > > fio-3244715 260.N.1. 10620430us : tracer_preempt_on <-ilookup > > fio-3244715 260.N.1. 10620440us : <stack trace> > > => _raw_spin_unlock > > => ilookup > > => blkdev_get_no_open > > => blkdev_open > > => do_dentry_open > > => vfs_open > > => path_openat > > => do_filp_open > > => do_sys_openat2 > > => __x64_sys_openat > > => x64_sys_call > > => do_syscall_64 > > => entry_SYSCALL_64_after_hwframe > > > > It appears that scalability issues with inode_hash_lock has been brought > > up multiple times in the past and there were patches to address the same. > > > > https://lore.kernel.org/all/20231206060629.2827226-9-david@fromorbit.com/ > > https://lore.kernel.org/lkml/20240611173824.535995-2-mjguzik@gmail.com/ > > > > CC'ing FS folks/list for awareness/comments. > > Note my patch does not enable RCU usage in ilookup, but this can be > trivially added. > > I can't even compile-test at the moment, but the diff below should do > it. Also note the patches are present here > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.inode.rcu > , not yet integrated anywhere. > > That said, if fio you are operating on the same target inode every > time then this is merely going to shift contention to the inode > spinlock usage in find_inode_fast. > > diff --git a/fs/inode.c b/fs/inode.c > index ad7844ca92f9..70b0e6383341 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -1524,10 +1524,14 @@ struct inode *ilookup(struct super_block *sb, > unsigned long ino) > { > struct hlist_head *head = inode_hashtable + hash(sb, ino); > struct inode *inode; > + > again: > - spin_lock(&inode_hash_lock); > - inode = find_inode_fast(sb, head, ino, true); > - spin_unlock(&inode_hash_lock); > + inode = find_inode_fast(sb, head, ino, false); > + if (IS_ERR_OR_NULL_PTR(inode)) { > + spin_lock(&inode_hash_lock); > + inode = find_inode_fast(sb, head, ino, true); > + spin_unlock(&inode_hash_lock); > + } > > if (inode) { > if (IS_ERR(inode)) > I think I expressed myself poorly, so here is take two: 1. inode hash soft lookup should get resolved if you apply https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.inode.rcu&id=7180f8d91fcbf252de572d9ffacc945effed0060 and the above pasted fix (not compile tested tho, but it should be obvious what the intended fix looks like) 2. find_inode_hash spinlocks the target inode. if your bench only operates on one, then contention is going to shift there and you may still be getting soft lockups. not taking the spinlock in this codepath is hackable, but I don't want to do it without a good justification. -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-10 13:04 ` Mateusz Guzik @ 2024-07-15 5:22 ` Bharata B Rao 2024-07-15 6:48 ` Mateusz Guzik 0 siblings, 1 reply; 37+ messages in thread From: Bharata B Rao @ 2024-07-15 5:22 UTC (permalink / raw) To: Mateusz Guzik Cc: Yu Zhao, david, kent.overstreet, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman, linux-fsdevel On 10-Jul-24 6:34 PM, Mateusz Guzik wrote: >>> However the contention now has shifted to inode_hash_lock. Around 55 >>> softlockups in ilookup() were observed: >>> >>> # tracer: preemptirqsoff >>> # >>> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-trnmglru >>> # -------------------------------------------------------------------- >>> # latency: 10620430 us, #4/4, CPU#260 | (M:desktop VP:0, KP:0, SP:0 HP:0 >>> #P:512) >>> # ----------------- >>> # | task: fio-3244715 (uid:0 nice:0 policy:0 rt_prio:0) >>> # ----------------- >>> # => started at: ilookup >>> # => ended at: ilookup >>> # >>> # >>> # _------=> CPU# >>> # / _-----=> irqs-off/BH-disabled >>> # | / _----=> need-resched >>> # || / _---=> hardirq/softirq >>> # ||| / _--=> preempt-depth >>> # |||| / _-=> migrate-disable >>> # ||||| / delay >>> # cmd pid |||||| time | caller >>> # \ / |||||| \ | / >>> fio-3244715 260...1. 0us$: _raw_spin_lock <-ilookup >>> fio-3244715 260.N.1. 10620429us : _raw_spin_unlock <-ilookup >>> fio-3244715 260.N.1. 10620430us : tracer_preempt_on <-ilookup >>> fio-3244715 260.N.1. 10620440us : <stack trace> >>> => _raw_spin_unlock >>> => ilookup >>> => blkdev_get_no_open >>> => blkdev_open >>> => do_dentry_open >>> => vfs_open >>> => path_openat >>> => do_filp_open >>> => do_sys_openat2 >>> => __x64_sys_openat >>> => x64_sys_call >>> => do_syscall_64 >>> => entry_SYSCALL_64_after_hwframe >>> >>> It appears that scalability issues with inode_hash_lock has been brought >>> up multiple times in the past and there were patches to address the same. >>> >>> https://lore.kernel.org/all/20231206060629.2827226-9-david@fromorbit.com/ >>> https://lore.kernel.org/lkml/20240611173824.535995-2-mjguzik@gmail.com/ >>> >>> CC'ing FS folks/list for awareness/comments. >> >> Note my patch does not enable RCU usage in ilookup, but this can be >> trivially added. >> >> I can't even compile-test at the moment, but the diff below should do >> it. Also note the patches are present here >> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.inode.rcu >> , not yet integrated anywhere. >> >> That said, if fio you are operating on the same target inode every >> time then this is merely going to shift contention to the inode >> spinlock usage in find_inode_fast. >> >> diff --git a/fs/inode.c b/fs/inode.c >> index ad7844ca92f9..70b0e6383341 100644 >> --- a/fs/inode.c >> +++ b/fs/inode.c >> @@ -1524,10 +1524,14 @@ struct inode *ilookup(struct super_block *sb, >> unsigned long ino) >> { >> struct hlist_head *head = inode_hashtable + hash(sb, ino); >> struct inode *inode; >> + >> again: >> - spin_lock(&inode_hash_lock); >> - inode = find_inode_fast(sb, head, ino, true); >> - spin_unlock(&inode_hash_lock); >> + inode = find_inode_fast(sb, head, ino, false); >> + if (IS_ERR_OR_NULL_PTR(inode)) { >> + spin_lock(&inode_hash_lock); >> + inode = find_inode_fast(sb, head, ino, true); >> + spin_unlock(&inode_hash_lock); >> + } >> >> if (inode) { >> if (IS_ERR(inode)) >> > > I think I expressed myself poorly, so here is take two: > 1. inode hash soft lookup should get resolved if you apply > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.inode.rcu&id=7180f8d91fcbf252de572d9ffacc945effed0060 > and the above pasted fix (not compile tested tho, but it should be > obvious what the intended fix looks like) > 2. find_inode_hash spinlocks the target inode. if your bench only > operates on one, then contention is going to shift there and you may > still be getting soft lockups. not taking the spinlock in this > codepath is hackable, but I don't want to do it without a good > justification. Thanks Mateusz for the fix. With this patch applied, the above mentioned contention in ilookup() has not been observed for a test run during the weekend. Regards, Bharata. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-15 5:22 ` Bharata B Rao @ 2024-07-15 6:48 ` Mateusz Guzik 0 siblings, 0 replies; 37+ messages in thread From: Mateusz Guzik @ 2024-07-15 6:48 UTC (permalink / raw) To: Bharata B Rao Cc: Yu Zhao, david, kent.overstreet, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman, linux-fsdevel On Mon, Jul 15, 2024 at 7:22 AM Bharata B Rao <bharata@amd.com> wrote: > > On 10-Jul-24 6:34 PM, Mateusz Guzik wrote: > >>> However the contention now has shifted to inode_hash_lock. Around 55 > >>> softlockups in ilookup() were observed: > >>> > >>> # tracer: preemptirqsoff > >>> # > >>> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-trnmglru > >>> # -------------------------------------------------------------------- > >>> # latency: 10620430 us, #4/4, CPU#260 | (M:desktop VP:0, KP:0, SP:0 HP:0 > >>> #P:512) > >>> # ----------------- > >>> # | task: fio-3244715 (uid:0 nice:0 policy:0 rt_prio:0) > >>> # ----------------- > >>> # => started at: ilookup > >>> # => ended at: ilookup > >>> # > >>> # > >>> # _------=> CPU# > >>> # / _-----=> irqs-off/BH-disabled > >>> # | / _----=> need-resched > >>> # || / _---=> hardirq/softirq > >>> # ||| / _--=> preempt-depth > >>> # |||| / _-=> migrate-disable > >>> # ||||| / delay > >>> # cmd pid |||||| time | caller > >>> # \ / |||||| \ | / > >>> fio-3244715 260...1. 0us$: _raw_spin_lock <-ilookup > >>> fio-3244715 260.N.1. 10620429us : _raw_spin_unlock <-ilookup > >>> fio-3244715 260.N.1. 10620430us : tracer_preempt_on <-ilookup > >>> fio-3244715 260.N.1. 10620440us : <stack trace> > >>> => _raw_spin_unlock > >>> => ilookup > >>> => blkdev_get_no_open > >>> => blkdev_open > >>> => do_dentry_open > >>> => vfs_open > >>> => path_openat > >>> => do_filp_open > >>> => do_sys_openat2 > >>> => __x64_sys_openat > >>> => x64_sys_call > >>> => do_syscall_64 > >>> => entry_SYSCALL_64_after_hwframe > >>> > >>> It appears that scalability issues with inode_hash_lock has been brought > >>> up multiple times in the past and there were patches to address the same. > >>> > >>> https://lore.kernel.org/all/20231206060629.2827226-9-david@fromorbit.com/ > >>> https://lore.kernel.org/lkml/20240611173824.535995-2-mjguzik@gmail.com/ > >>> > >>> CC'ing FS folks/list for awareness/comments. > >> > >> Note my patch does not enable RCU usage in ilookup, but this can be > >> trivially added. > >> > >> I can't even compile-test at the moment, but the diff below should do > >> it. Also note the patches are present here > >> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.inode.rcu > >> , not yet integrated anywhere. > >> > >> That said, if fio you are operating on the same target inode every > >> time then this is merely going to shift contention to the inode > >> spinlock usage in find_inode_fast. > >> > >> diff --git a/fs/inode.c b/fs/inode.c > >> index ad7844ca92f9..70b0e6383341 100644 > >> --- a/fs/inode.c > >> +++ b/fs/inode.c > >> @@ -1524,10 +1524,14 @@ struct inode *ilookup(struct super_block *sb, > >> unsigned long ino) > >> { > >> struct hlist_head *head = inode_hashtable + hash(sb, ino); > >> struct inode *inode; > >> + > >> again: > >> - spin_lock(&inode_hash_lock); > >> - inode = find_inode_fast(sb, head, ino, true); > >> - spin_unlock(&inode_hash_lock); > >> + inode = find_inode_fast(sb, head, ino, false); > >> + if (IS_ERR_OR_NULL_PTR(inode)) { > >> + spin_lock(&inode_hash_lock); > >> + inode = find_inode_fast(sb, head, ino, true); > >> + spin_unlock(&inode_hash_lock); > >> + } > >> > >> if (inode) { > >> if (IS_ERR(inode)) > >> > > > > I think I expressed myself poorly, so here is take two: > > 1. inode hash soft lookup should get resolved if you apply > > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.inode.rcu&id=7180f8d91fcbf252de572d9ffacc945effed0060 > > and the above pasted fix (not compile tested tho, but it should be > > obvious what the intended fix looks like) > > 2. find_inode_hash spinlocks the target inode. if your bench only > > operates on one, then contention is going to shift there and you may > > still be getting soft lockups. not taking the spinlock in this > > codepath is hackable, but I don't want to do it without a good > > justification. > > Thanks Mateusz for the fix. With this patch applied, the above mentioned > contention in ilookup() has not been observed for a test run during the > weekend. > Ok, I'll do some clean ups and send a proper patch to the vfs folks later today. Thanks for testing. -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-10 12:03 ` Bharata B Rao 2024-07-10 12:24 ` Mateusz Guzik @ 2024-07-10 18:04 ` Yu Zhao 1 sibling, 0 replies; 37+ messages in thread From: Yu Zhao @ 2024-07-10 18:04 UTC (permalink / raw) To: Bharata B Rao Cc: mjguzik, david, kent.overstreet, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman, linux-fsdevel On Wed, Jul 10, 2024 at 6:04 AM Bharata B Rao <bharata@amd.com> wrote: > > On 07-Jul-24 4:12 AM, Yu Zhao wrote: > >> Some experiments tried > >> ====================== > >> 1) When MGLRU was enabled many soft lockups were observed, no hard > >> lockups were seen for 48 hours run. Below is once such soft lockup. > <snip> > >> Below preemptirqsoff trace points to preemption being disabled for more > >> than 10s and the lock in picture is lruvec spinlock. > > > > Also if you could try the other patch (mglru.patch) please. It should > > help reduce unnecessary rotations from deactivate_file_folio(), which > > in turn should reduce the contention on the LRU lock for MGLRU. > > Thanks. With mglru.patch on a MGLRU-enabled system, the below latency > trace record is no longer seen for a 30hr workload run. Glad to hear. Will post a patch and add you as reported/tested-by. > > > >> # tracer: preemptirqsoff > >> # > >> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc > >> # -------------------------------------------------------------------- > >> # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0 > >> HP:0 #P:512) > >> # ----------------- > >> # | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0) > >> # ----------------- > >> # => started at: deactivate_file_folio > >> # => ended at: deactivate_file_folio > >> # > >> # > >> # _------=> CPU# > >> # / _-----=> irqs-off/BH-disabled > >> # | / _----=> need-resched > >> # || / _---=> hardirq/softirq > >> # ||| / _--=> preempt-depth > >> # |||| / _-=> migrate-disable > >> # ||||| / delay > >> # cmd pid |||||| time | caller > >> # \ / |||||| \ | / > >> fio-2701523 128...1. 0us$: deactivate_file_folio > >> <-deactivate_file_folio > >> fio-2701523 128.N.1. 10382681us : deactivate_file_folio > >> <-deactivate_file_folio > >> fio-2701523 128.N.1. 10382683us : tracer_preempt_on > >> <-deactivate_file_folio > >> fio-2701523 128.N.1. 10382691us : <stack trace> > >> => deactivate_file_folio > >> => mapping_try_invalidate > >> => invalidate_mapping_pages > >> => invalidate_bdev > >> => blkdev_common_ioctl > >> => blkdev_ioctl > >> => __x64_sys_ioctl > >> => x64_sys_call > >> => do_syscall_64 > >> => entry_SYSCALL_64_after_hwframe > > However the contention now has shifted to inode_hash_lock. Around 55 > softlockups in ilookup() were observed: This one is from fs/blk, so I'll leave it to those experts. > # tracer: preemptirqsoff > # > # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-trnmglru > # -------------------------------------------------------------------- > # latency: 10620430 us, #4/4, CPU#260 | (M:desktop VP:0, KP:0, SP:0 HP:0 > #P:512) > # ----------------- > # | task: fio-3244715 (uid:0 nice:0 policy:0 rt_prio:0) > # ----------------- > # => started at: ilookup > # => ended at: ilookup > # > # > # _------=> CPU# > # / _-----=> irqs-off/BH-disabled > # | / _----=> need-resched > # || / _---=> hardirq/softirq > # ||| / _--=> preempt-depth > # |||| / _-=> migrate-disable > # ||||| / delay > # cmd pid |||||| time | caller > # \ / |||||| \ | / > fio-3244715 260...1. 0us$: _raw_spin_lock <-ilookup > fio-3244715 260.N.1. 10620429us : _raw_spin_unlock <-ilookup > fio-3244715 260.N.1. 10620430us : tracer_preempt_on <-ilookup > fio-3244715 260.N.1. 10620440us : <stack trace> > => _raw_spin_unlock > => ilookup > => blkdev_get_no_open > => blkdev_open > => do_dentry_open > => vfs_open > => path_openat > => do_filp_open > => do_sys_openat2 > => __x64_sys_openat > => x64_sys_call > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > It appears that scalability issues with inode_hash_lock has been brought > up multiple times in the past and there were patches to address the same. > > https://lore.kernel.org/all/20231206060629.2827226-9-david@fromorbit.com/ > https://lore.kernel.org/lkml/20240611173824.535995-2-mjguzik@gmail.com/ > > CC'ing FS folks/list for awareness/comments. > > Regards, > Bharata. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-03 15:11 Hard and soft lockups with FIO and LTP runs on a large system Bharata B Rao 2024-07-06 22:42 ` Yu Zhao @ 2024-07-17 9:42 ` Vlastimil Babka 2024-07-17 10:31 ` Bharata B Rao ` (2 more replies) 1 sibling, 3 replies; 37+ messages in thread From: Vlastimil Babka @ 2024-07-17 9:42 UTC (permalink / raw) To: Bharata B Rao, linux-mm Cc: linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho, Mel Gorman, Mateusz Guzik On 7/3/24 5:11 PM, Bharata B Rao wrote: > Many soft and hard lockups are seen with upstream kernel when running a > bunch of tests that include FIO and LTP filesystem test on 10 NVME > disks. The lockups can appear anywhere between 2 to 48 hours. Originally > this was reported on a large customer VM instance with passthrough NVME > disks on older kernels(v5.4 based). However, similar problems were > reproduced when running the tests on bare metal with latest upstream > kernel (v6.10-rc3). Other lockups with different signatures are seen but > in this report, only those related to MM area are being discussed. > Also note that the subsequent description is related to the lockups in > bare metal upstream (and not VM). > > The general observation is that the problem usually surfaces when the > system free memory goes very low and page cache/buffer consumption hits > the ceiling. Most of the times the two contended locks are lruvec and > inode->i_lock spinlocks. > > - Could this be a scalability issue in LRU list handling and/or page > cache invalidation typical to a large system configuration? Seems to me it could be (except that ZONE_DMA corner case) a general scalability issue in that you tweak some part of the kernel and the contention moves elsewhere. At least in MM we have per-node locks so this means 256 CPUs per lock? It used to be that there were not that many (cores/threads) per a physical CPU and its NUMA node, so many cpus would mean also more NUMA nodes where the locks contention would distribute among them. I think you could try fakenuma to create these nodes artificially and see if it helps for the MM part. But if the contention moves to e.g. an inode lock, I'm not sure what to do about that then. > - Are there any MM/FS tunables that could help here? > > Hardware configuration > ====================== > Dual socket AMD EPYC 128 Core processor (256 cores, 512 threads) > Memory: 1.5 TB > 10 NVME - 3.5TB each > available: 2 nodes (0-1) > node 0 cpus: 0-127,256-383 > node 0 size: 773727 MB > node 1 cpus: 128-255,384-511 > node 1 size: 773966 MB > > Workload details > ================ > Workload includes concurrent runs of FIO and a few FS tests from LTP. > > FIO is run with a size of 1TB on each NVME partition with different > combinations of ioengine/blocksize/mode parameters and buffered-IO. > Selected FS tests from LTP are run on 256GB partitions of all NVME > disks. This is the typical NVME partition layout. > > nvme2n1 259:4 0 3.5T 0 disk > ├─nvme2n1p1 259:6 0 256G 0 part /data_nvme2n1p1 > └─nvme2n1p2 259:7 0 3.2T 0 part > > Though many different runs exist in the workload, the combination that > results in the problem is buffered-IO run with sync engine. > > fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \ > -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=sync -bs=4k \ > -numjobs=400 -runtime=25000 --time_based -group_reporting -name=mytest > > Watchdog threshold was reduced to 5s to reproduce the problem early and > all CPU backtrace enabled. > > Problem details and analysis > ============================ > One of the hard lockups which was observed and analyzed in detail is this: > > kernel: watchdog: Watchdog detected hard LOCKUP on cpu 284 > kernel: CPU: 284 PID: 924096 Comm: cat Not tainted 6.10.0-rc3-lruvec #9 > kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300 > kernel: Call Trace: > kernel: <NMI> > kernel: ? show_regs+0x69/0x80 > kernel: ? watchdog_hardlockup_check+0x19e/0x360 > <SNIP> > kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300 > kernel: </NMI> > kernel: <TASK> > kernel: ? __pfx_lru_add_fn+0x10/0x10 > kernel: _raw_spin_lock_irqsave+0x42/0x50 > kernel: folio_lruvec_lock_irqsave+0x62/0xb0 > kernel: folio_batch_move_lru+0x79/0x2a0 > kernel: folio_add_lru+0x6d/0xf0 > kernel: filemap_add_folio+0xba/0xe0 > kernel: __filemap_get_folio+0x137/0x2e0 > kernel: ext4_da_write_begin+0x12c/0x270 > kernel: generic_perform_write+0xbf/0x200 > kernel: ext4_buffered_write_iter+0x67/0xf0 > kernel: ext4_file_write_iter+0x70/0x780 > kernel: vfs_write+0x301/0x420 > kernel: ksys_write+0x67/0xf0 > kernel: __x64_sys_write+0x19/0x20 > kernel: x64_sys_call+0x1689/0x20d0 > kernel: do_syscall_64+0x6b/0x110 > kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e kernel: RIP: > 0033:0x7fe21c314887 > > With all CPU backtraces enabled, many CPUs are waiting for lruvec_lock > acquisition. We measured the lruvec spinlock start, end and hold > time(htime) using sched_clock(), along with a BUG() if the hold time was > more than 10s. The below case shows that lruvec spin lock was held for ~25s. > > kernel: vmscan: unlock_page_lruvec_irq: stime 27963327514341, etime > 27963324369895, htime 25889317166 > kernel: ------------[ cut here ]------------ > kernel: kernel BUG at include/linux/memcontrol.h:1677! > kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI > kernel: CPU: 21 PID: 3211 Comm: kswapd0 Tainted: G W > 6.10.0-rc3-qspindbg #10 > kernel: RIP: 0010:shrink_active_list+0x40a/0x520 > > And the corresponding trace point for the above: > kswapd0-3211 [021] dN.1. 27963.324332: mm_vmscan_lru_isolate: > classzone=0 order=0 nr_requested=1 nr_scanned=156946361 > nr_skipped=156946360 nr_taken=1 lru=active_file > > This shows that isolate_lru_folios() is scanning through a huge number > (~150million) of folios (order=0) with lruvec spinlock held. This is > happening because a large number of folios are being skipped to isolate > a few ZONE_DMA folios. Though the number of folios to be scanned is > bounded (32), there exists a genuine case where this can become > unbounded, i.e. in case where folios are skipped. > > Meminfo output shows that the free memory is around ~2% and page/buffer > cache grows very high when the lockup happens. > > MemTotal: 1584835956 kB > MemFree: 27805664 kB > MemAvailable: 1568099004 kB > Buffers: 1386120792 kB > Cached: 151894528 kB > SwapCached: 30620 kB > Active: 1043678892 kB > Inactive: 494456452 kB > > Often times, the perf output at the time of the problem shows heavy > contention on lruvec spin lock. Similar contention is also observed with > inode i_lock (in clear_shadow_entry path) > > 98.98% fio [kernel.kallsyms] [k] native_queued_spin_lock_slowpath > | > --98.96%--native_queued_spin_lock_slowpath > | > --98.96%--_raw_spin_lock_irqsave > folio_lruvec_lock_irqsave > | > --98.78%--folio_batch_move_lru > | > --98.63%--deactivate_file_folio > mapping_try_invalidate > invalidate_mapping_pages > invalidate_bdev > blkdev_common_ioctl > blkdev_ioctl > __x64_sys_ioctl > x64_sys_call > do_syscall_64 > entry_SYSCALL_64_after_hwframe > > Some experiments tried > ====================== > 1) When MGLRU was enabled many soft lockups were observed, no hard > lockups were seen for 48 hours run. Below is once such soft lockup. > > kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649] > kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G L > 6.10.0-rc3-mglru-irqstrc #24 > kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300 > kernel: Call Trace: > kernel: <IRQ> > kernel: ? show_regs+0x69/0x80 > kernel: ? watchdog_timer_fn+0x223/0x2b0 > kernel: ? __pfx_watchdog_timer_fn+0x10/0x10 > <SNIP> > kernel: </IRQ> > kernel: <TASK> > kernel: ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 > kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300 > kernel: _raw_spin_lock+0x38/0x50 > kernel: clear_shadow_entry+0x3d/0x100 > kernel: ? __pfx_workingset_update_node+0x10/0x10 > kernel: mapping_try_invalidate+0x117/0x1d0 > kernel: invalidate_mapping_pages+0x10/0x20 > kernel: invalidate_bdev+0x3c/0x50 > kernel: blkdev_common_ioctl+0x5f7/0xa90 > kernel: blkdev_ioctl+0x109/0x270 > kernel: x64_sys_call+0x1215/0x20d0 > kernel: do_syscall_64+0x7e/0x130 > > This happens to be contending on inode i_lock spinlock. > > Below preemptirqsoff trace points to preemption being disabled for more > than 10s and the lock in picture is lruvec spinlock. > > # tracer: preemptirqsoff > # > # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc > # -------------------------------------------------------------------- > # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0 > HP:0 #P:512) > # ----------------- > # | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0) > # ----------------- > # => started at: deactivate_file_folio > # => ended at: deactivate_file_folio > # > # > # _------=> CPU# > # / _-----=> irqs-off/BH-disabled > # | / _----=> need-resched > # || / _---=> hardirq/softirq > # ||| / _--=> preempt-depth > # |||| / _-=> migrate-disable > # ||||| / delay > # cmd pid |||||| time | caller > # \ / |||||| \ | / > fio-2701523 128...1. 0us$: deactivate_file_folio > <-deactivate_file_folio > fio-2701523 128.N.1. 10382681us : deactivate_file_folio > <-deactivate_file_folio > fio-2701523 128.N.1. 10382683us : tracer_preempt_on > <-deactivate_file_folio > fio-2701523 128.N.1. 10382691us : <stack trace> > => deactivate_file_folio > => mapping_try_invalidate > => invalidate_mapping_pages > => invalidate_bdev > => blkdev_common_ioctl > => blkdev_ioctl > => __x64_sys_ioctl > => x64_sys_call > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > 2) Increased low_watermark_threshold to 10% to prevent system from > entering into extremely low memory situation. Although hard lockups > weren't seen, but soft lockups (clear_shadow_entry()) were still seen. > > 3) AMD has a BIOS setting called NPS (Nodes per socket), using which a > socket can be further partitioned into smaller NUMA nodes. With NPS=4, > there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in > the system. This was done to check if having more number of kswapd > threads working on lesser number of folios per node would make a > difference. However here too, multiple soft lockups were seen (in > clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed. > > Any insights/suggestion into these lockups and suggestions are welcome! > > Regards, > Bharata. > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-17 9:42 ` Vlastimil Babka @ 2024-07-17 10:31 ` Bharata B Rao 2024-07-17 16:44 ` Karim Manaouil 2024-07-17 11:29 ` Mateusz Guzik 2024-07-17 16:34 ` Karim Manaouil 2 siblings, 1 reply; 37+ messages in thread From: Bharata B Rao @ 2024-07-17 10:31 UTC (permalink / raw) To: Vlastimil Babka, linux-mm Cc: linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho, Mel Gorman, Mateusz Guzik On 17-Jul-24 3:12 PM, Vlastimil Babka wrote: > On 7/3/24 5:11 PM, Bharata B Rao wrote: >> Many soft and hard lockups are seen with upstream kernel when running a >> bunch of tests that include FIO and LTP filesystem test on 10 NVME >> disks. The lockups can appear anywhere between 2 to 48 hours. Originally >> this was reported on a large customer VM instance with passthrough NVME >> disks on older kernels(v5.4 based). However, similar problems were >> reproduced when running the tests on bare metal with latest upstream >> kernel (v6.10-rc3). Other lockups with different signatures are seen but >> in this report, only those related to MM area are being discussed. >> Also note that the subsequent description is related to the lockups in >> bare metal upstream (and not VM). >> >> The general observation is that the problem usually surfaces when the >> system free memory goes very low and page cache/buffer consumption hits >> the ceiling. Most of the times the two contended locks are lruvec and >> inode->i_lock spinlocks. >> >> - Could this be a scalability issue in LRU list handling and/or page >> cache invalidation typical to a large system configuration? > > Seems to me it could be (except that ZONE_DMA corner case) a general > scalability issue in that you tweak some part of the kernel and the > contention moves elsewhere. At least in MM we have per-node locks so this > means 256 CPUs per lock? It used to be that there were not that many > (cores/threads) per a physical CPU and its NUMA node, so many cpus would > mean also more NUMA nodes where the locks contention would distribute among > them. I think you could try fakenuma to create these nodes artificially and > see if it helps for the MM part. But if the contention moves to e.g. an > inode lock, I'm not sure what to do about that then. See below... > <SNIP> >> >> 3) AMD has a BIOS setting called NPS (Nodes per socket), using which a >> socket can be further partitioned into smaller NUMA nodes. With NPS=4, >> there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in >> the system. This was done to check if having more number of kswapd >> threads working on lesser number of folios per node would make a >> difference. However here too, multiple soft lockups were seen (in >> clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed. These are some softlockups seen with NPS4 mode. watchdog: BUG: soft lockup - CPU#315 stuck for 11s! [kworker/315:1H:5153] CPU: 315 PID: 5153 Comm: kworker/315:1H Kdump: loaded Not tainted 6.10.0-rc3-enbprftw #12 Workqueue: kblockd blk_mq_run_work_fn RIP: 0010:handle_softirqs+0x70/0x2f0 Call Trace: <IRQ> __irq_exit_rcu+0x68/0x90 irq_exit_rcu+0x12/0x20 sysvec_apic_timer_interrupt+0x85/0xb0 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x1f/0x30 RIP: 0010:iommu_dma_map_page+0xca/0x2c0 dma_map_page_attrs+0x20d/0x2a0 nvme_prep_rq.part.0+0x63d/0x940 [nvme] nvme_queue_rq+0x82/0x210 [nvme] blk_mq_dispatch_rq_list+0x289/0x6d0 __blk_mq_sched_dispatch_requests+0x142/0x5f0 blk_mq_sched_dispatch_requests+0x36/0x70 blk_mq_run_work_fn+0x73/0x90 process_one_work+0x185/0x3d0 worker_thread+0x2ce/0x3e0 kthread+0xe5/0x120 ret_from_fork+0x3d/0x60 ret_from_fork_asm+0x1a/0x30 watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [fio:19820] CPU: 0 PID: 19820 Comm: fio Kdump: loaded Tainted: G L 6.10.0-rc3-enbprftw #12 RIP: 0010:native_queued_spin_lock_slowpath+0x2b8/0x300 Call Trace: <IRQ> </IRQ> <TASK> _raw_spin_lock+0x2d/0x40 clear_shadow_entry+0x3d/0x100 mapping_try_invalidate+0x11b/0x1e0 invalidate_mapping_pages+0x14/0x20 invalidate_bdev+0x40/0x50 blkdev_common_ioctl+0x5f7/0xa90 blkdev_ioctl+0x10d/0x270 __x64_sys_ioctl+0x99/0xd0 x64_sys_call+0x1219/0x20d0 do_syscall_64+0x51/0x120 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fc92fc3ec6b </TASK> The above one (clear_shadow_entry) has since been fixed by Yu Zhao and fix is in mm tree. We had seen a couple of scenarios with zone lock contention from page free and slab free code paths, as reported here: https://lore.kernel.org/linux-mm/b68e43d4-91f2-4481-80a9-d166c0a43584@amd.com/ Would you have any insights on these? Regards, Bharata. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-17 10:31 ` Bharata B Rao @ 2024-07-17 16:44 ` Karim Manaouil 0 siblings, 0 replies; 37+ messages in thread From: Karim Manaouil @ 2024-07-17 16:44 UTC (permalink / raw) To: Bharata B Rao Cc: Vlastimil Babka, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho, Mel Gorman, Mateusz Guzik On Wed, Jul 17, 2024 at 04:01:05PM +0530, Bharata B Rao wrote: > On 17-Jul-24 3:12 PM, Vlastimil Babka wrote: > > On 7/3/24 5:11 PM, Bharata B Rao wrote: > > > Many soft and hard lockups are seen with upstream kernel when running a > > > bunch of tests that include FIO and LTP filesystem test on 10 NVME > > > disks. The lockups can appear anywhere between 2 to 48 hours. Originally > > > this was reported on a large customer VM instance with passthrough NVME > > > disks on older kernels(v5.4 based). However, similar problems were > > > reproduced when running the tests on bare metal with latest upstream > > > kernel (v6.10-rc3). Other lockups with different signatures are seen but > > > in this report, only those related to MM area are being discussed. > > > Also note that the subsequent description is related to the lockups in > > > bare metal upstream (and not VM). > > > > > > The general observation is that the problem usually surfaces when the > > > system free memory goes very low and page cache/buffer consumption hits > > > the ceiling. Most of the times the two contended locks are lruvec and > > > inode->i_lock spinlocks. > > > > > > - Could this be a scalability issue in LRU list handling and/or page > > > cache invalidation typical to a large system configuration? > > > > Seems to me it could be (except that ZONE_DMA corner case) a general > > scalability issue in that you tweak some part of the kernel and the > > contention moves elsewhere. At least in MM we have per-node locks so this > > means 256 CPUs per lock? It used to be that there were not that many > > (cores/threads) per a physical CPU and its NUMA node, so many cpus would > > mean also more NUMA nodes where the locks contention would distribute among > > them. I think you could try fakenuma to create these nodes artificially and > > see if it helps for the MM part. But if the contention moves to e.g. an > > inode lock, I'm not sure what to do about that then. > > See below... > > > > <SNIP> > > > > > > 3) AMD has a BIOS setting called NPS (Nodes per socket), using which a > > > socket can be further partitioned into smaller NUMA nodes. With NPS=4, > > > there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in > > > the system. This was done to check if having more number of kswapd > > > threads working on lesser number of folios per node would make a > > > difference. However here too, multiple soft lockups were seen (in > > > clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed. > > These are some softlockups seen with NPS4 mode. > > watchdog: BUG: soft lockup - CPU#315 stuck for 11s! [kworker/315:1H:5153] > CPU: 315 PID: 5153 Comm: kworker/315:1H Kdump: loaded Not tainted > 6.10.0-rc3-enbprftw #12 > Workqueue: kblockd blk_mq_run_work_fn > RIP: 0010:handle_softirqs+0x70/0x2f0 > Call Trace: > <IRQ> > __irq_exit_rcu+0x68/0x90 > irq_exit_rcu+0x12/0x20 > sysvec_apic_timer_interrupt+0x85/0xb0 > </IRQ> > <TASK> > asm_sysvec_apic_timer_interrupt+0x1f/0x30 > RIP: 0010:iommu_dma_map_page+0xca/0x2c0 > dma_map_page_attrs+0x20d/0x2a0 > nvme_prep_rq.part.0+0x63d/0x940 [nvme] > nvme_queue_rq+0x82/0x210 [nvme] > blk_mq_dispatch_rq_list+0x289/0x6d0 > __blk_mq_sched_dispatch_requests+0x142/0x5f0 > blk_mq_sched_dispatch_requests+0x36/0x70 > blk_mq_run_work_fn+0x73/0x90 > process_one_work+0x185/0x3d0 > worker_thread+0x2ce/0x3e0 > kthread+0xe5/0x120 > ret_from_fork+0x3d/0x60 > ret_from_fork_asm+0x1a/0x30 > > > watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [fio:19820] > CPU: 0 PID: 19820 Comm: fio Kdump: loaded Tainted: G L > 6.10.0-rc3-enbprftw #12 > RIP: 0010:native_queued_spin_lock_slowpath+0x2b8/0x300 > Call Trace: > <IRQ> > </IRQ> > <TASK> > _raw_spin_lock+0x2d/0x40 > clear_shadow_entry+0x3d/0x100 > mapping_try_invalidate+0x11b/0x1e0 > invalidate_mapping_pages+0x14/0x20 > invalidate_bdev+0x40/0x50 > blkdev_common_ioctl+0x5f7/0xa90 > blkdev_ioctl+0x10d/0x270 > __x64_sys_ioctl+0x99/0xd0 > x64_sys_call+0x1219/0x20d0 > do_syscall_64+0x51/0x120 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > RIP: 0033:0x7fc92fc3ec6b > </TASK> > > The above one (clear_shadow_entry) has since been fixed by Yu Zhao and fix > is in mm tree. > > We had seen a couple of scenarios with zone lock contention from page free > and slab free code paths, as reported here: https://lore.kernel.org/linux-mm/b68e43d4-91f2-4481-80a9-d166c0a43584@amd.com/ > > Would you have any insights on these? Have you tried enabling memory interleaving policy for your workload? Karim PhD Student Edinburgh University ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-17 9:42 ` Vlastimil Babka 2024-07-17 10:31 ` Bharata B Rao @ 2024-07-17 11:29 ` Mateusz Guzik 2024-07-18 9:00 ` Bharata B Rao 2024-07-17 16:34 ` Karim Manaouil 2 siblings, 1 reply; 37+ messages in thread From: Mateusz Guzik @ 2024-07-17 11:29 UTC (permalink / raw) To: Vlastimil Babka Cc: Bharata B Rao, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho, Mel Gorman On Wed, Jul 17, 2024 at 11:42 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 7/3/24 5:11 PM, Bharata B Rao wrote: > > The general observation is that the problem usually surfaces when the > > system free memory goes very low and page cache/buffer consumption hits > > the ceiling. Most of the times the two contended locks are lruvec and > > inode->i_lock spinlocks. > > [snip mm stuff] There are numerous avoidable i_lock acquires (including some only showing up under load), but I don't know if they play any role in this particular test. Collecting all traces would definitely help, locked up or not, for example: bpftrace -e 'kprobe:queued_spin_lock_slowpath { @[kstack()] = count(); }' -o traces As for clear_shadow_entry mentioned in the opening mail, the content is: spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); __clear_shadow_entry(mapping, index, entry); xa_unlock_irq(&mapping->i_pages); if (mapping_shrinkable(mapping)) inode_add_lru(mapping->host); spin_unlock(&mapping->host->i_lock); so for all I know it's all about the xarray thing, not the i_lock per se. -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-17 11:29 ` Mateusz Guzik @ 2024-07-18 9:00 ` Bharata B Rao 2024-07-18 12:11 ` Mateusz Guzik 0 siblings, 1 reply; 37+ messages in thread From: Bharata B Rao @ 2024-07-18 9:00 UTC (permalink / raw) To: Mateusz Guzik, Vlastimil Babka Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho, Mel Gorman [-- Attachment #1: Type: text/plain, Size: 3538 bytes --] On 17-Jul-24 4:59 PM, Mateusz Guzik wrote: > On Wed, Jul 17, 2024 at 11:42 AM Vlastimil Babka <vbabka@suse.cz> wrote: >> >> On 7/3/24 5:11 PM, Bharata B Rao wrote: >>> The general observation is that the problem usually surfaces when the >>> system free memory goes very low and page cache/buffer consumption hits >>> the ceiling. Most of the times the two contended locks are lruvec and >>> inode->i_lock spinlocks. >>> > [snip mm stuff] > > There are numerous avoidable i_lock acquires (including some only > showing up under load), but I don't know if they play any role in this > particular test. > > Collecting all traces would definitely help, locked up or not, for example: > bpftrace -e 'kprobe:queued_spin_lock_slowpath { @[kstack()] = count(); > }' -o traces Here are the top 3 traces collected while the full list from a 30s collection duration when the workload was running, is attached. @[ native_queued_spin_lock_slowpath+1 __remove_mapping+98 remove_mapping+22 mapping_evict_folio+118 mapping_try_invalidate+214 invalidate_mapping_pages+16 invalidate_bdev+60 blkdev_common_ioctl+1527 blkdev_ioctl+265 __x64_sys_ioctl+149 x64_sys_call+4629 do_syscall_64+126 entry_SYSCALL_64_after_hwframe+118 ]: 1787212 @[ native_queued_spin_lock_slowpath+1 folio_wait_bit_common+205 filemap_get_pages+1543 filemap_read+231 blkdev_read_iter+111 aio_read+242 io_submit_one+546 __x64_sys_io_submit+132 x64_sys_call+6617 do_syscall_64+126 entry_SYSCALL_64_after_hwframe+118 ]: 7922497 @[ native_queued_spin_lock_slowpath+1 clear_shadow_entry+92 mapping_try_invalidate+337 invalidate_mapping_pages+16 invalidate_bdev+60 blkdev_common_ioctl+1527 blkdev_ioctl+265 __x64_sys_ioctl+149 x64_sys_call+4629 do_syscall_64+126 entry_SYSCALL_64_after_hwframe+118 ]: 10357614 > > As for clear_shadow_entry mentioned in the opening mail, the content is: > spin_lock(&mapping->host->i_lock); > xa_lock_irq(&mapping->i_pages); > __clear_shadow_entry(mapping, index, entry); > xa_unlock_irq(&mapping->i_pages); > if (mapping_shrinkable(mapping)) > inode_add_lru(mapping->host); > spin_unlock(&mapping->host->i_lock); > > so for all I know it's all about the xarray thing, not the i_lock per se. The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq and hence concluded it to be i_lock. Re-pasting the clear_shadow_entry softlockup here again: kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649] kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G L 6.10.0-rc3-mglru-irqstrc #24 kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300 kernel: Call Trace: kernel: <IRQ> kernel: ? show_regs+0x69/0x80 kernel: ? watchdog_timer_fn+0x223/0x2b0 kernel: ? __pfx_watchdog_timer_fn+0x10/0x10 <SNIP> kernel: </IRQ> kernel: <TASK> kernel: ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300 kernel: _raw_spin_lock+0x38/0x50 kernel: clear_shadow_entry+0x3d/0x100 kernel: ? __pfx_workingset_update_node+0x10/0x10 kernel: mapping_try_invalidate+0x117/0x1d0 kernel: invalidate_mapping_pages+0x10/0x20 kernel: invalidate_bdev+0x3c/0x50 kernel: blkdev_common_ioctl+0x5f7/0xa90 kernel: blkdev_ioctl+0x109/0x270 kernel: x64_sys_call+0x1215/0x20d0 kernel: do_syscall_64+0x7e/0x130 Regards, Bharata. [-- Attachment #2: traces.gz --] [-- Type: application/x-gzip, Size: 83505 bytes --] ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-18 9:00 ` Bharata B Rao @ 2024-07-18 12:11 ` Mateusz Guzik 2024-07-19 6:16 ` Bharata B Rao 0 siblings, 1 reply; 37+ messages in thread From: Mateusz Guzik @ 2024-07-18 12:11 UTC (permalink / raw) To: Bharata B Rao Cc: Vlastimil Babka, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho, Mel Gorman On Thu, Jul 18, 2024 at 11:00 AM Bharata B Rao <bharata@amd.com> wrote: > > On 17-Jul-24 4:59 PM, Mateusz Guzik wrote: > > As for clear_shadow_entry mentioned in the opening mail, the content is: > > spin_lock(&mapping->host->i_lock); > > xa_lock_irq(&mapping->i_pages); > > __clear_shadow_entry(mapping, index, entry); > > xa_unlock_irq(&mapping->i_pages); > > if (mapping_shrinkable(mapping)) > > inode_add_lru(mapping->host); > > spin_unlock(&mapping->host->i_lock); > > > > so for all I know it's all about the xarray thing, not the i_lock per se. > > The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq > and hence concluded it to be i_lock. I'm not disputing it was i_lock. I am claiming that the i_pages is taken immediately after and it may be that in your workload this is the thing with the actual contention problem, making i_lock a red herring. I tried to match up offsets to my own kernel binary, but things went haywire. Can you please resolve a bunch of symbols, like this: ./scripts/faddr2line vmlinux clear_shadow_entry+92 and then paste the source code from reported lines? (I presume you are running with some local patches, so opening relevant files in my repo may still give bogus resutls) Addresses are: clear_shadow_entry+92 __remove_mapping+98 __filemap_add_folio+332 Most notably in __remove_mapping i_lock is conditional: if (!folio_test_swapcache(folio)) spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); and the disasm of the offset in my case does not match either acquire. For all I know i_lock in this routine is *not* taken and all the queued up __remove_mapping callers increase i_lock -> i_pages wait times in clear_shadow_entry. To my cursory reading i_lock in clear_shadow_entry can be hacked away with some effort, but should this happen the contention is going to shift to i_pages presumably with more soft lockups (except on that lock). I am not convinced messing with it is justified. From looking at other places the i_lock is not a problem in other spots fwiw. All that said even if it is i_lock in both cases *and* someone whacks it, the mm folk should look into what happens when (maybe i_lock ->) i_pages lock is held. To that end perhaps you could provide a flamegraph or output of perf record -a -g, I don't know what's preferred. -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-18 12:11 ` Mateusz Guzik @ 2024-07-19 6:16 ` Bharata B Rao 2024-07-19 7:06 ` Yu Zhao 2024-07-19 14:26 ` Mateusz Guzik 0 siblings, 2 replies; 37+ messages in thread From: Bharata B Rao @ 2024-07-19 6:16 UTC (permalink / raw) To: Mateusz Guzik Cc: Vlastimil Babka, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho, Mel Gorman [-- Attachment #1: Type: text/plain, Size: 4821 bytes --] On 18-Jul-24 5:41 PM, Mateusz Guzik wrote: > On Thu, Jul 18, 2024 at 11:00 AM Bharata B Rao <bharata@amd.com> wrote: >> >> On 17-Jul-24 4:59 PM, Mateusz Guzik wrote: >>> As for clear_shadow_entry mentioned in the opening mail, the content is: >>> spin_lock(&mapping->host->i_lock); >>> xa_lock_irq(&mapping->i_pages); >>> __clear_shadow_entry(mapping, index, entry); >>> xa_unlock_irq(&mapping->i_pages); >>> if (mapping_shrinkable(mapping)) >>> inode_add_lru(mapping->host); >>> spin_unlock(&mapping->host->i_lock); >>> >>> so for all I know it's all about the xarray thing, not the i_lock per se. >> >> The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq >> and hence concluded it to be i_lock. > > I'm not disputing it was i_lock. I am claiming that the i_pages is > taken immediately after and it may be that in your workload this is > the thing with the actual contention problem, making i_lock a red > herring. > > I tried to match up offsets to my own kernel binary, but things went haywire. > > Can you please resolve a bunch of symbols, like this: > ./scripts/faddr2line vmlinux clear_shadow_entry+92 > > and then paste the source code from reported lines? (I presume you are > running with some local patches, so opening relevant files in my repo > may still give bogus resutls) > > Addresses are: clear_shadow_entry+92 __remove_mapping+98 __filemap_add_folio+332 clear_shadow_entry+92 $ ./scripts/faddr2line vmlinux clear_shadow_entry+92 clear_shadow_entry+92/0x180: spin_lock_irq at include/linux/spinlock.h:376 (inlined by) clear_shadow_entry at mm/truncate.c:51 42 static void clear_shadow_entry(struct address_space *mapping, 43 struct folio_batch *fbatch, pgoff_t *indices) 44 { 45 int i; 46 47 if (shmem_mapping(mapping) || dax_mapping(mapping)) 48 return; 49 50 spin_lock(&mapping->host->i_lock); 51 xa_lock_irq(&mapping->i_pages); __remove_mapping+98 $ ./scripts/faddr2line vmlinux __remove_mapping+98 __remove_mapping+98/0x230: spin_lock_irq at include/linux/spinlock.h:376 (inlined by) __remove_mapping at mm/vmscan.c:695 684 static int __remove_mapping(struct address_space *mapping, struct folio *folio, 685 bool reclaimed, struct mem_cgroup *target_memcg) 686 { 687 int refcount; 688 void *shadow = NULL; 689 690 BUG_ON(!folio_test_locked(folio)); 691 BUG_ON(mapping != folio_mapping(folio)); 692 693 if (!folio_test_swapcache(folio)) 694 spin_lock(&mapping->host->i_lock); 695 xa_lock_irq(&mapping->i_pages); __filemap_add_folio+332 $ ./scripts/faddr2line vmlinux __filemap_add_folio+332 __filemap_add_folio+332/0x480: spin_lock_irq at include/linux/spinlock.h:377 (inlined by) __filemap_add_folio at mm/filemap.c:878 851 noinline int __filemap_add_folio(struct address_space *mapping, 852 struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp) 853 { 854 XA_STATE(xas, &mapping->i_pages, index); ... 874 for (;;) { 875 int order = -1, split_order = 0; 876 void *entry, *old = NULL; 877 878 xas_lock_irq(&xas); 879 xas_for_each_conflict(&xas, entry) { > > Most notably in __remove_mapping i_lock is conditional: > if (!folio_test_swapcache(folio)) > spin_lock(&mapping->host->i_lock); > xa_lock_irq(&mapping->i_pages); > > and the disasm of the offset in my case does not match either acquire. > For all I know i_lock in this routine is *not* taken and all the > queued up __remove_mapping callers increase i_lock -> i_pages wait > times in clear_shadow_entry. So the first two are on i_pages lock and the last one is xa_lock. > > To my cursory reading i_lock in clear_shadow_entry can be hacked away > with some effort, but should this happen the contention is going to > shift to i_pages presumably with more soft lockups (except on that > lock). I am not convinced messing with it is justified. From looking > at other places the i_lock is not a problem in other spots fwiw. > > All that said even if it is i_lock in both cases *and* someone whacks > it, the mm folk should look into what happens when (maybe i_lock ->) > i_pages lock is held. To that end perhaps you could provide a > flamegraph or output of perf record -a -g, I don't know what's > preferred. I have attached the flamegraph but this is for the kernel that has been running with all the accumulated fixes so far. The original one (w/o fixes) did show considerable time spent on native_queued_spin_lock_slowpath but unfortunately unable to locate it now. Regards, Bharata. [-- Attachment #2: perf 1.svg --] [-- Type: image/svg+xml, Size: 1215900 bytes --] ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-19 6:16 ` Bharata B Rao @ 2024-07-19 7:06 ` Yu Zhao 2024-07-19 14:26 ` Mateusz Guzik 1 sibling, 0 replies; 37+ messages in thread From: Yu Zhao @ 2024-07-19 7:06 UTC (permalink / raw) To: Bharata B Rao Cc: Mateusz Guzik, Vlastimil Babka, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, kinseyho, Mel Gorman On Fri, Jul 19, 2024 at 12:16 AM Bharata B Rao <bharata@amd.com> wrote: > > On 18-Jul-24 5:41 PM, Mateusz Guzik wrote: > > On Thu, Jul 18, 2024 at 11:00 AM Bharata B Rao <bharata@amd.com> wrote: > >> > >> On 17-Jul-24 4:59 PM, Mateusz Guzik wrote: > >>> As for clear_shadow_entry mentioned in the opening mail, the content is: > >>> spin_lock(&mapping->host->i_lock); > >>> xa_lock_irq(&mapping->i_pages); > >>> __clear_shadow_entry(mapping, index, entry); > >>> xa_unlock_irq(&mapping->i_pages); > >>> if (mapping_shrinkable(mapping)) > >>> inode_add_lru(mapping->host); > >>> spin_unlock(&mapping->host->i_lock); > >>> > >>> so for all I know it's all about the xarray thing, not the i_lock per se. > >> > >> The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq > >> and hence concluded it to be i_lock. > > > > I'm not disputing it was i_lock. I am claiming that the i_pages is > > taken immediately after and it may be that in your workload this is > > the thing with the actual contention problem, making i_lock a red > > herring. > > > > I tried to match up offsets to my own kernel binary, but things went haywire. > > > > Can you please resolve a bunch of symbols, like this: > > ./scripts/faddr2line vmlinux clear_shadow_entry+92 > > > > and then paste the source code from reported lines? (I presume you are > > running with some local patches, so opening relevant files in my repo > > may still give bogus resutls) > > > > Addresses are: clear_shadow_entry+92 __remove_mapping+98 __filemap_add_folio+332 > > clear_shadow_entry+92 > > $ ./scripts/faddr2line vmlinux clear_shadow_entry+92 > clear_shadow_entry+92/0x180: > spin_lock_irq at include/linux/spinlock.h:376 > (inlined by) clear_shadow_entry at mm/truncate.c:51 > > 42 static void clear_shadow_entry(struct address_space *mapping, > 43 struct folio_batch *fbatch, pgoff_t > *indices) > 44 { > 45 int i; > 46 > 47 if (shmem_mapping(mapping) || dax_mapping(mapping)) > 48 return; > 49 > 50 spin_lock(&mapping->host->i_lock); > 51 xa_lock_irq(&mapping->i_pages); > > > __remove_mapping+98 > > $ ./scripts/faddr2line vmlinux __remove_mapping+98 > __remove_mapping+98/0x230: > spin_lock_irq at include/linux/spinlock.h:376 > (inlined by) __remove_mapping at mm/vmscan.c:695 > > 684 static int __remove_mapping(struct address_space *mapping, struct > folio *folio, > 685 bool reclaimed, struct mem_cgroup > *target_memcg) > 686 { > 687 int refcount; > 688 void *shadow = NULL; > 689 > 690 BUG_ON(!folio_test_locked(folio)); > 691 BUG_ON(mapping != folio_mapping(folio)); > 692 > 693 if (!folio_test_swapcache(folio)) > 694 spin_lock(&mapping->host->i_lock); > 695 xa_lock_irq(&mapping->i_pages); > > > __filemap_add_folio+332 > > $ ./scripts/faddr2line vmlinux __filemap_add_folio+332 > __filemap_add_folio+332/0x480: > spin_lock_irq at include/linux/spinlock.h:377 > (inlined by) __filemap_add_folio at mm/filemap.c:878 > > 851 noinline int __filemap_add_folio(struct address_space *mapping, > 852 struct folio *folio, pgoff_t index, gfp_t gfp, void > **shadowp) > 853 { > 854 XA_STATE(xas, &mapping->i_pages, index); > ... > 874 for (;;) { > 875 int order = -1, split_order = 0; > 876 void *entry, *old = NULL; > 877 > 878 xas_lock_irq(&xas); > 879 xas_for_each_conflict(&xas, entry) { > > > > > Most notably in __remove_mapping i_lock is conditional: > > if (!folio_test_swapcache(folio)) > > spin_lock(&mapping->host->i_lock); > > xa_lock_irq(&mapping->i_pages); > > > > and the disasm of the offset in my case does not match either acquire. > > For all I know i_lock in this routine is *not* taken and all the > > queued up __remove_mapping callers increase i_lock -> i_pages wait > > times in clear_shadow_entry. > > So the first two are on i_pages lock and the last one is xa_lock. Isn't xa_lock also i_pages->xa_lock, i.e., the same lock? > > To my cursory reading i_lock in clear_shadow_entry can be hacked away > > with some effort, but should this happen the contention is going to > > shift to i_pages presumably with more soft lockups (except on that > > lock). I am not convinced messing with it is justified. From looking > > at other places the i_lock is not a problem in other spots fwiw. > > > > All that said even if it is i_lock in both cases *and* someone whacks > > it, the mm folk should look into what happens when (maybe i_lock ->) > > i_pages lock is held. To that end perhaps you could provide a > > flamegraph or output of perf record -a -g, I don't know what's > > preferred. > > I have attached the flamegraph but this is for the kernel that has been > running with all the accumulated fixes so far. The original one (w/o > fixes) did show considerable time spent on > native_queued_spin_lock_slowpath but unfortunately unable to locate it now. > > Regards, > Bharata. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-19 6:16 ` Bharata B Rao 2024-07-19 7:06 ` Yu Zhao @ 2024-07-19 14:26 ` Mateusz Guzik 1 sibling, 0 replies; 37+ messages in thread From: Mateusz Guzik @ 2024-07-19 14:26 UTC (permalink / raw) To: Bharata B Rao Cc: Vlastimil Babka, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho, Mel Gorman On Fri, Jul 19, 2024 at 8:16 AM Bharata B Rao <bharata@amd.com> wrote: > > On 18-Jul-24 5:41 PM, Mateusz Guzik wrote: > > On Thu, Jul 18, 2024 at 11:00 AM Bharata B Rao <bharata@amd.com> wrote: > >> > >> On 17-Jul-24 4:59 PM, Mateusz Guzik wrote: > >>> As for clear_shadow_entry mentioned in the opening mail, the content is: > >>> spin_lock(&mapping->host->i_lock); > >>> xa_lock_irq(&mapping->i_pages); > >>> __clear_shadow_entry(mapping, index, entry); > >>> xa_unlock_irq(&mapping->i_pages); > >>> if (mapping_shrinkable(mapping)) > >>> inode_add_lru(mapping->host); > >>> spin_unlock(&mapping->host->i_lock); > >>> > >>> so for all I know it's all about the xarray thing, not the i_lock per se. > >> > >> The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq > >> and hence concluded it to be i_lock. > > > > I'm not disputing it was i_lock. I am claiming that the i_pages is > > taken immediately after and it may be that in your workload this is > > the thing with the actual contention problem, making i_lock a red > > herring. > > > > I tried to match up offsets to my own kernel binary, but things went haywire. > > > > Can you please resolve a bunch of symbols, like this: > > ./scripts/faddr2line vmlinux clear_shadow_entry+92 > > > > and then paste the source code from reported lines? (I presume you are > > running with some local patches, so opening relevant files in my repo > > may still give bogus resutls) > > > > Addresses are: clear_shadow_entry+92 __remove_mapping+98 __filemap_add_folio+332 > > clear_shadow_entry+92 > > $ ./scripts/faddr2line vmlinux clear_shadow_entry+92 > clear_shadow_entry+92/0x180: > spin_lock_irq at include/linux/spinlock.h:376 > (inlined by) clear_shadow_entry at mm/truncate.c:51 > > 42 static void clear_shadow_entry(struct address_space *mapping, > 43 struct folio_batch *fbatch, pgoff_t > *indices) > 44 { > 45 int i; > 46 > 47 if (shmem_mapping(mapping) || dax_mapping(mapping)) > 48 return; > 49 > 50 spin_lock(&mapping->host->i_lock); > 51 xa_lock_irq(&mapping->i_pages); > > > __remove_mapping+98 > > $ ./scripts/faddr2line vmlinux __remove_mapping+98 > __remove_mapping+98/0x230: > spin_lock_irq at include/linux/spinlock.h:376 > (inlined by) __remove_mapping at mm/vmscan.c:695 > > 684 static int __remove_mapping(struct address_space *mapping, struct > folio *folio, > 685 bool reclaimed, struct mem_cgroup > *target_memcg) > 686 { > 687 int refcount; > 688 void *shadow = NULL; > 689 > 690 BUG_ON(!folio_test_locked(folio)); > 691 BUG_ON(mapping != folio_mapping(folio)); > 692 > 693 if (!folio_test_swapcache(folio)) > 694 spin_lock(&mapping->host->i_lock); > 695 xa_lock_irq(&mapping->i_pages); > > > __filemap_add_folio+332 > > $ ./scripts/faddr2line vmlinux __filemap_add_folio+332 > __filemap_add_folio+332/0x480: > spin_lock_irq at include/linux/spinlock.h:377 > (inlined by) __filemap_add_folio at mm/filemap.c:878 > > 851 noinline int __filemap_add_folio(struct address_space *mapping, > 852 struct folio *folio, pgoff_t index, gfp_t gfp, void > **shadowp) > 853 { > 854 XA_STATE(xas, &mapping->i_pages, index); > ... > 874 for (;;) { > 875 int order = -1, split_order = 0; > 876 void *entry, *old = NULL; > 877 > 878 xas_lock_irq(&xas); > 879 xas_for_each_conflict(&xas, entry) { > > > > > Most notably in __remove_mapping i_lock is conditional: > > if (!folio_test_swapcache(folio)) > > spin_lock(&mapping->host->i_lock); > > xa_lock_irq(&mapping->i_pages); > > > > and the disasm of the offset in my case does not match either acquire. > > For all I know i_lock in this routine is *not* taken and all the > > queued up __remove_mapping callers increase i_lock -> i_pages wait > > times in clear_shadow_entry. > > So the first two are on i_pages lock and the last one is xa_lock. > bottom line though messing with i_lock removal is not justified afaics > > > > To my cursory reading i_lock in clear_shadow_entry can be hacked away > > with some effort, but should this happen the contention is going to > > shift to i_pages presumably with more soft lockups (except on that > > lock). I am not convinced messing with it is justified. From looking > > at other places the i_lock is not a problem in other spots fwiw. > > > > All that said even if it is i_lock in both cases *and* someone whacks > > it, the mm folk should look into what happens when (maybe i_lock ->) > > i_pages lock is held. To that end perhaps you could provide a > > flamegraph or output of perf record -a -g, I don't know what's > > preferred. > > I have attached the flamegraph but this is for the kernel that has been > running with all the accumulated fixes so far. The original one (w/o > fixes) did show considerable time spent on > native_queued_spin_lock_slowpath but unfortunately unable to locate it now. > So I think the problems at this point are all mm, so I'm kicking the ball to that side. -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Hard and soft lockups with FIO and LTP runs on a large system 2024-07-17 9:42 ` Vlastimil Babka 2024-07-17 10:31 ` Bharata B Rao 2024-07-17 11:29 ` Mateusz Guzik @ 2024-07-17 16:34 ` Karim Manaouil 2 siblings, 0 replies; 37+ messages in thread From: Karim Manaouil @ 2024-07-17 16:34 UTC (permalink / raw) To: Vlastimil Babka Cc: Bharata B Rao, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho, Mel Gorman, Mateusz Guzik On Wed, Jul 17, 2024 at 11:42:31AM +0200, Vlastimil Babka wrote: > Seems to me it could be (except that ZONE_DMA corner case) a general > scalability issue in that you tweak some part of the kernel and the > contention moves elsewhere. At least in MM we have per-node locks so this > means 256 CPUs per lock? It used to be that there were not that many > (cores/threads) per a physical CPU and its NUMA node, so many cpus would > mean also more NUMA nodes where the locks contention would distribute among > them. I think you could try fakenuma to create these nodes artificially and > see if it helps for the MM part. But if the contention moves to e.g. an > inode lock, I'm not sure what to do about that then. AMD EPYC BIOSes have an option called NPS (Nodes Per Socket) that can be set to 1, 2, 4 or 8 and that divides the system up into the chosen number of NUMA nodes. Karim PhD Student Edinburgh University ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2024-08-13 17:44 UTC | newest] Thread overview: 37+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-07-03 15:11 Hard and soft lockups with FIO and LTP runs on a large system Bharata B Rao 2024-07-06 22:42 ` Yu Zhao 2024-07-08 14:34 ` Bharata B Rao 2024-07-08 16:17 ` Yu Zhao 2024-07-09 4:30 ` Bharata B Rao 2024-07-09 5:58 ` Yu Zhao 2024-07-11 5:43 ` Bharata B Rao 2024-07-15 5:19 ` Bharata B Rao 2024-07-19 20:21 ` Yu Zhao 2024-07-20 7:57 ` Mateusz Guzik 2024-07-22 4:17 ` Bharata B Rao 2024-07-22 4:12 ` Bharata B Rao 2024-07-25 9:59 ` zhaoyang.huang 2024-07-26 3:26 ` Zhaoyang Huang 2024-07-29 4:49 ` Bharata B Rao 2024-08-13 11:04 ` Usama Arif 2024-08-13 17:43 ` Yu Zhao 2024-07-17 9:37 ` Vlastimil Babka 2024-07-17 10:50 ` Bharata B Rao 2024-07-17 11:15 ` Hillf Danton 2024-07-18 9:02 ` Bharata B Rao 2024-07-10 12:03 ` Bharata B Rao 2024-07-10 12:24 ` Mateusz Guzik 2024-07-10 13:04 ` Mateusz Guzik 2024-07-15 5:22 ` Bharata B Rao 2024-07-15 6:48 ` Mateusz Guzik 2024-07-10 18:04 ` Yu Zhao 2024-07-17 9:42 ` Vlastimil Babka 2024-07-17 10:31 ` Bharata B Rao 2024-07-17 16:44 ` Karim Manaouil 2024-07-17 11:29 ` Mateusz Guzik 2024-07-18 9:00 ` Bharata B Rao 2024-07-18 12:11 ` Mateusz Guzik 2024-07-19 6:16 ` Bharata B Rao 2024-07-19 7:06 ` Yu Zhao 2024-07-19 14:26 ` Mateusz Guzik 2024-07-17 16:34 ` Karim Manaouil
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).