Hard and soft lockups with FIO and LTP runs on a large system

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Hard and soft lockups with FIO and LTP runs on a large system
@ 2024-07-03 15:11 Bharata B Rao
  2024-07-06 22:42 ` Yu Zhao
  2024-07-17  9:42 ` Vlastimil Babka
  0 siblings, 2 replies; 37+ messages in thread
From: Bharata B Rao @ 2024-07-03 15:11 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
	David Hildenbrand, willy, vbabka, yuzhao, kinseyho, Mel Gorman

Many soft and hard lockups are seen with upstream kernel when running a 
bunch of tests that include FIO and LTP filesystem test on 10 NVME 
disks. The lockups can appear anywhere between 2 to 48 hours. Originally 
this was reported on a large customer VM instance with passthrough NVME 
disks on older kernels(v5.4 based). However, similar problems were 
reproduced when running the tests on bare metal with latest upstream 
kernel (v6.10-rc3). Other lockups with different signatures are seen but 
in this report, only those related to MM area are being discussed.
Also note that the subsequent description is related to the lockups in 
bare metal upstream (and not VM).

The general observation is that the problem usually surfaces when the 
system free memory goes very low and page cache/buffer consumption hits 
the ceiling. Most of the times the two contended locks are lruvec and 
inode->i_lock spinlocks.

- Could this be a scalability issue in LRU list handling and/or page 
cache invalidation typical to a large system configuration?
- Are there any MM/FS tunables that could help here?

Hardware configuration
======================
Dual socket  AMD EPYC 128 Core processor (256 cores, 512 threads)
Memory: 1.5 TB
10 NVME - 3.5TB each
available: 2 nodes (0-1)
node 0 cpus: 0-127,256-383
node 0 size: 773727 MB
node 1 cpus: 128-255,384-511
node 1 size: 773966 MB

Workload details
================
Workload includes concurrent runs of FIO and a few FS tests from LTP.

FIO is run with a size of 1TB on each NVME partition with different 
combinations of ioengine/blocksize/mode parameters and buffered-IO. 
Selected FS tests from LTP are run on 256GB partitions of all NVME 
disks. This is the typical NVME partition layout.

nvme2n1      259:4   0   3.5T  0 disk
├─nvme2n1p1  259:6   0   256G  0 part /data_nvme2n1p1
└─nvme2n1p2  259:7   0   3.2T  0 part

Though many different runs exist in the workload, the combination that 
results in the problem is buffered-IO run with sync engine.

fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \
-rwmixwrite=30  --norandommap --randrepeat=0 -ioengine=sync -bs=4k \
-numjobs=400 -runtime=25000 --time_based -group_reporting -name=mytest

Watchdog threshold was reduced to 5s to reproduce the problem early and 
all CPU backtrace enabled.

Problem details and analysis
============================
One of the hard lockups which was observed and analyzed in detail is this:

kernel: watchdog: Watchdog detected hard LOCKUP on cpu 284
kernel: CPU: 284 PID: 924096 Comm: cat Not tainted 6.10.0-rc3-lruvec #9
kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: Call Trace:
kernel:  <NMI>
kernel:  ? show_regs+0x69/0x80
kernel:  ? watchdog_hardlockup_check+0x19e/0x360
<SNIP>
kernel:  ? native_queued_spin_lock_slowpath+0x2b4/0x300
kernel:  </NMI>
kernel:  <TASK>
kernel:  ? __pfx_lru_add_fn+0x10/0x10
kernel: _raw_spin_lock_irqsave+0x42/0x50
kernel: folio_lruvec_lock_irqsave+0x62/0xb0
kernel: folio_batch_move_lru+0x79/0x2a0
kernel: folio_add_lru+0x6d/0xf0
kernel: filemap_add_folio+0xba/0xe0
kernel: __filemap_get_folio+0x137/0x2e0
kernel: ext4_da_write_begin+0x12c/0x270
kernel: generic_perform_write+0xbf/0x200
kernel: ext4_buffered_write_iter+0x67/0xf0
kernel: ext4_file_write_iter+0x70/0x780
kernel: vfs_write+0x301/0x420
kernel: ksys_write+0x67/0xf0
kernel: __x64_sys_write+0x19/0x20
kernel: x64_sys_call+0x1689/0x20d0
kernel: do_syscall_64+0x6b/0x110
kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e	kernel: RIP: 
0033:0x7fe21c314887

With all CPU backtraces enabled, many CPUs are waiting for lruvec_lock 
acquisition. We measured the lruvec spinlock start, end and hold 
time(htime) using sched_clock(), along with a BUG() if the hold time was 
more than 10s. The below case shows that lruvec spin lock was held for ~25s.

kernel: vmscan: unlock_page_lruvec_irq: stime 27963327514341, etime 
27963324369895, htime 25889317166
kernel: ------------[ cut here ]------------
kernel: kernel BUG at include/linux/memcontrol.h:1677!
kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
kernel: CPU: 21 PID: 3211 Comm: kswapd0 Tainted: G        W 
6.10.0-rc3-qspindbg #10
kernel: RIP: 0010:shrink_active_list+0x40a/0x520

And the corresponding trace point for the above:
kswapd0-3211    [021] dN.1. 27963.324332: mm_vmscan_lru_isolate: 
classzone=0 order=0 nr_requested=1 nr_scanned=156946361 
nr_skipped=156946360 nr_taken=1 lru=active_file

This shows that isolate_lru_folios() is scanning through a huge number 
(~150million) of folios (order=0) with lruvec spinlock held. This is 
happening because a large number of folios are being skipped to isolate 
a few ZONE_DMA folios. Though the number of folios to be scanned is 
bounded (32), there exists a genuine case where this can become 
unbounded, i.e. in case where folios are skipped.

Meminfo output shows that the free memory is around ~2% and page/buffer 
cache grows very high when the lockup happens.

MemTotal:       1584835956 kB
MemFree:        27805664 kB
MemAvailable:   1568099004 kB
Buffers:        1386120792 kB
Cached:         151894528 kB
SwapCached:        30620 kB
Active:         1043678892 kB
Inactive:       494456452 kB

Often times, the perf output at the time of the problem shows heavy 
contention on lruvec spin lock. Similar contention is also observed with 
inode i_lock (in clear_shadow_entry path)

98.98%  fio    [kernel.kallsyms]   [k] native_queued_spin_lock_slowpath
    |
     --98.96%--native_queued_spin_lock_slowpath
        |
         --98.96%--_raw_spin_lock_irqsave
                   folio_lruvec_lock_irqsave
                   |
                    --98.78%--folio_batch_move_lru
                        |
                         --98.63%--deactivate_file_folio
                                   mapping_try_invalidate
                                   invalidate_mapping_pages
                                   invalidate_bdev
                                   blkdev_common_ioctl
                                   blkdev_ioctl
                                   __x64_sys_ioctl
                                   x64_sys_call
                                   do_syscall_64
                                   entry_SYSCALL_64_after_hwframe

Some experiments tried
======================
1) When MGLRU was enabled many soft lockups were observed, no hard 
lockups were seen for 48 hours run. Below is once such soft lockup.

kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649]
kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G             L 
6.10.0-rc3-mglru-irqstrc #24
kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: Call Trace:
kernel:  <IRQ>
kernel:  ? show_regs+0x69/0x80
kernel:  ? watchdog_timer_fn+0x223/0x2b0
kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
<SNIP>
kernel:  </IRQ>
kernel:  <TASK>
kernel:  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
kernel:  ? native_queued_spin_lock_slowpath+0x2b4/0x300
kernel:  _raw_spin_lock+0x38/0x50
kernel:  clear_shadow_entry+0x3d/0x100
kernel:  ? __pfx_workingset_update_node+0x10/0x10
kernel:  mapping_try_invalidate+0x117/0x1d0
kernel:  invalidate_mapping_pages+0x10/0x20
kernel:  invalidate_bdev+0x3c/0x50
kernel:  blkdev_common_ioctl+0x5f7/0xa90
kernel:  blkdev_ioctl+0x109/0x270
kernel:  x64_sys_call+0x1215/0x20d0
kernel:  do_syscall_64+0x7e/0x130

This happens to be contending on inode i_lock spinlock.

Below preemptirqsoff trace points to preemption being disabled for more 
than 10s and the lock in picture is lruvec spinlock.

     # tracer: preemptirqsoff
     #
     # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc
     # --------------------------------------------------------------------
     # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0 
HP:0 #P:512)
     #    -----------------
     #    | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0)
     #    -----------------
     #  => started at: deactivate_file_folio
     #  => ended at:   deactivate_file_folio
     #
     #
     #                    _------=> CPU#
     #                   / _-----=> irqs-off/BH-disabled
     #                  | / _----=> need-resched
     #                  || / _---=> hardirq/softirq
     #                  ||| / _--=> preempt-depth
     #                  |||| / _-=> migrate-disable
     #                  ||||| /     delay
     #  cmd     pid     |||||| time  |   caller
     #     \   /        ||||||  \    |    /
          fio-2701523 128...1.    0us$: deactivate_file_folio 
<-deactivate_file_folio
          fio-2701523 128.N.1. 10382681us : deactivate_file_folio 
<-deactivate_file_folio
          fio-2701523 128.N.1. 10382683us : tracer_preempt_on 
<-deactivate_file_folio
          fio-2701523 128.N.1. 10382691us : <stack trace>
      => deactivate_file_folio
      => mapping_try_invalidate
      => invalidate_mapping_pages
      => invalidate_bdev
      => blkdev_common_ioctl
      => blkdev_ioctl
      => __x64_sys_ioctl
      => x64_sys_call
      => do_syscall_64
      => entry_SYSCALL_64_after_hwframe

2) Increased low_watermark_threshold to 10% to prevent system from 
entering into extremely low memory situation. Although hard lockups 
weren't seen, but soft lockups (clear_shadow_entry()) were still seen.

3) AMD has a BIOS setting called NPS (Nodes per socket), using which a 
socket can be further partitioned into smaller NUMA nodes. With NPS=4, 
there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in 
the system. This was done to check if having more number of kswapd 
threads working on lesser number of folios per node would make a 
difference. However here too, multiple  soft lockups were seen (in 
clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed.

Any insights/suggestion into these lockups and suggestions are welcome!

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-03 15:11 Hard and soft lockups with FIO and LTP runs on a large system Bharata B Rao
@ 2024-07-06 22:42 ` Yu Zhao
  2024-07-08 14:34   ` Bharata B Rao
  2024-07-10 12:03   ` Bharata B Rao
  2024-07-17  9:42 ` Vlastimil Babka
  1 sibling, 2 replies; 37+ messages in thread
From: Yu Zhao @ 2024-07-06 22:42 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
	David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 10946 bytes --]

Hi Bharata,

On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
>
> Many soft and hard lockups are seen with upstream kernel when running a
> bunch of tests that include FIO and LTP filesystem test on 10 NVME
> disks. The lockups can appear anywhere between 2 to 48 hours. Originally
> this was reported on a large customer VM instance with passthrough NVME
> disks on older kernels(v5.4 based). However, similar problems were
> reproduced when running the tests on bare metal with latest upstream
> kernel (v6.10-rc3). Other lockups with different signatures are seen but
> in this report, only those related to MM area are being discussed.
> Also note that the subsequent description is related to the lockups in
> bare metal upstream (and not VM).
>
> The general observation is that the problem usually surfaces when the
> system free memory goes very low and page cache/buffer consumption hits
> the ceiling. Most of the times the two contended locks are lruvec and
> inode->i_lock spinlocks.
>
> - Could this be a scalability issue in LRU list handling and/or page
> cache invalidation typical to a large system configuration?
> - Are there any MM/FS tunables that could help here?
>
> Hardware configuration
> ======================
> Dual socket  AMD EPYC 128 Core processor (256 cores, 512 threads)
> Memory: 1.5 TB
> 10 NVME - 3.5TB each
> available: 2 nodes (0-1)
> node 0 cpus: 0-127,256-383
> node 0 size: 773727 MB
> node 1 cpus: 128-255,384-511
> node 1 size: 773966 MB
>
> Workload details
> ================
> Workload includes concurrent runs of FIO and a few FS tests from LTP.
>
> FIO is run with a size of 1TB on each NVME partition with different
> combinations of ioengine/blocksize/mode parameters and buffered-IO.
> Selected FS tests from LTP are run on 256GB partitions of all NVME
> disks. This is the typical NVME partition layout.
>
> nvme2n1      259:4   0   3.5T  0 disk
> ├─nvme2n1p1  259:6   0   256G  0 part /data_nvme2n1p1
> └─nvme2n1p2  259:7   0   3.2T  0 part
>
> Though many different runs exist in the workload, the combination that
> results in the problem is buffered-IO run with sync engine.
>
> fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \
> -rwmixwrite=30  --norandommap --randrepeat=0 -ioengine=sync -bs=4k \
> -numjobs=400 -runtime=25000 --time_based -group_reporting -name=mytest
>
> Watchdog threshold was reduced to 5s to reproduce the problem early and
> all CPU backtrace enabled.
>
> Problem details and analysis
> ============================
> One of the hard lockups which was observed and analyzed in detail is this:
>
> kernel: watchdog: Watchdog detected hard LOCKUP on cpu 284
> kernel: CPU: 284 PID: 924096 Comm: cat Not tainted 6.10.0-rc3-lruvec #9
> kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
> kernel: Call Trace:
> kernel:  <NMI>
> kernel:  ? show_regs+0x69/0x80
> kernel:  ? watchdog_hardlockup_check+0x19e/0x360
> <SNIP>
> kernel:  ? native_queued_spin_lock_slowpath+0x2b4/0x300
> kernel:  </NMI>
> kernel:  <TASK>
> kernel:  ? __pfx_lru_add_fn+0x10/0x10
> kernel: _raw_spin_lock_irqsave+0x42/0x50
> kernel: folio_lruvec_lock_irqsave+0x62/0xb0
> kernel: folio_batch_move_lru+0x79/0x2a0
> kernel: folio_add_lru+0x6d/0xf0
> kernel: filemap_add_folio+0xba/0xe0
> kernel: __filemap_get_folio+0x137/0x2e0
> kernel: ext4_da_write_begin+0x12c/0x270
> kernel: generic_perform_write+0xbf/0x200
> kernel: ext4_buffered_write_iter+0x67/0xf0
> kernel: ext4_file_write_iter+0x70/0x780
> kernel: vfs_write+0x301/0x420
> kernel: ksys_write+0x67/0xf0
> kernel: __x64_sys_write+0x19/0x20
> kernel: x64_sys_call+0x1689/0x20d0
> kernel: do_syscall_64+0x6b/0x110
> kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e        kernel: RIP:
> 0033:0x7fe21c314887
>
> With all CPU backtraces enabled, many CPUs are waiting for lruvec_lock
> acquisition. We measured the lruvec spinlock start, end and hold
> time(htime) using sched_clock(), along with a BUG() if the hold time was
> more than 10s. The below case shows that lruvec spin lock was held for ~25s.
>
> kernel: vmscan: unlock_page_lruvec_irq: stime 27963327514341, etime
> 27963324369895, htime 25889317166
> kernel: ------------[ cut here ]------------
> kernel: kernel BUG at include/linux/memcontrol.h:1677!
> kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> kernel: CPU: 21 PID: 3211 Comm: kswapd0 Tainted: G        W
> 6.10.0-rc3-qspindbg #10
> kernel: RIP: 0010:shrink_active_list+0x40a/0x520
>
> And the corresponding trace point for the above:
> kswapd0-3211    [021] dN.1. 27963.324332: mm_vmscan_lru_isolate:
> classzone=0 order=0 nr_requested=1 nr_scanned=156946361
> nr_skipped=156946360 nr_taken=1 lru=active_file
>
> This shows that isolate_lru_folios() is scanning through a huge number
> (~150million) of folios (order=0) with lruvec spinlock held. This is
> happening because a large number of folios are being skipped to isolate
> a few ZONE_DMA folios. Though the number of folios to be scanned is
> bounded (32), there exists a genuine case where this can become
> unbounded, i.e. in case where folios are skipped.
>
> Meminfo output shows that the free memory is around ~2% and page/buffer
> cache grows very high when the lockup happens.
>
> MemTotal:       1584835956 kB
> MemFree:        27805664 kB
> MemAvailable:   1568099004 kB
> Buffers:        1386120792 kB
> Cached:         151894528 kB
> SwapCached:        30620 kB
> Active:         1043678892 kB
> Inactive:       494456452 kB
>
> Often times, the perf output at the time of the problem shows heavy
> contention on lruvec spin lock. Similar contention is also observed with
> inode i_lock (in clear_shadow_entry path)
>
> 98.98%  fio    [kernel.kallsyms]   [k] native_queued_spin_lock_slowpath
>     |
>      --98.96%--native_queued_spin_lock_slowpath
>         |
>          --98.96%--_raw_spin_lock_irqsave
>                    folio_lruvec_lock_irqsave
>                    |
>                     --98.78%--folio_batch_move_lru
>                         |
>                          --98.63%--deactivate_file_folio
>                                    mapping_try_invalidate
>                                    invalidate_mapping_pages
>                                    invalidate_bdev
>                                    blkdev_common_ioctl
>                                    blkdev_ioctl
>                                    __x64_sys_ioctl
>                                    x64_sys_call
>                                    do_syscall_64
>                                    entry_SYSCALL_64_after_hwframe
>
> Some experiments tried
> ======================
> 1) When MGLRU was enabled many soft lockups were observed, no hard
> lockups were seen for 48 hours run. Below is once such soft lockup.

This is not really an MGLRU issue -- can you please try one of the
attached patches? It (truncate.patch) should help with or without
MGLRU.

> kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649]
> kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G             L
> 6.10.0-rc3-mglru-irqstrc #24
> kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
> kernel: Call Trace:
> kernel:  <IRQ>
> kernel:  ? show_regs+0x69/0x80
> kernel:  ? watchdog_timer_fn+0x223/0x2b0
> kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
> <SNIP>
> kernel:  </IRQ>
> kernel:  <TASK>
> kernel:  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> kernel:  ? native_queued_spin_lock_slowpath+0x2b4/0x300
> kernel:  _raw_spin_lock+0x38/0x50
> kernel:  clear_shadow_entry+0x3d/0x100
> kernel:  ? __pfx_workingset_update_node+0x10/0x10
> kernel:  mapping_try_invalidate+0x117/0x1d0
> kernel:  invalidate_mapping_pages+0x10/0x20
> kernel:  invalidate_bdev+0x3c/0x50
> kernel:  blkdev_common_ioctl+0x5f7/0xa90
> kernel:  blkdev_ioctl+0x109/0x270
> kernel:  x64_sys_call+0x1215/0x20d0
> kernel:  do_syscall_64+0x7e/0x130
>
> This happens to be contending on inode i_lock spinlock.
>
> Below preemptirqsoff trace points to preemption being disabled for more
> than 10s and the lock in picture is lruvec spinlock.

Also if you could try the other patch (mglru.patch) please. It should
help reduce unnecessary rotations from deactivate_file_folio(), which
in turn should reduce the contention on the LRU lock for MGLRU.

>      # tracer: preemptirqsoff
>      #
>      # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc
>      # --------------------------------------------------------------------
>      # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0
> HP:0 #P:512)
>      #    -----------------
>      #    | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0)
>      #    -----------------
>      #  => started at: deactivate_file_folio
>      #  => ended at:   deactivate_file_folio
>      #
>      #
>      #                    _------=> CPU#
>      #                   / _-----=> irqs-off/BH-disabled
>      #                  | / _----=> need-resched
>      #                  || / _---=> hardirq/softirq
>      #                  ||| / _--=> preempt-depth
>      #                  |||| / _-=> migrate-disable
>      #                  ||||| /     delay
>      #  cmd     pid     |||||| time  |   caller
>      #     \   /        ||||||  \    |    /
>           fio-2701523 128...1.    0us$: deactivate_file_folio
> <-deactivate_file_folio
>           fio-2701523 128.N.1. 10382681us : deactivate_file_folio
> <-deactivate_file_folio
>           fio-2701523 128.N.1. 10382683us : tracer_preempt_on
> <-deactivate_file_folio
>           fio-2701523 128.N.1. 10382691us : <stack trace>
>       => deactivate_file_folio
>       => mapping_try_invalidate
>       => invalidate_mapping_pages
>       => invalidate_bdev
>       => blkdev_common_ioctl
>       => blkdev_ioctl
>       => __x64_sys_ioctl
>       => x64_sys_call
>       => do_syscall_64
>       => entry_SYSCALL_64_after_hwframe
>
> 2) Increased low_watermark_threshold to 10% to prevent system from
> entering into extremely low memory situation. Although hard lockups
> weren't seen, but soft lockups (clear_shadow_entry()) were still seen.
>
> 3) AMD has a BIOS setting called NPS (Nodes per socket), using which a
> socket can be further partitioned into smaller NUMA nodes. With NPS=4,
> there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in
> the system. This was done to check if having more number of kswapd
> threads working on lesser number of folios per node would make a
> difference. However here too, multiple  soft lockups were seen (in
> clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed.
>
> Any insights/suggestion into these lockups and suggestions are welcome!

Thanks!

[-- Attachment #2: mglru.patch --]
[-- Type: application/octet-stream, Size: 1202 bytes --]

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index d9a8a4affaaf..7d24d065aed8 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -182,6 +182,16 @@ static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
 	return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
 }
 
+static inline bool lru_gen_should_rotate(struct folio *folio)
+{
+	int gen = folio_lru_gen(folio);
+	int type = folio_is_file_lru(folio);
+	struct lruvec *lruvec = folio_lruvec(folio);
+	struct lru_gen_folio *lrugen = &lruvec->lrugen;
+
+	return gen != lru_gen_from_seq(READ_ONCE(lrugen->min_seq[type]));
+}
+
 static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *folio,
 				       int old_gen, int new_gen)
 {
diff --git a/mm/swap.c b/mm/swap.c
index 802681b3c857..e3dd092224ba 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -692,7 +692,7 @@ void deactivate_file_folio(struct folio *folio)
 	struct folio_batch *fbatch;
 
 	/* Deactivating an unevictable folio will not accelerate reclaim */
-	if (folio_test_unevictable(folio))
+	if (folio_test_unevictable(folio) || !lru_gen_should_rotate(folio))
 		return;
 
 	folio_get(folio);

[-- Attachment #3: truncate.patch --]
[-- Type: application/octet-stream, Size: 3942 bytes --]

diff --git a/mm/truncate.c b/mm/truncate.c
index e99085bf3d34..545211cf6061 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -39,12 +39,24 @@ static inline void __clear_shadow_entry(struct address_space *mapping,
 	xas_store(&xas, NULL);
 }
 
-static void clear_shadow_entry(struct address_space *mapping, pgoff_t index,
-			       void *entry)
+static void clear_shadow_entry(struct address_space *mapping,
+			       struct folio_batch *fbatch, pgoff_t *indices)
 {
+	int i;
+
+	if (shmem_mapping(mapping) || dax_mapping(mapping))
+		return;
+
 	spin_lock(&mapping->host->i_lock);
 	xa_lock_irq(&mapping->i_pages);
-	__clear_shadow_entry(mapping, index, entry);
+
+	for (i = 0; i < folio_batch_count(fbatch); i++) {
+		struct folio *folio = fbatch->folios[i];
+
+		if (xa_is_value(folio))
+			__clear_shadow_entry(mapping, indices[i], folio);
+	}
+
 	xa_unlock_irq(&mapping->i_pages);
 	if (mapping_shrinkable(mapping))
 		inode_add_lru(mapping->host);
@@ -105,36 +117,6 @@ static void truncate_folio_batch_exceptionals(struct address_space *mapping,
 	fbatch->nr = j;
 }
 
-/*
- * Invalidate exceptional entry if easily possible. This handles exceptional
- * entries for invalidate_inode_pages().
- */
-static int invalidate_exceptional_entry(struct address_space *mapping,
-					pgoff_t index, void *entry)
-{
-	/* Handled by shmem itself, or for DAX we do nothing. */
-	if (shmem_mapping(mapping) || dax_mapping(mapping))
-		return 1;
-	clear_shadow_entry(mapping, index, entry);
-	return 1;
-}
-
-/*
- * Invalidate exceptional entry if clean. This handles exceptional entries for
- * invalidate_inode_pages2() so for DAX it evicts only clean entries.
- */
-static int invalidate_exceptional_entry2(struct address_space *mapping,
-					 pgoff_t index, void *entry)
-{
-	/* Handled by shmem itself */
-	if (shmem_mapping(mapping))
-		return 1;
-	if (dax_mapping(mapping))
-		return dax_invalidate_mapping_entry_sync(mapping, index);
-	clear_shadow_entry(mapping, index, entry);
-	return 1;
-}
-
 /**
  * folio_invalidate - Invalidate part or all of a folio.
  * @folio: The folio which is affected.
@@ -494,6 +476,7 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
 	unsigned long ret;
 	unsigned long count = 0;
 	int i;
+	bool xa_has_values = false;
 
 	folio_batch_init(&fbatch);
 	while (find_lock_entries(mapping, &index, end, &fbatch, indices)) {
@@ -503,8 +486,8 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
 			/* We rely upon deletion not changing folio->index */
 
 			if (xa_is_value(folio)) {
-				count += invalidate_exceptional_entry(mapping,
-							     indices[i], folio);
+				xa_has_values = true;
+				count++;
 				continue;
 			}
 
@@ -522,6 +505,10 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
 			}
 			count += ret;
 		}
+
+		if (xa_has_values)
+			clear_shadow_entry(mapping, &fbatch, indices);
+
 		folio_batch_remove_exceptionals(&fbatch);
 		folio_batch_release(&fbatch);
 		cond_resched();
@@ -616,6 +603,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	int ret = 0;
 	int ret2 = 0;
 	int did_range_unmap = 0;
+	bool xa_has_values = false;
 
 	if (mapping_empty(mapping))
 		return 0;
@@ -629,8 +617,9 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 			/* We rely upon deletion not changing folio->index */
 
 			if (xa_is_value(folio)) {
-				if (!invalidate_exceptional_entry2(mapping,
-						indices[i], folio))
+				xa_has_values = true;
+				if (dax_mapping(mapping) &&
+				    !dax_invalidate_mapping_entry_sync(mapping, indices[i]))
 					ret = -EBUSY;
 				continue;
 			}
@@ -666,6 +655,10 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				ret = ret2;
 			folio_unlock(folio);
 		}
+
+		if (xa_has_values)
+			clear_shadow_entry(mapping, &fbatch, indices);
+
 		folio_batch_remove_exceptionals(&fbatch);
 		folio_batch_release(&fbatch);
 		cond_resched();

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-06 22:42 ` Yu Zhao
@ 2024-07-08 14:34   ` Bharata B Rao
  2024-07-08 16:17     ` Yu Zhao
  2024-07-10 12:03   ` Bharata B Rao
  1 sibling, 1 reply; 37+ messages in thread
From: Bharata B Rao @ 2024-07-08 14:34 UTC (permalink / raw)
  To: Yu Zhao
  Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
	David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman

Hi Yu Zhao,

Thanks for your patches. See below...

On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> Hi Bharata,
> 
> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
>>
<snip>
>> 
>> Some experiments tried
>> ======================
>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>> lockups were seen for 48 hours run. Below is once such soft lockup.
> 
> This is not really an MGLRU issue -- can you please try one of the
> attached patches? It (truncate.patch) should help with or without
> MGLRU.

With truncate.patch and default LRU scheme, a few hard lockups are seen.

First one is this:

watchdog: Watchdog detected hard LOCKUP on cpu 487
CPU: 487 PID: 11525 Comm: fio Not tainted 6.10.0-rc3 #27
RIP: 0010:native_queued_spin_lock_slowpath+0x81/0x300
Call Trace:
   <NMI>
   ? show_regs+0x69/0x80
   ? watchdog_hardlockup_check+0x1b4/0x3a0
<SNIP>
   ? native_queued_spin_lock_slowpath+0x81/0x300
   </NMI>
   <TASK>
   ? __pfx_folio_activate_fn+0x10/0x10
   _raw_spin_lock_irqsave+0x5b/0x70
   folio_lruvec_lock_irqsave+0x62/0x90
   folio_batch_move_lru+0x9d/0x160
   folio_activate+0x95/0xe0
   folio_mark_accessed+0x11f/0x160
   filemap_read+0x343/0x3d0
<SNIP>
   blkdev_read_iter+0x6f/0x140
   vfs_read+0x25b/0x340
   ksys_read+0x67/0xf0
   __x64_sys_read+0x19/0x20
   x64_sys_call+0x1771/0x20d0

This is the next one:

watchdog: Watchdog detected hard LOCKUP on cpu 219
CPU: 219 PID: 2584763 Comm: fs_racer_dir_cr Not tainted 6.10.0-rc3 #27
RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
Call Trace:
   <NMI>
   ? show_regs+0x69/0x80
   ? watchdog_hardlockup_check+0x1b4/0x3a0
<SNIP>
   ? native_queued_spin_lock_slowpath+0x2b4/0x300
   </NMI>
   <TASK>
   _raw_spin_lock_irqsave+0x5b/0x70
   folio_lruvec_lock_irqsave+0x62/0x90
   __page_cache_release+0x89/0x2f0
   folios_put_refs+0x92/0x230
   __folio_batch_release+0x74/0x90
   truncate_inode_pages_range+0x16f/0x520
   truncate_pagecache+0x49/0x70
   ext4_setattr+0x326/0xaa0
   notify_change+0x353/0x500
   do_truncate+0x83/0xe0
   path_openat+0xd9e/0x1090
   do_filp_open+0xaa/0x150
   do_sys_openat2+0x9b/0xd0
   __x64_sys_openat+0x55/0x90
   x64_sys_call+0xe55/0x20d0
   do_syscall_64+0x7e/0x130
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

When this happens, all-CPU backtrace shows a CPU being in 
isolate_lru_folios().

> 
>> kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649]
>> kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G             L
>> 6.10.0-rc3-mglru-irqstrc #24
>> kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
>> kernel: Call Trace:
>> kernel:  <IRQ>
>> kernel:  ? show_regs+0x69/0x80
>> kernel:  ? watchdog_timer_fn+0x223/0x2b0
>> kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
>> <SNIP>
>> kernel:  </IRQ>
>> kernel:  <TASK>
>> kernel:  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
>> kernel:  ? native_queued_spin_lock_slowpath+0x2b4/0x300
>> kernel:  _raw_spin_lock+0x38/0x50
>> kernel:  clear_shadow_entry+0x3d/0x100
>> kernel:  ? __pfx_workingset_update_node+0x10/0x10
>> kernel:  mapping_try_invalidate+0x117/0x1d0
>> kernel:  invalidate_mapping_pages+0x10/0x20
>> kernel:  invalidate_bdev+0x3c/0x50
>> kernel:  blkdev_common_ioctl+0x5f7/0xa90
>> kernel:  blkdev_ioctl+0x109/0x270
>> kernel:  x64_sys_call+0x1215/0x20d0
>> kernel:  do_syscall_64+0x7e/0x130
>>
>> This happens to be contending on inode i_lock spinlock.
>>
>> Below preemptirqsoff trace points to preemption being disabled for more
>> than 10s and the lock in picture is lruvec spinlock.
> 
> Also if you could try the other patch (mglru.patch) please. It should
> help reduce unnecessary rotations from deactivate_file_folio(), which
> in turn should reduce the contention on the LRU lock for MGLRU.

Currently testing is in progress with mglru.patch and MGLRU enabled. 
Will get back on the results.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-08 14:34   ` Bharata B Rao
@ 2024-07-08 16:17     ` Yu Zhao
  2024-07-09  4:30       ` Bharata B Rao
  0 siblings, 1 reply; 37+ messages in thread
From: Yu Zhao @ 2024-07-08 16:17 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
	David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman

On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
>
> Hi Yu Zhao,
>
> Thanks for your patches. See below...
>
> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> > Hi Bharata,
> >
> > On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
> >>
> <snip>
> >>
> >> Some experiments tried
> >> ======================
> >> 1) When MGLRU was enabled many soft lockups were observed, no hard
> >> lockups were seen for 48 hours run. Below is once such soft lockup.
> >
> > This is not really an MGLRU issue -- can you please try one of the
> > attached patches? It (truncate.patch) should help with or without
> > MGLRU.
>
> With truncate.patch and default LRU scheme, a few hard lockups are seen.

Thanks.

In your original report, you said:

  Most of the times the two contended locks are lruvec and
  inode->i_lock spinlocks.
  ...
  Often times, the perf output at the time of the problem shows
  heavy contention on lruvec spin lock. Similar contention is
  also observed with inode i_lock (in clear_shadow_entry path)

Based on this new report, does it mean the i_lock is not as contended,
for the same path (truncation) you tested? If so, I'll post
truncate.patch and add reported-by and tested-by you, unless you have
objections.

The two paths below were contended on the LRU lock, but they already
batch their operations. So I don't know what else we can do surgically
to improve them.

> First one is this:
>
> watchdog: Watchdog detected hard LOCKUP on cpu 487
> CPU: 487 PID: 11525 Comm: fio Not tainted 6.10.0-rc3 #27
> RIP: 0010:native_queued_spin_lock_slowpath+0x81/0x300
> Call Trace:
>    <NMI>
>    ? show_regs+0x69/0x80
>    ? watchdog_hardlockup_check+0x1b4/0x3a0
> <SNIP>
>    ? native_queued_spin_lock_slowpath+0x81/0x300
>    </NMI>
>    <TASK>
>    ? __pfx_folio_activate_fn+0x10/0x10
>    _raw_spin_lock_irqsave+0x5b/0x70
>    folio_lruvec_lock_irqsave+0x62/0x90
>    folio_batch_move_lru+0x9d/0x160
>    folio_activate+0x95/0xe0
>    folio_mark_accessed+0x11f/0x160
>    filemap_read+0x343/0x3d0
> <SNIP>
>    blkdev_read_iter+0x6f/0x140
>    vfs_read+0x25b/0x340
>    ksys_read+0x67/0xf0
>    __x64_sys_read+0x19/0x20
>    x64_sys_call+0x1771/0x20d0
>
> This is the next one:
>
> watchdog: Watchdog detected hard LOCKUP on cpu 219
> CPU: 219 PID: 2584763 Comm: fs_racer_dir_cr Not tainted 6.10.0-rc3 #27
> RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
> Call Trace:
>    <NMI>
>    ? show_regs+0x69/0x80
>    ? watchdog_hardlockup_check+0x1b4/0x3a0
> <SNIP>
>    ? native_queued_spin_lock_slowpath+0x2b4/0x300
>    </NMI>
>    <TASK>
>    _raw_spin_lock_irqsave+0x5b/0x70
>    folio_lruvec_lock_irqsave+0x62/0x90
>    __page_cache_release+0x89/0x2f0
>    folios_put_refs+0x92/0x230
>    __folio_batch_release+0x74/0x90
>    truncate_inode_pages_range+0x16f/0x520
>    truncate_pagecache+0x49/0x70
>    ext4_setattr+0x326/0xaa0
>    notify_change+0x353/0x500
>    do_truncate+0x83/0xe0
>    path_openat+0xd9e/0x1090
>    do_filp_open+0xaa/0x150
>    do_sys_openat2+0x9b/0xd0
>    __x64_sys_openat+0x55/0x90
>    x64_sys_call+0xe55/0x20d0
>    do_syscall_64+0x7e/0x130
>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> When this happens, all-CPU backtrace shows a CPU being in
> isolate_lru_folios().
>
> >
> >> kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649]
> >> kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G             L
> >> 6.10.0-rc3-mglru-irqstrc #24
> >> kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
> >> kernel: Call Trace:
> >> kernel:  <IRQ>
> >> kernel:  ? show_regs+0x69/0x80
> >> kernel:  ? watchdog_timer_fn+0x223/0x2b0
> >> kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
> >> <SNIP>
> >> kernel:  </IRQ>
> >> kernel:  <TASK>
> >> kernel:  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> >> kernel:  ? native_queued_spin_lock_slowpath+0x2b4/0x300
> >> kernel:  _raw_spin_lock+0x38/0x50
> >> kernel:  clear_shadow_entry+0x3d/0x100
> >> kernel:  ? __pfx_workingset_update_node+0x10/0x10
> >> kernel:  mapping_try_invalidate+0x117/0x1d0
> >> kernel:  invalidate_mapping_pages+0x10/0x20
> >> kernel:  invalidate_bdev+0x3c/0x50
> >> kernel:  blkdev_common_ioctl+0x5f7/0xa90
> >> kernel:  blkdev_ioctl+0x109/0x270
> >> kernel:  x64_sys_call+0x1215/0x20d0
> >> kernel:  do_syscall_64+0x7e/0x130
> >>
> >> This happens to be contending on inode i_lock spinlock.
> >>
> >> Below preemptirqsoff trace points to preemption being disabled for more
> >> than 10s and the lock in picture is lruvec spinlock.
> >
> > Also if you could try the other patch (mglru.patch) please. It should
> > help reduce unnecessary rotations from deactivate_file_folio(), which
> > in turn should reduce the contention on the LRU lock for MGLRU.
>
> Currently testing is in progress with mglru.patch and MGLRU enabled.
> Will get back on the results.

Thank you.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-08 16:17     ` Yu Zhao
@ 2024-07-09  4:30       ` Bharata B Rao
  2024-07-09  5:58         ` Yu Zhao
  2024-07-17  9:37         ` Vlastimil Babka
  0 siblings, 2 replies; 37+ messages in thread
From: Bharata B Rao @ 2024-07-09  4:30 UTC (permalink / raw)
  To: Yu Zhao
  Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
	David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman

On 08-Jul-24 9:47 PM, Yu Zhao wrote:
> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
>>
>> Hi Yu Zhao,
>>
>> Thanks for your patches. See below...
>>
>> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
>>> Hi Bharata,
>>>
>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>
>> <snip>
>>>>
>>>> Some experiments tried
>>>> ======================
>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>>>> lockups were seen for 48 hours run. Below is once such soft lockup.
>>>
>>> This is not really an MGLRU issue -- can you please try one of the
>>> attached patches? It (truncate.patch) should help with or without
>>> MGLRU.
>>
>> With truncate.patch and default LRU scheme, a few hard lockups are seen.
> 
> Thanks.
> 
> In your original report, you said:
> 
>    Most of the times the two contended locks are lruvec and
>    inode->i_lock spinlocks.
>    ...
>    Often times, the perf output at the time of the problem shows
>    heavy contention on lruvec spin lock. Similar contention is
>    also observed with inode i_lock (in clear_shadow_entry path)
> 
> Based on this new report, does it mean the i_lock is not as contended,
> for the same path (truncation) you tested? If so, I'll post
> truncate.patch and add reported-by and tested-by you, unless you have
> objections.

truncate.patch has been tested on two systems with default LRU scheme 
and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run.

> 
> The two paths below were contended on the LRU lock, but they already
> batch their operations. So I don't know what else we can do surgically
> to improve them.

What has been seen with this workload is that the lruvec spinlock is 
held for a long time from shrink_[active/inactive]_list path. In this 
path, there is a case in isolate_lru_folios() where scanning of LRU 
lists can become unbounded. To isolate a page from ZONE_DMA, sometimes 
scanning/skipping of more than 150 million folios were seen. There is 
already a comment in there which explains why nr_skipped shouldn't be 
counted, but is there any possibility of re-looking at this condition?

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-09  4:30       ` Bharata B Rao
@ 2024-07-09  5:58         ` Yu Zhao
  2024-07-11  5:43           ` Bharata B Rao
  2024-08-13 11:04           ` Usama Arif
  2024-07-17  9:37         ` Vlastimil Babka
  1 sibling, 2 replies; 37+ messages in thread
From: Yu Zhao @ 2024-07-09  5:58 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
	David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman

On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@amd.com> wrote:
>
> On 08-Jul-24 9:47 PM, Yu Zhao wrote:
> > On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
> >>
> >> Hi Yu Zhao,
> >>
> >> Thanks for your patches. See below...
> >>
> >> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> >>> Hi Bharata,
> >>>
> >>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
> >>>>
> >> <snip>
> >>>>
> >>>> Some experiments tried
> >>>> ======================
> >>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
> >>>> lockups were seen for 48 hours run. Below is once such soft lockup.
> >>>
> >>> This is not really an MGLRU issue -- can you please try one of the
> >>> attached patches? It (truncate.patch) should help with or without
> >>> MGLRU.
> >>
> >> With truncate.patch and default LRU scheme, a few hard lockups are seen.
> >
> > Thanks.
> >
> > In your original report, you said:
> >
> >    Most of the times the two contended locks are lruvec and
> >    inode->i_lock spinlocks.
> >    ...
> >    Often times, the perf output at the time of the problem shows
> >    heavy contention on lruvec spin lock. Similar contention is
> >    also observed with inode i_lock (in clear_shadow_entry path)
> >
> > Based on this new report, does it mean the i_lock is not as contended,
> > for the same path (truncation) you tested? If so, I'll post
> > truncate.patch and add reported-by and tested-by you, unless you have
> > objections.
>
> truncate.patch has been tested on two systems with default LRU scheme
> and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run.

Thanks.

> >
> > The two paths below were contended on the LRU lock, but they already
> > batch their operations. So I don't know what else we can do surgically
> > to improve them.
>
> What has been seen with this workload is that the lruvec spinlock is
> held for a long time from shrink_[active/inactive]_list path. In this
> path, there is a case in isolate_lru_folios() where scanning of LRU
> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
> scanning/skipping of more than 150 million folios were seen. There is
> already a comment in there which explains why nr_skipped shouldn't be
> counted, but is there any possibility of re-looking at this condition?

For this specific case, probably this can help:

@@ -1659,8 +1659,15 @@ static unsigned long
isolate_lru_folios(unsigned long nr_to_scan,
                if (folio_zonenum(folio) > sc->reclaim_idx ||
                                skip_cma(folio, sc)) {
                        nr_skipped[folio_zonenum(folio)] += nr_pages;
-                       move_to = &folios_skipped;
-                       goto move;
+                       list_move(&folio->lru, &folios_skipped);
+                       if (spin_is_contended(&lruvec->lru_lock)) {
+                               if (!list_empty(dst))
+                                       break;
+                               spin_unlock_irq(&lruvec->lru_lock);
+                               cond_resched();
+                               spin_lock_irq(&lruvec->lru_lock);
+                       }
+                       continue;
                }


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-09  5:58         ` Yu Zhao
@ 2024-07-11  5:43           ` Bharata B Rao
  2024-07-15  5:19             ` Bharata B Rao
  2024-08-13 11:04           ` Usama Arif
  1 sibling, 1 reply; 37+ messages in thread
From: Bharata B Rao @ 2024-07-11  5:43 UTC (permalink / raw)
  To: Yu Zhao
  Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
	David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman

On 09-Jul-24 11:28 AM, Yu Zhao wrote:
> On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@amd.com> wrote:
>>
>> On 08-Jul-24 9:47 PM, Yu Zhao wrote:
>>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>
>>>> Hi Yu Zhao,
>>>>
>>>> Thanks for your patches. See below...
>>>>
>>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
>>>>> Hi Bharata,
>>>>>
>>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>>>
>>>> <snip>
>>>>>>
>>>>>> Some experiments tried
>>>>>> ======================
>>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>>>>>> lockups were seen for 48 hours run. Below is once such soft lockup.
>>>>>
>>>>> This is not really an MGLRU issue -- can you please try one of the
>>>>> attached patches? It (truncate.patch) should help with or without
>>>>> MGLRU.
>>>>
>>>> With truncate.patch and default LRU scheme, a few hard lockups are seen.
>>>
>>> Thanks.
>>>
>>> In your original report, you said:
>>>
>>>     Most of the times the two contended locks are lruvec and
>>>     inode->i_lock spinlocks.
>>>     ...
>>>     Often times, the perf output at the time of the problem shows
>>>     heavy contention on lruvec spin lock. Similar contention is
>>>     also observed with inode i_lock (in clear_shadow_entry path)
>>>
>>> Based on this new report, does it mean the i_lock is not as contended,
>>> for the same path (truncation) you tested? If so, I'll post
>>> truncate.patch and add reported-by and tested-by you, unless you have
>>> objections.
>>
>> truncate.patch has been tested on two systems with default LRU scheme
>> and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run.
> 
> Thanks.
> 
>>>
>>> The two paths below were contended on the LRU lock, but they already
>>> batch their operations. So I don't know what else we can do surgically
>>> to improve them.
>>
>> What has been seen with this workload is that the lruvec spinlock is
>> held for a long time from shrink_[active/inactive]_list path. In this
>> path, there is a case in isolate_lru_folios() where scanning of LRU
>> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
>> scanning/skipping of more than 150 million folios were seen. There is
>> already a comment in there which explains why nr_skipped shouldn't be
>> counted, but is there any possibility of re-looking at this condition?
> 
> For this specific case, probably this can help:
> 
> @@ -1659,8 +1659,15 @@ static unsigned long
> isolate_lru_folios(unsigned long nr_to_scan,
>                  if (folio_zonenum(folio) > sc->reclaim_idx ||
>                                  skip_cma(folio, sc)) {
>                          nr_skipped[folio_zonenum(folio)] += nr_pages;
> -                       move_to = &folios_skipped;
> -                       goto move;
> +                       list_move(&folio->lru, &folios_skipped);
> +                       if (spin_is_contended(&lruvec->lru_lock)) {
> +                               if (!list_empty(dst))
> +                                       break;
> +                               spin_unlock_irq(&lruvec->lru_lock);
> +                               cond_resched();
> +                               spin_lock_irq(&lruvec->lru_lock);
> +                       }
> +                       continue;
>                  }

Thanks, this helped. With this fix, the test ran for 24hrs without any 
lockups attributable to lruvec spinlock. As noted in this thread, 
earlier isolate_lru_folios() used to scan millions of folios and spend a 
lot of time with spinlock held but after this fix, such a scenario is no 
longer seen.

However the contention seems to have shifted to other areas and these 
are the two MM related soft and hard lockups that were observed during 
this run:

Soft lockup
===========
watchdog: BUG: soft lockup - CPU#425 stuck for 12s!
CPU: 425 PID: 145707 Comm: fio Kdump: loaded Tainted: G        W 
  6.10.0-rc3-trkwtrs_trnct_nvme_lruvecresched #21
RIP: 0010:handle_softirqs+0x70/0x2f0

   __rmqueue_pcplist+0x4ce/0x9a0
   get_page_from_freelist+0x2e1/0x1650
   __alloc_pages_noprof+0x1b4/0x12c0
   alloc_pages_mpol_noprof+0xdd/0x200
   folio_alloc_noprof+0x67/0xe0

Hard lockup
===========
watchdog: Watchdog detected hard LOCKUP on cpu 296
CPU: 296 PID: 150155 Comm: fio Kdump: loaded Tainted: G        W    L 
  6.10.0-rc3-trkwtrs_trnct_nvme_lruvecresched #21
RIP: 0010:native_queued_spin_lock_slowpath+0x347/0x430

  Call Trace:
   <NMI>
   ? watchdog_hardlockup_check+0x1a2/0x370
   ? watchdog_overflow_callback+0x6d/0x80
<SNIP>
  native_queued_spin_lock_slowpath+0x347/0x430
   </NMI>
   <IRQ>
   _raw_spin_lock_irqsave+0x46/0x60
   free_unref_page+0x19f/0x540
   ? __slab_free+0x2ab/0x2b0
   __free_pages+0x9d/0xb0
   __free_slab+0xa7/0xf0
   free_slab+0x31/0x100
   discard_slab+0x32/0x40
   __put_partials+0xb8/0xe0
   put_cpu_partial+0x5a/0x90
   __slab_free+0x1d9/0x2b0
   kfree+0x244/0x280
   mempool_kfree+0x12/0x20
   mempool_free+0x30/0x90
   nvme_unmap_data+0xd0/0x150 [nvme]
   nvme_pci_complete_batch+0xaf/0xd0 [nvme]
   nvme_irq+0x96/0xe0 [nvme]
   __handle_irq_event_percpu+0x50/0x1b0

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-11  5:43           ` Bharata B Rao
@ 2024-07-15  5:19             ` Bharata B Rao
  2024-07-19 20:21               ` Yu Zhao
  2024-07-25  9:59               ` zhaoyang.huang
  0 siblings, 2 replies; 37+ messages in thread
From: Bharata B Rao @ 2024-07-15  5:19 UTC (permalink / raw)
  To: Yu Zhao
  Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
	David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman, mjguzik

On 11-Jul-24 11:13 AM, Bharata B Rao wrote:
> On 09-Jul-24 11:28 AM, Yu Zhao wrote:
>> On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@amd.com> wrote:
>>>
>>> On 08-Jul-24 9:47 PM, Yu Zhao wrote:
>>>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>>
>>>>> Hi Yu Zhao,
>>>>>
>>>>> Thanks for your patches. See below...
>>>>>
>>>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
>>>>>> Hi Bharata,
>>>>>>
>>>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>>>>
>>>>> <snip>
>>>>>>>
>>>>>>> Some experiments tried
>>>>>>> ======================
>>>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>>>>>>> lockups were seen for 48 hours run. Below is once such soft lockup.
>>>>>>
>>>>>> This is not really an MGLRU issue -- can you please try one of the
>>>>>> attached patches? It (truncate.patch) should help with or without
>>>>>> MGLRU.
>>>>>
>>>>> With truncate.patch and default LRU scheme, a few hard lockups are 
>>>>> seen.
>>>>
>>>> Thanks.
>>>>
>>>> In your original report, you said:
>>>>
>>>>     Most of the times the two contended locks are lruvec and
>>>>     inode->i_lock spinlocks.
>>>>     ...
>>>>     Often times, the perf output at the time of the problem shows
>>>>     heavy contention on lruvec spin lock. Similar contention is
>>>>     also observed with inode i_lock (in clear_shadow_entry path)
>>>>
>>>> Based on this new report, does it mean the i_lock is not as contended,
>>>> for the same path (truncation) you tested? If so, I'll post
>>>> truncate.patch and add reported-by and tested-by you, unless you have
>>>> objections.
>>>
>>> truncate.patch has been tested on two systems with default LRU scheme
>>> and the lockup due to inode->i_lock hasn't been seen yet after 24 
>>> hours run.
>>
>> Thanks.
>>
>>>>
>>>> The two paths below were contended on the LRU lock, but they already
>>>> batch their operations. So I don't know what else we can do surgically
>>>> to improve them.
>>>
>>> What has been seen with this workload is that the lruvec spinlock is
>>> held for a long time from shrink_[active/inactive]_list path. In this
>>> path, there is a case in isolate_lru_folios() where scanning of LRU
>>> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
>>> scanning/skipping of more than 150 million folios were seen. There is
>>> already a comment in there which explains why nr_skipped shouldn't be
>>> counted, but is there any possibility of re-looking at this condition?
>>
>> For this specific case, probably this can help:
>>
>> @@ -1659,8 +1659,15 @@ static unsigned long
>> isolate_lru_folios(unsigned long nr_to_scan,
>>                  if (folio_zonenum(folio) > sc->reclaim_idx ||
>>                                  skip_cma(folio, sc)) {
>>                          nr_skipped[folio_zonenum(folio)] += nr_pages;
>> -                       move_to = &folios_skipped;
>> -                       goto move;
>> +                       list_move(&folio->lru, &folios_skipped);
>> +                       if (spin_is_contended(&lruvec->lru_lock)) {
>> +                               if (!list_empty(dst))
>> +                                       break;
>> +                               spin_unlock_irq(&lruvec->lru_lock);
>> +                               cond_resched();
>> +                               spin_lock_irq(&lruvec->lru_lock);
>> +                       }
>> +                       continue;
>>                  }
> 
> Thanks, this helped. With this fix, the test ran for 24hrs without any 
> lockups attributable to lruvec spinlock. As noted in this thread, 
> earlier isolate_lru_folios() used to scan millions of folios and spend a 
> lot of time with spinlock held but after this fix, such a scenario is no 
> longer seen.

However during the weekend mglru-enabled run (with above fix to 
isolate_lru_folios() and also the previous two patches: truncate.patch 
and mglru.patch and the inode fix provided by Mateusz), another hard 
lockup related to lruvec spinlock was observed.

Here is the hardlock up:

watchdog: Watchdog detected hard LOCKUP on cpu 466
CPU: 466 PID: 3103929 Comm: fio Not tainted 
6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
Call Trace:
   <NMI>
   ? show_regs+0x69/0x80
   ? watchdog_hardlockup_check+0x1b4/0x3a0
<SNIP>
   ? native_queued_spin_lock_slowpath+0x2b4/0x300
   </NMI>
   <IRQ>
   _raw_spin_lock_irqsave+0x5b/0x70
   folio_lruvec_lock_irqsave+0x62/0x90
   folio_batch_move_lru+0x9d/0x160
   folio_rotate_reclaimable+0xab/0xf0
   folio_end_writeback+0x60/0x90
   end_buffer_async_write+0xaa/0xe0
   end_bio_bh_io_sync+0x2c/0x50
   bio_endio+0x108/0x180
   blk_mq_end_request_batch+0x11f/0x5e0
   nvme_pci_complete_batch+0xb5/0xd0 [nvme]
   nvme_irq+0x92/0xe0 [nvme]
   __handle_irq_event_percpu+0x6e/0x1e0
   handle_irq_event+0x39/0x80
   handle_edge_irq+0x8c/0x240
   __common_interrupt+0x4e/0xf0
   common_interrupt+0x49/0xc0
   asm_common_interrupt+0x27/0x40

Here is the lock holder details captured by all-cpu-backtrace:

NMI backtrace for cpu 75
CPU: 75 PID: 3095650 Comm: fio Not tainted 
6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
RIP: 0010:folio_inc_gen+0x142/0x430
Call Trace:
   <NMI>
   ? show_regs+0x69/0x80
   ? nmi_cpu_backtrace+0xc5/0x130
   ? nmi_cpu_backtrace_handler+0x11/0x20
   ? nmi_handle+0x64/0x180
   ? default_do_nmi+0x45/0x130
   ? exc_nmi+0x128/0x1a0
   ? end_repeat_nmi+0xf/0x53
   ? folio_inc_gen+0x142/0x430
   ? folio_inc_gen+0x142/0x430
   ? folio_inc_gen+0x142/0x430
   </NMI>
   <TASK>
   isolate_folios+0x954/0x1630
   evict_folios+0xa5/0x8c0
   try_to_shrink_lruvec+0x1be/0x320
   shrink_one+0x10f/0x1d0
   shrink_node+0xa4c/0xc90
   do_try_to_free_pages+0xc0/0x590
   try_to_free_pages+0xde/0x210
   __alloc_pages_noprof+0x6ae/0x12c0
   alloc_pages_mpol_noprof+0xd9/0x220
   folio_alloc_noprof+0x63/0xe0
   filemap_alloc_folio_noprof+0xf4/0x100
   page_cache_ra_unbounded+0xb9/0x1a0
   page_cache_ra_order+0x26e/0x310
   ondemand_readahead+0x1a3/0x360
   page_cache_sync_ra+0x83/0x90
   filemap_get_pages+0xf0/0x6a0
   filemap_read+0xe7/0x3d0
   blkdev_read_iter+0x6f/0x140
   vfs_read+0x25b/0x340
   ksys_read+0x67/0xf0
   __x64_sys_read+0x19/0x20
   x64_sys_call+0x1771/0x20d0
   do_syscall_64+0x7e/0x130

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-15  5:19             ` Bharata B Rao
@ 2024-07-19 20:21               ` Yu Zhao
  2024-07-20  7:57                 ` Mateusz Guzik
  2024-07-22  4:12                 ` Bharata B Rao
  2024-07-25  9:59               ` zhaoyang.huang
  1 sibling, 2 replies; 37+ messages in thread
From: Yu Zhao @ 2024-07-19 20:21 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
	David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman, mjguzik

On Sun, Jul 14, 2024 at 11:20 PM Bharata B Rao <bharata@amd.com> wrote:
>
> On 11-Jul-24 11:13 AM, Bharata B Rao wrote:
> > On 09-Jul-24 11:28 AM, Yu Zhao wrote:
> >> On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@amd.com> wrote:
> >>>
> >>> On 08-Jul-24 9:47 PM, Yu Zhao wrote:
> >>>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
> >>>>>
> >>>>> Hi Yu Zhao,
> >>>>>
> >>>>> Thanks for your patches. See below...
> >>>>>
> >>>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> >>>>>> Hi Bharata,
> >>>>>>
> >>>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
> >>>>>>>
> >>>>> <snip>
> >>>>>>>
> >>>>>>> Some experiments tried
> >>>>>>> ======================
> >>>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
> >>>>>>> lockups were seen for 48 hours run. Below is once such soft lockup.
> >>>>>>
> >>>>>> This is not really an MGLRU issue -- can you please try one of the
> >>>>>> attached patches? It (truncate.patch) should help with or without
> >>>>>> MGLRU.
> >>>>>
> >>>>> With truncate.patch and default LRU scheme, a few hard lockups are
> >>>>> seen.
> >>>>
> >>>> Thanks.
> >>>>
> >>>> In your original report, you said:
> >>>>
> >>>>     Most of the times the two contended locks are lruvec and
> >>>>     inode->i_lock spinlocks.
> >>>>     ...
> >>>>     Often times, the perf output at the time of the problem shows
> >>>>     heavy contention on lruvec spin lock. Similar contention is
> >>>>     also observed with inode i_lock (in clear_shadow_entry path)
> >>>>
> >>>> Based on this new report, does it mean the i_lock is not as contended,
> >>>> for the same path (truncation) you tested? If so, I'll post
> >>>> truncate.patch and add reported-by and tested-by you, unless you have
> >>>> objections.
> >>>
> >>> truncate.patch has been tested on two systems with default LRU scheme
> >>> and the lockup due to inode->i_lock hasn't been seen yet after 24
> >>> hours run.
> >>
> >> Thanks.
> >>
> >>>>
> >>>> The two paths below were contended on the LRU lock, but they already
> >>>> batch their operations. So I don't know what else we can do surgically
> >>>> to improve them.
> >>>
> >>> What has been seen with this workload is that the lruvec spinlock is
> >>> held for a long time from shrink_[active/inactive]_list path. In this
> >>> path, there is a case in isolate_lru_folios() where scanning of LRU
> >>> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
> >>> scanning/skipping of more than 150 million folios were seen. There is
> >>> already a comment in there which explains why nr_skipped shouldn't be
> >>> counted, but is there any possibility of re-looking at this condition?
> >>
> >> For this specific case, probably this can help:
> >>
> >> @@ -1659,8 +1659,15 @@ static unsigned long
> >> isolate_lru_folios(unsigned long nr_to_scan,
> >>                  if (folio_zonenum(folio) > sc->reclaim_idx ||
> >>                                  skip_cma(folio, sc)) {
> >>                          nr_skipped[folio_zonenum(folio)] += nr_pages;
> >> -                       move_to = &folios_skipped;
> >> -                       goto move;
> >> +                       list_move(&folio->lru, &folios_skipped);
> >> +                       if (spin_is_contended(&lruvec->lru_lock)) {
> >> +                               if (!list_empty(dst))
> >> +                                       break;
> >> +                               spin_unlock_irq(&lruvec->lru_lock);
> >> +                               cond_resched();
> >> +                               spin_lock_irq(&lruvec->lru_lock);
> >> +                       }
> >> +                       continue;
> >>                  }
> >
> > Thanks, this helped. With this fix, the test ran for 24hrs without any
> > lockups attributable to lruvec spinlock. As noted in this thread,
> > earlier isolate_lru_folios() used to scan millions of folios and spend a
> > lot of time with spinlock held but after this fix, such a scenario is no
> > longer seen.
>
> However during the weekend mglru-enabled run (with above fix to
> isolate_lru_folios() and also the previous two patches: truncate.patch
> and mglru.patch and the inode fix provided by Mateusz), another hard
> lockup related to lruvec spinlock was observed.

Thanks again for the stress tests.

I can't come up with any reasonable band-aid at this moment, i.e.,
something not too ugly to work around a more fundamental scalability
problem.

Before I give up: what type of dirty data was written back to the nvme
device? Was it page cache or swap?

> Here is the hardlock up:
>
> watchdog: Watchdog detected hard LOCKUP on cpu 466
> CPU: 466 PID: 3103929 Comm: fio Not tainted
> 6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
> RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
> Call Trace:
>    <NMI>
>    ? show_regs+0x69/0x80
>    ? watchdog_hardlockup_check+0x1b4/0x3a0
> <SNIP>
>    ? native_queued_spin_lock_slowpath+0x2b4/0x300
>    </NMI>
>    <IRQ>
>    _raw_spin_lock_irqsave+0x5b/0x70
>    folio_lruvec_lock_irqsave+0x62/0x90
>    folio_batch_move_lru+0x9d/0x160
>    folio_rotate_reclaimable+0xab/0xf0
>    folio_end_writeback+0x60/0x90
>    end_buffer_async_write+0xaa/0xe0
>    end_bio_bh_io_sync+0x2c/0x50
>    bio_endio+0x108/0x180
>    blk_mq_end_request_batch+0x11f/0x5e0
>    nvme_pci_complete_batch+0xb5/0xd0 [nvme]
>    nvme_irq+0x92/0xe0 [nvme]
>    __handle_irq_event_percpu+0x6e/0x1e0
>    handle_irq_event+0x39/0x80
>    handle_edge_irq+0x8c/0x240
>    __common_interrupt+0x4e/0xf0
>    common_interrupt+0x49/0xc0
>    asm_common_interrupt+0x27/0x40
>
> Here is the lock holder details captured by all-cpu-backtrace:
>
> NMI backtrace for cpu 75
> CPU: 75 PID: 3095650 Comm: fio Not tainted
> 6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
> RIP: 0010:folio_inc_gen+0x142/0x430
> Call Trace:
>    <NMI>
>    ? show_regs+0x69/0x80
>    ? nmi_cpu_backtrace+0xc5/0x130
>    ? nmi_cpu_backtrace_handler+0x11/0x20
>    ? nmi_handle+0x64/0x180
>    ? default_do_nmi+0x45/0x130
>    ? exc_nmi+0x128/0x1a0
>    ? end_repeat_nmi+0xf/0x53
>    ? folio_inc_gen+0x142/0x430
>    ? folio_inc_gen+0x142/0x430
>    ? folio_inc_gen+0x142/0x430
>    </NMI>
>    <TASK>
>    isolate_folios+0x954/0x1630
>    evict_folios+0xa5/0x8c0
>    try_to_shrink_lruvec+0x1be/0x320
>    shrink_one+0x10f/0x1d0
>    shrink_node+0xa4c/0xc90
>    do_try_to_free_pages+0xc0/0x590
>    try_to_free_pages+0xde/0x210
>    __alloc_pages_noprof+0x6ae/0x12c0
>    alloc_pages_mpol_noprof+0xd9/0x220
>    folio_alloc_noprof+0x63/0xe0
>    filemap_alloc_folio_noprof+0xf4/0x100
>    page_cache_ra_unbounded+0xb9/0x1a0
>    page_cache_ra_order+0x26e/0x310
>    ondemand_readahead+0x1a3/0x360
>    page_cache_sync_ra+0x83/0x90
>    filemap_get_pages+0xf0/0x6a0
>    filemap_read+0xe7/0x3d0
>    blkdev_read_iter+0x6f/0x140
>    vfs_read+0x25b/0x340
>    ksys_read+0x67/0xf0
>    __x64_sys_read+0x19/0x20
>    x64_sys_call+0x1771/0x20d0
>    do_syscall_64+0x7e/0x130
>
> Regards,
> Bharata.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-19 20:21               ` Yu Zhao
@ 2024-07-20  7:57                 ` Mateusz Guzik
  2024-07-22  4:17                   ` Bharata B Rao
  2024-07-22  4:12                 ` Bharata B Rao
  1 sibling, 1 reply; 37+ messages in thread
From: Mateusz Guzik @ 2024-07-20  7:57 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Bharata B Rao, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj,
	Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho,
	Mel Gorman

On Fri, Jul 19, 2024 at 10:21 PM Yu Zhao <yuzhao@google.com> wrote:
> I can't come up with any reasonable band-aid at this moment, i.e.,
> something not too ugly to work around a more fundamental scalability
> problem.
>
> Before I give up: what type of dirty data was written back to the nvme
> device? Was it page cache or swap?
>

With my corporate employee hat on, I would like to note a couple of
three things.

1. there are definitely bugs here and someone(tm) should sort them out(R)

however....

2. the real goal is presumably to beat the kernel into shape where
production kernels no longer suffer lockups running this workload on
this hardware
3. the flamegraph (to be found in [1]) shows expensive debug enabled,
notably for preemption count (search for preempt_count_sub to see)
4. I'm told the lruvec problem is being worked on (but no ETA) and I
don't think the above justifies considering any hacks or otherwise
putting more pressure on it

It is plausible eliminating the aforementioned debug will be good enough.

Apart from that I note percpu_counter_add_batch (+ irq debug) accounts
for 5.8% cpu time. This will of course go down if irq tracing is
disabled, but so happens I optimized this routine to be faster
single-threaded (in particular by dodging the interrupt trip). The
patch is hanging out in the mm tree [2] and is trivially applicable
for testing.

Even if none of the debug opts can get modified, this should drop
percpu_counter_add_batch to 1.5% or so, which may or may not have a
side effect of avoiding the lockup problem.

[1]: https://lore.kernel.org/lkml/584ecb5e-b1fc-4b43-ba36-ad396d379fad@amd.com/
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything&id=51d821654be4286b005ad2b7dc8b973d5008a2ec

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-20  7:57                 ` Mateusz Guzik
@ 2024-07-22  4:17                   ` Bharata B Rao
  0 siblings, 0 replies; 37+ messages in thread
From: Bharata B Rao @ 2024-07-22  4:17 UTC (permalink / raw)
  To: Mateusz Guzik, Yu Zhao
  Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
	David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman

On 20-Jul-24 1:27 PM, Mateusz Guzik wrote:
> On Fri, Jul 19, 2024 at 10:21 PM Yu Zhao <yuzhao@google.com> wrote:
>> I can't come up with any reasonable band-aid at this moment, i.e.,
>> something not too ugly to work around a more fundamental scalability
>> problem.
>>
>> Before I give up: what type of dirty data was written back to the nvme
>> device? Was it page cache or swap?
>>
> 
> With my corporate employee hat on, I would like to note a couple of
> three things.
> 
> 1. there are definitely bugs here and someone(tm) should sort them out(R)
> 
> however....
> 
> 2. the real goal is presumably to beat the kernel into shape where
> production kernels no longer suffer lockups running this workload on
> this hardware
> 3. the flamegraph (to be found in [1]) shows expensive debug enabled,
> notably for preemption count (search for preempt_count_sub to see)
> 4. I'm told the lruvec problem is being worked on (but no ETA) and I
> don't think the above justifies considering any hacks or otherwise
> putting more pressure on it
> 
> It is plausible eliminating the aforementioned debug will be good enough.
> 
> Apart from that I note percpu_counter_add_batch (+ irq debug) accounts
> for 5.8% cpu time. This will of course go down if irq tracing is
> disabled, but so happens I optimized this routine to be faster
> single-threaded (in particular by dodging the interrupt trip). The
> patch is hanging out in the mm tree [2] and is trivially applicable
> for testing.
> 
> Even if none of the debug opts can get modified, this should drop
> percpu_counter_add_batch to 1.5% or so, which may or may not have a
> side effect of avoiding the lockup problem.

Thanks, A few debug options were turned ON to gather debug data. Will do 
a full run once with them turned OFF and with the above 
percpu_counter_add_batch patch.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-19 20:21               ` Yu Zhao
  2024-07-20  7:57                 ` Mateusz Guzik
@ 2024-07-22  4:12                 ` Bharata B Rao
  1 sibling, 0 replies; 37+ messages in thread
From: Bharata B Rao @ 2024-07-22  4:12 UTC (permalink / raw)
  To: Yu Zhao
  Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
	David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman, mjguzik

On 20-Jul-24 1:51 AM, Yu Zhao wrote:
>> However during the weekend mglru-enabled run (with above fix to
>> isolate_lru_folios() and also the previous two patches: truncate.patch
>> and mglru.patch and the inode fix provided by Mateusz), another hard
>> lockup related to lruvec spinlock was observed.
> 
> Thanks again for the stress tests.
> 
> I can't come up with any reasonable band-aid at this moment, i.e.,
> something not too ugly to work around a more fundamental scalability
> problem.
> 
> Before I give up: what type of dirty data was written back to the nvme
> device? Was it page cache or swap?

This is how a typical dstat report looks like when we start to see the 
problem with lruvec spinlock.

------memory-usage----- ----swap---
used  free  buff  cach| used  free|
14.3G 20.7G 1467G  185M| 938M   15G|
14.3G 20.0G 1468G  174M| 938M   15G|
14.3G 20.3G 1468G  184M| 938M   15G|
14.3G 19.8G 1468G  183M| 938M   15G|
14.3G 19.9G 1468G  183M| 938M   15G|
14.3G 19.5G 1468G  183M| 938M   15G|

As you can see, most of the usage is in buffer cache and swap is hardly 
used. Just to recap from the original post...

====
FIO is run with a size of 1TB on each NVME partition with different
combinations of ioengine/blocksize/mode parameters and buffered-IO.
Selected FS tests from LTP are run on 256GB partitions of all NVME
disks. This is the typical NVME partition layout.

nvme2n1      259:4   0   3.5T  0 disk
├─nvme2n1p1  259:6   0   256G  0 part /data_nvme2n1p1
└─nvme2n1p2  259:7   0   3.2T  0 part

Though many different runs exist in the workload, the combination that
results in the problem is buffered-IO run with sync engine.

fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \
-rwmixwrite=30  --norandommap --randrepeat=0 -ioengine=sync -bs=4k \
-numjobs=400 -runtime=25000 --time_based -group_reporting -name=mytest
====

Regards,
Bharata.





^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-15  5:19             ` Bharata B Rao
  2024-07-19 20:21               ` Yu Zhao
@ 2024-07-25  9:59               ` zhaoyang.huang
  2024-07-26  3:26                 ` Zhaoyang Huang
  1 sibling, 1 reply; 37+ messages in thread
From: zhaoyang.huang @ 2024-07-25  9:59 UTC (permalink / raw)
  To: bharata
  Cc: Neeraj.Upadhyay, akpm, david, kinseyho, linux-kernel, linux-mm,
	mgorman, mjguzik, nikunj, vbabka, willy, yuzhao, huangzhaoyang,
	steve.kang

>However during the weekend mglru-enabled run (with above fix to 
>isolate_lru_folios() and also the previous two patches: truncate.patch 
>and mglru.patch and the inode fix provided by Mateusz), another hard 
>lockup related to lruvec spinlock was observed.
>
>Here is the hardlock up:
>
>watchdog: Watchdog detected hard LOCKUP on cpu 466
>CPU: 466 PID: 3103929 Comm: fio Not tainted 
>6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
>RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
>Call Trace:
>   <NMI>
>   ? show_regs+0x69/0x80
>   ? watchdog_hardlockup_check+0x1b4/0x3a0
><SNIP>
>   ? native_queued_spin_lock_slowpath+0x2b4/0x300
>   </NMI>
>   <IRQ>
>   _raw_spin_lock_irqsave+0x5b/0x70
>   folio_lruvec_lock_irqsave+0x62/0x90
>   folio_batch_move_lru+0x9d/0x160
>   folio_rotate_reclaimable+0xab/0xf0
>   folio_end_writeback+0x60/0x90
>   end_buffer_async_write+0xaa/0xe0
>   end_bio_bh_io_sync+0x2c/0x50
>   bio_endio+0x108/0x180
>   blk_mq_end_request_batch+0x11f/0x5e0
>   nvme_pci_complete_batch+0xb5/0xd0 [nvme]
>   nvme_irq+0x92/0xe0 [nvme]
>   __handle_irq_event_percpu+0x6e/0x1e0
>   handle_irq_event+0x39/0x80
>   handle_edge_irq+0x8c/0x240
>   __common_interrupt+0x4e/0xf0
>   common_interrupt+0x49/0xc0
>   asm_common_interrupt+0x27/0x40
>
>Here is the lock holder details captured by all-cpu-backtrace:
>
>NMI backtrace for cpu 75
>CPU: 75 PID: 3095650 Comm: fio Not tainted 
>6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
>RIP: 0010:folio_inc_gen+0x142/0x430
>Call Trace:
>   <NMI>
>   ? show_regs+0x69/0x80
>   ? nmi_cpu_backtrace+0xc5/0x130
>   ? nmi_cpu_backtrace_handler+0x11/0x20
>   ? nmi_handle+0x64/0x180
>   ? default_do_nmi+0x45/0x130
>   ? exc_nmi+0x128/0x1a0
>   ? end_repeat_nmi+0xf/0x53
>   ? folio_inc_gen+0x142/0x430
>   ? folio_inc_gen+0x142/0x430
>   ? folio_inc_gen+0x142/0x430
>   </NMI>
>   <TASK>
>   isolate_folios+0x954/0x1630
>   evict_folios+0xa5/0x8c0
>   try_to_shrink_lruvec+0x1be/0x320
>   shrink_one+0x10f/0x1d0
>   shrink_node+0xa4c/0xc90
>   do_try_to_free_pages+0xc0/0x590
>   try_to_free_pages+0xde/0x210
>   __alloc_pages_noprof+0x6ae/0x12c0
>   alloc_pages_mpol_noprof+0xd9/0x220
>   folio_alloc_noprof+0x63/0xe0
>   filemap_alloc_folio_noprof+0xf4/0x100
>   page_cache_ra_unbounded+0xb9/0x1a0
>   page_cache_ra_order+0x26e/0x310
>   ondemand_readahead+0x1a3/0x360
>   page_cache_sync_ra+0x83/0x90
>   filemap_get_pages+0xf0/0x6a0
>   filemap_read+0xe7/0x3d0
>   blkdev_read_iter+0x6f/0x140
>   vfs_read+0x25b/0x340
>   ksys_read+0x67/0xf0
>   __x64_sys_read+0x19/0x20
>   x64_sys_call+0x1771/0x20d0
>   do_syscall_64+0x7e/0x130

From the callstack of lock holder, it is looks like a scability issue rather than a deadlock. Unlike legacy LRU management, there is no throttling mechanism for global reclaim under mglru so far.Could we apply the similar method to throttle the reclaim when it is too aggresive. I am wondering if this patch which is a rough version could help on this?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2e34de9cd0d4..827036e21f24 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4520,6 +4520,50 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
 	return scanned;
 }
 
+static void lru_gen_throttle(pg_data_t *pgdat, struct scan_control *sc)
+{
+	struct lruvec *target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+
+	if (current_is_kswapd()) {
+		if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
+			set_bit(PGDAT_WRITEBACK, &pgdat->flags);
+
+		/* Allow kswapd to start writing pages during reclaim.*/
+		if (sc->nr.unqueued_dirty == sc->nr.file_taken)
+			set_bit(PGDAT_DIRTY, &pgdat->flags);
+
+		if (sc->nr.immediate)
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+	}
+
+	/*
+	 * Tag a node/memcg as congested if all the dirty pages were marked
+	 * for writeback and immediate reclaim (counted in nr.congested).
+	 *
+	 * Legacy memcg will stall in page writeback so avoid forcibly
+	 * stalling in reclaim_throttle().
+	 */
+	if (sc->nr.dirty && (sc->nr.dirty / 2 < sc->nr.congested)) {
+		if (cgroup_reclaim(sc) && writeback_throttling_sane(sc))
+			set_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags);
+
+		if (current_is_kswapd())
+			set_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags);
+	}
+
+	/*
+	 * Stall direct reclaim for IO completions if the lruvec is
+	 * node is congested. Allow kswapd to continue until it
+	 * starts encountering unqueued dirty pages or cycling through
+	 * the LRU too quickly.
+	 */
+	if (!current_is_kswapd() && current_may_throttle() &&
+	    !sc->hibernation_mode &&
+	    (test_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags) ||
+	     test_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags)))
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED);
+}
+
 static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
 {
 	int type;
@@ -4552,6 +4596,16 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 retry:
 	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false);
 	sc->nr_reclaimed += reclaimed;
+	sc->nr.dirty += stat.nr_dirty;
+	sc->nr.congested += stat.nr_congested;
+	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
+	sc->nr.writeback += stat.nr_writeback;
+	sc->nr.immediate += stat.nr_immediate;
+	sc->nr.taken += scanned;
+
+	if (type)
+		sc->nr.file_taken += scanned;
+
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
@@ -5908,6 +5962,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 
 	if (lru_gen_enabled() && root_reclaim(sc)) {
 		lru_gen_shrink_node(pgdat, sc);
+		lru_gen_throttle(pgdat, sc);
 		return;
 	}
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-25  9:59               ` zhaoyang.huang
@ 2024-07-26  3:26                 ` Zhaoyang Huang
  2024-07-29  4:49                   ` Bharata B Rao
  0 siblings, 1 reply; 37+ messages in thread
From: Zhaoyang Huang @ 2024-07-26  3:26 UTC (permalink / raw)
  To: zhaoyang.huang
  Cc: bharata, Neeraj.Upadhyay, akpm, david, kinseyho, linux-kernel,
	linux-mm, mgorman, mjguzik, nikunj, vbabka, willy, yuzhao,
	steve.kang

On Thu, Jul 25, 2024 at 6:00 PM zhaoyang.huang
<zhaoyang.huang@unisoc.com> wrote:
>
> >However during the weekend mglru-enabled run (with above fix to
> >isolate_lru_folios() and also the previous two patches: truncate.patch
> >and mglru.patch and the inode fix provided by Mateusz), another hard
> >lockup related to lruvec spinlock was observed.
> >
> >Here is the hardlock up:
> >
> >watchdog: Watchdog detected hard LOCKUP on cpu 466
> >CPU: 466 PID: 3103929 Comm: fio Not tainted
> >6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
> >RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
> >Call Trace:
> >   <NMI>
> >   ? show_regs+0x69/0x80
> >   ? watchdog_hardlockup_check+0x1b4/0x3a0
> ><SNIP>
> >   ? native_queued_spin_lock_slowpath+0x2b4/0x300
> >   </NMI>
> >   <IRQ>
> >   _raw_spin_lock_irqsave+0x5b/0x70
> >   folio_lruvec_lock_irqsave+0x62/0x90
> >   folio_batch_move_lru+0x9d/0x160
> >   folio_rotate_reclaimable+0xab/0xf0
> >   folio_end_writeback+0x60/0x90
> >   end_buffer_async_write+0xaa/0xe0
> >   end_bio_bh_io_sync+0x2c/0x50
> >   bio_endio+0x108/0x180
> >   blk_mq_end_request_batch+0x11f/0x5e0
> >   nvme_pci_complete_batch+0xb5/0xd0 [nvme]
> >   nvme_irq+0x92/0xe0 [nvme]
> >   __handle_irq_event_percpu+0x6e/0x1e0
> >   handle_irq_event+0x39/0x80
> >   handle_edge_irq+0x8c/0x240
> >   __common_interrupt+0x4e/0xf0
> >   common_interrupt+0x49/0xc0
> >   asm_common_interrupt+0x27/0x40
> >
> >Here is the lock holder details captured by all-cpu-backtrace:
> >
> >NMI backtrace for cpu 75
> >CPU: 75 PID: 3095650 Comm: fio Not tainted
> >6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
> >RIP: 0010:folio_inc_gen+0x142/0x430
> >Call Trace:
> >   <NMI>
> >   ? show_regs+0x69/0x80
> >   ? nmi_cpu_backtrace+0xc5/0x130
> >   ? nmi_cpu_backtrace_handler+0x11/0x20
> >   ? nmi_handle+0x64/0x180
> >   ? default_do_nmi+0x45/0x130
> >   ? exc_nmi+0x128/0x1a0
> >   ? end_repeat_nmi+0xf/0x53
> >   ? folio_inc_gen+0x142/0x430
> >   ? folio_inc_gen+0x142/0x430
> >   ? folio_inc_gen+0x142/0x430
> >   </NMI>
> >   <TASK>
> >   isolate_folios+0x954/0x1630
> >   evict_folios+0xa5/0x8c0
> >   try_to_shrink_lruvec+0x1be/0x320
> >   shrink_one+0x10f/0x1d0
> >   shrink_node+0xa4c/0xc90
> >   do_try_to_free_pages+0xc0/0x590
> >   try_to_free_pages+0xde/0x210
> >   __alloc_pages_noprof+0x6ae/0x12c0
> >   alloc_pages_mpol_noprof+0xd9/0x220
> >   folio_alloc_noprof+0x63/0xe0
> >   filemap_alloc_folio_noprof+0xf4/0x100
> >   page_cache_ra_unbounded+0xb9/0x1a0
> >   page_cache_ra_order+0x26e/0x310
> >   ondemand_readahead+0x1a3/0x360
> >   page_cache_sync_ra+0x83/0x90
> >   filemap_get_pages+0xf0/0x6a0
> >   filemap_read+0xe7/0x3d0
> >   blkdev_read_iter+0x6f/0x140
> >   vfs_read+0x25b/0x340
> >   ksys_read+0x67/0xf0
> >   __x64_sys_read+0x19/0x20
> >   x64_sys_call+0x1771/0x20d0
> >   do_syscall_64+0x7e/0x130
>
> From the callstack of lock holder, it is looks like a scability issue rather than a deadlock. Unlike legacy LRU management, there is no throttling mechanism for global reclaim under mglru so far.Could we apply the similar method to throttle the reclaim when it is too aggresive. I am wondering if this patch which is a rough version could help on this?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2e34de9cd0d4..827036e21f24 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4520,6 +4520,50 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
>         return scanned;
>  }
>
> +static void lru_gen_throttle(pg_data_t *pgdat, struct scan_control *sc)
> +{
> +       struct lruvec *target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
> +
> +       if (current_is_kswapd()) {
> +               if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
> +                       set_bit(PGDAT_WRITEBACK, &pgdat->flags);
> +
> +               /* Allow kswapd to start writing pages during reclaim.*/
> +               if (sc->nr.unqueued_dirty == sc->nr.file_taken)
> +                       set_bit(PGDAT_DIRTY, &pgdat->flags);
> +
> +               if (sc->nr.immediate)
> +                       reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> +       }
> +
> +       /*
> +        * Tag a node/memcg as congested if all the dirty pages were marked
> +        * for writeback and immediate reclaim (counted in nr.congested).
> +        *
> +        * Legacy memcg will stall in page writeback so avoid forcibly
> +        * stalling in reclaim_throttle().
> +        */
> +       if (sc->nr.dirty && (sc->nr.dirty / 2 < sc->nr.congested)) {
> +               if (cgroup_reclaim(sc) && writeback_throttling_sane(sc))
> +                       set_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags);
> +
> +               if (current_is_kswapd())
> +                       set_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags);
> +       }
> +
> +       /*
> +        * Stall direct reclaim for IO completions if the lruvec is
> +        * node is congested. Allow kswapd to continue until it
> +        * starts encountering unqueued dirty pages or cycling through
> +        * the LRU too quickly.
> +        */
> +       if (!current_is_kswapd() && current_may_throttle() &&
> +           !sc->hibernation_mode &&
> +           (test_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags) ||
> +            test_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags)))
> +               reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED);
> +}
> +
>  static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
>  {
>         int type;
> @@ -4552,6 +4596,16 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
>  retry:
>         reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false);
>         sc->nr_reclaimed += reclaimed;
> +       sc->nr.dirty += stat.nr_dirty;
> +       sc->nr.congested += stat.nr_congested;
> +       sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
> +       sc->nr.writeback += stat.nr_writeback;
> +       sc->nr.immediate += stat.nr_immediate;
> +       sc->nr.taken += scanned;
> +
> +       if (type)
> +               sc->nr.file_taken += scanned;
> +
>         trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>                         scanned, reclaimed, &stat, sc->priority,
>                         type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> @@ -5908,6 +5962,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>
>         if (lru_gen_enabled() && root_reclaim(sc)) {
>                 lru_gen_shrink_node(pgdat, sc);
> +               lru_gen_throttle(pgdat, sc);
>                 return;
>         }
Hi Bharata,
This patch arised from a regression Android test case failure which
allocated 1GB virtual memory by each over 8 threads on an 5.5GB RAM
system. This test could pass on legacy LRU management while failing
under MGLRU as a watchdog monitor detected abnormal system-wide
schedule status(watchdog can't be scheduled within 60 seconds). This
patch with a slight change as below got passed in the test whereas has
not been investigated deeply for how it was done. Theoretically, this
patch enrolled the similar reclaim throttle mechanism as legacy do
which could reduce the contention of lruvec->lru_lock. I think this
patch is quite naive for now， but I am hoping it could help you as
your case seems like a scability issue of memory pressure rather than
a deadlock issue. Thank you!

the change of the applied version(try to throttle the reclaim before
instead of after)
         if (lru_gen_enabled() && root_reclaim(sc)) {
 +               lru_gen_throttle(pgdat, sc);
                 lru_gen_shrink_node(pgdat, sc);
 -               lru_gen_throttle(pgdat, sc);
                 return;
         }
>
> --
> 2.25.1
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-26  3:26                 ` Zhaoyang Huang
@ 2024-07-29  4:49                   ` Bharata B Rao
  0 siblings, 0 replies; 37+ messages in thread
From: Bharata B Rao @ 2024-07-29  4:49 UTC (permalink / raw)
  To: Zhaoyang Huang, zhaoyang.huang
  Cc: Neeraj.Upadhyay, akpm, david, kinseyho, linux-kernel, linux-mm,
	mgorman, mjguzik, nikunj, vbabka, willy, yuzhao, steve.kang

On 26-Jul-24 8:56 AM, Zhaoyang Huang wrote:
> On Thu, Jul 25, 2024 at 6:00 PM zhaoyang.huang
> <zhaoyang.huang@unisoc.com> wrote:
<snip>
>>  From the callstack of lock holder, it is looks like a scability issue rather than a deadlock. Unlike legacy LRU management, there is no throttling mechanism for global reclaim under mglru so far.Could we apply the similar method to throttle the reclaim when it is too aggresive. I am wondering if this patch which is a rough version could help on this?
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 2e34de9cd0d4..827036e21f24 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -4520,6 +4520,50 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
>>          return scanned;
>>   }
>>
>> +static void lru_gen_throttle(pg_data_t *pgdat, struct scan_control *sc)
>> +{
>> +       struct lruvec *target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
>> +
>> +       if (current_is_kswapd()) {
>> +               if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
>> +                       set_bit(PGDAT_WRITEBACK, &pgdat->flags);
>> +
>> +               /* Allow kswapd to start writing pages during reclaim.*/
>> +               if (sc->nr.unqueued_dirty == sc->nr.file_taken)
>> +                       set_bit(PGDAT_DIRTY, &pgdat->flags);
>> +
>> +               if (sc->nr.immediate)
>> +                       reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
>> +       }
>> +
>> +       /*
>> +        * Tag a node/memcg as congested if all the dirty pages were marked
>> +        * for writeback and immediate reclaim (counted in nr.congested).
>> +        *
>> +        * Legacy memcg will stall in page writeback so avoid forcibly
>> +        * stalling in reclaim_throttle().
>> +        */
>> +       if (sc->nr.dirty && (sc->nr.dirty / 2 < sc->nr.congested)) {
>> +               if (cgroup_reclaim(sc) && writeback_throttling_sane(sc))
>> +                       set_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags);
>> +
>> +               if (current_is_kswapd())
>> +                       set_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags);
>> +       }
>> +
>> +       /*
>> +        * Stall direct reclaim for IO completions if the lruvec is
>> +        * node is congested. Allow kswapd to continue until it
>> +        * starts encountering unqueued dirty pages or cycling through
>> +        * the LRU too quickly.
>> +        */
>> +       if (!current_is_kswapd() && current_may_throttle() &&
>> +           !sc->hibernation_mode &&
>> +           (test_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags) ||
>> +            test_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags)))
>> +               reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED);
>> +}
>> +
>>   static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
>>   {
>>          int type;
>> @@ -4552,6 +4596,16 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
>>   retry:
>>          reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false);
>>          sc->nr_reclaimed += reclaimed;
>> +       sc->nr.dirty += stat.nr_dirty;
>> +       sc->nr.congested += stat.nr_congested;
>> +       sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
>> +       sc->nr.writeback += stat.nr_writeback;
>> +       sc->nr.immediate += stat.nr_immediate;
>> +       sc->nr.taken += scanned;
>> +
>> +       if (type)
>> +               sc->nr.file_taken += scanned;
>> +
>>          trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>>                          scanned, reclaimed, &stat, sc->priority,
>>                          type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>> @@ -5908,6 +5962,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>>
>>          if (lru_gen_enabled() && root_reclaim(sc)) {
>>                  lru_gen_shrink_node(pgdat, sc);
>> +               lru_gen_throttle(pgdat, sc);
>>                  return;
>>          }
> Hi Bharata,
> This patch arised from a regression Android test case failure which
> allocated 1GB virtual memory by each over 8 threads on an 5.5GB RAM
> system. This test could pass on legacy LRU management while failing
> under MGLRU as a watchdog monitor detected abnormal system-wide
> schedule status(watchdog can't be scheduled within 60 seconds). This
> patch with a slight change as below got passed in the test whereas has
> not been investigated deeply for how it was done. Theoretically, this
> patch enrolled the similar reclaim throttle mechanism as legacy do
> which could reduce the contention of lruvec->lru_lock. I think this
> patch is quite naive for now， but I am hoping it could help you as
> your case seems like a scability issue of memory pressure rather than
> a deadlock issue. Thank you!
> 
> the change of the applied version(try to throttle the reclaim before
> instead of after)
>           if (lru_gen_enabled() && root_reclaim(sc)) {
>   +               lru_gen_throttle(pgdat, sc);
>                   lru_gen_shrink_node(pgdat, sc);
>   -               lru_gen_throttle(pgdat, sc);
>                   return;
>           }

Thanks Zhaoyang Huang for the patch, will give this a test and report back.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-09  5:58         ` Yu Zhao
  2024-07-11  5:43           ` Bharata B Rao
@ 2024-08-13 11:04           ` Usama Arif
  2024-08-13 17:43             ` Yu Zhao
  1 sibling, 1 reply; 37+ messages in thread
From: Usama Arif @ 2024-08-13 11:04 UTC (permalink / raw)
  To: Yu Zhao, Bharata B Rao
  Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
	David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman, leitao



On 09/07/2024 06:58, Yu Zhao wrote:
> On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@amd.com> wrote:
>>
>> On 08-Jul-24 9:47 PM, Yu Zhao wrote:
>>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>
>>>> Hi Yu Zhao,
>>>>
>>>> Thanks for your patches. See below...
>>>>
>>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
>>>>> Hi Bharata,
>>>>>
>>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>>>
>>>> <snip>
>>>>>>
>>>>>> Some experiments tried
>>>>>> ======================
>>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>>>>>> lockups were seen for 48 hours run. Below is once such soft lockup.
>>>>>
>>>>> This is not really an MGLRU issue -- can you please try one of the
>>>>> attached patches? It (truncate.patch) should help with or without
>>>>> MGLRU.
>>>>
>>>> With truncate.patch and default LRU scheme, a few hard lockups are seen.
>>>
>>> Thanks.
>>>
>>> In your original report, you said:
>>>
>>>    Most of the times the two contended locks are lruvec and
>>>    inode->i_lock spinlocks.
>>>    ...
>>>    Often times, the perf output at the time of the problem shows
>>>    heavy contention on lruvec spin lock. Similar contention is
>>>    also observed with inode i_lock (in clear_shadow_entry path)
>>>
>>> Based on this new report, does it mean the i_lock is not as contended,
>>> for the same path (truncation) you tested? If so, I'll post
>>> truncate.patch and add reported-by and tested-by you, unless you have
>>> objections.
>>
>> truncate.patch has been tested on two systems with default LRU scheme
>> and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run.
> 
> Thanks.
> 
>>>
>>> The two paths below were contended on the LRU lock, but they already
>>> batch their operations. So I don't know what else we can do surgically
>>> to improve them.
>>
>> What has been seen with this workload is that the lruvec spinlock is
>> held for a long time from shrink_[active/inactive]_list path. In this
>> path, there is a case in isolate_lru_folios() where scanning of LRU
>> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
>> scanning/skipping of more than 150 million folios were seen. There is
>> already a comment in there which explains why nr_skipped shouldn't be
>> counted, but is there any possibility of re-looking at this condition?
> 
> For this specific case, probably this can help:
> 
> @@ -1659,8 +1659,15 @@ static unsigned long
> isolate_lru_folios(unsigned long nr_to_scan,
>                 if (folio_zonenum(folio) > sc->reclaim_idx ||
>                                 skip_cma(folio, sc)) {
>                         nr_skipped[folio_zonenum(folio)] += nr_pages;
> -                       move_to = &folios_skipped;
> -                       goto move;
> +                       list_move(&folio->lru, &folios_skipped);
> +                       if (spin_is_contended(&lruvec->lru_lock)) {
> +                               if (!list_empty(dst))
> +                                       break;
> +                               spin_unlock_irq(&lruvec->lru_lock);
> +                               cond_resched();
> +                               spin_lock_irq(&lruvec->lru_lock);
> +                       }
> +                       continue;
>                 }
> 

Hi Yu,

We are seeing lockups and high memory pressure in Meta production due to this lock contention as well. My colleague highlighted it in https://lore.kernel.org/all/ZrssOrcJIDy8hacI@gmail.com/ and was pointed to this fix.

We removed skip_cma check as a temporary measure, but this is a proper fix. I might have missed it but didn't see this as a patch on the mailing list. Just wanted to check if you were planning to send it as a patch? Happy to send it on your behalf as well.

Thanks


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-08-13 11:04           ` Usama Arif
@ 2024-08-13 17:43             ` Yu Zhao
  0 siblings, 0 replies; 37+ messages in thread
From: Yu Zhao @ 2024-08-13 17:43 UTC (permalink / raw)
  To: Usama Arif
  Cc: Bharata B Rao, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj,
	Andrew Morton, David Hildenbrand, willy, vbabka, kinseyho,
	Mel Gorman, leitao

On Tue, Aug 13, 2024 at 5:04 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 09/07/2024 06:58, Yu Zhao wrote:
> > On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@amd.com> wrote:
> >>
> >> On 08-Jul-24 9:47 PM, Yu Zhao wrote:
> >>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
> >>>>
> >>>> Hi Yu Zhao,
> >>>>
> >>>> Thanks for your patches. See below...
> >>>>
> >>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> >>>>> Hi Bharata,
> >>>>>
> >>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
> >>>>>>
> >>>> <snip>
> >>>>>>
> >>>>>> Some experiments tried
> >>>>>> ======================
> >>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
> >>>>>> lockups were seen for 48 hours run. Below is once such soft lockup.
> >>>>>
> >>>>> This is not really an MGLRU issue -- can you please try one of the
> >>>>> attached patches? It (truncate.patch) should help with or without
> >>>>> MGLRU.
> >>>>
> >>>> With truncate.patch and default LRU scheme, a few hard lockups are seen.
> >>>
> >>> Thanks.
> >>>
> >>> In your original report, you said:
> >>>
> >>>    Most of the times the two contended locks are lruvec and
> >>>    inode->i_lock spinlocks.
> >>>    ...
> >>>    Often times, the perf output at the time of the problem shows
> >>>    heavy contention on lruvec spin lock. Similar contention is
> >>>    also observed with inode i_lock (in clear_shadow_entry path)
> >>>
> >>> Based on this new report, does it mean the i_lock is not as contended,
> >>> for the same path (truncation) you tested? If so, I'll post
> >>> truncate.patch and add reported-by and tested-by you, unless you have
> >>> objections.
> >>
> >> truncate.patch has been tested on two systems with default LRU scheme
> >> and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run.
> >
> > Thanks.
> >
> >>>
> >>> The two paths below were contended on the LRU lock, but they already
> >>> batch their operations. So I don't know what else we can do surgically
> >>> to improve them.
> >>
> >> What has been seen with this workload is that the lruvec spinlock is
> >> held for a long time from shrink_[active/inactive]_list path. In this
> >> path, there is a case in isolate_lru_folios() where scanning of LRU
> >> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
> >> scanning/skipping of more than 150 million folios were seen. There is
> >> already a comment in there which explains why nr_skipped shouldn't be
> >> counted, but is there any possibility of re-looking at this condition?
> >
> > For this specific case, probably this can help:
> >
> > @@ -1659,8 +1659,15 @@ static unsigned long
> > isolate_lru_folios(unsigned long nr_to_scan,
> >                 if (folio_zonenum(folio) > sc->reclaim_idx ||
> >                                 skip_cma(folio, sc)) {
> >                         nr_skipped[folio_zonenum(folio)] += nr_pages;
> > -                       move_to = &folios_skipped;
> > -                       goto move;
> > +                       list_move(&folio->lru, &folios_skipped);
> > +                       if (spin_is_contended(&lruvec->lru_lock)) {
> > +                               if (!list_empty(dst))
> > +                                       break;
> > +                               spin_unlock_irq(&lruvec->lru_lock);
> > +                               cond_resched();
> > +                               spin_lock_irq(&lruvec->lru_lock);
> > +                       }
> > +                       continue;

Nitpick:

if () {
  ...
  if (!spin_is_contended(&lruvec->lru_lock))
    continue;

  if (!list_empty(dst))
    break;

  spin_unlock_irq(&lruvec->lru_lock);
  cond_resched();
  spin_lock_irq(&lruvec->lru_lock);
}


> Hi Yu,
>
> We are seeing lockups and high memory pressure in Meta production due to this lock contention as well. My colleague highlighted it in https://lore.kernel.org/all/ZrssOrcJIDy8hacI@gmail.com/ and was pointed to this fix.
>
> We removed skip_cma check as a temporary measure, but this is a proper fix. I might have missed it but didn't see this as a patch on the mailing list. Just wanted to check if you were planning to send it as a patch? Happy to send it on your behalf as well.

Please. Thank you.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-09  4:30       ` Bharata B Rao
  2024-07-09  5:58         ` Yu Zhao
@ 2024-07-17  9:37         ` Vlastimil Babka
  2024-07-17 10:50           ` Bharata B Rao
  1 sibling, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2024-07-17  9:37 UTC (permalink / raw)
  To: Bharata B Rao, Yu Zhao
  Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
	David Hildenbrand, willy, kinseyho, Mel Gorman, Petr Tesarik

On 7/9/24 6:30 AM, Bharata B Rao wrote:
> On 08-Jul-24 9:47 PM, Yu Zhao wrote:
>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
>>>
>>> Hi Yu Zhao,
>>>
>>> Thanks for your patches. See below...
>>>
>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
>>>> Hi Bharata,
>>>>
>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>>
>>> <snip>
>>>>>
>>>>> Some experiments tried
>>>>> ======================
>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>>>>> lockups were seen for 48 hours run. Below is once such soft lockup.
>>>>
>>>> This is not really an MGLRU issue -- can you please try one of the
>>>> attached patches? It (truncate.patch) should help with or without
>>>> MGLRU.
>>>
>>> With truncate.patch and default LRU scheme, a few hard lockups are seen.
>> 
>> Thanks.
>> 
>> In your original report, you said:
>> 
>>    Most of the times the two contended locks are lruvec and
>>    inode->i_lock spinlocks.
>>    ...
>>    Often times, the perf output at the time of the problem shows
>>    heavy contention on lruvec spin lock. Similar contention is
>>    also observed with inode i_lock (in clear_shadow_entry path)
>> 
>> Based on this new report, does it mean the i_lock is not as contended,
>> for the same path (truncation) you tested? If so, I'll post
>> truncate.patch and add reported-by and tested-by you, unless you have
>> objections.
> 
> truncate.patch has been tested on two systems with default LRU scheme 
> and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run.
> 
>> 
>> The two paths below were contended on the LRU lock, but they already
>> batch their operations. So I don't know what else we can do surgically
>> to improve them.
> 
> What has been seen with this workload is that the lruvec spinlock is 
> held for a long time from shrink_[active/inactive]_list path. In this 
> path, there is a case in isolate_lru_folios() where scanning of LRU 
> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes 
> scanning/skipping of more than 150 million folios were seen. There is 

It seems weird to me to see anything that would require ZONE_DMA allocation
on a modern system. Do you know where it comes from?

> already a comment in there which explains why nr_skipped shouldn't be 
> counted, but is there any possibility of re-looking at this condition?
> 
> Regards,
> Bharata.
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-17  9:37         ` Vlastimil Babka
@ 2024-07-17 10:50           ` Bharata B Rao
  2024-07-17 11:15             ` Hillf Danton
  0 siblings, 1 reply; 37+ messages in thread
From: Bharata B Rao @ 2024-07-17 10:50 UTC (permalink / raw)
  To: Vlastimil Babka, Yu Zhao
  Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
	David Hildenbrand, willy, kinseyho, Mel Gorman, Petr Tesarik

On 17-Jul-24 3:07 PM, Vlastimil Babka wrote:
> On 7/9/24 6:30 AM, Bharata B Rao wrote:
>> On 08-Jul-24 9:47 PM, Yu Zhao wrote:
>>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>
>>>> Hi Yu Zhao,
>>>>
>>>> Thanks for your patches. See below...
>>>>
>>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
>>>>> Hi Bharata,
>>>>>
>>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
>>>>>>
>>>> <snip>
>>>>>>
>>>>>> Some experiments tried
>>>>>> ======================
>>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>>>>>> lockups were seen for 48 hours run. Below is once such soft lockup.
>>>>>
>>>>> This is not really an MGLRU issue -- can you please try one of the
>>>>> attached patches? It (truncate.patch) should help with or without
>>>>> MGLRU.
>>>>
>>>> With truncate.patch and default LRU scheme, a few hard lockups are seen.
>>>
>>> Thanks.
>>>
>>> In your original report, you said:
>>>
>>>     Most of the times the two contended locks are lruvec and
>>>     inode->i_lock spinlocks.
>>>     ...
>>>     Often times, the perf output at the time of the problem shows
>>>     heavy contention on lruvec spin lock. Similar contention is
>>>     also observed with inode i_lock (in clear_shadow_entry path)
>>>
>>> Based on this new report, does it mean the i_lock is not as contended,
>>> for the same path (truncation) you tested? If so, I'll post
>>> truncate.patch and add reported-by and tested-by you, unless you have
>>> objections.
>>
>> truncate.patch has been tested on two systems with default LRU scheme
>> and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run.
>>
>>>
>>> The two paths below were contended on the LRU lock, but they already
>>> batch their operations. So I don't know what else we can do surgically
>>> to improve them.
>>
>> What has been seen with this workload is that the lruvec spinlock is
>> held for a long time from shrink_[active/inactive]_list path. In this
>> path, there is a case in isolate_lru_folios() where scanning of LRU
>> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
>> scanning/skipping of more than 150 million folios were seen. There is
> 
> It seems weird to me to see anything that would require ZONE_DMA allocation
> on a modern system. Do you know where it comes from?

We measured the lruvec spinlock start, end and hold
time(htime) using sched_clock(), along with a BUG() if the hold time was
more than 10s. The below case shows that lruvec spin lock was held for ~25s.

vmscan: unlock_page_lruvec_irq: stime 27963327514341, etime 
27963324369895, htime 25889317166 (time in ns)

kernel BUG at include/linux/memcontrol.h:1677!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 21 PID: 3211 Comm: kswapd0 Tainted: G        W 
6.10.0-rc3-qspindbg #10
RIP: 0010:shrink_active_list+0x40a/0x520
Call Trace:
   <TASK>
   shrink_lruvec+0x981/0x13b0
   shrink_node+0x358/0xd30
   balance_pgdat+0x3a3/0xa60
   kswapd+0x207/0x3a0
   kthread+0xe1/0x120
   ret_from_fork+0x39/0x60
   ret_from_fork_asm+0x1a/0x30
   </TASK>

As you can see the call stack is from kswapd but not sure what is the 
exact trigger.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-17 10:50           ` Bharata B Rao
@ 2024-07-17 11:15             ` Hillf Danton
  2024-07-18  9:02               ` Bharata B Rao
  0 siblings, 1 reply; 37+ messages in thread
From: Hillf Danton @ 2024-07-17 11:15 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Vlastimil Babka, Yu Zhao, linux-mm, linux-kernel, willy,
	Mel Gorman

On Wed, 17 Jul 2024 16:20:04 +0530 Bharata B Rao <bharata@amd.com>
> On 17-Jul-24 3:07 PM, Vlastimil Babka wrote:
> > 
> > It seems weird to me to see anything that would require ZONE_DMA allocation
> > on a modern system. Do you know where it comes from?
> 
> We measured the lruvec spinlock start, end and hold
> time(htime) using sched_clock(), along with a BUG() if the hold time was
> more than 10s. The below case shows that lruvec spin lock was held for ~25s.
>
What is more unusual could be observed perhaps with your hardware config but
with 386MiB RAM assigned to each node, the so called tight memory but not
extremely tight.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-17 11:15             ` Hillf Danton
@ 2024-07-18  9:02               ` Bharata B Rao
  0 siblings, 0 replies; 37+ messages in thread
From: Bharata B Rao @ 2024-07-18  9:02 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Vlastimil Babka, Yu Zhao, linux-mm, linux-kernel, willy,
	Mel Gorman, Dadhania, Nikunj, Upadhyay, Neeraj



On 17-Jul-24 4:45 PM, Hillf Danton wrote:
> On Wed, 17 Jul 2024 16:20:04 +0530 Bharata B Rao <bharata@amd.com>
>> On 17-Jul-24 3:07 PM, Vlastimil Babka wrote:
>>>
>>> It seems weird to me to see anything that would require ZONE_DMA allocation
>>> on a modern system. Do you know where it comes from?
>>
>> We measured the lruvec spinlock start, end and hold
>> time(htime) using sched_clock(), along with a BUG() if the hold time was
>> more than 10s. The below case shows that lruvec spin lock was held for ~25s.
>>
> What is more unusual could be observed perhaps with your hardware config but
> with 386MiB RAM assigned to each node, the so called tight memory but not
> extremely tight.

Hardware config is this:

Dual socket  AMD EPYC 128 Core processor (256 cores, 512 threads)
Memory: 1.5 TB
10 NVME - 3.5TB each
available: 2 nodes (0-1)
node 0 cpus: 0-127,256-383
node 0 size: 773727 MB
node 1 cpus: 128-255,384-511
node 1 size: 773966 MB

But I don't quite follow what you are hinting at, can you please 
rephrase or be more verbose?

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-06 22:42 ` Yu Zhao
  2024-07-08 14:34   ` Bharata B Rao
@ 2024-07-10 12:03   ` Bharata B Rao
  2024-07-10 12:24     ` Mateusz Guzik
  2024-07-10 18:04     ` Yu Zhao
  1 sibling, 2 replies; 37+ messages in thread
From: Bharata B Rao @ 2024-07-10 12:03 UTC (permalink / raw)
  To: Yu Zhao, mjguzik, david, kent.overstreet
  Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
	David Hildenbrand, willy, vbabka, kinseyho, Mel Gorman,
	linux-fsdevel

On 07-Jul-24 4:12 AM, Yu Zhao wrote:
>> Some experiments tried
>> ======================
>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>> lockups were seen for 48 hours run. Below is once such soft lockup.
<snip>
>> Below preemptirqsoff trace points to preemption being disabled for more
>> than 10s and the lock in picture is lruvec spinlock.
> 
> Also if you could try the other patch (mglru.patch) please. It should
> help reduce unnecessary rotations from deactivate_file_folio(), which
> in turn should reduce the contention on the LRU lock for MGLRU.

Thanks. With mglru.patch on a MGLRU-enabled system, the below latency 
trace record is no longer seen for a 30hr workload run.

> 
>>       # tracer: preemptirqsoff
>>       #
>>       # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc
>>       # --------------------------------------------------------------------
>>       # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0
>> HP:0 #P:512)
>>       #    -----------------
>>       #    | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0)
>>       #    -----------------
>>       #  => started at: deactivate_file_folio
>>       #  => ended at:   deactivate_file_folio
>>       #
>>       #
>>       #                    _------=> CPU#
>>       #                   / _-----=> irqs-off/BH-disabled
>>       #                  | / _----=> need-resched
>>       #                  || / _---=> hardirq/softirq
>>       #                  ||| / _--=> preempt-depth
>>       #                  |||| / _-=> migrate-disable
>>       #                  ||||| /     delay
>>       #  cmd     pid     |||||| time  |   caller
>>       #     \   /        ||||||  \    |    /
>>            fio-2701523 128...1.    0us$: deactivate_file_folio
>> <-deactivate_file_folio
>>            fio-2701523 128.N.1. 10382681us : deactivate_file_folio
>> <-deactivate_file_folio
>>            fio-2701523 128.N.1. 10382683us : tracer_preempt_on
>> <-deactivate_file_folio
>>            fio-2701523 128.N.1. 10382691us : <stack trace>
>>        => deactivate_file_folio
>>        => mapping_try_invalidate
>>        => invalidate_mapping_pages
>>        => invalidate_bdev
>>        => blkdev_common_ioctl
>>        => blkdev_ioctl
>>        => __x64_sys_ioctl
>>        => x64_sys_call
>>        => do_syscall_64
>>        => entry_SYSCALL_64_after_hwframe

However the contention now has shifted to inode_hash_lock. Around 55 
softlockups in ilookup() were observed:

# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-trnmglru
# --------------------------------------------------------------------
# latency: 10620430 us, #4/4, CPU#260 | (M:desktop VP:0, KP:0, SP:0 HP:0 
#P:512)
#    -----------------
#    | task: fio-3244715 (uid:0 nice:0 policy:0 rt_prio:0)
#    -----------------
#  => started at: ilookup
#  => ended at:   ilookup
#
#
#                    _------=> CPU#
#                   / _-----=> irqs-off/BH-disabled
#                  | / _----=> need-resched
#                  || / _---=> hardirq/softirq
#                  ||| / _--=> preempt-depth
#                  |||| / _-=> migrate-disable
#                  ||||| /     delay
#  cmd     pid     |||||| time  |   caller
#     \   /        ||||||  \    |    /
      fio-3244715 260...1.    0us$: _raw_spin_lock <-ilookup
      fio-3244715 260.N.1. 10620429us : _raw_spin_unlock <-ilookup
      fio-3244715 260.N.1. 10620430us : tracer_preempt_on <-ilookup
      fio-3244715 260.N.1. 10620440us : <stack trace>
=> _raw_spin_unlock
=> ilookup
=> blkdev_get_no_open
=> blkdev_open
=> do_dentry_open
=> vfs_open
=> path_openat
=> do_filp_open
=> do_sys_openat2
=> __x64_sys_openat
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe

It appears that scalability issues with inode_hash_lock has been brought 
up multiple times in the past and there were patches to address the same.

https://lore.kernel.org/all/20231206060629.2827226-9-david@fromorbit.com/
https://lore.kernel.org/lkml/20240611173824.535995-2-mjguzik@gmail.com/

CC'ing FS folks/list for awareness/comments.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-10 12:03   ` Bharata B Rao
@ 2024-07-10 12:24     ` Mateusz Guzik
  2024-07-10 13:04       ` Mateusz Guzik
  2024-07-10 18:04     ` Yu Zhao
  1 sibling, 1 reply; 37+ messages in thread
From: Mateusz Guzik @ 2024-07-10 12:24 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Yu Zhao, david, kent.overstreet, linux-mm, linux-kernel, nikunj,
	Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka,
	kinseyho, Mel Gorman, linux-fsdevel

On Wed, Jul 10, 2024 at 2:04 PM Bharata B Rao <bharata@amd.com> wrote:
>
> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> >> Some experiments tried
> >> ======================
> >> 1) When MGLRU was enabled many soft lockups were observed, no hard
> >> lockups were seen for 48 hours run. Below is once such soft lockup.
> <snip>
> >> Below preemptirqsoff trace points to preemption being disabled for more
> >> than 10s and the lock in picture is lruvec spinlock.
> >
> > Also if you could try the other patch (mglru.patch) please. It should
> > help reduce unnecessary rotations from deactivate_file_folio(), which
> > in turn should reduce the contention on the LRU lock for MGLRU.
>
> Thanks. With mglru.patch on a MGLRU-enabled system, the below latency
> trace record is no longer seen for a 30hr workload run.
>
> >
> >>       # tracer: preemptirqsoff
> >>       #
> >>       # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc
> >>       # --------------------------------------------------------------------
> >>       # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0
> >> HP:0 #P:512)
> >>       #    -----------------
> >>       #    | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0)
> >>       #    -----------------
> >>       #  => started at: deactivate_file_folio
> >>       #  => ended at:   deactivate_file_folio
> >>       #
> >>       #
> >>       #                    _------=> CPU#
> >>       #                   / _-----=> irqs-off/BH-disabled
> >>       #                  | / _----=> need-resched
> >>       #                  || / _---=> hardirq/softirq
> >>       #                  ||| / _--=> preempt-depth
> >>       #                  |||| / _-=> migrate-disable
> >>       #                  ||||| /     delay
> >>       #  cmd     pid     |||||| time  |   caller
> >>       #     \   /        ||||||  \    |    /
> >>            fio-2701523 128...1.    0us$: deactivate_file_folio
> >> <-deactivate_file_folio
> >>            fio-2701523 128.N.1. 10382681us : deactivate_file_folio
> >> <-deactivate_file_folio
> >>            fio-2701523 128.N.1. 10382683us : tracer_preempt_on
> >> <-deactivate_file_folio
> >>            fio-2701523 128.N.1. 10382691us : <stack trace>
> >>        => deactivate_file_folio
> >>        => mapping_try_invalidate
> >>        => invalidate_mapping_pages
> >>        => invalidate_bdev
> >>        => blkdev_common_ioctl
> >>        => blkdev_ioctl
> >>        => __x64_sys_ioctl
> >>        => x64_sys_call
> >>        => do_syscall_64
> >>        => entry_SYSCALL_64_after_hwframe
>
> However the contention now has shifted to inode_hash_lock. Around 55
> softlockups in ilookup() were observed:
>
> # tracer: preemptirqsoff
> #
> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-trnmglru
> # --------------------------------------------------------------------
> # latency: 10620430 us, #4/4, CPU#260 | (M:desktop VP:0, KP:0, SP:0 HP:0
> #P:512)
> #    -----------------
> #    | task: fio-3244715 (uid:0 nice:0 policy:0 rt_prio:0)
> #    -----------------
> #  => started at: ilookup
> #  => ended at:   ilookup
> #
> #
> #                    _------=> CPU#
> #                   / _-----=> irqs-off/BH-disabled
> #                  | / _----=> need-resched
> #                  || / _---=> hardirq/softirq
> #                  ||| / _--=> preempt-depth
> #                  |||| / _-=> migrate-disable
> #                  ||||| /     delay
> #  cmd     pid     |||||| time  |   caller
> #     \   /        ||||||  \    |    /
>       fio-3244715 260...1.    0us$: _raw_spin_lock <-ilookup
>       fio-3244715 260.N.1. 10620429us : _raw_spin_unlock <-ilookup
>       fio-3244715 260.N.1. 10620430us : tracer_preempt_on <-ilookup
>       fio-3244715 260.N.1. 10620440us : <stack trace>
> => _raw_spin_unlock
> => ilookup
> => blkdev_get_no_open
> => blkdev_open
> => do_dentry_open
> => vfs_open
> => path_openat
> => do_filp_open
> => do_sys_openat2
> => __x64_sys_openat
> => x64_sys_call
> => do_syscall_64
> => entry_SYSCALL_64_after_hwframe
>
> It appears that scalability issues with inode_hash_lock has been brought
> up multiple times in the past and there were patches to address the same.
>
> https://lore.kernel.org/all/20231206060629.2827226-9-david@fromorbit.com/
> https://lore.kernel.org/lkml/20240611173824.535995-2-mjguzik@gmail.com/
>
> CC'ing FS folks/list for awareness/comments.

Note my patch does not enable RCU usage in ilookup, but this can be
trivially added.

I can't even compile-test at the moment, but the diff below should do
it. Also note the patches are present here
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.inode.rcu
, not yet integrated anywhere.

That said, if fio you are operating on the same target inode every
time then this is merely going to shift contention to the inode
spinlock usage in find_inode_fast.

diff --git a/fs/inode.c b/fs/inode.c
index ad7844ca92f9..70b0e6383341 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1524,10 +1524,14 @@ struct inode *ilookup(struct super_block *sb,
unsigned long ino)
 {
        struct hlist_head *head = inode_hashtable + hash(sb, ino);
        struct inode *inode;
+
 again:
-       spin_lock(&inode_hash_lock);
-       inode = find_inode_fast(sb, head, ino, true);
-       spin_unlock(&inode_hash_lock);
+       inode = find_inode_fast(sb, head, ino, false);
+       if (IS_ERR_OR_NULL_PTR(inode)) {
+               spin_lock(&inode_hash_lock);
+               inode = find_inode_fast(sb, head, ino, true);
+               spin_unlock(&inode_hash_lock);
+       }

        if (inode) {
                if (IS_ERR(inode))



-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-10 12:24     ` Mateusz Guzik
@ 2024-07-10 13:04       ` Mateusz Guzik
  2024-07-15  5:22         ` Bharata B Rao
  0 siblings, 1 reply; 37+ messages in thread
From: Mateusz Guzik @ 2024-07-10 13:04 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Yu Zhao, david, kent.overstreet, linux-mm, linux-kernel, nikunj,
	Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka,
	kinseyho, Mel Gorman, linux-fsdevel

On Wed, Jul 10, 2024 at 2:24 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Wed, Jul 10, 2024 at 2:04 PM Bharata B Rao <bharata@amd.com> wrote:
> >
> > On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> > >> Some experiments tried
> > >> ======================
> > >> 1) When MGLRU was enabled many soft lockups were observed, no hard
> > >> lockups were seen for 48 hours run. Below is once such soft lockup.
> > <snip>
> > >> Below preemptirqsoff trace points to preemption being disabled for more
> > >> than 10s and the lock in picture is lruvec spinlock.
> > >
> > > Also if you could try the other patch (mglru.patch) please. It should
> > > help reduce unnecessary rotations from deactivate_file_folio(), which
> > > in turn should reduce the contention on the LRU lock for MGLRU.
> >
> > Thanks. With mglru.patch on a MGLRU-enabled system, the below latency
> > trace record is no longer seen for a 30hr workload run.
> >
> > >
> > >>       # tracer: preemptirqsoff
> > >>       #
> > >>       # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc
> > >>       # --------------------------------------------------------------------
> > >>       # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0
> > >> HP:0 #P:512)
> > >>       #    -----------------
> > >>       #    | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0)
> > >>       #    -----------------
> > >>       #  => started at: deactivate_file_folio
> > >>       #  => ended at:   deactivate_file_folio
> > >>       #
> > >>       #
> > >>       #                    _------=> CPU#
> > >>       #                   / _-----=> irqs-off/BH-disabled
> > >>       #                  | / _----=> need-resched
> > >>       #                  || / _---=> hardirq/softirq
> > >>       #                  ||| / _--=> preempt-depth
> > >>       #                  |||| / _-=> migrate-disable
> > >>       #                  ||||| /     delay
> > >>       #  cmd     pid     |||||| time  |   caller
> > >>       #     \   /        ||||||  \    |    /
> > >>            fio-2701523 128...1.    0us$: deactivate_file_folio
> > >> <-deactivate_file_folio
> > >>            fio-2701523 128.N.1. 10382681us : deactivate_file_folio
> > >> <-deactivate_file_folio
> > >>            fio-2701523 128.N.1. 10382683us : tracer_preempt_on
> > >> <-deactivate_file_folio
> > >>            fio-2701523 128.N.1. 10382691us : <stack trace>
> > >>        => deactivate_file_folio
> > >>        => mapping_try_invalidate
> > >>        => invalidate_mapping_pages
> > >>        => invalidate_bdev
> > >>        => blkdev_common_ioctl
> > >>        => blkdev_ioctl
> > >>        => __x64_sys_ioctl
> > >>        => x64_sys_call
> > >>        => do_syscall_64
> > >>        => entry_SYSCALL_64_after_hwframe
> >
> > However the contention now has shifted to inode_hash_lock. Around 55
> > softlockups in ilookup() were observed:
> >
> > # tracer: preemptirqsoff
> > #
> > # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-trnmglru
> > # --------------------------------------------------------------------
> > # latency: 10620430 us, #4/4, CPU#260 | (M:desktop VP:0, KP:0, SP:0 HP:0
> > #P:512)
> > #    -----------------
> > #    | task: fio-3244715 (uid:0 nice:0 policy:0 rt_prio:0)
> > #    -----------------
> > #  => started at: ilookup
> > #  => ended at:   ilookup
> > #
> > #
> > #                    _------=> CPU#
> > #                   / _-----=> irqs-off/BH-disabled
> > #                  | / _----=> need-resched
> > #                  || / _---=> hardirq/softirq
> > #                  ||| / _--=> preempt-depth
> > #                  |||| / _-=> migrate-disable
> > #                  ||||| /     delay
> > #  cmd     pid     |||||| time  |   caller
> > #     \   /        ||||||  \    |    /
> >       fio-3244715 260...1.    0us$: _raw_spin_lock <-ilookup
> >       fio-3244715 260.N.1. 10620429us : _raw_spin_unlock <-ilookup
> >       fio-3244715 260.N.1. 10620430us : tracer_preempt_on <-ilookup
> >       fio-3244715 260.N.1. 10620440us : <stack trace>
> > => _raw_spin_unlock
> > => ilookup
> > => blkdev_get_no_open
> > => blkdev_open
> > => do_dentry_open
> > => vfs_open
> > => path_openat
> > => do_filp_open
> > => do_sys_openat2
> > => __x64_sys_openat
> > => x64_sys_call
> > => do_syscall_64
> > => entry_SYSCALL_64_after_hwframe
> >
> > It appears that scalability issues with inode_hash_lock has been brought
> > up multiple times in the past and there were patches to address the same.
> >
> > https://lore.kernel.org/all/20231206060629.2827226-9-david@fromorbit.com/
> > https://lore.kernel.org/lkml/20240611173824.535995-2-mjguzik@gmail.com/
> >
> > CC'ing FS folks/list for awareness/comments.
>
> Note my patch does not enable RCU usage in ilookup, but this can be
> trivially added.
>
> I can't even compile-test at the moment, but the diff below should do
> it. Also note the patches are present here
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.inode.rcu
> , not yet integrated anywhere.
>
> That said, if fio you are operating on the same target inode every
> time then this is merely going to shift contention to the inode
> spinlock usage in find_inode_fast.
>
> diff --git a/fs/inode.c b/fs/inode.c
> index ad7844ca92f9..70b0e6383341 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1524,10 +1524,14 @@ struct inode *ilookup(struct super_block *sb,
> unsigned long ino)
>  {
>         struct hlist_head *head = inode_hashtable + hash(sb, ino);
>         struct inode *inode;
> +
>  again:
> -       spin_lock(&inode_hash_lock);
> -       inode = find_inode_fast(sb, head, ino, true);
> -       spin_unlock(&inode_hash_lock);
> +       inode = find_inode_fast(sb, head, ino, false);
> +       if (IS_ERR_OR_NULL_PTR(inode)) {
> +               spin_lock(&inode_hash_lock);
> +               inode = find_inode_fast(sb, head, ino, true);
> +               spin_unlock(&inode_hash_lock);
> +       }
>
>         if (inode) {
>                 if (IS_ERR(inode))
>

I think I expressed myself poorly, so here is take two:
1. inode hash soft lookup should get resolved if you apply
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.inode.rcu&id=7180f8d91fcbf252de572d9ffacc945effed0060
and the above pasted fix (not compile tested tho, but it should be
obvious what the intended fix looks like)
2. find_inode_hash spinlocks the target inode. if your bench only
operates on one, then contention is going to shift there and you may
still be getting soft lockups. not taking the spinlock in this
codepath is hackable, but I don't want to do it without a good
justification.


-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-10 13:04       ` Mateusz Guzik
@ 2024-07-15  5:22         ` Bharata B Rao
  2024-07-15  6:48           ` Mateusz Guzik
  0 siblings, 1 reply; 37+ messages in thread
From: Bharata B Rao @ 2024-07-15  5:22 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Yu Zhao, david, kent.overstreet, linux-mm, linux-kernel, nikunj,
	Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka,
	kinseyho, Mel Gorman, linux-fsdevel

On 10-Jul-24 6:34 PM, Mateusz Guzik wrote:
>>> However the contention now has shifted to inode_hash_lock. Around 55
>>> softlockups in ilookup() were observed:
>>>
>>> # tracer: preemptirqsoff
>>> #
>>> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-trnmglru
>>> # --------------------------------------------------------------------
>>> # latency: 10620430 us, #4/4, CPU#260 | (M:desktop VP:0, KP:0, SP:0 HP:0
>>> #P:512)
>>> #    -----------------
>>> #    | task: fio-3244715 (uid:0 nice:0 policy:0 rt_prio:0)
>>> #    -----------------
>>> #  => started at: ilookup
>>> #  => ended at:   ilookup
>>> #
>>> #
>>> #                    _------=> CPU#
>>> #                   / _-----=> irqs-off/BH-disabled
>>> #                  | / _----=> need-resched
>>> #                  || / _---=> hardirq/softirq
>>> #                  ||| / _--=> preempt-depth
>>> #                  |||| / _-=> migrate-disable
>>> #                  ||||| /     delay
>>> #  cmd     pid     |||||| time  |   caller
>>> #     \   /        ||||||  \    |    /
>>>        fio-3244715 260...1.    0us$: _raw_spin_lock <-ilookup
>>>        fio-3244715 260.N.1. 10620429us : _raw_spin_unlock <-ilookup
>>>        fio-3244715 260.N.1. 10620430us : tracer_preempt_on <-ilookup
>>>        fio-3244715 260.N.1. 10620440us : <stack trace>
>>> => _raw_spin_unlock
>>> => ilookup
>>> => blkdev_get_no_open
>>> => blkdev_open
>>> => do_dentry_open
>>> => vfs_open
>>> => path_openat
>>> => do_filp_open
>>> => do_sys_openat2
>>> => __x64_sys_openat
>>> => x64_sys_call
>>> => do_syscall_64
>>> => entry_SYSCALL_64_after_hwframe
>>>
>>> It appears that scalability issues with inode_hash_lock has been brought
>>> up multiple times in the past and there were patches to address the same.
>>>
>>> https://lore.kernel.org/all/20231206060629.2827226-9-david@fromorbit.com/
>>> https://lore.kernel.org/lkml/20240611173824.535995-2-mjguzik@gmail.com/
>>>
>>> CC'ing FS folks/list for awareness/comments.
>>
>> Note my patch does not enable RCU usage in ilookup, but this can be
>> trivially added.
>>
>> I can't even compile-test at the moment, but the diff below should do
>> it. Also note the patches are present here
>> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.inode.rcu
>> , not yet integrated anywhere.
>>
>> That said, if fio you are operating on the same target inode every
>> time then this is merely going to shift contention to the inode
>> spinlock usage in find_inode_fast.
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index ad7844ca92f9..70b0e6383341 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -1524,10 +1524,14 @@ struct inode *ilookup(struct super_block *sb,
>> unsigned long ino)
>>   {
>>          struct hlist_head *head = inode_hashtable + hash(sb, ino);
>>          struct inode *inode;
>> +
>>   again:
>> -       spin_lock(&inode_hash_lock);
>> -       inode = find_inode_fast(sb, head, ino, true);
>> -       spin_unlock(&inode_hash_lock);
>> +       inode = find_inode_fast(sb, head, ino, false);
>> +       if (IS_ERR_OR_NULL_PTR(inode)) {
>> +               spin_lock(&inode_hash_lock);
>> +               inode = find_inode_fast(sb, head, ino, true);
>> +               spin_unlock(&inode_hash_lock);
>> +       }
>>
>>          if (inode) {
>>                  if (IS_ERR(inode))
>>
> 
> I think I expressed myself poorly, so here is take two:
> 1. inode hash soft lookup should get resolved if you apply
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.inode.rcu&id=7180f8d91fcbf252de572d9ffacc945effed0060
> and the above pasted fix (not compile tested tho, but it should be
> obvious what the intended fix looks like)
> 2. find_inode_hash spinlocks the target inode. if your bench only
> operates on one, then contention is going to shift there and you may
> still be getting soft lockups. not taking the spinlock in this
> codepath is hackable, but I don't want to do it without a good
> justification.

Thanks Mateusz for the fix. With this patch applied, the above mentioned 
contention in ilookup() has not been observed for a test run during the 
weekend.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-15  5:22         ` Bharata B Rao
@ 2024-07-15  6:48           ` Mateusz Guzik
  0 siblings, 0 replies; 37+ messages in thread
From: Mateusz Guzik @ 2024-07-15  6:48 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Yu Zhao, david, kent.overstreet, linux-mm, linux-kernel, nikunj,
	Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka,
	kinseyho, Mel Gorman, linux-fsdevel

On Mon, Jul 15, 2024 at 7:22 AM Bharata B Rao <bharata@amd.com> wrote:
>
> On 10-Jul-24 6:34 PM, Mateusz Guzik wrote:
> >>> However the contention now has shifted to inode_hash_lock. Around 55
> >>> softlockups in ilookup() were observed:
> >>>
> >>> # tracer: preemptirqsoff
> >>> #
> >>> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-trnmglru
> >>> # --------------------------------------------------------------------
> >>> # latency: 10620430 us, #4/4, CPU#260 | (M:desktop VP:0, KP:0, SP:0 HP:0
> >>> #P:512)
> >>> #    -----------------
> >>> #    | task: fio-3244715 (uid:0 nice:0 policy:0 rt_prio:0)
> >>> #    -----------------
> >>> #  => started at: ilookup
> >>> #  => ended at:   ilookup
> >>> #
> >>> #
> >>> #                    _------=> CPU#
> >>> #                   / _-----=> irqs-off/BH-disabled
> >>> #                  | / _----=> need-resched
> >>> #                  || / _---=> hardirq/softirq
> >>> #                  ||| / _--=> preempt-depth
> >>> #                  |||| / _-=> migrate-disable
> >>> #                  ||||| /     delay
> >>> #  cmd     pid     |||||| time  |   caller
> >>> #     \   /        ||||||  \    |    /
> >>>        fio-3244715 260...1.    0us$: _raw_spin_lock <-ilookup
> >>>        fio-3244715 260.N.1. 10620429us : _raw_spin_unlock <-ilookup
> >>>        fio-3244715 260.N.1. 10620430us : tracer_preempt_on <-ilookup
> >>>        fio-3244715 260.N.1. 10620440us : <stack trace>
> >>> => _raw_spin_unlock
> >>> => ilookup
> >>> => blkdev_get_no_open
> >>> => blkdev_open
> >>> => do_dentry_open
> >>> => vfs_open
> >>> => path_openat
> >>> => do_filp_open
> >>> => do_sys_openat2
> >>> => __x64_sys_openat
> >>> => x64_sys_call
> >>> => do_syscall_64
> >>> => entry_SYSCALL_64_after_hwframe
> >>>
> >>> It appears that scalability issues with inode_hash_lock has been brought
> >>> up multiple times in the past and there were patches to address the same.
> >>>
> >>> https://lore.kernel.org/all/20231206060629.2827226-9-david@fromorbit.com/
> >>> https://lore.kernel.org/lkml/20240611173824.535995-2-mjguzik@gmail.com/
> >>>
> >>> CC'ing FS folks/list for awareness/comments.
> >>
> >> Note my patch does not enable RCU usage in ilookup, but this can be
> >> trivially added.
> >>
> >> I can't even compile-test at the moment, but the diff below should do
> >> it. Also note the patches are present here
> >> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.inode.rcu
> >> , not yet integrated anywhere.
> >>
> >> That said, if fio you are operating on the same target inode every
> >> time then this is merely going to shift contention to the inode
> >> spinlock usage in find_inode_fast.
> >>
> >> diff --git a/fs/inode.c b/fs/inode.c
> >> index ad7844ca92f9..70b0e6383341 100644
> >> --- a/fs/inode.c
> >> +++ b/fs/inode.c
> >> @@ -1524,10 +1524,14 @@ struct inode *ilookup(struct super_block *sb,
> >> unsigned long ino)
> >>   {
> >>          struct hlist_head *head = inode_hashtable + hash(sb, ino);
> >>          struct inode *inode;
> >> +
> >>   again:
> >> -       spin_lock(&inode_hash_lock);
> >> -       inode = find_inode_fast(sb, head, ino, true);
> >> -       spin_unlock(&inode_hash_lock);
> >> +       inode = find_inode_fast(sb, head, ino, false);
> >> +       if (IS_ERR_OR_NULL_PTR(inode)) {
> >> +               spin_lock(&inode_hash_lock);
> >> +               inode = find_inode_fast(sb, head, ino, true);
> >> +               spin_unlock(&inode_hash_lock);
> >> +       }
> >>
> >>          if (inode) {
> >>                  if (IS_ERR(inode))
> >>
> >
> > I think I expressed myself poorly, so here is take two:
> > 1. inode hash soft lookup should get resolved if you apply
> > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.inode.rcu&id=7180f8d91fcbf252de572d9ffacc945effed0060
> > and the above pasted fix (not compile tested tho, but it should be
> > obvious what the intended fix looks like)
> > 2. find_inode_hash spinlocks the target inode. if your bench only
> > operates on one, then contention is going to shift there and you may
> > still be getting soft lockups. not taking the spinlock in this
> > codepath is hackable, but I don't want to do it without a good
> > justification.
>
> Thanks Mateusz for the fix. With this patch applied, the above mentioned
> contention in ilookup() has not been observed for a test run during the
> weekend.
>

Ok, I'll do some clean ups and send a proper patch to the vfs folks later today.

Thanks for testing.

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-10 12:03   ` Bharata B Rao
  2024-07-10 12:24     ` Mateusz Guzik
@ 2024-07-10 18:04     ` Yu Zhao
  1 sibling, 0 replies; 37+ messages in thread
From: Yu Zhao @ 2024-07-10 18:04 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: mjguzik, david, kent.overstreet, linux-mm, linux-kernel, nikunj,
	Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy, vbabka,
	kinseyho, Mel Gorman, linux-fsdevel

On Wed, Jul 10, 2024 at 6:04 AM Bharata B Rao <bharata@amd.com> wrote:
>
> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> >> Some experiments tried
> >> ======================
> >> 1) When MGLRU was enabled many soft lockups were observed, no hard
> >> lockups were seen for 48 hours run. Below is once such soft lockup.
> <snip>
> >> Below preemptirqsoff trace points to preemption being disabled for more
> >> than 10s and the lock in picture is lruvec spinlock.
> >
> > Also if you could try the other patch (mglru.patch) please. It should
> > help reduce unnecessary rotations from deactivate_file_folio(), which
> > in turn should reduce the contention on the LRU lock for MGLRU.
>
> Thanks. With mglru.patch on a MGLRU-enabled system, the below latency
> trace record is no longer seen for a 30hr workload run.

Glad to hear. Will post a patch and add you as reported/tested-by.

> >
> >>       # tracer: preemptirqsoff
> >>       #
> >>       # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc
> >>       # --------------------------------------------------------------------
> >>       # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0
> >> HP:0 #P:512)
> >>       #    -----------------
> >>       #    | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0)
> >>       #    -----------------
> >>       #  => started at: deactivate_file_folio
> >>       #  => ended at:   deactivate_file_folio
> >>       #
> >>       #
> >>       #                    _------=> CPU#
> >>       #                   / _-----=> irqs-off/BH-disabled
> >>       #                  | / _----=> need-resched
> >>       #                  || / _---=> hardirq/softirq
> >>       #                  ||| / _--=> preempt-depth
> >>       #                  |||| / _-=> migrate-disable
> >>       #                  ||||| /     delay
> >>       #  cmd     pid     |||||| time  |   caller
> >>       #     \   /        ||||||  \    |    /
> >>            fio-2701523 128...1.    0us$: deactivate_file_folio
> >> <-deactivate_file_folio
> >>            fio-2701523 128.N.1. 10382681us : deactivate_file_folio
> >> <-deactivate_file_folio
> >>            fio-2701523 128.N.1. 10382683us : tracer_preempt_on
> >> <-deactivate_file_folio
> >>            fio-2701523 128.N.1. 10382691us : <stack trace>
> >>        => deactivate_file_folio
> >>        => mapping_try_invalidate
> >>        => invalidate_mapping_pages
> >>        => invalidate_bdev
> >>        => blkdev_common_ioctl
> >>        => blkdev_ioctl
> >>        => __x64_sys_ioctl
> >>        => x64_sys_call
> >>        => do_syscall_64
> >>        => entry_SYSCALL_64_after_hwframe
>
> However the contention now has shifted to inode_hash_lock. Around 55
> softlockups in ilookup() were observed:

This one is from fs/blk, so I'll leave it to those experts.

> # tracer: preemptirqsoff
> #
> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-trnmglru
> # --------------------------------------------------------------------
> # latency: 10620430 us, #4/4, CPU#260 | (M:desktop VP:0, KP:0, SP:0 HP:0
> #P:512)
> #    -----------------
> #    | task: fio-3244715 (uid:0 nice:0 policy:0 rt_prio:0)
> #    -----------------
> #  => started at: ilookup
> #  => ended at:   ilookup
> #
> #
> #                    _------=> CPU#
> #                   / _-----=> irqs-off/BH-disabled
> #                  | / _----=> need-resched
> #                  || / _---=> hardirq/softirq
> #                  ||| / _--=> preempt-depth
> #                  |||| / _-=> migrate-disable
> #                  ||||| /     delay
> #  cmd     pid     |||||| time  |   caller
> #     \   /        ||||||  \    |    /
>       fio-3244715 260...1.    0us$: _raw_spin_lock <-ilookup
>       fio-3244715 260.N.1. 10620429us : _raw_spin_unlock <-ilookup
>       fio-3244715 260.N.1. 10620430us : tracer_preempt_on <-ilookup
>       fio-3244715 260.N.1. 10620440us : <stack trace>
> => _raw_spin_unlock
> => ilookup
> => blkdev_get_no_open
> => blkdev_open
> => do_dentry_open
> => vfs_open
> => path_openat
> => do_filp_open
> => do_sys_openat2
> => __x64_sys_openat
> => x64_sys_call
> => do_syscall_64
> => entry_SYSCALL_64_after_hwframe
>
> It appears that scalability issues with inode_hash_lock has been brought
> up multiple times in the past and there were patches to address the same.
>
> https://lore.kernel.org/all/20231206060629.2827226-9-david@fromorbit.com/
> https://lore.kernel.org/lkml/20240611173824.535995-2-mjguzik@gmail.com/
>
> CC'ing FS folks/list for awareness/comments.
>
> Regards,
> Bharata.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-03 15:11 Hard and soft lockups with FIO and LTP runs on a large system Bharata B Rao
  2024-07-06 22:42 ` Yu Zhao
@ 2024-07-17  9:42 ` Vlastimil Babka
  2024-07-17 10:31   ` Bharata B Rao
                     ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Vlastimil Babka @ 2024-07-17  9:42 UTC (permalink / raw)
  To: Bharata B Rao, linux-mm
  Cc: linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
	David Hildenbrand, willy, yuzhao, kinseyho, Mel Gorman,
	Mateusz Guzik

On 7/3/24 5:11 PM, Bharata B Rao wrote:
> Many soft and hard lockups are seen with upstream kernel when running a 
> bunch of tests that include FIO and LTP filesystem test on 10 NVME 
> disks. The lockups can appear anywhere between 2 to 48 hours. Originally 
> this was reported on a large customer VM instance with passthrough NVME 
> disks on older kernels(v5.4 based). However, similar problems were 
> reproduced when running the tests on bare metal with latest upstream 
> kernel (v6.10-rc3). Other lockups with different signatures are seen but 
> in this report, only those related to MM area are being discussed.
> Also note that the subsequent description is related to the lockups in 
> bare metal upstream (and not VM).
> 
> The general observation is that the problem usually surfaces when the 
> system free memory goes very low and page cache/buffer consumption hits 
> the ceiling. Most of the times the two contended locks are lruvec and 
> inode->i_lock spinlocks.
> 
> - Could this be a scalability issue in LRU list handling and/or page 
> cache invalidation typical to a large system configuration?

Seems to me it could be (except that ZONE_DMA corner case) a general
scalability issue in that you tweak some part of the kernel and the
contention moves elsewhere. At least in MM we have per-node locks so this
means 256 CPUs per lock? It used to be that there were not that many
(cores/threads) per a physical CPU and its NUMA node, so many cpus would
mean also more NUMA nodes where the locks contention would distribute among
them. I think you could try fakenuma to create these nodes artificially and
see if it helps for the MM part. But if the contention moves to e.g. an
inode lock, I'm not sure what to do about that then.

> - Are there any MM/FS tunables that could help here?
> 
> Hardware configuration
> ======================
> Dual socket  AMD EPYC 128 Core processor (256 cores, 512 threads)
> Memory: 1.5 TB
> 10 NVME - 3.5TB each
> available: 2 nodes (0-1)
> node 0 cpus: 0-127,256-383
> node 0 size: 773727 MB
> node 1 cpus: 128-255,384-511
> node 1 size: 773966 MB
> 
> Workload details
> ================
> Workload includes concurrent runs of FIO and a few FS tests from LTP.
> 
> FIO is run with a size of 1TB on each NVME partition with different 
> combinations of ioengine/blocksize/mode parameters and buffered-IO. 
> Selected FS tests from LTP are run on 256GB partitions of all NVME 
> disks. This is the typical NVME partition layout.
> 
> nvme2n1      259:4   0   3.5T  0 disk
> ├─nvme2n1p1  259:6   0   256G  0 part /data_nvme2n1p1
> └─nvme2n1p2  259:7   0   3.2T  0 part
> 
> Though many different runs exist in the workload, the combination that 
> results in the problem is buffered-IO run with sync engine.
> 
> fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \
> -rwmixwrite=30  --norandommap --randrepeat=0 -ioengine=sync -bs=4k \
> -numjobs=400 -runtime=25000 --time_based -group_reporting -name=mytest
> 
> Watchdog threshold was reduced to 5s to reproduce the problem early and 
> all CPU backtrace enabled.
> 
> Problem details and analysis
> ============================
> One of the hard lockups which was observed and analyzed in detail is this:
> 
> kernel: watchdog: Watchdog detected hard LOCKUP on cpu 284
> kernel: CPU: 284 PID: 924096 Comm: cat Not tainted 6.10.0-rc3-lruvec #9
> kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
> kernel: Call Trace:
> kernel:  <NMI>
> kernel:  ? show_regs+0x69/0x80
> kernel:  ? watchdog_hardlockup_check+0x19e/0x360
> <SNIP>
> kernel:  ? native_queued_spin_lock_slowpath+0x2b4/0x300
> kernel:  </NMI>
> kernel:  <TASK>
> kernel:  ? __pfx_lru_add_fn+0x10/0x10
> kernel: _raw_spin_lock_irqsave+0x42/0x50
> kernel: folio_lruvec_lock_irqsave+0x62/0xb0
> kernel: folio_batch_move_lru+0x79/0x2a0
> kernel: folio_add_lru+0x6d/0xf0
> kernel: filemap_add_folio+0xba/0xe0
> kernel: __filemap_get_folio+0x137/0x2e0
> kernel: ext4_da_write_begin+0x12c/0x270
> kernel: generic_perform_write+0xbf/0x200
> kernel: ext4_buffered_write_iter+0x67/0xf0
> kernel: ext4_file_write_iter+0x70/0x780
> kernel: vfs_write+0x301/0x420
> kernel: ksys_write+0x67/0xf0
> kernel: __x64_sys_write+0x19/0x20
> kernel: x64_sys_call+0x1689/0x20d0
> kernel: do_syscall_64+0x6b/0x110
> kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e	kernel: RIP: 
> 0033:0x7fe21c314887
> 
> With all CPU backtraces enabled, many CPUs are waiting for lruvec_lock 
> acquisition. We measured the lruvec spinlock start, end and hold 
> time(htime) using sched_clock(), along with a BUG() if the hold time was 
> more than 10s. The below case shows that lruvec spin lock was held for ~25s.
> 
> kernel: vmscan: unlock_page_lruvec_irq: stime 27963327514341, etime 
> 27963324369895, htime 25889317166
> kernel: ------------[ cut here ]------------
> kernel: kernel BUG at include/linux/memcontrol.h:1677!
> kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> kernel: CPU: 21 PID: 3211 Comm: kswapd0 Tainted: G        W 
> 6.10.0-rc3-qspindbg #10
> kernel: RIP: 0010:shrink_active_list+0x40a/0x520
> 
> And the corresponding trace point for the above:
> kswapd0-3211    [021] dN.1. 27963.324332: mm_vmscan_lru_isolate: 
> classzone=0 order=0 nr_requested=1 nr_scanned=156946361 
> nr_skipped=156946360 nr_taken=1 lru=active_file
> 
> This shows that isolate_lru_folios() is scanning through a huge number 
> (~150million) of folios (order=0) with lruvec spinlock held. This is 
> happening because a large number of folios are being skipped to isolate 
> a few ZONE_DMA folios. Though the number of folios to be scanned is 
> bounded (32), there exists a genuine case where this can become 
> unbounded, i.e. in case where folios are skipped.
> 
> Meminfo output shows that the free memory is around ~2% and page/buffer 
> cache grows very high when the lockup happens.
> 
> MemTotal:       1584835956 kB
> MemFree:        27805664 kB
> MemAvailable:   1568099004 kB
> Buffers:        1386120792 kB
> Cached:         151894528 kB
> SwapCached:        30620 kB
> Active:         1043678892 kB
> Inactive:       494456452 kB
> 
> Often times, the perf output at the time of the problem shows heavy 
> contention on lruvec spin lock. Similar contention is also observed with 
> inode i_lock (in clear_shadow_entry path)
> 
> 98.98%  fio    [kernel.kallsyms]   [k] native_queued_spin_lock_slowpath
>     |
>      --98.96%--native_queued_spin_lock_slowpath
>         |
>          --98.96%--_raw_spin_lock_irqsave
>                    folio_lruvec_lock_irqsave
>                    |
>                     --98.78%--folio_batch_move_lru
>                         |
>                          --98.63%--deactivate_file_folio
>                                    mapping_try_invalidate
>                                    invalidate_mapping_pages
>                                    invalidate_bdev
>                                    blkdev_common_ioctl
>                                    blkdev_ioctl
>                                    __x64_sys_ioctl
>                                    x64_sys_call
>                                    do_syscall_64
>                                    entry_SYSCALL_64_after_hwframe
> 
> Some experiments tried
> ======================
> 1) When MGLRU was enabled many soft lockups were observed, no hard 
> lockups were seen for 48 hours run. Below is once such soft lockup.
> 
> kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649]
> kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G             L 
> 6.10.0-rc3-mglru-irqstrc #24
> kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
> kernel: Call Trace:
> kernel:  <IRQ>
> kernel:  ? show_regs+0x69/0x80
> kernel:  ? watchdog_timer_fn+0x223/0x2b0
> kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
> <SNIP>
> kernel:  </IRQ>
> kernel:  <TASK>
> kernel:  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> kernel:  ? native_queued_spin_lock_slowpath+0x2b4/0x300
> kernel:  _raw_spin_lock+0x38/0x50
> kernel:  clear_shadow_entry+0x3d/0x100
> kernel:  ? __pfx_workingset_update_node+0x10/0x10
> kernel:  mapping_try_invalidate+0x117/0x1d0
> kernel:  invalidate_mapping_pages+0x10/0x20
> kernel:  invalidate_bdev+0x3c/0x50
> kernel:  blkdev_common_ioctl+0x5f7/0xa90
> kernel:  blkdev_ioctl+0x109/0x270
> kernel:  x64_sys_call+0x1215/0x20d0
> kernel:  do_syscall_64+0x7e/0x130
> 
> This happens to be contending on inode i_lock spinlock.
> 
> Below preemptirqsoff trace points to preemption being disabled for more 
> than 10s and the lock in picture is lruvec spinlock.
> 
>      # tracer: preemptirqsoff
>      #
>      # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc
>      # --------------------------------------------------------------------
>      # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0 
> HP:0 #P:512)
>      #    -----------------
>      #    | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0)
>      #    -----------------
>      #  => started at: deactivate_file_folio
>      #  => ended at:   deactivate_file_folio
>      #
>      #
>      #                    _------=> CPU#
>      #                   / _-----=> irqs-off/BH-disabled
>      #                  | / _----=> need-resched
>      #                  || / _---=> hardirq/softirq
>      #                  ||| / _--=> preempt-depth
>      #                  |||| / _-=> migrate-disable
>      #                  ||||| /     delay
>      #  cmd     pid     |||||| time  |   caller
>      #     \   /        ||||||  \    |    /
>           fio-2701523 128...1.    0us$: deactivate_file_folio 
> <-deactivate_file_folio
>           fio-2701523 128.N.1. 10382681us : deactivate_file_folio 
> <-deactivate_file_folio
>           fio-2701523 128.N.1. 10382683us : tracer_preempt_on 
> <-deactivate_file_folio
>           fio-2701523 128.N.1. 10382691us : <stack trace>
>       => deactivate_file_folio
>       => mapping_try_invalidate
>       => invalidate_mapping_pages
>       => invalidate_bdev
>       => blkdev_common_ioctl
>       => blkdev_ioctl
>       => __x64_sys_ioctl
>       => x64_sys_call
>       => do_syscall_64
>       => entry_SYSCALL_64_after_hwframe
> 
> 2) Increased low_watermark_threshold to 10% to prevent system from 
> entering into extremely low memory situation. Although hard lockups 
> weren't seen, but soft lockups (clear_shadow_entry()) were still seen.
> 
> 3) AMD has a BIOS setting called NPS (Nodes per socket), using which a 
> socket can be further partitioned into smaller NUMA nodes. With NPS=4, 
> there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in 
> the system. This was done to check if having more number of kswapd 
> threads working on lesser number of folios per node would make a 
> difference. However here too, multiple  soft lockups were seen (in 
> clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed.
> 
> Any insights/suggestion into these lockups and suggestions are welcome!
> 
> Regards,
> Bharata.
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-17  9:42 ` Vlastimil Babka
@ 2024-07-17 10:31   ` Bharata B Rao
  2024-07-17 16:44     ` Karim Manaouil
  2024-07-17 11:29   ` Mateusz Guzik
  2024-07-17 16:34   ` Karim Manaouil
  2 siblings, 1 reply; 37+ messages in thread
From: Bharata B Rao @ 2024-07-17 10:31 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
	David Hildenbrand, willy, yuzhao, kinseyho, Mel Gorman,
	Mateusz Guzik

On 17-Jul-24 3:12 PM, Vlastimil Babka wrote:
> On 7/3/24 5:11 PM, Bharata B Rao wrote:
>> Many soft and hard lockups are seen with upstream kernel when running a
>> bunch of tests that include FIO and LTP filesystem test on 10 NVME
>> disks. The lockups can appear anywhere between 2 to 48 hours. Originally
>> this was reported on a large customer VM instance with passthrough NVME
>> disks on older kernels(v5.4 based). However, similar problems were
>> reproduced when running the tests on bare metal with latest upstream
>> kernel (v6.10-rc3). Other lockups with different signatures are seen but
>> in this report, only those related to MM area are being discussed.
>> Also note that the subsequent description is related to the lockups in
>> bare metal upstream (and not VM).
>>
>> The general observation is that the problem usually surfaces when the
>> system free memory goes very low and page cache/buffer consumption hits
>> the ceiling. Most of the times the two contended locks are lruvec and
>> inode->i_lock spinlocks.
>>
>> - Could this be a scalability issue in LRU list handling and/or page
>> cache invalidation typical to a large system configuration?
> 
> Seems to me it could be (except that ZONE_DMA corner case) a general
> scalability issue in that you tweak some part of the kernel and the
> contention moves elsewhere. At least in MM we have per-node locks so this
> means 256 CPUs per lock? It used to be that there were not that many
> (cores/threads) per a physical CPU and its NUMA node, so many cpus would
> mean also more NUMA nodes where the locks contention would distribute among
> them. I think you could try fakenuma to create these nodes artificially and
> see if it helps for the MM part. But if the contention moves to e.g. an
> inode lock, I'm not sure what to do about that then.

See below...

> 
<SNIP>
>>
>> 3) AMD has a BIOS setting called NPS (Nodes per socket), using which a
>> socket can be further partitioned into smaller NUMA nodes. With NPS=4,
>> there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in
>> the system. This was done to check if having more number of kswapd
>> threads working on lesser number of folios per node would make a
>> difference. However here too, multiple  soft lockups were seen (in
>> clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed.

These are some softlockups seen with NPS4 mode.

watchdog: BUG: soft lockup - CPU#315 stuck for 11s! [kworker/315:1H:5153]
CPU: 315 PID: 5153 Comm: kworker/315:1H Kdump: loaded Not tainted 
6.10.0-rc3-enbprftw #12
Workqueue: kblockd blk_mq_run_work_fn
RIP: 0010:handle_softirqs+0x70/0x2f0
Call Trace:
   <IRQ>
   __irq_exit_rcu+0x68/0x90
   irq_exit_rcu+0x12/0x20
   sysvec_apic_timer_interrupt+0x85/0xb0
   </IRQ>
   <TASK>
   asm_sysvec_apic_timer_interrupt+0x1f/0x30
RIP: 0010:iommu_dma_map_page+0xca/0x2c0
   dma_map_page_attrs+0x20d/0x2a0
   nvme_prep_rq.part.0+0x63d/0x940 [nvme]
   nvme_queue_rq+0x82/0x210 [nvme]
   blk_mq_dispatch_rq_list+0x289/0x6d0
   __blk_mq_sched_dispatch_requests+0x142/0x5f0
   blk_mq_sched_dispatch_requests+0x36/0x70
   blk_mq_run_work_fn+0x73/0x90
   process_one_work+0x185/0x3d0
   worker_thread+0x2ce/0x3e0
   kthread+0xe5/0x120
   ret_from_fork+0x3d/0x60
   ret_from_fork_asm+0x1a/0x30

watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [fio:19820]
CPU: 0 PID: 19820 Comm: fio Kdump: loaded Tainted: G             L 
6.10.0-rc3-enbprftw #12
RIP: 0010:native_queued_spin_lock_slowpath+0x2b8/0x300
Call Trace:
   <IRQ>
   </IRQ>
   <TASK>
   _raw_spin_lock+0x2d/0x40
   clear_shadow_entry+0x3d/0x100
   mapping_try_invalidate+0x11b/0x1e0
   invalidate_mapping_pages+0x14/0x20
   invalidate_bdev+0x40/0x50
   blkdev_common_ioctl+0x5f7/0xa90
   blkdev_ioctl+0x10d/0x270
   __x64_sys_ioctl+0x99/0xd0
   x64_sys_call+0x1219/0x20d0
   do_syscall_64+0x51/0x120
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fc92fc3ec6b
   </TASK>

The above one (clear_shadow_entry) has since been fixed by Yu Zhao and 
fix is in mm tree.

We had seen a couple of scenarios with zone lock contention from page 
free and slab free code paths, as reported here: 
https://lore.kernel.org/linux-mm/b68e43d4-91f2-4481-80a9-d166c0a43584@amd.com/

Would you have any insights on these?

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-17 10:31   ` Bharata B Rao
@ 2024-07-17 16:44     ` Karim Manaouil
  0 siblings, 0 replies; 37+ messages in thread
From: Karim Manaouil @ 2024-07-17 16:44 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Vlastimil Babka, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj,
	Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho,
	Mel Gorman, Mateusz Guzik

On Wed, Jul 17, 2024 at 04:01:05PM +0530, Bharata B Rao wrote:
> On 17-Jul-24 3:12 PM, Vlastimil Babka wrote:
> > On 7/3/24 5:11 PM, Bharata B Rao wrote:
> > > Many soft and hard lockups are seen with upstream kernel when running a
> > > bunch of tests that include FIO and LTP filesystem test on 10 NVME
> > > disks. The lockups can appear anywhere between 2 to 48 hours. Originally
> > > this was reported on a large customer VM instance with passthrough NVME
> > > disks on older kernels(v5.4 based). However, similar problems were
> > > reproduced when running the tests on bare metal with latest upstream
> > > kernel (v6.10-rc3). Other lockups with different signatures are seen but
> > > in this report, only those related to MM area are being discussed.
> > > Also note that the subsequent description is related to the lockups in
> > > bare metal upstream (and not VM).
> > > 
> > > The general observation is that the problem usually surfaces when the
> > > system free memory goes very low and page cache/buffer consumption hits
> > > the ceiling. Most of the times the two contended locks are lruvec and
> > > inode->i_lock spinlocks.
> > > 
> > > - Could this be a scalability issue in LRU list handling and/or page
> > > cache invalidation typical to a large system configuration?
> > 
> > Seems to me it could be (except that ZONE_DMA corner case) a general
> > scalability issue in that you tweak some part of the kernel and the
> > contention moves elsewhere. At least in MM we have per-node locks so this
> > means 256 CPUs per lock? It used to be that there were not that many
> > (cores/threads) per a physical CPU and its NUMA node, so many cpus would
> > mean also more NUMA nodes where the locks contention would distribute among
> > them. I think you could try fakenuma to create these nodes artificially and
> > see if it helps for the MM part. But if the contention moves to e.g. an
> > inode lock, I'm not sure what to do about that then.
> 
> See below...
> 
> > 
> <SNIP>
> > > 
> > > 3) AMD has a BIOS setting called NPS (Nodes per socket), using which a
> > > socket can be further partitioned into smaller NUMA nodes. With NPS=4,
> > > there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in
> > > the system. This was done to check if having more number of kswapd
> > > threads working on lesser number of folios per node would make a
> > > difference. However here too, multiple  soft lockups were seen (in
> > > clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed.
> 
> These are some softlockups seen with NPS4 mode.
> 
> watchdog: BUG: soft lockup - CPU#315 stuck for 11s! [kworker/315:1H:5153]
> CPU: 315 PID: 5153 Comm: kworker/315:1H Kdump: loaded Not tainted
> 6.10.0-rc3-enbprftw #12
> Workqueue: kblockd blk_mq_run_work_fn
> RIP: 0010:handle_softirqs+0x70/0x2f0
> Call Trace:
>   <IRQ>
>   __irq_exit_rcu+0x68/0x90
>   irq_exit_rcu+0x12/0x20
>   sysvec_apic_timer_interrupt+0x85/0xb0
>   </IRQ>
>   <TASK>
>   asm_sysvec_apic_timer_interrupt+0x1f/0x30
> RIP: 0010:iommu_dma_map_page+0xca/0x2c0
>   dma_map_page_attrs+0x20d/0x2a0
>   nvme_prep_rq.part.0+0x63d/0x940 [nvme]
>   nvme_queue_rq+0x82/0x210 [nvme]
>   blk_mq_dispatch_rq_list+0x289/0x6d0
>   __blk_mq_sched_dispatch_requests+0x142/0x5f0
>   blk_mq_sched_dispatch_requests+0x36/0x70
>   blk_mq_run_work_fn+0x73/0x90
>   process_one_work+0x185/0x3d0
>   worker_thread+0x2ce/0x3e0
>   kthread+0xe5/0x120
>   ret_from_fork+0x3d/0x60
>   ret_from_fork_asm+0x1a/0x30
> 
> 
> watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [fio:19820]
> CPU: 0 PID: 19820 Comm: fio Kdump: loaded Tainted: G             L
> 6.10.0-rc3-enbprftw #12
> RIP: 0010:native_queued_spin_lock_slowpath+0x2b8/0x300
> Call Trace:
>   <IRQ>
>   </IRQ>
>   <TASK>
>   _raw_spin_lock+0x2d/0x40
>   clear_shadow_entry+0x3d/0x100
>   mapping_try_invalidate+0x11b/0x1e0
>   invalidate_mapping_pages+0x14/0x20
>   invalidate_bdev+0x40/0x50
>   blkdev_common_ioctl+0x5f7/0xa90
>   blkdev_ioctl+0x10d/0x270
>   __x64_sys_ioctl+0x99/0xd0
>   x64_sys_call+0x1219/0x20d0
>   do_syscall_64+0x51/0x120
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7fc92fc3ec6b
>   </TASK>
> 
> The above one (clear_shadow_entry) has since been fixed by Yu Zhao and fix
> is in mm tree.
> 
> We had seen a couple of scenarios with zone lock contention from page free
> and slab free code paths, as reported here: https://lore.kernel.org/linux-mm/b68e43d4-91f2-4481-80a9-d166c0a43584@amd.com/
> 
> Would you have any insights on these?

Have you tried enabling memory interleaving policy for your workload?

Karim
PhD Student
Edinburgh University


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-17  9:42 ` Vlastimil Babka
  2024-07-17 10:31   ` Bharata B Rao
@ 2024-07-17 11:29   ` Mateusz Guzik
  2024-07-18  9:00     ` Bharata B Rao
  2024-07-17 16:34   ` Karim Manaouil
  2 siblings, 1 reply; 37+ messages in thread
From: Mateusz Guzik @ 2024-07-17 11:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Bharata B Rao, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj,
	Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho,
	Mel Gorman

On Wed, Jul 17, 2024 at 11:42 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 7/3/24 5:11 PM, Bharata B Rao wrote:
> > The general observation is that the problem usually surfaces when the
> > system free memory goes very low and page cache/buffer consumption hits
> > the ceiling. Most of the times the two contended locks are lruvec and
> > inode->i_lock spinlocks.
> >
[snip mm stuff]

There are numerous avoidable i_lock acquires (including some only
showing up under load), but I don't know if they play any role in this
particular test.

Collecting all traces would definitely help, locked up or not, for example:
bpftrace -e 'kprobe:queued_spin_lock_slowpath { @[kstack()] = count();
}' -o traces

As for clear_shadow_entry mentioned in the opening mail, the content is:
        spin_lock(&mapping->host->i_lock);
        xa_lock_irq(&mapping->i_pages);
        __clear_shadow_entry(mapping, index, entry);
        xa_unlock_irq(&mapping->i_pages);
        if (mapping_shrinkable(mapping))
                inode_add_lru(mapping->host);
        spin_unlock(&mapping->host->i_lock);

so for all I know it's all about the xarray thing, not the i_lock per se.
--
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-17 11:29   ` Mateusz Guzik
@ 2024-07-18  9:00     ` Bharata B Rao
  2024-07-18 12:11       ` Mateusz Guzik
  0 siblings, 1 reply; 37+ messages in thread
From: Bharata B Rao @ 2024-07-18  9:00 UTC (permalink / raw)
  To: Mateusz Guzik, Vlastimil Babka
  Cc: linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj, Andrew Morton,
	David Hildenbrand, willy, yuzhao, kinseyho, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 3538 bytes --]

On 17-Jul-24 4:59 PM, Mateusz Guzik wrote:
> On Wed, Jul 17, 2024 at 11:42 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 7/3/24 5:11 PM, Bharata B Rao wrote:
>>> The general observation is that the problem usually surfaces when the
>>> system free memory goes very low and page cache/buffer consumption hits
>>> the ceiling. Most of the times the two contended locks are lruvec and
>>> inode->i_lock spinlocks.
>>>
> [snip mm stuff]
> 
> There are numerous avoidable i_lock acquires (including some only
> showing up under load), but I don't know if they play any role in this
> particular test.
> 
> Collecting all traces would definitely help, locked up or not, for example:
> bpftrace -e 'kprobe:queued_spin_lock_slowpath { @[kstack()] = count();
> }' -o traces

Here are the top 3 traces collected while the full list from a 30s 
collection duration when the workload was running, is attached.

@[
     native_queued_spin_lock_slowpath+1
     __remove_mapping+98
     remove_mapping+22
     mapping_evict_folio+118
     mapping_try_invalidate+214
     invalidate_mapping_pages+16
     invalidate_bdev+60
     blkdev_common_ioctl+1527
     blkdev_ioctl+265
     __x64_sys_ioctl+149
     x64_sys_call+4629
     do_syscall_64+126
     entry_SYSCALL_64_after_hwframe+118
]: 1787212
@[
     native_queued_spin_lock_slowpath+1
     folio_wait_bit_common+205
     filemap_get_pages+1543
     filemap_read+231
     blkdev_read_iter+111
     aio_read+242
     io_submit_one+546
     __x64_sys_io_submit+132
     x64_sys_call+6617
     do_syscall_64+126
     entry_SYSCALL_64_after_hwframe+118
]: 7922497
@[
     native_queued_spin_lock_slowpath+1
     clear_shadow_entry+92
     mapping_try_invalidate+337
     invalidate_mapping_pages+16
     invalidate_bdev+60
     blkdev_common_ioctl+1527
     blkdev_ioctl+265
     __x64_sys_ioctl+149
     x64_sys_call+4629
     do_syscall_64+126
     entry_SYSCALL_64_after_hwframe+118
]: 10357614

> 
> As for clear_shadow_entry mentioned in the opening mail, the content is:
>          spin_lock(&mapping->host->i_lock);
>          xa_lock_irq(&mapping->i_pages);
>          __clear_shadow_entry(mapping, index, entry);
>          xa_unlock_irq(&mapping->i_pages);
>          if (mapping_shrinkable(mapping))
>                  inode_add_lru(mapping->host);
>          spin_unlock(&mapping->host->i_lock);
> 
> so for all I know it's all about the xarray thing, not the i_lock per se.

The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq 
and hence concluded it to be i_lock. Re-pasting the clear_shadow_entry 
softlockup here again:

kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649]
kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G             L
6.10.0-rc3-mglru-irqstrc #24
kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: Call Trace:
kernel:  <IRQ>
kernel:  ? show_regs+0x69/0x80
kernel:  ? watchdog_timer_fn+0x223/0x2b0
kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
<SNIP>
kernel:  </IRQ>
kernel:  <TASK>
kernel:  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
kernel:  ? native_queued_spin_lock_slowpath+0x2b4/0x300
kernel:  _raw_spin_lock+0x38/0x50
kernel:  clear_shadow_entry+0x3d/0x100
kernel:  ? __pfx_workingset_update_node+0x10/0x10
kernel:  mapping_try_invalidate+0x117/0x1d0
kernel:  invalidate_mapping_pages+0x10/0x20
kernel:  invalidate_bdev+0x3c/0x50
kernel:  blkdev_common_ioctl+0x5f7/0xa90
kernel:  blkdev_ioctl+0x109/0x270
kernel:  x64_sys_call+0x1215/0x20d0
kernel:  do_syscall_64+0x7e/0x130

Regards,
Bharata.

[-- Attachment #2: traces.gz --]
[-- Type: application/x-gzip, Size: 83505 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-18  9:00     ` Bharata B Rao
@ 2024-07-18 12:11       ` Mateusz Guzik
  2024-07-19  6:16         ` Bharata B Rao
  0 siblings, 1 reply; 37+ messages in thread
From: Mateusz Guzik @ 2024-07-18 12:11 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Vlastimil Babka, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj,
	Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho,
	Mel Gorman

On Thu, Jul 18, 2024 at 11:00 AM Bharata B Rao <bharata@amd.com> wrote:
>
> On 17-Jul-24 4:59 PM, Mateusz Guzik wrote:
> > As for clear_shadow_entry mentioned in the opening mail, the content is:
> >          spin_lock(&mapping->host->i_lock);
> >          xa_lock_irq(&mapping->i_pages);
> >          __clear_shadow_entry(mapping, index, entry);
> >          xa_unlock_irq(&mapping->i_pages);
> >          if (mapping_shrinkable(mapping))
> >                  inode_add_lru(mapping->host);
> >          spin_unlock(&mapping->host->i_lock);
> >
> > so for all I know it's all about the xarray thing, not the i_lock per se.
>
> The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq
> and hence concluded it to be i_lock.

I'm not disputing it was i_lock. I am claiming that the i_pages is
taken immediately after and it may be that in your workload this is
the thing with the actual contention problem, making i_lock a red
herring.

I tried to match up offsets to my own kernel binary, but things went haywire.

Can you please resolve a bunch of symbols, like this:
./scripts/faddr2line vmlinux clear_shadow_entry+92

and then paste the source code from reported lines? (I presume you are
running with some local patches, so opening relevant files in my repo
may still give bogus resutls)

Addresses are: clear_shadow_entry+92 __remove_mapping+98 __filemap_add_folio+332

Most notably in __remove_mapping i_lock is conditional:
        if (!folio_test_swapcache(folio))
                spin_lock(&mapping->host->i_lock);
        xa_lock_irq(&mapping->i_pages);

and the disasm of the offset in my case does not match either acquire.
For all I know i_lock in this routine is *not* taken and all the
queued up __remove_mapping callers increase i_lock -> i_pages wait
times in clear_shadow_entry.

To my cursory reading i_lock in clear_shadow_entry can be hacked away
with some effort, but should this happen the contention is going to
shift to i_pages presumably with more soft lockups (except on that
lock). I am not convinced messing with it is justified. From looking
at other places the i_lock is not a problem in other spots fwiw.

All that said even if it is i_lock in both cases *and* someone whacks
it, the mm folk should look into what happens when (maybe i_lock ->)
i_pages lock is held. To that end perhaps you could provide a
flamegraph or output of perf record -a -g, I don't know what's
preferred.
-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-18 12:11       ` Mateusz Guzik
@ 2024-07-19  6:16         ` Bharata B Rao
  2024-07-19  7:06           ` Yu Zhao
  2024-07-19 14:26           ` Mateusz Guzik
  0 siblings, 2 replies; 37+ messages in thread
From: Bharata B Rao @ 2024-07-19  6:16 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Vlastimil Babka, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj,
	Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho,
	Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 4821 bytes --]

On 18-Jul-24 5:41 PM, Mateusz Guzik wrote:
> On Thu, Jul 18, 2024 at 11:00 AM Bharata B Rao <bharata@amd.com> wrote:
>>
>> On 17-Jul-24 4:59 PM, Mateusz Guzik wrote:
>>> As for clear_shadow_entry mentioned in the opening mail, the content is:
>>>           spin_lock(&mapping->host->i_lock);
>>>           xa_lock_irq(&mapping->i_pages);
>>>           __clear_shadow_entry(mapping, index, entry);
>>>           xa_unlock_irq(&mapping->i_pages);
>>>           if (mapping_shrinkable(mapping))
>>>                   inode_add_lru(mapping->host);
>>>           spin_unlock(&mapping->host->i_lock);
>>>
>>> so for all I know it's all about the xarray thing, not the i_lock per se.
>>
>> The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq
>> and hence concluded it to be i_lock.
> 
> I'm not disputing it was i_lock. I am claiming that the i_pages is
> taken immediately after and it may be that in your workload this is
> the thing with the actual contention problem, making i_lock a red
> herring.
> 
> I tried to match up offsets to my own kernel binary, but things went haywire.
> 
> Can you please resolve a bunch of symbols, like this:
> ./scripts/faddr2line vmlinux clear_shadow_entry+92
> 
> and then paste the source code from reported lines? (I presume you are
> running with some local patches, so opening relevant files in my repo
> may still give bogus resutls)
> 
> Addresses are: clear_shadow_entry+92 __remove_mapping+98 __filemap_add_folio+332

clear_shadow_entry+92

$ ./scripts/faddr2line vmlinux clear_shadow_entry+92
clear_shadow_entry+92/0x180:
spin_lock_irq at include/linux/spinlock.h:376
(inlined by) clear_shadow_entry at mm/truncate.c:51

42 static void clear_shadow_entry(struct address_space *mapping,
43                                struct folio_batch *fbatch, pgoff_t 
*indices)
44 {
45         int i;
46
47         if (shmem_mapping(mapping) || dax_mapping(mapping))
48                 return;
49
50         spin_lock(&mapping->host->i_lock);
51         xa_lock_irq(&mapping->i_pages);


__remove_mapping+98

$ ./scripts/faddr2line vmlinux __remove_mapping+98
__remove_mapping+98/0x230:
spin_lock_irq at include/linux/spinlock.h:376
(inlined by) __remove_mapping at mm/vmscan.c:695

684 static int __remove_mapping(struct address_space *mapping, struct 
folio *folio,
685                             bool reclaimed, struct mem_cgroup 
*target_memcg)
686 {
687         int refcount;
688         void *shadow = NULL;
689
690         BUG_ON(!folio_test_locked(folio));
691         BUG_ON(mapping != folio_mapping(folio));
692
693         if (!folio_test_swapcache(folio))
694                 spin_lock(&mapping->host->i_lock);
695         xa_lock_irq(&mapping->i_pages);


__filemap_add_folio+332

$ ./scripts/faddr2line vmlinux __filemap_add_folio+332
__filemap_add_folio+332/0x480:
spin_lock_irq at include/linux/spinlock.h:377
(inlined by) __filemap_add_folio at mm/filemap.c:878

851 noinline int __filemap_add_folio(struct address_space *mapping,
852                 struct folio *folio, pgoff_t index, gfp_t gfp, void 
**shadowp)
853 {
854         XA_STATE(xas, &mapping->i_pages, index);
             ...
874         for (;;) {
875                 int order = -1, split_order = 0;
876                 void *entry, *old = NULL;
877
878                 xas_lock_irq(&xas);
879                 xas_for_each_conflict(&xas, entry) {

> 
> Most notably in __remove_mapping i_lock is conditional:
>          if (!folio_test_swapcache(folio))
>                  spin_lock(&mapping->host->i_lock);
>          xa_lock_irq(&mapping->i_pages);
> 
> and the disasm of the offset in my case does not match either acquire.
> For all I know i_lock in this routine is *not* taken and all the
> queued up __remove_mapping callers increase i_lock -> i_pages wait
> times in clear_shadow_entry.

So the first two are on i_pages lock and the last one is xa_lock.

> 
> To my cursory reading i_lock in clear_shadow_entry can be hacked away
> with some effort, but should this happen the contention is going to
> shift to i_pages presumably with more soft lockups (except on that
> lock). I am not convinced messing with it is justified. From looking
> at other places the i_lock is not a problem in other spots fwiw.
> 
> All that said even if it is i_lock in both cases *and* someone whacks
> it, the mm folk should look into what happens when (maybe i_lock ->)
> i_pages lock is held. To that end perhaps you could provide a
> flamegraph or output of perf record -a -g, I don't know what's
> preferred.

I have attached the flamegraph but this is for the kernel that has been 
running with all the accumulated fixes so far. The original one (w/o 
fixes) did show considerable time spent on 
native_queued_spin_lock_slowpath but unfortunately unable to locate it now.

Regards,
Bharata.

[-- Attachment #2: perf 1.svg --]
[-- Type: image/svg+xml, Size: 1215900 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-19  6:16         ` Bharata B Rao
@ 2024-07-19  7:06           ` Yu Zhao
  2024-07-19 14:26           ` Mateusz Guzik
  1 sibling, 0 replies; 37+ messages in thread
From: Yu Zhao @ 2024-07-19  7:06 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Mateusz Guzik, Vlastimil Babka, linux-mm, linux-kernel, nikunj,
	Upadhyay, Neeraj, Andrew Morton, David Hildenbrand, willy,
	kinseyho, Mel Gorman

On Fri, Jul 19, 2024 at 12:16 AM Bharata B Rao <bharata@amd.com> wrote:
>
> On 18-Jul-24 5:41 PM, Mateusz Guzik wrote:
> > On Thu, Jul 18, 2024 at 11:00 AM Bharata B Rao <bharata@amd.com> wrote:
> >>
> >> On 17-Jul-24 4:59 PM, Mateusz Guzik wrote:
> >>> As for clear_shadow_entry mentioned in the opening mail, the content is:
> >>>           spin_lock(&mapping->host->i_lock);
> >>>           xa_lock_irq(&mapping->i_pages);
> >>>           __clear_shadow_entry(mapping, index, entry);
> >>>           xa_unlock_irq(&mapping->i_pages);
> >>>           if (mapping_shrinkable(mapping))
> >>>                   inode_add_lru(mapping->host);
> >>>           spin_unlock(&mapping->host->i_lock);
> >>>
> >>> so for all I know it's all about the xarray thing, not the i_lock per se.
> >>
> >> The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq
> >> and hence concluded it to be i_lock.
> >
> > I'm not disputing it was i_lock. I am claiming that the i_pages is
> > taken immediately after and it may be that in your workload this is
> > the thing with the actual contention problem, making i_lock a red
> > herring.
> >
> > I tried to match up offsets to my own kernel binary, but things went haywire.
> >
> > Can you please resolve a bunch of symbols, like this:
> > ./scripts/faddr2line vmlinux clear_shadow_entry+92
> >
> > and then paste the source code from reported lines? (I presume you are
> > running with some local patches, so opening relevant files in my repo
> > may still give bogus resutls)
> >
> > Addresses are: clear_shadow_entry+92 __remove_mapping+98 __filemap_add_folio+332
>
> clear_shadow_entry+92
>
> $ ./scripts/faddr2line vmlinux clear_shadow_entry+92
> clear_shadow_entry+92/0x180:
> spin_lock_irq at include/linux/spinlock.h:376
> (inlined by) clear_shadow_entry at mm/truncate.c:51
>
> 42 static void clear_shadow_entry(struct address_space *mapping,
> 43                                struct folio_batch *fbatch, pgoff_t
> *indices)
> 44 {
> 45         int i;
> 46
> 47         if (shmem_mapping(mapping) || dax_mapping(mapping))
> 48                 return;
> 49
> 50         spin_lock(&mapping->host->i_lock);
> 51         xa_lock_irq(&mapping->i_pages);
>
>
> __remove_mapping+98
>
> $ ./scripts/faddr2line vmlinux __remove_mapping+98
> __remove_mapping+98/0x230:
> spin_lock_irq at include/linux/spinlock.h:376
> (inlined by) __remove_mapping at mm/vmscan.c:695
>
> 684 static int __remove_mapping(struct address_space *mapping, struct
> folio *folio,
> 685                             bool reclaimed, struct mem_cgroup
> *target_memcg)
> 686 {
> 687         int refcount;
> 688         void *shadow = NULL;
> 689
> 690         BUG_ON(!folio_test_locked(folio));
> 691         BUG_ON(mapping != folio_mapping(folio));
> 692
> 693         if (!folio_test_swapcache(folio))
> 694                 spin_lock(&mapping->host->i_lock);
> 695         xa_lock_irq(&mapping->i_pages);
>
>
> __filemap_add_folio+332
>
> $ ./scripts/faddr2line vmlinux __filemap_add_folio+332
> __filemap_add_folio+332/0x480:
> spin_lock_irq at include/linux/spinlock.h:377
> (inlined by) __filemap_add_folio at mm/filemap.c:878
>
> 851 noinline int __filemap_add_folio(struct address_space *mapping,
> 852                 struct folio *folio, pgoff_t index, gfp_t gfp, void
> **shadowp)
> 853 {
> 854         XA_STATE(xas, &mapping->i_pages, index);
>              ...
> 874         for (;;) {
> 875                 int order = -1, split_order = 0;
> 876                 void *entry, *old = NULL;
> 877
> 878                 xas_lock_irq(&xas);
> 879                 xas_for_each_conflict(&xas, entry) {
>
> >
> > Most notably in __remove_mapping i_lock is conditional:
> >          if (!folio_test_swapcache(folio))
> >                  spin_lock(&mapping->host->i_lock);
> >          xa_lock_irq(&mapping->i_pages);
> >
> > and the disasm of the offset in my case does not match either acquire.
> > For all I know i_lock in this routine is *not* taken and all the
> > queued up __remove_mapping callers increase i_lock -> i_pages wait
> > times in clear_shadow_entry.
>
> So the first two are on i_pages lock and the last one is xa_lock.

Isn't xa_lock also i_pages->xa_lock, i.e., the same lock?

> > To my cursory reading i_lock in clear_shadow_entry can be hacked away
> > with some effort, but should this happen the contention is going to
> > shift to i_pages presumably with more soft lockups (except on that
> > lock). I am not convinced messing with it is justified. From looking
> > at other places the i_lock is not a problem in other spots fwiw.
> >
> > All that said even if it is i_lock in both cases *and* someone whacks
> > it, the mm folk should look into what happens when (maybe i_lock ->)
> > i_pages lock is held. To that end perhaps you could provide a
> > flamegraph or output of perf record -a -g, I don't know what's
> > preferred.
>
> I have attached the flamegraph but this is for the kernel that has been
> running with all the accumulated fixes so far. The original one (w/o
> fixes) did show considerable time spent on
> native_queued_spin_lock_slowpath but unfortunately unable to locate it now.
>
> Regards,
> Bharata.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-19  6:16         ` Bharata B Rao
  2024-07-19  7:06           ` Yu Zhao
@ 2024-07-19 14:26           ` Mateusz Guzik
  1 sibling, 0 replies; 37+ messages in thread
From: Mateusz Guzik @ 2024-07-19 14:26 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Vlastimil Babka, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj,
	Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho,
	Mel Gorman

On Fri, Jul 19, 2024 at 8:16 AM Bharata B Rao <bharata@amd.com> wrote:
>
> On 18-Jul-24 5:41 PM, Mateusz Guzik wrote:
> > On Thu, Jul 18, 2024 at 11:00 AM Bharata B Rao <bharata@amd.com> wrote:
> >>
> >> On 17-Jul-24 4:59 PM, Mateusz Guzik wrote:
> >>> As for clear_shadow_entry mentioned in the opening mail, the content is:
> >>>           spin_lock(&mapping->host->i_lock);
> >>>           xa_lock_irq(&mapping->i_pages);
> >>>           __clear_shadow_entry(mapping, index, entry);
> >>>           xa_unlock_irq(&mapping->i_pages);
> >>>           if (mapping_shrinkable(mapping))
> >>>                   inode_add_lru(mapping->host);
> >>>           spin_unlock(&mapping->host->i_lock);
> >>>
> >>> so for all I know it's all about the xarray thing, not the i_lock per se.
> >>
> >> The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq
> >> and hence concluded it to be i_lock.
> >
> > I'm not disputing it was i_lock. I am claiming that the i_pages is
> > taken immediately after and it may be that in your workload this is
> > the thing with the actual contention problem, making i_lock a red
> > herring.
> >
> > I tried to match up offsets to my own kernel binary, but things went haywire.
> >
> > Can you please resolve a bunch of symbols, like this:
> > ./scripts/faddr2line vmlinux clear_shadow_entry+92
> >
> > and then paste the source code from reported lines? (I presume you are
> > running with some local patches, so opening relevant files in my repo
> > may still give bogus resutls)
> >
> > Addresses are: clear_shadow_entry+92 __remove_mapping+98 __filemap_add_folio+332
>
> clear_shadow_entry+92
>
> $ ./scripts/faddr2line vmlinux clear_shadow_entry+92
> clear_shadow_entry+92/0x180:
> spin_lock_irq at include/linux/spinlock.h:376
> (inlined by) clear_shadow_entry at mm/truncate.c:51
>
> 42 static void clear_shadow_entry(struct address_space *mapping,
> 43                                struct folio_batch *fbatch, pgoff_t
> *indices)
> 44 {
> 45         int i;
> 46
> 47         if (shmem_mapping(mapping) || dax_mapping(mapping))
> 48                 return;
> 49
> 50         spin_lock(&mapping->host->i_lock);
> 51         xa_lock_irq(&mapping->i_pages);
>
>
> __remove_mapping+98
>
> $ ./scripts/faddr2line vmlinux __remove_mapping+98
> __remove_mapping+98/0x230:
> spin_lock_irq at include/linux/spinlock.h:376
> (inlined by) __remove_mapping at mm/vmscan.c:695
>
> 684 static int __remove_mapping(struct address_space *mapping, struct
> folio *folio,
> 685                             bool reclaimed, struct mem_cgroup
> *target_memcg)
> 686 {
> 687         int refcount;
> 688         void *shadow = NULL;
> 689
> 690         BUG_ON(!folio_test_locked(folio));
> 691         BUG_ON(mapping != folio_mapping(folio));
> 692
> 693         if (!folio_test_swapcache(folio))
> 694                 spin_lock(&mapping->host->i_lock);
> 695         xa_lock_irq(&mapping->i_pages);
>
>
> __filemap_add_folio+332
>
> $ ./scripts/faddr2line vmlinux __filemap_add_folio+332
> __filemap_add_folio+332/0x480:
> spin_lock_irq at include/linux/spinlock.h:377
> (inlined by) __filemap_add_folio at mm/filemap.c:878
>
> 851 noinline int __filemap_add_folio(struct address_space *mapping,
> 852                 struct folio *folio, pgoff_t index, gfp_t gfp, void
> **shadowp)
> 853 {
> 854         XA_STATE(xas, &mapping->i_pages, index);
>              ...
> 874         for (;;) {
> 875                 int order = -1, split_order = 0;
> 876                 void *entry, *old = NULL;
> 877
> 878                 xas_lock_irq(&xas);
> 879                 xas_for_each_conflict(&xas, entry) {
>
> >
> > Most notably in __remove_mapping i_lock is conditional:
> >          if (!folio_test_swapcache(folio))
> >                  spin_lock(&mapping->host->i_lock);
> >          xa_lock_irq(&mapping->i_pages);
> >
> > and the disasm of the offset in my case does not match either acquire.
> > For all I know i_lock in this routine is *not* taken and all the
> > queued up __remove_mapping callers increase i_lock -> i_pages wait
> > times in clear_shadow_entry.
>
> So the first two are on i_pages lock and the last one is xa_lock.
>

bottom line though messing with i_lock removal is not justified afaics

> >
> > To my cursory reading i_lock in clear_shadow_entry can be hacked away
> > with some effort, but should this happen the contention is going to
> > shift to i_pages presumably with more soft lockups (except on that
> > lock). I am not convinced messing with it is justified. From looking
> > at other places the i_lock is not a problem in other spots fwiw.
> >
> > All that said even if it is i_lock in both cases *and* someone whacks
> > it, the mm folk should look into what happens when (maybe i_lock ->)
> > i_pages lock is held. To that end perhaps you could provide a
> > flamegraph or output of perf record -a -g, I don't know what's
> > preferred.
>
> I have attached the flamegraph but this is for the kernel that has been
> running with all the accumulated fixes so far. The original one (w/o
> fixes) did show considerable time spent on
> native_queued_spin_lock_slowpath but unfortunately unable to locate it now.
>

So I think the problems at this point are all mm, so I'm kicking the
ball to that side.

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Hard and soft lockups with FIO and LTP runs on a large system
  2024-07-17  9:42 ` Vlastimil Babka
  2024-07-17 10:31   ` Bharata B Rao
  2024-07-17 11:29   ` Mateusz Guzik
@ 2024-07-17 16:34   ` Karim Manaouil
  2 siblings, 0 replies; 37+ messages in thread
From: Karim Manaouil @ 2024-07-17 16:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Bharata B Rao, linux-mm, linux-kernel, nikunj, Upadhyay, Neeraj,
	Andrew Morton, David Hildenbrand, willy, yuzhao, kinseyho,
	Mel Gorman, Mateusz Guzik

On Wed, Jul 17, 2024 at 11:42:31AM +0200, Vlastimil Babka wrote:
> Seems to me it could be (except that ZONE_DMA corner case) a general
> scalability issue in that you tweak some part of the kernel and the
> contention moves elsewhere. At least in MM we have per-node locks so this
> means 256 CPUs per lock? It used to be that there were not that many
> (cores/threads) per a physical CPU and its NUMA node, so many cpus would
> mean also more NUMA nodes where the locks contention would distribute among
> them. I think you could try fakenuma to create these nodes artificially and
> see if it helps for the MM part. But if the contention moves to e.g. an
> inode lock, I'm not sure what to do about that then.

AMD EPYC BIOSes have an option called NPS (Nodes Per Socket) that can be
set to 1, 2, 4 or 8 and that divides the system up into the chosen number
of NUMA nodes.

Karim
PhD Student
Edinburgh University


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2024-08-13 17:44 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-03 15:11 Hard and soft lockups with FIO and LTP runs on a large system Bharata B Rao
2024-07-06 22:42 ` Yu Zhao
2024-07-08 14:34   ` Bharata B Rao
2024-07-08 16:17     ` Yu Zhao
2024-07-09  4:30       ` Bharata B Rao
2024-07-09  5:58         ` Yu Zhao
2024-07-11  5:43           ` Bharata B Rao
2024-07-15  5:19             ` Bharata B Rao
2024-07-19 20:21               ` Yu Zhao
2024-07-20  7:57                 ` Mateusz Guzik
2024-07-22  4:17                   ` Bharata B Rao
2024-07-22  4:12                 ` Bharata B Rao
2024-07-25  9:59               ` zhaoyang.huang
2024-07-26  3:26                 ` Zhaoyang Huang
2024-07-29  4:49                   ` Bharata B Rao
2024-08-13 11:04           ` Usama Arif
2024-08-13 17:43             ` Yu Zhao
2024-07-17  9:37         ` Vlastimil Babka
2024-07-17 10:50           ` Bharata B Rao
2024-07-17 11:15             ` Hillf Danton
2024-07-18  9:02               ` Bharata B Rao
2024-07-10 12:03   ` Bharata B Rao
2024-07-10 12:24     ` Mateusz Guzik
2024-07-10 13:04       ` Mateusz Guzik
2024-07-15  5:22         ` Bharata B Rao
2024-07-15  6:48           ` Mateusz Guzik
2024-07-10 18:04     ` Yu Zhao
2024-07-17  9:42 ` Vlastimil Babka
2024-07-17 10:31   ` Bharata B Rao
2024-07-17 16:44     ` Karim Manaouil
2024-07-17 11:29   ` Mateusz Guzik
2024-07-18  9:00     ` Bharata B Rao
2024-07-18 12:11       ` Mateusz Guzik
2024-07-19  6:16         ` Bharata B Rao
2024-07-19  7:06           ` Yu Zhao
2024-07-19 14:26           ` Mateusz Guzik
2024-07-17 16:34   ` Karim Manaouil

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).