From: Karim Manaouil <kmanaouil.dev@gmail.com>
To: Bharata B Rao <bharata@amd.com>
Cc: Vlastimil Babka <vbabka@suse.cz>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org, nikunj@amd.com,
"Upadhyay, Neeraj" <Neeraj.Upadhyay@amd.com>,
Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@redhat.com>,
willy@infradead.org, yuzhao@google.com, kinseyho@google.com,
Mel Gorman <mgorman@suse.de>, Mateusz Guzik <mjguzik@gmail.com>
Subject: Re: Hard and soft lockups with FIO and LTP runs on a large system
Date: Wed, 17 Jul 2024 17:44:18 +0100 [thread overview]
Message-ID: <Zpf04pRIaEyxl5fo@ed.ac.uk> (raw)
In-Reply-To: <ca9b925f-4f14-4749-8f28-83fd21f8ce6a@amd.com>
On Wed, Jul 17, 2024 at 04:01:05PM +0530, Bharata B Rao wrote:
> On 17-Jul-24 3:12 PM, Vlastimil Babka wrote:
> > On 7/3/24 5:11 PM, Bharata B Rao wrote:
> > > Many soft and hard lockups are seen with upstream kernel when running a
> > > bunch of tests that include FIO and LTP filesystem test on 10 NVME
> > > disks. The lockups can appear anywhere between 2 to 48 hours. Originally
> > > this was reported on a large customer VM instance with passthrough NVME
> > > disks on older kernels(v5.4 based). However, similar problems were
> > > reproduced when running the tests on bare metal with latest upstream
> > > kernel (v6.10-rc3). Other lockups with different signatures are seen but
> > > in this report, only those related to MM area are being discussed.
> > > Also note that the subsequent description is related to the lockups in
> > > bare metal upstream (and not VM).
> > >
> > > The general observation is that the problem usually surfaces when the
> > > system free memory goes very low and page cache/buffer consumption hits
> > > the ceiling. Most of the times the two contended locks are lruvec and
> > > inode->i_lock spinlocks.
> > >
> > > - Could this be a scalability issue in LRU list handling and/or page
> > > cache invalidation typical to a large system configuration?
> >
> > Seems to me it could be (except that ZONE_DMA corner case) a general
> > scalability issue in that you tweak some part of the kernel and the
> > contention moves elsewhere. At least in MM we have per-node locks so this
> > means 256 CPUs per lock? It used to be that there were not that many
> > (cores/threads) per a physical CPU and its NUMA node, so many cpus would
> > mean also more NUMA nodes where the locks contention would distribute among
> > them. I think you could try fakenuma to create these nodes artificially and
> > see if it helps for the MM part. But if the contention moves to e.g. an
> > inode lock, I'm not sure what to do about that then.
>
> See below...
>
> >
> <SNIP>
> > >
> > > 3) AMD has a BIOS setting called NPS (Nodes per socket), using which a
> > > socket can be further partitioned into smaller NUMA nodes. With NPS=4,
> > > there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in
> > > the system. This was done to check if having more number of kswapd
> > > threads working on lesser number of folios per node would make a
> > > difference. However here too, multiple soft lockups were seen (in
> > > clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed.
>
> These are some softlockups seen with NPS4 mode.
>
> watchdog: BUG: soft lockup - CPU#315 stuck for 11s! [kworker/315:1H:5153]
> CPU: 315 PID: 5153 Comm: kworker/315:1H Kdump: loaded Not tainted
> 6.10.0-rc3-enbprftw #12
> Workqueue: kblockd blk_mq_run_work_fn
> RIP: 0010:handle_softirqs+0x70/0x2f0
> Call Trace:
> <IRQ>
> __irq_exit_rcu+0x68/0x90
> irq_exit_rcu+0x12/0x20
> sysvec_apic_timer_interrupt+0x85/0xb0
> </IRQ>
> <TASK>
> asm_sysvec_apic_timer_interrupt+0x1f/0x30
> RIP: 0010:iommu_dma_map_page+0xca/0x2c0
> dma_map_page_attrs+0x20d/0x2a0
> nvme_prep_rq.part.0+0x63d/0x940 [nvme]
> nvme_queue_rq+0x82/0x210 [nvme]
> blk_mq_dispatch_rq_list+0x289/0x6d0
> __blk_mq_sched_dispatch_requests+0x142/0x5f0
> blk_mq_sched_dispatch_requests+0x36/0x70
> blk_mq_run_work_fn+0x73/0x90
> process_one_work+0x185/0x3d0
> worker_thread+0x2ce/0x3e0
> kthread+0xe5/0x120
> ret_from_fork+0x3d/0x60
> ret_from_fork_asm+0x1a/0x30
>
>
> watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [fio:19820]
> CPU: 0 PID: 19820 Comm: fio Kdump: loaded Tainted: G L
> 6.10.0-rc3-enbprftw #12
> RIP: 0010:native_queued_spin_lock_slowpath+0x2b8/0x300
> Call Trace:
> <IRQ>
> </IRQ>
> <TASK>
> _raw_spin_lock+0x2d/0x40
> clear_shadow_entry+0x3d/0x100
> mapping_try_invalidate+0x11b/0x1e0
> invalidate_mapping_pages+0x14/0x20
> invalidate_bdev+0x40/0x50
> blkdev_common_ioctl+0x5f7/0xa90
> blkdev_ioctl+0x10d/0x270
> __x64_sys_ioctl+0x99/0xd0
> x64_sys_call+0x1219/0x20d0
> do_syscall_64+0x51/0x120
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7fc92fc3ec6b
> </TASK>
>
> The above one (clear_shadow_entry) has since been fixed by Yu Zhao and fix
> is in mm tree.
>
> We had seen a couple of scenarios with zone lock contention from page free
> and slab free code paths, as reported here: https://lore.kernel.org/linux-mm/b68e43d4-91f2-4481-80a9-d166c0a43584@amd.com/
>
> Would you have any insights on these?
Have you tried enabling memory interleaving policy for your workload?
Karim
PhD Student
Edinburgh University
next prev parent reply other threads:[~2024-07-17 16:44 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-03 15:11 Hard and soft lockups with FIO and LTP runs on a large system Bharata B Rao
2024-07-06 22:42 ` Yu Zhao
2024-07-08 14:34 ` Bharata B Rao
2024-07-08 16:17 ` Yu Zhao
2024-07-09 4:30 ` Bharata B Rao
2024-07-09 5:58 ` Yu Zhao
2024-07-11 5:43 ` Bharata B Rao
2024-07-15 5:19 ` Bharata B Rao
2024-07-19 20:21 ` Yu Zhao
2024-07-20 7:57 ` Mateusz Guzik
2024-07-22 4:17 ` Bharata B Rao
2024-07-22 4:12 ` Bharata B Rao
2024-07-25 9:59 ` zhaoyang.huang
2024-07-26 3:26 ` Zhaoyang Huang
2024-07-29 4:49 ` Bharata B Rao
2024-08-13 11:04 ` Usama Arif
2024-08-13 17:43 ` Yu Zhao
2024-07-17 9:37 ` Vlastimil Babka
2024-07-17 10:50 ` Bharata B Rao
2024-07-17 11:15 ` Hillf Danton
2024-07-18 9:02 ` Bharata B Rao
2024-07-10 12:03 ` Bharata B Rao
2024-07-10 12:24 ` Mateusz Guzik
2024-07-10 13:04 ` Mateusz Guzik
2024-07-15 5:22 ` Bharata B Rao
2024-07-15 6:48 ` Mateusz Guzik
2024-07-10 18:04 ` Yu Zhao
2024-07-17 9:42 ` Vlastimil Babka
2024-07-17 10:31 ` Bharata B Rao
2024-07-17 16:44 ` Karim Manaouil [this message]
2024-07-17 11:29 ` Mateusz Guzik
2024-07-18 9:00 ` Bharata B Rao
2024-07-18 12:11 ` Mateusz Guzik
2024-07-19 6:16 ` Bharata B Rao
2024-07-19 7:06 ` Yu Zhao
2024-07-19 14:26 ` Mateusz Guzik
2024-07-17 16:34 ` Karim Manaouil
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zpf04pRIaEyxl5fo@ed.ac.uk \
--to=kmanaouil.dev@gmail.com \
--cc=Neeraj.Upadhyay@amd.com \
--cc=akpm@linux-foundation.org \
--cc=bharata@amd.com \
--cc=david@redhat.com \
--cc=kinseyho@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mjguzik@gmail.com \
--cc=nikunj@amd.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.