linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yu Zhao <yuzhao@google.com>
To: Bharata B Rao <bharata@amd.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, nikunj@amd.com,
	 "Upadhyay, Neeraj" <Neeraj.Upadhyay@amd.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 David Hildenbrand <david@redhat.com>,
	willy@infradead.org, vbabka@suse.cz, kinseyho@google.com,
	 Mel Gorman <mgorman@suse.de>,
	mjguzik@gmail.com
Subject: Re: Hard and soft lockups with FIO and LTP runs on a large system
Date: Fri, 19 Jul 2024 14:21:17 -0600	[thread overview]
Message-ID: <CAOUHufbkhMZYz20aM_3rHZ3OcK4m2puji2FGpUpn_-DevGk3Kg@mail.gmail.com> (raw)
In-Reply-To: <893a263a-0038-4b4b-9031-72567b966f73@amd.com>

On Sun, Jul 14, 2024 at 11:20 PM Bharata B Rao <bharata@amd.com> wrote:
>
> On 11-Jul-24 11:13 AM, Bharata B Rao wrote:
> > On 09-Jul-24 11:28 AM, Yu Zhao wrote:
> >> On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@amd.com> wrote:
> >>>
> >>> On 08-Jul-24 9:47 PM, Yu Zhao wrote:
> >>>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@amd.com> wrote:
> >>>>>
> >>>>> Hi Yu Zhao,
> >>>>>
> >>>>> Thanks for your patches. See below...
> >>>>>
> >>>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> >>>>>> Hi Bharata,
> >>>>>>
> >>>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@amd.com> wrote:
> >>>>>>>
> >>>>> <snip>
> >>>>>>>
> >>>>>>> Some experiments tried
> >>>>>>> ======================
> >>>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
> >>>>>>> lockups were seen for 48 hours run. Below is once such soft lockup.
> >>>>>>
> >>>>>> This is not really an MGLRU issue -- can you please try one of the
> >>>>>> attached patches? It (truncate.patch) should help with or without
> >>>>>> MGLRU.
> >>>>>
> >>>>> With truncate.patch and default LRU scheme, a few hard lockups are
> >>>>> seen.
> >>>>
> >>>> Thanks.
> >>>>
> >>>> In your original report, you said:
> >>>>
> >>>>     Most of the times the two contended locks are lruvec and
> >>>>     inode->i_lock spinlocks.
> >>>>     ...
> >>>>     Often times, the perf output at the time of the problem shows
> >>>>     heavy contention on lruvec spin lock. Similar contention is
> >>>>     also observed with inode i_lock (in clear_shadow_entry path)
> >>>>
> >>>> Based on this new report, does it mean the i_lock is not as contended,
> >>>> for the same path (truncation) you tested? If so, I'll post
> >>>> truncate.patch and add reported-by and tested-by you, unless you have
> >>>> objections.
> >>>
> >>> truncate.patch has been tested on two systems with default LRU scheme
> >>> and the lockup due to inode->i_lock hasn't been seen yet after 24
> >>> hours run.
> >>
> >> Thanks.
> >>
> >>>>
> >>>> The two paths below were contended on the LRU lock, but they already
> >>>> batch their operations. So I don't know what else we can do surgically
> >>>> to improve them.
> >>>
> >>> What has been seen with this workload is that the lruvec spinlock is
> >>> held for a long time from shrink_[active/inactive]_list path. In this
> >>> path, there is a case in isolate_lru_folios() where scanning of LRU
> >>> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
> >>> scanning/skipping of more than 150 million folios were seen. There is
> >>> already a comment in there which explains why nr_skipped shouldn't be
> >>> counted, but is there any possibility of re-looking at this condition?
> >>
> >> For this specific case, probably this can help:
> >>
> >> @@ -1659,8 +1659,15 @@ static unsigned long
> >> isolate_lru_folios(unsigned long nr_to_scan,
> >>                  if (folio_zonenum(folio) > sc->reclaim_idx ||
> >>                                  skip_cma(folio, sc)) {
> >>                          nr_skipped[folio_zonenum(folio)] += nr_pages;
> >> -                       move_to = &folios_skipped;
> >> -                       goto move;
> >> +                       list_move(&folio->lru, &folios_skipped);
> >> +                       if (spin_is_contended(&lruvec->lru_lock)) {
> >> +                               if (!list_empty(dst))
> >> +                                       break;
> >> +                               spin_unlock_irq(&lruvec->lru_lock);
> >> +                               cond_resched();
> >> +                               spin_lock_irq(&lruvec->lru_lock);
> >> +                       }
> >> +                       continue;
> >>                  }
> >
> > Thanks, this helped. With this fix, the test ran for 24hrs without any
> > lockups attributable to lruvec spinlock. As noted in this thread,
> > earlier isolate_lru_folios() used to scan millions of folios and spend a
> > lot of time with spinlock held but after this fix, such a scenario is no
> > longer seen.
>
> However during the weekend mglru-enabled run (with above fix to
> isolate_lru_folios() and also the previous two patches: truncate.patch
> and mglru.patch and the inode fix provided by Mateusz), another hard
> lockup related to lruvec spinlock was observed.

Thanks again for the stress tests.

I can't come up with any reasonable band-aid at this moment, i.e.,
something not too ugly to work around a more fundamental scalability
problem.

Before I give up: what type of dirty data was written back to the nvme
device? Was it page cache or swap?

> Here is the hardlock up:
>
> watchdog: Watchdog detected hard LOCKUP on cpu 466
> CPU: 466 PID: 3103929 Comm: fio Not tainted
> 6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
> RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
> Call Trace:
>    <NMI>
>    ? show_regs+0x69/0x80
>    ? watchdog_hardlockup_check+0x1b4/0x3a0
> <SNIP>
>    ? native_queued_spin_lock_slowpath+0x2b4/0x300
>    </NMI>
>    <IRQ>
>    _raw_spin_lock_irqsave+0x5b/0x70
>    folio_lruvec_lock_irqsave+0x62/0x90
>    folio_batch_move_lru+0x9d/0x160
>    folio_rotate_reclaimable+0xab/0xf0
>    folio_end_writeback+0x60/0x90
>    end_buffer_async_write+0xaa/0xe0
>    end_bio_bh_io_sync+0x2c/0x50
>    bio_endio+0x108/0x180
>    blk_mq_end_request_batch+0x11f/0x5e0
>    nvme_pci_complete_batch+0xb5/0xd0 [nvme]
>    nvme_irq+0x92/0xe0 [nvme]
>    __handle_irq_event_percpu+0x6e/0x1e0
>    handle_irq_event+0x39/0x80
>    handle_edge_irq+0x8c/0x240
>    __common_interrupt+0x4e/0xf0
>    common_interrupt+0x49/0xc0
>    asm_common_interrupt+0x27/0x40
>
> Here is the lock holder details captured by all-cpu-backtrace:
>
> NMI backtrace for cpu 75
> CPU: 75 PID: 3095650 Comm: fio Not tainted
> 6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
> RIP: 0010:folio_inc_gen+0x142/0x430
> Call Trace:
>    <NMI>
>    ? show_regs+0x69/0x80
>    ? nmi_cpu_backtrace+0xc5/0x130
>    ? nmi_cpu_backtrace_handler+0x11/0x20
>    ? nmi_handle+0x64/0x180
>    ? default_do_nmi+0x45/0x130
>    ? exc_nmi+0x128/0x1a0
>    ? end_repeat_nmi+0xf/0x53
>    ? folio_inc_gen+0x142/0x430
>    ? folio_inc_gen+0x142/0x430
>    ? folio_inc_gen+0x142/0x430
>    </NMI>
>    <TASK>
>    isolate_folios+0x954/0x1630
>    evict_folios+0xa5/0x8c0
>    try_to_shrink_lruvec+0x1be/0x320
>    shrink_one+0x10f/0x1d0
>    shrink_node+0xa4c/0xc90
>    do_try_to_free_pages+0xc0/0x590
>    try_to_free_pages+0xde/0x210
>    __alloc_pages_noprof+0x6ae/0x12c0
>    alloc_pages_mpol_noprof+0xd9/0x220
>    folio_alloc_noprof+0x63/0xe0
>    filemap_alloc_folio_noprof+0xf4/0x100
>    page_cache_ra_unbounded+0xb9/0x1a0
>    page_cache_ra_order+0x26e/0x310
>    ondemand_readahead+0x1a3/0x360
>    page_cache_sync_ra+0x83/0x90
>    filemap_get_pages+0xf0/0x6a0
>    filemap_read+0xe7/0x3d0
>    blkdev_read_iter+0x6f/0x140
>    vfs_read+0x25b/0x340
>    ksys_read+0x67/0xf0
>    __x64_sys_read+0x19/0x20
>    x64_sys_call+0x1771/0x20d0
>    do_syscall_64+0x7e/0x130
>
> Regards,
> Bharata.


  reply	other threads:[~2024-07-19 20:21 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-03 15:11 Hard and soft lockups with FIO and LTP runs on a large system Bharata B Rao
2024-07-06 22:42 ` Yu Zhao
2024-07-08 14:34   ` Bharata B Rao
2024-07-08 16:17     ` Yu Zhao
2024-07-09  4:30       ` Bharata B Rao
2024-07-09  5:58         ` Yu Zhao
2024-07-11  5:43           ` Bharata B Rao
2024-07-15  5:19             ` Bharata B Rao
2024-07-19 20:21               ` Yu Zhao [this message]
2024-07-20  7:57                 ` Mateusz Guzik
2024-07-22  4:17                   ` Bharata B Rao
2024-07-22  4:12                 ` Bharata B Rao
2024-07-25  9:59               ` zhaoyang.huang
2024-07-26  3:26                 ` Zhaoyang Huang
2024-07-29  4:49                   ` Bharata B Rao
2024-08-13 11:04           ` Usama Arif
2024-08-13 17:43             ` Yu Zhao
2024-07-17  9:37         ` Vlastimil Babka
2024-07-17 10:50           ` Bharata B Rao
2024-07-17 11:15             ` Hillf Danton
2024-07-18  9:02               ` Bharata B Rao
2024-07-10 12:03   ` Bharata B Rao
2024-07-10 12:24     ` Mateusz Guzik
2024-07-10 13:04       ` Mateusz Guzik
2024-07-15  5:22         ` Bharata B Rao
2024-07-15  6:48           ` Mateusz Guzik
2024-07-10 18:04     ` Yu Zhao
2024-07-17  9:42 ` Vlastimil Babka
2024-07-17 10:31   ` Bharata B Rao
2024-07-17 16:44     ` Karim Manaouil
2024-07-17 11:29   ` Mateusz Guzik
2024-07-18  9:00     ` Bharata B Rao
2024-07-18 12:11       ` Mateusz Guzik
2024-07-19  6:16         ` Bharata B Rao
2024-07-19  7:06           ` Yu Zhao
2024-07-19 14:26           ` Mateusz Guzik
2024-07-17 16:34   ` Karim Manaouil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOUHufbkhMZYz20aM_3rHZ3OcK4m2puji2FGpUpn_-DevGk3Kg@mail.gmail.com \
    --to=yuzhao@google.com \
    --cc=Neeraj.Upadhyay@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=bharata@amd.com \
    --cc=david@redhat.com \
    --cc=kinseyho@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mjguzik@gmail.com \
    --cc=nikunj@amd.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).