From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 73561C3DA59 for ; Fri, 19 Jul 2024 20:21:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E06C76B0085; Fri, 19 Jul 2024 16:21:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DB7B76B0089; Fri, 19 Jul 2024 16:21:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C7E9B6B008C; Fri, 19 Jul 2024 16:21:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id A30BE6B0085 for ; Fri, 19 Jul 2024 16:21:58 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 6B3D1A371A for ; Fri, 19 Jul 2024 20:21:58 +0000 (UTC) X-FDA: 82357623516.03.428CA89 Received: from mail-qt1-f172.google.com (mail-qt1-f172.google.com [209.85.160.172]) by imf29.hostedemail.com (Postfix) with ESMTP id A75CA12001F for ; Fri, 19 Jul 2024 20:21:56 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=vNe8HVBK; spf=pass (imf29.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.172 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721420496; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qdws4MyNnhAs5gi9zsaWYzhNp75XrkQajWxb10oUwCg=; b=5MCkn+tG7whNYI4QRJW5RDkjvuV8ySRs59u08eEUjBblM2bzhBRiVtpqHLt/bgFQLDE74j hfpCXP+3VNCy1srZXJelccppbxZRBSSdNXJjHK+xP57QUgF4m3bKMKn7pr1bUVbGyIZcFM jslpfGJJu/f7+ZeNnqUnKL3jyVSk5/w= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=vNe8HVBK; spf=pass (imf29.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.172 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721420496; a=rsa-sha256; cv=none; b=Tm0GGZgTlQReFj0mX/Q//++zyihWJNA/NV8WgnP49/A2fCyiO0bsnvF3QJ7y+sd0XFE7Y1 vMXAIW0EKAADFhGfMClpMjeM3VmgiJC0bT2UOlt1aEcRLGv3PwMk4FprLa+LLp4uZgOALb /yD65+eH5L0tUqGFGm9sTrvSJsBzkIU= Received: by mail-qt1-f172.google.com with SMTP id d75a77b69052e-44f9d7cb5c5so273671cf.0 for ; Fri, 19 Jul 2024 13:21:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1721420516; x=1722025316; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=qdws4MyNnhAs5gi9zsaWYzhNp75XrkQajWxb10oUwCg=; b=vNe8HVBKlLLRDlm0y/+7irvL0Eg5ktvRrz0Fg2AflDrscOHmj8vAiskwXEbN+1xtYx 8kmh0j3kRyIcgdvpHt3OQNHV04WWoNgJ9fsvdchCWQ8l33f4fmER6TU+PgfOHNexSwMB NdLHonfS1A1KuCyOe2fo+Q4eEOjzUKmqkh5927e+gaE8xXZZZbgN6qkfH+iV+iF+hLuv IHC8EnYTkq6rzlZ7YKhw62HAo/Q6Q1/VQhIzaiJa1TH38EYTvAs0tzj/IAEEcnm6QNWO 4bgDPT16nUPkedgd56yFXZblRQR2GJlSNQ2O1x5cP5PIk9qOxXCGdEhAnZ/c9540PiRe q5Mg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721420516; x=1722025316; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=qdws4MyNnhAs5gi9zsaWYzhNp75XrkQajWxb10oUwCg=; b=wXhNTSoXQLpGYDqes/qBov1jNaIMpiyp6Kk6fXMaZC+9v+fw3k+oknU4fnsKLHRgNp bWvaJLgua6Bh7jPnByI7xQ3NH43slQFWw4qp5rRtZEZPk26qz9RkjVnXnJgO7LVGqx9H o4DSnMHuk73/rWC0qYsbZu7MdfStPYBz5ucImt2Xr7nKGR7ppiMlg9JCFjc/tfxOsw0b EiiqXlo2Uu1yIN15XYSd5cSc1vd0KXDvTmXpn1XXXMdxG2BXu62p+Pr0Zoy5Zlnp0Fur xrIUC/m3fpwpt9txvvW0hBoDVz6iVdXm/pg4I1/W8nBTLaQEpfmSe309tgyW/luITvJC Sabw== X-Gm-Message-State: AOJu0YxM/OJThf42FiQ43cZNxlEMcuHt+QrfLztqvzzS/DoEUdaMbmNV bm63gTc8F5jTiGmIVe5MO6gxcc/r8h8oP0AtQoLf7cW4siL8E+hshiyEixPADgjX4v8yYjlE5hj eVI8Q7amYPNPRzFaRHm4vokc+zVr65czMxLw9 X-Google-Smtp-Source: AGHT+IGn2uWxsFNxJk0mzfVjqtT5lV4S7DxPzbKyGp8z+OHNXT7r6FlMUUDayXvxx0KOG9uRtLFAQVIQOBTHumtUVgw= X-Received: by 2002:a05:622a:120f:b0:447:f108:f80e with SMTP id d75a77b69052e-44f9acae1c5mr4564931cf.16.1721420515530; Fri, 19 Jul 2024 13:21:55 -0700 (PDT) MIME-Version: 1.0 References: <1998d479-eb1a-4bc8-a11e-59f8dd71aadb@amd.com> <7a06a14e-44d5-450a-bd56-1c348c2951b6@amd.com> <893a263a-0038-4b4b-9031-72567b966f73@amd.com> In-Reply-To: <893a263a-0038-4b4b-9031-72567b966f73@amd.com> From: Yu Zhao Date: Fri, 19 Jul 2024 14:21:17 -0600 Message-ID: Subject: Re: Hard and soft lockups with FIO and LTP runs on a large system To: Bharata B Rao Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, nikunj@amd.com, "Upadhyay, Neeraj" , Andrew Morton , David Hildenbrand , willy@infradead.org, vbabka@suse.cz, kinseyho@google.com, Mel Gorman , mjguzik@gmail.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: A75CA12001F X-Stat-Signature: 9akpwgn6bq4a5f7zgxwnxxprz17z7wb4 X-HE-Tag: 1721420516-370690 X-HE-Meta: U2FsdGVkX1+DgvkKsLD0W+rlQXclFJVO43L86w8ZV8amp+hNmBORnh4jT3wn9O2nSganQQYa6SKyLMiLZmGdmoIANVfYEMZ30x+U5/CnxxvVqxVlPva60KSVE8D6uskb/rUShTPhHn6zly5KyqxcWt6o50+nOoRO8DSQOwNIRdgY/hlhiFLCmp7OIKdK9cYAK+inM7Cej7m/0wI4RT+jakUSP1Q9gHzlmujbJXqMv2B/VzGArhuYug3HSK97PgAy90znU5plCT+mToESwKy/gdjp2nFt4CtconcisDbL31FEtazbDTgTDWk2aWLMSyBpagJE/dZ7wtuAvbeFfjSAp4xMoiCmmT0XU8BlZdwkMTnnImr7w5AZ9ah4cyT0PRtI+7iX2S6p7JlcAzkzUZRFY2q0uVbwwkGcTjrtAArrvoL9KUOhQnbKQk/nCLN50c7bqoQ0pVa6v/3o4/LdRecMaVhpGBCeBp9ezxjkhvu3p3MfcyUHNi0E6QG4Yk6YI5OdlEF8row4QPOgIChnqUP/kGdneRBEzy6Dh77rErPDyHtIIDboWIojDikOx14Bb6OR1bPL2pqTjX5o5DckaSQEsCEy6O/9RyOSKNPFsM0gkEYV4bnXk+oXWNOR4BoovlfSjW0lmiA4BR3MbaGDcnV8uuRDA1JD43q3H/2baaogcfYo5irfEMlQJUooRdIlhnGOmNENc7iC3pyZkBfVrTlrg6ZLC4UW9/6Yxx2qdPUA+ZAMfsx9/2SyWYktzppW+11SExOnhyyyhDKH+5Js/HD5fXm9GOb4Ru0Ar10pPFfd/TdadDe6Pd8hmjdf9wMRF03Lp/cymYBLUwnjRST/rZvXhoD7hJJnJn4mNmDOT8AT9AvZE6YtcOcBqE2dUCvXk4bpExS+0Zt7bCQVYtKANwRTZyui0uIcYN7nlqKUQK1ybnhSmDoh4HfJT9UcXe3kxdzArDQnNJs/G6oR4H/GZ8x t9yPI5ny 6eLIU6190dRxGEYX222B9i6BN6/Rd8wrnmr+zud3+jQhrW8h7+5VU9cr51BirSTgRdBbONH5jIr8P/MNvYmAkoFbRGEoXebYjTjiA1BDgSk9P/EpJ0aogiS2me+z0kEHYAlwLP4ufNHZwZLMO3oJMPpAF3NG4W+m4Wti2pVQIXTOOXciTjd76hX3sMJ+5mmcbtNCM X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Jul 14, 2024 at 11:20=E2=80=AFPM Bharata B Rao wr= ote: > > On 11-Jul-24 11:13 AM, Bharata B Rao wrote: > > On 09-Jul-24 11:28 AM, Yu Zhao wrote: > >> On Mon, Jul 8, 2024 at 10:31=E2=80=AFPM Bharata B Rao wrote: > >>> > >>> On 08-Jul-24 9:47 PM, Yu Zhao wrote: > >>>> On Mon, Jul 8, 2024 at 8:34=E2=80=AFAM Bharata B Rao wrote: > >>>>> > >>>>> Hi Yu Zhao, > >>>>> > >>>>> Thanks for your patches. See below... > >>>>> > >>>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote: > >>>>>> Hi Bharata, > >>>>>> > >>>>>> On Wed, Jul 3, 2024 at 9:11=E2=80=AFAM Bharata B Rao wrote: > >>>>>>> > >>>>> > >>>>>>> > >>>>>>> Some experiments tried > >>>>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D > >>>>>>> 1) When MGLRU was enabled many soft lockups were observed, no har= d > >>>>>>> lockups were seen for 48 hours run. Below is once such soft locku= p. > >>>>>> > >>>>>> This is not really an MGLRU issue -- can you please try one of the > >>>>>> attached patches? It (truncate.patch) should help with or without > >>>>>> MGLRU. > >>>>> > >>>>> With truncate.patch and default LRU scheme, a few hard lockups are > >>>>> seen. > >>>> > >>>> Thanks. > >>>> > >>>> In your original report, you said: > >>>> > >>>> Most of the times the two contended locks are lruvec and > >>>> inode->i_lock spinlocks. > >>>> ... > >>>> Often times, the perf output at the time of the problem shows > >>>> heavy contention on lruvec spin lock. Similar contention is > >>>> also observed with inode i_lock (in clear_shadow_entry path) > >>>> > >>>> Based on this new report, does it mean the i_lock is not as contende= d, > >>>> for the same path (truncation) you tested? If so, I'll post > >>>> truncate.patch and add reported-by and tested-by you, unless you hav= e > >>>> objections. > >>> > >>> truncate.patch has been tested on two systems with default LRU scheme > >>> and the lockup due to inode->i_lock hasn't been seen yet after 24 > >>> hours run. > >> > >> Thanks. > >> > >>>> > >>>> The two paths below were contended on the LRU lock, but they already > >>>> batch their operations. So I don't know what else we can do surgical= ly > >>>> to improve them. > >>> > >>> What has been seen with this workload is that the lruvec spinlock is > >>> held for a long time from shrink_[active/inactive]_list path. In this > >>> path, there is a case in isolate_lru_folios() where scanning of LRU > >>> lists can become unbounded. To isolate a page from ZONE_DMA, sometime= s > >>> scanning/skipping of more than 150 million folios were seen. There is > >>> already a comment in there which explains why nr_skipped shouldn't be > >>> counted, but is there any possibility of re-looking at this condition= ? > >> > >> For this specific case, probably this can help: > >> > >> @@ -1659,8 +1659,15 @@ static unsigned long > >> isolate_lru_folios(unsigned long nr_to_scan, > >> if (folio_zonenum(folio) > sc->reclaim_idx || > >> skip_cma(folio, sc)) { > >> nr_skipped[folio_zonenum(folio)] +=3D nr_page= s; > >> - move_to =3D &folios_skipped; > >> - goto move; > >> + list_move(&folio->lru, &folios_skipped); > >> + if (spin_is_contended(&lruvec->lru_lock)) { > >> + if (!list_empty(dst)) > >> + break; > >> + spin_unlock_irq(&lruvec->lru_lock); > >> + cond_resched(); > >> + spin_lock_irq(&lruvec->lru_lock); > >> + } > >> + continue; > >> } > > > > Thanks, this helped. With this fix, the test ran for 24hrs without any > > lockups attributable to lruvec spinlock. As noted in this thread, > > earlier isolate_lru_folios() used to scan millions of folios and spend = a > > lot of time with spinlock held but after this fix, such a scenario is n= o > > longer seen. > > However during the weekend mglru-enabled run (with above fix to > isolate_lru_folios() and also the previous two patches: truncate.patch > and mglru.patch and the inode fix provided by Mateusz), another hard > lockup related to lruvec spinlock was observed. Thanks again for the stress tests. I can't come up with any reasonable band-aid at this moment, i.e., something not too ugly to work around a more fundamental scalability problem. Before I give up: what type of dirty data was written back to the nvme device? Was it page cache or swap? > Here is the hardlock up: > > watchdog: Watchdog detected hard LOCKUP on cpu 466 > CPU: 466 PID: 3103929 Comm: fio Not tainted > 6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32 > RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300 > Call Trace: > > ? show_regs+0x69/0x80 > ? watchdog_hardlockup_check+0x1b4/0x3a0 > > ? native_queued_spin_lock_slowpath+0x2b4/0x300 > > > _raw_spin_lock_irqsave+0x5b/0x70 > folio_lruvec_lock_irqsave+0x62/0x90 > folio_batch_move_lru+0x9d/0x160 > folio_rotate_reclaimable+0xab/0xf0 > folio_end_writeback+0x60/0x90 > end_buffer_async_write+0xaa/0xe0 > end_bio_bh_io_sync+0x2c/0x50 > bio_endio+0x108/0x180 > blk_mq_end_request_batch+0x11f/0x5e0 > nvme_pci_complete_batch+0xb5/0xd0 [nvme] > nvme_irq+0x92/0xe0 [nvme] > __handle_irq_event_percpu+0x6e/0x1e0 > handle_irq_event+0x39/0x80 > handle_edge_irq+0x8c/0x240 > __common_interrupt+0x4e/0xf0 > common_interrupt+0x49/0xc0 > asm_common_interrupt+0x27/0x40 > > Here is the lock holder details captured by all-cpu-backtrace: > > NMI backtrace for cpu 75 > CPU: 75 PID: 3095650 Comm: fio Not tainted > 6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32 > RIP: 0010:folio_inc_gen+0x142/0x430 > Call Trace: > > ? show_regs+0x69/0x80 > ? nmi_cpu_backtrace+0xc5/0x130 > ? nmi_cpu_backtrace_handler+0x11/0x20 > ? nmi_handle+0x64/0x180 > ? default_do_nmi+0x45/0x130 > ? exc_nmi+0x128/0x1a0 > ? end_repeat_nmi+0xf/0x53 > ? folio_inc_gen+0x142/0x430 > ? folio_inc_gen+0x142/0x430 > ? folio_inc_gen+0x142/0x430 > > > isolate_folios+0x954/0x1630 > evict_folios+0xa5/0x8c0 > try_to_shrink_lruvec+0x1be/0x320 > shrink_one+0x10f/0x1d0 > shrink_node+0xa4c/0xc90 > do_try_to_free_pages+0xc0/0x590 > try_to_free_pages+0xde/0x210 > __alloc_pages_noprof+0x6ae/0x12c0 > alloc_pages_mpol_noprof+0xd9/0x220 > folio_alloc_noprof+0x63/0xe0 > filemap_alloc_folio_noprof+0xf4/0x100 > page_cache_ra_unbounded+0xb9/0x1a0 > page_cache_ra_order+0x26e/0x310 > ondemand_readahead+0x1a3/0x360 > page_cache_sync_ra+0x83/0x90 > filemap_get_pages+0xf0/0x6a0 > filemap_read+0xe7/0x3d0 > blkdev_read_iter+0x6f/0x140 > vfs_read+0x25b/0x340 > ksys_read+0x67/0xf0 > __x64_sys_read+0x19/0x20 > x64_sys_call+0x1771/0x20d0 > do_syscall_64+0x7e/0x130 > > Regards, > Bharata.