Re: [linux-next20250911]Kernel OOPs while running generic/256 on Pmem device

Linux cgroups development
 help / color / mirror / Atom feed

* Re: [linux-next20250911]Kernel OOPs while running generic/256 on Pmem device
       [not found] ` <aMPIwdleUCUMFPh2@infradead.org>
@ 2025-09-12  7:21   ` Venkat
  0 siblings, 0 replies; 9+ messages in thread
From: Venkat @ 2025-09-12  7:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, riteshh, ojaswin, linux-xfs, LKML,
	Madhavan Srinivasan, linuxppc-dev, Linux Next Mailing List,
	linux-mm, cgroups



> On 12 Sep 2025, at 12:46 PM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Fri, Sep 12, 2025 at 10:51:18AM +0530, Venkat Rao Bagalkote wrote:
>> Greetings!!!
>> 
>> 
>> IBM CI has reported a kernel crash, while running generic/256 test case on
>> pmem device from xfstests suite on linux-next20250911 kernel.
> 
> Given that this in memcg code you probably want to send this to linux-mm
> and the cgroups list.

Thanks for advice.

Adding mm and croups mailing list.

Regards,
Venkat.




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-next20250911]Kernel OOPs while running generic/256 on Pmem device
       [not found] <8957c526-d05c-4c0d-bfed-0eb6e6d2476c@linux.ibm.com>
       [not found] ` <aMPIwdleUCUMFPh2@infradead.org>
@ 2025-09-12 12:32 ` Venkat
  2025-09-12 13:16   ` [External] " Julian Sun
  2025-09-13  2:48   ` Julian Sun
  1 sibling, 2 replies; 9+ messages in thread
From: Venkat @ 2025-09-12 12:32 UTC (permalink / raw)
  To: sunjunchao, tj, akpm, stable, songmuchun, shakeelb, hannes,
	roman.gushchin, mhocko
  Cc: linuxppc-dev, riteshh, ojaswin, linux-fsdevel, linux-xfs, LKML,
	Madhavan Srinivasan, Linux Next Mailing List, cgroups, linux-mm



> On 12 Sep 2025, at 10:51 AM, Venkat Rao Bagalkote <venkat88@linux.ibm.com> wrote:
> 
> Greetings!!!
> 
> 
> IBM CI has reported a kernel crash, while running generic/256 test case on pmem device from xfstests suite on linux-next20250911 kernel.
> 
> 
> xfstests: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
> 
> local.config:
> 
> [xfs_dax]
> export RECREATE_TEST_DEV=true
> export TEST_DEV=/dev/pmem0
> export TEST_DIR=/mnt/test_pmem
> export SCRATCH_DEV=/dev/pmem0.1
> export SCRATCH_MNT=/mnt/scratch_pmem
> export MKFS_OPTIONS="-m reflink=0 -b size=65536 -s size=512"
> export FSTYP=xfs
> export MOUNT_OPTIONS="-o dax"
> 
> 
> Test case: generic/256
> 
> 
> Traces:
> 
> 
> [  163.371929] ------------[ cut here ]------------
> [  163.371936] kernel BUG at lib/list_debug.c:29!
> [  163.371946] Oops: Exception in kernel mode, sig: 5 [#1]
> [  163.371954] LE PAGE_SIZE=64K MMU=Radix  SMP NR_CPUS=8192 NUMA pSeries
> [  163.371965] Modules linked in: xfs nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack bonding tls nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink sunrpc pseries_rng vmx_crypto dax_pmem fuse ext4 crc16 mbcache jbd2 nd_pmem papr_scm sd_mod libnvdimm sg ibmvscsi ibmveth scsi_transport_srp pseries_wdt
> [  163.372127] CPU: 22 UID: 0 PID: 130 Comm: kworker/22:0 Kdump: loaded Not tainted 6.17.0-rc5-next-20250911 #1 VOLUNTARY
> [  163.372142] Hardware name: IBM,9080-HEX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1110.01 (NH1110_069) hv:phyp pSeries
> [  163.372155] Workqueue: cgroup_free css_free_rwork_fn
> [  163.372169] NIP:  c000000000d051d4 LR: c000000000d051d0 CTR: 0000000000000000
> [  163.372176] REGS: c00000000ba079b0 TRAP: 0700   Not tainted (6.17.0-rc5-next-20250911)
> [  163.372183] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28000000  XER: 00000006
> [  163.372214] CFAR: c0000000002bae9c IRQMASK: 0
> [  163.372214] GPR00: c000000000d051d0 c00000000ba07c50 c00000000230a600 0000000000000075
> [  163.372214] GPR04: 0000000000000004 0000000000000001 c000000000507e2c 0000000000000001
> [  163.372214] GPR08: c000000d0cb87d13 0000000000000000 0000000000000000 a80e000000000000
> [  163.372214] GPR12: c00e0001a1970fa2 c000000d0ddec700 c000000000208e58 c000000107b5e190
> [  163.372214] GPR16: c00000000d3e5d08 c00000000b71cf78 c00000000d3e5d05 c00000000b71cf30
> [  163.372214] GPR20: c00000000b71cf08 c00000000b71cf10 c000000019f58588 c000000004704bc8
> [  163.372214] GPR24: c000000107b5e100 c000000004704bd0 0000000000000003 c000000004704bd0
> [  163.372214] GPR28: c000000004704bc8 c000000019f585a8 c000000019f53da8 c000000004704bc8
> [  163.372315] NIP [c000000000d051d4] __list_add_valid_or_report+0x124/0x188
> [  163.372326] LR [c000000000d051d0] __list_add_valid_or_report+0x120/0x188
> [  163.372335] Call Trace:
> [  163.372339] [c00000000ba07c50] [c000000000d051d0] __list_add_valid_or_report+0x120/0x188 (unreliable)
> [  163.372352] [c00000000ba07ce0] [c000000000834280] mem_cgroup_css_free+0xa0/0x27c
> [  163.372363] [c00000000ba07d50] [c0000000003ba198] css_free_rwork_fn+0xd0/0x59c
> [  163.372374] [c00000000ba07da0] [c0000000001f5d60] process_one_work+0x41c/0x89c
> [  163.372385] [c00000000ba07eb0] [c0000000001f76c0] worker_thread+0x558/0x848
> [  163.372394] [c00000000ba07f80] [c000000000209038] kthread+0x1e8/0x230
> [  163.372406] [c00000000ba07fe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
> [  163.372416] Code: 4b9b1099 60000000 7f63db78 4bae8245 60000000 e8bf0008 3c62ff88 7fe6fb78 7fc4f378 38637d40 4b5b5c89 60000000 <0fe00000> 60000000 60000000 7f83e378
> [  163.372453] ---[ end trace 0000000000000000 ]---
> [  163.380581] pstore: backend (nvram) writing error (-1)
> [  163.380593]
> 
> 
> If you happen to fix this issue, please add below tag.
> 
> 
> Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
> 
> 
> 
> Regards,
> 
> Venkat.
> 
> 

After reverting the below commit, issue is not seen.

commit 61bbf51e75df1a94cf6736e311cb96aeb79826a8
Author: Julian Sun <sunjunchao@bytedance.com>
Date:   Thu Aug 28 04:45:57 2025 +0800

    memcg: don't wait writeback completion when release memcg
         Recently, we encountered the following hung task:
         INFO: task kworker/4:1:1334558 blocked for more than 1720 seconds.
    [Wed Jul 30 17:47:45 2025] Workqueue: cgroup_destroy css_free_rwork_fn
    [Wed Jul 30 17:47:45 2025] Call Trace:
    [Wed Jul 30 17:47:45 2025]  __schedule+0x934/0xe10
    [Wed Jul 30 17:47:45 2025]  ? complete+0x3b/0x50
    [Wed Jul 30 17:47:45 2025]  ? _cond_resched+0x15/0x30
    [Wed Jul 30 17:47:45 2025]  schedule+0x40/0xb0
    [Wed Jul 30 17:47:45 2025]  wb_wait_for_completion+0x52/0x80
    [Wed Jul 30 17:47:45 2025]  ? finish_wait+0x80/0x80
    [Wed Jul 30 17:47:45 2025]  mem_cgroup_css_free+0x22/0x1b0
    [Wed Jul 30 17:47:45 2025]  css_free_rwork_fn+0x42/0x380
    [Wed Jul 30 17:47:45 2025]  process_one_work+0x1a2/0x360
    [Wed Jul 30 17:47:45 2025]  worker_thread+0x30/0x390
    [Wed Jul 30 17:47:45 2025]  ? create_worker+0x1a0/0x1a0
    [Wed Jul 30 17:47:45 2025]  kthread+0x110/0x130
    [Wed Jul 30 17:47:45 2025]  ? __kthread_cancel_work+0x40/0x40
    [Wed Jul 30 17:47:45 2025]  ret_from_fork+0x1f/0x30
         The direct cause is that memcg spends a long time waiting for dirty page
    writeback of foreign memcgs during release.
         The root causes are:
        a. The wb may have multiple writeback tasks, containing millions
           of dirty pages, as shown below:
         >>> for work in list_for_each_entry("struct wb_writeback_work", \
                                        wb.work_list.address_of_(), "list"):
    ...     print(work.nr_pages, work.reason, hex(work))
    ...
    900628  WB_REASON_FOREIGN_FLUSH 0xffff969e8d956b40
    1116521 WB_REASON_FOREIGN_FLUSH 0xffff9698332a9540
    1275228 WB_REASON_FOREIGN_FLUSH 0xffff969d9b444bc0
    1099673 WB_REASON_FOREIGN_FLUSH 0xffff969f0954d6c0
    1351522 WB_REASON_FOREIGN_FLUSH 0xffff969e76713340
    2567437 WB_REASON_FOREIGN_FLUSH 0xffff9694ae208400
    2954033 WB_REASON_FOREIGN_FLUSH 0xffff96a22d62cbc0
    3008860 WB_REASON_FOREIGN_FLUSH 0xffff969eee8ce3c0
    3337932 WB_REASON_FOREIGN_FLUSH 0xffff9695b45156c0
    3348916 WB_REASON_FOREIGN_FLUSH 0xffff96a22c7a4f40
    3345363 WB_REASON_FOREIGN_FLUSH 0xffff969e5d872800
    3333581 WB_REASON_FOREIGN_FLUSH 0xffff969efd0f4600
    3382225 WB_REASON_FOREIGN_FLUSH 0xffff969e770edcc0
    3418770 WB_REASON_FOREIGN_FLUSH 0xffff96a252ceea40
    3387648 WB_REASON_FOREIGN_FLUSH 0xffff96a3bda86340
    3385420 WB_REASON_FOREIGN_FLUSH 0xffff969efc6eb280
    3418730 WB_REASON_FOREIGN_FLUSH 0xffff96a348ab1040
    3426155 WB_REASON_FOREIGN_FLUSH 0xffff969d90beac00
    3397995 WB_REASON_FOREIGN_FLUSH 0xffff96a2d7288800
    3293095 WB_REASON_FOREIGN_FLUSH 0xffff969dab423240
    3293595 WB_REASON_FOREIGN_FLUSH 0xffff969c765ff400
    3199511 WB_REASON_FOREIGN_FLUSH 0xffff969a72d5e680
    3085016 WB_REASON_FOREIGN_FLUSH 0xffff969f0455e000
    3035712 WB_REASON_FOREIGN_FLUSH 0xffff969d9bbf4b00
             b. The writeback might severely throttled by wbt, with a speed
           possibly less than 100kb/s, leading to a very long writeback time.
         >>> wb.write_bandwidth
    (unsigned long)24
    >>> wb.write_bandwidth
    (unsigned long)13
         The wb_wait_for_completion() here is probably only used to prevent
    use-after-free.  Therefore, we manage 'done' separately and automatically
    free it.
         This allows us to remove wb_wait_for_completion() while preventing the
    use-after-free issue.
     com
    Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing")
    Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
    Acked-by: Tejun Heo <tj@kernel.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Regards,
Venkat.

> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [linux-next20250911]Kernel OOPs while running generic/256 on Pmem device
  2025-09-12 12:32 ` Venkat
@ 2025-09-12 13:16   ` Julian Sun
  2025-09-13  2:48   ` Julian Sun
  1 sibling, 0 replies; 9+ messages in thread
From: Julian Sun @ 2025-09-12 13:16 UTC (permalink / raw)
  To: Venkat
  Cc: tj, akpm, stable, songmuchun, shakeelb, hannes, roman.gushchin,
	mhocko, linuxppc-dev, riteshh, ojaswin, linux-fsdevel, linux-xfs,
	LKML, Madhavan Srinivasan, Linux Next Mailing List, cgroups,
	linux-mm

Thanks for the report. I will take a look at this issue.

On Fri, Sep 12, 2025 at 8:33 PM Venkat <venkat88@linux.ibm.com> wrote:
>
>
>
> > On 12 Sep 2025, at 10:51 AM, Venkat Rao Bagalkote <venkat88@linux.ibm.com> wrote:
> >
> > Greetings!!!
> >
> >
> > IBM CI has reported a kernel crash, while running generic/256 test case on pmem device from xfstests suite on linux-next20250911 kernel.
> >
> >
> > xfstests: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
> >
> > local.config:
> >
> > [xfs_dax]
> > export RECREATE_TEST_DEV=true
> > export TEST_DEV=/dev/pmem0
> > export TEST_DIR=/mnt/test_pmem
> > export SCRATCH_DEV=/dev/pmem0.1
> > export SCRATCH_MNT=/mnt/scratch_pmem
> > export MKFS_OPTIONS="-m reflink=0 -b size=65536 -s size=512"
> > export FSTYP=xfs
> > export MOUNT_OPTIONS="-o dax"
> >
> >
> > Test case: generic/256
> >
> >
> > Traces:
> >
> >
> > [  163.371929] ------------[ cut here ]------------
> > [  163.371936] kernel BUG at lib/list_debug.c:29!
> > [  163.371946] Oops: Exception in kernel mode, sig: 5 [#1]
> > [  163.371954] LE PAGE_SIZE=64K MMU=Radix  SMP NR_CPUS=8192 NUMA pSeries
> > [  163.371965] Modules linked in: xfs nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack bonding tls nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink sunrpc pseries_rng vmx_crypto dax_pmem fuse ext4 crc16 mbcache jbd2 nd_pmem papr_scm sd_mod libnvdimm sg ibmvscsi ibmveth scsi_transport_srp pseries_wdt
> > [  163.372127] CPU: 22 UID: 0 PID: 130 Comm: kworker/22:0 Kdump: loaded Not tainted 6.17.0-rc5-next-20250911 #1 VOLUNTARY
> > [  163.372142] Hardware name: IBM,9080-HEX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1110.01 (NH1110_069) hv:phyp pSeries
> > [  163.372155] Workqueue: cgroup_free css_free_rwork_fn
> > [  163.372169] NIP:  c000000000d051d4 LR: c000000000d051d0 CTR: 0000000000000000
> > [  163.372176] REGS: c00000000ba079b0 TRAP: 0700   Not tainted (6.17.0-rc5-next-20250911)
> > [  163.372183] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28000000  XER: 00000006
> > [  163.372214] CFAR: c0000000002bae9c IRQMASK: 0
> > [  163.372214] GPR00: c000000000d051d0 c00000000ba07c50 c00000000230a600 0000000000000075
> > [  163.372214] GPR04: 0000000000000004 0000000000000001 c000000000507e2c 0000000000000001
> > [  163.372214] GPR08: c000000d0cb87d13 0000000000000000 0000000000000000 a80e000000000000
> > [  163.372214] GPR12: c00e0001a1970fa2 c000000d0ddec700 c000000000208e58 c000000107b5e190
> > [  163.372214] GPR16: c00000000d3e5d08 c00000000b71cf78 c00000000d3e5d05 c00000000b71cf30
> > [  163.372214] GPR20: c00000000b71cf08 c00000000b71cf10 c000000019f58588 c000000004704bc8
> > [  163.372214] GPR24: c000000107b5e100 c000000004704bd0 0000000000000003 c000000004704bd0
> > [  163.372214] GPR28: c000000004704bc8 c000000019f585a8 c000000019f53da8 c000000004704bc8
> > [  163.372315] NIP [c000000000d051d4] __list_add_valid_or_report+0x124/0x188
> > [  163.372326] LR [c000000000d051d0] __list_add_valid_or_report+0x120/0x188
> > [  163.372335] Call Trace:
> > [  163.372339] [c00000000ba07c50] [c000000000d051d0] __list_add_valid_or_report+0x120/0x188 (unreliable)
> > [  163.372352] [c00000000ba07ce0] [c000000000834280] mem_cgroup_css_free+0xa0/0x27c
> > [  163.372363] [c00000000ba07d50] [c0000000003ba198] css_free_rwork_fn+0xd0/0x59c
> > [  163.372374] [c00000000ba07da0] [c0000000001f5d60] process_one_work+0x41c/0x89c
> > [  163.372385] [c00000000ba07eb0] [c0000000001f76c0] worker_thread+0x558/0x848
> > [  163.372394] [c00000000ba07f80] [c000000000209038] kthread+0x1e8/0x230
> > [  163.372406] [c00000000ba07fe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
> > [  163.372416] Code: 4b9b1099 60000000 7f63db78 4bae8245 60000000 e8bf0008 3c62ff88 7fe6fb78 7fc4f378 38637d40 4b5b5c89 60000000 <0fe00000> 60000000 60000000 7f83e378
> > [  163.372453] ---[ end trace 0000000000000000 ]---
> > [  163.380581] pstore: backend (nvram) writing error (-1)
> > [  163.380593]
> >
> >
> > If you happen to fix this issue, please add below tag.
> >
> >
> > Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
> >
> >
> >
> > Regards,
> >
> > Venkat.
> >
> >
>
> After reverting the below commit, issue is not seen.
>
> commit 61bbf51e75df1a94cf6736e311cb96aeb79826a8
> Author: Julian Sun <sunjunchao@bytedance.com>
> Date:   Thu Aug 28 04:45:57 2025 +0800
>
>     memcg: don't wait writeback completion when release memcg
>          Recently, we encountered the following hung task:
>          INFO: task kworker/4:1:1334558 blocked for more than 1720 seconds.
>     [Wed Jul 30 17:47:45 2025] Workqueue: cgroup_destroy css_free_rwork_fn
>     [Wed Jul 30 17:47:45 2025] Call Trace:
>     [Wed Jul 30 17:47:45 2025]  __schedule+0x934/0xe10
>     [Wed Jul 30 17:47:45 2025]  ? complete+0x3b/0x50
>     [Wed Jul 30 17:47:45 2025]  ? _cond_resched+0x15/0x30
>     [Wed Jul 30 17:47:45 2025]  schedule+0x40/0xb0
>     [Wed Jul 30 17:47:45 2025]  wb_wait_for_completion+0x52/0x80
>     [Wed Jul 30 17:47:45 2025]  ? finish_wait+0x80/0x80
>     [Wed Jul 30 17:47:45 2025]  mem_cgroup_css_free+0x22/0x1b0
>     [Wed Jul 30 17:47:45 2025]  css_free_rwork_fn+0x42/0x380
>     [Wed Jul 30 17:47:45 2025]  process_one_work+0x1a2/0x360
>     [Wed Jul 30 17:47:45 2025]  worker_thread+0x30/0x390
>     [Wed Jul 30 17:47:45 2025]  ? create_worker+0x1a0/0x1a0
>     [Wed Jul 30 17:47:45 2025]  kthread+0x110/0x130
>     [Wed Jul 30 17:47:45 2025]  ? __kthread_cancel_work+0x40/0x40
>     [Wed Jul 30 17:47:45 2025]  ret_from_fork+0x1f/0x30
>          The direct cause is that memcg spends a long time waiting for dirty page
>     writeback of foreign memcgs during release.
>          The root causes are:
>         a. The wb may have multiple writeback tasks, containing millions
>            of dirty pages, as shown below:
>          >>> for work in list_for_each_entry("struct wb_writeback_work", \
>                                         wb.work_list.address_of_(), "list"):
>     ...     print(work.nr_pages, work.reason, hex(work))
>     ...
>     900628  WB_REASON_FOREIGN_FLUSH 0xffff969e8d956b40
>     1116521 WB_REASON_FOREIGN_FLUSH 0xffff9698332a9540
>     1275228 WB_REASON_FOREIGN_FLUSH 0xffff969d9b444bc0
>     1099673 WB_REASON_FOREIGN_FLUSH 0xffff969f0954d6c0
>     1351522 WB_REASON_FOREIGN_FLUSH 0xffff969e76713340
>     2567437 WB_REASON_FOREIGN_FLUSH 0xffff9694ae208400
>     2954033 WB_REASON_FOREIGN_FLUSH 0xffff96a22d62cbc0
>     3008860 WB_REASON_FOREIGN_FLUSH 0xffff969eee8ce3c0
>     3337932 WB_REASON_FOREIGN_FLUSH 0xffff9695b45156c0
>     3348916 WB_REASON_FOREIGN_FLUSH 0xffff96a22c7a4f40
>     3345363 WB_REASON_FOREIGN_FLUSH 0xffff969e5d872800
>     3333581 WB_REASON_FOREIGN_FLUSH 0xffff969efd0f4600
>     3382225 WB_REASON_FOREIGN_FLUSH 0xffff969e770edcc0
>     3418770 WB_REASON_FOREIGN_FLUSH 0xffff96a252ceea40
>     3387648 WB_REASON_FOREIGN_FLUSH 0xffff96a3bda86340
>     3385420 WB_REASON_FOREIGN_FLUSH 0xffff969efc6eb280
>     3418730 WB_REASON_FOREIGN_FLUSH 0xffff96a348ab1040
>     3426155 WB_REASON_FOREIGN_FLUSH 0xffff969d90beac00
>     3397995 WB_REASON_FOREIGN_FLUSH 0xffff96a2d7288800
>     3293095 WB_REASON_FOREIGN_FLUSH 0xffff969dab423240
>     3293595 WB_REASON_FOREIGN_FLUSH 0xffff969c765ff400
>     3199511 WB_REASON_FOREIGN_FLUSH 0xffff969a72d5e680
>     3085016 WB_REASON_FOREIGN_FLUSH 0xffff969f0455e000
>     3035712 WB_REASON_FOREIGN_FLUSH 0xffff969d9bbf4b00
>              b. The writeback might severely throttled by wbt, with a speed
>            possibly less than 100kb/s, leading to a very long writeback time.
>          >>> wb.write_bandwidth
>     (unsigned long)24
>     >>> wb.write_bandwidth
>     (unsigned long)13
>          The wb_wait_for_completion() here is probably only used to prevent
>     use-after-free.  Therefore, we manage 'done' separately and automatically
>     free it.
>          This allows us to remove wb_wait_for_completion() while preventing the
>     use-after-free issue.
>      com
>     Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing")
>     Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
>     Acked-by: Tejun Heo <tj@kernel.org>
>     Cc: Michal Hocko <mhocko@suse.com>
>     Cc: Roman Gushchin <roman.gushchin@linux.dev>
>     Cc: Johannes Weiner <hannes@cmpxchg.org>
>     Cc: Shakeel Butt <shakeelb@google.com>
>     Cc: Muchun Song <songmuchun@bytedance.com>
>     Cc: <stable@vger.kernel.org>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>
> Regards,
> Venkat.
>
> >
>

Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [linux-next20250911]Kernel OOPs while running generic/256 on Pmem device
  2025-09-12 12:32 ` Venkat
  2025-09-12 13:16   ` [External] " Julian Sun
@ 2025-09-13  2:48   ` Julian Sun
  2025-09-15 14:19     ` Venkat
  1 sibling, 1 reply; 9+ messages in thread
From: Julian Sun @ 2025-09-13  2:48 UTC (permalink / raw)
  To: Venkat
  Cc: tj, akpm, stable, songmuchun, shakeelb, hannes, roman.gushchin,
	mhocko, linuxppc-dev, riteshh, ojaswin, linux-fsdevel, linux-xfs,
	LKML, Madhavan Srinivasan, Linux Next Mailing List, cgroups,
	linux-mm

Hi,

Does this fix make sense to you?

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d0dfaa0ccaba..ed24dcece56a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3945,9 +3945,10 @@ static void mem_cgroup_css_free(struct
cgroup_subsys_state *css)
                 * Not necessary to wait for wb completion which might
cause task hung,
                 * only used to free resources. See
memcg_cgwb_waitq_callback_fn().
                 */
-               __add_wait_queue_entry_tail(wait->done.waitq, &wait->wq_entry);
                if (atomic_dec_and_test(&wait->done.cnt))
-                       wake_up_all(wait->done.waitq);
+                       kfree(wait);
+               else
+                       __add_wait_queue_entry_tail(wait->done.waitq,
&wait->wq_entry);;
        }
 #endif
        if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)

On Fri, Sep 12, 2025 at 8:33 PM Venkat <venkat88@linux.ibm.com> wrote:
>
>
>
> > On 12 Sep 2025, at 10:51 AM, Venkat Rao Bagalkote <venkat88@linux.ibm.com> wrote:
> >
> > Greetings!!!
> >
> >
> > IBM CI has reported a kernel crash, while running generic/256 test case on pmem device from xfstests suite on linux-next20250911 kernel.
> >
> >
> > xfstests: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
> >
> > local.config:
> >
> > [xfs_dax]
> > export RECREATE_TEST_DEV=true
> > export TEST_DEV=/dev/pmem0
> > export TEST_DIR=/mnt/test_pmem
> > export SCRATCH_DEV=/dev/pmem0.1
> > export SCRATCH_MNT=/mnt/scratch_pmem
> > export MKFS_OPTIONS="-m reflink=0 -b size=65536 -s size=512"
> > export FSTYP=xfs
> > export MOUNT_OPTIONS="-o dax"
> >
> >
> > Test case: generic/256
> >
> >
> > Traces:
> >
> >
> > [  163.371929] ------------[ cut here ]------------
> > [  163.371936] kernel BUG at lib/list_debug.c:29!
> > [  163.371946] Oops: Exception in kernel mode, sig: 5 [#1]
> > [  163.371954] LE PAGE_SIZE=64K MMU=Radix  SMP NR_CPUS=8192 NUMA pSeries
> > [  163.371965] Modules linked in: xfs nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack bonding tls nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink sunrpc pseries_rng vmx_crypto dax_pmem fuse ext4 crc16 mbcache jbd2 nd_pmem papr_scm sd_mod libnvdimm sg ibmvscsi ibmveth scsi_transport_srp pseries_wdt
> > [  163.372127] CPU: 22 UID: 0 PID: 130 Comm: kworker/22:0 Kdump: loaded Not tainted 6.17.0-rc5-next-20250911 #1 VOLUNTARY
> > [  163.372142] Hardware name: IBM,9080-HEX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1110.01 (NH1110_069) hv:phyp pSeries
> > [  163.372155] Workqueue: cgroup_free css_free_rwork_fn
> > [  163.372169] NIP:  c000000000d051d4 LR: c000000000d051d0 CTR: 0000000000000000
> > [  163.372176] REGS: c00000000ba079b0 TRAP: 0700   Not tainted (6.17.0-rc5-next-20250911)
> > [  163.372183] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28000000  XER: 00000006
> > [  163.372214] CFAR: c0000000002bae9c IRQMASK: 0
> > [  163.372214] GPR00: c000000000d051d0 c00000000ba07c50 c00000000230a600 0000000000000075
> > [  163.372214] GPR04: 0000000000000004 0000000000000001 c000000000507e2c 0000000000000001
> > [  163.372214] GPR08: c000000d0cb87d13 0000000000000000 0000000000000000 a80e000000000000
> > [  163.372214] GPR12: c00e0001a1970fa2 c000000d0ddec700 c000000000208e58 c000000107b5e190
> > [  163.372214] GPR16: c00000000d3e5d08 c00000000b71cf78 c00000000d3e5d05 c00000000b71cf30
> > [  163.372214] GPR20: c00000000b71cf08 c00000000b71cf10 c000000019f58588 c000000004704bc8
> > [  163.372214] GPR24: c000000107b5e100 c000000004704bd0 0000000000000003 c000000004704bd0
> > [  163.372214] GPR28: c000000004704bc8 c000000019f585a8 c000000019f53da8 c000000004704bc8
> > [  163.372315] NIP [c000000000d051d4] __list_add_valid_or_report+0x124/0x188
> > [  163.372326] LR [c000000000d051d0] __list_add_valid_or_report+0x120/0x188
> > [  163.372335] Call Trace:
> > [  163.372339] [c00000000ba07c50] [c000000000d051d0] __list_add_valid_or_report+0x120/0x188 (unreliable)
> > [  163.372352] [c00000000ba07ce0] [c000000000834280] mem_cgroup_css_free+0xa0/0x27c
> > [  163.372363] [c00000000ba07d50] [c0000000003ba198] css_free_rwork_fn+0xd0/0x59c
> > [  163.372374] [c00000000ba07da0] [c0000000001f5d60] process_one_work+0x41c/0x89c
> > [  163.372385] [c00000000ba07eb0] [c0000000001f76c0] worker_thread+0x558/0x848
> > [  163.372394] [c00000000ba07f80] [c000000000209038] kthread+0x1e8/0x230
> > [  163.372406] [c00000000ba07fe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
> > [  163.372416] Code: 4b9b1099 60000000 7f63db78 4bae8245 60000000 e8bf0008 3c62ff88 7fe6fb78 7fc4f378 38637d40 4b5b5c89 60000000 <0fe00000> 60000000 60000000 7f83e378
> > [  163.372453] ---[ end trace 0000000000000000 ]---
> > [  163.380581] pstore: backend (nvram) writing error (-1)
> > [  163.380593]
> >
> >
> > If you happen to fix this issue, please add below tag.
> >
> >
> > Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
> >
> >
> >
> > Regards,
> >
> > Venkat.
> >
> >
>
> After reverting the below commit, issue is not seen.
>
> commit 61bbf51e75df1a94cf6736e311cb96aeb79826a8
> Author: Julian Sun <sunjunchao@bytedance.com>
> Date:   Thu Aug 28 04:45:57 2025 +0800
>
>     memcg: don't wait writeback completion when release memcg
>          Recently, we encountered the following hung task:
>          INFO: task kworker/4:1:1334558 blocked for more than 1720 seconds.
>     [Wed Jul 30 17:47:45 2025] Workqueue: cgroup_destroy css_free_rwork_fn
>     [Wed Jul 30 17:47:45 2025] Call Trace:
>     [Wed Jul 30 17:47:45 2025]  __schedule+0x934/0xe10
>     [Wed Jul 30 17:47:45 2025]  ? complete+0x3b/0x50
>     [Wed Jul 30 17:47:45 2025]  ? _cond_resched+0x15/0x30
>     [Wed Jul 30 17:47:45 2025]  schedule+0x40/0xb0
>     [Wed Jul 30 17:47:45 2025]  wb_wait_for_completion+0x52/0x80
>     [Wed Jul 30 17:47:45 2025]  ? finish_wait+0x80/0x80
>     [Wed Jul 30 17:47:45 2025]  mem_cgroup_css_free+0x22/0x1b0
>     [Wed Jul 30 17:47:45 2025]  css_free_rwork_fn+0x42/0x380
>     [Wed Jul 30 17:47:45 2025]  process_one_work+0x1a2/0x360
>     [Wed Jul 30 17:47:45 2025]  worker_thread+0x30/0x390
>     [Wed Jul 30 17:47:45 2025]  ? create_worker+0x1a0/0x1a0
>     [Wed Jul 30 17:47:45 2025]  kthread+0x110/0x130
>     [Wed Jul 30 17:47:45 2025]  ? __kthread_cancel_work+0x40/0x40
>     [Wed Jul 30 17:47:45 2025]  ret_from_fork+0x1f/0x30
>          The direct cause is that memcg spends a long time waiting for dirty page
>     writeback of foreign memcgs during release.
>          The root causes are:
>         a. The wb may have multiple writeback tasks, containing millions
>            of dirty pages, as shown below:
>          >>> for work in list_for_each_entry("struct wb_writeback_work", \
>                                         wb.work_list.address_of_(), "list"):
>     ...     print(work.nr_pages, work.reason, hex(work))
>     ...
>     900628  WB_REASON_FOREIGN_FLUSH 0xffff969e8d956b40
>     1116521 WB_REASON_FOREIGN_FLUSH 0xffff9698332a9540
>     1275228 WB_REASON_FOREIGN_FLUSH 0xffff969d9b444bc0
>     1099673 WB_REASON_FOREIGN_FLUSH 0xffff969f0954d6c0
>     1351522 WB_REASON_FOREIGN_FLUSH 0xffff969e76713340
>     2567437 WB_REASON_FOREIGN_FLUSH 0xffff9694ae208400
>     2954033 WB_REASON_FOREIGN_FLUSH 0xffff96a22d62cbc0
>     3008860 WB_REASON_FOREIGN_FLUSH 0xffff969eee8ce3c0
>     3337932 WB_REASON_FOREIGN_FLUSH 0xffff9695b45156c0
>     3348916 WB_REASON_FOREIGN_FLUSH 0xffff96a22c7a4f40
>     3345363 WB_REASON_FOREIGN_FLUSH 0xffff969e5d872800
>     3333581 WB_REASON_FOREIGN_FLUSH 0xffff969efd0f4600
>     3382225 WB_REASON_FOREIGN_FLUSH 0xffff969e770edcc0
>     3418770 WB_REASON_FOREIGN_FLUSH 0xffff96a252ceea40
>     3387648 WB_REASON_FOREIGN_FLUSH 0xffff96a3bda86340
>     3385420 WB_REASON_FOREIGN_FLUSH 0xffff969efc6eb280
>     3418730 WB_REASON_FOREIGN_FLUSH 0xffff96a348ab1040
>     3426155 WB_REASON_FOREIGN_FLUSH 0xffff969d90beac00
>     3397995 WB_REASON_FOREIGN_FLUSH 0xffff96a2d7288800
>     3293095 WB_REASON_FOREIGN_FLUSH 0xffff969dab423240
>     3293595 WB_REASON_FOREIGN_FLUSH 0xffff969c765ff400
>     3199511 WB_REASON_FOREIGN_FLUSH 0xffff969a72d5e680
>     3085016 WB_REASON_FOREIGN_FLUSH 0xffff969f0455e000
>     3035712 WB_REASON_FOREIGN_FLUSH 0xffff969d9bbf4b00
>              b. The writeback might severely throttled by wbt, with a speed
>            possibly less than 100kb/s, leading to a very long writeback time.
>          >>> wb.write_bandwidth
>     (unsigned long)24
>     >>> wb.write_bandwidth
>     (unsigned long)13
>          The wb_wait_for_completion() here is probably only used to prevent
>     use-after-free.  Therefore, we manage 'done' separately and automatically
>     free it.
>          This allows us to remove wb_wait_for_completion() while preventing the
>     use-after-free issue.
>      com
>     Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing")
>     Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
>     Acked-by: Tejun Heo <tj@kernel.org>
>     Cc: Michal Hocko <mhocko@suse.com>
>     Cc: Roman Gushchin <roman.gushchin@linux.dev>
>     Cc: Johannes Weiner <hannes@cmpxchg.org>
>     Cc: Shakeel Butt <shakeelb@google.com>
>     Cc: Muchun Song <songmuchun@bytedance.com>
>     Cc: <stable@vger.kernel.org>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>
> Regards,
> Venkat.
>
> >
>


-- 
Julian Sun <sunjunchao@bytedance.com>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re:  [linux-next20250911]Kernel OOPs while running generic/256 on Pmem device
  2025-09-13  2:48   ` Julian Sun
@ 2025-09-15 14:19     ` Venkat
  2025-09-15 14:26       ` Alexander Gordeev
  2025-09-15 18:17       ` Julian Sun
  0 siblings, 2 replies; 9+ messages in thread
From: Venkat @ 2025-09-15 14:19 UTC (permalink / raw)
  To: Julian Sun
  Cc: tj, akpm, stable, songmuchun, shakeelb, hannes, roman.gushchin,
	mhocko, linuxppc-dev, riteshh, ojaswin, linux-fsdevel, linux-xfs,
	LKML, Madhavan Srinivasan, Linux Next Mailing List, cgroups,
	linux-mm



> On 13 Sep 2025, at 8:18 AM, Julian Sun <sunjunchao@bytedance.com> wrote:
> 
> Hi,
> 
> Does this fix make sense to you?
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d0dfaa0ccaba..ed24dcece56a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3945,9 +3945,10 @@ static void mem_cgroup_css_free(struct
> cgroup_subsys_state *css)
>                 * Not necessary to wait for wb completion which might
> cause task hung,
>                 * only used to free resources. See
> memcg_cgwb_waitq_callback_fn().
>                 */
> -               __add_wait_queue_entry_tail(wait->done.waitq, &wait->wq_entry);
>                if (atomic_dec_and_test(&wait->done.cnt))
> -                       wake_up_all(wait->done.waitq);
> +                       kfree(wait);
> +               else
> +                       __add_wait_queue_entry_tail(wait->done.waitq,
> &wait->wq_entry);;
>        }
> #endif
>        if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)

Hello,

Thanks for the fix. This is fixing the reported issue.

While sending out the patch please add below tag as well.

Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>

Regards,
Venkat.
> 
> On Fri, Sep 12, 2025 at 8:33 PM Venkat <venkat88@linux.ibm.com> wrote:
>> 
>> 
>> 
>>> On 12 Sep 2025, at 10:51 AM, Venkat Rao Bagalkote <venkat88@linux.ibm.com> wrote:
>>> 
>>> Greetings!!!
>>> 
>>> 
>>> IBM CI has reported a kernel crash, while running generic/256 test case on pmem device from xfstests suite on linux-next20250911 kernel.
>>> 
>>> 
>>> xfstests: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
>>> 
>>> local.config:
>>> 
>>> [xfs_dax]
>>> export RECREATE_TEST_DEV=true
>>> export TEST_DEV=/dev/pmem0
>>> export TEST_DIR=/mnt/test_pmem
>>> export SCRATCH_DEV=/dev/pmem0.1
>>> export SCRATCH_MNT=/mnt/scratch_pmem
>>> export MKFS_OPTIONS="-m reflink=0 -b size=65536 -s size=512"
>>> export FSTYP=xfs
>>> export MOUNT_OPTIONS="-o dax"
>>> 
>>> 
>>> Test case: generic/256
>>> 
>>> 
>>> Traces:
>>> 
>>> 
>>> [  163.371929] ------------[ cut here ]------------
>>> [  163.371936] kernel BUG at lib/list_debug.c:29!
>>> [  163.371946] Oops: Exception in kernel mode, sig: 5 [#1]
>>> [  163.371954] LE PAGE_SIZE=64K MMU=Radix  SMP NR_CPUS=8192 NUMA pSeries
>>> [  163.371965] Modules linked in: xfs nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack bonding tls nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink sunrpc pseries_rng vmx_crypto dax_pmem fuse ext4 crc16 mbcache jbd2 nd_pmem papr_scm sd_mod libnvdimm sg ibmvscsi ibmveth scsi_transport_srp pseries_wdt
>>> [  163.372127] CPU: 22 UID: 0 PID: 130 Comm: kworker/22:0 Kdump: loaded Not tainted 6.17.0-rc5-next-20250911 #1 VOLUNTARY
>>> [  163.372142] Hardware name: IBM,9080-HEX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1110.01 (NH1110_069) hv:phyp pSeries
>>> [  163.372155] Workqueue: cgroup_free css_free_rwork_fn
>>> [  163.372169] NIP:  c000000000d051d4 LR: c000000000d051d0 CTR: 0000000000000000
>>> [  163.372176] REGS: c00000000ba079b0 TRAP: 0700   Not tainted (6.17.0-rc5-next-20250911)
>>> [  163.372183] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28000000  XER: 00000006
>>> [  163.372214] CFAR: c0000000002bae9c IRQMASK: 0
>>> [  163.372214] GPR00: c000000000d051d0 c00000000ba07c50 c00000000230a600 0000000000000075
>>> [  163.372214] GPR04: 0000000000000004 0000000000000001 c000000000507e2c 0000000000000001
>>> [  163.372214] GPR08: c000000d0cb87d13 0000000000000000 0000000000000000 a80e000000000000
>>> [  163.372214] GPR12: c00e0001a1970fa2 c000000d0ddec700 c000000000208e58 c000000107b5e190
>>> [  163.372214] GPR16: c00000000d3e5d08 c00000000b71cf78 c00000000d3e5d05 c00000000b71cf30
>>> [  163.372214] GPR20: c00000000b71cf08 c00000000b71cf10 c000000019f58588 c000000004704bc8
>>> [  163.372214] GPR24: c000000107b5e100 c000000004704bd0 0000000000000003 c000000004704bd0
>>> [  163.372214] GPR28: c000000004704bc8 c000000019f585a8 c000000019f53da8 c000000004704bc8
>>> [  163.372315] NIP [c000000000d051d4] __list_add_valid_or_report+0x124/0x188
>>> [  163.372326] LR [c000000000d051d0] __list_add_valid_or_report+0x120/0x188
>>> [  163.372335] Call Trace:
>>> [  163.372339] [c00000000ba07c50] [c000000000d051d0] __list_add_valid_or_report+0x120/0x188 (unreliable)
>>> [  163.372352] [c00000000ba07ce0] [c000000000834280] mem_cgroup_css_free+0xa0/0x27c
>>> [  163.372363] [c00000000ba07d50] [c0000000003ba198] css_free_rwork_fn+0xd0/0x59c
>>> [  163.372374] [c00000000ba07da0] [c0000000001f5d60] process_one_work+0x41c/0x89c
>>> [  163.372385] [c00000000ba07eb0] [c0000000001f76c0] worker_thread+0x558/0x848
>>> [  163.372394] [c00000000ba07f80] [c000000000209038] kthread+0x1e8/0x230
>>> [  163.372406] [c00000000ba07fe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
>>> [  163.372416] Code: 4b9b1099 60000000 7f63db78 4bae8245 60000000 e8bf0008 3c62ff88 7fe6fb78 7fc4f378 38637d40 4b5b5c89 60000000 <0fe00000> 60000000 60000000 7f83e378
>>> [  163.372453] ---[ end trace 0000000000000000 ]---
>>> [  163.380581] pstore: backend (nvram) writing error (-1)
>>> [  163.380593]
>>> 
>>> 
>>> If you happen to fix this issue, please add below tag.
>>> 
>>> 
>>> Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
>>> 
>>> 
>>> 
>>> Regards,
>>> 
>>> Venkat.
>>> 
>>> 
>> 
>> After reverting the below commit, issue is not seen.
>> 
>> commit 61bbf51e75df1a94cf6736e311cb96aeb79826a8
>> Author: Julian Sun <sunjunchao@bytedance.com>
>> Date:   Thu Aug 28 04:45:57 2025 +0800
>> 
>>    memcg: don't wait writeback completion when release memcg
>>         Recently, we encountered the following hung task:
>>         INFO: task kworker/4:1:1334558 blocked for more than 1720 seconds.
>>    [Wed Jul 30 17:47:45 2025] Workqueue: cgroup_destroy css_free_rwork_fn
>>    [Wed Jul 30 17:47:45 2025] Call Trace:
>>    [Wed Jul 30 17:47:45 2025]  __schedule+0x934/0xe10
>>    [Wed Jul 30 17:47:45 2025]  ? complete+0x3b/0x50
>>    [Wed Jul 30 17:47:45 2025]  ? _cond_resched+0x15/0x30
>>    [Wed Jul 30 17:47:45 2025]  schedule+0x40/0xb0
>>    [Wed Jul 30 17:47:45 2025]  wb_wait_for_completion+0x52/0x80
>>    [Wed Jul 30 17:47:45 2025]  ? finish_wait+0x80/0x80
>>    [Wed Jul 30 17:47:45 2025]  mem_cgroup_css_free+0x22/0x1b0
>>    [Wed Jul 30 17:47:45 2025]  css_free_rwork_fn+0x42/0x380
>>    [Wed Jul 30 17:47:45 2025]  process_one_work+0x1a2/0x360
>>    [Wed Jul 30 17:47:45 2025]  worker_thread+0x30/0x390
>>    [Wed Jul 30 17:47:45 2025]  ? create_worker+0x1a0/0x1a0
>>    [Wed Jul 30 17:47:45 2025]  kthread+0x110/0x130
>>    [Wed Jul 30 17:47:45 2025]  ? __kthread_cancel_work+0x40/0x40
>>    [Wed Jul 30 17:47:45 2025]  ret_from_fork+0x1f/0x30
>>         The direct cause is that memcg spends a long time waiting for dirty page
>>    writeback of foreign memcgs during release.
>>         The root causes are:
>>        a. The wb may have multiple writeback tasks, containing millions
>>           of dirty pages, as shown below:
>>>>> for work in list_for_each_entry("struct wb_writeback_work", \
>>                                        wb.work_list.address_of_(), "list"):
>>    ...     print(work.nr_pages, work.reason, hex(work))
>>    ...
>>    900628  WB_REASON_FOREIGN_FLUSH 0xffff969e8d956b40
>>    1116521 WB_REASON_FOREIGN_FLUSH 0xffff9698332a9540
>>    1275228 WB_REASON_FOREIGN_FLUSH 0xffff969d9b444bc0
>>    1099673 WB_REASON_FOREIGN_FLUSH 0xffff969f0954d6c0
>>    1351522 WB_REASON_FOREIGN_FLUSH 0xffff969e76713340
>>    2567437 WB_REASON_FOREIGN_FLUSH 0xffff9694ae208400
>>    2954033 WB_REASON_FOREIGN_FLUSH 0xffff96a22d62cbc0
>>    3008860 WB_REASON_FOREIGN_FLUSH 0xffff969eee8ce3c0
>>    3337932 WB_REASON_FOREIGN_FLUSH 0xffff9695b45156c0
>>    3348916 WB_REASON_FOREIGN_FLUSH 0xffff96a22c7a4f40
>>    3345363 WB_REASON_FOREIGN_FLUSH 0xffff969e5d872800
>>    3333581 WB_REASON_FOREIGN_FLUSH 0xffff969efd0f4600
>>    3382225 WB_REASON_FOREIGN_FLUSH 0xffff969e770edcc0
>>    3418770 WB_REASON_FOREIGN_FLUSH 0xffff96a252ceea40
>>    3387648 WB_REASON_FOREIGN_FLUSH 0xffff96a3bda86340
>>    3385420 WB_REASON_FOREIGN_FLUSH 0xffff969efc6eb280
>>    3418730 WB_REASON_FOREIGN_FLUSH 0xffff96a348ab1040
>>    3426155 WB_REASON_FOREIGN_FLUSH 0xffff969d90beac00
>>    3397995 WB_REASON_FOREIGN_FLUSH 0xffff96a2d7288800
>>    3293095 WB_REASON_FOREIGN_FLUSH 0xffff969dab423240
>>    3293595 WB_REASON_FOREIGN_FLUSH 0xffff969c765ff400
>>    3199511 WB_REASON_FOREIGN_FLUSH 0xffff969a72d5e680
>>    3085016 WB_REASON_FOREIGN_FLUSH 0xffff969f0455e000
>>    3035712 WB_REASON_FOREIGN_FLUSH 0xffff969d9bbf4b00
>>             b. The writeback might severely throttled by wbt, with a speed
>>           possibly less than 100kb/s, leading to a very long writeback time.
>>>>> wb.write_bandwidth
>>    (unsigned long)24
>>>>> wb.write_bandwidth
>>    (unsigned long)13
>>         The wb_wait_for_completion() here is probably only used to prevent
>>    use-after-free.  Therefore, we manage 'done' separately and automatically
>>    free it.
>>         This allows us to remove wb_wait_for_completion() while preventing the
>>    use-after-free issue.
>>     com
>>    Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing")
>>    Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
>>    Acked-by: Tejun Heo <tj@kernel.org>
>>    Cc: Michal Hocko <mhocko@suse.com>
>>    Cc: Roman Gushchin <roman.gushchin@linux.dev>
>>    Cc: Johannes Weiner <hannes@cmpxchg.org>
>>    Cc: Shakeel Butt <shakeelb@google.com>
>>    Cc: Muchun Song <songmuchun@bytedance.com>
>>    Cc: <stable@vger.kernel.org>
>>    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>> 
>> Regards,
>> Venkat.
>> 
>>> 
>> 
> 
> 
> -- 
> Julian Sun <sunjunchao@bytedance.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-next20250911]Kernel OOPs while running generic/256 on Pmem device
  2025-09-15 14:19     ` Venkat
@ 2025-09-15 14:26       ` Alexander Gordeev
  2025-09-15 18:20         ` [External] " Julian Sun
  2025-09-15 18:17       ` Julian Sun
  1 sibling, 1 reply; 9+ messages in thread
From: Alexander Gordeev @ 2025-09-15 14:26 UTC (permalink / raw)
  To: Venkat
  Cc: Julian Sun, tj, akpm, stable, songmuchun, shakeelb, hannes,
	roman.gushchin, mhocko, linuxppc-dev, riteshh, ojaswin,
	linux-fsdevel, linux-xfs, LKML, Madhavan Srinivasan,
	Linux Next Mailing List, cgroups, linux-mm

On Mon, Sep 15, 2025 at 07:49:26PM +0530, Venkat wrote:
> Hello,
> 
> Thanks for the fix. This is fixing the reported issue.
> 
> While sending out the patch please add below tag as well.
> 
> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>

And Reported-by as well, if I may add ;)

> Regards,
> Venkat.

Thanks!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [linux-next20250911]Kernel OOPs while running generic/256 on Pmem device
  2025-09-15 14:19     ` Venkat
  2025-09-15 14:26       ` Alexander Gordeev
@ 2025-09-15 18:17       ` Julian Sun
  2025-09-17  5:58         ` Venkat
  1 sibling, 1 reply; 9+ messages in thread
From: Julian Sun @ 2025-09-15 18:17 UTC (permalink / raw)
  To: Venkat
  Cc: tj, akpm, stable, songmuchun, shakeelb, hannes, roman.gushchin,
	mhocko, linuxppc-dev, riteshh, ojaswin, linux-fsdevel, linux-xfs,
	LKML, Madhavan Srinivasan, Linux Next Mailing List, cgroups,
	linux-mm

Hi,

On Mon, Sep 15, 2025 at 10:20 PM Venkat <venkat88@linux.ibm.com> wrote:
>
>
>
> > On 13 Sep 2025, at 8:18 AM, Julian Sun <sunjunchao@bytedance.com> wrote:
> >
> > Hi,
> >
> > Does this fix make sense to you?
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index d0dfaa0ccaba..ed24dcece56a 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3945,9 +3945,10 @@ static void mem_cgroup_css_free(struct
> > cgroup_subsys_state *css)
> >                 * Not necessary to wait for wb completion which might
> > cause task hung,
> >                 * only used to free resources. See
> > memcg_cgwb_waitq_callback_fn().
> >                 */
> > -               __add_wait_queue_entry_tail(wait->done.waitq, &wait->wq_entry);
> >                if (atomic_dec_and_test(&wait->done.cnt))
> > -                       wake_up_all(wait->done.waitq);
> > +                       kfree(wait);
> > +               else
> > +                       __add_wait_queue_entry_tail(wait->done.waitq,
> > &wait->wq_entry);;
> >        }
> > #endif
> >        if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
>
> Hello,
>
> Thanks for the fix. This is fixing the reported issue.

Thanks for your testing and feedback.
>
> While sending out the patch please add below tag as well.
>
> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>

Sure. That's how it should be.

Could you please try again with the following patch? The previous one
might have caused a memory leak and had race conditions. I can’t
reproduce it locally...

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 80257dba30f8..35da16928599 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3940,6 +3940,7 @@ static void mem_cgroup_css_free(struct
cgroup_subsys_state *css)
        int __maybe_unused i;

 #ifdef CONFIG_CGROUP_WRITEBACK
+       spin_lock(&memcg_cgwb_frn_waitq.lock);
        for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
                struct cgwb_frn_wait *wait = memcg->cgwb_frn[i].wait;

@@ -3948,9 +3949,12 @@ static void mem_cgroup_css_free(struct
cgroup_subsys_state *css)
                 * only used to free resources. See
memcg_cgwb_waitq_callback_fn().
                 */
                __add_wait_queue_entry_tail(wait->done.waitq, &wait->wq_entry);
-               if (atomic_dec_and_test(&wait->done.cnt))
-                       wake_up_all(wait->done.waitq);
+               if (atomic_dec_and_test(&wait->done.cnt)) {
+                       list_del(&wait->wq_entry.entry);
+                       kfree(wait);
+               }
        }
+       spin_unlock(&memcg_cgwb_frn_waitq.lock);
 #endif
        if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
                static_branch_dec(&memcg_sockets_enabled_key);

>
> Regards,
> Venkat.
> >
> > On Fri, Sep 12, 2025 at 8:33 PM Venkat <venkat88@linux.ibm.com> wrote:
> >>
> >>
> >>
> >>> On 12 Sep 2025, at 10:51 AM, Venkat Rao Bagalkote <venkat88@linux.ibm.com> wrote:
> >>>
> >>> Greetings!!!
> >>>
> >>>
> >>> IBM CI has reported a kernel crash, while running generic/256 test case on pmem device from xfstests suite on linux-next20250911 kernel.
> >>>
> >>>
> >>> xfstests: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
> >>>
> >>> local.config:
> >>>
> >>> [xfs_dax]
> >>> export RECREATE_TEST_DEV=true
> >>> export TEST_DEV=/dev/pmem0
> >>> export TEST_DIR=/mnt/test_pmem
> >>> export SCRATCH_DEV=/dev/pmem0.1
> >>> export SCRATCH_MNT=/mnt/scratch_pmem
> >>> export MKFS_OPTIONS="-m reflink=0 -b size=65536 -s size=512"
> >>> export FSTYP=xfs
> >>> export MOUNT_OPTIONS="-o dax"
> >>>
> >>>
> >>> Test case: generic/256
> >>>
> >>>
> >>> Traces:
> >>>
> >>>
> >>> [  163.371929] ------------[ cut here ]------------
> >>> [  163.371936] kernel BUG at lib/list_debug.c:29!
> >>> [  163.371946] Oops: Exception in kernel mode, sig: 5 [#1]
> >>> [  163.371954] LE PAGE_SIZE=64K MMU=Radix  SMP NR_CPUS=8192 NUMA pSeries
> >>> [  163.371965] Modules linked in: xfs nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack bonding tls nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink sunrpc pseries_rng vmx_crypto dax_pmem fuse ext4 crc16 mbcache jbd2 nd_pmem papr_scm sd_mod libnvdimm sg ibmvscsi ibmveth scsi_transport_srp pseries_wdt
> >>> [  163.372127] CPU: 22 UID: 0 PID: 130 Comm: kworker/22:0 Kdump: loaded Not tainted 6.17.0-rc5-next-20250911 #1 VOLUNTARY
> >>> [  163.372142] Hardware name: IBM,9080-HEX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1110.01 (NH1110_069) hv:phyp pSeries
> >>> [  163.372155] Workqueue: cgroup_free css_free_rwork_fn
> >>> [  163.372169] NIP:  c000000000d051d4 LR: c000000000d051d0 CTR: 0000000000000000
> >>> [  163.372176] REGS: c00000000ba079b0 TRAP: 0700   Not tainted (6.17.0-rc5-next-20250911)
> >>> [  163.372183] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28000000  XER: 00000006
> >>> [  163.372214] CFAR: c0000000002bae9c IRQMASK: 0
> >>> [  163.372214] GPR00: c000000000d051d0 c00000000ba07c50 c00000000230a600 0000000000000075
> >>> [  163.372214] GPR04: 0000000000000004 0000000000000001 c000000000507e2c 0000000000000001
> >>> [  163.372214] GPR08: c000000d0cb87d13 0000000000000000 0000000000000000 a80e000000000000
> >>> [  163.372214] GPR12: c00e0001a1970fa2 c000000d0ddec700 c000000000208e58 c000000107b5e190
> >>> [  163.372214] GPR16: c00000000d3e5d08 c00000000b71cf78 c00000000d3e5d05 c00000000b71cf30
> >>> [  163.372214] GPR20: c00000000b71cf08 c00000000b71cf10 c000000019f58588 c000000004704bc8
> >>> [  163.372214] GPR24: c000000107b5e100 c000000004704bd0 0000000000000003 c000000004704bd0
> >>> [  163.372214] GPR28: c000000004704bc8 c000000019f585a8 c000000019f53da8 c000000004704bc8
> >>> [  163.372315] NIP [c000000000d051d4] __list_add_valid_or_report+0x124/0x188
> >>> [  163.372326] LR [c000000000d051d0] __list_add_valid_or_report+0x120/0x188
> >>> [  163.372335] Call Trace:
> >>> [  163.372339] [c00000000ba07c50] [c000000000d051d0] __list_add_valid_or_report+0x120/0x188 (unreliable)
> >>> [  163.372352] [c00000000ba07ce0] [c000000000834280] mem_cgroup_css_free+0xa0/0x27c
> >>> [  163.372363] [c00000000ba07d50] [c0000000003ba198] css_free_rwork_fn+0xd0/0x59c
> >>> [  163.372374] [c00000000ba07da0] [c0000000001f5d60] process_one_work+0x41c/0x89c
> >>> [  163.372385] [c00000000ba07eb0] [c0000000001f76c0] worker_thread+0x558/0x848
> >>> [  163.372394] [c00000000ba07f80] [c000000000209038] kthread+0x1e8/0x230
> >>> [  163.372406] [c00000000ba07fe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
> >>> [  163.372416] Code: 4b9b1099 60000000 7f63db78 4bae8245 60000000 e8bf0008 3c62ff88 7fe6fb78 7fc4f378 38637d40 4b5b5c89 60000000 <0fe00000> 60000000 60000000 7f83e378
> >>> [  163.372453] ---[ end trace 0000000000000000 ]---
> >>> [  163.380581] pstore: backend (nvram) writing error (-1)
> >>> [  163.380593]
> >>>
> >>>
> >>> If you happen to fix this issue, please add below tag.
> >>>
> >>>
> >>> Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
> >>>
> >>>
> >>>
> >>> Regards,
> >>>
> >>> Venkat.
> >>>
> >>>
> >>
> >> After reverting the below commit, issue is not seen.
> >>
> >> commit 61bbf51e75df1a94cf6736e311cb96aeb79826a8
> >> Author: Julian Sun <sunjunchao@bytedance.com>
> >> Date:   Thu Aug 28 04:45:57 2025 +0800
> >>
> >>    memcg: don't wait writeback completion when release memcg
> >>         Recently, we encountered the following hung task:
> >>         INFO: task kworker/4:1:1334558 blocked for more than 1720 seconds.
> >>    [Wed Jul 30 17:47:45 2025] Workqueue: cgroup_destroy css_free_rwork_fn
> >>    [Wed Jul 30 17:47:45 2025] Call Trace:
> >>    [Wed Jul 30 17:47:45 2025]  __schedule+0x934/0xe10
> >>    [Wed Jul 30 17:47:45 2025]  ? complete+0x3b/0x50
> >>    [Wed Jul 30 17:47:45 2025]  ? _cond_resched+0x15/0x30
> >>    [Wed Jul 30 17:47:45 2025]  schedule+0x40/0xb0
> >>    [Wed Jul 30 17:47:45 2025]  wb_wait_for_completion+0x52/0x80
> >>    [Wed Jul 30 17:47:45 2025]  ? finish_wait+0x80/0x80
> >>    [Wed Jul 30 17:47:45 2025]  mem_cgroup_css_free+0x22/0x1b0
> >>    [Wed Jul 30 17:47:45 2025]  css_free_rwork_fn+0x42/0x380
> >>    [Wed Jul 30 17:47:45 2025]  process_one_work+0x1a2/0x360
> >>    [Wed Jul 30 17:47:45 2025]  worker_thread+0x30/0x390
> >>    [Wed Jul 30 17:47:45 2025]  ? create_worker+0x1a0/0x1a0
> >>    [Wed Jul 30 17:47:45 2025]  kthread+0x110/0x130
> >>    [Wed Jul 30 17:47:45 2025]  ? __kthread_cancel_work+0x40/0x40
> >>    [Wed Jul 30 17:47:45 2025]  ret_from_fork+0x1f/0x30
> >>         The direct cause is that memcg spends a long time waiting for dirty page
> >>    writeback of foreign memcgs during release.
> >>         The root causes are:
> >>        a. The wb may have multiple writeback tasks, containing millions
> >>           of dirty pages, as shown below:
> >>>>> for work in list_for_each_entry("struct wb_writeback_work", \
> >>                                        wb.work_list.address_of_(), "list"):
> >>    ...     print(work.nr_pages, work.reason, hex(work))
> >>    ...
> >>    900628  WB_REASON_FOREIGN_FLUSH 0xffff969e8d956b40
> >>    1116521 WB_REASON_FOREIGN_FLUSH 0xffff9698332a9540
> >>    1275228 WB_REASON_FOREIGN_FLUSH 0xffff969d9b444bc0
> >>    1099673 WB_REASON_FOREIGN_FLUSH 0xffff969f0954d6c0
> >>    1351522 WB_REASON_FOREIGN_FLUSH 0xffff969e76713340
> >>    2567437 WB_REASON_FOREIGN_FLUSH 0xffff9694ae208400
> >>    2954033 WB_REASON_FOREIGN_FLUSH 0xffff96a22d62cbc0
> >>    3008860 WB_REASON_FOREIGN_FLUSH 0xffff969eee8ce3c0
> >>    3337932 WB_REASON_FOREIGN_FLUSH 0xffff9695b45156c0
> >>    3348916 WB_REASON_FOREIGN_FLUSH 0xffff96a22c7a4f40
> >>    3345363 WB_REASON_FOREIGN_FLUSH 0xffff969e5d872800
> >>    3333581 WB_REASON_FOREIGN_FLUSH 0xffff969efd0f4600
> >>    3382225 WB_REASON_FOREIGN_FLUSH 0xffff969e770edcc0
> >>    3418770 WB_REASON_FOREIGN_FLUSH 0xffff96a252ceea40
> >>    3387648 WB_REASON_FOREIGN_FLUSH 0xffff96a3bda86340
> >>    3385420 WB_REASON_FOREIGN_FLUSH 0xffff969efc6eb280
> >>    3418730 WB_REASON_FOREIGN_FLUSH 0xffff96a348ab1040
> >>    3426155 WB_REASON_FOREIGN_FLUSH 0xffff969d90beac00
> >>    3397995 WB_REASON_FOREIGN_FLUSH 0xffff96a2d7288800
> >>    3293095 WB_REASON_FOREIGN_FLUSH 0xffff969dab423240
> >>    3293595 WB_REASON_FOREIGN_FLUSH 0xffff969c765ff400
> >>    3199511 WB_REASON_FOREIGN_FLUSH 0xffff969a72d5e680
> >>    3085016 WB_REASON_FOREIGN_FLUSH 0xffff969f0455e000
> >>    3035712 WB_REASON_FOREIGN_FLUSH 0xffff969d9bbf4b00
> >>             b. The writeback might severely throttled by wbt, with a speed
> >>           possibly less than 100kb/s, leading to a very long writeback time.
> >>>>> wb.write_bandwidth
> >>    (unsigned long)24
> >>>>> wb.write_bandwidth
> >>    (unsigned long)13
> >>         The wb_wait_for_completion() here is probably only used to prevent
> >>    use-after-free.  Therefore, we manage 'done' separately and automatically
> >>    free it.
> >>         This allows us to remove wb_wait_for_completion() while preventing the
> >>    use-after-free issue.
> >>     com
> >>    Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing")
> >>    Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
> >>    Acked-by: Tejun Heo <tj@kernel.org>
> >>    Cc: Michal Hocko <mhocko@suse.com>
> >>    Cc: Roman Gushchin <roman.gushchin@linux.dev>
> >>    Cc: Johannes Weiner <hannes@cmpxchg.org>
> >>    Cc: Shakeel Butt <shakeelb@google.com>
> >>    Cc: Muchun Song <songmuchun@bytedance.com>
> >>    Cc: <stable@vger.kernel.org>
> >>    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> >>
> >> Regards,
> >> Venkat.
> >>
> >>>
> >>
> >
> >
> > --
> > Julian Sun <sunjunchao@bytedance.com>
>

Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [External] Re: [linux-next20250911]Kernel OOPs while running generic/256 on Pmem device
  2025-09-15 14:26       ` Alexander Gordeev
@ 2025-09-15 18:20         ` Julian Sun
  0 siblings, 0 replies; 9+ messages in thread
From: Julian Sun @ 2025-09-15 18:20 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Venkat, tj, akpm, stable, songmuchun, shakeelb, hannes,
	roman.gushchin, mhocko, linuxppc-dev, riteshh, ojaswin,
	linux-fsdevel, linux-xfs, LKML, Madhavan Srinivasan,
	Linux Next Mailing List, cgroups, linux-mm

Hi,

On Mon, Sep 15, 2025 at 10:26 PM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> On Mon, Sep 15, 2025 at 07:49:26PM +0530, Venkat wrote:
> > Hello,
> >
> > Thanks for the fix. This is fixing the reported issue.
> >
> > While sending out the patch please add below tag as well.
> >
> > Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
>
> And Reported-by as well, if I may add ;)
>

I'd like to but I will resend the whole patch which is used to address
another issue.  Thanks a lot for reporting anyway — it’s very helpful!
> > Regards,
> > Venkat.
>
> Thanks!


Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re:  [linux-next20250911]Kernel OOPs while running generic/256 on Pmem device
  2025-09-15 18:17       ` Julian Sun
@ 2025-09-17  5:58         ` Venkat
  0 siblings, 0 replies; 9+ messages in thread
From: Venkat @ 2025-09-17  5:58 UTC (permalink / raw)
  To: Julian Sun
  Cc: tj, akpm, stable, songmuchun, shakeelb, hannes, roman.gushchin,
	mhocko, linuxppc-dev, riteshh, ojaswin, linux-fsdevel, linux-xfs,
	LKML, Madhavan Srinivasan, Linux Next Mailing List, cgroups,
	linux-mm



> On 15 Sep 2025, at 11:47 PM, Julian Sun <sunjunchao@bytedance.com> wrote:
> 
> Hi,
> 
> On Mon, Sep 15, 2025 at 10:20 PM Venkat <venkat88@linux.ibm.com> wrote:
>> 
>> 
>> 
>>> On 13 Sep 2025, at 8:18 AM, Julian Sun <sunjunchao@bytedance.com> wrote:
>>> 
>>> Hi,
>>> 
>>> Does this fix make sense to you?
>>> 
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index d0dfaa0ccaba..ed24dcece56a 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -3945,9 +3945,10 @@ static void mem_cgroup_css_free(struct
>>> cgroup_subsys_state *css)
>>>                * Not necessary to wait for wb completion which might
>>> cause task hung,
>>>                * only used to free resources. See
>>> memcg_cgwb_waitq_callback_fn().
>>>                */
>>> -               __add_wait_queue_entry_tail(wait->done.waitq, &wait->wq_entry);
>>>               if (atomic_dec_and_test(&wait->done.cnt))
>>> -                       wake_up_all(wait->done.waitq);
>>> +                       kfree(wait);
>>> +               else
>>> +                       __add_wait_queue_entry_tail(wait->done.waitq,
>>> &wait->wq_entry);;
>>>       }
>>> #endif
>>>       if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
>> 
>> Hello,
>> 
>> Thanks for the fix. This is fixing the reported issue.
> 
> Thanks for your testing and feedback.
>> 
>> While sending out the patch please add below tag as well.
>> 
>> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
> 
> Sure. That's how it should be.
> 
> Could you please try again with the following patch? The previous one
> might have caused a memory leak and had race conditions. I can’t
> reproduce it locally...
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 80257dba30f8..35da16928599 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3940,6 +3940,7 @@ static void mem_cgroup_css_free(struct
> cgroup_subsys_state *css)
>        int __maybe_unused i;
> 
> #ifdef CONFIG_CGROUP_WRITEBACK
> +       spin_lock(&memcg_cgwb_frn_waitq.lock);
>        for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
>                struct cgwb_frn_wait *wait = memcg->cgwb_frn[i].wait;
> 
> @@ -3948,9 +3949,12 @@ static void mem_cgroup_css_free(struct
> cgroup_subsys_state *css)
>                 * only used to free resources. See
> memcg_cgwb_waitq_callback_fn().
>                 */
>                __add_wait_queue_entry_tail(wait->done.waitq, &wait->wq_entry);
> -               if (atomic_dec_and_test(&wait->done.cnt))
> -                       wake_up_all(wait->done.waitq);
> +               if (atomic_dec_and_test(&wait->done.cnt)) {
> +                       list_del(&wait->wq_entry.entry);
> +                       kfree(wait);
> +               }
>        }
> +       spin_unlock(&memcg_cgwb_frn_waitq.lock);
> #endif
>        if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
>                static_branch_dec(&memcg_sockets_enabled_key);
> 

Hello,

I tried this patch on the two on my CI nodes, and tests passed. Reported issue is fixed with this.

Regards,
Venkat.
>> 
>> Regards,
>> Venkat.
>>> 
>>> On Fri, Sep 12, 2025 at 8:33 PM Venkat <venkat88@linux.ibm.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On 12 Sep 2025, at 10:51 AM, Venkat Rao Bagalkote <venkat88@linux.ibm.com> wrote:
>>>>> 
>>>>> Greetings!!!
>>>>> 
>>>>> 
>>>>> IBM CI has reported a kernel crash, while running generic/256 test case on pmem device from xfstests suite on linux-next20250911 kernel.
>>>>> 
>>>>> 
>>>>> xfstests: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
>>>>> 
>>>>> local.config:
>>>>> 
>>>>> [xfs_dax]
>>>>> export RECREATE_TEST_DEV=true
>>>>> export TEST_DEV=/dev/pmem0
>>>>> export TEST_DIR=/mnt/test_pmem
>>>>> export SCRATCH_DEV=/dev/pmem0.1
>>>>> export SCRATCH_MNT=/mnt/scratch_pmem
>>>>> export MKFS_OPTIONS="-m reflink=0 -b size=65536 -s size=512"
>>>>> export FSTYP=xfs
>>>>> export MOUNT_OPTIONS="-o dax"
>>>>> 
>>>>> 
>>>>> Test case: generic/256
>>>>> 
>>>>> 
>>>>> Traces:
>>>>> 
>>>>> 
>>>>> [  163.371929] ------------[ cut here ]------------
>>>>> [  163.371936] kernel BUG at lib/list_debug.c:29!
>>>>> [  163.371946] Oops: Exception in kernel mode, sig: 5 [#1]
>>>>> [  163.371954] LE PAGE_SIZE=64K MMU=Radix  SMP NR_CPUS=8192 NUMA pSeries
>>>>> [  163.371965] Modules linked in: xfs nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack bonding tls nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink sunrpc pseries_rng vmx_crypto dax_pmem fuse ext4 crc16 mbcache jbd2 nd_pmem papr_scm sd_mod libnvdimm sg ibmvscsi ibmveth scsi_transport_srp pseries_wdt
>>>>> [  163.372127] CPU: 22 UID: 0 PID: 130 Comm: kworker/22:0 Kdump: loaded Not tainted 6.17.0-rc5-next-20250911 #1 VOLUNTARY
>>>>> [  163.372142] Hardware name: IBM,9080-HEX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1110.01 (NH1110_069) hv:phyp pSeries
>>>>> [  163.372155] Workqueue: cgroup_free css_free_rwork_fn
>>>>> [  163.372169] NIP:  c000000000d051d4 LR: c000000000d051d0 CTR: 0000000000000000
>>>>> [  163.372176] REGS: c00000000ba079b0 TRAP: 0700   Not tainted (6.17.0-rc5-next-20250911)
>>>>> [  163.372183] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28000000  XER: 00000006
>>>>> [  163.372214] CFAR: c0000000002bae9c IRQMASK: 0
>>>>> [  163.372214] GPR00: c000000000d051d0 c00000000ba07c50 c00000000230a600 0000000000000075
>>>>> [  163.372214] GPR04: 0000000000000004 0000000000000001 c000000000507e2c 0000000000000001
>>>>> [  163.372214] GPR08: c000000d0cb87d13 0000000000000000 0000000000000000 a80e000000000000
>>>>> [  163.372214] GPR12: c00e0001a1970fa2 c000000d0ddec700 c000000000208e58 c000000107b5e190
>>>>> [  163.372214] GPR16: c00000000d3e5d08 c00000000b71cf78 c00000000d3e5d05 c00000000b71cf30
>>>>> [  163.372214] GPR20: c00000000b71cf08 c00000000b71cf10 c000000019f58588 c000000004704bc8
>>>>> [  163.372214] GPR24: c000000107b5e100 c000000004704bd0 0000000000000003 c000000004704bd0
>>>>> [  163.372214] GPR28: c000000004704bc8 c000000019f585a8 c000000019f53da8 c000000004704bc8
>>>>> [  163.372315] NIP [c000000000d051d4] __list_add_valid_or_report+0x124/0x188
>>>>> [  163.372326] LR [c000000000d051d0] __list_add_valid_or_report+0x120/0x188
>>>>> [  163.372335] Call Trace:
>>>>> [  163.372339] [c00000000ba07c50] [c000000000d051d0] __list_add_valid_or_report+0x120/0x188 (unreliable)
>>>>> [  163.372352] [c00000000ba07ce0] [c000000000834280] mem_cgroup_css_free+0xa0/0x27c
>>>>> [  163.372363] [c00000000ba07d50] [c0000000003ba198] css_free_rwork_fn+0xd0/0x59c
>>>>> [  163.372374] [c00000000ba07da0] [c0000000001f5d60] process_one_work+0x41c/0x89c
>>>>> [  163.372385] [c00000000ba07eb0] [c0000000001f76c0] worker_thread+0x558/0x848
>>>>> [  163.372394] [c00000000ba07f80] [c000000000209038] kthread+0x1e8/0x230
>>>>> [  163.372406] [c00000000ba07fe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
>>>>> [  163.372416] Code: 4b9b1099 60000000 7f63db78 4bae8245 60000000 e8bf0008 3c62ff88 7fe6fb78 7fc4f378 38637d40 4b5b5c89 60000000 <0fe00000> 60000000 60000000 7f83e378
>>>>> [  163.372453] ---[ end trace 0000000000000000 ]---
>>>>> [  163.380581] pstore: backend (nvram) writing error (-1)
>>>>> [  163.380593]
>>>>> 
>>>>> 
>>>>> If you happen to fix this issue, please add below tag.
>>>>> 
>>>>> 
>>>>> Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
>>>>> 
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Venkat.
>>>>> 
>>>>> 
>>>> 
>>>> After reverting the below commit, issue is not seen.
>>>> 
>>>> commit 61bbf51e75df1a94cf6736e311cb96aeb79826a8
>>>> Author: Julian Sun <sunjunchao@bytedance.com>
>>>> Date:   Thu Aug 28 04:45:57 2025 +0800
>>>> 
>>>>   memcg: don't wait writeback completion when release memcg
>>>>        Recently, we encountered the following hung task:
>>>>        INFO: task kworker/4:1:1334558 blocked for more than 1720 seconds.
>>>>   [Wed Jul 30 17:47:45 2025] Workqueue: cgroup_destroy css_free_rwork_fn
>>>>   [Wed Jul 30 17:47:45 2025] Call Trace:
>>>>   [Wed Jul 30 17:47:45 2025]  __schedule+0x934/0xe10
>>>>   [Wed Jul 30 17:47:45 2025]  ? complete+0x3b/0x50
>>>>   [Wed Jul 30 17:47:45 2025]  ? _cond_resched+0x15/0x30
>>>>   [Wed Jul 30 17:47:45 2025]  schedule+0x40/0xb0
>>>>   [Wed Jul 30 17:47:45 2025]  wb_wait_for_completion+0x52/0x80
>>>>   [Wed Jul 30 17:47:45 2025]  ? finish_wait+0x80/0x80
>>>>   [Wed Jul 30 17:47:45 2025]  mem_cgroup_css_free+0x22/0x1b0
>>>>   [Wed Jul 30 17:47:45 2025]  css_free_rwork_fn+0x42/0x380
>>>>   [Wed Jul 30 17:47:45 2025]  process_one_work+0x1a2/0x360
>>>>   [Wed Jul 30 17:47:45 2025]  worker_thread+0x30/0x390
>>>>   [Wed Jul 30 17:47:45 2025]  ? create_worker+0x1a0/0x1a0
>>>>   [Wed Jul 30 17:47:45 2025]  kthread+0x110/0x130
>>>>   [Wed Jul 30 17:47:45 2025]  ? __kthread_cancel_work+0x40/0x40
>>>>   [Wed Jul 30 17:47:45 2025]  ret_from_fork+0x1f/0x30
>>>>        The direct cause is that memcg spends a long time waiting for dirty page
>>>>   writeback of foreign memcgs during release.
>>>>        The root causes are:
>>>>       a. The wb may have multiple writeback tasks, containing millions
>>>>          of dirty pages, as shown below:
>>>>>>> for work in list_for_each_entry("struct wb_writeback_work", \
>>>>                                       wb.work_list.address_of_(), "list"):
>>>>   ...     print(work.nr_pages, work.reason, hex(work))
>>>>   ...
>>>>   900628  WB_REASON_FOREIGN_FLUSH 0xffff969e8d956b40
>>>>   1116521 WB_REASON_FOREIGN_FLUSH 0xffff9698332a9540
>>>>   1275228 WB_REASON_FOREIGN_FLUSH 0xffff969d9b444bc0
>>>>   1099673 WB_REASON_FOREIGN_FLUSH 0xffff969f0954d6c0
>>>>   1351522 WB_REASON_FOREIGN_FLUSH 0xffff969e76713340
>>>>   2567437 WB_REASON_FOREIGN_FLUSH 0xffff9694ae208400
>>>>   2954033 WB_REASON_FOREIGN_FLUSH 0xffff96a22d62cbc0
>>>>   3008860 WB_REASON_FOREIGN_FLUSH 0xffff969eee8ce3c0
>>>>   3337932 WB_REASON_FOREIGN_FLUSH 0xffff9695b45156c0
>>>>   3348916 WB_REASON_FOREIGN_FLUSH 0xffff96a22c7a4f40
>>>>   3345363 WB_REASON_FOREIGN_FLUSH 0xffff969e5d872800
>>>>   3333581 WB_REASON_FOREIGN_FLUSH 0xffff969efd0f4600
>>>>   3382225 WB_REASON_FOREIGN_FLUSH 0xffff969e770edcc0
>>>>   3418770 WB_REASON_FOREIGN_FLUSH 0xffff96a252ceea40
>>>>   3387648 WB_REASON_FOREIGN_FLUSH 0xffff96a3bda86340
>>>>   3385420 WB_REASON_FOREIGN_FLUSH 0xffff969efc6eb280
>>>>   3418730 WB_REASON_FOREIGN_FLUSH 0xffff96a348ab1040
>>>>   3426155 WB_REASON_FOREIGN_FLUSH 0xffff969d90beac00
>>>>   3397995 WB_REASON_FOREIGN_FLUSH 0xffff96a2d7288800
>>>>   3293095 WB_REASON_FOREIGN_FLUSH 0xffff969dab423240
>>>>   3293595 WB_REASON_FOREIGN_FLUSH 0xffff969c765ff400
>>>>   3199511 WB_REASON_FOREIGN_FLUSH 0xffff969a72d5e680
>>>>   3085016 WB_REASON_FOREIGN_FLUSH 0xffff969f0455e000
>>>>   3035712 WB_REASON_FOREIGN_FLUSH 0xffff969d9bbf4b00
>>>>            b. The writeback might severely throttled by wbt, with a speed
>>>>          possibly less than 100kb/s, leading to a very long writeback time.
>>>>>>> wb.write_bandwidth
>>>>   (unsigned long)24
>>>>>>> wb.write_bandwidth
>>>>   (unsigned long)13
>>>>        The wb_wait_for_completion() here is probably only used to prevent
>>>>   use-after-free.  Therefore, we manage 'done' separately and automatically
>>>>   free it.
>>>>        This allows us to remove wb_wait_for_completion() while preventing the
>>>>   use-after-free issue.
>>>>    com
>>>>   Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing")
>>>>   Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
>>>>   Acked-by: Tejun Heo <tj@kernel.org>
>>>>   Cc: Michal Hocko <mhocko@suse.com>
>>>>   Cc: Roman Gushchin <roman.gushchin@linux.dev>
>>>>   Cc: Johannes Weiner <hannes@cmpxchg.org>
>>>>   Cc: Shakeel Butt <shakeelb@google.com>
>>>>   Cc: Muchun Song <songmuchun@bytedance.com>
>>>>   Cc: <stable@vger.kernel.org>
>>>>   Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>>>> 
>>>> Regards,
>>>> Venkat.
>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Julian Sun <sunjunchao@bytedance.com>
>> 
> 
> Thanks,
> -- 
> Julian Sun <sunjunchao@bytedance.com>



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-09-17  5:59 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <8957c526-d05c-4c0d-bfed-0eb6e6d2476c@linux.ibm.com>
     [not found] ` <aMPIwdleUCUMFPh2@infradead.org>
2025-09-12  7:21   ` [linux-next20250911]Kernel OOPs while running generic/256 on Pmem device Venkat
2025-09-12 12:32 ` Venkat
2025-09-12 13:16   ` [External] " Julian Sun
2025-09-13  2:48   ` Julian Sun
2025-09-15 14:19     ` Venkat
2025-09-15 14:26       ` Alexander Gordeev
2025-09-15 18:20         ` [External] " Julian Sun
2025-09-15 18:17       ` Julian Sun
2025-09-17  5:58         ` Venkat

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox