public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
From: Leon Romanovsky <leon@kernel.org>
To: "Li,Rongqing" <lirongqing@baidu.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>
Subject: Re: [????] Re: [PATCH] RDMA/core: Prevent soft lockup during large user memory region cleanup
Date: Wed, 12 Nov 2025 16:19:27 +0200	[thread overview]
Message-ID: <20251112141927.GE17382@unreal> (raw)
In-Reply-To: <5a9e07930f134ff283d4a65373a62b85@baidu.com>

On Tue, Nov 11, 2025 at 12:09:42PM +0000, Li,Rongqing wrote:
> 
> 
> > On Tue, Nov 11, 2025 at 03:01:07PM +0800, lirongqing wrote:
> > > From: Li RongQing <lirongqing@baidu.com>
> > >
> > > When a process exits with numerous large, pinned memory regions
> > > consisting of 4KB pages, the cleanup of the memory region through
> > > __ib_umem_release() may cause soft lockups. This is because
> > > unpin_user_page_range_dirty_lock()
> > 
> > Do you have soft lookup splat?
> > 
> 
> 
> A user meet this lockup issue on ubuntu 22.04 kernel, after change watchdog_thresh to 60, the soft lockup is disappeared.
> 
> I think his program registered too many memory region(this program has 400G memory), but the lockup should be fixed too.

So please add this information to next version together with change
proposed by Junxian.

Thanks

> 
> [9769474.755472] mlx5_core 0000:b0:00.0: mlx5_query_module_eeprom_by_page:475:(pid 3380349): Module ID not recognized: 0x19
> [9793445.031306] watchdog: BUG: soft lockup - CPU#44 stuck for 26s! [python3:73464]
> [9793445.032792] Kernel panic - not syncing: softlockup: hung tasks
> [9793445.033695] CPU: 44 PID: 73464 Comm: python3 Tainted: G           OEL    5.15.0-124-generic #134-Ubuntu
> [9793445.035024] Hardware name: BCC, BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> [9793445.036500] Call Trace:
> [9793445.036955]  <IRQ>
> [9793445.037339]  show_stack+0x52/0x5c
> [9793445.037892]  dump_stack_lvl+0x4a/0x63
> [9793445.038485]  dump_stack+0x10/0x16
> [9793445.039024]  panic+0x15c/0x33b
> [9793445.039540]  watchdog_timer_fn.cold+0xc/0x16
> [9793445.040204]  ? lockup_detector_update_enable+0x60/0x60
> [9793445.040999]  __hrtimer_run_queues+0x104/0x230
> [9793445.041678]  ? clockevents_program_event+0xaa/0x130
> [9793445.042427]  hrtimer_interrupt+0x101/0x220
> [9793445.043070]  __sysvec_apic_timer_interrupt+0x5e/0xe0
> [9793445.043826]  sysvec_apic_timer_interrupt+0x7b/0x90
> [9793445.044563]  </IRQ>
> [9793445.044968]  <TASK>
> [9793445.045353]  asm_sysvec_apic_timer_interrupt+0x1b/0x20
> [9793445.046138] RIP: 0010:free_unref_page+0xff/0x190
> [9793445.046861] Code: b9 ae 72 44 89 ea 4c 89 e7 e8 5d ce ff ff 65 48 03 1d fd b8 ae 72 41 f7 c6 00 02 00 00 0f 84 30 ff ff ff fb 66 0f 1f 44 00 00 <5b> 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc ba 00 08 0
> 0 00 e9 41
> [9793445.049506] RSP: 0018:ff7f1fce1bb039a8 EFLAGS: 00000206
> [9793445.050315] RAX: 0000000000000000 RBX: ff40b8648212ea88 RCX: ffc2d725bc39b808
> [9793445.051371] RDX: ff7f1fce1bb03920 RSI: ffc2d725bc39b7c8 RDI: ff40b86a3ffd7010
> [9793445.052436] RBP: ff7f1fce1bb039d0 R08: 0000000000000010 R09: 0000000000000000
> [9793445.053498] R10: ffc2d725bc39b800 R11: dead000000000122 R12: ffc2d728f67bcdc0
> [9793445.054551] R13: 0000000000000000 R14: 0000000000000297 R15: 0000000000000000
> [9793445.055604]  ? free_unref_page+0xe3/0x190
> [9793445.056255]  __put_page+0x77/0xe0
> [9793445.056831]  put_compound_head+0xed/0x100
> [9793445.057504]  unpin_user_page_range_dirty_lock+0xb2/0x180
> [9793445.058344]  __ib_umem_release+0x57/0xb0 [ib_core]
> [9793445.059148]  ib_umem_release+0x3f/0xd0 [ib_core]
> [9793445.059916]  mlx5_ib_dereg_mr+0x2e9/0x440 [mlx5_ib]
> [9793445.060716]  ib_dereg_mr_user+0x43/0xb0 [ib_core]
> [9793445.061492]  uverbs_free_mr+0x15/0x20 [ib_uverbs]
> [9793445.062242]  destroy_hw_idr_uobject+0x21/0x60 [ib_uverbs]
> [9793445.063091]  uverbs_destroy_uobject+0x38/0x1b0 [ib_uverbs]
> [9793445.063947]  __uverbs_cleanup_ufile+0xd1/0x150 [ib_uverbs]
> [9793445.064806]  uverbs_destroy_ufile_hw+0x3f/0x100 [ib_uverbs]
> [9793445.065671]  ib_uverbs_close+0x1f/0xb0 [ib_uverbs]
> [9793445.066450]  __fput+0x9c/0x280
> [9793445.066993]  ____fput+0xe/0x20
> [9793445.067541]  task_work_run+0x6a/0xb0
> [9793445.068149]  do_exit+0x217/0x3c0
> [9793445.068726]  do_group_exit+0x3b/0xb0
> [9793445.069327]  get_signal+0x150/0x900
> [9793445.069926]  arch_do_signal_or_restart+0xde/0x100
> [9793445.070679]  ? fput+0x13/0x20
> [9793445.071195]  ? do_epoll_wait+0x8f/0xe0
> [9793445.071817]  exit_to_user_mode_loop+0xc4/0x160
> [9793445.072526]  exit_to_user_mode_prepare+0xa0/0xb0
> [9793445.073253]  syscall_exit_to_user_mode+0x27/0x50
> [9793445.073985]  ? x64_sys_call+0xfab/0x1fa0
> [9793445.074632]  do_syscall_64+0x63/0xb0
> [9793445.075227]  ? exit_to_user_mode_prepare+0x37/0xb0
> [9793445.075985]  ? syscall_exit_to_user_mode+0x2c/0x50
> [9793445.076739]  ? x64_sys_call+0x1ea1/0x1fa0
> [9793445.077394]  ? do_syscall_64+0x63/0xb0
> [9793445.078011]  ? syscall_exit_to_user_mode+0x2c/0x50
> [9793445.078762]  ? x64_sys_call+0xfab/0x1fa0
> [9793445.079404]  ? do_syscall_64+0x63/0xb0
> [9793445.080018]  ? x64_sys_call+0x1ea1/0x1fa0
> [9793445.080674]  ? do_syscall_64+0x63/0xb0
> [9793445.081287]  ? do_syscall_64+0x63/0xb0
> [9793445.081903]  ? do_syscall_64+0x63/0xb0
> 
> 
> 
> -Li
> 
> 
> 
> > > is called in a tight loop for unpin and releasing page without
> > > yielding the CPU.
> > >
> > > Fix the soft lockup by adding cond_resched() calls in
> > > __ib_umem_release
> > >
> > > Signed-off-by: Li RongQing <lirongqing@baidu.com>
> > > ---
> > >  drivers/infiniband/core/umem.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/drivers/infiniband/core/umem.c
> > > b/drivers/infiniband/core/umem.c index c5b6863..70c1520 100644
> > > --- a/drivers/infiniband/core/umem.c
> > > +++ b/drivers/infiniband/core/umem.c
> > > @@ -59,6 +59,7 @@ static void __ib_umem_release(struct ib_device *dev,
> > struct ib_umem *umem, int d
> > >  		unpin_user_page_range_dirty_lock(sg_page(sg),
> > >  			DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty);
> > >
> > > +	cond_resched();
> > >  	sg_free_append_table(&umem->sgt_append);
> > >  }
> > >
> > > --
> > > 2.9.4
> > >

  reply	other threads:[~2025-11-12 14:19 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-11  7:01 [PATCH] RDMA/core: Prevent soft lockup during large user memory region cleanup lirongqing
2025-11-11  8:19 ` Junxian Huang
2025-11-13  7:38   ` [外部邮件] " Li,Rongqing
2025-11-11 12:01 ` Leon Romanovsky
2025-11-11 12:09   ` [????] " Li,Rongqing
2025-11-12 14:19     ` Leon Romanovsky [this message]
2025-11-13  7:25       ` [????] " Li,Rongqing

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251112141927.GE17382@unreal \
    --to=leon@kernel.org \
    --cc=jgg@ziepe.ca \
    --cc=linux-rdma@vger.kernel.org \
    --cc=lirongqing@baidu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox