[BUG] mutex deadlock of dpm_resume() in low memory situation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Youngmin Nam <youngmin.nam@samsung.com>
To: rafael@kernel.org, len.brown@intel.com, pavel@ucw.cz,
	gregkh@linuxfoundation.org
Cc: linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org,
	d7271.choe@samsung.com, janghyuck.kim@samsung.com,
	hyesoo.yu@samsung.com, youngmin.nam@samsung.com
Subject: [BUG] mutex deadlock of dpm_resume() in low memory situation
Date: Wed, 27 Dec 2023 17:42:50 +0900	[thread overview]
Message-ID: <ZYvjiqX6EsL15moe@perf> (raw)
In-Reply-To: CGME20231227084252epcas2p3b063f7852f81f82cd0a31afd7f404db4@epcas2p3.samsung.com

[-- Attachment #1: Type: text/plain, Size: 6085 bytes --]

Hi all.

I'm reporting a issue which looks like a upstream kernel bug.
We are using 6.1 but I think all of version can cause this issue.

A mutex deadlock issue occured on low mem situation when the device runs dpm_resume().
Here's the problematic situation.

#1. Currently, the device is on low mem situation as below.
[4: binder:569_5:27109] SLUB: Unable to allocate memory on node -1, gfp=0xb20(GFP_ATOMIC|__GFP_ZERO)
[4: binder:569_5:27109] cache: kmalloc-128, object size: 128, buffer size: 128, default order: 0, min order: 0
[4: binder:569_5:27109] node 0: slabs: 1865, objs: 59680, free: 0
[4: binder:569_5:27109] binder:569_5: page allocation failure: order:0, mode:0xa20(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
[4: binder:569_5:27109] CPU: 4 PID: 27109 Comm: binder:569_5 Tainted: G S C E 6.1.43-android14-11-abS921BXXU1AWLB #1
[4: binder:569_5:27109] Hardware name: Samsung E1S EUR OPENX board based on S5E9945 (DT)
[4: binder:569_5:27109] Call trace:
[4: binder:569_5:27109] dump_backtrace+0xf4/0x118
[4: binder:569_5:27109] show_stack+0x18/0x24
[4: binder:569_5:27109] dump_stack_lvl+0x60/0x7c
[4: binder:569_5:27109] dump_stack+0x18/0x38
[4: binder:569_5:27109] warn_alloc+0xf4/0x190
[4: binder:569_5:27109] __alloc_pages_slowpath+0x10ec/0x12ac
[4: binder:569_5:27109] __alloc_pages+0x27c/0x2fc
[4: binder:569_5:27109] new_slab+0x17c/0x4e0
[4: binder:569_5:27109] ___slab_alloc+0x4e4/0x8a8
[4: binder:569_5:27109] __slab_alloc+0x34/0x6c
[4: binder:569_5:27109] __kmem_cache_alloc_node+0x1f4/0x260
[4: binder:569_5:27109] kmalloc_trace+0x4c/0x144
[4: binder:569_5:27109] async_schedule_node_domain+0x40/0x1ec
[4: binder:569_5:27109] async_schedule_node+0x18/0x28
[4: binder:569_5:27109] dpm_suspend+0xfc/0x48c

#2. The process A runs dpm_resume() and acquired "dpm_list_mtx" lock as below.
1000 void dpm_resume(pm_message_t state)
1001 {
1002         struct device *dev;
1003         ktime_t starttime = ktime_get();
1004
1005         trace_suspend_resume(TPS("dpm_resume"), state.event, true);
1006         might_sleep();
1007
1008         mutex_lock(&dpm_list_mtx);  <-------- process A acquired the lock
1009         pm_transition = state;
1010         async_error = 0;
1011
1012         list_for_each_entry(dev, &dpm_suspended_list, power.entry)
1013                 dpm_async_fn(dev, async_resume);

#3. The process A continues to run below functions as below.
dpm_async_fn()
      --> async_schedule_dev()
          --> async_schedule_node()
              --> async_schedule_node_domain()

#4. The kzalloc() will be failed because of lowmen situation.
165 async_cookie_t async_schedule_node_domain(async_func_t func, void *data,
166                                           int node, struct async_domain *domain)
167 {
168         struct async_entry *entry;
169         unsigned long flags;
170         async_cookie_t newcookie;
171
172         /* allow irq-off callers */
173         entry = kzalloc(sizeof(struct async_entry), GFP_ATOMIC); <--- Will be failied
174
175         /*
176          * If we're out of memory or if there's too much work
177          * pending already, we execute synchronously.
178          */
179         if (!entry || atomic_read(&entry_count) > MAX_WORK) {
180                 kfree(entry);
181                 spin_lock_irqsave(&async_lock, flags);
182                 newcookie = next_cookie++;
183                 spin_unlock_irqrestore(&async_lock, flags);
184
185                 /* low on memory.. run synchronously */
186                 func(data, newcookie); <--- the process A will run func() that is async_resume()
187                 return newcookie;

5. The process A continues to run below functions as below.
async_resume()
    --> device_resume()
        --> dpm_wait_for_superior()

6. The process A is trying to get the lock which was acquired by A.
278 static bool dpm_wait_for_superior(struct device *dev, bool async)
279 {
280         struct device *parent;
281
282         /*
283          * If the device is resumed asynchronously and the parent's callback
284          * deletes both the device and the parent itself, the parent object may
285          * be freed while this function is running, so avoid that by reference
286          * counting the parent once more unless the device has been deleted
287          * already (in which case return right away).
288          */
289         mutex_lock(&dpm_list_mtx);

So we think this situation can make mutex dead lock issue in suspend/resume sequence.

Here's the process A's callstack in kernel log. (binder:569:5)
I[4:      swapper/4:    0]      pid            uTime            sTime     last_arrival      last_queued   stat   cpu  task_struct           comm [wait channel]
I[4:      swapper/4:    0]    27109                0         1758074533    2396019774802             0    D(  2)  5   ffffff8044bc92c0      binder:569_5 [dpm_wait_for_superior]
I[4:      swapper/4:    0] Mutex: dpm_list_mtx+0x0/0x30: owner[0xffffff8044bc92c0 binder:569_5 :27109]
I[4:      swapper/4:    0] Call trace:
I[4:      swapper/4:    0]  __switch_to+0x174/0x338
I[4:      swapper/4:    0]  __schedule+0x5ec/0x9cc
I[4:      swapper/4:    0]  schedule+0x7c/0xe8
I[4:      swapper/4:    0]  schedule_preempt_disabled+0x24/0x40
I[4:      swapper/4:    0]  __mutex_lock+0x408/0xdac
I[4:      swapper/4:    0]  __mutex_lock_slowpath+0x14/0x24
I[4:      swapper/4:    0]  mutex_lock+0x40/0xec  <-------- trying to acquire dpm_list_mtx again
I[4:      swapper/4:    0]  dpm_wait_for_superior+0x30/0x148
I[4:      swapper/4:    0]  device_resume+0x38/0x1e4
I[4:      swapper/4:    0]  async_resume+0x24/0xf4
I[4:      swapper/4:    0]  async_schedule_node_domain+0xb0/0x1ec
I[4:      swapper/4:    0]  async_schedule_node+0x18/0x28
I[4:      swapper/4:    0]  dpm_resume+0xbc/0x578  <------- acquired dpm_list_mtx
I[4:      swapper/4:    0]  dpm_resume_end+0x1c/0x38
I[4:      swapper/4:    0]  suspend_devices_and_enter+0x83c/0xb2c
I[4:      swapper/4:    0]  pm_suspend+0x34c/0x618
I[4:      swapper/4:    0]  state_store+0x104/0x144

Could you look into this issue ?

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

next      parent reply	other threads:[~2023-12-27  8:43 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20231227084252epcas2p3b063f7852f81f82cd0a31afd7f404db4@epcas2p3.samsung.com>
2023-12-27  8:42 ` Youngmin Nam [this message]
2023-12-27 16:08   ` [BUG] mutex deadlock of dpm_resume() in low memory situation Greg KH
2023-12-27 17:44     ` Rafael J. Wysocki
2023-12-27 18:39     ` Rafael J. Wysocki
2023-12-27 18:58       ` Rafael J. Wysocki
2023-12-27 20:50         ` Rafael J. Wysocki
2023-12-28  6:40           ` Youngmin Nam
2023-12-27 20:35       ` [PATCH v1 0/3] PM: sleep: Fix possible device suspend-resume deadlocks Rafael J. Wysocki
2023-12-27 20:37         ` [PATCH v1 1/3] async: Split async_schedule_node_domain() Rafael J. Wysocki
2023-12-27 20:38         ` [PATCH v1 2/3] async: Introduce async_schedule_dev_nocall() Rafael J. Wysocki
2023-12-28 20:29           ` Stanislaw Gruszka
2023-12-29 13:37             ` Rafael J. Wysocki
2023-12-29  3:08               ` Stanislaw Gruszka
2023-12-29 16:36                 ` Rafael J. Wysocki
2024-01-02  7:09                   ` Stanislaw Gruszka
2024-01-02 13:15                     ` Rafael J. Wysocki
2023-12-27 20:41         ` [PATCH v1 3/3] PM: sleep: Fix possible deadlocks in core system-wide PM code Rafael J. Wysocki
2024-01-02 13:35           ` Ulf Hansson
2024-01-02 13:53             ` Rafael J. Wysocki
2024-01-03 10:17               ` Ulf Hansson
2024-01-03 10:27                 ` Rafael J. Wysocki
2024-01-03 10:33                 ` Greg KH
2024-01-02 13:18         ` [PATCH v1 0/3] PM: sleep: Fix possible device suspend-resume deadlocks Rafael J. Wysocki
2024-01-03  4:39           ` Youngmin Nam
2024-01-03 10:28             ` Rafael J. Wysocki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZYvjiqX6EsL15moe@perf \
    --to=youngmin.nam@samsung.com \
    --cc=d7271.choe@samsung.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hyesoo.yu@samsung.com \
    --cc=janghyuck.kim@samsung.com \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=pavel@ucw.cz \
    --cc=rafael@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.