All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Omar Sandoval <osandov@osandov.com>, linux-block@vger.kernel.org
Cc: kernel-team@fb.com, Josef Bacik <josef@toxicpanda.com>,
	Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH] blk-rq-qos: fix crash on rq_qos_wait vs. rq_qos_wake_function race
Date: Wed, 16 Oct 2024 07:23:28 -0600	[thread overview]
Message-ID: <c55bcb77-e39a-4591-ba4e-827770b4ba35@kernel.dk> (raw)
In-Reply-To: <d3bee2463a67b1ee597211823bf7ad3721c26e41.1729014591.git.osandov@fb.com>

On 10/15/24 11:59 AM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> We're seeing crashes from rq_qos_wake_function that look like this:
> 
>   BUG: unable to handle page fault for address: ffffafe180a40084
>   #PF: supervisor write access in kernel mode
>   #PF: error_code(0x0002) - not-present page
>   PGD 100000067 P4D 100000067 PUD 10027c067 PMD 10115d067 PTE 0
>   Oops: Oops: 0002 [#1] PREEMPT SMP PTI
>   CPU: 17 UID: 0 PID: 0 Comm: swapper/17 Not tainted 6.12.0-rc3-00013-geca631b8fe80 #11
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
>   RIP: 0010:_raw_spin_lock_irqsave+0x1d/0x40
>   Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 54 9c 41 5c fa 65 ff 05 62 97 30 4c 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 0a 4c 89 e0 41 5c c3 cc cc cc cc 89 c6 e8 2c 0b 00
>   RSP: 0018:ffffafe180580ca0 EFLAGS: 00010046
>   RAX: 0000000000000000 RBX: ffffafe180a3f7a8 RCX: 0000000000000011
>   RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffafe180a40084
>   RBP: 0000000000000000 R08: 00000000001e7240 R09: 0000000000000011
>   R10: 0000000000000028 R11: 0000000000000888 R12: 0000000000000002
>   R13: ffffafe180a40084 R14: 0000000000000000 R15: 0000000000000003
>   FS:  0000000000000000(0000) GS:ffff9aaf1f280000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: ffffafe180a40084 CR3: 000000010e428002 CR4: 0000000000770ef0
>   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>   PKRU: 55555554
>   Call Trace:
>    <IRQ>
>    try_to_wake_up+0x5a/0x6a0
>    rq_qos_wake_function+0x71/0x80
>    __wake_up_common+0x75/0xa0
>    __wake_up+0x36/0x60
>    scale_up.part.0+0x50/0x110
>    wb_timer_fn+0x227/0x450
>    ...
> 
> So rq_qos_wake_function() calls wake_up_process(data->task), which calls
> try_to_wake_up(), which faults in raw_spin_lock_irqsave(&p->pi_lock).
> 
> p comes from data->task, and data comes from the waitqueue entry, which
> is stored on the waiter's stack in rq_qos_wait(). Analyzing the core
> dump with drgn, I found that the waiter had already woken up and moved
> on to a completely unrelated code path, clobbering what was previously
> data->task. Meanwhile, the waker was passing the clobbered garbage in
> data->task to wake_up_process(), leading to the crash.
> 
> What's happening is that in between rq_qos_wake_function() deleting the
> waitqueue entry and calling wake_up_process(), rq_qos_wait() is finding
> that it already got a token and returning. The race looks like this:
> 
> rq_qos_wait()                           rq_qos_wake_function()
> ==============================================================
> prepare_to_wait_exclusive()
>                                         data->got_token = true;
>                                         list_del_init(&curr->entry);
> if (data.got_token)
>         break;
> finish_wait(&rqw->wait, &data.wq);
>   ^- returns immediately because
>      list_empty_careful(&wq_entry->entry)
>      is true
> ... return, go do something else ...
>                                         wake_up_process(data->task)
>                                           (NO LONGER VALID!)-^
> 
> Normally, finish_wait() is supposed to synchronize against the waker.
> But, as noted above, it is returning immediately because the waitqueue
> entry has already been removed from the waitqueue.
> 
> The bug is that rq_qos_wake_function() is accessing the waitqueue entry
> AFTER deleting it. Note that autoremove_wake_function() wakes the waiter
> and THEN deletes the waitqueue entry, which is the proper order.
> 
> Fix it by swapping the order. We also need to use
> list_del_init_careful() to match the list_empty_careful() in
> finish_wait().

Thanks Omar, nice debugging!

-- 
Jens Axboe


      parent reply	other threads:[~2024-10-16 13:23 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-15 17:59 [PATCH] blk-rq-qos: fix crash on rq_qos_wait vs. rq_qos_wake_function race Omar Sandoval
2024-10-15 19:40 ` Tejun Heo
2024-10-16  6:13 ` Johannes Thumshirn
2024-10-16 13:20 ` Jens Axboe
2024-10-16 13:23 ` Jens Axboe [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c55bcb77-e39a-4591-ba4e-827770b4ba35@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@fb.com \
    --cc=linux-block@vger.kernel.org \
    --cc=osandov@osandov.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.