From: Rishabh Bhatnagar <risbhat@amazon.com>
To: <gregkh@linuxfoundation.org>, <shakeelb@google.com>,
<viro@zeniv.linux.org.uk>, <bsegall@google.com>
Cc: <mdecandia@gmail.com>, <linux-kernel@vger.kernel.org>,
<stable@vger.kernel.org>, Rishabh Bhatnagar <risbhat@amazon.com>
Subject: [PATCH 5.4 0/2] Fix epoll issue in 5.4 kernels
Date: Thu, 24 Nov 2022 00:11:21 +0000 [thread overview]
Message-ID: <20221124001123.3248571-1-risbhat@amazon.com> (raw)
Hi Greg
After upgrading to 5.4.211 we were started seeing some nodes getting
stuck in our Kubernetes cluster. All nodes are running this kernel
version. After taking a closer look it seems that runc was command getting
stuck. Looking at the stack it appears the thread is stuck in epoll wait for
sometime.
[<0>] do_syscall_64+0x48/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
[<0>] ep_poll+0x48d/0x4e0
[<0>] do_epoll_wait+0xab/0xc0
[<0>] __x64_sys_epoll_pwait+0x4d/0xa0
[<0>] do_syscall_64+0x48/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
[<0>] futex_wait_queue_me+0xb6/0x110
[<0>] futex_wait+0xe2/0x260
[<0>] do_futex+0x372/0x4f0
[<0>] __x64_sys_futex+0x134/0x180
[<0>] do_syscall_64+0x48/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
I noticed there are other discussions going on as well
regarding this.
https://lore.kernel.org/all/Y1pY2n6E1Xa58MXv@kroah.com/
Reverting the below patch does fix the issue:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=cf2db24ec4b8e9d399005ececd6f6336916ab6fc
We don't see this issue in latest upstream kernel or even latest 5.10
stable tree. Looking at the patches that went in for 5.10 stable there's
one that stands out that seems to be missing in 5.4.
289caf5d8f6c61c6d2b7fd752a7f483cd153f182 (epoll: check for events when removing
a timed out thread from the wait queue)
Backporting this patch to 5.4 we don't see the hangups anymore. Looks like
this patch fixes time out scenarios which might cause missed wake ups.
The other patch in the patch series also fixes a race and is needed for
the second patch to apply.
Roman Penyaev (1):
epoll: call final ep_events_available() check under the lock
Soheil Hassas Yeganeh (1):
epoll: check for events when removing a timed out thread from the wait
queue
fs/eventpoll.c | 68 ++++++++++++++++++++++++++++++--------------------
1 file changed, 41 insertions(+), 27 deletions(-)
--
2.37.1
next reply other threads:[~2022-11-24 0:11 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-24 0:11 Rishabh Bhatnagar [this message]
2022-11-24 0:11 ` [PATCH 5.4 1/2] epoll: call final ep_events_available() check under the lock Rishabh Bhatnagar
2022-11-24 7:48 ` Thadeu Lima de Souza Cascardo
2022-12-01 4:07 ` Samuel Mendoza-Jonas
2022-11-24 0:11 ` [PATCH 2/2] epoll: check for events when removing a timed out thread from the wait queue Rishabh Bhatnagar
2022-11-24 7:49 ` Thadeu Lima de Souza Cascardo
2022-11-28 21:05 ` [PATCH 5.4 0/2] Fix epoll issue in 5.4 kernels Benjamin Segall
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20221124001123.3248571-1-risbhat@amazon.com \
--to=risbhat@amazon.com \
--cc=bsegall@google.com \
--cc=gregkh@linuxfoundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mdecandia@gmail.com \
--cc=shakeelb@google.com \
--cc=stable@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.