From: Jeff Layton <jlayton@kernel.org>
To: Mike Galbraith <efault@gmx.de>,
dai.ngo@oracle.com, Chuck Lever III <chuck.lever@oracle.com>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH 1/1] NFSD: fix WARN_ON_ONCE in __queue_delayed_work
Date: Wed, 11 Jan 2023 05:55:49 -0500 [thread overview]
Message-ID: <2067b4b4ce029ab5be982820b81241cd457ff475.camel@kernel.org> (raw)
In-Reply-To: <ce3724b88bb2987ac773057f523aa0ed2abacaed.camel@kernel.org>
On Wed, 2023-01-11 at 05:15 -0500, Jeff Layton wrote:
> On Wed, 2023-01-11 at 03:34 +0100, Mike Galbraith wrote:
> > On Tue, 2023-01-10 at 11:58 -0800, dai.ngo@oracle.com wrote:
> > >
> > > On 1/10/23 11:30 AM, Jeff Layton wrote:
> > >
> > > > >
> > > > >
> > > > Looking over the traces that Mike posted, I suspect this is the real
> > > > bug, particularly if the server is being restarted during this test.
> > >
> > > Yes, I noticed the WARN_ON_ONCE(timer->function != delayed_work_timer_fn)
> > > too and this seems to indicate some kind of corruption. However, I'm not
> > > sure if Mike's test restarts the nfs-server service. This could be a bug
> > > in work queue module when it's under stress.
> >
> > My reproducer was to merely mount and traverse/md5sum, while that was
> > going on, fire up LTP's min_free_kbytes testcase (memory hog from hell)
> > on the server. Systemthing may well be restarting the server service
> > in response to oomkill. In fact, the struct delayed_work in question
> > at WARN_ON_ONCE() time didn't look the least bit ready for business.
> >
> > FWIW, I had noticed the missing cancel while eyeballing, and stuck one
> > next to the existing one as a hail-mary, but that helped not at all.
> >
>
> Ok, thanks, that's good to know.
>
> I still doubt that the problem is the race that Dai seems to think it
> is. The workqueue infrastructure has been fairly stable for years. If
> there were problems with concurrent tasks queueing the same work, the
> kernel would be blowing up all over the place.
>
> > crash> delayed_work ffff8881601fab48
> > struct delayed_work {
> > work = {
> > data = {
> > counter = 1
> > },
> > entry = {
> > next = 0x0,
> > prev = 0x0
> > },
> > func = 0x0
> > },
> > timer = {
> > entry = {
> > next = 0x0,
> > pprev = 0x0
> > },
> > expires = 0,
> > function = 0x0,
> > flags = 0
> > },
> > wq = 0x0,
> > cpu = 0
> > }
>
> That looks more like a memory scribble or UAF. Merely having multiple
> tasks calling queue_work at the same time wouldn't be enough to trigger
> this, IMO. It's more likely that the extra locking is changing the
> timing of your reproducer somehow.
>
> It might be interesting to turn up KASAN if you're able.
If you still have this vmcore, it might be interesting to do the pointer
math and find the nfsd_net structure that contains the above
delayed_work. Does the rest of it also seem to be corrupt? My guess is
that the corrupted structure extends beyond just the delayed_work above.
Also, it might be helpful to do this:
kmem -s ffff8881601fab48
...which should tell us whether and what part of the slab this object is
now a part of. That said, net-namespace object allocations are somewhat
weird, and I'm not 100% sure they come out of the slab.
--
Jeff Layton <jlayton@kernel.org>
next prev parent reply other threads:[~2023-01-11 10:56 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-10 6:48 [PATCH 1/1] NFSD: fix WARN_ON_ONCE in __queue_delayed_work Dai Ngo
2023-01-10 10:30 ` Jeff Layton
2023-01-10 17:33 ` dai.ngo
2023-01-10 18:17 ` Chuck Lever III
2023-01-10 18:34 ` Jeff Layton
2023-01-10 19:17 ` dai.ngo
2023-01-10 19:30 ` Jeff Layton
2023-01-10 19:58 ` dai.ngo
2023-01-11 2:34 ` Mike Galbraith
2023-01-11 10:15 ` Jeff Layton
2023-01-11 10:55 ` Jeff Layton [this message]
2023-01-11 11:19 ` Mike Galbraith
2023-01-11 11:31 ` dai.ngo
2023-01-11 12:26 ` Mike Galbraith
2023-01-11 12:44 ` Jeff Layton
2023-01-11 12:00 ` Jeff Layton
2023-01-11 12:15 ` Mike Galbraith
2023-01-11 12:33 ` Jeff Layton
2023-01-11 13:48 ` Mike Galbraith
2023-01-11 14:01 ` Jeff Layton
2023-01-11 14:16 ` Jeff Layton
2023-01-10 18:46 ` dai.ngo
2023-01-10 18:53 ` Chuck Lever III
2023-01-10 19:07 ` dai.ngo
2023-01-10 19:27 ` Jeff Layton
2023-01-10 19:16 ` Jeff Layton
2023-01-10 14:26 ` Chuck Lever III
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2067b4b4ce029ab5be982820b81241cd457ff475.camel@kernel.org \
--to=jlayton@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=dai.ngo@oracle.com \
--cc=efault@gmx.de \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox