Re: [syzbot] [xfs?] INFO: task hung in __fdget_pos (4)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mateusz Guzik <mjguzik@gmail.com>,
	syzbot <syzbot+e245f0516ee625aaa412@syzkaller.appspotmail.com>,
	brauner@kernel.org, djwong@kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-xfs@vger.kernel.org, llvm@lists.linux.dev,
	nathan@kernel.org, ndesaulniers@google.com,
	syzkaller-bugs@googlegroups.com, trix@redhat.com
Subject: Re: [syzbot] [xfs?] INFO: task hung in __fdget_pos (4)
Date: Mon, 4 Sep 2023 11:45:03 +1000	[thread overview]
Message-ID: <ZPU2n48GoSRMBc7j@dread.disaster.area> (raw)
In-Reply-To: <20230903231338.GN3390869@ZenIV>

On Mon, Sep 04, 2023 at 12:13:38AM +0100, Al Viro wrote:
> On Mon, Sep 04, 2023 at 08:27:15AM +1000, Dave Chinner wrote:
> 
> > It already is (sysrq-t), but I'm not sure that will help - if it is
> > a leaked unlock then nothing will show up at all.
> 
> Unlikely; grep and you'll see - very few callers, and for all of them
> there's an fdput_pos() downstream of any fdget_pos() that had picked
> non-NULL file reference.
> 
> In theory, it's not impossible that something had stripped FDPUT_POS_UNLOCK
> from the flags, but that's basically "something might've corrupted the
> local variables" scenario.

Entirely possible - this is syzbot we are talking about here.
Especially if reiser or ntfs has been tested back before the logs we
have start, as both are known to corrupt memory and/or leak locks
when trying to parse corrupt filesystem images that syzbot feeds
them.  That's why we largely ignore syzbot reports that involve
those filesystems...

Unfortunately, the logs from what was being done around when the
tasks actually hung are long gone (seems like only the last 20-30s
of log activity is reported) so when the hung task timer goes off
at 143s, there is nothing left to tell us what might have caused it.

IOWs, it's entirely possible that it is a memory corruption that
has resulted in a leaked lock somewhere...

> There are 12 functions total where we might
> be calling fdget_pos() and all of them are pretty small (1 in alpha
> osf_sys.c, 6 in read_write.c and 5 in readdir.c); none of those takes
> an address of struct fd, none of them has assignments to it after fdget_pos()
> and the only accesses to its members are those to fd.file - all fetches.
> Control flow is also easy to check - they are all short.
> 
> IMO it's much more likely that we'll find something like
> 
> thread A:
> 	grabs some fs lock
> 	gets stuck on something
> thread B: write()
> 	finds file
> 	grabs ->f_pos_lock
> 	calls into filesystem
> 	blocks on fs lock held by A
> thread C: read()/write()/lseek() on the same file
> 	blocks on ->f_pos_lock

Yes, that's exactly what I said in a followup email - we need to
know what happened to thread A, because that might be where we are
stuck on a leaked lock.

I saw quite a few reports where lookup/readdir are also stuck trying
to get an inode lock - those at the "thread B"s in the above example
- but there's no indication left of what happened with thread A.

If thread A was blocked iall that time on something, then the hung
task timer should fire on it, too.  If it is running in a tight
loop, the NMI would have dumped a stack trace from it.

But neither of those things happened, so it's either leaked
something or it's in a loop with a short term sleep so doesn't
trigger the hung task timer. sysrq-w output will capture that
without all the noise of sysrq-t....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2023-09-04  1:45 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-03  4:11 [syzbot] [xfs?] INFO: task hung in __fdget_pos (4) syzbot
2023-09-03  5:25 ` Dave Chinner
2023-09-03  8:33   ` Mateusz Guzik
2023-09-03 18:01     ` Al Viro
2023-09-03 18:57       ` Mateusz Guzik
2023-09-03 19:51         ` Al Viro
2023-09-03 20:04           ` Mateusz Guzik
2023-09-06 17:53             ` Aleksandr Nogikh
2023-09-03 22:27     ` Dave Chinner
2023-09-03 22:47       ` Mateusz Guzik
2023-09-03 23:09         ` Dave Chinner
2023-09-04  8:11           ` Christian Brauner
2023-09-04  8:23             ` Christian Brauner
2023-09-04  8:55               ` Dave Chinner
2023-09-03 23:13       ` Al Viro
2023-09-04  1:45         ` Dave Chinner [this message]
2023-09-04  3:02           ` Al Viro
2023-09-04  3:26           ` Theodore Ts'o
2023-09-04  6:09             ` Mateusz Guzik
2023-11-30 16:58 ` [syzbot] [fs] " syzbot
2024-09-21  5:58 ` syzbot
2024-10-31 13:38 ` syzbot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZPU2n48GoSRMBc7j@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=brauner@kernel.org \
    --cc=djwong@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=llvm@lists.linux.dev \
    --cc=mjguzik@gmail.com \
    --cc=nathan@kernel.org \
    --cc=ndesaulniers@google.com \
    --cc=syzbot+e245f0516ee625aaa412@syzkaller.appspotmail.com \
    --cc=syzkaller-bugs@googlegroups.com \
    --cc=trix@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox