All of lore.kernel.org
 help / color / mirror / Atom feed
From: Oleg Nesterov <oleg@redhat.com>
To: K Prateek Nayak <kprateek.nayak@amd.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Latchesar Ionkov <lucho@ionkov.net>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Christian Schoenebeck <linux_oss@crudebyte.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>,
	syzbot <syzbot+62262fdc0e01d99573fc@syzkaller.appspotmail.com>,
	brauner@kernel.org, dhowells@redhat.com, jack@suse.cz,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, netfs@lists.linux.dev,
	swapnil.sapkal@amd.com, syzkaller-bugs@googlegroups.com,
	viro@zeniv.linux.org.uk, v9fs@lists.linux.dev
Subject: Re: [syzbot] [netfs?] INFO: task hung in netfs_unbuffered_write_iter
Date: Tue, 25 Mar 2025 13:15:26 +0100	[thread overview]
Message-ID: <20250325121526.GA7904@redhat.com> (raw)
In-Reply-To: <ff294b3c-cd24-4aa6-9d03-718ff7087158@amd.com>

Prateek, thanks again for your hard work!

Yes, I think we need the help from 9p folks. I know nothing about it, but
so far this still looks as a 9p problem to me...

All I can say right now is that the "sigpending" logic in p9_client_rpc()
looks wrong. If nothing else:

	- clear_thread_flag(TIF_SIGPENDING) is not enough, it won't make
	  signal_pending() false if TIF_NOTIFY_SIGNAL is set.

	- otoh, if signal_pending() was true because of pending SIGKILL,
	  then after clear_thread_flag(TIF_SIGPENDING) wait_event_killable()
	  will act as uninterruptible wait_event().

Oleg.

On 03/25, K Prateek Nayak wrote:
>
> On 3/24/2025 8:22 PM, K Prateek Nayak wrote:
> >Hello folks,
> >
> >Some updates.
> >
> >On 3/24/2025 6:49 PM, K Prateek Nayak wrote:
> >>>
> >>>Per syzbot this attempt did not work out either.
> >>>
> >>>I think the blind stabs taken by everyone here are enough.
> >>>
> >>>The report does not provide the crucial bit: what are the other
> >>>threads doing. Presumably someone else is stuck somewhere, possibly
> >>>not even in pipe code and that stuck thread was supposed to wake up
> >>>the one which trips over hung task detector. Figuring out what that
> >>>thread is imo the next step.
> >>>
> >>>I failed to find a relevant command in
> >>>https://github.com/google/syzkaller/blob/master/docs/syzbot.md
> >>>
> >>>So if you guys know someone on syzkaller side, maybe you can ask them
> >>>to tweak the report *or* maybe syzbot can test a "fix" which makes
> >>>hung task detector also report all backtraces? I don't know if that
> >>>can work, the output may be long enough that it will get trimmed by
> >>>something.
> >>>
> >>>I don't have to time work on this for now, just throwing ideas.
> >>
> >>I got the reproducer running locally. Tracing stuff currently to see
> >>what is tripping. Will report back once I find something interesting.
> >>Might take a while since the 9p bits are so far spread out.
> >>
> >
> >So far, with tracing, this is where I'm:
> >
> >o Mainline + Oleg's optimization reverted:
> >
> >     ...
> >     kworker/43:1-1723    [043] .....   115.309065: p9_read_work: Data read wait 55
> >     kworker/43:1-1723    [043] .....   115.309066: p9_read_work: Data read 55
> >     kworker/43:1-1723    [043] .....   115.309067: p9_read_work: Data read wait 7
> >     kworker/43:1-1723    [043] .....   115.309068: p9_read_work: Data read 7
> >            repro-4138    [043] .....   115.309084: netfs_wake_write_collector: Wake collector
> >            repro-4138    [043] .....   115.309085: netfs_wake_write_collector: Queuing collector work
> >            repro-4138    [043] .....   115.309088: netfs_unbuffered_write: netfs_unbuffered_write
> >            repro-4138    [043] .....   115.309088: netfs_end_issue_write: netfs_end_issue_write
> >            repro-4138    [043] .....   115.309089: netfs_end_issue_write: Write collector need poke 0
> >            repro-4138    [043] .....   115.309091: netfs_unbuffered_write_iter_locked: Waiting on NETFS_RREQ_IN_PROGRESS!
> >  kworker/u1030:1-1951    [168] .....   115.309096: netfs_wake_write_collector: Wake collector
> >  kworker/u1030:1-1951    [168] .....   115.309097: netfs_wake_write_collector: Queuing collector work
> >  kworker/u1030:1-1951    [168] .....   115.309102: netfs_write_collection_worker: Write collect clearing and waking up!
> >     ... (syzbot reproducer continues)
> >
> >o Mainline:
> >
> >    kworker/185:1-1767    [185] .....   109.485961: p9_read_work: Data read wait 7
> >    kworker/185:1-1767    [185] .....   109.485962: p9_read_work: Data read 7
> >    kworker/185:1-1767    [185] .....   109.485962: p9_read_work: Data read wait 55
> >    kworker/185:1-1767    [185] .....   109.485963: p9_read_work: Data read 55
> >            repro-4038    [185] .....   114.225717: netfs_wake_write_collector: Wake collector
> >            repro-4038    [185] .....   114.225723: netfs_wake_write_collector: Queuing collector work
> >            repro-4038    [185] .....   114.225727: netfs_unbuffered_write: netfs_unbuffered_write
> >            repro-4038    [185] .....   114.225727: netfs_end_issue_write: netfs_end_issue_write
> >            repro-4038    [185] .....   114.225728: netfs_end_issue_write: Write collector need poke 0
> >            repro-4038    [185] .....   114.225728: netfs_unbuffered_write_iter_locked: Waiting on NETFS_RREQ_IN_PROGRESS!
> >    ... (syzbot reproducer hangs)
> >
> >There is a third "kworker/u1030" component that never gets woken up for
> >reasons currently unknown to me with Oleg's optimization. I'll keep
> >digging.
>
> More data ...
>
> I chased this down to p9_client_rpc() net/9p/client.c specifically:
>
>         err = c->trans_mod->request(c, req);
>         if (err < 0) {
>                 /* write won't happen */
>                 p9_req_put(c, req);
>                 if (err != -ERESTARTSYS && err != -EFAULT)
>                         c->status = Disconnected;
>                 goto recalc_sigpending;
>         }
>
> c->trans_mod->request() calls p9_fd_request() in net/9p/trans_fd.c
> which basically does a p9_fd_poll().
>
> Previously, the above would fail with err as -EIO which would
> cause the client to "Disconnect" and the retry logic would make
> progress. Now however, the err returned is -ERESTARTSYS which
> will not cause a disconnect and the retry logic will hang
> somewhere in p9_client_rpc() later.
>
> I'll chase it a little more but if 9p folks can chime in it would
> be great since I'm out of my depths here.
>
> P.S. There are more interactions at play and I'm trying to still
> piece them together.
>
> Relevant traces:
>
> o Mainline + Oleg's optimization reverted:
>
>            repro-4161    [239] .....   107.785644: p9_client_write: p9_client_rpc done
>            repro-4161    [239] .....   107.785644: p9_client_write: p9_pdup
>            repro-4161    [239] .....   107.785644: p9_client_write: iter revert
>            repro-4161    [239] .....   107.785644: p9_client_write: p9_client_rpc
>            repro-4161    [239] .....   107.785653: p9_fd_request: p9_fd_request
>            repro-4161    [239] ...1.   107.785653: p9_fd_request: p9_fd_request error
>            repro-4161    [239] .....   107.785653: p9_client_rpc: Client disconnected (no write) <------------- "write won't happen" case
>            repro-4161    [239] .....   107.785655: p9_client_write: p9_client_rpc done
>            repro-4161    [239] .....   107.785655: p9_client_write: p9_client_rpc error (-5)     <------------- -EIO
>            repro-4161    [239] .....   107.785656: v9fs_issue_write: Issue write done 2 err(-5)
>            repro-4161    [239] .....   107.785657: netfs_write_subrequest_terminated: Collector woken up from netfs_write_subrequest_terminated
>            repro-4161    [239] .....   107.785657: netfs_wake_write_collector: Wake collector
>            repro-4161    [239] .....   107.785658: netfs_wake_write_collector: Queuing collector work
>            repro-4161    [239] .....   107.785660: v9fs_issue_write: Issue write subrequest terminated 2
>            repro-4161    [239] .....   107.785661: netfs_unbuffered_write: netfs_unbuffered_write
>            repro-4161    [239] .....   107.785661: netfs_end_issue_write: netfs_end_issue_write
>            repro-4161    [239] .....   107.785662: netfs_end_issue_write: Write collector need poke 0
>            repro-4161    [239] .....   107.785662: netfs_unbuffered_write_iter_locked: Waiting on NETFS_RREQ_IN_PROGRESS!
>  kworker/u1038:0-1583    [105] .....   107.785667: netfs_retry_writes: netfs_reissue_write 1
>  kworker/u1038:0-1583    [105] .....   107.785668: v9fs_issue_write: v9fs_issue_write
>  kworker/u1038:0-1583    [105] .....   107.785669: p9_client_write: p9_client_rpc
>  kworker/u1038:0-1583    [105] .....   107.785669: p9_client_prepare_req: p9_client_prepare_req eio 1
>  kworker/u1038:0-1583    [105] .....   107.785670: p9_client_rpc: p9_client_rpc early err return
>  kworker/u1038:0-1583    [105] .....   107.785670: p9_client_write: p9_client_rpc done
>  kworker/u1038:0-1583    [105] .....   107.785671: p9_client_write: p9_client_rpc error (-5)
>  kworker/u1038:0-1583    [105] .....   107.785672: v9fs_issue_write: Issue write done 0 err(-5)
>  kworker/u1038:0-1583    [105] .....   107.785672: netfs_write_subrequest_terminated: Collector woken up
>  kworker/u1038:0-1583    [105] .....   107.785672: netfs_wake_write_collector: Wake collector
>  kworker/u1038:0-1583    [105] .....   107.785672: netfs_wake_write_collector: Queuing collector work
>  kworker/u1038:0-1583    [105] .....   107.785677: v9fs_issue_write: Issue write subrequest terminated 0
>  kworker/u1038:0-1583    [105] .....   107.785684: netfs_write_collection_worker: Write collect clearing and waking up!
>            repro-4161    [239] .....   107.785883: p9_client_prepare_req: p9_client_prepare_req eio 1
>            ... (continues)
>
> o Mainline:
>
>            repro-4161    [087] .....   123.474660: p9_client_write: p9_client_rpc done
>            repro-4161    [087] .....   123.474661: p9_client_write: p9_pdup
>            repro-4161    [087] .....   123.474661: p9_client_write: iter revert
>            repro-4161    [087] .....   123.474661: p9_client_write: p9_client_rpc
>            repro-4161    [087] .....   123.474672: p9_fd_request: p9_fd_request
>            repro-4161    [087] .....   123.474673: p9_fd_request: p9_fd_request EPOLL
>            repro-4161    [087] .....   123.474673: p9_fd_poll: p9_fd_poll rd poll
>            repro-4161    [087] .....   123.474674: p9_fd_poll: p9_fd_request wr poll
>            repro-4161    [087] .....   128.233025: p9_client_write: p9_client_rpc done
>            repro-4161    [087] .....   128.233033: p9_client_write: p9_client_rpc error (-512)     <------------- -ERESTARTSYS
>            repro-4161    [087] .....   128.233034: v9fs_issue_write: Issue write done 2 err(-512)
>            repro-4161    [087] .....   128.233035: netfs_write_subrequest_terminated: Collector woken
>            repro-4161    [087] .....   128.233036: netfs_wake_write_collector: Wake collector
>            repro-4161    [087] .....   128.233036: netfs_wake_write_collector: Queuing collector work
>            repro-4161    [087] .....   128.233040: v9fs_issue_write: Issue write subrequest terminated 2
>            repro-4161    [087] .....   128.233040: netfs_unbuffered_write: netfs_unbuffered_write
>            repro-4161    [087] .....   128.233040: netfs_end_issue_write: netfs_end_issue_write
>            repro-4161    [087] .....   128.233041: netfs_end_issue_write: Write collector need poke 0
>            repro-4161    [087] .....   128.233041: netfs_unbuffered_write_iter_locked: Waiting on NETFS_RREQ_IN_PROGRESS!
>  kworker/u1035:0-1580    [080] .....   128.233077: netfs_retry_writes: netfs_reissue_write 1
>  kworker/u1035:0-1580    [080] .....   128.233078: v9fs_issue_write: v9fs_issue_write
>  kworker/u1035:0-1580    [080] .....   128.233079: p9_client_write: p9_client_rpc
>  kworker/u1035:0-1580    [080] .....   128.233099: p9_fd_request: p9_fd_request
>  kworker/u1035:0-1580    [080] .....   128.233101: p9_fd_request: p9_fd_request EPOLL
>  kworker/u1035:0-1580    [080] .....   128.233101: p9_fd_poll: p9_fd_poll rd poll
>  kworker/u1035:0-1580    [080] .....   128.233101: p9_fd_poll: p9_fd_request wr poll
>            (Hangs)
>
> --
> Thanks and Regards,
> Prateek
>


  reply	other threads:[~2025-03-25 12:16 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-22 15:54 [syzbot] [netfs?] INFO: task hung in netfs_unbuffered_write_iter syzbot
2025-03-23 18:48 ` Oleg Nesterov
2025-03-23 19:17   ` syzbot
2025-03-23 19:47     ` Oleg Nesterov
2025-03-23 19:50       ` K Prateek Nayak
2025-03-23 19:52       ` Mateusz Guzik
2025-03-23 21:02         ` Oleg Nesterov
2025-03-24 10:47           ` K Prateek Nayak
2025-03-24 11:15             ` syzbot
2025-03-24 13:17             ` Mateusz Guzik
2025-03-24 13:19               ` K Prateek Nayak
2025-03-24 14:52                 ` K Prateek Nayak
2025-03-24 16:03                   ` Mateusz Guzik
2025-03-24 16:25                     ` K Prateek Nayak
2025-03-24 16:36                       ` K Prateek Nayak
2025-03-25  2:52                   ` K Prateek Nayak
2025-03-25 12:15                     ` Oleg Nesterov [this message]
2025-03-25 12:36                       ` Dominique Martinet
2025-03-25 13:04                       ` Oleg Nesterov
2025-03-25 14:49                         ` K Prateek Nayak
2025-03-25 14:58                           ` Dominique Martinet
2025-03-26 12:19                             ` Oleg Nesterov
2025-03-26 12:44                               ` Oleg Nesterov
2025-03-26 13:05                                 ` Oleg Nesterov
2025-03-27 17:46                             ` K Prateek Nayak
2025-03-27 21:19                               ` syzbot
2025-03-27 22:18                                 ` asmadeus
2025-03-28  4:01                                   ` K Prateek Nayak
2025-03-28  4:43                                     ` syzbot
2025-03-28 13:06                                     ` Oleg Nesterov
2025-03-28 13:07                                       ` syzbot
2025-03-28 13:25                                         ` Oleg Nesterov
2025-03-28 13:49                                           ` syzbot
2025-03-28 14:49                                             ` Oleg Nesterov
2025-03-28 15:22                                               ` syzbot
2025-03-28 17:00                                                 ` Oleg Nesterov
2025-03-28 17:56                                                   ` K Prateek Nayak
2025-03-28 18:20                                                     ` Oleg Nesterov
2025-03-29  0:00                                                   ` asmadeus
2025-03-29 14:21                                                     ` Oleg Nesterov
2025-03-29 23:27                                                       ` asmadeus
2025-03-30 10:21                                                         ` Oleg Nesterov
2025-03-23 20:03     ` Mateusz Guzik
2025-03-23 20:43       ` syzbot
2025-03-28 18:14   ` David Howells
2025-03-28 18:44     ` syzbot
2025-03-28 19:01     ` Oleg Nesterov
2025-03-24 12:49 ` Edward Adam Davis
2025-03-24 13:37   ` syzbot
2025-08-03 12:09 ` syzbot
     [not found] <20250323002028.3563-1-hdanton@sina.com>
2025-03-23  1:15 ` syzbot
     [not found] <20250323021219.3580-1-hdanton@sina.com>
2025-03-23  2:48 ` syzbot
     [not found] <20250324111613.3620-1-hdanton@sina.com>
2025-03-24 11:34 ` syzbot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250325121526.GA7904@redhat.com \
    --to=oleg@redhat.com \
    --cc=asmadeus@codewreck.org \
    --cc=brauner@kernel.org \
    --cc=dhowells@redhat.com \
    --cc=ericvh@kernel.org \
    --cc=jack@suse.cz \
    --cc=jlayton@kernel.org \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux_oss@crudebyte.com \
    --cc=lucho@ionkov.net \
    --cc=mjguzik@gmail.com \
    --cc=netfs@lists.linux.dev \
    --cc=swapnil.sapkal@amd.com \
    --cc=syzbot+62262fdc0e01d99573fc@syzkaller.appspotmail.com \
    --cc=syzkaller-bugs@googlegroups.com \
    --cc=v9fs@lists.linux.dev \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.