From: Christian Schoenebeck <linux_oss@crudebyte.com>
To: Dominique Martinet <asmadeus@codewreck.org>
Cc: Jens Axboe <axboe@kernel.dk>, Latchesar Ionkov <lucho@ionkov.net>,
Eric Van Hensbergen <ericvh@gmail.com>,
linux-kernel@vger.kernel.org, Pengfei Xu <pengfei.xu@intel.com>,
v9fs-developer@lists.sourceforge.net
Subject: Re: [V9fs-developer] [PATCH 0/5] Take 3 at async RPCs and no longer looping forever on signals
Date: Sat, 25 Mar 2023 13:45:44 +0100 [thread overview]
Message-ID: <1792249.25CL7UOqfo@silver> (raw)
In-Reply-To: <19996551.RFHlWWcHjK@silver>
On Thursday, March 23, 2023 4:58:51 PM CET Christian Schoenebeck wrote:
> On Sunday, March 19, 2023 12:53:46 PM CET Dominique Martinet wrote:
> > It's been a while but I didn't forget...
> >
> > Dominique Martinet wrote on Tue, Feb 14, 2023 at 08:16:38PM +0900:
> > > > Yes, apparently it tries to write dirty pages of the mapped file and keeps
> > > > hanging there [fs/9p/vfs_inode_dotl.c:586]:
> > >
> > > Yeah, it'd help to get the trace of the thread actually trying to do the
> > > IO, if it still exists.
> > > I had some hangs in the check that there are no flush in flight at some
> > > point, and I thought I fixed that, but I can't really see anywhere else
> > > that'd start hanging with this... it'll be clearer if I can reproduce.
> >
> > I couldn't reproduce this one, but manually inspecting
> > p9_client_wait_flush again I noticed the wait_event_interruptible was
> > waiting on req->flushed_req->wq but looking at req->status in the
> > condition; that was an error.
> > Also, we have a ref on req->flushed_req but not on req, so
> > req->flushed_req wasn't safe.
> >
> > I've changed the code to add a variable directly on req->flushed_req and
> > use it consistently, I'm not sure that's the problem you ran into but it
> > might help.
> > It's been a while but do you remember if that hang was consistently
> > happening on shutdown, or was it a one time thing?
> >
> > Either way, I'd appreciate if you could try my 9p-test branch again:
> > https://github.com/martinetd/linux/commits/9p-test
> >
> >
> > With that said, I expect that p9_client_wait_req will cause hangs on
> > broken servers.
> > If connection drops hopefully the reqs will just be marked as error and
> > free the thread, but I can see syzbot complaining about yet another
> > thread stuck.. Well it's interruptible at least, and bails out on
> > ERESTARTSYS.
>
> I gave your current test branch some spins today.
>
> And yes, the problem did not happen reliably each time, but consistently
> enough for me to reproduce it.
>
> With your latest test branch it appears to have mitigated the problem. Once in
> a while on shutdown I see it blocking for few minutes, but recovering:
>
> [** ] (2 of 3) A stop job is running for …ave Random Seed (2min 36s / 10min)
> [ 484.986388] INFO: task systemd-user-ru:8782 blocked for more than 120 seconds.
> [ 484.990150] Tainted: G E 6.3.0-rc2+ #63
> [ 484.992553] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 484.993657] task:systemd-user-ru state:D stack:0 pid:8782 ppid:1 flags:0x00000004
> [ 484.994863] Call Trace:
> [ 484.995398] <TASK>
> [ 484.995866] __schedule (kernel/sched/core.c:5304 kernel/sched/core.c:6622)
> [ 484.996614] schedule (kernel/sched/core.c:6699 (discriminator 1))
> [ 484.997283] d_alloc_parallel (./include/linux/spinlock.h:350 fs/dcache.c:2626 fs/dcache.c:2707)
> [ 484.998150] ? __pfx_default_wake_function (kernel/sched/core.c:6944)
> [ 484.999213] __lookup_slow (fs/namei.c:1671)
> [ 485.000006] walk_component (./include/linux/fs.h:773 fs/namei.c:1704 fs/namei.c:1994)
> [ 485.000805] path_lookupat (fs/namei.c:2451 fs/namei.c:2475)
> [ 485.001594] filename_lookup (fs/namei.c:2504)
> [ 485.002452] ? __check_object_size (mm/usercopy.c:196 mm/usercopy.c:251 mm/usercopy.c:213)
> [ 485.003523] ? strncpy_from_user (./include/asm-generic/access_ok.h:40 ./arch/x86/include/asm/uaccess.h:551 lib/strncpy_from_user.c:138)
> [ 485.004537] user_path_at_empty (fs/namei.c:2879)
> [ 485.005508] do_faccessat (fs/open.c:484)
> [ 485.006410] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> [ 485.007281] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
> [ 485.008495] RIP: 0033:0x7f7b4527d8f7
> [ 485.009380] RSP: 002b:00007ffc9ee73c08 EFLAGS: 00000246 ORIG_RAX: 0000000000000015
> [ 485.011118] RAX: ffffffffffffffda RBX: 00007f7b44ee2608 RCX: 00007f7b4527d8f7
> [ 485.012831] RDX: 000000000000001c RSI: 0000000000000000 RDI: 00007f7b44eda86b
> [ 485.014560] RBP: 0000000000000003 R08: 0000000000000000 R09: 00007f7b45360be0
> [ 485.016246] R10: 00005571ade756c0 R11: 0000000000000246 R12: 00007ffc9ee73cf8
> [ 485.017937] R13: 00007ffc9ee73d18 R14: 00007f7b44ee2610 R15: 0000000000000000
> [ 485.019669] </TASK>
>
> However that happens now before unmount rule is reached on shutdown?
> Previously it was hanging after the shutdown rule was reached. Does this make
> sense?
Correction: I still get the previous issue on unmount with your latest version
as well. :/
Call trace looks slightly different, but it seems to be the same issue:
[ OK ] Reached target Unmount All Filesystems.
[* ] A stop job is running for Load/Save Random Seed (3min 27s / 10min)
[ 243.031668] INFO: task systemd-journal:199 blocked for more than 120 seconds.
[ 243.036460] Tainted: G E 6.3.0-rc2+ #63
[ 243.037666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 243.039004] task:systemd-journal state:D stack:0 pid:199 ppid:1 flags:0x00004004
[ 243.040434] Call Trace:
[ 243.040878] <TASK>
[ 243.041265] __schedule (kernel/sched/core.c:5304 kernel/sched/core.c:6622)
[ 243.041887] ? __pagevec_release (./include/linux/pagevec.h:38 mm/swap.c:1063)
[ 243.042594] schedule (kernel/sched/core.c:6699 (discriminator 1))
[ 243.043150] io_schedule (kernel/sched/core.c:8859 kernel/sched/core.c:8885)
[ 243.043751] folio_wait_bit_common (mm/filemap.c:1293)
[ 243.044519] ? __pfx_wake_page_function (mm/filemap.c:1083)
[ 243.045321] folio_wait_writeback (./arch/x86/include/asm/bitops.h:207 ./arch/x86/include/asm/bitops.h:239 ./include/asm-generic/bitops/instrumented-non-atomic.h:142 ./include/linux/page-flags.h:518 mm/page-writeback.c:3125)
[ 243.046043] __filemap_fdatawait_range (./arch/x86/include/asm/bitops.h:75 (discriminator 3) ./include/asm-generic/bitops/instrumented-atomic.h:42 (discriminator 3) ./include/linux/page-flags.h:473 (discriminator 3) mm/filemap.c:526 (discriminator 3))
[ 243.046828] ? filemap_fdatawrite_wbc (mm/filemap.c:393)
[ 243.047622] ? __filemap_fdatawrite_range (mm/filemap.c:424)
[ 243.048531] filemap_write_and_wait_range (mm/filemap.c:689 mm/filemap.c:669)
[ 243.049358] v9fs_vfs_setattr_dotl (./include/linux/pagemap.h:60 fs/9p/vfs_inode_dotl.c:587) 9p
[ 243.050229] notify_change (fs/attr.c:486)
[ 243.050924] ? __vfs_getxattr (fs/xattr.c:426)
[ 243.051571] ? do_truncate (./include/linux/fs.h:763 fs/open.c:67)
[ 243.052224] do_truncate (./include/linux/fs.h:763 fs/open.c:67)
[ 243.052816] do_sys_ftruncate (fs/open.c:194)
[ 243.053548] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[ 243.054187] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
[ 243.055057] RIP: 0033:0x7f6db123ab77
[ 243.055706] RSP: 002b:00007ffc4dda0ca8 EFLAGS: 00000206 ORIG_RAX: 000000000000004d
[ 243.056978] RAX: ffffffffffffffda RBX: 00007ffc4dda0cf0 RCX: 00007f6db123ab77
[ 243.058198] RDX: 000055715f9b2a00 RSI: 0000000001000000 RDI: 0000000000000018
[ 243.059478] RBP: 000055715f9b2bf0 R08: 0000000000000001 R09: 000055715f9b2c74
[ 243.060716] R10: 0000000000000000 R11: 0000000000000206 R12: 00007ffc4dda0ce8
I just realized I don't have to do anything fancy to reproduce this. I just
have to boot and shutdown guest few times to trigger it.
prev parent reply other threads:[~2023-03-25 12:46 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-11 7:50 [PATCH 0/5] Take 3 at async RPCs and no longer looping forever on signals Dominique Martinet
2023-02-11 7:50 ` [PATCH 1/5] 9p/net: move code in preparation of async rpc Dominique Martinet
2023-02-13 17:46 ` Christian Schoenebeck
2023-02-11 7:50 ` [PATCH 2/5] 9p/net: share pooled receive buffers size exception in p9_tag_alloc Dominique Martinet
2023-02-13 18:06 ` Christian Schoenebeck
2023-02-11 7:50 ` [PATCH 3/5] 9p/net: implement asynchronous rpc skeleton Dominique Martinet
2023-02-11 7:50 ` [PATCH 4/5] 9p/net: add async clunk for retries Dominique Martinet
2023-02-11 7:50 ` [PATCH 5/5] 9p/net: make flush asynchronous Dominique Martinet
2023-02-11 8:05 ` [PATCH 0/5] Take 3 at async RPCs and no longer looping forever on signals Dominique Martinet
2023-02-13 18:26 ` Christian Schoenebeck
2023-02-13 18:45 ` Christian Schoenebeck
2023-02-13 22:37 ` Dominique Martinet
2023-02-14 9:34 ` Christian Schoenebeck
2023-02-14 11:16 ` Dominique Martinet
2023-03-19 11:53 ` [V9fs-developer] " Dominique Martinet
2023-03-23 15:58 ` Christian Schoenebeck
2023-03-25 12:45 ` Christian Schoenebeck [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1792249.25CL7UOqfo@silver \
--to=linux_oss@crudebyte.com \
--cc=asmadeus@codewreck.org \
--cc=axboe@kernel.dk \
--cc=ericvh@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lucho@ionkov.net \
--cc=pengfei.xu@intel.com \
--cc=v9fs-developer@lists.sourceforge.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.