From: Joanne Koong <joannelkoong@gmail.com>
To: Miklos Szeredi <miklos@szeredi.hu>
Cc: linux-fsdevel@vger.kernel.org, josef@toxicpanda.com,
bernd.schubert@fastmail.fm, jefflexu@linux.alibaba.com,
laoar.shao@gmail.com, kernel-team@meta.com
Subject: Re: [PATCH v7 2/3] fuse: add optional kernel-enforced timeout for requests
Date: Wed, 9 Oct 2024 17:44:55 -0700 [thread overview]
Message-ID: <CAJnrk1b7bfAWWq_pFP=4XH3ddc_9GtAM2mE7EgWnx2Od+UUUjQ@mail.gmail.com> (raw)
In-Reply-To: <CAJfpegs9A7iBbZpPMF-WuR48Ho_=z_ZWfjrLQG2ob0k6NbcaUg@mail.gmail.com>
On Wed, Oct 9, 2024 at 1:14 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, 7 Oct 2024 at 20:43, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > There are situations where fuse servers can become unresponsive or
> > stuck, for example if the server is deadlocked. Currently, there's no
> > good way to detect if a server is stuck and needs to be killed manually.
> >
> > This commit adds an option for enforcing a timeout (in minutes) for
> > requests where if the timeout elapses without the server responding to
> > the request, the connection will be automatically aborted.
> >
> > Please note that these timeouts are not 100% precise. The request may
> > take an extra FUSE_TIMEOUT_TIMER_FREQ seconds beyond the set max timeout
> > due to how it's internally implemented.
>
> One thing I worry about is adding more roadblocks on the way to making
> request queuing more scalable.
>
> Currently there's fc->num_waiting that's touched on all requests and
> bg_queue/bg_lock that are touched on background requests. We should
> be trying to fix these bottlenecks instead of adding more.
>
> Can't we use the existing lists to scan requests?
Hi Miklos,
The existing lists we have are:
* fiq pending list (per connection)
* fpq io list and fpq processing (for its associated hash) list (per fuse dev)
* bg queue (per connection)
If we wanted to reuse existing lists, we could do this in the timeout handler:
* grab the fiq lock, check the head entry of the pending list, release the lock
* grab the bg lock, check the head entry of the bg_queue, release the lock
* for each connection's fuse dev, grab the fpq lock, check the head
entry of the fpq->io list, iterate through the fpq->processing's lists
for 256 hashes and check against the head entry, release the lock
but some requests could slip through for the following cases:
-- resend:
* Request is on the resend's to_queue list when the timeout handler
check runs, in which case if that request is expired we won't get to
that until the next time the timeout handler kicks in
* A re-sent request may be moved to the head of the fiq->pending list,
but have a creation time newer than other entries on the fiq->pending
list , in which case we would not time out and abort the connection
when we should be doing so
-- transitioning between lists
* A request that is between lists (eg fpq->io and fpq->processing)
could be missed when the timeout handler check runs (but will probably
be caught the next time the timeout handler kicks in. We could also
modify the logic in dev_do_read to use list_move to avoid this case).
I think it's fine for these edge cases to slip through since most of
them will be caught eventually by the subsequent timeout handler runs,
but I was more worried about the increased lock contention while
iterating through all hashes of the fpq->processing list. But I think
for that we could just increase the timeout frequency to run less
frequently (eg once every 5 minutes instead of once every minute)
Do you think something like this sounds more reasonable?
Alternatively, I also still like the idea of something looser with
just periodically (according to whatever specified timeout) checking
if any requests are being serviced at all when fc->num-waiting is
non-zero. However, this would only protect against fully deadlocked
servers and miss malicious ones or half-deadlocked ones (eg
multithreaded fuse servers where only some threads are deadlocked).
Thanks,
Joanne
>
> It's more complex, obviously, but at least it doesn't introduce yet
> another per-fc list to worry about.
>
> Thanks,
> Miklos
next prev parent reply other threads:[~2024-10-10 0:45 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-07 18:42 [PATCH v7 0/3] fuse: add kernel-enforced request timeout option Joanne Koong
2024-10-07 18:42 ` [PATCH v7 1/3] fs_parser: add fsparam_u16 helper Joanne Koong
2024-10-07 18:42 ` [PATCH v7 2/3] fuse: add optional kernel-enforced timeout for requests Joanne Koong
2024-10-09 8:14 ` Miklos Szeredi
2024-10-10 0:44 ` Joanne Koong [this message]
2024-10-10 8:21 ` Miklos Szeredi
2024-10-10 23:08 ` Joanne Koong
2024-10-07 18:42 ` [PATCH v7 3/3] fuse: add default_request_timeout and max_request_timeout sysctls Joanne Koong
2024-10-08 20:58 ` [PATCH v7 0/3] fuse: add kernel-enforced request timeout option Joanne Koong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAJnrk1b7bfAWWq_pFP=4XH3ddc_9GtAM2mE7EgWnx2Od+UUUjQ@mail.gmail.com' \
--to=joannelkoong@gmail.com \
--cc=bernd.schubert@fastmail.fm \
--cc=jefflexu@linux.alibaba.com \
--cc=josef@toxicpanda.com \
--cc=kernel-team@meta.com \
--cc=laoar.shao@gmail.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=miklos@szeredi.hu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).