From: Andreas Hindborg <nmi@metaspace.dk>
To: Benno Lossin <benno.lossin@proton.me>
Cc: "Jens Axboe" <axboe@kernel.dk>, "Christoph Hellwig" <hch@lst.de>,
"Keith Busch" <kbusch@kernel.org>,
"Damien Le Moal" <Damien.LeMoal@wdc.com>,
"Bart Van Assche" <bvanassche@acm.org>,
"Hannes Reinecke" <hare@suse.de>,
"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
"Andreas Hindborg" <a.hindborg@samsung.com>,
"Wedson Almeida Filho" <wedsonaf@gmail.com>,
"Niklas Cassel" <Niklas.Cassel@wdc.com>,
"Greg KH" <gregkh@linuxfoundation.org>,
"Matthew Wilcox" <willy@infradead.org>,
"Miguel Ojeda" <ojeda@kernel.org>,
"Alex Gaynor" <alex.gaynor@gmail.com>,
"Boqun Feng" <boqun.feng@gmail.com>,
"Gary Guo" <gary@garyguo.net>,
"Björn Roy Baron" <bjorn3_gh@protonmail.com>,
"Alice Ryhl" <aliceryhl@google.com>,
"Chaitanya Kulkarni" <chaitanyak@nvidia.com>,
"Luis Chamberlain" <mcgrof@kernel.org>,
"Yexuan Yang" <1182282462@bupt.edu.cn>,
"Sergio González Collado" <sergio.collado@gmail.com>,
"Joel Granados" <j.granados@samsung.com>,
"Pankaj Raghav (Samsung)" <kernel@pankajraghav.com>,
"Daniel Gomez" <da.gomez@samsung.com>,
"open list" <linux-kernel@vger.kernel.org>,
"rust-for-linux@vger.kernel.org" <rust-for-linux@vger.kernel.org>,
"lsf-pc@lists.linux-foundation.org"
<lsf-pc@lists.linux-foundation.org>,
"gost.dev@samsung.com" <gost.dev@samsung.com>,
"Ming Lei" <ming.lei@redhat.com>
Subject: Re: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module
Date: Fri, 05 Apr 2024 10:43:39 +0200 [thread overview]
Message-ID: <875xwwmc78.fsf@metaspace.dk> (raw)
In-Reply-To: <7d64cb7b-0957-4acc-914a-3c950660ad07@proton.me> (Benno Lossin's message of "Thu, 04 Apr 2024 13:20:17 +0000")
Benno Lossin <benno.lossin@proton.me> writes:
> On 04.04.24 11:30, Andreas Hindborg wrote:
>> Benno Lossin <benno.lossin@proton.me> writes:
>>
>>> On 04.04.24 07:44, Andreas Hindborg wrote:
>>>> Benno Lossin <benno.lossin@proton.me> writes:
>>>>
>>>>> On 03.04.24 10:46, Andreas Hindborg wrote:
>>>>>> Benno Lossin <benno.lossin@proton.me> writes:
>>>>>>
>>>>>>> On 23.03.24 07:32, Andreas Hindborg wrote:
>>>>>>>> Benno Lossin <benno.lossin@proton.me> writes:
>>>>>>>>> On 3/13/24 12:05, Andreas Hindborg wrote:
>>>>>>>>>> +//! implementations of the `Operations` trait.
>>>>>>>>>> +//!
>>>>>>>>>> +//! IO requests are passed to the driver as [`Request`] references. The
>>>>>>>>>> +//! `Request` type is a wrapper around the C `struct request`. The driver must
>>>>>>>>>> +//! mark start of request processing by calling [`Request::start`] and end of
>>>>>>>>>> +//! processing by calling one of the [`Request::end`], methods. Failure to do so
>>>>>>>>>> +//! can lead to IO failures.
>>>>>>>>>
>>>>>>>>> I am unfamiliar with this, what are "IO failures"?
>>>>>>>>> Do you think that it might be better to change the API to use a
>>>>>>>>> callback? So instead of calling start and end, you would do
>>>>>>>>>
>>>>>>>>> request.handle(|req| {
>>>>>>>>> // do the stuff that would be done between start and end
>>>>>>>>> });
>>>>>>>>>
>>>>>>>>> I took a quick look at the rnull driver and there you are calling
>>>>>>>>> `Request::end_ok` from a different function. So my suggestion might not
>>>>>>>>> be possible, since you really need the freedom.
>>>>>>>>>
>>>>>>>>> Do you think that a guard approach might work better? ie `start` returns
>>>>>>>>> a guard that when dropped will call `end` and you need the guard to
>>>>>>>>> operate on the request.
>>>>>>>>
>>>>>>>> I don't think that would fit, since the driver might not complete the
>>>>>>>> request immediately. We might be able to call `start` on behalf of the
>>>>>>>> driver.
>>>>>>>>
>>>>>>>> At any rate, since the request is reference counted now, we can
>>>>>>>> automatically fail a request when the last reference is dropped and it
>>>>>>>> was not marked successfully completed. I would need to measure the
>>>>>>>> performance implications of such a feature.
>>>>>>>
>>>>>>> Are there cases where you still need access to the request after you
>>>>>>> have called `end`?
>>>>>>
>>>>>> In general no, there is no need to handle the request after calling end.
>>>>>> C drivers are not allowed to, because this transfers ownership of the
>>>>>> request back to the block layer. This patch series defer the transfer of
>>>>>> ownership to the point when the ARef<Request> refcount goes to zero, so
>>>>>> there should be no danger associated with touching the `Request` after
>>>>>> end.
>>>>>>
>>>>>>> If no, I think it would be better for the request to
>>>>>>> be consumed by the `end` function.
>>>>>>> This is a bit difficult with `ARef`, since the user can just clone it
>>>>>>> though... Do you think that it might be necessary to clone requests?
>>>>>>
>>>>>> Looking into the details now I see that calling `Request::end` more than
>>>>>> once will trigger UAF, because C code decrements the refcount on the
>>>>>> request. When we have `ARef<Request>` around, that is a problem. It
>>>>>> probably also messes with other things in C land. Good catch.
>>>>>>
>>>>>> I did implement `Request::end` to consume the request at one point
>>>>>> before I fell back on reference counting. It works fine for simple
>>>>>> drivers. However, most drivers will need to use the block layer tag set
>>>>>> service, that allows conversion of an integer id to a request pointer.
>>>>>> The abstraction for this feature is not part of this patch set. But the
>>>>>> block layer manages a mapping of integer to request mapping, and drivers
>>>>>> typically use this to identify the request that corresponds to
>>>>>> completion messages that arrive from hardware. When drivers are able to
>>>>>> turn integers into requests like this, consuming the request in the call
>>>>>> to `end` makes little sense (because we can just construct more).
>>>>>
>>>>> How do you ensure that this is fine?:
>>>>>
>>>>> let r1 = tagset.get(0);
>>>>> let r2 = tagset.get(0);
>>>>> r1.end_ok();
>>>>> r2.do_something_that_would_only_be_done_while_active();
>>>>>
>>>>> One thing that comes to my mind would be to only give out `&Request`
>>>>> from the tag set. And to destroy, you could have a separate operation
>>>>> that also removes the request from the tag set. (I am thinking of a tag
>>>>> set as a `HashMap<u64, Request>`.
>>>>
>>>> This would be similar to
>>>>
>>>> let r1 = tagset.get(0)?;
>>>> ler r2 = r1.clone();
>>>> r1.end_ok();
>>>> r2.do_something_requires_active();
>>>>
>>>> but it is not a problem because we do not implement any actions that are
>>>> illegal in that position (outside of `end` - that _is_ a problem).
>>>
>>> Makes sense, but I think it's a bit weird to still be able to access it
>>> after `end`ing.
>>
>> Yes, that is true.
>>
>>>
>>>>
>>>>
>>>>>>
>>>>>> What I do now is issue the an `Option<ARef<Request>>` with
>>>>>> `bindings::req_ref_inc_not_zero(rq_ptr)`, to make sure that the request
>>>>>> is currently owned by the driver.
>>>>>>
>>>>>> I guess we can check the absolute value of the refcount, and only issue
>>>>>> a request handle if the count matches what we expect. Then we can be certain
>>>>>> that the handle is unique, and we can require transfer of ownership of
>>>>>> the handle to `Request::end` to make sure it can never be called more
>>>>>> than once.
>>>>>>
>>>>>> Another option is to error out in `Request::end` if the
>>>>>> refcount is not what we expect.
>>>>>
>>>>> I am a bit confused, why does the refcount matter in this case? Can't
>>>>> the user just have multiple `ARef`s?
>>>>
>>>> Because we want to assert that we are consuming the last handle to the
>>>> request. After we do that, the user cannot call `Request::end` again.
>>>> `TagSet::get` will not issue a request reference if the request is not
>>>> in flight. Although there might be a race condition to watch out for.
>>>>
>>>> When the block layer hands over ownership to Rust, the reference count
>>>> is 1. The first `ARef<Request>` we create increments the count to 2. To
>>>> complete the request, we must have ownership of all reference counts
>>>> above 1. The block layer takes the last reference count when it takes
>>>> back ownership of the request.
>>>>
>>>>> I think it would be weird to use `ARef<Request>` if you expect the
>>>>> refcount to be 1.
>>>>
>>>> Yes, that would require a custom smart pointer with a `try_into_unique`
>>>> method that succeeds when the refcount is exactly 2. It would consume
>>>> the instance and decrement the refcount to 1. But as I said, there is a
>>>> potential race with `TagSet::get` when the refcount is 1 that needs to
>>>> be handled.
>>>>
>>>>> Maybe the API should be different?
>>>>
>>>> I needs to change a little, yes.
>>>>
>>>>> As I understand it, a request has the following life cycle (please
>>>>> correct me if I am wrong):
>>>>> 1. A new request is created, it is given to the driver via `queue_rq`.
>>>>> 2. The driver can now decide what to do with it (theoretically it can
>>>>> store it somewhere and later do something with it), but it should at
>>>>> some point call `Request::start`.
>>>>> 3. Work happens and eventually the driver calls `Request::end`.
>>>>>
>>>>> To me this does not seem like something where we need a refcount (we
>>>>> still might need one for safety, but it does not need to be exposed to
>>>>> the user).
>>>>
>>>> It would not need to be exposed to the user, other than a) ending a request
>>>> can fail OR b) `TagSet::get` can fail.
>>>>
>>>> a) would require that ending a request must be done with a unique
>>>> reference. This could be done by the user by the user calling
>>>> `try_into_unique` or by making the `end` method fallible.
>>>>
>>>> b) would make the reference handle `!Clone` and add a failure mode to
>>>> `TagSet::get`, so it fails to construct a `Request` handle if there are
>>>> already one in existence.
>>>>
>>>> I gravitate towards a) because it allows the user to clone the Request
>>>> reference without adding an additional `Arc`.
>>>
>>> This confuses me a little, since I thought that `TagSet::get` returns
>>> `Option<ARef<Request>>`.
>>
>> It does, but in the current implementation the failure mode returning
>> `None` is triggered when the refcount is zero, meaning that the request
>> corresponding to that tag is not currently owned by the driver. For
>> solution b) we would change the type to be
>> `Option<CustomSmartPointerHandleThing<Request>>`.
>>
>>> (I tried to find the abstractions in your
>>> github, but I did not find them)
>>
>> It's here [1]. It was introduced in the `rnvme-v6.8` branch.
>
> Thanks for the pointer.
>
>>> I think that this could work: `queue_rq` takes a `OwnedRequest`, which
>>> the user can store in a `TagSet`, transferring ownership. `TagSet::get`
>>> returns `Option<&Request>` and you can call `TagSet::remove` to get
>>> `Option<OwnedRequest>`. `OwnedRequest::end` consumes `self`.
>>> With this pattern we also do not need to take an additional refcount.
>>
>> It would, but the `TagSet` is just a wrapper for the C block layer
>> `strugt blk_mq_tag_set`. This is a highly optimized data structure and
>> tag mapping is done before the driver sees the request. I would like to
>> reuse that logic.
>>
>> We could implement what you suggest anyhow, but I would not want to that
>> additional logic to the hot path.
>
> I overlooked an important detail: the `TagSet` is always stored in an
> `Arc` (IIRC since you want to be able to share it between different
> `Gendisk`s). This probably makes my suggestion impossible, since you
> can't mutably borrow the `TagSet` for removal of `Request`s.
> Depending on how `Request`s are associated to a `TagSet`, there might be
> a way around this: I saw the `qid` parameter to the `tag_to_rq`
> function, is that a unique identifier for a queue?
A tag set services a number of request queues. Each queue has a number
used to identify it within the tag set. It is unique within the tag set.
> Because in that case
> we might be able to have a unique `QueueTagSetRef` with
>
> fn remove(&mut self, tag: u32) -> OwnedRequest;
We would not need exclusive access. The tag set remove are synchronized
internally with some fancy atomic bit flipping [1].
>
> fn get(&self, tag: u32) -> Option<&Request>;
>
> The `TagSet` would still be shared, only the ability to "remove" (I
> don't know if you do that manually in C, if not, then this would just
> remove it in the abstraction, but keep it on the C side) is unique to
> the `QueueTagSetRef` struct.
I would not advice removing tag->request associations from the driver. I
understand your point and from the perspective of these patches it makes
sense. But it would be a major layer violation of the current block
layer architecture, as far as I can tell.
I am having trouble enough trying to justify deferred free of the
request structure as it is.
> But feel free to use your proposed option a), it is simpler and we can
> try to make this work when you send the `TagSet` abstractions.
> I just think that we should try a bit harder to make it even better.
I'll code it up a) and see how it looks (and what it costs in
performance) 👍
BR Andreas
[1] https://github.com/metaspace/linux/blob/bf25426ad5652319528c6f87b74dd026fbedb9e8/lib/sbitmap.c#L638
next prev parent reply other threads:[~2024-04-05 8:43 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-22 23:40 [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module Benno Lossin
2024-03-23 6:32 ` Andreas Hindborg
2024-04-02 23:09 ` Benno Lossin
2024-04-03 8:46 ` Andreas Hindborg
2024-04-03 19:37 ` Benno Lossin
2024-04-04 5:44 ` Andreas Hindborg
2024-04-04 8:46 ` Benno Lossin
2024-04-04 9:30 ` Andreas Hindborg
2024-04-04 13:20 ` Benno Lossin
2024-04-05 8:43 ` Andreas Hindborg [this message]
2024-04-05 9:40 ` Benno Lossin
-- strict thread matches above, loose matches on Subject: below --
2024-03-13 11:05 [RFC PATCH 0/5] Rust block device driver API and null block driver Andreas Hindborg
2024-03-13 11:05 ` [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module Andreas Hindborg
2024-03-13 23:55 ` Boqun Feng
2024-03-14 8:58 ` Andreas Hindborg
2024-03-14 18:55 ` Miguel Ojeda
2024-03-14 19:22 ` Andreas Hindborg
2024-03-14 19:41 ` Andreas Hindborg
2024-03-14 19:41 ` Miguel Ojeda
2024-03-14 20:56 ` Miguel Ojeda
2024-03-15 7:52 ` Andreas Hindborg
2024-03-15 12:17 ` Ming Lei
2024-03-15 12:46 ` Andreas Hindborg
2024-03-15 15:24 ` Ming Lei
2024-03-15 17:49 ` Andreas Hindborg
2024-03-16 14:48 ` Ming Lei
2024-03-16 17:27 ` Andreas Hindborg
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=875xwwmc78.fsf@metaspace.dk \
--to=nmi@metaspace.dk \
--cc=1182282462@bupt.edu.cn \
--cc=Damien.LeMoal@wdc.com \
--cc=Niklas.Cassel@wdc.com \
--cc=a.hindborg@samsung.com \
--cc=alex.gaynor@gmail.com \
--cc=aliceryhl@google.com \
--cc=axboe@kernel.dk \
--cc=benno.lossin@proton.me \
--cc=bjorn3_gh@protonmail.com \
--cc=boqun.feng@gmail.com \
--cc=bvanassche@acm.org \
--cc=chaitanyak@nvidia.com \
--cc=da.gomez@samsung.com \
--cc=gary@garyguo.net \
--cc=gost.dev@samsung.com \
--cc=gregkh@linuxfoundation.org \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=j.granados@samsung.com \
--cc=kbusch@kernel.org \
--cc=kernel@pankajraghav.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mcgrof@kernel.org \
--cc=ming.lei@redhat.com \
--cc=ojeda@kernel.org \
--cc=rust-for-linux@vger.kernel.org \
--cc=sergio.collado@gmail.com \
--cc=wedsonaf@gmail.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).