All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: Ming Lei <tom.leiming@gmail.com>
Cc: "Denis V. Lunev" <den@virtuozzo.com>,
	io-uring@vger.kernel.org, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Kirill Tkhai <kirill.tkhai@openvz.org>,
	Manuel Bentele <development@manuel-bentele.de>,
	qemu-devel@nongnu.org, Kevin Wolf <kwolf@redhat.com>,
	rjones@redhat.com, Xie Yongji <xieyongji@bytedance.com>,
	Stefano Garzarella <sgarzare@redhat.com>,
	Josef Bacik <josef@toxicpanda.com>,
	Mike Christie <mchristi@redhat.com>
Subject: Re: ublk-qcow2: ublk-qcow2 is available
Date: Thu, 6 Oct 2022 14:29:55 -0400	[thread overview]
Message-ID: <Yz8eo0IWMAJOwKWn@fedora> (raw)
In-Reply-To: <Yz7vvNKSNRyBVObo@T590>

[-- Attachment #1: Type: text/plain, Size: 4166 bytes --]

On Thu, Oct 06, 2022 at 11:09:48PM +0800, Ming Lei wrote:
> On Thu, Oct 06, 2022 at 09:59:40AM -0400, Stefan Hajnoczi wrote:
> > On Thu, Oct 06, 2022 at 06:26:15PM +0800, Ming Lei wrote:
> > > On Wed, Oct 05, 2022 at 11:11:32AM -0400, Stefan Hajnoczi wrote:
> > > > On Tue, Oct 04, 2022 at 01:57:50AM +0200, Denis V. Lunev wrote:
> > > > > On 10/3/22 21:53, Stefan Hajnoczi wrote:
> > > > > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > > ublk-qcow2 is available now.
> > > > > > Cool, thanks for sharing!
> > > > > yep
> > > > > 
> > > > > > > So far it provides basic read/write function, and compression and snapshot
> > > > > > > aren't supported yet. The target/backend implementation is completely
> > > > > > > based on io_uring, and share the same io_uring with ublk IO command
> > > > > > > handler, just like what ublk-loop does.
> > > > > > > 
> > > > > > > Follows the main motivations of ublk-qcow2:
> > > > > > > 
> > > > > > > - building one complicated target from scratch helps libublksrv APIs/functions
> > > > > > >    become mature/stable more quickly, since qcow2 is complicated and needs more
> > > > > > >    requirement from libublksrv compared with other simple ones(loop, null)
> > > > > > > 
> > > > > > > - there are several attempts of implementing qcow2 driver in kernel, such as
> > > > > > >    ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > > > > >    might useful be for covering requirement in this field
> > > > > There is one important thing to keep in mind about all partly-userspace
> > > > > implementations though:
> > > > > * any single allocation happened in the context of the
> > > > >    userspace daemon through try_to_free_pages() in
> > > > >    kernel has a possibility to trigger the operation,
> > > > >    which will require userspace daemon action, which
> > > > >    is inside the kernel now.
> > > > > * the probability of this is higher in the overcommitted
> > > > >    environment
> > > > > 
> > > > > This was the main motivation of us in favor for the in-kernel
> > > > > implementation.
> > > > 
> > > > CCed Josef Bacik because the Linux NBD driver has dealt with memory
> > > > reclaim hangs in the past.
> > > > 
> > > > Josef: Any thoughts on userspace block drivers (whether NBD or ublk) and
> > > > how to avoid hangs in memory reclaim?
> > > 
> > > If I remember correctly, there isn't new report after the last NBD(TCMU) deadlock
> > > in memory reclaim was addressed by 8d19f1c8e193 ("prctl: PR_{G,S}ET_IO_FLUSHER
> > > to support controlling memory reclaim").
> > 
> > Denis: I'm trying to understand the problem you described. Is this
> > correct:
> > 
> > Due to memory pressure, the kernel reclaims pages and submits a write to
> > a ublk block device. The userspace process attempts to allocate memory
> > in order to service the write request, but it gets stuck because there
> > is no memory available. As a result reclaim gets stuck, the system is
> > unable to free more memory and therefore it hangs?
> 
> The process should be killed in this situation if PR_SET_IO_FLUSHER
> is applied since the page allocation is done in VM fault handler.

Thanks for mentioning PR_SET_IO_FLUSHER. There is more info in commit
8d19f1c8e1937baf74e1962aae9f90fa3aeab463 ("prctl: PR_{G,S}ET_IO_FLUSHER
to support controlling memory reclaim").

It requires CAP_SYS_RESOURCE :/. This makes me wonder whether
unprivileged ublk will ever be possible.

I think this addresses Denis' concern about hangs, but it doesn't solve
them because I/O will fail. The real solution is probably what you
mentioned...

> Firstly in theory the userspace part should provide forward progress
> guarantee in code path for handling IO, such as reserving/mlock pages
> for such situation. However, this issue isn't unique for nbd or ublk,
> all userspace block device should have such potential risk, and vduse
> is no exception, IMO.

...here. Userspace needs to minimize memory allocations in the I/O code
path and reserve sufficient resources to make forward progress.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2022-10-06 18:30 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-30  9:24 ublk-qcow2: ublk-qcow2 is available Ming Lei
2022-10-03 19:53 ` Stefan Hajnoczi
2022-10-03 23:57   ` Denis V. Lunev
2022-10-05 15:11     ` Stefan Hajnoczi
2022-10-06 10:26       ` Ming Lei
2022-10-06 13:59         ` Stefan Hajnoczi
2022-10-06 15:09           ` Ming Lei
2022-10-06 18:29             ` Stefan Hajnoczi [this message]
2022-10-07 11:21               ` Ming Lei
2022-10-04  9:43   ` Ming Lei
2022-10-04 13:53     ` Stefan Hajnoczi
2022-10-05  4:18       ` Ming Lei
2022-10-05 12:21         ` Stefan Hajnoczi
2022-10-05 12:38           ` Denis V. Lunev
2022-10-06 11:24           ` Ming Lei
2022-10-07 10:04             ` Yongji Xie
2022-10-07 10:51               ` Ming Lei
2022-10-07 11:21                 ` Yongji Xie
2022-10-07 11:23                   ` Ming Lei
2022-10-08  8:43         ` Ziyang Zhang
2022-10-12 14:22           ` Stefan Hajnoczi
2022-10-13  6:48             ` Yongji Xie
2022-10-13 16:02               ` Stefan Hajnoczi
2022-10-14 12:56               ` Ming Lei
2022-10-17 11:11                 ` Yongji Xie
2022-10-18  6:59                   ` Ming Lei
2022-10-18 13:17                     ` Yongji Xie
2022-10-18 14:54                       ` Stefan Hajnoczi
2022-10-19  9:09                         ` Ming Lei
2022-10-24 16:11                           ` Stefan Hajnoczi
2022-10-21  5:33                         ` Yongji Xie
2022-10-21  6:30                           ` Jason Wang
2022-10-25  8:17                             ` Yongji Xie
2022-10-25 12:02                               ` Stefan Hajnoczi
2022-10-28 13:33                                 ` Yongji Xie
2022-11-01  2:36                                 ` Jason Wang
2022-11-02 19:13                                   ` Stefan Hajnoczi
2022-11-04  6:55                                     ` Jason Wang
2022-10-21  6:28                     ` Jason Wang
2022-10-06 10:14       ` Richard W.M. Jones
2022-10-12 14:15         ` Stefan Hajnoczi
2022-10-13  1:50           ` Ming Lei
2022-10-13 16:01             ` Stefan Hajnoczi
2022-10-04  5:43 ` Manuel Bentele

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yz8eo0IWMAJOwKWn@fedora \
    --to=stefanha@redhat.com \
    --cc=den@virtuozzo.com \
    --cc=development@manuel-bentele.de \
    --cc=io-uring@vger.kernel.org \
    --cc=josef@toxicpanda.com \
    --cc=kirill.tkhai@openvz.org \
    --cc=kwolf@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchristi@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rjones@redhat.com \
    --cc=sgarzare@redhat.com \
    --cc=tom.leiming@gmail.com \
    --cc=xieyongji@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.