From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 5AF2AC433EF for ; Thu, 5 May 2022 17:04:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:In-Reply-To:References:Cc:To:From:Subject:MIME-Version:Date: Message-ID:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=rXWsIQnWwnTUQnKv0nFYg7ghVoDY9SAWwWKxyJ6lurI=; b=YXEBrKPVHndjw3JjiVJIOSgnR+ DtQbXG5eUZVjrIouG6u8mMex2tDpz1GWv0qemdKfwzjr6igNO/kbFyofO9JV90cNol0iaP9m+XVq0 51Cjdg6qyhtTn4/00hgBUNpu49uh8DQ6nwBC8Zz0mMe4jAUciFvMMLDPcoBYP3FKRWAwQ3p+rpAOp rmnrKk4t9Lym8H1qUbDpVyx4vY8EPiLSpaNiTfGUJo/511nXAX/Lx5azmx90hdGGu3h9e7CTOGDQj Xf0ib2LTtEHiKC6WRh6rM9OvMMJ608oXFmpMzjKKwOYrt0VY/8uEU0DzNyg4VmQgTK6eKHGIzrf5d KFTccdBg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1nmeuG-00Gz8k-6E; Thu, 05 May 2022 17:04:48 +0000 Received: from mail-pj1-x102a.google.com ([2607:f8b0:4864:20::102a]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1nmeuC-00Gz7R-GU for linux-nvme@lists.infradead.org; Thu, 05 May 2022 17:04:46 +0000 Received: by mail-pj1-x102a.google.com with SMTP id iq2-20020a17090afb4200b001d93cf33ae9so8531827pjb.5 for ; Thu, 05 May 2022 10:04:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20210112.gappssmtp.com; s=20210112; h=message-id:date:mime-version:user-agent:subject:content-language :from:to:cc:references:in-reply-to:content-transfer-encoding; bh=rXWsIQnWwnTUQnKv0nFYg7ghVoDY9SAWwWKxyJ6lurI=; b=BbhA3q6Km4lr3y/fVIQpsFVgyISR7PAXxS+wgKsuuU66UKUI215RqxrC5xKLeuS7Eu ZkjbcretsKtvwdEJYWbLcmbhcqLHveOo9V9ZeyAAjtpVF9cpsICZACBFuUZA76om/Uh3 vL/HXnakRic5QeiNWmII0XNqc/tctpLUUj1yE3DQ1M9AHykreYmdKB5jMOx56NHExyIe 8l/2W49rKOvk019VpU9XqvcJ34jxah1LZrli/xnG9KG/g9rI7dwPp1n44utg+FnNgbb4 A5LO3DgxohwsjnJHS3xy3iX/HiaT2D9M3UR6Amgp2DnZP7eylUguju7fR+eKwCSPJ+Lk ittg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:from:to:cc:references:in-reply-to :content-transfer-encoding; bh=rXWsIQnWwnTUQnKv0nFYg7ghVoDY9SAWwWKxyJ6lurI=; b=6hQDLUsNXRg1Iat2+j+7Uo6ycadzaqNSPHG6079wBNOzep3xIYYHUn4ynRiDcul1NI qaR0BoUb/JlJVsXOhLl3gpMLJwU/gqABdJbuKjRHQm41TJZMXZBbQhQ723Q5XC5sLWkQ wP+rZKQFsI/c+S0q+4g4ViIyXEPYT/iIYNEx7/OXlNpzcow39mZGDH9N3uo/tKUwGCpa rBc1GJ3LQkWoRgDI5kdaHXE8sRhSYJtf5TV2xSriOC4M6xQiIXMDT9gWHPiQOV5ADnF9 a6rDoTy/mQlgi8wWrPl7zWOJiHZv/PskFA5+lSfwsBL4g6aUgacWNIajbkOnu8WcirYV u3Kg== X-Gm-Message-State: AOAM533+Sd5hMO4GbA5RtoQeUR1kKyzgQHFvu0ZDLDR3wlJfd1DE0PqZ wjcfivhH8CCh3/t3VlZqNcl+nA== X-Google-Smtp-Source: ABdhPJz2WqbyxLLuVFh38hIhiuZzlOCL45ivQ9D0X8HBbwljI6d3XECw5qM2Hsq0XPqUnZimxgrNwA== X-Received: by 2002:a17:903:22ce:b0:15e:bd57:5bec with SMTP id y14-20020a17090322ce00b0015ebd575becmr13010052plg.114.1651770283143; Thu, 05 May 2022 10:04:43 -0700 (PDT) Received: from [192.168.4.166] (cpe-72-132-29-68.dc.res.rr.com. [72.132.29.68]) by smtp.gmail.com with ESMTPSA id j8-20020a170903024800b0015e8d4eb210sm1814596plh.90.2022.05.05.10.04.41 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 05 May 2022 10:04:42 -0700 (PDT) Message-ID: Date: Thu, 5 May 2022 11:04:41 -0600 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux aarch64; rv:91.0) Gecko/20100101 Thunderbird/91.8.1 Subject: Re: [PATCH v4 1/5] fs,io_uring: add infrastructure for uring-cmd Content-Language: en-US From: Jens Axboe To: Kanchan Joshi , hch@lst.de Cc: io-uring@vger.kernel.org, linux-nvme@lists.infradead.org, asml.silence@gmail.com, ming.lei@redhat.com, mcgrof@kernel.org, shr@fb.com, joshiiitr@gmail.com, anuj20.g@samsung.com, gost.dev@samsung.com References: <20220505060616.803816-1-joshi.k@samsung.com> <20220505060616.803816-2-joshi.k@samsung.com> <1af73495-d4a6-d6fd-0a03-367016385c92@kernel.dk> In-Reply-To: <1af73495-d4a6-d6fd-0a03-367016385c92@kernel.dk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20220505_100444_599910_97DCE3B6 X-CRM114-Status: GOOD ( 39.79 ) X-BeenThere: linux-nvme@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "Linux-nvme" Errors-To: linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org On 5/5/22 10:17 AM, Jens Axboe wrote: > On 5/5/22 12:06 AM, Kanchan Joshi wrote: >> +static int io_uring_cmd_prep(struct io_kiocb *req, >> + const struct io_uring_sqe *sqe) >> +{ >> + struct io_uring_cmd *ioucmd = &req->uring_cmd; >> + struct io_ring_ctx *ctx = req->ctx; >> + >> + if (ctx->flags & IORING_SETUP_IOPOLL) >> + return -EOPNOTSUPP; >> + /* do not support uring-cmd without big SQE/CQE */ >> + if (!(ctx->flags & IORING_SETUP_SQE128)) >> + return -EOPNOTSUPP; >> + if (!(ctx->flags & IORING_SETUP_CQE32)) >> + return -EOPNOTSUPP; >> + if (sqe->ioprio || sqe->rw_flags) >> + return -EINVAL; >> + ioucmd->cmd = sqe->cmd; >> + ioucmd->cmd_op = READ_ONCE(sqe->cmd_op); >> + return 0; >> +} > > While looking at the other suggested changes, I noticed a more > fundamental issue with the passthrough support. For any other command, > SQE contents are stable once prep has been done. The above does do that > for the basic items, but this case is special as the lower level command > itself resides in the SQE. > > For cases where the command needs deferral, it's problematic. There are > two main cases where this can happen: > > - The issue attempt yields -EAGAIN (we ran out of requests, etc). If you > look at other commands, if they have data that doesn't fit in the > io_kiocb itself, then they need to allocate room for that data and have > it be persistent > > - Deferral is specified by the application, using eg IOSQE_IO_LINK or > IOSQE_ASYNC. > > We're totally missing support for both of these cases. Consider the case > where the ring is setup with an SQ size of 1. You prep a passthrough > command (request A) and issue it with io_uring_submit(). Due to one of > the two above mentioned conditions, the internal request is deferred. > Either it was sent to ->uring_cmd() but we got -EAGAIN, or it was > deferred even before that happened. The application doesn't know this > happened, it gets another SQE to submit a new request (request B). Fills > it in, calls io_uring_submit(). Since we only have one SQE available in > that ring, when request A gets re-issued, it's now happily reading SQE > contents from command B. Oops. > > This is why prep handlers are the only ones that get an sqe passed to > them. They are supposed to ensure that we no longer read from the SQE > past that. Applications can always rely on that fact that once > io_uring_submit() has been done, which consumes the SQE in the SQ ring, > that no further reads are done from that SQE. > > IOW, we need io_req_prep_async() handling for URING_CMD, which needs to > allocate the full 80 bytes and copy them over. Subsequent issue attempts > will then use that memory rather than the SQE parts. Just need to find a > sane way to do that so we don't need to make the usual prep + direct > issue path any slower. This one should take care of that. Some parts should be folded into patch 1, other bits into the nvme adoption. diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index 8c3b15d3e86d..7daa95d50db1 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -411,8 +411,7 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns, static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns, struct io_uring_cmd *ioucmd, unsigned int issue_flags, bool vec) { - struct nvme_uring_cmd *cmd = - (struct nvme_uring_cmd *)ioucmd->cmd; + struct nvme_uring_cmd *cmd = ioucmd->cmd; struct request_queue *q = ns ? ns->queue : ctrl->admin_q; struct nvme_command c; struct request *req; @@ -548,8 +547,8 @@ long nvme_ns_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg) return __nvme_ioctl(ns, cmd, (void __user *)arg); } -static void nvme_ns_uring_cmd(struct nvme_ns *ns, struct io_uring_cmd *ioucmd, - unsigned int issue_flags) +static int nvme_ns_uring_cmd(struct nvme_ns *ns, struct io_uring_cmd *ioucmd, + unsigned int issue_flags) { int ret; @@ -567,15 +566,15 @@ static void nvme_ns_uring_cmd(struct nvme_ns *ns, struct io_uring_cmd *ioucmd, ret = -ENOTTY; } - if (ret != -EIOCBQUEUED) - io_uring_cmd_done(ioucmd, ret, 0); + return ret; } -void nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags) +int nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags) { struct nvme_ns *ns = container_of(file_inode(ioucmd->file)->i_cdev, struct nvme_ns, cdev); - nvme_ns_uring_cmd(ns, ioucmd, issue_flags); + + return nvme_ns_uring_cmd(ns, ioucmd, issue_flags); } #ifdef CONFIG_NVME_MULTIPATH diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 761ad6c629c4..26adc7660d28 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -783,7 +783,7 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg); long nvme_dev_ioctl(struct file *file, unsigned int cmd, unsigned long arg); -void nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd, +int nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags); void nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags); diff --git a/fs/io_uring.c b/fs/io_uring.c index cf7087dc12b3..cee7fc95f2c2 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1251,6 +1251,8 @@ static const struct io_op_def io_op_defs[] = { [IORING_OP_URING_CMD] = { .needs_file = 1, .plug = 1, + .async_size = 2 * sizeof(struct io_uring_sqe) - + offsetof(struct io_uring_sqe, cmd), }, }; @@ -4936,6 +4938,20 @@ void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, ssize_t res2) } EXPORT_SYMBOL_GPL(io_uring_cmd_done); +static int io_uring_cmd_prep_async(struct io_kiocb *req) +{ + struct io_uring_cmd *ioucmd = &req->uring_cmd; + struct io_ring_ctx *ctx = req->ctx; + size_t cmd_size = sizeof(struct io_uring_sqe) - + offsetof(struct io_uring_sqe, cmd); + + if (ctx->flags & IORING_SETUP_SQE128) + cmd_size += sizeof(struct io_uring_sqe); + + memcpy(req->async_data, ioucmd->cmd, cmd_size); + return 0; +} + static int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { @@ -4951,7 +4967,7 @@ static int io_uring_cmd_prep(struct io_kiocb *req, return -EOPNOTSUPP; if (sqe->ioprio || sqe->rw_flags) return -EINVAL; - ioucmd->cmd = sqe->cmd; + ioucmd->cmd = (void *) sqe->cmd; ioucmd->cmd_op = READ_ONCE(sqe->cmd_op); return 0; } @@ -4960,10 +4976,18 @@ static int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags) { struct file *file = req->file; struct io_uring_cmd *ioucmd = &req->uring_cmd; + int ret; if (!req->file->f_op->uring_cmd) return -EOPNOTSUPP; - file->f_op->uring_cmd(ioucmd, issue_flags); + + if (req_has_async_data(req)) + ioucmd->cmd = req->async_data; + + ret = file->f_op->uring_cmd(ioucmd, issue_flags); + if (ret != -EIOCBQUEUED) + io_uring_cmd_done(ioucmd, ret, 0); + return 0; } @@ -7849,6 +7873,8 @@ static int io_req_prep_async(struct io_kiocb *req) return io_recvmsg_prep_async(req); case IORING_OP_CONNECT: return io_connect_prep_async(req); + case IORING_OP_URING_CMD: + return io_uring_cmd_prep_async(req); } printk_once(KERN_WARNING "io_uring: prep_async() bad opcode %d\n", req->opcode); diff --git a/include/linux/fs.h b/include/linux/fs.h index 30c98d9993df..87b5af1d9fbe 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1996,7 +1996,7 @@ struct file_operations { struct file *file_out, loff_t pos_out, loff_t len, unsigned int remap_flags); int (*fadvise)(struct file *, loff_t, loff_t, int); - void (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags); + int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags); } __randomize_layout; struct inode_operations { diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index e4cbc80949ce..74371503478a 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -14,10 +14,11 @@ enum io_uring_cmd_flags { struct io_uring_cmd { struct file *file; - const u8 *cmd; + void *cmd; /* callback to defer completions to task context */ void (*task_work_cb)(struct io_uring_cmd *cmd); u32 cmd_op; + u32 pad; u8 pdu[32]; /* available inline for free use */ }; -- Jens Axboe