From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ED441C52D7D for ; Fri, 16 Aug 2024 02:16:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7CDF56B0165; Thu, 15 Aug 2024 22:16:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 77BB68D002B; Thu, 15 Aug 2024 22:16:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5F5956B0168; Thu, 15 Aug 2024 22:16:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 40DD46B0165 for ; Thu, 15 Aug 2024 22:16:11 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id E9708A8286 for ; Fri, 16 Aug 2024 02:16:10 +0000 (UTC) X-FDA: 82456493700.02.E30F3CC Received: from mail-lj1-f177.google.com (mail-lj1-f177.google.com [209.85.208.177]) by imf06.hostedemail.com (Postfix) with ESMTP id EA663180007 for ; Fri, 16 Aug 2024 02:16:08 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=keiWFBsm; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf06.hostedemail.com: domain of asml.silence@gmail.com designates 209.85.208.177 as permitted sender) smtp.mailfrom=asml.silence@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723774555; a=rsa-sha256; cv=none; b=8nms2IpvelSMnxpLm5PW5n+LA/psyaDhRRD94Dg/KXFSsD9wqUJKFB8NppU53hC11TZ1oy +PGNBO5Gvkcgw0x5Y41SPK6kuRrgnVGzoeTNM1xDG8//zp67cdT3vK3G668Zqz3O1HOjy6 8pKZDr+KvTD5crq+lkBBbjP/P/bJnYg= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=keiWFBsm; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf06.hostedemail.com: domain of asml.silence@gmail.com designates 209.85.208.177 as permitted sender) smtp.mailfrom=asml.silence@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723774555; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=poPIHMhE7auc6vPsKFHl1HlM8rO8Tq50d7kZO6/TfLc=; b=K/N28gTO/vPVZvn2m+p1T/4wNXYEq83Gm+y1ScWHPS8vHqEhmk9VWosPBxfcuJ+A480UQR ZXxXooNfHMnA64PoaGWP3h9hAxLbwxz8etiN+AkIeTvUy7hO3r0igpiwWD/j65PMt3aehF bRNOQ/u+eY3NcMZu2os98c2ZeLb0/sA= Received: by mail-lj1-f177.google.com with SMTP id 38308e7fff4ca-2f3bfcc2727so5802161fa.0 for ; Thu, 15 Aug 2024 19:16:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1723774567; x=1724379367; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=poPIHMhE7auc6vPsKFHl1HlM8rO8Tq50d7kZO6/TfLc=; b=keiWFBsm7fQ6+rGirtSka+rVYqW7XJ2yBXzisiTMf5qfNVtq6BQesPmPGIdkp/SVSj 5cFLDuebG0DvHFdmm+MPkaqboMuv3VOmKoKpAra/wYLOdP+rgfsjwmRw/atZM4JitdQ6 YV2u49q/C+9YckLU3MDsqLn97ncD8dWLI4TUV2zuCdheWGmXXjWJ41Fx+5/yTqV8650S 5w81EK27wJrrwRPHDcyWjzQbNMYHqmz2zgAebSodrUzZSAb/wOSsE/zow8E3NuPI0G9N c4jggyp9AKjKnJ20r9/klfrQ4rVt/90gB14lLk3QZhLYe0iP/sbAV0YCam9AYBRb9oBp T0aA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723774567; x=1724379367; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=poPIHMhE7auc6vPsKFHl1HlM8rO8Tq50d7kZO6/TfLc=; b=rCj96Gr4ibZlz4yivqdQK/8OdBXHrBOp4XMZReyL6NZ1Z1AnKHXQwLREx36hYrEiQR rod+fs0dvXkI58vwJ6SlLtXLd3PGTUVPsHBaQ7qn0BpvlMPwdn/BbONxytEBSx+KeyPd AfuDRZfzTXr8PDUulJ1XGTUI4But1zBrqcMyAd95ZZeS8YvyZNJqlocba5lZvU7chSjF aGRpRtWOzoNf5TvNvO/ZS51IS0Baa31K3alpkgUkw52K7S6VEQrIzQO8abH9G6j8uAcy xL+o1ivnTWe3aJYVsjkAuNInKySF56Rsqs/8OtXNFO9oEKQs0+4QQ5U1AyQeZZoT8NnG CHJQ== X-Forwarded-Encrypted: i=1; AJvYcCWFglL6lSxKcEtou03dWNzrLs6cdtAb0E+9X059t699zz/R06Ekk1tbhqgg3/vTG0C7dAofIH7JGua3cBTeEn+Kjbg= X-Gm-Message-State: AOJu0YzYrhGey6YrGJLJUw1BtVN9371tCzCfbDxRIcAu9OTbMLnfRtgl I+D/i2owcpRJubziS0QR50Hmf832tziH2/kGDmSHJQFjLMskwIuT X-Google-Smtp-Source: AGHT+IFcsPo0SUnsY2RA54cVGAUBSpKjFIs/ZlXkiXOl9/L3tALE7i16sJb0MaFGuh05X7Klxu56vA== X-Received: by 2002:a05:6512:ad0:b0:52c:83c7:936a with SMTP id 2adb3069b0e04-5331c6dbb04mr1196300e87.42.1723774566707; Thu, 15 Aug 2024 19:16:06 -0700 (PDT) Received: from [192.168.42.136] ([85.255.234.87]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a83838c652asm185210966b.12.2024.08.15.19.16.06 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 15 Aug 2024 19:16:06 -0700 (PDT) Message-ID: Date: Fri, 16 Aug 2024 03:16:41 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 5/5] block: implement io_uring discard cmd To: Ming Lei Cc: Jens Axboe , io-uring@vger.kernel.org, Conrad Meyer , linux-block@vger.kernel.org, linux-mm@kvack.org References: <6ecd7ab3386f63f1656dc766c1b5b038ff5353c2.1723601134.git.asml.silence@gmail.com> <4d016a30-d258-4d0e-b3bc-18bf0bd48e32@kernel.dk> Content-Language: en-US From: Pavel Begunkov In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Queue-Id: EA663180007 X-Rspamd-Server: rspam01 X-Stat-Signature: myng6mcyxdc9whft6gqne7uzodtsnhx3 X-HE-Tag: 1723774568-903643 X-HE-Meta: U2FsdGVkX1/h35EAd+RZZ+WPEH6zizyfStOPOnhLVYYGUU3zp8p+rWp4xSMkpSpGWxmlfu97rbHIx2p/J3FMt1dT8rL2hFgiPpdtAY+9640ZUgpHY7s/s+6f+L5cUjUaSq9QM9gHeunNdwtgEC1J/OpiQwpsJAZbcv36uMr5aF4ZYzBArailoXja/iHGt7Xm5Rgwjh1CuaY/SwFiH15D6W1/P2K5DtPZ5V1rWszEbj8pZq8zq5VAe6TygbMLS0t37ctl/9qfOMJQdTZI481mOqLgVvpmBHuwxK0HPsFsgTfon0vMVTJD/pF2Yjb8h2cP3Ca/GFGww2Umobtd1vi+evSIOz/nb9ZfZGeDNBAIju+i2Fo2ETacYPKZW8CDsYUKVuXaK2E8Z1VFLVXZC4sLV+D5nJhR4n4ZqqSu5/aLfbo5YPnBCaYHmxmikHC4W5cEV/OZTKle4uTzsl9T+I1WEasdO9q/j9DsVI6+Ab4Swtg8kKc1l8ylVsVqV6pd948cQn4Ai4DZ+iFcMXZQO1LlIjYEcEmJj6/vt1zDd6uPAmR3TMk/FQ/sR0GPVjBv6eQmx9OI0ZHy8PeDYa1C5jlf49S+5OhPF/OZjg2GKN4BVNkFmEpBO5eOjyTAUZStnrBEJmvJFmrnEdBuwqb1jrJjJPYHLLbHQb8FRMi5s0KtaUVJjs8GUGC+dbZVt0XQRRSj+g1VFcsE5q4iW8lphhszAgmrAwLQH4rgVDy0/56gkmwB1MJHb/fqd5ECPDvWORSmx97LR64dhgKfCe207sCC9UypdeIhTpgdbluS8lzXfyzlOy997nlr4licWhGbznd2smgLTFCTvUmLsSiF8g78VQi3AQ+f1R44bt5b9veZtgZ2GZHaXkLlxVuZVsdp3lP/6V4HEsC6hj1tgKYZ4L7OMTxATRIsMa5CgPoPJXPaTTQqqzhfbukoJXNcAbny+1SprJwetNNfzMXDn3FKSrz qsSoZIsh 3yz4D0BWXC4VKFGHAL7YEayoGL7/H2B7nhuKktDdq9rw0RwzE6pOdHxD7R+loowEi9PuDIDtVOWsk2iGg8giiTYUI54fO4QgO9ZqRxZdzTMyRbIMSCGS/7PqBRVrB3+/EugjxfY40oKAAXy+vzfVDCblJmf9FrakeeC376qT0BNr6Q6xWoqz0EZB5aQ7wWK4kagJlBwh16QlDreiqwka8uAUtNHcGjyP3ZPj9Yo5fjCkVZI6633wzYHgCgVPgh68GMdafd8JzzfyDJG1KJPSTK22y/mMqECoRAer7AOqmJPJnUSZeTfQdpB7gEH9Iik75MXXDGK8v1xyrKc3JujAlByCzpCusfqWCtk8ZdLHEeJCEgOTxtM1b2l+0EeyAEKKkcl4VIBI3NLxRkEDk4jaqWmSTP1q/XBO2n6fnzBTZsV96ClX8po1PgeYN3c1z4gxJrhQRUrLB+ZIxsglhqDgSJ6FAI111x8L2U+z86Z0PPwCy47FbnHI4JRqs1JLpDm502ahvmGV2nV0U3BTLIXlAXclrpwUWJRTVadkf X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 8/16/24 03:08, Ming Lei wrote: > On Fri, Aug 16, 2024 at 02:59:49AM +0100, Pavel Begunkov wrote: >> On 8/16/24 02:45, Ming Lei wrote: >>> On Thu, Aug 15, 2024 at 07:24:16PM -0600, Jens Axboe wrote: >>>> On 8/15/24 5:44 PM, Ming Lei wrote: >>>>> On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote: >>>>>> On 8/15/24 15:33, Jens Axboe wrote: >>>>>>> On 8/14/24 7:42 PM, Ming Lei wrote: >>>>>>>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov wrote: >>>>>>>>> >>>>>>>>> Add ->uring_cmd callback for block device files and use it to implement >>>>>>>>> asynchronous discard. Normally, it first tries to execute the command >>>>>>>>> from non-blocking context, which we limit to a single bio because >>>>>>>>> otherwise one of sub-bios may need to wait for other bios, and we don't >>>>>>>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry >>>>>>>>> it in a blocking context. >>>>>>>>> >>>>>>>>> Suggested-by: Conrad Meyer >>>>>>>>> Signed-off-by: Pavel Begunkov >>>>>>>>> --- >>>>>>>>> block/blk.h | 1 + >>>>>>>>> block/fops.c | 2 + >>>>>>>>> block/ioctl.c | 94 +++++++++++++++++++++++++++++++++++++++++ >>>>>>>>> include/uapi/linux/fs.h | 2 + >>>>>>>>> 4 files changed, 99 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/block/blk.h b/block/blk.h >>>>>>>>> index e180863f918b..5178c5ba6852 100644 >>>>>>>>> --- a/block/blk.h >>>>>>>>> +++ b/block/blk.h >>>>>>>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file); >>>>>>>>> int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, >>>>>>>>> loff_t lstart, loff_t lend); >>>>>>>>> long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>>>>>>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); >>>>>>>>> long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>>>>>>>> >>>>>>>>> extern const struct address_space_operations def_blk_aops; >>>>>>>>> diff --git a/block/fops.c b/block/fops.c >>>>>>>>> index 9825c1713a49..8154b10b5abf 100644 >>>>>>>>> --- a/block/fops.c >>>>>>>>> +++ b/block/fops.c >>>>>>>>> @@ -17,6 +17,7 @@ >>>>>>>>> #include >>>>>>>>> #include >>>>>>>>> #include >>>>>>>>> +#include >>>>>>>>> #include "blk.h" >>>>>>>>> >>>>>>>>> static inline struct inode *bdev_file_inode(struct file *file) >>>>>>>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { >>>>>>>>> .splice_read = filemap_splice_read, >>>>>>>>> .splice_write = iter_file_splice_write, >>>>>>>>> .fallocate = blkdev_fallocate, >>>>>>>>> + .uring_cmd = blkdev_uring_cmd, >>>>>>>> >>>>>>>> Just be curious, we have IORING_OP_FALLOCATE already for sending >>>>>>>> discard to block device, why is .uring_cmd added for this purpose? >>>>>> >>>>>> Which is a good question, I haven't thought about it, but I tend to >>>>>> agree with Jens. Because vfs_fallocate is created synchronous >>>>>> IORING_OP_FALLOCATE is slow for anything but pretty large requests. >>>>>> Probably can be patched up, which would involve changing the >>>>>> fops->fallocate protot, but I'm not sure async there makes sense >>>>>> outside of bdev (?), and cmd approach is simpler, can be made >>>>>> somewhat more efficient (1 less layer in the way), and it's not >>>>>> really something completely new since we have it in ioctl. >>>>> >>>>> Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock, >>>>> same with blkdev_fallocate(). >>>>> >>>>> But this patch drops this exclusive lock, so it becomes async friendly, >>>>> but may cause stale page cache. However, if the lock is required, it can't >>>>> be efficient anymore and io-wq may be inevitable, :-) >>>> >>>> If you want to grab the lock, you can still opportunistically grab it. >>>> For (by far) the common case, you'll get it, and you can still do it >>>> inline. >>> >>> If the lock is grabbed in the whole cmd lifetime, it is basically one sync >>> interface cause there is at most one async discard cmd in-flight for each >>> device. >>> >>> Meantime the handling has to move to io-wq for avoiding to block current >>> context, the interface becomes same with IORING_OP_FALLOCATE? >> >> Right, and agree that we can't trylock because we'd need to keep it >> locked until IO completes, at least the sync versions does that. >> >> But I think *invalidate_pages() in the patch should be enough. That's >> what the write path does, so it shouldn't cause any problem to the >> kernel. As for user space, that'd be more relaxed than the ioctl, >> just as writes are, so nothing new to the user. I hope someone with >> better filemap understanding can confirm it (or not). > > I may not be familiar with filemap enough, but looks *invalidate_pages() > is only for removing pages from the page cache range, and the lock is added > for preventing new page cache read from being started, so stale data read > can be avoided when DISCARD is in-progress. Sounds like it, but the point is it's the same data race for the user as if it would've had a write in progress. -- Pavel Begunkov