From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1D1E6C433EF for ; Wed, 27 Apr 2022 01:42:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229754AbiD0Bpi (ORCPT ); Tue, 26 Apr 2022 21:45:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36700 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245154AbiD0Bpf (ORCPT ); Tue, 26 Apr 2022 21:45:35 -0400 Received: from esa6.hgst.iphmx.com (esa6.hgst.iphmx.com [216.71.154.45]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3B1072F032 for ; Tue, 26 Apr 2022 18:42:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1651023747; x=1682559747; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=eQ7XBnug6z28Gc0LOiGVTTqObhpPTVEc1gKirvGGTx4=; b=XSXlzkCV9v64c5B6QnwqZzxISuHFCEGjisMB4/nJAsIUZ7JF4sXmAjRq pb8NRJHrPJuKQjz4dWUuwYPq0kEeBWgfaHW6ONeNEtNjSavw+UgPYmzCZ xyYgncJJiK1h8jv/hYCPTpC7gQwJmm+wfq0kw3/OLjDj7YhDKiwJ2O47D ZvjvWxnRAF9t1Xtuc8uOvlwwe5bJUX/8HGbgGPWmgWtQPc0nKW/xcmXub qNFNRTt+k/W2Gviz3QcclSNTTiPFXuehZsiPwRaXzGKfCNypvUP1pLFGx 6BsXtaV8NhBsG4UppDr9Nlz8+omJEN5gr1YPtTsKUowKkwDF02/N/pSqM Q==; X-IronPort-AV: E=Sophos;i="5.90,292,1643644800"; d="scan'208";a="199800859" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 27 Apr 2022 09:42:23 +0800 IronPort-SDR: t2dVoqoaKTjJCptGId6pltn4C99j1ucIOVGBbqF0LF6gzHi9TcFyzRn53asnUvFqLv+uiFwZAR UAwl/6zdpVlln6kDZsv6JNZUOHDkf1YsSB90OJwDRQf5T9URfNIT38LanJzT4ds/xdMWZk8ofj S4DNaBA8bIqHk6HB6CoMPk2XZ/9aRtpG+DOYfeUbHYtfSXLewrOBA1Lr6c9bWrR39yAJtim7GR YT44vI9G84OnGlImOkw+qAqTlW0G1AS+LJGdSTh121FahyKGLUgoquuGP9RJGU8clt56oFTg/Q kq5PMjnFqSuy570LAwf9as2E Received: from uls-op-cesaip01.wdc.com ([10.248.3.36]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES128-GCM-SHA256; 26 Apr 2022 18:12:32 -0700 IronPort-SDR: O/wOaG4dL4OHHgDUoHQQTCXIVtypw1PMRrhRqEHrVEzEaHP1H8boZ8ug8dH0pya2d7DeU1p3Y+ mgSktQUz75Bqw+970+Hhg0gUcoKCtpm274c0W+Ju/kSe9qGSyaQWrAiQOWPIe8tG+eW/uiFtfX 0y0Wc2EmHKUe++ZQhHnKtqw3fM3+HTYq5vx8BmG+4nSDxemlbm2wGOo1oTidY3pIOkmssk9uBJ 7wA37i8Mag6vjzbXiQPB4FkuaS1LUPZjg1eJM/cDl6VsxjGA8uLa14754KXC+3EFy5o2lkaCMI ktY= WDCIronportException: Internal Received: from usg-ed-osssrv.wdc.com ([10.3.10.180]) by uls-op-cesaip01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES128-GCM-SHA256; 26 Apr 2022 18:42:22 -0700 Received: from usg-ed-osssrv.wdc.com (usg-ed-osssrv.wdc.com [127.0.0.1]) by usg-ed-osssrv.wdc.com (Postfix) with ESMTP id 4Kp1hY4qtYz1Rvlx for ; Tue, 26 Apr 2022 18:42:21 -0700 (PDT) Authentication-Results: usg-ed-osssrv.wdc.com (amavisd-new); dkim=pass reason="pass (just generated, assumed good)" header.d=opensource.wdc.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d= opensource.wdc.com; h=content-transfer-encoding:content-type :in-reply-to:organization:from:references:to:content-language :subject:user-agent:mime-version:date:message-id; s=dkim; t= 1651023738; x=1653615739; bh=eQ7XBnug6z28Gc0LOiGVTTqObhpPTVEc1gK irvGGTx4=; b=njNQcWxUls6Tl30uoSFLyICBEu87a2VTyIGUg5bC0KPzpwqo5eN 0ppUA28BBiZzJVEsYY4QAtrKZLKktlND4yfmY8X3xap4t1fsWjrCiqUw2Tvaz9Nc qN0qchzjQPued2CVVt29TLFc2KtgBgllp9/CyOxglRQSWas5O/OV6gK3/6Cud4gK Cp5TKYoKPlT2pBgc9maI2o80fixl4WTRJ3M2fKs1qNAm5hlrES71xyaD2S/jJ4sn +pJ7nAHrdse97lvMCWDTdVEO1qipfvjy6uu+iB7HJeB+DSozFrLbHHZyt/PsFO+J mVmZXybQ61/CtUhy/Kj9kAZqqW7kMK1jkhQ== X-Virus-Scanned: amavisd-new at usg-ed-osssrv.wdc.com Received: from usg-ed-osssrv.wdc.com ([127.0.0.1]) by usg-ed-osssrv.wdc.com (usg-ed-osssrv.wdc.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id J7TyGBycoQzQ for ; Tue, 26 Apr 2022 18:42:18 -0700 (PDT) Received: from [10.225.163.27] (unknown [10.225.163.27]) by usg-ed-osssrv.wdc.com (Postfix) with ESMTPSA id 4Kp1hK3pPvz1Rvlc; Tue, 26 Apr 2022 18:42:09 -0700 (PDT) Message-ID: Date: Wed, 27 Apr 2022 10:42:07 +0900 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.0 Subject: Re: [PATCH v4 10/10] fs: add support for copy file range in zonefs Content-Language: en-US To: Nitesh Shetty Cc: chaitanyak@nvidia.com, linux-block@vger.kernel.org, linux-scsi@vger.kernel.org, dm-devel@redhat.com, linux-nvme@lists.infradead.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, msnitzer@redhat.com, bvanassche@acm.org, martin.petersen@oracle.com, hare@suse.de, kbusch@kernel.org, hch@lst.de, Frederick.Knight@netapp.com, osandov@fb.com, lsf-pc@lists.linux-foundation.org, djwong@kernel.org, josef@toxicpanda.com, clm@fb.com, dsterba@suse.com, tytso@mit.edu, jack@suse.com, nitheshshetty@gmail.com, gost.dev@samsung.com, Arnav Dawn , Alasdair Kergon , Mike Snitzer , Sagi Grimberg , James Smart , Chaitanya Kulkarni , Naohiro Aota , Johannes Thumshirn , Alexander Viro , linux-kernel@vger.kernel.org References: <20220426101241.30100-1-nj.shetty@samsung.com> <20220426101241.30100-11-nj.shetty@samsung.com> From: Damien Le Moal Organization: Western Digital Research In-Reply-To: <20220426101241.30100-11-nj.shetty@samsung.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org On 4/26/22 19:12, Nitesh Shetty wrote: > From: Arnav Dawn > > copy_file_range is implemented using copy offload, > copy offloading to device is always enabled. > To disable copy offloading mount with "no_copy_offload" mount option. > At present copy offload is only used, if the source and destination files > are on same block device, otherwise copy file range is completed by > generic copy file range. Why not integrate copy offload inside generic_copy_file_range() ? > > copy file range implemented as following: > - write pending writes on the src and dest files > - drop page cache for dest file if its conv zone > - copy the range using offload > - update dest file info > > For all failure cases we fallback to generic file copy range > At present this implementation does not support conv aggregation > > Signed-off-by: Arnav Dawn > --- > fs/zonefs/super.c | 178 ++++++++++++++++++++++++++++++++++++++++++++- > fs/zonefs/zonefs.h | 1 + > 2 files changed, 178 insertions(+), 1 deletion(-) > > diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c > index b3b0b71fdf6c..60563b592bf2 100644 > --- a/fs/zonefs/super.c > +++ b/fs/zonefs/super.c > @@ -901,6 +901,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from) > else > ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops, > &zonefs_write_dio_ops, 0, 0); > + Unnecessary white line change. > if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && > (ret > 0 || ret == -EIOCBQUEUED)) { > if (ret > 0) > @@ -1189,6 +1190,171 @@ static int zonefs_file_release(struct inode *inode, struct file *file) > return 0; > } > > +static int zonefs_is_file_size_ok(struct inode *src_inode, struct inode *dst_inode, > + loff_t src_off, loff_t dst_off, size_t len) This function is badly named. It is not checking if the size of files is OK, is is checking if the copy offsets are OK. > +{ > + loff_t size, endoff; > + > + size = i_size_read(src_inode); > + /* Don't copy beyond source file EOF. */ > + if (src_off + len > size) { > + zonefs_err(src_inode->i_sb, "Copy beyond EOF (%llu + %zu > %llu)\n", > + src_off, len, size); > + return -EOPNOTSUPP; Reading beyond EOF returns 0, not an error, for any FS, including zonefs. So why return an error here ? > + } > + > + endoff = dst_off + len; > + if (inode_newsize_ok(dst_inode, endoff)) > + return -EOPNOTSUPP; This is not EOPNOTSUPP. This is EINVAL since the user is asking for something that we know will fail due to the unaligned destination. Furthermore, this code does not consider the zone type for the file. Since the dest file could be a an aggregated conventional zone file which is larger than a sequential zone file, this is not using the right limit. This must be checked against i_max_size of struct zonefs_inode_info. Note that inode_newsize_ok() must be called with inode->i_mutex held but you never took that lock. Also, the dest file could be a conventional zone file used for swap. And you are not checking that. You have plenty of other checks missing. See generic_copy_file_checks(). Calling that function should be fine for zonefs too. > + > + > + return 0; > +} > +static ssize_t __zonefs_send_copy(struct zonefs_inode_info *src_zi, loff_t src_off, > + struct zonefs_inode_info *dst_zi, loff_t dst_off, size_t len) Please rename this zonefs_issue_copy(). > +{ > + struct block_device *src_bdev = src_zi->i_vnode.i_sb->s_bdev; > + struct block_device *dst_bdev = dst_zi->i_vnode.i_sb->s_bdev; > + struct range_entry *rlist; > + int ret = -EIO; Initializing ret is not needed. > + > + rlist = kmalloc(sizeof(*rlist), GFP_KERNEL); No NULL check ? > + rlist[0].dst = (dst_zi->i_zsector << SECTOR_SHIFT) + dst_off; > + rlist[0].src = (src_zi->i_zsector << SECTOR_SHIFT) + src_off; > + rlist[0].len = len; > + rlist[0].comp_len = 0; > + ret = blkdev_issue_copy(src_bdev, 1, rlist, dst_bdev, GFP_KERNEL); > + if (ret) { > + if (rlist[0].comp_len != len) { Pack this condition with the previous if using &&. > + ret = rlist[0].comp_len; > + kfree(rlist); > + return ret; These 2 lines are not needed, the same is done below. > + } > + } > + kfree(rlist); > + return len; And how do you handle this failing ? Where is zonefs_io_error() called ? Without a call to that function, there is no way to guarantee that the destination file state is still in sync with the zone state. This can fail for all sorts of reasons (e.g. zone went offline), and that needs to be checked. > +} > +static ssize_t __zonefs_copy_file_range(struct file *src_file, loff_t src_off, > + struct file *dst_file, loff_t dst_off, > + size_t len, unsigned int flags) > +{ > + struct inode *src_inode = file_inode(src_file); > + struct inode *dst_inode = file_inode(dst_file); > + struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode); > + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); > + struct block_device *src_bdev = src_inode->i_sb->s_bdev; > + struct block_device *dst_bdev = dst_inode->i_sb->s_bdev; > + struct super_block *src_sb = src_inode->i_sb; > + struct zonefs_sb_info *src_sbi = ZONEFS_SB(src_sb); > + struct super_block *dst_sb = dst_inode->i_sb; > + struct zonefs_sb_info *dst_sbi = ZONEFS_SB(dst_sb); > + ssize_t ret = -EIO, bytes; > + > + if (src_bdev != dst_bdev) { > + zonefs_err(src_sb, "Copying files across two devices\n"); > + return -EXDEV; Weird indentation. And the error message is not needed. The test can also be simplified to if (src_inode->i_sb != dst_inode->i_sb) > + } > + > + /* > + * Some of the checks below will return -EOPNOTSUPP, > + * which will force a generic copy > + */ > + > + if (!(src_sbi->s_mount_opts & ZONEFS_MNTOPT_COPY_FILE) > + || !(dst_sbi->s_mount_opts & ZONEFS_MNTOPT_COPY_FILE)) > + return -EOPNOTSUPP; I do not see the point of having this option. See below. > + > + /* Start by sync'ing the source and destination files ifor conv zones */ > + if (src_zi->i_ztype == ZONEFS_ZTYPE_CNV) { > + ret = file_write_and_wait_range(src_file, src_off, (src_off + len)); > + if (ret < 0) { > + zonefs_err(src_sb, "failed to write source file (%zd)\n", ret); > + goto out; > + } > + } > + if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) { > + ret = file_write_and_wait_range(dst_file, dst_off, (dst_off + len)); > + if (ret < 0) { > + zonefs_err(dst_sb, "failed to write destination file (%zd)\n", ret); > + goto out; > + } > + } And what about inode_dio_wait() for sequential dst file ? Not needed ? > + mutex_lock(&dst_zi->i_truncate_mutex); > + if (len > dst_zi->i_max_size - dst_zi->i_wpoffset) { > + /* Adjust length */ > + len -= dst_zi->i_max_size - dst_zi->i_wpoffset; > + if (len <= 0) { > + mutex_unlock(&dst_zi->i_truncate_mutex); > + return -EOPNOTSUPP; If len is 0, return 0. > + } > + } > + if (dst_off != dst_zi->i_wpoffset) { > + mutex_unlock(&dst_zi->i_truncate_mutex); > + return -EOPNOTSUPP; /* copy not at zone write ptr */ This must be an EINVAL. See zonefs_file_dio_write(). The condition is also invalid since the file can be a conventional zone file which allows to write anywhere. Did you really test this code properly ? > + } > + mutex_lock(&src_zi->i_truncate_mutex); > + ret = zonefs_is_file_size_ok(src_inode, dst_inode, src_off, dst_off, len); > + if (ret < 0) { > + mutex_unlock(&src_zi->i_truncate_mutex); > + mutex_unlock(&dst_zi->i_truncate_mutex); > + goto out; > + } > + mutex_unlock(&src_zi->i_truncate_mutex); > + > + /* Drop dst file cached pages for a conv zone*/ > + if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) { > + ret = invalidate_inode_pages2_range(dst_inode->i_mapping, > + dst_off >> PAGE_SHIFT, > + (dst_off + len) >> PAGE_SHIFT); > + if (ret < 0) { > + zonefs_err(dst_sb, "Failed to invalidate inode pages (%zd)\n", ret); > + ret = 0; And you still go on ? That will corrupt data. No way. This error must be returned to fail the copy. > + } > + } > + bytes = __zonefs_send_copy(src_zi, src_off, dst_zi, dst_off, len); > + ret += bytes; You cannot hold i_truncate_mutex while doing IOs because if there is an error, there will be a deadlock. Also, __zonefs_send_copy() can return an error and that is not checked. > + > + file_update_time(dst_file); > + zonefs_update_stats(dst_inode, dst_off + bytes); > + zonefs_i_size_write(dst_inode, dst_off + bytes); > + dst_zi->i_wpoffset += bytes; > + mutex_unlock(&dst_zi->i_truncate_mutex); > + > + > + 2 extra uneeded blank lines. > + /* > + * if we still have some bytes left, do splice copy > + */ This comment fits on a single line. > + if (bytes && (bytes < len)) { > + zonefs_info(src_sb, "Final partial copy of %zu bytes\n", len); > + bytes = do_splice_direct(src_file, &src_off, dst_file, > + &dst_off, len, flags); And this can fail because other writes may be coming in since you never locked inode->i_mutex. > + if (bytes > 0) > + ret += bytes; > + else > + zonefs_info(src_sb, "Failed partial copy (%zd)\n", bytes); Error message not needed. > + } > + > +out: > + > + return ret; > +} > + > +static ssize_t zonefs_copy_file_range(struct file *src_file, loff_t src_off, > + struct file *dst_file, loff_t dst_off, > + size_t len, unsigned int flags) > +{ > + ssize_t ret; > + > + ret = __zonefs_copy_file_range(src_file, src_off, dst_file, dst_off, > + len, flags); > + Useless blank line. __zonefs_copy_file_range() needs to be split into zonefs_copy_file_checks() and zonefs_copy_file(). The below call to generic_copy_file_range() should go into zonefs_copy_file(). zonefs_copy_file() calling either generic_copy_file_range() if the device does not have copy offload or zonefs_issue_copy() if it does. > + if (ret == -EOPNOTSUPP || ret == -EXDEV)> + ret = generic_copy_file_range(src_file, src_off, dst_file, > + dst_off, len, flags); This function is not taking the inode_lock() for source and est files. This means that this can run with concurent regular writes and truncate and that potentially can result in some very weird results, unaligned write errors and the FS going read-only. > + return ret; > +} > + > static const struct file_operations zonefs_file_operations = { > .open = zonefs_file_open, > .release = zonefs_file_release, > @@ -1200,6 +1366,7 @@ static const struct file_operations zonefs_file_operations = { > .splice_read = generic_file_splice_read, > .splice_write = iter_file_splice_write, > .iopoll = iocb_bio_iopoll, > + .copy_file_range = zonefs_copy_file_range, > }; > > static struct kmem_cache *zonefs_inode_cachep; > @@ -1262,7 +1429,7 @@ static int zonefs_statfs(struct dentry *dentry, struct kstatfs *buf) > > enum { > Opt_errors_ro, Opt_errors_zro, Opt_errors_zol, Opt_errors_repair, > - Opt_explicit_open, Opt_err, > + Opt_explicit_open, Opt_no_copy_offload, Opt_err, This mount option does not make much sense. Copy file range was not supported until now. Existing applications are thus not using it. Adding support for copy_file_range op will not break these applications so it can be unconditionally supported. > }; > > static const match_table_t tokens = { > @@ -1271,6 +1438,7 @@ static const match_table_t tokens = { > { Opt_errors_zol, "errors=zone-offline"}, > { Opt_errors_repair, "errors=repair"}, > { Opt_explicit_open, "explicit-open" }, > + { Opt_no_copy_offload, "no_copy_offload" }, > { Opt_err, NULL} > }; > > @@ -1280,6 +1448,7 @@ static int zonefs_parse_options(struct super_block *sb, char *options) > substring_t args[MAX_OPT_ARGS]; > char *p; > > + sbi->s_mount_opts |= ZONEFS_MNTOPT_COPY_FILE; > if (!options) > return 0; > > @@ -1310,6 +1479,9 @@ static int zonefs_parse_options(struct super_block *sb, char *options) > case Opt_explicit_open: > sbi->s_mount_opts |= ZONEFS_MNTOPT_EXPLICIT_OPEN; > break; > + case Opt_no_copy_offload: > + sbi->s_mount_opts &= ~ZONEFS_MNTOPT_COPY_FILE; > + break; > default: > return -EINVAL; > } > @@ -1330,6 +1502,8 @@ static int zonefs_show_options(struct seq_file *seq, struct dentry *root) > seq_puts(seq, ",errors=zone-offline"); > if (sbi->s_mount_opts & ZONEFS_MNTOPT_ERRORS_REPAIR) > seq_puts(seq, ",errors=repair"); > + if (sbi->s_mount_opts & ZONEFS_MNTOPT_COPY_FILE) > + seq_puts(seq, ",copy_offload"); > > return 0; > } > @@ -1769,6 +1943,8 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent) > atomic_set(&sbi->s_active_seq_files, 0); > sbi->s_max_active_seq_files = bdev_max_active_zones(sb->s_bdev); > > + /* set copy support by default */ > + sbi->s_mount_opts |= ZONEFS_MNTOPT_COPY_FILE; > ret = zonefs_read_super(sb); > if (ret) > return ret; > diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h > index 4b3de66c3233..efa6632c4b6a 100644 > --- a/fs/zonefs/zonefs.h > +++ b/fs/zonefs/zonefs.h > @@ -162,6 +162,7 @@ enum zonefs_features { > (ZONEFS_MNTOPT_ERRORS_RO | ZONEFS_MNTOPT_ERRORS_ZRO | \ > ZONEFS_MNTOPT_ERRORS_ZOL | ZONEFS_MNTOPT_ERRORS_REPAIR) > #define ZONEFS_MNTOPT_EXPLICIT_OPEN (1 << 4) /* Explicit open/close of zones on open/close */ > +#define ZONEFS_MNTOPT_COPY_FILE (1 << 5) /* enable copy file range offload to kernel */ > > /* > * In-memory Super block information. -- Damien Le Moal Western Digital Research