From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45AB8C4332F for ; Thu, 24 Nov 2022 01:48:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229880AbiKXBsH (ORCPT ); Wed, 23 Nov 2022 20:48:07 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48330 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229920AbiKXBsG (ORCPT ); Wed, 23 Nov 2022 20:48:06 -0500 Received: from esa4.hgst.iphmx.com (esa4.hgst.iphmx.com [216.71.154.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D791DEE22 for ; Wed, 23 Nov 2022 17:48:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1669254484; x=1700790484; h=message-id:date:mime-version:subject:from:to:cc: references:in-reply-to:content-transfer-encoding; bh=N2FhLLtnNgmNj+JO4Jpf49iCpELlXJhmTuTgmijgAec=; b=ZShXNS5wv3BtEjtTLZ8g4R9f2H8ORIPDBv5u9m/0AT7S52vp6geM7knW YSrayOA+3SnJBCjz+x34G9/GKn+dy+rcOPc7G8sjY/nWg58enOzhL8qv7 EDXDsq4nU99pURSBbsrufr8ifBnqBSkcsogKq78cPRIbJQeLQ+NjXMxe3 VAehrPjbkUvdIqTwEl0O7YRIQf7JJLvqNsySFYbC91gAscM53vibOXCGv hNJ4jfAtqiKKgGmK1XHkznhHBZIEOYGci34fHLjqiuuJ8n70XNyACMNaJ bX1ToLq7DdtVcal75gQB4Z1t72xdEK/qykvH/q4dy3stpc78ZTn2LCNFQ w==; X-IronPort-AV: E=Sophos;i="5.96,189,1665417600"; d="scan'208";a="215291791" Received: from h199-255-45-14.hgst.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 24 Nov 2022 09:48:03 +0800 IronPort-SDR: VmVE4oTgzBpCHWzftLaixAe0g0Qk7SUZgYVH6iqJe4aAvt91L3Hy7a9CeF2ACL0nQAgQoPwG+1 rk5IGBNnJbzCFp9p2ilqMa1xwOE8B9IR/C/rbuGNIvKg8Q0lcFHN/gwu/Rzh8DJHKZm/7NkV7P h9AO2XlCJH3/661mVirRuUQt54rJ5EOcSrlq/qzq/Rznm/C6kBMXvqZB8Xa5LGA3gwzJMd0aQe atrzxcPKe6O67Fwb3xKOtvmH8q3GFyx5CLYdEukdfPeLVWV6Ca+BMk/egx5ON6f2jQ8QGr4pNm lo0= Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES128-GCM-SHA256; 23 Nov 2022 17:06:50 -0800 IronPort-SDR: kF4ykB1uS2NCW0hma58uTaEo47BfeU9tAQMHyCy/HMF/0yufS+GOqFJ7rGag8SdvFdD8QZOfoO iq5ItxN7Nk7FKp0Dm0n1e5Mw6DDT7GxkwZxzAqdkE+ZUYP3HbOujmFEIVO+HbniaFoEU/dQeXX ISbiVVnyoYjrrH0iqnbST2Bw4K2TeYr5FkvjisMREmeT/+3sQfWvbxa7w/SSJfzaHwmPKluNDS 0Y1QYnCVgMHFkOxl35MgWD/+aeobbHR/lxneTC7SHzaZpZ4V7qkaH4Wf0QYUmmDRe1u3qmOfiA ocg= WDCIronportException: Internal Received: from usg-ed-osssrv.wdc.com ([10.3.10.180]) by uls-op-cesaip02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES128-GCM-SHA256; 23 Nov 2022 17:48:04 -0800 Received: from usg-ed-osssrv.wdc.com (usg-ed-osssrv.wdc.com [127.0.0.1]) by usg-ed-osssrv.wdc.com (Postfix) with ESMTP id 4NHgql3jTJz1Rwt8 for ; Wed, 23 Nov 2022 17:48:03 -0800 (PST) Authentication-Results: usg-ed-osssrv.wdc.com (amavisd-new); dkim=pass reason="pass (just generated, assumed good)" header.d=opensource.wdc.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d= opensource.wdc.com; h=content-transfer-encoding:content-type :in-reply-to:organization:references:to:from:content-language :subject:user-agent:mime-version:date:message-id; s=dkim; t= 1669254480; x=1671846481; bh=N2FhLLtnNgmNj+JO4Jpf49iCpELlXJhmTuT gmijgAec=; b=YHoTm+dkvFiXcokW5oA423bAdkL6utRZKwmbtgN62IenxWVhbBs J9nWrFX10pCN9ltqb+SFTad22/y1UftQLYMCX2Ei+2gjFWN7GF1m7aGG7PQel8ba 5CupV61RMVVsYLlJAnftpYnmuglGYpwjKHfAlOd6UhHG/WazA5sVUE7Tgfzg1jsN Mxv6/1wdGzjTxM1YPr53bDyDzzjo6G43N/wZzhHCaELmI7goonffi+kllNbpZPhT zuFPWD/NjAQMPEOfcU1qOokdhLqzPtnSchvQpLG+nn2fnJ+0Jxma9bVTaYZ9at7p dElZZYC3eAbiPQzucKP6w8tBkuMunUrkmNw== X-Virus-Scanned: amavisd-new at usg-ed-osssrv.wdc.com Received: from usg-ed-osssrv.wdc.com ([127.0.0.1]) by usg-ed-osssrv.wdc.com (usg-ed-osssrv.wdc.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id hOIstQJr2C9l for ; Wed, 23 Nov 2022 17:48:00 -0800 (PST) Received: from [10.225.163.55] (unknown [10.225.163.55]) by usg-ed-osssrv.wdc.com (Postfix) with ESMTPSA id 4NHgqc3Zlkz1RvLy; Wed, 23 Nov 2022 17:47:56 -0800 (PST) Message-ID: <349a4d66-3a9f-a095-005c-1f180c5f3aac@opensource.wdc.com> Date: Thu, 24 Nov 2022 10:47:55 +0900 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.5.0 Subject: Re: [PATCH v5 10/10] fs: add support for copy file range in zonefs Content-Language: en-US From: Damien Le Moal To: Nitesh Shetty , axboe@kernel.dk, agk@redhat.com, snitzer@kernel.org, dm-devel@redhat.com, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me, james.smart@broadcom.com, kch@nvidia.com, naohiro.aota@wdc.com, jth@kernel.org, viro@zeniv.linux.org.uk Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, linux-fsdevel@vger.kernel.org, anuj20.g@samsung.com, joshi.k@samsung.com, p.raghav@samsung.com, nitheshshetty@gmail.com, gost.dev@samsung.com References: <20221123055827.26996-1-nj.shetty@samsung.com> <20221123055827.26996-11-nj.shetty@samsung.com> <729254f8-2468-e694-715e-72bcbef80ff3@opensource.wdc.com> Organization: Western Digital Research In-Reply-To: <729254f8-2468-e694-715e-72bcbef80ff3@opensource.wdc.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On 11/24/22 10:32, Damien Le Moal wrote: > On 11/23/22 14:58, Nitesh Shetty wrote: >> copy_file_range is implemented using copy offload, >> copy offloading to device is always enabled. >> To disable copy offloading mount with "no_copy_offload" mount option. > > And were is the code that handle this option ? > >> At present copy offload is only used, if the source and destination files >> are on same block device, otherwise copy file range is completed by >> generic copy file range. >> >> copy file range implemented as following: >> - write pending writes on the src and dest files >> - drop page cache for dest file if its conv zone >> - copy the range using offload >> - update dest file info >> >> For all failure cases we fallback to generic file copy range > > For all cases ? That would be weird. What would be the point of trying to > copy again if e.g. the dest zone has gone offline or read only ? > >> At present this implementation does not support conv aggregation > > Please check this commit message overall: the grammar and punctuation > could really be improved. > >> >> Signed-off-by: Nitesh Shetty >> Signed-off-by: Anuj Gupta >> --- >> fs/zonefs/super.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 179 insertions(+) >> >> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c >> index abc9a85106f2..15613433d4ae 100644 >> --- a/fs/zonefs/super.c >> +++ b/fs/zonefs/super.c >> @@ -1223,6 +1223,183 @@ static int zonefs_file_release(struct inode *inode, struct file *file) >> return 0; >> } >> >> +static int zonefs_is_file_copy_offset_ok(struct inode *src_inode, >> + struct inode *dst_inode, loff_t src_off, loff_t dst_off, >> + size_t *len) >> +{ >> + loff_t size, endoff; >> + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); >> + >> + inode_lock(src_inode); >> + size = i_size_read(src_inode); >> + inode_unlock(src_inode); >> + /* Don't copy beyond source file EOF. */ >> + if (src_off < size) { >> + if (src_off + *len > size) >> + *len = (size - (src_off + *len)); >> + } else >> + *len = 0; > > Missing curly brackets for the else. > >> + >> + mutex_lock(&dst_zi->i_truncate_mutex); >> + if (dst_zi->i_ztype == ZONEFS_ZTYPE_SEQ) { >> + if (*len > dst_zi->i_max_size - dst_zi->i_wpoffset) >> + *len -= dst_zi->i_max_size - dst_zi->i_wpoffset; >> + >> + if (dst_off != dst_zi->i_wpoffset) >> + goto err; >> + } >> + mutex_unlock(&dst_zi->i_truncate_mutex); >> + >> + endoff = dst_off + *len; >> + inode_lock(dst_inode); >> + if (endoff > dst_zi->i_max_size || >> + inode_newsize_ok(dst_inode, endoff)) { >> + inode_unlock(dst_inode); >> + goto err; > > And here truncate mutex is not locked, but goto err will unlock it. This > is broken... > >> + } >> + inode_unlock(dst_inode); > > ...The locking is completely broken in this function anyway. You take the > lock, look at something, then release the lock. Then what if a write or a > trunctate comes in when the inode is not locked ? This is completely > broken. The inode should be locked with no dio pending when this function > is called. This is only to check if everything is ok. This has no business > playing with the inode and truncate locks. > >> + >> + return 0; >> +err: >> + mutex_unlock(&dst_zi->i_truncate_mutex); >> + return -EINVAL; >> +} >> + >> +static ssize_t zonefs_issue_copy(struct zonefs_inode_info *src_zi, >> + loff_t src_off, struct zonefs_inode_info *dst_zi, >> + loff_t dst_off, size_t len) >> +{ >> + struct block_device *src_bdev = src_zi->i_vnode.i_sb->s_bdev; >> + struct block_device *dst_bdev = dst_zi->i_vnode.i_sb->s_bdev; >> + struct range_entry *rlist = NULL; >> + int ret = len; >> + >> + rlist = kmalloc(sizeof(*rlist), GFP_KERNEL); > > GFP_NOIO ? > >> + if (!rlist) >> + return -ENOMEM; >> + >> + rlist[0].dst = (dst_zi->i_zsector << SECTOR_SHIFT) + dst_off; >> + rlist[0].src = (src_zi->i_zsector << SECTOR_SHIFT) + src_off; >> + rlist[0].len = len; >> + rlist[0].comp_len = 0; >> + ret = blkdev_issue_copy(src_bdev, dst_bdev, rlist, 1, NULL, NULL, >> + GFP_KERNEL); >> + if (rlist[0].comp_len > 0) >> + ret = rlist[0].comp_len; >> + kfree(rlist); >> + >> + return ret; >> +} >> + >> +/* Returns length of possible copy, else returns error */ >> +static ssize_t zonefs_copy_file_checks(struct file *src_file, loff_t src_off, >> + struct file *dst_file, loff_t dst_off, >> + size_t *len, unsigned int flags) >> +{ >> + struct inode *src_inode = file_inode(src_file); >> + struct inode *dst_inode = file_inode(dst_file); >> + struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode); >> + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); >> + ssize_t ret; >> + >> + if (src_inode->i_sb != dst_inode->i_sb) >> + return -EXDEV; >> + >> + /* Start by sync'ing the source and destination files for conv zones */ >> + if (src_zi->i_ztype == ZONEFS_ZTYPE_CNV) { >> + ret = file_write_and_wait_range(src_file, src_off, >> + (src_off + *len)); >> + if (ret < 0) >> + goto io_error; >> + } >> + inode_dio_wait(src_inode); > > That is not a "check". So having this in a function called > zonefs_copy_file_checks() is a little strange. > >> + >> + /* Start by sync'ing the source and destination files ifor conv zones */ > > Same comment repeated, with typos. > >> + if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) { >> + ret = file_write_and_wait_range(dst_file, dst_off, >> + (dst_off + *len)); >> + if (ret < 0) >> + goto io_error; >> + } >> + inode_dio_wait(dst_inode); >> + >> + /* Drop dst file cached pages for a conv zone*/ >> + if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) { >> + ret = invalidate_inode_pages2_range(dst_inode->i_mapping, >> + dst_off >> PAGE_SHIFT, >> + (dst_off + *len) >> PAGE_SHIFT); >> + if (ret < 0) >> + goto io_error; >> + } >> + >> + ret = zonefs_is_file_copy_offset_ok(src_inode, dst_inode, src_off, >> + dst_off, len); >> + if (ret < 0) > > if (ret) > >> + return ret; >> + >> + return *len; >> + >> +io_error: >> + zonefs_io_error(dst_inode, true); >> + return ret; >> +} >> + >> +static ssize_t zonefs_copy_file(struct file *src_file, loff_t src_off, >> + struct file *dst_file, loff_t dst_off, >> + size_t len, unsigned int flags) >> +{ >> + struct inode *src_inode = file_inode(src_file); >> + struct inode *dst_inode = file_inode(dst_file); >> + struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode); >> + struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode); >> + ssize_t ret = 0, bytes; >> + >> + inode_lock(src_inode); >> + inode_lock(dst_inode); > > So you did zonefs_copy_file_checks() outside of these locks, which mean > that everything about the source and destination files may have changed. > This does not work. I forgot to mention that locking 2 inodes blindly like this can leads to deadlocks if another process tries a copy range from dst to src at the same time (lock order is reversed and so can deadlock). > >> + bytes = zonefs_issue_copy(src_zi, src_off, dst_zi, dst_off, len); >> + if (bytes < 0) >> + goto unlock_exit; >> + >> + ret += bytes; >> + >> + file_update_time(dst_file); >> + mutex_lock(&dst_zi->i_truncate_mutex); >> + zonefs_update_stats(dst_inode, dst_off + bytes); >> + zonefs_i_size_write(dst_inode, dst_off + bytes); >> + dst_zi->i_wpoffset += bytes; > > This is wierd. iszie for dst will be dst_zi->i_wpoffset. So please do: > > dst_zi->i_wpoffset += bytes; > zonefs_i_size_write(dst_inode, dst_zi->i_wpoffset); > >> + mutex_unlock(&dst_zi->i_truncate_mutex); > > And you are not taking care of the accounting for active zones here. If > the copy made the dst zone full, it is not active anymore. You need to > call zonefs_account_active(); > >> + /* if we still have some bytes left, do splice copy */ >> + if (bytes && (bytes < len)) { >> + bytes = do_splice_direct(src_file, &src_off, dst_file, >> + &dst_off, len, flags); > > No way. > >> + if (bytes > 0) >> + ret += bytes; >> + } >> +unlock_exit: >> + if (ret < 0) >> + zonefs_io_error(dst_inode, true); > > How can you be sure that you even did an IO when you get an error ? > zonefs_issue_copy() may have failed on its kmalloc() and no IO was done. > >> + inode_unlock(src_inode); >> + inode_unlock(dst_inode); >> + return ret; >> +} >> + >> +static ssize_t zonefs_copy_file_range(struct file *src_file, loff_t src_off, >> + struct file *dst_file, loff_t dst_off, >> + size_t len, unsigned int flags) >> +{ >> + ssize_t ret = -EIO; > > This does not need to be initialized. > >> + >> + ret = zonefs_copy_file_checks(src_file, src_off, dst_file, dst_off, >> + &len, flags); > > These checks need to be done for the generic implementation too, no ? Why > would checking this automatically trigger the offload ? What if the device > does not support offloading ? > >> + if (ret > 0) >> + ret = zonefs_copy_file(src_file, src_off, dst_file, dst_off, >> + len, flags); > > return here, then no need for the else. But see above. This seems all > broken to me. > >> + else if (ret < 0 && ret == -EXDEV) >> + ret = generic_copy_file_range(src_file, src_off, dst_file, >> + dst_off, len, flags); >> + return ret; >> +} >> + >> static const struct file_operations zonefs_file_operations = { >> .open = zonefs_file_open, >> .release = zonefs_file_release, >> @@ -1234,6 +1411,7 @@ static const struct file_operations zonefs_file_operations = { >> .splice_read = generic_file_splice_read, >> .splice_write = iter_file_splice_write, >> .iopoll = iocb_bio_iopoll, >> + .copy_file_range = zonefs_copy_file_range, >> }; >> >> static struct kmem_cache *zonefs_inode_cachep; >> @@ -1804,6 +1982,7 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent) >> atomic_set(&sbi->s_active_seq_files, 0); >> sbi->s_max_active_seq_files = bdev_max_active_zones(sb->s_bdev); >> >> + /* set copy support by default */ > > What is this comment supposed to be for ? > >> ret = zonefs_read_super(sb); >> if (ret) >> return ret; > -- Damien Le Moal Western Digital Research