From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mout.gmx.net (mout.gmx.net [212.227.15.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7D0021DA58 for ; Tue, 25 Jun 2024 23:49:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=212.227.15.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719359358; cv=none; b=uuqRP7FQLt4x13yThzmxdE24s0p/hT2E7JxZfjxa2UKf49Lbyz6GrKHiun8lWdykTipJl2hR3hFg1cu0a6MabfenKzoYHj+Lmab4T7mVdJac2E0R8rKAGMdlfnhXu3L6UExpxVttjVnJ84OprjGJ0i95B1GYRKTmTR3y8cS/P/I= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719359358; c=relaxed/simple; bh=/8Z3/r9JnBED2lr2gKNWOcZ7ArpBw4MVy3EyxI/LvIY=; h=Message-ID:Date:MIME-Version:Subject:To:References:From: In-Reply-To:Content-Type; b=BLgdw3E7XOcF/Gpwy2feRnZWnmFms5mjpuhUiaXu2JZrtU64EySDdHaybKv/3MPahn7QYBZGGLSLxCnUGoOZIfbmj37NpcN8cd54+eLGFbfipCeIkz1BJ04QwLrU006ggypSrbvU0Xdjypj+PugI0ABz2oYUX3bEaxH+JutGkFk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=gmx.com; spf=pass smtp.mailfrom=gmx.com; dkim=pass (2048-bit key) header.d=gmx.com header.i=quwenruo.btrfs@gmx.com header.b=Y8Uj68j0; arc=none smtp.client-ip=212.227.15.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=gmx.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmx.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmx.com header.i=quwenruo.btrfs@gmx.com header.b="Y8Uj68j0" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmx.com; s=s31663417; t=1719359342; x=1719964142; i=quwenruo.btrfs@gmx.com; bh=gzZuNHsirx5Z2oi3Sc11ltmveO0SdMbjHjV7OPZMf6g=; h=X-UI-Sender-Class:Message-ID:Date:MIME-Version:Subject:To: References:From:In-Reply-To:Content-Type: Content-Transfer-Encoding:cc:content-transfer-encoding: content-type:date:from:message-id:mime-version:reply-to:subject: to; b=Y8Uj68j0ryvj5xRBN8SZorTb0mpbaE+j1TYImMAokq5P7gJnWd0J+FA1kcLvM3f1 NIDWfkd44y83lTV0qdcFSXSpCmuLR9luSbCJihA+cYQZRCQA+givywgTJ2a9IDb08 JfYnhQ5IOPe9+fzuRQvtJedPGp1VuKDDXC2uRs6nIb78e0gg1bmdhmXOYezf19BHG p+mibJbSupmPw7UG9lEAOtVgkULQiiQ15UKwwk4ty3H8oEaGm1j7g3SFwlRjSB5Uq ExTGXZ3xlhGBwW28HWLugqEmyF1Uum8HxJYysYqPH0uVZBfBPhO+Oqjx6ynjNvTNf Z1yniDZOE+7R8pnJmg== X-UI-Sender-Class: 724b4f7f-cbec-4199-ad4e-598c01a50d3a Received: from [172.16.0.191] ([159.196.52.54]) by mail.gmx.net (mrgmx004 [212.227.17.184]) with ESMTPSA (Nemesis) id 1MA7GS-1sAkNC3dF4-00CJhd; Wed, 26 Jun 2024 01:49:02 +0200 Message-ID: Date: Wed, 26 Jun 2024 09:18:57 +0930 Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] btrfs: move the direct IO code into its own file To: fdmanana@kernel.org, linux-btrfs@vger.kernel.org References: <107dc9437ffcbf6751d018209d037c851a890f4d.1719328515.git.fdmanana@suse.com> Content-Language: en-US From: Qu Wenruo Autocrypt: addr=quwenruo.btrfs@gmx.com; keydata= xsBNBFnVga8BCACyhFP3ExcTIuB73jDIBA/vSoYcTyysFQzPvez64TUSCv1SgXEByR7fju3o 8RfaWuHCnkkea5luuTZMqfgTXrun2dqNVYDNOV6RIVrc4YuG20yhC1epnV55fJCThqij0MRL 1NxPKXIlEdHvN0Kov3CtWA+R1iNN0RCeVun7rmOrrjBK573aWC5sgP7YsBOLK79H3tmUtz6b 9Imuj0ZyEsa76Xg9PX9Hn2myKj1hfWGS+5og9Va4hrwQC8ipjXik6NKR5GDV+hOZkktU81G5 gkQtGB9jOAYRs86QG/b7PtIlbd3+pppT0gaS+wvwMs8cuNG+Pu6KO1oC4jgdseFLu7NpABEB AAHNIlF1IFdlbnJ1byA8cXV3ZW5ydW8uYnRyZnNAZ214LmNvbT7CwJQEEwEIAD4CGwMFCwkI BwIGFQgJCgsCBBYCAwECHgECF4AWIQQt33LlpaVbqJ2qQuHCPZHzoSX+qAUCY00iVQUJDToH pgAKCRDCPZHzoSX+qNKACACkjDLzCvcFuDlgqCiS4ajHAo6twGra3uGgY2klo3S4JespWifr BLPPak74oOShqNZ8yWzB1Bkz1u93Ifx3c3H0r2vLWrImoP5eQdymVqMWmDAq+sV1Koyt8gXQ XPD2jQCrfR9nUuV1F3Z4Lgo+6I5LjuXBVEayFdz/VYK63+YLEAlSowCF72Lkz06TmaI0XMyj jgRNGM2MRgfxbprCcsgUypaDfmhY2nrhIzPUICURfp9t/65+/PLlV4nYs+DtSwPyNjkPX72+ LdyIdY+BqS8cZbPG5spCyJIlZonADojLDYQq4QnufARU51zyVjzTXMg5gAttDZwTH+8LbNI4 mm2YzsBNBFnVga8BCACqU+th4Esy/c8BnvliFAjAfpzhI1wH76FD1MJPmAhA3DnX5JDORcga CbPEwhLj1xlwTgpeT+QfDmGJ5B5BlrrQFZVE1fChEjiJvyiSAO4yQPkrPVYTI7Xj34FnscPj /IrRUUka68MlHxPtFnAHr25VIuOS41lmYKYNwPNLRz9Ik6DmeTG3WJO2BQRNvXA0pXrJH1fN GSsRb+pKEKHKtL1803x71zQxCwLh+zLP1iXHVM5j8gX9zqupigQR/Cel2XPS44zWcDW8r7B0 q1eW4Jrv0x19p4P923voqn+joIAostyNTUjCeSrUdKth9jcdlam9X2DziA/DHDFfS5eq4fEv ABEBAAHCwHwEGAEIACYCGwwWIQQt33LlpaVbqJ2qQuHCPZHzoSX+qAUCY00ibgUJDToHvwAK CRDCPZHzoSX+qK6vB/9yyZlsS+ijtsvwYDjGA2WhVhN07Xa5SBBvGCAycyGGzSMkOJcOtUUf tD+ADyrLbLuVSfRN1ke738UojphwkSFj4t9scG5A+U8GgOZtrlYOsY2+cG3R5vjoXUgXMP37 INfWh0KbJodf0G48xouesn08cbfUdlphSMXujCA8y5TcNyRuNv2q5Nizl8sKhUZzh4BascoK DChBuznBsucCTAGrwPgG4/ul6HnWE8DipMKvkV9ob1xJS2W4WJRPp6QdVrBWJ9cCdtpR6GbL iQi22uZXoSPv/0oUrGU+U5X4IvdnvT+8viPzszL5wXswJZfqfy8tmHM85yjObVdIG6AlnrrD In-Reply-To: <107dc9437ffcbf6751d018209d037c851a890f4d.1719328515.git.fdmanana@suse.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Provags-ID: V03:K1:sgqDjswIfgbaCCoAKeNQBvGSYF44omy80ysf44v28Z01efy/a2k ysTlZ5xJCktAhsP8GrTVLVPFsvudTDKuBhGSRLe+vJHnbfbeG8Z1Ur6ouA55qcAPON1tYNn GiI9uDhLflUgrhD+oVMPZ5E6PwxeLArAM15JS6SynXJHfH18T8YcTu+QC/JqAIhnmWmMwG7 cMT8QHNs4nWeaBxbiUXhQ== X-Spam-Flag: NO UI-OutboundReport: notjunk:1;M01:P0:DYeY4nC4KIQ=;xISmR8sc5YiBe69taesJxc4lkqE P/zZS3fJYGy/HwKoeSFWgrQt80AF02KzI1Rza1Ve3gPDaVMbagiBYF3ZHzZUSWpTRduBDqYU8 Et5QdFl5iPp3zwNhRfEtIy9Jt0JHkwr+d0mwxXB3j+qQelKTGehtlhZHvc/ArpgfLztUdheMf XiTh2R4RjV88h0eBa//ZSjVe/fj+Z3U7qnD0T4zTrklt1pCDOm09VXTxIRxVOX7HpZVrw9hku wQV5EM90BVA1tzGFNeXGi0BFZePETmeH+I4h/tz0juqC6/bgrqNamGA3Wd6HxdAWu/NLHmosA bidrnf+yr70cRQFG3qnAz7gvNb7MdUasbiVFp+rziyARDvmSHFAHzn6aRx2Mp/MvCL95mS80c 7OszGgEcIpzuldzpY7ZuqKUTMLSWYA5siJStmwV6nt7VvrEcXa5WX5d2XFbvAKl8ihZqB4jeY 5nn9kh89H+TrwiIIN3IBKNeBB4MPnANHR24TDTVw/LrYl6xCTDuPOfoxUEWDTSmmM/Z1vEC0R rHbEpxqYtEqk+e6pbwUs7wDiAghOyDVLUCixJ8vnJFQ2CFB5fwuCOzRDThlI2N27EGxe9u39Z nOdp0mnIWG8qclLweTMGmwTHA2Abqm22MFba9wRksG+qlxY267vmS7DDDKfDCx5v/RTpHEE9/ roiZk8gTjNWGl04nmZTycHjRUnni7VtphQAeFxpQH98ZzmCrOGOLPiVRVWwIydzi+dmW4+k6P Ox+IrBZ+WR0pstGp1Yg928fT2sTl8wkjsR3OGlmoRDda74EUb1aPm8McT+ONC/WR2KSvUQcAU 2j7w6JFjYy0PNGqi4jQzrnoq74cP8n/QW8TbednxUQB6g= =E5=9C=A8 2024/6/26 00:46, fdmanana@kernel.org =E5=86=99=E9=81=93: > From: Filipe Manana > > The direct IO code is over a thousand lines and it's currently spread > between file.c and inode.c, which makes it not easy to locate some parts > of it sometimes. Also inode.c is about 11 thousand lines and file.c abou= t > 4 thousand lines, both too big. So move all the direct IO code into a > dedicated file, so that it's easy to locate all its code and reduce the > sizes of inode.c and file.c. > > This is a pure move of code without any other changes except export a > a couple functions from inode.c (get_extent_allocation_hint() and > create_io_em()) because they are used in inode.c and the new direct-io.c > file, and a couple functions from file.c (btrfs_buffered_write() and > btrfs_write_check()) because they are used both in file.c and in the new > direct-io.c file. > > Signed-off-by: Filipe Manana Reviewed-by: Qu Wenruo Thanks, Qu > --- > fs/btrfs/Makefile | 2 +- > fs/btrfs/btrfs_inode.h | 9 +- > fs/btrfs/direct-io.c | 1052 ++++++++++++++++++++++++++++++++++++++++ > fs/btrfs/direct-io.h | 14 + > fs/btrfs/file.c | 287 +---------- > fs/btrfs/file.h | 2 + > fs/btrfs/inode.c | 784 +----------------------------- > fs/btrfs/super.c | 4 + > 8 files changed, 1095 insertions(+), 1059 deletions(-) > create mode 100644 fs/btrfs/direct-io.c > create mode 100644 fs/btrfs/direct-io.h > > diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile > index 50b19d15e956..87617f2968bc 100644 > --- a/fs/btrfs/Makefile > +++ b/fs/btrfs/Makefile > @@ -33,7 +33,7 @@ btrfs-y +=3D super.o ctree.o extent-tree.o print-tree.= o root-tree.o dir-item.o \ > uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o = \ > block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \ > subpage.o tree-mod-log.o extent-io-tree.o fs.o messages.o bio.o \ > - lru_cache.o raid-stripe-tree.o fiemap.o > + lru_cache.o raid-stripe-tree.o fiemap.o direct-io.o > > btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) +=3D acl.o > btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) +=3D ref-verify.o > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h > index b0fe610d5940..b33f147f2780 100644 > --- a/fs/btrfs/btrfs_inode.h > +++ b/fs/btrfs/btrfs_inode.h > @@ -610,10 +610,6 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, stru= ct iov_iter *iter, > ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *fr= om, > const struct btrfs_ioctl_encoded_io_args *encoded); > > -ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter, > - size_t done_before); > -struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *= iter, > - size_t done_before); > struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u6= 4 min_ino); > > extern const struct dentry_operations btrfs_dentry_operations; > @@ -630,5 +626,10 @@ void btrfs_inode_unlock(struct btrfs_inode *inode, = unsigned int ilock_flags); > void btrfs_update_inode_bytes(struct btrfs_inode *inode, const u64 add= _bytes, > const u64 del_bytes); > void btrfs_assert_inode_range_clean(struct btrfs_inode *inode, u64 sta= rt, u64 end); > +u64 btrfs_get_extent_allocation_hint(struct btrfs_inode *inode, u64 sta= rt, > + u64 num_bytes); > +struct extent_map *btrfs_create_io_em(struct btrfs_inode *inode, u64 st= art, > + const struct btrfs_file_extent *file_extent, > + int type); > > #endif > diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c > new file mode 100644 > index 000000000000..f9fb2db6a1e4 > --- /dev/null > +++ b/fs/btrfs/direct-io.c > @@ -0,0 +1,1052 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +#include > +#include > +#include "ctree.h" > +#include "delalloc-space.h" > +#include "direct-io.h" > +#include "extent-tree.h" > +#include "file.h" > +#include "fs.h" > +#include "transaction.h" > +#include "volumes.h" > + > +struct btrfs_dio_data { > + ssize_t submitted; > + struct extent_changeset *data_reserved; > + struct btrfs_ordered_extent *ordered; > + bool data_space_reserved; > + bool nocow_done; > +}; > + > +struct btrfs_dio_private { > + /* Range of I/O */ > + u64 file_offset; > + u32 bytes; > + > + /* This must be last */ > + struct btrfs_bio bbio; > +}; > + > +static struct bio_set btrfs_dio_bioset; > + > +static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 l= ockend, > + struct extent_state **cached_state, > + unsigned int iomap_flags) > +{ > + const bool writing =3D (iomap_flags & IOMAP_WRITE); > + const bool nowait =3D (iomap_flags & IOMAP_NOWAIT); > + struct extent_io_tree *io_tree =3D &BTRFS_I(inode)->io_tree; > + struct btrfs_ordered_extent *ordered; > + int ret =3D 0; > + > + while (1) { > + if (nowait) { > + if (!try_lock_extent(io_tree, lockstart, lockend, > + cached_state)) > + return -EAGAIN; > + } else { > + lock_extent(io_tree, lockstart, lockend, cached_state); > + } > + /* > + * We're concerned with the entire range that we're going to be > + * doing DIO to, so we need to make sure there's no ordered > + * extents in this range. > + */ > + ordered =3D btrfs_lookup_ordered_range(BTRFS_I(inode), lockstart, > + lockend - lockstart + 1); > + > + /* > + * We need to make sure there are no buffered pages in this > + * range either, we could have raced between the invalidate in > + * generic_file_direct_write and locking the extent. The > + * invalidate needs to happen so that reads after a write do not > + * get stale data. > + */ > + if (!ordered && > + (!writing || !filemap_range_has_page(inode->i_mapping, > + lockstart, lockend))) > + break; > + > + unlock_extent(io_tree, lockstart, lockend, cached_state); > + > + if (ordered) { > + if (nowait) { > + btrfs_put_ordered_extent(ordered); > + ret =3D -EAGAIN; > + break; > + } > + /* > + * If we are doing a DIO read and the ordered extent we > + * found is for a buffered write, we can not wait for it > + * to complete and retry, because if we do so we can > + * deadlock with concurrent buffered writes on page > + * locks. This happens only if our DIO read covers more > + * than one extent map, if at this point has already > + * created an ordered extent for a previous extent map > + * and locked its range in the inode's io tree, and a > + * concurrent write against that previous extent map's > + * range and this range started (we unlock the ranges > + * in the io tree only when the bios complete and > + * buffered writes always lock pages before attempting > + * to lock range in the io tree). > + */ > + if (writing || > + test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags)) > + btrfs_start_ordered_extent(ordered); > + else > + ret =3D nowait ? -EAGAIN : -ENOTBLK; > + btrfs_put_ordered_extent(ordered); > + } else { > + /* > + * We could trigger writeback for this range (and wait > + * for it to complete) and then invalidate the pages for > + * this range (through invalidate_inode_pages2_range()), > + * but that can lead us to a deadlock with a concurrent > + * call to readahead (a buffered read or a defrag call > + * triggered a readahead) on a page lock due to an > + * ordered dio extent we created before but did not have > + * yet a corresponding bio submitted (whence it can not > + * complete), which makes readahead wait for that > + * ordered extent to complete while holding a lock on > + * that page. > + */ > + ret =3D nowait ? -EAGAIN : -ENOTBLK; > + } > + > + if (ret) > + break; > + > + cond_resched(); > + } > + > + return ret; > +} > + > +static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *i= node, > + struct btrfs_dio_data *dio_data, > + const u64 start, > + const struct btrfs_file_extent *file_extent, > + const int type) > +{ > + struct extent_map *em =3D NULL; > + struct btrfs_ordered_extent *ordered; > + > + if (type !=3D BTRFS_ORDERED_NOCOW) { > + em =3D btrfs_create_io_em(inode, start, file_extent, type); > + if (IS_ERR(em)) > + goto out; > + } > + > + ordered =3D btrfs_alloc_ordered_extent(inode, start, file_extent, > + (1 << type) | > + (1 << BTRFS_ORDERED_DIRECT)); > + if (IS_ERR(ordered)) { > + if (em) { > + free_extent_map(em); > + btrfs_drop_extent_map_range(inode, start, > + start + file_extent->num_bytes - 1, false); > + } > + em =3D ERR_CAST(ordered); > + } else { > + ASSERT(!dio_data->ordered); > + dio_data->ordered =3D ordered; > + } > + out: > + > + return em; > +} > + > +static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *i= node, > + struct btrfs_dio_data *dio_data, > + u64 start, u64 len) > +{ > + struct btrfs_root *root =3D inode->root; > + struct btrfs_fs_info *fs_info =3D root->fs_info; > + struct btrfs_file_extent file_extent; > + struct extent_map *em; > + struct btrfs_key ins; > + u64 alloc_hint; > + int ret; > + > + alloc_hint =3D btrfs_get_extent_allocation_hint(inode, start, len); > +again: > + ret =3D btrfs_reserve_extent(root, len, len, fs_info->sectorsize, > + 0, alloc_hint, &ins, 1, 1); > + if (ret =3D=3D -EAGAIN) { > + ASSERT(btrfs_is_zoned(fs_info)); > + wait_on_bit_io(&inode->root->fs_info->flags, BTRFS_FS_NEED_ZONE_FINIS= H, > + TASK_UNINTERRUPTIBLE); > + goto again; > + } > + if (ret) > + return ERR_PTR(ret); > + > + file_extent.disk_bytenr =3D ins.objectid; > + file_extent.disk_num_bytes =3D ins.offset; > + file_extent.num_bytes =3D ins.offset; > + file_extent.ram_bytes =3D ins.offset; > + file_extent.offset =3D 0; > + file_extent.compression =3D BTRFS_COMPRESS_NONE; > + em =3D btrfs_create_dio_extent(inode, dio_data, start, &file_extent, > + BTRFS_ORDERED_REGULAR); > + btrfs_dec_block_group_reservations(fs_info, ins.objectid); > + if (IS_ERR(em)) > + btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, > + 1); > + > + return em; > +} > + > +static int btrfs_get_blocks_direct_write(struct extent_map **map, > + struct inode *inode, > + struct btrfs_dio_data *dio_data, > + u64 start, u64 *lenp, > + unsigned int iomap_flags) > +{ > + const bool nowait =3D (iomap_flags & IOMAP_NOWAIT); > + struct btrfs_fs_info *fs_info =3D inode_to_fs_info(inode); > + struct btrfs_file_extent file_extent; > + struct extent_map *em =3D *map; > + int type; > + u64 block_start; > + struct btrfs_block_group *bg; > + bool can_nocow =3D false; > + bool space_reserved =3D false; > + u64 len =3D *lenp; > + u64 prev_len; > + int ret =3D 0; > + > + /* > + * We don't allocate a new extent in the following cases > + * > + * 1) The inode is marked as NODATACOW. In this case we'll just use th= e > + * existing extent. > + * 2) The extent is marked as PREALLOC. We're good to go here and can > + * just use the extent. > + * > + */ > + if ((em->flags & EXTENT_FLAG_PREALLOC) || > + ((BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) && > + em->disk_bytenr !=3D EXTENT_MAP_HOLE)) { > + if (em->flags & EXTENT_FLAG_PREALLOC) > + type =3D BTRFS_ORDERED_PREALLOC; > + else > + type =3D BTRFS_ORDERED_NOCOW; > + len =3D min(len, em->len - (start - em->start)); > + block_start =3D extent_map_block_start(em) + (start - em->start); > + > + if (can_nocow_extent(inode, start, &len, > + &file_extent, false, false) =3D=3D 1) { > + bg =3D btrfs_inc_nocow_writers(fs_info, block_start); > + if (bg) > + can_nocow =3D true; > + } > + } > + > + prev_len =3D len; > + if (can_nocow) { > + struct extent_map *em2; > + > + /* We can NOCOW, so only need to reserve metadata space. */ > + ret =3D btrfs_delalloc_reserve_metadata(BTRFS_I(inode), len, len, > + nowait); > + if (ret < 0) { > + /* Our caller expects us to free the input extent map. */ > + free_extent_map(em); > + *map =3D NULL; > + btrfs_dec_nocow_writers(bg); > + if (nowait && (ret =3D=3D -ENOSPC || ret =3D=3D -EDQUOT)) > + ret =3D -EAGAIN; > + goto out; > + } > + space_reserved =3D true; > + > + em2 =3D btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, > + &file_extent, type); > + btrfs_dec_nocow_writers(bg); > + if (type =3D=3D BTRFS_ORDERED_PREALLOC) { > + free_extent_map(em); > + *map =3D em2; > + em =3D em2; > + } > + > + if (IS_ERR(em2)) { > + ret =3D PTR_ERR(em2); > + goto out; > + } > + > + dio_data->nocow_done =3D true; > + } else { > + /* Our caller expects us to free the input extent map. */ > + free_extent_map(em); > + *map =3D NULL; > + > + if (nowait) { > + ret =3D -EAGAIN; > + goto out; > + } > + > + /* > + * If we could not allocate data space before locking the file > + * range and we can't do a NOCOW write, then we have to fail. > + */ > + if (!dio_data->data_space_reserved) { > + ret =3D -ENOSPC; > + goto out; > + } > + > + /* > + * We have to COW and we have already reserved data space before, > + * so now we reserve only metadata. > + */ > + ret =3D btrfs_delalloc_reserve_metadata(BTRFS_I(inode), len, len, > + false); > + if (ret < 0) > + goto out; > + space_reserved =3D true; > + > + em =3D btrfs_new_extent_direct(BTRFS_I(inode), dio_data, start, len); > + if (IS_ERR(em)) { > + ret =3D PTR_ERR(em); > + goto out; > + } > + *map =3D em; > + len =3D min(len, em->len - (start - em->start)); > + if (len < prev_len) > + btrfs_delalloc_release_metadata(BTRFS_I(inode), > + prev_len - len, true); > + } > + > + /* > + * We have created our ordered extent, so we can now release our reser= vation > + * for an outstanding extent. > + */ > + btrfs_delalloc_release_extents(BTRFS_I(inode), prev_len); > + > + /* > + * Need to update the i_size under the extent lock so buffered > + * readers will get the updated i_size when we unlock. > + */ > + if (start + len > i_size_read(inode)) > + i_size_write(inode, start + len); > +out: > + if (ret && space_reserved) { > + btrfs_delalloc_release_extents(BTRFS_I(inode), len); > + btrfs_delalloc_release_metadata(BTRFS_I(inode), len, true); > + } > + *lenp =3D len; > + return ret; > +} > + > +static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start, > + loff_t length, unsigned int flags, struct iomap *iomap, > + struct iomap *srcmap) > +{ > + struct iomap_iter *iter =3D container_of(iomap, struct iomap_iter, iom= ap); > + struct btrfs_fs_info *fs_info =3D inode_to_fs_info(inode); > + struct extent_map *em; > + struct extent_state *cached_state =3D NULL; > + struct btrfs_dio_data *dio_data =3D iter->private; > + u64 lockstart, lockend; > + const bool write =3D !!(flags & IOMAP_WRITE); > + int ret =3D 0; > + u64 len =3D length; > + const u64 data_alloc_len =3D length; > + bool unlock_extents =3D false; > + > + /* > + * We could potentially fault if we have a buffer > PAGE_SIZE, and if > + * we're NOWAIT we may submit a bio for a partial range and return > + * EIOCBQUEUED, which would result in an errant short read. > + * > + * The best way to handle this would be to allow for partial completio= ns > + * of iocb's, so we could submit the partial bio, return and fault in > + * the rest of the pages, and then submit the io for the rest of the > + * range. However we don't have that currently, so simply return > + * -EAGAIN at this point so that the normal path is used. > + */ > + if (!write && (flags & IOMAP_NOWAIT) && length > PAGE_SIZE) > + return -EAGAIN; > + > + /* > + * Cap the size of reads to that usually seen in buffered I/O as we ne= ed > + * to allocate a contiguous array for the checksums. > + */ > + if (!write) > + len =3D min_t(u64, len, fs_info->sectorsize * BTRFS_MAX_BIO_SECTORS); > + > + lockstart =3D start; > + lockend =3D start + len - 1; > + > + /* > + * iomap_dio_rw() only does filemap_write_and_wait_range(), which isn'= t > + * enough if we've written compressed pages to this area, so we need t= o > + * flush the dirty pages again to make absolutely sure that any > + * outstanding dirty pages are on disk - the first flush only starts > + * compression on the data, while keeping the pages locked, so by the > + * time the second flush returns we know bios for the compressed pages > + * were submitted and finished, and the pages no longer under writebac= k. > + * > + * If we have a NOWAIT request and we have any pages in the range that > + * are locked, likely due to compression still in progress, we don't w= ant > + * to block on page locks. We also don't want to block on pages marked= as > + * dirty or under writeback (same as for the non-compression case). > + * iomap_dio_rw() did the same check, but after that and before we got > + * here, mmap'ed writes may have happened or buffered reads started > + * (readpage() and readahead(), which lock pages), as we haven't locke= d > + * the file range yet. > + */ > + if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, > + &BTRFS_I(inode)->runtime_flags)) { > + if (flags & IOMAP_NOWAIT) { > + if (filemap_range_needs_writeback(inode->i_mapping, > + lockstart, lockend)) > + return -EAGAIN; > + } else { > + ret =3D filemap_fdatawrite_range(inode->i_mapping, start, > + start + length - 1); > + if (ret) > + return ret; > + } > + } > + > + memset(dio_data, 0, sizeof(*dio_data)); > + > + /* > + * We always try to allocate data space and must do it before locking > + * the file range, to avoid deadlocks with concurrent writes to the sa= me > + * range if the range has several extents and the writes don't expand = the > + * current i_size (the inode lock is taken in shared mode). If we fail= to > + * allocate data space here we continue and later, after locking the > + * file range, we fail with ENOSPC only if we figure out we can not do= a > + * NOCOW write. > + */ > + if (write && !(flags & IOMAP_NOWAIT)) { > + ret =3D btrfs_check_data_free_space(BTRFS_I(inode), > + &dio_data->data_reserved, > + start, data_alloc_len, false); > + if (!ret) > + dio_data->data_space_reserved =3D true; > + else if (ret && !(BTRFS_I(inode)->flags & > + (BTRFS_INODE_NODATACOW | BTRFS_INODE_PREALLOC))) > + goto err; > + } > + > + /* > + * If this errors out it's because we couldn't invalidate pagecache fo= r > + * this range and we need to fallback to buffered IO, or we are doing = a > + * NOWAIT read/write and we need to block. > + */ > + ret =3D lock_extent_direct(inode, lockstart, lockend, &cached_state, f= lags); > + if (ret < 0) > + goto err; > + > + em =3D btrfs_get_extent(BTRFS_I(inode), NULL, start, len); > + if (IS_ERR(em)) { > + ret =3D PTR_ERR(em); > + goto unlock_err; > + } > + > + /* > + * Ok for INLINE and COMPRESSED extents we need to fallback on buffere= d > + * io. INLINE is special, and we could probably kludge it in here, bu= t > + * it's still buffered so for safety lets just fall back to the generi= c > + * buffered path. > + * > + * For COMPRESSED we _have_ to read the entire extent in so we can > + * decompress it, so there will be buffering required no matter what w= e > + * do, so go ahead and fallback to buffered. > + * > + * We return -ENOTBLK because that's what makes DIO go ahead and go ba= ck > + * to buffered IO. Don't blame me, this is the price we pay for using > + * the generic code. > + */ > + if (extent_map_is_compressed(em) || em->disk_bytenr =3D=3D EXTENT_MAP_= INLINE) { > + free_extent_map(em); > + /* > + * If we are in a NOWAIT context, return -EAGAIN in order to > + * fallback to buffered IO. This is not only because we can > + * block with buffered IO (no support for NOWAIT semantics at > + * the moment) but also to avoid returning short reads to user > + * space - this happens if we were able to read some data from > + * previous non-compressed extents and then when we fallback to > + * buffered IO, at btrfs_file_read_iter() by calling > + * filemap_read(), we fail to fault in pages for the read buffer, > + * in which case filemap_read() returns a short read (the number > + * of bytes previously read is > 0, so it does not return -EFAULT). > + */ > + ret =3D (flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOTBLK; > + goto unlock_err; > + } > + > + len =3D min(len, em->len - (start - em->start)); > + > + /* > + * If we have a NOWAIT request and the range contains multiple extents > + * (or a mix of extents and holes), then we return -EAGAIN to make the > + * caller fallback to a context where it can do a blocking (without > + * NOWAIT) request. This way we avoid doing partial IO and returning > + * success to the caller, which is not optimal for writes and for read= s > + * it can result in unexpected behaviour for an application. > + * > + * When doing a read, because we use IOMAP_DIO_PARTIAL when calling > + * iomap_dio_rw(), we can end up returning less data then what the cal= ler > + * asked for, resulting in an unexpected, and incorrect, short read. > + * That is, the caller asked to read N bytes and we return less than t= hat, > + * which is wrong unless we are crossing EOF. This happens if we get a > + * page fault error when trying to fault in pages for the buffer that = is > + * associated to the struct iov_iter passed to iomap_dio_rw(), and we > + * have previously submitted bios for other extents in the range, in > + * which case iomap_dio_rw() may return us EIOCBQUEUED if not all of > + * those bios have completed by the time we get the page fault error, > + * which we return back to our caller - we should only return EIOCBQUE= UED > + * after we have submitted bios for all the extents in the range. > + */ > + if ((flags & IOMAP_NOWAIT) && len < length) { > + free_extent_map(em); > + ret =3D -EAGAIN; > + goto unlock_err; > + } > + > + if (write) { > + ret =3D btrfs_get_blocks_direct_write(&em, inode, dio_data, > + start, &len, flags); > + if (ret < 0) > + goto unlock_err; > + unlock_extents =3D true; > + /* Recalc len in case the new em is smaller than requested */ > + len =3D min(len, em->len - (start - em->start)); > + if (dio_data->data_space_reserved) { > + u64 release_offset; > + u64 release_len =3D 0; > + > + if (dio_data->nocow_done) { > + release_offset =3D start; > + release_len =3D data_alloc_len; > + } else if (len < data_alloc_len) { > + release_offset =3D start + len; > + release_len =3D data_alloc_len - len; > + } > + > + if (release_len > 0) > + btrfs_free_reserved_data_space(BTRFS_I(inode), > + dio_data->data_reserved, > + release_offset, > + release_len); > + } > + } else { > + /* > + * We need to unlock only the end area that we aren't using. > + * The rest is going to be unlocked by the endio routine. > + */ > + lockstart =3D start + len; > + if (lockstart < lockend) > + unlock_extents =3D true; > + } > + > + if (unlock_extents) > + unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend, > + &cached_state); > + else > + free_extent_state(cached_state); > + > + /* > + * Translate extent map information to iomap. > + * We trim the extents (and move the addr) even though iomap code does > + * that, since we have locked only the parts we are performing I/O in. > + */ > + if ((em->disk_bytenr =3D=3D EXTENT_MAP_HOLE) || > + ((em->flags & EXTENT_FLAG_PREALLOC) && !write)) { > + iomap->addr =3D IOMAP_NULL_ADDR; > + iomap->type =3D IOMAP_HOLE; > + } else { > + iomap->addr =3D extent_map_block_start(em) + (start - em->start); > + iomap->type =3D IOMAP_MAPPED; > + } > + iomap->offset =3D start; > + iomap->bdev =3D fs_info->fs_devices->latest_dev->bdev; > + iomap->length =3D len; > + free_extent_map(em); > + > + return 0; > + > +unlock_err: > + unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend, > + &cached_state); > +err: > + if (dio_data->data_space_reserved) { > + btrfs_free_reserved_data_space(BTRFS_I(inode), > + dio_data->data_reserved, > + start, data_alloc_len); > + extent_changeset_free(dio_data->data_reserved); > + } > + > + return ret; > +} > + > +static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t = length, > + ssize_t written, unsigned int flags, struct iomap *iomap) > +{ > + struct iomap_iter *iter =3D container_of(iomap, struct iomap_iter, iom= ap); > + struct btrfs_dio_data *dio_data =3D iter->private; > + size_t submitted =3D dio_data->submitted; > + const bool write =3D !!(flags & IOMAP_WRITE); > + int ret =3D 0; > + > + if (!write && (iomap->type =3D=3D IOMAP_HOLE)) { > + /* If reading from a hole, unlock and return */ > + unlock_extent(&BTRFS_I(inode)->io_tree, pos, pos + length - 1, > + NULL); > + return 0; > + } > + > + if (submitted < length) { > + pos +=3D submitted; > + length -=3D submitted; > + if (write) > + btrfs_finish_ordered_extent(dio_data->ordered, NULL, > + pos, length, false); > + else > + unlock_extent(&BTRFS_I(inode)->io_tree, pos, > + pos + length - 1, NULL); > + ret =3D -ENOTBLK; > + } > + if (write) { > + btrfs_put_ordered_extent(dio_data->ordered); > + dio_data->ordered =3D NULL; > + } > + > + if (write) > + extent_changeset_free(dio_data->data_reserved); > + return ret; > +} > + > +static void btrfs_dio_end_io(struct btrfs_bio *bbio) > +{ > + struct btrfs_dio_private *dip =3D > + container_of(bbio, struct btrfs_dio_private, bbio); > + struct btrfs_inode *inode =3D bbio->inode; > + struct bio *bio =3D &bbio->bio; > + > + if (bio->bi_status) { > + btrfs_warn(inode->root->fs_info, > + "direct IO failed ino %llu op 0x%0x offset %#llx len %u err no %d", > + btrfs_ino(inode), bio->bi_opf, > + dip->file_offset, dip->bytes, bio->bi_status); > + } > + > + if (btrfs_op(bio) =3D=3D BTRFS_MAP_WRITE) { > + btrfs_finish_ordered_extent(bbio->ordered, NULL, > + dip->file_offset, dip->bytes, > + !bio->bi_status); > + } else { > + unlock_extent(&inode->io_tree, dip->file_offset, > + dip->file_offset + dip->bytes - 1, NULL); > + } > + > + bbio->bio.bi_private =3D bbio->private; > + iomap_dio_bio_end_io(bio); > +} > + > +static int btrfs_extract_ordered_extent(struct btrfs_bio *bbio, > + struct btrfs_ordered_extent *ordered) > +{ > + u64 start =3D (u64)bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT; > + u64 len =3D bbio->bio.bi_iter.bi_size; > + struct btrfs_ordered_extent *new; > + int ret; > + > + /* Must always be called for the beginning of an ordered extent. */ > + if (WARN_ON_ONCE(start !=3D ordered->disk_bytenr)) > + return -EINVAL; > + > + /* No need to split if the ordered extent covers the entire bio. */ > + if (ordered->disk_num_bytes =3D=3D len) { > + refcount_inc(&ordered->refs); > + bbio->ordered =3D ordered; > + return 0; > + } > + > + /* > + * Don't split the extent_map for NOCOW extents, as we're writing into > + * a pre-existing one. > + */ > + if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags)) { > + ret =3D split_extent_map(bbio->inode, bbio->file_offset, > + ordered->num_bytes, len, > + ordered->disk_bytenr); > + if (ret) > + return ret; > + } > + > + new =3D btrfs_split_ordered_extent(ordered, len); > + if (IS_ERR(new)) > + return PTR_ERR(new); > + bbio->ordered =3D new; > + return 0; > +} > + > +static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct b= io *bio, > + loff_t file_offset) > +{ > + struct btrfs_bio *bbio =3D btrfs_bio(bio); > + struct btrfs_dio_private *dip =3D > + container_of(bbio, struct btrfs_dio_private, bbio); > + struct btrfs_dio_data *dio_data =3D iter->private; > + > + btrfs_bio_init(bbio, BTRFS_I(iter->inode)->root->fs_info, > + btrfs_dio_end_io, bio->bi_private); > + bbio->inode =3D BTRFS_I(iter->inode); > + bbio->file_offset =3D file_offset; > + > + dip->file_offset =3D file_offset; > + dip->bytes =3D bio->bi_iter.bi_size; > + > + dio_data->submitted +=3D bio->bi_iter.bi_size; > + > + /* > + * Check if we are doing a partial write. If we are, we need to split > + * the ordered extent to match the submitted bio. Hang on to the > + * remaining unfinishable ordered_extent in dio_data so that it can be > + * cancelled in iomap_end to avoid a deadlock wherein faulting the > + * remaining pages is blocked on the outstanding ordered extent. > + */ > + if (iter->flags & IOMAP_WRITE) { > + int ret; > + > + ret =3D btrfs_extract_ordered_extent(bbio, dio_data->ordered); > + if (ret) { > + btrfs_finish_ordered_extent(dio_data->ordered, NULL, > + file_offset, dip->bytes, > + !ret); > + bio->bi_status =3D errno_to_blk_status(ret); > + iomap_dio_bio_end_io(bio); > + return; > + } > + } > + > + btrfs_submit_bio(bbio, 0); > +} > + > +static const struct iomap_ops btrfs_dio_iomap_ops =3D { > + .iomap_begin =3D btrfs_dio_iomap_begin, > + .iomap_end =3D btrfs_dio_iomap_end, > +}; > + > +static const struct iomap_dio_ops btrfs_dio_ops =3D { > + .submit_io =3D btrfs_dio_submit_io, > + .bio_set =3D &btrfs_dio_bioset, > +}; > + > +static ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter= , > + size_t done_before) > +{ > + struct btrfs_dio_data data =3D { 0 }; > + > + return iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops, > + IOMAP_DIO_PARTIAL, &data, done_before); > +} > + > +static struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov= _iter *iter, > + size_t done_before) > +{ > + struct btrfs_dio_data data =3D { 0 }; > + > + return __iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops= , > + IOMAP_DIO_PARTIAL, &data, done_before); > +} > + > +static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info, > + const struct iov_iter *iter, loff_t offset) > +{ > + const u32 blocksize_mask =3D fs_info->sectorsize - 1; > + > + if (offset & blocksize_mask) > + return -EINVAL; > + > + if (iov_iter_alignment(iter) & blocksize_mask) > + return -EINVAL; > + > + return 0; > +} > + > +ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from) > +{ > + struct file *file =3D iocb->ki_filp; > + struct inode *inode =3D file_inode(file); > + struct btrfs_fs_info *fs_info =3D inode_to_fs_info(inode); > + loff_t pos; > + ssize_t written =3D 0; > + ssize_t written_buffered; > + size_t prev_left =3D 0; > + loff_t endbyte; > + ssize_t ret; > + unsigned int ilock_flags =3D 0; > + struct iomap_dio *dio; > + > + if (iocb->ki_flags & IOCB_NOWAIT) > + ilock_flags |=3D BTRFS_ILOCK_TRY; > + > + /* > + * If the write DIO is within EOF, use a shared lock and also only if > + * security bits will likely not be dropped by file_remove_privs() cal= led > + * from btrfs_write_check(). Either will need to be rechecked after th= e > + * lock was acquired. > + */ > + if (iocb->ki_pos + iov_iter_count(from) <=3D i_size_read(inode) && IS_= NOSEC(inode)) > + ilock_flags |=3D BTRFS_ILOCK_SHARED; > + > +relock: > + ret =3D btrfs_inode_lock(BTRFS_I(inode), ilock_flags); > + if (ret < 0) > + return ret; > + > + /* Shared lock cannot be used with security bits set. */ > + if ((ilock_flags & BTRFS_ILOCK_SHARED) && !IS_NOSEC(inode)) { > + btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); > + ilock_flags &=3D ~BTRFS_ILOCK_SHARED; > + goto relock; > + } > + > + ret =3D generic_write_checks(iocb, from); > + if (ret <=3D 0) { > + btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); > + return ret; > + } > + > + ret =3D btrfs_write_check(iocb, from, ret); > + if (ret < 0) { > + btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); > + goto out; > + } > + > + pos =3D iocb->ki_pos; > + /* > + * Re-check since file size may have changed just before taking the > + * lock or pos may have changed because of O_APPEND in generic_write_c= heck() > + */ > + if ((ilock_flags & BTRFS_ILOCK_SHARED) && > + pos + iov_iter_count(from) > i_size_read(inode)) { > + btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); > + ilock_flags &=3D ~BTRFS_ILOCK_SHARED; > + goto relock; > + } > + > + if (check_direct_IO(fs_info, from, pos)) { > + btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); > + goto buffered; > + } > + > + /* > + * The iov_iter can be mapped to the same file range we are writing to= . > + * If that's the case, then we will deadlock in the iomap code, becaus= e > + * it first calls our callback btrfs_dio_iomap_begin(), which will cre= ate > + * an ordered extent, and after that it will fault in the pages that t= he > + * iov_iter refers to. During the fault in we end up in the readahead > + * pages code (starting at btrfs_readahead()), which will lock the ran= ge, > + * find that ordered extent and then wait for it to complete (at > + * btrfs_lock_and_flush_ordered_range()), resulting in a deadlock sinc= e > + * obviously the ordered extent can never complete as we didn't submit > + * yet the respective bio(s). This always happens when the buffer is > + * memory mapped to the same file range, since the iomap DIO code alwa= ys > + * invalidates pages in the target file range (after starting and wait= ing > + * for any writeback). > + * > + * So here we disable page faults in the iov_iter and then retry if we > + * got -EFAULT, faulting in the pages before the retry. > + */ > + from->nofault =3D true; > + dio =3D btrfs_dio_write(iocb, from, written); > + from->nofault =3D false; > + > + /* > + * iomap_dio_complete() will call btrfs_sync_file() if we have a dsync > + * iocb, and that needs to lock the inode. So unlock it before calling > + * iomap_dio_complete() to avoid a deadlock. > + */ > + btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); > + > + if (IS_ERR_OR_NULL(dio)) > + ret =3D PTR_ERR_OR_ZERO(dio); > + else > + ret =3D iomap_dio_complete(dio); > + > + /* No increment (+=3D) because iomap returns a cumulative value. */ > + if (ret > 0) > + written =3D ret; > + > + if (iov_iter_count(from) > 0 && (ret =3D=3D -EFAULT || ret > 0)) { > + const size_t left =3D iov_iter_count(from); > + /* > + * We have more data left to write. Try to fault in as many as > + * possible of the remainder pages and retry. We do this without > + * releasing and locking again the inode, to prevent races with > + * truncate. > + * > + * Also, in case the iov refers to pages in the file range of the > + * file we want to write to (due to a mmap), we could enter an > + * infinite loop if we retry after faulting the pages in, since > + * iomap will invalidate any pages in the range early on, before > + * it tries to fault in the pages of the iov. So we keep track of > + * how much was left of iov in the previous EFAULT and fallback > + * to buffered IO in case we haven't made any progress. > + */ > + if (left =3D=3D prev_left) { > + ret =3D -ENOTBLK; > + } else { > + fault_in_iov_iter_readable(from, left); > + prev_left =3D left; > + goto relock; > + } > + } > + > + /* > + * If 'ret' is -ENOTBLK or we have not written all data, then it means > + * we must fallback to buffered IO. > + */ > + if ((ret < 0 && ret !=3D -ENOTBLK) || !iov_iter_count(from)) > + goto out; > + > +buffered: > + /* > + * If we are in a NOWAIT context, then return -EAGAIN to signal the ca= ller > + * it must retry the operation in a context where blocking is acceptab= le, > + * because even if we end up not blocking during the buffered IO attem= pt > + * below, we will block when flushing and waiting for the IO. > + */ > + if (iocb->ki_flags & IOCB_NOWAIT) { > + ret =3D -EAGAIN; > + goto out; > + } > + > + pos =3D iocb->ki_pos; > + written_buffered =3D btrfs_buffered_write(iocb, from); > + if (written_buffered < 0) { > + ret =3D written_buffered; > + goto out; > + } > + /* > + * Ensure all data is persisted. We want the next direct IO read to be > + * able to read what was just written. > + */ > + endbyte =3D pos + written_buffered - 1; > + ret =3D btrfs_fdatawrite_range(BTRFS_I(inode), pos, endbyte); > + if (ret) > + goto out; > + ret =3D filemap_fdatawait_range(inode->i_mapping, pos, endbyte); > + if (ret) > + goto out; > + written +=3D written_buffered; > + iocb->ki_pos =3D pos + written_buffered; > + invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT, > + endbyte >> PAGE_SHIFT); > +out: > + return ret < 0 ? ret : written; > +} > + > +static int check_direct_read(struct btrfs_fs_info *fs_info, > + const struct iov_iter *iter, loff_t offset) > +{ > + int ret; > + int i, seg; > + > + ret =3D check_direct_IO(fs_info, iter, offset); > + if (ret < 0) > + return ret; > + > + if (!iter_is_iovec(iter)) > + return 0; > + > + for (seg =3D 0; seg < iter->nr_segs; seg++) { > + for (i =3D seg + 1; i < iter->nr_segs; i++) { > + const struct iovec *iov1 =3D iter_iov(iter) + seg; > + const struct iovec *iov2 =3D iter_iov(iter) + i; > + > + if (iov1->iov_base =3D=3D iov2->iov_base) > + return -EINVAL; > + } > + } > + return 0; > +} > + > +ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to) > +{ > + struct inode *inode =3D file_inode(iocb->ki_filp); > + size_t prev_left =3D 0; > + ssize_t read =3D 0; > + ssize_t ret; > + > + if (fsverity_active(inode)) > + return 0; > + > + if (check_direct_read(inode_to_fs_info(inode), to, iocb->ki_pos)) > + return 0; > + > + btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_SHARED); > +again: > + /* > + * This is similar to what we do for direct IO writes, see the comment > + * at btrfs_direct_write(), but we also disable page faults in additio= n > + * to disabling them only at the iov_iter level. This is because when > + * reading from a hole or prealloc extent, iomap calls iov_iter_zero()= , > + * which can still trigger page fault ins despite having set ->nofault > + * to true of our 'to' iov_iter. > + * > + * The difference to direct IO writes is that we deadlock when trying > + * to lock the extent range in the inode's tree during he page reads > + * triggered by the fault in (while for writes it is due to waiting fo= r > + * our own ordered extent). This is because for direct IO reads, > + * btrfs_dio_iomap_begin() returns with the extent range locked, which > + * is only unlocked in the endio callback (end_bio_extent_readpage()). > + */ > + pagefault_disable(); > + to->nofault =3D true; > + ret =3D btrfs_dio_read(iocb, to, read); > + to->nofault =3D false; > + pagefault_enable(); > + > + /* No increment (+=3D) because iomap returns a cumulative value. */ > + if (ret > 0) > + read =3D ret; > + > + if (iov_iter_count(to) > 0 && (ret =3D=3D -EFAULT || ret > 0)) { > + const size_t left =3D iov_iter_count(to); > + > + if (left =3D=3D prev_left) { > + /* > + * We didn't make any progress since the last attempt, > + * fallback to a buffered read for the remainder of the > + * range. This is just to avoid any possibility of looping > + * for too long. > + */ > + ret =3D read; > + } else { > + /* > + * We made some progress since the last retry or this is > + * the first time we are retrying. Fault in as many pages > + * as possible and retry. > + */ > + fault_in_iov_iter_writeable(to, left); > + prev_left =3D left; > + goto again; > + } > + } > + btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_SHARED); > + return ret < 0 ? ret : read; > +} > + > +int __init btrfs_init_dio(void) > +{ > + if (bioset_init(&btrfs_dio_bioset, BIO_POOL_SIZE, > + offsetof(struct btrfs_dio_private, bbio.bio), > + BIOSET_NEED_BVECS)) > + return -ENOMEM; > + > + return 0; > +} > + > +void __cold btrfs_destroy_dio(void) > +{ > + bioset_exit(&btrfs_dio_bioset); > +} > diff --git a/fs/btrfs/direct-io.h b/fs/btrfs/direct-io.h > new file mode 100644 > index 000000000000..3dc3ea926afe > --- /dev/null > +++ b/fs/btrfs/direct-io.h > @@ -0,0 +1,14 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > + > +#ifndef BTRFS_DIRECT_IO_H > +#define BTRFS_DIRECT_IO_H > + > +#include > + > +int __init btrfs_init_dio(void); > +void __cold btrfs_destroy_dio(void); > + > +ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from); > +ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to); > + > +#endif /* BTRFS_DIRECT_IO_H */ > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c > index 5834d452677f..21381de906f6 100644 > --- a/fs/btrfs/file.c > +++ b/fs/btrfs/file.c > @@ -17,8 +17,8 @@ > #include > #include > #include > -#include > #include "ctree.h" > +#include "direct-io.h" > #include "disk-io.h" > #include "transaction.h" > #include "btrfs_inode.h" > @@ -1140,8 +1140,7 @@ static void update_time_for_write(struct inode *in= ode) > inode_inc_iversion(inode); > } > > -static int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from, > - size_t count) > +int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from, size_t= count) > { > struct file *file =3D iocb->ki_filp; > struct inode *inode =3D file_inode(file); > @@ -1187,8 +1186,7 @@ static int btrfs_write_check(struct kiocb *iocb, s= truct iov_iter *from, > return 0; > } > > -static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb, > - struct iov_iter *i) > +ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i) > { > struct file *file =3D iocb->ki_filp; > loff_t pos; > @@ -1451,194 +1449,6 @@ static noinline ssize_t btrfs_buffered_write(str= uct kiocb *iocb, > return num_written ? num_written : ret; > } > > -static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info, > - const struct iov_iter *iter, loff_t offset) > -{ > - const u32 blocksize_mask =3D fs_info->sectorsize - 1; > - > - if (offset & blocksize_mask) > - return -EINVAL; > - > - if (iov_iter_alignment(iter) & blocksize_mask) > - return -EINVAL; > - > - return 0; > -} > - > -static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *= from) > -{ > - struct file *file =3D iocb->ki_filp; > - struct inode *inode =3D file_inode(file); > - struct btrfs_fs_info *fs_info =3D inode_to_fs_info(inode); > - loff_t pos; > - ssize_t written =3D 0; > - ssize_t written_buffered; > - size_t prev_left =3D 0; > - loff_t endbyte; > - ssize_t ret; > - unsigned int ilock_flags =3D 0; > - struct iomap_dio *dio; > - > - if (iocb->ki_flags & IOCB_NOWAIT) > - ilock_flags |=3D BTRFS_ILOCK_TRY; > - > - /* > - * If the write DIO is within EOF, use a shared lock and also only if > - * security bits will likely not be dropped by file_remove_privs() cal= led > - * from btrfs_write_check(). Either will need to be rechecked after th= e > - * lock was acquired. > - */ > - if (iocb->ki_pos + iov_iter_count(from) <=3D i_size_read(inode) && IS_= NOSEC(inode)) > - ilock_flags |=3D BTRFS_ILOCK_SHARED; > - > -relock: > - ret =3D btrfs_inode_lock(BTRFS_I(inode), ilock_flags); > - if (ret < 0) > - return ret; > - > - /* Shared lock cannot be used with security bits set. */ > - if ((ilock_flags & BTRFS_ILOCK_SHARED) && !IS_NOSEC(inode)) { > - btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); > - ilock_flags &=3D ~BTRFS_ILOCK_SHARED; > - goto relock; > - } > - > - ret =3D generic_write_checks(iocb, from); > - if (ret <=3D 0) { > - btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); > - return ret; > - } > - > - ret =3D btrfs_write_check(iocb, from, ret); > - if (ret < 0) { > - btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); > - goto out; > - } > - > - pos =3D iocb->ki_pos; > - /* > - * Re-check since file size may have changed just before taking the > - * lock or pos may have changed because of O_APPEND in generic_write_c= heck() > - */ > - if ((ilock_flags & BTRFS_ILOCK_SHARED) && > - pos + iov_iter_count(from) > i_size_read(inode)) { > - btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); > - ilock_flags &=3D ~BTRFS_ILOCK_SHARED; > - goto relock; > - } > - > - if (check_direct_IO(fs_info, from, pos)) { > - btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); > - goto buffered; > - } > - > - /* > - * The iov_iter can be mapped to the same file range we are writing to= . > - * If that's the case, then we will deadlock in the iomap code, becaus= e > - * it first calls our callback btrfs_dio_iomap_begin(), which will cre= ate > - * an ordered extent, and after that it will fault in the pages that t= he > - * iov_iter refers to. During the fault in we end up in the readahead > - * pages code (starting at btrfs_readahead()), which will lock the ran= ge, > - * find that ordered extent and then wait for it to complete (at > - * btrfs_lock_and_flush_ordered_range()), resulting in a deadlock sinc= e > - * obviously the ordered extent can never complete as we didn't submit > - * yet the respective bio(s). This always happens when the buffer is > - * memory mapped to the same file range, since the iomap DIO code alwa= ys > - * invalidates pages in the target file range (after starting and wait= ing > - * for any writeback). > - * > - * So here we disable page faults in the iov_iter and then retry if we > - * got -EFAULT, faulting in the pages before the retry. > - */ > - from->nofault =3D true; > - dio =3D btrfs_dio_write(iocb, from, written); > - from->nofault =3D false; > - > - /* > - * iomap_dio_complete() will call btrfs_sync_file() if we have a dsync > - * iocb, and that needs to lock the inode. So unlock it before calling > - * iomap_dio_complete() to avoid a deadlock. > - */ > - btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); > - > - if (IS_ERR_OR_NULL(dio)) > - ret =3D PTR_ERR_OR_ZERO(dio); > - else > - ret =3D iomap_dio_complete(dio); > - > - /* No increment (+=3D) because iomap returns a cumulative value. */ > - if (ret > 0) > - written =3D ret; > - > - if (iov_iter_count(from) > 0 && (ret =3D=3D -EFAULT || ret > 0)) { > - const size_t left =3D iov_iter_count(from); > - /* > - * We have more data left to write. Try to fault in as many as > - * possible of the remainder pages and retry. We do this without > - * releasing and locking again the inode, to prevent races with > - * truncate. > - * > - * Also, in case the iov refers to pages in the file range of the > - * file we want to write to (due to a mmap), we could enter an > - * infinite loop if we retry after faulting the pages in, since > - * iomap will invalidate any pages in the range early on, before > - * it tries to fault in the pages of the iov. So we keep track of > - * how much was left of iov in the previous EFAULT and fallback > - * to buffered IO in case we haven't made any progress. > - */ > - if (left =3D=3D prev_left) { > - ret =3D -ENOTBLK; > - } else { > - fault_in_iov_iter_readable(from, left); > - prev_left =3D left; > - goto relock; > - } > - } > - > - /* > - * If 'ret' is -ENOTBLK or we have not written all data, then it means > - * we must fallback to buffered IO. > - */ > - if ((ret < 0 && ret !=3D -ENOTBLK) || !iov_iter_count(from)) > - goto out; > - > -buffered: > - /* > - * If we are in a NOWAIT context, then return -EAGAIN to signal the ca= ller > - * it must retry the operation in a context where blocking is acceptab= le, > - * because even if we end up not blocking during the buffered IO attem= pt > - * below, we will block when flushing and waiting for the IO. > - */ > - if (iocb->ki_flags & IOCB_NOWAIT) { > - ret =3D -EAGAIN; > - goto out; > - } > - > - pos =3D iocb->ki_pos; > - written_buffered =3D btrfs_buffered_write(iocb, from); > - if (written_buffered < 0) { > - ret =3D written_buffered; > - goto out; > - } > - /* > - * Ensure all data is persisted. We want the next direct IO read to be > - * able to read what was just written. > - */ > - endbyte =3D pos + written_buffered - 1; > - ret =3D btrfs_fdatawrite_range(BTRFS_I(inode), pos, endbyte); > - if (ret) > - goto out; > - ret =3D filemap_fdatawait_range(inode->i_mapping, pos, endbyte); > - if (ret) > - goto out; > - written +=3D written_buffered; > - iocb->ki_pos =3D pos + written_buffered; > - invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT, > - endbyte >> PAGE_SHIFT); > -out: > - return ret < 0 ? ret : written; > -} > - > static ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter= *from, > const struct btrfs_ioctl_encoded_io_args *encoded) > { > @@ -3914,97 +3724,6 @@ static int btrfs_file_open(struct inode *inode, s= truct file *filp) > return generic_file_open(inode, filp); > } > > -static int check_direct_read(struct btrfs_fs_info *fs_info, > - const struct iov_iter *iter, loff_t offset) > -{ > - int ret; > - int i, seg; > - > - ret =3D check_direct_IO(fs_info, iter, offset); > - if (ret < 0) > - return ret; > - > - if (!iter_is_iovec(iter)) > - return 0; > - > - for (seg =3D 0; seg < iter->nr_segs; seg++) { > - for (i =3D seg + 1; i < iter->nr_segs; i++) { > - const struct iovec *iov1 =3D iter_iov(iter) + seg; > - const struct iovec *iov2 =3D iter_iov(iter) + i; > - > - if (iov1->iov_base =3D=3D iov2->iov_base) > - return -EINVAL; > - } > - } > - return 0; > -} > - > -static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *t= o) > -{ > - struct inode *inode =3D file_inode(iocb->ki_filp); > - size_t prev_left =3D 0; > - ssize_t read =3D 0; > - ssize_t ret; > - > - if (fsverity_active(inode)) > - return 0; > - > - if (check_direct_read(inode_to_fs_info(inode), to, iocb->ki_pos)) > - return 0; > - > - btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_SHARED); > -again: > - /* > - * This is similar to what we do for direct IO writes, see the comment > - * at btrfs_direct_write(), but we also disable page faults in additio= n > - * to disabling them only at the iov_iter level. This is because when > - * reading from a hole or prealloc extent, iomap calls iov_iter_zero()= , > - * which can still trigger page fault ins despite having set ->nofault > - * to true of our 'to' iov_iter. > - * > - * The difference to direct IO writes is that we deadlock when trying > - * to lock the extent range in the inode's tree during he page reads > - * triggered by the fault in (while for writes it is due to waiting fo= r > - * our own ordered extent). This is because for direct IO reads, > - * btrfs_dio_iomap_begin() returns with the extent range locked, which > - * is only unlocked in the endio callback (end_bio_extent_readpage()). > - */ > - pagefault_disable(); > - to->nofault =3D true; > - ret =3D btrfs_dio_read(iocb, to, read); > - to->nofault =3D false; > - pagefault_enable(); > - > - /* No increment (+=3D) because iomap returns a cumulative value. */ > - if (ret > 0) > - read =3D ret; > - > - if (iov_iter_count(to) > 0 && (ret =3D=3D -EFAULT || ret > 0)) { > - const size_t left =3D iov_iter_count(to); > - > - if (left =3D=3D prev_left) { > - /* > - * We didn't make any progress since the last attempt, > - * fallback to a buffered read for the remainder of the > - * range. This is just to avoid any possibility of looping > - * for too long. > - */ > - ret =3D read; > - } else { > - /* > - * We made some progress since the last retry or this is > - * the first time we are retrying. Fault in as many pages > - * as possible and retry. > - */ > - fault_in_iov_iter_writeable(to, left); > - prev_left =3D left; > - goto again; > - } > - } > - btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_SHARED); > - return ret < 0 ? ret : read; > -} > - > static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_ite= r *to) > { > ssize_t ret =3D 0; > diff --git a/fs/btrfs/file.h b/fs/btrfs/file.h > index ce93ed7083ab..912254e653cf 100644 > --- a/fs/btrfs/file.h > +++ b/fs/btrfs/file.h > @@ -44,5 +44,7 @@ void btrfs_check_nocow_unlock(struct btrfs_inode *inod= e); > bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start= , u64 end, > struct extent_state **cached_state, > u64 *delalloc_start_ret, u64 *delalloc_end_ret); > +int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from, size_t= count); > +ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i); > > #endif > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index d6c43120c5d3..41a3e3e73623 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -77,25 +77,6 @@ struct btrfs_iget_args { > struct btrfs_root *root; > }; > > -struct btrfs_dio_data { > - ssize_t submitted; > - struct extent_changeset *data_reserved; > - struct btrfs_ordered_extent *ordered; > - bool data_space_reserved; > - bool nocow_done; > -}; > - > -struct btrfs_dio_private { > - /* Range of I/O */ > - u64 file_offset; > - u32 bytes; > - > - /* This must be last */ > - struct btrfs_bio bbio; > -}; > - > -static struct bio_set btrfs_dio_bioset; > - > struct btrfs_rename_ctx { > /* Output field. Stores the index number of the old directory entry. = */ > u64 index; > @@ -138,9 +119,6 @@ static noinline int run_delalloc_cow(struct btrfs_in= ode *inode, > struct page *locked_page, u64 start, > u64 end, struct writeback_control *wbc, > bool pages_dirty); > -static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 s= tart, > - const struct btrfs_file_extent *file_extent, > - int type); > > static int data_reloc_print_warning_inode(u64 inum, u64 offset, u64 nu= m_bytes, > u64 root, void *warn_ctx) > @@ -1205,7 +1183,7 @@ static void submit_one_async_extent(struct async_c= hunk *async_chunk, > file_extent.offset =3D 0; > file_extent.compression =3D async_extent->compress_type; > > - em =3D create_io_em(inode, start, &file_extent, BTRFS_ORDERED_COMPRESS= ED); > + em =3D btrfs_create_io_em(inode, start, &file_extent, BTRFS_ORDERED_CO= MPRESSED); > if (IS_ERR(em)) { > ret =3D PTR_ERR(em); > goto out_free_reserve; > @@ -1257,8 +1235,8 @@ static void submit_one_async_extent(struct async_c= hunk *async_chunk, > kfree(async_extent); > } > > -static u64 get_extent_allocation_hint(struct btrfs_inode *inode, u64 st= art, > - u64 num_bytes) > +u64 btrfs_get_extent_allocation_hint(struct btrfs_inode *inode, u64 sta= rt, > + u64 num_bytes) > { > struct extent_map_tree *em_tree =3D &inode->extent_tree; > struct extent_map *em; > @@ -1368,7 +1346,7 @@ static noinline int cow_file_range(struct btrfs_in= ode *inode, > } > } > > - alloc_hint =3D get_extent_allocation_hint(inode, start, num_bytes); > + alloc_hint =3D btrfs_get_extent_allocation_hint(inode, start, num_byte= s); > > /* > * Relocation relies on the relocated extents to have exactly the sam= e > @@ -1435,7 +1413,8 @@ static noinline int cow_file_range(struct btrfs_in= ode *inode, > lock_extent(&inode->io_tree, start, start + ram_size - 1, > &cached); > > - em =3D create_io_em(inode, start, &file_extent, BTRFS_ORDERED_REGULAR= ); > + em =3D btrfs_create_io_em(inode, start, &file_extent, > + BTRFS_ORDERED_REGULAR); > if (IS_ERR(em)) { > unlock_extent(&inode->io_tree, start, > start + ram_size - 1, &cached); > @@ -2152,8 +2131,9 @@ static noinline int run_delalloc_nocow(struct btrf= s_inode *inode, > if (is_prealloc) { > struct extent_map *em; > > - em =3D create_io_em(inode, cur_offset, &nocow_args.file_extent, > - BTRFS_ORDERED_PREALLOC); > + em =3D btrfs_create_io_em(inode, cur_offset, > + &nocow_args.file_extent, > + BTRFS_ORDERED_PREALLOC); > if (IS_ERR(em)) { > unlock_extent(&inode->io_tree, cur_offset, > nocow_end, &cached_state); > @@ -2582,44 +2562,6 @@ void btrfs_clear_delalloc_extent(struct btrfs_ino= de *inode, > } > } > > -static int btrfs_extract_ordered_extent(struct btrfs_bio *bbio, > - struct btrfs_ordered_extent *ordered) > -{ > - u64 start =3D (u64)bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT; > - u64 len =3D bbio->bio.bi_iter.bi_size; > - struct btrfs_ordered_extent *new; > - int ret; > - > - /* Must always be called for the beginning of an ordered extent. */ > - if (WARN_ON_ONCE(start !=3D ordered->disk_bytenr)) > - return -EINVAL; > - > - /* No need to split if the ordered extent covers the entire bio. */ > - if (ordered->disk_num_bytes =3D=3D len) { > - refcount_inc(&ordered->refs); > - bbio->ordered =3D ordered; > - return 0; > - } > - > - /* > - * Don't split the extent_map for NOCOW extents, as we're writing into > - * a pre-existing one. > - */ > - if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags)) { > - ret =3D split_extent_map(bbio->inode, bbio->file_offset, > - ordered->num_bytes, len, > - ordered->disk_bytenr); > - if (ret) > - return ret; > - } > - > - new =3D btrfs_split_ordered_extent(ordered, len); > - if (IS_ERR(new)) > - return PTR_ERR(new); > - bbio->ordered =3D new; > - return 0; > -} > - > /* > * given a list of ordered sums record them in the inode. This happen= s > * at IO completion time based on sums calculated at bio submission ti= me. > @@ -6995,81 +6937,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_= inode *inode, > return em; > } > > -static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *i= node, > - struct btrfs_dio_data *dio_data, > - const u64 start, > - const struct btrfs_file_extent *file_extent, > - const int type) > -{ > - struct extent_map *em =3D NULL; > - struct btrfs_ordered_extent *ordered; > - > - if (type !=3D BTRFS_ORDERED_NOCOW) { > - em =3D create_io_em(inode, start, file_extent, type); > - if (IS_ERR(em)) > - goto out; > - } > - > - ordered =3D btrfs_alloc_ordered_extent(inode, start, file_extent, > - (1 << type) | > - (1 << BTRFS_ORDERED_DIRECT)); > - if (IS_ERR(ordered)) { > - if (em) { > - free_extent_map(em); > - btrfs_drop_extent_map_range(inode, start, > - start + file_extent->num_bytes - 1, false); > - } > - em =3D ERR_CAST(ordered); > - } else { > - ASSERT(!dio_data->ordered); > - dio_data->ordered =3D ordered; > - } > - out: > - > - return em; > -} > - > -static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *i= node, > - struct btrfs_dio_data *dio_data, > - u64 start, u64 len) > -{ > - struct btrfs_root *root =3D inode->root; > - struct btrfs_fs_info *fs_info =3D root->fs_info; > - struct btrfs_file_extent file_extent; > - struct extent_map *em; > - struct btrfs_key ins; > - u64 alloc_hint; > - int ret; > - > - alloc_hint =3D get_extent_allocation_hint(inode, start, len); > -again: > - ret =3D btrfs_reserve_extent(root, len, len, fs_info->sectorsize, > - 0, alloc_hint, &ins, 1, 1); > - if (ret =3D=3D -EAGAIN) { > - ASSERT(btrfs_is_zoned(fs_info)); > - wait_on_bit_io(&inode->root->fs_info->flags, BTRFS_FS_NEED_ZONE_FINIS= H, > - TASK_UNINTERRUPTIBLE); > - goto again; > - } > - if (ret) > - return ERR_PTR(ret); > - > - file_extent.disk_bytenr =3D ins.objectid; > - file_extent.disk_num_bytes =3D ins.offset; > - file_extent.num_bytes =3D ins.offset; > - file_extent.ram_bytes =3D ins.offset; > - file_extent.offset =3D 0; > - file_extent.compression =3D BTRFS_COMPRESS_NONE; > - em =3D btrfs_create_dio_extent(inode, dio_data, start, &file_extent, > - BTRFS_ORDERED_REGULAR); > - btrfs_dec_block_group_reservations(fs_info, ins.objectid); > - if (IS_ERR(em)) > - btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, > - 1); > - > - return em; > -} > - > static bool btrfs_extent_readonly(struct btrfs_fs_info *fs_info, u64 b= ytenr) > { > struct btrfs_block_group *block_group; > @@ -7200,103 +7067,10 @@ noinline int can_nocow_extent(struct inode *ino= de, u64 offset, u64 *len, > return ret; > } > > -static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 l= ockend, > - struct extent_state **cached_state, > - unsigned int iomap_flags) > -{ > - const bool writing =3D (iomap_flags & IOMAP_WRITE); > - const bool nowait =3D (iomap_flags & IOMAP_NOWAIT); > - struct extent_io_tree *io_tree =3D &BTRFS_I(inode)->io_tree; > - struct btrfs_ordered_extent *ordered; > - int ret =3D 0; > - > - while (1) { > - if (nowait) { > - if (!try_lock_extent(io_tree, lockstart, lockend, > - cached_state)) > - return -EAGAIN; > - } else { > - lock_extent(io_tree, lockstart, lockend, cached_state); > - } > - /* > - * We're concerned with the entire range that we're going to be > - * doing DIO to, so we need to make sure there's no ordered > - * extents in this range. > - */ > - ordered =3D btrfs_lookup_ordered_range(BTRFS_I(inode), lockstart, > - lockend - lockstart + 1); > - > - /* > - * We need to make sure there are no buffered pages in this > - * range either, we could have raced between the invalidate in > - * generic_file_direct_write and locking the extent. The > - * invalidate needs to happen so that reads after a write do not > - * get stale data. > - */ > - if (!ordered && > - (!writing || !filemap_range_has_page(inode->i_mapping, > - lockstart, lockend))) > - break; > - > - unlock_extent(io_tree, lockstart, lockend, cached_state); > - > - if (ordered) { > - if (nowait) { > - btrfs_put_ordered_extent(ordered); > - ret =3D -EAGAIN; > - break; > - } > - /* > - * If we are doing a DIO read and the ordered extent we > - * found is for a buffered write, we can not wait for it > - * to complete and retry, because if we do so we can > - * deadlock with concurrent buffered writes on page > - * locks. This happens only if our DIO read covers more > - * than one extent map, if at this point has already > - * created an ordered extent for a previous extent map > - * and locked its range in the inode's io tree, and a > - * concurrent write against that previous extent map's > - * range and this range started (we unlock the ranges > - * in the io tree only when the bios complete and > - * buffered writes always lock pages before attempting > - * to lock range in the io tree). > - */ > - if (writing || > - test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags)) > - btrfs_start_ordered_extent(ordered); > - else > - ret =3D nowait ? -EAGAIN : -ENOTBLK; > - btrfs_put_ordered_extent(ordered); > - } else { > - /* > - * We could trigger writeback for this range (and wait > - * for it to complete) and then invalidate the pages for > - * this range (through invalidate_inode_pages2_range()), > - * but that can lead us to a deadlock with a concurrent > - * call to readahead (a buffered read or a defrag call > - * triggered a readahead) on a page lock due to an > - * ordered dio extent we created before but did not have > - * yet a corresponding bio submitted (whence it can not > - * complete), which makes readahead wait for that > - * ordered extent to complete while holding a lock on > - * that page. > - */ > - ret =3D nowait ? -EAGAIN : -ENOTBLK; > - } > - > - if (ret) > - break; > - > - cond_resched(); > - } > - > - return ret; > -} > - > /* The callers of this must take lock_extent() */ > -static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 s= tart, > - const struct btrfs_file_extent *file_extent, > - int type) > +struct extent_map *btrfs_create_io_em(struct btrfs_inode *inode, u64 st= art, > + const struct btrfs_file_extent *file_extent, > + int type) > { > struct extent_map *em; > int ret; > @@ -7363,527 +7137,6 @@ static struct extent_map *create_io_em(struct bt= rfs_inode *inode, u64 start, > return em; > } > > - > -static int btrfs_get_blocks_direct_write(struct extent_map **map, > - struct inode *inode, > - struct btrfs_dio_data *dio_data, > - u64 start, u64 *lenp, > - unsigned int iomap_flags) > -{ > - const bool nowait =3D (iomap_flags & IOMAP_NOWAIT); > - struct btrfs_fs_info *fs_info =3D inode_to_fs_info(inode); > - struct btrfs_file_extent file_extent; > - struct extent_map *em =3D *map; > - int type; > - u64 block_start; > - struct btrfs_block_group *bg; > - bool can_nocow =3D false; > - bool space_reserved =3D false; > - u64 len =3D *lenp; > - u64 prev_len; > - int ret =3D 0; > - > - /* > - * We don't allocate a new extent in the following cases > - * > - * 1) The inode is marked as NODATACOW. In this case we'll just use th= e > - * existing extent. > - * 2) The extent is marked as PREALLOC. We're good to go here and can > - * just use the extent. > - * > - */ > - if ((em->flags & EXTENT_FLAG_PREALLOC) || > - ((BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) && > - em->disk_bytenr !=3D EXTENT_MAP_HOLE)) { > - if (em->flags & EXTENT_FLAG_PREALLOC) > - type =3D BTRFS_ORDERED_PREALLOC; > - else > - type =3D BTRFS_ORDERED_NOCOW; > - len =3D min(len, em->len - (start - em->start)); > - block_start =3D extent_map_block_start(em) + (start - em->start); > - > - if (can_nocow_extent(inode, start, &len, > - &file_extent, false, false) =3D=3D 1) { > - bg =3D btrfs_inc_nocow_writers(fs_info, block_start); > - if (bg) > - can_nocow =3D true; > - } > - } > - > - prev_len =3D len; > - if (can_nocow) { > - struct extent_map *em2; > - > - /* We can NOCOW, so only need to reserve metadata space. */ > - ret =3D btrfs_delalloc_reserve_metadata(BTRFS_I(inode), len, len, > - nowait); > - if (ret < 0) { > - /* Our caller expects us to free the input extent map. */ > - free_extent_map(em); > - *map =3D NULL; > - btrfs_dec_nocow_writers(bg); > - if (nowait && (ret =3D=3D -ENOSPC || ret =3D=3D -EDQUOT)) > - ret =3D -EAGAIN; > - goto out; > - } > - space_reserved =3D true; > - > - em2 =3D btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, > - &file_extent, type); > - btrfs_dec_nocow_writers(bg); > - if (type =3D=3D BTRFS_ORDERED_PREALLOC) { > - free_extent_map(em); > - *map =3D em2; > - em =3D em2; > - } > - > - if (IS_ERR(em2)) { > - ret =3D PTR_ERR(em2); > - goto out; > - } > - > - dio_data->nocow_done =3D true; > - } else { > - /* Our caller expects us to free the input extent map. */ > - free_extent_map(em); > - *map =3D NULL; > - > - if (nowait) { > - ret =3D -EAGAIN; > - goto out; > - } > - > - /* > - * If we could not allocate data space before locking the file > - * range and we can't do a NOCOW write, then we have to fail. > - */ > - if (!dio_data->data_space_reserved) { > - ret =3D -ENOSPC; > - goto out; > - } > - > - /* > - * We have to COW and we have already reserved data space before, > - * so now we reserve only metadata. > - */ > - ret =3D btrfs_delalloc_reserve_metadata(BTRFS_I(inode), len, len, > - false); > - if (ret < 0) > - goto out; > - space_reserved =3D true; > - > - em =3D btrfs_new_extent_direct(BTRFS_I(inode), dio_data, start, len); > - if (IS_ERR(em)) { > - ret =3D PTR_ERR(em); > - goto out; > - } > - *map =3D em; > - len =3D min(len, em->len - (start - em->start)); > - if (len < prev_len) > - btrfs_delalloc_release_metadata(BTRFS_I(inode), > - prev_len - len, true); > - } > - > - /* > - * We have created our ordered extent, so we can now release our reser= vation > - * for an outstanding extent. > - */ > - btrfs_delalloc_release_extents(BTRFS_I(inode), prev_len); > - > - /* > - * Need to update the i_size under the extent lock so buffered > - * readers will get the updated i_size when we unlock. > - */ > - if (start + len > i_size_read(inode)) > - i_size_write(inode, start + len); > -out: > - if (ret && space_reserved) { > - btrfs_delalloc_release_extents(BTRFS_I(inode), len); > - btrfs_delalloc_release_metadata(BTRFS_I(inode), len, true); > - } > - *lenp =3D len; > - return ret; > -} > - > -static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start, > - loff_t length, unsigned int flags, struct iomap *iomap, > - struct iomap *srcmap) > -{ > - struct iomap_iter *iter =3D container_of(iomap, struct iomap_iter, iom= ap); > - struct btrfs_fs_info *fs_info =3D inode_to_fs_info(inode); > - struct extent_map *em; > - struct extent_state *cached_state =3D NULL; > - struct btrfs_dio_data *dio_data =3D iter->private; > - u64 lockstart, lockend; > - const bool write =3D !!(flags & IOMAP_WRITE); > - int ret =3D 0; > - u64 len =3D length; > - const u64 data_alloc_len =3D length; > - bool unlock_extents =3D false; > - > - /* > - * We could potentially fault if we have a buffer > PAGE_SIZE, and if > - * we're NOWAIT we may submit a bio for a partial range and return > - * EIOCBQUEUED, which would result in an errant short read. > - * > - * The best way to handle this would be to allow for partial completio= ns > - * of iocb's, so we could submit the partial bio, return and fault in > - * the rest of the pages, and then submit the io for the rest of the > - * range. However we don't have that currently, so simply return > - * -EAGAIN at this point so that the normal path is used. > - */ > - if (!write && (flags & IOMAP_NOWAIT) && length > PAGE_SIZE) > - return -EAGAIN; > - > - /* > - * Cap the size of reads to that usually seen in buffered I/O as we ne= ed > - * to allocate a contiguous array for the checksums. > - */ > - if (!write) > - len =3D min_t(u64, len, fs_info->sectorsize * BTRFS_MAX_BIO_SECTORS); > - > - lockstart =3D start; > - lockend =3D start + len - 1; > - > - /* > - * iomap_dio_rw() only does filemap_write_and_wait_range(), which isn'= t > - * enough if we've written compressed pages to this area, so we need t= o > - * flush the dirty pages again to make absolutely sure that any > - * outstanding dirty pages are on disk - the first flush only starts > - * compression on the data, while keeping the pages locked, so by the > - * time the second flush returns we know bios for the compressed pages > - * were submitted and finished, and the pages no longer under writebac= k. > - * > - * If we have a NOWAIT request and we have any pages in the range that > - * are locked, likely due to compression still in progress, we don't w= ant > - * to block on page locks. We also don't want to block on pages marked= as > - * dirty or under writeback (same as for the non-compression case). > - * iomap_dio_rw() did the same check, but after that and before we got > - * here, mmap'ed writes may have happened or buffered reads started > - * (readpage() and readahead(), which lock pages), as we haven't locke= d > - * the file range yet. > - */ > - if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, > - &BTRFS_I(inode)->runtime_flags)) { > - if (flags & IOMAP_NOWAIT) { > - if (filemap_range_needs_writeback(inode->i_mapping, > - lockstart, lockend)) > - return -EAGAIN; > - } else { > - ret =3D filemap_fdatawrite_range(inode->i_mapping, start, > - start + length - 1); > - if (ret) > - return ret; > - } > - } > - > - memset(dio_data, 0, sizeof(*dio_data)); > - > - /* > - * We always try to allocate data space and must do it before locking > - * the file range, to avoid deadlocks with concurrent writes to the sa= me > - * range if the range has several extents and the writes don't expand = the > - * current i_size (the inode lock is taken in shared mode). If we fail= to > - * allocate data space here we continue and later, after locking the > - * file range, we fail with ENOSPC only if we figure out we can not do= a > - * NOCOW write. > - */ > - if (write && !(flags & IOMAP_NOWAIT)) { > - ret =3D btrfs_check_data_free_space(BTRFS_I(inode), > - &dio_data->data_reserved, > - start, data_alloc_len, false); > - if (!ret) > - dio_data->data_space_reserved =3D true; > - else if (ret && !(BTRFS_I(inode)->flags & > - (BTRFS_INODE_NODATACOW | BTRFS_INODE_PREALLOC))) > - goto err; > - } > - > - /* > - * If this errors out it's because we couldn't invalidate pagecache fo= r > - * this range and we need to fallback to buffered IO, or we are doing = a > - * NOWAIT read/write and we need to block. > - */ > - ret =3D lock_extent_direct(inode, lockstart, lockend, &cached_state, f= lags); > - if (ret < 0) > - goto err; > - > - em =3D btrfs_get_extent(BTRFS_I(inode), NULL, start, len); > - if (IS_ERR(em)) { > - ret =3D PTR_ERR(em); > - goto unlock_err; > - } > - > - /* > - * Ok for INLINE and COMPRESSED extents we need to fallback on buffere= d > - * io. INLINE is special, and we could probably kludge it in here, bu= t > - * it's still buffered so for safety lets just fall back to the generi= c > - * buffered path. > - * > - * For COMPRESSED we _have_ to read the entire extent in so we can > - * decompress it, so there will be buffering required no matter what w= e > - * do, so go ahead and fallback to buffered. > - * > - * We return -ENOTBLK because that's what makes DIO go ahead and go ba= ck > - * to buffered IO. Don't blame me, this is the price we pay for using > - * the generic code. > - */ > - if (extent_map_is_compressed(em) || em->disk_bytenr =3D=3D EXTENT_MAP_= INLINE) { > - free_extent_map(em); > - /* > - * If we are in a NOWAIT context, return -EAGAIN in order to > - * fallback to buffered IO. This is not only because we can > - * block with buffered IO (no support for NOWAIT semantics at > - * the moment) but also to avoid returning short reads to user > - * space - this happens if we were able to read some data from > - * previous non-compressed extents and then when we fallback to > - * buffered IO, at btrfs_file_read_iter() by calling > - * filemap_read(), we fail to fault in pages for the read buffer, > - * in which case filemap_read() returns a short read (the number > - * of bytes previously read is > 0, so it does not return -EFAULT). > - */ > - ret =3D (flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOTBLK; > - goto unlock_err; > - } > - > - len =3D min(len, em->len - (start - em->start)); > - > - /* > - * If we have a NOWAIT request and the range contains multiple extents > - * (or a mix of extents and holes), then we return -EAGAIN to make the > - * caller fallback to a context where it can do a blocking (without > - * NOWAIT) request. This way we avoid doing partial IO and returning > - * success to the caller, which is not optimal for writes and for read= s > - * it can result in unexpected behaviour for an application. > - * > - * When doing a read, because we use IOMAP_DIO_PARTIAL when calling > - * iomap_dio_rw(), we can end up returning less data then what the cal= ler > - * asked for, resulting in an unexpected, and incorrect, short read. > - * That is, the caller asked to read N bytes and we return less than t= hat, > - * which is wrong unless we are crossing EOF. This happens if we get a > - * page fault error when trying to fault in pages for the buffer that = is > - * associated to the struct iov_iter passed to iomap_dio_rw(), and we > - * have previously submitted bios for other extents in the range, in > - * which case iomap_dio_rw() may return us EIOCBQUEUED if not all of > - * those bios have completed by the time we get the page fault error, > - * which we return back to our caller - we should only return EIOCBQUE= UED > - * after we have submitted bios for all the extents in the range. > - */ > - if ((flags & IOMAP_NOWAIT) && len < length) { > - free_extent_map(em); > - ret =3D -EAGAIN; > - goto unlock_err; > - } > - > - if (write) { > - ret =3D btrfs_get_blocks_direct_write(&em, inode, dio_data, > - start, &len, flags); > - if (ret < 0) > - goto unlock_err; > - unlock_extents =3D true; > - /* Recalc len in case the new em is smaller than requested */ > - len =3D min(len, em->len - (start - em->start)); > - if (dio_data->data_space_reserved) { > - u64 release_offset; > - u64 release_len =3D 0; > - > - if (dio_data->nocow_done) { > - release_offset =3D start; > - release_len =3D data_alloc_len; > - } else if (len < data_alloc_len) { > - release_offset =3D start + len; > - release_len =3D data_alloc_len - len; > - } > - > - if (release_len > 0) > - btrfs_free_reserved_data_space(BTRFS_I(inode), > - dio_data->data_reserved, > - release_offset, > - release_len); > - } > - } else { > - /* > - * We need to unlock only the end area that we aren't using. > - * The rest is going to be unlocked by the endio routine. > - */ > - lockstart =3D start + len; > - if (lockstart < lockend) > - unlock_extents =3D true; > - } > - > - if (unlock_extents) > - unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend, > - &cached_state); > - else > - free_extent_state(cached_state); > - > - /* > - * Translate extent map information to iomap. > - * We trim the extents (and move the addr) even though iomap code does > - * that, since we have locked only the parts we are performing I/O in. > - */ > - if ((em->disk_bytenr =3D=3D EXTENT_MAP_HOLE) || > - ((em->flags & EXTENT_FLAG_PREALLOC) && !write)) { > - iomap->addr =3D IOMAP_NULL_ADDR; > - iomap->type =3D IOMAP_HOLE; > - } else { > - iomap->addr =3D extent_map_block_start(em) + (start - em->start); > - iomap->type =3D IOMAP_MAPPED; > - } > - iomap->offset =3D start; > - iomap->bdev =3D fs_info->fs_devices->latest_dev->bdev; > - iomap->length =3D len; > - free_extent_map(em); > - > - return 0; > - > -unlock_err: > - unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend, > - &cached_state); > -err: > - if (dio_data->data_space_reserved) { > - btrfs_free_reserved_data_space(BTRFS_I(inode), > - dio_data->data_reserved, > - start, data_alloc_len); > - extent_changeset_free(dio_data->data_reserved); > - } > - > - return ret; > -} > - > -static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t = length, > - ssize_t written, unsigned int flags, struct iomap *iomap) > -{ > - struct iomap_iter *iter =3D container_of(iomap, struct iomap_iter, iom= ap); > - struct btrfs_dio_data *dio_data =3D iter->private; > - size_t submitted =3D dio_data->submitted; > - const bool write =3D !!(flags & IOMAP_WRITE); > - int ret =3D 0; > - > - if (!write && (iomap->type =3D=3D IOMAP_HOLE)) { > - /* If reading from a hole, unlock and return */ > - unlock_extent(&BTRFS_I(inode)->io_tree, pos, pos + length - 1, > - NULL); > - return 0; > - } > - > - if (submitted < length) { > - pos +=3D submitted; > - length -=3D submitted; > - if (write) > - btrfs_finish_ordered_extent(dio_data->ordered, NULL, > - pos, length, false); > - else > - unlock_extent(&BTRFS_I(inode)->io_tree, pos, > - pos + length - 1, NULL); > - ret =3D -ENOTBLK; > - } > - if (write) { > - btrfs_put_ordered_extent(dio_data->ordered); > - dio_data->ordered =3D NULL; > - } > - > - if (write) > - extent_changeset_free(dio_data->data_reserved); > - return ret; > -} > - > -static void btrfs_dio_end_io(struct btrfs_bio *bbio) > -{ > - struct btrfs_dio_private *dip =3D > - container_of(bbio, struct btrfs_dio_private, bbio); > - struct btrfs_inode *inode =3D bbio->inode; > - struct bio *bio =3D &bbio->bio; > - > - if (bio->bi_status) { > - btrfs_warn(inode->root->fs_info, > - "direct IO failed ino %llu op 0x%0x offset %#llx len %u err no %d", > - btrfs_ino(inode), bio->bi_opf, > - dip->file_offset, dip->bytes, bio->bi_status); > - } > - > - if (btrfs_op(bio) =3D=3D BTRFS_MAP_WRITE) { > - btrfs_finish_ordered_extent(bbio->ordered, NULL, > - dip->file_offset, dip->bytes, > - !bio->bi_status); > - } else { > - unlock_extent(&inode->io_tree, dip->file_offset, > - dip->file_offset + dip->bytes - 1, NULL); > - } > - > - bbio->bio.bi_private =3D bbio->private; > - iomap_dio_bio_end_io(bio); > -} > - > -static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct b= io *bio, > - loff_t file_offset) > -{ > - struct btrfs_bio *bbio =3D btrfs_bio(bio); > - struct btrfs_dio_private *dip =3D > - container_of(bbio, struct btrfs_dio_private, bbio); > - struct btrfs_dio_data *dio_data =3D iter->private; > - > - btrfs_bio_init(bbio, BTRFS_I(iter->inode)->root->fs_info, > - btrfs_dio_end_io, bio->bi_private); > - bbio->inode =3D BTRFS_I(iter->inode); > - bbio->file_offset =3D file_offset; > - > - dip->file_offset =3D file_offset; > - dip->bytes =3D bio->bi_iter.bi_size; > - > - dio_data->submitted +=3D bio->bi_iter.bi_size; > - > - /* > - * Check if we are doing a partial write. If we are, we need to split > - * the ordered extent to match the submitted bio. Hang on to the > - * remaining unfinishable ordered_extent in dio_data so that it can be > - * cancelled in iomap_end to avoid a deadlock wherein faulting the > - * remaining pages is blocked on the outstanding ordered extent. > - */ > - if (iter->flags & IOMAP_WRITE) { > - int ret; > - > - ret =3D btrfs_extract_ordered_extent(bbio, dio_data->ordered); > - if (ret) { > - btrfs_finish_ordered_extent(dio_data->ordered, NULL, > - file_offset, dip->bytes, > - !ret); > - bio->bi_status =3D errno_to_blk_status(ret); > - iomap_dio_bio_end_io(bio); > - return; > - } > - } > - > - btrfs_submit_bio(bbio, 0); > -} > - > -static const struct iomap_ops btrfs_dio_iomap_ops =3D { > - .iomap_begin =3D btrfs_dio_iomap_begin, > - .iomap_end =3D btrfs_dio_iomap_end, > -}; > - > -static const struct iomap_dio_ops btrfs_dio_ops =3D { > - .submit_io =3D btrfs_dio_submit_io, > - .bio_set =3D &btrfs_dio_bioset, > -}; > - > -ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter, size_= t done_before) > -{ > - struct btrfs_dio_data data =3D { 0 }; > - > - return iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops, > - IOMAP_DIO_PARTIAL, &data, done_before); > -} > - > -struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *= iter, > - size_t done_before) > -{ > - struct btrfs_dio_data data =3D { 0 }; > - > - return __iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops= , > - IOMAP_DIO_PARTIAL, &data, done_before); > -} > - > /* > * For release_folio() and invalidate_folio() we have a race window wh= ere > * folio_end_writeback() is called but the subpage spinlock is not yet= released. > @@ -8503,7 +7756,6 @@ void __cold btrfs_destroy_cachep(void) > * destroy cache. > */ > rcu_barrier(); > - bioset_exit(&btrfs_dio_bioset); > kmem_cache_destroy(btrfs_inode_cachep); > } > > @@ -8514,17 +7766,9 @@ int __init btrfs_init_cachep(void) > SLAB_RECLAIM_ACCOUNT | SLAB_ACCOUNT, > init_once); > if (!btrfs_inode_cachep) > - goto fail; > - > - if (bioset_init(&btrfs_dio_bioset, BIO_POOL_SIZE, > - offsetof(struct btrfs_dio_private, bbio.bio), > - BIOSET_NEED_BVECS)) > - goto fail; > + return -ENOMEM; > > return 0; > -fail: > - btrfs_destroy_cachep(); > - return -ENOMEM; > } > > static int btrfs_getattr(struct mnt_idmap *idmap, > @@ -10267,7 +9511,7 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb= , struct iov_iter *from, > file_extent.ram_bytes =3D ram_bytes; > file_extent.offset =3D encoded->unencoded_offset; > file_extent.compression =3D compression; > - em =3D create_io_em(inode, start, &file_extent, BTRFS_ORDERED_COMPRESS= ED); > + em =3D btrfs_create_io_em(inode, start, &file_extent, BTRFS_ORDERED_CO= MPRESSED); > if (IS_ERR(em)) { > ret =3D PTR_ERR(em); > goto out_free_reserved; > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index 715686e8d4cb..5450a01cb69c 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -34,6 +34,7 @@ > #include "disk-io.h" > #include "transaction.h" > #include "btrfs_inode.h" > +#include "direct-io.h" > #include "props.h" > #include "xattr.h" > #include "bio.h" > @@ -2489,6 +2490,9 @@ static const struct init_sequence mod_init_seq[] = =3D { > }, { > .init_func =3D btrfs_init_cachep, > .exit_func =3D btrfs_destroy_cachep, > + }, { > + .init_func =3D btrfs_init_dio, > + .exit_func =3D btrfs_destroy_dio, > }, { > .init_func =3D btrfs_transaction_init, > .exit_func =3D btrfs_transaction_exit,