public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Filipe Manana <fdmanana@suse.com>, Boris Burkov <boris@bur.io>,
	David Sterba <dsterba@suse.com>, Sasha Levin <sashal@kernel.org>,
	clm@fb.com, josef@toxicpanda.com, linux-btrfs@vger.kernel.org
Subject: [PATCH AUTOSEL 6.0 28/39] btrfs: send: avoid unaligned encoded writes when attempting to clone range
Date: Mon, 28 Nov 2022 12:36:08 -0500	[thread overview]
Message-ID: <20221128173642.1441232-28-sashal@kernel.org> (raw)
In-Reply-To: <20221128173642.1441232-1-sashal@kernel.org>

From: Filipe Manana <fdmanana@suse.com>

[ Upstream commit a11452a3709e217492798cf3686ac2cc8eb3fb51 ]

When trying to see if we can clone a file range, there are cases where we
end up sending two write operations in case the inode from the source root
has an i_size that is not sector size aligned and the length from the
current offset to its i_size is less than the remaining length we are
trying to clone.

Issuing two write operations when we could instead issue a single write
operation is not incorrect. However it is not optimal, specially if the
extents are compressed and the flag BTRFS_SEND_FLAG_COMPRESSED was passed
to the send ioctl. In that case we can end up sending an encoded write
with an offset that is not sector size aligned, which makes the receiver
fallback to decompressing the data and writing it using regular buffered
IO (so re-compressing the data in case the fs is mounted with compression
enabled), because encoded writes fail with -EINVAL when an offset is not
sector size aligned.

The following example, which triggered a bug in the receiver code for the
fallback logic of decompressing + regular buffer IO and is fixed by the
patchset referred in a Link at the bottom of this changelog, is an example
where we have the non-optimal behaviour due to an unaligned encoded write:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/sdj
   MNT=/mnt/sdj

   mkfs.btrfs -f $DEV > /dev/null
   mount -o compress $DEV $MNT

   # File foo has a size of 33K, not aligned to the sector size.
   xfs_io -f -c "pwrite -S 0xab 0 33K" $MNT/foo

   xfs_io -f -c "pwrite -S 0xcd 0 64K" $MNT/bar

   # Now clone the first 32K of file bar into foo at offset 0.
   xfs_io -c "reflink $MNT/bar 0 0 32K" $MNT/foo

   # Snapshot the default subvolume and create a full send stream (v2).
   btrfs subvolume snapshot -r $MNT $MNT/snap

   btrfs send --compressed-data -f /tmp/test.send $MNT/snap

   echo -e "\nFile bar in the original filesystem:"
   od -A d -t x1 $MNT/snap/bar

   umount $MNT
   mkfs.btrfs -f $DEV > /dev/null
   mount $DEV $MNT

   echo -e "\nReceiving stream in a new filesystem..."
   btrfs receive -f /tmp/test.send $MNT

   echo -e "\nFile bar in the new filesystem:"
   od -A d -t x1 $MNT/snap/bar

   umount $MNT

Before this patch, the send stream included one regular write and one
encoded write for file 'bar', with the later being not sector size aligned
and causing the receiver to fallback to decompression + buffered writes.
The output of the btrfs receive command in verbose mode (-vvv):

   (...)
   mkfile o258-7-0
   rename o258-7-0 -> bar
   utimes
   clone bar - source=foo source offset=0 offset=0 length=32768
   write bar - offset=32768 length=1024
   encoded_write bar - offset=33792, len=4096, unencoded_offset=33792, unencoded_file_len=31744, unencoded_len=65536, compression=1, encryption=0
   encoded_write bar - falling back to decompress and write due to errno 22 ("Invalid argument")
   (...)

This patch avoids the regular write followed by an unaligned encoded write
so that we end up sending a single encoded write that is aligned. So after
this patch the stream content is (output of btrfs receive -vvv):

   (...)
   mkfile o258-7-0
   rename o258-7-0 -> bar
   utimes
   clone bar - source=foo source offset=0 offset=0 length=32768
   encoded_write bar - offset=32768, len=4096, unencoded_offset=32768, unencoded_file_len=32768, unencoded_len=65536, compression=1, encryption=0
   (...)

So we get more optimal behaviour and avoid the silent data loss bug in
versions of btrfs-progs affected by the bug referred by the Link tag
below (btrfs-progs v5.19, v5.19.1, v6.0 and v6.0.1).

Link: https://lore.kernel.org/linux-btrfs/cover.1668529099.git.fdmanana@suse.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/send.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index e7671afcee4f..8cc038460bed 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -5615,6 +5615,7 @@ static int clone_range(struct send_ctx *sctx, struct btrfs_path *dst_path,
 		u64 ext_len;
 		u64 clone_len;
 		u64 clone_data_offset;
+		bool crossed_src_i_size = false;
 
 		if (slot >= btrfs_header_nritems(leaf)) {
 			ret = btrfs_next_leaf(clone_root->root, path);
@@ -5672,8 +5673,10 @@ static int clone_range(struct send_ctx *sctx, struct btrfs_path *dst_path,
 		if (key.offset >= clone_src_i_size)
 			break;
 
-		if (key.offset + ext_len > clone_src_i_size)
+		if (key.offset + ext_len > clone_src_i_size) {
 			ext_len = clone_src_i_size - key.offset;
+			crossed_src_i_size = true;
+		}
 
 		clone_data_offset = btrfs_file_extent_offset(leaf, ei);
 		if (btrfs_file_extent_disk_bytenr(leaf, ei) == disk_byte) {
@@ -5734,6 +5737,25 @@ static int clone_range(struct send_ctx *sctx, struct btrfs_path *dst_path,
 				ret = send_clone(sctx, offset, clone_len,
 						 clone_root);
 			}
+		} else if (crossed_src_i_size && clone_len < len) {
+			/*
+			 * If we are at i_size of the clone source inode and we
+			 * can not clone from it, terminate the loop. This is
+			 * to avoid sending two write operations, one with a
+			 * length matching clone_len and the final one after
+			 * this loop with a length of len - clone_len.
+			 *
+			 * When using encoded writes (BTRFS_SEND_FLAG_COMPRESSED
+			 * was passed to the send ioctl), this helps avoid
+			 * sending an encoded write for an offset that is not
+			 * sector size aligned, in case the i_size of the source
+			 * inode is not sector size aligned. That will make the
+			 * receiver fallback to decompression of the data and
+			 * writing it using regular buffered IO, therefore while
+			 * not incorrect, it's not optimal due decompression and
+			 * possible re-compression at the receiver.
+			 */
+			break;
 		} else {
 			ret = send_extent_data(sctx, dst_path, offset,
 					       clone_len);
-- 
2.35.1


  parent reply	other threads:[~2022-11-28 17:41 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-28 17:35 [PATCH AUTOSEL 6.0 01/39] arm64: dts: rockchip: Fix gmac failure of rgmii-id from rk3566-roc-pc Sasha Levin
2022-11-28 17:35 ` [PATCH AUTOSEL 6.0 02/39] arm64: dts: rockchip: Fix i2c3 pinctrl on rk3566-roc-pc Sasha Levin
2022-11-28 17:35 ` [PATCH AUTOSEL 6.0 03/39] arm64: dts: rockchip: remove i2c5 from rk3566-roc-pc Sasha Levin
2022-11-28 17:35 ` [PATCH AUTOSEL 6.0 04/39] arm64: dts: rockchip: keep I2S1 disabled for GPIO function on ROCK Pi 4 series Sasha Levin
2022-11-28 17:35 ` [PATCH AUTOSEL 6.0 05/39] arm64: dts: rockchip: fix node name for hym8563 rtc Sasha Levin
2022-11-28 17:35 ` [PATCH AUTOSEL 6.0 06/39] arm: " Sasha Levin
2022-11-28 17:35 ` [PATCH AUTOSEL 6.0 07/39] arm: dts: rockchip: remove clock-frequency from rtc Sasha Levin
2022-11-28 17:35 ` [PATCH AUTOSEL 6.0 08/39] ARM: dts: rockchip: fix adc-keys sub node names Sasha Levin
2022-11-28 17:35 ` [PATCH AUTOSEL 6.0 09/39] arm64: " Sasha Levin
2022-11-28 17:35 ` [PATCH AUTOSEL 6.0 10/39] ARM: dts: rockchip: fix ir-receiver " Sasha Levin
2022-11-28 17:35 ` [PATCH AUTOSEL 6.0 11/39] arm64: " Sasha Levin
2022-11-28 17:35 ` [PATCH AUTOSEL 6.0 12/39] ARM: dts: rockchip: rk3188: fix lcdc1-rgb24 node name Sasha Levin
2022-11-28 17:35 ` [PATCH AUTOSEL 6.0 13/39] fs: use acquire ordering in __fget_light() Sasha Levin
2022-11-28 17:35 ` [PATCH AUTOSEL 6.0 14/39] ARM: 9251/1: perf: Fix stacktraces for tracepoint events in THUMB2 kernels Sasha Levin
2022-11-28 17:35 ` [PATCH AUTOSEL 6.0 15/39] ARM: 9266/1: mm: fix no-MMU ZERO_PAGE() implementation Sasha Levin
2022-11-28 17:35 ` [PATCH AUTOSEL 6.0 16/39] ASoC: wm8962: Wait for updated value of WM8962_CLOCKING1 register Sasha Levin
2022-11-28 17:35 ` [PATCH AUTOSEL 6.0 17/39] spi: mediatek: Fix DEVAPC Violation at KO Remove Sasha Levin
2022-11-28 17:35 ` [PATCH AUTOSEL 6.0 18/39] ARM: dts: rockchip: disable arm_global_timer on rk3066 and rk3188 Sasha Levin
2022-11-28 17:35 ` [PATCH AUTOSEL 6.0 19/39] ASoC: rt711-sdca: fix the latency time of clock stop prepare state machine transitions Sasha Levin
2022-11-28 17:36 ` [PATCH AUTOSEL 6.0 20/39] 9p/fd: Use P9_HDRSZ for header size Sasha Levin
2022-11-28 17:36 ` [PATCH AUTOSEL 6.0 21/39] regulator: slg51000: Wait after asserting CS pin Sasha Levin
2022-11-28 17:36 ` [PATCH AUTOSEL 6.0 22/39] ALSA: seq: Fix function prototype mismatch in snd_seq_expand_var_event Sasha Levin
2022-11-28 17:36 ` [PATCH AUTOSEL 6.0 23/39] LoongArch: Makefile: Use "grep -E" instead of "egrep" Sasha Levin
2022-11-28 17:36 ` [PATCH AUTOSEL 6.0 24/39] LoongArch: Combine acpi_boot_table_init() and acpi_boot_init() Sasha Levin
2022-11-28 17:36 ` [PATCH AUTOSEL 6.0 25/39] LoongArch: Set _PAGE_DIRTY only if _PAGE_MODIFIED is set in {pmd,pte}_mkwrite() Sasha Levin
2022-11-28 17:36 ` [PATCH AUTOSEL 6.0 26/39] LoongArch: Fix unsigned comparison with less than zero Sasha Levin
2022-11-28 17:36 ` [PATCH AUTOSEL 6.0 27/39] selftests/net: Find nettest in current directory Sasha Levin
2022-11-28 17:36 ` Sasha Levin [this message]
2022-11-28 17:36 ` [PATCH AUTOSEL 6.0 29/39] net/mlx5: Lag, avoid lockdep warnings Sasha Levin
2022-11-28 17:36 ` [PATCH AUTOSEL 6.0 30/39] ASoC: soc-pcm: Add NULL check in BE reparenting Sasha Levin
2022-11-28 17:36 ` [PATCH AUTOSEL 6.0 31/39] regulator: twl6030: fix get status of twl6032 regulators Sasha Levin
2022-11-28 17:36 ` [PATCH AUTOSEL 6.0 32/39] fbcon: Use kzalloc() in fbcon_prepare_logo() Sasha Levin
2022-11-28 17:36 ` [PATCH AUTOSEL 6.0 33/39] usb: dwc3: gadget: Disable GUSB2PHYCFG.SUSPHY for End Transfer Sasha Levin
2022-11-28 17:36 ` [PATCH AUTOSEL 6.0 34/39] 9p/xen: check logical size for buffer size Sasha Levin
2022-11-28 17:36 ` [PATCH AUTOSEL 6.0 35/39] net: usb: qmi_wwan: add u-blox 0x1342 composition Sasha Levin
2022-11-28 17:36 ` [PATCH AUTOSEL 6.0 36/39] drm/amd/display: Use viewport height for subvp mall allocation size Sasha Levin
2022-11-28 17:36 ` [PATCH AUTOSEL 6.0 37/39] drm/amd/display: Avoid setting pixel rate divider to N/A Sasha Levin
2022-11-28 17:36 ` [PATCH AUTOSEL 6.0 38/39] drm/amd/display: Use new num clk levels struct for max mclk index Sasha Levin
2022-11-28 17:36 ` [PATCH AUTOSEL 6.0 39/39] drm/amdgpu: fix use-after-free during gpu recovery Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221128173642.1441232-28-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=boris@bur.io \
    --cc=clm@fb.com \
    --cc=dsterba@suse.com \
    --cc=fdmanana@suse.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox