From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmx.com header.i=quwenruo.btrfs@gmx.com header.b="PRllUffr" Received: from mout.gmx.net (mout.gmx.net [212.227.15.19]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B8717C7 for ; Mon, 20 Nov 2023 12:25:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gmx.com; s=s31663417; t=1700511940; x=1701116740; i=quwenruo.btrfs@gmx.com; bh=bc6netv4urZODj2PyMPV3kP8R/zLZcobC7wRRyWelz4=; h=X-UI-Sender-Class:Date:Subject:To:Cc:References:From: In-Reply-To; b=PRllUffr/9+5jSkVRAKQfdC4mKWDW8QkA9OblDacO+Zo/frA6wODdEoHSb2a5aGZ HX/zbKI2EYo3dlzJ8KXAc4+A4DrCdOAWrmep6udqNcrhcUqIIqb59RCBQp9qq1yhj Cf3z7Dld+NJhej/3uPMogfvW3a5isMsYOqfQHrC2SqEMBu1FrVaO8/52bhbVrYiIU 0fXe4xbKz3HuPPGnWL8LQPQdcz2HazEEWQxOrra4XVxrrd9IWyMYobu38znTPrk55 qo98y2KGNgZXjUj3I5gEXe5B2wDUyu1p5c/wF/+XIp4c0N5Ox0KbDc4PxKQFs0Dqn zYYR6Dq2JOzjZdqixg== X-UI-Sender-Class: 724b4f7f-cbec-4199-ad4e-598c01a50d3a Received: from [172.16.0.117] ([122.151.37.21]) by mail.gmx.net (mrgmx005 [212.227.17.184]) with ESMTPSA (Nemesis) id 1Mlf4S-1rmWHJ26N8-00ioI0; Mon, 20 Nov 2023 21:25:40 +0100 Message-ID: Date: Tue, 21 Nov 2023 06:55:35 +1030 Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH RFC] btrfs: allow extent buffer helpers to skip cross-page handling To: dsterba@suse.cz, Qu Wenruo Cc: linux-btrfs@vger.kernel.org References: <721bab821198fc9b49d2795b2028ed6c436ab886.1700111928.git.wqu@suse.com> <20231120170015.GM11264@twin.jikos.cz> Content-Language: en-US From: Qu Wenruo Autocrypt: addr=quwenruo.btrfs@gmx.com; keydata= xsBNBFnVga8BCACyhFP3ExcTIuB73jDIBA/vSoYcTyysFQzPvez64TUSCv1SgXEByR7fju3o 8RfaWuHCnkkea5luuTZMqfgTXrun2dqNVYDNOV6RIVrc4YuG20yhC1epnV55fJCThqij0MRL 1NxPKXIlEdHvN0Kov3CtWA+R1iNN0RCeVun7rmOrrjBK573aWC5sgP7YsBOLK79H3tmUtz6b 9Imuj0ZyEsa76Xg9PX9Hn2myKj1hfWGS+5og9Va4hrwQC8ipjXik6NKR5GDV+hOZkktU81G5 gkQtGB9jOAYRs86QG/b7PtIlbd3+pppT0gaS+wvwMs8cuNG+Pu6KO1oC4jgdseFLu7NpABEB AAHNIlF1IFdlbnJ1byA8cXV3ZW5ydW8uYnRyZnNAZ214LmNvbT7CwJQEEwEIAD4CGwMFCwkI BwIGFQgJCgsCBBYCAwECHgECF4AWIQQt33LlpaVbqJ2qQuHCPZHzoSX+qAUCY00iVQUJDToH pgAKCRDCPZHzoSX+qNKACACkjDLzCvcFuDlgqCiS4ajHAo6twGra3uGgY2klo3S4JespWifr BLPPak74oOShqNZ8yWzB1Bkz1u93Ifx3c3H0r2vLWrImoP5eQdymVqMWmDAq+sV1Koyt8gXQ XPD2jQCrfR9nUuV1F3Z4Lgo+6I5LjuXBVEayFdz/VYK63+YLEAlSowCF72Lkz06TmaI0XMyj jgRNGM2MRgfxbprCcsgUypaDfmhY2nrhIzPUICURfp9t/65+/PLlV4nYs+DtSwPyNjkPX72+ LdyIdY+BqS8cZbPG5spCyJIlZonADojLDYQq4QnufARU51zyVjzTXMg5gAttDZwTH+8LbNI4 mm2YzsBNBFnVga8BCACqU+th4Esy/c8BnvliFAjAfpzhI1wH76FD1MJPmAhA3DnX5JDORcga CbPEwhLj1xlwTgpeT+QfDmGJ5B5BlrrQFZVE1fChEjiJvyiSAO4yQPkrPVYTI7Xj34FnscPj /IrRUUka68MlHxPtFnAHr25VIuOS41lmYKYNwPNLRz9Ik6DmeTG3WJO2BQRNvXA0pXrJH1fN GSsRb+pKEKHKtL1803x71zQxCwLh+zLP1iXHVM5j8gX9zqupigQR/Cel2XPS44zWcDW8r7B0 q1eW4Jrv0x19p4P923voqn+joIAostyNTUjCeSrUdKth9jcdlam9X2DziA/DHDFfS5eq4fEv ABEBAAHCwHwEGAEIACYCGwwWIQQt33LlpaVbqJ2qQuHCPZHzoSX+qAUCY00ibgUJDToHvwAK CRDCPZHzoSX+qK6vB/9yyZlsS+ijtsvwYDjGA2WhVhN07Xa5SBBvGCAycyGGzSMkOJcOtUUf tD+ADyrLbLuVSfRN1ke738UojphwkSFj4t9scG5A+U8GgOZtrlYOsY2+cG3R5vjoXUgXMP37 INfWh0KbJodf0G48xouesn08cbfUdlphSMXujCA8y5TcNyRuNv2q5Nizl8sKhUZzh4BascoK DChBuznBsucCTAGrwPgG4/ul6HnWE8DipMKvkV9ob1xJS2W4WJRPp6QdVrBWJ9cCdtpR6GbL iQi22uZXoSPv/0oUrGU+U5X4IvdnvT+8viPzszL5wXswJZfqfy8tmHM85yjObVdIG6AlnrrD In-Reply-To: <20231120170015.GM11264@twin.jikos.cz> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Provags-ID: V03:K1:Sp09LahtgtVggcbjKVcOYRGJIUU5+UMGKvIvjxI4lw0bcJ4DYBb 2Lp5yrF/qOk5szJYUlbQcScidQjArQJQB39yaAVehFwKLlUEn/nnPh1XkbspN2SjE3IJeIK baBiwkD2o4sCBqXAVvYH5vJZZdCAceF9JAjYmilgrNMb0Vt6yVzkNAdQWS+yZQXvAmUS+Ie 6IgRCo6GyNFvZCNa0kjSQ== UI-OutboundReport: notjunk:1;M01:P0:5bJeMTpYqss=;rUZQckHp4/+DdadFGwFbxe8wSBA CST7FPvA4tZUkkRVAaROTLjAPfZ/NZeGClvKoZWyAT8U8op6XY7C1ArlGq0NoPf1CsKPqTxIq lpT165nT55yEkNZkGmpBdchThAJGdgbD47zUqlWc5PUZDUtVGiedaC47InsXYe66mAjvvv1fE mRhoLZo8YACrY/cx+WkIiMxSuUQFg8Z8Tq5QIiBDRRmJgZSsr9Nuze8q3xSI6dAkl0Y8G/anL PvTfQ1e4b8PzSxxEpwmtiVhvE91PkdPB1vHa8Hjer3J+/CSwgcn2LUjHEK+EUmmUhIwXgghIC iaQ8CA1kZ7MlKrRlGf6B8rV6IrJjugEgj5keX3apYxYUmHXfz+dZCVdACgAFK6DI2bVH6t6lQ +okYOmAbQueQ3WU8ZrxctuahGenelwhsEjEyS7yrqU7bYGFCwwSY/sh8hThH5i/Y3w1WqDGE/ goA0Q0+6sPMEkD5k58eQNpG1rQH4mxYvlYHkBzUrpR1shgnLM4dp51Ozu5WdlAc9LpNU9TjcW eWsvvrJELhW4FRzUswTDeHELMQkVSQUuP9C3RaRDsgfN2haAyY5cEroHG6/C3y8frrFifa0+/ FgpyrqnHcw2yL3iiqfcnxU0vGhUK897y59G62NdSyc7nw/tnNRnVlmREu+9N+mF8MKcRvlmcE Yx2wv7MXMfrEeF07Rug7BZFJfZqUzeiORQSes6Q39HHZGWf3pzBjFb7VEweRHgiPNNOt3r/Fs m5EfYQOADkVzRpluc+ujc5tAXsvibZgWQYipZRUr7FLpeENHZvH6fg5tUPyWDd2BgSp1LXfH6 BlYjKG2mAHxrLn4SJG7FHSSBxV3F1L5LCK/PlK0+GNnqNHQisMPkjk083RMD8sEU2huHvb7FO t22nQZ0m920MWXIrM0F/srDG9taYCpGQlVrQCHdAo2sKQ6XuAG3I+ctt8C1Wg3x5yOPcZpmD/ ckAc7gPFLe+WZNFOEPeKMrC6dAw= On 2023/11/21 03:30, David Sterba wrote: > On Thu, Nov 16, 2023 at 03:49:06PM +1030, Qu Wenruo wrote: >> Currently btrfs extent buffer helpers are doing all the cross-page >> handling, as there is no guarantee that all those eb pages are >> contiguous. >> >> However on systems with enough memory, there is a very high chance the >> page cache for btree_inode are allocated with physically contiguous >> pages. >> >> In that case, we can skip all the complex cross-page handling, thus >> speeding up the code. >> >> This patch adds a new member, extent_buffer::addr, which is only set to >> non-NULL if all the extent buffer pages are physically contiguous. >> >> Signed-off-by: Qu Wenruo >> --- >> Reason for RFC: >> >> This change would increase the code size for all extent buffer helpers, >> and since there one more branch introduced, it may even slow down the >> system if most ebs do not have physically contiguous pages. >> >> But I still believe this is worthy trying, as my previous attempt to >> use virtually contiguous pages are rejected due to possible slow down i= n >> vm_map() call. >> >> I don't have convincing benchmark yet, but so far no obvious performanc= e >> drop observed either. >> --- >> fs/btrfs/disk-io.c | 9 +++++++- >> fs/btrfs/extent_io.c | 55 +++++++++++++++++++++++++++++++++++++++++++= + >> fs/btrfs/extent_io.h | 7 ++++++ >> 3 files changed, 70 insertions(+), 1 deletion(-) >> >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >> index 5ac6789ca55f..7fc78171a262 100644 >> --- a/fs/btrfs/disk-io.c >> +++ b/fs/btrfs/disk-io.c >> @@ -80,8 +80,16 @@ static void csum_tree_block(struct extent_buffer *bu= f, u8 *result) >> char *kaddr; >> int i; >> >> + memset(result, 0, BTRFS_CSUM_SIZE); >> shash->tfm =3D fs_info->csum_shash; >> crypto_shash_init(shash); >> + >> + if (buf->addr) { >> + crypto_shash_digest(shash, buf->addr + offset_in_page(buf->start) + = BTRFS_CSUM_SIZE, >> + buf->len - BTRFS_CSUM_SIZE, result); >> + return; >> + } >> + >> kaddr =3D page_address(buf->pages[0]) + offset_in_page(buf->start); >> crypto_shash_update(shash, kaddr + BTRFS_CSUM_SIZE, >> first_page_part - BTRFS_CSUM_SIZE); >> @@ -90,7 +98,6 @@ static void csum_tree_block(struct extent_buffer *buf= , u8 *result) >> kaddr =3D page_address(buf->pages[i]); >> crypto_shash_update(shash, kaddr, PAGE_SIZE); >> } >> - memset(result, 0, BTRFS_CSUM_SIZE); > > This is not related to the contig pages but the result buffer for > checksum should be always cleared before storing the digest. This just get moved before the branch, as that clearing is shared for both cross-page and contig cases. > >> crypto_shash_final(shash, result); >> } >> >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >> index 03cef28d9e37..004b0ba6b1c7 100644 >> --- a/fs/btrfs/extent_io.c >> +++ b/fs/btrfs/extent_io.c >> @@ -3476,6 +3476,7 @@ struct extent_buffer *alloc_extent_buffer(struct = btrfs_fs_info *fs_info, >> struct address_space *mapping =3D fs_info->btree_inode->i_mapping; >> struct btrfs_subpage *prealloc =3D NULL; >> u64 lockdep_owner =3D owner_root; >> + bool page_contig =3D true; >> int uptodate =3D 1; >> int ret; >> >> @@ -3562,6 +3563,14 @@ struct extent_buffer *alloc_extent_buffer(struct= btrfs_fs_info *fs_info, >> >> WARN_ON(btrfs_page_test_dirty(fs_info, p, eb->start, eb->len)); >> eb->pages[i] =3D p; >> + >> + /* >> + * Check if the current page is physically contiguous with previous = eb >> + * page. >> + */ >> + if (i && eb->pages[i - 1] + 1 !=3D p) >> + page_contig =3D false; > > This hasn't been fixed from last time, this has almost zero chance to > succeed once the system is up for some time. I have the same counter argument as the last time. If so, transparent huge page should not work, thus I strongly doubt about above statement. > Page addresses returned > from allocator are random. What I was suggesting is to use alloc_page() > with the given order (16K pages are 2). Nope, this patch is not intended to do that at all. This is just the preparation for the incoming changes. In fact alloc_page() with order needs more than those changes, it would only come after all the preparation, including: - Change how we allocate pages for eb It has to go allocation in one-go, then attaching those pages to filemap. - Extra changes to how concurrent eb allocation - Folio flags related changes Remember a lot of folio flags are not applied to all its pages. > > This works for all eb sizes we need, the prolematic one could be for 64K > because this is order 4 and PAGE_ALLOC_COSTLY_ORDER is 3, so this would > cost more on the MM side. But who uses 64K node size on x8_64. As long as you still want per-page allocation as fallback, this patch itself is still required. All the higher order allocation is only going to be an optimization or fast path. Furthermore, I found this suggestion is conflicting with your previous statement on contig pages. If you say the system can no longer provides contig pages after some uptime, then all above higher order page allocation should all fail. > >> --- a/fs/btrfs/extent_io.h >> +++ b/fs/btrfs/extent_io.h >> @@ -77,6 +77,13 @@ struct extent_buffer { >> unsigned long len; >> unsigned long bflags; >> struct btrfs_fs_info *fs_info; >> + >> + /* >> + * The address where the eb can be accessed without any cross-page ha= ndling. >> + * This can be NULL if not possible. >> + */ >> + void *addr; > > So this is a remnant of the vm_map, we would not need to store the > address in case all the pages are contiguous, it would be the address of > pages[0]. That it's contiguous could be tracked as a bit in the flags. It's the indicator of whether the pages are contig. Or you have to check all the pages' address then determine if we go fast path everytime, which is too slow afaik. Thanks, Qu > >> + >> spinlock_t refs_lock; >> atomic_t refs; >> int read_mirror; >> -- >> 2.42.1 >> >