Re: [PATCH RFC] btrfs: allow extent buffer helpers to skip cross-page handling

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: dsterba@suse.cz, Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH RFC] btrfs: allow extent buffer helpers to skip cross-page handling
Date: Tue, 21 Nov 2023 06:55:35 +1030	[thread overview]
Message-ID: <a73faeae-1925-4894-9512-7a049ff8353b@gmx.com> (raw)
In-Reply-To: <20231120170015.GM11264@twin.jikos.cz>



On 2023/11/21 03:30, David Sterba wrote:
> On Thu, Nov 16, 2023 at 03:49:06PM +1030, Qu Wenruo wrote:
>> Currently btrfs extent buffer helpers are doing all the cross-page
>> handling, as there is no guarantee that all those eb pages are
>> contiguous.
>>
>> However on systems with enough memory, there is a very high chance the
>> page cache for btree_inode are allocated with physically contiguous
>> pages.
>>
>> In that case, we can skip all the complex cross-page handling, thus
>> speeding up the code.
>>
>> This patch adds a new member, extent_buffer::addr, which is only set to
>> non-NULL if all the extent buffer pages are physically contiguous.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Reason for RFC:
>>
>> This change would increase the code size for all extent buffer helpers,
>> and since there one more branch introduced, it may even slow down the
>> system if most ebs do not have physically contiguous pages.
>>
>> But I still believe this is worthy trying, as my previous attempt to
>> use virtually contiguous pages are rejected due to possible slow down in
>> vm_map() call.
>>
>> I don't have convincing benchmark yet, but so far no obvious performance
>> drop observed either.
>> ---
>>   fs/btrfs/disk-io.c   |  9 +++++++-
>>   fs/btrfs/extent_io.c | 55 ++++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/extent_io.h |  7 ++++++
>>   3 files changed, 70 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 5ac6789ca55f..7fc78171a262 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -80,8 +80,16 @@ static void csum_tree_block(struct extent_buffer *buf, u8 *result)
>>   	char *kaddr;
>>   	int i;
>>
>> +	memset(result, 0, BTRFS_CSUM_SIZE);
>>   	shash->tfm = fs_info->csum_shash;
>>   	crypto_shash_init(shash);
>> +
>> +	if (buf->addr) {
>> +		crypto_shash_digest(shash, buf->addr + offset_in_page(buf->start) + BTRFS_CSUM_SIZE,
>> +				    buf->len - BTRFS_CSUM_SIZE, result);
>> +		return;
>> +	}
>> +
>>   	kaddr = page_address(buf->pages[0]) + offset_in_page(buf->start);
>>   	crypto_shash_update(shash, kaddr + BTRFS_CSUM_SIZE,
>>   			    first_page_part - BTRFS_CSUM_SIZE);
>> @@ -90,7 +98,6 @@ static void csum_tree_block(struct extent_buffer *buf, u8 *result)
>>   		kaddr = page_address(buf->pages[i]);
>>   		crypto_shash_update(shash, kaddr, PAGE_SIZE);
>>   	}
>> -	memset(result, 0, BTRFS_CSUM_SIZE);
>
> This is not related to the contig pages but the result buffer for
> checksum should be always cleared before storing the digest.

This just get moved before the branch, as that clearing is shared for
both cross-page and contig cases.

>
>>   	crypto_shash_final(shash, result);
>>   }
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 03cef28d9e37..004b0ba6b1c7 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -3476,6 +3476,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>>   	struct address_space *mapping = fs_info->btree_inode->i_mapping;
>>   	struct btrfs_subpage *prealloc = NULL;
>>   	u64 lockdep_owner = owner_root;
>> +	bool page_contig = true;
>>   	int uptodate = 1;
>>   	int ret;
>>
>> @@ -3562,6 +3563,14 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>>
>>   		WARN_ON(btrfs_page_test_dirty(fs_info, p, eb->start, eb->len));
>>   		eb->pages[i] = p;
>> +
>> +		/*
>> +		 * Check if the current page is physically contiguous with previous eb
>> +		 * page.
>> +		 */
>> +		if (i && eb->pages[i - 1] + 1 != p)
>> +			page_contig = false;
>
> This hasn't been fixed from last time, this has almost zero chance to
> succeed once the system is up for some time.

I have the same counter argument as the last time.

If so, transparent huge page should not work, thus I strongly doubt
about above statement.

> Page addresses returned
> from allocator are random. What I was suggesting is to use alloc_page()
> with the given order (16K pages are 2).

Nope, this patch is not intended to do that at all.

This is just the preparation for the incoming changes.
In fact alloc_page() with order needs more than those changes, it would
only come after all the preparation, including:

- Change how we allocate  pages for eb
   It has to go allocation in one-go, then attaching those pages to
   filemap.

- Extra changes to how concurrent eb allocation

- Folio flags related changes
   Remember a lot of folio flags are not applied to all its pages.

>
> This works for all eb sizes we need, the prolematic one could be for 64K
> because this is order 4 and PAGE_ALLOC_COSTLY_ORDER is 3, so this would
> cost more on the MM side. But who uses 64K node size on x8_64.

As long as you still want per-page allocation as fallback, this patch
itself is still required.

All the higher order allocation is only going to be an optimization or
fast path.

Furthermore, I found this suggestion is conflicting with your previous
statement on contig pages.
If you say the system can no longer provides contig pages after some
uptime, then all above higher order page allocation should all fail.

>
>> --- a/fs/btrfs/extent_io.h
>> +++ b/fs/btrfs/extent_io.h
>> @@ -77,6 +77,13 @@ struct extent_buffer {
>>   	unsigned long len;
>>   	unsigned long bflags;
>>   	struct btrfs_fs_info *fs_info;
>> +
>> +	/*
>> +	 * The address where the eb can be accessed without any cross-page handling.
>> +	 * This can be NULL if not possible.
>> +	 */
>> +	void *addr;
>
> So this is a remnant of the vm_map, we would not need to store the
> address in case all the pages are contiguous, it would be the address of
> pages[0]. That it's contiguous could be tracked as a bit in the flags.

It's the indicator of whether the pages are contig.

Or you have to check all the pages' address then determine if we go fast
path everytime, which is too slow afaik.

Thanks,
Qu

>
>> +
>>   	spinlock_t refs_lock;
>>   	atomic_t refs;
>>   	int read_mirror;
>> --
>> 2.42.1
>>
>

next prev parent reply	other threads:[~2023-11-20 20:25 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-16  5:19 [PATCH RFC] btrfs: allow extent buffer helpers to skip cross-page handling Qu Wenruo
2023-11-20 17:00 ` David Sterba
2023-11-20 20:25   ` Qu Wenruo [this message]
2023-11-21 15:35     ` David Sterba
2023-11-21 20:37       ` Qu Wenruo
2023-11-21 21:14         ` David Sterba
2023-11-21 21:30           ` Qu Wenruo
2023-11-22 13:23             ` David Sterba
2023-11-22 13:34 ` David Sterba
2023-11-22 13:46 ` David Sterba
2023-11-22 20:01   ` Qu Wenruo
2023-11-22 22:05     ` David Sterba
2023-11-23 18:50     ` David Sterba
2023-11-23 20:51     ` Qu Wenruo
2023-11-24 16:03       ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a73faeae-1925-4894-9512-7a049ff8353b@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=dsterba@suse.cz \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox