From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EA84FC2BD09
	for <linux-mm@archiver.kernel.org>; Tue,  2 Jul 2024 00:08:34 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 3E6F36B0089; Mon,  1 Jul 2024 20:08:34 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 396A56B008C; Mon,  1 Jul 2024 20:08:34 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 210836B0092; Mon,  1 Jul 2024 20:08:34 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id F213A6B0089
	for <linux-mm@kvack.org>; Mon,  1 Jul 2024 20:08:33 -0400 (EDT)
Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 7813F161ECC
	for <linux-mm@kvack.org>; Tue,  2 Jul 2024 00:08:33 +0000 (UTC)
X-FDA: 82292876106.13.8DA1BB6
Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com [209.85.215.182])
	by imf05.hostedemail.com (Postfix) with ESMTP id 677AE100004
	for <linux-mm@kvack.org>; Tue,  2 Jul 2024 00:08:31 +0000 (UTC)
Authentication-Results: imf05.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=M7zqo3Zi;
	spf=pass (imf05.hostedemail.com: domain of alexander.duyck@gmail.com designates 209.85.215.182 as permitted sender) smtp.mailfrom=alexander.duyck@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1719878899;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=zwXlu1WVqsnVJZ9utDU8T7rgXo210umX1xN+HtCfvHk=;
	b=LxxZkzzp/1YbiaOsUS5e/0XQ6TbB7okz0qfyBF/lDwKlYD+Y8Fvq8624lL1TWX/S1Ub3B+
	z6msp4EB0nK278meEnzgHbDrqVFTb9svCBOCYWB2MeQk6rQMeQae3tcfCK9eZ3QmqHoZY5
	hsAtzdp9EZ/SZSENWq6adnysHlJmWno=
ARC-Authentication-Results: i=1;
	imf05.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=M7zqo3Zi;
	spf=pass (imf05.hostedemail.com: domain of alexander.duyck@gmail.com designates 209.85.215.182 as permitted sender) smtp.mailfrom=alexander.duyck@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719878899; a=rsa-sha256;
	cv=none;
	b=S6qaDLzBhINz0LBTNJI5wJNOb4eR12511kNWiFeBhMv9iV+CvQAzuZ2Y+F6y+BlqVUa+07
	LcWQH4DYVxSGVVkVDrEVCjIZfQSBL5p1HSF/CDdy5XP0aqcj9xiUQI7jZgIj9tT921XVOQ
	HLX5gtBGtlMxrt6IjKbjOLaOjIvRvEA=
Received: by mail-pg1-f182.google.com with SMTP id 41be03b00d2f7-7178ba1c24bso2256433a12.3
        for <linux-mm@kvack.org>; Mon, 01 Jul 2024 17:08:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1719878910; x=1720483710; darn=kvack.org;
        h=mime-version:user-agent:content-transfer-encoding:references
         :in-reply-to:date:cc:to:from:subject:message-id:from:to:cc:subject
         :date:message-id:reply-to;
        bh=zwXlu1WVqsnVJZ9utDU8T7rgXo210umX1xN+HtCfvHk=;
        b=M7zqo3Zig8hmk7A7AlY9AUTzDFGenO4/N2lA8CW6IrZKX9V3+bOXr8K/m5f57mixUW
         Dyx/HSky6x0sLN8/9P+krwJZPtebsBOot3bknt2EDXnZLU7CCpHl7aM6/ivi1NR3Ywbz
         CQPXT9XA4WqthAmdCYy8L+yjMe5Mir2erf3wv0+YArDc8IFrVWQB3RvSRaPlChL3UEhB
         2JctW11sTYMlxpI+3bJ0NE0xqaodI/dOy3xdcB/neWVF+Pr+Nr1nESD43NiWfcU7eh0D
         xFwxEskFZSOczk4uLMRnognRu1+bmtmBHBBLqepEXfuOC4RUPm8z2nSmMGZBm1XD+VDL
         cvsw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1719878910; x=1720483710;
        h=mime-version:user-agent:content-transfer-encoding:references
         :in-reply-to:date:cc:to:from:subject:message-id:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=zwXlu1WVqsnVJZ9utDU8T7rgXo210umX1xN+HtCfvHk=;
        b=YosRAGGmuFSvWuy/8j2KFd0Oi7fZNeS2zoU6tebQcx0MprErw8gHrvYFHHEHYX9wjg
         9ADTOJSr8Iu4ZeRNNAnCIyG+B7GHjZTRvBS3Yf6IDi/rRjH97Zw0ghFPVj59LUSEZS7z
         VtJg5wbVCFkBrq7GAT0nbPgX9QNG0TC5fRgwIkhIbsFesJ5Ua1ooikHoxGhJzlLtUU2B
         vJqv8/rCFiThslEF9tL3zuc0j25XL2rhCehUn1G/BknnOTy9iCR5/b1wPadNx8P3yk/9
         o1M+6h7wZvFm/jTRIVEpEaU8c/0CtuZc7tOfnMKXv5Dxf5LCNqPyp5RQLcowRV2WocWK
         ICgA==
X-Forwarded-Encrypted: i=1; AJvYcCUe7eehq+iRac/LzbaRtnBVbMgIly5FwdMo8/RqQxSo7B6sPMKk1beSUiYZd5CY5Kgcw+gLnypmSqVPeXfsCVIr4Mg=
X-Gm-Message-State: AOJu0YzvWYX7a4UKzz0Hjc3xrk3fFfbAJitlya7bstZiUetAKJkOluJF
	SwCNHk26t9K0N21o/OqD1FzMxZn0iXODzsW52aQeH1B7ySL45Mt7
X-Google-Smtp-Source: AGHT+IFL9tfJxnSiQ1ohuj80uEisjABTfCEmpgMn4bUaZduxfPeqy8o3YyadjSYRSeX9qvb5RNfDgg==
X-Received: by 2002:a05:6a21:6d96:b0:1be:ca24:964c with SMTP id adf61e73a8af0-1bef6109d5bmr11975347637.16.1719878909973;
        Mon, 01 Jul 2024 17:08:29 -0700 (PDT)
Received: from ?IPv6:2605:59c8:829:4c00:82ee:73ff:fe41:9a02? ([2605:59c8:829:4c00:82ee:73ff:fe41:9a02])
        by smtp.googlemail.com with ESMTPSA id d9443c01a7336-1fac1535b94sm70925405ad.155.2024.07.01.17.08.28
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 01 Jul 2024 17:08:29 -0700 (PDT)
Message-ID: <12a8b9ddbcb2da8431f77c5ec952ccfb2a77b7ec.camel@gmail.com>
Subject: Re: [PATCH net-next v9 06/13] mm: page_frag: reuse existing space
 for 'size' and 'pfmemalloc'
From: Alexander H Duyck <alexander.duyck@gmail.com>
To: Yunsheng Lin <linyunsheng@huawei.com>, davem@davemloft.net,
 kuba@kernel.org,  pabeni@redhat.com
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton
	 <akpm@linux-foundation.org>, linux-mm@kvack.org
Date: Mon, 01 Jul 2024 17:08:28 -0700
In-Reply-To: <20240625135216.47007-7-linyunsheng@huawei.com>
References: <20240625135216.47007-1-linyunsheng@huawei.com>
	 <20240625135216.47007-7-linyunsheng@huawei.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
User-Agent: Evolution 3.48.4 (3.48.4-1.fc38) 
MIME-Version: 1.0
X-Rspamd-Server: rspam11
X-Rspamd-Queue-Id: 677AE100004
X-Stat-Signature: pejgi3ktaage73u8yyqpxn5cw9w8479p
X-Rspam-User: 
X-HE-Tag: 1719878911-242887
X-HE-Meta: U2FsdGVkX1+zwLCVvCsQCgIke7x/tAVnitWtP6dJ8heKZdVMouVRmHYkFVLjdU27RoWKKSRvmo1F0Ze+arrkqTV4xmL/VrqISuntNK2r6WXfJ18S6O+/edcBIoDGoxNhCJPLNPXIhra1k5b/ltaO8PuDnxa9Ipfdm/8jqBcm617baElh+Q9PR7Q9JS0cJkgGPyPvGAg9hGQ4g7KdRCAdWVY9FmDu8NCPHzL6zEljEQWEcPFPW1ZjDpdZrxkE3LFV2G6Y91CHQYUs/uuBNP+DMqvMCejrnW2VUZy2k0pQ05fuGQLuW2q1GFhJS/5fRokmB8GykUEyES4WIt/U3xqr2afiMcIRnB1WWZOliasWkrodBMDWOWnHgSlheuGaGNjz2UuQcPttZHi+O4SiWFcK+Ot7qlUCvh/Do7NY3EL9AIkLF0x31UekiZX6mALm+lXpH3yQm5afgFtZGiTY2XcoxZaKYO72evZec5+Q5lQqG0gQ1h76wie8gOY85XiN2tJp7ekOgi4DFRuZeDk0VFTYs9GjoOR2aU5PentAop7YX9bfAqRc6+H0uSYaTDcGTZro68eNmgdwa7X0gJpaFgsU76mjVAE467B7K4+VARYUD8eBe4mgou/H6sCu0JM+/+kekjBuTUnbHG2yb8CWXTO6IvmnPX+ROwvpCT15ZFY4GcaSJaNQ+Qadt0Rb5EDRd9u8IR+CBibppXpwpZQJ28GSdrdfrPg2KsXWIp8x7PUNwRg5Bxb/IrOjm8NYimJ4k9KhKgT+pXD9M9l3uMGTkSD9iVJAsz1acoksss24dw0Ly7ETBoG9ivasGxiAkGRz/cQZqXuXxsmGyaG7ofR51IpGJVdJVY0vebb0YeNDNQ2XLrKeAXf+TWCnde6+eK15LR2Lu810M7zakcERs+R0cXDqXsjkNN4UScH16TTvhaDPxgTfaIj+dlxLs4kr/c750qDqGxGo0cR5YOvKCsuAZPa
 Um4W+7Gk
 7T8daKeAFxJTvUPKB7v6RtwEt5adbCgYSNn0ZV2s0eOrhJ6Ah7Su6RK8QtDD5du869yZqScDy2M6M+gQ/jD6OgLq8Nhz6carv0XwmUcnYqoJrKY+KhvCdkX5T5RSL3E0V+09FxMZQXiSHLPx6jwZSFFkG0zugCzIJ5U2Kxv8xFQRze2s19Id4oCnVn/2mNwx19r6PrMgSbLoK0emPRmj+feSLGuPt+XpxFYnSzM9uM1eG/6zURqnMXTPMkE/R/aYK6UkIECxl5pZ2gJ/sSJ5CelxuhaaHmBTQywVVN7BpaKQkcf9J+3XSn2GIOKSVKOLg1tvqXLFXoL8coCCqYssrUbvmXynAX57CcTHbAZax2Ka65zT4Z6SJANtdqo72J71yLm07C3GfWdfq4IbRDH9wYxx5w04dyhvnfOFqQZCPXiwEdRKU4LlzfmL0i9kqq1+/cKB6
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, 2024-06-25 at 21:52 +0800, Yunsheng Lin wrote:
> Currently there is one 'struct page_frag' for every 'struct
> sock' and 'struct task_struct', we are about to replace the
> 'struct page_frag' with 'struct page_frag_cache' for them.
> Before begin the replacing, we need to ensure the size of
> 'struct page_frag_cache' is not bigger than the size of
> 'struct page_frag', as there may be tens of thousands of
> 'struct sock' and 'struct task_struct' instances in the
> system.
>=20
> By or'ing the page order & pfmemalloc with lower bits of
> 'va' instead of using 'u16' or 'u32' for page size and 'u8'
> for pfmemalloc, we are able to avoid 3 or 5 bytes space waste.
> And page address & pfmemalloc & order is unchanged for the
> same page in the same 'page_frag_cache' instance, it makes
> sense to fit them together.
>=20
> Also, it is better to replace 'offset' with 'remaining', which
> is the remaining size for the cache in a 'page_frag_cache'
> instance, we are able to do a single 'fragsz > remaining'
> checking for the case of cache not being enough, which should be
> the fast path if we ensure size is zoro when 'va' =3D=3D NULL by
> memset'ing 'struct page_frag_cache' in page_frag_cache_init()
> and page_frag_cache_drain().
>=20
> After this patch, the size of 'struct page_frag_cache' should be
> the same as the size of 'struct page_frag'.
>=20
> CC: Alexander Duyck <alexander.duyck@gmail.com>
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
>  include/linux/page_frag_cache.h | 76 +++++++++++++++++++++++-----
>  mm/page_frag_cache.c            | 90 ++++++++++++++++++++-------------
>  2 files changed, 118 insertions(+), 48 deletions(-)
>=20
> diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_ca=
che.h
> index 6ac3a25089d1..b33904d4494f 100644
> --- a/include/linux/page_frag_cache.h
> +++ b/include/linux/page_frag_cache.h
> @@ -8,29 +8,81 @@
>  #define PAGE_FRAG_CACHE_MAX_SIZE	__ALIGN_MASK(32768, ~PAGE_MASK)
>  #define PAGE_FRAG_CACHE_MAX_ORDER	get_order(PAGE_FRAG_CACHE_MAX_SIZE)
> =20
> -struct page_frag_cache {
> -	void *va;
> +/*
> + * struct encoded_va - a nonexistent type marking this pointer
> + *
> + * An 'encoded_va' pointer is a pointer to a aligned virtual address, wh=
ich is
> + * at least aligned to PAGE_SIZE, that means there are at least 12 lower=
 bits
> + * space available for other purposes.
> + *
> + * Currently we use the lower 8 bits and bit 9 for the order and PFMEMAL=
LOC
> + * flag of the page this 'va' is corresponding to.
> + *
> + * Use the supplied helper functions to endcode/decode the pointer and b=
its.
> + */
> +struct encoded_va;
> +

Why did you create a struct for this? The way you use it below it is
just a pointer. No point in defining a struct that doesn't exist
anywhere.

> +#define PAGE_FRAG_CACHE_ORDER_MASK		GENMASK(7, 0)
> +#define PAGE_FRAG_CACHE_PFMEMALLOC_BIT		BIT(8)
> +#define PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT	8
> +
> +static inline struct encoded_va *encode_aligned_va(void *va,
> +						   unsigned int order,
> +						   bool pfmemalloc)
> +{
>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> -	__u16 offset;
> -	__u16 size;
> +	return (struct encoded_va *)((unsigned long)va | order |
> +			pfmemalloc << PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT);
>  #else
> -	__u32 offset;
> +	return (struct encoded_va *)((unsigned long)va |
> +			pfmemalloc << PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT);
> +#endif
> +}
> +
> +static inline unsigned long encoded_page_order(struct encoded_va *encode=
d_va)
> +{
> +#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> +	return PAGE_FRAG_CACHE_ORDER_MASK & (unsigned long)encoded_va;
> +#else
> +	return 0;
> +#endif
> +}
> +
> +static inline bool encoded_page_pfmemalloc(struct encoded_va *encoded_va=
)
> +{
> +	return PAGE_FRAG_CACHE_PFMEMALLOC_BIT & (unsigned long)encoded_va;
> +}
> +

My advice is that if you just make encoded_va an unsigned long this
just becomes some FIELD_GET and bit operations.

> +static inline void *encoded_page_address(struct encoded_va *encoded_va)
> +{
> +	return (void *)((unsigned long)encoded_va & PAGE_MASK);
> +}
> +
> +struct page_frag_cache {
> +	struct encoded_va *encoded_va;

This should be an unsigned long, not a pointer since you are storing
data other than just a pointer in here. The pointer is just one of the
things you extract out of it.

> +
> +#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) && (BITS_PER_LONG <=3D 32)
> +	u16 pagecnt_bias;
> +	u16 remaining;
> +#else
> +	u32 pagecnt_bias;
> +	u32 remaining;
>  #endif
> -	/* we maintain a pagecount bias, so that we dont dirty cache line
> -	 * containing page->_refcount every time we allocate a fragment.
> -	 */
> -	unsigned int		pagecnt_bias;
> -	bool pfmemalloc;
>  };
> =20
>  static inline void page_frag_cache_init(struct page_frag_cache *nc)
>  {
> -	nc->va =3D NULL;
> +	memset(nc, 0, sizeof(*nc));

Shouldn't need to memset 0 the whole thing. Just setting page and order
to 0 should be enough to indicate that there isn't anything there.

>  }
> =20
>  static inline bool page_frag_cache_is_pfmemalloc(struct page_frag_cache =
*nc)
>  {
> -	return !!nc->pfmemalloc;
> +	return encoded_page_pfmemalloc(nc->encoded_va);
> +}
> +
> +static inline unsigned int page_frag_cache_page_size(struct encoded_va *=
encoded_va)
> +{
> +	return PAGE_SIZE << encoded_page_order(encoded_va);
>  }
> =20
>  void page_frag_cache_drain(struct page_frag_cache *nc);
> diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
> index dd640af5607a..a3316dd50eff 100644
> --- a/mm/page_frag_cache.c
> +++ b/mm/page_frag_cache.c
> @@ -18,34 +18,61 @@
>  #include <linux/page_frag_cache.h>
>  #include "internal.h"
> =20
> +static void *page_frag_cache_current_va(struct page_frag_cache *nc)
> +{
> +	struct encoded_va *encoded_va =3D nc->encoded_va;
> +
> +	return (void *)(((unsigned long)encoded_va & PAGE_MASK) |
> +		(page_frag_cache_page_size(encoded_va) - nc->remaining));
> +}
> +

Rather than an OR here I would rather see this just use addition.
Otherwise this logic becomes overly complicated.

>  static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
>  					     gfp_t gfp_mask)
>  {
>  	struct page *page =3D NULL;
>  	gfp_t gfp =3D gfp_mask;
> +	unsigned int order;
> =20
>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>  	gfp_mask =3D (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
>  		   __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
>  	page =3D alloc_pages_node(NUMA_NO_NODE, gfp_mask,
>  				PAGE_FRAG_CACHE_MAX_ORDER);
> -	nc->size =3D page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
>  #endif
> -	if (unlikely(!page))
> +	if (unlikely(!page)) {
>  		page =3D alloc_pages_node(NUMA_NO_NODE, gfp, 0);
> +		if (unlikely(!page)) {
> +			memset(nc, 0, sizeof(*nc));
> +			return NULL;
> +		}
> +
> +		order =3D 0;
> +		nc->remaining =3D PAGE_SIZE;
> +	} else {
> +		order =3D PAGE_FRAG_CACHE_MAX_ORDER;
> +		nc->remaining =3D PAGE_FRAG_CACHE_MAX_SIZE;
> +	}
> =20
> -	nc->va =3D page ? page_address(page) : NULL;
> +	/* Even if we own the page, we do not use atomic_set().
> +	 * This would break get_page_unless_zero() users.
> +	 */
> +	page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE);
> =20
> +	/* reset page count bias of new frag */
> +	nc->pagecnt_bias =3D PAGE_FRAG_CACHE_MAX_SIZE + 1;

I would rather keep the pagecnt_bias, page reference addition, and
resetting of remaining outside of this. The only fields we should be
setting are order, the virtual address, and pfmemalloc since those are
what is encoded in your unsigned long variable.

> +	nc->encoded_va =3D encode_aligned_va(page_address(page), order,
> +					   page_is_pfmemalloc(page));
>  	return page;
>  }
> =20
>  void page_frag_cache_drain(struct page_frag_cache *nc)
>  {
> -	if (!nc->va)
> +	if (!nc->encoded_va)
>  		return;
> =20
> -	__page_frag_cache_drain(virt_to_head_page(nc->va), nc->pagecnt_bias);
> -	nc->va =3D NULL;
> +	__page_frag_cache_drain(virt_to_head_page(nc->encoded_va),
> +				nc->pagecnt_bias);
> +	memset(nc, 0, sizeof(*nc));

Again, no need for memset when "nv->encoded_va =3D 0" will do.

>  }
>  EXPORT_SYMBOL(page_frag_cache_drain);
> =20
> @@ -62,51 +89,41 @@ void *__page_frag_alloc_va_align(struct page_frag_cac=
he *nc,
>  				 unsigned int fragsz, gfp_t gfp_mask,
>  				 unsigned int align_mask)
>  {
> -	unsigned int size =3D PAGE_SIZE;
> +	struct encoded_va *encoded_va =3D nc->encoded_va;
>  	struct page *page;
> -	int offset;
> +	int remaining;
> +	void *va;
> =20
> -	if (unlikely(!nc->va)) {
> +	if (unlikely(!encoded_va)) {
>  refill:
> -		page =3D __page_frag_cache_refill(nc, gfp_mask);
> -		if (!page)
> +		if (unlikely(!__page_frag_cache_refill(nc, gfp_mask)))
>  			return NULL;
> =20
> -		/* Even if we own the page, we do not use atomic_set().
> -		 * This would break get_page_unless_zero() users.
> -		 */
> -		page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE);
> -
> -		/* reset page count bias and offset to start of new frag */
> -		nc->pfmemalloc =3D page_is_pfmemalloc(page);
> -		nc->pagecnt_bias =3D PAGE_FRAG_CACHE_MAX_SIZE + 1;
> -		nc->offset =3D 0;
> +		encoded_va =3D nc->encoded_va;
>  	}
> =20
> -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> -	/* if size can vary use size else just use PAGE_SIZE */
> -	size =3D nc->size;
> -#endif
> -
> -	offset =3D __ALIGN_KERNEL_MASK(nc->offset, ~align_mask);
> -	if (unlikely(offset + fragsz > size)) {
> -		page =3D virt_to_page(nc->va);
> -
> +	remaining =3D nc->remaining & align_mask;
> +	remaining -=3D fragsz;
> +	if (unlikely(remaining < 0)) {

Now this is just getting confusing. You essentially just added an
additional addition step and went back to the countdown approach I was
using before except for the fact that you are starting at 0 whereas I
was actually moving down through the page.

What I would suggest doing since "remaining" is a negative offset
anyway would be to look at just storing it as a signed negative number.
At least with that you can keep to your original approach and would
only have to change your check to be for "remaining + fragsz <=3D 0".
With that you can still do your math but it becomes an addition instead
of a subtraction.

> +		page =3D virt_to_page(encoded_va);
>  		if (!page_ref_sub_and_test(page, nc->pagecnt_bias))
>  			goto refill;
> =20
> -		if (unlikely(nc->pfmemalloc)) {
> -			free_unref_page(page, compound_order(page));
> +		if (unlikely(encoded_page_pfmemalloc(encoded_va))) {
> +			VM_BUG_ON(compound_order(page) !=3D
> +				  encoded_page_order(encoded_va));
> +			free_unref_page(page, encoded_page_order(encoded_va));
>  			goto refill;
>  		}
> =20
>  		/* OK, page count is 0, we can safely set it */
>  		set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
> =20
> -		/* reset page count bias and offset to start of new frag */
> +		/* reset page count bias and remaining of new frag */
>  		nc->pagecnt_bias =3D PAGE_FRAG_CACHE_MAX_SIZE + 1;
> -		offset =3D 0;
> -		if (unlikely(fragsz > PAGE_SIZE)) {
> +		nc->remaining =3D remaining =3D page_frag_cache_page_size(encoded_va);
> +		remaining -=3D fragsz;
> +		if (unlikely(remaining < 0)) {
>  			/*
>  			 * The caller is trying to allocate a fragment
>  			 * with fragsz > PAGE_SIZE but the cache isn't big

I find it really amusing that you went to all the trouble of flipping
the logic just to flip it back to being a countdown setup. If you were
going to bother with all that then why not just make the remaining
negative instead? You could save yourself a ton of trouble that way and
all you would need to do is flip a few signs.

> @@ -120,10 +137,11 @@ void *__page_frag_alloc_va_align(struct page_frag_c=
ache *nc,
>  		}
>  	}
> =20
> +	va =3D page_frag_cache_current_va(nc);
>  	nc->pagecnt_bias--;
> -	nc->offset =3D offset + fragsz;
> +	nc->remaining =3D remaining;
> =20
> -	return nc->va + offset;
> +	return va;
>  }
>  EXPORT_SYMBOL(__page_frag_alloc_va_align);
> =20

Not sure I am huge fan of the way the order of operations has to get so
creative for this to work.  Not that I see a better way to do it, but
my concern is that this is going to add technical debt as I can easily
see somebody messing up the order of things at some point in the future
and generating a bad pointer.