[PATCH net-next v13 00/14] Replace page_frag with page_frag_cache for sk_page

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next v13 00/14] Replace page_frag with page_frag_cache for sk_page_frag()
@ 2024-08-08 12:37 Yunsheng Lin
  2024-08-08 12:37 ` [PATCH net-next v13 04/14] mm: page_frag: add '_va' suffix to page_frag API Yunsheng Lin
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Yunsheng Lin @ 2024-08-08 12:37 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Matthias Brugger, AngeloGioacchino Del Regno, bpf,
	linux-arm-kernel, linux-mediatek

After [1], there are still two implementations for page frag:

1. mm/page_alloc.c: net stack seems to be using it in the
   rx part with 'struct page_frag_cache' and the main API
   being page_frag_alloc_align().
2. net/core/sock.c: net stack seems to be using it in the
   tx part with 'struct page_frag' and the main API being
   skb_page_frag_refill().

This patchset tries to unfiy the page frag implementation
by replacing page_frag with page_frag_cache for sk_page_frag()
first. net_high_order_alloc_disable_key for the implementation
in net/core/sock.c doesn't seems matter that much now as pcp
is also supported for high-order pages:
commit 44042b449872 ("mm/page_alloc: allow high-order pages to
be stored on the per-cpu lists")

As the related change is mostly related to networking, so
targeting the net-next. And will try to replace the rest
of page_frag in the follow patchset.

After this patchset:
1. Unify the page frag implementation by taking the best out of
   two the existing implementations: we are able to save some space
   for the 'page_frag_cache' API user, and avoid 'get_page()' for
   the old 'page_frag' API user.
2. Future bugfix and performance can be done in one place, hence
   improving maintainability of page_frag's implementation.

Kernel Image changing:
    Linux Kernel   total |      text      data        bss
    ------------------------------------------------------
    after     45250307 |   27274279   17209996     766032
    before    45254134 |   27278118   17209984     766032
    delta        -3827 |      -3839        +12         +0

Performance validation:
1. Using micro-benchmark ko added in patch 1 to test aligned and
   non-aligned API performance impact for the existing users, there
   is no notiable performance degradation. Instead we seems to have
   some major performance boot for both aligned and non-aligned API
   after switching to ptr_ring for testing, respectively about 200%
   and 10% improvement in arm64 server as below.

2. Use the below netcat test case, we also have some minor
   performance boot for replacing 'page_frag' with 'page_frag_cache'
   after this patchset.
   server: taskset -c 32 nc -l -k 1234 > /dev/null
   client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234

In order to avoid performance noise as much as possible, the testing
is done in system without any other load and have enough iterations to
prove the data is stable enough, complete log for testing is below:

perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000
perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1
taskset -c 32 nc -l -k 1234 > /dev/null
perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234

*After* this patchset:

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs):

         17.758393      task-clock (msec)         #    0.004 CPUs utilized            ( +-  0.51% )
                 5      context-switches          #    0.293 K/sec                    ( +-  0.65% )
                 0      cpu-migrations            #    0.008 K/sec                    ( +- 17.21% )
                74      page-faults               #    0.004 M/sec                    ( +-  0.12% )
          46128650      cycles                    #    2.598 GHz                      ( +-  0.51% )
          60810511      instructions              #    1.32  insn per cycle           ( +-  0.04% )
          14764914      branches                  #  831.433 M/sec                    ( +-  0.04% )
             19281      branch-misses             #    0.13% of all branches          ( +-  0.13% )

       4.240273854 seconds time elapsed                                          ( +-  0.13% )

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs):

         17.348690      task-clock (msec)         #    0.019 CPUs utilized            ( +-  0.66% )
                 5      context-switches          #    0.310 K/sec                    ( +-  0.84% )
                 0      cpu-migrations            #    0.009 K/sec                    ( +- 16.55% )
                74      page-faults               #    0.004 M/sec                    ( +-  0.11% )
          45065287      cycles                    #    2.598 GHz                      ( +-  0.66% )
          60755389      instructions              #    1.35  insn per cycle           ( +-  0.05% )
          14747865      branches                  #  850.085 M/sec                    ( +-  0.05% )
             19272      branch-misses             #    0.13% of all branches          ( +-  0.13% )

       0.935251375 seconds time elapsed                                          ( +-  0.07% )

 Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs):

      16626.042731      task-clock (msec)         #    0.607 CPUs utilized            ( +-  0.03% )
           3291020      context-switches          #    0.198 M/sec                    ( +-  0.05% )
                 1      cpu-migrations            #    0.000 K/sec                    ( +-  0.50% )
                85      page-faults               #    0.005 K/sec                    ( +-  0.16% )
       30581044838      cycles                    #    1.839 GHz                      ( +-  0.05% )
       34962744631      instructions              #    1.14  insn per cycle           ( +-  0.01% )
        6483883671      branches                  #  389.984 M/sec                    ( +-  0.02% )
          99624551      branch-misses             #    1.54% of all branches          ( +-  0.17% )

      27.370305077 seconds time elapsed                                          ( +-  0.01% )


*Before* this patchset:

Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs):

         21.587934      task-clock (msec)         #    0.005 CPUs utilized            ( +-  0.72% )
                 6      context-switches          #    0.281 K/sec                    ( +-  0.28% )
                 1      cpu-migrations            #    0.047 K/sec                    ( +-  0.50% )
                73      page-faults               #    0.003 M/sec                    ( +-  0.12% )
          56080697      cycles                    #    2.598 GHz                      ( +-  0.72% )
          61605150      instructions              #    1.10  insn per cycle           ( +-  0.05% )
          14950196      branches                  #  692.526 M/sec                    ( +-  0.05% )
             19410      branch-misses             #    0.13% of all branches          ( +-  0.18% )

       4.603530546 seconds time elapsed                                          ( +-  0.11% )

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs):

         20.988297      task-clock (msec)         #    0.006 CPUs utilized            ( +-  0.81% )
                 7      context-switches          #    0.316 K/sec                    ( +-  0.54% )
                 1      cpu-migrations            #    0.048 K/sec                    ( +-  0.70% )
                73      page-faults               #    0.003 M/sec                    ( +-  0.11% )
          54512166      cycles                    #    2.597 GHz                      ( +-  0.81% )
          61440941      instructions              #    1.13  insn per cycle           ( +-  0.08% )
          14906043      branches                  #  710.207 M/sec                    ( +-  0.08% )
             19927      branch-misses             #    0.13% of all branches          ( +-  0.17% )

       3.438041238 seconds time elapsed                                          ( +-  1.11% )

 Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs):

      17364.040855      task-clock (msec)         #    0.624 CPUs utilized            ( +-  0.02% )
           3340375      context-switches          #    0.192 M/sec                    ( +-  0.06% )
                 1      cpu-migrations            #    0.000 K/sec
                85      page-faults               #    0.005 K/sec                    ( +-  0.15% )
       32077623335      cycles                    #    1.847 GHz                      ( +-  0.03% )
       35121047596      instructions              #    1.09  insn per cycle           ( +-  0.01% )
        6519872824      branches                  #  375.481 M/sec                    ( +-  0.02% )
         101877022      branch-misses             #    1.56% of all branches          ( +-  0.14% )

      27.842745343 seconds time elapsed                                          ( +-  0.02% )


Note, ipv4-udp, ipv6-tcp and ipv6-udp is also tested with the below script:
nc -u -l -k 1234 > /dev/null
perf stat -r 4 -- head -c 51200000000 /dev/zero | nc -N -u 127.0.0.1 1234

nc -l6 -k 1234 > /dev/null
perf stat -r 4 -- head -c 51200000000 /dev/zero | nc -N ::1 1234

nc -l6 -k -u 1234 > /dev/null
perf stat -r 4 -- head -c 51200000000 /dev/zero | nc -u -N ::1 1234

CC: Alexander Duyck <alexander.duyck@gmail.com>

1. https://lore.kernel.org/all/20240228093013.8263-1-linyunsheng@huawei.com/

Change log:
V13:
   1. Move page_frag_test from mm/ to tools/testing/selftest/mm
   2. Use ptr_ring to replace ptr_pool for page_frag_test.c
   3. Retest based on the new testing ko, which shows a big different
      result than using ptr_pool.

V12:
   1. Do not treat page_frag_test ko as DEBUG feature.
   2. Make some improvement for the refactoring in patch 8.
   3. Some other minor improvement as Alexander's comment.

RFC v11:
   1. Fold 'page_frag_cache' moving change into patch 2.
   2. Optimizate patch 3 according to discussion in v9.

V10:
   1. Change Subject to "Replace page_frag with page_frag_cache for sk_page_frag()".
   2. Move 'struct page_frag_cache' to sched.h as suggested by Alexander.
   3. Rename skb_copy_to_page_nocache().
   4. Adjust change between patches to make it more reviewable as Alexander's comment.
   5. Use 'aligned_remaining' variable to generate virtual address as Alexander's
      comment.
   6. Some included header and typo fix as Alexander's comment.
   7. Add back the get_order() opt patch for xtensa arch

V9:
   1. Add check for test_alloc_len and change perm of module_param()
      to 0 as Wang Wei' comment.
   2. Rebased on latest net-next.

V8: Remove patch 2 & 3 in V7, as free_unref_page() is changed to call
    pcp_allowed_order() and used in page_frag API recently in:
    commit 5b8d75913a0e ("mm: combine free_the_page() and free_unref_page()")

V7: Fix doc build warning and error.

V6:
   1. Fix some typo and compiler error for x86 pointed out by Jakub and
      Simon.
   2. Add two refactoring and optimization patches.

V5:
   1. Add page_frag_alloc_pg() API for tls_device.c case and refactor
      some implementation, update kernel bin size changing as bin size
      is increased after that.
   2. Add ack from Mat.

RFC v4:
   1. Update doc according to Randy and Mat's suggestion.
   2. Change probe API to "probe" for a specific amount of available space,
      rather than "nonzero" space according to Mat's suggestion.
   3. Retest and update the test result.

v3:
   1. Use new layout for 'struct page_frag_cache' as the discussion
      with Alexander and other sugeestions from Alexander.
   2. Add probe API to address Mat' comment about mptcp use case.
   3. Some doc updating according to Bagas' suggestion.

v2:
   1. reorder test module to patch 1.
   2. split doc and maintainer updating to two patches.
   3. refactor the page_frag before moving.
   4. fix a type and 'static' warning in test module.
   5. add a patch for xtensa arch to enable using get_order() in
      BUILD_BUG_ON().
   6. Add test case and performance data for the socket code.

Yunsheng Lin (14):
  mm: page_frag: add a test module for page_frag
  mm: move the page fragment allocator from page_alloc into its own file
  mm: page_frag: use initial zero offset for page_frag_alloc_align()
  mm: page_frag: add '_va' suffix to page_frag API
  mm: page_frag: avoid caller accessing 'page_frag_cache' directly
  xtensa: remove the get_order() implementation
  mm: page_frag: reuse existing space for 'size' and 'pfmemalloc'
  mm: page_frag: some minor refactoring before adding new API
  mm: page_frag: use __alloc_pages() to replace alloc_pages_node()
  net: rename skb_copy_to_page_nocache() helper
  mm: page_frag: introduce prepare/probe/commit API
  net: replace page_frag with page_frag_cache
  mm: page_frag: update documentation for page_frag
  mm: page_frag: add an entry in MAINTAINERS for page_frag

 Documentation/mm/page_frags.rst               | 169 +++++++-
 MAINTAINERS                                   |  11 +
 arch/xtensa/include/asm/page.h                |  18 -
 .../chelsio/inline_crypto/chtls/chtls.h       |   3 -
 .../chelsio/inline_crypto/chtls/chtls_io.c    | 100 ++---
 .../chelsio/inline_crypto/chtls/chtls_main.c  |   3 -
 drivers/net/ethernet/google/gve/gve_rx.c      |   4 +-
 drivers/net/ethernet/intel/ice/ice_txrx.c     |   2 +-
 drivers/net/ethernet/intel/ice/ice_txrx.h     |   2 +-
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c |   2 +-
 .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |   4 +-
 .../marvell/octeontx2/nic/otx2_common.c       |   2 +-
 drivers/net/ethernet/mediatek/mtk_wed_wo.c    |   4 +-
 drivers/net/tun.c                             |  48 +--
 drivers/nvme/host/tcp.c                       |   8 +-
 drivers/nvme/target/tcp.c                     |  22 +-
 drivers/vhost/net.c                           |   8 +-
 include/linux/gfp.h                           |  22 --
 include/linux/mm_types.h                      |  18 -
 include/linux/mm_types_task.h                 |  18 +
 include/linux/page_frag_cache.h               | 272 +++++++++++++
 include/linux/sched.h                         |   2 +-
 include/linux/skbuff.h                        |   3 +-
 include/net/sock.h                            |  23 +-
 kernel/bpf/cpumap.c                           |   2 +-
 kernel/exit.c                                 |   3 +-
 kernel/fork.c                                 |   3 +-
 mm/Makefile                                   |   1 +
 mm/page_alloc.c                               | 136 -------
 mm/page_frag_cache.c                          | 367 ++++++++++++++++++
 net/core/skbuff.c                             |  81 ++--
 net/core/skmsg.c                              |  22 +-
 net/core/sock.c                               |  46 ++-
 net/core/xdp.c                                |   2 +-
 net/ipv4/ip_output.c                          |  33 +-
 net/ipv4/tcp.c                                |  35 +-
 net/ipv4/tcp_output.c                         |  28 +-
 net/ipv6/ip6_output.c                         |  33 +-
 net/kcm/kcmsock.c                             |  30 +-
 net/mptcp/protocol.c                          |  67 ++--
 net/rxrpc/conn_object.c                       |   4 +-
 net/rxrpc/local_object.c                      |   4 +-
 net/rxrpc/txbuf.c                             |  15 +-
 net/sched/em_meta.c                           |   2 +-
 net/sunrpc/svcsock.c                          |  12 +-
 net/tls/tls_device.c                          | 137 ++++---
 tools/testing/selftests/mm/Makefile           |   2 +
 tools/testing/selftests/mm/page_frag/Makefile |  18 +
 .../selftests/mm/page_frag/page_frag_test.c   | 173 +++++++++
 tools/testing/selftests/mm/run_vmtests.sh     |   9 +-
 50 files changed, 1449 insertions(+), 584 deletions(-)
 create mode 100644 include/linux/page_frag_cache.h
 create mode 100644 mm/page_frag_cache.c
 create mode 100644 tools/testing/selftests/mm/page_frag/Makefile
 create mode 100644 tools/testing/selftests/mm/page_frag/page_frag_test.c

-- 
2.33.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH net-next v13 04/14] mm: page_frag: add '_va' suffix to page_frag API
  2024-08-08 12:37 [PATCH net-next v13 00/14] Replace page_frag with page_frag_cache for sk_page_frag() Yunsheng Lin
@ 2024-08-08 12:37 ` Yunsheng Lin
  2024-08-14 15:49   ` Alexander H Duyck
  2024-08-08 12:37 ` [PATCH net-next v13 12/14] net: replace page_frag with page_frag_cache Yunsheng Lin
  2024-08-13 11:30 ` [PATCH net-next v13 00/14] Replace page_frag with page_frag_cache for sk_page_frag() Yunsheng Lin
  2 siblings, 1 reply; 16+ messages in thread
From: Yunsheng Lin @ 2024-08-08 12:37 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck,
	Subbaraya Sundeep, Chuck Lever, Sagi Grimberg, Jeroen de Borst,
	Praveen Kaligineedi, Shailend Chand, Eric Dumazet, Tony Nguyen,
	Przemek Kitszel, Sunil Goutham, Geetha sowjanya, hariprasad,
	Felix Fietkau, Sean Wang, Mark Lee, Lorenzo Bianconi,
	Matthias Brugger, AngeloGioacchino Del Regno, Keith Busch,
	Jens Axboe, Christoph Hellwig, Chaitanya Kulkarni,
	Michael S. Tsirkin, Jason Wang, Eugenio Pérez, Andrew Morton,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, David Howells,
	Marc Dionne, Jeff Layton, Neil Brown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Trond Myklebust, Anna Schumaker, Shuah Khan,
	intel-wired-lan, linux-arm-kernel, linux-mediatek, linux-nvme,
	kvm, virtualization, linux-mm, bpf, linux-afs, linux-nfs,
	linux-kselftest

Currently the page_frag API is returning 'virtual address'
or 'va' when allocing and expecting 'virtual address' or
'va' as input when freeing.

As we are about to support new use cases that the caller
need to deal with 'struct page' or need to deal with both
'va' and 'struct page'. In order to differentiate the API
handling between 'va' and 'struct page', add '_va' suffix
to the corresponding API mirroring the page_pool_alloc_va()
API of the page_pool. So that callers expecting to deal with
va, page or both va and page may call page_frag_alloc_va*,
page_frag_alloc_pg*, or page_frag_alloc* API accordingly.

CC: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/net/ethernet/google/gve/gve_rx.c      |  4 ++--
 drivers/net/ethernet/intel/ice/ice_txrx.c     |  2 +-
 drivers/net/ethernet/intel/ice/ice_txrx.h     |  2 +-
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c |  2 +-
 .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |  4 ++--
 .../marvell/octeontx2/nic/otx2_common.c       |  2 +-
 drivers/net/ethernet/mediatek/mtk_wed_wo.c    |  4 ++--
 drivers/nvme/host/tcp.c                       |  8 +++----
 drivers/nvme/target/tcp.c                     | 22 +++++++++----------
 drivers/vhost/net.c                           |  6 ++---
 include/linux/page_frag_cache.h               | 21 +++++++++---------
 include/linux/skbuff.h                        |  2 +-
 kernel/bpf/cpumap.c                           |  2 +-
 mm/page_frag_cache.c                          | 12 +++++-----
 net/core/skbuff.c                             | 16 +++++++-------
 net/core/xdp.c                                |  2 +-
 net/rxrpc/txbuf.c                             | 15 +++++++------
 net/sunrpc/svcsock.c                          |  6 ++---
 .../selftests/mm/page_frag/page_frag_test.c   | 13 ++++++-----
 19 files changed, 75 insertions(+), 70 deletions(-)

diff --git a/drivers/net/ethernet/google/gve/gve_rx.c b/drivers/net/ethernet/google/gve/gve_rx.c
index acb73d4d0de6..b6c10100e462 100644
--- a/drivers/net/ethernet/google/gve/gve_rx.c
+++ b/drivers/net/ethernet/google/gve/gve_rx.c
@@ -729,7 +729,7 @@ static int gve_xdp_redirect(struct net_device *dev, struct gve_rx_ring *rx,
 
 	total_len = headroom + SKB_DATA_ALIGN(len) +
 		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-	frame = page_frag_alloc(&rx->page_cache, total_len, GFP_ATOMIC);
+	frame = page_frag_alloc_va(&rx->page_cache, total_len, GFP_ATOMIC);
 	if (!frame) {
 		u64_stats_update_begin(&rx->statss);
 		rx->xdp_alloc_fails++;
@@ -742,7 +742,7 @@ static int gve_xdp_redirect(struct net_device *dev, struct gve_rx_ring *rx,
 
 	err = xdp_do_redirect(dev, &new, xdp_prog);
 	if (err)
-		page_frag_free(frame);
+		page_frag_free_va(frame);
 
 	return err;
 }
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 8d25b6981269..00c706de2b82 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -126,7 +126,7 @@ ice_unmap_and_free_tx_buf(struct ice_tx_ring *ring, struct ice_tx_buf *tx_buf)
 		dev_kfree_skb_any(tx_buf->skb);
 		break;
 	case ICE_TX_BUF_XDP_TX:
-		page_frag_free(tx_buf->raw_buf);
+		page_frag_free_va(tx_buf->raw_buf);
 		break;
 	case ICE_TX_BUF_XDP_XMIT:
 		xdp_return_frame(tx_buf->xdpf);
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index feba314a3fe4..6379f57d8228 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -148,7 +148,7 @@ static inline int ice_skb_pad(void)
  * @ICE_TX_BUF_DUMMY: dummy Flow Director packet, unmap and kfree()
  * @ICE_TX_BUF_FRAG: mapped skb OR &xdp_buff frag, only unmap DMA
  * @ICE_TX_BUF_SKB: &sk_buff, unmap and consume_skb(), update stats
- * @ICE_TX_BUF_XDP_TX: &xdp_buff, unmap and page_frag_free(), stats
+ * @ICE_TX_BUF_XDP_TX: &xdp_buff, unmap and page_frag_free_va(), stats
  * @ICE_TX_BUF_XDP_XMIT: &xdp_frame, unmap and xdp_return_frame(), stats
  * @ICE_TX_BUF_XSK_TX: &xdp_buff on XSk queue, xsk_buff_free(), stats
  */
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index 2719f0e20933..a1a41a14df0d 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -250,7 +250,7 @@ ice_clean_xdp_tx_buf(struct device *dev, struct ice_tx_buf *tx_buf,
 
 	switch (tx_buf->type) {
 	case ICE_TX_BUF_XDP_TX:
-		page_frag_free(tx_buf->raw_buf);
+		page_frag_free_va(tx_buf->raw_buf);
 		break;
 	case ICE_TX_BUF_XDP_XMIT:
 		xdp_return_frame_bulk(tx_buf->xdpf, bq);
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 149911e3002a..eef16a909f85 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -302,7 +302,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
 
 		/* free the skb */
 		if (ring_is_xdp(tx_ring))
-			page_frag_free(tx_buffer->data);
+			page_frag_free_va(tx_buffer->data);
 		else
 			napi_consume_skb(tx_buffer->skb, napi_budget);
 
@@ -2412,7 +2412,7 @@ static void ixgbevf_clean_tx_ring(struct ixgbevf_ring *tx_ring)
 
 		/* Free all the Tx ring sk_buffs */
 		if (ring_is_xdp(tx_ring))
-			page_frag_free(tx_buffer->data);
+			page_frag_free_va(tx_buffer->data);
 		else
 			dev_kfree_skb_any(tx_buffer->skb);
 
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
index 87d5776e3b88..a485e988fa1d 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
@@ -553,7 +553,7 @@ static int __otx2_alloc_rbuf(struct otx2_nic *pfvf, struct otx2_pool *pool,
 	*dma = dma_map_single_attrs(pfvf->dev, buf, pool->rbsize,
 				    DMA_FROM_DEVICE, DMA_ATTR_SKIP_CPU_SYNC);
 	if (unlikely(dma_mapping_error(pfvf->dev, *dma))) {
-		page_frag_free(buf);
+		page_frag_free_va(buf);
 		return -ENOMEM;
 	}
 
diff --git a/drivers/net/ethernet/mediatek/mtk_wed_wo.c b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
index 7063c78bd35f..c4228719f8a4 100644
--- a/drivers/net/ethernet/mediatek/mtk_wed_wo.c
+++ b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
@@ -142,8 +142,8 @@ mtk_wed_wo_queue_refill(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q,
 		dma_addr_t addr;
 		void *buf;
 
-		buf = page_frag_alloc(&q->cache, q->buf_size,
-				      GFP_ATOMIC | GFP_DMA32);
+		buf = page_frag_alloc_va(&q->cache, q->buf_size,
+					 GFP_ATOMIC | GFP_DMA32);
 		if (!buf)
 			break;
 
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index a2a47d3ab99f..86906bc505de 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -506,7 +506,7 @@ static void nvme_tcp_exit_request(struct blk_mq_tag_set *set,
 {
 	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
 
-	page_frag_free(req->pdu);
+	page_frag_free_va(req->pdu);
 }
 
 static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
@@ -520,7 +520,7 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
 	struct nvme_tcp_queue *queue = &ctrl->queues[queue_idx];
 	u8 hdgst = nvme_tcp_hdgst_len(queue);
 
-	req->pdu = page_frag_alloc(&queue->pf_cache,
+	req->pdu = page_frag_alloc_va(&queue->pf_cache,
 		sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
 		GFP_KERNEL | __GFP_ZERO);
 	if (!req->pdu)
@@ -1337,7 +1337,7 @@ static void nvme_tcp_free_async_req(struct nvme_tcp_ctrl *ctrl)
 {
 	struct nvme_tcp_request *async = &ctrl->async_req;
 
-	page_frag_free(async->pdu);
+	page_frag_free_va(async->pdu);
 }
 
 static int nvme_tcp_alloc_async_req(struct nvme_tcp_ctrl *ctrl)
@@ -1346,7 +1346,7 @@ static int nvme_tcp_alloc_async_req(struct nvme_tcp_ctrl *ctrl)
 	struct nvme_tcp_request *async = &ctrl->async_req;
 	u8 hdgst = nvme_tcp_hdgst_len(queue);
 
-	async->pdu = page_frag_alloc(&queue->pf_cache,
+	async->pdu = page_frag_alloc_va(&queue->pf_cache,
 		sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
 		GFP_KERNEL | __GFP_ZERO);
 	if (!async->pdu)
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 5bff0d5464d1..560df3db2f82 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1463,24 +1463,24 @@ static int nvmet_tcp_alloc_cmd(struct nvmet_tcp_queue *queue,
 	c->queue = queue;
 	c->req.port = queue->port->nport;
 
-	c->cmd_pdu = page_frag_alloc(&queue->pf_cache,
+	c->cmd_pdu = page_frag_alloc_va(&queue->pf_cache,
 			sizeof(*c->cmd_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
 	if (!c->cmd_pdu)
 		return -ENOMEM;
 	c->req.cmd = &c->cmd_pdu->cmd;
 
-	c->rsp_pdu = page_frag_alloc(&queue->pf_cache,
+	c->rsp_pdu = page_frag_alloc_va(&queue->pf_cache,
 			sizeof(*c->rsp_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
 	if (!c->rsp_pdu)
 		goto out_free_cmd;
 	c->req.cqe = &c->rsp_pdu->cqe;
 
-	c->data_pdu = page_frag_alloc(&queue->pf_cache,
+	c->data_pdu = page_frag_alloc_va(&queue->pf_cache,
 			sizeof(*c->data_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
 	if (!c->data_pdu)
 		goto out_free_rsp;
 
-	c->r2t_pdu = page_frag_alloc(&queue->pf_cache,
+	c->r2t_pdu = page_frag_alloc_va(&queue->pf_cache,
 			sizeof(*c->r2t_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
 	if (!c->r2t_pdu)
 		goto out_free_data;
@@ -1495,20 +1495,20 @@ static int nvmet_tcp_alloc_cmd(struct nvmet_tcp_queue *queue,
 
 	return 0;
 out_free_data:
-	page_frag_free(c->data_pdu);
+	page_frag_free_va(c->data_pdu);
 out_free_rsp:
-	page_frag_free(c->rsp_pdu);
+	page_frag_free_va(c->rsp_pdu);
 out_free_cmd:
-	page_frag_free(c->cmd_pdu);
+	page_frag_free_va(c->cmd_pdu);
 	return -ENOMEM;
 }
 
 static void nvmet_tcp_free_cmd(struct nvmet_tcp_cmd *c)
 {
-	page_frag_free(c->r2t_pdu);
-	page_frag_free(c->data_pdu);
-	page_frag_free(c->rsp_pdu);
-	page_frag_free(c->cmd_pdu);
+	page_frag_free_va(c->r2t_pdu);
+	page_frag_free_va(c->data_pdu);
+	page_frag_free_va(c->rsp_pdu);
+	page_frag_free_va(c->cmd_pdu);
 }
 
 static int nvmet_tcp_alloc_cmds(struct nvmet_tcp_queue *queue)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f16279351db5..6691fac01e0d 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -686,8 +686,8 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 		return -ENOSPC;
 
 	buflen += SKB_DATA_ALIGN(len + pad);
-	buf = page_frag_alloc_align(&net->pf_cache, buflen, GFP_KERNEL,
-				    SMP_CACHE_BYTES);
+	buf = page_frag_alloc_va_align(&net->pf_cache, buflen, GFP_KERNEL,
+				       SMP_CACHE_BYTES);
 	if (unlikely(!buf))
 		return -ENOMEM;
 
@@ -734,7 +734,7 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 	return 0;
 
 err:
-	page_frag_free(buf);
+	page_frag_free_va(buf);
 	return ret;
 }
 
diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h
index a758cb65a9b3..ef038a07925c 100644
--- a/include/linux/page_frag_cache.h
+++ b/include/linux/page_frag_cache.h
@@ -9,23 +9,24 @@
 
 void page_frag_cache_drain(struct page_frag_cache *nc);
 void __page_frag_cache_drain(struct page *page, unsigned int count);
-void *__page_frag_alloc_align(struct page_frag_cache *nc, unsigned int fragsz,
-			      gfp_t gfp_mask, unsigned int align_mask);
+void *__page_frag_alloc_va_align(struct page_frag_cache *nc,
+				 unsigned int fragsz, gfp_t gfp_mask,
+				 unsigned int align_mask);
 
-static inline void *page_frag_alloc_align(struct page_frag_cache *nc,
-					  unsigned int fragsz, gfp_t gfp_mask,
-					  unsigned int align)
+static inline void *page_frag_alloc_va_align(struct page_frag_cache *nc,
+					     unsigned int fragsz,
+					     gfp_t gfp_mask, unsigned int align)
 {
 	WARN_ON_ONCE(!is_power_of_2(align));
-	return __page_frag_alloc_align(nc, fragsz, gfp_mask, -align);
+	return __page_frag_alloc_va_align(nc, fragsz, gfp_mask, -align);
 }
 
-static inline void *page_frag_alloc(struct page_frag_cache *nc,
-				    unsigned int fragsz, gfp_t gfp_mask)
+static inline void *page_frag_alloc_va(struct page_frag_cache *nc,
+				       unsigned int fragsz, gfp_t gfp_mask)
 {
-	return __page_frag_alloc_align(nc, fragsz, gfp_mask, ~0u);
+	return __page_frag_alloc_va_align(nc, fragsz, gfp_mask, ~0u);
 }
 
-void page_frag_free(void *addr);
+void page_frag_free_va(void *addr);
 
 #endif
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7482997c719f..f9f0393e6b12 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3381,7 +3381,7 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
 
 static inline void skb_free_frag(void *addr)
 {
-	page_frag_free(addr);
+	page_frag_free_va(addr);
 }
 
 void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask);
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index fbdf5a1aabfe..3b70b6b071b9 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -323,7 +323,7 @@ static int cpu_map_kthread_run(void *data)
 
 			/* Bring struct page memory area to curr CPU. Read by
 			 * build_skb_around via page_is_pfmemalloc(), and when
-			 * freed written by page_frag_free call.
+			 * freed written by page_frag_free_va call.
 			 */
 			prefetchw(page);
 		}
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index c5bc72cf018a..70fb6dead624 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -59,9 +59,9 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
-void *__page_frag_alloc_align(struct page_frag_cache *nc,
-			      unsigned int fragsz, gfp_t gfp_mask,
-			      unsigned int align_mask)
+void *__page_frag_alloc_va_align(struct page_frag_cache *nc,
+				 unsigned int fragsz, gfp_t gfp_mask,
+				 unsigned int align_mask)
 {
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
 	unsigned int size = nc->size;
@@ -130,16 +130,16 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
 
 	return nc->va + (size - remaining);
 }
-EXPORT_SYMBOL(__page_frag_alloc_align);
+EXPORT_SYMBOL(__page_frag_alloc_va_align);
 
 /*
  * Frees a page fragment allocated out of either a compound or order 0 page.
  */
-void page_frag_free(void *addr)
+void page_frag_free_va(void *addr)
 {
 	struct page *page = virt_to_head_page(addr);
 
 	if (unlikely(put_page_testzero(page)))
 		free_unref_page(page, compound_order(page));
 }
-EXPORT_SYMBOL(page_frag_free);
+EXPORT_SYMBOL(page_frag_free_va);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index de2a044cc665..6cf2c51a34e1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -314,8 +314,8 @@ void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
 	fragsz = SKB_DATA_ALIGN(fragsz);
 
 	local_lock_nested_bh(&napi_alloc_cache.bh_lock);
-	data = __page_frag_alloc_align(&nc->page, fragsz,
-				       GFP_ATOMIC | __GFP_NOWARN, align_mask);
+	data = __page_frag_alloc_va_align(&nc->page, fragsz,
+					  GFP_ATOMIC | __GFP_NOWARN, align_mask);
 	local_unlock_nested_bh(&napi_alloc_cache.bh_lock);
 	return data;
 
@@ -330,9 +330,9 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
 		struct page_frag_cache *nc = this_cpu_ptr(&netdev_alloc_cache);
 
 		fragsz = SKB_DATA_ALIGN(fragsz);
-		data = __page_frag_alloc_align(nc, fragsz,
-					       GFP_ATOMIC | __GFP_NOWARN,
-					       align_mask);
+		data = __page_frag_alloc_va_align(nc, fragsz,
+						  GFP_ATOMIC | __GFP_NOWARN,
+						  align_mask);
 	} else {
 		local_bh_disable();
 		data = __napi_alloc_frag_align(fragsz, align_mask);
@@ -751,14 +751,14 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
 
 	if (in_hardirq() || irqs_disabled()) {
 		nc = this_cpu_ptr(&netdev_alloc_cache);
-		data = page_frag_alloc(nc, len, gfp_mask);
+		data = page_frag_alloc_va(nc, len, gfp_mask);
 		pfmemalloc = nc->pfmemalloc;
 	} else {
 		local_bh_disable();
 		local_lock_nested_bh(&napi_alloc_cache.bh_lock);
 
 		nc = this_cpu_ptr(&napi_alloc_cache.page);
-		data = page_frag_alloc(nc, len, gfp_mask);
+		data = page_frag_alloc_va(nc, len, gfp_mask);
 		pfmemalloc = nc->pfmemalloc;
 
 		local_unlock_nested_bh(&napi_alloc_cache.bh_lock);
@@ -848,7 +848,7 @@ struct sk_buff *napi_alloc_skb(struct napi_struct *napi, unsigned int len)
 	} else {
 		len = SKB_HEAD_ALIGN(len);
 
-		data = page_frag_alloc(&nc->page, len, gfp_mask);
+		data = page_frag_alloc_va(&nc->page, len, gfp_mask);
 		pfmemalloc = nc->page.pfmemalloc;
 	}
 	local_unlock_nested_bh(&napi_alloc_cache.bh_lock);
diff --git a/net/core/xdp.c b/net/core/xdp.c
index bcc5551c6424..7d4e09fb478f 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -387,7 +387,7 @@ void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct,
 		page_pool_put_full_page(page->pp, page, napi_direct);
 		break;
 	case MEM_TYPE_PAGE_SHARED:
-		page_frag_free(data);
+		page_frag_free_va(data);
 		break;
 	case MEM_TYPE_PAGE_ORDER0:
 		page = virt_to_page(data); /* Assumes order0 page*/
diff --git a/net/rxrpc/txbuf.c b/net/rxrpc/txbuf.c
index c3913d8a50d3..dccb0353ee84 100644
--- a/net/rxrpc/txbuf.c
+++ b/net/rxrpc/txbuf.c
@@ -33,8 +33,8 @@ struct rxrpc_txbuf *rxrpc_alloc_data_txbuf(struct rxrpc_call *call, size_t data_
 
 	data_align = umax(data_align, L1_CACHE_BYTES);
 	mutex_lock(&call->conn->tx_data_alloc_lock);
-	buf = page_frag_alloc_align(&call->conn->tx_data_alloc, total, gfp,
-				    data_align);
+	buf = page_frag_alloc_va_align(&call->conn->tx_data_alloc, total, gfp,
+				       data_align);
 	mutex_unlock(&call->conn->tx_data_alloc_lock);
 	if (!buf) {
 		kfree(txb);
@@ -96,17 +96,18 @@ struct rxrpc_txbuf *rxrpc_alloc_ack_txbuf(struct rxrpc_call *call, size_t sack_s
 	if (!txb)
 		return NULL;
 
-	buf = page_frag_alloc(&call->local->tx_alloc,
-			      sizeof(*whdr) + sizeof(*ack) + 1 + 3 + sizeof(*trailer), gfp);
+	buf = page_frag_alloc_va(&call->local->tx_alloc,
+				 sizeof(*whdr) + sizeof(*ack) + 1 + 3 + sizeof(*trailer), gfp);
 	if (!buf) {
 		kfree(txb);
 		return NULL;
 	}
 
 	if (sack_size) {
-		buf2 = page_frag_alloc(&call->local->tx_alloc, sack_size, gfp);
+		buf2 = page_frag_alloc_va(&call->local->tx_alloc, sack_size,
+					  gfp);
 		if (!buf2) {
-			page_frag_free(buf);
+			page_frag_free_va(buf);
 			kfree(txb);
 			return NULL;
 		}
@@ -180,7 +181,7 @@ static void rxrpc_free_txbuf(struct rxrpc_txbuf *txb)
 			  rxrpc_txbuf_free);
 	for (i = 0; i < txb->nr_kvec; i++)
 		if (txb->kvec[i].iov_base)
-			page_frag_free(txb->kvec[i].iov_base);
+			page_frag_free_va(txb->kvec[i].iov_base);
 	kfree(txb);
 	atomic_dec(&rxrpc_nr_txbuf);
 }
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 6b3f01beb294..42d20412c1c3 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1222,8 +1222,8 @@ static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
 	/* The stream record marker is copied into a temporary page
 	 * fragment buffer so that it can be included in rq_bvec.
 	 */
-	buf = page_frag_alloc(&svsk->sk_frag_cache, sizeof(marker),
-			      GFP_KERNEL);
+	buf = page_frag_alloc_va(&svsk->sk_frag_cache, sizeof(marker),
+				 GFP_KERNEL);
 	if (!buf)
 		return -ENOMEM;
 	memcpy(buf, &marker, sizeof(marker));
@@ -1235,7 +1235,7 @@ static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
 	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, rqstp->rq_bvec,
 		      1 + count, sizeof(marker) + rqstp->rq_res.len);
 	ret = sock_sendmsg(svsk->sk_sock, &msg);
-	page_frag_free(buf);
+	page_frag_free_va(buf);
 	if (ret < 0)
 		return ret;
 	*sentp += ret;
diff --git a/tools/testing/selftests/mm/page_frag/page_frag_test.c b/tools/testing/selftests/mm/page_frag/page_frag_test.c
index 4a009122991e..e522611452c9 100644
--- a/tools/testing/selftests/mm/page_frag/page_frag_test.c
+++ b/tools/testing/selftests/mm/page_frag/page_frag_test.c
@@ -52,7 +52,7 @@ static int page_frag_pop_thread(void *arg)
 
 		if (obj) {
 			nr--;
-			page_frag_free(obj);
+			page_frag_free_va(obj);
 		} else {
 			cond_resched();
 		}
@@ -80,13 +80,16 @@ static int page_frag_push_thread(void *arg)
 		int ret;
 
 		if (test_align) {
-			va = page_frag_alloc_align(&test_frag, test_alloc_len,
-						   GFP_KERNEL, SMP_CACHE_BYTES);
+			va = page_frag_alloc_va_align(&test_frag,
+						      test_alloc_len,
+						      GFP_KERNEL,
+						      SMP_CACHE_BYTES);
 
 			WARN_ONCE((unsigned long)va & (SMP_CACHE_BYTES - 1),
 				  "unaligned va returned\n");
 		} else {
-			va = page_frag_alloc(&test_frag, test_alloc_len, GFP_KERNEL);
+			va = page_frag_alloc_va(&test_frag, test_alloc_len,
+						GFP_KERNEL);
 		}
 
 		if (!va)
@@ -94,7 +97,7 @@ static int page_frag_push_thread(void *arg)
 
 		ret = __ptr_ring_produce(ring, va);
 		if (ret) {
-			page_frag_free(va);
+			page_frag_free_va(va);
 			cond_resched();
 		} else {
 			nr--;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next v13 12/14] net: replace page_frag with page_frag_cache
  2024-08-08 12:37 [PATCH net-next v13 00/14] Replace page_frag with page_frag_cache for sk_page_frag() Yunsheng Lin
  2024-08-08 12:37 ` [PATCH net-next v13 04/14] mm: page_frag: add '_va' suffix to page_frag API Yunsheng Lin
@ 2024-08-08 12:37 ` Yunsheng Lin
  2024-08-14 22:01   ` Alexander H Duyck
  2024-08-13 11:30 ` [PATCH net-next v13 00/14] Replace page_frag with page_frag_cache for sk_page_frag() Yunsheng Lin
  2 siblings, 1 reply; 16+ messages in thread
From: Yunsheng Lin @ 2024-08-08 12:37 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck,
	Mat Martineau, Ayush Sawal, Eric Dumazet, Willem de Bruijn,
	Jason Wang, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, John Fastabend, Jakub Sitnicki,
	David Ahern, Matthieu Baerts, Geliang Tang, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Boris Pismenny, bpf, mptcp

Use the newly introduced prepare/probe/commit API to
replace page_frag with page_frag_cache for sk_page_frag().

CC: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Mat Martineau <martineau@kernel.org>
---
 .../chelsio/inline_crypto/chtls/chtls.h       |   3 -
 .../chelsio/inline_crypto/chtls/chtls_io.c    | 100 ++++---------
 .../chelsio/inline_crypto/chtls/chtls_main.c  |   3 -
 drivers/net/tun.c                             |  48 +++---
 include/linux/sched.h                         |   2 +-
 include/net/sock.h                            |  14 +-
 kernel/exit.c                                 |   3 +-
 kernel/fork.c                                 |   3 +-
 net/core/skbuff.c                             |  59 +++++---
 net/core/skmsg.c                              |  22 +--
 net/core/sock.c                               |  46 ++++--
 net/ipv4/ip_output.c                          |  33 +++--
 net/ipv4/tcp.c                                |  32 ++--
 net/ipv4/tcp_output.c                         |  28 ++--
 net/ipv6/ip6_output.c                         |  33 +++--
 net/kcm/kcmsock.c                             |  27 ++--
 net/mptcp/protocol.c                          |  67 +++++----
 net/sched/em_meta.c                           |   2 +-
 net/tls/tls_device.c                          | 137 ++++++++++--------
 19 files changed, 347 insertions(+), 315 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h
index 7ff82b6778ba..fe2b6a8ef718 100644
--- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h
+++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h
@@ -234,7 +234,6 @@ struct chtls_dev {
 	struct list_head list_node;
 	struct list_head rcu_node;
 	struct list_head na_node;
-	unsigned int send_page_order;
 	int max_host_sndbuf;
 	u32 round_robin_cnt;
 	struct key_map kmap;
@@ -453,8 +452,6 @@ enum {
 
 /* The ULP mode/submode of an skbuff */
 #define skb_ulp_mode(skb)  (ULP_SKB_CB(skb)->ulp_mode)
-#define TCP_PAGE(sk)   (sk->sk_frag.page)
-#define TCP_OFF(sk)    (sk->sk_frag.offset)
 
 static inline struct chtls_dev *to_chtls_dev(struct tls_toe_device *tlsdev)
 {
diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
index d567e42e1760..334381c1587f 100644
--- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
+++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
@@ -825,12 +825,6 @@ void skb_entail(struct sock *sk, struct sk_buff *skb, int flags)
 	ULP_SKB_CB(skb)->flags = flags;
 	__skb_queue_tail(&csk->txq, skb);
 	sk->sk_wmem_queued += skb->truesize;
-
-	if (TCP_PAGE(sk) && TCP_OFF(sk)) {
-		put_page(TCP_PAGE(sk));
-		TCP_PAGE(sk) = NULL;
-		TCP_OFF(sk) = 0;
-	}
 }
 
 static struct sk_buff *get_tx_skb(struct sock *sk, int size)
@@ -882,16 +876,12 @@ static void push_frames_if_head(struct sock *sk)
 		chtls_push_frames(csk, 1);
 }
 
-static int chtls_skb_copy_to_page_nocache(struct sock *sk,
-					  struct iov_iter *from,
-					  struct sk_buff *skb,
-					  struct page *page,
-					  int off, int copy)
+static int chtls_skb_copy_to_va_nocache(struct sock *sk, struct iov_iter *from,
+					struct sk_buff *skb, char *va, int copy)
 {
 	int err;
 
-	err = skb_do_copy_data_nocache(sk, skb, from, page_address(page) +
-				       off, copy, skb->len);
+	err = skb_do_copy_data_nocache(sk, skb, from, va, copy, skb->len);
 	if (err)
 		return err;
 
@@ -1114,82 +1104,44 @@ int chtls_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 			if (err)
 				goto do_fault;
 		} else {
+			struct page_frag_cache *pfrag = &sk->sk_frag;
 			int i = skb_shinfo(skb)->nr_frags;
-			struct page *page = TCP_PAGE(sk);
-			int pg_size = PAGE_SIZE;
-			int off = TCP_OFF(sk);
-			bool merge;
-
-			if (page)
-				pg_size = page_size(page);
-			if (off < pg_size &&
-			    skb_can_coalesce(skb, i, page, off)) {
+			unsigned int offset, fragsz;
+			bool merge = false;
+			struct page *page;
+			void *va;
+
+			fragsz = 32U;
+			page = page_frag_alloc_prepare(pfrag, &offset, &fragsz,
+						       &va, sk->sk_allocation);
+			if (unlikely(!page))
+				goto wait_for_memory;
+
+			if (skb_can_coalesce(skb, i, page, offset))
 				merge = true;
-				goto copy;
-			}
-			merge = false;
-			if (i == (is_tls_tx(csk) ? (MAX_SKB_FRAGS - 1) :
-			    MAX_SKB_FRAGS))
+			else if (i == (is_tls_tx(csk) ? (MAX_SKB_FRAGS - 1) :
+				       MAX_SKB_FRAGS))
 				goto new_buf;
 
-			if (page && off == pg_size) {
-				put_page(page);
-				TCP_PAGE(sk) = page = NULL;
-				pg_size = PAGE_SIZE;
-			}
-
-			if (!page) {
-				gfp_t gfp = sk->sk_allocation;
-				int order = cdev->send_page_order;
-
-				if (order) {
-					page = alloc_pages(gfp | __GFP_COMP |
-							   __GFP_NOWARN |
-							   __GFP_NORETRY,
-							   order);
-					if (page)
-						pg_size <<= order;
-				}
-				if (!page) {
-					page = alloc_page(gfp);
-					pg_size = PAGE_SIZE;
-				}
-				if (!page)
-					goto wait_for_memory;
-				off = 0;
-			}
-copy:
-			if (copy > pg_size - off)
-				copy = pg_size - off;
+			copy = min_t(int, copy, fragsz);
 			if (is_tls_tx(csk))
 				copy = min_t(int, copy, csk->tlshws.txleft);
 
-			err = chtls_skb_copy_to_page_nocache(sk, &msg->msg_iter,
-							     skb, page,
-							     off, copy);
-			if (unlikely(err)) {
-				if (!TCP_PAGE(sk)) {
-					TCP_PAGE(sk) = page;
-					TCP_OFF(sk) = 0;
-				}
+			err = chtls_skb_copy_to_va_nocache(sk, &msg->msg_iter,
+							   skb, va, copy);
+			if (unlikely(err))
 				goto do_fault;
-			}
+
 			/* Update the skb. */
 			if (merge) {
 				skb_frag_size_add(
 						&skb_shinfo(skb)->frags[i - 1],
 						copy);
+				page_frag_alloc_commit_noref(pfrag, copy);
 			} else {
-				skb_fill_page_desc(skb, i, page, off, copy);
-				if (off + copy < pg_size) {
-					/* space left keep page */
-					get_page(page);
-					TCP_PAGE(sk) = page;
-				} else {
-					TCP_PAGE(sk) = NULL;
-				}
+				skb_fill_page_desc(skb, i, page, offset, copy);
+				page_frag_alloc_commit(pfrag, copy);
 			}
-			TCP_OFF(sk) = off + copy;
 		}
 		if (unlikely(skb->len == mss))
 			tx_skb_finalize(skb);
diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c
index 455a54708be4..ba88b2fc7cd8 100644
--- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c
+++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c
@@ -34,7 +34,6 @@ static DEFINE_MUTEX(notify_mutex);
 static RAW_NOTIFIER_HEAD(listen_notify_list);
 static struct proto chtls_cpl_prot, chtls_cpl_protv6;
 struct request_sock_ops chtls_rsk_ops, chtls_rsk_opsv6;
-static uint send_page_order = (14 - PAGE_SHIFT < 0) ? 0 : 14 - PAGE_SHIFT;
 
 static void register_listen_notifier(struct notifier_block *nb)
 {
@@ -273,8 +272,6 @@ static void *chtls_uld_add(const struct cxgb4_lld_info *info)
 	INIT_WORK(&cdev->deferq_task, process_deferq);
 	spin_lock_init(&cdev->listen_lock);
 	spin_lock_init(&cdev->idr_lock);
-	cdev->send_page_order = min_t(uint, get_order(32768),
-				      send_page_order);
 	cdev->max_host_sndbuf = 48 * 1024;
 
 	if (lldi->vr->key.size)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 1d06c560c5e6..51df92fd60db 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1598,21 +1598,19 @@ static bool tun_can_build_skb(struct tun_struct *tun, struct tun_file *tfile,
 }
 
 static struct sk_buff *__tun_build_skb(struct tun_file *tfile,
-				       struct page_frag *alloc_frag, char *buf,
-				       int buflen, int len, int pad)
+				       char *buf, int buflen, int len, int pad)
 {
 	struct sk_buff *skb = build_skb(buf, buflen);
 
-	if (!skb)
+	if (!skb) {
+		page_frag_free_va(buf);
 		return ERR_PTR(-ENOMEM);
+	}
 
 	skb_reserve(skb, pad);
 	skb_put(skb, len);
 	skb_set_owner_w(skb, tfile->socket.sk);
 
-	get_page(alloc_frag->page);
-	alloc_frag->offset += buflen;
-
 	return skb;
 }
 
@@ -1660,7 +1658,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 				     struct virtio_net_hdr *hdr,
 				     int len, int *skb_xdp)
 {
-	struct page_frag *alloc_frag = &current->task_frag;
+	struct page_frag_cache *alloc_frag = &current->task_frag;
 	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	struct bpf_prog *xdp_prog;
 	int buflen = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
@@ -1676,16 +1674,16 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 	buflen += SKB_DATA_ALIGN(len + pad);
 	rcu_read_unlock();
 
-	alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
-	if (unlikely(!skb_page_frag_refill(buflen, alloc_frag, GFP_KERNEL)))
+	buf = page_frag_alloc_va_align(alloc_frag, buflen, GFP_KERNEL,
+				       SMP_CACHE_BYTES);
+	if (unlikely(!buf))
 		return ERR_PTR(-ENOMEM);
 
-	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
-	copied = copy_page_from_iter(alloc_frag->page,
-				     alloc_frag->offset + pad,
-				     len, from);
-	if (copied != len)
+	copied = copy_from_iter(buf + pad, len, from);
+	if (copied != len) {
+		page_frag_alloc_abort(alloc_frag, buflen);
 		return ERR_PTR(-EFAULT);
+	}
 
 	/* There's a small window that XDP may be set after the check
 	 * of xdp_prog above, this should be rare and for simplicity
@@ -1693,8 +1691,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 	 */
 	if (hdr->gso_type || !xdp_prog) {
 		*skb_xdp = 1;
-		return __tun_build_skb(tfile, alloc_frag, buf, buflen, len,
-				       pad);
+		return __tun_build_skb(tfile, buf, buflen, len, pad);
 	}
 
 	*skb_xdp = 0;
@@ -1711,21 +1708,16 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 		xdp_prepare_buff(&xdp, buf, pad, len, false);
 
 		act = bpf_prog_run_xdp(xdp_prog, &xdp);
-		if (act == XDP_REDIRECT || act == XDP_TX) {
-			get_page(alloc_frag->page);
-			alloc_frag->offset += buflen;
-		}
 		err = tun_xdp_act(tun, xdp_prog, &xdp, act);
-		if (err < 0) {
-			if (act == XDP_REDIRECT || act == XDP_TX)
-				put_page(alloc_frag->page);
-			goto out;
-		}
-
 		if (err == XDP_REDIRECT)
 			xdp_do_flush();
-		if (err != XDP_PASS)
+
+		if (err == XDP_REDIRECT || err == XDP_TX) {
+			goto out;
+		} else if (err < 0 || err != XDP_PASS) {
+			page_frag_alloc_abort(alloc_frag, buflen);
 			goto out;
+		}
 
 		pad = xdp.data - xdp.data_hard_start;
 		len = xdp.data_end - xdp.data;
@@ -1734,7 +1726,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 	rcu_read_unlock();
 	local_bh_enable();
 
-	return __tun_build_skb(tfile, alloc_frag, buf, buflen, len, pad);
+	return __tun_build_skb(tfile, buf, buflen, len, pad);
 
 out:
 	bpf_net_ctx_clear(bpf_net_ctx);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f8d150343d42..bb9a8e9d6d2d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1355,7 +1355,7 @@ struct task_struct {
 	/* Cache last used pipe for splice(): */
 	struct pipe_inode_info		*splice_pipe;
 
-	struct page_frag		task_frag;
+	struct page_frag_cache		task_frag;
 
 #ifdef CONFIG_TASK_DELAY_ACCT
 	struct task_delay_info		*delays;
diff --git a/include/net/sock.h b/include/net/sock.h
index b5e702298ab7..8f6cc0dd2f4f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -461,7 +461,7 @@ struct sock {
 	struct sk_buff_head	sk_write_queue;
 	u32			sk_dst_pending_confirm;
 	u32			sk_pacing_status; /* see enum sk_pacing */
-	struct page_frag	sk_frag;
+	struct page_frag_cache	sk_frag;
 	struct timer_list	sk_timer;
 
 	unsigned long		sk_pacing_rate; /* bytes per second */
@@ -2484,7 +2484,7 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
  * Return: a per task page_frag if context allows that,
  * otherwise a per socket one.
  */
-static inline struct page_frag *sk_page_frag(struct sock *sk)
+static inline struct page_frag_cache *sk_page_frag(struct sock *sk)
 {
 	if (sk->sk_use_task_frag)
 		return &current->task_frag;
@@ -2492,7 +2492,15 @@ static inline struct page_frag *sk_page_frag(struct sock *sk)
 	return &sk->sk_frag;
 }
 
-bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag);
+struct page *sk_page_frag_alloc_prepare(struct sock *sk,
+					struct page_frag_cache *pfrag,
+					unsigned int *size,
+					unsigned int *offset, void **va);
+
+struct page *sk_page_frag_alloc_pg_prepare(struct sock *sk,
+					   struct page_frag_cache *pfrag,
+					   unsigned int *size,
+					   unsigned int *offset);
 
 /*
  *	Default write policy as shown to user space via poll/select/SIGIO
diff --git a/kernel/exit.c b/kernel/exit.c
index 7430852a8571..b5257e74ec1c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -917,8 +917,7 @@ void __noreturn do_exit(long code)
 	if (tsk->splice_pipe)
 		free_pipe_info(tsk->splice_pipe);
 
-	if (tsk->task_frag.page)
-		put_page(tsk->task_frag.page);
+	page_frag_cache_drain(&tsk->task_frag);
 
 	exit_task_stack_account(tsk);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index cc760491f201..7d380a6fd64a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -80,6 +80,7 @@
 #include <linux/tty.h>
 #include <linux/fs_struct.h>
 #include <linux/magic.h>
+#include <linux/page_frag_cache.h>
 #include <linux/perf_event.h>
 #include <linux/posix-timers.h>
 #include <linux/user-return-notifier.h>
@@ -1157,10 +1158,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->btrace_seq = 0;
 #endif
 	tsk->splice_pipe = NULL;
-	tsk->task_frag.page = NULL;
 	tsk->wake_q.next = NULL;
 	tsk->worker_private = NULL;
 
+	page_frag_cache_init(&tsk->task_frag);
 	kcov_task_init(tsk);
 	kmsan_task_create(tsk);
 	kmap_local_fork(tsk);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bb77c3fd192f..a7e86984bbcd 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3040,25 +3040,6 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
 	put_page(spd->pages[i]);
 }
 
-static struct page *linear_to_page(struct page *page, unsigned int *len,
-				   unsigned int *offset,
-				   struct sock *sk)
-{
-	struct page_frag *pfrag = sk_page_frag(sk);
-
-	if (!sk_page_frag_refill(sk, pfrag))
-		return NULL;
-
-	*len = min_t(unsigned int, *len, pfrag->size - pfrag->offset);
-
-	memcpy(page_address(pfrag->page) + pfrag->offset,
-	       page_address(page) + *offset, *len);
-	*offset = pfrag->offset;
-	pfrag->offset += *len;
-
-	return pfrag->page;
-}
-
 static bool spd_can_coalesce(const struct splice_pipe_desc *spd,
 			     struct page *page,
 			     unsigned int offset)
@@ -3069,6 +3050,38 @@ static bool spd_can_coalesce(const struct splice_pipe_desc *spd,
 		 spd->partial[spd->nr_pages - 1].len == offset);
 }
 
+static bool spd_fill_linear_page(struct splice_pipe_desc *spd,
+				 struct page *page, unsigned int offset,
+				 unsigned int *len, struct sock *sk)
+{
+	struct page_frag_cache *pfrag = sk_page_frag(sk);
+	unsigned int frag_len, frag_offset;
+	struct page *frag_page;
+	void *va;
+
+	frag_page = sk_page_frag_alloc_prepare(sk, pfrag, &frag_offset,
+					       &frag_len, &va);
+	if (!frag_page)
+		return true;
+
+	*len = min_t(unsigned int, *len, frag_len);
+	memcpy(va, page_address(page) + offset, *len);
+
+	if (spd_can_coalesce(spd, frag_page, frag_offset)) {
+		spd->partial[spd->nr_pages - 1].len += *len;
+		page_frag_alloc_commit_noref(pfrag, *len);
+		return false;
+	}
+
+	page_frag_alloc_commit(pfrag, *len);
+	spd->pages[spd->nr_pages] = frag_page;
+	spd->partial[spd->nr_pages].len = *len;
+	spd->partial[spd->nr_pages].offset = frag_offset;
+	spd->nr_pages++;
+
+	return false;
+}
+
 /*
  * Fill page/offset/length into spd, if it can hold more pages.
  */
@@ -3081,11 +3094,9 @@ static bool spd_fill_page(struct splice_pipe_desc *spd,
 	if (unlikely(spd->nr_pages == MAX_SKB_FRAGS))
 		return true;
 
-	if (linear) {
-		page = linear_to_page(page, len, &offset, sk);
-		if (!page)
-			return true;
-	}
+	if (linear)
+		return spd_fill_linear_page(spd, page, offset, len,  sk);
+
 	if (spd_can_coalesce(spd, page, offset)) {
 		spd->partial[spd->nr_pages - 1].len += *len;
 		return false;
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index bbf40b999713..956fd6103909 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -27,23 +27,25 @@ static bool sk_msg_try_coalesce_ok(struct sk_msg *msg, int elem_first_coalesce)
 int sk_msg_alloc(struct sock *sk, struct sk_msg *msg, int len,
 		 int elem_first_coalesce)
 {
-	struct page_frag *pfrag = sk_page_frag(sk);
+	struct page_frag_cache *pfrag = sk_page_frag(sk);
 	u32 osize = msg->sg.size;
 	int ret = 0;
 
 	len -= msg->sg.size;
 	while (len > 0) {
+		unsigned int frag_offset, frag_len;
 		struct scatterlist *sge;
-		u32 orig_offset;
+		struct page *page;
 		int use, i;
 
-		if (!sk_page_frag_refill(sk, pfrag)) {
+		page = sk_page_frag_alloc_pg_prepare(sk, pfrag, &frag_offset,
+						     &frag_len);
+		if (!page) {
 			ret = -ENOMEM;
 			goto msg_trim;
 		}
 
-		orig_offset = pfrag->offset;
-		use = min_t(int, len, pfrag->size - orig_offset);
+		use = min_t(int, len, frag_len);
 		if (!sk_wmem_schedule(sk, use)) {
 			ret = -ENOMEM;
 			goto msg_trim;
@@ -54,9 +56,10 @@ int sk_msg_alloc(struct sock *sk, struct sk_msg *msg, int len,
 		sge = &msg->sg.data[i];
 
 		if (sk_msg_try_coalesce_ok(msg, elem_first_coalesce) &&
-		    sg_page(sge) == pfrag->page &&
-		    sge->offset + sge->length == orig_offset) {
+		    sg_page(sge) == page &&
+		    sge->offset + sge->length == frag_offset) {
 			sge->length += use;
+			page_frag_alloc_commit_noref(pfrag, use);
 		} else {
 			if (sk_msg_full(msg)) {
 				ret = -ENOSPC;
@@ -65,14 +68,13 @@ int sk_msg_alloc(struct sock *sk, struct sk_msg *msg, int len,
 
 			sge = &msg->sg.data[msg->sg.end];
 			sg_unmark_end(sge);
-			sg_set_page(sge, pfrag->page, use, orig_offset);
-			get_page(pfrag->page);
+			sg_set_page(sge, page, use, frag_offset);
+			page_frag_alloc_commit(pfrag, use);
 			sk_msg_iter_next(msg, end);
 		}
 
 		sk_mem_charge(sk, use);
 		msg->sg.size += use;
-		pfrag->offset += use;
 		len -= use;
 	}
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 9abc4fe25953..26c100ee9001 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2207,10 +2207,7 @@ static void __sk_destruct(struct rcu_head *head)
 		pr_debug("%s: optmem leakage (%d bytes) detected\n",
 			 __func__, atomic_read(&sk->sk_omem_alloc));
 
-	if (sk->sk_frag.page) {
-		put_page(sk->sk_frag.page);
-		sk->sk_frag.page = NULL;
-	}
+	page_frag_cache_drain(&sk->sk_frag);
 
 	/* We do not need to acquire sk->sk_peer_lock, we are the last user. */
 	put_cred(sk->sk_peer_cred);
@@ -2956,16 +2953,43 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
 }
 EXPORT_SYMBOL(skb_page_frag_refill);
 
-bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+struct page *sk_page_frag_alloc_prepare(struct sock *sk,
+					struct page_frag_cache *pfrag,
+					unsigned int *offset,
+					unsigned int *size, void **va)
 {
-	if (likely(skb_page_frag_refill(32U, pfrag, sk->sk_allocation)))
-		return true;
+	struct page *page;
+
+	*size = 32U;
+	page = page_frag_alloc_prepare(pfrag, offset, size, va,
+				       sk->sk_allocation);
+	if (likely(page))
+		return page;
 
 	sk_enter_memory_pressure(sk);
 	sk_stream_moderate_sndbuf(sk);
-	return false;
+	return NULL;
+}
+EXPORT_SYMBOL(sk_page_frag_alloc_prepare);
+
+struct page *sk_page_frag_alloc_pg_prepare(struct sock *sk,
+					   struct page_frag_cache *pfrag,
+					   unsigned int *offset,
+					   unsigned int *size)
+{
+	struct page *page;
+
+	*size = 32U;
+	page = page_frag_alloc_pg_prepare(pfrag, offset, size,
+					  sk->sk_allocation);
+	if (likely(page))
+		return page;
+
+	sk_enter_memory_pressure(sk);
+	sk_stream_moderate_sndbuf(sk);
+	return NULL;
 }
-EXPORT_SYMBOL(sk_page_frag_refill);
+EXPORT_SYMBOL(sk_page_frag_alloc_pg_prepare);
 
 void __lock_sock(struct sock *sk)
 	__releases(&sk->sk_lock.slock)
@@ -3487,8 +3511,8 @@ void sock_init_data_uid(struct socket *sock, struct sock *sk, kuid_t uid)
 	sk->sk_error_report	=	sock_def_error_report;
 	sk->sk_destruct		=	sock_def_destruct;
 
-	sk->sk_frag.page	=	NULL;
-	sk->sk_frag.offset	=	0;
+	page_frag_cache_init(&sk->sk_frag);
+
 	sk->sk_peek_off		=	-1;
 
 	sk->sk_peer_pid 	=	NULL;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 8a10a7c67834..57499e3ed9e5 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -952,7 +952,7 @@ static int __ip_append_data(struct sock *sk,
 			    struct flowi4 *fl4,
 			    struct sk_buff_head *queue,
 			    struct inet_cork *cork,
-			    struct page_frag *pfrag,
+			    struct page_frag_cache *pfrag,
 			    int getfrag(void *from, char *to, int offset,
 					int len, int odd, struct sk_buff *skb),
 			    void *from, int length, int transhdrlen,
@@ -1228,31 +1228,38 @@ static int __ip_append_data(struct sock *sk,
 			wmem_alloc_delta += copy;
 		} else if (!zc) {
 			int i = skb_shinfo(skb)->nr_frags;
+			unsigned int frag_offset, frag_size;
+			struct page *page;
+			void *va;
 
 			err = -ENOMEM;
-			if (!sk_page_frag_refill(sk, pfrag))
+			page = sk_page_frag_alloc_prepare(sk, pfrag,
+							  &frag_offset,
+							  &frag_size, &va);
+			if (!page)
 				goto error;
 
 			skb_zcopy_downgrade_managed(skb);
-			if (!skb_can_coalesce(skb, i, pfrag->page,
-					      pfrag->offset)) {
+			copy = min_t(int, copy, frag_size);
+
+			if (!skb_can_coalesce(skb, i, page, frag_offset)) {
 				err = -EMSGSIZE;
 				if (i == MAX_SKB_FRAGS)
 					goto error;
 
-				__skb_fill_page_desc(skb, i, pfrag->page,
-						     pfrag->offset, 0);
+				__skb_fill_page_desc(skb, i, page, frag_offset,
+						     copy);
 				skb_shinfo(skb)->nr_frags = ++i;
-				get_page(pfrag->page);
+				page_frag_alloc_commit(pfrag, copy);
+			} else {
+				skb_frag_size_add(
+					&skb_shinfo(skb)->frags[i - 1], copy);
+				page_frag_alloc_commit_noref(pfrag, copy);
 			}
-			copy = min_t(int, copy, pfrag->size - pfrag->offset);
-			if (getfrag(from,
-				    page_address(pfrag->page) + pfrag->offset,
-				    offset, copy, skb->len, skb) < 0)
+
+			if (getfrag(from, va, offset, copy, skb->len, skb) < 0)
 				goto error_efault;
 
-			pfrag->offset += copy;
-			skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
 			skb_len_add(skb, copy);
 			wmem_alloc_delta += copy;
 		} else {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 7c392710ae15..815ec53b16d5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1189,13 +1189,17 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 		if (zc == 0) {
 			bool merge = true;
 			int i = skb_shinfo(skb)->nr_frags;
-			struct page_frag *pfrag = sk_page_frag(sk);
-
-			if (!sk_page_frag_refill(sk, pfrag))
+			struct page_frag_cache *pfrag = sk_page_frag(sk);
+			unsigned int frag_offset, frag_size;
+			struct page *page;
+			void *va;
+
+			page = sk_page_frag_alloc_prepare(sk, pfrag, &frag_offset,
+							  &frag_size, &va);
+			if (!page)
 				goto wait_for_space;
 
-			if (!skb_can_coalesce(skb, i, pfrag->page,
-					      pfrag->offset)) {
+			if (!skb_can_coalesce(skb, i, page, frag_offset)) {
 				if (i >= READ_ONCE(net_hotdata.sysctl_max_skb_frags)) {
 					tcp_mark_push(tp, skb);
 					goto new_segment;
@@ -1203,7 +1207,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 				merge = false;
 			}
 
-			copy = min_t(int, copy, pfrag->size - pfrag->offset);
+			copy = min_t(int, copy, frag_size);
 
 			if (unlikely(skb_zcopy_pure(skb) || skb_zcopy_managed(skb))) {
 				if (tcp_downgrade_zcopy_pure(sk, skb))
@@ -1216,20 +1220,18 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 				goto wait_for_space;
 
 			err = skb_copy_to_va_nocache(sk, &msg->msg_iter, skb,
-						     page_address(pfrag->page) +
-						     pfrag->offset, copy);
+						     va, copy);
 			if (err)
 				goto do_error;
 
 			/* Update the skb. */
 			if (merge) {
 				skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
+				page_frag_alloc_commit_noref(pfrag, copy);
 			} else {
-				skb_fill_page_desc(skb, i, pfrag->page,
-						   pfrag->offset, copy);
-				page_ref_inc(pfrag->page);
+				skb_fill_page_desc(skb, i, page, frag_offset, copy);
+				page_frag_alloc_commit(pfrag, copy);
 			}
-			pfrag->offset += copy;
 		} else if (zc == MSG_ZEROCOPY)  {
 			/* First append to a fragless skb builds initial
 			 * pure zerocopy skb
@@ -3131,11 +3133,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 
 	WARN_ON(inet->inet_num && !icsk->icsk_bind_hash);
 
-	if (sk->sk_frag.page) {
-		put_page(sk->sk_frag.page);
-		sk->sk_frag.page = NULL;
-		sk->sk_frag.offset = 0;
-	}
+	page_frag_cache_drain(&sk->sk_frag);
 	sk_error_report(sk);
 	return 0;
 }
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 16c48df8df4c..43208092b89c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3970,9 +3970,12 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct tcp_fastopen_request *fo = tp->fastopen_req;
-	struct page_frag *pfrag = sk_page_frag(sk);
+	struct page_frag_cache *pfrag = sk_page_frag(sk);
+	unsigned int offset, size;
 	struct sk_buff *syn_data;
 	int space, err = 0;
+	struct page *page;
+	void *va;
 
 	tp->rx_opt.mss_clamp = tp->advmss;  /* If MSS is not cached */
 	if (!tcp_fastopen_cookie_check(sk, &tp->rx_opt.mss_clamp, &fo->cookie))
@@ -3991,30 +3994,31 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
 
 	space = min_t(size_t, space, fo->size);
 
-	if (space &&
-	    !skb_page_frag_refill(min_t(size_t, space, PAGE_SIZE),
-				  pfrag, sk->sk_allocation))
-		goto fallback;
+	if (space) {
+		size = min_t(size_t, space, PAGE_SIZE);
+		page = page_frag_alloc_prepare(pfrag, &offset, &size, &va,
+					       sk->sk_allocation);
+		if (!page)
+			goto fallback;
+	}
+
 	syn_data = tcp_stream_alloc_skb(sk, sk->sk_allocation, false);
 	if (!syn_data)
 		goto fallback;
 	memcpy(syn_data->cb, syn->cb, sizeof(syn->cb));
 	if (space) {
-		space = min_t(size_t, space, pfrag->size - pfrag->offset);
+		space = min_t(size_t, space, size);
 		space = tcp_wmem_schedule(sk, space);
 	}
 	if (space) {
-		space = copy_page_from_iter(pfrag->page, pfrag->offset,
-					    space, &fo->data->msg_iter);
+		space = _copy_from_iter(va, space, &fo->data->msg_iter);
 		if (unlikely(!space)) {
 			tcp_skb_tsorted_anchor_cleanup(syn_data);
 			kfree_skb(syn_data);
 			goto fallback;
 		}
-		skb_fill_page_desc(syn_data, 0, pfrag->page,
-				   pfrag->offset, space);
-		page_ref_inc(pfrag->page);
-		pfrag->offset += space;
+		skb_fill_page_desc(syn_data, 0, page, offset, space);
+		page_frag_alloc_commit(pfrag, space);
 		skb_len_add(syn_data, space);
 		skb_zcopy_set(syn_data, fo->uarg, NULL);
 	}
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index ab504d31f0cd..86086b4bb55c 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1405,7 +1405,7 @@ static int __ip6_append_data(struct sock *sk,
 			     struct sk_buff_head *queue,
 			     struct inet_cork_full *cork_full,
 			     struct inet6_cork *v6_cork,
-			     struct page_frag *pfrag,
+			     struct page_frag_cache *pfrag,
 			     int getfrag(void *from, char *to, int offset,
 					 int len, int odd, struct sk_buff *skb),
 			     void *from, size_t length, int transhdrlen,
@@ -1746,32 +1746,39 @@ static int __ip6_append_data(struct sock *sk,
 			copy = err;
 			wmem_alloc_delta += copy;
 		} else if (!zc) {
+			unsigned int frag_offset, frag_size;
 			int i = skb_shinfo(skb)->nr_frags;
+			struct page *page;
+			void *va;
 
 			err = -ENOMEM;
-			if (!sk_page_frag_refill(sk, pfrag))
+			page = sk_page_frag_alloc_prepare(sk, pfrag,
+							  &frag_offset,
+							  &frag_size, &va);
+			if (!page)
 				goto error;
 
 			skb_zcopy_downgrade_managed(skb);
-			if (!skb_can_coalesce(skb, i, pfrag->page,
-					      pfrag->offset)) {
+			copy = min_t(int, copy, frag_size);
+
+			if (!skb_can_coalesce(skb, i, page, frag_offset)) {
 				err = -EMSGSIZE;
 				if (i == MAX_SKB_FRAGS)
 					goto error;
 
-				__skb_fill_page_desc(skb, i, pfrag->page,
-						     pfrag->offset, 0);
+				__skb_fill_page_desc(skb, i, page, frag_offset,
+						     copy);
 				skb_shinfo(skb)->nr_frags = ++i;
-				get_page(pfrag->page);
+				page_frag_alloc_commit(pfrag, copy);
+			} else {
+				skb_frag_size_add(
+					&skb_shinfo(skb)->frags[i - 1], copy);
+				page_frag_alloc_commit_noref(pfrag, copy);
 			}
-			copy = min_t(int, copy, pfrag->size - pfrag->offset);
-			if (getfrag(from,
-				    page_address(pfrag->page) + pfrag->offset,
-				    offset, copy, skb->len, skb) < 0)
+
+			if (getfrag(from, va, offset, copy, skb->len, skb) < 0)
 				goto error_efault;
 
-			pfrag->offset += copy;
-			skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
 			skb->len += copy;
 			skb->data_len += copy;
 			skb->truesize += copy;
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index eec6c56b7f3e..e52ddf716fa5 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -803,13 +803,17 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 	while (msg_data_left(msg)) {
 		bool merge = true;
 		int i = skb_shinfo(skb)->nr_frags;
-		struct page_frag *pfrag = sk_page_frag(sk);
-
-		if (!sk_page_frag_refill(sk, pfrag))
+		struct page_frag_cache *pfrag = sk_page_frag(sk);
+		unsigned int offset, size;
+		struct page *page;
+		void *va;
+
+		page = sk_page_frag_alloc_prepare(sk, pfrag, &offset, &size,
+						  &va);
+		if (!page)
 			goto wait_for_memory;
 
-		if (!skb_can_coalesce(skb, i, pfrag->page,
-				      pfrag->offset)) {
+		if (!skb_can_coalesce(skb, i, page, offset)) {
 			if (i == MAX_SKB_FRAGS) {
 				struct sk_buff *tskb;
 
@@ -850,14 +854,12 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 			if (head != skb)
 				head->truesize += copy;
 		} else {
-			copy = min_t(int, msg_data_left(msg),
-				     pfrag->size - pfrag->offset);
+			copy = min_t(int, msg_data_left(msg), size);
 			if (!sk_wmem_schedule(sk, copy))
 				goto wait_for_memory;
 
 			err = skb_copy_to_va_nocache(sk, &msg->msg_iter, skb,
-						     page_address(pfrag->page) +
-						     pfrag->offset, copy);
+						     va, copy);
 			if (err)
 				goto out_error;
 
@@ -865,13 +867,12 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 			if (merge) {
 				skb_frag_size_add(
 					&skb_shinfo(skb)->frags[i - 1], copy);
+				page_frag_alloc_commit_noref(pfrag, copy);
 			} else {
-				skb_fill_page_desc(skb, i, pfrag->page,
-						   pfrag->offset, copy);
-				get_page(pfrag->page);
+				skb_fill_page_desc(skb, i, page, offset, copy);
+				page_frag_alloc_commit(pfrag, copy);
 			}
 
-			pfrag->offset += copy;
 		}
 
 		copied += copy;
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 0d536b183a6c..3d27ede2c781 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -960,17 +960,16 @@ static bool mptcp_skb_can_collapse_to(u64 write_seq,
 }
 
 /* we can append data to the given data frag if:
- * - there is space available in the backing page_frag
- * - the data frag tail matches the current page_frag free offset
+ * - the data frag tail matches the current page and offset
  * - the data frag end sequence number matches the current write seq
  */
 static bool mptcp_frag_can_collapse_to(const struct mptcp_sock *msk,
-				       const struct page_frag *pfrag,
+				       const struct page *page,
+				       const unsigned int offset,
 				       const struct mptcp_data_frag *df)
 {
-	return df && pfrag->page == df->page &&
-		pfrag->size - pfrag->offset > 0 &&
-		pfrag->offset == (df->offset + df->data_len) &&
+	return df && page == df->page &&
+		offset == (df->offset + df->data_len) &&
 		df->data_seq + df->data_len == msk->write_seq;
 }
 
@@ -1085,30 +1084,36 @@ static void mptcp_enter_memory_pressure(struct sock *sk)
 /* ensure we get enough memory for the frag hdr, beyond some minimal amount of
  * data
  */
-static bool mptcp_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+static struct page *mptcp_page_frag_alloc_prepare(struct sock *sk,
+						  struct page_frag_cache *pfrag,
+						  unsigned int *offset,
+						  unsigned int *size, void **va)
 {
-	if (likely(skb_page_frag_refill(32U + sizeof(struct mptcp_data_frag),
-					pfrag, sk->sk_allocation)))
-		return true;
+	struct page *page;
+
+	page = page_frag_alloc_prepare(pfrag, offset, size, va,
+				       sk->sk_allocation);
+	if (likely(page))
+		return page;
 
 	mptcp_enter_memory_pressure(sk);
-	return false;
+	return NULL;
 }
 
 static struct mptcp_data_frag *
-mptcp_carve_data_frag(const struct mptcp_sock *msk, struct page_frag *pfrag,
-		      int orig_offset)
+mptcp_carve_data_frag(const struct mptcp_sock *msk, struct page *page,
+		      unsigned int orig_offset)
 {
 	int offset = ALIGN(orig_offset, sizeof(long));
 	struct mptcp_data_frag *dfrag;
 
-	dfrag = (struct mptcp_data_frag *)(page_to_virt(pfrag->page) + offset);
+	dfrag = (struct mptcp_data_frag *)(page_to_virt(page) + offset);
 	dfrag->data_len = 0;
 	dfrag->data_seq = msk->write_seq;
 	dfrag->overhead = offset - orig_offset + sizeof(struct mptcp_data_frag);
 	dfrag->offset = offset + sizeof(struct mptcp_data_frag);
 	dfrag->already_sent = 0;
-	dfrag->page = pfrag->page;
+	dfrag->page = page;
 
 	return dfrag;
 }
@@ -1795,7 +1800,7 @@ static u32 mptcp_send_limit(const struct sock *sk)
 static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	struct page_frag *pfrag;
+	struct page_frag_cache *pfrag;
 	size_t copied = 0;
 	int ret = 0;
 	long timeo;
@@ -1834,9 +1839,12 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	while (msg_data_left(msg)) {
 		int total_ts, frag_truesize = 0;
 		struct mptcp_data_frag *dfrag;
+		unsigned int offset = 0, size;
 		bool dfrag_collapsed;
-		size_t psize, offset;
+		struct page *page;
 		u32 copy_limit;
+		size_t psize;
+		void *va;
 
 		/* ensure fitting the notsent_lowat() constraint */
 		copy_limit = mptcp_send_limit(sk);
@@ -1847,21 +1855,27 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		 * page allocator
 		 */
 		dfrag = mptcp_pending_tail(sk);
-		dfrag_collapsed = mptcp_frag_can_collapse_to(msk, pfrag, dfrag);
+		size = 1;
+		page = page_frag_alloc_probe(pfrag, &offset, &size, &va);
+		dfrag_collapsed = mptcp_frag_can_collapse_to(msk, page, offset,
+							     dfrag);
 		if (!dfrag_collapsed) {
-			if (!mptcp_page_frag_refill(sk, pfrag))
+			size = 32U + sizeof(struct mptcp_data_frag);
+			page = mptcp_page_frag_alloc_prepare(sk, pfrag, &offset,
+							     &size, &va);
+			if (!page)
 				goto wait_for_memory;
 
-			dfrag = mptcp_carve_data_frag(msk, pfrag, pfrag->offset);
+			dfrag = mptcp_carve_data_frag(msk, page, offset);
 			frag_truesize = dfrag->overhead;
+			va += dfrag->overhead;
 		}
 
 		/* we do not bound vs wspace, to allow a single packet.
 		 * memory accounting will prevent execessive memory usage
 		 * anyway
 		 */
-		offset = dfrag->offset + dfrag->data_len;
-		psize = pfrag->size - offset;
+		psize = size - frag_truesize;
 		psize = min_t(size_t, psize, msg_data_left(msg));
 		psize = min_t(size_t, psize, copy_limit);
 		total_ts = psize + frag_truesize;
@@ -1869,8 +1883,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		if (!sk_wmem_schedule(sk, total_ts))
 			goto wait_for_memory;
 
-		ret = do_copy_data_nocache(sk, psize, &msg->msg_iter,
-					   page_address(dfrag->page) + offset);
+		ret = do_copy_data_nocache(sk, psize, &msg->msg_iter, va);
 		if (ret)
 			goto do_error;
 
@@ -1879,7 +1892,6 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		copied += psize;
 		dfrag->data_len += psize;
 		frag_truesize += psize;
-		pfrag->offset += frag_truesize;
 		WRITE_ONCE(msk->write_seq, msk->write_seq + psize);
 
 		/* charge data on mptcp pending queue to the msk socket
@@ -1887,11 +1899,14 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		 */
 		sk_wmem_queued_add(sk, frag_truesize);
 		if (!dfrag_collapsed) {
-			get_page(dfrag->page);
+			page_frag_alloc_commit(pfrag, frag_truesize);
 			list_add_tail(&dfrag->list, &msk->rtx_queue);
 			if (!msk->first_pending)
 				WRITE_ONCE(msk->first_pending, dfrag);
+		} else {
+			page_frag_alloc_commit_noref(pfrag, frag_truesize);
 		}
+
 		pr_debug("msk=%p dfrag at seq=%llu len=%u sent=%u new=%d", msk,
 			 dfrag->data_seq, dfrag->data_len, dfrag->already_sent,
 			 !dfrag_collapsed);
diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c
index 8996c73c9779..4da465af972f 100644
--- a/net/sched/em_meta.c
+++ b/net/sched/em_meta.c
@@ -590,7 +590,7 @@ META_COLLECTOR(int_sk_sendmsg_off)
 		*err = -1;
 		return;
 	}
-	dst->value = sk->sk_frag.offset;
+	dst->value = page_frag_cache_page_offset(&sk->sk_frag);
 }
 
 META_COLLECTOR(int_sk_write_pend)
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index dc063c2c7950..02925c25ae12 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -253,24 +253,42 @@ static void tls_device_resync_tx(struct sock *sk, struct tls_context *tls_ctx,
 }
 
 static void tls_append_frag(struct tls_record_info *record,
-			    struct page_frag *pfrag,
-			    int size)
+			    struct page_frag_cache *pfrag, struct page *page,
+			    unsigned int offset, unsigned int size)
 {
 	skb_frag_t *frag;
 
 	frag = &record->frags[record->num_frags - 1];
-	if (skb_frag_page(frag) == pfrag->page &&
-	    skb_frag_off(frag) + skb_frag_size(frag) == pfrag->offset) {
+	if (skb_frag_page(frag) == page &&
+	    skb_frag_off(frag) + skb_frag_size(frag) == offset) {
 		skb_frag_size_add(frag, size);
+		page_frag_alloc_commit_noref(pfrag, size);
 	} else {
 		++frag;
-		skb_frag_fill_page_desc(frag, pfrag->page, pfrag->offset,
-					size);
+		skb_frag_fill_page_desc(frag, page, offset, size);
 		++record->num_frags;
-		get_page(pfrag->page);
+		page_frag_alloc_commit(pfrag, size);
+	}
+
+	record->len += size;
+}
+
+static void tls_append_page(struct tls_record_info *record, struct page *page,
+			    unsigned int offset, unsigned int size)
+{
+	skb_frag_t *frag;
+
+	frag = &record->frags[record->num_frags - 1];
+	if (skb_frag_page(frag) == page &&
+	    skb_frag_off(frag) + skb_frag_size(frag) == offset) {
+		skb_frag_size_add(frag, size);
+	} else {
+		++frag;
+		skb_frag_fill_page_desc(frag, page, offset, size);
+		++record->num_frags;
+		get_page(page);
 	}
 
-	pfrag->offset += size;
 	record->len += size;
 }
 
@@ -311,11 +329,12 @@ static int tls_push_record(struct sock *sk,
 static void tls_device_record_close(struct sock *sk,
 				    struct tls_context *ctx,
 				    struct tls_record_info *record,
-				    struct page_frag *pfrag,
+				    struct page_frag_cache *pfrag,
 				    unsigned char record_type)
 {
 	struct tls_prot_info *prot = &ctx->prot_info;
-	struct page_frag dummy_tag_frag;
+	unsigned int offset, size;
+	struct page *page;
 
 	/* append tag
 	 * device will fill in the tag, we just need to append a placeholder
@@ -323,13 +342,14 @@ static void tls_device_record_close(struct sock *sk,
 	 * increases frag count)
 	 * if we can't allocate memory now use the dummy page
 	 */
-	if (unlikely(pfrag->size - pfrag->offset < prot->tag_size) &&
-	    !skb_page_frag_refill(prot->tag_size, pfrag, sk->sk_allocation)) {
-		dummy_tag_frag.page = dummy_page;
-		dummy_tag_frag.offset = 0;
-		pfrag = &dummy_tag_frag;
+	size = prot->tag_size;
+	page = page_frag_alloc_pg_prepare(pfrag, &offset, &size,
+					  sk->sk_allocation);
+	if (unlikely(!page)) {
+		tls_append_page(record, dummy_page, 0, prot->tag_size);
+	} else {
+		tls_append_frag(record, pfrag, page, offset, prot->tag_size);
 	}
-	tls_append_frag(record, pfrag, prot->tag_size);
 
 	/* fill prepend */
 	tls_fill_prepend(ctx, skb_frag_address(&record->frags[0]),
@@ -337,57 +357,52 @@ static void tls_device_record_close(struct sock *sk,
 			 record_type);
 }
 
-static int tls_create_new_record(struct tls_offload_context_tx *offload_ctx,
-				 struct page_frag *pfrag,
+static int tls_create_new_record(struct sock *sk,
+				 struct tls_offload_context_tx *offload_ctx,
+				 struct page_frag_cache *pfrag,
 				 size_t prepend_size)
 {
 	struct tls_record_info *record;
+	unsigned int offset;
+	struct page *page;
 	skb_frag_t *frag;
 
 	record = kmalloc(sizeof(*record), GFP_KERNEL);
 	if (!record)
 		return -ENOMEM;
 
-	frag = &record->frags[0];
-	skb_frag_fill_page_desc(frag, pfrag->page, pfrag->offset,
-				prepend_size);
-
-	get_page(pfrag->page);
-	pfrag->offset += prepend_size;
+	page = page_frag_alloc_pg(pfrag, &offset, prepend_size,
+				  sk->sk_allocation);
+	if (!page) {
+		kfree(record);
+		READ_ONCE(sk->sk_prot)->enter_memory_pressure(sk);
+		sk_stream_moderate_sndbuf(sk);
+		return -ENOMEM;
+	}
 
+	frag = &record->frags[0];
+	skb_frag_fill_page_desc(frag, page, offset, prepend_size);
 	record->num_frags = 1;
 	record->len = prepend_size;
 	offload_ctx->open_record = record;
 	return 0;
 }
 
-static int tls_do_allocation(struct sock *sk,
-			     struct tls_offload_context_tx *offload_ctx,
-			     struct page_frag *pfrag,
-			     size_t prepend_size)
+static struct page *tls_do_allocation(struct sock *sk,
+				      struct tls_offload_context_tx *ctx,
+				      struct page_frag_cache *pfrag,
+				      size_t prepend_size, unsigned int *offset,
+				      unsigned int *size, void **va)
 {
-	int ret;
-
-	if (!offload_ctx->open_record) {
-		if (unlikely(!skb_page_frag_refill(prepend_size, pfrag,
-						   sk->sk_allocation))) {
-			READ_ONCE(sk->sk_prot)->enter_memory_pressure(sk);
-			sk_stream_moderate_sndbuf(sk);
-			return -ENOMEM;
-		}
+	if (!ctx->open_record) {
+		int ret;
 
-		ret = tls_create_new_record(offload_ctx, pfrag, prepend_size);
+		ret = tls_create_new_record(sk, ctx, pfrag, prepend_size);
 		if (ret)
-			return ret;
-
-		if (pfrag->size > pfrag->offset)
-			return 0;
+			return NULL;
 	}
 
-	if (!sk_page_frag_refill(sk, pfrag))
-		return -ENOMEM;
-
-	return 0;
+	return sk_page_frag_alloc_prepare(sk, pfrag, offset, size, va);
 }
 
 static int tls_device_copy_data(void *addr, size_t bytes, struct iov_iter *i)
@@ -424,8 +439,8 @@ static int tls_push_data(struct sock *sk,
 	struct tls_prot_info *prot = &tls_ctx->prot_info;
 	struct tls_offload_context_tx *ctx = tls_offload_ctx_tx(tls_ctx);
 	struct tls_record_info *record;
+	struct page_frag_cache *pfrag;
 	int tls_push_record_flags;
-	struct page_frag *pfrag;
 	size_t orig_size = size;
 	u32 max_open_record_len;
 	bool more = false;
@@ -462,8 +477,13 @@ static int tls_push_data(struct sock *sk,
 	max_open_record_len = TLS_MAX_PAYLOAD_SIZE +
 			      prot->prepend_size;
 	do {
-		rc = tls_do_allocation(sk, ctx, pfrag, prot->prepend_size);
-		if (unlikely(rc)) {
+		unsigned int frag_offset, frag_size;
+		struct page *page;
+		void *va;
+
+		page = tls_do_allocation(sk, ctx, pfrag, prot->prepend_size,
+					 &frag_offset, &frag_size, &va);
+		if (unlikely(!page)) {
 			rc = sk_stream_wait_memory(sk, &timeo);
 			if (!rc)
 				continue;
@@ -491,8 +511,8 @@ static int tls_push_data(struct sock *sk,
 
 		copy = min_t(size_t, size, max_open_record_len - record->len);
 		if (copy && (flags & MSG_SPLICE_PAGES)) {
-			struct page_frag zc_pfrag;
-			struct page **pages = &zc_pfrag.page;
+			struct page *splice_page;
+			struct page **pages = &splice_page;
 			size_t off;
 
 			rc = iov_iter_extract_pages(iter, &pages,
@@ -504,24 +524,21 @@ static int tls_push_data(struct sock *sk,
 			}
 			copy = rc;
 
-			if (WARN_ON_ONCE(!sendpage_ok(zc_pfrag.page))) {
+			if (WARN_ON_ONCE(!sendpage_ok(splice_page))) {
 				iov_iter_revert(iter, copy);
 				rc = -EIO;
 				goto handle_error;
 			}
 
-			zc_pfrag.offset = off;
-			zc_pfrag.size = copy;
-			tls_append_frag(record, &zc_pfrag, copy);
+			tls_append_page(record, splice_page, off, copy);
 		} else if (copy) {
-			copy = min_t(size_t, copy, pfrag->size - pfrag->offset);
+			copy = min_t(size_t, copy, frag_size);
 
-			rc = tls_device_copy_data(page_address(pfrag->page) +
-						  pfrag->offset, copy,
-						  iter);
+			rc = tls_device_copy_data(va, copy, iter);
 			if (rc)
 				goto handle_error;
-			tls_append_frag(record, pfrag, copy);
+
+			tls_append_frag(record, pfrag, page, frag_offset, copy);
 		}
 
 		size -= copy;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v13 00/14] Replace page_frag with page_frag_cache for sk_page_frag()
  2024-08-08 12:37 [PATCH net-next v13 00/14] Replace page_frag with page_frag_cache for sk_page_frag() Yunsheng Lin
  2024-08-08 12:37 ` [PATCH net-next v13 04/14] mm: page_frag: add '_va' suffix to page_frag API Yunsheng Lin
  2024-08-08 12:37 ` [PATCH net-next v13 12/14] net: replace page_frag with page_frag_cache Yunsheng Lin
@ 2024-08-13 11:30 ` Yunsheng Lin
  2024-08-13 15:11   ` Alexander Duyck
  2 siblings, 1 reply; 16+ messages in thread
From: Yunsheng Lin @ 2024-08-13 11:30 UTC (permalink / raw)
  To: davem, kuba, pabeni, Alexander Duyck
  Cc: netdev, linux-kernel, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Matthias Brugger,
	AngeloGioacchino Del Regno, bpf, linux-arm-kernel, linux-mediatek

On 2024/8/8 20:37, Yunsheng Lin wrote:

...

> 
> CC: Alexander Duyck <alexander.duyck@gmail.com>
> 
> 1. https://lore.kernel.org/all/20240228093013.8263-1-linyunsheng@huawei.com/
> 
> Change log:
> V13:
>    1. Move page_frag_test from mm/ to tools/testing/selftest/mm
>    2. Use ptr_ring to replace ptr_pool for page_frag_test.c
>    3. Retest based on the new testing ko, which shows a big different
>       result than using ptr_pool.

Hi, Davem & Jakub & Paolo
    It seems the state of this patchset is changed to 'Deferred' in the
patchwork, as the info regarding the state in 'Documentation/process/
maintainer-netdev.rst':

Deferred           patch needs to be reposted later, usually due to dependency
                   or because it was posted for a closed tree

Obviously it was not the a closed tree reason here, I guess it was the dependency
reason casuing the 'Deferred' here? I am not sure if I understand what sort of
dependency this patchset is running into? It would be good to mention what need
to be done avoid the kind of dependency too.


Hi, Alexander
    The v13 mainly address your comments about the page_fage_test module.
It seems the your main comment about this patchset is about the new API
naming now, and it seems there was no feedback in previous version for
about a week:

https://lore.kernel.org/all/ca6be29e-ab53-4673-9624-90d41616a154@huawei.com/

If there is still disagreement about the new API naming or other things, it
would be good to continue the discussion, so that we can have better
understanding of each other's main concern and better idea might come up too,
like the discussion about new layout for 'struct page_frag_cache' and the
new refactoring in patch 8.

> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v13 00/14] Replace page_frag with page_frag_cache for sk_page_frag()
  2024-08-13 11:30 ` [PATCH net-next v13 00/14] Replace page_frag with page_frag_cache for sk_page_frag() Yunsheng Lin
@ 2024-08-13 15:11   ` Alexander Duyck
  2024-08-14 11:55     ` Yunsheng Lin
  0 siblings, 1 reply; 16+ messages in thread
From: Alexander Duyck @ 2024-08-13 15:11 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Matthias Brugger, AngeloGioacchino Del Regno, bpf,
	linux-arm-kernel, linux-mediatek

On Tue, Aug 13, 2024 at 4:30 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/8/8 20:37, Yunsheng Lin wrote:
>
> ...
>
> >
> > CC: Alexander Duyck <alexander.duyck@gmail.com>
> >
> > 1. https://lore.kernel.org/all/20240228093013.8263-1-linyunsheng@huawei.com/
> >
> > Change log:
> > V13:
> >    1. Move page_frag_test from mm/ to tools/testing/selftest/mm
> >    2. Use ptr_ring to replace ptr_pool for page_frag_test.c
> >    3. Retest based on the new testing ko, which shows a big different
> >       result than using ptr_pool.
>
> Hi, Davem & Jakub & Paolo
>     It seems the state of this patchset is changed to 'Deferred' in the
> patchwork, as the info regarding the state in 'Documentation/process/
> maintainer-netdev.rst':
>
> Deferred           patch needs to be reposted later, usually due to dependency
>                    or because it was posted for a closed tree
>
> Obviously it was not the a closed tree reason here, I guess it was the dependency
> reason casuing the 'Deferred' here? I am not sure if I understand what sort of
> dependency this patchset is running into? It would be good to mention what need
> to be done avoid the kind of dependency too.
>
>
> Hi, Alexander
>     The v13 mainly address your comments about the page_fage_test module.
> It seems the your main comment about this patchset is about the new API
> naming now, and it seems there was no feedback in previous version for
> about a week:
>
> https://lore.kernel.org/all/ca6be29e-ab53-4673-9624-90d41616a154@huawei.com/
>
> If there is still disagreement about the new API naming or other things, it
> would be good to continue the discussion, so that we can have better
> understanding of each other's main concern and better idea might come up too,
> like the discussion about new layout for 'struct page_frag_cache' and the
> new refactoring in patch 8.

Sorry for not getting to this sooner. I have been travelling for the
last week and a half. I just got home on Sunday and I am suffering
from a pretty bad bout of jet lag as I am overcoming a 12.5 hour time
change. The earliest I can probably get to this for review would be
tomorrow morning (8/14 in the AM PDT) as my calendar has me fully
booked with meetings most of today.

Thanks,

- Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v13 00/14] Replace page_frag with page_frag_cache for sk_page_frag()
  2024-08-13 15:11   ` Alexander Duyck
@ 2024-08-14 11:55     ` Yunsheng Lin
  0 siblings, 0 replies; 16+ messages in thread
From: Yunsheng Lin @ 2024-08-14 11:55 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Matthias Brugger, AngeloGioacchino Del Regno, bpf,
	linux-arm-kernel, linux-mediatek

On 2024/8/13 23:11, Alexander Duyck wrote:
> On Tue, Aug 13, 2024 at 4:30 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2024/8/8 20:37, Yunsheng Lin wrote:
>>
>> ...
>>
>>>
>>> CC: Alexander Duyck <alexander.duyck@gmail.com>
>>>
>>> 1. https://lore.kernel.org/all/20240228093013.8263-1-linyunsheng@huawei.com/
>>>
>>> Change log:
>>> V13:
>>>    1. Move page_frag_test from mm/ to tools/testing/selftest/mm
>>>    2. Use ptr_ring to replace ptr_pool for page_frag_test.c
>>>    3. Retest based on the new testing ko, which shows a big different
>>>       result than using ptr_pool.
>>
>> Hi, Davem & Jakub & Paolo
>>     It seems the state of this patchset is changed to 'Deferred' in the
>> patchwork, as the info regarding the state in 'Documentation/process/
>> maintainer-netdev.rst':
>>
>> Deferred           patch needs to be reposted later, usually due to dependency
>>                    or because it was posted for a closed tree
>>
>> Obviously it was not the a closed tree reason here, I guess it was the dependency
>> reason casuing the 'Deferred' here? I am not sure if I understand what sort of
>> dependency this patchset is running into? It would be good to mention what need
>> to be done avoid the kind of dependency too.
>>
>>
>> Hi, Alexander
>>     The v13 mainly address your comments about the page_fage_test module.
>> It seems the your main comment about this patchset is about the new API
>> naming now, and it seems there was no feedback in previous version for
>> about a week:
>>
>> https://lore.kernel.org/all/ca6be29e-ab53-4673-9624-90d41616a154@huawei.com/
>>
>> If there is still disagreement about the new API naming or other things, it
>> would be good to continue the discussion, so that we can have better
>> understanding of each other's main concern and better idea might come up too,
>> like the discussion about new layout for 'struct page_frag_cache' and the
>> new refactoring in patch 8.
> 
> Sorry for not getting to this sooner. I have been travelling for the
> last week and a half. I just got home on Sunday and I am suffering
> from a pretty bad bout of jet lag as I am overcoming a 12.5 hour time
> change. The earliest I can probably get to this for review would be
> tomorrow morning (8/14 in the AM PDT) as my calendar has me fully
> booked with meetings most of today.

Thanks for the clarifying.
Appreciating for the time and effort of reviewing.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v13 04/14] mm: page_frag: add '_va' suffix to page_frag API
  2024-08-08 12:37 ` [PATCH net-next v13 04/14] mm: page_frag: add '_va' suffix to page_frag API Yunsheng Lin
@ 2024-08-14 15:49   ` Alexander H Duyck
  2024-08-15  2:59     ` Yunsheng Lin
  0 siblings, 1 reply; 16+ messages in thread
From: Alexander H Duyck @ 2024-08-14 15:49 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: netdev, linux-kernel, Subbaraya Sundeep, Chuck Lever,
	Sagi Grimberg, Jeroen de Borst, Praveen Kaligineedi,
	Shailend Chand, Eric Dumazet, Tony Nguyen, Przemek Kitszel,
	Sunil Goutham, Geetha sowjanya, hariprasad, Felix Fietkau,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Chaitanya Kulkarni, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Andrew Morton, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Howells, Marc Dionne, Jeff Layton, Neil Brown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Shuah Khan, intel-wired-lan, linux-arm-kernel,
	linux-mediatek, linux-nvme, kvm, virtualization, linux-mm, bpf,
	linux-afs, linux-nfs, linux-kselftest

On Thu, 2024-08-08 at 20:37 +0800, Yunsheng Lin wrote:
> Currently the page_frag API is returning 'virtual address'
> or 'va' when allocing and expecting 'virtual address' or
> 'va' as input when freeing.
> 
> As we are about to support new use cases that the caller
> need to deal with 'struct page' or need to deal with both
> 'va' and 'struct page'. In order to differentiate the API
> handling between 'va' and 'struct page', add '_va' suffix
> to the corresponding API mirroring the page_pool_alloc_va()
> API of the page_pool. So that callers expecting to deal with
> va, page or both va and page may call page_frag_alloc_va*,
> page_frag_alloc_pg*, or page_frag_alloc* API accordingly.
> 
> CC: Alexander Duyck <alexander.duyck@gmail.com>
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
> Acked-by: Chuck Lever <chuck.lever@oracle.com>
> Acked-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>  drivers/net/ethernet/google/gve/gve_rx.c      |  4 ++--
>  drivers/net/ethernet/intel/ice/ice_txrx.c     |  2 +-
>  drivers/net/ethernet/intel/ice/ice_txrx.h     |  2 +-
>  drivers/net/ethernet/intel/ice/ice_txrx_lib.c |  2 +-
>  .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |  4 ++--
>  .../marvell/octeontx2/nic/otx2_common.c       |  2 +-
>  drivers/net/ethernet/mediatek/mtk_wed_wo.c    |  4 ++--
>  drivers/nvme/host/tcp.c                       |  8 +++----
>  drivers/nvme/target/tcp.c                     | 22 +++++++++----------
>  drivers/vhost/net.c                           |  6 ++---
>  include/linux/page_frag_cache.h               | 21 +++++++++---------
>  include/linux/skbuff.h                        |  2 +-
>  kernel/bpf/cpumap.c                           |  2 +-
>  mm/page_frag_cache.c                          | 12 +++++-----
>  net/core/skbuff.c                             | 16 +++++++-------
>  net/core/xdp.c                                |  2 +-
>  net/rxrpc/txbuf.c                             | 15 +++++++------
>  net/sunrpc/svcsock.c                          |  6 ++---
>  .../selftests/mm/page_frag/page_frag_test.c   | 13 ++++++-----
>  19 files changed, 75 insertions(+), 70 deletions(-)
> 

I still say no to this patch. It is an unnecessary name change and adds
no value. If you insist on this patch I will reject the set every time.

The fact is it is polluting the git history and just makes things
harder to maintain without adding any value as you aren't changing what
the function does and there is no need for this. In addition it just
makes it that much harder to backport fixes in the future as people
will have to work around the rename.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v13 12/14] net: replace page_frag with page_frag_cache
  2024-08-08 12:37 ` [PATCH net-next v13 12/14] net: replace page_frag with page_frag_cache Yunsheng Lin
@ 2024-08-14 22:01   ` Alexander H Duyck
  2024-08-18 14:17     ` Yunsheng Lin
  0 siblings, 1 reply; 16+ messages in thread
From: Alexander H Duyck @ 2024-08-14 22:01 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: netdev, linux-kernel, Mat Martineau, Ayush Sawal, Eric Dumazet,
	Willem de Bruijn, Jason Wang, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, John Fastabend,
	Jakub Sitnicki, David Ahern, Matthieu Baerts, Geliang Tang,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Boris Pismenny, bpf,
	mptcp

On Thu, 2024-08-08 at 20:37 +0800, Yunsheng Lin wrote:
> Use the newly introduced prepare/probe/commit API to
> replace page_frag with page_frag_cache for sk_page_frag().
> 
> CC: Alexander Duyck <alexander.duyck@gmail.com>
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> Acked-by: Mat Martineau <martineau@kernel.org>
> ---
>  .../chelsio/inline_crypto/chtls/chtls.h       |   3 -
>  .../chelsio/inline_crypto/chtls/chtls_io.c    | 100 ++++---------
>  .../chelsio/inline_crypto/chtls/chtls_main.c  |   3 -
>  drivers/net/tun.c                             |  48 +++---
>  include/linux/sched.h                         |   2 +-
>  include/net/sock.h                            |  14 +-
>  kernel/exit.c                                 |   3 +-
>  kernel/fork.c                                 |   3 +-
>  net/core/skbuff.c                             |  59 +++++---
>  net/core/skmsg.c                              |  22 +--
>  net/core/sock.c                               |  46 ++++--
>  net/ipv4/ip_output.c                          |  33 +++--
>  net/ipv4/tcp.c                                |  32 ++--
>  net/ipv4/tcp_output.c                         |  28 ++--
>  net/ipv6/ip6_output.c                         |  33 +++--
>  net/kcm/kcmsock.c                             |  27 ++--
>  net/mptcp/protocol.c                          |  67 +++++----
>  net/sched/em_meta.c                           |   2 +-
>  net/tls/tls_device.c                          | 137 ++++++++++--------
>  19 files changed, 347 insertions(+), 315 deletions(-)
> 
> diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h
> index 7ff82b6778ba..fe2b6a8ef718 100644
> --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h
> +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h
> @@ -234,7 +234,6 @@ struct chtls_dev {
>  	struct list_head list_node;
>  	struct list_head rcu_node;
>  	struct list_head na_node;
> -	unsigned int send_page_order;
>  	int max_host_sndbuf;
>  	u32 round_robin_cnt;
>  	struct key_map kmap;
> @@ -453,8 +452,6 @@ enum {
>  
>  /* The ULP mode/submode of an skbuff */
>  #define skb_ulp_mode(skb)  (ULP_SKB_CB(skb)->ulp_mode)
> -#define TCP_PAGE(sk)   (sk->sk_frag.page)
> -#define TCP_OFF(sk)    (sk->sk_frag.offset)
>  
>  static inline struct chtls_dev *to_chtls_dev(struct tls_toe_device *tlsdev)
>  {
> diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
> index d567e42e1760..334381c1587f 100644
> --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
> +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
> @@ -825,12 +825,6 @@ void skb_entail(struct sock *sk, struct sk_buff *skb, int flags)
>  	ULP_SKB_CB(skb)->flags = flags;
>  	__skb_queue_tail(&csk->txq, skb);
>  	sk->sk_wmem_queued += skb->truesize;
> -
> -	if (TCP_PAGE(sk) && TCP_OFF(sk)) {
> -		put_page(TCP_PAGE(sk));
> -		TCP_PAGE(sk) = NULL;
> -		TCP_OFF(sk) = 0;
> -	}
>  }
>  
>  static struct sk_buff *get_tx_skb(struct sock *sk, int size)
> @@ -882,16 +876,12 @@ static void push_frames_if_head(struct sock *sk)
>  		chtls_push_frames(csk, 1);
>  }
>  
> -static int chtls_skb_copy_to_page_nocache(struct sock *sk,
> -					  struct iov_iter *from,
> -					  struct sk_buff *skb,
> -					  struct page *page,
> -					  int off, int copy)
> +static int chtls_skb_copy_to_va_nocache(struct sock *sk, struct iov_iter *from,
> +					struct sk_buff *skb, char *va, int copy)
>  {
>  	int err;
>  
> -	err = skb_do_copy_data_nocache(sk, skb, from, page_address(page) +
> -				       off, copy, skb->len);
> +	err = skb_do_copy_data_nocache(sk, skb, from, va, copy, skb->len);
>  	if (err)
>  		return err;
>  
> @@ -1114,82 +1104,44 @@ int chtls_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
>  			if (err)
>  				goto do_fault;
>  		} else {
> +			struct page_frag_cache *pfrag = &sk->sk_frag;

Is this even valid? Shouldn't it be using sk_page_frag to get the
reference here? Seems like it might be trying to instantiate an unused
cache.

As per my earlier suggestion this could be made very simple if we are
just pulling a bio_vec out from the page cache at the start. With that
we could essentially plug it into the TCP_PAGE/TCP_OFF block here and
most of it would just function the same.

>  			int i = skb_shinfo(skb)->nr_frags;
> -			struct page *page = TCP_PAGE(sk);
> -			int pg_size = PAGE_SIZE;
> -			int off = TCP_OFF(sk);
> -			bool merge;
> -
> -			if (page)
> -				pg_size = page_size(page);
> -			if (off < pg_size &&
> -			    skb_can_coalesce(skb, i, page, off)) {
> +			unsigned int offset, fragsz;
> +			bool merge = false;
> +			struct page *page;
> +			void *va;
> +
> +			fragsz = 32U;
> +			page = page_frag_alloc_prepare(pfrag, &offset, &fragsz,
> +						       &va, sk->sk_allocation);
> +			if (unlikely(!page))
> +				goto wait_for_memory;
> +
> +			if (skb_can_coalesce(skb, i, page, offset))
>  				merge = true;
> -				goto copy;
> -			}
> -			merge = false;
> -			if (i == (is_tls_tx(csk) ? (MAX_SKB_FRAGS - 1) :
> -			    MAX_SKB_FRAGS))
> +			else if (i == (is_tls_tx(csk) ? (MAX_SKB_FRAGS - 1) :
> +				       MAX_SKB_FRAGS))
>  				goto new_buf;
>  
> -			if (page && off == pg_size) {
> -				put_page(page);
> -				TCP_PAGE(sk) = page = NULL;
> -				pg_size = PAGE_SIZE;
> -			}
> -
> -			if (!page) {
> -				gfp_t gfp = sk->sk_allocation;
> -				int order = cdev->send_page_order;
> -
> -				if (order) {
> -					page = alloc_pages(gfp | __GFP_COMP |
> -							   __GFP_NOWARN |
> -							   __GFP_NORETRY,
> -							   order);
> -					if (page)
> -						pg_size <<= order;
> -				}
> -				if (!page) {
> -					page = alloc_page(gfp);
> -					pg_size = PAGE_SIZE;
> -				}
> -				if (!page)
> -					goto wait_for_memory;
> -				off = 0;
> -			}
> -copy:
> -			if (copy > pg_size - off)
> -				copy = pg_size - off;
> +			copy = min_t(int, copy, fragsz);
>  			if (is_tls_tx(csk))
>  				copy = min_t(int, copy, csk->tlshws.txleft);
>  
> -			err = chtls_skb_copy_to_page_nocache(sk, &msg->msg_iter,
> -							     skb, page,
> -							     off, copy);
> -			if (unlikely(err)) {
> -				if (!TCP_PAGE(sk)) {
> -					TCP_PAGE(sk) = page;
> -					TCP_OFF(sk) = 0;
> -				}
> +			err = chtls_skb_copy_to_va_nocache(sk, &msg->msg_iter,
> +							   skb, va, copy);
> +			if (unlikely(err))
>  				goto do_fault;
> -			}
> +
>  			/* Update the skb. */
>  			if (merge) {
>  				skb_frag_size_add(
>  						&skb_shinfo(skb)->frags[i - 1],
>  						copy);
> +				page_frag_alloc_commit_noref(pfrag, copy);
>  			} else {
> -				skb_fill_page_desc(skb, i, page, off, copy);
> -				if (off + copy < pg_size) {
> -					/* space left keep page */
> -					get_page(page);
> -					TCP_PAGE(sk) = page;
> -				} else {
> -					TCP_PAGE(sk) = NULL;
> -				}
> +				skb_fill_page_desc(skb, i, page, offset, copy);
> +				page_frag_alloc_commit(pfrag, copy);
>  			}
> -			TCP_OFF(sk) = off + copy;
>  		}
>  		if (unlikely(skb->len == mss))
>  			tx_skb_finalize(skb);

Really there is so much refactor here it is hard to tell what is what.
I would suggest just trying to plug in an intermediary value and you
can save the refactor for later.

> diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c
> index 455a54708be4..ba88b2fc7cd8 100644
> --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c
> +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c
> @@ -34,7 +34,6 @@ static DEFINE_MUTEX(notify_mutex);
>  static RAW_NOTIFIER_HEAD(listen_notify_list);
>  static struct proto chtls_cpl_prot, chtls_cpl_protv6;
>  struct request_sock_ops chtls_rsk_ops, chtls_rsk_opsv6;
> -static uint send_page_order = (14 - PAGE_SHIFT < 0) ? 0 : 14 - PAGE_SHIFT;
>  
>  static void register_listen_notifier(struct notifier_block *nb)
>  {
> @@ -273,8 +272,6 @@ static void *chtls_uld_add(const struct cxgb4_lld_info *info)
>  	INIT_WORK(&cdev->deferq_task, process_deferq);
>  	spin_lock_init(&cdev->listen_lock);
>  	spin_lock_init(&cdev->idr_lock);
> -	cdev->send_page_order = min_t(uint, get_order(32768),
> -				      send_page_order);
>  	cdev->max_host_sndbuf = 48 * 1024;
>  
>  	if (lldi->vr->key.size)
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 1d06c560c5e6..51df92fd60db 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1598,21 +1598,19 @@ static bool tun_can_build_skb(struct tun_struct *tun, struct tun_file *tfile,
>  }
>  
>  static struct sk_buff *__tun_build_skb(struct tun_file *tfile,
> -				       struct page_frag *alloc_frag, char *buf,
> -				       int buflen, int len, int pad)
> +				       char *buf, int buflen, int len, int pad)
>  {
>  	struct sk_buff *skb = build_skb(buf, buflen);
>  
> -	if (!skb)
> +	if (!skb) {
> +		page_frag_free_va(buf);
>  		return ERR_PTR(-ENOMEM);
> +	}
>  
>  	skb_reserve(skb, pad);
>  	skb_put(skb, len);
>  	skb_set_owner_w(skb, tfile->socket.sk);
>  
> -	get_page(alloc_frag->page);
> -	alloc_frag->offset += buflen;
> -

Rather than freeing the buf it would be better if you were to just
stick to the existing pattern and commit the alloc_frag at the end here
instead of calling get_page.

>  	return skb;
>  }
>  
> @@ -1660,7 +1658,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>  				     struct virtio_net_hdr *hdr,
>  				     int len, int *skb_xdp)
>  {
> -	struct page_frag *alloc_frag = &current->task_frag;
> +	struct page_frag_cache *alloc_frag = &current->task_frag;
>  	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
>  	struct bpf_prog *xdp_prog;
>  	int buflen = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> @@ -1676,16 +1674,16 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>  	buflen += SKB_DATA_ALIGN(len + pad);
>  	rcu_read_unlock();
>  
> -	alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
> -	if (unlikely(!skb_page_frag_refill(buflen, alloc_frag, GFP_KERNEL)))
> +	buf = page_frag_alloc_va_align(alloc_frag, buflen, GFP_KERNEL,
> +				       SMP_CACHE_BYTES);
> +	if (unlikely(!buf))
>  		return ERR_PTR(-ENOMEM);
>  
> -	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> -	copied = copy_page_from_iter(alloc_frag->page,
> -				     alloc_frag->offset + pad,
> -				     len, from);
> -	if (copied != len)
> +	copied = copy_from_iter(buf + pad, len, from);
> +	if (copied != len) {
> +		page_frag_alloc_abort(alloc_frag, buflen);
>  		return ERR_PTR(-EFAULT);
> +	}
>  
>  	/* There's a small window that XDP may be set after the check
>  	 * of xdp_prog above, this should be rare and for simplicity
> @@ -1693,8 +1691,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>  	 */
>  	if (hdr->gso_type || !xdp_prog) {
>  		*skb_xdp = 1;
> -		return __tun_build_skb(tfile, alloc_frag, buf, buflen, len,
> -				       pad);
> +		return __tun_build_skb(tfile, buf, buflen, len, pad);
>  	}
>  
>  	*skb_xdp = 0;
> @@ -1711,21 +1708,16 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>  		xdp_prepare_buff(&xdp, buf, pad, len, false);
>  
>  		act = bpf_prog_run_xdp(xdp_prog, &xdp);
> -		if (act == XDP_REDIRECT || act == XDP_TX) {
> -			get_page(alloc_frag->page);
> -			alloc_frag->offset += buflen;
> -		}
>  		err = tun_xdp_act(tun, xdp_prog, &xdp, act);
> -		if (err < 0) {
> -			if (act == XDP_REDIRECT || act == XDP_TX)
> -				put_page(alloc_frag->page);
> -			goto out;
> -		}
> -
>  		if (err == XDP_REDIRECT)
>  			xdp_do_flush();
> -		if (err != XDP_PASS)
> +
> +		if (err == XDP_REDIRECT || err == XDP_TX) {
> +			goto out;
> +		} else if (err < 0 || err != XDP_PASS) {
> +			page_frag_alloc_abort(alloc_frag, buflen);
>  			goto out;
> +		}
>  

Your abort function here is not necessarily safe. It is assuming that
nothing else might have taken a reference to the page or modified it in
some way. Generally we shouldn't allow rewinding the pointer until we
check the page count to guarantee nobody else is now working with a
copy of the page.

>  		pad = xdp.data - xdp.data_hard_start;
>  		len = xdp.data_end - xdp.data;
> @@ -1734,7 +1726,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>  	rcu_read_unlock();
>  	local_bh_enable();
>  
> -	return __tun_build_skb(tfile, alloc_frag, buf, buflen, len, pad);
> +	return __tun_build_skb(tfile, buf, buflen, len, pad);
>  
>  out:
>  	bpf_net_ctx_clear(bpf_net_ctx);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f8d150343d42..bb9a8e9d6d2d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1355,7 +1355,7 @@ struct task_struct {
>  	/* Cache last used pipe for splice(): */
>  	struct pipe_inode_info		*splice_pipe;
>  
> -	struct page_frag		task_frag;
> +	struct page_frag_cache		task_frag;
>  
>  #ifdef CONFIG_TASK_DELAY_ACCT
>  	struct task_delay_info		*delays;
> diff --git a/include/net/sock.h b/include/net/sock.h
> index b5e702298ab7..8f6cc0dd2f4f 100644
> 

It occurs to me that bio_vec and page_frag are essentially the same
thing. Instead of having your functions pass a bio_vec it might make
more sense to work with just a page_frag as the unit to be probed and
committed with the page_frag_cache being what is borrowed from.

With that I think you could clean up a bunch of the change this code is
generating as there is too much refactor to make this easy to review.
If you were to change things though so that you maintain working with a
page_frag and are just probing it out of the page_frag_cache and
committing your change back in it would make the diff much more
readable in my opinion.

The general idea would be that the page and offset should not be
changed from probe to commit, and size would only be able to be reduced
vs remaining. It would help to make this more readable instead of
returning page while passing pointers to offset and length/size.

Also as I mentioned earlier we cannot be rolling the offset backwards.
It needs to always be making forward progress unless we own all
instances of the page as it is possible that a section may have been
shared out from underneath us when we showed some other entity the
memory.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v13 04/14] mm: page_frag: add '_va' suffix to page_frag API
  2024-08-14 15:49   ` Alexander H Duyck
@ 2024-08-15  2:59     ` Yunsheng Lin
  2024-08-15 15:00       ` Alexander Duyck
  0 siblings, 1 reply; 16+ messages in thread
From: Yunsheng Lin @ 2024-08-15  2:59 UTC (permalink / raw)
  To: Alexander H Duyck, davem, kuba, pabeni
  Cc: netdev, linux-kernel, Subbaraya Sundeep, Chuck Lever,
	Sagi Grimberg, Jeroen de Borst, Praveen Kaligineedi,
	Shailend Chand, Eric Dumazet, Tony Nguyen, Przemek Kitszel,
	Sunil Goutham, Geetha sowjanya, hariprasad, Felix Fietkau,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Chaitanya Kulkarni, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Andrew Morton, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Howells, Marc Dionne, Jeff Layton, Neil Brown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Shuah Khan, intel-wired-lan, linux-arm-kernel,
	linux-mediatek, linux-nvme, kvm, virtualization, linux-mm, bpf,
	linux-afs, linux-nfs, linux-kselftest

On 2024/8/14 23:49, Alexander H Duyck wrote:
> On Thu, 2024-08-08 at 20:37 +0800, Yunsheng Lin wrote:
>> Currently the page_frag API is returning 'virtual address'
>> or 'va' when allocing and expecting 'virtual address' or
>> 'va' as input when freeing.
>>
>> As we are about to support new use cases that the caller
>> need to deal with 'struct page' or need to deal with both
>> 'va' and 'struct page'. In order to differentiate the API
>> handling between 'va' and 'struct page', add '_va' suffix
>> to the corresponding API mirroring the page_pool_alloc_va()
>> API of the page_pool. So that callers expecting to deal with
>> va, page or both va and page may call page_frag_alloc_va*,
>> page_frag_alloc_pg*, or page_frag_alloc* API accordingly.
>>
>> CC: Alexander Duyck <alexander.duyck@gmail.com>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
>> Acked-by: Chuck Lever <chuck.lever@oracle.com>
>> Acked-by: Sagi Grimberg <sagi@grimberg.me>
>> ---
>>  drivers/net/ethernet/google/gve/gve_rx.c      |  4 ++--
>>  drivers/net/ethernet/intel/ice/ice_txrx.c     |  2 +-
>>  drivers/net/ethernet/intel/ice/ice_txrx.h     |  2 +-
>>  drivers/net/ethernet/intel/ice/ice_txrx_lib.c |  2 +-
>>  .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |  4 ++--
>>  .../marvell/octeontx2/nic/otx2_common.c       |  2 +-
>>  drivers/net/ethernet/mediatek/mtk_wed_wo.c    |  4 ++--
>>  drivers/nvme/host/tcp.c                       |  8 +++----
>>  drivers/nvme/target/tcp.c                     | 22 +++++++++----------
>>  drivers/vhost/net.c                           |  6 ++---
>>  include/linux/page_frag_cache.h               | 21 +++++++++---------
>>  include/linux/skbuff.h                        |  2 +-
>>  kernel/bpf/cpumap.c                           |  2 +-
>>  mm/page_frag_cache.c                          | 12 +++++-----
>>  net/core/skbuff.c                             | 16 +++++++-------
>>  net/core/xdp.c                                |  2 +-
>>  net/rxrpc/txbuf.c                             | 15 +++++++------
>>  net/sunrpc/svcsock.c                          |  6 ++---
>>  .../selftests/mm/page_frag/page_frag_test.c   | 13 ++++++-----
>>  19 files changed, 75 insertions(+), 70 deletions(-)
>>
> 
> I still say no to this patch. It is an unnecessary name change and adds
> no value. If you insist on this patch I will reject the set every time.
> 
> The fact is it is polluting the git history and just makes things
> harder to maintain without adding any value as you aren't changing what
> the function does and there is no need for this. In addition it just

I guess I have to disagree with the above 'no need for this' part for
now, as mentioned in [1]:

"There are three types of API as proposed in this patchset instead of
two types of API:
1. page_frag_alloc_va() returns [va].
2. page_frag_alloc_pg() returns [page, offset].
3. page_frag_alloc() returns [va] & [page, offset].

You seemed to miss that we need a third naming for the type 3 API.
Do you see type 3 API as a valid API? if yes, what naming are you
suggesting for it? if no, why it is not a valid API?"


1. https://lore.kernel.org/all/ca6be29e-ab53-4673-9624-90d41616a154@huawei.com/

> makes it that much harder to backport fixes in the future as people
> will have to work around the rename.
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v13 04/14] mm: page_frag: add '_va' suffix to page_frag API
  2024-08-15  2:59     ` Yunsheng Lin
@ 2024-08-15 15:00       ` Alexander Duyck
  2024-08-16 11:55         ` Yunsheng Lin
  0 siblings, 1 reply; 16+ messages in thread
From: Alexander Duyck @ 2024-08-15 15:00 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Subbaraya Sundeep,
	Chuck Lever, Sagi Grimberg, Jeroen de Borst, Praveen Kaligineedi,
	Shailend Chand, Eric Dumazet, Tony Nguyen, Przemek Kitszel,
	Sunil Goutham, Geetha sowjanya, hariprasad, Felix Fietkau,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Chaitanya Kulkarni, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Andrew Morton, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Howells, Marc Dionne, Jeff Layton, Neil Brown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Shuah Khan, intel-wired-lan, linux-arm-kernel,
	linux-mediatek, linux-nvme, kvm, virtualization, linux-mm, bpf,
	linux-afs, linux-nfs, linux-kselftest

On Wed, Aug 14, 2024 at 8:00 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/8/14 23:49, Alexander H Duyck wrote:
> > On Thu, 2024-08-08 at 20:37 +0800, Yunsheng Lin wrote:
> >> Currently the page_frag API is returning 'virtual address'
> >> or 'va' when allocing and expecting 'virtual address' or
> >> 'va' as input when freeing.
> >>
> >> As we are about to support new use cases that the caller
> >> need to deal with 'struct page' or need to deal with both
> >> 'va' and 'struct page'. In order to differentiate the API
> >> handling between 'va' and 'struct page', add '_va' suffix
> >> to the corresponding API mirroring the page_pool_alloc_va()
> >> API of the page_pool. So that callers expecting to deal with
> >> va, page or both va and page may call page_frag_alloc_va*,
> >> page_frag_alloc_pg*, or page_frag_alloc* API accordingly.
> >>
> >> CC: Alexander Duyck <alexander.duyck@gmail.com>
> >> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> >> Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
> >> Acked-by: Chuck Lever <chuck.lever@oracle.com>
> >> Acked-by: Sagi Grimberg <sagi@grimberg.me>
> >> ---
> >>  drivers/net/ethernet/google/gve/gve_rx.c      |  4 ++--
> >>  drivers/net/ethernet/intel/ice/ice_txrx.c     |  2 +-
> >>  drivers/net/ethernet/intel/ice/ice_txrx.h     |  2 +-
> >>  drivers/net/ethernet/intel/ice/ice_txrx_lib.c |  2 +-
> >>  .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |  4 ++--
> >>  .../marvell/octeontx2/nic/otx2_common.c       |  2 +-
> >>  drivers/net/ethernet/mediatek/mtk_wed_wo.c    |  4 ++--
> >>  drivers/nvme/host/tcp.c                       |  8 +++----
> >>  drivers/nvme/target/tcp.c                     | 22 +++++++++----------
> >>  drivers/vhost/net.c                           |  6 ++---
> >>  include/linux/page_frag_cache.h               | 21 +++++++++---------
> >>  include/linux/skbuff.h                        |  2 +-
> >>  kernel/bpf/cpumap.c                           |  2 +-
> >>  mm/page_frag_cache.c                          | 12 +++++-----
> >>  net/core/skbuff.c                             | 16 +++++++-------
> >>  net/core/xdp.c                                |  2 +-
> >>  net/rxrpc/txbuf.c                             | 15 +++++++------
> >>  net/sunrpc/svcsock.c                          |  6 ++---
> >>  .../selftests/mm/page_frag/page_frag_test.c   | 13 ++++++-----
> >>  19 files changed, 75 insertions(+), 70 deletions(-)
> >>
> >
> > I still say no to this patch. It is an unnecessary name change and adds
> > no value. If you insist on this patch I will reject the set every time.
> >
> > The fact is it is polluting the git history and just makes things
> > harder to maintain without adding any value as you aren't changing what
> > the function does and there is no need for this. In addition it just
>
> I guess I have to disagree with the above 'no need for this' part for
> now, as mentioned in [1]:
>
> "There are three types of API as proposed in this patchset instead of
> two types of API:
> 1. page_frag_alloc_va() returns [va].
> 2. page_frag_alloc_pg() returns [page, offset].
> 3. page_frag_alloc() returns [va] & [page, offset].
>
> You seemed to miss that we need a third naming for the type 3 API.
> Do you see type 3 API as a valid API? if yes, what naming are you
> suggesting for it? if no, why it is not a valid API?"

I didn't. I just don't see the point in pushing out the existing API
to support that. In reality 2 and 3 are redundant. You probably only
need 3. Like I mentioned earlier you can essentially just pass a
page_frag via pointer to the function. With that you could also look
at just returning a virtual address as well if you insist on having
something that returns all of the above. No point in having 2 and 3 be
seperate functions.

I am going to nack this patch set if you insist on this pointless
renaming. The fact is it is just adding noise that adds no value.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v13 04/14] mm: page_frag: add '_va' suffix to page_frag API
  2024-08-15 15:00       ` Alexander Duyck
@ 2024-08-16 11:55         ` Yunsheng Lin
  2024-08-19 15:54           ` Alexander Duyck
  0 siblings, 1 reply; 16+ messages in thread
From: Yunsheng Lin @ 2024-08-16 11:55 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Subbaraya Sundeep,
	Chuck Lever, Sagi Grimberg, Jeroen de Borst, Praveen Kaligineedi,
	Shailend Chand, Eric Dumazet, Tony Nguyen, Przemek Kitszel,
	Sunil Goutham, Geetha sowjanya, hariprasad, Felix Fietkau,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Chaitanya Kulkarni, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Andrew Morton, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Howells, Marc Dionne, Jeff Layton, Neil Brown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Shuah Khan, intel-wired-lan, linux-arm-kernel,
	linux-mediatek, linux-nvme, kvm, virtualization, linux-mm, bpf,
	linux-afs, linux-nfs, linux-kselftest

On 2024/8/15 23:00, Alexander Duyck wrote:
> On Wed, Aug 14, 2024 at 8:00 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2024/8/14 23:49, Alexander H Duyck wrote:
>>> On Thu, 2024-08-08 at 20:37 +0800, Yunsheng Lin wrote:
>>>> Currently the page_frag API is returning 'virtual address'
>>>> or 'va' when allocing and expecting 'virtual address' or
>>>> 'va' as input when freeing.
>>>>
>>>> As we are about to support new use cases that the caller
>>>> need to deal with 'struct page' or need to deal with both
>>>> 'va' and 'struct page'. In order to differentiate the API
>>>> handling between 'va' and 'struct page', add '_va' suffix
>>>> to the corresponding API mirroring the page_pool_alloc_va()
>>>> API of the page_pool. So that callers expecting to deal with
>>>> va, page or both va and page may call page_frag_alloc_va*,
>>>> page_frag_alloc_pg*, or page_frag_alloc* API accordingly.
>>>>
>>>> CC: Alexander Duyck <alexander.duyck@gmail.com>
>>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>>>> Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
>>>> Acked-by: Chuck Lever <chuck.lever@oracle.com>
>>>> Acked-by: Sagi Grimberg <sagi@grimberg.me>
>>>> ---
>>>>  drivers/net/ethernet/google/gve/gve_rx.c      |  4 ++--
>>>>  drivers/net/ethernet/intel/ice/ice_txrx.c     |  2 +-
>>>>  drivers/net/ethernet/intel/ice/ice_txrx.h     |  2 +-
>>>>  drivers/net/ethernet/intel/ice/ice_txrx_lib.c |  2 +-
>>>>  .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |  4 ++--
>>>>  .../marvell/octeontx2/nic/otx2_common.c       |  2 +-
>>>>  drivers/net/ethernet/mediatek/mtk_wed_wo.c    |  4 ++--
>>>>  drivers/nvme/host/tcp.c                       |  8 +++----
>>>>  drivers/nvme/target/tcp.c                     | 22 +++++++++----------
>>>>  drivers/vhost/net.c                           |  6 ++---
>>>>  include/linux/page_frag_cache.h               | 21 +++++++++---------
>>>>  include/linux/skbuff.h                        |  2 +-
>>>>  kernel/bpf/cpumap.c                           |  2 +-
>>>>  mm/page_frag_cache.c                          | 12 +++++-----
>>>>  net/core/skbuff.c                             | 16 +++++++-------
>>>>  net/core/xdp.c                                |  2 +-
>>>>  net/rxrpc/txbuf.c                             | 15 +++++++------
>>>>  net/sunrpc/svcsock.c                          |  6 ++---
>>>>  .../selftests/mm/page_frag/page_frag_test.c   | 13 ++++++-----
>>>>  19 files changed, 75 insertions(+), 70 deletions(-)
>>>>
>>>
>>> I still say no to this patch. It is an unnecessary name change and adds
>>> no value. If you insist on this patch I will reject the set every time.
>>>
>>> The fact is it is polluting the git history and just makes things
>>> harder to maintain without adding any value as you aren't changing what
>>> the function does and there is no need for this. In addition it just
>>
>> I guess I have to disagree with the above 'no need for this' part for
>> now, as mentioned in [1]:
>>
>> "There are three types of API as proposed in this patchset instead of
>> two types of API:
>> 1. page_frag_alloc_va() returns [va].
>> 2. page_frag_alloc_pg() returns [page, offset].
>> 3. page_frag_alloc() returns [va] & [page, offset].
>>
>> You seemed to miss that we need a third naming for the type 3 API.
>> Do you see type 3 API as a valid API? if yes, what naming are you
>> suggesting for it? if no, why it is not a valid API?"
> 
> I didn't. I just don't see the point in pushing out the existing API
> to support that. In reality 2 and 3 are redundant. You probably only
> need 3. Like I mentioned earlier you can essentially just pass a

If the caller just expect [page, offset], do you expect the caller also
type 3 API, which return both [va] and [page, offset]?

I am not sure if I understand why you think 2 and 3 are redundant here?
If you think 2 and 3 are redundant here, aren't 1 and 3 also redundant
as the similar agrument?

> page_frag via pointer to the function. With that you could also look
> at just returning a virtual address as well if you insist on having
> something that returns all of the above. No point in having 2 and 3 be
> seperate functions.

Let's be more specific about what are your suggestion here: which way
is the prefer way to return the virtual address. It seems there are two
options:

1. Return the virtual address by function returning as below:
void *page_frag_alloc_bio(struct page_frag_cache *nc, struct bio_vec *bio);

2. Return the virtual address by double pointer as below:
int page_frag_alloc_bio(struct page_frag_cache *nc, struct bio_vec *bio,
		        void **va);

If the above options is what you have in mind, please be more specific
which one is the prefer option, and why it is the prefer option.
If the above options is not what you have in mind, please list out the
declaration of API in your mind.

> 
> I am going to nack this patch set if you insist on this pointless
> renaming. The fact is it is just adding noise that adds no value.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v13 12/14] net: replace page_frag with page_frag_cache
  2024-08-14 22:01   ` Alexander H Duyck
@ 2024-08-18 14:17     ` Yunsheng Lin
  0 siblings, 0 replies; 16+ messages in thread
From: Yunsheng Lin @ 2024-08-18 14:17 UTC (permalink / raw)
  To: Alexander H Duyck, Yunsheng Lin, davem, kuba, pabeni
  Cc: netdev, linux-kernel, Mat Martineau, Ayush Sawal, Eric Dumazet,
	Willem de Bruijn, Jason Wang, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, John Fastabend,
	Jakub Sitnicki, David Ahern, Matthieu Baerts, Geliang Tang,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Boris Pismenny, bpf,
	mptcp

On 8/15/2024 6:01 AM, Alexander H Duyck wrote:
> On Thu, 2024-08-08 at 20:37 +0800, Yunsheng Lin wrote:
>> Use the newly introduced prepare/probe/commit API to
>> replace page_frag with page_frag_cache for sk_page_frag().
>>
>> CC: Alexander Duyck <alexander.duyck@gmail.com>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> Acked-by: Mat Martineau <martineau@kernel.org>
>> ---
>>   .../chelsio/inline_crypto/chtls/chtls.h       |   3 -
>>   .../chelsio/inline_crypto/chtls/chtls_io.c    | 100 ++++---------
>>   .../chelsio/inline_crypto/chtls/chtls_main.c  |   3 -
>>   drivers/net/tun.c                             |  48 +++---
>>   include/linux/sched.h                         |   2 +-
>>   include/net/sock.h                            |  14 +-
>>   kernel/exit.c                                 |   3 +-
>>   kernel/fork.c                                 |   3 +-
>>   net/core/skbuff.c                             |  59 +++++---
>>   net/core/skmsg.c                              |  22 +--
>>   net/core/sock.c                               |  46 ++++--
>>   net/ipv4/ip_output.c                          |  33 +++--
>>   net/ipv4/tcp.c                                |  32 ++--
>>   net/ipv4/tcp_output.c                         |  28 ++--
>>   net/ipv6/ip6_output.c                         |  33 +++--
>>   net/kcm/kcmsock.c                             |  27 ++--
>>   net/mptcp/protocol.c                          |  67 +++++----
>>   net/sched/em_meta.c                           |   2 +-
>>   net/tls/tls_device.c                          | 137 ++++++++++--------
>>   19 files changed, 347 insertions(+), 315 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h
>> index 7ff82b6778ba..fe2b6a8ef718 100644
>> --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h
>> +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h
>> @@ -234,7 +234,6 @@ struct chtls_dev {
>>   	struct list_head list_node;
>>   	struct list_head rcu_node;
>>   	struct list_head na_node;
>> -	unsigned int send_page_order;
>>   	int max_host_sndbuf;
>>   	u32 round_robin_cnt;
>>   	struct key_map kmap;
>> @@ -453,8 +452,6 @@ enum {
>>   
>>   /* The ULP mode/submode of an skbuff */
>>   #define skb_ulp_mode(skb)  (ULP_SKB_CB(skb)->ulp_mode)
>> -#define TCP_PAGE(sk)   (sk->sk_frag.page)
>> -#define TCP_OFF(sk)    (sk->sk_frag.offset)
>>   
>>   static inline struct chtls_dev *to_chtls_dev(struct tls_toe_device *tlsdev)
>>   {
>> diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
>> index d567e42e1760..334381c1587f 100644
>> --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
>> +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
>> @@ -825,12 +825,6 @@ void skb_entail(struct sock *sk, struct sk_buff *skb, int flags)
>>   	ULP_SKB_CB(skb)->flags = flags;
>>   	__skb_queue_tail(&csk->txq, skb);
>>   	sk->sk_wmem_queued += skb->truesize;
>> -
>> -	if (TCP_PAGE(sk) && TCP_OFF(sk)) {
>> -		put_page(TCP_PAGE(sk));
>> -		TCP_PAGE(sk) = NULL;
>> -		TCP_OFF(sk) = 0;
>> -	}
>>   }
>>   
>>   static struct sk_buff *get_tx_skb(struct sock *sk, int size)
>> @@ -882,16 +876,12 @@ static void push_frames_if_head(struct sock *sk)
>>   		chtls_push_frames(csk, 1);
>>   }
>>   
>> -static int chtls_skb_copy_to_page_nocache(struct sock *sk,
>> -					  struct iov_iter *from,
>> -					  struct sk_buff *skb,
>> -					  struct page *page,
>> -					  int off, int copy)
>> +static int chtls_skb_copy_to_va_nocache(struct sock *sk, struct iov_iter *from,
>> +					struct sk_buff *skb, char *va, int copy)
>>   {
>>   	int err;
>>   
>> -	err = skb_do_copy_data_nocache(sk, skb, from, page_address(page) +
>> -				       off, copy, skb->len);
>> +	err = skb_do_copy_data_nocache(sk, skb, from, va, copy, skb->len);
>>   	if (err)
>>   		return err;
>>   
>> @@ -1114,82 +1104,44 @@ int chtls_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
>>   			if (err)
>>   				goto do_fault;
>>   		} else {
>> +			struct page_frag_cache *pfrag = &sk->sk_frag;
> 
> Is this even valid? Shouldn't it be using sk_page_frag to get the

chtls_sendmsg() only use sk->sk_frag, see below.

> reference here? Seems like it might be trying to instantiate an unused
> cache.

I am not sure if I understand what you meant by "trying to instantiate
an unused cache". sk->sk_frag is supposed to be instantiated in
sock_init_data_uid() by calling page_frag_cache_init() in this patch.

> 
> As per my earlier suggestion this could be made very simple if we are
> just pulling a bio_vec out from the page cache at the start. With that
> we could essentially plug it into the TCP_PAGE/TCP_OFF block here and
> most of it would just function the same.

I am not sure if we had the same common understanding as why chtls had
more changing than other places when replacing page_frag with
page_frag_cache.

chtls_sendmsg() was duplicating its own implementation of page_frag
refilling instead of using skb_page_frag_refill(), we can remove that
implementation by using the new API, that is why there is more changing
for chtls than other places. Are you suggesting to keep chtls's own
implementation of page_frag refilling by 'plug it into the TCP_PAGE/
TCP_OFF block' here?

> 
>>   			int i = skb_shinfo(skb)->nr_frags;
>> -			struct page *page = TCP_PAGE(sk);

TCP_PAGE macro is defined as below, that is why sk->sk_frag is used
instead of sk_page_frag() for chtls case if I understand your question
correctly:

#define TCP_PAGE(sk)   (sk->sk_frag.page)
#define TCP_OFF(sk)    (sk->sk_frag.offset)

>> -			int pg_size = PAGE_SIZE;
>> -			int off = TCP_OFF(sk);
>> -			bool merge;
>> -
>> -			if (page)
>> -				pg_size = page_size(page);
>> -			if (off < pg_size &&
>> -			    skb_can_coalesce(skb, i, page, off)) {
>> +			unsigned int offset, fragsz;
>> +			bool merge = false;
>> +			struct page *page;
>> +			void *va;
>> +
>> +			fragsz = 32U;
>> +			page = page_frag_alloc_prepare(pfrag, &offset, &fragsz,
>> +						       &va, sk->sk_allocation);
>> +			if (unlikely(!page))
>> +				goto wait_for_memory;
>> +
>> +			if (skb_can_coalesce(skb, i, page, offset))
>>   				merge = true;
>> -				goto copy;
>> -			}
>> -			merge = false;
>> -			if (i == (is_tls_tx(csk) ? (MAX_SKB_FRAGS - 1) :
>> -			    MAX_SKB_FRAGS))
>> +			else if (i == (is_tls_tx(csk) ? (MAX_SKB_FRAGS - 1) :
>> +				       MAX_SKB_FRAGS))
>>   				goto new_buf;
>>   
>> -			if (page && off == pg_size) {
>> -				put_page(page);
>> -				TCP_PAGE(sk) = page = NULL;
>> -				pg_size = PAGE_SIZE;
>> -			}
>> -
>> -			if (!page) {
>> -				gfp_t gfp = sk->sk_allocation;
>> -				int order = cdev->send_page_order;
>> -
>> -				if (order) {
>> -					page = alloc_pages(gfp | __GFP_COMP |
>> -							   __GFP_NOWARN |
>> -							   __GFP_NORETRY,
>> -							   order);
>> -					if (page)
>> -						pg_size <<= order;
>> -				}
>> -				if (!page) {
>> -					page = alloc_page(gfp);
>> -					pg_size = PAGE_SIZE;
>> -				}
>> -				if (!page)
>> -					goto wait_for_memory;
>> -				off = 0;
>> -			}
>> -copy:
>> -			if (copy > pg_size - off)
>> -				copy = pg_size - off;
>> +			copy = min_t(int, copy, fragsz);
>>   			if (is_tls_tx(csk))
>>   				copy = min_t(int, copy, csk->tlshws.txleft);
>>   
>> -			err = chtls_skb_copy_to_page_nocache(sk, &msg->msg_iter,
>> -							     skb, page,
>> -							     off, copy);
>> -			if (unlikely(err)) {
>> -				if (!TCP_PAGE(sk)) {
>> -					TCP_PAGE(sk) = page;
>> -					TCP_OFF(sk) = 0;
>> -				}
>> +			err = chtls_skb_copy_to_va_nocache(sk, &msg->msg_iter,
>> +							   skb, va, copy);
>> +			if (unlikely(err))
>>   				goto do_fault;
>> -			}
>> +
>>   			/* Update the skb. */
>>   			if (merge) {
>>   				skb_frag_size_add(
>>   						&skb_shinfo(skb)->frags[i - 1],
>>   						copy);
>> +				page_frag_alloc_commit_noref(pfrag, copy);
>>   			} else {
>> -				skb_fill_page_desc(skb, i, page, off, copy);
>> -				if (off + copy < pg_size) {
>> -					/* space left keep page */
>> -					get_page(page);
>> -					TCP_PAGE(sk) = page;
>> -				} else {
>> -					TCP_PAGE(sk) = NULL;
>> -				}
>> +				skb_fill_page_desc(skb, i, page, offset, copy);
>> +				page_frag_alloc_commit(pfrag, copy);
>>   			}
>> -			TCP_OFF(sk) = off + copy;
>>   		}
>>   		if (unlikely(skb->len == mss))
>>   			tx_skb_finalize(skb);
> 
> Really there is so much refactor here it is hard to tell what is what.
> I would suggest just trying to plug in an intermediary value and you
> can save the refactor for later.

I am not sure if your above suggestion works, if it does work, I am not
sure if it is that simple enough to just plug in an intermediary value
as the the fields in 'struct page_frag_cache' is much different from the
fields in 'struct page_frag' as below when replacing page_frag with 
page_frag_cache for sk->sk_frag:

struct page_frag_cache {
	unsigned long encoded_va;

+#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) && (BITS_PER_LONG <= 32)
  	__u16 remaining;
	__u16 pagecnt_bias;
  #else
  	__u32 remaining;
	__u32 pagecnt_bias;
  #endif
};

struct page_frag {
	struct page *page;
#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
	__u32 offset;
	__u32 size;
#else
	__u16 offset;
	__u16 size;
#endif
};


> 
>> diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c
>> index 455a54708be4..ba88b2fc7cd8 100644
>> --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c
>> +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c
>> @@ -34,7 +34,6 @@ static DEFINE_MUTEX(notify_mutex);
>>   static RAW_NOTIFIER_HEAD(listen_notify_list);
>>   static struct proto chtls_cpl_prot, chtls_cpl_protv6;
>>   struct request_sock_ops chtls_rsk_ops, chtls_rsk_opsv6;
>> -static uint send_page_order = (14 - PAGE_SHIFT < 0) ? 0 : 14 - PAGE_SHIFT;
>>   
>>   static void register_listen_notifier(struct notifier_block *nb)
>>   {
>> @@ -273,8 +272,6 @@ static void *chtls_uld_add(const struct cxgb4_lld_info *info)
>>   	INIT_WORK(&cdev->deferq_task, process_deferq);
>>   	spin_lock_init(&cdev->listen_lock);
>>   	spin_lock_init(&cdev->idr_lock);
>> -	cdev->send_page_order = min_t(uint, get_order(32768),
>> -				      send_page_order);
>>   	cdev->max_host_sndbuf = 48 * 1024;
>>   
>>   	if (lldi->vr->key.size)
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 1d06c560c5e6..51df92fd60db 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -1598,21 +1598,19 @@ static bool tun_can_build_skb(struct tun_struct *tun, struct tun_file *tfile,
>>   }
>>   
>>   static struct sk_buff *__tun_build_skb(struct tun_file *tfile,
>> -				       struct page_frag *alloc_frag, char *buf,
>> -				       int buflen, int len, int pad)
>> +				       char *buf, int buflen, int len, int pad)
>>   {
>>   	struct sk_buff *skb = build_skb(buf, buflen);
>>   
>> -	if (!skb)
>> +	if (!skb) {
>> +		page_frag_free_va(buf);
>>   		return ERR_PTR(-ENOMEM);
>> +	}
>>   
>>   	skb_reserve(skb, pad);
>>   	skb_put(skb, len);
>>   	skb_set_owner_w(skb, tfile->socket.sk);
>>   
>> -	get_page(alloc_frag->page);
>> -	alloc_frag->offset += buflen;
>> -
> 
> Rather than freeing the buf it would be better if you were to just
> stick to the existing pattern and commit the alloc_frag at the end here
> instead of calling get_page.
> 
>>   	return skb;
>>   }
>>   
>> @@ -1660,7 +1658,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>>   				     struct virtio_net_hdr *hdr,
>>   				     int len, int *skb_xdp)
>>   {
>> -	struct page_frag *alloc_frag = &current->task_frag;
>> +	struct page_frag_cache *alloc_frag = &current->task_frag;
>>   	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
>>   	struct bpf_prog *xdp_prog;
>>   	int buflen = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> @@ -1676,16 +1674,16 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>>   	buflen += SKB_DATA_ALIGN(len + pad);
>>   	rcu_read_unlock();
>>   
>> -	alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
>> -	if (unlikely(!skb_page_frag_refill(buflen, alloc_frag, GFP_KERNEL)))
>> +	buf = page_frag_alloc_va_align(alloc_frag, buflen, GFP_KERNEL,
>> +				       SMP_CACHE_BYTES);
>> +	if (unlikely(!buf))
>>   		return ERR_PTR(-ENOMEM);
>>   
>> -	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
>> -	copied = copy_page_from_iter(alloc_frag->page,
>> -				     alloc_frag->offset + pad,
>> -				     len, from);
>> -	if (copied != len)
>> +	copied = copy_from_iter(buf + pad, len, from);
>> +	if (copied != len) {
>> +		page_frag_alloc_abort(alloc_frag, buflen);
>>   		return ERR_PTR(-EFAULT);
>> +	}
>>   
>>   	/* There's a small window that XDP may be set after the check
>>   	 * of xdp_prog above, this should be rare and for simplicity
>> @@ -1693,8 +1691,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>>   	 */
>>   	if (hdr->gso_type || !xdp_prog) {
>>   		*skb_xdp = 1;
>> -		return __tun_build_skb(tfile, alloc_frag, buf, buflen, len,
>> -				       pad);
>> +		return __tun_build_skb(tfile, buf, buflen, len, pad);
>>   	}
>>   
>>   	*skb_xdp = 0;
>> @@ -1711,21 +1708,16 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>>   		xdp_prepare_buff(&xdp, buf, pad, len, false);
>>   
>>   		act = bpf_prog_run_xdp(xdp_prog, &xdp);
>> -		if (act == XDP_REDIRECT || act == XDP_TX) {
>> -			get_page(alloc_frag->page);
>> -			alloc_frag->offset += buflen;

the above is only executed for XDP_REDIRECT and XDP_TX cases.

>> -		}
>>   		err = tun_xdp_act(tun, xdp_prog, &xdp, act);
>> -		if (err < 0) {
>> -			if (act == XDP_REDIRECT || act == XDP_TX)
>> -				put_page(alloc_frag->page);

And there is a put_page() immediately when xdp_do_redirect() or
tun_xdp_tx() fails in tun_xdp_act(), so there is something else
might have taken a reference to the page and modified it in some way
even when tun_xdp_act() return error? Would you be more specific
about why above happens?

>> -			goto out;
>> -		}
>> -
>>   		if (err == XDP_REDIRECT)
>>   			xdp_do_flush();
>> -		if (err != XDP_PASS)
>> +
>> +		if (err == XDP_REDIRECT || err == XDP_TX) {
>> +			goto out;
>> +		} else if (err < 0 || err != XDP_PASS) {
>> +			page_frag_alloc_abort(alloc_frag, buflen);
>>   			goto out;
>> +		}
>>   
> 
> Your abort function here is not necessarily safe. It is assuming that
> nothing else might have taken a reference to the page or modified it in
> some way. Generally we shouldn't allow rewinding the pointer until we
> check the page count to guarantee nobody else is now working with a
> copy of the page.
> 
>>   		pad = xdp.data - xdp.data_hard_start;
>>   		len = xdp.data_end - xdp.data;
>> @@ -1734,7 +1726,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>>   	rcu_read_unlock();
>>   	local_bh_enable();
>>   
>> -	return __tun_build_skb(tfile, alloc_frag, buf, buflen, len, pad);
>> +	return __tun_build_skb(tfile, buf, buflen, len, pad);
>>   
>>   out:
>>   	bpf_net_ctx_clear(bpf_net_ctx);
>

...


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v13 04/14] mm: page_frag: add '_va' suffix to page_frag API
  2024-08-16 11:55         ` Yunsheng Lin
@ 2024-08-19 15:54           ` Alexander Duyck
  2024-08-20 13:07             ` Yunsheng Lin
  0 siblings, 1 reply; 16+ messages in thread
From: Alexander Duyck @ 2024-08-19 15:54 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Subbaraya Sundeep,
	Chuck Lever, Sagi Grimberg, Jeroen de Borst, Praveen Kaligineedi,
	Shailend Chand, Eric Dumazet, Tony Nguyen, Przemek Kitszel,
	Sunil Goutham, Geetha sowjanya, hariprasad, Felix Fietkau,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Chaitanya Kulkarni, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Andrew Morton, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Howells, Marc Dionne, Jeff Layton, Neil Brown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Shuah Khan, intel-wired-lan, linux-arm-kernel,
	linux-mediatek, linux-nvme, kvm, virtualization, linux-mm, bpf,
	linux-afs, linux-nfs, linux-kselftest

On Fri, Aug 16, 2024 at 4:55 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/8/15 23:00, Alexander Duyck wrote:
> > On Wed, Aug 14, 2024 at 8:00 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> On 2024/8/14 23:49, Alexander H Duyck wrote:
> >>> On Thu, 2024-08-08 at 20:37 +0800, Yunsheng Lin wrote:
> >>>> Currently the page_frag API is returning 'virtual address'
> >>>> or 'va' when allocing and expecting 'virtual address' or
> >>>> 'va' as input when freeing.
> >>>>
> >>>> As we are about to support new use cases that the caller
> >>>> need to deal with 'struct page' or need to deal with both
> >>>> 'va' and 'struct page'. In order to differentiate the API
> >>>> handling between 'va' and 'struct page', add '_va' suffix
> >>>> to the corresponding API mirroring the page_pool_alloc_va()
> >>>> API of the page_pool. So that callers expecting to deal with
> >>>> va, page or both va and page may call page_frag_alloc_va*,
> >>>> page_frag_alloc_pg*, or page_frag_alloc* API accordingly.
> >>>>
> >>>> CC: Alexander Duyck <alexander.duyck@gmail.com>
> >>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> >>>> Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
> >>>> Acked-by: Chuck Lever <chuck.lever@oracle.com>
> >>>> Acked-by: Sagi Grimberg <sagi@grimberg.me>
> >>>> ---
> >>>>  drivers/net/ethernet/google/gve/gve_rx.c      |  4 ++--
> >>>>  drivers/net/ethernet/intel/ice/ice_txrx.c     |  2 +-
> >>>>  drivers/net/ethernet/intel/ice/ice_txrx.h     |  2 +-
> >>>>  drivers/net/ethernet/intel/ice/ice_txrx_lib.c |  2 +-
> >>>>  .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |  4 ++--
> >>>>  .../marvell/octeontx2/nic/otx2_common.c       |  2 +-
> >>>>  drivers/net/ethernet/mediatek/mtk_wed_wo.c    |  4 ++--
> >>>>  drivers/nvme/host/tcp.c                       |  8 +++----
> >>>>  drivers/nvme/target/tcp.c                     | 22 +++++++++----------
> >>>>  drivers/vhost/net.c                           |  6 ++---
> >>>>  include/linux/page_frag_cache.h               | 21 +++++++++---------
> >>>>  include/linux/skbuff.h                        |  2 +-
> >>>>  kernel/bpf/cpumap.c                           |  2 +-
> >>>>  mm/page_frag_cache.c                          | 12 +++++-----
> >>>>  net/core/skbuff.c                             | 16 +++++++-------
> >>>>  net/core/xdp.c                                |  2 +-
> >>>>  net/rxrpc/txbuf.c                             | 15 +++++++------
> >>>>  net/sunrpc/svcsock.c                          |  6 ++---
> >>>>  .../selftests/mm/page_frag/page_frag_test.c   | 13 ++++++-----
> >>>>  19 files changed, 75 insertions(+), 70 deletions(-)
> >>>>
> >>>
> >>> I still say no to this patch. It is an unnecessary name change and adds
> >>> no value. If you insist on this patch I will reject the set every time.
> >>>
> >>> The fact is it is polluting the git history and just makes things
> >>> harder to maintain without adding any value as you aren't changing what
> >>> the function does and there is no need for this. In addition it just
> >>
> >> I guess I have to disagree with the above 'no need for this' part for
> >> now, as mentioned in [1]:
> >>
> >> "There are three types of API as proposed in this patchset instead of
> >> two types of API:
> >> 1. page_frag_alloc_va() returns [va].
> >> 2. page_frag_alloc_pg() returns [page, offset].
> >> 3. page_frag_alloc() returns [va] & [page, offset].
> >>
> >> You seemed to miss that we need a third naming for the type 3 API.
> >> Do you see type 3 API as a valid API? if yes, what naming are you
> >> suggesting for it? if no, why it is not a valid API?"
> >
> > I didn't. I just don't see the point in pushing out the existing API
> > to support that. In reality 2 and 3 are redundant. You probably only
> > need 3. Like I mentioned earlier you can essentially just pass a
>
> If the caller just expect [page, offset], do you expect the caller also
> type 3 API, which return both [va] and [page, offset]?
>
> I am not sure if I understand why you think 2 and 3 are redundant here?
> If you think 2 and 3 are redundant here, aren't 1 and 3 also redundant
> as the similar agrument?

The big difference is the need to return page and offset. Basically to
support returning page and offset you need to pass at least one value
as a pointer so you can store the return there.

The reason why 3 is just a redundant form of 2 is that you will
normally just be converting from a va to a page and offset so the va
should already be easily accessible.

> > page_frag via pointer to the function. With that you could also look
> > at just returning a virtual address as well if you insist on having
> > something that returns all of the above. No point in having 2 and 3 be
> > seperate functions.
>
> Let's be more specific about what are your suggestion here: which way
> is the prefer way to return the virtual address. It seems there are two
> options:
>
> 1. Return the virtual address by function returning as below:
> void *page_frag_alloc_bio(struct page_frag_cache *nc, struct bio_vec *bio);
>
> 2. Return the virtual address by double pointer as below:
> int page_frag_alloc_bio(struct page_frag_cache *nc, struct bio_vec *bio,
>                         void **va);

I was thinking more of option 1. Basically this is a superset of
page_frag_alloc_va that is also returning the page and offset via a
page frag. However instead of bio_vec I would be good with "struct
page_frag *" being the value passed to the function to play the role
of container. Basically the big difference between 1 and 2/3 if I am
not mistaken is the fact that for 1 you pass the size, whereas with
2/3 you are peeling off the page frag from the larger page frag cache
after the fact via a commit type action.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v13 04/14] mm: page_frag: add '_va' suffix to page_frag API
  2024-08-19 15:54           ` Alexander Duyck
@ 2024-08-20 13:07             ` Yunsheng Lin
  2024-08-20 16:02               ` Alexander Duyck
  0 siblings, 1 reply; 16+ messages in thread
From: Yunsheng Lin @ 2024-08-20 13:07 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Subbaraya Sundeep,
	Chuck Lever, Sagi Grimberg, Jeroen de Borst, Praveen Kaligineedi,
	Shailend Chand, Eric Dumazet, Tony Nguyen, Przemek Kitszel,
	Sunil Goutham, Geetha sowjanya, hariprasad, Felix Fietkau,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Chaitanya Kulkarni, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Andrew Morton, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Howells, Marc Dionne, Jeff Layton, Neil Brown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Shuah Khan, intel-wired-lan, linux-arm-kernel,
	linux-mediatek, linux-nvme, kvm, virtualization, linux-mm, bpf,
	linux-afs, linux-nfs, linux-kselftest

On 2024/8/19 23:54, Alexander Duyck wrote:

...

>>>>
>>>> "There are three types of API as proposed in this patchset instead of
>>>> two types of API:
>>>> 1. page_frag_alloc_va() returns [va].
>>>> 2. page_frag_alloc_pg() returns [page, offset].
>>>> 3. page_frag_alloc() returns [va] & [page, offset].
>>>>
>>>> You seemed to miss that we need a third naming for the type 3 API.
>>>> Do you see type 3 API as a valid API? if yes, what naming are you
>>>> suggesting for it? if no, why it is not a valid API?"
>>>
>>> I didn't. I just don't see the point in pushing out the existing API
>>> to support that. In reality 2 and 3 are redundant. You probably only
>>> need 3. Like I mentioned earlier you can essentially just pass a
>>
>> If the caller just expect [page, offset], do you expect the caller also
>> type 3 API, which return both [va] and [page, offset]?
>>
>> I am not sure if I understand why you think 2 and 3 are redundant here?
>> If you think 2 and 3 are redundant here, aren't 1 and 3 also redundant
>> as the similar agrument?
> 
> The big difference is the need to return page and offset. Basically to
> support returning page and offset you need to pass at least one value
> as a pointer so you can store the return there.
> 
> The reason why 3 is just a redundant form of 2 is that you will
> normally just be converting from a va to a page and offset so the va
> should already be easily accessible.

I am assuming that by 'easily accessible', you meant the 'va' can be
calculated as below, right?

va = encoded_page_address(encoded_va) +
		(page_frag_cache_page_size(encoded_va) - remaining);

I guess it is easily accessible, but it is not without some overhead
to calculate the 'va' here.

> 
>>> page_frag via pointer to the function. With that you could also look
>>> at just returning a virtual address as well if you insist on having
>>> something that returns all of the above. No point in having 2 and 3 be
>>> seperate functions.
>>
>> Let's be more specific about what are your suggestion here: which way
>> is the prefer way to return the virtual address. It seems there are two
>> options:
>>
>> 1. Return the virtual address by function returning as below:
>> void *page_frag_alloc_bio(struct page_frag_cache *nc, struct bio_vec *bio);
>>
>> 2. Return the virtual address by double pointer as below:
>> int page_frag_alloc_bio(struct page_frag_cache *nc, struct bio_vec *bio,
>>                         void **va);
> 
> I was thinking more of option 1. Basically this is a superset of
> page_frag_alloc_va that is also returning the page and offset via a
> page frag. However instead of bio_vec I would be good with "struct
> page_frag *" being the value passed to the function to play the role
> of container. Basically the big difference between 1 and 2/3 if I am
> not mistaken is the fact that for 1 you pass the size, whereas with
> 2/3 you are peeling off the page frag from the larger page frag cache

Let's be clear here: The callers just expecting [page, offset] also need
to call type 3 API, which return both [va] and [page, offset]? and it
is ok to ignore the overhead of calculating the 'va' for those kinds
of callers just because we don't want to do the renaming for a existing
API and can't come up with good naming for that?

> after the fact via a commit type action.

Just be clear here, there is no commit type action for some subtype of
type 2/3 API.

For example, for type 2 API in this patchset, it has below subtypes:

subtype 1: it does not need a commit type action, it just return
           [page, offset] instead of page_frag_alloc_va() returning [va],
           and it does not return the allocated fragsz back to the caller
           as page_frag_alloc_va() does not too:
struct page *page_frag_alloc_pg(struct page_frag_cache *nc,
                                unsigned int *offset, unsigned int fragsz,
                                gfp_t gfp)

subtype 2: it does need a commit type action, and @fragsz is returned to
           the caller and caller used that to commit how much fragsz to
           commit.
struct page *page_frag_alloc_pg_prepare(struct page_frag_cache *nc,
                                        unsigned int *offset,
                                        unsigned int *fragsz, gfp_t gfp)

Do you see subtype 1 as valid API? If no, why?
If yes, do you also expect the caller to use "struct page_frag *" as the
container? If yes, what is the caller expected to do with the size field in
"struct page_frag *" from API perspective? Just ignore it?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v13 04/14] mm: page_frag: add '_va' suffix to page_frag API
  2024-08-20 13:07             ` Yunsheng Lin
@ 2024-08-20 16:02               ` Alexander Duyck
  2024-08-21 12:30                 ` Yunsheng Lin
  0 siblings, 1 reply; 16+ messages in thread
From: Alexander Duyck @ 2024-08-20 16:02 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Subbaraya Sundeep,
	Chuck Lever, Sagi Grimberg, Jeroen de Borst, Praveen Kaligineedi,
	Shailend Chand, Eric Dumazet, Tony Nguyen, Przemek Kitszel,
	Sunil Goutham, Geetha sowjanya, hariprasad, Felix Fietkau,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Chaitanya Kulkarni, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Andrew Morton, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Howells, Marc Dionne, Jeff Layton, Neil Brown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Shuah Khan, intel-wired-lan, linux-arm-kernel,
	linux-mediatek, linux-nvme, kvm, virtualization, linux-mm, bpf,
	linux-afs, linux-nfs, linux-kselftest

On Tue, Aug 20, 2024 at 6:07 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/8/19 23:54, Alexander Duyck wrote:
>
> ...
>
> >>>>
> >>>> "There are three types of API as proposed in this patchset instead of
> >>>> two types of API:
> >>>> 1. page_frag_alloc_va() returns [va].
> >>>> 2. page_frag_alloc_pg() returns [page, offset].
> >>>> 3. page_frag_alloc() returns [va] & [page, offset].
> >>>>
> >>>> You seemed to miss that we need a third naming for the type 3 API.
> >>>> Do you see type 3 API as a valid API? if yes, what naming are you
> >>>> suggesting for it? if no, why it is not a valid API?"
> >>>
> >>> I didn't. I just don't see the point in pushing out the existing API
> >>> to support that. In reality 2 and 3 are redundant. You probably only
> >>> need 3. Like I mentioned earlier you can essentially just pass a
> >>
> >> If the caller just expect [page, offset], do you expect the caller also
> >> type 3 API, which return both [va] and [page, offset]?
> >>
> >> I am not sure if I understand why you think 2 and 3 are redundant here?
> >> If you think 2 and 3 are redundant here, aren't 1 and 3 also redundant
> >> as the similar agrument?
> >
> > The big difference is the need to return page and offset. Basically to
> > support returning page and offset you need to pass at least one value
> > as a pointer so you can store the return there.
> >
> > The reason why 3 is just a redundant form of 2 is that you will
> > normally just be converting from a va to a page and offset so the va
> > should already be easily accessible.
>
> I am assuming that by 'easily accessible', you meant the 'va' can be
> calculated as below, right?
>
> va = encoded_page_address(encoded_va) +
>                 (page_frag_cache_page_size(encoded_va) - remaining);
>
> I guess it is easily accessible, but it is not without some overhead
> to calculate the 'va' here.

It is just the encoded_page_address + offset that you have to
calculate anyway. So the only bit you actually have to do is 2
instructions, one to mask the encoded_va and then the addition of the
offset that you provided to the page. As it stands those instruction
can easily be slipped in while you are working on converting the va to
a page.


> >
> >>> page_frag via pointer to the function. With that you could also look
> >>> at just returning a virtual address as well if you insist on having
> >>> something that returns all of the above. No point in having 2 and 3 be
> >>> seperate functions.
> >>
> >> Let's be more specific about what are your suggestion here: which way
> >> is the prefer way to return the virtual address. It seems there are two
> >> options:
> >>
> >> 1. Return the virtual address by function returning as below:
> >> void *page_frag_alloc_bio(struct page_frag_cache *nc, struct bio_vec *bio);
> >>
> >> 2. Return the virtual address by double pointer as below:
> >> int page_frag_alloc_bio(struct page_frag_cache *nc, struct bio_vec *bio,
> >>                         void **va);
> >
> > I was thinking more of option 1. Basically this is a superset of
> > page_frag_alloc_va that is also returning the page and offset via a
> > page frag. However instead of bio_vec I would be good with "struct
> > page_frag *" being the value passed to the function to play the role
> > of container. Basically the big difference between 1 and 2/3 if I am
> > not mistaken is the fact that for 1 you pass the size, whereas with
> > 2/3 you are peeling off the page frag from the larger page frag cache
>
> Let's be clear here: The callers just expecting [page, offset] also need
> to call type 3 API, which return both [va] and [page, offset]? and it
> is ok to ignore the overhead of calculating the 'va' for those kinds
> of callers just because we don't want to do the renaming for a existing
> API and can't come up with good naming for that?
>
> > after the fact via a commit type action.
>
> Just be clear here, there is no commit type action for some subtype of
> type 2/3 API.
>
> For example, for type 2 API in this patchset, it has below subtypes:
>
> subtype 1: it does not need a commit type action, it just return
>            [page, offset] instead of page_frag_alloc_va() returning [va],
>            and it does not return the allocated fragsz back to the caller
>            as page_frag_alloc_va() does not too:
> struct page *page_frag_alloc_pg(struct page_frag_cache *nc,
>                                 unsigned int *offset, unsigned int fragsz,
>                                 gfp_t gfp)
>
> subtype 2: it does need a commit type action, and @fragsz is returned to
>            the caller and caller used that to commit how much fragsz to
>            commit.
> struct page *page_frag_alloc_pg_prepare(struct page_frag_cache *nc,
>                                         unsigned int *offset,
>                                         unsigned int *fragsz, gfp_t gfp)
>
> Do you see subtype 1 as valid API? If no, why?

Not really, it is just a wrapper for page_frag_alloc that is
converting the virtual address to a page and offset. They are the same
data and don't justify the need for two functions. It kind of explains
one of the complaints I had about this code. Supposedly it was
refactoring and combining several different callers into one, but what
it is actually doing is fracturing the code path into 3 different
variants based on little if any actual difference as it is doing
unnecessary optimization.

> If yes, do you also expect the caller to use "struct page_frag *" as the
> container? If yes, what is the caller expected to do with the size field in
> "struct page_frag *" from API perspective? Just ignore it?

It should be populated. You passed a fragsz, so you should populate
the output fragsz so you can get the truesize in the case of network
packets. The removal of the page_frag from the other callers is making
it much harder to review your code anyway. If we keep the page_frag
there it should reduce the amount of change needed when you replace
page_frag with the page_frag_cache.

Honestly this is eating up too much of my time. As I said before this
patch set is too big and it is trying to squeeze in more than it
really should for a single patch set to be reviewable. Going forward
please split up the patch set as I had suggested before and address my
comments. Ideally you would have your first patch just be some
refactor and cleanup to get the "offset" pointer moving in the
direction you want. With that we can at least get half of this set
digested before we start chewing into all this refactor for the
replacement of page_frag with the page_frag_cache.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v13 04/14] mm: page_frag: add '_va' suffix to page_frag API
  2024-08-20 16:02               ` Alexander Duyck
@ 2024-08-21 12:30                 ` Yunsheng Lin
  0 siblings, 0 replies; 16+ messages in thread
From: Yunsheng Lin @ 2024-08-21 12:30 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Subbaraya Sundeep,
	Chuck Lever, Sagi Grimberg, Jeroen de Borst, Praveen Kaligineedi,
	Shailend Chand, Eric Dumazet, Tony Nguyen, Przemek Kitszel,
	Sunil Goutham, Geetha sowjanya, hariprasad, Felix Fietkau,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Chaitanya Kulkarni, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Andrew Morton, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Howells, Marc Dionne, Jeff Layton, Neil Brown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Shuah Khan, intel-wired-lan, linux-arm-kernel,
	linux-mediatek, linux-nvme, kvm, virtualization, linux-mm, bpf,
	linux-afs, linux-nfs, linux-kselftest

On 2024/8/21 0:02, Alexander Duyck wrote:
> On Tue, Aug 20, 2024 at 6:07 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2024/8/19 23:54, Alexander Duyck wrote:
>>
>> ...
>>
>>>>>>
>>>>>> "There are three types of API as proposed in this patchset instead of
>>>>>> two types of API:
>>>>>> 1. page_frag_alloc_va() returns [va].
>>>>>> 2. page_frag_alloc_pg() returns [page, offset].
>>>>>> 3. page_frag_alloc() returns [va] & [page, offset].
>>>>>>
>>>>>> You seemed to miss that we need a third naming for the type 3 API.
>>>>>> Do you see type 3 API as a valid API? if yes, what naming are you
>>>>>> suggesting for it? if no, why it is not a valid API?"
>>>>>
>>>>> I didn't. I just don't see the point in pushing out the existing API
>>>>> to support that. In reality 2 and 3 are redundant. You probably only
>>>>> need 3. Like I mentioned earlier you can essentially just pass a
>>>>
>>>> If the caller just expect [page, offset], do you expect the caller also
>>>> type 3 API, which return both [va] and [page, offset]?
>>>>
>>>> I am not sure if I understand why you think 2 and 3 are redundant here?
>>>> If you think 2 and 3 are redundant here, aren't 1 and 3 also redundant
>>>> as the similar agrument?
>>>
>>> The big difference is the need to return page and offset. Basically to
>>> support returning page and offset you need to pass at least one value
>>> as a pointer so you can store the return there.
>>>
>>> The reason why 3 is just a redundant form of 2 is that you will
>>> normally just be converting from a va to a page and offset so the va
>>> should already be easily accessible.
>>
>> I am assuming that by 'easily accessible', you meant the 'va' can be
>> calculated as below, right?
>>
>> va = encoded_page_address(encoded_va) +
>>                 (page_frag_cache_page_size(encoded_va) - remaining);
>>
>> I guess it is easily accessible, but it is not without some overhead
>> to calculate the 'va' here.
> 
> It is just the encoded_page_address + offset that you have to
> calculate anyway. So the only bit you actually have to do is 2
> instructions, one to mask the encoded_va and then the addition of the
> offset that you provided to the page. As it stands those instruction
> can easily be slipped in while you are working on converting the va to
> a page.

Well, with your suggestions against other optimizations like avoiding
a checking in fast patch and avoid calling virt_to_page(), the overhead
is kind of added up.

And I am really surprised by your above suggestion about deciding the
API for users according to the internal implementation detail here. As
the overhead of calculating 'va' is really depending on the layout of
'struct page_frag_cache' here, what if we change the implementation and
the overhead of calculating 'va' becomes bigger? Do we expect to change
the API for the callers when we change the internal implementation of
page_frag_cache?

> 
> 
>>>
>>>>> page_frag via pointer to the function. With that you could also look
>>>>> at just returning a virtual address as well if you insist on having
>>>>> something that returns all of the above. No point in having 2 and 3 be
>>>>> seperate functions.
>>>>
>>>> Let's be more specific about what are your suggestion here: which way
>>>> is the prefer way to return the virtual address. It seems there are two
>>>> options:
>>>>
>>>> 1. Return the virtual address by function returning as below:
>>>> void *page_frag_alloc_bio(struct page_frag_cache *nc, struct bio_vec *bio);
>>>>
>>>> 2. Return the virtual address by double pointer as below:
>>>> int page_frag_alloc_bio(struct page_frag_cache *nc, struct bio_vec *bio,
>>>>                         void **va);
>>>
>>> I was thinking more of option 1. Basically this is a superset of
>>> page_frag_alloc_va that is also returning the page and offset via a
>>> page frag. However instead of bio_vec I would be good with "struct
>>> page_frag *" being the value passed to the function to play the role
>>> of container. Basically the big difference between 1 and 2/3 if I am
>>> not mistaken is the fact that for 1 you pass the size, whereas with
>>> 2/3 you are peeling off the page frag from the larger page frag cache
>>
>> Let's be clear here: The callers just expecting [page, offset] also need
>> to call type 3 API, which return both [va] and [page, offset]? and it
>> is ok to ignore the overhead of calculating the 'va' for those kinds
>> of callers just because we don't want to do the renaming for a existing
>> API and can't come up with good naming for that?
>>
>>> after the fact via a commit type action.
>>
>> Just be clear here, there is no commit type action for some subtype of
>> type 2/3 API.
>>
>> For example, for type 2 API in this patchset, it has below subtypes:
>>
>> subtype 1: it does not need a commit type action, it just return
>>            [page, offset] instead of page_frag_alloc_va() returning [va],
>>            and it does not return the allocated fragsz back to the caller
>>            as page_frag_alloc_va() does not too:
>> struct page *page_frag_alloc_pg(struct page_frag_cache *nc,
>>                                 unsigned int *offset, unsigned int fragsz,
>>                                 gfp_t gfp)
>>
>> subtype 2: it does need a commit type action, and @fragsz is returned to
>>            the caller and caller used that to commit how much fragsz to
>>            commit.
>> struct page *page_frag_alloc_pg_prepare(struct page_frag_cache *nc,
>>                                         unsigned int *offset,
>>                                         unsigned int *fragsz, gfp_t gfp)
>>
>> Do you see subtype 1 as valid API? If no, why?
> 
> Not really, it is just a wrapper for page_frag_alloc that is
> converting the virtual address to a page and offset. They are the same
> data and don't justify the need for two functions. It kind of explains

I am supposing you meant something like below:
struct page *page_frag_alloc_pg(struct page_frag_cache *nc,
				unsigned int *offset, unsigned int fragsz,
				gfp_t gfp)
{
	struct page *page;
	void *va;

	va = page_frag_alloc_va(nc, fragsz, gfp);
	if (!va)
		return NULL;

	page = virt_to_head_page(va);
	*offset = va - page_to_virt(page);

	return page;
}

If yes, I really think you are caring about maintainability too much by
trading off too much performance here by not only recalculating the offset
here, but also sometimes calling virt_to_head_page() unnecessarily.

If no, please share the pseudo code in your mind.

> one of the complaints I had about this code. Supposedly it was
> refactoring and combining several different callers into one, but what
> it is actually doing is fracturing the code path into 3 different
> variants based on little if any actual difference as it is doing
> unnecessary optimization.

I am supposing the 3 different variants meant the below, right?
1. page_frag_alloc_va() returns [va].
2. page_frag_alloc_pg() returns [page, offset].
3. page_frag_alloc() returns [va] & [page, offset].

And there is others 3 different variants for prepare API too:
4. page_frag_alloc_va_prepare() returns [va].
5. page_frag_alloc_pg_prepare() returns [page, offset].
6. page_frag_alloc_prepare() returns [va] & [page, offset].

Side note: I just found the '4. page_frag_alloc_va_prepare()' API is
not used/called currently and can be removed in next revision for this
patchset.

It seems what you really want is 3 & 2 to be a wrapper for 1, and
5 & 6 to be a wrapper for 4?

If yes, too much performance is traded off here as my understanding.
Does't the introducing of __page_frag_cache_reload() already enable the
balance between performance and maintainability as much as possible in
patch 8?

> 
>> If yes, do you also expect the caller to use "struct page_frag *" as the
>> container? If yes, what is the caller expected to do with the size field in
>> "struct page_frag *" from API perspective? Just ignore it?
> 
> It should be populated. You passed a fragsz, so you should populate
> the output fragsz so you can get the truesize in the case of network
> packets. The removal of the page_frag from the other callers is making
> it much harder to review your code anyway. If we keep the page_frag
> there it should reduce the amount of change needed when you replace
> page_frag with the page_frag_cache.

I am not starting to use page_frag as the container yet, but the above
part is something that I am probably agreed with.

> 
> Honestly this is eating up too much of my time. As I said before this
> patch set is too big and it is trying to squeeze in more than it
> really should for a single patch set to be reviewable. Going forward
> please split up the patch set as I had suggested before and address my
> comments. Ideally you would have your first patch just be some
> refactor and cleanup to get the "offset" pointer moving in the
> direction you want. With that we can at least get half of this set
> digested before we start chewing into all this refactor for the
> replacement of page_frag with the page_frag_cache.

I don't really think breaking this patchset into more patchsets without
a newcase is helping to speed up the process here, it might slow down
the process instead, as the different idea about the refactoring and
new API naming is not going to disappear by breaking the patchset, and
the breaking may make the discussion harder without a bigger picture
and context.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-08-21 12:30 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-08 12:37 [PATCH net-next v13 00/14] Replace page_frag with page_frag_cache for sk_page_frag() Yunsheng Lin
2024-08-08 12:37 ` [PATCH net-next v13 04/14] mm: page_frag: add '_va' suffix to page_frag API Yunsheng Lin
2024-08-14 15:49   ` Alexander H Duyck
2024-08-15  2:59     ` Yunsheng Lin
2024-08-15 15:00       ` Alexander Duyck
2024-08-16 11:55         ` Yunsheng Lin
2024-08-19 15:54           ` Alexander Duyck
2024-08-20 13:07             ` Yunsheng Lin
2024-08-20 16:02               ` Alexander Duyck
2024-08-21 12:30                 ` Yunsheng Lin
2024-08-08 12:37 ` [PATCH net-next v13 12/14] net: replace page_frag with page_frag_cache Yunsheng Lin
2024-08-14 22:01   ` Alexander H Duyck
2024-08-18 14:17     ` Yunsheng Lin
2024-08-13 11:30 ` [PATCH net-next v13 00/14] Replace page_frag with page_frag_cache for sk_page_frag() Yunsheng Lin
2024-08-13 15:11   ` Alexander Duyck
2024-08-14 11:55     ` Yunsheng Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox