Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: PRP with VLAN support - or how to contribute to a Linux network driver
From: Heiko Gerstung @ 2023-11-09  8:08 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev@vger.kernel.org
In-Reply-To: <6ab3289a-2ede-41a3-8c57-b01df37bab7f@lunn.ch>

[-- Attachment #1: Type: text/plain, Size: 2026 bytes --]

Am 08.11.23, 16:17 schrieb "Andrew Lunn" <andrew@lunn.ch <mailto:andrew@lunn.ch>>:

>> I would like to discuss if it makes sense to remove the PRP
>> functionality from the HSR driver (which is based on the bridge
>> kernel module AFAICS) and instead implement PRP as a separate module
>> (based on the Bonding driver, which would make more sense for PRP).

> Seems like nobody replied. I don't know PRP or HSR, so i can only make
> general remarks.

Thank you for responding!

> The general policy is that we don't rip something out and replace it
> with new code. We try to improve what already exists to meet the
> demands. This is partially because of backwards compatibility. There
> could be users using the code as is. You cannot break that. Can you
> step by step modify the current code to make use of bonding, and in
> the process show you don't break the current use cases? 

Understood. I am not sure if we can change the hsr driver to gradually use a more bonding-like approach for prp and I believe this might not be required, as long as we can get VLAN support into it. 

> You also need to consider offloading to hardware. The bridge code has infrastructure
> to offload. Does the bond driver? I've no idea about that.

I do not know this either but would expect that the nature of bonding would not require offloading support (I do not see a potential for efficiency/performance improvements here, unlike HSR or PRP). 

>> Hoping for advise what the next steps could be. Happy to discuss
>> this off-list as it may not be of interest for most people.

> You probably want to get together with others who are interested in
> PRP and HSR. linutronix, ti, microchip, etc.

Yes, would love to do that and my hope was that I would find them here. I am not familiar with the "orphaned" status for a kernel module, but I would have expected that one of the mentioned parties interested in PRP/HSR would have adopted the module. 

> Andrew

Again, thanks a lot for your comments and remarks, very useful.

Heiko

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 8165 bytes --]

^ permalink raw reply

* Re: [PATCHv2] selftests: bpf: xskxceiver: ksft_print_msg: fix format type error
From: Magnus Karlsson @ 2023-11-09  8:08 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Anders Roxell, bjorn, magnus.karlsson, maciej.fijalkowski, netdev,
	bpf, linux-kernel
In-Reply-To: <CAEf4Bzbbix1KpCKGhK3dnFK99YNyyQzXHp9RzDtd72x7-c6M3A@mail.gmail.com>

On Wed, 8 Nov 2023 at 18:03, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Nov 8, 2023 at 3:00 AM Anders Roxell <anders.roxell@linaro.org> wrote:
> >
> > Crossbuilding selftests/bpf for architecture arm64, format specifies
> > type error show up like.
> >
> > xskxceiver.c:912:34: error: format specifies type 'int' but the argument
> > has type '__u64' (aka 'unsigned long long') [-Werror,-Wformat]
> >  ksft_print_msg("[%s] expected meta_count [%d], got meta_count [%d]\n",
> >                                                                 ~~
> >                                                                 %llu
> >                 __func__, pkt->pkt_nb, meta->count);
> >                                        ^~~~~~~~~~~
> > xskxceiver.c:929:55: error: format specifies type 'unsigned long long' but
> >  the argument has type 'u64' (aka 'unsigned long') [-Werror,-Wformat]
> >  ksft_print_msg("Frag invalid addr: %llx len: %u\n", addr, len);
> >                                     ~~~~             ^~~~
> >
> > Fixing the issues by casting to (unsigned long long) and changing the
> > specifiers to be %llx, since with u64s it might be %llx or %lx,
> > depending on architecture.
> >
> > Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
> > ---
> >  tools/testing/selftests/bpf/xskxceiver.c | 19 ++++++++++++-------
> >  1 file changed, 12 insertions(+), 7 deletions(-)
> >
> > diff --git a/tools/testing/selftests/bpf/xskxceiver.c b/tools/testing/selftests/bpf/xskxceiver.c
> > index 591ca9637b23..1ab9512f5aa2 100644
> > --- a/tools/testing/selftests/bpf/xskxceiver.c
> > +++ b/tools/testing/selftests/bpf/xskxceiver.c
> > @@ -908,8 +908,9 @@ static bool is_metadata_correct(struct pkt *pkt, void *buffer, u64 addr)
> >         struct xdp_info *meta = data - sizeof(struct xdp_info);
> >
> >         if (meta->count != pkt->pkt_nb) {
> > -               ksft_print_msg("[%s] expected meta_count [%d], got meta_count [%d]\n",
> > -                              __func__, pkt->pkt_nb, meta->count);
> > +               ksft_print_msg("[%s] expected meta_count [%d], got meta_count [%llx]\n",
>
> why hex? %llu?

You are correct, it should be %llu in both these cases. The original
%d was incorrect here and good that it was corrected to a 64-bit
unsigned.

>
> > +                              __func__, pkt->pkt_nb,
> > +                              (unsigned long long)meta->count);
> >                 return false;
> >         }
> >
> > @@ -926,11 +927,13 @@ static bool is_frag_valid(struct xsk_umem_info *umem, u64 addr, u32 len, u32 exp
> >
> >         if (addr >= umem->num_frames * umem->frame_size ||
> >             addr + len > umem->num_frames * umem->frame_size) {
> > -               ksft_print_msg("Frag invalid addr: %llx len: %u\n", addr, len);
> > +               ksft_print_msg("Frag invalid addr: %llx len: %u\n",
> > +                              (unsigned long long)addr, len);
> >                 return false;
> >         }
> >         if (!umem->unaligned_mode && addr % umem->frame_size + len > umem->frame_size) {
> > -               ksft_print_msg("Frag crosses frame boundary addr: %llx len: %u\n", addr, len);
> > +               ksft_print_msg("Frag crosses frame boundary addr: %llx len: %u\n",
> > +                              (unsigned long long)addr, len);
> >                 return false;
> >         }
> >
> > @@ -1029,7 +1032,8 @@ static int complete_pkts(struct xsk_socket_info *xsk, int batch_size)
> >                         u64 addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx + rcvd - 1);
> >
> >                         ksft_print_msg("[%s] Too many packets completed\n", __func__);
> > -                       ksft_print_msg("Last completion address: %llx\n", addr);
> > +                       ksft_print_msg("Last completion address: %llx\n",
> > +                                      (unsigned long long)addr);
> >                         return TEST_FAILURE;
> >                 }
> >
> > @@ -1513,8 +1517,9 @@ static int validate_tx_invalid_descs(struct ifobject *ifobject)
> >         }
> >
> >         if (stats.tx_invalid_descs != ifobject->xsk->pkt_stream->nb_pkts / 2) {
> > -               ksft_print_msg("[%s] tx_invalid_descs incorrect. Got [%u] expected [%u]\n",
> > -                              __func__, stats.tx_invalid_descs,
> > +               ksft_print_msg("[%s] tx_invalid_descs incorrect. Got [%llx] expected [%u]\n",
>
> should this be %llu? Or the switch to the hex was intentional?

Same thing here. The original %u should really have been %llu since
the stats is 64-bits. But no reason for the hex here since it is not
an address.

> > +                              __func__,
> > +                              (unsigned long long)stats.tx_invalid_descs,
> >                                ifobject->xsk->pkt_stream->nb_pkts);
> >                 return TEST_FAILURE;
> >         }
> > --
> > 2.42.0
> >
>

^ permalink raw reply

* Re: [PATCH net-next v2 12/21] virtio_net: xsk: tx: support tx
From: Michael S. Tsirkin @ 2023-11-09  8:09 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
In-Reply-To: <20231107031227.100015-13-xuanzhuo@linux.alibaba.com>

On Tue, Nov 07, 2023 at 11:12:18AM +0800, Xuan Zhuo wrote:
> The driver's tx napi is very important for XSK. It is responsible for
> obtaining data from the XSK queue and sending it out.
> 
> At the beginning, we need to trigger tx napi.
> 
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>  drivers/net/virtio/main.c       |  12 +++-
>  drivers/net/virtio/virtio_net.h |   3 +-
>  drivers/net/virtio/xsk.c        | 110 ++++++++++++++++++++++++++++++++
>  drivers/net/virtio/xsk.h        |  13 ++++
>  4 files changed, 136 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
> index 6c608b3ce27d..ff6bc764089d 100644
> --- a/drivers/net/virtio/main.c
> +++ b/drivers/net/virtio/main.c
> @@ -2074,6 +2074,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>  	struct virtnet_info *vi = sq->vq->vdev->priv;
>  	unsigned int index = vq2txq(sq->vq);
>  	struct netdev_queue *txq;
> +	int busy = 0;
>  	int opaque;
>  	bool done;
>  
> @@ -2086,11 +2087,20 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>  	txq = netdev_get_tx_queue(vi->dev, index);
>  	__netif_tx_lock(txq, raw_smp_processor_id());
>  	virtqueue_disable_cb(sq->vq);
> -	free_old_xmit(sq, true);
> +
> +	if (sq->xsk.pool)
> +		busy |= virtnet_xsk_xmit(sq, sq->xsk.pool, budget);

You use bitwise or on errno values? What's going on here?


> +	else
> +		free_old_xmit(sq, true);
>  
>  	if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
>  		netif_tx_wake_queue(txq);
>  
> +	if (busy) {
> +		__netif_tx_unlock(txq);
> +		return budget;
> +	}
> +
>  	opaque = virtqueue_enable_cb_prepare(sq->vq);
>  
>  	done = napi_complete_done(napi, 0);
> diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> index 442af4673bf8..1c21af47e13c 100644
> --- a/drivers/net/virtio/virtio_net.h
> +++ b/drivers/net/virtio/virtio_net.h
> @@ -9,7 +9,8 @@
>  #include <net/xdp_sock_drv.h>
>  
>  #define VIRTIO_XDP_FLAG	BIT(0)
> -#define VIRTIO_XMIT_DATA_MASK (VIRTIO_XDP_FLAG)
> +#define VIRTIO_XSK_FLAG	BIT(1)
> +#define VIRTIO_XMIT_DATA_MASK (VIRTIO_XDP_FLAG | VIRTIO_XSK_FLAG)
>  
>  /* RX packet size EWMA. The average packet size is used to determine the packet
>   * buffer size when refilling RX rings. As the entire RX ring may be refilled
> diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> index 8b397787603f..caa448308232 100644
> --- a/drivers/net/virtio/xsk.c
> +++ b/drivers/net/virtio/xsk.c
> @@ -4,9 +4,119 @@
>   */
>  
>  #include "virtio_net.h"
> +#include "xsk.h"
>  
>  static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
>  
> +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> +{
> +	sg->dma_address = addr;
> +	sg->length = len;
> +}
> +
> +static void virtnet_xsk_check_queue(struct virtnet_sq *sq)
> +{
> +	struct virtnet_info *vi = sq->vq->vdev->priv;
> +	struct net_device *dev = vi->dev;
> +	int qnum = sq - vi->sq;
> +
> +	/* If it is a raw buffer queue, it does not check whether the status
> +	 * of the queue is stopped when sending. So there is no need to check
> +	 * the situation of the raw buffer queue.
> +	 */
> +	if (virtnet_is_xdp_raw_buffer_queue(vi, qnum))
> +		return;
> +
> +	/* If this sq is not the exclusive queue of the current cpu,
> +	 * then it may be called by start_xmit, so check it running out
> +	 * of space.
> +	 *
> +	 * Stop the queue to avoid getting packets that we are
> +	 * then unable to transmit. Then wait the tx interrupt.
> +	 */
> +	if (sq->vq->num_free < 2 + MAX_SKB_FRAGS)

what does MAX_SKB_FRAGS have to do with it? And where's 2 coming from?

> +		netif_stop_subqueue(dev, qnum);
> +}
> +
> +static int virtnet_xsk_xmit_one(struct virtnet_sq *sq,
> +				struct xsk_buff_pool *pool,
> +				struct xdp_desc *desc)
> +{
> +	struct virtnet_info *vi;
> +	dma_addr_t addr;
> +
> +	vi = sq->vq->vdev->priv;
> +
> +	addr = xsk_buff_raw_get_dma(pool, desc->addr);
> +	xsk_buff_raw_dma_sync_for_device(pool, addr, desc->len);
> +
> +	sg_init_table(sq->sg, 2);
> +
> +	sg_fill_dma(sq->sg, sq->xsk.hdr_dma_address, vi->hdr_len);
> +	sg_fill_dma(sq->sg + 1, addr, desc->len);
> +
> +	return virtqueue_add_outbuf(sq->vq, sq->sg, 2,
> +				    virtnet_xsk_to_ptr(desc->len), GFP_ATOMIC);
> +}
> +
> +static int virtnet_xsk_xmit_batch(struct virtnet_sq *sq,
> +				  struct xsk_buff_pool *pool,
> +				  unsigned int budget,
> +				  u64 *kicks)
> +{
> +	struct xdp_desc *descs = pool->tx_descs;
> +	u32 nb_pkts, max_pkts, i;
> +	bool kick = false;
> +	int err;
> +
> +	/* Every xsk tx packet needs two desc(virtnet header and packet). So we
> +	 * use sq->vq->num_free / 2 as the limitation.
> +	 */
> +	max_pkts = min_t(u32, budget, sq->vq->num_free / 2);
> +
> +	nb_pkts = xsk_tx_peek_release_desc_batch(pool, max_pkts);
> +	if (!nb_pkts)
> +		return 0;
> +
> +	for (i = 0; i < nb_pkts; i++) {
> +		err = virtnet_xsk_xmit_one(sq, pool, &descs[i]);
> +		if (unlikely(err))
> +			break;
> +
> +		kick = true;
> +	}
> +
> +	if (kick && virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
> +		(*kicks)++;
> +
> +	return i;
> +}
> +
> +bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct xsk_buff_pool *pool,
> +		      int budget)
> +{
> +	u64 bytes = 0, packets = 0, kicks = 0;
> +	int sent;
> +
> +	virtnet_free_old_xmit(sq, true, &bytes, &packets);
> +
> +	sent = virtnet_xsk_xmit_batch(sq, pool, budget, &kicks);
> +
> +	virtnet_xsk_check_queue(sq);
> +
> +	u64_stats_update_begin(&sq->stats.syncp);
> +	u64_stats_add(&sq->stats.packets, packets);
> +	u64_stats_add(&sq->stats.bytes, bytes);
> +	u64_stats_add(&sq->stats.kicks, kicks);
> +	u64_stats_add(&sq->stats.xdp_tx,  sent);
> +	u64_stats_update_end(&sq->stats.syncp);
> +
> +	if (xsk_uses_need_wakeup(pool))
> +		xsk_set_tx_need_wakeup(pool);
> +
> +	return sent == budget;
> +}
> +
>  static int virtnet_rq_bind_xsk_pool(struct virtnet_info *vi, struct virtnet_rq *rq,
>  				    struct xsk_buff_pool *pool)
>  {
> diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
> index 1918285c310c..73ca8cd5308b 100644
> --- a/drivers/net/virtio/xsk.h
> +++ b/drivers/net/virtio/xsk.h
> @@ -3,5 +3,18 @@
>  #ifndef __XSK_H__
>  #define __XSK_H__
>  
> +#define VIRTIO_XSK_FLAG_OFFSET	4
> +
> +static inline void *virtnet_xsk_to_ptr(u32 len)
> +{
> +	unsigned long p;
> +
> +	p = len << VIRTIO_XSK_FLAG_OFFSET;
> +
> +	return (void *)(p | VIRTIO_XSK_FLAG);
> +}
> +
>  int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
> +bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct xsk_buff_pool *pool,
> +		      int budget);
>  #endif
> -- 
> 2.32.0.3.g01195cf9f


^ permalink raw reply

* Re: [PATCH net-next 00/15] net: page_pool: add netlink-based introspection
From: Ilias Apalodimas @ 2023-11-09  8:11 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, netdev, edumazet, pabeni, almasrymina, hawk
In-Reply-To: <20231024160220.3973311-1-kuba@kernel.org>

Hi Jakub,

On Tue, 24 Oct 2023 at 19:02, Jakub Kicinski <kuba@kernel.org> wrote:
>
> This is a new revision of the RFC posted in August:
> https://lore.kernel.org/all/20230816234303.3786178-1-kuba@kernel.org/
> There's been a handful of fixes and tweaks but the overall
> architecture is unchanged.
>
> As a reminder the RFC was posted as the first step towards
> an API which could configure the page pools (GET API as a stepping
> stone for a SET API to come later). I wasn't sure whether we should
> commit to the GET API before the SET takes shape, hence the large
> delay between versions.
>
> Unfortunately, real deployment experience made this series much more
> urgent. We recently started to deploy newer kernels / drivers
> at Meta, making significant use of page pools for the first time.

That's nice and scary at the same time!

> We immediately run into page pool leaks both real and false positive
> warnings. As Eric pointed out/predicted there's no guarantee that
> applications will read / close their sockets so a page pool page
> may be stuck in a socket (but not leaked) forever. This happens
> a lot in our fleet. Most of these are obviously due to application
> bugs but we should not be printing kernel warnings due to minor
> application resource leaks.

Fair enough, I guess you mean 'continuous warnings'?

>
> Conversely the page pool memory may get leaked at runtime, and
> we have no way to detect / track that, unless someone reconfigures
> the NIC and destroys the page pools which leaked the pages.
>
> The solution presented here is to expose the memory use of page
> pools via netlink. This allows for continuous monitoring of memory
> used by page pools, regardless if they were destroyed or not.
> Sample in patch 15 can print the memory use and recycling
> efficiency:
>
> $ ./page-pool
>     eth0[2]     page pools: 10 (zombies: 0)
>                 refs: 41984 bytes: 171966464 (refs: 0 bytes: 0)
>                 recycling: 90.3% (alloc: 656:397681 recycle: 89652:270201)
>

That's reasonable, and the recycling rate is pretty impressive.  Any
idea how that translated to enhancements overall? mem/cpu pressure etc

Thanks
/Ilias

> The main change compared to the RFC is that the API now exposes
> outstanding references and byte counts even for "live" page pools.
> The warning is no longer printed if page pool is accessible via netlink.
>
> Jakub Kicinski (15):
>   net: page_pool: split the page_pool_params into fast and slow
>   net: page_pool: avoid touching slow on the fastpath
>   net: page_pool: factor out uninit
>   net: page_pool: id the page pools
>   net: page_pool: record pools per netdev
>   net: page_pool: stash the NAPI ID for easier access
>   eth: link netdev to page_pools in drivers
>   net: page_pool: add nlspec for basic access to page pools
>   net: page_pool: implement GET in the netlink API
>   net: page_pool: add netlink notifications for state changes
>   net: page_pool: report amount of memory held by page pools
>   net: page_pool: report when page pool was destroyed
>   net: page_pool: expose page pool stats via netlink
>   net: page_pool: mute the periodic warning for visible page pools
>   tools: ynl: add sample for getting page-pool information
>
>  Documentation/netlink/specs/netdev.yaml       | 161 +++++++
>  Documentation/networking/page_pool.rst        |  10 +-
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c     |   1 +
>  .../net/ethernet/mellanox/mlx5/core/en_main.c |   1 +
>  drivers/net/ethernet/microsoft/mana/mana_en.c |   1 +
>  include/linux/list.h                          |  20 +
>  include/linux/netdevice.h                     |   4 +
>  include/linux/poison.h                        |   2 +
>  include/net/page_pool/helpers.h               |   8 +-
>  include/net/page_pool/types.h                 |  43 +-
>  include/uapi/linux/netdev.h                   |  36 ++
>  net/core/Makefile                             |   2 +-
>  net/core/netdev-genl-gen.c                    |  52 +++
>  net/core/netdev-genl-gen.h                    |  11 +
>  net/core/page_pool.c                          |  78 ++--
>  net/core/page_pool_priv.h                     |  12 +
>  net/core/page_pool_user.c                     | 414 +++++++++++++++++
>  tools/include/uapi/linux/netdev.h             |  36 ++
>  tools/net/ynl/generated/netdev-user.c         | 419 ++++++++++++++++++
>  tools/net/ynl/generated/netdev-user.h         | 171 +++++++
>  tools/net/ynl/lib/ynl.h                       |   2 +-
>  tools/net/ynl/samples/.gitignore              |   1 +
>  tools/net/ynl/samples/Makefile                |   2 +-
>  tools/net/ynl/samples/page-pool.c             | 147 ++++++
>  24 files changed, 1586 insertions(+), 48 deletions(-)
>  create mode 100644 net/core/page_pool_priv.h
>  create mode 100644 net/core/page_pool_user.c
>  create mode 100644 tools/net/ynl/samples/page-pool.c
>
> --
> 2.41.0
>

^ permalink raw reply

* Re: [PATCH net-next v2 16/21] virtio_net: xsk: rx: introduce add_recvbuf_xsk()
From: Michael S. Tsirkin @ 2023-11-09  8:12 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
In-Reply-To: <20231107031227.100015-17-xuanzhuo@linux.alibaba.com>

On Tue, Nov 07, 2023 at 11:12:22AM +0800, Xuan Zhuo wrote:
> Implement the logic of filling rq with XSK buffers.
> 
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>  drivers/net/virtio/main.c       |  4 ++-
>  drivers/net/virtio/virtio_net.h |  5 ++++
>  drivers/net/virtio/xsk.c        | 49 ++++++++++++++++++++++++++++++++-
>  drivers/net/virtio/xsk.h        |  2 ++
>  4 files changed, 58 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
> index 6210a6e37396..15943a22e17d 100644
> --- a/drivers/net/virtio/main.c
> +++ b/drivers/net/virtio/main.c
> @@ -1798,7 +1798,9 @@ static bool try_fill_recv(struct virtnet_info *vi, struct virtnet_rq *rq,
>  	bool oom;
>  
>  	do {
> -		if (vi->mergeable_rx_bufs)
> +		if (rq->xsk.pool)
> +			err = virtnet_add_recvbuf_xsk(vi, rq, rq->xsk.pool, gfp);
> +		else if (vi->mergeable_rx_bufs)
>  			err = add_recvbuf_mergeable(vi, rq, gfp);
>  		else if (vi->big_packets)
>  			err = add_recvbuf_big(vi, rq, gfp);

I'm not sure I understand. How does this handle mergeable flag still being set?


> diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> index a13d6d301fdb..1242785e311e 100644
> --- a/drivers/net/virtio/virtio_net.h
> +++ b/drivers/net/virtio/virtio_net.h
> @@ -140,6 +140,11 @@ struct virtnet_rq {
>  
>  		/* xdp rxq used by xsk */
>  		struct xdp_rxq_info xdp_rxq;
> +
> +		struct xdp_buff **xsk_buffs;
> +		u32 nxt_idx;
> +		u32 num;
> +		u32 size;
>  	} xsk;
>  };
>  
> diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> index ea5804ddd44e..e737c3353212 100644
> --- a/drivers/net/virtio/xsk.c
> +++ b/drivers/net/virtio/xsk.c
> @@ -38,6 +38,41 @@ static void virtnet_xsk_check_queue(struct virtnet_sq *sq)
>  		netif_stop_subqueue(dev, qnum);
>  }
>  
> +int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct virtnet_rq *rq,
> +			    struct xsk_buff_pool *pool, gfp_t gfp)
> +{
> +	struct xdp_buff **xsk_buffs;
> +	dma_addr_t addr;
> +	u32 len, i;
> +	int err = 0;
> +
> +	xsk_buffs = rq->xsk.xsk_buffs;
> +
> +	if (rq->xsk.nxt_idx >= rq->xsk.num) {
> +		rq->xsk.num = xsk_buff_alloc_batch(pool, xsk_buffs, rq->xsk.size);
> +		if (!rq->xsk.num)
> +			return -ENOMEM;
> +		rq->xsk.nxt_idx = 0;
> +	}

Another manually rolled linked list implementation.
Please, don't.


> +
> +	i = rq->xsk.nxt_idx;
> +
> +	/* use the part of XDP_PACKET_HEADROOM as the virtnet hdr space */
> +	addr = xsk_buff_xdp_get_dma(xsk_buffs[i]) - vi->hdr_len;
> +	len = xsk_pool_get_rx_frame_size(pool) + vi->hdr_len;
> +
> +	sg_init_table(rq->sg, 1);
> +	sg_fill_dma(rq->sg, addr, len);
> +
> +	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, xsk_buffs[i], gfp);
> +	if (err)
> +		return err;
> +
> +	rq->xsk.nxt_idx++;
> +
> +	return 0;
> +}
> +
>  static int virtnet_xsk_xmit_one(struct virtnet_sq *sq,
>  				struct xsk_buff_pool *pool,
>  				struct xdp_desc *desc)
> @@ -213,7 +248,7 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
>  	struct virtnet_sq *sq;
>  	struct device *dma_dev;
>  	dma_addr_t hdr_dma;
> -	int err;
> +	int err, size;
>  
>  	/* In big_packets mode, xdp cannot work, so there is no need to
>  	 * initialize xsk of rq.
> @@ -249,6 +284,16 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
>  	if (!dma_dev)
>  		return -EPERM;
>  
> +	size = virtqueue_get_vring_size(rq->vq);
> +
> +	rq->xsk.xsk_buffs = kcalloc(size, sizeof(*rq->xsk.xsk_buffs), GFP_KERNEL);
> +	if (!rq->xsk.xsk_buffs)
> +		return -ENOMEM;
> +
> +	rq->xsk.size = size;
> +	rq->xsk.nxt_idx = 0;
> +	rq->xsk.num = 0;
> +
>  	hdr_dma = dma_map_single(dma_dev, &xsk_hdr, vi->hdr_len, DMA_TO_DEVICE);
>  	if (dma_mapping_error(dma_dev, hdr_dma))
>  		return -ENOMEM;
> @@ -307,6 +352,8 @@ static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)
>  
>  	dma_unmap_single(dma_dev, sq->xsk.hdr_dma_address, vi->hdr_len, DMA_TO_DEVICE);
>  
> +	kfree(rq->xsk.xsk_buffs);
> +
>  	return err1 | err2;
>  }
>  
> diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
> index 7ebc9bda7aee..bef41a3f954e 100644
> --- a/drivers/net/virtio/xsk.h
> +++ b/drivers/net/virtio/xsk.h
> @@ -23,4 +23,6 @@ int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
>  bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct xsk_buff_pool *pool,
>  		      int budget);
>  int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag);
> +int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct virtnet_rq *rq,
> +			    struct xsk_buff_pool *pool, gfp_t gfp);
>  #endif
> -- 
> 2.32.0.3.g01195cf9f


^ permalink raw reply

* Re: [PATCH net-next 01/15] net: page_pool: split the page_pool_params into fast and slow
From: Ilias Apalodimas @ 2023-11-09  8:13 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, netdev, edumazet, pabeni, almasrymina, hawk
In-Reply-To: <20231024160220.3973311-2-kuba@kernel.org>

On Tue, 24 Oct 2023 at 19:02, Jakub Kicinski <kuba@kernel.org> wrote:
>
> struct page_pool is rather performance critical and we use
> 16B of the first cache line to store 2 pointers used only
> by test code. Future patches will add more informational
> (non-fast path) attributes.
>
> It's convenient for the user of the API to not have to worry
> which fields are fast and which are slow path. Use struct
> groups to split the params into the two categories internally.
>
> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> Reviewed-by: Mina Almasry <almasrymina@google.com>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  include/net/page_pool/types.h | 31 +++++++++++++++++++------------
>  net/core/page_pool.c          |  7 ++++---
>  2 files changed, 23 insertions(+), 15 deletions(-)
>
> diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
> index 6fc5134095ed..23950fcc4eca 100644
> --- a/include/net/page_pool/types.h
> +++ b/include/net/page_pool/types.h
> @@ -54,18 +54,22 @@ struct pp_alloc_cache {
>   * @offset:    DMA sync address offset for PP_FLAG_DMA_SYNC_DEV
>   */
>  struct page_pool_params {
> -       unsigned int    flags;
> -       unsigned int    order;
> -       unsigned int    pool_size;
> -       int             nid;
> -       struct device   *dev;
> -       struct napi_struct *napi;
> -       enum dma_data_direction dma_dir;
> -       unsigned int    max_len;
> -       unsigned int    offset;
> +       struct_group_tagged(page_pool_params_fast, fast,
> +               unsigned int    flags;
> +               unsigned int    order;
> +               unsigned int    pool_size;
> +               int             nid;
> +               struct device   *dev;
> +               struct napi_struct *napi;
> +               enum dma_data_direction dma_dir;
> +               unsigned int    max_len;
> +               unsigned int    offset;
> +       );
> +       struct_group_tagged(page_pool_params_slow, slow,
>  /* private: used by test code only */
> -       void (*init_callback)(struct page *page, void *arg);
> -       void *init_arg;
> +               void (*init_callback)(struct page *page, void *arg);
> +               void *init_arg;
> +       );
>  };
>
>  #ifdef CONFIG_PAGE_POOL_STATS
> @@ -119,7 +123,7 @@ struct page_pool_stats {
>  #endif
>
>  struct page_pool {
> -       struct page_pool_params p;
> +       struct page_pool_params_fast p;
>
>         long frag_users;
>         struct page *frag_page;
> @@ -178,6 +182,9 @@ struct page_pool {
>         refcount_t user_cnt;
>
>         u64 destroy_cnt;
> +
> +       /* Slow/Control-path information follows */
> +       struct page_pool_params_slow slow;
>  };
>
>  struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 5e409b98aba0..5cae413de7cc 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -173,7 +173,8 @@ static int page_pool_init(struct page_pool *pool,
>  {
>         unsigned int ring_qsize = 1024; /* Default */
>
> -       memcpy(&pool->p, params, sizeof(pool->p));
> +       memcpy(&pool->p, &params->fast, sizeof(pool->p));
> +       memcpy(&pool->slow, &params->slow, sizeof(pool->slow));
>
>         /* Validate only known flags were used */
>         if (pool->p.flags & ~(PP_FLAG_ALL))
> @@ -384,8 +385,8 @@ static void page_pool_set_pp_info(struct page_pool *pool,
>          * the overhead is negligible.
>          */
>         page_pool_fragment_page(page, 1);
> -       if (pool->p.init_callback)
> -               pool->p.init_callback(page, pool->p.init_arg);
> +       if (pool->slow.init_callback)
> +               pool->slow.init_callback(page, pool->slow.init_arg);
>  }
>
>  static void page_pool_clear_pp_info(struct page *page)
> --
> 2.41.0
>

Had time for a close look, feel free to replace my ack with
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>

^ permalink raw reply

* Re: [PATCH net-next v2 17/21] virtio_net: xsk: rx: skip dma unmap when rq is bind with AF_XDP
From: Michael S. Tsirkin @ 2023-11-09  8:15 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
In-Reply-To: <20231107031227.100015-18-xuanzhuo@linux.alibaba.com>

On Tue, Nov 07, 2023 at 11:12:23AM +0800, Xuan Zhuo wrote:
> When rq is bound with AF_XDP, the buffer dma is managed
> by the AF_XDP APIs. So the buffer got from the virtio core should
> skip the dma unmap operation.
> 
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>


I don't get it - is this like a bugfix?
And why do we need our own flag and checks?
Doesn't virtio core DTRT?

> ---
>  drivers/net/virtio/main.c       | 8 +++++---
>  drivers/net/virtio/virtio_net.h | 3 +++
>  drivers/net/virtio/xsk.c        | 1 +
>  3 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
> index 15943a22e17d..a318b2533b94 100644
> --- a/drivers/net/virtio/main.c
> +++ b/drivers/net/virtio/main.c
> @@ -430,7 +430,7 @@ static void *virtnet_rq_get_buf(struct virtnet_rq *rq, u32 *len, void **ctx)
>  	void *buf;
>  
>  	buf = virtqueue_get_buf_ctx(rq->vq, len, ctx);
> -	if (buf && rq->do_dma)
> +	if (buf && rq->do_dma_unmap)
>  		virtnet_rq_unmap(rq, buf, *len);
>  
>  	return buf;
> @@ -561,8 +561,10 @@ static void virtnet_set_premapped(struct virtnet_info *vi)
>  
>  		/* disable for big mode */
>  		if (vi->mergeable_rx_bufs || !vi->big_packets) {
> -			if (!virtqueue_set_dma_premapped(vi->rq[i].vq))
> +			if (!virtqueue_set_dma_premapped(vi->rq[i].vq)) {
>  				vi->rq[i].do_dma = true;
> +				vi->rq[i].do_dma_unmap = true;
> +			}
>  		}
>  	}
>  }
> @@ -3944,7 +3946,7 @@ void virtnet_rq_free_unused_buf(struct virtqueue *vq, void *buf)
>  
>  	rq = &vi->rq[i];
>  
> -	if (rq->do_dma)
> +	if (rq->do_dma_unmap)
>  		virtnet_rq_unmap(rq, buf, 0);
>  
>  	virtnet_rq_free_buf(vi, rq, buf);
> diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> index 1242785e311e..2005d0cd22e2 100644
> --- a/drivers/net/virtio/virtio_net.h
> +++ b/drivers/net/virtio/virtio_net.h
> @@ -135,6 +135,9 @@ struct virtnet_rq {
>  	/* Do dma by self */
>  	bool do_dma;
>  
> +	/* Do dma unmap after getting buf from virtio core. */
> +	bool do_dma_unmap;
> +
>  	struct {
>  		struct xsk_buff_pool *pool;
>  
> diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> index e737c3353212..b09c473c29fb 100644
> --- a/drivers/net/virtio/xsk.c
> +++ b/drivers/net/virtio/xsk.c
> @@ -210,6 +210,7 @@ static int virtnet_rq_bind_xsk_pool(struct virtnet_info *vi, struct virtnet_rq *
>  		xdp_rxq_info_unreg(&rq->xsk.xdp_rxq);
>  
>  	rq->xsk.pool = pool;
> +	rq->do_dma_unmap = !pool;
>  
>  	virtnet_rx_resume(vi, rq);
>  
> -- 
> 2.32.0.3.g01195cf9f


^ permalink raw reply

* Re: [PATCH net-next v2 00/21] virtio-net: support AF_XDP zero copy
From: Michael S. Tsirkin @ 2023-11-09  8:19 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
In-Reply-To: <20231107031227.100015-1-xuanzhuo@linux.alibaba.com>

On Tue, Nov 07, 2023 at 11:12:06AM +0800, Xuan Zhuo wrote:
> ## AF_XDP
> 
> XDP socket(AF_XDP) is an excellent bypass kernel network framework. The zero
> copy feature of xsk (XDP socket) needs to be supported by the driver. The
> performance of zero copy is very good. mlx5 and intel ixgbe already support
> this feature, This patch set allows virtio-net to support xsk's zerocopy xmit
> feature.
> 
> At present, we have completed some preparation:
> 
> 1. vq-reset (virtio spec and kernel code)
> 2. virtio-core premapped dma
> 3. virtio-net xdp refactor
> 
> So it is time for Virtio-Net to complete the support for the XDP Socket
> Zerocopy.
> 
> Virtio-net can not increase the queue num at will, so xsk shares the queue with
> kernel.
> 
> On the other hand, Virtio-Net does not support generate interrupt from driver
> manually, so when we wakeup tx xmit, we used some tips. If the CPU run by TX
> NAPI last time is other CPUs, use IPI to wake up NAPI on the remote CPU. If it
> is also the local CPU, then we wake up napi directly.
> 
> This patch set includes some refactor to the virtio-net to let that to support
> AF_XDP.
> 
> ## performance
> 
> ENV: Qemu with vhost-user(polling mode).
> Host CPU: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
> 
> ### virtio PMD in guest with testpmd
> 
> testpmd> show port stats all
> 
>  ######################## NIC statistics for port 0 ########################
>  RX-packets: 19531092064 RX-missed: 0     RX-bytes: 1093741155584
>  RX-errors: 0
>  RX-nombuf: 0
>  TX-packets: 5959955552 TX-errors: 0     TX-bytes: 371030645664
> 
> 
>  Throughput (since last show)
>  Rx-pps:   8861574     Rx-bps:  3969985208
>  Tx-pps:   8861493     Tx-bps:  3969962736
>  ############################################################################
> 
> ### AF_XDP PMD in guest with testpmd
> 
> testpmd> show port stats all
> 
>   ######################## NIC statistics for port 0  ########################
>   RX-packets: 68152727   RX-missed: 0          RX-bytes:  3816552712
>   RX-errors: 0
>   RX-nombuf:  0
>   TX-packets: 68114967   TX-errors: 33216      TX-bytes:  3814438152
> 
>   Throughput (since last show)
>   Rx-pps:      6333196          Rx-bps:   2837272088
>   Tx-pps:      6333227          Tx-bps:   2837285936
>   ############################################################################
> 
> But AF_XDP consumes more CPU for tx and rx napi(100% and 86%).
> 
> ## maintain
> 
> I am currently a reviewer for virtio-net. I commit to maintain AF_XDP support in
> virtio-net.
> 
> Please review.
> 
> Thanks.
> 
> v2
>     1. wakeup uses the way of GVE. No send ipi to wakeup napi on remote cpu.
>     2. remove rcu. Because we synchronize all operat, so the rcu is not needed.
>     3. split the commit "move to virtio_net.h" in last patch set. Just move the
>        struct/api to header when we use them.
>     4. add comments for some code
> 
> v1:
>     1. remove two virtio commits. Push this patchset to net-next
>     2. squash "virtio_net: virtnet_poll_tx support rescheduled" to xsk: support tx
>     3. fix some warnings
> 
> 
> Xuan Zhuo (21):
>   virtio_net: rename free_old_xmit_skbs to free_old_xmit
>   virtio_net: unify the code for recycling the xmit ptr
>   virtio_net: independent directory
>   virtio_net: move core structures to virtio_net.h
>   virtio_net: add prefix virtnet to all struct inside virtio_net.h
>   virtio_net: separate virtnet_rx_resize()
>   virtio_net: separate virtnet_tx_resize()
>   virtio_net: sq support premapped mode
>   virtio_net: xsk: bind/unbind xsk
>   virtio_net: xsk: prevent disable tx napi
>   virtio_net: move some api to header
>   virtio_net: xsk: tx: support tx
>   virtio_net: xsk: tx: support wakeup
>   virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk buffer
>   virtio_net: xsk: tx: virtnet_sq_free_unused_buf() check xsk buffer
>   virtio_net: xsk: rx: introduce add_recvbuf_xsk()
>   virtio_net: xsk: rx: skip dma unmap when rq is bind with AF_XDP
>   virtio_net: xsk: rx: introduce receive_xsk() to recv xsk buffer
>   virtio_net: xsk: rx: virtnet_rq_free_unused_buf() check xsk buffer
>   virtio_net: update tx timeout record
>   virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY
> 
>  MAINTAINERS                                 |   2 +-
>  drivers/net/Kconfig                         |   8 +-
>  drivers/net/Makefile                        |   2 +-
>  drivers/net/virtio/Kconfig                  |  13 +
>  drivers/net/virtio/Makefile                 |   8 +
>  drivers/net/{virtio_net.c => virtio/main.c} | 630 +++++++++-----------
>  drivers/net/virtio/virtio_net.h             | 346 +++++++++++
>  drivers/net/virtio/xsk.c                    | 498 ++++++++++++++++
>  drivers/net/virtio/xsk.h                    |  32 +
>  9 files changed, 1174 insertions(+), 365 deletions(-)
>  create mode 100644 drivers/net/virtio/Kconfig
>  create mode 100644 drivers/net/virtio/Makefile
>  rename drivers/net/{virtio_net.c => virtio/main.c} (92%)
>  create mode 100644 drivers/net/virtio/virtio_net.h
>  create mode 100644 drivers/net/virtio/xsk.c
>  create mode 100644 drivers/net/virtio/xsk.h


Too much code duplications for my taste. Would there maybe be less if
everything was in a single file?


> --
> 2.32.0.3.g01195cf9f


^ permalink raw reply

* Re: [RFC PATCH v3 04/12] netdev: support binding dma-buf to netdevice
From: Paolo Abeni @ 2023-11-09  8:29 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-kernel, linux-arch, linux-kselftest,
	linux-media, dri-devel, linaro-mm-sig
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Willem de Bruijn, Shuah Khan, Sumit Semwal,
	Christian König, Shakeel Butt, Jeroen de Borst,
	Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <20231106024413.2801438-5-almasrymina@google.com>

On Sun, 2023-11-05 at 18:44 -0800, Mina Almasry wrote:
[...]
> +int netdev_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
> +		       struct netdev_dmabuf_binding **out)
> +{
> +	struct netdev_dmabuf_binding *binding;
> +	struct scatterlist *sg;
> +	struct dma_buf *dmabuf;
> +	unsigned int sg_idx, i;
> +	unsigned long virtual;
> +	int err;
> +
> +	if (!capable(CAP_NET_ADMIN))
> +		return -EPERM;
> +
> +	dmabuf = dma_buf_get(dmabuf_fd);
> +	if (IS_ERR_OR_NULL(dmabuf))
> +		return -EBADFD;
> +
> +	binding = kzalloc_node(sizeof(*binding), GFP_KERNEL,
> +			       dev_to_node(&dev->dev));
> +	if (!binding) {
> +		err = -ENOMEM;
> +		goto err_put_dmabuf;
> +	}
> +
> +	xa_init_flags(&binding->bound_rxq_list, XA_FLAGS_ALLOC);
> +
> +	refcount_set(&binding->ref, 1);
> +
> +	binding->dmabuf = dmabuf;
> +
> +	binding->attachment = dma_buf_attach(binding->dmabuf, dev->dev.parent);
> +	if (IS_ERR(binding->attachment)) {
> +		err = PTR_ERR(binding->attachment);
> +		goto err_free_binding;
> +	}
> +
> +	binding->sgt = dma_buf_map_attachment(binding->attachment,
> +					      DMA_BIDIRECTIONAL);
> +	if (IS_ERR(binding->sgt)) {
> +		err = PTR_ERR(binding->sgt);
> +		goto err_detach;
> +	}
> +
> +	/* For simplicity we expect to make PAGE_SIZE allocations, but the
> +	 * binding can be much more flexible than that. We may be able to
> +	 * allocate MTU sized chunks here. Leave that for future work...
> +	 */
> +	binding->chunk_pool = gen_pool_create(PAGE_SHIFT,
> +					      dev_to_node(&dev->dev));
> +	if (!binding->chunk_pool) {
> +		err = -ENOMEM;
> +		goto err_unmap;
> +	}
> +
> +	virtual = 0;
> +	for_each_sgtable_dma_sg(binding->sgt, sg, sg_idx) {
> +		dma_addr_t dma_addr = sg_dma_address(sg);
> +		struct dmabuf_genpool_chunk_owner *owner;
> +		size_t len = sg_dma_len(sg);
> +		struct page_pool_iov *ppiov;
> +
> +		owner = kzalloc_node(sizeof(*owner), GFP_KERNEL,
> +				     dev_to_node(&dev->dev));
> +		owner->base_virtual = virtual;
> +		owner->base_dma_addr = dma_addr;
> +		owner->num_ppiovs = len / PAGE_SIZE;
> +		owner->binding = binding;
> +
> +		err = gen_pool_add_owner(binding->chunk_pool, dma_addr,
> +					 dma_addr, len, dev_to_node(&dev->dev),
> +					 owner);
> +		if (err) {
> +			err = -EINVAL;
> +			goto err_free_chunks;
> +		}
> +
> +		owner->ppiovs = kvmalloc_array(owner->num_ppiovs,
> +					       sizeof(*owner->ppiovs),
> +					       GFP_KERNEL);
> +		if (!owner->ppiovs) {
> +			err = -ENOMEM;
> +			goto err_free_chunks;
> +		}
> +
> +		for (i = 0; i < owner->num_ppiovs; i++) {
> +			ppiov = &owner->ppiovs[i];
> +			ppiov->owner = owner;
> +			refcount_set(&ppiov->refcount, 1);
> +		}
> +
> +		dma_addr += len;

I'm trying to wrap my head around the whole infra... the above line is
confusing. Why do you increment dma_addr? it will be re-initialized in
the next iteration.

Cheers,

Paolo


^ permalink raw reply

* Re: [PATCH net-next v2 4/5] virtio-net: support rx netdim
From: Heng Qi @ 2023-11-09  8:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: netdev, virtualization, Michael S. Tsirkin, Eric Dumazet,
	Paolo Abeni, David S. Miller, Jesper Dangaard Brouer,
	John Fastabend, Alexei Starovoitov, Simon Horman, Jakub Kicinski,
	Liu, Yujie
In-Reply-To: <CACGkMEunoNEA1KFKmH+RhHkFreMgAEa-NgL+XVNROi8qvL42=A@mail.gmail.com>



在 2023/11/9 下午12:43, Jason Wang 写道:
> On Thu, Nov 2, 2023 at 9:10 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>> By comparing the traffic information in the complete napi processes,
>> let the virtio-net driver automatically adjust the coalescing
>> moderation parameters of each receive queue.
>>
>> Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
>> ---
>> v1->v2:
>> - Improved the judgment of dim switch conditions.
>> - Fix the safe problem of work thread.
> Nit: it's better to be more concrete here, e.g what kind of "safe
> problem" it is.

Maybe it shouldn't be called a "safe problem" because it won't be 
destructive.
When the specific rxq is reset, the device will respond with errno when 
the work
tries to modifiy parameters for the rxq.

>
>
>>   drivers/net/virtio_net.c | 187 ++++++++++++++++++++++++++++++++++-----
>>   1 file changed, 165 insertions(+), 22 deletions(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 69fe09e99b3c..5473aa1ee5cd 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -19,6 +19,7 @@
>>   #include <linux/average.h>
>>   #include <linux/filter.h>
>>   #include <linux/kernel.h>
>> +#include <linux/dim.h>
>>   #include <net/route.h>
>>   #include <net/xdp.h>
>>   #include <net/net_failover.h>
>> @@ -172,6 +173,17 @@ struct receive_queue {
>>
>>          struct virtnet_rq_stats stats;
>>
>> +       /* The number of rx notifications */
>> +       u16 calls;
>> +
>> +       /* Is dynamic interrupt moderation enabled? */
>> +       bool dim_enabled;
>> +
>> +       /* Dynamic Interrupt Moderation */
>> +       struct dim dim;
>> +
>> +       u32 packets_in_napi;
>> +
>>          struct virtnet_interrupt_coalesce intr_coal;
>>
>>          /* Chain pages by the private ptr. */
>> @@ -305,6 +317,9 @@ struct virtnet_info {
>>          u8 duplex;
>>          u32 speed;
>>
>> +       /* Is rx dynamic interrupt moderation enabled? */
>> +       bool rx_dim_enabled;
>> +
>>          /* Interrupt coalescing settings */
>>          struct virtnet_interrupt_coalesce intr_coal_tx;
>>          struct virtnet_interrupt_coalesce intr_coal_rx;
>> @@ -2001,6 +2016,7 @@ static void skb_recv_done(struct virtqueue *rvq)
>>          struct virtnet_info *vi = rvq->vdev->priv;
>>          struct receive_queue *rq = &vi->rq[vq2rxq(rvq)];
>>
>> +       rq->calls++;
>>          virtqueue_napi_schedule(&rq->napi, rvq);
>>   }
>>
>> @@ -2141,6 +2157,26 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
>>          }
>>   }
>>
>> +static void virtnet_rx_dim_work(struct work_struct *work);
>> +
>> +static void virtnet_rx_dim_update(struct virtnet_info *vi, struct receive_queue *rq)
>> +{
>> +       struct dim_sample cur_sample = {};
>> +
>> +       if (!rq->packets_in_napi)
>> +               return;
>> +
>> +       u64_stats_update_begin(&rq->stats.syncp);
>> +       dim_update_sample(rq->calls,
>> +                         u64_stats_read(&rq->stats.packets),
>> +                         u64_stats_read(&rq->stats.bytes),
>> +                         &cur_sample);
>> +       u64_stats_update_end(&rq->stats.syncp);
>> +
>> +       net_dim(&rq->dim, cur_sample);
>> +       rq->packets_in_napi = 0;
>> +}
>> +
>>   static int virtnet_poll(struct napi_struct *napi, int budget)
>>   {
>>          struct receive_queue *rq =
>> @@ -2149,17 +2185,22 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
>>          struct send_queue *sq;
>>          unsigned int received;
>>          unsigned int xdp_xmit = 0;
>> +       bool napi_complete;
>>
>>          virtnet_poll_cleantx(rq);
>>
>>          received = virtnet_receive(rq, budget, &xdp_xmit);
>> +       rq->packets_in_napi += received;
>>
>>          if (xdp_xmit & VIRTIO_XDP_REDIR)
>>                  xdp_do_flush();
>>
>>          /* Out of packets? */
>> -       if (received < budget)
>> -               virtqueue_napi_complete(napi, rq->vq, received);
>> +       if (received < budget) {
>> +               napi_complete = virtqueue_napi_complete(napi, rq->vq, received);
>> +               if (napi_complete && rq->dim_enabled)
>> +                       virtnet_rx_dim_update(vi, rq);
>> +       }
>>
>>          if (xdp_xmit & VIRTIO_XDP_TX) {
>>                  sq = virtnet_xdp_get_sq(vi);
>> @@ -2179,6 +2220,7 @@ static void virtnet_disable_queue_pair(struct virtnet_info *vi, int qp_index)
>>          virtnet_napi_tx_disable(&vi->sq[qp_index].napi);
>>          napi_disable(&vi->rq[qp_index].napi);
>>          xdp_rxq_info_unreg(&vi->rq[qp_index].xdp_rxq);
>> +       cancel_work_sync(&vi->rq[qp_index].dim.work);
>>   }
>>
>>   static int virtnet_enable_queue_pair(struct virtnet_info *vi, int qp_index)
>> @@ -2196,6 +2238,9 @@ static int virtnet_enable_queue_pair(struct virtnet_info *vi, int qp_index)
>>          if (err < 0)
>>                  goto err_xdp_reg_mem_model;
>>
>> +       INIT_WORK(&vi->rq[qp_index].dim.work, virtnet_rx_dim_work);
> Any reason we need to do the INIT_WORK here but not probe?

This can initialize the work when the device status changes (open, restore).
At this time, rxq is ready and once it's napi is enabled, the work can 
start working.

>
>> +       vi->rq[qp_index].dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
>> +
>>          virtnet_napi_enable(vi->rq[qp_index].vq, &vi->rq[qp_index].napi);
>>          virtnet_napi_tx_enable(vi, vi->sq[qp_index].vq, &vi->sq[qp_index].napi);
>>
>> @@ -2393,8 +2438,10 @@ static int virtnet_rx_resize(struct virtnet_info *vi,
>>
>>          qindex = rq - vi->rq;
>>
>> -       if (running)
>> +       if (running) {
>>                  napi_disable(&rq->napi);
>> +               cancel_work_sync(&rq->dim.work);
>> +       }
>>
>>          err = virtqueue_resize(rq->vq, ring_num, virtnet_rq_free_unused_buf);
>>          if (err)
>> @@ -2403,8 +2450,10 @@ static int virtnet_rx_resize(struct virtnet_info *vi,
>>          if (!try_fill_recv(vi, rq, GFP_KERNEL))
>>                  schedule_delayed_work(&vi->refill, 0);
>>
>> -       if (running)
>> +       if (running) {
>> +               INIT_WORK(&rq->dim.work, virtnet_rx_dim_work);
>>                  virtnet_napi_enable(rq->vq, &rq->napi);
>> +       }
>>          return err;
>>   }
>>
>> @@ -3341,24 +3390,51 @@ static int virtnet_send_tx_notf_coal_cmds(struct virtnet_info *vi,
>>   static int virtnet_send_rx_notf_coal_cmds(struct virtnet_info *vi,
>>                                            struct ethtool_coalesce *ec)
>>   {
>> +       bool rx_ctrl_dim_on = !!ec->use_adaptive_rx_coalesce;
>>          struct scatterlist sgs_rx;
>> +       bool switch_dim, update;
>>          int i;
>>
>> -       vi->ctrl->coal_rx.rx_usecs = cpu_to_le32(ec->rx_coalesce_usecs);
>> -       vi->ctrl->coal_rx.rx_max_packets = cpu_to_le32(ec->rx_max_coalesced_frames);
>> -       sg_init_one(&sgs_rx, &vi->ctrl->coal_rx, sizeof(vi->ctrl->coal_rx));
>> -
>> -       if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_NOTF_COAL,
>> -                                 VIRTIO_NET_CTRL_NOTF_COAL_RX_SET,
>> -                                 &sgs_rx))
>> -               return -EINVAL;
>> +       switch_dim = rx_ctrl_dim_on != vi->rx_dim_enabled;
>> +       if (switch_dim) {
>> +               if (rx_ctrl_dim_on) {
>> +                       if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_VQ_NOTF_COAL)) {
>> +                               vi->rx_dim_enabled = true;
>> +                               for (i = 0; i < vi->max_queue_pairs; i++)
>> +                                       vi->rq[i].dim_enabled = true;
>> +                       } else {
>> +                               return -EOPNOTSUPP;
>> +                       }
>> +               } else {
>> +                       vi->rx_dim_enabled = false;
>> +                       for (i = 0; i < vi->max_queue_pairs; i++)
>> +                               vi->rq[i].dim_enabled = false;
>> +               }
>> +       } else {
>> +               if (ec->rx_coalesce_usecs != vi->intr_coal_rx.max_usecs ||
>> +                   ec->rx_max_coalesced_frames != vi->intr_coal_rx.max_packets)
>> +                       update = true;
>>
>> -       /* Save parameters */
>> -       vi->intr_coal_rx.max_usecs = ec->rx_coalesce_usecs;
>> -       vi->intr_coal_rx.max_packets = ec->rx_max_coalesced_frames;
>> -       for (i = 0; i < vi->max_queue_pairs; i++) {
>> -               vi->rq[i].intr_coal.max_usecs = ec->rx_coalesce_usecs;
>> -               vi->rq[i].intr_coal.max_packets = ec->rx_max_coalesced_frames;
>> +               if (vi->rx_dim_enabled) {
>> +                       if (update)
>> +                               return -EINVAL;
> update could be used without initialization?

It defaults to false, but I will initialize it explicitly in the next 
version.

>
> Btw under what condition could we reach here?

If the current dim status is on, and this condition is triggered when 
the user
tries to modify parameters.

Thanks!

>
> Thanks
>
>> +               } else {
>> +                       vi->ctrl->coal_rx.rx_usecs = cpu_to_le32(ec->rx_coalesce_usecs);
>> +                       vi->ctrl->coal_rx.rx_max_packets = cpu_to_le32(ec->rx_max_coalesced_frames);
>> +                       sg_init_one(&sgs_rx, &vi->ctrl->coal_rx, sizeof(vi->ctrl->coal_rx));
>> +
>> +                       if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_NOTF_COAL,
>> +                                                 VIRTIO_NET_CTRL_NOTF_COAL_RX_SET,
>> +                                                 &sgs_rx))
>> +                               return -EINVAL;
>> +
>> +                       vi->intr_coal_rx.max_usecs = ec->rx_coalesce_usecs;
>> +                       vi->intr_coal_rx.max_packets = ec->rx_max_coalesced_frames;
>> +                       for (i = 0; i < vi->max_queue_pairs; i++) {
>> +                               vi->rq[i].intr_coal.max_usecs = ec->rx_coalesce_usecs;
>> +                               vi->rq[i].intr_coal.max_packets = ec->rx_max_coalesced_frames;
>> +                       }
>> +               }
>>          }
>>
>>          return 0;
>> @@ -3380,15 +3456,54 @@ static int virtnet_send_notf_coal_cmds(struct virtnet_info *vi,
>>          return 0;
>>   }
>>
>> +static int virtnet_send_rx_notf_coal_vq_cmds(struct virtnet_info *vi,
>> +                                            struct ethtool_coalesce *ec,
>> +                                            u16 queue)
>> +{
>> +       bool rx_ctrl_dim_on = !!ec->use_adaptive_rx_coalesce;
>> +       bool cur_rx_dim = vi->rq[queue].dim_enabled;
>> +       u32 max_usecs, max_packets;
>> +       bool switch_dim, update;
>> +       int err;
>> +
>> +       switch_dim = rx_ctrl_dim_on != cur_rx_dim;
>> +       if (switch_dim) {
>> +               if (rx_ctrl_dim_on)
>> +                       vi->rq[queue].dim_enabled = true;
>> +               else
>> +                       vi->rq[queue].dim_enabled = false;
>> +       } else {
>> +               max_usecs = vi->rq[queue].intr_coal.max_usecs;
>> +               max_packets = vi->rq[queue].intr_coal.max_packets;
>> +               if (ec->rx_coalesce_usecs != max_usecs ||
>> +                   ec->rx_max_coalesced_frames != max_packets)
>> +                       update = true;
>> +
>> +               if (cur_rx_dim) {
>> +                       if (update)
>> +                               return -EINVAL;
>> +               } else {
>> +                       if (!update)
>> +                               return 0;
>> +
>> +                       err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, queue,
>> +                                                              ec->rx_coalesce_usecs,
>> +                                                              ec->rx_max_coalesced_frames);
>> +                       if (err)
>> +                               return err;
>> +               }
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>>   static int virtnet_send_notf_coal_vq_cmds(struct virtnet_info *vi,
>>                                            struct ethtool_coalesce *ec,
>>                                            u16 queue)
>>   {
>>          int err;
>>
>> -       err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, queue,
>> -                                              ec->rx_coalesce_usecs,
>> -                                              ec->rx_max_coalesced_frames);
>> +       err = virtnet_send_rx_notf_coal_vq_cmds(vi, ec, queue);
>>          if (err)
>>                  return err;
>>
>> @@ -3401,6 +3516,32 @@ static int virtnet_send_notf_coal_vq_cmds(struct virtnet_info *vi,
>>          return 0;
>>   }
>>
>> +static void virtnet_rx_dim_work(struct work_struct *work)
>> +{
>> +       struct dim *dim = container_of(work, struct dim, work);
>> +       struct receive_queue *rq = container_of(dim,
>> +                       struct receive_queue, dim);
>> +       struct virtnet_info *vi = rq->vq->vdev->priv;
>> +       struct net_device *dev = vi->dev;
>> +       struct dim_cq_moder update_moder;
>> +       int qnum = rq - vi->rq, err;
>> +
>> +       update_moder = net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
>> +       if (update_moder.usec != vi->rq[qnum].intr_coal.max_usecs ||
>> +           update_moder.pkts != vi->rq[qnum].intr_coal.max_packets) {
>> +               rtnl_lock();
>> +               err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, qnum,
>> +                                                      update_moder.usec,
>> +                                                      update_moder.pkts);
>> +               if (err)
>> +                       pr_debug("%s: Failed to send dim parameters on rxq%d\n",
>> +                                dev->name, (int)(rq - vi->rq));
>> +               rtnl_unlock();
>> +       }
>> +
>> +       dim->state = DIM_START_MEASURE;
>> +}
>> +
>>   static int virtnet_coal_params_supported(struct ethtool_coalesce *ec)
>>   {
>>          /* usecs coalescing is supported only if VIRTIO_NET_F_NOTF_COAL
>> @@ -3482,6 +3623,7 @@ static int virtnet_get_coalesce(struct net_device *dev,
>>                  ec->tx_coalesce_usecs = vi->intr_coal_tx.max_usecs;
>>                  ec->tx_max_coalesced_frames = vi->intr_coal_tx.max_packets;
>>                  ec->rx_max_coalesced_frames = vi->intr_coal_rx.max_packets;
>> +               ec->use_adaptive_rx_coalesce = vi->rx_dim_enabled;
>>          } else {
>>                  ec->rx_max_coalesced_frames = 1;
>>
>> @@ -3539,6 +3681,7 @@ static int virtnet_get_per_queue_coalesce(struct net_device *dev,
>>                  ec->tx_coalesce_usecs = vi->sq[queue].intr_coal.max_usecs;
>>                  ec->tx_max_coalesced_frames = vi->sq[queue].intr_coal.max_packets;
>>                  ec->rx_max_coalesced_frames = vi->rq[queue].intr_coal.max_packets;
>> +               ec->use_adaptive_rx_coalesce = vi->rq[queue].dim_enabled;
>>          } else {
>>                  ec->rx_max_coalesced_frames = 1;
>>
>> @@ -3664,7 +3807,7 @@ static int virtnet_set_rxnfc(struct net_device *dev, struct ethtool_rxnfc *info)
>>
>>   static const struct ethtool_ops virtnet_ethtool_ops = {
>>          .supported_coalesce_params = ETHTOOL_COALESCE_MAX_FRAMES |
>> -               ETHTOOL_COALESCE_USECS,
>> +               ETHTOOL_COALESCE_USECS | ETHTOOL_COALESCE_USE_ADAPTIVE_RX,
>>          .get_drvinfo = virtnet_get_drvinfo,
>>          .get_link = ethtool_op_get_link,
>>          .get_ringparam = virtnet_get_ringparam,
>> --
>> 2.19.1.6.gb485710b
>>


^ permalink raw reply

* Re: [PATCH v2 1/3] net: phy: at803x: add QCA8084 ethernet phy support
From: Jie Luo @ 2023-11-09  8:32 UTC (permalink / raw)
  To: Maxime Chevallier
  Cc: andrew, hkallweit1, linux, davem, edumazet, kuba, pabeni, netdev,
	linux-kernel
In-Reply-To: <20231108131250.66d1c236@fedora>



On 11/8/2023 8:12 PM, Maxime Chevallier wrote:
> Hello,
> 
> On Wed, 8 Nov 2023 19:34:43 +0800
> Luo Jie <quic_luoj@quicinc.com> wrote:
> 
>> Add qca8084 PHY support, which is four-port PHY with maximum
>> link capability 2.5G, the features of each port is almost same
>> as QCA8081 and slave seed config is not needed.
>>
>> Three kind of interface modes supported by qca8084.
>> PHY_INTERFACE_MODE_QUSGMII, PHY_INTERFACE_MODE_2500BASEX and
>> PHY_INTERFACE_MODE_SGMII.
>>
>> The PCS(serdes) and clock are also needed to be configured to
>> bringup qca8084 PHY, which will be added in the pcs driver.
>>
>> The additional CDT configurations used for qca8084.
>>
>> Signed-off-by: Luo Jie <quic_luoj@quicinc.com>
>> ---
>>   drivers/net/phy/at803x.c | 48 ++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 48 insertions(+)
> 
> [...]
> 
>> @@ -1824,6 +1828,21 @@ static int qca808x_read_status(struct phy_device *phydev)
>>   		return ret;
>>   
>>   	if (phydev->link) {
>> +		/* There are two PCSs available for QCA8084, which support the following
>> +		 * interface modes.
>> +		 *
>> +		 * 1. PHY_INTERFACE_MODE_QUSGMII utilizes PCS1 for all available 4 ports,
>> +		 * which is for all link speeds.
>> +		 *
>> +		 * 2. PHY_INTERFACE_MODE_2500BASEX utilizes PCS0 for the fourth port,
>> +		 * which is only for the link speed 2500M same as QCA8081.
>> +		 *
>> +		 * 3. PHY_INTERFACE_MODE_SGMII utilizes PCS0 for the fourth port,
>> +		 * which is for the link speed 10M, 100M and 1000M same as QCA8081.
>> +		 */
>> +		if (phydev->interface == PHY_INTERFACE_MODE_QUSGMII)
>> +			return 0;
>> +
> 
> What I understand from this is that this PHY can be used either as a
> switch, in which case port 4 would be connected to the host interface
> at up to 2.5G, or as a quad-phy, but since it uses QUSGMII the link
> speed would be limited to 1G per-port, is that correct ?

When the PHY works on the interface mode QUSGMII for quad-phy, all 4
PHYs can support to the max link speed 2.5G, actually the PHY can
support to max link speed 2.5G for all supported interface modes
including qusgmii and sgmii.

> 
> However the get_features function seems to build the supported modes
> set by reading some capabilities registers :
> 
> static int qca808x_get_features(struct phy_device *phydev)
> {
> [...]
> 	ret = phy_read_mmd(phydev, MDIO_MMD_AN, QCA808X_PHY_MMD7_CHIP_TYPE);
> 	if (ret < 0)
> 		return ret;
> 
> 	if (QCA808X_PHY_CHIP_TYPE_1G & ret)
> 		linkmode_clear_bit(ETHTOOL_LINK_MODE_2500baseT_Full_BIT, phydev->supported);
> [...]
> }
> 
> Wouldn't port 4 report 2.5G capabilities then ? Maybe you need to
> mask-out the 2.5G bit if the interface is qusgmii.
> 
> Best regards,
> 
> Maxime

All ports including port4 support 2.5G capability for the supported
interface mode.

Thanks Maxime for the review.

^ permalink raw reply

* Re: [PATCH net-next v2 5/5] virtio-net: return -EOPNOTSUPP for adaptive-tx
From: Heng Qi @ 2023-11-09  8:34 UTC (permalink / raw)
  To: Jason Wang
  Cc: netdev, virtualization, Michael S. Tsirkin, Eric Dumazet,
	Paolo Abeni, David S. Miller, Jesper Dangaard Brouer,
	John Fastabend, Alexei Starovoitov, Simon Horman, Jakub Kicinski,
	Liu, Yujie
In-Reply-To: <CACGkMEt23Xm=dpwJMwX9dnwVjmQZqBp0SBxnpY19fgc=xMpcjA@mail.gmail.com>



在 2023/11/9 下午12:45, Jason Wang 写道:
> On Thu, Nov 2, 2023 at 9:10 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>> We do not currently support tx dim, so respond to -EOPNOTSUPP.
>>
>> Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
>> ---
>> v1->v2:
>> - Use -EOPNOTSUPP instead of specific implementation.
>>
>>   drivers/net/virtio_net.c | 29 ++++++++++++++++++++++++++---
>>   1 file changed, 26 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 5473aa1ee5cd..03edeadd0725 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -3364,9 +3364,15 @@ static int virtnet_get_link_ksettings(struct net_device *dev,
>>   static int virtnet_send_tx_notf_coal_cmds(struct virtnet_info *vi,
>>                                            struct ethtool_coalesce *ec)
>>   {
>> +       bool tx_ctrl_dim_on = !!ec->use_adaptive_tx_coalesce;
>>          struct scatterlist sgs_tx;
>>          int i;
>>
>> +       if (tx_ctrl_dim_on) {
>> +               pr_debug("Failed to enable adaptive-tx, which is not supported\n");
>> +               return -EOPNOTSUPP;
>> +       }
> When can we hit this?

When user tries to enable tx dim using following cmd:
'ethtool -C eth0 adaptive-tx on'

Thanks!

>
> Thanks
>
>> +
>>          vi->ctrl->coal_tx.tx_usecs = cpu_to_le32(ec->tx_coalesce_usecs);
>>          vi->ctrl->coal_tx.tx_max_packets = cpu_to_le32(ec->tx_max_coalesced_frames);
>>          sg_init_one(&sgs_tx, &vi->ctrl->coal_tx, sizeof(vi->ctrl->coal_tx));
>> @@ -3497,6 +3503,25 @@ static int virtnet_send_rx_notf_coal_vq_cmds(struct virtnet_info *vi,
>>          return 0;
>>   }
>>
>> +static int virtnet_send_tx_notf_coal_vq_cmds(struct virtnet_info *vi,
>> +                                            struct ethtool_coalesce *ec,
>> +                                            u16 queue)
>> +{
>> +       bool tx_ctrl_dim_on = !!ec->use_adaptive_tx_coalesce;
>> +       int err;
>> +
>> +       if (tx_ctrl_dim_on) {
>> +               pr_debug("Enabling adaptive-tx for txq%d is not supported\n", queue);
>> +               return -EOPNOTSUPP;
>> +       }
>> +
>> +       err = virtnet_send_tx_ctrl_coal_vq_cmd(vi, queue,
>> +                                              ec->tx_coalesce_usecs,
>> +                                              ec->tx_max_coalesced_frames);
>> +
>> +       return err;
>> +}
>> +
>>   static int virtnet_send_notf_coal_vq_cmds(struct virtnet_info *vi,
>>                                            struct ethtool_coalesce *ec,
>>                                            u16 queue)
>> @@ -3507,9 +3532,7 @@ static int virtnet_send_notf_coal_vq_cmds(struct virtnet_info *vi,
>>          if (err)
>>                  return err;
>>
>> -       err = virtnet_send_tx_ctrl_coal_vq_cmd(vi, queue,
>> -                                              ec->tx_coalesce_usecs,
>> -                                              ec->tx_max_coalesced_frames);
>> +       err = virtnet_send_tx_notf_coal_vq_cmds(vi, ec, queue);
>>          if (err)
>>                  return err;
>>
>> --
>> 2.19.1.6.gb485710b
>>


^ permalink raw reply

* [syzbot] [net?] [nfc?] KASAN: slab-use-after-free Read in nfc_alloc_send_skb
From: syzbot @ 2023-11-09  8:36 UTC (permalink / raw)
  To: davem, edumazet, krzysztof.kozlowski, kuba, linux-kernel, netdev,
	pabeni, syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    aea6bf908d73 Merge tag 'f2fs-for-6.7-rc1' of git://git.ker..
git tree:       upstream
console+strace: https://syzkaller.appspot.com/x/log.txt?x=134b06eb680000
kernel config:  https://syzkaller.appspot.com/x/.config?x=93ac5233c138249e
dashboard link: https://syzkaller.appspot.com/bug?extid=bbe84a4010eeea00982d
compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1336f6eb680000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=165d67f3680000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/3e461e471e00/disk-aea6bf90.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/3f1784e1f86e/vmlinux-aea6bf90.xz
kernel image: https://storage.googleapis.com/syzbot-assets/4f7a660ae0d9/bzImage-aea6bf90.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+bbe84a4010eeea00982d@syzkaller.appspotmail.com

llcp: nfc_llcp_send_ui_frame: Could not allocate PDU (error=-6)
==================================================================
BUG: KASAN: slab-use-after-free in nfc_alloc_send_skb+0x149/0x1c0 net/nfc/core.c:722
Read of size 4 at addr ffff888075231548 by task syz-executor193/5062

CPU: 1 PID: 5062 Comm: syz-executor193 Not tainted 6.6.0-syzkaller-14263-gaea6bf908d73 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
 print_address_description mm/kasan/report.c:364 [inline]
 print_report+0x163/0x540 mm/kasan/report.c:475
 kasan_report+0x142/0x170 mm/kasan/report.c:588
 nfc_alloc_send_skb+0x149/0x1c0 net/nfc/core.c:722
 nfc_llcp_send_ui_frame+0x2ac/0x670 net/nfc/llcp_commands.c:766
 sock_sendmsg_nosec net/socket.c:730 [inline]
 __sock_sendmsg net/socket.c:745 [inline]
 ____sys_sendmsg+0x592/0x890 net/socket.c:2588
 ___sys_sendmsg net/socket.c:2642 [inline]
 __sys_sendmmsg+0x3b2/0x730 net/socket.c:2728
 __do_sys_sendmmsg net/socket.c:2757 [inline]
 __se_sys_sendmmsg net/socket.c:2754 [inline]
 __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2754
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
 entry_SYSCALL_64_after_hwframe+0x63/0x6b
RIP: 0033:0x7fa0a3c77969
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fa0a3c38208 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 00007fa0a3d03328 RCX: 00007fa0a3c77969
RDX: 0000000000000001 RSI: 00000000200013c0 RDI: 0000000000000004
RBP: 00007fa0a3d03320 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fa0a3cd01a4
R13: 0000000000000039 R14: 00007fa0a3d010f0 R15: 00007fa0a3c38670
 </TASK>

Allocated by task 5062:
 kasan_save_stack mm/kasan/common.c:45 [inline]
 kasan_set_track+0x4f/0x70 mm/kasan/common.c:52
 ____kasan_kmalloc mm/kasan/common.c:374 [inline]
 __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:383
 kmalloc include/linux/slab.h:600 [inline]
 kzalloc include/linux/slab.h:721 [inline]
 nfc_allocate_device+0x12f/0x520 net/nfc/core.c:1065
 nci_allocate_device+0x1e2/0x360 net/nfc/nci/core.c:1179
 virtual_ncidev_open+0x75/0x1b0 drivers/nfc/virtual_ncidev.c:136
 misc_open+0x30b/0x380 drivers/char/misc.c:165
 chrdev_open+0x5ab/0x630 fs/char_dev.c:414
 do_dentry_open+0x8fd/0x1590 fs/open.c:948
 do_open fs/namei.c:3622 [inline]
 path_openat+0x2845/0x3280 fs/namei.c:3779
 do_filp_open+0x234/0x490 fs/namei.c:3809
 do_sys_openat2+0x13e/0x1d0 fs/open.c:1440
 do_sys_open fs/open.c:1455 [inline]
 __do_sys_openat fs/open.c:1471 [inline]
 __se_sys_openat fs/open.c:1466 [inline]
 __x64_sys_openat+0x247/0x290 fs/open.c:1466
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
 entry_SYSCALL_64_after_hwframe+0x63/0x6b

Freed by task 5061:
 kasan_save_stack mm/kasan/common.c:45 [inline]
 kasan_set_track+0x4f/0x70 mm/kasan/common.c:52
 kasan_save_free_info+0x28/0x40 mm/kasan/generic.c:522
 ____kasan_slab_free+0xd6/0x120 mm/kasan/common.c:236
 kasan_slab_free include/linux/kasan.h:164 [inline]
 slab_free_hook mm/slub.c:1800 [inline]
 slab_free_freelist_hook mm/slub.c:1826 [inline]
 slab_free mm/slub.c:3809 [inline]
 __kmem_cache_free+0x263/0x3a0 mm/slub.c:3822
 device_release+0x95/0x1c0
 kobject_cleanup lib/kobject.c:682 [inline]
 kobject_release lib/kobject.c:716 [inline]
 kref_put include/linux/kref.h:65 [inline]
 kobject_put+0x1ee/0x430 lib/kobject.c:733
 nfc_free_device include/net/nfc/nfc.h:213 [inline]
 nci_free_device+0x38/0x50 net/nfc/nci/core.c:1209
 virtual_ncidev_close+0x70/0x90 drivers/nfc/virtual_ncidev.c:164
 __fput+0x3cc/0xa10 fs/file_table.c:394
 __do_sys_close fs/open.c:1590 [inline]
 __se_sys_close+0x15f/0x220 fs/open.c:1575
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
 entry_SYSCALL_64_after_hwframe+0x63/0x6b

The buggy address belongs to the object at ffff888075231000
 which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 1352 bytes inside of
 freed 2048-byte region [ffff888075231000, ffff888075231800)

The buggy address belongs to the physical page:
page:ffffea0001d48c00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x75230
head:ffffea0001d48c00 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0xfff00000000840(slab|head|node=0|zone=1|lastcpupid=0x7ff)
page_type: 0xffffffff()
raw: 00fff00000000840 ffff888012c42000 dead000000000122 0000000000000000
raw: 0000000000000000 0000000080080008 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 5062, tgid 5061 (syz-executor193), ts 56272329966, free_ts 55738397890
 set_page_owner include/linux/page_owner.h:31 [inline]
 post_alloc_hook+0x1e6/0x210 mm/page_alloc.c:1537
 prep_new_page mm/page_alloc.c:1544 [inline]
 get_page_from_freelist+0x339a/0x3530 mm/page_alloc.c:3312
 __alloc_pages+0x255/0x670 mm/page_alloc.c:4568
 alloc_pages_mpol+0x3de/0x640 mm/mempolicy.c:2133
 alloc_slab_page+0x6a/0x160 mm/slub.c:1870
 allocate_slab mm/slub.c:2017 [inline]
 new_slab+0x84/0x2f0 mm/slub.c:2070
 ___slab_alloc+0xc85/0x1310 mm/slub.c:3223
 __slab_alloc mm/slub.c:3322 [inline]
 __slab_alloc_node mm/slub.c:3375 [inline]
 slab_alloc_node mm/slub.c:3468 [inline]
 __kmem_cache_alloc_node+0x21d/0x300 mm/slub.c:3517
 kmalloc_trace+0x2a/0xe0 mm/slab_common.c:1098
 kmalloc include/linux/slab.h:600 [inline]
 kzalloc include/linux/slab.h:721 [inline]
 nci_allocate_device+0xe9/0x360 net/nfc/nci/core.c:1163
 virtual_ncidev_open+0x75/0x1b0 drivers/nfc/virtual_ncidev.c:136
 misc_open+0x30b/0x380 drivers/char/misc.c:165
 chrdev_open+0x5ab/0x630 fs/char_dev.c:414
 do_dentry_open+0x8fd/0x1590 fs/open.c:948
 do_open fs/namei.c:3622 [inline]
 path_openat+0x2845/0x3280 fs/namei.c:3779
 do_filp_open+0x234/0x490 fs/namei.c:3809
page last free stack trace:
 reset_page_owner include/linux/page_owner.h:24 [inline]
 free_pages_prepare mm/page_alloc.c:1137 [inline]
 free_unref_page_prepare+0x92a/0xa50 mm/page_alloc.c:2347
 free_unref_page+0x37/0x3f0 mm/page_alloc.c:2487
 __slab_free+0x2f6/0x390 mm/slub.c:3715
 qlink_free mm/kasan/quarantine.c:168 [inline]
 qlist_free_all+0x75/0xe0 mm/kasan/quarantine.c:187
 kasan_quarantine_reduce+0x14b/0x160 mm/kasan/quarantine.c:294
 __kasan_slab_alloc+0x23/0x70 mm/kasan/common.c:305
 kasan_slab_alloc include/linux/kasan.h:188 [inline]
 slab_post_alloc_hook+0x6c/0x3c0 mm/slab.h:763
 slab_alloc_node mm/slub.c:3478 [inline]
 __kmem_cache_alloc_node+0x1d0/0x300 mm/slub.c:3517
 __do_kmalloc_node mm/slab_common.c:1006 [inline]
 __kmalloc+0xa8/0x230 mm/slab_common.c:1020
 kmalloc include/linux/slab.h:604 [inline]
 tomoyo_add_entry security/tomoyo/common.c:2023 [inline]
 tomoyo_supervisor+0xe06/0x11f0 security/tomoyo/common.c:2095
 tomoyo_audit_env_log security/tomoyo/environ.c:36 [inline]
 tomoyo_env_perm+0x178/0x210 security/tomoyo/environ.c:63
 tomoyo_environ security/tomoyo/domain.c:672 [inline]
 tomoyo_find_next_domain+0x1383/0x1cf0 security/tomoyo/domain.c:878
 tomoyo_bprm_check_security+0x114/0x170 security/tomoyo/tomoyo.c:101
 security_bprm_check+0x63/0xa0 security/security.c:1103
 search_binary_handler fs/exec.c:1725 [inline]
 exec_binprm fs/exec.c:1779 [inline]
 bprm_execve+0x95f/0x18a0 fs/exec.c:1854
 do_execveat_common+0x580/0x720 fs/exec.c:1962

Memory state around the buggy address:
 ffff888075231400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888075231480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888075231500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                              ^
 ffff888075231580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888075231600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator
From: Paolo Abeni @ 2023-11-09  8:44 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-kernel, linux-arch, linux-kselftest,
	linux-media, dri-devel, linaro-mm-sig
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Willem de Bruijn, Shuah Khan, Sumit Semwal,
	Christian König, Shakeel Butt, Jeroen de Borst,
	Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <20231106024413.2801438-6-almasrymina@google.com>

On Sun, 2023-11-05 at 18:44 -0800, Mina Almasry wrote:
[...]
> +void netdev_free_devmem(struct page_pool_iov *ppiov)
> +{
> +	struct netdev_dmabuf_binding *binding = page_pool_iov_binding(ppiov);
> +
> +	refcount_set(&ppiov->refcount, 1);
> +
> +	if (gen_pool_has_addr(binding->chunk_pool,
> +			      page_pool_iov_dma_addr(ppiov), PAGE_SIZE))
> +		gen_pool_free(binding->chunk_pool,
> +			      page_pool_iov_dma_addr(ppiov), PAGE_SIZE);

Minor nit: what about caching the dma_addr value to make the above more
readable?

Cheers,

Paolo


^ permalink raw reply

* Re: [PATCH v9 bpf-next 02/17] bpf: add BPF token delegation mount options to BPF FS
From: Christian Brauner @ 2023-11-09  8:47 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, netdev, paul, linux-fsdevel,
	linux-security-module, keescook, kernel-team, sargun
In-Reply-To: <CAEf4BzbanZO_QPhzyFgBEuB0i+uZZO4rZn7mO1qNp3aoPx+32g@mail.gmail.com>

On Wed, Nov 08, 2023 at 01:09:27PM -0800, Andrii Nakryiko wrote:
> On Wed, Nov 8, 2023 at 5:51 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Fri, Nov 03, 2023 at 12:05:08PM -0700, Andrii Nakryiko wrote:
> > > Add few new mount options to BPF FS that allow to specify that a given
> > > BPF FS instance allows creation of BPF token (added in the next patch),
> > > and what sort of operations are allowed under BPF token. As such, we get
> > > 4 new mount options, each is a bit mask
> > >   - `delegate_cmds` allow to specify which bpf() syscall commands are
> > >     allowed with BPF token derived from this BPF FS instance;
> > >   - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
> > >     a set of allowable BPF map types that could be created with BPF token;
> > >   - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
> > >     a set of allowable BPF program types that could be loaded with BPF token;
> > >   - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
> > >     a set of allowable BPF program attach types that could be loaded with
> > >     BPF token; delegate_progs and delegate_attachs are meant to be used
> > >     together, as full BPF program type is, in general, determined
> > >     through both program type and program attach type.
> > >
> > > Currently, these mount options accept the following forms of values:
> > >   - a special value "any", that enables all possible values of a given
> > >   bit set;
> > >   - numeric value (decimal or hexadecimal, determined by kernel
> > >   automatically) that specifies a bit mask value directly;
> > >   - all the values for a given mount option are combined, if specified
> > >   multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
> > >   delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
> > >   mask.
> > >
> > > Ideally, more convenient (for humans) symbolic form derived from
> > > corresponding UAPI enums would be accepted (e.g., `-o
> > > delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
> > > it requires a bunch of UAPI header churn, so I postponed it until this
> > > feature lands upstream or at least there is a definite consensus that
> > > this feature is acceptable and is going to make it, just to minimize
> > > amount of wasted effort and not increase amount of non-essential code to
> > > be reviewed.
> > >
> > > Attentive reader will notice that BPF FS is now marked as
> > > FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
> > > user namespace as long as the process has sufficient *namespaced*
> > > capabilities within that user namespace. But in reality we still
> > > restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
> > > init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
> > > to allow creating BPF FS context object (i.e., fsopen("bpf")) from
> > > inside unprivileged process inside non-init userns, to capture that
> > > userns as the owning userns. It will still be required to pass this
> > > context object back to privileged process to instantiate and mount it.
> > >
> > > This manipulation is important, because capturing non-init userns as the
> > > owning userns of BPF FS instance (super block) allows to use that userns
> > > to constraint BPF token to that userns later on (see next patch). So
> > > creating BPF FS with delegation inside unprivileged userns will restrict
> > > derived BPF token objects to only "work" inside that intended userns,
> > > making it scoped to a intended "container".
> > >
> > > There is a set of selftests at the end of the patch set that simulates
> > > this sequence of steps and validates that everything works as intended.
> > > But careful review is requested to make sure there are no missed gaps in
> > > the implementation and testing.
> > >
> > > All this is based on suggestions and discussions with Christian Brauner
> > > ([0]), to the best of my ability to follow all the implications.
> >
> > "who will not be held responsible for any CVE future or present as he's
> >  not sure whether bpf token is a good idea in general"
> >
> > I'm not opposing it because it's really not my subsystem. But it'd be
> > nice if you also added a disclaimer that I'm not endorsing this. :)
> >
> 
> Sure, I'll clarify. I still appreciate your reviewing everything and
> pointing out all the gotchas (like the reconfiguration and other
> stuff), thanks!
> 
> > A comment below.
> >
> > >
> > >   [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > >  include/linux/bpf.h | 10 ++++++
> > >  kernel/bpf/inode.c  | 88 +++++++++++++++++++++++++++++++++++++++------
> > >  2 files changed, 88 insertions(+), 10 deletions(-)
> > >
> 
> [...]
> 
> > >       opt = fs_parse(fc, bpf_fs_parameters, param, &result);
> > >       if (opt < 0) {
> > > @@ -665,6 +692,25 @@ static int bpf_parse_param(struct fs_context *fc, struct fs_parameter *param)
> > >       case OPT_MODE:
> > >               opts->mode = result.uint_32 & S_IALLUGO;
> > >               break;
> > > +     case OPT_DELEGATE_CMDS:
> > > +     case OPT_DELEGATE_MAPS:
> > > +     case OPT_DELEGATE_PROGS:
> > > +     case OPT_DELEGATE_ATTACHS:
> > > +             if (strcmp(param->string, "any") == 0) {
> > > +                     msk = ~0ULL;
> > > +             } else {
> > > +                     err = kstrtou64(param->string, 0, &msk);
> > > +                     if (err)
> > > +                             return err;
> > > +             }
> > > +             switch (opt) {
> > > +             case OPT_DELEGATE_CMDS: opts->delegate_cmds |= msk; break;
> > > +             case OPT_DELEGATE_MAPS: opts->delegate_maps |= msk; break;
> > > +             case OPT_DELEGATE_PROGS: opts->delegate_progs |= msk; break;
> > > +             case OPT_DELEGATE_ATTACHS: opts->delegate_attachs |= msk; break;
> > > +             default: return -EINVAL;
> > > +             }
> > > +             break;
> > >       }
> >
> > So just to repeat that this will allow a container to set it's own
> > delegation options:
> >
> >         # unprivileged container
> >
> >         fd_fs = fsopen();
> >         fsconfig(fd_fs, FSCONFIG_BLA_BLA, "give-me-all-the-delegation");
> >
> >         # Now hand of that fd_fs to a privileged process
> >
> >         fsconfig(fd_fs, FSCONFIG_CREATE_CMD, ...)
> >
> > This means the container manager can't be part of your threat model
> > because you need to trust it to set delegation options.
> >
> > But if the container manager is part of your threat model then you can
> > never trust an fd_fs handed to you because the container manager might
> > have enabled arbitrary delegation privileges.
> >
> > There's ways around this:
> >
> > (1) kernel: Account for this in the kernel and require privileges when
> >     setting delegation options.
> 
> What sort of privilege would that be? We are in an unprivileged user
> namespace, so that would have to be some ns_capable() checks or
> something? I can add ns_capable(CAP_BPF), but what else did you have
> in mind?

You would require privileges in the initial namespace aka capable()
checks similar to what you require for superblock creation.

> 
> I think even if we say that privileged parent does FSCONFIG_SET_STRING
> and unprivileged child just does sys_fsopen("bpf", 0) and nothing
> more, we still can't be sure that child won't race with parent and set
> FSCONFIG_SET_STRING at the same time. Because they both have access to
> the same fs_fd.

Unless you require privileges as outlined above to set delegation
options in which case an unprivileged container cannot change delegation
options at all.

> 
> > (2) userspace: A trusted helper that allocates an fs_context fd in
> >     the target user namespace, then sets delegation options and creates
> >     superblock.
> >
> > (1) Is more restrictive but also more secure. (2) is less restrictive
> > but requires more care from userspace.
> >
> > Either way I would probably consider writing a document detailing
> > various delegation scenarios and possible pitfalls and implications
> > before advertising it.
> >
> > If you choose (2) then you also need to be aware that the security of
> > this also hinges on bpffs not allowing to reconfigure parameters once it
> > has been mounted. Otherwise an unprivileged container can change
> > delegation options.
> >
> > I would recommend that you either add a dummy bpf_reconfigure() method
> > with a comment in it or you add a comment on top of bpf_context_ops.
> > Something like:
> >
> > /*
> >  * Unprivileged mounts of bpffs are owned by the user namespace they are
> >  * mounted in. That means unprivileged users can change vfs mount
> >  * options (ro<->rw, nosuid, etc.).
> >  *
> >  * They currently cannot change bpffs specific mount options such as
> >  * delegation settings. If that is ever implemented it is necessary to
> >  * require rivileges in the initial namespace. Otherwise unprivileged
> >  * users can change delegation options to whatever they want.
> >  */
> 
> Yep, I will add a custom callback. I think we can allow reconfiguring
> towards less permissive delegation subset, but I'll need to look at
> the specifics to see if we can support that easily.

^ permalink raw reply

* Re: [PATCH net-next 02/15] net: page_pool: avoid touching slow on the fastpath
From: Ilias Apalodimas @ 2023-11-09  9:00 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, netdev, edumazet, pabeni, almasrymina, hawk
In-Reply-To: <20231024160220.3973311-3-kuba@kernel.org>

On Tue, 24 Oct 2023 at 19:02, Jakub Kicinski <kuba@kernel.org> wrote:
>
> To fully benefit from previous commit add one byte of state
> in the first cache line recording if we need to look at
> the slow part.
>
> The packing isn't all that impressive right now, we create
> a 7B hole. I'm expecting Olek's rework will reshuffle this,
> anyway.
>
> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> Reviewed-by: Mina Almasry <almasrymina@google.com>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  include/net/page_pool/types.h | 2 ++
>  net/core/page_pool.c          | 4 +++-
>  2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
> index 23950fcc4eca..e1bb92c192de 100644
> --- a/include/net/page_pool/types.h
> +++ b/include/net/page_pool/types.h
> @@ -125,6 +125,8 @@ struct page_pool_stats {
>  struct page_pool {
>         struct page_pool_params_fast p;
>
> +       bool has_init_callback;
> +
>         long frag_users;
>         struct page *frag_page;
>         unsigned int frag_offset;
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 5cae413de7cc..08af9de8e8eb 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -212,6 +212,8 @@ static int page_pool_init(struct page_pool *pool,
>                  */
>         }
>
> +       pool->has_init_callback = !!pool->slow.init_callback;
> +
>  #ifdef CONFIG_PAGE_POOL_STATS
>         pool->recycle_stats = alloc_percpu(struct page_pool_recycle_stats);
>         if (!pool->recycle_stats)
> @@ -385,7 +387,7 @@ static void page_pool_set_pp_info(struct page_pool *pool,
>          * the overhead is negligible.
>          */
>         page_pool_fragment_page(page, 1);
> -       if (pool->slow.init_callback)
> +       if (pool->has_init_callback)
>                 pool->slow.init_callback(page, pool->slow.init_arg);
>  }
>
> --
> 2.41.0
>

Same here, please swap my ack with
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>

^ permalink raw reply

* Re: [RFC PATCH v3 07/12] page-pool: device memory support
From: Paolo Abeni @ 2023-11-09  9:01 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-kernel, linux-arch, linux-kselftest,
	linux-media, dri-devel, linaro-mm-sig
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Willem de Bruijn, Shuah Khan, Sumit Semwal,
	Christian König, Shakeel Butt, Jeroen de Borst,
	Praveen Kaligineedi
In-Reply-To: <20231106024413.2801438-8-almasrymina@google.com>

On Sun, 2023-11-05 at 18:44 -0800, Mina Almasry wrote:
> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> 
> Refactor mm calls on struct page* into helpers, and add page_pool_iov
> handling on those helpers. Modify callers of these mm APIs with calls to
> these helpers instead.
> 
> In areas where struct page* is dereferenced, add a check for special
> handling of page_pool_iov.
> 
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> 
> ---
>  include/net/page_pool/helpers.h | 74 ++++++++++++++++++++++++++++++++-
>  net/core/page_pool.c            | 63 ++++++++++++++++++++--------
>  2 files changed, 118 insertions(+), 19 deletions(-)
> 
> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> index b93243c2a640..08f1a2cc70d2 100644
> --- a/include/net/page_pool/helpers.h
> +++ b/include/net/page_pool/helpers.h
> @@ -151,6 +151,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
>  	return NULL;
>  }
>  
> +static inline int page_pool_page_ref_count(struct page *page)
> +{
> +	if (page_is_page_pool_iov(page))
> +		return page_pool_iov_refcount(page_to_page_pool_iov(page));
> +
> +	return page_ref_count(page);
> +}
> +
> +static inline void page_pool_page_get_many(struct page *page,
> +					   unsigned int count)
> +{
> +	if (page_is_page_pool_iov(page))
> +		return page_pool_iov_get_many(page_to_page_pool_iov(page),
> +					      count);
> +
> +	return page_ref_add(page, count);
> +}
> +
> +static inline void page_pool_page_put_many(struct page *page,
> +					   unsigned int count)
> +{
> +	if (page_is_page_pool_iov(page))
> +		return page_pool_iov_put_many(page_to_page_pool_iov(page),
> +					      count);
> +
> +	if (count > 1)
> +		page_ref_sub(page, count - 1);
> +
> +	put_page(page);
> +}
> +
> +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
> +{
> +	if (page_is_page_pool_iov(page))
> +		return false;
> +
> +	return page_is_pfmemalloc(page);
> +}
> +
> +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
> +{
> +	/* Assume page_pool_iov are on the preferred node without actually
> +	 * checking...
> +	 *
> +	 * This check is only used to check for recycling memory in the page
> +	 * pool's fast paths. Currently the only implementation of page_pool_iov
> +	 * is dmabuf device memory. It's a deliberate decision by the user to
> +	 * bind a certain dmabuf to a certain netdev, and the netdev rx queue
> +	 * would not be able to reallocate memory from another dmabuf that
> +	 * exists on the preferred node, so, this check doesn't make much sense
> +	 * in this case. Assume all page_pool_iovs can be recycled for now.
> +	 */
> +	if (page_is_page_pool_iov(page))
> +		return true;
> +
> +	return page_to_nid(page) == pref_nid;
> +}
> +
>  /**
>   * page_pool_dev_alloc_pages() - allocate a page.
>   * @pool:	pool from which to allocate
> @@ -301,6 +359,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
>  {
>  	long ret;
>  
> +	if (page_is_page_pool_iov(page))
> +		return -EINVAL;
> +
>  	/* If nr == pp_frag_count then we have cleared all remaining
>  	 * references to the page:
>  	 * 1. 'n == 1': no need to actually overwrite it.
> @@ -431,7 +492,12 @@ static inline void page_pool_free_va(struct page_pool *pool, void *va,
>   */
>  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>  {
> -	dma_addr_t ret = page->dma_addr;
> +	dma_addr_t ret;
> +
> +	if (page_is_page_pool_iov(page))
> +		return page_pool_iov_dma_addr(page_to_page_pool_iov(page));

Should the above conditional be guarded by the page_pool_mem_providers
static key? this looks like fast-path. Same question for the refcount
helper above.

Minor nit: possibly cache 'page_is_page_pool_iov(page)' to make the
code more readable.

> +
> +	ret = page->dma_addr;
>  
>  	if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA)
>  		ret <<= PAGE_SHIFT;
> @@ -441,6 +507,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>  
>  static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
>  {
> +	/* page_pool_iovs are mapped and their dma-addr can't be modified. */
> +	if (page_is_page_pool_iov(page)) {
> +		DEBUG_NET_WARN_ON_ONCE(true);
> +		return false;
> +	}

Quickly skimming over the page_pool_code it looks like
page_pool_set_dma_addr() usage is guarded by the PP_FLAG_DMA_MAP page
pool flag. Could the device mem provider enforce such flag being
cleared on the page pool?

> +
>  	if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA) {
>  		page->dma_addr = addr >> PAGE_SHIFT;
>  
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 138ddea0b28f..d211996d423b 100644
> --- a/net/core/page_pool.cnn
> +++ b/net/core/page_pool.c
> @@ -317,7 +317,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
>  		if (unlikely(!page))
>  			break;
>  
> -		if (likely(page_to_nid(page) == pref_nid)) {
> +		if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
>  			pool->alloc.cache[pool->alloc.count++] = page;
>  		} else {
>  			/* NUMA mismatch;
> @@ -362,7 +362,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
>  					  struct page *page,
>  					  unsigned int dma_sync_size)
>  {
> -	dma_addr_t dma_addr = page_pool_get_dma_addr(page);
> +	dma_addr_t dma_addr;
> +
> +	/* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
> +	if (page_is_page_pool_iov(page)) {
> +		DEBUG_NET_WARN_ON_ONCE(true);
> +		return;
> +	}

Similar to the above point, mutatis mutandis.

> +
> +	dma_addr = page_pool_get_dma_addr(page);
>  
>  	dma_sync_size = min(dma_sync_size, pool->p.max_len);
>  	dma_sync_single_range_for_device(pool->p.dev, dma_addr,
> @@ -374,6 +382,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>  {
>  	dma_addr_t dma;
>  
> +	if (page_is_page_pool_iov(page)) {
> +		/* page_pool_iovs are already mapped */
> +		DEBUG_NET_WARN_ON_ONCE(true);
> +		return true;
> +	}

Ditto.

Cheers,

Paolo


^ permalink raw reply

* [PATCH net v4 0/3] Fix large frames in the Gemini ethernet driver
From: Linus Walleij @ 2023-11-09  9:03 UTC (permalink / raw)
  To: Hans Ulli Kroll, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Michał Mirosław, Vladimir Oltean,
	Andrew Lunn
  Cc: linux-arm-kernel, netdev, linux-kernel, Linus Walleij

This is the result of a bug hunt for a problem with the
RTL8366RB DSA switch leading me wrong all over the place.

I am indebted to Vladimir Oltean who as usual pointed
out where the real problem was, many thanks!

Tryig to actually use big ("jumbo") frames on this
hardware uncovered the real bugs. Then I tested it on
the DSA switch and it indeed fixes the issue.

To make sure it also works fine with big frames on
non-DSA devices I also copied a large video file over
scp to a device with maximum frame size, the data
was transported in large TCP packets ending up in
0x7ff sized frames using software checksumming at
~2.0 MB/s.

If I set down the MTU to the standard 1500 bytes so
that hardware checksumming is used, the scp transfer
of the same file was slightly lower, ~1.8-1.9 MB/s.

Despite this not being the best test it shows that
we can now stress the hardware with large frames
and that software checksum works fine.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
Changes in v4:
- Strip stray v1-related comment from the commit message on patch 1
- Move the hunks deleting gmac_fix_features() from patch
  "net: ethernet: cortina: Handle large frames" to
  "net: ethernet: cortina: Fix MTU max setting" as it is
  perfectly motivated by the MTU change, then move this patch
  later in the series.
- Drop the last patch only activating the checksum engine for
  TCP and UDP explicitly. It's not fixing a regression,
  so let's reconsider it for net-next rather than net.
- Link to v3: https://lore.kernel.org/r/20231107-gemini-largeframe-fix-v3-0-e3803c080b75@linaro.org

Changes in v3:
- Do not reimplement the existing oversize check (sigh what is
  wrong with me). Drop that patch.
- Drop the gmac_fix_features() since we are better off falling
  back to software checksums dynamically per-frame.
- Add a new patch to bypass the checksumming engine if we are not
  handling TCP or UDP.
- Link to v2: https://lore.kernel.org/r/20231105-gemini-largeframe-fix-v2-0-cd3a5aa6c496@linaro.org

Changes in v2:
- Don't check for oversized MTU request: the framework makes sure it doesn't
  happen.
- Drop unrelated BIT() macro cleanups (I might send these later for net-next)
- Use a special error code if the skbuff is too big and fail gracefully
  is this happens.
- Do proper checksum of the frame using a software fallback when the frame
  is too long for hardware checksumming.
- Link to v1: https://lore.kernel.org/r/20231104-gemini-largeframe-fix-v1-0-9c5513f22f33@linaro.org

---
Linus Walleij (3):
      net: ethernet: cortina: Fix max RX frame define
      net: ethernet: cortina: Handle large frames
      net: ethernet: cortina: Fix MTU max setting

 drivers/net/ethernet/cortina/gemini.c | 45 ++++++++++++++++++++++-------------
 drivers/net/ethernet/cortina/gemini.h |  4 ++--
 2 files changed, 31 insertions(+), 18 deletions(-)
---
base-commit: ffc253263a1375a65fa6c9f62a893e9767fbebfa
change-id: 20231104-gemini-largeframe-fix-c143d2c781b5

Best regards,
-- 
Linus Walleij <linus.walleij@linaro.org>


^ permalink raw reply

* [PATCH net v4 1/3] net: ethernet: cortina: Fix max RX frame define
From: Linus Walleij @ 2023-11-09  9:03 UTC (permalink / raw)
  To: Hans Ulli Kroll, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Michał Mirosław, Vladimir Oltean,
	Andrew Lunn
  Cc: linux-arm-kernel, netdev, linux-kernel, Linus Walleij
In-Reply-To: <20231109-gemini-largeframe-fix-v4-0-6e611528db08@linaro.org>

Enumerator 3 is 1548 bytes according to the datasheet.
Not 1542.

Fixes: 4d5ae32f5e1e ("net: ethernet: Add a driver for Gemini gigabit ethernet")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
 drivers/net/ethernet/cortina/gemini.c | 4 ++--
 drivers/net/ethernet/cortina/gemini.h | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/cortina/gemini.c b/drivers/net/ethernet/cortina/gemini.c
index a8b9d1a3e4d5..5bdd1b252840 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -432,8 +432,8 @@ static const struct gmac_max_framelen gmac_maxlens[] = {
 		.val = CONFIG0_MAXLEN_1536,
 	},
 	{
-		.max_l3_len = 1542,
-		.val = CONFIG0_MAXLEN_1542,
+		.max_l3_len = 1548,
+		.val = CONFIG0_MAXLEN_1548,
 	},
 	{
 		.max_l3_len = 9212,
diff --git a/drivers/net/ethernet/cortina/gemini.h b/drivers/net/ethernet/cortina/gemini.h
index 9fdf77d5eb37..99efb1155743 100644
--- a/drivers/net/ethernet/cortina/gemini.h
+++ b/drivers/net/ethernet/cortina/gemini.h
@@ -787,7 +787,7 @@ union gmac_config0 {
 #define  CONFIG0_MAXLEN_1536	0
 #define  CONFIG0_MAXLEN_1518	1
 #define  CONFIG0_MAXLEN_1522	2
-#define  CONFIG0_MAXLEN_1542	3
+#define  CONFIG0_MAXLEN_1548	3
 #define  CONFIG0_MAXLEN_9k	4	/* 9212 */
 #define  CONFIG0_MAXLEN_10k	5	/* 10236 */
 #define  CONFIG0_MAXLEN_1518__6	6

-- 
2.34.1


^ permalink raw reply related

* [PATCH net v4 2/3] net: ethernet: cortina: Handle large frames
From: Linus Walleij @ 2023-11-09  9:03 UTC (permalink / raw)
  To: Hans Ulli Kroll, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Michał Mirosław, Vladimir Oltean,
	Andrew Lunn
  Cc: linux-arm-kernel, netdev, linux-kernel, Linus Walleij
In-Reply-To: <20231109-gemini-largeframe-fix-v4-0-6e611528db08@linaro.org>

The Gemini ethernet controller provides hardware checksumming
for frames up to 1514 bytes including ethernet headers but not
FCS.

If we start sending bigger frames (after first bumping up the MTU
on both interfaces sending and receiving the frames), truncated
packets start to appear on the target such as in this tcpdump
resulting from ping -s 1474:

23:34:17.241983 14:d6:4d:a8:3c:4f (oui Unknown) > bc:ae:c5:6b:a8:3d (oui Unknown),
ethertype IPv4 (0x0800), length 1514: truncated-ip - 2 bytes missing!
(tos 0x0, ttl 64, id 32653, offset 0, flags [DF], proto ICMP (1), length 1502)
OpenWrt.lan > Fecusia: ICMP echo request, id 1672, seq 50, length 1482

If we bypass the hardware checksumming and provide a software
fallback, everything starts working fine up to the max TX MTU
of 2047 bytes, for example ping -s2000 192.168.1.2:

00:44:29.587598 bc:ae:c5:6b:a8:3d (oui Unknown) > 14:d6:4d:a8:3c:4f (oui Unknown),
ethertype IPv4 (0x0800), length 2042:
(tos 0x0, ttl 64, id 51828, offset 0, flags [none], proto ICMP (1), length 2028)
Fecusia > OpenWrt.lan: ICMP echo reply, id 1683, seq 4, length 2008

The bit enabling to bypass hardware checksum (or any of the
"TSS" bits) are undocumented in the hardware reference manual.
The entire hardware checksum unit appears undocumented. The
conclusion that we need to use the "bypass" bit was found by
trial-and-error.

Since no hardware checksum will happen, we slot in a software
checksum fallback.

Check for the condition where we need to compute checksum on the
skb with either hardware or software using == CHECKSUM_PARTIAL instead
of != CHECKSUM_NONE which is an incomplete check according to
<linux/skbuff.h>.

On the D-Link DIR-685 router this fixes a bug on the conduit
interface to the RTL8366RB DSA switch: as the switch needs to add
space for its tag it increases the MTU on the conduit interface
to 1504 and that means that when the router sends packages
of 1500 bytes these get an extra 4 bytes of DSA tag and the
transfer fails because of the erroneous hardware checksumming,
affecting such basic functionality as the LuCI web interface.

Fixes: 4d5ae32f5e1e ("net: ethernet: Add a driver for Gemini gigabit ethernet")
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
 drivers/net/ethernet/cortina/gemini.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cortina/gemini.c b/drivers/net/ethernet/cortina/gemini.c
index 5bdd1b252840..dbbccef86516 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -1145,6 +1145,7 @@ static int gmac_map_tx_bufs(struct net_device *netdev, struct sk_buff *skb,
 	dma_addr_t mapping;
 	unsigned short mtu;
 	void *buffer;
+	int ret;
 
 	mtu  = ETH_HLEN;
 	mtu += netdev->mtu;
@@ -1159,9 +1160,30 @@ static int gmac_map_tx_bufs(struct net_device *netdev, struct sk_buff *skb,
 		word3 |= mtu;
 	}
 
-	if (skb->ip_summed != CHECKSUM_NONE) {
+	if (skb->len >= ETH_FRAME_LEN) {
+		/* Hardware offloaded checksumming isn't working on frames
+		 * bigger than 1514 bytes. A hypothesis about this is that the
+		 * checksum buffer is only 1518 bytes, so when the frames get
+		 * bigger they get truncated, or the last few bytes get
+		 * overwritten by the FCS.
+		 *
+		 * Just use software checksumming and bypass on bigger frames.
+		 */
+		if (skb->ip_summed == CHECKSUM_PARTIAL) {
+			ret = skb_checksum_help(skb);
+			if (ret)
+				return ret;
+		}
+		word1 |= TSS_BYPASS_BIT;
+	} else if (skb->ip_summed == CHECKSUM_PARTIAL) {
 		int tcp = 0;
 
+		/* We do not switch off the checksumming on non TCP/UDP
+		 * frames: as is shown from tests, the checksumming engine
+		 * is smart enough to see that a frame is not actually TCP
+		 * or UDP and then just pass it through without any changes
+		 * to the frame.
+		 */
 		if (skb->protocol == htons(ETH_P_IP)) {
 			word1 |= TSS_IP_CHKSUM_BIT;
 			tcp = ip_hdr(skb)->protocol == IPPROTO_TCP;

-- 
2.34.1


^ permalink raw reply related

* [PATCH net v4 3/3] net: ethernet: cortina: Fix MTU max setting
From: Linus Walleij @ 2023-11-09  9:03 UTC (permalink / raw)
  To: Hans Ulli Kroll, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Michał Mirosław, Vladimir Oltean,
	Andrew Lunn
  Cc: linux-arm-kernel, netdev, linux-kernel, Linus Walleij
In-Reply-To: <20231109-gemini-largeframe-fix-v4-0-6e611528db08@linaro.org>

The RX max frame size is over 10000 for the Gemini ethernet,
but the TX max frame size is actually just 2047 (0x7ff after
checking the datasheet). Reflect this in what we offer to Linux,
cap the MTU at the TX max frame minus ethernet headers.

We delete the code disabling the hardware checksum for large
MTUs as netdev->mtu can no longer be larger than
netdev->max_mtu meaning the if()-clause in gmac_fix_features()
is never true.

Fixes: 4d5ae32f5e1e ("net: ethernet: Add a driver for Gemini gigabit ethernet")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
 drivers/net/ethernet/cortina/gemini.c | 17 ++++-------------
 drivers/net/ethernet/cortina/gemini.h |  2 +-
 2 files changed, 5 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/cortina/gemini.c b/drivers/net/ethernet/cortina/gemini.c
index dbbccef86516..636949737d72 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -2000,15 +2000,6 @@ static int gmac_change_mtu(struct net_device *netdev, int new_mtu)
 	return 0;
 }
 
-static netdev_features_t gmac_fix_features(struct net_device *netdev,
-					   netdev_features_t features)
-{
-	if (netdev->mtu + ETH_HLEN + VLAN_HLEN > MTU_SIZE_BIT_MASK)
-		features &= ~GMAC_OFFLOAD_FEATURES;
-
-	return features;
-}
-
 static int gmac_set_features(struct net_device *netdev,
 			     netdev_features_t features)
 {
@@ -2234,7 +2225,6 @@ static const struct net_device_ops gmac_351x_ops = {
 	.ndo_set_mac_address	= gmac_set_mac_address,
 	.ndo_get_stats64	= gmac_get_stats64,
 	.ndo_change_mtu		= gmac_change_mtu,
-	.ndo_fix_features	= gmac_fix_features,
 	.ndo_set_features	= gmac_set_features,
 };
 
@@ -2486,11 +2476,12 @@ static int gemini_ethernet_port_probe(struct platform_device *pdev)
 
 	netdev->hw_features = GMAC_OFFLOAD_FEATURES;
 	netdev->features |= GMAC_OFFLOAD_FEATURES | NETIF_F_GRO;
-	/* We can handle jumbo frames up to 10236 bytes so, let's accept
-	 * payloads of 10236 bytes minus VLAN and ethernet header
+	/* We can receive jumbo frames up to 10236 bytes but only
+	 * transmit 2047 bytes so, let's accept payloads of 2047
+	 * bytes minus VLAN and ethernet header
 	 */
 	netdev->min_mtu = ETH_MIN_MTU;
-	netdev->max_mtu = 10236 - VLAN_ETH_HLEN;
+	netdev->max_mtu = MTU_SIZE_BIT_MASK - VLAN_ETH_HLEN;
 
 	port->freeq_refill = 0;
 	netif_napi_add(netdev, &port->napi, gmac_napi_poll);
diff --git a/drivers/net/ethernet/cortina/gemini.h b/drivers/net/ethernet/cortina/gemini.h
index 99efb1155743..24bb989981f2 100644
--- a/drivers/net/ethernet/cortina/gemini.h
+++ b/drivers/net/ethernet/cortina/gemini.h
@@ -502,7 +502,7 @@ union gmac_txdesc_3 {
 #define SOF_BIT			0x80000000
 #define EOF_BIT			0x40000000
 #define EOFIE_BIT		BIT(29)
-#define MTU_SIZE_BIT_MASK	0x1fff
+#define MTU_SIZE_BIT_MASK	0x7ff /* Max MTU 2047 bytes */
 
 /* GMAC Tx Descriptor */
 struct gmac_txdesc {

-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH net-next 07/15] eth: link netdev to page_pools in drivers
From: Ilias Apalodimas @ 2023-11-09  9:11 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, netdev, edumazet, pabeni, almasrymina, hawk
In-Reply-To: <20231024160220.3973311-8-kuba@kernel.org>

Hi Jakub,

On Tue, 24 Oct 2023 at 19:02, Jakub Kicinski <kuba@kernel.org> wrote:
>
> Link page pool instances to netdev for the drivers which
> already link to NAPI. Unless the driver is doing something
> very weird per-NAPI should imply per-netdev.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c         | 1 +
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 1 +
>  drivers/net/ethernet/microsoft/mana/mana_en.c     | 1 +
>  3 files changed, 3 insertions(+)

Mind add a line for netsec.c (probably in netsec_setup_rx_dring),
since that's the only NIC I currently have around with pp support?

[...]

Thanks
/Ilias

^ permalink raw reply

* Re: [RFC PATCH v3 08/12] net: support non paged skb frags
From: Paolo Abeni @ 2023-11-09  9:14 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-kernel, linux-arch, linux-kselftest,
	linux-media, dri-devel, linaro-mm-sig
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Willem de Bruijn, Shuah Khan, Sumit Semwal,
	Christian König, Shakeel Butt, Jeroen de Borst,
	Praveen Kaligineedi
In-Reply-To: <20231106024413.2801438-9-almasrymina@google.com>

On Sun, 2023-11-05 at 18:44 -0800, Mina Almasry wrote:
[...]
> @@ -3421,7 +3446,7 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
>   */
>  static inline void __skb_frag_ref(skb_frag_t *frag)
>  {
> -	get_page(skb_frag_page(frag));
> +	page_pool_page_get_many(frag->bv_page, 1);

I guess the above needs #ifdef CONFIG_PAGE_POOL guards and explicit
skb_frag_is_page_pool_iov() check ?


Cheers,

Paolo


^ permalink raw reply

* Re: [PATCH net 1/1] net: stmmac: fix MAC and phylink mismatch issue after resume with STMMAC_FLAG_USE_PHY_WOL enabled
From: Russell King (Oracle) @ 2023-11-09  9:15 UTC (permalink / raw)
  To: Gan Yi Fang
  Cc: Alexandre Torgue, Jose Abreu, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Maxime Coquelin, Joakim Zhang,
	netdev, linux-stm32, linux-arm-kernel, linux-kernel,
	Looi Hong Aun, Voon Weifeng, Song Yoong Siang
In-Reply-To: <20231109050027.2545000-1-yi.fang.gan@intel.com>

On Thu, Nov 09, 2023 at 01:00:27PM +0800, Gan Yi Fang wrote:
> From: "Gan, Yi Fang" <yi.fang.gan@intel.com>
> 
> The issue happened when flag STMMAC_FLAG_USE_PHY_WOL is enabled.
> It can be reproduced with steps below:
> 1. Advertise only one speed on the host
> 2. Enable the WoL on the host
> 3. Suspend the host
> 4. Wake up the host
> 
> When the WoL is disabled, both the PHY and MAC will suspend and wake up
> with everything configured well. When WoL is enabled, the PHY needs to be
> stay awake to receive the signal from remote client but MAC will enter
> suspend mode.
> 
> When the MAC resumes from suspend, phylink_resume() will call
> phylink_start() to start the phylink instance which will trigger the
> phylink machine to invoke the mac_link_up callback function. The
> stmmac_mac_link_up() will configure the MAC_CTRL_REG based on the current
> link state. Then the stmmac_hw_setup() will be called to configure the MAC.
> 
> This sequence might cause mismatch of the link state between MAC and
> phylink. This patch moves the phylink_resume() after stmamc_hw_setup() to
> ensure the MAC is initialized before phylink is being configured.

Isn't this going to cause problems?

stmamc_hw_setup() calls stmmac_init_dma_engine(), which then calls
stmmac_reset() - and stmmac_reset() can fail if the PHY clock isn't
running, which is why phylink_resume() gets called before this.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: [PATCH v2 1/3] net: phy: at803x: add QCA8084 ethernet phy support
From: Maxime Chevallier @ 2023-11-09  9:16 UTC (permalink / raw)
  To: Jie Luo
  Cc: andrew, hkallweit1, linux, davem, edumazet, kuba, pabeni, netdev,
	linux-kernel
In-Reply-To: <423a3ee3-bed5-02f9-f872-7b5dba64f994@quicinc.com>

Hello,

On Thu, 9 Nov 2023 16:32:36 +0800
Jie Luo <quic_luoj@quicinc.com> wrote:

[...]

> > What I understand from this is that this PHY can be used either as a
> > switch, in which case port 4 would be connected to the host interface
> > at up to 2.5G, or as a quad-phy, but since it uses QUSGMII the link
> > speed would be limited to 1G per-port, is that correct ?  
> 
> When the PHY works on the interface mode QUSGMII for quad-phy, all 4
> PHYs can support to the max link speed 2.5G, actually the PHY can
> support to max link speed 2.5G for all supported interface modes
> including qusgmii and sgmii.

I'm a bit confused then, as the USGMII spec says that Quad USGMII really
is for quad 10/100/1000 speeds, using 10b/8b encoding.

Aren't you using the USXGMII mode instead, which can convey 4 x 2.5Gbps
 with 66b/64b encoding ?

Thanks,

Maxime

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox