* Re: [PATCH net v2 1/2] r8169: add handling DASH when DASH is disabled
From: Paolo Abeni @ 2023-11-09 11:49 UTC (permalink / raw)
To: ChunHao Lin, hkallweit1
Cc: nic_swsd, davem, edumazet, kuba, netdev, linux-kernel, stable
In-Reply-To: <20231108184849.2925-2-hau@realtek.com>
On Thu, 2023-11-09 at 02:48 +0800, ChunHao Lin wrote:
> For devices that support DASH, even DASH is disabled, there may still
> exist a default firmware that will influence device behavior.
> So driver needs to handle DASH for devices that support DASH, no
> matter the DASH status is.
>
> This patch also prepare for "fix DASH deviceis network lost issue".
>
> Signed-off-by: ChunHao Lin <hau@realtek.com>
You should include the fixes tag you already added in v1 and your Sob
should come as the last tag
The same applies to the next patch
> Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
It's not clear where/when Heiner provided the above tag for this patch.
I hope that was off-list.
> Cc: stable@vger.kernel.org
> ---
> drivers/net/ethernet/realtek/r8169_main.c | 35 ++++++++++++++++-------
> 1 file changed, 25 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
> index 0c76c162b8a9..108dc75050ba 100644
> --- a/drivers/net/ethernet/realtek/r8169_main.c
> +++ b/drivers/net/ethernet/realtek/r8169_main.c
> @@ -624,6 +624,7 @@ struct rtl8169_private {
>
> unsigned supports_gmii:1;
> unsigned aspm_manageable:1;
> + unsigned dash_enabled:1;
> dma_addr_t counters_phys_addr;
> struct rtl8169_counters *counters;
> struct rtl8169_tc_offsets tc_offset;
> @@ -1253,14 +1254,26 @@ static bool r8168ep_check_dash(struct rtl8169_private *tp)
> return r8168ep_ocp_read(tp, 0x128) & BIT(0);
> }
>
> -static enum rtl_dash_type rtl_check_dash(struct rtl8169_private *tp)
> +static bool rtl_dash_is_enabled(struct rtl8169_private *tp)
> +{
> + switch (tp->dash_type) {
> + case RTL_DASH_DP:
> + return r8168dp_check_dash(tp);
> + case RTL_DASH_EP:
> + return r8168ep_check_dash(tp);
> + default:
> + return false;
> + }
> +}
> +
> +static enum rtl_dash_type rtl_get_dash_type(struct rtl8169_private *tp)
> {
> switch (tp->mac_version) {
> case RTL_GIGA_MAC_VER_28:
> case RTL_GIGA_MAC_VER_31:
> - return r8168dp_check_dash(tp) ? RTL_DASH_DP : RTL_DASH_NONE;
> + return RTL_DASH_DP;
> case RTL_GIGA_MAC_VER_51 ... RTL_GIGA_MAC_VER_53:
> - return r8168ep_check_dash(tp) ? RTL_DASH_EP : RTL_DASH_NONE;
> + return RTL_DASH_EP;
> default:
> return RTL_DASH_NONE;
> }
> @@ -1453,7 +1466,7 @@ static void __rtl8169_set_wol(struct rtl8169_private *tp, u32 wolopts)
>
> device_set_wakeup_enable(tp_to_dev(tp), wolopts);
>
> - if (tp->dash_type == RTL_DASH_NONE) {
> + if (!tp->dash_enabled) {
> rtl_set_d3_pll_down(tp, !wolopts);
> tp->dev->wol_enabled = wolopts ? 1 : 0;
> }
> @@ -2512,7 +2525,7 @@ static void rtl_wol_enable_rx(struct rtl8169_private *tp)
>
> static void rtl_prepare_power_down(struct rtl8169_private *tp)
> {
> - if (tp->dash_type != RTL_DASH_NONE)
> + if (tp->dash_enabled)
> return;
>
> if (tp->mac_version == RTL_GIGA_MAC_VER_32 ||
> @@ -4869,7 +4882,7 @@ static int rtl8169_runtime_idle(struct device *device)
> {
> struct rtl8169_private *tp = dev_get_drvdata(device);
>
> - if (tp->dash_type != RTL_DASH_NONE)
> + if (tp->dash_enabled)
> return -EBUSY;
>
> if (!netif_running(tp->dev) || !netif_carrier_ok(tp->dev))
> @@ -4896,7 +4909,7 @@ static void rtl_shutdown(struct pci_dev *pdev)
> rtl_rar_set(tp, tp->dev->perm_addr);
>
> if (system_state == SYSTEM_POWER_OFF &&
> - tp->dash_type == RTL_DASH_NONE) {
> + !tp->dash_enabled) {
Since you have to repost, please maintain the correct indentation
above:
if (system_state == SYSTEM_POWER_OFF &&
!tp->dash_enabled) {
^^^^
spaces here.
Cheers,
Paolo
^ permalink raw reply
* Re: [PATCH v4] net: mvneta: fix calls to page_pool_get_stats
From: Lorenzo Bianconi @ 2023-11-09 11:31 UTC (permalink / raw)
To: Sven Auhagen
Cc: netdev@vger.kernel.org, thomas.petazzoni@bootlin.com,
brouer@redhat.com, ilias.apalodimas@linaro.org,
mcroce@microsoft.com, leon@kernel.org, kuba@kernel.org
In-Reply-To: <4fxnidhi7gfpzmeels363loksphtifgsan6w64n5y7dxzi7dyx@jwbe4gp37mwy>
[-- Attachment #1: Type: text/plain, Size: 1606 bytes --]
[...]
> >
> > do we need to check pp->is_stopped here? (we already check if page_pool
> > pointer is NULL in mvneta_ethtool_pp_stats).
> > Moreover in mvneta_ethtool_get_sset_count() and in mvneta_ethtool_get_strings()
> > we just check pp->bm_priv pointer. Are the stats disaligned in this case?
>
> Hi Lorenzo,
>
> so the buffer manager (bm) does not support the page pool.
> If this mode is used we can skip any page pool references.
>
> The question is do we end up with a race condition when we skip the is_stopped check
> as the variable is set to true just before the page pools are
> deallocated on suspend or interface stop calls.
Do we really have a race here? e.g. both ndo_stop and ethtool_get_stats() run under rtnl.
Moreover it seems is_stopped is not set for armada-3700 devices (e.g. my
espressobin). Do we have the issue for this kind of devices?
Regards,
Lorenzo
>
> Best
> Sven
>
> >
> > > + mvneta_ethtool_pp_stats(pp, data);
> > > }
> > >
> > > static int mvneta_ethtool_get_sset_count(struct net_device *dev, int sset)
> > > {
> > > - if (sset == ETH_SS_STATS)
> > > - return ARRAY_SIZE(mvneta_statistics) +
> > > - page_pool_ethtool_stats_get_count();
> > > + if (sset == ETH_SS_STATS) {
> > > + int count = ARRAY_SIZE(mvneta_statistics);
> > > + struct mvneta_port *pp = netdev_priv(dev);
> > > +
> > > + if (!pp->bm_priv)
> > > + count += page_pool_ethtool_stats_get_count();
> > > +
> > > + return count;
> > > + }
> > >
> > > return -EOPNOTSUPP;
> > > }
> > > --
> > > 2.42.0
> > >
>
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [PATCH net] virtio_net: fix missing dma unmap for resize
From: Paolo Abeni @ 2023-11-09 11:26 UTC (permalink / raw)
To: Xuan Zhuo, netdev
Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Eric Dumazet,
Jakub Kicinski, virtualization
In-Reply-To: <20231106081832.668-1-xuanzhuo@linux.alibaba.com>
On Mon, 2023-11-06 at 16:18 +0800, Xuan Zhuo wrote:
> For rq, we have three cases getting buffers from virtio core:
>
> 1. virtqueue_get_buf{,_ctx}
> 2. virtqueue_detach_unused_buf
> 3. callback for virtqueue_resize
>
> But in commit 295525e29a5b("virtio_net: merge dma operations when
> filling mergeable buffers"), I missed the dma unmap for the #3 case.
>
> That will leak some memory, because I did not release the pages referred
> by the unused buffers.
>
> If we do such script, we will make the system OOM.
>
> while true
> do
> ethtool -G ens4 rx 128
> ethtool -G ens4 rx 256
> free -m
> done
>
> Fixes: 295525e29a5b ("virtio_net: merge dma operations when filling mergeable buffers")
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
@Micheal, @Jason: this fix LGTM, but I guess it deserve an explicit ack
from you before merging, thanks!
Paolo
^ permalink raw reply
* Re: [PATCH] Documentation: Document the Netlink spec
From: Donald Hunter @ 2023-11-09 11:22 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: Breno Leitao, linux-doc, netdev, kuba, pabeni, edumazet
In-Reply-To: <875y2cxa6n.fsf@meer.lwn.net>
Jonathan Corbet <corbet@lwn.net> writes:
> Breno Leitao <leitao@debian.org> writes:
>
>> This is a Sphinx extension that parses the Netlink YAML spec files
>> (Documentation/netlink/specs/), and generates a rst file to be
>> displayed into Documentation pages.
>>
>> Create a new Documentation/networking/netlink_spec page, and a sub-page
>> for each Netlink spec that needs to be documented, such as ethtool,
>> devlink, netdev, etc.
>>
>> Create a Sphinx directive extension that reads the YAML spec
>> (located under Documentation/netlink/specs), parses it and returns a RST
>> string that is inserted where the Sphinx directive was called.
>
> So I finally had a chance to look a bit at this; I have a few
> impressions.
>
> First of all, if you put something silly into one of the YAML files, it
> kills the whole docs build, which is ... not desirable:
>
>> Exception occurred:
>> File "/usr/lib64/python3.11/site-packages/yaml/scanner.py", line 577, in fetch_value
>> raise ScannerError(None, None,
>> yaml.scanner.ScannerError: mapping values are not allowed here
>> in "/stuff/k/git/kernel/Documentation/netlink/specs/ovs_datapath.yaml", line 14, column 9
>>
>
> That error needs to be caught and handled in some more graceful way.
>
> I do have to wonder, though, whether a sphinx extension is the right way
> to solve this problem. You're essentially implementing a filter that
> turns one YAML file into one RST file; might it be better to keep that
> outside of sphinx as a standalone script, invoked by the Makefile?
>
> Note that I'm asking because I wonder, I'm not saying I would block an
> extension-based implementation.
+1 to this. The .rst generation can then be easily tested independently
of the doc build and the stub files could be avoided.
Just a note that last year you offered the opposite guidance:
https://lore.kernel.org/linux-doc/87tu4zsfse.fsf@meer.lwn.net/
If the preference now is for standalone scripts invoked by the Makefile
then this previous patch might be useful:
https://lore.kernel.org/linux-doc/20220922115257.99815-2-donald.hunter@gmail.com/
It would be good to document the preferred approach to this kind of doc
extension and I'd be happy to contribute an 'Extensions' section for
contributing.rst in the doc-guide.
> Thanks,
>
> jon
^ permalink raw reply
* Re: [PATCH net-next v2 14/21] virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk buffer
From: Xuan Zhuo @ 2023-11-09 11:16 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
In-Reply-To: <20231109061056-mutt-send-email-mst@kernel.org>
On Thu, 9 Nov 2023 06:11:49 -0500, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Tue, Nov 07, 2023 at 11:12:20AM +0800, Xuan Zhuo wrote:
> > virtnet_free_old_xmit distinguishes three type ptr(skb, xdp frame, xsk
> > buffer) by the last bits of the pointer.
> >
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > ---
> > drivers/net/virtio/virtio_net.h | 18 ++++++++++++++++--
> > drivers/net/virtio/xsk.h | 5 +++++
> > 2 files changed, 21 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> > index a431a2c1ee47..a13d6d301fdb 100644
> > --- a/drivers/net/virtio/virtio_net.h
> > +++ b/drivers/net/virtio/virtio_net.h
> > @@ -225,6 +225,11 @@ struct virtnet_info {
> > struct failover *failover;
> > };
> >
> > +static inline bool virtnet_is_skb_ptr(void *ptr)
> > +{
> > + return !((unsigned long)ptr & VIRTIO_XMIT_DATA_MASK);
> > +}
> > +
> > static inline bool virtnet_is_xdp_frame(void *ptr)
> > {
> > return (unsigned long)ptr & VIRTIO_XDP_FLAG;
> > @@ -235,6 +240,8 @@ static inline struct xdp_frame *virtnet_ptr_to_xdp(void *ptr)
> > return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
> > }
> >
> > +static inline u32 virtnet_ptr_to_xsk(void *ptr);
> > +
>
> I don't understand why you need this here.
The below function virtnet_free_old_xmit needs this.
Thanks.
>
>
> > static inline void *virtnet_sq_unmap(struct virtnet_sq *sq, void *data)
> > {
> > struct virtnet_sq_dma *next, *head;
> > @@ -261,11 +268,12 @@ static inline void *virtnet_sq_unmap(struct virtnet_sq *sq, void *data)
> > static inline void virtnet_free_old_xmit(struct virtnet_sq *sq, bool in_napi,
> > u64 *bytes, u64 *packets)
> > {
> > + unsigned int xsknum = 0;
> > unsigned int len;
> > void *ptr;
> >
> > while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> > - if (!virtnet_is_xdp_frame(ptr)) {
> > + if (virtnet_is_skb_ptr(ptr)) {
> > struct sk_buff *skb;
> >
> > if (sq->do_dma)
> > @@ -277,7 +285,7 @@ static inline void virtnet_free_old_xmit(struct virtnet_sq *sq, bool in_napi,
> >
> > *bytes += skb->len;
> > napi_consume_skb(skb, in_napi);
> > - } else {
> > + } else if (virtnet_is_xdp_frame(ptr)) {
> > struct xdp_frame *frame;
> >
> > if (sq->do_dma)
> > @@ -287,9 +295,15 @@ static inline void virtnet_free_old_xmit(struct virtnet_sq *sq, bool in_napi,
> >
> > *bytes += xdp_get_frame_len(frame);
> > xdp_return_frame(frame);
> > + } else {
> > + *bytes += virtnet_ptr_to_xsk(ptr);
> > + ++xsknum;
> > }
> > (*packets)++;
> > }
> > +
> > + if (xsknum)
> > + xsk_tx_completed(sq->xsk.pool, xsknum);
> > }
> >
> > static inline bool virtnet_is_xdp_raw_buffer_queue(struct virtnet_info *vi, int q)
> > diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
> > index 1bd19dcda649..7ebc9bda7aee 100644
> > --- a/drivers/net/virtio/xsk.h
> > +++ b/drivers/net/virtio/xsk.h
> > @@ -14,6 +14,11 @@ static inline void *virtnet_xsk_to_ptr(u32 len)
> > return (void *)(p | VIRTIO_XSK_FLAG);
> > }
> >
> > +static inline u32 virtnet_ptr_to_xsk(void *ptr)
> > +{
> > + return ((unsigned long)ptr) >> VIRTIO_XSK_FLAG_OFFSET;
> > +}
> > +
> > int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
> > bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct xsk_buff_pool *pool,
> > int budget);
> > --
> > 2.32.0.3.g01195cf9f
>
>
^ permalink raw reply
* Re: [PATCH net-next v2 16/21] virtio_net: xsk: rx: introduce add_recvbuf_xsk()
From: Xuan Zhuo @ 2023-11-09 11:11 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
In-Reply-To: <20231109031003-mutt-send-email-mst@kernel.org>
On Thu, 9 Nov 2023 03:12:27 -0500, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Tue, Nov 07, 2023 at 11:12:22AM +0800, Xuan Zhuo wrote:
> > Implement the logic of filling rq with XSK buffers.
> >
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > ---
> > drivers/net/virtio/main.c | 4 ++-
> > drivers/net/virtio/virtio_net.h | 5 ++++
> > drivers/net/virtio/xsk.c | 49 ++++++++++++++++++++++++++++++++-
> > drivers/net/virtio/xsk.h | 2 ++
> > 4 files changed, 58 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
> > index 6210a6e37396..15943a22e17d 100644
> > --- a/drivers/net/virtio/main.c
> > +++ b/drivers/net/virtio/main.c
> > @@ -1798,7 +1798,9 @@ static bool try_fill_recv(struct virtnet_info *vi, struct virtnet_rq *rq,
> > bool oom;
> >
> > do {
> > - if (vi->mergeable_rx_bufs)
> > + if (rq->xsk.pool)
> > + err = virtnet_add_recvbuf_xsk(vi, rq, rq->xsk.pool, gfp);
> > + else if (vi->mergeable_rx_bufs)
> > err = add_recvbuf_mergeable(vi, rq, gfp);
> > else if (vi->big_packets)
> > err = add_recvbuf_big(vi, rq, gfp);
>
> I'm not sure I understand. How does this handle mergeable flag still being set?
You has the same question as Jason.
So I think maybe I should put the handle into the
add_recvbuf_mergeable and add_recvbuf_small.
Let me think about this.
>
>
> > diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> > index a13d6d301fdb..1242785e311e 100644
> > --- a/drivers/net/virtio/virtio_net.h
> > +++ b/drivers/net/virtio/virtio_net.h
> > @@ -140,6 +140,11 @@ struct virtnet_rq {
> >
> > /* xdp rxq used by xsk */
> > struct xdp_rxq_info xdp_rxq;
> > +
> > + struct xdp_buff **xsk_buffs;
> > + u32 nxt_idx;
> > + u32 num;
> > + u32 size;
> > } xsk;
> > };
> >
> > diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> > index ea5804ddd44e..e737c3353212 100644
> > --- a/drivers/net/virtio/xsk.c
> > +++ b/drivers/net/virtio/xsk.c
> > @@ -38,6 +38,41 @@ static void virtnet_xsk_check_queue(struct virtnet_sq *sq)
> > netif_stop_subqueue(dev, qnum);
> > }
> >
> > +int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct virtnet_rq *rq,
> > + struct xsk_buff_pool *pool, gfp_t gfp)
> > +{
> > + struct xdp_buff **xsk_buffs;
> > + dma_addr_t addr;
> > + u32 len, i;
> > + int err = 0;
> > +
> > + xsk_buffs = rq->xsk.xsk_buffs;
> > +
> > + if (rq->xsk.nxt_idx >= rq->xsk.num) {
> > + rq->xsk.num = xsk_buff_alloc_batch(pool, xsk_buffs, rq->xsk.size);
> > + if (!rq->xsk.num)
> > + return -ENOMEM;
> > + rq->xsk.nxt_idx = 0;
> > + }
>
> Another manually rolled linked list implementation.
> Please, don't.
The array is for speedup.
xsk_buff_alloc_batch will return many xsk_buff that will be more efficient than
the xsk_buff_alloc.
Thanks.
>
>
> > +
> > + i = rq->xsk.nxt_idx;
> > +
> > + /* use the part of XDP_PACKET_HEADROOM as the virtnet hdr space */
> > + addr = xsk_buff_xdp_get_dma(xsk_buffs[i]) - vi->hdr_len;
> > + len = xsk_pool_get_rx_frame_size(pool) + vi->hdr_len;
> > +
> > + sg_init_table(rq->sg, 1);
> > + sg_fill_dma(rq->sg, addr, len);
> > +
> > + err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, xsk_buffs[i], gfp);
> > + if (err)
> > + return err;
> > +
> > + rq->xsk.nxt_idx++;
> > +
> > + return 0;
> > +}
> > +
> > static int virtnet_xsk_xmit_one(struct virtnet_sq *sq,
> > struct xsk_buff_pool *pool,
> > struct xdp_desc *desc)
> > @@ -213,7 +248,7 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
> > struct virtnet_sq *sq;
> > struct device *dma_dev;
> > dma_addr_t hdr_dma;
> > - int err;
> > + int err, size;
> >
> > /* In big_packets mode, xdp cannot work, so there is no need to
> > * initialize xsk of rq.
> > @@ -249,6 +284,16 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
> > if (!dma_dev)
> > return -EPERM;
> >
> > + size = virtqueue_get_vring_size(rq->vq);
> > +
> > + rq->xsk.xsk_buffs = kcalloc(size, sizeof(*rq->xsk.xsk_buffs), GFP_KERNEL);
> > + if (!rq->xsk.xsk_buffs)
> > + return -ENOMEM;
> > +
> > + rq->xsk.size = size;
> > + rq->xsk.nxt_idx = 0;
> > + rq->xsk.num = 0;
> > +
> > hdr_dma = dma_map_single(dma_dev, &xsk_hdr, vi->hdr_len, DMA_TO_DEVICE);
> > if (dma_mapping_error(dma_dev, hdr_dma))
> > return -ENOMEM;
> > @@ -307,6 +352,8 @@ static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)
> >
> > dma_unmap_single(dma_dev, sq->xsk.hdr_dma_address, vi->hdr_len, DMA_TO_DEVICE);
> >
> > + kfree(rq->xsk.xsk_buffs);
> > +
> > return err1 | err2;
> > }
> >
> > diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
> > index 7ebc9bda7aee..bef41a3f954e 100644
> > --- a/drivers/net/virtio/xsk.h
> > +++ b/drivers/net/virtio/xsk.h
> > @@ -23,4 +23,6 @@ int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
> > bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct xsk_buff_pool *pool,
> > int budget);
> > int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag);
> > +int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct virtnet_rq *rq,
> > + struct xsk_buff_pool *pool, gfp_t gfp);
> > #endif
> > --
> > 2.32.0.3.g01195cf9f
>
>
^ permalink raw reply
* Re: [PATCH net-next v2 14/21] virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk buffer
From: Michael S. Tsirkin @ 2023-11-09 11:11 UTC (permalink / raw)
To: Xuan Zhuo
Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
In-Reply-To: <20231107031227.100015-15-xuanzhuo@linux.alibaba.com>
On Tue, Nov 07, 2023 at 11:12:20AM +0800, Xuan Zhuo wrote:
> virtnet_free_old_xmit distinguishes three type ptr(skb, xdp frame, xsk
> buffer) by the last bits of the pointer.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
> drivers/net/virtio/virtio_net.h | 18 ++++++++++++++++--
> drivers/net/virtio/xsk.h | 5 +++++
> 2 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> index a431a2c1ee47..a13d6d301fdb 100644
> --- a/drivers/net/virtio/virtio_net.h
> +++ b/drivers/net/virtio/virtio_net.h
> @@ -225,6 +225,11 @@ struct virtnet_info {
> struct failover *failover;
> };
>
> +static inline bool virtnet_is_skb_ptr(void *ptr)
> +{
> + return !((unsigned long)ptr & VIRTIO_XMIT_DATA_MASK);
> +}
> +
> static inline bool virtnet_is_xdp_frame(void *ptr)
> {
> return (unsigned long)ptr & VIRTIO_XDP_FLAG;
> @@ -235,6 +240,8 @@ static inline struct xdp_frame *virtnet_ptr_to_xdp(void *ptr)
> return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
> }
>
> +static inline u32 virtnet_ptr_to_xsk(void *ptr);
> +
I don't understand why you need this here.
> static inline void *virtnet_sq_unmap(struct virtnet_sq *sq, void *data)
> {
> struct virtnet_sq_dma *next, *head;
> @@ -261,11 +268,12 @@ static inline void *virtnet_sq_unmap(struct virtnet_sq *sq, void *data)
> static inline void virtnet_free_old_xmit(struct virtnet_sq *sq, bool in_napi,
> u64 *bytes, u64 *packets)
> {
> + unsigned int xsknum = 0;
> unsigned int len;
> void *ptr;
>
> while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> - if (!virtnet_is_xdp_frame(ptr)) {
> + if (virtnet_is_skb_ptr(ptr)) {
> struct sk_buff *skb;
>
> if (sq->do_dma)
> @@ -277,7 +285,7 @@ static inline void virtnet_free_old_xmit(struct virtnet_sq *sq, bool in_napi,
>
> *bytes += skb->len;
> napi_consume_skb(skb, in_napi);
> - } else {
> + } else if (virtnet_is_xdp_frame(ptr)) {
> struct xdp_frame *frame;
>
> if (sq->do_dma)
> @@ -287,9 +295,15 @@ static inline void virtnet_free_old_xmit(struct virtnet_sq *sq, bool in_napi,
>
> *bytes += xdp_get_frame_len(frame);
> xdp_return_frame(frame);
> + } else {
> + *bytes += virtnet_ptr_to_xsk(ptr);
> + ++xsknum;
> }
> (*packets)++;
> }
> +
> + if (xsknum)
> + xsk_tx_completed(sq->xsk.pool, xsknum);
> }
>
> static inline bool virtnet_is_xdp_raw_buffer_queue(struct virtnet_info *vi, int q)
> diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
> index 1bd19dcda649..7ebc9bda7aee 100644
> --- a/drivers/net/virtio/xsk.h
> +++ b/drivers/net/virtio/xsk.h
> @@ -14,6 +14,11 @@ static inline void *virtnet_xsk_to_ptr(u32 len)
> return (void *)(p | VIRTIO_XSK_FLAG);
> }
>
> +static inline u32 virtnet_ptr_to_xsk(void *ptr)
> +{
> + return ((unsigned long)ptr) >> VIRTIO_XSK_FLAG_OFFSET;
> +}
> +
> int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
> bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct xsk_buff_pool *pool,
> int budget);
> --
> 2.32.0.3.g01195cf9f
^ permalink raw reply
* Re: [PATCH v2 net 4/7] net/sched: taprio: get corrected value of cycle_time and interval
From: Vladimir Oltean @ 2023-11-09 11:11 UTC (permalink / raw)
To: Faizal Rahim, Vinicius Costa Gomes
Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, David S . Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, linux-kernel
In-Reply-To: <20231107112023.676016-5-faizal.abdul.rahim@linux.intel.com>
On Tue, Nov 07, 2023 at 06:20:20AM -0500, Faizal Rahim wrote:
> Retrieve adjusted cycle_time and interval values through new APIs.
> Note that in some cases where the original values are required,
> such as in dump_schedule() and setup_first_end_time(), direct calls
> to cycle_time and interval are retained without using the new APIs.
>
> Added a new field, correction_active, in the sched_entry struct to
> determine the entry's correction state. This field is required due
> to specific flow like find_entry_to_transmit() -> get_interval_end_time()
> which retrieves the interval for each entry. During positive cycle
> time correction, it's known that the last entry interval requires
> correction. However, for negative correction, the affected entry
> is unknown, which is why this new field is necessary.
I agree with the motivation, but I'm not sure if the chosen solution is
correct.
static u32 get_interval(const struct sched_entry *entry,
const struct sched_gate_list *oper)
{
if (entry->correction_active)
return entry->interval + oper->cycle_time_correction;
return entry->interval;
}
What if the schedule looks like this:
sched-entry S 0x01 125000000
sched-entry S 0x02 125000000
sched-entry S 0x04 125000000
sched-entry S 0x08 125000000
sched-entry S 0x10 125000000
sched-entry S 0x20 125000000
sched-entry S 0x40 125000000
sched-entry S 0x80 125000000
and the calculated cycle_time_correction is -200000000? That would
eliminate the entire last sched-entry (0x80), and the previous one
(0x40) would run for just 75000000 ns. But your calculation would say
that its interval is −75000000 ns (actually reported as an u32 positive
integer, so it would be a completely bogus value).
So not only is the affected entry unknown, but also the amount of cycle
time correction that applies to it is unknown.
I'm looking at where we need get_interval(), and it's from:
taprio_enqueue_one()
-> is_valid_interval()
-> find_entry_to_transmit()
-> get_interval_end_time()
-> get_packet_txtime()
-> find_entry_to_transmit()
I admit it's a part of taprio which I don't understand too well. Why do
we perform such complex calculations in get_interval_end_time() when we
should have struct sched_entry :: end_time precomputed and available for
this purpose (although it was primarily inteded for advance_sched() and
not for enqueue())?
Vinicius, do you know?
^ permalink raw reply
* Re: [PATCH net-next v2 17/21] virtio_net: xsk: rx: skip dma unmap when rq is bind with AF_XDP
From: Xuan Zhuo @ 2023-11-09 11:10 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
In-Reply-To: <20231109031347-mutt-send-email-mst@kernel.org>
On Thu, 9 Nov 2023 03:15:03 -0500, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Tue, Nov 07, 2023 at 11:12:23AM +0800, Xuan Zhuo wrote:
> > When rq is bound with AF_XDP, the buffer dma is managed
> > by the AF_XDP APIs. So the buffer got from the virtio core should
> > skip the dma unmap operation.
> >
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>
>
> I don't get it - is this like a bugfix?
I want focus on this. So let it as an independent commit.
> And why do we need our own flag and checks?
> Doesn't virtio core DTRT?
struct vring_virtqueue {
[....]
/* Do DMA mapping by driver */
bool premapped;
We can not.
So I add own flag.
Thanks.
>
> > ---
> > drivers/net/virtio/main.c | 8 +++++---
> > drivers/net/virtio/virtio_net.h | 3 +++
> > drivers/net/virtio/xsk.c | 1 +
> > 3 files changed, 9 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
> > index 15943a22e17d..a318b2533b94 100644
> > --- a/drivers/net/virtio/main.c
> > +++ b/drivers/net/virtio/main.c
> > @@ -430,7 +430,7 @@ static void *virtnet_rq_get_buf(struct virtnet_rq *rq, u32 *len, void **ctx)
> > void *buf;
> >
> > buf = virtqueue_get_buf_ctx(rq->vq, len, ctx);
> > - if (buf && rq->do_dma)
> > + if (buf && rq->do_dma_unmap)
> > virtnet_rq_unmap(rq, buf, *len);
> >
> > return buf;
> > @@ -561,8 +561,10 @@ static void virtnet_set_premapped(struct virtnet_info *vi)
> >
> > /* disable for big mode */
> > if (vi->mergeable_rx_bufs || !vi->big_packets) {
> > - if (!virtqueue_set_dma_premapped(vi->rq[i].vq))
> > + if (!virtqueue_set_dma_premapped(vi->rq[i].vq)) {
> > vi->rq[i].do_dma = true;
> > + vi->rq[i].do_dma_unmap = true;
> > + }
> > }
> > }
> > }
> > @@ -3944,7 +3946,7 @@ void virtnet_rq_free_unused_buf(struct virtqueue *vq, void *buf)
> >
> > rq = &vi->rq[i];
> >
> > - if (rq->do_dma)
> > + if (rq->do_dma_unmap)
> > virtnet_rq_unmap(rq, buf, 0);
> >
> > virtnet_rq_free_buf(vi, rq, buf);
> > diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> > index 1242785e311e..2005d0cd22e2 100644
> > --- a/drivers/net/virtio/virtio_net.h
> > +++ b/drivers/net/virtio/virtio_net.h
> > @@ -135,6 +135,9 @@ struct virtnet_rq {
> > /* Do dma by self */
> > bool do_dma;
> >
> > + /* Do dma unmap after getting buf from virtio core. */
> > + bool do_dma_unmap;
> > +
> > struct {
> > struct xsk_buff_pool *pool;
> >
> > diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> > index e737c3353212..b09c473c29fb 100644
> > --- a/drivers/net/virtio/xsk.c
> > +++ b/drivers/net/virtio/xsk.c
> > @@ -210,6 +210,7 @@ static int virtnet_rq_bind_xsk_pool(struct virtnet_info *vi, struct virtnet_rq *
> > xdp_rxq_info_unreg(&rq->xsk.xdp_rxq);
> >
> > rq->xsk.pool = pool;
> > + rq->do_dma_unmap = !pool;
> >
> > virtnet_rx_resume(vi, rq);
> >
> > --
> > 2.32.0.3.g01195cf9f
>
>
^ permalink raw reply
* Re: [RFC PATCH v3 02/12] net: page_pool: create hooks for custom page providers
From: Paolo Abeni @ 2023-11-09 11:09 UTC (permalink / raw)
To: Mina Almasry, netdev, linux-kernel, linux-arch, linux-kselftest,
linux-media, dri-devel, linaro-mm-sig
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski,
Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
David Ahern, Willem de Bruijn, Shuah Khan, Sumit Semwal,
Christian König, Shakeel Butt, Jeroen de Borst,
Praveen Kaligineedi
In-Reply-To: <20231106024413.2801438-3-almasrymina@google.com>
On Sun, 2023-11-05 at 18:44 -0800, Mina Almasry wrote:
> diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
> index 6fc5134095ed..d4bea053bb7e 100644
> --- a/include/net/page_pool/types.h
> +++ b/include/net/page_pool/types.h
> @@ -60,6 +60,8 @@ struct page_pool_params {
> int nid;
> struct device *dev;
> struct napi_struct *napi;
> + u8 memory_provider;
> + void *mp_priv;
Minor nit: swapping the above 2 fields should make the struct smaller.
> enum dma_data_direction dma_dir;
> unsigned int max_len;
> unsigned int offset;
> @@ -118,6 +120,19 @@ struct page_pool_stats {
> };
> #endif
>
> +struct mem_provider;
> +
> +enum pp_memory_provider_type {
> + __PP_MP_NONE, /* Use system allocator directly */
> +};
> +
> +struct pp_memory_provider_ops {
> + int (*init)(struct page_pool *pool);
> + void (*destroy)(struct page_pool *pool);
> + struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
> + bool (*release_page)(struct page_pool *pool, struct page *page);
> +};
> +
> struct page_pool {
> struct page_pool_params p;
>
> @@ -165,6 +180,9 @@ struct page_pool {
> */
> struct ptr_ring ring;
>
> + const struct pp_memory_provider_ops *mp_ops;
> + void *mp_priv;
Why the mp_ops are not part of page_pool_params? why mp_priv is
duplicated here?
Cheers,
Paolo
^ permalink raw reply
* Re: [PATCH net-next v2 12/21] virtio_net: xsk: tx: support tx
From: Xuan Zhuo @ 2023-11-09 11:06 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
In-Reply-To: <20231109030424-mutt-send-email-mst@kernel.org>
On Thu, 9 Nov 2023 03:09:00 -0500, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Tue, Nov 07, 2023 at 11:12:18AM +0800, Xuan Zhuo wrote:
> > The driver's tx napi is very important for XSK. It is responsible for
> > obtaining data from the XSK queue and sending it out.
> >
> > At the beginning, we need to trigger tx napi.
> >
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > ---
> > drivers/net/virtio/main.c | 12 +++-
> > drivers/net/virtio/virtio_net.h | 3 +-
> > drivers/net/virtio/xsk.c | 110 ++++++++++++++++++++++++++++++++
> > drivers/net/virtio/xsk.h | 13 ++++
> > 4 files changed, 136 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
> > index 6c608b3ce27d..ff6bc764089d 100644
> > --- a/drivers/net/virtio/main.c
> > +++ b/drivers/net/virtio/main.c
> > @@ -2074,6 +2074,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
> > struct virtnet_info *vi = sq->vq->vdev->priv;
> > unsigned int index = vq2txq(sq->vq);
> > struct netdev_queue *txq;
> > + int busy = 0;
> > int opaque;
> > bool done;
> >
> > @@ -2086,11 +2087,20 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
> > txq = netdev_get_tx_queue(vi->dev, index);
> > __netif_tx_lock(txq, raw_smp_processor_id());
> > virtqueue_disable_cb(sq->vq);
> > - free_old_xmit(sq, true);
> > +
> > + if (sq->xsk.pool)
> > + busy |= virtnet_xsk_xmit(sq, sq->xsk.pool, budget);
>
> You use bitwise or on errno values? What's going on here?
virtnet_xsk_xmit() return that it is busy or not. Not the errno.
Here just record whether this handler is busy or not.
>
>
> > + else
> > + free_old_xmit(sq, true);
> >
> > if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
> > netif_tx_wake_queue(txq);
> >
> > + if (busy) {
> > + __netif_tx_unlock(txq);
> > + return budget;
> > + }
> > +
> > opaque = virtqueue_enable_cb_prepare(sq->vq);
> >
> > done = napi_complete_done(napi, 0);
> > diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> > index 442af4673bf8..1c21af47e13c 100644
> > --- a/drivers/net/virtio/virtio_net.h
> > +++ b/drivers/net/virtio/virtio_net.h
> > @@ -9,7 +9,8 @@
> > #include <net/xdp_sock_drv.h>
> >
> > #define VIRTIO_XDP_FLAG BIT(0)
> > -#define VIRTIO_XMIT_DATA_MASK (VIRTIO_XDP_FLAG)
> > +#define VIRTIO_XSK_FLAG BIT(1)
> > +#define VIRTIO_XMIT_DATA_MASK (VIRTIO_XDP_FLAG | VIRTIO_XSK_FLAG)
> >
> > /* RX packet size EWMA. The average packet size is used to determine the packet
> > * buffer size when refilling RX rings. As the entire RX ring may be refilled
> > diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> > index 8b397787603f..caa448308232 100644
> > --- a/drivers/net/virtio/xsk.c
> > +++ b/drivers/net/virtio/xsk.c
> > @@ -4,9 +4,119 @@
> > */
> >
> > #include "virtio_net.h"
> > +#include "xsk.h"
> >
> > static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
> >
> > +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> > +{
> > + sg->dma_address = addr;
> > + sg->length = len;
> > +}
> > +
> > +static void virtnet_xsk_check_queue(struct virtnet_sq *sq)
> > +{
> > + struct virtnet_info *vi = sq->vq->vdev->priv;
> > + struct net_device *dev = vi->dev;
> > + int qnum = sq - vi->sq;
> > +
> > + /* If it is a raw buffer queue, it does not check whether the status
> > + * of the queue is stopped when sending. So there is no need to check
> > + * the situation of the raw buffer queue.
> > + */
> > + if (virtnet_is_xdp_raw_buffer_queue(vi, qnum))
> > + return;
> > +
> > + /* If this sq is not the exclusive queue of the current cpu,
> > + * then it may be called by start_xmit, so check it running out
> > + * of space.
> > + *
> > + * Stop the queue to avoid getting packets that we are
> > + * then unable to transmit. Then wait the tx interrupt.
> > + */
> > + if (sq->vq->num_free < 2 + MAX_SKB_FRAGS)
>
> what does MAX_SKB_FRAGS have to do with it? And where's 2 coming from?
check_sq_full_and_disable()
Thanks.
>
> > + netif_stop_subqueue(dev, qnum);
> > +}
> > +
> > +static int virtnet_xsk_xmit_one(struct virtnet_sq *sq,
> > + struct xsk_buff_pool *pool,
> > + struct xdp_desc *desc)
> > +{
> > + struct virtnet_info *vi;
> > + dma_addr_t addr;
> > +
> > + vi = sq->vq->vdev->priv;
> > +
> > + addr = xsk_buff_raw_get_dma(pool, desc->addr);
> > + xsk_buff_raw_dma_sync_for_device(pool, addr, desc->len);
> > +
> > + sg_init_table(sq->sg, 2);
> > +
> > + sg_fill_dma(sq->sg, sq->xsk.hdr_dma_address, vi->hdr_len);
> > + sg_fill_dma(sq->sg + 1, addr, desc->len);
> > +
> > + return virtqueue_add_outbuf(sq->vq, sq->sg, 2,
> > + virtnet_xsk_to_ptr(desc->len), GFP_ATOMIC);
> > +}
> > +
> > +static int virtnet_xsk_xmit_batch(struct virtnet_sq *sq,
> > + struct xsk_buff_pool *pool,
> > + unsigned int budget,
> > + u64 *kicks)
> > +{
> > + struct xdp_desc *descs = pool->tx_descs;
> > + u32 nb_pkts, max_pkts, i;
> > + bool kick = false;
> > + int err;
> > +
> > + /* Every xsk tx packet needs two desc(virtnet header and packet). So we
> > + * use sq->vq->num_free / 2 as the limitation.
> > + */
> > + max_pkts = min_t(u32, budget, sq->vq->num_free / 2);
> > +
> > + nb_pkts = xsk_tx_peek_release_desc_batch(pool, max_pkts);
> > + if (!nb_pkts)
> > + return 0;
> > +
> > + for (i = 0; i < nb_pkts; i++) {
> > + err = virtnet_xsk_xmit_one(sq, pool, &descs[i]);
> > + if (unlikely(err))
> > + break;
> > +
> > + kick = true;
> > + }
> > +
> > + if (kick && virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
> > + (*kicks)++;
> > +
> > + return i;
> > +}
> > +
> > +bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct xsk_buff_pool *pool,
> > + int budget)
> > +{
> > + u64 bytes = 0, packets = 0, kicks = 0;
> > + int sent;
> > +
> > + virtnet_free_old_xmit(sq, true, &bytes, &packets);
> > +
> > + sent = virtnet_xsk_xmit_batch(sq, pool, budget, &kicks);
> > +
> > + virtnet_xsk_check_queue(sq);
> > +
> > + u64_stats_update_begin(&sq->stats.syncp);
> > + u64_stats_add(&sq->stats.packets, packets);
> > + u64_stats_add(&sq->stats.bytes, bytes);
> > + u64_stats_add(&sq->stats.kicks, kicks);
> > + u64_stats_add(&sq->stats.xdp_tx, sent);
> > + u64_stats_update_end(&sq->stats.syncp);
> > +
> > + if (xsk_uses_need_wakeup(pool))
> > + xsk_set_tx_need_wakeup(pool);
> > +
> > + return sent == budget;
> > +}
> > +
> > static int virtnet_rq_bind_xsk_pool(struct virtnet_info *vi, struct virtnet_rq *rq,
> > struct xsk_buff_pool *pool)
> > {
> > diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
> > index 1918285c310c..73ca8cd5308b 100644
> > --- a/drivers/net/virtio/xsk.h
> > +++ b/drivers/net/virtio/xsk.h
> > @@ -3,5 +3,18 @@
> > #ifndef __XSK_H__
> > #define __XSK_H__
> >
> > +#define VIRTIO_XSK_FLAG_OFFSET 4
> > +
> > +static inline void *virtnet_xsk_to_ptr(u32 len)
> > +{
> > + unsigned long p;
> > +
> > + p = len << VIRTIO_XSK_FLAG_OFFSET;
> > +
> > + return (void *)(p | VIRTIO_XSK_FLAG);
> > +}
> > +
> > int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
> > +bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct xsk_buff_pool *pool,
> > + int budget);
> > #endif
> > --
> > 2.32.0.3.g01195cf9f
>
^ permalink raw reply
* Re: [PATCH net v3] net: ti: icssg-prueth: Add missing icss_iep_put to error path
From: Jan Kiszka @ 2023-11-09 11:08 UTC (permalink / raw)
To: Jakub Kicinski
Cc: David S. Miller, Eric Dumazet, Paolo Abeni, MD Danish Anwar,
netdev, linux-kernel, Lopes Ivo, Diogo Miguel (T CED IFD-PT),
Nishanth Menon, Su, Bao Cheng (RC-CN DF FA R&D),
Wojciech Drewek, Roger Quadros, Grygorii Strashko
In-Reply-To: <20231107183256.2d19981b@kernel.org>
On 08.11.23 03:32, Jakub Kicinski wrote:
> On Mon, 6 Nov 2023 12:47:42 +0100 Jan Kiszka wrote:
>> Analogously to prueth_remove, just also taking care for NULL'ing the
>> iep pointers.
>>
>> Fixes: 186734c15886 ("net: ti: icssg-prueth: add packet timestamping and ptp support")
>> Fixes: 443a2367ba3c ("net: ti: icssg-prueth: am65x SR2.0 add 10M full duplex support")
>
> Is there a reason you're not CCing authors of these changes?
> Please make sure you run get_maintainer on the patch, and CC
> folks appropriately.
I was only interacting (directly) with Danish in the past years on this
driver, and he also upstreamed it. So I assumed "ownership" moved on.
Adding both, Roger with updated email (where get_maintainer does not help).
Jan
--
Siemens AG, Technology
Linux Expert Center
^ permalink raw reply
* Re: [PATCH v4] net: mvneta: fix calls to page_pool_get_stats
From: Sven Auhagen @ 2023-11-09 11:06 UTC (permalink / raw)
To: Lorenzo Bianconi
Cc: netdev@vger.kernel.org, thomas.petazzoni@bootlin.com,
brouer@redhat.com, ilias.apalodimas@linaro.org,
mcroce@microsoft.com, leon@kernel.org, kuba@kernel.org
In-Reply-To: <ZUyOsB7p6j21e42c@lore-desk>
On Thu, Nov 09, 2023 at 08:48:00AM +0100, Lorenzo Bianconi wrote:
> > Calling page_pool_get_stats in the mvneta driver without checks
> > leads to kernel crashes.
> > First the page pool is only available if the bm is not used.
> > The page pool is also not allocated when the port is stopped.
> > It can also be not allocated in case of errors.
> >
> > The current implementation leads to the following crash calling
> > ethstats on a port that is down or when calling it at the wrong moment:
> >
> > ble to handle kernel NULL pointer dereference at virtual address 00000070
> > [00000070] *pgd=00000000
> > Internal error: Oops: 5 [#1] SMP ARM
> > Hardware name: Marvell Armada 380/385 (Device Tree)
> > PC is at page_pool_get_stats+0x18/0x1cc
> > LR is at mvneta_ethtool_get_stats+0xa0/0xe0 [mvneta]
> > pc : [<c0b413cc>] lr : [<bf0a98d8>] psr: a0000013
> > sp : f1439d48 ip : f1439dc0 fp : 0000001d
> > r10: 00000100 r9 : c4816b80 r8 : f0d75150
> > r7 : bf0b400c r6 : c238f000 r5 : 00000000 r4 : f1439d68
> > r3 : c2091040 r2 : ffffffd8 r1 : f1439d68 r0 : 00000000
> > Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
> > Control: 10c5387d Table: 066b004a DAC: 00000051
> > Register r0 information: NULL pointer
> > Register r1 information: 2-page vmalloc region starting at 0xf1438000 allocated at kernel_clone+0x9c/0x390
> > Register r2 information: non-paged memory
> > Register r3 information: slab kmalloc-2k start c2091000 pointer offset 64 size 2048
> > Register r4 information: 2-page vmalloc region starting at 0xf1438000 allocated at kernel_clone+0x9c/0x390
> > Register r5 information: NULL pointer
> > Register r6 information: slab kmalloc-cg-4k start c238f000 pointer offset 0 size 4096
> > Register r7 information: 15-page vmalloc region starting at 0xbf0a8000 allocated at load_module+0xa30/0x219c
> > Register r8 information: 1-page vmalloc region starting at 0xf0d75000 allocated at ethtool_get_stats+0x138/0x208
> > Register r9 information: slab task_struct start c4816b80 pointer offset 0
> > Register r10 information: non-paged memory
> > Register r11 information: non-paged memory
> > Register r12 information: 2-page vmalloc region starting at 0xf1438000 allocated at kernel_clone+0x9c/0x390
> > Process snmpd (pid: 733, stack limit = 0x38de3a88)
> > Stack: (0xf1439d48 to 0xf143a000)
> > 9d40: 000000c0 00000001 c238f000 bf0b400c f0d75150 c4816b80
> > 9d60: 00000100 bf0a98d8 00000000 00000000 00000000 00000000 00000000 00000000
> > 9d80: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> > 9da0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> > 9dc0: 00000dc0 5335509c 00000035 c238f000 bf0b2214 01067f50 f0d75000 c0b9b9c8
> > 9de0: 0000001d 00000035 c2212094 5335509c c4816b80 c238f000 c5ad6e00 01067f50
> > 9e00: c1b0be80 c4816b80 00014813 c0b9d7f0 00000000 00000000 0000001d 0000001d
> > 9e20: 00000000 00001200 00000000 00000000 c216ed90 c73943b8 00000000 00000000
> > 9e40: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> > 9e60: 00000000 c0ad9034 00000000 00000000 00000000 00000000 00000000 00000000
> > 9e80: 00000000 00000000 00000000 5335509c c1b0be80 f1439ee4 00008946 c1b0be80
> > 9ea0: 01067f50 f1439ee3 00000000 00000046 b6d77ae0 c0b383f0 00008946 becc83e8
> > 9ec0: c1b0be80 00000051 0000000b c68ca480 c7172d00 c0ad8ff0 f1439ee3 cf600e40
> > 9ee0: 01600e40 32687465 00000000 00000000 00000000 01067f50 00000000 00000000
> > 9f00: 00000000 5335509c 00008946 00008946 00000000 c68ca480 becc83e8 c05e2de0
> > 9f20: f1439fb0 c03002f0 00000006 5ac3c35a c4816b80 00000006 b6d77ae0 c030caf0
> > 9f40: c4817350 00000014 f1439e1c 0000000c 00000000 00000051 01000000 00000014
> > 9f60: 00003fec f1439edc 00000001 c0372abc b6d77ae0 c0372abc cf600e40 5335509c
> > 9f80: c21e6800 01015c9c 0000000b 00008946 00000036 c03002f0 c4816b80 00000036
> > 9fa0: b6d77ae0 c03000c0 01015c9c 0000000b 0000000b 00008946 becc83e8 00000000
> > 9fc0: 01015c9c 0000000b 00008946 00000036 00000035 010678a0 b6d797ec b6d77ae0
> > 9fe0: b6dbf738 becc838c b6d186d7 b6baa858 40000030 0000000b 00000000 00000000
> > page_pool_get_stats from mvneta_ethtool_get_stats+0xa0/0xe0 [mvneta]
> > mvneta_ethtool_get_stats [mvneta] from ethtool_get_stats+0x154/0x208
> > ethtool_get_stats from dev_ethtool+0xf48/0x2480
> > dev_ethtool from dev_ioctl+0x538/0x63c
> > dev_ioctl from sock_ioctl+0x49c/0x53c
> > sock_ioctl from sys_ioctl+0x134/0xbd8
> > sys_ioctl from ret_fast_syscall+0x0/0x1c
> > Exception stack(0xf1439fa8 to 0xf1439ff0)
> > 9fa0: 01015c9c 0000000b 0000000b 00008946 becc83e8 00000000
> > 9fc0: 01015c9c 0000000b 00008946 00000036 00000035 010678a0 b6d797ec b6d77ae0
> > 9fe0: b6dbf738 becc838c b6d186d7 b6baa858
> > Code: e28dd004 e1a05000 e2514000 0a00006a (e5902070)
> >
> > This commit adds the proper checks before calling page_pool_get_stats.
> >
> > Fixes: b3fc79225f05 ("net: mvneta: add support for page_pool_get_stats")
> > Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de>
> > Reported-by: Paulo Da Silva <Paulo.DaSilva@kyberna.com>
> > ---
>
> Hi Sven,
>
> first of all thx for fixing it. Just minor comments inline.
>
> Regards,
> Lorenzo
>
> >
> > Change from v3:
> > * Move the page pool check back to mvneta
> >
> > Change from v2:
> > * Fix the fixes tag
> >
> > Change from v1:
> > * Add cover letter
> > * Move the page pool check in mvneta to the ethtool stats
> > function
> >
> > diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> > index 8b0f12a0e0f2..bbb5d972657a 100644
> > --- a/drivers/net/ethernet/marvell/mvneta.c
> > +++ b/drivers/net/ethernet/marvell/mvneta.c
> > @@ -4734,13 +4734,16 @@ static void mvneta_ethtool_get_strings(struct net_device *netdev, u32 sset,
> > {
> > if (sset == ETH_SS_STATS) {
> > int i;
> > + struct mvneta_port *pp = netdev_priv(netdev);
>
> nit: reverse christmas tree here (just if you need to repost)
>
> >
> > for (i = 0; i < ARRAY_SIZE(mvneta_statistics); i++)
> > memcpy(data + i * ETH_GSTRING_LEN,
> > mvneta_statistics[i].name, ETH_GSTRING_LEN);
> >
> > - data += ETH_GSTRING_LEN * ARRAY_SIZE(mvneta_statistics);
> > - page_pool_ethtool_stats_get_strings(data);
> > + if (!pp->bm_priv) {
> > + data += ETH_GSTRING_LEN * ARRAY_SIZE(mvneta_statistics);
> > + page_pool_ethtool_stats_get_strings(data);
> > + }
> > }
> > }
> >
> > @@ -4858,8 +4861,10 @@ static void mvneta_ethtool_pp_stats(struct mvneta_port *pp, u64 *data)
> > struct page_pool_stats stats = {};
> > int i;
> >
> > - for (i = 0; i < rxq_number; i++)
> > - page_pool_get_stats(pp->rxqs[i].page_pool, &stats);
> > + for (i = 0; i < rxq_number; i++) {
> > + if (pp->rxqs[i].page_pool)
> > + page_pool_get_stats(pp->rxqs[i].page_pool, &stats);
> > + }
> >
> > page_pool_ethtool_stats_get(data, &stats);
> > }
> > @@ -4875,14 +4880,21 @@ static void mvneta_ethtool_get_stats(struct net_device *dev,
> > for (i = 0; i < ARRAY_SIZE(mvneta_statistics); i++)
> > *data++ = pp->ethtool_stats[i];
> >
> > - mvneta_ethtool_pp_stats(pp, data);
> > + if (!pp->bm_priv && !pp->is_stopped)
>
> do we need to check pp->is_stopped here? (we already check if page_pool
> pointer is NULL in mvneta_ethtool_pp_stats).
> Moreover in mvneta_ethtool_get_sset_count() and in mvneta_ethtool_get_strings()
> we just check pp->bm_priv pointer. Are the stats disaligned in this case?
Hi Lorenzo,
so the buffer manager (bm) does not support the page pool.
If this mode is used we can skip any page pool references.
The question is do we end up with a race condition when we skip the is_stopped check
as the variable is set to true just before the page pools are
deallocated on suspend or interface stop calls.
Best
Sven
>
> > + mvneta_ethtool_pp_stats(pp, data);
> > }
> >
> > static int mvneta_ethtool_get_sset_count(struct net_device *dev, int sset)
> > {
> > - if (sset == ETH_SS_STATS)
> > - return ARRAY_SIZE(mvneta_statistics) +
> > - page_pool_ethtool_stats_get_count();
> > + if (sset == ETH_SS_STATS) {
> > + int count = ARRAY_SIZE(mvneta_statistics);
> > + struct mvneta_port *pp = netdev_priv(dev);
> > +
> > + if (!pp->bm_priv)
> > + count += page_pool_ethtool_stats_get_count();
> > +
> > + return count;
> > + }
> >
> > return -EOPNOTSUPP;
> > }
> > --
> > 2.42.0
> >
^ permalink raw reply
* Re: [PATCH net-next v2 08/21] virtio_net: sq support premapped mode
From: Xuan Zhuo @ 2023-11-09 10:58 UTC (permalink / raw)
To: Jason Wang
Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Michael S. Tsirkin, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
virtualization, bpf
In-Reply-To: <CACGkMEtLee8ELzqFnV_zOu3p5tU6hivouKM=WjtNAq+2wQzAFQ@mail.gmail.com>
On Thu, 9 Nov 2023 14:37:38 +0800, Jason Wang <jasowang@redhat.com> wrote:
> On Tue, Nov 7, 2023 at 11:12 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> >
> > If the xsk is enabling, the xsk tx will share the send queue.
> > But the xsk requires that the send queue use the premapped mode.
> > So the send queue must support premapped mode.
> >
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > ---
> > drivers/net/virtio/main.c | 163 ++++++++++++++++++++++++++++----
> > drivers/net/virtio/virtio_net.h | 16 ++++
> > 2 files changed, 163 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
> > index 16e75c08639e..f052db459156 100644
> > --- a/drivers/net/virtio/main.c
> > +++ b/drivers/net/virtio/main.c
> > @@ -46,6 +46,7 @@ module_param(napi_tx, bool, 0644);
> > #define VIRTIO_XDP_REDIR BIT(1)
> >
> > #define VIRTIO_XDP_FLAG BIT(0)
> > +#define VIRTIO_XMIT_DATA_MASK (VIRTIO_XDP_FLAG)
> >
> > #define VIRTNET_DRIVER_VERSION "1.0.0"
> >
> > @@ -167,6 +168,29 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
> > return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
> > }
> >
> > +static inline void *virtnet_sq_unmap(struct virtnet_sq *sq, void *data)
> > +{
> > + struct virtnet_sq_dma *next, *head;
> > +
> > + head = (void *)((unsigned long)data & ~VIRTIO_XMIT_DATA_MASK);
> > +
> > + data = head->data;
> > +
> > + while (head) {
> > + virtqueue_dma_unmap_single_attrs(sq->vq, head->addr, head->len,
> > + DMA_TO_DEVICE, 0);
> > +
> > + next = head->next;
> > +
> > + head->next = sq->dmainfo.free;
> > + sq->dmainfo.free = head;
> > +
> > + head = next;
> > + }
> > +
> > + return data;
> > +}
> > +
> > static void __free_old_xmit(struct virtnet_sq *sq, bool in_napi,
> > u64 *bytes, u64 *packets)
> > {
> > @@ -175,14 +199,24 @@ static void __free_old_xmit(struct virtnet_sq *sq, bool in_napi,
> >
> > while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> > if (!is_xdp_frame(ptr)) {
> > - struct sk_buff *skb = ptr;
> > + struct sk_buff *skb;
> > +
> > + if (sq->do_dma)
> > + ptr = virtnet_sq_unmap(sq, ptr);
> > +
> > + skb = ptr;
> >
> > pr_debug("Sent skb %p\n", skb);
> >
> > *bytes += skb->len;
> > napi_consume_skb(skb, in_napi);
> > } else {
> > - struct xdp_frame *frame = ptr_to_xdp(ptr);
> > + struct xdp_frame *frame;
> > +
> > + if (sq->do_dma)
> > + ptr = virtnet_sq_unmap(sq, ptr);
> > +
> > + frame = ptr_to_xdp(ptr);
> >
> > *bytes += xdp_get_frame_len(frame);
> > xdp_return_frame(frame);
> > @@ -567,22 +601,104 @@ static void *virtnet_rq_alloc(struct virtnet_rq *rq, u32 size, gfp_t gfp)
> > return buf;
> > }
> >
> > -static void virtnet_rq_set_premapped(struct virtnet_info *vi)
> > +static int virtnet_sq_set_premapped(struct virtnet_sq *sq)
> > {
> > - int i;
> > + struct virtnet_sq_dma *d;
> > + int err, size, i;
> >
> > - /* disable for big mode */
> > - if (!vi->mergeable_rx_bufs && vi->big_packets)
> > - return;
> > + size = virtqueue_get_vring_size(sq->vq);
> > +
> > + size += MAX_SKB_FRAGS + 2;
>
> Btw, the dmainfo seems per sg? If I'm correct, how can vq_size +
> MAX_SKB_FRAGS + 2 work?
We may alloc dmainfo items when the vq is full. So I prepare more dmainfo items.
>
> > +
> > + sq->dmainfo.head = kcalloc(size, sizeof(*sq->dmainfo.head), GFP_KERNEL);
> > + if (!sq->dmainfo.head)
> > + return -ENOMEM;
> > +
> > + err = virtqueue_set_dma_premapped(sq->vq);
> > + if (err) {
> > + kfree(sq->dmainfo.head);
> > + return err;
> > + }
>
> Allocating after set_dma_premapped() seems easier.
Yes. But, we donot has the way to disable premapped mode.
That is my mistake. :)
>
> Btw, is there a benchmark of TX PPS just for this patch to demonstrate
> the impact of the performance?
We will have that.
>
> > +
> > + sq->dmainfo.free = NULL;
> > +
> > + sq->do_dma = true;
> > +
> > + for (i = 0; i < size; ++i) {
> > + d = &sq->dmainfo.head[i];
> > +
> > + d->next = sq->dmainfo.free;
> > + sq->dmainfo.free = d;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static void virtnet_set_premapped(struct virtnet_info *vi)
> > +{
> > + int i;
> >
> > for (i = 0; i < vi->max_queue_pairs; i++) {
> > - if (virtqueue_set_dma_premapped(vi->rq[i].vq))
> > - continue;
> > + virtnet_sq_set_premapped(&vi->sq[i]);
> >
> > - vi->rq[i].do_dma = true;
> > + /* disable for big mode */
> > + if (vi->mergeable_rx_bufs || !vi->big_packets) {
> > + if (!virtqueue_set_dma_premapped(vi->rq[i].vq))
> > + vi->rq[i].do_dma = true;
>
> How about sticking a virtnet_rq_set_premapped() and calling it here?
>
> It seems more clean.
OK.
>
> Btw, the big mode support for pre mapping is still worthwhile
> regardless whether or not XDP is supported. It has a page pool so we
> can avoid redundant DMA map/unmap there.
Yes.
I post other patch set to do this.
>
> > + }
> > }
> > }
> >
> > +static struct virtnet_sq_dma *virtnet_sq_map_sg(struct virtnet_sq *sq, int nents, void *data)
> > +{
> > + struct virtnet_sq_dma *d, *head;
> > + struct scatterlist *sg;
> > + int i;
> > +
> > + head = NULL;
> > +
> > + for_each_sg(sq->sg, sg, nents, i) {
> > + sg->dma_address = virtqueue_dma_map_single_attrs(sq->vq, sg_virt(sg),
> > + sg->length,
> > + DMA_TO_DEVICE, 0);
> > + if (virtqueue_dma_mapping_error(sq->vq, sg->dma_address))
> > + goto err;
> > +
> > + d = sq->dmainfo.free;
> > + sq->dmainfo.free = d->next;
> > +
> > + d->addr = sg->dma_address;
> > + d->len = sg->length;
> > +
> > + d->next = head;
> > + head = d;
> > + }
> > +
> > + head->data = data;
> > +
> > + return (void *)((unsigned long)head | ((unsigned long)data & VIRTIO_XMIT_DATA_MASK));
>
> So head contains a pointer to data, any reason we still need to pack a
> data pointer here?
Maybe you are right. We can skip this.
>
>
> > +err:
> > + virtnet_sq_unmap(sq, head);
> > + return NULL;
> > +}
> > +
> > +static int virtnet_add_outbuf(struct virtnet_sq *sq, u32 num, void *data)
> > +{
> > + int ret;
> > +
> > + if (sq->do_dma) {
> > + data = virtnet_sq_map_sg(sq, num, data);
> > + if (!data)
> > + return -ENOMEM;
> > + }
> > +
> > + ret = virtqueue_add_outbuf(sq->vq, sq->sg, num, data, GFP_ATOMIC);
> > + if (ret && sq->do_dma)
> > + virtnet_sq_unmap(sq, data);
> > +
> > + return ret;
> > +}
> > +
> > static void free_old_xmit(struct virtnet_sq *sq, bool in_napi)
> > {
> > u64 bytes, packets = 0;
> > @@ -686,8 +802,7 @@ static int __virtnet_xdp_xmit_one(struct virtnet_info *vi,
> > skb_frag_size(frag), skb_frag_off(frag));
> > }
> >
> > - err = virtqueue_add_outbuf(sq->vq, sq->sg, nr_frags + 1,
> > - xdp_to_ptr(xdpf), GFP_ATOMIC);
> > + err = virtnet_add_outbuf(sq, nr_frags + 1, xdp_to_ptr(xdpf));
> > if (unlikely(err))
> > return -ENOSPC; /* Caller handle free/refcnt */
> >
> > @@ -2126,7 +2241,8 @@ static int xmit_skb(struct virtnet_sq *sq, struct sk_buff *skb)
> > return num_sg;
> > num_sg++;
> > }
> > - return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
> > +
> > + return virtnet_add_outbuf(sq, num_sg, skb);
> > }
> >
> > static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> > @@ -3818,6 +3934,8 @@ static void virtnet_free_queues(struct virtnet_info *vi)
> > for (i = 0; i < vi->max_queue_pairs; i++) {
> > __netif_napi_del(&vi->rq[i].napi);
> > __netif_napi_del(&vi->sq[i].napi);
> > +
> > + kfree(vi->sq[i].dmainfo.head);
> > }
> >
> > /* We called __netif_napi_del(),
> > @@ -3866,10 +3984,23 @@ static void free_receive_page_frags(struct virtnet_info *vi)
> >
> > static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf)
> > {
> > - if (!is_xdp_frame(buf))
> > + struct virtnet_info *vi = vq->vdev->priv;
> > + struct virtnet_sq *sq;
> > + int i = vq2rxq(vq);
> > +
> > + sq = &vi->sq[i];
> > +
> > + if (!is_xdp_frame(buf)) {
> > + if (sq->do_dma)
> > + buf = virtnet_sq_unmap(sq, buf);
> > +
> > dev_kfree_skb(buf);
> > - else
> > + } else {
> > + if (sq->do_dma)
> > + buf = virtnet_sq_unmap(sq, buf);
> > +
> > xdp_return_frame(ptr_to_xdp(buf));
> > + }
> > }
> >
> > static void virtnet_rq_free_unused_buf(struct virtqueue *vq, void *buf)
> > @@ -4075,7 +4206,7 @@ static int init_vqs(struct virtnet_info *vi)
> > if (ret)
> > goto err_free;
> >
> > - virtnet_rq_set_premapped(vi);
> > + virtnet_set_premapped(vi);
> >
> > cpus_read_lock();
> > virtnet_set_affinity(vi);
> > diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> > index d814341d9f97..ce806afb6d64 100644
> > --- a/drivers/net/virtio/virtio_net.h
> > +++ b/drivers/net/virtio/virtio_net.h
> > @@ -48,6 +48,18 @@ struct virtnet_rq_dma {
> > u16 need_sync;
> > };
> >
> > +struct virtnet_sq_dma {
> > + struct virtnet_sq_dma *next;
> > + dma_addr_t addr;
> > + u32 len;
> > + void *data;
>
> I think we need to seek a way to reuse what has been stored by virtio
> core. It should be much more efficient.
Yes.
But that is for net-next branch.
Can we do that as a fix after that is merged to 6.8?
Thanks.
>
> Thanks
>
> > +};
> > +
> > +struct virtnet_sq_dma_head {
> > + struct virtnet_sq_dma *free;
> > + struct virtnet_sq_dma *head;
> > +};
> > +
> > /* Internal representation of a send virtqueue */
> > struct virtnet_sq {
> > /* Virtqueue associated with this virtnet_sq */
> > @@ -67,6 +79,10 @@ struct virtnet_sq {
> >
> > /* Record whether sq is in reset state. */
> > bool reset;
> > +
> > + bool do_dma;
> > +
> > + struct virtnet_sq_dma_head dmainfo;
> > };
> >
> > /* Internal representation of a receive virtqueue */
> > --
> > 2.32.0.3.g01195cf9f
> >
>
>
^ permalink raw reply
* Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP
From: Paolo Abeni @ 2023-11-09 11:05 UTC (permalink / raw)
To: Willem de Bruijn, Stanislav Fomichev
Cc: Mina Almasry, netdev, linux-kernel, linux-arch, linux-kselftest,
linux-media, dri-devel, linaro-mm-sig, David S. Miller,
Eric Dumazet, Jakub Kicinski, Jesper Dangaard Brouer,
Ilias Apalodimas, Arnd Bergmann, David Ahern, Shuah Khan,
Sumit Semwal, Christian König, Shakeel Butt, Jeroen de Borst,
Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <CAF=yD-JZ88j+44MYgX-=oYJngz4Z0zw6Y0V3nHXisZJtNu7q6A@mail.gmail.com>
On Mon, 2023-11-06 at 14:55 -0800, Willem de Bruijn wrote:
> On Mon, Nov 6, 2023 at 2:34 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 11/06, Willem de Bruijn wrote:
> > > > > IMHO, we need a better UAPI to receive the tokens and give them back to
> > > > > the kernel. CMSG + setsockopt(SO_DEVMEM_DONTNEED) get the job done,
> > > > > but look dated and hacky :-(
> > > > >
> > > > > We should either do some kind of user/kernel shared memory queue to
> > > > > receive/return the tokens (similar to what Jonathan was doing in his
> > > > > proposal?)
> > > >
> > > > I'll take a look at Jonathan's proposal, sorry, I'm not immediately
> > > > familiar but I wanted to respond :-) But is the suggestion here to
> > > > build a new kernel-user communication channel primitive for the
> > > > purpose of passing the information in the devmem cmsg? IMHO that seems
> > > > like an overkill. Why add 100-200 lines of code to the kernel to add
> > > > something that can already be done with existing primitives? I don't
> > > > see anything concretely wrong with cmsg & setsockopt approach, and if
> > > > we switch to something I'd prefer to switch to an existing primitive
> > > > for simplicity?
> > > >
> > > > The only other existing primitive to pass data outside of the linear
> > > > buffer is the MSG_ERRQUEUE that is used for zerocopy. Is that
> > > > preferred? Any other suggestions or existing primitives I'm not aware
> > > > of?
> > > >
> > > > > or bite the bullet and switch to io_uring.
> > > > >
> > > >
> > > > IMO io_uring & socket support are orthogonal, and one doesn't preclude
> > > > the other. As you know we like to use sockets and I believe there are
> > > > issues with io_uring adoption at Google that I'm not familiar with
> > > > (and could be wrong). I'm interested in exploring io_uring support as
> > > > a follow up but I think David Wei will be interested in io_uring
> > > > support as well anyway.
> > >
> > > I also disagree that we need to replace a standard socket interface
> > > with something "faster", in quotes.
> > >
> > > This interface is not the bottleneck to the target workload.
> > >
> > > Replacing the synchronous sockets interface with something more
> > > performant for workloads where it is, is an orthogonal challenge.
> > > However we do that, I think that traditional sockets should continue
> > > to be supported.
> > >
> > > The feature may already even work with io_uring, as both recvmsg with
> > > cmsg and setsockopt have io_uring support now.
> >
> > I'm not really concerned with faster. I would prefer something cleaner :-)
> >
> > Or maybe we should just have it documented. With some kind of path
> > towards beautiful world where we can create dynamic queues..
>
> I suppose we just disagree on the elegance of the API.
>
> The concise notification API returns tokens as a range for
> compression, encoding as two 32-bit unsigned integers start + length.
> It allows for even further batching by returning multiple such ranges
> in a single call.
>
> This is analogous to the MSG_ZEROCOPY notification mechanism from
> kernel to user.
>
> The synchronous socket syscall interface can be replaced by something
> asynchronous like io_uring. This already works today? Whatever
> asynchronous ring-based API would be selected, io_uring or otherwise,
> I think the concise notification encoding would remain as is.
>
> Since this is an operation on a socket, I find a setsockopt the
> fitting interface.
FWIW, I think sockopt +cmsg is the right API. It would deserve some
explicit addition to the documentation, both in the kernel and in the
man-pages.
Cheers,
Paolo
^ permalink raw reply
* Re: [RFC PATCH v3 12/12] selftests: add ncdevmem, netcat for devmem TCP
From: Paolo Abeni @ 2023-11-09 11:03 UTC (permalink / raw)
To: Mina Almasry, netdev, linux-kernel, linux-arch, linux-kselftest,
linux-media, dri-devel, linaro-mm-sig
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski,
Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
David Ahern, Willem de Bruijn, Shuah Khan, Sumit Semwal,
Christian König, Shakeel Butt, Jeroen de Borst,
Praveen Kaligineedi, Stanislav Fomichev
In-Reply-To: <20231106024413.2801438-13-almasrymina@google.com>
On Sun, 2023-11-05 at 18:44 -0800, Mina Almasry wrote:
> @@ -91,6 +95,7 @@ TEST_PROGS += test_bridge_neigh_suppress.sh
> TEST_PROGS += test_vxlan_nolocalbypass.sh
> TEST_PROGS += test_bridge_backup_port.sh
> TEST_PROGS += fdb_flush.sh
> +TEST_GEN_FILES += ncdevmem
I guess we want something added to TEST_PROGS, too ;)
> TEST_FILES := settings
>
> diff --git a/tools/testing/selftests/net/ncdevmem.c b/tools/testing/selftests/net/ncdevmem.c
> new file mode 100644
> index 000000000000..78bc3ad767ca
> --- /dev/null
> +++ b/tools/testing/selftests/net/ncdevmem.c
> @@ -0,0 +1,546 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define _GNU_SOURCE
> +#define __EXPORTED_HEADERS__
> +
> +#include <linux/uio.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +#include <stdbool.h>
> +#include <string.h>
> +#include <errno.h>
> +#define __iovec_defined
> +#include <fcntl.h>
> +#include <malloc.h>
> +
> +#include <arpa/inet.h>
> +#include <sys/socket.h>
> +#include <sys/mman.h>
> +#include <sys/ioctl.h>
> +#include <sys/syscall.h>
> +
> +#include <linux/memfd.h>
> +#include <linux/if.h>
> +#include <linux/dma-buf.h>
> +#include <linux/udmabuf.h>
> +#include <libmnl/libmnl.h>
> +#include <linux/types.h>
> +#include <linux/netlink.h>
> +#include <linux/genetlink.h>
> +#include <linux/netdev.h>
> +#include <time.h>
> +
> +#include "netdev-user.h"
> +#include <ynl.h>
> +
> +#define PAGE_SHIFT 12
> +#define TEST_PREFIX "ncdevmem"
> +#define NUM_PAGES 16000
> +
> +#ifndef MSG_SOCK_DEVMEM
> +#define MSG_SOCK_DEVMEM 0x2000000
> +#endif
> +
> +/*
> + * tcpdevmem netcat. Works similarly to netcat but does device memory TCP
> + * instead of regular TCP. Uses udmabuf to mock a dmabuf provider.
> + *
> + * Usage:
> + *
> + * * Without validation:
> + *
> + * On server:
> + * ncdevmem -s <server IP> -c <client IP> -f eth1 -n 0000:06:00.0 -l \
> + * -p 5201
> + *
> + * On client:
> + * ncdevmem -s <server IP> -c <client IP> -f eth1 -n 0000:06:00.0 -p 5201
> + *
> + * * With Validation:
> + * On server:
> + * ncdevmem -s <server IP> -c <client IP> -l -f eth1 -n 0000:06:00.0 \
> + * -p 5202 -v 1
> + *
> + * On client:
> + * ncdevmem -s <server IP> -c <client IP> -f eth1 -n 0000:06:00.0 -p 5202 \
> + * -v 100000
> + *
> + * Note this is compatible with regular netcat. i.e. the sender or receiver can
> + * be replaced with regular netcat to test the RX or TX path in isolation.
> + */
> +
> +static char *server_ip = "192.168.1.4";
> +static char *client_ip = "192.168.1.2";
> +static char *port = "5201";
> +static size_t do_validation;
> +static int queue_num = 15;
> +static char *ifname = "eth1";
> +static char *nic_pci_addr = "0000:06:00.0";
> +static unsigned int iterations;
> +
> +void print_bytes(void *ptr, size_t size)
> +{
> + unsigned char *p = ptr;
> + int i;
> +
> + for (i = 0; i < size; i++) {
> + printf("%02hhX ", p[i]);
> + }
> + printf("\n");
> +}
> +
> +void print_nonzero_bytes(void *ptr, size_t size)
> +{
> + unsigned char *p = ptr;
> + unsigned int i;
> +
> + for (i = 0; i < size; i++)
> + putchar(p[i]);
> + printf("\n");
> +}
> +
> +void validate_buffer(void *line, size_t size)
> +{
> + static unsigned char seed = 1;
> + unsigned char *ptr = line;
> + int errors = 0;
> + size_t i;
> +
> + for (i = 0; i < size; i++) {
> + if (ptr[i] != seed) {
> + fprintf(stderr,
> + "Failed validation: expected=%u, actual=%u, index=%lu\n",
> + seed, ptr[i], i);
> + errors++;
> + if (errors > 20)
> + exit(1);
> + }
> + seed++;
> + if (seed == do_validation)
> + seed = 0;
> + }
> +
> + fprintf(stdout, "Validated buffer\n");
> +}
> +
> +static void reset_flow_steering(void)
> +{
> + char command[256];
> +
> + memset(command, 0, sizeof(command));
> + snprintf(command, sizeof(command), "sudo ethtool -K %s ntuple off",
> + "eth1");
> + system(command);
> +
> + memset(command, 0, sizeof(command));
> + snprintf(command, sizeof(command), "sudo ethtool -K %s ntuple on",
> + "eth1");
> + system(command);
> +}
> +
> +static void configure_flow_steering(void)
> +{
> + char command[256];
> +
> + memset(command, 0, sizeof(command));
> + snprintf(command, sizeof(command),
> + "sudo ethtool -N %s flow-type tcp4 src-ip %s dst-ip %s src-port %s dst-port %s queue %d",
> + ifname, client_ip, server_ip, port, port, queue_num);
> + system(command);
> +}
> +
> +/* Triggers a driver reset...
> + *
> + * The proper way to do this is probably 'ethtool --reset', but I don't have
> + * that supported on my current test bed. I resort to changing this
> + * configuration in the driver which also causes a driver reset...
> + */
> +static void trigger_device_reset(void)
> +{
> + char command[256];
> +
> + memset(command, 0, sizeof(command));
> + snprintf(command, sizeof(command),
> + "sudo ethtool --set-priv-flags %s enable-header-split off",
> + ifname);
> + system(command);
> +
> + memset(command, 0, sizeof(command));
> + snprintf(command, sizeof(command),
> + "sudo ethtool --set-priv-flags %s enable-header-split on",
> + ifname);
> + system(command);
> +}
> +
> +static int bind_rx_queue(unsigned int ifindex, unsigned int dmabuf_fd,
> + __u32 *queue_idx, unsigned int n_queue_index,
> + struct ynl_sock **ys)
> +{
> + struct netdev_bind_rx_req *req = NULL;
> + struct ynl_error yerr;
> + int ret = 0;
> +
> + *ys = ynl_sock_create(&ynl_netdev_family, &yerr);
> + if (!*ys) {
> + fprintf(stderr, "YNL: %s\n", yerr.msg);
> + return -1;
> + }
> +
> + if (ynl_subscribe(*ys, "mgmt"))
> + goto err_close;
> +
> + req = netdev_bind_rx_req_alloc();
> + netdev_bind_rx_req_set_ifindex(req, ifindex);
> + netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd);
> + __netdev_bind_rx_req_set_queues(req, queue_idx, n_queue_index);
> +
> + ret = netdev_bind_rx(*ys, req);
> + if (!ret) {
> + perror("netdev_bind_rx");
> + goto err_close;
> + }
> +
> + netdev_bind_rx_req_free(req);
> +
> + return 0;
> +
> +err_close:
> + fprintf(stderr, "YNL failed: %s\n", (*ys)->err.msg);
> + netdev_bind_rx_req_free(req);
> + ynl_sock_destroy(*ys);
> + return -1;
> +}
> +
> +static void create_udmabuf(int *devfd, int *memfd, int *buf, size_t dmabuf_size)
> +{
> + struct udmabuf_create create;
> + int ret;
> +
> + *devfd = open("/dev/udmabuf", O_RDWR);
> + if (*devfd < 0) {
> + fprintf(stderr,
> + "%s: [skip,no-udmabuf: Unable to access DMA "
> + "buffer device file]\n",
> + TEST_PREFIX);
> + exit(70);
> + }
> +
> + *memfd = memfd_create("udmabuf-test", MFD_ALLOW_SEALING);
> + if (*memfd < 0) {
> + printf("%s: [skip,no-memfd]\n", TEST_PREFIX);
> + exit(72);
> + }
> +
> + ret = fcntl(*memfd, F_ADD_SEALS, F_SEAL_SHRINK);
> + if (ret < 0) {
> + printf("%s: [skip,fcntl-add-seals]\n", TEST_PREFIX);
> + exit(73);
> + }
> +
> + ret = ftruncate(*memfd, dmabuf_size);
> + if (ret == -1) {
> + printf("%s: [FAIL,memfd-truncate]\n", TEST_PREFIX);
> + exit(74);
> + }
> +
> + memset(&create, 0, sizeof(create));
> +
> + create.memfd = *memfd;
> + create.offset = 0;
> + create.size = dmabuf_size;
> + *buf = ioctl(*devfd, UDMABUF_CREATE, &create);
> + if (*buf < 0) {
> + printf("%s: [FAIL, create udmabuf]\n", TEST_PREFIX);
> + exit(75);
> + }
> +}
> +
> +int do_server(void)
> +{
> + char ctrl_data[sizeof(int) * 20000];
> + size_t non_page_aligned_frags = 0;
> + struct sockaddr_in client_addr;
> + struct sockaddr_in server_sin;
> + size_t page_aligned_frags = 0;
> + int devfd, memfd, buf, ret;
> + size_t total_received = 0;
> + bool is_devmem = false;
> + char *buf_mem = NULL;
> + struct ynl_sock *ys;
> + size_t dmabuf_size;
> + char iobuf[819200];
> + char buffer[256];
> + int socket_fd;
> + int client_fd;
> + size_t i = 0;
> + int opt = 1;
> +
> + dmabuf_size = getpagesize() * NUM_PAGES;
> +
> + create_udmabuf(&devfd, &memfd, &buf, dmabuf_size);
> +
> + __u32 *queue_idx = malloc(sizeof(__u32) * 2);
> +
> + queue_idx[0] = 14;
> + queue_idx[1] = 15;
> + if (bind_rx_queue(3 /* index for eth1 */, buf, queue_idx, 2, &ys)) {
^^^^^^^^^^^^^^^^^^^
I guess we want to explicitly fetch the "ifname" index.
Side note: I'm wondering if we could extend some kind of virtual device
to allow single host self-tests? e.g. veth, if that would not cause
excessive bloat in the device driver?
Cheers,
Paolo
^ permalink raw reply
* Re: [PATCH net-next v2 00/21] virtio-net: support AF_XDP zero copy
From: Xuan Zhuo @ 2023-11-09 10:37 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
In-Reply-To: <20231109031728-mutt-send-email-mst@kernel.org>
On Thu, 9 Nov 2023 03:19:04 -0500, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Tue, Nov 07, 2023 at 11:12:06AM +0800, Xuan Zhuo wrote:
> > ## AF_XDP
> >
> > XDP socket(AF_XDP) is an excellent bypass kernel network framework. The zero
> > copy feature of xsk (XDP socket) needs to be supported by the driver. The
> > performance of zero copy is very good. mlx5 and intel ixgbe already support
> > this feature, This patch set allows virtio-net to support xsk's zerocopy xmit
> > feature.
> >
> > At present, we have completed some preparation:
> >
> > 1. vq-reset (virtio spec and kernel code)
> > 2. virtio-core premapped dma
> > 3. virtio-net xdp refactor
> >
> > So it is time for Virtio-Net to complete the support for the XDP Socket
> > Zerocopy.
> >
> > Virtio-net can not increase the queue num at will, so xsk shares the queue with
> > kernel.
> >
> > On the other hand, Virtio-Net does not support generate interrupt from driver
> > manually, so when we wakeup tx xmit, we used some tips. If the CPU run by TX
> > NAPI last time is other CPUs, use IPI to wake up NAPI on the remote CPU. If it
> > is also the local CPU, then we wake up napi directly.
> >
> > This patch set includes some refactor to the virtio-net to let that to support
> > AF_XDP.
> >
> > ## performance
> >
> > ENV: Qemu with vhost-user(polling mode).
> > Host CPU: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
> >
> > ### virtio PMD in guest with testpmd
> >
> > testpmd> show port stats all
> >
> > ######################## NIC statistics for port 0 ########################
> > RX-packets: 19531092064 RX-missed: 0 RX-bytes: 1093741155584
> > RX-errors: 0
> > RX-nombuf: 0
> > TX-packets: 5959955552 TX-errors: 0 TX-bytes: 371030645664
> >
> >
> > Throughput (since last show)
> > Rx-pps: 8861574 Rx-bps: 3969985208
> > Tx-pps: 8861493 Tx-bps: 3969962736
> > ############################################################################
> >
> > ### AF_XDP PMD in guest with testpmd
> >
> > testpmd> show port stats all
> >
> > ######################## NIC statistics for port 0 ########################
> > RX-packets: 68152727 RX-missed: 0 RX-bytes: 3816552712
> > RX-errors: 0
> > RX-nombuf: 0
> > TX-packets: 68114967 TX-errors: 33216 TX-bytes: 3814438152
> >
> > Throughput (since last show)
> > Rx-pps: 6333196 Rx-bps: 2837272088
> > Tx-pps: 6333227 Tx-bps: 2837285936
> > ############################################################################
> >
> > But AF_XDP consumes more CPU for tx and rx napi(100% and 86%).
> >
> > ## maintain
> >
> > I am currently a reviewer for virtio-net. I commit to maintain AF_XDP support in
> > virtio-net.
> >
> > Please review.
> >
> > Thanks.
> >
> > v2
> > 1. wakeup uses the way of GVE. No send ipi to wakeup napi on remote cpu.
> > 2. remove rcu. Because we synchronize all operat, so the rcu is not needed.
> > 3. split the commit "move to virtio_net.h" in last patch set. Just move the
> > struct/api to header when we use them.
> > 4. add comments for some code
> >
> > v1:
> > 1. remove two virtio commits. Push this patchset to net-next
> > 2. squash "virtio_net: virtnet_poll_tx support rescheduled" to xsk: support tx
> > 3. fix some warnings
> >
> >
> > Xuan Zhuo (21):
> > virtio_net: rename free_old_xmit_skbs to free_old_xmit
> > virtio_net: unify the code for recycling the xmit ptr
> > virtio_net: independent directory
> > virtio_net: move core structures to virtio_net.h
> > virtio_net: add prefix virtnet to all struct inside virtio_net.h
> > virtio_net: separate virtnet_rx_resize()
> > virtio_net: separate virtnet_tx_resize()
> > virtio_net: sq support premapped mode
> > virtio_net: xsk: bind/unbind xsk
> > virtio_net: xsk: prevent disable tx napi
> > virtio_net: move some api to header
> > virtio_net: xsk: tx: support tx
> > virtio_net: xsk: tx: support wakeup
> > virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk buffer
> > virtio_net: xsk: tx: virtnet_sq_free_unused_buf() check xsk buffer
> > virtio_net: xsk: rx: introduce add_recvbuf_xsk()
> > virtio_net: xsk: rx: skip dma unmap when rq is bind with AF_XDP
> > virtio_net: xsk: rx: introduce receive_xsk() to recv xsk buffer
> > virtio_net: xsk: rx: virtnet_rq_free_unused_buf() check xsk buffer
> > virtio_net: update tx timeout record
> > virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY
> >
> > MAINTAINERS | 2 +-
> > drivers/net/Kconfig | 8 +-
> > drivers/net/Makefile | 2 +-
> > drivers/net/virtio/Kconfig | 13 +
> > drivers/net/virtio/Makefile | 8 +
> > drivers/net/{virtio_net.c => virtio/main.c} | 630 +++++++++-----------
> > drivers/net/virtio/virtio_net.h | 346 +++++++++++
> > drivers/net/virtio/xsk.c | 498 ++++++++++++++++
> > drivers/net/virtio/xsk.h | 32 +
> > 9 files changed, 1174 insertions(+), 365 deletions(-)
> > create mode 100644 drivers/net/virtio/Kconfig
> > create mode 100644 drivers/net/virtio/Makefile
> > rename drivers/net/{virtio_net.c => virtio/main.c} (92%)
> > create mode 100644 drivers/net/virtio/virtio_net.h
> > create mode 100644 drivers/net/virtio/xsk.c
> > create mode 100644 drivers/net/virtio/xsk.h
>
>
> Too much code duplications for my taste. Would there maybe be less if
> everything was in a single file?
What duplication?
Yes. If everything is in a single file, something maybe simpler. But I
do not like that, that will be more trouble in the future. We want to
make the virtnet more like the modern network card, then we will
introduce more features to the virtio-net. I think it's time to
split the single file to modules. I think letting every module simple is
beneficial for the maintaining.
About the duplications, I do not understand. No duplication,
We just refer the functions from each other.
Do you mean the receive_xsk()?
Let we resolve this problem if you think it's a duplicate. I think this is the
right way.
Thanks.
>
>
> > --
> > 2.32.0.3.g01195cf9f
>
>
^ permalink raw reply
* Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP
From: Paolo Abeni @ 2023-11-09 10:52 UTC (permalink / raw)
To: Mina Almasry, netdev, linux-kernel, linux-arch, linux-kselftest,
linux-media, dri-devel, linaro-mm-sig
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski,
Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
David Ahern, Willem de Bruijn, Shuah Khan, Sumit Semwal,
Christian König, Shakeel Butt, Jeroen de Borst,
Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <20231106024413.2801438-11-almasrymina@google.com>
On Sun, 2023-11-05 at 18:44 -0800, Mina Almasry wrote:
[...]
> +/* On error, returns the -errno. On success, returns number of bytes sent to the
> + * user. May not consume all of @remaining_len.
> + */
> +static int tcp_recvmsg_devmem(const struct sock *sk, const struct sk_buff *skb,
> + unsigned int offset, struct msghdr *msg,
> + int remaining_len)
> +{
> + struct cmsg_devmem cmsg_devmem = { 0 };
> + unsigned int start;
> + int i, copy, n;
> + int sent = 0;
> + int err = 0;
> +
> + do {
> + start = skb_headlen(skb);
> +
> + if (!skb_frags_not_readable(skb)) {
As 'skb_frags_not_readable()' is intended to be a possibly wider scope
test then skb->devmem, should the above test explicitly skb->devmem?
> + err = -ENODEV;
> + goto out;
> + }
> +
> + /* Copy header. */
> + copy = start - offset;
> + if (copy > 0) {
> + copy = min(copy, remaining_len);
> +
> + n = copy_to_iter(skb->data + offset, copy,
> + &msg->msg_iter);
> + if (n != copy) {
> + err = -EFAULT;
> + goto out;
> + }
> +
> + offset += copy;
> + remaining_len -= copy;
> +
> + /* First a cmsg_devmem for # bytes copied to user
> + * buffer.
> + */
> + memset(&cmsg_devmem, 0, sizeof(cmsg_devmem));
> + cmsg_devmem.frag_size = copy;
> + err = put_cmsg(msg, SOL_SOCKET, SO_DEVMEM_HEADER,
> + sizeof(cmsg_devmem), &cmsg_devmem);
> + if (err || msg->msg_flags & MSG_CTRUNC) {
> + msg->msg_flags &= ~MSG_CTRUNC;
> + if (!err)
> + err = -ETOOSMALL;
> + goto out;
> + }
> +
> + sent += copy;
> +
> + if (remaining_len == 0)
> + goto out;
> + }
> +
> + /* after that, send information of devmem pages through a
> + * sequence of cmsg
> + */
> + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> + const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> + struct page_pool_iov *ppiov;
> + u64 frag_offset;
> + u32 user_token;
> + int end;
> +
> + /* skb_frags_not_readable() should indicate that ALL the
> + * frags in this skb are unreadable page_pool_iovs.
> + * We're checking for that flag above, but also check
> + * individual pages here. If the tcp stack is not
> + * setting skb->devmem correctly, we still don't want to
> + * crash here when accessing pgmap or priv below.
> + */
> + if (!skb_frag_page_pool_iov(frag)) {
> + net_err_ratelimited("Found non-devmem skb with page_pool_iov");
> + err = -ENODEV;
> + goto out;
> + }
> +
> + ppiov = skb_frag_page_pool_iov(frag);
> + end = start + skb_frag_size(frag);
> + copy = end - offset;
> +
> + if (copy > 0) {
> + copy = min(copy, remaining_len);
> +
> + frag_offset = page_pool_iov_virtual_addr(ppiov) +
> + skb_frag_off(frag) + offset -
> + start;
> + cmsg_devmem.frag_offset = frag_offset;
> + cmsg_devmem.frag_size = copy;
> + err = xa_alloc((struct xarray *)&sk->sk_user_pages,
> + &user_token, frag->bv_page,
> + xa_limit_31b, GFP_KERNEL);
> + if (err)
> + goto out;
> +
> + cmsg_devmem.frag_token = user_token;
> +
> + offset += copy;
> + remaining_len -= copy;
> +
> + err = put_cmsg(msg, SOL_SOCKET,
> + SO_DEVMEM_OFFSET,
> + sizeof(cmsg_devmem),
> + &cmsg_devmem);
> + if (err || msg->msg_flags & MSG_CTRUNC) {
> + msg->msg_flags &= ~MSG_CTRUNC;
> + xa_erase((struct xarray *)&sk->sk_user_pages,
> + user_token);
> + if (!err)
> + err = -ETOOSMALL;
> + goto out;
> + }
> +
> + page_pool_iov_get_many(ppiov, 1);
> +
> + sent += copy;
> +
> + if (remaining_len == 0)
> + goto out;
> + }
> + start = end;
> + }
> +
> + if (!remaining_len)
> + goto out;
> +
> + /* if remaining_len is not satisfied yet, we need to go to the
> + * next frag in the frag_list to satisfy remaining_len.
> + */
> + skb = skb_shinfo(skb)->frag_list ?: skb->next;
I think at this point the 'skb' is still on the sk receive queue. The
above will possibly walk the queue.
Later on, only the current queue tail could be possibly consumed by
tcp_recvmsg_locked(). This feel confusing to me?!? Why don't limit the
loop only the 'current' skb and it's frags?
> +
> + offset = offset - start;
> + } while (skb);
> +
> + if (remaining_len) {
> + err = -EFAULT;
> + goto out;
> + }
> +
> +out:
> + if (!sent)
> + sent = err;
> +
> + return sent;
> +}
> +
> /*
> * This routine copies from a sock struct into the user buffer.
> *
> @@ -2314,6 +2463,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> int *cmsg_flags)
> {
> struct tcp_sock *tp = tcp_sk(sk);
> + int last_copied_devmem = -1; /* uninitialized */
> int copied = 0;
> u32 peek_seq;
> u32 *seq;
> @@ -2491,15 +2641,44 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> }
>
> if (!(flags & MSG_TRUNC)) {
> - err = skb_copy_datagram_msg(skb, offset, msg, used);
> - if (err) {
> - /* Exception. Bailout! */
> - if (!copied)
> - copied = -EFAULT;
> + if (last_copied_devmem != -1 &&
> + last_copied_devmem != skb->devmem)
> break;
> +
> + if (!skb->devmem) {
> + err = skb_copy_datagram_msg(skb, offset, msg,
> + used);
> + if (err) {
> + /* Exception. Bailout! */
> + if (!copied)
> + copied = -EFAULT;
> + break;
> + }
> + } else {
> + if (!(flags & MSG_SOCK_DEVMEM)) {
> + /* skb->devmem skbs can only be received
> + * with the MSG_SOCK_DEVMEM flag.
> + */
> + if (!copied)
> + copied = -EFAULT;
> +
> + break;
> + }
> +
> + err = tcp_recvmsg_devmem(sk, skb, offset, msg,
> + used);
> + if (err <= 0) {
> + if (!copied)
> + copied = -EFAULT;
> +
> + break;
> + }
> + used = err;
Minor nit: I personally would find the above more readable, placing
this whole chunk in a single helper (e.g. the current
tcp_recvmsg_devmem(), renamed to something more appropriate).
Cheers,
Paolo
^ permalink raw reply
* Re: [PATCH net 3/3] dpll: fix register pin with unregistered parent pin
From: Vadim Fedorenko @ 2023-11-09 10:56 UTC (permalink / raw)
To: Kubalewski, Arkadiusz, Jiri Pirko
Cc: netdev@vger.kernel.org, Michalik, Michal, Olech, Milena,
pabeni@redhat.com, kuba@kernel.org
In-Reply-To: <DM6PR11MB46572BD8C43DACA0FF15C2CF9BAFA@DM6PR11MB4657.namprd11.prod.outlook.com>
On 09/11/2023 09:59, Kubalewski, Arkadiusz wrote:
>> From: Jiri Pirko <jiri@resnulli.us>
>> Sent: Wednesday, November 8, 2023 4:08 PM
>>
>> Wed, Nov 08, 2023 at 11:32:26AM CET, arkadiusz.kubalewski@intel.com wrote:
>>> In case of multiple kernel module instances using the same dpll device:
>>> if only one registers dpll device, then only that one can register
>>
>> They why you don't register in multiple instances? See mlx5 for a
>> reference.
>>
>
> Every registration requires ops, but for our case only PF0 is able to
> control dpll pins and device, thus only this can provide ops.
> Basically without PF0, dpll is not able to be controlled, as well
> as directly connected pins.
>
But why do you need other pins then, if FP0 doesn't exist?
>>
>>> directly connected pins with a dpll device. If unregistered parent
>>> determines if the muxed pin can be register with it or not, it forces
>>> serialized driver load order - first the driver instance which
>>> registers the direct pins needs to be loaded, then the other instances
>>> could register muxed type pins.
>>>
>>> Allow registration of a pin with a parent even if the parent was not
>>> yet registered, thus allow ability for unserialized driver instance
>>
>> Weird.
>>
>
> Yeah, this is issue only for MUX/parent pin part, couldn't find better
> way, but it doesn't seem to break things around..
>
I just wonder how do you see the registration procedure? How can parent
pin exist if it's not registered? I believe you cannot get it through
DPLL API, then the only possible way is to create it within the same
driver code, which can be simply re-arranged. Am I wrong here?
> Thank you!
> Arkadiusz
>
>>
>>> load order.
>>> Do not WARN_ON notification for unregistered pin, which can be invoked
>>> for described case, instead just return error.
>>>
>>> Fixes: 9431063ad323 ("dpll: core: Add DPLL framework base functions")
>>> Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base
>>> functions")
>>> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
>>> ---
>>> drivers/dpll/dpll_core.c | 4 ----
>>> drivers/dpll/dpll_netlink.c | 2 +-
>>> 2 files changed, 1 insertion(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/dpll/dpll_core.c b/drivers/dpll/dpll_core.c index
>>> 4077b562ba3b..ae884b92d68c 100644
>>> --- a/drivers/dpll/dpll_core.c
>>> +++ b/drivers/dpll/dpll_core.c
>>> @@ -28,8 +28,6 @@ static u32 dpll_xa_id;
>>> WARN_ON_ONCE(!xa_get_mark(&dpll_device_xa, (d)->id, DPLL_REGISTERED))
>>> #define ASSERT_DPLL_NOT_REGISTERED(d) \
>>> WARN_ON_ONCE(xa_get_mark(&dpll_device_xa, (d)->id, DPLL_REGISTERED))
>>> -#define ASSERT_PIN_REGISTERED(p) \
>>> - WARN_ON_ONCE(!xa_get_mark(&dpll_pin_xa, (p)->id, DPLL_REGISTERED))
>>>
>>> struct dpll_device_registration {
>>> struct list_head list;
>>> @@ -641,8 +639,6 @@ int dpll_pin_on_pin_register(struct dpll_pin *parent,
>> struct dpll_pin *pin,
>>> WARN_ON(!ops->state_on_pin_get) ||
>>> WARN_ON(!ops->direction_get))
>>> return -EINVAL;
>>> - if (ASSERT_PIN_REGISTERED(parent))
>>> - return -EINVAL;
>>>
>>> mutex_lock(&dpll_lock);
>>> ret = dpll_xa_ref_pin_add(&pin->parent_refs, parent, ops, priv); diff
>>> --git a/drivers/dpll/dpll_netlink.c b/drivers/dpll/dpll_netlink.c index
>>> 963bbbbe6660..ff430f43304f 100644
>>> --- a/drivers/dpll/dpll_netlink.c
>>> +++ b/drivers/dpll/dpll_netlink.c
>>> @@ -558,7 +558,7 @@ dpll_pin_event_send(enum dpll_cmd event, struct
>> dpll_pin *pin)
>>> int ret = -ENOMEM;
>>> void *hdr;
>>>
>>> - if (WARN_ON(!xa_get_mark(&dpll_pin_xa, pin->id, DPLL_REGISTERED)))
>>> + if (!xa_get_mark(&dpll_pin_xa, pin->id, DPLL_REGISTERED))
>>> return -ENODEV;
>>>
>>> msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
>>> --
>>> 2.38.1
>>>
^ permalink raw reply
* Re: [PATCH net] net: microchip: lan743x : bidirectional throughuput improvement
From: VishvambarPanth.S @ 2023-11-09 10:53 UTC (permalink / raw)
To: kuba, f.fainelli
Cc: Bryan.Whitehead, andrew, davem, linux-kernel, pabeni, netdev,
UNGLinuxDriver, edumazet
In-Reply-To: <ee81b2128f5178df95a1678d2cf94ad4edf2c9e9.camel@microchip.com>
On Wed, 2023-11-01 at 12:52 +0530, Vishvambar Panth S wrote:
> On Wed, 2023-10-04 at 13:09 -0700, Jakub Kicinski wrote:
> > EXTERNAL EMAIL: Do not click links or open attachments unless you
> > know the content is safe
> >
> > On Wed, 4 Oct 2023 13:02:17 -0700 Florian Fainelli wrote:
> > > > Nobody complained for 5 years, and it's not a regression.
> > > > Let's not treat this as a fix, please repost without the Fixes
> > > > tag for
> > > > net-next.
> > >
> > > As a driver maintainer, you may want to provide some guarantees
> > > to
> > > your
> > > end users/customers that from stable version X.Y.Z the
> > > performance
> > > issues have been fixed. Performance improvements are definitively
> > > border
> > > line in terms of being considered as bug fixes though.
> >
> > I understand that, but too often people just "feel like a device
> > which
> > advertises X Mbps / Gbps should reach line rate" while no end user
> > cares.
> >
> > Luckily stable rules are pretty clear about this (search for
> > "performance"):
> > https://docs.kernel.org/process/stable-kernel-rules.html
> >
> > As posted it doesn't fulfill the requirements
>
> Thanks for your feedback. I apologize for the delayed response.
>
> The data presented in the patch description was aimed to convince a
> reviewer with the visible impact of the performance boosts in both
> x64
> and ARM platforms. However, the main motivation behind the patch was
> not merely a "good-to-have" improvement but a solution to the
> throughput issues reported by multiple customers in several
> platforms.
> We received lots of customer requests through our ticket site system
> urging us to address the performance issues on multiple kernel
> versions
> including LTS. While it's acknowledged that stable branch rules
> typically do not consider performance fixes that are not documented
> in
> public Bugzilla, this performance enhancement is essential to many of
> our customers and their end users and we believe should therefore be
> considered for stable branch on the basis of it’s visible user
> impact.
>
> Few issues reported by our customers are mentioned below, even though
> these issues have existed for a long time, the data presented below
> is
> collected from the customer within last 3 months.
>
> Customer-A using lan743x with Hisilicon- Kirin 990 processor in 5.10
> kernel, reported a mere ~300Mbps in Rx UDP. The fix significantly
> improved the performance to ~900Mbps Rx in their platform.
>
> Customer-B using lan743x with v5.10 has an issue with Tx UDP being
> only
> 157Mbps in their platform. Including the fix in the patch boosts the
> performance to ~600Mbps in Tx UDP.
>
> Customer-C using lan743x with ADAS Ref Design in v5.10 reported UDP
> Tx/Rx to be 126/723 Mbps and the fix improved the performance to
> 828/956 Mbps.
>
> Customer-D using lan743x with Qcom 6490 with v5.4 wanted improvements
> for their platform from UDP Rx 200Mbps. The fix along with few other
> changes helped us to bring Rx perf to 800Mbps in customer’s platform
>
> This is a kind request for considering the acceptance of this patch
> into the net branch, as it has a significant positive impact on users
> and does not have any adverse effects.
>
> Thanks,
> Vishvambar Panth S
>
>
It has come to my attention that some people may not have received my
whole reply dated Nov 1st (as per
https://patchwork.kernel.org/project/netdevbpf/patch/20230927111623.9966-1-vishvambarpanth.s@microchip.com/#25577895
), possibly due to a non-ASCII character at the cut-off point.
Therefore, I am resending the part that was cut short below.
Jakub, would it be possible for you to apply the patch to the net
branch given the additional justification now posted below?
-----
Thanks for your feedback. I apologize for the delayed response.
The data presented in the patch description was aimed to convince a
reviewer with the visible impact of the performance boosts in both x64
and ARM platforms. However, the main motivation behind the patch was
not merely a "good-to-have" improvement but a solution to the
throughput issues reported by multiple customers in several platforms.
We received lots of customer requests through our ticket site system
urging us to address the performance issues on multiple kernel versions
including LTS. While it's acknowledged that stable branch rules
typically do not consider performance fixes that are not documented in
public Bugzilla, this performance enhancement is essential to many of
our customers and their end users and we believe should therefore be
considered for stable branch on the basis of it’s visible user impact.
Few issues reported by our customers are mentioned below, even though
these issues have existed for a long time, the data presented below is
collected from the customer within last 3 months.
Customer-A using lan743x with Hisilicon- Kirin 990 processor in 5.10
kernel, reported a mere ~300Mbps in Rx UDP. The fix significantly
improved the performance to ~900Mbps Rx in their platform.
Customer-B using lan743x with v5.10 has an issue with Tx UDP being only
157Mbps in their platform. Including the fix in the patch boosts the
performance to ~600Mbps in Tx UDP.
Customer-C using lan743x with ADAS Ref Design in v5.10 reported UDP
Tx/Rx to be 126/723 Mbps and the fix improved the performance to
828/956 Mbps.
Customer-D using lan743x with Qcom 6490 with v5.4 wanted improvements
for their platform from UDP Rx 200Mbps. The fix along with few other
changes helped us to bring Rx perf to 800Mbps in customer’s platform
This is a kind request for considering the acceptance of this patch
into the net branch, as it has a significant positive impact on users
and does not have any adverse effects.
Thanks,
Vishvambar Panth S
^ permalink raw reply
* Re: [PATCH net 0/3] dpll: fix unordered unbind/bind registerer issues
From: Vadim Fedorenko @ 2023-11-09 10:50 UTC (permalink / raw)
To: Arkadiusz Kubalewski, netdev
Cc: jiri, michal.michalik, milena.olech, pabeni, kuba
In-Reply-To: <20231108103226.1168500-1-arkadiusz.kubalewski@intel.com>
On 08/11/2023 10:32, Arkadiusz Kubalewski wrote:
> Fix issues when performing unordered unbind/bind of a kernel modules
> which are using a dpll device with DPLL_PIN_TYPE_MUX pins.
> Currently only serialized bind/unbind of such use case works, fix
> the issues and allow for unserialized kernel module bind order.
>
> The issues are observed on the ice driver, i.e.,
>
> $ echo 0000:af:00.0 > /sys/bus/pci/drivers/ice/unbind
> $ echo 0000:af:00.1 > /sys/bus/pci/drivers/ice/unbind
>
> results in:
>
> ice 0000:af:00.0: Removed PTP clock
> BUG: kernel NULL pointer dereference, address: 0000000000000010
> PF: supervisor read access in kernel mode
> PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] PREEMPT SMP PTI
> CPU: 7 PID: 71848 Comm: bash Kdump: loaded Not tainted 6.6.0-rc5_next-queue_19th-Oct-2023-01625-g039e5d15e451 #1
> Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
> RIP: 0010:ice_dpll_rclk_state_on_pin_get+0x2f/0x90 [ice]
> Code: 41 57 4d 89 cf 41 56 41 55 4d 89 c5 41 54 55 48 89 f5 53 4c 8b 66 08 48 89 cb 4d 8d b4 24 f0 49 00 00 4c 89 f7 e8 71 ec 1f c5 <0f> b6 5b 10 41 0f b6 84 24 30 4b 00 00 29 c3 41 0f b6 84 24 28 4b
> RSP: 0018:ffffc902b179fb60 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: ffff8882c1398000 RSI: ffff888c7435cc60 RDI: ffff888c7435cb90
> RBP: ffff888c7435cc60 R08: ffffc902b179fbb0 R09: 0000000000000000
> R10: ffff888ef1fc8050 R11: fffffffffff82700 R12: ffff888c743581a0
> R13: ffffc902b179fbb0 R14: ffff888c7435cb90 R15: 0000000000000000
> FS: 00007fdc7dae0740(0000) GS:ffff888c105c0000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000010 CR3: 0000000132c24002 CR4: 00000000007706e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
> <TASK>
> ? __die+0x20/0x70
> ? page_fault_oops+0x76/0x170
> ? exc_page_fault+0x65/0x150
> ? asm_exc_page_fault+0x22/0x30
> ? ice_dpll_rclk_state_on_pin_get+0x2f/0x90 [ice]
> ? __pfx_ice_dpll_rclk_state_on_pin_get+0x10/0x10 [ice]
> dpll_msg_add_pin_parents+0x142/0x1d0
> dpll_pin_event_send+0x7d/0x150
> dpll_pin_on_pin_unregister+0x3f/0x100
> ice_dpll_deinit_pins+0xa1/0x230 [ice]
> ice_dpll_deinit+0x29/0xe0 [ice]
> ice_remove+0xcd/0x200 [ice]
> pci_device_remove+0x33/0xa0
> device_release_driver_internal+0x193/0x200
> unbind_store+0x9d/0xb0
> kernfs_fop_write_iter+0x128/0x1c0
> vfs_write+0x2bb/0x3e0
> ksys_write+0x5f/0xe0
> do_syscall_64+0x59/0x90
> ? filp_close+0x1b/0x30
> ? do_dup2+0x7d/0xd0
> ? syscall_exit_work+0x103/0x130
> ? syscall_exit_to_user_mode+0x22/0x40
> ? do_syscall_64+0x69/0x90
> ? syscall_exit_work+0x103/0x130
> ? syscall_exit_to_user_mode+0x22/0x40
> ? do_syscall_64+0x69/0x90
> entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> RIP: 0033:0x7fdc7d93eb97
> Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> RSP: 002b:00007fff2aa91028 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fdc7d93eb97
> RDX: 000000000000000d RSI: 00005644814ec9b0 RDI: 0000000000000001
> RBP: 00005644814ec9b0 R08: 0000000000000000 R09: 00007fdc7d9b14e0
> R10: 00007fdc7d9b13e0 R11: 0000000000000246 R12: 000000000000000d
> R13: 00007fdc7d9fb780 R14: 000000000000000d R15: 00007fdc7d9f69e0
> </TASK>
> Modules linked in: uinput vfio_pci vfio_pci_core vfio_iommu_type1 vfio irqbypass ixgbevf snd_seq_dummy snd_hrtimer snd_seq snd_timer snd_seq_device snd soundcore overlay qrtr rfkill vfat fat xfs libcrc32c rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp irdma rapl intel_cstate ib_uverbs iTCO_wdt iTCO_vendor_support acpi_ipmi intel_uncore mei_me ipmi_si pcspkr i2c_i801 ib_core mei ipmi_devintf intel_pch_thermal ioatdma i2c_smbus ipmi_msghandler lpc_ich joydev acpi_power_meter acpi_pad ext4 mbcache jbd2 sd_mod t10_pi sg ast i2c_algo_bit drm_shmem_helper drm_kms_helper ice crct10dif_pclmul ixgbe crc32_pclmul drm crc32c_intel ahci i40e libahci ghash_clmulni_intel libata mdio dca gnss wmi fuse [last unloaded: iavf]
> CR2: 0000000000000010
>
> Arkadiusz Kubalewski (3):
> dpll: fix pin dump crash after module unbind
> dpll: fix pin dump crash for rebound module
> dpll: fix register pin with unregistered parent pin
>
> drivers/dpll/dpll_core.c | 8 ++------
> drivers/dpll/dpll_core.h | 4 ++--
> drivers/dpll/dpll_netlink.c | 37 ++++++++++++++++++++++---------------
> 3 files changed, 26 insertions(+), 23 deletions(-)
>
I still don't get how can we end up with unregistered pin. And shouldn't
drivers do unregister of dpll/pin during release procedure? I thought it
was kind of agreement we reached while developing the subsystem.
^ permalink raw reply
* Re: [PATCH net v4 0/3] Fix large frames in the Gemini ethernet driver
From: Vladimir Oltean @ 2023-11-09 10:50 UTC (permalink / raw)
To: Linus Walleij
Cc: Hans Ulli Kroll, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Michał Mirosław, Andrew Lunn,
linux-arm-kernel, netdev, linux-kernel
In-Reply-To: <20231109-gemini-largeframe-fix-v4-0-6e611528db08@linaro.org>
On Thu, Nov 09, 2023 at 10:03:11AM +0100, Linus Walleij wrote:
> This is the result of a bug hunt for a problem with the
> RTL8366RB DSA switch leading me wrong all over the place.
>
> I am indebted to Vladimir Oltean who as usual pointed
> out where the real problem was, many thanks!
>
> Tryig to actually use big ("jumbo") frames on this
> hardware uncovered the real bugs. Then I tested it on
> the DSA switch and it indeed fixes the issue.
>
> To make sure it also works fine with big frames on
> non-DSA devices I also copied a large video file over
> scp to a device with maximum frame size, the data
> was transported in large TCP packets ending up in
> 0x7ff sized frames using software checksumming at
> ~2.0 MB/s.
>
> If I set down the MTU to the standard 1500 bytes so
> that hardware checksumming is used, the scp transfer
> of the same file was slightly lower, ~1.8-1.9 MB/s.
>
> Despite this not being the best test it shows that
> we can now stress the hardware with large frames
> and that software checksum works fine.
>
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
> ---
Thanks for being persistent with this! I hope we didn't miss today's
"net" pull request :)
^ permalink raw reply
* Re: [PATCH 15/22] arch: vdso: consolidate gettime prototypes
From: Christophe Leroy @ 2023-11-09 10:49 UTC (permalink / raw)
To: Michael Ellerman, Arnd Bergmann, Arnd Bergmann, Andrew Morton,
linux-kernel@vger.kernel.org, Masahiro Yamada,
linux-kbuild@vger.kernel.org
Cc: Matt Turner, Vineet Gupta, Russell King, Catalin Marinas,
Will Deacon, Steven Rostedt, Masami Hiramatsu, Mark Rutland,
guoren, Peter Zijlstra, Ard Biesheuvel, Huacai Chen, Greg Ungerer,
Michal Simek, Thomas Bogendoerfer, Dinh Nguyen, Nicholas Piggin,
Geoff Levand, Palmer Dabbelt, Heiko Carstens,
John Paul Adrian Glaubitz, David S . Miller, Andy Lutomirski,
Thomas Gleixner, Ingo Molnar, x86@kernel.org, Helge Deller,
Sudip Mukherjee, Greg Kroah-Hartman, Timur Tabi, Kent Overstreet,
David Woodhouse, Naveen N. Rao, Anil S Keshavamurthy, Kees Cook,
Vincenzo Frascino, Juri Lelli, Vincent Guittot, Nathan Chancellor,
Nick Desaulniers, Nicolas Schier, Alexander Viro,
Uwe Kleine-König, linux-alpha@vger.kernel.org,
linux-snps-arc@lists.infradead.org,
linux-arm-kernel@lists.infradead.org,
linux-trace-kernel@vger.kernel.org, linux-csky@vger.kernel.org,
loongarch@lists.linux.dev, linux-m68k@lists.linux-m68k.org,
linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org,
linux-sh@vger.kernel.org, sparclinux@vger.kernel.org, Netdev,
linux-parisc@vger.kernel.org, linux-usb@vger.kernel.org,
linux-fbdev@vger.kernel.org, dri-devel@lists.freedesktop.org,
linux-bcachefs@vger.kernel.org, linux-mtd@lists.infradead.org
In-Reply-To: <87o7g3qlf5.fsf@mail.lhotse>
Le 09/11/2023 à 11:18, Michael Ellerman a écrit :
> "Arnd Bergmann" <arnd@arndb.de> writes:
>> On Wed, Nov 8, 2023, at 19:31, Christophe Leroy wrote:
>>> Le 08/11/2023 à 13:58, Arnd Bergmann a écrit :
>>
>>> powerpc has functions doing more or less the same, they are called
>>> __c_kernel_clock_gettime() and alike with their prototypes siting in
>>> arch/powerpc/include/asm/vdso/gettimeofday.h
>>>
>>> Should those prototypes be moved to include/vdso/gettime.h too and
>>> eventually renamed, or are they considered too powerpc specific ?
>>
>> I don't actually know, my initial interpretation was that
>> these function names are part of the user ABI for the vdso,
>> but I never looked closely enough at how vdso works to
>> be sure what the actual ABI is.
>
> AFAIK the ABI is just the symbols we export, as defined in the linker
> script:
>
> /*
> * This controls what symbols we export from the DSO.
> */
> VERSION
> {
> VDSO_VERSION_STRING {
> global:
> __kernel_get_syscall_map;
> __kernel_gettimeofday;
> __kernel_clock_gettime;
> __kernel_clock_getres;
> __kernel_get_tbfreq;
> __kernel_sync_dicache;
> __kernel_sigtramp_rt64;
> __kernel_getcpu;
> __kernel_time;
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/kernel/vdso/vdso64.lds.S?h=v6.6&#n117
>
>> If __c_kernel_clock_gettime() etc are not part of the user-facing
>> ABI, I think renaming them for consistency with the other
>> architectures would be best.
>
> The __c symbols are not part of the ABI, so we could rename them.
>
> At the moment though they don't have the same prototype as the generic
> versions, because we find the VDSO data in asm and pass it to the C
> functions, eg:
>
> int __c_kernel_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz,
> const struct vdso_data *vd);
>
> I think we can rework that though, by implementing
> __arch_get_vdso_data() and getting the vdso_data in C. Then we'd be able
> to share the prototypes.
I think it would not a been good idea, it would be less performant, for
explanation see commit
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e876f0b69dc993e86ca7795e63e98385aa9a7ef3
Christophe
^ permalink raw reply
* Re: [PATCH net v4 3/3] net: ethernet: cortina: Fix MTU max setting
From: Vladimir Oltean @ 2023-11-09 10:47 UTC (permalink / raw)
To: Linus Walleij
Cc: Hans Ulli Kroll, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Michał Mirosław, Andrew Lunn,
linux-arm-kernel, netdev, linux-kernel
In-Reply-To: <20231109-gemini-largeframe-fix-v4-3-6e611528db08@linaro.org>
On Thu, Nov 09, 2023 at 10:03:14AM +0100, Linus Walleij wrote:
> The RX max frame size is over 10000 for the Gemini ethernet,
> but the TX max frame size is actually just 2047 (0x7ff after
> checking the datasheet). Reflect this in what we offer to Linux,
> cap the MTU at the TX max frame minus ethernet headers.
>
> We delete the code disabling the hardware checksum for large
> MTUs as netdev->mtu can no longer be larger than
> netdev->max_mtu meaning the if()-clause in gmac_fix_features()
> is never true.
>
> Fixes: 4d5ae32f5e1e ("net: ethernet: Add a driver for Gemini gigabit ethernet")
> Reviewed-by: Andrew Lunn <andrew@lunn.ch>
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
> ---
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
^ permalink raw reply
* Re: [PATCH net v4 2/3] net: ethernet: cortina: Handle large frames
From: Vladimir Oltean @ 2023-11-09 10:46 UTC (permalink / raw)
To: Linus Walleij
Cc: Hans Ulli Kroll, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Michał Mirosław, Andrew Lunn,
linux-arm-kernel, netdev, linux-kernel
In-Reply-To: <20231109-gemini-largeframe-fix-v4-2-6e611528db08@linaro.org>
On Thu, Nov 09, 2023 at 10:03:13AM +0100, Linus Walleij wrote:
> The Gemini ethernet controller provides hardware checksumming
> for frames up to 1514 bytes including ethernet headers but not
> FCS.
>
> If we start sending bigger frames (after first bumping up the MTU
> on both interfaces sending and receiving the frames), truncated
> packets start to appear on the target such as in this tcpdump
> resulting from ping -s 1474:
>
> 23:34:17.241983 14:d6:4d:a8:3c:4f (oui Unknown) > bc:ae:c5:6b:a8:3d (oui Unknown),
> ethertype IPv4 (0x0800), length 1514: truncated-ip - 2 bytes missing!
> (tos 0x0, ttl 64, id 32653, offset 0, flags [DF], proto ICMP (1), length 1502)
> OpenWrt.lan > Fecusia: ICMP echo request, id 1672, seq 50, length 1482
>
> If we bypass the hardware checksumming and provide a software
> fallback, everything starts working fine up to the max TX MTU
> of 2047 bytes, for example ping -s2000 192.168.1.2:
>
> 00:44:29.587598 bc:ae:c5:6b:a8:3d (oui Unknown) > 14:d6:4d:a8:3c:4f (oui Unknown),
> ethertype IPv4 (0x0800), length 2042:
> (tos 0x0, ttl 64, id 51828, offset 0, flags [none], proto ICMP (1), length 2028)
> Fecusia > OpenWrt.lan: ICMP echo reply, id 1683, seq 4, length 2008
>
> The bit enabling to bypass hardware checksum (or any of the
> "TSS" bits) are undocumented in the hardware reference manual.
> The entire hardware checksum unit appears undocumented. The
> conclusion that we need to use the "bypass" bit was found by
> trial-and-error.
>
> Since no hardware checksum will happen, we slot in a software
> checksum fallback.
>
> Check for the condition where we need to compute checksum on the
> skb with either hardware or software using == CHECKSUM_PARTIAL instead
> of != CHECKSUM_NONE which is an incomplete check according to
> <linux/skbuff.h>.
>
> On the D-Link DIR-685 router this fixes a bug on the conduit
> interface to the RTL8366RB DSA switch: as the switch needs to add
> space for its tag it increases the MTU on the conduit interface
> to 1504 and that means that when the router sends packages
> of 1500 bytes these get an extra 4 bytes of DSA tag and the
> transfer fails because of the erroneous hardware checksumming,
> affecting such basic functionality as the LuCI web interface.
>
> Fixes: 4d5ae32f5e1e ("net: ethernet: Add a driver for Gemini gigabit ethernet")
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
> ---
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox