Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH rdma-next 0/3] RDMA RX RoCE Steering Support
From: Doug Ledford @ 2019-08-22 15:29 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, RDMA mailing list, Mark Bloch, Mark Zhang,
	Saeed Mahameed, linux-netdev
In-Reply-To: <20190821140204.GG4459@mtr-leonro.mtl.com>

[-- Attachment #1: Type: text/plain, Size: 1197 bytes --]

On Wed, 2019-08-21 at 17:02 +0300, Leon Romanovsky wrote:
> On Tue, Aug 20, 2019 at 01:54:59PM -0400, Doug Ledford wrote:
> > On Mon, 2019-08-19 at 14:36 +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@mellanox.com>
> > > 
> > > Hi,
> > > 
> > > This series from Mark extends mlx5 with RDMA_RX RoCE flow steering
> > > support
> > > for DEVX and QP objects.
> > > 
> > > Thanks
> > > 
> > > Mark Zhang (3):
> > >   net/mlx5: Add per-namespace flow table default miss action
> > > support
> > >   net/mlx5: Create bypass and loopback flow steering namespaces
> > > for
> > > RDMA
> > >     RX
> > >   RDMA/mlx5: RDMA_RX flow type support for user applications
> > 
> > I have no objection to this series.
> 
> Thanks, first two patches were applied to mlx5-next
> 
> e6806e9a63a7 net/mlx5: Create bypass and loopback flow steering
> namespaces for RDMA RX
> f66ad830b114 net/mlx5: Add per-namespace flow table default miss
> action support

mlx5-next merged into for-next, final patch applied, thanks.

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Aw: [PATCH net-next v2 0/3] net: dsa: mt7530: Convert to PHYLINK and add support for port 5
From: Frank Wunderlich @ 2019-08-22 15:44 UTC (permalink / raw)
  To: "René van Dorst"
  Cc: Sean Wang, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	David S . Miller, Matthias Brugger, netdev, linux-mips,
	"René van Dorst", linux-mediatek, John Crispin,
	linux-arm-kernel
In-Reply-To: <20190821144547.15113-1-opensource@vdorst.com>

Hi,

tested on BPI-R2 (mt7623) with 2 Problems (already reported to Rene, just to inform everyone)...maybe anybody has an idea

- linux-next (i know it's not part of the series, but a pitfall on testing other devices) seems to break power-regulator somewhere here:

priv->core_pwr = devm_regulator_get(&mdiodev->dev, "core"); returns 517

#define EPROBE_DEFER517/* Driver requests probe retry */

https://elixir.bootlin.com/linux/latest/source/drivers/regulator/core.c#L1726

without linux-next switch came up including dsa-ports

- RX-traffic (run iperf3 -c x.x.x.x -R) is only 780 Mbits/sec (TX=940 Mbits/sec), same measure with 5.3-rc4 gives 940 MBit/s with same devices,
maybe caused by changes for mt76x8?

regards Frank

^ permalink raw reply

* Re: [PATCH rdma-next 0/3] RDMA RX RoCE Steering Support
From: Leon Romanovsky @ 2019-08-22 15:45 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Jason Gunthorpe, RDMA mailing list, Mark Bloch, Mark Zhang,
	Saeed Mahameed, linux-netdev
In-Reply-To: <c7caa8eece02f3d15a0928663e9f64f99572f3ab.camel@redhat.com>

On Thu, Aug 22, 2019 at 11:29:02AM -0400, Doug Ledford wrote:
> On Wed, 2019-08-21 at 17:02 +0300, Leon Romanovsky wrote:
> > On Tue, Aug 20, 2019 at 01:54:59PM -0400, Doug Ledford wrote:
> > > On Mon, 2019-08-19 at 14:36 +0300, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro@mellanox.com>
> > > >
> > > > Hi,
> > > >
> > > > This series from Mark extends mlx5 with RDMA_RX RoCE flow steering
> > > > support
> > > > for DEVX and QP objects.
> > > >
> > > > Thanks
> > > >
> > > > Mark Zhang (3):
> > > >   net/mlx5: Add per-namespace flow table default miss action
> > > > support
> > > >   net/mlx5: Create bypass and loopback flow steering namespaces
> > > > for
> > > > RDMA
> > > >     RX
> > > >   RDMA/mlx5: RDMA_RX flow type support for user applications
> > >
> > > I have no objection to this series.
> >
> > Thanks, first two patches were applied to mlx5-next
> >
> > e6806e9a63a7 net/mlx5: Create bypass and loopback flow steering
> > namespaces for RDMA RX
> > f66ad830b114 net/mlx5: Add per-namespace flow table default miss
> > action support
>
> mlx5-next merged into for-next, final patch applied, thanks.

Thanks
>
> --
> Doug Ledford <dledford@redhat.com>
>     GPG KeyID: B826A3330E572FDD
>     Fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD



^ permalink raw reply

* Re: [PATCH v12 3/5] dt-bindings: can: tcan4x5x: Add DT bindings for TCAN4x5X driver
From: Marc Kleine-Budde @ 2019-08-22 15:46 UTC (permalink / raw)
  To: Dan Murphy, wg, davem; +Cc: linux-can, netdev, linux-kernel
In-Reply-To: <ff9e007b-6e39-3d64-b62b-93c281d69113@ti.com>


[-- Attachment #1.1: Type: text/plain, Size: 1334 bytes --]

On 8/22/19 4:20 PM, Dan Murphy wrote:
>>> +tcan4x5x: tcan4x5x@0 {
>>> +		compatible = "ti,tcan4x5x";
>>> +		reg = <0>;
>>> +		#address-cells = <1>;
>>> +		#size-cells = <1>;
>>> +		spi-max-frequency = <10000000>;
>>> +		bosch,mram-cfg = <0x0 0 0 32 0 0 1 1>;
>>> +		data-ready-gpios = <&gpio1 14 GPIO_ACTIVE_LOW>;
>> Can you convert this into a proper interrupt property? E.g.:
> 
> OK.  Do you want v13 or do you want patches on top for net-next?

Please use net-next/master as the base.

>>>                  interrupt-parent = <&gpio4>;
>>>                  interrupts = <13 0x2>;
>> See:
>> https://elixir.bootlin.com/linux/latest/source/Documentation/devicetree/bindings/net/can/microchip,mcp251x.txt#L21
>> https://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next.git/tree/drivers/net/can/spi/mcp251x.c?h=mcp251x#n945

I've removed the branch, as the patches are already upstream, have a
look here:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/tree/drivers/net/can/spi/mcp251x.c#n921

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH net-next] net/mlx5e: Use PTR_ERR_OR_ZERO in mlx5e_tc_add_nic_flow()
From: Leon Romanovsky @ 2019-08-22 15:50 UTC (permalink / raw)
  To: YueHaibing
  Cc: Saeed Mahameed, davem, netdev, linux-rdma, kernel-janitors,
	linux-kernel
In-Reply-To: <20190822065219.73945-1-yuehaibing@huawei.com>

On Thu, Aug 22, 2019 at 06:52:19AM +0000, YueHaibing wrote:
> Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR
>
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
>

Thanks,
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>

^ permalink raw reply

* Re: [PATCH v12 3/5] dt-bindings: can: tcan4x5x: Add DT bindings for TCAN4x5X driver
From: Dan Murphy @ 2019-08-22 15:58 UTC (permalink / raw)
  To: Marc Kleine-Budde, wg, davem; +Cc: linux-can, netdev, linux-kernel
In-Reply-To: <6c2bf55f-e360-c51a-e7bb-effc86aa5b6c@pengutronix.de>

Marc

On 8/22/19 10:46 AM, Marc Kleine-Budde wrote:
> On 8/22/19 4:20 PM, Dan Murphy wrote:
>>>> +tcan4x5x: tcan4x5x@0 {
>>>> +		compatible = "ti,tcan4x5x";
>>>> +		reg = <0>;
>>>> +		#address-cells = <1>;
>>>> +		#size-cells = <1>;
>>>> +		spi-max-frequency = <10000000>;
>>>> +		bosch,mram-cfg = <0x0 0 0 32 0 0 1 1>;
>>>> +		data-ready-gpios = <&gpio1 14 GPIO_ACTIVE_LOW>;
>>> Can you convert this into a proper interrupt property? E.g.:
>> OK.  Do you want v13 or do you want patches on top for net-next?
> Please use net-next/master as the base.


Thanks for the reply.  I see that that there are patches on top of the 
driver so I will send patches on top of that.

Dan

<snip>

^ permalink raw reply

* Re: [PATCH spi for-5.4 0/5] Deterministic SPI latency with NXP DSPI driver
From: Richard Cochran @ 2019-08-22 16:05 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Mark Brown, Hubert Feurstein, Miroslav Lichvar, Andrew Lunn,
	Florian Fainelli, linux-spi, netdev
In-Reply-To: <CA+h21hpJm-3svfV93pYYrpoiV12jDjuROHCgvCjPivAjXTB_VA@mail.gmail.com>

On Thu, Aug 22, 2019 at 05:58:49PM +0300, Vladimir Oltean wrote:
> I don't think I understand the problem here.

On the contrary, I do.

> You'd have something like this:
> 
> Master (DSA master port)         Slave (switch CPU port)
> 
>     |                            |         Tstamps known
>     |                            |         to slave
>     |       Local_sync_req       |
>  t1 |------\                     |         t1
>     |       \-----\              |
>     |              \-----\       |
>     |                     \----->| t2      t1, t2
>     |                            |
>     |     Local_sync_resp /------| t3      t1, t2, t3
>     |              /-----/       |
>     |       /-----/              |
>  t4 |<-----/                     |         t1, t2, t3, t4
>     |                            |
>     |                            |
>     v           time             v

And who generates Local_sync_resp?

Also, what sort of frame is it?  PTP has no Sync request or response.

> But you don't mean a TX timestamp at the egress of swp4 here, do you?

Yes, I do.
 
> Why would that matter?

Because in order to synchronize to an external GM, you need to measure
two things:

1. the (unchanging) delay from MAC to MAC
2. the (per-packet) switch residence time

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH net] ixgbe: fix double clean of tx descriptors with xdp
From: William Tu @ 2019-08-22 16:07 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Alexander Duyck, Björn Töpel, Netdev, LKML, bpf,
	David S. Miller, Magnus Karlsson, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Jeff Kirsher,
	intel-wired-lan, Eelco Chaudron
In-Reply-To: <cbf7c51b-9ce7-6ef6-32c4-981258d4af4c@samsung.com>

On Thu, Aug 22, 2019 at 1:17 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>
> On 22.08.2019 0:38, William Tu wrote:
> > On Wed, Aug 21, 2019 at 9:57 AM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> >>
> >> On Wed, Aug 21, 2019 at 9:22 AM Ilya Maximets <i.maximets@samsung.com> wrote:
> >>>
> >>> On 21.08.2019 4:17, Alexander Duyck wrote:
> >>>> On Tue, Aug 20, 2019 at 8:58 AM Ilya Maximets <i.maximets@samsung.com> wrote:
> >>>>>
> >>>>> On 20.08.2019 18:35, Alexander Duyck wrote:
> >>>>>> On Tue, Aug 20, 2019 at 8:18 AM Ilya Maximets <i.maximets@samsung.com> wrote:
> >>>>>>>
> >>>>>>> Tx code doesn't clear the descriptor status after cleaning.
> >>>>>>> So, if the budget is larger than number of used elems in a ring, some
> >>>>>>> descriptors will be accounted twice and xsk_umem_complete_tx will move
> >>>>>>> prod_tail far beyond the prod_head breaking the comletion queue ring.
> >>>>>>>
> >>>>>>> Fix that by limiting the number of descriptors to clean by the number
> >>>>>>> of used descriptors in the tx ring.
> >>>>>>>
> >>>>>>> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
> >>>>>>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> >>>>>>
> >>>>>> I'm not sure this is the best way to go. My preference would be to
> >>>>>> have something in the ring that would prevent us from racing which I
> >>>>>> don't think this really addresses. I am pretty sure this code is safe
> >>>>>> on x86 but I would be worried about weak ordered systems such as
> >>>>>> PowerPC.
> >>>>>>
> >>>>>> It might make sense to look at adding the eop_desc logic like we have
> >>>>>> in the regular path with a proper barrier before we write it and after
> >>>>>> we read it. So for example we could hold of on writing the bytecount
> >>>>>> value until the end of an iteration and call smp_wmb before we write
> >>>>>> it. Then on the cleanup we could read it and if it is non-zero we take
> >>>>>> an smp_rmb before proceeding further to process the Tx descriptor and
> >>>>>> clearing the value. Otherwise this code is going to just keep popping
> >>>>>> up with issues.
> >>>>>
> >>>>> But, unlike regular case, xdp zero-copy xmit and clean for particular
> >>>>> tx ring always happens in the same NAPI context and even on the same
> >>>>> CPU core.
> >>>>>
> >>>>> I saw the 'eop_desc' manipulations in regular case and yes, we could
> >>>>> use 'next_to_watch' field just as a flag of descriptor existence,
> >>>>> but it seems unnecessarily complicated. Am I missing something?
> >>>>>
> >>>>
> >>>> So is it always in the same NAPI context?. I forgot, I was thinking
> >>>> that somehow the socket could possibly make use of XDP for transmit.
> >>>
> >>> AF_XDP socket only triggers tx interrupt on ndo_xsk_async_xmit() which
> >>> is used in zero-copy mode. Real xmit happens inside
> >>> ixgbe_poll()
> >>>  -> ixgbe_clean_xdp_tx_irq()
> >>>     -> ixgbe_xmit_zc()
> >>>
> >>> This should be not possible to bound another XDP socket to the same netdev
> >>> queue.
> >>>
> >>> It also possible to xmit frames in xdp_ring while performing XDP_TX/REDIRECT
> >>> actions. REDIRECT could happen from different netdev with different NAPI
> >>> context, but this operation is bound to specific CPU core and each core has
> >>> its own xdp_ring.
> >>>
> >>> However, I'm not an expert here.
> >>> Björn, maybe you could comment on this?
> >>>
> >>>>
> >>>> As far as the logic to use I would be good with just using a value you
> >>>> are already setting such as the bytecount value. All that would need
> >>>> to happen is to guarantee that the value is cleared in the Tx path. So
> >>>> if you clear the bytecount in ixgbe_clean_xdp_tx_irq you could
> >>>> theoretically just use that as well to flag that a descriptor has been
> >>>> populated and is ready to be cleaned. Assuming the logic about this
> >>>> all being in the same NAPI context anyway you wouldn't need to mess
> >>>> with the barrier stuff I mentioned before.
> >>>
> >>> Checking the number of used descs, i.e. next_to_use - next_to_clean,
> >>> makes iteration in this function logically equal to the iteration inside
> >>> 'ixgbe_xsk_clean_tx_ring()'. Do you think we need to change the later
> >>> function too to follow same 'bytecount' approach? I don't like having
> >>> two different ways to determine number of used descriptors in the same file.
> >>>
> >>> Best regards, Ilya Maximets.
> >>
> >> As far as ixgbe_clean_xdp_tx_irq() vs ixgbe_xsk_clean_tx_ring(), I
> >> would say that if you got rid of budget and framed things more like
> >> how ixgbe_xsk_clean_tx_ring was framed with the ntc != ntu being
> >> obvious I would prefer to see us go that route.
> >>
> >> Really there is no need for budget in ixgbe_clean_xdp_tx_irq() if you
> >> are going to be working with a static ntu value since you will only
> >> ever process one iteration through the ring anyway. It might make more
> >> sense if you just went through and got rid of budget and i, and
> >> instead used ntc and ntu like what was done in
> >> ixgbe_xsk_clean_tx_ring().
> >>
> >> Thanks.
> >>
> >> - Alex
> >
> > Not familiar with the driver details.
> > I tested this patch and the issue mentioned in OVS mailing list.
> > https://www.mail-archive.com/ovs-dev@openvswitch.org/msg35362.html
> > and indeed the problem goes away.
>
> Good. Thanks for testing!
>
> > But I saw a huge performance drop,
> > my AF_XDP tx performance drops from >9Mpps to <5Mpps.
>
> I didn't expect so big performance difference with this change.
> What is your test scenario?

I was using OVS with dual port NIC, setting one OpenFlow rule
in_port=eth2 actions=output:eth3
and eth2 for rx and measure eth3 tx
'sar -n DEV 1'  shows pretty huge drop on eth3 tx.

> Is it possible that you're accounting same
> packet several times due to broken completion queue?

That's possible.
Let me double check on your v2 patch.

@Eelco: do you also see some performance difference?

Regards,
William

>
> Looking at samples/bpf/xdpsock_user.c:complete_tx_only(), it accounts
> sent packets (tx_npkts) by accumulating results of xsk_ring_cons__peek()
> for completion queue, so it's not a trusted source of pps information.
>
> Best regards, Ilya Maximets.
>
> >
> > Tested using kernel 5.3.0-rc3+
> > 03:00.0 Ethernet controller: Intel Corporation Ethernet Controller
> > 10-Gigabit X540-AT2 (rev 01)
> > Subsystem: Intel Corporation Ethernet 10G 2P X540-t Adapter
> > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > Stepping- SERR- FastB2B- DisINTx+
> >
> > Regards,
> > William

^ permalink raw reply

* [PATCH] net/core/skmsg: Delete an unnecessary check before the function call “consume_skb”
From: Markus Elfring @ 2019-08-22 16:08 UTC (permalink / raw)
  To: netdev, bpf, Daniel Borkmann, David S. Miller, John Fastabend
  Cc: LKML, kernel-janitors

From: Markus Elfring <elfring@users.sourceforge.net>
Date: Thu, 22 Aug 2019 18:00:40 +0200

The consume_skb() function performs also input parameter validation.
Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
 net/core/skmsg.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 6832eeb4b785..cf390e0aa73d 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -190,8 +190,7 @@ static int __sk_msg_free(struct sock *sk, struct sk_msg *msg, u32 i,
 		sk_msg_check_to_free(msg, i, msg->sg.size);
 		sge = sk_msg_elem(msg, i);
 	}
-	if (msg->skb)
-		consume_skb(msg->skb);
+	consume_skb(msg->skb);
 	sk_msg_init(msg);
 	return freed;
 }
--
2.23.0

^ permalink raw reply related

* Re: [PATCH spi for-5.4 0/5] Deterministic SPI latency with NXP DSPI driver
From: Richard Cochran @ 2019-08-22 16:10 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Mark Brown, Hubert Feurstein, Miroslav Lichvar, Andrew Lunn,
	Florian Fainelli, linux-spi, netdev
In-Reply-To: <CA+h21hpJm-3svfV93pYYrpoiV12jDjuROHCgvCjPivAjXTB_VA@mail.gmail.com>

On Thu, Aug 22, 2019 at 05:58:49PM +0300, Vladimir Oltean wrote:
> If you mean the latter, then yes, having HWTSTAMP_FILTER_ALL in the
> rx_filter of the DSA master is a hard requirement.

And to clear, the marvell only recognizes PTP frames.  That means that
DSA frames cannot be time stamped.

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH spi for-5.4 0/5] Deterministic SPI latency with NXP DSPI driver
From: Vladimir Oltean @ 2019-08-22 16:13 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Mark Brown, Hubert Feurstein, Miroslav Lichvar, Andrew Lunn,
	Florian Fainelli, linux-spi, netdev
In-Reply-To: <20190822160521.GC4522@localhost>

On Thu, 22 Aug 2019 at 19:05, Richard Cochran <richardcochran@gmail.com> wrote:
>
> On Thu, Aug 22, 2019 at 05:58:49PM +0300, Vladimir Oltean wrote:
> > I don't think I understand the problem here.
>
> On the contrary, I do.
>

You do think that I understand the problem? But I don't!

> > You'd have something like this:
> >
> > Master (DSA master port)         Slave (switch CPU port)
> >
> >     |                            |         Tstamps known
> >     |                            |         to slave
> >     |       Local_sync_req       |
> >  t1 |------\                     |         t1
> >     |       \-----\              |
> >     |              \-----\       |
> >     |                     \----->| t2      t1, t2
> >     |                            |
> >     |     Local_sync_resp /------| t3      t1, t2, t3
> >     |              /-----/       |
> >     |       /-----/              |
> >  t4 |<-----/                     |         t1, t2, t3, t4
> >     |                            |
> >     |                            |
> >     v           time             v
>
> And who generates Local_sync_resp?
>

Local_sync_resp is the same as Local_sync_req except maybe with a
custom tag added by the switch. Irrelevant as long as the DSA master
can timestamp it.

> Also, what sort of frame is it?  PTP has no Sync request or response.
>

A frame that can be timestamped on RX and TX by the DSA switch and
master, that is not a PTP frame.

> > But you don't mean a TX timestamp at the egress of swp4 here, do you?
>
> Yes, I do.
>
> > Why would that matter?
>
> Because in order to synchronize to an external GM, you need to measure
> two things:
>
> 1. the (unchanging) delay from MAC to MAC
> 2. the (per-packet) switch residence time
>

But since when are we discussing the synchronization to an external
grandmaster? I don't see the connection.

> Thanks,
> Richard

Regards,
-Vladimir

^ permalink raw reply

* Re: [PATCH net v2] ixgbe: fix double clean of tx descriptors with xdp
From: William Tu @ 2019-08-22 16:23 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Linux Kernel Network Developers, LKML, bpf, David S. Miller,
	Björn Töpel, Magnus Karlsson, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Jeff Kirsher,
	intel-wired-lan, Eelco Chaudron, Alexander Duyck
In-Reply-To: <20190822123037.28068-1-i.maximets@samsung.com>

On Thu, Aug 22, 2019 at 5:30 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>
> Tx code doesn't clear the descriptors' status after cleaning.
> So, if the budget is larger than number of used elems in a ring, some
> descriptors will be accounted twice and xsk_umem_complete_tx will move
> prod_tail far beyond the prod_head breaking the comletion queue ring.

s/comletion/completion/

>
> Fix that by limiting the number of descriptors to clean by the number
> of used descriptors in the tx ring.
>
> 'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
> 'ixgbe_xsk_clean_tx_ring()' since we don't need most of the
> complications implemented in the regular 'ixgbe_clean_tx_irq()'
> and we're allowed to directly use 'next_to_clean' and 'next_to_use'
> indexes.
>
> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> ---

Tested-by: William Tu <u9012063@gmail.com>

Instead of measuring tx performance at the tx machine, I measured the TX
performance at the other side (the traffic generating machine).  This time it
is more consistent and showing not much difference with (5.9Mpps) and
without this patch (6.1Mpps).

>
> Version 2:
>   * 'ixgbe_clean_xdp_tx_irq()' refactored to look more like
>     'ixgbe_xsk_clean_tx_ring()'.
>
>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 34 ++++++++------------
>  1 file changed, 13 insertions(+), 21 deletions(-)
>

<snip>

^ permalink raw reply

* Aw: [PATCH net-next v2 0/3] net: dsa: mt7530: Convert to PHYLINK and add support for port 5
From: Frank Wunderlich @ 2019-08-22 16:26 UTC (permalink / raw)
  To: Frank Wunderlich
  Cc: "René van Dorst", Andrew Lunn, Florian Fainelli,
	netdev, Sean Wang, linux-mips, David S . Miller, linux-mediatek,
	John Crispin, Matthias Brugger, Vivien Didelot, linux-arm-kernel
In-Reply-To: <trinity-b1f48e51-af73-466d-9ecf-d560a7d7c1ee-1566488653737@3c-app-gmx-bap07>

tested now also on bpi-r64 (mt7622) v0.1 (rtl8367 switch), without linux-next to avoid power-regulator-problems like on bpi-r2

dmesg without warnings/errors caused by this patches
link came up as desired
iperf3 looks good: 943 Mbits/sec in both directions and no other issues

so it is currently only the rx-throughput-problem on mt7623/bpi-r2

regards Frank

^ permalink raw reply

* Re: [PATCH net] ixgbe: fix double clean of tx descriptors with xdp
From: Ilya Maximets @ 2019-08-22 16:30 UTC (permalink / raw)
  To: William Tu
  Cc: Alexander Duyck, Björn Töpel, Netdev, LKML, bpf,
	David S. Miller, Magnus Karlsson, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Jeff Kirsher,
	intel-wired-lan, Eelco Chaudron
In-Reply-To: <CALDO+SaRNMvmXrQqOtNiRsOkgfOQAW4EA2yVgmeoGQto2zvfMQ@mail.gmail.com>

On 22.08.2019 19:07, William Tu wrote:
> On Thu, Aug 22, 2019 at 1:17 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>>
>> On 22.08.2019 0:38, William Tu wrote:
>>> On Wed, Aug 21, 2019 at 9:57 AM Alexander Duyck
>>> <alexander.duyck@gmail.com> wrote:
>>>>
>>>> On Wed, Aug 21, 2019 at 9:22 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>>>>>
>>>>> On 21.08.2019 4:17, Alexander Duyck wrote:
>>>>>> On Tue, Aug 20, 2019 at 8:58 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>>>>>>>
>>>>>>> On 20.08.2019 18:35, Alexander Duyck wrote:
>>>>>>>> On Tue, Aug 20, 2019 at 8:18 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>>>>>>>>>
>>>>>>>>> Tx code doesn't clear the descriptor status after cleaning.
>>>>>>>>> So, if the budget is larger than number of used elems in a ring, some
>>>>>>>>> descriptors will be accounted twice and xsk_umem_complete_tx will move
>>>>>>>>> prod_tail far beyond the prod_head breaking the comletion queue ring.
>>>>>>>>>
>>>>>>>>> Fix that by limiting the number of descriptors to clean by the number
>>>>>>>>> of used descriptors in the tx ring.
>>>>>>>>>
>>>>>>>>> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
>>>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
>>>>>>>>
>>>>>>>> I'm not sure this is the best way to go. My preference would be to
>>>>>>>> have something in the ring that would prevent us from racing which I
>>>>>>>> don't think this really addresses. I am pretty sure this code is safe
>>>>>>>> on x86 but I would be worried about weak ordered systems such as
>>>>>>>> PowerPC.
>>>>>>>>
>>>>>>>> It might make sense to look at adding the eop_desc logic like we have
>>>>>>>> in the regular path with a proper barrier before we write it and after
>>>>>>>> we read it. So for example we could hold of on writing the bytecount
>>>>>>>> value until the end of an iteration and call smp_wmb before we write
>>>>>>>> it. Then on the cleanup we could read it and if it is non-zero we take
>>>>>>>> an smp_rmb before proceeding further to process the Tx descriptor and
>>>>>>>> clearing the value. Otherwise this code is going to just keep popping
>>>>>>>> up with issues.
>>>>>>>
>>>>>>> But, unlike regular case, xdp zero-copy xmit and clean for particular
>>>>>>> tx ring always happens in the same NAPI context and even on the same
>>>>>>> CPU core.
>>>>>>>
>>>>>>> I saw the 'eop_desc' manipulations in regular case and yes, we could
>>>>>>> use 'next_to_watch' field just as a flag of descriptor existence,
>>>>>>> but it seems unnecessarily complicated. Am I missing something?
>>>>>>>
>>>>>>
>>>>>> So is it always in the same NAPI context?. I forgot, I was thinking
>>>>>> that somehow the socket could possibly make use of XDP for transmit.
>>>>>
>>>>> AF_XDP socket only triggers tx interrupt on ndo_xsk_async_xmit() which
>>>>> is used in zero-copy mode. Real xmit happens inside
>>>>> ixgbe_poll()
>>>>>  -> ixgbe_clean_xdp_tx_irq()
>>>>>     -> ixgbe_xmit_zc()
>>>>>
>>>>> This should be not possible to bound another XDP socket to the same netdev
>>>>> queue.
>>>>>
>>>>> It also possible to xmit frames in xdp_ring while performing XDP_TX/REDIRECT
>>>>> actions. REDIRECT could happen from different netdev with different NAPI
>>>>> context, but this operation is bound to specific CPU core and each core has
>>>>> its own xdp_ring.
>>>>>
>>>>> However, I'm not an expert here.
>>>>> Björn, maybe you could comment on this?
>>>>>
>>>>>>
>>>>>> As far as the logic to use I would be good with just using a value you
>>>>>> are already setting such as the bytecount value. All that would need
>>>>>> to happen is to guarantee that the value is cleared in the Tx path. So
>>>>>> if you clear the bytecount in ixgbe_clean_xdp_tx_irq you could
>>>>>> theoretically just use that as well to flag that a descriptor has been
>>>>>> populated and is ready to be cleaned. Assuming the logic about this
>>>>>> all being in the same NAPI context anyway you wouldn't need to mess
>>>>>> with the barrier stuff I mentioned before.
>>>>>
>>>>> Checking the number of used descs, i.e. next_to_use - next_to_clean,
>>>>> makes iteration in this function logically equal to the iteration inside
>>>>> 'ixgbe_xsk_clean_tx_ring()'. Do you think we need to change the later
>>>>> function too to follow same 'bytecount' approach? I don't like having
>>>>> two different ways to determine number of used descriptors in the same file.
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>>
>>>> As far as ixgbe_clean_xdp_tx_irq() vs ixgbe_xsk_clean_tx_ring(), I
>>>> would say that if you got rid of budget and framed things more like
>>>> how ixgbe_xsk_clean_tx_ring was framed with the ntc != ntu being
>>>> obvious I would prefer to see us go that route.
>>>>
>>>> Really there is no need for budget in ixgbe_clean_xdp_tx_irq() if you
>>>> are going to be working with a static ntu value since you will only
>>>> ever process one iteration through the ring anyway. It might make more
>>>> sense if you just went through and got rid of budget and i, and
>>>> instead used ntc and ntu like what was done in
>>>> ixgbe_xsk_clean_tx_ring().
>>>>
>>>> Thanks.
>>>>
>>>> - Alex
>>>
>>> Not familiar with the driver details.
>>> I tested this patch and the issue mentioned in OVS mailing list.
>>> https://www.mail-archive.com/ovs-dev@openvswitch.org/msg35362.html
>>> and indeed the problem goes away.
>>
>> Good. Thanks for testing!
>>
>>> But I saw a huge performance drop,
>>> my AF_XDP tx performance drops from >9Mpps to <5Mpps.
>>
>> I didn't expect so big performance difference with this change.
>> What is your test scenario?
> 
> I was using OVS with dual port NIC, setting one OpenFlow rule
> in_port=eth2 actions=output:eth3
> and eth2 for rx and measure eth3 tx
> 'sar -n DEV 1'  shows pretty huge drop on eth3 tx.

'sar' just parses net procfs to obtain interface stats, but interface stats
are not correct due to this bug (same packets accounted several times).

> 
>> Is it possible that you're accounting same
>> packet several times due to broken completion queue?
> 
> That's possible.
> Let me double check on your v2 patch.
> 
> @Eelco: do you also see some performance difference?
> 
> Regards,
> William
> 
>>
>> Looking at samples/bpf/xdpsock_user.c:complete_tx_only(), it accounts
>> sent packets (tx_npkts) by accumulating results of xsk_ring_cons__peek()
>> for completion queue, so it's not a trusted source of pps information.
>>
>> Best regards, Ilya Maximets.
>>
>>>
>>> Tested using kernel 5.3.0-rc3+
>>> 03:00.0 Ethernet controller: Intel Corporation Ethernet Controller
>>> 10-Gigabit X540-AT2 (rev 01)
>>> Subsystem: Intel Corporation Ethernet 10G 2P X540-t Adapter
>>> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
>>> Stepping- SERR- FastB2B- DisINTx+
>>>
>>> Regards,
>>> William
> 
> 

^ permalink raw reply

* Re: New skb extension for use by LSMs (skb "security blob")?
From: Paul Moore @ 2019-08-22 16:32 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, linux-security-module, selinux
In-Reply-To: <20190822070358.GE20113@breakpoint.cc>

On Thu, Aug 22, 2019 at 3:03 AM Florian Westphal <fw@strlen.de> wrote:
> Paul Moore <paul@paul-moore.com> wrote:
> > Hello netdev,
> >
> > I was just made aware of the skb extension work, and it looks very
> > appealing from a LSM perspective.  As some of you probably remember,
> > we (the LSM folks) have wanted a proper security blob in the skb for
> > quite some time, but netdev has been resistant to this idea thus far.
>
> Is that "blob" in addition to skb->secmark, or a replacement?

That's a good question.  While I thought about that, I wasn't sure if
that was worth bringing up as previous attempts to trade the secmark
field for a void pointer met with failure.  Last time I played with it
I was able to take the additional 32-bits from holes in the skb, and
possibly even improve some of the cacheline groupings (but that is
always going to be a dependent on use case I think), but that wasn't
enough.

I think we could consider freeing up the secmark in the main skb, and
move it to a skb extension, but this would potentially increase the
chances that we would need to add an extension to a skb.  I don't have
any hard numbers, but based on discussions and questions I suspect
Secmark is more widely used than NetLabel and/or labeled IPsec;
although I'm confident it is still a minor percentage of the overall
Linux installed base.

For me the big question is what would it take for us to get a security
blob associated with the skb?  Would moving the secmark into the skb
extension be enough?  Something else?  Or is this simply never going
to happen?  I want to remain optimistic, but I've been trying for this
off-and-on for over a decade and keep running into a brick wall ;)

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* [PATCH] net/ipv4/tcp_bpf: Delete an unnecessary check before the function call “consume_skb”
From: Markus Elfring @ 2019-08-22 16:32 UTC (permalink / raw)
  To: netdev, bpf, Alexei Starovoitov, Alexey Kuznetsov,
	Daniel Borkmann, David S. Miller, Eric Dumazet, Hideaki Yoshifuji,
	John Fastabend, Martin KaFai Lau, Song Liu, Yonghong Song
  Cc: LKML, kernel-janitors

From: Markus Elfring <elfring@users.sourceforge.net>
Date: Thu, 22 Aug 2019 18:20:42 +0200

The consume_skb() function performs also input parameter validation.
Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
 net/ipv4/tcp_bpf.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 8a56e09cfb0e..4ae18bd431a0 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -103,8 +103,7 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
 		msg_rx->sg.start = i;
 		if (!sge->length && msg_rx->sg.start == msg_rx->sg.end) {
 			list_del(&msg_rx->list);
-			if (msg_rx->skb)
-				consume_skb(msg_rx->skb);
+			consume_skb(msg_rx->skb);
 			kfree(msg_rx);
 		}
 		msg_rx = list_first_entry_or_null(&psock->ingress_msg,
--
2.23.0


^ permalink raw reply related

* Re: [PATCH net v2] ixgbe: fix double clean of tx descriptors with xdp
From: Alexander Duyck @ 2019-08-22 16:38 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Netdev, LKML, bpf, David S. Miller, Björn Töpel,
	Magnus Karlsson, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Jeff Kirsher, intel-wired-lan, Eelco Chaudron,
	William Tu
In-Reply-To: <20190822123037.28068-1-i.maximets@samsung.com>

On Thu, Aug 22, 2019 at 5:30 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>
> Tx code doesn't clear the descriptors' status after cleaning.
> So, if the budget is larger than number of used elems in a ring, some
> descriptors will be accounted twice and xsk_umem_complete_tx will move
> prod_tail far beyond the prod_head breaking the comletion queue ring.
>
> Fix that by limiting the number of descriptors to clean by the number
> of used descriptors in the tx ring.
>
> 'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
> 'ixgbe_xsk_clean_tx_ring()' since we don't need most of the
> complications implemented in the regular 'ixgbe_clean_tx_irq()'
> and we're allowed to directly use 'next_to_clean' and 'next_to_use'
> indexes.
>
> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> ---
>
> Version 2:
>   * 'ixgbe_clean_xdp_tx_irq()' refactored to look more like
>     'ixgbe_xsk_clean_tx_ring()'.
>
>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 34 ++++++++------------
>  1 file changed, 13 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
> index 6b609553329f..d1297660e14a 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
> @@ -633,22 +633,23 @@ static void ixgbe_clean_xdp_tx_buffer(struct ixgbe_ring *tx_ring,
>  bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
>                             struct ixgbe_ring *tx_ring, int napi_budget)
>  {
> +       u16 ntc = tx_ring->next_to_clean, ntu = tx_ring->next_to_use;
>         unsigned int total_packets = 0, total_bytes = 0;
> -       u32 i = tx_ring->next_to_clean, xsk_frames = 0;
>         unsigned int budget = q_vector->tx.work_limit;
>         struct xdp_umem *umem = tx_ring->xsk_umem;
> -       union ixgbe_adv_tx_desc *tx_desc;
> -       struct ixgbe_tx_buffer *tx_bi;
> +       u32 xsk_frames = 0;
>         bool xmit_done;
>
> -       tx_bi = &tx_ring->tx_buffer_info[i];
> -       tx_desc = IXGBE_TX_DESC(tx_ring, i);
> -       i -= tx_ring->count;
> +       while (likely(ntc != ntu && budget)) {

I would say you can get rid of budget entirely. It was only really
needed for the regular Tx case where you can have multiple CPUs
feeding a single Tx queue and causing a stall. Since we have a 1:1
mapping we should never have more than the Rx budget worth of packets
to really process. In addition we can only make one pass through the
ring since the ntu value is not updated while running the loop.

> +               union ixgbe_adv_tx_desc *tx_desc;
> +               struct ixgbe_tx_buffer *tx_bi;
> +
> +               tx_desc = IXGBE_TX_DESC(tx_ring, ntc);
>
> -       do {
>                 if (!(tx_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
>                         break;
>
> +               tx_bi = &tx_ring->tx_buffer_info[ntc];

Please don't move this logic into the loop. We were intentionally
processing this outside of the loop once and then just doing the
increments because it is faster that way. It takes several operations
to compute tx_bi based on ntc, whereas just incrementing is a single
operation.

>                 total_bytes += tx_bi->bytecount;
>                 total_packets += tx_bi->gso_segs;
>
> @@ -659,24 +660,15 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
>
>                 tx_bi->xdpf = NULL;
>
> -               tx_bi++;
> -               tx_desc++;
> -               i++;
> -               if (unlikely(!i)) {
> -                       i -= tx_ring->count;

So these two lines can probably just be replaced by:
if (unlikely(ntc == tx_ring->count)) {
        ntc = 0;

> -                       tx_bi = tx_ring->tx_buffer_info;
> -                       tx_desc = IXGBE_TX_DESC(tx_ring, 0);
> -               }
> -
> -               /* issue prefetch for next Tx descriptor */
> -               prefetch(tx_desc);

Did you just drop the prefetch? You are changing way too much with
this patch. All you should need to do is replace i with ntc, replace
the "do {" with "while (ntc != ntu) {", and remove the while at the
end.

> +               ntc++;
> +               if (unlikely(ntc == tx_ring->count))
> +                       ntc = 0;
>
>                 /* update budget accounting */
>                 budget--;
> -       } while (likely(budget));

As I stated earlier, budget can be removed entirely.

> +       }
>
> -       i += tx_ring->count;
> -       tx_ring->next_to_clean = i;
> +       tx_ring->next_to_clean = ntc;
>
>         u64_stats_update_begin(&tx_ring->syncp);
>         tx_ring->stats.bytes += total_bytes;
> --
> 2.17.1
>

^ permalink raw reply

* Re: [PATCH net v2] ixgbe: fix double clean of tx descriptors with xdp
From: Ilya Maximets @ 2019-08-22 16:58 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, LKML, bpf, David S. Miller, Björn Töpel,
	Magnus Karlsson, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Jeff Kirsher, intel-wired-lan, Eelco Chaudron,
	William Tu
In-Reply-To: <CAKgT0Uf26P53EA4m503aehq3tWCX9b3C+17TW2Ursbue9Kp=_w@mail.gmail.com>

On 22.08.2019 19:38, Alexander Duyck wrote:
> On Thu, Aug 22, 2019 at 5:30 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>>
>> Tx code doesn't clear the descriptors' status after cleaning.
>> So, if the budget is larger than number of used elems in a ring, some
>> descriptors will be accounted twice and xsk_umem_complete_tx will move
>> prod_tail far beyond the prod_head breaking the comletion queue ring.
>>
>> Fix that by limiting the number of descriptors to clean by the number
>> of used descriptors in the tx ring.
>>
>> 'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
>> 'ixgbe_xsk_clean_tx_ring()' since we don't need most of the
>> complications implemented in the regular 'ixgbe_clean_tx_irq()'
>> and we're allowed to directly use 'next_to_clean' and 'next_to_use'
>> indexes.
>>
>> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
>> ---
>>
>> Version 2:
>>   * 'ixgbe_clean_xdp_tx_irq()' refactored to look more like
>>     'ixgbe_xsk_clean_tx_ring()'.
>>
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 34 ++++++++------------
>>  1 file changed, 13 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
>> index 6b609553329f..d1297660e14a 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
>> @@ -633,22 +633,23 @@ static void ixgbe_clean_xdp_tx_buffer(struct ixgbe_ring *tx_ring,
>>  bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
>>                             struct ixgbe_ring *tx_ring, int napi_budget)
>>  {
>> +       u16 ntc = tx_ring->next_to_clean, ntu = tx_ring->next_to_use;
>>         unsigned int total_packets = 0, total_bytes = 0;
>> -       u32 i = tx_ring->next_to_clean, xsk_frames = 0;
>>         unsigned int budget = q_vector->tx.work_limit;
>>         struct xdp_umem *umem = tx_ring->xsk_umem;
>> -       union ixgbe_adv_tx_desc *tx_desc;
>> -       struct ixgbe_tx_buffer *tx_bi;
>> +       u32 xsk_frames = 0;
>>         bool xmit_done;
>>
>> -       tx_bi = &tx_ring->tx_buffer_info[i];
>> -       tx_desc = IXGBE_TX_DESC(tx_ring, i);
>> -       i -= tx_ring->count;
>> +       while (likely(ntc != ntu && budget)) {
> 
> I would say you can get rid of budget entirely. It was only really
> needed for the regular Tx case where you can have multiple CPUs
> feeding a single Tx queue and causing a stall. Since we have a 1:1
> mapping we should never have more than the Rx budget worth of packets
> to really process. In addition we can only make one pass through the
> ring since the ntu value is not updated while running the loop.

OK. Will remove.

> 
>> +               union ixgbe_adv_tx_desc *tx_desc;
>> +               struct ixgbe_tx_buffer *tx_bi;
>> +
>> +               tx_desc = IXGBE_TX_DESC(tx_ring, ntc);
>>
>> -       do {
>>                 if (!(tx_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
>>                         break;
>>
>> +               tx_bi = &tx_ring->tx_buffer_info[ntc];
> 
> Please don't move this logic into the loop. We were intentionally
> processing this outside of the loop once and then just doing the
> increments because it is faster that way. It takes several operations
> to compute tx_bi based on ntc, whereas just incrementing is a single
> operation.

OK.

> 
>>                 total_bytes += tx_bi->bytecount;
>>                 total_packets += tx_bi->gso_segs;
>>
>> @@ -659,24 +660,15 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
>>
>>                 tx_bi->xdpf = NULL;
>>
>> -               tx_bi++;
>> -               tx_desc++;
>> -               i++;
>> -               if (unlikely(!i)) {
>> -                       i -= tx_ring->count;
> 
> So these two lines can probably just be replaced by:
> if (unlikely(ntc == tx_ring->count)) {
>         ntc = 0;

Sure.

> 
>> -                       tx_bi = tx_ring->tx_buffer_info;
>> -                       tx_desc = IXGBE_TX_DESC(tx_ring, 0);
>> -               }
>> -
>> -               /* issue prefetch for next Tx descriptor */
>> -               prefetch(tx_desc);
> 
> Did you just drop the prefetch?

I'll keep the prefetch in v3 because, as you fairly mentioned, it's not
related to this patch. However, I'm not sure if this prefetch makes any
sense here, because there is only one comparison operation between the
prefetch and the data usage:

 while (ntc != ntu) {
     if (!(tx_desc->wb.status ...
     <...>
     prefetch(tx_desc);
 }


> You are changing way too much with
> this patch. All you should need to do is replace i with ntc, replace
> the "do {" with "while (ntc != ntu) {", and remove the while at the
> end.
> 
>> +               ntc++;
>> +               if (unlikely(ntc == tx_ring->count))
>> +                       ntc = 0;
>>
>>                 /* update budget accounting */
>>                 budget--;
>> -       } while (likely(budget));
> 
> As I stated earlier, budget can be removed entirely.

Sure.

> 
>> +       }
>>
>> -       i += tx_ring->count;
>> -       tx_ring->next_to_clean = i;
>> +       tx_ring->next_to_clean = ntc;
>>
>>         u64_stats_update_begin(&tx_ring->syncp);
>>         tx_ring->stats.bytes += total_bytes;
>> --
>> 2.17.1

^ permalink raw reply

* Re: [PATCH net v2] ixgbe: fix double clean of tx descriptors with xdp
From: Alexander Duyck @ 2019-08-22 17:10 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Netdev, LKML, bpf, David S. Miller, Björn Töpel,
	Magnus Karlsson, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Jeff Kirsher, intel-wired-lan, Eelco Chaudron,
	William Tu
In-Reply-To: <7e9e426c-92eb-ebf8-2447-6c804a0c7135@samsung.com>

On Thu, Aug 22, 2019 at 9:58 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>
> On 22.08.2019 19:38, Alexander Duyck wrote:
> > On Thu, Aug 22, 2019 at 5:30 AM Ilya Maximets <i.maximets@samsung.com> wrote:
> >>
> >> Tx code doesn't clear the descriptors' status after cleaning.
> >> So, if the budget is larger than number of used elems in a ring, some
> >> descriptors will be accounted twice and xsk_umem_complete_tx will move
> >> prod_tail far beyond the prod_head breaking the comletion queue ring.
> >>
> >> Fix that by limiting the number of descriptors to clean by the number
> >> of used descriptors in the tx ring.
> >>
> >> 'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
> >> 'ixgbe_xsk_clean_tx_ring()' since we don't need most of the
> >> complications implemented in the regular 'ixgbe_clean_tx_irq()'
> >> and we're allowed to directly use 'next_to_clean' and 'next_to_use'
> >> indexes.
> >>
> >> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
> >> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> >> ---
> >>
> >> Version 2:
> >>   * 'ixgbe_clean_xdp_tx_irq()' refactored to look more like
> >>     'ixgbe_xsk_clean_tx_ring()'.
> >>
> >>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 34 ++++++++------------
> >>  1 file changed, 13 insertions(+), 21 deletions(-)
> >>
> >> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
> >> index 6b609553329f..d1297660e14a 100644
> >> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
> >> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
> >> @@ -633,22 +633,23 @@ static void ixgbe_clean_xdp_tx_buffer(struct ixgbe_ring *tx_ring,
> >>  bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
> >>                             struct ixgbe_ring *tx_ring, int napi_budget)
> >>  {
> >> +       u16 ntc = tx_ring->next_to_clean, ntu = tx_ring->next_to_use;
> >>         unsigned int total_packets = 0, total_bytes = 0;
> >> -       u32 i = tx_ring->next_to_clean, xsk_frames = 0;
> >>         unsigned int budget = q_vector->tx.work_limit;
> >>         struct xdp_umem *umem = tx_ring->xsk_umem;
> >> -       union ixgbe_adv_tx_desc *tx_desc;
> >> -       struct ixgbe_tx_buffer *tx_bi;
> >> +       u32 xsk_frames = 0;
> >>         bool xmit_done;
> >>
> >> -       tx_bi = &tx_ring->tx_buffer_info[i];
> >> -       tx_desc = IXGBE_TX_DESC(tx_ring, i);
> >> -       i -= tx_ring->count;
> >> +       while (likely(ntc != ntu && budget)) {
> >
> > I would say you can get rid of budget entirely. It was only really
> > needed for the regular Tx case where you can have multiple CPUs
> > feeding a single Tx queue and causing a stall. Since we have a 1:1
> > mapping we should never have more than the Rx budget worth of packets
> > to really process. In addition we can only make one pass through the
> > ring since the ntu value is not updated while running the loop.
>
> OK. Will remove.
>
> >
> >> +               union ixgbe_adv_tx_desc *tx_desc;
> >> +               struct ixgbe_tx_buffer *tx_bi;
> >> +
> >> +               tx_desc = IXGBE_TX_DESC(tx_ring, ntc);
> >>
> >> -       do {
> >>                 if (!(tx_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
> >>                         break;
> >>
> >> +               tx_bi = &tx_ring->tx_buffer_info[ntc];
> >
> > Please don't move this logic into the loop. We were intentionally
> > processing this outside of the loop once and then just doing the
> > increments because it is faster that way. It takes several operations
> > to compute tx_bi based on ntc, whereas just incrementing is a single
> > operation.
>
> OK.
>
> >
> >>                 total_bytes += tx_bi->bytecount;
> >>                 total_packets += tx_bi->gso_segs;
> >>
> >> @@ -659,24 +660,15 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
> >>
> >>                 tx_bi->xdpf = NULL;
> >>
> >> -               tx_bi++;
> >> -               tx_desc++;
> >> -               i++;
> >> -               if (unlikely(!i)) {
> >> -                       i -= tx_ring->count;
> >
> > So these two lines can probably just be replaced by:
> > if (unlikely(ntc == tx_ring->count)) {
> >         ntc = 0;
>
> Sure.
>
> >
> >> -                       tx_bi = tx_ring->tx_buffer_info;
> >> -                       tx_desc = IXGBE_TX_DESC(tx_ring, 0);
> >> -               }
> >> -
> >> -               /* issue prefetch for next Tx descriptor */
> >> -               prefetch(tx_desc);
> >
> > Did you just drop the prefetch?
>
> I'll keep the prefetch in v3 because, as you fairly mentioned, it's not
> related to this patch. However, I'm not sure if this prefetch makes any
> sense here, because there is only one comparison operation between the
> prefetch and the data usage:
>
>  while (ntc != ntu) {
>      if (!(tx_desc->wb.status ...
>      <...>
>      prefetch(tx_desc);
>  }

I'm not opposed to dropping the prefetch, but if you are going to do
it you should do it in a separate patch.

^ permalink raw reply

* [PATCH net v3] ixgbe: fix double clean of tx descriptors with xdp
From: Ilya Maximets @ 2019-08-22 17:12 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, bpf, David S. Miller, Björn Töpel,
	Magnus Karlsson, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Jeff Kirsher, intel-wired-lan, Eelco Chaudron,
	William Tu, Alexander Duyck, Ilya Maximets
In-Reply-To: <CGME20190822171243eucas1p12213f2239d6c36be515dade41ed7470b@eucas1p1.samsung.com>

Tx code doesn't clear the descriptors' status after cleaning.
So, if the budget is larger than number of used elems in a ring, some
descriptors will be accounted twice and xsk_umem_complete_tx will move
prod_tail far beyond the prod_head breaking the completion queue ring.

Fix that by limiting the number of descriptors to clean by the number
of used descriptors in the tx ring.

'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
'ixgbe_xsk_clean_tx_ring()' since we're allowed to directly use
'next_to_clean' and 'next_to_use' indexes.

Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
---

Version 3:
  * Reverted some refactoring made for v2.
  * Eliminated 'budget' for tx clean.
  * prefetch returned.

Version 2:
  * 'ixgbe_clean_xdp_tx_irq()' refactored to look more like
    'ixgbe_xsk_clean_tx_ring()'.

 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 29 ++++++++------------
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index 6b609553329f..a3b6d8c89127 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -633,19 +633,17 @@ static void ixgbe_clean_xdp_tx_buffer(struct ixgbe_ring *tx_ring,
 bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
 			    struct ixgbe_ring *tx_ring, int napi_budget)
 {
+	u16 ntc = tx_ring->next_to_clean, ntu = tx_ring->next_to_use;
 	unsigned int total_packets = 0, total_bytes = 0;
-	u32 i = tx_ring->next_to_clean, xsk_frames = 0;
-	unsigned int budget = q_vector->tx.work_limit;
 	struct xdp_umem *umem = tx_ring->xsk_umem;
 	union ixgbe_adv_tx_desc *tx_desc;
 	struct ixgbe_tx_buffer *tx_bi;
-	bool xmit_done;
+	u32 xsk_frames = 0;
 
-	tx_bi = &tx_ring->tx_buffer_info[i];
-	tx_desc = IXGBE_TX_DESC(tx_ring, i);
-	i -= tx_ring->count;
+	tx_bi = &tx_ring->tx_buffer_info[ntc];
+	tx_desc = IXGBE_TX_DESC(tx_ring, ntc);
 
-	do {
+	while (ntc != ntu) {
 		if (!(tx_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
 			break;
 
@@ -661,22 +659,18 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
 
 		tx_bi++;
 		tx_desc++;
-		i++;
-		if (unlikely(!i)) {
-			i -= tx_ring->count;
+		ntc++;
+		if (unlikely(ntc == tx_ring->count)) {
+			ntc = 0;
 			tx_bi = tx_ring->tx_buffer_info;
 			tx_desc = IXGBE_TX_DESC(tx_ring, 0);
 		}
 
 		/* issue prefetch for next Tx descriptor */
 		prefetch(tx_desc);
+	}
 
-		/* update budget accounting */
-		budget--;
-	} while (likely(budget));
-
-	i += tx_ring->count;
-	tx_ring->next_to_clean = i;
+	tx_ring->next_to_clean = ntc;
 
 	u64_stats_update_begin(&tx_ring->syncp);
 	tx_ring->stats.bytes += total_bytes;
@@ -688,8 +682,7 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
 	if (xsk_frames)
 		xsk_umem_complete_tx(umem, xsk_frames);
 
-	xmit_done = ixgbe_xmit_zc(tx_ring, q_vector->tx.work_limit);
-	return budget > 0 && xmit_done;
+	return ixgbe_xmit_zc(tx_ring, q_vector->tx.work_limit);
 }
 
 int ixgbe_xsk_async_xmit(struct net_device *dev, u32 qid)
-- 
2.17.1


^ permalink raw reply related

* RE: [RFC 1/4] Add usb_get_address and usb_set_address support
From: Charles.Hyde @ 2019-08-22 17:14 UTC (permalink / raw)
  To: oneukum, gregkh
  Cc: Mario.Limonciello, nic_swsd, linux-acpi, linux-usb, netdev
In-Reply-To: <1566461295.8347.19.camel@suse.com>

> > <snipped>
> > >
> > > This is a VERY cdc-net-specific function.  It is not a "generic" USB
> > > function at all.  Why does it belong in the USB core?  Shouldn't it
> > > live in the code that handles the other cdc-net-specific logic?
> > >
> > > thanks,
> > >
> > > greg k-h
> >
> >
> > Thank you for this feedback, Greg.  I was not sure about adding this to
> message.c, because of the USB_CDC_GET_NET_ADDRESS.  I had found
> references to SET_ADDRESS in the USB protocol at
> https://wiki.osdev.org/Universal_Serial_Bus#USB_Protocol.  If one wanted a
> generic USB function for SET_ADDRESS, to be used for both sending a MAC
> address and receiving one, how would you suggest this be implemented?  This
> is a legit question because I am curious.
> 
> Your implementation was, except for missing error handling, usable.
> The problem is where you put it. CDC messages exist only for CDC devices. Now
> it is true that there is no generic CDC driver.
> Creating a module just for that would cost more memory than it saves in most
> cases.
> But MACs are confined to network devices. Hence the functionality can be put
> into usbnet. It should not be put into any individual driver, so that every
> network driver can use it without duplication.
> 
> > Your feedback led to moving the functionality into cdc_ncm.c for today's
> testing, and removing all changes from messages.c, usb.h, usbnet.c, and
> usbnet.h.  This may be where I end up long term, but I would like to learn if
> there is a possible solution that could live in message.c and be callable from
> other USB-to-Ethernet aware drivers.
> 
> All those drivers use usbnet. Hence there it should be.
> 
> 	Regards
> 		Oliver


Some of the drivers in drivers/net/usb/ do call functions in drivers/net/usb/usbnet, but not all.  As Greg pointed out, the USB change I developed is cdc specific, so putting it into usbnet would raise the same concerns Greg mentioned.  Leaving my newest implementation in cdc_ncm.c will be most appropriate, as it also fits with what other drivers in this folder have done.  My original code was rather short sighted, at best.

Charles

^ permalink raw reply

* Re: [PATCH net-next 03/10] net: dsa: mv88e6xxx: move hidden registers operations in own file
From: Vivien Didelot @ 2019-08-22 17:21 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Marek Behún, netdev, Florian Fainelli, Vladimir Oltean
In-Reply-To: <20190822131047.GE13020@lunn.ch>

Hi Marek, Andrew,

On Thu, 22 Aug 2019 15:10:47 +0200, Andrew Lunn <andrew@lunn.ch> wrote:
> On Thu, Aug 22, 2019 at 01:27:17AM +0200, Marek Behún wrote:
> > This patch moves the functions operating on the hidden debug registers
> > into it's own file, hidden.c.
> 
> Humm, actually...
> 
> These are in the port register space. Maybe it would be better to move
> them into port.c/port.h?
> 
> What you really need is that they have global scope within the driver
> so you can call them. So add the functions definitions to port.h.
> 
> Vivien, what do you think?

Andrew's correct. Code accessing internal registers in the mv88e6xxx driver
is split per internal SMI device. An internal SMI device is a "column" of 32
registers found in the datasheet, accessed via its dedicated internal SMI
device address, like ports, Global 1 (often 0x1b), Global 2 (often 0x1c),
and so on. Each internal SMI device has its unique header file, describing
all registers it contains. Then if the corresponding .c file has a portion
specific enough, it can be moved to its own .c file, like global1_atu.c,
global1_vtu.c, etc.

So keep these port registers definitions ordered in port.h with a naming as
closed to the documentation as possible, prefixing them with MV88E6XXX_PORT_
(or the model of reference if that is specific to a few models only), and
please describe the bits with an ordered 0x1234 format as well. The port.h
header is fortunately already a good example of how it should be done.

Then you can include it in a new port_hidden.c file, which implements
mv88e6xxx_port_hidden_* internal helpers (or mv88e6789_port_ if specific to a
model again) to access these hidden port registers, and use them as convenience
in chip.c:mv88e6xxx_errata_setup() or wherever a feature is implemented.

Thanks a lot,

	Vivien

^ permalink raw reply

* Re: [PATCH net v3] ixgbe: fix double clean of tx descriptors with xdp
From: Alexander Duyck @ 2019-08-22 17:21 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Netdev, LKML, bpf, David S. Miller, Björn Töpel,
	Magnus Karlsson, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Jeff Kirsher, intel-wired-lan, Eelco Chaudron,
	William Tu
In-Reply-To: <20190822171237.20798-1-i.maximets@samsung.com>

On Thu, Aug 22, 2019 at 10:12 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>
> Tx code doesn't clear the descriptors' status after cleaning.
> So, if the budget is larger than number of used elems in a ring, some
> descriptors will be accounted twice and xsk_umem_complete_tx will move
> prod_tail far beyond the prod_head breaking the completion queue ring.
>
> Fix that by limiting the number of descriptors to clean by the number
> of used descriptors in the tx ring.
>
> 'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
> 'ixgbe_xsk_clean_tx_ring()' since we're allowed to directly use
> 'next_to_clean' and 'next_to_use' indexes.
>
> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> ---
>
> Version 3:
>   * Reverted some refactoring made for v2.
>   * Eliminated 'budget' for tx clean.
>   * prefetch returned.
>
> Version 2:
>   * 'ixgbe_clean_xdp_tx_irq()' refactored to look more like
>     'ixgbe_xsk_clean_tx_ring()'.
>
>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 29 ++++++++------------
>  1 file changed, 11 insertions(+), 18 deletions(-)

Thanks for addressing my concerns.

Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

^ permalink raw reply

* Re: [PATCH net v3] ixgbe: fix double clean of tx descriptors with xdp
From: William Tu @ 2019-08-22 17:32 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Ilya Maximets, Netdev, LKML, bpf, David S. Miller,
	Björn Töpel, Magnus Karlsson, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Jeff Kirsher,
	intel-wired-lan, Eelco Chaudron
In-Reply-To: <CAKgT0UepBGqx=FiqrdC-r3kvkMxVAHonkfc6rDt_HVQuzahZPQ@mail.gmail.com>

On Thu, Aug 22, 2019 at 10:21 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Thu, Aug 22, 2019 at 10:12 AM Ilya Maximets <i.maximets@samsung.com> wrote:
> >
> > Tx code doesn't clear the descriptors' status after cleaning.
> > So, if the budget is larger than number of used elems in a ring, some
> > descriptors will be accounted twice and xsk_umem_complete_tx will move
> > prod_tail far beyond the prod_head breaking the completion queue ring.
> >
> > Fix that by limiting the number of descriptors to clean by the number
> > of used descriptors in the tx ring.
> >
> > 'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
> > 'ixgbe_xsk_clean_tx_ring()' since we're allowed to directly use
> > 'next_to_clean' and 'next_to_use' indexes.
> >
> > Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
> > Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> > ---
> >
> > Version 3:
> >   * Reverted some refactoring made for v2.
> >   * Eliminated 'budget' for tx clean.
> >   * prefetch returned.
> >
> > Version 2:
> >   * 'ixgbe_clean_xdp_tx_irq()' refactored to look more like
> >     'ixgbe_xsk_clean_tx_ring()'.
> >
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 29 ++++++++------------
> >  1 file changed, 11 insertions(+), 18 deletions(-)
>
> Thanks for addressing my concerns.
>
> Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Thanks.

Tested-by: William Tu <u9012063@gmail.com>

^ permalink raw reply

* Re: [PATCH net-next,v4, 3/6] net/mlx5: Add wrappers for HyperV PCIe operations
From: Leon Romanovsky @ 2019-08-22 17:38 UTC (permalink / raw)
  To: Haiyang Zhang
  Cc: sashal@kernel.org, davem@davemloft.net, saeedm@mellanox.com,
	eranbe@mellanox.com, lorenzo.pieralisi@arm.com,
	bhelgaas@google.com, linux-pci@vger.kernel.org,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
	KY Srinivasan, Stephen Hemminger, linux-kernel@vger.kernel.org
In-Reply-To: <1566450236-36757-4-git-send-email-haiyangz@microsoft.com>

On Thu, Aug 22, 2019 at 05:05:47AM +0000, Haiyang Zhang wrote:
> From: Eran Ben Elisha <eranbe@mellanox.com>
>
> Add wrapper functions for HyperV PCIe read / write /
> block_invalidate_register operations.  This will be used as an
> infrastructure in the downstream patch for software communication.
>
> This will be enabled by default if CONFIG_PCI_HYPERV_INTERFACE is set.
>
> Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/Makefile |  1 +
>  drivers/net/ethernet/mellanox/mlx5/core/lib/hv.c | 64 ++++++++++++++++++++++++
>  drivers/net/ethernet/mellanox/mlx5/core/lib/hv.h | 22 ++++++++
>  3 files changed, 87 insertions(+)
>  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/hv.c
>  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/hv.h
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> index bcf3655..fd32a5b 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> @@ -45,6 +45,7 @@ mlx5_core-$(CONFIG_MLX5_ESWITCH)   += eswitch.o eswitch_offloads.o eswitch_offlo
>  mlx5_core-$(CONFIG_MLX5_MPFS)      += lib/mpfs.o
>  mlx5_core-$(CONFIG_VXLAN)          += lib/vxlan.o
>  mlx5_core-$(CONFIG_PTP_1588_CLOCK) += lib/clock.o
> +mlx5_core-$(CONFIG_PCI_HYPERV_INTERFACE) += lib/hv.o
>
>  #
>  # Ipoib netdev
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv.c
> new file mode 100644
> index 0000000..cf08d02
> --- /dev/null
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv.c
> @@ -0,0 +1,64 @@
> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> +// Copyright (c) 2018 Mellanox Technologies
> +
> +#include <linux/hyperv.h>
> +#include "mlx5_core.h"
> +#include "lib/hv.h"
> +
> +static int mlx5_hv_config_common(struct mlx5_core_dev *dev, void *buf, int len,
> +				 int offset, bool read)
> +{
> +	int rc = -EOPNOTSUPP;
> +	int bytes_returned;
> +	int block_id;
> +
> +	if (offset % HV_CONFIG_BLOCK_SIZE_MAX || len % HV_CONFIG_BLOCK_SIZE_MAX)
> +		return -EINVAL;
> +
> +	block_id = offset / HV_CONFIG_BLOCK_SIZE_MAX;
> +
> +	rc = read ?
> +	     hyperv_read_cfg_blk(dev->pdev, buf,
> +				 HV_CONFIG_BLOCK_SIZE_MAX, block_id,
> +				 &bytes_returned) :
> +	     hyperv_write_cfg_blk(dev->pdev, buf,
> +				  HV_CONFIG_BLOCK_SIZE_MAX, block_id);
> +
> +	/* Make sure len bytes were read successfully  */
> +	if (read)
> +		rc |= !(len == bytes_returned);

Unclear what do you want to achieve here, for read == true, "rc" will
have result of hyperv_read_cfg_blk(), which can be an error too.
It means that your "rc |= .." will give interesting results.

> +
> +	if (rc) {
> +		mlx5_core_err(dev, "Failed to %s hv config, err = %d, len = %d, offset = %d\n",
> +			      read ? "read" : "write", rc, len,
> +			      offset);
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +
> +int mlx5_hv_read_config(struct mlx5_core_dev *dev, void *buf, int len,
> +			int offset)
> +{
> +	return mlx5_hv_config_common(dev, buf, len, offset, true);
> +}
> +
> +int mlx5_hv_write_config(struct mlx5_core_dev *dev, void *buf, int len,
> +			 int offset)
> +{
> +	return mlx5_hv_config_common(dev, buf, len, offset, false);
> +}
> +
> +int mlx5_hv_register_invalidate(struct mlx5_core_dev *dev, void *context,
> +				void (*block_invalidate)(void *context,
> +							 u64 block_mask))
> +{
> +	return hyperv_reg_block_invalidate(dev->pdev, context,
> +					   block_invalidate);
> +}
> +
> +void mlx5_hv_unregister_invalidate(struct mlx5_core_dev *dev)
> +{
> +	hyperv_reg_block_invalidate(dev->pdev, NULL, NULL);
> +}
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv.h
> new file mode 100644
> index 0000000..f9a4557
> --- /dev/null
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv.h
> @@ -0,0 +1,22 @@
> +/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
> +/* Copyright (c) 2019 Mellanox Technologies. */
> +
> +#ifndef __LIB_HV_H__
> +#define __LIB_HV_H__
> +
> +#if IS_ENABLED(CONFIG_PCI_HYPERV_INTERFACE)

The common way to write it if(IS_ENABLED(CONFIG_FOO)) inside the code
and #ifdef CONFIG_FOO for header files.

> +
> +#include <linux/hyperv.h>
> +#include <linux/mlx5/driver.h>
> +
> +int mlx5_hv_read_config(struct mlx5_core_dev *dev, void *buf, int len,
> +			int offset);
> +int mlx5_hv_write_config(struct mlx5_core_dev *dev, void *buf, int len,
> +			 int offset);
> +int mlx5_hv_register_invalidate(struct mlx5_core_dev *dev, void *context,
> +				void (*block_invalidate)(void *context,
> +							 u64 block_mask));
> +void mlx5_hv_unregister_invalidate(struct mlx5_core_dev *dev);
> +#endif
> +
> +#endif /* __LIB_HV_H__ */
> --
> 1.8.3.1
>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox