Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: XDP redirect measurements, gotchas and tracepoints
From: Duyck, Alexander H @ 2017-08-22 18:30 UTC (permalink / raw)
  To: john.fastabend@gmail.com, brouer@redhat.com,
	michael.chan@broadcom.com
  Cc: pstaszewski@itcare.pl, netdev@vger.kernel.org,
	xdp-newbies@vger.kernel.org, andy@greyhouse.net,
	borkmann@iogearbox.net
In-Reply-To: <599C7530.2010405@gmail.com>

On Tue, 2017-08-22 at 11:17 -0700, John Fastabend wrote:
> On 08/22/2017 11:02 AM, Michael Chan wrote:
> > On Mon, Aug 21, 2017 at 12:25 PM, Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:
> > > 
> > > I'be been playing with the latest XDP_REDIRECT feature, that was
> > > accepted in net-next (for ixgbe), see merge commit[1].
> > >  [1] https://git.kernel.org/davem/net-next/c/6093ec2dc31
> > > 
> > 
> > Just catching on XDP_REDIRECT and I have a very basic question.  The
> > ingress device passes the XDP buffer to the egress device for XDP
> > redirect transmission.  When the egress device has transmitted the
> > packet, is it supposed to just free the buffer?  Or is it supposed to
> > be recycled?
> > 
> > In XDP_TX, the buffer is recycled back to the rx ring.
> > 
> 
> With XDP_REDIRECT we must "just free the buffer" in ixgbe this means
> page_frag_free() on the data. There is no way to know where the xdp
> buffer came from it could be a different NIC for example.
> 
> However with how ixgbe is coded up recycling will work as long as
> the memory is free'd before the driver ring tries to use it again. In
> normal usage this should be the case. And if we are over-running a device
> it doesn't really hurt to slow down the sender a bit.
> 
> I think this is a pretty good model, we could probably provide a set
> of APIs for drivers to use so that we get some consistency across
> vendors here, ala Jesper's page pool ideas.
> 
> (+Alex, for ixgbe details)
> 
> Thanks,
> John

I think you pretty much covered the inner workings for the ixgbe bits.

The only piece I would add is that the recycling trick normally only
works if the same interface/driver is doing both the Tx and the Rx. The
redirect code cannot assume that is the case and that is the reason why
it must always be freeing the traffic on clean-up.

- Alex

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: Eric Dumazet @ 2017-08-22 18:28 UTC (permalink / raw)
  To: David Miller; +Cc: willemdebruijn.kernel, mst, netdev, den, virtualization
In-Reply-To: <20170822.110108.343109469263087166.davem@davemloft.net>

On Tue, 2017-08-22 at 11:01 -0700, David Miller wrote:
> From: "Michael S. Tsirkin" <mst@redhat.com>
> Date: Tue, 22 Aug 2017 20:55:56 +0300
> 
> > Which reminds me that skb_linearize in net core seems to be
> > fundamentally racy - I suspect that if skb is cloned, and someone is
> > trying to use the shared frags while another thread calls skb_linearize,
> > we get some use after free bugs which likely mostly go undetected
> > because the corrupted packets mostly go on wire and get dropped
> > by checksum code.
> 
> Indeed, it does assume that the skb from which the clone was made
> never has it's geometry changed.
> 
> I don't think even the TCP retransmit queue has this guarantee.

TCP retransmit makes sure to avoid that.

if (skb_unclone(skb, GFP_ATOMIC))
     return -ENOMEM;

( Before cloning again skb )

^ permalink raw reply

* Re: [PATCH net] net: dsa: skb_put_padto() already frees nskb
From: David Miller @ 2017-08-22 18:27 UTC (permalink / raw)
  To: f.fainelli; +Cc: netdev, andrew, vivien.didtelot, woojung.huh
In-Reply-To: <20170821194131.27839-1-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Mon, 21 Aug 2017 12:41:31 -0700

> skb_put_padto() already frees the passed sk_buff reference upon error,
> so calling kfree_skb() on it again is not necessary.
> 
> Detected by CoverityScan, CID#1416687 ("USE_AFTER_FREE")
> 
> Fixes: e71cb9e00922 ("net: dsa: ksz: fix skb freeing")
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

I know it seems like a lot of work, but with appropriate wrappers we
can control the freeing that skb_pad() does at the deepest part of
this call chain.

int __skb_pad(struct sk_buff *skb, int pad, bool free_skb_on_err);
static inline int skb_pad(struct sk_buff *skb, int pad)
{
	return __skb_pad(skb, pad, true);
}

static inline int __skb_put_padto(struct sk_buff *skb, unsigned int len,
				  bool free_skb_on_err)
{
	unsigned int size = skb->len;

	if (unlikely(size < len)) {
		len -= size;
		if (__skb_pad(skb, len, free_skb_on_err))
			return -ENOMEM;
		__skb_put(skb, len);
	}
	return 0;
}

static inline int skb_put_padto(struct sk_buff *skb, unsigned int len)
{
	return __skb_put_padto(skb, len, true);
}

And then here in the ksz_xmit() code, invoke __skb_put_padto() with the
boolean set appropriately.

^ permalink raw reply

* Re: [PATCH net-next 3/4] bpf/verifier: when pruning a branch, ignore its write marks
From: Alexei Starovoitov via iovisor-dev @ 2017-08-22 18:25 UTC (permalink / raw)
  To: Edward Cree, davem-fT/PcQaiUtIeIZ0/mPfg9Q, Alexei Starovoitov,
	Daniel Borkmann
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, iovisor-dev
In-Reply-To: <9a8076d1-a88b-b0e8-3c47-969a022feb60-s/n/eUQHGBpZroRs9YW3xA@public.gmane.org>

On 8/22/17 11:03 AM, Edward Cree wrote:
> On 22/08/17 16:50, Edward Cree wrote:
>> On 22/08/17 16:24, Alexei Starovoitov wrote:
>>> Do you have a test case for this by any chance?
>> I think something like
>>     if (cond)
>>         r0=0;
>>     if (cond)
>>         r0=0;
>>     return r0;
>> might tickle the bug, but I'm not sure.
> It turns out that (cond) has to be constructed not to alter our knowledge
>  of whatever register we're testing, but apart from that, this works.
>     {
>         "liveness pruning and write screening",
>         .insns = {
>             /* Get an unknown value */
>             BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1, 0),
>             /* branch conditions teach us nothing about R2 */
>             BPF_JMP_IMM(BPF_JGE, BPF_REG_2, 0, 1),
>             BPF_MOV64_IMM(BPF_REG_0, 0),
>             BPF_JMP_IMM(BPF_JGE, BPF_REG_2, 0, 1),
>             BPF_MOV64_IMM(BPF_REG_0, 0),
>             BPF_EXIT_INSN(),
>         },
>         .errstr = "R0 !read_ok",
>         .result = REJECT,
>         .prog_type = BPF_PROG_TYPE_LWT_IN,
>     },
> This test fails on net-next, but passes with my patch.
> I'll include it in the next spin of the series.

nice. thanks for the test!

^ permalink raw reply

* Re: [PATCH net-next v3 2/2] tcp: Remove the unused parameter for tcp_try_fastopen.
From: Eric Dumazet @ 2017-08-22 18:19 UTC (permalink / raw)
  To: Tonghao Zhang; +Cc: netdev
In-Reply-To: <1503383629-12392-3-git-send-email-xiangxia.m.yue@gmail.com>

On Mon, 2017-08-21 at 23:33 -0700, Tonghao Zhang wrote:
> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> ---
>  include/net/tcp.h       | 3 +--
>  net/ipv4/tcp_fastopen.c | 6 ++----
>  net/ipv4/tcp_input.c    | 2 +-
>  3 files changed, 4 insertions(+), 7 deletions(-)

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: XDP redirect measurements, gotchas and tracepoints
From: John Fastabend @ 2017-08-22 18:17 UTC (permalink / raw)
  To: Michael Chan, Jesper Dangaard Brouer
  Cc: xdp-newbies@vger.kernel.org, Daniel Borkmann, Andy Gospodarek,
	netdev@vger.kernel.org, Paweł Staszewski, Alexander H Duyck
In-Reply-To: <CACKFLimAF9OYwz8EqG=1-N40hWuA-8C8k8ut0oFKYdduvZOxHA@mail.gmail.com>

On 08/22/2017 11:02 AM, Michael Chan wrote:
> On Mon, Aug 21, 2017 at 12:25 PM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>>
>> I'be been playing with the latest XDP_REDIRECT feature, that was
>> accepted in net-next (for ixgbe), see merge commit[1].
>>  [1] https://git.kernel.org/davem/net-next/c/6093ec2dc31
>>
> 
> Just catching on XDP_REDIRECT and I have a very basic question.  The
> ingress device passes the XDP buffer to the egress device for XDP
> redirect transmission.  When the egress device has transmitted the
> packet, is it supposed to just free the buffer?  Or is it supposed to
> be recycled?
> 
> In XDP_TX, the buffer is recycled back to the rx ring.
> 

With XDP_REDIRECT we must "just free the buffer" in ixgbe this means
page_frag_free() on the data. There is no way to know where the xdp
buffer came from it could be a different NIC for example.

However with how ixgbe is coded up recycling will work as long as
the memory is free'd before the driver ring tries to use it again. In
normal usage this should be the case. And if we are over-running a device
it doesn't really hurt to slow down the sender a bit.

I think this is a pretty good model, we could probably provide a set
of APIs for drivers to use so that we get some consistency across
vendors here, ala Jesper's page pool ideas.

(+Alex, for ixgbe details)

Thanks,
John

^ permalink raw reply

* Re: [PATCH net-next v3 1/2] tcp: Get a proper dst before checking it.
From: Eric Dumazet @ 2017-08-22 18:14 UTC (permalink / raw)
  To: Tonghao Zhang; +Cc: netdev
In-Reply-To: <1503383629-12392-2-git-send-email-xiangxia.m.yue@gmail.com>

On Mon, 2017-08-21 at 23:33 -0700, Tonghao Zhang wrote:
> tcp_peer_is_proven needs a proper route to make the
> determination, but dst always is NULL. This bug may
> be there at the beginning of git tree. This does not
> look serious enough to deserve backports to stable
> versions.
> 
> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> ---

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [PATCH v3 3/4] net: stmmac: register parent MDIO node for sun8i-h3-emac
From: Corentin Labbe @ 2017-08-22 18:11 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Chen-Yu Tsai, Andrew Lunn, Maxime Ripard, Rob Herring,
	Mark Rutland, Russell King, Giuseppe Cavallaro, Alexandre Torgue,
	devicetree, linux-arm-kernel, linux-kernel, netdev
In-Reply-To: <bfe784b6-5137-8fcd-dbd0-ad96974f4060@gmail.com>

On Tue, Aug 22, 2017 at 09:40:24AM -0700, Florian Fainelli wrote:
> On 08/22/2017 08:39 AM, Chen-Yu Tsai wrote:
> > On Mon, Aug 21, 2017 at 10:23 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> >>> All muxes are mostly always represented the same way afaik, or do you
> >>> want to simply introduce a new compatible / property?
> >>
> >> +         mdio-mux {
> >> +               compatible = "allwinner,sun8i-h3-mdio-switch";
> >> +               mdio-parent-bus = <&mdio_parent>;
> >> +               #address-cells = <1>;
> >> +               #size-cells = <0>;
> >> +
> >> +               internal_mdio: mdio@1 {
> >>                         reg = <1>;
> >> -                       clocks = <&ccu CLK_BUS_EPHY>;
> >> -                       resets = <&ccu RST_BUS_EPHY>;
> >> +                       #address-cells = <1>;
> >> +                       #size-cells = <0>;
> >> +                       int_mii_phy: ethernet-phy@1 {
> >> +                               compatible = "ethernet-phy-ieee802.3-c22";
> >> +                               reg = <1>;
> >> +                               clocks = <&ccu CLK_BUS_EPHY>;
> >> +                               resets = <&ccu RST_BUS_EPHY>;
> >> +                               phy-is-integrated;
> >> +                       };
> >> +               };
> >> +               mdio: mdio@0 {
> >> +                       reg = <0>;
> >> +                       #address-cells = <1>;
> >> +                       #size-cells = <0>;
> >>                 };
> >>
> >> Hi Maxim
> >>
> >> Anybody who knows the MDIO-mux code/binding, knows that it is a run
> >> time mux. You swap the mux per MDIO transaction. You can access all
> >> the PHY and switches on the mux'ed MDIO bus.
> >>
> >> However here, it is effectively a boot-time MUX. You cannot change it
> >> on the fly. What happens when somebody has a phandle to a PHY on the
> >> internal and a phandle to a phy on the external? Does the driver at
> >> least return -EINVAL, or -EBUSY? Is there a representation which
> >> eliminates this possibility?
> > 
> > There is only one controller. Either you use the internal PHY, which
> > is then directly coupled (no magnetics needed) to the RJ45 port, or
> > you use an external PHY over MII/RMII/RGMII. You could supposedly
> > have both on a board, and let the user choose one. But why bother
> > with the extra complexity and cost? Either you use the internal PHY
> > at 100M, or an external RGMII PHY for gigabit speeds.
> 
> I agree, there is no point in over-engineering any of this. I don't
> think there is actually any MDIO mux per-se in that the MDIO clock and
> data lines are muxed, however there has to be some kind of built-in port
> multiplexer that lets you chose between connecting to the internal PHY
> and any external PHY/MAC, but that is not what a "mdio-mux" node represents.
> 
> > 
> > So I think what you are saying is either impossible or engineering-wise
> > a very stupid design, like using an external MAC with a discrete PHY
> > connected to the internal MAC's MDIO bus, while using the internal MAC
> > with the internal PHY.
> > 
> > Now can we please decide on something? We're a week and a half from
> > the 4.13 release. If mdio-mux is wrong, then we could have two mdio
> > nodes (internal-mdio & external-mdio).
> 
> I really don't see a need for a mdio-mux in the first place, just have
> one MDIO controller (current state) sub-node which describes the
> built-in STMMAC MDIO controller and declare the internal PHY as a child
> node (along with 'phy-is-integrated'). If a different configuration is
> used, then just put the external PHY as a child node there.
> 
> If fixed-link is required, the mdio node becomes unused anyway.
> 
> Works for everyone?

If we put an external PHY with reg=1 as a child of internal MDIO, il will be merged with internal PHY node and get phy-is-integrated.

Does two MDIO node "internal-mdio" and "mdio" works for you ?
(We keep "mdio" for external MDIO for reducing the number of patchs)

Thanks
Regards

diff --git a/arch/arm/boot/dts/sunxi-h3-h5.dtsi b/arch/arm/boot/dts/sunxi-h3-h5.dtsi
index 4b599b5d26f6..d5e7cf0d9454 100644
--- a/arch/arm/boot/dts/sunxi-h3-h5.dtsi
+++ b/arch/arm/boot/dts/sunxi-h3-h5.dtsi
@@ -417,7 +417,8 @@
 			#size-cells = <0>;
 			status = "disabled";
 
-			mdio: mdio {
+			/* Only one MDIO is usable at the time */
+			internal_mdio: mdio@1 {
 				#address-cells = <1>;
 				#size-cells = <0>;
 				int_mii_phy: ethernet-phy@1 {
@@ -425,8 +426,13 @@
 					reg = <1>;
 					clocks = <&ccu CLK_BUS_EPHY>;
 					resets = <&ccu RST_BUS_EPHY>;
+					phy-is-integrated;
 				};
 			};
+			mdio: mdio@0 {
+				#address-cells = <1>;
+				#size-cells = <0>;
+			};
 		};
 
 		spi0: spi@01c68000 {

^ permalink raw reply related

* [PATCH] net: amd: constify zorro_device_id
From: Arvind Yadav @ 2017-08-22 18:11 UTC (permalink / raw)
  To: davem, jarod; +Cc: linux-kernel, netdev

zorro_device_id are not supposed to change at runtime. All functions
working with zorro_device_id provided by <linux/zorro.h> work with
const zorro_device_id. So mark the non-const structs as const.

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
---
 drivers/net/ethernet/amd/a2065.c   | 2 +-
 drivers/net/ethernet/amd/ariadne.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/amd/a2065.c b/drivers/net/ethernet/amd/a2065.c
index ee4b94e..e22f976 100644
--- a/drivers/net/ethernet/amd/a2065.c
+++ b/drivers/net/ethernet/amd/a2065.c
@@ -643,7 +643,7 @@ static int a2065_init_one(struct zorro_dev *z,
 static void a2065_remove_one(struct zorro_dev *z);
 
 
-static struct zorro_device_id a2065_zorro_tbl[] = {
+static const struct zorro_device_id a2065_zorro_tbl[] = {
 	{ ZORRO_PROD_CBM_A2065_1 },
 	{ ZORRO_PROD_CBM_A2065_2 },
 	{ ZORRO_PROD_AMERISTAR_A2065 },
diff --git a/drivers/net/ethernet/amd/ariadne.c b/drivers/net/ethernet/amd/ariadne.c
index 5fd7b15..4b6a5cb 100644
--- a/drivers/net/ethernet/amd/ariadne.c
+++ b/drivers/net/ethernet/amd/ariadne.c
@@ -692,7 +692,7 @@ static void ariadne_remove_one(struct zorro_dev *z)
 	free_netdev(dev);
 }
 
-static struct zorro_device_id ariadne_zorro_tbl[] = {
+static const struct zorro_device_id ariadne_zorro_tbl[] = {
 	{ ZORRO_PROD_VILLAGE_TRONIC_ARIADNE },
 	{ 0 }
 };
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH V2 net-next 0/2] liquidio: VF driver will notify NIC firmware of MTU change
From: David Miller @ 2017-08-22 18:08 UTC (permalink / raw)
  To: felix.manlunas
  Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
	veerasenareddy.burru
In-Reply-To: <20170821193527.GA6102@felix-thinkpad.cavium.com>

From: Felix Manlunas <felix.manlunas@cavium.com>
Date: Mon, 21 Aug 2017 12:35:27 -0700

> From: Veerasenareddy Burru <veerasenareddy.burru@cavium.com>
> 
> Make VF driver notify NIC firmware of MTU change.  Firmware needs this
> information for MTU propagation and enforcement.
> 
> The first patch in this series moves a macro definition to a proper place
> to prevent a build error in the second patch which has the code that sends
> the notification.
> 
> Change Log:
>   V1 -> V2
>     * Add "From:" line to patch #1 and #2 to give credit to the author.
>     * In patch #2, order local variable declarations from longest to
>       shortest line.

Series applied.

^ permalink raw reply

* Re: [PATCH net-next v2 00/10] net: mvpp2: MAC/GoP configuration
From: Andrew Lunn @ 2017-08-22 18:07 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, jason, gregory.clement, sebastian.hesselbarth,
	thomas.petazzoni, nadavh, linux, mw, stefanc, netdev,
	linux-arm-kernel
In-Reply-To: <20170822170830.32413-1-antoine.tenart@free-electrons.com>

On Tue, Aug 22, 2017 at 07:08:20PM +0200, Antoine Tenart wrote:
> Hi all,
> 
> This is based on net-next (e2a7c34fb285).
> 
> I removed the GoP interrupt and PHY optional parts in this v2 to ease
> the review process as the MAC/GoP initialization seemed to start less
> discussions :)

Hi Antoine

By that, do you mean setting the SMI PHY address?

   Andrew

^ permalink raw reply

* Re: [PATCH] ptp: make ptp_clock_info const
From: David Miller @ 2017-08-22 18:05 UTC (permalink / raw)
  To: bhumirks; +Cc: julia.lawall, richardcochran, netdev, linux-kernel
In-Reply-To: <1503336672-22918-1-git-send-email-bhumirks@gmail.com>

From: Bhumika Goyal <bhumirks@gmail.com>
Date: Mon, 21 Aug 2017 23:01:12 +0530

> Make these const as they are only used in a copy operation.
> Done using Coccinelle.
 ...
> Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net: ethernet: make ptp_clock_info const
From: David Miller @ 2017-08-22 18:05 UTC (permalink / raw)
  To: bhumirks
  Cc: julia.lawall, nicolas.ferre, richardcochran, peppe.cavallaro,
	alexandre.torgue, cmetcalf, adi-buildroot-devel, netdev,
	linux-kernel
In-Reply-To: <1503335210-16179-1-git-send-email-bhumirks@gmail.com>

From: Bhumika Goyal <bhumirks@gmail.com>
Date: Mon, 21 Aug 2017 22:36:50 +0530

> Make these const as they are only used in a copy operation.
> Done using Coccinelle.
 ...
> Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net v3] ipv6: add rcu grace period before freeing fib6_node
From: David Miller @ 2017-08-22 18:03 UTC (permalink / raw)
  To: weiwan; +Cc: netdev, edumazet, kafai
In-Reply-To: <20170821164710.134367-1-tracywwnj@gmail.com>

From: Wei Wang <weiwan@google.com>
Date: Mon, 21 Aug 2017 09:47:10 -0700

> From: Wei Wang <weiwan@google.com>
> 
> We currently keep rt->rt6i_node pointing to the fib6_node for the route.
> And some functions make use of this pointer to dereference the fib6_node
> from rt structure, e.g. rt6_check(). However, as there is neither
> refcount nor rcu taken when dereferencing rt->rt6i_node, it could
> potentially cause crashes as rt->rt6i_node could be set to NULL by other
> CPUs when doing a route deletion.
> This patch introduces an rcu grace period before freeing fib6_node and
> makes sure the functions that dereference it takes rcu_read_lock().
> 
> Note: there is no "Fixes" tag because this bug was there in a very
> early stage.
> 
> Signed-off-by: Wei Wang <weiwan@google.com>
> Acked-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Martin KaFai Lau <kafai@fb.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net-next 3/4] bpf/verifier: when pruning a branch, ignore its write marks
From: Edward Cree @ 2017-08-22 18:03 UTC (permalink / raw)
  To: Alexei Starovoitov, davem, Alexei Starovoitov, Daniel Borkmann
  Cc: netdev, iovisor-dev
In-Reply-To: <97f3da2d-abb0-ba1f-07ca-ea426f30c7fb@solarflare.com>

On 22/08/17 16:50, Edward Cree wrote:
> On 22/08/17 16:24, Alexei Starovoitov wrote:
>> Do you have a test case for this by any chance?
> I think something like
>     if (cond)
>         r0=0;
>     if (cond)
>         r0=0;
>     return r0;
> might tickle the bug, but I'm not sure.
It turns out that (cond) has to be constructed not to alter our knowledge
 of whatever register we're testing, but apart from that, this works.
    {
        "liveness pruning and write screening",
        .insns = {
            /* Get an unknown value */
            BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1, 0),
            /* branch conditions teach us nothing about R2 */
            BPF_JMP_IMM(BPF_JGE, BPF_REG_2, 0, 1),
            BPF_MOV64_IMM(BPF_REG_0, 0),
            BPF_JMP_IMM(BPF_JGE, BPF_REG_2, 0, 1),
            BPF_MOV64_IMM(BPF_REG_0, 0),
            BPF_EXIT_INSN(),
        },
        .errstr = "R0 !read_ok",
        .result = REJECT,
        .prog_type = BPF_PROG_TYPE_LWT_IN,
    },
This test fails on net-next, but passes with my patch.
I'll include it in the next spin of the series.

^ permalink raw reply

* Re: [PATCH V4 net-next] net: hns3: Add support to change MTU in HNS3 hardware
From: David Miller @ 2017-08-22 18:03 UTC (permalink / raw)
  To: salil.mehta
  Cc: yisen.zhuang, lipeng321, mehta.salil.lnk, netdev, linux-kernel,
	linux-rdma, linuxarm
In-Reply-To: <20170821160524.32760-1-salil.mehta@huawei.com>

From: Salil Mehta <salil.mehta@huawei.com>
Date: Mon, 21 Aug 2017 17:05:24 +0100

> This patch adds the following support to the HNS3 driver:
> 1. Support to change the Maximum Transmission Unit of a
>    port in the HNS NIC hardware.
> 2. Initializes the supported MTU range for the netdevice.
> 
> Signed-off-by: lipeng <lipeng321@huawei.com>
> Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> Reviewed-by: Andrew Lunn <andrew@lunn.ch>
> Reviewed-by: Leon Romanovsky <leonro@mellanox.com>

Applied.

^ permalink raw reply

* Re: XDP redirect measurements, gotchas and tracepoints
From: Michael Chan @ 2017-08-22 18:02 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: xdp-newbies@vger.kernel.org, John Fastabend, Daniel Borkmann,
	Andy Gospodarek, netdev@vger.kernel.org, Paweł Staszewski
In-Reply-To: <20170821212506.1cb0d5d6@redhat.com>

On Mon, Aug 21, 2017 at 12:25 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> I'be been playing with the latest XDP_REDIRECT feature, that was
> accepted in net-next (for ixgbe), see merge commit[1].
>  [1] https://git.kernel.org/davem/net-next/c/6093ec2dc31
>

Just catching on XDP_REDIRECT and I have a very basic question.  The
ingress device passes the XDP buffer to the egress device for XDP
redirect transmission.  When the egress device has transmitted the
packet, is it supposed to just free the buffer?  Or is it supposed to
be recycled?

In XDP_TX, the buffer is recycled back to the rx ring.

^ permalink raw reply

* RE: [PATCH net] net: dsa: skb_put_padto() already frees nskb
From: Woojung.Huh @ 2017-08-22 18:01 UTC (permalink / raw)
  To: f.fainelli, netdev; +Cc: davem, andrew, vivien.didelot
In-Reply-To: <52a0e01e-8a55-2360-ca11-b65c613a812c@gmail.com>

> > Because skb_put_padto() frees skb when it fails,  below lines in
> e71cb9e00922
> > ("net: dsa: ksz: fix skb freeing") will be an issue to.
> >
> > 	if (skb_tailroom(skb) >= padlen + KSZ_INGRESS_TAG_LEN) {
> > +		if (skb_put_padto(skb, skb->len + padlen))
> > +			return NULL;
> > +
> >
> > When it fails skb will be freed twice in skb_put_padto() and
> > caller of dsa_slave_xmit().
> 
> You are right, I am not sure what is the best way to fix tag_ksz.c other
> than somehow open coding skb_put_padto() minus the freeing on error part?

Agree. May need to go back to 1st submission without skb_put_padto().
Will check more if any other ideas.

Thanks.
Woojung

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: David Miller @ 2017-08-22 18:01 UTC (permalink / raw)
  To: mst; +Cc: jasowang, willemdebruijn.kernel, den, virtualization, netdev
In-Reply-To: <20170822204015-mutt-send-email-mst@kernel.org>

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Tue, 22 Aug 2017 20:55:56 +0300

> Which reminds me that skb_linearize in net core seems to be
> fundamentally racy - I suspect that if skb is cloned, and someone is
> trying to use the shared frags while another thread calls skb_linearize,
> we get some use after free bugs which likely mostly go undetected
> because the corrupted packets mostly go on wire and get dropped
> by checksum code.

Indeed, it does assume that the skb from which the clone was made
never has it's geometry changed.

I don't think even the TCP retransmit queue has this guarantee.

^ permalink raw reply

* Re: [PATCH v12 0/8] Replace PCI pool by DMA pool API
From: Doug Ledford @ 2017-08-22 18:00 UTC (permalink / raw)
  To: Romain Perier, Dan Williams, Sean Hefty, Hal Rosenstock,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w, David S. Miller,
	stas.yakovlev-Re5JQEeQqe8AvxtiuMwx3w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman
In-Reply-To: <20170822114701.14907-1-romain.perier-ZGY8ohtN/8qB+jHODAdFcQ@public.gmane.org>

On Tue, 2017-08-22 at 13:46 +0200, Romain Perier wrote:
> The current PCI pool API are simple macro functions direct expanded
> to
> the appropriate dma pool functions. The prototypes are almost the
> same
> and semantically, they are very similar. I propose to use the DMA
> pool
> API directly and get rid of the old API.
> 
> This set of patches, replaces the old API by the dma pool API
> and remove the defines.
> 
> Changes in v12:
> - Rebased series onto next-20170822

Hi Romain,

I've applied the three that are in my purview to my for-next branch
(mthca, mlx4, and mlx5 patches).  Thanks.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH V8 net-next 00/22] Huawei HiNIC Ethernet Driver
From: David Miller @ 2017-08-22 17:58 UTC (permalink / raw)
  To: aviad.krawczyk
  Cc: linux-kernel, netdev, bc.y, victor.gissin, zhaochen6, tony.qu
In-Reply-To: <cover.1503330613.git.aviad.krawczyk@huawei.com>

From: Aviad Krawczyk <aviad.krawczyk@huawei.com>
Date: Mon, 21 Aug 2017 23:55:46 +0800

> The patch-set contains the support of the HiNIC Ethernet driver for
> hinic family of PCIE Network Interface Cards.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: Michael S. Tsirkin @ 2017-08-22 17:55 UTC (permalink / raw)
  To: Jason Wang
  Cc: Willem de Bruijn, Koichiro Den, virtualization,
	Network Development
In-Reply-To: <64d451ae-9944-e978-5a05-54bb1a62aaad@redhat.com>

On Tue, Aug 22, 2017 at 10:50:41AM +0800, Jason Wang wrote:
> > Perhaps the descriptor pool should also be
> > revised to allow out of order completions. Then there is no need to
> > copy zerocopy packets whenever they may experience delay.
> 
> Yes, but as replied in the referenced thread, windows driver may treat out
> of order completion as a bug.

That would be a windows driver bug then, but I don't think it makes this
assumption. What the referenced thread
(https://patchwork.kernel.org/patch/3787671/) is saying is that host
must use any buffers made available on a tx vq within a reasonable
timeframe otherwise windows guests panic.

Ideally we would detect that a packet is actually experiencing delay and
trigger the copy at that point e.g. by calling skb_linearize. But it
isn't easy to track these packets though and even harder to do a data
copy without races.

Which reminds me that skb_linearize in net core seems to be
fundamentally racy - I suspect that if skb is cloned, and someone is
trying to use the shared frags while another thread calls skb_linearize,
we get some use after free bugs which likely mostly go undetected
because the corrupted packets mostly go on wire and get dropped
by checksum code.

-- 
MST

^ permalink raw reply

* Re: Something hitting my total number of connections to the server
From: Eric Dumazet @ 2017-08-22 17:50 UTC (permalink / raw)
  To: David Ahern; +Cc: Akshat Kakkar, David Laight, netdev, Willem de Bruijn
In-Reply-To: <724144b0-ba06-9d10-56d5-24e22c574506@gmail.com>

On Tue, 2017-08-22 at 10:46 -0700, David Ahern wrote:
> On 8/22/17 10:44 AM, Eric Dumazet wrote:
> > Willem wrote this doc in 2013, before we finally went back to 1000.
> > 
> > We should update this doc.
> 
> 
> And these too:
> 
> $ egrep -r netdev_max_backlog Documentation/networking/
> Documentation/networking//cxgb.txt:      sysctl -w
> net.core.netdev_max_backlog=300000
> Documentation/networking//ixgb.txt:net.core.netdev_max_backlog = 300000

Yes, whoever wrote this had no idea of the implications I guess.

^ permalink raw reply

* Re: Something hitting my total number of connections to the server
From: David Ahern @ 2017-08-22 17:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Akshat Kakkar, David Laight, netdev, Willem de Bruijn
In-Reply-To: <1503423863.2499.39.camel@edumazet-glaptop3.roam.corp.google.com>

On 8/22/17 10:44 AM, Eric Dumazet wrote:
> Willem wrote this doc in 2013, before we finally went back to 1000.
> 
> We should update this doc.


And these too:

$ egrep -r netdev_max_backlog Documentation/networking/
Documentation/networking//cxgb.txt:      sysctl -w
net.core.netdev_max_backlog=300000
Documentation/networking//ixgb.txt:net.core.netdev_max_backlog = 300000

^ permalink raw reply

* Re: Something hitting my total number of connections to the server
From: Eric Dumazet @ 2017-08-22 17:44 UTC (permalink / raw)
  To: David Ahern; +Cc: Akshat Kakkar, David Laight, netdev, Willem de Bruijn
In-Reply-To: <bf580154-aea9-4143-5d9c-df3764b5e745@gmail.com>

On Tue, 2017-08-22 at 09:43 -0700, David Ahern wrote:
> On 8/22/17 6:02 AM, Eric Dumazet wrote:
> >>
> >> net.core.netdev_max_backlog=10000
> > This is an insane backlog.
> > 
> 
> https://www.kernel.org/doc/Documentation/networking/scaling.txt
> 
> "== Suggested Configuration
> 
> Flow limit is useful on systems with many concurrent connections,
> where a single connection taking up 50% of a CPU indicates a problem.
> In such environments, enable the feature on all CPUs that handle
> network rx interrupts (as set in /proc/irq/N/smp_affinity).
> 
> The feature depends on the input packet queue length to exceed
> the flow limit threshold (50%) + the flow history length (256).
> Setting net.core.netdev_max_backlog to either 1000 or 10000
> performed well in experiments."

10000 is adding tail latencies.

At Google we run all the fleet with backlog of 1000

And yes, it took time to get rid of the backlog of 10000 that was setup
years ago, because of old constraints and some fears.

Willem wrote this doc in 2013, before we finally went back to 1000.

We should update this doc.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox