* Re: DSA vs. SWTICHDEV ?
From: Florian Fainelli @ 2016-11-30 19:39 UTC (permalink / raw)
To: Joakim Tjernlund, andrew@lunn.ch; +Cc: netdev@vger.kernel.org
In-Reply-To: <1480531441.3563.148.camel@infinera.com>
On 11/30/2016 10:44 AM, Joakim Tjernlund wrote:
> On Wed, 2016-11-30 at 10:10 -0800, Florian Fainelli wrote:
>> On 11/30/2016 09:44 AM, Joakim Tjernlund wrote:
>>> On Wed, 2016-11-30 at 17:55 +0100, Andrew Lunn wrote:
>>>>> This is an embedded system with several boards in a subrack.
>>>>> Each board has eth I/F connected to a switch to communicate with each other.
>>>>> One of the board will also house the actual switch device and manage the switch.
>>>>> Then the normal app just communicates over the physical eth I/F like any other board
>>>>> in the system. There is a "manage switch app" which brings the switch up and partition
>>>>> phys VLANs etc. (each phys I/F would be a a separate domain so no loop)
>>>>
>>>> So you are planning on throwing away the "manage switch app", and just
>>>> use standard linux networking commands? That is what switchdev is all
>>>> about really, throwing away the vendor SDK for the switch, making a
>>>> switch just a bunch on interfaces on the host which you manage as
>>>> normal interfaces.
>>>
>>> Something like that. I need to run routing protocols on the switch I/Fs and egress
>>> pkgs on selected switch I/Fs bypassing ARP, just like DSA does with its vendor
>>> tags.
>>>
>>>>
>>>>> I guess I could skip the phys I/F and have the switch app create a virtual eth0 I/F over PCIe
>>>>
>>>> No need to create this interface. It will exist if you go the
>>>> switchdev route.
>>>>
>>>>>>> And switchdev can do all this over PCIe instead? Can you have a
>>>>>>> switch tree in switchdev too?
>>>>>>
>>>>>> Mellonex says so, but i don't think they have actually implemented it.
>>>>>
>>>>> Not impl. any of DSAs features? What can you do with a Mellonex switch then?
>>>>
>>>> They don't have a tree of switches, as far as i know. Just a single
>>>> switch. But DSA does support a tree of switches, that is what the D in
>>>> DSA means, distributed. And there are a couple of boards which have 2
>>>> to 4 switches in a tree.
>>>
>>> We might have a tree as well so now I really wonder: Given we write a
>>> proper switchdev driver, can it support switchtrees without touching
>>> switchdev infra structure? If not I guess we will attach a physical
>>> eth I/F to the switch and use both DSA and switchdev to support both trees
>>> and HW offload.
>>
>> switchdev in itself really is the glue layer between the networking
>> stack and how to push specific objects down to the Ethernet switch
>> driver, and that Ethernet switch driver. Switchdev does not enforce a
>> specific network device driver model object, and just provides standard
>> hooks for your network devices to register with switchdev in order to
>> push/receive offloads. DSA on the other hand, utilizes switchdev to get
>> notifications about offloads from the networking stack, but also exposes
>> a clearly and well defined Ethernet switch device driver model, as
>> Andrew described, it creates per-port network devices, binds the ports
>> to their PHYs (built-in, or external), and also takes care of
>> encapsulating/decapsulating the switch specific tagging protocol.
>
> Lets see, switchdev is mainly for offloading L2/L3 into HW and does NOT create
> virtual I/F(one for each phys sw port) so if my only goal is to offload I don't
> need DSA? (How do one create routes if no virtual I/Fs I wonder ..)
switchdev does not create network devices per-se, but a driver utilizing
switchdev should do that instead. How packet transmission/reception
works on these network devices is outside of the scope of switchdev.
>
> DSA then does create virt. I/Fs and manages switch trees, to actually tx pkg it needs
> a phys I/F using vendor specific tags(for PCIe connected switches I can envision using
> the PCIe connection instead).
DSA can operate without a tag (DSA_TAG_PROTO_NONE) but the general idea
is that the management entity, in order to properly identify which
per-port network device is sending traffic to/from, needs some kind of
tag. Its most rudimentary form can be a per-port VLAN ID, but ideally a
proper switch tag (EDSA, DSA, Broadcom tag) is better suited for that.
The "master" network device in DSA is just a normal Ethernet interface
which receives these tags, and transmits then to the managemenet/CPU
port of the switch it is attached to.
>
> Who is the master when using both DSA and switchdev w.r.t initil conf/bring up of switch?
> configuring VLANs?
You start with all ports being all segregated and just able to talk to
the CPU/management port. If you are in a switch tree (distributed), then
the uplinks between the switches should make such traffic pass through
all the way to the management port. In order to configure VLANs, you
need to create a bridge device, and then use "bridge vlan ..." to
configure the VLAN member ship of ports.
>
>>
>> We should probably put that in some crystal clear sentence somewhere in
>> Documentation/networking/ but switchdev and DSA are complementary and
>> not competitors, they just do not tackle the problems from the same angle.
>
> Yes :)
>
>>
>>>
>>>>
>>>> I think this is partially down to market segments. Mellonex market is
>>>> top of rack switches. High port count, very high bandwidth. DSA is
>>>> more wireless access points, set top boxes, generally up to 7 ports of
>>>> 1Gbps and a few custom embedded products which need more ports, so
>>>> build a tree of switches.
>>>
>>> We have on an existing board with a BCM ROBO switch with lots of ports(>24),
>>> managed over SPI. Looking at BCM DSA tag code it looks like it only supports
>>> some 8 ports or so. I still have to find out if this is a limitation in BCM tagging
>>> protocol or if just not impl. in DSA yet.
>>
>> Oh cool, can you share the model by chance? I suspect the tagging format
>
> I think it is an BCM5322
OK, so this one is actually capable of supporting something that is a
real distributed architecture design, cool! The format of the tag is way
different than what net/dsa/tag_brcm.c supports today, and it also
requires a lot more help from the networking stack with respect to
unicast/broadcast/multicast since you need to adjust the opcode of the
tag accordingly. Nothing that cannot be supported though.
>
>> of that switch is going to be different than what net/dsa/tag_brcm.c, so
>> feel free to add something NET_DSA_TAG_BRCM8B (for 8 bytes) or something
>> like that.
>
> I will have to look at that, don't have an docs handy.
The device looks largely register compatible with the definitions in
drivers/net/dsa/b53/b53_regs.h with a few notable exceptions:
- port control overrides B53_PORT_CTRL are not at offset 0 + (port
number), but A0 + port number
- link status summary in the status page is slightly offset down
But for the most part, in particular the ARL and VLAN pages, seems like
b53 would not require too much conditionals to get this switch to work.
>
>>
>> Note that DSA currently hardcodes the maximum number of ports to 14
>> (DSA_MAX_PORTS), but this should obviously be something dynamically
>> determined based on probing the switch device.
>>
>> Can you also evaluate if using drivers/net/dsa/b53/ would work for you?
>> My hope would be that they preserved the register compatibility here,
>> but since this has a large number of ports, it may have completely
>> offset most registers.
>
> I will have a look, I sure want to try out DSA/switchdev on existing boards if I can :)
Good!
--
Florian
^ permalink raw reply
* Re: [PATCH 0/5] cpsw: add per channel shaper configuration
From: David Miller @ 2016-11-30 19:37 UTC (permalink / raw)
To: ivan.khoronzhuk
Cc: netdev, linux-kernel, mugunthanvnm, grygorii.strashko, linux-omap
In-Reply-To: <1480431651-6081-1-git-send-email-ivan.khoronzhuk@linaro.org>
From: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Date: Tue, 29 Nov 2016 17:00:46 +0200
> This series is intended to allow user to set rate for per channel
> shapers at cpdma level. This patchset doesn't have impact on performance.
> The rate can be set with:
>
> echo 100 > /sys/class/net/ethX/queues/tx-0/tx_maxrate
>
> Tested on am572xx
> Based on net-next/master
Series applied, thanks.
^ permalink raw reply
* Re: pull-request: wireless-drivers 2016-11-29
From: David Miller @ 2016-11-30 19:34 UTC (permalink / raw)
To: kvalo-sgV2jX0FEOL9JmXXK+q4OQ
Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <87r35utjq7.fsf-HodKDYzPHsUD5k0oWYwrnHL1okKdlPRT@public.gmane.org>
From: Kalle Valo <kvalo-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
Date: Tue, 29 Nov 2016 16:59:44 +0200
> if there's still time here's one more patch to 3.9. I think this is good
> to have in 3.9 as it fixes an issue where we were printing uninitialised
> memory in mwifiex. I had this in wireless-drivers already for some time
> as I was waiting for other fixes and nothing serious actually came up.
>
> If this doesn't make it to 3.9 that's not a problem, I'll just merge
> this to wireless-drivers-next. Let me know what you prefer.
Unless you are in a time-machine of some sort I assume you mean "4.9" not
"3.9" :-)
Pulled, thanks.
^ permalink raw reply
* Re: [PATCH v2 net-next 00/11] qed*: Add XDP support
From: David Miller @ 2016-11-30 19:32 UTC (permalink / raw)
To: Yuval.Mintz; +Cc: netdev
In-Reply-To: <1480430830-17671-1-git-send-email-Yuval.Mintz@cavium.com>
From: Yuval Mintz <Yuval.Mintz@cavium.com>
Date: Tue, 29 Nov 2016 16:46:59 +0200
> This patch series is intended to add XDP to the qede driver, although
> it contains quite a bit of cleanups, refactorings and infrastructure
> changes as well.
>
> The content of this series can be roughly divided into:
>
> - Datapath improvements - mostly focused on having the datapath utilize
> parameters which can be more tightly contained in cachelines.
> Patches #1, #2, #8, #9 belong to this group.
>
> - Refactoring - done mostly in favour of XDP. Patches #3, #4, #5, #9.
>
> - Infrastructure changes - done in favour of XDP. Paches #6 and #7 belong
> to this category [#7 being by far the biggest patch in the series].
>
> - Actual XDP support - last two patches [#10, #11].
Series applied, thank you.
^ permalink raw reply
* Re: [WIP] net+mlx4: auto doorbell
From: Eric Dumazet @ 2016-11-30 19:30 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, Achiad Shochat
In-Reply-To: <20161130201711.2f353a76@redhat.com>
On Wed, 2016-11-30 at 20:17 +0100, Jesper Dangaard Brouer wrote:
> Don't take is as critique Eric. I was hoping your patch would have
> solved this issue of being sensitive to TX completion adjustments. You
> usually have good solutions for difficult issues. I basically rejected
> Achiad's approach/patch because it was too sensitive to these kind of
> adjustments.
Well, this patch can hurt latencies, because a doorbell can be delayed,
and softirqs can be delayed by many hundred of usec in some cases.
I would not enable this behavior by default.
^ permalink raw reply
* Re: [PATCH net-next] sock: reset sk_err for ICMP packets read from error queue
From: Eric Dumazet @ 2016-11-30 19:27 UTC (permalink / raw)
To: Soheil Hassas Yeganeh
Cc: davem, netdev, edumazet, willemb, maze, hannes,
Soheil Hassas Yeganeh
In-Reply-To: <1480532468-1610-1-git-send-email-soheil.kdev@gmail.com>
On Wed, 2016-11-30 at 14:01 -0500, Soheil Hassas Yeganeh wrote:
> From: Soheil Hassas Yeganeh <soheil@google.com>
>
> Only when ICMP packets are enqueued onto the error queue,
> sk_err is also set. Before f5f99309fa74 (sock: do not set sk_err
> in sock_dequeue_err_skb), a subsequent error queue read
> would set sk_err to the next error on the queue, or 0 if empty.
> As no error types other than ICMP set this field, sk_err should
> not be modified upon dequeuing them.
>
> Only for ICMP errors, reset the (racy) sk_err. Some applications,
> like traceroute, rely on it and go into a futile busy POLLERR
> loop otherwise.
>
> In principle, sk_err has to be set while an ICMP error is queued.
> Testing is_icmp_err_skb(skb_next) approximates this without
> requiring a full queue walk. Applications that receive both ICMP
> and other errors cannot rely on this legacy behavior, as other
> errors do not set sk_err in the first place.
>
> Fixes: f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb)
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---
Acked-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply
* Re: [PATCH] ehea: Remove unnecessary memset of stats in netdev private data
From: David Miller @ 2016-11-30 19:26 UTC (permalink / raw)
To: tklauser; +Cc: dougmill, netdev
In-Reply-To: <20161129133507.5008-1-tklauser@distanz.ch>
From: Tobias Klauser <tklauser@distanz.ch>
Date: Tue, 29 Nov 2016 14:35:07 +0100
> The memory for netdev private data is allocated using kzalloc/vzalloc in
> alloc_netdev_mqs, thus there is no need to zero the stats portion of it
> again in the driver's probe function.
>
> In any case, the size for the memset is wrong as the stats member is of
> type rtnl_link_stats64, not net_device_stats.
>
> Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH V2 net-next] net: hns: Fix to conditionally convey RX checksum flag to stack
From: David Miller @ 2016-11-30 19:25 UTC (permalink / raw)
To: salil.mehta; +Cc: yisen.zhuang, mehta.salil.lnk, netdev, linux-kernel, linuxarm
In-Reply-To: <20161129130945.919372-1-salil.mehta@huawei.com>
From: Salil Mehta <salil.mehta@huawei.com>
Date: Tue, 29 Nov 2016 13:09:45 +0000
> + /* We only support checksum for IPv4,UDP(over IPv4 or IPv6), TCP(over
> + * IPv4 or IPv6) and SCTP but we support many L3(IPv4, IPv6, MPLS,
> + * PPPoE etc) and L4(TCP, UDP, GRE, SCTP, IGMP, ICMP etc.) protocols.
> + * We want to filter out L3 and L4 protocols early on for which checksum
> + * is not supported.
...
> + */
> + l3id = hnae_get_field(flag, HNS_RXD_L3ID_M, HNS_RXD_L3ID_S);
> + l4id = hnae_get_field(flag, HNS_RXD_L4ID_M, HNS_RXD_L4ID_S);
> + if ((l3id != HNS_RX_FLAG_L3ID_IPV4) &&
> + ((l3id != HNS_RX_FLAG_L3ID_IPV6) ||
> + (l4id != HNS_RX_FLAG_L4ID_UDP)) &&
> + ((l3id != HNS_RX_FLAG_L3ID_IPV6) ||
> + (l4id != HNS_RX_FLAG_L4ID_TCP)) &&
> + (l4id != HNS_RX_FLAG_L4ID_SCTP))
> + return;
I have a hard time understanding this seemingly overcomplicated
check.
It looks like if L3 is IPV4 it will accept any underlying L4 protocol,
but is that what is really intended? That doesn't match what this new
comment states.
My understanding is that the chip supports checksums for:
UDP/IPV4
UDP/IPV6
TCP/IPV4
TCP/IPV6
SCTP/IPV4
SCTP/IPV6
So the simplest thing is to validate each level one at a time:
if (l3 != IPV4 && l3 != IPV6)
return;
if (l4 != UDP && l4 != TCP && l4 != SCTP)
return;
^ permalink raw reply
* Re: net/sctp: vmalloc allocation failure in sctp_setsockopt/xt_alloc_table_info
From: Marcelo Ricardo Leitner @ 2016-11-30 19:21 UTC (permalink / raw)
To: Florian Westphal; +Cc: Neil Horman, netdev, LKML, netfilter-devel
In-Reply-To: <20161128181803.GA13159@localhost.localdomain>
On Mon, Nov 28, 2016 at 04:18:03PM -0200, Marcelo Ricardo Leitner wrote:
> On Mon, Nov 28, 2016 at 07:09:25PM +0100, Florian Westphal wrote:
> > Neil Horman <nhorman@tuxdriver.com> wrote:
> >
> > [ trimming CCs ]
> >
> > > On Mon, Nov 28, 2016 at 06:47:10PM +0100, Florian Westphal wrote:
> > > > Neil Horman <nhorman@tuxdriver.com> wrote:
> > > > > I'm not sure I agree with that. Generally speaking it seems like the right
> > > > > thing to do, if you want to avoid filling logs with warnings, but this is the
> > > > > sort of error that is going to be accompanied by severe service interruption.
> > > > > I'd rather see a reason behind that in the logs, than just have it occur
> > > > > silently.
> > > >
> > > > Its not silent -- the setsockopt call will fail and userspace should
> > > > display an error.
> > > >
> > > Thats not true. If the OOM succedes in freeing enough memory to fulfill the
> > > request the setsockopt may complete without error, you're just left with a
> > > killed process...somewhere. Thats seems a bit dodgy to me
> >
>
> __GFP_NOWARN is about allocation failures only and it won't disable OOM
> kill messages. oom_kill_process() has no idea on GFP_NOWARN when doing
> the logging.
>
> > We should prevent OOM killer from running in first place (GFP_NORETRY should work).
>
> Oh. Really?
>
Now I see why. Then we're basically saying that's better to fail this
operation than to kill some random process around.
And kmalloc() is already using GFP_NORETRY in this same place.
Marcelo
^ permalink raw reply
* Re: [WIP] net+mlx4: auto doorbell
From: Jesper Dangaard Brouer @ 2016-11-30 19:17 UTC (permalink / raw)
To: Eric Dumazet
Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, Achiad Shochat,
brouer
In-Reply-To: <1480521386.18162.189.camel@edumazet-glaptop3.roam.corp.google.com>
On Wed, 30 Nov 2016 07:56:26 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-11-30 at 12:38 +0100, Jesper Dangaard Brouer wrote:
> > I've played with a somewhat similar patch (from Achiad Shochat) for
> > mlx5 (attached). While it gives huge improvements, the problem I ran
> > into was that; TX performance became a function of the TX completion
> > time/interrupt and could easily be throttled if configured too
> > high/slow.
> >
> > Can your patch be affected by this too?
>
> Like all TX business, you should know this Jesper.
> No need to constantly remind us something very well known.
Don't take is as critique Eric. I was hoping your patch would have
solved this issue of being sensitive to TX completion adjustments. You
usually have good solutions for difficult issues. I basically rejected
Achiad's approach/patch because it was too sensitive to these kind of
adjustments.
> > On Mon, 28 Nov 2016 22:58:36 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:
[...]
> >
> > These +75% number is pktgen without "burst", and definitely show that
> > your patch activate xmit_more.
> > What is the pps performance number when using pktgen "burst" option?
>
> About the same really. About all packets now get the xmit_more effect.
Perfect!
> > [...]
> > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > > index 4b597dca5c52..affebb435679 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > [...]
> > > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> > > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
> > > {
> > > - return ring->prod - ring->cons > ring->full_size;
> > > + return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
> > > }
> > [...]
> >
> > > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> > > }
> > > send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
> > >
> > > + /* Doorbell avoidance : We can omit doorbell if we know a TX completion
> > > + * will happen shortly.
> > > + */
> > > + if (send_doorbell &&
> > > + dev->doorbell_opt &&
> > > + (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
> >
> > It would be nice with a function call with an appropriate name, instead
> > of an open-coded queue size check. I'm also confused by the "ncons" name.
> >
> > > + send_doorbell = false;
> > > +
> > [...]
> >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > > index 574bcbb1b38f..c3fd0deda198 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > > +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > > @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {
> > > */
> > > u32 last_nr_txbb;
> > > u32 cons;
> > > + u32 ncons;
> >
> > Maybe we can find a better name than "ncons" ?
>
> Thats because 'cons' in this driver really means 'old cons'
>
> and new cons = old cons + last_nr_txbb;
It was not clear to me that "n" meant "new". And also not clear that
this drive have an issue of "cons" (consumer) is tracking "old" cons.
> > > unsigned long wake_queue;
> > > struct netdev_queue *tx_queue;
> > > u32 (*free_tx_desc)(struct mlx4_en_priv *priv,
> > > @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {
> > >
> > > /* cache line used and dirtied in mlx4_en_xmit() */
> > > u32 prod ____cacheline_aligned_in_smp;
> > > + u32 prod_bell;
> >
> > Good descriptive variable name.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply
* [PATCH 1/1] ax25: Fix segfault when receiving an iframe with net2kiss loaded
From: Basil Gunn @ 2016-11-30 19:15 UTC (permalink / raw)
To: Joerg Reuter, Ralf Baechle, David S. Miller, linux-hams, netdev,
linux-kernel
Cc: stable, Edouard Lafargue, Jeremy McDermond
AX.25 uses sock_queue_rcv_skb() to queue an iframe received packet.
This routine writes NULL to the socket buffer device structure
pointer. The socket buffer is subsequently serviced by
__netif_receiv_skb_core() which dereferences the device structure
pointer & segfaults.
The fix puts the ax25 device structure pointer back in the socket
buffer struct after sock_queue_rcv_skb() is called.
To trigger the segfault setup an ax.25 device (ax0) then run net2kiss
(net2kiss -v -i ax0 /dev/ptmx). In another console make an ax.25
connection (call udr0 jnbbs). Within 2 received packets a segfault
will occur.
Please submit to -stable.
Signed-off-by: Basil Gunn <basil@pacabunga.com>
---
net/ax25/ax25_in.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/net/ax25/ax25_in.c b/net/ax25/ax25_in.c
index bb5a0e4..417f21a 100644
--- a/net/ax25/ax25_in.c
+++ b/net/ax25/ax25_in.c
@@ -144,10 +144,15 @@ int ax25_rx_iframe(ax25_cb *ax25, struct sk_buff *skb)
if (ax25->sk != NULL && ax25->ax25_dev->values[AX25_VALUES_CONMODE] == 2) {
if ((!ax25->pidincl && ax25->sk->sk_protocol == pid) ||
ax25->pidincl) {
+ /* Will set socket buffer device struct pointer,
+ * skb->dev to NULL
+ */
if (sock_queue_rcv_skb(ax25->sk, skb) == 0)
queued = 1;
else
ax25->condition |= AX25_COND_OWN_RX_BUSY;
+
+ skb->dev = ax25->ax25_dev->dev;
}
}
--
2.1.4
^ permalink raw reply related
* Re: [PATCH v2 net 0/5] l2tp: fixes for l2tp_ip and l2tp_ip6 socket handling
From: David Miller @ 2016-11-30 19:14 UTC (permalink / raw)
To: jchapman; +Cc: g.nault, netdev, celston
In-Reply-To: <7de5b95e-b33a-fce4-cb59-4a4415e39697@katalix.com>
From: James Chapman <jchapman@katalix.com>
Date: Tue, 29 Nov 2016 13:47:06 +0000
> On 29/11/16 12:09, Guillaume Nault wrote:
>> This series addresses problems found while working on commit 32c231164b76
>> ("l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()").
>>
>> The first three patches fix races in socket's connect, recv and bind
>> operations. The last two ones fix scenarios where l2tp fails to
>> correctly lookup its userspace sockets.
>>
>> Apart from the last patch, which is l2tp_ip6 specific, every patch
>> fixes the same problem in the L2TP IPv4 and IPv6 code.
>>
>> All problems fixed by this series exist since the creation of the
>> l2tp_ip and l2tp_ip6 modules.
>>
>> Changes since v1:
>> * Patch #3: fix possible uninitialised use of 'ret' in l2tp_ip_bind().
...
> Looks good.
>
> Acked-by: James Chapman <jchapman@katalix.com>
Series applied.
^ permalink raw reply
* Re: [PATCH net] cxgb4: Add PCI device ID for new adapter
From: David Miller @ 2016-11-30 19:12 UTC (permalink / raw)
To: hariprasad; +Cc: netdev, leedom, swise, nirranjan
In-Reply-To: <1480419892-30842-1-git-send-email-hariprasad@chelsio.com>
From: Hariprasad Shenai <hariprasad@chelsio.com>
Date: Tue, 29 Nov 2016 17:14:52 +0530
> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next] net: thunderx: Fix transmit queue timeout issue
From: David Miller @ 2016-11-30 19:11 UTC (permalink / raw)
To: sunil.kovvuri; +Cc: netdev, linux-kernel, linux-arm-kernel, sgoutham
In-Reply-To: <1480419620-32500-1-git-send-email-sunil.kovvuri@gmail.com>
From: sunil.kovvuri@gmail.com
Date: Tue, 29 Nov 2016 17:10:20 +0530
> + /* Check again, incase another cpu freed descriptors */
> + if (atomic_read(&sq->free_cnt) > MIN_SQ_DESC_PER_PKT_XMIT) {
> + netif_tx_start_queue(txq);
You have to use netif_tx_wake_queue() any time you restart a queue after bringing
the device up.
^ permalink raw reply
* Re: [PATCH net] cdc_ether: Fix handling connection notification
From: Henning Schild @ 2016-11-30 19:09 UTC (permalink / raw)
To: Kristian Evensen
Cc: oliver-GvhC2dPhHPQdnm+yROfE0A, linux-usb-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161130184216.4224-1-kristian.evensen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Am Wed, 30 Nov 2016 19:42:16 +0100
schrieb Kristian Evensen <kristian.evensen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
> Commit bfe9b9d2df66 ("cdc_ether: Improve ZTE MF823/831/910 handling")
> introduced a work-around in usbnet_cdc_status() for devices that
> exported cdc carrier on twice on connect. Before the commit, this
> behavior caused the link state to be incorrect. It was assumed that
> all CDC Ethernet devices would either export this behavior, or send
> one off and then one on notification (which seems to be the default
> behavior).
>
> Unfortunately, it turns out multiple devices sends a connection
> notification multiple times per second (via an interrupt), even when
> connection state does not change. This has been observed with several
> different USB LAN dongles (at least), for example 13b1:0041 (Linksys).
> After bfe9b9d2df66, the link state has been set as down and then up
> for each notification. This has caused a flood of Netlink NEWLINK
> messages and syslog to be flooded with messages similar to:
>
> cdc_ether 2-1:2.0 eth1: kevent 12 may have been dropped
>
> This commit fixes the behavior by reverting usbnet_cdc_status() to
> how it was before bfe9b9d2df66. The work-around has been moved to a
> separate status-function which is only called when a known, affect
> device is detected.
>
> Fixes: bfe9b9d2df66 ("cdc_ether: Improve ZTE MF823/831/910 handling")
> Reported-by: Henning Schild <henning.schild-kv7WeFo6aLtBDgjK7y7TUQ@public.gmane.org>
> Signed-off-by: Kristian Evensen <kristian.evensen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
> drivers/net/usb/cdc_ether.c | 38
> +++++++++++++++++++++++++++++++------- 1 file changed, 31
> insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
> index 45e5e43..8c628ea 100644
> --- a/drivers/net/usb/cdc_ether.c
> +++ b/drivers/net/usb/cdc_ether.c
> @@ -388,12 +388,6 @@ void usbnet_cdc_status(struct usbnet *dev,
> struct urb *urb) case USB_CDC_NOTIFY_NETWORK_CONNECTION:
> netif_dbg(dev, timer, dev->net, "CDC: carrier %s\n",
> event->wValue ? "on" : "off");
> -
> - /* Work-around for devices with broken
> off-notifications */
> - if (event->wValue &&
> - !test_bit(__LINK_STATE_NOCARRIER,
> &dev->net->state))
> - usbnet_link_change(dev, 0, 0);
> -
> usbnet_link_change(dev, !!event->wValue, 0);
> break;
> case USB_CDC_NOTIFY_SPEED_CHANGE: /* tx/rx rates */
> @@ -466,6 +460,36 @@ static int usbnet_cdc_zte_rx_fixup(struct usbnet
> *dev, struct sk_buff *skb) return 1;
> }
>
> +/* Ensure correct link state
> + *
> + * Some devices (ZTE MF823/831/910) export two carrier on
> notifications when
> + * connected. This causes the link state to be incorrect. Work
> around this by
> + * always setting the state to off, then on.
> + */
> +void usbnet_cdc_zte_status(struct usbnet *dev, struct urb *urb)
> +{
> + struct usb_cdc_notification *event;
> +
> + if (urb->actual_length < sizeof(*event))
> + return;
> +
> + event = urb->transfer_buffer;
> +
> + if (event->bNotificationType !=
> USB_CDC_NOTIFY_NETWORK_CONNECTION) {
> + usbnet_cdc_status(dev, urb);
> + return;
> + }
> +
> + netif_dbg(dev, timer, dev->net, "CDC: carrier %s\n",
> + event->wValue ? "on" : "off");
> +
> + if (event->wValue &&
> + !test_bit(__LINK_STATE_NOCARRIER, &dev->net->state))
You should probably use netif_carrier_ok(dev->net) instead.
> + usbnet_link_change(dev, 0, 0);
> +
> + usbnet_link_change(dev, !!event->wValue, 0);
Would the following work?
usbnet_link_change(dev, !!event->wValue, event->wValue &&
netif_carrier_ok(dev->net));
> +}
> +
> static const struct driver_info cdc_info = {
> .description = "CDC Ethernet Device",
> .flags = FLAG_ETHER | FLAG_POINTTOPOINT,
> @@ -481,7 +505,7 @@ static const struct driver_info
> zte_cdc_info = { .flags = FLAG_ETHER | FLAG_POINTTOPOINT,
> .bind = usbnet_cdc_zte_bind,
> .unbind = usbnet_cdc_unbind,
> - .status = usbnet_cdc_status,
> + .status = usbnet_cdc_zte_status,
> .set_rx_mode = usbnet_cdc_update_filter,
> .manage_power = usbnet_manage_power,
> .rx_fixup = usbnet_cdc_zte_rx_fixup,
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH net-next] sock: reset sk_err for ICMP packets read from error queue
From: Soheil Hassas Yeganeh @ 2016-11-30 19:01 UTC (permalink / raw)
To: davem, netdev; +Cc: edumazet, willemb, maze, hannes, Soheil Hassas Yeganeh
From: Soheil Hassas Yeganeh <soheil@google.com>
Only when ICMP packets are enqueued onto the error queue,
sk_err is also set. Before f5f99309fa74 (sock: do not set sk_err
in sock_dequeue_err_skb), a subsequent error queue read
would set sk_err to the next error on the queue, or 0 if empty.
As no error types other than ICMP set this field, sk_err should
not be modified upon dequeuing them.
Only for ICMP errors, reset the (racy) sk_err. Some applications,
like traceroute, rely on it and go into a futile busy POLLERR
loop otherwise.
In principle, sk_err has to be set while an ICMP error is queued.
Testing is_icmp_err_skb(skb_next) approximates this without
requiring a full queue walk. Applications that receive both ICMP
and other errors cannot rely on this legacy behavior, as other
errors do not set sk_err in the first place.
Fixes: f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb)
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
net/core/skbuff.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d1d1a5a..8dad391 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3714,20 +3714,29 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
}
EXPORT_SYMBOL(sock_queue_err_skb);
+static bool is_icmp_err_skb(const struct sk_buff *skb)
+{
+ return skb && (SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP ||
+ SKB_EXT_ERR(skb)->ee.ee_origin == SO_EE_ORIGIN_ICMP6);
+}
+
struct sk_buff *sock_dequeue_err_skb(struct sock *sk)
{
struct sk_buff_head *q = &sk->sk_error_queue;
- struct sk_buff *skb, *skb_next;
+ struct sk_buff *skb, *skb_next = NULL;
+ bool icmp_next = false;
unsigned long flags;
- int err = 0;
spin_lock_irqsave(&q->lock, flags);
skb = __skb_dequeue(q);
if (skb && (skb_next = skb_peek(q)))
- err = SKB_EXT_ERR(skb_next)->ee.ee_errno;
+ icmp_next = is_icmp_err_skb(skb_next);
spin_unlock_irqrestore(&q->lock, flags);
- if (err)
+ if (is_icmp_err_skb(skb) && !icmp_next)
+ sk->sk_err = 0;
+
+ if (skb_next)
sk->sk_error_report(sk);
return skb;
--
2.8.0.rc3.226.g39d4020
^ permalink raw reply related
* Re: DSA vs. SWTICHDEV ?
From: Joakim Tjernlund @ 2016-11-30 18:44 UTC (permalink / raw)
To: f.fainelli@gmail.com, andrew@lunn.ch; +Cc: netdev@vger.kernel.org
In-Reply-To: <922521a8-3a33-bbc4-2e07-2b946590c829@gmail.com>
On Wed, 2016-11-30 at 10:10 -0800, Florian Fainelli wrote:
> On 11/30/2016 09:44 AM, Joakim Tjernlund wrote:
> > On Wed, 2016-11-30 at 17:55 +0100, Andrew Lunn wrote:
> > > > This is an embedded system with several boards in a subrack.
> > > > Each board has eth I/F connected to a switch to communicate with each other.
> > > > One of the board will also house the actual switch device and manage the switch.
> > > > Then the normal app just communicates over the physical eth I/F like any other board
> > > > in the system. There is a "manage switch app" which brings the switch up and partition
> > > > phys VLANs etc. (each phys I/F would be a a separate domain so no loop)
> > >
> > > So you are planning on throwing away the "manage switch app", and just
> > > use standard linux networking commands? That is what switchdev is all
> > > about really, throwing away the vendor SDK for the switch, making a
> > > switch just a bunch on interfaces on the host which you manage as
> > > normal interfaces.
> >
> > Something like that. I need to run routing protocols on the switch I/Fs and egress
> > pkgs on selected switch I/Fs bypassing ARP, just like DSA does with its vendor
> > tags.
> >
> > >
> > > > I guess I could skip the phys I/F and have the switch app create a virtual eth0 I/F over PCIe
> > >
> > > No need to create this interface. It will exist if you go the
> > > switchdev route.
> > >
> > > > > > And switchdev can do all this over PCIe instead? Can you have a
> > > > > > switch tree in switchdev too?
> > > > >
> > > > > Mellonex says so, but i don't think they have actually implemented it.
> > > >
> > > > Not impl. any of DSAs features? What can you do with a Mellonex switch then?
> > >
> > > They don't have a tree of switches, as far as i know. Just a single
> > > switch. But DSA does support a tree of switches, that is what the D in
> > > DSA means, distributed. And there are a couple of boards which have 2
> > > to 4 switches in a tree.
> >
> > We might have a tree as well so now I really wonder: Given we write a
> > proper switchdev driver, can it support switchtrees without touching
> > switchdev infra structure? If not I guess we will attach a physical
> > eth I/F to the switch and use both DSA and switchdev to support both trees
> > and HW offload.
>
> switchdev in itself really is the glue layer between the networking
> stack and how to push specific objects down to the Ethernet switch
> driver, and that Ethernet switch driver. Switchdev does not enforce a
> specific network device driver model object, and just provides standard
> hooks for your network devices to register with switchdev in order to
> push/receive offloads. DSA on the other hand, utilizes switchdev to get
> notifications about offloads from the networking stack, but also exposes
> a clearly and well defined Ethernet switch device driver model, as
> Andrew described, it creates per-port network devices, binds the ports
> to their PHYs (built-in, or external), and also takes care of
> encapsulating/decapsulating the switch specific tagging protocol.
Lets see, switchdev is mainly for offloading L2/L3 into HW and does NOT create
virtual I/F(one for each phys sw port) so if my only goal is to offload I don't
need DSA? (How do one create routes if no virtual I/Fs I wonder ..)
DSA then does create virt. I/Fs and manages switch trees, to actually tx pkg it needs
a phys I/F using vendor specific tags(for PCIe connected switches I can envision using
the PCIe connection instead).
Who is the master when using both DSA and switchdev w.r.t initil conf/bring up of switch?
configuring VLANs?
>
> We should probably put that in some crystal clear sentence somewhere in
> Documentation/networking/ but switchdev and DSA are complementary and
> not competitors, they just do not tackle the problems from the same angle.
Yes :)
>
> >
> > >
> > > I think this is partially down to market segments. Mellonex market is
> > > top of rack switches. High port count, very high bandwidth. DSA is
> > > more wireless access points, set top boxes, generally up to 7 ports of
> > > 1Gbps and a few custom embedded products which need more ports, so
> > > build a tree of switches.
> >
> > We have on an existing board with a BCM ROBO switch with lots of ports(>24),
> > managed over SPI. Looking at BCM DSA tag code it looks like it only supports
> > some 8 ports or so. I still have to find out if this is a limitation in BCM tagging
> > protocol or if just not impl. in DSA yet.
>
> Oh cool, can you share the model by chance? I suspect the tagging format
I think it is an BCM5322
> of that switch is going to be different than what net/dsa/tag_brcm.c, so
> feel free to add something NET_DSA_TAG_BRCM8B (for 8 bytes) or something
> like that.
I will have to look at that, don't have an docs handy.
>
> Note that DSA currently hardcodes the maximum number of ports to 14
> (DSA_MAX_PORTS), but this should obviously be something dynamically
> determined based on probing the switch device.
>
> Can you also evaluate if using drivers/net/dsa/b53/ would work for you?
> My hope would be that they preserved the register compatibility here,
> but since this has a large number of ports, it may have completely
> offset most registers.
I will have a look, I sure want to try out DSA/switchdev on existing boards if I can :)
>
> BTW, there is #linuxswitch on Freenode if you want to chat!
^ permalink raw reply
* Re: [PATCH net-next] cgroup, bpf: remove unnecessary #include
From: David Miller @ 2016-11-30 18:58 UTC (permalink / raw)
To: ast; +Cc: daniel, daniel, roszenrami, netdev
In-Reply-To: <1480529768-1954199-1-git-send-email-ast@fb.com>
From: Alexei Starovoitov <ast@fb.com>
Date: Wed, 30 Nov 2016 10:16:08 -0800
> this #include is unnecessary and brings whole set of
> other headers into cgroup-defs.h. Remove it.
>
> Fixes: 3007098494be ("cgroup: add support for eBPF programs")
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> Acked-by: Rami Rosen <roszenrami@gmail.com>
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: Daniel Mack <daniel@zonque.org>
> ---
> Dave,
> this patch got lost somehow (marked accepted, but not in net-next).
> Resending.
Applied, thanks ALexei.
^ permalink raw reply
* Re: [PATCH v4 net-next 3/7] net: mvneta: Use cacheable memory to store the rx buffer virtual address
From: David Miller @ 2016-11-30 18:57 UTC (permalink / raw)
To: gregory.clement
Cc: linux-kernel, netdev, jszhang, arnd, jason, andrew,
sebastian.hesselbarth, thomas.petazzoni, linux-arm-kernel, nadavh,
mw, dima, yelena
In-Reply-To: <bb2f1a61e9b912905f8390f7a06c36f9f1e7ab66.1480431285.git-series.gregory.clement@free-electrons.com>
From: Gregory CLEMENT <gregory.clement@free-electrons.com>
Date: Tue, 29 Nov 2016 15:55:21 +0100
> + /* Virtual address of the RX buffer */
> + void **buf_virt_addr;
> +
> /* Virtual address of the RX DMA descriptors array */
> struct mvneta_rx_desc *descs;
>
...
> + data = (unsigned char *)rxq->buf_virt_addr[index];
This cast is unnecessary, please remove it.
^ permalink raw reply
* Re: [net-next PATCH v3 0/6] XDP for virtio_net
From: Michael S. Tsirkin @ 2016-11-30 18:56 UTC (permalink / raw)
To: John Fastabend
Cc: tgraf, shm, alexei.starovoitov, daniel, davem, john.r.fastabend,
netdev, bblanco, brouer
In-Reply-To: <583F1D40.7060408@gmail.com>
On Wed, Nov 30, 2016 at 10:41:04AM -0800, John Fastabend wrote:
> On 16-11-30 10:35 AM, Michael S. Tsirkin wrote:
> > On Tue, Nov 29, 2016 at 12:05:20PM -0800, John Fastabend wrote:
> >> This implements virtio_net for the mergeable buffers and big_packet
> >> modes. I tested this with vhost_net running on qemu and did not see
> >> any issues. For testing num_buf > 1 I added a hack to vhost driver
> >> to only but 100 bytes per buffer.
> >>
> >> There are some restrictions for XDP to be enabled and work well
> >> (see patch 3) for more details.
> >>
> >> 1. LRO must be off
> >> 2. MTU must be less than PAGE_SIZE
> >> 3. queues must be available to dedicate to XDP
> >> 4. num_bufs received in mergeable buffers must be 1
> >> 5. big_packet mode must have all data on single page
> >>
> >> Please review any comments/feedback welcome as always.
> >>
> >> v2, fixes rcu usage throughout thanks to Eric and the use of
> >> num_online_cpus() usage thanks to Jakub.
> >>
> >> v3, add slowpath patch to handle num_bufs > 1
> >>
> >> Thanks,
> >> John
> >
> > BTW this is threaded incorrectly: patch 1/6 isn't a reply to 0/6,
> > patches 2 and on are replies to patch 1.
> >
>
> Ah yep, if you mangle the command line git will send the
> cover letter even if you have mangled 'to' email addresses but when
> it hits a real patch it aborts. At least on my version of git.
>
> > I'm busy until end of week, I'll review Monday. Sorry about the delay.
>
> In the meantime I'll post a v4 with better commit message (Alexei) and
> address a corner cases Jakub pointed out.
I did a quick look and found some too, but a detailed review will
have to wait till next week.
> >
> >> ---
> >>
> >> John Fastabend (6):
> >> net: virtio dynamically disable/enable LRO
> >> net: xdp: add invalid buffer warning
> >> virtio_net: Add XDP support
> >> virtio_net: add dedicated XDP transmit queues
> >> virtio_net: add XDP_TX support
> >> virtio_net: xdp, add slowpath case for non contiguous buffers
> >>
> >>
> >> drivers/net/virtio_net.c | 344 +++++++++++++++++++++++++++++++++++++++++++++-
> >> include/linux/filter.h | 1
> >> net/core/filter.c | 6 +
> >> 3 files changed, 346 insertions(+), 5 deletions(-)
> >>
> >> --
> >> Signature
^ permalink raw reply
* Re: [net-next PATCH v3 3/6] virtio_net: Add XDP support
From: Michael S. Tsirkin @ 2016-11-30 18:54 UTC (permalink / raw)
To: John Fastabend
Cc: eric.dumazet, daniel, shm, davem, tgraf, alexei.starovoitov,
john.r.fastabend, netdev, bblanco, brouer
In-Reply-To: <20161129201021.26851.34352.stgit@john-Precision-Tower-5810>
On Tue, Nov 29, 2016 at 12:10:21PM -0800, John Fastabend wrote:
> From: John Fastabend <john.fastabend@gmail.com>
>
> This adds XDP support to virtio_net. Some requirements must be
> met for XDP to be enabled depending on the mode. First it will
> only be supported with LRO disabled so that data is not pushed
> across multiple buffers. Second the MTU must be less than a page
> size to avoid having to handle XDP across multiple pages.
>
> If mergeable receive is enabled this patch only supports the case
> where header and data are in the same buf which we can check when
> a packet is received by looking at num_buf. If the num_buf is
> greater than 1 and a XDP program is loaded the packet is dropped
> and a warning is thrown. When any_header_sg is set this does not
> happen and both header and data is put in a single buffer as expected
> so we check this when XDP programs are loaded. Subsequent patches
> will process the packet in a degraded mode to ensure connectivity
> and correctness is not lost even if backend pushes packets into
> multiple buffers.
>
> If big packets mode is enabled and MTU/LRO conditions above are
> met then XDP is allowed.
>
> This patch was tested with qemu with vhost=on and vhost=off where
> mergable and big_packet modes were forced via hard coding feature
> negotiation. Multiple buffers per packet was forced via a small
> test patch to vhost.c in the vhost=on qemu mode.
>
> Suggested-by: Shrijeet Mukherjee <shrijeet@gmail.com>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
> drivers/net/virtio_net.c | 154 +++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 150 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 8189e5b..32126bf 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -22,6 +22,7 @@
> #include <linux/module.h>
> #include <linux/virtio.h>
> #include <linux/virtio_net.h>
> +#include <linux/bpf.h>
> #include <linux/scatterlist.h>
> #include <linux/if_vlan.h>
> #include <linux/slab.h>
> @@ -81,6 +82,8 @@ struct receive_queue {
>
> struct napi_struct napi;
>
> + struct bpf_prog __rcu *xdp_prog;
> +
> /* Chain pages by the private ptr. */
> struct page *pages;
>
> @@ -324,6 +327,38 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> return skb;
> }
>
> +static u32 do_xdp_prog(struct virtnet_info *vi,
> + struct bpf_prog *xdp_prog,
> + struct page *page, int offset, int len)
> +{
> + int hdr_padded_len;
> + struct xdp_buff xdp;
> + u32 act;
> + u8 *buf;
> +
> + buf = page_address(page) + offset;
> +
> + if (vi->mergeable_rx_bufs)
> + hdr_padded_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
> + else
> + hdr_padded_len = sizeof(struct padded_vnet_hdr);
> +
> + xdp.data = buf + hdr_padded_len;
> + xdp.data_end = xdp.data + (len - vi->hdr_len);
so header seems to be ignored completely.
but the packet could be from the time when
e.g. checksum offloading was on, and
so it might gave DATA_VALID (from CHECKSUM_UNNECESSARY
in host).
I think you want to verify that flags and gso type
are 0.
> +
> + act = bpf_prog_run_xdp(xdp_prog, &xdp);
> + switch (act) {
> + case XDP_PASS:
> + return XDP_PASS;
> + default:
> + bpf_warn_invalid_xdp_action(act);
> + case XDP_TX:
> + case XDP_ABORTED:
> + case XDP_DROP:
> + return XDP_DROP;
> + }
> +}
do we really want this switch just to warn?
How about doing != XDP_PASS in the caller?
> +
> static struct sk_buff *receive_small(struct virtnet_info *vi, void *buf, unsigned int len)
> {
> struct sk_buff * skb = buf;
> @@ -340,14 +375,28 @@ static struct sk_buff *receive_big(struct net_device *dev,
> void *buf,
> unsigned int len)
> {
> + struct bpf_prog *xdp_prog;
> struct page *page = buf;
> - struct sk_buff *skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
> + struct sk_buff *skb;
>
> + rcu_read_lock();
> + xdp_prog = rcu_dereference(rq->xdp_prog);
> + if (xdp_prog) {
> + u32 act = do_xdp_prog(vi, xdp_prog, page, 0, len);
> +
> + if (act == XDP_DROP)
> + goto err_xdp;
> + }
> + rcu_read_unlock();
> +
> + skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
> if (unlikely(!skb))
> goto err;
>
> return skb;
>
> +err_xdp:
> + rcu_read_unlock();
> err:
> dev->stats.rx_dropped++;
> give_pages(rq, page);
> @@ -366,10 +415,27 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> struct page *page = virt_to_head_page(buf);
> int offset = buf - page_address(page);
> unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
This is some useless computation when XDP is used, isn't it?
> + struct sk_buff *head_skb, *curr_skb;
> + struct bpf_prog *xdp_prog;
>
> - struct sk_buff *head_skb = page_to_skb(vi, rq, page, offset, len,
> - truesize);
> - struct sk_buff *curr_skb = head_skb;
> + rcu_read_lock();
> + xdp_prog = rcu_dereference(rq->xdp_prog);
> + if (xdp_prog) {
> + u32 act;
> +
> + if (num_buf > 1) {
> + bpf_warn_invalid_xdp_buffer();
> + goto err_xdp;
> + }
> +
> + act = do_xdp_prog(vi, xdp_prog, page, offset, len);
> + if (act == XDP_DROP)
> + goto err_xdp;
> + }
> + rcu_read_unlock();
> +
> + head_skb = page_to_skb(vi, rq, page, offset, len, truesize);
> + curr_skb = head_skb;
>
> if (unlikely(!curr_skb))
> goto err_skb;
I'm confused. Did the requirement to have a page per packet go away?
I don't think this mode is doing it here.
> @@ -423,6 +489,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> ewma_pkt_len_add(&rq->mrg_avg_pkt_len, head_skb->len);
> return head_skb;
>
> +err_xdp:
> + rcu_read_unlock();
> err_skb:
> put_page(page);
> while (--num_buf) {
> @@ -1328,6 +1396,13 @@ static int virtnet_set_channels(struct net_device *dev,
> if (queue_pairs > vi->max_queue_pairs || queue_pairs == 0)
> return -EINVAL;
>
> + /* For now we don't support modifying channels while XDP is loaded
> + * also when XDP is loaded all RX queues have XDP programs so we only
> + * need to check a single RX queue.
> + */
> + if (vi->rq[0].xdp_prog)
> + return -EINVAL;
> +
> get_online_cpus();
> err = virtnet_set_queues(vi, queue_pairs);
> if (!err) {
> @@ -1454,6 +1529,68 @@ static int virtnet_set_features(struct net_device *netdev,
> return 0;
> }
>
> +static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
> +{
> + struct virtnet_info *vi = netdev_priv(dev);
> + struct bpf_prog *old_prog;
> + int i;
> +
> + if ((dev->features & NETIF_F_LRO) && prog) {
> + netdev_warn(dev, "can't set XDP while LRO is on, disable LRO first\n");
> + return -EINVAL;
> + }
> +
> + if (vi->mergeable_rx_bufs && !vi->any_header_sg) {
> + netdev_warn(dev, "XDP expects header/data in single page\n");
> + return -EINVAL;
> + }
> +
> + if (dev->mtu > PAGE_SIZE) {
> + netdev_warn(dev, "XDP requires MTU less than %lu\n", PAGE_SIZE);
> + return -EINVAL;
> + }
> +
> + if (prog) {
> + prog = bpf_prog_add(prog, vi->max_queue_pairs - 1);
> + if (IS_ERR(prog))
> + return PTR_ERR(prog);
> + }
> +
> + for (i = 0; i < vi->max_queue_pairs; i++) {
> + old_prog = rtnl_dereference(vi->rq[i].xdp_prog);
> + rcu_assign_pointer(vi->rq[i].xdp_prog, prog);
> + if (old_prog)
> + bpf_prog_put(old_prog);
don't we need to sync before put?
> + }
> +
> + return 0;
> +}
> +
> +static bool virtnet_xdp_query(struct net_device *dev)
> +{
> + struct virtnet_info *vi = netdev_priv(dev);
> + int i;
> +
> + for (i = 0; i < vi->max_queue_pairs; i++) {
> + if (vi->rq[i].xdp_prog)
> + return true;
> + }
> + return false;
> +}
> +
> +static int virtnet_xdp(struct net_device *dev, struct netdev_xdp *xdp)
> +{
> + switch (xdp->command) {
> + case XDP_SETUP_PROG:
> + return virtnet_xdp_set(dev, xdp->prog);
> + case XDP_QUERY_PROG:
> + xdp->prog_attached = virtnet_xdp_query(dev);
> + return 0;
> + default:
> + return -EINVAL;
> + }
> +}
> +
> static const struct net_device_ops virtnet_netdev = {
> .ndo_open = virtnet_open,
> .ndo_stop = virtnet_close,
> @@ -1471,6 +1608,7 @@ static int virtnet_set_features(struct net_device *netdev,
> .ndo_busy_poll = virtnet_busy_poll,
> #endif
> .ndo_set_features = virtnet_set_features,
> + .ndo_xdp = virtnet_xdp,
> };
>
> static void virtnet_config_changed_work(struct work_struct *work)
> @@ -1527,12 +1665,20 @@ static void virtnet_free_queues(struct virtnet_info *vi)
>
> static void free_receive_bufs(struct virtnet_info *vi)
> {
> + struct bpf_prog *old_prog;
> int i;
>
> + rtnl_lock();
> for (i = 0; i < vi->max_queue_pairs; i++) {
> while (vi->rq[i].pages)
> __free_pages(get_a_page(&vi->rq[i], GFP_KERNEL), 0);
> +
> + old_prog = rtnl_dereference(vi->rq[i].xdp_prog);
> + RCU_INIT_POINTER(vi->rq[i].xdp_prog, NULL);
> + if (old_prog)
> + bpf_prog_put(old_prog);
> }
> + rtnl_unlock();
> }
>
> static void free_receive_page_frags(struct virtnet_info *vi)
^ permalink raw reply
* Re: [net-next PATCH v3 6/6] virtio_net: xdp, add slowpath case for non contiguous buffers
From: Michael S. Tsirkin @ 2016-11-30 18:49 UTC (permalink / raw)
To: John Fastabend
Cc: Jakub Kicinski, eric.dumazet, daniel, shm, davem, tgraf,
alexei.starovoitov, john.r.fastabend, netdev, bblanco, brouer
In-Reply-To: <583F0361.5030804@gmail.com>
On Wed, Nov 30, 2016 at 08:50:41AM -0800, John Fastabend wrote:
> On 16-11-30 06:30 AM, Jakub Kicinski wrote:
> > [add MST]
> >
>
> Thanks sorry MST. I did a cut'n'paste of an old list of CC's and missed
> you were not on the list.
>
> [...]
>
> >> + memcpy(page_address(page) + page_off, page_address(p) + offset, *len);
> >> + while (--num_buf) {
> >> + unsigned int buflen;
> >> + unsigned long ctx;
> >> + void *buf;
> >> + int off;
> >> +
> >> + ctx = (unsigned long)virtqueue_get_buf(rq->vq, &buflen);
> >> + if (unlikely(!ctx))
> >> + goto err_buf;
> >> +
> >> + buf = mergeable_ctx_to_buf_address(ctx);
> >> + p = virt_to_head_page(buf);
> >> + off = buf - page_address(p);
> >> +
> >> + memcpy(page_address(page) + page_off,
> >> + page_address(p) + off, buflen);
> >> + page_off += buflen;
> >
> > Could malicious user potentially submit a frame bigger than MTU?
>
> Well presumably if the MTU is greater than PAGE_SIZE the xdp program
> would not have been loaded. And the malicious user in this case would
> have to be qemu which seems like everything is already lost if qemu
> is trying to attack its VM.
>
> But this is a good point because it looks like there is nothing in
> virtio or qemu that drops frames with MTU greater than the virtio
> configured setting. Maybe Michael can confirm this or I'll poke at it
> more. I think qemu should drop these frames in general.
>
> So I think adding a guard here is sensible I'll go ahead and do that.
> Also the MTU guard at set_xdp time needs to account for header length.
I agree. Further, offloads are disabled dynamically and we could
get a packet that was processed with LRO.
> Thanks nice catch.
>
> >
> >> + }
> >> +
> >> + *len = page_off;
> >> + return page;
> >> +err_buf:
> >> + __free_pages(page, 0);
> >> + return NULL;
> >> +}
> >> +
> >> static struct sk_buff *receive_mergeable(struct net_device *dev,
> >> struct virtnet_info *vi,
> >> struct receive_queue *rq,
> >> @@ -469,21 +519,37 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >> rcu_read_lock();
> >> xdp_prog = rcu_dereference(rq->xdp_prog);
> >> if (xdp_prog) {
> >> + struct page *xdp_page;
> >> u32 act;
> >>
> >> if (num_buf > 1) {
> >> bpf_warn_invalid_xdp_buffer();
> >> - goto err_xdp;
> >> +
> >> + /* linearize data for XDP */
> >> + xdp_page = xdp_linearize_page(rq, num_buf,
> >> + page, offset, &len);
> >> + if (!xdp_page)
> >> + goto err_xdp;
> >> + offset = len;
> >> + } else {
> >> + xdp_page = page;
> >> }
> >>
> >> - act = do_xdp_prog(vi, xdp_prog, page, offset, len);
> >> + act = do_xdp_prog(vi, xdp_prog, xdp_page, offset, len);
> >> switch (act) {
> >> case XDP_PASS:
> >> + if (unlikely(xdp_page != page))
> >> + __free_pages(xdp_page, 0);
> >> break;
> >> case XDP_TX:
> >> + if (unlikely(xdp_page != page))
> >> + goto err_xdp;
> >> + rcu_read_unlock();
> >
> > Only if there is a reason for v4 - this unlock could go to the previous
> > patch.
> >
>
> Sure will do this.
^ permalink raw reply
* Re: [net-next v2] neigh: remove duplicate check for same neigh
From: David Miller @ 2016-11-30 18:46 UTC (permalink / raw)
To: zhangshengju; +Cc: netdev, dsa
In-Reply-To: <1480476282-3650-1-git-send-email-zhangshengju@cmss.chinamobile.com>
From: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Date: Wed, 30 Nov 2016 11:24:42 +0800
> Currently loop index 'idx' is used as the index in the neigh list of interest.
> It's increased only when the neigh is dumped. It's not the absolute index in
> the list. Because there is no info to record which neigh has already be scanned
> by previous loop. This will cause the filtered out neighs to be scanned mulitple
> times.
>
> This patch make idx as the absolute index in the list, it will increase no matter
> whether the neigh is filtered. This will prevent the above problem.
>
> And this is in line with other dump functions.
>
> v2:
> - take David Ahern's advice to do simple change
>
> Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Applied, thanks.
^ permalink raw reply
* Re: [net-next PATCH v3 5/6] virtio_net: add XDP_TX support
From: Michael S. Tsirkin @ 2016-11-30 18:45 UTC (permalink / raw)
To: John Fastabend
Cc: eric.dumazet, daniel, shm, davem, tgraf, alexei.starovoitov,
john.r.fastabend, netdev, bblanco, brouer
In-Reply-To: <20161129201108.26851.1114.stgit@john-Precision-Tower-5810>
On Tue, Nov 29, 2016 at 12:11:08PM -0800, John Fastabend wrote:
> This adds support for the XDP_TX action to virtio_net. When an XDP
> program is run and returns the XDP_TX action the virtio_net XDP
> implementation will transmit the packet on a TX queue that aligns
> with the current CPU that the XDP packet was processed on.
>
> Before sending the packet the header is zeroed. Also XDP is expected
> to handle checksum correctly so no checksum offload support is
> provided.
>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
> drivers/net/virtio_net.c | 59 ++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 56 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index a1bfa99..9604e55 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -330,12 +330,40 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> return skb;
> }
>
> +static void virtnet_xdp_xmit(struct virtnet_info *vi,
> + unsigned int qnum, struct xdp_buff *xdp)
> +{
> + struct send_queue *sq = &vi->sq[qnum];
> + struct virtio_net_hdr_mrg_rxbuf *hdr;
> + unsigned int num_sg, len;
> + void *xdp_sent;
> +
> + /* Free up any pending old buffers before queueing new ones. */
> + while ((xdp_sent = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> + struct page *page = virt_to_head_page(xdp_sent);
> +
> + put_page(page);
> + }
> +
> + /* Zero header and leave csum up to XDP layers */
> + hdr = xdp->data;
> + memset(hdr, 0, vi->hdr_len);
> + hdr->hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE;
> + hdr->hdr.flags = VIRTIO_NET_HDR_F_DATA_VALID;
Do we really want this?
This is CHECKSUM_UNNECESSARY.
Does not XDP pass checksummed packets?
> +
> + num_sg = 1;
> + sg_init_one(sq->sg, xdp->data, xdp->data_end - xdp->data);
> + virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, xdp->data, GFP_ATOMIC);
This might fail. If it does, you want to at least free up the memory.
> + virtqueue_kick(sq->vq);
> +}
> +
> static u32 do_xdp_prog(struct virtnet_info *vi,
> struct bpf_prog *xdp_prog,
> struct page *page, int offset, int len)
> {
> int hdr_padded_len;
> struct xdp_buff xdp;
> + unsigned int qp;
> u32 act;
> u8 *buf;
>
> @@ -353,9 +381,15 @@ static u32 do_xdp_prog(struct virtnet_info *vi,
> switch (act) {
> case XDP_PASS:
> return XDP_PASS;
> + case XDP_TX:
> + qp = vi->curr_queue_pairs -
> + vi->xdp_queue_pairs +
> + smp_processor_id();
> + xdp.data = buf + (vi->mergeable_rx_bufs ? 0 : 4);
> + virtnet_xdp_xmit(vi, qp, &xdp);
> + return XDP_TX;
> default:
> bpf_warn_invalid_xdp_action(act);
> - case XDP_TX:
> case XDP_ABORTED:
> case XDP_DROP:
> return XDP_DROP;
> @@ -387,8 +421,16 @@ static struct sk_buff *receive_big(struct net_device *dev,
> if (xdp_prog) {
> u32 act = do_xdp_prog(vi, xdp_prog, page, 0, len);
>
> - if (act == XDP_DROP)
> + switch (act) {
> + case XDP_PASS:
> + break;
> + case XDP_TX:
> + rcu_read_unlock();
> + goto xdp_xmit;
> + case XDP_DROP:
> + default:
> goto err_xdp;
> + }
> }
> rcu_read_unlock();
>
> @@ -403,6 +445,7 @@ static struct sk_buff *receive_big(struct net_device *dev,
> err:
> dev->stats.rx_dropped++;
> give_pages(rq, page);
> +xdp_xmit:
> return NULL;
> }
>
> @@ -421,6 +464,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> struct sk_buff *head_skb, *curr_skb;
> struct bpf_prog *xdp_prog;
>
> + head_skb = NULL;
> +
> rcu_read_lock();
> xdp_prog = rcu_dereference(rq->xdp_prog);
> if (xdp_prog) {
> @@ -432,8 +477,15 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> }
>
> act = do_xdp_prog(vi, xdp_prog, page, offset, len);
> - if (act == XDP_DROP)
> + switch (act) {
> + case XDP_PASS:
> + break;
> + case XDP_TX:
> + goto xdp_xmit;
> + case XDP_DROP:
> + default:
> goto err_xdp;
> + }
> }
> rcu_read_unlock();
>
> @@ -510,6 +562,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> err_buf:
> dev->stats.rx_dropped++;
> dev_kfree_skb(head_skb);
> +xdp_xmit:
> return NULL;
> }
>
^ permalink raw reply
* Re: [PATCH 4/6] net: ethernet: ti: cpts: add ptp pps support
From: Richard Cochran @ 2016-11-30 18:45 UTC (permalink / raw)
To: Grygorii Strashko
Cc: David S. Miller, netdev, Mugunthan V N, Sekhar Nori, linux-kernel,
linux-omap, Rob Herring, devicetree, Murali Karicheri,
Wingman Kwok
In-Reply-To: <20161128230428.6872-5-grygorii.strashko@ti.com>
On Mon, Nov 28, 2016 at 05:04:26PM -0600, Grygorii Strashko wrote:
> +static cycle_t cpts_cc_ns2cyc(struct cpts *cpts, u64 nsecs)
> +{
> + cycle_t cyc = (nsecs << cpts->cc.shift) + nsecs;
> +
> + do_div(cyc, cpts->cc.mult);
> +
> + return cyc;
> +}
So you set the comparison value once per second, based on cc.mult.
But when the clock is being actively synchronized, user space calls to
clock_adjtimex() will change cc.mult. This can happen several times
per second, depending on the PTP Sync rate.
In order to produce the PPS edge correctly, you would have to adjust
the comparison value whenever cc.mult changes, but of course this is
unworkable.
So I'll have to say NAK for this patch.
Thanks,
Richard
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox