* Re: [PATCH net 0/3] mlx4 fix for shutdown flow
From: David Miller @ 2016-11-18 18:47 UTC (permalink / raw)
To: tariqt; +Cc: netdev, eranbe, saeedm, eugenia
In-Reply-To: <1479397251-6932-1-git-send-email-tariqt@mellanox.com>
From: Tariq Toukan <tariqt@mellanox.com>
Date: Thu, 17 Nov 2016 17:40:48 +0200
> This patchset fixes an invalid reference to mdev in mlx4 shutdown flow.
>
> In patch 1, we make sure netif_device_detach() is called from shutdown flow only,
> since we want to keep it present during a simple configuration change.
>
> In patches 2 and 3, we add checks that were missing in:
> * dev_get_phys_port_id
> * dev_get_phys_port_name
> We check the presence of the network device before calling the driver's
> callbacks. This already exists for all other ndo's.
>
> Series generated against net commit:
> e5f6f564fd19 bnxt: add a missing rcu synchronization
I don't like where this is going nor the precedence it is setting.
If you are taking the device into a state where it cannot be safely
accessed by ndo operations, then you _MUST_ do whatever is necessary
to make sure the device is unregistered and cannot be found in the
various global lists and tables of network devices.
This is mandatory.
And this is how we must fix these kinds of problems instead of
peppering device presence test all over the place. That will be
error prone and in the long term a huge maintainence burdon.
I'm not applying this series, sorry. You have to fix this properly.
^ permalink raw reply
* Re: [PATCH net v2 7/7] net: ethernet: ti: cpsw: fix fixed-link phy probe deferral
From: David Miller @ 2016-11-18 18:48 UTC (permalink / raw)
To: johan; +Cc: mugunthanvnm, grygorii.strashko, linux-omap, netdev, linux-kernel
In-Reply-To: <20161117171920.GD10490@localhost>
From: Johan Hovold <johan@kernel.org>
Date: Thu, 17 Nov 2016 18:19:20 +0100
> On Thu, Nov 17, 2016 at 12:04:16PM -0500, David Miller wrote:
>> From: Johan Hovold <johan@kernel.org>
>> Date: Thu, 17 Nov 2016 17:40:04 +0100
>>
>> > Make sure to propagate errors from of_phy_register_fixed_link() which
>> > can fail with -EPROBE_DEFER.
>> >
>> > Fixes: 1f71e8c96fc6 ("drivers: net: cpsw: Add support for fixed-link
>> > PHY")
>> > Signed-off-by: Johan Hovold <johan@kernel.org>
>>
>> Johan, when you update a patch within a series you must post the
>> entire series freshly to the lists, cover posting and all.
>
> I'm quite sure that is exactly what I did. Did you only get this last
> patch out of the seven?
I ended up getting it delayed, thanks.
^ permalink raw reply
* Re: [PATCH net v2 0/7] net: cpsw: fix leaks and probe deferral
From: David Miller @ 2016-11-18 18:49 UTC (permalink / raw)
To: johan; +Cc: mugunthanvnm, grygorii.strashko, linux-omap, netdev, linux-kernel
In-Reply-To: <1479400804-9847-1-git-send-email-johan@kernel.org>
From: Johan Hovold <johan@kernel.org>
Date: Thu, 17 Nov 2016 17:39:57 +0100
> This series fixes as number of leaks and issues in the cpsw probe-error
> and driver-unbind paths, some which specifically prevented deferred
> probing.
...
> v2
> - Keep platform device runtime-resumed throughout probe instead of
> resuming in the probe error path as suggested by Grygorii (patch
> 1/7).
>
> - Runtime-resume platform device before registering any children in
> order to make sure it is synchronously suspended after deregistering
> children in the error path (patch 3/7).
Series applied, thanks.
^ permalink raw reply
* Re: [PATCH net-next v4 0/5] net: Enable COMPILE_TEST for Marvell & Freescale drivers
From: David Miller @ 2016-11-18 18:54 UTC (permalink / raw)
To: f.fainelli; +Cc: netdev, mw, arnd, gregory.clement, Shaohui.Xie, andrew
In-Reply-To: <20161117191914.11077-1-f.fainelli@gmail.com>
From: Florian Fainelli <f.fainelli@gmail.com>
Date: Thu, 17 Nov 2016 11:19:09 -0800
> This patch series allows building the Freescale and Marvell Ethernet
> network drivers with COMPILE_TEST.
Thanks for doing this work, this kind of thing helps me a lot.
Series applied, thanks.
^ permalink raw reply
* Re: [PATCH -next] tcp: make undo_cwnd mandatory for congestion modules
From: Florian Westphal @ 2016-11-18 18:54 UTC (permalink / raw)
To: David Miller; +Cc: fw, netdev, edumazet, ycheng, ncardwell
In-Reply-To: <20161118.134304.1574614508508170507.davem@davemloft.net>
David Miller <davem@davemloft.net> wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Thu, 17 Nov 2016 13:56:51 +0100
>
> > The undo_cwnd fallback in the stack doubles cwnd based on ssthresh,
> > which un-does reno halving behaviour.
> >
> > It seems more appropriate to let congctl algorithms pair .ssthresh
> > and .undo_cwnd properly. Add a 'tcp_reno_undo_cwnd' function and wire it
> > up for all congestion algorithms that used to rely on the fallback.
> >
> > highspeed, illinois, scalable, veno and yeah use 'reno undo' while their
> > .ssthresh implementation doesn't halve the slowstart threshold, this
> > might point to similar issue as the one fixed for dctcp in
> > ce6dd23329b1e ("dctcp: avoid bogus doubling of cwnd after loss").
> >
> > Cc: Eric Dumazet <edumazet@google.com>
> > Cc: Yuchung Cheng <ycheng@google.com>
> > Cc: Neal Cardwell <ncardwell@google.com>
> > Signed-off-by: Florian Westphal <fw@strlen.de>
>
> If you really suspect that highspeed et al. need to implement their own
> undo_cwnd instead of using the default reno fallback, I would really
> rather that this gets either fixed or explicitly marked as likely wrong
> (in an "XXX" comment or similar).
Ok, fair enough. I am not familiar with these algorithms, I will check
what they're doing in more detail and if absolutely needed resubmit this
patch with XXX/FIXME/TODO comments added.
> Otherwise nobody is going to remember this down the road.
Agreed.
^ permalink raw reply
* Re: [PATCH v2 net-next] lan78xx: relocate mdix setting to phy driver
From: David Miller @ 2016-11-18 18:57 UTC (permalink / raw)
To: Woojung.Huh; +Cc: f.fainelli, netdev, andrew, UNGLinuxDriver
In-Reply-To: <9235D6609DB808459E95D78E17F2E43D40966AA4@CHN-SV-EXMX02.mchp-main.com>
From: <Woojung.Huh@microchip.com>
Date: Thu, 17 Nov 2016 22:10:02 +0000
> From: Woojung Huh <woojung.huh@microchip.com>
>
> Relocate mdix code to phy driver to be called at config_init().
>
> Signed-off-by: Woojung Huh <woojung.huh@microchip.com>
Applied, thank you.
^ permalink raw reply
* Re: [Patch net v2] af_unix: conditionally use freezable blocking calls in read
From: David Miller @ 2016-11-18 18:59 UTC (permalink / raw)
To: xiyou.wangcong; +Cc: netdev, dvyukov, tj, ccross, rafael.j.wysocki, hannes
In-Reply-To: <1479426926-28197-1-git-send-email-xiyou.wangcong@gmail.com>
From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Thu, 17 Nov 2016 15:55:26 -0800
> Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
> converts schedule_timeout() to its freezable version, it was probably
> correct at that time, but later, commit 2b514574f7e8
> ("net: af_unix: implement splice for stream af_unix sockets") breaks
> the strong requirement for a freezable sleep, according to
> commit 0f9548ca1091:
>
> We shouldn't try_to_freeze if locks are held. Holding a lock can cause a
> deadlock if the lock is later acquired in the suspend or hibernate path
> (e.g. by dpm). Holding a lock can also cause a deadlock in the case of
> cgroup_freezer if a lock is held inside a frozen cgroup that is later
> acquired by a process outside that group.
>
> The pipe_lock is still held at that point.
>
> So use freezable version only for the recvmsg call path, avoid impact for
> Android.
>
> Fixes: 2b514574f7e8 ("net: af_unix: implement splice for stream af_unix sockets")
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Colin Cross <ccross@android.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Applied and queued up for -stable, thanks.
^ permalink raw reply
* [PATCH 0/5] XDP for virtio_net
From: John Fastabend @ 2016-11-18 18:59 UTC (permalink / raw)
To: tgraf, shm, alexei.starovoitov, daniel, davem
Cc: john.r.fastabend, netdev, bblanco, john.fastabend, brouer
This implements virtio_net for the mergeable buffers and big_packet
modes. I tested this with vhost_net running on qemu and did not see
any issues.
There are some restrictions for XDP to be enabled (see patch 3) for
more details.
1. LRO must be off
2. MTU must be less than PAGE_SIZE
3. queues must be available to dedicate to XDP
4. num_bufs received in mergeable buffers must be 1
5. big_packet mode must have all data on single page
Please review any comments/feedback welcome as always.
Thanks,
John
---
John Fastabend (4):
net: virtio dynamically disable/enable LRO
net: xdp: add invalid buffer warning
virtio_net: add dedicated XDP transmit queues
virtio_net: add XDP_TX support
Shrijeet Mukherjee (1):
virtio_net: Add XDP support
drivers/net/virtio_net.c | 264 +++++++++++++++++++++++++++++++++++++++++++++-
include/linux/filter.h | 1
net/core/filter.c | 6 +
3 files changed, 267 insertions(+), 4 deletions(-)
^ permalink raw reply
* [PATCH 1/5] net: virtio dynamically disable/enable LRO
From: John Fastabend @ 2016-11-18 18:59 UTC (permalink / raw)
To: tgraf, shm, alexei.starovoitov, daniel, davem
Cc: john.r.fastabend, netdev, bblanco, john.fastabend, brouer
In-Reply-To: <20161118185517.16137.92123.stgit@john-Precision-Tower-5810>
This adds support for dynamically setting the LRO feature flag. The
message to control guest features in the backend uses the
CTRL_GUEST_OFFLOADS msg type.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
drivers/net/virtio_net.c | 43 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 43 insertions(+)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 2cafd12..0758cae 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1419,6 +1419,41 @@ static void virtnet_init_settings(struct net_device *dev)
.set_settings = virtnet_set_settings,
};
+static int virtnet_set_features(struct net_device *netdev,
+ netdev_features_t features)
+{
+ struct virtnet_info *vi = netdev_priv(netdev);
+ struct virtio_device *vdev = vi->vdev;
+ struct scatterlist sg;
+ u64 offloads = 0;
+
+ if (features & NETIF_F_LRO)
+ offloads |= (1 << VIRTIO_NET_F_GUEST_TSO4) |
+ (1 << VIRTIO_NET_F_GUEST_TSO6);
+
+ if (features & NETIF_F_RXCSUM)
+ offloads |= (1 << VIRTIO_NET_F_GUEST_CSUM);
+
+ if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) {
+ sg_init_one(&sg, &offloads, sizeof(uint64_t));
+ if (!virtnet_send_command(vi,
+ VIRTIO_NET_CTRL_GUEST_OFFLOADS,
+ VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET,
+ &sg)) {
+ dev_warn(&netdev->dev,
+ "Failed to set guest offloads by virtnet command.\n");
+ return -EINVAL;
+ }
+ } else if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS) &&
+ !virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
+ dev_warn(&netdev->dev,
+ "No support for setting offloads pre version_1.\n");
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
static const struct net_device_ops virtnet_netdev = {
.ndo_open = virtnet_open,
.ndo_stop = virtnet_close,
@@ -1435,6 +1470,7 @@ static void virtnet_init_settings(struct net_device *dev)
#ifdef CONFIG_NET_RX_BUSY_POLL
.ndo_busy_poll = virtnet_busy_poll,
#endif
+ .ndo_set_features = virtnet_set_features,
};
static void virtnet_config_changed_work(struct work_struct *work)
@@ -1810,6 +1846,12 @@ static int virtnet_probe(struct virtio_device *vdev)
if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
dev->features |= NETIF_F_RXCSUM;
+ if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) &&
+ virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6)) {
+ dev->features |= NETIF_F_LRO;
+ dev->hw_features |= NETIF_F_LRO;
+ }
+
dev->vlan_features = dev->features;
/* MTU range: 68 - 65535 */
@@ -2049,6 +2091,7 @@ static int virtnet_restore(struct virtio_device *vdev)
VIRTIO_NET_F_CTRL_MAC_ADDR,
VIRTIO_F_ANY_LAYOUT,
VIRTIO_NET_F_MTU,
+ VIRTIO_NET_F_CTRL_GUEST_OFFLOADS,
};
static struct virtio_driver virtio_net_driver = {
^ permalink raw reply related
* [PATCH 2/5] net: xdp: add invalid buffer warning
From: John Fastabend @ 2016-11-18 18:59 UTC (permalink / raw)
To: tgraf, shm, alexei.starovoitov, daniel, davem
Cc: john.r.fastabend, netdev, bblanco, john.fastabend, brouer
In-Reply-To: <20161118185517.16137.92123.stgit@john-Precision-Tower-5810>
This adds a warning for drivers to use when encountering an invalid
buffer for XDP. For normal cases this should not happen but to catch
this in virtual/qemu setups that I may not have expected from the
emulation layer having a standard warning is useful.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
include/linux/filter.h | 1 +
net/core/filter.c | 6 ++++++
2 files changed, 7 insertions(+)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1f09c52..0c79004 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -595,6 +595,7 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *filter,
struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
const struct bpf_insn *patch, u32 len);
void bpf_warn_invalid_xdp_action(u32 act);
+void bpf_warn_invalid_xdp_buffer(void);
#ifdef CONFIG_BPF_JIT
extern int bpf_jit_enable;
diff --git a/net/core/filter.c b/net/core/filter.c
index cd9e2ba..b8fb57c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2722,6 +2722,12 @@ void bpf_warn_invalid_xdp_action(u32 act)
}
EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
+void bpf_warn_invalid_xdp_buffer(void)
+{
+ WARN_ONCE(1, "Illegal XDP buffer encountered, expect packet loss\n");
+}
+EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_buffer);
+
static u32 sk_filter_convert_ctx_access(enum bpf_access_type type, int dst_reg,
int src_reg, int ctx_off,
struct bpf_insn *insn_buf,
^ permalink raw reply related
* [PATCH 3/5] virtio_net: Add XDP support
From: John Fastabend @ 2016-11-18 19:00 UTC (permalink / raw)
To: tgraf, shm, alexei.starovoitov, daniel, davem
Cc: john.r.fastabend, netdev, bblanco, john.fastabend, brouer
In-Reply-To: <20161118185517.16137.92123.stgit@john-Precision-Tower-5810>
From: Shrijeet Mukherjee <shrijeet@gmail.com>
This adds XDP support to virtio_net. Some requirements must be
met for XDP to be enabled depending on the mode. First it will
only be supported with LRO disabled so that data is not pushed
across multiple buffers. The MTU must be less than a page size
to avoid having to handle XDP across multiple pages.
If mergeable receive is enabled this first series only supports
the case where header and data are in the same buf which we can
check when a packet is received by looking at num_buf. If the
num_buf is greater than 1 and a XDP program is loaded the packet
is dropped and a warning is thrown. When any_header_sg is set this
does not happen and both header and data is put in a single buffer
as expected so we check this when XDP programs are loaded. Note I
have only tested this with Linux vhost backend.
If big packets mode is enabled and MTU/LRO conditions above are
met then XDP is allowed.
A follow on patch can be generated to solve the mergeable receive
case with num_bufs equal to 2. Buffers greater than two may not
be handled has easily.
Suggested-by: Shrijeet Mukherjee <shrijeet@gmail.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
drivers/net/virtio_net.c | 144 +++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 140 insertions(+), 4 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0758cae..16c257d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -22,6 +22,7 @@
#include <linux/module.h>
#include <linux/virtio.h>
#include <linux/virtio_net.h>
+#include <linux/bpf.h>
#include <linux/scatterlist.h>
#include <linux/if_vlan.h>
#include <linux/slab.h>
@@ -81,6 +82,8 @@ struct receive_queue {
struct napi_struct napi;
+ struct bpf_prog *xdp_prog;
+
/* Chain pages by the private ptr. */
struct page *pages;
@@ -324,6 +327,38 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
return skb;
}
+static u32 do_xdp_prog(struct virtnet_info *vi,
+ struct bpf_prog *xdp_prog,
+ struct page *page, int offset, int len)
+{
+ int hdr_padded_len;
+ struct xdp_buff xdp;
+ u32 act;
+ u8 *buf;
+
+ buf = page_address(page) + offset;
+
+ if (vi->mergeable_rx_bufs)
+ hdr_padded_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+ else
+ hdr_padded_len = sizeof(struct padded_vnet_hdr);
+
+ xdp.data = buf + hdr_padded_len;
+ xdp.data_end = xdp.data + (len - vi->hdr_len);
+
+ act = bpf_prog_run_xdp(xdp_prog, &xdp);
+ switch (act) {
+ case XDP_PASS:
+ return XDP_PASS;
+ default:
+ bpf_warn_invalid_xdp_action(act);
+ case XDP_TX:
+ case XDP_ABORTED:
+ case XDP_DROP:
+ return XDP_DROP;
+ }
+}
+
static struct sk_buff *receive_small(struct virtnet_info *vi, void *buf, unsigned int len)
{
struct sk_buff * skb = buf;
@@ -340,9 +375,19 @@ static struct sk_buff *receive_big(struct net_device *dev,
void *buf,
unsigned int len)
{
+ struct bpf_prog *xdp_prog;
struct page *page = buf;
- struct sk_buff *skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
+ struct sk_buff *skb;
+
+ xdp_prog = rcu_dereference(rq->xdp_prog);
+ if (xdp_prog) {
+ u32 act = do_xdp_prog(vi, xdp_prog, page, 0, len);
+
+ if (act == XDP_DROP)
+ goto err;
+ }
+ skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
if (unlikely(!skb))
goto err;
@@ -366,10 +411,25 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
struct page *page = virt_to_head_page(buf);
int offset = buf - page_address(page);
unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
+ struct sk_buff *head_skb, *curr_skb;
+ struct bpf_prog *xdp_prog;
- struct sk_buff *head_skb = page_to_skb(vi, rq, page, offset, len,
- truesize);
- struct sk_buff *curr_skb = head_skb;
+ xdp_prog = rcu_dereference(rq->xdp_prog);
+ if (xdp_prog) {
+ u32 act;
+
+ if (num_buf > 1) {
+ bpf_warn_invalid_xdp_buffer();
+ goto err_skb;
+ }
+
+ act = do_xdp_prog(vi, xdp_prog, page, offset, len);
+ if (act == XDP_DROP)
+ goto err_skb;
+ }
+
+ head_skb = page_to_skb(vi, rq, page, offset, len, truesize);
+ curr_skb = head_skb;
if (unlikely(!curr_skb))
goto err_skb;
@@ -1328,6 +1388,13 @@ static int virtnet_set_channels(struct net_device *dev,
if (queue_pairs > vi->max_queue_pairs || queue_pairs == 0)
return -EINVAL;
+ /* For now we don't support modifying channels while XDP is loaded
+ * also when XDP is loaded all RX queues have XDP programs so we only
+ * need to check a single RX queue.
+ */
+ if (vi->rq[0].xdp_prog)
+ return -EINVAL;
+
get_online_cpus();
err = virtnet_set_queues(vi, queue_pairs);
if (!err) {
@@ -1454,6 +1521,68 @@ static int virtnet_set_features(struct net_device *netdev,
return 0;
}
+static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
+{
+ struct virtnet_info *vi = netdev_priv(dev);
+ struct bpf_prog *old_prog;
+ int i;
+
+ if ((dev->features & NETIF_F_LRO) && prog) {
+ netdev_warn(dev, "can't set XDP while LRO is on, disable LRO first\n");
+ return -EINVAL;
+ }
+
+ if (vi->mergeable_rx_bufs && !vi->any_header_sg) {
+ netdev_warn(dev, "XDP expects header/data in single page\n");
+ return -EINVAL;
+ }
+
+ if (dev->mtu > PAGE_SIZE) {
+ netdev_warn(dev, "XDP requires MTU less than %lu\n", PAGE_SIZE);
+ return -EINVAL;
+ }
+
+ if (prog) {
+ prog = bpf_prog_add(prog, vi->max_queue_pairs - 1);
+ if (IS_ERR(prog))
+ return PTR_ERR(prog);
+ }
+
+ for (i = 0; i < vi->max_queue_pairs; i++) {
+ old_prog = rcu_dereference(vi->rq[i].xdp_prog);
+ rcu_assign_pointer(vi->rq[i].xdp_prog, prog);
+ if (old_prog)
+ bpf_prog_put(old_prog);
+ }
+
+ return 0;
+}
+
+static bool virtnet_xdp_query(struct net_device *dev)
+{
+ struct virtnet_info *vi = netdev_priv(dev);
+ int i;
+
+ for (i = 0; i < vi->max_queue_pairs; i++) {
+ if (vi->rq[i].xdp_prog)
+ return true;
+ }
+ return false;
+}
+
+static int virtnet_xdp(struct net_device *dev, struct netdev_xdp *xdp)
+{
+ switch (xdp->command) {
+ case XDP_SETUP_PROG:
+ return virtnet_xdp_set(dev, xdp->prog);
+ case XDP_QUERY_PROG:
+ xdp->prog_attached = virtnet_xdp_query(dev);
+ return 0;
+ default:
+ return -EINVAL;
+ }
+}
+
static const struct net_device_ops virtnet_netdev = {
.ndo_open = virtnet_open,
.ndo_stop = virtnet_close,
@@ -1471,6 +1600,7 @@ static int virtnet_set_features(struct net_device *netdev,
.ndo_busy_poll = virtnet_busy_poll,
#endif
.ndo_set_features = virtnet_set_features,
+ .ndo_xdp = virtnet_xdp,
};
static void virtnet_config_changed_work(struct work_struct *work)
@@ -1527,11 +1657,17 @@ static void virtnet_free_queues(struct virtnet_info *vi)
static void free_receive_bufs(struct virtnet_info *vi)
{
+ struct bpf_prog *old_prog;
int i;
for (i = 0; i < vi->max_queue_pairs; i++) {
while (vi->rq[i].pages)
__free_pages(get_a_page(&vi->rq[i], GFP_KERNEL), 0);
+
+ old_prog = rcu_dereference(vi->rq[i].xdp_prog);
+ RCU_INIT_POINTER(vi->rq[i].xdp_prog, NULL);
+ if (old_prog)
+ bpf_prog_put(old_prog);
}
}
^ permalink raw reply related
* Re: pull-request: mac80211 2016-11-18
From: David Miller @ 2016-11-18 19:00 UTC (permalink / raw)
To: johannes-cdvu00un1VgdHxzADdlk8Q
Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
linux-wireless-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161118075201.30081-1-johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>
From: Johannes Berg <johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>
Date: Fri, 18 Nov 2016 08:52:00 +0100
> Due to travel/vacation, this is a bit late, but there aren't
> that many fixes either. Most interesting/important are the
> fixes from Felix and perhaps the scan entry limit.
>
> Please pull and let me know if there's any problem.
Pulled, thanks a lot Johannes.
^ permalink raw reply
* [PATCH 4/5] virtio_net: add dedicated XDP transmit queues
From: John Fastabend @ 2016-11-18 19:00 UTC (permalink / raw)
To: tgraf, shm, alexei.starovoitov, daniel, davem
Cc: john.r.fastabend, netdev, bblanco, john.fastabend, brouer
In-Reply-To: <20161118185517.16137.92123.stgit@john-Precision-Tower-5810>
XDP requires using isolated transmit queues to avoid interference
with normal networking stack (BQL, NETDEV_TX_BUSY, etc). This patch
adds a XDP queue per cpu when a XDP program is loaded and does not
expose the queues to the OS via the normal API call to
netif_set_real_num_tx_queues(). This way the stack will never push
an skb to these queues.
However virtio/vhost/qemu implementation only allows for creating
TX/RX queue pairs at this time so creating only TX queues was not
possible. And because the associated RX queues are being created I
went ahead and exposed these to the stack and let the backend use
them. This creates more RX queues visible to the network stack than
TX queues which is worth mentioning but does not cause any issues as
far as I can tell.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
drivers/net/virtio_net.c | 32 +++++++++++++++++++++++++++++---
1 file changed, 29 insertions(+), 3 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 16c257d..631ee07 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -114,6 +114,9 @@ struct virtnet_info {
/* # of queue pairs currently used by the driver */
u16 curr_queue_pairs;
+ /* # of XDP queue pairs currently used by the driver */
+ u16 xdp_queue_pairs;
+
/* I like... big packets and I cannot lie! */
bool big_packets;
@@ -1525,7 +1528,8 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
{
struct virtnet_info *vi = netdev_priv(dev);
struct bpf_prog *old_prog;
- int i;
+ u16 xdp_qp = 0, curr_qp;
+ int err, i;
if ((dev->features & NETIF_F_LRO) && prog) {
netdev_warn(dev, "can't set XDP while LRO is on, disable LRO first\n");
@@ -1542,12 +1546,34 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
return -EINVAL;
}
+ curr_qp = vi->curr_queue_pairs - vi->xdp_queue_pairs;
+ if (prog)
+ xdp_qp = num_online_cpus();
+
+ /* XDP requires extra queues for XDP_TX */
+ if (curr_qp + xdp_qp > vi->max_queue_pairs) {
+ netdev_warn(dev, "request %i queues but max is %i\n",
+ curr_qp + xdp_qp, vi->max_queue_pairs);
+ return -ENOMEM;
+ }
+
+ err = virtnet_set_queues(vi, curr_qp + xdp_qp);
+ if (err) {
+ dev_warn(&dev->dev, "XDP Device queue allocation failure.\n");
+ return err;
+ }
+
if (prog) {
- prog = bpf_prog_add(prog, vi->max_queue_pairs - 1);
- if (IS_ERR(prog))
+ prog = bpf_prog_add(prog, vi->max_queue_pairs);
+ if (IS_ERR(prog)) {
+ virtnet_set_queues(vi, curr_qp);
return PTR_ERR(prog);
+ }
}
+ vi->xdp_queue_pairs = xdp_qp;
+ netif_set_real_num_rx_queues(dev, curr_qp + xdp_qp);
+
for (i = 0; i < vi->max_queue_pairs; i++) {
old_prog = rcu_dereference(vi->rq[i].xdp_prog);
rcu_assign_pointer(vi->rq[i].xdp_prog, prog);
^ permalink raw reply related
* [PATCH 5/5] virtio_net: add XDP_TX support
From: John Fastabend @ 2016-11-18 19:01 UTC (permalink / raw)
To: tgraf, shm, alexei.starovoitov, daniel, davem
Cc: john.r.fastabend, netdev, bblanco, john.fastabend, brouer
In-Reply-To: <20161118185517.16137.92123.stgit@john-Precision-Tower-5810>
This adds support for the XDP_TX action to virtio_net. When an XDP
program is run and returns the XDP_TX action the virtio_net XDP
implementation will transmit the packet on a TX queue that aligns
with the current CPU that the XDP packet was processed on.
Before sending the packet the header is zeroed. Also XDP is expected
to handle checksum correctly so no checksum offload support is
provided.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
drivers/net/virtio_net.c | 57 ++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 54 insertions(+), 3 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 631ee07..4b22938 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -330,12 +330,40 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
return skb;
}
+static void virtnet_xdp_xmit(struct virtnet_info *vi,
+ unsigned int qnum, struct xdp_buff *xdp)
+{
+ struct send_queue *sq = &vi->sq[qnum];
+ struct virtio_net_hdr_mrg_rxbuf *hdr;
+ unsigned int num_sg, len;
+ void *xdp_sent;
+
+ /* Free up any pending old buffers before queueing new ones. */
+ while ((xdp_sent = virtqueue_get_buf(sq->vq, &len)) != NULL) {
+ struct page *page = virt_to_head_page(xdp_sent);
+
+ put_page(page);
+ }
+
+ /* Zero header and leave csum up to XDP layers */
+ hdr = xdp->data;
+ memset(hdr, 0, vi->hdr_len);
+ hdr->hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE;
+ hdr->hdr.flags = VIRTIO_NET_HDR_F_DATA_VALID;
+
+ num_sg = 1;
+ sg_init_one(sq->sg, xdp->data, xdp->data_end - xdp->data);
+ virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, xdp->data, GFP_ATOMIC);
+ virtqueue_kick(sq->vq);
+}
+
static u32 do_xdp_prog(struct virtnet_info *vi,
struct bpf_prog *xdp_prog,
struct page *page, int offset, int len)
{
int hdr_padded_len;
struct xdp_buff xdp;
+ unsigned int qp;
u32 act;
u8 *buf;
@@ -353,9 +381,15 @@ static u32 do_xdp_prog(struct virtnet_info *vi,
switch (act) {
case XDP_PASS:
return XDP_PASS;
+ case XDP_TX:
+ qp = vi->curr_queue_pairs -
+ vi->xdp_queue_pairs +
+ smp_processor_id();
+ xdp.data = buf + (vi->mergeable_rx_bufs ? 0 : 4);
+ virtnet_xdp_xmit(vi, qp, &xdp);
+ return XDP_TX;
default:
bpf_warn_invalid_xdp_action(act);
- case XDP_TX:
case XDP_ABORTED:
case XDP_DROP:
return XDP_DROP;
@@ -386,8 +420,15 @@ static struct sk_buff *receive_big(struct net_device *dev,
if (xdp_prog) {
u32 act = do_xdp_prog(vi, xdp_prog, page, 0, len);
- if (act == XDP_DROP)
+ switch (act) {
+ case XDP_PASS:
+ break;
+ case XDP_TX:
+ goto xdp_xmit;
+ case XDP_DROP:
+ default:
goto err;
+ }
}
skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
@@ -399,6 +440,7 @@ static struct sk_buff *receive_big(struct net_device *dev,
err:
dev->stats.rx_dropped++;
give_pages(rq, page);
+xdp_xmit:
return NULL;
}
@@ -417,6 +459,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
struct sk_buff *head_skb, *curr_skb;
struct bpf_prog *xdp_prog;
+ head_skb = NULL;
xdp_prog = rcu_dereference(rq->xdp_prog);
if (xdp_prog) {
u32 act;
@@ -427,8 +470,15 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
}
act = do_xdp_prog(vi, xdp_prog, page, offset, len);
- if (act == XDP_DROP)
+ switch (act) {
+ case XDP_PASS:
+ break;
+ case XDP_TX:
+ goto xdp_xmit;
+ case XDP_DROP:
+ default:
goto err_skb;
+ }
}
head_skb = page_to_skb(vi, rq, page, offset, len, truesize);
@@ -502,6 +552,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
err_buf:
dev->stats.rx_dropped++;
dev_kfree_skb(head_skb);
+xdp_xmit:
return NULL;
}
^ permalink raw reply related
* Re: [PATCH] netns: fix get_net_ns_by_fd(int pid) typo
From: David Miller @ 2016-11-18 19:02 UTC (permalink / raw)
To: stefanha; +Cc: netdev
In-Reply-To: <1479462106-28529-1-git-send-email-stefanha@redhat.com>
From: Stefan Hajnoczi <stefanha@redhat.com>
Date: Fri, 18 Nov 2016 09:41:46 +0000
> The argument to get_net_ns_by_fd() is a /proc/$PID/ns/net file
> descriptor not a pid. Fix the typo.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Applied.
^ permalink raw reply
* Re: [patch net-next] liquidio CN23XX: bitwise vs logical AND typo
From: David Miller @ 2016-11-18 19:04 UTC (permalink / raw)
To: dan.carpenter
Cc: derek.chickles, raghu.vatsavayi, satananda.burla, felix.manlunas,
netdev, kernel-janitors
In-Reply-To: <20161118114734.GB3281@mwanda>
From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Fri, 18 Nov 2016 14:47:35 +0300
> We obviously intended a bitwise AND here, not a logical one.
>
> Fixes: 8c978d059224 ("liquidio CN23XX: Mailbox support")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Applied.
^ permalink raw reply
* Re: [PATCH] liquidio CN23XX: check if PENDING bit is clear using logical and
From: David Miller @ 2016-11-18 19:04 UTC (permalink / raw)
To: colin.king
Cc: derek.chickles, satananda.burla, felix.manlunas, raghu.vatsavayi,
netdev, linux-kernel
In-Reply-To: <20161118184532.5282-1-colin.king@canonical.com>
From: Colin King <colin.king@canonical.com>
Date: Fri, 18 Nov 2016 18:45:32 +0000
> From: Colin Ian King <colin.king@canonical.com>
>
> the mbox state should be bitwise anded rather than logically anded
> with OCTEON_MBOX_STATE_RESPONSE_PENDING. Fix this by using the
> correct & operator instead of &&.
>
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
Dan Carpenter already submitted a fix for this.
^ permalink raw reply
* Re: [PATCH net-next] cxgb4: Allocate Tx queues dynamically
From: David Miller @ 2016-11-18 19:04 UTC (permalink / raw)
To: atul.gupta-ut6Up61K2wZBDgjK7y7TUQ
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-scsi-u79uwXL29TY76Z2rM5mHXA,
target-devel-u79uwXL29TY76Z2rM5mHXA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-crypto-u79uwXL29TY76Z2rM5mHXA, nab-IzHhD5pYlfBP7FQvKIMDCQ,
jejb-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
martin.petersen-QHcLZuEGTsvQT0dZR+AlfA,
dledford-H+wXaHxf7aLQT0dZR+AlfA,
herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q,
leedom-ut6Up61K2wZBDgjK7y7TUQ, nirranjan-ut6Up61K2wZBDgjK7y7TUQ,
varun-ut6Up61K2wZBDgjK7y7TUQ,
swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
hariprasad-ut6Up61K2wZBDgjK7y7TUQ
In-Reply-To: <1479467260-6509-1-git-send-email-atul.gupta-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
From: Atul Gupta <atul.gupta-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
Date: Fri, 18 Nov 2016 16:37:40 +0530
> From: Hariprasad Shenai <hariprasad-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
>
> Allocate resources dynamically for Upper layer driver's (ULD) like
> cxgbit, iw_cxgb4, cxgb4i and chcr. The resources allocated include Tx
> queues which are allocated when ULD register with cxgb4 driver and freed
> while un-registering. The Tx queues which are shared by ULD shall be
> allocated by first registering driver and un-allocated by last
> unregistering driver.
>
> Signed-off-by: Atul Gupta <atul.gupta-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
Applied.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 0/5] XDP for virtio_net
From: John Fastabend @ 2016-11-18 19:06 UTC (permalink / raw)
To: tgraf, shm, alexei.starovoitov, daniel, davem, Michael S. Tsirkin
Cc: john.r.fastabend, netdev, bblanco, brouer
In-Reply-To: <20161118185517.16137.92123.stgit@john-Precision-Tower-5810>
On 16-11-18 10:59 AM, John Fastabend wrote:
> This implements virtio_net for the mergeable buffers and big_packet
> modes. I tested this with vhost_net running on qemu and did not see
> any issues.
>
> There are some restrictions for XDP to be enabled (see patch 3) for
> more details.
>
> 1. LRO must be off
> 2. MTU must be less than PAGE_SIZE
> 3. queues must be available to dedicate to XDP
> 4. num_bufs received in mergeable buffers must be 1
> 5. big_packet mode must have all data on single page
>
> Please review any comments/feedback welcome as always.
>
> Thanks,
> John
> ---
>
Hi Dave,
Should be obvious but this is for net-next I dropped the tag from my
git-send command.
Also I missed probably the most important person on the CC/TO list.
+Michael Tsirkin.
Thanks,
John
^ permalink raw reply
* Re: [PATCH net] rtnetlink: fix FDB size computation
From: David Miller @ 2016-11-18 19:10 UTC (permalink / raw)
To: sd; +Cc: netdev, hubert.sokolowski
In-Reply-To: <9bef2a8afa6f8193fcca61ea381b67a52ab878b2.1479477903.git.sd@queasysnail.net>
From: Sabrina Dubroca <sd@queasysnail.net>
Date: Fri, 18 Nov 2016 15:50:39 +0100
> Add missing NDA_VLAN attribute's size.
>
> Fixes: 1e53d5bb8878 ("net: Pass VLAN ID to rtnl_fdb_notify.")
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Applied and queued up for -stable, thanks.
^ permalink raw reply
* Re: [PATCH] mlxsw: switchib: add MLXSW_PCI dependency
From: David Miller @ 2016-11-18 19:13 UTC (permalink / raw)
To: arnd; +Cc: jiri, idosch, ivecera, eladr, netdev, linux-kernel
In-Reply-To: <20161118160127.473555-1-arnd@arndb.de>
From: Arnd Bergmann <arnd@arndb.de>
Date: Fri, 18 Nov 2016 17:01:14 +0100
> The newly added switchib driver fails to link if MLXSW_PCI=m:
>
> drivers/net/ethernet/mellanox/mlxsw/mlxsw_switchib.o: In function^Cmlxsw_sib_module_exit':
> switchib.c:(.exit.text+0x8): undefined reference to `mlxsw_pci_driver_unregister'
> switchib.c:(.exit.text+0x10): undefined reference to `mlxsw_pci_driver_unregister'
> drivers/net/ethernet/mellanox/mlxsw/mlxsw_switchib.o: In function `mlxsw_sib_module_init':
> switchib.c:(.init.text+0x28): undefined reference to `mlxsw_pci_driver_register'
> switchib.c:(.init.text+0x38): undefined reference to `mlxsw_pci_driver_register'
> switchib.c:(.init.text+0x48): undefined reference to `mlxsw_pci_driver_unregister'
>
> The other two such sub-drivers have a dependency, so add the same one
> here. In theory we could allow this driver if MLXSW_PCI is disabled,
> but it's probably not worth it.
>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Please resubmit this with a proper fixes tag that identifies the commit that
added the switchib driver.
Thanks.
^ permalink raw reply
* [PATCH net-next] mlx4: avoid unnecessary dirtying of critical fields
From: Eric Dumazet @ 2016-11-18 20:15 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Tariq Toukan
From: Eric Dumazet <edumazet@google.com>
While stressing a 40Gbit mlx4 NIC with busy polling, I found false
sharing in mlx4 driver that can be easily avoided.
This patch brings an additional 7 % performance improvement in UDP_RR
workload.
1) If we received no frame during one mlx4_en_process_rx_cq()
invocation, no need to call mlx4_cq_set_ci() and/or dirty ring->cons
2) Do not refill rx buffers if we have plenty of them.
This avoids false sharing and allows some bulk/batch optimizations.
Page allocator and its locks will thank us.
Finally, mlx4_en_poll_rx_cq() should not return 0 if it determined
cpu handling NIC IRQ should be changed. We should return budget-1
instead, to not fool net_rx_action() and its netdev_budget.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx4/en_rx.c | 51 +++++++++++--------
1 file changed, 32 insertions(+), 19 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 22f08f9ef464..2112494ff43b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -688,18 +688,23 @@ static void validate_loopback(struct mlx4_en_priv *priv, struct sk_buff *skb)
dev_kfree_skb_any(skb);
}
-static void mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv,
- struct mlx4_en_rx_ring *ring)
+static bool mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv,
+ struct mlx4_en_rx_ring *ring)
{
- int index = ring->prod & ring->size_mask;
+ u32 missing = ring->actual_size - (ring->prod - ring->cons);
- while ((u32) (ring->prod - ring->cons) < ring->actual_size) {
- if (mlx4_en_prepare_rx_desc(priv, ring, index,
+ /* Try to batch allocations, but not too much. */
+ if (missing < 8)
+ return false;
+ do {
+ if (mlx4_en_prepare_rx_desc(priv, ring,
+ ring->prod & ring->size_mask,
GFP_ATOMIC | __GFP_COLD))
break;
ring->prod++;
- index = ring->prod & ring->size_mask;
- }
+ } while (--missing);
+
+ return true;
}
/* When hardware doesn't strip the vlan, we need to calculate the checksum
@@ -1081,15 +1086,20 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
out:
rcu_read_unlock();
- if (doorbell_pending)
- mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
-
- AVG_PERF_COUNTER(priv->pstats.rx_coal_avg, polled);
- mlx4_cq_set_ci(&cq->mcq);
- wmb(); /* ensure HW sees CQ consumer before we post new buffers */
- ring->cons = cq->mcq.cons_index;
- mlx4_en_refill_rx_buffers(priv, ring);
- mlx4_en_update_rx_prod_db(ring);
+
+ if (polled) {
+ if (doorbell_pending)
+ mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
+
+ AVG_PERF_COUNTER(priv->pstats.rx_coal_avg, polled);
+ mlx4_cq_set_ci(&cq->mcq);
+ wmb(); /* ensure HW sees CQ consumer before we post new buffers */
+ ring->cons = cq->mcq.cons_index;
+ }
+
+ if (mlx4_en_refill_rx_buffers(priv, ring))
+ mlx4_en_update_rx_prod_db(ring);
+
return polled;
}
@@ -1131,10 +1141,13 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
return budget;
/* Current cpu is not according to smp_irq_affinity -
- * probably affinity changed. need to stop this NAPI
- * poll, and restart it on the right CPU
+ * probably affinity changed. Need to stop this NAPI
+ * poll, and restart it on the right CPU.
+ * Try to avoid returning a too small value (like 0),
+ * to not fool net_rx_action() and its netdev_budget
*/
- done = 0;
+ if (done)
+ done--;
}
/* Done for now */
if (napi_complete_done(napi, done))
^ permalink raw reply related
* Re: [PATCH -next] tcp: make undo_cwnd mandatory for congestion modules
From: Neal Cardwell @ 2016-11-18 20:38 UTC (permalink / raw)
To: Florian Westphal; +Cc: David Miller, Netdev, Eric Dumazet, Yuchung Cheng
In-Reply-To: <20161118185434.GB18071@breakpoint.cc>
On Fri, Nov 18, 2016 at 1:54 PM, Florian Westphal <fw@strlen.de> wrote:
> David Miller <davem@davemloft.net> wrote:
>> If you really suspect that highspeed et al. need to implement their own
>> undo_cwnd instead of using the default reno fallback, I would really
>> rather that this gets either fixed or explicitly marked as likely wrong
>> (in an "XXX" comment or similar).
>
> Ok, fair enough. I am not familiar with these algorithms, I will check
> what they're doing in more detail and if absolutely needed resubmit this
> patch with XXX/FIXME/TODO comments added.
BTW, FWIW I really like the idea of making undo_cwnd required. It
simplifies the core code and forces CC modules to think about what
undo should look like for their CC module.
And I suspect you are right that those CC modules have an issue that
should be fixed.
neal
^ permalink raw reply
* Re: [PATCH 4/5] virtio_net: add dedicated XDP transmit queues
From: Jakub Kicinski @ 2016-11-18 21:09 UTC (permalink / raw)
To: John Fastabend
Cc: tgraf, shm, alexei.starovoitov, daniel, davem, john.r.fastabend,
netdev, bblanco, brouer
In-Reply-To: <20161118190041.16137.48399.stgit@john-Precision-Tower-5810>
Looks very cool! :)
On Fri, 18 Nov 2016 11:00:41 -0800, John Fastabend wrote:
> @@ -1542,12 +1546,34 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
> return -EINVAL;
> }
>
> + curr_qp = vi->curr_queue_pairs - vi->xdp_queue_pairs;
> + if (prog)
> + xdp_qp = num_online_cpus();
Is num_online_cpus() correct here?
^ permalink raw reply
* [PATCH net] l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()
From: Guillaume Nault @ 2016-11-18 21:13 UTC (permalink / raw)
To: netdev; +Cc: James Chapman, Chris Elston, Baozeng Ding, Andrey Konovalov
Lock socket before checking the SOCK_ZAPPED flag in l2tp_ip6_bind().
Without lock, a concurrent call could modify the socket flags between
the sock_flag(sk, SOCK_ZAPPED) test and the lock_sock() call. This way,
a socket could be inserted twice in l2tp_ip6_bind_table. Releasing it
would then leave a stale pointer there, generating use-after-free
errors when walking through the list or modifying adjacent entries.
BUG: KASAN: use-after-free in l2tp_ip6_close+0x22e/0x290 at addr ffff8800081b0ed8
Write of size 8 by task syz-executor/10987
CPU: 0 PID: 10987 Comm: syz-executor Not tainted 4.8.0+ #39
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
ffff880031d97838 ffffffff829f835b ffff88001b5a1640 ffff8800081b0ec0
ffff8800081b15a0 ffff8800081b6d20 ffff880031d97860 ffffffff8174d3cc
ffff880031d978f0 ffff8800081b0e80 ffff88001b5a1640 ffff880031d978e0
Call Trace:
[<ffffffff829f835b>] dump_stack+0xb3/0x118 lib/dump_stack.c:15
[<ffffffff8174d3cc>] kasan_object_err+0x1c/0x70 mm/kasan/report.c:156
[< inline >] print_address_description mm/kasan/report.c:194
[<ffffffff8174d666>] kasan_report_error+0x1f6/0x4d0 mm/kasan/report.c:283
[< inline >] kasan_report mm/kasan/report.c:303
[<ffffffff8174db7e>] __asan_report_store8_noabort+0x3e/0x40 mm/kasan/report.c:329
[< inline >] __write_once_size ./include/linux/compiler.h:249
[< inline >] __hlist_del ./include/linux/list.h:622
[< inline >] hlist_del_init ./include/linux/list.h:637
[<ffffffff8579047e>] l2tp_ip6_close+0x22e/0x290 net/l2tp/l2tp_ip6.c:239
[<ffffffff850b2dfd>] inet_release+0xed/0x1c0 net/ipv4/af_inet.c:415
[<ffffffff851dc5a0>] inet6_release+0x50/0x70 net/ipv6/af_inet6.c:422
[<ffffffff84c4581d>] sock_release+0x8d/0x1d0 net/socket.c:570
[<ffffffff84c45976>] sock_close+0x16/0x20 net/socket.c:1017
[<ffffffff817a108c>] __fput+0x28c/0x780 fs/file_table.c:208
[<ffffffff817a1605>] ____fput+0x15/0x20 fs/file_table.c:244
[<ffffffff813774f9>] task_work_run+0xf9/0x170
[<ffffffff81324aae>] do_exit+0x85e/0x2a00
[<ffffffff81326dc8>] do_group_exit+0x108/0x330
[<ffffffff81348cf7>] get_signal+0x617/0x17a0 kernel/signal.c:2307
[<ffffffff811b49af>] do_signal+0x7f/0x18f0
[<ffffffff810039bf>] exit_to_usermode_loop+0xbf/0x150 arch/x86/entry/common.c:156
[< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190
[<ffffffff81006060>] syscall_return_slowpath+0x1a0/0x1e0 arch/x86/entry/common.c:259
[<ffffffff85e4d726>] entry_SYSCALL_64_fastpath+0xc4/0xc6
Object at ffff8800081b0ec0, in cache L2TP/IPv6 size: 1448
Allocated:
PID = 10987
[ 1116.897025] [<ffffffff811ddcb6>] save_stack_trace+0x16/0x20
[ 1116.897025] [<ffffffff8174c736>] save_stack+0x46/0xd0
[ 1116.897025] [<ffffffff8174c9ad>] kasan_kmalloc+0xad/0xe0
[ 1116.897025] [<ffffffff8174cee2>] kasan_slab_alloc+0x12/0x20
[ 1116.897025] [< inline >] slab_post_alloc_hook mm/slab.h:417
[ 1116.897025] [< inline >] slab_alloc_node mm/slub.c:2708
[ 1116.897025] [< inline >] slab_alloc mm/slub.c:2716
[ 1116.897025] [<ffffffff817476a8>] kmem_cache_alloc+0xc8/0x2b0 mm/slub.c:2721
[ 1116.897025] [<ffffffff84c4f6a9>] sk_prot_alloc+0x69/0x2b0 net/core/sock.c:1326
[ 1116.897025] [<ffffffff84c58ac8>] sk_alloc+0x38/0xae0 net/core/sock.c:1388
[ 1116.897025] [<ffffffff851ddf67>] inet6_create+0x2d7/0x1000 net/ipv6/af_inet6.c:182
[ 1116.897025] [<ffffffff84c4af7b>] __sock_create+0x37b/0x640 net/socket.c:1153
[ 1116.897025] [< inline >] sock_create net/socket.c:1193
[ 1116.897025] [< inline >] SYSC_socket net/socket.c:1223
[ 1116.897025] [<ffffffff84c4b46f>] SyS_socket+0xef/0x1b0 net/socket.c:1203
[ 1116.897025] [<ffffffff85e4d685>] entry_SYSCALL_64_fastpath+0x23/0xc6
Freed:
PID = 10987
[ 1116.897025] [<ffffffff811ddcb6>] save_stack_trace+0x16/0x20
[ 1116.897025] [<ffffffff8174c736>] save_stack+0x46/0xd0
[ 1116.897025] [<ffffffff8174cf61>] kasan_slab_free+0x71/0xb0
[ 1116.897025] [< inline >] slab_free_hook mm/slub.c:1352
[ 1116.897025] [< inline >] slab_free_freelist_hook mm/slub.c:1374
[ 1116.897025] [< inline >] slab_free mm/slub.c:2951
[ 1116.897025] [<ffffffff81748b28>] kmem_cache_free+0xc8/0x330 mm/slub.c:2973
[ 1116.897025] [< inline >] sk_prot_free net/core/sock.c:1369
[ 1116.897025] [<ffffffff84c541eb>] __sk_destruct+0x32b/0x4f0 net/core/sock.c:1444
[ 1116.897025] [<ffffffff84c5aca4>] sk_destruct+0x44/0x80 net/core/sock.c:1452
[ 1116.897025] [<ffffffff84c5ad33>] __sk_free+0x53/0x220 net/core/sock.c:1460
[ 1116.897025] [<ffffffff84c5af23>] sk_free+0x23/0x30 net/core/sock.c:1471
[ 1116.897025] [<ffffffff84c5cb6c>] sk_common_release+0x28c/0x3e0 ./include/net/sock.h:1589
[ 1116.897025] [<ffffffff8579044e>] l2tp_ip6_close+0x1fe/0x290 net/l2tp/l2tp_ip6.c:243
[ 1116.897025] [<ffffffff850b2dfd>] inet_release+0xed/0x1c0 net/ipv4/af_inet.c:415
[ 1116.897025] [<ffffffff851dc5a0>] inet6_release+0x50/0x70 net/ipv6/af_inet6.c:422
[ 1116.897025] [<ffffffff84c4581d>] sock_release+0x8d/0x1d0 net/socket.c:570
[ 1116.897025] [<ffffffff84c45976>] sock_close+0x16/0x20 net/socket.c:1017
[ 1116.897025] [<ffffffff817a108c>] __fput+0x28c/0x780 fs/file_table.c:208
[ 1116.897025] [<ffffffff817a1605>] ____fput+0x15/0x20 fs/file_table.c:244
[ 1116.897025] [<ffffffff813774f9>] task_work_run+0xf9/0x170
[ 1116.897025] [<ffffffff81324aae>] do_exit+0x85e/0x2a00
[ 1116.897025] [<ffffffff81326dc8>] do_group_exit+0x108/0x330
[ 1116.897025] [<ffffffff81348cf7>] get_signal+0x617/0x17a0 kernel/signal.c:2307
[ 1116.897025] [<ffffffff811b49af>] do_signal+0x7f/0x18f0
[ 1116.897025] [<ffffffff810039bf>] exit_to_usermode_loop+0xbf/0x150 arch/x86/entry/common.c:156
[ 1116.897025] [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190
[ 1116.897025] [<ffffffff81006060>] syscall_return_slowpath+0x1a0/0x1e0 arch/x86/entry/common.c:259
[ 1116.897025] [<ffffffff85e4d726>] entry_SYSCALL_64_fastpath+0xc4/0xc6
Memory state around the buggy address:
ffff8800081b0d80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8800081b0e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff8800081b0e80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
^
ffff8800081b0f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8800081b0f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
The same issue exists with l2tp_ip_bind() and l2tp_ip_bind_table.
Fixes: c51ce49735c1 ("l2tp: fix oops in L2TP IP sockets for connect() AF_UNSPEC case")
Reported-by: Baozeng Ding <sploving1@gmail.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
net/l2tp/l2tp_ip.c | 5 +++--
net/l2tp/l2tp_ip6.c | 5 +++--
2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index fce25af..982f6c4 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -251,8 +251,6 @@ static int l2tp_ip_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
int ret;
int chk_addr_ret;
- if (!sock_flag(sk, SOCK_ZAPPED))
- return -EINVAL;
if (addr_len < sizeof(struct sockaddr_l2tpip))
return -EINVAL;
if (addr->l2tp_family != AF_INET)
@@ -267,6 +265,9 @@ static int l2tp_ip_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
read_unlock_bh(&l2tp_ip_lock);
lock_sock(sk);
+ if (!sock_flag(sk, SOCK_ZAPPED))
+ goto out;
+
if (sk->sk_state != TCP_CLOSE || addr_len < sizeof(struct sockaddr_l2tpip))
goto out;
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index ad3468c..9978d01 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -269,8 +269,6 @@ static int l2tp_ip6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
int addr_type;
int err;
- if (!sock_flag(sk, SOCK_ZAPPED))
- return -EINVAL;
if (addr->l2tp_family != AF_INET6)
return -EINVAL;
if (addr_len < sizeof(*addr))
@@ -296,6 +294,9 @@ static int l2tp_ip6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
lock_sock(sk);
err = -EINVAL;
+ if (!sock_flag(sk, SOCK_ZAPPED))
+ goto out_unlock;
+
if (sk->sk_state != TCP_CLOSE)
goto out_unlock;
--
2.10.2
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox