* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
From: Paolo Abeni @ 2016-12-06 10:34 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev, Willem de Bruijn
In-Reply-To: <1480960639.18162.556.camel@edumazet-glaptop3.roam.corp.google.com>
Hi Eric,
On Mon, 2016-12-05 at 09:57 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> In UDP recvmsg() path we currently access 3 cache lines from an skb
> while holding receive queue lock, plus another one if packet is
> dequeued, since we need to change skb->next->prev
>
> 1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
> 2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
> 3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)
>
> skb->peeked is only needed to make sure 0-length packets are properly
> handled while MSG_PEEK is operated.
>
> I had first the intent to remove skb->peeked but the "MSG_PEEK at
> non-zero offset" support added by Sam Kumar makes this not possible.
>
> This patch avoids one cache line miss during the locked section, when
> skb->len and skb->peeked do not have to be read.
>
> It also avoids the skb_set_peeked() cost for non empty UDP datagrams.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Thank you for all the good work.
After all your improvement, I see the cacheline miss in inet_recvmsg()
as a major perf offender for the user space process in the udp flood
scenario due to skc_rxhash sharing the same sk_drops cacheline.
Using an udp-specific drop counter (and an sk_drops accessor to wrap
sk_drops access where needed), we could avoid such cache miss. With that
- patch for udp.h only below - I get 3% improvement on top of all the
pending udp patches, and the gain should be more relevant after the 2
queues rework. What do you think ?
Cheers,
Paolo.
---
diff --git a/include/linux/udp.h b/include/linux/udp.h
index d1fd8cd..2bbf5db 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -49,7 +49,11 @@ struct udp_sock {
unsigned int corkflag; /* Cork is required */
__u8 encap_type; /* Is this an Encapsulation socket? */
unsigned char no_check6_tx:1,/* Send zero UDP6 checksums on TX? */
- no_check6_rx:1;/* Allow zero UDP6 checksums on RX? */
+ no_check6_rx:1,/* Allow zero UDP6 checksums on RX? */
+ pcflag:6; /* UDP-Lite specific, moved here to */
+ /* fill an hole, marks socket as */
+ /* UDP-Lite if > 0 */
+
/*
* Following member retains the information to create a UDP header
* when the socket is uncorked.
@@ -64,8 +68,7 @@ struct udp_sock {
#define UDPLITE_BIT 0x1 /* set by udplite proto init function */
#define UDPLITE_SEND_CC 0x2 /* set via udplite setsockopt */
#define UDPLITE_RECV_CC 0x4 /* set via udplite setsocktopt */
- __u8 pcflag; /* marks socket as UDP-Lite if > 0 */
- __u8 unused[3];
+
/*
* For encapsulation sockets.
*/
@@ -79,6 +82,9 @@ struct udp_sock {
int (*gro_complete)(struct sock *sk,
struct sk_buff *skb,
int nhoff);
+
+ /* since we are prone to drops, avoid dirtying any sk cacheline */
+ atomic_t drops ____cacheline_aligned_in_smp;
};
static inline struct udp_sock *udp_sk(const struct sock *sk)
^ permalink raw reply related
* Re: [PATCH iproute2/net-next 3/3] tc: flower: support matching on ICMP type and code
From: Simon Horman @ 2016-12-06 10:18 UTC (permalink / raw)
To: netdev; +Cc: Stephen Hemminger, Jamal Hadi Salim, Jiri Pirko
In-Reply-To: <1480703079-25422-4-git-send-email-simon.horman@netronome.com>
On Fri, Dec 02, 2016 at 07:24:39PM +0100, Simon Horman wrote:
> Support matching on ICMP type and code.
>
> Example usage:
>
> tc qdisc add dev eth0 ingress
>
> tc filter add dev eth0 protocol ip parent ffff: flower \
> indev eth0 ip_proto icmp type 8 code 0 action drop
>
> tc filter add dev eth0 protocol ipv6 parent ffff: flower \
> indev eth0 ip_proto icmpv6 type 128 code 0 action drop
>
> Signed-off-by: Simon Horman <simon.horman@netronome.com>
This series no longer applies to net-next.
...
> @@ -524,6 +597,12 @@ static void flower_print_port(FILE *f, char *name, struct rtattr *attr)
> fprintf(f, "\n %s %d", name, ntohs(rta_getattr_u16(attr)));
> }
>
> +static void flower_print_icmp(FILE *f, char *name, struct rtattr *attr)
> +{
> + if (attr)
> + fprintf(f, "\n %s %d", name, ntohs(rta_getattr_u8(attr)));
ntohs() should be removed from the above as its argument is a u8.
...
^ permalink raw reply
* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
From: Paolo Abeni @ 2016-12-06 9:53 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev, Willem de Bruijn
In-Reply-To: <1480960639.18162.556.camel@edumazet-glaptop3.roam.corp.google.com>
Hi Eric,
On Mon, 2016-12-05 at 09:57 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> In UDP recvmsg() path we currently access 3 cache lines from an skb
> while holding receive queue lock, plus another one if packet is
> dequeued, since we need to change skb->next->prev
>
> 1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
> 2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
> 3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)
>
> skb->peeked is only needed to make sure 0-length packets are properly
> handled while MSG_PEEK is operated.
>
> I had first the intent to remove skb->peeked but the "MSG_PEEK at
> non-zero offset" support added by Sam Kumar makes this not possible.
I'm wondering if peeking with offset is going to complicate the 2 queues
patch, too.
> This patch avoids one cache line miss during the locked section, when
> skb->len and skb->peeked do not have to be read.
>
> It also avoids the skb_set_peeked() cost for non empty UDP datagrams.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> net/core/datagram.c | 19 ++++++++++---------
> 1 file changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index 49816af8586bb832e806972b486588041a99524c..9482037a5c8c64aec79e42c65bd2691bdd9450a3 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -214,6 +214,7 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
> if (error)
> goto no_packet;
>
> + *peeked = 0;
> do {
> /* Again only user level code calls this function, so nothing
> * interrupt level will suddenly eat the receive_queue.
> @@ -227,22 +228,22 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
> spin_lock_irqsave(&queue->lock, cpu_flags);
> skb_queue_walk(queue, skb) {
> *last = skb;
> - *peeked = skb->peeked;
> if (flags & MSG_PEEK) {
> if (_off >= skb->len && (skb->len || _off ||
> skb->peeked)) {
> _off -= skb->len;
> continue;
> }
> -
> - skb = skb_set_peeked(skb);
> - error = PTR_ERR(skb);
> - if (IS_ERR(skb)) {
> - spin_unlock_irqrestore(&queue->lock,
> - cpu_flags);
> - goto no_packet;
> + if (!skb->len) {
> + skb = skb_set_peeked(skb);
> + if (IS_ERR(skb)) {
> + error = PTR_ERR(skb);
> + spin_unlock_irqrestore(&queue->lock,
> + cpu_flags);
> + goto no_packet;
> + }
> }
I don't understand why we can avoid setting skb->peek if len > 0. I
think that will change the kernel behavior if:
- peek with offset is set
- 3 skbs with len > 0 are enqueued
- the u/s peek (with offset) the second one
- the u/s disable peeking with offset and peeks 2 more skbs.
With the current code in the last step the u/s is going to peek the 1#
and the 3# skbs, after this patch will peek the 1# and the 2#. Am I
missing something ? Probably the new behavior is more correct, but still
is a change.
I gave this a run in my test bed on top of your udp-related patches I
see additional ~3 improvement in the udp flood scenario, and a bit more
in the un-contended scenario.
Thank you,
Paolo
^ permalink raw reply
* Re: [PATCHv2 net] team: team_port_add should check link_up before enable port
From: Jiri Pirko @ 2016-12-06 8:22 UTC (permalink / raw)
To: Xin Long; +Cc: network dev, davem, Marcelo Ricardo Leitner
In-Reply-To: <62f7aa9c00b68beed68a907c51dde22e7847d6c9.1481009348.git.lucien.xin@gmail.com>
Tue, Dec 06, 2016 at 08:29:08AM CET, lucien.xin@gmail.com wrote:
>Now when users add a nic to team dev, the option 'enable' of the port
>is true by default, as team_port_enable enables it after dev_open in
>team_port_add.
>
>But even if the port_dev has no carrier, like it's cable was unpluged,
>the port is still enabled. It leads to that team dev couldn't work well
>if this port was chosen to connect, and has no chance to change to use
>other ports if link_watch is ethtool.
>
>This patch is to enable the port only when the port_dev has carrier in
>team_port_add.
>
>v1 -> v2:
> use netif_carrier_ok() instead of !!netif_carrier_ok(), as it returns
> bool now.
>
>Signed-off-by: Xin Long <lucien.xin@gmail.com>
>---
> drivers/net/team/team.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>index a380649..4bc0103 100644
>--- a/drivers/net/team/team.c
>+++ b/drivers/net/team/team.c
>@@ -1140,6 +1140,7 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
> struct net_device *dev = team->dev;
> struct team_port *port;
> char *portname = port_dev->name;
>+ bool linkup;
> int err;
>
> if (port_dev->flags & IFF_LOOPBACK) {
>@@ -1249,9 +1250,12 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
>
> port->index = -1;
> list_add_tail_rcu(&port->list, &team->port_list);
>- team_port_enable(team, port);
>+ linkup = netif_carrier_ok(port_dev);
>+ if (linkup)
>+ team_port_enable(team, port);
This is obviously wrong change. It you use a simple setup without
userspace part (e.g. round robin), The port gets never enabled.
team_port_enabl is called from here and team_port_en_option_set.
By default, all ports should be enabled. Only in case the userspace
daemon decides to disable, it does so.
Could you send me the exact configuration where you see the issue?
This should be definitelly fixed in user part.
Thanks.
^ permalink raw reply
* Re: [PATCH]net:witchdev:release the lock before function return to avoid dead lock
From: Jiri Pirko @ 2016-12-06 8:06 UTC (permalink / raw)
To: Feng Deng; +Cc: David S. Miller, netdev, linux-kernel, feng.deng
In-Reply-To: <CAEX1niVLkt=LQhjrALsUrN+oEa9ZEUJ7X+tbNFs71RLm6ryZCA@mail.gmail.com>
Tue, Dec 06, 2016 at 03:58:29AM CET, cxdx2006@gmail.com wrote:
>From : Feng Deng <cxdx2006@gmail.com>
>
>In switchdev_deferred_dequeue(),only get the lock at the beginning,
>but forgot to release before return,
>we must release the lock to avoid dead lock
Feng, your patches are corrupted. Please use "git send-email"
Also, when you send a fit for -net tree, you have to pass byt the
"Fixes" tag. Please see the git log for details.
Thanks!
>
>Signed-off-by: Feng Deng <feng.deng@cortina-access.com>
>---
> net/switchdev/switchdev.c | 1 +
> 1 file changed, 1 insertion(+)
> mode change 100644 => 100755 net/switchdev/switchdev.c
>
>diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
>index 3b95fe9..c0a1ad4
>--- a/net/switchdev/switchdev.c
>+++ b/net/switchdev/switchdev.c
>@@ -120,6 +120,7 @@ static struct switchdev_deferred_item
>*switchdev_deferred_dequeue(void)
> dfitem = list_first_entry(&deferred,
> struct switchdev_deferred_item, list);
> list_del(&dfitem->list);
>+ spin_unlock_bh(&deferred_lock);
> unlock:
> spin_unlock_bh(&deferred_lock);
> return dfitem;
>--
>1.8.3.1
^ permalink raw reply
* Re: commit : ppp: add rtnetlink device creation support - breaks netcf on my machine.
From: Brad Campbell @ 2016-12-06 7:47 UTC (permalink / raw)
To: Guillaume Nault; +Cc: netdev
In-Reply-To: <20161205175320.mbtswpmdjumfgh7y@alphalink.fr>
On 06/12/16 01:53, Guillaume Nault wrote:
>>
> Probably not a mistake on your side. I've started looking at netcf'
> source code, but haven't found anything that could explain your issue.
> It'd really help if you could provide steps to reproduce the bug.
Further to my message this morning, I started with a clean linux.git
4.9.0-rc7-00198-g0cb65c8 and did two runs. One untouched and one with
the identified patch reverted. I logged both of these with NLCB=debug,
then split out the ppp section and diffed them.
It appears the only difference of note is the new ATTR 18. I did a diff
of the entire dump for both and nothing else popped out.
brad@test:~$ diff -u ppp-ok ppp-fail
--- ppp-ok 2016-12-06 13:32:04.358393578 +0800
+++ ppp-fail 2016-12-06 13:32:18.577864406 +0800
@@ -1,10 +1,10 @@
-------------------------- BEGIN NETLINK MESSAGE
---------------------------
[HEADER] 16 octets
- .nlmsg_len = 628
+ .nlmsg_len = 644
.nlmsg_type = 16 <route/link::new>
.nlmsg_flags = 2 <MULTI>
- .nlmsg_seq = 1481001940
- .nlmsg_pid = 7462
+ .nlmsg_seq = 1481002252
+ .nlmsg_pid = 7376
[PAYLOAD] 16 octets
00 00 00 02 0a 00 00 00 d1 10 01 00 00 00 00 00 ................
[ATTR 03] 5 octets
@@ -71,6 +71,8 @@
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
..................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
..................
00 00 00 00 00 00 ......
+ [ATTR 18] 12 octets
+ 08 00 01 00 70 70 70 00 04 00 02 00 ....ppp.....
[ATTR 26] 132 octets
84 00 02 00 80 00 01 00 01 00 00 00 00 00 00 00 00 00
..................
00 00 01 00 00 00 01 00 00 00 01 00 00 00 01 00 00 00
..................
@@ -81,3 +83,4 @@
00 00 00 00 10 27 00 00 e8 03 00 00 00 00 00 00 00 00
.....'............
00 00 00 00 00 00 ......
--------------------------- END NETLINK MESSAGE
---------------------------
Running with NLDBG=4 seems to generate this :
DBG<2>: While picking up for 0x26d2e00 <route/link>, recvmsgs() returned
-34: (errno = Numerical result out of range)DBG<1>: Clearing cache
0x26d2e00 <route/link>...
(skip forward 4 hours)
Ok, so I've spent the afternoon compiling and installing software.
I'm afraid I gave you a bum steer. The issue only manifests itself on
libnl1. I had both installed and netcf was compiling against 1 and not 3.
I spent the afternoon compiling and installing various combinations of
libnl and netcf and can only reproduce the issue if netcf is compiled
against libnl <= 1.1.4. It won't compile against 2, 3, or 3.1 and it
works against 3.2. That explains why it manifests itself on my clean
Debian 7 machines.
I can work around it locally by recompiling all my stuff against libnl3
if you don't feel inclined to chase it down, but it is certainly
reproducible on nl1. I compiled up 1.1.4 and compiled netcf-0.2.8
against that and the problem shows.
Regards,
Brad
^ permalink raw reply
* [PATCHv2 net] team: team_port_add should check link_up before enable port
From: Xin Long @ 2016-12-06 7:29 UTC (permalink / raw)
To: network dev; +Cc: davem, Marcelo Ricardo Leitner, Jiri Pirko
Now when users add a nic to team dev, the option 'enable' of the port
is true by default, as team_port_enable enables it after dev_open in
team_port_add.
But even if the port_dev has no carrier, like it's cable was unpluged,
the port is still enabled. It leads to that team dev couldn't work well
if this port was chosen to connect, and has no chance to change to use
other ports if link_watch is ethtool.
This patch is to enable the port only when the port_dev has carrier in
team_port_add.
v1 -> v2:
use netif_carrier_ok() instead of !!netif_carrier_ok(), as it returns
bool now.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
drivers/net/team/team.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index a380649..4bc0103 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -1140,6 +1140,7 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
struct net_device *dev = team->dev;
struct team_port *port;
char *portname = port_dev->name;
+ bool linkup;
int err;
if (port_dev->flags & IFF_LOOPBACK) {
@@ -1249,9 +1250,12 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
port->index = -1;
list_add_tail_rcu(&port->list, &team->port_list);
- team_port_enable(team, port);
+ linkup = netif_carrier_ok(port_dev);
+ if (linkup)
+ team_port_enable(team, port);
+
__team_compute_features(team);
- __team_port_change_port_added(port, !!netif_carrier_ok(port_dev));
+ __team_port_change_port_added(port, linkup);
__team_options_change_check(team);
netdev_info(dev, "Port device %s added\n", portname);
--
2.1.0
^ permalink raw reply related
* [PATCH] net: return value of skb_linearize should be handled in Linux kernel
From: Zhouyi Zhou @ 2016-12-06 7:10 UTC (permalink / raw)
To: faisal.latif, dledford, sean.hefty, hal.rosenstock,
jeffrey.t.kirsher, QLogic-Storage-Upstream, jejb, martin.petersen,
jth, jon.maloy, ying.xue, davem, linux-rdma, linux-kernel,
intel-wired-lan, netdev, linux-scsi, fcoe-devel, tipc-discussion
Cc: Zhouyi Zhou
kmalloc_reserve may fail to allocate memory inside skb_linearize,
which means skb_linearize's return value should not be ignored.
Following patch correct the uses of skb_linearize.
Compiled in x86_64
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
---
drivers/infiniband/hw/nes/nes_nic.c | 5 +++--
drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c | 6 +++++-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 +--
drivers/scsi/bnx2fc/bnx2fc_fcoe.c | 7 +++++--
drivers/scsi/fcoe/fcoe.c | 5 ++++-
net/tipc/link.c | 3 ++-
net/tipc/name_distr.c | 5 ++++-
7 files changed, 24 insertions(+), 10 deletions(-)
diff --git a/drivers/infiniband/hw/nes/nes_nic.c b/drivers/infiniband/hw/nes/nes_nic.c
index 2b27d13..69372ea 100644
--- a/drivers/infiniband/hw/nes/nes_nic.c
+++ b/drivers/infiniband/hw/nes/nes_nic.c
@@ -662,10 +662,11 @@ static int nes_netdev_start_xmit(struct sk_buff *skb, struct net_device *netdev)
nesnic->sq_head &= nesnic->sq_size-1;
}
} else {
- nesvnic->linearized_skbs++;
hoffset = skb_transport_header(skb) - skb->data;
nhoffset = skb_network_header(skb) - skb->data;
- skb_linearize(skb);
+ if (skb_linearize(skb))
+ return NETDEV_TX_BUSY;
+ nesvnic->linearized_skbs++;
skb_set_transport_header(skb, hoffset);
skb_set_network_header(skb, nhoffset);
if (!nes_nic_send(skb, netdev))
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
index 2a653ec..ab787cb 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
@@ -490,7 +490,11 @@ int ixgbe_fcoe_ddp(struct ixgbe_adapter *adapter,
*/
if ((fh->fh_r_ctl == FC_RCTL_DD_SOL_DATA) &&
(fctl & FC_FC_END_SEQ)) {
- skb_linearize(skb);
+ int err = 0;
+
+ err = skb_linearize(skb);
+ if (err)
+ return err;
crc = (struct fcoe_crc_eof *)skb_put(skb, sizeof(*crc));
crc->fcoe_eof = FC_EOF_T;
}
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index fee1f29..4926d48 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2173,8 +2173,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
total_rx_bytes += ddp_bytes;
total_rx_packets += DIV_ROUND_UP(ddp_bytes,
mss);
- }
- if (!ddp_bytes) {
+ } else {
dev_kfree_skb_any(skb);
continue;
}
diff --git a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
index f9ddb61..197d02e 100644
--- a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
+++ b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
@@ -542,8 +542,11 @@ static void bnx2fc_recv_frame(struct sk_buff *skb)
return;
}
- if (skb_is_nonlinear(skb))
- skb_linearize(skb);
+ if (skb_linearize(skb)) {
+ kfree_skb(skb);
+ return;
+ }
+
mac = eth_hdr(skb)->h_source;
dest_mac = eth_hdr(skb)->h_dest;
diff --git a/drivers/scsi/fcoe/fcoe.c b/drivers/scsi/fcoe/fcoe.c
index 9bd41a3..f691b97 100644
--- a/drivers/scsi/fcoe/fcoe.c
+++ b/drivers/scsi/fcoe/fcoe.c
@@ -1685,7 +1685,10 @@ static void fcoe_recv_frame(struct sk_buff *skb)
skb->dev ? skb->dev->name : "<NULL>");
port = lport_priv(lport);
- skb_linearize(skb); /* check for skb_is_nonlinear is within skb_linearize */
+ if (skb_linearize(skb)) {
+ kfree_skb(skb);
+ return;
+ }
/*
* Frame length checks and setting up the header pointers
diff --git a/net/tipc/link.c b/net/tipc/link.c
index bda89bf..077c570 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -1446,7 +1446,8 @@ static int tipc_link_proto_rcv(struct tipc_link *l, struct sk_buff *skb,
if (tipc_own_addr(l->net) > msg_prevnode(hdr))
l->net_plane = msg_net_plane(hdr);
- skb_linearize(skb);
+ if (skb_linearize(skb))
+ goto exit;
hdr = buf_msg(skb);
data = msg_data(hdr);
diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
index c1cfd92..4e05d2a 100644
--- a/net/tipc/name_distr.c
+++ b/net/tipc/name_distr.c
@@ -356,7 +356,10 @@ void tipc_named_rcv(struct net *net, struct sk_buff_head *inputq)
spin_lock_bh(&tn->nametbl_lock);
for (skb = skb_dequeue(inputq); skb; skb = skb_dequeue(inputq)) {
- skb_linearize(skb);
+ if (skb_linearize(skb)) {
+ kfree_skb(skb);
+ continue;
+ }
msg = buf_msg(skb);
mtype = msg_type(msg);
item = (struct distr_item *)msg_data(msg);
--
1.9.1
^ permalink raw reply related
* Re: [PATCH net-next 1/1] driver: ipvlan: Free the port memory directly with kfree instead of kfree_rcu
From: Gao Feng @ 2016-12-06 7:00 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S. Miller, Mahesh Bandewar, Eric Dumazet,
Linux Kernel Network Developers
In-Reply-To: <1481007190.18162.570.camel@edumazet-glaptop3.roam.corp.google.com>
Hi Eric,
On Tue, Dec 6, 2016 at 2:53 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2016-12-06 at 14:31 +0800, Gao Feng wrote:
>
>> Because I don't fully hold the ipvlan codes now, I am afraid of that
>> there is someone which may get the port address when
>> ipvlan_port_destroy. So the original ipvlan_port_destroy uses the
>> kfree_rcu to avoid it.
>>
>> I am sure there is unnecessary to use kfree in ipvlan_port_create.
>
> And I am pretty sure it is unnecessary to use kfree_rcu() in
> ipvlan_port_destroy() as well.
>
> I highly suggest you spend time on learning why.
>
>
>
Thanks your suggestion.
I will send v2 patch after get the reason by myself.
Begards
Feng
^ permalink raw reply
* Re: [PATCH net] team: team_port_add should check link_up before enable port
From: Xin Long @ 2016-12-06 6:54 UTC (permalink / raw)
To: Marcelo Ricardo Leitner; +Cc: network dev, davem, Jiri Pirko
In-Reply-To: <20161203145745.GB13159@localhost.localdomain>
On Sat, Dec 3, 2016 at 10:57 PM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> On Sat, Dec 03, 2016 at 09:42:11PM +0800, Xin Long wrote:
>> Now when users add a nic to team dev, the option 'enable' of the port
>> is true by default, as team_port_enable enables it after dev_open in
>> team_port_add.
>>
>> But even if the port_dev has no carrier, like it's cable was unpluged,
>> the port is still enabled. It leads to that team dev couldn't work well
>> if this port was chosen to connect, and has no chance to change to use
>> other ports if link_watch is ethtool.
>>
>> This patch is to enable the port only when the port_dev has carrier in
>> team_port_add.
>>
>> Signed-off-by: Xin Long <lucien.xin@gmail.com>
>> ---
>> drivers/net/team/team.c | 8 ++++++--
>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>> index a380649..42004ac 100644
>> --- a/drivers/net/team/team.c
>> +++ b/drivers/net/team/team.c
>> @@ -1140,6 +1140,7 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
>> struct net_device *dev = team->dev;
>> struct team_port *port;
>> char *portname = port_dev->name;
>> + bool linkup;
>> int err;
>>
>> if (port_dev->flags & IFF_LOOPBACK) {
>> @@ -1249,9 +1250,12 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
>>
>> port->index = -1;
>> list_add_tail_rcu(&port->list, &team->port_list);
>> - team_port_enable(team, port);
>> + linkup = !!netif_carrier_ok(port_dev);
>
> The !! here is not needed anymore, netif_carrier_ok already returns a
> bool.
> static inline bool netif_carrier_ok(const struct net_device *dev)
will repost, thanks.
>
>
>> + if (linkup)
>> + team_port_enable(team, port);
>> +
>> __team_compute_features(team);
>> - __team_port_change_port_added(port, !!netif_carrier_ok(port_dev));
>> + __team_port_change_port_added(port, linkup);
>> __team_options_change_check(team);
>>
>> netdev_info(dev, "Port device %s added\n", portname);
>> --
>> 2.1.0
>>
^ permalink raw reply
* Re: [PATCH net-next 1/1] driver: ipvlan: Free the port memory directly with kfree instead of kfree_rcu
From: Eric Dumazet @ 2016-12-06 6:53 UTC (permalink / raw)
To: Gao Feng
Cc: David S. Miller, Mahesh Bandewar, Eric Dumazet,
Linux Kernel Network Developers
In-Reply-To: <CA+6hz4pGw-+HvKVDCiFmXRd6YwioUXdZfOghtx0aFb6EPxcj5A@mail.gmail.com>
On Tue, 2016-12-06 at 14:31 +0800, Gao Feng wrote:
> Because I don't fully hold the ipvlan codes now, I am afraid of that
> there is someone which may get the port address when
> ipvlan_port_destroy. So the original ipvlan_port_destroy uses the
> kfree_rcu to avoid it.
>
> I am sure there is unnecessary to use kfree in ipvlan_port_create.
And I am pretty sure it is unnecessary to use kfree_rcu() in
ipvlan_port_destroy() as well.
I highly suggest you spend time on learning why.
^ permalink raw reply
* Re: [PATCH net-next 1/1] driver: ipvlan: Free the port memory directly with kfree instead of kfree_rcu
From: Gao Feng @ 2016-12-06 6:31 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S. Miller, Mahesh Bandewar, Eric Dumazet,
Linux Kernel Network Developers
In-Reply-To: <1481005511.18162.564.camel@edumazet-glaptop3.roam.corp.google.com>
Hi Eric,
On Tue, Dec 6, 2016 at 2:25 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2016-12-06 at 12:29 +0800, fgao@ikuai8.com wrote:
>> From: Gao Feng <fgao@ikuai8.com>
>>
>> There is no one which may reference the "port" in ipvlan_port_create
>> when netdev_rx_handler_register failed. So it could free it directly
>> with kfree instead of kfree_rcu.
>>
>> Signed-off-by: Gao Feng <fgao@ikuai8.com>
>> ---
>> drivers/net/ipvlan/ipvlan_main.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
>> index c6aa667..1a601151 100644
>> --- a/drivers/net/ipvlan/ipvlan_main.c
>> +++ b/drivers/net/ipvlan/ipvlan_main.c
>> @@ -128,7 +128,7 @@ static int ipvlan_port_create(struct net_device *dev)
>> return 0;
>>
>> err:
>> - kfree_rcu(port, rcu);
>> + kfree(port);
>> return err;
>> }
>>
>
> This looks a partial patch.
>
> If you really care, why don't you also replace the kfree_rcu() in
> ipvlan_port_destroy() ?
Because I don't fully hold the ipvlan codes now, I am afraid of that
there is someone which may get the port address when
ipvlan_port_destroy. So the original ipvlan_port_destroy uses the
kfree_rcu to avoid it.
I am sure there is unnecessary to use kfree in ipvlan_port_create.
Regards
Feng
>
>
>
> diff --git a/drivers/net/ipvlan/ipvlan.h b/drivers/net/ipvlan/ipvlan.h
> index 05a62d2216c54651f6158c35d446d2e395b38dc3..031093e1c25f55244e6bdfde4ebeb65c0f2f10c1 100644
> --- a/drivers/net/ipvlan/ipvlan.h
> +++ b/drivers/net/ipvlan/ipvlan.h
> @@ -97,7 +97,6 @@ struct ipvl_port {
> struct work_struct wq;
> struct sk_buff_head backlog;
> int count;
> - struct rcu_head rcu;
> };
>
> static inline struct ipvl_port *ipvlan_port_get_rcu(const struct net_device *d)
> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
> index 5430460167b5e8945d29a3febdd324461bf5af5c..ffe8994e64fc1791ef07d80ad2340bc82d541bba 100644
> --- a/drivers/net/ipvlan/ipvlan_main.c
> +++ b/drivers/net/ipvlan/ipvlan_main.c
> @@ -128,7 +128,7 @@ static int ipvlan_port_create(struct net_device *dev)
> return 0;
>
> err:
> - kfree_rcu(port, rcu);
> + kfree(port);
> return err;
> }
>
> @@ -145,7 +145,7 @@ static void ipvlan_port_destroy(struct net_device *dev)
> netdev_rx_handler_unregister(dev);
> cancel_work_sync(&port->wq);
> __skb_queue_purge(&port->backlog);
> - kfree_rcu(port, rcu);
> + kfree(port);
> }
>
> #define IPVLAN_FEATURES \
>
>
>
^ permalink raw reply
* [patch net v4] net: fec: fix compile with CONFIG_M5272
From: Nikita Yushchenko @ 2016-12-06 6:26 UTC (permalink / raw)
To: David S. Miller, Fugang Duan, Troy Kisky, Andrew Lunn,
Eric Nelson, Philippe Reynes, Johannes Berg, netdev
Cc: Chris Healy, Fabio Estevam, linux-kernel, Nikita Yushchenko
Commit 80cca775cdc4 ("net: fec: cache statistics while device is down")
introduced unconditional statistics-related actions.
However, when driver is compiled with CONFIG_M5272, staticsics-related
definitions do not exist, which results into build errors.
Fix that by adding explicit handling of !defined(CONFIG_M5272) case.
Fixes: 80cca775cdc4 ("net: fec: cache statistics while device is down")
Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
---
Changes from v3:
- fix reference commit id to match upstream tree
drivers/net/ethernet/freescale/fec_main.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 5f77caa59534..12aef1b15356 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -2313,6 +2313,8 @@ static const struct fec_stat {
{ "IEEE_rx_octets_ok", IEEE_R_OCTETS_OK },
};
+#define FEC_STATS_SIZE (ARRAY_SIZE(fec_stats) * sizeof(u64))
+
static void fec_enet_update_ethtool_stats(struct net_device *dev)
{
struct fec_enet_private *fep = netdev_priv(dev);
@@ -2330,7 +2332,7 @@ static void fec_enet_get_ethtool_stats(struct net_device *dev,
if (netif_running(dev))
fec_enet_update_ethtool_stats(dev);
- memcpy(data, fep->ethtool_stats, ARRAY_SIZE(fec_stats) * sizeof(u64));
+ memcpy(data, fep->ethtool_stats, FEC_STATS_SIZE);
}
static void fec_enet_get_strings(struct net_device *netdev,
@@ -2355,6 +2357,12 @@ static int fec_enet_get_sset_count(struct net_device *dev, int sset)
return -EOPNOTSUPP;
}
}
+
+#else /* !defined(CONFIG_M5272) */
+#define FEC_STATS_SIZE 0
+static inline void fec_enet_update_ethtool_stats(struct net_device *dev)
+{
+}
#endif /* !defined(CONFIG_M5272) */
static int fec_enet_nway_reset(struct net_device *dev)
@@ -3293,8 +3301,7 @@ fec_probe(struct platform_device *pdev)
/* Init network device */
ndev = alloc_etherdev_mqs(sizeof(struct fec_enet_private) +
- ARRAY_SIZE(fec_stats) * sizeof(u64),
- num_tx_qs, num_rx_qs);
+ FEC_STATS_SIZE, num_tx_qs, num_rx_qs);
if (!ndev)
return -ENOMEM;
--
2.1.4
^ permalink raw reply related
* Re: [PATCH net-next 1/1] driver: ipvlan: Free the port memory directly with kfree instead of kfree_rcu
From: Eric Dumazet @ 2016-12-06 6:25 UTC (permalink / raw)
To: fgao; +Cc: davem, maheshb, edumazet, netdev, gfree.wind
In-Reply-To: <1480998547-9013-1-git-send-email-fgao@ikuai8.com>
On Tue, 2016-12-06 at 12:29 +0800, fgao@ikuai8.com wrote:
> From: Gao Feng <fgao@ikuai8.com>
>
> There is no one which may reference the "port" in ipvlan_port_create
> when netdev_rx_handler_register failed. So it could free it directly
> with kfree instead of kfree_rcu.
>
> Signed-off-by: Gao Feng <fgao@ikuai8.com>
> ---
> drivers/net/ipvlan/ipvlan_main.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
> index c6aa667..1a601151 100644
> --- a/drivers/net/ipvlan/ipvlan_main.c
> +++ b/drivers/net/ipvlan/ipvlan_main.c
> @@ -128,7 +128,7 @@ static int ipvlan_port_create(struct net_device *dev)
> return 0;
>
> err:
> - kfree_rcu(port, rcu);
> + kfree(port);
> return err;
> }
>
This looks a partial patch.
If you really care, why don't you also replace the kfree_rcu() in
ipvlan_port_destroy() ?
diff --git a/drivers/net/ipvlan/ipvlan.h b/drivers/net/ipvlan/ipvlan.h
index 05a62d2216c54651f6158c35d446d2e395b38dc3..031093e1c25f55244e6bdfde4ebeb65c0f2f10c1 100644
--- a/drivers/net/ipvlan/ipvlan.h
+++ b/drivers/net/ipvlan/ipvlan.h
@@ -97,7 +97,6 @@ struct ipvl_port {
struct work_struct wq;
struct sk_buff_head backlog;
int count;
- struct rcu_head rcu;
};
static inline struct ipvl_port *ipvlan_port_get_rcu(const struct net_device *d)
diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 5430460167b5e8945d29a3febdd324461bf5af5c..ffe8994e64fc1791ef07d80ad2340bc82d541bba 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -128,7 +128,7 @@ static int ipvlan_port_create(struct net_device *dev)
return 0;
err:
- kfree_rcu(port, rcu);
+ kfree(port);
return err;
}
@@ -145,7 +145,7 @@ static void ipvlan_port_destroy(struct net_device *dev)
netdev_rx_handler_unregister(dev);
cancel_work_sync(&port->wq);
__skb_queue_purge(&port->backlog);
- kfree_rcu(port, rcu);
+ kfree(port);
}
#define IPVLAN_FEATURES \
^ permalink raw reply related
* Re: [PATCH net v2] ipv6: Allow IPv4-mapped address as next-hop
From: Erik Nordmark @ 2016-12-06 6:08 UTC (permalink / raw)
To: David Miller, nordmark; +Cc: netdev, gilligan
In-Reply-To: <20161205.145223.1574324293082144579.davem@davemloft.net>
On 12/5/16 11:52 AM, David Miller wrote:
> From: Erik Nordmark <nordmark@arista.com>
> Date: Sat, 3 Dec 2016 20:57:09 -0800
>
>> Made kernel accept IPv6 routes with IPv4-mapped address as next-hop.
>>
>> It is possible to configure IP interfaces with IPv4-mapped addresses, and
>> one can add IPv6 routes for IPv4-mapped destinations/prefixes, yet prior
>> to this fix the kernel returned an EINVAL when attempting to add an IPv6
>> route with an IPv4-mapped address as a nexthop/gateway.
>>
>> RFC 4798 (a proposed standard RFC) uses IPv4-mapped addresses as nexthops,
>> thus in order to support that type of address configuration the kernel
>> needs to allow IPv4-mapped addresses as nexthops.
>>
>> Signed-off-by: Erik Nordmark <nordmark@arista.com>
>> Signed-off-by: Bob Gilligan <gilligan@arista.com>
> Applied to net-next, thanks.
>
Thanks, especially for moving it from net to net-next.
I guess I don't fully understand what is considered a bug fix for net as
opposed to new stuff for net-next. Is the former mostly for regressions
and serious bugs? This was a fix for a bug that's been there since the
beginning of IPv6 time AFAICT.
Erik
^ permalink raw reply
* Re: [flamebait] xdp Was: Re: bpf bounded loops. Was: [flamebait] xdp
From: Alexei Starovoitov @ 2016-12-06 6:04 UTC (permalink / raw)
To: Tom Herbert
Cc: Hannes Frederic Sowa, Thomas Graf,
Linux Kernel Network Developers, Daniel Borkmann, David S. Miller
In-Reply-To: <CALx6S34zRN0MJOLR=xA9V936QtJ5FHzUHvQT9QK_383Eozu-CQ@mail.gmail.com>
On Mon, Dec 05, 2016 at 09:08:36PM -0800, Tom Herbert wrote:
> On Mon, Dec 5, 2016 at 7:05 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Sun, Dec 04, 2016 at 05:05:28PM +0100, Hannes Frederic Sowa wrote:
> >>
> >> If one of those eBPF verifiers only accepts a certain number of INSN, as
> >> fundamental as backwards jumps, we might end up with two compiler?
> >
> > two compilers? We already have five. There is gcc bpf backend (unmaintained)
> > and now lua, python and ply project can generate bpf code without llvm.
> > The kernel verifier has to become smarter. Right now it understands
> > only certain instruction patterns which caused all five bpf generators to
> > do extra work to satisfy the verifier. The solution is to do
> > data flow analysis using proper compiler techniques.
> >
> >> program thinks). Ergo, more complexity. What do you do when one of those
> >> two systems fail? What is the reference data? What do you do if on a
> >> highly busy box during DoS constant reloading of your vmalloc happens (I
> >> don't know if it is a problem under DoS)?
> >
> > ddos is one of the key use cases for xdp. If the system is about to oom
> > during ddos, it has to be fixed. The faster we move with xdp development
> > the sooner we will find and fix those issues.
> > And xdp being a core component of the linux kernel we will fix ddos
> > for the whole internet. Anyone going dpdk route are simply in
> > business of selling ddos protection with proprietary solutions.
> >
> Hi Alexei,
>
> I am wondering exactly how XDP fixes DDOS in a non-proprietary
> fashion. While the XDP infrastructure is part of the core kernel, the
> programs are not part of the kernel as you mention below. So what will
> a DDOS solution based on XDP for the whole Internet look like? Do you
> envision a set of "blessed" DDOS programs that various sites can use
> and configure (maybe some maintained open source repository), or will
> each site need to come up with their own XDP programs for DDOS?
At some point we would need a repository of these 'blessed' programs.
Some of them will not be programs, but program generators
similar to existing Cloudflare bpf setup:
https://github.com/cloudflare/bpftools
and instead of doing things like:
https://github.com/cloudflare/lua-aho-corasick
and reimplementing them in proprietary c++,
the dfa/aho-corasick will be implemented as a kernel helper.
That's what I was alluding to in
https://github.com/iovisor/bcc/issues/471
Then all of the research in that area like:
https://ir.nctu.edu.tw/bitstream/11536/26033/1/000288319400006.pdf
will be applicable and researchers will be sharing
these detector programs.
Of course, not everyone will open up their secret sauce,
but a lot of folks will do and it will drive the innovation.
^ permalink raw reply
* [PATCH net] be2net: Add DEVSEC privilege to SET_HSW_CONFIG command.
From: Suresh Reddy @ 2016-12-06 5:33 UTC (permalink / raw)
To: netdev, venkatkumar.duvvuru
From: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
OPCODE_COMMON_GET_FN_PRIVILEGES is returning only DEVSEC
privilege (Unrestricted Administrative Privilege) for Lancer NIC functions.
So, driver is failing SET_HSW_CONFIG command, as DEVSEC privilege was not
set in the privilege bitmap. This patch fixes the problem by setting DEVSEC
privilege in SET_HSW_CONFIG’s privilege bitmap.
Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com>
---
drivers/net/ethernet/emulex/benet/be_cmds.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/emulex/benet/be_cmds.c b/drivers/net/ethernet/emulex/benet/be_cmds.c
index 1fb5d72..0e74529 100644
--- a/drivers/net/ethernet/emulex/benet/be_cmds.c
+++ b/drivers/net/ethernet/emulex/benet/be_cmds.c
@@ -90,7 +90,8 @@ static struct be_cmd_priv_map cmd_priv_map[] = {
{
OPCODE_COMMON_SET_HSW_CONFIG,
CMD_SUBSYSTEM_COMMON,
- BE_PRIV_DEVCFG | BE_PRIV_VHADM
+ BE_PRIV_DEVCFG | BE_PRIV_VHADM |
+ BE_PRIV_DEVSEC
},
{
OPCODE_COMMON_GET_EXT_FAT_CAPABILITIES,
--
2.10.1
^ permalink raw reply related
* Re: [flamebait] xdp Was: Re: bpf bounded loops. Was: [flamebait] xdp
From: Tom Herbert @ 2016-12-06 5:08 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Hannes Frederic Sowa, Thomas Graf,
Linux Kernel Network Developers, Daniel Borkmann, David S. Miller
In-Reply-To: <20161206030525.GA89307@ast-mbp.thefacebook.com>
On Mon, Dec 5, 2016 at 7:05 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Sun, Dec 04, 2016 at 05:05:28PM +0100, Hannes Frederic Sowa wrote:
>>
>> If one of those eBPF verifiers only accepts a certain number of INSN, as
>> fundamental as backwards jumps, we might end up with two compiler?
>
> two compilers? We already have five. There is gcc bpf backend (unmaintained)
> and now lua, python and ply project can generate bpf code without llvm.
> The kernel verifier has to become smarter. Right now it understands
> only certain instruction patterns which caused all five bpf generators to
> do extra work to satisfy the verifier. The solution is to do
> data flow analysis using proper compiler techniques.
>
>> program thinks). Ergo, more complexity. What do you do when one of those
>> two systems fail? What is the reference data? What do you do if on a
>> highly busy box during DoS constant reloading of your vmalloc happens (I
>> don't know if it is a problem under DoS)?
>
> ddos is one of the key use cases for xdp. If the system is about to oom
> during ddos, it has to be fixed. The faster we move with xdp development
> the sooner we will find and fix those issues.
> And xdp being a core component of the linux kernel we will fix ddos
> for the whole internet. Anyone going dpdk route are simply in
> business of selling ddos protection with proprietary solutions.
>
Hi Alexei,
I am wondering exactly how XDP fixes DDOS in a non-proprietary
fashion. While the XDP infrastructure is part of the core kernel, the
programs are not part of the kernel as you mention below. So what will
a DDOS solution based on XDP for the whole Internet look like? Do you
envision a set of "blessed" DDOS programs that various sites can use
and configure (maybe some maintained open source repository), or will
each site need to come up with their own XDP programs for DDOS?
Thanks,
Tom
>> I tried to argue that someone wanting to build netmap/DPDK-alike things
>> in XDP, one faces the problem of synchronized IPC. Hashmaps solve this
>> to some degree but cannot be synchronized.
>
> I don't see ipc as a problem and, yes, xdp is the best platform so far
> to deliver packets to user space. I think that the dataplane-in-the-driver
> is going to be faster than the fastest streaming to user space approach,
> but we cannot rule one way or the other without trying multiple
> approaches first and benchmarking them against each other.
> So I very much in favor of Jesper's effort to deliver packets to user space.
>
>> DPDK even can configure various hw offloads already before the kernel
>> can do so.
>
> that's a harsh lesson that the kernel needs to learn. Since people went
> to dpdk to do hw offload it means it's our fault that we were not
> accommodative and flexible enough to provide such frameworks within
> the kernel. imo John's flow/match api should have been accepted
> and it would have been solid building block towards such offloads.
>
>> If users want to use those, they switch to DPDK also, as I
>> have seen the industry always wanting the best performance. DPDK can use
>> SIMD instructions, all AVX, SSE and MMX stuff, and they do it.
>
> agree as well. The kernel needs to find a way to use all of these
> fancy instructions where performance matters.
> People who say "kernel cannot do simd" just didn't try hard enough.
>
>> Debugging is harder but currently worked on. But will probably always be
>> harder than simply using a debugger.
>
> That's actually the important value proposition of xdp+bpf, since
> non-working bpf program is not a concern for the kernel support team.
> Unlike kernel modules that the kernel team needs to bless and support
> in production, bpf programs are outside of that scope. They are part
> of user space apps and part of user space responsibility.
>
>> This all leads to gigantic user space control planes like neutron and
>> others that just make everyone's life much harder. The model requires
>> this. And that is what I fear.
>
> the neutron is complex and fragile, since it's using bridges on
> top of bridges with ebtables and ovs in the mix. Trying to manage
> many different kernel technologies and a mix of smaller control planes
> by this mega control plane is not an easy task.
>
>> I am not at all that negative against a hook before allocating the
>> packet, but making everyone using it and marketing as an alternative to
>> DPDK doesn't seem to fit for me.
>
> I don't see developers that are forced to use xdp. I see developers
> that are eager to use xdp as soon as support for it is available
> in their nics. Those like maglev who developed their own bypass
> are not going to use dpdk and people who already using dpdk are
> not going to switch to xdp, but there are lots of others who
> welcome xdp with open arms.
>
> Thanks
>
^ permalink raw reply
* [PATCH net-next 1/1] driver: ipvlan: Free the port memory directly with kfree instead of kfree_rcu
From: fgao @ 2016-12-06 4:29 UTC (permalink / raw)
To: davem, maheshb, edumazet, netdev, gfree.wind; +Cc: Gao Feng
From: Gao Feng <fgao@ikuai8.com>
There is no one which may reference the "port" in ipvlan_port_create
when netdev_rx_handler_register failed. So it could free it directly
with kfree instead of kfree_rcu.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
---
drivers/net/ipvlan/ipvlan_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index c6aa667..1a601151 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -128,7 +128,7 @@ static int ipvlan_port_create(struct net_device *dev)
return 0;
err:
- kfree_rcu(port, rcu);
+ kfree(port);
return err;
}
--
1.9.1
^ permalink raw reply related
* Re: "af_unix: conditionally use freezable blocking calls in read" is wrong
From: Cong Wang @ 2016-12-06 4:24 UTC (permalink / raw)
To: Al Viro; +Cc: David Miller, Linux Kernel Network Developers, Colin Cross
In-Reply-To: <20161205035241.GJ1555@ZenIV.linux.org.uk>
On Sun, Dec 4, 2016 at 7:52 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sun, Dec 04, 2016 at 09:42:14PM -0500, David Miller wrote:
>> > I've run into that converting AF_UNIX to generic_file_splice_read();
>> > I can kludge around that ("freezable unless ->msg_iter is ITER_PIPE"), but
>> > that only delays trouble.
>> >
>> > Note that the only other user of freezable_schedule_timeout() is
>> > a very different story - it's a kernel thread, which *does* have a guaranteed
>> > locking environment. Making such assumptions in unix_stream_recvmsg(),
>> > OTOH, is insane...
>>
>> We have to otherwise Android phones drain their batteries in 10
>> minutes.
>>
>> I'm not going to revert this and be responsible for that.
>>
>> So you have to find a way to make the freezable calls legitimate.
>
> Oh, well... As I said, I can kludge around that - call from
> generic_file_splice_read() can be distinguished by looking at the
> ->msg_iter->type; it still means unpleasantness for kernel_recvmsg()
> users - in effect, it can only be called with locks held if you know that
> the socket is not an AF_UNIX one.
>
> BTW, how do they deal with plain pipes?
I suppose this question is for Colin. ;)
^ permalink raw reply
* Re: [flamebait] xdp Was: Re: bpf bounded loops. Was: [flamebait] xdp
From: Alexei Starovoitov @ 2016-12-06 3:05 UTC (permalink / raw)
To: Hannes Frederic Sowa
Cc: Tom Herbert, Thomas Graf, Linux Kernel Network Developers,
Daniel Borkmann, David S. Miller
In-Reply-To: <eee898ad-9a77-2987-c38d-984231139475@stressinduktion.org>
On Sun, Dec 04, 2016 at 05:05:28PM +0100, Hannes Frederic Sowa wrote:
>
> If one of those eBPF verifiers only accepts a certain number of INSN, as
> fundamental as backwards jumps, we might end up with two compiler?
two compilers? We already have five. There is gcc bpf backend (unmaintained)
and now lua, python and ply project can generate bpf code without llvm.
The kernel verifier has to become smarter. Right now it understands
only certain instruction patterns which caused all five bpf generators to
do extra work to satisfy the verifier. The solution is to do
data flow analysis using proper compiler techniques.
> program thinks). Ergo, more complexity. What do you do when one of those
> two systems fail? What is the reference data? What do you do if on a
> highly busy box during DoS constant reloading of your vmalloc happens (I
> don't know if it is a problem under DoS)?
ddos is one of the key use cases for xdp. If the system is about to oom
during ddos, it has to be fixed. The faster we move with xdp development
the sooner we will find and fix those issues.
And xdp being a core component of the linux kernel we will fix ddos
for the whole internet. Anyone going dpdk route are simply in
business of selling ddos protection with proprietary solutions.
> I tried to argue that someone wanting to build netmap/DPDK-alike things
> in XDP, one faces the problem of synchronized IPC. Hashmaps solve this
> to some degree but cannot be synchronized.
I don't see ipc as a problem and, yes, xdp is the best platform so far
to deliver packets to user space. I think that the dataplane-in-the-driver
is going to be faster than the fastest streaming to user space approach,
but we cannot rule one way or the other without trying multiple
approaches first and benchmarking them against each other.
So I very much in favor of Jesper's effort to deliver packets to user space.
> DPDK even can configure various hw offloads already before the kernel
> can do so.
that's a harsh lesson that the kernel needs to learn. Since people went
to dpdk to do hw offload it means it's our fault that we were not
accommodative and flexible enough to provide such frameworks within
the kernel. imo John's flow/match api should have been accepted
and it would have been solid building block towards such offloads.
> If users want to use those, they switch to DPDK also, as I
> have seen the industry always wanting the best performance. DPDK can use
> SIMD instructions, all AVX, SSE and MMX stuff, and they do it.
agree as well. The kernel needs to find a way to use all of these
fancy instructions where performance matters.
People who say "kernel cannot do simd" just didn't try hard enough.
> Debugging is harder but currently worked on. But will probably always be
> harder than simply using a debugger.
That's actually the important value proposition of xdp+bpf, since
non-working bpf program is not a concern for the kernel support team.
Unlike kernel modules that the kernel team needs to bless and support
in production, bpf programs are outside of that scope. They are part
of user space apps and part of user space responsibility.
> This all leads to gigantic user space control planes like neutron and
> others that just make everyone's life much harder. The model requires
> this. And that is what I fear.
the neutron is complex and fragile, since it's using bridges on
top of bridges with ebtables and ovs in the mix. Trying to manage
many different kernel technologies and a mix of smaller control planes
by this mega control plane is not an easy task.
> I am not at all that negative against a hook before allocating the
> packet, but making everyone using it and marketing as an alternative to
> DPDK doesn't seem to fit for me.
I don't see developers that are forced to use xdp. I see developers
that are eager to use xdp as soon as support for it is available
in their nics. Those like maglev who developed their own bypass
are not going to use dpdk and people who already using dpdk are
not going to switch to xdp, but there are lots of others who
welcome xdp with open arms.
Thanks
^ permalink raw reply
* Re: [PATCH] virtio-net: Fix DMA-from-the-stack in virtnet_set_mac_address()
From: Michael S. Tsirkin @ 2016-12-06 3:00 UTC (permalink / raw)
To: Andy Lutomirski
Cc: netdev, linux-kernel, virtualization, Jason Wang, Laura Abbott
In-Reply-To: <fe889e578d5dffa9ae0834b449a35fcfd1e10694.1480990173.git.luto@kernel.org>
On Mon, Dec 05, 2016 at 06:10:58PM -0800, Andy Lutomirski wrote:
> With CONFIG_VMAP_STACK=y, virtnet_set_mac_address() can be passed a
> pointer to the stack and it will OOPS. Copy the address to the heap
> to prevent the crash.
>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Laura Abbott <labbott@redhat.com>
> Reported-by: zbyszek@in.waw.pl
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>
> Very lightly tested.
>
> drivers/net/virtio_net.c | 19 ++++++++++++++-----
> 1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 7276d5a95bd0..cbf1c613c67a 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -969,12 +969,17 @@ static int virtnet_set_mac_address(struct net_device *dev, void *p)
> struct virtnet_info *vi = netdev_priv(dev);
> struct virtio_device *vdev = vi->vdev;
> int ret;
> - struct sockaddr *addr = p;
> + struct sockaddr *addr;
> struct scatterlist sg;
>
> - ret = eth_prepare_mac_addr_change(dev, p);
> + addr = kmalloc(sizeof(*addr), GFP_KERNEL);
> + if (!addr)
> + return -ENOMEM;
> + memcpy(addr, p, sizeof(*addr));
> +
> + ret = eth_prepare_mac_addr_change(dev, addr);
> if (ret)
> - return ret;
> + goto out;
>
> if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_MAC_ADDR)) {
> sg_init_one(&sg, addr->sa_data, dev->addr_len);
> @@ -982,7 +987,8 @@ static int virtnet_set_mac_address(struct net_device *dev, void *p)
> VIRTIO_NET_CTRL_MAC_ADDR_SET, &sg)) {
> dev_warn(&vdev->dev,
> "Failed to set mac address by vq command.\n");
> - return -EINVAL;
> + ret = -EINVAL;
> + goto out;
> }
> } else if (virtio_has_feature(vdev, VIRTIO_NET_F_MAC) &&
> !virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
> @@ -996,8 +1002,11 @@ static int virtnet_set_mac_address(struct net_device *dev, void *p)
> }
>
> eth_commit_mac_addr_change(dev, p);
> + ret = 0;
>
> - return 0;
> +out:
> + kfree(addr);
> + return ret;
> }
>
> static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
> --
> 2.9.3
^ permalink raw reply
* Re: [PATCH] virtio-net: Fix DMA-from-the-stack in virtnet_set_mac_address()
From: Jason Wang @ 2016-12-06 2:30 UTC (permalink / raw)
To: Andy Lutomirski, netdev
Cc: Laura Abbott, Michael S . Tsirkin, linux-kernel, virtualization
In-Reply-To: <fe889e578d5dffa9ae0834b449a35fcfd1e10694.1480990173.git.luto@kernel.org>
On 2016年12月06日 10:10, Andy Lutomirski wrote:
> With CONFIG_VMAP_STACK=y, virtnet_set_mac_address() can be passed a
> pointer to the stack and it will OOPS. Copy the address to the heap
> to prevent the crash.
>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Laura Abbott <labbott@redhat.com>
> Reported-by: zbyszek@in.waw.pl
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>
> Very lightly tested.
>
> drivers/net/virtio_net.c | 19 ++++++++++++++-----
> 1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 7276d5a95bd0..cbf1c613c67a 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -969,12 +969,17 @@ static int virtnet_set_mac_address(struct net_device *dev, void *p)
> struct virtnet_info *vi = netdev_priv(dev);
> struct virtio_device *vdev = vi->vdev;
> int ret;
> - struct sockaddr *addr = p;
> + struct sockaddr *addr;
> struct scatterlist sg;
>
> - ret = eth_prepare_mac_addr_change(dev, p);
> + addr = kmalloc(sizeof(*addr), GFP_KERNEL);
> + if (!addr)
> + return -ENOMEM;
> + memcpy(addr, p, sizeof(*addr));
> +
> + ret = eth_prepare_mac_addr_change(dev, addr);
> if (ret)
> - return ret;
> + goto out;
>
> if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_MAC_ADDR)) {
> sg_init_one(&sg, addr->sa_data, dev->addr_len);
> @@ -982,7 +987,8 @@ static int virtnet_set_mac_address(struct net_device *dev, void *p)
> VIRTIO_NET_CTRL_MAC_ADDR_SET, &sg)) {
> dev_warn(&vdev->dev,
> "Failed to set mac address by vq command.\n");
> - return -EINVAL;
> + ret = -EINVAL;
> + goto out;
> }
> } else if (virtio_has_feature(vdev, VIRTIO_NET_F_MAC) &&
> !virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
> @@ -996,8 +1002,11 @@ static int virtnet_set_mac_address(struct net_device *dev, void *p)
> }
>
> eth_commit_mac_addr_change(dev, p);
> + ret = 0;
>
> - return 0;
> +out:
> + kfree(addr);
> + return ret;
> }
>
> static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
Acked-by: Jason Wang <jasowang@redhat.com>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH next] Revert "dctcp: update cwnd on congestion event"
From: Neal Cardwell @ 2016-12-06 2:11 UTC (permalink / raw)
To: Florian Westphal; +Cc: Netdev
In-Reply-To: <1480980180-23349-1-git-send-email-fw@strlen.de>
On Mon, Dec 5, 2016 at 6:23 PM, Florian Westphal <fw@strlen.de> wrote:
> Neal Cardwell says:
> If I am reading the code correctly, then I would have two concerns:
> 1) Has that been tested? That seems like an extremely dramatic
> decrease in cwnd. For example, if the cwnd is 80, and there are 40
> ACKs, and half the ACKs are ECE marked, then my back-of-the-envelope
> calculations seem to suggest that after just 11 ACKs the cwnd would be
> down to a minimal value of 2 [..]
> 2) That seems to contradict another passage in the draft [..] where it
> sazs:
> Just as specified in [RFC3168], DCTCP does not react to congestion
> indications more than once for every window of data.
>
> Neal is right. Fortunately we don't have to complicate this by testing
> vs. current rtt estimate, we can just revert the patch.
>
> Normal stack already handles this for us: receiving ACKs with ECE
> set causes a call to tcp_enter_cwr(), from there on the ssthresh gets
> adjusted and prr will take care of cwnd adjustment.
>
> Fixes: 4780566784b396 ("dctcp: update cwnd on congestion event")
> Cc: Neal Cardwell <ncardwell@google.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
Acked-by: Neal Cardwell <ncardwell@google.com>
Looks good to me. :-)
Thanks,
neal
^ permalink raw reply
* [PATCH] virtio-net: Fix DMA-from-the-stack in virtnet_set_mac_address()
From: Andy Lutomirski @ 2016-12-06 2:10 UTC (permalink / raw)
To: netdev
Cc: Michael S . Tsirkin, linux-kernel, virtualization,
Andy Lutomirski, Laura Abbott
With CONFIG_VMAP_STACK=y, virtnet_set_mac_address() can be passed a
pointer to the stack and it will OOPS. Copy the address to the heap
to prevent the crash.
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Laura Abbott <labbott@redhat.com>
Reported-by: zbyszek@in.waw.pl
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
Very lightly tested.
drivers/net/virtio_net.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7276d5a95bd0..cbf1c613c67a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -969,12 +969,17 @@ static int virtnet_set_mac_address(struct net_device *dev, void *p)
struct virtnet_info *vi = netdev_priv(dev);
struct virtio_device *vdev = vi->vdev;
int ret;
- struct sockaddr *addr = p;
+ struct sockaddr *addr;
struct scatterlist sg;
- ret = eth_prepare_mac_addr_change(dev, p);
+ addr = kmalloc(sizeof(*addr), GFP_KERNEL);
+ if (!addr)
+ return -ENOMEM;
+ memcpy(addr, p, sizeof(*addr));
+
+ ret = eth_prepare_mac_addr_change(dev, addr);
if (ret)
- return ret;
+ goto out;
if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_MAC_ADDR)) {
sg_init_one(&sg, addr->sa_data, dev->addr_len);
@@ -982,7 +987,8 @@ static int virtnet_set_mac_address(struct net_device *dev, void *p)
VIRTIO_NET_CTRL_MAC_ADDR_SET, &sg)) {
dev_warn(&vdev->dev,
"Failed to set mac address by vq command.\n");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out;
}
} else if (virtio_has_feature(vdev, VIRTIO_NET_F_MAC) &&
!virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
@@ -996,8 +1002,11 @@ static int virtnet_set_mac_address(struct net_device *dev, void *p)
}
eth_commit_mac_addr_change(dev, p);
+ ret = 0;
- return 0;
+out:
+ kfree(addr);
+ return ret;
}
static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
--
2.9.3
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox