Netdev List

Netdev List
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] net: Alloc NAPI page frags from their own pool
From: Alexander Duyck @ 2014-11-27  0:05 UTC (permalink / raw)
  To: netdev; +Cc: davem, brouer, jeffrey.t.kirsher, eric.dumazet, ast

This patch series implements a means of allocating page fragments without
the need for the local_irq_save/restore in __netdev_alloc_frag.  By doing
this I am able to decrease packet processing time by 11ns per packet in my
test environment.

---

Alexander Duyck (3):
      net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag
      net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb
      fm10k/igb/ixgbe: Use napi_alloc_skb


 drivers/net/ethernet/intel/fm10k/fm10k_main.c |    4 -
 drivers/net/ethernet/intel/igb/igb_main.c     |    3 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    4 -
 include/linux/skbuff.h                        |   11 ++
 net/core/dev.c                                |    2 
 net/core/skbuff.c                             |  160 ++++++++++++++++++-------
 6 files changed, 133 insertions(+), 51 deletions(-)

--

^ permalink raw reply

* Re: [patch net-next v3 04/17] net: introduce generic switch devices support
From: Jamal Hadi Salim @ 2014-11-26 23:32 UTC (permalink / raw)
  To: Thomas Graf, Jiri Pirko
  Cc: Scott Feldman, Netdev, David S. Miller, nhorman@tuxdriver.com,
	Andy Gospodarek, dborkman@redhat.com, ogerlitz@mellanox.com,
	jesse@nicira.com, pshelar@nicira.com, azhou@nicira.com,
	ben@decadent.org.uk, stephen@networkplumber.org,
	Kirsher, Jeffrey T, vyasevic@redhat.com, Cong Wang,
	Fastabend, John R, Eric Dumazet, Florian Fainelli, Roopa Prabhu,
	John Linville, "jasowang@redhat.c
In-Reply-To: <20141126215034.GA32174@casper.infradead.org>

On 11/26/14 16:50, Thomas Graf wrote:

> You are requesting a name change for a proprietary driver after
> confirming that you can't publish the code. We don't even know what
> the piece of hardware you refer to is capable of.
>

I am not sure why there is such a misunderstanding. Here's the
sequence of events.

Jiri/Scott: We'll call this offload thing hanging off a port_ops
a "switch". It does one or more of L2, L3 and flows.
Jamal: I am not fond of that name because not everything that offloads
off a port is a switch (some mention of fitting even with dpdk)
Jiri: What do you have - an L3 "switch"?
Jamal: No, it is something that does offloading of packet processing off
a port with flows and action. Example a netronome would be a good fit 
(if you are to ignore Simon going for OVS).

And then things get out of control. This has *nothing* to do with any
driver or any code or anything speacilized.
Not every packet processing offload hanging off ports is a switch (I
dont think even the patch was claiming that although by now ive lost
track of where it started).

Yes, i cannot publish this code. You know that; Scott knows that and
Jiri knows. (and thats why i thought it passive aggressive when Scott
asked about the code when we are discussing a name change).
The reason i am even involved in all this is so we can actually
publish code and i can stop using proprietary SDK stuff.
While i cant release the current code I want to share my experiences
in trying to help make that API sane. Because i want to use it.
I have been doing this offload shit for at least 15 years on Linux.
I have something to say about it. Just throwing in some gauntlet
when it serves some convinience and treating me like some guy who
showed off the street making claims is bordering on the ridiculuos.

cheers,
jamal

^ permalink raw reply

* [PATCH net] bond: Check length of IFLA_BOND_ARP_IP_TARGET attributes
From: Thomas Graf @ 2014-11-26 23:22 UTC (permalink / raw)
  To: davem; +Cc: netdev, Scott Feldman

Fixes: 7f28fa10 ("bonding: add arp_ip_target netlink support")
Reported-by: John Fastabend <john.fastabend@gmail.com>
Cc: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 drivers/net/bonding/bond_netlink.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/bonding/bond_netlink.c b/drivers/net/bonding/bond_netlink.c
index c13d83e..45f09a6 100644
--- a/drivers/net/bonding/bond_netlink.c
+++ b/drivers/net/bonding/bond_netlink.c
@@ -225,7 +225,12 @@ static int bond_changelink(struct net_device *bond_dev,
 
 		bond_option_arp_ip_targets_clear(bond);
 		nla_for_each_nested(attr, data[IFLA_BOND_ARP_IP_TARGET], rem) {
-			__be32 target = nla_get_be32(attr);
+			__be32 target;
+
+			if (nla_len(attr) < sizeof(target))
+				return -EINVAL;
+
+			target = nla_get_be32(attr);
 
 			bond_opt_initval(&newval, (__force u64)target);
 			err = __bond_opt_set(bond, BOND_OPT_ARP_TARGETS,
-- 
1.9.3

^ permalink raw reply related

* Re: [PATCH 0/5 net] bridge: Fix missing Netlink message validations
From: Thomas Graf @ 2014-11-26 23:14 UTC (permalink / raw)
  To: John Fastabend; +Cc: Jiri Pirko, davem, stephen, netdev
In-Reply-To: <54760D1D.3070201@gmail.com>

On 11/26/14 at 09:25am, John Fastabend wrote:
> >--- a/net/ipv4/devinet.c
> >+++ b/net/ipv4/devinet.c
> >@@ -1687,8 +1687,11 @@ static int inet_set_link_af(struct net_device *dev, const struct nlattr *nla)
> >                BUG();
> >
> >        if (tb[IFLA_INET_CONF]) {
> >-               nla_for_each_nested(a, tb[IFLA_INET_CONF], rem)
> >+               nla_for_each_nested(a, tb[IFLA_INET_CONF], rem) {
> >+                       if (nla_len(a) < sizeof(u32))
> >+                               return -EINVAL;
> >                        ipv4_devconf_set(in_dev, nla_type(a), nla_get_u32(a));
> >+               }

Looked into this and found a validation function
inet_validate_link_af(). It's split to keep the updates atomic.

^ permalink raw reply

* Re: [PATCH rfc] packet: zerocopy packet_snd
From: Willem de Bruijn @ 2014-11-26 23:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Network Development, David Miller, Eric Dumazet, Daniel Borkmann
In-Reply-To: <20141126212038.GA12118@redhat.com>

On Wed, Nov 26, 2014 at 4:20 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Nov 26, 2014 at 02:59:34PM -0500, Willem de Bruijn wrote:
>> > The main problem with zero copy ATM is with queueing disciplines
>> > which might keep the socket around essentially forever.
>> > The case was described here:
>> > https://lkml.org/lkml/2014/1/17/105
>> > and of course this will make it more serious now that
>> > more applications will be able to do this, so
>> > chances that an administrator enables this
>> > are higher.
>>
>> The denial of service issue raised there, that a single queue can
>> block an entire virtio-net device, is less problematic in the case of
>> packet sockets. A socket can run out of sk_wmem_alloc, but a prudent
>> application can increase the limit or use separate sockets for
>> separate flows.
>
> Sounds like this interface is very hard to use correctly.

Actually, this socket alloc issue is the same for zerocopy and
non-zerocopy. Packets can be held in deep queues at which point
the packet socket is blocked. This is accepted behavior.

>From the above thread:

"It's ok for non-zerocopy packet to be blocked since VM1 thought the
packets has been sent instead of pending in the virtqueue. So VM1 can
still send packet to other destination."

This is very specific to virtio and vhost-net. I don't think that that
concern applies to a packet interface.

Another issue, though, is that the method currently really only helps
TSO because ll other paths cause a deep copy. There are more use
cases once it can send up to 64KB MTU over loopback or send out
GSO datagrams without triggering skb_copy_ubufs. I have not looked
into how (or if) that can be achieved yet.

>
>> > One possible solution is some kind of timer orphaning frags
>> > for skbs that have been around for too long.
>>
>> Perhaps this can be approximated without an explicit timer by calling
>> skb_copy_ubufs on enqueue whenever qlen exceeds a threshold value?
>
> Not sure.  I'll have to see that patch to judge.
>
>> >
>> >> ---
>> >>  include/linux/skbuff.h        |   1 +
>> >>  include/linux/socket.h        |   1 +
>> >>  include/uapi/linux/errqueue.h |   1 +
>> >>  net/packet/af_packet.c        | 110 ++++++++++++++++++++++++++++++++++++++----
>> >>  4 files changed, 103 insertions(+), 10 deletions(-)
>> >>
>> >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> >> index 78c299f..8e661d2 100644
>> >> --- a/include/linux/skbuff.h
>> >> +++ b/include/linux/skbuff.h
>> >> @@ -295,6 +295,7 @@ struct ubuf_info {
>> >>       void (*callback)(struct ubuf_info *, bool zerocopy_success);
>> >>       void *ctx;
>> >>       unsigned long desc;
>> >> +     void *callback_priv;
>> >>  };
>> >>
>> >>  /* This data is invariant across clones and lives at
>> >> diff --git a/include/linux/socket.h b/include/linux/socket.h
>> >> index de52228..0a2e0ea 100644
>> >> --- a/include/linux/socket.h
>> >> +++ b/include/linux/socket.h
>> >> @@ -265,6 +265,7 @@ struct ucred {
>> >>  #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
>> >>  #define MSG_EOF         MSG_FIN
>> >>
>> >> +#define MSG_ZEROCOPY 0x4000000       /* Pin user pages */
>> >>  #define MSG_FASTOPEN 0x20000000      /* Send data in TCP SYN */
>> >>  #define MSG_CMSG_CLOEXEC 0x40000000  /* Set close_on_exec for file
>> >>                                          descriptor received through
>> >> diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
>> >> index 07bdce1..61bf318 100644
>> >> --- a/include/uapi/linux/errqueue.h
>> >> +++ b/include/uapi/linux/errqueue.h
>> >> @@ -19,6 +19,7 @@ struct sock_extended_err {
>> >>  #define SO_EE_ORIGIN_ICMP6   3
>> >>  #define SO_EE_ORIGIN_TXSTATUS        4
>> >>  #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
>> >> +#define SO_EE_ORIGIN_UPAGE   5
>> >>
>> >>  #define SO_EE_OFFENDER(ee)   ((struct sockaddr*)((ee)+1))
>> >>
>> >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> >> index 58af5802..367c23a 100644
>> >> --- a/net/packet/af_packet.c
>> >> +++ b/net/packet/af_packet.c
>> >> @@ -2370,28 +2370,112 @@ out:
>> >>
>> >>  static struct sk_buff *packet_alloc_skb(struct sock *sk, size_t prepad,
>> >>                                       size_t reserve, size_t len,
>> >> -                                     size_t linear, int noblock,
>> >> +                                     size_t linear, int flags,
>> >>                                       int *err)
>> >>  {
>> >>       struct sk_buff *skb;
>> >> +     size_t data_len;
>> >>
>> >> -     /* Under a page?  Don't bother with paged skb. */
>> >> -     if (prepad + len < PAGE_SIZE || !linear)
>> >> -             linear = len;
>> >> +     if (flags & MSG_ZEROCOPY) {
>> >> +             /* Minimize linear, but respect header lower bound */
>> >> +             linear = min(len, max_t(size_t, linear, MAX_HEADER));
>> >> +             data_len = 0;
>> >> +     } else {
>> >> +             /* Under a page? Don't bother with paged skb. */
>> >> +             if (prepad + len < PAGE_SIZE || !linear)
>> >> +                     linear = len;
>> >> +             data_len = len - linear;
>> >> +     }
>> >>
>> >> -     skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
>> >> -                                err, 0);
>> >> +     skb = sock_alloc_send_pskb(sk, prepad + linear, data_len,
>> >> +                                flags & MSG_DONTWAIT, err, 0);
>> >>       if (!skb)
>> >>               return NULL;
>> >>
>> >>       skb_reserve(skb, reserve);
>> >>       skb_put(skb, linear);
>> >> -     skb->data_len = len - linear;
>> >> -     skb->len += len - linear;
>> >> +     skb->data_len = data_len;
>> >> +     skb->len += data_len;
>> >>
>> >>       return skb;
>> >>  }
>> >>
>> >> +/* release zerocopy resources and avoid destructor callback on kfree */
>> >> +static void packet_snd_zerocopy_free(struct sk_buff *skb)
>> >> +{
>> >> +     struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
>> >> +
>> >> +     if (uarg) {
>> >> +             kfree_skb(uarg->callback_priv);
>> >> +             sock_put((struct sock *) uarg->ctx);
>> >> +             kfree(uarg);
>> >> +
>> >> +             skb_shinfo(skb)->destructor_arg = NULL;
>> >> +             skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
>> >> +     }
>> >> +}
>> >> +
>> >> +static void packet_snd_zerocopy_callback(struct ubuf_info *uarg,
>> >> +                                      bool zerocopy_success)
>> >> +{
>> >> +     struct sk_buff *err_skb;
>> >> +     struct sock *err_sk;
>> >> +     struct sock_exterr_skb *serr;
>> >> +
>> >> +     err_sk = uarg->ctx;
>> >> +     err_skb = uarg->callback_priv;
>> >> +
>> >> +     serr = SKB_EXT_ERR(err_skb);
>> >> +     memset(serr, 0, sizeof(*serr));
>> >> +     serr->ee.ee_errno = ENOMSG;
>> >> +     serr->ee.ee_origin = SO_EE_ORIGIN_UPAGE;
>> >> +     serr->ee.ee_data = uarg->desc & 0xFFFFFFFF;
>> >> +     serr->ee.ee_info = ((u64) uarg->desc) >> 32;
>> >> +     if (sock_queue_err_skb(err_sk, err_skb))
>> >> +             kfree_skb(err_skb);
>> >> +
>> >> +     kfree(uarg);
>> >> +     sock_put(err_sk);
>> >> +}
>> >> +
>> >> +static int packet_zerocopy_sg_from_iovec(struct sk_buff *skb,
>> >> +                                      struct msghdr *msg)
>> >> +{
>> >> +     struct ubuf_info *uarg = NULL;
>> >> +     int ret;
>> >> +
>> >> +     if (iov_pages(msg->msg_iov, 0, msg->msg_iovlen) > MAX_SKB_FRAGS)
>> >> +             return -EMSGSIZE;
>> >> +
>> >> +     uarg = kzalloc(sizeof(*uarg), GFP_KERNEL);
>> >> +     if (!uarg)
>> >> +             return -ENOMEM;
>> >> +
>> >> +     uarg->callback_priv = alloc_skb(0, GFP_KERNEL);
>> >> +     if (!uarg->callback_priv) {
>> >> +             kfree(uarg);
>> >> +             return -ENOMEM;
>> >> +     }
>> >> +
>> >> +     ret = zerocopy_sg_from_iovec(skb, msg->msg_iov, 0, msg->msg_iovlen);
>> >> +     if (ret) {
>> >> +             kfree_skb(uarg->callback_priv);
>> >> +             kfree(uarg);
>> >> +             return -EIO;
>> >> +     }
>> >> +
>> >> +     sock_hold(skb->sk);
>> >> +     uarg->ctx = skb->sk;
>> >> +     uarg->callback = packet_snd_zerocopy_callback;
>> >> +     uarg->desc = (unsigned long) msg->msg_iov[0].iov_base;
>> >> +
>> >> +     skb_shinfo(skb)->destructor_arg = uarg;
>> >> +     skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
>> >> +     skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
>> >> +
>> >> +     return 0;
>> >> +}
>> >> +
>> >>  static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>> >>  {
>> >>       struct sock *sk = sock->sk;
>> >> @@ -2408,6 +2492,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>> >>       unsigned short gso_type = 0;
>> >>       int hlen, tlen;
>> >>       int extra_len = 0;
>> >> +     bool zerocopy = msg->msg_flags & MSG_ZEROCOPY;
>> >>
>> >>       /*
>> >>        *      Get and verify the address.
>> >> @@ -2501,7 +2586,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>> >>       hlen = LL_RESERVED_SPACE(dev);
>> >>       tlen = dev->needed_tailroom;
>> >>       skb = packet_alloc_skb(sk, hlen + tlen, hlen, len, vnet_hdr.hdr_len,
>> >> -                            msg->msg_flags & MSG_DONTWAIT, &err);
>> >> +                            msg->msg_flags, &err);
>> >>       if (skb == NULL)
>> >>               goto out_unlock;
>> >>
>> >> @@ -2518,7 +2603,11 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>> >>       }
>> >>
>> >>       /* Returns -EFAULT on error */
>> >> -     err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov, 0, len);
>> >> +     if (zerocopy)
>> >> +             err = packet_zerocopy_sg_from_iovec(skb, msg);
>> >> +     else
>> >> +             err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov,
>> >> +                                                0, len);
>> >>       if (err)
>> >>               goto out_free;
>> >>
>> >> @@ -2578,6 +2667,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>> >>       return len;
>> >>
>> >>  out_free:
>> >> +     packet_snd_zerocopy_free(skb);
>> >>       kfree_skb(skb);
>> >>  out_unlock:
>> >>       if (dev)
>> >> --
>> >> 2.1.0.rc2.206.gedb03e5

^ permalink raw reply

* Re: 3.12.33 Bug with ipvs
From: Smart Weblications GmbH - Florian Wiessner @ 2014-11-26 22:38 UTC (permalink / raw)
  To: netdev
In-Reply-To: <54763E3F.4020306@smart-weblications.de>

Am 26.11.2014 21:55, schrieb Smart Weblications GmbH - Florian Wiessner:

> 
> I can't get a clue of that output. I rebuild the kernel now with
> 
> CONFIG_IP_VS=m
> # CONFIG_IP_VS_IPV6 is not set
> # CONFIG_IP_VS_DEBUG is not set
> CONFIG_IP_VS_TAB_BITS=18
> CONFIG_IP_VS_PROTO_TCP=y
> CONFIG_IP_VS_PROTO_UDP=y
> # CONFIG_IP_VS_PROTO_AH_ESP is not set
> # CONFIG_IP_VS_PROTO_ESP is not set
> # CONFIG_IP_VS_PROTO_AH is not set
> # CONFIG_IP_VS_PROTO_SCTP is not set
> CONFIG_IP_VS_RR=m
> CONFIG_IP_VS_WRR=m
> CONFIG_IP_VS_LC=m
> CONFIG_IP_VS_WLC=m
> CONFIG_IP_VS_LBLC=m
> CONFIG_IP_VS_LBLCR=m
> CONFIG_IP_VS_DH=m
> CONFIG_IP_VS_SH=m
> CONFIG_IP_VS_SED=m
> CONFIG_IP_VS_NQ=m
> CONFIG_IP_VS_SH_TAB_BITS=12
> CONFIG_IP_VS_FTP=m
> CONFIG_IP_VS_NFCT=y
> CONFIG_IP_VS_PE_SIP=m
> 

Unfortunatelly the problem still exists even when IP_VS_IPV6 disabled.


-- 

Mit freundlichen Grüßen,

Florian Wiessner

Smart Weblications GmbH
Martinsberger Str. 1
D-95119 Naila

fon.: +49 9282 9638 200
fax.: +49 9282 9638 205
24/7: +49 900 144 000 00 - 0,99 EUR/Min*
http://www.smart-weblications.de

--
Sitz der Gesellschaft: Naila
Geschäftsführer: Florian Wiessner
HRB-Nr.: HRB 3840 Amtsgericht Hof
*aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz

^ permalink raw reply

* [PATCH] add an empty ndo_poll_controller to veth to make bridges happy to support poll with veth devices attached
From: Smart Weblications GmbH - Florian Wiessner @ 2014-11-26 22:32 UTC (permalink / raw)
  To: netdev

Hi netdev,


what do i need to do to get these patches to 3.10, 3.12, 3.14 lts?

for kernels 3.10, 3.12:


This patch adds netpoll "support" to veth. As veth is a virtual device there is
no need to support netpoll. We just need
to tell the kernel veth supports it to have netpoll support on bridging while
veth devices are assigned.
An example is the netconsole driver on a bridge.

Signed-off-by: Stefan Priebe <s.priebe@profihost.ag>
---
  drivers/net/veth.c | 9 +++++++++
  1 file changed, 9 insertions(+)
--- veth.c-orig 2014-11-26 23:30:26.104210917 +0100
+++ veth.c      2014-11-26 23:29:37.357444217 +0100
@@ -188,6 +188,12 @@
        return tot;
 }

+#ifdef CONFIG_NET_POLL_CONTROLLER
+static void veth_poll_controller(struct net_device *dev)
+{
+}
+#endif
+
 static int veth_open(struct net_device *dev)
 {
        struct veth_priv *priv = netdev_priv(dev);
@@ -251,6 +257,9 @@
        .ndo_change_mtu      = veth_change_mtu,
        .ndo_get_stats64     = veth_get_stats64,
        .ndo_set_mac_address = eth_mac_addr,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+       .ndo_poll_controller = veth_poll_controller,
+#endif
 };

 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |    \



for kernels 3.14.25, 3.16.7:


This patch adds netpoll "support" to veth. As veth is a virtual device there is
no need to support netpoll. We just need
to tell the kernel veth supports it to have netpoll support on bridging while
veth devices are assigned.
An example is the netconsole driver on a bridge.

Signed-off-by: Florian Wiessner <f.wiessner@smart-kvm.com>

--- veth.c.orig 2014-11-26 23:22:57.926041383 +0100
+++ veth.c      2014-11-26 23:23:42.584757995 +0100
@@ -188,6 +188,12 @@
        return tot;
 }

+#ifdef CONFIG_NET_POLL_CONTROLLER
+static void veth_poll_controller(struct net_device *dev)
+{
+}
+#endif
+
 /* fake multicast ability */
 static void veth_set_multicast_list(struct net_device *dev)
 {
@@ -265,6 +271,9 @@
        .ndo_get_stats64     = veth_get_stats64,
        .ndo_set_rx_mode     = veth_set_multicast_list,
        .ndo_set_mac_address = eth_mac_addr,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+        .ndo_poll_controller    = veth_poll_controller,
+#endif
 };

 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |    \


-- 

Mit freundlichen Grüßen,

Florian Wiessner

Smart Weblications GmbH
Martinsberger Str. 1
D-95119 Naila

fon.: +49 9282 9638 200
fax.: +49 9282 9638 205
24/7: +49 900 144 000 00 - 0,99 EUR/Min*
http://www.smart-weblications.de

--
Sitz der Gesellschaft: Naila
Geschäftsführer: Florian Wiessner
HRB-Nr.: HRB 3840 Amtsgericht Hof
*aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz

^ permalink raw reply

* Re: [PATCH net-next] bridge: add vlan id to mdb notifications
From: Thomas Graf @ 2014-11-26 21:53 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Stephen Hemminger, vyasevich, netdev, wkok, gospo, jtoppins,
	sashok
In-Reply-To: <54763AA6.4000804@cumulusnetworks.com>

On 11/26/14 at 12:40pm, Roopa Prabhu wrote:
> I have always wondered, if binary netlink attributes have this restriction,
> they should be discouraged. especially when the other extensible option is
> to add them as a separate netlink attribute.

This is pretty much exactly the reason why attributes have been
introduced ;-)

^ permalink raw reply

* Is this 32-bit NCM?
From: Enrico Mioso @ 2014-11-26 21:51 UTC (permalink / raw)
  To: Bjòrn Mork; +Cc: youtux, linux-usb, netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 338 bytes --]

I am sorry for the indecent capture - but I didn't manage to get a better one 
till now.
We modified the driver: but I wanted to be sure ... is this 32-bit NCM?
If so, we might proceed testing with the non-working device to see if this was 
the problem and then decide what to do next. Thank you.
Please CC us - we are nt in mailing list.

[-- Attachment #2: Type: APPLICATION/x-xz, Size: 859768 bytes --]

^ permalink raw reply

* Re: [patch net-next v3 04/17] net: introduce generic switch devices support
From: Thomas Graf @ 2014-11-26 21:50 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jamal Hadi Salim, Scott Feldman, Netdev, David S. Miller,
	nhorman@tuxdriver.com, Andy Gospodarek, dborkman@redhat.com,
	ogerlitz@mellanox.com, jesse@nicira.com, pshelar@nicira.com,
	azhou@nicira.com, ben@decadent.org.uk, stephen@networkplumber.org,
	Kirsher, Jeffrey T, vyasevic@redhat.com, Cong Wang,
	Fastabend, John R, Eric Dumazet, Florian Fainelli, Roopa Prabhu,
	John Linville <linvi
In-Reply-To: <20141126175944.GW1875@nanopsycho.orion>

On 11/26/14 at 06:59pm, Jiri Pirko wrote:
> Wed, Nov 26, 2014 at 06:09:13PM CET, jhs@mojatatu.com wrote:
> >On 11/26/14 11:08, Thomas Graf wrote:
> >>On 11/26/14 at 06:36am, Jamal Hadi Salim wrote:
> >>What is irriating in this context is that you are pushing back on
> >>Jiri and others while referring to properitary and closed code which
> >>you are unwilling or unable to share. I don't see this as being
> >>passive aggressive, everybody is treated the same way in this regard.
> >
> >WTF? I said i have hardware that is not a switch because it doesnt
> >do switching. This all started with the name being "switch" which
> >I objected to. You ask me to describe hardware and then you come
> >back and say I am using that to stop progress?
> 
> Stay calm, I'm sure that this is just a misunderstanding.
> 
> >Where the hell did i push back on Jiri? Stop going around
> >telling people i do.

You are requesting a name change for a proprietary driver after
confirming that you can't publish the code. We don't even know what
the piece of hardware you refer to is capable of.

We've always written driver facing APIs for the drivers that are
*in* the kernel which in this case is rocker, modelled after OF-DPA,
existing NIC drivers, and DSA drivers.

I can live with the term switch, but if somebody can come up with a
better name, cool. "Chip" or "ASIC" are probably not better choices
though.

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2014-11-26 21:48 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


Several small fixes here:

1) Don't crash in tg3 driver when the number of tx queues has
   been configured to be different from the number of rx queues.
   From Thadeu Lima de Souza Cascardo.

2) VLAN filter not disabled properly in promisc mode in ixgbe
   driver, from Vlad Yasevich.

3) Fix OOPS on dellink op in VTI tunnel driver, from Xin Long.

4) IPV6 GRE driver WCCP code checks skb->protocol for ETH_P_IP
   instead of ETH_P_IPV6, whoops.  From Yuri Chislov.

5) Socket matching in ping driver is buggy when packet AF does
   not match socket's AF.  Fix from Jane Zhou.

6) Fix checksum calculation errors in VXLAN due to where
   the udp_tunnel6_xmit_skb() helper gets it's saddr/daddr from.
   From Alexander Duyck.

7) Fix 5G detection problem in rtlwifi driver, from Larry Finger.

8) Fix NULL deref in tcp_v{4,6}_send_reset, from Eric Dumazet.

9) Various missing netlink attribute verifications in bridging
   code, from Thomas Graf.

10) tcp_recvmsg() unconditionally calls ipv4 ip_recv_error
    even for ipv6 sockets, whoops.  Fix from Willem de Bruijn.

Please pull, thanks a lot!

The following changes since commit 8a84e01e147f44111988f9d8ccd2eaa30215a0f2:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2014-11-21 17:20:36 -0800)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git master

for you to fetch changes up to d1c637c51d87e021c12ed66baddf6cfbd11a3e2b:

  Merge tag 'master-2014-11-25' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless (2014-11-26 16:38:35 -0500)

----------------------------------------------------------------

Alexander Duyck (3):
      ipv6: Do not treat a GSO_TCPV4 request from UDP tunnel over IPv6 as invalid
      ip6_udp_tunnel: Fix checksum calculation
      vxlan: Fix boolean flip in VXLAN_F_UDP_ZERO_CSUM6_[TX|RX]

Alexey Khoroshilov (1):
      xen-netback: do not report success if backend_create_xenvif() fails

Andrew Lutomirski (1):
      net-timestamp: Fix a documentation typo

Carolyn Wyborny (1):
      igb: Fixes needed for surprise removal support

Daniel Borkmann (1):
      ixgbe: fix use after free adapter->state test in ixgbe_remove/ixgbe_probe

David S. Miller (4):
      Merge branch 'ipv6_vxlan_outer_udp_csum'
      Merge branch 'bcm_sf2'
      Merge branch 'bridge_nl_validation'
      Merge tag 'master-2014-11-25' of git://git.kernel.org/.../linville/wireless

Eric Dumazet (1):
      tcp: fix possible NULL dereference in tcp_vX_send_reset()

Florian Fainelli (2):
      net: dsa: bcm_sf2: fix unmapping registers in case of errors
      net: dsa: bcm_sf2: reset switch prior to initialization

Huacai Chen (1):
      stmmac: platform: fix default values of the filter bins setting

Jack Morgenstein (1):
      net/mlx4_core: Limit count field to 24 bits in qp_alloc_res

Jane Zhou (1):
      net/ping: handle protocol mismatching scenario

John W. Linville (1):
      Merge tag 'iwlwifi-for-john-2014-11-23' of git://git.kernel.org/.../iwlwifi/iwlwifi-fixes

Julia Lawall (1):
      solos-pci: fix error return code

Larry Finger (2):
      rtlwifi: rtl8821ae: Fix 5G detection problem
      rtlwifi: Change order in device startup

Luciano Coelho (1):
      iwlwifi: mvm: check TLV flag before trying to use hotspot firmware commands

Michael S. Tsirkin (1):
      af_packet: fix sparse warning

Pablo Neira (1):
      Revert "netfilter: conntrack: fix race in __nf_conntrack_confirm against get_next_corpse"

Thadeu Lima de Souza Cascardo (1):
      tg3: fix ring init when there are more TX than RX channels

Thomas Graf (5):
      bridge: Validate IFLA_BRIDGE_FLAGS attribute length
      net: Validate IFLA_BRIDGE_MODE attribute length
      net: Check for presence of IFLA_AF_SPEC
      bridge: Add missing policy entry for IFLA_BRPORT_FAST_LEAVE
      bridge: Sanitize IFLA_EXT_MASK for AF_BRIDGE:RTM_GETLINK

Vlad Yasevich (1):
      ixgbe: Correctly disable VLAN filter in promiscuous mode

Willem de Bruijn (1):
      net-timestamp: make tcp_recvmsg call ipv6_recv_error for AF_INET6 socks

Yuri Chislov (1):
      ipv6: gre: fix wrong skb->protocol in WCCP

lucien (1):
      ip_tunnel: the lack of vti_link_ops' dellink() cause kernel panic

 Documentation/networking/timestamping.txt             |  2 +-
 drivers/atm/solos-pci.c                               |  2 ++
 drivers/net/dsa/bcm_sf2.c                             | 58 +++++++++++++++++++++++++++++----------------------
 drivers/net/ethernet/broadcom/tg3.c                   |  3 ++-
 drivers/net/ethernet/emulex/benet/be_main.c           |  5 +++++
 drivers/net/ethernet/intel/igb/igb_main.c             | 23 +++++++++++++-------
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c         | 17 +++++++++++----
 drivers/net/ethernet/mellanox/mlx4/resource_tracker.c |  2 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c | 13 ++++++------
 drivers/net/vxlan.c                                   |  4 ++--
 drivers/net/wireless/iwlwifi/iwl-fw.h                 |  2 ++
 drivers/net/wireless/iwlwifi/mvm/mac80211.c           | 12 ++++++++---
 drivers/net/wireless/rtlwifi/pci.c                    | 20 +++++++++---------
 drivers/net/wireless/rtlwifi/rtl8821ae/hw.c           |  5 +++--
 drivers/net/xen-netback/xenbus.c                      | 15 +++++++------
 include/net/inet_common.h                             |  2 ++
 net/bridge/br_netlink.c                               |  1 +
 net/core/rtnetlink.c                                  | 23 +++++++++++++++-----
 net/ipv4/af_inet.c                                    | 11 ++++++++++
 net/ipv4/ip_vti.c                                     |  1 +
 net/ipv4/ping.c                                       | 14 ++++---------
 net/ipv4/tcp.c                                        |  2 +-
 net/ipv4/tcp_ipv4.c                                   |  5 ++++-
 net/ipv6/ip6_gre.c                                    |  4 ++--
 net/ipv6/ip6_offload.c                                |  3 ++-
 net/ipv6/ip6_udp_tunnel.c                             |  4 +---
 net/ipv6/ip6_vti.c                                    | 11 ++++++++++
 net/ipv6/tcp_ipv6.c                                   |  5 ++++-
 net/netfilter/nf_conntrack_core.c                     | 14 ++++++-------
 net/packet/af_packet.c                                |  2 +-
 30 files changed, 184 insertions(+), 101 deletions(-)

^ permalink raw reply

* Re: pull request: wireless 2014-11-26
From: David Miller @ 2014-11-26 21:39 UTC (permalink / raw)
  To: linville; +Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <20141126194743.GA12284@tuxdriver.com>

From: "John W. Linville" <linville@tuxdriver.com>
Date: Wed, 26 Nov 2014 14:47:43 -0500

> Please pull this little batch of fixes intended for the 3.18 stream...
> 
> For the iwlwifi one, Emmanuel says:
> 
> "Not all the firmware know how to handle the HOT_SPOT_CMD.
> Make sure that the firmware will know this command before
> sending it. This avoids a firmware crash."
> 
> Along with that, Larry sends a pair of rtlwifi fixes to address some
> discrepancies from moving drivers out of staging.  Larry says:
> 
> "These two patches are needed to fix a regression introduced when
> driver rtl8821ae was moved from staging to the regular wireless tree."
> 
> Please let me know if there are problems!

Pulled John, thanks a lot!

^ permalink raw reply

* Re: [PATCH net] net-timestamp: make tcp_recvmsg call ipv6_recv_error for AF_INET6 socks
From: David Miller @ 2014-11-26 21:37 UTC (permalink / raw)
  To: willemb; +Cc: netdev, eric.dumazet, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <CA+FuTSdf=3cw7rGhAmPo+hh2qnn_GgPG3Xy8Er259_FxjTur3w@mail.gmail.com>

From: Willem de Bruijn <willemb@google.com>
Date: Wed, 26 Nov 2014 15:03:01 -0500

> On Wed, Nov 26, 2014 at 2:53 PM, Willem de Bruijn <willemb@google.com> wrote:
>> From: Willem de Bruijn <willemb@google.com>
>>
>> TCP timestamping introduced MSG_ERRQUEUE handling for TCP sockets.
>> If the socket is of family AF_INET6, call ipv6_recv_error instead
>> of ip_recv_error.
>>
>> This change is more complex than a single branch due to the loadable
>> ipv6 module. It reuses a pre-existing indirect function call from
>> ping. The ping code is safe to call, because it is part of the core
>> ipv6 module and always present when AF_INET6 sockets are active.
> 
> I forgot to add:
> 
> Fixes: 4ed2d765 (net-timestamp: TCP timestamping)
> 
>> Signed-off-by: Willem de Bruijn <willemb@google.com>

Applied and queued up for -stable, thanks.

>> It may also be worthwhile to add WARN_ON_ONCE(sk->family == AF_INET6)
>> to ip_recv_error.

Agreed.  This kind of mistakes do happen, and these routines end up
essentially referencing garbage when it happens.

Thanks again.

^ permalink raw reply

* Re: [PATCH rfc] packet: zerocopy packet_snd
From: Michael S. Tsirkin @ 2014-11-26 21:20 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, Eric Dumazet, Daniel Borkmann
In-Reply-To: <CA+FuTSdYB4rDMH3gAAMwaRdbqN58f_SxLz=C3fcCACw_KosGXw@mail.gmail.com>

On Wed, Nov 26, 2014 at 02:59:34PM -0500, Willem de Bruijn wrote:
> > The main problem with zero copy ATM is with queueing disciplines
> > which might keep the socket around essentially forever.
> > The case was described here:
> > https://lkml.org/lkml/2014/1/17/105
> > and of course this will make it more serious now that
> > more applications will be able to do this, so
> > chances that an administrator enables this
> > are higher.
> 
> The denial of service issue raised there, that a single queue can
> block an entire virtio-net device, is less problematic in the case of
> packet sockets. A socket can run out of sk_wmem_alloc, but a prudent
> application can increase the limit or use separate sockets for
> separate flows.

Sounds like this interface is very hard to use correctly.

> > One possible solution is some kind of timer orphaning frags
> > for skbs that have been around for too long.
> 
> Perhaps this can be approximated without an explicit timer by calling
> skb_copy_ubufs on enqueue whenever qlen exceeds a threshold value?

Not sure.  I'll have to see that patch to judge.

> >
> >> ---
> >>  include/linux/skbuff.h        |   1 +
> >>  include/linux/socket.h        |   1 +
> >>  include/uapi/linux/errqueue.h |   1 +
> >>  net/packet/af_packet.c        | 110 ++++++++++++++++++++++++++++++++++++++----
> >>  4 files changed, 103 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> >> index 78c299f..8e661d2 100644
> >> --- a/include/linux/skbuff.h
> >> +++ b/include/linux/skbuff.h
> >> @@ -295,6 +295,7 @@ struct ubuf_info {
> >>       void (*callback)(struct ubuf_info *, bool zerocopy_success);
> >>       void *ctx;
> >>       unsigned long desc;
> >> +     void *callback_priv;
> >>  };
> >>
> >>  /* This data is invariant across clones and lives at
> >> diff --git a/include/linux/socket.h b/include/linux/socket.h
> >> index de52228..0a2e0ea 100644
> >> --- a/include/linux/socket.h
> >> +++ b/include/linux/socket.h
> >> @@ -265,6 +265,7 @@ struct ucred {
> >>  #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
> >>  #define MSG_EOF         MSG_FIN
> >>
> >> +#define MSG_ZEROCOPY 0x4000000       /* Pin user pages */
> >>  #define MSG_FASTOPEN 0x20000000      /* Send data in TCP SYN */
> >>  #define MSG_CMSG_CLOEXEC 0x40000000  /* Set close_on_exec for file
> >>                                          descriptor received through
> >> diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
> >> index 07bdce1..61bf318 100644
> >> --- a/include/uapi/linux/errqueue.h
> >> +++ b/include/uapi/linux/errqueue.h
> >> @@ -19,6 +19,7 @@ struct sock_extended_err {
> >>  #define SO_EE_ORIGIN_ICMP6   3
> >>  #define SO_EE_ORIGIN_TXSTATUS        4
> >>  #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
> >> +#define SO_EE_ORIGIN_UPAGE   5
> >>
> >>  #define SO_EE_OFFENDER(ee)   ((struct sockaddr*)((ee)+1))
> >>
> >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> >> index 58af5802..367c23a 100644
> >> --- a/net/packet/af_packet.c
> >> +++ b/net/packet/af_packet.c
> >> @@ -2370,28 +2370,112 @@ out:
> >>
> >>  static struct sk_buff *packet_alloc_skb(struct sock *sk, size_t prepad,
> >>                                       size_t reserve, size_t len,
> >> -                                     size_t linear, int noblock,
> >> +                                     size_t linear, int flags,
> >>                                       int *err)
> >>  {
> >>       struct sk_buff *skb;
> >> +     size_t data_len;
> >>
> >> -     /* Under a page?  Don't bother with paged skb. */
> >> -     if (prepad + len < PAGE_SIZE || !linear)
> >> -             linear = len;
> >> +     if (flags & MSG_ZEROCOPY) {
> >> +             /* Minimize linear, but respect header lower bound */
> >> +             linear = min(len, max_t(size_t, linear, MAX_HEADER));
> >> +             data_len = 0;
> >> +     } else {
> >> +             /* Under a page? Don't bother with paged skb. */
> >> +             if (prepad + len < PAGE_SIZE || !linear)
> >> +                     linear = len;
> >> +             data_len = len - linear;
> >> +     }
> >>
> >> -     skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
> >> -                                err, 0);
> >> +     skb = sock_alloc_send_pskb(sk, prepad + linear, data_len,
> >> +                                flags & MSG_DONTWAIT, err, 0);
> >>       if (!skb)
> >>               return NULL;
> >>
> >>       skb_reserve(skb, reserve);
> >>       skb_put(skb, linear);
> >> -     skb->data_len = len - linear;
> >> -     skb->len += len - linear;
> >> +     skb->data_len = data_len;
> >> +     skb->len += data_len;
> >>
> >>       return skb;
> >>  }
> >>
> >> +/* release zerocopy resources and avoid destructor callback on kfree */
> >> +static void packet_snd_zerocopy_free(struct sk_buff *skb)
> >> +{
> >> +     struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
> >> +
> >> +     if (uarg) {
> >> +             kfree_skb(uarg->callback_priv);
> >> +             sock_put((struct sock *) uarg->ctx);
> >> +             kfree(uarg);
> >> +
> >> +             skb_shinfo(skb)->destructor_arg = NULL;
> >> +             skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
> >> +     }
> >> +}
> >> +
> >> +static void packet_snd_zerocopy_callback(struct ubuf_info *uarg,
> >> +                                      bool zerocopy_success)
> >> +{
> >> +     struct sk_buff *err_skb;
> >> +     struct sock *err_sk;
> >> +     struct sock_exterr_skb *serr;
> >> +
> >> +     err_sk = uarg->ctx;
> >> +     err_skb = uarg->callback_priv;
> >> +
> >> +     serr = SKB_EXT_ERR(err_skb);
> >> +     memset(serr, 0, sizeof(*serr));
> >> +     serr->ee.ee_errno = ENOMSG;
> >> +     serr->ee.ee_origin = SO_EE_ORIGIN_UPAGE;
> >> +     serr->ee.ee_data = uarg->desc & 0xFFFFFFFF;
> >> +     serr->ee.ee_info = ((u64) uarg->desc) >> 32;
> >> +     if (sock_queue_err_skb(err_sk, err_skb))
> >> +             kfree_skb(err_skb);
> >> +
> >> +     kfree(uarg);
> >> +     sock_put(err_sk);
> >> +}
> >> +
> >> +static int packet_zerocopy_sg_from_iovec(struct sk_buff *skb,
> >> +                                      struct msghdr *msg)
> >> +{
> >> +     struct ubuf_info *uarg = NULL;
> >> +     int ret;
> >> +
> >> +     if (iov_pages(msg->msg_iov, 0, msg->msg_iovlen) > MAX_SKB_FRAGS)
> >> +             return -EMSGSIZE;
> >> +
> >> +     uarg = kzalloc(sizeof(*uarg), GFP_KERNEL);
> >> +     if (!uarg)
> >> +             return -ENOMEM;
> >> +
> >> +     uarg->callback_priv = alloc_skb(0, GFP_KERNEL);
> >> +     if (!uarg->callback_priv) {
> >> +             kfree(uarg);
> >> +             return -ENOMEM;
> >> +     }
> >> +
> >> +     ret = zerocopy_sg_from_iovec(skb, msg->msg_iov, 0, msg->msg_iovlen);
> >> +     if (ret) {
> >> +             kfree_skb(uarg->callback_priv);
> >> +             kfree(uarg);
> >> +             return -EIO;
> >> +     }
> >> +
> >> +     sock_hold(skb->sk);
> >> +     uarg->ctx = skb->sk;
> >> +     uarg->callback = packet_snd_zerocopy_callback;
> >> +     uarg->desc = (unsigned long) msg->msg_iov[0].iov_base;
> >> +
> >> +     skb_shinfo(skb)->destructor_arg = uarg;
> >> +     skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
> >> +     skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
> >> +
> >> +     return 0;
> >> +}
> >> +
> >>  static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>  {
> >>       struct sock *sk = sock->sk;
> >> @@ -2408,6 +2492,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>       unsigned short gso_type = 0;
> >>       int hlen, tlen;
> >>       int extra_len = 0;
> >> +     bool zerocopy = msg->msg_flags & MSG_ZEROCOPY;
> >>
> >>       /*
> >>        *      Get and verify the address.
> >> @@ -2501,7 +2586,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>       hlen = LL_RESERVED_SPACE(dev);
> >>       tlen = dev->needed_tailroom;
> >>       skb = packet_alloc_skb(sk, hlen + tlen, hlen, len, vnet_hdr.hdr_len,
> >> -                            msg->msg_flags & MSG_DONTWAIT, &err);
> >> +                            msg->msg_flags, &err);
> >>       if (skb == NULL)
> >>               goto out_unlock;
> >>
> >> @@ -2518,7 +2603,11 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>       }
> >>
> >>       /* Returns -EFAULT on error */
> >> -     err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov, 0, len);
> >> +     if (zerocopy)
> >> +             err = packet_zerocopy_sg_from_iovec(skb, msg);
> >> +     else
> >> +             err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov,
> >> +                                                0, len);
> >>       if (err)
> >>               goto out_free;
> >>
> >> @@ -2578,6 +2667,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>       return len;
> >>
> >>  out_free:
> >> +     packet_snd_zerocopy_free(skb);
> >>       kfree_skb(skb);
> >>  out_unlock:
> >>       if (dev)
> >> --
> >> 2.1.0.rc2.206.gedb03e5

^ permalink raw reply

* Re: [PATCH rfc] packet: zerocopy packet_snd
From: Michael S. Tsirkin @ 2014-11-26 21:17 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, Eric Dumazet, Daniel Borkmann
In-Reply-To: <CA+FuTSdYB4rDMH3gAAMwaRdbqN58f_SxLz=C3fcCACw_KosGXw@mail.gmail.com>

On Wed, Nov 26, 2014 at 02:59:34PM -0500, Willem de Bruijn wrote:
> > The main problem with zero copy ATM is with queueing disciplines
> > which might keep the socket around essentially forever.
> > The case was described here:
> > https://lkml.org/lkml/2014/1/17/105
> > and of course this will make it more serious now that
> > more applications will be able to do this, so
> > chances that an administrator enables this
> > are higher.
> 
> The denial of service issue raised there, that a single queue can
> block an entire virtio-net device, is less problematic in the case of
> packet sockets. A socket can run out of sk_wmem_alloc, but a prudent
> application can increase the limit or use separate sockets for
> separate flows.

Socket per flow? Maybe just use TCP then?  increasing the limit
sounds like a wrong solution, it hurts security.

> > One possible solution is some kind of timer orphaning frags
> > for skbs that have been around for too long.
> 
> Perhaps this can be approximated without an explicit timer by calling
> skb_copy_ubufs on enqueue whenever qlen exceeds a threshold value?

Hard to say. Will have to see that patch to judge how robust this is.


> >
> >> ---
> >>  include/linux/skbuff.h        |   1 +
> >>  include/linux/socket.h        |   1 +
> >>  include/uapi/linux/errqueue.h |   1 +
> >>  net/packet/af_packet.c        | 110 ++++++++++++++++++++++++++++++++++++++----
> >>  4 files changed, 103 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> >> index 78c299f..8e661d2 100644
> >> --- a/include/linux/skbuff.h
> >> +++ b/include/linux/skbuff.h
> >> @@ -295,6 +295,7 @@ struct ubuf_info {
> >>       void (*callback)(struct ubuf_info *, bool zerocopy_success);
> >>       void *ctx;
> >>       unsigned long desc;
> >> +     void *callback_priv;
> >>  };
> >>
> >>  /* This data is invariant across clones and lives at
> >> diff --git a/include/linux/socket.h b/include/linux/socket.h
> >> index de52228..0a2e0ea 100644
> >> --- a/include/linux/socket.h
> >> +++ b/include/linux/socket.h
> >> @@ -265,6 +265,7 @@ struct ucred {
> >>  #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
> >>  #define MSG_EOF         MSG_FIN
> >>
> >> +#define MSG_ZEROCOPY 0x4000000       /* Pin user pages */
> >>  #define MSG_FASTOPEN 0x20000000      /* Send data in TCP SYN */
> >>  #define MSG_CMSG_CLOEXEC 0x40000000  /* Set close_on_exec for file
> >>                                          descriptor received through
> >> diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
> >> index 07bdce1..61bf318 100644
> >> --- a/include/uapi/linux/errqueue.h
> >> +++ b/include/uapi/linux/errqueue.h
> >> @@ -19,6 +19,7 @@ struct sock_extended_err {
> >>  #define SO_EE_ORIGIN_ICMP6   3
> >>  #define SO_EE_ORIGIN_TXSTATUS        4
> >>  #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
> >> +#define SO_EE_ORIGIN_UPAGE   5
> >>
> >>  #define SO_EE_OFFENDER(ee)   ((struct sockaddr*)((ee)+1))
> >>
> >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> >> index 58af5802..367c23a 100644
> >> --- a/net/packet/af_packet.c
> >> +++ b/net/packet/af_packet.c
> >> @@ -2370,28 +2370,112 @@ out:
> >>
> >>  static struct sk_buff *packet_alloc_skb(struct sock *sk, size_t prepad,
> >>                                       size_t reserve, size_t len,
> >> -                                     size_t linear, int noblock,
> >> +                                     size_t linear, int flags,
> >>                                       int *err)
> >>  {
> >>       struct sk_buff *skb;
> >> +     size_t data_len;
> >>
> >> -     /* Under a page?  Don't bother with paged skb. */
> >> -     if (prepad + len < PAGE_SIZE || !linear)
> >> -             linear = len;
> >> +     if (flags & MSG_ZEROCOPY) {
> >> +             /* Minimize linear, but respect header lower bound */
> >> +             linear = min(len, max_t(size_t, linear, MAX_HEADER));
> >> +             data_len = 0;
> >> +     } else {
> >> +             /* Under a page? Don't bother with paged skb. */
> >> +             if (prepad + len < PAGE_SIZE || !linear)
> >> +                     linear = len;
> >> +             data_len = len - linear;
> >> +     }
> >>
> >> -     skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
> >> -                                err, 0);
> >> +     skb = sock_alloc_send_pskb(sk, prepad + linear, data_len,
> >> +                                flags & MSG_DONTWAIT, err, 0);
> >>       if (!skb)
> >>               return NULL;
> >>
> >>       skb_reserve(skb, reserve);
> >>       skb_put(skb, linear);
> >> -     skb->data_len = len - linear;
> >> -     skb->len += len - linear;
> >> +     skb->data_len = data_len;
> >> +     skb->len += data_len;
> >>
> >>       return skb;
> >>  }
> >>
> >> +/* release zerocopy resources and avoid destructor callback on kfree */
> >> +static void packet_snd_zerocopy_free(struct sk_buff *skb)
> >> +{
> >> +     struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
> >> +
> >> +     if (uarg) {
> >> +             kfree_skb(uarg->callback_priv);
> >> +             sock_put((struct sock *) uarg->ctx);
> >> +             kfree(uarg);
> >> +
> >> +             skb_shinfo(skb)->destructor_arg = NULL;
> >> +             skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
> >> +     }
> >> +}
> >> +
> >> +static void packet_snd_zerocopy_callback(struct ubuf_info *uarg,
> >> +                                      bool zerocopy_success)
> >> +{
> >> +     struct sk_buff *err_skb;
> >> +     struct sock *err_sk;
> >> +     struct sock_exterr_skb *serr;
> >> +
> >> +     err_sk = uarg->ctx;
> >> +     err_skb = uarg->callback_priv;
> >> +
> >> +     serr = SKB_EXT_ERR(err_skb);
> >> +     memset(serr, 0, sizeof(*serr));
> >> +     serr->ee.ee_errno = ENOMSG;
> >> +     serr->ee.ee_origin = SO_EE_ORIGIN_UPAGE;
> >> +     serr->ee.ee_data = uarg->desc & 0xFFFFFFFF;
> >> +     serr->ee.ee_info = ((u64) uarg->desc) >> 32;
> >> +     if (sock_queue_err_skb(err_sk, err_skb))
> >> +             kfree_skb(err_skb);
> >> +
> >> +     kfree(uarg);
> >> +     sock_put(err_sk);
> >> +}
> >> +
> >> +static int packet_zerocopy_sg_from_iovec(struct sk_buff *skb,
> >> +                                      struct msghdr *msg)
> >> +{
> >> +     struct ubuf_info *uarg = NULL;
> >> +     int ret;
> >> +
> >> +     if (iov_pages(msg->msg_iov, 0, msg->msg_iovlen) > MAX_SKB_FRAGS)
> >> +             return -EMSGSIZE;
> >> +
> >> +     uarg = kzalloc(sizeof(*uarg), GFP_KERNEL);
> >> +     if (!uarg)
> >> +             return -ENOMEM;
> >> +
> >> +     uarg->callback_priv = alloc_skb(0, GFP_KERNEL);
> >> +     if (!uarg->callback_priv) {
> >> +             kfree(uarg);
> >> +             return -ENOMEM;
> >> +     }
> >> +
> >> +     ret = zerocopy_sg_from_iovec(skb, msg->msg_iov, 0, msg->msg_iovlen);
> >> +     if (ret) {
> >> +             kfree_skb(uarg->callback_priv);
> >> +             kfree(uarg);
> >> +             return -EIO;
> >> +     }
> >> +
> >> +     sock_hold(skb->sk);
> >> +     uarg->ctx = skb->sk;
> >> +     uarg->callback = packet_snd_zerocopy_callback;
> >> +     uarg->desc = (unsigned long) msg->msg_iov[0].iov_base;
> >> +
> >> +     skb_shinfo(skb)->destructor_arg = uarg;
> >> +     skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
> >> +     skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
> >> +
> >> +     return 0;
> >> +}
> >> +
> >>  static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>  {
> >>       struct sock *sk = sock->sk;
> >> @@ -2408,6 +2492,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>       unsigned short gso_type = 0;
> >>       int hlen, tlen;
> >>       int extra_len = 0;
> >> +     bool zerocopy = msg->msg_flags & MSG_ZEROCOPY;
> >>
> >>       /*
> >>        *      Get and verify the address.
> >> @@ -2501,7 +2586,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>       hlen = LL_RESERVED_SPACE(dev);
> >>       tlen = dev->needed_tailroom;
> >>       skb = packet_alloc_skb(sk, hlen + tlen, hlen, len, vnet_hdr.hdr_len,
> >> -                            msg->msg_flags & MSG_DONTWAIT, &err);
> >> +                            msg->msg_flags, &err);
> >>       if (skb == NULL)
> >>               goto out_unlock;
> >>
> >> @@ -2518,7 +2603,11 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>       }
> >>
> >>       /* Returns -EFAULT on error */
> >> -     err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov, 0, len);
> >> +     if (zerocopy)
> >> +             err = packet_zerocopy_sg_from_iovec(skb, msg);
> >> +     else
> >> +             err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov,
> >> +                                                0, len);
> >>       if (err)
> >>               goto out_free;
> >>
> >> @@ -2578,6 +2667,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>       return len;
> >>
> >>  out_free:
> >> +     packet_snd_zerocopy_free(skb);
> >>       kfree_skb(skb);
> >>  out_unlock:
> >>       if (dev)
> >> --
> >> 2.1.0.rc2.206.gedb03e5

^ permalink raw reply

* Re: [PATCH rfc 1/4] net-timestamp: pull headers for SOCK_STREAM
From: Willem de Bruijn @ 2014-11-26 21:03 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: David Miller, Network Development, Richard Cochran
In-Reply-To: <CALCETrXHdSVC-YLsdjHxAp3rU2gGFZaz8NNgaTKYgjM-ah3z+A@mail.gmail.com>

On Tue, Nov 25, 2014 at 4:39 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Nov 25, 2014 at 11:54 AM, David Miller <davem@davemloft.net> wrote:
>> From: Willem de Bruijn <willemb@google.com>
>> Date: Tue, 25 Nov 2014 14:52:00 -0500
>>
>>> On Tue, Nov 25, 2014 at 1:42 PM, David Miller <davem@davemloft.net> wrote:
>>>> From: Willem de Bruijn <willemb@google.com>
>>>> Date: Tue, 25 Nov 2014 12:58:03 -0500
>>>>
>>>> What's the harm in exposing the headers?  Either it's harmful, and
>>>> therefore doing so for UDP is bad too, or it's harmless and
>>>
>>> Headers may expose information not available otherwise. I don't
>>> immediately see critical problems, but that does not mean that they
>>> might not lurk there.
>>>
>>> We so far avoid exposing the sequence number, though keeping it hidden
>>> is more about third parties. More in general, unprivileged processes
>>> may start requesting timestamps only to learn tcp state that they
>>> should either get from tcpinfo or cannot currently get at all, likely
>>> for good reason. A far-fetched example is identifying admin iptables
>>> tos mangling rules by reading the tos bits at the driver layer. At least
>>> on my machine, iptables -L is privileged.
>>>
>>>> we should probably leave it alone to not risk breaking anyone.
>>>
>>> That's fair. I sent it for rfc first for that reason. I won't resubmit
>>> unless more serious concerns are raised.
>>
>> I just worry about the potential breakage.
>>
>> Your concerns are valid... I honestly don't know what we should do here.
>> Both choices have merit.
>
> Here's a scenario in which giving the headers might be dangerous:
>
> Suppose I create a network namespace that's designed to contain
> something, e.g. a Tor or Tor-like client, that shouldn't know any of
> its public addressing information.  I might assign something like a
> tunnel interface to the namespace, but, if the contained code can get
> lower-level headers, it might learn something that would identify the
> *other* end of the tunnel, which wouldn't be so good.  Admittedly,
> this would be just one of several things that would require care to
> get this right.

network namespaces are an interesting case, indeed.

>
> Also, what happens if the output is transformed by ipsec?  Does the
> timestamp message show the ciphertext?
>
> TBH, I'd rather send no payload at all and have an scm message that
> the sender provides that specifies a cookie identifying the particular
> sent data.  But that ship mostly sailed awhile ago.
>
> For bytestreams, though, isn't this all new in 3.18?  Or am I off by a release.

It was added in 3.17. That is still very recent.

One third option, though hardly pretty, is to put display of headers
under administrator control. An application cannot easily infer whether
headers are stripped, and legacy applications do not even know to try.
So, this is a bit too crude:

+    if (sk->sk_protocol == IPPROTO_TCP && sysctl_net_blind_errqueue)
+        skb_pull(skb, skb_transport_offset(skb) + tcp_hdrlen(skb));
+    else if (sk->sk_protocol == IPPROTO_UDP && sysctl_net_blind_errqueue >= 2)
+        skb_pull(skb, skb_transport_offset(skb) + sizeof(struct udphdr));

An alternative is to add a timestamping option to skip headers (or
even full payload, basically
http://patchwork.ozlabs.org/patch/366967/) and give the administrator
a sysctl to drop all requests that do not pass this flag. The intent
is that future proof applications will start requesting the flag, and
relying on the ts counter. Hardened installations can set the sysctl
from the start, accepting possible breakage.

^ permalink raw reply

* 3.12.33 Bug with ipvs
From: Smart Weblications GmbH - Florian Wiessner @ 2014-11-26 20:55 UTC (permalink / raw)
  To: netdev

Hi netdev,

On 3.12.33 i see this every 3 hours or so on a box with ip_vs running with a
setup which made no problems on 3.10.40. Could someone give me hints how to
debug this? It seems to happen instantly, when i add ip_vs_ftp and have some nat
rules. Setup is like this:

host connected to net with bond0 over eth0 eth1 (bonding mode6)
bond0 added to br0

running 5 lxc using veth on br0 as real servers to use for ipvs

we use net 10.10.1.0/24  10.10.0.0/24  on lxc, 10.10.1.1 as gw-ip on the host
and vip bound to the host so we do some aditional NAT:

iptables -t nat -A POSTROUTING -o br0 -s 10.10.0.0/24 -j SNAT --to 192.168.1.61
iptables -t nat -A POSTROUTING -o br0 -s 10.10.1.0/24 ! -d 192.168.1.0/26 -j
SNAT --to 192.168.1.62

then setup additional nat for ftp passive to a realserver:
iptables -t nat -A PREROUTING -i br0 -d 192.168.1.62 -p tcp -m multiport
--dports 64000:64444 -j DNAT --to 10.10.1.20

we also use ipv6 in the lxc container, but do not use any ip_vs ipv6 rules


[13230.422498] BUG: unable to handle kernel paging request at 00000000000600d0
[13230.422541] IP: [<ffffffff814ff2fc>] xfrm_selector_match+0x25/0x2f6
[13230.422577] PGD 57fb0d067 PUD 718403067 PMD 0
[13230.422682] Oops: 0000 [#1] SMP
[13230.422711] Modules linked in: ip6table_filter ip6_tables ebt_arp ebt_ip
ebtable_nat ebtables act_police cls_u32 sch_ingress arptable_filter arp_tables
netconsole xmand cpufreq_powersave cpufreq_conservative cpufreq_userspace
ocfs2_stack_o2cb ocfs2_dlm bridge stp llc bonding fuse nf_conntrack_ftp 8021q
openvswitch gre vxlan xt_collia_generic serpent_generic blowfish_generic
blowfish_common cast5_generic cast_common xcbc sha512_generic crypto_null af_key
psmouse serio_raw lpc_ich i2c_i801 mfd_c
[13230.423318] CPU: 6 PID: 18038 Comm: kvm.php Not tainted 3.12.33 #6
[13230.423348] Hardware name: Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 1.1a
09/28/2011
[13230.423395] task: ffff88043803c680 ti: ffff880162836000 task.ti: ffff880162836000
[13230.423440] RIP: 0010:[<ffffffff814ff2fc>]  [<ffffffff814ff2fc>]
xfrm_selector_match+0x25/0x2f6
[13230.423491] RSP: 0018:ffff88083fd83a68  EFLAGS: 00010246
[13230.423519] RAX: 0000000000000001 RBX: ffff88083fd83b88 RCX: ffff8804ce5c68c0
[13230.423549] RDX: 0000000000000002 RSI: ffff88083fd83b88 RDI: 00000000000600a6
[13230.423580] RBP: 00000000000600a6 R08: 0000000000000000 R09: ffff88083fd83b08
[13230.423611] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88083fd83b88
[13230.423641] R13: 0000000000000001 R14: ffffffff81812040 R15: ffffffffa01ab3b0
[13230.423672] FS:  00007f6fd48e4720(0000) GS:ffff88083fd80000(0000)
knlGS:0000000000000000
[13230.423725] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13230.423758] CR2: 00000000000600d0 CR3: 00000007188b1000 CR4: 00000000000407e0
[13230.423790] Stack:
[13230.423817]  0000000000000000 0000000000060002 ffff8804ce5c68c0 ffff88083fd83b88
[13230.423877]  0000000000000001 ffffffff814ff611 0000000000000000 ffff8800907be740
[13230.423935]  ffff88043803c680 ffffffff81812040 000000003c9041bc ffffffff814ffa8c
[13230.423992] Call Trace:
[13230.424019]  <IRQ>
[13230.424024]  [<ffffffff814ff611>] ? xfrm_sk_policy_lookup+0x44/0x9b
[13230.424076]  [<ffffffff814ffa8c>] ? xfrm_lookup+0x91/0x446
[13230.424111]  [<ffffffff814f76a6>] ? ip_route_me_harder+0x150/0x1b0
[13230.424146]  [<ffffffffa019a457>] ? ip_vs_route_me_harder+0x86/0x91 [ip_vs]
[13230.424182]  [<ffffffffa019b97a>] ? ip_vs_out+0x2d3/0x5bc [ip_vs]
[13230.424213]  [<ffffffff814b537c>] ? ip_rcv_finish+0x2b8/0x2b8
[13230.424244]  [<ffffffff814b0237>] ? nf_iterate+0x42/0x80
[13230.424277]  [<ffffffff814b02de>] ? nf_hook_slow+0x69/0xff
[13230.424308]  [<ffffffff814b537c>] ? ip_rcv_finish+0x2b8/0x2b8
[13230.424339]  [<ffffffff814b56a0>] ? ip_local_deliver+0x6f/0x7e
[13230.424371]  [<ffffffff8148c94c>] ? __netif_receive_skb_core+0x5c6/0x62d
[13230.424404]  [<ffffffff8148cb48>] ? process_backlog+0x13e/0x13e
[13230.424438]  [<ffffffffa041adfd>] ? br_handle_frame_finish+0x382/0x382 [bridge]
[13230.424493]  [<ffffffff8148cb94>] ? netif_receive_skb+0x4c/0x7d
[13230.424526]  [<ffffffffa041ad89>] ? br_handle_frame_finish+0x30e/0x382 [bridge]
[13230.430400]  [<ffffffffa041afce>] ? br_handle_frame+0x1d1/0x217 [bridge]
[13230.430431]  [<ffffffff8148c7fb>] ? __netif_receive_skb_core+0x475/0x62d
[13230.430468]  [<ffffffff8145cf3a>] ? intel_pstate_cpu_exit+0x3c/0x3c
[13230.430504]  [<ffffffff8103eb48>] ? call_timer_fn.isra.24+0x1c/0x6f
[13230.430539]  [<ffffffff8148ca94>] ? process_backlog+0x8a/0x13e
[13230.430577]  [<ffffffff8148cd96>] ? net_rx_action+0x9e/0x175
[13230.430612]  [<ffffffff8103a4b7>] ? __do_softirq+0xb8/0x176
[13230.430643]  [<ffffffff81566c3c>] ? call_softirq+0x1c/0x30
[13230.430671]  <EOI>
[13230.430676]  [<ffffffff810040b1>] ? do_softirq+0x2c/0x5f
[13230.430727]  [<ffffffff81039ffd>] ? local_bh_enable+0x67/0x85
[13230.430756]  [<ffffffff814b8c6a>] ? ip_finish_output+0x2e1/0x33a
[13230.430790]  [<ffffffffa01a11f6>] ? ip_vs_nat_xmit+0x267/0x2b2 [ip_vs]
[13230.430822]  [<ffffffffa019b34a>] ? ip_vs_in+0x442/0x4c5 [ip_vs]
[13230.430852]  [<ffffffff814b7cec>] ? ip_forward_options+0x163/0x163
[13230.430882]  [<ffffffff814b0237>] ? nf_iterate+0x42/0x80
[13230.430910]  [<ffffffff814b02de>] ? nf_hook_slow+0x69/0xff
[13230.430939]  [<ffffffff814b7cec>] ? ip_forward_options+0x163/0x163
[13230.430970]  [<ffffffff814b9705>] ? __ip_local_out+0x69/0x76
[13230.431000]  [<ffffffff8147d5e3>] ? __sk_dst_check+0x24/0x4c
[13230.431029]  [<ffffffff814b971b>] ? ip_local_out+0x9/0x22
[13230.431058]  [<ffffffff814b99eb>] ? ip_queue_xmit+0x2b7/0x2f0
[13230.431088]  [<ffffffff814cbdd0>] ? tcp_transmit_skb+0x6f5/0x75b
[13230.431119]  [<ffffffff814cde61>] ? tcp_connect+0x44a/0x4d9
[13230.431149]  [<ffffffff8106fcf8>] ? ktime_get_real+0xc/0x3f
[13230.431180]  [<ffffffff814883cb>] ? secure_tcp_sequence_number+0x4d/0x5e
[13230.431211]  [<ffffffff814d0ae4>] ? tcp_v4_connect+0x3ab/0x402
[13230.431241]  [<ffffffff814e10b7>] ? __inet_stream_connect+0x80/0x27c
[13230.431272]  [<ffffffff81125353>] ? fsnotify_clear_marks_by_inode+0x26/0x103
[13230.431304]  [<ffffffff814e12e3>] ? inet_stream_connect+0x30/0x48
[13230.431334]  [<ffffffff8147b52e>] ? SyS_connect+0x6e/0x93
[13230.431365]  [<ffffffff8104bc74>] ? task_work_run+0x7d/0x8d
[13230.431394]  [<ffffffff81103804>] ? SyS_fcntl+0x232/0x45e
[13230.431430]  [<ffffffff81565a22>] ? system_call_fastpath+0x16/0x1b
[13230.431464] Code: 5d 41 5e 41 5f c3 41 55 66 83 fa 02 41 54 55 48 89 fd 53 48
89 f3 41 50 74 11 31 c0 66 83 fa 0a 0f 85 ce 02 00 00 e9 fd 00 00 00 <0f> b6 47
2a 8b
[13230.431740] RIP  [<ffffffff814ff2fc>] xfrm_selector_match+0x25/0x2f6
[13230.431772]  RSP <ffff88083fd83a68>
[13230.431795] CR2: 00000000000600d0
[13230.432240] ---[ end trace 103912aa204977dc ]---

node01:/ocfs2/usr/src/linux-3.12.33/scripts# ./decodecode </tmp/oops.log
[13230.431464] Code: 5d 41 5e 41 5f c3 41 55 66 83 fa 02 41 54 55 48 89 fd 53 48
89 f3 41 50 74 11 31 c0 66 83 fa 0a 0f 85 ce 02 00 00 e9 fd 00 00 00 <0f> b6 47
2a 8b 17 8b 76 18 84 c0 74 1a b9 20 00 00 00 31 f2 29
All code
========
   0:   5d                      pop    %rbp
   1:   41 5e                   pop    %r14
   3:   41 5f                   pop    %r15
   5:   c3                      retq
   6:   41 55                   push   %r13
   8:   66 83 fa 02             cmp    $0x2,%dx
   c:   41 54                   push   %r12
   e:   55                      push   %rbp
   f:   48 89 fd                mov    %rdi,%rbp
  12:   53                      push   %rbx
  13:   48 89 f3                mov    %rsi,%rbx
  16:   41 50                   push   %r8
  18:   74 11                   je     0x2b
  1a:   31 c0                   xor    %eax,%eax
  1c:   66 83 fa 0a             cmp    $0xa,%dx
  20:   0f 85 ce 02 00 00       jne    0x2f4
  26:   e9 fd 00 00 00          jmpq   0x128
  2b:*  0f b6 47 2a             movzbl 0x2a(%rdi),%eax          <-- trapping
instruction
  2f:   8b 17                   mov    (%rdi),%edx
  31:   8b 76 18                mov    0x18(%rsi),%esi
  34:   84 c0                   test   %al,%al
  36:   74 1a                   je     0x52
  38:   b9 20 00 00 00          mov    $0x20,%ecx
  3d:   31 f2                   xor    %esi,%edx
  3f:   29                      .byte 0x29

Code starting with the faulting instruction
===========================================
   0:   0f b6 47 2a             movzbl 0x2a(%rdi),%eax
   4:   8b 17                   mov    (%rdi),%edx
   6:   8b 76 18                mov    0x18(%rsi),%esi
   9:   84 c0                   test   %al,%al
   b:   74 1a                   je     0x27
   d:   b9 20 00 00 00          mov    $0x20,%ecx
  12:   31 f2                   xor    %esi,%edx
  14:   29                      .byte 0x29


I can't get a clue of that output. I rebuild the kernel now with

CONFIG_IP_VS=m
# CONFIG_IP_VS_IPV6 is not set
# CONFIG_IP_VS_DEBUG is not set
CONFIG_IP_VS_TAB_BITS=18
CONFIG_IP_VS_PROTO_TCP=y
CONFIG_IP_VS_PROTO_UDP=y
# CONFIG_IP_VS_PROTO_AH_ESP is not set
# CONFIG_IP_VS_PROTO_ESP is not set
# CONFIG_IP_VS_PROTO_AH is not set
# CONFIG_IP_VS_PROTO_SCTP is not set
CONFIG_IP_VS_RR=m
CONFIG_IP_VS_WRR=m
CONFIG_IP_VS_LC=m
CONFIG_IP_VS_WLC=m
CONFIG_IP_VS_LBLC=m
CONFIG_IP_VS_LBLCR=m
CONFIG_IP_VS_DH=m
CONFIG_IP_VS_SH=m
CONFIG_IP_VS_SED=m
CONFIG_IP_VS_NQ=m
CONFIG_IP_VS_SH_TAB_BITS=12
CONFIG_IP_VS_FTP=m
CONFIG_IP_VS_NFCT=y
CONFIG_IP_VS_PE_SIP=m

instead of:

CONFIG_IP_VS=m
CONFIG_IP_VS_IPV6=y
# CONFIG_IP_VS_DEBUG is not set
CONFIG_IP_VS_TAB_BITS=12
CONFIG_IP_VS_PROTO_TCP=y
CONFIG_IP_VS_PROTO_UDP=y
CONFIG_IP_VS_PROTO_AH_ESP=y
CONFIG_IP_VS_PROTO_ESP=y
CONFIG_IP_VS_PROTO_AH=y
# CONFIG_IP_VS_PROTO_SCTP is not set
CONFIG_IP_VS_RR=m
CONFIG_IP_VS_WRR=m
CONFIG_IP_VS_LC=m
CONFIG_IP_VS_WLC=m
CONFIG_IP_VS_LBLC=m
CONFIG_IP_VS_LBLCR=m
CONFIG_IP_VS_DH=m
CONFIG_IP_VS_SH=m
CONFIG_IP_VS_SED=m
CONFIG_IP_VS_NQ=m
CONFIG_IP_VS_SH_TAB_BITS=11
# CONFIG_IP_VS_FTP is not set
CONFIG_IP_VS_NFCT=y
# CONFIG_IP_VS_PE_SIP is not set


and try again as i think it might be ipv6 related.


Could someone shed some light on the decoded output and point me somewhere so i
can debug this further?




-- 

Mit freundlichen Grüßen,

Florian Wiessner

Smart Weblications GmbH
Martinsberger Str. 1
D-95119 Naila

fon.: +49 9282 9638 200
fax.: +49 9282 9638 205
24/7: +49 900 144 000 00 - 0,99 EUR/Min*
http://www.smart-weblications.de

--
Sitz der Gesellschaft: Naila
Geschäftsführer: Florian Wiessner
HRB-Nr.: HRB 3840 Amtsgericht Hof
*aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz

^ permalink raw reply

* travelling...
From: David Miller @ 2014-11-26 20:47 UTC (permalink / raw)
  To: netdev; +Cc: linux-wireless, netfilter-devel

I will be travelling for about 10 days starting tomorrow.

Email will be read and patches will be reviewed, but my response
time will be a little bit longer.

I've purged patchwork of every patch I can apply right now, the two in
there right now require either an expert review or a response to
feedback.

I plan to ask Linus to pull my 'net' tree in a little bit and I have
a stable queue submission in the works as well.

Just FYI...

^ permalink raw reply

* Re: [PATCH net-next] bridge: add vlan id to mdb notifications
From: Roopa Prabhu @ 2014-11-26 20:45 UTC (permalink / raw)
  To: Jonathan Toppins
  Cc: Stephen Hemminger, vyasevich, netdev, wkok, gospo, sashok
In-Reply-To: <54762CB3.4090101@cumulusnetworks.com>

On 11/26/14, 11:40 AM, Jonathan Toppins wrote:
> On 11/26/14 1:56 PM, Stephen Hemminger wrote:
>> On Wed, 26 Nov 2014 05:53:33 -0800
>> roopa@cumulusnetworks.com wrote:
>>
>>> diff --git a/include/uapi/linux/if_bridge.h 
>>> b/include/uapi/linux/if_bridge.h
>>> index da17e45..db061fd 100644
>>> --- a/include/uapi/linux/if_bridge.h
>>> +++ b/include/uapi/linux/if_bridge.h
>>> @@ -185,6 +185,7 @@ struct br_mdb_entry {
>>>               struct in6_addr ip6;
>>>           } u;
>>>           __be16        proto;
>>> +        __be16        vid;
>>>       } addr;
>>>   };
>>>
>>
>> You can't add fields to existing binary API
>>
>
> Roopa, maybe a description of what use case this is trying to solve 
> would better justify the addition to the UAPI?
>
I don't think a description of use case can be used to justify a UAPI 
breakage. Getting the patch out was mainly to see if this really breaks 
UAPI.
Basically to get some feedback.

^ permalink raw reply

* Re: [net-next PATCH 2/5] ethernet/intel: Use eth_skb_pad helper
From: David Miller @ 2014-11-26 20:41 UTC (permalink / raw)
  To: eric.dumazet
  Cc: alexander.duyck, alexander.h.duyck, netdev, jeffrey.t.kirsher,
	kuznet
In-Reply-To: <1416974473.29427.49.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 25 Nov 2014 20:01:13 -0800

[ I am still intrigued, CC:'ing Alexey ]

> On Tue, 2014-11-25 at 22:19 -0500, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Tue, 25 Nov 2014 17:43:05 -0800
>> 
>> > I believe I finally have an idea why we had various + 15 in skb
>> > allocations in TCP stack !
>> 
>> It was so that you could do one level of tunneling with "for
>> free".  Or that is my recollection.
>> 
>> Those + 15 existed way before any of these padto() calls even
>> existed.
> 
> Well, tunneling is added in front of the packet. Thats why we use
> MAX_TCP_HEADER.
> 
> The +15 is in fact because TCP stack wanted to make sure the eventual
> padding (needing tailroom, not headroom) was possible...
> 
> Note that ack packets never used the +15, but other packets did.

Alexey, do you remember exact reason for that +15 everywhere in TCP
packet allocation sizing?

I thought it was for headroom, but as Eric shows that's illogical,
it can only be for tailroom considerations.

^ permalink raw reply

* Re: [PATCH net-next] bridge: add vlan id to mdb notifications
From: Roopa Prabhu @ 2014-11-26 20:40 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: vyasevich, netdev, wkok, gospo, jtoppins, sashok
In-Reply-To: <20141126105614.6a42d697@urahara>

On 11/26/14, 10:56 AM, Stephen Hemminger wrote:
> On Wed, 26 Nov 2014 05:53:33 -0800
> roopa@cumulusnetworks.com wrote:
>
>> diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
>> index da17e45..db061fd 100644
>> --- a/include/uapi/linux/if_bridge.h
>> +++ b/include/uapi/linux/if_bridge.h
>> @@ -185,6 +185,7 @@ struct br_mdb_entry {
>>   			struct in6_addr ip6;
>>   		} u;
>>   		__be16		proto;
>> +		__be16		vid;
>>   	} addr;
>>   };
>>   
> You can't add fields to existing binary API

Ack,  we know the concern..., The fact that it was not changing the size 
of the struct (due to existing padding and i verified that it worked 
with an older iproute2), we wanted to get the patch out and get some 
feedback.

Getting the vlan in the notification is imp and the only other option I 
see is to add a new netlink attribute in the mdb msg.

I have always wondered, if binary netlink attributes have this 
restriction, they should be discouraged. especially when the other 
extensible option is to add them as a separate netlink attribute.

Thanks for the review.

^ permalink raw reply

* Re: [PATCH net-next] macvlan: delay the header check for dodgy packets into lower device
From: David Miller @ 2014-11-26 20:37 UTC (permalink / raw)
  To: jasowang; +Cc: kaber, netdev, linux-kernel, mst, vyasevic
In-Reply-To: <1416993674-11177-1-git-send-email-jasowang@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Wed, 26 Nov 2014 17:21:14 +0800

> We do header check twice for a dodgy packet. One is done before
> macvlan_start_xmit(), another is done before lower device's
> ndo_start_xmit(). The first one seems redundant so this patch tries to
> delay header check until a packet reaches its lower device (or macvtap)
> through always enabling NETIF_F_GSO_ROBUST for macvlan device.
> 
> Cc: Patrick McHardy <kaber@trash.net>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Hmmm, it's the idea that if we have a dodgy packet, we want to
notice that as early as possible in the packet processing path?

^ permalink raw reply

* Re: [PATCH net] r8152: drop the tx packet with invalid length
From: David Miller @ 2014-11-26 20:33 UTC (permalink / raw)
  To: eric.dumazet; +Cc: hayeswang, netdev, nic_swsd, linux-kernel, linux-usb
In-Reply-To: <1417027459.29427.63.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 26 Nov 2014 10:44:19 -0800

> On Wed, 2014-11-26 at 12:06 -0500, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Wed, 26 Nov 2014 08:52:28 -0800
>> 
>> > On Wed, 2014-11-26 at 17:56 +0800, Hayes Wang wrote:
>> >> Drop the tx packet which is more than the size of agg_buf_sz. When
>> >> creating a bridge with the device, we may get the tx packet with
>> >> TSO and the length is more than the gso_max_size which is set by
>> >> the driver through netif_set_gso_max_size(). Such packets couldn't
>> >> be transmitted and should be dropped directly.
>> >> 
>> >> Signed-off-by: Hayes Wang <hayeswang@realtek.com>
>>  ...
>> > Looks like a candidate for ndo_gso_check(), so that we do not drop, but
>> > instead segment from netif_needs_gso()/validate_xmit_skb()
>> 
>> You mean have the bridge implement the ndo_gso_check() method right?
> 
> No, I meant this particular driver.
> 
> Note that netif_skb_features() does only this check :
> 
> if (gso_segs > dev->gso_max_segs || gso_segs < dev->gso_min_segs)
>       features &= ~NETIF_F_GSO_MASK;
> 
> Ie not testing gso_max_size
> 
> It looks like all these particular tests should be moved on
> ndo_gso_check(), to remove code from netif_skb_features()

A check against gso_max_size is generic enough that it ought to be put
right into netif_needs_gso() rather then duplicating it into every
driver's ndo_gso_check() method don't you think?

^ permalink raw reply

* Re: [PATCH 0/5 net] bridge: Fix missing Netlink message validations
From: David Miller @ 2014-11-26 20:29 UTC (permalink / raw)
  To: tgraf; +Cc: stephen, netdev
In-Reply-To: <cover.1417005245.git.tgraf@suug.ch>

From: Thomas Graf <tgraf@suug.ch>
Date: Wed, 26 Nov 2014 13:42:15 +0100

> Adds various missing length checks in the bridging code for Netlink
> messages and corresponding attributes provided by user space.

Series applied, thanks Thomas.

^ permalink raw reply

* Re: [PATCH net-next V1 1/2] ethtool: Support for configurable RSS hash function
From: Eyal perry @ 2014-11-26 20:29 UTC (permalink / raw)
  To: David Miller, ben
  Cc: amirv, netdev, ogerlitz, yevgenyp, eyalpe, Tom Lendacky,
	ariel.elior, prashant, mchan, hariprasad, sathya.perla,
	subbu.seetharaman, ajit.khaparde, jeffrey.t.kirsher,
	jesse.brandeburg, bruce.w.allan, carolyn.wyborny,
	donald.c.skidmore, gregory.v.rose, matthew.vick, john.ronciak,
	mitch.a.williams, linux-net-drivers, sshah, sbhatewara,
	pv-drivers
In-Reply-To: <20141122.165407.641057904952001007.davem@davemloft.net>

On 11/22/2014 11:54 PM, David Miller wrote:
> From: Amir Vadai <amirv@mellanox.com>
> Date: Thu, 20 Nov 2014 16:26:49 +0200
> 
>> +	/* We require at least one supported parameter to be changed and no
>> +	 * change in any of the unsupported parameters
>> +	 */
>> +	if ((!indir && !key) || hfunc != ETH_RSS_HASH_NO_CHANGE)
>> +		return -EOPNOTSUPP;
>> +
> 
> I know it will make more work for you, but all of these driver
> implementations of this hook should:
> 
> 1) Accept hfunc of whatever hash function the chip is using,
>    not just ETH_RSS_HASH_NO_CHANGE.
> 
> 2) Provide an accurate hfunc value in the ->get() call.
Hello David, Ben, et al,
Before submitting V2, I'd like to consult you regarding the
implementation shown above. I thought of skipping the validity check
which I've described above as "We require at least one supported
parameter...", instead, I think it's better to fail the ->set() call
only in case of unsupported action requested, e.g.:
+	if (hfunc != ETH_RSS_HASH_NO_CHANGE &&
+	    hfunc != ETH_RSS_HASH_TOP)
+		return -EOPNOTSUPP;
+	if (indir)
+		/* set indirection table code ... */
+	if (key)
+		/* set hash key code ... */
The drawbacks are the change of previous behavior (only requests for at
least one change were supported), however it seems more reasonable and
makes the code much more readable.
In similar manner, for the ->get() call, remove the validity checks (as
I suggested in V1), and just protect against NULL pointer dereference, e.g:
-	if (!indir && !key)
-		return -EOPNOTSUPP;
+	if (indir)
+		/* fill in the given indirection table array */
+	if (key)
+		/* fill in the given hash key array */
+	if (hfunc)
+		*hfunc = ETH_RSS_HASH_TOP;
Please advise,
Thanks,
Eyal.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox