Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [patch net-next 1/2 v3] tc: add BPF based action
From: Alexei Starovoitov @ 2015-01-14 15:39 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Jiri Pirko, Network Development, David S. Miller,
	Jamal Hadi Salim, Hannes Frederic Sowa
In-Reply-To: <54B66F08.2010305@redhat.com>

On Wed, Jan 14, 2015 at 5:28 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
>
> I'm still wondering about the drop semantics ... wouldn't it be more
> intuitive to use 0 for drops in this context?

good point.
I think it must be 0 to match behavior of socket filters, etc.
If program tries to access beyond packet size or does divide
by zero if will be terminated and will return 0.
So zero should be the safest action from caller point of view.

^ permalink raw reply

* non-OVS based vxlan config broken on 3.19-rc ?!
From: Or Gerlitz @ 2015-01-14 15:18 UTC (permalink / raw)
  To: tom Herbert, thomas Graf, Marcelo Leitner, Jesse Gross
  Cc: netdev@vger.kernel.org

Guys, just realized that non-OVS based vxlan config is broken with 
3.19-rc... I see that it works for me on 3.18.2 and breaks on 3.19-rc3 
(Linus tree). Tested over mlx4 (both offloaded and non offloaded modes) 
and igb, see below the simplest form I can see it breaks on 3.19-rcand 
works on 3.18

Looking on tcpdump and stats, the arp reply arrives to the 3.19-rc host 
NIC driver but is dropped along the stack beforehanded to the vxlan 
driver, not sure where and why...

Or.

> $ tcpdump -nni vxlan42 -e
> 16:48:40.961852 8e:c2:13:4f:97:6e > ff:ff:ff:ff:ff:ff, ethertype ARP 
> (0x0806), length 42: Request who-has 192.168.92.18 tell 192.168.92.17, 
> length 28
> 16:48:41.961864 8e:c2:13:4f:97:6e > ff:ff:ff:ff:ff:ff, ethertype ARP 
> (0x0806), length 42: Request who-has 192.168.92.18 tell 192.168.92.17, 
> length 28
> 16:48:42.979948 8e:c2:13:4f:97:6e > ff:ff:ff:ff:ff:ff, ethertype ARP 
> (0x0806), length 42: Request who-has 192.168.92.18 tell 192.168.92.17, 
> length 28
>
> $ tcpdump -nni eth3 -e
> 16:48:46.993870 00:02:c9:e9:bf:32 > 01:00:5e:00:00:2a, ethertype IPv4 
> (0x0800), length 92: 192.168.31.17.33434 > 239.0.0.42.4789: UDP, length 50
> 16:48:46.993905 f4:52:14:01:da:82 > 00:02:c9:e9:bf:32, ethertype IPv4 
> (0x0800), length 92: 192.168.31.18.39155 > 192.168.31.17.4789: UDP, 
> length 50
> 16:48:47.993855 00:02:c9:e9:bf:32 > 01:00:5e:00:00:2a, ethertype IPv4 
> (0x0800), length 92: 192.168.31.17.33434 > 239.0.0.42.4789: UDP, length 50
> 16:48:47.993881 f4:52:14:01:da:82 > 00:02:c9:e9:bf:32, ethertype IPv4 
> (0x0800), length 92: 192.168.31.18.39155 > 192.168.31.17.4789: UDP, 
> length 50
>
> $ nstat
> Wed Jan 14 16:54:04 2015
>
> #kernel
> IpInReceives                    2                  0.0
> IpInDelivers                    2                  0.0
> IpOutRequests                   4                  0.0
> IcmpOutErrors                   2                  0.0
> IcmpOutEchoReps                 2                  0.0
> IcmpMsgOutType8                 2                  0.0
> TcpInSegs                       1                  0.0
> TcpOutSegs                      1                  0.0
> UdpInDatagrams                  1                  0.0
> TcpExtTCPPureAcks               1                  0.0
> TcpExtTCPOrigDataSent           1                  0.0
> IpExtOutMcastPkts               1                  0.0
> IpExtInOctets                   124                0.0
> IpExtOutOctets                  670                0.0
> IpExtOutMcastOctets             78                 0.0
> IpExtInNoECTPkts                2                  0.0
>
> Every 1.0s: netstat -s -w
> Wed Jan 14 16:54:56 2015
>
> Ip:
>     1077125083 total packets received
>     182 with invalid addresses
>     0 forwarded
>     0 incoming packets discarded
>     1077124621 incoming packets delivered
>     545614475 requests sent out
> Icmp:
>     67936 ICMP messages received
>     60 input ICMP message failed.
>     ICMP input histogram:
>         destination unreachable: 3319
>         echo requests: 528
>         echo replies: 64089
>     71890 ICMP messages sent
>     0 ICMP messages failed
>     ICMP output histogram:
>         destination unreachable: 3542
>         echo request: 67821
>         echo replies: 527
> IcmpMsg:
>         InType0: 64089
>         InType3: 3319
>         InType8: 528
>         OutType0: 527
>         OutType3: 3542
>         OutType8: 67821
> UdpLite:
> IpExt:
>     InMcastPkts: 827
>     OutMcastPkts: 2993
>     InBcastPkts: 1424
>     InOctets: 2102729994314
>     OutOctets: 35397269627
>     InMcastOctets: 31854
>     OutMcastOctets: 237154
>     InBcastOctets: 453473
>     InNoECTPkts: 1440876685



# host A with IP address 192.168.31.17

IP=ip

# mlx4
ETH=eth0

$IP link del vxlan42
$IP link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dstport 
4789 dev $ETH
$IP link set vxlan42 up
ifconfig vxlan42 192.168.92.17/24
ifconfig vxlan42 mtu 1450

# plain ping
ping 192.168.31.18 -c 2
# encaped  ping <-- BREAKS
ping 192.168.92.18 -c 2

# host B with IP address 192.168.31.18

IP=ip

# mlx4
ETH=eth0

$IP link del vxlan42
$IP link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dstport 
4789 dev $ETH
$IP link set vxlan42 up
ifconfig vxlan42 192.168.92.18/24
ifconfig vxlan42 mtu 1450

ping 192.168.31.17 -c 2
ping 192.168.92.17 -c 2

^ permalink raw reply

* [bisected regression] e1000e: "Detected Hardware Unit Hang"
From: Thomas Jarosch @ 2015-01-14 15:32 UTC (permalink / raw)
  To: 'Linux Netdev List'; +Cc: Eric Dumazet, Jeff Kirsher, e1000-devel

Hello,

after updating a good bunch of production level machines
from kernel 3.4.101 to kernel 3.14.25, a few of them started
to show serious trouble when there was a lot of network traffic.

---------------------------------------------------------------
Jan 14 11:14:57 intrartc kernel: e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
Jan 14 11:14:57 intrartc kernel:  TDH                  <3b>
Jan 14 11:14:57 intrartc kernel:  TDT                  <76>
Jan 14 11:14:57 intrartc kernel:  next_to_use          <76>
Jan 14 11:14:57 intrartc kernel:  next_to_clean        <31>
Jan 14 11:14:57 intrartc kernel: buffer_info[next_to_clean]:
Jan 14 11:14:57 intrartc kernel:  time_stamp           <ffff328c>
Jan 14 11:14:57 intrartc kernel:  next_to_watch        <3b>
Jan 14 11:14:57 intrartc kernel:  jiffies              <ffff33b9>
Jan 14 11:14:57 intrartc kernel:  next_to_watch.status <0>
Jan 14 11:14:57 intrartc kernel: MAC Status             <40080083>
Jan 14 11:14:57 intrartc kernel: PHY Status             <796d>
Jan 14 11:14:57 intrartc kernel: PHY 1000BASE-T Status  <3800>
Jan 14 11:14:57 intrartc kernel: PHY Extended Status    <3000>
Jan 14 11:14:57 intrartc kernel: PCI Status             <10>
Jan 14 11:14:59 intrartc kernel: e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
..
---------------------------------------------------------------

All of those troubled machines use an Intel DH61CR board and
are driven by the e1000e driver. Kernels 3.7.0 to 3.19-rc4 are affected.

The problem vanishes when you disable TSO. This is the
recommended "solution" on serverfault and others.
http://ehc.ac/p/e1000/bugs/378/
http://serverfault.com/questions/616485/e1000e-reset-adapter-unexpectedly-detected-hardware-unit-hang

I have a test setup that can trigger the problem within seconds
and bisected it down to this commit (hi Eric!):
---------------------------------------------------------------
commit 69b08f62e17439ee3d436faf0b9a7ca6fffb78db
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Sep 26 06:46:57 2012 +0000

    net: use bigger pages in __netdev_alloc_frag

    We currently use percpu order-0 pages in __netdev_alloc_frag
    to deliver fragments used by __netdev_alloc_skb()

    Depending on NIC driver and arch being 32 or 64 bit, it allows a page to
    be split in several fragments (between 1 and 8), assuming PAGE_SIZE=4096

    Switching to bigger pages (32768 bytes for PAGE_SIZE=4096 case) allows :

    - Better filling of space (the ending hole overhead is less an issue)

    - Less calls to page allocator or accesses to page->_count

    - Could allow struct skb_shared_info futures changes without major
    performance impact.

    This patch implements a transparent fallback to smaller
    pages in case of memory pressure.

    It also uses a standard "struct page_frag" instead of a custom one.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Alexander Duyck <alexander.h.duyck@intel.com>
    Cc: Benjamin LaHaise <bcrl@kvack.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>
---------------------------------------------------------------

Reverting the commit f.e. in kernel 3.7.0  solves the issue.
I've done some more tests:

    3.18.0 32bit + PAE: broken
    3.6.0 32bit + PAE: works
    3.7.0 32bit + PAE: broken
    3.7.0 32bit + PAE + revert 69b08f62e17439ee3d436faf0b9a7ca6fffb78db -> works

    3.7.0 32bit (without PAE) -> broken
    3.7.0 32bit + "GFP_COMP" flag removed in __netdev_alloc_frag(): broken
    3.7.0 32bit + "GFP_COMP" flag replaced with
                              "GFP_DMA" in __netdev_alloc_frag(): works!
    3.7.0 32bit + "GFP_COMP" flag + "GFP_DMA" flag: broken
    3.19-rc4 32bit: broken


The problem is triggered only when the traffic is forwarded to another client.
(this client is behind NAT). Generating traffic directly
on the system did not trigger the issue.

To me it looks like Eric's change uncovered a memory allocation
issue in the e1000e driver: It probably uses a memory address
unsuitable for DMA or so. This is just a guess though.

Funny fact: I have another Intel DH61CR board that does not show the problem.
I've borrowed (...) the mainboard from one affected box for my bisect test setup.

Please CC: comments. Thanks.

Best regards,
Thomas

^ permalink raw reply

* RE: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
From: Zhou, Danny @ 2015-01-14 15:28 UTC (permalink / raw)
  To: David Miller, David.Laight@ACULAB.COM
  Cc: john.fastabend@gmail.com, dborkman@redhat.com,
	hannes@stressinduktion.org, netdev@vger.kernel.org,
	nhorman@tuxdriver.com, Ronciak, John, brouer@redhat.com
In-Reply-To: <20150113.122733.1459000731112321026.davem@davemloft.net>



> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Wednesday, January 14, 2015 1:28 AM
> To: David.Laight@ACULAB.COM
> Cc: john.fastabend@gmail.com; dborkman@redhat.com; hannes@stressinduktion.org; netdev@vger.kernel.org; Zhou, Danny;
> nhorman@tuxdriver.com; Ronciak, John; brouer@redhat.com
> Subject: Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
> 
> From: David Laight <David.Laight@ACULAB.COM>
> Date: Tue, 13 Jan 2015 17:15:30 +0000
> 
> > How about something like:
> >
> > struct tpacket_dma_mem_region {
> >     __u64 addr;        /* userspace virtual address */
> >     __u64 phys_addr;    /* physical address */
> >     __u64 iova;        /* IO virtual address used for DMA */
> >     __u64 size;    /* size of region */
> >     int direction;        /* dma data direction */
> > } aligned(8);
> >
> > So that it is independant of 32/64 bits.
> > It is a shame that gcc has no way of defining a 64bit 'void *' on 32bit systems.
> > You can use a union, but you still need to zero extend the value on LE (worse on BE).
> 
> We have an __aligned_u64, please use that.

Thanks, will do.

^ permalink raw reply

* RE: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
From: Zhou, Danny @ 2015-01-14 15:26 UTC (permalink / raw)
  To: Willem de Bruijn, John Fastabend
  Cc: Network Development, Neil Horman, Daniel Borkmann, Ronciak, John,
	Hannes Frederic Sowa, brouer@redhat.com
In-Reply-To: <CA+FuTSdCV8puZVe-6aWQt5mPk0i3_CBK7hecOgviLc0GpdUmNw@mail.gmail.com>



> -----Original Message-----
> From: Willem de Bruijn [mailto:willemb@google.com]
> Sent: Wednesday, January 14, 2015 2:53 AM
> To: John Fastabend
> Cc: Network Development; Zhou, Danny; Neil Horman; Daniel Borkmann; Ronciak, John; Hannes Frederic Sowa;
> brouer@redhat.com
> Subject: Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
> 
> On Mon, Jan 12, 2015 at 11:35 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
> > This patch adds net_device ops to split off a set of driver queues
> > from the driver and map the queues into user space via mmap. This
> > allows the queues to be directly manipulated from user space. For
> > raw packet interface this removes any overhead from the kernel network
> > stack.
> 
> Can you elaborate how packet payload mapping is handled?
> Processes are still responsible for translating from user virtual to
> physical (and bus) addresses, correct? The IOMMU is only there
> to restrict the physical address ranges that may be written.
> 

User space processes have to use the IOVA returned from af_packet to fill 
NIC's Rx (as well as Tx) descriptors. When a DMA request is trigged for transferring a 
coming packet from the NIC to host memory, the device ID(specified by PCIe device' B:N:F) 
field in the DMA request will be used by IOMMU to find the device address translation
structure for this domain/device. Then the IOMMU will use the IOVA field in the 
DMA request as the match field to look up the per-device address translation structure 
to get the corresponding physical address pointing to where packet should be transferred to.

If an invalid IOVA address (e.g. arbitrary address or physical address) is filled in NIC's descriptors, 
IOMMU would prevent DMA from happening due to above lookup operation returns failure.

> >
> > With these operations we bypass the network stack and packet_type
> > handlers that would typically send traffic to an af_packet socket.
> > This means hardware must do the forwarding. To do this ew can use
> > the ETHTOOL_SRXCLSRLINS ops in the ethtool command set. It is
> > currently supported by multiple drivers including sfc, mlx4, niu,
> > ixgbe, and i40e. Supporting some way to steer traffic to a queue
> > is the _only_ hardware requirement to support this interface.
> >
> > A follow on patch adds support for ixgbe but we expect at least
> > the subset of drivers implementing ETHTOOL_SRXCLSRLINS can be
> > implemented later.
> >
> > The high level flow, leveraging the af_packet control path, looks
> > like:
> >
> >         bind(fd, &sockaddr, sizeof(sockaddr));
> >
> >         /* Get the device type and info */
> >         getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
> >                    &optlen);
> >
> >         /* With device info we can look up descriptor format */
> >
> >         /* Get the layout of ring space offset, page_sz, cnt */
> >         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
> >                    &info, &optlen);
> >
> >         /* request some queues from the driver */
> >         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> >                    &qpairs_info, sizeof(qpairs_info));
> >
> >         /* if we let the driver pick us queues learn which queues
> >          * we were given
> >          */
> >         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> >                    &qpairs_info, sizeof(qpairs_info));
> >
> >         /* And mmap queue pairs to user space */
> >         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
> >              MAP_SHARED, fd, 0);
> >
> >         /* Now we have some user space queues to read/write to*/
> >
> > There is one critical difference when running with these interfaces
> > vs running without them. In the normal case the af_packet module
> > uses a standard descriptor format exported by the af_packet user
> > space headers. In this model because we are working directly with
> > driver queues the descriptor format maps to the descriptor format
> > used by the device. User space applications can learn device
> > information from the socket option PACKET_DEV_DESC_INFO. These
> > are described by giving the vendor/deviceid and a descriptor layout
> > in offset/length/width/alignment/byte_ordering.
> 
> Raising the issue of exposed vs. virtualized interface just once
> more. I wonder if it is possible to keep the virtual to physical
> translation in the kernel while avoiding syscall latency, by doing
> the translation in a kernel thread on a coupled hyperthread that
> waits with mwait on the virtual queue producer index. The page
> table operations that Neil proposed in v1 of this patch may work
> even better.
> 

This is one shot request during initialization, so should be ok from latency
prospective. The NIC requests physically contiguous host memory region
to be used as rx/tx packet buffer, so the physical address is provided for af_packet
or the NIC driver to do this check. Otherwise, it is hard to check it for given
virtual address and size of the memory regions.

> > To protect against arbitrary DMA writes IOMMU devices put memory
> > in a single domain to stop arbitrary DMA to memory. Note it would
> > be possible to dma into another sockets pages because most NIC
> > devices only support a single domain. This would require being
> > able to guess another sockets page layout. However the socket
> > operation does require CAP_NET_ADMIN privileges.
> >
> > Additionally we have a set of DPDK patches to enable DPDK with this
> > interface. DPDK can be downloaded @ dpdk.org although as I hope is
> > clear from above DPDK is just our paticular test environment we
> > expect other libraries could be built on this interface.
> >
> > Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> > ---
> >  include/linux/netdevice.h      |   79 ++++++++
> >  include/uapi/linux/if_packet.h |   88 +++++++++
> >  net/packet/af_packet.c         |  397 ++++++++++++++++++++++++++++++++++++++++
> >  net/packet/internal.h          |   10 +
> >  4 files changed, 573 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 679e6e9..b71c97d 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -52,6 +52,8 @@
> >  #include <linux/neighbour.h>
> >  #include <uapi/linux/netdevice.h>
> >
> > +#include <linux/if_packet.h>
> > +
> >  struct netpoll_info;
> >  struct device;
> >  struct phy_device;
> > @@ -1030,6 +1032,54 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
> >   * int (*ndo_switch_port_stp_update)(struct net_device *dev, u8 state);
> >   *     Called to notify switch device port of bridge port STP
> >   *     state change.
> > + *
> > + * int (*ndo_split_queue_pairs) (struct net_device *dev,
> > + *                              unsigned int qpairs_start_from,
> > + *                              unsigned int qpairs_num,
> > + *                              struct sock *sk)
> > + *     Called to request a set of queues from the driver to be handed to the
> > + *     callee for management. After this returns the driver will not use the
> > + *     queues.
> > + *
> > + * int (*ndo_get_split_queue_pairs) (struct net_device *dev,
> > + *                              unsigned int *qpairs_start_from,
> > + *                              unsigned int *qpairs_num,
> > + *                              struct sock *sk)
> > + *     Called to get the location of queues that have been split for user
> > + *     space to use. The socket must have previously requested the queues via
> > + *     ndo_split_queue_pairs successfully.
> > + *
> > + * int (*ndo_return_queue_pairs) (struct net_device *dev,
> > + *                               struct sock *sk)
> > + *     Called to return a set of queues identified by sock to the driver. The
> > + *     socket must have previously requested the queues via
> > + *     ndo_split_queue_pairs for this action to be performed.
> > + *
> > + * int (*ndo_get_device_qpair_map_region_info) (struct net_device *dev,
> > + *                             struct tpacket_dev_qpair_map_region_info *info)
> > + *     Called to return mapping of queue memory region.
> > + *
> > + * int (*ndo_get_device_desc_info) (struct net_device *dev,
> > + *                                 struct tpacket_dev_info *dev_info)
> > + *     Called to get device specific information. This should uniquely identify
> > + *     the hardware so that descriptor formats can be learned by the stack/user
> > + *     space.
> > + *
> > + * int (*ndo_direct_qpair_page_map) (struct vm_area_struct *vma,
> > + *                                  struct net_device *dev)
> > + *     Called to map queue pair range from split_queue_pairs into mmap region.
> > + *
> > + * int (*ndo_direct_validate_dma_mem_region_map)
> > + *                                     (struct net_device *dev,
> > + *                                      struct tpacket_dma_mem_region *region,
> > + *                                      struct sock *sk)
> > + *     Called to validate DMA address remaping for userspace memory region
> > + *
> > + * int (*ndo_get_dma_region_info)
> > + *                              (struct net_device *dev,
> > + *                               struct tpacket_dma_mem_region *region,
> > + *                               struct sock *sk)
> > + *     Called to get dma region' information such as iova.
> >   */
> >  struct net_device_ops {
> >         int                     (*ndo_init)(struct net_device *dev);
> > @@ -1190,6 +1240,35 @@ struct net_device_ops {
> >         int                     (*ndo_switch_port_stp_update)(struct net_device *dev,
> >                                                               u8 state);
> >  #endif
> > +       int                     (*ndo_split_queue_pairs)(struct net_device *dev,
> > +                                        unsigned int qpairs_start_from,
> > +                                        unsigned int qpairs_num,
> > +                                        struct sock *sk);
> > +       int                     (*ndo_get_split_queue_pairs)
> > +                                       (struct net_device *dev,
> > +                                        unsigned int *qpairs_start_from,
> > +                                        unsigned int *qpairs_num,
> > +                                        struct sock *sk);
> > +       int                     (*ndo_return_queue_pairs)
> > +                                       (struct net_device *dev,
> > +                                        struct sock *sk);
> > +       int                     (*ndo_get_device_qpair_map_region_info)
> > +                                       (struct net_device *dev,
> > +                                        struct tpacket_dev_qpair_map_region_info *info);
> > +       int                     (*ndo_get_device_desc_info)
> > +                                       (struct net_device *dev,
> > +                                        struct tpacket_dev_info *dev_info);
> > +       int                     (*ndo_direct_qpair_page_map)
> > +                                       (struct vm_area_struct *vma,
> > +                                        struct net_device *dev);
> > +       int                     (*ndo_validate_dma_mem_region_map)
> > +                                       (struct net_device *dev,
> > +                                        struct tpacket_dma_mem_region *region,
> > +                                        struct sock *sk);
> > +       int                     (*ndo_get_dma_region_info)
> > +                                       (struct net_device *dev,
> > +                                        struct tpacket_dma_mem_region *region,
> > +                                        struct sock *sk);
> >  };
> >
> >  /**
> > diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
> > index da2d668..eb7a727 100644
> > --- a/include/uapi/linux/if_packet.h
> > +++ b/include/uapi/linux/if_packet.h
> > @@ -54,6 +54,13 @@ struct sockaddr_ll {
> >  #define PACKET_FANOUT                  18
> >  #define PACKET_TX_HAS_OFF              19
> >  #define PACKET_QDISC_BYPASS            20
> > +#define PACKET_RXTX_QPAIRS_SPLIT       21
> > +#define PACKET_RXTX_QPAIRS_RETURN      22
> > +#define PACKET_DEV_QPAIR_MAP_REGION_INFO       23
> > +#define PACKET_DEV_DESC_INFO           24
> > +#define PACKET_DMA_MEM_REGION_MAP       25
> > +#define PACKET_DMA_MEM_REGION_RELEASE   26
> > +
> >
> >  #define PACKET_FANOUT_HASH             0
> >  #define PACKET_FANOUT_LB               1
> > @@ -64,6 +71,87 @@ struct sockaddr_ll {
> >  #define PACKET_FANOUT_FLAG_ROLLOVER    0x1000
> >  #define PACKET_FANOUT_FLAG_DEFRAG      0x8000
> >
> > +#define PACKET_MAX_NUM_MAP_MEMORY_REGIONS 64
> > +#define PACKET_MAX_NUM_DESC_FORMATS      8
> > +#define PACKET_MAX_NUM_DESC_FIELDS       64
> > +#define PACKET_NIC_DESC_FIELD(fseq, foffset, fwidth, falign, fbo) \
> > +               .seqn = (__u8)fseq,                             \
> > +               .offset = (__u8)foffset,                        \
> > +               .width = (__u8)fwidth,                          \
> > +               .align = (__u8)falign,                          \
> > +               .byte_order = (__u8)fbo
> > +
> > +#define MAX_MAP_MEMORY_REGIONS 64
> > +
> > +/* setsockopt takes addr, size ,direction parametner, getsockopt takes
> > + * iova, size, direction.
> > + * */
> > +struct tpacket_dma_mem_region {
> > +       void *addr;             /* userspace virtual address */
> > +       __u64 phys_addr;        /* physical address */
> > +       __u64 iova;             /* IO virtual address used for DMA */
> > +       unsigned long size;     /* size of region */
> > +       int direction;          /* dma data direction */
> > +};
> > +
> > +struct tpacket_dev_qpair_map_region_info {
> > +       unsigned int tp_dev_bar_sz;             /* size of BAR */
> > +       unsigned int tp_dev_sysm_sz;            /* size of systerm memory */
> > +       /* number of contiguous memory on BAR mapping to user space */
> > +       unsigned int tp_num_map_regions;
> > +       /* number of contiguous memory on system mapping to user apce */
> > +       unsigned int tp_num_sysm_map_regions;
> > +       struct map_page_region {
> > +               unsigned page_offset;   /* offset to start of region */
> > +               unsigned page_sz;       /* size of page */
> > +               unsigned page_cnt;      /* number of pages */
> > +       } tp_regions[MAX_MAP_MEMORY_REGIONS];
> > +};
> > +
> > +struct tpacket_dev_qpairs_info {
> > +       unsigned int tp_qpairs_start_from;      /* qpairs index to start from */
> > +       unsigned int tp_qpairs_num;             /* number of qpairs */
> > +};
> > +
> > +enum tpack_desc_byte_order {
> > +       BO_NATIVE = 0,
> > +       BO_NETWORK,
> > +       BO_BIG_ENDIAN,
> > +       BO_LITTLE_ENDIAN,
> > +};
> > +
> > +struct tpacket_nic_desc_fld {
> > +       __u8 seqn;      /* Sequency index of descriptor field */
> > +       __u8 offset;    /* Offset to start */
> > +       __u8 width;     /* Width of field */
> > +       __u8 align;     /* Alignment in bits */
> > +       enum tpack_desc_byte_order byte_order;  /* Endian flag */
> > +};
> > +
> > +struct tpacket_nic_desc_expr {
> > +       __u8 version;           /* Version number */
> > +       __u8 size;              /* Descriptor size in bytes */
> > +       enum tpack_desc_byte_order byte_order;          /* Endian flag */
> > +       __u8 num_of_fld;        /* Number of valid fields */
> > +       /* List of each descriptor field */
> > +       struct tpacket_nic_desc_fld fields[PACKET_MAX_NUM_DESC_FIELDS];
> > +};
> > +
> > +struct tpacket_dev_info {
> > +       __u16   tp_device_id;
> > +       __u16   tp_vendor_id;
> > +       __u16   tp_subsystem_device_id;
> > +       __u16   tp_subsystem_vendor_id;
> > +       __u32   tp_numa_node;
> > +       __u32   tp_revision_id;
> > +       __u32   tp_num_total_qpairs;
> > +       __u32   tp_num_inuse_qpairs;
> > +       __u32   tp_num_rx_desc_fmt;
> > +       __u32   tp_num_tx_desc_fmt;
> > +       struct tpacket_nic_desc_expr tp_rx_dexpr[PACKET_MAX_NUM_DESC_FORMATS];
> > +       struct tpacket_nic_desc_expr tp_tx_dexpr[PACKET_MAX_NUM_DESC_FORMATS];
> > +};
> > +
> >  struct tpacket_stats {
> >         unsigned int    tp_packets;
> >         unsigned int    tp_drops;
> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > index 6880f34..8cd17da 100644
> > --- a/net/packet/af_packet.c
> > +++ b/net/packet/af_packet.c
> > @@ -214,6 +214,9 @@ static void prb_clear_rxhash(struct tpacket_kbdq_core *,
> >  static void prb_fill_vlan_info(struct tpacket_kbdq_core *,
> >                 struct tpacket3_hdr *);
> >  static void packet_flush_mclist(struct sock *sk);
> > +static int umem_release(struct net_device *dev, struct packet_sock *po);
> > +static int get_umem_pages(struct tpacket_dma_mem_region *region,
> > +                         struct packet_umem_region *umem);
> >
> >  struct packet_skb_cb {
> >         unsigned int origlen;
> > @@ -2633,6 +2636,16 @@ static int packet_release(struct socket *sock)
> >         sock_prot_inuse_add(net, sk->sk_prot, -1);
> >         preempt_enable();
> >
> > +       if (po->tp_owns_queue_pairs) {
> > +               struct net_device *dev;
> > +
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (dev) {
> > +                       dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
> > +                       umem_release(dev, po);
> > +               }
> > +       }
> > +
> >         spin_lock(&po->bind_lock);
> >         unregister_prot_hook(sk, false);
> >         packet_cached_dev_reset(po);
> > @@ -2829,6 +2842,8 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
> >         po->num = proto;
> >         po->xmit = dev_queue_xmit;
> >
> > +       INIT_LIST_HEAD(&po->umem_list);
> > +
> >         err = packet_alloc_pending(po);
> >         if (err)
> >                 goto out2;
> > @@ -3226,6 +3241,88 @@ static void packet_flush_mclist(struct sock *sk)
> >  }
> >
> >  static int
> > +get_umem_pages(struct tpacket_dma_mem_region *region,
> > +              struct packet_umem_region *umem)
> > +{
> > +       struct page **page_list;
> > +       unsigned long npages;
> > +       unsigned long offset;
> > +       unsigned long base;
> > +       unsigned long i;
> > +       int ret;
> > +       dma_addr_t phys_base;
> > +
> > +       phys_base = (region->phys_addr) & PAGE_MASK;
> > +       base = ((unsigned long)region->addr) & PAGE_MASK;
> > +       offset = ((unsigned long)region->addr) & (~PAGE_MASK);
> > +       npages = PAGE_ALIGN(region->size + offset) >> PAGE_SHIFT;
> > +
> > +       npages = min_t(unsigned long, npages, umem->nents);
> > +       sg_init_table(umem->sglist, npages);
> > +
> > +       umem->nmap = 0;
> > +       page_list = (struct page **)__get_free_page(GFP_KERNEL);
> > +       if (!page_list)
> > +               return -ENOMEM;
> > +
> > +       while (npages) {
> > +               unsigned long min = min_t(unsigned long, npages,
> > +                                         PAGE_SIZE / sizeof(struct page *));
> > +
> > +               ret = get_user_pages(current, current->mm, base, min,
> > +                                    1, 0, page_list, NULL);
> > +               if (ret < 0)
> > +                       break;
> > +
> > +               base += ret * PAGE_SIZE;
> > +               npages -= ret;
> > +
> > +               /* validate if the memory region is physically contigenous */
> > +               for (i = 0; i < ret; i++) {
> > +                       unsigned int page_index =
> > +                               (page_to_phys(page_list[i]) - phys_base) /
> > +                               PAGE_SIZE;
> > +
> > +                       if (page_index != umem->nmap + i) {
> > +                               int j;
> > +
> > +                               for (j = 0; j < (umem->nmap + i); j++)
> > +                                       put_page(sg_page(&umem->sglist[j]));
> > +
> > +                               free_page((unsigned long)page_list);
> > +                               return -EFAULT;
> > +                       }
> > +
> > +                       sg_set_page(&umem->sglist[umem->nmap + i],
> > +                                   page_list[i], PAGE_SIZE, 0);
> > +               }
> > +
> > +               umem->nmap += ret;
> > +       }
> > +
> > +       free_page((unsigned long)page_list);
> > +       return 0;
> > +}
> > +
> > +static int
> > +umem_release(struct net_device *dev, struct packet_sock *po)
> > +{
> > +       struct packet_umem_region *umem, *tmp;
> > +       int i;
> > +
> > +       list_for_each_entry_safe(umem, tmp, &po->umem_list, list) {
> > +               dma_unmap_sg(dev->dev.parent, umem->sglist,
> > +                            umem->nmap, umem->direction);
> > +               for (i = 0; i < umem->nmap; i++)
> > +                       put_page(sg_page(&umem->sglist[i]));
> > +
> > +               vfree(umem);
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static int
> >  packet_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen)
> >  {
> >         struct sock *sk = sock->sk;
> > @@ -3428,6 +3525,167 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
> >                 po->xmit = val ? packet_direct_xmit : dev_queue_xmit;
> >                 return 0;
> >         }
> > +       case PACKET_RXTX_QPAIRS_SPLIT:
> > +       {
> > +               struct tpacket_dev_qpairs_info qpairs;
> > +               const struct net_device_ops *ops;
> > +               struct net_device *dev;
> > +               int err;
> > +
> > +               if (optlen != sizeof(qpairs))
> > +                       return -EINVAL;
> > +               if (copy_from_user(&qpairs, optval, sizeof(qpairs)))
> > +                       return -EFAULT;
> > +
> > +               /* Only allow one set of queues to be owned by userspace */
> > +               if (po->tp_owns_queue_pairs)
> > +                       return -EBUSY;
> > +
> > +               /* This call only works after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +               ops = dev->netdev_ops;
> > +               if (!ops->ndo_split_queue_pairs)
> > +                       return -EOPNOTSUPP;
> > +
> > +               err =  ops->ndo_split_queue_pairs(dev,
> > +                                                 qpairs.tp_qpairs_start_from,
> > +                                                 qpairs.tp_qpairs_num, sk);
> > +               if (!err)
> > +                       po->tp_owns_queue_pairs = true;
> > +
> > +               return err;
> > +       }
> > +       case PACKET_RXTX_QPAIRS_RETURN:
> > +       {
> > +               struct tpacket_dev_qpairs_info qpairs_info;
> > +               const struct net_device_ops *ops;
> > +               struct net_device *dev;
> > +               int err;
> > +
> > +               if (optlen != sizeof(qpairs_info))
> > +                       return -EINVAL;
> > +               if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
> > +                       return -EFAULT;
> > +
> > +               if (!po->tp_owns_queue_pairs)
> > +                       return -EINVAL;
> > +
> > +               /* This call only work after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +               ops = dev->netdev_ops;
> > +               if (!ops->ndo_split_queue_pairs)
> > +                       return -EOPNOTSUPP;
> > +
> > +               err =  dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
> > +               if (!err)
> > +                       po->tp_owns_queue_pairs = false;
> > +
> > +               return err;
> > +       }
> > +       case PACKET_DMA_MEM_REGION_MAP:
> > +       {
> > +               struct tpacket_dma_mem_region region;
> > +               const struct net_device_ops *ops;
> > +               struct net_device *dev;
> > +               struct packet_umem_region *umem;
> > +               unsigned long npages;
> > +               unsigned long offset;
> > +               unsigned long i;
> > +               int err;
> > +
> > +               if (optlen != sizeof(region))
> > +                       return -EINVAL;
> > +               if (copy_from_user(&region, optval, sizeof(region)))
> > +                       return -EFAULT;
> > +               if ((region.direction != DMA_BIDIRECTIONAL) &&
> > +                   (region.direction != DMA_TO_DEVICE) &&
> > +                   (region.direction != DMA_FROM_DEVICE))
> > +                       return -EFAULT;
> > +
> > +               if (!po->tp_owns_queue_pairs)
> > +                       return -EINVAL;
> > +
> > +               /* This call only work after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +
> > +               offset = ((unsigned long)region.addr) & (~PAGE_MASK);
> > +               npages = PAGE_ALIGN(region.size + offset) >> PAGE_SHIFT;
> > +
> > +               umem = vzalloc(sizeof(*umem) +
> > +                              sizeof(struct scatterlist) * npages);
> > +               if (!umem)
> > +                       return -ENOMEM;
> > +
> > +               umem->nents = npages;
> > +               umem->direction = region.direction;
> > +
> > +               down_write(&current->mm->mmap_sem);
> > +               if (get_umem_pages(&region, umem) < 0) {
> > +                       ret = -EFAULT;
> > +                       goto exit;
> > +               }
> > +
> > +               if ((umem->nmap == npages) &&
> > +                   (0 != dma_map_sg(dev->dev.parent, umem->sglist,
> > +                                    umem->nmap, region.direction))) {
> > +                       region.iova = sg_dma_address(umem->sglist) + offset;
> > +
> > +                       ops = dev->netdev_ops;
> > +                       if (!ops->ndo_validate_dma_mem_region_map) {
> > +                               ret = -EOPNOTSUPP;
> > +                               goto unmap;
> > +                       }
> > +
> > +                       /* use driver to validate mapping of dma memory */
> > +                       err = ops->ndo_validate_dma_mem_region_map(dev,
> > +                                                                  &region,
> > +                                                                  sk);
> > +                       if (!err) {
> > +                               list_add_tail(&umem->list, &po->umem_list);
> > +                               ret = 0;
> > +                               goto exit;
> > +                       }
> > +               }
> > +
> > +unmap:
> > +               dma_unmap_sg(dev->dev.parent, umem->sglist,
> > +                            umem->nmap, umem->direction);
> > +               for (i = 0; i < umem->nmap; i++)
> > +                       put_page(sg_page(&umem->sglist[i]));
> > +
> > +               vfree(umem);
> > +exit:
> > +               up_write(&current->mm->mmap_sem);
> > +
> > +               return ret;
> > +       }
> > +       case PACKET_DMA_MEM_REGION_RELEASE:
> > +       {
> > +               struct net_device *dev;
> > +
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +
> > +               down_write(&current->mm->mmap_sem);
> > +               ret = umem_release(dev, po);
> > +               up_write(&current->mm->mmap_sem);
> > +
> > +               return ret;
> > +       }
> > +
> >         default:
> >                 return -ENOPROTOOPT;
> >         }
> > @@ -3523,6 +3781,129 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
> >         case PACKET_QDISC_BYPASS:
> >                 val = packet_use_direct_xmit(po);
> >                 break;
> > +       case PACKET_RXTX_QPAIRS_SPLIT:
> > +       {
> > +               struct net_device *dev;
> > +               struct tpacket_dev_qpairs_info qpairs_info;
> > +               int err;
> > +
> > +               if (len != sizeof(qpairs_info))
> > +                       return -EINVAL;
> > +               if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
> > +                       return -EFAULT;
> > +
> > +               /* This call only work after a successful queue pairs split-off
> > +                * operation via setsockopt()
> > +                */
> > +               if (!po->tp_owns_queue_pairs)
> > +                       return -EINVAL;
> > +
> > +               /* This call only work after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +               if (!dev->netdev_ops->ndo_split_queue_pairs)
> > +                       return -EOPNOTSUPP;
> > +
> > +               err =  dev->netdev_ops->ndo_get_split_queue_pairs(dev,
> > +                                       &qpairs_info.tp_qpairs_start_from,
> > +                                       &qpairs_info.tp_qpairs_num, sk);
> > +
> > +               lv = sizeof(qpairs_info);
> > +               data = &qpairs_info;
> > +               break;
> > +       }
> > +       case PACKET_DEV_QPAIR_MAP_REGION_INFO:
> > +       {
> > +               struct tpacket_dev_qpair_map_region_info info;
> > +               const struct net_device_ops *ops;
> > +               struct net_device *dev;
> > +               int err;
> > +
> > +               if (len != sizeof(info))
> > +                       return -EINVAL;
> > +               if (copy_from_user(&info, optval, sizeof(info)))
> > +                       return -EFAULT;
> > +
> > +               /* This call only work after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +
> > +               ops = dev->netdev_ops;
> > +               if (!ops->ndo_get_device_qpair_map_region_info)
> > +                       return -EOPNOTSUPP;
> > +
> > +               err = ops->ndo_get_device_qpair_map_region_info(dev, &info);
> > +               if (err)
> > +                       return err;
> > +
> > +               lv = sizeof(struct tpacket_dev_qpair_map_region_info);
> > +               data = &info;
> > +               break;
> > +       }
> > +       case PACKET_DEV_DESC_INFO:
> > +       {
> > +               struct net_device *dev;
> > +               struct tpacket_dev_info info;
> > +               int err;
> > +
> > +               if (len != sizeof(info))
> > +                       return -EINVAL;
> > +               if (copy_from_user(&info, optval, sizeof(info)))
> > +                       return -EFAULT;
> > +
> > +               /* This call only work after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +               if (!dev->netdev_ops->ndo_get_device_desc_info)
> > +                       return -EOPNOTSUPP;
> > +
> > +               err =  dev->netdev_ops->ndo_get_device_desc_info(dev, &info);
> > +               if (err)
> > +                       return err;
> > +
> > +               lv = sizeof(struct tpacket_dev_info);
> > +               data = &info;
> > +               break;
> > +       }
> > +       case PACKET_DMA_MEM_REGION_MAP:
> > +       {
> > +               struct tpacket_dma_mem_region info;
> > +               struct net_device *dev;
> > +               int err;
> > +
> > +               if (len != sizeof(info))
> > +                               return -EINVAL;
> > +               if (copy_from_user(&info, optval, sizeof(info)))
> > +                               return -EFAULT;
> > +
> > +               /* This call only work after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +
> > +               if (!dev->netdev_ops->ndo_get_dma_region_info)
> > +                       return -EOPNOTSUPP;
> > +
> > +               err =  dev->netdev_ops->ndo_get_dma_region_info(dev, &info, sk);
> > +               if (err)
> > +                       return err;
> > +
> > +               lv = sizeof(struct tpacket_dma_mem_region);
> > +               data = &info;
> > +               break;
> > +       }
> > +
> >         default:
> >                 return -ENOPROTOOPT;
> >         }
> > @@ -3536,7 +3917,6 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
> >         return 0;
> >  }
> >
> > -
> >  static int packet_notifier(struct notifier_block *this,
> >                            unsigned long msg, void *ptr)
> >  {
> > @@ -3920,6 +4300,8 @@ static int packet_mmap(struct file *file, struct socket *sock,
> >         struct packet_sock *po = pkt_sk(sk);
> >         unsigned long size, expected_size;
> >         struct packet_ring_buffer *rb;
> > +       const struct net_device_ops *ops;
> > +       struct net_device *dev;
> >         unsigned long start;
> >         int err = -EINVAL;
> >         int i;
> > @@ -3927,8 +4309,20 @@ static int packet_mmap(struct file *file, struct socket *sock,
> >         if (vma->vm_pgoff)
> >                 return -EINVAL;
> >
> > +       dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +       if (!dev)
> > +               return -EINVAL;
> > +
> >         mutex_lock(&po->pg_vec_lock);
> >
> > +       if (po->tp_owns_queue_pairs) {
> > +               ops = dev->netdev_ops;
> > +               err = ops->ndo_direct_qpair_page_map(vma, dev);
> > +               if (err)
> > +                       goto out;
> > +               goto done;
> > +       }
> > +
> >         expected_size = 0;
> >         for (rb = &po->rx_ring; rb <= &po->tx_ring; rb++) {
> >                 if (rb->pg_vec) {
> > @@ -3966,6 +4360,7 @@ static int packet_mmap(struct file *file, struct socket *sock,
> >                 }
> >         }
> >
> > +done:
> >         atomic_inc(&po->mapped);
> >         vma->vm_ops = &packet_mmap_ops;
> >         err = 0;
> > diff --git a/net/packet/internal.h b/net/packet/internal.h
> > index cdddf6a..55d2fce 100644
> > --- a/net/packet/internal.h
> > +++ b/net/packet/internal.h
> > @@ -90,6 +90,14 @@ struct packet_fanout {
> >         struct packet_type      prot_hook ____cacheline_aligned_in_smp;
> >  };
> >
> > +struct packet_umem_region {
> > +       struct list_head        list;
> > +       int                     nents;
> > +       int                     nmap;
> > +       int                     direction;
> > +       struct scatterlist      sglist[0];
> > +};
> > +
> >  struct packet_sock {
> >         /* struct sock has to be the first member of packet_sock */
> >         struct sock             sk;
> > @@ -97,6 +105,7 @@ struct packet_sock {
> >         union  tpacket_stats_u  stats;
> >         struct packet_ring_buffer       rx_ring;
> >         struct packet_ring_buffer       tx_ring;
> > +       struct list_head        umem_list;
> >         int                     copy_thresh;
> >         spinlock_t              bind_lock;
> >         struct mutex            pg_vec_lock;
> > @@ -113,6 +122,7 @@ struct packet_sock {
> >         unsigned int            tp_reserve;
> >         unsigned int            tp_loss:1;
> >         unsigned int            tp_tx_has_off:1;
> > +       unsigned int            tp_owns_queue_pairs:1;
> >         unsigned int            tp_tstamp;
> >         struct net_device __rcu *cached_dev;
> >         int                     (*xmit)(struct sk_buff *skb);
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/2] net/macb: Adding comments to various #defs to make interpretation easier
From: Brad Mouring @ 2015-01-14 15:11 UTC (permalink / raw)
  To: David Miller
  Cc: xander.huff, nicolas.ferre, jaeden.amero, rich.tollerton,
	ben.shelton, brad.mouring, netdev, linux-kernel
In-Reply-To: <20150114.002609.1903467089677912749.davem@davemloft.net>

On Wed, Jan 14, 2015 at 12:26:09AM -0500, David Miller wrote:
> From: Xander Huff <xander.huff@ni.com>
> Date: Tue, 13 Jan 2015 16:15:50 -0600
> 
> > This change is to help improve at-a-glace knowledge of the purpose of the
> > various Cadence MACB/GEM registers. Comments are more helpful for human
> > readability than short acronyms.
> > 
> > Describe various #define varibles Cadence MACB/GEM registers as documented
> > in Xilinix's "Zynq-7000 All Programmable SoC TechnicalReference Manual, v1.9.1
s/Xilinix/Xilinx/. Sorry for the previous html-spam. Didn't follow the rule to always get coffee in the system prior to responding.
> > (UG-585)"
> > 
> > Signed-off-by: Xander Huff <xander.huff@ni.com>
> 
> Applied.

^ permalink raw reply

* Re: [PATCH 1/2] net/macb: Adding comments to various #defs to make interpretation easier
From: Nicolas Ferre @ 2015-01-14 15:10 UTC (permalink / raw)
  To: Xander Huff
  Cc: jaeden.amero, rich.tollerton, ben.shelton, brad.mouring, netdev,
	linux-kernel
In-Reply-To: <1421187351-27279-1-git-send-email-xander.huff@ni.com>

Le 13/01/2015 23:15, Xander Huff a écrit :
> This change is to help improve at-a-glace knowledge of the purpose of the
> various Cadence MACB/GEM registers. Comments are more helpful for human
> readability than short acronyms.
> 
> Describe various #define varibles Cadence MACB/GEM registers as documented
> in Xilinix's "Zynq-7000 All Programmable SoC TechnicalReference Manual, v1.9.1
> (UG-585)"
> 
> Signed-off-by: Xander Huff <xander.huff@ni.com>

For the record:
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>

> ---
>  drivers/net/ethernet/cadence/macb.h | 269 ++++++++++++++++++++++--------------
>  1 file changed, 162 insertions(+), 107 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
> index 084191b..8e8c3c9 100644
> --- a/drivers/net/ethernet/cadence/macb.h
> +++ b/drivers/net/ethernet/cadence/macb.h
> @@ -15,20 +15,20 @@
>  #define MACB_MAX_QUEUES 8
>  
>  /* MACB register offsets */
> -#define MACB_NCR				0x0000
> -#define MACB_NCFGR				0x0004
> -#define MACB_NSR				0x0008
> +#define MACB_NCR				0x0000 /* Network Control */
> +#define MACB_NCFGR				0x0004 /* Network Config */
> +#define MACB_NSR				0x0008 /* Network Status */
>  #define MACB_TAR				0x000c /* AT91RM9200 only */
>  #define MACB_TCR				0x0010 /* AT91RM9200 only */
> -#define MACB_TSR				0x0014
> -#define MACB_RBQP				0x0018
> -#define MACB_TBQP				0x001c
> -#define MACB_RSR				0x0020
> -#define MACB_ISR				0x0024
> -#define MACB_IER				0x0028
> -#define MACB_IDR				0x002c
> -#define MACB_IMR				0x0030
> -#define MACB_MAN				0x0034
> +#define MACB_TSR				0x0014 /* Transmit Status */
> +#define MACB_RBQP				0x0018 /* RX Q Base Address */
> +#define MACB_TBQP				0x001c /* TX Q Base Address */
> +#define MACB_RSR				0x0020 /* Receive Status */
> +#define MACB_ISR				0x0024 /* Interrupt Status */
> +#define MACB_IER				0x0028 /* Interrupt Enable */
> +#define MACB_IDR				0x002c /* Interrupt Disable */
> +#define MACB_IMR				0x0030 /* Interrupt Mask */
> +#define MACB_MAN				0x0034 /* PHY Maintenance */
>  #define MACB_PTR				0x0038
>  #define MACB_PFR				0x003c
>  #define MACB_FTO				0x0040
> @@ -68,27 +68,27 @@
>  #define MACB_MID				0x00fc
>  
>  /* GEM register offsets. */
> -#define GEM_NCFGR				0x0004
> -#define GEM_USRIO				0x000c
> -#define GEM_DMACFG				0x0010
> -#define GEM_HRB					0x0080
> -#define GEM_HRT					0x0084
> -#define GEM_SA1B				0x0088
> -#define GEM_SA1T				0x008C
> -#define GEM_SA2B				0x0090
> -#define GEM_SA2T				0x0094
> -#define GEM_SA3B				0x0098
> -#define GEM_SA3T				0x009C
> -#define GEM_SA4B				0x00A0
> -#define GEM_SA4T				0x00A4
> -#define GEM_OTX					0x0100
> -#define GEM_DCFG1				0x0280
> -#define GEM_DCFG2				0x0284
> -#define GEM_DCFG3				0x0288
> -#define GEM_DCFG4				0x028c
> -#define GEM_DCFG5				0x0290
> -#define GEM_DCFG6				0x0294
> -#define GEM_DCFG7				0x0298
> +#define GEM_NCFGR				0x0004 /* Network Config */
> +#define GEM_USRIO				0x000c /* User IO */
> +#define GEM_DMACFG				0x0010 /* DMA Configuration */
> +#define GEM_HRB					0x0080 /* Hash Bottom */
> +#define GEM_HRT					0x0084 /* Hash Top */
> +#define GEM_SA1B				0x0088 /* Specific1 Bottom */
> +#define GEM_SA1T				0x008C /* Specific1 Top */
> +#define GEM_SA2B				0x0090 /* Specific2 Bottom */
> +#define GEM_SA2T				0x0094 /* Specific2 Top */
> +#define GEM_SA3B				0x0098 /* Specific3 Bottom */
> +#define GEM_SA3T				0x009C /* Specific3 Top */
> +#define GEM_SA4B				0x00A0 /* Specific4 Bottom */
> +#define GEM_SA4T				0x00A4 /* Specific4 Top */
> +#define GEM_OTX					0x0100 /* Octets transmitted */
> +#define GEM_DCFG1				0x0280 /* Design Config 1 */
> +#define GEM_DCFG2				0x0284 /* Design Config 2 */
> +#define GEM_DCFG3				0x0288 /* Design Config 3 */
> +#define GEM_DCFG4				0x028c /* Design Config 4 */
> +#define GEM_DCFG5				0x0290 /* Design Config 5 */
> +#define GEM_DCFG6				0x0294 /* Design Config 6 */
> +#define GEM_DCFG7				0x0298 /* Design Config 7 */
>  
>  #define GEM_ISR(hw_q)				(0x0400 + ((hw_q) << 2))
>  #define GEM_TBQP(hw_q)				(0x0440 + ((hw_q) << 2))
> @@ -98,67 +98,73 @@
>  #define GEM_IMR(hw_q)				(0x0640 + ((hw_q) << 2))
>  
>  /* Bitfields in NCR */
> -#define MACB_LB_OFFSET				0
> +#define MACB_LB_OFFSET				0 /* reserved */
>  #define MACB_LB_SIZE				1
> -#define MACB_LLB_OFFSET				1
> +#define MACB_LLB_OFFSET				1 /* Loop back local */
>  #define MACB_LLB_SIZE				1
> -#define MACB_RE_OFFSET				2
> +#define MACB_RE_OFFSET				2 /* Receive enable */
>  #define MACB_RE_SIZE				1
> -#define MACB_TE_OFFSET				3
> +#define MACB_TE_OFFSET				3 /* Transmit enable */
>  #define MACB_TE_SIZE				1
> -#define MACB_MPE_OFFSET				4
> +#define MACB_MPE_OFFSET				4 /* Management port enable */
>  #define MACB_MPE_SIZE				1
> -#define MACB_CLRSTAT_OFFSET			5
> +#define MACB_CLRSTAT_OFFSET			5 /* Clear stats regs */
>  #define MACB_CLRSTAT_SIZE			1
> -#define MACB_INCSTAT_OFFSET			6
> +#define MACB_INCSTAT_OFFSET			6 /* Incremental stats regs */
>  #define MACB_INCSTAT_SIZE			1
> -#define MACB_WESTAT_OFFSET			7
> +#define MACB_WESTAT_OFFSET			7 /* Write enable stats regs */
>  #define MACB_WESTAT_SIZE			1
> -#define MACB_BP_OFFSET				8
> +#define MACB_BP_OFFSET				8 /* Back pressure */
>  #define MACB_BP_SIZE				1
> -#define MACB_TSTART_OFFSET			9
> +#define MACB_TSTART_OFFSET			9 /* Start transmission */
>  #define MACB_TSTART_SIZE			1
> -#define MACB_THALT_OFFSET			10
> +#define MACB_THALT_OFFSET			10 /* Transmit halt */
>  #define MACB_THALT_SIZE				1
> -#define MACB_NCR_TPF_OFFSET			11
> +#define MACB_NCR_TPF_OFFSET			11 /* Transmit pause frame */
>  #define MACB_NCR_TPF_SIZE			1
> -#define MACB_TZQ_OFFSET				12
> +#define MACB_TZQ_OFFSET				12 /* Transmit zero quantum
> +						    * pause frame
> +						    */
>  #define MACB_TZQ_SIZE				1
>  
>  /* Bitfields in NCFGR */
> -#define MACB_SPD_OFFSET				0
> +#define MACB_SPD_OFFSET				0 /* Speed */
>  #define MACB_SPD_SIZE				1
> -#define MACB_FD_OFFSET				1
> +#define MACB_FD_OFFSET				1 /* Full duplex */
>  #define MACB_FD_SIZE				1
> -#define MACB_BIT_RATE_OFFSET			2
> +#define MACB_BIT_RATE_OFFSET			2 /* Discard non-VLAN frames */
>  #define MACB_BIT_RATE_SIZE			1
> -#define MACB_JFRAME_OFFSET			3
> +#define MACB_JFRAME_OFFSET			3 /* reserved */
>  #define MACB_JFRAME_SIZE			1
> -#define MACB_CAF_OFFSET				4
> +#define MACB_CAF_OFFSET				4 /* Copy all frames */
>  #define MACB_CAF_SIZE				1
> -#define MACB_NBC_OFFSET				5
> +#define MACB_NBC_OFFSET				5 /* No broadcast */
>  #define MACB_NBC_SIZE				1
> -#define MACB_NCFGR_MTI_OFFSET			6
> +#define MACB_NCFGR_MTI_OFFSET			6 /* Multicast hash enable */
>  #define MACB_NCFGR_MTI_SIZE			1
> -#define MACB_UNI_OFFSET				7
> +#define MACB_UNI_OFFSET				7 /* Unicast hash enable */
>  #define MACB_UNI_SIZE				1
> -#define MACB_BIG_OFFSET				8
> +#define MACB_BIG_OFFSET				8 /* Receive 1536 byte frames */
>  #define MACB_BIG_SIZE				1
> -#define MACB_EAE_OFFSET				9
> +#define MACB_EAE_OFFSET				9 /* External address match
> +						   * enable
> +						   */
>  #define MACB_EAE_SIZE				1
>  #define MACB_CLK_OFFSET				10
>  #define MACB_CLK_SIZE				2
> -#define MACB_RTY_OFFSET				12
> +#define MACB_RTY_OFFSET				12 /* Retry test */
>  #define MACB_RTY_SIZE				1
> -#define MACB_PAE_OFFSET				13
> +#define MACB_PAE_OFFSET				13 /* Pause enable */
>  #define MACB_PAE_SIZE				1
>  #define MACB_RM9200_RMII_OFFSET			13 /* AT91RM9200 only */
>  #define MACB_RM9200_RMII_SIZE			1  /* AT91RM9200 only */
> -#define MACB_RBOF_OFFSET			14
> +#define MACB_RBOF_OFFSET			14 /* Receive buffer offset */
>  #define MACB_RBOF_SIZE				2
> -#define MACB_RLCE_OFFSET			16
> +#define MACB_RLCE_OFFSET			16 /* Length field error frame
> +						    * discard
> +						    */
>  #define MACB_RLCE_SIZE				1
> -#define MACB_DRFCS_OFFSET			17
> +#define MACB_DRFCS_OFFSET			17 /* FCS remove */
>  #define MACB_DRFCS_SIZE				1
>  #define MACB_EFRHD_OFFSET			18
>  #define MACB_EFRHD_SIZE				1
> @@ -166,111 +172,160 @@
>  #define MACB_IRXFCS_SIZE			1
>  
>  /* GEM specific NCFGR bitfields. */
> -#define GEM_GBE_OFFSET				10
> +#define GEM_GBE_OFFSET				10 /* Gigabit mode enable */
>  #define GEM_GBE_SIZE				1
> -#define GEM_CLK_OFFSET				18
> +#define GEM_CLK_OFFSET				18 /* MDC clock division */
>  #define GEM_CLK_SIZE				3
> -#define GEM_DBW_OFFSET				21
> +#define GEM_DBW_OFFSET				21 /* Data bus width */
>  #define GEM_DBW_SIZE				2
>  #define GEM_RXCOEN_OFFSET			24
>  #define GEM_RXCOEN_SIZE				1
>  
>  /* Constants for data bus width. */
> -#define GEM_DBW32				0
> -#define GEM_DBW64				1
> -#define GEM_DBW128				2
> +#define GEM_DBW32				0 /* 32 bit AMBA AHB data bus
> +						   * width
> +						   */
> +#define GEM_DBW64				1 /* 64 bit AMBA AHB data bus
> +						   * width
> +						   */
> +#define GEM_DBW128				2 /* 128 bit AMBA AHB data bus
> +						   * width
> +						   */
>  
>  /* Bitfields in DMACFG. */
> -#define GEM_FBLDO_OFFSET			0
> +#define GEM_FBLDO_OFFSET			0 /* AHB fixed burst length for
> +						   * DMA data operations
> +						   */
>  #define GEM_FBLDO_SIZE				5
> -#define GEM_ENDIA_OFFSET			7
> +#define GEM_ENDIA_OFFSET			7 /* AHB endian swap mode enable
> +						   * for packet data accesses
> +						   */
>  #define GEM_ENDIA_SIZE				1
> -#define GEM_RXBMS_OFFSET			8
> +#define GEM_RXBMS_OFFSET			8 /* Receiver packet buffer
> +						   * memory size select
> +						   */
>  #define GEM_RXBMS_SIZE				2
> -#define GEM_TXPBMS_OFFSET			10
> +#define GEM_TXPBMS_OFFSET			10 /* Transmitter packet buffer
> +						    * memory size select
> +						    */
>  #define GEM_TXPBMS_SIZE				1
> -#define GEM_TXCOEN_OFFSET			11
> +#define GEM_TXCOEN_OFFSET			11 /* Transmitter IP, TCP and
> +						    * UDP checksum generation
> +						    * offload enable
> +						    */
>  #define GEM_TXCOEN_SIZE				1
> -#define GEM_RXBS_OFFSET				16
> +#define GEM_RXBS_OFFSET				16 /* DMA receive buffer size in
> +						    * AHB system memory
> +						    */
>  #define GEM_RXBS_SIZE				8
> -#define GEM_DDRP_OFFSET				24
> +#define GEM_DDRP_OFFSET				24 /* disc_when_no_ahb */
>  #define GEM_DDRP_SIZE				1
>  
>  
>  /* Bitfields in NSR */
> -#define MACB_NSR_LINK_OFFSET			0
> +#define MACB_NSR_LINK_OFFSET			0 /* pcs_link_state */
>  #define MACB_NSR_LINK_SIZE			1
> -#define MACB_MDIO_OFFSET			1
> +#define MACB_MDIO_OFFSET			1 /* status of the mdio_in
> +						   * pin
> +						   */
>  #define MACB_MDIO_SIZE				1
> -#define MACB_IDLE_OFFSET			2
> +#define MACB_IDLE_OFFSET			2 /* The PHY management logic is
> +						   * idle (i.e. has completed)
> +						   */
>  #define MACB_IDLE_SIZE				1
>  
>  /* Bitfields in TSR */
> -#define MACB_UBR_OFFSET				0
> +#define MACB_UBR_OFFSET				0 /* Used bit read */
>  #define MACB_UBR_SIZE				1
> -#define MACB_COL_OFFSET				1
> +#define MACB_COL_OFFSET				1 /* Collision occurred */
>  #define MACB_COL_SIZE				1
> -#define MACB_TSR_RLE_OFFSET			2
> +#define MACB_TSR_RLE_OFFSET			2 /* Retry limit exceeded */
>  #define MACB_TSR_RLE_SIZE			1
> -#define MACB_TGO_OFFSET				3
> +#define MACB_TGO_OFFSET				3 /* Transmit go */
>  #define MACB_TGO_SIZE				1
> -#define MACB_BEX_OFFSET				4
> +#define MACB_BEX_OFFSET				4 /* Transmit frame corruption
> +						   * due to AHB error
> +						   */
>  #define MACB_BEX_SIZE				1
>  #define MACB_RM9200_BNQ_OFFSET			4 /* AT91RM9200 only */
>  #define MACB_RM9200_BNQ_SIZE			1 /* AT91RM9200 only */
> -#define MACB_COMP_OFFSET			5
> +#define MACB_COMP_OFFSET			5 /* Trnasmit complete */
>  #define MACB_COMP_SIZE				1
> -#define MACB_UND_OFFSET				6
> +#define MACB_UND_OFFSET				6 /* Trnasmit under run */
>  #define MACB_UND_SIZE				1
>  
>  /* Bitfields in RSR */
> -#define MACB_BNA_OFFSET				0
> +#define MACB_BNA_OFFSET				0 /* Buffer not available */
>  #define MACB_BNA_SIZE				1
> -#define MACB_REC_OFFSET				1
> +#define MACB_REC_OFFSET				1 /* Frame received */
>  #define MACB_REC_SIZE				1
> -#define MACB_OVR_OFFSET				2
> +#define MACB_OVR_OFFSET				2 /* Receive overrun */
>  #define MACB_OVR_SIZE				1
>  
>  /* Bitfields in ISR/IER/IDR/IMR */
> -#define MACB_MFD_OFFSET				0
> +#define MACB_MFD_OFFSET				0 /* Management frame sent */
>  #define MACB_MFD_SIZE				1
> -#define MACB_RCOMP_OFFSET			1
> +#define MACB_RCOMP_OFFSET			1 /* Receive complete */
>  #define MACB_RCOMP_SIZE				1
> -#define MACB_RXUBR_OFFSET			2
> +#define MACB_RXUBR_OFFSET			2 /* RX used bit read */
>  #define MACB_RXUBR_SIZE				1
> -#define MACB_TXUBR_OFFSET			3
> +#define MACB_TXUBR_OFFSET			3 /* TX used bit read */
>  #define MACB_TXUBR_SIZE				1
> -#define MACB_ISR_TUND_OFFSET			4
> +#define MACB_ISR_TUND_OFFSET			4 /* Enable trnasmit buffer
> +						   * under run interrupt
> +						   */
>  #define MACB_ISR_TUND_SIZE			1
> -#define MACB_ISR_RLE_OFFSET			5
> +#define MACB_ISR_RLE_OFFSET			5 /* Enable retry limit exceeded
> +						   * or late collision interrupt
> +						   */
>  #define MACB_ISR_RLE_SIZE			1
> -#define MACB_TXERR_OFFSET			6
> +#define MACB_TXERR_OFFSET			6 /* Enable transmit frame
> +						   * corruption due to AHB error
> +						   * interrupt
> +						   */
>  #define MACB_TXERR_SIZE				1
> -#define MACB_TCOMP_OFFSET			7
> +#define MACB_TCOMP_OFFSET			7 /* Enable transmit complete
> +						   * interrupt
> +						   */
>  #define MACB_TCOMP_SIZE				1
> -#define MACB_ISR_LINK_OFFSET			9
> +#define MACB_ISR_LINK_OFFSET			9 /* Enable link change
> +						   * interrupt
> +						   */
>  #define MACB_ISR_LINK_SIZE			1
> -#define MACB_ISR_ROVR_OFFSET			10
> +#define MACB_ISR_ROVR_OFFSET			10 /* Enable receive overrun
> +						    * interrupt
> +						    */
>  #define MACB_ISR_ROVR_SIZE			1
> -#define MACB_HRESP_OFFSET			11
> +#define MACB_HRESP_OFFSET			11 /* Enable hrsep not OK
> +						    * interrupt
> +						    */
>  #define MACB_HRESP_SIZE				1
> -#define MACB_PFR_OFFSET				12
> +#define MACB_PFR_OFFSET				12 /* Enable pause frame with
> +						    * non-zero pause quantum
> +						    * interrupt
> +						    */
>  #define MACB_PFR_SIZE				1
> -#define MACB_PTZ_OFFSET				13
> +#define MACB_PTZ_OFFSET				13 /* Enable pause time zero
> +						    * interrupt
> +						    */
>  #define MACB_PTZ_SIZE				1
>  
>  /* Bitfields in MAN */
> -#define MACB_DATA_OFFSET			0
> +#define MACB_DATA_OFFSET			0 /* data */
>  #define MACB_DATA_SIZE				16
> -#define MACB_CODE_OFFSET			16
> +#define MACB_CODE_OFFSET			16 /* Must be written to 10 */
>  #define MACB_CODE_SIZE				2
> -#define MACB_REGA_OFFSET			18
> +#define MACB_REGA_OFFSET			18 /* Register address */
>  #define MACB_REGA_SIZE				5
> -#define MACB_PHYA_OFFSET			23
> +#define MACB_PHYA_OFFSET			23 /* PHY address */
>  #define MACB_PHYA_SIZE				5
> -#define MACB_RW_OFFSET				28
> +#define MACB_RW_OFFSET				28 /* Operation. 10 is read. 01
> +						    * is write.
> +						    */
>  #define MACB_RW_SIZE				2
> -#define MACB_SOF_OFFSET				30
> +#define MACB_SOF_OFFSET				30 /* Must be written to 1 for
> +						    * Clause 22 operation
> +						    */
>  #define MACB_SOF_SIZE				2
>  
>  /* Bitfields in USRIO (AVR32) */
> @@ -286,7 +341,7 @@
>  /* Bitfields in USRIO (AT91) */
>  #define MACB_RMII_OFFSET			0
>  #define MACB_RMII_SIZE				1
> -#define GEM_RGMII_OFFSET			0	/* GEM gigabit mode */
> +#define GEM_RGMII_OFFSET			0 /* GEM gigabit mode */
>  #define GEM_RGMII_SIZE				1
>  #define MACB_CLKEN_OFFSET			1
>  #define MACB_CLKEN_SIZE				1
> 


-- 
Nicolas Ferre

^ permalink raw reply

* Re: [REMINDER] iproute2 ss: Some thoughts about additional info output layout
From: Vadim Kochan @ 2015-01-14 14:57 UTC (permalink / raw)
  To: netdev; +Cc: vadim4j
In-Reply-To: <20150108221240.GA23636@angus-think.lan>

On Fri, Jan 09, 2015 at 12:12:40AM +0200, Vadim Kochan wrote:
> Hi,
> 
> I think that current output of ss utility visually looks little weird
> when additional info options were specified, so I feel that a lot of
> yours will say that it will break existing scripts of ss output parsing
> but I will try, so thats how I think ss output would looks better in
> case if additionally info options were specified (I am not sure how it
> would looks in the email) :
> 
> 
> u_str  ESTAB      0      0                                                   * 14875                                                             * 14876
> users:("terminator",pid=30585,fd=8),
>       ("terminator",pid=29450,fd=8),
>       ("firefox",pid=21863,fd=8),
>       ("terminator",pid=18927,fd=8),
>       ("terminator",pid=17955,fd=8),
>       ("terminator",pid=9620,fd=8),
>       ("terminator",pid=7894,fd=8),
>       ("terminator",pid=7245,fd=8),
>       ("terminator",pid=2542,fd=8),
>       ("qtile",pid=654,fd=8)
> skmem:(r0,rb212992,t0,tb212992,f0,w0,o0,bl0) <-> 
> 
> u_str  ESTAB      0      0                                                   * 7550221                                                           * 7550222      
> users:("terminator",pid=9620,fd=12)
> skmem:(r0,rb4194304,t0,tb4194304,f0,w0,o0,bl0) <-> 
> 
> udp    ESTAB      0      0                                           10.3.5.18:ipproto-58744                                           192.168.1.1:swipe        
> users:("firefox",pid=21863,fd=54) 
> uid:1000 ino:7328962 sk:ffff880212f77bc0 <->
> skmem:(r0,rb4194304,t0,tb4194304,f0,w0,o0,bl0)
> 
> udp    ESTAB      0      0                           fe80::6e88:14ff:feac:51e4:ipproto-33582                                     fd6b:a256:8c2e::1:swipe        
> users:("firefox",pid=21863,fd=62) 
> uid:1000 ino:7328974 sk:ffff880205a76e80 <->
> skmem:(r0,rb4194304,t0,tb4194304,f0,w0,o0,bl0)
> 
> tcp    ESTAB      0      0                                       192.168.1.198:57851                                               173.194.113.205:https        
> users:("firefox",pid=21863,fd=67) 
> uid:1000 ino:7601516 sk:ffff880102e94d00 <->
> skmem:(r0,rb372480,t0,tb87040,f0,w0,o0,bl0)
> ts sack cubic
> wscale:7,11
> rto:253.333 rtt:50.609/18.209 ato:40
> mss:1448 cwnd:10
> send 2.3Mbps lastsnd:46636 lastrcv:46570 lastack:46593 pacing_rate 4.6Mbps rcv_rtt:73.333 rcv_space:42340
> 
> tcp    ESTAB      0      0                                       192.168.1.198:47385                                                74.125.143.104:https        
> users:("firefox",pid=21863,fd=63)
> uid:1000 ino:7602727 sk:ffff880007a1db00 <->
> skmem:(r0,rb372480,t0,tb87040,f0,w0,o0,bl0)
> ts sack cubic wscale:7,11 rto:296.666 rtt:95.496/22.756 ato:40
> mss:1418 cwnd:10
> send 1.2Mbps lastsnd:58640 lastrcv:58550 lastack:58550 pacing_rate 2.4Mbps rcv_rtt:96.666 rcv_space:42340
> 
> Regards,
> Vadim Kochan

Jut reminder may be it was missed within a lots of emails.

Thanks,

^ permalink raw reply

* Re: [net-next PATCH v2 00/12] Flow API
From: Or Gerlitz @ 2015-01-14 15:00 UTC (permalink / raw)
  To: John Fastabend
  Cc: Thomas Graf, simon.horman, Scott Feldman, Linux Netdev List,
	Jamal Hadi Salim, Andy Gospodarek, David Miller,
	Alexei Starovoitov
In-Reply-To: <54B680EB.8040404@gmail.com>

On Wed, Jan 14, 2015 at 4:44 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 01/13/2015 10:29 PM, Or Gerlitz wrote:
>>
>> On Tue, Jan 13, 2015 at 11:35 PM, John Fastabend
>> <john.fastabend@gmail.com> wrote:
>>>
>>> I tried to roll in all the feedback from v1 into this series annotated
>>> here,

>> Can you please drop V2 to your
>> https://github.com/jrfastab/rocker-net-next.git tree?

> Should be there now, but as Alexei noted I seem to have missed updating
> the patch commit messages so they still reference the old names.

Oh, I see it now.. yep, with the old names and dates to Dec 30th...

^ permalink raw reply

* Re: [net-next PATCH v2 02/12] net: flow_table: add flow, delete flow
From: John Fastabend @ 2015-01-14 14:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Thomas Graf, simon.horman, Scott Feldman, netdev@vger.kernel.org,
	gerlitz.or@gmail.com, Jamal Hadi Salim, Andy Gospodarek,
	David S. Miller
In-Reply-To: <CAADnVQK0FqxD9iu5UzFu0Pizi9tQPgjmAs3WBbV7y8jk0Sdi5w@mail.gmail.com>

On 01/13/2015 03:00 PM, Alexei Starovoitov wrote:
> On Tue, Jan 13, 2015 at 1:35 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
>> Now that the device capabilities are exposed we can add support to
>> add and delete flows from the tables.
>>
>> The two operations are
>>
>> table_set_flows :
>>
>>    The set flow operations is used to program a set of flows into a
>>    hardware device table. The message is consumed via netlink encoded
>
> should now netlink cmd be called table_set_rules ?
> and s/flow/rule/ everywhere in commit log?
>
>>    message which is then decoded into a null terminated  array of
>>    flow entry structures. A flow entry structure is defined as
>>
>>       struct net_flow_flow {
>
> commit log no longer matches implementation ;)
> should be net_flow_rule ?
>

Oops, I guess I'll update it after waiting a bit for more feedback.

> can you update your .html writeup?

Took a quick scan at this think I caught most cases and some typos
as well.

> I hope to see more real examples in there.

Sure, I'll put together some more interesting examples in the next
day or so.

>
> btw how the whole thing will work with queue splitting from
> your other patch?
>

If one of the actions supported by the device is forward_to_queue() or
forward_to_socket() we can use the API to steer potentially interesting
packets to a user space application for processing.

Going the other way applications could tell the hardware to
drop/mangle/fwd packets. At some point I thought it would be interesting
to use both the the flow API here and the queue splitting from the
other patch with a tool like Suricata.

.John

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* RE: [PATCH v2 1/1] atm: remove deprecated use of pci api
From: David Laight @ 2015-01-14 14:47 UTC (permalink / raw)
  To: 'David Miller', lambert.quentin@gmail.com
  Cc: chas@cmf.nrl.navy.mil, linux-atm-general@lists.sourceforge.net,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20150113.215944.1703977685948196746.davem@davemloft.net>

From: David Miller
> From: Quentin Lambert <lambert.quentin@gmail.com>
> Date: Mon, 12 Jan 2015 17:10:42 +0100
> 
> > @@ -2246,7 +2246,8 @@ static int eni_init_one(struct pci_dev *pci_dev,
> >  		goto err_disable;
> >
> >  	zero = &eni_dev->zero;
> > -	zero->addr = pci_alloc_consistent(pci_dev, ENI_ZEROES_SIZE, &zero->dma);
> > +	zero->addr = dma_alloc_coherent(&pci_dev->dev, ENI_ZEROES_SIZE,
> > +					&zero->dma, GFP_ATOMIC);
> >  	if (!zero->addr)
> >  		goto err_kfree;
> >
> 
> I really would like you to look at these locations and see if
> GFP_KERNEL can be used instead of GFP_ATOMIC.  I bet that nearly
> all of these can, and it is preferred.

And there isn't much point inlining the wrapper until that has been done.
Not only that, the corresponding pci_free_consistent() calls need changing
at (much) the same time.

	David

^ permalink raw reply

* Re: [net-next PATCH v2 00/12] Flow API
From: John Fastabend @ 2015-01-14 14:44 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Thomas Graf, simon.horman, Scott Feldman, Linux Netdev List,
	Jamal Hadi Salim, Andy Gospodarek, David Miller,
	Alexei Starovoitov
In-Reply-To: <CAJ3xEMhSkHMTqvvaP2jS5Q8dh3FgqSt2JnQc+WuPeYwh0P5dkw@mail.gmail.com>

On 01/13/2015 10:29 PM, Or Gerlitz wrote:
> On Tue, Jan 13, 2015 at 11:35 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
>> I tried to roll in all the feedback from v1 into this series annotated
>> here,
>
> Hi John,
>
> Can you please drop V2 to your
> https://github.com/jrfastab/rocker-net-next.git tree?
>
> thanks,
>
> Or.
>

Should be there now, but as Alexei noted I seem to have missed updating
the patch commit messages so they still reference the old names.

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* [PATCH] ixgbe: Re-enable relaxed ordering as part of init/restart sequence for non-DCA config
From: Sowmini Varadhan @ 2015-01-14 14:40 UTC (permalink / raw)
  To: sowmini.varadhan, jeffrey.t.kirsher, jesse.brandeburg,
	bruce.w.allan, carolyn.wyborny, donald.c.skidmore, gregory.v.rose,
	matthew.vick, john.ronciak, mitch.a.williams
  Cc: linux.nics, e1000-devel, netdev, linux-kernel, sparclinux,
	emil.s.tantilov


Relaxed ordering is disabled by default at driver initialization
and re-enabled when DCA is used. The reason it is disabled  was
due to an issue on some chipsets (see comments in ixgbe_update_tx_dca()).
But when DCA is not used, RO needs to be re-enabled, else we have
a serialization bottleneck on platforms like SPARC.

This patch eliminates the bottleneck for ixgbe when DCA is not configured.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Cc: Emil Tantilov <emil.s.tantilov@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_82598.c  |    1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_common.c |   20 ++++++++++++++++++++
 drivers/net/ethernet/intel/ixgbe/ixgbe_common.h |    1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |   11 +++++++++++
 drivers/net/ethernet/intel/ixgbe/ixgbe_type.h   |    1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c   |    1 +
 6 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_82598.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_82598.c
index c5c97b4..85c7a28 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_82598.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_82598.c
@@ -1161,6 +1161,7 @@ static struct ixgbe_mac_operations mac_ops_82598 = {
 	.clear_hw_cntrs		= &ixgbe_clear_hw_cntrs_generic,
 	.get_media_type		= &ixgbe_get_media_type_82598,
 	.enable_rx_dma          = &ixgbe_enable_rx_dma_generic,
+	.enable_relaxed_ordering = &ixgbe_enable_relaxed_ordering,
 	.get_mac_addr		= &ixgbe_get_mac_addr_generic,
 	.stop_adapter		= &ixgbe_stop_adapter_generic,
 	.get_bus_info           = &ixgbe_get_bus_info_generic,
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
index 9c66bab..4453d92 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
@@ -338,6 +338,26 @@ s32 ixgbe_start_hw_gen2(struct ixgbe_hw *hw)
 	return 0;
 }
 
+void ixgbe_enable_relaxed_ordering(struct ixgbe_hw *hw)
+{
+	u32 i;
+	u32 regval;
+
+	/* Enable relaxed ordering */
+	for (i = 0; i < hw->mac.max_tx_queues; i++) {
+		regval = IXGBE_READ_REG(hw, IXGBE_DCA_TXCTRL_82599(i));
+		regval |= IXGBE_DCA_TXCTRL_DESC_WRO_EN;
+		IXGBE_WRITE_REG(hw, IXGBE_DCA_TXCTRL_82599(i), regval);
+	}
+
+	for (i = 0; i < hw->mac.max_rx_queues; i++) {
+		regval = IXGBE_READ_REG(hw, IXGBE_DCA_RXCTRL(i));
+		regval |= (IXGBE_DCA_RXCTRL_DATA_WRO_EN |
+			    IXGBE_DCA_RXCTRL_HEAD_WRO_EN);
+		IXGBE_WRITE_REG(hw, IXGBE_DCA_RXCTRL(i), regval);
+	}
+}
+
 /**
  *  ixgbe_init_hw_generic - Generic hardware initialization
  *  @hw: pointer to hardware structure
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.h
index 8cfadcb..c399c18 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.h
@@ -37,6 +37,7 @@ s32 ixgbe_init_ops_generic(struct ixgbe_hw *hw);
 s32 ixgbe_init_hw_generic(struct ixgbe_hw *hw);
 s32 ixgbe_start_hw_generic(struct ixgbe_hw *hw);
 s32 ixgbe_start_hw_gen2(struct ixgbe_hw *hw);
+void ixgbe_enable_relaxed_ordering(struct ixgbe_hw *hw);
 s32 ixgbe_clear_hw_cntrs_generic(struct ixgbe_hw *hw);
 s32 ixgbe_read_pba_string_generic(struct ixgbe_hw *hw, u8 *pba_num,
 				  u32 pba_num_size);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 2ed2c7d..dec815b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -4898,6 +4898,12 @@ void ixgbe_reset(struct ixgbe_adapter *adapter)
 
 	if (test_bit(__IXGBE_PTP_RUNNING, &adapter->state))
 		ixgbe_ptp_reset(adapter);
+
+#ifndef CONFIG_IXGBE_DCA
+	/* ixgbe_update_tx_dca() will re-enable RO when DCA is enabled */
+	if (hw->mac.ops.enable_relaxed_ordering)
+		hw->mac.ops.enable_relaxed_ordering(hw);
+#endif
 }
 
 /**
@@ -8470,6 +8476,11 @@ skip_sriov:
 			   "representative who provided you with this "
 			   "hardware.\n");
 	}
+#ifndef CONFIG_IXGBE_DCA
+	/* ixgbe_update_tx_dca() will re-enable RO when DCA is enabled */
+	if (hw->mac.ops.enable_relaxed_ordering)
+		hw->mac.ops.enable_relaxed_ordering(hw);
+#endif
 	strcpy(netdev->name, "eth%d");
 	err = register_netdev(netdev);
 	if (err)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
index d101b25..b967241 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
@@ -3013,6 +3013,7 @@ struct ixgbe_mac_operations {
 	void (*release_swfw_sync)(struct ixgbe_hw *, u32);
 	s32 (*prot_autoc_read)(struct ixgbe_hw *, bool *, u32 *);
 	s32 (*prot_autoc_write)(struct ixgbe_hw *, u32, bool);
+	void (*enable_relaxed_ordering)(struct ixgbe_hw *);
 
 	/* Link */
 	void (*disable_tx_laser)(struct ixgbe_hw *);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c
index ba54ff0..88adad2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c
@@ -781,6 +781,7 @@ static struct ixgbe_mac_operations mac_ops_X540 = {
 	.clear_hw_cntrs         = &ixgbe_clear_hw_cntrs_generic,
 	.get_media_type         = &ixgbe_get_media_type_X540,
 	.enable_rx_dma          = &ixgbe_enable_rx_dma_generic,
+	.enable_relaxed_ordering = &ixgbe_enable_relaxed_ordering,
 	.get_mac_addr           = &ixgbe_get_mac_addr_generic,
 	.get_san_mac_addr       = &ixgbe_get_san_mac_addr_generic,
 	.get_device_caps        = &ixgbe_get_device_caps_generic,
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH v2 1/1] atm: remove deprecated use of pci api
From: Quentin Lambert @ 2015-01-14 14:25 UTC (permalink / raw)
  To: chas williams - CONTRACTOR, David Miller
  Cc: linux-atm-general, netdev, linux-kernel
In-Reply-To: <20150114085851.234db50d@thirdoffive.cmf.nrl.navy.mil>


On 14/01/2015 14:58, chas williams - CONTRACTOR wrote:
> I think I would like to go through and just fix all the usages of the
> older pci interface.  This patch isn't very complete due to its
> automated nature.
>
> I will make some time this weekend.
It was my original intent to produce a complete patch, I obviously
misused the tool and forgot a significant number of cases.

But I am ok with fixing that and submitting a complete version of this
patch if you like.

^ permalink raw reply

* Re: [PATCH v2 1/1] atm: remove deprecated use of pci api
From: chas williams - CONTRACTOR @ 2015-01-14 13:58 UTC (permalink / raw)
  To: David Miller; +Cc: lambert.quentin, linux-atm-general, netdev, linux-kernel
In-Reply-To: <20150113.215944.1703977685948196746.davem@davemloft.net>

On Tue, 13 Jan 2015 21:59:44 -0500 (EST)
David Miller <davem@davemloft.net> wrote:

> From: Quentin Lambert <lambert.quentin@gmail.com>
> Date: Mon, 12 Jan 2015 17:10:42 +0100
> 
> > @@ -2246,7 +2246,8 @@ static int eni_init_one(struct pci_dev *pci_dev,
> >  		goto err_disable;
> >  
> >  	zero = &eni_dev->zero;
> > -	zero->addr = pci_alloc_consistent(pci_dev, ENI_ZEROES_SIZE, &zero->dma);
> > +	zero->addr = dma_alloc_coherent(&pci_dev->dev, ENI_ZEROES_SIZE,
> > +					&zero->dma, GFP_ATOMIC);
> >  	if (!zero->addr)
> >  		goto err_kfree;
> >  
> 
> I really would like you to look at these locations and see if
> GFP_KERNEL can be used instead of GFP_ATOMIC.  I bet that nearly
> all of these can, and it is preferred.
> 
> Thanks.

I think I would like to go through and just fix all the usages of the
older pci interface.  This patch isn't very complete due to its
automated nature.

I will make some time this weekend.

^ permalink raw reply

* [PATCH net] openvswitch: packet messages need their own probe attribtue
From: Thomas Graf @ 2015-01-14 13:56 UTC (permalink / raw)
  To: Florian Westphal, davem, Sander Eikelenboom
  Cc: Pravin B Shelar, netdev, dev, jrajahalme
In-Reply-To: <20150114130351.GB13716@breakpoint.cc>

User space is currently sending a OVS_FLOW_ATTR_PROBE for both flow
and packet messages. This leads to an out-of-bounds access in
ovs_packet_cmd_execute() because OVS_FLOW_ATTR_PROBE >
OVS_PACKET_ATTR_MAX.

Introduce a new OVS_PACKET_ATTR_PROBE with the same numeric value
as OVS_FLOW_ATTR_PROBE to grow the range of accepted packet attributes
while maintaining to be binary compatible with existing OVS binaries.

Fixes: 05da589 ("openvswitch: Add support for OVS_FLOW_ATTR_PROBE.")
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Tracked-down-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 include/uapi/linux/openvswitch.h | 4 ++++
 net/openvswitch/datapath.c       | 3 ++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 3a6dcaa..f714e86 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -174,6 +174,10 @@ enum ovs_packet_attr {
 	OVS_PACKET_ATTR_USERDATA,    /* OVS_ACTION_ATTR_USERSPACE arg. */
 	OVS_PACKET_ATTR_EGRESS_TUN_KEY,  /* Nested OVS_TUNNEL_KEY_ATTR_*
 					    attributes. */
+	OVS_PACKET_ATTR_UNUSED1,
+	OVS_PACKET_ATTR_UNUSED2,
+	OVS_PACKET_ATTR_PROBE,      /* Packet operation is a feature probe,
+				       error logging should be suppressed. */
 	__OVS_PACKET_ATTR_MAX
 };
 
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 4e9a5f0..b07349e 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -524,7 +524,7 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
 	struct vport *input_vport;
 	int len;
 	int err;
-	bool log = !a[OVS_FLOW_ATTR_PROBE];
+	bool log = !a[OVS_PACKET_ATTR_PROBE];
 
 	err = -EINVAL;
 	if (!a[OVS_PACKET_ATTR_PACKET] || !a[OVS_PACKET_ATTR_KEY] ||
@@ -610,6 +610,7 @@ static const struct nla_policy packet_policy[OVS_PACKET_ATTR_MAX + 1] = {
 	[OVS_PACKET_ATTR_PACKET] = { .len = ETH_HLEN },
 	[OVS_PACKET_ATTR_KEY] = { .type = NLA_NESTED },
 	[OVS_PACKET_ATTR_ACTIONS] = { .type = NLA_NESTED },
+	[OVS_PACKET_ATTR_PROBE] = { .type = NLA_FLAG },
 };
 
 static const struct genl_ops dp_packet_genl_ops[] = {
-- 
1.9.3

^ permalink raw reply related

* Re: [patch net-next 1/2 v3] tc: add BPF based action
From: Daniel Borkmann @ 2015-01-14 13:28 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, jhs, ast, hannes
In-Reply-To: <1421229297-14473-1-git-send-email-jiri@resnulli.us>

On 01/14/2015 10:54 AM, Jiri Pirko wrote:
> This action provides a possibility to exec custom BPF code.
>
> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
...
> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index c54c9d9..cc311e9 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -698,6 +698,17 @@ config NET_ACT_VLAN
>   	  To compile this code as a module, choose M here: the
>   	  module will be called act_vlan.
>
> +config NET_ACT_BPF
> +        tristate "BPF based action"
> +        depends on NET_CLS_ACT
> +        ---help---
> +	  Say Y here to execute BFP code on packets.
                                 ^^^
                                (typo)

Technically correct, but I'd be a bit more precise. When we add eBPF
support one day, this description should be extended to better explain
what it can do, for now it would be good to mention that it can
filter + drop packets.

> +	  If unsure, say N.
> +
> +	  To compile this code as a module, choose M here: the
> +	  module will be called act_bpf.
> +
...
> diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
> new file mode 100644
> index 0000000..0e2a912
> --- /dev/null
> +++ b/net/sched/act_bpf.c
> @@ -0,0 +1,206 @@
...
> +static int tcf_bpf(struct sk_buff *skb, const struct tc_action *a,
> +		   struct tcf_result *res)
> +{
> +	struct tcf_bpf *b = a->priv;
> +	int action;
> +	int filter_res;
> +
> +	spin_lock(&b->tcf_lock);
> +	b->tcf_tm.lastuse = jiffies;
> +	bstats_update(&b->tcf_bstats, skb);
> +	action = b->tcf_action;
> +
> +	filter_res = BPF_PROG_RUN(b->filter, skb);
> +	if (filter_res == -1)
> +		goto drop;
> +
> +	goto unlock;
> +

Why this double goto stuff? Wouldn't it be easier to just write it as:

	filter_res = BPF_PROG_RUN(b->filter, skb);
	if (filter_res == -1) {
		/* #-1 return code from the BPF program in act_bpf
		 * is being interpreted as a drop.
		 */
		action = TC_ACT_SHOT;
		b->tcf_qstats.drops++;
	}

	spin_unlock(&b->tcf_lock);
	return action;

I'm still wondering about the drop semantics ... wouldn't it be more
intuitive to use 0 for drops in this context?

> +drop:
> +	action = TC_ACT_SHOT;
> +	b->tcf_qstats.drops++;
> +unlock:
> +	spin_unlock(&b->tcf_lock);
> +	return action;
> +}
...

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH net-next v3] rhashtable: Fix race in rhashtable_destroy() and use regular work_struct
From: Sergei Shtylyov @ 2015-01-14 13:25 UTC (permalink / raw)
  To: Ying Xue, tgraf; +Cc: davem, netdev
In-Reply-To: <1421228256-740-1-git-send-email-ying.xue@windriver.com>

Hello.

On 1/14/2015 12:37 PM, Ying Xue wrote:

> When we put our declared work task in the global workqueue with
> schedule_delayed_work(), its delay parameter is always zero.
> Therefore, we should define a regular work in rhashtable structure
> instead of a delayed work.

> By the way, we add a condition to check whether resizing functions
> are NULL before cancel the work, avoiding to cancel an uninitialized

    s/before cancel/before cancelling/.

> work.

[...]

> Fixes: 97defe1 ("rhashtable: Per bucket locks & deferred expansion/shrinking")
> Signed-off-by: Ying Xue <ying.xue@windriver.com>
> Cc: Thomas Graf <tgraf@suug.ch>
> Acked-by: Thomas Graf <tgraf@suug.ch>

[...]

> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index ed6ae1a..1f56189 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
[...]
> @@ -911,12 +911,11 @@ EXPORT_SYMBOL_GPL(rhashtable_init);
>   void rhashtable_destroy(struct rhashtable *ht)
>   {
>   	ht->being_destroyed = true;

    Empty line needed here.

> +	if (ht->p.grow_decision || ht->p.shrink_decision)
> +		cancel_work_sync(&ht->run_work);
>
>   	mutex_lock(&ht->mutex);
> -
> -	cancel_delayed_work(&ht->run_work);
>   	bucket_table_free(rht_dereference(ht->tbl, ht));
> -
>   	mutex_unlock(&ht->mutex);
>   }
>   EXPORT_SYMBOL_GPL(rhashtable_destroy);

WBR, Sergei

^ permalink raw reply

* [net] i40e: adds FCoE configure option
From: Jeff Kirsher @ 2015-01-14 13:14 UTC (permalink / raw)
  To: davem; +Cc: Vasu Dev, netdev, nhorman, sassmann, jogreene, stable,
	Jeff Kirsher

From: Vasu Dev <vasu.dev@intel.com>

Adds FCoE config option I40E_FCOE, so that FCoE can be enabled
as needed but otherwise have it disabled by default.

This also eliminate multiple FCoE config checks, instead now just
one config check for CONFIG_I40E_FCOE.

The I40E FCoE was added with 3.17 kernel and therefore this patch
shall be applied to stable 3.17 kernel also.

CC: <stable@vger.kernel.org>
Signed-off-by: Vasu Dev <vasu.dev@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/Kconfig           | 11 +++++++++++
 drivers/net/ethernet/intel/i40e/Makefile     |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_osdep.h |  4 ++--
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig
index 5b8300a..4d61ef5 100644
--- a/drivers/net/ethernet/intel/Kconfig
+++ b/drivers/net/ethernet/intel/Kconfig
@@ -281,6 +281,17 @@ config I40E_DCB
 
 	  If unsure, say N.
 
+config I40E_FCOE
+	bool "Fibre Channel over Ethernet (FCoE)"
+	default n
+	depends on I40E && DCB && FCOE
+	---help---
+	  Say Y here if you want to use Fibre Channel over Ethernet (FCoE)
+	  in the driver. This will create new netdev for exclusive FCoE
+	  use with XL710 FCoE offloads enabled.
+
+	  If unsure, say N.
+
 config I40EVF
 	tristate "Intel(R) XL710 X710 Virtual Function Ethernet support"
 	depends on PCI_MSI
diff --git a/drivers/net/ethernet/intel/i40e/Makefile b/drivers/net/ethernet/intel/i40e/Makefile
index 4b94ddb..c405819 100644
--- a/drivers/net/ethernet/intel/i40e/Makefile
+++ b/drivers/net/ethernet/intel/i40e/Makefile
@@ -44,4 +44,4 @@ i40e-objs := i40e_main.o \
 	i40e_virtchnl_pf.o
 
 i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
-i40e-$(CONFIG_FCOE:m=y) += i40e_fcoe.o
+i40e-$(CONFIG_I40E_FCOE) += i40e_fcoe.o
diff --git a/drivers/net/ethernet/intel/i40e/i40e_osdep.h b/drivers/net/ethernet/intel/i40e/i40e_osdep.h
index 045b5c4..ad802dd 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_osdep.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_osdep.h
@@ -78,7 +78,7 @@ do {                                                            \
 } while (0)
 
 typedef enum i40e_status_code i40e_status;
-#if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
+#ifdef CONFIG_I40E_FCOE
 #define I40E_FCOE
-#endif /* CONFIG_FCOE or CONFIG_FCOE_MODULE */
+#endif
 #endif /* _I40E_OSDEP_H_ */
-- 
1.9.3

^ permalink raw reply related

* Re: 3.19-rc4:  BUG: unable to handle kernel paging request at ffff880055f15000 ovs_packet_cmd_execute+0x1f/0x229
From: Thomas Graf @ 2015-01-14 13:13 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Sander Eikelenboom, Pravin B Shelar, David S. Miller, netdev, dev
In-Reply-To: <20150114130351.GB13716@breakpoint.cc>

On 01/14/15 at 02:03pm, Florian Westphal wrote:
> Thomas Graf <tgraf@suug.ch> wrote:
> > Copying ovs-dev mailing list and thus qutoing full message.
> > 
> > On 01/14/15 at 01:14pm, Sander Eikelenboom wrote:
> > > Hi,
> > > 
> > > I was testing 3.19-rc4 with openvswitch and encountered the splat below.
> > 
> > What version of OVS are you using? Did this work properly with rc3 or
> > an older kernel?
> 
> seems like it was introduced via 05da5898a96c
> (openvswitch: Add support for OVS_FLOW_ATTR_PROBE).
> 
> It adds test for OVS_FLOW_ATTR_PROBE to ovs_packet_cmd_execute() but
> this function seems to only expect OVS_PACKET_ATTR_* (so we get
> out-of-bounds access)?

Absolutely, just came to the same conclusion independently. I'll send
a fix.

^ permalink raw reply

* Re: 3.19-rc4:  BUG: unable to handle kernel paging request at ffff880055f15000 ovs_packet_cmd_execute+0x1f/0x229
From: Sander Eikelenboom @ 2015-01-14 13:05 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Pravin B Shelar, David S. Miller, netdev, dev
In-Reply-To: <20150114130005.GA564@casper.infradead.org>


Wednesday, January 14, 2015, 2:00:05 PM, you wrote:

> Copying ovs-dev mailing list and thus qutoing full message.

> On 01/14/15 at 01:14pm, Sander Eikelenboom wrote:
>> Hi,
>> 
>> I was testing 3.19-rc4 with openvswitch and encountered the splat below.

> What version of OVS are you using? Did this work properly with rc3 or
> an older kernel?

Hi Thomas,

Don't know for sure, i haven't seen it before, but on the other hand after a 
reboot it is running fine now for some time.
So it seems it's not reliably reproducible :(.

OVS version is current Debian wheezy:
ii  openvswitch-common                                                          1.4.2+git20120612-9.1~deb7u1                                             amd64        Open vSwitch common components
ii  openvswitch-controller                                                      1.4.2+git20120612-9.1~deb7u1                                             amd64        Open vSwitch controller implementation
ii  openvswitch-pki                                                             1.4.2+git20120612-9.1~deb7u1                                             all          Open vSwitch public key infrastructure dependency package
ii  openvswitch-switch                                                          1.4.2+git20120612-9.1~deb7u1                                             amd64        Open vSwitch switch implementations

--
Sander

>> #addr2line -e /boot/vmlinux-3.19.0-rc4-creanuc-20150114-doflr-apicpatchv3-apicrevert+ ffffffff818a1690
>> /mnt/kernelbuild/linux-tip/net/openvswitch/datapath.c:527
>> --
>> Sander
>> 
>> [  463.033308] BUG: unable to handle kernel paging request at ffff880055f15000
>> [  463.072154] IP: [<ffffffff818a1690>] ovs_packet_cmd_execute+0x1f/0x229
>> [  463.106202] PGD 1e10067 PUD 2097067 PMD 5ff54067 PTE 0
>> [  463.126940] Oops: 0000 [#1] SMP
>> [  463.147505] Modules linked in:
>> [  463.166938] CPU: 2 PID: 3049 Comm: ovs-vswitchd Not tainted 3.19.0-rc4-creanuc-20150114-doflr-apicpatchv3-apicrevert+ #1
>> [  463.187507] Hardware name:                  /D53427RKE, BIOS RKPPT10H.86A.0017.2013.0425.1251 04/25/2013
>> [  463.208553] task: ffff880058d30000 ti: ffff880055c38000 task.ti: ffff880055c38000
>> [  463.229734] RIP: e030:[<ffffffff818a1690>]  [<ffffffff818a1690>] ovs_packet_cmd_execute+0x1f/0x229
>> [  463.251082] RSP: e02b:ffff880055c3ba48  EFLAGS: 00010296
>> [  463.271786] RAX: ffff88004fe38818 RBX: ffffffff81ed4cc0 RCX: 0000000000000000
>> [  463.293072] RDX: ffff880055c3bb00 RSI: ffff880055c3bad0 RDI: ffff8800559dc700
>> [  463.314521] RBP: ffff8800559dc700 R08: ffffffff81b08d00 R09: 000000007ffff000
>> [  463.336189] R10: ffff88004fe38814 R11: ffffffff81ed4cc0 R12: ffff880055f14fc0
>> [  463.356906] R13: ffff88004fe38800 R14: ffff880055f14fc0 R15: ffffffff81b08c60
>> [  463.377482] FS:  00007f196321c700(0000) GS:ffff88005f700000(0000) knlGS:ffff88005f680000
>> [  463.398646] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  463.419995] CR2: ffff880055f15000 CR3: 000000005622e000 CR4: 0000000000042660
>> [  463.441577] Stack:
>> [  463.462975]  000000000000000c ffff88004fe38814 0000000000000005 ffffffff8130b116
>> [  463.485114]  ffffffff81ed4cc0 ffffffff81ed4cc0 ffff8800559dc700 ffff880055f14fc0
>> [  463.507367]  ffff88004fe38800 0000000000000008 ffffffff81b08c60 ffffffff81794364
>> [  463.530186] Call Trace:
>> [  463.552330]  [<ffffffff8130b116>] ? nla_parse+0x57/0xe7
>> [  463.574869]  [<ffffffff81794364>] ? genl_family_rcv_msg+0x243/0x2a9
>> [  463.597276]  [<ffffffff818ad9a2>] ? __slab_alloc.constprop.63+0x2bb/0x2e5
>> [  463.619394]  [<ffffffff81794402>] ? genl_rcv_msg+0x38/0x5b
>> [  463.641361]  [<ffffffff817911a6>] ? __netlink_lookup+0x3a/0x40
>> [  463.663192]  [<ffffffff817943ca>] ? genl_family_rcv_msg+0x2a9/0x2a9
>> [  463.685141]  [<ffffffff81793686>] ? netlink_rcv_skb+0x36/0x7c
>> [  463.706874]  [<ffffffff81793987>] ? genl_rcv+0x1f/0x2c
>> [  463.729152]  [<ffffffff817930ea>] ? netlink_unicast+0x100/0x19c
>> [  463.751315]  [<ffffffff8179350d>] ? netlink_sendmsg+0x311/0x36b
>> [  463.772483]  [<ffffffff8173d574>] ? do_sock_sendmsg+0x62/0x7b
>> [  463.793309]  [<ffffffff8173e6ce>] ? copy_msghdr_from_user+0x158/0x17c
>> [  463.814032]  [<ffffffff8173e94c>] ? ___sys_sendmsg+0x11f/0x197
>> [  463.834595]  [<ffffffff8173ddfd>] ? sock_poll+0xf2/0xfd
>> [  463.854970]  [<ffffffff81136d27>] ? ep_send_events_proc+0x91/0x153
>> [  463.875603]  [<ffffffff81136c96>] ? ep_read_events_proc+0x92/0x92
>> [  463.896168]  [<ffffffff818b6929>] ? _raw_spin_unlock_irqrestore+0x42/0x5b
>> [  463.917050]  [<ffffffff8113759c>] ? ep_scan_ready_list.isra.14+0x163/0x182
>> [  463.938458]  [<ffffffff81137820>] ? ep_poll+0x250/0x2c4
>> [  463.958214]  [<ffffffff8173f410>] ? __sys_sendmsg+0x3b/0x5d
>> [  463.977581]  [<ffffffff818b6da9>] ? system_call_fastpath+0x12/0x17
>> [  463.996860] Code: ff 89 d8 5b 5d 41 5c 41 5d 41 5e c3 41 57 41 56 41 55 41 54 55 53 48 83 ec 28 48 8b 46 18 4c 8b 76 20 48 89 44 24 08 49 8b 46 08 <49> 8b 6e 40 48 85 c0 0f 84 e0 01 00 00 49 83 7e 10 00 0f 84 d5
>> [  464.037236] RIP  [<ffffffff818a1690>] ovs_packet_cmd_execute+0x1f/0x229
>> [  464.056926]  RSP <ffff880055c3ba48>
>> [  464.076182] CR2: ffff880055f15000
>> [  464.095097] ---[ end trace 8bcb28ced5309e55 ]---

^ permalink raw reply

* Re: 3.19-rc4:  BUG: unable to handle kernel paging request at ffff880055f15000 ovs_packet_cmd_execute+0x1f/0x229
From: Florian Westphal @ 2015-01-14 13:03 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Sander Eikelenboom, Pravin B Shelar, David S. Miller, netdev, dev
In-Reply-To: <20150114130005.GA564@casper.infradead.org>

Thomas Graf <tgraf@suug.ch> wrote:
> Copying ovs-dev mailing list and thus qutoing full message.
> 
> On 01/14/15 at 01:14pm, Sander Eikelenboom wrote:
> > Hi,
> > 
> > I was testing 3.19-rc4 with openvswitch and encountered the splat below.
> 
> What version of OVS are you using? Did this work properly with rc3 or
> an older kernel?

seems like it was introduced via 05da5898a96c
(openvswitch: Add support for OVS_FLOW_ATTR_PROBE).

It adds test for OVS_FLOW_ATTR_PROBE to ovs_packet_cmd_execute() but
this function seems to only expect OVS_PACKET_ATTR_* (so we get
out-of-bounds access)?

^ permalink raw reply

* Re: [PATCH 2/2] mdio-mux-gpio: use new gpiod_get_array and gpiod_put_array functions
From: Rojhalat Ibrahim @ 2015-01-14 13:01 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: linux-gpio@vger.kernel.org, Alexandre Courbot, Linus Walleij,
	David Miller, netdev
In-Reply-To: <CAAVeFuLqV7CvR2-SNWsrQuBzctQx1wiMtWwjw=cPbp5GRnMdcw@mail.gmail.com>

On Wednesday 14 January 2015 14:11:27 Alexandre Courbot wrote:
> On Sat, Jan 10, 2015 at 12:19 AM, Rojhalat Ibrahim <imr@rtschenk.de> wrote:
> > Use the new gpiod_get_array and gpiod_put_array functions for obtaining and
> > disposing of GPIO descriptors.
> >
> > Signed-off-by: Rojhalat Ibrahim <imr@rtschenk.de>
> > ---
> > This patch depends on my previous patch "gpiolib: add gpiod_get_array and
> > gpiod_put_array functions".
> >
> >  drivers/net/phy/mdio-mux-gpio.c |   28 ++++++++--------------------
> >  1 file changed, 8 insertions(+), 20 deletions(-)
> >
> > diff --git a/drivers/net/phy/mdio-mux-gpio.c b/drivers/net/phy/mdio-mux-gpio.c
> > index 1eaf81e..35c37da 100644
> > --- a/drivers/net/phy/mdio-mux-gpio.c
> > +++ b/drivers/net/phy/mdio-mux-gpio.c
> > @@ -47,7 +47,6 @@ static int mdio_mux_gpio_probe(struct platform_device *pdev)
> >  {
> >         struct mdio_mux_gpio_state *s;
> >         int num_gpios;
> > -       unsigned int n;
> >         int r;
> >
> >         if (!pdev->dev.of_node)
> > @@ -63,16 +62,10 @@ static int mdio_mux_gpio_probe(struct platform_device *pdev)
> >
> >         s->num_gpios = num_gpios;
> >
> > -       for (n = 0; n < num_gpios; ) {
> > -               struct gpio_desc *gpio = gpiod_get_index(&pdev->dev, NULL, n,
> > -                                                        GPIOD_OUT_LOW);
> > -               if (IS_ERR(gpio)) {
> > -                       r = PTR_ERR(gpio);
> > -                       goto err;
> > -               }
> > -               s->gpio[n] = gpio;
> > -               n++;
> > -       }
> > +       r = gpiod_get_array(&pdev->dev, NULL, s->gpio, num_gpios,
> > +                           GPIOD_OUT_LOW);
> > +       if (r != num_gpios)
> > +               return r;
> >
> >         r = mdio_mux_init(&pdev->dev,
> >                           mdio_mux_gpio_switch_fn, &s->mux_handle, s);
> > @@ -80,22 +73,17 @@ static int mdio_mux_gpio_probe(struct platform_device *pdev)
> >         if (r == 0) {
> >                 pdev->dev.platform_data = s;
> >                 return 0;
> > +       } else {
> > +               gpiod_put_array(s->gpio, num_gpios);
> > +               return r;
> >         }
> 
> Suggestion: handle the errors in the if condition, and let normal
> execution be visible at the first level of indentation of the
> function. I.e:
> 
>     if (r != 0) {
>         gpiod_put_array(s->gpio, num_gpios);
>         return r;
>     }
> 
>     pdev->dev.platform_data = s;
>     return 0;
> 
> This is how previous errors are handled in this function and is
> generally a good thing to do as it makes the function's logic easier
> to follow.
> 

Ok, will do.

Thanks for the review.

^ permalink raw reply

* Re: 3.19-rc4:  BUG: unable to handle kernel paging request at ffff880055f15000 ovs_packet_cmd_execute+0x1f/0x229
From: Thomas Graf @ 2015-01-14 13:00 UTC (permalink / raw)
  To: Sander Eikelenboom; +Cc: Pravin B Shelar, David S. Miller, netdev, dev
In-Reply-To: <1645184106.20150114131420@eikelenboom.it>

Copying ovs-dev mailing list and thus qutoing full message.

On 01/14/15 at 01:14pm, Sander Eikelenboom wrote:
> Hi,
> 
> I was testing 3.19-rc4 with openvswitch and encountered the splat below.

What version of OVS are you using? Did this work properly with rc3 or
an older kernel?

> #addr2line -e /boot/vmlinux-3.19.0-rc4-creanuc-20150114-doflr-apicpatchv3-apicrevert+ ffffffff818a1690
> /mnt/kernelbuild/linux-tip/net/openvswitch/datapath.c:527
> --
> Sander
> 
> [  463.033308] BUG: unable to handle kernel paging request at ffff880055f15000
> [  463.072154] IP: [<ffffffff818a1690>] ovs_packet_cmd_execute+0x1f/0x229
> [  463.106202] PGD 1e10067 PUD 2097067 PMD 5ff54067 PTE 0
> [  463.126940] Oops: 0000 [#1] SMP
> [  463.147505] Modules linked in:
> [  463.166938] CPU: 2 PID: 3049 Comm: ovs-vswitchd Not tainted 3.19.0-rc4-creanuc-20150114-doflr-apicpatchv3-apicrevert+ #1
> [  463.187507] Hardware name:                  /D53427RKE, BIOS RKPPT10H.86A.0017.2013.0425.1251 04/25/2013
> [  463.208553] task: ffff880058d30000 ti: ffff880055c38000 task.ti: ffff880055c38000
> [  463.229734] RIP: e030:[<ffffffff818a1690>]  [<ffffffff818a1690>] ovs_packet_cmd_execute+0x1f/0x229
> [  463.251082] RSP: e02b:ffff880055c3ba48  EFLAGS: 00010296
> [  463.271786] RAX: ffff88004fe38818 RBX: ffffffff81ed4cc0 RCX: 0000000000000000
> [  463.293072] RDX: ffff880055c3bb00 RSI: ffff880055c3bad0 RDI: ffff8800559dc700
> [  463.314521] RBP: ffff8800559dc700 R08: ffffffff81b08d00 R09: 000000007ffff000
> [  463.336189] R10: ffff88004fe38814 R11: ffffffff81ed4cc0 R12: ffff880055f14fc0
> [  463.356906] R13: ffff88004fe38800 R14: ffff880055f14fc0 R15: ffffffff81b08c60
> [  463.377482] FS:  00007f196321c700(0000) GS:ffff88005f700000(0000) knlGS:ffff88005f680000
> [  463.398646] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  463.419995] CR2: ffff880055f15000 CR3: 000000005622e000 CR4: 0000000000042660
> [  463.441577] Stack:
> [  463.462975]  000000000000000c ffff88004fe38814 0000000000000005 ffffffff8130b116
> [  463.485114]  ffffffff81ed4cc0 ffffffff81ed4cc0 ffff8800559dc700 ffff880055f14fc0
> [  463.507367]  ffff88004fe38800 0000000000000008 ffffffff81b08c60 ffffffff81794364
> [  463.530186] Call Trace:
> [  463.552330]  [<ffffffff8130b116>] ? nla_parse+0x57/0xe7
> [  463.574869]  [<ffffffff81794364>] ? genl_family_rcv_msg+0x243/0x2a9
> [  463.597276]  [<ffffffff818ad9a2>] ? __slab_alloc.constprop.63+0x2bb/0x2e5
> [  463.619394]  [<ffffffff81794402>] ? genl_rcv_msg+0x38/0x5b
> [  463.641361]  [<ffffffff817911a6>] ? __netlink_lookup+0x3a/0x40
> [  463.663192]  [<ffffffff817943ca>] ? genl_family_rcv_msg+0x2a9/0x2a9
> [  463.685141]  [<ffffffff81793686>] ? netlink_rcv_skb+0x36/0x7c
> [  463.706874]  [<ffffffff81793987>] ? genl_rcv+0x1f/0x2c
> [  463.729152]  [<ffffffff817930ea>] ? netlink_unicast+0x100/0x19c
> [  463.751315]  [<ffffffff8179350d>] ? netlink_sendmsg+0x311/0x36b
> [  463.772483]  [<ffffffff8173d574>] ? do_sock_sendmsg+0x62/0x7b
> [  463.793309]  [<ffffffff8173e6ce>] ? copy_msghdr_from_user+0x158/0x17c
> [  463.814032]  [<ffffffff8173e94c>] ? ___sys_sendmsg+0x11f/0x197
> [  463.834595]  [<ffffffff8173ddfd>] ? sock_poll+0xf2/0xfd
> [  463.854970]  [<ffffffff81136d27>] ? ep_send_events_proc+0x91/0x153
> [  463.875603]  [<ffffffff81136c96>] ? ep_read_events_proc+0x92/0x92
> [  463.896168]  [<ffffffff818b6929>] ? _raw_spin_unlock_irqrestore+0x42/0x5b
> [  463.917050]  [<ffffffff8113759c>] ? ep_scan_ready_list.isra.14+0x163/0x182
> [  463.938458]  [<ffffffff81137820>] ? ep_poll+0x250/0x2c4
> [  463.958214]  [<ffffffff8173f410>] ? __sys_sendmsg+0x3b/0x5d
> [  463.977581]  [<ffffffff818b6da9>] ? system_call_fastpath+0x12/0x17
> [  463.996860] Code: ff 89 d8 5b 5d 41 5c 41 5d 41 5e c3 41 57 41 56 41 55 41 54 55 53 48 83 ec 28 48 8b 46 18 4c 8b 76 20 48 89 44 24 08 49 8b 46 08 <49> 8b 6e 40 48 85 c0 0f 84 e0 01 00 00 49 83 7e 10 00 0f 84 d5
> [  464.037236] RIP  [<ffffffff818a1690>] ovs_packet_cmd_execute+0x1f/0x229
> [  464.056926]  RSP <ffff880055c3ba48>
> [  464.076182] CR2: ffff880055f15000
> [  464.095097] ---[ end trace 8bcb28ced5309e55 ]---

^ permalink raw reply

* Re: [PATCH net] be2net: Allow GRE to work concurrently while a VxLAN tunnel is configured
From: Sergei Shtylyov @ 2015-01-14 12:55 UTC (permalink / raw)
  To: Sriharsha Basavapatna, netdev
In-Reply-To: <1421293069-30518-1-git-send-email-sriharsha.basavapatna@emulex.com>

Hello.

On 1/15/2015 6:37 AM, Sriharsha Basavapatna wrote:

> Other tunnels like GRE break while VxLAN offloads are enabled in Skyhawk-R. To
> avoid this, we should restrict offload features on a per-packet basis in such
> conditions.

> Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@emulex.com>
> ---
>   drivers/net/ethernet/emulex/benet/be_main.c |   41 +++++++++++++++++++++++++--
>   1 file changed, 38 insertions(+), 3 deletions(-)

    Some nitpicking, mostly grammatical...

> diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
> index 41a0a54..726a4a4 100644
> --- a/drivers/net/ethernet/emulex/benet/be_main.c
> +++ b/drivers/net/ethernet/emulex/benet/be_main.c
[...]
> @@ -4463,7 +4464,41 @@ static netdev_features_t be_features_check(struct sk_buff *skb,
>   					   struct net_device *dev,
>   					   netdev_features_t features)
>   {
> -	return vxlan_features_check(skb, features);
> +	struct be_adapter *adapter = netdev_priv(dev);
> +	u8 l4_hdr = 0;
> +
> +	/* The code below restricts offload features for some tunneled packets.
> +	 * Offload features for normal(non tunnel) packets are unchanged.

    You forgot space before (.

> +	 */
> +	if (!skb->encapsulation ||
> +	    !(adapter->flags & BE_FLAGS_VXLAN_OFFLOADS))
> +		return features;
> +
> +	/* It's an encapsulated packet and VxLAN offloads are enabled. We
> +	 * should disable tunnel offload features if it's not a VxLAN packet,
> +	 * as tunnel offloads have been enabled only for VxLAN. This is done to
> +	 * allow other tunneled trafffic like GRE work fine while VxLAN

    Just "traffic", too many f's.

> +	 * offloads are configured in Skyhawk-R.
> +	 */
> +	switch (vlan_get_protocol(skb)) {
> +	case htons(ETH_P_IP):
> +		l4_hdr = ip_hdr(skb)->protocol;
> +		break;
> +	case htons(ETH_P_IPV6):
> +		l4_hdr = ipv6_hdr(skb)->nexthdr;
> +		break;
> +	default:
> +		return features;
> +	}
> +
> +	if (l4_hdr != IPPROTO_UDP ||
> +	    skb->inner_protocol_type != ENCAP_TYPE_ETHER ||
> +	    skb->inner_protocol != htons(ETH_P_TEB) ||
> +	    (skb_inner_mac_header(skb) - skb_transport_header(skb) !=
> +	    sizeof(struct udphdr) + sizeof(struct vxlanhdr)))

    () not needed around !=.

[...]

WBR, Sergei

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox