Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next] mlx4: fix use-after-free in mlx4_en_fold_software_stats()
From: Saeed Mahameed @ 2016-12-01 15:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jesper Dangaard Brouer, David Miller, netdev, Tariq Toukan
In-Reply-To: <1480597326.18162.276.camel@edumazet-glaptop3.roam.corp.google.com>

On Thu, Dec 1, 2016 at 3:02 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> My recent commit to get more precise rx/tx counters in ndo_get_stats64()
> can lead to crashes at device dismantle, as Jesper found out.
>
> We must prevent mlx4_en_fold_software_stats() trying to access
> tx/rx rings if they are deleted.
>
> Fix this by adding a test against priv->port_up in
> mlx4_en_fold_software_stats()
>
> Calling mlx4_en_fold_software_stats() from mlx4_en_stop_port()
> allows us to eventually broadcast the latest/current counters to
> rtnetlink monitors.
>
> Fixes: 40931b85113d ("mlx4: give precise rx/tx bytes/packets counters")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-and-bisected-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> Cc: Saeed Mahameed <saeedm@dev.mellanox.co.il>

Acked-by: Saeed Mahameed <saeedm@mellanox.com>

^ permalink raw reply

* Re: [PATCH] ip6_offload: check segs for NULL in ipv6_gso_segment.
From: Eric Dumazet @ 2016-12-01 15:27 UTC (permalink / raw)
  To: Artem Savkov
  Cc: davem, netdev, linux-kernel, jstancek, steffen.klassert,
	alexander.h.duyck,
	YOSHIFUJI Hideaki / 吉藤英明
In-Reply-To: <20161201150703.dv463hxgqajqo6pm@shodan.usersys.redhat.com>

On Thu, 2016-12-01 at 16:07 +0100, Artem Savkov wrote:

> I am not, but this would have the same behavior as pre-07b26c9 code and
> IS_ERR_OR_NULL is used in ipv4's inet_gso_segment().

My concern might have been that IS_ERR_OR_NULL() considers the !ptr to
be unlikely.

But in this code path, we really can not tell.

segs == NULL can be quite likely in TUN case, because of DODGY bit

Commit 50c3a487d50756 replaced the perfectly fine :

if (!segs || IS_ERR(segs))

into dubious

if (IS_ERR_OR_NULL(segs))

segs = NULL is not an error, but use of IS_ERR_OR_NULL() might mislead
programmers trying to understand this code.

^ permalink raw reply

* Re: Regression: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: Saeed Mahameed @ 2016-12-01 15:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jesper Dangaard Brouer, David Miller, netdev, Tariq Toukan
In-Reply-To: <1480539652.18162.205.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, Nov 30, 2016 at 11:00 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-11-30 at 22:42 +0200, Saeed Mahameed wrote:
>> On Wed, Nov 30, 2016 at 7:35 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Wed, 2016-11-30 at 18:46 +0200, Saeed Mahameed wrote:
>> >
>> >> we had/still have the proper stats they are the ones that
>> >> mlx4_en_fold_software_stats is trying to cache into  (they always
>> >> exist),
>> >> but the ones that you are trying to read from (the mlx4 rings) are gone !
>> >>
>> >> This bug is totally new and as i warned, this is another symptom of
>> >> the real root cause (can't sleep while reading stats).
>> >>
>> >> Eric what do you suggest ? Keep pre-allocated MAX_RINGS stats  and
>> >> always iterate over all of them to query stats ?
>> >> what if you have one ring/none/1K ? how would you know how many to query ?
>> >
>> > I am suggesting I will fix the bug I introduced.
>> >
>> > Do not panic.
>> >
>> >
>>
>> Not at all, I trust you are the only one who is capable of providing
>> the best solution.
>> I am just trying to read your mind :-).
>>
>> As i said i like the solution and i want to adapt it to mlx5, so I am
>> a little bit enthusiastic :)
>
> What about the following fix guys ?
>
> As a bonus we update the stats right before they are sent to monitors
> via rtnetlink ;)

Hi Eric, Thanks for the patch, I already acked it.

I have one educational question (not related to this patch, but
related to stats reading in general).
I was wondering why do we need to disable bh every time we read stats
"spin_lock_bh" ? is it essential ?

I checked and in mlx4 we don't hold stats_lock in softirq
(en_rx.c/en_tx.c), so I don't see any deadlock risk in here..

 Thanks
Saeed.

^ permalink raw reply

* [PATCH 0/1] NET: usb: qmi_wwan: add support for Telit LE922A PID 0x1040
From: Daniele Palmas @ 2016-12-01 15:52 UTC (permalink / raw)
  To: Bjørn Mork; +Cc: netdev, Daniele Palmas

This patch adds support for PID 0x1040 of Telit LE922A.

The qmi adapter requires to have DTR set for proper working,
so QMI_WWAN_QUIRK_DTR has been enabled.

Following verbose lsusb output of the composition:

Bus 003 Device 006: ID 1bc7:1040 Telit Wireless Solutions 
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               2.00
  bDeviceClass            0 (Defined at Interface level)
  bDeviceSubClass         0 
  bDeviceProtocol         0 
  bMaxPacketSize0        64
  idVendor           0x1bc7 Telit Wireless Solutions
  idProduct          0x1040 
  bcdDevice            3.10
  iManufacturer           1 Android
  iProduct                2 Android
  iSerial                 3 359fb2
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength          281
    bNumInterfaces          7
    bConfigurationValue     1
    iConfiguration          0 
    bmAttributes         0x80
      (Bus Powered)
    MaxPower              500mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           2
      bInterfaceClass       255 Vendor Specific Class
      bInterfaceSubClass    255 Vendor Specific Subclass
      bInterfaceProtocol    255 Vendor Specific Protocol
      iInterface              0 
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x01  EP 1 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        1
      bAlternateSetting       0
      bNumEndpoints           2
      bInterfaceClass       255 Vendor Specific Class
      bInterfaceSubClass     66 
      bInterfaceProtocol      1 
      iInterface              4 ADB Interface
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x02  EP 2 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x82  EP 2 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        2
      bAlternateSetting       0
      bNumEndpoints           3
      bInterfaceClass       255 Vendor Specific Class
      bInterfaceSubClass    255 Vendor Specific Subclass
      bInterfaceProtocol    255 Vendor Specific Protocol
      iInterface              0 
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x84  EP 4 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0008  1x 8 bytes
        bInterval               9
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x83  EP 3 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x03  EP 3 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        3
      bAlternateSetting       0
      bNumEndpoints           3
      bInterfaceClass       255 Vendor Specific Class
      bInterfaceSubClass      0 
      bInterfaceProtocol      0 
      iInterface              0 
      ** UNRECOGNIZED:  05 24 00 10 01
      ** UNRECOGNIZED:  05 24 01 00 00
      ** UNRECOGNIZED:  04 24 02 02
      ** UNRECOGNIZED:  05 24 06 00 00
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x86  EP 6 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x000a  1x 10 bytes
        bInterval               9
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x85  EP 5 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x04  EP 4 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        4
      bAlternateSetting       0
      bNumEndpoints           3
      bInterfaceClass       255 Vendor Specific Class
      bInterfaceSubClass      0 
      bInterfaceProtocol      0 
      iInterface              0 
      ** UNRECOGNIZED:  05 24 00 10 01
      ** UNRECOGNIZED:  05 24 01 00 00
      ** UNRECOGNIZED:  04 24 02 02
      ** UNRECOGNIZED:  05 24 06 00 00
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x88  EP 8 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x000a  1x 10 bytes
        bInterval               9
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x87  EP 7 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x05  EP 5 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        5
      bAlternateSetting       0
      bNumEndpoints           3
      bInterfaceClass       255 Vendor Specific Class
      bInterfaceSubClass      0 
      bInterfaceProtocol      0 
      iInterface              0 
      ** UNRECOGNIZED:  05 24 00 10 01
      ** UNRECOGNIZED:  05 24 01 00 00
      ** UNRECOGNIZED:  04 24 02 02
      ** UNRECOGNIZED:  05 24 06 00 00
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x8a  EP 10 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x000a  1x 10 bytes
        bInterval               9
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x89  EP 9 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x06  EP 6 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        6
      bAlternateSetting       0
      bNumEndpoints           3
      bInterfaceClass       255 Vendor Specific Class
      bInterfaceSubClass      0 
      bInterfaceProtocol      0 
      iInterface              0 
      ** UNRECOGNIZED:  05 24 00 10 01
      ** UNRECOGNIZED:  05 24 01 00 00
      ** UNRECOGNIZED:  04 24 02 02
      ** UNRECOGNIZED:  05 24 06 00 00
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x8c  EP 12 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x000a  1x 10 bytes
        bInterval               9
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x8b  EP 11 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x07  EP 7 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
Device Qualifier (for other device speed):
  bLength                10
  bDescriptorType         6
  bcdUSB               2.00
  bDeviceClass            0 (Defined at Interface level)
  bDeviceSubClass         0 
  bDeviceProtocol         0 
  bMaxPacketSize0        64
  bNumConfigurations      1
Device Status:     0x0000
  (Bus Powered)


Daniele Palmas (1):
  NET: usb: qmi_wwan: add support for Telit LE922A PID 0x1040

 drivers/net/usb/qmi_wwan.c | 1 +
 1 file changed, 1 insertion(+)

-- 
2.7.4

^ permalink raw reply

* [PATCH 1/1] NET: usb: qmi_wwan: add support for Telit LE922A PID 0x1040
From: Daniele Palmas @ 2016-12-01 15:52 UTC (permalink / raw)
  To: Bjørn Mork; +Cc: netdev, Daniele Palmas
In-Reply-To: <1480607525-23044-1-git-send-email-dnlplm@gmail.com>

This patch adds support for PID 0x1040 of Telit LE922A.

The qmi adapter requires to have DTR set for proper working,
so QMI_WWAN_QUIRK_DTR has been enabled.

Signed-off-by: Daniele Palmas <dnlplm@gmail.com>
---
 drivers/net/usb/qmi_wwan.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 3ff76c6..6fe1cdb 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -894,6 +894,7 @@ static const struct usb_device_id products[] = {
 	{QMI_FIXED_INTF(0x1bbb, 0x0203, 2)},	/* Alcatel L800MA */
 	{QMI_FIXED_INTF(0x2357, 0x0201, 4)},	/* TP-LINK HSUPA Modem MA180 */
 	{QMI_FIXED_INTF(0x2357, 0x9000, 4)},	/* TP-LINK MA260 */
+	{QMI_QUIRK_SET_DTR(0x1bc7, 0x1040, 2)},	/* Telit LE922A */
 	{QMI_FIXED_INTF(0x1bc7, 0x1200, 5)},	/* Telit LE920 */
 	{QMI_FIXED_INTF(0x1bc7, 0x1201, 2)},	/* Telit LE920 */
 	{QMI_FIXED_INTF(0x1c9e, 0x9b01, 3)},	/* XS Stick W100-2 from 4G Systems */
-- 
2.7.4

^ permalink raw reply related

* Re: [flamebait] xdp, well meaning but pointless
From: Hannes Frederic Sowa @ 2016-12-01 15:52 UTC (permalink / raw)
  To: Thomas Graf, Florian Westphal; +Cc: netdev
In-Reply-To: <20161201145834.GA569@pox.localdomain>

Hi,

On 01.12.2016 15:58, Thomas Graf wrote:
> On 12/01/16 at 10:11am, Florian Westphal wrote:
>> Aside from this, XDP, like DPDK, is a kernel bypass.
>> You might say 'Its just stack bypass, not a kernel bypass!'.
>> But what does that mean exactly?  That packets can still be passed
>> onward to normal stack?
>> Bypass solutions like netmap can also inject packets back to
>> kernel stack again.
> 
> I have a fundamental issue with the approach of exporting packets into
> user space and reinjecting them: Once the packet leaves the kernel,
> any security guarantees are off. I have no control over what is
> running in user space and whether whatever listener up there has been
> compromised or not. To me, that's a no go, in particular for servers
> hosting multi tenant workloads. This is one of the main reasons why
> XDP, in particular in combination with BPF, is very interesting to me.

First of all, this is a rant targeted at XDP and not at eBPF as a whole.
XDP manipulates packets at free will and thus all security guarantees
are off as well as in any user space solution.

Secondly user space provides policy, acl, more controlled memory
protection, restartability and better debugability. If I had multi
tenant workloads I would definitely put more complex "business/acl"
logic into user space, so I can make use of LSM and other features to
especially prevent a network facing service to attack the tenants. If
stuff gets put into the kernel you run user controlled code in the
kernel exposing a much bigger attack vector.

What use case do you see in XDP specifically e.g. for container networking?

>> b). with regards to a programmable data path: IFF one wants to do this
>> in kernel (and thats a big if), it seems much more preferrable to provide
>> a config/data-based approach rather than a programmable one.  If you want
>> full freedom DPDK is architecturally just too powerful to compete with.
> 
> I must have missed the legal disclaimer that is usually put in front
> of the DPDK marketing show :-)
>
> I don't want full freedom. I want programmability with stack integration
> at sufficient speed and the ability to benefit from the hardware
> abstractions that the kernel provides.
> 
>> Proponents of XDP sometimes provide usage examples.
>> Lets look at some of these.
> 
> [ I won't comment on any of the other use cases because they are of no
>   interest to me ]
> 
>> * Load balancer
>> State holding algorithm need sorting and searching, so also no fit for
>> eBPF (could be exposed by function exports, but then can we do DoS by
>> finding worst case scenarios?).
>>
>> Also again needs way to forward frame out via another interface.
>>
>> For cases where packet gets sent out via same interface it would appear
>> to be easier to use port mirroring in a switch and use stochastic filtering
>> on end nodes to determine which host should take responsibility.
>>
>> XDP plus: central authority over how distribution will work in case
>> nodes are added/removed from pool.
>> But then again, it will be easier to hande this with netmap/dpdk where
>> more complicated scheduling algorithms can be used.
> 
> I agree with you if the LB is a software based appliance in either a
> dedicated VM or on dedicated baremetal.
> 
> The reality is turning out to be different in many cases though, LB
> needs to be performed not only for north south but east west as well.
> So even if I would handle LB for traffic entering my datacenter in user
> space, I will need the same LB for packets from my applications and
> I definitely don't want to move all of that into user space.

The open question to me is why is programmability needed here.

Look at the discussion about ECMP and consistent hashing. It is not very
easy to actually write this code correctly. Why can't we just put C code
into the kernel that implements this once and for all and let user space
update the policies?

Load balancers have to deal correctly with ICMP packets, e.g. they even
have to be duplicated to every ECMP route. This seems to be problematic
to do in eBPF programs due to looping constructs so you end up with
complicated user space anyway.

>> * early drop/filtering.
>> While its possible to do "u32" like filters with ebpf, all modern nics
>> support ntuple filtering in hardware, which is going to be faster because
>> such packet will never even be signalled to the operating system.
>> For more complicated cases (e.g. doing socket lookup to check if particular
>> packet does match bound socket (and expected sequence numbers etc) I don't
>> see easy ways to do that with XDP (and without sk_buff context).
>> Providing it via function exports is possible of course, but that will only
>> result in an "arms race" where we will see special-sauce functions
>> all over the place -- DoS will always attempt to go for something
>> that is difficult to filter against, cf. all the recent volume-based
>> floodings.
> 
> You probably put this last because this was the most difficult to
> shoot down ;-)
> 
> The benefits of XDP for this use case are extremely obvious in combination
> with local applications which need to be protected. ntuple filters won't
> cut it. They are limited and subject to a certain rate at which they
> can be configured. Any serious mitigation will require stateful filtering
> with at least minimal L7 matching abilities and this is exactly where XDP
> will excel.

In my experience and research of DoS attacks you certainly want to put a
bit more logic into a filter than to look up something from hash tables
and drop it then. You certainly also want to have some more logic than
32 * 4096 instructions to execute, e.g. parsing and matching of DNS/NTP
packets with certain conditions and side look-ups. If you seriously do
that stuff you end up with a highly optimized programs containing
stochastic filters and also complex database logic.

If I want to drop based on hash table lookups, as Florian wrote, I would
let the hardware do that and assemble the tables in user space.

Bye,
Hannes

^ permalink raw reply

* Re: Regression: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: Eric Dumazet @ 2016-12-01 15:55 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: Jesper Dangaard Brouer, David Miller, netdev, Tariq Toukan
In-Reply-To: <CALzJLG8O28ZBS3sdqLpohPZtpLbLs228K4aO6hYd4URsA3yefA@mail.gmail.com>

On Thu, 2016-12-01 at 17:38 +0200, Saeed Mahameed wrote:

> 
> Hi Eric, Thanks for the patch, I already acked it.

Thanks !

> 
> I have one educational question (not related to this patch, but
> related to stats reading in general).
> I was wondering why do we need to disable bh every time we read stats
> "spin_lock_bh" ? is it essential ?
> 
> I checked and in mlx4 we don't hold stats_lock in softirq
> (en_rx.c/en_tx.c), so I don't see any deadlock risk in here..

Excellent question, and I chose to keep the spinlock.

That would be doable, only if we do not overwrite dev->stats.

Current code is :

static struct rtnl_link_stats64 *
mlx4_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
{
        struct mlx4_en_priv *priv = netdev_priv(dev);

        spin_lock_bh(&priv->stats_lock);
        mlx4_en_fold_software_stats(dev);
        netdev_stats_to_stats64(stats, &dev->stats);
        spin_unlock_bh(&priv->stats_lock);

        return stats;
}

If you remove the spin_lock_bh() :

static struct rtnl_link_stats64 *
mlx4_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
{
        struct mlx4_en_priv *priv = netdev_priv(dev);

        mlx4_en_fold_software_stats(dev); // possible races

        netdev_stats_to_stats64(stats, &dev->stats);

        return stats;
}

1) one mlx4_en_fold_software_stats(dev) could be preempted
on a CONFIG_PREEMPT kernel, or interrupted by long irqs.

2) Another cpu would also call mlx4_en_fold_software_stats(dev) while
   first cpu is busy.

3) Then when resuming first cpu/thread, part of the dev->stats fieds 
would be updated with 'old counters',
while another thread might have updated them with newer values.

4) A SNMP reader could then get counters that are not monotonically
increasing,
which would be confusing/buggy.

So removing the spinlock is doable, but needs to add a new parameter
to mlx4_en_fold_software_stats() and call netdev_stats_to_stats64()
before mlx4_en_fold_software_stats(dev)

static struct rtnl_link_stats64 *
mlx4_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
{
        struct mlx4_en_priv *priv = netdev_priv(dev);

        netdev_stats_to_stats64(stats, &dev->stats);

	// Passing a non NULL stats asks mlx4_en_fold_software_stats()
	// to not update dev->stats, but stats directly.

        mlx4_en_fold_software_stats(dev, stats)


        return stats;
}

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Jesper Dangaard Brouer @ 2016-12-01 16:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Saeed Mahameed, Rick Jones, Linux Netdev List, Saeed Mahameed,
	Tariq Toukan, brouer
In-Reply-To: <1480602274.18162.285.camel@edumazet-glaptop3.roam.corp.google.com>

On Thu, 01 Dec 2016 06:24:34 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2016-12-01 at 13:05 +0100, Jesper Dangaard Brouer wrote:
> > On Wed, 30 Nov 2016 18:27:45 +0200
> > Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> >   
> > > >> All in all, this is risky business :),  the right way to go is to
> > > >> force the upper layer to use xmit-more and delay doorbells/use bulking
> > > >> but from the same context (xmit routine).  For example see
> > > >> Achiad's suggestion (attached in Jesper's response), he used stop
> > > >> queue to force the stack to queue up packets (TX bulking)
> > > >> which would set xmit-more and will use the next completion to
> > > >> release the "stopped" ring TXQ rather than hit the doorbell on
> > > >> behalf of it.    
> > > >
> > > > Well, you depend on having a higher level queue like a qdisc.
> > > >
> > > > Some users do not use a qdisc.
> > > > If you stop the queue, they no longer can send anything -> drops.
> > > >  
> > 
> > You do have a point that stopping the device might not be the best way
> > to create a push-back (to allow stack queue packets).
> > 
> >  netif_tx_stop_queue() / __QUEUE_STATE_DRV_XOFF
> > 
> >   
> > > In this case, i think they should implement their own bulking (pktgen
> > > is not a good example) but XDP can predict if it has more packets to
> > > xmit  as long as all of them fall in the same NAPI cycle.
> > > Others should try and do the same.  
> > 
> > I actually agree with Saeed here.
> > 
> > Maybe we can come up with another __QUEUE_STATE_xxx that informs the
> > upper layer what the driver is doing.  Then users not using a qdisc can
> > use this indication (like the qdisc could).  (qdisc-bypass users already
> > check the QUEUE_STATE flags e.g. via netif_xmit_frozen_or_drv_stopped).  
> 
> Can you explain how this is going to help trafgen using AF_PACKET with
> Qdisc bypass ?
> 
> Say trafgen wants to send 10 or 1000 packets back to back (as fast as
> possible)
>
> With my proposal, only the first is triggering a doorbell from
> ndo_start_xmit(). Following ones are driven by TX completion logic, or
> BQL if we can push packets faster than TX interrupt can be
> delivered/handled.
> 
> If you stop the queue (with yet another atomic operations to stop/unstop
> btw), packet_direct_xmit() will have to drop trafgen packets on the
> floor.

I think you misunderstood my concept[1].  I don't want to stop the
queue. The new __QUEUE_STATE_FLUSH_NEEDED does not stop the queue, is
it just indicating that someone need to flush/ring-doorbell.  Maybe it
need another name, because it also indicate that the driver can see
that its TX queue is so busy that we don't need to call it immediately.
The qdisc layer can then choose to enqueue instead if doing direct xmit.

When qdisc layer or trafgen/af_packet see this indication it knows it
should/must flush the queue when it don't have more work left.  Perhaps
through net_tx_action(), by registering itself and e.g. if qdisc_run()
is called and queue is empty then check if queue needs a flush. I would
also allow driver to flush and clear this bit.

I just see it as an extension of your solution, as we still need the
driver to figure out then the doorbell/flush can be delayed.
p.s. don't be discouraged by this feedback, I'm just very excited and
happy that your are working on a solution in this area. As this is a
problem area that I've not been able to solve myself for the last
approx 2 years. Keep up the good work!

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] http://lkml.kernel.org/r/20161130233015.3de95356@redhat.com

^ permalink raw reply

* Re: [PATCH v5 net-next 7/7] ARM64: dts: marvell: Add network support for Armada 3700
From: Marcin Wojtas @ 2016-12-01 16:05 UTC (permalink / raw)
  To: Gregory CLEMENT
  Cc: David S. Miller, linux-kernel, netdev, Jisheng Zhang,
	Arnd Bergmann, Jason Cooper, Andrew Lunn, Sebastian Hesselbarth,
	Thomas Petazzoni, linux-arm-kernel@lists.infradead.org,
	Nadav Haklai, Dmitri Epshtein, Yelena Krivosheev
In-Reply-To: <f205231e640a2324b8f007073cf54b166a263ed4.1480542157.git-series.gregory.clement@free-electrons.com>

Hi Gregory,

2016-11-30 22:42 GMT+01:00 Gregory CLEMENT <gregory.clement@free-electrons.com>:
> Add neta nodes for network support both in device tree for the SoC and
> the board.
>
> Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
> ---
>  arch/arm64/boot/dts/marvell/armada-3720-db.dts | 23 +++++++++++++++++++-
>  arch/arm64/boot/dts/marvell/armada-37xx.dtsi   | 23 +++++++++++++++++++-
>  2 files changed, 46 insertions(+), 0 deletions(-)
>
> diff --git a/arch/arm64/boot/dts/marvell/armada-3720-db.dts b/arch/arm64/boot/dts/marvell/armada-3720-db.dts
> index 1372e9a6aaa4..c8b82e4145de 100644
> --- a/arch/arm64/boot/dts/marvell/armada-3720-db.dts
> +++ b/arch/arm64/boot/dts/marvell/armada-3720-db.dts
> @@ -81,3 +81,26 @@
>  &pcie0 {
>         status = "okay";
>  };
> +
> +&mdio {
> +       status = "okay";
> +       phy0: ethernet-phy@0 {
> +               reg = <0>;
> +       };
> +
> +       phy1: ethernet-phy@1 {
> +               reg = <1>;
> +       };
> +};
> +
> +&eth0 {
> +       phy-mode = "rgmii-id";
> +       phy = <&phy0>;
> +       status = "okay";
> +};
> +
> +&eth1 {
> +       phy-mode = "rgmii-id";

Should be "sgmii".

Best regards,
Marcin

^ permalink raw reply

* Re: [PATCH v5 net-next 0/7] Support Armada 37xx SoC (ARMv8 64-bits) in mvneta driver
From: Marcin Wojtas @ 2016-12-01 16:07 UTC (permalink / raw)
  To: Gregory CLEMENT
  Cc: David S. Miller, linux-kernel, netdev, Jisheng Zhang,
	Arnd Bergmann, Jason Cooper, Andrew Lunn, Sebastian Hesselbarth,
	Thomas Petazzoni, linux-arm-kernel@lists.infradead.org,
	Nadav Haklai, Dmitri Epshtein, Yelena Krivosheev
In-Reply-To: <cover.0270f6d2413a709521fe2c8c17fbebea6f2e78d1.1480542157.git-series.gregory.clement@free-electrons.com>

Hi Gregory,

Checked on a388-gp with and without HWBM, also both ports work on
a3700 (second one after changing to sgmii).

Tested-by: Marcin Wojtas <mw@semihalf.com>

Best regards,
Marcin

2016-11-30 22:42 GMT+01:00 Gregory CLEMENT <gregory.clement@free-electrons.com>:
> Hi,
>
> The Armada 37xx is a new ARMv8 SoC from Marvell using same network
> controller as the older Armada 370/38x/XP SoCs. This series adapts the
> driver in order to be able to use it on this new SoC. The main changes
> are:
>
> - 64-bits support: the first patches allow using the driver on a 64-bit
>   architecture.
>
> - MBUS support: the mbus configuration is different on Armada 37xx
>   from the older SoCs.
>
> - per cpu interrupt: Armada 37xx do not support per cpu interrupt for
>   the NETA IP, the non-per-CPU behavior was added back.
>
> The first patch is an optimization in the rx path in swbm mode.
> The second patch remove unnecessary allocation for HWBM.
> The first item is solved by patches 4 and 5.
> The 2 last items are solved by patch 6.
> In patch 7 the dt support is added.
>
> Beside Armada 37xx, this series have been again tested on Armada XP
> and Armada 38x (with Hardware Buffer Management and with Software
> Buffer Management).
>
> This is the 5th version of the series:
> - 1st version:
> http://lists.infradead.org/pipermail/linux-arm-kernel/2016-November/469588.html
>
> - 2nd version:
> http://lists.infradead.org/pipermail/linux-arm-kernel/2016-November/470476.html
>
> - 3rd version:
> http://lists.infradead.org/pipermail/linux-arm-kernel/2016-November/470901.html
>
> - 4th version:
> http://lists.infradead.org/pipermail/linux-arm-kernel/2016-November/471039.html
>
> Changelog:
> v4 -> v5:
>  - remove unnecessary cast in patch 3
>
> v3 -> v4:
>  - Adding new patch: "net: mvneta: do not allocate buffer in rxq init
>    with HWBM"
>
>  - Simplify the HWBM case in patch 3 as suggested by Marcin
>
> v2 -> v3:
>  - Adding patch 1 "Optimize rx path for small frame"
>
>  - Fix the kbuild error by moving the "phys_addr += pp->rx_offset_correction;"
>   line from patch 2 to patch 3 where rx_offset_correction is introduced.
>
>  - Move the memory allocation of the buf_virt_addr of the rxq to be
>    called by the probe function in order to avoid a memory leak.
>
> Thanks,
>
> Gregory
>
> Gregory CLEMENT (5):
>   net: mvneta: Optimize rx path for small frame
>   net: mvneta: Do not allocate buffer in rxq init with HWBM
>   net: mvneta: Use cacheable memory to store the rx buffer virtual address
>   net: mvneta: Only disable mvneta_bm for 64-bits
>   ARM64: dts: marvell: Add network support for Armada 3700
>
> Marcin Wojtas (2):
>   net: mvneta: Convert to be 64 bits compatible
>   net: mvneta: Add network support for Armada 3700 SoC
>
>  Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt |   7 +-
>  arch/arm64/boot/dts/marvell/armada-3720-db.dts                    |  23 +++++-
>  arch/arm64/boot/dts/marvell/armada-37xx.dtsi                      |  23 +++++-
>  drivers/net/ethernet/marvell/Kconfig                              |  10 +-
>  drivers/net/ethernet/marvell/mvneta.c                             | 344 +++++++++++++++++++++++++++++++++++++++++++++++++++---------------------
>  5 files changed, 305 insertions(+), 102 deletions(-)
>
> base-commit: 436accebb53021ef7c63535f60bda410aa87c136
> --
> git-series 0.8.10

^ permalink raw reply

* Re: [flamebait] xdp, well meaning but pointless
From: Florian Westphal @ 2016-12-01 16:06 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Florian Westphal, netdev
In-Reply-To: <20161201145834.GA569@pox.localdomain>

Thomas Graf <tgraf@suug.ch> wrote:
> On 12/01/16 at 10:11am, Florian Westphal wrote:
> > Aside from this, XDP, like DPDK, is a kernel bypass.
> > You might say 'Its just stack bypass, not a kernel bypass!'.
> > But what does that mean exactly?  That packets can still be passed
> > onward to normal stack?
> > Bypass solutions like netmap can also inject packets back to
> > kernel stack again.
> 
> I have a fundamental issue with the approach of exporting packets into
> user space and reinjecting them: Once the packet leaves the kernel,
> any security guarantees are off. I have no control over what is
> running in user space and whether whatever listener up there has been
> compromised or not. To me, that's a no go, in particular for servers
> hosting multi tenant workloads. This is one of the main reasons why
> XDP, in particular in combination with BPF, is very interesting to me.

Funny, I see it exactly the other way around :)

To me packet coming from this "userspace injection" is no different than
a tun/tap, or any other packet coming from network.

I see no change or increase in attack surface.

^ permalink raw reply

* Re: [PATCH v5 net-next 0/7] Support Armada 37xx SoC (ARMv8 64-bits) in mvneta driver
From: Gregory CLEMENT @ 2016-12-01 16:09 UTC (permalink / raw)
  To: Marcin Wojtas
  Cc: David S. Miller, linux-kernel, netdev, Jisheng Zhang,
	Arnd Bergmann, Jason Cooper, Andrew Lunn, Sebastian Hesselbarth,
	Thomas Petazzoni, linux-arm-kernel@lists.infradead.org,
	Nadav Haklai, Dmitri Epshtein, Yelena Krivosheev
In-Reply-To: <CAPv3WKdWyS0wCVsRKR86qpMx4r8NN9=RmjM9oHFy2mmvaR5PAA@mail.gmail.com>

Hi Marcin,
 
 On jeu., déc. 01 2016, Marcin Wojtas <mw@semihalf.com> wrote:

> Hi Gregory,
>
> Checked on a388-gp with and without HWBM, also both ports work on
> a3700 (second one after changing to sgmii).
>
> Tested-by: Marcin Wojtas <mw@semihalf.com>

Thanks, I am going to send a new version with tour tested-by and the dts
fix for the second port.

Gregory

>
> Best regards,
> Marcin
>
> 2016-11-30 22:42 GMT+01:00 Gregory CLEMENT <gregory.clement@free-electrons.com>:
>> Hi,
>>
>> The Armada 37xx is a new ARMv8 SoC from Marvell using same network
>> controller as the older Armada 370/38x/XP SoCs. This series adapts the
>> driver in order to be able to use it on this new SoC. The main changes
>> are:
>>
>> - 64-bits support: the first patches allow using the driver on a 64-bit
>>   architecture.
>>
>> - MBUS support: the mbus configuration is different on Armada 37xx
>>   from the older SoCs.
>>
>> - per cpu interrupt: Armada 37xx do not support per cpu interrupt for
>>   the NETA IP, the non-per-CPU behavior was added back.
>>
>> The first patch is an optimization in the rx path in swbm mode.
>> The second patch remove unnecessary allocation for HWBM.
>> The first item is solved by patches 4 and 5.
>> The 2 last items are solved by patch 6.
>> In patch 7 the dt support is added.
>>
>> Beside Armada 37xx, this series have been again tested on Armada XP
>> and Armada 38x (with Hardware Buffer Management and with Software
>> Buffer Management).
>>
>> This is the 5th version of the series:
>> - 1st version:
>> http://lists.infradead.org/pipermail/linux-arm-kernel/2016-November/469588.html
>>
>> - 2nd version:
>> http://lists.infradead.org/pipermail/linux-arm-kernel/2016-November/470476.html
>>
>> - 3rd version:
>> http://lists.infradead.org/pipermail/linux-arm-kernel/2016-November/470901.html
>>
>> - 4th version:
>> http://lists.infradead.org/pipermail/linux-arm-kernel/2016-November/471039.html
>>
>> Changelog:
>> v4 -> v5:
>>  - remove unnecessary cast in patch 3
>>
>> v3 -> v4:
>>  - Adding new patch: "net: mvneta: do not allocate buffer in rxq init
>>    with HWBM"
>>
>>  - Simplify the HWBM case in patch 3 as suggested by Marcin
>>
>> v2 -> v3:
>>  - Adding patch 1 "Optimize rx path for small frame"
>>
>>  - Fix the kbuild error by moving the "phys_addr += pp->rx_offset_correction;"
>>   line from patch 2 to patch 3 where rx_offset_correction is introduced.
>>
>>  - Move the memory allocation of the buf_virt_addr of the rxq to be
>>    called by the probe function in order to avoid a memory leak.
>>
>> Thanks,
>>
>> Gregory
>>
>> Gregory CLEMENT (5):
>>   net: mvneta: Optimize rx path for small frame
>>   net: mvneta: Do not allocate buffer in rxq init with HWBM
>>   net: mvneta: Use cacheable memory to store the rx buffer virtual address
>>   net: mvneta: Only disable mvneta_bm for 64-bits
>>   ARM64: dts: marvell: Add network support for Armada 3700
>>
>> Marcin Wojtas (2):
>>   net: mvneta: Convert to be 64 bits compatible
>>   net: mvneta: Add network support for Armada 3700 SoC
>>
>>  Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt |   7 +-
>>  arch/arm64/boot/dts/marvell/armada-3720-db.dts                    |  23 +++++-
>>  arch/arm64/boot/dts/marvell/armada-37xx.dtsi                      |  23 +++++-
>>  drivers/net/ethernet/marvell/Kconfig                              |  10 +-
>>  drivers/net/ethernet/marvell/mvneta.c                             | 344 +++++++++++++++++++++++++++++++++++++++++++++++++++---------------------
>>  5 files changed, 305 insertions(+), 102 deletions(-)
>>
>> base-commit: 436accebb53021ef7c63535f60bda410aa87c136
>> --
>> git-series 0.8.10

-- 
Gregory Clement, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply

* Re: Regression: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: Eric Dumazet @ 2016-12-01 16:08 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: Jesper Dangaard Brouer, David Miller, netdev, Tariq Toukan
In-Reply-To: <1480607729.18162.311.camel@edumazet-glaptop3.roam.corp.google.com>

On Thu, 2016-12-01 at 07:55 -0800, Eric Dumazet wrote:

> So removing the spinlock is doable, but needs to add a new parameter
> to mlx4_en_fold_software_stats() and call netdev_stats_to_stats64()
> before mlx4_en_fold_software_stats(dev)

Untested patch would be :

 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |    2 -
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  |   10 +----
 drivers/net/ethernet/mellanox/mlx4/en_port.c    |   24 +++++++++-----
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |    3 +
 4 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index d9c9f86a30df953fa555934c5406057dcaf28960..676050e352703cebe7fcaa5202a06496f7a5a0df 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -367,7 +367,7 @@ static void mlx4_en_get_ethtool_stats(struct net_device *dev,
 
 	spin_lock_bh(&priv->stats_lock);
 
-	mlx4_en_fold_software_stats(dev);
+	mlx4_en_fold_software_stats(dev, NULL);
 
 	for (i = 0; i < NUM_MAIN_STATS; i++, bitmap_iterator_inc(&it))
 		if (bitmap_iterator_test(&it))
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 091b904262bc7932d3edf99cf850affb23b9ce6e..6ee9e31e59c392cb88faedf9c541b3bc6d195228 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1321,13 +1321,9 @@ static void mlx4_en_tx_timeout(struct net_device *dev)
 static struct rtnl_link_stats64 *
 mlx4_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
 {
-	struct mlx4_en_priv *priv = netdev_priv(dev);
-
-	spin_lock_bh(&priv->stats_lock);
-	mlx4_en_fold_software_stats(dev);
 	netdev_stats_to_stats64(stats, &dev->stats);
-	spin_unlock_bh(&priv->stats_lock);
-
+	/* Must be called after netdev_stats_to_stats64() */
+	mlx4_en_fold_software_stats(dev, stats);
 	return stats;
 }
 
@@ -1810,7 +1806,7 @@ void mlx4_en_stop_port(struct net_device *dev, int detach)
 	netif_tx_disable(dev);
 
 	spin_lock_bh(&priv->stats_lock);
-	mlx4_en_fold_software_stats(dev);
+	mlx4_en_fold_software_stats(dev, NULL);
 	/* Set port as not active */
 	priv->port_up = false;
 	spin_unlock_bh(&priv->stats_lock);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_port.c b/drivers/net/ethernet/mellanox/mlx4/en_port.c
index 9166d90e732858610b1407fe85cbf6cbe27f5e0b..eea042a18e3cfba62745ece4ca673c2db967b9aa 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_port.c
@@ -147,7 +147,8 @@ static unsigned long en_stats_adder(__be64 *start, __be64 *next, int num)
 	return ret;
 }
 
-void mlx4_en_fold_software_stats(struct net_device *dev)
+void mlx4_en_fold_software_stats(struct net_device *dev,
+				 struct rtnl_link_stats64 *stats)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	struct mlx4_en_dev *mdev = priv->mdev;
@@ -165,9 +166,13 @@ void mlx4_en_fold_software_stats(struct net_device *dev)
 		packets += READ_ONCE(ring->packets);
 		bytes   += READ_ONCE(ring->bytes);
 	}
-	dev->stats.rx_packets = packets;
-	dev->stats.rx_bytes = bytes;
-
+	if (stats) {
+		stats->rx_packets = packets;
+		stats->rx_bytes = bytes;
+	} else {
+		dev->stats.rx_packets = packets;
+		dev->stats.rx_bytes = bytes;
+	}
 	packets = 0;
 	bytes = 0;
 	for (i = 0; i < priv->tx_ring_num[TX]; i++) {
@@ -176,8 +181,13 @@ void mlx4_en_fold_software_stats(struct net_device *dev)
 		packets += READ_ONCE(ring->packets);
 		bytes   += READ_ONCE(ring->bytes);
 	}
-	dev->stats.tx_packets = packets;
-	dev->stats.tx_bytes = bytes;
+	if (stats) {
+		stats->tx_packets = packets;
+		stats->tx_bytes = bytes;
+	} else {
+		dev->stats.tx_packets = packets;
+		dev->stats.tx_bytes = bytes;
+	}
 }
 
 int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset)
@@ -208,7 +218,7 @@ int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset)
 
 	spin_lock_bh(&priv->stats_lock);
 
-	mlx4_en_fold_software_stats(dev);
+	mlx4_en_fold_software_stats(dev, NULL);
 
 	priv->port_stats.rx_chksum_good = 0;
 	priv->port_stats.rx_chksum_none = 0;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 20a936428f4a44c8ca0a7161855da310f9166b50..92dbb41f425b282e9ab7c8d534f091da0ba661c3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -755,7 +755,8 @@ void mlx4_en_rx_irq(struct mlx4_cq *mcq);
 int mlx4_SET_MCAST_FLTR(struct mlx4_dev *dev, u8 port, u64 mac, u64 clear, u8 mode);
 int mlx4_SET_VLAN_FLTR(struct mlx4_dev *dev, struct mlx4_en_priv *priv);
 
-void mlx4_en_fold_software_stats(struct net_device *dev);
+void mlx4_en_fold_software_stats(struct net_device *dev,
+				 struct rtnl_link_stats64 *stats);
 int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset);
 int mlx4_en_QUERY_PORT(struct mlx4_en_dev *mdev, u8 port);
 

^ permalink raw reply related

* Re: [PATCH net v2] tipc: check minimum bearer MTU
From: Ben Hutchings @ 2016-12-01 16:11 UTC (permalink / raw)
  To: Michal Kubecek, Jon Maloy, Ying Xue
  Cc: David S. Miller, tipc-discussion, netdev, linux-kernel,
	Qian Zhang
In-Reply-To: <20161201110205.10749A0F33@unicorn.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 716 bytes --]

On Thu, 2016-12-01 at 12:02 +0100, Michal Kubecek wrote:
[...] 
> +/* check if device MTU is sufficient for tipc headers */
> +static inline bool tipc_check_mtu(struct net_device *dev, unsigned int reserve)
> +{
> +	if (dev->mtu >= TIPC_MIN_BEARER_MTU + reserve)
> +		return false;
> +	netdev_warn(dev, "MTU too low for tipc bearer\n");
> +	return true;
> +}
[...]

The comment says "check if ... sufficient" but the return value
indicates the opposite.  Could you make these consistent?

Other than that, this looks OK to me.  I haven't tested any version as
I don't know how to use TIPC.

Ben.

-- 
Ben Hutchings
A free society is one where it is safe to be unpopular. - Adlai
Stevenson


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 00/11] Netfilter fixes for net
From: David Miller @ 2016-12-01 16:13 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <1480543045-3389-1-git-send-email-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Wed, 30 Nov 2016 22:57:14 +0100

> This is a large batch of Netfilter fixes for net, they are:
 ...
> I know is late but I think these are important, specifically the NAT
> bits, as they are mostly addressing fallout from recent changes. I also
> read there are chances to have -rc8, if that is the case, that would
> also give us a bit more time to test this.
> 
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Pulled, thanks Pablo.

^ permalink raw reply

* Re: [flamebait] xdp, well meaning but pointless
From: David Miller @ 2016-12-01 16:19 UTC (permalink / raw)
  To: tgraf; +Cc: fw, netdev
In-Reply-To: <20161201145834.GA569@pox.localdomain>

From: Thomas Graf <tgraf@suug.ch>
Date: Thu, 1 Dec 2016 15:58:34 +0100

> The benefits of XDP for this use case are extremely obvious in combination
> with local applications which need to be protected. ntuple filters won't
> cut it. They are limited and subject to a certain rate at which they
> can be configured. Any serious mitigation will require stateful filtering
> with at least minimal L7 matching abilities and this is exactly where XDP
> will excel.

+1

Saying that ntuple filters can handle the early drop use case doesn't
take into consideration the nature of the tables (hundreds of
thousands of "evil" IP addresses), whether hardware can actually
handle that (it can't), and whether simple IP address matching is the
full extent of it (it isn't).

Most of the time when I hear anti-XDP rhetoric, it's usually comes
from a crowd who for some reason feels threatened by the technology
and what it might replace and make useless.

That to me says that we are _exactly_ going down the right path.

^ permalink raw reply

* Re: [PATCH v2 net-next 0/4] Adding PHY MDI(X) support
From: David Miller @ 2016-12-01 16:27 UTC (permalink / raw)
  To: Raju.Lakkaraju; +Cc: netdev, f.fainelli, Allan.Nielsen, andrew
In-Reply-To: <1480412809-6122-1-git-send-email-Raju.Lakkaraju@microsemi.com>

From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Date: Tue, 29 Nov 2016 15:16:45 +0530

> I updated all review comments which were given by Andrew and Florian.
> 
> This series add support for PHY MDI(X), and implement it for MSCC phys.
> 
> Tested on Beaglebone Black with VSC 8531 PHY.

Series applied, thanks.

^ permalink raw reply

* Re: [flamebait] xdp, well meaning but pointless
From: Thomas Graf @ 2016-12-01 16:28 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Florian Westphal, netdev
In-Reply-To: <7e2be2fc-7c04-b333-59c7-43d4fcfcb451@stressinduktion.org>

On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
> First of all, this is a rant targeted at XDP and not at eBPF as a whole.
> XDP manipulates packets at free will and thus all security guarantees
> are off as well as in any user space solution.
> 
> Secondly user space provides policy, acl, more controlled memory
> protection, restartability and better debugability. If I had multi
> tenant workloads I would definitely put more complex "business/acl"
> logic into user space, so I can make use of LSM and other features to
> especially prevent a network facing service to attack the tenants. If
> stuff gets put into the kernel you run user controlled code in the
> kernel exposing a much bigger attack vector.
> 
> What use case do you see in XDP specifically e.g. for container networking?

DDOS mitigation to protect distributed applications in large clusters.
Relying on CDN works to protect API gateways and frontends (as long as
they don't throw you out of their network) but offers no protection
beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
level and allowing the mitigation capability to scale up with the number
of servers is natural and cheap.

> > I agree with you if the LB is a software based appliance in either a
> > dedicated VM or on dedicated baremetal.
> > 
> > The reality is turning out to be different in many cases though, LB
> > needs to be performed not only for north south but east west as well.
> > So even if I would handle LB for traffic entering my datacenter in user
> > space, I will need the same LB for packets from my applications and
> > I definitely don't want to move all of that into user space.
> 
> The open question to me is why is programmability needed here.
> 
> Look at the discussion about ECMP and consistent hashing. It is not very
> easy to actually write this code correctly. Why can't we just put C code
> into the kernel that implements this once and for all and let user space
> update the policies?

Whatever LB logic is put in place with native C code now is unlikely the
logic we need in two years. We can't really predict the future. If it
was the case, networking would have been done long ago and we would all
be working on self eating ice cream now.

> Load balancers have to deal correctly with ICMP packets, e.g. they even
> have to be duplicated to every ECMP route. This seems to be problematic
> to do in eBPF programs due to looping constructs so you end up with
> complicated user space anyway.

Feel free to implement such complex LBs in user space or natively. It is
not required for the majority of use cases. The most popular LBs for
application load balancing have no idea of ECMP and require ECMP aware
routers to be made redundant itself.

^ permalink raw reply

* Re: [PATCH net-next v4 3/4] bpf: BPF for lightweight tunnel infrastructure
From: Thomas Graf @ 2016-12-01 16:28 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: davem, netdev, alexei.starovoitov, tom, roopa, hannes
In-Reply-To: <584012CC.4030004@iogearbox.net>

On 12/01/16 at 01:08pm, Daniel Borkmann wrote:
> For the verifier change in may_access_direct_pkt_data(), would be
> great if you could later on follow up with a selftest-suite case,
> one where BPF_PROG_TYPE_LWT_IN/OUT prog tries to write and fails,
> and one where BPF_PROG_TYPE_LWT_IN/OUT prog uses pkt data to pass
> to helpers, for example, so that we can keep testing it when future
> changes in that area are made. Thanks.

Good idea, will do.

^ permalink raw reply

* Re: [Patch net-next] audit: remove useless synchronize_net()
From: David Miller @ 2016-12-01 16:29 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, rgb
In-Reply-To: <1480439696-21818-1-git-send-email-xiyou.wangcong@gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Tue, 29 Nov 2016 09:14:56 -0800

> netlink kernel socket is protected by refcount, not RCU.
> Its rcv path is neither protected by RCU. So the synchronize_net()
> is just pointless.
> 
> Cc: Richard Guy Briggs <rgb@redhat.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Applied, thanks.

^ permalink raw reply

* Re: Regression: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: Saeed Mahameed @ 2016-12-01 16:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jesper Dangaard Brouer, David Miller, netdev, Tariq Toukan
In-Reply-To: <1480607729.18162.311.camel@edumazet-glaptop3.roam.corp.google.com>

On Thu, Dec 1, 2016 at 5:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2016-12-01 at 17:38 +0200, Saeed Mahameed wrote:
>
>>
>> Hi Eric, Thanks for the patch, I already acked it.
>
> Thanks !
>
>>
>> I have one educational question (not related to this patch, but
>> related to stats reading in general).
>> I was wondering why do we need to disable bh every time we read stats
>> "spin_lock_bh" ? is it essential ?
>>
>> I checked and in mlx4 we don't hold stats_lock in softirq
>> (en_rx.c/en_tx.c), so I don't see any deadlock risk in here..
>
> Excellent question, and I chose to keep the spinlock.
>
> That would be doable, only if we do not overwrite dev->stats.
>
> Current code is :
>
> static struct rtnl_link_stats64 *
> mlx4_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
> {
>         struct mlx4_en_priv *priv = netdev_priv(dev);
>
>         spin_lock_bh(&priv->stats_lock);
>         mlx4_en_fold_software_stats(dev);
>         netdev_stats_to_stats64(stats, &dev->stats);
>         spin_unlock_bh(&priv->stats_lock);
>
>         return stats;
> }
>
> If you remove the spin_lock_bh() :
>
>
> static struct rtnl_link_stats64 *
> mlx4_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
> {
>         struct mlx4_en_priv *priv = netdev_priv(dev);
>
>         mlx4_en_fold_software_stats(dev); // possible races
>
>         netdev_stats_to_stats64(stats, &dev->stats);
>
>         return stats;
> }
>
> 1) one mlx4_en_fold_software_stats(dev) could be preempted
> on a CONFIG_PREEMPT kernel, or interrupted by long irqs.
>
> 2) Another cpu would also call mlx4_en_fold_software_stats(dev) while
>    first cpu is busy.
>
> 3) Then when resuming first cpu/thread, part of the dev->stats fieds
> would be updated with 'old counters',
> while another thread might have updated them with newer values.
>
> 4) A SNMP reader could then get counters that are not monotonically
> increasing,
> which would be confusing/buggy.
>
> So removing the spinlock is doable, but needs to add a new parameter
> to mlx4_en_fold_software_stats() and call netdev_stats_to_stats64()
> before mlx4_en_fold_software_stats(dev)
>
> static struct rtnl_link_stats64 *
> mlx4_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
> {
>         struct mlx4_en_priv *priv = netdev_priv(dev);
>
>         netdev_stats_to_stats64(stats, &dev->stats);
>
>         // Passing a non NULL stats asks mlx4_en_fold_software_stats()
>         // to not update dev->stats, but stats directly.
>
>         mlx4_en_fold_software_stats(dev, stats)
>
>
>         return stats;
> }
>
>

Thanks for the detailed answer !!

BTW you went 5 steps ahead of my original question :)), so far you
already have a patch without locking at all (really impressive).

What i wanted to ask originally, was regarding the "_bh", i didn't
mean to completely remove the "spin_lock_bh",
I meant, what happens if we replace "spin_lock_bh"  with "spin_lock",
without disabling bh ?
I gues raw "sping_lock" handles points (2 to 4) from above, but it
won't handle long irqs.

^ permalink raw reply

* Re: pull request (net): ipsec 2016-12-01
From: David Miller @ 2016-12-01 16:36 UTC (permalink / raw)
  To: steffen.klassert; +Cc: herbert, netdev
In-Reply-To: <1480592692-3653-1-git-send-email-steffen.klassert@secunet.com>

From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Thu, 1 Dec 2016 12:44:49 +0100

> 1) Change the error value when someone tries to run 32bit
>    userspace on a 64bit host from -ENOTSUPP to the userspace
>    exported -EOPNOTSUPP. Fix from Yi Zhao.
> 
> 2) On inbound, ESN sequence numbers are already in network
>    byte order. So don't try to convert it again, this fixes
>    integrity verification for ESN. Fixes from Tobias Brunner.
> 
> Please pull or let me know if there are problems.

Pulled, thanks Steffen.

^ permalink raw reply

* Re: [PATCH net] RDS: TCP: unregister_netdevice_notifier() in error path of rds_tcp_init_net
From: Santosh Shilimkar @ 2016-12-01 16:40 UTC (permalink / raw)
  To: Sowmini Varadhan, netdev; +Cc: davem
In-Reply-To: <1480596283-204869-1-git-send-email-sowmini.varadhan@oracle.com>

On 12/1/2016 4:44 AM, Sowmini Varadhan wrote:
> If some error is encountered in rds_tcp_init_net, make sure to
> unregister_netdevice_notifier(), else we could trigger a panic
> later on, when the modprobe from a netns fails.
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

^ permalink raw reply

* Re: pull request (net-next): ipsec-next 2016-12-01
From: David Miller @ 2016-12-01 16:45 UTC (permalink / raw)
  To: steffen.klassert; +Cc: herbert, netdev
In-Reply-To: <1480592885-3903-1-git-send-email-steffen.klassert@secunet.com>

From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Thu, 1 Dec 2016 12:48:04 +0100

>   git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next.git master
> 
> for you to fetch changes up to 2258d927a691ddd2ab585adb17ea9f96e89d0638:
> 
>   xfrm: remove unused helper (2016-09-30 08:20:56 +0200)

Hmmm, when I try to pull I don't get anything:

[davem@dhcp-10-15-49-210 net-next]$ git pull --no-ff git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next.git master
>From git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
 * branch            master     -> FETCH_HEAD
Already up-to-date.

^ permalink raw reply

* [PATCH v7 net-next 0/6] net: Add bpf support for sockets
From: David Ahern @ 2016-12-01 16:48 UTC (permalink / raw)
  To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern

The recently added VRF support in Linux leverages the bind-to-device
API for programs to specify an L3 domain for a socket. While
SO_BINDTODEVICE has been around for ages, not every ipv4/ipv6 capable
program has support for it. Even for those programs that do support it,
the API requires processes to be started as root (CAP_NET_RAW) which
is not desirable from a general security perspective.

This patch set leverages Daniel Mack's work to attach bpf programs to
a cgroup to provide a capability to set sk_bound_dev_if for all
AF_INET{6} sockets opened by a process in a cgroup when the sockets
are allocated.

For example:
 1. configure vrf (e.g., using ifupdown2)
        auto eth0
        iface eth0 inet dhcp
            vrf mgmt

        auto mgmt
        iface mgmt
            vrf-table auto

 2. configure cgroup
        mount -t cgroup2 none /tmp/cgroupv2
        mkdir /tmp/cgroupv2/mgmt
        test_cgrp2_sock /tmp/cgroupv2/mgmt 15

 3. set shell into cgroup (e.g., can be done at login using pam)
        echo $$ >> /tmp/cgroupv2/mgmt/cgroup.procs

At this point all commands run in the shell (e.g, apt) have sockets
automatically bound to the VRF (see output of ss -ap 'dev == <vrf>'),
including processes not running as root.

This capability enables running any program in a VRF context and is key
to deploying Management VRF, a fundamental configuration for networking
gear, with any Linux OS installation.

This patchset also exports the socket family, type and protocol as
read-only allowing bpf filters to deny a process in a cgroup the ability
to open specific types of AF_INET or AF_INET6 sockets.

v7
- comments from Alexei

v6
- add export of socket family, type and protocol


David Ahern (6):
  bpf: Refactor cgroups code in prep for new type
  bpf: Add new cgroup attach type to enable sock modifications
  samples: bpf: add userspace example for modifying sk_bound_dev_if
  bpf: Add support for reading socket family, type, protocol
  samples/bpf: Update bpf loader for cgroup section names
  samples/bpf: add userspace example for prohibiting sockets

 include/linux/bpf-cgroup.h      | 60 +++++++++++++++++------------
 include/net/sock.h              | 15 ++++++++
 include/uapi/linux/bpf.h        |  9 +++++
 kernel/bpf/cgroup.c             | 43 ++++++++++++++++++---
 kernel/bpf/syscall.c            | 33 +++++++++-------
 net/core/filter.c               | 83 +++++++++++++++++++++++++++++++++++++++++
 net/ipv4/af_inet.c              | 12 +++++-
 net/ipv6/af_inet6.c             |  8 ++++
 samples/bpf/Makefile            |  6 +++
 samples/bpf/bpf_load.c          | 14 +++++--
 samples/bpf/bpf_load.h          |  1 +
 samples/bpf/sock_flags_kern.c   | 44 ++++++++++++++++++++++
 samples/bpf/test_cgrp2_sock.c   | 83 +++++++++++++++++++++++++++++++++++++++++
 samples/bpf/test_cgrp2_sock.sh  | 47 +++++++++++++++++++++++
 samples/bpf/test_cgrp2_sock2.c  | 66 ++++++++++++++++++++++++++++++++
 samples/bpf/test_cgrp2_sock2.sh | 81 ++++++++++++++++++++++++++++++++++++++++
 16 files changed, 559 insertions(+), 46 deletions(-)
 create mode 100644 samples/bpf/sock_flags_kern.c
 create mode 100644 samples/bpf/test_cgrp2_sock.c
 create mode 100755 samples/bpf/test_cgrp2_sock.sh
 create mode 100644 samples/bpf/test_cgrp2_sock2.c
 create mode 100755 samples/bpf/test_cgrp2_sock2.sh

-- 
2.1.4

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox