* Re: [flamebait] xdp, well meaning but pointless
From: Tom Herbert @ 2016-12-02 16:59 UTC (permalink / raw)
To: Hannes Frederic Sowa
Cc: Jesper Dangaard Brouer, Thomas Graf, Florian Westphal,
Linux Kernel Network Developers
In-Reply-To: <163220cc-a91b-5627-ea93-cd43dc0079c6@stressinduktion.org>
On Fri, Dec 2, 2016 at 3:54 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On 02.12.2016 11:24, Jesper Dangaard Brouer wrote:
>> On Thu, 1 Dec 2016 13:51:32 -0800
>> Tom Herbert <tom@herbertland.com> wrote:
>>
>>>>> The technical plenary at last IETF on Seoul a couple of weeks ago was
>>>>> exclusively focussed on DDOS in light of the recent attack against
>>>>> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
>>>>> presentation by Nick Sullivan
>>>>> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
>>>>> alluded to some implementation of DDOS mitigation. In particular, on
>>>>> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
>>
>> slide 14
>>
>>>>> numbers he gave we're based in iptables+BPF and that was a whole
>>>>> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
>>>>> and that's also when I introduced XDP to whole IETF :-) ). If that's
>>>>> the best we can do the Internet is in a world hurt. DDOS mitigation
>>>>> alone is probably a sufficient motivation to look at XDP. We need
>>>>> something that drops bad packets as quickly as possible when under
>>>>> attack, we need this to be integrated into the stack, we need it to be
>>>>> programmable to deal with the increasing savvy of attackers, and we
>>>>> don't want to be forced to be dependent on HW solutions. This is why
>>>>> we created XDP!
>>
>> The 1.2Mpps number is a bit low, but we are unfortunately in that
>> ballpark.
>>
>>>> I totally understand that. But in my reply to David in this thread I
>>>> mentioned DNS apex processing as being problematic which is actually
>>>> being referred in your linked slide deck on page 9 ("What do floods look
>>>> like") and the problematic of parsing DNS packets in XDP due to string
>>>> processing and looping inside eBPF.
>>
>> That is a weak argument. You do realize CloudFlare actually use eBPF to
>> do this exact filtering, and (so-far) eBPF for parsing DNS have been
>> sufficient for them.
>
> You are talking about this code on the following slides (I actually
> transcribed it for you here and disassembled):
>
> l0: ld #0x14
> l1: ldxb 4*([0]&0xf)
> l2: add x
> l3: tax
> l4: ld [x+0]
> l5: jeq #0x7657861, l6, l13
> l6: ld [x+4]
> l7: jeq #0x6d706c65, l8, l13
> l8: ld [x+8]
> l9: jeq #0x3636f6d, l10, l13
> l10: ldb [x+12]
> l11: jeq #0, l12, l13
> l12: ret #0x1
> l13: ret #0
>
> You can offload this to u32 in hardware if that is what you want.
>
> The reason this works is because of netfilter, which allows them to
> dynamically generate BPF programs and insert and delete them from
> chains, do intersection or unions of them.
>
> If you have a freestanding program like in XDP the complexity space is a
> different one and not comparable to this at all.
>
I don't understand this comment about complexity especially in regards
to the idea of offloading u32 to hardware. Relying on hardware to do
anything always leads to more complexity than an equivalent SW
implementation for the same functionality. The only reason we ever use
a hardware mechanisms is if it gives *significantly* better
performance. If the performance difference isn't there then doing
things in SW is going to be the better path (as we see in XDP).
Tom
> Bye,
> Hannes
>
^ permalink raw reply
* Re: [PATCH net-next v4 0/4] bpf: BPF for lightweight tunnel encapsulation
From: David Miller @ 2016-12-02 16:57 UTC (permalink / raw)
To: tgraf; +Cc: netdev, alexei.starovoitov, daniel, tom, roopa, hannes
In-Reply-To: <cover.1480522144.git.tgraf@suug.ch>
From: Thomas Graf <tgraf@suug.ch>
Date: Wed, 30 Nov 2016 17:10:07 +0100
> This series implements BPF program invocation from dst entries via the
> lightweight tunnels infrastructure.
Nice work, applied, thanks Thomas.
^ permalink raw reply
* Re: [PATCH net-next] net/mlx5e: skip loopback selftest with !CONFIG_INET
From: David Miller @ 2016-12-02 16:56 UTC (permalink / raw)
To: arnd; +Cc: saeedm, matanb, leonro, kamalh, netdev, linux-rdma, linux-kernel
In-Reply-To: <20161130210552.1756085-1-arnd@arndb.de>
From: Arnd Bergmann <arnd@arndb.de>
Date: Wed, 30 Nov 2016 22:05:39 +0100
> When CONFIG_INET is disabled, the new selftest results in a link
> error:
>
> drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.o: In function `mlx5e_test_loopback':
> en_selftest.c:(.text.mlx5e_test_loopback+0x2ec): undefined reference to `ip_send_check'
> en_selftest.c:(.text.mlx5e_test_loopback+0x34c): undefined reference to `udp4_hwcsum'
>
> This hides the specific test in that configuration.
>
> Fixes: 0952da791c97 ("net/mlx5e: Add support for loopback selftest")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Applied, thanks Arnd.
^ permalink raw reply
* Re: arp_filter and IPv6 ND
From: Hannes Frederic Sowa @ 2016-12-02 16:45 UTC (permalink / raw)
To: Saku Ytti; +Cc: netdev
In-Reply-To: <CAAeewD_2tfNHhWmtwXb_28DicmipReimRvGqv1Q6PQ0AjkBPGw@mail.gmail.com>
Hello,
On 02.12.2016 16:42, Saku Ytti wrote:
> On 2 December 2016 at 16:08, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>
> Hey,
>
>> May I ask why you want to turn it off?
>
> Certainly. I don't want device to answer with link address for L3
> address it does not have on the link. In my case it triggers this bug
> https://supportforums.cisco.com/document/12098096/cscse46790-cef-prefers-arp-adjacency-over-rib-next-hop
Okay, that should not happen.
Redirects and neighbor advertisements are the only way how you can
announce prefixes on-link. Unfortunately historically we automatically
add device routes for prefixes, too. We can't change this anymore but
this is wrong.
> In this particular case, for one reason or another my Cisco device
> would have ND entry for Linux loopback pointing to an interface with
> completely different network. Which itself would be just weird, but
> combined with weird behaviour of Cisco it actually causes the loopback
> route advertised by BGP not to be installed. If the ND entry didn't
> exist, the BGP route would be installed.
Hmmmm... Loopback route advertised by BGP? Do you use filter to get rid
of that on your AS-border? So you probably don't use an IGP? Do you use
next-hop-self attribute on your neighbor in that direction? BGP in
general doesn't lead to ND entry installs, protocols like IS-IS afair
can short circuit here.
Hmm, I would keep the Loopback announcements out of the BGP.
> I don't really even know why the ND entry exists, all I can think of
> is that Linux must have sent gratuitous reply, because I don't se why
> Cisco would have tried to discover it.
>
> Expected behaviour is that the loopback/128 BGP route resolves to
> on-link next-hop, and on-link next hop is then ND'd. Observed
> behaviour is that loopback/128 BGP route also appears in ND cache.
Yep, exactly.
>> In IPv6 this depends on the scope. In IPv4 this concept doesn't really
>> exist.
>>
>> Please notice that in IPv4 arp_filter does not necessarily mean that the
>> system is operating in strong end system mode but you end up in an
>> hybrid clone where arp is acting strong but routing not and thus you
>> also have to add fib rules to simulate that.
>
> It's just very peculiar behaviour to have ARP or ND entries on a
> interface where given subnet does not exist, it rudimentarily causes
> difficult to troubleshoot problems and is surprising/unexpected
> behaviour.
For enterprise and cloud stuff it is certainly very surprising, as some
isolations don't work as expected. OTOH it is really easy to build up
home networks and things are more plug and play.
> Of course well behaving device wouldn't accept such replies, because
> it itself could be attack vector (imagine me telling you 8.8.8.8 is on
> the link, or worse, your bank).
>
> I'm curious, why does this behaviour exist? When is this desirable?
> I've never seen any other device than Linux behave like this, and when
> ever I've heard about the problem, I've only seen surprised faces that
> it does behave like this.
I don't feel comfortable to answer that, just some thoughts...
Some RFCs require that for some router implementations (CPE), on the
other hand weak end model in Linux was probably inherited by IPv4. The
addition of duplicate address detection (which of course only makes
sense in strong end systems) to IPv6, basically shows that IPv6 is more
or less designed to be a strong end system model.
Anyway, a patch to suppress ndisc requests on those interfaces will
probably be accepted.
For unicast reverse filtering e.g. there is actually no sysctl available
anymore, instead you are supposed to install a netfilter rule to handle
this, which automatically takes care of this.
Bye,
Hannes
^ permalink raw reply
* Re: [PATCH 2/7] net: ethernet: ti: cpdma: fix desc re-queuing
From: Grygorii Strashko @ 2016-12-02 16:45 UTC (permalink / raw)
To: Ivan Khoronzhuk
Cc: David S. Miller, netdev, Mugunthan V N, Sekhar Nori, linux-kernel,
linux-omap
In-Reply-To: <20161202110321.GA1213@khorivan>
On 12/02/2016 05:03 AM, Ivan Khoronzhuk wrote:
> On Thu, Dec 01, 2016 at 05:34:27PM -0600, Grygorii Strashko wrote:
>> The currently processing cpdma descriptor with EOQ flag set may
>> contain two values in Next Descriptor Pointer field:
>> - valid pointer: means CPDMA missed addition of new desc in queue;
> It shouldn't happen in normal circumstances, right?
it might happen, because desc push compete with desc pop.
You can check stats values:
chan->stats.misqueued
chan->stats.requeue
under different types of net-loads.
TRM:
"
If the pNext pointer is initially NULL, and more packets need to be queued for transmit, the software
application may alter this pointer to point to a newly appended descriptor. The EMAC will use the new
pointer value and proceed to the next descriptor unless the pNext value has already been read. In this
latter case, the transmitter will halt on the transmit channel in question, and the software application may
restart it at that time. The software can detect this case by checking for an end of queue (EOQ) condition
flag on the updated packet descriptor when it is returned by the EMAC.
"
> So, why it happens only for egress channels? And Does that mean
> there is some resynchronization between submit and process function,
> or this is h/w issue?
no hw issues. this patch just removes one unnecessary I/O access
>
>> - null: no more descriptors in queue.
>> In the later case, it's not required to write to HDP register, but now
>> CPDMA does it.
>>
>> Hence, add additional check for Next Descriptor Pointer != null in
>> cpdma_chan_process() function before writing in HDP register.
>>
>> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
>> ---
>> drivers/net/ethernet/ti/davinci_cpdma.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
>> index 0924014..379314f 100644
>> --- a/drivers/net/ethernet/ti/davinci_cpdma.c
>> +++ b/drivers/net/ethernet/ti/davinci_cpdma.c
>> @@ -1152,7 +1152,7 @@ static int __cpdma_chan_process(struct cpdma_chan *chan)
>> chan->count--;
>> chan->stats.good_dequeue++;
>>
>> - if (status & CPDMA_DESC_EOQ) {
>> + if ((status & CPDMA_DESC_EOQ) && chan->head) {
>> chan->stats.requeue++;
>> chan_write(chan, hdp, desc_phys(pool, chan->head));
>> }
>> --
>> 2.10.1
>>
--
regards,
-grygorii
^ permalink raw reply
* Re: [PATCH net v2] igb: re-assign hw address pointer on reset after PCI error
From: Keller, Jacob E @ 2016-12-02 16:41 UTC (permalink / raw)
To: gpiccoli@linux.vnet.ibm.com, intel-wired-lan@lists.osuosl.org,
Kirsher, Jeffrey T, alexander.duyck@gmail.com
Cc: netdev@vger.kernel.org, Fujinaka, Todd
In-Reply-To: <58416F5C.20602@linux.vnet.ibm.com>
On Fri, 2016-12-02 at 10:55 -0200, Guilherme G. Piccoli wrote:
> On 11/10/2016 04:46 PM, Guilherme G. Piccoli wrote:
> > Whenever the igb driver detects the result of a read operation
> > returns
> > a value composed only by F's (like 0xFFFFFFFF), it will detach the
> > net_device, clear the hw_addr pointer and warn to the user that
> > adapter's
> > link is lost - those steps happen on igb_rd32().
> >
> > In case a PCI error happens on Power architecture, there's a
> > recovery
> > mechanism called EEH, that will reset the PCI slot and call
> > driver's
> > handlers to reset the adapter and network functionality as well.
> >
> > We observed that once hw_addr is NULL after the error is detected
> > on
> > igb_rd32(), it's never assigned back, so in the process of
> > resetting
> > the network functionality we got a NULL pointer dereference in both
> > igb_configure_tx_ring() and igb_configure_rx_ring(). In order to
> > avoid
> > such bug, this patch re-assigns the hw_addr value in the slot_reset
> > handler.
> >
> > Reported-by: Anthony H. Thai <ahthai@us.ibm.com>
> > Reported-by: Harsha Thyagaraja <hathyaga@in.ibm.com>
> > Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
> > ---
> > drivers/net/ethernet/intel/igb/igb_main.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> > b/drivers/net/ethernet/intel/igb/igb_main.c
> > index edc9a6a..136ee9e 100644
> > --- a/drivers/net/ethernet/intel/igb/igb_main.c
> > +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> > @@ -7878,6 +7878,11 @@ static pci_ers_result_t
> > igb_io_slot_reset(struct pci_dev *pdev)
> > pci_enable_wake(pdev, PCI_D3hot, 0);
> > pci_enable_wake(pdev, PCI_D3cold, 0);
> >
> > + /* In case of PCI error, adapter lose its HW
> > address
> > + * so we should re-assign it here.
> > + */
> > + hw->hw_addr = adapter->io_addr;
> > +
> > igb_reset(adapter);
> > wr32(E1000_WUS, ~0);
> > result = PCI_ERS_RESULT_RECOVERED;
> >
>
> Ping?
>
> Sorry to annoy, any news on this?
> Thanks in advance!
>
> Cheers,
>
>
> Guilherme
>
This seems reasonable. It's similar to what fm10k driver does under
this circumstance.
Thanks,
Jake
^ permalink raw reply
* [PATCH net-next] udp: be less conservative with sock rmem accounting
From: Paolo Abeni @ 2016-12-02 16:35 UTC (permalink / raw)
To: netdev; +Cc: David S. Miller, Jesper Dangaard Brouer
Before commit 850cbaddb52d ("udp: use it's own memory accounting
schema"), the udp protocol allowed sk_rmem_alloc to grow beyond
the rcvbuf by the whole current packet's truesize. After said commit
we allow sk_rmem_alloc to exceed the rcvbuf only if the receive queue
is empty. As reported by Jesper this cause a performance regression
for some (small) values of rcvbuf.
This commit is intended to fix the regression restoring the old
handling of the rcvbuf limit.
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Fixes: 850cbaddb52d ("udp: use it's own memory accounting schema")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
net/ipv4/udp.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e1d0bf8..16d88ba 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1205,14 +1205,14 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
* queue is full; always allow at least a packet
*/
rmem = atomic_read(&sk->sk_rmem_alloc);
- if (rmem && (rmem + size > sk->sk_rcvbuf))
+ if (rmem > sk->sk_rcvbuf)
goto drop;
/* we drop only if the receive buf is full and the receive
* queue contains some other skb
*/
rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
- if ((rmem > sk->sk_rcvbuf) && (rmem > size))
+ if (rmem > (size + sk->sk_rcvbuf))
goto uncharge_drop;
spin_lock(&list->lock);
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH 1/1] ax25: Fix segfault when receiving an iframe with net2kiss loaded
From: David Miller @ 2016-12-02 16:05 UTC (permalink / raw)
To: basil; +Cc: jreuter, ralf, linux-hams, netdev, linux-kernel, stable, ed,
mcdermj
In-Reply-To: <20161130111525.13f74728@brox.localnet>
From: Basil Gunn <basil@pacabunga.com>
Date: Wed, 30 Nov 2016 11:15:25 -0800
> AX.25 uses sock_queue_rcv_skb() to queue an iframe received packet.
> This routine writes NULL to the socket buffer device structure
> pointer. The socket buffer is subsequently serviced by
> __netif_receiv_skb_core() which dereferences the device structure
> pointer & segfaults.
>
> The fix puts the ax25 device structure pointer back in the socket
> buffer struct after sock_queue_rcv_skb() is called.
You cannot do this.
We have no reference count held on the device object, therefore once
access to this SKB leaves the receive packet processing path and thus
the RCU protected region, we must set skb->dev to NULL.
Queueing the SKB to the socket causes the SKB to leave the receive
processing path.
You will have to find another way to propagate this device pointer
to the code that needs it.
^ permalink raw reply
* [PATCH] batman-adv: Check for alloc errors when preparing TT local data
From: Simon Wunderlich @ 2016-12-02 16:13 UTC (permalink / raw)
To: davem-fT/PcQaiUtIeIZ0/mPfg9Q
Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r
In-Reply-To: <20161202161323.5990-1-sw-2YrNx6rUIHYiY0qSoAWiAoQuADTiUCJX@public.gmane.org>
From: Sven Eckelmann <sven-KaDOiPu9UxWEi8DpZVb4nw@public.gmane.org>
batadv_tt_prepare_tvlv_local_data can fail to allocate the memory for the
new TVLV block. The caller is informed about this problem with the returned
length of 0. Not checking this value results in an invalid memory access
when either tt_data or tt_change is accessed.
Reported-by: Dan Carpenter <dan.carpenter-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Fixes: 7ea7b4a14275 ("batman-adv: make the TT CRC logic VLAN specific")
Signed-off-by: Sven Eckelmann <sven-KaDOiPu9UxWEi8DpZVb4nw@public.gmane.org>
Signed-off-by: Simon Wunderlich <sw-2YrNx6rUIHYiY0qSoAWiAoQuADTiUCJX@public.gmane.org>
---
net/batman-adv/translation-table.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c
index 7f66309..0dc85eb 100644
--- a/net/batman-adv/translation-table.c
+++ b/net/batman-adv/translation-table.c
@@ -3282,7 +3282,7 @@ static bool batadv_send_my_tt_response(struct batadv_priv *bat_priv,
&tvlv_tt_data,
&tt_change,
&tt_len);
- if (!tt_len)
+ if (!tt_len || !tvlv_len)
goto unlock;
/* Copy the last orig_node's OGM buffer */
@@ -3300,7 +3300,7 @@ static bool batadv_send_my_tt_response(struct batadv_priv *bat_priv,
&tvlv_tt_data,
&tt_change,
&tt_len);
- if (!tt_len)
+ if (!tt_len || !tvlv_len)
goto out;
/* fill the rest of the tvlv with the real TT entries */
--
2.10.2
^ permalink raw reply related
* [PATCH] pull request for net: batman-adv 2016-12-02
From: Simon Wunderlich @ 2016-12-02 16:13 UTC (permalink / raw)
To: davem-fT/PcQaiUtIeIZ0/mPfg9Q
Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r
Hi David,
here is another bugfix which we would like to see integrated into net,
if this is possible now.
Please pull or let me know of any problem!
Thank you,
Simon
The following changes since commit e13258f38e927b61cdb5f4ad25309450d3b127d1:
batman-adv: Detect missing primaryif during tp_send as error (2016-11-04 12:27:39 +0100)
are available in the git repository at:
git://git.open-mesh.org/linux-merge.git tags/batadv-net-for-davem-20161202
for you to fetch changes up to c2d0f48a13e53b4747704c9e692f5e765e52041a:
batman-adv: Check for alloc errors when preparing TT local data (2016-12-02 10:46:59 +0100)
----------------------------------------------------------------
Here is another batman-adv bugfix:
- fix checking for failed allocation of TVLV blocks in TT local data,
by Sven Eckelmann
----------------------------------------------------------------
Sven Eckelmann (1):
batman-adv: Check for alloc errors when preparing TT local data
net/batman-adv/translation-table.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
^ permalink raw reply
* [PATCH net-next] net_sched: gen_estimator: account for timer drifts
From: Eric Dumazet @ 2016-12-02 16:11 UTC (permalink / raw)
To: David Miller; +Cc: netdev
From: Eric Dumazet <edumazet@google.com>
Under heavy stress, timer used in estimators tend to slowly be delayed
by a few jiffies, leading to inaccuracies.
Lets remember what was the last scheduled jiffies so that we get more
precise estimations, without having to add a multiply/divide in the loop
to account for the drifts.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/core/gen_estimator.c | 25 ++++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index cad8e791f28e..0993844faeea 100644
--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -78,8 +78,7 @@
#define EST_MAX_INTERVAL 5
-struct gen_estimator
-{
+struct gen_estimator {
struct list_head list;
struct gnet_stats_basic_packed *bstats;
struct gnet_stats_rate_est64 *rate_est;
@@ -96,8 +95,8 @@ struct gen_estimator
struct rcu_head head;
};
-struct gen_estimator_head
-{
+struct gen_estimator_head {
+ unsigned long next_jiffies;
struct timer_list timer;
struct list_head list;
};
@@ -146,8 +145,15 @@ static void est_timer(unsigned long arg)
spin_unlock(e->stats_lock);
}
- if (!list_empty(&elist[idx].list))
- mod_timer(&elist[idx].timer, jiffies + ((HZ/4) << idx));
+ if (!list_empty(&elist[idx].list)) {
+ elist[idx].next_jiffies += ((HZ/4) << idx);
+
+ if (unlikely(time_after_eq(jiffies, elist[idx].next_jiffies))) {
+ /* Ouch... timer was delayed. */
+ elist[idx].next_jiffies = jiffies + 1;
+ }
+ mod_timer(&elist[idx].timer, elist[idx].next_jiffies);
+ }
rcu_read_unlock();
}
@@ -251,9 +257,10 @@ int gen_new_estimator(struct gnet_stats_basic_packed *bstats,
setup_timer(&elist[idx].timer, est_timer, idx);
}
- if (list_empty(&elist[idx].list))
- mod_timer(&elist[idx].timer, jiffies + ((HZ/4) << idx));
-
+ if (list_empty(&elist[idx].list)) {
+ elist[idx].next_jiffies = jiffies + ((HZ/4) << idx);
+ mod_timer(&elist[idx].timer, elist[idx].next_jiffies);
+ }
list_add_rcu(&est->list, &elist[idx].list);
gen_add_node(est);
spin_unlock_bh(&est_tree_lock);
^ permalink raw reply related
* Re: [PATCH net-next] bpf, xdp: drop rcu_read_lock from bpf_prog_run_xdp and move to caller
From: David Miller @ 2016-12-02 16:09 UTC (permalink / raw)
To: daniel; +Cc: alexei.starovoitov, saeedm, kubakici, Yuval.Mintz, netdev
In-Reply-To: <b70c47e0284823e6a5db600ae75c01eac4cf7922.1480538565.git.daniel@iogearbox.net>
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Wed, 30 Nov 2016 22:16:06 +0100
> After 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
> the rcu_read_lock() in bpf_prog_run_xdp() is superfluous, since callers
> need to hold rcu_read_lock() already to make sure BPF program doesn't
> get released in the background.
>
> Thus, drop it from bpf_prog_run_xdp(), as it can otherwise be misleading.
> Still keeping the bpf_prog_run_xdp() is useful as it allows for grepping
> in XDP supported drivers and to keep the typecheck on the context intact.
> For mlx4, this means we don't have a double rcu_read_lock() anymore. nfp can
> just make use of bpf_prog_run_xdp(), too. For qede, just move rcu_read_lock()
> out of the helper. When the driver gets atomic replace support, this will
> move to call-sites eventually.
>
> mlx5 needs actual fixing as it has the same issue as described already in
> 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
> that is, we're under RCU bh at this time, BPF programs are released via
> call_rcu(), and call_rcu() != call_rcu_bh(), so we need to properly mark
> read side as programs can get xchg()'ed in mlx5e_xdp_set() without queue
> reset.
>
> Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> ---
> ( Also here net-next is just fine, imho. )
Applied, thanks.
^ permalink raw reply
* Re: [PATCH net-next] sock: reset sk_err for ICMP packets read from error queue
From: David Miller @ 2016-12-02 15:56 UTC (permalink / raw)
To: soheil.kdev; +Cc: netdev, edumazet, willemb, maze, hannes, soheil
In-Reply-To: <1480532468-1610-1-git-send-email-soheil.kdev@gmail.com>
From: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Date: Wed, 30 Nov 2016 14:01:08 -0500
> From: Soheil Hassas Yeganeh <soheil@google.com>
>
> Only when ICMP packets are enqueued onto the error queue,
> sk_err is also set. Before f5f99309fa74 (sock: do not set sk_err
> in sock_dequeue_err_skb), a subsequent error queue read
> would set sk_err to the next error on the queue, or 0 if empty.
> As no error types other than ICMP set this field, sk_err should
> not be modified upon dequeuing them.
>
> Only for ICMP errors, reset the (racy) sk_err. Some applications,
> like traceroute, rely on it and go into a futile busy POLLERR
> loop otherwise.
>
> In principle, sk_err has to be set while an ICMP error is queued.
> Testing is_icmp_err_skb(skb_next) approximates this without
> requiring a full queue walk. Applications that receive both ICMP
> and other errors cannot rely on this legacy behavior, as other
> errors do not set sk_err in the first place.
>
> Fixes: f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb)
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
Applied, thanks.
^ permalink raw reply
* RE: [PATCH net] fib_trie: Avoid expensive update of suffix length when not required
From: Duyck, Alexander H @ 2016-12-02 15:52 UTC (permalink / raw)
To: Robert Shearman, Alexander Duyck; +Cc: David Miller, Netdev
In-Reply-To: <f470ac98-9b23-1374-9c93-b33ea0ae149c@brocade.com>
> -----Original Message-----
> From: Robert Shearman [mailto:rshearma@brocade.com]
> Sent: Friday, December 2, 2016 5:54 AM
> To: Alexander Duyck <alexander.duyck@gmail.com>
> Cc: David Miller <davem@davemloft.net>; Netdev <netdev@vger.kernel.org>;
> Duyck, Alexander H <alexander.h.duyck@intel.com>
> Subject: Re: [PATCH net] fib_trie: Avoid expensive update of suffix length when
> not required
>
> On 01/12/16 18:29, Alexander Duyck wrote:
> > On Wed, Nov 30, 2016 at 4:27 PM, Robert Shearman
> <rshearma@brocade.com> wrote:
> >> On 29/11/16 23:14, Alexander Duyck wrote:
> >>> On Tue, Nov 29, 2016 at 9:46 AM, Robert Shearman
> >>> <rshearma@brocade.com>
> >>>>
> >>>> Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix
> >>>> length")
> >>>> Signed-off-by: Robert Shearman <rshearma@brocade.com>
> >>>
> >>>
> >>> So I am not entirely convinced this is a regression, I was wondering
> >>> what your numbers looked like before the patch you are saying this
> >>> fixes? I recall I fixed a number of issues leading up to this so I
> >>> am just curious.
> >>
> >>
> >> At commit 21d1f11db0e2f20a549ad8322879507850544670 ("fib_trie:
> Remove
> >> checks for index >= tnode_child_length from tnode_get_child") which
> >> is the parent of 5405afd1a306:
> >>
> >> $ time sudo ip route restore < ~/allroutes RTNETLINK answers: File
> >> exists RTNETLINK answers: File exists RTNETLINK answers: File exists
> >> RTNETLINK answers: File exists
> >>
> >> real 0m3.451s
> >> user 0m0.184s
> >> sys 0m2.004s
> >>
> >>
> >> At commit 5405afd1a30620de8601436bae739c67c0bc7324 ("fib_trie: Add
> >> tracking value for suffix length"):
> >>
> >> $ time sudo ip route restore < ~/allroutes RTNETLINK answers: File
> >> exists RTNETLINK answers: File exists RTNETLINK answers: File exists
> >> RTNETLINK answers: File exists
> >>
> >> real 0m48.624s
> >> user 0m0.360s
> >> sys 0m46.960s
> >>
> >> So does that warrant a fixes tag for this patch?
> >
> > Yes, please include it in the patch description next time.
> >
> > I've been giving it thought and the fact is the patch series has
> > merit. I just think it was being a bit too aggressive in terms of
> > some of the changes. So placing any changes in put_child seem to be
> > the one piece in this that make this a bit ugly.
>
> Does that imply the current updating of the parent's slen value should be
> removed from put_child then?
No, that is where it should be. We want to leave the put_child code as is. That way all the code for halving and inflating nodes works correctly.
The only reason for having the update_suffix code update the parent was because before it was propagating the suffix length for increases as well. Since we aren't using it for that anymore there isn't much point in having update_suffix touching the parent in the code below.
> I started out without doing the changes in put_child, but that meant the changes
> to push the slen up through the trie were spread all over the place. This seemed
> like the cleaner solution.
It actually makes it much messier. The put_child call is used all over the place in many situations where it doesn't make sense to go through updating the entire trie. The big issue is you can't walk up the trie if you are working on assembling a node off on the side that you plan to insert into the trie later. The only places we need to worry about updating the suffix are when we have added a new leaf/suffix, or when we have deleted a leaf/suffix. That is why in the patches I submitted I only update in those two places.
> >>>> + }
> >>>> +}
> >>>> +
> >>>> /* Add a child at position i overwriting the old value.
> >>>> * Update the value of full_children and empty_children.
> >>>> + *
> >>>> + * The suffix length of the parent node and the rest of the tree
> >>>> + is
> >>>> + * updated (if required) when adding/replacing a node, but is
> >>>> + caller's
> >>>> + * responsibility on removal.
> >>>> */
> >>>> static void put_child(struct key_vector *tn, unsigned long i,
> >>>> struct key_vector *n) @@ -447,8 +461,8 @@
> >>>> static void put_child(struct key_vector *tn, unsigned long i,
> >>>> else if (!wasfull && isfull)
> >>>> tn_info(tn)->full_children++;
> >>>>
> >>>> - if (n && (tn->slen < n->slen))
> >>>> - tn->slen = n->slen;
> >>>> + if (n)
> >>>> + node_push_suffix(tn, n);
> >>>
> >>>
> >>> This is just creating redundant work if we have to perform a resize.
> >>
> >>
> >> The only real redundant work is the assignment of slen in tn, since
> >> the propagation up the trie has to happen regardless and if a
> >> subsequent resize results in changes to the trie then the propagation
> >> will stop at tn's parent since the suffix length of the tn's parent
> >> will not be less than tn's suffix length.
> >
> >
> > This isn't going to work. We want to avoid trying to push the suffix
> > when we place a child. There are scenarios where we are placing
> > children in nodes that don't have parents yet because we are doing
> > rebalances and such. I suspect this could be pretty expensive on a
> > resize call.
>
> It's not expensive because the new parents that are being built up are orphaned
> from the rest of the trie, so the push up won't proceed into the rest of the trie
> until the new node is inserted into the trie.
That isn't entirely true. For example every time you have to add a new leaf where one already exists we create a new tnode, add the original leaf, and then add the new one. I would rather not have this code trying to walk the trie twice. It would make more sense just to update things only when we add the new one.
> > We want to pull the suffix pushing out of the resize completely. The
> > problem is this is going to cause us to start pushing suffixes for
> > every node moved as a part of resize.
>
> This will mean that nodes will have to be visited multiple times, which could well
> be more expensive than doing it during the resize.
I'm not sure what you are talking about here. Basically when we add a node we will start walking up the list just like you do with the changes you made to put_child. The only difference is that I only have us doing it when we insert a new leaf or when we insert a new fib alias to the tail of the list. It should be less expensive than the approach you took as I don't try walking up any lists unless there is actually something to pull on an add.
Basically the only real difference between what I am suggesting and what you have submitted was the fact that I would prefer to see put_child left as is and instead for us to just add a call to node_push_suffix in fib_insert_node. Doing it that way is much cleaner in my opinion.
> >
> >>>
> >>>> rcu_assign_pointer(tn->tnode[i], n); }
> >>
> >> ...
> >>
> >>>> -static void leaf_pull_suffix(struct key_vector *tp, struct
> >>>> key_vector
> >>>> *l)
> >>>> +static void node_set_suffix(struct key_vector *tp, unsigned char
> >>>> +slen)
> >>>> {
> >>>> - while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
> >>>> - if (update_suffix(tp) > l->slen)
> >>>> + if (slen > tp->slen)
> >>>> + tp->slen = slen;
> >>>> +}
> >>>> +
> >>>> +static void node_pull_suffix(struct key_vector *tn) {
> >>>> + struct key_vector *tp;
> >>>> + unsigned char slen;
> >>>> +
> >>>> + slen = update_suffix(tn);
> >>>> + tp = node_parent(tn);
> >>>> + while ((tp->slen > tp->pos) && (tp->slen > slen)) {
> >>>> + if (update_suffix(tp) > slen)
> >>>> break;
> >>>> tp = node_parent(tp);
> >>>> }
> >>>> }
> >>>>
> >>>
> >
> > Actually I realized that there is a bug here. The check for
> > update_suffix(tp) > slen can cause us to stop prematurely if I am not
> > mistaken. What we should probably be doing is pulling out tp->slen
> > and if the result of update_suffix is the same value then we can break
> > the loop. Otherwise we can stop early if a "second runner up" shows
> > up that is supposed to be pushed up the trie.
>
> Excellent point. This also a problem in the existing code when removing a fib
> alias with other aliases still on the leaf. I'll send a separate patch for this and
> base my changes on top of it.
>
> > I actually started around with patches to do much the same as what you
> > are doing and I will probably submit them as an RFC so if you need you
> > can pick pieces out as needed, or I could submit them if they work for
> > you.
>
> I'd be happy to test out any patches you send me.
I sent them out yesterday. If you could get me the timing on those patches I would appreciate it. I'm suspecting we should be getting close to your original time required to plug in 200K routes.
> >>> I really hate all the renaming. Can't you just leave these as
> >>> leaf_pull_suffix and leaf_push _suffix. I'm fine with us dropping
> >>> the leaf of the suffix but the renaming just makes adds a bunch of noise.
> >>> Really it might work better if you broke the replacement of the
> >>> key_vector as a leaf with the suffix length into a separate patch,
> >>> then you could do the rename as a part of that.
> >>
> >>
> >> Ok, I'll leave the naming of leaf_push_suffix alone. Note that
> >> leaf_pull_suffix hasn't been renamed, the below in the diff is just
> >> an artifact of how diff decided to present the changes along with the
> >> moving of leaf_push_suffix.
> >
> > So I have been looking this over for the last couple days and actually
> > I have started to be okay with the renaming.
> >
> > However I would ask to be consistent. If you are going to have
> > node_pull_suffix and node_push_suffix then just pass a node and a
> > suffix length. You can drop the leaf key vector from both functions
> > instead of just one.
>
> Ok, I can do that.
>
> >
> >>>
> >>>> -static void leaf_push_suffix(struct key_vector *tn, struct
> >>>> key_vector
> >>>> *l)
> >>>> +static void leaf_pull_suffix(struct key_vector *tp, struct
> >>>> +key_vector
> >>>> *l)
> >>>> {
> >>>> - /* if this is a new leaf then tn will be NULL and we can sort
> >>>> - * out parent suffix lengths as a part of trie_rebalance
> >>>> - */
> >>>> - while (tn->slen < l->slen) {
> >>>> - tn->slen = l->slen;
> >>>> - tn = node_parent(tn);
> >>>> + while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
> >>>> + if (update_suffix(tp) > l->slen)
> >>>> + break;
> >>>> + tp = node_parent(tp);
> >>>> }
> >>>> }
> >>>
> >>>
> >>> If possible it would work better if you could avoid moving functions
> >>> around as a result of your changes.
> >>
> >>
> >> Ok, I can add a forward declaration instead.
> >
> > It shouldn't be necessary. I think you are doing way too much with
> > this patch. We just need this to be a fix, you are doing a bit too
> > much like adding this new node_set_suffix function which isn't really
> > needed.
>
> As per your comment below the node_set_suffix function isn't necessary.
> However, I'm not sure exactly what you're suggesting with this patch doing too
> much. Are you saying that you'd like to see a series of patches starting with
> some of the restructuring/renaming of the push/pull functions, or are you saying
> that the sum total is too much?
Basically what it all came down to is the patch itself is a bit noisy and hard to follow. That is why I broke this down into two patches when I submitted it back out with what I consider to be fixes.
> >>>>
> >>>> @@ -1783,6 +1801,16 @@ void fib_table_flush_external(struct
> >>>> fib_table
> >>>> *tb)
> >>>> if (IS_TRIE(pn))
> >>>> break;
> >>>>
> >>>> + /* push the suffix length to the parent node,
> >>>> + * since a previous leaf removal may have
> >>>> + * caused it to change
> >>>> + */
> >>>> + if (pn->slen > pn->pos) {
> >>>> + unsigned char slen =
> >>>> + update_suffix(pn);
> >>>> +
> >>>> + node_set_suffix(node_parent(pn), slen);
> >>>> + }
> >>>> +
> >>>
> >>>
> >>> Why bother with the local variable? You could just pass
> >>> update_suffix(pn) as the parameter to your node_set_suffix function.
> >>
> >>
> >> To avoid a long line on the node_set_suffix call or splitting it
> >> across lines, but I'll remove the local variable as you suggest and the same
> below.
> >
> > Actually I think there is a bug here. You shouldn't be setting the
> > suffix for the parent based on the child. You can just call
> > update_suffix(pn) and that should be enough.
>
> Yes, you're right. BTW, this logic is transferred from the existing resize function
> and I think what saves us both before and after my changes is that the logic is
> repeated for the parent node which fixes the value and so on all the way up the
> trie.
Right, but the logic was changed so the bug wasn't really introduced until we stopped using update_suffix to push the suffix length up the trie.
Odds are the call to node_set_suffix always did nothing anyway since the parent should have always had an equal or greater suffix than the child.
- Alex
^ permalink raw reply
* [PATCH net-next] sfc: remove EFX_BUG_ON_PARANOID, use EFX_WARN_ON_[ONCE_]PARANOID instead
From: Edward Cree @ 2016-12-02 15:51 UTC (permalink / raw)
To: linux-net-drivers, davem; +Cc: bkenward, netdev
Logically, EFX_BUG_ON_PARANOID can never be correct. For, BUG_ON should
only be used if it is not possible to continue without potential harm;
and since the non-DEBUG driver will continue regardless (as the BUG_ON is
compiled out), clearly the BUG_ON cannot be needed in the DEBUG driver.
So, replace every EFX_BUG_ON_PARANOID with either an EFX_WARN_ON_PARANOID
or the newly defined EFX_WARN_ON_ONCE_PARANOID.
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
drivers/net/ethernet/sfc/ef10.c | 2 +-
drivers/net/ethernet/sfc/efx.c | 2 +-
drivers/net/ethernet/sfc/ethtool.c | 4 ++--
drivers/net/ethernet/sfc/farch.c | 6 ++---
drivers/net/ethernet/sfc/mcdi.h | 4 ++--
drivers/net/ethernet/sfc/mcdi_mon.c | 4 ++--
drivers/net/ethernet/sfc/mcdi_port.c | 2 +-
drivers/net/ethernet/sfc/net_driver.h | 22 +++++++++----------
drivers/net/ethernet/sfc/ptp.c | 2 +-
drivers/net/ethernet/sfc/rx.c | 8 +++----
drivers/net/ethernet/sfc/siena.c | 2 +-
drivers/net/ethernet/sfc/tx.c | 12 +++++-----
drivers/net/ethernet/sfc/tx_tso.c | 41 +++++++++++++++++------------------
13 files changed, 55 insertions(+), 56 deletions(-)
diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index 0f58ea8..de2947c 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -2100,7 +2100,7 @@ static int efx_ef10_tx_tso_desc(struct efx_tx_queue *tx_queue,
u32 seqnum;
u32 mss;
- EFX_BUG_ON_PARANOID(tx_queue->tso_version != 2);
+ EFX_WARN_ON_ONCE_PARANOID(tx_queue->tso_version != 2);
mss = skb_shinfo(skb)->gso_size;
diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index da7028d..5a5dcad 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -355,7 +355,7 @@ static int efx_probe_eventq(struct efx_channel *channel)
/* Build an event queue with room for one event per tx and rx buffer,
* plus some extra for link state events and MCDI completions. */
entries = roundup_pow_of_two(efx->rxq_entries + efx->txq_entries + 128);
- EFX_BUG_ON_PARANOID(entries > EFX_MAX_EVQ_SIZE);
+ EFX_WARN_ON_PARANOID(entries > EFX_MAX_EVQ_SIZE);
channel->eventq_mask = max(entries, EFX_MIN_EVQ_SIZE) - 1;
return efx_nic_probe_eventq(channel);
diff --git a/drivers/net/ethernet/sfc/ethtool.c b/drivers/net/ethernet/sfc/ethtool.c
index ca29d3d..f644216 100644
--- a/drivers/net/ethernet/sfc/ethtool.c
+++ b/drivers/net/ethernet/sfc/ethtool.c
@@ -333,12 +333,12 @@ static int efx_ethtool_fill_self_tests(struct efx_nic *efx,
"core", 0, "registers", NULL);
if (efx->phy_op->run_tests != NULL) {
- EFX_BUG_ON_PARANOID(efx->phy_op->test_name == NULL);
+ EFX_WARN_ON_PARANOID(efx->phy_op->test_name == NULL);
for (i = 0; true; ++i) {
const char *name;
- EFX_BUG_ON_PARANOID(i >= EFX_MAX_PHY_TESTS);
+ EFX_WARN_ON_PARANOID(i >= EFX_MAX_PHY_TESTS);
name = efx->phy_op->test_name(efx, i);
if (name == NULL)
break;
diff --git a/drivers/net/ethernet/sfc/farch.c b/drivers/net/ethernet/sfc/farch.c
index 91aa3ec..e4ca216 100644
--- a/drivers/net/ethernet/sfc/farch.c
+++ b/drivers/net/ethernet/sfc/farch.c
@@ -177,7 +177,7 @@ efx_init_special_buffer(struct efx_nic *efx, struct efx_special_buffer *buffer)
dma_addr_t dma_addr;
int i;
- EFX_BUG_ON_PARANOID(!buffer->buf.addr);
+ EFX_WARN_ON_PARANOID(!buffer->buf.addr);
/* Write buffer descriptors to NIC */
for (i = 0; i < buffer->entries; i++) {
@@ -332,7 +332,7 @@ void efx_farch_tx_write(struct efx_tx_queue *tx_queue)
txd = efx_tx_desc(tx_queue, write_ptr);
++tx_queue->write_count;
- EFX_BUG_ON_PARANOID(buffer->flags & EFX_TX_BUF_OPTION);
+ EFX_WARN_ON_ONCE_PARANOID(buffer->flags & EFX_TX_BUF_OPTION);
/* Create TX descriptor ring entry */
BUILD_BUG_ON(EFX_TX_BUF_CONT != 1);
@@ -2041,7 +2041,7 @@ efx_farch_filter_from_gen_spec(struct efx_farch_filter_spec *spec,
__be32 rhost, host1, host2;
__be16 rport, port1, port2;
- EFX_BUG_ON_PARANOID(!(gen_spec->flags & EFX_FILTER_FLAG_RX));
+ EFX_WARN_ON_PARANOID(!(gen_spec->flags & EFX_FILTER_FLAG_RX));
if (gen_spec->ether_type != htons(ETH_P_IP))
return -EPROTONOSUPPORT;
diff --git a/drivers/net/ethernet/sfc/mcdi.h b/drivers/net/ethernet/sfc/mcdi.h
index c9aeb07..4472107 100644
--- a/drivers/net/ethernet/sfc/mcdi.h
+++ b/drivers/net/ethernet/sfc/mcdi.h
@@ -129,14 +129,14 @@ struct efx_mcdi_data {
static inline struct efx_mcdi_iface *efx_mcdi(struct efx_nic *efx)
{
- EFX_BUG_ON_PARANOID(!efx->mcdi);
+ EFX_WARN_ON_PARANOID(!efx->mcdi);
return &efx->mcdi->iface;
}
#ifdef CONFIG_SFC_MCDI_MON
static inline struct efx_mcdi_mon *efx_mcdi_mon(struct efx_nic *efx)
{
- EFX_BUG_ON_PARANOID(!efx->mcdi);
+ EFX_WARN_ON_PARANOID(!efx->mcdi);
return &efx->mcdi->hwmon;
}
#endif
diff --git a/drivers/net/ethernet/sfc/mcdi_mon.c b/drivers/net/ethernet/sfc/mcdi_mon.c
index bc27d5b..f97da05 100644
--- a/drivers/net/ethernet/sfc/mcdi_mon.c
+++ b/drivers/net/ethernet/sfc/mcdi_mon.c
@@ -121,9 +121,9 @@ void efx_mcdi_sensor_event(struct efx_nic *efx, efx_qword_t *ev)
}
if (!name)
name = "No sensor name available";
- EFX_BUG_ON_PARANOID(state >= ARRAY_SIZE(sensor_status_names));
+ EFX_WARN_ON_PARANOID(state >= ARRAY_SIZE(sensor_status_names));
state_txt = sensor_status_names[state];
- EFX_BUG_ON_PARANOID(hwmon_type >= EFX_HWMON_TYPES_COUNT);
+ EFX_WARN_ON_PARANOID(hwmon_type >= EFX_HWMON_TYPES_COUNT);
unit = efx_hwmon_unit[hwmon_type];
if (!unit)
unit = "";
diff --git a/drivers/net/ethernet/sfc/mcdi_port.c b/drivers/net/ethernet/sfc/mcdi_port.c
index 0f0eb27..9dcd396 100644
--- a/drivers/net/ethernet/sfc/mcdi_port.c
+++ b/drivers/net/ethernet/sfc/mcdi_port.c
@@ -840,7 +840,7 @@ void efx_mcdi_process_link_change(struct efx_nic *efx, efx_qword_t *ev)
u32 flags, fcntl, speed, lpa;
speed = EFX_QWORD_FIELD(*ev, MCDI_EVENT_LINKCHANGE_SPEED);
- EFX_BUG_ON_PARANOID(speed >= ARRAY_SIZE(efx_mcdi_event_link_speed));
+ EFX_WARN_ON_PARANOID(speed >= ARRAY_SIZE(efx_mcdi_event_link_speed));
speed = efx_mcdi_event_link_speed[speed];
flags = EFX_QWORD_FIELD(*ev, MCDI_EVENT_LINKCHANGE_LINK_FLAGS);
diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h
index 0bbd7e2..8692e82 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -44,10 +44,10 @@
#define EFX_DRIVER_VERSION "4.1"
#ifdef DEBUG
-#define EFX_BUG_ON_PARANOID(x) BUG_ON(x)
+#define EFX_WARN_ON_ONCE_PARANOID(x) WARN_ON_ONCE(x)
#define EFX_WARN_ON_PARANOID(x) WARN_ON(x)
#else
-#define EFX_BUG_ON_PARANOID(x) do {} while (0)
+#define EFX_WARN_ON_ONCE_PARANOID(x) do {} while (0)
#define EFX_WARN_ON_PARANOID(x) do {} while (0)
#endif
@@ -1409,7 +1409,7 @@ struct efx_nic_type {
static inline struct efx_channel *
efx_get_channel(struct efx_nic *efx, unsigned index)
{
- EFX_BUG_ON_PARANOID(index >= efx->n_channels);
+ EFX_WARN_ON_ONCE_PARANOID(index >= efx->n_channels);
return efx->channel[index];
}
@@ -1430,8 +1430,8 @@ efx_get_channel(struct efx_nic *efx, unsigned index)
static inline struct efx_tx_queue *
efx_get_tx_queue(struct efx_nic *efx, unsigned index, unsigned type)
{
- EFX_BUG_ON_PARANOID(index >= efx->n_tx_channels ||
- type >= EFX_TXQ_TYPES);
+ EFX_WARN_ON_ONCE_PARANOID(index >= efx->n_tx_channels ||
+ type >= EFX_TXQ_TYPES);
return &efx->channel[efx->tx_channel_offset + index]->tx_queue[type];
}
@@ -1444,8 +1444,8 @@ static inline bool efx_channel_has_tx_queues(struct efx_channel *channel)
static inline struct efx_tx_queue *
efx_channel_get_tx_queue(struct efx_channel *channel, unsigned type)
{
- EFX_BUG_ON_PARANOID(!efx_channel_has_tx_queues(channel) ||
- type >= EFX_TXQ_TYPES);
+ EFX_WARN_ON_ONCE_PARANOID(!efx_channel_has_tx_queues(channel) ||
+ type >= EFX_TXQ_TYPES);
return &channel->tx_queue[type];
}
@@ -1482,7 +1482,7 @@ static inline bool efx_channel_has_rx_queue(struct efx_channel *channel)
static inline struct efx_rx_queue *
efx_channel_get_rx_queue(struct efx_channel *channel)
{
- EFX_BUG_ON_PARANOID(!efx_channel_has_rx_queue(channel));
+ EFX_WARN_ON_ONCE_PARANOID(!efx_channel_has_rx_queue(channel));
return &channel->rx_queue;
}
@@ -1578,9 +1578,9 @@ efx_tx_queue_get_insert_buffer(const struct efx_tx_queue *tx_queue)
struct efx_tx_buffer *buffer =
__efx_tx_queue_get_insert_buffer(tx_queue);
- EFX_BUG_ON_PARANOID(buffer->len);
- EFX_BUG_ON_PARANOID(buffer->flags);
- EFX_BUG_ON_PARANOID(buffer->unmap_len);
+ EFX_WARN_ON_ONCE_PARANOID(buffer->len);
+ EFX_WARN_ON_ONCE_PARANOID(buffer->flags);
+ EFX_WARN_ON_ONCE_PARANOID(buffer->unmap_len);
return buffer;
}
diff --git a/drivers/net/ethernet/sfc/ptp.c b/drivers/net/ethernet/sfc/ptp.c
index 77a5364..60cdb97 100644
--- a/drivers/net/ethernet/sfc/ptp.c
+++ b/drivers/net/ethernet/sfc/ptp.c
@@ -835,7 +835,7 @@ static int efx_ptp_synchronize(struct efx_nic *efx, unsigned int num_readings)
ACCESS_ONCE(*start) = 0;
rc = efx_mcdi_rpc_start(efx, MC_CMD_PTP, synch_buf,
MC_CMD_PTP_IN_SYNCHRONIZE_LEN);
- EFX_BUG_ON_PARANOID(rc);
+ EFX_WARN_ON_ONCE_PARANOID(rc);
/* Wait for start from MCDI (or timeout) */
timeout = jiffies + msecs_to_jiffies(MAX_SYNCHRONISE_WAIT_MS);
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index 7893a73..5f4ad4f 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -335,7 +335,7 @@ void efx_fast_push_rx_descriptors(struct efx_rx_queue *rx_queue, bool atomic)
/* Calculate current fill level, and exit if we don't need to fill */
fill_level = (rx_queue->added_count - rx_queue->removed_count);
- EFX_BUG_ON_PARANOID(fill_level > rx_queue->efx->rxq_entries);
+ EFX_WARN_ON_ONCE_PARANOID(fill_level > rx_queue->efx->rxq_entries);
if (fill_level >= rx_queue->fast_fill_trigger)
goto out;
@@ -347,7 +347,7 @@ void efx_fast_push_rx_descriptors(struct efx_rx_queue *rx_queue, bool atomic)
batch_size = efx->rx_pages_per_batch * efx->rx_bufs_per_page;
space = rx_queue->max_fill - fill_level;
- EFX_BUG_ON_PARANOID(space < batch_size);
+ EFX_WARN_ON_ONCE_PARANOID(space < batch_size);
netif_vdbg(rx_queue->efx, rx_status, rx_queue->efx->net_dev,
"RX queue %d fast-filling descriptor ring from"
@@ -475,7 +475,7 @@ static struct sk_buff *efx_rx_mk_skb(struct efx_channel *channel,
return NULL;
}
- EFX_BUG_ON_PARANOID(rx_buf->len < hdr_len);
+ EFX_WARN_ON_ONCE_PARANOID(rx_buf->len < hdr_len);
memcpy(skb->data + efx->rx_ip_align, eh - efx->rx_prefix_size,
efx->rx_prefix_size + hdr_len);
@@ -682,7 +682,7 @@ int efx_probe_rx_queue(struct efx_rx_queue *rx_queue)
/* Create the smallest power-of-two aligned ring */
entries = max(roundup_pow_of_two(efx->rxq_entries), EFX_MIN_DMAQ_SIZE);
- EFX_BUG_ON_PARANOID(entries > EFX_MAX_DMAQ_SIZE);
+ EFX_WARN_ON_PARANOID(entries > EFX_MAX_DMAQ_SIZE);
rx_queue->ptr_mask = entries - 1;
netif_dbg(efx, probe, efx->net_dev,
diff --git a/drivers/net/ethernet/sfc/siena.c b/drivers/net/ethernet/sfc/siena.c
index 0c4a8dd..a3901bc 100644
--- a/drivers/net/ethernet/sfc/siena.c
+++ b/drivers/net/ethernet/sfc/siena.c
@@ -717,7 +717,7 @@ static void siena_mcdi_request(struct efx_nic *efx,
unsigned int i;
unsigned int inlen_dw = DIV_ROUND_UP(sdu_len, 4);
- EFX_BUG_ON_PARANOID(hdr_len != 4);
+ EFX_WARN_ON_PARANOID(hdr_len != 4);
efx_writed(efx, hdr, pdu);
diff --git a/drivers/net/ethernet/sfc/tx.c b/drivers/net/ethernet/sfc/tx.c
index f11a36a..3c01514 100644
--- a/drivers/net/ethernet/sfc/tx.c
+++ b/drivers/net/ethernet/sfc/tx.c
@@ -142,7 +142,7 @@ static void efx_tx_maybe_stop_queue(struct efx_tx_queue *txq1)
fill_level = max(txq1->insert_count - txq1->old_read_count,
txq2->insert_count - txq2->old_read_count);
- EFX_BUG_ON_PARANOID(fill_level >= efx->txq_entries);
+ EFX_WARN_ON_ONCE_PARANOID(fill_level >= efx->txq_entries);
if (likely(fill_level < efx->txq_stop_thresh)) {
smp_mb();
if (likely(!efx->loopback_selftest))
@@ -158,7 +158,7 @@ static int efx_enqueue_skb_copy(struct efx_tx_queue *tx_queue,
u8 *copy_buffer;
int rc;
- EFX_BUG_ON_PARANOID(copy_len > EFX_TX_CB_SIZE);
+ EFX_WARN_ON_ONCE_PARANOID(copy_len > EFX_TX_CB_SIZE);
buffer = efx_tx_queue_get_insert_buffer(tx_queue);
@@ -268,7 +268,7 @@ static void efx_skb_copy_bits_to_pio(struct efx_nic *efx, struct sk_buff *skb,
kunmap_atomic(vaddr);
}
- EFX_BUG_ON_PARANOID(skb_shinfo(skb)->frag_list);
+ EFX_WARN_ON_ONCE_PARANOID(skb_shinfo(skb)->frag_list);
}
static int efx_enqueue_skb_pio(struct efx_tx_queue *tx_queue,
@@ -503,7 +503,7 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
* size limit.
*/
if (segments) {
- EFX_BUG_ON_PARANOID(!tx_queue->handle_tso);
+ EFX_WARN_ON_ONCE_PARANOID(!tx_queue->handle_tso);
rc = tx_queue->handle_tso(tx_queue, skb, &data_mapped);
if (rc == -EINVAL) {
rc = efx_tx_tso_fallback(tx_queue, skb);
@@ -724,7 +724,7 @@ void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index)
struct efx_tx_queue *txq2;
unsigned int pkts_compl = 0, bytes_compl = 0;
- EFX_BUG_ON_PARANOID(index > tx_queue->ptr_mask);
+ EFX_WARN_ON_ONCE_PARANOID(index > tx_queue->ptr_mask);
efx_dequeue_buffers(tx_queue, index, &pkts_compl, &bytes_compl);
tx_queue->pkts_compl += pkts_compl;
@@ -772,7 +772,7 @@ int efx_probe_tx_queue(struct efx_tx_queue *tx_queue)
/* Create the smallest power-of-two aligned ring */
entries = max(roundup_pow_of_two(efx->txq_entries), EFX_MIN_DMAQ_SIZE);
- EFX_BUG_ON_PARANOID(entries > EFX_MAX_DMAQ_SIZE);
+ EFX_WARN_ON_PARANOID(entries > EFX_MAX_DMAQ_SIZE);
tx_queue->ptr_mask = entries - 1;
netif_dbg(efx, probe, efx->net_dev,
diff --git a/drivers/net/ethernet/sfc/tx_tso.c b/drivers/net/ethernet/sfc/tx_tso.c
index 6032887..e0cbda9 100644
--- a/drivers/net/ethernet/sfc/tx_tso.c
+++ b/drivers/net/ethernet/sfc/tx_tso.c
@@ -109,15 +109,15 @@ static void efx_tx_queue_insert(struct efx_tx_queue *tx_queue,
struct efx_tx_buffer *buffer;
unsigned int dma_len;
- EFX_BUG_ON_PARANOID(len <= 0);
+ EFX_WARN_ON_ONCE_PARANOID(len <= 0);
while (1) {
buffer = efx_tx_queue_get_insert_buffer(tx_queue);
++tx_queue->insert_count;
- EFX_BUG_ON_PARANOID(tx_queue->insert_count -
- tx_queue->read_count >=
- tx_queue->efx->txq_entries);
+ EFX_WARN_ON_ONCE_PARANOID(tx_queue->insert_count -
+ tx_queue->read_count >=
+ tx_queue->efx->txq_entries);
buffer->dma_addr = dma_addr;
@@ -134,7 +134,7 @@ static void efx_tx_queue_insert(struct efx_tx_queue *tx_queue,
len -= dma_len;
}
- EFX_BUG_ON_PARANOID(!len);
+ EFX_WARN_ON_ONCE_PARANOID(!len);
buffer->len = len;
*final_buffer = buffer;
}
@@ -147,8 +147,8 @@ static __be16 efx_tso_check_protocol(struct sk_buff *skb)
{
__be16 protocol = skb->protocol;
- EFX_BUG_ON_PARANOID(((struct ethhdr *)skb->data)->h_proto !=
- protocol);
+ EFX_WARN_ON_ONCE_PARANOID(((struct ethhdr *)skb->data)->h_proto !=
+ protocol);
if (protocol == htons(ETH_P_8021Q)) {
struct vlan_ethhdr *veh = (struct vlan_ethhdr *)skb->data;
@@ -156,19 +156,18 @@ static __be16 efx_tso_check_protocol(struct sk_buff *skb)
}
if (protocol == htons(ETH_P_IP)) {
- EFX_BUG_ON_PARANOID(ip_hdr(skb)->protocol != IPPROTO_TCP);
+ EFX_WARN_ON_ONCE_PARANOID(ip_hdr(skb)->protocol != IPPROTO_TCP);
} else {
- EFX_BUG_ON_PARANOID(protocol != htons(ETH_P_IPV6));
- EFX_BUG_ON_PARANOID(ipv6_hdr(skb)->nexthdr != NEXTHDR_TCP);
+ EFX_WARN_ON_ONCE_PARANOID(protocol != htons(ETH_P_IPV6));
+ EFX_WARN_ON_ONCE_PARANOID(ipv6_hdr(skb)->nexthdr != NEXTHDR_TCP);
}
- EFX_BUG_ON_PARANOID((PTR_DIFF(tcp_hdr(skb), skb->data)
- + (tcp_hdr(skb)->doff << 2u)) >
- skb_headlen(skb));
+ EFX_WARN_ON_ONCE_PARANOID((PTR_DIFF(tcp_hdr(skb), skb->data) +
+ (tcp_hdr(skb)->doff << 2u)) >
+ skb_headlen(skb));
return protocol;
}
-
/* Parse the SKB header and initialise state. */
static int tso_start(struct tso_state *st, struct efx_nic *efx,
struct efx_tx_queue *tx_queue,
@@ -193,9 +192,9 @@ static int tso_start(struct tso_state *st, struct efx_nic *efx,
}
st->seqnum = ntohl(tcp_hdr(skb)->seq);
- EFX_BUG_ON_PARANOID(tcp_hdr(skb)->urg);
- EFX_BUG_ON_PARANOID(tcp_hdr(skb)->syn);
- EFX_BUG_ON_PARANOID(tcp_hdr(skb)->rst);
+ EFX_WARN_ON_ONCE_PARANOID(tcp_hdr(skb)->urg);
+ EFX_WARN_ON_ONCE_PARANOID(tcp_hdr(skb)->syn);
+ EFX_WARN_ON_ONCE_PARANOID(tcp_hdr(skb)->rst);
st->out_len = skb->len - header_len;
@@ -245,8 +244,8 @@ static void tso_fill_packet_with_fragment(struct efx_tx_queue *tx_queue,
if (st->packet_space == 0)
return;
- EFX_BUG_ON_PARANOID(st->in_len <= 0);
- EFX_BUG_ON_PARANOID(st->packet_space <= 0);
+ EFX_WARN_ON_ONCE_PARANOID(st->in_len <= 0);
+ EFX_WARN_ON_ONCE_PARANOID(st->packet_space <= 0);
n = min(st->in_len, st->packet_space);
@@ -379,7 +378,7 @@ int efx_enqueue_skb_tso(struct efx_tx_queue *tx_queue,
/* Find the packet protocol and sanity-check it */
state.protocol = efx_tso_check_protocol(skb);
- EFX_BUG_ON_PARANOID(tx_queue->write_count != tx_queue->insert_count);
+ EFX_WARN_ON_ONCE_PARANOID(tx_queue->write_count != tx_queue->insert_count);
rc = tso_start(&state, efx, tx_queue, skb);
if (rc)
@@ -387,7 +386,7 @@ int efx_enqueue_skb_tso(struct efx_tx_queue *tx_queue,
if (likely(state.in_len == 0)) {
/* Grab the first payload fragment. */
- EFX_BUG_ON_PARANOID(skb_shinfo(skb)->nr_frags < 1);
+ EFX_WARN_ON_ONCE_PARANOID(skb_shinfo(skb)->nr_frags < 1);
frag_i = 0;
rc = tso_get_fragment(&state, efx,
skb_shinfo(skb)->frags + frag_i);
^ permalink raw reply related
* Re: arp_filter and IPv6 ND
From: Saku Ytti @ 2016-12-02 15:42 UTC (permalink / raw)
To: Hannes Frederic Sowa; +Cc: netdev
In-Reply-To: <3bf09622-afa5-506c-8407-e3e1a70341c0@stressinduktion.org>
On 2 December 2016 at 16:08, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
Hey,
> May I ask why you want to turn it off?
Certainly. I don't want device to answer with link address for L3
address it does not have on the link. In my case it triggers this bug
https://supportforums.cisco.com/document/12098096/cscse46790-cef-prefers-arp-adjacency-over-rib-next-hop
In this particular case, for one reason or another my Cisco device
would have ND entry for Linux loopback pointing to an interface with
completely different network. Which itself would be just weird, but
combined with weird behaviour of Cisco it actually causes the loopback
route advertised by BGP not to be installed. If the ND entry didn't
exist, the BGP route would be installed.
I don't really even know why the ND entry exists, all I can think of
is that Linux must have sent gratuitous reply, because I don't se why
Cisco would have tried to discover it.
Expected behaviour is that the loopback/128 BGP route resolves to
on-link next-hop, and on-link next hop is then ND'd. Observed
behaviour is that loopback/128 BGP route also appears in ND cache.
> In IPv6 this depends on the scope. In IPv4 this concept doesn't really
> exist.
>
> Please notice that in IPv4 arp_filter does not necessarily mean that the
> system is operating in strong end system mode but you end up in an
> hybrid clone where arp is acting strong but routing not and thus you
> also have to add fib rules to simulate that.
It's just very peculiar behaviour to have ARP or ND entries on a
interface where given subnet does not exist, it rudimentarily causes
difficult to troubleshoot problems and is surprising/unexpected
behaviour.
Of course well behaving device wouldn't accept such replies, because
it itself could be attack vector (imagine me telling you 8.8.8.8 is on
the link, or worse, your bank).
I'm curious, why does this behaviour exist? When is this desirable?
I've never seen any other device than Linux behave like this, and when
ever I've heard about the problem, I've only seen surprised faces that
it does behave like this.
--
++ytti
^ permalink raw reply
* [PATCH net] geneve: avoid use-after-free of skb->data
From: Sabrina Dubroca @ 2016-12-02 15:49 UTC (permalink / raw)
To: netdev; +Cc: Sabrina Dubroca, John W . Linville
geneve{,6}_build_skb can end up doing a pskb_expand_head(), which
makes the ip_hdr(skb) reference we stashed earlier stale. Since it's
only needed as an argument to ip_tunnel_ecn_encap(), move this
directly in the function call.
Fixes: 08399efc6319 ("geneve: ensure ECN info is handled properly in all tx/rx paths")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
drivers/net/geneve.c | 14 ++++----------
1 file changed, 4 insertions(+), 10 deletions(-)
diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 532695c8209b..bab65647b80f 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -859,7 +859,6 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
struct geneve_dev *geneve = netdev_priv(dev);
struct geneve_sock *gs4;
struct rtable *rt = NULL;
- const struct iphdr *iip; /* interior IP header */
int err = -EINVAL;
struct flowi4 fl4;
__u8 tos, ttl;
@@ -890,8 +889,6 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
skb_reset_mac_header(skb);
- iip = ip_hdr(skb);
-
if (info) {
const struct ip_tunnel_key *key = &info->key;
u8 *opts = NULL;
@@ -911,7 +908,7 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
if (unlikely(err))
goto tx_error;
- tos = ip_tunnel_ecn_encap(key->tos, iip, skb);
+ tos = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb);
ttl = key->ttl;
df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
} else {
@@ -920,7 +917,7 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
if (unlikely(err))
goto tx_error;
- tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, iip, skb);
+ tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, ip_hdr(skb), skb);
ttl = geneve->ttl;
if (!ttl && IN_MULTICAST(ntohl(fl4.daddr)))
ttl = 1;
@@ -952,7 +949,6 @@ static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
{
struct geneve_dev *geneve = netdev_priv(dev);
struct dst_entry *dst = NULL;
- const struct iphdr *iip; /* interior IP header */
struct geneve_sock *gs6;
int err = -EINVAL;
struct flowi6 fl6;
@@ -982,8 +978,6 @@ static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
skb_reset_mac_header(skb);
- iip = ip_hdr(skb);
-
if (info) {
const struct ip_tunnel_key *key = &info->key;
u8 *opts = NULL;
@@ -1004,7 +998,7 @@ static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
if (unlikely(err))
goto tx_error;
- prio = ip_tunnel_ecn_encap(key->tos, iip, skb);
+ prio = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb);
ttl = key->ttl;
label = info->key.label;
} else {
@@ -1014,7 +1008,7 @@ static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
goto tx_error;
prio = ip_tunnel_ecn_encap(ip6_tclass(fl6.flowlabel),
- iip, skb);
+ ip_hdr(skb), skb);
ttl = geneve->ttl;
if (!ttl && ipv6_addr_is_multicast(&fl6.daddr))
ttl = 1;
--
2.10.2
^ permalink raw reply related
* Re: [PATCH net-next V2 0/7] Mellanox 100G mlx5 updates 2016-11-29
From: David Miller @ 2016-12-02 15:47 UTC (permalink / raw)
To: saeedm; +Cc: netdev, tariqt, ogerlitz, roid, sebott
In-Reply-To: <1480521583-12755-1-git-send-email-saeedm@mellanox.com>
From: Saeed Mahameed <saeedm@mellanox.com>
Date: Wed, 30 Nov 2016 17:59:36 +0200
> The following series from Tariq and Roi, provides some critical fixes
> and updates for the mlx5e driver.
>
> From Tariq:
> - Fix driver coherent memory huge allocation issues by fragmenting
> completion queues, in a way that is transparent to the netdev driver by
> providing a new buffer type "mlx5_frag_buf" with the same access API.
> - Create UMR MKey per RQ to have better scalability.
>
> From Roi:
> - Some fixes for the encap-decap support and tc flower added lately to the
> mlx5e driver.
>
> v1->v2:
> - Fix start index in error flow of mlx5_frag_buf_alloc_node, pointed out by Eric.
>
> This series was generated against commit:
> 31ac1c19455f ("geneve: fix ip_hdr_len reserved for geneve6 tunnel.")
Series applied, thanks.
^ permalink raw reply
* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
From: Paolo Abeni @ 2016-12-02 15:44 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Mel Gorman, Andrew Morton, Christoph Lameter, Michal Hocko,
Vlastimil Babka, Johannes Weiner, Linux-MM, Linux-Kernel,
Rick Jones, netdev@vger.kernel.org, Hannes Frederic Sowa
In-Reply-To: <20161202163758.0d8cc9bf@redhat.com>
On Fri, 2016-12-02 at 16:37 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 01 Dec 2016 23:17:48 +0100
> Paolo Abeni <pabeni@redhat.com> wrote:
>
> > On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> > > (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> > > small socket queues)
> > >
> > > On Wed, 30 Nov 2016 16:35:20 +0000
> > > Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > > > > I don't quite get why you are setting the socket recv size
> > > > > (with -- -s and -S) to such a small number, size + 256.
> > > > >
> > > >
> > > > Maybe I missed something at the time I wrote that but why would it
> > > > need to be larger?
> > >
> > > Well, to me it is quite obvious that we need some queue to avoid packet
> > > drops. We have two processes netperf and netserver, that are sending
> > > packets between each-other (UDP_STREAM mostly netperf -> netserver).
> > > These PIDs are getting scheduled and migrated between CPUs, and thus
> > > does not get executed equally fast, thus a queue is need absorb the
> > > fluctuations.
> > >
> > > The network stack is even partly catching your config "mistake" and
> > > increase the socket queue size, so we minimum can handle one max frame
> > > (due skb "truesize" concept approx PAGE_SIZE + overhead).
> > >
> > > Hopefully for localhost testing a small queue should hopefully not
> > > result in packet drops. Testing... ups, this does result in packet
> > > drops.
> > >
> > > Test command extracted from mmtests, UDP_STREAM size 1024:
> > >
> > > netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 \
> > > -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
> > >
> > > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
> > > port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
> > > Socket Message Elapsed Messages
> > > Size Size Time Okay Errors Throughput
> > > bytes bytes secs # # 10^6bits/sec
> > >
> > > 4608 1024 60.00 50024301 0 6829.98
> > > 2560 60.00 46133211 6298.72
> > >
> > > Dropped packets: 50024301-46133211=3891090
> > >
> > > To get a better drop indication, during this I run a command, to get
> > > system-wide network counters from the last second, so below numbers are
> > > per second.
> > >
> > > $ nstat > /dev/null && sleep 1 && nstat
> > > #kernel
> > > IpInReceives 885162 0.0
> > > IpInDelivers 885161 0.0
> > > IpOutRequests 885162 0.0
> > > UdpInDatagrams 776105 0.0
> > > UdpInErrors 109056 0.0
> > > UdpOutDatagrams 885160 0.0
> > > UdpRcvbufErrors 109056 0.0
> > > IpExtInOctets 931190476 0.0
> > > IpExtOutOctets 931189564 0.0
> > > IpExtInNoECTPkts 885162 0.0
> > >
> > > So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> > > UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> > > happens kernel side in __udp_queue_rcv_skb[1], because receiving
> > > process didn't empty it's queue fast enough see [2].
> > >
> > > Although upstream changes are coming in this area, [2] is replaced with
> > > __udp_enqueue_schedule_skb, which I actually tested with... hmm
> > >
> > > Retesting with kernel 4.7.0-baseline+ ... show something else.
> > > To Paolo, you might want to look into this. And it could also explain why
> > > I've not see the mentioned speedup by mm-change, as I've been testing
> > > this patch on top of net-next (at 93ba2222550) with Paolo's UDP changes.
> >
> > Thank you for reporting this.
> >
> > It seems that the commit 123b4a633580 ("udp: use it's own memory
> > accounting schema") is too strict while checking the rcvbuf.
> >
> > For very small value of rcvbuf, it allows a single skb to be enqueued,
> > while previously we allowed 2 of them to enter the queue, even if the
> > first one truesize exceeded rcvbuf, as in your test-case.
> >
> > Can you please try the following patch ?
>
> Sure, it looks much better with this patch.
Thank you for testing. I'll send a formal patch to David soon.
BTW I see I nice performance improvement compared to 4.7...
Cheers,
Paolo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH net 0/7] net: stmmac: fix probe error handling and phydev leaks
From: David Miller @ 2016-12-02 15:43 UTC (permalink / raw)
To: johan
Cc: peppe.cavallaro, alexandre.torgue, manabian, carlo, khilman,
mcoquelin.stm32, maxime.ripard, wens, netdev, linux-kernel
In-Reply-To: <1480516195-27696-1-git-send-email-johan@kernel.org>
From: Johan Hovold <johan@kernel.org>
Date: Wed, 30 Nov 2016 15:29:48 +0100
> This series fixes a number of issues with the stmmac-driver probe error
> handling, which for example left clocks enabled after probe failures.
>
> The final patch fixes a failure to deregister and free any fixed-link
> PHYs that were registered during probe on probe errors and on driver
> unbind. It also fixes a related of-node leak on late probe errors.
>
> This series depends on the of_phy_deregister_fixed_link() helper that
> was just merged to net.
>
> As mentioned earlier, one staging driver also suffers from a similar
> leak and can be fixed up once the above mentioned helper hits mainline.
>
> Note that these patches have only been compile tested.
Series applied, thanks.
^ permalink raw reply
* Re: net/can: warning in raw_setsockopt/__alloc_pages_slowpath
From: Marc Kleine-Budde @ 2016-12-02 15:42 UTC (permalink / raw)
To: Oliver Hartkopp, Andrey Konovalov, David S. Miller, linux-can,
netdev, LKML
Cc: Dmitry Vyukov, Kostya Serebryany, syzkaller
In-Reply-To: <73b78023-bc11-25c0-33e3-3a748dbc81cd@hartkopp.net>
[-- Attachment #1.1: Type: text/plain, Size: 1601 bytes --]
On 12/02/2016 04:11 PM, Oliver Hartkopp wrote:
>
>
> On 12/02/2016 02:24 PM, Marc Kleine-Budde wrote:
>> On 12/02/2016 01:43 PM, Andrey Konovalov wrote:
>
>
>>> [<ffffffff8369e0de>] raw_setsockopt+0x1be/0x9f0 net/can/raw.c:506
>>
>> We should add a check for a sensible optlen....
>>
>>> static int raw_setsockopt(struct socket *sock, int level, int optname,
>>> char __user *optval, unsigned int optlen)
>>> {
>>> struct sock *sk = sock->sk;
>>> struct raw_sock *ro = raw_sk(sk);
>>> struct can_filter *filter = NULL; /* dyn. alloc'ed filters */
>>> struct can_filter sfilter; /* single filter */
>>> struct net_device *dev = NULL;
>>> can_err_mask_t err_mask = 0;
>>> int count = 0;
>>> int err = 0;
>>>
>>> if (level != SOL_CAN_RAW)
>>> return -EINVAL;
>>>
>>> switch (optname) {
>>>
>>> case CAN_RAW_FILTER:
>>> if (optlen % sizeof(struct can_filter) != 0)
>>> return -EINVAL;
>>
>> here...
>>
>> if (optlen > 64 * sizeof(struct can_filter))
>> return -EINVAL;
>>
>
> Agreed.
>
> But what is sensible here?
> 64 filters is way to small IMO.
>
> When thinking about picking a bunch of single CAN IDs I would tend to
> something like 512 filters.
Ok - 64 was just an arbitrary chosen value for demonstration purposes :)
Marc
--
Pengutronix e.K. | Marc Kleine-Budde |
Industrial Linux Solutions | Phone: +49-231-2826-924 |
Vertretung West/Dortmund | Fax: +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [PATCH net-next 3/6] net: dsa: mv88e6xxx: add a software reset op
From: Andrew Lunn @ 2016-12-02 15:43 UTC (permalink / raw)
To: Vivien Didelot
Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli
In-Reply-To: <87vav3gz67.fsf@ketchup.i-did-not-set--mail-host-address--so-tickle-me>
On Thu, Dec 01, 2016 at 03:41:20PM -0500, Vivien Didelot wrote:
> Hi Andrew,
>
> Andrew Lunn <andrew@lunn.ch> writes:
>
> >> diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
> >> index ab52c37..9e51405 100644
> >> --- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
> >> +++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
> >> @@ -765,6 +765,9 @@ struct mv88e6xxx_ops {
> >> int (*phy_write)(struct mv88e6xxx_chip *chip, int addr, int reg,
> >> u16 val);
> >>
> >> + /* Switch Software Reset */
> >> + int (*reset)(struct mv88e6xxx_chip *chip);
> >> +
> >
> > Hi Vivien
> >
> > In my huge patch series of 6390, i've been using a g1_ prefix for
> > functionality which is in global 1, g2_ for global 2, etc. This has
> > worked for everything so far with the exception of setting which
> > reserved MAC addresses should be sent to the CPU. Most devices have it
> > in g2, but 6390 has it in g1.
> >
> > Please could you add the prefix.
>
> I don't understand. It looks like you are talking about the second part
> of the comment I made on your RFC patchset, about the Rsvd2CPU feature:
Hi Vivien
I mean
+ /* Switch Software Reset */
+ int (*g1_reset)(struct mv88e6xxx_chip *chip);
+
We have a collection of function pointers with port_ prefix, another
collection with stats_, and a third with ppu_, etc. And then we have
some which do not fit a specific category. Those i have prefixed with
g1_ or g2_. I think we should have some prefix, and that is my
suggestion.
Andrew
^ permalink raw reply
* Re: DSA vs. SWTICHDEV ?
From: Andrew Lunn @ 2016-12-02 15:38 UTC (permalink / raw)
To: Murali Karicheri
Cc: Joakim Tjernlund, netdev@vger.kernel.org, Roger Quadros,
Grygorii Strashko
In-Reply-To: <58409867.50001@ti.com>
On Thu, Dec 01, 2016 at 04:38:47PM -0500, Murali Karicheri wrote:
> Hi Andrew,
> On 12/01/2016 12:31 PM, Andrew Lunn wrote:
> > Hi Murali
> >
> >> 2. Switch mode where it implements a simple Ethernet switch. Currently
> >> it doesn't have address learning capability, but in future it
> >> can.
> >
> > If it does not have address learning capabilities, does it act like a
> > plain old hub? What comes in one port goes out all others?
>
> Thanks for the response!
>
> Yes. It is a plain hub. it replicates frame to both ports. So need to
> run a bridge layer for address learning in software.
Hi Murali
It would be good if you start thinking about all the different
directions. It is not just host to port A and host to port B. What
about port A to Port B? Can it do that in hardware?
> I think not. I see we have a non Linux implementation that does address
> learning in software using a hash table and look up MAC for each packet
> to see which port it needs to be sent to.
I think i need to read more about the switch. I'm starting to wonder
if it has enough intelligence to be usable. Switchdev is about pushing
configuration down into the switch. It does not however sound like
there is that much which is configurable in this switch.
Andrew
^ permalink raw reply
* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
From: Jesper Dangaard Brouer @ 2016-12-02 15:37 UTC (permalink / raw)
To: Paolo Abeni
Cc: Mel Gorman, Andrew Morton, Christoph Lameter, Michal Hocko,
Vlastimil Babka, Johannes Weiner, Linux-MM, Linux-Kernel,
Rick Jones, netdev@vger.kernel.org, Hannes Frederic Sowa, brouer
In-Reply-To: <1480630668.5345.7.camel@redhat.com>
On Thu, 01 Dec 2016 23:17:48 +0100
Paolo Abeni <pabeni@redhat.com> wrote:
> On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> > (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> > small socket queues)
> >
> > On Wed, 30 Nov 2016 16:35:20 +0000
> > Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > > > I don't quite get why you are setting the socket recv size
> > > > (with -- -s and -S) to such a small number, size + 256.
> > > >
> > >
> > > Maybe I missed something at the time I wrote that but why would it
> > > need to be larger?
> >
> > Well, to me it is quite obvious that we need some queue to avoid packet
> > drops. We have two processes netperf and netserver, that are sending
> > packets between each-other (UDP_STREAM mostly netperf -> netserver).
> > These PIDs are getting scheduled and migrated between CPUs, and thus
> > does not get executed equally fast, thus a queue is need absorb the
> > fluctuations.
> >
> > The network stack is even partly catching your config "mistake" and
> > increase the socket queue size, so we minimum can handle one max frame
> > (due skb "truesize" concept approx PAGE_SIZE + overhead).
> >
> > Hopefully for localhost testing a small queue should hopefully not
> > result in packet drops. Testing... ups, this does result in packet
> > drops.
> >
> > Test command extracted from mmtests, UDP_STREAM size 1024:
> >
> > netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 \
> > -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
> >
> > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
> > port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
> > Socket Message Elapsed Messages
> > Size Size Time Okay Errors Throughput
> > bytes bytes secs # # 10^6bits/sec
> >
> > 4608 1024 60.00 50024301 0 6829.98
> > 2560 60.00 46133211 6298.72
> >
> > Dropped packets: 50024301-46133211=3891090
> >
> > To get a better drop indication, during this I run a command, to get
> > system-wide network counters from the last second, so below numbers are
> > per second.
> >
> > $ nstat > /dev/null && sleep 1 && nstat
> > #kernel
> > IpInReceives 885162 0.0
> > IpInDelivers 885161 0.0
> > IpOutRequests 885162 0.0
> > UdpInDatagrams 776105 0.0
> > UdpInErrors 109056 0.0
> > UdpOutDatagrams 885160 0.0
> > UdpRcvbufErrors 109056 0.0
> > IpExtInOctets 931190476 0.0
> > IpExtOutOctets 931189564 0.0
> > IpExtInNoECTPkts 885162 0.0
> >
> > So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> > UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> > happens kernel side in __udp_queue_rcv_skb[1], because receiving
> > process didn't empty it's queue fast enough see [2].
> >
> > Although upstream changes are coming in this area, [2] is replaced with
> > __udp_enqueue_schedule_skb, which I actually tested with... hmm
> >
> > Retesting with kernel 4.7.0-baseline+ ... show something else.
> > To Paolo, you might want to look into this. And it could also explain why
> > I've not see the mentioned speedup by mm-change, as I've been testing
> > this patch on top of net-next (at 93ba2222550) with Paolo's UDP changes.
>
> Thank you for reporting this.
>
> It seems that the commit 123b4a633580 ("udp: use it's own memory
> accounting schema") is too strict while checking the rcvbuf.
>
> For very small value of rcvbuf, it allows a single skb to be enqueued,
> while previously we allowed 2 of them to enter the queue, even if the
> first one truesize exceeded rcvbuf, as in your test-case.
>
> Can you please try the following patch ?
Sure, it looks much better with this patch.
$ /home/jbrouer/git/mmtests/work/testdisk/sources/netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec
4608 1024 60.00 50191555 0 6852.82
2560 60.00 50189872 6852.59
Only 50191555-50189872=1683 drops, approx 1683/60 = 28/sec
$ nstat > /dev/null && sleep 1 && nstat
#kernel
IpInReceives 885417 0.0
IpInDelivers 885416 0.0
IpOutRequests 885417 0.0
UdpInDatagrams 885382 0.0
UdpInErrors 29 0.0
UdpOutDatagrams 885410 0.0
UdpRcvbufErrors 29 0.0
IpExtInOctets 931534428 0.0
IpExtOutOctets 931533376 0.0
IpExtInNoECTPkts 885488 0.0
> Thank you,
>
> Paolo
> ---
> net/ipv4/udp.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index e1d0bf8..2f5dc92 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1200,19 +1200,21 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> struct sk_buff_head *list = &sk->sk_receive_queue;
> int rmem, delta, amt, err = -ENOMEM;
> int size = skb->truesize;
> + int limit;
>
> /* try to avoid the costly atomic add/sub pair when the receive
> * queue is full; always allow at least a packet
> */
> rmem = atomic_read(&sk->sk_rmem_alloc);
> - if (rmem && (rmem + size > sk->sk_rcvbuf))
> + limit = size + sk->sk_rcvbuf;
> + if (rmem > limit)
> goto drop;
>
> /* we drop only if the receive buf is full and the receive
> * queue contains some other skb
> */
> rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> - if ((rmem > sk->sk_rcvbuf) && (rmem > size))
> + if (rmem > limit)
> goto uncharge_drop;
>
> spin_lock(&list->lock);
>
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH 0/2] net: Add support for SGMII PCS on Altera TSE MAC
From: David Miller @ 2016-12-02 15:37 UTC (permalink / raw)
To: neill.whillans
Cc: netdev, linux-kernel, linux-kernel, nios2-dev, vbridger,
f.fainelli
In-Reply-To: <cover.1480497966.git.neill.whillans@codethink.co.uk>
From: Neill Whillans <neill.whillans@codethink.co.uk>
Date: Wed, 30 Nov 2016 13:41:03 +0000
> These patches were created as part of work to add support for SGMII
> PCS functionality to the Altera TSE MAC. Patches are based on 4.9-rc6
> git tree.
>
> The first patch in the series adds support for the VSC8572 dual-port
> Gigabit Ethernet transceiver, used in integration testing.
>
> The second patch adds support for the SGMII PCS functionality to the
> Altera TSE driver.
Series applied, thanks.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox