Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net v2] tipc: check minimum bearer MTU
From: Michal Kubecek @ 2016-12-01 17:52 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Jon Maloy, Ying Xue, David S. Miller, tipc-discussion, netdev,
	linux-kernel, Qian Zhang
In-Reply-To: <1480608678.16599.94.camel@decadent.org.uk>

On Thu, Dec 01, 2016 at 04:11:18PM +0000, Ben Hutchings wrote:
> On Thu, 2016-12-01 at 12:02 +0100, Michal Kubecek wrote:
> [...] 
> > +/* check if device MTU is sufficient for tipc headers */
> > +static inline bool tipc_check_mtu(struct net_device *dev, unsigned int reserve)
> > +{
> > +	if (dev->mtu >= TIPC_MIN_BEARER_MTU + reserve)
> > +		return false;
> > +	netdev_warn(dev, "MTU too low for tipc bearer\n");
> > +	return true;
> > +}
> [...]
> 
> The comment says "check if ... sufficient" but the return value
> indicates the opposite.  Could you make these consistent?

Good point. I suppose renaming the function to e.g. tipc_mtu_bad() (and
rewording the commment accordingly) would also make the code more 
readable without looking at the definition.

I'll wait for other comments and send v3 tomorrow.

> Other than that, this looks OK to me.  I haven't tested any version as
> I don't know how to use TIPC.

I checked that the patch doesn't allow enabling a bearer on top of
device with insufficient MTU and it does for sufficient (100/128), both
in eth and udp case. I also checked that lowering MTU under 100 in eth
case disables attached bearer. I didn't run any deeper test like sending
an actual traffic but the patch shouldn't affect that.

Michal Kubecek

^ permalink raw reply

* Re: [RFC PATCH net-next v2] ipv6: implement consistent hashing for equal-cost multipath routing
From: Roopa Prabhu @ 2016-12-01 17:55 UTC (permalink / raw)
  To: David Lebrun; +Cc: Hannes Frederic Sowa, netdev@vger.kernel.org
In-Reply-To: <583E8646.2010402@uclouvain.be>

On Tue, Nov 29, 2016 at 11:56 PM, David Lebrun
<david.lebrun@uclouvain.be> wrote:
> On 11/30/2016 04:52 AM, Hannes Frederic Sowa wrote:

>
>> Also please convert the sysctl to a netlink attribute if you pursue this
>> because if I change the sysctl while my quagga is hammering the routing
>> table I would like to know which nodes allocate what amount of memory.
>
> Yes, that was the idea.
>

I think Its best for it to be a global setting, and thats why sysctl
seems like the best way (unless there are other ways to set this
globally via rtnetlink). If it helps, most hw switch vendors
supporting this feature also provide a globally tunable knob and it is
not on by default.

^ permalink raw reply

* Re: [PATCH v7 net-next 2/6] bpf: Add new cgroup attach type to enable sock modifications
From: Alexei Starovoitov @ 2016-12-01 16:56 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <1480610888-31082-3-git-send-email-dsa@cumulusnetworks.com>

On Thu, Dec 01, 2016 at 08:48:04AM -0800, David Ahern wrote:
> Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
> BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
> any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
> Currently only sk_bound_dev_if is exported to userspace for modification
> by a bpf program.
> 
> This allows a cgroup to be configured such that AF_INET{6} sockets opened
> by processes are automatically bound to a specific device. In turn, this
> enables the running of programs that do not support SO_BINDTODEVICE in a
> specific VRF context / L3 domain.
> 
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [flamebait] xdp, well meaning but pointless
From: Tom Herbert @ 2016-12-01 18:02 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Linux Kernel Network Developers
In-Reply-To: <CALx6S35R_ZStV=DbD-7Gf_y5xXqQq113_6m5p-p0GQfv46v0Ow@mail.gmail.com>

On Thu, Dec 1, 2016 at 10:01 AM, Tom Herbert <tom@herbertland.com> wrote:
>
>
> On Thu, Dec 1, 2016 at 1:11 AM, Florian Westphal <fw@strlen.de> wrote:
>>
>> [ As already mentioned in my reply to Tom, here is
>> the xdp flamebait/critique ]
>>
>> Lots of XDP related patches started to appear on netdev.
>> I'd prefer if it would stop...
>>
>> To me XDP combines all disadvantages of stack bypass solutions like dpdk
>> with the disadvantages of kernel programming with a more limited
>> instruction set and toolchain.
>>
>> Unlike XDP userspace bypass (dpdk et al) allow use of any programming
>> model or language you want (including scripting languages), which
>> makes things a lot easier, e.g. garbage collection, debuggers vs.
>> crash+vmcore+printk...
>>
>> I have heared the argument that these restrictions that come with
>> XDP are great because it allows to 'limit what users can do'.
>>
>> Given existence of DPDK/netmap/userspace bypass is a reality, this is
>> a very weak argument -- why would anyone pick XDP over a dpdk/netmap
>> based solution?
>
Because, we've seen time an time again that attempts to bypass the
stack and run parallel stacks under the banner of "the kernel is too
slow" does not scale for large deployment. We've seen this with RDMA,
TOE, OpenOnload, and we'll see this for DPDK, FD.io, VPP and whatever
else people are going to dream up. If I have a couple hundred machines
running a single application like the HFT guys do, then sure I'd
probably look into such solutions. But when I have datacenters with
100Ks running an assortment of applications even contemplating the
possibility of deploying a parallel stacks gives me headache. We need
to consider an seemingly endless list of security issues,
manageability. robustness, protocol compatibility, etc. I really have
little interest in bringing a huge pile of 3rd party code that I have
to support, and I definitely have no interest in constantly replacing
all of my hardware to get the latest and greatest support for these
offloads as vendors leak them out. Given a choice between buying into
some kernel bypass solution versus hacking Linux a little bit to carve
out an accelerated data path to address the "kernel is too slow"
argument, I will choose the latter any day of the week.

Tom

>
> Because, we've seen time an time again that attempts to bypass the stack and
> run parallel stacks under the banner of "the kernel is too slow" does not
> scale for large deployment. We've seen this with RDMA, TOE, OpenOnload, and
> we'll see this for DPDK, FD.io, VPP and whatever else people are going to
> dream up. If I have a couple hundred machines running a single application
> like the HFT guys do, then sure I'd probably look into such solutions. But
> when I have datacenters with 100Ks running an assortment of applications
> even contemplating the possibility of deploying a parallel stacks gives me
> headache. We need to consider an seemingly endless list of security issues,
> manageability. robustness, protocol compatibility, etc. I really have little
> interest in bringing a huge pile of 3rd party code that I have to support,
> and I definitely have no interest in constantly replacing all of my hardware
> to get the latest and greatest support for these offloads as vendors leak
> them out. Given a choice between buying into some kernel bypass solution
> versus hacking Linux a little bit to carve out an accelerated data path to
> address the "kernel is too slow" argument, I will choose the latter any day
> of the week.
>
> Tom
>
>> XDP will always be less powerful and a lot more complicated,
>> especially considering users of dpdk (or toolkits built on top of it)
>> are not kernel programmers and userspace has more powerful ipc
>> (or storage) mechanisms.
>>
>> Aside from this, XDP, like DPDK, is a kernel bypass.
>> You might say 'Its just stack bypass, not a kernel bypass!'.
>> But what does that mean exactly?  That packets can still be passed
>> onward to normal stack?
>> Bypass solutions like netmap can also inject packets back to
>> kernel stack again.
>>
>> Running less powerful user code in a restricted environment in the kernel
>> address space is certainly a worse idea than separating this logic out
>> to user space.
>>
>> In light of DPDKs existence it make a lot more sense to me to provide
>> a). a faster mmap based interface (possibly AF_PACKET based) that allows
>> to map nic directly into userspace, detaching tx/rx queue from kernel.
>>
>> John Fastabend sent something like this last year as a proof of
>> concept, iirc it was rejected because register space got exposed directly
>> to userspace.  I think we should re-consider merging netmap
>> (or something conceptually close to its design).
>>
>> b). with regards to a programmable data path: IFF one wants to do this
>> in kernel (and thats a big if), it seems much more preferrable to provide
>> a config/data-based approach rather than a programmable one.  If you want
>> full freedom DPDK is architecturally just too powerful to compete with.
>>
>> Proponents of XDP sometimes provide usage examples.
>> Lets look at some of these.
>>
>> == Application developement: ==
>> * DNS Server
>> data structures and algorithms need to be implemented in a mostly touring
>> complete language, so eBPF cannot readily be be used for that.
>> At least it will be orders of magnitude harder than in userspace.
>>
>> * TCP Endpoint
>> TCP processing in eBPF is a bit out of question while userspace tcp stacks
>> based on both netmap and dpdk already exist today.
>>
>> == Forwarding dataplane: ==
>>
>> * Router/Switch
>> Router and switches should actually adhere to standardized and specified
>> protocols and thus don't need a lot of custom software and specialized
>> software.  Still a lot more work compared to userspace offloads where
>> you can do things like allocating a 4GB array to perform nexthop lookup.
>> Also needs ability to perform tx on another interface.
>>
>> * Load balancer
>> State holding algorithm need sorting and searching, so also no fit for
>> eBPF (could be exposed by function exports, but then can we do DoS by
>> finding worst case scenarios?).
>>
>> Also again needs way to forward frame out via another interface.
>>
>> For cases where packet gets sent out via same interface it would appear
>> to be easier to use port mirroring in a switch and use stochastic
>> filtering
>> on end nodes to determine which host should take responsibility.
>>
>> XDP plus: central authority over how distribution will work in case
>> nodes are added/removed from pool.
>> But then again, it will be easier to hande this with netmap/dpdk where
>> more complicated scheduling algorithms can be used.
>>
>> * early drop/filtering.
>> While its possible to do "u32" like filters with ebpf, all modern nics
>> support ntuple filtering in hardware, which is going to be faster because
>> such packet will never even be signalled to the operating system.
>> For more complicated cases (e.g. doing socket lookup to check if
>> particular
>> packet does match bound socket (and expected sequence numbers etc) I don't
>> see easy ways to do that with XDP (and without sk_buff context).
>> Providing it via function exports is possible of course, but that will
>> only
>> result in an "arms race" where we will see special-sauce functions
>> all over the place -- DoS will always attempt to go for something
>> that is difficult to filter against, cf. all the recent volume-based
>> floodings.
>>
>> Thanks, Florian
>
>

^ permalink raw reply

* Re: [PATCH 2/2] net: rfkill: Add rfkill-any LED trigger
From: kbuild test robot @ 2016-12-01 18:06 UTC (permalink / raw)
  To: Michał Kępień
  Cc: kbuild-all, Johannes Berg, David S . Miller, linux-wireless,
	netdev, linux-kernel
In-Reply-To: <20161130120317.11851-2-kernel@kempniu.pl>

[-- Attachment #1: Type: text/plain, Size: 1459 bytes --]

Hi Michał,

[auto build test WARNING on mac80211-next/master]
[also build test WARNING on v4.9-rc7 next-20161201]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Micha-K-pie/net-rfkill-Cleanup-error-handling-in-rfkill_init/20161202-002119
base:   https://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git master
config: blackfin-allmodconfig (attached as .config)
compiler: bfin-uclinux-gcc (GCC) 6.2.0
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=blackfin 

All warnings (new ones prefixed by >>):

>> WARNING: net/rfkill/rfkill.o(.init.text+0xa2): Section mismatch in reference from the function _init_module() to the function .exit.text:_rfkill_handler_exit()
   The function __init _init_module() references
   a function __exit _rfkill_handler_exit().
   This is often seen when error handling in the init function
   uses functionality in the exit path.
   The fix is often to remove the __exit annotation of
   _rfkill_handler_exit() so it may be used outside an exit section.

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 42125 bytes --]

^ permalink raw reply

* [PATCH iproute2 1/2] ss: print new tcp_info fields: delivery_rate and app_limited
From: Neal Cardwell @ 2016-12-01 18:21 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev, Neal Cardwell, Yuchung Cheng, Eric Dumazet,
	Soheil Hassas Yeganeh

Dump the new delivery_rate and delivery_rate_app_limited fields that
were added to tcp_info in Linux v4.9.

Example output:
  pacing_rate 65.7Mbps delivery_rate 62.9Mbps

And for the application-limited case this looks like:
  pacing_rate 1031.1Mbps delivery_rate 87.4Mbps app_limited

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 misc/ss.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/misc/ss.c b/misc/ss.c
index dd77b81..18cfa93 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -766,6 +766,7 @@ struct tcpstat {
 	unsigned int	    lastack;
 	double		    pacing_rate;
 	double		    pacing_rate_max;
+	double		    delivery_rate;
 	unsigned long long  bytes_acked;
 	unsigned long long  bytes_received;
 	unsigned int	    segs_out;
@@ -789,6 +790,7 @@ struct tcpstat {
 	bool		    has_ecnseen_opt;
 	bool		    has_fastopen_opt;
 	bool		    has_wscale_opt;
+	bool		    app_limited;
 	struct dctcpstat    *dctcp;
 	struct tcp_bbr_info *bbr_info;
 };
@@ -1868,6 +1870,11 @@ static void tcp_stats_print(struct tcpstat *s)
 							s->pacing_rate_max));
 	}
 
+	if (s->delivery_rate)
+		printf(" delivery_rate %sbps", sprint_bw(b1, s->delivery_rate));
+	if (s->app_limited)
+		printf(" app_limited");
+
 	if (s->unacked)
 		printf(" unacked:%u", s->unacked);
 	if (s->retrans || s->retrans_total)
@@ -2162,6 +2169,8 @@ static void tcp_show_info(const struct nlmsghdr *nlh, struct inet_diag_msg *r,
 		s.not_sent = info->tcpi_notsent_bytes;
 		if (info->tcpi_min_rtt && info->tcpi_min_rtt != ~0U)
 			s.min_rtt = (double) info->tcpi_min_rtt / 1000;
+		s.delivery_rate = info->tcpi_delivery_rate * 8.;
+		s.app_limited = info->tcpi_delivery_rate_app_limited;
 		tcp_stats_print(&s);
 		free(s.dctcp);
 		free(s.bbr_info);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH iproute2 2/2] ss: print new tcp_info fields: busy, rwnd-limited, sndbuf-limited times
From: Neal Cardwell @ 2016-12-01 18:21 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev, Yuchung Cheng, Neal Cardwell, Eric Dumazet,
	Soheil Hassas Yeganeh
In-Reply-To: <1480616500-16919-1-git-send-email-ncardwell@google.com>

From: Yuchung Cheng <ycheng@google.com>

Dump some new fields added to tcp_info in v4.10: tcpi_busy_time,
tcpi_rwnd_limited, tcpi_sndbuf_limited.

Example output for a flow busy for 110ms but never measurably limited by
receive window or send buffer:
   busy:110ms

Example output for a flow usually limited by receive window:
   busy:111ms rwnd_limited:101ms(91.0%)

Example output for a flow sometimes limited by send buffer:
   busy:50ms sndbuf_limited:10ms(20.0%)

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 misc/ss.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/misc/ss.c b/misc/ss.c
index 18cfa93..392dbf7 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -784,6 +784,9 @@ struct tcpstat {
 	double		    rcv_rtt;
 	double		    min_rtt;
 	int		    rcv_space;
+	unsigned long long  busy_time;
+	unsigned long long  rwnd_limited;
+	unsigned long long  sndbuf_limited;
 	bool		    has_ts_opt;
 	bool		    has_sack_opt;
 	bool		    has_ecn_opt;
@@ -1875,6 +1878,18 @@ static void tcp_stats_print(struct tcpstat *s)
 	if (s->app_limited)
 		printf(" app_limited");
 
+	if (s->busy_time) {
+		printf(" busy:%llums", s->busy_time / 1000);
+		if (s->rwnd_limited)
+			printf(" rwnd_limited:%llums(%.1f%%)",
+			       s->rwnd_limited / 1000,
+			       100.0 * s->rwnd_limited / s->busy_time);
+		if (s->sndbuf_limited)
+			printf(" sndbuf_limited:%llums(%.1f%%)",
+			       s->sndbuf_limited / 1000,
+			       100.0 * s->sndbuf_limited / s->busy_time);
+	}
+
 	if (s->unacked)
 		printf(" unacked:%u", s->unacked);
 	if (s->retrans || s->retrans_total)
@@ -2171,6 +2186,9 @@ static void tcp_show_info(const struct nlmsghdr *nlh, struct inet_diag_msg *r,
 			s.min_rtt = (double) info->tcpi_min_rtt / 1000;
 		s.delivery_rate = info->tcpi_delivery_rate * 8.;
 		s.app_limited = info->tcpi_delivery_rate_app_limited;
+		s.busy_time = info->tcpi_busy_time;
+		s.rwnd_limited = info->tcpi_rwnd_limited;
+		s.sndbuf_limited = info->tcpi_sndbuf_limited;
 		tcp_stats_print(&s);
 		free(s.dctcp);
 		free(s.bbr_info);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* RE: [PATCH v2] sh_eth: remove unchecked interrupts
From: Chris Brandt @ 2016-12-01 18:24 UTC (permalink / raw)
  To: Sergei Shtylyov, David Miller
  Cc: Simon Horman, Geert Uytterhoeven, netdev@vger.kernel.org,
	linux-renesas-soc@vger.kernel.org
In-Reply-To: <7d50b5e0-ae7c-a39d-625a-c9cfc11e398f@cogentembedded.com>

On 12/1/2016, Sergei Shtylyov wrote:
>     One thing you've missed so far is mentioning R7S72100 (RZ/A1) in the
> subject. This driver supports many SoCs, you're only fixing one of them...

For the last sh_eth.c patch I submitted, I had:

"net: ethernet: renesas: sh_eth: add POST registers for rz"


Should I resend the patch as:

"[PATCH v3] sh_eth: remove unchecked interrupts for RZ/A1"


??

Chris


^ permalink raw reply

* Re: [PATCH v2] sh_eth: remove unchecked interrupts
From: Sergei Shtylyov @ 2016-12-01 18:28 UTC (permalink / raw)
  To: Chris Brandt, David Miller
  Cc: Simon Horman, Geert Uytterhoeven, netdev@vger.kernel.org,
	linux-renesas-soc@vger.kernel.org
In-Reply-To: <SG2PR06MB1165DFB4E735FC10696F08CA8A8F0@SG2PR06MB1165.apcprd06.prod.outlook.com>

On 12/01/2016 09:24 PM, Chris Brandt wrote:

>>     One thing you've missed so far is mentioning R7S72100 (RZ/A1) in the
>> subject. This driver supports many SoCs, you're only fixing one of them...
>
> For the last sh_eth.c patch I submitted, I had:
>
> "net: ethernet: renesas: sh_eth: add POST registers for rz"
>
>
> Should I resend the patch as:
>
> "[PATCH v3] sh_eth: remove unchecked interrupts for RZ/A1"
>
>
> ??

    Yes, that will do.

> Chris

MBR, Sergei

^ permalink raw reply

* [net PATCH 0/2] IPv4 FIB suffix length fixes
From: Alexander Duyck @ 2016-12-01 12:27 UTC (permalink / raw)
  To: netdev, rshearma; +Cc: davem

In reviewing the patch from Robert Shearman and looking over the code I
realized there were a few different bugs we were still carrying in the IPv4
FIB lookup code.

These two patches are based off of Robert's original patch, but take things
one step further by splitting them up to address two additional issues I
found.

So first have Robert's original patch which was addressing the fact that
us calling update_suffix in resize is expensive when it is called per add.
To address that I incorporated the core bit of the patch which was us
dropping the update_suffix call from resize.

The first patch in the series does a rename and fix on the push_suffix and
pull_suffix code.  Specifically we drop the need to pass a leaf and
secondly we fix things so we pull the suffix as long as the value of the
suffix in the node is dropping.

The second patch addresses the original issue reported as well as
optimizing the code for the fact that update_suffix is only really meant to
go through and clean things up when we are decreasing a suffix.  I had
originally added code for it to somehow cause an increase, but if we push
the suffix when a new leaf is added we only ever have to handle pulling
down the suffix with update_suffix so I updated the code to reflect that.

As far as side effects the only ones I think that will be obvious should be
the fact that some routes may be able to be found earlier since before we
relied on resize to update the suffix lengths, and now we are updating them
before we add or remove the leaf.

---

Alexander Duyck (2):
      ipv4: Drop leaf from suffix pull/push functions
      ipv4: Drop suffix update from resize code


 net/ipv4/fib_trie.c |   68 ++++++++++++++++++++++++++-------------------------
 1 file changed, 35 insertions(+), 33 deletions(-)

^ permalink raw reply

* [net PATCH 1/2] ipv4: Drop leaf from suffix pull/push functions
From: Alexander Duyck @ 2016-12-01 12:27 UTC (permalink / raw)
  To: netdev, rshearma; +Cc: davem
In-Reply-To: <20161201121605.15499.13176.stgit@ahduyck-blue-test.jf.intel.com>

It wasn't necessary to pass a leaf in when doing the suffix updates so just
drop it.  Instead just pass the suffix and work with that.

Since we dropped the leaf there is no need to include that in the name so
the names are updated to node_push_suffix and node_pull_suffix.

Finally I noticed that the logic for pulling the suffix length back
actually had some issues.  Specifically it would stop prematurely if there
was a longer suffix, but it was not as long as the original suffix.  I
updated the code to address that in node_pull_suffix.

Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
Suggested-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 net/ipv4/fib_trie.c |   26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index e2ffc2a..c5cd226 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -892,22 +892,24 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
 	return tp;
 }
 
-static void leaf_pull_suffix(struct key_vector *tp, struct key_vector *l)
+static void node_pull_suffix(struct key_vector *tn, unsigned char slen)
 {
-	while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
-		if (update_suffix(tp) > l->slen)
+	unsigned char node_slen = tn->slen;
+
+	while ((node_slen > tn->pos) && (node_slen > slen)) {
+		slen = update_suffix(tn);
+		if (node_slen == slen)
 			break;
-		tp = node_parent(tp);
+
+		tn = node_parent(tn);
+		node_slen = tn->slen;
 	}
 }
 
-static void leaf_push_suffix(struct key_vector *tn, struct key_vector *l)
+static void node_push_suffix(struct key_vector *tn, unsigned char slen)
 {
-	/* if this is a new leaf then tn will be NULL and we can sort
-	 * out parent suffix lengths as a part of trie_rebalance
-	 */
-	while (tn->slen < l->slen) {
-		tn->slen = l->slen;
+	while (tn->slen < slen) {
+		tn->slen = slen;
 		tn = node_parent(tn);
 	}
 }
@@ -1069,7 +1071,7 @@ static int fib_insert_alias(struct trie *t, struct key_vector *tp,
 	/* if we added to the tail node then we need to update slen */
 	if (l->slen < new->fa_slen) {
 		l->slen = new->fa_slen;
-		leaf_push_suffix(tp, l);
+		node_push_suffix(tp, new->fa_slen);
 	}
 
 	return 0;
@@ -1482,7 +1484,7 @@ static void fib_remove_alias(struct trie *t, struct key_vector *tp,
 
 	/* update the trie with the latest suffix length */
 	l->slen = fa->fa_slen;
-	leaf_pull_suffix(tp, l);
+	node_pull_suffix(tp, fa->fa_slen);
 }
 
 /* Caller must hold RTNL. */

^ permalink raw reply related

* Re: [PATCH net] fib_trie: Avoid expensive update of suffix length when not required
From: Alexander Duyck @ 2016-12-01 18:29 UTC (permalink / raw)
  To: Robert Shearman; +Cc: David Miller, Netdev, Alexander Duyck
In-Reply-To: <50c4f236-c486-1224-e3b0-c3e322d68d58@brocade.com>

On Wed, Nov 30, 2016 at 4:27 PM, Robert Shearman <rshearma@brocade.com> wrote:
> On 29/11/16 23:14, Alexander Duyck wrote:
>>
>> On Tue, Nov 29, 2016 at 9:46 AM, Robert Shearman <rshearma@brocade.com>
>> wrote:
>>>
>>> With certain distributions of routes it can take a long time to add
>>> and delete further routes at scale. For example, with a route for
>>> 10.37.96.0/20 present it takes 47s to add ~200k contiguous /24 routes
>>> from 8.0.0.0/24 through to 11.138.207.0/24. Perf shows the bottleneck
>>> is update_suffix:
>>>
>>>       40.39%  [kernel]                      [k] update_suffix
>>>        8.02%  libnl-3.so.200.19.0           [.] nl_hash_table_lookup
>>
>>
>> Well yeah, it will be expensive when you have over 512K entries in a
>> single node.  I'm assuming that is the case based on the fact that
>> 200K routes will double up in the trie node due to broadcast and the
>> route ending up in adjacent key vectors.
>
>
> The example scenario was in fact using a large scale of just routes rather
> than addresses so there are no broadcast leafs (nor local leafs):
>
>         +-- 8.0.0.0/6 18 2 52436 suffix/20
>            |-- 8.0.0.0
>               /24 universe UNICAST
>            |-- 8.0.1.0
>               /24 universe UNICAST
>            |-- 8.0.2.0
>               /24 universe UNICAST
>
> (the "suffix/20" is for test purposes as per below). In this case the
> 8.0.0.0/6 node has a child array of size 2^18 = 262144.
>
>>
>>> With these changes, the time is reduced to 4s for the same scale and
>>> distribution of routes.
>>>
>>> The issue is that update_suffix does an O(n) walk on the children of a
>>> node and the with a dense distribtion of routes the number of children
>>> in a node tends towards the number of nodes in the tree.
>>
>>
>> Other than the performance I was just wondering if you did any other
>> validation to confirm that nothing else differs.  Basically it would
>> just be a matter of verifying that /proc/net/fib_trie is the same
>> before and after your patch.
>
>
> Yes, to verify these changes I applied some local changes to dump the slen
> field of trie nodes from /proc/net/fib_trie. I presumed that the format of
> /proc/net/fib_trie is a public API that we can't break so I intend to submit
> this as a patch. I verified by inspection that the suffix length listed in
> the nodes was as expected when exercising the insert and remove functions
> for routes with both larger and smaller suffixes than in the current nodes,
> and then for the inflate, halve and flush cases.
>
> I've now verified that a diff of /proc/net/fib_trie before and after my
> patch with all the routes added of my example scenario shows no changes.
>
>>
> ...
>>>
>>> Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
>>> Signed-off-by: Robert Shearman <rshearma@brocade.com>
>>
>>
>> So I am not entirely convinced this is a regression, I was wondering
>> what your numbers looked like before the patch you are saying this
>> fixes?  I recall I fixed a number of issues leading up to this so I am
>> just curious.
>
>
> At commit 21d1f11db0e2f20a549ad8322879507850544670 ("fib_trie: Remove checks
> for index >= tnode_child_length from tnode_get_child") which is the parent
> of 5405afd1a306:
>
> $ time sudo ip route restore < ~/allroutes
> RTNETLINK answers: File exists
> RTNETLINK answers: File exists
> RTNETLINK answers: File exists
> RTNETLINK answers: File exists
>
> real    0m3.451s
> user    0m0.184s
> sys     0m2.004s
>
>
> At commit 5405afd1a30620de8601436bae739c67c0bc7324 ("fib_trie: Add tracking
> value for suffix length"):
>
> $ time sudo ip route restore < ~/allroutes
> RTNETLINK answers: File exists
> RTNETLINK answers: File exists
> RTNETLINK answers: File exists
> RTNETLINK answers: File exists
>
> real    0m48.624s
> user    0m0.360s
> sys     0m46.960s
>
> So does that warrant a fixes tag for this patch?

Yes, please include it in the patch description next time.

I've been giving it thought and the fact is the patch series has
merit.  I just think it was being a bit too aggressive in terms of
some of the changes.  So placing any changes in put_child seem to be
the one piece in this that make this a bit ugly.  I'll submit a
counter proposal, if you could try testing it I would appreciate it,
or if you could look at incorporating some of it into your patch it
would be useful.

>>
>> My thought is this seems more like a performance optimization than a
>> fix.  If that is the case this probably belongs in net-next.
>>
>> It seems to me we might be able to simplify update_suffix rather than
>> going through all this change.  That might be something that is more
>> acceptable for net.  Have you looked at simply adding code so that
>> there is a break inside update_suffix if (slen == tn->slen)?  We don't
>> have to call it for if a leaf is added so it might make sense to add
>> that check.
>
>
> That doesn't really help since the search always starts from the smallest
> suffix length and thus could potentially require visiting a large number of
> children before finding the node that makes slen == tn->slen.
>
> In the example above, 10.37.96.0/20 would be child number 140640 in node
> 8.0.0.0/6 18. Since the loop starts out with stride = 2 this means that it
> requires 70320 iterations round to find 10.37.96.0/20 which contributes the
> largest suffix length to the node.

Yes, however that is still 60K fewer iterations we would have to do.

> Now it may be possible to improve the algorithm by starting the search from
> the largest suffix length towards the smallest suffix length instead of the
> current smallest to largest, but there would still be distributions of
> routes that would lead to needing to visit a large number of nodes only to
> find that nothing has changed.

The largest possible suffix always starts at 0.  Any bits in the index
mean that the suffix has to stop before that bit since suffix
represents the number of bits that are 0.  By starting at 0 and
increasing the stride as we hit bigger values we should be about as
optimal there as we can be.

>>
>>> ---
>>>  net/ipv4/fib_trie.c | 86
>>> ++++++++++++++++++++++++++++++++++++++---------------
>>>  1 file changed, 62 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
>>> index 026f309c51e9..701cae8af44a 100644
>>> --- a/net/ipv4/fib_trie.c
>>> +++ b/net/ipv4/fib_trie.c
>>> @@ -421,8 +421,22 @@ static inline int tnode_full(struct key_vector *tn,
>>> struct key_vector *n)
>>>         return n && ((n->pos + n->bits) == tn->pos) && IS_TNODE(n);
>>>  }
>>>
>>> +static void node_push_suffix(struct key_vector *tn, struct key_vector
>>> *l)
>>> +{
>>> +       while (tn->slen < l->slen) {
>>> +               tn->slen = l->slen;
>>> +               tn = node_parent(tn);
>>> +               if (!tn)
>>> +                       break;
>>
>>
>> I don't think the NULL check is necessary, at least it wasn't with the old
>> code.
>
>
> It wasn't necessary before because the root node had the largest possible
> suffix length of KEYLENGTH. This isn't the case for the nodes this is now
> called on since they're either nodes without parents or sub-tries that end
> up in node without a parent, where they're initialised to have the smallest
> possible suffix length for the node (equal to their position).
>
>>
>>> +       }
>>> +}
>>> +
>>>  /* Add a child at position i overwriting the old value.
>>>   * Update the value of full_children and empty_children.
>>> + *
>>> + * The suffix length of the parent node and the rest of the tree is
>>> + * updated (if required) when adding/replacing a node, but is caller's
>>> + * responsibility on removal.
>>>   */
>>>  static void put_child(struct key_vector *tn, unsigned long i,
>>>                       struct key_vector *n)
>>> @@ -447,8 +461,8 @@ static void put_child(struct key_vector *tn, unsigned
>>> long i,
>>>         else if (!wasfull && isfull)
>>>                 tn_info(tn)->full_children++;
>>>
>>> -       if (n && (tn->slen < n->slen))
>>> -               tn->slen = n->slen;
>>> +       if (n)
>>> +               node_push_suffix(tn, n);
>>
>>
>> This is just creating redundant work if we have to perform a resize.
>
>
> The only real redundant work is the assignment of slen in tn, since the
> propagation up the trie has to happen regardless and if a subsequent resize
> results in changes to the trie then the propagation will stop at tn's parent
> since the suffix length of the tn's parent will not be less than tn's suffix
> length.


This isn't going to work.  We want to avoid trying to push the suffix
when we place a child.  There are scenarios where we are placing
children in nodes that don't have parents yet because we are doing
rebalances and such.  I suspect this could be pretty expensive on a
resize call.

We want to pull the suffix pushing out of the resize completely.  The
problem is this is going to cause us to start pushing suffixes for
every node moved as a part of resize.

>>
>>>         rcu_assign_pointer(tn->tnode[i], n);
>>>  }
>
> ...
>
>>> -static void leaf_pull_suffix(struct key_vector *tp, struct key_vector
>>> *l)
>>> +static void node_set_suffix(struct key_vector *tp, unsigned char slen)
>>>  {
>>> -       while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
>>> -               if (update_suffix(tp) > l->slen)
>>> +       if (slen > tp->slen)
>>> +               tp->slen = slen;
>>> +}
>>> +
>>> +static void node_pull_suffix(struct key_vector *tn)
>>> +{
>>> +       struct key_vector *tp;
>>> +       unsigned char slen;
>>> +
>>> +       slen = update_suffix(tn);
>>> +       tp = node_parent(tn);
>>> +       while ((tp->slen > tp->pos) && (tp->slen > slen)) {
>>> +               if (update_suffix(tp) > slen)
>>>                         break;
>>>                 tp = node_parent(tp);
>>>         }
>>>  }
>>>
>>

Actually I realized that there is a bug here.  The check for
update_suffix(tp) > slen can cause us to stop prematurely if I am not
mistaken.  What we should probably be doing is pulling out tp->slen
and if the result of update_suffix is the same value then we can break
the loop.  Otherwise we can stop early if a "second runner up" shows
up that is supposed to be pushed up the trie.

I actually started around with patches to do much the same as what you
are doing and I will probably submit them as an RFC so if you need you
can pick pieces out as needed, or I could submit them if they work for
you.

>> I really hate all the renaming.  Can't you just leave these as
>> leaf_pull_suffix and leaf_push _suffix.  I'm fine with us dropping the
>> leaf of the suffix but the renaming just makes adds a bunch of noise.
>> Really it might work better if you broke the replacement of the
>> key_vector as a leaf with the suffix length into a separate patch,
>> then you could do the rename as a part of that.
>
>
> Ok, I'll leave the naming of leaf_push_suffix alone. Note that
> leaf_pull_suffix hasn't been renamed, the below in the diff is just an
> artifact of how diff decided to present the changes along with the moving of
> leaf_push_suffix.

So I have been looking this over for the last couple days and actually
I have started to be okay with the renaming.

However I would ask to be consistent.  If you are going to have
node_pull_suffix and node_push_suffix then just pass a node and a
suffix length.  You can drop the leaf key vector from both functions
instead of just one.

>>
>>> -static void leaf_push_suffix(struct key_vector *tn, struct key_vector
>>> *l)
>>> +static void leaf_pull_suffix(struct key_vector *tp, struct key_vector
>>> *l)
>>>  {
>>> -       /* if this is a new leaf then tn will be NULL and we can sort
>>> -        * out parent suffix lengths as a part of trie_rebalance
>>> -        */
>>> -       while (tn->slen < l->slen) {
>>> -               tn->slen = l->slen;
>>> -               tn = node_parent(tn);
>>> +       while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
>>> +               if (update_suffix(tp) > l->slen)
>>> +                       break;
>>> +               tp = node_parent(tp);
>>>         }
>>>  }
>>
>>
>> If possible it would work better if you could avoid moving functions
>> around as a result of your changes.
>
>
> Ok, I can add a forward declaration instead.

It shouldn't be necessary.  I think you are doing way too much with
this patch.  We just need this to be a fix, you are doing a bit too
much like adding this new node_set_suffix function which isn't really
needed.

>>
>>> @@ -1107,7 +1122,7 @@ static int fib_insert_alias(struct trie *t, struct
>>> key_vector *tp,
>>>         /* if we added to the tail node then we need to update slen */
>>>         if (l->slen < new->fa_slen) {
>>>                 l->slen = new->fa_slen;
>>> -               leaf_push_suffix(tp, l);
>>> +               node_push_suffix(tp, l);
>>>         }
>>>
>>>         return 0;
>
> ...
>>>
>>> @@ -1783,6 +1801,16 @@ void fib_table_flush_external(struct fib_table
>>> *tb)
>>>                         if (IS_TRIE(pn))
>>>                                 break;
>>>
>>> +                       /* push the suffix length to the parent node,
>>> +                        * since a previous leaf removal may have
>>> +                        * caused it to change
>>> +                        */
>>> +                       if (pn->slen > pn->pos) {
>>> +                               unsigned char slen = update_suffix(pn);
>>> +
>>> +                               node_set_suffix(node_parent(pn), slen);
>>> +                       }
>>> +
>>
>>
>> Why bother with the local variable?  You could just pass
>> update_suffix(pn) as the parameter to your node_set_suffix function.
>
>
> To avoid a long line on the node_set_suffix call or splitting it across
> lines, but I'll remove the local variable as you suggest and the same below.

Actually I think there is a bug here.  You shouldn't be setting the
suffix for the parent based on the child.  You can just call
update_suffix(pn) and that should be enough.

^ permalink raw reply

* [net PATCH 2/2] ipv4: Drop suffix update from resize code
From: Alexander Duyck @ 2016-12-01 12:27 UTC (permalink / raw)
  To: netdev, rshearma; +Cc: davem
In-Reply-To: <20161201121605.15499.13176.stgit@ahduyck-blue-test.jf.intel.com>

It has been reported that update_suffix can be expensive when it is called
on a large node in which most of the suffix lengths are the same.  The time
required to add 200K entries had increased from around 3 seconds to almost
49 seconds.

In order to address this we need to move the code for updating the suffix
out of resize and instead just have it handled in the cases where we are
pushing a node that increases the suffix length, or will decrease the
suffix length.

Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
Reported-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 net/ipv4/fib_trie.c |   42 +++++++++++++++++++++---------------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index c5cd226..2cfa9f4 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -681,6 +681,13 @@ static unsigned char update_suffix(struct key_vector *tn)
 {
 	unsigned char slen = tn->pos;
 	unsigned long stride, i;
+	unsigned char slen_max;
+
+	/* only vector 0 can have a suffix length greater than or equal to
+	 * tn->pos + tn->bits, the second highest node will have a suffix
+	 * length at most of tn->pos + tn->bits - 1
+	 */
+	slen_max = min_t(unsigned char, tn->pos + tn->bits - 1, tn->slen);
 
 	/* search though the list of children looking for nodes that might
 	 * have a suffix greater than the one we currently have.  This is
@@ -698,12 +705,8 @@ static unsigned char update_suffix(struct key_vector *tn)
 		slen = n->slen;
 		i &= ~(stride - 1);
 
-		/* if slen covers all but the last bit we can stop here
-		 * there will be nothing longer than that since only node
-		 * 0 and 1 << (bits - 1) could have that as their suffix
-		 * length.
-		 */
-		if ((slen + 1) >= (tn->pos + tn->bits))
+		/* stop searching if we have hit the maximum possible value */
+		if (slen >= slen_max)
 			break;
 	}
 
@@ -875,21 +878,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
 		return collapse(t, tn);
 
 	/* update parent in case halve failed */
-	tp = node_parent(tn);
-
-	/* Return if at least one deflate was run */
-	if (max_work != MAX_WORK)
-		return tp;
-
-	/* push the suffix length to the parent node */
-	if (tn->slen > tn->pos) {
-		unsigned char slen = update_suffix(tn);
-
-		if (slen > tp->slen)
-			tp->slen = slen;
-	}
-
-	return tp;
+	return node_parent(tn);
 }
 
 static void node_pull_suffix(struct key_vector *tn, unsigned char slen)
@@ -1030,6 +1019,7 @@ static int fib_insert_node(struct trie *t, struct key_vector *tp,
 	}
 
 	/* Case 3: n is NULL, and will just insert a new leaf */
+	node_push_suffix(tp, new->fa_slen);
 	NODE_INIT_PARENT(l, tp);
 	put_child_root(tp, key, l);
 	trie_rebalance(t, tp);
@@ -1472,6 +1462,8 @@ static void fib_remove_alias(struct trie *t, struct key_vector *tp,
 	 * out parent suffix lengths as a part of trie_rebalance
 	 */
 	if (hlist_empty(&l->leaf)) {
+		if (tp->slen == l->slen)
+			node_pull_suffix(tp, tp->pos);
 		put_child_root(tp, l->key, NULL);
 		node_free(l);
 		trie_rebalance(t, tp);
@@ -1753,6 +1745,10 @@ void fib_table_flush_external(struct fib_table *tb)
 			if (IS_TRIE(pn))
 				break;
 
+			/* update the suffix to address pulled leaves */
+			if (pn->slen > pn->pos)
+				update_suffix(pn);
+
 			/* resize completed node */
 			pn = resize(t, pn);
 			cindex = get_index(pkey, pn);
@@ -1828,6 +1824,10 @@ int fib_table_flush(struct fib_table *tb)
 			if (IS_TRIE(pn))
 				break;
 
+			/* update the suffix to address pulled leaves */
+			if (pn->slen > pn->pos)
+				update_suffix(pn);
+
 			/* resize completed node */
 			pn = resize(t, pn);
 			cindex = get_index(pkey, pn);

^ permalink raw reply related

* [PATCH v3] sh_eth: remove unchecked interrupts for RZ/A1
From: Chris Brandt @ 2016-12-01 18:32 UTC (permalink / raw)
  To: David Miller, Sergei Shtylyov
  Cc: Simon Horman, Geert Uytterhoeven, netdev, linux-renesas-soc,
	Chris Brandt

When streaming a lot of data and the RZ/A1 can't keep up, some status bits
will get set that are not being checked or cleared which cause the
following messages and the Ethernet driver to stop working. This
patch fixes that issue.

irq 21: nobody cared (try booting with the "irqpoll" option)
handlers:
[<c036b71c>] sh_eth_interrupt
Disabling IRQ #21

Fixes: db893473d313a4ad ("sh_eth: Add support for r7s72100")
Signed-off-by: Chris Brandt <chris.brandt@renesas.com>
Acked-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
---
v3:
* add RZ/A1 to subject line
v2:
* switched from modifying eesr_err_check to modifying eesipr_value
---
 drivers/net/ethernet/renesas/sh_eth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
index 05b0dc5..1a92de7 100644
--- a/drivers/net/ethernet/renesas/sh_eth.c
+++ b/drivers/net/ethernet/renesas/sh_eth.c
@@ -518,7 +518,7 @@ static struct sh_eth_cpu_data r7s72100_data = {
 
 	.ecsr_value	= ECSR_ICD,
 	.ecsipr_value	= ECSIPR_ICDIP,
-	.eesipr_value	= 0xff7f009f,
+	.eesipr_value	= 0xe77f009f,
 
 	.tx_check	= EESR_TC1 | EESR_FTC,
 	.eesr_err_check	= EESR_TWB1 | EESR_TWB | EESR_TABT | EESR_RABT |
-- 
2.10.1

^ permalink raw reply related

* pull-request: wireless-drivers-next 2016-12-01
From: Kalle Valo @ 2016-12-01 18:33 UTC (permalink / raw)
  To: David Miller; +Cc: linux-wireless, netdev, linux-kernel

Hi Dave,

here's another pull request for net-next. Nothing special to mention
about, the details are in the signed tag below.

This time there's a trivial conflict in
drivers/net/wireless/ath/ath10k/mac.c:

<<<<<<< HEAD
	ieee80211_hw_set(ar->hw, SUPPORTS_TX_FRAG);
=======
	ieee80211_hw_set(ar->hw, REPORTS_LOW_ACK);
>>>>>>> d5fb3a138048798ce4cc4b4ced47d07d1794c577

We want to have both flags enabled in ath10k.

I'm planning to submit at least one more pull request, if Linus gives us
one more week I might send even two. For example there are patches to
convert wcn36xx to use the real SMD bus subsystem but they depend on few
arm-soc patches. I'll send a separate email about that, they are not
part of this pull request.

Please let me know if there are any problems.

Kalle

The following changes since commit 159a55a64d44acbbd6f0d8f3c082e628d6d75670:

  rt2800: disable CCK rates on HT (2016-11-23 17:38:53 +0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next.git tags/wireless-drivers-next-for-davem-2016-12-01

for you to fetch changes up to d5fb3a138048798ce4cc4b4ced47d07d1794c577:

  Merge ath-next from git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git (2016-12-01 15:09:14 +0200)

----------------------------------------------------------------

wireless-drivers-next patches for 4.10

Major changes:

rsi

* filter rx frames
* configure tx power
* make it possible to select antenna
* support 802.11d

brcmfmac

* cleanup of scheduled scan code
* support for bcm43341 chipset with different chip id
* support rev6 of PCIe device interface

ath10k

* add spectral scan support for QCA6174 and QCA9377 families
* show used tx bitrate with 10.4 firmware

wil6210

* add power save mode support
* add abort scan functionality
* add support settings retry limit for short frames

bcma

* add Dell Inspiron 3148

----------------------------------------------------------------
Anilkumar Kolli (2):
      ath10k: add per peer htt tx stats support for 10.4
      ath10k: add support for per sta tx bitrate

Anthony Romano (2):
      mt7601u: wait for clear rxq when stopping mac
      ath9k_htc: don't use HZ for usb msg timeouts

Arend Van Spriel (11):
      brcmfmac: add support for 43341 chip
      brcmfmac: move pno helper functions in separate source file
      brcmfmac: fix handling ssids in .sched_scan_start() callback
      brcmfmac: change prototype for brcmf_do_escan()
      brcmfmac: make internal escan more generic
      brcmfmac: split up brcmf_pno_config() function
      brcmfmac: move scheduled scan activation to pno source file
      brcmfmac: use provided channels for scheduled scan
      brcmfmac: remove restriction from .sched_scan_start() callback
      brcmfmac: use requested scan interval in scheduled scan
      brcmfmac: fix scheduled scan result handling for newer chips

Barry Day (1):
      rtl8xxxu: tx rate reported before set

Ben Greear (1):
      ath10k: wmi-alloc-chunk should use DMA_BIDIRECTIONAL

Bhumika Goyal (1):
      ath9k: constify ath_bus_ops structure

Brian Norris (3):
      mwifiex: cleanup wake-IRQ handling if suspend fails
      mwifiex: avoid double-disable_irq() race
      mwifiex: pcie: implement timeout loop for FW programming doorbell

Dedy Lansky (1):
      wil6210: fix net queue stop/wake

Erik Stromdahl (1):
      ath10k: fix TLV set regdomain command

Franky Lin (1):
      brcmfmac: add pcie host dongle interface rev6 support

Geliang Tang (1):
      ath5k: drop duplicate header vmalloc.h

Jes Sorensen (7):
      rtl8xxxu: Fix memory leak in handling rxdesc16 packets
      rtl8xxxu: Fix big-endian problem reporting mactime
      rtl8xxxu: Fix rtl8723bu driver reload issue
      rtl8xxxu: Fix rtl8192eu driver reload issue
      rtl8xxxu: Obtain RTS rates from mac80211
      rtl8xxxu: Pass tx_info to fill_txdesc in order to have access to retry count
      rtl8xxxu: Work around issue with 8192eu and 8723bu devices not reconnecting

Jiri Slaby (1):
      bcma: add Dell Inspiron 3148

Kalle Valo (1):
      Merge ath-next from git://git.kernel.org/.../kvalo/ath.git

Karthik D A (1):
      mwifiex: Disable adhoc feature based on firmware capability

Kirtika Ruchandani (7):
      mwifiex: Removed unused mwifiex_private* 'priv' variable
      mwifiex: Remove unused 'chan_num' variable
      mwifiex: Remove unused 'sta_ptr' variable
      mwifiex: Remove unused 'adapter'variable
      mwifiex: Remove unused 'pm_flag' variable
      mwifiex: Removed unused 'pkt_type' variable
      mwifiex: Remove unused 'bcd_usb' variable

Larry Finger (1):
      rtlwifi: Fix enter/exit power_save

Lior David (6):
      wil6210: fix deadlock when using fw_no_recovery option
      wil6210: align to latest auto generated wmi.h
      wil6210: support NL80211_ATTR_WIPHY_RETRY_SHORT
      wil6210: delay remain on channel when scan is active
      wil6210: add debugfs blobs for UCODE code and data
      wil6210: align to latest auto generated wmi.h

Manoharan, Rajkumar (1):
      ath10k: fix monitor vdev for receiving other bss frames

Matthias Schiffer (1):
      ath9k: fix ath9k_hw_gpio_get() to return 0 or 1 on success

Maya Erez (3):
      wil6210: add support for power save enable / disable
      wil6210: add support for abort scan
      wil6210: validate wil_pmc_alloc parameters

Miaoqing Pan (1):
      ath9k: fix NULL pointer dereference

Michal Kazior (2):
      ath10k: fix null deref on wmi-tlv when trying spectral scan
      ath10k: add spectral scan support to wmi-tlv

Mohammed Shafi Shajakhan (2):
      ath10k: fix soft lockup during firmware crash/hw-restart
      ath10k: fix Tx DMA alloc failure during continuous wifi down/up

Pedersen, Thomas (2):
      ath10k: implement offset_tsf ieee80211_op
      ath10k: remove set/get_tsf ieee80211_ops

Prameela Rani Garnepudi (4):
      rsi: Add support to filter rx frames
      rsi: Add support for configuring tx power
      rsi: Add support for antenna selection
      rsi: Add support for 802.11d

Rajkumar Manoharan (1):
      ath10k: advertize hardware packet loss mechanism

Tobias Regnery (1):
      brcmsmac: fix array out-of-bounds access in qm_log10

Wei Yongjun (1):
      rtl8xxxu: Fix non static symbol warning

Zefir Kurtisi (1):
      ath9k: feed only active spectral / dfs-detector

 drivers/bcma/host_pci.c                            |    1 +
 drivers/net/wireless/ath/ath10k/core.c             |    8 +-
 drivers/net/wireless/ath/ath10k/core.h             |   24 +
 drivers/net/wireless/ath/ath10k/debugfs_sta.c      |   13 +
 drivers/net/wireless/ath/ath10k/htt.c              |    2 +
 drivers/net/wireless/ath/ath10k/htt.h              |   31 +-
 drivers/net/wireless/ath/ath10k/htt_rx.c           |  125 +++++
 drivers/net/wireless/ath/ath10k/htt_tx.c           |   54 +-
 drivers/net/wireless/ath/ath10k/mac.c              |   51 +-
 drivers/net/wireless/ath/ath10k/wmi-ops.h          |    6 +
 drivers/net/wireless/ath/ath10k/wmi-tlv.c          |   77 ++-
 drivers/net/wireless/ath/ath10k/wmi.c              |    8 +-
 drivers/net/wireless/ath/ath10k/wmi.h              |   18 +-
 drivers/net/wireless/ath/ath5k/debug.c             |    1 -
 drivers/net/wireless/ath/ath9k/ahb.c               |    2 +-
 drivers/net/wireless/ath/ath9k/common-spectral.c   |    8 +-
 drivers/net/wireless/ath/ath9k/hif_usb.c           |    9 +-
 drivers/net/wireless/ath/ath9k/hif_usb.h           |    2 +
 drivers/net/wireless/ath/ath9k/hw.c                |    2 +-
 drivers/net/wireless/ath/ath9k/recv.c              |   17 +-
 drivers/net/wireless/ath/wil6210/cfg80211.c        |  129 ++++-
 drivers/net/wireless/ath/wil6210/main.c            |  100 ++--
 drivers/net/wireless/ath/wil6210/netdev.c          |    2 +-
 drivers/net/wireless/ath/wil6210/p2p.c             |  160 ++++--
 drivers/net/wireless/ath/wil6210/pmc.c             |   55 +-
 drivers/net/wireless/ath/wil6210/txrx.c            |  110 +++-
 drivers/net/wireless/ath/wil6210/wil6210.h         |   25 +-
 drivers/net/wireless/ath/wil6210/wil_crash_dump.c  |    6 +
 drivers/net/wireless/ath/wil6210/wmi.c             |  160 +++++-
 drivers/net/wireless/ath/wil6210/wmi.h             |  586 ++++++++++++++++----
 .../wireless/broadcom/brcm80211/brcmfmac/Makefile  |    3 +-
 .../net/wireless/broadcom/brcm80211/brcmfmac/bus.h |   10 +-
 .../broadcom/brcm80211/brcmfmac/cfg80211.c         |  381 +++++--------
 .../broadcom/brcm80211/brcmfmac/cfg80211.h         |    4 +-
 .../broadcom/brcm80211/brcmfmac/fwil_types.h       |   23 +
 .../wireless/broadcom/brcm80211/brcmfmac/msgbuf.c  |   38 +-
 .../wireless/broadcom/brcm80211/brcmfmac/msgbuf.h  |    4 +
 .../wireless/broadcom/brcm80211/brcmfmac/pcie.c    |  171 +++---
 .../net/wireless/broadcom/brcm80211/brcmfmac/pno.c |  242 ++++++++
 .../net/wireless/broadcom/brcm80211/brcmfmac/pno.h |   40 ++
 .../wireless/broadcom/brcm80211/brcmfmac/sdio.c    |    1 +
 .../broadcom/brcm80211/brcmsmac/phy/phy_qmath.c    |    5 +-
 .../broadcom/brcm80211/include/brcm_hw_ids.h       |    1 +
 drivers/net/wireless/marvell/mwifiex/cfg80211.c    |   12 +-
 drivers/net/wireless/marvell/mwifiex/fw.h          |    1 +
 drivers/net/wireless/marvell/mwifiex/main.c        |    3 -
 drivers/net/wireless/marvell/mwifiex/main.h        |    9 +-
 drivers/net/wireless/marvell/mwifiex/pcie.c        |   17 +-
 drivers/net/wireless/marvell/mwifiex/scan.c        |    8 +-
 drivers/net/wireless/marvell/mwifiex/sdio.c        |    6 +-
 drivers/net/wireless/marvell/mwifiex/sta_cmd.c     |   38 +-
 drivers/net/wireless/marvell/mwifiex/usb.c         |    3 +-
 drivers/net/wireless/mediatek/mt7601u/init.c       |   14 +-
 drivers/net/wireless/mediatek/mt7601u/regs.h       |    3 +
 drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.h   |   31 +-
 .../net/wireless/realtek/rtl8xxxu/rtl8xxxu_8192e.c |   10 +-
 .../net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723b.c |    4 +
 .../net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c  |  122 ++--
 drivers/net/wireless/realtek/rtlwifi/base.c        |    8 +-
 drivers/net/wireless/realtek/rtlwifi/core.c        |    9 +-
 drivers/net/wireless/realtek/rtlwifi/pci.c         |   14 +-
 drivers/net/wireless/realtek/rtlwifi/ps.c          |   36 +-
 drivers/net/wireless/rsi/rsi_91x_mac80211.c        |  156 +++++-
 drivers/net/wireless/rsi/rsi_91x_mgmt.c            |  129 ++++-
 drivers/net/wireless/rsi/rsi_main.h                |    4 +
 drivers/net/wireless/rsi/rsi_mgmt.h                |   23 +-
 66 files changed, 2581 insertions(+), 794 deletions(-)
 create mode 100644 drivers/net/wireless/broadcom/brcm80211/brcmfmac/pno.c
 create mode 100644 drivers/net/wireless/broadcom/brcm80211/brcmfmac/pno.h

^ permalink raw reply

* Re: linux-next: Tree for Dec 1 (ethernet/mellanox)
From: Randy Dunlap @ 2016-12-01 18:38 UTC (permalink / raw)
  To: Stephen Rothwell, linux-next
  Cc: linux-kernel, netdev@vger.kernel.org, Tariq Toukan,
	Saeed Mahameed
In-Reply-To: <20161201174235.21689436@canb.auug.org.au>

On 11/30/16 22:42, Stephen Rothwell wrote:
> Hi all,
> 
> Changes since 20161130:
> 

on i386:

when CONFIG_INET is not enabled:

drivers/built-in.o: In function `mlx5e_test_loopback':
en_selftest.c:(.text+0x690bde): undefined reference to `ip_send_check'
en_selftest.c:(.text+0x690c42): undefined reference to `udp4_hwcsum'


-- 
~Randy

^ permalink raw reply

* Re: [iproute PATCH 04/18] ss: Use sockstat->type in all socket types
From: Stephen Hemminger @ 2016-12-01 18:41 UTC (permalink / raw)
  To: Phil Sutter; +Cc: netdev
In-Reply-To: <20161111131014.21865-5-phil@nwl.cc>

On Fri, 11 Nov 2016 14:10:00 +0100
Phil Sutter <phil@nwl.cc> wrote:

> Unix sockets used that field already to hold info about the socket type.
> By replicating this approach in all other socket types, we can get rid
> of protocol parameter in inet_stats_print() and have sock_state_print()
> figure things out by itself.
> 
> Signed-off-by: Phil Sutter <phil@nwl.cc>

Cleaning out the patch backlog...
This patch does not apply to current master branch, rejects are:

--- misc/ss.c
+++ misc/ss.c
@@ -1719,27 +1768,11 @@ void *parse_markmask(const char *markmask)
 	return res;
 }
 
-static char *proto_name(int protocol)
-{
-	switch (protocol) {
-	case 0:
-		return "raw";
-	case IPPROTO_UDP:
-		return "udp";
-	case IPPROTO_TCP:
-		return "tcp";
-	case IPPROTO_DCCP:
-		return "dccp";
-	}
-
-	return "???";
-}
-
-static void inet_stats_print(struct sockstat *s, int protocol)
+static void inet_stats_print(struct sockstat *s)
 {
 	char *buf = NULL;
 
-	sock_state_print(s, proto_name(protocol));
+	sock_state_print(s);
 
 	inet_addr_print(&s->local, s->lport, s->iface);
 	inet_addr_print(&s->remote, s->rport, 0);


Please fix and resubmit whole series.

^ permalink raw reply

* Re: [PATCH iproute2 2/3] ifstat: Add extended statistics to ifstat
From: Stephen Hemminger @ 2016-12-01 18:46 UTC (permalink / raw)
  To: Nogah Frankel; +Cc: netdev, eladr, yotamg, jiri, idosch, ogerlitz
In-Reply-To: <1479996760-61271-3-git-send-email-nogahf@mellanox.com>

On Thu, 24 Nov 2016 16:12:39 +0200
Nogah Frankel <nogahf@mellanox.com> wrote:

> Add extended stats option for ifstat. It supports stats that are in the
> nesting level as the "normal" stats or one lower, as long as they are in
> the same struct type as the "normal" stats.
> Every extension is unaware of data from other extension and is being
> presented by itself.
> The extension can be called by its name or any shorten of it. If there is
> more then one matched, the first one will be picked.
> 
> To get the extended stats the flag -x <stats type> is used.
> 
> Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
> Reviewed-by: Jiri Pirko <jiri@mellanox.com>

Finally clearing up iproute2 patch backlog. This feature looks good,
but does not apply cleanly to current git master branch.

--- misc/ifstat.c
+++ misc/ifstat.c
@@ -733,7 +790,8 @@ static void usage(void)
 "   -s, --noupdate	don\'t update history\n"
 "   -t, --interval=SECS	report average over the last SECS\n"
 "   -V, --version	output version information\n"
-"   -z, --zeros		show entries with zero activity\n");
+"   -z, --zeros		show entries with zero activity\n"
+"   -x, --extended=TYPE	show extended stats of TYPE\n");
 
 	exit(-1);
 }

Please rebase your patches and resubmit.

^ permalink raw reply

* Re: [PATCH iproute2 net-next 2/4] tc: flower: document SCTP ip_proto
From: Stephen Hemminger @ 2016-12-01 18:50 UTC (permalink / raw)
  To: Simon Horman; +Cc: netdev
In-Reply-To: <1480434693-29397-3-git-send-email-simon.horman@netronome.com>

On Tue, 29 Nov 2016 16:51:31 +0100
Simon Horman <simon.horman@netronome.com> wrote:

> Add SCTP ip_proto to help text and man page.
> 
> Signed-off-by: Simon Horman <simon.horman@netronome.com>

Sorry doesn't apply to current net-next branch in iproute2 git.
Probably some of the other changes modified formatting.

^ permalink raw reply

* RE: [PATCH v2] sh_eth: remove unchecked interrupts
From: Chris Brandt @ 2016-12-01 18:53 UTC (permalink / raw)
  To: Sergei Shtylyov, Geert Uytterhoeven
  Cc: David Miller, Simon Horman, Geert Uytterhoeven,
	netdev@vger.kernel.org, Linux-Renesas
In-Reply-To: <25b92de0-806b-342b-7556-06b96b948b2c@cogentembedded.com>

Hi Geert,

On 12/1/2016, Sergei Shtylyov wrote:
> 
> On 12/01/2016 05:42 PM, Geert Uytterhoeven wrote:
> 
> >> --- a/drivers/net/ethernet/renesas/sh_eth.c
> >> +++ b/drivers/net/ethernet/renesas/sh_eth.c
> >> @@ -518,7 +518,7 @@ static struct sh_eth_cpu_data r7s72100_data = {
> >>
> >>         .ecsr_value     = ECSR_ICD,
> >>         .ecsipr_value   = ECSIPR_ICDIP,
> >> -       .eesipr_value   = 0xff7f009f,
> >> +       .eesipr_value   = 0xe77f009f,
> >
> > Comment not directly related to the merits of this patch: the EESIPR
> > bit definitions seem to be identical to those for EESR, so we can get
> > rid of the hardcoded values here?
> 
>     Do you mean using the @define's? We have EESIPR bits also declared,
> see enum DMAC_IM_BIT,


Is your idea to get rid of .eesipr_value for devices that have values
that are the same as .eesr_err_check?


For example in sh_eth_dev_init():

	sh_eth_modify(ndev, EESR, 0, 0);
	mdp->irq_enabled = true;
-	sh_eth_write(ndev, mdp->cd->eesipr_value, EESIPR);
+	if (mdp->cd->eesipr_value)
+		sh_eth_write(ndev, mdp->cd->eesipr_value, EESIPR);
+	else
+		sh_eth_write(ndev, mdp->cd->eesr_err_check, EESIPR);


Chris


^ permalink raw reply

* Re: [PATCH iproute2 V4 0/3] tc: Support for ip tunnel metadata set/unset/classify
From: Stephen Hemminger @ 2016-12-01 18:54 UTC (permalink / raw)
  To: Amir Vadai
  Cc: netdev, David S. Miller, Jiri Benc, Or Gerlitz, Hadar Har-Zion,
	Roi Dayan
In-Reply-To: <20161201114446.30333-1-amir@vadai.me>

On Thu,  1 Dec 2016 13:44:43 +0200
Amir Vadai <amir@vadai.me> wrote:

> Hi,
> 
> This short series adds support for matching and setting metadata for ip tunnel
> shared device using the TC system, introduced in kernel 4.9 [1].
> 
> Applied and tested on top of commit f3f339e9590a ("cleanup debris from revert")
> 
> Example usage:
> 
> $ tc filter add dev vxlan0 protocol ip parent ffff: \
>     flower \
>       enc_src_ip 11.11.0.2 \
>       enc_dst_ip 11.11.0.1 \
>       enc_key_id 11 \
>       dst_ip 11.11.11.1 \
>     action mirred egress redirect dev vnet0
> 
> $ tc filter add dev net0 protocol ip parent ffff: \
>     flower \
>       ip_proto 1 \
>       dst_ip 11.11.11.2 \
>     action tunnel_key set \
>       src_ip 11.11.0.1 \
>       dst_ip 11.11.0.2 \
>       id 11 \
>     action mirred egress redirect dev vxlan0
> 
> [1] - d1ba24feb466 ("Merge branch 'act_tunnel_key'")
> 
> Thanks,
> Amir
> 
> Changes from V3:
> - Fix bad wording in the man page about the use of the 'unset' operation
> 
> Changes from V2:
> - Use const where needed
> - Don't lose return value
> - Introduce rta_getattr_be16() and rta_getattr_be32()
> 
> Changes from V1:
> - Updated Patch 2/2 ("tc/act_tunnel: Introduce ip tunnel action") commit log
> 	and the man page tc-tunnel_key to reflect the fact that 'unset' operation is
> 	no mandatory.
> 	And describe when it might be needed.
> - Rename the 'release' operation to 'unset'
> 
> Amir Vadai (3):
>   libnetlink: Introduce rta_getattr_be*()
>   tc/cls_flower: Classify packet in ip tunnels
>   tc/act_tunnel: Introduce ip tunnel action
> 
>  bridge/fdb.c                         |   4 +-
>  include/libnetlink.h                 |   9 ++
>  include/linux/tc_act/tc_tunnel_key.h |  42 ++++++
>  ip/iplink_geneve.c                   |   2 +-
>  ip/iplink_vxlan.c                    |   2 +-
>  man/man8/tc-flower.8                 |  17 ++-
>  man/man8/tc-tunnel_key.8             | 112 +++++++++++++++
>  tc/Makefile                          |   1 +
>  tc/f_flower.c                        |  84 +++++++++++-
>  tc/m_tunnel_key.c                    | 258 +++++++++++++++++++++++++++++++++++
>  10 files changed, 522 insertions(+), 9 deletions(-)
>  create mode 100644 include/linux/tc_act/tc_tunnel_key.h
>  create mode 100644 man/man8/tc-tunnel_key.8
>  create mode 100644 tc/m_tunnel_key.c

I cleared up patch backlog and got net-next branch up to date with kernel
headers, and this patch series does not apply cleanly anymore.

Please rebase and resubmit.

^ permalink raw reply

* Re: [PATCH net-next iproute2 PATCH 2/2 v2] ss: Add inet raw sockets information gathering via netlink diag interface
From: Stephen Hemminger @ 2016-12-01 18:57 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: netdev, avagin
In-Reply-To: <1478092496-7540-3-git-send-email-gorcunov@gmail.com>

On Wed,  2 Nov 2016 16:14:56 +0300
Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> unix, tcp, udp[lite], packet, netlink sockets already support diag
> interface for their collection and killing. Implement support
> for raw sockets.
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

Applied both patches, but needed to remove inet_diag.h since
already updated kernel headers.

^ permalink raw reply

* Re: [PATCH iproute2 1/2] ss: print new tcp_info fields: delivery_rate and app_limited
From: Stephen Hemminger @ 2016-12-01 19:01 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: netdev, Yuchung Cheng, Eric Dumazet, Soheil Hassas Yeganeh
In-Reply-To: <1480616500-16919-1-git-send-email-ncardwell@google.com>

On Thu,  1 Dec 2016 13:21:39 -0500
Neal Cardwell <ncardwell@google.com> wrote:

> Dump the new delivery_rate and delivery_rate_app_limited fields that
> were added to tcp_info in Linux v4.9.
> 
> Example output:
>   pacing_rate 65.7Mbps delivery_rate 62.9Mbps
> 
> And for the application-limited case this looks like:
>   pacing_rate 1031.1Mbps delivery_rate 87.4Mbps app_limited
> 
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>

Looks good, applied to net-next branch

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Tom Herbert @ 2016-12-01 19:05 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: Linux Kernel Network Developers
In-Reply-To: <20161201135508.GB24547@oracle.com>

On Thu, Dec 1, 2016 at 5:55 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (11/30/16 14:54), Tom Herbert wrote:
>>
>> Posting for discussion....
>    :
>> One simplifying assumption we might make is that TXDP is primarily for
>> optimizing latency, specifically request/response type operations
>> (think HPC, HFT, flash server, or other tightly coupled communications
>> within the datacenter). Notably, I don't think that saving CPU is as
>> relevant to TXDP, in fact we have already seen that CPU utilization
>> can be traded off for lower latency via spin polling. Similar to XDP
>> though, we might assume that single CPU performance is relevant (i.e.
>> on a cache server we'd like to spin as few CPUs as needed and no more
>> to handle the load an maintain throughput and latency requirements).
>> High throughput (ops/sec) and low variance should be side effects of
>> any design.
>
> I'm sending this with some hesitation (esp as the flamebait threads
> are starting up - I have no interest in getting into food-fights!!),
> because it sounds like the HPC/request-response use-case you have in mind
> (HTTP based?) is very likely different than the one the DB use-cases in
> my environment (RDBMS, Cluster req/responses). But to provide some
> perspective from the latter use-case..
>
> We also have request-response transactions, but CPU utilization
> is extremely critical- many DB operations are highly CPU bound,
> so it's not acceptable for the network to hog CPU util by polling.
> In that sense, the DB req/resp model has a lot of overlap with the
> Suricata use-case.
>
Hi Sowmini,

Polling does not necessarily imply that networking monopolizes the CPU
except when the CPU is otherwise idle. Presumably the application
drives the polling when it is ready to receive work.

> Also we need a select()able socket, because we have to deal with
> input from several sources- network I/O, but also disk, and
> file-system I/O. So need to make sure there is no starvation,
> and that we multiplex between  I/O sources efficiently
>
Yes, that is a requirement.

> and one other critical difference from the hot-potato-forwarding
> model (the sort of OVS model that DPDK etc might aruguably be a fit for)
> does not apply: in order to figure out the ethernet and IP headers
> in the response correctly at all times (in the face of things like VRRP,
> gw changes, gw's mac addr changes etc) the application should really
> be listening on NETLINK sockets for modifications to the networking
> state - again points to needing a select() socket set where you can
> have both the I/O fds and the netlink socket,
>
I would think that that is management would not be implemented in a
fast path processing thread for an application.

> For all of these reasons, we are investigating approaches similar ot
> Suricata- PF_PACKET with TPACKETV2 (since we need both Tx and Rx,
> and so far, tpacketv2 seems "good enough"). FWIW, we also took
> a look at netmap and so far have not seen any significant benefits
> to netmap over pf_packet.. investigation still ongoing.
>
>>   - Call into TCP/IP stack with page data directly from driver-- no
>> skbuff allocation or interface. This is essentially provided by the
>
> I'm curious- one thing that came out of the IPsec evaluation
> is that TSO is very valuable for performance, and this is most easily
> accessed via the sk_buff interfaces.  I have not had a chance
> to review your patches yet, but isnt that an issue if you bypass
> sk_buff usage? But I should probably go and review your patchset..
>
The *SOs are always an interesting question. They make for great
benchmarks, but in real life the amount of benefit is somewhat
unclear. Under the wrong conditions, like all cwnds have collapsed or
received packets for flows are small or so mixed that we can't get
much aggregation, SO provides no benefit and in fact becomes
overhead. Relying on any amount of segmentation offload in real
deployment is risky; for instance we've seen some video servers
deployed that were able to serve line rate at 90% CPU in testing (SO
was effective) but ended up needing 110% CPU in deployment when a
hiccup caused all cwnds to collapse. Moral of the story is provision
your servers assuming the worse case conditions that would render
opportunistic offloads unless.

For the GSO and GRO the rationale is that performing the extra SW
processing to do the offloads is significantly less expensive than
running each packet through the full stack. This is true in a
multi-layered generalized stack. In TXDP, however, we should be able
to optimize the stack data path such that that would no longer be
true. For instance, if we can process the packets received on a
connection quickly enough so that it's about the same or just a little
more costly than GRO processing then we might bypass GRO entirely.
TSO is probably still relevant in TXDP since it reduces overheads
processing TX in the device itself.

Tom

> --Sowmini

^ permalink raw reply

* Re: [PATCH] netfilter: avoid warn and OOM on vmalloc call
From: Marcelo Ricardo Leitner @ 2016-12-01 19:08 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Florian Westphal, Neil Horman, netdev, netfilter-devel, LKML
In-Reply-To: <CAAeHK+wjP=h_4YxB6VUc+FjKcZi9igmyTs3nPAuUJeNomYSA0w@mail.gmail.com>

On Thu, Dec 01, 2016 at 10:42:22AM +0100, Andrey Konovalov wrote:
> On Wed, Nov 30, 2016 at 8:42 PM, Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
> > Hi Andrey,
> >
> > Please let me know how this works for you. It seems good here, though
> > your poc may still trigger OOM through other means.
> 
> Hi Marcelo,
> 
> Don't see any reports with this patch.
> 
> Thanks!

Thanks Andrey.
I'll post a v2 after a few more tests here and to s/OOM/OOM killer/ in
most of the changelog.

> 
> >
> > Thanks,
> > Marcelo
> >
> > ---8<---
> >
> > Andrey Konovalov reported that this vmalloc call is based on an
> > userspace request and that it's spewing traces, which may flood the logs
> > and cause DoS if abused.
> >
> > Florian Westphal also mentioned that this call should not trigger OOM,
> > as kmalloc one is already not triggering it.
> >
> > This patch brings the vmalloc call in sync to kmalloc and disables the
> > warn trace on allocation failure and also disable OOM invocation.
> >
> > Note, however, that under such stress situation, other places may
> > trigger OOM invocation.
> >
> > Reported-by: Andrey Konovalov <andreyknvl@google.com>
> > Cc: Florian Westphal <fw@strlen.de>
> > Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > ---
> >  net/netfilter/x_tables.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> > index fc4977456c30e098197b4f987b758072c9cf60d9..dece525bf83a0098dad607fce665cd0bde228362 100644
> > --- a/net/netfilter/x_tables.c
> > +++ b/net/netfilter/x_tables.c
> > @@ -958,7 +958,9 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
> >         if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> >                 info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
> >         if (!info) {
> > -               info = vmalloc(sz);
> > +               info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN |
> > +                                    __GFP_NORETRY | __GFP_HIGHMEM,
> > +                                PAGE_KERNEL);
> >                 if (!info)
> >                         return NULL;
> >         }
> > --
> > 2.9.3
> >
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox