Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next V2] net/vxlan: Avoid unaligned access in vxlan_build_skb()
From: Hannes Frederic Sowa @ 2016-09-21 16:49 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Sowmini Varadhan, netdev, jbenc, davem
In-Reply-To: <20160921161457.GA17116@ast-mbp.thefacebook.com>

On 21.09.2016 18:14, Alexei Starovoitov wrote:
> On Wed, Sep 21, 2016 at 12:10:55PM +0200, Hannes Frederic Sowa wrote:
>> On 20.09.2016 20:57, Sowmini Varadhan wrote:
>>> The vxlan header may not be aligned to 4 bytes in
>>> vxlan_build_skb (e.g., for MLD packets). This patch
>>> avoids unaligned access traps from vxlan_build_skb
>>> (in platforms like sparc) by making struct vxlanhdr __packed.
>>>
>>> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>>
>> Performance wise this should only affect code generation for archs where
>> it matters anyway.
> 
> I think it's the opposite. Even on x86 compiler will use byte loads.

I checked that on x86-64 actually. Also clearly visible here:

https://godbolt.org/g/xsW2P1

Bye,
Hannes

^ permalink raw reply

* Re: [ANNOUNCE] ndiv: line-rate network traffic processing
From: Jesper Dangaard Brouer @ 2016-09-21 16:26 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: brouer, netdev, Tom Herbert, Alexei Starovoitov, Brenden Blanco,
	Tariq Toukan
In-Reply-To: <20160921112852.GA991@1wt.eu>


On Wed, 21 Sep 2016 13:28:52 +0200 Willy Tarreau <w@1wt.eu> wrote:

> Over the last 3 years I've been working a bit on high traffic processing
> for various reasons. It started with the wish to capture line-rate GigE
> traffic on very small fanless ARM machines and the framework has evolved
> to be used at my company as a basis for our anti-DDoS engine capable of
> dealing with multiple 10G links saturated with floods.
> 
> I know it comes a bit late now that there is XDP, but it's my first
> vacation since then and I needed to have a bit of calm time to collect
> the patches from the various branches and put them together. Anyway I'm
> sending this here in case it can be of interest to anyone, for use or
> just to study it.

I definitely want to study it!

You mention XDP.  If you didn't notice, I've created some documentation
on XDP (it is very "live" documentation at this point and it will
hopefully "materialized" later in the process).  But it should be a good
starting point for understanding XDP:

 https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html


> I presented it in 2014 at kernel recipes :
>   http://kernel-recipes.org/en/2014/ndiv-a-low-overhead-network-traffic-diverter/

Cool, and it even have a video!
 
> It now supports drivers mvneta, ixgbe, e1000e, e1000 and igb. It is
> very light, and retrieves the packets in the NIC's driver before they
> are converted to an skb, then submits them to a registered RX handler
> running in softirq context so we have the best of all worlds by
> benefitting from CPU scalability, delayed processing, and not paying
> the cost of switching to userland. Also an rx_done() function allows
> handlers to batch their processing. 

Wow - it does sound a lot like XDP!  I would say that is sort of
validate the current direction of XDP, and that there are real
use-cases for this stuff.

> The RX handler returns an action
> among accepting the packet as-is, accepting it modified (eg: vlan or
> tunnel decapsulation), dropping it, postponing the processing
> (equivalent to EAGAIN), or building a new packet to send back.

I'll be very interested in studying in-details how you implemented and
choose what actions to implement.

What was the need for postponing the processing (EAGAIN)?

 
> This last function is the one requiring the most changes in existing
> drivers, but offers the widest range of possibilities. We use it to
> send SYN cookies, but I have also implemented a stateless HTTP server
> supporting keep-alive using it, achieving line-rate traffic processing
> on a single CPU core when the NIC supports it. It's very convenient to
> test various stateful TCP components as it's easy to sustain millions
> of connections per second on it.

Interesting, and controversial use-case.  One controversial use-case
for XDP, that I imagine was implementing a DNS accelerator, what
answers simple and frequent requests.  You took it a step further with
a HTTP server!


> It does not support forwarding between NICs. It was my first goal
> because I wanted to implement a TAP with it, bridging the traffic
> between two ports, but figured it was adding some complexity to the
> system back then. 

With all the XDP features at the moment, we have avoided going through
the page allocator, by relying on different page recycling tricks.

When doing forwarding between NICs is it harder to do these page
recycling tricks.  I've measured that page allocators fast-path
("recycling" same page) cost approx 270 cycles, and the 14Mpps cycle
count on this 4GHz CPU is 268 cycles.  Thus, it is a non-starter...

Did you have to modify the page allocator?
Or implement some kind of recycling?

> However since then we've implemented traffic
> capture in our product, exploiting this framework to capture without
> losses at 14 Mpps. I may find some time to try to extract it later.
> It uses the /sys API so that you can simply plug tcpdump -r on a
> file there, though there's also an mmap version which uses less CPU
> (that's important at 10G).

Interesting.  I do see a XDP use-case for RAW packet capture, but I've
postponed that work until later.  I would interested in how you solved
it?  E.g. Do you support zero-copy?


> In its current form since the initial code's intent was to limit
> core changes, it happens not to modify anything in the kernel by
> default and to reuse the net_device's ax25_ptr to attach devices
> (idea borrowed from netmap), so it can be used on an existing
> kernel just by loading the patched network drivers (yes, I know
> it's not a valid solution for the long term).
> 
> The current code is available here :
> 
>   http://git.kernel.org/cgit/linux/kernel/git/wtarreau/ndiv.git/

I was just about to complain that the link was broken... but it fixed
itself while writing this email ;-)

Can you instead explain what branch to look at?

> 
> Please let me know if there could be some interest in rebasing it
> on more recent versions (currently 3.10, 3.14 and 4.4 are supported).

What, no support for 2.4 ;-)

> I don't have much time to assign to it since it works fine as-is,
> but will be glad to do so if that can be useful.
> 
> Also the stateless HTTP server provided in it definitely is a nice
> use case for testing such a framework.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH net] act_ife: Add support for machines with hard_header_len != mac_len
From: Yotam Gigi @ 2016-09-21 12:54 UTC (permalink / raw)
  To: jhs, davem, netdev; +Cc: Yotam Gigi

Without that fix, the following could occur:
 - On encode ingress, the total amount of skb_pushes (in lines 751 and
   753) was more than specified in cow.
 - On machines with hard_header_len > mac_len, the packet format was not
   according to the ife specifications, as the place reserved at the
   beginning of the packet is for hard_header_len, but only mac_len bytes
   get initialized in it. Acting upon simple ping packet, The following
   tc commands:

   tc qdisc add dev sw1p5 handle 1: root prio
   tc filter add dev sw1p5 parent 1: protocol ip \
		matchall skip_hw \
		action ife encode type 0xdead use mark 0x12

   when netdev sw1p5 has hard_header_len of 30, created the following
   packet:

   0x0000:  e41d 2da5 f1d3 e41d 2d46 ffb5 dead 0000  ..-.....-F......
   0x0010:  0000 0000 0000 0000 0000 0000 0000 000a  ................
   0x0020:  0001 0004 0000 0012 e41d 2da5 f1d3 e41d  ..........-.....
   0x0031:  2d46 ffb5 0800 4500 0054 55e9 4000 4001  -F....E..TU.@.@.
   0x0040:  ceba 0b00 0005 0b00 0001 0800 d2ea 63b7  ..............c.
   0x0050:  0002 a360 e257 0000 0000 7ad0 0200 0000  ...`.W....z.....
   0x0060:  0000 1011 1213 1415 1617 1819 1a1b 1c1d  ................
   0x0070:  1e1f 2021 2223 2425 2627 2829 2a2b 2c2d  ...!"#$%&'()*+,-
   0x0080:  2e2f 3031 3233 3435 3637                 ./01234567

   and it can be seen that bytes 0x0e to 0x01e are not initialized and
   contain random memory data, where bytes 0x01e to 0x028 contain the ife
   header.

To fix those problems, add the total_push and reserve variables, which
indicates the exact amount of pushes needed and the exact amount of
headroom the packet should have. Thus, it is possible to take care of the
cases:
 - on ingress, the mac header must be pushed back as the ingress parser
   already parses the mac header and pulls it
 - on egress, the code should reserve hard_header_len space extra in
   headroom for driver to put the rest of the headers

Fixes: ef6980b6becb ("net sched: introduce IFE action")
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
---
 net/sched/act_ife.c | 34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index e87cd81..27b19ca 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -708,11 +708,13 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct tc_action *a,
 	   where ORIGDATA = original ethernet header ...
 	 */
 	u16 metalen = ife_get_sz(skb, ife);
-	int hdrm = metalen + skb->dev->hard_header_len + IFE_METAHDRLEN;
-	unsigned int skboff = skb->dev->hard_header_len;
 	u32 at = G_TC_AT(skb->tc_verd);
-	int new_len = skb->len + hdrm;
 	bool exceed_mtu = false;
+	unsigned int skboff;
+	int total_push;
+	int reserve;
+	int new_len;
+	int hdrm;
 	int err;
 
 	if (at & AT_EGRESS) {
@@ -724,6 +726,22 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct tc_action *a,
 	bstats_update(&ife->tcf_bstats, skb);
 	tcf_lastuse_update(&ife->tcf_tm);
 
+	if (at & AT_EGRESS) {
+		/* on egress, reserve space for hard_header_len instead of
+		 * mac_len
+		 */
+		skb_reset_mac_len(skb);
+		hdrm = metalen + skb->mac_len + IFE_METAHDRLEN;
+		total_push = hdrm;
+		reserve = metalen + skb->dev->hard_header_len + IFE_METAHDRLEN;
+	} else {
+		/* on ingress, push mac_len as it already get parsed from tc */
+		hdrm = metalen + skb->mac_len + IFE_METAHDRLEN;
+		total_push = hdrm + skb->mac_len;
+		reserve = total_push;
+	}
+	new_len =  skb->len + hdrm;
+
 	if (!metalen) {		/* no metadata to send */
 		/* abuse overlimits to count when we allow packet
 		 * with no metadata
@@ -742,19 +760,17 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct tc_action *a,
 
 	iethh = eth_hdr(skb);
 
-	err = skb_cow_head(skb, hdrm);
+	err = skb_cow_head(skb, reserve);
 	if (unlikely(err)) {
 		ife->tcf_qstats.drops++;
 		spin_unlock(&ife->tcf_lock);
 		return TC_ACT_SHOT;
 	}
 
-	if (!(at & AT_EGRESS))
-		skb_push(skb, skb->dev->hard_header_len);
-
-	__skb_push(skb, hdrm);
+	__skb_push(skb, total_push);
 	memcpy(skb->data, iethh, skb->mac_len);
 	skb_reset_mac_header(skb);
+	skboff += skb->mac_len;
 	oethh = eth_hdr(skb);
 
 	/*total metadata length */
@@ -792,7 +808,7 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct tc_action *a,
 	oethh->h_proto = htons(ife->eth_type);
 
 	if (!(at & AT_EGRESS))
-		skb_pull(skb, skb->dev->hard_header_len);
+		skb_pull(skb, skb->mac_len);
 
 	spin_unlock(&ife->tcf_lock);
 
-- 
2.4.11

^ permalink raw reply related

* Re: [RFC PATCH v3 2/7] proc: Reduce cache miss in {snmp,netstat}_seq_show
From: hejianet @ 2016-09-21 16:18 UTC (permalink / raw)
  To: Marcelo
  Cc: netdev, linux-sctp, linux-kernel, davem, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, Vlad Yasevich,
	Neil Horman, Steffen Klassert, Herbert Xu
In-Reply-To: <20160914115543.GF17689@localhost.localdomain>

Hi Marcelo

sorry for the late, just came back from a vacation.

On 9/14/16 7:55 PM, Marcelo wrote:
> Hi Jia,
>
> On Wed, Sep 14, 2016 at 01:58:42PM +0800, hejianet wrote:
>> Hi Marcelo
>>
>>
>> On 9/13/16 2:57 AM, Marcelo wrote:
>>> On Fri, Sep 09, 2016 at 02:33:57PM +0800, Jia He wrote:
>>>> This is to use the generic interface snmp_get_cpu_field{,64}_batch to
>>>> aggregate the data by going through all the items of each cpu sequentially.
>>>> Then snmp_seq_show and netstat_seq_show are split into 2 parts to avoid build
>>>> warning "the frame size" larger than 1024 on s390.
>>> Yeah about that, did you test it with stack overflow detection?
>>> These arrays can be quite large.
>>>
>>> One more below..
>> Do you think it is acceptable if the stack usage is a little larger than 1024?
>> e.g. 1120
>> I can't find any other way to reduce the stack usage except use "static" before
>> unsigned long buff[TCP_MIB_MAX]
>>
>> PS. sizeof buff is about TCP_MIB_MAX(116)*8=928
>> B.R.
> That's pretty much the question. Linux has the option on some archs to
> run with 4Kb (4KSTACKS option), so this function alone would be using
> 25% of it in this last case. While on x86_64, it uses 16Kb (6538b8ea886e
> ("x86_64: expand kernel stack to 16K")).
>
> Adding static to it is not an option as it actually makes the variable
> shared amongst the CPUs (and then you have concurrency issues), plus the
> fact that it's always allocated, even while not in use.
>
> Others here certainly know better than me if it's okay to make such
> usage of the stach.
What about this patch instead?
It is a trade-off. I split the aggregation process into 2 parts, it will
increase the cache miss a little bit, but it can reduce the stack usage.
After this, stack usage is 672bytes
objdump -d vmlinux | ./scripts/checkstack.pl ppc64 | grep seq_show
0xc0000000007f7cc0 netstat_seq_show_tcpext.isra.3 [vmlinux]:672

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index c6ee8a2..cc41590 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -486,22 +486,37 @@ static const struct file_operations snmp_seq_fops = {
   */
  static int netstat_seq_show_tcpext(struct seq_file *seq, void *v)
  {
-       int i;
-       unsigned long buff[LINUX_MIB_MAX];
+       int i, c;
+       unsigned long buff[LINUX_MIB_MAX/2 + 1];
         struct net *net = seq->private;

-       memset(buff, 0, sizeof(unsigned long) * LINUX_MIB_MAX);
+       memset(buff, 0, sizeof(unsigned long) * (LINUX_MIB_MAX/2 + 1));

         seq_puts(seq, "TcpExt:");
         for (i = 0; snmp4_net_list[i].name; i++)
                 seq_printf(seq, " %s", snmp4_net_list[i].name);

         seq_puts(seq, "\nTcpExt:");
-       snmp_get_cpu_field_batch(buff, snmp4_net_list,
-                                net->mib.net_statistics);
-       for (i = 0; snmp4_net_list[i].name; i++)
+       for_each_possible_cpu(c) {
+               for (i = 0; i < LINUX_MIB_MAX/2; i++)
+                       buff[i] += snmp_get_cpu_field(
+ net->mib.net_statistics,
+                                               c, snmp4_net_list[i].entry);
+       }
+       for (i = 0; i < LINUX_MIB_MAX/2; i++)
                 seq_printf(seq, " %lu", buff[i]);

+       memset(buff, 0, sizeof(unsigned long) * (LINUX_MIB_MAX/2 + 1));
+       for_each_possible_cpu(c) {
+               for (i = LINUX_MIB_MAX/2; snmp4_net_list[i].name; i++)
+                       buff[i - LINUX_MIB_MAX/2] += snmp_get_cpu_field(
+                               net->mib.net_statistics,
+                               c,
+                               snmp4_net_list[i].entry);
+       }
+        for (i = LINUX_MIB_MAX/2; snmp4_net_list[i].name; i++)
+                seq_printf(seq, " %lu", buff[i - LINUX_MIB_MAX/2]);
+
         return 0;
  }

>>>> +static int netstat_seq_show_ipext(struct seq_file *seq, void *v)
>>>> +{
>>>> +	int i;
>>>> +	u64 buff64[IPSTATS_MIB_MAX];
>>>> +	struct net *net = seq->private;
>>>>    	seq_puts(seq, "\nIpExt:");
>>>>    	for (i = 0; snmp4_ipextstats_list[i].name != NULL; i++)
>>>>    		seq_printf(seq, " %s", snmp4_ipextstats_list[i].name);
>>>>    	seq_puts(seq, "\nIpExt:");
>>> You're missing a memset() call here.
> Not sure if you missed this one or not..
indeed, thanks
B.R.
Jia
> Thanks,
> Marcelo
>

^ permalink raw reply related

* Re: [PATCH net-next V2] net/vxlan: Avoid unaligned access in vxlan_build_skb()
From: Alexei Starovoitov @ 2016-09-21 16:14 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Sowmini Varadhan, netdev, jbenc, davem
In-Reply-To: <4617aca3-8d44-e99a-cc5a-1088b30ae327@stressinduktion.org>

On Wed, Sep 21, 2016 at 12:10:55PM +0200, Hannes Frederic Sowa wrote:
> On 20.09.2016 20:57, Sowmini Varadhan wrote:
> > The vxlan header may not be aligned to 4 bytes in
> > vxlan_build_skb (e.g., for MLD packets). This patch
> > avoids unaligned access traps from vxlan_build_skb
> > (in platforms like sparc) by making struct vxlanhdr __packed.
> > 
> > Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> 
> Performance wise this should only affect code generation for archs where
> it matters anyway.

I think it's the opposite. Even on x86 compiler will use byte loads.

^ permalink raw reply

* [PATCH iproute2] ss: Support displaying and filtering on socket marks.
From: Lorenzo Colitti @ 2016-09-21 16:02 UTC (permalink / raw)
  To: netdev; +Cc: shemming, zenczykowski, Lorenzo Colitti

This allows the user to dump sockets with a given mark (via
"fwmark = 0x1234/0x1234" or "fwmark = 12345", etc.) , and to
display the socket marks of dumped sockets.

The relevant kernel commits are: d545caca827b ("net: inet: diag:
expose the socket mark to privileged processes.") and
- a52e95abf772 ("net: diag: allow socket bytecode filters to
match socket marks")

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
---
 include/linux/inet_diag.h |  7 +++++++
 misc/ss.c                 | 52 +++++++++++++++++++++++++++++++++++++++++++++++
 misc/ssfilter.h           |  2 ++
 misc/ssfilter.y           | 23 +++++++++++++++++++--
 4 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/include/linux/inet_diag.h b/include/linux/inet_diag.h
index beb74ee..016de88 100644
--- a/include/linux/inet_diag.h
+++ b/include/linux/inet_diag.h
@@ -73,6 +73,7 @@ enum {
 	INET_DIAG_BC_S_COND,
 	INET_DIAG_BC_D_COND,
 	INET_DIAG_BC_DEV_COND,   /* u32 ifindex */
+	INET_DIAG_BC_MARK_COND,
 };
 
 struct inet_diag_hostcond {
@@ -82,6 +83,11 @@ struct inet_diag_hostcond {
 	__be32	addr[0];
 };
 
+struct inet_diag_markcond {
+	__u32 mark;
+	__u32 mask;
+};
+
 /* Base info structure. It contains socket identity (addrs/ports/cookie)
  * and, alas, the information shown by netstat. */
 struct inet_diag_msg {
@@ -117,6 +123,7 @@ enum {
 	INET_DIAG_LOCALS,
 	INET_DIAG_PEERS,
 	INET_DIAG_PAD,
+	INET_DIAG_MARK,
 	__INET_DIAG_MAX,
 };
 
diff --git a/misc/ss.c b/misc/ss.c
index 3b268d9..83fb01f 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -737,6 +737,7 @@ struct sockstat {
 	unsigned long long  sk;
 	char *name;
 	char *peer_name;
+	__u32		    mark;
 };
 
 struct dctcpstat {
@@ -807,6 +808,9 @@ static void sock_details_print(struct sockstat *s)
 
 	printf(" ino:%u", s->ino);
 	printf(" sk:%llx", s->sk);
+
+	if (s->mark)
+		printf(" fwmark:0x%x", s->mark);
 }
 
 static void sock_addr_print_width(int addr_len, const char *addr, char *delim,
@@ -1046,6 +1050,8 @@ struct aafilter {
 	inet_prefix	addr;
 	int		port;
 	unsigned int	iface;
+	__u32		mark;
+	__u32		mask;
 	struct aafilter *next;
 };
 
@@ -1166,6 +1172,12 @@ static int run_ssfilter(struct ssfilter *f, struct sockstat *s)
 
 		return s->iface == a->iface;
 	}
+		case SSF_MARKMASK:
+	{
+		struct aafilter *a = (void *)f->pred;
+
+		return (s->mark & a->mask) == a->mark;
+	}
 		/* Yup. It is recursion. Sorry. */
 		case SSF_AND:
 		return run_ssfilter(f->pred, s) && run_ssfilter(f->post, s);
@@ -1341,6 +1353,23 @@ static int ssfilter_bytecompile(struct ssfilter *f, char **bytecode)
 		/* bytecompile for SSF_DEVCOND not supported yet */
 		return 0;
 	}
+		case SSF_MARKMASK:
+	{
+		struct aafilter *a = (void *)f->pred;
+		struct instr {
+			struct inet_diag_bc_op op;
+			struct inet_diag_markcond cond;
+		};
+		int inslen = sizeof(struct instr);
+
+		if (!(*bytecode = malloc(inslen))) abort();
+		((struct instr *)*bytecode)[0] = (struct instr) {
+			{ INET_DIAG_BC_MARK_COND, inslen, inslen + 4 },
+			{ a->mark, a->mask},
+		};
+
+		return inslen;
+	}
 		default:
 		abort();
 	}
@@ -1620,6 +1649,25 @@ out:
 	return res;
 }
 
+void *parse_markmask(const char *markmask)
+{
+	struct aafilter a, *res;
+
+	if (strchr(markmask, '/')) {
+		if (sscanf(markmask, "%i/%i", &a.mark, &a.mask) != 2)
+			return NULL;
+	} else {
+		a.mask = 0xffffffff;
+		if (sscanf(markmask, "%i", &a.mark) != 1)
+			return NULL;
+	}
+
+	res = malloc(sizeof(*res));
+	if (res)
+		memcpy(res, &a, sizeof(a));
+	return res;
+}
+
 static char *proto_name(int protocol)
 {
 	switch (protocol) {
@@ -2107,6 +2155,10 @@ static void parse_diag_msg(struct nlmsghdr *nlh, struct sockstat *s)
 	s->iface	= r->id.idiag_if;
 	s->sk		= cookie_sk_get(&r->id.idiag_cookie[0]);
 
+	s->mark = 0;
+	if (tb[INET_DIAG_MARK])
+		s->mark = *(__u32 *) RTA_DATA(tb[INET_DIAG_MARK]);
+
 	if (s->local.family == AF_INET)
 		s->local.bytelen = s->remote.bytelen = 4;
 	else
diff --git a/misc/ssfilter.h b/misc/ssfilter.h
index c7db8ee..dfc5b93 100644
--- a/misc/ssfilter.h
+++ b/misc/ssfilter.h
@@ -9,6 +9,7 @@
 #define SSF_S_LE  8
 #define SSF_S_AUTO  9
 #define SSF_DEVCOND 10
+#define SSF_MARKMASK 11
 
 #include <stdbool.h>
 
@@ -22,3 +23,4 @@ struct ssfilter
 int ssfilter_parse(struct ssfilter **f, int argc, char **argv, FILE *fp);
 void *parse_hostcond(char *addr, bool is_port);
 void *parse_devcond(char *name);
+void *parse_markmask(const char *markmask);
diff --git a/misc/ssfilter.y b/misc/ssfilter.y
index 14bf981..ba82b65 100644
--- a/misc/ssfilter.y
+++ b/misc/ssfilter.y
@@ -36,7 +36,7 @@ static void yyerror(char *s)
 
 %}
 
-%token HOSTCOND DCOND SCOND DPORT SPORT LEQ GEQ NEQ AUTOBOUND DEVCOND DEVNAME
+%token HOSTCOND DCOND SCOND DPORT SPORT LEQ GEQ NEQ AUTOBOUND DEVCOND DEVNAME MARKMASK FWMARK
 %left '|'
 %left '&'
 %nonassoc '!'
@@ -116,7 +116,14 @@ expr:	DCOND HOSTCOND
         {
 		$$ = alloc_node(SSF_NOT, alloc_node(SSF_DEVCOND, $3));
         }
-
+        | FWMARK '=' MARKMASK
+        {
+                $$ = alloc_node(SSF_MARKMASK, $3);
+        }
+        | FWMARK NEQ MARKMASK
+        {
+                $$ = alloc_node(SSF_NOT, alloc_node(SSF_MARKMASK, $3));
+        }
         | AUTOBOUND
         {
                 $$ = alloc_node(SSF_S_AUTO, NULL);
@@ -249,6 +256,10 @@ int yylex(void)
 		tok_type = DEVNAME;
 		return DEVNAME;
 	}
+	if (strcmp(curtok, "fwmark") == 0) {
+		tok_type = FWMARK;
+		return FWMARK;
+	}
 	if (strcmp(curtok, ">=") == 0 ||
 	    strcmp(curtok, "ge") == 0 ||
 	    strcmp(curtok, "geq") == 0)
@@ -283,6 +294,14 @@ int yylex(void)
 		}
 		return DEVCOND;
 	}
+	if (tok_type == FWMARK) {
+		yylval = (void*)parse_markmask(curtok);
+		if (yylval == NULL) {
+			fprintf(stderr, "Cannot parse mark %s.\n", curtok);
+			exit(1);
+		}
+		return MARKMASK;
+	}
 	yylval = (void*)parse_hostcond(curtok, tok_type == SPORT || tok_type == DPORT);
 	if (yylval == NULL) {
 		fprintf(stderr, "Cannot parse dst/src address.\n");
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* Re: [RFC v2 06/12] qedr: Add support for QP verbs
From: Jason Gunthorpe @ 2016-09-21 15:55 UTC (permalink / raw)
  To: Amrani, Ram
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org,
	dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, Elior, Ariel,
	Kalderon, Michal, Mintz, Yuval, Borundia, Rajesh,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <SN1PR07MB2207B94372F7A910C0E6AF0AF8F60-mikhvbZlbf8TSoR2DauN2+FPX92sqiQdvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>

On Wed, Sep 21, 2016 at 02:23:46PM +0000, Amrani, Ram wrote:
> > Ugh, each patch keeps adding to this?
> 
> The logic in the patch series is to have each patch contain only
> what is necessary for the specific functionality it adds.  This made
> it harder on us to prepare but, IMHO, easier for the reviewer.  If
> you'd like to have this file in one chunk, I can do this. What do
> you prefer?

I wouldn't change anything at this point, but I'm not sure it is
easier to review like this than the one patch per file scheme.

Do you have a git tree?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] netfilter: replace list_head with single linked list
From: Aaron Conole @ 2016-09-21 15:46 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal, Pablo Neira Ayuso
In-Reply-To: <1474472107-12992-8-git-send-email-aconole@bytheb.org>

Aaron Conole <aconole@bytheb.org> writes:

> The netfilter hook list never uses the prev pointer, and so can be trimmed to
> be a simple singly-linked list.
>
> In addition to having a more light weight structure for hook traversal,
> struct net becomes 5568 bytes (down from 6400) and struct net_device becomes
> 2176 bytes (down from 2240).
>
> Signed-off-by: Aaron Conole <aconole@bytheb.org>
> Signed-off-by: Florian Westphal <fw@strlen.de>

Apologies, all.  This subject prefix is incorrect.  It should be:
[PATCH nf-next v3 7/7]

Should I repost?

> ---
>  include/linux/netdevice.h         |   2 +-
>  include/linux/netfilter.h         |  61 +++++++++--------
>  include/linux/netfilter_ingress.h |  16 +++--
>  include/net/netfilter/nf_queue.h  |   3 +-
>  include/net/netns/netfilter.h     |   2 +-
>  net/bridge/br_netfilter_hooks.c   |  19 ++---
>  net/netfilter/core.c              | 141 +++++++++++++++++++++++++-------------
>  net/netfilter/nf_internals.h      |  10 +--
>  net/netfilter/nf_queue.c          |  18 ++---
>  net/netfilter/nfnetlink_queue.c   |   8 ++-
>  10 files changed, 165 insertions(+), 115 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 67bb978..41f49f5 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1783,7 +1783,7 @@ struct net_device {
>  #endif
>  	struct netdev_queue __rcu *ingress_queue;
>  #ifdef CONFIG_NETFILTER_INGRESS
> -	struct list_head	nf_hooks_ingress;
> +	struct nf_hook_entry __rcu *nf_hooks_ingress;
>  #endif
>  
>  	unsigned char		broadcast[MAX_ADDR_LEN];
> diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
> index ad444f0..17c90b0 100644
> --- a/include/linux/netfilter.h
> +++ b/include/linux/netfilter.h
> @@ -55,12 +55,34 @@ struct nf_hook_state {
>  	struct net_device *out;
>  	struct sock *sk;
>  	struct net *net;
> -	struct list_head *hook_list;
> +	struct nf_hook_entry __rcu *hook_entries;
>  	int (*okfn)(struct net *, struct sock *, struct sk_buff *);
>  };
>  
> +typedef unsigned int nf_hookfn(void *priv,
> +			       struct sk_buff *skb,
> +			       const struct nf_hook_state *state);
> +struct nf_hook_ops {
> +	struct list_head	list;
> +
> +	/* User fills in from here down. */
> +	nf_hookfn		*hook;
> +	struct net_device	*dev;
> +	void			*priv;
> +	u_int8_t		pf;
> +	unsigned int		hooknum;
> +	/* Hooks are ordered in ascending priority. */
> +	int			priority;
> +};
> +
> +struct nf_hook_entry {
> +	struct nf_hook_entry __rcu	*next;
> +	struct nf_hook_ops		ops;
> +	const struct nf_hook_ops	*orig_ops;
> +};
> +
>  static inline void nf_hook_state_init(struct nf_hook_state *p,
> -				      struct list_head *hook_list,
> +				      struct nf_hook_entry *hook_entry,
>  				      unsigned int hook,
>  				      int thresh, u_int8_t pf,
>  				      struct net_device *indev,
> @@ -76,26 +98,11 @@ static inline void nf_hook_state_init(struct nf_hook_state *p,
>  	p->out = outdev;
>  	p->sk = sk;
>  	p->net = net;
> -	p->hook_list = hook_list;
> +	RCU_INIT_POINTER(p->hook_entries, hook_entry);
>  	p->okfn = okfn;
>  }
>  
> -typedef unsigned int nf_hookfn(void *priv,
> -			       struct sk_buff *skb,
> -			       const struct nf_hook_state *state);
>  
> -struct nf_hook_ops {
> -	struct list_head 	list;
> -
> -	/* User fills in from here down. */
> -	nf_hookfn		*hook;
> -	struct net_device	*dev;
> -	void			*priv;
> -	u_int8_t		pf;
> -	unsigned int		hooknum;
> -	/* Hooks are ordered in ascending priority. */
> -	int			priority;
> -};
>  
>  struct nf_sockopt_ops {
>  	struct list_head list;
> @@ -161,7 +168,8 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook,
>  				 int (*okfn)(struct net *, struct sock *, struct sk_buff *),
>  				 int thresh)
>  {
> -	struct list_head *hook_list;
> +	struct nf_hook_entry *hook_head;
> +	int ret = 1;
>  
>  #ifdef HAVE_JUMP_LABEL
>  	if (__builtin_constant_p(pf) &&
> @@ -170,22 +178,19 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook,
>  		return 1;
>  #endif
>  
> -	hook_list = &net->nf.hooks[pf][hook];
> +	rcu_read_lock();
>  
> -	if (!list_empty(hook_list)) {
> +	hook_head = rcu_dereference(net->nf.hooks[pf][hook]);
> +	if (hook_head) {
>  		struct nf_hook_state state;
> -		int ret;
>  
> -		/* We may already have this, but read-locks nest anyway */
> -		rcu_read_lock();
> -		nf_hook_state_init(&state, hook_list, hook, thresh,
> +		nf_hook_state_init(&state, hook_head, hook, thresh,
>  				   pf, indev, outdev, sk, net, okfn);
>  
>  		ret = nf_hook_slow(skb, &state);
> -		rcu_read_unlock();
> -		return ret;
>  	}
> -	return 1;
> +	rcu_read_unlock();
> +	return ret;
>  }
>  
>  static inline int nf_hook(u_int8_t pf, unsigned int hook, struct net *net,
> diff --git a/include/linux/netfilter_ingress.h b/include/linux/netfilter_ingress.h
> index 6965ba0..b02fd80 100644
> --- a/include/linux/netfilter_ingress.h
> +++ b/include/linux/netfilter_ingress.h
> @@ -11,23 +11,29 @@ static inline bool nf_hook_ingress_active(const struct sk_buff *skb)
>  	if (!static_key_false(&nf_hooks_needed[NFPROTO_NETDEV][NF_NETDEV_INGRESS]))
>  		return false;
>  #endif
> -	return !list_empty(&skb->dev->nf_hooks_ingress);
> +	return rcu_access_pointer(skb->dev->nf_hooks_ingress);
>  }
>  
>  /* caller must hold rcu_read_lock */
>  static inline int nf_hook_ingress(struct sk_buff *skb)
>  {
> +	struct nf_hook_entry *e = rcu_dereference(skb->dev->nf_hooks_ingress);
>  	struct nf_hook_state state;
>  
> -	nf_hook_state_init(&state, &skb->dev->nf_hooks_ingress,
> -			   NF_NETDEV_INGRESS, INT_MIN, NFPROTO_NETDEV,
> -			   skb->dev, NULL, NULL, dev_net(skb->dev), NULL);
> +	/* must recheck the ingress hook head, in the event it became NULL
> +	 * after the check in nf_hook_ingress_active evaluated to true. */
> +	if (unlikely(!e))
> +		return 0;
> +
> +	nf_hook_state_init(&state, e, NF_NETDEV_INGRESS, INT_MIN,
> +			   NFPROTO_NETDEV, skb->dev, NULL, NULL,
> +			   dev_net(skb->dev), NULL);
>  	return nf_hook_slow(skb, &state);
>  }
>  
>  static inline void nf_hook_ingress_init(struct net_device *dev)
>  {
> -	INIT_LIST_HEAD(&dev->nf_hooks_ingress);
> +	RCU_INIT_POINTER(dev->nf_hooks_ingress, NULL);
>  }
>  #else /* CONFIG_NETFILTER_INGRESS */
>  static inline int nf_hook_ingress_active(struct sk_buff *skb)
> diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
> index 66b3f3f..ef4f75d 100644
> --- a/include/net/netfilter/nf_queue.h
> +++ b/include/net/netfilter/nf_queue.h
> @@ -11,7 +11,6 @@ struct nf_queue_entry {
>  	struct sk_buff		*skb;
>  	unsigned int		id;
>  
> -	struct nf_hook_ops	*elem;
>  	struct nf_hook_state	state;
>  	u16			size; /* sizeof(entry) + saved route keys */
>  
> @@ -25,7 +24,7 @@ struct nf_queue_handler {
>  	int		(*outfn)(struct nf_queue_entry *entry,
>  				 unsigned int queuenum);
>  	void		(*nf_hook_drop)(struct net *net,
> -					struct nf_hook_ops *ops);
> +					const struct nf_hook_entry *hooks);
>  };
>  
>  void nf_register_queue_handler(struct net *net, const struct nf_queue_handler *qh);
> diff --git a/include/net/netns/netfilter.h b/include/net/netns/netfilter.h
> index 36d7235..58487b1 100644
> --- a/include/net/netns/netfilter.h
> +++ b/include/net/netns/netfilter.h
> @@ -16,6 +16,6 @@ struct netns_nf {
>  #ifdef CONFIG_SYSCTL
>  	struct ctl_table_header *nf_log_dir_header;
>  #endif
> -	struct list_head hooks[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
> +	struct nf_hook_entry __rcu *hooks[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
>  };
>  #endif
> diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
> index 6029af4..2fe9345 100644
> --- a/net/bridge/br_netfilter_hooks.c
> +++ b/net/bridge/br_netfilter_hooks.c
> @@ -1002,28 +1002,21 @@ int br_nf_hook_thresh(unsigned int hook, struct net *net,
>  		      int (*okfn)(struct net *, struct sock *,
>  				  struct sk_buff *))
>  {
> -	struct nf_hook_ops *elem;
> +	struct nf_hook_entry *elem;
>  	struct nf_hook_state state;
> -	struct list_head *head;
>  	int ret;
>  
> -	head = &net->nf.hooks[NFPROTO_BRIDGE][hook];
> +	elem = rcu_dereference(net->nf.hooks[NFPROTO_BRIDGE][hook]);
>  
> -	list_for_each_entry_rcu(elem, head, list) {
> -		struct nf_hook_ops *next;
> +	while (elem && (elem->ops.priority <= NF_BR_PRI_BRNF))
> +		elem = rcu_dereference(elem->next);
>  
> -		next = list_entry_rcu(list_next_rcu(&elem->list),
> -				      struct nf_hook_ops, list);
> -		if (next->priority <= NF_BR_PRI_BRNF)
> -			continue;
> -	}
> -
> -	if (&elem->list == head)
> +	if (!elem)
>  		return okfn(net, sk, skb);
>  
>  	/* We may already have this, but read-locks nest anyway */
>  	rcu_read_lock();
> -	nf_hook_state_init(&state, head, hook, NF_BR_PRI_BRNF + 1,
> +	nf_hook_state_init(&state, elem, hook, NF_BR_PRI_BRNF + 1,
>  			   NFPROTO_BRIDGE, indev, outdev, sk, net, okfn);
>  
>  	ret = nf_hook_slow(skb, &state);
> diff --git a/net/netfilter/core.c b/net/netfilter/core.c
> index 67b7428..360c63d 100644
> --- a/net/netfilter/core.c
> +++ b/net/netfilter/core.c
> @@ -22,6 +22,7 @@
>  #include <linux/proc_fs.h>
>  #include <linux/mutex.h>
>  #include <linux/slab.h>
> +#include <linux/rcupdate.h>
>  #include <net/net_namespace.h>
>  #include <net/sock.h>
>  
> @@ -61,33 +62,50 @@ EXPORT_SYMBOL(nf_hooks_needed);
>  #endif
>  
>  static DEFINE_MUTEX(nf_hook_mutex);
> +#define nf_entry_dereference(e) \
> +	rcu_dereference_protected(e, lockdep_is_held(&nf_hook_mutex))
>  
> -static struct list_head *nf_find_hook_list(struct net *net,
> -					   const struct nf_hook_ops *reg)
> +static struct nf_hook_entry *nf_hook_entry_head(struct net *net,
> +						const struct nf_hook_ops *reg)
>  {
> -	struct list_head *hook_list = NULL;
> +	struct nf_hook_entry *hook_head = NULL;
>  
>  	if (reg->pf != NFPROTO_NETDEV)
> -		hook_list = &net->nf.hooks[reg->pf][reg->hooknum];
> +		hook_head = nf_entry_dereference(net->nf.hooks[reg->pf]
> +						 [reg->hooknum]);
>  	else if (reg->hooknum == NF_NETDEV_INGRESS) {
>  #ifdef CONFIG_NETFILTER_INGRESS
>  		if (reg->dev && dev_net(reg->dev) == net)
> -			hook_list = &reg->dev->nf_hooks_ingress;
> +			hook_head =
> +				nf_entry_dereference(
> +					reg->dev->nf_hooks_ingress);
>  #endif
>  	}
> -	return hook_list;
> +	return hook_head;
>  }
>  
> -struct nf_hook_entry {
> -	const struct nf_hook_ops	*orig_ops;
> -	struct nf_hook_ops		ops;
> -};
> +/* must hold nf_hook_mutex */
> +static void nf_set_hooks_head(struct net *net, const struct nf_hook_ops *reg,
> +			      struct nf_hook_entry *entry)
> +{
> +	switch (reg->pf) {
> +	case NFPROTO_NETDEV:
> +		/* We already checked in nf_register_net_hook() that this is
> +		 * used from ingress.
> +		 */
> +		rcu_assign_pointer(reg->dev->nf_hooks_ingress, entry);
> +		break;
> +	default:
> +		rcu_assign_pointer(net->nf.hooks[reg->pf][reg->hooknum],
> +				   entry);
> +		break;
> +	}
> +}
>  
>  int nf_register_net_hook(struct net *net, const struct nf_hook_ops *reg)
>  {
> -	struct list_head *hook_list;
> +	struct nf_hook_entry *hooks_entry;
>  	struct nf_hook_entry *entry;
> -	struct nf_hook_ops *elem;
>  
>  	if (reg->pf == NFPROTO_NETDEV &&
>  	    (reg->hooknum != NF_NETDEV_INGRESS ||
> @@ -100,19 +118,30 @@ int nf_register_net_hook(struct net *net, const struct nf_hook_ops *reg)
>  
>  	entry->orig_ops	= reg;
>  	entry->ops	= *reg;
> +	entry->next	= NULL;
> +
> +	mutex_lock(&nf_hook_mutex);
> +	hooks_entry = nf_hook_entry_head(net, reg);
>  
> -	hook_list = nf_find_hook_list(net, reg);
> -	if (!hook_list) {
> -		kfree(entry);
> -		return -ENOENT;
> +	if (hooks_entry && hooks_entry->orig_ops->priority > reg->priority) {
> +		/* This is the case where we need to insert at the head */
> +		entry->next = hooks_entry;
> +		hooks_entry = NULL;
>  	}
>  
> -	mutex_lock(&nf_hook_mutex);
> -	list_for_each_entry(elem, hook_list, list) {
> -		if (reg->priority < elem->priority)
> -			break;
> +	while (hooks_entry &&
> +	       reg->priority >= hooks_entry->orig_ops->priority &&
> +	       nf_entry_dereference(hooks_entry->next)) {
> +		hooks_entry = nf_entry_dereference(hooks_entry->next);
> +	}
> +
> +	if (hooks_entry) {
> +		entry->next = nf_entry_dereference(hooks_entry->next);
> +		rcu_assign_pointer(hooks_entry->next, entry);
> +	} else {
> +		nf_set_hooks_head(net, reg, entry);
>  	}
> -	list_add_rcu(&entry->ops.list, elem->list.prev);
> +
>  	mutex_unlock(&nf_hook_mutex);
>  #ifdef CONFIG_NETFILTER_INGRESS
>  	if (reg->pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_INGRESS)
> @@ -127,24 +156,33 @@ EXPORT_SYMBOL(nf_register_net_hook);
>  
>  void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
>  {
> -	struct list_head *hook_list;
> -	struct nf_hook_entry *entry;
> -	struct nf_hook_ops *elem;
> -
> -	hook_list = nf_find_hook_list(net, reg);
> -	if (!hook_list)
> -		return;
> +	struct nf_hook_entry *hooks_entry;
>  
>  	mutex_lock(&nf_hook_mutex);
> -	list_for_each_entry(elem, hook_list, list) {
> -		entry = container_of(elem, struct nf_hook_entry, ops);
> -		if (entry->orig_ops == reg) {
> -			list_del_rcu(&entry->ops.list);
> -			break;
> +	hooks_entry = nf_hook_entry_head(net, reg);
> +	if (hooks_entry->orig_ops == reg) {
> +		nf_set_hooks_head(net, reg,
> +				  nf_entry_dereference(hooks_entry->next));
> +		goto unlock;
> +	}
> +	while (hooks_entry && nf_entry_dereference(hooks_entry->next)) {
> +		struct nf_hook_entry *next =
> +			nf_entry_dereference(hooks_entry->next);
> +		struct nf_hook_entry *nnext;
> +
> +		if (next->orig_ops != reg) {
> +			hooks_entry = next;
> +			continue;
>  		}
> +		nnext = nf_entry_dereference(next->next);
> +		rcu_assign_pointer(hooks_entry->next, nnext);
> +		hooks_entry = next;
> +		break;
>  	}
> +
> +unlock:
>  	mutex_unlock(&nf_hook_mutex);
> -	if (&elem->list == hook_list) {
> +	if (!hooks_entry) {
>  		WARN(1, "nf_unregister_net_hook: hook not found!\n");
>  		return;
>  	}
> @@ -156,10 +194,10 @@ void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
>  	static_key_slow_dec(&nf_hooks_needed[reg->pf][reg->hooknum]);
>  #endif
>  	synchronize_net();
> -	nf_queue_nf_hook_drop(net, &entry->ops);
> +	nf_queue_nf_hook_drop(net, hooks_entry);
>  	/* other cpu might still process nfqueue verdict that used reg */
>  	synchronize_net();
> -	kfree(entry);
> +	kfree(hooks_entry);
>  }
>  EXPORT_SYMBOL(nf_unregister_net_hook);
>  
> @@ -258,10 +296,9 @@ void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n)
>  }
>  EXPORT_SYMBOL(nf_unregister_hooks);
>  
> -unsigned int nf_iterate(struct list_head *head,
> -			struct sk_buff *skb,
> +unsigned int nf_iterate(struct sk_buff *skb,
>  			struct nf_hook_state *state,
> -			struct nf_hook_ops **elemp)
> +			struct nf_hook_entry **entryp)
>  {
>  	unsigned int verdict;
>  
> @@ -269,20 +306,23 @@ unsigned int nf_iterate(struct list_head *head,
>  	 * The caller must not block between calls to this
>  	 * function because of risk of continuing from deleted element.
>  	 */
> -	list_for_each_entry_continue_rcu((*elemp), head, list) {
> -		if (state->thresh > (*elemp)->priority)
> +	while (*entryp) {
> +		if (state->thresh > (*entryp)->ops.priority) {
> +			*entryp = rcu_dereference((*entryp)->next);
>  			continue;
> +		}
>  
>  		/* Optimization: we don't need to hold module
>  		   reference here, since function can't sleep. --RR */
>  repeat:
> -		verdict = (*elemp)->hook((*elemp)->priv, skb, state);
> +		verdict = (*entryp)->ops.hook((*entryp)->ops.priv, skb, state);
>  		if (verdict != NF_ACCEPT) {
>  #ifdef CONFIG_NETFILTER_DEBUG
>  			if (unlikely((verdict & NF_VERDICT_MASK)
>  							> NF_MAX_VERDICT)) {
>  				NFDEBUG("Evil return from %p(%u).\n",
> -					(*elemp)->hook, state->hook);
> +					(*entryp)->ops.hook, state->hook);
> +				*entryp = rcu_dereference((*entryp)->next);
>  				continue;
>  			}
>  #endif
> @@ -290,6 +330,7 @@ repeat:
>  				return verdict;
>  			goto repeat;
>  		}
> +		*entryp = rcu_dereference((*entryp)->next);
>  	}
>  	return NF_ACCEPT;
>  }
> @@ -299,13 +340,13 @@ repeat:
>   * -EPERM for NF_DROP, 0 otherwise.  Caller must hold rcu_read_lock. */
>  int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state)
>  {
> -	struct nf_hook_ops *elem;
> +	struct nf_hook_entry *entry;
>  	unsigned int verdict;
>  	int ret = 0;
>  
> -	elem = list_entry_rcu(state->hook_list, struct nf_hook_ops, list);
> +	entry = rcu_dereference(state->hook_entries);
>  next_hook:
> -	verdict = nf_iterate(state->hook_list, skb, state, &elem);
> +	verdict = nf_iterate(skb, state, &entry);
>  	if (verdict == NF_ACCEPT || verdict == NF_STOP) {
>  		ret = 1;
>  	} else if ((verdict & NF_VERDICT_MASK) == NF_DROP) {
> @@ -314,8 +355,10 @@ next_hook:
>  		if (ret == 0)
>  			ret = -EPERM;
>  	} else if ((verdict & NF_VERDICT_MASK) == NF_QUEUE) {
> -		int err = nf_queue(skb, elem, state,
> -				   verdict >> NF_VERDICT_QBITS);
> +		int err;
> +
> +		RCU_INIT_POINTER(state->hook_entries, entry);
> +		err = nf_queue(skb, state, verdict >> NF_VERDICT_QBITS);
>  		if (err < 0) {
>  			if (err == -ESRCH &&
>  			   (verdict & NF_VERDICT_FLAG_QUEUE_BYPASS))
> @@ -442,7 +485,7 @@ static int __net_init netfilter_net_init(struct net *net)
>  
>  	for (i = 0; i < ARRAY_SIZE(net->nf.hooks); i++) {
>  		for (h = 0; h < NF_MAX_HOOKS; h++)
> -			INIT_LIST_HEAD(&net->nf.hooks[i][h]);
> +			RCU_INIT_POINTER(net->nf.hooks[i][h], NULL);
>  	}
>  
>  #ifdef CONFIG_PROC_FS
> diff --git a/net/netfilter/nf_internals.h b/net/netfilter/nf_internals.h
> index 0655225..e0adb59 100644
> --- a/net/netfilter/nf_internals.h
> +++ b/net/netfilter/nf_internals.h
> @@ -13,13 +13,13 @@
>  
>  
>  /* core.c */
> -unsigned int nf_iterate(struct list_head *head, struct sk_buff *skb,
> -			struct nf_hook_state *state, struct nf_hook_ops **elemp);
> +unsigned int nf_iterate(struct sk_buff *skb, struct nf_hook_state *state,
> +			struct nf_hook_entry **entryp);
>  
>  /* nf_queue.c */
> -int nf_queue(struct sk_buff *skb, struct nf_hook_ops *elem,
> -	     struct nf_hook_state *state, unsigned int queuenum);
> -void nf_queue_nf_hook_drop(struct net *net, struct nf_hook_ops *ops);
> +int nf_queue(struct sk_buff *skb, struct nf_hook_state *state,
> +	     unsigned int queuenum);
> +void nf_queue_nf_hook_drop(struct net *net, const struct nf_hook_entry *entry);
>  int __init netfilter_queue_init(void);
>  
>  /* nf_log.c */
> diff --git a/net/netfilter/nf_queue.c b/net/netfilter/nf_queue.c
> index b19ad20..96964a0 100644
> --- a/net/netfilter/nf_queue.c
> +++ b/net/netfilter/nf_queue.c
> @@ -96,14 +96,14 @@ void nf_queue_entry_get_refs(struct nf_queue_entry *entry)
>  }
>  EXPORT_SYMBOL_GPL(nf_queue_entry_get_refs);
>  
> -void nf_queue_nf_hook_drop(struct net *net, struct nf_hook_ops *ops)
> +void nf_queue_nf_hook_drop(struct net *net, const struct nf_hook_entry *entry)
>  {
>  	const struct nf_queue_handler *qh;
>  
>  	rcu_read_lock();
>  	qh = rcu_dereference(net->nf.queue_handler);
>  	if (qh)
> -		qh->nf_hook_drop(net, ops);
> +		qh->nf_hook_drop(net, entry);
>  	rcu_read_unlock();
>  }
>  
> @@ -112,7 +112,6 @@ void nf_queue_nf_hook_drop(struct net *net, struct nf_hook_ops *ops)
>   * through nf_reinject().
>   */
>  int nf_queue(struct sk_buff *skb,
> -	     struct nf_hook_ops *elem,
>  	     struct nf_hook_state *state,
>  	     unsigned int queuenum)
>  {
> @@ -141,7 +140,6 @@ int nf_queue(struct sk_buff *skb,
>  
>  	*entry = (struct nf_queue_entry) {
>  		.skb	= skb,
> -		.elem	= elem,
>  		.state	= *state,
>  		.size	= sizeof(*entry) + afinfo->route_key_size,
>  	};
> @@ -165,11 +163,15 @@ err:
>  
>  void nf_reinject(struct nf_queue_entry *entry, unsigned int verdict)
>  {
> +	struct nf_hook_entry *hook_entry;
>  	struct sk_buff *skb = entry->skb;
> -	struct nf_hook_ops *elem = entry->elem;
>  	const struct nf_afinfo *afinfo;
> +	struct nf_hook_ops *elem;
>  	int err;
>  
> +	hook_entry = rcu_dereference(entry->state.hook_entries);
> +	elem = &hook_entry->ops;
> +
>  	nf_queue_entry_release_refs(entry);
>  
>  	/* Continue traversal iff userspace said ok... */
> @@ -186,8 +188,7 @@ void nf_reinject(struct nf_queue_entry *entry, unsigned int verdict)
>  
>  	if (verdict == NF_ACCEPT) {
>  	next_hook:
> -		verdict = nf_iterate(entry->state.hook_list,
> -				     skb, &entry->state, &elem);
> +		verdict = nf_iterate(skb, &entry->state, &hook_entry);
>  	}
>  
>  	switch (verdict & NF_VERDICT_MASK) {
> @@ -198,7 +199,8 @@ void nf_reinject(struct nf_queue_entry *entry, unsigned int verdict)
>  		local_bh_enable();
>  		break;
>  	case NF_QUEUE:
> -		err = nf_queue(skb, elem, &entry->state,
> +		RCU_INIT_POINTER(entry->state.hook_entries, hook_entry);
> +		err = nf_queue(skb, &entry->state,
>  			       verdict >> NF_VERDICT_QBITS);
>  		if (err < 0) {
>  			if (err == -ESRCH &&
> diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
> index 7caa8b0..af832c5 100644
> --- a/net/netfilter/nfnetlink_queue.c
> +++ b/net/netfilter/nfnetlink_queue.c
> @@ -917,12 +917,14 @@ static struct notifier_block nfqnl_dev_notifier = {
>  	.notifier_call	= nfqnl_rcv_dev_event,
>  };
>  
> -static int nf_hook_cmp(struct nf_queue_entry *entry, unsigned long ops_ptr)
> +static int nf_hook_cmp(struct nf_queue_entry *entry, unsigned long entry_ptr)
>  {
> -	return entry->elem == (struct nf_hook_ops *)ops_ptr;
> +	return rcu_access_pointer(entry->state.hook_entries) ==
> +		(struct nf_hook_entry *)entry_ptr;
>  }
>  
> -static void nfqnl_nf_hook_drop(struct net *net, struct nf_hook_ops *hook)
> +static void nfqnl_nf_hook_drop(struct net *net,
> +			       const struct nf_hook_entry *hook)
>  {
>  	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
>  	int i;

^ permalink raw reply

* Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP
From: Alexei Starovoitov @ 2016-09-21 15:44 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, davem, netdev, kernel-team, tariqt, bblanco,
	alexei.starovoitov, eric.dumazet
In-Reply-To: <20160921083955.7cdba944@redhat.com>

On 9/20/16 11:39 PM, Jesper Dangaard Brouer wrote:
> We are the early stages of XDP development. Users cannot consider XDP a
> stable UAPI yet.  I added a big fat warning to the docs here[1].
>
> If you already consider this a stable API, then I will suggest that we
> disable XDP or rip the hole thing out again!!!

the doc that you wrote is a great description of your understanding
of what XDP is about. It's not an official spec or design document.
Until it is reviewed and lands in kernel.org, please do not
make any assumption about present or future XDP API based on it.

^ permalink raw reply

* Re: [PATCH v6 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs
From: Pablo Neira Ayuso @ 2016-09-21 15:45 UTC (permalink / raw)
  To: Daniel Mack
  Cc: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	ast-b10kYP2dOMg, davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, sargun-GaZTRHToo+CzQB+pC5nmwQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <6584b975-fa3e-8d98-f0c7-a2c6b194b2b6-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>

Hi Daniel,

On Tue, Sep 20, 2016 at 06:43:35PM +0200, Daniel Mack wrote:
> Hi Pablo,
> 
> On 09/20/2016 04:29 PM, Pablo Neira Ayuso wrote:
> > On Mon, Sep 19, 2016 at 10:56:14PM +0200, Daniel Mack wrote:
> > [...]
> >> Why would we artificially limit the use-cases of this implementation if
> >> the way it stands, both filtering and introspection are possible?
> > 
> > Why should we place infrastructure in the kernel to filter packets so
> > late, and why at postrouting btw, when we can do this way earlier
> > before any packet is actually sent?
> 
> The point is that from an application's perspective, restricting the
> ability to bind a port and dropping packets that are being sent is a
> very different thing. Applications will start to behave differently if
> they can't bind to a port, and that's something we do not want to happen.

What is exactly the problem? Applications are not checking for return
value from bind? They should be fixed. If you want to collect
statistics, I see no reason why you couldn't collect them for every
EACCESS on each bind() call.

> Looking at packets and making a verdict on them is the only way to
> implement what we have in mind. Given that's in line with what netfilter
> does, it can't be all that wrong, can it?

That output hook was added ~20 years ago... At that time we didn't
have anything better than dropping locally generated packets. Today we
can probably do something better.

> > No performance impact, no need for
> > skbuff allocation and *no cycles wasted to evaluate if every packet is
> > wanted or not*.
> 
> Hmm, not sure why this keeps coming up. As I said - for accounting,
> there is no other option than to look at every packet and its size.
> 
> Regarding the performance concerns, are you saying a netfilter based
> implementation that uses counters for that purpose would be more
> efficient?

> Because in my tests, just loading the netfilter modules with no
> rules in place at all has more impact than running the code from 6/6
> on every packet.

You must be talking on iptables. When did you test this? We now have
on-demand hook registration per-netns, anyway, in nftables you only
register what you need.

Everytime you mention about performance, it sounds like there is no
room to improve what we have... and we indeed have room and ideas to
get this flying faster, but keeping in mind good integration with our
generic network stack and extensible interfaces, that's important too.
On top of that, I started working on some preliminary patches to add
nftables jit, will be talking on this during NetDev 1.2
netfilter/nftables workshop. I would expect numbers close to what
you're observing with this solution.

^ permalink raw reply

* Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP
From: Alexei Starovoitov @ 2016-09-21 15:39 UTC (permalink / raw)
  To: Tom Herbert, Thomas Graf
  Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team,
	Tariq Toukan, Brenden Blanco, Alexei Starovoitov, Eric Dumazet,
	Jesper Dangaard Brouer
In-Reply-To: <CALx6S35Xctd-aA8DF1_vypMDejiGXcud6=UO33dqgvO60W0DZQ@mail.gmail.com>

On 9/21/16 7:19 AM, Tom Herbert wrote:
> #1: Should we allow alternate code to run in XDP other than BPF?

separate nft hook - yes
generic hook - no
since it's one step away from kernel modules abusing this hook.
pass/drop/tx of raw buffer at the driver level is a perfect
interface to bypass everything in the stack.
The tighter we make it the better.

If nft and bpf are both not flexible enough to express
dataplane functionality we should extend them instead of
writing C code or kernel modules.

On bpf side we're trying very hard to kill any dream of
interoperability with kernel modules.
The map and prog type registration is done in a way to make
it impossible for kernel modules to register their own
map and program types or provide their own helper functions.

nfhooks approach is very lax at that and imo it was a mistake,
since there are plenty of out of tree modules that using nf hooks
and plenty of in-tree modules that are barely maintained.

> #2: If #1 is true what is the best way to implement that?

Add separate nft hook that doesn't interfere in any way
with bpf hook at xdp level.
The order nft-first or bpf-first or exclusive attach
doesn't matter to me. These are details to be discussed.

^ permalink raw reply

* Re: [PATCH next v3 0/2] Rename WORD_TRUNC/ROUND macros and use them
From: Neil Horman @ 2016-09-21 15:37 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: netdev, linux-sctp, Vlad Yasevich, David Miller, David Laight
In-Reply-To: <cover.1474457954.git.marcelo.leitner@gmail.com>

On Wed, Sep 21, 2016 at 08:45:54AM -0300, Marcelo Ricardo Leitner wrote:
> This patchset aims to rename these macros to a non-confusing name, as
> reported by David Laight and David Miller, and to update all remaining
> places to make use of it, which was 1 last remaining spot.
> 
> v3:
> - Name it SCTP_PAD4 instead of SCTP_ALIGN4, as suggested by David Laight
> v2:
> - fixed 2nd patch summary
> 
> Details on the specific changelogs.
> 
> Thanks!
> 
> Marcelo Ricardo Leitner (2):
>   sctp: rename WORD_TRUNC/ROUND macros
>   sctp: make use of WORD_TRUNC macro
> 
>  include/net/sctp/sctp.h  | 10 +++++-----
>  net/netfilter/xt_sctp.c  |  2 +-
>  net/sctp/associola.c     |  2 +-
>  net/sctp/chunk.c         | 13 +++++++------
>  net/sctp/input.c         |  8 ++++----
>  net/sctp/inqueue.c       |  2 +-
>  net/sctp/output.c        | 12 ++++++------
>  net/sctp/sm_make_chunk.c | 28 ++++++++++++++--------------
>  net/sctp/sm_statefuns.c  |  6 +++---
>  net/sctp/transport.c     |  4 ++--
>  net/sctp/ulpevent.c      |  4 ++--
>  11 files changed, 46 insertions(+), 45 deletions(-)
> 
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Acked-by: Neil Horman <nhorman@tuxdriver.com>

^ permalink raw reply

* Re: [PATCH net-next 3/3] net: ethernet: mediatek: add the dts property to set if TRGMII supported on GMAC0
From: Sean Wang @ 2016-09-21 15:37 UTC (permalink / raw)
  To: andrew; +Cc: john, davem, nbd, netdev, linux-kernel, linux-mediatek, keyhaede
In-Reply-To: <20160921141720.GN22292@lunn.ch>

Date: Wed, 21 Sep 2016 16:17:20 +0200, Andrew Lunn <andrew@lunn.ch> wrote:
>On Wed, Sep 21, 2016 at 02:16:30PM +0800, Sean Wang wrote:
>> Date: Tue, 20 Sep 2016 21:37:58 +0200, Andrew Lunn <andrew@lunn.ch> wrote:
>> >On Tue, Sep 20, 2016 at 03:59:20PM +0800, sean.wang@mediatek.com wrote:
>> >> From: Sean Wang <sean.wang@mediatek.com>
>> >>
>> >> Add the dts property for the capability if TRGMII supported on GAMC0
>> >>
>> >> Signed-off-by: Sean Wang <sean.wang@mediatek.com>

.... deleted

>> >In this case the switch is an MDIO device, not an PHY. It will not
>> >have an phy-mode. It cannot have a phy mode, it is not a PHY.
>> >
>> >Or am i missing something here?
>> >
>> >Thanks
>> >
>> 
>> 1)
>> 
>> The switch driver is not supported for DSA so far yet 
>> but DSA is good thing and I will try make it happen
>> in the near future.
>
>O.K. But if i understand correctly, the TRGMII is so you can use the
>switch. So it needs to work when you have DSA.
>

yes, you are right. TRGMII for now is dedicated for switch
and furthermore it needs doing calibration between the host and
the switch before it works, that I expect to put
the logic of calibration into setup callback of DSA driver.


>> And another question about DSA, that is
>> if I use DSA for switch, how to know the relationship
>> between MAC and DSA ? such like I could know relationship 
>> between MAC and PHY by phy-handle.
>
>It will look like what i stated above. But i missed the cpu node in
>the ports, which is what you are asking about. There will also be a
>node like:
>
>                            port@6 {
>                                     reg = <6>;
>                                     label = "cpu";
>                                     ethernet = <&gmac1>;
>                             };
>
>And this is how you couple the MAC to DSA.

thanks, it is answerig my question : i can get the relationship from 
the node of cpu port pointing to what MAC it runs for.

>> The cause I ask is becasue I think it's good if the topology
>> about MAC/PHYs/Switch is known just by dts files.
>> 
>> 2)
>> 
>> The phy-mode I mention is for fixed-link. For current MAC driver, 
>> it just uses fixed phy to adapt into the part of switch, so the 
>> device tree looks something like the below. 
>> 
>> &eth {
>>         status = "okay";
>>         gmac0: mac@0 {
>>                 compatible = "mediatek,eth-mac";
>>                 reg = <0>;
>>                 phy-mode = "trgmii";
>>                 fixed-link {
>>                         speed = <1000>;
>>                         full-duplex;
>>                         pause;
>>                 };
>>         };
>> 
>>         gmac1: mac@1 {
>>                 compatible = "mediatek,eth-mac";
>>                 reg = <1>;
>>                 phy-handle = <&phy5>;
>>         };
>
>
>static int mtk_phy_connect(struct mtk_mac *mac)
>{
>        struct mtk_eth *eth = mac->hw;
>        struct device_node *np;
>        u32 val;
>
>        np = of_parse_phandle(mac->of_node, "phy-handle", 0);
>        if (!np && of_phy_is_fixed_link(mac->of_node))
>                if (!of_phy_register_fixed_link(mac->of_node))
>                        np = of_node_get(mac->of_node);
>	...
>        ...
>        mtk_phy_connect_node(eth, mac, np);
>
>
>So in the case of a fixed-phy, you do look in the MAC node, and when
>there is a phy-handle, you look in the PHY node.
>
>So this does work....

yes , it is all

>
>   Andrew

^ permalink raw reply

* [PATCH nf-next v3 6/7] nf_queue_handler: whitespace cleanup
From: Aaron Conole @ 2016-09-21 15:35 UTC (permalink / raw)
  To: netfilter-devel, netdev; +Cc: Florian Westphal, Pablo Neira Ayuso
In-Reply-To: <1474472107-12992-1-git-send-email-aconole@bytheb.org>

A future patch will modify the hook drop and outfn functions.  This will
cause the line lengths to take up too much space.  This is simply a
readability change.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
---
 include/net/netfilter/nf_queue.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
index cc8a11f..66b3f3f 100644
--- a/include/net/netfilter/nf_queue.h
+++ b/include/net/netfilter/nf_queue.h
@@ -22,10 +22,10 @@ struct nf_queue_entry {
 
 /* Packet queuing */
 struct nf_queue_handler {
-	int			(*outfn)(struct nf_queue_entry *entry,
-					 unsigned int queuenum);
-	void			(*nf_hook_drop)(struct net *net,
-						struct nf_hook_ops *ops);
+	int		(*outfn)(struct nf_queue_entry *entry,
+				 unsigned int queuenum);
+	void		(*nf_hook_drop)(struct net *net,
+					struct nf_hook_ops *ops);
 };
 
 void nf_register_queue_handler(struct net *net, const struct nf_queue_handler *qh);
-- 
2.7.4

^ permalink raw reply related

* [PATCH nf-next v3 5/7] nf_register_net_hook: Only allow sane values
From: Aaron Conole @ 2016-09-21 15:35 UTC (permalink / raw)
  To: netfilter-devel, netdev; +Cc: Florian Westphal, Pablo Neira Ayuso
In-Reply-To: <1474472107-12992-1-git-send-email-aconole@bytheb.org>

This commit adds an upfront check for sane values to be passed when
registering a netfilter hook.  This will be used in a future patch for a
simplified hook list traversal.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
---
 net/netfilter/core.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index c8faf81..67b7428 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -89,6 +89,11 @@ int nf_register_net_hook(struct net *net, const struct nf_hook_ops *reg)
 	struct nf_hook_entry *entry;
 	struct nf_hook_ops *elem;
 
+	if (reg->pf == NFPROTO_NETDEV &&
+	    (reg->hooknum != NF_NETDEV_INGRESS ||
+	     !reg->dev || dev_net(reg->dev) != net))
+		return -EINVAL;
+
 	entry = kmalloc(sizeof(*entry), GFP_KERNEL);
 	if (!entry)
 		return -ENOMEM;
-- 
2.7.4

^ permalink raw reply related

* [PATCH nf-next v3 4/7] nf_hook_slow: Remove explicit rcu_read_lock
From: Aaron Conole @ 2016-09-21 15:35 UTC (permalink / raw)
  To: netfilter-devel, netdev; +Cc: Florian Westphal, Pablo Neira Ayuso
In-Reply-To: <1474472107-12992-1-git-send-email-aconole@bytheb.org>

All of the callers of nf_hook_slow already hold the rcu_read_lock, so this
cleanup removes the recursive call.  This is just a cleanup, as the locking
code gracefully handles this situation.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
---
 net/bridge/netfilter/ebt_redirect.c            | 2 +-
 net/bridge/netfilter/ebtables.c                | 2 +-
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c | 2 +-
 net/ipv4/netfilter/nf_conntrack_proto_icmp.c   | 2 +-
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c | 2 +-
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c | 2 +-
 net/netfilter/core.c                           | 6 +-----
 net/netfilter/nf_conntrack_core.c              | 2 +-
 net/netfilter/nf_conntrack_h323_main.c         | 2 +-
 net/netfilter/nf_conntrack_helper.c            | 2 +-
 net/netfilter/nfnetlink_cthelper.c             | 2 +-
 net/netfilter/nfnetlink_log.c                  | 6 ++++--
 net/netfilter/nfnetlink_queue.c                | 2 +-
 net/netfilter/xt_helper.c                      | 2 +-
 14 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/net/bridge/netfilter/ebt_redirect.c b/net/bridge/netfilter/ebt_redirect.c
index 20396499..2e7c4f9 100644
--- a/net/bridge/netfilter/ebt_redirect.c
+++ b/net/bridge/netfilter/ebt_redirect.c
@@ -24,7 +24,7 @@ ebt_redirect_tg(struct sk_buff *skb, const struct xt_action_param *par)
 		return EBT_DROP;
 
 	if (par->hooknum != NF_BR_BROUTING)
-		/* rcu_read_lock()ed by nf_hook_slow */
+		/* rcu_read_lock()ed by nf_hook_thresh */
 		ether_addr_copy(eth_hdr(skb)->h_dest,
 				br_port_get_rcu(par->in)->br->dev->dev_addr);
 	else
diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index cceac5b..dd71332 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -146,7 +146,7 @@ ebt_basic_match(const struct ebt_entry *e, const struct sk_buff *skb,
 		return 1;
 	if (NF_INVF(e, EBT_IOUT, ebt_dev_check(e->out, out)))
 		return 1;
-	/* rcu_read_lock()ed by nf_hook_slow */
+	/* rcu_read_lock()ed by nf_hook_thresh */
 	if (in && (p = br_port_get_rcu(in)) != NULL &&
 	    NF_INVF(e, EBT_ILOGICALIN,
 		    ebt_dev_check(e->logical_in, p->br->dev)))
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
index 870aebd..713c09a 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
@@ -110,7 +110,7 @@ static unsigned int ipv4_helper(void *priv,
 	if (!help)
 		return NF_ACCEPT;
 
-	/* rcu_read_lock()ed by nf_hook_slow */
+	/* rcu_read_lock()ed by nf_hook_thresh */
 	helper = rcu_dereference(help->helper);
 	if (!helper)
 		return NF_ACCEPT;
diff --git a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
index 4b5904b..d075b3c 100644
--- a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
+++ b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
@@ -149,7 +149,7 @@ icmp_error_message(struct net *net, struct nf_conn *tmpl, struct sk_buff *skb,
 		return -NF_ACCEPT;
 	}
 
-	/* rcu_read_lock()ed by nf_hook_slow */
+	/* rcu_read_lock()ed by nf_hook_thresh */
 	innerproto = __nf_ct_l4proto_find(PF_INET, origtuple.dst.protonum);
 
 	/* Ordinarily, we'd expect the inverted tupleproto, but it's
diff --git a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
index 1aa5848..963ee38 100644
--- a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
@@ -115,7 +115,7 @@ static unsigned int ipv6_helper(void *priv,
 	help = nfct_help(ct);
 	if (!help)
 		return NF_ACCEPT;
-	/* rcu_read_lock()ed by nf_hook_slow */
+	/* rcu_read_lock()ed by nf_hook_thresh */
 	helper = rcu_dereference(help->helper);
 	if (!helper)
 		return NF_ACCEPT;
diff --git a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
index 660bc10..f5a61bc 100644
--- a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
@@ -165,7 +165,7 @@ icmpv6_error_message(struct net *net, struct nf_conn *tmpl,
 		return -NF_ACCEPT;
 	}
 
-	/* rcu_read_lock()ed by nf_hook_slow */
+	/* rcu_read_lock()ed by nf_hook_thresh */
 	inproto = __nf_ct_l4proto_find(PF_INET6, origtuple.dst.protonum);
 
 	/* Ordinarily, we'd expect the inverted tupleproto, but it's
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index f39276d..c8faf81 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -291,16 +291,13 @@ repeat:
 
 
 /* Returns 1 if okfn() needs to be executed by the caller,
- * -EPERM for NF_DROP, 0 otherwise. */
+ * -EPERM for NF_DROP, 0 otherwise.  Caller must hold rcu_read_lock. */
 int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state)
 {
 	struct nf_hook_ops *elem;
 	unsigned int verdict;
 	int ret = 0;
 
-	/* We may already have this, but read-locks nest anyway */
-	rcu_read_lock();
-
 	elem = list_entry_rcu(state->hook_list, struct nf_hook_ops, list);
 next_hook:
 	verdict = nf_iterate(state->hook_list, skb, state, &elem);
@@ -321,7 +318,6 @@ next_hook:
 			kfree_skb(skb);
 		}
 	}
-	rcu_read_unlock();
 	return ret;
 }
 EXPORT_SYMBOL(nf_hook_slow);
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 8d1ddb9..c94ec19 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1275,7 +1275,7 @@ nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
 		skb->nfct = NULL;
 	}
 
-	/* rcu_read_lock()ed by nf_hook_slow */
+	/* rcu_read_lock()ed by nf_hook_thresh */
 	l3proto = __nf_ct_l3proto_find(pf);
 	ret = l3proto->get_l4proto(skb, skb_network_offset(skb),
 				   &dataoff, &protonum);
diff --git a/net/netfilter/nf_conntrack_h323_main.c b/net/netfilter/nf_conntrack_h323_main.c
index 5c0db5c..f65d936 100644
--- a/net/netfilter/nf_conntrack_h323_main.c
+++ b/net/netfilter/nf_conntrack_h323_main.c
@@ -736,7 +736,7 @@ static int callforward_do_filter(struct net *net,
 	const struct nf_afinfo *afinfo;
 	int ret = 0;
 
-	/* rcu_read_lock()ed by nf_hook_slow() */
+	/* rcu_read_lock()ed by nf_hook_thresh */
 	afinfo = nf_get_afinfo(family);
 	if (!afinfo)
 		return 0;
diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c
index 4ffe388..336e215 100644
--- a/net/netfilter/nf_conntrack_helper.c
+++ b/net/netfilter/nf_conntrack_helper.c
@@ -346,7 +346,7 @@ void nf_ct_helper_log(struct sk_buff *skb, const struct nf_conn *ct,
 	/* Called from the helper function, this call never fails */
 	help = nfct_help(ct);
 
-	/* rcu_read_lock()ed by nf_hook_slow */
+	/* rcu_read_lock()ed by nf_hook_thresh */
 	helper = rcu_dereference(help->helper);
 
 	nf_log_packet(nf_ct_net(ct), nf_ct_l3num(ct), 0, skb, NULL, NULL, NULL,
diff --git a/net/netfilter/nfnetlink_cthelper.c b/net/netfilter/nfnetlink_cthelper.c
index e924e95..3b79f34 100644
--- a/net/netfilter/nfnetlink_cthelper.c
+++ b/net/netfilter/nfnetlink_cthelper.c
@@ -43,7 +43,7 @@ nfnl_userspace_cthelper(struct sk_buff *skb, unsigned int protoff,
 	if (help == NULL)
 		return NF_DROP;
 
-	/* rcu_read_lock()ed by nf_hook_slow */
+	/* rcu_read_lock()ed by nf_hook_thresh */
 	helper = rcu_dereference(help->helper);
 	if (helper == NULL)
 		return NF_DROP;
diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c
index 6577db5..f2446e7 100644
--- a/net/netfilter/nfnetlink_log.c
+++ b/net/netfilter/nfnetlink_log.c
@@ -442,7 +442,8 @@ __build_packet_message(struct nfnl_log_net *log,
 			if (nla_put_be32(inst->skb, NFULA_IFINDEX_PHYSINDEV,
 					 htonl(indev->ifindex)) ||
 			/* this is the bridge group "brX" */
-			/* rcu_read_lock()ed by nf_hook_slow or nf_log_packet */
+			/* rcu_read_lock()ed by nf_hook_thresh or
+			 * nf_log_packet */
 			    nla_put_be32(inst->skb, NFULA_IFINDEX_INDEV,
 					 htonl(br_port_get_rcu(indev)->br->dev->ifindex)))
 				goto nla_put_failure;
@@ -477,7 +478,8 @@ __build_packet_message(struct nfnl_log_net *log,
 			if (nla_put_be32(inst->skb, NFULA_IFINDEX_PHYSOUTDEV,
 					 htonl(outdev->ifindex)) ||
 			/* this is the bridge group "brX" */
-			/* rcu_read_lock()ed by nf_hook_slow or nf_log_packet */
+			/* rcu_read_lock()ed by nf_hook_thresh or
+			 * nf_log_packet */
 			    nla_put_be32(inst->skb, NFULA_IFINDEX_OUTDEV,
 					 htonl(br_port_get_rcu(outdev)->br->dev->ifindex)))
 				goto nla_put_failure;
diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
index 808da34..7caa8b0 100644
--- a/net/netfilter/nfnetlink_queue.c
+++ b/net/netfilter/nfnetlink_queue.c
@@ -740,7 +740,7 @@ nfqnl_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
 	struct net *net = entry->state.net;
 	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
 
-	/* rcu_read_lock()ed by nf_hook_slow() */
+	/* rcu_read_lock()ed by nf_hook_thresh */
 	queue = instance_lookup(q, queuenum);
 	if (!queue)
 		return -ESRCH;
diff --git a/net/netfilter/xt_helper.c b/net/netfilter/xt_helper.c
index 9f4ab00..a883edb 100644
--- a/net/netfilter/xt_helper.c
+++ b/net/netfilter/xt_helper.c
@@ -41,7 +41,7 @@ helper_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	if (!master_help)
 		return ret;
 
-	/* rcu_read_lock()ed by nf_hook_slow */
+	/* rcu_read_lock()ed by nf_hook_thresh */
 	helper = rcu_dereference(master_help->helper);
 	if (!helper)
 		return ret;
-- 
2.7.4

^ permalink raw reply related

* [PATCH nf-next v3 3/7] netfilter: call nf_hook_ingress with rcu_read_lock
From: Aaron Conole @ 2016-09-21 15:35 UTC (permalink / raw)
  To: netfilter-devel, netdev; +Cc: Florian Westphal, Pablo Neira Ayuso
In-Reply-To: <1474472107-12992-1-git-send-email-aconole@bytheb.org>

This commit ensures that the rcu read-side lock is held while the
ingress hook is called.  This ensures that a call to nf_hook_slow (and
ultimately nf_ingress) will be read protected.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
---
 net/core/dev.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 34b5322..0649194 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4040,12 +4040,17 @@ static inline int nf_ingress(struct sk_buff *skb, struct packet_type **pt_prev,
 {
 #ifdef CONFIG_NETFILTER_INGRESS
 	if (nf_hook_ingress_active(skb)) {
+		int ingress_retval;
+
 		if (*pt_prev) {
 			*ret = deliver_skb(skb, *pt_prev, orig_dev);
 			*pt_prev = NULL;
 		}
 
-		return nf_hook_ingress(skb);
+		rcu_read_lock();
+		ingress_retval = nf_hook_ingress(skb);
+		rcu_read_unlock();
+		return ingress_retval;
 	}
 #endif /* CONFIG_NETFILTER_INGRESS */
 	return 0;
-- 
2.7.4

^ permalink raw reply related

* [PATCH nf-next v3 2/7] netfilter: call nf_hook_state_init with rcu_read_lock held
From: Aaron Conole @ 2016-09-21 15:35 UTC (permalink / raw)
  To: netfilter-devel, netdev; +Cc: Florian Westphal, Pablo Neira Ayuso
In-Reply-To: <1474472107-12992-1-git-send-email-aconole@bytheb.org>

From: Florian Westphal <fw@strlen.de>

This makes things simpler because we can store the head of the list
in the nf_state structure without worrying about concurrent add/delete
of hook elements from the list.

A future commit will make use of this to implement a simpler
linked-list.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Aaron Conole <aconole@bytheb.org>
---
 include/linux/netfilter.h         | 8 +++++++-
 include/linux/netfilter_ingress.h | 1 +
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 9230f9a..ad444f0 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -174,10 +174,16 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook,
 
 	if (!list_empty(hook_list)) {
 		struct nf_hook_state state;
+		int ret;
 
+		/* We may already have this, but read-locks nest anyway */
+		rcu_read_lock();
 		nf_hook_state_init(&state, hook_list, hook, thresh,
 				   pf, indev, outdev, sk, net, okfn);
-		return nf_hook_slow(skb, &state);
+
+		ret = nf_hook_slow(skb, &state);
+		rcu_read_unlock();
+		return ret;
 	}
 	return 1;
 }
diff --git a/include/linux/netfilter_ingress.h b/include/linux/netfilter_ingress.h
index 5fcd375..6965ba0 100644
--- a/include/linux/netfilter_ingress.h
+++ b/include/linux/netfilter_ingress.h
@@ -14,6 +14,7 @@ static inline bool nf_hook_ingress_active(const struct sk_buff *skb)
 	return !list_empty(&skb->dev->nf_hooks_ingress);
 }
 
+/* caller must hold rcu_read_lock */
 static inline int nf_hook_ingress(struct sk_buff *skb)
 {
 	struct nf_hook_state state;
-- 
2.7.4

^ permalink raw reply related

* [PATCH nf-next v3 1/7] netfilter: bridge: add and use br_nf_hook_thresh
From: Aaron Conole @ 2016-09-21 15:35 UTC (permalink / raw)
  To: netfilter-devel, netdev; +Cc: Florian Westphal, Pablo Neira Ayuso
In-Reply-To: <1474472107-12992-1-git-send-email-aconole@bytheb.org>

From: Florian Westphal <fw@strlen.de>

This replaces the last uses of NF_HOOK_THRESH().
Followup patch will remove it and rename nf_hook_thresh.

The reason is that inet (non-bridge) netfilter no longer invokes the
hooks from hooks, so we do no longer need the thresh value to skip hooks
with a lower priority.

The bridge netfilter however may need to do this. br_nf_hook_thresh is a
wrapper that is supposed to do this, i.e. only call hooks with a
priority that exceeds NF_BR_PRI_BRNF.

It's used only in the recursion cases of br_netfilter.  It invokes
nf_hook_slow while holding an rcu read-side critical section to make a
future cleanup simpler.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Aaron Conole <aconole@bytheb.org>
---
 include/net/netfilter/br_netfilter.h |  6 ++++
 net/bridge/br_netfilter_hooks.c      | 60 ++++++++++++++++++++++++++++++------
 net/bridge/br_netfilter_ipv6.c       | 12 +++-----
 3 files changed, 62 insertions(+), 16 deletions(-)

diff --git a/include/net/netfilter/br_netfilter.h b/include/net/netfilter/br_netfilter.h
index e8d1448..0b0c35c 100644
--- a/include/net/netfilter/br_netfilter.h
+++ b/include/net/netfilter/br_netfilter.h
@@ -15,6 +15,12 @@ static inline struct nf_bridge_info *nf_bridge_alloc(struct sk_buff *skb)
 
 void nf_bridge_update_protocol(struct sk_buff *skb);
 
+int br_nf_hook_thresh(unsigned int hook, struct net *net, struct sock *sk,
+		      struct sk_buff *skb, struct net_device *indev,
+		      struct net_device *outdev,
+		      int (*okfn)(struct net *, struct sock *,
+				  struct sk_buff *));
+
 static inline struct nf_bridge_info *
 nf_bridge_info_get(const struct sk_buff *skb)
 {
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 77e7f69..6029af4 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -30,6 +30,7 @@
 #include <linux/netfilter_ipv6.h>
 #include <linux/netfilter_arp.h>
 #include <linux/in_route.h>
+#include <linux/rculist.h>
 #include <linux/inetdevice.h>
 
 #include <net/ip.h>
@@ -395,11 +396,10 @@ bridged_dnat:
 				skb->dev = nf_bridge->physindev;
 				nf_bridge_update_protocol(skb);
 				nf_bridge_push_encap_header(skb);
-				NF_HOOK_THRESH(NFPROTO_BRIDGE,
-					       NF_BR_PRE_ROUTING,
-					       net, sk, skb, skb->dev, NULL,
-					       br_nf_pre_routing_finish_bridge,
-					       1);
+				br_nf_hook_thresh(NF_BR_PRE_ROUTING,
+						  net, sk, skb, skb->dev,
+						  NULL,
+						  br_nf_pre_routing_finish);
 				return 0;
 			}
 			ether_addr_copy(eth_hdr(skb)->h_dest, dev->dev_addr);
@@ -417,10 +417,8 @@ bridged_dnat:
 	skb->dev = nf_bridge->physindev;
 	nf_bridge_update_protocol(skb);
 	nf_bridge_push_encap_header(skb);
-	NF_HOOK_THRESH(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING, net, sk, skb,
-		       skb->dev, NULL,
-		       br_handle_frame_finish, 1);
-
+	br_nf_hook_thresh(NF_BR_PRE_ROUTING, net, sk, skb, skb->dev, NULL,
+			  br_handle_frame_finish);
 	return 0;
 }
 
@@ -992,6 +990,50 @@ static struct notifier_block brnf_notifier __read_mostly = {
 	.notifier_call = brnf_device_event,
 };
 
+/* recursively invokes nf_hook_slow (again), skipping already-called
+ * hooks (< NF_BR_PRI_BRNF).
+ *
+ * Called with rcu read lock held.
+ */
+int br_nf_hook_thresh(unsigned int hook, struct net *net,
+		      struct sock *sk, struct sk_buff *skb,
+		      struct net_device *indev,
+		      struct net_device *outdev,
+		      int (*okfn)(struct net *, struct sock *,
+				  struct sk_buff *))
+{
+	struct nf_hook_ops *elem;
+	struct nf_hook_state state;
+	struct list_head *head;
+	int ret;
+
+	head = &net->nf.hooks[NFPROTO_BRIDGE][hook];
+
+	list_for_each_entry_rcu(elem, head, list) {
+		struct nf_hook_ops *next;
+
+		next = list_entry_rcu(list_next_rcu(&elem->list),
+				      struct nf_hook_ops, list);
+		if (next->priority <= NF_BR_PRI_BRNF)
+			continue;
+	}
+
+	if (&elem->list == head)
+		return okfn(net, sk, skb);
+
+	/* We may already have this, but read-locks nest anyway */
+	rcu_read_lock();
+	nf_hook_state_init(&state, head, hook, NF_BR_PRI_BRNF + 1,
+			   NFPROTO_BRIDGE, indev, outdev, sk, net, okfn);
+
+	ret = nf_hook_slow(skb, &state);
+	rcu_read_unlock();
+	if (ret == 1)
+		ret = okfn(net, sk, skb);
+
+	return ret;
+}
+
 #ifdef CONFIG_SYSCTL
 static
 int brnf_sysctl_call_tables(struct ctl_table *ctl, int write,
diff --git a/net/bridge/br_netfilter_ipv6.c b/net/bridge/br_netfilter_ipv6.c
index 5e59a84..5989661 100644
--- a/net/bridge/br_netfilter_ipv6.c
+++ b/net/bridge/br_netfilter_ipv6.c
@@ -187,10 +187,9 @@ static int br_nf_pre_routing_finish_ipv6(struct net *net, struct sock *sk, struc
 			skb->dev = nf_bridge->physindev;
 			nf_bridge_update_protocol(skb);
 			nf_bridge_push_encap_header(skb);
-			NF_HOOK_THRESH(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING,
-				       net, sk, skb, skb->dev, NULL,
-				       br_nf_pre_routing_finish_bridge,
-				       1);
+			br_nf_hook_thresh(NF_BR_PRE_ROUTING,
+					  net, sk, skb, skb->dev, NULL,
+					  br_nf_pre_routing_finish_bridge);
 			return 0;
 		}
 		ether_addr_copy(eth_hdr(skb)->h_dest, dev->dev_addr);
@@ -207,9 +206,8 @@ static int br_nf_pre_routing_finish_ipv6(struct net *net, struct sock *sk, struc
 	skb->dev = nf_bridge->physindev;
 	nf_bridge_update_protocol(skb);
 	nf_bridge_push_encap_header(skb);
-	NF_HOOK_THRESH(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING, net, sk, skb,
-		       skb->dev, NULL,
-		       br_handle_frame_finish, 1);
+	br_nf_hook_thresh(NF_BR_PRE_ROUTING, net, sk, skb,
+			  skb->dev, NULL, br_handle_frame_finish);
 
 	return 0;
 }
-- 
2.7.4

^ permalink raw reply related

* [PATCH nf-next v3 0/7] Compact netfilter hooks list
From: Aaron Conole @ 2016-09-21 15:35 UTC (permalink / raw)
  To: netfilter-devel, netdev; +Cc: Florian Westphal, Pablo Neira Ayuso

This series makes a simple change to shrink the netfilter hook list
from a double linked list, to a singly linked list.  Since the hooks
are always traversed in-order, there is no need to maintain a previous
pointer.

This was jointly developed by Florian Westphal.

It has been tested with RCU debugging and lockdep debugging enabled.  A
more rigorous stress test is underway, but this is being submitted for
early feedback.

Apologies for the size of patch 7/7, particularly the refactor in
nf_hook_thresh.  It didn't make sense to split the refactor out at the
time, but if desired, it can be reworked.

After this series, the hook entry head in nf_hook_state will not always
be a valid pointer.  I don't know if the circular nature of the hook list
could have ever been abused with a string of custom queue and non-queue
hook handlers.  If so, this patch would likely break that behavior.

Previous series can be found at:
http://www.spinics.net/lists/netdev/msg386080.html

Aaron Conole (5):
  netfilter: call nf_hook_ingress with rcu_read_lock
  nf_hook_slow: Remove explicit rcu_read_lock
  nf_register_net_hook: Only allow sane values
  nf_queue_handler: whitespace cleanup
  netfilter: replace list_head with single linked list

Florian Westphal (2):
  netfilter: bridge: add and use br_nf_hook_thresh
  netfilter: call nf_hook_state_init with rcu_read_lock held

 include/linux/netdevice.h                      |   2 +-
 include/linux/netfilter.h                      |  61 ++++++----
 include/linux/netfilter_ingress.h              |  16 ++-
 include/net/netfilter/br_netfilter.h           |   6 +
 include/net/netfilter/nf_queue.h               |   9 +-
 include/net/netns/netfilter.h                  |   2 +-
 net/bridge/br_netfilter_hooks.c                |  53 +++++++--
 net/bridge/br_netfilter_ipv6.c                 |  12 +-
 net/bridge/netfilter/ebt_redirect.c            |   2 +-
 net/bridge/netfilter/ebtables.c                |   2 +-
 net/core/dev.c                                 |   7 +-
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |   2 +-
 net/ipv4/netfilter/nf_conntrack_proto_icmp.c   |   2 +-
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c |   2 +-
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c |   2 +-
 net/netfilter/core.c                           | 152 ++++++++++++++++---------
 net/netfilter/nf_conntrack_core.c              |   2 +-
 net/netfilter/nf_conntrack_h323_main.c         |   2 +-
 net/netfilter/nf_conntrack_helper.c            |   2 +-
 net/netfilter/nf_internals.h                   |  10 +-
 net/netfilter/nf_queue.c                       |  18 +--
 net/netfilter/nfnetlink_cthelper.c             |   2 +-
 net/netfilter/nfnetlink_log.c                  |   6 +-
 net/netfilter/nfnetlink_queue.c                |  10 +-
 net/netfilter/xt_helper.c                      |   2 +-
 25 files changed, 249 insertions(+), 137 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH] netfilter: replace list_head with single linked list
From: Aaron Conole @ 2016-09-21 15:35 UTC (permalink / raw)
  To: netfilter-devel, netdev; +Cc: Florian Westphal, Pablo Neira Ayuso
In-Reply-To: <1474472107-12992-1-git-send-email-aconole@bytheb.org>

The netfilter hook list never uses the prev pointer, and so can be trimmed to
be a simple singly-linked list.

In addition to having a more light weight structure for hook traversal,
struct net becomes 5568 bytes (down from 6400) and struct net_device becomes
2176 bytes (down from 2240).

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netdevice.h         |   2 +-
 include/linux/netfilter.h         |  61 +++++++++--------
 include/linux/netfilter_ingress.h |  16 +++--
 include/net/netfilter/nf_queue.h  |   3 +-
 include/net/netns/netfilter.h     |   2 +-
 net/bridge/br_netfilter_hooks.c   |  19 ++---
 net/netfilter/core.c              | 141 +++++++++++++++++++++++++-------------
 net/netfilter/nf_internals.h      |  10 +--
 net/netfilter/nf_queue.c          |  18 ++---
 net/netfilter/nfnetlink_queue.c   |   8 ++-
 10 files changed, 165 insertions(+), 115 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 67bb978..41f49f5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1783,7 +1783,7 @@ struct net_device {
 #endif
 	struct netdev_queue __rcu *ingress_queue;
 #ifdef CONFIG_NETFILTER_INGRESS
-	struct list_head	nf_hooks_ingress;
+	struct nf_hook_entry __rcu *nf_hooks_ingress;
 #endif
 
 	unsigned char		broadcast[MAX_ADDR_LEN];
diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index ad444f0..17c90b0 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -55,12 +55,34 @@ struct nf_hook_state {
 	struct net_device *out;
 	struct sock *sk;
 	struct net *net;
-	struct list_head *hook_list;
+	struct nf_hook_entry __rcu *hook_entries;
 	int (*okfn)(struct net *, struct sock *, struct sk_buff *);
 };
 
+typedef unsigned int nf_hookfn(void *priv,
+			       struct sk_buff *skb,
+			       const struct nf_hook_state *state);
+struct nf_hook_ops {
+	struct list_head	list;
+
+	/* User fills in from here down. */
+	nf_hookfn		*hook;
+	struct net_device	*dev;
+	void			*priv;
+	u_int8_t		pf;
+	unsigned int		hooknum;
+	/* Hooks are ordered in ascending priority. */
+	int			priority;
+};
+
+struct nf_hook_entry {
+	struct nf_hook_entry __rcu	*next;
+	struct nf_hook_ops		ops;
+	const struct nf_hook_ops	*orig_ops;
+};
+
 static inline void nf_hook_state_init(struct nf_hook_state *p,
-				      struct list_head *hook_list,
+				      struct nf_hook_entry *hook_entry,
 				      unsigned int hook,
 				      int thresh, u_int8_t pf,
 				      struct net_device *indev,
@@ -76,26 +98,11 @@ static inline void nf_hook_state_init(struct nf_hook_state *p,
 	p->out = outdev;
 	p->sk = sk;
 	p->net = net;
-	p->hook_list = hook_list;
+	RCU_INIT_POINTER(p->hook_entries, hook_entry);
 	p->okfn = okfn;
 }
 
-typedef unsigned int nf_hookfn(void *priv,
-			       struct sk_buff *skb,
-			       const struct nf_hook_state *state);
 
-struct nf_hook_ops {
-	struct list_head 	list;
-
-	/* User fills in from here down. */
-	nf_hookfn		*hook;
-	struct net_device	*dev;
-	void			*priv;
-	u_int8_t		pf;
-	unsigned int		hooknum;
-	/* Hooks are ordered in ascending priority. */
-	int			priority;
-};
 
 struct nf_sockopt_ops {
 	struct list_head list;
@@ -161,7 +168,8 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook,
 				 int (*okfn)(struct net *, struct sock *, struct sk_buff *),
 				 int thresh)
 {
-	struct list_head *hook_list;
+	struct nf_hook_entry *hook_head;
+	int ret = 1;
 
 #ifdef HAVE_JUMP_LABEL
 	if (__builtin_constant_p(pf) &&
@@ -170,22 +178,19 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook,
 		return 1;
 #endif
 
-	hook_list = &net->nf.hooks[pf][hook];
+	rcu_read_lock();
 
-	if (!list_empty(hook_list)) {
+	hook_head = rcu_dereference(net->nf.hooks[pf][hook]);
+	if (hook_head) {
 		struct nf_hook_state state;
-		int ret;
 
-		/* We may already have this, but read-locks nest anyway */
-		rcu_read_lock();
-		nf_hook_state_init(&state, hook_list, hook, thresh,
+		nf_hook_state_init(&state, hook_head, hook, thresh,
 				   pf, indev, outdev, sk, net, okfn);
 
 		ret = nf_hook_slow(skb, &state);
-		rcu_read_unlock();
-		return ret;
 	}
-	return 1;
+	rcu_read_unlock();
+	return ret;
 }
 
 static inline int nf_hook(u_int8_t pf, unsigned int hook, struct net *net,
diff --git a/include/linux/netfilter_ingress.h b/include/linux/netfilter_ingress.h
index 6965ba0..b02fd80 100644
--- a/include/linux/netfilter_ingress.h
+++ b/include/linux/netfilter_ingress.h
@@ -11,23 +11,29 @@ static inline bool nf_hook_ingress_active(const struct sk_buff *skb)
 	if (!static_key_false(&nf_hooks_needed[NFPROTO_NETDEV][NF_NETDEV_INGRESS]))
 		return false;
 #endif
-	return !list_empty(&skb->dev->nf_hooks_ingress);
+	return rcu_access_pointer(skb->dev->nf_hooks_ingress);
 }
 
 /* caller must hold rcu_read_lock */
 static inline int nf_hook_ingress(struct sk_buff *skb)
 {
+	struct nf_hook_entry *e = rcu_dereference(skb->dev->nf_hooks_ingress);
 	struct nf_hook_state state;
 
-	nf_hook_state_init(&state, &skb->dev->nf_hooks_ingress,
-			   NF_NETDEV_INGRESS, INT_MIN, NFPROTO_NETDEV,
-			   skb->dev, NULL, NULL, dev_net(skb->dev), NULL);
+	/* must recheck the ingress hook head, in the event it became NULL
+	 * after the check in nf_hook_ingress_active evaluated to true. */
+	if (unlikely(!e))
+		return 0;
+
+	nf_hook_state_init(&state, e, NF_NETDEV_INGRESS, INT_MIN,
+			   NFPROTO_NETDEV, skb->dev, NULL, NULL,
+			   dev_net(skb->dev), NULL);
 	return nf_hook_slow(skb, &state);
 }
 
 static inline void nf_hook_ingress_init(struct net_device *dev)
 {
-	INIT_LIST_HEAD(&dev->nf_hooks_ingress);
+	RCU_INIT_POINTER(dev->nf_hooks_ingress, NULL);
 }
 #else /* CONFIG_NETFILTER_INGRESS */
 static inline int nf_hook_ingress_active(struct sk_buff *skb)
diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
index 66b3f3f..ef4f75d 100644
--- a/include/net/netfilter/nf_queue.h
+++ b/include/net/netfilter/nf_queue.h
@@ -11,7 +11,6 @@ struct nf_queue_entry {
 	struct sk_buff		*skb;
 	unsigned int		id;
 
-	struct nf_hook_ops	*elem;
 	struct nf_hook_state	state;
 	u16			size; /* sizeof(entry) + saved route keys */
 
@@ -25,7 +24,7 @@ struct nf_queue_handler {
 	int		(*outfn)(struct nf_queue_entry *entry,
 				 unsigned int queuenum);
 	void		(*nf_hook_drop)(struct net *net,
-					struct nf_hook_ops *ops);
+					const struct nf_hook_entry *hooks);
 };
 
 void nf_register_queue_handler(struct net *net, const struct nf_queue_handler *qh);
diff --git a/include/net/netns/netfilter.h b/include/net/netns/netfilter.h
index 36d7235..58487b1 100644
--- a/include/net/netns/netfilter.h
+++ b/include/net/netns/netfilter.h
@@ -16,6 +16,6 @@ struct netns_nf {
 #ifdef CONFIG_SYSCTL
 	struct ctl_table_header *nf_log_dir_header;
 #endif
-	struct list_head hooks[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
+	struct nf_hook_entry __rcu *hooks[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
 };
 #endif
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 6029af4..2fe9345 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -1002,28 +1002,21 @@ int br_nf_hook_thresh(unsigned int hook, struct net *net,
 		      int (*okfn)(struct net *, struct sock *,
 				  struct sk_buff *))
 {
-	struct nf_hook_ops *elem;
+	struct nf_hook_entry *elem;
 	struct nf_hook_state state;
-	struct list_head *head;
 	int ret;
 
-	head = &net->nf.hooks[NFPROTO_BRIDGE][hook];
+	elem = rcu_dereference(net->nf.hooks[NFPROTO_BRIDGE][hook]);
 
-	list_for_each_entry_rcu(elem, head, list) {
-		struct nf_hook_ops *next;
+	while (elem && (elem->ops.priority <= NF_BR_PRI_BRNF))
+		elem = rcu_dereference(elem->next);
 
-		next = list_entry_rcu(list_next_rcu(&elem->list),
-				      struct nf_hook_ops, list);
-		if (next->priority <= NF_BR_PRI_BRNF)
-			continue;
-	}
-
-	if (&elem->list == head)
+	if (!elem)
 		return okfn(net, sk, skb);
 
 	/* We may already have this, but read-locks nest anyway */
 	rcu_read_lock();
-	nf_hook_state_init(&state, head, hook, NF_BR_PRI_BRNF + 1,
+	nf_hook_state_init(&state, elem, hook, NF_BR_PRI_BRNF + 1,
 			   NFPROTO_BRIDGE, indev, outdev, sk, net, okfn);
 
 	ret = nf_hook_slow(skb, &state);
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 67b7428..360c63d 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -22,6 +22,7 @@
 #include <linux/proc_fs.h>
 #include <linux/mutex.h>
 #include <linux/slab.h>
+#include <linux/rcupdate.h>
 #include <net/net_namespace.h>
 #include <net/sock.h>
 
@@ -61,33 +62,50 @@ EXPORT_SYMBOL(nf_hooks_needed);
 #endif
 
 static DEFINE_MUTEX(nf_hook_mutex);
+#define nf_entry_dereference(e) \
+	rcu_dereference_protected(e, lockdep_is_held(&nf_hook_mutex))
 
-static struct list_head *nf_find_hook_list(struct net *net,
-					   const struct nf_hook_ops *reg)
+static struct nf_hook_entry *nf_hook_entry_head(struct net *net,
+						const struct nf_hook_ops *reg)
 {
-	struct list_head *hook_list = NULL;
+	struct nf_hook_entry *hook_head = NULL;
 
 	if (reg->pf != NFPROTO_NETDEV)
-		hook_list = &net->nf.hooks[reg->pf][reg->hooknum];
+		hook_head = nf_entry_dereference(net->nf.hooks[reg->pf]
+						 [reg->hooknum]);
 	else if (reg->hooknum == NF_NETDEV_INGRESS) {
 #ifdef CONFIG_NETFILTER_INGRESS
 		if (reg->dev && dev_net(reg->dev) == net)
-			hook_list = &reg->dev->nf_hooks_ingress;
+			hook_head =
+				nf_entry_dereference(
+					reg->dev->nf_hooks_ingress);
 #endif
 	}
-	return hook_list;
+	return hook_head;
 }
 
-struct nf_hook_entry {
-	const struct nf_hook_ops	*orig_ops;
-	struct nf_hook_ops		ops;
-};
+/* must hold nf_hook_mutex */
+static void nf_set_hooks_head(struct net *net, const struct nf_hook_ops *reg,
+			      struct nf_hook_entry *entry)
+{
+	switch (reg->pf) {
+	case NFPROTO_NETDEV:
+		/* We already checked in nf_register_net_hook() that this is
+		 * used from ingress.
+		 */
+		rcu_assign_pointer(reg->dev->nf_hooks_ingress, entry);
+		break;
+	default:
+		rcu_assign_pointer(net->nf.hooks[reg->pf][reg->hooknum],
+				   entry);
+		break;
+	}
+}
 
 int nf_register_net_hook(struct net *net, const struct nf_hook_ops *reg)
 {
-	struct list_head *hook_list;
+	struct nf_hook_entry *hooks_entry;
 	struct nf_hook_entry *entry;
-	struct nf_hook_ops *elem;
 
 	if (reg->pf == NFPROTO_NETDEV &&
 	    (reg->hooknum != NF_NETDEV_INGRESS ||
@@ -100,19 +118,30 @@ int nf_register_net_hook(struct net *net, const struct nf_hook_ops *reg)
 
 	entry->orig_ops	= reg;
 	entry->ops	= *reg;
+	entry->next	= NULL;
+
+	mutex_lock(&nf_hook_mutex);
+	hooks_entry = nf_hook_entry_head(net, reg);
 
-	hook_list = nf_find_hook_list(net, reg);
-	if (!hook_list) {
-		kfree(entry);
-		return -ENOENT;
+	if (hooks_entry && hooks_entry->orig_ops->priority > reg->priority) {
+		/* This is the case where we need to insert at the head */
+		entry->next = hooks_entry;
+		hooks_entry = NULL;
 	}
 
-	mutex_lock(&nf_hook_mutex);
-	list_for_each_entry(elem, hook_list, list) {
-		if (reg->priority < elem->priority)
-			break;
+	while (hooks_entry &&
+	       reg->priority >= hooks_entry->orig_ops->priority &&
+	       nf_entry_dereference(hooks_entry->next)) {
+		hooks_entry = nf_entry_dereference(hooks_entry->next);
+	}
+
+	if (hooks_entry) {
+		entry->next = nf_entry_dereference(hooks_entry->next);
+		rcu_assign_pointer(hooks_entry->next, entry);
+	} else {
+		nf_set_hooks_head(net, reg, entry);
 	}
-	list_add_rcu(&entry->ops.list, elem->list.prev);
+
 	mutex_unlock(&nf_hook_mutex);
 #ifdef CONFIG_NETFILTER_INGRESS
 	if (reg->pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_INGRESS)
@@ -127,24 +156,33 @@ EXPORT_SYMBOL(nf_register_net_hook);
 
 void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
 {
-	struct list_head *hook_list;
-	struct nf_hook_entry *entry;
-	struct nf_hook_ops *elem;
-
-	hook_list = nf_find_hook_list(net, reg);
-	if (!hook_list)
-		return;
+	struct nf_hook_entry *hooks_entry;
 
 	mutex_lock(&nf_hook_mutex);
-	list_for_each_entry(elem, hook_list, list) {
-		entry = container_of(elem, struct nf_hook_entry, ops);
-		if (entry->orig_ops == reg) {
-			list_del_rcu(&entry->ops.list);
-			break;
+	hooks_entry = nf_hook_entry_head(net, reg);
+	if (hooks_entry->orig_ops == reg) {
+		nf_set_hooks_head(net, reg,
+				  nf_entry_dereference(hooks_entry->next));
+		goto unlock;
+	}
+	while (hooks_entry && nf_entry_dereference(hooks_entry->next)) {
+		struct nf_hook_entry *next =
+			nf_entry_dereference(hooks_entry->next);
+		struct nf_hook_entry *nnext;
+
+		if (next->orig_ops != reg) {
+			hooks_entry = next;
+			continue;
 		}
+		nnext = nf_entry_dereference(next->next);
+		rcu_assign_pointer(hooks_entry->next, nnext);
+		hooks_entry = next;
+		break;
 	}
+
+unlock:
 	mutex_unlock(&nf_hook_mutex);
-	if (&elem->list == hook_list) {
+	if (!hooks_entry) {
 		WARN(1, "nf_unregister_net_hook: hook not found!\n");
 		return;
 	}
@@ -156,10 +194,10 @@ void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
 	static_key_slow_dec(&nf_hooks_needed[reg->pf][reg->hooknum]);
 #endif
 	synchronize_net();
-	nf_queue_nf_hook_drop(net, &entry->ops);
+	nf_queue_nf_hook_drop(net, hooks_entry);
 	/* other cpu might still process nfqueue verdict that used reg */
 	synchronize_net();
-	kfree(entry);
+	kfree(hooks_entry);
 }
 EXPORT_SYMBOL(nf_unregister_net_hook);
 
@@ -258,10 +296,9 @@ void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n)
 }
 EXPORT_SYMBOL(nf_unregister_hooks);
 
-unsigned int nf_iterate(struct list_head *head,
-			struct sk_buff *skb,
+unsigned int nf_iterate(struct sk_buff *skb,
 			struct nf_hook_state *state,
-			struct nf_hook_ops **elemp)
+			struct nf_hook_entry **entryp)
 {
 	unsigned int verdict;
 
@@ -269,20 +306,23 @@ unsigned int nf_iterate(struct list_head *head,
 	 * The caller must not block between calls to this
 	 * function because of risk of continuing from deleted element.
 	 */
-	list_for_each_entry_continue_rcu((*elemp), head, list) {
-		if (state->thresh > (*elemp)->priority)
+	while (*entryp) {
+		if (state->thresh > (*entryp)->ops.priority) {
+			*entryp = rcu_dereference((*entryp)->next);
 			continue;
+		}
 
 		/* Optimization: we don't need to hold module
 		   reference here, since function can't sleep. --RR */
 repeat:
-		verdict = (*elemp)->hook((*elemp)->priv, skb, state);
+		verdict = (*entryp)->ops.hook((*entryp)->ops.priv, skb, state);
 		if (verdict != NF_ACCEPT) {
 #ifdef CONFIG_NETFILTER_DEBUG
 			if (unlikely((verdict & NF_VERDICT_MASK)
 							> NF_MAX_VERDICT)) {
 				NFDEBUG("Evil return from %p(%u).\n",
-					(*elemp)->hook, state->hook);
+					(*entryp)->ops.hook, state->hook);
+				*entryp = rcu_dereference((*entryp)->next);
 				continue;
 			}
 #endif
@@ -290,6 +330,7 @@ repeat:
 				return verdict;
 			goto repeat;
 		}
+		*entryp = rcu_dereference((*entryp)->next);
 	}
 	return NF_ACCEPT;
 }
@@ -299,13 +340,13 @@ repeat:
  * -EPERM for NF_DROP, 0 otherwise.  Caller must hold rcu_read_lock. */
 int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state)
 {
-	struct nf_hook_ops *elem;
+	struct nf_hook_entry *entry;
 	unsigned int verdict;
 	int ret = 0;
 
-	elem = list_entry_rcu(state->hook_list, struct nf_hook_ops, list);
+	entry = rcu_dereference(state->hook_entries);
 next_hook:
-	verdict = nf_iterate(state->hook_list, skb, state, &elem);
+	verdict = nf_iterate(skb, state, &entry);
 	if (verdict == NF_ACCEPT || verdict == NF_STOP) {
 		ret = 1;
 	} else if ((verdict & NF_VERDICT_MASK) == NF_DROP) {
@@ -314,8 +355,10 @@ next_hook:
 		if (ret == 0)
 			ret = -EPERM;
 	} else if ((verdict & NF_VERDICT_MASK) == NF_QUEUE) {
-		int err = nf_queue(skb, elem, state,
-				   verdict >> NF_VERDICT_QBITS);
+		int err;
+
+		RCU_INIT_POINTER(state->hook_entries, entry);
+		err = nf_queue(skb, state, verdict >> NF_VERDICT_QBITS);
 		if (err < 0) {
 			if (err == -ESRCH &&
 			   (verdict & NF_VERDICT_FLAG_QUEUE_BYPASS))
@@ -442,7 +485,7 @@ static int __net_init netfilter_net_init(struct net *net)
 
 	for (i = 0; i < ARRAY_SIZE(net->nf.hooks); i++) {
 		for (h = 0; h < NF_MAX_HOOKS; h++)
-			INIT_LIST_HEAD(&net->nf.hooks[i][h]);
+			RCU_INIT_POINTER(net->nf.hooks[i][h], NULL);
 	}
 
 #ifdef CONFIG_PROC_FS
diff --git a/net/netfilter/nf_internals.h b/net/netfilter/nf_internals.h
index 0655225..e0adb59 100644
--- a/net/netfilter/nf_internals.h
+++ b/net/netfilter/nf_internals.h
@@ -13,13 +13,13 @@
 
 
 /* core.c */
-unsigned int nf_iterate(struct list_head *head, struct sk_buff *skb,
-			struct nf_hook_state *state, struct nf_hook_ops **elemp);
+unsigned int nf_iterate(struct sk_buff *skb, struct nf_hook_state *state,
+			struct nf_hook_entry **entryp);
 
 /* nf_queue.c */
-int nf_queue(struct sk_buff *skb, struct nf_hook_ops *elem,
-	     struct nf_hook_state *state, unsigned int queuenum);
-void nf_queue_nf_hook_drop(struct net *net, struct nf_hook_ops *ops);
+int nf_queue(struct sk_buff *skb, struct nf_hook_state *state,
+	     unsigned int queuenum);
+void nf_queue_nf_hook_drop(struct net *net, const struct nf_hook_entry *entry);
 int __init netfilter_queue_init(void);
 
 /* nf_log.c */
diff --git a/net/netfilter/nf_queue.c b/net/netfilter/nf_queue.c
index b19ad20..96964a0 100644
--- a/net/netfilter/nf_queue.c
+++ b/net/netfilter/nf_queue.c
@@ -96,14 +96,14 @@ void nf_queue_entry_get_refs(struct nf_queue_entry *entry)
 }
 EXPORT_SYMBOL_GPL(nf_queue_entry_get_refs);
 
-void nf_queue_nf_hook_drop(struct net *net, struct nf_hook_ops *ops)
+void nf_queue_nf_hook_drop(struct net *net, const struct nf_hook_entry *entry)
 {
 	const struct nf_queue_handler *qh;
 
 	rcu_read_lock();
 	qh = rcu_dereference(net->nf.queue_handler);
 	if (qh)
-		qh->nf_hook_drop(net, ops);
+		qh->nf_hook_drop(net, entry);
 	rcu_read_unlock();
 }
 
@@ -112,7 +112,6 @@ void nf_queue_nf_hook_drop(struct net *net, struct nf_hook_ops *ops)
  * through nf_reinject().
  */
 int nf_queue(struct sk_buff *skb,
-	     struct nf_hook_ops *elem,
 	     struct nf_hook_state *state,
 	     unsigned int queuenum)
 {
@@ -141,7 +140,6 @@ int nf_queue(struct sk_buff *skb,
 
 	*entry = (struct nf_queue_entry) {
 		.skb	= skb,
-		.elem	= elem,
 		.state	= *state,
 		.size	= sizeof(*entry) + afinfo->route_key_size,
 	};
@@ -165,11 +163,15 @@ err:
 
 void nf_reinject(struct nf_queue_entry *entry, unsigned int verdict)
 {
+	struct nf_hook_entry *hook_entry;
 	struct sk_buff *skb = entry->skb;
-	struct nf_hook_ops *elem = entry->elem;
 	const struct nf_afinfo *afinfo;
+	struct nf_hook_ops *elem;
 	int err;
 
+	hook_entry = rcu_dereference(entry->state.hook_entries);
+	elem = &hook_entry->ops;
+
 	nf_queue_entry_release_refs(entry);
 
 	/* Continue traversal iff userspace said ok... */
@@ -186,8 +188,7 @@ void nf_reinject(struct nf_queue_entry *entry, unsigned int verdict)
 
 	if (verdict == NF_ACCEPT) {
 	next_hook:
-		verdict = nf_iterate(entry->state.hook_list,
-				     skb, &entry->state, &elem);
+		verdict = nf_iterate(skb, &entry->state, &hook_entry);
 	}
 
 	switch (verdict & NF_VERDICT_MASK) {
@@ -198,7 +199,8 @@ void nf_reinject(struct nf_queue_entry *entry, unsigned int verdict)
 		local_bh_enable();
 		break;
 	case NF_QUEUE:
-		err = nf_queue(skb, elem, &entry->state,
+		RCU_INIT_POINTER(entry->state.hook_entries, hook_entry);
+		err = nf_queue(skb, &entry->state,
 			       verdict >> NF_VERDICT_QBITS);
 		if (err < 0) {
 			if (err == -ESRCH &&
diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
index 7caa8b0..af832c5 100644
--- a/net/netfilter/nfnetlink_queue.c
+++ b/net/netfilter/nfnetlink_queue.c
@@ -917,12 +917,14 @@ static struct notifier_block nfqnl_dev_notifier = {
 	.notifier_call	= nfqnl_rcv_dev_event,
 };
 
-static int nf_hook_cmp(struct nf_queue_entry *entry, unsigned long ops_ptr)
+static int nf_hook_cmp(struct nf_queue_entry *entry, unsigned long entry_ptr)
 {
-	return entry->elem == (struct nf_hook_ops *)ops_ptr;
+	return rcu_access_pointer(entry->state.hook_entries) ==
+		(struct nf_hook_entry *)entry_ptr;
 }
 
-static void nfqnl_nf_hook_drop(struct net *net, struct nf_hook_ops *hook)
+static void nfqnl_nf_hook_drop(struct net *net,
+			       const struct nf_hook_entry *hook)
 {
 	struct nfnl_queue_net *q = nfnl_queue_pernet(net);
 	int i;
-- 
2.7.4


^ permalink raw reply related

* Re: [PATCH] ptp_clock: future-proofing drivers against PTP subsystem becoming optional
From: Edward Cree @ 2016-09-21 15:10 UTC (permalink / raw)
  To: Nicolas Pitre, Richard Cochran, David S. Miller
  Cc: John Stultz, Thomas Gleixner, Josh Triplett, netdev, linux-kernel
In-Reply-To: <alpine.LFD.2.20.1609201915070.9311@knanqh.ubzr>

On 21/09/16 00:25, Nicolas Pitre wrote:
> Drivers must be ready to accept NULL from ptp_clock_register() if the
> PTP clock subsystem is configured out.
>
> This patch documents that and ensures that all drivers cope well
> with a NULL return.
>
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> Reviewed-by: Eugenia Emantayev <eugenia@mellanox.com>
[...]
> diff --git a/drivers/net/ethernet/sfc/ptp.c b/drivers/net/ethernet/sfc/ptp.c
> index c771e0af4e..f105a170b4 100644
> --- a/drivers/net/ethernet/sfc/ptp.c
> +++ b/drivers/net/ethernet/sfc/ptp.c
> @@ -1269,13 +1269,13 @@ int efx_ptp_probe(struct efx_nic *efx, struct efx_channel *channel)
>  		if (IS_ERR(ptp->phc_clock)) {
>  			rc = PTR_ERR(ptp->phc_clock);
>  			goto fail3;
> -		}
> -
> -		INIT_WORK(&ptp->pps_work, efx_ptp_pps_worker);
> -		ptp->pps_workwq = create_singlethread_workqueue("sfc_pps");
> -		if (!ptp->pps_workwq) {
> -			rc = -ENOMEM;
> -			goto fail4;
> +		} else if (ptp->phc_clock) {
> +			INIT_WORK(&ptp->pps_work, efx_ptp_pps_worker);
> +			ptp->pps_workwq = create_singlethread_workqueue("sfc_pps");
> +			if (!ptp->pps_workwq) {
> +				rc = -ENOMEM;
> +				goto fail4;
> +			}
>  		}
>  	}
>  	ptp->nic_ts_enabled = false;
For the sfc change:
Acked-by: Edward Cree <ecree@solarflare.com>

^ permalink raw reply

* [PATCH -next] cxgb4: Convert to use simple_open()
From: Wei Yongjun @ 2016-09-21 15:09 UTC (permalink / raw)
  To: Hariprasad S; +Cc: Wei Yongjun, netdev

From: Wei Yongjun <weiyongjun1@huawei.com>

Remove an open coded simple_open() function and replace file
operations references to the function with simple_open()
instead.

Generated by: scripts/coccinelle/api/simple_open.cocci

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index 52be9a4..20455d0 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -2748,12 +2748,6 @@ static void add_debugfs_mem(struct adapter *adap, const char *name,
 				 size_mb << 20);
 }
 
-static int blocked_fl_open(struct inode *inode, struct file *file)
-{
-	file->private_data = inode->i_private;
-	return 0;
-}
-
 static ssize_t blocked_fl_read(struct file *filp, char __user *ubuf,
 			       size_t count, loff_t *ppos)
 {
@@ -2797,7 +2791,7 @@ static ssize_t blocked_fl_write(struct file *filp, const char __user *ubuf,
 
 static const struct file_operations blocked_fl_fops = {
 	.owner   = THIS_MODULE,
-	.open    = blocked_fl_open,
+	.open    = simple_open,
 	.read    = blocked_fl_read,
 	.write   = blocked_fl_write,
 	.llseek  = generic_file_llseek,

^ permalink raw reply related

* Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP
From: Tom Herbert @ 2016-09-21 15:08 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team,
	Tariq Toukan, Brenden Blanco, Alexei Starovoitov, Eric Dumazet,
	Jesper Dangaard Brouer
In-Reply-To: <20160921144800.GB13991@pox.localdomain>

On Wed, Sep 21, 2016 at 7:48 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 09/21/16 at 07:19am, Tom Herbert wrote:
>> certain design that because of constraints on one kernel interface. As
>> a kernel developer I want flexibility on how we design and implement
>> things!
>
> Perfectly valid argument. I reviewed your ILA changes and did not
> object to them.
>
>
>> I think there are two questions that this patch set poses for the
>> community wrt XDP:
>>
>> #1: Should we allow alternate code to run in XDP other than BPF?
>> #2: If #1 is true what is the best way to implement that?
>>
>> If the answer to #1 is "no" then the answer to #2 is irrelevant. So
>> with this RFC I'm hoping we can come the agreement on questions #1.
>
> I'm not opposed to running non-BPF code at XDP. I'm against adding
> a linked list of hook consumers.
>
> Would anyone require to run XDP-BPF in combination ILA? Or XDP-BPF
> in combination with a potential XDP-nftables? We don't know yet I
> guess.
>
Right. Admittedly, I feel like we owe a bit of reciprocity to
nftables. For ILA we are using the NF_INET_PRE_ROUTING hook with our
own code (looks like ipvlan set nfhooks as well). This works really
well and saves the value of early demux in ILA. Had we not had the
ability to use nfhooks in this fashion it's likely we would have had
to create another hook (we did try putting translation in nftables
rules but that was too inefficient for ILA).

> Maybe exclusive access to the hook for one consumer as selected by
> the user is good enough.
>
> If that is not good enough: BPF (and potentially nftables in the
> future) could provide means to perform a selection process where a
> helper call can run another XDP prog or return a verdict to trigger
> another XDP prog. Definitely more flexible and faster than a linear
> list doing  if, else if, else if, else if, ...

It seems reasonable that the the output of one program may be an
indication of another program. We've already talked about something
like that in regards to splitting BPF programs into device independent
program and device dependent program.

Tom

^ permalink raw reply

* Re: [RFC PATCH] net: Require socket to allow XPS to set queue mapping
From: Willem de Bruijn @ 2016-09-21 15:07 UTC (permalink / raw)
  To: Rick Jones; +Cc: Eric Dumazet, Alexander Duyck, Network Development
In-Reply-To: <3a1d23f2-d2c6-8d28-96cc-1b25b61b948c@hpe.com>

On Thu, Aug 25, 2016 at 5:28 PM, Rick Jones <rick.jones2@hpe.com> wrote:
> On 08/25/2016 02:08 PM, Eric Dumazet wrote:
>>
>> When XPS was submitted, it was _not_ enabled by default and 'magic'
>>
>> Some NIC vendors decided it was a good thing, you should complain to
>> them ;)
>
>
> I kindasorta am with the emails I've been sending to netdev :)  And also
> hopefully precluding others going down that path.

I recently came across another effect of configuring devices at
ndo_open. The behavior of `ip link set dev ${dev} down && ip link set
dev ${dev} up` is no longer consistent across devices. Drivers that
call netif_set_xps_queue overwrite a user supplied xps configuration,
others leave it in place.

This is easily demonstrated by adding this snippet to the loopback driver:

+static int loopback_dev_open(struct net_device *dev)
+{
+       cpumask_t mask;
+
+       cpumask_clear(&mask);
+       cpumask_set_cpu(0, &mask);
+
+       return netif_set_xps_queue(dev, &mask, 0);
+}
+
 static void loopback_dev_free(struct net_device *dev)
 {
        free_percpu(dev->lstats);
@@ -158,6 +168,7 @@ static void loopback_dev_free(struct net_device *dev)

 static const struct net_device_ops loopback_ops = {
        .ndo_init      = loopback_dev_init,
+       .ndo_open       = loopback_dev_open,

and running

fp=/sys/class/net/lo/queues/tx-0/xps_cpus

cat ${fp}
echo 8 > ${fp}
cat ${fp}
ip link set dev lo down
ip link set dev lo up
cat ${fp}

On a vanilla kernel, the output is {0, 8, 8} : the user-supplied xps
setting persists
With the patch, it is {1, 8, 1} : the driver sets, and resets, xps

A third patch that adds netif_set_xps_queue to loopback_dev_init gives
{1, 8, 8}. That is arguably the preferred behavior (sidestepping the
issue of whether device driver initialization is a good thing in
itself): init with a sane default, but do not override user
preference.

The fm10k and i40e drivers makes the call conditional on
test_and_set_bit(__FM10K_TX_XPS_INIT_DONE), so probably already
implement those semantics. In which case other drivers should probably
be updated to do the same. If all drivers are essentially trying to
set the same basic load balancing policy, this could also be lifted
out of drivers into __dev_open for consistency and code deduplication.

I'm pointing this out less for changing this xps feature, than as a
subtle effect to be aware of with any subsequent device driver policy
patches.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox