Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH RFC 0/2] xfrm: Remove ancient sleeping code
From: Fan Du @ 2013-10-11  7:18 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: netdev
In-Reply-To: <20131010085703.GR7660@secunet.com>

hehe, I didn't object this cleanup except learning from it :)

On 2013年10月10日 16:57, Steffen Klassert wrote:
> On Thu, Oct 10, 2013 at 03:02:14PM +0800, Fan Du wrote:
>>
>>
>> On 2013年10月10日 14:33, Steffen Klassert wrote:
>>> Does anyone still rely on the ancient sleeping when the SA is in
>>> acquire state? It is disabled by default since more that five years,
>>> but can cause indefinite task hangs if enabled and the needed state
>>> does not get resolved.
>>
>> I saw that "can_sleep" is set true in ip_route_connect which upper layer
>> protocol relies on it, which ensure not dropping *any* skb.
>
> 'Any' means one per task in this context. Also, we can't ensure that
> this packet reaches it's destination. So where is the difference
> between dropping the packet locally or on the network?
>
>> And acquire timer will make sure the task will not hangs indefinitely.
>>
>
> Did you try that? It makes sure that the task wakes up from time to time,
> but it goes immediately back to sleep if the needed state is not resolved.
> The only terminating contition is when the task gets a signal to exit.
>
>> In xfrm policy queue, XFRM_MAX_QUEUE_LEN is 100, which means 101th skb
>> will be dropped, how about make it configurable?
>
> IMO we would have yet another useless knob then. Currently we send all
> packets by default to a blackhole as long as the state is not resolved
> and most people are fine with it. The queueing is mostly to speed up
> tcp handshakes,

I cannot follow on this part. Would you please mind to explain how making a
policy queue will speed up TCP handshakes than orignal CAN_SLEEP mechanism?


Thanks

-- 
浮沉随浪只记今朝笑

--fan

^ permalink raw reply

* [PATCHv2 2/2] Don't compute checksum value for SCTP skb with, CHECKSUM_PARTIAL set
From: Fan Du @ 2013-10-11  7:08 UTC (permalink / raw)
  To: Neil Horman, vyasevich, steffen.klassert, davem; +Cc: netdev
In-Reply-To: <20131010131116.GA16147@hmsreliant.think-freely.org>



On 2013年10月10日 21:11, Neil Horman wrote:
> On Thu, Oct 10, 2013 at 01:51:36PM +0800, Fan Du wrote:
>> igb/ixgbe have hardware sctp checksum support, when this feature is enabled
>> and also IPsec is armed to protect sctp traffic, ugly things happened as
>> xfrm_output checks CHECKSUM_PARTIAL to do check sum operation(sum every thing
>> up and pack the 16bits result in the checksum field). The result is fail
>> establishment of sctp communication.
>>
> Shouldn't this be fixed in the xfrm code then?  E.g. check the device features
> for SCTP checksum offloading and and skip the checksum during xfrm output if its
> available?
>
> Or am I missing something?
> Neil
>
>


 From cff25810910603ff991f0c56441d6de8dc33a822 Mon Sep 17 00:00:00 2001
From: Fan Du <fan.du@windriver.com>
Date: Fri, 11 Oct 2013 14:31:57 +0800
Subject: [PATCH 2/2] Don't compute checksum value for SCTP skb with
  CHECKSUM_PARTIAL set
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

IPsec is not enabled in this scenario:

SCTP skb set CHECKSUM_PARTIAL to indicate hardware is capable of doing
SCTP checksum(crc32-c) scoping the whole SCTP packet range. However when
such kind of skb is delivered through IPv4/v6 output handler, IPv4/v6 stack
interpret CHECKSUM_PARTIAL by calling skb_checksum_help to compute 16bits
checksum value by summing everything up, the whole SCTP packet in software
manner! After this skb reach NIC, after hardware doing its SCTP checking
business, a crc32-c value will overwrite the value IPv4/v6 stack computed
before.

This patch solves this by introducing skb_is_sctpv4/6 to optimize such case.

Signed-off-by: Fan Du <fan.du@windriver.com>
---
v2:
    Rework this problem by introducing skb_is_scktv4/6

---
  include/linux/ip.h     |    5 +++++
  include/linux/ipv6.h   |    6 ++++++
  include/linux/skbuff.h |    1 -
  net/core/skbuff.c      |    1 +
  net/ipv4/ip_output.c   |    4 +++-
  net/ipv6/ip6_output.c  |    1 +
  net/xfrm/xfrm_output.c |   20 +++++++++++++++-----
  7 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/linux/ip.h b/include/linux/ip.h
index 492bc65..f556292 100644
--- a/include/linux/ip.h
+++ b/include/linux/ip.h
@@ -19,6 +19,7 @@

  #include <linux/skbuff.h>
  #include <uapi/linux/ip.h>
+#include <uapi/linux/in.h>

  static inline struct iphdr *ip_hdr(const struct sk_buff *skb)
  {
@@ -34,4 +35,8 @@ static inline struct iphdr *ipip_hdr(const struct sk_buff *skb)
  {
  	return (struct iphdr *)skb_transport_header(skb);
  }
+static inline int skb_is_sctpv4(const struct sk_buff *skb)
+{
+	return ip_hdr(skb)->protocol == IPPROTO_SCTP;
+}
  #endif	/* _LINUX_IP_H */
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 28ea384..6e17c04 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -2,6 +2,7 @@
  #define _IPV6_H

  #include <uapi/linux/ipv6.h>
+#include <uapi/linux/in.h>

  #define ipv6_optlen(p)  (((p)->hdrlen+1) << 3)
  /*
@@ -387,4 +388,9 @@ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
  	  ((__sk)->sk_bound_dev_if == (__dif)))				&& \
  	 net_eq(sock_net(__sk), (__net)))

+static inline int skb_is_sctpv6(const struct sk_buff *skb)
+{
+	return ipv6_hdr(skb)->nexthdr == IPPROTO_SCTP;
+}
+
  #endif /* _IPV6_H */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2ddb48d..b36d0cc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2393,7 +2393,6 @@ extern void	       skb_split(struct sk_buff *skb,
  extern int	       skb_shift(struct sk_buff *tgt, struct sk_buff *skb,
  				 int shiftlen);
  extern void	       skb_scrub_packet(struct sk_buff *skb, bool xnet);
-
  extern struct sk_buff *skb_segment(struct sk_buff *skb,
  				   netdev_features_t features);

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d81cff1..54d6172 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3526,3 +3526,4 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)
  	nf_reset_trace(skb);
  }
  EXPORT_SYMBOL_GPL(skb_scrub_packet);
+
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index a04d872..8676b2c 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -587,7 +587,9 @@ slow_path_clean:

  slow_path:
  	/* for offloaded checksums cleanup checksum before fragmentation */
-	if ((skb->ip_summed == CHECKSUM_PARTIAL) && skb_checksum_help(skb))
+	if ((skb->ip_summed == CHECKSUM_PARTIAL) &&
+	    !skb_is_sctpv4(skb) &&
+	    skb_checksum_help(skb))
  		goto fail;
  	iph = ip_hdr(skb);

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 3a692d5..9b27d95 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -671,6 +671,7 @@ slow_path_clean:

  slow_path:
  	if ((skb->ip_summed == CHECKSUM_PARTIAL) &&
+	    !skb_is_sctpv6(skb) &&
  	    skb_checksum_help(skb))
  		goto fail;

diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 3bb2cdc..ddef94a 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -180,6 +180,14 @@ static int xfrm_output_gso(struct sk_buff *skb)
  	return 0;
  }

+static int skb_is_sctp(const struct sk_buff *skb)
+{
+	if (skb->protocol == __constant_htons(ETH_P_IP))
+		return skb_is_sctpv4(skb);
+	else
+		return skb_is_sctpv6(skb);
+}
+
  int xfrm_output(struct sk_buff *skb)
  {
  	struct net *net = dev_net(skb_dst(skb)->dev);
@@ -189,11 +197,13 @@ int xfrm_output(struct sk_buff *skb)
  		return xfrm_output_gso(skb);

  	if (skb->ip_summed == CHECKSUM_PARTIAL) {
-		err = skb_checksum_help(skb);
-		if (err) {
-			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
-			kfree_skb(skb);
-			return err;
+		if (!skb_is_sctp(skb)) {
+			err = skb_checksum_help(skb);
+			if (err) {
+				XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
+				kfree_skb(skb);
+				return err;
+			}
  		}
  	}

-- 
1.7.9.5






-- 
浮沉随浪只记今朝笑

--fan

^ permalink raw reply related

* [PATCHv2 1/2 ] {xfrm, sctp} Stick to software crc32 even if hardware is capable of that
From: Fan Du @ 2013-10-11  7:05 UTC (permalink / raw)
  To: Neil Horman, vyasevich, steffen.klassert, davem; +Cc: netdev
In-Reply-To: <20131010131116.GA16147@hmsreliant.think-freely.org>



On 2013年10月10日 21:11, Neil Horman wrote:
> On Thu, Oct 10, 2013 at 01:51:36PM +0800, Fan Du wrote:
>> igb/ixgbe have hardware sctp checksum support, when this feature is enabled
>> and also IPsec is armed to protect sctp traffic, ugly things happened as
>> xfrm_output checks CHECKSUM_PARTIAL to do check sum operation(sum every thing
>> up and pack the 16bits result in the checksum field). The result is fail
>> establishment of sctp communication.
>>
> Shouldn't this be fixed in the xfrm code then?  E.g. check the device features
> for SCTP checksum offloading and and skip the checksum during xfrm output if its
> available?
>
> Or am I missing something?
> Neil
>
>


 From 014276de0877f11d46e1704114a7d91f19221a63 Mon Sep 17 00:00:00 2001
From: Fan Du <fan.du@windriver.com>
Date: Fri, 11 Oct 2013 14:24:33 +0800
Subject: [PATCH 1/2] {xfrm, sctp} Stick to software crc32 even if hardware is
  capable of that

igb/ixgbe have hardware sctp checksum support, when this feature is enabled
and also IPsec is armed to protect sctp traffic, ugly things happened as
xfrm_output checks CHECKSUM_PARTIAL to do check sum operation(sum every thing
up and pack the 16bits result in the checksum field). The result is fail
establishment of sctp communication.

Signed-off-by: Fan Du <fan.du@windriver.com>
---
v2:
   Leave ip_summed as CHECKSUM_PARTIAL as before, the second patch will fix this.

---
  net/sctp/output.c |   14 +++++++++++++-
  1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/net/sctp/output.c b/net/sctp/output.c
index 0ac3a65..6de6402 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -372,6 +372,16 @@ static void sctp_packet_set_owner_w(struct sk_buff *skb, struct sock *sk)
  	atomic_inc(&sk->sk_wmem_alloc);
  }

+static int is_xfrm_armed(struct dst_entry *dst)
+{
+#ifdef CONFIG_XFRM
+	/* If dst->xfrm is valid, this skb needs to be transformed */
+	return dst->xfrm != NULL;
+#else
+	return 0;
+#endif
+}
+
  /* All packets are sent to the network through this function from
   * sctp_outq_tail().
   *
@@ -536,7 +546,9 @@ int sctp_packet_transmit(struct sctp_packet *packet)
  	 * by CRC32-C as described in <draft-ietf-tsvwg-sctpcsum-02.txt>.
  	 */
  	if (!sctp_checksum_disable) {
-		if (!(dst->dev->features & NETIF_F_SCTP_CSUM)) {
+		if ((!(dst->dev->features & NETIF_F_SCTP_CSUM)) ||
+			is_xfrm_armed(dst)) {
+
  			__u32 crc32 = sctp_start_cksum((__u8 *)sh, cksum_buf_len);

  			/* 3) Put the resultant value into the checksum field in the
-- 
1.7.9.5




-- 
浮沉随浪只记今朝笑

--fan

^ permalink raw reply related

* Re: [PATCH net] {xfrm, sctp} Stick to software crc32 even if hardware is capable of that
From: Fan Du @ 2013-10-11  7:02 UTC (permalink / raw)
  To: Neil Horman, steffen.klassert; +Cc: vyasevich, davem, netdev
In-Reply-To: <20131010131116.GA16147@hmsreliant.think-freely.org>



On 2013年10月10日 21:11, Neil Horman wrote:
> On Thu, Oct 10, 2013 at 01:51:36PM +0800, Fan Du wrote:
>> igb/ixgbe have hardware sctp checksum support, when this feature is enabled
>> and also IPsec is armed to protect sctp traffic, ugly things happened as
>> xfrm_output checks CHECKSUM_PARTIAL to do check sum operation(sum every thing
>> up and pack the 16bits result in the checksum field). The result is fail
>> establishment of sctp communication.
>>
> Shouldn't this be fixed in the xfrm code then?  E.g. check the device features
> for SCTP checksum offloading and and skip the checksum during xfrm output if its
> available?

sigh... same as my first thought.
However why should xfrm_output check whether the skb is a SCTP one as well as whether
the associated dev is able to do hw SCTP crc32-c checksum in the first place, and then
call SCTP crc checksum value API to write the correct checksum *after* this skb has
leaven SCTP layer???

The checksum operation in xfrm_ouput fits TCP/UDP/IP layer checksum semantics, e.g.
calculate 16bits value by sum almost everything up. Make a SCTP specific exception
in this common path sound wired, as the fix can be done in SCTP codes easily with
minimum changes, so not any bit of consequence for non-SCTP traffic.

And If you/Vlad insist to do so as well as no objection from Steffen/David, I'm
happy to send this revised version as your suggested.

Anyway, I should have separated this patch for two which makes the issues more clear
for you and Vlad to review.

> Or am I missing something?
> Neil
>
>

-- 
浮沉随浪只记今朝笑

--fan

^ permalink raw reply

* Re: [PATCH net] {xfrm, sctp} Stick to software crc32 even if hardware is capable of that
From: Fan Du @ 2013-10-11  7:02 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: nhorman, steffen.klassert, davem, netdev
In-Reply-To: <5256B578.8040101@gmail.com>



On 2013年10月10日 22:11, Vlad Yasevich wrote:
> On 10/10/2013 01:51 AM, Fan Du wrote:
>> igb/ixgbe have hardware sctp checksum support, when this feature is enabled
>> and also IPsec is armed to protect sctp traffic, ugly things happened as
>> xfrm_output checks CHECKSUM_PARTIAL to do check sum operation(sum every thing
>> up and pack the 16bits result in the checksum field). The result is fail
>> establishment of sctp communication.
>>
>> And I saw another point in this part of code, when IPsec is not armed,
>> sctp communication is good, however setting setting CHECKSUM_PARTIAL will
>> make xfrm_output compute dummy checksum values which will be overwritten by
>> hardware lately.
>>
>> So this patch try to solve above two issues together.
>>
>> Signed-off-by: Fan Du <fan.du@windriver.com>
>> ---
>> note:
>> igb/ixgbe hardware is not handy on my side, so just build test only.
>>
>> ---
>> net/sctp/output.c | 27 +++++++++++++++++++--------
>> 1 file changed, 19 insertions(+), 8 deletions(-)
>>
>> diff --git a/net/sctp/output.c b/net/sctp/output.c
>> index 0ac3a65..f0b9cc5 100644
>> --- a/net/sctp/output.c
>> +++ b/net/sctp/output.c
>> @@ -372,6 +372,16 @@ static void sctp_packet_set_owner_w(struct sk_buff *skb, struct sock *sk)
>> atomic_inc(&sk->sk_wmem_alloc);
>> }
>>
>> +static int is_xfrm_armed(struct dst_entry *dst)
>> +{
>> +#ifdef CONFIG_XFRM
>> + /* If dst->xfrm is valid, this skb needs to be transformed */
>> + return dst->xfrm != NULL;
>> +#else
>> + return 0;
>> +#endif
>> +}
>> +
>> /* All packets are sent to the network through this function from
>> * sctp_outq_tail().
>> *
>> @@ -536,20 +546,21 @@ int sctp_packet_transmit(struct sctp_packet *packet)
>> * by CRC32-C as described in <draft-ietf-tsvwg-sctpcsum-02.txt>.
>> */
>> if (!sctp_checksum_disable) {
>> - if (!(dst->dev->features & NETIF_F_SCTP_CSUM)) {
>> + if ((!(dst->dev->features & NETIF_F_SCTP_CSUM)) ||
>> + is_xfrm_armed(dst)) {
>> +
>> __u32 crc32 = sctp_start_cksum((__u8 *)sh, cksum_buf_len);
>>
>> /* 3) Put the resultant value into the checksum field in the
>> * common header, and leave the rest of the bits unchanged.
>> */
>> sh->checksum = sctp_end_cksum(crc32);
>> - } else {
>> - /* no need to seed pseudo checksum for SCTP */
>> - nskb->ip_summed = CHECKSUM_PARTIAL;
>> - nskb->csum_start = (skb_transport_header(nskb) -
>> - nskb->head);
>> - nskb->csum_offset = offsetof(struct sctphdr, checksum);
>> - }
>> + } else
>> + /* Mark skb as CHECKSUM_UNNECESSARY to let hardware compute
>> + * the checksum, and also avoid xfrm_output to do unceccessary
>> + * checksum.
>> + */
>> + nskb->ip_summed = CHECKSUM_UNNECESSARY;
>> }
>
> In addition to what Niel said, the use of CHECKSUM_UNNECESSARY is
> incorrect as it will cause the nic to not compute the checksum.
> The checksum offload depends on the use of CHECKSUM_PARTIAL.

My bad, my head is cramed with IPsec recently :( We should fix this in ipv4/v6 output path.

> -vlad
>
>
>>
>> /* IP layer ECN support
>>
>
>

-- 
浮沉随浪只记今朝笑

--fan

^ permalink raw reply

* Re: pull request: batman-adv 2013-10-09b
From: Antonio Quartulli @ 2013-10-11  6:10 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, b.a.t.m.a.n
In-Reply-To: <20131009.155658.131342560291162660.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 573 bytes --]

On Wed, Oct 09, 2013 at 03:56:58PM -0400, David Miller wrote:
> From: Antonio Quartulli <antonio@meshcoding.com>
> Date: Wed,  9 Oct 2013 21:32:38 +0200
> 
> > here you have my fixed pull request intended for net-next.
> > 
> > The previous build error was due to an accidental remotion of the beginning of a
> > batadv_dbg() statement during a merge conflict resolution.
> > Sorry for that.
> 
> This looks better, pulled, thanks a lot.

Hello David,

I can't find my patchset in net-next, hasn't it been pushed yet?

Regards,


-- 
Antonio Quartulli

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH] bridge: update mdb expiration timer upon reports.
From: David Miller @ 2013-10-11  4:57 UTC (permalink / raw)
  To: herbert; +Cc: vyasevic, netdev, xiyou.wangcong, stephen
In-Reply-To: <20131010234856.GA31951@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 11 Oct 2013 07:48:56 +0800

> On Thu, Oct 10, 2013 at 03:57:59PM -0400, Vlad Yasevich wrote:
>> commit 9f00b2e7cf241fa389733d41b615efdaa2cb0f5b
>> 	bridge: only expire the mdb entry when query is received
>> changed the mdb expiration timer to be armed only when QUERY is
>> received.  Howerver, this causes issues in an environment where
>> the multicast server socket comes and goes very fast while a client
>> is trying to send traffic to it.
>> 
>> The root cause is a race where a sequence of LEAVE followed by REPORT
>> messages can race against QUERY messages generated in response to LEAVE.
>> The QUERY ends up starting the expiration timer, and that timer can
>> potentially expire after the new REPORT message has been received signaling
>> the new join operation.  This leads to a significant drop in multicast
>> traffic and possible complete stall. 
>> 
>> The solution is to have REPORT messages update the expiration timer
>> on entries that already exist.
>> 
>> CC: Cong Wang <xiyou.wangcong@gmail.com>
>> CC: Herbert Xu <herbert@gondor.apana.org.au>
>> CC: Stephen Hemminger <stephen@networkplumber.org>
>> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
> 
> Good catch.  Thanks!
> 
> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied, and queued up for -stable, thanks everyone.

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: fix vport-netdev unregister
From: Alexei Starovoitov @ 2013-10-11  4:48 UTC (permalink / raw)
  To: Jesse Gross
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org, netdev, Jiri Pirko,
	David S. Miller
In-Reply-To: <CAEP_g=_meOUPrYc7S8rM4NwT_7qG1JZrHJkOg6CmbjDc-rFjAQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Thu, Oct 10, 2013 at 8:56 PM, Jesse Gross <jesse-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
> On Thu, Oct 10, 2013 at 5:47 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>> On Thu, Oct 10, 2013 at 3:38 PM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Thu, Oct 10, 2013 at 1:48 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>> On Thu, Oct 10, 2013 at 11:21 AM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>>>>> On Wed, Oct 9, 2013 at 11:26 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>>>> On Wed, Oct 9, 2013 at 11:07 PM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>> On Wed, Oct 9, 2013 at 9:11 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>>>>>> On Wed, Oct 9, 2013 at 8:02 PM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>>>> On Tue, Oct 8, 2013 at 8:07 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>>>>>>>> The combination of two commits
>>>>>>>>>>
>>>>>>>>>> commit 8e4e1713e4
>>>>>>>>>> ("openvswitch: Simplify datapath locking.")
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>> commit 2537b4dd0a
>>>>>>>>>> ("openvswitch:: link upper device for port devices")
>>>>>>>>>>
>>>>>>>>>> introduced a bug where upper_dev wasn't unlinked upon
>>>>>>>>>> netdev_unregister notification
>>>>>>>>>>
>>>>>>>>>> The following steps:
>>>>>>>>>>
>>>>>>>>>>   modprobe openvswitch
>>>>>>>>>>   ovs-dpctl add-dp test
>>>>>>>>>>   ip tuntap add dev tap1 mode tap
>>>>>>>>>>   ovs-dpctl add-if test tap1
>>>>>>>>>>   ip tuntap del dev tap1 mode tap
>>>>>>>>>>
>>>>>>>>>> are causing multiple warnings:
>>>>>>>>>> diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
>>>>>>>>>> index c323567..e9380bd 100644
>>>>>>>>>> --- a/net/openvswitch/dp_notify.c
>>>>>>>>>> +++ b/net/openvswitch/dp_notify.c
>>>>>>>>>> @@ -88,6 +88,11 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
>>>>>>>>>>                 return NOTIFY_DONE;
>>>>>>>>>>
>>>>>>>>>>         if (event == NETDEV_UNREGISTER) {
>>>>>>>>>> +               /* rx_handler_unregister and upper_dev_unlink immediately */
>>>>>>>>>> +               if (dev->reg_state == NETREG_UNREGISTERING)
>>>>>>>>>> +                       ovs_netdev_unlink_dev(vport);
>>>>>>>>>> +
>>>>>>>>>
>>>>>>>>> Rather than doing vport destroy here, we can just unlink upper device
>>>>>>>>> and let workq do rest of work.
>>>>>>>>
>>>>>>>> isn't it what it's doing?
>>>>>>>
>>>>>>> I meant just call netdev_upper_dev_unlink() here in event handler and
>>>>>>> rest of vport destroy can be done in workq.
>>>>>>
>>>>>> netdev_upper_dev_unlink() without netdev_rx_handler_unregister() ?!
>>>>>> that's dangerous.
>>>>> why is it dangerous? ovs already had ref to net-device.
>>>>
>>>> comment from dev.c:
>>>>                 /* Notify protocols, that we are about to destroy
>>>>                    this device. They should clean all the things.
>>>>                 */
>>>>                 call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
>>>>
>>>> so here you're suggesting to just netdev_upper_dev_unlink() to silence
>>>> the warning,
>>>> but then do dev_set_promisc(-1) in workqueue?
>>>>
>>> promiscuity setting is different issue. If you want to have discussion
>>> then you can post separate patch for same. Lets fix the warning here.
>>>
>>>> well, as a minimum audit of promiscuity will be wrong.
>>>> ndo_change_rx_flags will be called after ndo_uninit,
>>>> all sorts of other cleanups are done.
>>>
>>> change_rx_flags() checks for UP flag for given device.
>>>
>>>> I cannot track all possible scenarios, but it seems much safer to
>>>> cleanup everything possible
>>>> as soon as ovs received NETDEV_UNREGISTER event.
>>>>
>>>> May be all these risks are worth taking, then please explain what is
>>>> the problem with the proposed patch?
>>>>
>>> Problem is that this is causing layering issues in OVS. dp_notify is
>>> suppose to work at dp layer. your patch directly calls vport-netdev
>>> implementation function from dp_notify.
>>> I could not think of a simple approach that will do this in completely
>>> clean manner. Therefore I am trying to minimize layering issues. So
>>> just calling netdev_upper_dev_unlink() looks better than doing
>>> anything extra.
>>
>> dp_notify is per net, not per dp.
>> notifier can only be called for net_device and the first thing it does:
>>         if (!ovs_is_internal_dev(dev))
>>                 vport = ovs_netdev_get_vport(dev);
>> where ovs_netdev_get_vport() is defined in vport-netdev.c
>>
>> once it gets into workq, it checks for:
>>        if (vport->ops->type != OVS_VPORT_TYPE_NETDEV)
>>              continue;
>> and
>>         netdev_vport = netdev_vport_priv(vport);
>> where netdev_vport_priv() is defined in netdev-vport.h
>>
>> only then it proceeds with generic ovs_dp_detach_port().
>>
>> Is that the layering you talking about?
>>
>> So to minimize layering issues you want to call 'upper_dev_link' from
>> netdev_create() in vport-netdev.c
>> and 'upper_dev_unlink' directly from dp_device_event() in dp_notify.c?
>>
>> That will look better than calling ovs_netdev_unlink_dev() ?
>
> At a minimum, this function name is misleading because it does a bunch
> of other things beyond unlink the device.

sure. Please propose a different name.

> At this point, it basically seems like six of one, half dozen of the
> other. I don't think that either solution is ideal but it doesn't
> really matter all that much.
>
> However, the check dev->reg_state in netdev_destroy() looks racy to
> me, as it could already be in NETREG_UNREGISTERED even if we already
> processed this device.

you mean that netdev_destroy() will see reg_state == netreg_unregistered,
while dp_device_event() didn't see reg_state == netreg_unregistering yet?
or dp_device_event() saw it, proceeded to do unlink and
netdev_destroy() ran in parallel?
well, that's why reg_state == netreg_unregistering check in netdev_destroy()
is done with rtnl_lock() held.
reg_state cannot go into netreg_unregistered state skipping
netreg_unregistering and notifier.
therefore I don't think it's racy.

In ovs_dp_notify_wq() you're checking for both unregistering and
unregistered and that makes
sense, since workq can run after unregistering notifier called and
netdev_run_todo()
already changed the state to unregistered.
But here it's not the case.

^ permalink raw reply

* Re: Peak TCP performance
From: Kyle Hubert @ 2013-10-11  4:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1381463075.4971.99.camel@edumazet-glaptop.roam.corp.google.com>

On Thu, Oct 10, 2013 at 11:44 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Also, my copy of ethtool does not recognize tx-nocache-copy. However,
>> I do have control over the net device. Is there something there I can
>> set, or is tx-nocache-copy also a new feature? I'll start digging.
>>
>
> nocache-copy was added in 3.0, but I do find its not a gain for current
> cpus.
>
> You could get a fresh copy of ethtool sources :
>
> git clone git://git.kernel.org/pub/scm/network/ethtool/ethtool.git
> cd ethtool
> ./autogen.sh  ...

That did the trick. Thanks for the help! Is there somewhere I can read
up on this feature? A lot of the netdev features are opaque to me.
Also, I can set NETIF_F_NOCACHE_COPY in the netdev->features to set
this by default, yes?

This at least mirrors the performance improvement that I see when
forwarding, however I still see reserved CPU time. Is there a way to
push it even farther?

-Kyle

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: fix vport-netdev unregister
From: Jesse Gross @ 2013-10-11  3:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Pravin Shelar, David S. Miller, Jiri Pirko, dev@openvswitch.org,
	netdev
In-Reply-To: <CAMEtUuyEw81g8+68HUMpR1RERAA9rjL=WHRE58se83cdur67eg@mail.gmail.com>

On Thu, Oct 10, 2013 at 5:47 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Thu, Oct 10, 2013 at 3:38 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>> On Thu, Oct 10, 2013 at 1:48 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>> On Thu, Oct 10, 2013 at 11:21 AM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>> On Wed, Oct 9, 2013 at 11:26 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>> On Wed, Oct 9, 2013 at 11:07 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>>>> On Wed, Oct 9, 2013 at 9:11 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>>>> On Wed, Oct 9, 2013 at 8:02 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>>>>>> On Tue, Oct 8, 2013 at 8:07 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>>>>>> The combination of two commits
>>>>>>>>>
>>>>>>>>> commit 8e4e1713e4
>>>>>>>>> ("openvswitch: Simplify datapath locking.")
>>>>>>>>>
>>>>>>>>> and
>>>>>>>>>
>>>>>>>>> commit 2537b4dd0a
>>>>>>>>> ("openvswitch:: link upper device for port devices")
>>>>>>>>>
>>>>>>>>> introduced a bug where upper_dev wasn't unlinked upon
>>>>>>>>> netdev_unregister notification
>>>>>>>>>
>>>>>>>>> The following steps:
>>>>>>>>>
>>>>>>>>>   modprobe openvswitch
>>>>>>>>>   ovs-dpctl add-dp test
>>>>>>>>>   ip tuntap add dev tap1 mode tap
>>>>>>>>>   ovs-dpctl add-if test tap1
>>>>>>>>>   ip tuntap del dev tap1 mode tap
>>>>>>>>>
>>>>>>>>> are causing multiple warnings:
>>>>>>>>> diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
>>>>>>>>> index c323567..e9380bd 100644
>>>>>>>>> --- a/net/openvswitch/dp_notify.c
>>>>>>>>> +++ b/net/openvswitch/dp_notify.c
>>>>>>>>> @@ -88,6 +88,11 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
>>>>>>>>>                 return NOTIFY_DONE;
>>>>>>>>>
>>>>>>>>>         if (event == NETDEV_UNREGISTER) {
>>>>>>>>> +               /* rx_handler_unregister and upper_dev_unlink immediately */
>>>>>>>>> +               if (dev->reg_state == NETREG_UNREGISTERING)
>>>>>>>>> +                       ovs_netdev_unlink_dev(vport);
>>>>>>>>> +
>>>>>>>>
>>>>>>>> Rather than doing vport destroy here, we can just unlink upper device
>>>>>>>> and let workq do rest of work.
>>>>>>>
>>>>>>> isn't it what it's doing?
>>>>>>
>>>>>> I meant just call netdev_upper_dev_unlink() here in event handler and
>>>>>> rest of vport destroy can be done in workq.
>>>>>
>>>>> netdev_upper_dev_unlink() without netdev_rx_handler_unregister() ?!
>>>>> that's dangerous.
>>>> why is it dangerous? ovs already had ref to net-device.
>>>
>>> comment from dev.c:
>>>                 /* Notify protocols, that we are about to destroy
>>>                    this device. They should clean all the things.
>>>                 */
>>>                 call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
>>>
>>> so here you're suggesting to just netdev_upper_dev_unlink() to silence
>>> the warning,
>>> but then do dev_set_promisc(-1) in workqueue?
>>>
>> promiscuity setting is different issue. If you want to have discussion
>> then you can post separate patch for same. Lets fix the warning here.
>>
>>> well, as a minimum audit of promiscuity will be wrong.
>>> ndo_change_rx_flags will be called after ndo_uninit,
>>> all sorts of other cleanups are done.
>>
>> change_rx_flags() checks for UP flag for given device.
>>
>>> I cannot track all possible scenarios, but it seems much safer to
>>> cleanup everything possible
>>> as soon as ovs received NETDEV_UNREGISTER event.
>>>
>>> May be all these risks are worth taking, then please explain what is
>>> the problem with the proposed patch?
>>>
>> Problem is that this is causing layering issues in OVS. dp_notify is
>> suppose to work at dp layer. your patch directly calls vport-netdev
>> implementation function from dp_notify.
>> I could not think of a simple approach that will do this in completely
>> clean manner. Therefore I am trying to minimize layering issues. So
>> just calling netdev_upper_dev_unlink() looks better than doing
>> anything extra.
>
> dp_notify is per net, not per dp.
> notifier can only be called for net_device and the first thing it does:
>         if (!ovs_is_internal_dev(dev))
>                 vport = ovs_netdev_get_vport(dev);
> where ovs_netdev_get_vport() is defined in vport-netdev.c
>
> once it gets into workq, it checks for:
>        if (vport->ops->type != OVS_VPORT_TYPE_NETDEV)
>              continue;
> and
>         netdev_vport = netdev_vport_priv(vport);
> where netdev_vport_priv() is defined in netdev-vport.h
>
> only then it proceeds with generic ovs_dp_detach_port().
>
> Is that the layering you talking about?
>
> So to minimize layering issues you want to call 'upper_dev_link' from
> netdev_create() in vport-netdev.c
> and 'upper_dev_unlink' directly from dp_device_event() in dp_notify.c?
>
> That will look better than calling ovs_netdev_unlink_dev() ?

At a minimum, this function name is misleading because it does a bunch
of other things beyond unlink the device.

At this point, it basically seems like six of one, half dozen of the
other. I don't think that either solution is ideal but it doesn't
really matter all that much.

However, the check dev->reg_state in netdev_destroy() looks racy to
me, as it could already be in NETREG_UNREGISTERED even if we already
processed this device.

^ permalink raw reply

* Re: Peak TCP performance
From: Eric Dumazet @ 2013-10-11  3:44 UTC (permalink / raw)
  To: Kyle Hubert; +Cc: netdev
In-Reply-To: <CAJoZ4U0tpKhknMLd7kN4uQmq-E81=EbH1Q816nX_eig=tGreAQ@mail.gmail.com>

On Thu, 2013-10-10 at 23:33 -0400, Kyle Hubert wrote:

> Ah, yes, sorry. I'm on 3.0.80 (SLES SP2). And, no, I do not have that
> commit. In fact, the code looks quite different. There must have been
> a lot of changes?
> 

Probably ;)

> Also, my copy of ethtool does not recognize tx-nocache-copy. However,
> I do have control over the net device. Is there something there I can
> set, or is tx-nocache-copy also a new feature? I'll start digging.
> 

nocache-copy was added in 3.0, but I do find its not a gain for current
cpus.

You could get a fresh copy of ethtool sources :

git clone git://git.kernel.org/pub/scm/network/ethtool/ethtool.git
cd ethtool
./autogen.sh  ...

^ permalink raw reply

* Re: Peak TCP performance
From: Kyle Hubert @ 2013-10-11  3:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1381461664.4971.97.camel@edumazet-glaptop.roam.corp.google.com>

On Thu, Oct 10, 2013 at 11:21 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2013-10-10 at 22:34 -0400, Kyle Hubert wrote:
>> I'm working on a device, I consistently get 27gbps via netperf-2.6.
>> UDP reports 54gbps.
>>
>> TCP is maxed out at 100% CPU on the transmit side. On the receive
>> side, 40% of the CPU. Thus, I didn't believe I could eek anymore
>> performance out of it.
>>
>> However, very oddly, if I enabled bridged mode to forward some
>> packets, TCP performance goes up to 32gbps. The thing that bothers me
>> is that transmit CPU utilization drops to 65%, and receive CPU
>> utilization increases to 60%.
>>
>> What happens when the device becomes bridged to gain so much
>> performance? Also, can I now take advantage of the extra CPU time
>> available to drive more traffic? No tunable seems to have any effect..
>> (except down)
>
>
> You do not give what version of linux you use, but my guess is that
> using latest trees should help, because of
> http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=c9eeec26e32e087359160406f96e0949b3cc6f10
>
> Also try to disable tx-nocache-copy
>
> ethtool -K eth0 tx-nocache-copy off

Ah, yes, sorry. I'm on 3.0.80 (SLES SP2). And, no, I do not have that
commit. In fact, the code looks quite different. There must have been
a lot of changes?

Also, my copy of ethtool does not recognize tx-nocache-copy. However,
I do have control over the net device. Is there something there I can
set, or is tx-nocache-copy also a new feature? I'll start digging.

Thanks for your response.
-Kyle

^ permalink raw reply

* Re: Peak TCP performance
From: Eric Dumazet @ 2013-10-11  3:21 UTC (permalink / raw)
  To: Kyle Hubert; +Cc: netdev
In-Reply-To: <CAJoZ4U2eVUd4JiGQ6UFYMdoEwyN=MO8+9S3jGcsaYGN2vqt=3g@mail.gmail.com>

On Thu, 2013-10-10 at 22:34 -0400, Kyle Hubert wrote:
> I'm working on a device, I consistently get 27gbps via netperf-2.6.
> UDP reports 54gbps.
> 
> TCP is maxed out at 100% CPU on the transmit side. On the receive
> side, 40% of the CPU. Thus, I didn't believe I could eek anymore
> performance out of it.
> 
> However, very oddly, if I enabled bridged mode to forward some
> packets, TCP performance goes up to 32gbps. The thing that bothers me
> is that transmit CPU utilization drops to 65%, and receive CPU
> utilization increases to 60%.
> 
> What happens when the device becomes bridged to gain so much
> performance? Also, can I now take advantage of the extra CPU time
> available to drive more traffic? No tunable seems to have any effect..
> (except down)


You do not give what version of linux you use, but my guess is that
using latest trees should help, because of
http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=c9eeec26e32e087359160406f96e0949b3cc6f10

Also try to disable tx-nocache-copy

ethtool -K eth0 tx-nocache-copy off

^ permalink raw reply

* Re: [PATCH net] bridge: allow receiption on disabled port
From: Stephen Hemminger @ 2013-10-11  2:35 UTC (permalink / raw)
  To: Felix Fietkau; +Cc: netdev
In-Reply-To: <52571481.5010907@openwrt.org>

This is what I was thinking would be better.

Don't want these packets leaking into PRE_ROUTEING chain or have
any chance to get flooded out other ports.

Compile tested only...

I could use another goto instead but that becomes spaghetti and
never like to jump back into a block.


--- a/net/bridge/br_input.c	2013-10-06 14:48:24.946450042 -0700
+++ b/net/bridge/br_input.c	2013-10-10 19:32:14.227926344 -0700
@@ -152,6 +152,16 @@ static int br_handle_local_finish(struct
 	return 0;	 /* process further */
 }
 
+/* Deliver packet to local host only */
+static rx_handler_result_t br_local_only(struct sk_buff *skb)
+{
+	if (NF_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN, skb, skb->dev,
+		    NULL, br_handle_local_finish))
+		return RX_HANDLER_CONSUMED; /* consumed by filter */
+	else
+		return RX_HANDLER_PASS;	/* continue processing */
+}
+
 /*
  * Return NULL if skb is handled
  * note: already called with rcu_read_lock
@@ -206,18 +216,20 @@ rx_handler_result_t br_handle_frame(stru
 				goto forward;
 		}
 
-		/* Deliver packet to local host only */
-		if (NF_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN, skb, skb->dev,
-			    NULL, br_handle_local_finish)) {
-			return RX_HANDLER_CONSUMED; /* consumed by filter */
-		} else {
-			*pskb = skb;
-			return RX_HANDLER_PASS;	/* continue processing */
-		}
+		*pskb = skb;
+		return br_local_only(skb);
 	}
 
 forward:
 	switch (p->state) {
+	case BR_STATE_DISABLED:
+		if (!ether_addr_equal(p->br->dev->dev_addr, dest))
+			goto drop;
+
+		skb->pkt_type = PACKET_HOST;
+		*pskb = skb;
+		return br_local_only(skb);
+
 	case BR_STATE_FORWARDING:
 		rhook = rcu_dereference(br_should_route_hook);
 		if (rhook) {

^ permalink raw reply

* Peak TCP performance
From: Kyle Hubert @ 2013-10-11  2:34 UTC (permalink / raw)
  To: netdev

I'm working on a device, I consistently get 27gbps via netperf-2.6.
UDP reports 54gbps.

TCP is maxed out at 100% CPU on the transmit side. On the receive
side, 40% of the CPU. Thus, I didn't believe I could eek anymore
performance out of it.

However, very oddly, if I enabled bridged mode to forward some
packets, TCP performance goes up to 32gbps. The thing that bothers me
is that transmit CPU utilization drops to 65%, and receive CPU
utilization increases to 60%.

What happens when the device becomes bridged to gain so much
performance? Also, can I now take advantage of the extra CPU time
available to drive more traffic? No tunable seems to have any effect..
(except down)

Any ideas?

Cheers,
-Kyle

^ permalink raw reply

* Re: [PATCH] usbnet: smsc95xx: Add device tree input for MAC address
From: Ming Lei @ 2013-10-11  1:51 UTC (permalink / raw)
  To: Dan Murphy
  Cc: Steve Glendinning, Network Development, Linux Kernel Mailing List,
	linux-usb, mugunthanvnm
In-Reply-To: <5256A1DE.8010709@ti.com>

On Thu, Oct 10, 2013 at 8:47 PM, Dan Murphy <dmurphy@ti.com> wrote:
>
> Is there a board that has 2 built in smsc devices?

I don't know, maybe there isn't, but driver should be generic enough, and
as I said it is a generic problem, and people are discussing it, so suggest
to read previous discussions first before post your V2.

Below link is one I remembered, but I think you can find more:

https://lkml.org/lkml/2013/8/11/100

> This is all dependent on what the dev.id is of udev.

I am not sure why it depends on udev, and udev can't
see the device util it is added to system.

Thanks,
-- 
Ming Lei

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: fix vport-netdev unregister
From: Alexei Starovoitov @ 2013-10-11  0:47 UTC (permalink / raw)
  To: Pravin Shelar
  Cc: David S. Miller, Jesse Gross, Jiri Pirko, dev@openvswitch.org,
	netdev
In-Reply-To: <CALnjE+rgasL+XShQN20b9PkFNGq38+UoXX4uLogrZ2igAwfDew@mail.gmail.com>

On Thu, Oct 10, 2013 at 3:38 PM, Pravin Shelar <pshelar@nicira.com> wrote:
> On Thu, Oct 10, 2013 at 1:48 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> On Thu, Oct 10, 2013 at 11:21 AM, Pravin Shelar <pshelar@nicira.com> wrote:
>>> On Wed, Oct 9, 2013 at 11:26 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>> On Wed, Oct 9, 2013 at 11:07 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>>> On Wed, Oct 9, 2013 at 9:11 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>>> On Wed, Oct 9, 2013 at 8:02 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>>>>> On Tue, Oct 8, 2013 at 8:07 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>>>>> The combination of two commits
>>>>>>>>
>>>>>>>> commit 8e4e1713e4
>>>>>>>> ("openvswitch: Simplify datapath locking.")
>>>>>>>>
>>>>>>>> and
>>>>>>>>
>>>>>>>> commit 2537b4dd0a
>>>>>>>> ("openvswitch:: link upper device for port devices")
>>>>>>>>
>>>>>>>> introduced a bug where upper_dev wasn't unlinked upon
>>>>>>>> netdev_unregister notification
>>>>>>>>
>>>>>>>> The following steps:
>>>>>>>>
>>>>>>>>   modprobe openvswitch
>>>>>>>>   ovs-dpctl add-dp test
>>>>>>>>   ip tuntap add dev tap1 mode tap
>>>>>>>>   ovs-dpctl add-if test tap1
>>>>>>>>   ip tuntap del dev tap1 mode tap
>>>>>>>>
>>>>>>>> are causing multiple warnings:
>>>>>>>> diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
>>>>>>>> index c323567..e9380bd 100644
>>>>>>>> --- a/net/openvswitch/dp_notify.c
>>>>>>>> +++ b/net/openvswitch/dp_notify.c
>>>>>>>> @@ -88,6 +88,11 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
>>>>>>>>                 return NOTIFY_DONE;
>>>>>>>>
>>>>>>>>         if (event == NETDEV_UNREGISTER) {
>>>>>>>> +               /* rx_handler_unregister and upper_dev_unlink immediately */
>>>>>>>> +               if (dev->reg_state == NETREG_UNREGISTERING)
>>>>>>>> +                       ovs_netdev_unlink_dev(vport);
>>>>>>>> +
>>>>>>>
>>>>>>> Rather than doing vport destroy here, we can just unlink upper device
>>>>>>> and let workq do rest of work.
>>>>>>
>>>>>> isn't it what it's doing?
>>>>>
>>>>> I meant just call netdev_upper_dev_unlink() here in event handler and
>>>>> rest of vport destroy can be done in workq.
>>>>
>>>> netdev_upper_dev_unlink() without netdev_rx_handler_unregister() ?!
>>>> that's dangerous.
>>> why is it dangerous? ovs already had ref to net-device.
>>
>> comment from dev.c:
>>                 /* Notify protocols, that we are about to destroy
>>                    this device. They should clean all the things.
>>                 */
>>                 call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
>>
>> so here you're suggesting to just netdev_upper_dev_unlink() to silence
>> the warning,
>> but then do dev_set_promisc(-1) in workqueue?
>>
> promiscuity setting is different issue. If you want to have discussion
> then you can post separate patch for same. Lets fix the warning here.
>
>> well, as a minimum audit of promiscuity will be wrong.
>> ndo_change_rx_flags will be called after ndo_uninit,
>> all sorts of other cleanups are done.
>
> change_rx_flags() checks for UP flag for given device.
>
>> I cannot track all possible scenarios, but it seems much safer to
>> cleanup everything possible
>> as soon as ovs received NETDEV_UNREGISTER event.
>>
>> May be all these risks are worth taking, then please explain what is
>> the problem with the proposed patch?
>>
> Problem is that this is causing layering issues in OVS. dp_notify is
> suppose to work at dp layer. your patch directly calls vport-netdev
> implementation function from dp_notify.
> I could not think of a simple approach that will do this in completely
> clean manner. Therefore I am trying to minimize layering issues. So
> just calling netdev_upper_dev_unlink() looks better than doing
> anything extra.

dp_notify is per net, not per dp.
notifier can only be called for net_device and the first thing it does:
        if (!ovs_is_internal_dev(dev))
                vport = ovs_netdev_get_vport(dev);
where ovs_netdev_get_vport() is defined in vport-netdev.c

once it gets into workq, it checks for:
       if (vport->ops->type != OVS_VPORT_TYPE_NETDEV)
             continue;
and
        netdev_vport = netdev_vport_priv(vport);
where netdev_vport_priv() is defined in netdev-vport.h

only then it proceeds with generic ovs_dp_detach_port().

Is that the layering you talking about?

So to minimize layering issues you want to call 'upper_dev_link' from
netdev_create() in vport-netdev.c
and 'upper_dev_unlink' directly from dp_device_event() in dp_notify.c?

That will look better than calling ovs_netdev_unlink_dev() ?

Having both register+link and unregister+unlink in the same
vport-netdev.c, is not an advantage?

I'm still missing something here.

^ permalink raw reply

* Re: [PATCH v2 tip/core/rcu 07/13] ipv6/ip6_tunnel: Apply rcu_access_pointer() to avoid sparse false positive
From: Josh Triplett @ 2013-10-11  0:20 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Eric Dumazet, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, niv, tglx, peterz, rostedt, dhowells, edumazet,
	darren, fweisbec, sbw, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, netdev
In-Reply-To: <20131010002833.GJ5790@linux.vnet.ibm.com>

On Wed, Oct 09, 2013 at 05:28:33PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 09, 2013 at 05:12:40PM -0700, Eric Dumazet wrote:
> > On Wed, 2013-10-09 at 16:40 -0700, Josh Triplett wrote:
> > 
> > > that.  Constructs like list_del_rcu are much clearer, and not
> > > open-coded.  Open-coding synchronization code is almost always a Bad
> > > Idea.
> > 
> > OK, so you think there is synchronization code.
> > 
> > I will shut up then, no need to waste time.
> 
> As you said earlier, we should at least get rid of the memory barrier
> as long as we are changing the code.
> 
> Josh, what would you suggest as the best way to avoid the memory barrier,
> keep sparse happy, and not be too ugly?

The more I think about it, the more I realize that assigning an __rcu
pointer to an __rcu pointer *without* a memory barrier is a sufficiently
uncommon case that you probably *should* just write an open-coded
assignment.  Just please put a very clear comment right before it.

I'd originally thought it might make sense to have a macro similar to
rcu_assign_pointer, but I just don't think this is a common enough case,
and we don't want people thinking they can use this in general for __rcu
to __rcu assignments (most of which still need a memory barrier).

- Josh Triplett

^ permalink raw reply

* Re: [PATCH] bridge: update mdb expiration timer upon reports.
From: Herbert Xu @ 2013-10-10 23:48 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: netdev, Cong Wang, Stephen Hemminger
In-Reply-To: <1381435079-15387-1-git-send-email-vyasevic@redhat.com>

On Thu, Oct 10, 2013 at 03:57:59PM -0400, Vlad Yasevich wrote:
> commit 9f00b2e7cf241fa389733d41b615efdaa2cb0f5b
> 	bridge: only expire the mdb entry when query is received
> changed the mdb expiration timer to be armed only when QUERY is
> received.  Howerver, this causes issues in an environment where
> the multicast server socket comes and goes very fast while a client
> is trying to send traffic to it.
> 
> The root cause is a race where a sequence of LEAVE followed by REPORT
> messages can race against QUERY messages generated in response to LEAVE.
> The QUERY ends up starting the expiration timer, and that timer can
> potentially expire after the new REPORT message has been received signaling
> the new join operation.  This leads to a significant drop in multicast
> traffic and possible complete stall. 
> 
> The solution is to have REPORT messages update the expiration timer
> on entries that already exist.
> 
> CC: Cong Wang <xiyou.wangcong@gmail.com>
> CC: Herbert Xu <herbert@gondor.apana.org.au>
> CC: Stephen Hemminger <stephen@networkplumber.org>
> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>

Good catch.  Thanks!

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
From: Mark Lord @ 2013-10-10 23:17 UTC (permalink / raw)
  To: Alexander Gordeev, H. Peter Anvin
  Cc: Benjamin Herrenschmidt, linux-kernel, Bjorn Helgaas, Ralf Baechle,
	Michael Ellerman, Martin Schwidefsky, Ingo Molnar, Tejun Heo,
	Dan Williams, Andy King, Jon Mason, Matt Porter, linux-pci,
	linux-mips, linuxppc-dev, linux390, linux-s390, x86, linux-ide,
	iss_storagedev, linux-nvme, linux-rdma, netdev, e1000-devel,
	linux-dr
In-Reply-To: <20131010180704.GA15719@dhcp-26-207.brq.redhat.com>

Just to help us all understand "the loop" issue..

Here's an example of driver code which uses the existing MSI-X interfaces,
for a device which can work with either 16, 8, 4, 2, or 1 MSI-X interrupt.
This is from a new driver I'm working on right now:


static int xx_alloc_msix_irqs (struct xx_dev *dev, int nvec)
{
        xx_disable_all_irqs(dev);
        do {
                if (nvec < 2)
                        xx_prep_for_1_msix_vector(dev);
                else if (nvec < 4)
                        xx_prep_for_2_msix_vectors(dev);
                else if (nvec < 8)
                        xx_prep_for_4_msix_vectors(dev);
                else if (nvec < 16)
                        xx_prep_for_8_msix_vectors(dev);
                else
                        xx_prep_for_16_msix_vectors(dev);
                nvec = pci_enable_msix(dev->pdev, dev->irqs, dev->num_vectors);
        } while (nvec > 0);

        if (nvec) {
                kerr(dev->name, "pci_enable_msix() failed, err=%d", nvec);
                dev->num_vectors = 0;
                return nvec;
        }
        return 0;       /* success */
}

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: fix vport-netdev unregister
From: Pravin Shelar @ 2013-10-10 22:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org, netdev, Jiri Pirko,
	David S. Miller
In-Reply-To: <CAMEtUuxH24G6pQtSFznrK7w5mdP3za4uQfKKEEaEngW4SJajbA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Thu, Oct 10, 2013 at 1:48 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
> On Thu, Oct 10, 2013 at 11:21 AM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>> On Wed, Oct 9, 2013 at 11:26 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>> On Wed, Oct 9, 2013 at 11:07 PM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On Wed, Oct 9, 2013 at 9:11 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>>> On Wed, Oct 9, 2013 at 8:02 PM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>> On Tue, Oct 8, 2013 at 8:07 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>>>>> The combination of two commits
>>>>>>>
>>>>>>> commit 8e4e1713e4
>>>>>>> ("openvswitch: Simplify datapath locking.")
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>> commit 2537b4dd0a
>>>>>>> ("openvswitch:: link upper device for port devices")
>>>>>>>
>>>>>>> introduced a bug where upper_dev wasn't unlinked upon
>>>>>>> netdev_unregister notification
>>>>>>>
>>>>>>> The following steps:
>>>>>>>
>>>>>>>   modprobe openvswitch
>>>>>>>   ovs-dpctl add-dp test
>>>>>>>   ip tuntap add dev tap1 mode tap
>>>>>>>   ovs-dpctl add-if test tap1
>>>>>>>   ip tuntap del dev tap1 mode tap
>>>>>>>
>>>>>>> are causing multiple warnings:
>>>>>>> diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
>>>>>>> index c323567..e9380bd 100644
>>>>>>> --- a/net/openvswitch/dp_notify.c
>>>>>>> +++ b/net/openvswitch/dp_notify.c
>>>>>>> @@ -88,6 +88,11 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
>>>>>>>                 return NOTIFY_DONE;
>>>>>>>
>>>>>>>         if (event == NETDEV_UNREGISTER) {
>>>>>>> +               /* rx_handler_unregister and upper_dev_unlink immediately */
>>>>>>> +               if (dev->reg_state == NETREG_UNREGISTERING)
>>>>>>> +                       ovs_netdev_unlink_dev(vport);
>>>>>>> +
>>>>>>
>>>>>> Rather than doing vport destroy here, we can just unlink upper device
>>>>>> and let workq do rest of work.
>>>>>
>>>>> isn't it what it's doing?
>>>>
>>>> I meant just call netdev_upper_dev_unlink() here in event handler and
>>>> rest of vport destroy can be done in workq.
>>>
>>> netdev_upper_dev_unlink() without netdev_rx_handler_unregister() ?!
>>> that's dangerous.
>> why is it dangerous? ovs already had ref to net-device.
>
> comment from dev.c:
>                 /* Notify protocols, that we are about to destroy
>                    this device. They should clean all the things.
>                 */
>                 call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
>
> so here you're suggesting to just netdev_upper_dev_unlink() to silence
> the warning,
> but then do dev_set_promisc(-1) in workqueue?
>
promiscuity setting is different issue. If you want to have discussion
then you can post separate patch for same. Lets fix the warning here.

> well, as a minimum audit of promiscuity will be wrong.
> ndo_change_rx_flags will be called after ndo_uninit,
> all sorts of other cleanups are done.

change_rx_flags() checks for UP flag for given device.

> I cannot track all possible scenarios, but it seems much safer to
> cleanup everything possible
> as soon as ovs received NETDEV_UNREGISTER event.
>
> May be all these risks are worth taking, then please explain what is
> the problem with the proposed patch?
>
Problem is that this is causing layering issues in OVS. dp_notify is
suppose to work at dp layer. your patch directly calls vport-netdev
implementation function from dp_notify.
I could not think of a simple approach that will do this in completely
clean manner. Therefore I am trying to minimize layering issues. So
just calling netdev_upper_dev_unlink() looks better than doing
anything extra.

^ permalink raw reply

* Re: [PATCH net] bridge: allow receiption on disabled port
From: Felix Fietkau @ 2013-10-10 22:08 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20131010145255.16cf7c09@nehalam.linuxnetplumber.net>

On 2013-10-10 11:52 PM, Stephen Hemminger wrote:
> On Thu, 10 Oct 2013 22:56:33 +0200
> Felix Fietkau <nbd@openwrt.org> wrote:
> 
>> On 2013-10-10 10:36 PM, Stephen Hemminger wrote:
>> > On Thu, 10 Oct 2013 14:52:50 +0200
>> > Felix Fietkau <nbd@openwrt.org> wrote:
>> > 
>> >> When an ethernet device is enslaved to a bridge, and the bridge STP
>> >> detects loss of carrier (or operational state down), then normally
>> >> packet receiption is blocked.
>> >> 
>> >> This breaks control applications like WPA which maybe expecting to
>> >> receive packets to negotiate to bring link up. The bridge needs to
>> >> block forwarding packets from these disabled ports, but there is no
>> >> hard requirement to not allow local packet delivery.
>> >> 
>> >> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
>> >> Signed-off-by: Felix Fietkau <nbd@openwrt.org>
>> > 
>> > No. This will cause duplicate packets to be delivered.
>> How? I haven't observed any duplications in my tests with this patch.
> 
> The purpose of DISABLED state is to break loops in the bridge tree.
> If packet is flooded by another bridge (Broadcast Unknown or Multicast)
> then it will go down both paths.
Ah, right.

>> > If doing a link layer protocol like WPA then it should be done directly
>> > on the underlying device, not the bridge itself.
>> When the ETH_P_PAE protocol is set for the packet socket inside
>> wpa_supplicant, the bridge steals all packets before the protocol
>> handler gets them.
>> In __netif_receive_skb_core, only ptype_all gets processed before the rx
>> handler, not ptype_base.
> 
> Thought it was using direct type all. Or at least the link local multicast
> address.
> 
> Can you revise it to only accept packets directed to link local multicast
> address or local address, and go through the local_finish handler.
The destination address in WPA EAPOL packets from the AP is set to the
MAC address of the client, which may not be the same as the one from the
bridge.

- Felix

^ permalink raw reply

* Re: [PATCH nf-next] netfilter: xtables: lightweight process control group matching
From: Daniel Borkmann @ 2013-10-10 21:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: pablo-Cap9r6Oaw4JrovVCs/uTlw,
	netfilter-devel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Daniel Borkmann
In-Reply-To: <20131009170409.GH22495-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>

Hi Tejun,

On 10/09/2013 07:04 PM, Tejun Heo wrote:
> On Tue, Oct 08, 2013 at 10:05:02AM +0200, Daniel Borkmann wrote:
>> Could you elaborate on "Wouldn't it be more logical to implement netfilter
>> rule to match the target cgroup paths?". I don't think (or hope) you mean
>> some string comparison on the dentry path here? :) With our proposal, we
>> have in the network stack's critical path only the following code that is
>> being executed here to match the cgroup ...
>
> Comparing path each time obviously doesn't make sense but you can
> determine the cgroup on config and hold onto the pointer while the
> rule exists.
>
>> ... where ``info->id == skb->sk->sk_cgrp_fwid'' is the actual work, so very
>> lightweight, which is good for high loads (1Gbit/s, 10Gbit/s and beyond), of
>> course. Also, it would be intuitive for admins familiar with other subsystems
>> to just set up and use these cgroup ids in iptabels. I'm not yet quite sure
>> how your suggestion would look like, so you would need to setup some "dummy"
>> subgroups first just to have a path that you can match on?
>
> Currently, it's tricky because we have multiple hierarchies to
> consider and there isn't an efficient way to map from task to cgroup
> on a specific hierarchy.  I'm not sure whether we should add another
> mapping table in css_set or just allow using path matching on the
> unified hierarchy.  The latter should be cleaner and easier but more
> restrictive.
>
> Anyways, it isn't manageable in the long term to keep adding
> controllers simply to tag tasks differently.  If we want to do this,
> let's please work on a way to match a task's cgroup affiliation
> efficiently.

Here's a draft (!) of an alternative w/o using a new cgroup subsystem. I've
tested it and it would basically work this way as well. I've used serial_nr
as an identifier of cgroups here, as we'd actually want the xt_cgroup_info
structure as small as possible for rule sets (since they can be large and
are flat-copied to kernel). Logic in cgroup_mt() needs to change a bit as
we cannot hold css_set_lock here. Anyway, iptables would match here against
cgroup.serial (that can probably also be widely used otherwise). The way we
do it here is to cache the corresponding task in socket structure, and walk
all cgroups belonging to that task, comparing if serial_nr's match.

Still, I think my original patch is more clean, user friendly and intuitive,
and has a better performance (main work is one comparison instead of walking
all corresponding cgroups), so I'd still consider this the better tradeoff
to go with, I think netfilter is a large enough candidate for a subsys. ;)

Thanks,
Daniel

Draft patch:

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3561d30..3c5e953 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -399,6 +399,26 @@ struct cgroup_map_cb {
  };

  /*
+ * A cgroup can be associated with multiple css_sets as different tasks may
+ * belong to different cgroups on different hierarchies.  In the other
+ * direction, a css_set is naturally associated with multiple cgroups.
+ * This M:N relationship is represented by the following link structure
+ * which exists for each association and allows traversing the associations
+ * from both sides.
+ */
+struct cgrp_cset_link {
+	/* the cgroup and css_set this link associates */
+	struct cgroup		*cgrp;
+	struct css_set		*cset;
+
+	/* list of cgrp_cset_links anchored at cgrp->cset_links */
+	struct list_head	cset_link;
+
+	/* list of cgrp_cset_links anchored at css_set->cgrp_links */
+	struct list_head	cgrp_link;
+};
+
+/*
   * struct cftype: handler definitions for cgroup control files
   *
   * When reading/writing to a file:
diff --git a/include/net/sock.h b/include/net/sock.h
index e3bf213..b035ba3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -406,6 +406,9 @@ struct sock {
  	__u32			sk_mark;
  	u32			sk_classid;
  	struct cg_proto		*sk_cgrp;
+#if IS_ENABLED(CONFIG_NETFILTER_XT_MATCH_CGROUP)
+	struct task_struct	*sk_task_cached;
+#endif
  	void			(*sk_state_change)(struct sock *sk);
  	void			(*sk_data_ready)(struct sock *sk, int bytes);
  	void			(*sk_write_space)(struct sock *sk);
@@ -2098,6 +2101,22 @@ static inline gfp_t gfp_any(void)
  	return in_softirq() ? GFP_ATOMIC : GFP_KERNEL;
  }

+static inline struct task_struct *sock_task(const struct sock *sk)
+{
+#if IS_ENABLED(CONFIG_NETFILTER_XT_MATCH_CGROUP)
+	return sk->sk_task_cached;
+#else
+	return NULL;
+#endif
+}
+
+static inline void sock_task_set(struct sock *sk, struct task_struct *task)
+{
+#if IS_ENABLED(CONFIG_NETFILTER_XT_MATCH_CGROUP)
+	sk->sk_task_cached = task;
+#endif
+}
+
  static inline long sock_rcvtimeo(const struct sock *sk, bool noblock)
  {
  	return noblock ? 0 : sk->sk_rcvtimeo;
diff --git a/include/uapi/linux/netfilter/Kbuild b/include/uapi/linux/netfilter/Kbuild
index 1749154..94a4890 100644
--- a/include/uapi/linux/netfilter/Kbuild
+++ b/include/uapi/linux/netfilter/Kbuild
@@ -37,6 +37,7 @@ header-y += xt_TEE.h
  header-y += xt_TPROXY.h
  header-y += xt_addrtype.h
  header-y += xt_bpf.h
+header-y += xt_cgroup.h
  header-y += xt_cluster.h
  header-y += xt_comment.h
  header-y += xt_connbytes.h
diff --git a/include/uapi/linux/netfilter/xt_cgroup.h b/include/uapi/linux/netfilter/xt_cgroup.h
new file mode 100644
index 0000000..c59ff53
--- /dev/null
+++ b/include/uapi/linux/netfilter/xt_cgroup.h
@@ -0,0 +1,11 @@
+#ifndef _UAPI_XT_CGROUP_H
+#define _UAPI_XT_CGROUP_H
+
+#include <linux/types.h>
+
+struct xt_cgroup_info {
+	__u64 serial_nr;
+	__u32 invert;
+};
+
+#endif /* _UAPI_XT_CGROUP_H */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2418b6e..1f9dc5b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -357,26 +357,6 @@ static void cgroup_release_agent(struct work_struct *work);
  static DECLARE_WORK(release_agent_work, cgroup_release_agent);
  static void check_for_release(struct cgroup *cgrp);

-/*
- * A cgroup can be associated with multiple css_sets as different tasks may
- * belong to different cgroups on different hierarchies.  In the other
- * direction, a css_set is naturally associated with multiple cgroups.
- * This M:N relationship is represented by the following link structure
- * which exists for each association and allows traversing the associations
- * from both sides.
- */
-struct cgrp_cset_link {
-	/* the cgroup and css_set this link associates */
-	struct cgroup		*cgrp;
-	struct css_set		*cset;
-
-	/* list of cgrp_cset_links anchored at cgrp->cset_links */
-	struct list_head	cset_link;
-
-	/* list of cgrp_cset_links anchored at css_set->cgrp_links */
-	struct list_head	cgrp_link;
-};
-
  /* The default css_set - used by init and its children prior to any
   * hierarchies being mounted. It contains a pointer to the root state
   * for each subsystem. Also used to anchor the list of css_sets. Not
@@ -4163,6 +4143,12 @@ static int cgroup_clone_children_write(struct cgroup_subsys_state *css,
  	return 0;
  }

+static u64 cgroup_serial_read(struct cgroup_subsys_state *css,
+			      struct cftype *cft)
+{
+	return css->cgroup->serial_nr;
+}
+
  static struct cftype cgroup_base_files[] = {
  	{
  		.name = "cgroup.procs",
@@ -4187,6 +4173,11 @@ static struct cftype cgroup_base_files[] = {
  		.flags = CFTYPE_ONLY_ON_ROOT,
  		.read_seq_string = cgroup_sane_behavior_show,
  	},
+	{
+		.name = "cgroup.serial",
+		.read_u64 = cgroup_serial_read,
+		.mode = S_IRUGO,
+	},

  	/*
  	 * Historical crazy stuff.  These don't have "cgroup."  prefix and
diff --git a/net/core/scm.c b/net/core/scm.c
index b442e7e..9a40ab0 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -292,6 +292,7 @@ void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm)
  		if (sock) {
  			sock_update_netprioidx(sock->sk);
  			sock_update_classid(sock->sk);
+			sock_task_set(sock->sk, current);
  		}
  		fd_install(new_fd, get_file(fp[i]));
  	}
diff --git a/net/core/sock.c b/net/core/sock.c
index 2bd9b3f..ab53afc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1359,6 +1359,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
  		sk->sk_prot = sk->sk_prot_creator = prot;
  		sock_lock_init(sk);
  		sock_net_set(sk, get_net(net));
+		sock_task_set(sk, current);
  		atomic_set(&sk->sk_wmem_alloc, 1);

  		sock_update_classid(sk);
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 6e839b6..d276ff4 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -806,6 +806,14 @@ config NETFILTER_XT_MATCH_BPF

  	  To compile it as a module, choose M here.  If unsure, say N.

+config NETFILTER_XT_MATCH_CGROUP
+	tristate '"control group" match support'
+	depends on NETFILTER_ADVANCED
+	depends on CGROUPS
+	---help---
+	Socket/process control group matching allows you to match locally
+	generated packets based on which control group processes belong to.
+
  config NETFILTER_XT_MATCH_CLUSTER
  	tristate '"cluster" match support'
  	depends on NF_CONNTRACK
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index c3a0a12..12f014f 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -124,6 +124,7 @@ obj-$(CONFIG_NETFILTER_XT_MATCH_MULTIPORT) += xt_multiport.o
  obj-$(CONFIG_NETFILTER_XT_MATCH_NFACCT) += xt_nfacct.o
  obj-$(CONFIG_NETFILTER_XT_MATCH_OSF) += xt_osf.o
  obj-$(CONFIG_NETFILTER_XT_MATCH_OWNER) += xt_owner.o
+obj-$(CONFIG_NETFILTER_XT_MATCH_CGROUP) += xt_cgroup.o
  obj-$(CONFIG_NETFILTER_XT_MATCH_PHYSDEV) += xt_physdev.o
  obj-$(CONFIG_NETFILTER_XT_MATCH_PKTTYPE) += xt_pkttype.o
  obj-$(CONFIG_NETFILTER_XT_MATCH_POLICY) += xt_policy.o
diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c
new file mode 100644
index 0000000..f04cba8
--- /dev/null
+++ b/net/netfilter/xt_cgroup.c
@@ -0,0 +1,79 @@
+#include <linux/skbuff.h>
+#include <linux/module.h>
+#include <linux/file.h>
+#include <linux/cgroup.h>
+#include <linux/fdtable.h>
+#include <linux/netfilter/x_tables.h>
+#include <linux/netfilter/xt_cgroup.h>
+#include <net/sock.h>
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Daniel Borkmann <dborkman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>");
+MODULE_DESCRIPTION("Xtables: process control group matching");
+MODULE_ALIAS("ipt_cgroup");
+MODULE_ALIAS("ip6t_cgroup");
+
+static int cgroup_mt_check(const struct xt_mtchk_param *par)
+{
+	struct xt_cgroup_info *info = par->matchinfo;
+
+	return (info->invert & ~1) ? -EINVAL : 0;
+}
+
+static bool
+cgroup_mt(const struct sk_buff *skb, struct xt_action_param *par)
+{
+	const struct xt_cgroup_info *info = par->matchinfo;
+	struct cgrp_cset_link *link, *link_tmp;
+	const struct sock *sk = skb->sk;
+	struct task_struct *task;
+	struct css_set *cset;
+	bool found = false;
+
+	if (sk == NULL)
+		return false;
+
+	task = sock_task(sk);
+	if (task == NULL)
+		return false;
+
+	rcu_read_lock();
+	/* XXX: read_lock(&css_set_lock); */
+	cset = task_css_set(task);
+	list_for_each_entry_safe(link, link_tmp, &cset->cgrp_links, cgrp_link) {
+		struct cgroup *cgrp = link->cgrp;
+		if (cgrp->serial_nr == info->serial_nr) {
+			found = true;
+			break;
+		}
+	}
+	/* XXX: read_unlock(&css_set_lock); */
+	rcu_read_unlock();
+
+	return found ^ info->invert;
+}
+
+static struct xt_match cgroup_mt_reg __read_mostly = {
+	.name       = "cgroup",
+	.revision   = 0,
+	.family     = NFPROTO_UNSPEC,
+	.checkentry = cgroup_mt_check,
+	.match      = cgroup_mt,
+	.matchsize  = sizeof(struct xt_cgroup_info),
+	.me         = THIS_MODULE,
+	.hooks      = (1 << NF_INET_LOCAL_OUT) |
+	              (1 << NF_INET_POST_ROUTING),
+};
+
+static int __init cgroup_mt_init(void)
+{
+	return xt_register_match(&cgroup_mt_reg);
+}
+
+static void __exit cgroup_mt_exit(void)
+{
+	xt_unregister_match(&cgroup_mt_reg);
+}
+
+module_init(cgroup_mt_init);
+module_exit(cgroup_mt_exit);
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net] bridge: allow receiption on disabled port
From: Stephen Hemminger @ 2013-10-10 21:52 UTC (permalink / raw)
  To: Felix Fietkau; +Cc: netdev
In-Reply-To: <52571481.5010907@openwrt.org>

On Thu, 10 Oct 2013 22:56:33 +0200
Felix Fietkau <nbd@openwrt.org> wrote:

> On 2013-10-10 10:36 PM, Stephen Hemminger wrote:
> > On Thu, 10 Oct 2013 14:52:50 +0200
> > Felix Fietkau <nbd@openwrt.org> wrote:
> > 
> >> When an ethernet device is enslaved to a bridge, and the bridge STP
> >> detects loss of carrier (or operational state down), then normally
> >> packet receiption is blocked.
> >> 
> >> This breaks control applications like WPA which maybe expecting to
> >> receive packets to negotiate to bring link up. The bridge needs to
> >> block forwarding packets from these disabled ports, but there is no
> >> hard requirement to not allow local packet delivery.
> >> 
> >> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> >> Signed-off-by: Felix Fietkau <nbd@openwrt.org>
> > 
> > No. This will cause duplicate packets to be delivered.
> How? I haven't observed any duplications in my tests with this patch.

The purpose of DISABLED state is to break loops in the bridge tree.
If packet is flooded by another bridge (Broadcast Unknown or Multicast)
then it will go down both paths.

> 
> > If doing a link layer protocol like WPA then it should be done directly
> > on the underlying device, not the bridge itself.
> When the ETH_P_PAE protocol is set for the packet socket inside
> wpa_supplicant, the bridge steals all packets before the protocol
> handler gets them.
> In __netif_receive_skb_core, only ptype_all gets processed before the rx
> handler, not ptype_base.

Thought it was using direct type all. Or at least the link local multicast
address.

Can you revise it to only accept packets directed to link local multicast
address or local address, and go through the local_finish handler.

^ permalink raw reply

* Re: [PATCH net] bridge: allow receiption on disabled port
From: Felix Fietkau @ 2013-10-10 20:56 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20131010133646.1bdd42c1@nehalam.linuxnetplumber.net>

On 2013-10-10 10:36 PM, Stephen Hemminger wrote:
> On Thu, 10 Oct 2013 14:52:50 +0200
> Felix Fietkau <nbd@openwrt.org> wrote:
> 
>> When an ethernet device is enslaved to a bridge, and the bridge STP
>> detects loss of carrier (or operational state down), then normally
>> packet receiption is blocked.
>> 
>> This breaks control applications like WPA which maybe expecting to
>> receive packets to negotiate to bring link up. The bridge needs to
>> block forwarding packets from these disabled ports, but there is no
>> hard requirement to not allow local packet delivery.
>> 
>> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
>> Signed-off-by: Felix Fietkau <nbd@openwrt.org>
> 
> No. This will cause duplicate packets to be delivered.
How? I haven't observed any duplications in my tests with this patch.

> If doing a link layer protocol like WPA then it should be done directly
> on the underlying device, not the bridge itself.
When the ETH_P_PAE protocol is set for the packet socket inside
wpa_supplicant, the bridge steals all packets before the protocol
handler gets them.
In __netif_receive_skb_core, only ptype_all gets processed before the rx
handler, not ptype_base.


- Felix

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox