Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: pull request: batman-adv 2013-10-09b
From: Antonio Quartulli @ 2013-10-11  6:10 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, b.a.t.m.a.n
In-Reply-To: <20131009.155658.131342560291162660.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 573 bytes --]

On Wed, Oct 09, 2013 at 03:56:58PM -0400, David Miller wrote:
> From: Antonio Quartulli <antonio@meshcoding.com>
> Date: Wed,  9 Oct 2013 21:32:38 +0200
> 
> > here you have my fixed pull request intended for net-next.
> > 
> > The previous build error was due to an accidental remotion of the beginning of a
> > batadv_dbg() statement during a merge conflict resolution.
> > Sorry for that.
> 
> This looks better, pulled, thanks a lot.

Hello David,

I can't find my patchset in net-next, hasn't it been pushed yet?

Regards,


-- 
Antonio Quartulli

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH] bridge: update mdb expiration timer upon reports.
From: David Miller @ 2013-10-11  4:57 UTC (permalink / raw)
  To: herbert; +Cc: vyasevic, netdev, xiyou.wangcong, stephen
In-Reply-To: <20131010234856.GA31951@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 11 Oct 2013 07:48:56 +0800

> On Thu, Oct 10, 2013 at 03:57:59PM -0400, Vlad Yasevich wrote:
>> commit 9f00b2e7cf241fa389733d41b615efdaa2cb0f5b
>> 	bridge: only expire the mdb entry when query is received
>> changed the mdb expiration timer to be armed only when QUERY is
>> received.  Howerver, this causes issues in an environment where
>> the multicast server socket comes and goes very fast while a client
>> is trying to send traffic to it.
>> 
>> The root cause is a race where a sequence of LEAVE followed by REPORT
>> messages can race against QUERY messages generated in response to LEAVE.
>> The QUERY ends up starting the expiration timer, and that timer can
>> potentially expire after the new REPORT message has been received signaling
>> the new join operation.  This leads to a significant drop in multicast
>> traffic and possible complete stall. 
>> 
>> The solution is to have REPORT messages update the expiration timer
>> on entries that already exist.
>> 
>> CC: Cong Wang <xiyou.wangcong@gmail.com>
>> CC: Herbert Xu <herbert@gondor.apana.org.au>
>> CC: Stephen Hemminger <stephen@networkplumber.org>
>> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
> 
> Good catch.  Thanks!
> 
> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied, and queued up for -stable, thanks everyone.

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: fix vport-netdev unregister
From: Alexei Starovoitov @ 2013-10-11  4:48 UTC (permalink / raw)
  To: Jesse Gross
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org, netdev, Jiri Pirko,
	David S. Miller
In-Reply-To: <CAEP_g=_meOUPrYc7S8rM4NwT_7qG1JZrHJkOg6CmbjDc-rFjAQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Thu, Oct 10, 2013 at 8:56 PM, Jesse Gross <jesse-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
> On Thu, Oct 10, 2013 at 5:47 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>> On Thu, Oct 10, 2013 at 3:38 PM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Thu, Oct 10, 2013 at 1:48 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>> On Thu, Oct 10, 2013 at 11:21 AM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>>>>> On Wed, Oct 9, 2013 at 11:26 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>>>> On Wed, Oct 9, 2013 at 11:07 PM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>> On Wed, Oct 9, 2013 at 9:11 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>>>>>> On Wed, Oct 9, 2013 at 8:02 PM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>>>> On Tue, Oct 8, 2013 at 8:07 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>>>>>>>> The combination of two commits
>>>>>>>>>>
>>>>>>>>>> commit 8e4e1713e4
>>>>>>>>>> ("openvswitch: Simplify datapath locking.")
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>> commit 2537b4dd0a
>>>>>>>>>> ("openvswitch:: link upper device for port devices")
>>>>>>>>>>
>>>>>>>>>> introduced a bug where upper_dev wasn't unlinked upon
>>>>>>>>>> netdev_unregister notification
>>>>>>>>>>
>>>>>>>>>> The following steps:
>>>>>>>>>>
>>>>>>>>>>   modprobe openvswitch
>>>>>>>>>>   ovs-dpctl add-dp test
>>>>>>>>>>   ip tuntap add dev tap1 mode tap
>>>>>>>>>>   ovs-dpctl add-if test tap1
>>>>>>>>>>   ip tuntap del dev tap1 mode tap
>>>>>>>>>>
>>>>>>>>>> are causing multiple warnings:
>>>>>>>>>> diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
>>>>>>>>>> index c323567..e9380bd 100644
>>>>>>>>>> --- a/net/openvswitch/dp_notify.c
>>>>>>>>>> +++ b/net/openvswitch/dp_notify.c
>>>>>>>>>> @@ -88,6 +88,11 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
>>>>>>>>>>                 return NOTIFY_DONE;
>>>>>>>>>>
>>>>>>>>>>         if (event == NETDEV_UNREGISTER) {
>>>>>>>>>> +               /* rx_handler_unregister and upper_dev_unlink immediately */
>>>>>>>>>> +               if (dev->reg_state == NETREG_UNREGISTERING)
>>>>>>>>>> +                       ovs_netdev_unlink_dev(vport);
>>>>>>>>>> +
>>>>>>>>>
>>>>>>>>> Rather than doing vport destroy here, we can just unlink upper device
>>>>>>>>> and let workq do rest of work.
>>>>>>>>
>>>>>>>> isn't it what it's doing?
>>>>>>>
>>>>>>> I meant just call netdev_upper_dev_unlink() here in event handler and
>>>>>>> rest of vport destroy can be done in workq.
>>>>>>
>>>>>> netdev_upper_dev_unlink() without netdev_rx_handler_unregister() ?!
>>>>>> that's dangerous.
>>>>> why is it dangerous? ovs already had ref to net-device.
>>>>
>>>> comment from dev.c:
>>>>                 /* Notify protocols, that we are about to destroy
>>>>                    this device. They should clean all the things.
>>>>                 */
>>>>                 call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
>>>>
>>>> so here you're suggesting to just netdev_upper_dev_unlink() to silence
>>>> the warning,
>>>> but then do dev_set_promisc(-1) in workqueue?
>>>>
>>> promiscuity setting is different issue. If you want to have discussion
>>> then you can post separate patch for same. Lets fix the warning here.
>>>
>>>> well, as a minimum audit of promiscuity will be wrong.
>>>> ndo_change_rx_flags will be called after ndo_uninit,
>>>> all sorts of other cleanups are done.
>>>
>>> change_rx_flags() checks for UP flag for given device.
>>>
>>>> I cannot track all possible scenarios, but it seems much safer to
>>>> cleanup everything possible
>>>> as soon as ovs received NETDEV_UNREGISTER event.
>>>>
>>>> May be all these risks are worth taking, then please explain what is
>>>> the problem with the proposed patch?
>>>>
>>> Problem is that this is causing layering issues in OVS. dp_notify is
>>> suppose to work at dp layer. your patch directly calls vport-netdev
>>> implementation function from dp_notify.
>>> I could not think of a simple approach that will do this in completely
>>> clean manner. Therefore I am trying to minimize layering issues. So
>>> just calling netdev_upper_dev_unlink() looks better than doing
>>> anything extra.
>>
>> dp_notify is per net, not per dp.
>> notifier can only be called for net_device and the first thing it does:
>>         if (!ovs_is_internal_dev(dev))
>>                 vport = ovs_netdev_get_vport(dev);
>> where ovs_netdev_get_vport() is defined in vport-netdev.c
>>
>> once it gets into workq, it checks for:
>>        if (vport->ops->type != OVS_VPORT_TYPE_NETDEV)
>>              continue;
>> and
>>         netdev_vport = netdev_vport_priv(vport);
>> where netdev_vport_priv() is defined in netdev-vport.h
>>
>> only then it proceeds with generic ovs_dp_detach_port().
>>
>> Is that the layering you talking about?
>>
>> So to minimize layering issues you want to call 'upper_dev_link' from
>> netdev_create() in vport-netdev.c
>> and 'upper_dev_unlink' directly from dp_device_event() in dp_notify.c?
>>
>> That will look better than calling ovs_netdev_unlink_dev() ?
>
> At a minimum, this function name is misleading because it does a bunch
> of other things beyond unlink the device.

sure. Please propose a different name.

> At this point, it basically seems like six of one, half dozen of the
> other. I don't think that either solution is ideal but it doesn't
> really matter all that much.
>
> However, the check dev->reg_state in netdev_destroy() looks racy to
> me, as it could already be in NETREG_UNREGISTERED even if we already
> processed this device.

you mean that netdev_destroy() will see reg_state == netreg_unregistered,
while dp_device_event() didn't see reg_state == netreg_unregistering yet?
or dp_device_event() saw it, proceeded to do unlink and
netdev_destroy() ran in parallel?
well, that's why reg_state == netreg_unregistering check in netdev_destroy()
is done with rtnl_lock() held.
reg_state cannot go into netreg_unregistered state skipping
netreg_unregistering and notifier.
therefore I don't think it's racy.

In ovs_dp_notify_wq() you're checking for both unregistering and
unregistered and that makes
sense, since workq can run after unregistering notifier called and
netdev_run_todo()
already changed the state to unregistered.
But here it's not the case.

^ permalink raw reply

* Re: Peak TCP performance
From: Kyle Hubert @ 2013-10-11  4:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1381463075.4971.99.camel@edumazet-glaptop.roam.corp.google.com>

On Thu, Oct 10, 2013 at 11:44 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Also, my copy of ethtool does not recognize tx-nocache-copy. However,
>> I do have control over the net device. Is there something there I can
>> set, or is tx-nocache-copy also a new feature? I'll start digging.
>>
>
> nocache-copy was added in 3.0, but I do find its not a gain for current
> cpus.
>
> You could get a fresh copy of ethtool sources :
>
> git clone git://git.kernel.org/pub/scm/network/ethtool/ethtool.git
> cd ethtool
> ./autogen.sh  ...

That did the trick. Thanks for the help! Is there somewhere I can read
up on this feature? A lot of the netdev features are opaque to me.
Also, I can set NETIF_F_NOCACHE_COPY in the netdev->features to set
this by default, yes?

This at least mirrors the performance improvement that I see when
forwarding, however I still see reserved CPU time. Is there a way to
push it even farther?

-Kyle

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: fix vport-netdev unregister
From: Jesse Gross @ 2013-10-11  3:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Pravin Shelar, David S. Miller, Jiri Pirko, dev@openvswitch.org,
	netdev
In-Reply-To: <CAMEtUuyEw81g8+68HUMpR1RERAA9rjL=WHRE58se83cdur67eg@mail.gmail.com>

On Thu, Oct 10, 2013 at 5:47 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Thu, Oct 10, 2013 at 3:38 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>> On Thu, Oct 10, 2013 at 1:48 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>> On Thu, Oct 10, 2013 at 11:21 AM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>> On Wed, Oct 9, 2013 at 11:26 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>> On Wed, Oct 9, 2013 at 11:07 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>>>> On Wed, Oct 9, 2013 at 9:11 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>>>> On Wed, Oct 9, 2013 at 8:02 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>>>>>> On Tue, Oct 8, 2013 at 8:07 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>>>>>> The combination of two commits
>>>>>>>>>
>>>>>>>>> commit 8e4e1713e4
>>>>>>>>> ("openvswitch: Simplify datapath locking.")
>>>>>>>>>
>>>>>>>>> and
>>>>>>>>>
>>>>>>>>> commit 2537b4dd0a
>>>>>>>>> ("openvswitch:: link upper device for port devices")
>>>>>>>>>
>>>>>>>>> introduced a bug where upper_dev wasn't unlinked upon
>>>>>>>>> netdev_unregister notification
>>>>>>>>>
>>>>>>>>> The following steps:
>>>>>>>>>
>>>>>>>>>   modprobe openvswitch
>>>>>>>>>   ovs-dpctl add-dp test
>>>>>>>>>   ip tuntap add dev tap1 mode tap
>>>>>>>>>   ovs-dpctl add-if test tap1
>>>>>>>>>   ip tuntap del dev tap1 mode tap
>>>>>>>>>
>>>>>>>>> are causing multiple warnings:
>>>>>>>>> diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
>>>>>>>>> index c323567..e9380bd 100644
>>>>>>>>> --- a/net/openvswitch/dp_notify.c
>>>>>>>>> +++ b/net/openvswitch/dp_notify.c
>>>>>>>>> @@ -88,6 +88,11 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
>>>>>>>>>                 return NOTIFY_DONE;
>>>>>>>>>
>>>>>>>>>         if (event == NETDEV_UNREGISTER) {
>>>>>>>>> +               /* rx_handler_unregister and upper_dev_unlink immediately */
>>>>>>>>> +               if (dev->reg_state == NETREG_UNREGISTERING)
>>>>>>>>> +                       ovs_netdev_unlink_dev(vport);
>>>>>>>>> +
>>>>>>>>
>>>>>>>> Rather than doing vport destroy here, we can just unlink upper device
>>>>>>>> and let workq do rest of work.
>>>>>>>
>>>>>>> isn't it what it's doing?
>>>>>>
>>>>>> I meant just call netdev_upper_dev_unlink() here in event handler and
>>>>>> rest of vport destroy can be done in workq.
>>>>>
>>>>> netdev_upper_dev_unlink() without netdev_rx_handler_unregister() ?!
>>>>> that's dangerous.
>>>> why is it dangerous? ovs already had ref to net-device.
>>>
>>> comment from dev.c:
>>>                 /* Notify protocols, that we are about to destroy
>>>                    this device. They should clean all the things.
>>>                 */
>>>                 call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
>>>
>>> so here you're suggesting to just netdev_upper_dev_unlink() to silence
>>> the warning,
>>> but then do dev_set_promisc(-1) in workqueue?
>>>
>> promiscuity setting is different issue. If you want to have discussion
>> then you can post separate patch for same. Lets fix the warning here.
>>
>>> well, as a minimum audit of promiscuity will be wrong.
>>> ndo_change_rx_flags will be called after ndo_uninit,
>>> all sorts of other cleanups are done.
>>
>> change_rx_flags() checks for UP flag for given device.
>>
>>> I cannot track all possible scenarios, but it seems much safer to
>>> cleanup everything possible
>>> as soon as ovs received NETDEV_UNREGISTER event.
>>>
>>> May be all these risks are worth taking, then please explain what is
>>> the problem with the proposed patch?
>>>
>> Problem is that this is causing layering issues in OVS. dp_notify is
>> suppose to work at dp layer. your patch directly calls vport-netdev
>> implementation function from dp_notify.
>> I could not think of a simple approach that will do this in completely
>> clean manner. Therefore I am trying to minimize layering issues. So
>> just calling netdev_upper_dev_unlink() looks better than doing
>> anything extra.
>
> dp_notify is per net, not per dp.
> notifier can only be called for net_device and the first thing it does:
>         if (!ovs_is_internal_dev(dev))
>                 vport = ovs_netdev_get_vport(dev);
> where ovs_netdev_get_vport() is defined in vport-netdev.c
>
> once it gets into workq, it checks for:
>        if (vport->ops->type != OVS_VPORT_TYPE_NETDEV)
>              continue;
> and
>         netdev_vport = netdev_vport_priv(vport);
> where netdev_vport_priv() is defined in netdev-vport.h
>
> only then it proceeds with generic ovs_dp_detach_port().
>
> Is that the layering you talking about?
>
> So to minimize layering issues you want to call 'upper_dev_link' from
> netdev_create() in vport-netdev.c
> and 'upper_dev_unlink' directly from dp_device_event() in dp_notify.c?
>
> That will look better than calling ovs_netdev_unlink_dev() ?

At a minimum, this function name is misleading because it does a bunch
of other things beyond unlink the device.

At this point, it basically seems like six of one, half dozen of the
other. I don't think that either solution is ideal but it doesn't
really matter all that much.

However, the check dev->reg_state in netdev_destroy() looks racy to
me, as it could already be in NETREG_UNREGISTERED even if we already
processed this device.

^ permalink raw reply

* Re: Peak TCP performance
From: Eric Dumazet @ 2013-10-11  3:44 UTC (permalink / raw)
  To: Kyle Hubert; +Cc: netdev
In-Reply-To: <CAJoZ4U0tpKhknMLd7kN4uQmq-E81=EbH1Q816nX_eig=tGreAQ@mail.gmail.com>

On Thu, 2013-10-10 at 23:33 -0400, Kyle Hubert wrote:

> Ah, yes, sorry. I'm on 3.0.80 (SLES SP2). And, no, I do not have that
> commit. In fact, the code looks quite different. There must have been
> a lot of changes?
> 

Probably ;)

> Also, my copy of ethtool does not recognize tx-nocache-copy. However,
> I do have control over the net device. Is there something there I can
> set, or is tx-nocache-copy also a new feature? I'll start digging.
> 

nocache-copy was added in 3.0, but I do find its not a gain for current
cpus.

You could get a fresh copy of ethtool sources :

git clone git://git.kernel.org/pub/scm/network/ethtool/ethtool.git
cd ethtool
./autogen.sh  ...

^ permalink raw reply

* Re: Peak TCP performance
From: Kyle Hubert @ 2013-10-11  3:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1381461664.4971.97.camel@edumazet-glaptop.roam.corp.google.com>

On Thu, Oct 10, 2013 at 11:21 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2013-10-10 at 22:34 -0400, Kyle Hubert wrote:
>> I'm working on a device, I consistently get 27gbps via netperf-2.6.
>> UDP reports 54gbps.
>>
>> TCP is maxed out at 100% CPU on the transmit side. On the receive
>> side, 40% of the CPU. Thus, I didn't believe I could eek anymore
>> performance out of it.
>>
>> However, very oddly, if I enabled bridged mode to forward some
>> packets, TCP performance goes up to 32gbps. The thing that bothers me
>> is that transmit CPU utilization drops to 65%, and receive CPU
>> utilization increases to 60%.
>>
>> What happens when the device becomes bridged to gain so much
>> performance? Also, can I now take advantage of the extra CPU time
>> available to drive more traffic? No tunable seems to have any effect..
>> (except down)
>
>
> You do not give what version of linux you use, but my guess is that
> using latest trees should help, because of
> http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=c9eeec26e32e087359160406f96e0949b3cc6f10
>
> Also try to disable tx-nocache-copy
>
> ethtool -K eth0 tx-nocache-copy off

Ah, yes, sorry. I'm on 3.0.80 (SLES SP2). And, no, I do not have that
commit. In fact, the code looks quite different. There must have been
a lot of changes?

Also, my copy of ethtool does not recognize tx-nocache-copy. However,
I do have control over the net device. Is there something there I can
set, or is tx-nocache-copy also a new feature? I'll start digging.

Thanks for your response.
-Kyle

^ permalink raw reply

* Re: Peak TCP performance
From: Eric Dumazet @ 2013-10-11  3:21 UTC (permalink / raw)
  To: Kyle Hubert; +Cc: netdev
In-Reply-To: <CAJoZ4U2eVUd4JiGQ6UFYMdoEwyN=MO8+9S3jGcsaYGN2vqt=3g@mail.gmail.com>

On Thu, 2013-10-10 at 22:34 -0400, Kyle Hubert wrote:
> I'm working on a device, I consistently get 27gbps via netperf-2.6.
> UDP reports 54gbps.
> 
> TCP is maxed out at 100% CPU on the transmit side. On the receive
> side, 40% of the CPU. Thus, I didn't believe I could eek anymore
> performance out of it.
> 
> However, very oddly, if I enabled bridged mode to forward some
> packets, TCP performance goes up to 32gbps. The thing that bothers me
> is that transmit CPU utilization drops to 65%, and receive CPU
> utilization increases to 60%.
> 
> What happens when the device becomes bridged to gain so much
> performance? Also, can I now take advantage of the extra CPU time
> available to drive more traffic? No tunable seems to have any effect..
> (except down)


You do not give what version of linux you use, but my guess is that
using latest trees should help, because of
http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=c9eeec26e32e087359160406f96e0949b3cc6f10

Also try to disable tx-nocache-copy

ethtool -K eth0 tx-nocache-copy off

^ permalink raw reply

* Re: [PATCH net] bridge: allow receiption on disabled port
From: Stephen Hemminger @ 2013-10-11  2:35 UTC (permalink / raw)
  To: Felix Fietkau; +Cc: netdev
In-Reply-To: <52571481.5010907@openwrt.org>

This is what I was thinking would be better.

Don't want these packets leaking into PRE_ROUTEING chain or have
any chance to get flooded out other ports.

Compile tested only...

I could use another goto instead but that becomes spaghetti and
never like to jump back into a block.


--- a/net/bridge/br_input.c	2013-10-06 14:48:24.946450042 -0700
+++ b/net/bridge/br_input.c	2013-10-10 19:32:14.227926344 -0700
@@ -152,6 +152,16 @@ static int br_handle_local_finish(struct
 	return 0;	 /* process further */
 }
 
+/* Deliver packet to local host only */
+static rx_handler_result_t br_local_only(struct sk_buff *skb)
+{
+	if (NF_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN, skb, skb->dev,
+		    NULL, br_handle_local_finish))
+		return RX_HANDLER_CONSUMED; /* consumed by filter */
+	else
+		return RX_HANDLER_PASS;	/* continue processing */
+}
+
 /*
  * Return NULL if skb is handled
  * note: already called with rcu_read_lock
@@ -206,18 +216,20 @@ rx_handler_result_t br_handle_frame(stru
 				goto forward;
 		}
 
-		/* Deliver packet to local host only */
-		if (NF_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN, skb, skb->dev,
-			    NULL, br_handle_local_finish)) {
-			return RX_HANDLER_CONSUMED; /* consumed by filter */
-		} else {
-			*pskb = skb;
-			return RX_HANDLER_PASS;	/* continue processing */
-		}
+		*pskb = skb;
+		return br_local_only(skb);
 	}
 
 forward:
 	switch (p->state) {
+	case BR_STATE_DISABLED:
+		if (!ether_addr_equal(p->br->dev->dev_addr, dest))
+			goto drop;
+
+		skb->pkt_type = PACKET_HOST;
+		*pskb = skb;
+		return br_local_only(skb);
+
 	case BR_STATE_FORWARDING:
 		rhook = rcu_dereference(br_should_route_hook);
 		if (rhook) {

^ permalink raw reply

* Peak TCP performance
From: Kyle Hubert @ 2013-10-11  2:34 UTC (permalink / raw)
  To: netdev

I'm working on a device, I consistently get 27gbps via netperf-2.6.
UDP reports 54gbps.

TCP is maxed out at 100% CPU on the transmit side. On the receive
side, 40% of the CPU. Thus, I didn't believe I could eek anymore
performance out of it.

However, very oddly, if I enabled bridged mode to forward some
packets, TCP performance goes up to 32gbps. The thing that bothers me
is that transmit CPU utilization drops to 65%, and receive CPU
utilization increases to 60%.

What happens when the device becomes bridged to gain so much
performance? Also, can I now take advantage of the extra CPU time
available to drive more traffic? No tunable seems to have any effect..
(except down)

Any ideas?

Cheers,
-Kyle

^ permalink raw reply

* Re: [PATCH] usbnet: smsc95xx: Add device tree input for MAC address
From: Ming Lei @ 2013-10-11  1:51 UTC (permalink / raw)
  To: Dan Murphy
  Cc: Steve Glendinning, Network Development, Linux Kernel Mailing List,
	linux-usb, mugunthanvnm
In-Reply-To: <5256A1DE.8010709@ti.com>

On Thu, Oct 10, 2013 at 8:47 PM, Dan Murphy <dmurphy@ti.com> wrote:
>
> Is there a board that has 2 built in smsc devices?

I don't know, maybe there isn't, but driver should be generic enough, and
as I said it is a generic problem, and people are discussing it, so suggest
to read previous discussions first before post your V2.

Below link is one I remembered, but I think you can find more:

https://lkml.org/lkml/2013/8/11/100

> This is all dependent on what the dev.id is of udev.

I am not sure why it depends on udev, and udev can't
see the device util it is added to system.

Thanks,
-- 
Ming Lei

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: fix vport-netdev unregister
From: Alexei Starovoitov @ 2013-10-11  0:47 UTC (permalink / raw)
  To: Pravin Shelar
  Cc: David S. Miller, Jesse Gross, Jiri Pirko, dev@openvswitch.org,
	netdev
In-Reply-To: <CALnjE+rgasL+XShQN20b9PkFNGq38+UoXX4uLogrZ2igAwfDew@mail.gmail.com>

On Thu, Oct 10, 2013 at 3:38 PM, Pravin Shelar <pshelar@nicira.com> wrote:
> On Thu, Oct 10, 2013 at 1:48 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> On Thu, Oct 10, 2013 at 11:21 AM, Pravin Shelar <pshelar@nicira.com> wrote:
>>> On Wed, Oct 9, 2013 at 11:26 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>> On Wed, Oct 9, 2013 at 11:07 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>>> On Wed, Oct 9, 2013 at 9:11 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>>> On Wed, Oct 9, 2013 at 8:02 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>>>>> On Tue, Oct 8, 2013 at 8:07 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>>>>> The combination of two commits
>>>>>>>>
>>>>>>>> commit 8e4e1713e4
>>>>>>>> ("openvswitch: Simplify datapath locking.")
>>>>>>>>
>>>>>>>> and
>>>>>>>>
>>>>>>>> commit 2537b4dd0a
>>>>>>>> ("openvswitch:: link upper device for port devices")
>>>>>>>>
>>>>>>>> introduced a bug where upper_dev wasn't unlinked upon
>>>>>>>> netdev_unregister notification
>>>>>>>>
>>>>>>>> The following steps:
>>>>>>>>
>>>>>>>>   modprobe openvswitch
>>>>>>>>   ovs-dpctl add-dp test
>>>>>>>>   ip tuntap add dev tap1 mode tap
>>>>>>>>   ovs-dpctl add-if test tap1
>>>>>>>>   ip tuntap del dev tap1 mode tap
>>>>>>>>
>>>>>>>> are causing multiple warnings:
>>>>>>>> diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
>>>>>>>> index c323567..e9380bd 100644
>>>>>>>> --- a/net/openvswitch/dp_notify.c
>>>>>>>> +++ b/net/openvswitch/dp_notify.c
>>>>>>>> @@ -88,6 +88,11 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
>>>>>>>>                 return NOTIFY_DONE;
>>>>>>>>
>>>>>>>>         if (event == NETDEV_UNREGISTER) {
>>>>>>>> +               /* rx_handler_unregister and upper_dev_unlink immediately */
>>>>>>>> +               if (dev->reg_state == NETREG_UNREGISTERING)
>>>>>>>> +                       ovs_netdev_unlink_dev(vport);
>>>>>>>> +
>>>>>>>
>>>>>>> Rather than doing vport destroy here, we can just unlink upper device
>>>>>>> and let workq do rest of work.
>>>>>>
>>>>>> isn't it what it's doing?
>>>>>
>>>>> I meant just call netdev_upper_dev_unlink() here in event handler and
>>>>> rest of vport destroy can be done in workq.
>>>>
>>>> netdev_upper_dev_unlink() without netdev_rx_handler_unregister() ?!
>>>> that's dangerous.
>>> why is it dangerous? ovs already had ref to net-device.
>>
>> comment from dev.c:
>>                 /* Notify protocols, that we are about to destroy
>>                    this device. They should clean all the things.
>>                 */
>>                 call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
>>
>> so here you're suggesting to just netdev_upper_dev_unlink() to silence
>> the warning,
>> but then do dev_set_promisc(-1) in workqueue?
>>
> promiscuity setting is different issue. If you want to have discussion
> then you can post separate patch for same. Lets fix the warning here.
>
>> well, as a minimum audit of promiscuity will be wrong.
>> ndo_change_rx_flags will be called after ndo_uninit,
>> all sorts of other cleanups are done.
>
> change_rx_flags() checks for UP flag for given device.
>
>> I cannot track all possible scenarios, but it seems much safer to
>> cleanup everything possible
>> as soon as ovs received NETDEV_UNREGISTER event.
>>
>> May be all these risks are worth taking, then please explain what is
>> the problem with the proposed patch?
>>
> Problem is that this is causing layering issues in OVS. dp_notify is
> suppose to work at dp layer. your patch directly calls vport-netdev
> implementation function from dp_notify.
> I could not think of a simple approach that will do this in completely
> clean manner. Therefore I am trying to minimize layering issues. So
> just calling netdev_upper_dev_unlink() looks better than doing
> anything extra.

dp_notify is per net, not per dp.
notifier can only be called for net_device and the first thing it does:
        if (!ovs_is_internal_dev(dev))
                vport = ovs_netdev_get_vport(dev);
where ovs_netdev_get_vport() is defined in vport-netdev.c

once it gets into workq, it checks for:
       if (vport->ops->type != OVS_VPORT_TYPE_NETDEV)
             continue;
and
        netdev_vport = netdev_vport_priv(vport);
where netdev_vport_priv() is defined in netdev-vport.h

only then it proceeds with generic ovs_dp_detach_port().

Is that the layering you talking about?

So to minimize layering issues you want to call 'upper_dev_link' from
netdev_create() in vport-netdev.c
and 'upper_dev_unlink' directly from dp_device_event() in dp_notify.c?

That will look better than calling ovs_netdev_unlink_dev() ?

Having both register+link and unregister+unlink in the same
vport-netdev.c, is not an advantage?

I'm still missing something here.

^ permalink raw reply

* Re: [PATCH v2 tip/core/rcu 07/13] ipv6/ip6_tunnel: Apply rcu_access_pointer() to avoid sparse false positive
From: Josh Triplett @ 2013-10-11  0:20 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Eric Dumazet, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, niv, tglx, peterz, rostedt, dhowells, edumazet,
	darren, fweisbec, sbw, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, netdev
In-Reply-To: <20131010002833.GJ5790@linux.vnet.ibm.com>

On Wed, Oct 09, 2013 at 05:28:33PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 09, 2013 at 05:12:40PM -0700, Eric Dumazet wrote:
> > On Wed, 2013-10-09 at 16:40 -0700, Josh Triplett wrote:
> > 
> > > that.  Constructs like list_del_rcu are much clearer, and not
> > > open-coded.  Open-coding synchronization code is almost always a Bad
> > > Idea.
> > 
> > OK, so you think there is synchronization code.
> > 
> > I will shut up then, no need to waste time.
> 
> As you said earlier, we should at least get rid of the memory barrier
> as long as we are changing the code.
> 
> Josh, what would you suggest as the best way to avoid the memory barrier,
> keep sparse happy, and not be too ugly?

The more I think about it, the more I realize that assigning an __rcu
pointer to an __rcu pointer *without* a memory barrier is a sufficiently
uncommon case that you probably *should* just write an open-coded
assignment.  Just please put a very clear comment right before it.

I'd originally thought it might make sense to have a macro similar to
rcu_assign_pointer, but I just don't think this is a common enough case,
and we don't want people thinking they can use this in general for __rcu
to __rcu assignments (most of which still need a memory barrier).

- Josh Triplett

^ permalink raw reply

* Re: [PATCH] bridge: update mdb expiration timer upon reports.
From: Herbert Xu @ 2013-10-10 23:48 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: netdev, Cong Wang, Stephen Hemminger
In-Reply-To: <1381435079-15387-1-git-send-email-vyasevic@redhat.com>

On Thu, Oct 10, 2013 at 03:57:59PM -0400, Vlad Yasevich wrote:
> commit 9f00b2e7cf241fa389733d41b615efdaa2cb0f5b
> 	bridge: only expire the mdb entry when query is received
> changed the mdb expiration timer to be armed only when QUERY is
> received.  Howerver, this causes issues in an environment where
> the multicast server socket comes and goes very fast while a client
> is trying to send traffic to it.
> 
> The root cause is a race where a sequence of LEAVE followed by REPORT
> messages can race against QUERY messages generated in response to LEAVE.
> The QUERY ends up starting the expiration timer, and that timer can
> potentially expire after the new REPORT message has been received signaling
> the new join operation.  This leads to a significant drop in multicast
> traffic and possible complete stall. 
> 
> The solution is to have REPORT messages update the expiration timer
> on entries that already exist.
> 
> CC: Cong Wang <xiyou.wangcong@gmail.com>
> CC: Herbert Xu <herbert@gondor.apana.org.au>
> CC: Stephen Hemminger <stephen@networkplumber.org>
> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>

Good catch.  Thanks!

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
From: Mark Lord @ 2013-10-10 23:17 UTC (permalink / raw)
  To: Alexander Gordeev, H. Peter Anvin
  Cc: Benjamin Herrenschmidt, linux-kernel, Bjorn Helgaas, Ralf Baechle,
	Michael Ellerman, Martin Schwidefsky, Ingo Molnar, Tejun Heo,
	Dan Williams, Andy King, Jon Mason, Matt Porter, linux-pci,
	linux-mips, linuxppc-dev, linux390, linux-s390, x86, linux-ide,
	iss_storagedev, linux-nvme, linux-rdma, netdev, e1000-devel,
	linux-dr
In-Reply-To: <20131010180704.GA15719@dhcp-26-207.brq.redhat.com>

Just to help us all understand "the loop" issue..

Here's an example of driver code which uses the existing MSI-X interfaces,
for a device which can work with either 16, 8, 4, 2, or 1 MSI-X interrupt.
This is from a new driver I'm working on right now:


static int xx_alloc_msix_irqs (struct xx_dev *dev, int nvec)
{
        xx_disable_all_irqs(dev);
        do {
                if (nvec < 2)
                        xx_prep_for_1_msix_vector(dev);
                else if (nvec < 4)
                        xx_prep_for_2_msix_vectors(dev);
                else if (nvec < 8)
                        xx_prep_for_4_msix_vectors(dev);
                else if (nvec < 16)
                        xx_prep_for_8_msix_vectors(dev);
                else
                        xx_prep_for_16_msix_vectors(dev);
                nvec = pci_enable_msix(dev->pdev, dev->irqs, dev->num_vectors);
        } while (nvec > 0);

        if (nvec) {
                kerr(dev->name, "pci_enable_msix() failed, err=%d", nvec);
                dev->num_vectors = 0;
                return nvec;
        }
        return 0;       /* success */
}

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: fix vport-netdev unregister
From: Pravin Shelar @ 2013-10-10 22:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org, netdev, Jiri Pirko,
	David S. Miller
In-Reply-To: <CAMEtUuxH24G6pQtSFznrK7w5mdP3za4uQfKKEEaEngW4SJajbA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Thu, Oct 10, 2013 at 1:48 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
> On Thu, Oct 10, 2013 at 11:21 AM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>> On Wed, Oct 9, 2013 at 11:26 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>> On Wed, Oct 9, 2013 at 11:07 PM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On Wed, Oct 9, 2013 at 9:11 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>>> On Wed, Oct 9, 2013 at 8:02 PM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>> On Tue, Oct 8, 2013 at 8:07 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>>>>> The combination of two commits
>>>>>>>
>>>>>>> commit 8e4e1713e4
>>>>>>> ("openvswitch: Simplify datapath locking.")
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>> commit 2537b4dd0a
>>>>>>> ("openvswitch:: link upper device for port devices")
>>>>>>>
>>>>>>> introduced a bug where upper_dev wasn't unlinked upon
>>>>>>> netdev_unregister notification
>>>>>>>
>>>>>>> The following steps:
>>>>>>>
>>>>>>>   modprobe openvswitch
>>>>>>>   ovs-dpctl add-dp test
>>>>>>>   ip tuntap add dev tap1 mode tap
>>>>>>>   ovs-dpctl add-if test tap1
>>>>>>>   ip tuntap del dev tap1 mode tap
>>>>>>>
>>>>>>> are causing multiple warnings:
>>>>>>> diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
>>>>>>> index c323567..e9380bd 100644
>>>>>>> --- a/net/openvswitch/dp_notify.c
>>>>>>> +++ b/net/openvswitch/dp_notify.c
>>>>>>> @@ -88,6 +88,11 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
>>>>>>>                 return NOTIFY_DONE;
>>>>>>>
>>>>>>>         if (event == NETDEV_UNREGISTER) {
>>>>>>> +               /* rx_handler_unregister and upper_dev_unlink immediately */
>>>>>>> +               if (dev->reg_state == NETREG_UNREGISTERING)
>>>>>>> +                       ovs_netdev_unlink_dev(vport);
>>>>>>> +
>>>>>>
>>>>>> Rather than doing vport destroy here, we can just unlink upper device
>>>>>> and let workq do rest of work.
>>>>>
>>>>> isn't it what it's doing?
>>>>
>>>> I meant just call netdev_upper_dev_unlink() here in event handler and
>>>> rest of vport destroy can be done in workq.
>>>
>>> netdev_upper_dev_unlink() without netdev_rx_handler_unregister() ?!
>>> that's dangerous.
>> why is it dangerous? ovs already had ref to net-device.
>
> comment from dev.c:
>                 /* Notify protocols, that we are about to destroy
>                    this device. They should clean all the things.
>                 */
>                 call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
>
> so here you're suggesting to just netdev_upper_dev_unlink() to silence
> the warning,
> but then do dev_set_promisc(-1) in workqueue?
>
promiscuity setting is different issue. If you want to have discussion
then you can post separate patch for same. Lets fix the warning here.

> well, as a minimum audit of promiscuity will be wrong.
> ndo_change_rx_flags will be called after ndo_uninit,
> all sorts of other cleanups are done.

change_rx_flags() checks for UP flag for given device.

> I cannot track all possible scenarios, but it seems much safer to
> cleanup everything possible
> as soon as ovs received NETDEV_UNREGISTER event.
>
> May be all these risks are worth taking, then please explain what is
> the problem with the proposed patch?
>
Problem is that this is causing layering issues in OVS. dp_notify is
suppose to work at dp layer. your patch directly calls vport-netdev
implementation function from dp_notify.
I could not think of a simple approach that will do this in completely
clean manner. Therefore I am trying to minimize layering issues. So
just calling netdev_upper_dev_unlink() looks better than doing
anything extra.

^ permalink raw reply

* Re: [PATCH net] bridge: allow receiption on disabled port
From: Felix Fietkau @ 2013-10-10 22:08 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20131010145255.16cf7c09@nehalam.linuxnetplumber.net>

On 2013-10-10 11:52 PM, Stephen Hemminger wrote:
> On Thu, 10 Oct 2013 22:56:33 +0200
> Felix Fietkau <nbd@openwrt.org> wrote:
> 
>> On 2013-10-10 10:36 PM, Stephen Hemminger wrote:
>> > On Thu, 10 Oct 2013 14:52:50 +0200
>> > Felix Fietkau <nbd@openwrt.org> wrote:
>> > 
>> >> When an ethernet device is enslaved to a bridge, and the bridge STP
>> >> detects loss of carrier (or operational state down), then normally
>> >> packet receiption is blocked.
>> >> 
>> >> This breaks control applications like WPA which maybe expecting to
>> >> receive packets to negotiate to bring link up. The bridge needs to
>> >> block forwarding packets from these disabled ports, but there is no
>> >> hard requirement to not allow local packet delivery.
>> >> 
>> >> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
>> >> Signed-off-by: Felix Fietkau <nbd@openwrt.org>
>> > 
>> > No. This will cause duplicate packets to be delivered.
>> How? I haven't observed any duplications in my tests with this patch.
> 
> The purpose of DISABLED state is to break loops in the bridge tree.
> If packet is flooded by another bridge (Broadcast Unknown or Multicast)
> then it will go down both paths.
Ah, right.

>> > If doing a link layer protocol like WPA then it should be done directly
>> > on the underlying device, not the bridge itself.
>> When the ETH_P_PAE protocol is set for the packet socket inside
>> wpa_supplicant, the bridge steals all packets before the protocol
>> handler gets them.
>> In __netif_receive_skb_core, only ptype_all gets processed before the rx
>> handler, not ptype_base.
> 
> Thought it was using direct type all. Or at least the link local multicast
> address.
> 
> Can you revise it to only accept packets directed to link local multicast
> address or local address, and go through the local_finish handler.
The destination address in WPA EAPOL packets from the AP is set to the
MAC address of the client, which may not be the same as the one from the
bridge.

- Felix

^ permalink raw reply

* Re: [PATCH nf-next] netfilter: xtables: lightweight process control group matching
From: Daniel Borkmann @ 2013-10-10 21:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: pablo-Cap9r6Oaw4JrovVCs/uTlw,
	netfilter-devel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Daniel Borkmann
In-Reply-To: <20131009170409.GH22495-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>

Hi Tejun,

On 10/09/2013 07:04 PM, Tejun Heo wrote:
> On Tue, Oct 08, 2013 at 10:05:02AM +0200, Daniel Borkmann wrote:
>> Could you elaborate on "Wouldn't it be more logical to implement netfilter
>> rule to match the target cgroup paths?". I don't think (or hope) you mean
>> some string comparison on the dentry path here? :) With our proposal, we
>> have in the network stack's critical path only the following code that is
>> being executed here to match the cgroup ...
>
> Comparing path each time obviously doesn't make sense but you can
> determine the cgroup on config and hold onto the pointer while the
> rule exists.
>
>> ... where ``info->id == skb->sk->sk_cgrp_fwid'' is the actual work, so very
>> lightweight, which is good for high loads (1Gbit/s, 10Gbit/s and beyond), of
>> course. Also, it would be intuitive for admins familiar with other subsystems
>> to just set up and use these cgroup ids in iptabels. I'm not yet quite sure
>> how your suggestion would look like, so you would need to setup some "dummy"
>> subgroups first just to have a path that you can match on?
>
> Currently, it's tricky because we have multiple hierarchies to
> consider and there isn't an efficient way to map from task to cgroup
> on a specific hierarchy.  I'm not sure whether we should add another
> mapping table in css_set or just allow using path matching on the
> unified hierarchy.  The latter should be cleaner and easier but more
> restrictive.
>
> Anyways, it isn't manageable in the long term to keep adding
> controllers simply to tag tasks differently.  If we want to do this,
> let's please work on a way to match a task's cgroup affiliation
> efficiently.

Here's a draft (!) of an alternative w/o using a new cgroup subsystem. I've
tested it and it would basically work this way as well. I've used serial_nr
as an identifier of cgroups here, as we'd actually want the xt_cgroup_info
structure as small as possible for rule sets (since they can be large and
are flat-copied to kernel). Logic in cgroup_mt() needs to change a bit as
we cannot hold css_set_lock here. Anyway, iptables would match here against
cgroup.serial (that can probably also be widely used otherwise). The way we
do it here is to cache the corresponding task in socket structure, and walk
all cgroups belonging to that task, comparing if serial_nr's match.

Still, I think my original patch is more clean, user friendly and intuitive,
and has a better performance (main work is one comparison instead of walking
all corresponding cgroups), so I'd still consider this the better tradeoff
to go with, I think netfilter is a large enough candidate for a subsys. ;)

Thanks,
Daniel

Draft patch:

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3561d30..3c5e953 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -399,6 +399,26 @@ struct cgroup_map_cb {
  };

  /*
+ * A cgroup can be associated with multiple css_sets as different tasks may
+ * belong to different cgroups on different hierarchies.  In the other
+ * direction, a css_set is naturally associated with multiple cgroups.
+ * This M:N relationship is represented by the following link structure
+ * which exists for each association and allows traversing the associations
+ * from both sides.
+ */
+struct cgrp_cset_link {
+	/* the cgroup and css_set this link associates */
+	struct cgroup		*cgrp;
+	struct css_set		*cset;
+
+	/* list of cgrp_cset_links anchored at cgrp->cset_links */
+	struct list_head	cset_link;
+
+	/* list of cgrp_cset_links anchored at css_set->cgrp_links */
+	struct list_head	cgrp_link;
+};
+
+/*
   * struct cftype: handler definitions for cgroup control files
   *
   * When reading/writing to a file:
diff --git a/include/net/sock.h b/include/net/sock.h
index e3bf213..b035ba3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -406,6 +406,9 @@ struct sock {
  	__u32			sk_mark;
  	u32			sk_classid;
  	struct cg_proto		*sk_cgrp;
+#if IS_ENABLED(CONFIG_NETFILTER_XT_MATCH_CGROUP)
+	struct task_struct	*sk_task_cached;
+#endif
  	void			(*sk_state_change)(struct sock *sk);
  	void			(*sk_data_ready)(struct sock *sk, int bytes);
  	void			(*sk_write_space)(struct sock *sk);
@@ -2098,6 +2101,22 @@ static inline gfp_t gfp_any(void)
  	return in_softirq() ? GFP_ATOMIC : GFP_KERNEL;
  }

+static inline struct task_struct *sock_task(const struct sock *sk)
+{
+#if IS_ENABLED(CONFIG_NETFILTER_XT_MATCH_CGROUP)
+	return sk->sk_task_cached;
+#else
+	return NULL;
+#endif
+}
+
+static inline void sock_task_set(struct sock *sk, struct task_struct *task)
+{
+#if IS_ENABLED(CONFIG_NETFILTER_XT_MATCH_CGROUP)
+	sk->sk_task_cached = task;
+#endif
+}
+
  static inline long sock_rcvtimeo(const struct sock *sk, bool noblock)
  {
  	return noblock ? 0 : sk->sk_rcvtimeo;
diff --git a/include/uapi/linux/netfilter/Kbuild b/include/uapi/linux/netfilter/Kbuild
index 1749154..94a4890 100644
--- a/include/uapi/linux/netfilter/Kbuild
+++ b/include/uapi/linux/netfilter/Kbuild
@@ -37,6 +37,7 @@ header-y += xt_TEE.h
  header-y += xt_TPROXY.h
  header-y += xt_addrtype.h
  header-y += xt_bpf.h
+header-y += xt_cgroup.h
  header-y += xt_cluster.h
  header-y += xt_comment.h
  header-y += xt_connbytes.h
diff --git a/include/uapi/linux/netfilter/xt_cgroup.h b/include/uapi/linux/netfilter/xt_cgroup.h
new file mode 100644
index 0000000..c59ff53
--- /dev/null
+++ b/include/uapi/linux/netfilter/xt_cgroup.h
@@ -0,0 +1,11 @@
+#ifndef _UAPI_XT_CGROUP_H
+#define _UAPI_XT_CGROUP_H
+
+#include <linux/types.h>
+
+struct xt_cgroup_info {
+	__u64 serial_nr;
+	__u32 invert;
+};
+
+#endif /* _UAPI_XT_CGROUP_H */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2418b6e..1f9dc5b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -357,26 +357,6 @@ static void cgroup_release_agent(struct work_struct *work);
  static DECLARE_WORK(release_agent_work, cgroup_release_agent);
  static void check_for_release(struct cgroup *cgrp);

-/*
- * A cgroup can be associated with multiple css_sets as different tasks may
- * belong to different cgroups on different hierarchies.  In the other
- * direction, a css_set is naturally associated with multiple cgroups.
- * This M:N relationship is represented by the following link structure
- * which exists for each association and allows traversing the associations
- * from both sides.
- */
-struct cgrp_cset_link {
-	/* the cgroup and css_set this link associates */
-	struct cgroup		*cgrp;
-	struct css_set		*cset;
-
-	/* list of cgrp_cset_links anchored at cgrp->cset_links */
-	struct list_head	cset_link;
-
-	/* list of cgrp_cset_links anchored at css_set->cgrp_links */
-	struct list_head	cgrp_link;
-};
-
  /* The default css_set - used by init and its children prior to any
   * hierarchies being mounted. It contains a pointer to the root state
   * for each subsystem. Also used to anchor the list of css_sets. Not
@@ -4163,6 +4143,12 @@ static int cgroup_clone_children_write(struct cgroup_subsys_state *css,
  	return 0;
  }

+static u64 cgroup_serial_read(struct cgroup_subsys_state *css,
+			      struct cftype *cft)
+{
+	return css->cgroup->serial_nr;
+}
+
  static struct cftype cgroup_base_files[] = {
  	{
  		.name = "cgroup.procs",
@@ -4187,6 +4173,11 @@ static struct cftype cgroup_base_files[] = {
  		.flags = CFTYPE_ONLY_ON_ROOT,
  		.read_seq_string = cgroup_sane_behavior_show,
  	},
+	{
+		.name = "cgroup.serial",
+		.read_u64 = cgroup_serial_read,
+		.mode = S_IRUGO,
+	},

  	/*
  	 * Historical crazy stuff.  These don't have "cgroup."  prefix and
diff --git a/net/core/scm.c b/net/core/scm.c
index b442e7e..9a40ab0 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -292,6 +292,7 @@ void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm)
  		if (sock) {
  			sock_update_netprioidx(sock->sk);
  			sock_update_classid(sock->sk);
+			sock_task_set(sock->sk, current);
  		}
  		fd_install(new_fd, get_file(fp[i]));
  	}
diff --git a/net/core/sock.c b/net/core/sock.c
index 2bd9b3f..ab53afc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1359,6 +1359,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
  		sk->sk_prot = sk->sk_prot_creator = prot;
  		sock_lock_init(sk);
  		sock_net_set(sk, get_net(net));
+		sock_task_set(sk, current);
  		atomic_set(&sk->sk_wmem_alloc, 1);

  		sock_update_classid(sk);
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 6e839b6..d276ff4 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -806,6 +806,14 @@ config NETFILTER_XT_MATCH_BPF

  	  To compile it as a module, choose M here.  If unsure, say N.

+config NETFILTER_XT_MATCH_CGROUP
+	tristate '"control group" match support'
+	depends on NETFILTER_ADVANCED
+	depends on CGROUPS
+	---help---
+	Socket/process control group matching allows you to match locally
+	generated packets based on which control group processes belong to.
+
  config NETFILTER_XT_MATCH_CLUSTER
  	tristate '"cluster" match support'
  	depends on NF_CONNTRACK
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index c3a0a12..12f014f 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -124,6 +124,7 @@ obj-$(CONFIG_NETFILTER_XT_MATCH_MULTIPORT) += xt_multiport.o
  obj-$(CONFIG_NETFILTER_XT_MATCH_NFACCT) += xt_nfacct.o
  obj-$(CONFIG_NETFILTER_XT_MATCH_OSF) += xt_osf.o
  obj-$(CONFIG_NETFILTER_XT_MATCH_OWNER) += xt_owner.o
+obj-$(CONFIG_NETFILTER_XT_MATCH_CGROUP) += xt_cgroup.o
  obj-$(CONFIG_NETFILTER_XT_MATCH_PHYSDEV) += xt_physdev.o
  obj-$(CONFIG_NETFILTER_XT_MATCH_PKTTYPE) += xt_pkttype.o
  obj-$(CONFIG_NETFILTER_XT_MATCH_POLICY) += xt_policy.o
diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c
new file mode 100644
index 0000000..f04cba8
--- /dev/null
+++ b/net/netfilter/xt_cgroup.c
@@ -0,0 +1,79 @@
+#include <linux/skbuff.h>
+#include <linux/module.h>
+#include <linux/file.h>
+#include <linux/cgroup.h>
+#include <linux/fdtable.h>
+#include <linux/netfilter/x_tables.h>
+#include <linux/netfilter/xt_cgroup.h>
+#include <net/sock.h>
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Daniel Borkmann <dborkman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>");
+MODULE_DESCRIPTION("Xtables: process control group matching");
+MODULE_ALIAS("ipt_cgroup");
+MODULE_ALIAS("ip6t_cgroup");
+
+static int cgroup_mt_check(const struct xt_mtchk_param *par)
+{
+	struct xt_cgroup_info *info = par->matchinfo;
+
+	return (info->invert & ~1) ? -EINVAL : 0;
+}
+
+static bool
+cgroup_mt(const struct sk_buff *skb, struct xt_action_param *par)
+{
+	const struct xt_cgroup_info *info = par->matchinfo;
+	struct cgrp_cset_link *link, *link_tmp;
+	const struct sock *sk = skb->sk;
+	struct task_struct *task;
+	struct css_set *cset;
+	bool found = false;
+
+	if (sk == NULL)
+		return false;
+
+	task = sock_task(sk);
+	if (task == NULL)
+		return false;
+
+	rcu_read_lock();
+	/* XXX: read_lock(&css_set_lock); */
+	cset = task_css_set(task);
+	list_for_each_entry_safe(link, link_tmp, &cset->cgrp_links, cgrp_link) {
+		struct cgroup *cgrp = link->cgrp;
+		if (cgrp->serial_nr == info->serial_nr) {
+			found = true;
+			break;
+		}
+	}
+	/* XXX: read_unlock(&css_set_lock); */
+	rcu_read_unlock();
+
+	return found ^ info->invert;
+}
+
+static struct xt_match cgroup_mt_reg __read_mostly = {
+	.name       = "cgroup",
+	.revision   = 0,
+	.family     = NFPROTO_UNSPEC,
+	.checkentry = cgroup_mt_check,
+	.match      = cgroup_mt,
+	.matchsize  = sizeof(struct xt_cgroup_info),
+	.me         = THIS_MODULE,
+	.hooks      = (1 << NF_INET_LOCAL_OUT) |
+	              (1 << NF_INET_POST_ROUTING),
+};
+
+static int __init cgroup_mt_init(void)
+{
+	return xt_register_match(&cgroup_mt_reg);
+}
+
+static void __exit cgroup_mt_exit(void)
+{
+	xt_unregister_match(&cgroup_mt_reg);
+}
+
+module_init(cgroup_mt_init);
+module_exit(cgroup_mt_exit);
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net] bridge: allow receiption on disabled port
From: Stephen Hemminger @ 2013-10-10 21:52 UTC (permalink / raw)
  To: Felix Fietkau; +Cc: netdev
In-Reply-To: <52571481.5010907@openwrt.org>

On Thu, 10 Oct 2013 22:56:33 +0200
Felix Fietkau <nbd@openwrt.org> wrote:

> On 2013-10-10 10:36 PM, Stephen Hemminger wrote:
> > On Thu, 10 Oct 2013 14:52:50 +0200
> > Felix Fietkau <nbd@openwrt.org> wrote:
> > 
> >> When an ethernet device is enslaved to a bridge, and the bridge STP
> >> detects loss of carrier (or operational state down), then normally
> >> packet receiption is blocked.
> >> 
> >> This breaks control applications like WPA which maybe expecting to
> >> receive packets to negotiate to bring link up. The bridge needs to
> >> block forwarding packets from these disabled ports, but there is no
> >> hard requirement to not allow local packet delivery.
> >> 
> >> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> >> Signed-off-by: Felix Fietkau <nbd@openwrt.org>
> > 
> > No. This will cause duplicate packets to be delivered.
> How? I haven't observed any duplications in my tests with this patch.

The purpose of DISABLED state is to break loops in the bridge tree.
If packet is flooded by another bridge (Broadcast Unknown or Multicast)
then it will go down both paths.

> 
> > If doing a link layer protocol like WPA then it should be done directly
> > on the underlying device, not the bridge itself.
> When the ETH_P_PAE protocol is set for the packet socket inside
> wpa_supplicant, the bridge steals all packets before the protocol
> handler gets them.
> In __netif_receive_skb_core, only ptype_all gets processed before the rx
> handler, not ptype_base.

Thought it was using direct type all. Or at least the link local multicast
address.

Can you revise it to only accept packets directed to link local multicast
address or local address, and go through the local_finish handler.

^ permalink raw reply

* Re: [PATCH net] bridge: allow receiption on disabled port
From: Felix Fietkau @ 2013-10-10 20:56 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20131010133646.1bdd42c1@nehalam.linuxnetplumber.net>

On 2013-10-10 10:36 PM, Stephen Hemminger wrote:
> On Thu, 10 Oct 2013 14:52:50 +0200
> Felix Fietkau <nbd@openwrt.org> wrote:
> 
>> When an ethernet device is enslaved to a bridge, and the bridge STP
>> detects loss of carrier (or operational state down), then normally
>> packet receiption is blocked.
>> 
>> This breaks control applications like WPA which maybe expecting to
>> receive packets to negotiate to bring link up. The bridge needs to
>> block forwarding packets from these disabled ports, but there is no
>> hard requirement to not allow local packet delivery.
>> 
>> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
>> Signed-off-by: Felix Fietkau <nbd@openwrt.org>
> 
> No. This will cause duplicate packets to be delivered.
How? I haven't observed any duplications in my tests with this patch.

> If doing a link layer protocol like WPA then it should be done directly
> on the underlying device, not the bridge itself.
When the ETH_P_PAE protocol is set for the packet socket inside
wpa_supplicant, the bridge steals all packets before the protocol
handler gets them.
In __netif_receive_skb_core, only ptype_all gets processed before the rx
handler, not ptype_base.


- Felix

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: fix vport-netdev unregister
From: Alexei Starovoitov @ 2013-10-10 20:48 UTC (permalink / raw)
  To: Pravin Shelar
  Cc: David S. Miller, Jesse Gross, Jiri Pirko, dev@openvswitch.org,
	netdev
In-Reply-To: <CALnjE+oXJQoPG4io2JOn1zBbf7XtPKaBE+vHPNKXm2TBygvB5Q@mail.gmail.com>

On Thu, Oct 10, 2013 at 11:21 AM, Pravin Shelar <pshelar@nicira.com> wrote:
> On Wed, Oct 9, 2013 at 11:26 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> On Wed, Oct 9, 2013 at 11:07 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>> On Wed, Oct 9, 2013 at 9:11 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>> On Wed, Oct 9, 2013 at 8:02 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>>> On Tue, Oct 8, 2013 at 8:07 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>>> The combination of two commits
>>>>>>
>>>>>> commit 8e4e1713e4
>>>>>> ("openvswitch: Simplify datapath locking.")
>>>>>>
>>>>>> and
>>>>>>
>>>>>> commit 2537b4dd0a
>>>>>> ("openvswitch:: link upper device for port devices")
>>>>>>
>>>>>> introduced a bug where upper_dev wasn't unlinked upon
>>>>>> netdev_unregister notification
>>>>>>
>>>>>> The following steps:
>>>>>>
>>>>>>   modprobe openvswitch
>>>>>>   ovs-dpctl add-dp test
>>>>>>   ip tuntap add dev tap1 mode tap
>>>>>>   ovs-dpctl add-if test tap1
>>>>>>   ip tuntap del dev tap1 mode tap
>>>>>>
>>>>>> are causing multiple warnings:
>>>>>> diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
>>>>>> index c323567..e9380bd 100644
>>>>>> --- a/net/openvswitch/dp_notify.c
>>>>>> +++ b/net/openvswitch/dp_notify.c
>>>>>> @@ -88,6 +88,11 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
>>>>>>                 return NOTIFY_DONE;
>>>>>>
>>>>>>         if (event == NETDEV_UNREGISTER) {
>>>>>> +               /* rx_handler_unregister and upper_dev_unlink immediately */
>>>>>> +               if (dev->reg_state == NETREG_UNREGISTERING)
>>>>>> +                       ovs_netdev_unlink_dev(vport);
>>>>>> +
>>>>>
>>>>> Rather than doing vport destroy here, we can just unlink upper device
>>>>> and let workq do rest of work.
>>>>
>>>> isn't it what it's doing?
>>>
>>> I meant just call netdev_upper_dev_unlink() here in event handler and
>>> rest of vport destroy can be done in workq.
>>
>> netdev_upper_dev_unlink() without netdev_rx_handler_unregister() ?!
>> that's dangerous.
> why is it dangerous? ovs already had ref to net-device.

comment from dev.c:
                /* Notify protocols, that we are about to destroy
                   this device. They should clean all the things.
                */
                call_netdevice_notifiers(NETDEV_UNREGISTER, dev);

so here you're suggesting to just netdev_upper_dev_unlink() to silence
the warning,
but then do dev_set_promisc(-1) in workqueue?

well, as a minimum audit of promiscuity will be wrong.
ndo_change_rx_flags will be called after ndo_uninit,
all sorts of other cleanups are done.
I cannot track all possible scenarios, but it seems much safer to
cleanup everything possible
as soon as ovs received NETDEV_UNREGISTER event.

May be all these risks are worth taking, then please explain what is
the problem with the proposed patch?

Thanks
Alex

^ permalink raw reply

* Re: [PATCH net] bridge: allow receiption on disabled port
From: Stephen Hemminger @ 2013-10-10 20:36 UTC (permalink / raw)
  To: Felix Fietkau; +Cc: netdev
In-Reply-To: <1381409570-1892-1-git-send-email-nbd@openwrt.org>

On Thu, 10 Oct 2013 14:52:50 +0200
Felix Fietkau <nbd@openwrt.org> wrote:

> When an ethernet device is enslaved to a bridge, and the bridge STP
> detects loss of carrier (or operational state down), then normally
> packet receiption is blocked.
> 
> This breaks control applications like WPA which maybe expecting to
> receive packets to negotiate to bring link up. The bridge needs to
> block forwarding packets from these disabled ports, but there is no
> hard requirement to not allow local packet delivery.
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> Signed-off-by: Felix Fietkau <nbd@openwrt.org>

No. This will cause duplicate packets to be delivered.

If doing a link layer protocol like WPA then it should be done directly
on the underlying device, not the bridge itself.

^ permalink raw reply

* [PATCH] bridge: update mdb expiration timer upon reports.
From: Vlad Yasevich @ 2013-10-10 19:57 UTC (permalink / raw)
  To: netdev; +Cc: Vlad Yasevich, Cong Wang, Herbert Xu, Stephen Hemminger

commit 9f00b2e7cf241fa389733d41b615efdaa2cb0f5b
	bridge: only expire the mdb entry when query is received
changed the mdb expiration timer to be armed only when QUERY is
received.  Howerver, this causes issues in an environment where
the multicast server socket comes and goes very fast while a client
is trying to send traffic to it.

The root cause is a race where a sequence of LEAVE followed by REPORT
messages can race against QUERY messages generated in response to LEAVE.
The QUERY ends up starting the expiration timer, and that timer can
potentially expire after the new REPORT message has been received signaling
the new join operation.  This leads to a significant drop in multicast
traffic and possible complete stall. 

The solution is to have REPORT messages update the expiration timer
on entries that already exist.

CC: Cong Wang <xiyou.wangcong@gmail.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
---
 net/bridge/br_multicast.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index d1c5786..1085f21 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -611,6 +611,9 @@ rehash:
 		break;

 	default:
+		/* If we have an existing entry, update it's expire timer */
+		mod_timer(&mp->timer,
+			  jiffies + br->multicast_membership_interval);
 		goto out;
 	}

@@ -680,8 +683,12 @@ static int br_multicast_add_group(struct net_bridge *br,
 	for (pp = &mp->ports;
 	     (p = mlock_dereference(*pp, br)) != NULL;
 	     pp = &p->next) {
-		if (p->port == port)
+		if (p->port == port) {
+			/* We already have a portgroup, update the timer.  */
+			mod_timer(&p->timer,
+				  jiffies + br->multicast_membership_interval);
 			goto out;
+		}
 		if ((unsigned long)p->port < (unsigned long)port)
 			break;
 	}
-- 
1.8.3.1

^ permalink raw reply related

* Re: [net-next 00/13][pull request] Intel Wired LAN Driver Updates
From: David Miller @ 2013-10-10 19:30 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, sassmann
In-Reply-To: <1381387271-29605-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed,  9 Oct 2013 23:40:58 -0700

> This series contains updates to i40e only.
> 
> Alex provides the majority of the patches against i40e, where he does
> cleanup of the Tx and RX queues and to align the code with the known
> good Tx/Rx queue code in the ixgbe driver.
> 
> Anjali provides an i40e patch to update link events to not print to
> the log until the device is administratively up.
> 
> Catherine provides a patch to update the driver version.

Pulled, thanks a lot Jeff.

^ permalink raw reply

* Re: [PATCH v2 tip/core/rcu 07/13] ipv6/ip6_tunnel: Apply rcu_access_pointer() to avoid sparse false positive
From: Paul E. McKenney @ 2013-10-10 19:05 UTC (permalink / raw)
  To: Eric Dumazet, Josh Triplett, linux-kernel, mingo, laijs, dipankar,
	akpm, mathieu.desnoyers, niv, tglx, peterz, rostedt, dhowells,
	edumazet, darren, fweisbec, sbw, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, netdev
In-Reply-To: <20131010020422.GB24368@order.stressinduktion.org>

On Thu, Oct 10, 2013 at 04:04:22AM +0200, Hannes Frederic Sowa wrote:
> On Wed, Oct 09, 2013 at 05:28:33PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 09, 2013 at 05:12:40PM -0700, Eric Dumazet wrote:
> > > On Wed, 2013-10-09 at 16:40 -0700, Josh Triplett wrote:
> > > 
> > > > that.  Constructs like list_del_rcu are much clearer, and not
> > > > open-coded.  Open-coding synchronization code is almost always a Bad
> > > > Idea.
> > > 
> > > OK, so you think there is synchronization code.
> > > 
> > > I will shut up then, no need to waste time.
> > 
> > As you said earlier, we should at least get rid of the memory barrier
> > as long as we are changing the code.
> 
> Interesting thread!
> 
> Sorry to chime in and asking a question:
> 
> Why do we need an ACCESS_ONCE here if rcu_assign_pointer can do without one?
> In other words I wonder why rcu_assign_pointer is not a static inline function
> to use the sequence point in argument evaluation (if I remember correctly this
> also holds for inline functions) to not allow something like this:
> 
> E.g. we want to publish which lock to take first to prevent an ABBA problem
> (extreme example):
> 
> rcu_assign_pointer(lockptr, min(lptr1, lptr2));
> 
> Couldn't a compiler spill the lockptr memory location as a temporary buffer
> if the compiler is under register pressure? (yes, this seems unlikely if we
> flushed out most registers to memory because of the barrier, but still... ;) )
> 
> This seems to be also the case if we publish a multi-dereferencing pointers
> e.g. ptr->ptr->ptr.

IIRC, sequence points only confine volatile accesses.  For non-volatile
accesses, the so-called "as-if rule" allows compiler writers to do some
surprisingly global reordering.

The reason that rcu_assign_pointer() isn't an inline function is because
it needs to be type-generic, in other words, it needs to be OK to use
it on any type of pointers as long as the C types of the two pointers
match (the sparse types can vary a bit).

One of the reasons for wanting a volatile cast in rcu_assign_pointer() is
to prevent compiler mischief such as you described in your last two
paragraphs.  That said, it would take a very brave compiler to pull
a pointer-referenced memory location into a register and keep it there.
Unfortunately, increasing compiler bravery seems to be a solid long-term
trend.

							Thanx, Paul

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox