Netdev List
 help / color / mirror / Atom feed
* Re: Peak TCP performance
From: Eric Dumazet @ 2013-10-11  3:44 UTC (permalink / raw)
  To: Kyle Hubert; +Cc: netdev
In-Reply-To: <CAJoZ4U0tpKhknMLd7kN4uQmq-E81=EbH1Q816nX_eig=tGreAQ@mail.gmail.com>

On Thu, 2013-10-10 at 23:33 -0400, Kyle Hubert wrote:

> Ah, yes, sorry. I'm on 3.0.80 (SLES SP2). And, no, I do not have that
> commit. In fact, the code looks quite different. There must have been
> a lot of changes?
> 

Probably ;)

> Also, my copy of ethtool does not recognize tx-nocache-copy. However,
> I do have control over the net device. Is there something there I can
> set, or is tx-nocache-copy also a new feature? I'll start digging.
> 

nocache-copy was added in 3.0, but I do find its not a gain for current
cpus.

You could get a fresh copy of ethtool sources :

git clone git://git.kernel.org/pub/scm/network/ethtool/ethtool.git
cd ethtool
./autogen.sh  ...

^ permalink raw reply

* Re: Peak TCP performance
From: Kyle Hubert @ 2013-10-11  3:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1381461664.4971.97.camel@edumazet-glaptop.roam.corp.google.com>

On Thu, Oct 10, 2013 at 11:21 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2013-10-10 at 22:34 -0400, Kyle Hubert wrote:
>> I'm working on a device, I consistently get 27gbps via netperf-2.6.
>> UDP reports 54gbps.
>>
>> TCP is maxed out at 100% CPU on the transmit side. On the receive
>> side, 40% of the CPU. Thus, I didn't believe I could eek anymore
>> performance out of it.
>>
>> However, very oddly, if I enabled bridged mode to forward some
>> packets, TCP performance goes up to 32gbps. The thing that bothers me
>> is that transmit CPU utilization drops to 65%, and receive CPU
>> utilization increases to 60%.
>>
>> What happens when the device becomes bridged to gain so much
>> performance? Also, can I now take advantage of the extra CPU time
>> available to drive more traffic? No tunable seems to have any effect..
>> (except down)
>
>
> You do not give what version of linux you use, but my guess is that
> using latest trees should help, because of
> http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=c9eeec26e32e087359160406f96e0949b3cc6f10
>
> Also try to disable tx-nocache-copy
>
> ethtool -K eth0 tx-nocache-copy off

Ah, yes, sorry. I'm on 3.0.80 (SLES SP2). And, no, I do not have that
commit. In fact, the code looks quite different. There must have been
a lot of changes?

Also, my copy of ethtool does not recognize tx-nocache-copy. However,
I do have control over the net device. Is there something there I can
set, or is tx-nocache-copy also a new feature? I'll start digging.

Thanks for your response.
-Kyle

^ permalink raw reply

* Re: Peak TCP performance
From: Eric Dumazet @ 2013-10-11  3:21 UTC (permalink / raw)
  To: Kyle Hubert; +Cc: netdev
In-Reply-To: <CAJoZ4U2eVUd4JiGQ6UFYMdoEwyN=MO8+9S3jGcsaYGN2vqt=3g@mail.gmail.com>

On Thu, 2013-10-10 at 22:34 -0400, Kyle Hubert wrote:
> I'm working on a device, I consistently get 27gbps via netperf-2.6.
> UDP reports 54gbps.
> 
> TCP is maxed out at 100% CPU on the transmit side. On the receive
> side, 40% of the CPU. Thus, I didn't believe I could eek anymore
> performance out of it.
> 
> However, very oddly, if I enabled bridged mode to forward some
> packets, TCP performance goes up to 32gbps. The thing that bothers me
> is that transmit CPU utilization drops to 65%, and receive CPU
> utilization increases to 60%.
> 
> What happens when the device becomes bridged to gain so much
> performance? Also, can I now take advantage of the extra CPU time
> available to drive more traffic? No tunable seems to have any effect..
> (except down)


You do not give what version of linux you use, but my guess is that
using latest trees should help, because of
http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=c9eeec26e32e087359160406f96e0949b3cc6f10

Also try to disable tx-nocache-copy

ethtool -K eth0 tx-nocache-copy off

^ permalink raw reply

* Re: [PATCH net] bridge: allow receiption on disabled port
From: Stephen Hemminger @ 2013-10-11  2:35 UTC (permalink / raw)
  To: Felix Fietkau; +Cc: netdev
In-Reply-To: <52571481.5010907@openwrt.org>

This is what I was thinking would be better.

Don't want these packets leaking into PRE_ROUTEING chain or have
any chance to get flooded out other ports.

Compile tested only...

I could use another goto instead but that becomes spaghetti and
never like to jump back into a block.


--- a/net/bridge/br_input.c	2013-10-06 14:48:24.946450042 -0700
+++ b/net/bridge/br_input.c	2013-10-10 19:32:14.227926344 -0700
@@ -152,6 +152,16 @@ static int br_handle_local_finish(struct
 	return 0;	 /* process further */
 }
 
+/* Deliver packet to local host only */
+static rx_handler_result_t br_local_only(struct sk_buff *skb)
+{
+	if (NF_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN, skb, skb->dev,
+		    NULL, br_handle_local_finish))
+		return RX_HANDLER_CONSUMED; /* consumed by filter */
+	else
+		return RX_HANDLER_PASS;	/* continue processing */
+}
+
 /*
  * Return NULL if skb is handled
  * note: already called with rcu_read_lock
@@ -206,18 +216,20 @@ rx_handler_result_t br_handle_frame(stru
 				goto forward;
 		}
 
-		/* Deliver packet to local host only */
-		if (NF_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN, skb, skb->dev,
-			    NULL, br_handle_local_finish)) {
-			return RX_HANDLER_CONSUMED; /* consumed by filter */
-		} else {
-			*pskb = skb;
-			return RX_HANDLER_PASS;	/* continue processing */
-		}
+		*pskb = skb;
+		return br_local_only(skb);
 	}
 
 forward:
 	switch (p->state) {
+	case BR_STATE_DISABLED:
+		if (!ether_addr_equal(p->br->dev->dev_addr, dest))
+			goto drop;
+
+		skb->pkt_type = PACKET_HOST;
+		*pskb = skb;
+		return br_local_only(skb);
+
 	case BR_STATE_FORWARDING:
 		rhook = rcu_dereference(br_should_route_hook);
 		if (rhook) {

^ permalink raw reply

* Peak TCP performance
From: Kyle Hubert @ 2013-10-11  2:34 UTC (permalink / raw)
  To: netdev

I'm working on a device, I consistently get 27gbps via netperf-2.6.
UDP reports 54gbps.

TCP is maxed out at 100% CPU on the transmit side. On the receive
side, 40% of the CPU. Thus, I didn't believe I could eek anymore
performance out of it.

However, very oddly, if I enabled bridged mode to forward some
packets, TCP performance goes up to 32gbps. The thing that bothers me
is that transmit CPU utilization drops to 65%, and receive CPU
utilization increases to 60%.

What happens when the device becomes bridged to gain so much
performance? Also, can I now take advantage of the extra CPU time
available to drive more traffic? No tunable seems to have any effect..
(except down)

Any ideas?

Cheers,
-Kyle

^ permalink raw reply

* Re: [PATCH] usbnet: smsc95xx: Add device tree input for MAC address
From: Ming Lei @ 2013-10-11  1:51 UTC (permalink / raw)
  To: Dan Murphy
  Cc: Steve Glendinning, Network Development, Linux Kernel Mailing List,
	linux-usb, mugunthanvnm
In-Reply-To: <5256A1DE.8010709@ti.com>

On Thu, Oct 10, 2013 at 8:47 PM, Dan Murphy <dmurphy@ti.com> wrote:
>
> Is there a board that has 2 built in smsc devices?

I don't know, maybe there isn't, but driver should be generic enough, and
as I said it is a generic problem, and people are discussing it, so suggest
to read previous discussions first before post your V2.

Below link is one I remembered, but I think you can find more:

https://lkml.org/lkml/2013/8/11/100

> This is all dependent on what the dev.id is of udev.

I am not sure why it depends on udev, and udev can't
see the device util it is added to system.

Thanks,
-- 
Ming Lei

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: fix vport-netdev unregister
From: Alexei Starovoitov @ 2013-10-11  0:47 UTC (permalink / raw)
  To: Pravin Shelar
  Cc: David S. Miller, Jesse Gross, Jiri Pirko, dev@openvswitch.org,
	netdev
In-Reply-To: <CALnjE+rgasL+XShQN20b9PkFNGq38+UoXX4uLogrZ2igAwfDew@mail.gmail.com>

On Thu, Oct 10, 2013 at 3:38 PM, Pravin Shelar <pshelar@nicira.com> wrote:
> On Thu, Oct 10, 2013 at 1:48 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> On Thu, Oct 10, 2013 at 11:21 AM, Pravin Shelar <pshelar@nicira.com> wrote:
>>> On Wed, Oct 9, 2013 at 11:26 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>> On Wed, Oct 9, 2013 at 11:07 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>>> On Wed, Oct 9, 2013 at 9:11 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>>> On Wed, Oct 9, 2013 at 8:02 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>>>>> On Tue, Oct 8, 2013 at 8:07 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>>>>> The combination of two commits
>>>>>>>>
>>>>>>>> commit 8e4e1713e4
>>>>>>>> ("openvswitch: Simplify datapath locking.")
>>>>>>>>
>>>>>>>> and
>>>>>>>>
>>>>>>>> commit 2537b4dd0a
>>>>>>>> ("openvswitch:: link upper device for port devices")
>>>>>>>>
>>>>>>>> introduced a bug where upper_dev wasn't unlinked upon
>>>>>>>> netdev_unregister notification
>>>>>>>>
>>>>>>>> The following steps:
>>>>>>>>
>>>>>>>>   modprobe openvswitch
>>>>>>>>   ovs-dpctl add-dp test
>>>>>>>>   ip tuntap add dev tap1 mode tap
>>>>>>>>   ovs-dpctl add-if test tap1
>>>>>>>>   ip tuntap del dev tap1 mode tap
>>>>>>>>
>>>>>>>> are causing multiple warnings:
>>>>>>>> diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
>>>>>>>> index c323567..e9380bd 100644
>>>>>>>> --- a/net/openvswitch/dp_notify.c
>>>>>>>> +++ b/net/openvswitch/dp_notify.c
>>>>>>>> @@ -88,6 +88,11 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
>>>>>>>>                 return NOTIFY_DONE;
>>>>>>>>
>>>>>>>>         if (event == NETDEV_UNREGISTER) {
>>>>>>>> +               /* rx_handler_unregister and upper_dev_unlink immediately */
>>>>>>>> +               if (dev->reg_state == NETREG_UNREGISTERING)
>>>>>>>> +                       ovs_netdev_unlink_dev(vport);
>>>>>>>> +
>>>>>>>
>>>>>>> Rather than doing vport destroy here, we can just unlink upper device
>>>>>>> and let workq do rest of work.
>>>>>>
>>>>>> isn't it what it's doing?
>>>>>
>>>>> I meant just call netdev_upper_dev_unlink() here in event handler and
>>>>> rest of vport destroy can be done in workq.
>>>>
>>>> netdev_upper_dev_unlink() without netdev_rx_handler_unregister() ?!
>>>> that's dangerous.
>>> why is it dangerous? ovs already had ref to net-device.
>>
>> comment from dev.c:
>>                 /* Notify protocols, that we are about to destroy
>>                    this device. They should clean all the things.
>>                 */
>>                 call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
>>
>> so here you're suggesting to just netdev_upper_dev_unlink() to silence
>> the warning,
>> but then do dev_set_promisc(-1) in workqueue?
>>
> promiscuity setting is different issue. If you want to have discussion
> then you can post separate patch for same. Lets fix the warning here.
>
>> well, as a minimum audit of promiscuity will be wrong.
>> ndo_change_rx_flags will be called after ndo_uninit,
>> all sorts of other cleanups are done.
>
> change_rx_flags() checks for UP flag for given device.
>
>> I cannot track all possible scenarios, but it seems much safer to
>> cleanup everything possible
>> as soon as ovs received NETDEV_UNREGISTER event.
>>
>> May be all these risks are worth taking, then please explain what is
>> the problem with the proposed patch?
>>
> Problem is that this is causing layering issues in OVS. dp_notify is
> suppose to work at dp layer. your patch directly calls vport-netdev
> implementation function from dp_notify.
> I could not think of a simple approach that will do this in completely
> clean manner. Therefore I am trying to minimize layering issues. So
> just calling netdev_upper_dev_unlink() looks better than doing
> anything extra.

dp_notify is per net, not per dp.
notifier can only be called for net_device and the first thing it does:
        if (!ovs_is_internal_dev(dev))
                vport = ovs_netdev_get_vport(dev);
where ovs_netdev_get_vport() is defined in vport-netdev.c

once it gets into workq, it checks for:
       if (vport->ops->type != OVS_VPORT_TYPE_NETDEV)
             continue;
and
        netdev_vport = netdev_vport_priv(vport);
where netdev_vport_priv() is defined in netdev-vport.h

only then it proceeds with generic ovs_dp_detach_port().

Is that the layering you talking about?

So to minimize layering issues you want to call 'upper_dev_link' from
netdev_create() in vport-netdev.c
and 'upper_dev_unlink' directly from dp_device_event() in dp_notify.c?

That will look better than calling ovs_netdev_unlink_dev() ?

Having both register+link and unregister+unlink in the same
vport-netdev.c, is not an advantage?

I'm still missing something here.

^ permalink raw reply

* Re: [PATCH v2 tip/core/rcu 07/13] ipv6/ip6_tunnel: Apply rcu_access_pointer() to avoid sparse false positive
From: Josh Triplett @ 2013-10-11  0:20 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Eric Dumazet, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, niv, tglx, peterz, rostedt, dhowells, edumazet,
	darren, fweisbec, sbw, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, netdev
In-Reply-To: <20131010002833.GJ5790@linux.vnet.ibm.com>

On Wed, Oct 09, 2013 at 05:28:33PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 09, 2013 at 05:12:40PM -0700, Eric Dumazet wrote:
> > On Wed, 2013-10-09 at 16:40 -0700, Josh Triplett wrote:
> > 
> > > that.  Constructs like list_del_rcu are much clearer, and not
> > > open-coded.  Open-coding synchronization code is almost always a Bad
> > > Idea.
> > 
> > OK, so you think there is synchronization code.
> > 
> > I will shut up then, no need to waste time.
> 
> As you said earlier, we should at least get rid of the memory barrier
> as long as we are changing the code.
> 
> Josh, what would you suggest as the best way to avoid the memory barrier,
> keep sparse happy, and not be too ugly?

The more I think about it, the more I realize that assigning an __rcu
pointer to an __rcu pointer *without* a memory barrier is a sufficiently
uncommon case that you probably *should* just write an open-coded
assignment.  Just please put a very clear comment right before it.

I'd originally thought it might make sense to have a macro similar to
rcu_assign_pointer, but I just don't think this is a common enough case,
and we don't want people thinking they can use this in general for __rcu
to __rcu assignments (most of which still need a memory barrier).

- Josh Triplett

^ permalink raw reply

* Re: [PATCH] bridge: update mdb expiration timer upon reports.
From: Herbert Xu @ 2013-10-10 23:48 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: netdev, Cong Wang, Stephen Hemminger
In-Reply-To: <1381435079-15387-1-git-send-email-vyasevic@redhat.com>

On Thu, Oct 10, 2013 at 03:57:59PM -0400, Vlad Yasevich wrote:
> commit 9f00b2e7cf241fa389733d41b615efdaa2cb0f5b
> 	bridge: only expire the mdb entry when query is received
> changed the mdb expiration timer to be armed only when QUERY is
> received.  Howerver, this causes issues in an environment where
> the multicast server socket comes and goes very fast while a client
> is trying to send traffic to it.
> 
> The root cause is a race where a sequence of LEAVE followed by REPORT
> messages can race against QUERY messages generated in response to LEAVE.
> The QUERY ends up starting the expiration timer, and that timer can
> potentially expire after the new REPORT message has been received signaling
> the new join operation.  This leads to a significant drop in multicast
> traffic and possible complete stall. 
> 
> The solution is to have REPORT messages update the expiration timer
> on entries that already exist.
> 
> CC: Cong Wang <xiyou.wangcong@gmail.com>
> CC: Herbert Xu <herbert@gondor.apana.org.au>
> CC: Stephen Hemminger <stephen@networkplumber.org>
> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>

Good catch.  Thanks!

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
From: Mark Lord @ 2013-10-10 23:17 UTC (permalink / raw)
  To: Alexander Gordeev, H. Peter Anvin
  Cc: Benjamin Herrenschmidt, linux-kernel, Bjorn Helgaas, Ralf Baechle,
	Michael Ellerman, Martin Schwidefsky, Ingo Molnar, Tejun Heo,
	Dan Williams, Andy King, Jon Mason, Matt Porter, linux-pci,
	linux-mips, linuxppc-dev, linux390, linux-s390, x86, linux-ide,
	iss_storagedev, linux-nvme, linux-rdma, netdev, e1000-devel,
	linux-dr
In-Reply-To: <20131010180704.GA15719@dhcp-26-207.brq.redhat.com>

Just to help us all understand "the loop" issue..

Here's an example of driver code which uses the existing MSI-X interfaces,
for a device which can work with either 16, 8, 4, 2, or 1 MSI-X interrupt.
This is from a new driver I'm working on right now:


static int xx_alloc_msix_irqs (struct xx_dev *dev, int nvec)
{
        xx_disable_all_irqs(dev);
        do {
                if (nvec < 2)
                        xx_prep_for_1_msix_vector(dev);
                else if (nvec < 4)
                        xx_prep_for_2_msix_vectors(dev);
                else if (nvec < 8)
                        xx_prep_for_4_msix_vectors(dev);
                else if (nvec < 16)
                        xx_prep_for_8_msix_vectors(dev);
                else
                        xx_prep_for_16_msix_vectors(dev);
                nvec = pci_enable_msix(dev->pdev, dev->irqs, dev->num_vectors);
        } while (nvec > 0);

        if (nvec) {
                kerr(dev->name, "pci_enable_msix() failed, err=%d", nvec);
                dev->num_vectors = 0;
                return nvec;
        }
        return 0;       /* success */
}

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: fix vport-netdev unregister
From: Pravin Shelar @ 2013-10-10 22:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org, netdev, Jiri Pirko,
	David S. Miller
In-Reply-To: <CAMEtUuxH24G6pQtSFznrK7w5mdP3za4uQfKKEEaEngW4SJajbA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Thu, Oct 10, 2013 at 1:48 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
> On Thu, Oct 10, 2013 at 11:21 AM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>> On Wed, Oct 9, 2013 at 11:26 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>> On Wed, Oct 9, 2013 at 11:07 PM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On Wed, Oct 9, 2013 at 9:11 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>>> On Wed, Oct 9, 2013 at 8:02 PM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>> On Tue, Oct 8, 2013 at 8:07 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>>>>> The combination of two commits
>>>>>>>
>>>>>>> commit 8e4e1713e4
>>>>>>> ("openvswitch: Simplify datapath locking.")
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>> commit 2537b4dd0a
>>>>>>> ("openvswitch:: link upper device for port devices")
>>>>>>>
>>>>>>> introduced a bug where upper_dev wasn't unlinked upon
>>>>>>> netdev_unregister notification
>>>>>>>
>>>>>>> The following steps:
>>>>>>>
>>>>>>>   modprobe openvswitch
>>>>>>>   ovs-dpctl add-dp test
>>>>>>>   ip tuntap add dev tap1 mode tap
>>>>>>>   ovs-dpctl add-if test tap1
>>>>>>>   ip tuntap del dev tap1 mode tap
>>>>>>>
>>>>>>> are causing multiple warnings:
>>>>>>> diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
>>>>>>> index c323567..e9380bd 100644
>>>>>>> --- a/net/openvswitch/dp_notify.c
>>>>>>> +++ b/net/openvswitch/dp_notify.c
>>>>>>> @@ -88,6 +88,11 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
>>>>>>>                 return NOTIFY_DONE;
>>>>>>>
>>>>>>>         if (event == NETDEV_UNREGISTER) {
>>>>>>> +               /* rx_handler_unregister and upper_dev_unlink immediately */
>>>>>>> +               if (dev->reg_state == NETREG_UNREGISTERING)
>>>>>>> +                       ovs_netdev_unlink_dev(vport);
>>>>>>> +
>>>>>>
>>>>>> Rather than doing vport destroy here, we can just unlink upper device
>>>>>> and let workq do rest of work.
>>>>>
>>>>> isn't it what it's doing?
>>>>
>>>> I meant just call netdev_upper_dev_unlink() here in event handler and
>>>> rest of vport destroy can be done in workq.
>>>
>>> netdev_upper_dev_unlink() without netdev_rx_handler_unregister() ?!
>>> that's dangerous.
>> why is it dangerous? ovs already had ref to net-device.
>
> comment from dev.c:
>                 /* Notify protocols, that we are about to destroy
>                    this device. They should clean all the things.
>                 */
>                 call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
>
> so here you're suggesting to just netdev_upper_dev_unlink() to silence
> the warning,
> but then do dev_set_promisc(-1) in workqueue?
>
promiscuity setting is different issue. If you want to have discussion
then you can post separate patch for same. Lets fix the warning here.

> well, as a minimum audit of promiscuity will be wrong.
> ndo_change_rx_flags will be called after ndo_uninit,
> all sorts of other cleanups are done.

change_rx_flags() checks for UP flag for given device.

> I cannot track all possible scenarios, but it seems much safer to
> cleanup everything possible
> as soon as ovs received NETDEV_UNREGISTER event.
>
> May be all these risks are worth taking, then please explain what is
> the problem with the proposed patch?
>
Problem is that this is causing layering issues in OVS. dp_notify is
suppose to work at dp layer. your patch directly calls vport-netdev
implementation function from dp_notify.
I could not think of a simple approach that will do this in completely
clean manner. Therefore I am trying to minimize layering issues. So
just calling netdev_upper_dev_unlink() looks better than doing
anything extra.

^ permalink raw reply

* Re: [PATCH net] bridge: allow receiption on disabled port
From: Felix Fietkau @ 2013-10-10 22:08 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20131010145255.16cf7c09@nehalam.linuxnetplumber.net>

On 2013-10-10 11:52 PM, Stephen Hemminger wrote:
> On Thu, 10 Oct 2013 22:56:33 +0200
> Felix Fietkau <nbd@openwrt.org> wrote:
> 
>> On 2013-10-10 10:36 PM, Stephen Hemminger wrote:
>> > On Thu, 10 Oct 2013 14:52:50 +0200
>> > Felix Fietkau <nbd@openwrt.org> wrote:
>> > 
>> >> When an ethernet device is enslaved to a bridge, and the bridge STP
>> >> detects loss of carrier (or operational state down), then normally
>> >> packet receiption is blocked.
>> >> 
>> >> This breaks control applications like WPA which maybe expecting to
>> >> receive packets to negotiate to bring link up. The bridge needs to
>> >> block forwarding packets from these disabled ports, but there is no
>> >> hard requirement to not allow local packet delivery.
>> >> 
>> >> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
>> >> Signed-off-by: Felix Fietkau <nbd@openwrt.org>
>> > 
>> > No. This will cause duplicate packets to be delivered.
>> How? I haven't observed any duplications in my tests with this patch.
> 
> The purpose of DISABLED state is to break loops in the bridge tree.
> If packet is flooded by another bridge (Broadcast Unknown or Multicast)
> then it will go down both paths.
Ah, right.

>> > If doing a link layer protocol like WPA then it should be done directly
>> > on the underlying device, not the bridge itself.
>> When the ETH_P_PAE protocol is set for the packet socket inside
>> wpa_supplicant, the bridge steals all packets before the protocol
>> handler gets them.
>> In __netif_receive_skb_core, only ptype_all gets processed before the rx
>> handler, not ptype_base.
> 
> Thought it was using direct type all. Or at least the link local multicast
> address.
> 
> Can you revise it to only accept packets directed to link local multicast
> address or local address, and go through the local_finish handler.
The destination address in WPA EAPOL packets from the AP is set to the
MAC address of the client, which may not be the same as the one from the
bridge.

- Felix

^ permalink raw reply

* Re: [PATCH nf-next] netfilter: xtables: lightweight process control group matching
From: Daniel Borkmann @ 2013-10-10 21:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: pablo-Cap9r6Oaw4JrovVCs/uTlw,
	netfilter-devel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Daniel Borkmann
In-Reply-To: <20131009170409.GH22495-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>

Hi Tejun,

On 10/09/2013 07:04 PM, Tejun Heo wrote:
> On Tue, Oct 08, 2013 at 10:05:02AM +0200, Daniel Borkmann wrote:
>> Could you elaborate on "Wouldn't it be more logical to implement netfilter
>> rule to match the target cgroup paths?". I don't think (or hope) you mean
>> some string comparison on the dentry path here? :) With our proposal, we
>> have in the network stack's critical path only the following code that is
>> being executed here to match the cgroup ...
>
> Comparing path each time obviously doesn't make sense but you can
> determine the cgroup on config and hold onto the pointer while the
> rule exists.
>
>> ... where ``info->id == skb->sk->sk_cgrp_fwid'' is the actual work, so very
>> lightweight, which is good for high loads (1Gbit/s, 10Gbit/s and beyond), of
>> course. Also, it would be intuitive for admins familiar with other subsystems
>> to just set up and use these cgroup ids in iptabels. I'm not yet quite sure
>> how your suggestion would look like, so you would need to setup some "dummy"
>> subgroups first just to have a path that you can match on?
>
> Currently, it's tricky because we have multiple hierarchies to
> consider and there isn't an efficient way to map from task to cgroup
> on a specific hierarchy.  I'm not sure whether we should add another
> mapping table in css_set or just allow using path matching on the
> unified hierarchy.  The latter should be cleaner and easier but more
> restrictive.
>
> Anyways, it isn't manageable in the long term to keep adding
> controllers simply to tag tasks differently.  If we want to do this,
> let's please work on a way to match a task's cgroup affiliation
> efficiently.

Here's a draft (!) of an alternative w/o using a new cgroup subsystem. I've
tested it and it would basically work this way as well. I've used serial_nr
as an identifier of cgroups here, as we'd actually want the xt_cgroup_info
structure as small as possible for rule sets (since they can be large and
are flat-copied to kernel). Logic in cgroup_mt() needs to change a bit as
we cannot hold css_set_lock here. Anyway, iptables would match here against
cgroup.serial (that can probably also be widely used otherwise). The way we
do it here is to cache the corresponding task in socket structure, and walk
all cgroups belonging to that task, comparing if serial_nr's match.

Still, I think my original patch is more clean, user friendly and intuitive,
and has a better performance (main work is one comparison instead of walking
all corresponding cgroups), so I'd still consider this the better tradeoff
to go with, I think netfilter is a large enough candidate for a subsys. ;)

Thanks,
Daniel

Draft patch:

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3561d30..3c5e953 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -399,6 +399,26 @@ struct cgroup_map_cb {
  };

  /*
+ * A cgroup can be associated with multiple css_sets as different tasks may
+ * belong to different cgroups on different hierarchies.  In the other
+ * direction, a css_set is naturally associated with multiple cgroups.
+ * This M:N relationship is represented by the following link structure
+ * which exists for each association and allows traversing the associations
+ * from both sides.
+ */
+struct cgrp_cset_link {
+	/* the cgroup and css_set this link associates */
+	struct cgroup		*cgrp;
+	struct css_set		*cset;
+
+	/* list of cgrp_cset_links anchored at cgrp->cset_links */
+	struct list_head	cset_link;
+
+	/* list of cgrp_cset_links anchored at css_set->cgrp_links */
+	struct list_head	cgrp_link;
+};
+
+/*
   * struct cftype: handler definitions for cgroup control files
   *
   * When reading/writing to a file:
diff --git a/include/net/sock.h b/include/net/sock.h
index e3bf213..b035ba3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -406,6 +406,9 @@ struct sock {
  	__u32			sk_mark;
  	u32			sk_classid;
  	struct cg_proto		*sk_cgrp;
+#if IS_ENABLED(CONFIG_NETFILTER_XT_MATCH_CGROUP)
+	struct task_struct	*sk_task_cached;
+#endif
  	void			(*sk_state_change)(struct sock *sk);
  	void			(*sk_data_ready)(struct sock *sk, int bytes);
  	void			(*sk_write_space)(struct sock *sk);
@@ -2098,6 +2101,22 @@ static inline gfp_t gfp_any(void)
  	return in_softirq() ? GFP_ATOMIC : GFP_KERNEL;
  }

+static inline struct task_struct *sock_task(const struct sock *sk)
+{
+#if IS_ENABLED(CONFIG_NETFILTER_XT_MATCH_CGROUP)
+	return sk->sk_task_cached;
+#else
+	return NULL;
+#endif
+}
+
+static inline void sock_task_set(struct sock *sk, struct task_struct *task)
+{
+#if IS_ENABLED(CONFIG_NETFILTER_XT_MATCH_CGROUP)
+	sk->sk_task_cached = task;
+#endif
+}
+
  static inline long sock_rcvtimeo(const struct sock *sk, bool noblock)
  {
  	return noblock ? 0 : sk->sk_rcvtimeo;
diff --git a/include/uapi/linux/netfilter/Kbuild b/include/uapi/linux/netfilter/Kbuild
index 1749154..94a4890 100644
--- a/include/uapi/linux/netfilter/Kbuild
+++ b/include/uapi/linux/netfilter/Kbuild
@@ -37,6 +37,7 @@ header-y += xt_TEE.h
  header-y += xt_TPROXY.h
  header-y += xt_addrtype.h
  header-y += xt_bpf.h
+header-y += xt_cgroup.h
  header-y += xt_cluster.h
  header-y += xt_comment.h
  header-y += xt_connbytes.h
diff --git a/include/uapi/linux/netfilter/xt_cgroup.h b/include/uapi/linux/netfilter/xt_cgroup.h
new file mode 100644
index 0000000..c59ff53
--- /dev/null
+++ b/include/uapi/linux/netfilter/xt_cgroup.h
@@ -0,0 +1,11 @@
+#ifndef _UAPI_XT_CGROUP_H
+#define _UAPI_XT_CGROUP_H
+
+#include <linux/types.h>
+
+struct xt_cgroup_info {
+	__u64 serial_nr;
+	__u32 invert;
+};
+
+#endif /* _UAPI_XT_CGROUP_H */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2418b6e..1f9dc5b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -357,26 +357,6 @@ static void cgroup_release_agent(struct work_struct *work);
  static DECLARE_WORK(release_agent_work, cgroup_release_agent);
  static void check_for_release(struct cgroup *cgrp);

-/*
- * A cgroup can be associated with multiple css_sets as different tasks may
- * belong to different cgroups on different hierarchies.  In the other
- * direction, a css_set is naturally associated with multiple cgroups.
- * This M:N relationship is represented by the following link structure
- * which exists for each association and allows traversing the associations
- * from both sides.
- */
-struct cgrp_cset_link {
-	/* the cgroup and css_set this link associates */
-	struct cgroup		*cgrp;
-	struct css_set		*cset;
-
-	/* list of cgrp_cset_links anchored at cgrp->cset_links */
-	struct list_head	cset_link;
-
-	/* list of cgrp_cset_links anchored at css_set->cgrp_links */
-	struct list_head	cgrp_link;
-};
-
  /* The default css_set - used by init and its children prior to any
   * hierarchies being mounted. It contains a pointer to the root state
   * for each subsystem. Also used to anchor the list of css_sets. Not
@@ -4163,6 +4143,12 @@ static int cgroup_clone_children_write(struct cgroup_subsys_state *css,
  	return 0;
  }

+static u64 cgroup_serial_read(struct cgroup_subsys_state *css,
+			      struct cftype *cft)
+{
+	return css->cgroup->serial_nr;
+}
+
  static struct cftype cgroup_base_files[] = {
  	{
  		.name = "cgroup.procs",
@@ -4187,6 +4173,11 @@ static struct cftype cgroup_base_files[] = {
  		.flags = CFTYPE_ONLY_ON_ROOT,
  		.read_seq_string = cgroup_sane_behavior_show,
  	},
+	{
+		.name = "cgroup.serial",
+		.read_u64 = cgroup_serial_read,
+		.mode = S_IRUGO,
+	},

  	/*
  	 * Historical crazy stuff.  These don't have "cgroup."  prefix and
diff --git a/net/core/scm.c b/net/core/scm.c
index b442e7e..9a40ab0 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -292,6 +292,7 @@ void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm)
  		if (sock) {
  			sock_update_netprioidx(sock->sk);
  			sock_update_classid(sock->sk);
+			sock_task_set(sock->sk, current);
  		}
  		fd_install(new_fd, get_file(fp[i]));
  	}
diff --git a/net/core/sock.c b/net/core/sock.c
index 2bd9b3f..ab53afc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1359,6 +1359,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
  		sk->sk_prot = sk->sk_prot_creator = prot;
  		sock_lock_init(sk);
  		sock_net_set(sk, get_net(net));
+		sock_task_set(sk, current);
  		atomic_set(&sk->sk_wmem_alloc, 1);

  		sock_update_classid(sk);
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 6e839b6..d276ff4 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -806,6 +806,14 @@ config NETFILTER_XT_MATCH_BPF

  	  To compile it as a module, choose M here.  If unsure, say N.

+config NETFILTER_XT_MATCH_CGROUP
+	tristate '"control group" match support'
+	depends on NETFILTER_ADVANCED
+	depends on CGROUPS
+	---help---
+	Socket/process control group matching allows you to match locally
+	generated packets based on which control group processes belong to.
+
  config NETFILTER_XT_MATCH_CLUSTER
  	tristate '"cluster" match support'
  	depends on NF_CONNTRACK
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index c3a0a12..12f014f 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -124,6 +124,7 @@ obj-$(CONFIG_NETFILTER_XT_MATCH_MULTIPORT) += xt_multiport.o
  obj-$(CONFIG_NETFILTER_XT_MATCH_NFACCT) += xt_nfacct.o
  obj-$(CONFIG_NETFILTER_XT_MATCH_OSF) += xt_osf.o
  obj-$(CONFIG_NETFILTER_XT_MATCH_OWNER) += xt_owner.o
+obj-$(CONFIG_NETFILTER_XT_MATCH_CGROUP) += xt_cgroup.o
  obj-$(CONFIG_NETFILTER_XT_MATCH_PHYSDEV) += xt_physdev.o
  obj-$(CONFIG_NETFILTER_XT_MATCH_PKTTYPE) += xt_pkttype.o
  obj-$(CONFIG_NETFILTER_XT_MATCH_POLICY) += xt_policy.o
diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c
new file mode 100644
index 0000000..f04cba8
--- /dev/null
+++ b/net/netfilter/xt_cgroup.c
@@ -0,0 +1,79 @@
+#include <linux/skbuff.h>
+#include <linux/module.h>
+#include <linux/file.h>
+#include <linux/cgroup.h>
+#include <linux/fdtable.h>
+#include <linux/netfilter/x_tables.h>
+#include <linux/netfilter/xt_cgroup.h>
+#include <net/sock.h>
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Daniel Borkmann <dborkman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>");
+MODULE_DESCRIPTION("Xtables: process control group matching");
+MODULE_ALIAS("ipt_cgroup");
+MODULE_ALIAS("ip6t_cgroup");
+
+static int cgroup_mt_check(const struct xt_mtchk_param *par)
+{
+	struct xt_cgroup_info *info = par->matchinfo;
+
+	return (info->invert & ~1) ? -EINVAL : 0;
+}
+
+static bool
+cgroup_mt(const struct sk_buff *skb, struct xt_action_param *par)
+{
+	const struct xt_cgroup_info *info = par->matchinfo;
+	struct cgrp_cset_link *link, *link_tmp;
+	const struct sock *sk = skb->sk;
+	struct task_struct *task;
+	struct css_set *cset;
+	bool found = false;
+
+	if (sk == NULL)
+		return false;
+
+	task = sock_task(sk);
+	if (task == NULL)
+		return false;
+
+	rcu_read_lock();
+	/* XXX: read_lock(&css_set_lock); */
+	cset = task_css_set(task);
+	list_for_each_entry_safe(link, link_tmp, &cset->cgrp_links, cgrp_link) {
+		struct cgroup *cgrp = link->cgrp;
+		if (cgrp->serial_nr == info->serial_nr) {
+			found = true;
+			break;
+		}
+	}
+	/* XXX: read_unlock(&css_set_lock); */
+	rcu_read_unlock();
+
+	return found ^ info->invert;
+}
+
+static struct xt_match cgroup_mt_reg __read_mostly = {
+	.name       = "cgroup",
+	.revision   = 0,
+	.family     = NFPROTO_UNSPEC,
+	.checkentry = cgroup_mt_check,
+	.match      = cgroup_mt,
+	.matchsize  = sizeof(struct xt_cgroup_info),
+	.me         = THIS_MODULE,
+	.hooks      = (1 << NF_INET_LOCAL_OUT) |
+	              (1 << NF_INET_POST_ROUTING),
+};
+
+static int __init cgroup_mt_init(void)
+{
+	return xt_register_match(&cgroup_mt_reg);
+}
+
+static void __exit cgroup_mt_exit(void)
+{
+	xt_unregister_match(&cgroup_mt_reg);
+}
+
+module_init(cgroup_mt_init);
+module_exit(cgroup_mt_exit);
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net] bridge: allow receiption on disabled port
From: Stephen Hemminger @ 2013-10-10 21:52 UTC (permalink / raw)
  To: Felix Fietkau; +Cc: netdev
In-Reply-To: <52571481.5010907@openwrt.org>

On Thu, 10 Oct 2013 22:56:33 +0200
Felix Fietkau <nbd@openwrt.org> wrote:

> On 2013-10-10 10:36 PM, Stephen Hemminger wrote:
> > On Thu, 10 Oct 2013 14:52:50 +0200
> > Felix Fietkau <nbd@openwrt.org> wrote:
> > 
> >> When an ethernet device is enslaved to a bridge, and the bridge STP
> >> detects loss of carrier (or operational state down), then normally
> >> packet receiption is blocked.
> >> 
> >> This breaks control applications like WPA which maybe expecting to
> >> receive packets to negotiate to bring link up. The bridge needs to
> >> block forwarding packets from these disabled ports, but there is no
> >> hard requirement to not allow local packet delivery.
> >> 
> >> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> >> Signed-off-by: Felix Fietkau <nbd@openwrt.org>
> > 
> > No. This will cause duplicate packets to be delivered.
> How? I haven't observed any duplications in my tests with this patch.

The purpose of DISABLED state is to break loops in the bridge tree.
If packet is flooded by another bridge (Broadcast Unknown or Multicast)
then it will go down both paths.

> 
> > If doing a link layer protocol like WPA then it should be done directly
> > on the underlying device, not the bridge itself.
> When the ETH_P_PAE protocol is set for the packet socket inside
> wpa_supplicant, the bridge steals all packets before the protocol
> handler gets them.
> In __netif_receive_skb_core, only ptype_all gets processed before the rx
> handler, not ptype_base.

Thought it was using direct type all. Or at least the link local multicast
address.

Can you revise it to only accept packets directed to link local multicast
address or local address, and go through the local_finish handler.

^ permalink raw reply

* Re: [PATCH net] bridge: allow receiption on disabled port
From: Felix Fietkau @ 2013-10-10 20:56 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20131010133646.1bdd42c1@nehalam.linuxnetplumber.net>

On 2013-10-10 10:36 PM, Stephen Hemminger wrote:
> On Thu, 10 Oct 2013 14:52:50 +0200
> Felix Fietkau <nbd@openwrt.org> wrote:
> 
>> When an ethernet device is enslaved to a bridge, and the bridge STP
>> detects loss of carrier (or operational state down), then normally
>> packet receiption is blocked.
>> 
>> This breaks control applications like WPA which maybe expecting to
>> receive packets to negotiate to bring link up. The bridge needs to
>> block forwarding packets from these disabled ports, but there is no
>> hard requirement to not allow local packet delivery.
>> 
>> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
>> Signed-off-by: Felix Fietkau <nbd@openwrt.org>
> 
> No. This will cause duplicate packets to be delivered.
How? I haven't observed any duplications in my tests with this patch.

> If doing a link layer protocol like WPA then it should be done directly
> on the underlying device, not the bridge itself.
When the ETH_P_PAE protocol is set for the packet socket inside
wpa_supplicant, the bridge steals all packets before the protocol
handler gets them.
In __netif_receive_skb_core, only ptype_all gets processed before the rx
handler, not ptype_base.


- Felix

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: fix vport-netdev unregister
From: Alexei Starovoitov @ 2013-10-10 20:48 UTC (permalink / raw)
  To: Pravin Shelar
  Cc: David S. Miller, Jesse Gross, Jiri Pirko, dev@openvswitch.org,
	netdev
In-Reply-To: <CALnjE+oXJQoPG4io2JOn1zBbf7XtPKaBE+vHPNKXm2TBygvB5Q@mail.gmail.com>

On Thu, Oct 10, 2013 at 11:21 AM, Pravin Shelar <pshelar@nicira.com> wrote:
> On Wed, Oct 9, 2013 at 11:26 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> On Wed, Oct 9, 2013 at 11:07 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>> On Wed, Oct 9, 2013 at 9:11 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>> On Wed, Oct 9, 2013 at 8:02 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>>> On Tue, Oct 8, 2013 at 8:07 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>>> The combination of two commits
>>>>>>
>>>>>> commit 8e4e1713e4
>>>>>> ("openvswitch: Simplify datapath locking.")
>>>>>>
>>>>>> and
>>>>>>
>>>>>> commit 2537b4dd0a
>>>>>> ("openvswitch:: link upper device for port devices")
>>>>>>
>>>>>> introduced a bug where upper_dev wasn't unlinked upon
>>>>>> netdev_unregister notification
>>>>>>
>>>>>> The following steps:
>>>>>>
>>>>>>   modprobe openvswitch
>>>>>>   ovs-dpctl add-dp test
>>>>>>   ip tuntap add dev tap1 mode tap
>>>>>>   ovs-dpctl add-if test tap1
>>>>>>   ip tuntap del dev tap1 mode tap
>>>>>>
>>>>>> are causing multiple warnings:
>>>>>> diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
>>>>>> index c323567..e9380bd 100644
>>>>>> --- a/net/openvswitch/dp_notify.c
>>>>>> +++ b/net/openvswitch/dp_notify.c
>>>>>> @@ -88,6 +88,11 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
>>>>>>                 return NOTIFY_DONE;
>>>>>>
>>>>>>         if (event == NETDEV_UNREGISTER) {
>>>>>> +               /* rx_handler_unregister and upper_dev_unlink immediately */
>>>>>> +               if (dev->reg_state == NETREG_UNREGISTERING)
>>>>>> +                       ovs_netdev_unlink_dev(vport);
>>>>>> +
>>>>>
>>>>> Rather than doing vport destroy here, we can just unlink upper device
>>>>> and let workq do rest of work.
>>>>
>>>> isn't it what it's doing?
>>>
>>> I meant just call netdev_upper_dev_unlink() here in event handler and
>>> rest of vport destroy can be done in workq.
>>
>> netdev_upper_dev_unlink() without netdev_rx_handler_unregister() ?!
>> that's dangerous.
> why is it dangerous? ovs already had ref to net-device.

comment from dev.c:
                /* Notify protocols, that we are about to destroy
                   this device. They should clean all the things.
                */
                call_netdevice_notifiers(NETDEV_UNREGISTER, dev);

so here you're suggesting to just netdev_upper_dev_unlink() to silence
the warning,
but then do dev_set_promisc(-1) in workqueue?

well, as a minimum audit of promiscuity will be wrong.
ndo_change_rx_flags will be called after ndo_uninit,
all sorts of other cleanups are done.
I cannot track all possible scenarios, but it seems much safer to
cleanup everything possible
as soon as ovs received NETDEV_UNREGISTER event.

May be all these risks are worth taking, then please explain what is
the problem with the proposed patch?

Thanks
Alex

^ permalink raw reply

* Re: [PATCH net] bridge: allow receiption on disabled port
From: Stephen Hemminger @ 2013-10-10 20:36 UTC (permalink / raw)
  To: Felix Fietkau; +Cc: netdev
In-Reply-To: <1381409570-1892-1-git-send-email-nbd@openwrt.org>

On Thu, 10 Oct 2013 14:52:50 +0200
Felix Fietkau <nbd@openwrt.org> wrote:

> When an ethernet device is enslaved to a bridge, and the bridge STP
> detects loss of carrier (or operational state down), then normally
> packet receiption is blocked.
> 
> This breaks control applications like WPA which maybe expecting to
> receive packets to negotiate to bring link up. The bridge needs to
> block forwarding packets from these disabled ports, but there is no
> hard requirement to not allow local packet delivery.
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> Signed-off-by: Felix Fietkau <nbd@openwrt.org>

No. This will cause duplicate packets to be delivered.

If doing a link layer protocol like WPA then it should be done directly
on the underlying device, not the bridge itself.

^ permalink raw reply

* [PATCH] bridge: update mdb expiration timer upon reports.
From: Vlad Yasevich @ 2013-10-10 19:57 UTC (permalink / raw)
  To: netdev; +Cc: Vlad Yasevich, Cong Wang, Herbert Xu, Stephen Hemminger

commit 9f00b2e7cf241fa389733d41b615efdaa2cb0f5b
	bridge: only expire the mdb entry when query is received
changed the mdb expiration timer to be armed only when QUERY is
received.  Howerver, this causes issues in an environment where
the multicast server socket comes and goes very fast while a client
is trying to send traffic to it.

The root cause is a race where a sequence of LEAVE followed by REPORT
messages can race against QUERY messages generated in response to LEAVE.
The QUERY ends up starting the expiration timer, and that timer can
potentially expire after the new REPORT message has been received signaling
the new join operation.  This leads to a significant drop in multicast
traffic and possible complete stall. 

The solution is to have REPORT messages update the expiration timer
on entries that already exist.

CC: Cong Wang <xiyou.wangcong@gmail.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
---
 net/bridge/br_multicast.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index d1c5786..1085f21 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -611,6 +611,9 @@ rehash:
 		break;
 
 	default:
+		/* If we have an existing entry, update it's expire timer */
+		mod_timer(&mp->timer,
+			  jiffies + br->multicast_membership_interval);
 		goto out;
 	}
 
@@ -680,8 +683,12 @@ static int br_multicast_add_group(struct net_bridge *br,
 	for (pp = &mp->ports;
 	     (p = mlock_dereference(*pp, br)) != NULL;
 	     pp = &p->next) {
-		if (p->port == port)
+		if (p->port == port) {
+			/* We already have a portgroup, update the timer.  */
+			mod_timer(&p->timer,
+				  jiffies + br->multicast_membership_interval);
 			goto out;
+		}
 		if ((unsigned long)p->port < (unsigned long)port)
 			break;
 	}
-- 
1.8.3.1

^ permalink raw reply related

* Re: [net-next 00/13][pull request] Intel Wired LAN Driver Updates
From: David Miller @ 2013-10-10 19:30 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, sassmann
In-Reply-To: <1381387271-29605-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed,  9 Oct 2013 23:40:58 -0700

> This series contains updates to i40e only.
> 
> Alex provides the majority of the patches against i40e, where he does
> cleanup of the Tx and RX queues and to align the code with the known
> good Tx/Rx queue code in the ixgbe driver.
> 
> Anjali provides an i40e patch to update link events to not print to
> the log until the device is administratively up.
> 
> Catherine provides a patch to update the driver version.

Pulled, thanks a lot Jeff.

^ permalink raw reply

* Re: [PATCH v2 tip/core/rcu 07/13] ipv6/ip6_tunnel: Apply rcu_access_pointer() to avoid sparse false positive
From: Paul E. McKenney @ 2013-10-10 19:05 UTC (permalink / raw)
  To: Eric Dumazet, Josh Triplett, linux-kernel, mingo, laijs, dipankar,
	akpm, mathieu.desnoyers, niv, tglx, peterz, rostedt, dhowells,
	edumazet, darren, fweisbec, sbw, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, netdev
In-Reply-To: <20131010020422.GB24368@order.stressinduktion.org>

On Thu, Oct 10, 2013 at 04:04:22AM +0200, Hannes Frederic Sowa wrote:
> On Wed, Oct 09, 2013 at 05:28:33PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 09, 2013 at 05:12:40PM -0700, Eric Dumazet wrote:
> > > On Wed, 2013-10-09 at 16:40 -0700, Josh Triplett wrote:
> > > 
> > > > that.  Constructs like list_del_rcu are much clearer, and not
> > > > open-coded.  Open-coding synchronization code is almost always a Bad
> > > > Idea.
> > > 
> > > OK, so you think there is synchronization code.
> > > 
> > > I will shut up then, no need to waste time.
> > 
> > As you said earlier, we should at least get rid of the memory barrier
> > as long as we are changing the code.
> 
> Interesting thread!
> 
> Sorry to chime in and asking a question:
> 
> Why do we need an ACCESS_ONCE here if rcu_assign_pointer can do without one?
> In other words I wonder why rcu_assign_pointer is not a static inline function
> to use the sequence point in argument evaluation (if I remember correctly this
> also holds for inline functions) to not allow something like this:
> 
> E.g. we want to publish which lock to take first to prevent an ABBA problem
> (extreme example):
> 
> rcu_assign_pointer(lockptr, min(lptr1, lptr2));
> 
> Couldn't a compiler spill the lockptr memory location as a temporary buffer
> if the compiler is under register pressure? (yes, this seems unlikely if we
> flushed out most registers to memory because of the barrier, but still... ;) )
> 
> This seems to be also the case if we publish a multi-dereferencing pointers
> e.g. ptr->ptr->ptr.

IIRC, sequence points only confine volatile accesses.  For non-volatile
accesses, the so-called "as-if rule" allows compiler writers to do some
surprisingly global reordering.

The reason that rcu_assign_pointer() isn't an inline function is because
it needs to be type-generic, in other words, it needs to be OK to use
it on any type of pointers as long as the C types of the two pointers
match (the sparse types can vary a bit).

One of the reasons for wanting a volatile cast in rcu_assign_pointer() is
to prevent compiler mischief such as you described in your last two
paragraphs.  That said, it would take a very brave compiler to pull
a pointer-referenced memory location into a register and keep it there.
Unfortunately, increasing compiler bravery seems to be a solid long-term
trend.

							Thanx, Paul

^ permalink raw reply

* Re: [PATCH net-next] inet: rename ir_loc_port to ir_num
From: David Miller @ 2013-10-10 18:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1381388677.4971.65.camel@edumazet-glaptop.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 10 Oct 2013 00:04:37 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> In commit 634fb979e8f ("inet: includes a sock_common in request_sock")
> I forgot that the two ports in sock_common do not have same byte order :
> 
> skc_dport is __be16 (network order), but skc_num is __u16 (host order)
> 
> So sparse complains because ir_loc_port (mapped into skc_num) is
> considered as __u16 while it should be __be16
> 
> Let rename ir_loc_port to ireq->ir_num (analogy with inet->inet_num),
> and perform appropriate htons/ntohs conversions.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Wu Fengguang <fengguang.wu@intel.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: fix vport-netdev unregister
From: Pravin Shelar @ 2013-10-10 18:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Jesse Gross, Jiri Pirko, dev@openvswitch.org,
	netdev
In-Reply-To: <CAMEtUuzP1b_VNqP1texkrQ7kvCRxEukUoRz=Dj-pMX9gsO9AWw@mail.gmail.com>

On Wed, Oct 9, 2013 at 11:26 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Wed, Oct 9, 2013 at 11:07 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>> On Wed, Oct 9, 2013 at 9:11 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>> On Wed, Oct 9, 2013 at 8:02 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>> On Tue, Oct 8, 2013 at 8:07 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>> The combination of two commits
>>>>>
>>>>> commit 8e4e1713e4
>>>>> ("openvswitch: Simplify datapath locking.")
>>>>>
>>>>> and
>>>>>
>>>>> commit 2537b4dd0a
>>>>> ("openvswitch:: link upper device for port devices")
>>>>>
>>>>> introduced a bug where upper_dev wasn't unlinked upon
>>>>> netdev_unregister notification
>>>>>
>>>>> The following steps:
>>>>>
>>>>>   modprobe openvswitch
>>>>>   ovs-dpctl add-dp test
>>>>>   ip tuntap add dev tap1 mode tap
>>>>>   ovs-dpctl add-if test tap1
>>>>>   ip tuntap del dev tap1 mode tap
>>>>>
>>>>> are causing multiple warnings:
>>>>> diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
>>>>> index c323567..e9380bd 100644
>>>>> --- a/net/openvswitch/dp_notify.c
>>>>> +++ b/net/openvswitch/dp_notify.c
>>>>> @@ -88,6 +88,11 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
>>>>>                 return NOTIFY_DONE;
>>>>>
>>>>>         if (event == NETDEV_UNREGISTER) {
>>>>> +               /* rx_handler_unregister and upper_dev_unlink immediately */
>>>>> +               if (dev->reg_state == NETREG_UNREGISTERING)
>>>>> +                       ovs_netdev_unlink_dev(vport);
>>>>> +
>>>>
>>>> Rather than doing vport destroy here, we can just unlink upper device
>>>> and let workq do rest of work.
>>>
>>> isn't it what it's doing?
>>
>> I meant just call netdev_upper_dev_unlink() here in event handler and
>> rest of vport destroy can be done in workq.
>
> netdev_upper_dev_unlink() without netdev_rx_handler_unregister() ?!
> that's dangerous.
why is it dangerous? ovs already had ref to net-device.

> If that is acceptable, then there was no reason to link them in the first place.
>
I do not see any harm in linking device hierarchy for ovs.

> notifier asks to unregister. imo the only acceptable deferred task
> here is to delay dev_put,
> since ovs structures are still referring to it.

^ permalink raw reply

* Re: [PATCH] carl9170: fix leaks at failure path in carl9170_usb_probe()
From: Christian Lamparter @ 2013-10-10 18:17 UTC (permalink / raw)
  To: John W. Linville
  Cc: Alexey Khoroshilov, Fabio Estevam, linux-wireless,
	netdev@vger.kernel.org, linux-kernel, ldv-project
In-Reply-To: <20131010175952.GG2691@tuxdriver.com>

On Thursday, October 10, 2013 01:59:52 PM John W. Linville wrote:
> On Sat, Sep 28, 2013 at 01:16:20AM -0400, Alexey Khoroshilov wrote:
> > On 28.09.2013 00:17, Fabio Estevam wrote:
> > >On Sat, Sep 28, 2013 at 12:51 AM, Alexey Khoroshilov
> > ><khoroshilov@ispras.ru> wrote:
> > >
> > >>-       return request_firmware_nowait(THIS_MODULE, 1, CARL9170FW_NAME,
> > >>+       err = request_firmware_nowait(THIS_MODULE, 1, CARL9170FW_NAME,
> > >>                 &ar->udev->dev, GFP_KERNEL, ar, carl9170_usb_firmware_step2);
> > >>+       if (err) {
> > >>+               usb_put_dev(udev);
> > >>+               usb_put_dev(udev);
> > >You are doing the same free twice.
> > Yes, because it was get twice.
> > >I guess you meant to also free: usb_put_dev(ar->udev)
> > udev and ar->udev are equal, so technically the patch is correct.
> > 
> > I agree that there is some inconsistency, but I would prefer to fix
> > it at usb_get_dev() side with a comment about reasons for the double
> > get.
> 
> What is the reason for the double get?

The idea is:
One (extra) reference protects the asynchronous firmware loader callback
from disappearing "udev".

Regards,
Chr
 
 

^ permalink raw reply

* Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
From: Alexander Gordeev @ 2013-10-10 18:07 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Benjamin Herrenschmidt, linux-kernel, Bjorn Helgaas, Ralf Baechle,
	Michael Ellerman, Martin Schwidefsky, Ingo Molnar, Tejun Heo,
	Dan Williams, Andy King, Jon Mason, Matt Porter, linux-pci,
	linux-mips, linuxppc-dev, linux390, linux-s390, x86, linux-ide,
	iss_storagedev, linux-nvme, linux-rdma, netdev, e1000-devel,
	linux-dr
In-Reply-To: <5256D5AB.4050105@zytor.com>

On Thu, Oct 10, 2013 at 09:28:27AM -0700, H. Peter Anvin wrote:
> On 10/10/2013 03:17 AM, Alexander Gordeev wrote:
> > On Wed, Oct 09, 2013 at 03:24:08PM +1100, Benjamin Herrenschmidt wrote:
> > 
> > Ok, this suggestion sounded in one or another form by several people.
> > What about name it pcim_enable_msix_range() and wrap in couple more
> > helpers to complete an API:
> > 
> > int pcim_enable_msix_range(pdev, msix_entries, nvec, minvec);
> > 	<0 - error code
> > 	>0 - number of MSIs allocated, where minvec >= result <= nvec
> > 
> > int pcim_enable_msix(pdev, msix_entries, nvec);
> > 	<0 - error code
> > 	>0 - number of MSIs allocated, where 1 >= result <= nvec 
> > 
> > int pcim_enable_msix_exact(pdev, msix_entries, nvec);
> > 	<0 - error code
> > 	>0 - number of MSIs allocated, where result == nvec
> > 
> > The latter's return value seems odd, but I can not help to make
> > it consistent with the first two.
> > 
> 
> Is there a reason for the wrappers, as opposed to just specifying either
> 1 or nvec as the minimum?

The wrappers are more handy IMO.

I.e. can imagine people start struggling to figure out what minvec to provide:
1 or 0? Why 1? Oh.. okay.. Or should we tolerate 0 (as opposite to -ERANGE)?

Well, do not know.. pcim_enable_msix(pdev, msix_entries, nvec, nvec) is 
less readable for me than just pcim_enable_msix_exact(pdev, msix_entries,
nvec).

> 	-hpa

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply

* Re: [PATCH] carl9170: fix leaks at failure path in carl9170_usb_probe()
From: John W. Linville @ 2013-10-10 17:59 UTC (permalink / raw)
  To: Alexey Khoroshilov
  Cc: Fabio Estevam, Christian Lamparter, linux-wireless,
	netdev@vger.kernel.org, linux-kernel, ldv-project
In-Reply-To: <52466624.4040106@ispras.ru>

On Sat, Sep 28, 2013 at 01:16:20AM -0400, Alexey Khoroshilov wrote:
> On 28.09.2013 00:17, Fabio Estevam wrote:
> >On Sat, Sep 28, 2013 at 12:51 AM, Alexey Khoroshilov
> ><khoroshilov@ispras.ru> wrote:
> >
> >>-       return request_firmware_nowait(THIS_MODULE, 1, CARL9170FW_NAME,
> >>+       err = request_firmware_nowait(THIS_MODULE, 1, CARL9170FW_NAME,
> >>                 &ar->udev->dev, GFP_KERNEL, ar, carl9170_usb_firmware_step2);
> >>+       if (err) {
> >>+               usb_put_dev(udev);
> >>+               usb_put_dev(udev);
> >You are doing the same free twice.
> Yes, because it was get twice.
> >I guess you meant to also free: usb_put_dev(ar->udev)
> udev and ar->udev are equal, so technically the patch is correct.
> 
> I agree that there is some inconsistency, but I would prefer to fix
> it at usb_get_dev() side with a comment about reasons for the double
> get.

What is the reason for the double get?

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox