Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v6 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()
From: Ming Lei @ 2012-11-28  9:47 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, linux-kernel, Alan Stern, Oliver Neukum, Minchan Kim,
	Greg Kroah-Hartman, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-mm
In-Reply-To: <1408044.6czCGhbHJH@vostro.rjw.lan>

On Wed, Nov 28, 2012 at 5:29 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>
> But it doesn't have to walk the children.  Moreover, with counters it only

Yeah, I got it, it is the advantage of counter, but with extra 'int'
field introduced
in 'struct device'.

> needs to walk the whole path if all devices in it need to be updated.  For
> example, if you call pm_runtime_set_memalloc_noio(dev, true) for a device
> whose parent's counter is greater than zero already, you don't need to
> walk the path above the parent.

We still can do it with the flag only, pm_runtime_set_memalloc_noio(dev, true)
can return immediately if one parent or the 'dev' flag is true.

But considered that the pm_runtime_set_memalloc_noio(dev, false) is only
called in a very infrequent path(network/block device->remove()), looks the
introduced cost isn't worthy of the obtained advantage.

So could you accept not introducing counter? and I will update with the
above improvement you suggested.

Thanks,
--
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* RE: Re: RTL 8169  linux driver question
From: David Laight @ 2012-11-28  9:33 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Stéphane ANCELOT, netdev, sancelot, Hayes Wang
In-Reply-To: <20121127224605.GA10228@electric-eye.fr.zoreil.com>

> +static struct rtl_coalesce_scale *rtl_coalesce_scale(struct net_device *dev)
> +{
...
> +}
> +
> +static int rtl_get_coalesce(struct net_device *dev, struct ethtool_coalesce *ec)
> +{
...
> +}
> +
> +static int rtl_set_coalesce(struct net_device *dev, struct ethtool_coalesce *ec)
> +{
...
> +}

Those functions are horrid - so horrid I've deleted the contents.

	David


^ permalink raw reply

* Re: [PATCH 1/2] smsc75xx: refactor entering suspend modes
From: Bjørn Mork @ 2012-11-28  9:31 UTC (permalink / raw)
  To: Steve Glendinning; +Cc: Alan Stern, netdev, linux-usb
In-Reply-To: <CAKh2mn7+0itrJNxmh6Wv9njGaNOY48N_zfJm6=gTeMoF2VzW4g@mail.gmail.com>

Steve Glendinning <steve@shawell.net> writes:

>>> udev->do_remote_wakeup is set in choose_wakeup() in
>>> drivers/usb/core/driver.c.  AFAICS it is always set as long as
>>> device_may_wakeup(&udev->dev) is true.
>>
>> That's right.  But is device_may_wakeup(&udev->dev) true?
>>
>> By default it wouldn't be.  The normal way to set it is for the user or
>> a program to do:
>>
>>         echo enabled >/sys/bus/usb/devices/.../power/wakeup
>>
>> Of course, a driver could disregard the user's choice and set the flag
>> by itself.
>
> If I set that from userspace the system is able to resume, but I can't
> work out how to successfully set this from the driver.  I believe the
> driver should be overriding this as if the user has asked for the
> device to wake on lan they're expecting this to resume the system.
>
> I've tried placing various combinations of device_set_wakeup_capable
> and device_set_wakeup_enable in different places (bind, suspend), but
> it still doesn't allow the device to resume from suspend.  How should
> I do this?

I may be completely wrong here, but this is how I believe it is supposed
to work...  The device can be suspended for two possible reasons:

1) system suspend.  If the user want the device to wake the system, then
   (s)he will do

       echo enabled >/sys/bus/usb/devices/.../power/wakeup

   If this isn't set, then there is no reason for the driver to request
   remote wakeup while the system is suspended.

2) autosuspend.  Any interface driver needing remote wakeup will set
   intf->needs_remote_wakeup, which makes autosuspend_check() set
   udev->do_remote_wakeup

If all my guesses and assumptions are right, then you want to set
intf->needs_remote_wakeup unconditionally.  This will make the USB core
enable remote wakeup on autosuspend.

Remote wakeup will not be enabled on system suspend unless the user (or
a userspace program on the users behalf) has requested it.

Bjørn

^ permalink raw reply

* Re: [PATCH v6 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()
From: Rafael J. Wysocki @ 2012-11-28  9:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-pm, linux-kernel, Alan Stern, Oliver Neukum, Minchan Kim,
	Greg Kroah-Hartman, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-mm
In-Reply-To: <CACVXFVODD9fRqQc3kR58OJm3ERgBWojnx=790xGwu=MPGaSmMA@mail.gmail.com>

On Wednesday, November 28, 2012 12:34:36 PM Ming Lei wrote:
> On Wed, Nov 28, 2012 at 5:19 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> >
> > Please use counters instead of walking the whole path every time.  Ie. in
> > addition to the flag add a counter to store the number of the device's
> > children having that flag set.
> 
> Even though counter is added, walking the whole path can't be avoided too,
> and may be a explicit walking or recursion, because pm_runtime_set_memalloc_noio
> is required to set or clear the flag(or increase/decrease the counter) of
> devices in the whole path.

But it doesn't have to walk the children.  Moreover, with counters it only
needs to walk the whole path if all devices in it need to be updated.  For
example, if you call pm_runtime_set_memalloc_noio(dev, true) for a device
whose parent's counter is greater than zero already, you don't need to
walk the path above the parent.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v3 8/7] pppoatm: fix missing wakeup in pppoatm_send()
From: David Woodhouse @ 2012-11-28  9:24 UTC (permalink / raw)
  To: Krzysztof Mazur; +Cc: chas williams - CONTRACTOR, netdev, linux-kernel, davem
In-Reply-To: <20121128081237.GA30488@shrek.podlesie.net>

[-- Attachment #1: Type: text/plain, Size: 2788 bytes --]

On Wed, 2012-11-28 at 09:12 +0100, Krzysztof Mazur wrote:
> On Wed, Nov 28, 2012 at 12:48:17AM +0000, David Woodhouse wrote:
> > On Tue, 2012-11-27 at 10:23 -0500, chas williams - CONTRACTOR wrote:
> > > yes, but dont call it 8/7 since that doesnt make sense.
> > 
> > It made enough sense when it was a single patch appended to a thread of
> > 7 other patches from Krzysztof. But now it's all got a little more
> > complex, so I've tried to collect together the latest version of
> > everything we've discussed:
> 
> There was also discussion about patch 9/7 "pppoatm: wakeup after ATM
> unlock only when it's needed".

True. Is that really necessary? How often is the lock actually taken? Is
it once per packet that PPP sends (which is mostly just LCP
echo/response during an active connection)? And does that really warrant
the optimisation?

This is a tasklet that we used to run after absolutely *every* packet,
remember. Optimising *that* made sense, but I'm less sure it's worth the
added complexity for this case. As I have a vague recollection that we
decided we couldn't use the existing BLOCKED bit for it... or can we? 

Can this work? Feel free to replace that test_bit() and the
corresponding comment, with a test_and_clear_bit() and a new comment
explaining *why* it's safe... while I go make another cup of tea.

diff --git a/net/atm/pppoatm.c b/net/atm/pppoatm.c
index 446a7f0..da58863 100644
--- a/net/atm/pppoatm.c
+++ b/net/atm/pppoatm.c
@@ -113,7 +113,13 @@ static void pppoatm_release_cb(struct atm_vcc *atmvcc)
 {
 	struct pppoatm_vcc *pvcc = atmvcc_to_pvcc(atmvcc);
 
-	tasklet_schedule(&pvcc->wakeup_tasklet);
+	/*
+	 * We can't clear it here because I haven't had enough caffeine
+	 * this morning to deal with the concurrency issues. Just leave
+	 * it set, and let pppoatm_pop() clear it later.
+	 */
+	if (test_bit(BLOCKED, &pvcc->blocked))
+		tasklet_schedule(&pvcc->wakeup_tasklet);
 	if (pvcc->old_release_cb)
 		pvcc->old_release_cb(atmvcc);
 }
@@ -342,6 +348,12 @@ static int pppoatm_send(struct ppp_channel *chan, struct sk_buff *skb)
 	bh_unlock_sock(sk_atm(vcc));
 	return ret;
 nospace:
+	/*
+	 * Needs to happen (and be flushed, hence test_and_) before we unlock
+	 * the socket. It needs to be seen by the time our ->release_cb gets
+	 * called.
+	 */
+	test_and_set_bit(BLOCKED, &pvcc->blocked);
 	bh_unlock_sock(sk_atm(vcc));
 	/*
 	 * We don't have space to send this SKB now, but we might have


> > David Woodhouse (5):
> >       atm: Add release_cb() callback to vcc
> >       pppoatm: fix missing wakeup in pppoatm_send()
> >       br2684: fix module_put() race
> 
> for the three patches above:
> 
> Acked-by: Krzysztof Mazur <krzysiek@podlesie.net>

Ta.
-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply related

* RE: [PATCH v2 3/3] pppoatm: protect against freeing of vcc
From: David Laight @ 2012-11-28  9:21 UTC (permalink / raw)
  To: chas williams - CONTRACTOR, David Woodhouse
  Cc: Krzysztof Mazur, davem, netdev, linux-kernel, nathan
In-Reply-To: <20121127135434.0728cd4f@thirdoffive.cmf.nrl.navy.mil>

> On Tue, 27 Nov 2012 18:02:29 +0000
> David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > In solos-pci at least, the ops->close() function doesn't flush all
> > pending skbs for this vcc before returning. So can be a tasklet
> > somewhere which has loaded the address of the vcc->pop function from one
> > of them, and is going to call it in some unspecified amount of time.
> >
> > Should we make the device's ->close function wait for all TX and RX skbs
> > for this vcc to complete?
> 
> the driver's close routine should wait for any of the pending tx and rx
> to complete.  take a look at the he.c in driver/atm

I'm not sure that sleeping for long periods in close() is always a
good idea. If the process is event driven it will be unable to
handle events on other fd until the close completes.
This may be known not to be true in this case, but is more generally
a problem.
In this case the close should probably (IMHO at least) only sleep
while pending tx and rx are aborted/discarded.

Even when it might make sense to sleep in close until tx drains
there needs to be a finite timeout before it become abortive.

	David

^ permalink raw reply

* [PATCH] net: ICMPv6 packets transmitted on wrong interface if nfmark is mangled
From: Dries De Winter @ 2012-11-28  9:09 UTC (permalink / raw)
  To: David S. Miller, Pablo Neira Ayuso, Patrick McHardy
  Cc: netdev, netfilter-devel
In-Reply-To: <22884633.2468.1354092935228.JavaMail.driesdw@sahwcmp0020>

From: Dries De Winter <dries.dewinter@gmail.com>

The IPv6 mangle table may change the source/destination address and skb->mark
of a packet. Therefore it may be necessary to "reroute" a packet after it
traversed this table. But this should not happen for some special packets like
neighbour solicitations and MLD reports: they have an explicit destination, not
originating from the routing table. Rerouting these packets may cause them to
go out on the wrong interface or not to go out at all depending on the routing
table.

I propose a patch which allows to mark a dst_entry as "non-reroutable".
icmp6_dst_alloc() (used by ndisc and MLD implementation) will always mark the
allocated dst_entry as such. A check is added to netfilter (IPv6-only) so
packets heading for a non-reroutable destination are never rerouted.

Detailed discussion about the patch:
- It is based on 3.6.7.
- Are there other examples of dsts but ICMPv6 that should be non-reroutable?
- Are there other situations but rerouting by netfilter in which this new flag
  should be considered?
- Similar logic exists in IPv4 so local multicast/broadcast messages are
  potentially transmitted on the wrong interface. However, it's a less likely
  corner case there because those packets are treated differently by
  local output routing: multicast/broadcast messages are by default routed to
  the interface with a matching source IP-address. But this logic is invalid
  because (1) it is allowed to send messages with a source IP-address
  different from your own and (2) it is allowed to assign the same IP-address
  on multiple interfaces. So I feel that also in the case of IPv4 it should
  be possible to forbid rerouting for some special packets.

Regards,

Dries De Winter
SoftAtHome

Signed-off-by: Dries De Winter <dries.dewinter@gmail.com>
---
diff --git a/include/net/dst.h b/include/net/dst.h
index 621e351..8b92678 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -61,6 +61,7 @@ struct dst_entry {
 #define DST_NOPEER                0x0040
 #define DST_FAKE_RTABLE                0x0080
 #define DST_XFRM_TUNNEL                0x0100
+#define DST_NOREROUTE                0x0200

         unsigned short                pending_confirm;

diff --git a/net/ipv6/netfilter.c b/net/ipv6/netfilter.c
index db31561..5b98145 100644
--- a/net/ipv6/netfilter.c
+++ b/net/ipv6/netfilter.c
@@ -23,6 +23,10 @@ int ip6_route_me_harder(struct sk_buff *skb)
                 .saddr = iph->saddr,
         };

+        dst = skb_dst(skb);
+        if (dst && (dst->flags & DST_NOREROUTE))
+                return 0;
+
         dst = ip6_route_output(net, skb->sk, &fl6);
         if (dst->error) {
                 IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 070a3ce..1c7d377 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1234,7 +1234,7 @@ struct dst_entry *icmp6_dst_alloc(struct net_device *dev,
                 }
         }

-        rt->dst.flags |= DST_HOST;
+        rt->dst.flags |= DST_HOST | DST_NOREROUTE;
         rt->dst.output  = ip6_output;
         rt->n = neigh;
         atomic_set(&rt->dst.__refcnt, 1);

^ permalink raw reply related

* Re: TCP and reordering
From: David Woodhouse @ 2012-11-28  9:08 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: David Miller, saku, rick.jones2, netdev
In-Reply-To: <CAGK4HS8=NfcyvcNjC3h1wEjgQFCYoNeuWXj8n5Ruukeg+6j=SQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 804 bytes --]

On Wed, 2012-11-28 at 00:22 -0800, Vijay Subramanian wrote:
> 
> I don't believe reordering is tracked on the receiver side but on the
> sender, there are SNMB_MIB items.
> They can be tracked and can be viewed using nstat/netstat
> 
> # nstat -az | grep -i reorder
> TcpExtTCPFACKReorder            0                  0.0
> TcpExtTCPSACKReorder            0                  0.0
> TcpExtTCPRenoReorder            0                  0.0
> TcpExtTCPTSReorder              0                  0.0

Thanks. For me after a 64MiB download, I have an increase of one FACK,
one SACK and one TS reorder. So my connection probably does even less
reordering than I thought, and thus isn't particularly relevant to this
conversation. I'll shut up now and go back to playing with ATM.

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply

* Re: TCP and reordering
From: Saku Ytti @ 2012-11-28  8:54 UTC (permalink / raw)
  To: netdev
In-Reply-To: <CAGK4HS_GVZqoGVjd87N6skM-SBWyPtY-AhiR9FDcR8ty5x6Xbg@mail.gmail.com>

On (2012-11-28 00:35 -0800), Vijay Subramanian wrote:

> Also note that reordering is tracked on the sender side using the per
> flow variable tp->reordering . This measures the amount of reordering
> on the connection so that
> fast retransmit and other loss recovery mechanisms are not entered
> prematurely. Doesn't this behavior at the  sender already provide the
> behavior you seek?

Sorry I don't seem to understand what you mean. Do you mind explaining how
the sender can help to restore performance on reordering network?

-- 
  ++ytti

^ permalink raw reply

* Re: TCP and reordering
From: Vijay Subramanian @ 2012-11-28  8:35 UTC (permalink / raw)
  To: Saku Ytti; +Cc: David Miller, rick.jones2, netdev
In-Reply-To: <20121128072611.GA26010@pob.ytti.fi>

>
> My proposal (or question more accurately) was to add 'reorder' counter to
> sockets, which would increment when duplicate ACK is followed by same
> sequence twice.
> Then you could automatically/dynamically delay duplicate acks, as you'd
> start to expect to receive the frames, out-of-order. Giving non-lossy
> reordering links pretty much 100% same performance as non-lossy in-order
> links.

RFC 5681 says that out-of-order packets should be acked immediately.
Please see section 4.2 for detailed reasoning.
It also explains why acks should not be delayed too much.

Also note that reordering is tracked on the sender side using the per
flow variable tp->reordering . This measures the amount of reordering
on the connection so that
fast retransmit and other loss recovery mechanisms are not entered
prematurely. Doesn't this behavior at the  sender already provide the
behavior you seek?

Regards,
Vijay

^ permalink raw reply

* Re: [PATCH 1/2] smsc75xx: refactor entering suspend modes
From: Steve Glendinning @ 2012-11-28  8:34 UTC (permalink / raw)
  To: Alan Stern; +Cc: Bjørn Mork, netdev, linux-usb-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <Pine.LNX.4.44L0.1211271443150.1489-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>

Hi Alan,

>> udev->do_remote_wakeup is set in choose_wakeup() in
>> drivers/usb/core/driver.c.  AFAICS it is always set as long as
>> device_may_wakeup(&udev->dev) is true.
>
> That's right.  But is device_may_wakeup(&udev->dev) true?
>
> By default it wouldn't be.  The normal way to set it is for the user or
> a program to do:
>
>         echo enabled >/sys/bus/usb/devices/.../power/wakeup
>
> Of course, a driver could disregard the user's choice and set the flag
> by itself.

If I set that from userspace the system is able to resume, but I can't
work out how to successfully set this from the driver.  I believe the
driver should be overriding this as if the user has asked for the
device to wake on lan they're expecting this to resume the system.

I've tried placing various combinations of device_set_wakeup_capable
and device_set_wakeup_enable in different places (bind, suspend), but
it still doesn't allow the device to resume from suspend.  How should
I do this?
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
From: Joe Jin @ 2012-11-28  8:31 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Fujinaka, Todd, Mary Mcgrath, netdev@vger.kernel.org,
	e1000-devel@lists.sf.net, linux-kernel@vger.kernel.org, linux-pci
In-Reply-To: <1354039840.2701.14.camel@bwh-desktop.uk.solarflarecom.com>

On 11/28/12 02:10, Ben Hutchings wrote:
> On Tue, 2012-11-27 at 17:32 +0000, Fujinaka, Todd wrote:
>> Forgive me if I'm being too repetitious as I think some of this has
>> been mentioned in the past.
>>
>> We (and by we I mean the Ethernet part and driver) can only change the
>> advertised availability of a larger MaxPayloadSize. The size is
>> negotiated by both sides of the link when the link is established. The
>> driver should not change the size of the link as it would be poking at
>> registers outside of its scope and is controlled by the upstream
>> bridge (not us).
> [...]
> 
> MaxPayloadSize (MPS) is not negotiated between devices but is programmed
> by the system firmware (at least for devices present at boot - the
> kernel may be responsible in case of hotplug).  You can use the kernel
> parameter 'pci=pcie_bus_perf' (or one of several others) to set a policy
> that overrides this, but no policy will allow setting MPS above the
> device's MaxPayloadSizeSupported (MPSS).
> 

Ben,

Unfortunately I'm using 3.0.x kernel and this is not included in the kernel.
So I'm trying to use ethtool modify it from eeprom to see if help or no.


Todd, I'll review all MaxPayload for all devices, but need to say if it mismatch,
customer could not modify it from BIOS for there was not entry at there, to
test it, we have to find how to verify if this is the root cause, so still 
need to find the offset in eeprom.

Thanks in advance,
Joe

^ permalink raw reply

* Re: TCP and reordering
From: Vijay Subramanian @ 2012-11-28  8:22 UTC (permalink / raw)
  To: David Woodhouse; +Cc: David Miller, saku, rick.jones2, netdev
In-Reply-To: <1354089566.21562.20.camel@shinybook.infradead.org>

>
> Short of going through whole dumps and looking, is there a good way to
> get statistics?
>

Hi David,

I don't believe reordering is tracked on the receiver side but on the
sender, there are SNMB_MIB items.
They can be tracked and can be viewed using nstat/netstat

# nstat -az | grep -i reorder
TcpExtTCPFACKReorder            0                  0.0
TcpExtTCPSACKReorder            0                  0.0
TcpExtTCPRenoReorder            0                  0.0
TcpExtTCPTSReorder              0                  0.0

Regards,
Vijay

^ permalink raw reply

* Re: TCP and reordering
From: Christoph Paasch @ 2012-11-28  8:21 UTC (permalink / raw)
  To: David Woodhouse; +Cc: David Miller, saku, rick.jones2, netdev
In-Reply-To: <1354089566.21562.20.camel@shinybook.infradead.org>

On Wednesday 28 November 2012 07:59:26 David Woodhouse wrote:
> My 'strange justification' for reordering, albeit not entirely on
> purpose, is that a single ADSL line at 8Mb/s down, 448Kb/s up is less
> bandwidth than I had to my dorm room 16 years ago. So I bond two of
> them, and naturally expect a certain amount of reordering.

You might want to have a look at MultiPath TCP [1], which allows the use of 
multiple interfaces for a single TCP connection. It is somehow similar to 
SCTP-CMT, with the difference that MPTCP is able to pass by today's firewalls 
and NATs and does not require any modifications to the applications.

E.g., you could install MPTCP on your end host and set up an HTTP-proxy on a 
public web hoster to terminate your MPTCP session -- as servers don't (yet) 
support MPTCP, you will have to terminate the MPTCP session somewhere.

Cheers,
Christoph

[1] http://multipath-tcp.org

-- 
IP Networking Lab --- http://inl.info.ucl.ac.be
MultiPath TCP in the Linux Kernel --- http://mptcp.info.ucl.ac.be
Université Catholique de Louvain
--

^ permalink raw reply

* Re: [PATCH v3 8/7] pppoatm: fix missing wakeup in pppoatm_send()
From: Krzysztof Mazur @ 2012-11-28  8:12 UTC (permalink / raw)
  To: David Woodhouse; +Cc: chas williams - CONTRACTOR, netdev, linux-kernel, davem
In-Reply-To: <1354063697.21562.4.camel@shinybook.infradead.org>

On Wed, Nov 28, 2012 at 12:48:17AM +0000, David Woodhouse wrote:
> On Tue, 2012-11-27 at 10:23 -0500, chas williams - CONTRACTOR wrote:
> > yes, but dont call it 8/7 since that doesnt make sense.
> 
> It made enough sense when it was a single patch appended to a thread of
> 7 other patches from Krzysztof. But now it's all got a little more
> complex, so I've tried to collect together the latest version of
> everything we've discussed:

There was also discussion about patch 9/7 "pppoatm: wakeup after ATM
unlock only when it's needed".

> 
>  http://git.infradead.org/users/dwmw2/atm.git
>   git://git.infradead.org/users/dwmw2/atm.git
> 
> David Woodhouse (5):
>       atm: Add release_cb() callback to vcc
>       pppoatm: fix missing wakeup in pppoatm_send()
>       br2684: fix module_put() race

for the three patches above:

Acked-by: Krzysztof Mazur <krzysiek@podlesie.net>

Krzysiek

^ permalink raw reply

* Re: [net-next RFC v2] net_cls: traffic counter based on classification control cgroup
From: Daniel Wagner @ 2012-11-28  8:09 UTC (permalink / raw)
  To: Alexey Perevalov
  Cc: Glauber Costa, netdev-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <50B59F54.8080401-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>

Hi Alexey,

On 28.11.2012 06:21, Alexey Perevalov wrote:
>>> Daniel Wagner is working on something a lot similar.
>> Yes, basically what I try to do is explained by this excellent article
>>
>> https://lwn.net/Articles/523058/
> I read articles and agreed with aspects.
> But problem of selecting preferred network for application can be solved 
> using netprio cgroup.

Choosing the which network to connect to is job of a connection manager.
I don't see how a cgroup controller can help you there. I guess I do not 
understand your statement. Can you rephrase please?

>> The second implementation is adding a new iptables matcher which matches
>> on LSM contexts. Then you can do something like this:
>>
>> iptables -t mangle -A OUTPUT -m secmark --secctx 
>> unconfined_u:unconfined_r:foo_t:s0-s0:c0.c1023 -j MARK --set-mark 200
> As I understand in LSM context it works for egress and ingress.

Yes, I am using CONNMARK in conjunction with the the above LSM context
matcher. I am still playing around, but it looks quite promising.

>>> 2) When Daniel exposed his use case to me, it gave me the impression
>>> that "counting traffic" is something that is totally doable by having a
>>> dedicated interface in a separate namespace. Basically, we already count
>>> traffic (rx and tx) for all interfaces anyway, so it suggests that it
>>> could be an interesting way to see the problem.
>> Moving applications into separate net namespaces is for sure a valid 
>> solution.
>> Though there is a one drawback in this approach. The namespaces need 
>> to be
>> attached to a bridge and then some NATting. That means every application
>> would get it's own IP address. This might be okay for your certain use
>> cases but I am still trying to work around this. Glauber and I had some
>> discussion about this and he suggested to allow the physical networking
>> device to be attached to several namespaces (e.g. via macvlan). Every
>> namespace would get the same IP address. Unfortunately, this would 
>> result in
>> the same mess as several physical devices on a network get the same
>> IP address assigned.
> Is I truly understand what to make statistics works we need to put 
> process to separate namespace?

If a process lives in its own network namespace then you can
count the packets/bytes on the network interface level. The side effect
is that is that each namespace is obviously a new network and has to be
treated as such.

> Approach to keep counter in cgroup hasn't such side effects, but it has 
> another ).

cgroups are not for free. Currently a lot of effort is put into getting
a reasonable performance and behavior into cgroups. In this situation
any new feature added to cgroups will need a pretty good justification
why it is needed and why it cant be done with existing infrastructure.

Here is some background information on the state of cgroups:

http://thread.gmane.org/gmane.linux.kernel.containers/23698

cheers,
daniel

^ permalink raw reply

* Re: [PATCH] br2684: don't send frames on not-ready vcc
From: Krzysztof Mazur @ 2012-11-28  8:08 UTC (permalink / raw)
  To: David Woodhouse
  Cc: chas williams - CONTRACTOR, davem, netdev, linux-kernel, nathan
In-Reply-To: <1354064086.21562.10.camel@shinybook.infradead.org>

On Wed, Nov 28, 2012 at 12:54:46AM +0000, David Woodhouse wrote:
> On Wed, 2012-11-28 at 00:51 +0100, Krzysztof Mazur wrote:
> > If you do this actually it's better to don't use patch 1/7 because
> > it introduces race condition that you found earlier.
> 
> Right. I've omitted that from the git tree I just pushed out.
> 
> > With this patch you have still theoretical race that was fixed in patches
> > 5 and 8 in pppoatm series, but I never seen that in practice.
> 
> And I think it's even less likely for br2684. At least with pppoatm you
> might have had pppd sending frames. But for br2684 they *only* come from
> its start_xmit function... which is serialised anyway.
> 
> I do get strange oopses when I try to add BQL to br2684, but that's not
> something to be looking at at 1am...
> 
> I *do* need the equivalent of your patch 4, which is the module_put
> race.
> 

I think you might need also an equivalent of
"[PATCH v3 3/7] pppoatm: allow assign only on a connected socket".

I'm not sure yet. In will test if I can trigger that Oops on pppoatm
without that patch. Testing vcc flags might be sufficient - that's
what I did in the first patch, but you asked what about SOCK_CONNECTED,
and I think it was really needed.

Krzysiek
-- >8 --
Subject: [PATCH] br2684: allow assign only on a connected socket

The br2684 does not check if used vcc is in connected state,
causing potential Oops in pppoatm_send() when vcc->send() is called
on not fully connected socket.

Now br2684 can be assigned only on connected sockets; otherwise
-EINVAL error is returned.

Signed-off-by: Krzysztof Mazur <krzysiek@podlesie.net>
---
 net/atm/br2684.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/atm/br2684.c b/net/atm/br2684.c
index 59e8edb..e88998c 100644
--- a/net/atm/br2684.c
+++ b/net/atm/br2684.c
@@ -704,10 +704,13 @@ static int br2684_ioctl(struct socket *sock, unsigned int cmd,
 			return -ENOIOCTLCMD;
 		if (!capable(CAP_NET_ADMIN))
 			return -EPERM;
-		if (cmd == ATM_SETBACKEND)
+		if (cmd == ATM_SETBACKEND) {
+			if (sock->state != SS_CONNECTED)
+				return -EINVAL;
 			return br2684_regvcc(atmvcc, argp);
-		else
+		} else {
 			return br2684_create(argp);
+		}
 #ifdef CONFIG_ATM_BR2684_IPFILTER
 	case BR2684_SETFILT:
 		if (atmvcc->push != br2684_push)
-- 
1.8.0.411.g71a7da8

^ permalink raw reply related

* Re: TCP and reordering
From: David Woodhouse @ 2012-11-28  7:59 UTC (permalink / raw)
  To: David Miller; +Cc: saku, rick.jones2, netdev
In-Reply-To: <20121127.210611.1127622873924794001.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 2170 bytes --]

On Tue, 2012-11-27 at 21:06 -0500, David Miller wrote:
> And the gains of fast retransmit far outweigh whatever strange
> justification would give for reordering packets on purpose.

My 'strange justification' for reordering, albeit not entirely on
purpose, is that a single ADSL line at 8Mb/s down, 448Kb/s up is less
bandwidth than I had to my dorm room 16 years ago. So I bond two of
them, and naturally expect a certain amount of reordering.

I've never really done much analysis of this though, and it's never
seemed to be a problem. Then again, I don't think I get *much*
reordering. Big downloads tend to look fairly much like this:

07:36:02.272979 IP6 2001:770:15f::2.http > 2001:8b0:10b:1:e6ce:8fff:fe1f:f2c0.52530: Flags [.], seq 67016473:67017881, ack 124, win 110, options [nop,nop,TS val 2564943119 ecr 1096912240], length 1408
07:36:02.273478 IP6 2001:770:15f::2.http > 2001:8b0:10b:1:e6ce:8fff:fe1f:f2c0.52530: Flags [.], seq 67017881:67019289, ack 124, win 110, options [nop,nop,TS val 2564943119 ecr 1096912240], length 1408
07:36:02.273507 IP6 2001:8b0:10b:1:e6ce:8fff:fe1f:f2c0.52530 > 2001:770:15f::2.http: Flags [.], ack 67019289, win 11198, options [nop,nop,TS val 1096912356 ecr 2564943119], length 0
07:36:02.274727 IP6 2001:770:15f::2.http > 2001:8b0:10b:1:e6ce:8fff:fe1f:f2c0.52530: Flags [.], seq 67019289:67020697, ack 124, win 110, options [nop,nop,TS val 2564943119 ecr 1096912241], length 1408
07:36:02.275151 IP6 2001:770:15f::2.http > 2001:8b0:10b:1:e6ce:8fff:fe1f:f2c0.52530: Flags [.], seq 67020697:67022105, ack 124, win 110, options [nop,nop,TS val 2564943119 ecr 1096912241], length 1408
07:36:02.275184 IP6 2001:8b0:10b:1:e6ce:8fff:fe1f:f2c0.52530 > 2001:770:15f::2.http: Flags [.], ack 67022105, win 11198, options [nop,nop,TS val 1096912358 ecr 2564943119], length 0

I suppose it might be worse if the lines weren't running at the same
speed, and if the packets weren't running over the same path through the
telco between the ISP's LNS (which alternates one packet per line) and
my local DSLAM.

Short of going through whole dumps and looking, is there a good way to
get statistics?

-- 
dwmw2

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply

* Re: TCP and reordering
From: Saku Ytti @ 2012-11-28  7:26 UTC (permalink / raw)
  To: David Miller; +Cc: rick.jones2, netdev
In-Reply-To: <20121127.210611.1127622873924794001.davem@davemloft.net>

On (2012-11-27 21:06 -0500), David Miller wrote:

> And the gains of fast retransmit far outweigh whatever strange
> justification would give for reordering packets on purpose.

I don't disagree. I'm not proposing to turn off fast retransmits.

My proposal (or question more accurately) was to add 'reorder' counter to
sockets, which would increment when duplicate ACK is followed by same
sequence twice. 
Then you could automatically/dynamically delay duplicate acks, as you'd
start to expect to receive the frames, out-of-order. Giving non-lossy
reordering links pretty much 100% same performance as non-lossy in-order
links.

There are good amount of optimization in TCP for corner-case, and well that
is what TCP stack does, tries to work with limitations imposed by network.

My main question is, am I underestimating complexity needed to add such
counter. Or does such counter actually already exist (I've not looked if
netstat -s reordering counters are attributable to particular socket)

-- 
  ++ytti

^ permalink raw reply

* Re: [RFC PATCH] 8139cp: properly support change of MTU values
From: Rami Rosen @ 2012-11-28  7:23 UTC (permalink / raw)
  To: John Greene; +Cc: netdev
In-Reply-To: <1354046932-13606-1-git-send-email-jogreene@redhat.com>

Hi,

In cp_change_mtu(), there is the following check:
...
if (new_mtu < CP_MIN_MTU || new_mtu > CP_MAX_MTU)
		return -EINVAL;
...
Later on, we set dev->mtu to new_mtu.

The CP_MIN_MTU is defined to be 60; shouldn't it be 68 ?


The reason for 68 is (RFC 791,  Internet Protocol,
http://www.ietf.org/rfc/rfc791.txt):
"Every internet module must be able to forward a datagram of 68 octets
without further fragmentation.  This is because an internet  header
may be up to 60 octets, and the minimum fragment is 8 octets."

See also the generic Ethernet () method in eth_change_mtu() (net/ethernet/eth.c)

int eth_change_mtu(struct net_device *dev, int new_mtu)
{
	if (new_mtu < 68 || new_mtu > ETH_DATA_LEN)
		return -EINVAL;
	dev->mtu = new_mtu;
	return 0;
}


regards,
Rami Rosen

http://ramirose.wix.com/ramirosen

On Tue, Nov 27, 2012 at 10:08 PM, John Greene <jogreene@redhat.com> wrote:
> The 8139cp driver has a change_mtu function that has not been
> enabled since the dawn of the git repository. However, the
> generic eth_change_mtu is not used in its place, so that
> invalid MTU values can be set on the interface.
>
> Original patch salvages the broken code for the single case of
> setting the MTU while the interface is down, which is safe
> and also includes the range check.  Now enhanced to support up
> or down interface.
>
> Original patch from
> http://lkml.indiana.edu/hypermail/linux/kernel/1202.2/00770.html
>
> Testing: has been test on virtual 8139cp setup without issue,
> awaiting real hardware and retest again, but wanted to post now.
>
> Signed-Off-By: "John Greene" <jogreene@redhat.com>
> CC: "David S. Miller" <davem@davemloft.net>
> ---
>  drivers/net/ethernet/realtek/8139cp.c | 22 +++-------------------
>  1 file changed, 3 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c
> index 6cb96b4..7847c83 100644
> --- a/drivers/net/ethernet/realtek/8139cp.c
> +++ b/drivers/net/ethernet/realtek/8139cp.c
> @@ -1226,12 +1226,9 @@ static void cp_tx_timeout(struct net_device *dev)
>         spin_unlock_irqrestore(&cp->lock, flags);
>  }
>
> -#ifdef BROKEN
>  static int cp_change_mtu(struct net_device *dev, int new_mtu)
>  {
>         struct cp_private *cp = netdev_priv(dev);
> -       int rc;
> -       unsigned long flags;
>
>         /* check for invalid MTU, according to hardware limits */
>         if (new_mtu < CP_MIN_MTU || new_mtu > CP_MAX_MTU)
> @@ -1244,22 +1241,11 @@ static int cp_change_mtu(struct net_device *dev, int new_mtu)
>                 return 0;
>         }
>
> -       spin_lock_irqsave(&cp->lock, flags);
> -
> -       cp_stop_hw(cp);                 /* stop h/w and free rings */
> -       cp_clean_rings(cp);
> -
> +       /* network IS up, close it, reset MTU, and come up again. */
> +       cp_close(dev);
>         dev->mtu = new_mtu;
> -       cp_set_rxbufsize(cp);           /* set new rx buf size */
> -
> -       rc = cp_init_rings(cp);         /* realloc and restart h/w */
> -       cp_start_hw(cp);
> -
> -       spin_unlock_irqrestore(&cp->lock, flags);
> -
> -       return rc;
> +       return cp_open(dev);
>  }
> -#endif /* BROKEN */
>
>  static const char mii_2_8139_map[8] = {
>         BasicModeCtrl,
> @@ -1835,9 +1821,7 @@ static const struct net_device_ops cp_netdev_ops = {
>         .ndo_start_xmit         = cp_start_xmit,
>         .ndo_tx_timeout         = cp_tx_timeout,
>         .ndo_set_features       = cp_set_features,
> -#ifdef BROKEN
>         .ndo_change_mtu         = cp_change_mtu,
> -#endif
>
>  #ifdef CONFIG_NET_POLL_CONTROLLER
>         .ndo_poll_controller    = cp_poll_controller,
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [net-next RFC] pktgen: don't wait for the device who doesn't free skb immediately after sent
From: Jason Wang @ 2012-11-28  6:48 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: mst, netdev, linux-kernel, virtualization, davem
In-Reply-To: <20121127084919.1587c647@nehalam.linuxnetplumber.net>

On 11/28/2012 12:49 AM, Stephen Hemminger wrote:
> On Tue, 27 Nov 2012 14:45:13 +0800
> Jason Wang <jasowang@redhat.com> wrote:
>
>> On 11/27/2012 01:37 AM, Stephen Hemminger wrote:
>>> On Mon, 26 Nov 2012 15:56:52 +0800
>>> Jason Wang <jasowang@redhat.com> wrote:
>>>
>>>> Some deivces do not free the old tx skbs immediately after it has been sent
>>>> (usually in tx interrupt). One such example is virtio-net which optimizes for
>>>> virt and only free the possible old tx skbs during the next packet sending. This
>>>> would lead the pktgen to wait forever in the refcount of the skb if no other
>>>> pakcet will be sent afterwards.
>>>>
>>>> Solving this issue by introducing a new flag IFF_TX_SKB_FREE_DELAY which could
>>>> notify the pktgen that the device does not free skb immediately after it has
>>>> been sent and let it not to wait for the refcount to be one.
>>>>
>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>> Another alternative would be using skb_orphan() and skb->destructor.
>>> There are other cases where skb's are not freed right away.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Hi Stephen:
>>
>> Do you mean registering a skb->destructor for pktgen then set and check
>> bits in skb->tx_flag?
> Yes. Register a destructor that does something like update a counter (number of packets pending),
> then just spin while number of packets pending is over threshold.
> --

Not sure this is the best method, since pktgen was used to test the tx 
process of the device driver and NIC. If we use skb_orhpan(), we would 
miss the test of tx completion part.

^ permalink raw reply

* Re: Fwd: Re: [PATCH] net: ipv6: change %8s to %s for rt->dst.dev->name in seq_printf of rt6_info_route
From: Chen Gang @ 2012-11-28  5:54 UTC (permalink / raw)
  To: Shan Wei; +Cc: Eric Dumazet, David Miller, netdev
In-Reply-To: <50B4551C.6030505@asianux.com>


completion: "8 right alignment should not belong to os api level"
  for api, we need keep as fewer contents as we can.
  for output format api: "contents"+"topology"+"separator mark"+"space redundancy" are enough.
  so "8 right alignment" should not belong to api level (it only belongs to "User Experience").


  "User Experience" is most likely 'beautiful' !!

  !-)


gchen.


于 2012年11月27日 13:52, Chen Gang 写道:
> 于 2012年11月27日 13:40, Chen Gang 写道:
>>
>> and now, I think:
>>   A) both input and output through /proc/* are for os api level.
>>   B) but both %8s and %s do not change the output interface format (including contents, topology, separator mark, space redundancy).
>>   C) so it is belong to 'User Experience', not belong to os api.
>>
>>   welcome any another members to giving suggestions and completions.
>>
>>   thanks.
>>
>>   :-)
>>
> 
>   completion: 8 right alignment is not belong to interface format.
>     if it was belong to interface format,
>     it would cause correctness issue (the name len may be larger than 8).
>     so if "8 right alignment" is belong to os api, it means the api is not correct, need change.
> 
>   :-)
> 


-- 
Chen Gang

Asianux Corporation

^ permalink raw reply

* [PATCH net-next] be2net: fix INTx ISR for interrupt behaviour on BE2
From: Sathya Perla @ 2012-11-28  5:50 UTC (permalink / raw)
  To: netdev; +Cc: Sathya Perla

On BE2 chip, an interrupt may be raised even when EQ is in un-armed state.
As a result be_intx()::events_get() and be_poll:events_get() can race and
notify an EQ wrongly.

Fix this by counting events only in be_poll(). Commit 0b545a629 fixes
the same issue in the MSI-x path.

But, on Lancer, INTx can be de-asserted only by notifying num evts. This
is not an issue as the above BE2 behavior doesn't exist/has never been
seen on Lancer.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
---
 drivers/net/ethernet/emulex/benet/be_main.c |   54 +++++++++++---------------
 1 files changed, 23 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index adef536..0661e93 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -1675,24 +1675,6 @@ static inline int events_get(struct be_eq_obj *eqo)
 	return num;
 }
 
-static int event_handle(struct be_eq_obj *eqo)
-{
-	bool rearm = false;
-	int num = events_get(eqo);
-
-	/* Deal with any spurious interrupts that come without events */
-	if (!num)
-		rearm = true;
-
-	if (num || msix_enabled(eqo->adapter))
-		be_eq_notify(eqo->adapter, eqo->q.id, rearm, true, num);
-
-	if (num)
-		napi_schedule(&eqo->napi);
-
-	return num;
-}
-
 /* Leaves the EQ is disarmed state */
 static void be_eq_clean(struct be_eq_obj *eqo)
 {
@@ -2014,15 +1996,23 @@ static int be_rx_cqs_create(struct be_adapter *adapter)
 
 static irqreturn_t be_intx(int irq, void *dev)
 {
-	struct be_adapter *adapter = dev;
-	int num_evts;
+	struct be_eq_obj *eqo = dev;
+	struct be_adapter *adapter = eqo->adapter;
+	int num_evts = 0;
 
-	/* With INTx only one EQ is used */
-	num_evts = event_handle(&adapter->eq_obj[0]);
-	if (num_evts)
-		return IRQ_HANDLED;
-	else
-		return IRQ_NONE;
+	/* On Lancer, clear-intr bit of the EQ DB does not work.
+	 * INTx is de-asserted only on notifying num evts.
+	 */
+	if (lancer_chip(adapter))
+		num_evts = events_get(eqo);
+
+	/* The EQ-notify may not de-assert INTx rightaway, causing
+	 * the ISR to be invoked again. So, return HANDLED even when
+	 * num_evts is zero.
+	 */
+	be_eq_notify(adapter, eqo->q.id, false, true, num_evts);
+	napi_schedule(&eqo->napi);
+	return IRQ_HANDLED;
 }
 
 static irqreturn_t be_msix(int irq, void *dev)
@@ -2342,10 +2332,10 @@ static int be_irq_register(struct be_adapter *adapter)
 			return status;
 	}
 
-	/* INTx */
+	/* INTx: only the first EQ is used */
 	netdev->irq = adapter->pdev->irq;
 	status = request_irq(netdev->irq, be_intx, IRQF_SHARED, netdev->name,
-			adapter);
+			     &adapter->eq_obj[0]);
 	if (status) {
 		dev_err(&adapter->pdev->dev,
 			"INTx request IRQ failed - err %d\n", status);
@@ -2367,7 +2357,7 @@ static void be_irq_unregister(struct be_adapter *adapter)
 
 	/* INTx */
 	if (!msix_enabled(adapter)) {
-		free_irq(netdev->irq, adapter);
+		free_irq(netdev->irq, &adapter->eq_obj[0]);
 		goto done;
 	}
 
@@ -3023,8 +3013,10 @@ static void be_netpoll(struct net_device *netdev)
 	struct be_eq_obj *eqo;
 	int i;
 
-	for_all_evt_queues(adapter, eqo, i)
-		event_handle(eqo);
+	for_all_evt_queues(adapter, eqo, i) {
+		be_eq_notify(eqo->adapter, eqo->q.id, false, true, 0);
+		napi_schedule(&eqo->napi);
+	}
 
 	return;
 }
-- 
1.7.1

^ permalink raw reply related

* Re: BUG: scheduling while atomic: ifup-bonding/3711/0x00000002 -- V3.6.7
From: Cong Wang @ 2012-11-28  5:47 UTC (permalink / raw)
  To: Linda Walsh; +Cc: LKML, Linux Kernel Network Developers
In-Reply-To: <50B5248A.5010908@tlinx.org>

Cc netdev...

On Wed, Nov 28, 2012 at 4:37 AM, Linda Walsh <lkml@tlinx.org> wrote:
>
>
> Is this a known problem / bug, or should I file a bug on it?  It doesn't
> cause a complete failure, and it happens multiple times (~28 times
> in 2.5 days?... so maybe 10x/day?)  about 8 start with ifup, and the rest
> start @ kworker -- both happen upon enabling the bonding driver
> on a 10Gb dual port adapter (trying to get 1 20Gb adapter).
>
> The 2 tracebacks tyeps (ifup-bonding + kworker) follow:


Does this quick fix help?

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 5f5b69f..4a4d9eb 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1785,7 +1785,9 @@ int bond_enslave(struct net_device *bond_dev,
struct net_device *slave_dev)
                new_slave->link == BOND_LINK_DOWN ? "DOWN" :
                        (new_slave->link == BOND_LINK_UP ? "UP" : "BACK"));

+       read_unlock(&bond->lock);
        bond_update_speed_duplex(new_slave);
+       read_lock(&bond->lock);

        if (USES_PRIMARY(bond->params.mode) && bond->params.primary[0]) {
                /* if there is a primary slave, remember it */

Thanks!

>
>
> ----------------- ifup-bonding traceback:
>
> [  229.208603] bonding: bond0: Setting MII monitoring interval to 100.
> [  229.222336] bonding: bond0: Adding slave p2p1.
> [  229.685599] BUG: scheduling while atomic: ifup-bonding/3711/0x00000002
> [  229.692166] 4 locks held by ifup-bonding/3711:
> [  229.696645]  #0:  (&buffer->mutex){......}, at: [<ffffffff811acd3f>]
> sysfs_write_file+0x3f/0x150
> [  229.705721]  #1:  (s_active#75){......}, at: [<ffffffff811acdbb>]
> sysfs_write_file+0xbb/0x150
> [  229.714538]  #2:  (rtnl_mutex){......}, at: [<ffffffff8159e1e0>]
> rtnl_trylock+0x10/0x20
> [  229.722772]  #3:  (&bond->lock){......}, at: [<ffffffffa02964af>]
> bond_enslave+0x4df/0xb50 [bonding]
> [  229.732188] Modules linked in: bonding fan mousedev kvm_intel iTCO_wdt
> iTCO_vendor_support gpio_ich kvm acpi_cpufreq mperf tpm_tis tpm tpm_bios
> processor button
> [  229.747197] Pid: 3711, comm: ifup-bonding Not tainted 3.6.7-Isht-Van #1
> [  229.753843] Call Trace:
> [  229.756333]  [<ffffffff8168ead9>] __schedule_bug+0x5e/0x6c
> [  229.761863]  [<ffffffff8169893c>] __schedule+0x77c/0x810
> [  229.767214]  [<ffffffff81698a54>] schedule+0x24/0x70
> [  229.772214]  [<ffffffff81697b6c>]
> schedule_hrtimeout_range_clock+0xfc/0x140
> [  229.779210]  [<ffffffff81064c80>] ? update_rmtp+0x60/0x60
> [  229.784645]  [<ffffffff81065a1f>] ? hrtimer_start_range_ns+0xf/0x20
> [  229.790950]  [<ffffffff81697bbe>] schedule_hrtimeout_range+0xe/0x10
> [  229.797254]  [<ffffffff8104bddb>] usleep_range+0x3b/0x40
> [  229.802611]  [<ffffffff814c519c>] ixgbe_acquire_swfw_sync_X540+0xbc/0x110
> [  229.809429]  [<ffffffff814c146d>] ixgbe_read_phy_reg_generic+0x3d/0x120
> [  229.816078]  [<ffffffff814c16dc>]
> ixgbe_get_copper_link_capabilities_generic+0x2c/0x60
> [  229.824022]  [<ffffffffa02964af>] ? bond_enslave+0x4df/0xb50 [bonding]
> [  229.830581]  [<ffffffff814b93e4>] ixgbe_get_settings+0x34/0x2b0
> [  229.836534]  [<ffffffff81594385>] __ethtool_get_settings+0x85/0x140
> [  229.842837]  [<ffffffffa0292303>] bond_update_speed_duplex+0x23/0x60
> [bonding]
> [  229.850092]  [<ffffffffa0296518>] bond_enslave+0x548/0xb50 [bonding]
> [  229.856478]  [<ffffffffa029e62f>] bonding_store_slaves+0x13f/0x190
> [bonding]
> [  229.863556]  [<ffffffff813fe163>] dev_attr_store+0x13/0x30
> [  229.869074]  [<ffffffff811acdd4>] sysfs_write_file+0xd4/0x150
> [  229.874856]  [<ffffffff81142c01>] vfs_write+0xb1/0x180
> [  229.880034]  [<ffffffff81142f28>] sys_write+0x48/0x90
> [  229.885125]  [<ffffffff8169b162>] system_call_fastpath+0x16/0x1b
> [  229.891259] BUG: scheduling while atomic: ifup-bonding/3711/0x00000002
> [  229.897839] 4 locks held by ifup-bonding/3711:
> [  229.902320]  #0:  (&buffer->mutex){......}, at: [<ffffffff811acd3f>]
> sysfs_write_file+0x3f/0x150
> [  229.911395]  #1:  (s_active#75){......}, at: [<ffffffff811acdbb>]
> sysfs_write_file+0xbb/0x150
> [  229.920212]  #2:  (rtnl_mutex){......}, at: [<ffffffff8159e1e0>]
> rtnl_trylock+0x10/0x20
> [  229.928449]  #3:  (&bond->lock){......}, at: [<ffffffffa02964af>]
> bond_enslave+0x4df/0xb50 [bonding]
> [  229.937866] Modules linked in: bonding fan mousedev kvm_intel iTCO_wdt
> iTCO_vendor_support gpio_ich kvm acpi_cpufreq mperf tpm_tis tpm tpm_bios
> processor button
> [  229.952904] Pid: 3711, comm: ifup-bonding Tainted: G        W
> 3.6.7-Isht-Van #1
> [  229.960507] Call Trace:
> [  229.962997]  [<ffffffff8168ead9>] __schedule_bug+0x5e/0x6c
> [  229.968526]  [<ffffffff8169893c>] __schedule+0x77c/0x810
> [  229.973875]  [<ffffffff81698a54>] schedule+0x24/0x70
> [  229.978876]  [<ffffffff81697b6c>]
> schedule_hrtimeout_range_clock+0xfc/0x140
> [  229.985871]  [<ffffffff81064c80>] ? update_rmtp+0x60/0x60
> [  229.991303]  [<ffffffff81064c80>] ? update_rmtp+0x60/0x60
> [  229.996739]  [<ffffffff81065a1f>] ? hrtimer_start_range_ns+0xf/0x20
> [  230.003040]  [<ffffffff81697bbe>] schedule_hrtimeout_range+0xe/0x10
> [  230.009344]  [<ffffffff8104bddb>] usleep_range+0x3b/0x40
> [  230.014698]  [<ffffffff814c50ce>] ixgbe_release_swfw_sync_X540+0x4e/0x60
> [  230.021435]  [<ffffffff814c1531>] ixgbe_read_phy_reg_generic+0x101/0x120
> [  230.028171]  [<ffffffff814c16dc>]
> ixgbe_get_copper_link_capabilities_generic+0x2c/0x60
> [  230.036117]  [<ffffffffa02964af>] ? bond_enslave+0x4df/0xb50 [bonding]
> [  230.042677]  [<ffffffff814b93e4>] ixgbe_get_settings+0x34/0x2b0
> [  230.048630]  [<ffffffff81594385>] __ethtool_get_settings+0x85/0x140
> [  230.054934]  [<ffffffffa0292303>] bond_update_speed_duplex+0x23/0x60
> [bonding]
> [  230.062189]  [<ffffffffa0296518>] bond_enslave+0x548/0xb50 [bonding]
> [  230.068580]  [<ffffffffa029e62f>] bonding_store_slaves+0x13f/0x190
> [bonding]
> [  230.075660]  [<ffffffff813fe163>] dev_attr_store+0x13/0x30
> [  230.081189]  [<ffffffff811acdd4>] sysfs_write_file+0xd4/0x150
> [  230.086971]  [<ffffffff81142c01>] vfs_write+0xb1/0x180
> [  230.092148]  [<ffffffff81142f28>] sys_write+0x48/0x90
> [  230.097245]  [<ffffffff8169b162>] system_call_fastpath+0x16/0x1b
> [  230.103427] bonding: bond0: enslaving p2p1 as an active interface with a
> down link.
> [  230.120623] bonding: bond0: Adding slave p2p2.
> [  230.575194] BUG: scheduling while atomic: ifup-bonding/3711/0x00000002
> [  230.581782] 4 locks held by ifup-bonding/3711:
> [  230.586262]  #0:  (&buffer->mutex){......}, at: [<ffffffff811acd3f>]
> sysfs_write_file+0x3f/0x150
> [  230.595287]  #1:  (s_active#75){......}, at: [<ffffffff811acdbb>]
> sysfs_write_file+0xbb/0x150
> [  230.604105]  #2:  (rtnl_mutex){......}, at: [<ffffffff8159e1e0>]
> rtnl_trylock+0x10/0x20
> [  230.612393]  #3:  (&bond->lock){......}, at: [<ffffffffa02964af>]
> bond_enslave+0x4df/0xb50 [bonding]
> [  230.621801] Modules linked in: bonding fan mousedev kvm_intel iTCO_wdt
> iTCO_vendor_support gpio_ich kvm acpi_cpufreq mperf tpm_tis tpm tpm_bios
> processor button
> [  230.636922] Pid: 3711, comm: ifup-bonding Tainted: G        W
> 3.6.7-Isht-Van #1
> [  230.644516] Call Trace:
> [  230.647009]  [<ffffffff8168ead9>] __schedule_bug+0x5e/0x6c
> [  230.652537]  [<ffffffff8169893c>] __schedule+0x77c/0x810
> [  230.657886]  [<ffffffff81698a54>] schedule+0x24/0x70
> [  230.662889]  [<ffffffff81697b6c>]
> schedule_hrtimeout_range_clock+0xfc/0x140
> [  230.669884]  [<ffffffff81064c80>] ? update_rmtp+0x60/0x60
> [  230.675320]  [<ffffffff81065a1f>] ? hrtimer_start_range_ns+0xf/0x20
> [  230.681622]  [<ffffffff81697bbe>] schedule_hrtimeout_range+0xe/0x10
> [  230.687921]  [<ffffffff8104bddb>] usleep_range+0x3b/0x40
> [  230.693283]  [<ffffffff814c519c>] ixgbe_acquire_swfw_sync_X540+0xbc/0x110
> [  230.700113]  [<ffffffff814c146d>] ixgbe_read_phy_reg_generic+0x3d/0x120
> [  230.706763]  [<ffffffff814c16dc>]
> ixgbe_get_copper_link_capabilities_generic+0x2c/0x60
> [  230.714713]  [<ffffffffa02964af>] ? bond_enslave+0x4df/0xb50 [bonding]
> [  230.721275]  [<ffffffff814b93e4>] ixgbe_get_settings+0x34/0x2b0
> [  230.727231]  [<ffffffff81594385>] __ethtool_get_settings+0x85/0x140
> [  230.733535]  [<ffffffffa0292303>] bond_update_speed_duplex+0x23/0x60
> [bonding]
> [  230.740790]  [<ffffffffa0296518>] bond_enslave+0x548/0xb50 [bonding]
> [  230.747189]  [<ffffffffa029e62f>] bonding_store_slaves+0x13f/0x190
> [bonding]
> [  230.754270]  [<ffffffff813fe163>] dev_attr_store+0x13/0x30
> [  230.759792]  [<ffffffff811acdd4>] sysfs_write_file+0xd4/0x150
> [  230.765576]  [<ffffffff81142c01>] vfs_write+0xb1/0x180
> [  230.770753]  [<ffffffff81142f28>] sys_write+0x48/0x90
> [  230.775840]  [<ffffffff8169b162>] system_call_fastpath+0x16/0x1b
> [  230.781933] BUG: scheduling while atomic: ifup-bonding/3711/0x00000002
> [  230.788499] 4 locks held by ifup-bonding/3711:
> [  230.793021]  #0:  (&buffer->mutex){......}, at: [<ffffffff811acd3f>]
> sysfs_write_file+0x3f/0x150
> [  230.802051]  #1:  (s_active#75){......}, at: [<ffffffff811acdbb>]
> sysfs_write_file+0xbb/0x150
> [  230.810872]  #2:  (rtnl_mutex){......}, at: [<ffffffff8159e1e0>]
> rtnl_trylock+0x10/0x20
> [  230.819160]  #3:  (&bond->lock){......}, at: [<ffffffffa02964af>]
> bond_enslave+0x4df/0xb50 [bonding]
> [  230.828575] Modules linked in: bonding fan mousedev kvm_intel iTCO_wdt
> iTCO_vendor_support gpio_ich kvm acpi_cpufreq mperf tpm_tis tpm tpm_bios
> processor button
> [  230.843673] Pid: 3711, comm: ifup-bonding Tainted: G        W
> 3.6.7-Isht-Van #1
> [  230.851271] Call Trace:
> [  230.853759]  [<ffffffff8168ead9>] __schedule_bug+0x5e/0x6c
> [  230.859282]  [<ffffffff8169893c>] __schedule+0x77c/0x810
> [  230.864637]  [<ffffffff81698a54>] schedule+0x24/0x70
> [  230.869598]  [<ffffffff81697b6c>]
> schedule_hrtimeout_range_clock+0xfc/0x140
> [  230.876548]  [<ffffffff81064c80>] ? update_rmtp+0x60/0x60
> [  230.881966]  [<ffffffff81064c80>] ? update_rmtp+0x60/0x60
> [  230.887359]  [<ffffffff81065a1f>] ? hrtimer_start_range_ns+0xf/0x20
> [  230.893649]  [<ffffffff81697bbe>] schedule_hrtimeout_range+0xe/0x10
> [  230.899907]  [<ffffffff8104bddb>] usleep_range+0x3b/0x40
> [  230.905213]  [<ffffffff814c50ce>] ixgbe_release_swfw_sync_X540+0x4e/0x60
> [  230.911908]  [<ffffffff814c1531>] ixgbe_read_phy_reg_generic+0x101/0x120
> [  230.918599]  [<ffffffff814c16dc>]
> ixgbe_get_copper_link_capabilities_generic+0x2c/0x60
> [  230.926501]  [<ffffffffa02964af>] ? bond_enslave+0x4df/0xb50 [bonding]
> [  230.933018]  [<ffffffff814b93e4>] ixgbe_get_settings+0x34/0x2b0
> [  230.938929]  [<ffffffff81594385>] __ethtool_get_settings+0x85/0x140
> [  230.945185]  [<ffffffffa0292303>] bond_update_speed_duplex+0x23/0x60
> [bonding]
> [  230.952392]  [<ffffffffa0296518>] bond_enslave+0x548/0xb50 [bonding]
> [  230.958739]  [<ffffffffa029e62f>] bonding_store_slaves+0x13f/0x190
> [bonding]
> [  230.965774]  [<ffffffff813fe163>] dev_attr_store+0x13/0x30
> [  230.971251]  [<ffffffff811acdd4>] sysfs_write_file+0xd4/0x150
> [  230.976988]  [<ffffffff81142c01>] vfs_write+0xb1/0x180
> [  230.982146]  [<ffffffff81142f28>] sys_write+0x48/0x90
> [  230.987192]  [<ffffffff8169b162>] system_call_fastpath+0x16/0x1b
> [  230.993297] bonding: bond0: enslaving p2p2 as an active interface with a
> down link.
> [  231.014761] ixgbe 0000:06:00.0: p2p1: changing MTU from 1500 to 9000
> [  231.863728] ixgbe 0000:06:00.1: p2p2: changing MTU from 1500 to 9000
>
>
> ---------- kworker traceback:
> [  236.268690] ixgbe 0000:06:00.0: p2p1: NIC Link is Up 10 Gbps, Flow
> Control: None
> [  236.305628] BUG: scheduling while atomic: kworker/u:2/106/0x00000002
> [  236.312025] 4 locks held by kworker/u:2/106:
> [  236.312037]  #0:  ((bond_dev->name)){......}, at: [<ffffffff8105a956>]
> process_one_work+0x146/0x680
> [  236.312049]  #1:  ((&(&bond->mii_work)->work)){......}, at:
> [<ffffffff8105a956>] process_one_work+0x146/0x680
> [  236.312056]  #2:  (rtnl_mutex){......}, at: [<ffffffff8159e1e0>]
> rtnl_trylock+0x10/0x20
> [  236.312065]  #3:  (&bond->lock){......}, at: [<ffffffffa02955ad>]
> bond_mii_monitor+0x2ed/0x640 [bonding]
> [  236.312078] Modules linked in: ipv6 bonding fan mousedev kvm_intel
> iTCO_wdt iTCO_vendor_support gpio_ich kvm acpi_cpufreq mperf tpm_tis tpm
> tpm_bios processor button
> [  236.312082] Pid: 106, comm: kworker/u:2 Tainted: G        W
> 3.6.7-Isht-Van #1
> [  236.312083] Call Trace:
> [  236.312092]  [<ffffffff8168ead9>] __schedule_bug+0x5e/0x6c
> [  236.312102]  [<ffffffff8169893c>] __schedule+0x77c/0x810
> [  236.312108]  [<ffffffff81698a54>] schedule+0x24/0x70
> [  236.312114]  [<ffffffff81697b6c>]
> schedule_hrtimeout_range_clock+0xfc/0x140
> [  236.312121]  [<ffffffff81064c80>] ? update_rmtp+0x60/0x60
> [  236.312129]  [<ffffffff81065a1f>] ? hrtimer_start_range_ns+0xf/0x20
> [  236.312134]  [<ffffffff81697bbe>] schedule_hrtimeout_range+0xe/0x10
> [  236.312144]  [<ffffffff8104bddb>] usleep_range+0x3b/0x40
> [  236.312150]  [<ffffffff814c519c>] ixgbe_acquire_swfw_sync_X540+0xbc/0x110
> [  236.312157]  [<ffffffff814c146d>] ixgbe_read_phy_reg_generic+0x3d/0x120
> [  236.312161]  [<ffffffff814c16dc>]
> ixgbe_get_copper_link_capabilities_generic+0x2c/0x60
> [  236.312166]  [<ffffffffa02955ad>] ? bond_mii_monitor+0x2ed/0x640
> [bonding]
> [  236.312170]  [<ffffffff814b93e4>] ixgbe_get_settings+0x34/0x2b0
> [  236.312177]  [<ffffffff81594385>] __ethtool_get_settings+0x85/0x140
> [  236.312182]  [<ffffffffa0292303>] bond_update_speed_duplex+0x23/0x60
> [bonding]
> [  236.312188]  [<ffffffffa0295614>] bond_mii_monitor+0x354/0x640 [bonding]
> [  236.312198]  [<ffffffff8105a9b7>] process_one_work+0x1a7/0x680
> [  236.312203]  [<ffffffff8105a956>] ? process_one_work+0x146/0x680
> [  236.312210]  [<ffffffff8108c7ce>] ? put_lock_stats.isra.21+0xe/0x40
> [  236.312215]  [<ffffffffa02952c0>] ? bond_loadbalance_arp_mon+0x2c0/0x2c0
> [bonding]
> [  236.312234]  [<ffffffff8105b9ed>] worker_thread+0x18d/0x4f0
> [  236.312239]  [<ffffffff81070991>] ? sub_preempt_count+0x51/0x60
> [  236.312242]  [<ffffffff8105b860>] ? manage_workers+0x320/0x320
> [  236.312247]  [<ffffffff81060f7d>] kthread+0x9d/0xb0
> [  236.312250]  [<ffffffff8169c264>] kernel_thread_helper+0x4/0x10
> [  236.312254]  [<ffffffff8106c197>] ? finish_task_switch+0x77/0x100
> [  236.312262]  [<ffffffff8169a4a6>] ? _raw_spin_unlock_irq+0x36/0x60
> [  236.312268]  [<ffffffff8169a9dd>] ? retint_restore_args+0xe/0xe
> [  236.312273]  [<ffffffff81060ee0>] ? flush_kthread_worker+0x160/0x160
> [  236.312277]  [<ffffffff8169c260>] ? gs_change+0xb/0xb
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply related

* Re: performance regression on HiperSockets depending on MTU size
From: Cong Wang @ 2012-11-28  5:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Kernel Network Developers
In-Reply-To: <1353998800.7553.873.camel@edumazet-glaptop>

On Tue, Nov 27, 2012 at 2:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2012-11-27 at 06:21 +0000, Cong Wang wrote:
>
>> Eric,
>>
>> Do you have a full list of such commits? I am trying to backport TSQ
>> to 2.6.32, and of course I don't want to miss these commits either.
>
> I dont think there are other known issues.
>
> mlx4 had a 'problem' because only recently we removed the skb_orphan()
> call it used to do in its start_xmit() function.
>
> I remember David had to revert BQL on NIU driver, but NIU does the
> skb_orphan() call as well so TSQ is basically disabled.
>
>


Good news! Thanks!

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox