Netdev List
 help / color / mirror / Atom feed
* Requeues and ECN marking
From: Greg Kuperman @ 2014-02-03 14:50 UTC (permalink / raw)
  To: netdev

Hi all,

I am testing a new congestion control protocol that relies on explicit
congestion notifications (ECN) to notify the receiver of a congestion
event. I have a rate limited link of 1 Mbps, and I am using the RED
queuing discipline with ECN enabled. What I have noticed is that no
matter how small I set my queue size, or how low I set my minimum
marking level, the first ECN marked packet does not get sent out for
about 10 seconds after the input rate exceeds the output rate. Further
examination shows that ECN marking does not occur until the number or
requeues hits 1000. Below are two queries of tc -s -d qdisc ls dev
eth1.

qdisc red 8028: root refcnt 2 limit 10000000b min 1b max 0b ecn ewma
30 Plog 21 Scell_log 31
 Sent 1307892 bytes 1247 pkt (dropped 0, overlimits 0 requeues 960)
 backlog 1052118b 962p requeues 960
  marked 0 early 0 pdrop 0 other 0

qdisc red 8028: root refcnt 2 limit 10000000b min 1b max 0b ecn ewma
30 Plog 21 Scell_log 31
 Sent 1379262 bytes 1312 pkt (dropped 0, overlimits 72 requeues 1024)
 backlog 1122468b 1027p requeues 1024
  marked 72 early 0 pdrop 0 other 0


The txqueuelen defaults to 1000 for the interface, so I figured that
packets maybe buffering there, and then dequeuing, before any packets
are marked. I set txqueuelen to lower values (all the way down to 1),
but the exact same behavior occurs (no marked packets until number of
dequeues hits 1000). In contrast, if I set txqueuele to something very
high, I get no requeues, drops, or marked packets.

My goal is for packets to be marked as soon as the ingress rate
exceeds the egress. Am I correct in thinking that the requeuing
operation is the culprit? Can I eliminate requeues? Is there something
else I can do to get the behavior I am looking for?

Thank you all for the help. And please cc me in your replies; I'm not
100% sure if I get all the messages from this mailing list.

Best,
Greg

^ permalink raw reply

* [PATCH] net:phy:dp83640: Declare that TX timestamping possible
From: Stefan Sørensen @ 2014-02-03 14:36 UTC (permalink / raw)
  To: richardcochran, netdev; +Cc: Stefan Sørensen

Set the SKBTX_IN_PROGRESS bit in tx_flags dp83640_txtstamp when doing
tx timestamps as per Documentation/networking/timestamping.txt.

Signed-off-by: Stefan Sørensen <stefan.sorensen@spectralink.com>
---
 drivers/net/phy/dp83640.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
index dfb132e..ae95a9a 100644
--- a/drivers/net/phy/dp83640.c
+++ b/drivers/net/phy/dp83640.c
@@ -1273,6 +1273,7 @@ static void dp83640_txtstamp(struct phy_device *phydev,
 		}
 		/* fall through */
 	case HWTSTAMP_TX_ON:
+		skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
 		skb_queue_tail(&dp83640->tx_queue, skb);
 		schedule_work(&dp83640->ts_work);
 		break;
-- 
1.8.5.3

^ permalink raw reply related

* [PATCH] net:phy:dp83640: Do not hardcode timestamping event edge
From: Stefan Sørensen @ 2014-02-03 14:36 UTC (permalink / raw)
  To: richardcochran, netdev; +Cc: Stefan Sørensen

Currently the external timestamping code it hardcoded to use
the rising edge even though the hardware has configurable event
edge detection. This patch change the code to use falling edge
detection if PTP_FALLING_EDGE is set in the user supplied flags.

Signed-off-by: Stefan Sørensen <stefan.sorensen@spectralink.com>
---
 drivers/net/phy/dp83640.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
index baa1a75..80c5fc8 100644
--- a/drivers/net/phy/dp83640.c
+++ b/drivers/net/phy/dp83640.c
@@ -440,7 +440,10 @@ static int ptp_dp83640_enable(struct ptp_clock_info *ptp,
 		if (on) {
 			gpio_num = extts_gpio[index];
 			evnt |= (gpio_num & EVNT_GPIO_MASK) << EVNT_GPIO_SHIFT;
-			evnt |= EVNT_RISE;
+			if (rq->extts.flags & PTP_FALLING_EDGE)
+				evnt |= EVNT_FALL;
+			else
+				evnt |= EVNT_RISE;
 		}
 		ext_write(0, phydev, PAGE5, PTP_EVNT, evnt);
 		return 0;
-- 
1.8.5.3

^ permalink raw reply related

* [PATCH] net:phy:dp83640: Initialize PTP clocks at device init.
From: Stefan Sørensen @ 2014-02-03 14:36 UTC (permalink / raw)
  To: richardcochran, netdev; +Cc: Stefan Sørensen

The trigger and events functionality can be useful even if packet
timestamping is not used, but the required PTP clock is only enabled
when packet timestamping is started. This patch moves the clock enable
to when the interface is configured.

Signed-off-by: Stefan Sørensen <stefan.sorensen@spectralink.com>
---
 drivers/net/phy/dp83640.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
index 80c5fc8..14616e4 100644
--- a/drivers/net/phy/dp83640.c
+++ b/drivers/net/phy/dp83640.c
@@ -1064,6 +1064,13 @@ static void dp83640_remove(struct phy_device *phydev)
 	kfree(dp83640);
 }
 
+static int dp83640_config_init(struct phy_device *phydev)
+{
+	enable_status_frames(phydev, true);
+	ext_write(0, phydev, PAGE4, PTP_CTL, PTP_ENABLE);
+	return 0;
+}
+
 static int dp83640_ack_interrupt(struct phy_device *phydev)
 {
 	int err = phy_read(phydev, MII_DP83640_MISR);
@@ -1201,11 +1208,6 @@ static int dp83640_hwtstamp(struct phy_device *phydev, struct ifreq *ifr)
 
 	mutex_lock(&dp83640->clock->extreg_lock);
 
-	if (dp83640->hwts_tx_en || dp83640->hwts_rx_en) {
-		enable_status_frames(phydev, true);
-		ext_write(0, phydev, PAGE4, PTP_CTL, PTP_ENABLE);
-	}
-
 	ext_write(0, phydev, PAGE5, PTP_TXCFG0, txcfg0);
 	ext_write(0, phydev, PAGE5, PTP_RXCFG0, rxcfg0);
 
@@ -1337,6 +1339,7 @@ static struct phy_driver dp83640_driver = {
 	.flags		= PHY_HAS_INTERRUPT,
 	.probe		= dp83640_probe,
 	.remove		= dp83640_remove,
+	.config_init	= dp83640_config_init,
 	.config_aneg	= genphy_config_aneg,
 	.read_status	= genphy_read_status,
 	.ack_interrupt  = dp83640_ack_interrupt,
-- 
1.8.5.3

^ permalink raw reply related

* [PATCH] ptp: Allow selecting trigger/event index in testptp
From: Stefan Sørensen @ 2014-02-03 14:36 UTC (permalink / raw)
  To: richardcochran, netdev; +Cc: Stefan Sørensen

Currently the trigger/event is hardcoded to 0, this patch adds
a new command line argument -i to select an arbitrary trigger/
event.

Signed-off-by: Stefan Sørensen <stefan.sorensen@spectralink.com>
---
 Documentation/ptp/testptp.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/Documentation/ptp/testptp.c b/Documentation/ptp/testptp.c
index a74d0a8..04b21cd 100644
--- a/Documentation/ptp/testptp.c
+++ b/Documentation/ptp/testptp.c
@@ -123,7 +123,8 @@ static void usage(char *progname)
 		" -P val     enable or disable (val=1|0) the system clock PPS\n"
 		" -s         set the ptp clock time from the system time\n"
 		" -S         set the system time from the ptp clock time\n"
-		" -t val     shift the ptp clock time by 'val' seconds\n",
+		" -t val     shift the ptp clock time by 'val' seconds\n"
+		" -i val     index for event/trigger\n",
 		progname);
 }
 
@@ -161,13 +162,14 @@ int main(int argc, char *argv[])
 	int perout = -1;
 	int pps = -1;
 	int settime = 0;
+	int index = 0;
 
 	int64_t t1, t2, tp;
 	int64_t interval, offset;
 
 	progname = strrchr(argv[0], '/');
 	progname = progname ? 1+progname : argv[0];
-	while (EOF != (c = getopt(argc, argv, "a:A:cd:e:f:ghk:p:P:sSt:v"))) {
+	while (EOF != (c = getopt(argc, argv, "a:A:cd:e:f:ghk:p:P:sSt:vi:"))) {
 		switch (c) {
 		case 'a':
 			oneshot = atoi(optarg);
@@ -209,6 +211,9 @@ int main(int argc, char *argv[])
 		case 't':
 			adjtime = atoi(optarg);
 			break;
+		case 'i':
+			index = atoi(optarg);
+			break;
 		case 'h':
 			usage(progname);
 			return 0;
@@ -301,7 +306,7 @@ int main(int argc, char *argv[])
 
 	if (extts) {
 		memset(&extts_request, 0, sizeof(extts_request));
-		extts_request.index = 0;
+		extts_request.index = index;
 		extts_request.flags = PTP_ENABLE_FEATURE;
 		if (ioctl(fd, PTP_EXTTS_REQUEST, &extts_request)) {
 			perror("PTP_EXTTS_REQUEST");
@@ -375,7 +380,7 @@ int main(int argc, char *argv[])
 			return -1;
 		}
 		memset(&perout_request, 0, sizeof(perout_request));
-		perout_request.index = 0;
+		perout_request.index = index;
 		perout_request.start.sec = ts.tv_sec + 2;
 		perout_request.start.nsec = 0;
 		perout_request.period.sec = 0;
-- 
1.8.5.3

^ permalink raw reply related

* Re: [PATCH] ipv6: default route for link local address is not added while assigning a address
From: Nicolas Dichtel @ 2014-02-03 15:23 UTC (permalink / raw)
  To: Sohny Thomas, netdev, linux-kernel, yoshfuji, davem, kumuda,
	Hannes Frederic Sowa
In-Reply-To: <52EF42FF.60907@linux.vnet.ibm.com>

Le 03/02/2014 08:19, Sohny Thomas a écrit :
>
>> Actually I am not so sure, there is no defined semantic of flush. I would
>> be ok with all three solutions: leave it as is, always add link-local
>> address (it does not matter if we don't have a link-local address on
It matters. This address is required.
RFC 4291
Section 2.1:
    All interfaces are required to have at least one Link-Local unicast
    address (see Section 2.8 for additional required addresses).
Section 2.8:
       o Its required Link-Local address for each interface.

>> that interface, as a global scoped one is just fine enough) or make flush not
>> remove the link-local address (but this seems a bit too special cased for me).
>
> 1) In case if we leave it as it is, there is rfc 6724 rule 2 to be considered (
> previously rfc 3484)
>
> Rule 2: Prefer appropriate scope.
>     If Scope(SA) < Scope(SB): If Scope(SA) < Scope(D), then prefer SB and
>     otherwise prefer SA.  Similarly, if Scope(SB) < Scope(SA): If
>     Scope(SB) < Scope(D), then prefer SA and otherwise prefer SB.
>
> Test:
>
>     Destination: fe80::2(LS)
>      Candidate Source Addresses: 3ffe::1(GS) or fec0::1(SS) or LLA(LS)
>      Result: LLA(LS)
>      Scope(LLA) < Scope(fec0::1): If Scope(LLA) < Scope(fe80::2),  no, prefer LLA
>      Scope(LLA) < Scope(3ffe::1): If Scope(LLA) < Scope(fe80::2),  no, prefer LLA
>
>
> Now the above test fails since the route itself is not present, and the test
> assumes that the route gets added since the LLA is not removed during the test
In your scenario, the link local route has been removed manually, not by the
kernel. What is your network manager?

>
> 2) having a LLA always helps in NDP i think
A link-local Address yes, it's a MUST. But having only the link local route will
not help.

>
> 3) making flush not remove link-local address will be chnaging functionality of
> ip flush command
You can flush by specifying the prototype:
ip -6 route flush proto static


Regards,
Nicolas

^ permalink raw reply

* Re: OOPS in nf_ct_unlink_expect_report using Polycom RealPresence Mobile
From: astx @ 2014-02-03 15:46 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: linux-kernel, netdev, netfilter, Alexey Dobriyan, netfilter-devel
In-Reply-To: <20140203121415.GA12777@localhost>

Test results / tested kernel versions:

3.2.54
3.8.13
3.10.28

Above kernel versions without patch are dying with same error on  
trying to start h323 connections using "Polycom RealPresence Mobile".

I can confirm that with this patch all three kernel versions are  
pretty stable now again.

Thank you all for your fast and competent help.

Best Regards,

Toni


Zitat von Pablo Neira Ayuso <pablo@netfilter.org>:

> On Fri, Jan 31, 2014 at 05:04:02PM +0100, astx wrote:
>> Dear Alexey,
>>
>> seems to help. Thank you for your quick response. Kernel 3.10.28 is
>> now stable using h323 / Polycom.
>
> Thanks, if no objection, will pass this patch to David.

^ permalink raw reply

* Re: [PATCH] ptp: Allow selecting trigger/event index in testptp
From: Richard Cochran @ 2014-02-03 15:59 UTC (permalink / raw)
  To: Stefan Sørensen; +Cc: netdev
In-Reply-To: <1391438187-21834-1-git-send-email-stefan.sorensen@spectralink.com>

On Mon, Feb 03, 2014 at 03:36:27PM +0100, Stefan Sørensen wrote:
> Currently the trigger/event is hardcoded to 0, this patch adds
> a new command line argument -i to select an arbitrary trigger/
> event.

This is a nice extension of the program, but ...

> diff --git a/Documentation/ptp/testptp.c b/Documentation/ptp/testptp.c
> index a74d0a8..04b21cd 100644
> --- a/Documentation/ptp/testptp.c
> +++ b/Documentation/ptp/testptp.c
> @@ -123,7 +123,8 @@ static void usage(char *progname)
>  		" -P val     enable or disable (val=1|0) the system clock PPS\n"
>  		" -s         set the ptp clock time from the system time\n"
>  		" -S         set the system time from the ptp clock time\n"
> -		" -t val     shift the ptp clock time by 'val' seconds\n",
> +		" -t val     shift the ptp clock time by 'val' seconds\n"
> +		" -i val     index for event/trigger\n",

can we please keep the options in alphabetical order?

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH] ipv6: default route for link local address is not added while assigning a address
From: Hannes Frederic Sowa @ 2014-02-03 16:08 UTC (permalink / raw)
  To: Nicolas Dichtel
  Cc: Sohny Thomas, netdev, linux-kernel, yoshfuji, davem, kumuda
In-Reply-To: <52EFB454.1040908@6wind.com>

Hello!

On Mon, Feb 03, 2014 at 04:23:00PM +0100, Nicolas Dichtel wrote:
> Le 03/02/2014 08:19, Sohny Thomas a écrit :
> >
> >>Actually I am not so sure, there is no defined semantic of flush. I would
> >>be ok with all three solutions: leave it as is, always add link-local
> >>address (it does not matter if we don't have a link-local address on
> It matters. This address is required.
> RFC 4291
> Section 2.1:
>    All interfaces are required to have at least one Link-Local unicast
>    address (see Section 2.8 for additional required addresses).
> Section 2.8:
>       o Its required Link-Local address for each interface.

Yes, sure, it is required. But you also can manually delete the LL address and
we don't guard against that.

> >>that interface, as a global scoped one is just fine enough) or make flush 
> >>not
> >>remove the link-local address (but this seems a bit too special cased for 
> >>me).
> >
> >1) In case if we leave it as it is, there is rfc 6724 rule 2 to be 
> >considered (
> >previously rfc 3484)
> >
> >Rule 2: Prefer appropriate scope.
> >    If Scope(SA) < Scope(SB): If Scope(SA) < Scope(D), then prefer SB and
> >    otherwise prefer SA.  Similarly, if Scope(SB) < Scope(SA): If
> >    Scope(SB) < Scope(D), then prefer SA and otherwise prefer SB.
> >
> >Test:
> >
> >    Destination: fe80::2(LS)
> >     Candidate Source Addresses: 3ffe::1(GS) or fec0::1(SS) or LLA(LS)
> >     Result: LLA(LS)
> >     Scope(LLA) < Scope(fec0::1): If Scope(LLA) < Scope(fe80::2),  no, 
> >     prefer LLA
> >     Scope(LLA) < Scope(3ffe::1): If Scope(LLA) < Scope(fe80::2),  no, 
> >     prefer LLA
> >
> >
> >Now the above test fails since the route itself is not present, and the 
> >test
> >assumes that the route gets added since the LLA is not removed during the 
> >test
> In your scenario, the link local route has been removed manually, not by the
> kernel. What is your network manager?

The test scenario is outlined here:
<https://bugzilla.kernel.org/show_bug.cgi?id=68511>

Basically, the command in question is this one:

	[root@localhost ~]# ip -6 -statistics -statistics route flush dev eth0

which removes the fe80::/64 route.

> >2) having a LLA always helps in NDP i think
> A link-local Address yes, it's a MUST. But having only the link local route 
> will
> not help.

Agreed, the LL address should be available, too. I currently don't know
what will break if LL address is not available. I guess MLD won't work
properly and thus even basic connectivity won't work with some switches.

> >3) making flush not remove link-local address will be chnaging 
> >functionality of
> >ip flush command
> You can flush by specifying the prototype:
> ip -6 route flush proto static

So we have four possiblities now:

1) leave it as is

	seems still acceptable to me

2) add fe80::/64 route unconditionally if any address gets added

	Sohny's patch already looks good in doing so at first look.

3) add fe80::/64 route in case LL address gets added via inet6_rtm_newaddr

	would be ok, too. I tend towards this solution somehow by now.

4) make flush not remove the fe80::/64 address

	Least favourable to me. I guess this also woud need iproute change
	and seems most difficult to do.

Any opionions?

Greetings,

  Hannes

^ permalink raw reply

* Re: [PATCH] net:phy:dp83640: Declare that TX timestamping possible
From: Richard Cochran @ 2014-02-03 16:19 UTC (permalink / raw)
  To: Stefan Sørensen; +Cc: netdev
In-Reply-To: <1391438195-21888-1-git-send-email-stefan.sorensen@spectralink.com>

On Mon, Feb 03, 2014 at 03:36:35PM +0100, Stefan Sørensen wrote:
> Set the SKBTX_IN_PROGRESS bit in tx_flags dp83640_txtstamp when doing
> tx timestamps as per Documentation/networking/timestamping.txt.
> 
> Signed-off-by: Stefan Sørensen <stefan.sorensen@spectralink.com>
> ---

Acked-by: Richard Cochran <richardcochran@gmail.com>

^ permalink raw reply

* Re: [PATCH] net:phy:dp83640: Do not hardcode timestamping event edge
From: Richard Cochran @ 2014-02-03 16:20 UTC (permalink / raw)
  To: Stefan Sørensen; +Cc: netdev
In-Reply-To: <1391438210-21941-1-git-send-email-stefan.sorensen@spectralink.com>

On Mon, Feb 03, 2014 at 03:36:50PM +0100, Stefan Sørensen wrote:
> Currently the external timestamping code it hardcoded to use
> the rising edge even though the hardware has configurable event
> edge detection. This patch change the code to use falling edge
> detection if PTP_FALLING_EDGE is set in the user supplied flags.
> 
> Signed-off-by: Stefan Sørensen <stefan.sorensen@spectralink.com>
> ---

Acked-by: Richard Cochran <richardcochran@gmail.com>

^ permalink raw reply

* Re: [PATCH] [RFC] netfilter: nf_conntrack: don't relase a conntrack with non-zero refcnt
From: Eric Dumazet @ 2014-02-03 16:22 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Florian Westphal, Andrew Vagin, Andrey Vagin, netfilter-devel,
	netfilter, netdev, linux-kernel, vvs, Cyrill Gorcunov,
	Vasiliy Averin
In-Reply-To: <20140202233046.GA4137@localhost>

On Mon, 2014-02-03 at 00:30 +0100, Pablo Neira Ayuso wrote:
>          */
>         smp_wmb();
> -       atomic_set(&ct->ct_general.use, 1);
> +       atomic_set(&ct->ct_general.use, 0);
>         return ct; 

Hi Pablo !

I think your patch is the way to go, but might need some extra care
with memory barriers.

I believe the smp_wmb() here is no longer needed.

If its a newly allocated memory, no other users can access to ct,
if its a recycled ct, content is already 0 anyway.

After your patch, nf_conntrack_get(&tmpl->ct_general) should increment 
an already non zero refcnt, so no memory barrier is needed.

But one smp_wmb() is needed right before this point :

	/* The caller holds a reference to this object */
	atomic_set(&ct->ct_general.use, 2);

Thanks !



^ permalink raw reply

* Re: [PATCH] ipv6: default route for link local address is not added while assigning a address
From: Nicolas Dichtel @ 2014-02-03 16:26 UTC (permalink / raw)
  To: Sohny Thomas, netdev, linux-kernel, yoshfuji, davem, kumuda
In-Reply-To: <20140203160838.GA17999@order.stressinduktion.org>

Le 03/02/2014 17:08, Hannes Frederic Sowa a écrit :
> Hello!
>
> On Mon, Feb 03, 2014 at 04:23:00PM +0100, Nicolas Dichtel wrote:
>> Le 03/02/2014 08:19, Sohny Thomas a écrit :
>>>
>>>> Actually I am not so sure, there is no defined semantic of flush. I would
>>>> be ok with all three solutions: leave it as is, always add link-local
>>>> address (it does not matter if we don't have a link-local address on
>> It matters. This address is required.
>> RFC 4291
>> Section 2.1:
>>     All interfaces are required to have at least one Link-Local unicast
>>     address (see Section 2.8 for additional required addresses).
>> Section 2.8:
>>        o Its required Link-Local address for each interface.
>
> Yes, sure, it is required. But you also can manually delete the LL address and
> we don't guard against that.
Sure. It's why I don't like this patch, it fix a user error.

>
>>>> that interface, as a global scoped one is just fine enough) or make flush
>>>> not
>>>> remove the link-local address (but this seems a bit too special cased for
>>>> me).
>>>
>>> 1) In case if we leave it as it is, there is rfc 6724 rule 2 to be
>>> considered (
>>> previously rfc 3484)
>>>
>>> Rule 2: Prefer appropriate scope.
>>>     If Scope(SA) < Scope(SB): If Scope(SA) < Scope(D), then prefer SB and
>>>     otherwise prefer SA.  Similarly, if Scope(SB) < Scope(SA): If
>>>     Scope(SB) < Scope(D), then prefer SA and otherwise prefer SB.
>>>
>>> Test:
>>>
>>>     Destination: fe80::2(LS)
>>>      Candidate Source Addresses: 3ffe::1(GS) or fec0::1(SS) or LLA(LS)
>>>      Result: LLA(LS)
>>>      Scope(LLA) < Scope(fec0::1): If Scope(LLA) < Scope(fe80::2),  no,
>>>      prefer LLA
>>>      Scope(LLA) < Scope(3ffe::1): If Scope(LLA) < Scope(fe80::2),  no,
>>>      prefer LLA
>>>
>>>
>>> Now the above test fails since the route itself is not present, and the
>>> test
>>> assumes that the route gets added since the LLA is not removed during the
>>> test
>> In your scenario, the link local route has been removed manually, not by the
>> kernel. What is your network manager?
>
> The test scenario is outlined here:
> <https://bugzilla.kernel.org/show_bug.cgi?id=68511>
>
> Basically, the command in question is this one:
>
> 	[root@localhost ~]# ip -6 -statistics -statistics route flush dev eth0
>
> which removes the fe80::/64 route.
>
>>> 2) having a LLA always helps in NDP i think
>> A link-local Address yes, it's a MUST. But having only the link local route
>> will
>> not help.
>
> Agreed, the LL address should be available, too. I currently don't know
> what will break if LL address is not available. I guess MLD won't work
> properly and thus even basic connectivity won't work with some switches.
>
>>> 3) making flush not remove link-local address will be chnaging
>>> functionality of
>>> ip flush command
>> You can flush by specifying the prototype:
>> ip -6 route flush proto static
>
> So we have four possiblities now:
>
> 1) leave it as is
>
> 	seems still acceptable to me
>
> 2) add fe80::/64 route unconditionally if any address gets added
>
> 	Sohny's patch already looks good in doing so at first look.
I don't like this solution, because it's a kernel patch to fix a configuration
problem.

>
> 3) add fe80::/64 route in case LL address gets added via inet6_rtm_newaddr
>
> 	would be ok, too. I tend towards this solution somehow by now.
This seems right also, but I'm not sure that this will fix Sohny's pb.

>
> 4) make flush not remove the fe80::/64 address
>
> 	Least favourable to me. I guess this also woud need iproute change
> 	and seems most difficult to do.
Why using this command 'ip -6 route flush proto static' isn't possible?
I think that we know what kind of route is added for these TAHI tests, hence
it's better to remove only routes added manually (or by a routing daemon if
it's the case).
Removing kernel routes may hide bugs: imagine the kernel adds a wrong route,
TAHI will not detect it.


Regards,
Nicolas

^ permalink raw reply

* Re: [PATCH] net:phy:dp83640: Initialize PTP clocks at device init.
From: Richard Cochran @ 2014-02-03 16:33 UTC (permalink / raw)
  To: Stefan Sørensen; +Cc: netdev
In-Reply-To: <1391438218-21994-1-git-send-email-stefan.sorensen@spectralink.com>

On Mon, Feb 03, 2014 at 03:36:58PM +0100, Stefan Sørensen wrote:
> The trigger and events functionality can be useful even if packet
> timestamping is not used, but the required PTP clock is only enabled
> when packet timestamping is started. This patch moves the clock enable
> to when the interface is configured.

Hm, I vaguely recall that there might have been some reason not enable
the clock too early. (Maybe this was related to multiple PHYs?)

Quickly looking at the code once again, I can't see anything wrong with
this now, but I'll look at it again tomorrow.

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH net 4/5] openvswitch: Fix ovs_flow_free() ovs-lock assert.
From: Sergei Shtylyov @ 2014-02-03 16:42 UTC (permalink / raw)
  To: Jesse Gross, David Miller
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, dev-yBygre7rU0SM8Zsap4Y0gw
In-Reply-To: <1391389686-34303-5-git-send-email-jesse-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>

Hello.

On 03-02-2014 5:08, Jesse Gross wrote:

> From: Pravin B Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>

> ovs_flow_free() is not called under ovs-lock during packet
> execute path (ovs_packet_cmd_execute()). Since packet execute
> does not touch flow->mask, there is no need to take that
> lock either. So move assert in case where flow->mask is checked.

> Found by code inspection.

> Signed-off-by: Pravin B Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Jesse Gross <jesse-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>
> ---
>   net/openvswitch/flow_table.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)

> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
> index bd14052..ad0bda0 100644
> --- a/net/openvswitch/flow_table.c
> +++ b/net/openvswitch/flow_table.c
> @@ -158,11 +158,12 @@ void ovs_flow_free(struct sw_flow *flow, bool deferred)
>   	if (!flow)
>   		return;
>
> -	ASSERT_OVSL();
> -
>   	if (flow->mask) {
>   		struct sw_flow_mask *mask = flow->mask;
>
> +		/* ovs-lock is required to protect mask-refcount and
> +		 * mask list. */

    Networking multi-line comment style is:

/* bla
  * bla
  */

WBR, Sergei

^ permalink raw reply

* Re: [PATCH net 1/5] openvswitch: Pad OVS_PACKET_ATTR_PACKET if linear copy was performed
From: Sergei Shtylyov @ 2014-02-03 16:43 UTC (permalink / raw)
  To: Jesse Gross, David Miller
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, dev-yBygre7rU0SM8Zsap4Y0gw
In-Reply-To: <1391389686-34303-2-git-send-email-jesse-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>

Hello.

On 03-02-2014 5:08, Jesse Gross wrote:

> From: Thomas Graf <tgraf-G/eBtMaohhA@public.gmane.org>

> While the zerocopy method is correctly omitted if user space
> does not support unaligned Netlink messages. The attribute is
> still not padded correctly as skb_zerocopy() will not ensure
> padding and the attribute size is no longer pre calculated
> though nla_reserve() which ensured padding previously.

> This patch applies appropriate padding if a linear data copy
> was performed in skb_zerocopy().

> Signed-off-by: Thomas Graf <tgraf-G/eBtMaohhA@public.gmane.org>
> Acked-by: Zoltan Kiss <zoltan.kiss-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Jesse Gross <jesse-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>
> ---
>   net/openvswitch/datapath.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)

> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index df46928..3ca9121 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
[...]
> @@ -466,6 +466,11 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
>
>   	skb_zerocopy(user_skb, skb, skb->len, hlen);
>
> +	/* Pad OVS_PACKET_ATTR_PACKET if linear copy was performed */
> +	if (!(dp->user_features & OVS_DP_F_UNALIGNED) &&
> +	    (plen = (ALIGN(user_skb->len, NLA_ALIGNTO) - user_skb->len)) > 0)

    This shouldn't pass checkpatch.pl which complains about assignments inside 
*if* statements.

WBR, Sergei

^ permalink raw reply

* usb interrupt storms from the ax88179 hardware
From: David Laight @ 2014-02-03 17:17 UTC (permalink / raw)
  To: netdev, linux-usb@vger.kernel.org, Freddy Xin

On one system (an amd motherboard with the ASMedia xhci controller)
I'm seeing almost back to back USB (7 or 8 a second) 'interrupt'
packets from an ax88179 Ge card.
It may be that other systems behave similarly.
I'm sure this hadn't used to happen!

I don't know what the interrupt status means, the value (as LE u32)
is 0x900a1 0xe1cd6d79
The only bit the driver looks at is the 0x10000 bit in the first word.
This is the 'link up/down' flag.
The two halves of the second word are probably different fields, the
high bits appear after about a second.

Now I'd guess that the driver ought to be doing something about some
of these values. While in this mode transmits are delayed for anything
upto 100ms, at least for some time they do get sent.

However the processing of the 'link up/down' flag is clearly wrong.
The code currently does:

348         event = urb->transfer_buffer;
349         le32_to_cpus((void *)&event->intdata1);
350 
351         link = (((__force u32)event->intdata1) & AX_INT_PPLS_LINK) >> 16;
352 
353         if (netif_carrier_ok(dev->net) != link) {
354                 usbnet_link_change(dev, link, 1);
355                 netdev_info(dev->net, "ax88179 - Link status is: %d\n", link);
356         }

Which ends up doing repeated calls to usbnet_link_change() and confusing
that code a lot.

I presume there is some delay before the return value from netif_carrier_ok()
matches the set state.

I think the code should be remembering the link state locally.
It might also need to clear some other interrupt flags, only ASIX know
what they mean.

	David

^ permalink raw reply

* Re: Requeues and ECN marking
From: Eric Dumazet @ 2014-02-03 17:28 UTC (permalink / raw)
  To: Greg Kuperman; +Cc: netdev
In-Reply-To: <CAMvx-beWMC8awScfEtHs8sSkzz0fGNqBH1fn7hC=k0iaFpJvSA@mail.gmail.com>

On Mon, 2014-02-03 at 09:50 -0500, Greg Kuperman wrote:
> Hi all,
> 
> I am testing a new congestion control protocol that relies on explicit
> congestion notifications (ECN) to notify the receiver of a congestion
> event. I have a rate limited link of 1 Mbps, and I am using the RED
> queuing discipline with ECN enabled. What I have noticed is that no
> matter how small I set my queue size, or how low I set my minimum
> marking level, the first ECN marked packet does not get sent out for
> about 10 seconds after the input rate exceeds the output rate. Further
> examination shows that ECN marking does not occur until the number or
> requeues hits 1000. Below are two queries of tc -s -d qdisc ls dev
> eth1.
> 
> qdisc red 8028: root refcnt 2 limit 10000000b min 1b max 0b ecn ewma
> 30 Plog 21 Scell_log 31
>  Sent 1307892 bytes 1247 pkt (dropped 0, overlimits 0 requeues 960)
>  backlog 1052118b 962p requeues 960
>   marked 0 early 0 pdrop 0 other 0
> 
> qdisc red 8028: root refcnt 2 limit 10000000b min 1b max 0b ecn ewma
> 30 Plog 21 Scell_log 31
>  Sent 1379262 bytes 1312 pkt (dropped 0, overlimits 72 requeues 1024)
>  backlog 1122468b 1027p requeues 1024
>   marked 72 early 0 pdrop 0 other 0
> 
> 
> The txqueuelen defaults to 1000 for the interface, so I figured that
> packets maybe buffering there, and then dequeuing, before any packets
> are marked. I set txqueuelen to lower values (all the way down to 1),
> but the exact same behavior occurs (no marked packets until number of
> dequeues hits 1000). In contrast, if I set txqueuele to something very
> high, I get no requeues, drops, or marked packets.
> 
> My goal is for packets to be marked as soon as the ingress rate
> exceeds the egress. Am I correct in thinking that the requeuing
> operation is the culprit? Can I eliminate requeues? Is there something
> else I can do to get the behavior I am looking for?
> 
> Thank you all for the help. And please cc me in your replies; I'm not
> 100% sure if I get all the messages from this mailing list.

requeues have nothing to do with ECN marking.

How is done your rate limiting ?

Post the whole setup, not part of it, it will help to spot the problem
in one go, instead of many mail exchanges.

^ permalink raw reply

* Re: Requeues and ECN marking
From: Greg Kuperman @ 2014-02-03 17:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1391448503.28432.101.camel@edumazet-glaptop2.roam.corp.google.com>

Thanks for the response. I agree that requeues should have nothing to
do with ECN marking, and that is why I am confused about what is
happening.

The entire setup is as follows. I am using the kernel version 3.2.44.
I am running a network emulator (CORE
http://www.nrl.navy.mil/itd/ncs/products/core), within which I have
four nodes. Each node becomes its own linux container, running its own
network control on its interfaces. The four nodes are the sender node
s, receiver node r, and two intermediate nodes 1 and 2. Node s is
connected to node 1, which is connected to node 2, which is connected
to node r. Link (1,2) is rate-limited to 1 Mbps (this rate limiting is
handled by another application that applies back pressure to the node
when its buffers are full and it can no longer send packets; the
buffer for that application is variable, and I have set it to hold up
to 10 packets).

I am running RED queuing discipline on the egress of node 1 with the
following setup:
tc qdisc add dev eth1 root red burst 1000000 limit 10000000 avpkt 1000
ecn bandwidth 125 probability 1

I also run it with the following (and have no change in behavior):
tc qdisc add dev eth1 root red min 2000 max 10000 probability 1.0
limit 1000000 burst 10 avpkt 1000 bandwidth 125 ecn probability 1

The odd thing that seems to be happening is that I can see the backlog
and requeues increasing, and once they hit 1000, then packet marking
begins. This is even though I have the minimum in RED set to 1 byte,
and max set to 0 (which, from my understanding means that packet
marking should begin when the backlog is 1 byte be the maximum
probability of marking right away because the max is set to 0). The
explanation I came up with is that it had something to do with the
requeues, but that may be entirely off base. I have no idea why it
does not begin marking packets right away (which is the desired
behavior).

Thank you again for all of your time, and please let me know if there
is anymore info that you guys need.

Some more queue statistics (I'm not sure how helpful this will be):

qdisc red 8004: root refcnt 2 limit 10000000b min 1b max 0b ecn
 Sent 1044606 bytes 996 pkt (dropped 0, overlimits 0 requeues 905)
 backlog 993072b 913p requeues 905
  marked 0 early 0 pdrop 0 other 0

qdisc red 8004: root refcnt 2 limit 10000000b min 1b max 0b ecn
 Sent 1131390 bytes 1076 pkt (dropped 0, overlimits 0 requeues 984)
 backlog 1080870b 992p requeues 984
  marked 0 early 0 pdrop 0 other 0

qdisc red 8004: root refcnt 2 limit 10000000b min 1b max 0b ecn
 Sent 1231386 bytes 1168 pkt (dropped 0, overlimits 179 requeues 1075)
 backlog 1179690b 1082p requeues 1075
  marked 179 early 0 pdrop 0 other 0

qdisc red 8004: root refcnt 2 limit 10000000b min 1b max 0b ecn
 Sent 1334640 bytes 1263 pkt (dropped 0, overlimits 368 requeues 1169)
 backlog 1283958b 1176p requeues 1169
  marked 368 early 0 pdrop 0 other 0



On Mon, Feb 3, 2014 at 12:28 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2014-02-03 at 09:50 -0500, Greg Kuperman wrote:
>> Hi all,
>>
>> I am testing a new congestion control protocol that relies on explicit
>> congestion notifications (ECN) to notify the receiver of a congestion
>> event. I have a rate limited link of 1 Mbps, and I am using the RED
>> queuing discipline with ECN enabled. What I have noticed is that no
>> matter how small I set my queue size, or how low I set my minimum
>> marking level, the first ECN marked packet does not get sent out for
>> about 10 seconds after the input rate exceeds the output rate. Further
>> examination shows that ECN marking does not occur until the number or
>> requeues hits 1000. Below are two queries of tc -s -d qdisc ls dev
>> eth1.
>>
>> qdisc red 8028: root refcnt 2 limit 10000000b min 1b max 0b ecn ewma
>> 30 Plog 21 Scell_log 31
>>  Sent 1307892 bytes 1247 pkt (dropped 0, overlimits 0 requeues 960)
>>  backlog 1052118b 962p requeues 960
>>   marked 0 early 0 pdrop 0 other 0
>>
>> qdisc red 8028: root refcnt 2 limit 10000000b min 1b max 0b ecn ewma
>> 30 Plog 21 Scell_log 31
>>  Sent 1379262 bytes 1312 pkt (dropped 0, overlimits 72 requeues 1024)
>>  backlog 1122468b 1027p requeues 1024
>>   marked 72 early 0 pdrop 0 other 0
>>
>>
>> The txqueuelen defaults to 1000 for the interface, so I figured that
>> packets maybe buffering there, and then dequeuing, before any packets
>> are marked. I set txqueuelen to lower values (all the way down to 1),
>> but the exact same behavior occurs (no marked packets until number of
>> dequeues hits 1000). In contrast, if I set txqueuele to something very
>> high, I get no requeues, drops, or marked packets.
>>
>> My goal is for packets to be marked as soon as the ingress rate
>> exceeds the egress. Am I correct in thinking that the requeuing
>> operation is the culprit? Can I eliminate requeues? Is there something
>> else I can do to get the behavior I am looking for?
>>
>> Thank you all for the help. And please cc me in your replies; I'm not
>> 100% sure if I get all the messages from this mailing list.
>
> requeues have nothing to do with ECN marking.
>
> How is done your rate limiting ?
>
> Post the whole setup, not part of it, it will help to spot the problem
> in one go, instead of many mail exchanges.
>
>

^ permalink raw reply

* Re: [PATCH RFC 1/1] usb: Tell xhci when usb data might be misaligned
From: Sarah Sharp @ 2014-02-03 17:55 UTC (permalink / raw)
  To: Mark Lord
  Cc: Ming Lei, Bjørn Mork, David Laight,
	linux-usb-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Greg Kroah-Hartman, David Miller, Dan Williams, Nyman, Mathias,
	Alan Stern, Freddy Xin
In-Reply-To: <52ED5381.2010106-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

On Sat, Feb 01, 2014 at 03:05:21PM -0500, Mark Lord wrote:
> On 14-02-01 09:18 AM, Ming Lei wrote:
> >
> > Even real regressions are easily/often introduced, and we are discussing
> > how to fix that. I suggest to unset the flag only for the known buggy
> > controllers.

Ming, the regression cannot be easily fixed in this case.  We tried the
"easy, quick fix" and it broke USB storage and usbfs.  The patches to
paper over those issues started to creep into the upper layers, and I'm
not willing to add more code to hack around the issues caused by the
"quick fix".  We need to do this right, not wall-paper over the issues.

> It is not the controllers that are particularly "buggy" here.
> But rather the drivers and design of parts of the kernel.

As Mark mentioned, the host controllers aren't buggy.  The xHCI driver
simply doesn't handle a 1.0 host controller requirement, TD fragments,
very well.  Only the USB ethernet layer triggers this bug, because the
USB storage layer hands down scatter-gather lists in multiples of the
max packet size.

You tested on a 1.0 host controller, and it apparently didn't need the
TD fragments requirement.  It seems that Intel 1.0 xHCI host controllers
do need that requirement.  Perhaps we can add an xHCI driver quirk for
an exception so that your host can allow any kind of scatter-gather?

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH RFC 1/1] usb: Tell xhci when usb data might be misaligned
From: Sarah Sharp @ 2014-02-03 17:56 UTC (permalink / raw)
  To: David Laight
  Cc: 'Mark Lord', Ming Lei, Bjørn Mork,
	linux-usb@vger.kernel.org, netdev@vger.kernel.org,
	Greg Kroah-Hartman, David Miller, Dan Williams, Nyman, Mathias,
	Alan Stern, Freddy Xin
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D0F6B7707@AcuExch.aculab.com>

On Mon, Feb 03, 2014 at 09:54:09AM +0000, David Laight wrote:
> From: Mark Lord
> > On 14-02-01 09:18 AM, Ming Lei wrote:
> > >
> > > Even real regressions are easily/often introduced, and we are discussing
> > > how to fix that. I suggest to unset the flag only for the known buggy
> > > controllers.
> > 
> > It is not the controllers that are particularly "buggy" here.
> > But rather the drivers and design of parts of the kernel.
> 
> I suspect that the documentation is describing the actual implementation
> of a specific hardware implementation, not necessarily how the hardware was
> intended to behave.

You are speculating.  Please stop speculating without evidence.  It does
not add to this conversation.

Sarah Sharp

^ permalink raw reply

* Re: Requeues and ECN marking
From: Eric Dumazet @ 2014-02-03 17:59 UTC (permalink / raw)
  To: Greg Kuperman; +Cc: netdev
In-Reply-To: <CAMvx-bccJce7U4=KAO-uaot8uguORtBK0kPxdFUUgDkNWJVmMQ@mail.gmail.com>

On Mon, 2014-02-03 at 12:48 -0500, Greg Kuperman wrote:

> I am running RED queuing discipline on the egress of node 1 with the
> following setup:
> tc qdisc add dev eth1 root red burst 1000000 limit 10000000 avpkt 1000
> ecn bandwidth 125 probability 1

But these parameters are huge, and you get what you asked for.

burst=1000000, and avpkt=1000 means you have to queue more than 1000
packets before red being active, and marking packets eventually (or drop
them if non ECN capable)

Even probability=1 makes little sense.

Please read http://linux.die.net/man/8/tc-red for some guidance.

^ permalink raw reply

* Re: Requeues and ECN marking
From: Greg Kuperman @ 2014-02-03 18:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1391450360.28432.104.camel@edumazet-glaptop2.roam.corp.google.com>

That was silly. That was just the latest test to see what I can
change. My actual setup that I have been running is the following:

tc qdisc add dev eth1 root red min 2000 max 10000 probability 1.0
limit 1000000 burst 10 avpkt 1000 bandwidth 125 ecn

Thank you for pointing that out, because the problem went away (when
earlier it was still present). I also changed the txqueuelen to 1,
which in conjunction with an appropriately sized burst, allows RED to
work properly. I unfortunately forgot that I set the burst to a too
high value.

What I don't understand then, is if I see the following:
qdisc red 8004: root refcnt 2 limit 10000000b min 1b max 0b ecn
should it not begin marking right away? Is the burst creating a large
buffer between the queue and the device?

Thank you again for the help.

Best,
Greg

On Mon, Feb 3, 2014 at 12:59 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2014-02-03 at 12:48 -0500, Greg Kuperman wrote:
>
>> I am running RED queuing discipline on the egress of node 1 with the
>> following setup:
>> tc qdisc add dev eth1 root red burst 1000000 limit 10000000 avpkt 1000
>> ecn bandwidth 125 probability 1
>
> But these parameters are huge, and you get what you asked for.
>
> burst=1000000, and avpkt=1000 means you have to queue more than 1000
> packets before red being active, and marking packets eventually (or drop
> them if non ECN capable)
>
> Even probability=1 makes little sense.
>
> Please read http://linux.die.net/man/8/tc-red for some guidance.
>
>
>

^ permalink raw reply

* Re: TI CPSW Ethernet Tx performance regression
From: Mugunthan V N @ 2014-02-03 18:34 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: Ben Hutchings, netdev
In-Reply-To: <CAGVrzcbdDAAfFXjYc-ksxqxeJWeY_Jyh1DbwFiOW=p7WqRvzFQ@mail.gmail.com>

Hi

On Friday 17 January 2014 05:05 AM, Florian Fainelli wrote:
> Whenever I had bad TX performance with hardware, the culprit was that
> transmit buffers were not freed quickly enough so the transmit
> scheduler cannot push as many packets as expected. When this happens,
> the root cause for me was bad TX interrupt which messed up the TX flow
> control, but there are plenty other stuff that can go wrong.
>
> You could try to check a few things like TX interrupt rate for the
> same workload on both kernels, dump the queue usage every few seconds
> etc...

I did a further analysis using oprofile and found some more info. In
v3.2 kernel most of the time is spend in csum_partial_copy_from_user and
cpdma_chan_submit which are in the path of tx but the dump in v3.12 cpu
is held more in __do_softirq and __irq_put_desc_unlock. I think because
of this Tx performance is affected. Since __do_softirq is used to invode
NAPI, how to reduce its priority or is there any other code that I
should be looking into?

Pasting the O-Profile dump with iperf running in v3.2 and v3.12 kernel

v3.2:
====
samples  %        app name                 symbol name
33152     9.3792  vmlinux-3.2              csum_partial_copy_from_user
23960     6.7786  vmlinux-3.2              cpdma_chan_submit
19288     5.4569  vmlinux-3.2              __do_softirq
13425     3.7981  vmlinux-3.2              __irq_put_desc_unlock
11065     3.1305  vmlinux-3.2              tcp_packet
8458      2.3929  vmlinux-3.2              __cpdma_chan_free
8386      2.3725  vmlinux-3.2              cpdma_ctlr_int_ctrl
7316      2.0698  vmlinux-3.2              __cpdma_chan_process
5186      1.4672  vmlinux-3.2              tcp_transmit_skb
5118      1.4480  vmlinux-3.2              ipt_do_table
4954      1.4016  vmlinux-3.2              kfree
4857      1.3741  vmlinux-3.2              nf_iterate
4797      1.3571  vmlinux-3.2              tcp_ack
4511      1.2762  vmlinux-3.2              __kmalloc
4433      1.2542  vmlinux-3.2              v7_dma_inv_range
4393      1.2428  vmlinux-3.2              nf_conntrack_in
4069      1.1512  vmlinux-3.2              tcp_sendmsg
3607      1.0205  vmlinux-3.2              local_bh_enable
3148      0.8906  vmlinux-3.2              __memzero
3127      0.8847  vmlinux-3.2              csum_partial
2850      0.8063  vmlinux-3.2              __alloc_skb
2825      0.7992  vmlinux-3.2              ip_queue_xmit
2559      0.7240  vmlinux-3.2              tcp_write_xmit
2399      0.6787  vmlinux-3.2              clocksource_read_cycles
2091      0.5916  vmlinux-3.2              dev_hard_start_xmit


v3.12:
=====
samples  %        app name                 symbol name
9040     15.8034  vmlinux                  __do_softirq
6410     11.2057  vmlinux                  __irq_put_desc_unlock
3584      6.2654  vmlinux                  cpdma_chan_submit
3250      5.6815  vmlinux                  csum_partial_copy_from_user
3070      5.3669  vmlinux                  __cpdma_chan_process
2894      5.0592  vmlinux                  resend_irqs
2567      4.4875  vmlinux                  cpdma_ctlr_int_ctrl
2214      3.8704  vmlinux                  mod_timer
1922      3.3600  vmlinux                  lock_acquire
1402      2.4509  vmlinux                  __cpdma_chan_free
1063      1.8583  vmlinux                  local_bh_enable
783       1.3688  vmlinux                  cpdma_check_free_tx_desc
668       1.1678  vmlinux                  lock_is_held
610       1.0664  vmlinux                  __kmalloc_track_caller
584       1.0209  vmlinux                  lock_release
559       0.9772  vmlinux                  kmem_cache_alloc
557       0.9737  vmlinux                  kfree
460       0.8042  vmlinux                  tcp_transmit_skb
429       0.7500  vmlinux                  tcp_ack
418       0.7307  vmlinux                  tcp_sendmsg
378       0.6608  vmlinux                  kmem_cache_free
366       0.6398  vmlinux                  ip_queue_xmit
363       0.6346  vmlinux                  cache_alloc_refill
351       0.6136  vmlinux                  sub_preempt_count
347       0.6066  vmlinux                  napi_complete
335       0.5856  vmlinux                  __alloc_skb
311       0.5437  vmlinux                  ip_finish_output

^ permalink raw reply

* Re: TI CPSW Ethernet Tx performance regression
From: Florian Fainelli @ 2014-02-03 19:24 UTC (permalink / raw)
  To: Mugunthan V N; +Cc: netdev, Ben Hutchings
In-Reply-To: <52D77716.1020205@ti.com>

2014-01-15 Mugunthan V N <mugunthanvnm@ti.com>:
> Hi
>
> On Thursday 16 January 2014 02:51 AM, Florian Fainelli wrote:
>> 2014/1/15 Ben Hutchings <bhutchings@solarflare.com>:
>>> On Wed, 2014-01-15 at 18:18 +0530, Mugunthan V N wrote:
>>>> Hi
>>>>
>>>> I am seeing a performance regression with CPSW driver on AM335x EVM. AM335x EVM
>>>> CPSW has 3.2 kernel support [1] and Mainline support from 3.7. When I am
>>>> comparing the performance between 3.2 and 3.13-rc4. TCP receive performance of
>>>> CPSW between 3.2 and 3.13-rc4 is same (~180Mbps) but TCP Transmit performance
>>>> is poor comparing to 3.2 kernel. In 3.2 kernel is it *256Mbps* and in 3.13-rc4
>>>> it is *70Mbps*
>>>>
>>>> Iperf version is *iperf version 2.0.5 (08 Jul 2010) pthreads* on both PC and EVM
>>>>
>>>> On UDP transmit also performance is down comparing to 3.2 kernel. In 3.2 it is
>>>> 196Mbps for 200Mbps band width and in 3.13-rc4 it is 92Mbps
>>>>
>>>> Can someone point me out where can I look for improving Tx performance. I also
>>>> checked whether there is Tx descriptor over flow and there is none. I have
>>>> tries 3.11 and some older kernel, all are giving ~75Mbps Transmit performance
>>>> only.
>>>>
>>>> [1] - http://arago-project.org/git/projects/?p=linux-am33x.git;a=summary
>>> If you don't get any specific suggestions, you could try bisecting to
>>> find out which specific commit(s) changed the performance.
>> Not necessarily related to that issue, but there are a few
>> weird/unusual things done in the CPSW interrupt handler:
>>
>> static irqreturn_t cpsw_interrupt(int irq, void *dev_id)
>> {
>>         struct cpsw_priv *priv = dev_id;
>>
>>         cpsw_intr_disable(priv);
>>         if (priv->irq_enabled == true) {
>>                 cpsw_disable_irq(priv);
>>                 priv->irq_enabled = false;
>>         }
>>
>>         if (netif_running(priv->ndev)) {
>>                 napi_schedule(&priv->napi);
>>                 return IRQ_HANDLED;
>>         }
>>
>> Checking for netif_running() should not be required, you should not
>> get any TX/RX interrupts if your interface is not running.
>
> The driver also supports Dual EMAC with one physical device. More
> description can be found in [1] under the topic *9.2.1.5.2 Dual Mac
> Mode*. If the first interface is down and the second interface is up,
> without checking the interface we will not know which napi to schedule.
>
>>
>>
>>         priv = cpsw_get_slave_priv(priv, 1);
>>         if (!priv)
>>                 return IRQ_NONE;
>>
>> Should not this be moved up as the very first conditional check to do?
>> is not there a risk to leave the interrupts disabled and not
>> re-enabled due to the first 5 lines at the top?
>
> This has to be kept here to check if the interrupt is triggered by the
> second Ethernet port interface when the first interface is down.

Ok,the priv pointer when we enter the interrupt handler could point to
e.g: slave 0, so we need to get it re-assigned to the second slave
using cpsw_get_slave_priv(). How do you ensure that "priv" at the
beginning of the interrupt handler does not already point to slave 1?
In that case, is not there a chance to starve slave 0, or at least
cause an excessive latency by exiting the interrupt handler for slave
1, and then re-entering it for slave 0?

-- 
Florian

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox