Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH iproute2 0/3] bridge: filtering by vlan id
From: Nikolay Aleksandrov @ 2016-04-11 15:30 UTC (permalink / raw)
  To: netdev; +Cc: stephen, roopa
In-Reply-To: <1460380710-29583-1-git-send-email-nikolay@cumulusnetworks.com>

On 04/11/2016 03:18 PM, Nikolay Aleksandrov wrote:
> Hi,
> This set adds support for filtering by a vlan id when showing fdb/mdb/vlan
> entries. Currently the filtering is implemented entirely in user-space, but
> the plan is to add kernel support as well. The vlan show part is also needed
> for the future per-vlan statistics in order to be able to show them only for
> a specific vlan. I plan to update the bridge man page soon as it's missing
> other options too and it seemed inconsistent to add this given that there're
> potential paragraphs missing, thus I'll post a separate patch for that.
> 
> Thank you,
>  Nik
> 

Self-NAK, after discussing with colleagues, we think it'd be better not to print
the non-matching ports at all (right now they're printed with empty "vlan ids"
column). I'll post a v2 with updated patch 03.

Cheers,
 Nik

^ permalink raw reply

* [PATCH iproute2 v2 0/3] bridge: filtering by vlan id
From: Nikolay Aleksandrov @ 2016-04-11 15:45 UTC (permalink / raw)
  To: netdev; +Cc: roopa, stephen, Nikolay Aleksandrov
In-Reply-To: <570BC2FD.8010201@cumulusnetworks.com>

Hi,
This set adds support for filtering by a vlan id when showing fdb/mdb/vlan
entries. Currently the filtering is implemented entirely in user-space, but
the plan is to add kernel support as well. The vlan show part is also needed
for the future per-vlan statistics in order to be able to show them only for
a specific vlan. I plan to update the bridge man page soon as it's missing
other options too and it seemed inconsistent to add this given that there're
potential paragraphs missing, thus I'll post a separate patch for that.

v2: in patch 03 print only the ports having the vlan instead of empty
"vlan ids" column

Thank you,
 Nik

Nikolay Aleksandrov (3):
  bridge: fdb: add support to filter by vlan id
  bridge: mdb: add support to filter by vlan id
  bridge: vlan: add support to filter by vlan id

 bridge/fdb.c  | 21 +++++++++++++++------
 bridge/mdb.c  | 11 +++++++++--
 bridge/vlan.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++----------
 3 files changed, 74 insertions(+), 18 deletions(-)

-- 
2.4.3

^ permalink raw reply

* [PATCH iproute2 v2 1/3] bridge: fdb: add support to filter by vlan id
From: Nikolay Aleksandrov @ 2016-04-11 15:45 UTC (permalink / raw)
  To: netdev; +Cc: roopa, stephen, Nikolay Aleksandrov
In-Reply-To: <1460389516-1643-1-git-send-email-nikolay@cumulusnetworks.com>

Add the optional keyword "vlan" to bridge fdb show so the user can request
filtering by a specific vlan id. Currently the filtering is implemented
only in user-space. The argument name has been chosen to match the
add/del one - "vlan".

Example:
$ bridge fdb show vlan 400
52:54:00:bf:57:16 dev eth2 vlan 400 master br0 permanent

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
v2: no change

 bridge/fdb.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/bridge/fdb.c b/bridge/fdb.c
index df55e86df83f..be849f980a80 100644
--- a/bridge/fdb.c
+++ b/bridge/fdb.c
@@ -27,7 +27,7 @@
 #include "rt_names.h"
 #include "utils.h"
 
-static unsigned int filter_index;
+static unsigned int filter_index, filter_vlan;
 
 static void usage(void)
 {
@@ -35,7 +35,7 @@ static void usage(void)
 			"              [ self ] [ master ] [ use ] [ router ]\n"
 			"              [ local | static | dynamic ] [ dst IPADDR ] [ vlan VID ]\n"
 			"              [ port PORT] [ vni VNI ] [ via DEV ]\n");
-	fprintf(stderr, "       bridge fdb [ show [ br BRDEV ] [ brport DEV ] ]\n");
+	fprintf(stderr, "       bridge fdb [ show [ br BRDEV ] [ brport DEV ] [ vlan VID ] ]\n");
 	exit(-1);
 }
 
@@ -65,6 +65,7 @@ int print_fdb(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 	struct ndmsg *r = NLMSG_DATA(n);
 	int len = n->nlmsg_len;
 	struct rtattr *tb[NDA_MAX+1];
+	__u16 vid = 0;
 
 	if (n->nlmsg_type != RTM_NEWNEIGH && n->nlmsg_type != RTM_DELNEIGH) {
 		fprintf(stderr, "Not RTM_NEWNEIGH: %08x %08x %08x\n",
@@ -88,6 +89,12 @@ int print_fdb(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 	parse_rtattr(tb, NDA_MAX, NDA_RTA(r),
 		     n->nlmsg_len - NLMSG_LENGTH(sizeof(*r)));
 
+	if (tb[NDA_VLAN])
+		vid = rta_getattr_u16(tb[NDA_VLAN]);
+
+	if (filter_vlan && filter_vlan != vid)
+		return 0;
+
 	if (n->nlmsg_type == RTM_DELNEIGH)
 		fprintf(fp, "Deleted ");
 
@@ -115,11 +122,8 @@ int print_fdb(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 				    RTA_DATA(tb[NDA_DST])));
 	}
 
-	if (tb[NDA_VLAN]) {
-		__u16 vid = rta_getattr_u16(tb[NDA_VLAN]);
-
+	if (vid)
 		fprintf(fp, "vlan %hu ", vid);
-	}
 
 	if (tb[NDA_PORT])
 		fprintf(fp, "port %d ", ntohs(rta_getattr_u16(tb[NDA_PORT])));
@@ -190,6 +194,11 @@ static int fdb_show(int argc, char **argv)
 		} else if (strcmp(*argv, "br") == 0) {
 			NEXT_ARG();
 			br = *argv;
+		} else if (strcmp(*argv, "vlan") == 0) {
+			NEXT_ARG();
+			if (filter_vlan)
+				duparg("vlan", *argv);
+			filter_vlan = atoi(*argv);
 		} else {
 			if (matches(*argv, "help") == 0)
 				usage();
-- 
2.4.3

^ permalink raw reply related

* [PATCH iproute2 v2 2/3] bridge: mdb: add support to filter by vlan id
From: Nikolay Aleksandrov @ 2016-04-11 15:45 UTC (permalink / raw)
  To: netdev; +Cc: roopa, stephen, Nikolay Aleksandrov
In-Reply-To: <1460389516-1643-1-git-send-email-nikolay@cumulusnetworks.com>

Add the optional keyword "vid" to bridge mdb show so the user can
request filtering by a specific vlan id. Currently the filtering is
implemented only in user-space. The argument name has been chosen to match
the add/del one - "vid".

Example:
$ bridge mdb show vid 200
dev br0 port eth2 grp 239.0.0.1 permanent vid 200

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
v2: no change

 bridge/mdb.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/bridge/mdb.c b/bridge/mdb.c
index 842536ec003c..6c904f8e6ae8 100644
--- a/bridge/mdb.c
+++ b/bridge/mdb.c
@@ -24,12 +24,12 @@
 	((struct rtattr *)(((char *)(r)) + NLMSG_ALIGN(sizeof(struct br_port_msg))))
 #endif
 
-static unsigned int filter_index;
+static unsigned int filter_index, filter_vlan;
 
 static void usage(void)
 {
 	fprintf(stderr, "Usage: bridge mdb { add | del } dev DEV port PORT grp GROUP [permanent | temp] [vid VID]\n");
-	fprintf(stderr, "       bridge mdb {show} [ dev DEV ]\n");
+	fprintf(stderr, "       bridge mdb {show} [ dev DEV ] [ vid VID ]\n");
 	exit(-1);
 }
 
@@ -92,6 +92,8 @@ static void print_mdb_entry(FILE *f, int ifindex, struct br_mdb_entry *e,
 	const void *src;
 	int af;
 
+	if (filter_vlan && e->vid != filter_vlan)
+		return;
 	af = e->addr.proto == htons(ETH_P_IP) ? AF_INET : AF_INET6;
 	src = af == AF_INET ? (const void *)&e->addr.u.ip4 :
 			      (const void *)&e->addr.u.ip6;
@@ -195,6 +197,11 @@ static int mdb_show(int argc, char **argv)
 			if (filter_dev)
 				duparg("dev", *argv);
 			filter_dev = *argv;
+		} else if (strcmp(*argv, "vid") == 0) {
+			NEXT_ARG();
+			if (filter_vlan)
+				duparg("vid", *argv);
+			filter_vlan = atoi(*argv);
 		}
 		argc--; argv++;
 	}
-- 
2.4.3

^ permalink raw reply related

* [PATCH iproute2 v2 3/3] bridge: vlan: add support to filter by vlan id
From: Nikolay Aleksandrov @ 2016-04-11 15:45 UTC (permalink / raw)
  To: netdev; +Cc: roopa, stephen, Nikolay Aleksandrov
In-Reply-To: <1460389516-1643-1-git-send-email-nikolay@cumulusnetworks.com>

Add the optional keyword "vid" to bridge vlan show so the user can
request filtering by a specific vlan id. Currently the filtering is
implemented only in user-space. The argument name has been chosen to
match the add/del one - "vid". This filtering can be used also with the
"-compressvlans" option to see in which range is a vlan (if in any).
Also this will be used to show only specific per-vlan statistics later
when support is added to the kernel for it.

Examples:
$ bridge vlan show vid 450
port	vlan ids
eth2	 450

$ bridge -c vlan show vid 450
port	vlan ids
eth2	 400-500

$ bridge vlan show vid 1
port	vlan ids
eth1	 1 PVID Egress Untagged
eth2	 1 PVID
br0	 1 PVID Egress Untagged

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
v2: don't print ports which are not matching the vlan filter
The vcheck_ret == 1 case implicit use is to avoid a nesting level which
breaks a lot of lines and produces uglier result.

 bridge/vlan.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 50 insertions(+), 10 deletions(-)

diff --git a/bridge/vlan.c b/bridge/vlan.c
index ae588323d9b1..717025ae6eec 100644
--- a/bridge/vlan.c
+++ b/bridge/vlan.c
@@ -13,13 +13,13 @@
 #include "br_common.h"
 #include "utils.h"
 
-static unsigned int filter_index;
+static unsigned int filter_index, filter_vlan;
 
 static void usage(void)
 {
 	fprintf(stderr, "Usage: bridge vlan { add | del } vid VLAN_ID dev DEV [ pvid] [ untagged ]\n");
 	fprintf(stderr, "                                                     [ self ] [ master ]\n");
-	fprintf(stderr, "       bridge vlan { show } [ dev DEV ]\n");
+	fprintf(stderr, "       bridge vlan { show } [ dev DEV ] [ vid VLAN_ID ]\n");
 	exit(-1);
 }
 
@@ -138,6 +138,26 @@ static int vlan_modify(int cmd, int argc, char **argv)
 	return 0;
 }
 
+/* In order to use this function for both filtering and non-filtering cases
+ * we need to make it a tristate:
+ * return -1 - if filtering we've gone over so don't continue
+ * return  0 - skip entry and continue (applies to range start or to entries
+ *             which are less than filter_vlan)
+ * return  1 - print the entry and continue
+ */
+static int filter_vlan_check(struct bridge_vlan_info *vinfo)
+{
+	/* if we're filtering we should stop on the first greater entry */
+	if (filter_vlan && vinfo->vid > filter_vlan &&
+	    !(vinfo->flags & BRIDGE_VLAN_INFO_RANGE_END))
+		return -1;
+	if ((vinfo->flags & BRIDGE_VLAN_INFO_RANGE_BEGIN) ||
+	    vinfo->vid < filter_vlan)
+		return 0;
+
+	return 1;
+}
+
 static int print_vlan(const struct sockaddr_nl *who,
 		      struct nlmsghdr *n,
 		      void *arg)
@@ -169,26 +189,40 @@ static int print_vlan(const struct sockaddr_nl *who,
 
 	/* if AF_SPEC isn't there, vlan table is not preset for this port */
 	if (!tb[IFLA_AF_SPEC]) {
-		fprintf(fp, "%s\tNone\n", ll_index_to_name(ifm->ifi_index));
+		if (!filter_vlan)
+			fprintf(fp, "%s\tNone\n",
+				ll_index_to_name(ifm->ifi_index));
 		return 0;
 	} else {
 		struct rtattr *i, *list = tb[IFLA_AF_SPEC];
 		int rem = RTA_PAYLOAD(list);
+		__u16 last_vid_start = 0;
 
-		fprintf(fp, "%s", ll_index_to_name(ifm->ifi_index));
+		if (!filter_vlan)
+			fprintf(fp, "%s", ll_index_to_name(ifm->ifi_index));
 		for (i = RTA_DATA(list); RTA_OK(i, rem); i = RTA_NEXT(i, rem)) {
 			struct bridge_vlan_info *vinfo;
+			int vcheck_ret;
 
 			if (i->rta_type != IFLA_BRIDGE_VLAN_INFO)
 				continue;
 
 			vinfo = RTA_DATA(i);
-			if (vinfo->flags & BRIDGE_VLAN_INFO_RANGE_END)
-				fprintf(fp, "-%hu", vinfo->vid);
-			else
-				fprintf(fp, "\t %hu", vinfo->vid);
-			if (vinfo->flags & BRIDGE_VLAN_INFO_RANGE_BEGIN)
+
+			if (!(vinfo->flags & BRIDGE_VLAN_INFO_RANGE_END))
+				last_vid_start = vinfo->vid;
+			vcheck_ret = filter_vlan_check(vinfo);
+			if (vcheck_ret == -1)
+				break;
+			else if (vcheck_ret == 0)
 				continue;
+
+			if (filter_vlan)
+				fprintf(fp, "%s",
+					ll_index_to_name(ifm->ifi_index));
+			fprintf(fp, "\t %hu", last_vid_start);
+			if (last_vid_start != vinfo->vid)
+				fprintf(fp, "-%hu", vinfo->vid);
 			if (vinfo->flags & BRIDGE_VLAN_INFO_PVID)
 				fprintf(fp, " PVID");
 			if (vinfo->flags & BRIDGE_VLAN_INFO_UNTAGGED)
@@ -196,7 +230,8 @@ static int print_vlan(const struct sockaddr_nl *who,
 			fprintf(fp, "\n");
 		}
 	}
-	fprintf(fp, "\n");
+	if (!filter_vlan)
+		fprintf(fp, "\n");
 	fflush(fp);
 	return 0;
 }
@@ -211,6 +246,11 @@ static int vlan_show(int argc, char **argv)
 			if (filter_dev)
 				duparg("dev", *argv);
 			filter_dev = *argv;
+		} else if (strcmp(*argv, "vid") == 0) {
+			NEXT_ARG();
+			if (filter_vlan)
+				duparg("vid", *argv);
+			filter_vlan = atoi(*argv);
 		}
 		argc--; argv++;
 	}
-- 
2.4.3

^ permalink raw reply related

* Re: [PATCH net v2] net: sched: do not requeue a NULL skb
From: Eric Dumazet @ 2016-04-11 15:52 UTC (permalink / raw)
  To: Lars Persson; +Cc: Lars Persson, netdev, jhs, linux-kernel, xiyou.wangcong
In-Reply-To: <570BC024.1070504@axis.com>

On Mon, 2016-04-11 at 17:17 +0200, Lars Persson wrote:
> 
> On 04/11/2016 04:22 PM, Eric Dumazet wrote:
> > On Mon, 2016-04-11 at 15:38 +0200, Lars Persson wrote:
> >
> >> I though it would be prudent because the queue can be non-empty even for
> >> the case of skb=NULL. So should it be there in this patch, another patch
> >> or not at all ?
> >
> > Then maybe change return code ?
> >
> > It seems strange that a validate_xmit_skb_list() failure stops the
> > __qdisc_run() loop but schedules another round.
> >
> >
> 
> It was suggested by Cong Wang to return 0 in order to stop the loop. Do 
> you guys agree that the loop should be stopped for such failures ? Then 
> I will put the schedule call inside the if as you proposed earlier.

What are the causes of validate_xmit_skb_list() failures ?

If gso segmentations fail because of memory pressure, better free more
skbs right now.

In any case, having a single test " if (skb)  " sounds better to me,
to have a fast path.

So your first patch was probably a better idea.

v2 has two tests instead of one.

^ permalink raw reply

* Re: [PATCH net-next 00/11] FUJITSU Extended Socket driver version 1.1
From: David Miller @ 2016-04-11 15:56 UTC (permalink / raw)
  To: izumi.taku; +Cc: netdev
In-Reply-To: <1460362136-14968-1-git-send-email-izumi.taku@jp.fujitsu.com>

This submission is of an extremely low quality.

All of your ioctl additions are completely inappropriate, as are your
debugfs facilities.  You must remove all of them completely.

^ permalink raw reply

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
From: Jesper Dangaard Brouer @ 2016-04-11 16:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: James Bottomley, netdev@vger.kernel.org, Brenden Blanco, lsf,
	linux-mm, Mel Gorman, Tom Herbert, lsf-pc, Alexei Starovoitov,
	brouer
In-Reply-To: <20160411130826.GB32073@techsingularity.net>


On Mon, 11 Apr 2016 14:08:27 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:
> On Mon, Apr 11, 2016 at 02:26:39PM +0200, Jesper Dangaard Brouer wrote:
[...]
> > 
> > It is always great if you can optimized the page allocator.  IMHO the
> > page allocator is too slow.  
> 
> It's why I spent some time on it as any improvement in the allocator is
> an unconditional win without requiring driver modifications.
> 
> > At least for my performance needs (67ns
> > per packet, approx 201 cycles at 3GHz).  I've measured[1]
> > alloc_pages(order=0) + __free_pages() to cost 277 cycles(tsc).
> >   
> 
> It'd be worth retrying this with the branch
> 
> http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git/log/?h=mm-vmscan-node-lru-v4r5
> 

The cost decreased to: 228 cycles(tsc), but there are some variations,
sometimes it increase to 238 cycles(tsc).

Nice, but there is still a looong way to my performance target, where I
can spend 201 cycles for the entire forwarding path....


> This is an unreleased series that contains both the page allocator
> optimisations and the one-LRU-per-node series which in combination remove a
> lot of code from the page allocator fast paths. I have no data on how the
> combined series behaves but each series individually is known to improve
> page allocator performance.
>
> Once you have that, do a hackjob to remove the debugging checks from both the
> alloc and free path and see what that leaves. They could be bypassed properly
> with a __GFP_NOACCT flag used only by drivers that absolutely require pages
> as quickly as possible and willing to be less safe to get that performance.

I would be interested in testing/benchmarking a patch where you remove
the debugging checks...

You are also welcome to try out my benchmarking modules yourself:
 https://github.com/netoptimizer/prototype-kernel/blob/master/getting_started.rst

This is really simple stuff (for rapid prototyping) I'm just doing:
 modprobe page_bench01; rmmod page_bench01 ; dmesg | tail -n40

[...]
> 
> Be aware that compound order allocs like this are a double edged sword as
> it'll be fast sometimes and other times require reclaim/compaction which
> can stall for prolonged periods of time.

Yes, I've notice that there can be a fairly high variation, when doing
compound order allocs, which is not so nice!  I really don't like these
variations....

Drivers also do tricks where they fallback to smaller order pages. E.g.
lookup function mlx4_alloc_pages().  I've tried to simulate that
function here:
https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69

It does not seem very optimal. I tried to mem pressure the system a bit
to cause the alloc_pages() to fail, and then the result were very bad,
something like 2500 cycles, and it usually got the next order pages.


> > I've measured order 3 (32KB) alloc_pages(order=3) + __free_pages() to
> > cost approx 500 cycles(tsc).  That was more expensive, BUT an order=3
> > page 32Kb correspond to 8 pages (32768/4096), thus 500/8 = 62.5
> > cycles.  Usually a network RX-frame only need to be 2048 bytes, thus
> > the "bulk" effect speed up is x16 (32768/2048), thus 31.25 cycles.

The order=3 cost were reduced to: 417 cycles(tsc), nice!  But I've also
seen it jump to 611 cycles.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
From: Matthew Wilcox @ 2016-04-11 16:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jesper Dangaard Brouer, James Bottomley, netdev@vger.kernel.org,
	Brenden Blanco, lsf, linux-mm, Mel Gorman, Tom Herbert, lsf-pc,
	Alexei Starovoitov
In-Reply-To: <20160411130826.GB32073@techsingularity.net>

On Mon, Apr 11, 2016 at 02:08:27PM +0100, Mel Gorman wrote:
> On Mon, Apr 11, 2016 at 02:26:39PM +0200, Jesper Dangaard Brouer wrote:
> > On arch's like PowerPC, the DMA API is the bottleneck.  To workaround
> > the cost of DMA calls, NIC driver alloc large order (compound) pages.
> > (dma_map compound page, handout page-fragments for RX ring, and later
> > dma_unmap when last RX page-fragments is seen).
> 
> So, IMO only holding onto the DMA pages is all that is justified but not a
> recycle of order-0 pages built on top of the core allocator. For DMA pages,
> it would take a bit of legwork but the per-cpu allocator could be split
> and converted to hold arbitrary sized pages with a constructer/destructor
> to do the DMA coherency step when pages are taken from or handed back to
> the core allocator. I'm not volunteering to do that unfortunately but I
> estimate it'd be a few days work unless it needs to be per-CPU and NUMA
> aware in which case the memory footprint will be high.

Have "we" tried to accelerate the DMA calls in PowerPC?  For example, it
could hold onto a cache of recently used mappings and recycle them if that
still works.  It trades off a bit of security (a device can continue to DMA
after the memory should no longer be accessible to it) for speed, but then
so does the per-driver hack of keeping pages around still mapped.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: pull-request: wireless-drivers-next 2016-04-11
From: David Miller @ 2016-04-11 16:21 UTC (permalink / raw)
  To: kvalo-sgV2jX0FEOL9JmXXK+q4OQ
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <87vb3o5n0v.fsf-HodKDYzPHsUD5k0oWYwrnHL1okKdlPRT@public.gmane.org>

From: Kalle Valo <kvalo-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
Date: Mon, 11 Apr 2016 15:48:48 +0300

> here's a pull request for 4.7. More features, but nothing really
> standing out. Please let me know if you have any problems.

Pulled, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net] cxgb4: Stop Rx Queues before freeing it up
From: David Miller @ 2016-04-11 16:21 UTC (permalink / raw)
  To: hariprasad; +Cc: netdev, leedom, nirranjan
In-Reply-To: <1460353078-949-1-git-send-email-hariprasad@chelsio.com>

From: Hariprasad Shenai <hariprasad@chelsio.com>
Date: Mon, 11 Apr 2016 11:07:58 +0530

> Stop all Ethernet RX Queues before freeing up various Ingress/Egress
> Queues, etc. We were seeing cases of Ingress Queues not getting serviced
> during the shutdown process leading to Ingress Paths jamming up through
> the chip and blocking the shutdown effort itself.
> 
> One such case involved the Firmware sending a "Flush Token" through the
> ULP-TX -> ULP-RX path for an Ethernet TX Queue being freed in order to
> make sure there weren't any remaining TX Work Requests in the pipeline.
> But the return path was stalled by Ingress Data unable to be delivered to
> the Host because those Ingress Queues were no longer being serviced.
> 
> Based on original work by Casey Leedom <leedom@chelsio.com>
> 
> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>

Applied, thank you.

^ permalink raw reply

* Unhandled fault during system suspend in sky2_shutdown
From: Sudeep Holla @ 2016-04-11 16:24 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: Sudeep Holla, Stephen Hemminger, Mirko Lindner

Hi,

I am seeing unhandled fault during system suspend in sky2_shutdown.
I am not sure if it's something missing in the firmware, but just wanted
to check. I see that networkmanager is invoking calling to 
netlink_sendmsg which calls sky2_get_stats after the device is shutdown.

Unhandled fault: synchronous external abort (0x96000210) at 
0xffff0000091c2918
Internal error: : 96000210 [#1] PREEMPT SMP
Modules linked in:
CPU: 3 PID: 2029 Comm: NetworkManager Not tainted 4.6.0-rc3 #126
Hardware name: ARM Juno development board (r2) (DT)
task: ffff80007a673000 ti: ffff800940b5c000 task.ti: ffff800940b5c000
PC is at sky2_get_stats+0x44/0x3b8
LR is at dev_get_stats+0x58/0xc8
  sky2_get_stats+0x44/0x3b8
  rtnl_fill_stats+0x20/0x138
  rtnl_fill_ifinfo+0x440/0xb38
  rtnl_getlink+0xe8/0x198
  rtnetlink_rcv_msg+0xe4/0x220
  netlink_rcv_skb+0xc4/0xf8
  rtnetlink_rcv+0x2c/0x40
  netlink_unicast+0x160/0x238
  netlink_sendmsg+0x2f0/0x358
  sock_sendmsg+0x18/0x30
  ___sys_sendmsg+0x204/0x218
  __sys_sendmsg+0x44/0x88
  SyS_sendmsg+0xc/0x18
  el0_svc_naked+0x24/0x28

The below patch is the hack I came up to check if the netdev is detached 
and unregistered, I no longer see the issue.

Regards,
Sudeep

-->8

diff --git i/drivers/net/ethernet/marvell/sky2.c 
w/drivers/net/ethernet/marvell/sky2.c
index ec0a22119e09..0ff0434e32fc 100644
--- i/drivers/net/ethernet/marvell/sky2.c
+++ w/drivers/net/ethernet/marvell/sky2.c
@@ -5220,6 +5220,13 @@ static SIMPLE_DEV_PM_OPS(sky2_pm_ops, 
sky2_suspend, sky2_resume);

  static void sky2_shutdown(struct pci_dev *pdev)
  {
+       struct sky2_hw *hw = pci_get_drvdata(pdev);
+       int i;
+
+       for (i = hw->ports - 1; i >= 0; --i) {
+               sky2_detach(hw->dev[i]);
+               unregister_netdev(hw->dev[i]);
+       }
         sky2_suspend(&pdev->dev);
         pci_wake_from_d3(pdev, device_may_wakeup(&pdev->dev));
         pci_set_power_state(pdev, PCI_D3hot);

^ permalink raw reply related

* Re: [PATCH net-next 2/2] net: exit busy loop when another process is runnable
From: Michael S. Tsirkin @ 2016-04-11 16:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, Jason Wang, davem, netdev, linux-kernel,
	Peter Zijlstra, Ingo Molnar
In-Reply-To: <20140822073653.GA7372@gmail.com>

On Fri, Aug 22, 2014 at 09:36:53AM +0200, Ingo Molnar wrote:
> 
> > > diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
> > > index 1d67fb6..8a33fb2 100644
> > > --- a/include/net/busy_poll.h
> > > +++ b/include/net/busy_poll.h
> > > @@ -109,7 +109,8 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)
> > >  		cpu_relax();
> > >  
> > >  	} while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) &&
> > > -		 !need_resched() && !busy_loop_timeout(end_time));
> > > +		 !need_resched() && !busy_loop_timeout(end_time) &&
> > > +		 nr_running_this_cpu() < 2);
> 
> So it's generally a bad idea to couple to the scheduler through 
> such a low level, implementation dependent value like 
> 'nr_running', causing various problems:
> 
>  - It misses important work that might be pending on this CPU,
>    like RCU callbacks.
> 
>  - It will also over-credit task contexts that might be
>    runnable, but which are less important than the currently
>    running one: such as a SCHED_IDLE task
> 
>  - It will also over-credit even regular SCHED_NORMAL tasks, if
>    this current task is more important than them: say
>    SCHED_FIFO. A SCHED_FIFO workload should run just as fast 
>    with SCHED_NORMAL tasks around, as a SCHED_NORMAL workload 
>    on an otherwise idle system.
> 
> So what you want is a more sophisticated query to the 
> scheduler, a sched_expected_runtime() method that returns the 
> number of nsecs this task is expected to run in the future, 
> which returns 0 if you will be scheduled away on the next 
> schedule(), and returns infinity for a high prio SCHED_FIFO 
> task, or if this SCHED_NORMAL task is on an otherwise idle CPU.
> 
> It will return a regular time slice value in other cases, when 
> there's some load on the CPU.
> 
> The polling logic can then do its decision based on that time 
> value.
> 
> All this can be done reasonably fast and lockless in most 
> cases, so that it can be called from busy-polling code.
>
> An added advantage would be that this approach consolidates the 
> somewhat random need_resched() checks into this method as well.
> 
> In any case I don't agree with the nr_running_this_cpu() 
> method.
> 
> (Please Cc: me and lkml to future iterations of this patchset.)
> 
> Thanks,
> 
> 	Ingo

I tried to look into this: it might be even nicer to add
sched_expected_to_run(time) which tells us whether we expect the current
task to keep running for the next XX nsecs.

For the fair scheduler, it seems that it could be as simple as

+static bool expected_to_run_fair(struct cfs_rq *cfs_rq, s64 t)
+{
+       struct sched_entity *left;
+       struct sched_entity *curr = cfs_rq->curr;
+
+       if (!curr || !curr->on_rq)
+               return false;
+
+       left = __pick_first_entity(cfs_rq);
+       if (!left)
+               return true;
+
+       return (s64)(curr->vruntime + calc_delta_fair(t, curr) -
+                    left->vruntime) < 0;
+}

The reason it seems easier is because that way we can reuse
calc_delta_fair and don't have to do the reverse translation
from vruntime to nsec.

And I guess if we do this with interrupts disabled, and only poke
at the current CPU's rq, we know first entity
won't go away so we don't need locks? 

Is this close to what you had in mind?

Thanks,

-- 
MST

^ permalink raw reply

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
From: Eric Dumazet @ 2016-04-11 16:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Mel Gorman, James Bottomley, netdev@vger.kernel.org,
	Brenden Blanco, lsf, linux-mm, Mel Gorman, Tom Herbert, lsf-pc,
	Alexei Starovoitov
In-Reply-To: <20160411181907.15fdb8b9@redhat.com>

On Mon, 2016-04-11 at 18:19 +0200, Jesper Dangaard Brouer wrote:

> Drivers also do tricks where they fallback to smaller order pages. E.g.
> lookup function mlx4_alloc_pages().  I've tried to simulate that
> function here:
> https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69

We use order-0 pages on mlx4 at Google, as order-3 pages are very
dangerous for some kind of attacks...

An out of order TCP packet can hold an order-3 pages, while claiming to
use 1.5 KBvia skb->truesize.

order-0 only pages allow the page recycle trick used by Intel driver,
and we hardly see any page allocations in typical workloads.

While order-3 pages are 'nice' for friendly datacenter kind of traffic,
they also are a higher risk on hosts connected to the wild Internet.

Maybe I should upstream this patch ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] sh_eth: re-enable-E-MAC interrupts in sh_eth_set_ringparam()
From: Sergei Shtylyov @ 2016-04-11 17:10 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-renesas-soc
In-Reply-To: <20160409.212725.649496142802966013.davem@davemloft.net>

Hello.

On 04/10/2016 04:27 AM, David Miller wrote:

>> The E-MAC interrupts are left disabled when the ring parameters are changed
>> via 'ethtool'. In order to fix this, it's enough to call sh_eth_dev_init()
>> with 'true' instead of 'false' for the second argument (which conveniently
>> allows us to remove the following code re-enabling E-DMAC interrupts and
>> reception).
>>
>> Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
>
> Applied, thanks.

    Thanks!
    Unfortunately, this patch isn't fit for the stable kernels because there's 
a prerequisite cleanup (most recent patch merged to this driver). I clearly 
hadn't thought this out well -- should have enabled the E-MAC interrupts 
outside sh_eth_dev_init() and then removed the duplicate code like I did here.

MBR, Sergei

^ permalink raw reply

* Re: [PATCH RFT 2/2] macb: kill PHY reset code
From: Sergei Shtylyov @ 2016-04-11 17:41 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: nicolas.ferre, netdev, linux-kernel
In-Reply-To: <20160411022802.GB4307@lunn.ch>

Hello.

On 04/11/2016 05:28 AM, Andrew Lunn wrote:

>> With  the 'phylib' now  being aware of  the "reset-gpios" PHY node property,
>> there should be no need to frob the PHY reset in this  driver anymore...
>>
>> Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
>>
>> ---
>>   drivers/net/ethernet/cadence/macb.c |   17 -----------------
>>   drivers/net/ethernet/cadence/macb.h |    1 -
>>   2 files changed, 18 deletions(-)
>>
>> Index: net-next/drivers/net/ethernet/cadence/macb.c
>> ===================================================================
>> --- net-next.orig/drivers/net/ethernet/cadence/macb.c
>> +++ net-next/drivers/net/ethernet/cadence/macb.c
[...]
>> @@ -2977,18 +2976,6 @@ static int macb_probe(struct platform_de
>>   	else
>>   		macb_get_hwaddr(bp);
>>
>> -	/* Power up the PHY if there is a GPIO reset */
>> -	phy_node =  of_get_next_available_child(np, NULL);
>> -	if (phy_node) {
>> -		int gpio = of_get_named_gpio(phy_node, "reset-gpios", 0);
>> -
>> -		if (gpio_is_valid(gpio)) {
>> -			bp->reset_gpio = gpio_to_desc(gpio);
>> -			gpiod_direction_output(bp->reset_gpio, 1);
>
> Hi Sergei
>
> The code you are deleting would of ignored the flags in the gpio
> property, i.e. active low.

    Hm, you're right -- I forgot about that... :-/

> The new code in the previous patch does
> however take the flags into account. Did you check if there are any
> device trees which have flags, which were never used, but are now
> going to be used and thus break...

    Checked this now and found out arch/arm/boot/dts/ar91-vinco.dts. Looks 
like it needs to be fixed indeed...

>        Andrew

MBR, Sergei

^ permalink raw reply

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
From: Thadeu Lima de Souza Cascardo @ 2016-04-11 17:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mel Gorman, Jesper Dangaard Brouer, James Bottomley,
	netdev@vger.kernel.org, Brenden Blanco, lsf, linux-mm, Mel Gorman,
	Tom Herbert, lsf-pc, Alexei Starovoitov
In-Reply-To: <20160411162047.GJ2781@linux.intel.com>

On Mon, Apr 11, 2016 at 12:20:47PM -0400, Matthew Wilcox wrote:
> On Mon, Apr 11, 2016 at 02:08:27PM +0100, Mel Gorman wrote:
> > On Mon, Apr 11, 2016 at 02:26:39PM +0200, Jesper Dangaard Brouer wrote:
> > > On arch's like PowerPC, the DMA API is the bottleneck.  To workaround
> > > the cost of DMA calls, NIC driver alloc large order (compound) pages.
> > > (dma_map compound page, handout page-fragments for RX ring, and later
> > > dma_unmap when last RX page-fragments is seen).
> > 
> > So, IMO only holding onto the DMA pages is all that is justified but not a
> > recycle of order-0 pages built on top of the core allocator. For DMA pages,
> > it would take a bit of legwork but the per-cpu allocator could be split
> > and converted to hold arbitrary sized pages with a constructer/destructor
> > to do the DMA coherency step when pages are taken from or handed back to
> > the core allocator. I'm not volunteering to do that unfortunately but I
> > estimate it'd be a few days work unless it needs to be per-CPU and NUMA
> > aware in which case the memory footprint will be high.
> 
> Have "we" tried to accelerate the DMA calls in PowerPC?  For example, it
> could hold onto a cache of recently used mappings and recycle them if that
> still works.  It trades off a bit of security (a device can continue to DMA
> after the memory should no longer be accessible to it) for speed, but then
> so does the per-driver hack of keeping pages around still mapped.
> 

There are two problems on the DMA calls on Power servers. One is scalability. A
new allocation method for the address space would be necessary to fix it.

The other one is the latency or the cost of updating the TCE tables. The only
number I have is that I could push around 1M updates per second. So, we could
guess 1us per operation, which is pretty much a no-no for Jesper use case.

Your solution could address both. But I am concerned about the security problem.
Here is why I think this problem should be ignored if we go this way. IOMMU can
be used for three problems: virtualization, paranoia security and debuggability.

For virtualization, there is a solution already, and it's in place for Power and
x86. Power servers have the ability to enlarge the DMA window, allowing the
entire VM memory to be mapped during PCI driver probe time. After that, dma_map
is a simple sum and dma_unmap is a nop. x86 KVM maps the entire VM memory even
before booting the guest. Unless we want to fix this for old Power servers, I
see no point in fixing it.

Now, if you are using IOMMU on the host with no passthrough or linear system
memory mapping, you are paranoid. It's not just a matter of security, in fact.
It's also a matter of stability. Hardware, firmware and drivers can be buggy,
and they are. When I worked with drivers on Power servers, I found and fixed a
lot of driver bugs that caused the device to write to memory it was not supposed
to. Good thing is that IOMMU prevented that memory write to happen and the
driver would be reset by EEH. If we can make this scenario faster, and if we
want it to be the default we need to, then your solution might not be desired.
Otherwise, just turn your IOMMU off or put it into passthrough.

Now, the driver keeps pages mapped, but those pages belong to the driver. They
are not pages we decide to give to a userspace process because it's no longer in
use by the driver. So, I don't quite agree this would be a good tradeoff.
Certainly not if we can do it in a way that does not require this.

So, Jesper, please take into consideration that this pool design would rather be
per device. Otherwise, we allow some device to write into another's
device/driver memory.

Cascardo.

^ permalink raw reply

* Re: [PATCH net v2] net: sched: do not requeue a NULL skb
From: Cong Wang @ 2016-04-11 17:53 UTC (permalink / raw)
  To: Lars Persson
  Cc: Eric Dumazet, Lars Persson, Linux Kernel Network Developers,
	Jamal Hadi Salim, LKML
In-Reply-To: <570BC024.1070504@axis.com>

On Mon, Apr 11, 2016 at 8:17 AM, Lars Persson <lars.persson@axis.com> wrote:
>
>
> On 04/11/2016 04:22 PM, Eric Dumazet wrote:
>>
>> On Mon, 2016-04-11 at 15:38 +0200, Lars Persson wrote:
>>
>>> I though it would be prudent because the queue can be non-empty even for
>>> the case of skb=NULL. So should it be there in this patch, another patch
>>> or not at all ?
>>
>>
>> Then maybe change return code ?
>>
>> It seems strange that a validate_xmit_skb_list() failure stops the
>> __qdisc_run() loop but schedules another round.
>>
>>
>
> It was suggested by Cong Wang to return 0 in order to stop the loop. Do you
> guys agree that the loop should be stopped for such failures ? Then I will
> put the schedule call inside the if as you proposed earlier.

I don't see any reason why we should continue on failure. And, I didn't
suggest you to return reschedule it either. I was suggesting to just return
0 for skb == NULL case.

^ permalink raw reply

* Re: [RFC PATCH net 3/4] ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update
From: Martin KaFai Lau @ 2016-04-11 17:56 UTC (permalink / raw)
  To: David Miller, xiyou.wangcong; +Cc: netdev, edumazet, weiwan, kernel-team
In-Reply-To: <20160405.195654.681016570979164308.davem@davemloft.net>

On Tue, Apr 05, 2016 at 07:56:54PM -0400, David Miller wrote:
> From: Cong Wang <xiyou.wangcong@gmail.com>
> Date: Mon, 4 Apr 2016 13:45:02 -0700
>
> > On Sat, Apr 2, 2016 at 7:33 PM, Martin KaFai Lau <kafai@fb.com> wrote:
> >> One thing to note is that this patch uses the addresses from the sk
> >> instead of iph when updating sk->sk_dst_cache.  It is basically the
> >> same logic that the __ip6_datagram_connect() is doing, so some
> >> refactoring works in the first two patches.
> >>
> >> AFAIK, a UDP socket can become connected after sending out some
> >> datagrams in un-connected state.  or It can be connected
> >> multiple times to different destinations.  I did some quick
> >> tests but I could be wrong.
> >>
> >> I am thinking if there could be a chance that the skb->data, which
> >> has the original outgoing iph, is not related to the current
> >> connected address.  If it is possible, we have to specifically
> >> use the addresses in the sk instead of skb->data (i.e. iph) when
> >> updating the sk->sk_dst_cache.
> >>
> >> If we need to use the sk addresses (and other info) to find out a
> >> new dst for a connected udp socket, it is better not doing it while
> >> the userland is connecting to somewhere else.
> >>
> >> If the above case is impossible, we can keep using the info from iph to
> >> do the dst update for a connected-udp sk without taking the lock.
> >
> > I see your point, but calling __ip6_datagram_connect() seems overkill
> > here, we don't need to update so many things in the pmtu update context,
> > at least IPv4 doesn't do that either. I don't think you have to do that.
> >
> > So why just updating the dst cache (also some addr cache) here is not
> > enough?
>
> I think we are steadily getting closer to a version of this fix that
> we have some agreement on, right?
>
> Martin can you address Cong's feedback and spin another version of this
> series?
Dave, I will spin v2 shortly with some minor changes.

Cong, I believe the loose ends have been tied up after the last email thread
since last Thursday?

^ permalink raw reply

* Re: [PATCH net v2] net: sched: do not requeue a NULL skb
From: Cong Wang @ 2016-04-11 18:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Lars Persson, Lars Persson, Linux Kernel Network Developers,
	Jamal Hadi Salim, LKML
In-Reply-To: <1460389951.6473.555.camel@edumazet-glaptop3.roam.corp.google.com>

On Mon, Apr 11, 2016 at 8:52 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2016-04-11 at 17:17 +0200, Lars Persson wrote:
>>
>> On 04/11/2016 04:22 PM, Eric Dumazet wrote:
>> > On Mon, 2016-04-11 at 15:38 +0200, Lars Persson wrote:
>> >
>> >> I though it would be prudent because the queue can be non-empty even for
>> >> the case of skb=NULL. So should it be there in this patch, another patch
>> >> or not at all ?
>> >
>> > Then maybe change return code ?
>> >
>> > It seems strange that a validate_xmit_skb_list() failure stops the
>> > __qdisc_run() loop but schedules another round.
>> >
>> >
>>
>> It was suggested by Cong Wang to return 0 in order to stop the loop. Do
>> you guys agree that the loop should be stopped for such failures ? Then
>> I will put the schedule call inside the if as you proposed earlier.
>
> What are the causes of validate_xmit_skb_list() failures ?
>
> If gso segmentations fail because of memory pressure, better free more
> skbs right now.
>
> In any case, having a single test " if (skb)  " sounds better to me,
> to have a fast path.
>
> So your first patch was probably a better idea.
>
> v2 has two tests instead of one.

I am fine with either way as long as the loop stops on failure.
Folding the test "if (skb)" into one also requires to retake the spinlock.

^ permalink raw reply

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
From: Mel Gorman @ 2016-04-11 18:07 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Mel Gorman, James Bottomley, netdev@vger.kernel.org,
	Brenden Blanco, lsf, linux-mm, Tom Herbert, lsf-pc,
	Alexei Starovoitov
In-Reply-To: <20160411181907.15fdb8b9@redhat.com>

On Mon, Apr 11, 2016 at 06:19:07PM +0200, Jesper Dangaard Brouer wrote:
> > http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git/log/?h=mm-vmscan-node-lru-v4r5
> > 
> 
> The cost decreased to: 228 cycles(tsc), but there are some variations,
> sometimes it increase to 238 cycles(tsc).
> 

In the free path, a bulk pcp free adds to the cycles. In the alloc path,
a refill of the pcp lists costs quite a bit. Either option introduces
variances. The bulk free path can be optimised a little so I chucked
some additional patches at it that are not released yet but I suspect the
benefit will be marginal. The real heavy costs there are splitting/merging
buddies. Fixing that is much more fundamental but even fronting the allocator
with a new recycle allocator would not offset that as the refill of the
page-recycling thing would incur high costs.

> Nice, but there is still a looong way to my performance target, where I
> can spend 201 cycles for the entire forwarding path....
> 

While I accept the cost is still too high, I think the effort should still
be spent on improving the allocator in general than trying to bypass it.

> 
> > This is an unreleased series that contains both the page allocator
> > optimisations and the one-LRU-per-node series which in combination remove a
> > lot of code from the page allocator fast paths. I have no data on how the
> > combined series behaves but each series individually is known to improve
> > page allocator performance.
> >
> > Once you have that, do a hackjob to remove the debugging checks from both the
> > alloc and free path and see what that leaves. They could be bypassed properly
> > with a __GFP_NOACCT flag used only by drivers that absolutely require pages
> > as quickly as possible and willing to be less safe to get that performance.
> 
> I would be interested in testing/benchmarking a patch where you remove
> the debugging checks...
> 

Right now, I'm not proposing to remove the debugging checks despite their
cost. They catch really difficult problems in the field unfortunately
including corruption from buggy hardware. A GFP flag that disables them
for a very specific case would be ok but I expect it to be resisted by
others if it's done for the general case. Even a static branch for runtime
debugging checks may be resisted.

Even if GFP flags are tight, I have a patch that deletes __GFP_COLD on
the grounds it is of questionable value. Applying that would free a flag
for __GFP_NOACCT that bypasses debugging checks and statistic updates.
That would work for the allocation side at least but doing the same for
the free side would be hard (potentially impossible) to do transparently
for drivers.

> You are also welcome to try out my benchmarking modules yourself:
>  https://github.com/netoptimizer/prototype-kernel/blob/master/getting_started.rst
> 

I took a quick look and functionally it's similar to the systemtap-based
microbenchmark I'm using in mmtests so I don't think we have a problem
with reproduction at the moment.

> > Be aware that compound order allocs like this are a double edged sword as
> > it'll be fast sometimes and other times require reclaim/compaction which
> > can stall for prolonged periods of time.
> 
> Yes, I've notice that there can be a fairly high variation, when doing
> compound order allocs, which is not so nice!  I really don't like these
> variations....
> 

They can cripple you which is why I'm very wary of performance patches that
require compound pages. It tends to look great only on benchmarks and then
the corner cases hit in the real world and the bug reports are unpleasant.

> Drivers also do tricks where they fallback to smaller order pages. E.g.
> lookup function mlx4_alloc_pages().  I've tried to simulate that
> function here:
> https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69
> 
> It does not seem very optimal. I tried to mem pressure the system a bit
> to cause the alloc_pages() to fail, and then the result were very bad,
> something like 2500 cycles, and it usually got the next order pages.

The options for fallback tend to have one hazard after the next. It's
partially why the last series focused on order-0 pages only.

> > > I've measured order 3 (32KB) alloc_pages(order=3) + __free_pages() to
> > > cost approx 500 cycles(tsc).  That was more expensive, BUT an order=3
> > > page 32Kb correspond to 8 pages (32768/4096), thus 500/8 = 62.5
> > > cycles.  Usually a network RX-frame only need to be 2048 bytes, thus
> > > the "bulk" effect speed up is x16 (32768/2048), thus 31.25 cycles.
> 
> The order=3 cost were reduced to: 417 cycles(tsc), nice!  But I've also
> seen it jump to 611 cycles.
> 

The corner cases can be minimised to some extent -- lazy buddy merging for
example but it unfortunately has other consequences for users that require
high-order pages for functional reasons. I tried something like that once
(http://thread.gmane.org/gmane.linux.kernel/807683) but didn't pursue it
to the end as it was a small part of the problem I was dealing with at the
time. It shouldn't be ruled out but it should be considered a last resort.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH RFT 2/2] macb: kill PHY reset code
From: Andrew Lunn @ 2016-04-11 18:19 UTC (permalink / raw)
  To: Sergei Shtylyov; +Cc: nicolas.ferre, netdev, linux-kernel
In-Reply-To: <570BE1C5.70502@cogentembedded.com>

> >The code you are deleting would of ignored the flags in the gpio
> >property, i.e. active low.
> 
>    Hm, you're right -- I forgot about that... :-/
> 
> >The new code in the previous patch does
> >however take the flags into account. Did you check if there are any
> >device trees which have flags, which were never used, but are now
> >going to be used and thus break...
> 
>    Checked this now and found out arch/arm/boot/dts/ar91-vinco.dts.
> Looks like it needs to be fixed indeed...

And this is where it gets tricky. You are breaking backwards
compatibility by now respecting the flag. An old DT blob is not going
to work.

You potentially need to add a new property and deprecate the old one.

    Andrew

^ permalink raw reply

* Re: Unhandled fault during system suspend in sky2_shutdown
From: Stephen Hemminger @ 2016-04-11 18:24 UTC (permalink / raw)
  To: Sudeep Holla; +Cc: linux-kernel, netdev, Mirko Lindner
In-Reply-To: <570BCFC5.4070208@arm.com>

On Mon, 11 Apr 2016 17:24:37 +0100
Sudeep Holla <sudeep.holla@arm.com> wrote:

> Hi,
> 
> I am seeing unhandled fault during system suspend in sky2_shutdown.
> I am not sure if it's something missing in the firmware, but just wanted
> to check. I see that networkmanager is invoking calling to 
> netlink_sendmsg which calls sky2_get_stats after the device is shutdown.
> 
> Unhandled fault: synchronous external abort (0x96000210) at 
> 0xffff0000091c2918
> Internal error: : 96000210 [#1] PREEMPT SMP
> Modules linked in:
> CPU: 3 PID: 2029 Comm: NetworkManager Not tainted 4.6.0-rc3 #126
> Hardware name: ARM Juno development board (r2) (DT)
> task: ffff80007a673000 ti: ffff800940b5c000 task.ti: ffff800940b5c000
> PC is at sky2_get_stats+0x44/0x3b8
> LR is at dev_get_stats+0x58/0xc8
>   sky2_get_stats+0x44/0x3b8
>   rtnl_fill_stats+0x20/0x138
>   rtnl_fill_ifinfo+0x440/0xb38
>   rtnl_getlink+0xe8/0x198
>   rtnetlink_rcv_msg+0xe4/0x220
>   netlink_rcv_skb+0xc4/0xf8
>   rtnetlink_rcv+0x2c/0x40
>   netlink_unicast+0x160/0x238
>   netlink_sendmsg+0x2f0/0x358
>   sock_sendmsg+0x18/0x30
>   ___sys_sendmsg+0x204/0x218
>   __sys_sendmsg+0x44/0x88
>   SyS_sendmsg+0xc/0x18
>   el0_svc_naked+0x24/0x28
> 
> The below patch is the hack I came up to check if the netdev is detached 
> and unregistered, I no longer see the issue.
> 
> Regards,
> Sudeep
> 
> -->8  
> 
> diff --git i/drivers/net/ethernet/marvell/sky2.c 
> w/drivers/net/ethernet/marvell/sky2.c
> index ec0a22119e09..0ff0434e32fc 100644
> --- i/drivers/net/ethernet/marvell/sky2.c
> +++ w/drivers/net/ethernet/marvell/sky2.c
> @@ -5220,6 +5220,13 @@ static SIMPLE_DEV_PM_OPS(sky2_pm_ops, 
> sky2_suspend, sky2_resume);
> 
>   static void sky2_shutdown(struct pci_dev *pdev)
>   {
> +       struct sky2_hw *hw = pci_get_drvdata(pdev);
> +       int i;
> +
> +       for (i = hw->ports - 1; i >= 0; --i) {
> +               sky2_detach(hw->dev[i]);
> +               unregister_netdev(hw->dev[i]);
> +       }
>          sky2_suspend(&pdev->dev);
>          pci_wake_from_d3(pdev, device_may_wakeup(&pdev->dev));
>          pci_set_power_state(pdev, PCI_D3hot);

This is not the correct fix, the device is supposed to stay registered.
The correct way to fix this would be to make get_stats ignore requests for device
when suspended.

^ permalink raw reply

* Re: [PATCH net v2] net: sched: do not requeue a NULL skb
From: Eric Dumazet @ 2016-04-11 18:26 UTC (permalink / raw)
  To: Cong Wang
  Cc: Lars Persson, Lars Persson, Linux Kernel Network Developers,
	Jamal Hadi Salim, LKML
In-Reply-To: <CAM_iQpWM0cce9PtvkenywGpvh0MEMe_0qvkKC00kYFAdFG=BxQ@mail.gmail.com>

On Mon, 2016-04-11 at 11:02 -0700, Cong Wang wrote:

> I am fine with either way as long as the loop stops on failure.
> Folding the test "if (skb)" into one also requires to retake the spinlock.

Adding the likely() in this path would probably help as well.

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index f18c35024207..07202d9ac4f6 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -159,12 +159,14 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
        if (validate)
                skb = validate_xmit_skb_list(skb, dev);
 
-       if (skb) {
+       if (likely(skb)) {
                HARD_TX_LOCK(dev, txq, smp_processor_id());
                if (!netif_xmit_frozen_or_stopped(txq))
                        skb = dev_hard_start_xmit(skb, dev, txq, &ret);
 
                HARD_TX_UNLOCK(dev, txq);
+       } else {
+               ... does all and return...
        }
        spin_lock(root_lock);

^ permalink raw reply related

* Re: [PATCH net v2] net: sched: do not requeue a NULL skb
From: Eric Dumazet @ 2016-04-11 18:30 UTC (permalink / raw)
  To: Cong Wang
  Cc: Lars Persson, Lars Persson, Linux Kernel Network Developers,
	Jamal Hadi Salim, LKML
In-Reply-To: <1460399171.6473.562.camel@edumazet-glaptop3.roam.corp.google.com>

On Mon, 2016-04-11 at 11:26 -0700, Eric Dumazet wrote:
> On Mon, 2016-04-11 at 11:02 -0700, Cong Wang wrote:
> 
> > I am fine with either way as long as the loop stops on failure.

Note that skb that could not be validated is already freed.

So I do not see any value from stopping the loop, since
we need to schedule the queue to avoid tx hang.

Just process following skb if there is one, fact that skb is sent or
dropped does not matter.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox