Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
From: David Miller @ 2012-05-17 19:53 UTC (permalink / raw)
  To: eric.dumazet; +Cc: w, netdev
In-Reply-To: <1337276056.3403.37.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 17 May 2012 19:34:16 +0200

> [PATCH net-next] net: netdev_alloc_skb() use build_skb()
> 
> netdev_alloc_skb() is used by networks driver in their RX path to
> allocate an skb to receive an incoming frame.
> 
> With recent skb->head_frag infrastructure, it makes sense to change
> netdev_alloc_skb() to use build_skb() and a frag allocator.
> 
> This permits a zero copy splice(socket->pipe), and better GRO or TCP
> coalescing.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, we can sort out any fallout very easily before 3.5 is released.

Awesome work Eric.

^ permalink raw reply

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
From: David Miller @ 2012-05-17 19:55 UTC (permalink / raw)
  To: w; +Cc: eric.dumazet, netdev
In-Reply-To: <20120517150157.GA19274@1wt.eu>

From: Willy Tarreau <w@1wt.eu>
Date: Thu, 17 May 2012 17:01:57 +0200

>>From 6da6a21798d0156e647a993c31782eec739fa5df Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 17 May 2012 16:48:56 +0200
> Subject: [PATCH] tcp: force push data out when buffers are missing
> 
> Commit 2f533844242 (tcp: allow splice() to build full TSO packets)
> significantly improved splice() performance for some workloads but
> caused stalls when pipe buffers were larger than socket buffers.
> 
> The issue seems to happen when no data can be copied at all due to
> lack of buffers, which results in pending data never being pushed.
> 
> This change checks if all pending data has been pushed or not and
> pushes them when waiting for send buffers.

Eric, please indicate whether we need Willy's patch here.

I want to propagate this fix as fast as possible if so.

^ permalink raw reply

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
From: Willy Tarreau @ 2012-05-17 20:04 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev
In-Reply-To: <20120517.155503.2294382162578627387.davem@davemloft.net>

Hi David,

On Thu, May 17, 2012 at 03:55:03PM -0400, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 17 May 2012 17:01:57 +0200
> 
> >>From 6da6a21798d0156e647a993c31782eec739fa5df Mon Sep 17 00:00:00 2001
> > From: Willy Tarreau <w@1wt.eu>
> > Date: Thu, 17 May 2012 16:48:56 +0200
> > Subject: [PATCH] tcp: force push data out when buffers are missing
> > 
> > Commit 2f533844242 (tcp: allow splice() to build full TSO packets)
> > significantly improved splice() performance for some workloads but
> > caused stalls when pipe buffers were larger than socket buffers.
> > 
> > The issue seems to happen when no data can be copied at all due to
> > lack of buffers, which results in pending data never being pushed.
> > 
> > This change checks if all pending data has been pushed or not and
> > pushes them when waiting for send buffers.
> 
> Eric, please indicate whether we need Willy's patch here.
> 
> I want to propagate this fix as fast as possible if so.

I think you should hold off for now, because it's possible that my patch
hides another issue instead of fixing it.

I'm having the same stall issue again since I applied Eric's build_skb
patch, but not for all data sizes. So if the same issue is still there,
it's possible that we're playing hide-and-seek with it. That's rather
strange.

Thanks,
Willy

^ permalink raw reply

* [PATCH v3] drop_monitor: convert to modular building
From: Neil Horman @ 2012-05-17 20:04 UTC (permalink / raw)
  To: netdev; +Cc: Neil Horman, David S. Miller, Eric Dumazet, Ben Hutchings
In-Reply-To: <1337178426-2470-1-git-send-email-nhorman@tuxdriver.com>

When I first wrote drop monitor I wrote it to just build monolithically.  There
is no reason it can't be built modularly as well, so lets give it that
flexibiity.

I've tested this by building it as both a module and monolithically, and it
seems to work quite well

Change notes:

v2)
* fixed for_each_present_cpu loops to be more correct as per Eric D.
* Converted exit path failures to BUG_ON as per Ben H.

v3)
* Converted del_timer to del_timer_sync to close race noted by Ben H.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
---
 net/Kconfig             |    2 +-
 net/core/drop_monitor.c |   46 ++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/net/Kconfig b/net/Kconfig
index e07272d..76ad6fa 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -295,7 +295,7 @@ config NET_TCPPROBE
 	module will be called tcp_probe.
 
 config NET_DROP_MONITOR
-	boolean "Network packet drop alerting service"
+	tristate "Network packet drop alerting service"
 	depends on INET && EXPERIMENTAL && TRACEPOINTS
 	---help---
 	This feature provides an alerting service to userspace in the
diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index cfeeef8..f93f985 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -24,6 +24,7 @@
 #include <linux/timer.h>
 #include <linux/bitops.h>
 #include <linux/slab.h>
+#include <linux/module.h>
 #include <net/genetlink.h>
 #include <net/netevent.h>
 
@@ -225,9 +226,15 @@ static int set_all_monitor_traces(int state)
 
 	switch (state) {
 	case TRACE_ON:
+		if (!try_module_get(THIS_MODULE)) {
+			rc = -ENODEV;
+			break;
+		}
+
 		rc |= register_trace_kfree_skb(trace_kfree_skb_hit, NULL);
 		rc |= register_trace_napi_poll(trace_napi_poll_hit, NULL);
 		break;
+
 	case TRACE_OFF:
 		rc |= unregister_trace_kfree_skb(trace_kfree_skb_hit, NULL);
 		rc |= unregister_trace_napi_poll(trace_napi_poll_hit, NULL);
@@ -243,6 +250,9 @@ static int set_all_monitor_traces(int state)
 				kfree_rcu(new_stat, rcu);
 			}
 		}
+
+		module_put(THIS_MODULE);
+
 		break;
 	default:
 		rc = 1;
@@ -368,7 +378,7 @@ static int __init init_net_drop_monitor(void)
 
 	rc = 0;
 
-	for_each_present_cpu(cpu) {
+	for_each_possible_cpu(cpu) {
 		data = &per_cpu(dm_cpu_data, cpu);
 		reset_per_cpu_data(data);
 		INIT_WORK(&data->dm_alert_work, send_dm_alert);
@@ -385,4 +395,36 @@ out:
 	return rc;
 }
 
-late_initcall(init_net_drop_monitor);
+static void exit_net_drop_monitor(void)
+{
+	struct per_cpu_dm_data *data;
+	int cpu;
+
+	BUG_ON(unregister_netdevice_notifier(&dropmon_net_notifier));
+
+	/*
+	 * Because of the module_get/put we do in the trace state change path
+	 * we are guarnateed not to have any current users when we get here
+	 * all we need to do is make sure that we don't have any running timers
+	 * or pending schedule calls
+	 */
+
+	for_each_possible_cpu(cpu) {
+		data = &per_cpu(dm_cpu_data, cpu);
+		del_timer_sync(&data->send_timer);
+		cancel_work_sync(&data->dm_alert_work);
+		/*
+		 * At this point, we should have exclusive access
+		 * to this struct and can free the skb inside it
+		 */
+		kfree_skb(data->skb);
+	}
+
+	BUG_ON(genl_unregister_family(&net_drop_monitor_family));
+}
+
+module_init(init_net_drop_monitor);
+module_exit(exit_net_drop_monitor);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Neil Horman <nhorman@tuxdriver.com>");
-- 
1.7.7.6

^ permalink raw reply related

* [PATCH] netfilter: xt_recent: Add optional mask option for xt_recent
From: Denys Fedoryshchenko @ 2012-05-17 20:07 UTC (permalink / raw)
  To: Linux netdev; +Cc: Pablo Neira Ayuso, Denys Fedoryshchenko

Use case for this feature:
1)In some occasions if you need to allow,block,match specific subnet.
2)I can use recent as a trigger when netfilter rule matches, with mask 0.0.0.0

Tested for backward compatibility:
)old (userspace) iptables, new kernel
)old kernel, new iptables
)new kernel, new iptables

For v2:
 As Pablo Neira Ayuso suggested, moved nf_inet_addr_mask to xt_recent.h
 and made info_v1 as a stack variable.

Signed-off-by: Denys Fedoryshchenko <denys@visp.net.lb>
CC: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/linux/netfilter/xt_recent.h |   20 +++++++++++
 net/netfilter/xt_recent.c           |   62 ++++++++++++++++++++++++++++++----
 2 files changed, 74 insertions(+), 8 deletions(-)

diff --git a/include/linux/netfilter/xt_recent.h b/include/linux/netfilter/xt_recent.h
index 83318e0..5f69ebc 100644
--- a/include/linux/netfilter/xt_recent.h
+++ b/include/linux/netfilter/xt_recent.h
@@ -32,4 +32,24 @@ struct xt_recent_mtinfo {
 	__u8 side;
 };
 
+struct xt_recent_mtinfo_v1 {
+	__u32 seconds;
+	__u32 hit_count;
+	__u8 check_set;
+	__u8 invert;
+	char name[XT_RECENT_NAME_LEN];
+	__u8 side;
+	union nf_inet_addr mask;
+};
+
+static inline void nf_inet_addr_mask(const union nf_inet_addr *a1,
+				    union nf_inet_addr *result,
+				    const union nf_inet_addr *mask)
+{
+	result->all[0] = a1->all[0] & mask->all[0];
+	result->all[1] = a1->all[1] & mask->all[1];
+	result->all[2] = a1->all[2] & mask->all[2];
+	result->all[3] = a1->all[3] & mask->all[3];
+}
+
 #endif /* _LINUX_NETFILTER_XT_RECENT_H */
diff --git a/net/netfilter/xt_recent.c b/net/netfilter/xt_recent.c
index fc0d6db..ca4375c 100644
--- a/net/netfilter/xt_recent.c
+++ b/net/netfilter/xt_recent.c
@@ -75,6 +75,7 @@ struct recent_entry {
 struct recent_table {
 	struct list_head	list;
 	char			name[XT_RECENT_NAME_LEN];
+	union nf_inet_addr	mask;
 	unsigned int		refcnt;
 	unsigned int		entries;
 	struct list_head	lru_list;
@@ -228,10 +229,11 @@ recent_mt(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	struct net *net = dev_net(par->in ? par->in : par->out);
 	struct recent_net *recent_net = recent_pernet(net);
-	const struct xt_recent_mtinfo *info = par->matchinfo;
+	const struct xt_recent_mtinfo_v1 *info = par->matchinfo;
 	struct recent_table *t;
 	struct recent_entry *e;
 	union nf_inet_addr addr = {};
+	union nf_inet_addr addr_masked;
 	u_int8_t ttl;
 	bool ret = info->invert;
 
@@ -261,12 +263,15 @@ recent_mt(const struct sk_buff *skb, struct xt_action_param *par)
 
 	spin_lock_bh(&recent_lock);
 	t = recent_table_lookup(recent_net, info->name);
-	e = recent_entry_lookup(t, &addr, par->family,
+
+	nf_inet_addr_mask(&addr, &addr_masked, &t->mask);
+
+	e = recent_entry_lookup(t, &addr_masked, par->family,
 				(info->check_set & XT_RECENT_TTL) ? ttl : 0);
 	if (e == NULL) {
 		if (!(info->check_set & XT_RECENT_SET))
 			goto out;
-		e = recent_entry_init(t, &addr, par->family, ttl);
+		e = recent_entry_init(t, &addr_masked, par->family, ttl);
 		if (e == NULL)
 			par->hotdrop = true;
 		ret = !ret;
@@ -306,10 +311,10 @@ out:
 	return ret;
 }
 
-static int recent_mt_check(const struct xt_mtchk_param *par)
+static int recent_mt_check(const struct xt_mtchk_param *par,
+	const struct xt_recent_mtinfo_v1 *info)
 {
 	struct recent_net *recent_net = recent_pernet(par->net);
-	const struct xt_recent_mtinfo *info = par->matchinfo;
 	struct recent_table *t;
 #ifdef CONFIG_PROC_FS
 	struct proc_dir_entry *pde;
@@ -361,6 +366,8 @@ static int recent_mt_check(const struct xt_mtchk_param *par)
 		goto out;
 	}
 	t->refcnt = 1;
+
+	memcpy(&t->mask, &info->mask, sizeof(t->mask));
 	strcpy(t->name, info->name);
 	INIT_LIST_HEAD(&t->lru_list);
 	for (i = 0; i < ip_list_hash_size; i++)
@@ -385,10 +392,29 @@ out:
 	return ret;
 }
 
+static int recent_mt_check_v0(const struct xt_mtchk_param *par)
+{
+	const struct xt_recent_mtinfo_v0 *info_v0 = par->matchinfo;
+	struct xt_recent_mtinfo_v1 info_v1;
+	int ret;
+
+	/* Copy old data */
+	memcpy(&info_v1, info_v0, sizeof(struct xt_recent_mtinfo));
+	/* Default mask will make same behavior as old recent */
+	memset(info_v1.mask.all, 0xFF, sizeof(info_v1.mask.all));
+	ret = recent_mt_check(par, &info_v1);
+	return ret;
+}
+
+static int recent_mt_check_v1(const struct xt_mtchk_param *par)
+{
+	return recent_mt_check(par, par->matchinfo);
+}
+
 static void recent_mt_destroy(const struct xt_mtdtor_param *par)
 {
 	struct recent_net *recent_net = recent_pernet(par->net);
-	const struct xt_recent_mtinfo *info = par->matchinfo;
+	const struct xt_recent_mtinfo_v1 *info = par->matchinfo;
 	struct recent_table *t;
 
 	mutex_lock(&recent_mutex);
@@ -625,7 +651,7 @@ static struct xt_match recent_mt_reg[] __read_mostly = {
 		.family     = NFPROTO_IPV4,
 		.match      = recent_mt,
 		.matchsize  = sizeof(struct xt_recent_mtinfo),
-		.checkentry = recent_mt_check,
+		.checkentry = recent_mt_check_v0,
 		.destroy    = recent_mt_destroy,
 		.me         = THIS_MODULE,
 	},
@@ -635,10 +661,30 @@ static struct xt_match recent_mt_reg[] __read_mostly = {
 		.family     = NFPROTO_IPV6,
 		.match      = recent_mt,
 		.matchsize  = sizeof(struct xt_recent_mtinfo),
-		.checkentry = recent_mt_check,
+		.checkentry = recent_mt_check_v0,
+		.destroy    = recent_mt_destroy,
+		.me         = THIS_MODULE,
+	},
+	{
+		.name       = "recent",
+		.revision   = 1,
+		.family     = NFPROTO_IPV4,
+		.match      = recent_mt,
+		.matchsize  = sizeof(struct xt_recent_mtinfo_v1),
+		.checkentry = recent_mt_check_v1,
 		.destroy    = recent_mt_destroy,
 		.me         = THIS_MODULE,
 	},
+	{
+		.name       = "recent",
+		.revision   = 1,
+		.family     = NFPROTO_IPV6,
+		.match      = recent_mt,
+		.matchsize  = sizeof(struct xt_recent_mtinfo_v1),
+		.checkentry = recent_mt_check_v1,
+		.destroy    = recent_mt_destroy,
+		.me         = THIS_MODULE,
+	}
 };
 
 static int __init recent_mt_init(void)
-- 
1.7.3.4

^ permalink raw reply related

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
From: David Miller @ 2012-05-17 20:07 UTC (permalink / raw)
  To: w; +Cc: eric.dumazet, netdev
In-Reply-To: <20120517200404.GO14498@1wt.eu>

From: Willy Tarreau <w@1wt.eu>
Date: Thu, 17 May 2012 22:04:04 +0200

> I'm having the same stall issue again since I applied Eric's build_skb
> patch, but not for all data sizes. So if the same issue is still there,
> it's possible that we're playing hide-and-seek with it. That's rather
> strange.

Ok, a Heisenbug :-)  Let me know when you guys resolve this.

^ permalink raw reply

* Re: [PATCH v3] drop_monitor: convert to modular building
From: Ben Hutchings @ 2012-05-17 20:08 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, David S. Miller, Eric Dumazet
In-Reply-To: <1337285040-20848-1-git-send-email-nhorman@tuxdriver.com>

On Thu, 2012-05-17 at 16:04 -0400, Neil Horman wrote:
> When I first wrote drop monitor I wrote it to just build monolithically.  There
> is no reason it can't be built modularly as well, so lets give it that
> flexibiity.
> 
> I've tested this by building it as both a module and monolithically, and it
> seems to work quite well
> 
> Change notes:
> 
> v2)
> * fixed for_each_present_cpu loops to be more correct as per Eric D.
> * Converted exit path failures to BUG_ON as per Ben H.
> 
> v3)
> * Converted del_timer to del_timer_sync to close race noted by Ben H.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: "David S. Miller" <davem@davemloft.net>
> CC: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Ben Hutchings <bhutchings@solarflare.com>
[...]

Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>

Thanks,
Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH] iptables: xt_recent: Add optional mask option for xt_recent
From: Denys Fedoryshchenko @ 2012-05-17 20:08 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Linux netdev, Pablo Neira Ayuso, Denys Fedoryshchenko

Use case for this feature:
1)In some occasions if you need to allow,block,match specific subnet.
2)I can use recent as a trigger when netfilter rule matches, with mask 0.0.0.0

Tested for backward compatibility:
)old (userspace) iptables, new kernel
)old kernel, new iptables
)new kernel, new iptables

Signed-off-by: Denys Fedoryshchenko <denys@visp.net.lb>
---
 extensions/libxt_recent.c           |  152 ++++++++++++++++++++++++++++++----
 include/linux/netfilter/xt_recent.h |   11 +++-
 2 files changed, 144 insertions(+), 19 deletions(-)

diff --git a/extensions/libxt_recent.c b/extensions/libxt_recent.c
index c7dce4e..930da29 100644
--- a/extensions/libxt_recent.c
+++ b/extensions/libxt_recent.c
@@ -16,6 +16,7 @@ enum {
 	O_NAME,
 	O_RSOURCE,
 	O_RDEST,
+	O_MASK,
 	F_SET    = 1 << O_SET,
 	F_RCHECK = 1 << O_RCHECK,
 	F_UPDATE = 1 << O_UPDATE,
@@ -25,7 +26,7 @@ enum {
 };
 
 #define s struct xt_recent_mtinfo
-static const struct xt_option_entry recent_opts[] = {
+static const struct xt_option_entry recent_opts_v0[] = {
 	{.name = "set", .id = O_SET, .type = XTTYPE_NONE,
 	 .excl = F_ANY_OP, .flags = XTOPT_INVERT},
 	{.name = "rcheck", .id = O_RCHECK, .type = XTTYPE_NONE,
@@ -50,6 +51,33 @@ static const struct xt_option_entry recent_opts[] = {
 };
 #undef s
 
+#define s struct xt_recent_mtinfo_v1
+static const struct xt_option_entry recent_opts_v1[] = {
+	{.name = "set", .id = O_SET, .type = XTTYPE_NONE,
+	 .excl = F_ANY_OP, .flags = XTOPT_INVERT},
+	{.name = "rcheck", .id = O_RCHECK, .type = XTTYPE_NONE,
+	 .excl = F_ANY_OP, .flags = XTOPT_INVERT},
+	{.name = "update", .id = O_UPDATE, .type = XTTYPE_NONE,
+	 .excl = F_ANY_OP, .flags = XTOPT_INVERT},
+	{.name = "remove", .id = O_REMOVE, .type = XTTYPE_NONE,
+	 .excl = F_ANY_OP, .flags = XTOPT_INVERT},
+	{.name = "seconds", .id = O_SECONDS, .type = XTTYPE_UINT32,
+	 .flags = XTOPT_PUT, XTOPT_POINTER(s, seconds)},
+	{.name = "hitcount", .id = O_HITCOUNT, .type = XTTYPE_UINT32,
+	 .flags = XTOPT_PUT, XTOPT_POINTER(s, hit_count)},
+	{.name = "rttl", .id = O_RTTL, .type = XTTYPE_NONE,
+	 .excl = F_SET | F_REMOVE},
+	{.name = "name", .id = O_NAME, .type = XTTYPE_STRING,
+	 .flags = XTOPT_PUT, XTOPT_POINTER(s, name)},
+	{.name = "rsource", .id = O_RSOURCE, .type = XTTYPE_NONE},
+	{.name = "rdest", .id = O_RDEST, .type = XTTYPE_NONE},
+	{.name = "mask", .id = O_MASK, .type = XTTYPE_HOST,
+	 .flags = XTOPT_PUT, XTOPT_POINTER(s, mask)},
+	XTOPT_TABLEEND,
+};
+#undef s
+
+
 static void recent_help(void)
 {
 	printf(
@@ -74,24 +102,27 @@ static void recent_help(void)
 "    --name name                 Name of the recent list to be used.  DEFAULT used if none given.\n"
 "    --rsource                   Match/Save the source address of each packet in the recent list table (default).\n"
 "    --rdest                     Match/Save the destination address of each packet in the recent list table.\n"
+"    --mask netmask              Netmask that will be applied to this recent list.\n"
 "xt_recent by: Stephen Frost <sfrost@snowman.net>.  http://snowman.net/projects/ipt_recent/\n");
 }
 
-static void recent_init(struct xt_entry_match *match)
+static void recent_init(struct xt_entry_match *match,unsigned int family)
 {
-	struct xt_recent_mtinfo *info = (void *)(match)->data;
+	struct xt_recent_mtinfo    *info_v0 = (void *)(match)->data;
+	struct xt_recent_mtinfo_v1 *info_v1 = (void *)(match)->data;
 
-	strncpy(info->name,"DEFAULT", XT_RECENT_NAME_LEN);
+	strncpy(info_v0->name,"DEFAULT", XT_RECENT_NAME_LEN);
 	/* even though XT_RECENT_NAME_LEN is currently defined as 200,
 	 * better be safe, than sorry */
-	info->name[XT_RECENT_NAME_LEN-1] = '\0';
-	info->side = XT_RECENT_SOURCE;
+	info_v0->name[XT_RECENT_NAME_LEN-1] = '\0';
+	info_v0->side = XT_RECENT_SOURCE;
+	if (family == NFPROTO_IPV6)
+	    memset(&info_v1->mask,0xFF,sizeof(info_v1->mask));
 }
 
 static void recent_parse(struct xt_option_call *cb)
 {
 	struct xt_recent_mtinfo *info = cb->data;
-
 	xtables_option_parse(cb);
 	switch (cb->entry->id) {
 	case O_SET:
@@ -140,9 +171,9 @@ static void recent_check(struct xt_fcheck_call *cb)
 }
 
 static void recent_print(const void *ip, const struct xt_entry_match *match,
-                         int numeric)
+                         unsigned int family)
 {
-	const struct xt_recent_mtinfo *info = (const void *)match->data;
+	const struct xt_recent_mtinfo_v1 *info = (const void *)match->data;
 
 	if (info->invert)
 		printf(" !");
@@ -167,11 +198,17 @@ static void recent_print(const void *ip, const struct xt_entry_match *match,
 		printf(" side: source");
 	if (info->side == XT_RECENT_DEST)
 		printf(" side: dest");
+	if (family == NFPROTO_IPV4)
+	    printf(" mask: %s",
+		xtables_ipaddr_to_numeric(&info->mask.in));
+	if (family == NFPROTO_IPV6)
+	    printf(" mask: %s",
+		xtables_ip6addr_to_numeric(&info->mask.in6));
 }
 
-static void recent_save(const void *ip, const struct xt_entry_match *match)
+static void recent_save(const void *ip, const struct xt_entry_match *match,unsigned int family)
 {
-	const struct xt_recent_mtinfo *info = (const void *)match->data;
+	const struct xt_recent_mtinfo_v1 *info = (const void *)match->data;
 
 	if (info->invert)
 		printf(" !");
@@ -191,28 +228,107 @@ static void recent_save(const void *ip, const struct xt_entry_match *match)
 	if (info->check_set & XT_RECENT_TTL)
 		printf(" --rttl");
 	if(info->name) printf(" --name %s",info->name);
+	if (family == NFPROTO_IPV4)
+	    printf(" --mask %s",
+		xtables_ipaddr_to_numeric(&info->mask.in));
+	if (family == NFPROTO_IPV6)
+	    printf(" --mask %s",
+		xtables_ip6addr_to_numeric(&info->mask.in6));
+		
 	if (info->side == XT_RECENT_SOURCE)
 		printf(" --rsource");
 	if (info->side == XT_RECENT_DEST)
 		printf(" --rdest");
 }
 
-static struct xtables_match recent_mt_reg = {
-	.name          = "recent",
+static void recent_init_v0(struct xt_entry_match *match) {
+	recent_init(match,NFPROTO_UNSPEC);
+}
+
+static void recent_init_v1(struct xt_entry_match *match) {
+	recent_init(match,NFPROTO_IPV6);
+}
+
+static void recent_save_v0(const void *ip, const struct xt_entry_match *match)
+{
+	recent_save(ip,match,NFPROTO_UNSPEC);
+}
+
+static void recent_save_v4(const void *ip, const struct xt_entry_match *match)
+{
+	recent_save(ip,match,NFPROTO_IPV4);
+}
+
+static void recent_save_v6(const void *ip, const struct xt_entry_match *match)
+{
+	recent_save(ip,match,NFPROTO_IPV6);
+}
+
+static void recent_print_v0(const void *ip, const struct xt_entry_match *match,
+                         int numeric)
+{
+	recent_print(ip,match,NFPROTO_UNSPEC);
+}
+
+static void recent_print_v4(const void *ip, const struct xt_entry_match *match,
+                         int numeric)
+{
+	recent_print(ip,match,NFPROTO_IPV4);
+}
+
+static void recent_print_v6(const void *ip, const struct xt_entry_match *match,
+                         int numeric)
+{
+	recent_print(ip,match,NFPROTO_IPV6);
+}
+
+static struct xtables_match recent_mt_reg[] = {
+    {	.name          = "recent",
 	.version       = XTABLES_VERSION,
+	.revision      = 0,
 	.family        = NFPROTO_UNSPEC,
 	.size          = XT_ALIGN(sizeof(struct xt_recent_mtinfo)),
 	.userspacesize = XT_ALIGN(sizeof(struct xt_recent_mtinfo)),
 	.help          = recent_help,
-	.init          = recent_init,
+	.init          = recent_init_v0,
+	.x6_parse      = recent_parse,
+	.x6_fcheck     = recent_check,
+	.print         = recent_print_v0,
+	.save          = recent_save_v0,
+	.x6_options    = recent_opts_v0,
+    },
+    {	.name          = "recent",
+	.version       = XTABLES_VERSION,
+	.revision      = 1,
+	.family        = NFPROTO_IPV4,
+	.size          = XT_ALIGN(sizeof(struct xt_recent_mtinfo_v1)),
+	.userspacesize = XT_ALIGN(sizeof(struct xt_recent_mtinfo_v1)),
+	.help          = recent_help,
+	.init          = recent_init_v1,
+	.x6_parse      = recent_parse,
+	.x6_fcheck     = recent_check,
+	.print         = recent_print_v4,
+	.save          = recent_save_v4,
+	.x6_options    = recent_opts_v1,
+    },
+    {	.name          = "recent",
+	.version       = XTABLES_VERSION,
+	.revision      = 1,
+	.family        = NFPROTO_IPV6,
+	.size          = XT_ALIGN(sizeof(struct xt_recent_mtinfo_v1)),
+	.userspacesize = XT_ALIGN(sizeof(struct xt_recent_mtinfo_v1)),
+	.help          = recent_help,
+	.init          = recent_init_v1,
 	.x6_parse      = recent_parse,
 	.x6_fcheck     = recent_check,
-	.print         = recent_print,
-	.save          = recent_save,
-	.x6_options    = recent_opts,
+	.print         = recent_print_v6,
+	.save          = recent_save_v6,
+	.x6_options    = recent_opts_v1,
+    }
 };
 
 void _init(void)
 {
-	xtables_register_match(&recent_mt_reg);
+	xtables_register_matches(recent_mt_reg,
+				 ARRAY_SIZE(recent_mt_reg));
 }
diff --git a/include/linux/netfilter/xt_recent.h b/include/linux/netfilter/xt_recent.h
index 83318e0..b8d58c6 100644
--- a/include/linux/netfilter/xt_recent.h
+++ b/include/linux/netfilter/xt_recent.h
@@ -22,7 +22,6 @@ enum {
 
 #define XT_RECENT_VALID_FLAGS (XT_RECENT_CHECK|XT_RECENT_SET|XT_RECENT_UPDATE|\
 			       XT_RECENT_REMOVE|XT_RECENT_TTL|XT_RECENT_REAP)
-
 struct xt_recent_mtinfo {
 	__u32 seconds;
 	__u32 hit_count;
@@ -32,4 +31,14 @@ struct xt_recent_mtinfo {
 	__u8 side;
 };
 
+struct xt_recent_mtinfo_v1 {
+	__u32 seconds;
+	__u32 hit_count;
+	__u8 check_set;
+	__u8 invert;
+	char name[XT_RECENT_NAME_LEN];
+	__u8 side;
+	union nf_inet_addr mask;
+};
+
 #endif /* _LINUX_NETFILTER_XT_RECENT_H */
-- 
1.7.3.4


^ permalink raw reply related

* Re: [PATCH 08/17] net: Introduce sk_gfp_atomic() to allow addition of GFP flags depending on the individual socket
From: David Miller @ 2012-05-17 20:10 UTC (permalink / raw)
  To: mgorman
  Cc: akpm, linux-mm, netdev, linux-kernel, neilb, a.p.zijlstra,
	michaelc, emunson
In-Reply-To: <1337266231-8031-9-git-send-email-mgorman@suse.de>

From: Mel Gorman <mgorman@suse.de>
Date: Thu, 17 May 2012 15:50:22 +0100

> Introduce sk_gfp_atomic(), this function allows to inject sock specific
> flags to each sock related allocation. It is only used on allocation
> paths that may be required for writing pages back to network storage.
> 
> [davem@davemloft.net: Use sk_gfp_atomic only when necessary]
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: David S. Miller <davem@davemloft.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v3] drop_monitor: convert to modular building
From: David Miller @ 2012-05-17 20:09 UTC (permalink / raw)
  To: nhorman; +Cc: netdev, eric.dumazet, bhutchings
In-Reply-To: <1337285040-20848-1-git-send-email-nhorman@tuxdriver.com>

From: Neil Horman <nhorman@tuxdriver.com>
Date: Thu, 17 May 2012 16:04:00 -0400

> When I first wrote drop monitor I wrote it to just build monolithically.  There
> is no reason it can't be built modularly as well, so lets give it that
> flexibiity.
> 
> I've tested this by building it as both a module and monolithically, and it
> seems to work quite well
> 
> Change notes:
> 
> v2)
> * fixed for_each_present_cpu loops to be more correct as per Eric D.
> * Converted exit path failures to BUG_ON as per Ben H.
> 
> v3)
> * Converted del_timer to del_timer_sync to close race noted by Ben H.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

Applied, althrough it didn't apply cleanly to net-next.

^ permalink raw reply

* Re: [PATCH 09/17] netvm: Allow the use of __GFP_MEMALLOC by specific sockets
From: David Miller @ 2012-05-17 20:11 UTC (permalink / raw)
  To: mgorman
  Cc: akpm, linux-mm, netdev, linux-kernel, neilb, a.p.zijlstra,
	michaelc, emunson
In-Reply-To: <1337266231-8031-10-git-send-email-mgorman@suse.de>

From: Mel Gorman <mgorman@suse.de>
Date: Thu, 17 May 2012 15:50:23 +0100

> Allow specific sockets to be tagged SOCK_MEMALLOC and use
> __GFP_MEMALLOC for their allocations. These sockets will be able to go
> below watermarks and allocate from the emergency reserve. Such sockets
> are to be used to service the VM (iow. to swap over). They must be
> handled kernel side, exposing such a socket to user-space is a bug.
> 
> There is a risk that the reserves be depleted so for now, the
> administrator is responsible for increasing min_free_kbytes as
> necessary to prevent deadlock for their workloads.
> 
> [a.p.zijlstra@chello.nl: Original patches]
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: David S. Miller <davem@davemloft.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 10/17] netvm: Allow skb allocation to use PFMEMALLOC reserves
From: David Miller @ 2012-05-17 20:12 UTC (permalink / raw)
  To: mgorman
  Cc: akpm, linux-mm, netdev, linux-kernel, neilb, a.p.zijlstra,
	michaelc, emunson
In-Reply-To: <1337266231-8031-11-git-send-email-mgorman@suse.de>

From: Mel Gorman <mgorman@suse.de>
Date: Thu, 17 May 2012 15:50:24 +0100

> Change the skb allocation API to indicate RX usage and use this to fall
> back to the PFMEMALLOC reserve when needed. SKBs allocated from the
> reserve are tagged in skb->pfmemalloc. If an SKB is allocated from
> the reserve and the socket is later found to be unrelated to page
> reclaim, the packet is dropped so that the memory remains available
> for page reclaim. Network protocols are expected to recover from this
> packet loss.
> 
> [davem@davemloft.net: Use static branches, coding style corrections]
> [a.p.zijlstra@chello.nl: Ideas taken from various patches]
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: David S. Miller <davem@davemloft.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 11/17] netvm: Propagate page->pfmemalloc to skb
From: David Miller @ 2012-05-17 20:12 UTC (permalink / raw)
  To: mgorman
  Cc: akpm, linux-mm, netdev, linux-kernel, neilb, a.p.zijlstra,
	michaelc, emunson
In-Reply-To: <1337266231-8031-12-git-send-email-mgorman@suse.de>

From: Mel Gorman <mgorman@suse.de>
Date: Thu, 17 May 2012 15:50:25 +0100

> The skb->pfmemalloc flag gets set to true iff during the slab
> allocation of data in __alloc_skb that the the PFMEMALLOC reserves
> were used. If the packet is fragmented, it is possible that pages
> will be allocated from the PFMEMALLOC reserve without propagating
> this information to the skb. This patch propagates page->pfmemalloc
> from pages allocated for fragments to the skb.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: David S. Miller <davem@davemloft.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 12/17] netvm: Propagate page->pfmemalloc from skb_alloc_page to skb
From: David Miller @ 2012-05-17 20:13 UTC (permalink / raw)
  To: mgorman
  Cc: akpm, linux-mm, netdev, linux-kernel, neilb, a.p.zijlstra,
	michaelc, emunson
In-Reply-To: <1337266231-8031-13-git-send-email-mgorman@suse.de>

From: Mel Gorman <mgorman@suse.de>
Date: Thu, 17 May 2012 15:50:26 +0100

> The skb->pfmemalloc flag gets set to true iff during the slab
> allocation of data in __alloc_skb that the the PFMEMALLOC reserves
> were used. If page splitting is used, it is possible that pages will
> be allocated from the PFMEMALLOC reserve without propagating this
> information to the skb. This patch propagates page->pfmemalloc from
> pages allocated for fragments to the skb.
> 
> It works by reintroducing and expanding the skb_alloc_page() API
> to take an skb. If the page was allocated from pfmemalloc reserves,
> it is automatically copied. If the driver allocates the page before
> the skb, it should call skb_propagate_pfmemalloc() after the skb is
> allocated to ensure the flag is copied properly.
> 
> Failure to do so is not critical. The resulting driver may perform
> slower if it is used for swap-over-NBD or swap-over-NFS but it should
> not result in failure.
> 
> [davem@davemloft.net: API rename and consistency]
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: David S. Miller <davem@davemloft.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 13/17] netvm: Set PF_MEMALLOC as appropriate during SKB processing
From: David Miller @ 2012-05-17 20:13 UTC (permalink / raw)
  To: mgorman
  Cc: akpm, linux-mm, netdev, linux-kernel, neilb, a.p.zijlstra,
	michaelc, emunson
In-Reply-To: <1337266231-8031-14-git-send-email-mgorman@suse.de>

From: Mel Gorman <mgorman@suse.de>
Date: Thu, 17 May 2012 15:50:27 +0100

> In order to make sure pfmemalloc packets receive all memory
> needed to proceed, ensure processing of pfmemalloc SKBs happens
> under PF_MEMALLOC. This is limited to a subset of protocols that
> are expected to be used for writing to swap. Taps are not allowed to
> use PF_MEMALLOC as these are expected to communicate with userspace
> processes which could be paged out.
> 
> [a.p.zijlstra@chello.nl: Ideas taken from various patches]
> [jslaby@suse.cz: Lock imbalance fix]
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: David S. Miller <davem@davemloft.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 01/12] netvm: Prevent a stream-specific deadlock
From: David Miller @ 2012-05-17 20:14 UTC (permalink / raw)
  To: mgorman
  Cc: akpm, linux-mm, netdev, linux-nfs, linux-kernel, Trond.Myklebust,
	neilb, hch, a.p.zijlstra, michaelc, emunson
In-Reply-To: <1337266285-8102-2-git-send-email-mgorman@suse.de>

From: Mel Gorman <mgorman@suse.de>
Date: Thu, 17 May 2012 15:51:14 +0100

> It could happen that all !SOCK_MEMALLOC sockets have buffered so
> much data that we're over the global rmem limit. This will prevent
> SOCK_MEMALLOC buffers from receiving data, which will prevent userspace
> from running, which is needed to reduce the buffered data.
> 
> Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.
> Once this change it applied, it is important that sockets that set
> SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
> If this happens, a warning is generated and the tokens reclaimed to
> avoid accounting errors until the bug is fixed.
> 
> [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: David S. Miller <davem@davemloft.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH net-next] net/mlx4_en: num cores tx rings for every UP
From: David Miller @ 2012-05-17 20:19 UTC (permalink / raw)
  To: amirv; +Cc: netdev, oren, john.r.fastabend, liranl
In-Reply-To: <1337252290-20444-1-git-send-email-amirv@mellanox.com>

From: Amir Vadai <amirv@mellanox.com>
Date: Thu, 17 May 2012 13:58:10 +0300

> Change the TX ring scheme such that the number of rings for untagged packets
> and for tagged packets (per each of the vlan priorities) is the same, unlike
> the current situation where for tagged traffic there's one ring per priority
> and for untagged rings as the number of core.
> 
> Queue selection is done as follows:
> 
> If the mqprio qdisc is operates on the interface, such that the core networking
> code invoked the device setup_tc ndo callback, a mapping of skb->priority =>
> queue set is forced - for both, tagged and untagged traffic.
> 
> Else, the egress map skb->priority =>  User priority is used for tagged traffic, and
> all untagged traffic is sent through tx rings of UP 0.
> 
> The patch follows the convergence of discussing that issue with John Fastabend
> over this thread http://comments.gmane.org/gmane.linux.network/229877
> 
> Cc: John Fastabend <john.r.fastabend@intel.com>
> Cc: Liran Liss <liranl@mellanox.com>
> Signed-off-by: Amir Vadai <amirv@mellanox.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next v4] be2net: Fix to allow get/set of debug levels in the firmware.
From: David Miller @ 2012-05-17 20:21 UTC (permalink / raw)
  To: somnath.kotur; +Cc: netdev, bhutchings, suresh.reddy
In-Reply-To: <23cb36ad-2827-44bd-a91a-c6d8b01db70e@exht1.ad.emulex.com>

From: Somnath Kotur <somnath.kotur@emulex.com>
Date: Mon, 14 May 2012 21:59:28 +0530

> Fixed missing paranthesis warning
> Incorporated review comments by Ben Hutchings.
> 
> Signed-off-by: Suresh Reddy <suresh.reddy@emulex.com>
> Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>

This doesn't apply cleanly to net-next, please respin.

Also, you should convert your driver now to use ->msg_enable to gate
all of your driver's kernel message logging, not just this firmware
stuff.

^ permalink raw reply

* Re: [PATCH v3] drop_monitor: convert to modular building
From: Neil Horman @ 2012-05-17 20:21 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, eric.dumazet, bhutchings
In-Reply-To: <20120517.160937.586334759945738635.davem@davemloft.net>

On Thu, May 17, 2012 at 04:09:37PM -0400, David Miller wrote:
> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Thu, 17 May 2012 16:04:00 -0400
> 
> > When I first wrote drop monitor I wrote it to just build monolithically.  There
> > is no reason it can't be built modularly as well, so lets give it that
> > flexibiity.
> > 
> > I've tested this by building it as both a module and monolithically, and it
> > seems to work quite well
> > 
> > Change notes:
> > 
> > v2)
> > * fixed for_each_present_cpu loops to be more correct as per Eric D.
> > * Converted exit path failures to BUG_ON as per Ben H.
> > 
> > v3)
> > * Converted del_timer to del_timer_sync to close race noted by Ben H.
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 
> Applied, althrough it didn't apply cleanly to net-next.
> 

Apologies Dave, should have told you that I was carrying Joe P.'s cleanup patch
in my net-next tree as well:
http://marc.info/?l=linux-netdev&m=133727344816140&w=2

Since you noted that you had applied it, I applied it myself here.
Neil

^ permalink raw reply

* Re: Severe regression in bnx2 driver with bonding in post 2.6.30 kernels
From: Bo Mackey @ 2012-05-17 20:21 UTC (permalink / raw)
  To: netdev
In-Reply-To: <4BB3463A.2000801@openobjects.com>



Stuart Shelton <stuart <at> openobjects.com> writes:

> 
> 
> Hi all,
> 
> The Broadcom NetXtreme II driver appears to have a severe regression in 
> all kernels post 2.6.30 - I've observed problems with 2.6.31, 2.6.32. 
> and 2.6.33.
> 
> The hardware impacted is an IBM Bladecenter LS21 Blade, model 7971.  We 
> have a large number of these, and all are affected.
> 
> We use generic channel-bonding, with the following options in modprobe.conf:
> 


Hi Stuart, et. al.,

Has the above issue been fixed? If so, can you please share the root cause and
the diffs? Seeing a similar issue in version 2.0.8 of the bnx2 driver.

Thank you,
Bo

^ permalink raw reply

* [PATCH net-next] lapb: Neaten debugging
From: Joe Perches @ 2012-05-17 20:25 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-x25, netdev, linux-kernel

Enable dynamic debugging and remove a bunch of #ifdef/#endifs.

Add a lapb_dbg(level, fmt, ...) macro and replace the
printk(KERN_DEBUG uses.
Add pr_fmt and remove embedded prefixes.

Signed-off-by: Joe Perches <joe@perches.com>
---
 include/net/lapb.h    |    6 +
 net/lapb/lapb_iface.c |   22 +---
 net/lapb/lapb_in.c    |  320 ++++++++++++++-----------------------------------
 net/lapb/lapb_out.c   |   38 ++----
 net/lapb/lapb_subr.c  |   28 ++---
 net/lapb/lapb_timer.c |   32 ++----
 6 files changed, 140 insertions(+), 306 deletions(-)

diff --git a/include/net/lapb.h b/include/net/lapb.h
index fd2bf57..df892a9 100644
--- a/include/net/lapb.h
+++ b/include/net/lapb.h
@@ -149,4 +149,10 @@ extern int  lapb_t1timer_running(struct lapb_cb *lapb);
  */
 #define	LAPB_DEBUG	0
 
+#define lapb_dbg(level, fmt, ...)			\
+do {							\
+	if (level < LAPB_DEBUG)				\
+		pr_debug(fmt, ##__VA_ARGS__);		\
+} while (0)
+
 #endif
diff --git a/net/lapb/lapb_iface.c b/net/lapb/lapb_iface.c
index ab3d35f..3cdaa04 100644
--- a/net/lapb/lapb_iface.c
+++ b/net/lapb/lapb_iface.c
@@ -15,6 +15,8 @@
  *	2000-10-29	Henner Eisen	lapb_data_indication() return status.
  */
 
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include <linux/module.h>
 #include <linux/errno.h>
 #include <linux/types.h>
@@ -279,9 +281,7 @@ int lapb_connect_request(struct net_device *dev)
 
 	lapb_establish_data_link(lapb);
 
-#if LAPB_DEBUG > 0
-	printk(KERN_DEBUG "lapb: (%p) S0 -> S1\n", lapb->dev);
-#endif
+	lapb_dbg(0, "(%p) S0 -> S1\n", lapb->dev);
 	lapb->state = LAPB_STATE_1;
 
 	rc = LAPB_OK;
@@ -305,12 +305,8 @@ int lapb_disconnect_request(struct net_device *dev)
 		goto out_put;
 
 	case LAPB_STATE_1:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S1 TX DISC(1)\n", lapb->dev);
-#endif
-#if LAPB_DEBUG > 0
-		printk(KERN_DEBUG "lapb: (%p) S1 -> S0\n", lapb->dev);
-#endif
+		lapb_dbg(1, "(%p) S1 TX DISC(1)\n", lapb->dev);
+		lapb_dbg(0, "(%p) S1 -> S0\n", lapb->dev);
 		lapb_send_control(lapb, LAPB_DISC, LAPB_POLLON, LAPB_COMMAND);
 		lapb->state = LAPB_STATE_0;
 		lapb_start_t1timer(lapb);
@@ -329,12 +325,8 @@ int lapb_disconnect_request(struct net_device *dev)
 	lapb_stop_t2timer(lapb);
 	lapb->state = LAPB_STATE_2;
 
-#if LAPB_DEBUG > 1
-	printk(KERN_DEBUG "lapb: (%p) S3 DISC(1)\n", lapb->dev);
-#endif
-#if LAPB_DEBUG > 0
-	printk(KERN_DEBUG "lapb: (%p) S3 -> S2\n", lapb->dev);
-#endif
+	lapb_dbg(1, "(%p) S3 DISC(1)\n", lapb->dev);
+	lapb_dbg(0, "(%p) S3 -> S2\n", lapb->dev);
 
 	rc = LAPB_OK;
 out_put:
diff --git a/net/lapb/lapb_in.c b/net/lapb/lapb_in.c
index f4e3c1a..5dba899 100644
--- a/net/lapb/lapb_in.c
+++ b/net/lapb/lapb_in.c
@@ -15,6 +15,8 @@
  *	2000-10-29	Henner Eisen	lapb_data_indication() return status.
  */
 
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include <linux/errno.h>
 #include <linux/types.h>
 #include <linux/socket.h>
@@ -44,25 +46,16 @@ static void lapb_state0_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 {
 	switch (frame->type) {
 	case LAPB_SABM:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S0 RX SABM(%d)\n",
-		       lapb->dev, frame->pf);
-#endif
+		lapb_dbg(1, "(%p) S0 RX SABM(%d)\n", lapb->dev, frame->pf);
 		if (lapb->mode & LAPB_EXTENDED) {
-#if LAPB_DEBUG > 1
-			printk(KERN_DEBUG "lapb: (%p) S0 TX DM(%d)\n",
-			       lapb->dev, frame->pf);
-#endif
+			lapb_dbg(1, "(%p) S0 TX DM(%d)\n",
+				 lapb->dev, frame->pf);
 			lapb_send_control(lapb, LAPB_DM, frame->pf,
 					  LAPB_RESPONSE);
 		} else {
-#if LAPB_DEBUG > 1
-			printk(KERN_DEBUG "lapb: (%p) S0 TX UA(%d)\n",
-			       lapb->dev, frame->pf);
-#endif
-#if LAPB_DEBUG > 0
-			printk(KERN_DEBUG "lapb: (%p) S0 -> S3\n", lapb->dev);
-#endif
+			lapb_dbg(1, "(%p) S0 TX UA(%d)\n",
+				 lapb->dev, frame->pf);
+			lapb_dbg(0, "(%p) S0 -> S3\n", lapb->dev);
 			lapb_send_control(lapb, LAPB_UA, frame->pf,
 					  LAPB_RESPONSE);
 			lapb_stop_t1timer(lapb);
@@ -78,18 +71,11 @@ static void lapb_state0_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 		break;
 
 	case LAPB_SABME:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S0 RX SABME(%d)\n",
-		       lapb->dev, frame->pf);
-#endif
+		lapb_dbg(1, "(%p) S0 RX SABME(%d)\n", lapb->dev, frame->pf);
 		if (lapb->mode & LAPB_EXTENDED) {
-#if LAPB_DEBUG > 1
-			printk(KERN_DEBUG "lapb: (%p) S0 TX UA(%d)\n",
-			       lapb->dev, frame->pf);
-#endif
-#if LAPB_DEBUG > 0
-			printk(KERN_DEBUG "lapb: (%p) S0 -> S3\n", lapb->dev);
-#endif
+			lapb_dbg(1, "(%p) S0 TX UA(%d)\n",
+				 lapb->dev, frame->pf);
+			lapb_dbg(0, "(%p) S0 -> S3\n", lapb->dev);
 			lapb_send_control(lapb, LAPB_UA, frame->pf,
 					  LAPB_RESPONSE);
 			lapb_stop_t1timer(lapb);
@@ -102,22 +88,16 @@ static void lapb_state0_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 			lapb->va        = 0;
 			lapb_connect_indication(lapb, LAPB_OK);
 		} else {
-#if LAPB_DEBUG > 1
-			printk(KERN_DEBUG "lapb: (%p) S0 TX DM(%d)\n",
-			       lapb->dev, frame->pf);
-#endif
+			lapb_dbg(1, "(%p) S0 TX DM(%d)\n",
+				 lapb->dev, frame->pf);
 			lapb_send_control(lapb, LAPB_DM, frame->pf,
 					  LAPB_RESPONSE);
 		}
 		break;
 
 	case LAPB_DISC:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S0 RX DISC(%d)\n",
-		       lapb->dev, frame->pf);
-		printk(KERN_DEBUG "lapb: (%p) S0 TX UA(%d)\n",
-		       lapb->dev, frame->pf);
-#endif
+		lapb_dbg(1, "(%p) S0 RX DISC(%d)\n", lapb->dev, frame->pf);
+		lapb_dbg(1, "(%p) S0 TX UA(%d)\n", lapb->dev, frame->pf);
 		lapb_send_control(lapb, LAPB_UA, frame->pf, LAPB_RESPONSE);
 		break;
 
@@ -137,68 +117,45 @@ static void lapb_state1_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 {
 	switch (frame->type) {
 	case LAPB_SABM:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S1 RX SABM(%d)\n",
-		       lapb->dev, frame->pf);
-#endif
+		lapb_dbg(1, "(%p) S1 RX SABM(%d)\n", lapb->dev, frame->pf);
 		if (lapb->mode & LAPB_EXTENDED) {
-#if LAPB_DEBUG > 1
-			printk(KERN_DEBUG "lapb: (%p) S1 TX DM(%d)\n",
-			       lapb->dev, frame->pf);
-#endif
+			lapb_dbg(1, "(%p) S1 TX DM(%d)\n",
+				 lapb->dev, frame->pf);
 			lapb_send_control(lapb, LAPB_DM, frame->pf,
 					  LAPB_RESPONSE);
 		} else {
-#if LAPB_DEBUG > 1
-			printk(KERN_DEBUG "lapb: (%p) S1 TX UA(%d)\n",
-			       lapb->dev, frame->pf);
-#endif
+			lapb_dbg(1, "(%p) S1 TX UA(%d)\n",
+				 lapb->dev, frame->pf);
 			lapb_send_control(lapb, LAPB_UA, frame->pf,
 					  LAPB_RESPONSE);
 		}
 		break;
 
 	case LAPB_SABME:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S1 RX SABME(%d)\n",
-		       lapb->dev, frame->pf);
-#endif
+		lapb_dbg(1, "(%p) S1 RX SABME(%d)\n", lapb->dev, frame->pf);
 		if (lapb->mode & LAPB_EXTENDED) {
-#if LAPB_DEBUG > 1
-			printk(KERN_DEBUG "lapb: (%p) S1 TX UA(%d)\n",
-			       lapb->dev, frame->pf);
-#endif
+			lapb_dbg(1, "(%p) S1 TX UA(%d)\n",
+				 lapb->dev, frame->pf);
 			lapb_send_control(lapb, LAPB_UA, frame->pf,
 					  LAPB_RESPONSE);
 		} else {
-#if LAPB_DEBUG > 1
-			printk(KERN_DEBUG "lapb: (%p) S1 TX DM(%d)\n",
-			       lapb->dev, frame->pf);
-#endif
+			lapb_dbg(1, "(%p) S1 TX DM(%d)\n",
+				 lapb->dev, frame->pf);
 			lapb_send_control(lapb, LAPB_DM, frame->pf,
 					  LAPB_RESPONSE);
 		}
 		break;
 
 	case LAPB_DISC:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S1 RX DISC(%d)\n",
-		       lapb->dev, frame->pf);
-		printk(KERN_DEBUG "lapb: (%p) S1 TX DM(%d)\n",
-		       lapb->dev, frame->pf);
-#endif
+		lapb_dbg(1, "(%p) S1 RX DISC(%d)\n", lapb->dev, frame->pf);
+		lapb_dbg(1, "(%p) S1 TX DM(%d)\n", lapb->dev, frame->pf);
 		lapb_send_control(lapb, LAPB_DM, frame->pf, LAPB_RESPONSE);
 		break;
 
 	case LAPB_UA:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S1 RX UA(%d)\n",
-		       lapb->dev, frame->pf);
-#endif
+		lapb_dbg(1, "(%p) S1 RX UA(%d)\n", lapb->dev, frame->pf);
 		if (frame->pf) {
-#if LAPB_DEBUG > 0
-			printk(KERN_DEBUG "lapb: (%p) S1 -> S3\n", lapb->dev);
-#endif
+			lapb_dbg(0, "(%p) S1 -> S3\n", lapb->dev);
 			lapb_stop_t1timer(lapb);
 			lapb_stop_t2timer(lapb);
 			lapb->state     = LAPB_STATE_3;
@@ -212,14 +169,9 @@ static void lapb_state1_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 		break;
 
 	case LAPB_DM:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S1 RX DM(%d)\n",
-		       lapb->dev, frame->pf);
-#endif
+		lapb_dbg(1, "(%p) S1 RX DM(%d)\n", lapb->dev, frame->pf);
 		if (frame->pf) {
-#if LAPB_DEBUG > 0
-			printk(KERN_DEBUG "lapb: (%p) S1 -> S0\n", lapb->dev);
-#endif
+			lapb_dbg(0, "(%p) S1 -> S0\n", lapb->dev);
 			lapb_clear_queues(lapb);
 			lapb->state = LAPB_STATE_0;
 			lapb_start_t1timer(lapb);
@@ -242,34 +194,22 @@ static void lapb_state2_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 	switch (frame->type) {
 	case LAPB_SABM:
 	case LAPB_SABME:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S2 RX {SABM,SABME}(%d)\n",
-		       lapb->dev, frame->pf);
-		printk(KERN_DEBUG "lapb: (%p) S2 TX DM(%d)\n",
-		       lapb->dev, frame->pf);
-#endif
+		lapb_dbg(1, "(%p) S2 RX {SABM,SABME}(%d)\n",
+			 lapb->dev, frame->pf);
+		lapb_dbg(1, "(%p) S2 TX DM(%d)\n", lapb->dev, frame->pf);
 		lapb_send_control(lapb, LAPB_DM, frame->pf, LAPB_RESPONSE);
 		break;
 
 	case LAPB_DISC:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S2 RX DISC(%d)\n",
-		       lapb->dev, frame->pf);
-		printk(KERN_DEBUG "lapb: (%p) S2 TX UA(%d)\n",
-		       lapb->dev, frame->pf);
-#endif
+		lapb_dbg(1, "(%p) S2 RX DISC(%d)\n", lapb->dev, frame->pf);
+		lapb_dbg(1, "(%p) S2 TX UA(%d)\n", lapb->dev, frame->pf);
 		lapb_send_control(lapb, LAPB_UA, frame->pf, LAPB_RESPONSE);
 		break;
 
 	case LAPB_UA:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S2 RX UA(%d)\n",
-		       lapb->dev, frame->pf);
-#endif
+		lapb_dbg(1, "(%p) S2 RX UA(%d)\n", lapb->dev, frame->pf);
 		if (frame->pf) {
-#if LAPB_DEBUG > 0
-			printk(KERN_DEBUG "lapb: (%p) S2 -> S0\n", lapb->dev);
-#endif
+			lapb_dbg(0, "(%p) S2 -> S0\n", lapb->dev);
 			lapb->state = LAPB_STATE_0;
 			lapb_start_t1timer(lapb);
 			lapb_stop_t2timer(lapb);
@@ -278,14 +218,9 @@ static void lapb_state2_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 		break;
 
 	case LAPB_DM:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S2 RX DM(%d)\n",
-		       lapb->dev, frame->pf);
-#endif
+		lapb_dbg(1, "(%p) S2 RX DM(%d)\n", lapb->dev, frame->pf);
 		if (frame->pf) {
-#if LAPB_DEBUG > 0
-			printk(KERN_DEBUG "lapb: (%p) S2 -> S0\n", lapb->dev);
-#endif
+			lapb_dbg(0, "(%p) S2 -> S0\n", lapb->dev);
 			lapb->state = LAPB_STATE_0;
 			lapb_start_t1timer(lapb);
 			lapb_stop_t2timer(lapb);
@@ -297,12 +232,9 @@ static void lapb_state2_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 	case LAPB_REJ:
 	case LAPB_RNR:
 	case LAPB_RR:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S2 RX {I,REJ,RNR,RR}(%d)\n",
-		       lapb->dev, frame->pf);
-		printk(KERN_DEBUG "lapb: (%p) S2 RX DM(%d)\n",
+		lapb_dbg(1, "(%p) S2 RX {I,REJ,RNR,RR}(%d)\n",
 		       lapb->dev, frame->pf);
-#endif
+		lapb_dbg(1, "(%p) S2 RX DM(%d)\n", lapb->dev, frame->pf);
 		if (frame->pf)
 			lapb_send_control(lapb, LAPB_DM, frame->pf,
 					  LAPB_RESPONSE);
@@ -325,22 +257,15 @@ static void lapb_state3_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 
 	switch (frame->type) {
 	case LAPB_SABM:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S3 RX SABM(%d)\n",
-		       lapb->dev, frame->pf);
-#endif
+		lapb_dbg(1, "(%p) S3 RX SABM(%d)\n", lapb->dev, frame->pf);
 		if (lapb->mode & LAPB_EXTENDED) {
-#if LAPB_DEBUG > 1
-			printk(KERN_DEBUG "lapb: (%p) S3 TX DM(%d)\n",
-			       lapb->dev, frame->pf);
-#endif
+			lapb_dbg(1, "(%p) S3 TX DM(%d)\n",
+				 lapb->dev, frame->pf);
 			lapb_send_control(lapb, LAPB_DM, frame->pf,
 					  LAPB_RESPONSE);
 		} else {
-#if LAPB_DEBUG > 1
-			printk(KERN_DEBUG "lapb: (%p) S3 TX UA(%d)\n",
-			       lapb->dev, frame->pf);
-#endif
+			lapb_dbg(1, "(%p) S3 TX UA(%d)\n",
+				 lapb->dev, frame->pf);
 			lapb_send_control(lapb, LAPB_UA, frame->pf,
 					  LAPB_RESPONSE);
 			lapb_stop_t1timer(lapb);
@@ -355,15 +280,10 @@ static void lapb_state3_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 		break;
 
 	case LAPB_SABME:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S3 RX SABME(%d)\n",
-		       lapb->dev, frame->pf);
-#endif
+		lapb_dbg(1, "(%p) S3 RX SABME(%d)\n", lapb->dev, frame->pf);
 		if (lapb->mode & LAPB_EXTENDED) {
-#if LAPB_DEBUG > 1
-			printk(KERN_DEBUG "lapb: (%p) S3 TX UA(%d)\n",
-			       lapb->dev, frame->pf);
-#endif
+			lapb_dbg(1, "(%p) S3 TX UA(%d)\n",
+				 lapb->dev, frame->pf);
 			lapb_send_control(lapb, LAPB_UA, frame->pf,
 					  LAPB_RESPONSE);
 			lapb_stop_t1timer(lapb);
@@ -375,23 +295,16 @@ static void lapb_state3_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 			lapb->va        = 0;
 			lapb_requeue_frames(lapb);
 		} else {
-#if LAPB_DEBUG > 1
-			printk(KERN_DEBUG "lapb: (%p) S3 TX DM(%d)\n",
-			       lapb->dev, frame->pf);
-#endif
+			lapb_dbg(1, "(%p) S3 TX DM(%d)\n",
+				 lapb->dev, frame->pf);
 			lapb_send_control(lapb, LAPB_DM, frame->pf,
 					  LAPB_RESPONSE);
 		}
 		break;
 
 	case LAPB_DISC:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S3 RX DISC(%d)\n",
-		       lapb->dev, frame->pf);
-#endif
-#if LAPB_DEBUG > 0
-		printk(KERN_DEBUG "lapb: (%p) S3 -> S0\n", lapb->dev);
-#endif
+		lapb_dbg(1, "(%p) S3 RX DISC(%d)\n", lapb->dev, frame->pf);
+		lapb_dbg(0, "(%p) S3 -> S0\n", lapb->dev);
 		lapb_clear_queues(lapb);
 		lapb_send_control(lapb, LAPB_UA, frame->pf, LAPB_RESPONSE);
 		lapb_start_t1timer(lapb);
@@ -401,13 +314,8 @@ static void lapb_state3_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 		break;
 
 	case LAPB_DM:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S3 RX DM(%d)\n",
-		       lapb->dev, frame->pf);
-#endif
-#if LAPB_DEBUG > 0
-		printk(KERN_DEBUG "lapb: (%p) S3 -> S0\n", lapb->dev);
-#endif
+		lapb_dbg(1, "(%p) S3 RX DM(%d)\n", lapb->dev, frame->pf);
+		lapb_dbg(0, "(%p) S3 -> S0\n", lapb->dev);
 		lapb_clear_queues(lapb);
 		lapb->state = LAPB_STATE_0;
 		lapb_start_t1timer(lapb);
@@ -416,10 +324,8 @@ static void lapb_state3_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 		break;
 
 	case LAPB_RNR:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S3 RX RNR(%d) R%d\n",
-		       lapb->dev, frame->pf, frame->nr);
-#endif
+		lapb_dbg(1, "(%p) S3 RX RNR(%d) R%d\n",
+			 lapb->dev, frame->pf, frame->nr);
 		lapb->condition |= LAPB_PEER_RX_BUSY_CONDITION;
 		lapb_check_need_response(lapb, frame->cr, frame->pf);
 		if (lapb_validate_nr(lapb, frame->nr)) {
@@ -428,9 +334,7 @@ static void lapb_state3_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 			lapb->frmr_data = *frame;
 			lapb->frmr_type = LAPB_FRMR_Z;
 			lapb_transmit_frmr(lapb);
-#if LAPB_DEBUG > 0
-			printk(KERN_DEBUG "lapb: (%p) S3 -> S4\n", lapb->dev);
-#endif
+			lapb_dbg(0, "(%p) S3 -> S4\n", lapb->dev);
 			lapb_start_t1timer(lapb);
 			lapb_stop_t2timer(lapb);
 			lapb->state   = LAPB_STATE_4;
@@ -439,10 +343,8 @@ static void lapb_state3_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 		break;
 
 	case LAPB_RR:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S3 RX RR(%d) R%d\n",
-		       lapb->dev, frame->pf, frame->nr);
-#endif
+		lapb_dbg(1, "(%p) S3 RX RR(%d) R%d\n",
+			 lapb->dev, frame->pf, frame->nr);
 		lapb->condition &= ~LAPB_PEER_RX_BUSY_CONDITION;
 		lapb_check_need_response(lapb, frame->cr, frame->pf);
 		if (lapb_validate_nr(lapb, frame->nr)) {
@@ -451,9 +353,7 @@ static void lapb_state3_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 			lapb->frmr_data = *frame;
 			lapb->frmr_type = LAPB_FRMR_Z;
 			lapb_transmit_frmr(lapb);
-#if LAPB_DEBUG > 0
-			printk(KERN_DEBUG "lapb: (%p) S3 -> S4\n", lapb->dev);
-#endif
+			lapb_dbg(0, "(%p) S3 -> S4\n", lapb->dev);
 			lapb_start_t1timer(lapb);
 			lapb_stop_t2timer(lapb);
 			lapb->state   = LAPB_STATE_4;
@@ -462,10 +362,8 @@ static void lapb_state3_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 		break;
 
 	case LAPB_REJ:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S3 RX REJ(%d) R%d\n",
-		       lapb->dev, frame->pf, frame->nr);
-#endif
+		lapb_dbg(1, "(%p) S3 RX REJ(%d) R%d\n",
+			 lapb->dev, frame->pf, frame->nr);
 		lapb->condition &= ~LAPB_PEER_RX_BUSY_CONDITION;
 		lapb_check_need_response(lapb, frame->cr, frame->pf);
 		if (lapb_validate_nr(lapb, frame->nr)) {
@@ -477,9 +375,7 @@ static void lapb_state3_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 			lapb->frmr_data = *frame;
 			lapb->frmr_type = LAPB_FRMR_Z;
 			lapb_transmit_frmr(lapb);
-#if LAPB_DEBUG > 0
-			printk(KERN_DEBUG "lapb: (%p) S3 -> S4\n", lapb->dev);
-#endif
+			lapb_dbg(0, "(%p) S3 -> S4\n", lapb->dev);
 			lapb_start_t1timer(lapb);
 			lapb_stop_t2timer(lapb);
 			lapb->state   = LAPB_STATE_4;
@@ -488,17 +384,13 @@ static void lapb_state3_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 		break;
 
 	case LAPB_I:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S3 RX I(%d) S%d R%d\n",
-		       lapb->dev, frame->pf, frame->ns, frame->nr);
-#endif
+		lapb_dbg(1, "(%p) S3 RX I(%d) S%d R%d\n",
+			 lapb->dev, frame->pf, frame->ns, frame->nr);
 		if (!lapb_validate_nr(lapb, frame->nr)) {
 			lapb->frmr_data = *frame;
 			lapb->frmr_type = LAPB_FRMR_Z;
 			lapb_transmit_frmr(lapb);
-#if LAPB_DEBUG > 0
-			printk(KERN_DEBUG "lapb: (%p) S3 -> S4\n", lapb->dev);
-#endif
+			lapb_dbg(0, "(%p) S3 -> S4\n", lapb->dev);
 			lapb_start_t1timer(lapb);
 			lapb_stop_t2timer(lapb);
 			lapb->state   = LAPB_STATE_4;
@@ -522,7 +414,7 @@ static void lapb_state3_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 			 * a frame lost on the wire.
 			 */
 			if (cn == NET_RX_DROP) {
-				printk(KERN_DEBUG "LAPB: rx congestion\n");
+				pr_debug("rx congestion\n");
 				break;
 			}
 			lapb->vr = (lapb->vr + 1) % modulus;
@@ -541,11 +433,8 @@ static void lapb_state3_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 				if (frame->pf)
 					lapb_enquiry_response(lapb);
 			} else {
-#if LAPB_DEBUG > 1
-				printk(KERN_DEBUG
-				       "lapb: (%p) S3 TX REJ(%d) R%d\n",
-				       lapb->dev, frame->pf, lapb->vr);
-#endif
+				lapb_dbg(1, "(%p) S3 TX REJ(%d) R%d\n",
+					 lapb->dev, frame->pf, lapb->vr);
 				lapb->condition |= LAPB_REJECT_CONDITION;
 				lapb_send_control(lapb, LAPB_REJ, frame->pf,
 						  LAPB_RESPONSE);
@@ -555,31 +444,22 @@ static void lapb_state3_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 		break;
 
 	case LAPB_FRMR:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S3 RX FRMR(%d) %02X "
-		       "%02X %02X %02X %02X\n", lapb->dev, frame->pf,
-		       skb->data[0], skb->data[1], skb->data[2],
-		       skb->data[3], skb->data[4]);
-#endif
+		lapb_dbg(1, "(%p) S3 RX FRMR(%d) %02X %02X %02X %02X %02X\n",
+			 lapb->dev, frame->pf,
+			 skb->data[0], skb->data[1], skb->data[2],
+			 skb->data[3], skb->data[4]);
 		lapb_establish_data_link(lapb);
-#if LAPB_DEBUG > 0
-		printk(KERN_DEBUG "lapb: (%p) S3 -> S1\n", lapb->dev);
-#endif
+		lapb_dbg(0, "(%p) S3 -> S1\n", lapb->dev);
 		lapb_requeue_frames(lapb);
 		lapb->state = LAPB_STATE_1;
 		break;
 
 	case LAPB_ILLEGAL:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S3 RX ILLEGAL(%d)\n",
-		       lapb->dev, frame->pf);
-#endif
+		lapb_dbg(1, "(%p) S3 RX ILLEGAL(%d)\n", lapb->dev, frame->pf);
 		lapb->frmr_data = *frame;
 		lapb->frmr_type = LAPB_FRMR_W;
 		lapb_transmit_frmr(lapb);
-#if LAPB_DEBUG > 0
-		printk(KERN_DEBUG "lapb: (%p) S3 -> S4\n", lapb->dev);
-#endif
+		lapb_dbg(0, "(%p) S3 -> S4\n", lapb->dev);
 		lapb_start_t1timer(lapb);
 		lapb_stop_t2timer(lapb);
 		lapb->state   = LAPB_STATE_4;
@@ -600,25 +480,16 @@ static void lapb_state4_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 {
 	switch (frame->type) {
 	case LAPB_SABM:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S4 RX SABM(%d)\n",
-		       lapb->dev, frame->pf);
-#endif
+		lapb_dbg(1, "(%p) S4 RX SABM(%d)\n", lapb->dev, frame->pf);
 		if (lapb->mode & LAPB_EXTENDED) {
-#if LAPB_DEBUG > 1
-			printk(KERN_DEBUG "lapb: (%p) S4 TX DM(%d)\n",
-			       lapb->dev, frame->pf);
-#endif
+			lapb_dbg(1, "(%p) S4 TX DM(%d)\n",
+				 lapb->dev, frame->pf);
 			lapb_send_control(lapb, LAPB_DM, frame->pf,
 					  LAPB_RESPONSE);
 		} else {
-#if LAPB_DEBUG > 1
-			printk(KERN_DEBUG "lapb: (%p) S4 TX UA(%d)\n",
-			       lapb->dev, frame->pf);
-#endif
-#if LAPB_DEBUG > 0
-			printk(KERN_DEBUG "lapb: (%p) S4 -> S3\n", lapb->dev);
-#endif
+			lapb_dbg(1, "(%p) S4 TX UA(%d)\n",
+				 lapb->dev, frame->pf);
+			lapb_dbg(0, "(%p) S4 -> S3\n", lapb->dev);
 			lapb_send_control(lapb, LAPB_UA, frame->pf,
 					  LAPB_RESPONSE);
 			lapb_stop_t1timer(lapb);
@@ -634,18 +505,11 @@ static void lapb_state4_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 		break;
 
 	case LAPB_SABME:
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S4 RX SABME(%d)\n",
-		       lapb->dev, frame->pf);
-#endif
+		lapb_dbg(1, "(%p) S4 RX SABME(%d)\n", lapb->dev, frame->pf);
 		if (lapb->mode & LAPB_EXTENDED) {
-#if LAPB_DEBUG > 1
-			printk(KERN_DEBUG "lapb: (%p) S4 TX UA(%d)\n",
-			       lapb->dev, frame->pf);
-#endif
-#if LAPB_DEBUG > 0
-			printk(KERN_DEBUG "lapb: (%p) S4 -> S3\n", lapb->dev);
-#endif
+			lapb_dbg(1, "(%p) S4 TX UA(%d)\n",
+				 lapb->dev, frame->pf);
+			lapb_dbg(0, "(%p) S4 -> S3\n", lapb->dev);
 			lapb_send_control(lapb, LAPB_UA, frame->pf,
 					  LAPB_RESPONSE);
 			lapb_stop_t1timer(lapb);
@@ -658,10 +522,8 @@ static void lapb_state4_machine(struct lapb_cb *lapb, struct sk_buff *skb,
 			lapb->va        = 0;
 			lapb_connect_indication(lapb, LAPB_OK);
 		} else {
-#if LAPB_DEBUG > 1
-			printk(KERN_DEBUG "lapb: (%p) S4 TX DM(%d)\n",
-			       lapb->dev, frame->pf);
-#endif
+			lapb_dbg(1, "(%p) S4 TX DM(%d)\n",
+				 lapb->dev, frame->pf);
 			lapb_send_control(lapb, LAPB_DM, frame->pf,
 					  LAPB_RESPONSE);
 		}
diff --git a/net/lapb/lapb_out.c b/net/lapb/lapb_out.c
index baab276..ba4d015 100644
--- a/net/lapb/lapb_out.c
+++ b/net/lapb/lapb_out.c
@@ -14,6 +14,8 @@
  *	LAPB 002	Jonathan Naylor	New timer architecture.
  */
 
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include <linux/errno.h>
 #include <linux/types.h>
 #include <linux/socket.h>
@@ -60,10 +62,8 @@ static void lapb_send_iframe(struct lapb_cb *lapb, struct sk_buff *skb, int poll
 		*frame |= lapb->vs << 1;
 	}
 
-#if LAPB_DEBUG > 1
-	printk(KERN_DEBUG "lapb: (%p) S%d TX I(%d) S%d R%d\n",
-	       lapb->dev, lapb->state, poll_bit, lapb->vs, lapb->vr);
-#endif
+	lapb_dbg(1, "(%p) S%d TX I(%d) S%d R%d\n",
+		 lapb->dev, lapb->state, poll_bit, lapb->vs, lapb->vr);
 
 	lapb_transmit_buffer(lapb, skb, LAPB_COMMAND);
 }
@@ -148,11 +148,9 @@ void lapb_transmit_buffer(struct lapb_cb *lapb, struct sk_buff *skb, int type)
 		}
 	}
 
-#if LAPB_DEBUG > 2
-	printk(KERN_DEBUG "lapb: (%p) S%d TX %02X %02X %02X\n",
-	       lapb->dev, lapb->state,
-	       skb->data[0], skb->data[1], skb->data[2]);
-#endif
+	lapb_dbg(2, "(%p) S%d TX %02X %02X %02X\n",
+		 lapb->dev, lapb->state,
+		 skb->data[0], skb->data[1], skb->data[2]);
 
 	if (!lapb_data_transmit(lapb, skb))
 		kfree_skb(skb);
@@ -164,16 +162,10 @@ void lapb_establish_data_link(struct lapb_cb *lapb)
 	lapb->n2count   = 0;
 
 	if (lapb->mode & LAPB_EXTENDED) {
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S%d TX SABME(1)\n",
-		       lapb->dev, lapb->state);
-#endif
+		lapb_dbg(1, "(%p) S%d TX SABME(1)\n", lapb->dev, lapb->state);
 		lapb_send_control(lapb, LAPB_SABME, LAPB_POLLON, LAPB_COMMAND);
 	} else {
-#if LAPB_DEBUG > 1
-		printk(KERN_DEBUG "lapb: (%p) S%d TX SABM(1)\n",
-		       lapb->dev, lapb->state);
-#endif
+		lapb_dbg(1, "(%p) S%d TX SABM(1)\n", lapb->dev, lapb->state);
 		lapb_send_control(lapb, LAPB_SABM, LAPB_POLLON, LAPB_COMMAND);
 	}
 
@@ -183,10 +175,8 @@ void lapb_establish_data_link(struct lapb_cb *lapb)
 
 void lapb_enquiry_response(struct lapb_cb *lapb)
 {
-#if LAPB_DEBUG > 1
-	printk(KERN_DEBUG "lapb: (%p) S%d TX RR(1) R%d\n",
-	       lapb->dev, lapb->state, lapb->vr);
-#endif
+	lapb_dbg(1, "(%p) S%d TX RR(1) R%d\n",
+		 lapb->dev, lapb->state, lapb->vr);
 
 	lapb_send_control(lapb, LAPB_RR, LAPB_POLLON, LAPB_RESPONSE);
 
@@ -195,10 +185,8 @@ void lapb_enquiry_response(struct lapb_cb *lapb)
 
 void lapb_timeout_response(struct lapb_cb *lapb)
 {
-#if LAPB_DEBUG > 1
-	printk(KERN_DEBUG "lapb: (%p) S%d TX RR(0) R%d\n",
-	       lapb->dev, lapb->state, lapb->vr);
-#endif
+	lapb_dbg(1, "(%p) S%d TX RR(0) R%d\n",
+		 lapb->dev, lapb->state, lapb->vr);
 	lapb_send_control(lapb, LAPB_RR, LAPB_POLLOFF, LAPB_RESPONSE);
 
 	lapb->condition &= ~LAPB_ACK_PENDING_CONDITION;
diff --git a/net/lapb/lapb_subr.c b/net/lapb/lapb_subr.c
index 066225b..9d0a426 100644
--- a/net/lapb/lapb_subr.c
+++ b/net/lapb/lapb_subr.c
@@ -13,6 +13,8 @@
  *	LAPB 001	Jonathan Naylor	Started Coding
  */
 
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include <linux/errno.h>
 #include <linux/types.h>
 #include <linux/socket.h>
@@ -111,11 +113,9 @@ int lapb_decode(struct lapb_cb *lapb, struct sk_buff *skb,
 {
 	frame->type = LAPB_ILLEGAL;
 
-#if LAPB_DEBUG > 2
-	printk(KERN_DEBUG "lapb: (%p) S%d RX %02X %02X %02X\n",
-	       lapb->dev, lapb->state,
-	       skb->data[0], skb->data[1], skb->data[2]);
-#endif
+	lapb_dbg(2, "(%p) S%d RX %02X %02X %02X\n",
+		 lapb->dev, lapb->state,
+		 skb->data[0], skb->data[1], skb->data[2]);
 
 	/* We always need to look at 2 bytes, sometimes we need
 	 * to look at 3 and those cases are handled below.
@@ -284,12 +284,10 @@ void lapb_transmit_frmr(struct lapb_cb *lapb)
 		dptr++;
 		*dptr++ = lapb->frmr_type;
 
-#if LAPB_DEBUG > 1
-	printk(KERN_DEBUG "lapb: (%p) S%d TX FRMR %02X %02X %02X %02X %02X\n",
-	       lapb->dev, lapb->state,
-	       skb->data[1], skb->data[2], skb->data[3],
-	       skb->data[4], skb->data[5]);
-#endif
+		lapb_dbg(1, "(%p) S%d TX FRMR %02X %02X %02X %02X %02X\n",
+			 lapb->dev, lapb->state,
+			 skb->data[1], skb->data[2], skb->data[3],
+			 skb->data[4], skb->data[5]);
 	} else {
 		dptr    = skb_put(skb, 4);
 		*dptr++ = LAPB_FRMR;
@@ -301,11 +299,9 @@ void lapb_transmit_frmr(struct lapb_cb *lapb)
 		dptr++;
 		*dptr++ = lapb->frmr_type;
 
-#if LAPB_DEBUG > 1
-	printk(KERN_DEBUG "lapb: (%p) S%d TX FRMR %02X %02X %02X\n",
-	       lapb->dev, lapb->state, skb->data[1],
-	       skb->data[2], skb->data[3]);
-#endif
+		lapb_dbg(1, "(%p) S%d TX FRMR %02X %02X %02X\n",
+			 lapb->dev, lapb->state, skb->data[1],
+			 skb->data[2], skb->data[3]);
 	}
 
 	lapb_transmit_buffer(lapb, skb, LAPB_RESPONSE);
diff --git a/net/lapb/lapb_timer.c b/net/lapb/lapb_timer.c
index f8cd641..54563ad 100644
--- a/net/lapb/lapb_timer.c
+++ b/net/lapb/lapb_timer.c
@@ -14,6 +14,8 @@
  *	LAPB 002	Jonathan Naylor	New timer architecture.
  */
 
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include <linux/errno.h>
 #include <linux/types.h>
 #include <linux/socket.h>
@@ -105,21 +107,17 @@ static void lapb_t1timer_expiry(unsigned long param)
 				lapb_clear_queues(lapb);
 				lapb->state = LAPB_STATE_0;
 				lapb_disconnect_indication(lapb, LAPB_TIMEDOUT);
-#if LAPB_DEBUG > 0
-				printk(KERN_DEBUG "lapb: (%p) S1 -> S0\n", lapb->dev);
-#endif
+				lapb_dbg(0, "(%p) S1 -> S0\n", lapb->dev);
 				return;
 			} else {
 				lapb->n2count++;
 				if (lapb->mode & LAPB_EXTENDED) {
-#if LAPB_DEBUG > 1
-					printk(KERN_DEBUG "lapb: (%p) S1 TX SABME(1)\n", lapb->dev);
-#endif
+					lapb_dbg(1, "(%p) S1 TX SABME(1)\n",
+						 lapb->dev);
 					lapb_send_control(lapb, LAPB_SABME, LAPB_POLLON, LAPB_COMMAND);
 				} else {
-#if LAPB_DEBUG > 1
-					printk(KERN_DEBUG "lapb: (%p) S1 TX SABM(1)\n", lapb->dev);
-#endif
+					lapb_dbg(1, "(%p) S1 TX SABM(1)\n",
+						 lapb->dev);
 					lapb_send_control(lapb, LAPB_SABM, LAPB_POLLON, LAPB_COMMAND);
 				}
 			}
@@ -133,15 +131,11 @@ static void lapb_t1timer_expiry(unsigned long param)
 				lapb_clear_queues(lapb);
 				lapb->state = LAPB_STATE_0;
 				lapb_disconnect_confirmation(lapb, LAPB_TIMEDOUT);
-#if LAPB_DEBUG > 0
-				printk(KERN_DEBUG "lapb: (%p) S2 -> S0\n", lapb->dev);
-#endif
+				lapb_dbg(0, "(%p) S2 -> S0\n", lapb->dev);
 				return;
 			} else {
 				lapb->n2count++;
-#if LAPB_DEBUG > 1
-				printk(KERN_DEBUG "lapb: (%p) S2 TX DISC(1)\n", lapb->dev);
-#endif
+				lapb_dbg(1, "(%p) S2 TX DISC(1)\n", lapb->dev);
 				lapb_send_control(lapb, LAPB_DISC, LAPB_POLLON, LAPB_COMMAND);
 			}
 			break;
@@ -155,9 +149,7 @@ static void lapb_t1timer_expiry(unsigned long param)
 				lapb->state = LAPB_STATE_0;
 				lapb_stop_t2timer(lapb);
 				lapb_disconnect_indication(lapb, LAPB_TIMEDOUT);
-#if LAPB_DEBUG > 0
-				printk(KERN_DEBUG "lapb: (%p) S3 -> S0\n", lapb->dev);
-#endif
+				lapb_dbg(0, "(%p) S3 -> S0\n", lapb->dev);
 				return;
 			} else {
 				lapb->n2count++;
@@ -173,9 +165,7 @@ static void lapb_t1timer_expiry(unsigned long param)
 				lapb_clear_queues(lapb);
 				lapb->state = LAPB_STATE_0;
 				lapb_disconnect_indication(lapb, LAPB_TIMEDOUT);
-#if LAPB_DEBUG > 0
-				printk(KERN_DEBUG "lapb: (%p) S4 -> S0\n", lapb->dev);
-#endif
+				lapb_dbg(0, "(%p) S4 -> S0\n", lapb->dev);
 				return;
 			} else {
 				lapb->n2count++;

^ permalink raw reply related

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
From: Eric Dumazet @ 2012-05-17 20:41 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev
In-Reply-To: <20120517121800.GA18052@1wt.eu>

On Thu, 2012-05-17 at 14:18 +0200, Willy Tarreau wrote:
> Hi Eric,
> 
> I'm facing a regression in stable 3.2.17 and 3.0.31 which is
> exhibited by your patch 'tcp: allow splice() to build full TSO
> packets' which unfortunately I am very interested in !
> 
> What I'm observing is that TCP transmits using splice() stall
> quite quickly if I'm using pipes larger than 64kB (even 65537
> is enough to reliably observe the stall).
> 
> I'm seeing this on haproxy running on a small ARM machine (a
> dockstar), which exchanges data through a gig switch with my
> development PC. The NIC (mv643xx) doesn't support TSO but has
> GSO enabled. If I disable GSO, the problem remains. I can however
> make the problem disappear by disabling SG or Tx checksumming.
> BTW, using recv/send() instead of splice() also gets rid of the
> problem.
> 
> I can also reduce the risk of seeing the problem by increasing
> the default TCP buffer sizes in tcp_wmem. By default I'm running
> at 16kB, but if I increase the output buffer size above the pipe
> size, the problem *seems* to disappear though I can't be certain,
> since larger buffers generally means the problem takes longer to
> appear, probably due to the fact that the buffers don't need to
> be filled. Still I'm certain that with 64k TCP buffers and 128k
> pipes I'm still seeing it.
> 
> With strace, I'm seeing data fill up the pipe with the splice()
> call responsible for pushing the data to the output socket returing
> -1 EAGAIN. During this time, the client receives no data.
> 
> Something bugs me, I have tested with a dummy server of mine,
> httpterm, which uses tee+splice() to push data outside, and it
> has no problem filling the gig pipe, and correctly recoverers
> from the EAGAIN :
> 
>   send(13, "HTTP/1.1 200\r\nConnection: close\r"..., 160, MSG_DONTWAIT|MSG_NOSIGNAL) = 160
>   tee(0x3, 0x6, 0x10000, 0x2)             = 42552
>   splice(0x5, 0, 0xd, 0, 0xa00000, 0x2)   = 14440
>   tee(0x3, 0x6, 0x10000, 0x2)             = 13880
>   splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2)   = -1 EAGAIN (Resource temporarily unavailable)
>   ...
>   tee(0x3, 0x6, 0x10000, 0x2)             = 13880
>   splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2)   = 51100
>   tee(0x3, 0x6, 0x10000, 0x2)             = 50744
>   splice(0x5, 0, 0xd, 0, 0x9efffc, 0x2)   = 32120
>   tee(0x3, 0x6, 0x10000, 0x2)             = 30264
>   splice(0x5, 0, 0xd, 0, 0x9e8284, 0x2)   = -1 EAGAIN (Resource temporarily unavailable)
> 
> etc...
> 
> It's only with haproxy which uses splice() to copy data between
> two sockets that I'm getting the issue (data forwarded from fd 0xe
> to fd 0x6) :
> 
>   16:03:17.797144 pipe([36, 37])          = 0
>   16:03:17.797318 fcntl64(36, 0x407 /* F_??? */, 0x20000) = 131072 ## note: fcntl(F_SETPIPE_SZ, 128k)
>   16:03:17.797473 splice(0xe, 0, 0x25, 0, 0x9f2234, 0x3) = 10220
>   16:03:17.797706 splice(0x24, 0, 0x6, 0, 0x27ec, 0x3) = 10220
>   16:03:17.802036 gettimeofday({1324652597, 802093}, NULL) = 0
>   16:03:17.802200 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 7
>   16:03:17.802363 gettimeofday({1324652597, 802419}, NULL) = 0
>   16:03:17.802530 splice(0xe, 0, 0x25, 0, 0x9efa48, 0x3) = 16060
>   16:03:17.802789 splice(0x24, 0, 0x6, 0, 0x3ebc, 0x3) = 16060
>   16:03:17.806593 gettimeofday({1324652597, 806651}, NULL) = 0
>   16:03:17.806759 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 4
>   16:03:17.806919 gettimeofday({1324652597, 806974}, NULL) = 0
>   16:03:17.807087 splice(0xe, 0, 0x25, 0, 0x9ebb8c, 0x3) = 17520
>   16:03:17.807356 splice(0x24, 0, 0x6, 0, 0x4470, 0x3) = 17520
>   16:03:17.809565 gettimeofday({1324652597, 809620}, NULL) = 0
>   16:03:17.809726 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1
>   16:03:17.809883 gettimeofday({1324652597, 809937}, NULL) = 0
>   16:03:17.810047 splice(0xe, 0, 0x25, 0, 0x9e771c, 0x3) = 36500
>   16:03:17.810399 splice(0x24, 0, 0x6, 0, 0x8e94, 0x3) = 23360
>   16:03:17.810629 epoll_ctl(0x3, 0x1, 0x6, 0x85378) = 0       ## note: epoll_ctl(ADD, fd=6, dir=OUT).
>   16:03:17.810792 gettimeofday({1324652597, 810848}, NULL) = 0
>   16:03:17.810954 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1
>   16:03:17.811188 gettimeofday({1324652597, 811246}, NULL) = 0
>   16:03:17.811356 splice(0xe, 0, 0x25, 0, 0x9de888, 0x3) = 21900
>   16:03:17.811651 splice(0x24, 0, 0x6, 0, 0x88e0, 0x3) = -1 EAGAIN (Resource temporarily unavailable)
> 

Willy you say output to fd 6 hangs, but splice() returns EAGAIN here ?
(because socket buffer is full)

> So output fd 6 hangs here and will not appear anymore until
> here where I pressed Ctrl-C to stop the test :
> 

I just want to make sure its not a userland error that triggers now much
faster than with prior kernels.

You drain bytes from fd 0xe to pipe buffers, but I dont see you check
write ability on destination socket prior the splice(pipe -> socket)

^ permalink raw reply

* [PATCH] STA2X11 CAN: CAN driver for the STA2X11 board
From: Federico Vaga @ 2012-05-17 20:59 UTC (permalink / raw)
  To: Wolfgang Grandegger, Marc Kleine-Budde, linux-can, netdev,
	linux-kernel
  Cc: Federico vaga, Giancarlo Asnaghi, Alan Cox

Signed-off-by: Federico Vaga <federico.vaga@gmail.com>
Acked-by: Giancarlo Asnaghi <giancarlo.asnaghi@st.com>
Cc: Alan Cox <alan@linux.intel.com>
---
 drivers/net/can/Kconfig       |   11 +
 drivers/net/can/Makefile      |    1 +
 drivers/net/can/sta2x11_can.c | 1085 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 1097 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/can/sta2x11_can.c

diff --git a/drivers/net/can/Kconfig b/drivers/net/can/Kconfig
index bb709fd..5b1baef 100644
--- a/drivers/net/can/Kconfig
+++ b/drivers/net/can/Kconfig
@@ -122,6 +122,17 @@ source "drivers/net/can/usb/Kconfig"
 
 source "drivers/net/can/softing/Kconfig"
 
+config CAN_STA2X11
+	depends on CAN_DEV && HAS_IOMEM && MFD_STA2X11
+	tristate "CAN STA2X11"
+	---help---
+	  Driver for the STA2x11 CAN controller
+	  Supports CAN protocol version 2.0 part A and B
+	  Bit rates up to 1 MBit/s
+	  32 Message Objects
+	  Programmable loop-back mode for self-test operation
+	  8-bit non-multiplex Motorola HC08 compatible module interface
+
 config CAN_DEBUG_DEVICES
 	bool "CAN devices debugging messages"
 	depends on CAN
diff --git a/drivers/net/can/Makefile b/drivers/net/can/Makefile
index 938be37..00474b6 100644
--- a/drivers/net/can/Makefile
+++ b/drivers/net/can/Makefile
@@ -22,5 +22,6 @@ obj-$(CONFIG_CAN_BFIN)		+= bfin_can.o
 obj-$(CONFIG_CAN_JANZ_ICAN3)	+= janz-ican3.o
 obj-$(CONFIG_CAN_FLEXCAN)	+= flexcan.o
 obj-$(CONFIG_PCH_CAN)		+= pch_can.o
+obj-$(CONFIG_CAN_STA2X11)	+= sta2x11_can.o
 
 ccflags-$(CONFIG_CAN_DEBUG_DEVICES) := -DDEBUG
diff --git a/drivers/net/can/sta2x11_can.c b/drivers/net/can/sta2x11_can.c
new file mode 100644
index 0000000..9194b02
--- /dev/null
+++ b/drivers/net/can/sta2x11_can.c
@@ -0,0 +1,1085 @@
+/*
+ * Copyright (c) 2010-2011 Wind River Systems, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/pci.h>
+#include <linux/netdevice.h>
+#include <linux/interrupt.h>
+#include <linux/debugfs.h>
+#include <linux/mfd/sta2x11-mfd.h>
+
+#include <linux/can.h>
+#include <linux/can/dev.h>
+#include <linux/can/error.h>
+
+#define PCI_DEVICE_ID_STMICRO_CAN	0xCC11
+
+#define CAN_CR			0x0	/* Control Register */
+#define CAN_CR_INI		0x01	/* Initialization */
+#define CAN_CR_IE		0x02	/* Interrupt Enable */
+#define CAN_CR_SIE		0x04	/* Status Interrupt Enable */
+#define CAN_CR_EIE		0x08	/* Error Interrupt Enable */
+#define CAN_CR_DAR		0x20	/* Disable Automatic Re-transmission */
+#define CAN_CR_CCE		0x40	/* Change Configuration Enable */
+#define CAN_CR_TME		0x80	/* Test Mode Enable */
+
+#define CAN_SR			0x04	/* Status Register */
+#define CAN_SR_LEC		0x07	/* Last Error Code */
+#define CAN_SR_LEC_STUFF	0x01	/* Stuff error */
+#define CAN_SR_LEC_FORM		0x02	/* Form error */
+#define CAN_SR_LEC_ACK		0x03	/* Acknowledgement error */
+#define CAN_SR_LEC_BIT1		0x04	/* Bit1 error */
+#define CAN_SR_LEC_BIT0		0x05	/* Bit0 error */
+#define CAN_SR_LEC_CRC		0x06	/* CRC error */
+#define CAN_SR_TXOK		0x08	/* Transmit Message Successfully */
+#define CAN_SR_RXOK		0x10	/* Receive Message Successfully */
+#define CAN_SR_EPAS		0x20	/* Error Passive */
+#define CAN_SR_WARN		0x40	/* Warning Status */
+#define CAN_SR_BOFF		0x80	/* Bus Off Status */
+
+#define CAN_ERR			0x08	/* Error Counter Register */
+#define CAN_ERR_TEC		0xFF	/* Transmit Error Counter */
+#define CAN_ERR_REC		0x7F00	/* Receive Error Counter */
+#define CAN_ERR_RP		0x8000	/* Receive Error Passive */
+
+#define CAN_BTR			0x0C	/* Bit Timing Register */
+
+#define CAN_BRPR		0x18	/* BRP Extension Register */
+
+#define CAN_IDR			0x10	/* Interrupt Identifier Register */
+#define CAN_IDR_STATUS		0x8000	/* Status Interrupt Identifier */
+
+#define CAN_TXR1R		0x100	/* Transmission Request Register */
+#define CAN_TXR2R		0x104	/* Transmission Request Register */
+
+#define CAN_ND1R		0x120	/* New Data Register */
+#define CAN_ND2R		0x124	/* New Data Register */
+
+#define CAN_IP1R		0x140	/* Interrupt Pending Register */
+#define CAN_IP2R		0x144	/* Interrupt Pending Register */
+
+#define CAN_MV1R		0x160	/* Message Valid Register */
+#define CAN_MV2R		0x164	/* Message Valid Register */
+
+#define CAN_IF1_CRR		0x20	/* Command Request Register */
+#define CAN_IF2_CRR		0x80	/* Command Request Register */
+#define CAN_IF_CRR_BUSY		0x8000  /* Busy Flag */
+#define CAN_IF_CRR_MSG		0x3F	/* Message Number */
+
+#define CAN_IF1_CMR		0x24	/* Command Mask Register */
+#define CAN_IF2_CMR		0x84	/* Command Mask Register */
+#define CAN_IF_CMR_WR		0x80	/* Write/Read to/from Message Object */
+#define CAN_IF_CMR_MSK		0x40	/* Transfer Mask Bits */
+#define CAN_IF_CMR_AR		0x20	/* Transfer Arbitration Bits */
+#define CAN_IF_CMR_CTL		0x10	/* Transfer Control Bits */
+#define CAN_IF_CMR_CPI		0x08	/* Clear Interrupt Pending Bit */
+#define CAN_IF_CMR_TXR		0x04	/* Clear TxRqst/NewDat Bit */
+#define CAN_IF_CMR_CND		0x04	/* Clear TxRqst/NewDat Bit */
+#define CAN_IF_CMR_D30		0x02	/* Transfer Data Bytes 3:0 */
+#define CAN_IF_CMR_C74		0x01	/* Transfer Data Bytes 7:4 */
+
+#define CAN_IF1_M1R		0x28	/* Mask Register */
+#define CAN_IF2_M1R		0x88	/* Mask Register */
+
+#define CAN_IF1_M2R		0x2C	/* Mask Register */
+#define CAN_IF2_M2R		0x8C	/* Mask Register */
+#define CAN_IF_M2R_MXTD		0x8000
+#define CAN_IF_M2R_MDIR		0x4000
+
+#define CAN_IF1_A1R		0x30	/* Message Arbitration Register */
+#define CAN_IF2_A1R		0x90	/* Message Arbitration Register */
+
+#define CAN_IF1_A2R		0x34	/* Message Arbitration Register */
+#define CAN_IF2_A2R		0x94	/* Message Arbitration Register */
+#define CAN_IF_A2R_MSGVAL	0x8000
+#define CAN_IF_A2R_XTD		0x4000
+#define CAN_IF_A2R_DIR		0x2000
+
+#define CAN_IF1_MCR		0x38	/* Message Control Register */
+#define CAN_IF2_MCR		0x98	/* Message Control Register */
+#define CAN_IF_MCR_NEWD		0x8000
+#define CAN_IF_MCR_MSGL		0x4000
+#define CAN_IF_MCR_INTP		0x2000
+#define CAN_IF_MCR_UMSK		0x1000
+#define CAN_IF_MCR_TXIE		0x800
+#define CAN_IF_MCR_RXIE		0x400
+#define CAN_IF_MCR_RMT		0x200
+#define CAN_IF_MCR_TXR		0x100
+#define CAN_IF_MCR_EOB		0x80
+
+#define CAN_IF1_DATA1		0x3C	/* Buffer Register */
+#define CAN_IF1_DATA2		0x40	/* Buffer Register */
+#define CAN_IF1_DATB1		0x44	/* Buffer Register */
+#define CAN_IF1_DATB2		0x48	/* Buffer Register */
+#define CAN_IF1_DATAV {CAN_IF1_DATA1, CAN_IF1_DATA2, \
+			CAN_IF1_DATB1, CAN_IF1_DATB2}
+
+#define CAN_IF2_DATA1		0x9C	/* Buffer Register */
+#define CAN_IF2_DATA2		0xA0	/* Buffer Register */
+#define CAN_IF2_DATB1		0xA4	/* Buffer Register */
+#define CAN_IF2_DATB2		0xA8	/* Buffer Register */
+#define CAN_IF2_DATAV {CAN_IF2_DATA1, CAN_IF2_DATA2, \
+			CAN_IF2_DATB1, CAN_IF2_DATB2}
+
+#define STA2X11_ECHO_SKB_MAX	1
+
+#define MSGOBJ_FIRST		0x01
+#define MSGOBJ_LAST		0x20
+
+/* max. number of interrupts handled in ISR */
+#define STA2X11_MAX_IRQ		20
+
+/*
+ * STA2X11 private data structure
+ */
+struct sta2x11_priv {
+	struct can_priv can;	/* must be the first member */
+	int open_time;
+	struct net_device *dev;
+	void __iomem *reg_base;	/* ioremap'ed address to registers */
+	struct dentry *dentry;
+	struct timer_list txtimer;
+};
+
+#define STA2X11_APB_FREQ 104000000
+
+/*
+ * 32 messages are available, but only 2 messages are used.
+ * TX and RX message objects
+ */
+#define STA2X11_OBJ_TX 1
+#define STA2X11_OBJ_RX 2
+
+static struct can_bittiming_const sta2x11_can_bittiming_const = {
+	.name = KBUILD_MODNAME,
+	.tseg1_min = 2,
+	.tseg1_max = 16,
+	.tseg2_min = 1,
+	.tseg2_max = 8,
+	.sjw_max = 4,
+	.brp_min = 1,
+	.brp_max = 1024,
+	.brp_inc = 1,
+};
+
+static void sta2x11_can_write_reg(struct sta2x11_priv *priv, uint32_t val, int reg)
+{
+	writel(val, priv->reg_base + reg);
+}
+
+static uint32_t sta2x11_can_read_reg(struct sta2x11_priv *priv, int reg)
+{
+	return readl(priv->reg_base + reg);
+}
+
+static void sta2x11_can_clear_interrupts(struct sta2x11_priv *priv)
+{
+	uint32_t mo;
+
+	sta2x11_can_write_reg(priv, CAN_IF_CMR_CPI, CAN_IF1_CMR);
+	for (mo = MSGOBJ_FIRST; mo <= MSGOBJ_LAST; mo++)
+		sta2x11_can_write_reg(priv, mo, CAN_IF1_CRR);
+}
+
+static void sta2x11_can_enable_objs(const struct net_device *dev)
+{
+	struct sta2x11_priv *priv = netdev_priv(dev);
+
+	/* RX message object */
+	/* command mask */
+	sta2x11_can_write_reg(priv, CAN_IF_CMR_WR | CAN_IF_CMR_AR |
+				    CAN_IF_CMR_MSK | CAN_IF_CMR_CTL,
+				    CAN_IF1_CMR);
+	/* mask */
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF1_M1R);
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF1_M2R);
+	/* arb */
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF1_A1R);
+	sta2x11_can_write_reg(priv, CAN_IF_A2R_MSGVAL, CAN_IF1_A2R);
+	/* control */
+	sta2x11_can_write_reg(priv, CAN_IF_MCR_RXIE | CAN_IF_MCR_UMSK |
+				    CAN_IF_MCR_EOB, CAN_IF1_MCR);
+
+	sta2x11_can_write_reg(priv, STA2X11_OBJ_RX, CAN_IF1_CRR);
+
+	/* TX message object */
+	/* command mask */
+	sta2x11_can_write_reg(priv, CAN_IF_CMR_WR | CAN_IF_CMR_AR |
+				    CAN_IF_CMR_CTL, CAN_IF1_CMR);
+	/* arb */
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF1_A1R);
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF1_A2R);
+	/* control */
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF1_MCR);
+	/* Write to RAM */
+	sta2x11_can_write_reg(priv, STA2X11_OBJ_TX, CAN_IF1_CRR);
+}
+
+static void sta2x11_can_disable_objs(struct sta2x11_priv *priv)
+{
+	/* RX message object */
+	/* command mask */
+	sta2x11_can_write_reg(priv, CAN_IF_CMR_WR | CAN_IF_CMR_AR |
+				    CAN_IF_CMR_CTL, CAN_IF1_CMR);
+	/* arb */
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF1_A1R);
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF1_A2R);
+	/* control */
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF1_MCR);
+	/* command */
+	sta2x11_can_write_reg(priv, STA2X11_OBJ_RX, CAN_IF1_CRR);
+
+	/* TX message object */
+	/* command mask */
+	sta2x11_can_write_reg(priv, CAN_IF_CMR_WR | CAN_IF_CMR_AR |
+				    CAN_IF_CMR_CTL, CAN_IF1_CMR);
+	/* arb */
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF1_A1R);
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF1_A2R);
+	/* control */
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF1_MCR);
+	/* command */
+	sta2x11_can_write_reg(priv, STA2X11_OBJ_TX, CAN_IF1_CRR);
+}
+
+static void sta2x11_can_reset_mode(struct net_device *dev)
+{
+	struct sta2x11_priv *priv = netdev_priv(dev);
+
+	if (priv->can.ctrlmode & CAN_CTRLMODE_ONE_SHOT) {
+		/* cancel timer handling tx request poll */
+		del_timer_sync(&priv->txtimer);
+	}
+
+	/* enable configuration and puts chip in bus-off, disable interrupts */
+	sta2x11_can_write_reg(priv, CAN_CR_CCE | CAN_CR_INI, CAN_CR);
+
+	priv->can.state = CAN_STATE_STOPPED;
+
+	sta2x11_can_clear_interrupts(priv);
+
+	/* clear status interrupt */
+	sta2x11_can_read_reg(priv, CAN_SR);
+	/* clear status register */
+	sta2x11_can_write_reg(priv, 0x0, CAN_SR);
+
+	/* disable all used message objects */
+	sta2x11_can_disable_objs(priv);
+}
+
+static void sta2x11_can_normal_mode(struct net_device *dev)
+{
+	struct sta2x11_priv *priv = netdev_priv(dev);
+	uint32_t ctrl;
+
+	sta2x11_can_clear_interrupts(priv);
+
+	/* clear status interrupt */
+	sta2x11_can_read_reg(priv, CAN_SR);
+	/* clear status register */
+	sta2x11_can_write_reg(priv, CAN_SR_LEC, CAN_SR);
+
+	/* enable all used message objects */
+	sta2x11_can_enable_objs(dev);
+
+	/* clear bus-off */
+	ctrl = CAN_CR_IE | CAN_CR_EIE;
+	if (priv->can.ctrlmode & CAN_CTRLMODE_BERR_REPORTING)
+		ctrl |= CAN_CR_SIE;
+
+	if (priv->can.ctrlmode & CAN_CTRLMODE_ONE_SHOT)
+		ctrl |= CAN_CR_DAR;
+
+	sta2x11_can_write_reg(priv, ctrl, CAN_CR);
+
+	priv->can.state = CAN_STATE_ERROR_ACTIVE;
+}
+
+static void sta2x11_can_chipset_init(struct net_device *dev)
+{
+	struct sta2x11_priv *priv = netdev_priv(dev);
+	struct pci_dev *pdev = to_pci_dev(dev->dev.parent);
+	int data_reg[] = CAN_IF1_DATAV;
+	uint32_t mo;
+	unsigned int i;
+
+
+	/* config clock and release device from reset */
+	sta2x11_apbreg_mask(pdev, APBREG_PCG, APBREG_CAN, 0);
+	sta2x11_apbreg_mask(pdev, APBREG_PUR, APBREG_CAN, 0);
+	msleep_interruptible(100);
+	sta2x11_apbreg_mask(pdev, APBREG_PCG, APBREG_CAN, APBREG_CAN);
+	sta2x11_apbreg_mask(pdev, APBREG_PUR, APBREG_CAN, APBREG_CAN);
+	msleep_interruptible(100);
+
+	/* enable configuration and put chip in bus-off, disable interrupts */
+	sta2x11_can_write_reg(priv, CAN_CR_CCE | CAN_CR_INI, CAN_CR);
+
+	/* clear status interrupt */
+	sta2x11_can_read_reg(priv, CAN_SR);
+	/* clear status register */
+	sta2x11_can_write_reg(priv, CAN_SR_LEC, CAN_SR);
+
+	sta2x11_can_clear_interrupts(priv);
+
+	/* Invalidate message objects */
+	/* command mask */
+	sta2x11_can_write_reg(priv, CAN_IF_CMR_WR | CAN_IF_CMR_MSK |
+				    CAN_IF_CMR_AR | CAN_IF_CMR_CTL |
+				    CAN_IF_CMR_D30 | CAN_IF_CMR_C74,
+				    CAN_IF1_CMR);
+	/* mask */
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF1_M1R);
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF1_M2R);
+	/* arb */
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF1_A1R);
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF1_A2R);
+	/* control */
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF1_MCR);
+	/* data */
+	for (i = 0; i < 4; i++)
+		sta2x11_can_write_reg(priv, 0x0, data_reg[i]);
+
+	/* send command to all 32 messages */
+	for (mo = MSGOBJ_FIRST; mo <= MSGOBJ_LAST; mo++)
+		sta2x11_can_write_reg(priv, mo, CAN_IF1_CRR);
+}
+
+static void sta2x11_can_start(struct net_device *dev)
+{
+	struct sta2x11_priv *priv = netdev_priv(dev);
+
+	if (priv->can.state != CAN_STATE_STOPPED)
+		sta2x11_can_reset_mode(dev);
+
+	sta2x11_can_normal_mode(dev);
+}
+
+static void sta2x11_can_write_data(struct sta2x11_priv *priv,
+			       struct can_frame *cf, u8 dlc)
+{
+	int data_reg[] = CAN_IF1_DATAV;
+	uint32_t val = 0;
+	int i;
+
+	for (i = 0; i < dlc; i++) {
+		if (i & 0x1) {
+			val |= cf->data[i] << 8;
+			sta2x11_can_write_reg(priv, val, data_reg[i / 2]);
+		} else {
+			val = cf->data[i];
+		}
+	}
+	/* if dlc is an even number the last byte must be write */
+	if (i & 0x1)
+		sta2x11_can_write_reg(priv, val, data_reg[i / 2]);
+
+}
+
+static void sta2x11_can_ar_config(struct sta2x11_priv *priv, uint32_t id,
+				  uint32_t dir)
+{
+	/* Arbitration configuration */
+	if (id & CAN_EFF_FLAG) { /* extended identifier */
+		id &= CAN_EFF_MASK;
+		sta2x11_can_write_reg(priv, id & 0xFFFF, CAN_IF1_A1R);
+		sta2x11_can_write_reg(priv, CAN_IF_A2R_MSGVAL | CAN_IF_A2R_XTD |
+					    dir | (id >> 16), CAN_IF1_A2R);
+	} else { /* standard identifier */
+		id &= CAN_SFF_MASK;
+		sta2x11_can_write_reg(priv, 0X0, CAN_IF1_A1R);
+		sta2x11_can_write_reg(priv, CAN_IF_A2R_MSGVAL | dir | (id << 2),
+					    CAN_IF1_A2R);
+	}
+}
+
+static int sta2x11_can_start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct sta2x11_priv *priv = netdev_priv(dev);
+	struct net_device_stats *stats = &dev->stats;
+	struct can_frame *cf = (struct can_frame *)skb->data;
+	uint32_t dlc, id, dir = 0, cmr, ctrl;
+
+	if (can_dropped_invalid_skb(dev, skb))
+		return NETDEV_TX_OK;
+
+	if ((sta2x11_can_read_reg(priv, CAN_TXR1R) & STA2X11_OBJ_TX)) {
+		dev_err(dev->dev.parent, "TX register is still occupied!\n");
+		return NETDEV_TX_BUSY;
+	}
+
+	/* It doesn't accept new message during transmission */
+	netif_stop_queue(dev);
+
+	dlc = cf->can_dlc & 0x0f;
+	id = cf->can_id;
+
+	/* Message Control Register configuration */
+	cmr = CAN_IF_CMR_WR | CAN_IF_CMR_AR | CAN_IF_CMR_CTL;
+	if (!(id & CAN_RTR_FLAG)) {
+		/* transmission */
+		dir = CAN_IF_A2R_DIR;
+		cmr |= CAN_IF_CMR_D30 | CAN_IF_CMR_C74;
+	}
+	sta2x11_can_write_reg(priv, cmr, CAN_IF1_CMR);
+
+	sta2x11_can_ar_config(priv, id, dir);
+
+	/* control */
+	ctrl = CAN_IF_MCR_TXR | CAN_IF_MCR_EOB;
+
+	if (dir) {
+		/* control */
+		ctrl |= dlc;
+		/* Write data to IF1 data registers */
+		sta2x11_can_write_data(priv, cf, dlc);
+	}
+
+	if (priv->can.ctrlmode & CAN_CTRLMODE_ONE_SHOT)
+		/* use polling in one-shot mode */
+		ctrl |= CAN_IF_MCR_NEWD;
+	else
+		ctrl |= CAN_IF_MCR_TXIE;
+
+	sta2x11_can_write_reg(priv, ctrl, CAN_IF1_MCR);
+
+	/* start data transfer to RAM by writing on CRR the destination */
+	sta2x11_can_write_reg(priv, STA2X11_OBJ_TX, CAN_IF1_CRR);
+
+	stats->tx_bytes += dlc;
+	dev->trans_start = jiffies;
+
+	can_put_echo_skb(skb, dev, 0);
+
+	if (priv->can.ctrlmode & CAN_CTRLMODE_ONE_SHOT) {
+		/*
+		 * when automatic re-transmission mode is disabled the txRqst
+		 * bit of the respective message buffer is not set,
+		 * we don't know if the transmission started or not ...
+		 */
+		mod_timer(&priv->txtimer, jiffies + HZ / 100);
+	}
+	return NETDEV_TX_OK;
+}
+
+static int sta2x11_can_err(struct net_device *dev, uint32_t status)
+{
+	struct sta2x11_priv *priv = netdev_priv(dev);
+	struct net_device_stats *stats = &dev->stats;
+	struct can_frame *pcf;
+	struct sk_buff *skb;
+	uint32_t err, rec, tec, lec;
+	uint32_t can_data[4] = {0};
+	canid_t can_id = 0;
+	int i;
+
+	dev_dbg(dev->dev.parent, "status interrupt (0x%04x)\n", status);
+
+	if (status & CAN_SR_BOFF) {
+		if (priv->can.state != CAN_STATE_BUS_OFF) {
+			dev_dbg(dev->dev.parent, "entering busoff state\n");
+			/* disable interrupts */
+			sta2x11_can_write_reg(priv, CAN_CR_INI, CAN_CR);
+			can_id |= CAN_ERR_BUSOFF;
+			priv->can.state = CAN_STATE_BUS_OFF;
+			can_bus_off(dev);
+		}
+	} else if (status & CAN_SR_EPAS) {
+		if (priv->can.state != CAN_STATE_ERROR_PASSIVE) {
+			dev_dbg(dev->dev.parent,
+				"entering error passive state\n");
+			can_id |= CAN_ERR_CRTL;
+
+			err = sta2x11_can_read_reg(priv, CAN_ERR);
+			tec = (err & CAN_ERR_TEC);
+			rec = (err & CAN_ERR_REC) >> 8;
+
+			if (tec > rec)
+				can_data[1] |= CAN_ERR_CRTL_TX_PASSIVE;
+			else
+				can_data[1] |= CAN_ERR_CRTL_RX_PASSIVE;
+
+			priv->can.state = CAN_STATE_ERROR_PASSIVE;
+			priv->can.can_stats.error_passive++;
+		}
+	} else if (status & CAN_SR_WARN) {
+		if (priv->can.state != CAN_STATE_ERROR_WARNING) {
+			dev_dbg(dev->dev.parent,
+				"entering error warning state\n");
+			can_id |= CAN_ERR_CRTL;
+
+			err = sta2x11_can_read_reg(priv, CAN_ERR);
+			tec = (err & CAN_ERR_TEC);
+			rec = (err & CAN_ERR_REC) >> 8;
+
+			if (tec > rec)
+				can_data[1] |= CAN_ERR_CRTL_TX_WARNING;
+			else
+				can_data[1] |= CAN_ERR_CRTL_RX_WARNING;
+
+			priv->can.state = CAN_STATE_ERROR_WARNING;
+			priv->can.can_stats.error_warning++;
+		}
+	} else if (priv->can.state != CAN_STATE_ERROR_ACTIVE) {
+		dev_dbg(dev->dev.parent, "entering error active state\n");
+		priv->can.state = CAN_STATE_ERROR_ACTIVE;
+	}
+
+	lec = status & CAN_SR_LEC;
+
+	if (lec && (lec != CAN_SR_LEC)) {
+		if (lec == CAN_SR_LEC_ACK) {
+			dev_dbg(dev->dev.parent, "ack error\n");
+			can_id |= CAN_ERR_ACK;
+			stats->tx_errors++;
+		} else {
+			priv->can.can_stats.bus_error++;
+			stats->rx_errors++;
+
+			can_id |= CAN_ERR_PROT | CAN_ERR_BUSERROR;
+			switch (lec) {
+			case CAN_SR_LEC_STUFF:
+				dev_dbg(dev->dev.parent, "stuff error\n");
+				can_data[2] |= CAN_ERR_PROT_STUFF;
+				break;
+			case CAN_SR_LEC_FORM:
+				dev_dbg(dev->dev.parent, "form error\n");
+				can_data[2] |= CAN_ERR_PROT_FORM;
+				break;
+			case CAN_SR_LEC_BIT1:
+				dev_dbg(dev->dev.parent, "bit1 error\n");
+				can_data[2] |= CAN_ERR_PROT_BIT1;
+				break;
+			case CAN_SR_LEC_BIT0:
+				dev_dbg(dev->dev.parent, "bit0 error\n");
+				can_data[2] |= CAN_ERR_PROT_BIT0;
+				break;
+			case CAN_SR_LEC_CRC:
+				dev_dbg(dev->dev.parent, "crc error\n");
+				can_data[3] |= CAN_ERR_PROT_LOC_CRC_SEQ;
+				break;
+			}
+		}
+	}
+
+	if (can_id) {
+		skb = alloc_can_err_skb(dev, &pcf);
+		if (unlikely(!skb))
+			return -ENOMEM;
+		pcf->can_id |= can_id;
+		for (i = 0; i < 4; i++)
+			pcf->data[i] = can_data[i];
+		netif_rx(skb);
+
+		stats->rx_packets++;
+		stats->rx_bytes += pcf->can_dlc;
+	}
+	return 0;
+}
+
+static int sta2x11_can_status_interrupt(struct net_device *dev)
+{
+	struct sta2x11_priv *priv = netdev_priv(dev);
+	uint32_t status;
+
+	/* get status */
+	status = sta2x11_can_read_reg(priv, CAN_SR);
+	/* reset the status register including RXOK and TXOK */
+	sta2x11_can_write_reg(priv, CAN_SR_LEC, CAN_SR);
+
+	return sta2x11_can_err(dev, status);
+}
+
+/*
+ * Reading data from the Interface Register 2
+ */
+static void sta2x11_can_read_data(struct sta2x11_priv *priv,
+			      struct can_frame *cf, u8 dlc)
+{
+	int data_reg[] = CAN_IF2_DATAV;
+	uint32_t val = 0;
+	int i;
+
+	for (i = 0; i < dlc; i++) {
+		if (i & 0x1) {
+			cf->data[i] = val >> 8;
+		} else {
+			val = sta2x11_can_read_reg(priv, data_reg[i / 2]);
+			cf->data[i] = val & 0xFF;
+		}
+	}
+}
+
+static void sta2x11_can_rx(struct net_device *dev, unsigned int mo, uint32_t ctrl)
+{
+	struct sta2x11_priv *priv = netdev_priv(dev);
+	struct net_device_stats *stats = &dev->stats;
+	struct can_frame *cf;
+	struct sk_buff *skb;
+	uint32_t arb1, arb2;
+
+	skb = alloc_can_skb(dev, &cf);
+	if (skb == NULL)
+		return;
+
+	/* Read Arbitration 2 Register */
+	arb2 = sta2x11_can_read_reg(priv, CAN_IF2_A2R);
+	if (arb2 & CAN_IF_A2R_XTD) {
+		/* Massage has an extended identifier */
+		arb1 = sta2x11_can_read_reg(priv, CAN_IF2_A1R);
+		cf->can_id = (((arb2 & 0x1FFF) << 16) | arb1 | CAN_EFF_FLAG);
+	} else {
+		/* Massage hasn't an extended identifier */
+		cf->can_id = ((arb2 & 0x1FFF) >> 2);
+	}
+
+	if (arb2 & CAN_IF_A2R_DIR) {
+		cf->can_id |= CAN_RTR_FLAG;
+		cf->can_dlc = 0;
+	} else {
+		cf->can_dlc = get_can_dlc(ctrl & 0xF);
+		sta2x11_can_read_data(priv, cf, cf->can_dlc);
+	}
+
+	netif_rx(skb);
+
+	stats->rx_packets++;
+	stats->rx_bytes += cf->can_dlc;
+}
+
+static void sta2x11_can_rx_interrupt(struct net_device *dev, unsigned int mo)
+{
+	struct sta2x11_priv *priv = netdev_priv(dev);
+	struct net_device_stats *stats = &dev->stats;
+	struct can_frame *pcf;
+	struct sk_buff *skb;
+	uint32_t ctrl;
+
+	/* clear interrupt, read control, data, arbitration */
+	sta2x11_can_write_reg(priv, CAN_IF_CMR_CPI | CAN_IF_CMR_CND |
+				    CAN_IF_CMR_AR | CAN_IF_CMR_CTL |
+				    CAN_IF_CMR_D30 | CAN_IF_CMR_C74,
+				    CAN_IF2_CMR);
+	sta2x11_can_write_reg(priv, mo, CAN_IF2_CRR);
+
+	ctrl = sta2x11_can_read_reg(priv, CAN_IF2_MCR);
+
+	if (ctrl & CAN_IF_MCR_MSGL) {
+		dev_dbg(dev->dev.parent, "rx overrun error\n");
+		stats->rx_over_errors++;
+		stats->rx_errors++;
+		skb = alloc_can_err_skb(dev, &pcf);
+		if (likely(skb)) {
+			pcf->can_id |= CAN_ERR_CRTL;
+			pcf->data[1] = CAN_ERR_CRTL_RX_OVERFLOW;
+			netif_rx(skb);
+
+			stats->rx_packets++;
+			stats->rx_bytes += pcf->can_dlc;
+		}
+	}
+
+	sta2x11_can_rx(dev, mo, ctrl);
+
+	/* reset message object */
+	sta2x11_can_write_reg(priv, CAN_IF_CMR_WR | CAN_IF_CMR_AR |
+				    CAN_IF_CMR_CTL, CAN_IF2_CMR);
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF2_M1R);
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF2_M2R);
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF2_A1R);
+	sta2x11_can_write_reg(priv, CAN_IF_A2R_MSGVAL, CAN_IF2_A2R);
+	sta2x11_can_write_reg(priv, CAN_IF_MCR_RXIE | CAN_IF_MCR_UMSK |
+				    CAN_IF_MCR_EOB, CAN_IF2_MCR);
+	sta2x11_can_write_reg(priv, mo, CAN_IF2_CRR);
+}
+
+static void sta2x11_can_tx_interrupt(struct net_device *dev, unsigned int mo)
+{
+	struct sta2x11_priv *priv = netdev_priv(dev);
+	struct net_device_stats *stats = &dev->stats;
+
+	/* clear interrupt */
+	sta2x11_can_write_reg(priv, CAN_IF_CMR_CPI | CAN_IF_CMR_CTL,
+				    CAN_IF2_CMR);
+	sta2x11_can_write_reg(priv, mo, CAN_IF2_CRR);
+
+	/* invalidate */
+	sta2x11_can_write_reg(priv, CAN_IF_CMR_WR | CAN_IF_CMR_AR,
+				    CAN_IF2_CMR);
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF2_A2R);
+	sta2x11_can_write_reg(priv, mo, CAN_IF2_CRR);
+
+	stats->tx_packets++;
+	can_get_echo_skb(dev, 0);
+	netif_wake_queue(dev);
+}
+
+static irqreturn_t sta2x11_can_interrupt(int irq, void *dev_id)
+{
+	struct net_device *dev = (struct net_device *)dev_id;
+	struct sta2x11_priv *priv = netdev_priv(dev);
+	uint32_t intid;
+	int n = 0;
+
+	/* shared interrupts and IRQ off? */
+	if (priv->can.state == CAN_STATE_STOPPED)
+		return IRQ_NONE;
+
+	while (n < STA2X11_MAX_IRQ) {
+
+		/* read the highest pending interrupt request */
+		intid = sta2x11_can_read_reg(priv, CAN_IDR);
+		if (!intid)
+			break;
+
+		switch (intid) {
+		case CAN_IDR_STATUS:
+			sta2x11_can_status_interrupt(dev);
+			break;
+		case STA2X11_OBJ_RX:
+			sta2x11_can_rx_interrupt(dev, intid);
+			break;
+		case STA2X11_OBJ_TX:
+			sta2x11_can_tx_interrupt(dev, intid);
+			break;
+		default:
+			dev_err(dev->dev.parent, "Unexpected interrupt %i",
+				intid);
+			sta2x11_can_clear_interrupts(priv);
+			break;
+		}
+
+		n++;
+	}
+
+	return n ? IRQ_HANDLED : IRQ_NONE;
+}
+
+static int sta2x11_can_open(struct net_device *dev)
+{
+	struct sta2x11_priv *priv = netdev_priv(dev);
+	int err;
+
+	sta2x11_can_reset_mode(dev);
+
+	err = open_candev(dev);
+	if (err)
+		return err;
+
+	err = request_irq(dev->irq, &sta2x11_can_interrupt, 0 /* FIXME */,
+			  dev->name, (void *)dev);
+	if (err) {
+		close_candev(dev);
+		return -EAGAIN;
+	}
+
+	sta2x11_can_start(dev);
+	priv->open_time = jiffies;
+	netif_start_queue(dev);
+
+	return 0;
+}
+
+static int sta2x11_can_close(struct net_device *dev)
+{
+	struct sta2x11_priv *priv = netdev_priv(dev);
+
+	netif_stop_queue(dev);
+	sta2x11_can_reset_mode(dev);
+
+	free_irq(dev->irq, (void *)dev);
+	close_candev(dev);
+
+	priv->open_time = 0;
+
+	return 0;
+}
+
+static int sta2x11_can_set_bittiming(struct net_device *dev)
+{
+	struct sta2x11_priv *priv = netdev_priv(dev);
+	struct can_bittiming *bt = &priv->can.bittiming;
+	uint32_t reg;
+
+	reg = ((bt->prop_seg + bt->phase_seg1 - 1) & 0xf) |
+	    (((bt->phase_seg2 - 1) & 0x7) << 4);
+	reg <<= 8;
+	reg |= ((bt->brp - 1) & 0x3f) | (((bt->sjw - 1) & 0x3) << 6);
+	sta2x11_can_write_reg(priv, reg, CAN_BTR);
+
+	reg = ((bt->brp - 1) >> 6) & 0xf;
+	sta2x11_can_write_reg(priv, reg, CAN_BRPR);
+
+	return 0;
+}
+
+static int sta2x11_can_set_mode(struct net_device *dev, enum can_mode mode)
+{
+	struct sta2x11_priv *priv = netdev_priv(dev);
+
+	if (!priv->open_time)
+		return -EINVAL;
+
+	switch (mode) {
+	case CAN_MODE_START:
+		sta2x11_can_start(dev);
+		if (netif_queue_stopped(dev))
+			netif_wake_queue(dev);
+		break;
+
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
+static struct net_device *sta2x11_can_alloc(void)
+{
+	struct net_device *dev;
+	struct sta2x11_priv *priv;
+
+	dev = alloc_candev(sizeof(struct sta2x11_priv), STA2X11_ECHO_SKB_MAX);
+	if (!dev)
+		return NULL;
+
+	priv = netdev_priv(dev);
+
+	priv->dev = dev;
+
+	/* Configuring CAN */
+	priv->can.bittiming_const = &sta2x11_can_bittiming_const;
+	priv->can.do_set_bittiming = sta2x11_can_set_bittiming;
+	priv->can.do_set_mode = sta2x11_can_set_mode;
+	priv->can.clock.freq = STA2X11_APB_FREQ / 2;
+	priv->can.ctrlmode_supported = CAN_CTRLMODE_ONE_SHOT |
+				       CAN_CTRLMODE_BERR_REPORTING;
+
+	return dev;
+}
+
+static void sta2x11_can_free(struct net_device *dev)
+{
+	free_candev(dev);
+}
+
+static const struct net_device_ops sta2x11_can_netdev_ops = {
+	.ndo_open = sta2x11_can_open,
+	.ndo_stop = sta2x11_can_close,
+	.ndo_start_xmit = sta2x11_can_start_xmit,
+};
+
+static void sta2x11_can_tx_poll(unsigned long xdev)
+{
+	struct net_device *dev = (struct net_device *)xdev;
+	struct sta2x11_priv *priv = netdev_priv(dev);
+	struct net_device_stats *stats = &dev->stats;
+
+	if (sta2x11_can_read_reg(priv, CAN_ND1R) & STA2X11_OBJ_TX) {
+		dev_dbg(dev->dev.parent, "one-shot tx failed\n");
+		stats->tx_errors++;
+		stats->tx_dropped++;
+		can_free_echo_skb(dev, 0);
+	} else {
+		stats->tx_packets++;
+		can_get_echo_skb(dev, 0);
+	}
+
+	/* invalidate */
+	sta2x11_can_write_reg(priv, CAN_IF_CMR_WR | CAN_IF_CMR_AR,
+				    CAN_IF1_CMR);
+	sta2x11_can_write_reg(priv, 0x0, CAN_IF1_A2R);
+	sta2x11_can_write_reg(priv, STA2X11_OBJ_TX, CAN_IF1_CRR);
+
+	netif_wake_queue(dev);
+}
+
+static int sta2x11_can_register(struct net_device *dev)
+{
+	struct sta2x11_priv *priv = netdev_priv(dev);
+
+	/* local echo */
+	dev->flags |= IFF_ECHO;
+	dev->netdev_ops = &sta2x11_can_netdev_ops;
+
+	/* init timer handling tx request poll for one-shot mode */
+	init_timer(&priv->txtimer);
+	priv->txtimer.data = (unsigned long)dev;
+	priv->txtimer.function = sta2x11_can_tx_poll;
+
+	sta2x11_can_chipset_init(dev);
+	sta2x11_can_reset_mode(dev);
+
+	return register_candev(dev);
+}
+
+static void sta2x11_can_unregister(struct net_device *dev)
+{
+	sta2x11_can_reset_mode(dev);
+	unregister_candev(dev);
+}
+
+DEFINE_PCI_DEVICE_TABLE(sta2x11_can_pci_tbl) = {
+	{
+		PCI_DEVICE(PCI_VENDOR_ID_STMICRO, PCI_DEVICE_ID_STMICRO_CAN),
+		.driver_data = 0,
+	},
+	{},
+};
+
+/*
+ * Static definition of debugfs 32bit registers, on sta2x11 there is only
+ * one CAN bus
+ */
+#define REG(regname) {.name = #regname, .offset = regname}
+static struct debugfs_reg32 sta2x11_can_regs[] = {
+	REG(CAN_CR), REG(CAN_SR), REG(CAN_ERR), REG(CAN_BTR),
+	REG(CAN_BRPR), REG(CAN_IDR), REG(CAN_TXR1R), REG(CAN_TXR2R),
+	REG(CAN_ND1R), REG(CAN_ND2R), REG(CAN_IP1R), REG(CAN_IP2R),
+	REG(CAN_MV1R), REG(CAN_MV2R),
+};
+#undef REG
+static struct debugfs_regset32 sta2x11_can_regset = {
+	.regs = sta2x11_can_regs,
+	.nregs = ARRAY_SIZE(sta2x11_can_regs),
+};
+
+static int __devinit
+sta2x11_can_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
+{
+	struct net_device *dev;
+	struct sta2x11_priv *priv;
+	int rc = 0;
+
+	rc = pci_enable_device(pdev);
+	if (rc) {
+		dev_err(&pdev->dev, "pci_enable_device FAILED\n");
+		goto out;
+	}
+
+	rc = pci_request_regions(pdev, KBUILD_MODNAME);
+	if (rc) {
+		dev_err(&pdev->dev, "pci_request_regions FAILED\n");
+		goto out_disable_device;
+	}
+
+	pci_set_master(pdev);
+	pci_enable_msi(pdev);
+
+	dev = sta2x11_can_alloc();
+	if (!dev) {
+		rc = -ENOMEM;
+		goto out_release_regions;
+	}
+
+	dev->irq = pdev->irq;
+	priv = netdev_priv(dev);
+
+	priv->reg_base = pci_iomap(pdev, 0, pci_resource_len(pdev, 0));
+
+	if (!priv->reg_base) {
+		dev_err(&pdev->dev,
+			"device has no PCI memory resources, "
+			"failing adapter\n");
+		rc = -ENOMEM;
+		goto out_kfree_sta2x11;
+	}
+
+	SET_NETDEV_DEV(dev, &pdev->dev);
+
+	rc = sta2x11_can_register(dev);
+	if (rc) {
+		dev_err(&pdev->dev, "registering %s failed (err=%d)\n",
+			KBUILD_MODNAME, rc);
+		goto out_iounmap;
+	}
+
+	pci_set_drvdata(pdev, dev);
+
+	/* Configure debugfs */
+	sta2x11_can_regset.base = priv->reg_base;
+	priv->dentry = debugfs_create_regset32("sta2x11_can", S_IFREG | S_IRUGO,
+			NULL, &sta2x11_can_regset);
+
+	return 0;
+
+out_iounmap:
+	pci_iounmap(pdev, priv->reg_base);
+out_kfree_sta2x11:
+	sta2x11_can_free(dev);
+out_release_regions:
+	pci_disable_msi(pdev);
+	pci_release_regions(pdev);
+out_disable_device:
+	/*
+	 * do not call pci_disable_device on sta2x11 because it
+	 * break all other Bus masters on this EP
+	 */
+out:
+	return rc;
+}
+
+static void __devexit sta2x11_can_pci_remove(struct pci_dev *pdev)
+{
+	struct net_device *dev = pci_get_drvdata(pdev);
+	struct sta2x11_priv *priv = netdev_priv(dev);
+
+
+	if (priv->dentry)
+		debugfs_remove(priv->dentry);
+
+	pci_set_drvdata(pdev, NULL);
+
+	sta2x11_can_unregister(dev);
+	pci_iounmap(pdev, priv->reg_base);
+	sta2x11_can_free(dev);
+
+	pci_disable_msi(pdev);
+	pci_release_regions(pdev);
+	/*
+	 * do not call pci_disable_device on sta2x11 because it
+	 * break all other Bus masters on this EP
+	 */
+}
+
+static struct pci_driver sta2x11_pci_driver = {
+	.name = KBUILD_MODNAME,
+	.id_table = sta2x11_can_pci_tbl,
+	.probe = sta2x11_can_pci_probe,
+	.remove = __devexit_p(sta2x11_can_pci_remove),
+};
+
+static __init int sta2x11_can_init(void)
+{
+	return pci_register_driver(&sta2x11_pci_driver);
+}
+
+/* needs to be started after the sta2x11_apbreg driver */
+late_initcall(sta2x11_can_init);
+
+static __exit void sta2x11_can_exit(void)
+{
+	pci_unregister_driver(&sta2x11_pci_driver);
+}
+
+module_exit(sta2x11_can_exit);
+
+MODULE_AUTHOR("Wind River");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_DESCRIPTION(KBUILD_MODNAME "CAN netdevice driver");
+MODULE_DEVICE_TABLE(pci, sta2x11_pci_tbl);
-- 
1.7.7.6


^ permalink raw reply related

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
From: Willy Tarreau @ 2012-05-17 21:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1337287279.3403.44.camel@edumazet-glaptop>

On Thu, May 17, 2012 at 10:41:19PM +0200, Eric Dumazet wrote:
> On Thu, 2012-05-17 at 14:18 +0200, Willy Tarreau wrote:
> > Hi Eric,
> > 
> > I'm facing a regression in stable 3.2.17 and 3.0.31 which is
> > exhibited by your patch 'tcp: allow splice() to build full TSO
> > packets' which unfortunately I am very interested in !
> > 
> > What I'm observing is that TCP transmits using splice() stall
> > quite quickly if I'm using pipes larger than 64kB (even 65537
> > is enough to reliably observe the stall).
> > 
> > I'm seeing this on haproxy running on a small ARM machine (a
> > dockstar), which exchanges data through a gig switch with my
> > development PC. The NIC (mv643xx) doesn't support TSO but has
> > GSO enabled. If I disable GSO, the problem remains. I can however
> > make the problem disappear by disabling SG or Tx checksumming.
> > BTW, using recv/send() instead of splice() also gets rid of the
> > problem.
> > 
> > I can also reduce the risk of seeing the problem by increasing
> > the default TCP buffer sizes in tcp_wmem. By default I'm running
> > at 16kB, but if I increase the output buffer size above the pipe
> > size, the problem *seems* to disappear though I can't be certain,
> > since larger buffers generally means the problem takes longer to
> > appear, probably due to the fact that the buffers don't need to
> > be filled. Still I'm certain that with 64k TCP buffers and 128k
> > pipes I'm still seeing it.
> > 
> > With strace, I'm seeing data fill up the pipe with the splice()
> > call responsible for pushing the data to the output socket returing
> > -1 EAGAIN. During this time, the client receives no data.
> > 
> > Something bugs me, I have tested with a dummy server of mine,
> > httpterm, which uses tee+splice() to push data outside, and it
> > has no problem filling the gig pipe, and correctly recoverers
> > from the EAGAIN :
> > 
> >   send(13, "HTTP/1.1 200\r\nConnection: close\r"..., 160, MSG_DONTWAIT|MSG_NOSIGNAL) = 160
> >   tee(0x3, 0x6, 0x10000, 0x2)             = 42552
> >   splice(0x5, 0, 0xd, 0, 0xa00000, 0x2)   = 14440
> >   tee(0x3, 0x6, 0x10000, 0x2)             = 13880
> >   splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2)   = -1 EAGAIN (Resource temporarily unavailable)
> >   ...
> >   tee(0x3, 0x6, 0x10000, 0x2)             = 13880
> >   splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2)   = 51100
> >   tee(0x3, 0x6, 0x10000, 0x2)             = 50744
> >   splice(0x5, 0, 0xd, 0, 0x9efffc, 0x2)   = 32120
> >   tee(0x3, 0x6, 0x10000, 0x2)             = 30264
> >   splice(0x5, 0, 0xd, 0, 0x9e8284, 0x2)   = -1 EAGAIN (Resource temporarily unavailable)
> > 
> > etc...
> > 
> > It's only with haproxy which uses splice() to copy data between
> > two sockets that I'm getting the issue (data forwarded from fd 0xe
> > to fd 0x6) :
> > 
> >   16:03:17.797144 pipe([36, 37])          = 0
> >   16:03:17.797318 fcntl64(36, 0x407 /* F_??? */, 0x20000) = 131072 ## note: fcntl(F_SETPIPE_SZ, 128k)
> >   16:03:17.797473 splice(0xe, 0, 0x25, 0, 0x9f2234, 0x3) = 10220
> >   16:03:17.797706 splice(0x24, 0, 0x6, 0, 0x27ec, 0x3) = 10220
> >   16:03:17.802036 gettimeofday({1324652597, 802093}, NULL) = 0
> >   16:03:17.802200 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 7
> >   16:03:17.802363 gettimeofday({1324652597, 802419}, NULL) = 0
> >   16:03:17.802530 splice(0xe, 0, 0x25, 0, 0x9efa48, 0x3) = 16060
> >   16:03:17.802789 splice(0x24, 0, 0x6, 0, 0x3ebc, 0x3) = 16060
> >   16:03:17.806593 gettimeofday({1324652597, 806651}, NULL) = 0
> >   16:03:17.806759 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 4
> >   16:03:17.806919 gettimeofday({1324652597, 806974}, NULL) = 0
> >   16:03:17.807087 splice(0xe, 0, 0x25, 0, 0x9ebb8c, 0x3) = 17520
> >   16:03:17.807356 splice(0x24, 0, 0x6, 0, 0x4470, 0x3) = 17520
> >   16:03:17.809565 gettimeofday({1324652597, 809620}, NULL) = 0
> >   16:03:17.809726 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1
> >   16:03:17.809883 gettimeofday({1324652597, 809937}, NULL) = 0
> >   16:03:17.810047 splice(0xe, 0, 0x25, 0, 0x9e771c, 0x3) = 36500
> >   16:03:17.810399 splice(0x24, 0, 0x6, 0, 0x8e94, 0x3) = 23360
> >   16:03:17.810629 epoll_ctl(0x3, 0x1, 0x6, 0x85378) = 0       ## note: epoll_ctl(ADD, fd=6, dir=OUT).
> >   16:03:17.810792 gettimeofday({1324652597, 810848}, NULL) = 0
> >   16:03:17.810954 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1
> >   16:03:17.811188 gettimeofday({1324652597, 811246}, NULL) = 0
> >   16:03:17.811356 splice(0xe, 0, 0x25, 0, 0x9de888, 0x3) = 21900
> >   16:03:17.811651 splice(0x24, 0, 0x6, 0, 0x88e0, 0x3) = -1 EAGAIN (Resource temporarily unavailable)
> > 
> 
> Willy you say output to fd 6 hangs, but splice() returns EAGAIN here ?
> (because socket buffer is full)

Exactly.

> > So output fd 6 hangs here and will not appear anymore until
> > here where I pressed Ctrl-C to stop the test :
> > 
> 
> I just want to make sure its not a userland error that triggers now much
> faster than with prior kernels.

I understand and that could be possible indeed. Still, this precise code
has been used for a few years now in prod at 10+ Gbps, so while that does
not mean it's exempt from any race condition or bug, we have not observed
this behaviour earlier. In fact, what I've not tested much was the small
ARM based machine which is just a convenient development system to try to
optimize network performance. Among the differences I see with usual
deployments is that the NIC doesn't support TSO, while I've been used to
enable splicing only where TSO was supported, because before your recent
optimizations, it was less performant than recv/send.

> You drain bytes from fd 0xe to pipe buffers, but I dont see you check
> write ability on destination socket prior the splice(pipe -> socket)

I don't, I only rely on EAGAIN to re-enable polling for write (otherwise
it becomes a real mess of epoll_ctl which sensibly hurts performance). I
only re-enable polling if FDs can't move anymore.

Before doing a splice(read), if any data are left pending in the pipe, I
first try a splice(write) to try to flush the pipe, then I perform the
splice(read) then try to flush the pipe again using a splice(write).
Then polling is enabled if we block on EAGAIN.

I could fix the issue here by reworking my first patch. I think I was
too much conservative, because if we leave do_tcp_sendpages() on OOM
with copied == 0, in my opinion we never push. My first attempt tried
to call tcp_push() only once but it seems like this is a wrong idea
because it doesn't allow new attempts if for example tcp_write_xmit()
cannot send upon first attempt.

After calling tcp_push() inconditionnally on OOM, I cannot reproduce
the issue at all, and the traffic reaches a steady 950 Mbps in each
direction.

I'm appending the patch, you'll know better than me if it's correct or
not.

Best regards,
Willy

---

>From 39c3f73176118a274ec9dea9c22c83d97a7fbfe0 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Thu, 17 May 2012 22:43:20 +0200
Subject: [PATCH] tcp: do_tcp_sendpages() must try to push data out on oom conditions

Since recent changes on TCP splicing (starting with commits 2f533844
and 35f9c09f), I started seeing massive stalls when forwarding traffic
between two sockets using splice() when pipe buffers were larger than
socket buffers.

Latest changes (net: netdev_alloc_skb() use build_skb()) made the
problem even more apparent.

The reason seems to be that if do_tcp_sendpages() fails on out of memory
condition without being able to send at least one byte, tcp_push() is not
called and the buffers cannot be flushed.

After applying the attached patch, I cannot reproduce the stalls at all
and the data rate it perfectly stable and steady under any condition
which previously caused the problem to be permanent.

The issue seems to have been there since before the kernel migrated to
git, which makes me think that the stalls I occasionally experienced
with tux during stress-tests years ago were probably related to the
same issue.

This issue was first encountered on 3.0.31 and 3.2.17, so please backport
to -stable.

Signed-off-by: Willy Tarreau <w@1wt.eu>
Cc: <stable@vger.kernel.org>
---
 net/ipv4/tcp.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 63ddaee..231fe53 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -917,8 +917,7 @@ new_segment:
 wait_for_sndbuf:
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 wait_for_memory:
-		if (copied)
-			tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
+		tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);

 		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
 			goto do_error;
-- 
1.7.2.1.45.g54fbc

^ permalink raw reply related

* Re: [RFC] API to modify /proc/sys/net/ipv4/ip_local_reserved_ports
From: Helge Deller @ 2012-05-17 21:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Cong Wang, Octavian Purdila, netdev, David Miller, Andrew Morton,
	Frank Danapfel, Laszlo Ersek, shemminger
In-Reply-To: <m18vi3w7zd.fsf@fess.ebiederm.org>

On 04/11/2012 12:13 AM, Eric W. Biederman wrote:
> Helge Deller <deller@gmx.de> writes:
> 
>> On 04/09/2012 10:43 AM, Cong Wang wrote:
>>> On Wed, 2012-04-04 at 22:24 +0200, Helge Deller wrote:
>>>> I would like to follow up on my last patch series to be able to modify
>>>> the contents of the /proc/sys/net/ipv4/ip_local_reserved_ports port list
>>>> from userspace.
>>>>
>>>> My last patch (https://lkml.org/lkml/2012/3/10/187) was based on
>>>> modifications to the proc interface, which - based on the feedback here
>>>> on the list - seemed to not be the right way to go (although I personally
>>>> still like the idea very much :-)).
>>>>
>>>> Anyway, with this RFC I would like to get feedback about a new proposed
>>>> API and attached kernel patch.
>>>>
>>>> The idea is to introduce a new<optname>  value for get/setsockopt()
>>>> named SO_RESERVED_PORTS to get/set the ip_local_reserved_ports
>>>> bitmap via standard get/setsockopt() syscalls.
>>>> As far as I understand this seems to be similiar to how iptables works.
>>>>
>>>> An untested kernel patch for review and feedback is attached below.
>>>>
>>>> In userspace it then would be possible to write a new tool or to extend
>>>> for example the "ip" tool to accept commands like:
>>>> $>  ip reserved_ports add 100-2000
>>>> $>  ip reserved_ports remove 50-60
>>>> $>  ip reserved_ports list     (to show current reserved port list)
>>>>
>>>> This userspace tool could then read the port bitmap from kernel via
>>>> a) socket(PF_INET, SOCK_RAW, IPPROTO_RAW)
>>>> b) getsockopt(3, SOL_SOCKET, SO_RESERVED_PORTS,<bitmaplist>)
>>>> and write back the results after modification via
>>>> c) setsockopt(3, SOL_SOCKET, SO_RESERVED_PORTS,<bitmaplist>)
>>>>
>>>> Would that be an acceptable solution?
>>> Hmm, it is indeed that bitmap fits for syscall rather than /proc file.
>>>
>>> But it seems that using getsockopt()/setsockopt() makes it like it is a
>>> per-socket setting, actually it is a system-wide setting.
>> Yes, that's the reason why I used SOL_SOCKET which configures at least
>> a few system-wide settings too.
>>
>>> So I am
>>> wondering if exporting a binary /proc file for this is a better
>>> solution.
>> Yeah - that's another solution, but (65536 ports)/(8 bits per byte) = 8 KByte,
>> so we
>> may again hit the 4k limit of /proc (unless you do binary reads which should
>> be done with a binary /proc-entry anyway).
>>
>> Again, I'm open to develop any kind of solution which would get an OK
>> here.
> 
> Just looking at proc_do_large_bitmap, it does appear that there is a
> very local 4k limit on writes.
> 
> Can you please just modify proc_do_large_bitmap so that there is not a
> 4k limit on writes.  Ideally the code would just read another 4k from
> userspace when it is getting close to the end of it's 4k buffer, or
> perhaps we just read everything directly from userspace and run slowly.

Hi Eric,

sorry for the very late reply.
Yes, you are right- this is only a local 4K limit. Increasing it allowed me 
to write more ports at once.

With your tips I was now able to build a simple solution which fits my needs.
Based on standard tools like echo and dd (with the seek option) I can
block all ports which I need.

Nevertheless, the current kernel interface is not very flexible.
So, my proposal for a new interface (with tools) still stands. I just need
and advise what would be acceptable. Without any advise I will just leave
everything as is (since I'm now fine with it).

Helge

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox