Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 6/6 v2] IB: userspace support for RDMA connection manager
From: Michael S. Tsirkin @ 2006-03-22  8:30 UTC (permalink / raw)
  To: Roland Dreier; +Cc: netdev, linux-kernel, openib-general
In-Reply-To: <adawten6yu1.fsf@cisco.com>

Quoting r. Roland Dreier <rdreier@cisco.com>:
> Subject: Re: [PATCH 6/6 v2] IB: userspace support for RDMA connection manager
> 
> I added this patch to the rdma_cm branch in my git tree.

BTW, is there some way to see your git tree e.g. with gitweb?

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

^ permalink raw reply

* [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
From: Balbir Singh @ 2006-03-22  7:49 UTC (permalink / raw)
  To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev
In-Reply-To: <1142304506.5219.34.camel@jzny2>

On Mon, Mar 13, 2006 at 09:48:26PM -0500, jamal wrote:
> On Mon, 2006-13-03 at 18:33 -0800, Matt Helsley wrote:
> > On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:
> 
> 
> I had a long description in an earlier email feedback; but the summary
> of it is the GET command is generic like TASKSTATS_CMD_GET; the message
> itself carries TLVs of what needs to be gotten which are 
> either PID and/or TGID etc. Anyways, theres a long spill of what i am
> saying in that earlier email. Perhaps the current patch is a transition
> towards that?
> 

Hi, Jamal,

Please find the updated version of delayacct-genetlink.patch. We hope
this iteration is closer to your expectation. I have copied the enums
you suggested in your previous review comments and used them.

Comments addressed (in this patch)

- Changed the code to use TLV's for data exchange between kernel and
  user space

Thanks,
Balbir


Documentation for the patch

Create a generic netlink interface (NETLINK_GENERIC family),
called "taskstats", for getting delay and cpu statistics of
tasks and thread groups during their lifetime and when they exit.

More changes expected. Following comments will go into a
Documentation file:

When a task is alive, userspace can get its stats by sending a
command containing its pid. Sending a tgid returns the sum of stats
of the tasks belonging to that tgid (where such a sum makes sense).
Together, the command interface allows stats for a large number of
tasks to be collected more efficiently than would be possible
through /proc or any per-pid interface.

The netlink interface also sends the stats for each task to userspace
when the task is exiting. This permits fine-grain accounting for
short-lived tasks, which is important if userspace is doing its own
aggregation of statistics based on some grouping of tasks
(e.g. CSA jobs, ELSA banks or CKRM classes).

If the exiting task belongs to a thread group (with more members than itself)
, the latters delay stats are also sent out on the task's exit. This allows
userspace to get accurate data at a per-tgid level while the tid's of a tgid
are exiting one by one.

The interface has been deliberately kept distinct from the delay
accounting code since it is potentially usable by other kernel components
that need to export per-pid/tgid data. The format of data returned to
userspace is versioned and the command interface easily extensible to
facilitate reuse.

If reuse is not deemed useful enough, the naming, placement of functions
and config options will be modified to make this an interface for delay
accounting alone.

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>

---

 include/linux/delayacct.h |   11 ++
 include/linux/taskstats.h |  112 ++++++++++++++++++++
 init/Kconfig              |   16 ++
 kernel/Makefile           |    1 
 kernel/delayacct.c        |   44 ++++++++
 kernel/taskstats.c        |  251 ++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 432 insertions(+), 3 deletions(-)

diff -puN include/linux/delayacct.h~delayacct-genetlink include/linux/delayacct.h
--- linux-2.6.16/include/linux/delayacct.h~delayacct-genetlink	2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/delayacct.h	2006-03-22 11:56:03.000000000 +0530
@@ -15,6 +15,7 @@
 #define _LINUX_TASKDELAYS_H
 
 #include <linux/sched.h>
+#include <linux/taskstats.h>
 
 #ifdef CONFIG_TASK_DELAY_ACCT
 extern int delayacct_on;	/* Delay accounting turned on/off */
@@ -25,6 +26,7 @@ extern void __delayacct_tsk_exit(struct 
 extern void __delayacct_blkio_start(void);
 extern void __delayacct_blkio_end(void);
 extern unsigned long long __delayacct_blkio_ticks(struct task_struct *);
+extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
 
 static inline void delayacct_tsk_init(struct task_struct *tsk)
 {
@@ -72,4 +74,13 @@ static inline unsigned long long delayac
 	return 0;
 }
 #endif /* CONFIG_TASK_DELAY_ACCT */
+#ifdef CONFIG_TASKSTATS
+static inline int delayacct_add_tsk(struct taskstats *d,
+				    struct task_struct *tsk)
+{
+	if (!tsk->delays)
+		return -EINVAL;
+	return __delayacct_add_tsk(d, tsk);
+}
+#endif
 #endif /* _LINUX_TASKDELAYS_H */
diff -puN /dev/null include/linux/taskstats.h
--- /dev/null	2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/taskstats.h	2006-03-22 13:12:01.000000000 +0530
@@ -0,0 +1,112 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *           (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ *	- a task exits
+ *	- listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION	1
+#define TASKSTATS_NOPID	-1
+
+struct taskstats {
+	/* Maintain 64-bit alignment while extending */
+	/* Version 1 */
+
+	/* XXX_count is number of delay values recorded.
+	 * XXX_total is corresponding cumulative delay in nanoseconds
+	 */
+
+#define TASKSTATS_NOCPUSTATS	1
+	__u64	cpu_count;
+	__u64	cpu_delay_total;	/* wait, while runnable, for cpu */
+	__u64	blkio_count;
+	__u64	blkio_delay_total;	/* sync,block io completion wait*/
+	__u64	swapin_count;
+	__u64	swapin_delay_total;	/* swapin page fault wait*/
+
+	__u64	cpu_run_total;		/* cpu running time
+					 * no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP	0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+	TASKSTATS_CMD_UNSPEC = 0,	/* Reserved */
+	TASKSTATS_CMD_GET,		/* user->kernel request */
+	TASKSTATS_CMD_NEW,		/* kernel->user event */
+	__TASKSTATS_CMD_MAX,
+};
+
+#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1)
+
+enum {
+	TASKSTATS_TYPE_UNSPEC = 0,	/* Reserved */
+	TASKSTATS_TYPE_TGID,		/* Thread group id */
+	TASKSTATS_TYPE_PID,		/* Process id */
+	TASKSTATS_TYPE_STATS,		/* taskstats structure */
+	__TASKSTATS_TYPE_MAX,
+};
+
+#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1)
+
+enum {
+	TASKSTATS_CMD_ATTR_UNSPEC = 0,
+	TASKSTATS_CMD_ATTR_PID,
+	TASKSTATS_CMD_ATTR_TGID,
+	__TASKSTATS_CMD_ATTR_MAX,
+};
+
+#define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1)
+
+enum {
+	TASKSTATS_MSG_UNICAST,		/* send data only to requester */
+	TASKSTATS_MSG_MULTICAST,	/* send data to a group */
+};
+
+
+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME	"TASKSTATS"
+#define TASKSTATS_GENL_VERSION	0x1
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_TASKSTATS
+extern void taskstats_exit_pid(struct task_struct *);
+#else
+static inline void taskstats_exit_pid(struct task_struct *tsk)
+{}
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_TASKSTATS_H */
diff -puN init/Kconfig~delayacct-genetlink init/Kconfig
--- linux-2.6.16/init/Kconfig~delayacct-genetlink	2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/init/Kconfig	2006-03-22 11:56:03.000000000 +0530
@@ -158,11 +158,21 @@ config TASK_DELAY_ACCT
 	  in pages. Such statistics can help in setting a task's priorities
 	  relative to other tasks for cpu, io, rss limits etc.
 
-	  Unlike BSD process accounting, this information is available
-	  continuously during the lifetime of a task.
-
 	  Say N if unsure.
 
+config TASKSTATS
+	bool "Export task/process statistics through netlink (EXPERIMENTAL)"
+	depends on TASK_DELAY_ACCT
+	default y
+	help
+	  Export selected statistics for tasks/processes through the
+	  generic netlink interface. Unlike BSD process accounting, the
+	  statistics are available during the lifetime of tasks/processes as
+	  responses to commands. Like BSD accounting, they are sent to user
+	  space on task exit.
+
+	  Say Y if unsure.
+
 config SYSCTL
 	bool "Sysctl support"
 	---help---
diff -puN kernel/delayacct.c~delayacct-genetlink kernel/delayacct.c
--- linux-2.6.16/kernel/delayacct.c~delayacct-genetlink	2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/delayacct.c	2006-03-22 13:12:01.000000000 +0530
@@ -1,6 +1,7 @@
 /* delayacct.c - per-task delay accounting
  *
  * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * Copyright (C) Balbir Singh, IBM Corp. 2006
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of version 2.1 of the GNU Lesser General Public License
@@ -16,9 +17,12 @@
 #include <linux/time.h>
 #include <linux/sysctl.h>
 #include <linux/delayacct.h>
+#include <linux/taskstats.h>
+#include <linux/mutex.h>
 
 int delayacct_on = 0;		/* Delay accounting turned on/off */
 kmem_cache_t *delayacct_cache;
+static DEFINE_MUTEX(delayacct_exit_mutex);
 
 static int __init delayacct_setup_enable(char *str)
 {
@@ -51,10 +55,16 @@ void __delayacct_tsk_init(struct task_st
 
 void __delayacct_tsk_exit(struct task_struct *tsk)
 {
+	/*
+	 * Protect against racing thread group exits
+	 */
+	mutex_lock(&delayacct_exit_mutex);
+	taskstats_exit_pid(tsk);
 	if (tsk->delays) {
 		kmem_cache_free(delayacct_cache, tsk->delays);
 		tsk->delays = NULL;
 	}
+	mutex_unlock(&delayacct_exit_mutex);
 }
 
 /*
@@ -124,3 +134,37 @@ unsigned long long __delayacct_blkio_tic
 	spin_unlock(&tsk->delays->lock);
 	return ret;
 }
+#ifdef CONFIG_TASKSTATS
+int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
+{
+	nsec_t tmp;
+	struct timespec ts;
+	unsigned long t1,t2;
+
+	/* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
+
+	tmp = (nsec_t)d->cpu_run_total ;
+	tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC;
+	d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0: tmp;
+
+	/* No locking available for sched_info. Take snapshot first. */
+	t1 = tsk->sched_info.pcnt;
+	t2 = tsk->sched_info.run_delay;
+
+	d->cpu_count += t1;
+
+	jiffies_to_timespec(t2, &ts);
+	tmp = (nsec_t)d->cpu_delay_total + timespec_to_ns(&ts);
+	d->cpu_delay_total = (tmp < (nsec_t)d->cpu_delay_total)? 0: tmp;
+
+	spin_lock(&tsk->delays->lock);
+	tmp = d->blkio_delay_total + tsk->delays->blkio_delay;
+	d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0: tmp;
+	tmp = d->swapin_delay_total + tsk->delays->swapin_delay;
+	d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0: tmp;
+	d->blkio_count += tsk->delays->blkio_count;
+	d->swapin_count += tsk->delays->swapin_count;
+	spin_unlock(&tsk->delays->lock);
+	return 0;
+}
+#endif /* CONFIG_TASKSTATS */
diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile
--- linux-2.6.16/kernel/Makefile~delayacct-genetlink	2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/Makefile	2006-03-22 11:56:03.000000000 +0530
@@ -35,6 +35,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff -puN /dev/null kernel/taskstats.c
--- /dev/null	2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/kernel/taskstats.c	2006-03-22 13:12:07.000000000 +0530
@@ -0,0 +1,251 @@
+/*
+ * taskstats.c - Export per-task statistics to userland
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *           (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats.h>
+#include <linux/delayacct.h>
+#include <net/genetlink.h>
+#include <asm/atomic.h>
+
+const int taskstats_version = TASKSTATS_VERSION;
+static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
+static int family_registered = 0;
+
+static struct genl_family family = {
+	.id             = GENL_ID_GENERATE,
+	.name           = TASKSTATS_GENL_NAME,
+	.version        = TASKSTATS_GENL_VERSION,
+	.hdrsize        = 0,
+	.maxattr        = TASKSTATS_CMD_ATTR_MAX,
+};
+
+static struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1] __read_mostly = {
+	[TASKSTATS_CMD_ATTR_PID]  = { .type = NLA_U32 },
+	[TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 },
+};
+
+
+static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
+			 void **replyp)
+{
+	struct sk_buff *skb;
+	void *reply;
+
+	skb = nlmsg_new(NLMSG_GOODSIZE);
+	if (!skb)
+		return -ENOMEM;
+
+	if (!info) {
+		int seq = get_cpu_var(taskstats_seqnum)++;
+		put_cpu_var(taskstats_seqnum);
+
+		reply = genlmsg_put(skb, 0, seq,
+				    family.id, 0, NLM_F_REQUEST,
+				    cmd, family.version);
+	} else
+		reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
+				    family.id, 0, info->nlhdr->nlmsg_flags,
+				    info->genlhdr->cmd, family.version);
+	if (reply == NULL) {
+		nlmsg_free(skb);
+		return -EINVAL;
+	}
+
+	*skbp = skb;
+	*replyp = reply;
+	return 0;
+}
+
+static int send_reply(struct sk_buff *skb, pid_t pid, int event)
+{
+	struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+	void *reply;
+	int rc;
+
+	reply = genlmsg_data(genlhdr);
+
+	rc = genlmsg_end(skb, reply);
+	if (rc < 0) {
+		nlmsg_free(skb);
+		return rc;
+	}
+
+	if (event == TASKSTATS_MSG_MULTICAST)
+		return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP);
+	return genlmsg_unicast(skb, pid);
+}
+
+static inline int fill_pid(pid_t pid, struct task_struct *pidtsk,
+			   struct taskstats *stats)
+{
+	int rc;
+	struct task_struct *tsk = pidtsk;
+
+	if (!pidtsk) {
+		read_lock(&tasklist_lock);
+		tsk = find_task_by_pid(pid);
+		if (!tsk) {
+			read_unlock(&tasklist_lock);
+			return -ESRCH;
+		}
+		get_task_struct(tsk);
+		read_unlock(&tasklist_lock);
+	} else
+		get_task_struct(tsk);
+
+	rc = delayacct_add_tsk(stats, tsk);
+	put_task_struct(tsk);
+
+	return rc;
+
+}
+
+static inline int fill_tgid(pid_t tgid, struct task_struct *tgidtsk,
+			    struct taskstats *stats)
+{
+	int rc;
+	struct task_struct *tsk, *first;
+
+	first = tgidtsk;
+	read_lock(&tasklist_lock);
+	if (!first) {
+		first = find_task_by_pid(tgid);
+		if (!first) {
+			read_unlock(&tasklist_lock);
+			return -ESRCH;
+		}
+	}
+	tsk = first;
+	do {
+		rc = delayacct_add_tsk(stats, tsk);
+		if (rc)
+			break;
+	} while_each_thread(first, tsk);
+	read_unlock(&tasklist_lock);
+
+	return rc;
+}
+
+static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
+{
+	int rc;
+	struct sk_buff *rep_skb;
+	struct taskstats stats;
+	void *reply;
+
+	memset(&stats, 0, sizeof(stats));
+	rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);
+	if (rc < 0)
+		return rc;
+
+	if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
+		u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
+		NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
+		rc = fill_pid((pid_t)pid, NULL, &stats);
+		if (rc < 0)
+			return rc;
+	}
+
+	if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
+		u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
+		NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
+		rc = fill_tgid((pid_t)tgid, NULL, &stats);
+		if (rc < 0)
+			return rc;
+	}
+
+	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+	return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
+
+nla_put_failure:
+	return genlmsg_cancel(rep_skb, reply);
+
+}
+
+
+/* Send pid data out on exit */
+void taskstats_exit_pid(struct task_struct *tsk)
+{
+	int rc;
+	struct sk_buff *rep_skb;
+	void *reply;
+	struct taskstats stats;
+
+	/*
+	 * tasks can start to exit very early. Ensure that the family
+	 * is registered before notifications are sent out
+	 */
+	if (!family_registered)
+		return;
+
+	memset(&stats, 0, sizeof(stats));
+	rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
+	if (rc < 0)
+		return;
+
+	NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
+	rc = fill_pid(tsk->pid, tsk, &stats);
+	if (rc < 0)
+		return;
+
+	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+	rc = send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+
+	if (rc || thread_group_empty(tsk))
+		return;
+
+	/* Send tgid data too */
+	rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
+	if (rc < 0)
+		return;
+
+	NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
+	rc = fill_tgid(tsk->tgid, tsk, &stats);
+	if (rc < 0)
+		return;
+
+	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+	send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+
+nla_put_failure:
+	genlmsg_cancel(rep_skb, reply);
+}
+
+static struct genl_ops taskstats_ops = {
+	.cmd            = TASKSTATS_CMD_GET,
+	.doit           = taskstats_send_stats,
+	.policy		= taskstats_cmd_get_policy,
+};
+
+static int __init taskstats_init(void)
+{
+	if (genl_register_family(&family))
+		return -EFAULT;
+        family_registered = 1;
+
+	if (genl_register_ops(&family, &taskstats_ops))
+		goto err;
+
+	return 0;
+err:
+	genl_unregister_family(&family);
+	family_registered = 0;
+	return -EFAULT;
+}
+
+late_initcall(taskstats_init);
_

^ permalink raw reply

* [patch 2/2] net: Node aware multipath device round robin -- device locality check
From: Ravikiran G Thirumalai @ 2006-03-22  6:08 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: davem, Andrew Morton, Shai Fultheim (Shai@scalex86.org),
	pravin b shelar
In-Reply-To: <20060322060540.GA3603@localhost.localdomain>

This patch checks device locality on every ip packet xmit.
In multipath configuration tcp connection to route association is done at 
session startup time. The tcp session process is migrated to different nodes 
after this association.  This would mean a remote NIC is chosen for xmit,
although a local NIC could be available.   Following patch checks if a 
local NIC is available for the desitnation, and recalculates routes if so.
his leads to remote NIC  transfer  in some tcp work load such as AB.
    
Downside: adds a bitmap to struct rtable.  But only if 
CONFIG_IP_ROUTE_MULTIPATH_NODE is enabled.

Comments, suggestions welcome. 

Signed-off by: Pravin B. Shelar <pravins@calsoftinc.com>
Signed-off by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off by: Shai Fultheim <shai@scalex86.org>

Index: linux-2.6.16/include/net/route.h
===================================================================
--- linux-2.6.16.orig/include/net/route.h	2006-03-19 21:53:29.000000000 -0800
+++ linux-2.6.16/include/net/route.h	2006-03-20 14:52:24.000000000 -0800
@@ -75,6 +75,13 @@ struct rtable
 	/* Miscellaneous cached information */
 	__u32			rt_spec_dst; /* RFC1122 specific destination */
 	struct inet_peer	*peer; /* long-living peer info */
+#ifdef CONFIG_IP_ROUTE_MULTIPATH_NODE
+	/*  bitmap bit is set if current node has a local multi-path device for
+	 *  this route.
+	 */
+	DECLARE_BITMAP          (mp_if_bitmap, MAX_NUMNODES);
+#endif
+
 };
 
 struct ip_rt_acct
@@ -201,4 +208,21 @@ static inline struct inet_peer *rt_get_p
 
 extern ctl_table ipv4_route_table[];
 
+#ifdef CONFIG_IP_ROUTE_MULTIPATH_NODE
+
+#include <linux/ip_mp_alg.h>
+
+static inline int dst_dev_node_check(struct rtable *rt)
+{
+	int cnode = numa_node_id();
+	if (unlikely(netdev_node(rt->u.dst.dev) != cnode)) {
+		if (test_bit(cnode, rt->mp_if_bitmap))
+			return 1;
+	}
+	return 0;
+}
+#else
+#define dst_dev_node_check(rt)			0
+#endif
+
 #endif	/* _ROUTE_H */
Index: linux-2.6.16/net/ipv4/ip_output.c
===================================================================
--- linux-2.6.16.orig/net/ipv4/ip_output.c	2006-03-19 21:53:29.000000000 -0800
+++ linux-2.6.16/net/ipv4/ip_output.c	2006-03-20 14:52:24.000000000 -0800
@@ -309,7 +309,7 @@ int ip_queue_xmit(struct sk_buff *skb, i
 
 	/* Make sure we can route this packet. */
 	rt = (struct rtable *)__sk_dst_check(sk, 0);
-	if (rt == NULL) {
+	if ((rt == NULL ) || dst_dev_node_check(rt)) {
 		u32 daddr;
 
 		/* Use correct destination address if we have options. */
Index: linux-2.6.16/net/ipv4/route.c
===================================================================
--- linux-2.6.16.orig/net/ipv4/route.c	2006-03-19 21:53:29.000000000 -0800
+++ linux-2.6.16/net/ipv4/route.c	2006-03-20 14:52:24.000000000 -0800
@@ -2313,6 +2313,22 @@ static inline int ip_mkroute_output(stru
 	if (res->fi && res->fi->fib_nhs > 1) {
 		unsigned char hopcount = res->fi->fib_nhs;
 
+#ifdef CONFIG_IP_ROUTE_MULTIPATH_NODE
+		DECLARE_BITMAP (mp_if_bitmap, MAX_NUMNODES);
+		bitmap_zero(mp_if_bitmap, MAX_NUMNODES);
+		/* Calculating device bitmap for this multipath route */
+		if (res->fi->fib_mp_alg == IP_MP_ALG_DRR) {
+			for (hop = 0; hop < hopcount; hop++) {
+				struct net_device *dev2nexthop;
+
+				res->nh_sel = hop;
+				dev2nexthop = FIB_RES_DEV(*res);
+				dev_hold(dev2nexthop);
+				set_bit(netdev_node(dev2nexthop), mp_if_bitmap);
+				dev_put(dev2nexthop);
+			}
+		}
+#endif
 		for (hop = 0; hop < hopcount; hop++) {
 			struct net_device *dev2nexthop;
 
@@ -2343,6 +2359,10 @@ static inline int ip_mkroute_output(stru
 					     FIB_RES_NETMASK(*res),
 					     res->prefixlen,
 					     &FIB_RES_NH(*res));
+
+#ifdef CONFIG_IP_ROUTE_MULTIPATH_NODE
+			bitmap_copy(rth->mp_if_bitmap, mp_if_bitmap, MAX_NUMNODES);
+#endif
 		cleanup:
 			/* release work reference to output device */
 			dev_put(dev2nexthop);

^ permalink raw reply

* [patch 1/2] net: Node aware multipath device round robin
From: Ravikiran G Thirumalai @ 2006-03-22  6:05 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: davem, Andrew Morton, Shai Fultheim (Shai@scalex86.org),
	pravin b shelar

Following patch adds in node aware, device round robin ip multipathing.  
It is based on multipath_drr.c, the multipath device round robin algorithm, and
is derived from it.  This implementation maintians per node state table, and 
round robins between interfaces on the same node.  The implementation needs to 
be aware of the NIC proximity to a node.  Hence we have added a nodeid field to 
struct netdevice.  NIC device drivers can initialize this with the node id 
the NIC belongs to.  This patch uses IP_MP_ALG_DRR slot like the regular 
multipath_drr too.  So either SMP multipath_drr or node aware 
multipath_node_drr should be used for device round robin, based on system having
proximity information for the NICs.

Performance results:
1. Single NIC test -- 1 client targets 1 nic on the server with 300 concurrent 
requests.
2. 4 NIC test -- 1 client targets 4 nics, all on different nodes on the server with 300 concurrent requests.

We see about 135% improvement on AB requests per second with this patch and 
the device_locality_check patch on single NIC test, on the Rackable c5100 
machine (server).  We see about 64% improvement  when all 4 NICS are targeted.

Credits:  This work was originally done by Justin Forbes 

Comments?

Signed-off by: Pravin B. Shelar <pravin.shelar@calsoftinc.com>
Signed-off by: Shobhit Dayal <shobhit.dayal@calsoftinc.com>
Signed-off by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off by: Shai Fultheim <shai@scalex86.org>

Index: linux-2.6.16/drivers/net/e1000/e1000_main.c
===================================================================
--- linux-2.6.16.orig/drivers/net/e1000/e1000_main.c	2006-03-19 21:53:29.000000000 -0800
+++ linux-2.6.16/drivers/net/e1000/e1000_main.c	2006-03-20 14:52:23.000000000 -0800
@@ -692,6 +692,7 @@ e1000_probe(struct pci_dev *pdev,
 
 	SET_MODULE_OWNER(netdev);
 	SET_NETDEV_DEV(netdev, &pdev->dev);
+	SET_NETDEV_NODE(netdev, pcibus_to_node(pdev->bus));
 
 	pci_set_drvdata(pdev, netdev);
 	adapter = netdev_priv(netdev);
Index: linux-2.6.16/drivers/net/tg3.c
===================================================================
--- linux-2.6.16.orig/drivers/net/tg3.c	2006-03-19 21:53:29.000000000 -0800
+++ linux-2.6.16/drivers/net/tg3.c	2006-03-20 14:52:23.000000000 -0800
@@ -10705,6 +10705,7 @@ static int __devinit tg3_init_one(struct
 
 	SET_MODULE_OWNER(dev);
 	SET_NETDEV_DEV(dev, &pdev->dev);
+	SET_NETDEV_NODE(dev, pcibus_to_node(pdev->bus));
 
 	dev->features |= NETIF_F_LLTX;
 #if TG3_VLAN_TAG_USED
Index: linux-2.6.16/include/linux/netdevice.h
===================================================================
--- linux-2.6.16.orig/include/linux/netdevice.h	2006-03-19 21:53:29.000000000 -0800
+++ linux-2.6.16/include/linux/netdevice.h	2006-03-20 14:52:23.000000000 -0800
@@ -315,7 +315,9 @@ struct net_device
 	/* Interface index. Unique device identifier	*/
 	int			ifindex;
 	int			iflink;
-
+#ifdef CONFIG_NUMA
+	int			node;	/* NUMA node this IF is close to */
+#endif
 
 	struct net_device_stats* (*get_stats)(struct net_device *dev);
 	struct iw_statistics*	(*get_wireless_stats)(struct net_device *dev);
@@ -520,6 +522,14 @@ static inline void *netdev_priv(struct n
  */
 #define SET_NETDEV_DEV(net, pdev)	((net)->class_dev.dev = (pdev))
 
+#ifdef CONFIG_NUMA
+#define SET_NETDEV_NODE(dev, nodeid)	((dev)->node = (nodeid))
+#define netdev_node(dev)		((dev)->node)
+#else
+#define SET_NETDEV_NODE(dev, nodeid)	do {} while (0)
+#define netdev_node(dev)		(-1)
+#endif
+
 struct packet_type {
 	__be16			type;	/* This is really htons(ether_type). */
 	struct net_device	*dev;	/* NULL is wildcarded here	     */
Index: linux-2.6.16/net/core/dev.c
===================================================================
--- linux-2.6.16.orig/net/core/dev.c	2006-03-19 21:53:29.000000000 -0800
+++ linux-2.6.16/net/core/dev.c	2006-03-20 14:52:23.000000000 -0800
@@ -3003,7 +3003,8 @@ struct net_device *alloc_netdev(int size
 
 	if (sizeof_priv)
 		dev->priv = netdev_priv(dev);
-
+	
+	SET_NETDEV_NODE(dev, -1);
 	setup(dev);
 	strcpy(dev->name, name);
 	return dev;
Index: linux-2.6.16/net/ipv4/Kconfig
===================================================================
--- linux-2.6.16.orig/net/ipv4/Kconfig	2006-03-19 21:53:29.000000000 -0800
+++ linux-2.6.16/net/ipv4/Kconfig	2006-03-20 14:52:23.000000000 -0800
@@ -164,6 +164,15 @@ config IP_ROUTE_MULTIPATH_DRR
 	  available interfaces. This policy makes sense if the connections 
 	  should be primarily distributed on interfaces and not on routes. 
 
+config IP_ROUTE_MULTIPATH_NODE
+	tristate "MULTIPATH: interface RR algorithm with node affinity"
+	depends on IP_ROUTE_MULTIPATH_CACHED && NUMA && !IP_ROUTE_MULTIPATH_DRR
+	help
+	  This allows equal cost multipath device round robin alogorithm  to
+	  use node affinity when choosing the device for outbound traffic. This
+	  is similar to CONFIG_IP_ROUTE_MULTIPATH_DRR. Choose this if you
+	  have a NUMA system, and the NICs have node proximity.
+
 config IP_ROUTE_VERBOSE
 	bool "IP: verbose route monitoring"
 	depends on IP_ADVANCED_ROUTER
Index: linux-2.6.16/net/ipv4/Makefile
===================================================================
--- linux-2.6.16.orig/net/ipv4/Makefile	2006-03-19 21:53:29.000000000 -0800
+++ linux-2.6.16/net/ipv4/Makefile	2006-03-20 14:52:23.000000000 -0800
@@ -28,6 +28,7 @@ obj-$(CONFIG_IP_ROUTE_MULTIPATH_RR) += m
 obj-$(CONFIG_IP_ROUTE_MULTIPATH_RANDOM) += multipath_random.o
 obj-$(CONFIG_IP_ROUTE_MULTIPATH_WRANDOM) += multipath_wrandom.o
 obj-$(CONFIG_IP_ROUTE_MULTIPATH_DRR) += multipath_drr.o
+obj-$(CONFIG_IP_ROUTE_MULTIPATH_NODE) += multipath_node_drr.o
 obj-$(CONFIG_NETFILTER)	+= netfilter.o netfilter/
 obj-$(CONFIG_IP_VS) += ipvs/
 obj-$(CONFIG_INET_DIAG) += inet_diag.o 
Index: linux-2.6.16/net/ipv4/multipath_node_drr.c
===================================================================
--- linux-2.6.16.orig/net/ipv4/multipath_node_drr.c	2006-02-28 01:25:15.174738088 -0800
+++ linux-2.6.16/net/ipv4/multipath_node_drr.c	2006-03-20 14:52:23.000000000 -0800
@@ -0,0 +1,264 @@
+/*
+ *  Node aware device round robin policy for multipath.
+ *  Extension of multipath device round robin for NUMA node based multipathing.
+ *  Derived from net/ipv4/multipath_drr.c
+ */
+
+#include <linux/netdevice.h>
+#include <linux/module.h>
+#include <net/ip_mp_alg.h>
+
+struct multipath_device {
+	int		ifi; /* interface index of device */
+	atomic_t	usecount;
+	int 		allocated;
+	int 		node; /* node id of device */
+};
+
+#define MULTIPATH_MAX_DEVICECANDIDATES 16
+
+static struct multipath_device *local_state[MAX_NUMNODES] __read_mostly;
+static DEFINE_SPINLOCK(state_lock);
+
+static int inline __multipath_findslot(int ifindex, int nid)
+{
+	int i, idx, mx;
+	struct multipath_device *state = local_state[nid];
+
+	i = ifindex % MULTIPATH_MAX_DEVICECANDIDATES;
+	if (likely(state[i].allocated == 0))
+		return i;
+
+	mx = i + MULTIPATH_MAX_DEVICECANDIDATES;
+
+	for (; i < mx; i++) {
+		idx = i % MULTIPATH_MAX_DEVICECANDIDATES;
+		if (state[idx].allocated == 0)
+			return idx;
+	}
+	return -1;
+}
+
+static int inline __multipath_finddev(int ifindex, int nid)
+{
+	int i, mx, idx;
+	struct multipath_device *state = local_state[nid];
+
+	i = ifindex % MULTIPATH_MAX_DEVICECANDIDATES;
+	if (likely(state[i].ifi == ifindex))
+		return i;
+
+	mx = i + MULTIPATH_MAX_DEVICECANDIDATES;
+
+	for (; i < mx; i++) {
+		idx = i % MULTIPATH_MAX_DEVICECANDIDATES;
+
+		if (state[idx].ifi == ifindex)
+			return idx;
+	}
+	return -1;
+}
+
+static int drr_dev_event(struct notifier_block *this,
+			 unsigned long event, void *ptr)
+{
+	struct net_device *dev = ptr;
+	int devidx, nid;
+
+	switch (event) {
+	case NETDEV_UNREGISTER:
+	case NETDEV_DOWN:
+		spin_lock_bh(&state_lock);
+		for_each_node(nid) {
+			devidx = __multipath_finddev(dev->ifindex, nid);
+			if (devidx != -1) {
+				local_state[nid][devidx].ifi = 0;
+				local_state[nid][devidx].allocated = 0;
+			}
+		}
+
+		spin_unlock_bh(&state_lock);
+		break;
+	};
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block drr_dev_notifier = {
+	.notifier_call	= drr_dev_event,
+};
+
+static void inline drr_safe_inc(atomic_t *usecount)
+{
+	int n;
+
+	atomic_inc(usecount);
+	n = atomic_read(usecount);
+	if (unlikely(n <= 0)) {
+		int i;
+		struct multipath_device *state = local_state[numa_node_id()];
+
+		for (i = 0; i < MULTIPATH_MAX_DEVICECANDIDATES; i++)
+			atomic_set(&state[i].usecount, 0);
+
+	}
+}
+
+static int update_state_table(struct rtable *nh, int node)
+{
+	int devidx = -1;
+	struct multipath_device *state;
+	int nh_ifidx = nh->u.dst.dev->ifindex;
+	/* add the interface to the array
+	 * SMP safe
+	 */
+	spin_lock_bh(&state_lock);
+
+	/* due to SMP: search again */
+	devidx = __multipath_finddev(nh_ifidx, node);
+	if (devidx == -1) {
+		/* add entry for device */
+		state = local_state[node];
+		/* find free slot in state table */
+		devidx = __multipath_findslot(nh_ifidx, node);
+		if (devidx == -1) {
+			/* unlikely but possible */
+			goto out;
+		} else {
+			state[devidx].allocated = 1;
+			state[devidx].ifi = nh_ifidx;
+			atomic_set(&state[devidx].usecount, 0);
+			state[devidx].node = netdev_node(nh->u.dst.dev);
+		}
+	}
+out:
+	spin_unlock_bh(&state_lock);
+	return devidx;
+}
+
+static void drr_select_route(const struct flowi *flp,
+			     struct rtable *first, struct rtable **rp)
+{
+	struct rtable *nh, *cur_min = NULL, *cur_min_nrr = NULL;
+	int devidx = -1;
+	int cur_min_devidx = -1, cur_min_devidx_nrr = -1;
+	int min_usecount = INT_MAX, min_usecount_nrr = INT_MAX;
+	int node = numa_node_id();
+	struct multipath_device *state;
+
+	/* 1. make sure all alt. nexthops have the same GC related data */
+	/* 2. determine the new candidate to be returned */
+	state = local_state[node];
+	for (nh = rcu_dereference(first); nh;
+	     nh = rcu_dereference(nh->u.rt_next)) {
+		if ((nh->u.dst.flags & DST_BALANCED) != 0 &&
+		    multipath_comparekeys(&nh->fl, flp)) {
+			int count;
+			int nh_ifidx = nh->u.dst.dev->ifindex;
+
+			nh->u.dst.lastuse = jiffies;
+			nh->u.dst.__use++;
+
+			/* search for the output interface */
+
+			/* this is not SMP safe, only add/remove are
+			 * SMP safe as wrong usecount updates have no big
+			 * impact
+			 */
+			devidx = __multipath_finddev(nh_ifidx, node);
+			if (devidx == -1) {
+				devidx = update_state_table(nh, node);
+				if (devidx == -1)
+					continue;
+			}
+			count = atomic_read(&state[devidx].usecount);
+
+			/* RR on node local interfaces if available */
+			if (state[devidx].node == node) {
+				if (count < min_usecount_nrr) {
+					cur_min_nrr = nh;
+					cur_min_devidx_nrr = devidx;
+					min_usecount_nrr = count;
+					/* lowest used.  So use this IF */
+					if (min_usecount_nrr == 0)
+						break;
+				}
+			} else {
+				if (count < min_usecount) {
+					cur_min = nh;
+					cur_min_devidx = devidx;
+					min_usecount = count;
+				}
+			}
+		}
+	}
+
+	/* If node local route is present, choose it.  Else choose SMP RR */
+	if (cur_min_devidx_nrr != -1) {
+		drr_safe_inc(&state[cur_min_devidx_nrr].usecount);
+		*rp = cur_min_nrr;
+		return ;
+	}
+	
+	if (cur_min_devidx != -1) {
+		drr_safe_inc(&state[cur_min_devidx].usecount);
+		*rp = cur_min;
+	} else
+		*rp = first;
+}
+
+static struct ip_mp_alg_ops drr_ops = {
+	.mp_alg_select_route	=	drr_select_route,
+};
+
+static int __init drr_init(void)
+{
+	int err, nid;
+	int size = MULTIPATH_MAX_DEVICECANDIDATES * 
+			sizeof(struct multipath_device);
+	for_each_node(nid) {
+		int i;
+		local_state[nid] = kmalloc_node(size, GFP_KERNEL, nid);
+		if (local_state[nid] == NULL) {
+			int i;
+			for_each_node(i){
+				if (i < nid)
+					kfree(local_state[i]);
+			}
+			printk(KERN_CRIT"drr_init: Cannot allocate state table\n");
+			return -ENOMEM;
+		}
+		for (i = 0; i < MULTIPATH_MAX_DEVICECANDIDATES; i++) {
+			local_state[nid][i].allocated = 0;
+			local_state[nid][i].ifi = 0;
+		}
+	}
+	err = register_netdevice_notifier(&drr_dev_notifier);
+
+	if (err)
+		return err;
+
+	err = multipath_alg_register(&drr_ops, IP_MP_ALG_DRR);
+	if (err)
+		goto fail;
+
+	return 0;
+
+fail:
+	unregister_netdevice_notifier(&drr_dev_notifier);
+	return err;
+}
+
+static void __exit drr_exit(void)
+{
+	int nid;
+	unregister_netdevice_notifier(&drr_dev_notifier);
+	multipath_alg_unregister(&drr_ops, IP_MP_ALG_DRR);
+	for_each_node(nid){
+		kfree(local_state[nid]);
+	}
+}
+
+module_init(drr_init);
+module_exit(drr_exit);
+MODULE_LICENSE("GPL");

^ permalink raw reply

* Re: [PATCH 2.6.16-rc6 1/1] ipw2200: Add Kconfig entries for QOS and Monitor mode
From: Zhu Yi @ 2006-03-22  3:11 UTC (permalink / raw)
  To: Andreas Happe; +Cc: Adrian Bunk, Andrew Morton, linux-kernel, jgarzik, netdev
In-Reply-To: <20060318174703.GA22072@localdomain>

On Sat, 2006-03-18 at 18:47 +0100, Andreas Happe wrote:
> Adds Kconfig entries for enabling Monitor mode and Quality of service
> to the ipw2200 driver. It also renames the IPW_QOS define to
> IPW2200_QOS.
> 
> As Monitor mode generates lots of firmware errors it depends upon
> BROKEN. QOS is under development, so it depends upon EXPERIMENTAL.

Ack the rename and QoS description changes.

The IPW2200_MONITOR and monitor mode firmware error are already fixed in
wireless-2.6 GIT
http://kernel.org/git/?p=linux/kernel/git/linville/wireless-2.6.git;a=summary

Wireless related development happens there. I'd suggest you create
patches against that tree.

Thanks,
-yi

^ permalink raw reply

* Re: [PATCH 6/6 v2] IB: userspace support for RDMA connection manager
From: Roland Dreier @ 2006-03-22  1:40 UTC (permalink / raw)
  To: Sean Hefty; +Cc: netdev, linux-kernel, openib-general
In-Reply-To: <ORSMSX4011XvpFVjCRG0000000f@orsmsx401.amr.corp.intel.com>

I added this patch to the rdma_cm branch in my git tree.  When I was
doing that, I noticed that it builds rdma_ucm.ko unconditionally.  It
seems that we want this to depend on CONFIG_INFINIBAND_USER_ACCESS,
since that controls ib_uverbs.ko and ib_ucm.ko.

To do this I rejiggered the Kconfig and Makefile changes I made
before.  I made CONFIG_INFINIBAND_ADDR_TRANS into a bool (instead of a
tristate), so that it's 'y' if INFINIBAND and INET are on, and made
the top of the Makefile look like:

infiniband-$(CONFIG_INFINIBAND_ADDR_TRANS)	:= ib_addr.o rdma_cm.o
user_access-$(CONFIG_INFINIBAND_ADDR_TRANS)	:= rdma_ucm.o

obj-$(CONFIG_INFINIBAND) +=		ib_core.o ib_mad.o ib_sa.o \
					ib_cm.o $(infiniband-y)
obj-$(CONFIG_INFINIBAND_USER_MAD) +=	ib_umad.o
obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=	ib_uverbs.o ib_ucm.o $(user_access-y)

I'm pretty sure this does exactly what we want.

 - R.

^ permalink raw reply

* Re: Re: [iproute2] IPoIB link layer address bug
From: Jason Gunthorpe @ 2006-03-22  1:30 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, openib-general, lartc
In-Reply-To: <20060321155617.3ae419b2@localhost.localdomain>

On Tue, Mar 21, 2006 at 03:56:17PM -0800, Stephen Hemminger wrote:

> Okay, but there are number of other places in iproute2 that call ll_addr_a2n()
> with ifr.ifr_hwaddr.sa_data. And that is 14 bytes.  If you want to fix those
> it will be harder since it would increase the sizeof(struct sockaddr) and potentially
> break compatibility.

Maybe the best thing is to upgrade ip (and or netlink?) to use netlink
messages instead of ioctls for the remaining problematic operations.
Since netlink already supports an arbitary length hwaddr there should
be no compatability problem.

Just browsing I see usages of SIOCSIFHWBROADCAST, SIOCSIFHWADDR,
SIOCADDMULTI, SIOCDELMULTI and SIOCGIFHWADDR that use a struct ifreq..

I know SIOCGIFHWADDR can be done over netlink, but I'm not too
familiar with the others..

Jason

^ permalink raw reply

* Re: [iproute2] IPoIB link layer address bug
From: Stephen Hemminger @ 2006-03-21 23:56 UTC (permalink / raw)
  To: James Lentini; +Cc: netdev, openib-general, lartc
In-Reply-To: <Pine.LNX.4.61.0603161707360.23670@jlentini-linux.nane.netapp.com>

On Thu, 16 Mar 2006 17:24:41 -0500 (EST)
James Lentini <jlentini@netapp.com> wrote:

> 
> The ip(8) command has a bug when dealing with IPoIB link layer 
> addresses. Specifically it does not correctly handle the addition of 
> new entries in the neighbor/arp table. For example, this command will 
> fail:
> 
> ip neigh add 192.168.0.138 lladdr 00:00:04:04:fe:80:00:00:00:00:00:00:00:01:73:00:00:00:8a:91 nud permanent dev ib0
> 
> An IPoIB link layer address is 20-bytes (see 
> http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-09.txt, 
> section 9.1.1).
> 
> The command line parsing code expects link layer addresses to be a 
> maximum of 16-bytes. Addresses over 16-bytes are truncated.
> 
> This patch (against the iproute2 cvs repository) fixes the problem:
> 

Okay, but there are number of other places in iproute2 that call ll_addr_a2n()
with ifr.ifr_hwaddr.sa_data. And that is 14 bytes.  If you want to fix those
it will be harder since it would increase the sizeof(struct sockaddr) and potentially
break compatibility.

^ permalink raw reply

* Re: Re: [PATCH 4/6 v2] IB: address translation to map IP toIB addresses (GIDs)
From: Roland Dreier @ 2006-03-21 22:39 UTC (permalink / raw)
  To: Sean Hefty; +Cc: netdev, linux-kernel, openib-general
In-Reply-To: <44206B53.8020701@ichips.intel.com>

    Sean> "This is simply an attempt to reduce/combine work queues
    Sean> used by the Infiniband code.  This keeps the threading a
    Sean> little simpler in the rdma_cm, since all callbacks are
    Sean> invoked using the same work queue.  (I'm also using this
    Sean> with the local SA/multicast code, but that's not ready for
    Sean> merging.)"

How does it keep the threading model simpler?  Is this an inter-module
dependency.

    Sean> There's no specific ordering constraint that's required.
    Sean> We're just ending up with several Infiniband modules
    Sean> creating their own work queues (ib_mad, ib_cm, ib_addr,
    Sean> rdma_cm, plus a couple more in modules under development),
    Sean> and this is an attempt to reduce that.  If having separate
    Sean> work queues would work better, there shouldn't be anything
    Sean> that prevents this.

It seems like it would be cleaner for each module to have its own
workqueue if it needs one.  There's also schedule_work(), although
that goes to a multi-threaded workqueue.  Michael Tsirkin has
suggested creating a system-wide single-threaded workqueue (ie
something like schedule_ordered_work()) for everyone that occasionally
needs a single-threaded workqueue.

 - R.

^ permalink raw reply

* [git patches] net driver updates
From: Jeff Garzik @ 2006-03-21 21:55 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

[just pushed upstream; patch too big to inline]

Please pull from 'upstream-linus' branch of
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git

to receive the following updates:

 Documentation/networking/e100.txt  |  158 -
 Documentation/networking/e1000.txt |  634 +++--
 MAINTAINERS                        |   16 
 drivers/net/mv643xx_eth.h          |   18 
 drivers/net/pcnet32.c              | 4161 +++++++++++++++++++------------------
 drivers/net/skfp/fplustm.c         |   12 
 drivers/net/skge.c                 |  275 +-
 drivers/net/skge.h                 |    1 
 drivers/net/sky2.c                 |  583 ++---
 drivers/net/sky2.h                 |   22 
 drivers/net/smc91x.c               |   53 
 drivers/net/smc91x.h               |  478 ++--
 12 files changed, 3467 insertions(+), 2944 deletions(-)

Andrew Morton:
      skfp warning fixes

Dale Farnsworth:
      mv643xx_eth: Cache align skb->data if CONFIG_NOT_COHERENT_CACHE

Don Fry:
      pcnet32: support boards with multiple phys

Jeff Garzik:
      [netdrvr] pcnet32: Lindent
      [netdrvr] pcnet32: other source formatting cleanups

Jesse Brandeburg:
      e100/e1000/ixgb: update MAINTAINERS to current developers
      e100: update e100.txt
      e1000: update the readme with the latest text

Nicolas Pitre:
      smc91x: allow for dynamic bus access configs

Stephen Hemminger:
      skge: use NAPI for tx cleanup.
      skge: use auto masking of irqs
      skge: check the allocation of ring buffer
      skge: dma configuration cleanup
      skge: use kcalloc
      skge: use mmiowb
      skge: formmating and whitespace cleanup
      skge: handle pci errors better
      skge: version 1.4
      sky2: remove support for untested Yukon EC/rev 0
      sky2: drop broken wake on lan support
      sky2: rework of NAPI and IRQ management
      sky2: coalescing parameters
      sky2: add MSI support
      sky2: whitespace fixes
      sky2: transmit recovery
      sky2: handle all error irqs
      sky2 version 1.1

^ permalink raw reply

* Re: Re: [PATCH 4/6 v2] IB: address translation to map IP toIB addresses (GIDs)
From: Sean Hefty @ 2006-03-21 21:08 UTC (permalink / raw)
  To: Roland Dreier; +Cc: netdev, linux-kernel, openib-general
In-Reply-To: <adabqvza53f.fsf@cisco.com>

Roland Dreier wrote:
>  > +struct workqueue_struct *rdma_wq;
>  > +EXPORT_SYMBOL(rdma_wq);
> 
> Sean, I don't think I saw an answer when I asked you this before.  Why
> is ib_addr exporting a workqueue?  Is there some sort of ordering
> constraint that is forcing other modules to go through the same
> workqueue for things?
> 
> This seems like a very fragile internal thing to be exposing, and I'm
> wondering if there's a better way to handle it.

I responded in a different thread, but here's what I wrote:

"This is simply an attempt to reduce/combine work queues used by the Infiniband 
code.  This keeps the threading a little simpler in the rdma_cm, since all 
callbacks are invoked using the same work queue.  (I'm also using this with the 
local SA/multicast code, but that's not ready for merging.)"

There's no specific ordering constraint that's required.  We're just ending up 
with several Infiniband modules creating their own work queues (ib_mad, ib_cm, 
ib_addr, rdma_cm, plus a couple more in modules under development), and this is 
an attempt to reduce that.  If having separate work queues would work better, 
there shouldn't be anything that prevents this.

- Sean

^ permalink raw reply

* Re: [PATCH 4/6 v2] IB: address translation to map IP toIB addresses (GIDs)
From: Roland Dreier @ 2006-03-21 20:57 UTC (permalink / raw)
  To: Sean Hefty; +Cc: linux-kernel, netdev, openib-general
In-Reply-To: <ORSMSX401FRaqbC8wSA0000000d@orsmsx401.amr.corp.intel.com>

 > +struct workqueue_struct *rdma_wq;
 > +EXPORT_SYMBOL(rdma_wq);

Sean, I don't think I saw an answer when I asked you this before.  Why
is ib_addr exporting a workqueue?  Is there some sort of ordering
constraint that is forcing other modules to go through the same
workqueue for things?

This seems like a very fragile internal thing to be exposing, and I'm
wondering if there's a better way to handle it.

 - R.

^ permalink raw reply

* Re: [PATCH] wireless.git: update acxsm to 0.4.7
From: John W. Linville @ 2006-03-21 19:10 UTC (permalink / raw)
  To: Denis Vlasenko
  Cc: acx100-devel, Carlos Martín, netdev, Christoph Hellwig
In-Reply-To: <200603011558.14968.vda@ilport.com.ua>

On Wed, Mar 01, 2006 at 03:58:14PM +0200, Denis Vlasenko wrote:
> On Tuesday 28 February 2006 03:34, John W. Linville wrote:
> > On Mon, Feb 27, 2006 at 11:44:38AM +0100, Carlos Martín wrote:
> > > On Monday 27 February 2006 11:20, Denis Vlasenko wrote:
> > > > > Comments are welcome and I'll split the patch if needed.
> > 
> > Denis are you applying this patch to your tree?  If so, I'll rely on
> > you to push it to me when you are ready.
> > 
> > If not, then I will need Carlos to generate the diffs so that they
> > can be applied to the top of the tree with -p1.
> > 
> > 	http://linux.yyz.us/patch-format.html
> 
> Changelog:

Merged to softmac branch of wireless-2.6...thanks!

John
-- 
John W. Linville
linville@tuxdriver.com


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid\x110944&bid$1720&dat\x121642

^ permalink raw reply

* Re: Please pull bcm43xx softmac-upstream and dscape-upstream branches
From: John W. Linville @ 2006-03-21 19:09 UTC (permalink / raw)
  To: Michael Buesch
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, bcm43xx-dev-0fE9KPoRgkgATYTw5x5z8w
In-Reply-To: <200603052147.55557.mbuesch-KuiJ5kEpwI6ELgA04lAiVw@public.gmane.org>

On Sun, Mar 05, 2006 at 09:47:55PM +0100, Michael Buesch wrote:

> Please pull branches "softmac-upstream" and "dscape-upstream"
> from my repository at:
> git://bu3sch.de/wireless-2.6.git

Merged to softmac and dscape branches of wireless-2.6...thanks!

John
-- 
John W. Linville
linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org

^ permalink raw reply

* Re: [PATCH] scm: fold __scm_send() into scm_send()
From: Stephen Smalley @ 2006-03-21 13:42 UTC (permalink / raw)
  To: Chris Wright
  Cc: James Morris, Andrew Morton, cxzhang, netdev, ioe-lkml, davem,
	linux-kernel, netdev
In-Reply-To: <1142947952.28120.29.camel@moss-spartans.epoch.ncsc.mil>

On Tue, 2006-03-21 at 08:32 -0500, Stephen Smalley wrote:
> > I don't expect security_sk_sid() to be terribly expensive.  It's not
> > an AVC check, it's just propagating a label.  But I've not done any
> > benchmarking on that.
> 
> No permission check there, but it looks like it does read lock
> sk_callback_lock.  Not sure if that is truly justified here.

Ah, that is because it is also called from the xfrm code, introduced by
Trent's patches.  But that locking shouldn't be necessary from scm_send,
right?  So she likely wants a separate hook for it to avoid that
overhead, or even just a direct SELinux interface?
  
-- 
Stephen Smalley
National Security Agency

^ permalink raw reply

* Re: [PATCH] scm: fold __scm_send() into scm_send()
From: Stephen Smalley @ 2006-03-21 13:32 UTC (permalink / raw)
  To: Chris Wright
  Cc: James Morris, Andrew Morton, cxzhang, netdev, ioe-lkml, davem,
	linux-kernel, netdev
In-Reply-To: <20060320231508.GV15997@sorel.sous-sol.org>

On Mon, 2006-03-20 at 15:15 -0800, Chris Wright wrote:
> * Andrew Morton (akpm@osdl.org) wrote:
> > Chris Wright <chrisw@sous-sol.org> wrote:
> > > Catherine, the security_sid_to_context() is a raw SELinux function which
> > > crept into core code and should not have been there.  The fallout fixes
> > > included conditionally exporting security_sid_to_context, and finally
> > > scm_send/recv unlining.
> > 
> > Yes.  So we're OK up the uninlining, right?
> 
> Yes, although sid_to_context is meant to be analog to the other
> get_peersec calls, and should really be made a proper part of the
> interface (can be done later, correctness is the issue at hand).

Yes, Catherine was told that she shouldn't be directly exporting
security_sid_to_context, and was allegedly working on a fix.  Note
however that the expected solution is not a LSM interface but a set of
properly encapsulated interfaces exported directly from SELinux, based
on the iptables context matching patches by James.  The same style of
interface is being put forth for the audit LSPP work.  The indirection
of LSM serves no purpose here, as these users are specifically looking
for functionality provided only by SELinux.

> I don't expect security_sk_sid() to be terribly expensive.  It's not
> an AVC check, it's just propagating a label.  But I've not done any
> benchmarking on that.

No permission check there, but it looks like it does read lock
sk_callback_lock.  Not sure if that is truly justified here.

-- 
Stephen Smalley
National Security Agency

^ permalink raw reply

* [PATCH] au1000_tx_timeout and promiscuous mode
From: elmar gerdes @ 2006-03-21  0:50 UTC (permalink / raw)
  To: netdev; +Cc: linux-mips

[-- Attachment #1: Type: TEXT/PLAIN, Size: 622 bytes --]


hello,

the attached patch fixes a problem I had when running 2 different
Au15xx-based boards (Au1500 and Au1550) in bridging mode under high load.
au1000_tx_timeout() would reset the device for which it was called, but
promiscuous mode was not re-established.

tested under 2.6.14, 2.6.16-rc6 and 2.4.31.

on the 2.6.1x systems, the timeout appears after some 5 seconds, under
2.4.31 I thought it did not appear, but after half an hour of full load
I've seen it there, too.  now this makes me wonder why it takes only a
few seconds to have a tx timeout under 2.6 and half an hour under 2.4.
any ideas?

regards,
 	elmar

[-- Attachment #2: Type: TEXT/PLAIN, Size: 10214 bytes --]

1) au1000_tx_timeout(): Re-establish promiscuous mode (when bridging)
   after the re-init.

2) Do not dereference device private structure but use netdev_priv()
   instead (according to Rubini et.al. "Linux Device Drivers", 3rd
   edition, o-Reilly.

Apply to kernel trees 2.6.x and 2.4.x.

Tested on
 - myCable XXS1500, Rev.B, Au1500-based, and
 - Kurz NVB, Au1550-based, not in kernel-tree.

Signed-off-by: elmar gerdes <elmar.gerdes@engel-kg.com>


--- a/drivers/net/au1000_eth.c
+++ b/drivers/net/au1000_eth.c
@@ -206,7 +206,7 @@
 		printk(KERN_ERR "bcm_5201_status error: NULL dev\n");
 		return -1;
 	}
-	aup = (struct au1000_private *) dev->priv;
+	aup = netdev_priv(dev);
 
 	mii_data = mdio_read(dev, aup->phy_addr, MII_STATUS);
 	if (mii_data & MII_STAT_LINK) {
@@ -293,7 +293,7 @@
 		printk(KERN_ERR "lsi_80227_status error: NULL dev\n");
 		return -1;
 	}
-	aup = (struct au1000_private *) dev->priv;
+	aup = netdev_priv(dev);
 
 	mii_data = mdio_read(dev, aup->phy_addr, MII_STATUS);
 	if (mii_data & MII_STAT_LINK) {
@@ -404,7 +404,7 @@
 		return -1;
 	}
 
-	aup = (struct au1000_private *) dev->priv;
+	aup = netdev_priv(dev);
 	mii_data = mdio_read(dev, aup->phy_addr, MII_STATUS);
 
 	if (mii_data & MII_STAT_LINK) {
@@ -486,7 +486,7 @@
 		printk(KERN_ERR "lxt971a_status error: NULL dev\n");
 		return -1;
 	}
-	aup = (struct au1000_private *) dev->priv;
+	aup = netdev_priv(dev);
 
 	mii_data = mdio_read(dev, aup->phy_addr, MII_STATUS);
 	if (mii_data & MII_STAT_LINK) {
@@ -571,7 +571,7 @@
 		printk(KERN_ERR "ks8995m_status error: NULL dev\n");
 		return -1;
 	}
-	aup = (struct au1000_private *) dev->priv;
+	aup = netdev_priv(dev);
 
 	mii_data = mdio_read(dev, aup->phy_addr, MII_STATUS);
 	if (mii_data & MII_STAT_LINK) {
@@ -664,7 +664,7 @@
 		return -1;
 	}
 
-	aup = (struct au1000_private *) dev->priv;
+	aup = netdev_priv(dev);
 	mii_data = mdio_read(dev, aup->phy_addr, MII_STATUS);
 
 	if (mii_data & MII_STAT_LINK) {
@@ -794,7 +794,7 @@
 
 static int mdio_read(struct net_device *dev, int phy_id, int reg)
 {
-	struct au1000_private *aup = (struct au1000_private *) dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 	volatile u32 *mii_control_reg;
 	volatile u32 *mii_data_reg;
 	u32 timedout = 20;
@@ -854,7 +854,7 @@
 
 static void mdio_write(struct net_device *dev, int phy_id, int reg, u16 value)
 {
-	struct au1000_private *aup = (struct au1000_private *) dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 	volatile u32 *mii_control_reg;
 	volatile u32 *mii_data_reg;
 	u32 timedout = 20;
@@ -911,7 +911,7 @@
 
 static int mii_probe (struct net_device * dev)
 {
-	struct au1000_private *aup = (struct au1000_private *) dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 	int phy_addr;
 #ifdef CONFIG_MIPS_BOSPORUS
 	int phy_found=0;
@@ -1078,7 +1078,7 @@
 
 static void enable_rx_tx(struct net_device *dev)
 {
-	struct au1000_private *aup = (struct au1000_private *) dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 
 	if (au1000_debug > 4)
 		printk(KERN_INFO "%s: enable_rx_tx\n", dev->name);
@@ -1089,7 +1089,7 @@
 
 static void hard_stop(struct net_device *dev)
 {
-	struct au1000_private *aup = (struct au1000_private *) dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 
 	if (au1000_debug > 4)
 		printk(KERN_INFO "%s: hard stop\n", dev->name);
@@ -1103,7 +1103,7 @@
 {
 	int i;
 	u32 flags;
-	struct au1000_private *aup = (struct au1000_private *) dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 
 	if (au1000_debug > 4) 
 		printk(KERN_INFO "%s: reset mac, aup %x\n", 
@@ -1240,7 +1240,7 @@
 
 static int au1000_setup_aneg(struct net_device *dev, u32 advertise)
 {
-	struct au1000_private *aup = (struct au1000_private *)dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 	u16 ctl, adv;
 
 	/* Setup standard advertise */
@@ -1266,7 +1266,7 @@
 
 static int au1000_setup_forced(struct net_device *dev, int speed, int fd)
 {
-	struct au1000_private *aup = (struct au1000_private *)dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 	u16 ctl;
 
 	ctl = mdio_read(dev, aup->phy_addr, MII_BMCR);
@@ -1297,7 +1297,7 @@
 static void
 au1000_start_link(struct net_device *dev, struct ethtool_cmd *cmd)
 {
-	struct au1000_private *aup = (struct au1000_private *)dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 	u32 advertise;
 	int autoneg;
 	int forced_speed;
@@ -1333,7 +1333,7 @@
 
 static int au1000_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
 {
-	struct au1000_private *aup = (struct au1000_private *)dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 	u16 link, speed;
 
 	cmd->supported = GENMII_DEFAULT_FEATURES;
@@ -1358,7 +1358,7 @@
 
 static int au1000_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
 {
-	 struct au1000_private *aup = (struct au1000_private *)dev->priv;
+	 struct au1000_private *aup = netdev_priv(dev);
 	  unsigned long features = GENMII_DEFAULT_FEATURES;
 
 	 if (!capable(CAP_NET_ADMIN))
@@ -1402,7 +1402,7 @@
 
 static int au1000_nway_reset(struct net_device *dev)
 {
-	struct au1000_private *aup = (struct au1000_private *)dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 
 	if (!aup->want_autoneg)
 		return -EINVAL;
@@ -1415,7 +1415,7 @@
 static void
 au1000_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *info)
 {
-	struct au1000_private *aup = (struct au1000_private *)dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 
 	strcpy(info->driver, DRV_NAME);
 	strcpy(info->version, DRV_VERSION);
@@ -1470,7 +1470,7 @@
 	printk("%s: Au1x Ethernet found at 0x%x, irq %d\n", 
 			dev->name, ioaddr, irq);
 
-	aup = dev->priv;
+	aup = netdev_priv(dev);
 
 	/* Allocate the data buffers */
 	/* Snooping works fine with eth on all au1xxx */
@@ -1637,7 +1637,7 @@
  */
 static int au1000_init(struct net_device *dev)
 {
-	struct au1000_private *aup = (struct au1000_private *) dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 	u32 flags;
 	int i;
 	u32 control;
@@ -1689,7 +1689,7 @@
 static void au1000_timer(unsigned long data)
 {
 	struct net_device *dev = (struct net_device *)data;
-	struct au1000_private *aup = (struct au1000_private *) dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 	unsigned char if_port;
 	u16 link, speed;
 
@@ -1742,7 +1742,7 @@
 static int au1000_open(struct net_device *dev)
 {
 	int retval;
-	struct au1000_private *aup = (struct au1000_private *) dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 
 	if (au1000_debug > 4)
 		printk("%s: open: dev=%p\n", dev->name, dev);
@@ -1776,7 +1776,7 @@
 static int au1000_close(struct net_device *dev)
 {
 	u32 flags;
-	struct au1000_private *aup = (struct au1000_private *) dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 
 	if (au1000_debug > 4)
 		printk("%s: close: dev=%p\n", dev->name, dev);
@@ -1804,7 +1804,7 @@
 	for (i = 0; i < num_ifs; i++) {
 		dev = iflist[i].dev;
 		if (dev) {
-			aup = (struct au1000_private *) dev->priv;
+			aup = netdev_priv(dev);
 			unregister_netdev(dev);
 			kfree(aup->mii);
 			for (j = 0; j < NUM_RX_DMA; j++) {
@@ -1829,7 +1829,7 @@
 static inline void 
 update_tx_stats(struct net_device *dev, u32 status, u32 pkt_len)
 {
-	struct au1000_private *aup = (struct au1000_private *) dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 	struct net_device_stats *ps = &aup->stats;
 
 	ps->tx_packets++;
@@ -1861,7 +1861,7 @@
  */
 static void au1000_tx_ack(struct net_device *dev)
 {
-	struct au1000_private *aup = (struct au1000_private *) dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 	volatile tx_dma_t *ptxd;
 
 	ptxd = aup->tx_dma_ring[aup->tx_tail];
@@ -1888,7 +1888,7 @@
  */
 static int au1000_tx(struct sk_buff *skb, struct net_device *dev)
 {
-	struct au1000_private *aup = (struct au1000_private *) dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 	volatile tx_dma_t *ptxd;
 	u32 buff_stat;
 	db_dest_t *pDB;
@@ -1939,7 +1939,7 @@
 
 static inline void update_rx_stats(struct net_device *dev, u32 status)
 {
-	struct au1000_private *aup = (struct au1000_private *) dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 	struct net_device_stats *ps = &aup->stats;
 
 	ps->rx_packets++;
@@ -1967,7 +1967,7 @@
  */
 static int au1000_rx(struct net_device *dev)
 {
-	struct au1000_private *aup = (struct au1000_private *) dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 	struct sk_buff *skb;
 	volatile rx_dma_t *prxd;
 	u32 buff_stat, status;
@@ -2070,6 +2070,8 @@
 	printk(KERN_ERR "%s: au1000_tx_timeout: dev=%p\n", dev->name, dev);
 	reset_mac(dev);
 	au1000_init(dev);
+	/* Set promiscuous mode */
+	set_rx_mode(dev);
 	dev->trans_start = jiffies;
 	netif_wake_queue(dev);
 }
@@ -2093,7 +2095,7 @@
 
 static void set_rx_mode(struct net_device *dev)
 {
-	struct au1000_private *aup = (struct au1000_private *) dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 
 	if (au1000_debug > 4) 
 		printk("%s: set_rx_mode: flags=%x\n", dev->name, dev->flags);
@@ -2127,7 +2129,7 @@
 
 static int au1000_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 {
-	struct au1000_private *aup = (struct au1000_private *)dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 	u16 *data = (u16 *)&rq->ifr_ifru;
 
 	switch(cmd) { 
@@ -2154,7 +2156,7 @@
 
 static int au1000_set_config(struct net_device *dev, struct ifmap *map)
 {
-	struct au1000_private *aup = (struct au1000_private *) dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 	u16 control;
 
 	if (au1000_debug > 4)  {
@@ -2248,7 +2250,7 @@
 
 static struct net_device_stats *au1000_get_stats(struct net_device *dev)
 {
-	struct au1000_private *aup = (struct au1000_private *) dev->priv;
+	struct au1000_private *aup = netdev_priv(dev);
 
 	if (au1000_debug > 4)
 		printk("%s: au1000_get_stats: dev=%p\n", dev->name, dev);

^ permalink raw reply

* Re: [PATCH] scm: fold __scm_send() into scm_send()
From: David S. Miller @ 2006-03-21  0:50 UTC (permalink / raw)
  To: jmorris; +Cc: chrisw, cxzhang, netdev, ioe-lkml, linux-kernel, netdev, akpm
In-Reply-To: <Pine.LNX.4.64.0603201936030.6749@excalibur.intercode>

From: James Morris <jmorris@namei.org>
Date: Mon, 20 Mar 2006 19:37:51 -0500 (EST)

> I believe Catherine is away this week, so it's probably best to drop the 
> code and wait till she gets back and we can get it 100% right.

Ok, agreed.

> Sorry, this is my fault, I should have caught this problem.

No worries.

^ permalink raw reply

* Re: [PATCH] scm: fold __scm_send() into scm_send()
From: James Morris @ 2006-03-21  0:37 UTC (permalink / raw)
  To: David S. Miller
  Cc: chrisw, cxzhang, netdev, ioe-lkml, linux-kernel, netdev, akpm
In-Reply-To: <20060320.152838.68858441.davem@davemloft.net>

On Mon, 20 Mar 2006, David S. Miller wrote:

> I'm seriously considering backing out Catherine's AF_UNIX patch from
> the net-2.6.17 tree before submitting it to Linus later today so that
> none of this crap goes in right now.

I believe Catherine is away this week, so it's probably best to drop the 
code and wait till she gets back and we can get it 100% right.

Sorry, this is my fault, I should have caught this problem.

- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply

* Re: [PATCH] scm: fold __scm_send() into scm_send()
From: Chris Wright @ 2006-03-20 23:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: chrisw, cxzhang, netdev, ioe-lkml, linux-kernel, netdev, akpm
In-Reply-To: <20060320.152838.68858441.davem@davemloft.net>

* David S. Miller (davem@davemloft.net) wrote:
> From: Chris Wright <chrisw@sous-sol.org>
> Date: Mon, 20 Mar 2006 13:36:36 -0800
> 
> > The point of Catherine's original patch was to make sure there's always
> > a security identifier associated with AF_UNIX messages.  So receiver
> > can always check it (same as having credentials even w/out sender
> > control message passing them).  Now we will have garbage for sid.
> 
> I'm seriously considering backing out Catherine's AF_UNIX patch from
> the net-2.6.17 tree before submitting it to Linus later today so that
> none of this crap goes in right now.
> 
> It appears that this needs a lot more sorting out, so for now that's
> probably the right thing to do.

I won't object.  I checked your tree, it looks OK to me.  The actual
broken patch appears in -mm, and the security_sid_to_context snafu
is primarily cosmetic at this point (the exports, etc fixed the real
compilation issues AFAICT).  But, again, if you want to drop that's fine
w/ me.  I'm sure Catherine can cleanup and resend as needed.

thanks,
-chris

^ permalink raw reply

* Re: [PATCH] scm: fold __scm_send() into scm_send()
From: David S. Miller @ 2006-03-20 23:28 UTC (permalink / raw)
  To: chrisw; +Cc: cxzhang, netdev, ioe-lkml, linux-kernel, netdev, akpm
In-Reply-To: <20060320213636.GT15997@sorel.sous-sol.org>

From: Chris Wright <chrisw@sous-sol.org>
Date: Mon, 20 Mar 2006 13:36:36 -0800

> The point of Catherine's original patch was to make sure there's always
> a security identifier associated with AF_UNIX messages.  So receiver
> can always check it (same as having credentials even w/out sender
> control message passing them).  Now we will have garbage for sid.

I'm seriously considering backing out Catherine's AF_UNIX patch from
the net-2.6.17 tree before submitting it to Linus later today so that
none of this crap goes in right now.

It appears that this needs a lot more sorting out, so for now that's
probably the right thing to do.

^ permalink raw reply

* Re: [PATCH] scm: fold __scm_send() into scm_send()
From: Chris Wright @ 2006-03-20 23:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chris Wright, cxzhang, netdev, ioe-lkml, davem, linux-kernel,
	netdev
In-Reply-To: <20060320143103.31b7d933.akpm@osdl.org>

* Andrew Morton (akpm@osdl.org) wrote:
> Chris Wright <chrisw@sous-sol.org> wrote:
> > Catherine, the security_sid_to_context() is a raw SELinux function which
> > crept into core code and should not have been there.  The fallout fixes
> > included conditionally exporting security_sid_to_context, and finally
> > scm_send/recv unlining.
> 
> Yes.  So we're OK up the uninlining, right?

Yes, although sid_to_context is meant to be analog to the other
get_peersec calls, and should really be made a proper part of the
interface (can be done later, correctness is the issue at hand).

> >  The end result in -mm looks broken to me.
> > Specifically, it now does:
> > 
> > 	ucred->uid = tsk->uid;
> > 	ucred->gid = tsk->gid;
> > 	ucred->pid = tsk->tgid;
> > 	scm->fp = NULL;
> > 	scm->seq = 0;
> > 	if (msg->msg_controllen <= 0)
> > 		return 0;
> > 
> > 	scm->sid = security_sk_sid(sock->sk, NULL, 0);
> > 
> > The point of Catherine's original patch was to make sure there's always
> > a security identifier associated with AF_UNIX messages.  So receiver
> > can always check it (same as having credentials even w/out sender
> > control message passing them).  Now we will have garbage for sid.
> 
> This answers the question I've been asking all and sundry for a week, thanks ;)
> So:
> 
> - scm-fold-__scm_send-into-scm_send.patch is OK

Yes.

> - scm_send-speedup.patch is wrong

Yes.

> - Catherine's patch introduces a possibly-significant performance
>   problem: we're now calling the expensive-on-SELinux security_sk_sid()
>   more frequently than we used to.

I don't expect security_sk_sid() to be terribly expensive.  It's not
an AVC check, it's just propagating a label.  But I've not done any
benchmarking on that.

thanks,
-chris

^ permalink raw reply

* Re: TSO and IPoIB performance degradation
From: David S. Miller @ 2006-03-20 23:00 UTC (permalink / raw)
  To: bcrl
  Cc: netdev, rdreier, rick.jones2, linux-kernel, openib-general,
	buytenh, arjan
In-Reply-To: <20060320150941.GD16108@kvack.org>

From: Benjamin LaHaise <bcrl@kvack.org>
Date: Mon, 20 Mar 2006 10:09:42 -0500

> Wouldn't it make sense to strech the ACK when the previous ACK is still in 
> the TX queue of the device?  I know that sort of behaviour was always an 
> issue on modem links where you don't want to send out redundant ACKs.

I thought about doing some similar trick with TSO, wherein we would
not defer a TSO send if all the previous packets sent are out of the
device transmit queue.  The idea was the prevent the pipe from ever
emptying which is the danger of deferring too much for TSO.

This has several problems.  It's hard to implement.  You have to
decide if you want precise state, thereby checking the TX descriptors.
Or you go for imprecise but easier to implement, which is very
imprecise and therefore not very useful state, by just checking the
SKB refcount or similar which means that you find out it's left the TX
queue after the TX purge interrupt which can be a long time after the
event and by then the pipe has empties which is what you were trying
to prevent.

Lastly, you don't want to touch remote cpu state which is what such
a hack is going to end up doing much of the time.

^ permalink raw reply

* Re: [PATCH] scm: fold __scm_send() into scm_send()
From: Andrew Morton @ 2006-03-20 22:31 UTC (permalink / raw)
  To: Chris Wright
  Cc: cxzhang, netdev, chrisw, ioe-lkml, davem, linux-kernel, netdev
In-Reply-To: <20060320213636.GT15997@sorel.sous-sol.org>

Chris Wright <chrisw@sous-sol.org> wrote:
>
> * Chris Wright (chrisw@sous-sol.org) wrote:
> > * Ingo Oeser (netdev@axxeo.de) wrote:
> > > Hi Chris,
> > > 
> > > Andrew Morton wrote:
> > > > Ingo Oeser <ioe-lkml@rameria.de> wrote:
> > > > >
> > > > >  -int scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *scm)
> > > > >  -{
> > > > >  -	struct task_struct *p = current;
> > > > >  -	scm->creds = (struct ucred) {
> > > > >  -		.uid = p->uid,
> > > > >  -		.gid = p->gid,
> > > > >  -		.pid = p->tgid
> > > > >  -	};
> > > > >  -	scm->fp = NULL;
> > > > >  -	scm->sid = security_sk_sid(sock->sk, NULL, 0);
> > > > >  -	scm->seq = 0;
> > > > >  -	if (msg->msg_controllen <= 0)
> > > > >  -		return 0;
> > > > >  -	return __scm_send(sock, msg, scm);
> > > > >  -}
> > > > 
> > > > It's worth noting that scm_send() will call security_sk_sid() even if
> > > > (msg->msg_controllen <= 0).
> > > 
> > > Chris, do you know if this is needed in this case?
> > 
> > This whole thing is looking broken.  I'm still trying to find the original
> > patch which caused the series of broken patches on top.
> 
> OK, it starts here from Catherine's patch:
> 
> include/net/scm.h::scm_recv()
> +	if (test_bit(SOCK_PASSSEC, &sock->flags)) {
> +		err = security_sid_to_context(scm->sid, &scontext, &scontext_len);
> +		if (!err)
> + 			put_cmsg(msg, SOL_SOCKET, SCM_SECURITY, scontext_len, scontext);
> + }
> 
> Catherine, the security_sid_to_context() is a raw SELinux function which
> crept into core code and should not have been there.  The fallout fixes
> included conditionally exporting security_sid_to_context, and finally
> scm_send/recv unlining.

Yes.  So we're OK up the uninlining, right?

>  The end result in -mm looks broken to me.
> Specifically, it now does:
> 
> 	ucred->uid = tsk->uid;
> 	ucred->gid = tsk->gid;
> 	ucred->pid = tsk->tgid;
> 	scm->fp = NULL;
> 	scm->seq = 0;
> 	if (msg->msg_controllen <= 0)
> 		return 0;
> 
> 	scm->sid = security_sk_sid(sock->sk, NULL, 0);
> 
> The point of Catherine's original patch was to make sure there's always
> a security identifier associated with AF_UNIX messages.  So receiver
> can always check it (same as having credentials even w/out sender
> control message passing them).  Now we will have garbage for sid.

This answers the question I've been asking all and sundry for a week, thanks ;)

So:

- scm-fold-__scm_send-into-scm_send.patch is OK

- scm_send-speedup.patch is wrong

- Catherine's patch introduces a possibly-significant performance
  problem: we're now calling the expensive-on-SELinux security_sk_sid()
  more frequently than we used to.

- That "initialise scm->creds via a temporary struct" trick still
  generates bad code.


I actually have enough to be going on with here - I'll drop it all.

^ permalink raw reply

* Re: [PATCH] scm: fold __scm_send() into scm_send()
From: Chris Wright @ 2006-03-20 21:36 UTC (permalink / raw)
  To: Catherine Zhang
  Cc: Ingo Oeser, Chris Wright, Ingo Oeser, davem, linux-kernel, netdev,
	Andrew Morton
In-Reply-To: <20060320201802.GS15997@sorel.sous-sol.org>

* Chris Wright (chrisw@sous-sol.org) wrote:
> * Ingo Oeser (netdev@axxeo.de) wrote:
> > Hi Chris,
> > 
> > Andrew Morton wrote:
> > > Ingo Oeser <ioe-lkml@rameria.de> wrote:
> > > >
> > > >  -int scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *scm)
> > > >  -{
> > > >  -	struct task_struct *p = current;
> > > >  -	scm->creds = (struct ucred) {
> > > >  -		.uid = p->uid,
> > > >  -		.gid = p->gid,
> > > >  -		.pid = p->tgid
> > > >  -	};
> > > >  -	scm->fp = NULL;
> > > >  -	scm->sid = security_sk_sid(sock->sk, NULL, 0);
> > > >  -	scm->seq = 0;
> > > >  -	if (msg->msg_controllen <= 0)
> > > >  -		return 0;
> > > >  -	return __scm_send(sock, msg, scm);
> > > >  -}
> > > 
> > > It's worth noting that scm_send() will call security_sk_sid() even if
> > > (msg->msg_controllen <= 0).
> > 
> > Chris, do you know if this is needed in this case?
> 
> This whole thing is looking broken.  I'm still trying to find the original
> patch which caused the series of broken patches on top.

OK, it starts here from Catherine's patch:

include/net/scm.h::scm_recv()
+	if (test_bit(SOCK_PASSSEC, &sock->flags)) {
+		err = security_sid_to_context(scm->sid, &scontext, &scontext_len);
+		if (!err)
+ 			put_cmsg(msg, SOL_SOCKET, SCM_SECURITY, scontext_len, scontext);
+ }

Catherine, the security_sid_to_context() is a raw SELinux function which
crept into core code and should not have been there.  The fallout fixes
included conditionally exporting security_sid_to_context, and finally
scm_send/recv unlining.  The end result in -mm looks broken to me.
Specifically, it now does:

	ucred->uid = tsk->uid;
	ucred->gid = tsk->gid;
	ucred->pid = tsk->tgid;
	scm->fp = NULL;
	scm->seq = 0;
	if (msg->msg_controllen <= 0)
		return 0;

	scm->sid = security_sk_sid(sock->sk, NULL, 0);

The point of Catherine's original patch was to make sure there's always
a security identifier associated with AF_UNIX messages.  So receiver
can always check it (same as having credentials even w/out sender
control message passing them).  Now we will have garbage for sid.

thanks,
-chris

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox