Netdev List
 help / color / mirror / Atom feed
* Re: RCU error in networking
From: David Miller @ 2010-04-29 23:04 UTC (permalink / raw)
  To: mpatocka; +Cc: netdev
In-Reply-To: <Pine.LNX.4.64.1004291811420.26080@hs20-bc2-1.build.redhat.com>

From: Mikulas Patocka <mpatocka@redhat.com>
Date: Thu, 29 Apr 2010 18:14:11 -0400 (EDT)

> BTW. when I enabled lockdep, I got this (2.6.34-rc5; the machine is no 
> longer bridging, it has just a single interface):
> 

Fixed in Linus's tree for more than a week:

commit 05d17608a69b3ae653ea5c9857283bef3439c733
Author: David Howells <dhowells@redhat.com>
Date:   Tue Apr 20 00:25:58 2010 +0000

    net: Fix an RCU warning in dev_pick_tx()
    
    Fix the following RCU warning in dev_pick_tx():
    
    ===================================================
    [ INFO: suspicious rcu_dereference_check() usage. ]
    ---------------------------------------------------
    net/core/dev.c:1993 invoked rcu_dereference_check() without protection!
    
    other info that might help us debug this:
    
    rcu_scheduler_active = 1, debug_locks = 0
    2 locks held by swapper/0:
     #0:  (&idev->mc_ifc_timer){+.-...}, at: [<ffffffff81039e65>] run_timer_softirq+0x17b/0x278
     #1:  (rcu_read_lock_bh){.+....}, at: [<ffffffff812ea3eb>] dev_queue_xmit+0x14e/0x4dc
    
    stack backtrace:
    Pid: 0, comm: swapper Not tainted 2.6.34-rc5-cachefs #4
    Call Trace:
     <IRQ>  [<ffffffff810516c4>] lockdep_rcu_dereference+0xaa/0xb2
     [<ffffffff812ea4f6>] dev_queue_xmit+0x259/0x4dc
     [<ffffffff812ea3eb>] ? dev_queue_xmit+0x14e/0x4dc
     [<ffffffff81052324>] ? trace_hardirqs_on+0xd/0xf
     [<ffffffff81035362>] ? local_bh_enable_ip+0xbc/0xc1
     [<ffffffff812f0954>] neigh_resolve_output+0x24b/0x27c
     [<ffffffff8134f673>] ip6_output_finish+0x7c/0xb4
     [<ffffffff81350c34>] ip6_output2+0x256/0x261
     [<ffffffff81052324>] ? trace_hardirqs_on+0xd/0xf
     [<ffffffff813517fb>] ip6_output+0xbbc/0xbcb
     [<ffffffff8135bc5d>] ? fib6_force_start_gc+0x2b/0x2d
     [<ffffffff81368acb>] mld_sendpack+0x273/0x39d
     [<ffffffff81368858>] ? mld_sendpack+0x0/0x39d
     [<ffffffff81052099>] ? mark_held_locks+0x52/0x70
     [<ffffffff813692fc>] mld_ifc_timer_expire+0x24f/0x288
     [<ffffffff81039ed6>] run_timer_softirq+0x1ec/0x278
     [<ffffffff81039e65>] ? run_timer_softirq+0x17b/0x278
     [<ffffffff813690ad>] ? mld_ifc_timer_expire+0x0/0x288
     [<ffffffff81035531>] ? __do_softirq+0x69/0x140
     [<ffffffff8103556a>] __do_softirq+0xa2/0x140
     [<ffffffff81002e0c>] call_softirq+0x1c/0x28
     [<ffffffff81004b54>] do_softirq+0x38/0x80
     [<ffffffff81034f06>] irq_exit+0x45/0x47
     [<ffffffff810177c3>] smp_apic_timer_interrupt+0x88/0x96
     [<ffffffff810028d3>] apic_timer_interrupt+0x13/0x20
     <EOI>  [<ffffffff810488dd>] ? __atomic_notifier_call_chain+0x0/0x86
     [<ffffffff810096bf>] ? mwait_idle+0x6e/0x78
     [<ffffffff810096b6>] ? mwait_idle+0x65/0x78
     [<ffffffff810011cb>] cpu_idle+0x4d/0x83
     [<ffffffff81380b05>] rest_init+0xb9/0xc0
     [<ffffffff81380a4c>] ? rest_init+0x0/0xc0
     [<ffffffff8168dcf0>] start_kernel+0x392/0x39d
     [<ffffffff8168d2a3>] x86_64_start_reservations+0xb3/0xb7
     [<ffffffff8168d38b>] x86_64_start_kernel+0xe4/0xeb
    
    An rcu_dereference() should be an rcu_dereference_bh().
    
    Signed-off-by: David Howells <dhowells@redhat.com>
    Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/core/dev.c b/net/core/dev.c
index 92584bf..f769098 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1990,7 +1990,7 @@ static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 				queue_index = skb_tx_hash(dev, skb);
 
 			if (sk) {
-				struct dst_entry *dst = rcu_dereference(sk->sk_dst_cache);
+				struct dst_entry *dst = rcu_dereference_bh(sk->sk_dst_cache);
 
 				if (dst && skb_dst(skb) == dst)
 					sk_tx_queue_set(sk, queue_index);

^ permalink raw reply related

* Re: patch sysfs-implement-sysfs-tagged-directory-support.patch added to gregkh-2.6 tree
From: Greg KH @ 2010-04-30  4:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: ebiederm, bcrl, benjamin.thery, cornelia.huck, eric.dumazet,
	kay.sievers, netdev, serue
In-Reply-To: <4BDA5A2D.6080904@kernel.org>

On Fri, Apr 30, 2010 at 06:18:53AM +0200, Tejun Heo wrote:
> On 04/29/2010 10:29 PM, gregkh@suse.de wrote:
> > 
> > This is a note to let you know that I've just added the patch titled
> > 
> >     Subject: sysfs: Implement sysfs tagged directory support.
> > 
> > to my gregkh-2.6 tree.  Its filename is
> > 
> >     sysfs-implement-sysfs-tagged-directory-support.patch
> > 
> > This tree can be found at 
> >     http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/
> 
> I wish at least more comments are added before it goes mainline.  I
> don't really understand the current form.

Ok, that's fine with me, I'll pull it back out.

thanks,

greg k-h

^ permalink raw reply

* [PATCH 3/3] ptp: Added a clock that uses the eTSEC found on the MPC85xx.
From: Richard Cochran @ 2010-04-29  9:20 UTC (permalink / raw)
  To: netdev

The eTSEC includes a PTP clock with quite a few features. This patch adds
support for the basic clock adjustment functions.

Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
---
 arch/powerpc/boot/dts/mpc8313erdb.dts |   14 ++
 arch/powerpc/boot/dts/p2020ds.dts     |   13 ++
 arch/powerpc/boot/dts/p2020rdb.dts    |   14 ++
 drivers/net/Makefile                  |    1 +
 drivers/net/gianfar_ptp.c             |  308 +++++++++++++++++++++++++++++++++
 drivers/net/gianfar_ptp_reg.h         |  107 ++++++++++++
 drivers/ptp/Kconfig                   |   13 ++
 7 files changed, 470 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/gianfar_ptp.c
 create mode 100644 drivers/net/gianfar_ptp_reg.h

diff --git a/arch/powerpc/boot/dts/mpc8313erdb.dts b/arch/powerpc/boot/dts/mpc8313erdb.dts
index 183f2aa..b760aee 100644
--- a/arch/powerpc/boot/dts/mpc8313erdb.dts
+++ b/arch/powerpc/boot/dts/mpc8313erdb.dts
@@ -208,6 +208,20 @@
 			sleep = <&pmc 0x00300000>;
 		};
 
+		ptp_clock@24E00 {
+			device_type = "ptp_clock";
+			model = "eTSEC";
+			reg = <0x24E00 0xB0>;
+			interrupts = <0x0C 2 0x0D 2>;
+			interrupt-parent = < &ipic >;
+			tclk_period = <10>;
+			tmr_prsc    = <100>;
+			tmr_add     = <0x999999A4>;
+			cksel       = <0x1>;
+			tmr_fiper1  = <0x3B9AC9F6>;
+			tmr_fiper2  = <0x00018696>;
+		};
+
 		enet0: ethernet@24000 {
 			#address-cells = <1>;
 			#size-cells = <1>;
diff --git a/arch/powerpc/boot/dts/p2020ds.dts b/arch/powerpc/boot/dts/p2020ds.dts
index 1101914..1dcf790 100644
--- a/arch/powerpc/boot/dts/p2020ds.dts
+++ b/arch/powerpc/boot/dts/p2020ds.dts
@@ -336,6 +336,19 @@
 			phy_type = "ulpi";
 		};
 
+		ptp_clock@24E00 {
+			device_type = "ptp_clock";
+			model = "eTSEC";
+			reg = <0x24E00 0xB0>;
+			interrupts = <0x0C 2 0x0D 2>;
+			interrupt-parent = < &mpic >;
+			tclk_period = <5>;
+			tmr_prsc = <200>;
+			tmr_add = <0xCCCCCCCD>;
+			tmr_fiper1 = <0x3B9AC9FB>;
+			tmr_fiper2 = <0x0001869B>;
+		};
+
 		enet0: ethernet@24000 {
 			#address-cells = <1>;
 			#size-cells = <1>;
diff --git a/arch/powerpc/boot/dts/p2020rdb.dts b/arch/powerpc/boot/dts/p2020rdb.dts
index da4cb0d..ba61e8e 100644
--- a/arch/powerpc/boot/dts/p2020rdb.dts
+++ b/arch/powerpc/boot/dts/p2020rdb.dts
@@ -396,6 +396,20 @@
 			phy_type = "ulpi";
 		};
 
+		ptp_clock@24E00 {
+			device_type = "ptp_clock";
+			model = "eTSEC";
+			reg = <0x24E00 0xB0>;
+			interrupts = <0x0C 2 0x0D 2>;
+			interrupt-parent = < &mpic >;
+			tclk_period = <5>;
+			tmr_prsc = <200>;
+			tmr_add = <0xCCCCCCCD>;
+			cksel = <1>;
+			tmr_fiper1 = <0x3B9AC9FB>;
+			tmr_fiper2 = <0x0001869B>;
+		};
+
 		enet0: ethernet@24000 {
 			#address-cells = <1>;
 			#size-cells = <1>;
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 0a0512a..389c0d9 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -28,6 +28,7 @@ obj-$(CONFIG_ATL2) += atlx/
 obj-$(CONFIG_ATL1E) += atl1e/
 obj-$(CONFIG_ATL1C) += atl1c/
 obj-$(CONFIG_GIANFAR) += gianfar_driver.o
+obj-$(CONFIG_PTP_1588_CLOCK_GIANFAR) += gianfar_ptp.o
 obj-$(CONFIG_TEHUTI) += tehuti.o
 obj-$(CONFIG_ENIC) += enic/
 obj-$(CONFIG_JME) += jme.o
diff --git a/drivers/net/gianfar_ptp.c b/drivers/net/gianfar_ptp.c
new file mode 100644
index 0000000..20be423
--- /dev/null
+++ b/drivers/net/gianfar_ptp.c
@@ -0,0 +1,308 @@
+/*
+ * PTP 1588 clock using the eTSEC
+ *
+ * Copyright (C) 2010 OMICRON electronics GmbH
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+#include <linux/device.h>
+#include <linux/hrtimer.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/of.h>
+#include <linux/of_platform.h>
+#include <linux/timex.h>
+#include <asm/io.h>
+
+#include <linux/ptp_clock_kernel.h>
+
+#include "gianfar_ptp_reg.h"
+
+#define REG_SIZE	(4 + TMR_ETTS2_L)
+
+struct etsects {
+	void *regs;
+	u32 tclk_period;  /* nanoseconds */
+	u32 tmr_prsc;
+	u32 tmr_add;
+	u32 cksel;
+	u32 tmr_fiper1;
+	u32 tmr_fiper2;
+};
+
+/* Private globals */
+static struct ptp_clock *gianfar_clock;
+static struct etsects the_clock;
+DEFINE_SPINLOCK(adjtime_lock);
+
+/*
+ * Register access functions
+ */
+
+static inline u32 reg_read(struct etsects *etsects, unsigned long offset)
+{
+	return in_be32(etsects->regs + offset);
+}
+
+static inline void reg_write(struct etsects *etsects,
+			     unsigned long offset, u32 val)
+{
+	out_be32(etsects->regs + offset, val);
+}
+
+static u64 tmr_cnt_read(struct etsects *etsects)
+{
+	u64 ns;
+	u32 lo, hi;
+	lo = reg_read(etsects, TMR_CNT_L);
+	hi = reg_read(etsects, TMR_CNT_H);
+	ns = ((u64) hi) << 32;
+	ns |= lo;
+	return ns;
+}
+
+static void tmr_cnt_write(struct etsects *etsects, u64 ns)
+{
+	u32 hi = ns >> 32;
+	u32 lo = ns & 0xffffffff;
+	reg_write(etsects, TMR_CNT_L, lo);
+	reg_write(etsects, TMR_CNT_H, hi);
+}
+
+static void set_alarm(struct etsects *etsects)
+{
+	u64 ns;
+	u32 lo, hi;
+	ns = tmr_cnt_read(etsects) + 1500000000ULL;
+	ns = div_u64(ns, 1000000000UL) * 1000000000ULL;
+	ns -= etsects->tclk_period;
+	hi = ns >> 32;
+	lo = ns & 0xffffffff;
+	reg_write(etsects, TMR_ALARM1_L, lo);
+	reg_write(etsects, TMR_ALARM1_H, hi);
+}
+
+static void set_fipers(struct etsects *etsects)
+{
+	u32 tmr_ctrl = reg_read(etsects, TMR_CTRL);
+
+	reg_write(etsects, TMR_CTRL,   tmr_ctrl & (~TE));
+	reg_write(etsects, TMR_PRSC,   etsects->tmr_prsc);
+	reg_write(etsects, TMR_FIPER1, etsects->tmr_fiper1);
+	reg_write(etsects, TMR_FIPER2, etsects->tmr_fiper2);
+	set_alarm(etsects);
+	reg_write(etsects, TMR_CTRL,   tmr_ctrl|TE);
+}
+
+/*
+ * PTP clock operations
+ */
+
+static int ptp_gianfar_adjfreq(void *priv, s32 ppb)
+{
+	u64 adj;
+	u32 diff, tmr_add;
+	int neg_adj = 0;
+	struct etsects *etsects = priv;
+
+	if (!ppb)
+		return 0;
+
+	if (ppb < 0) {
+		neg_adj = 1;
+		ppb = -ppb;
+	}
+	tmr_add = etsects->tmr_add;
+	adj = tmr_add;
+	adj *= ppb;
+	diff = div_u64(adj, 1000000000ULL);
+
+	tmr_add = neg_adj ? tmr_add - diff : tmr_add + diff;
+
+	reg_write(etsects, TMR_ADD, tmr_add);
+
+	return 0;
+}
+
+static int ptp_gianfar_adjtime(void *priv, struct timespec *ts)
+{
+	s64 delta, now;
+	unsigned long flags;
+	struct etsects *etsects = priv;
+
+	delta = 1000000000LL * ts->tv_sec;
+	delta += ts->tv_nsec;
+
+	spin_lock_irqsave(&adjtime_lock, flags);
+
+	now = tmr_cnt_read(etsects);
+	now += delta;
+	tmr_cnt_write(etsects, now);
+
+	spin_unlock_irqrestore(&adjtime_lock, flags);
+
+	set_fipers(etsects);
+
+	return 0;
+}
+
+static int ptp_gianfar_gettime(void *priv, struct timespec *ts)
+{
+	u64 ns;
+	u32 remainder;
+	struct etsects *etsects = priv;
+	ns = tmr_cnt_read(etsects);
+	ts->tv_sec = div_u64_rem(ns, 1000000000, &remainder);
+	ts->tv_nsec = remainder;
+	return 0;
+}
+
+static int ptp_gianfar_settime(void *priv, struct timespec *ts)
+{
+	u64 ns;
+	struct etsects *etsects = priv;
+	ns = ts->tv_sec * 1000000000ULL;
+	ns += ts->tv_nsec;
+	tmr_cnt_write(etsects, ns);
+	set_fipers(etsects);
+	return 0;
+}
+
+static int ptp_gianfar_enable(void *priv, struct ptp_clock_request rq, int on)
+{
+	/* We do not (yet) offer any ancillary features. */
+	return -EINVAL;
+}
+
+struct ptp_clock_info ptp_gianfar_caps = {
+	.owner		= THIS_MODULE,
+	.name		= "gianfar clock",
+	.max_adj	= 512000,
+	.n_alarm	= 0,
+	.n_ext_ts	= 0,
+	.n_per_out	= 0,
+	.pps		= 0,
+	.priv		= &the_clock,
+	.adjfreq	= ptp_gianfar_adjfreq,
+	.adjtime	= ptp_gianfar_adjtime,
+	.gettime	= ptp_gianfar_gettime,
+	.settime	= ptp_gianfar_settime,
+	.enable		= ptp_gianfar_enable,
+};
+
+/* OF device tree */
+
+static int get_of_u32(struct device_node *node, char *str, u32 *val)
+{
+	int plen;
+	const u32 *prop = of_get_property(node, str, &plen);
+
+	if (!prop || plen != sizeof(*prop))
+		return -1;
+	*val = *prop;
+	return 0;
+}
+
+static int gianfar_ptp_probe(struct of_device* dev,
+			     const struct of_device_id *match)
+{
+	u64 addr, size;
+	struct device_node *node = dev->node;
+	struct etsects *etsects = &the_clock;
+	struct timespec now;
+	phys_addr_t reg_addr;
+	unsigned long reg_size;
+	u32 tmr_ctrl;
+
+	if (get_of_u32(node, "tclk_period", &etsects->tclk_period) ||
+	    get_of_u32(node, "tmr_prsc", &etsects->tmr_prsc) ||
+	    get_of_u32(node, "tmr_add", &etsects->tmr_add) ||
+	    get_of_u32(node, "cksel", &etsects->cksel) ||
+	    get_of_u32(node, "tmr_fiper1", &etsects->tmr_fiper1) ||
+	    get_of_u32(node, "tmr_fiper2", &etsects->tmr_fiper2))
+		return -ENODEV;
+
+	addr = of_translate_address(node, of_get_address(node, 0, &size, NULL));
+	reg_addr = addr;
+	reg_size = size;
+	if (reg_size < REG_SIZE) {
+		pr_warning("device tree reg range %lu too small\n", reg_size);
+		reg_size = REG_SIZE;
+	}
+	etsects->regs = ioremap(reg_addr, reg_size);
+	if (!etsects->regs) {
+		pr_err("ioremap ptp registers failed\n");
+		return -EINVAL;
+	}
+
+	tmr_ctrl =
+	  (etsects->tclk_period & TCLK_PERIOD_MASK) << TCLK_PERIOD_SHIFT |
+	  (etsects->cksel & CKSEL_MASK) << CKSEL_SHIFT;
+
+	getnstimeofday(&now);
+	ptp_gianfar_settime(etsects, &now);
+
+	reg_write(etsects, TMR_CTRL,   tmr_ctrl);
+	reg_write(etsects, TMR_ADD,    etsects->tmr_add);
+	reg_write(etsects, TMR_PRSC,   etsects->tmr_prsc);
+	reg_write(etsects, TMR_FIPER1, etsects->tmr_fiper1);
+	reg_write(etsects, TMR_FIPER2, etsects->tmr_fiper2);
+	set_alarm(etsects);
+	reg_write(etsects, TMR_CTRL,   tmr_ctrl|FS|RTPE|TE);
+
+	gianfar_clock = ptp_clock_register(&ptp_gianfar_caps);
+
+	return IS_ERR(gianfar_clock) ? PTR_ERR(gianfar_clock) : 0;
+}
+
+static int gianfar_ptp_remove(struct of_device* dev)
+{
+	ptp_clock_unregister(gianfar_clock);
+	iounmap(the_clock.regs);
+	return 0;
+}
+
+static struct of_device_id match_table[] = {
+	{ .type = "ptp_clock" },
+	{},
+};
+
+static struct of_platform_driver gianfar_ptp_driver = {
+	.name        = "gianfar_ptp",
+	.match_table = match_table,
+	.owner       = THIS_MODULE,
+	.probe       = gianfar_ptp_probe,
+	.remove      = gianfar_ptp_remove,
+};
+
+/* module operations */
+
+static void __exit ptp_gianfar_exit(void)
+{
+	of_unregister_platform_driver(&gianfar_ptp_driver);
+}
+
+static int __init ptp_gianfar_init(void)
+{
+	return of_register_platform_driver(&gianfar_ptp_driver);
+}
+
+subsys_initcall(ptp_gianfar_init);
+module_exit(ptp_gianfar_exit);
+
+MODULE_AUTHOR("Richard Cochran <richard.cochran@omicron.at>");
+MODULE_DESCRIPTION("PTP clock using the eTSEC");
+MODULE_LICENSE("GPL");
diff --git a/drivers/net/gianfar_ptp_reg.h b/drivers/net/gianfar_ptp_reg.h
new file mode 100644
index 0000000..9c94677
--- /dev/null
+++ b/drivers/net/gianfar_ptp_reg.h
@@ -0,0 +1,107 @@
+/* gianfar_ptp_reg.h
+ * Generated by regen.tcl on Thu Apr 15 02:21:02 PM CEST 2010
+ *
+ * PTP 1588 clock using the gianfar eTSEC
+ *
+ * Copyright (C) 2010 OMICRON electronics GmbH
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+#ifndef _GIANFAR_PTP_REG_H_
+#define _GIANFAR_PTP_REG_H_
+
+#define TMR_CTRL              0x00000000 /* Timer control register */
+#define TMR_TEVENT            0x00000004 /* Timestamp event register */
+#define TMR_TEMASK            0x00000008 /* Timer event mask register */
+#define TMR_PEVENT            0x0000000c /* Timestamp event register */
+#define TMR_PEMASK            0x00000010 /* Timer event mask register */
+#define TMR_STAT              0x00000014 /* Timestamp status register */
+#define TMR_CNT_H             0x00000018 /* Timer counter high register */
+#define TMR_CNT_L             0x0000001c /* Timer counter low register */
+#define TMR_ADD               0x00000020 /* Timer drift compensation addend register */
+#define TMR_ACC               0x00000024 /* Timer accumulator register */
+#define TMR_PRSC              0x00000028 /* Timer prescale */
+#define TMROFF_H              0x00000030 /* Timer offset high */
+#define TMROFF_L              0x00000034 /* Timer offset low */
+#define TMR_ALARM1_H          0x00000040 /* Timer alarm 1 high register */
+#define TMR_ALARM1_L          0x00000044 /* Timer alarm 1 high register */
+#define TMR_ALARM2_H          0x00000048 /* Timer alarm 2 high register */
+#define TMR_ALARM2_L          0x0000004c /* Timer alarm 2 high register */
+#define TMR_FIPER1            0x00000080 /* Timer fixed period interval */
+#define TMR_FIPER2            0x00000084 /* Timer fixed period interval */
+#define TMR_FIPER3            0x00000088 /* Timer fixed period interval */
+#define TMR_ETTS1_H           0x000000a0 /* Timestamp of general purpose external trigger */
+#define TMR_ETTS1_L           0x000000a4 /* Timestamp of general purpose external trigger */
+#define TMR_ETTS2_H           0x000000a8 /* Timestamp of general purpose external trigger */
+#define TMR_ETTS2_L           0x000000ac /* Timestamp of general purpose external trigger */
+
+/* Bit definitions for the TMR_CTRL register */
+#define ALM1P                 (1<<31) /* Alarm1 output polarity */
+#define ALM2P                 (1<<30) /* Alarm2 output polarity */
+#define FS                    (1<<28) /* FIPER start indication */
+#define PP1L                  (1<<27) /* Fiper1 pulse loopback mode enabled. */
+#define PP2L                  (1<<26) /* Fiper2 pulse loopback mode enabled. */
+#define TCLK_PERIOD_SHIFT     (16) /* 1588 timer reference clock period. */
+#define TCLK_PERIOD_MASK      (0x3ff)
+#define RTPE                  (1<<15) /* Record Tx Timestamp to PAL Enable. */
+#define FRD                   (1<<14) /* FIPER Realignment Disable */
+#define ESFDP                 (1<<11) /* External Tx/Rx SFD Polarity. */
+#define ESFDE                 (1<<10) /* External Tx/Rx SFD Enable. */
+#define ETEP2                 (1<<9) /* External trigger 2 edge polarity */
+#define ETEP1                 (1<<8) /* External trigger 1 edge polarity */
+#define COPH                  (1<<7) /* Generated clock (TSEC_1588_GCLK) output phase. */
+#define CIPH                  (1<<6) /* External oscillator input clock phase. */
+#define TMSR                  (1<<5) /* Timer soft reset. When enabled, it resets all the timer registers and state machines. */
+#define BYP                   (1<<3) /* Bypass drift compensated clock */
+#define TE                    (1<<2) /* 1588 timer enable. If not enabled, all the timer registers and state machines are disabled. */
+#define CKSEL_SHIFT           (0) /* 1588 Timer reference clock source select. */
+#define CKSEL_MASK            (0x3)
+
+/* Bit definitions for the TMR_TEVENT register */
+#define ETS2                  (1<<25) /* External trigger 2 timestamp sampled */
+#define ETS1                  (1<<24) /* External trigger 1 timestamp sampled */
+#define ALM2                  (1<<17) /* Current time equaled alarm time register 2 */
+#define ALM1                  (1<<16) /* Current time equaled alarm time register 1 */
+#define PP1                   (1<<7) /* Indicates that a periodic pulse has been generated based on FIPER1 register */
+#define PP2                   (1<<6) /* Indicates that a periodic pulse has been generated based on FIPER2 register */
+#define PP3                   (1<<5) /* Indicates that a periodic pulse has been generated based on FIPER3 register */
+
+/* Bit definitions for the TMR_TEMASK register */
+#define ETS2EN                (1<<25) /* External trigger 2 timestamp sample event enable */
+#define ETS1EN                (1<<24) /* External trigger 1 timestamp sample event enable */
+#define ALM2EN                (1<<17) /* Timer ALM2 event enable */
+#define ALM1EN                (1<<16) /* Timer ALM1 event enable */
+#define PP1EN                 (1<<7) /* Periodic pulse event 1 enable */
+#define PP2EN                 (1<<6) /* Periodic pulse event 2 enable */
+
+/* Bit definitions for the TMR_PEVENT register */
+#define TXP2                  (1<<9) /* Indicates that a PTP frame has been transmitted and its timestamp is stored in TXTS2 register */
+#define TXP1                  (1<<8) /* Indicates that a PTP frame has been transmitted and its timestamp is stored in TXTS1 register */
+#define RXP                   (1<<0) /* Indicates that a PTP frame has been received */
+
+/* Bit definitions for the TMR_PEMASK register */
+#define TXP2EN                (1<<9) /* Transmit PTP packet event 2 enable */
+#define TXP1EN                (1<<8) /* Transmit PTP packet event 1 enable */
+#define RXPEN                 (1<<0) /* Receive PTP packet event enable */
+
+/* Bit definitions for the TMR_STAT register */
+#define STAT_VEC_SHIFT        (0) /* Timer general purpose status vector */
+#define STAT_VEC_MASK         (0x3f)
+
+/* Bit definitions for the TMR_PRSC register */
+#define PRSC_OCK_SHIFT        (0) /* Output clock division/prescale factor. */
+#define PRSC_OCK_MASK         (0xffff)
+
+#endif
diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 9390d44..3b7bd73 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -35,4 +35,17 @@ config PTP_1588_CLOCK_LINUX
 	  To compile this driver as a module, choose M here: the module
 	  will be called ptp_linux.
 
+config PTP_1588_CLOCK_GIANFAR
+	tristate "Freescale eTSEC as PTP clock"
+	depends on PTP_1588_CLOCK
+	depends on GIANFAR
+	help
+	  This driver adds support for using the eTSEC as a PTP
+	  clock. This clock is only useful if your PTP programs are
+	  getting hardware time stamps on the PTP Ethernet packets
+	  using the SO_TIMESTAMPING API.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called gianfar_ptp.
+
 endmenu
-- 
1.6.0.4


^ permalink raw reply related

* Re: patch sysfs-implement-sysfs-tagged-directory-support.patch added to gregkh-2.6 tree
From: Eric W. Biederman @ 2010-04-30  5:24 UTC (permalink / raw)
  To: Greg KH
  Cc: Tejun Heo, bcrl, benjamin.thery, cornelia.huck, eric.dumazet,
	kay.sievers, netdev, serue
In-Reply-To: <20100430044522.GA29845@suse.de>

Greg KH <gregkh@suse.de> writes:

> On Fri, Apr 30, 2010 at 06:18:53AM +0200, Tejun Heo wrote:
>> On 04/29/2010 10:29 PM, gregkh@suse.de wrote:
>> > 
>> > This is a note to let you know that I've just added the patch titled
>> > 
>> >     Subject: sysfs: Implement sysfs tagged directory support.
>> > 
>> > to my gregkh-2.6 tree.  Its filename is
>> > 
>> >     sysfs-implement-sysfs-tagged-directory-support.patch
>> > 
>> > This tree can be found at 
>> >     http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/
>> 
>> I wish at least more comments are added before it goes mainline.  I
>> don't really understand the current form.
>
> Ok, that's fine with me, I'll pull it back out.

?????

Tejun you have offered nothing constructive to the review, except looking
and saying you don't understand what is going on.

I have a tree posted with all of my code.  I have given snippets of the
pieces yet to be merged, and still your reaction is you don't understand
please break it down for you in itty-bitty little pieces, that you don't
need to think about it, to understand it.

Tejun I think for the code to make any sense to you I would need to rip
out out and/or rewrite the kobject layer, and possible the device
model code as well.

Tejun I'm sorry you can't understand the code, and I'm sorry the code
may be over-general.  In part that is because making the code
over-general is what you asked for when reviewing it the first time.


Greg I have not gotten any constructive feedback.  Not a specific
please fix/or comment a specific thing.  Not a comment that
says something is a bug and just wrong.  The closest I have gotten
is a request to make the code even more complicated and intrusive,
and harder to keep correct by adding an ns member to kobjects
which comes to 3 copies of the same state for the same objects
which ultimately is more difficult to keep in sync.

I am more than happy to improve the code, but at this point I really
think the code needs to be merged so people are forced to deal with
it, instead of saying "I don't understand the code" in the review and
blocking the merge.  I don't think the code will improve any more by
being out of tree.

Eric

^ permalink raw reply

* [Patch 2/3] sysctl: add proc_do_large_bitmap
From: Amerigo Wang @ 2010-04-30  8:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Octavian Purdila, Eric Dumazet, penguin-kernel, netdev,
	Neil Horman, Amerigo Wang, ebiederm, adobriyan, David Miller
In-Reply-To: <20100430082912.5630.82405.sendpatchset@localhost.localdomain>

From: Octavian Purdila <opurdila@ixiacom.com>

The new function can be used to read/write large bitmaps via /proc. A
comma separated range format is used for compact output and input
(e.g. 1,3-4,10-10).

Writing into the file will first reset the bitmap then update it
based on the given input.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---

Index: linux-2.6/include/linux/sysctl.h
===================================================================
--- linux-2.6.orig/include/linux/sysctl.h
+++ linux-2.6/include/linux/sysctl.h
@@ -980,6 +980,8 @@ extern int proc_doulongvec_minmax(struct
 				  void __user *, size_t *, loff_t *);
 extern int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int,
 				      void __user *, size_t *, loff_t *);
+extern int proc_do_large_bitmap(struct ctl_table *, int,
+				void __user *, size_t *, loff_t *);
 
 /*
  * Register a set of sysctl names by calling register_sysctl_table
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -2049,6 +2049,16 @@ static size_t proc_skip_spaces(char **bu
 	return ret;
 }
 
+static void proc_skip_char(char **buf, size_t *size, const char v)
+{
+	while (*size) {
+		if (**buf != v)
+			break;
+		(*size)--;
+		(*buf)++;
+	}
+}
+
 #define TMPBUFLEN 22
 /**
  * proc_get_long - reads an ASCII formated integer from a user buffer
@@ -2675,6 +2685,153 @@ static int proc_do_cad_pid(struct ctl_ta
 	return 0;
 }
 
+/**
+ * proc_do_large_bitmap - read/write from/to a large bitmap
+ * @table: the sysctl table
+ * @write: %TRUE if this is a write to the sysctl file
+ * @buffer: the user buffer
+ * @lenp: the size of the user buffer
+ * @ppos: file position
+ *
+ * The bitmap is stored at table->data and the bitmap length (in bits)
+ * in table->maxlen.
+ *
+ * We use a range comma separated format (e.g. 1,3-4,10-10) so that
+ * large bitmaps may be represented in a compact manner. Writing into
+ * the file will clear the bitmap then update it with the given input.
+ *
+ * Returns 0 on success.
+ */
+int proc_do_large_bitmap(struct ctl_table *table, int write,
+			 void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int err = 0;
+	bool first = 1;
+	size_t left = *lenp;
+	unsigned long bitmap_len = table->maxlen;
+	unsigned long *bitmap = (unsigned long *) table->data;
+	unsigned long *tmp_bitmap = NULL;
+	char tr_a[] = { '-', ',', '\n' }, tr_b[] = { ',', '\n', 0 }, c;
+
+	if (!bitmap_len || !left || (*ppos && !write)) {
+		*lenp = 0;
+		return 0;
+	}
+
+	if (write) {
+		unsigned long page = 0;
+		char *kbuf;
+
+		if (left > PAGE_SIZE - 1)
+			left = PAGE_SIZE - 1;
+
+		page = __get_free_page(GFP_TEMPORARY);
+		kbuf = (char *) page;
+		if (!kbuf)
+			return -ENOMEM;
+		if (copy_from_user(kbuf, buffer, left)) {
+			free_page(page);
+			return -EFAULT;
+                }
+		kbuf[left] = 0;
+
+		tmp_bitmap = kzalloc(BITS_TO_LONGS(bitmap_len) * sizeof(unsigned long),
+				     GFP_KERNEL);
+		if (!tmp_bitmap) {
+			free_page(page);
+			return -ENOMEM;
+		}
+		proc_skip_char(&kbuf, &left, '\n');
+		while (!err && left) {
+			unsigned long val_a, val_b;
+			bool neg;
+
+			err = proc_get_long(&kbuf, &left, &val_a, &neg, tr_a,
+					     sizeof(tr_a), &c);
+			if (err)
+				break;
+			if (val_a >= bitmap_len || neg) {
+				err = -EINVAL;
+				break;
+			}
+
+			val_b = val_a;
+			if (left) {
+				kbuf++;
+				left--;
+			}
+
+			if (c == '-') {
+				err = proc_get_long(&kbuf, &left, &val_b,
+						     &neg, tr_b, sizeof(tr_b),
+						     &c);
+				if (err)
+					break;
+				if (val_b >= bitmap_len || neg ||
+				    val_a > val_b) {
+					err = -EINVAL;
+					break;
+				}
+				if (left) {
+					kbuf++;
+					left--;
+				}
+			}
+
+			while (val_a <= val_b)
+				set_bit(val_a++, tmp_bitmap);
+
+			first = 0;
+			proc_skip_char(&kbuf, &left, '\n');
+		}
+		free_page(page);
+	} else {
+		unsigned long bit_a, bit_b = 0;
+
+		while (left) {
+			bit_a = find_next_bit(bitmap, bitmap_len, bit_b);
+			if (bit_a >= bitmap_len)
+				break;
+			bit_b = find_next_zero_bit(bitmap, bitmap_len,
+						   bit_a + 1) - 1;
+
+			if (!first) {
+				err = proc_put_char(&buffer, &left, ',');
+				if (err)
+					break;
+			}
+			err = proc_put_long(&buffer, &left, bit_a, false);
+			if (err)
+				break;
+			if (bit_a != bit_b) {
+				err = proc_put_char(&buffer, &left, '-');
+				if (err)
+					break;
+				err = proc_put_long(&buffer, &left, bit_b, false);
+				if (err)
+					break;
+			}
+
+			first = 0; bit_b++;
+		}
+		if (!err)
+			err = proc_put_char(&buffer, &left, '\n');
+	}
+
+	if (!err) {
+		if (write)
+			memcpy(bitmap, tmp_bitmap,
+			       BITS_TO_LONGS(bitmap_len) * sizeof(unsigned long));
+		kfree(tmp_bitmap);
+		*lenp -= left;
+		*ppos += *lenp;
+		return 0;
+	} else {
+		kfree(tmp_bitmap);
+		return err;
+	}
+}
+
 #else /* CONFIG_PROC_FS */
 
 int proc_dostring(struct ctl_table *table, int write,

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/2] add ndo_set_port_profile op support for enic dynamic vnics
From: Scott Feldman @ 2010-04-29 14:32 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: davem, netdev, chrisw, Jens Osterkamp
In-Reply-To: <201004291427.31540.arnd@arndb.de>

On 4/29/10 5:27 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:

> On Thursday 29 April 2010, Scott Feldman wrote:
>> Yes, that's right.  Without going into implementation details, assume any
>> enic has firmware with private mgmt channel to switch to do the equivalent
>> of your base device->LLDP->switch.
> 
> Ok, now it all makes a *lot* more sense, thanks for the clarification!
> 
> For my curiosity, can you point to any documentation about what's actually
> going on on the wire?

I don't believe those links are available at this time.

> Is it possible or planned to implement the same protocol in Linux so you
> can do it with Cisco switches and cheap non-IOV NICs?

That seems very possible from a technical standpoint.  I don't think the
port-profile netlink API we're specing out excludes that option.
 
>>    ip port_profile set DEVICE [ base DEVICE ] [ { pre_associate |
>>                                                   pre_associate_rr } ]
>>                               { name PORT-PROFILE | vsi MGR:VTID:VER }

BTW, I was meaning to ask: is there a way to role the vsi tuple and the
flags up into a single identifier, say a string like PORT-PROFILE?  I'm
asking because it seems awkward from an admin's perspective to know how to
construct a vsi tuple or to know what pre_associate_rr means. I have to
admit I didn't fully grok what pre_associate_rr means myself.  Even if there
was a simple local database to map named port-profiles to the underlying
{vsi tuple, flags}, that would bring us closer to a more consistent user
interface.  Is this possible?

>> 2. Future enic for pass-thru case where base != target.  We get:
>> 
>>     ip port_profile set eth1 base eth0 name joes-garage ...
>> 
>> And
>> 
>>     eth0:ndi_set_port_profile(eth1, ...)
> 
> Is eth1 the static device and eth0 the dynamic device in this scenario
> or the other way round?

eth0 is the static and eth1 is the dynamic.  So eth0 is the base device.
(The PF in SR-IOV parlance).
 
> Wouldn't you still require access to both devices from the host root
> network namespace here or do you just ignore the identifier for the
> dynamic device here?

The dynamic device is the one to apply the port-profile to (we'll, I should
say to apply to the dynamic's devices switch port).  So we need the dynamic
device identified.
  
>> 3. Future VEPA, we get:
>> 
>>     ip port_profile set eth11 base eth10 vsi 1:23456:7
>> 
>> And (here netlink msg handled in user-space):
>> 
>>     VDP msg sent on eth10 to set port-profile on eth11 using vis tuple
>      
> Yes. I'd prefer still requiring to pass the mac and vlan addresses in this
> case, but seems workable.

Sorry, I forgot the "..." in this use-case.  I didn't mean to exclude the
other params in the example cmd line.
 
>> Does this work?  I want to get agreement before coding up patch attempt #4.
> 
> Seems ok for all I can see at this point, other than the complexity
> that results from doing two network protocols through a single netlink
> protocol. Maybe Jens and Chris can comment some more on this.

Ok, thanks Arnd.  I'll start coding this up now, hedging that the design is
set before hearing back from Jens/Chris.

-scott


^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: jamal @ 2010-04-29 13:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272547307.2222.83.camel@edumazet-laptop>

On Thu, 2010-04-29 at 15:21 +0200, Eric Dumazet wrote:


> 
> You could try following program :
> 

Will do later today (test machine is not on the network and is about 20
minutes from here; so worst case i will get you results by end of day)
I guess this program is good enough since it tells me the system wide
ipi count - what my patch did was also to break it down by which cpu got
how many IPIs (served to check if there was uneven distribution)

> 
> Is your application mono threaded and receiving data to 8 sockets ?
> 

I fork one instance per detected cpu and bind to different ports each
time. Example bind to port 8200 on cpu0, 8201 on cpu1, etc.

cheers,
jamal


^ permalink raw reply

* Re: OFT - reserving CPU's for networking
From: Thomas Gleixner @ 2010-04-29 19:19 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Eric Dumazet, Andi Kleen, netdev, Andi Kleen, Peter Zijlstra
In-Reply-To: <20100429111047.031eeff9@nehalam>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2800 bytes --]

On Thu, 29 Apr 2010, Stephen Hemminger wrote:
> > Le jeudi 29 avril 2010 à 19:42 +0200, Andi Kleen a écrit :
> > > > Andi, what do you think of this one ?
> > > > Dont we have a function to send an IPI to an individual cpu instead ?  
> > > 
> > > That's what this function already does. You only set a single CPU 
> > > in the target mask, right?
> > > 
> > > IPIs are unfortunately always a bit slow. Nehalem-EX systems have X2APIC
> > > which is a bit faster for this, but that's not available in the lower
> > > end Nehalems. But even then it's not exactly fast.
> > > 
> > > I don't think the IPI primitive can be optimized much. It's not a cheap 
> > > operation.
> > > 
> > > If it's a problem do it less often and batch IPIs.
> > > 
> > > It's essentially the same problem as interrupt mitigation or NAPI 
> > > are solving for NICs. I guess just need a suitable mitigation mechanism.
> > > 
> > > Of course that would move more work to the sending CPU again, but 
> > > perhaps there's no alternative. I guess you could make it cheaper it by
> > > minimizing access to packet data.
> > > 
> > > -Andi  
> > 
> > Well, IPI are already batched, and rate is auto adaptative.
> > 
> > After various changes, it seems things are going better, maybe there is
> > something related to cache line trashing.
> > 
> > I 'solved' it by using idle=poll, but you might take a look at
> > clockevents_notify (acpi_idle_enter_bm) abuse of a shared and higly
> > contended spinlock...

Say thanks to Intel/AMD for providing us timers which stop in lower
c-states.

Not much we can do about the broadcast lock when several cores are
going idle and we need to setup a global timer to work around the
lapic timer stops in C2/C3 issue.

Simply the C-state timer broadcasting does not scale. And it was never
meant to scale. It's a workaround for laptops to have functional NOHZ.

There are several ways to work around that on larger machines:

 - Restrict c-states
 - Disable NOHZ and highres timers
 - idle=poll is definitely the worst of all possible solutions

> I keep getting asked about taking some core's away from clock and scheduler
> to be reserved just for network processing. Seeing this kind of stuff
> makes me wonder if maybe that isn't a half bad idea.

This comes up every few month and we pointed out several times what
needs to be done to make this work w/o these weird hacks which put a
core offline and then start some magic undebugable binary blob on it.
We have not seen anyone working on this, but the "set cores aside and
let them do X" idea seems to stick in peoples heads.

Seriously, that's not a solution. It's going to be some hacked up
nightmare which is completely unmaintainable.

Aside of that I seriously doubt that you can do networking w/o time
and timers.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH RFC: linux-next 1/2] irq: Add CPU mask affinity hint callback framework
From: Thomas Gleixner @ 2010-04-29 20:39 UTC (permalink / raw)
  To: Peter P Waskiewicz Jr
  Cc: davem@davemloft.net, arjan@linux.jf.intel.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <1272572930.9614.3.camel@localhost>

On Thu, 29 Apr 2010, Peter P Waskiewicz Jr wrote:
> On Thu, 2010-04-29 at 12:48 -0700, Thomas Gleixner wrote:
> > Thinking more about it, I wonder whether you have a cpu_mask in your
> > driver/device private data anyway. I bet you have :)
> 
> Well, at this point we don't, but nothing says we can't.

Somewhere you need to store that information in your driver, right ?

> > So it should be sufficient to set a pointer to that cpu_mask in
> > irq_desc and get rid of the callback completely.
> 
> So "register" would just assign the pointer, and "unregister" would make
> sure to NULL the mask pointer out.  I like it.  It'll sure clean things
> up too.

Yep, that'd be like the set_irq_chip() function. Just assign the
pointer under desc->lock.

Thanks,

	tglx

^ permalink raw reply

* Re: patch sysfs-implement-sysfs-tagged-directory-support.patch added to gregkh-2.6 tree
From: Tejun Heo @ 2010-04-30  6:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg KH, bcrl, benjamin.thery, cornelia.huck, eric.dumazet,
	kay.sievers, netdev, serue
In-Reply-To: <4BDA6C90.9010303@kernel.org>

Hello,

On 04/30/2010 07:37 AM, Tejun Heo wrote:
>> Tejun I think for the code to make any sense to you I would need to rip
>> out out and/or rewrite the kobject layer, and possible the device
>> model code as well.
> 
> And yes, in the long run, please do that.

Let me add a little bit here just in case.

IIRC, the initial sysfs tag support wasn't too different from the
interface side but the implementation inside sysfs was very hacky, so
I complained on both accounts.  You did a wonderful job of
restructuing sysfs so that it basically behaves as a distributed file
system and exposing different subsets is not too hacky and doesn't
violate layering.  Your work there was admirable and much better than
what I had in mind.

I don't think the hooking part from NSes would be as major an overhaul
as you're suggesting above.  I haven't tried it so I can't tell with
certainty (again so no nack) but I just can't imagine it being that
difficult or major after all the things you've done in the area and
was hoping that you repost something which is more digestible soonish.

I'm sorry that I couldn't be more specific but I'm almost sure you'll
be able to come up with something better than what I can think of.
So, yeah, please,

* Add comments to the current implementation.  If for nothing else,
  for cases where you get sick of it for some time and other people
  have to work on it before you feel like coming back.

* While doing that, please add "FIXME: or TODO:" comments describing
  what would be a better direction in the long run.

Thanks.

-- 
tejun

^ permalink raw reply

* patch driver-core-implement-ns-directory-support-for-device-classes.patch added to gregkh-2.6 tree
From: gregkh @ 2010-04-29 20:29 UTC (permalink / raw)
  To: ebiederm, bcrl, benjamin.thery, cornelia.huck, eric.dumazet,
	gregkh, kay.sievers, netdev, serue
In-Reply-To: <1269973889-25260-6-git-send-email-ebiederm@xmission.com>


This is a note to let you know that I've just added the patch titled

    Subject: driver core: Implement ns directory support for device classes.

to my gregkh-2.6 tree.  Its filename is

    driver-core-implement-ns-directory-support-for-device-classes.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From ebiederm@xmission.com  Thu Apr 29 12:47:44 2010
From: "Eric W. Biederman" <ebiederm@xmission.com>
Date: Tue, 30 Mar 2010 11:31:29 -0700
Subject: driver core: Implement ns directory support for device classes.
To: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Kay Sievers <kay.sievers@vrfy.org>, linux-kernel@vger.kernel.org, Tejun Heo <tj@kernel.org>, Cornelia Huck <cornelia.huck@de.ibm.com>, linux-fsdevel@vger.kernel.org, Eric Dumazet <eric.dumazet@gmail.com>, Benjamin LaHaise <bcrl@lhnet.ca>, Serge Hallyn <serue@us.ibm.com>, <netdev@vger.kernel.org>, "Eric W. Biederman" <ebiederm@xmission.com>, Benjamin Thery <benjamin.thery@bull.net>
Message-ID: <1269973889-25260-6-git-send-email-ebiederm@xmission.com>


From: Eric W. Biederman <ebiederm@xmission.com>

device_del and device_rename were modified to use
sysfs_delete_link and sysfs_rename_link respectively to ensure
when these operations happen on devices whose classes
are in namespace directories they work properly.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 drivers/base/core.c |   21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -786,7 +786,7 @@ out_device:
 out_busid:
 	if (dev->kobj.parent != &dev->class->p->class_subsys.kobj &&
 	    device_is_not_partition(dev))
-		sysfs_remove_link(&dev->class->p->class_subsys.kobj,
+		sysfs_delete_link(&dev->class->p->class_subsys.kobj, &dev->kobj,
 				  dev_name(dev));
 #else
 	/* link in the class directory pointing to the device */
@@ -804,7 +804,7 @@ out_busid:
 	return 0;
 
 out_busid:
-	sysfs_remove_link(&dev->class->p->class_subsys.kobj, dev_name(dev));
+	sysfs_delete_link(&dev->class->p->class_subsys.kobj, &dev->kobj, dev_name(dev));
 #endif
 
 out_subsys:
@@ -832,13 +832,13 @@ static void device_remove_class_symlinks
 
 	if (dev->kobj.parent != &dev->class->p->class_subsys.kobj &&
 	    device_is_not_partition(dev))
-		sysfs_remove_link(&dev->class->p->class_subsys.kobj,
+		sysfs_delete_link(&dev->class->p->class_subsys.kobj, &dev->kobj,
 				  dev_name(dev));
 #else
 	if (dev->parent && device_is_not_partition(dev))
 		sysfs_remove_link(&dev->kobj, "device");
 
-	sysfs_remove_link(&dev->class->p->class_subsys.kobj, dev_name(dev));
+	sysfs_delete_link(&dev->class->p->class_subsys.kobj, &dev->kobj, dev_name(dev));
 #endif
 
 	sysfs_remove_link(&dev->kobj, "subsystem");
@@ -1624,6 +1624,14 @@ int device_rename(struct device *dev, ch
 		goto out;
 	}
 
+#ifndef CONFIG_SYSFS_DEPRECATED
+	if (dev->class) {
+		error = sysfs_rename_link(&dev->class->p->class_subsys.kobj,
+			&dev->kobj, old_device_name, new_name);
+		if (error)
+			goto out;
+	}
+#endif
 	error = kobject_rename(&dev->kobj, new_name);
 	if (error)
 		goto out;
@@ -1638,11 +1646,6 @@ int device_rename(struct device *dev, ch
 						  new_class_name);
 		}
 	}
-#else
-	if (dev->class) {
-		error = sysfs_rename_link(&dev->class->p->class_subsys.kobj,
-					  &dev->kobj, old_device_name, new_name);
-	}
 #endif
 
 out:


^ permalink raw reply

* patch sysfs-implement-sysfs_delete_link.patch added to gregkh-2.6 tree
From: gregkh @ 2010-04-29 20:29 UTC (permalink / raw)
  To: ebiederm, bcrl, benjamin.thery, cornelia.huck, dlezcano,
	eric.dumazet, gregkh, kay.sievers, netdev
In-Reply-To: <1269973889-25260-5-git-send-email-ebiederm@xmission.com>


This is a note to let you know that I've just added the patch titled

    Subject: sysfs: Implement sysfs_delete_link

to my gregkh-2.6 tree.  Its filename is

    sysfs-implement-sysfs_delete_link.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From ebiederm@xmission.com  Thu Apr 29 12:47:27 2010
From: "Eric W. Biederman" <ebiederm@xmission.com>
Date: Tue, 30 Mar 2010 11:31:28 -0700
Subject: sysfs: Implement sysfs_delete_link
To: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Kay Sievers <kay.sievers@vrfy.org>, linux-kernel@vger.kernel.org, Tejun Heo <tj@kernel.org>, Cornelia Huck <cornelia.huck@de.ibm.com>, linux-fsdevel@vger.kernel.org, Eric Dumazet <eric.dumazet@gmail.com>, Benjamin LaHaise <bcrl@lhnet.ca>, Serge Hallyn <serue@us.ibm.com>, <netdev@vger.kernel.org>, "Eric W. Biederman" <ebiederm@xmission.com>, Benjamin Thery <benjamin.thery@bull.net>, Daniel Lezcano <dlezcano@fr.ibm.com>
Message-ID: <1269973889-25260-5-git-send-email-ebiederm@xmission.com>


From: Eric W. Biederman <ebiederm@xmission.com>

When removing a symlink sysfs_remove_link does not provide
enough information to figure out which tagged directory the symlink
falls in.  So I need sysfs_delete_link which is passed the target
of the symlink to delete.

sysfs_rename_link is updated to call sysfs_delete_link instead
of sysfs_remove_link as we have all of the information necessary
and the callers are interesting.

Both of these functions now have enough information to find a symlink
in a tagged directory.  The only restriction is that they must be called
before the target kobject is renamed or deleted.  If they are called
later I loose track of which tag the target kobject was marked with
and can no longer find the old symlink to remove it.

This patch was split from an earlier patch.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 fs/sysfs/symlink.c    |   20 ++++++++++++++++++++
 include/linux/sysfs.h |    8 ++++++++
 2 files changed, 28 insertions(+)

--- a/fs/sysfs/symlink.c
+++ b/fs/sysfs/symlink.c
@@ -109,6 +109,26 @@ int sysfs_create_link_nowarn(struct kobj
 }
 
 /**
+ *	sysfs_delete_link - remove symlink in object's directory.
+ *	@kobj:	object we're acting for.
+ *	@targ:	object we're pointing to.
+ *	@name:	name of the symlink to remove.
+ *
+ *	Unlike sysfs_remove_link sysfs_delete_link has enough information
+ *	to successfully delete symlinks in tagged directories.
+ */
+void sysfs_delete_link(struct kobject *kobj, struct kobject *targ,
+			const char *name)
+{
+	const void *ns = NULL;
+	spin_lock(&sysfs_assoc_lock);
+	if (targ->sd)
+		ns = targ->sd->s_ns;
+	spin_unlock(&sysfs_assoc_lock);
+	sysfs_hash_and_remove(kobj->sd, ns, name);
+}
+
+/**
  *	sysfs_remove_link - remove symlink in object's directory.
  *	@kobj:	object we're acting for.
  *	@name:	name of the symlink to remove.
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -155,6 +155,9 @@ void sysfs_remove_link(struct kobject *k
 int sysfs_rename_link(struct kobject *kobj, struct kobject *target,
 			const char *old_name, const char *new_name);
 
+void sysfs_delete_link(struct kobject *dir, struct kobject *targ,
+			const char *name);
+
 int __must_check sysfs_create_group(struct kobject *kobj,
 				    const struct attribute_group *grp);
 int sysfs_update_group(struct kobject *kobj,
@@ -269,6 +272,11 @@ static inline int sysfs_rename_link(stru
 	return 0;
 }
 
+static inline void sysfs_delete_link(struct kobject *k, struct kobject *t,
+				     const char *name)
+{
+}
+
 static inline int sysfs_create_group(struct kobject *kobj,
 				     const struct attribute_group *grp)
 {


^ permalink raw reply

* patch sysfs-add-support-for-tagged-directories-with-untagged-members.patch added to gregkh-2.6 tree
From: gregkh @ 2010-04-29 20:29 UTC (permalink / raw)
  To: ebiederm, bcrl, cornelia.huck, ebiederm, ebiederm, eric.dumazet,
	gregkh, kay.sievers
In-Reply-To: <1269973889-25260-4-git-send-email-ebiederm@xmission.com>


This is a note to let you know that I've just added the patch titled

    Subject: sysfs: Add support for tagged directories with untagged members.

to my gregkh-2.6 tree.  Its filename is

    sysfs-add-support-for-tagged-directories-with-untagged-members.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From ebiederm@xmission.com  Thu Apr 29 12:47:00 2010
From: "Eric W. Biederman" <ebiederm@xmission.com>
Date: Tue, 30 Mar 2010 11:31:27 -0700
Subject: sysfs: Add support for tagged directories with untagged members.
To: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Kay Sievers <kay.sievers@vrfy.org>, linux-kernel@vger.kernel.org, Tejun Heo <tj@kernel.org>, Cornelia Huck <cornelia.huck@de.ibm.com>, linux-fsdevel@vger.kernel.org, Eric Dumazet <eric.dumazet@gmail.com>, Benjamin LaHaise <bcrl@lhnet.ca>, Serge Hallyn <serue@us.ibm.com>, <netdev@vger.kernel.org>, "Eric W. Biederman" <ebiederm@maxwell.aristanetworks.com>, "Eric W. Biederman" <ebiederm@aristanetworks.com>
Message-ID: <1269973889-25260-4-git-send-email-ebiederm@xmission.com>


From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>

I had hopped to avoid this but the bonding driver adds a file
to /sys/class/net/  and the easiest way to handle that file is
to make it untagged and to register it only once.

So relax the rules on tagged directories, and make bonding work.

Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 fs/sysfs/dir.c   |   12 +++---------
 fs/sysfs/inode.c |    2 ++
 2 files changed, 5 insertions(+), 9 deletions(-)

--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -383,12 +383,6 @@ int __sysfs_add_one(struct sysfs_addrm_c
 	if (sysfs_find_dirent(acxt->parent_sd, sd->s_ns, sd->s_name))
 		return -EEXIST;
 
-	if (sysfs_ns_type(acxt->parent_sd) && !sd->s_ns) {
-		WARN(1, KERN_WARNING "sysfs: ns required in '%s' for '%s'\n",
-			acxt->parent_sd->s_name, sd->s_name);
-		return -EINVAL;
-	}
-
 	sd->s_parent = sysfs_get(acxt->parent_sd);
 
 	sysfs_link_sibling(sd);
@@ -545,7 +539,7 @@ struct sysfs_dirent *sysfs_find_dirent(s
 	struct sysfs_dirent *sd;
 
 	for (sd = parent_sd->s_dir.children; sd; sd = sd->s_sibling) {
-		if (sd->s_ns != ns)
+		if (ns && sd->s_ns && (sd->s_ns != ns))
 			continue;
 		if (!strcmp(sd->s_name, name))
 			return sd;
@@ -879,7 +873,7 @@ static struct sysfs_dirent *sysfs_dir_po
 		while (pos && (ino > pos->s_ino))
 			pos = pos->s_sibling;
 	}
-	while (pos && pos->s_ns != ns)
+	while (pos && pos->s_ns && pos->s_ns != ns)
 		pos = pos->s_sibling;
 	return pos;
 }
@@ -890,7 +884,7 @@ static struct sysfs_dirent *sysfs_dir_ne
 	pos = sysfs_dir_pos(ns, parent_sd, ino, pos);
 	if (pos)
 		pos = pos->s_sibling;
-	while (pos && pos->s_ns != ns)
+	while (pos && pos->s_ns && pos->s_ns != ns)
 		pos = pos->s_sibling;
 	return pos;
 }
--- a/fs/sysfs/inode.c
+++ b/fs/sysfs/inode.c
@@ -335,6 +335,8 @@ int sysfs_hash_and_remove(struct sysfs_d
 	sysfs_addrm_start(&acxt, dir_sd);
 
 	sd = sysfs_find_dirent(dir_sd, ns, name);
+	if (sd && (sd->s_ns != ns))
+		sd = NULL;
 	if (sd)
 		sysfs_remove_one(&acxt, sd);
 


^ permalink raw reply

* patch kobj-add-basic-infrastructure-for-dealing-with-namespaces.patch added to gregkh-2.6 tree
From: gregkh @ 2010-04-29 20:29 UTC (permalink / raw)
  To: ebiederm, bcrl, cornelia.huck, eric.dumazet, gregkh, kay.sievers,
	netdev, serue, tj
In-Reply-To: <1269973889-25260-2-git-send-email-ebiederm@xmission.com>


This is a note to let you know that I've just added the patch titled

    Subject: kobj: Add basic infrastructure for dealing with namespaces.

to my gregkh-2.6 tree.  Its filename is

    kobj-add-basic-infrastructure-for-dealing-with-namespaces.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From ebiederm@xmission.com  Thu Apr 29 12:44:32 2010
From: "Eric W. Biederman" <ebiederm@xmission.com>
Date: Tue, 30 Mar 2010 11:31:25 -0700
Subject: kobj: Add basic infrastructure for dealing with namespaces.
To: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Kay Sievers <kay.sievers@vrfy.org>, linux-kernel@vger.kernel.org, Tejun Heo <tj@kernel.org>, Cornelia Huck <cornelia.huck@de.ibm.com>, linux-fsdevel@vger.kernel.org, Eric Dumazet <eric.dumazet@gmail.com>, Benjamin LaHaise <bcrl@lhnet.ca>, Serge Hallyn <serue@us.ibm.com>, <netdev@vger.kernel.org>, "Eric W. Biederman" <ebiederm@xmission.com>
Message-ID: <1269973889-25260-2-git-send-email-ebiederm@xmission.com>


From: Eric W. Biederman <ebiederm@xmission.com>

Move complete knowledge of namespaces into the kobject layer
so we can use that information when reporting kobjects to
userspace.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 drivers/base/class.c    |    9 ++++
 drivers/base/core.c     |   77 +++++++++++++++++++++++++++++------
 include/linux/device.h  |    3 +
 include/linux/kobject.h |   26 ++++++++++++
 lib/kobject.c           |  103 ++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 204 insertions(+), 14 deletions(-)

--- a/drivers/base/class.c
+++ b/drivers/base/class.c
@@ -63,6 +63,14 @@ static void class_release(struct kobject
 	kfree(cp);
 }
 
+static const struct kobj_ns_type_operations *class_child_ns_type(struct kobject *kobj)
+{
+	struct class_private *cp = to_class(kobj);
+	struct class *class = cp->class;
+
+	return class->ns_type;
+}
+
 static const struct sysfs_ops class_sysfs_ops = {
 	.show	= class_attr_show,
 	.store	= class_attr_store,
@@ -71,6 +79,7 @@ static const struct sysfs_ops class_sysf
 static struct kobj_type class_ktype = {
 	.sysfs_ops	= &class_sysfs_ops,
 	.release	= class_release,
+	.child_ns_type	= class_child_ns_type,
 };
 
 /* Hotplug events for classes go to the class class_subsys */
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -131,9 +131,21 @@ static void device_release(struct kobjec
 	kfree(p);
 }
 
+static const void *device_namespace(struct kobject *kobj)
+{
+	struct device *dev = to_dev(kobj);
+	const void *ns = NULL;
+
+	if (dev->class && dev->class->ns_type)
+		ns = dev->class->namespace(dev);
+
+	return ns;
+}
+
 static struct kobj_type device_ktype = {
 	.release	= device_release,
 	.sysfs_ops	= &dev_sysfs_ops,
+	.namespace	= device_namespace,
 };
 
 
@@ -595,11 +607,59 @@ static struct kobject *virtual_device_pa
 	return virtual_dir;
 }
 
-static struct kobject *get_device_parent(struct device *dev,
-					 struct device *parent)
+struct class_dir {
+	struct kobject kobj;
+	struct class *class;
+};
+
+#define to_class_dir(obj) container_of(obj, struct class_dir, kobj)
+
+static void class_dir_release(struct kobject *kobj)
+{
+	struct class_dir *dir = to_class_dir(kobj);
+	kfree(dir);
+}
+
+static const
+struct kobj_ns_type_operations *class_dir_child_ns_type(struct kobject *kobj)
+{
+	struct class_dir *dir = to_class_dir(kobj);
+	return dir->class->ns_type;
+}
+
+static struct kobj_type class_dir_ktype = {
+	.release	= class_dir_release,
+	.sysfs_ops	= &kobj_sysfs_ops,
+	.child_ns_type	= class_dir_child_ns_type
+};
+
+static struct kobject *
+class_dir_create_and_add(struct class *class, struct kobject *parent_kobj)
 {
+	struct class_dir *dir;
 	int retval;
 
+	dir = kzalloc(sizeof(*dir), GFP_KERNEL);
+	if (!dir)
+		return NULL;
+
+	dir->class = class;
+	kobject_init(&dir->kobj, &class_dir_ktype);
+
+	dir->kobj.kset = &class->p->class_dirs;
+
+	retval = kobject_add(&dir->kobj, parent_kobj, "%s", class->name);
+	if (retval < 0) {
+		kobject_put(&dir->kobj);
+		return NULL;
+	}
+	return &dir->kobj;
+}
+
+
+static struct kobject *get_device_parent(struct device *dev,
+					 struct device *parent)
+{
 	if (dev->class) {
 		static DEFINE_MUTEX(gdp_mutex);
 		struct kobject *kobj = NULL;
@@ -634,18 +694,7 @@ static struct kobject *get_device_parent
 		}
 
 		/* or create a new class-directory at the parent device */
-		k = kobject_create();
-		if (!k) {
-			mutex_unlock(&gdp_mutex);
-			return NULL;
-		}
-		k->kset = &dev->class->p->class_dirs;
-		retval = kobject_add(k, parent_kobj, "%s", dev->class->name);
-		if (retval < 0) {
-			mutex_unlock(&gdp_mutex);
-			kobject_put(k);
-			return NULL;
-		}
+		k = class_dir_create_and_add(dev->class, parent_kobj);
 		/* do not emit an uevent for this simple "glue" directory */
 		mutex_unlock(&gdp_mutex);
 		return k;
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -202,6 +202,9 @@ struct class {
 	int (*suspend)(struct device *dev, pm_message_t state);
 	int (*resume)(struct device *dev);
 
+	const struct kobj_ns_type_operations *ns_type;
+	const void *(*namespace)(struct device *dev);
+
 	const struct dev_pm_ops *pm;
 
 	struct class_private *p;
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -108,6 +108,8 @@ struct kobj_type {
 	void (*release)(struct kobject *kobj);
 	const struct sysfs_ops *sysfs_ops;
 	struct attribute **default_attrs;
+	const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj);
+	const void *(*namespace)(struct kobject *kobj);
 };
 
 struct kobj_uevent_env {
@@ -134,6 +136,30 @@ struct kobj_attribute {
 
 extern const struct sysfs_ops kobj_sysfs_ops;
 
+enum kobj_ns_type {
+	KOBJ_NS_TYPE_NONE = 0,
+	KOBJ_NS_TYPES
+};
+
+struct sock;
+struct kobj_ns_type_operations {
+	enum kobj_ns_type type;
+	const void *(*current_ns)(void);
+	const void *(*netlink_ns)(struct sock *sk);
+	const void *(*initial_ns)(void);
+};
+
+int kobj_ns_type_register(const struct kobj_ns_type_operations *ops);
+int kobj_ns_type_registered(enum kobj_ns_type type);
+const struct kobj_ns_type_operations *kobj_child_ns_ops(struct kobject *parent);
+const struct kobj_ns_type_operations *kobj_ns_ops(struct kobject *kobj);
+
+const void *kobj_ns_current(enum kobj_ns_type type);
+const void *kobj_ns_netlink(enum kobj_ns_type type, struct sock *sk);
+const void *kobj_ns_initial(enum kobj_ns_type type);
+void kobj_ns_exit(enum kobj_ns_type type, const void *ns);
+
+
 /**
  * struct kset - a set of kobjects of a specific type, belonging to a specific subsystem.
  *
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -850,6 +850,109 @@ struct kset *kset_create_and_add(const c
 }
 EXPORT_SYMBOL_GPL(kset_create_and_add);
 
+
+static DEFINE_SPINLOCK(kobj_ns_type_lock);
+static const struct kobj_ns_type_operations *kobj_ns_ops_tbl[KOBJ_NS_TYPES];
+
+int kobj_ns_type_register(const struct kobj_ns_type_operations *ops)
+{
+	enum kobj_ns_type type = ops->type;
+	int error;
+
+	spin_lock(&kobj_ns_type_lock);
+
+	error = -EINVAL;
+	if (type >= KOBJ_NS_TYPES)
+		goto out;
+
+	error = -EINVAL;
+	if (type <= KOBJ_NS_TYPE_NONE)
+		goto out;
+
+	error = -EBUSY;
+	if (kobj_ns_ops_tbl[type])
+		goto out;
+
+	error = 0;
+	kobj_ns_ops_tbl[type] = ops;
+
+out:
+	spin_unlock(&kobj_ns_type_lock);
+	return error;
+}
+
+int kobj_ns_type_registered(enum kobj_ns_type type)
+{
+	int registered = 0;
+
+	spin_lock(&kobj_ns_type_lock);
+	if ((type > KOBJ_NS_TYPE_NONE) && (type < KOBJ_NS_TYPES))
+		registered = kobj_ns_ops_tbl[type] != NULL;
+	spin_unlock(&kobj_ns_type_lock);
+
+	return registered;
+}
+
+const struct kobj_ns_type_operations *kobj_child_ns_ops(struct kobject *parent)
+{
+	const struct kobj_ns_type_operations *ops = NULL;
+
+	if (parent && parent->ktype->child_ns_type)
+		ops = parent->ktype->child_ns_type(parent);
+
+	return ops;
+}
+
+const struct kobj_ns_type_operations *kobj_ns_ops(struct kobject *kobj)
+{
+	return kobj_child_ns_ops(kobj->parent);
+}
+
+
+const void *kobj_ns_current(enum kobj_ns_type type)
+{
+	const void *ns = NULL;
+
+	spin_lock(&kobj_ns_type_lock);
+	if ((type > KOBJ_NS_TYPE_NONE) && (type < KOBJ_NS_TYPES) &&
+	    kobj_ns_ops_tbl[type])
+		ns = kobj_ns_ops_tbl[type]->current_ns();
+	spin_unlock(&kobj_ns_type_lock);
+
+	return ns;
+}
+
+const void *kobj_ns_netlink(enum kobj_ns_type type, struct sock *sk)
+{
+	const void *ns = NULL;
+
+	spin_lock(&kobj_ns_type_lock);
+	if ((type > KOBJ_NS_TYPE_NONE) && (type < KOBJ_NS_TYPES) &&
+	    kobj_ns_ops_tbl[type])
+		ns = kobj_ns_ops_tbl[type]->netlink_ns(sk);
+	spin_unlock(&kobj_ns_type_lock);
+
+	return ns;
+}
+
+const void *kobj_ns_initial(enum kobj_ns_type type)
+{
+	const void *ns = NULL;
+
+	spin_lock(&kobj_ns_type_lock);
+	if ((type > KOBJ_NS_TYPE_NONE) && (type < KOBJ_NS_TYPES) &&
+	    kobj_ns_ops_tbl[type])
+		ns = kobj_ns_ops_tbl[type]->initial_ns();
+	spin_unlock(&kobj_ns_type_lock);
+
+	return ns;
+}
+
+void kobj_ns_exit(enum kobj_ns_type type, const void *ns)
+{
+}
+
+
 EXPORT_SYMBOL(kobject_get);
 EXPORT_SYMBOL(kobject_put);
 EXPORT_SYMBOL(kobject_del);


^ permalink raw reply

* patch sysfs-implement-sysfs-tagged-directory-support.patch added to gregkh-2.6 tree
From: gregkh @ 2010-04-29 20:29 UTC (permalink / raw)
  To: ebiederm, bcrl, benjamin.thery, cornelia.huck, eric.dumazet,
	gregkh, kay.sievers, netdev, serue
In-Reply-To: <1269973889-25260-3-git-send-email-ebiederm@xmission.com>


This is a note to let you know that I've just added the patch titled

    Subject: sysfs: Implement sysfs tagged directory support.

to my gregkh-2.6 tree.  Its filename is

    sysfs-implement-sysfs-tagged-directory-support.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From ebiederm@xmission.com  Thu Apr 29 12:46:06 2010
From: "Eric W. Biederman" <ebiederm@xmission.com>
Date: Tue, 30 Mar 2010 11:31:26 -0700
Subject: sysfs: Implement sysfs tagged directory support.
To: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Kay Sievers <kay.sievers@vrfy.org>, linux-kernel@vger.kernel.org, Tejun Heo <tj@kernel.org>, Cornelia Huck <cornelia.huck@de.ibm.com>, linux-fsdevel@vger.kernel.org, Eric Dumazet <eric.dumazet@gmail.com>, Benjamin LaHaise <bcrl@lhnet.ca>, Serge Hallyn <serue@us.ibm.com>, <netdev@vger.kernel.org>, "Eric W. Biederman" <ebiederm@xmission.com>, Benjamin Thery <benjamin.thery@bull.net>
Message-ID: <1269973889-25260-3-git-send-email-ebiederm@xmission.com>


From: Eric W. Biederman <ebiederm@xmission.com>

The problem.  When implementing a network namespace I need to be able
to have multiple network devices with the same name.  Currently this
is a problem for /sys/class/net/*, /sys/devices/virtual/net/*, and
potentially a few other directories of the form /sys/ ... /net/*.

What this patch does is to add an additional tag field to the
sysfs dirent structure.  For directories that should show different
contents depending on the context such as /sys/class/net/, and
/sys/devices/virtual/net/ this tag field is used to specify the
context in which those directories should be visible.  Effectively
this is the same as creating multiple distinct directories with
the same name but internally to sysfs the result is nicer.

I am calling the concept of a single directory that looks like multiple
directories all at the same path in the filesystem tagged directories.

For the networking namespace the set of directories whose contents I need
to filter with tags can depend on the presence or absence of hotplug
hardware or which modules are currently loaded.  Which means I need
a simple race free way to setup those directories as tagged.

To achieve a reace free design all tagged directories are created
and managed by sysfs itself.

Users of this interface:
- define a type in the sysfs_tag_type enumeration.
- call sysfs_register_ns_types with the type and it's operations
- sysfs_exit_ns when an individual tag is no longer valid

- Implement mount_ns() which returns the ns of the calling process
  so we can attach it to a sysfs superblock.
- Implement ktype.namespace() which returns the ns of a syfs kobject.

Everything else is left up to sysfs and the driver layer.

For the network namespace mount_ns and namespace() are essentially
one line functions, and look to remain that.

Tags are currently represented a const void * pointers as that is
both generic, prevides enough information for equality comparisons,
and is trivial to create for current users, as it is just the
existing namespace pointer.

The work needed in sysfs is more extensive.  At each directory
or symlink creating I need to check if the directory it is being
created in is a tagged directory and if so generate the appropriate
tag to place on the sysfs_dirent.  Likewise at each symlink or
directory removal I need to check if the sysfs directory it is
being removed from is a tagged directory and if so figure out
which tag goes along with the name I am deleting.

Currently only directories which hold kobjects, and
symlinks are supported.  There is not enough information
in the current file attribute interfaces to give us anything
to discriminate on which makes it useless, and there are
no potential users which makes it an uninteresting problem
to solve.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 drivers/gpio/gpiolib.c |    2 
 drivers/md/bitmap.c    |    4 -
 drivers/md/md.c        |    6 +-
 fs/sysfs/bin.c         |    2 
 fs/sysfs/dir.c         |  112 ++++++++++++++++++++++++++++++++++++++-----------
 fs/sysfs/file.c        |   17 ++++---
 fs/sysfs/group.c       |    6 +-
 fs/sysfs/inode.c       |    4 -
 fs/sysfs/mount.c       |   33 ++++++++++++++
 fs/sysfs/symlink.c     |   15 +++++-
 fs/sysfs/sysfs.h       |   20 +++++++-
 include/linux/sysfs.h  |   10 ++++
 lib/kobject.c          |    1 
 13 files changed, 181 insertions(+), 51 deletions(-)

--- a/drivers/gpio/gpiolib.c
+++ b/drivers/gpio/gpiolib.c
@@ -399,7 +399,7 @@ static int gpio_setup_irq(struct gpio_de
 			goto free_id;
 		}
 
-		pdesc->value_sd = sysfs_get_dirent(dev->kobj.sd, "value");
+		pdesc->value_sd = sysfs_get_dirent(dev->kobj.sd, NULL, "value");
 		if (!pdesc->value_sd) {
 			ret = -ENODEV;
 			goto free_id;
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1678,9 +1678,9 @@ int bitmap_create(mddev_t *mddev)
 
 	bitmap->mddev = mddev;
 
-	bm = sysfs_get_dirent(mddev->kobj.sd, "bitmap");
+	bm = sysfs_get_dirent(mddev->kobj.sd, NULL, "bitmap");
 	if (bm) {
-		bitmap->sysfs_can_clear = sysfs_get_dirent(bm, "can_clear");
+		bitmap->sysfs_can_clear = sysfs_get_dirent(bm, NULL, "can_clear");
 		sysfs_put(bm);
 	} else
 		bitmap->sysfs_can_clear = NULL;
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1766,7 +1766,7 @@ static int bind_rdev_to_array(mdk_rdev_t
 		kobject_del(&rdev->kobj);
 		goto fail;
 	}
-	rdev->sysfs_state = sysfs_get_dirent(rdev->kobj.sd, "state");
+	rdev->sysfs_state = sysfs_get_dirent(rdev->kobj.sd, NULL, "state");
 
 	list_add_rcu(&rdev->same_set, &mddev->disks);
 	bd_claim_by_disk(rdev->bdev, rdev->bdev->bd_holder, mddev->gendisk);
@@ -4183,7 +4183,7 @@ static int md_alloc(dev_t dev, char *nam
 	mutex_unlock(&disks_mutex);
 	if (!error) {
 		kobject_uevent(&mddev->kobj, KOBJ_ADD);
-		mddev->sysfs_state = sysfs_get_dirent(mddev->kobj.sd, "array_state");
+		mddev->sysfs_state = sysfs_get_dirent(mddev->kobj.sd, NULL, "array_state");
 	}
 	mddev_put(mddev);
 	return error;
@@ -4392,7 +4392,7 @@ static int do_md_run(mddev_t * mddev)
 			printk(KERN_WARNING
 			       "md: cannot register extra attributes for %s\n",
 			       mdname(mddev));
-		mddev->sysfs_action = sysfs_get_dirent(mddev->kobj.sd, "sync_action");
+		mddev->sysfs_action = sysfs_get_dirent(mddev->kobj.sd, NULL, "sync_action");
 	} else if (mddev->ro == 2) /* auto-readonly not meaningful */
 		mddev->ro = 0;
 
--- a/fs/sysfs/bin.c
+++ b/fs/sysfs/bin.c
@@ -501,7 +501,7 @@ int sysfs_create_bin_file(struct kobject
 void sysfs_remove_bin_file(struct kobject *kobj,
 			   const struct bin_attribute *attr)
 {
-	sysfs_hash_and_remove(kobj->sd, attr->attr.name);
+	sysfs_hash_and_remove(kobj->sd, NULL, attr->attr.name);
 }
 
 EXPORT_SYMBOL_GPL(sysfs_create_bin_file);
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -380,9 +380,15 @@ int __sysfs_add_one(struct sysfs_addrm_c
 {
 	struct sysfs_inode_attrs *ps_iattr;
 
-	if (sysfs_find_dirent(acxt->parent_sd, sd->s_name))
+	if (sysfs_find_dirent(acxt->parent_sd, sd->s_ns, sd->s_name))
 		return -EEXIST;
 
+	if (sysfs_ns_type(acxt->parent_sd) && !sd->s_ns) {
+		WARN(1, KERN_WARNING "sysfs: ns required in '%s' for '%s'\n",
+			acxt->parent_sd->s_name, sd->s_name);
+		return -EINVAL;
+	}
+
 	sd->s_parent = sysfs_get(acxt->parent_sd);
 
 	sysfs_link_sibling(sd);
@@ -533,13 +539,17 @@ void sysfs_addrm_finish(struct sysfs_add
  *	Pointer to sysfs_dirent if found, NULL if not.
  */
 struct sysfs_dirent *sysfs_find_dirent(struct sysfs_dirent *parent_sd,
+				       const void *ns,
 				       const unsigned char *name)
 {
 	struct sysfs_dirent *sd;
 
-	for (sd = parent_sd->s_dir.children; sd; sd = sd->s_sibling)
+	for (sd = parent_sd->s_dir.children; sd; sd = sd->s_sibling) {
+		if (sd->s_ns != ns)
+			continue;
 		if (!strcmp(sd->s_name, name))
 			return sd;
+	}
 	return NULL;
 }
 
@@ -558,12 +568,13 @@ struct sysfs_dirent *sysfs_find_dirent(s
  *	Pointer to sysfs_dirent if found, NULL if not.
  */
 struct sysfs_dirent *sysfs_get_dirent(struct sysfs_dirent *parent_sd,
+				      const void *ns,
 				      const unsigned char *name)
 {
 	struct sysfs_dirent *sd;
 
 	mutex_lock(&sysfs_mutex);
-	sd = sysfs_find_dirent(parent_sd, name);
+	sd = sysfs_find_dirent(parent_sd, ns, name);
 	sysfs_get(sd);
 	mutex_unlock(&sysfs_mutex);
 
@@ -572,7 +583,8 @@ struct sysfs_dirent *sysfs_get_dirent(st
 EXPORT_SYMBOL_GPL(sysfs_get_dirent);
 
 static int create_dir(struct kobject *kobj, struct sysfs_dirent *parent_sd,
-		      const char *name, struct sysfs_dirent **p_sd)
+	enum kobj_ns_type type, const void *ns, const char *name,
+	struct sysfs_dirent **p_sd)
 {
 	umode_t mode = S_IFDIR| S_IRWXU | S_IRUGO | S_IXUGO;
 	struct sysfs_addrm_cxt acxt;
@@ -583,6 +595,9 @@ static int create_dir(struct kobject *ko
 	sd = sysfs_new_dirent(name, mode, SYSFS_DIR);
 	if (!sd)
 		return -ENOMEM;
+
+	sd->s_flags |= (type << SYSFS_NS_TYPE_SHIFT);
+	sd->s_ns = ns;
 	sd->s_dir.kobj = kobj;
 
 	/* link in */
@@ -601,7 +616,25 @@ static int create_dir(struct kobject *ko
 int sysfs_create_subdir(struct kobject *kobj, const char *name,
 			struct sysfs_dirent **p_sd)
 {
-	return create_dir(kobj, kobj->sd, name, p_sd);
+	return create_dir(kobj, kobj->sd,
+			  KOBJ_NS_TYPE_NONE, NULL, name, p_sd);
+}
+
+static enum kobj_ns_type sysfs_read_ns_type(struct kobject *kobj)
+{
+	const struct kobj_ns_type_operations *ops;
+	enum kobj_ns_type type;
+
+	ops = kobj_child_ns_ops(kobj);
+	if (!ops)
+		return KOBJ_NS_TYPE_NONE;
+
+	type = ops->type;
+	BUG_ON(type <= KOBJ_NS_TYPE_NONE);
+	BUG_ON(type >= KOBJ_NS_TYPES);
+	BUG_ON(!kobj_ns_type_registered(type));
+
+	return type;
 }
 
 /**
@@ -610,7 +643,9 @@ int sysfs_create_subdir(struct kobject *
  */
 int sysfs_create_dir(struct kobject * kobj)
 {
+	enum kobj_ns_type type;
 	struct sysfs_dirent *parent_sd, *sd;
+	const void *ns = NULL;
 	int error = 0;
 
 	BUG_ON(!kobj);
@@ -620,7 +655,11 @@ int sysfs_create_dir(struct kobject * ko
 	else
 		parent_sd = &sysfs_root;
 
-	error = create_dir(kobj, parent_sd, kobject_name(kobj), &sd);
+	if (sysfs_ns_type(parent_sd))
+		ns = kobj->ktype->namespace(kobj);
+	type = sysfs_read_ns_type(kobj);
+
+	error = create_dir(kobj, parent_sd, type, ns, kobject_name(kobj), &sd);
 	if (!error)
 		kobj->sd = sd;
 	return error;
@@ -630,13 +669,19 @@ static struct dentry * sysfs_lookup(stru
 				struct nameidata *nd)
 {
 	struct dentry *ret = NULL;
-	struct sysfs_dirent *parent_sd = dentry->d_parent->d_fsdata;
+	struct dentry *parent = dentry->d_parent;
+	struct sysfs_dirent *parent_sd = parent->d_fsdata;
 	struct sysfs_dirent *sd;
 	struct inode *inode;
+	enum kobj_ns_type type;
+	const void *ns;
 
 	mutex_lock(&sysfs_mutex);
 
-	sd = sysfs_find_dirent(parent_sd, dentry->d_name.name);
+	type = sysfs_ns_type(parent_sd);
+	ns = sysfs_info(dir->i_sb)->ns[type];
+
+	sd = sysfs_find_dirent(parent_sd, ns, dentry->d_name.name);
 
 	/* no such entry */
 	if (!sd) {
@@ -735,7 +780,8 @@ void sysfs_remove_dir(struct kobject * k
 }
 
 int sysfs_rename(struct sysfs_dirent *sd,
-	struct sysfs_dirent *new_parent_sd, const char *new_name)
+	struct sysfs_dirent *new_parent_sd, const void *new_ns,
+	const char *new_name)
 {
 	const char *dup_name = NULL;
 	int error;
@@ -743,12 +789,12 @@ int sysfs_rename(struct sysfs_dirent *sd
 	mutex_lock(&sysfs_mutex);
 
 	error = 0;
-	if ((sd->s_parent == new_parent_sd) &&
+	if ((sd->s_parent == new_parent_sd) && (sd->s_ns == new_ns) &&
 	    (strcmp(sd->s_name, new_name) == 0))
 		goto out;	/* nothing to rename */
 
 	error = -EEXIST;
-	if (sysfs_find_dirent(new_parent_sd, new_name))
+	if (sysfs_find_dirent(new_parent_sd, new_ns, new_name))
 		goto out;
 
 	/* rename sysfs_dirent */
@@ -770,6 +816,7 @@ int sysfs_rename(struct sysfs_dirent *sd
 		sd->s_parent = new_parent_sd;
 		sysfs_link_sibling(sd);
 	}
+	sd->s_ns = new_ns;
 
 	error = 0;
  out:
@@ -780,19 +827,28 @@ int sysfs_rename(struct sysfs_dirent *sd
 
 int sysfs_rename_dir(struct kobject *kobj, const char *new_name)
 {
-	return sysfs_rename(kobj->sd, kobj->sd->s_parent, new_name);
+	struct sysfs_dirent *parent_sd = kobj->sd->s_parent;
+	const void *new_ns = NULL;
+
+	if (sysfs_ns_type(parent_sd))
+		new_ns = kobj->ktype->namespace(kobj);
+
+	return sysfs_rename(kobj->sd, parent_sd, new_ns, new_name);
 }
 
 int sysfs_move_dir(struct kobject *kobj, struct kobject *new_parent_kobj)
 {
 	struct sysfs_dirent *sd = kobj->sd;
 	struct sysfs_dirent *new_parent_sd;
+	const void *new_ns = NULL;
 
 	BUG_ON(!sd->s_parent);
+	if (sysfs_ns_type(sd->s_parent))
+		new_ns = kobj->ktype->namespace(kobj);
 	new_parent_sd = new_parent_kobj && new_parent_kobj->sd ?
 		new_parent_kobj->sd : &sysfs_root;
 
-	return sysfs_rename(sd, new_parent_sd, sd->s_name);
+	return sysfs_rename(sd, new_parent_sd, new_ns, sd->s_name);
 }
 
 /* Relationship between s_mode and the DT_xxx types */
@@ -807,32 +863,35 @@ static int sysfs_dir_release(struct inod
 	return 0;
 }
 
-static struct sysfs_dirent *sysfs_dir_pos(struct sysfs_dirent *parent_sd,
-	ino_t ino, struct sysfs_dirent *pos)
+static struct sysfs_dirent *sysfs_dir_pos(const void *ns,
+	struct sysfs_dirent *parent_sd,	ino_t ino, struct sysfs_dirent *pos)
 {
 	if (pos) {
 		int valid = !(pos->s_flags & SYSFS_FLAG_REMOVED) &&
 			pos->s_parent == parent_sd &&
 			ino == pos->s_ino;
 		sysfs_put(pos);
-		if (valid)
-			return pos;
+		if (!valid)
+			pos = NULL;
 	}
-	pos = NULL;
-	if ((ino > 1) && (ino < INT_MAX)) {
+	if (!pos && (ino > 1) && (ino < INT_MAX)) {
 		pos = parent_sd->s_dir.children;
 		while (pos && (ino > pos->s_ino))
 			pos = pos->s_sibling;
 	}
+	while (pos && pos->s_ns != ns)
+		pos = pos->s_sibling;
 	return pos;
 }
 
-static struct sysfs_dirent *sysfs_dir_next_pos(struct sysfs_dirent *parent_sd,
-	ino_t ino, struct sysfs_dirent *pos)
+static struct sysfs_dirent *sysfs_dir_next_pos(const void *ns,
+	struct sysfs_dirent *parent_sd,	ino_t ino, struct sysfs_dirent *pos)
 {
-	pos = sysfs_dir_pos(parent_sd, ino, pos);
+	pos = sysfs_dir_pos(ns, parent_sd, ino, pos);
 	if (pos)
 		pos = pos->s_sibling;
+	while (pos && pos->s_ns != ns)
+		pos = pos->s_sibling;
 	return pos;
 }
 
@@ -841,8 +900,13 @@ static int sysfs_readdir(struct file * f
 	struct dentry *dentry = filp->f_path.dentry;
 	struct sysfs_dirent * parent_sd = dentry->d_fsdata;
 	struct sysfs_dirent *pos = filp->private_data;
+	enum kobj_ns_type type;
+	const void *ns;
 	ino_t ino;
 
+	type = sysfs_ns_type(parent_sd);
+	ns = sysfs_info(dentry->d_sb)->ns[type];
+
 	if (filp->f_pos == 0) {
 		ino = parent_sd->s_ino;
 		if (filldir(dirent, ".", 1, filp->f_pos, ino, DT_DIR) == 0)
@@ -857,9 +921,9 @@ static int sysfs_readdir(struct file * f
 			filp->f_pos++;
 	}
 	mutex_lock(&sysfs_mutex);
-	for (pos = sysfs_dir_pos(parent_sd, filp->f_pos, pos);
+	for (pos = sysfs_dir_pos(ns, parent_sd, filp->f_pos, pos);
 	     pos;
-	     pos = sysfs_dir_next_pos(parent_sd, filp->f_pos, pos)) {
+	     pos = sysfs_dir_next_pos(ns, parent_sd, filp->f_pos, pos)) {
 		const char * name;
 		unsigned int type;
 		int len, ret;
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -478,9 +478,12 @@ void sysfs_notify(struct kobject *k, con
 	mutex_lock(&sysfs_mutex);
 
 	if (sd && dir)
-		sd = sysfs_find_dirent(sd, dir);
+		/* Only directories are tagged, so no need to pass
+		 * a tag explicitly.
+		 */
+		sd = sysfs_find_dirent(sd, NULL, dir);
 	if (sd && attr)
-		sd = sysfs_find_dirent(sd, attr);
+		sd = sysfs_find_dirent(sd, NULL, attr);
 	if (sd)
 		sysfs_notify_dirent(sd);
 
@@ -569,7 +572,7 @@ int sysfs_add_file_to_group(struct kobje
 	int error;
 
 	if (group)
-		dir_sd = sysfs_get_dirent(kobj->sd, group);
+		dir_sd = sysfs_get_dirent(kobj->sd, NULL, group);
 	else
 		dir_sd = sysfs_get(kobj->sd);
 
@@ -599,7 +602,7 @@ int sysfs_chmod_file(struct kobject *kob
 	mutex_lock(&sysfs_mutex);
 
 	rc = -ENOENT;
-	sd = sysfs_find_dirent(kobj->sd, attr->name);
+	sd = sysfs_find_dirent(kobj->sd, NULL, attr->name);
 	if (!sd)
 		goto out;
 
@@ -624,7 +627,7 @@ EXPORT_SYMBOL_GPL(sysfs_chmod_file);
 
 void sysfs_remove_file(struct kobject * kobj, const struct attribute * attr)
 {
-	sysfs_hash_and_remove(kobj->sd, attr->name);
+	sysfs_hash_and_remove(kobj->sd, NULL, attr->name);
 }
 
 void sysfs_remove_files(struct kobject * kobj, const struct attribute **ptr)
@@ -646,11 +649,11 @@ void sysfs_remove_file_from_group(struct
 	struct sysfs_dirent *dir_sd;
 
 	if (group)
-		dir_sd = sysfs_get_dirent(kobj->sd, group);
+		dir_sd = sysfs_get_dirent(kobj->sd, NULL, group);
 	else
 		dir_sd = sysfs_get(kobj->sd);
 	if (dir_sd) {
-		sysfs_hash_and_remove(dir_sd, attr->name);
+		sysfs_hash_and_remove(dir_sd, NULL, attr->name);
 		sysfs_put(dir_sd);
 	}
 }
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -23,7 +23,7 @@ static void remove_files(struct sysfs_di
 	int i;
 
 	for (i = 0, attr = grp->attrs; *attr; i++, attr++)
-		sysfs_hash_and_remove(dir_sd, (*attr)->name);
+		sysfs_hash_and_remove(dir_sd, NULL, (*attr)->name);
 }
 
 static int create_files(struct sysfs_dirent *dir_sd, struct kobject *kobj,
@@ -39,7 +39,7 @@ static int create_files(struct sysfs_dir
 		 * visibility.  Do this by first removing then
 		 * re-adding (if required) the file */
 		if (update)
-			sysfs_hash_and_remove(dir_sd, (*attr)->name);
+			sysfs_hash_and_remove(dir_sd, NULL, (*attr)->name);
 		if (grp->is_visible) {
 			mode = grp->is_visible(kobj, *attr, i);
 			if (!mode)
@@ -132,7 +132,7 @@ void sysfs_remove_group(struct kobject *
 	struct sysfs_dirent *sd;
 
 	if (grp->name) {
-		sd = sysfs_get_dirent(dir_sd, grp->name);
+		sd = sysfs_get_dirent(dir_sd, NULL, grp->name);
 		if (!sd) {
 			WARN(!sd, KERN_WARNING "sysfs group %p not found for "
 				"kobject '%s'\n", grp, kobject_name(kobj));
--- a/fs/sysfs/inode.c
+++ b/fs/sysfs/inode.c
@@ -324,7 +324,7 @@ void sysfs_delete_inode(struct inode *in
 	sysfs_put(sd);
 }
 
-int sysfs_hash_and_remove(struct sysfs_dirent *dir_sd, const char *name)
+int sysfs_hash_and_remove(struct sysfs_dirent *dir_sd, const void *ns, const char *name)
 {
 	struct sysfs_addrm_cxt acxt;
 	struct sysfs_dirent *sd;
@@ -334,7 +334,7 @@ int sysfs_hash_and_remove(struct sysfs_d
 
 	sysfs_addrm_start(&acxt, dir_sd);
 
-	sd = sysfs_find_dirent(dir_sd, name);
+	sd = sysfs_find_dirent(dir_sd, ns, name);
 	if (sd)
 		sysfs_remove_one(&acxt, sd);
 
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -35,7 +35,7 @@ static const struct super_operations sys
 struct sysfs_dirent sysfs_root = {
 	.s_name		= "",
 	.s_count	= ATOMIC_INIT(1),
-	.s_flags	= SYSFS_DIR,
+	.s_flags	= SYSFS_DIR | (KOBJ_NS_TYPE_NONE << SYSFS_NS_TYPE_SHIFT),
 	.s_mode		= S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO,
 	.s_ino		= 1,
 };
@@ -76,7 +76,13 @@ static int sysfs_test_super(struct super
 {
 	struct sysfs_super_info *sb_info = sysfs_info(sb);
 	struct sysfs_super_info *info = data;
+	enum kobj_ns_type type;
 	int found = 1;
+
+	for (type = KOBJ_NS_TYPE_NONE; type < KOBJ_NS_TYPES; type++) {
+		if (sb_info->ns[type] != info->ns[type])
+			found = 0;
+	}
 	return found;
 }
 
@@ -93,6 +99,7 @@ static int sysfs_get_sb(struct file_syst
 	int flags, const char *dev_name, void *data, struct vfsmount *mnt)
 {
 	struct sysfs_super_info *info;
+	enum kobj_ns_type type;
 	struct super_block *sb;
 	int error;
 
@@ -100,6 +107,10 @@ static int sysfs_get_sb(struct file_syst
 	info = kzalloc(sizeof(*info), GFP_KERNEL);
 	if (!info)
 		goto out;
+
+	for (type = KOBJ_NS_TYPE_NONE; type < KOBJ_NS_TYPES; type++)
+		info->ns[type] = kobj_ns_current(type);
+
 	sb = sget(fs_type, sysfs_test_super, sysfs_set_super, info);
 	if (IS_ERR(sb) || sb->s_fs_info != info)
 		kfree(info);
@@ -137,6 +148,26 @@ static struct file_system_type sysfs_fs_
 	.kill_sb	= sysfs_kill_sb,
 };
 
+void sysfs_exit_ns(enum kobj_ns_type type, const void *ns)
+{
+	struct super_block *sb;
+
+	mutex_lock(&sysfs_mutex);
+	spin_lock(&sb_lock);
+	list_for_each_entry(sb, &sysfs_fs_type.fs_supers, s_instances) {
+		struct sysfs_super_info *info = sysfs_info(sb);
+		/* Ignore superblocks that are in the process of unmounting */
+		if (sb->s_count <= S_BIAS)
+			continue;
+		/* Ignore superblocks with the wrong ns */
+		if (info->ns[type] != ns)
+			continue;
+		info->ns[type] = NULL;
+	}
+	spin_unlock(&sb_lock);
+	mutex_unlock(&sysfs_mutex);
+}
+
 int __init sysfs_init(void)
 {
 	int err = -ENOMEM;
--- a/fs/sysfs/symlink.c
+++ b/fs/sysfs/symlink.c
@@ -58,6 +58,8 @@ static int sysfs_do_create_link(struct k
 	if (!sd)
 		goto out_put;
 
+	if (sysfs_ns_type(parent_sd))
+		sd->s_ns = target->ktype->namespace(target);
 	sd->s_symlink.target_sd = target_sd;
 	target_sd = NULL;	/* reference is now owned by the symlink */
 
@@ -121,7 +123,7 @@ void sysfs_remove_link(struct kobject *
 	else
 		parent_sd = kobj->sd;
 
-	sysfs_hash_and_remove(parent_sd, name);
+	sysfs_hash_and_remove(parent_sd, NULL, name);
 }
 
 /**
@@ -137,6 +139,7 @@ int sysfs_rename_link(struct kobject *ko
 			const char *old, const char *new)
 {
 	struct sysfs_dirent *parent_sd, *sd = NULL;
+	const void *old_ns = NULL, *new_ns = NULL;
 	int result;
 
 	if (!kobj)
@@ -144,8 +147,11 @@ int sysfs_rename_link(struct kobject *ko
 	else
 		parent_sd = kobj->sd;
 
+	if (targ->sd)
+		old_ns = targ->sd->s_ns;
+
 	result = -ENOENT;
-	sd = sysfs_get_dirent(parent_sd, old);
+	sd = sysfs_get_dirent(parent_sd, old_ns, old);
 	if (!sd)
 		goto out;
 
@@ -155,7 +161,10 @@ int sysfs_rename_link(struct kobject *ko
 	if (sd->s_symlink.target_sd->s_dir.kobj != targ)
 		goto out;
 
-	result = sysfs_rename(sd, parent_sd, new);
+	if (sysfs_ns_type(parent_sd))
+		new_ns = targ->ktype->namespace(targ);
+
+	result = sysfs_rename(sd, parent_sd, new_ns, new);
 
 out:
 	sysfs_put(sd);
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -58,6 +58,7 @@ struct sysfs_dirent {
 	struct sysfs_dirent	*s_sibling;
 	const char		*s_name;
 
+	const void		*s_ns;
 	union {
 		struct sysfs_elem_dir		s_dir;
 		struct sysfs_elem_symlink	s_symlink;
@@ -81,14 +82,22 @@ struct sysfs_dirent {
 #define SYSFS_COPY_NAME			(SYSFS_DIR | SYSFS_KOBJ_LINK)
 #define SYSFS_ACTIVE_REF		(SYSFS_KOBJ_ATTR | SYSFS_KOBJ_BIN_ATTR)
 
-#define SYSFS_FLAG_MASK			~SYSFS_TYPE_MASK
-#define SYSFS_FLAG_REMOVED		0x0200
+#define SYSFS_NS_TYPE_MASK		0xff00
+#define SYSFS_NS_TYPE_SHIFT		8
+
+#define SYSFS_FLAG_MASK			~(SYSFS_NS_TYPE_MASK|SYSFS_TYPE_MASK)
+#define SYSFS_FLAG_REMOVED		0x020000
 
 static inline unsigned int sysfs_type(struct sysfs_dirent *sd)
 {
 	return sd->s_flags & SYSFS_TYPE_MASK;
 }
 
+static inline enum kobj_ns_type sysfs_ns_type(struct sysfs_dirent *sd)
+{
+	return (sd->s_flags & SYSFS_NS_TYPE_MASK) >> SYSFS_NS_TYPE_SHIFT;
+}
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 #define sysfs_dirent_init_lockdep(sd)				\
 do {								\
@@ -115,6 +124,7 @@ struct sysfs_addrm_cxt {
  * mount.c
  */
 struct sysfs_super_info {
+	const void *ns[KOBJ_NS_TYPES];
 };
 #define sysfs_info(SB) ((struct sysfs_super_info *)(SB->s_fs_info))
 extern struct sysfs_dirent sysfs_root;
@@ -140,8 +150,10 @@ void sysfs_remove_one(struct sysfs_addrm
 void sysfs_addrm_finish(struct sysfs_addrm_cxt *acxt);
 
 struct sysfs_dirent *sysfs_find_dirent(struct sysfs_dirent *parent_sd,
+				       const void *ns,
 				       const unsigned char *name);
 struct sysfs_dirent *sysfs_get_dirent(struct sysfs_dirent *parent_sd,
+				      const void *ns,
 				      const unsigned char *name);
 struct sysfs_dirent *sysfs_new_dirent(const char *name, umode_t mode, int type);
 
@@ -152,7 +164,7 @@ int sysfs_create_subdir(struct kobject *
 void sysfs_remove_subdir(struct sysfs_dirent *sd);
 
 int sysfs_rename(struct sysfs_dirent *sd,
-	struct sysfs_dirent *new_parent_sd, const char *new_name);
+	struct sysfs_dirent *new_parent_sd, const void *ns, const char *new_name);
 
 static inline struct sysfs_dirent *__sysfs_get(struct sysfs_dirent *sd)
 {
@@ -182,7 +194,7 @@ int sysfs_setattr(struct dentry *dentry,
 int sysfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat);
 int sysfs_setxattr(struct dentry *dentry, const char *name, const void *value,
 		size_t size, int flags);
-int sysfs_hash_and_remove(struct sysfs_dirent *dir_sd, const char *name);
+int sysfs_hash_and_remove(struct sysfs_dirent *dir_sd, const void *ns, const char *name);
 int sysfs_inode_init(void);
 
 /*
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -20,6 +20,7 @@
 
 struct kobject;
 struct module;
+enum kobj_ns_type;
 
 /* FIXME
  * The *owner field is no longer used.
@@ -168,10 +169,14 @@ void sysfs_remove_file_from_group(struct
 void sysfs_notify(struct kobject *kobj, const char *dir, const char *attr);
 void sysfs_notify_dirent(struct sysfs_dirent *sd);
 struct sysfs_dirent *sysfs_get_dirent(struct sysfs_dirent *parent_sd,
+				      const void *ns,
 				      const unsigned char *name);
 struct sysfs_dirent *sysfs_get(struct sysfs_dirent *sd);
 void sysfs_put(struct sysfs_dirent *sd);
 void sysfs_printk_last_file(void);
+
+void sysfs_exit_ns(enum kobj_ns_type type, const void *tag);
+
 int __must_check sysfs_init(void);
 
 #else /* CONFIG_SYSFS */
@@ -301,6 +306,7 @@ static inline void sysfs_notify_dirent(s
 }
 static inline
 struct sysfs_dirent *sysfs_get_dirent(struct sysfs_dirent *parent_sd,
+				      const void *ns,
 				      const unsigned char *name)
 {
 	return NULL;
@@ -313,6 +319,10 @@ static inline void sysfs_put(struct sysf
 {
 }
 
+static inline void sysfs_exit_ns(enum kobj_ns_type type, const void *tag)
+{
+}
+
 static inline int __must_check sysfs_init(void)
 {
 	return 0;
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -950,6 +950,7 @@ const void *kobj_ns_initial(enum kobj_ns
 
 void kobj_ns_exit(enum kobj_ns_type type, const void *ns)
 {
+	sysfs_exit_ns(type, ns);
 }
 
 


^ permalink raw reply

* patch sysfs-basic-support-for-multiple-super-blocks.patch added to gregkh-2.6 tree
From: gregkh @ 2010-04-29 20:29 UTC (permalink / raw)
  To: ebiederm, bcrl, cornelia.huck, eric.dumazet, gregkh, kay.sievers,
	netdev, serue, tj
In-Reply-To: <1269973889-25260-1-git-send-email-ebiederm@xmission.com>


This is a note to let you know that I've just added the patch titled

    Subject: sysfs: Basic support for multiple super blocks

to my gregkh-2.6 tree.  Its filename is

    sysfs-basic-support-for-multiple-super-blocks.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From ebiederm@xmission.com  Thu Apr 29 12:41:57 2010
From: "Eric W. Biederman" <ebiederm@xmission.com>
Date: Tue, 30 Mar 2010 11:31:24 -0700
Subject: sysfs: Basic support for multiple super blocks
To: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Kay Sievers <kay.sievers@vrfy.org>, linux-kernel@vger.kernel.org, Tejun Heo <tj@kernel.org>, Cornelia Huck <cornelia.huck@de.ibm.com>, linux-fsdevel@vger.kernel.org, Eric Dumazet <eric.dumazet@gmail.com>, Benjamin LaHaise <bcrl@lhnet.ca>, Serge Hallyn <serue@us.ibm.com>, <netdev@vger.kernel.org>, "Eric W. Biederman" <ebiederm@xmission.com>
Message-ID: <1269973889-25260-1-git-send-email-ebiederm@xmission.com>


From: Eric W. Biederman <ebiederm@xmission.com>

Add all of the necessary bioler plate to support
multiple superblocks in sysfs.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 fs/sysfs/mount.c |   58 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 fs/sysfs/sysfs.h |    3 ++
 2 files changed, 59 insertions(+), 2 deletions(-)

--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -72,16 +72,70 @@ static int sysfs_fill_super(struct super
 	return 0;
 }
 
+static int sysfs_test_super(struct super_block *sb, void *data)
+{
+	struct sysfs_super_info *sb_info = sysfs_info(sb);
+	struct sysfs_super_info *info = data;
+	int found = 1;
+	return found;
+}
+
+static int sysfs_set_super(struct super_block *sb, void *data)
+{
+	int error;
+	error = set_anon_super(sb, data);
+	if (!error)
+		sb->s_fs_info = data;
+	return error;
+}
+
 static int sysfs_get_sb(struct file_system_type *fs_type,
 	int flags, const char *dev_name, void *data, struct vfsmount *mnt)
 {
-	return get_sb_single(fs_type, flags, data, sysfs_fill_super, mnt);
+	struct sysfs_super_info *info;
+	struct super_block *sb;
+	int error;
+
+	error = -ENOMEM;
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info)
+		goto out;
+	sb = sget(fs_type, sysfs_test_super, sysfs_set_super, info);
+	if (IS_ERR(sb) || sb->s_fs_info != info)
+		kfree(info);
+	if (IS_ERR(sb)) {
+		kfree(info);
+		error = PTR_ERR(sb);
+		goto out;
+	}
+	if (!sb->s_root) {
+		sb->s_flags = flags;
+		error = sysfs_fill_super(sb, data, flags & MS_SILENT ? 1 : 0);
+		if (error) {
+			deactivate_locked_super(sb);
+			goto out;
+		}
+		sb->s_flags |= MS_ACTIVE;
+	}
+
+	simple_set_mnt(mnt, sb);
+	error = 0;
+out:
+	return error;
+}
+
+static void sysfs_kill_sb(struct super_block *sb)
+{
+	struct sysfs_super_info *info = sysfs_info(sb);
+
+	kill_anon_super(sb);
+	kfree(info);
 }
 
 static struct file_system_type sysfs_fs_type = {
 	.name		= "sysfs",
 	.get_sb		= sysfs_get_sb,
-	.kill_sb	= kill_anon_super,
+	.kill_sb	= sysfs_kill_sb,
 };
 
 int __init sysfs_init(void)
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -114,6 +114,9 @@ struct sysfs_addrm_cxt {
 /*
  * mount.c
  */
+struct sysfs_super_info {
+};
+#define sysfs_info(SB) ((struct sysfs_super_info *)(SB->s_fs_info))
 extern struct sysfs_dirent sysfs_root;
 extern struct kmem_cache *sysfs_dir_cachep;
 


^ permalink raw reply

* [PATCH] tcp: SO_TIMESTAMP implementation for TCP
From: Tom Herbert @ 2010-04-30  6:07 UTC (permalink / raw)
  To: davem, netdev

Implement SO_TIMESTAMP{NS} for TCP.  When this socket option is enabled
on a TCP socket, a timestamp for received data can be returned in the
ancillary data of a recvmsg with control message type SCM_TIMESTAMP{NS}.
The timestamp chosen is that of the skb most recently received from
which data was copied.  This is useful in debugging and timing
network operations.

Signed-off-by: Tom Herbert <therbert@google.com>
---
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index a778ee0..7dbb662 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -314,6 +314,7 @@ struct tcp_sock {
  	u32	snd_sml;	/* Last byte of the most recently transmitted small packet */
 	u32	rcv_tstamp;	/* timestamp of last received ACK (for keepalives) */
 	u32	lsndtime;	/* timestamp of last sent data packet (for restart window) */
+	ktime_t lrxcopytime;	/* timestamp of newest data in copy to user space */
 
 	/* Data for direct copy to user */
 	struct {
@@ -466,6 +467,15 @@ static inline struct tcp_sock *tcp_sk(const struct sock *sk)
 	return (struct tcp_sock *)sk;
 }
 
+static inline void tcp_update_lrxcopytime(struct sock *sk,
+					  const struct sk_buff *skb)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (unlikely(skb->tstamp.tv64 >  tp->lrxcopytime.tv64))
+		tp->lrxcopytime = skb->tstamp;
+}
+
 struct tcp_timewait_sock {
 	struct inet_timewait_sock tw_sk;
 	u32			  tw_rcv_nxt;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 8ce2974..c7f107a 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1381,6 +1381,27 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	return copied;
 }
 
+static inline void tcp_sock_recv_timestamp(struct msghdr *msg, struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (likely(!sock_flag(sk, SOCK_RCVTSTAMP)))
+		return;
+
+	if (msg->msg_controllen >= sizeof(struct timeval) &&
+	    tp->lrxcopytime.tv64) {
+		if (sock_flag(sk, SOCK_RCVTSTAMPNS)) {
+			struct timespec ts = ktime_to_timespec(tp->lrxcopytime);
+			put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPNS,
+				 sizeof(ts), &ts);
+		} else {
+			struct timeval tv = ktime_to_timeval(tp->lrxcopytime);
+			put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP,
+			    sizeof(tv), &tv);
+		}
+	}
+}
+
 /*
  *	This routine copies from a sock struct into the user buffer.
  *
@@ -1414,6 +1435,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		goto out;
 
 	timeo = sock_rcvtimeo(sk, nonblock);
+	tp->lrxcopytime.tv64 = 0;
 
 	/* Urgent data needs to be handled specially. */
 	if (flags & MSG_OOB)
@@ -1691,6 +1713,8 @@ do_prequeue:
 					break;
 				}
 			}
+
+			tcp_update_lrxcopytime(sk, skb);
 		}
 
 		*seq += used;
@@ -1758,6 +1782,8 @@ skip_copy:
 	 * on connected socket. I was just happy when found this 8) --ANK
 	 */
 
+	tcp_sock_recv_timestamp(msg, sk);
+
 	/* Clean up data we have read: This will do ACK frames. */
 	tcp_cleanup_rbuf(sk, copied);
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e82162c..b94ad16 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4397,6 +4397,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 				tp->copied_seq += chunk;
 				eaten = (chunk == skb->len && !th->fin);
 				tcp_rcv_space_adjust(sk);
+				tcp_update_lrxcopytime(sk, skb);
 			}
 			local_bh_disable();
 		}
@@ -5061,6 +5062,7 @@ static int tcp_copy_to_iovec(struct sock *sk, struct sk_buff *skb, int hlen)
 		tp->ucopy.len -= chunk;
 		tp->copied_seq += chunk;
 		tcp_rcv_space_adjust(sk);
+		tcp_update_lrxcopytime(sk, skb);
 	}
 
 	local_bh_disable();
@@ -5120,6 +5122,7 @@ static int tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
 		tp->ucopy.len -= chunk;
 		tp->copied_seq += chunk;
 		tcp_rcv_space_adjust(sk);
+		tcp_update_lrxcopytime(sk, skb);
 
 		if ((tp->ucopy.len == 0) ||
 		    (tcp_flag_word(tcp_hdr(skb)) & TCP_FLAG_PSH) ||

^ permalink raw reply related

* OFT - reserving CPU's for networking
From: Stephen Hemminger @ 2010-04-29 18:10 UTC (permalink / raw)
  To: Eric Dumazet, Thomas Gleixner; +Cc: Andi Kleen, netdev, Andi Kleen
In-Reply-To: <1272563772.2222.301.camel@edumazet-laptop>

> Le jeudi 29 avril 2010 à 19:42 +0200, Andi Kleen a écrit :
> > > Andi, what do you think of this one ?
> > > Dont we have a function to send an IPI to an individual cpu instead ?  
> > 
> > That's what this function already does. You only set a single CPU 
> > in the target mask, right?
> > 
> > IPIs are unfortunately always a bit slow. Nehalem-EX systems have X2APIC
> > which is a bit faster for this, but that's not available in the lower
> > end Nehalems. But even then it's not exactly fast.
> > 
> > I don't think the IPI primitive can be optimized much. It's not a cheap 
> > operation.
> > 
> > If it's a problem do it less often and batch IPIs.
> > 
> > It's essentially the same problem as interrupt mitigation or NAPI 
> > are solving for NICs. I guess just need a suitable mitigation mechanism.
> > 
> > Of course that would move more work to the sending CPU again, but 
> > perhaps there's no alternative. I guess you could make it cheaper it by
> > minimizing access to packet data.
> > 
> > -Andi  
> 
> Well, IPI are already batched, and rate is auto adaptative.
> 
> After various changes, it seems things are going better, maybe there is
> something related to cache line trashing.
> 
> I 'solved' it by using idle=poll, but you might take a look at
> clockevents_notify (acpi_idle_enter_bm) abuse of a shared and higly
> contended spinlock...
> 
> 
> 
> 
>     23.52%            init  [kernel.kallsyms]             [k] _raw_spin_lock_irqsave
>                       |
>                       --- _raw_spin_lock_irqsave
>                          |          
>                          |--94.74%-- clockevents_notify
>                          |          lapic_timer_state_broadcast
>                          |          acpi_idle_enter_bm
>                          |          cpuidle_idle_call
>                          |          cpu_idle
>                          |          start_secondary
>                          |          
>                          |--4.10%-- tick_broadcast_oneshot_control
>                          |          tick_notify
>                          |          notifier_call_chain
>                          |          __raw_notifier_call_chain
>                          |          raw_notifier_call_chain
>                          |          clockevents_do_notify
>                          |          clockevents_notify
>                          |          lapic_timer_state_broadcast
>                          |          acpi_idle_enter_bm
>                          |          cpuidle_idle_call
>                          |          cpu_idle
>                          |          start_secondary
>                          |          
> 


I keep getting asked about taking some core's away from clock and scheduler
to be reserved just for network processing. Seeing this kind of stuff
makes me wonder if maybe that isn't a half bad idea.


-- 

^ permalink raw reply

* Re: patch sysfs-implement-sysfs-tagged-directory-support.patch added to gregkh-2.6 tree
From: Tejun Heo @ 2010-04-30  4:18 UTC (permalink / raw)
  To: gregkh
  Cc: ebiederm, bcrl, benjamin.thery, cornelia.huck, eric.dumazet,
	kay.sievers, netdev, serue
In-Reply-To: <12725729473590@kroah.org>

On 04/29/2010 10:29 PM, gregkh@suse.de wrote:
> 
> This is a note to let you know that I've just added the patch titled
> 
>     Subject: sysfs: Implement sysfs tagged directory support.
> 
> to my gregkh-2.6 tree.  Its filename is
> 
>     sysfs-implement-sysfs-tagged-directory-support.patch
> 
> This tree can be found at 
>     http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/

I wish at least more comments are added before it goes mainline.  I
don't really understand the current form.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: patch sysfs-implement-sysfs-tagged-directory-support.patch added to gregkh-2.6 tree
From: Tejun Heo @ 2010-04-30  5:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg KH, bcrl, benjamin.thery, cornelia.huck, eric.dumazet,
	kay.sievers, netdev, serue
In-Reply-To: <m1aaslbjh4.fsf@fess.ebiederm.org>

Hello,

On 04/30/2010 07:24 AM, Eric W. Biederman wrote:
>>> I wish at least more comments are added before it goes mainline.  I
>>> don't really understand the current form.
>>
>> Ok, that's fine with me, I'll pull it back out.
> 
> ?????
> 
> Tejun you have offered nothing constructive to the review, except looking
> and saying you don't understand what is going on.

Eric, no need to get too touchy and you're right in part in saying all
I'm saying is basically "I don't understand it" which is the same
reason why I'm not nacking it and explicitly stated that I would be
okay with the series going in if Greg/Kay would be okay with it.
Again, about the same thing with the above comment, I was *wishing*
for more comments *before it goes mainline*.

> Tejun I think for the code to make any sense to you I would need to rip
> out out and/or rewrite the kobject layer, and possible the device
> model code as well.

And yes, in the long run, please do that.

> Tejun I'm sorry you can't understand the code, and I'm sorry the code
> may be over-general.  In part that is because making the code
> over-general is what you asked for when reviewing it the first time.

Please give me some credit.  I mean that the code is difficult to
follow and justify when I say I don't understand it.  Yeah, I tried to
understand it and I think I understand how it *works* in its current
form but I just don't think the design is justified or logical.  You
say it's infeasible to do it in straightforward manner in reasonable
amount of time and that's why I neither acked or nacked the series and
deferred the decision to the subsystem maintainer.

But, at the very least, please add some comments.  Try to explain what
each callbacks are supposed to do and why they're there.  Not everyone
lives in your head.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 0/3] [RFC] ptp: IEEE 1588 clock support
From: Richard Cochran @ 2010-04-29  9:42 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: netdev
In-Reply-To: <4BD95048.7050606@grandegger.com>

On Thu, Apr 29, 2010 at 11:24:24AM +0200, Wolfgang Grandegger wrote:
> I used these interrupt number fixes as well but it was not necessary for
> the actual net-next-2.6 tree. Need to check why? I remember some version
> dependent re-mapping code.

I argued on the ppc list with Scott Wood about adding dts files, one
for each of mpc8313 rev A, B, and C, but he advocated fixing this
problem in uboot instead. Is the fix in uboot, or in the kernel?

> That's missing to get the PPS signal output. But it should probably go
> to gianfar_ptp.c.

Well, this fix is specific to the mpc8313, but the gianfar_ptp driver
is for all eTECs. For example, I have the ptp code running on the
p2020rdb and p2020ds, too.

I don't think board fixups really belong in the PTP clock driver.

Just my 2 cents,
Richard

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/2] add ndo_set_port_profile op support for enic dynamic vnics
From: Scott Feldman @ 2010-04-29 16:31 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: davem, netdev, chrisw, Jens Osterkamp
In-Reply-To: <201004291748.38702.arnd@arndb.de>

On 4/29/10 8:48 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:

> I believe Chris is the one that was pushing most for having a single interface
> for both VDP/LLDPAD and enic.
> While I now understand your reasons for doing it in firmware and requiring the
> kernel interface in addition to the user interface, my doubts on whether VDP
> and your protocol should be part of the same interface are increasing.
> 
> While I'm convinced that you can make it work for both now, the alternative
> to split the two may turn out to be cleaner. We'd still be able to do
> either of the two in kernel or user space. Using iproute2 syntax to describe
> this again, it would mean an interface like
> 
>    ip iov set  port-profile DEVICE [ base BASE-DEVICE ] name PORT-PROFILE
>                              [ host_uuid HOST_UUID ]
>                      [ client_name CLIENT_NAME ]
>                                       [ client_uuid CLIENT_UUID ]
>    ip iov set  vsi { associate | pre-associate | pre-associate-rr }
> BASE-DEVICE
>                                       vsi MGR:VTID:VER
>                                       mac LLADDR [ vlan VID ]
>                                       client_uuid CLIENT_UUID
> 
>    ip iov del  port_profile DEVICE      [ base BASE-DEVICE ]
>    ip iov del  vsi          BASE-DEVICE [ mac LLADDR [ vlan VID ] ]
>        [ client_uuid CLIENT_UUID ]
> 
>    ip iov show port_profile DEVICE      [ base BASE-DEVICE ]
>    ip iov show vsi          BASE-DEVICE [ mac LLADDR [ vlan VID ] ]
> [ client_uuid CLIENT_UUID ]
> 
> You would obvioulsy only implement the kernel support for the port-profile
> stuff as callbacks, because no driver yet does VDP in the kernel, but we
> should
> have a common netlink header that defines both variants.
> 
> Chris, any opinion on this interface as opposed to the combined one?
> Either one should work, but splitting it seems cleaner to me.

I'm OK with either version.  Your latest does seem cleanest.  Let's let
Chris be the final decider.  Chris, door #1 or door #2?

-scott


^ permalink raw reply

* [PATCH] [RFC] C/R: inet4 and inet6 unicast routes (v2)
From: Dan Smith @ 2010-04-30 17:00 UTC (permalink / raw)
  To: containers-qjLDD68F18O7TbgM5vRIOg
  Cc: Vlad Yasevich, netdev-u79uwXL29TY76Z2rM5mHXA, David Miller

This patch adds support for checkpointing and restoring route information.
It keeps enough information to restore basic routes at the level of detail
of /proc/net/route.  It uses RTNETLINK to extract the information during
checkpoint and also to insert it back during restore.  This gives us a
nice layer of isolation between us and the various "fib" implementations.

Changes in v2:

This version of the patch actually moves the current task into the
desired network namespace temporarily, for the purposes of examining and
restoring the route information.  This is a instead of creating a cross-
namespace socket to do the job, as was done in v1.

This is just an RFC to see if this is an acceptable method.  For a final
version, adding a helper to nsproxy.c would allow us to create a new
nsproxy with the desired netns instead of creating one with
copy_namespaces() just to kill it off and use the target one.

I still think the previous method is cleaner, but this way may violate
fewer namespace boundaries (I'm still undecided :)

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Cc: David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
Cc: Vlad Yasevich <vladislav.yasevich-VXdhtT5mjnY@public.gmane.org>
Cc: jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org>
---
 include/linux/checkpoint_hdr.h |   31 +++
 net/checkpoint_dev.c           |  463 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 493 insertions(+), 1 deletions(-)

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 790214f..28b268a 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -20,6 +20,7 @@
 #ifndef CONFIG_CHECKPOINT
 #warn linux/checkpoint_hdr.h included directly (without CONFIG_CHECKPOINT)
 #endif
+#include <linux/if.h>
 
 #else /* __KERNEL__ */
 
@@ -782,6 +783,7 @@ struct ckpt_hdr_file_socket {
 struct ckpt_hdr_netns {
 	struct ckpt_hdr h;
 	__s32 this_ref;
+	__u32 routes;
 } __attribute__((aligned(8)));
 
 enum ckpt_netdev_types {
@@ -826,6 +828,35 @@ struct ckpt_netdev_addr {
 	} __attribute__((aligned(8)));
 } __attribute__((aligned(8)));
 
+enum ckpt_route_types {
+	CKPT_ROUTE_IPV4,
+	CKPT_ROUTE_IPV6,
+	CKPT_ROUTE_MAX
+};
+
+#define CKPT_ROUTE_FLAG_GW 1
+
+struct ckpt_route {
+	__u16 type;
+	__u16 flags;
+
+	union {
+		struct {
+			__be32 inet4_len;          /* mask length (bits) */
+			__u32  inet4_met;          /* metric             */
+			__be32 inet4_dst;          /* route address      */
+			__be32 inet4_gwy;          /* gateway address    */
+		};
+		struct {
+			__u32 inet6_len;           /* mask length (bits) */
+			__u32 inet6_met;           /* metric             */
+			struct in6_addr inet6_dst; /* route address      */
+			struct in6_addr inet6_gwy; /* gateway address    */
+		};
+	} __attribute__((aligned(8)));
+	char dev[IFNAMSIZ+1];
+} __attribute__((aligned(8)));
+
 struct ckpt_hdr_eventpoll_items {
 	struct ckpt_hdr h;
 	__s32  epfile_objref;
diff --git a/net/checkpoint_dev.c b/net/checkpoint_dev.c
index a8e3341..cc5f0ac 100644
--- a/net/checkpoint_dev.c
+++ b/net/checkpoint_dev.c
@@ -16,9 +16,11 @@
 #include <linux/veth.h>
 #include <linux/checkpoint.h>
 #include <linux/deferqueue.h>
+#include <linux/fib_rules.h>
 
 #include <net/net_namespace.h>
 #include <net/sch_generic.h>
+#include <net/ipv6.h>
 
 struct veth_newlink {
 	char *peer;
@@ -59,6 +61,22 @@ static int __kern_dev_ioctl(struct net *net, unsigned int cmd, void *arg)
 	return ret;
 }
 
+static void debug_route(struct ckpt_route *route)
+{
+       if (route->type == CKPT_ROUTE_IPV4)
+               ckpt_debug("inet4 route %pI4/%i gw %pI4 metric %i dev %s\n",
+                          &route->inet4_dst, route->inet4_len,
+                          &route->inet4_gwy, route->inet4_met,
+                          route->dev);
+       else if (route->type == CKPT_ROUTE_IPV6)
+               ckpt_debug("inet6 route %pI6/%i gw %pI6 metric %i dev %s\n",
+                          &route->inet6_dst, route->inet6_len,
+                          &route->inet6_gwy, route->inet6_met,
+                          route->dev);
+       else
+               ckpt_debug("unknown route type %i\n", route->type);
+}
+
 static struct socket *rtnl_open(void)
 {
 	struct socket *sock;
@@ -250,11 +268,280 @@ int checkpoint_netdev(struct ckpt_ctx *ctx, void *ptr)
 	return ret;
 }
 
+static int rtnl_do_dump_routes(struct socket *rtnl, int family)
+{
+	struct sk_buff *skb = NULL;
+	struct rtmsg *rtm;
+	int flags = NLM_F_ROOT | NLM_F_REQUEST;
+	struct msghdr msg;
+	struct kvec kvec;
+	struct nlmsghdr *nlh;
+	int ret = -ENOMEM;
+
+	skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!skb)
+		return -ENOMEM;
+
+	nlh = nlmsg_put(skb, 0, 0, RTM_GETROUTE, sizeof(*rtm), flags);
+	if (!nlh)
+		goto out;
+
+	rtm = nlmsg_data(nlh);
+	memset(rtm, 0, sizeof(*rtm));
+	rtm->rtm_family = family;
+
+	nlmsg_end(skb, nlh);
+
+	memset(&msg, 0, sizeof(msg));
+	kvec.iov_len = skb->len;
+	kvec.iov_base = skb->head;
+
+	ret = kernel_sendmsg(rtnl, &msg, &kvec, 1, kvec.iov_len);
+	if ((ret >= 0) && (ret != skb->len))
+		ret = -EIO;
+ out:
+	kfree_skb(skb);
+
+	return ret;
+}
+
+static int rtnl_dump_routes(struct socket *rtnl, int family,
+			    struct sk_buff **skb)
+{
+	int ret = -ENOMEM;
+	long timeo = MAX_SCHEDULE_TIMEOUT;
+
+	*skb = NULL;
+
+	ret = rtnl_do_dump_routes(rtnl, family);
+	if (ret < 0)
+		return ret;
+
+	lock_sock(rtnl->sk);
+	ret = sk_wait_data(rtnl->sk, &timeo);
+	if (ret)
+		*skb = skb_dequeue(&rtnl->sk->sk_receive_queue);
+	release_sock(rtnl->sk);
+	if (!*skb)
+		ret = -EIO;
+
+	return ret;
+}
+
+static int rtnl_process_inet4_route(struct net *net,
+				    struct rtmsg *rtm,
+				    struct nlattr **tb,
+				    struct ckpt_route *route)
+{
+	if (rtm->rtm_type != RTN_UNICAST)
+		return 0; /* skip non-unicast routes */
+
+	route->type = CKPT_ROUTE_IPV4;
+	route->inet4_len = rtm->rtm_dst_len;
+
+	if (tb[RTA_DST])
+		route->inet4_dst = htonl(nla_get_u32(tb[RTA_DST]));
+	if (tb[RTA_GATEWAY]) {
+		route->flags |= CKPT_ROUTE_FLAG_GW;
+		route->inet4_gwy = htonl(nla_get_u32(tb[RTA_GATEWAY]));
+	}
+	if (tb[RTA_PRIORITY])
+		route->inet4_met = nla_get_u32(tb[RTA_PRIORITY]);
+
+	if (tb[RTA_OIF]) {
+		struct net_device *dev;
+
+		dev = dev_get_by_index(net, nla_get_u32(tb[RTA_OIF]));
+		if (dev) {
+			strncpy(route->dev, dev->name, IFNAMSIZ);
+			dev_put(dev);
+		}
+	}
+
+	debug_route(route);
+
+	return 1; /* save this route */
+}
+
+static int rtnl_process_inet6_route(struct net *net,
+				    struct rtmsg *rtm,
+				    struct nlattr **tb,
+				    struct ckpt_route *route)
+{
+	if (rtm->rtm_type != RTN_UNICAST)
+		return 0; /* skip non-unicast routes */
+
+	route->type = CKPT_ROUTE_IPV6;
+	route->inet6_len = rtm->rtm_dst_len;
+
+	if (tb[RTA_DST])
+		ipv6_addr_copy(&route->inet6_dst, nla_data(tb[RTA_DST]));
+	if (tb[RTA_GATEWAY]) {
+		route->flags |= CKPT_ROUTE_FLAG_GW;
+		ipv6_addr_copy(&route->inet6_gwy, nla_data(tb[RTA_GATEWAY]));
+	}
+	if (tb[RTA_PRIORITY])
+		route->inet6_met = nla_get_u32(tb[RTA_PRIORITY]);
+
+	if (tb[RTA_OIF]) {
+		struct net_device *dev;
+
+		dev = dev_get_by_index(net, nla_get_u32(tb[RTA_OIF]));
+		if (dev) {
+			strncpy(route->dev, dev->name, IFNAMSIZ);
+			dev_put(dev);
+		}
+	}
+
+	debug_route(route);
+
+	return 1;
+}
+
+static int rtnl_process_routes(struct net *net,
+			       struct nlmsghdr *nlh, int len,
+			       struct ckpt_route *routes,
+			       int idx, int max)
+{
+	struct nlmsghdr *i;
+
+	for (i = nlh; NLMSG_OK(i, len); i = NLMSG_NEXT(i, len)) {
+		struct ckpt_route *route = &routes[idx];
+		struct rtmsg *rtm = NLMSG_DATA(i);
+		struct nlattr *tb[FRA_MAX+1];
+		int ret;
+
+		if (idx >= max)
+			return -E2BIG;
+
+		if (i->nlmsg_type == NLMSG_DONE)
+			break;
+		else if (nlh->nlmsg_type != RTM_NEWROUTE) {
+			struct nlmsgerr *errmsg = nlmsg_data(nlh);
+			return errmsg->error;
+		}
+
+		ret = nlmsg_parse(i, sizeof(*rtm), tb, FRA_MAX, NULL);
+		if (ret < 0)
+			return ret;
+
+		memset(route, 0, sizeof(*route));
+
+		if (rtm->rtm_family == AF_INET)
+			ret = rtnl_process_inet4_route(net, rtm, tb, route);
+		else if (rtm->rtm_family == AF_INET6)
+			ret = rtnl_process_inet6_route(net, rtm, tb, route);
+		else
+			ret = 0; /* skip */
+		if (ret < 0)
+			return ret;
+		else if (ret)
+			idx += 1;
+	}
+
+	return idx;
+}
+
+static int temp_netns_enter(struct net *net)
+{
+	int ret;
+	struct net *tmp_netns;
+
+	ret = copy_namespaces(CLONE_NEWNET, current);
+	if (ret)
+		return ret;
+
+	tmp_netns = current->nsproxy->net_ns;
+	get_net(net);
+	current->nsproxy->net_ns = net;
+	put_net(tmp_netns);
+
+	return 0;
+}
+
+static void temp_netns_exit(struct nsproxy *prev)
+{
+	switch_task_namespaces(current, prev);
+}
+
+static int rtnl_get_routes(struct net *net, int family,
+			   struct ckpt_route *routes, int idx, int max)
+{
+	int ret;
+	struct nlmsghdr *nlh;
+	struct sk_buff *skb = NULL;
+	struct socket *rtnl = NULL;
+	struct nsproxy *prev = current->nsproxy;
+
+	ret = temp_netns_enter(net);
+	if (ret)
+		return ret;
+
+	rtnl = rtnl_open();
+	if (IS_ERR(rtnl)) {
+		ret = PTR_ERR(rtnl);
+		goto out;
+	}
+
+	ret = rtnl_dump_routes(rtnl, family, &skb);
+	if (ret < 0)
+		goto out;
+
+	nlh = nlmsg_hdr(skb);
+	if (!nlh) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = rtnl_process_routes(net, nlh, skb->len, routes, idx, max);
+ out:
+	kfree_skb(skb);
+	rtnl_close(rtnl);
+	temp_netns_exit(prev);
+
+	return ret;
+}
+
+int checkpoint_netns_routes(struct ckpt_ctx *ctx, struct net *net,
+			    struct ckpt_route **_routes)
+{
+	struct ckpt_route *routes = NULL;
+	int max = 32;
+	int idx;
+	int families[] = {AF_INET, AF_INET6, 0};
+	int family;
+ retry:
+	idx = 0;
+	kfree(routes);
+	routes = kmalloc(max * sizeof(*routes), GFP_KERNEL);
+	if (!routes)
+		return -ENOMEM;
+
+	for (family = 0; families[family]; family++) {
+		idx = rtnl_get_routes(net, families[family], routes, idx, max);
+		if (idx == -E2BIG) {
+			max *= 2;
+			goto retry;
+		} else if (idx < 0)
+			break;
+	}
+
+	if (idx < 0) {
+		kfree(routes);
+		routes = NULL;
+		ckpt_err(ctx, idx, "error saving routes\n");
+	}
+	*_routes = routes;
+
+	return idx;
+}
+
 int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr)
 {
 	struct net *net = ptr;
 	struct net_device *dev;
 	struct ckpt_hdr_netns *h;
+	struct ckpt_route *routes = NULL;
 	int ret;
 
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NET_NS);
@@ -264,10 +551,19 @@ int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr)
 	h->this_ref = ckpt_obj_lookup(ctx, net, CKPT_OBJ_NET_NS);
 	BUG_ON(h->this_ref <= 0);
 
+	ret = checkpoint_netns_routes(ctx, net, &routes);
+	if (ret < 0)
+		goto out;
+	h->routes = ret;
+
 	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
 	if (ret < 0)
 		goto out;
 
+	ret = ckpt_write_buffer(ctx, routes, h->routes * sizeof(*routes));
+	if (ret < 0)
+		goto out;
+
 	for_each_netdev(net, dev) {
 		if (dev->netdev_ops->ndo_checkpoint)
 			ret = checkpoint_obj(ctx, dev, CKPT_OBJ_NETDEV);
@@ -284,6 +580,7 @@ int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr)
 	}
  out:
 	ckpt_hdr_put(ctx, h);
+	kfree(routes);
 
 	return ret;
 }
@@ -825,10 +1122,152 @@ void *restore_netdev(struct ckpt_ctx *ctx)
 	return dev;
 }
 
+static int rtnl_restore_route(struct net *net, struct ckpt_route *route)
+{
+	struct sk_buff *skb;
+	struct rtmsg *rtm;
+	int flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_ACK;
+	struct nlmsghdr *nlh;
+	int ret = 0;
+
+	skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!skb)
+		return -ENOMEM;
+
+	nlh = nlmsg_put(skb, 0, 0, RTM_NEWROUTE, sizeof(*rtm), flags);
+	if (!nlh) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	rtm = nlmsg_data(nlh);
+	memset(rtm, 0, sizeof(*rtm));
+
+	rtm->rtm_table = RT_TABLE_MAIN;
+	rtm->rtm_protocol = RTPROT_BOOT;
+	rtm->rtm_scope = RT_SCOPE_UNIVERSE;
+	rtm->rtm_type = RTN_UNICAST;
+
+	if (route->dev[0]) {
+		struct net_device *dev;
+
+		dev = dev_get_by_name(net, route->dev);
+		if (!dev) {
+			ckpt_debug("unable to find dev %s for route\n",
+				   route->dev);
+			ret = -EINVAL;
+			goto out;
+		}
+		nla_put_u32(skb, RTA_OIF, dev->ifindex);
+		dev_put(dev);
+	}
+
+	if (route->type == CKPT_ROUTE_IPV4) {
+		rtm->rtm_family = AF_INET;
+		rtm->rtm_dst_len = route->inet4_len;
+
+		nla_put_u32(skb, RTA_DST, ntohl(route->inet4_dst));
+		if (route->flags & CKPT_ROUTE_FLAG_GW)
+			nla_put_u32(skb, RTA_GATEWAY, ntohl(route->inet4_gwy));
+		nla_put_u32(skb, RTA_PRIORITY, route->inet4_met);
+	} else if (route->type == CKPT_ROUTE_IPV6) {
+		int len = sizeof(route->inet6_dst);
+
+		if (ipv6_addr_scope(&route->inet6_dst))
+			goto out; /* Skip non-global scope routes */
+
+		rtm->rtm_family = AF_INET6;
+		rtm->rtm_dst_len = route->inet6_len;
+
+		nla_put(skb, RTA_DST, len, &route->inet6_dst);
+		if (route->flags & CKPT_ROUTE_FLAG_GW)
+			nla_put(skb, RTA_GATEWAY, len, &route->inet6_gwy);
+		nla_put_u32(skb, RTA_PRIORITY, route->inet6_met);
+	} else {
+		ckpt_debug("unsupported route type %i\n", route->type);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	nlmsg_end(skb, nlh);
+
+	debug_route(route);
+
+	ret = rtnl_do(skb);
+ out:
+	kfree_skb(skb);
+	return ret;
+}
+
+static int restore_routes(struct net *net, struct ckpt_route *routes, int count)
+{
+	int i;
+	int ret = 0;
+	struct nsproxy *prev = current->nsproxy;
+
+	ret = temp_netns_enter(net);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < count; i++) {
+		struct ckpt_route *route = &routes[i];
+
+		ret = rtnl_restore_route(net, route);
+		if (ret == -EEXIST)
+			/* Some routes have been implied by device addresses */
+			continue;
+		else if (ret < 0)
+			break;
+	}
+
+	temp_netns_exit(prev);
+
+	return ret;
+}
+
+struct dq_routes {
+	struct ckpt_ctx *ctx;
+	struct net *net;
+	struct ckpt_route *routes;
+	int count;
+};
+
+static int deferred_restore_routes(void *data)
+{
+	struct dq_routes *dq = data;
+	int ret;
+
+	ret = restore_routes(dq->net, dq->routes, dq->count);
+	if (ret < 0)
+		ckpt_err(dq->ctx, ret, "failed to restore routes\n");
+
+	kfree(dq->routes);
+
+	return ret;
+}
+
+static int defer_restore_routes(struct ckpt_ctx *ctx,
+				struct net *net,
+				struct ckpt_route *routes,
+				int count)
+{
+	struct dq_routes dq;
+
+	dq.ctx = ctx;
+	dq.net = net;
+	dq.routes = routes;
+	dq.count = count;
+
+	return deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
+			      deferred_restore_routes, NULL);
+}
+
 void *restore_netns(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_netns *h;
 	struct net *net;
+	struct ckpt_route *routes = NULL;
+	int ret;
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NET_NS);
 	if (IS_ERR(h)) {
@@ -836,12 +1275,34 @@ void *restore_netns(struct ckpt_ctx *ctx)
 		return h;
 	}
 
+	ret = ckpt_read_payload(ctx, (void **)&routes,
+				h->routes * sizeof(*routes), CKPT_HDR_BUFFER);
+	if (ret < 0) {
+		ckpt_err(ctx, ret, "Unable to read routes buffer\n");
+		net = ERR_PTR(ret);
+		goto out;
+	}
+
 	if (h->this_ref != 0) {
 		net = copy_net_ns(CLONE_NEWNET, current->nsproxy->net_ns);
 		if (IS_ERR(net))
 			goto out;
-	} else
+
+		ret = defer_restore_routes(ctx, net, routes, h->routes);
+		if (ret < 0) {
+			kfree(routes);
+			put_net(net);
+			net = ERR_PTR(ret);
+		}
+	} else {
+		if (h->routes) {
+			net = ERR_PTR(-EINVAL);
+			ckpt_err(ctx, -EINVAL,
+				 "Parent netns claims to have routes\n");
+			goto out;
+		}
 		net = current->nsproxy->net_ns;
+	}
  out:
 	ckpt_hdr_put(ctx, h);
 
-- 
1.6.2.5

^ permalink raw reply related

* Re: [PATCH] tcp: SO_TIMESTAMP implementation for TCP
From: Tom Herbert @ 2010-04-30  7:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100429.233958.212393607.davem@davemloft.net>

> That's not what you're implementing here.
>
> You're only updating the socket timestamp if the SKB passed into
> the update function has a more recent timestamp.
>
Yes that is the intent.  This provides a measure time from when all
the data in the recvmsg is present on the socket and when the
application actually consumes it.  It has been quite useful for
demonstrating to apps writers when their application is being slow to
read the data, as opposed to the stack being slow.

> There is nothing that says the timestamps have to be increasing and
> with retransmits and such if it were me I would want to see the real
> timestamp even if it was earlier than the most recently reported
> timestamp.
>
> I don't know, I really don't like this feature at all.  SO_TIMESTAMP
> is really meant for datagram oriented sockets, where there is a
> clearly defined "packet" whose timestamp you get.  A TCP receive can
> involve hundreds of tiny packets so the timestamp can lack any
> meaning.
>
And a UDP datagram could be composed of hundreds of IP fragments, so
there's still no clear definition of a "packet"...  in fact the choice
of which skb to use as the representative timestamp seems pretty
arbitrary (if the semantics of the timestamp is for when the "datagram
is received", I would think that is when reassembly is complete, i.e.
timestamp of last packet received).

> All these new checks and branches for a feature of questionable value.

> If you can modify you apps to grab this information you can also probe
> for the information using external probing tools.
>
I don't see an nice way to do that, we're profiling a significant
percentage of millions of connections over thousands of paths as part
of standard operations while incurring negligible overhead.  The app
can can easily timestamp its operations, but without some mechanism
for getting timestamps out of a TCP connection, the networking portion
of servicing requests is pretty much a black box in that.

> Sorry, I don't think I'll be applying this.
>
Thanks for consideration!

Tom

^ permalink raw reply

* RE: [net-2.6 PATCH] e1000e: enable/disable ASPM L0s and L1 and ERT according to hardware errata
From: Allan, Bruce W @ 2010-04-29 17:19 UTC (permalink / raw)
  To: Anton Blanchard, Kirsher, Jeffrey T
  Cc: davem@davemloft.net, netdev@vger.kernel.org, gospo@redhat.com,
	Matthew Garret
In-Reply-To: <20100429074606.GA12437@kryten>

On Thursday, April 29, 2010 12:46 AM, Anton Blanchard wrote:
> This oopses on one of my ppc64 boxes with a NULL pointer (0x4a):
> 
> Unable to handle kernel paging request for data at address 0x0000004a
> Faulting instruction address: 0xc0000000004d2f1c
> cpu 0xe: Vector: 300 (Data Access) at [c000000bec1833a0]
>     pc: c0000000004d2f1c: .e1000e_disable_aspm+0xe0/0x150
>     lr: c0000000004d2f0c: .e1000e_disable_aspm+0xd0/0x150
>    dar: 4a
> 
> [c000000bec1836d0] c00000000069b9d8 .e1000_probe+0x84/0xe8c
> [c000000bec1837b0] c000000000386d90 .local_pci_probe+0x4c/0x68
> [c000000bec183840] c0000000003872ac .pci_device_probe+0xfc/0x148
> [c000000bec183900] c000000000409e8c .driver_probe_device+0xe4/0x1d0
> [c000000bec1839a0] c00000000040a024 .__driver_attach+0xac/0xf4
> [c000000bec183a40] c000000000409124 .bus_for_each_dev+0x9c/0x10c
> [c000000bec183b00] c000000000409c1c .driver_attach+0x40/0x60
> [c000000bec183b90] c0000000004085dc .bus_add_driver+0x150/0x328
> [c000000bec183c40] c00000000040a58c .driver_register+0x100/0x1c4
> [c000000bec183cf0] c00000000038764c .__pci_register_driver+0x78/0x128
> 
> Seems like pdev->bus->self == NULL. I haven't touched pci in a long
> time 
> so I'm trying to remember what this means (no pcie bridge perhaps?)
> 
> The patch below fixes the oops for me.
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>
> ---
> 
> Index: linux-2.6.34-rc5/drivers/net/e1000e/netdev.c
> ===================================================================
> --- linux-2.6.34-rc5.orig/drivers/net/e1000e/netdev.c	2010-04-29
> 00:10:58.000000000 -0500 +++
> linux-2.6.34-rc5/drivers/net/e1000e/netdev.c	2010-04-29
>  	02:20:50.000000000 -0500 @@ -4633,6 +4633,9 @@ reg16 &= ~state;
>  	pci_write_config_word(pdev, pos + PCI_EXP_LNKCTL, reg16);
> 
> +	if (!pdev->bus->self)
> +		return;
> +
>  	pos = pci_pcie_cap(pdev->bus->self);
>  	pci_read_config_word(pdev->bus->self, pos + PCI_EXP_LNKCTL, &reg16);
>  	reg16 &= ~state;

Your patch is probably the correct thing to do but I'm not all that familiar with the ppc64 architecture.  Would you please provide the output of 'lspci -t' and 'lspci -vvv -xxx'.

Thanks,
Bruce.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox