Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: patch sysfs-implement-sysfs-tagged-directory-support.patch added to gregkh-2.6 tree
From: Tejun Heo @ 2010-04-30  5:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg KH, bcrl, benjamin.thery, cornelia.huck, eric.dumazet,
	kay.sievers, netdev, serue
In-Reply-To: <m1aaslbjh4.fsf@fess.ebiederm.org>

Hello,

On 04/30/2010 07:24 AM, Eric W. Biederman wrote:
>>> I wish at least more comments are added before it goes mainline.  I
>>> don't really understand the current form.
>>
>> Ok, that's fine with me, I'll pull it back out.
> 
> ?????
> 
> Tejun you have offered nothing constructive to the review, except looking
> and saying you don't understand what is going on.

Eric, no need to get too touchy and you're right in part in saying all
I'm saying is basically "I don't understand it" which is the same
reason why I'm not nacking it and explicitly stated that I would be
okay with the series going in if Greg/Kay would be okay with it.
Again, about the same thing with the above comment, I was *wishing*
for more comments *before it goes mainline*.

> Tejun I think for the code to make any sense to you I would need to rip
> out out and/or rewrite the kobject layer, and possible the device
> model code as well.

And yes, in the long run, please do that.

> Tejun I'm sorry you can't understand the code, and I'm sorry the code
> may be over-general.  In part that is because making the code
> over-general is what you asked for when reviewing it the first time.

Please give me some credit.  I mean that the code is difficult to
follow and justify when I say I don't understand it.  Yeah, I tried to
understand it and I think I understand how it *works* in its current
form but I just don't think the design is justified or logical.  You
say it's infeasible to do it in straightforward manner in reasonable
amount of time and that's why I neither acked or nacked the series and
deferred the decision to the subsystem maintainer.

But, at the very least, please add some comments.  Try to explain what
each callbacks are supposed to do and why they're there.  Not everyone
lives in your head.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 0/3] [RFC] ptp: IEEE 1588 clock support
From: Richard Cochran @ 2010-04-29  9:42 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: netdev
In-Reply-To: <4BD95048.7050606@grandegger.com>

On Thu, Apr 29, 2010 at 11:24:24AM +0200, Wolfgang Grandegger wrote:
> I used these interrupt number fixes as well but it was not necessary for
> the actual net-next-2.6 tree. Need to check why? I remember some version
> dependent re-mapping code.

I argued on the ppc list with Scott Wood about adding dts files, one
for each of mpc8313 rev A, B, and C, but he advocated fixing this
problem in uboot instead. Is the fix in uboot, or in the kernel?

> That's missing to get the PPS signal output. But it should probably go
> to gianfar_ptp.c.

Well, this fix is specific to the mpc8313, but the gianfar_ptp driver
is for all eTECs. For example, I have the ptp code running on the
p2020rdb and p2020ds, too.

I don't think board fixups really belong in the PTP clock driver.

Just my 2 cents,
Richard

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/2] add ndo_set_port_profile op support for enic dynamic vnics
From: Scott Feldman @ 2010-04-29 16:31 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: davem, netdev, chrisw, Jens Osterkamp
In-Reply-To: <201004291748.38702.arnd@arndb.de>

On 4/29/10 8:48 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:

> I believe Chris is the one that was pushing most for having a single interface
> for both VDP/LLDPAD and enic.
> While I now understand your reasons for doing it in firmware and requiring the
> kernel interface in addition to the user interface, my doubts on whether VDP
> and your protocol should be part of the same interface are increasing.
> 
> While I'm convinced that you can make it work for both now, the alternative
> to split the two may turn out to be cleaner. We'd still be able to do
> either of the two in kernel or user space. Using iproute2 syntax to describe
> this again, it would mean an interface like
> 
>    ip iov set  port-profile DEVICE [ base BASE-DEVICE ] name PORT-PROFILE
>                              [ host_uuid HOST_UUID ]
>                      [ client_name CLIENT_NAME ]
>                                       [ client_uuid CLIENT_UUID ]
>    ip iov set  vsi { associate | pre-associate | pre-associate-rr }
> BASE-DEVICE
>                                       vsi MGR:VTID:VER
>                                       mac LLADDR [ vlan VID ]
>                                       client_uuid CLIENT_UUID
> 
>    ip iov del  port_profile DEVICE      [ base BASE-DEVICE ]
>    ip iov del  vsi          BASE-DEVICE [ mac LLADDR [ vlan VID ] ]
>        [ client_uuid CLIENT_UUID ]
> 
>    ip iov show port_profile DEVICE      [ base BASE-DEVICE ]
>    ip iov show vsi          BASE-DEVICE [ mac LLADDR [ vlan VID ] ]
> [ client_uuid CLIENT_UUID ]
> 
> You would obvioulsy only implement the kernel support for the port-profile
> stuff as callbacks, because no driver yet does VDP in the kernel, but we
> should
> have a common netlink header that defines both variants.
> 
> Chris, any opinion on this interface as opposed to the combined one?
> Either one should work, but splitting it seems cleaner to me.

I'm OK with either version.  Your latest does seem cleanest.  Let's let
Chris be the final decider.  Chris, door #1 or door #2?

-scott


^ permalink raw reply

* [PATCH] [RFC] C/R: inet4 and inet6 unicast routes (v2)
From: Dan Smith @ 2010-04-30 17:00 UTC (permalink / raw)
  To: containers-qjLDD68F18O7TbgM5vRIOg
  Cc: Vlad Yasevich, netdev-u79uwXL29TY76Z2rM5mHXA, David Miller

This patch adds support for checkpointing and restoring route information.
It keeps enough information to restore basic routes at the level of detail
of /proc/net/route.  It uses RTNETLINK to extract the information during
checkpoint and also to insert it back during restore.  This gives us a
nice layer of isolation between us and the various "fib" implementations.

Changes in v2:

This version of the patch actually moves the current task into the
desired network namespace temporarily, for the purposes of examining and
restoring the route information.  This is a instead of creating a cross-
namespace socket to do the job, as was done in v1.

This is just an RFC to see if this is an acceptable method.  For a final
version, adding a helper to nsproxy.c would allow us to create a new
nsproxy with the desired netns instead of creating one with
copy_namespaces() just to kill it off and use the target one.

I still think the previous method is cleaner, but this way may violate
fewer namespace boundaries (I'm still undecided :)

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Cc: David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
Cc: Vlad Yasevich <vladislav.yasevich-VXdhtT5mjnY@public.gmane.org>
Cc: jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org>
---
 include/linux/checkpoint_hdr.h |   31 +++
 net/checkpoint_dev.c           |  463 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 493 insertions(+), 1 deletions(-)

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 790214f..28b268a 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -20,6 +20,7 @@
 #ifndef CONFIG_CHECKPOINT
 #warn linux/checkpoint_hdr.h included directly (without CONFIG_CHECKPOINT)
 #endif
+#include <linux/if.h>
 
 #else /* __KERNEL__ */
 
@@ -782,6 +783,7 @@ struct ckpt_hdr_file_socket {
 struct ckpt_hdr_netns {
 	struct ckpt_hdr h;
 	__s32 this_ref;
+	__u32 routes;
 } __attribute__((aligned(8)));
 
 enum ckpt_netdev_types {
@@ -826,6 +828,35 @@ struct ckpt_netdev_addr {
 	} __attribute__((aligned(8)));
 } __attribute__((aligned(8)));
 
+enum ckpt_route_types {
+	CKPT_ROUTE_IPV4,
+	CKPT_ROUTE_IPV6,
+	CKPT_ROUTE_MAX
+};
+
+#define CKPT_ROUTE_FLAG_GW 1
+
+struct ckpt_route {
+	__u16 type;
+	__u16 flags;
+
+	union {
+		struct {
+			__be32 inet4_len;          /* mask length (bits) */
+			__u32  inet4_met;          /* metric             */
+			__be32 inet4_dst;          /* route address      */
+			__be32 inet4_gwy;          /* gateway address    */
+		};
+		struct {
+			__u32 inet6_len;           /* mask length (bits) */
+			__u32 inet6_met;           /* metric             */
+			struct in6_addr inet6_dst; /* route address      */
+			struct in6_addr inet6_gwy; /* gateway address    */
+		};
+	} __attribute__((aligned(8)));
+	char dev[IFNAMSIZ+1];
+} __attribute__((aligned(8)));
+
 struct ckpt_hdr_eventpoll_items {
 	struct ckpt_hdr h;
 	__s32  epfile_objref;
diff --git a/net/checkpoint_dev.c b/net/checkpoint_dev.c
index a8e3341..cc5f0ac 100644
--- a/net/checkpoint_dev.c
+++ b/net/checkpoint_dev.c
@@ -16,9 +16,11 @@
 #include <linux/veth.h>
 #include <linux/checkpoint.h>
 #include <linux/deferqueue.h>
+#include <linux/fib_rules.h>
 
 #include <net/net_namespace.h>
 #include <net/sch_generic.h>
+#include <net/ipv6.h>
 
 struct veth_newlink {
 	char *peer;
@@ -59,6 +61,22 @@ static int __kern_dev_ioctl(struct net *net, unsigned int cmd, void *arg)
 	return ret;
 }
 
+static void debug_route(struct ckpt_route *route)
+{
+       if (route->type == CKPT_ROUTE_IPV4)
+               ckpt_debug("inet4 route %pI4/%i gw %pI4 metric %i dev %s\n",
+                          &route->inet4_dst, route->inet4_len,
+                          &route->inet4_gwy, route->inet4_met,
+                          route->dev);
+       else if (route->type == CKPT_ROUTE_IPV6)
+               ckpt_debug("inet6 route %pI6/%i gw %pI6 metric %i dev %s\n",
+                          &route->inet6_dst, route->inet6_len,
+                          &route->inet6_gwy, route->inet6_met,
+                          route->dev);
+       else
+               ckpt_debug("unknown route type %i\n", route->type);
+}
+
 static struct socket *rtnl_open(void)
 {
 	struct socket *sock;
@@ -250,11 +268,280 @@ int checkpoint_netdev(struct ckpt_ctx *ctx, void *ptr)
 	return ret;
 }
 
+static int rtnl_do_dump_routes(struct socket *rtnl, int family)
+{
+	struct sk_buff *skb = NULL;
+	struct rtmsg *rtm;
+	int flags = NLM_F_ROOT | NLM_F_REQUEST;
+	struct msghdr msg;
+	struct kvec kvec;
+	struct nlmsghdr *nlh;
+	int ret = -ENOMEM;
+
+	skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!skb)
+		return -ENOMEM;
+
+	nlh = nlmsg_put(skb, 0, 0, RTM_GETROUTE, sizeof(*rtm), flags);
+	if (!nlh)
+		goto out;
+
+	rtm = nlmsg_data(nlh);
+	memset(rtm, 0, sizeof(*rtm));
+	rtm->rtm_family = family;
+
+	nlmsg_end(skb, nlh);
+
+	memset(&msg, 0, sizeof(msg));
+	kvec.iov_len = skb->len;
+	kvec.iov_base = skb->head;
+
+	ret = kernel_sendmsg(rtnl, &msg, &kvec, 1, kvec.iov_len);
+	if ((ret >= 0) && (ret != skb->len))
+		ret = -EIO;
+ out:
+	kfree_skb(skb);
+
+	return ret;
+}
+
+static int rtnl_dump_routes(struct socket *rtnl, int family,
+			    struct sk_buff **skb)
+{
+	int ret = -ENOMEM;
+	long timeo = MAX_SCHEDULE_TIMEOUT;
+
+	*skb = NULL;
+
+	ret = rtnl_do_dump_routes(rtnl, family);
+	if (ret < 0)
+		return ret;
+
+	lock_sock(rtnl->sk);
+	ret = sk_wait_data(rtnl->sk, &timeo);
+	if (ret)
+		*skb = skb_dequeue(&rtnl->sk->sk_receive_queue);
+	release_sock(rtnl->sk);
+	if (!*skb)
+		ret = -EIO;
+
+	return ret;
+}
+
+static int rtnl_process_inet4_route(struct net *net,
+				    struct rtmsg *rtm,
+				    struct nlattr **tb,
+				    struct ckpt_route *route)
+{
+	if (rtm->rtm_type != RTN_UNICAST)
+		return 0; /* skip non-unicast routes */
+
+	route->type = CKPT_ROUTE_IPV4;
+	route->inet4_len = rtm->rtm_dst_len;
+
+	if (tb[RTA_DST])
+		route->inet4_dst = htonl(nla_get_u32(tb[RTA_DST]));
+	if (tb[RTA_GATEWAY]) {
+		route->flags |= CKPT_ROUTE_FLAG_GW;
+		route->inet4_gwy = htonl(nla_get_u32(tb[RTA_GATEWAY]));
+	}
+	if (tb[RTA_PRIORITY])
+		route->inet4_met = nla_get_u32(tb[RTA_PRIORITY]);
+
+	if (tb[RTA_OIF]) {
+		struct net_device *dev;
+
+		dev = dev_get_by_index(net, nla_get_u32(tb[RTA_OIF]));
+		if (dev) {
+			strncpy(route->dev, dev->name, IFNAMSIZ);
+			dev_put(dev);
+		}
+	}
+
+	debug_route(route);
+
+	return 1; /* save this route */
+}
+
+static int rtnl_process_inet6_route(struct net *net,
+				    struct rtmsg *rtm,
+				    struct nlattr **tb,
+				    struct ckpt_route *route)
+{
+	if (rtm->rtm_type != RTN_UNICAST)
+		return 0; /* skip non-unicast routes */
+
+	route->type = CKPT_ROUTE_IPV6;
+	route->inet6_len = rtm->rtm_dst_len;
+
+	if (tb[RTA_DST])
+		ipv6_addr_copy(&route->inet6_dst, nla_data(tb[RTA_DST]));
+	if (tb[RTA_GATEWAY]) {
+		route->flags |= CKPT_ROUTE_FLAG_GW;
+		ipv6_addr_copy(&route->inet6_gwy, nla_data(tb[RTA_GATEWAY]));
+	}
+	if (tb[RTA_PRIORITY])
+		route->inet6_met = nla_get_u32(tb[RTA_PRIORITY]);
+
+	if (tb[RTA_OIF]) {
+		struct net_device *dev;
+
+		dev = dev_get_by_index(net, nla_get_u32(tb[RTA_OIF]));
+		if (dev) {
+			strncpy(route->dev, dev->name, IFNAMSIZ);
+			dev_put(dev);
+		}
+	}
+
+	debug_route(route);
+
+	return 1;
+}
+
+static int rtnl_process_routes(struct net *net,
+			       struct nlmsghdr *nlh, int len,
+			       struct ckpt_route *routes,
+			       int idx, int max)
+{
+	struct nlmsghdr *i;
+
+	for (i = nlh; NLMSG_OK(i, len); i = NLMSG_NEXT(i, len)) {
+		struct ckpt_route *route = &routes[idx];
+		struct rtmsg *rtm = NLMSG_DATA(i);
+		struct nlattr *tb[FRA_MAX+1];
+		int ret;
+
+		if (idx >= max)
+			return -E2BIG;
+
+		if (i->nlmsg_type == NLMSG_DONE)
+			break;
+		else if (nlh->nlmsg_type != RTM_NEWROUTE) {
+			struct nlmsgerr *errmsg = nlmsg_data(nlh);
+			return errmsg->error;
+		}
+
+		ret = nlmsg_parse(i, sizeof(*rtm), tb, FRA_MAX, NULL);
+		if (ret < 0)
+			return ret;
+
+		memset(route, 0, sizeof(*route));
+
+		if (rtm->rtm_family == AF_INET)
+			ret = rtnl_process_inet4_route(net, rtm, tb, route);
+		else if (rtm->rtm_family == AF_INET6)
+			ret = rtnl_process_inet6_route(net, rtm, tb, route);
+		else
+			ret = 0; /* skip */
+		if (ret < 0)
+			return ret;
+		else if (ret)
+			idx += 1;
+	}
+
+	return idx;
+}
+
+static int temp_netns_enter(struct net *net)
+{
+	int ret;
+	struct net *tmp_netns;
+
+	ret = copy_namespaces(CLONE_NEWNET, current);
+	if (ret)
+		return ret;
+
+	tmp_netns = current->nsproxy->net_ns;
+	get_net(net);
+	current->nsproxy->net_ns = net;
+	put_net(tmp_netns);
+
+	return 0;
+}
+
+static void temp_netns_exit(struct nsproxy *prev)
+{
+	switch_task_namespaces(current, prev);
+}
+
+static int rtnl_get_routes(struct net *net, int family,
+			   struct ckpt_route *routes, int idx, int max)
+{
+	int ret;
+	struct nlmsghdr *nlh;
+	struct sk_buff *skb = NULL;
+	struct socket *rtnl = NULL;
+	struct nsproxy *prev = current->nsproxy;
+
+	ret = temp_netns_enter(net);
+	if (ret)
+		return ret;
+
+	rtnl = rtnl_open();
+	if (IS_ERR(rtnl)) {
+		ret = PTR_ERR(rtnl);
+		goto out;
+	}
+
+	ret = rtnl_dump_routes(rtnl, family, &skb);
+	if (ret < 0)
+		goto out;
+
+	nlh = nlmsg_hdr(skb);
+	if (!nlh) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = rtnl_process_routes(net, nlh, skb->len, routes, idx, max);
+ out:
+	kfree_skb(skb);
+	rtnl_close(rtnl);
+	temp_netns_exit(prev);
+
+	return ret;
+}
+
+int checkpoint_netns_routes(struct ckpt_ctx *ctx, struct net *net,
+			    struct ckpt_route **_routes)
+{
+	struct ckpt_route *routes = NULL;
+	int max = 32;
+	int idx;
+	int families[] = {AF_INET, AF_INET6, 0};
+	int family;
+ retry:
+	idx = 0;
+	kfree(routes);
+	routes = kmalloc(max * sizeof(*routes), GFP_KERNEL);
+	if (!routes)
+		return -ENOMEM;
+
+	for (family = 0; families[family]; family++) {
+		idx = rtnl_get_routes(net, families[family], routes, idx, max);
+		if (idx == -E2BIG) {
+			max *= 2;
+			goto retry;
+		} else if (idx < 0)
+			break;
+	}
+
+	if (idx < 0) {
+		kfree(routes);
+		routes = NULL;
+		ckpt_err(ctx, idx, "error saving routes\n");
+	}
+	*_routes = routes;
+
+	return idx;
+}
+
 int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr)
 {
 	struct net *net = ptr;
 	struct net_device *dev;
 	struct ckpt_hdr_netns *h;
+	struct ckpt_route *routes = NULL;
 	int ret;
 
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NET_NS);
@@ -264,10 +551,19 @@ int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr)
 	h->this_ref = ckpt_obj_lookup(ctx, net, CKPT_OBJ_NET_NS);
 	BUG_ON(h->this_ref <= 0);
 
+	ret = checkpoint_netns_routes(ctx, net, &routes);
+	if (ret < 0)
+		goto out;
+	h->routes = ret;
+
 	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
 	if (ret < 0)
 		goto out;
 
+	ret = ckpt_write_buffer(ctx, routes, h->routes * sizeof(*routes));
+	if (ret < 0)
+		goto out;
+
 	for_each_netdev(net, dev) {
 		if (dev->netdev_ops->ndo_checkpoint)
 			ret = checkpoint_obj(ctx, dev, CKPT_OBJ_NETDEV);
@@ -284,6 +580,7 @@ int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr)
 	}
  out:
 	ckpt_hdr_put(ctx, h);
+	kfree(routes);
 
 	return ret;
 }
@@ -825,10 +1122,152 @@ void *restore_netdev(struct ckpt_ctx *ctx)
 	return dev;
 }
 
+static int rtnl_restore_route(struct net *net, struct ckpt_route *route)
+{
+	struct sk_buff *skb;
+	struct rtmsg *rtm;
+	int flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_ACK;
+	struct nlmsghdr *nlh;
+	int ret = 0;
+
+	skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!skb)
+		return -ENOMEM;
+
+	nlh = nlmsg_put(skb, 0, 0, RTM_NEWROUTE, sizeof(*rtm), flags);
+	if (!nlh) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	rtm = nlmsg_data(nlh);
+	memset(rtm, 0, sizeof(*rtm));
+
+	rtm->rtm_table = RT_TABLE_MAIN;
+	rtm->rtm_protocol = RTPROT_BOOT;
+	rtm->rtm_scope = RT_SCOPE_UNIVERSE;
+	rtm->rtm_type = RTN_UNICAST;
+
+	if (route->dev[0]) {
+		struct net_device *dev;
+
+		dev = dev_get_by_name(net, route->dev);
+		if (!dev) {
+			ckpt_debug("unable to find dev %s for route\n",
+				   route->dev);
+			ret = -EINVAL;
+			goto out;
+		}
+		nla_put_u32(skb, RTA_OIF, dev->ifindex);
+		dev_put(dev);
+	}
+
+	if (route->type == CKPT_ROUTE_IPV4) {
+		rtm->rtm_family = AF_INET;
+		rtm->rtm_dst_len = route->inet4_len;
+
+		nla_put_u32(skb, RTA_DST, ntohl(route->inet4_dst));
+		if (route->flags & CKPT_ROUTE_FLAG_GW)
+			nla_put_u32(skb, RTA_GATEWAY, ntohl(route->inet4_gwy));
+		nla_put_u32(skb, RTA_PRIORITY, route->inet4_met);
+	} else if (route->type == CKPT_ROUTE_IPV6) {
+		int len = sizeof(route->inet6_dst);
+
+		if (ipv6_addr_scope(&route->inet6_dst))
+			goto out; /* Skip non-global scope routes */
+
+		rtm->rtm_family = AF_INET6;
+		rtm->rtm_dst_len = route->inet6_len;
+
+		nla_put(skb, RTA_DST, len, &route->inet6_dst);
+		if (route->flags & CKPT_ROUTE_FLAG_GW)
+			nla_put(skb, RTA_GATEWAY, len, &route->inet6_gwy);
+		nla_put_u32(skb, RTA_PRIORITY, route->inet6_met);
+	} else {
+		ckpt_debug("unsupported route type %i\n", route->type);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	nlmsg_end(skb, nlh);
+
+	debug_route(route);
+
+	ret = rtnl_do(skb);
+ out:
+	kfree_skb(skb);
+	return ret;
+}
+
+static int restore_routes(struct net *net, struct ckpt_route *routes, int count)
+{
+	int i;
+	int ret = 0;
+	struct nsproxy *prev = current->nsproxy;
+
+	ret = temp_netns_enter(net);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < count; i++) {
+		struct ckpt_route *route = &routes[i];
+
+		ret = rtnl_restore_route(net, route);
+		if (ret == -EEXIST)
+			/* Some routes have been implied by device addresses */
+			continue;
+		else if (ret < 0)
+			break;
+	}
+
+	temp_netns_exit(prev);
+
+	return ret;
+}
+
+struct dq_routes {
+	struct ckpt_ctx *ctx;
+	struct net *net;
+	struct ckpt_route *routes;
+	int count;
+};
+
+static int deferred_restore_routes(void *data)
+{
+	struct dq_routes *dq = data;
+	int ret;
+
+	ret = restore_routes(dq->net, dq->routes, dq->count);
+	if (ret < 0)
+		ckpt_err(dq->ctx, ret, "failed to restore routes\n");
+
+	kfree(dq->routes);
+
+	return ret;
+}
+
+static int defer_restore_routes(struct ckpt_ctx *ctx,
+				struct net *net,
+				struct ckpt_route *routes,
+				int count)
+{
+	struct dq_routes dq;
+
+	dq.ctx = ctx;
+	dq.net = net;
+	dq.routes = routes;
+	dq.count = count;
+
+	return deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
+			      deferred_restore_routes, NULL);
+}
+
 void *restore_netns(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_netns *h;
 	struct net *net;
+	struct ckpt_route *routes = NULL;
+	int ret;
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NET_NS);
 	if (IS_ERR(h)) {
@@ -836,12 +1275,34 @@ void *restore_netns(struct ckpt_ctx *ctx)
 		return h;
 	}
 
+	ret = ckpt_read_payload(ctx, (void **)&routes,
+				h->routes * sizeof(*routes), CKPT_HDR_BUFFER);
+	if (ret < 0) {
+		ckpt_err(ctx, ret, "Unable to read routes buffer\n");
+		net = ERR_PTR(ret);
+		goto out;
+	}
+
 	if (h->this_ref != 0) {
 		net = copy_net_ns(CLONE_NEWNET, current->nsproxy->net_ns);
 		if (IS_ERR(net))
 			goto out;
-	} else
+
+		ret = defer_restore_routes(ctx, net, routes, h->routes);
+		if (ret < 0) {
+			kfree(routes);
+			put_net(net);
+			net = ERR_PTR(ret);
+		}
+	} else {
+		if (h->routes) {
+			net = ERR_PTR(-EINVAL);
+			ckpt_err(ctx, -EINVAL,
+				 "Parent netns claims to have routes\n");
+			goto out;
+		}
 		net = current->nsproxy->net_ns;
+	}
  out:
 	ckpt_hdr_put(ctx, h);
 
-- 
1.6.2.5

^ permalink raw reply related

* Re: [PATCH] tcp: SO_TIMESTAMP implementation for TCP
From: Tom Herbert @ 2010-04-30  7:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100429.233958.212393607.davem@davemloft.net>

> That's not what you're implementing here.
>
> You're only updating the socket timestamp if the SKB passed into
> the update function has a more recent timestamp.
>
Yes that is the intent.  This provides a measure time from when all
the data in the recvmsg is present on the socket and when the
application actually consumes it.  It has been quite useful for
demonstrating to apps writers when their application is being slow to
read the data, as opposed to the stack being slow.

> There is nothing that says the timestamps have to be increasing and
> with retransmits and such if it were me I would want to see the real
> timestamp even if it was earlier than the most recently reported
> timestamp.
>
> I don't know, I really don't like this feature at all.  SO_TIMESTAMP
> is really meant for datagram oriented sockets, where there is a
> clearly defined "packet" whose timestamp you get.  A TCP receive can
> involve hundreds of tiny packets so the timestamp can lack any
> meaning.
>
And a UDP datagram could be composed of hundreds of IP fragments, so
there's still no clear definition of a "packet"...  in fact the choice
of which skb to use as the representative timestamp seems pretty
arbitrary (if the semantics of the timestamp is for when the "datagram
is received", I would think that is when reassembly is complete, i.e.
timestamp of last packet received).

> All these new checks and branches for a feature of questionable value.

> If you can modify you apps to grab this information you can also probe
> for the information using external probing tools.
>
I don't see an nice way to do that, we're profiling a significant
percentage of millions of connections over thousands of paths as part
of standard operations while incurring negligible overhead.  The app
can can easily timestamp its operations, but without some mechanism
for getting timestamps out of a TCP connection, the networking portion
of servicing requests is pretty much a black box in that.

> Sorry, I don't think I'll be applying this.
>
Thanks for consideration!

Tom

^ permalink raw reply

* RE: [net-2.6 PATCH] e1000e: enable/disable ASPM L0s and L1 and ERT according to hardware errata
From: Allan, Bruce W @ 2010-04-29 17:19 UTC (permalink / raw)
  To: Anton Blanchard, Kirsher, Jeffrey T
  Cc: davem@davemloft.net, netdev@vger.kernel.org, gospo@redhat.com,
	Matthew Garret
In-Reply-To: <20100429074606.GA12437@kryten>

On Thursday, April 29, 2010 12:46 AM, Anton Blanchard wrote:
> This oopses on one of my ppc64 boxes with a NULL pointer (0x4a):
> 
> Unable to handle kernel paging request for data at address 0x0000004a
> Faulting instruction address: 0xc0000000004d2f1c
> cpu 0xe: Vector: 300 (Data Access) at [c000000bec1833a0]
>     pc: c0000000004d2f1c: .e1000e_disable_aspm+0xe0/0x150
>     lr: c0000000004d2f0c: .e1000e_disable_aspm+0xd0/0x150
>    dar: 4a
> 
> [c000000bec1836d0] c00000000069b9d8 .e1000_probe+0x84/0xe8c
> [c000000bec1837b0] c000000000386d90 .local_pci_probe+0x4c/0x68
> [c000000bec183840] c0000000003872ac .pci_device_probe+0xfc/0x148
> [c000000bec183900] c000000000409e8c .driver_probe_device+0xe4/0x1d0
> [c000000bec1839a0] c00000000040a024 .__driver_attach+0xac/0xf4
> [c000000bec183a40] c000000000409124 .bus_for_each_dev+0x9c/0x10c
> [c000000bec183b00] c000000000409c1c .driver_attach+0x40/0x60
> [c000000bec183b90] c0000000004085dc .bus_add_driver+0x150/0x328
> [c000000bec183c40] c00000000040a58c .driver_register+0x100/0x1c4
> [c000000bec183cf0] c00000000038764c .__pci_register_driver+0x78/0x128
> 
> Seems like pdev->bus->self == NULL. I haven't touched pci in a long
> time 
> so I'm trying to remember what this means (no pcie bridge perhaps?)
> 
> The patch below fixes the oops for me.
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>
> ---
> 
> Index: linux-2.6.34-rc5/drivers/net/e1000e/netdev.c
> ===================================================================
> --- linux-2.6.34-rc5.orig/drivers/net/e1000e/netdev.c	2010-04-29
> 00:10:58.000000000 -0500 +++
> linux-2.6.34-rc5/drivers/net/e1000e/netdev.c	2010-04-29
>  	02:20:50.000000000 -0500 @@ -4633,6 +4633,9 @@ reg16 &= ~state;
>  	pci_write_config_word(pdev, pos + PCI_EXP_LNKCTL, reg16);
> 
> +	if (!pdev->bus->self)
> +		return;
> +
>  	pos = pci_pcie_cap(pdev->bus->self);
>  	pci_read_config_word(pdev->bus->self, pos + PCI_EXP_LNKCTL, &reg16);
>  	reg16 &= ~state;

Your patch is probably the correct thing to do but I'm not all that familiar with the ppc64 architecture.  Would you please provide the output of 'lspci -t' and 'lspci -vvv -xxx'.

Thanks,
Bruce.

^ permalink raw reply

* [PATCH 1/3] ptp: Added a brand new class driver for ptp clocks.
From: Richard Cochran @ 2010-04-29  9:19 UTC (permalink / raw)
  To: netdev

This patch adds an infrastructure for hardware clocks that implement
IEEE 1588, the Precision Time Protocol (PTP). A class driver offers a
registration method to particular hardware clock drivers. Each clock is
exposed to user space as a character device with ioctls that allow tuning
of the PTP clock.

Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
---
 Documentation/ptp/ptp.txt        |   78 ++++++++++
 Documentation/ptp/testptp.c      |  130 ++++++++++++++++
 Documentation/ptp/testptp.mk     |   33 ++++
 drivers/Kconfig                  |    2 +
 drivers/Makefile                 |    1 +
 drivers/ptp/Kconfig              |   26 ++++
 drivers/ptp/Makefile             |    5 +
 drivers/ptp/ptp_clock.c          |  302 ++++++++++++++++++++++++++++++++++++++
 include/linux/Kbuild             |    1 +
 include/linux/ptp_clock.h        |   37 +++++
 include/linux/ptp_clock_kernel.h |  134 +++++++++++++++++
 11 files changed, 749 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/ptp/ptp.txt
 create mode 100644 Documentation/ptp/testptp.c
 create mode 100644 Documentation/ptp/testptp.mk
 create mode 100644 drivers/ptp/Kconfig
 create mode 100644 drivers/ptp/Makefile
 create mode 100644 drivers/ptp/ptp_clock.c
 create mode 100644 include/linux/ptp_clock.h
 create mode 100644 include/linux/ptp_clock_kernel.h

diff --git a/Documentation/ptp/ptp.txt b/Documentation/ptp/ptp.txt
new file mode 100644
index 0000000..35dc01a
--- /dev/null
+++ b/Documentation/ptp/ptp.txt
@@ -0,0 +1,78 @@
+
+* PTP infrastructure for Linux
+
+  This patch set introduces support for IEEE 1588 PTP clocks in
+  Linux. Together with the SO_TIMESTAMPING socket options, this
+  presents standardized method for developing PTP user space programs,
+  synchronizing Linux with external clocks, and using the ancillary
+  features of PTP hardware clocks.
+
+  A new class driver exports a kernel interface for specific clock
+  drivers and a user space interface. The infrastructure supports a
+  complete set of PTP functionality.
+
+  + Basic clock operations
+    - Set time
+    - Get time
+    - Shift the clock by a given offset atomically
+    - Adjust clock frequency
+
+  + Ancillary clock features
+    - One short or periodic alarms, with signal delivery to user program
+    - Time stamp external events
+    - Period output signals configurable from user space
+    - Synchronization of the Linux system time via the PPS subsystem
+
+** PTP kernel API
+
+   A PTP clock driver registers itself with the class driver. The
+   class driver handles all of the dealings with user space. The
+   author of a clock driver need only implement the details of
+   programming the clock hardware. The clock driver notifies the class
+   driver of asynchronous events (alarms and external time stamps) via
+   a simple message passing interface.
+
+   The class driver supports multiple PTP clock drivers. In normal use
+   cases, only one PTP clock is needed. However, for testing and
+   development, it can be useful to have more than one clock in a
+   single system, in order to allow performance comparisons.
+
+** PTP user space API
+
+   The class driver creates a character device for each registered PTP
+   clock. User space programs may control the clock via standardized
+   ioctls. A program may query, enable, configure, and disable the
+   ancillary clock features. User space can receive time stamped
+   events via blocking read() and poll(). One shot and periodic
+   signals may be configured via an ioctl API with similar semantics
+   to the POSIX timer_settime() system call.
+
+   As an real life example, the following two patches for ptpd version
+   1.0.0 demonstrate how the API works.
+
+   https://sourceforge.net/tracker/?func=detail&aid=2992845&group_id=139814&atid=744634
+
+   https://sourceforge.net/tracker/?func=detail&aid=2992847&group_id=139814&atid=744634
+
+** Supported hardware
+
+   + Standard Linux system timer
+     - No special PTP features
+     - For use with software time stamping
+
+   + Freescale eTSEC gianfar
+     - 2 Time stamp external triggers, programmable polarity (opt. interrupt)
+     - 2 Alarm registers (optional interrupt)
+     - 3 Periodic signals (optional interrupt)
+
+   + National DP83640
+     - 6 GPIOs programmable as inputs or outputs
+     - 6 GPIOs with dedicated functions (LED/JTAG/clock) can also be
+       used as general inputs or outputs
+     - GPIO inputs can time stamp external triggers
+     - GPIO outputs can produce periodic signals
+     - 1 interrupt pin
+
+   + Intel IXP465
+     - Auxiliary Slave/Master Mode Snapshot (optional interrupt)
+     - Target Time (optional interrupt)
diff --git a/Documentation/ptp/testptp.c b/Documentation/ptp/testptp.c
new file mode 100644
index 0000000..d3ad918
--- /dev/null
+++ b/Documentation/ptp/testptp.c
@@ -0,0 +1,130 @@
+/*
+ * PTP 1588 clock support - User space test program
+ *
+ * Copyright (C) 2010 OMICRON electronics GmbH
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+#include <errno.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <time.h>
+#include <unistd.h>
+
+#include <linux/ptp_clock.h>
+
+static void usage (char* progname)
+{
+	fprintf(stderr,
+		"usage: %s [options] \n"
+		" -d name    device to open \n"
+		" -f val     adjust the ptp clock frequency by 'val' PPB \n"
+		" -g         get the ptp clock time \n"
+		" -h         prints this message \n"
+		" -s         set the ptp clock time from the system time \n"
+		" -t val     shift the ptp clock time by 'val' seconds \n"
+		" -v         query the ptp clock api version \n"
+		,progname);
+}
+
+int main(int argc,char *argv[])
+{
+	struct timespec ts;
+	char *progname;
+	int c, fd, val=0;
+
+	char *device = "/dev/ptp_clock_0";
+	int adjfreq=0x7fffffff;
+	int adjtime=0;
+	int gettime=0;
+	int settime=0;
+	int version=0;
+
+	progname = strrchr(argv[0],'/');
+	progname = progname ? 1+progname : argv[0];
+	while (EOF != (c = getopt(argc, argv, "d:f:ghst:v"))) {
+		switch (c) {
+		case 'd': device = optarg; break;
+		case 'f': adjfreq = atoi(optarg); break;
+		case 'g': gettime = 1; break;
+		case 's': settime = 1; break;
+		case 't': adjtime = atoi(optarg); break;
+		case 'v': version = 1; break;
+		case 'h': usage(progname); return 0;
+		case '?': usage(progname); return -1;
+		default:  usage(progname); return -1;
+		}
+	}
+
+	fd = open(device, O_RDWR);
+	if (fd < 0) {
+		fprintf(stderr,"cannot open %s: %s", device, strerror(errno));
+		return -1;
+	}
+
+	if (version) {
+		if (ioctl(fd, PTP_CLOCK_APIVERS, &val)) {
+			perror("PTP_CLOCK_APIVERS");
+		} else {
+			printf("version = 0x%08x \n",val);
+		}
+	}
+
+	if (0x7fffffff != adjfreq) {
+		if (ioctl(fd, PTP_CLOCK_ADJFREQ, adjfreq)) {
+			perror("PTP_CLOCK_ADJFREQ");
+		} else {
+			puts("frequency adjustment okay");
+		}
+	}
+
+	if (adjtime) {
+		ts.tv_sec = adjtime;
+		ts.tv_nsec = 0;
+		if (ioctl(fd, PTP_CLOCK_ADJTIME, &ts)) {
+			perror("PTP_CLOCK_ADJTIME");
+		} else {
+			puts("time shift okay");
+		}
+	}
+
+	if (gettime) {
+		if (ioctl(fd, PTP_CLOCK_GETTIME, &ts)) {
+			perror("PTP_CLOCK_GETTIME");
+		} else {
+			printf("clock time: %ld.%09ld or %s",
+			       ts.tv_sec, ts.tv_nsec, ctime(&ts.tv_sec));
+		}
+	}
+
+	if (settime) {
+		clock_gettime(CLOCK_REALTIME, &ts);
+		if (ioctl(fd, PTP_CLOCK_SETTIME, &ts)) {
+			perror("PTP_CLOCK_SETTIME");
+		} else {
+			puts("set time okay");
+		}
+	}
+
+	close(fd);
+	return 0;
+}
diff --git a/Documentation/ptp/testptp.mk b/Documentation/ptp/testptp.mk
new file mode 100644
index 0000000..4ef2d97
--- /dev/null
+++ b/Documentation/ptp/testptp.mk
@@ -0,0 +1,33 @@
+# PTP 1588 clock support - User space test program
+#
+# Copyright (C) 2010 OMICRON electronics GmbH
+#
+#  This program is free software; you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation; either version 2 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program; if not, write to the Free Software
+#  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+CC        = $(CROSS_COMPILE)gcc
+INC       = -I$(KBUILD_OUTPUT)/usr/include
+CFLAGS    = -Wall $(INC)
+LDLIBS    = -lrt
+PROGS     = testptp
+
+all: $(PROGS)
+
+testptp: testptp.o
+
+clean:
+	rm -f testptp.o
+
+distclean: clean
+	rm -f $(PROGS)
diff --git a/drivers/Kconfig b/drivers/Kconfig
index a2b902f..774fbd7 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -52,6 +52,8 @@ source "drivers/spi/Kconfig"
 
 source "drivers/pps/Kconfig"
 
+source "drivers/ptp/Kconfig"
+
 source "drivers/gpio/Kconfig"
 
 source "drivers/w1/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 34f1e10..7dea7c8 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -74,6 +74,7 @@ obj-$(CONFIG_I2O)		+= message/
 obj-$(CONFIG_RTC_LIB)		+= rtc/
 obj-y				+= i2c/ media/
 obj-$(CONFIG_PPS)		+= pps/
+obj-$(CONFIG_PTP_1588_CLOCK)	+= ptp/
 obj-$(CONFIG_W1)		+= w1/
 obj-$(CONFIG_POWER_SUPPLY)	+= power/
 obj-$(CONFIG_HWMON)		+= hwmon/
diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
new file mode 100644
index 0000000..c80a25b
--- /dev/null
+++ b/drivers/ptp/Kconfig
@@ -0,0 +1,26 @@
+#
+# PTP clock support configuration
+#
+
+menu "PTP clock support"
+
+config PTP_1588_CLOCK
+	tristate "PTP clock support"
+	depends on EXPERIMENTAL
+	help
+	  The IEEE 1588 standard defines a method to precisely
+	  synchronize distributed clocks over Ethernet networks. The
+	  standard defines a Precision Time Protocol (PTP), which can
+	  be used to achieve synchronization within a few dozen
+	  microseconds. In addition, with the help of special hardware
+	  time stamping units, it can be possible to achieve
+	  synchronization to within a few hundred nanoseconds.
+
+	  This driver adds support for PTP clocks as character
+	  devices. If you want to use a PTP clock, then you should
+	  also enable at least one clock driver as well.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called ptp_clock.
+
+endmenu
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
new file mode 100644
index 0000000..b86695c
--- /dev/null
+++ b/drivers/ptp/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for PTP 1588 clock support.
+#
+
+obj-$(CONFIG_PTP_1588_CLOCK)		+= ptp_clock.o
diff --git a/drivers/ptp/ptp_clock.c b/drivers/ptp/ptp_clock.c
new file mode 100644
index 0000000..a5acac4
--- /dev/null
+++ b/drivers/ptp/ptp_clock.c
@@ -0,0 +1,302 @@
+/*
+ * PTP 1588 clock support
+ *
+ * Partially adapted from the Linux PPS driver.
+ *
+ * Copyright (C) 2010 OMICRON electronics GmbH
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+#include <linux/bitops.h>
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+#include <linux/ptp_clock_kernel.h>
+#include <linux/ptp_clock.h>
+
+#define PTP_MAX_CLOCKS BITS_PER_LONG
+
+struct ptp_clock {
+	struct list_head list;
+	struct cdev cdev;
+	struct device *dev;
+	struct ptp_clock_info *info;
+	dev_t devid;
+	int index; /* index into clocks.map, also the minor number */
+	struct mutex mux; /* one process at a time on a device */
+};
+
+
+/* private globals */
+
+static const struct file_operations ptp_fops;
+static dev_t ptp_devt;
+static struct class *ptp_class;
+
+static struct {
+	struct list_head list;
+	DECLARE_BITMAP(map, PTP_MAX_CLOCKS);
+} clocks;
+static DEFINE_SPINLOCK(clocks_lock); /* protects 'clocks' */
+
+/* public interface */
+
+struct ptp_clock* ptp_clock_register(struct ptp_clock_info *info)
+{
+	struct ptp_clock *ptp;
+	int err = 0, index, major = MAJOR(ptp_devt);
+	unsigned long flags;
+
+	/* Find a free clock slot and reserve it. */
+	err = -EBUSY;
+	spin_lock_irqsave(&clocks_lock, flags);
+	index = find_first_zero_bit(clocks.map, PTP_MAX_CLOCKS);
+	if (index < PTP_MAX_CLOCKS) {
+		set_bit(index, clocks.map);
+		spin_unlock_irqrestore(&clocks_lock, flags);
+	} else {
+		spin_unlock_irqrestore(&clocks_lock, flags);
+		goto no_clock;
+	}
+
+	/* Initialize a clock structure. */
+	err = -ENOMEM;
+	ptp = kzalloc(sizeof(struct ptp_clock), GFP_KERNEL);
+	if (ptp == NULL)
+		goto no_memory;
+
+	ptp->info = info;
+	ptp->devid = MKDEV(major, index);
+	ptp->index = index;
+	mutex_init(&ptp->mux);
+
+	/* Create a new device in our class. */
+	ptp->dev = device_create(ptp_class, NULL, ptp->devid, ptp,
+				 "ptp_clock_%d", ptp->index);
+	if (IS_ERR(ptp->dev))
+		goto no_device;
+
+	dev_set_drvdata(ptp->dev, ptp);
+
+	/* Register a character device. */
+	cdev_init(&ptp->cdev, &ptp_fops);
+	ptp->cdev.owner = info->owner;
+	err = cdev_add(&ptp->cdev, ptp->devid, 1);
+	if (err)
+		goto no_cdev;
+
+	/* Clock is ready, add it into the list. */
+	spin_lock_irqsave(&clocks_lock, flags);
+	list_add(&ptp->list, &clocks.list);
+	spin_unlock_irqrestore(&clocks_lock, flags);
+
+	return ptp;
+
+no_cdev:
+	device_destroy(ptp_class, ptp->devid);
+no_device:
+	mutex_destroy(&ptp->mux);
+	kfree(ptp);
+no_memory:
+	spin_lock_irqsave(&clocks_lock, flags);
+	clear_bit(index, clocks.map);
+	spin_unlock_irqrestore(&clocks_lock, flags);
+no_clock:
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL(ptp_clock_register);
+
+int ptp_clock_unregister(struct ptp_clock *ptp)
+{
+	unsigned long flags;
+
+	/* Release the clock's resources. */
+	cdev_del(&ptp->cdev);
+	device_destroy(ptp_class, ptp->devid);
+	mutex_destroy(&ptp->mux);
+
+	/* Remove the clock from the list. */
+	spin_lock_irqsave(&clocks_lock, flags);
+	list_del(&ptp->list);
+	clear_bit(ptp->index, clocks.map);
+	spin_unlock_irqrestore(&clocks_lock, flags);
+
+	kfree(ptp);
+
+	return 0;
+}
+EXPORT_SYMBOL(ptp_clock_unregister);
+
+void ptp_clock_event(struct ptp_clock_event event)
+{
+}
+EXPORT_SYMBOL(ptp_clock_event);
+
+/* character device operations */
+
+static int ptp_ioctl(struct inode *node, struct file *fp,
+		      unsigned int cmd, unsigned long arg)
+{
+	struct ptp_clock *ptp = fp->private_data;
+	struct ptp_clock_info *ops = ptp->info;
+	void *priv = ops->priv;
+	struct timespec ts;
+	int err = 0;
+
+	switch (cmd) {
+
+	case PTP_CLOCK_APIVERS:
+		err = put_user(PTP_CLOCK_VERSION, (u32 __user*)arg);
+		break;
+
+	case PTP_CLOCK_ADJFREQ:
+		err = ops->adjfreq(priv, arg);
+		break;
+
+	case PTP_CLOCK_ADJTIME:
+		if (copy_from_user(&ts, (void __user*)arg, sizeof(ts)))
+			err = -EFAULT;
+		else
+			err = ops->adjtime(priv, &ts);
+		break;
+
+	case PTP_CLOCK_GETTIME:
+		err = ops->gettime(priv, &ts);
+		if (err)
+			break;
+		err = copy_to_user((void __user*)arg, &ts, sizeof(ts));
+		break;
+
+	case PTP_CLOCK_SETTIME:
+		if (copy_from_user(&ts, (void __user*)arg, sizeof(ts)))
+			err = -EFAULT;
+		else
+			err = ops->settime(priv, &ts);
+		break;
+
+	default:
+		err = -ENOTTY;
+		break;
+	}
+	return err;
+}
+
+static int ptp_open(struct inode *inode, struct file *fp)
+{
+	struct ptp_clock *ptp;
+	ptp = container_of(inode->i_cdev, struct ptp_clock, cdev);
+
+	if (mutex_lock_interruptible(&ptp->mux))
+		return -ERESTARTSYS;
+
+	fp->private_data = ptp;
+
+	return 0;
+}
+
+static unsigned int ptp_poll(struct file *fp, poll_table *wait)
+{
+	return 0;
+}
+
+static int ptp_release(struct inode *inode, struct file *fp)
+{
+	struct ptp_clock *ptp = fp->private_data;
+	mutex_unlock(&ptp->mux);
+	return 0;
+}
+
+static const struct file_operations ptp_fops = {
+	.owner		= THIS_MODULE,
+	.ioctl		= ptp_ioctl,
+	.open		= ptp_open,
+	.poll		= ptp_poll,
+	.release	= ptp_release,
+};
+
+/* sysfs */
+
+static ssize_t ptp_show_status(struct device *dev,
+			       struct device_attribute *attr, char *buf)
+{
+	struct ptp_clock *ptp = dev_get_drvdata(dev);
+	return sprintf(buf,
+		       "maximum adjustment:  %d\n"
+		       "programmable alarms: %d\n"
+		       "external timestamps: %d\n"
+		       "periodic outputs:    %d\n"
+		       "has pps:             %d\n"
+		       "device index:        %d\n"
+		       ,ptp->info->max_adj
+		       ,ptp->info->n_alarm
+		       ,ptp->info->n_ext_ts
+		       ,ptp->info->n_per_out
+		       ,ptp->info->pps
+		       ,ptp->index);
+}
+
+struct device_attribute ptp_attrs[] = {
+	__ATTR(capabilities, S_IRUGO, ptp_show_status, NULL),
+	__ATTR_NULL,
+};
+
+/* module operations */
+
+static void __exit ptp_exit(void)
+{
+	class_destroy(ptp_class);
+	unregister_chrdev_region(ptp_devt, PTP_MAX_CLOCKS);
+}
+
+static int __init ptp_init(void)
+{
+	int err;
+
+	INIT_LIST_HEAD(&clocks.list);
+
+	ptp_class = class_create(THIS_MODULE, "ptp");
+	if (!ptp_class) {
+		printk(KERN_ERR "ptp: failed to allocate class\n");
+		return -ENOMEM;
+	}
+	ptp_class->dev_attrs = ptp_attrs;
+
+	err = alloc_chrdev_region(&ptp_devt, 0, PTP_MAX_CLOCKS, "ptp");
+	if (err < 0) {
+		printk(KERN_ERR "ptp: failed to allocate char device region\n");
+		goto no_region;
+	}
+
+	pr_info("PTP clock support registered\n");
+	return 0;
+
+no_region:
+	class_destroy(ptp_class);
+	return err;
+}
+
+subsys_initcall(ptp_init);
+module_exit(ptp_exit);
+
+MODULE_AUTHOR("Richard Cochran <richard.cochran@omicron.at>");
+MODULE_DESCRIPTION("PTP clocks support");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 2fc8e14..9959fe4 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -140,6 +140,7 @@ header-y += pkt_sched.h
 header-y += posix_types.h
 header-y += ppdev.h
 header-y += prctl.h
+header-y += ptp_clock.h
 header-y += qnxtypes.h
 header-y += qnx4_fs.h
 header-y += radeonfb.h
diff --git a/include/linux/ptp_clock.h b/include/linux/ptp_clock.h
new file mode 100644
index 0000000..5064b19
--- /dev/null
+++ b/include/linux/ptp_clock.h
@@ -0,0 +1,37 @@
+/*
+ * PTP 1588 clock support - user space interface
+ *
+ * Copyright (C) 2010 OMICRON electronics GmbH
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#ifndef _PTP_CLOCK_H_
+#define _PTP_CLOCK_H_
+
+#include <linux/ioctl.h>
+#include <linux/types.h>
+
+#define PTP_CLOCK_VERSION 0x00000001
+
+#define PTP_CLK_MAGIC '='
+
+#define PTP_CLOCK_APIVERS _IOR (PTP_CLK_MAGIC, 1, __u32)
+#define PTP_CLOCK_ADJFREQ _IO  (PTP_CLK_MAGIC, 2)
+#define PTP_CLOCK_ADJTIME _IOW (PTP_CLK_MAGIC, 3, struct timespec)
+#define PTP_CLOCK_GETTIME _IOR (PTP_CLK_MAGIC, 4, struct timespec)
+#define PTP_CLOCK_SETTIME _IOW (PTP_CLK_MAGIC, 5, struct timespec)
+
+#endif
diff --git a/include/linux/ptp_clock_kernel.h b/include/linux/ptp_clock_kernel.h
new file mode 100644
index 0000000..d43bdd2
--- /dev/null
+++ b/include/linux/ptp_clock_kernel.h
@@ -0,0 +1,134 @@
+/*
+ * PTP 1588 clock support
+ *
+ * Copyright (C) 2010 OMICRON electronics GmbH
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#ifndef _PTP_CLOCK_KERNEL_H_
+#define _PTP_CLOCK_KERNEL_H_
+
+enum ptp_clock_events {
+	PTP_CLOCK_ALARM,
+	PTP_CLOCK_EXTTS,
+	PTP_CLOCK_PEROUT,
+	PTP_CLOCK_PPS,
+};
+
+/**
+ * struct ptp_clock_request - parameter for the enable() operation
+ *
+ * @type:  One of the ptp_clock_events enumeration values.
+ * @index: Selects one resource of the given type.
+ * @ts:    Desired one shot or recurring period for alarms and output signals.
+ */
+
+struct ptp_clock_request {
+	int type;
+	int index;
+	struct itimerspec ts;
+};
+
+/**
+ * struct ptp_clock_info - decribes a PTP hardware clock
+ *
+ * @owner:     The clock driver should set to THIS_MODULE.
+ * @name:      A short name to identify the clock.
+ * @max_adj:   The maximum possible frequency adjustment, in parts per billon.
+ * @n_alarm:   The number of programmable alarms.
+ * @n_ext_ts:  The number of external time stamp channels.
+ * @n_per_out: The number of programmable periodic signals.
+ * @pps:       Indicates whether the clock supports a PPS callback.
+ * @priv:      Passed to the clock operations, for the driver's private use.
+ *
+ * clock operations
+ *
+ * @adjfreq: Adjusts the frequency of the hardware clock.
+ *           parameter delta: Desired period change in parts per billion.
+ *
+ * @adjtime: Shifts the time of the hardware clock.
+ *           parameter ts: Desired change in seconds and nanoseconds.
+ *
+ * @gettime: Reads the current time from the hardware clock.
+ *           parameter ts: Holds the result.
+ *
+ * @settime: Set the current time on the hardware clock.
+ *           parameter ts: Time value to set.
+ *
+ * @enable:  Request driver to enable or disable an ancillary feature.
+ *           parameter request: Desired resource to enable or disable.
+ *           parameter on: Caller passes one to enable or zero to disable.
+ *
+ * The callbacks must all return zero on success, non-zero otherwise.
+ */
+
+struct ptp_clock_info {
+	struct module *owner;
+	char name[16];
+	s32 max_adj;
+	int n_alarm;
+	int n_ext_ts;
+	int n_per_out;
+	int pps;
+	void *priv;
+	int (*adjfreq)(void *priv, s32 delta);
+	int (*adjtime)(void *priv, struct timespec *ts);
+	int (*gettime)(void *priv, struct timespec *ts);
+	int (*settime)(void *priv, struct timespec *ts);
+	int (*enable) (void *priv, struct ptp_clock_request request, int on);
+};
+
+struct ptp_clock;
+
+/**
+ * ptp_clock_register() - register a PTP hardware clock driver
+ *
+ * @info:  Structure describing the new clock.
+ */
+
+extern struct ptp_clock* ptp_clock_register(struct ptp_clock_info *info);
+
+/**
+ * ptp_clock_unregister() - unregister a PTP hardware clock driver
+ *
+ * @ptp:  The clock to remove from service.
+ */
+
+extern int ptp_clock_unregister(struct ptp_clock *ptp);
+
+/**
+ * struct ptp_clock_event - decribes a PTP hardware clock event
+ *
+ * @type:  One of the ptp_clock_events enumeration values.
+ * @index: Identifies the source of the event.
+ * @timestamp: When the event occured.
+ */
+
+struct ptp_clock_event {
+	int type;
+	int index;
+	u64 timestamp;
+};
+
+/**
+ * ptp_clock_event() - notify the PTP layer about an event
+ *
+ * @event:  Message structure describing the event.
+ */
+
+extern void ptp_clock_event(struct ptp_clock_event event);
+
+#endif
-- 
1.6.0.4


^ permalink raw reply related

* Re: patch sysfs-implement-sysfs-tagged-directory-support.patch added to gregkh-2.6 tree
From: Tejun Heo @ 2010-04-30 15:22 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Eric W. Biederman, Greg KH, bcrl, benjamin.thery, cornelia.huck,
	eric.dumazet, kay.sievers, netdev
In-Reply-To: <20100430142940.GC7187@us.ibm.com>

Hello,

On 04/30/2010 04:29 PM, Serge E. Hallyn wrote:
> Hmm, but looking back over the previous thread (Mar 31) I guess you
> mean more in-line comments around the callbacks, presumably things
> like class_dir_child_ns_type() and struct kobj_ns_type_operations
> members?

In-line.  What they're, how they're supposed to be used, which calling
context is expected, what can be returned and so on.

> It sounds like what you'd really like is to have any explicit
> mention to namespaces pulled out of drivers/base (layering as you
> keep saying)?  But will there be a use for this outside of
> namespaces?  Does trying to anticipate that fall into the category
> of over-abstraction?

I wouldn't mind limited amount of layering exceptions as long as
they're clearly documented.  What I'm primarily worried about is not
the possibility of other users but more the obfuscation of the whole
sysfs-kobject-driver model thing which is already overly abstracted
and obfuscated (at least it seems to me that way).

NS needs tagged support in the driver model which in itself is fine
and I also understand that from someone who's primarily working on NS,
adding a bit on top of the whole thing wouldn't seem like much of a
problem.  To me it seems like worsening a problem which is already
pretty bad.  I hope you could understand my POV too.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 3/3] ptp: Added a clock that uses the eTSEC found on the MPC85xx.
From: Wolfgang Grandegger @ 2010-04-29 12:41 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev
In-Reply-To: <20100427091449.GA5122@riccoc20.at.omicron.at>

Richard Cochran wrote:
> The eTSEC includes a PTP clock with quite a few features. This patch adds
> support for the basic clock adjustment functions.
> 
> Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
> ---
>  drivers/net/Makefile          |    1 +
>  drivers/net/gianfar_ptp.c     |  269 +++++++++++++++++++++++++++++++++++++++++
>  drivers/net/gianfar_ptp_reg.h |  107 ++++++++++++++++
>  drivers/ptp/Kconfig           |   13 ++
>  4 files changed, 390 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/net/gianfar_ptp.c
>  create mode 100644 drivers/net/gianfar_ptp_reg.h
> 
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index ebf80b9..8f2c2ff 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -28,6 +28,7 @@ obj-$(CONFIG_ATL2) += atlx/
>  obj-$(CONFIG_ATL1E) += atl1e/
>  obj-$(CONFIG_ATL1C) += atl1c/
>  obj-$(CONFIG_GIANFAR) += gianfar_driver.o
> +obj-$(CONFIG_PTP_1588_CLOCK_GIANFAR) += gianfar_ptp.o
>  obj-$(CONFIG_TEHUTI) += tehuti.o
>  obj-$(CONFIG_ENIC) += enic/
>  obj-$(CONFIG_JME) += jme.o
> diff --git a/drivers/net/gianfar_ptp.c b/drivers/net/gianfar_ptp.c
> new file mode 100644
> index 0000000..eed3246
> --- /dev/null
> +++ b/drivers/net/gianfar_ptp.c
> @@ -0,0 +1,269 @@
> +/*
> + * PTP 1588 clock using the eTSEC
> + *
> + * Copyright (C) 2010 OMICRON electronics GmbH
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License, or
> + *  (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, write to the Free Software
> + *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> + */
> +#include <linux/device.h>
> +#include <linux/hrtimer.h>
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/timex.h>
> +#include <asm/io.h>
> +
> +#include <linux/ptp_clock_kernel.h>
> +
> +#include "gianfar_ptp_reg.h"
> +
> +/*
> + *
> + * TODO - get the following from device tree
> + *
> + */
> +#define TMR_BASE_KERNEL	0xe0024e00 // CONFIG_PPC_85xx 0xffe24e00
> +#define TIMER_OSC	166666666

fsl_get_sys_freq() does return that frequency. But that part is
temporary anyway and you already posted a patch to get them from the
device tree.

> +#define TCLK_PERIOD	10
> +#define NOMINAL_FREQ	100000000
> +#define DEF_TMR_PRSC	100
> +#define DEF_TMR_ADD	0x999999A4
> +#define DEFAULT_CKSEL   1
> +
> +#define REG_SIZE	(4 + TMR_ETTS2_L)
> +
> +struct etsects {
> +	void *regs;
> +	u32 timer_osc;    /* Hz */
> +	u32 tclk_period;  /* nanoseconds */
> +	s64 nominal_freq; /* Hz */
> +	u32 tmr_prsc;
> +	u32 tmr_add;
> +	u32 cksel;
> +};
> +
> +/* Private globals */
> +static struct etsects the_clock;
> +DEFINE_SPINLOCK(adjtime_lock);
> +
> +/*
> + * Register access functions
> + */
> +
> +static inline u32 reg_read(struct etsects *etsects, unsigned long offset)
> +{
> +	return in_be32(etsects->regs + offset);
> +}
> +
> +static inline void reg_write(struct etsects *etsects,
> +			     unsigned long offset, u32 val)
> +{
> +	out_be32(etsects->regs + offset, val);
> +}

For compatibility with the gianfar implementation and to profit from
type checking I would prefer to use a structure to describe the register
layout. This would make the functions above obsolete.

> +static u64 tmr_cnt_read(struct etsects *etsects)
> +{
> +	u64 ns;
> +	u32 lo, hi;

Empty line? While I'm at it, some coding style fixes.

> +	lo = reg_read(etsects, TMR_CNT_L);
> +	hi = reg_read(etsects, TMR_CNT_H);
> +	ns = ((u64) hi) << 32;
> +	ns |= lo;
> +	return ns;
> +}
> +
> +static void tmr_cnt_write(struct etsects *etsects, u64 ns)
> +{
> +	u32 hi = ns >> 32;
> +	u32 lo = ns & 0xffffffff;

Ditto.

> +	reg_write(etsects, TMR_CNT_L, lo);
> +	reg_write(etsects, TMR_CNT_H, hi);
> +}
> +
> +static void set_alarm(struct etsects *etsects)
> +{
> +	u64 ns;
> +	u32 lo, hi;

Ditto.

> +	ns = tmr_cnt_read(etsects) + 1500000000ULL;
> +	ns = div_u64(ns, 1000000000UL) * 1000000000ULL;
> +	ns -= etsects->tclk_period;
> +	hi = ns >> 32;
> +	lo = ns & 0xffffffff;
> +	reg_write(etsects, TMR_ALARM1_L, lo);
> +	reg_write(etsects, TMR_ALARM1_H, hi);
> +}
> +
> +static void set_fipers(struct etsects *etsects)
> +{
> +	u32 tmr_ctrl = reg_read(etsects, TMR_CTRL);
> +
> +	reg_write(etsects, TMR_CTRL,   tmr_ctrl & (~TE));
> +	reg_write(etsects, TMR_PRSC,   etsects->tmr_prsc);
> +	reg_write(etsects, TMR_FIPER1, 0x3B9AC9F6);
> +	reg_write(etsects, TMR_FIPER2, 0x00018696);
> +	set_alarm(etsects);
> +	reg_write(etsects, TMR_CTRL,   tmr_ctrl|TE);
> +}

These hardcode values should be replaced with values derived from the
input clock frequency, clock period, etc.

> +/*
> + * PTP clock operations
> + */
> +
> +static int ptp_gianfar_adjfreq(void *priv, s32 ppb)
> +{
> +	u64 adj;
> +	u32 diff, tmr_add;
> +	int neg_adj = 0;
> +	struct etsects *etsects = priv;
> +
> +	if (!ppb)
> +		return 0;
> +
> +	if (ppb < 0) {
> +		neg_adj = 1;
> +		ppb = -ppb;
> +	}
> +	tmr_add = etsects->tmr_add;
> +	adj = tmr_add;
> +	adj *= ppb;
> +	diff = div_u64(adj, 1000000000ULL);
> +
> +	tmr_add = neg_adj ? tmr_add - diff : tmr_add + diff;
> +
> +	reg_write(etsects, TMR_ADD, tmr_add);
> +
> +	return 0;
> +}
> +
> +static int ptp_gianfar_adjtime(void *priv, struct timespec *ts)
> +{
> +	s64 delta, now;
> +	unsigned long flags;
> +	struct etsects *etsects = priv;
> +
> +	delta = 1000000000LL * ts->tv_sec;
> +	delta += ts->tv_nsec;
> +
> +	spin_lock_irqsave(&adjtime_lock, flags);
> +
> +	now = tmr_cnt_read(etsects);
> +	now += delta;
> +	tmr_cnt_write(etsects, now);
> +
> +	spin_unlock_irqrestore(&adjtime_lock, flags);
> +
> +	set_fipers(etsects);
> +
> +	return 0;
> +}
> +
> +static int ptp_gianfar_gettime(void *priv, struct timespec *ts)
> +{
> +	u64 ns;
> +	u32 remainder;
> +	struct etsects *etsects = priv;

Empty line?

> +	ns = tmr_cnt_read(etsects);
> +	ts->tv_sec = div_u64_rem(ns, 1000000000, &remainder);
> +	ts->tv_nsec = remainder;
> +	return 0;
> +}
> +
> +static int ptp_gianfar_settime(void *priv, struct timespec *ts)
> +{
> +	u64 ns;
> +	struct etsects *etsects = priv;

Ditto.

> +	ns = ts->tv_sec * 1000000000ULL;
> +	ns += ts->tv_nsec;
> +	tmr_cnt_write(etsects, ns);
> +	set_fipers(etsects);
> +	return 0;
> +}
> +
> +static int ptp_gianfar_enable(void *priv, struct ptp_clock_request rq, int on)
> +{
> +	/* We do not (yet) offer any ancillary features. */
> +	return -EINVAL;
> +}
> +
> +struct ptp_clock_info ptp_gianfar_caps = {
> +	.owner		= THIS_MODULE,
> +	.name		= "gianfar clock",
> +	.max_adj	= 512000,
> +	.n_alarm	= 0,
> +	.n_ext_ts	= 0,
> +	.n_per_out	= 0,
> +	.pps		= 0,
> +	.priv		= &the_clock,
> +	.adjfreq	= ptp_gianfar_adjfreq,
> +	.adjtime	= ptp_gianfar_adjtime,
> +	.gettime	= ptp_gianfar_gettime,
> +	.settime	= ptp_gianfar_settime,
> +	.enable		= ptp_gianfar_enable,
> +};
> +
> +/* module operations */
> +
> +static void __exit ptp_gianfar_exit(void)
> +{
> +	ptp_clock_unregister(&ptp_gianfar_caps);
> +	iounmap(the_clock.regs);
> +}

Maybe some more cleanup is necessary, e.g. stopping the PPS.

> +static int __init ptp_gianfar_init(void)
> +{
> +	struct etsects *etsects = &the_clock;
> +	struct timespec now;
> +	phys_addr_t reg_addr = TMR_BASE_KERNEL;
> +	unsigned long reg_size = REG_SIZE;
> +	u32 tmr_ctrl;
> +	int err;
> +
> +	etsects->regs = ioremap(reg_addr, reg_size);
> +	if (!etsects->regs) {
> +		pr_err("ioremap ptp registers failed\n");
> +		return -EINVAL;
> +	}
> +	etsects->timer_osc = TIMER_OSC;
> +	etsects->tclk_period = TCLK_PERIOD;
> +	etsects->nominal_freq = NOMINAL_FREQ;
> +	etsects->tmr_prsc = DEF_TMR_PRSC;
> +	etsects->tmr_add = DEF_TMR_ADD;
> +	etsects->cksel = DEFAULT_CKSEL;
> +
> +	tmr_ctrl =
> +	  (etsects->tclk_period & TCLK_PERIOD_MASK) << TCLK_PERIOD_SHIFT |
> +	  (etsects->cksel & CKSEL_MASK) << CKSEL_SHIFT;
> +
> +	getnstimeofday(&now);
> +	ptp_gianfar_settime(etsects, &now);
> +
> +	reg_write(etsects, TMR_CTRL,   tmr_ctrl);
> +	reg_write(etsects, TMR_ADD,    etsects->tmr_add);
> +	reg_write(etsects, TMR_PRSC,   etsects->tmr_prsc);
> +	reg_write(etsects, TMR_FIPER1, 0x3B9AC9F6);
> +	reg_write(etsects, TMR_FIPER2, 0x00018696);
> +	set_alarm(etsects);
> +	reg_write(etsects, TMR_CTRL,   tmr_ctrl|FS|RTPE|TE);
> +
> +	err = ptp_clock_register(&ptp_gianfar_caps);
> +	return err;
> +}
> +
> +subsys_initcall(ptp_gianfar_init);
> +module_exit(ptp_gianfar_exit);

As already mentioned, ptp_clock_register() did fail when this driver is
statically linked into the kernel.

Wolfgang.

^ permalink raw reply

* Re: TCP MD5 issue
From: Bijay Singh @ 2010-04-29 17:39 UTC (permalink / raw)
  To: Bijay Singh; +Cc: netdev@vger.kernel.org
In-Reply-To: <7F9EF72D-8D9E-4C36-983F-96589B6F7D02@guavus.com>


I did a simple experiment to conclude that the md5sig_pool is indeed having an integrity issue.

In the .cal_md5_hash function after the calculations i printed the values of psuedo header which is kept in the md5sig_pool variable and the values after completing the calculations, it has different values for at-least one of src,dest or length in a few cases. 

Does anyone get a cue from this?

On 29-Apr-2010, at 7:19 AM, Bijay Singh wrote:

> 
> 
>> Hi,
>> 
>> I have hit upon bug in the TCP MD5 implementation. The bug shows up in load conditions. I haven't  been able to exactly identify the issue, but is easily reproducible.
>> 
>> I run multiple instances (44 pairs) of servers and clients. All servers running on ubuntu 2.6.31 and all clients running on 2.6.26.
>> 
>> I observe that few messages from either end have invalid MD5 Hash signatures.
>> 
>> I browsed thru the code and did not find anything suspicious.
>> 
>> I hacked the code (2.6.26) to call the .calc_md5_hash function twice at the same point and printked the hashes from the .calc_md5_hash function. I noticed that sometimes the 1st call gives the in correct value and sometimes the 2nd call gives the in correct value, which essentially leads me the believe that the input to the fucntions are not getting modified. 
>> 
>> I short-circuited the crypto and called the md5 functions directly and allocated the md5 context from the stack, to remove any kind of sharing violation that may be happening. (the context otherwise is saved per cpu in the md5_sig_pool and the exection of the hash generation code is made non-preemtbale to prevent sharing voilation). btw what will happen if there is an interupt.
>> 
>> After this change the code is running much more smoothly, however i did manage to get 6 error in 4 hours, earlier i was seen invalid cheksum errors with secs of starting the load.
>> 
>> I have rerun the test and haven't observed any error since last 2 hours.
>> 
>> I need to fix this issue and am heavily dependent on you to provide me with some clues to proceed further. Pls. let me know if you need any more data.
>> 
>> Looking forward to your response.
>> 
>> Thanks
>> BIjay
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: jamal @ 2010-04-29 11:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, xiaosuo, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272514176.2201.85.camel@edumazet-laptop>

[-- Attachment #1: Type: text/plain, Size: 2089 bytes --]

On Thu, 2010-04-29 at 06:09 +0200, Eric Dumazet wrote:

> I dont see in your results the number of pps, number of udp ports,
> number of flows.

My test scenario is still the same: send 1M packets of 8 flows
round-robin at 750Kpps. Repeat test 4-6 times and average out. 8 flows
map to 8 cpus. Any rate above 750Kpps and the driver starts dropping.
The flows are {Fixed dst IP, fixed src IP, fixed src port, 8 variable
dst port}. ip_rcv and friends show up in profile as we have already
discussed - but i dont want to change the test characteristic because i
cant do fair backward comparison. Also i use rps mask ee to use all the
cpus except the core doing demux (core 0).
In the results when i say "udp sink 90%" it means 90% of 750Kpps was
successfuly received by the app (on the multiple cpus).

> In my latest results, I can handle more pps than before, regardless of
> rps being on or off, 

Same here - even in my worst case scenario 88.5% of 750Kpps > 600Kpps.
Attached is history results to make more sense of what i am saying:
we have net-next kernels from apr14, apr23, apr23 with changlis change,
apr28, apr28 with your change. What you'll see is non-rps (blue) gets
better and rps (Orange) gets better slowly then by apr28 it is worse.

> and with various number of udp ports (one user
> thread per port), number of flows (many src addr so that rps spread
> packets on many cpus)
> 

This is true for me except for non rps getting relatively better and rps
getting worse in plain net-next for Apr 28. Sorry, dont have time to
dissect where things changed but i figured if i reported it will point
to something obvious.

> If/when contention windows are smaller, cpu can run uncontended, and can
> consume more cycles to process more frames ?
> 
> With a non yet published patch, I even can reach 600.000 pps in DDOS
> situations, instead of 400.000.

So my tests are simpler. What i was hoping to see was at minimum rps
maintains its gap of 6-7% more capacity. I dont mind seeing
rps get better. If both rps and non-rps get better that even more
interesting.

cheers,
jamal

[-- Attachment #2: rps-hist.pdf --]
[-- Type: application/pdf, Size: 212033 bytes --]

^ permalink raw reply

* Re: [PATCH 5/5] sctp: Fix oops when sending queued ASCONF chunks
From: Vlad Yasevich @ 2010-04-29 14:26 UTC (permalink / raw)
  To: Shuaijun Zhang; +Cc: netdev, davem, linux-sctp, Yuansong Qiao
In-Reply-To: <4BD99310.8020207@research.ait.ie>



Shuaijun Zhang wrote:
> Vlad Yasevich wrote:
>> When we finish processing ASCONF_ACK chunk, we try to send
>> the next queued ASCONF.  This action runs the sctp state
>> machine recursively and it's not prepared to do so.
>>
>> kernel BUG at kernel/timer.c:790!
>> invalid opcode: 0000 [#1] SMP
>> last sysfs file: /sys/module/ipv6/initstate
>> Modules linked in: sha256_generic sctp libcrc32c ipv6 dm_multipath
>> uinput 8139too i2c_piix4 8139cp mii i2c_core pcspkr virtio_net joydev
>> floppy virtio_blk virtio_pci [last unloaded: scsi_wait_scan]
>>
>> Pid: 0, comm: swapper Not tainted 2.6.34-rc4 #15 /Bochs
>> EIP: 0060:[<c044a2ef>] EFLAGS: 00010286 CPU: 0
>> EIP is at add_timer+0xd/0x1b
>> EAX: cecbab14 EBX: 000000f0 ECX: c0957b1c EDX: 03595cf4
>> ESI: cecba800 EDI: cf276f00 EBP: c0957aa0 ESP: c0957aa0
>>  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
>> Process swapper (pid: 0, ti=c0956000 task=c0988ba0 task.ti=c0956000)
>> Stack:
>>  c0957ae0 d1851214 c0ab62e4 c0ab5f26 0500ffff 00000004 00000005 00000004
>> <0> 00000000 d18694fd 00000004 1666b892 cecba800 cecba800 c0957b14
>> 00000004
>> <0> c0957b94 d1851b11 ceda8b00 cecba800 cf276f00 00000001 c0957b14
>> 000000d0
>>   
> According to the call trace below, it seems that our modification did
> not take affect.
> sctp_primitive_ASCONF should be invoked after sctp_side_effects().
> Our code fixed the same problem in kernel 2.6.27.28.
> Not sure about the difference between 2.6.34-rc4 kernel and 2.6.27.28
> kernel.

This is the patch that you sent me, that included comments that Wei made on
the list.  It corrects this problem and will be pushed back to stable tries
that are still maintained.

The call trace in the commit log is for reference.  The patch has been tested
and resolves the issue you reported.

-vlad

>> Call Trace:
>>  [<d1851214>] ? sctp_side_effects+0x607/0xdfc [sctp]
>>  [<d1851b11>] ? sctp_do_sm+0x108/0x159 [sctp]
>>  [<d1863386>] ? sctp_pname+0x0/0x1d [sctp]
>>  [<d1861a56>] ? sctp_primitive_ASCONF+0x36/0x3b [sctp]        <---
>> sctp_side_effects() should show up here before send next asconf
>>  [<d185657c>] ? sctp_process_asconf_ack+0x2a4/0x2d3 [sctp]
>>  [<d184e35c>] ? sctp_sf_do_asconf_ack+0x1dd/0x2b4 [sctp]
>>  [<d1851ac1>] ? sctp_do_sm+0xb8/0x159 [sctp]
>>  [<d1863334>] ? sctp_cname+0x0/0x52 [sctp]
>>  [<d1854377>] ? sctp_assoc_bh_rcv+0xac/0xe1 [sctp]
>>  [<d1858f0f>] ? sctp_inq_push+0x2d/0x30 [sctp]
>>  [<d186329d>] ? sctp_rcv+0x797/0x82e [sctp]
>>
>> Tested-by: Wei Yongjun <yjwei@cn.fujitsu.com>
>> Signed-off-by: Yuansong Qiao <ysqiao@research.ait.ie>
>> Signed-off-by: Shuaijun Zhang <szhang@research.ait.ie>
>> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
>> ---
>>  include/net/sctp/command.h |    1 +
>>  net/sctp/sm_make_chunk.c   |   15 ---------------
>>  net/sctp/sm_sideeffect.c   |   26 ++++++++++++++++++++++++++
>>  net/sctp/sm_statefuns.c    |    8 +++++++-
>>  4 files changed, 34 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/net/sctp/command.h b/include/net/sctp/command.h
>> index 8be5135..2c55a7e 100644
>> --- a/include/net/sctp/command.h
>> +++ b/include/net/sctp/command.h
>> @@ -107,6 +107,7 @@ typedef enum {
>>      SCTP_CMD_T1_RETRAN,     /* Mark for retransmission after T1
>> timeout  */
>>      SCTP_CMD_UPDATE_INITTAG, /* Update peer inittag */
>>      SCTP_CMD_SEND_MSG,     /* Send the whole use message */
>> +    SCTP_CMD_SEND_NEXT_ASCONF, /* Send the next ASCONF after ACK */
>>      SCTP_CMD_LAST
>>  } sctp_verb_t;
>>  
>> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
>> index f6fc5c1..0fd5b4c 100644
>> --- a/net/sctp/sm_make_chunk.c
>> +++ b/net/sctp/sm_make_chunk.c
>> @@ -3318,21 +3318,6 @@ int sctp_process_asconf_ack(struct
>> sctp_association *asoc,
>>      sctp_chunk_free(asconf);
>>      asoc->addip_last_asconf = NULL;
>>  
>> -    /* Send the next asconf chunk from the addip chunk queue. */
>> -    if (!list_empty(&asoc->addip_chunk_list)) {
>> -        struct list_head *entry = asoc->addip_chunk_list.next;
>> -        asconf = list_entry(entry, struct sctp_chunk, list);
>> -
>> -        list_del_init(entry);
>> -
>> -        /* Hold the chunk until an ASCONF_ACK is received. */
>> -        sctp_chunk_hold(asconf);
>> -        if (sctp_primitive_ASCONF(asoc, asconf))
>> -            sctp_chunk_free(asconf);
>> -        else
>> -            asoc->addip_last_asconf = asconf;
>> -    }
>> -
>>      return retval;
>>  }
>>  
>> diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
>> index 4c5bed9..d5ae450 100644
>> --- a/net/sctp/sm_sideeffect.c
>> +++ b/net/sctp/sm_sideeffect.c
>> @@ -962,6 +962,29 @@ static int sctp_cmd_send_msg(struct
>> sctp_association *asoc,
>>  }
>>  
>>  
>> +/* Sent the next ASCONF packet currently stored in the association.
>> + * This happens after the ASCONF_ACK was succeffully processed.
>> + */
>> +static void sctp_cmd_send_asconf(struct sctp_association *asoc)
>> +{
>> +    /* Send the next asconf chunk from the addip chunk
>> +     * queue.
>> +     */
>> +    if (!list_empty(&asoc->addip_chunk_list)) {
>> +        struct list_head *entry = asoc->addip_chunk_list.next;
>> +        struct sctp_chunk *asconf = list_entry(entry,
>> +                        struct sctp_chunk, list);
>> +        list_del_init(entry);
>> +
>> +        /* Hold the chunk until an ASCONF_ACK is received. */
>> +        sctp_chunk_hold(asconf);
>> +        if (sctp_primitive_ASCONF(asoc, asconf))
>> +            sctp_chunk_free(asconf);
>> +        else
>> +            asoc->addip_last_asconf = asconf;
>> +    }
>> +}
>> +
>>  
>>  /* These three macros allow us to pull the debugging code out of the
>>   * main flow of sctp_do_sm() to keep attention focused on the real
>> @@ -1617,6 +1640,9 @@ static int sctp_cmd_interpreter(sctp_event_t
>> event_type,
>>              }
>>              error = sctp_cmd_send_msg(asoc, cmd->obj.msg);
>>              break;
>> +        case SCTP_CMD_SEND_NEXT_ASCONF:
>> +            sctp_cmd_send_asconf(asoc);
>> +            break;
>>          default:
>>              printk(KERN_WARNING "Impossible command: %u, %p\n",
>>                     cmd->verb, cmd->obj.ptr);
>> diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
>> index abf601a..24b2cd5 100644
>> --- a/net/sctp/sm_statefuns.c
>> +++ b/net/sctp/sm_statefuns.c
>> @@ -3676,8 +3676,14 @@ sctp_disposition_t sctp_sf_do_asconf_ack(const
>> struct sctp_endpoint *ep,
>>                  SCTP_TO(SCTP_EVENT_TIMEOUT_T4_RTO));
>>  
>>          if (!sctp_process_asconf_ack((struct sctp_association *)asoc,
>> -                         asconf_ack))
>> +                         asconf_ack)) {
>> +            /* Successfully processed ASCONF_ACK.  We can
>> +             * release the next asconf if we have one.
>> +             */
>> +            sctp_add_cmd_sf(commands, SCTP_CMD_SEND_NEXT_ASCONF,
>> +                    SCTP_NULL());
>>              return SCTP_DISPOSITION_CONSUME;
>> +        }
>>  
>>          abort = sctp_make_abort(asoc, asconf_ack,
>>                      sizeof(sctp_errhdr_t));
>>   
> 

^ permalink raw reply

* Re: [PATCHv7] add mergeable buffers support to vhost_net
From: David Stevens @ 2010-04-30 16:48 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, netdev, virtualization
In-Reply-To: <20100429134515.GA26287@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> wrote on 04/29/2010 06:45:15 AM:

> On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
> > This patch adds mergeable receive buffer support to vhost_net.
> > 
> > Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
> 
> I have applied this, thanks very much!
> I have also applied some tweaks on top,
> please take a look.
> 
> Thanks,
> MSt
> 

Looks fine to me.

Acked-by: David L Stevens <dlstevens@us.ibm.com>

> commit 2809e94f5f26d89dc5232aaec753ffda95c4d95e
> Author: Michael S. Tsirkin <mst@redhat.com>
> Date:   Thu Apr 29 16:18:08 2010 +0300
> 
>     vhost-net: minor tweaks in mergeable buffer code
> 
>     Applies the following tweaks on top of mergeable buffers patch:
>     1. vhost_get_desc_n assumes that all desriptors are 'in' only.
>        It's also unlikely to be useful for any vhost frontend
>        besides vhost_net, so just move it to net.c, and rename
>        get_rx_bufs for brevity.
> 
>     2. for rx, we called iov_length within vhost_get_desc_n
>        (now get_rx_bufs) already, so we don't
>        need an extra call to iov_length to avoid overflow anymore.
>        Accordingly, copy_iovec_hdr can return void now.
> 
>     3. for rx, do some further code tweaks:
>        do not assign len = err as we check that err == len
>        handle data length in a way similar to how we handle
>        header length: datalen -> sock_len, len -> vhost_len.
>        add sock_hlen as a local variable, for symmetry with vhost_hlen.
> 
>     4. add some likely/unlikely annotations
> 
>     Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index d61d945..85519b4 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -74,9 +74,9 @@ static int move_iovec_hdr(struct iovec *from, struct 
iovec *to,
>     }
>     return seg;
>  }
> -/* Copy iovec entries for len bytes from iovec. Return segments used. 
*/
> -static int copy_iovec_hdr(const struct iovec *from, struct iovec *to,
> -           size_t len, int iovcount)
> +/* Copy iovec entries for len bytes from iovec. */
> +static void copy_iovec_hdr(const struct iovec *from, struct iovec *to,
> +            size_t len, int iovcount)
>  {
>     int seg = 0;
>     size_t size;
> @@ -89,7 +89,6 @@ static int copy_iovec_hdr(const struct iovec *from, 
struct iovec *to,
>        ++to;
>        ++seg;
>     }
> -   return seg;
>  }
> 
>  /* Caller must have TX VQ lock */
> @@ -204,7 +203,7 @@ static void handle_tx(struct vhost_net *net)
>     unuse_mm(net->dev.mm);
>  }
> 
> -static int vhost_head_len(struct vhost_virtqueue *vq, struct sock *sk)
> +static int peek_head_len(struct vhost_virtqueue *vq, struct sock *sk)
>  {
>     struct sk_buff *head;
>     int len = 0;
> @@ -212,17 +211,70 @@ static int vhost_head_len(struct vhost_virtqueue 
*vq, 
> struct sock *sk)
>     lock_sock(sk);
>     head = skb_peek(&sk->sk_receive_queue);
>     if (head)
> -      len = head->len + vq->sock_hlen;
> +      len = head->len;
>     release_sock(sk);
>     return len;
>  }
> 
> +/* This is a multi-buffer version of vhost_get_desc, that works if
> + *   vq has read descriptors only.
> + * @vq      - the relevant virtqueue
> + * @datalen   - data length we'll be reading
> + * @iovcount   - returned count of io vectors we fill
> + * @log      - vhost log
> + * @log_num   - log offset
> + *   returns number of buffer heads allocated, negative on error
> + */
> +static int get_rx_bufs(struct vhost_virtqueue *vq,
> +             struct vring_used_elem *heads,
> +             int datalen,
> +             unsigned *iovcount,
> +             struct vhost_log *log,
> +             unsigned *log_num)
> +{
> +   unsigned int out, in;
> +   int seg = 0;
> +   int headcount = 0;
> +   unsigned d;
> +   int r;
> +
> +   while (datalen > 0) {
> +      if (unlikely(headcount >= VHOST_NET_MAX_SG)) {
> +         r = -ENOBUFS;
> +         goto err;
> +      }
> +      d = vhost_get_desc(vq->dev, vq, vq->iov + seg,
> +               ARRAY_SIZE(vq->iov) - seg, &out,
> +               &in, log, log_num);
> +      if (d == vq->num) {
> +         r = 0;
> +         goto err;
> +      }
> +      if (unlikely(out || in <= 0)) {
> +         vq_err(vq, "unexpected descriptor format for RX: "
> +            "out %d, in %d\n", out, in);
> +         r = -EINVAL;
> +         goto err;
> +      }
> +      heads[headcount].id = d;
> +      heads[headcount].len = iov_length(vq->iov + seg, in);
> +      datalen -= heads[headcount].len;
> +      ++headcount;
> +      seg += in;
> +   }
> +   *iovcount = seg;
> +   return headcount;
> +err:
> +   vhost_discard_desc(vq, headcount);
> +   return r;
> +}
> +
>  /* Expects to be always run from workqueue - which acts as
>   * read-size critical section for our kind of RCU. */
>  static void handle_rx(struct vhost_net *net)
>  {
>     struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> -   unsigned in, log, s;
> +   unsigned uninitialized_var(in), log;
>     struct vhost_log *vq_log;
>     struct msghdr msg = {
>        .msg_name = NULL,
> @@ -238,9 +290,10 @@ static void handle_rx(struct vhost_net *net)
>        .hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
>     };
> 
> -   size_t len, total_len = 0;
> -   int err, headcount, datalen;
> -   size_t vhost_hlen;
> +   size_t total_len = 0;
> +   int err, headcount;
> +   size_t vhost_hlen, sock_hlen;
> +   size_t vhost_len, sock_len;
>     struct socket *sock = rcu_dereference(vq->private_data);
>     if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
>        return;
> @@ -249,14 +302,16 @@ static void handle_rx(struct vhost_net *net)
>     mutex_lock(&vq->mutex);
>     vhost_disable_notify(vq);
>     vhost_hlen = vq->vhost_hlen;
> +   sock_hlen = vq->sock_hlen;
> 
>     vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
>        vq->log : NULL;
> 
> -   while ((datalen = vhost_head_len(vq, sock->sk))) {
> -      headcount = vhost_get_desc_n(vq, vq->heads,
> -                    datalen + vhost_hlen,
> -                    &in, vq_log, &log);
> +   while ((sock_len = peek_head_len(vq, sock->sk))) {
> +      sock_len += sock_hlen;
> +      vhost_len = sock_len + vhost_hlen;
> +      headcount = get_rx_bufs(vq, vq->heads, vhost_len,
> +               &in, vq_log, &log);
>        if (headcount < 0)
>           break;
>        /* OK, now we need to know about added descriptors. */
> @@ -272,34 +327,25 @@ static void handle_rx(struct vhost_net *net)
>           break;
>        }
>        /* We don't need to be notified again. */
> -      if (vhost_hlen)
> +      if (unlikely((vhost_hlen)))
>           /* Skip header. TODO: support TSO. */
> -         s = move_iovec_hdr(vq->iov, vq->hdr, vhost_hlen, in);
> +         move_iovec_hdr(vq->iov, vq->hdr, vhost_hlen, in);
>        else
> -         s = copy_iovec_hdr(vq->iov, vq->hdr, vq->sock_hlen, in);
> +         copy_iovec_hdr(vq->iov, vq->hdr, sock_hlen, in);
>        msg.msg_iovlen = in;
> -      len = iov_length(vq->iov, in);
> -      /* Sanity check */
> -      if (!len) {
> -         vq_err(vq, "Unexpected header len for RX: "
> -                "%zd expected %zd\n",
> -                iov_length(vq->hdr, s), vhost_hlen);
> -         break;
> -      }
>        err = sock->ops->recvmsg(NULL, sock, &msg,
> -                len, MSG_DONTWAIT | MSG_TRUNC);
> +                sock_len, MSG_DONTWAIT | MSG_TRUNC);
>        /* TODO: Check specific error and bomb out unless EAGAIN? */
>        if (err < 0) {
>           vhost_discard_desc(vq, headcount);
>           break;
>        }
> -      if (err != datalen) {
> +      if (err != sock_len) {
>           pr_err("Discarded rx packet: "
> -                " len %d, expected %zd\n", err, datalen);
> +                " len %d, expected %zd\n", err, sock_len);
>           vhost_discard_desc(vq, headcount);
>           continue;
>        }
> -      len = err;
>        if (vhost_hlen &&
>            memcpy_toiovecend(vq->hdr, (unsigned char *)&hdr, 0,
>                    vhost_hlen)) {
> @@ -311,17 +357,16 @@ static void handle_rx(struct vhost_net *net)
>        if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF) &&
>            memcpy_toiovecend(vq->hdr, (unsigned char *)&headcount,
>                    offsetof(typeof(hdr), num_buffers),
> -                  sizeof(hdr.num_buffers))) {
> +                  sizeof hdr.num_buffers)) {
>           vq_err(vq, "Failed num_buffers write");
>           vhost_discard_desc(vq, headcount);
>           break;
>        }
> -      len += vhost_hlen;
>        vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
>                     headcount);
>        if (unlikely(vq_log))
> -         vhost_log_write(vq, vq_log, log, len);
> -      total_len += len;
> +         vhost_log_write(vq, vq_log, log, vhost_len);
> +      total_len += vhost_len;
>        if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>           vhost_poll_queue(&vq->poll);
>           break;
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 8ef5e3f..4b49991 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -862,53 +862,6 @@ static unsigned get_indirect(struct vhost_dev *dev, 

> struct vhost_virtqueue *vq,
>     return 0;
>  }
> 
> -/* This is a multi-buffer version of vhost_get_desc
> - * @vq      - the relevant virtqueue
> - * datalen   - data length we'll be reading
> - * @iovcount   - returned count of io vectors we fill
> - * @log      - vhost log
> - * @log_num   - log offset
> - *   returns number of buffer heads allocated, negative on error
> - */
> -int vhost_get_desc_n(struct vhost_virtqueue *vq, struct vring_used_elem 
*heads,
> -           int datalen, unsigned *iovcount, struct vhost_log *log,
> -           unsigned int *log_num)
> -{
> -   unsigned int out, in;
> -   int seg = 0;
> -   int headcount = 0;
> -   int r;
> -
> -   while (datalen > 0) {
> -      if (headcount >= VHOST_NET_MAX_SG) {
> -         r = -ENOBUFS;
> -         goto err;
> -      }
> -      heads[headcount].id = vhost_get_desc(vq->dev, vq, vq->iov + seg,
> -                     ARRAY_SIZE(vq->iov) - seg, &out,
> -                     &in, log, log_num);
> -      if (heads[headcount].id == vq->num) {
> -         r = 0;
> -         goto err;
> -      }
> -      if (out || in <= 0) {
> -         vq_err(vq, "unexpected descriptor format for RX: "
> -            "out %d, in %d\n", out, in);
> -         r = -EINVAL;
> -         goto err;
> -      }
> -      heads[headcount].len = iov_length(vq->iov + seg, in);
> -      datalen -= heads[headcount].len;
> -      ++headcount;
> -      seg += in;
> -   }
> -   *iovcount = seg;
> -   return headcount;
> -err:
> -   vhost_discard_desc(vq, headcount);
> -   return r;
> -}
> -
>  /* This looks in the virtqueue and for the first available buffer, and 
converts
>   * it to an iovec for convenient access.  Since descriptors consist of 
some
>   * number of output then some number of input descriptors, it's 
actually two
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 08d740a..4c9809e 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -122,9 +122,6 @@ long vhost_dev_ioctl(struct vhost_dev *, unsigned 
int 
> ioctl, unsigned long arg);
>  int vhost_vq_access_ok(struct vhost_virtqueue *vq);
>  int vhost_log_access_ok(struct vhost_dev *);
> 
> -int vhost_get_desc_n(struct vhost_virtqueue *, struct vring_used_elem 
*heads,
> -           int datalen, unsigned int *iovcount, struct vhost_log *log,
> -           unsigned int *log_num);
>  unsigned vhost_get_desc(struct vhost_dev *, struct vhost_virtqueue *,
>              struct iovec iov[], unsigned int iov_count,
>              unsigned int *out_num, unsigned int *in_num,
> 
> -- 
> MST


^ permalink raw reply

* Re: [net-2.6 PATCH] e1000e: enable/disable ASPM L0s and L1 and ERT according to hardware errata
From: David Miller @ 2010-04-29 19:04 UTC (permalink / raw)
  To: bruce.w.allan; +Cc: anton, jeffrey.t.kirsher, netdev, gospo, mjg
In-Reply-To: <8DD2590731AB5D4C9DBF71A877482A9062274501@orsmsx509.amr.corp.intel.com>

From: "Allan, Bruce W" <bruce.w.allan@intel.com>
Date: Thu, 29 Apr 2010 10:19:56 -0700

> Your patch is probably the correct thing to do but I'm not all that
> familiar with the ppc64 architecture.  Would you please provide the
> output of 'lspci -t' and 'lspci -vvv -xxx'.

You're not guarenteed for there to be a pci_dev backing the top-level
host controller, at the very least.  Some platforms don't even implement
the PCI config space for the host controller, whilst on others access
to them is protected by the hypervisor.

So you can't go poking around the PCI host controller registers
unconditionally.

The same OOPS probably would happen on Sparc64 in some configurations
too.  Although all of my PCI-E slots do have PCI-E express switch port
nodes, so maybe it wouldn't trigger here.

^ permalink raw reply

* Re: [PATCH] bonding: fix arp_validate on bonds inside a bridge
From: Jay Vosburgh @ 2010-04-29 18:57 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: bonding-devel, netdev
In-Reply-To: <20100429163921.GA4228@midget.suse.cz>

Jiri Bohac <jbohac@suse.cz> wrote:

>On Wed, Apr 28, 2010 at 12:05:11PM -0700, Jay Vosburgh wrote:
>> Jiri Bohac <jbohac@suse.cz> wrote:
>> >--- a/include/linux/netdevice.h
>> >+++ b/include/linux/netdevice.h
>> >@@ -2055,6 +2055,8 @@ static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
>> > 	}
>> > }
>> >
>> >+extern void (*bond_handle_arp_hook)(struct sk_buff *skb);
>> 
>> 	Should this be inside the the skb_bond_should_drop function to
>> limit its scope?  Just wondering if that's a little tidier.
>
>No, this needs to be global, so that the bonding module can
>initialize it with the correct address.
>
>
>> >--- a/net/core/dev.c
>> >+++ b/net/core/dev.c
>> >@@ -2314,6 +2314,9 @@ static inline int deliver_skb(struct sk_buff *skb,
>> > 	return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
>> > }
>> >
>> >+void (*bond_handle_arp_hook)(struct sk_buff *skb);
>> >+EXPORT_SYMBOL_GPL(bond_handle_arp_hook);
>> 
>> 	Should this be hidden by
>> 
>> #if defined(CONFIG_BONDING) || defined(CONFIG_BONDING_MODULE)
>> 
>> 	with some parallel changes to skb_bond_should_drop so it
>> vanishes if bonding is not configured?  Granted, distros will all turn
>> it on anyway, but the embedded size might be a bit smaller.
>
>Sure, good idea. I put the whole skb_bonding_should_drop function
>in an #ifdef, which will actually speed up the rx path if
>CONFIG_BONDING is not set. A new version follows:

	This doesn't apply to the current net-next-2.6 (because
skb_bond_should_drop was pulled out of line a few weeks ago); can you
update the patch?

	-J

>----
>bonding with arp_validate does not currently work when the
>bonding master is part of a bridge. This is because
>bond_arp_rcv() is registered as a packet type handler for ARP,
>but before netif_receive_skb() processes the ptype_base hash
>table, handle_bridge() is called and changes the skb->dev to
>point to the bridge device.
>
>This patch makes bonding_should_drop() call the bonding ARP
>handler directly if a IFF_MASTER_NEEDARP flag is set on the
>bonding master.  bond_register_arp() now only needs to set the
>IFF_MASTER_NEEDARP flag.
>
>We no longer need special ARP handling for inactive slaves, hence
>IFF_SLAVE_NEEDARP is not needed.
>
>skb_reset_network_header() and skb_reset_transport_header() need
>to be called before the call to bonding_should_drop() because
>bond_handle_arp() needs the offsets initialized.
>
>As a side-effect, skb_bond_should_drop is #defined as 0
>when CONFIG_BONDING is not set.
>
>Signed-off-by: Jiri Bohac <jbohac@suse.cz>
>
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 0075514..cafd404 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1940,8 +1940,7 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev)
> 	}
>
> 	slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB |
>-				   IFF_SLAVE_INACTIVE | IFF_BONDING |
>-				   IFF_SLAVE_NEEDARP);
>+				   IFF_SLAVE_INACTIVE | IFF_BONDING);
>
> 	kfree(slave);
>
>@@ -2612,11 +2611,12 @@ static void bond_validate_arp(struct bonding *bond, struct slave *slave, __be32
> 	}
> }
>
>-static int bond_arp_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev)
>+static void bond_handle_arp(struct sk_buff *skb)
> {
> 	struct arphdr *arp;
> 	struct slave *slave;
> 	struct bonding *bond;
>+	struct net_device *dev = skb->dev->master, *orig_dev = skb->dev;
> 	unsigned char *arp_ptr;
> 	__be32 sip, tip;
>
>@@ -2637,9 +2637,8 @@ static int bond_arp_rcv(struct sk_buff *skb, struct net_device *dev, struct pack
> 	bond = netdev_priv(dev);
> 	read_lock(&bond->lock);
>
>-	pr_debug("bond_arp_rcv: bond %s skb->dev %s orig_dev %s\n",
>-		 bond->dev->name, skb->dev ? skb->dev->name : "NULL",
>-		 orig_dev ? orig_dev->name : "NULL");
>+	pr_debug("bond_handle_arp: bond: %s, master: %s, slave: %s\n",
>+		bond->dev->name, dev->name, orig_dev->name);
>
> 	slave = bond_get_slave_by_dev(bond, orig_dev);
> 	if (!slave || !slave_do_arp_validate(bond, slave))
>@@ -2684,8 +2683,7 @@ static int bond_arp_rcv(struct sk_buff *skb, struct net_device *dev, struct pack
> out_unlock:
> 	read_unlock(&bond->lock);
> out:
>-	dev_kfree_skb(skb);
>-	return NET_RX_SUCCESS;
>+	return;
> }
>
> /*
>@@ -3567,23 +3565,12 @@ static void bond_unregister_lacpdu(struct bonding *bond)
>
> void bond_register_arp(struct bonding *bond)
> {
>-	struct packet_type *pt = &bond->arp_mon_pt;
>-
>-	if (pt->type)
>-		return;
>-
>-	pt->type = htons(ETH_P_ARP);
>-	pt->dev = bond->dev;
>-	pt->func = bond_arp_rcv;
>-	dev_add_pack(pt);
>+	bond->dev->priv_flags |= IFF_MASTER_NEEDARP;
> }
>
> void bond_unregister_arp(struct bonding *bond)
> {
>-	struct packet_type *pt = &bond->arp_mon_pt;
>-
>-	dev_remove_pack(pt);
>-	pt->type = 0;
>+	bond->dev->priv_flags &= ~IFF_MASTER_NEEDARP;
> }
>
> /*---------------------------- Hashing Policies -----------------------------*/
>@@ -5041,6 +5028,7 @@ static int __init bonding_init(void)
> 	register_netdevice_notifier(&bond_netdev_notifier);
> 	register_inetaddr_notifier(&bond_inetaddr_notifier);
> 	bond_register_ipv6_notifier();
>+	bond_handle_arp_hook = bond_handle_arp;
> out:
> 	return res;
> err:
>@@ -5061,6 +5049,7 @@ static void __exit bonding_exit(void)
>
> 	rtnl_link_unregister(&bond_link_ops);
> 	unregister_pernet_subsys(&bond_net_ops);
>+	bond_handle_arp_hook = NULL;
> }
>
> module_init(bonding_init);
>diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
>index 257a7a4..57adfe5 100644
>--- a/drivers/net/bonding/bonding.h
>+++ b/drivers/net/bonding/bonding.h
>@@ -212,7 +212,6 @@ struct bonding {
> 	struct   bond_params params;
> 	struct   list_head vlan_list;
> 	struct   vlan_group *vlgrp;
>-	struct   packet_type arp_mon_pt;
> 	struct   workqueue_struct *wq;
> 	struct   delayed_work mii_work;
> 	struct   delayed_work arp_work;
>@@ -292,14 +291,12 @@ static inline void bond_set_slave_inactive_flags(struct slave *slave)
> 	if (!bond_is_lb(bond))
> 		slave->state = BOND_STATE_BACKUP;
> 	slave->dev->priv_flags |= IFF_SLAVE_INACTIVE;
>-	if (slave_do_arp_validate(bond, slave))
>-		slave->dev->priv_flags |= IFF_SLAVE_NEEDARP;
> }
>
> static inline void bond_set_slave_active_flags(struct slave *slave)
> {
> 	slave->state = BOND_STATE_ACTIVE;
>-	slave->dev->priv_flags &= ~(IFF_SLAVE_INACTIVE | IFF_SLAVE_NEEDARP);
>+	slave->dev->priv_flags &= ~IFF_SLAVE_INACTIVE;
> }
>
> static inline void bond_set_master_3ad_flags(struct bonding *bond)
>diff --git a/include/linux/if.h b/include/linux/if.h
>index 3a9f410..84ab2c8 100644
>--- a/include/linux/if.h
>+++ b/include/linux/if.h
>@@ -63,7 +63,7 @@
> #define IFF_MASTER_8023AD	0x8	/* bonding master, 802.3ad. 	*/
> #define IFF_MASTER_ALB	0x10		/* bonding master, balance-alb.	*/
> #define IFF_BONDING	0x20		/* bonding master or slave	*/
>-#define IFF_SLAVE_NEEDARP 0x40		/* need ARPs for validation	*/
>+#define IFF_MASTER_NEEDARP 0x40		/* need ARPs for validation	*/
> #define IFF_ISATAP	0x80		/* ISATAP interface (RFC4214)	*/
> #define IFF_MASTER_ARPMON 0x100		/* bonding master, ARP mon in use */
> #define IFF_WAN_HDLC	0x200		/* WAN HDLC device		*/
>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>index fa8b476..1ad9b71 100644
>--- a/include/linux/netdevice.h
>+++ b/include/linux/netdevice.h
>@@ -2055,6 +2055,9 @@ static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
> 	}
> }
>
>+#if defined(CONFIG_BONDING) || defined(CONFIG_BONDING_MODULE)
>+extern void (*bond_handle_arp_hook)(struct sk_buff *skb);
>+
> /* On bonding slaves other than the currently active slave, suppress
>  * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
>  * ARP on active-backup slaves with arp_validate enabled.
>@@ -2076,11 +2079,13 @@ static inline int skb_bond_should_drop(struct sk_buff *skb,
> 			skb_bond_set_mac_by_master(skb, master);
> 		}
>
>-		if (dev->priv_flags & IFF_SLAVE_INACTIVE) {
>-			if ((dev->priv_flags & IFF_SLAVE_NEEDARP) &&
>-			    skb->protocol == __cpu_to_be16(ETH_P_ARP))
>-				return 0;
>+		/* pass ARP frames directly to bonding
>+		   before bridging or other hooks change them */
>+		if ((master->priv_flags & IFF_MASTER_NEEDARP) &&
>+		    skb->protocol == __cpu_to_be16(ETH_P_ARP))
>+			bond_handle_arp_hook(skb);
>
>+		if (dev->priv_flags & IFF_SLAVE_INACTIVE) {
> 			if (master->priv_flags & IFF_MASTER_ALB) {
> 				if (skb->pkt_type != PACKET_BROADCAST &&
> 				    skb->pkt_type != PACKET_MULTICAST)
>@@ -2095,6 +2100,9 @@ static inline int skb_bond_should_drop(struct sk_buff *skb,
> 	}
> 	return 0;
> }
>+#else
>+#define skb_bond_should_drop(a, b) 0
>+#endif
>
> extern struct pernet_operations __net_initdata loopback_net_ops;
>
>diff --git a/net/core/dev.c b/net/core/dev.c
>index f769098..f9821f1 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -2314,6 +2314,11 @@ static inline int deliver_skb(struct sk_buff *skb,
> 	return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
> }
>
>+#if defined(CONFIG_BONDING) || defined(CONFIG_BONDING_MODULE)
>+void (*bond_handle_arp_hook)(struct sk_buff *skb);
>+EXPORT_SYMBOL_GPL(bond_handle_arp_hook);
>+#endif
>+
> #if defined(CONFIG_BRIDGE) || defined (CONFIG_BRIDGE_MODULE)
>
> #if defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE)
>@@ -2507,6 +2512,10 @@ int netif_receive_skb(struct sk_buff *skb)
> 	if (!skb->skb_iif)
> 		skb->skb_iif = skb->dev->ifindex;
>
>+	skb_reset_network_header(skb);
>+	skb_reset_transport_header(skb);
>+	skb->mac_len = skb->network_header - skb->mac_header;
>+
> 	null_or_orig = NULL;
> 	orig_dev = skb->dev;
> 	master = ACCESS_ONCE(orig_dev->master);
>@@ -2519,10 +2528,6 @@ int netif_receive_skb(struct sk_buff *skb)
>
> 	__get_cpu_var(netdev_rx_stat).total++;
>
>-	skb_reset_network_header(skb);
>-	skb_reset_transport_header(skb);
>-	skb->mac_len = skb->network_header - skb->mac_header;
>-
> 	pt_prev = NULL;
>
> 	rcu_read_lock();
>
>
>Thanks,
>
>-- 
>Jiri Bohac <jbohac@suse.cz>
>SUSE Labs, SUSE CZ
>

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: patch sysfs-implement-sysfs-tagged-directory-support.patch added to gregkh-2.6 tree
From: Greg KH @ 2010-04-30 15:58 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Tejun Heo, Eric W. Biederman, bcrl, benjamin.thery, cornelia.huck,
	eric.dumazet, kay.sievers, netdev
In-Reply-To: <20100430154320.GC13977@us.ibm.com>

On Fri, Apr 30, 2010 at 10:43:21AM -0500, Serge E. Hallyn wrote:
> Quoting Tejun Heo (tj@kernel.org):
> > Hello,
> > 
> > On 04/30/2010 04:29 PM, Serge E. Hallyn wrote:
> > > Hmm, but looking back over the previous thread (Mar 31) I guess you
> > > mean more in-line comments around the callbacks, presumably things
> > > like class_dir_child_ns_type() and struct kobj_ns_type_operations
> > > members?
> > 
> > In-line.  What they're, how they're supposed to be used, which calling
> > context is expected, what can be returned and so on.
> > 
> > > It sounds like what you'd really like is to have any explicit
> > > mention to namespaces pulled out of drivers/base (layering as you
> > > keep saying)?  But will there be a use for this outside of
> > > namespaces?  Does trying to anticipate that fall into the category
> > > of over-abstraction?
> > 
> > I wouldn't mind limited amount of layering exceptions as long as
> > they're clearly documented.  What I'm primarily worried about is not
> > the possibility of other users but more the obfuscation of the whole
> > sysfs-kobject-driver model thing which is already overly abstracted
> > and obfuscated (at least it seems to me that way).
> > 
> > NS needs tagged support in the driver model which in itself is fine
> > and I also understand that from someone who's primarily working on NS,
> > adding a bit on top of the whole thing wouldn't seem like much of a
> > problem.  To me it seems like worsening a problem which is already
> > pretty bad.  I hope you could understand my POV too.
> 
> I do.  I can take a stab monday at pushing a cloned version of Eric's
> tree with comments added, if Eric doesn't have time.  (Or a patch on
> top of Greg's tree)

On top of Greg's tree please :)

thanks,

greg k-h

^ permalink raw reply

* Re: NULL pointer dereference panic in stable (2.6.33.2), amd64
From: Denys Fedorysychenko @ 2010-04-29 10:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, krkumar2, netdev
In-Reply-To: <1271327830.16881.2370.camel@edumazet-laptop>

On Thursday 15 April 2010 13:37:10 Eric Dumazet wrote:
> Le jeudi 15 avril 2010 à 12:11 +0300, Denys Fedorysychenko a écrit :
> > Btw i have application using tun.
> 
> Could you add following sanity test to catch the error ?
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index fa8b476..b67274a 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -988,6 +988,7 @@ static inline
>  struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
>  					 unsigned int index)
>  {
> +	WARN_ON(index >= dev->num_tx_queues);
>  	return &dev->_tx[index];
>  }
> 
Very sorry for being late, just i found way to stabilize kernel for me and to 
solve my personal life issues. It took 2 weeks...
I will try this patch, but i'm sure i dont have multiqueue card there.

I had shaper on eth0.33 and eth0 so i disable all offloading except 
checksumming there. Recently i found  that on some interfaces (where is no 
shaper) gso left on. And yes, probably some traffic was queued, then route 
changed, and maybe it went from gso off interface to gso on... 
It can be just pure luck, that i dont have crashes anymore, but maybe trick is 
in gso...

^ permalink raw reply

* Re: patch sysfs-implement-sysfs-tagged-directory-support.patch added to gregkh-2.6 tree
From: Serge E. Hallyn @ 2010-04-30 15:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Eric W. Biederman, Greg KH, bcrl, benjamin.thery, cornelia.huck,
	eric.dumazet, kay.sievers, netdev
In-Reply-To: <4BDAF5B0.3080801@kernel.org>

Quoting Tejun Heo (tj@kernel.org):
> Hello,
> 
> On 04/30/2010 04:29 PM, Serge E. Hallyn wrote:
> > Hmm, but looking back over the previous thread (Mar 31) I guess you
> > mean more in-line comments around the callbacks, presumably things
> > like class_dir_child_ns_type() and struct kobj_ns_type_operations
> > members?
> 
> In-line.  What they're, how they're supposed to be used, which calling
> context is expected, what can be returned and so on.
> 
> > It sounds like what you'd really like is to have any explicit
> > mention to namespaces pulled out of drivers/base (layering as you
> > keep saying)?  But will there be a use for this outside of
> > namespaces?  Does trying to anticipate that fall into the category
> > of over-abstraction?
> 
> I wouldn't mind limited amount of layering exceptions as long as
> they're clearly documented.  What I'm primarily worried about is not
> the possibility of other users but more the obfuscation of the whole
> sysfs-kobject-driver model thing which is already overly abstracted
> and obfuscated (at least it seems to me that way).
> 
> NS needs tagged support in the driver model which in itself is fine
> and I also understand that from someone who's primarily working on NS,
> adding a bit on top of the whole thing wouldn't seem like much of a
> problem.  To me it seems like worsening a problem which is already
> pretty bad.  I hope you could understand my POV too.

I do.  I can take a stab monday at pushing a cloned version of Eric's
tree with comments added, if Eric doesn't have time.  (Or a patch on
top of Greg's tree)

thanks,
-serge

^ permalink raw reply

* Re: [PATCH linux-next v2 1/2] irq: Add CPU mask affinity hint
From: Peter P Waskiewicz Jr @ 2010-04-30 16:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: davem@davemloft.net, arjan@linux.jf.intel.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <alpine.LFD.2.00.1004301249540.2951@localhost.localdomain>

On Fri, 30 Apr 2010, Thomas Gleixner wrote:

> On Fri, 30 Apr 2010, Peter P Waskiewicz Jr wrote:
>
>> This patch adds a cpumask affinity hint to the irq_desc
>> structure, along with a registration function and a read-only
>> proc entry for each interrupt.
>>
>> This affinity_hint handle for each interrupt can be used by
>> underlying drivers that need a better mechanism to control
>> interrupt affinity.  The underlying driver can register a
>> cpumask for the interrupt, which will allow the driver to
>> provide the CPU mask for the interrupt to anything that
>> requests it.  The intent is to extend the userspace daemon,
>> irqbalance, to help hint to it a preferred CPU mask to balance
>> the interrupt into.
>>
>> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
>> ---
>>
>>  include/linux/interrupt.h |   13 +++++++++++++
>>  include/linux/irq.h       |    1 +
>>  kernel/irq/manage.c       |   28 ++++++++++++++++++++++++++++
>>  kernel/irq/proc.c         |   33 +++++++++++++++++++++++++++++++++
>>  4 files changed, 75 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
>> index 75f3f00..9c9ea2a 100644
>> --- a/include/linux/interrupt.h
>> +++ b/include/linux/interrupt.h
>> @@ -209,6 +209,9 @@ extern int irq_set_affinity(unsigned int irq, const struct cpumask *cpumask);
>>  extern int irq_can_set_affinity(unsigned int irq);
>>  extern int irq_select_affinity(unsigned int irq);
>>
>> +extern int irq_register_affinity_hint(unsigned int irq,
>> +                                      const struct cpumask *m);
>
> I think we can do with a single funtion irq_set_affinity_hint() and
> let the caller set the pointer to NULL.

That works too.  I like it.  :-)

>
>>
>> +int irq_register_affinity_hint(unsigned int irq, const struct cpumask *m)
>> +{
>> +	struct irq_desc *desc = irq_to_desc(irq);
>> +	unsigned long flags;
>
>  desc needs to be checked. It might be NULL !

Doh, good point!

>
>> +
>> +	raw_spin_lock_irqsave(&desc->lock, flags);
>> +	desc->affinity_hint = m;
>> +	raw_spin_unlock_irqrestore(&desc->lock, flags);
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL(irq_register_affinity_hint);
>
>  EXPORT_SYMBOL_GPL please

Will do.

>
>> +static int irq_affinity_hint_proc_show(struct seq_file *m, void *v)
>> +{
>> +	struct irq_desc *desc = irq_to_desc((long)m->private);
>> +	unsigned long flags;
>> +	int ret = -EINVAL;
>> +
>> +	raw_spin_lock_irqsave(&desc->lock, flags);
>> +	if (desc->affinity_hint) {
>> +		seq_cpumask(m, desc->affinity_hint);
>
>  Please make a local copy under desc->mask and do the seq_cpumask()
>  stuff on the local copy outside of desc->lock

Will do.

Cheers,

-PJ

^ permalink raw reply

* Re: [PATCH linux-next v2 1/2] irq: Add CPU mask affinity hint
From: Thomas Gleixner @ 2010-04-30 10:53 UTC (permalink / raw)
  To: Peter P Waskiewicz Jr; +Cc: davem, arjan, netdev, linux-kernel
In-Reply-To: <20100430085150.4630.46790.stgit@ppwaskie-hc2.jf.intel.com>

On Fri, 30 Apr 2010, Peter P Waskiewicz Jr wrote:

> This patch adds a cpumask affinity hint to the irq_desc
> structure, along with a registration function and a read-only
> proc entry for each interrupt.
> 
> This affinity_hint handle for each interrupt can be used by
> underlying drivers that need a better mechanism to control
> interrupt affinity.  The underlying driver can register a
> cpumask for the interrupt, which will allow the driver to
> provide the CPU mask for the interrupt to anything that
> requests it.  The intent is to extend the userspace daemon,
> irqbalance, to help hint to it a preferred CPU mask to balance
> the interrupt into.
> 
> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> ---
> 
>  include/linux/interrupt.h |   13 +++++++++++++
>  include/linux/irq.h       |    1 +
>  kernel/irq/manage.c       |   28 ++++++++++++++++++++++++++++
>  kernel/irq/proc.c         |   33 +++++++++++++++++++++++++++++++++
>  4 files changed, 75 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index 75f3f00..9c9ea2a 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -209,6 +209,9 @@ extern int irq_set_affinity(unsigned int irq, const struct cpumask *cpumask);
>  extern int irq_can_set_affinity(unsigned int irq);
>  extern int irq_select_affinity(unsigned int irq);
>  
> +extern int irq_register_affinity_hint(unsigned int irq,
> +                                      const struct cpumask *m);

I think we can do with a single funtion irq_set_affinity_hint() and
let the caller set the pointer to NULL.

>  
> +int irq_register_affinity_hint(unsigned int irq, const struct cpumask *m)
> +{
> +	struct irq_desc *desc = irq_to_desc(irq);
> +	unsigned long flags;

  desc needs to be checked. It might be NULL !

> +
> +	raw_spin_lock_irqsave(&desc->lock, flags);
> +	desc->affinity_hint = m;
> +	raw_spin_unlock_irqrestore(&desc->lock, flags);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(irq_register_affinity_hint);

  EXPORT_SYMBOL_GPL please

> +static int irq_affinity_hint_proc_show(struct seq_file *m, void *v)
> +{
> +	struct irq_desc *desc = irq_to_desc((long)m->private);
> +	unsigned long flags;
> +	int ret = -EINVAL;
> +
> +	raw_spin_lock_irqsave(&desc->lock, flags);
> +	if (desc->affinity_hint) {
> +		seq_cpumask(m, desc->affinity_hint);

  Please make a local copy under desc->mask and do the seq_cpumask()
  stuff on the local copy outside of desc->lock

Thanks,

	tglx

^ permalink raw reply

* [PATCH linux-next v2 2/2] ixgbe: Example usage of the new IRQ affinity_hint callback
From: Peter P Waskiewicz Jr @ 2010-04-30  8:52 UTC (permalink / raw)
  To: tglx, davem, arjan; +Cc: netdev, linux-kernel
In-Reply-To: <20100430085150.4630.46790.stgit@ppwaskie-hc2.jf.intel.com>

This patch uses the new IRQ affinity_hint callback mechanism.
It serves purely as an example of how a low-level driver can
utilize this new interface.

An official ixgbe patch will be pushed through netdev once the
IRQ patches have been accepted and merged.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 drivers/net/ixgbe/ixgbe.h      |    2 ++
 drivers/net/ixgbe/ixgbe_main.c |   20 +++++++++++++++++++-
 2 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index 79c35ae..c220b9f 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -32,6 +32,7 @@
 #include <linux/pci.h>
 #include <linux/netdevice.h>
 #include <linux/aer.h>
+#include <linux/cpumask.h>
 
 #include "ixgbe_type.h"
 #include "ixgbe_common.h"
@@ -236,6 +237,7 @@ struct ixgbe_q_vector {
 	u8 tx_itr;
 	u8 rx_itr;
 	u32 eitr;
+	cpumask_var_t affinity_mask;
 };
 
 /* Helper macros to switch between ints/sec and what the register uses.
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 1b1419c..28f9d6b 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -1083,6 +1083,16 @@ static void ixgbe_configure_msix(struct ixgbe_adapter *adapter)
 			q_vector->eitr = adapter->rx_eitr_param;
 
 		ixgbe_write_eitr(q_vector);
+
+		/*
+		 * Allocate the affinity_hint cpumask, assign the mask for
+		 * this vector, and register our affinity_hint for this irq.
+		 */
+		if (!alloc_cpumask_var(&q_vector->affinity_mask, GFP_KERNEL))
+			return;
+		cpumask_set_cpu(v_idx, q_vector->affinity_mask);
+		irq_register_affinity_hint(adapter->msix_entries[v_idx].vector,
+		                           q_vector->affinity_mask);
 	}
 
 	if (adapter->hw.mac.type == ixgbe_mac_82598EB)
@@ -3218,7 +3228,7 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
 	struct ixgbe_hw *hw = &adapter->hw;
 	u32 rxctrl;
 	u32 txdctl;
-	int i, j;
+	int i, j, num_q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS;
 
 	/* signal that we are down to the interrupt handler */
 	set_bit(__IXGBE_DOWN, &adapter->state);
@@ -3251,6 +3261,14 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
 
 	ixgbe_napi_disable_all(adapter);
 
+	for (i = 0; i < num_q_vectors; i++) {
+		struct ixgbe_q_vector *q_vector = adapter->q_vector[i];
+		/* unregister the affinity_hint */
+		irq_unregister_affinity_hint(adapter->msix_entries[i].vector);
+		/* release the CPU mask memory */
+		free_cpumask_var(q_vector->affinity_mask);
+	}
+
 	clear_bit(__IXGBE_SFP_MODULE_NOT_FOUND, &adapter->state);
 	del_timer_sync(&adapter->sfp_timer);
 	del_timer_sync(&adapter->watchdog_timer);

^ permalink raw reply related

* [PATCH linux-next v2 1/2] irq: Add CPU mask affinity hint
From: Peter P Waskiewicz Jr @ 2010-04-30  8:51 UTC (permalink / raw)
  To: tglx, davem, arjan; +Cc: netdev, linux-kernel

This patch adds a cpumask affinity hint to the irq_desc
structure, along with a registration function and a read-only
proc entry for each interrupt.

This affinity_hint handle for each interrupt can be used by
underlying drivers that need a better mechanism to control
interrupt affinity.  The underlying driver can register a
cpumask for the interrupt, which will allow the driver to
provide the CPU mask for the interrupt to anything that
requests it.  The intent is to extend the userspace daemon,
irqbalance, to help hint to it a preferred CPU mask to balance
the interrupt into.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 include/linux/interrupt.h |   13 +++++++++++++
 include/linux/irq.h       |    1 +
 kernel/irq/manage.c       |   28 ++++++++++++++++++++++++++++
 kernel/irq/proc.c         |   33 +++++++++++++++++++++++++++++++++
 4 files changed, 75 insertions(+), 0 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 75f3f00..9c9ea2a 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -209,6 +209,9 @@ extern int irq_set_affinity(unsigned int irq, const struct cpumask *cpumask);
 extern int irq_can_set_affinity(unsigned int irq);
 extern int irq_select_affinity(unsigned int irq);
 
+extern int irq_register_affinity_hint(unsigned int irq,
+                                      const struct cpumask *m);
+extern int irq_unregister_affinity_hint(unsigned int irq);
 #else /* CONFIG_SMP */
 
 static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m)
@@ -223,6 +226,16 @@ static inline int irq_can_set_affinity(unsigned int irq)
 
 static inline int irq_select_affinity(unsigned int irq)  { return 0; }
 
+static inline int irq_register_affinity_hint(unsigned int irq,
+                                             const struct cpumask *m)
+{
+	return -EINVAL;
+}
+
+static inline int irq_unregister_affinity_hint(unsigned int irq);
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_SMP && CONFIG_GENERIC_HARDIRQS */
 
 #ifdef CONFIG_GENERIC_HARDIRQS
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 707ab12..83b16d7 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -206,6 +206,7 @@ struct irq_desc {
 	struct proc_dir_entry	*dir;
 #endif
 	const char		*name;
+	struct cpumask		*affinity_hint;
 } ____cacheline_internodealigned_in_smp;
 
 extern void arch_init_copy_chip_data(struct irq_desc *old_desc,
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 704e488..bce7e38 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -138,6 +138,31 @@ int irq_set_affinity(unsigned int irq, const struct cpumask *cpumask)
 	return 0;
 }
 
+int irq_register_affinity_hint(unsigned int irq, const struct cpumask *m)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&desc->lock, flags);
+	desc->affinity_hint = m;
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+	return 0;
+}
+EXPORT_SYMBOL(irq_register_affinity_hint);
+
+int irq_unregister_affinity_hint(unsigned int irq)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&desc->lock, flags);
+	desc->affinity_hint = NULL;
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+	return 0;
+}
+EXPORT_SYMBOL(irq_unregister_affinity_hint);
 #ifndef CONFIG_AUTO_IRQ_AFFINITY
 /*
  * Generic version of the affinity autoselector.
@@ -916,6 +941,9 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
 			desc->chip->disable(irq);
 	}
 
+	/* make sure affinity_hint is cleaned up */
+	desc->affinity_hint = NULL;
+
 	raw_spin_unlock_irqrestore(&desc->lock, flags);
 
 	unregister_handler_proc(irq, action);
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 7a6eb04..8b85f77 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -32,6 +32,23 @@ static int irq_affinity_proc_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static int irq_affinity_hint_proc_show(struct seq_file *m, void *v)
+{
+	struct irq_desc *desc = irq_to_desc((long)m->private);
+	unsigned long flags;
+	int ret = -EINVAL;
+
+	raw_spin_lock_irqsave(&desc->lock, flags);
+	if (desc->affinity_hint) {
+		seq_cpumask(m, desc->affinity_hint);
+		seq_putc(m, '\n');
+		ret = 0;
+	}
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+	return ret;
+}
+
 #ifndef is_affinity_mask_valid
 #define is_affinity_mask_valid(val) 1
 #endif
@@ -84,6 +101,11 @@ static int irq_affinity_proc_open(struct inode *inode, struct file *file)
 	return single_open(file, irq_affinity_proc_show, PDE(inode)->data);
 }
 
+static int irq_affinity_hint_proc_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, irq_affinity_hint_proc_show, PDE(inode)->data);
+}
+
 static const struct file_operations irq_affinity_proc_fops = {
 	.open		= irq_affinity_proc_open,
 	.read		= seq_read,
@@ -92,6 +114,13 @@ static const struct file_operations irq_affinity_proc_fops = {
 	.write		= irq_affinity_proc_write,
 };
 
+static const struct file_operations irq_affinity_hint_proc_fops = {
+	.open		= irq_affinity_hint_proc_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 static int default_affinity_show(struct seq_file *m, void *v)
 {
 	seq_cpumask(m, irq_default_affinity);
@@ -231,6 +260,10 @@ void register_irq_proc(unsigned int irq, struct irq_desc *desc)
 	/* create /proc/irq/<irq>/smp_affinity */
 	proc_create_data("smp_affinity", 0600, desc->dir,
 			 &irq_affinity_proc_fops, (void *)(long)irq);
+
+	/* create /proc/irq/<irq>/affinity_hint */
+	proc_create_data("affinity_hint", 0400, desc->dir,
+			 &irq_affinity_hint_proc_fops, (void *)(long)irq);
 #endif
 
 	proc_create_data("spurious", 0444, desc->dir,

^ permalink raw reply related

* [Patch 3/3] net: reserve ports for applications using fixed port numbers
From: Amerigo Wang @ 2010-04-30  8:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Octavian Purdila, Eric Dumazet, penguin-kernel, netdev,
	Neil Horman, Amerigo Wang, David Miller, adobriyan, ebiederm
In-Reply-To: <20100430082912.5630.82405.sendpatchset@localhost.localdomain>


(Dropped the infiniband part, because Tetsuo modified the related code,
I will send a separate patch for it once this is accepted.)

This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which
allows users to reserve ports for third-party applications.

The reserved ports will not be used by automatic port assignments
(e.g. when calling connect() or bind() with port number 0). Explicit
port allocation behavior is unchanged.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---

Index: linux-2.6/Documentation/networking/ip-sysctl.txt
===================================================================
--- linux-2.6.orig/Documentation/networking/ip-sysctl.txt
+++ linux-2.6/Documentation/networking/ip-sysctl.txt
@@ -588,6 +588,37 @@ ip_local_port_range - 2 INTEGERS
 	(i.e. by default) range 1024-4999 is enough to issue up to
 	2000 connections per second to systems supporting timestamps.
 
+ip_local_reserved_ports - list of comma separated ranges
+	Specify the ports which are reserved for known third-party
+	applications. These ports will not be used by automatic port
+	assignments (e.g. when calling connect() or bind() with port
+	number 0). Explicit port allocation behavior is unchanged.
+
+	The format used for both input and output is a comma separated
+	list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and
+	10). Writing to the file will clear all previously reserved
+	ports and update the current list with the one given in the
+	input.
+
+	Note that ip_local_port_range and ip_local_reserved_ports
+	settings are independent and both are considered by the kernel
+	when determining which ports are available for automatic port
+	assignments.
+
+	You can reserve ports which are not in the current
+	ip_local_port_range, e.g.:
+
+	$ cat /proc/sys/net/ipv4/ip_local_port_range
+	32000	61000
+	$ cat /proc/sys/net/ipv4/ip_local_reserved_ports
+	8080,9148
+
+	although this is redundant. However such a setting is useful
+	if later the port range is changed to a value that will
+	include the reserved ports.
+
+	Default: Empty
+
 ip_nonlocal_bind - BOOLEAN
 	If set, allows processes to bind() to non-local IP addresses,
 	which can be quite useful - but may break some applications.
Index: linux-2.6/include/net/ip.h
===================================================================
--- linux-2.6.orig/include/net/ip.h
+++ linux-2.6/include/net/ip.h
@@ -184,6 +184,12 @@ extern struct local_ports {
 } sysctl_local_ports;
 extern void inet_get_local_port_range(int *low, int *high);
 
+extern unsigned long *sysctl_local_reserved_ports;
+static inline int inet_is_reserved_local_port(int port)
+{
+	return test_bit(port, sysctl_local_reserved_ports);
+}
+
 extern int sysctl_ip_default_ttl;
 extern int sysctl_ip_nonlocal_bind;
 
Index: linux-2.6/net/ipv4/af_inet.c
===================================================================
--- linux-2.6.orig/net/ipv4/af_inet.c
+++ linux-2.6/net/ipv4/af_inet.c
@@ -1552,9 +1552,13 @@ static int __init inet_init(void)
 
 	BUILD_BUG_ON(sizeof(struct inet_skb_parm) > sizeof(dummy_skb->cb));
 
+	sysctl_local_reserved_ports = kzalloc(65536 / 8, GFP_KERNEL);
+	if (!sysctl_local_reserved_ports)
+		goto out;
+
 	rc = proto_register(&tcp_prot, 1);
 	if (rc)
-		goto out;
+		goto out_free_reserved_ports;
 
 	rc = proto_register(&udp_prot, 1);
 	if (rc)
@@ -1653,6 +1657,8 @@ out_unregister_udp_proto:
 	proto_unregister(&udp_prot);
 out_unregister_tcp_proto:
 	proto_unregister(&tcp_prot);
+out_free_reserved_ports:
+	kfree(sysctl_local_reserved_ports);
 	goto out;
 }
 
Index: linux-2.6/net/ipv4/inet_connection_sock.c
===================================================================
--- linux-2.6.orig/net/ipv4/inet_connection_sock.c
+++ linux-2.6/net/ipv4/inet_connection_sock.c
@@ -37,6 +37,9 @@ struct local_ports sysctl_local_ports __
 	.range = { 32768, 61000 },
 };
 
+unsigned long *sysctl_local_reserved_ports;
+EXPORT_SYMBOL(sysctl_local_reserved_ports);
+
 void inet_get_local_port_range(int *low, int *high)
 {
 	unsigned seq;
@@ -112,6 +115,8 @@ again:
 
 		smallest_size = -1;
 		do {
+			if (inet_is_reserved_local_port(rover))
+				goto next_nolock;
 			head = &hashinfo->bhash[inet_bhashfn(net, rover,
 					hashinfo->bhash_size)];
 			spin_lock(&head->lock);
@@ -136,6 +141,7 @@ again:
 			break;
 		next:
 			spin_unlock(&head->lock);
+		next_nolock:
 			if (++rover > high)
 				rover = low;
 		} while (--remaining > 0);
Index: linux-2.6/net/ipv4/inet_hashtables.c
===================================================================
--- linux-2.6.orig/net/ipv4/inet_hashtables.c
+++ linux-2.6/net/ipv4/inet_hashtables.c
@@ -456,6 +456,8 @@ int __inet_hash_connect(struct inet_time
 		local_bh_disable();
 		for (i = 1; i <= remaining; i++) {
 			port = low + (i + offset) % remaining;
+			if (inet_is_reserved_local_port(port))
+				continue;
 			head = &hinfo->bhash[inet_bhashfn(net, port,
 					hinfo->bhash_size)];
 			spin_lock(&head->lock);
Index: linux-2.6/net/ipv4/sysctl_net_ipv4.c
===================================================================
--- linux-2.6.orig/net/ipv4/sysctl_net_ipv4.c
+++ linux-2.6/net/ipv4/sysctl_net_ipv4.c
@@ -299,6 +299,13 @@ static struct ctl_table ipv4_table[] = {
 		.mode		= 0644,
 		.proc_handler	= ipv4_local_port_range,
 	},
+	{
+		.procname	= "ip_local_reserved_ports",
+		.data		= NULL, /* initialized in sysctl_ipv4_init */
+		.maxlen		= 65536,
+		.mode		= 0644,
+		.proc_handler	= proc_do_large_bitmap,
+	},
 #ifdef CONFIG_IP_MULTICAST
 	{
 		.procname	= "igmp_max_memberships",
@@ -736,6 +743,16 @@ static __net_initdata struct pernet_oper
 static __init int sysctl_ipv4_init(void)
 {
 	struct ctl_table_header *hdr;
+	struct ctl_table *i;
+
+	for (i = ipv4_table; i->procname; i++) {
+		if (strcmp(i->procname, "ip_local_reserved_ports") == 0) {
+			i->data = sysctl_local_reserved_ports;
+			break;
+		}
+	}
+	if (!i->procname)
+		return -EINVAL;
 
 	hdr = register_sysctl_paths(net_ipv4_ctl_path, ipv4_table);
 	if (hdr == NULL)
Index: linux-2.6/net/ipv4/udp.c
===================================================================
--- linux-2.6.orig/net/ipv4/udp.c
+++ linux-2.6/net/ipv4/udp.c
@@ -233,7 +233,8 @@ int udp_lib_get_port(struct sock *sk, un
 			 */
 			do {
 				if (low <= snum && snum <= high &&
-				    !test_bit(snum >> udptable->log, bitmap))
+				    !test_bit(snum >> udptable->log, bitmap) &&
+				    !inet_is_reserved_local_port(snum))
 					goto found;
 				snum += rand;
 			} while (snum != first);
Index: linux-2.6/net/sctp/socket.c
===================================================================
--- linux-2.6.orig/net/sctp/socket.c
+++ linux-2.6/net/sctp/socket.c
@@ -5436,6 +5436,8 @@ static long sctp_get_port_local(struct s
 			rover++;
 			if ((rover < low) || (rover > high))
 				rover = low;
+			if (inet_is_reserved_local_port(rover))
+				continue;
 			index = sctp_phashfn(rover);
 			head = &sctp_port_hashtable[index];
 			sctp_spin_lock(&head->lock);

^ permalink raw reply

* [Patch 1/3] sysctl: refactor integer handling proc code
From: Amerigo Wang @ 2010-04-30  8:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Octavian Purdila, Eric Dumazet, penguin-kernel, netdev,
	Neil Horman, ebiederm, David Miller, adobriyan, Amerigo Wang
In-Reply-To: <20100430082912.5630.82405.sendpatchset@localhost.localdomain>

(Based on Octavian's work, and I modified a lot.)

As we are about to add another integer handling proc function a little
bit of cleanup is in order: add a few helper functions to improve code
readability and decrease code duplication.

In the process a bug is also fixed: if the user specifies a number
with more then 20 digits it will be interpreted as two integers
(e.g. 10000...13 will be interpreted as 100.... and 13).

Behavior for EFAULT handling was changed as well. Previous to this
patch, when an EFAULT error occurred in the middle of a write
operation, although some of the elements were set, that was not
acknowledged to the user (by shorting the write and returning the
number of bytes accepted). EFAULT is now treated just like any other
errors by acknowledging the amount of bytes accepted.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---

Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -2040,8 +2040,122 @@ int proc_dostring(struct ctl_table *tabl
 			       buffer, lenp, ppos);
 }
 
+static size_t proc_skip_spaces(char **buf)
+{
+	size_t ret;
+	char *tmp = skip_spaces(*buf);
+	ret = tmp - *buf;
+	*buf = tmp;
+	return ret;
+}
+
+#define TMPBUFLEN 22
+/**
+ * proc_get_long - reads an ASCII formated integer from a user buffer
+ *
+ * @buf - a kernel buffer
+ * @size - size of the kernel buffer
+ * @val - this is where the number will be stored
+ * @neg - set to %TRUE if number is negative
+ * @perm_tr - a vector which contains the allowed trailers
+ * @perm_tr_len - size of the perm_tr vector
+ * @tr - pointer to store the trailer character
+ *
+ * In case of success 0 is returned and buf and size are updated with
+ * the amount of bytes read. If tr is non NULL and a trailing
+ * character exist (size is non zero after returning from this
+ * function) tr is updated with the trailing character.
+ */
+static int proc_get_long(char **buf, size_t *size,
+			  unsigned long *val, bool *neg,
+			  const char *perm_tr, unsigned perm_tr_len, char *tr)
+{
+	int len;
+	char *p, tmp[TMPBUFLEN];
+
+	if (!*size)
+		return -EINVAL;
+
+	len = *size;
+	if (len > TMPBUFLEN-1)
+		len = TMPBUFLEN-1;
+
+	memcpy(tmp, *buf, len);
+
+	tmp[len] = 0;
+	p = tmp;
+	if (*p == '-' && *size > 1) {
+		*neg = 1;
+		p++;
+	} else
+		*neg = 0;
+	if (!isdigit(*p))
+		return -EINVAL;
+
+	*val = simple_strtoul(p, &p, 0);
 
-static int do_proc_dointvec_conv(int *negp, unsigned long *lvalp,
+	len = p - tmp;
+
+	/* We don't know if the next char is whitespace thus we may accept
+	 * invalid integers (e.g. 1234...a) or two integers instead of one
+	 * (e.g. 123...1). So lets not allow such large numbers. */
+	if (len == TMPBUFLEN - 1)
+		return -EINVAL;
+
+	if (len < *size && perm_tr_len && !memchr(perm_tr, *p, perm_tr_len))
+		return -EINVAL;
+
+	if (tr && (len < *size))
+		*tr = *p;
+
+	*buf += len;
+	*size -= len;
+
+	return 0;
+}
+
+/**
+ * proc_put_long - coverts an integer to a decimal ASCII formated string
+ *
+ * @buf - the user buffer
+ * @size - the size of the user buffer
+ * @val - the integer to be converted
+ * @neg - sign of the number, %TRUE for negative
+ *
+ * In case of success 0 is returned and buf and size are updated with
+ * the amount of bytes read.
+ */
+static int proc_put_long(void __user **buf, size_t *size, unsigned long val,
+			  bool neg)
+{
+	int len;
+	char tmp[TMPBUFLEN], *p = tmp;
+
+	sprintf(p, "%s%lu", neg ? "-" : "", val);
+	len = strlen(tmp);
+	if (len > *size)
+		len = *size;
+	if (copy_to_user(*buf, tmp, len))
+		return -EFAULT;
+	*size -= len;
+	*buf += len;
+	return 0;
+}
+#undef TMPBUFLEN
+
+static int proc_put_char(void __user **buf, size_t *size, char c)
+{
+	if (*size) {
+		char __user **buffer = (char __user **)buf;
+		if (put_user(c, *buffer))
+			return -EFAULT;
+		(*size)--, (*buffer)++;
+		*buf = *buffer;
+	}
+	return 0;
+}
+
+static int do_proc_dointvec_conv(bool *negp, unsigned long *lvalp,
 				 int *valp,
 				 int write, void *data)
 {
@@ -2050,7 +2164,7 @@ static int do_proc_dointvec_conv(int *ne
 	} else {
 		int val = *valp;
 		if (val < 0) {
-			*negp = -1;
+			*negp = 1;
 			*lvalp = (unsigned long)-val;
 		} else {
 			*negp = 0;
@@ -2060,23 +2174,21 @@ static int do_proc_dointvec_conv(int *ne
 	return 0;
 }
 
+static const char proc_wspace_sep[] = { ' ', '\t', '\n' };
+
 static int __do_proc_dointvec(void *tbl_data, struct ctl_table *table,
 		  int write, void __user *buffer,
 		  size_t *lenp, loff_t *ppos,
-		  int (*conv)(int *negp, unsigned long *lvalp, int *valp,
+		  int (*conv)(bool *negp, unsigned long *lvalp, int *valp,
 			      int write, void *data),
 		  void *data)
 {
-#define TMPBUFLEN 21
-	int *i, vleft, first = 1, neg;
-	unsigned long lval;
-	size_t left, len;
+	int *i, vleft, first = 1, err = 0;
+	unsigned long page = 0;
+	size_t left;
+	char *kbuf;
 	
-	char buf[TMPBUFLEN], *p;
-	char __user *s = buffer;
-	
-	if (!tbl_data || !table->maxlen || !*lenp ||
-	    (*ppos && !write)) {
+	if (!tbl_data || !table->maxlen || !*lenp || (*ppos && !write)) {
 		*lenp = 0;
 		return 0;
 	}
@@ -2088,89 +2200,69 @@ static int __do_proc_dointvec(void *tbl_
 	if (!conv)
 		conv = do_proc_dointvec_conv;
 
+	if (write) {
+		if (left > PAGE_SIZE - 1)
+			left = PAGE_SIZE - 1;
+		page = __get_free_page(GFP_TEMPORARY);
+		kbuf = (char *) page;
+		if (!kbuf)
+			return -ENOMEM;
+		if (copy_from_user(kbuf, buffer, left)) {
+			err = -EFAULT;
+			goto free;
+		}
+		kbuf[left] = 0;
+	}
+
 	for (; left && vleft--; i++, first=0) {
-		if (write) {
-			while (left) {
-				char c;
-				if (get_user(c, s))
-					return -EFAULT;
-				if (!isspace(c))
-					break;
-				left--;
-				s++;
-			}
-			if (!left)
-				break;
-			neg = 0;
-			len = left;
-			if (len > sizeof(buf) - 1)
-				len = sizeof(buf) - 1;
-			if (copy_from_user(buf, s, len))
-				return -EFAULT;
-			buf[len] = 0;
-			p = buf;
-			if (*p == '-' && left > 1) {
-				neg = 1;
-				p++;
-			}
-			if (*p < '0' || *p > '9')
-				break;
+		unsigned long lval;
+		bool neg;
 
-			lval = simple_strtoul(p, &p, 0);
+		if (write) {
+			left -= proc_skip_spaces(&kbuf);
 
-			len = p-buf;
-			if ((len < left) && *p && !isspace(*p))
+			err = proc_get_long(&kbuf, &left, &lval, &neg,
+					     proc_wspace_sep,
+					     sizeof(proc_wspace_sep), NULL);
+			if (err)
 				break;
-			s += len;
-			left -= len;
-
-			if (conv(&neg, &lval, i, 1, data))
+			if (conv(&neg, &lval, i, 1, data)) {
+				err = -EINVAL;
 				break;
+			}
 		} else {
-			p = buf;
+			if (conv(&neg, &lval, i, 0, data)) {
+				err = -EINVAL;
+				break;
+			}
 			if (!first)
-				*p++ = '\t';
-	
-			if (conv(&neg, &lval, i, 0, data))
+				err = proc_put_char(&buffer, &left, '\t');
+			if (err)
+				break;
+			err = proc_put_long(&buffer, &left, lval, neg);
+			if (err)
 				break;
-
-			sprintf(p, "%s%lu", neg ? "-" : "", lval);
-			len = strlen(buf);
-			if (len > left)
-				len = left;
-			if(copy_to_user(s, buf, len))
-				return -EFAULT;
-			left -= len;
-			s += len;
 		}
 	}
 
-	if (!write && !first && left) {
-		if(put_user('\n', s))
-			return -EFAULT;
-		left--, s++;
-	}
+	if (!write && !first && left && !err)
+		err = proc_put_char(&buffer, &left, '\n');
+	if (write && !err)
+		left -= proc_skip_spaces(&kbuf);
+free:
 	if (write) {
-		while (left) {
-			char c;
-			if (get_user(c, s++))
-				return -EFAULT;
-			if (!isspace(c))
-				break;
-			left--;
-		}
+		free_page(page);
+		if (first)
+			return err ? : -EINVAL;
 	}
-	if (write && first)
-		return -EINVAL;
 	*lenp -= left;
 	*ppos += *lenp;
-	return 0;
-#undef TMPBUFLEN
+	return err;
 }
 
 static int do_proc_dointvec(struct ctl_table *table, int write,
 		  void __user *buffer, size_t *lenp, loff_t *ppos,
-		  int (*conv)(int *negp, unsigned long *lvalp, int *valp,
+		  int (*conv)(bool *negp, unsigned long *lvalp, int *valp,
 			      int write, void *data),
 		  void *data)
 {
@@ -2238,8 +2330,8 @@ struct do_proc_dointvec_minmax_conv_para
 	int *max;
 };
 
-static int do_proc_dointvec_minmax_conv(int *negp, unsigned long *lvalp, 
-					int *valp, 
+static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
+					int *valp,
 					int write, void *data)
 {
 	struct do_proc_dointvec_minmax_conv_param *param = data;
@@ -2252,7 +2344,7 @@ static int do_proc_dointvec_minmax_conv(
 	} else {
 		int val = *valp;
 		if (val < 0) {
-			*negp = -1;
+			*negp = 1;
 			*lvalp = (unsigned long)-val;
 		} else {
 			*negp = 0;
@@ -2295,102 +2387,78 @@ static int __do_proc_doulongvec_minmax(v
 				     unsigned long convmul,
 				     unsigned long convdiv)
 {
-#define TMPBUFLEN 21
-	unsigned long *i, *min, *max, val;
-	int vleft, first=1, neg;
-	size_t len, left;
-	char buf[TMPBUFLEN], *p;
-	char __user *s = buffer;
-	
-	if (!data || !table->maxlen || !*lenp ||
-	    (*ppos && !write)) {
+	unsigned long *i, *min, *max;
+	int vleft, first = 1, err = 0;
+	unsigned long page = 0;
+	size_t left;
+	char *kbuf;
+
+	if (!data || !table->maxlen || !*lenp || (*ppos && !write)) {
 		*lenp = 0;
 		return 0;
 	}
-	
+
 	i = (unsigned long *) data;
 	min = (unsigned long *) table->extra1;
 	max = (unsigned long *) table->extra2;
 	vleft = table->maxlen / sizeof(unsigned long);
 	left = *lenp;
-	
+
+	if (write) {
+		if (left > PAGE_SIZE - 1)
+			left = PAGE_SIZE - 1;
+		page = __get_free_page(GFP_TEMPORARY);
+		kbuf = (char *) page;
+		if (!kbuf)
+			return -ENOMEM;
+		if (copy_from_user(kbuf, buffer, left)) {
+			err = -EFAULT;
+			goto free;
+		}
+		kbuf[left] = 0;
+	}
+
 	for (; left && vleft--; i++, min++, max++, first=0) {
+		unsigned long val;
+
 		if (write) {
-			while (left) {
-				char c;
-				if (get_user(c, s))
-					return -EFAULT;
-				if (!isspace(c))
-					break;
-				left--;
-				s++;
-			}
-			if (!left)
-				break;
-			neg = 0;
-			len = left;
-			if (len > TMPBUFLEN-1)
-				len = TMPBUFLEN-1;
-			if (copy_from_user(buf, s, len))
-				return -EFAULT;
-			buf[len] = 0;
-			p = buf;
-			if (*p == '-' && left > 1) {
-				neg = 1;
-				p++;
-			}
-			if (*p < '0' || *p > '9')
-				break;
-			val = simple_strtoul(p, &p, 0) * convmul / convdiv ;
-			len = p-buf;
-			if ((len < left) && *p && !isspace(*p))
+			bool neg;
+
+			left -= proc_skip_spaces(&kbuf);
+
+			err = proc_get_long(&kbuf, &left, &val, &neg,
+					     proc_wspace_sep,
+					     sizeof(proc_wspace_sep), NULL);
+			if (err)
 				break;
 			if (neg)
-				val = -val;
-			s += len;
-			left -= len;
-
-			if(neg)
 				continue;
 			if ((min && val < *min) || (max && val > *max))
 				continue;
 			*i = val;
 		} else {
-			p = buf;
+			val = convdiv * (*i) / convmul;
 			if (!first)
-				*p++ = '\t';
-			sprintf(p, "%lu", convdiv * (*i) / convmul);
-			len = strlen(buf);
-			if (len > left)
-				len = left;
-			if(copy_to_user(s, buf, len))
-				return -EFAULT;
-			left -= len;
-			s += len;
+				err = proc_put_char(&buffer, &left, '\t');
+			err = proc_put_long(&buffer, &left, val, false);
+			if (err)
+				break;
 		}
 	}
 
-	if (!write && !first && left) {
-		if(put_user('\n', s))
-			return -EFAULT;
-		left--, s++;
-	}
+	if (!write && !first && left && !err)
+		err = proc_put_char(&buffer, &left, '\n');
+	if (write && !err)
+		left -= proc_skip_spaces(&kbuf);
+free:
 	if (write) {
-		while (left) {
-			char c;
-			if (get_user(c, s++))
-				return -EFAULT;
-			if (!isspace(c))
-				break;
-			left--;
-		}
+		free_page(page);
+		if (first)
+			return err ? : -EINVAL;
 	}
-	if (write && first)
-		return -EINVAL;
 	*lenp -= left;
 	*ppos += *lenp;
-	return 0;
-#undef TMPBUFLEN
+	return err;
 }
 
 static int do_proc_doulongvec_minmax(struct ctl_table *table, int write,
@@ -2451,7 +2519,7 @@ int proc_doulongvec_ms_jiffies_minmax(st
 }
 
 
-static int do_proc_dointvec_jiffies_conv(int *negp, unsigned long *lvalp,
+static int do_proc_dointvec_jiffies_conv(bool *negp, unsigned long *lvalp,
 					 int *valp,
 					 int write, void *data)
 {
@@ -2463,7 +2531,7 @@ static int do_proc_dointvec_jiffies_conv
 		int val = *valp;
 		unsigned long lval;
 		if (val < 0) {
-			*negp = -1;
+			*negp = 1;
 			lval = (unsigned long)-val;
 		} else {
 			*negp = 0;
@@ -2474,7 +2542,7 @@ static int do_proc_dointvec_jiffies_conv
 	return 0;
 }
 
-static int do_proc_dointvec_userhz_jiffies_conv(int *negp, unsigned long *lvalp,
+static int do_proc_dointvec_userhz_jiffies_conv(bool *negp, unsigned long *lvalp,
 						int *valp,
 						int write, void *data)
 {
@@ -2486,7 +2554,7 @@ static int do_proc_dointvec_userhz_jiffi
 		int val = *valp;
 		unsigned long lval;
 		if (val < 0) {
-			*negp = -1;
+			*negp = 1;
 			lval = (unsigned long)-val;
 		} else {
 			*negp = 0;
@@ -2497,7 +2565,7 @@ static int do_proc_dointvec_userhz_jiffi
 	return 0;
 }
 
-static int do_proc_dointvec_ms_jiffies_conv(int *negp, unsigned long *lvalp,
+static int do_proc_dointvec_ms_jiffies_conv(bool *negp, unsigned long *lvalp,
 					    int *valp,
 					    int write, void *data)
 {
@@ -2507,7 +2575,7 @@ static int do_proc_dointvec_ms_jiffies_c
 		int val = *valp;
 		unsigned long lval;
 		if (val < 0) {
-			*negp = -1;
+			*negp = 1;
 			lval = (unsigned long)-val;
 		} else {
 			*negp = 0;

^ permalink raw reply

* Re: 2.6.34-rc5-git7 (plus all patches) -- another suspicious rcu_dereference_check() usage.
From: Paul E. McKenney @ 2010-04-30  0:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Miles Lane, Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, netdev, Jens Axboe, Gui Jianfeng,
	Li Zefan, Johannes Berg, shemminger
In-Reply-To: <20100428204403.GT2540@linux.vnet.ibm.com>

On Wed, Apr 28, 2010 at 01:44:03PM -0700, Paul E. McKenney wrote:
> On Wed, Apr 28, 2010 at 10:18:43PM +0200, Eric Dumazet wrote:
> > Le mercredi 28 avril 2010 à 13:09 -0700, Paul E. McKenney a écrit :
> > > On Wed, Apr 28, 2010 at 09:38:11PM +0200, Eric Dumazet wrote:
> > > > Le mercredi 28 avril 2010 à 10:54 -0700, Paul E. McKenney a écrit :
> > > > > On Mon, Apr 26, 2010 at 08:51:06PM -0400, Miles Lane wrote:
> > > > > > This one occurred during the wakeup from suspend to RAM.
> > > > > > 
> > > > > > [  984.724697] [ INFO: suspicious rcu_dereference_check() usage. ]
> > > > > > [  984.724700] ---------------------------------------------------
> > > > > > [  984.724703] include/linux/fdtable.h:88 invoked
> > > > > > rcu_dereference_check() without protection!
> > > > > > [  984.724706]
> > > > > > [  984.724707] other info that might help us debug this:
> > > > > > [  984.724708]
> > > > > > [  984.724711]
> > > > > > [  984.724711] rcu_scheduler_active = 1, debug_locks = 1
> > > > > > [  984.724714] no locks held by dbus-daemon/4680.
> > > > > > [  984.724717]
> > > > > > [  984.724717] stack backtrace:
> > > > > > [  984.724721] Pid: 4680, comm: dbus-daemon Not tainted 2.6.34-rc5-git7 #33
> > > > > > [  984.724724] Call Trace:
> > > > > > [  984.724734]  [<ffffffff81074556>] lockdep_rcu_dereference+0x9d/0xa6
> > > > > > [  984.724740]  [<ffffffff810fc785>] fcheck_files+0xb1/0xc9
> > > > > > [  984.724745]  [<ffffffff810fc7f5>] fget_light+0x35/0xab
> > > > > > [  984.724751]  [<ffffffff81433e1b>] ? sock_poll_wait+0x13/0x18
> > > > > > [  984.724755]  [<ffffffff81433e39>] ? unix_poll+0x19/0x95
> > > > > > [  984.724762]  [<ffffffff8110aa95>] do_sys_poll+0x1ff/0x3e5
> > > > > > [  984.724766]  [<ffffffff8110a19e>] ? __pollwait+0x0/0xc7
> > > > > > [  984.724771]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > > > [  984.724776]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > > > [  984.724780]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > > > [  984.724784]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > > > [  984.724788]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > > > [  984.724793]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > > > [  984.724797]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > > > [  984.724802]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > > > [  984.724806]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > > > [  984.724812]  [<ffffffff8110ae0f>] sys_poll+0x50/0xbb
> > > > > > [  984.724818]  [<ffffffff81009d82>] system_call_fastpath+0x16/0x1b
> > > > > 
> > > > > Hmmm...  I am not convinced that this is a false positive.  Couldn't
> > > > > there be a multi-threaded process where one thread is invoking poll()
> > > > > on a UNIX socket just as another thread is calling close() on it?
> > > > > 
> > > > > The current fcheck_files() logic requires that the caller either (1) be in
> > > > > an RCU read-side critical section, (2) hold ->files_lock, or (3) passing
> > > > > in a files_struct with ->count equal to 1 (initialization or cleanup).
> > > > > 
> > > > > So I don't feel comfortable just slapping an RCU read-side critical
> > > > > section around this one, at least not unless someone who understands
> > > > > the locking says that doing so is OK.
> > > > > 
> > > > > 		
> > > > 
> > > > Its a single threaded program.
> > > > 
> > > > So fget_light() calls fcheck_files(files, fd); without rcu lock,
> > > > but some /proc/pid/fd/... user temporarly raised files->count just
> > > > before we perform the condition check.
> > > 
> > > So I should add a single-threaded check.  My first thought was to use
> > > current_is_single_threaded(), but the bit about scanning the full list
> > > of processes does give me pause.  However, thread_group_empty() looks
> > > like a much lighter-weight alternative.
> > > 
> > > I believe that it is possible for a pair of single-threaded processes
> > > to share a file descriptor, but that should not be a problem, as both
> > > of them would need to close it for it to go away.
> > > 
> > > But what happens if someone does a clone() with CLONE_FILES, as some
> > > of the AIO stuff seems to do?  Won't that allow one of the resulting
> > > processes to close the file for both of them, even though both are
> > > otherwise single-threaded?  And the ->count seems to be the only
> > > distinction between these two cases.
> > > 
> > > And AIO does CLONE_VM as well as CLONE_FILES, but that seems to mean that
> > > the check must scan the processes with current_is_single_threaded().
> > > Besides which, a user could invoke clone() with only CLONE_FILES
> > > specified, right?
> > > 
> > > Or am I just confused here?

Not confused, just stupid.  You would think that after almost 20 years
doing parallel programming, I would understand that concurrent debugging
assertions usually cannot be exact.  In this case, I really don't want
false positives, so I must accept some false negatives.  -That- I should
be able to do without adding runtime overhead when !PROVE_RCU.

I will put together a patch, test it, and send it out.

							Thanx, Paul

> > If a program is mono threaded, and doing a fget_light() syscall, it
> > cannot possibly do a clone() in // ;)
> 
> The sequence of events that I am worried about is as follows:
> 
> 1.	Single-threaded process does clone(CLONE_FILES).  The
> 	result is a pair of single-threaded processes that share
> 	file descriptors.
> 
> 2.	One of these processes does files_fdtable(i) at the same
> 	time as the other process closes file descriptor i.
> 
> So, clone and -then- do fget_light().
> 
> > If we want to be picky, we could add a user provided condition, aka "we
> > are sure we are allowed to do this because we are the owner of the files
> > struct".
> 
> But yes, if I understand your trick below, the race conditions from
> the above sequence of events would simply force the processes off
> of the fget_light() path, which should be OK.
> 
> 						Thanx, Paul
> 
> > diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
> > index 6da962c..027f5e1 100644
> > --- a/drivers/char/tty_io.c
> > +++ b/drivers/char/tty_io.c
> > @@ -2694,7 +2694,7 @@ void __do_SAK(struct tty_struct *tty)
> >  			spin_lock(&p->files->file_lock);
> >  			fdt = files_fdtable(p->files);
> >  			for (i = 0; i < fdt->max_fds; i++) {
> > -				filp = fcheck_files(p->files, i);
> > +				filp = fcheck_files(p->files, i, false);
> >  				if (!filp)
> >  					continue;
> >  				if (filp->f_op->read == tty_read &&
> > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > index 452d02f..dabf4d8 100644
> > --- a/fs/fcntl.c
> > +++ b/fs/fcntl.c
> > @@ -119,7 +119,7 @@ SYSCALL_DEFINE2(dup2, unsigned int, oldfd, unsigned int, newfd)
> >  		int retval = oldfd;
> > 
> >  		rcu_read_lock();
> > -		if (!fcheck_files(files, oldfd))
> > +		if (!fcheck_files(files, oldfd, false))
> >  			retval = -EBADF;
> >  		rcu_read_unlock();
> >  		return retval;
> > diff --git a/fs/file_table.c b/fs/file_table.c
> > index 32d12b7..2865f72 100644
> > --- a/fs/file_table.c
> > +++ b/fs/file_table.c
> > @@ -274,7 +274,7 @@ struct file *fget(unsigned int fd)
> >  	struct files_struct *files = current->files;
> > 
> >  	rcu_read_lock();
> > -	file = fcheck_files(files, fd);
> > +	file = fcheck_files(files, fd, false);
> >  	if (file) {
> >  		if (!atomic_long_inc_not_zero(&file->f_count)) {
> >  			/* File object ref couldn't be taken */
> > @@ -303,10 +303,10 @@ struct file *fget_light(unsigned int fd, int *fput_needed)
> > 
> >  	*fput_needed = 0;
> >  	if (likely((atomic_read(&files->count) == 1))) {
> > -		file = fcheck_files(files, fd);
> > +		file = fcheck_files(files, fd, true);
> >  	} else {
> >  		rcu_read_lock();
> > -		file = fcheck_files(files, fd);
> > +		file = fcheck_files(files, fd, false);
> >  		if (file) {
> >  			if (atomic_long_inc_not_zero(&file->f_count))
> >  				*fput_needed = 1;
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 8418fcc..0e89448 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -1716,7 +1716,7 @@ static int proc_fd_info(struct inode *inode, struct path *path, char *info)
> >  		 * hold ->file_lock.
> >  		 */
> >  		spin_lock(&files->file_lock);
> > -		file = fcheck_files(files, fd);
> > +		file = fcheck_files(files, fd, false);
> >  		if (file) {
> >  			if (path) {
> >  				*path = file->f_path;
> > @@ -1755,7 +1755,7 @@ static int tid_fd_revalidate(struct dentry *dentry, struct nameidata *nd)
> >  		files = get_files_struct(task);
> >  		if (files) {
> >  			rcu_read_lock();
> > -			if (fcheck_files(files, fd)) {
> > +			if (fcheck_files(files, fd, false)) {
> >  				rcu_read_unlock();
> >  				put_files_struct(files);
> >  				if (task_dumpable(task)) {
> > @@ -1813,7 +1813,7 @@ static struct dentry *proc_fd_instantiate(struct inode *dir,
> >  	 * hold ->file_lock.
> >  	 */
> >  	spin_lock(&files->file_lock);
> > -	file = fcheck_files(files, fd);
> > +	file = fcheck_files(files, fd, false);
> >  	if (!file)
> >  		goto out_unlock;
> >  	if (file->f_mode & FMODE_READ)
> > @@ -1899,7 +1899,7 @@ static int proc_readfd_common(struct file * filp, void * dirent,
> >  				char name[PROC_NUMBUF];
> >  				int len;
> > 
> > -				if (!fcheck_files(files, fd))
> > +				if (!fcheck_files(files, fd, false))
> >  					continue;
> >  				rcu_read_unlock();
> > 
> > diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
> > index 013dc52..76423ad 100644
> > --- a/include/linux/fdtable.h
> > +++ b/include/linux/fdtable.h
> > @@ -57,11 +57,12 @@ struct files_struct {
> >  	struct file * fd_array[NR_OPEN_DEFAULT];
> >  };
> > 
> > -#define rcu_dereference_check_fdtable(files, fdtfd) \
> > +#define rcu_dereference_check_fdtable(files, fdtfd, cond) \
> >  	(rcu_dereference_check((fdtfd), \
> >  			       rcu_read_lock_held() || \
> >  			       lockdep_is_held(&(files)->file_lock) || \
> > -			       atomic_read(&(files)->count) == 1))
> > +			       atomic_read(&(files)->count) == 1 || \
> > +			       cond))
> > 
> >  #define files_fdtable(files) \
> >  		(rcu_dereference_check_fdtable((files), (files)->fdt))
> > @@ -79,13 +80,13 @@ static inline void free_fdtable(struct fdtable *fdt)
> >  	call_rcu(&fdt->rcu, free_fdtable_rcu);
> >  }
> > 
> > -static inline struct file * fcheck_files(struct files_struct *files, unsigned int fd)
> > +static inline struct file * fcheck_files(struct files_struct *files, unsigned int fd, bool cond)
> >  {
> >  	struct file * file = NULL;
> >  	struct fdtable *fdt = files_fdtable(files);
> > 
> >  	if (fd < fdt->max_fds)
> > -		file = rcu_dereference_check_fdtable(files, fdt->fd[fd]);
> > +		file = rcu_dereference_check_fdtable(files, fdt->fd[fd], cond);
> >  	return file;
> >  }
> > 
> > 
> > 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox