* Re: [PATCH 3/3] vhost: fix the memory leak which will happen when memory_access_ok fails
From: Michael S. Tsirkin @ 2010-05-27 10:49 UTC (permalink / raw)
To: Takuya Yoshikawa; +Cc: kvm, virtualization, netdev
In-Reply-To: <20100527190356.cbf2aac7.yoshikawa.takuya@oss.ntt.co.jp>
On Thu, May 27, 2010 at 07:03:56PM +0900, Takuya Yoshikawa wrote:
> We need to free newmem when vhost_set_memory() fails to complete.
>
> Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
> ---
Thanks, applied.
> drivers/vhost/vhost.c | 4 +++-
> 1 files changed, 3 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 9633a3c..1241a22 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -337,8 +337,10 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
> return -EFAULT;
> }
>
> - if (!memory_access_ok(d, newmem, vhost_has_feature(d, VHOST_F_LOG_ALL)))
> + if (!memory_access_ok(d, newmem, vhost_has_feature(d, VHOST_F_LOG_ALL))) {
> + kfree(newmem);
> return -EFAULT;
> + }
> oldmem = d->memory;
> rcu_assign_pointer(d->memory, newmem);
> synchronize_rcu();
> --
> 1.7.0.4
^ permalink raw reply
* [GIT PULL net-next-2.6] vhost-net cleanups
From: Michael S. Tsirkin @ 2010-05-27 10:58 UTC (permalink / raw)
To: David Miller; +Cc: kvm, virtualization, netdev, linux-kernel, jdike, tklauser
David,
The following tree is on top of net-next-2.6.
Please merge it for 2.6.36.
Thanks!
The following changes since commit 7a9b149212f3716c598afe973b6261fd58453b7a:
Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6 (2010-05-20 21:26:12 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net-next
Jeff Dike (1):
vhost-net: minor cleanup
Michael S. Tsirkin (1):
vhost: whitespace fix
Tobias Klauser (1):
vhost: Storage class should be before const qualifier
drivers/vhost/net.c | 13 ++++++-------
drivers/vhost/vhost.c | 4 ++--
2 files changed, 8 insertions(+), 9 deletions(-)
--
MST
^ permalink raw reply
* Re: ipv6: Add GSO support on forwarding path
From: Herbert Xu @ 2010-05-27 11:54 UTC (permalink / raw)
To: Ralf Baechle; +Cc: David S. Miller, netdev
In-Reply-To: <20100527102429.GA9017@gondor.apana.org.au>
On Thu, May 27, 2010 at 08:24:29PM +1000, Herbert Xu wrote:
>
> Actually, this patch is still not quite right for the GSO case.
> Right now it's only adding the IP header size, not the full header
> size.
>
> I need to think a bit more about this.
Let's do it the old way first, so that it at least works for the
common case. I'll fix it properly later.
ipv6: Add GSO support on forwarding path
Currently we disallow GSO packets on the IPv6 forward path.
This patch fixes this.
Note that I discovered that our existing GSO MTU checks (e.g.,
IPv4 forwarding) are buggy in that they skip the check altogether,
hen they really should be checking gso_size + header instead.
I have also been lazy here in that I haven't bothered to segment
the GSO packet by hand before generating an ICMP message. Someone
should add that to be 100% correct.
Reported-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index cd963f6..89425af 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -507,7 +507,7 @@ int ip6_forward(struct sk_buff *skb)
if (mtu < IPV6_MIN_MTU)
mtu = IPV6_MIN_MTU;
- if (skb->len > mtu) {
+ if (skb->len > mtu && !skb_is_gso(skb)) {
/* Again, force OUTPUT device used as source address */
skb->dev = dst->dev;
icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply related
* [GIT PULL net-2.6] vhost-net: error handling fixes
From: Michael S. Tsirkin @ 2010-05-27 11:57 UTC (permalink / raw)
To: David Miller; +Cc: kvm, virtualization, netdev, linux-kernel, krkumar2
David,
The following tree includes fixes dealing with error handling
in vhost-net. It is on top of net-2.6.
Please merge it for 2.6.35.
Thanks!
The following changes since commit 8a74ad60a546b13bd1096b2a61a7a5c6fd9ae17c:
net: fix lock_sock_bh/unlock_sock_bh (2010-05-27 00:30:53 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net
Krishna Kumar (1):
vhost: Fix host panic if ioctl called with wrong index
Takuya Yoshikawa (3):
vhost: fix to check the return value of copy_to/from_user() correctly
vhost-net: fix to check the return value of copy_to/from_user() correctly
vhost: fix the memory leak which will happen when memory_access_ok fails
drivers/vhost/net.c | 14 ++++++------
drivers/vhost/vhost.c | 57 +++++++++++++++++++++++++++---------------------
2 files changed, 39 insertions(+), 32 deletions(-)
--
MST
^ permalink raw reply
* [PATCH 4/11] drivers/net/hamradio: Eliminate a NULL pointer dereference
From: Julia Lawall @ 2010-05-27 12:33 UTC (permalink / raw)
To: Jean-Paul Roubelat, linux-hams, netdev, linux-kernel,
kernel-janitors
From: Julia Lawall <julia@diku.dk>
At the point of the print, dev is NULL.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@r exists@
expression E,E1;
identifier f;
statement S1,S2,S3;
@@
if ((E == NULL && ...) || ...)
{
... when != if (...) S1 else S2
when != E = E1
* E->f
... when any
return ...;
}
else S3
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
---
drivers/net/hamradio/yam.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/net/hamradio/yam.c b/drivers/net/hamradio/yam.c
index 694132e..4e7d1d0 100644
--- a/drivers/net/hamradio/yam.c
+++ b/drivers/net/hamradio/yam.c
@@ -1151,8 +1151,7 @@ static int __init yam_init_driver(void)
dev = alloc_netdev(sizeof(struct yam_port), name,
yam_setup);
if (!dev) {
- printk(KERN_ERR "yam: cannot allocate net device %s\n",
- dev->name);
+ pr_err("yam: cannot allocate net device\n");
err = -ENOMEM;
goto error;
}
^ permalink raw reply related
* [PATCH 6/11] drivers/net: Eliminate a NULL pointer dereference
From: Julia Lawall @ 2010-05-27 12:33 UTC (permalink / raw)
To: netdev, linux-kernel, kernel-janitors
From: Julia Lawall <julia@diku.dk>
At the point of the print, dev is NULL.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@r exists@
expression E,E1;
identifier f;
statement S1,S2,S3;
@@
if ((E == NULL && ...) || ...)
{
... when != if (...) S1 else S2
when != E = E1
* E->f
... when any
return ...;
}
else S3
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
---
drivers/net/3c507.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/net/3c507.c b/drivers/net/3c507.c
index 82eaf65..ea9b7a0 100644
--- a/drivers/net/3c507.c
+++ b/drivers/net/3c507.c
@@ -551,8 +551,7 @@ static irqreturn_t el16_interrupt(int irq, void *dev_id)
void __iomem *shmem;
if (dev == NULL) {
- pr_err("%s: net_interrupt(): irq %d for unknown device.\n",
- dev->name, irq);
+ pr_err("net_interrupt(): irq %d for unknown device.\n", irq);
return IRQ_NONE;
}
^ permalink raw reply related
* Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
From: Oleg Nesterov @ 2010-05-27 12:44 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Sridhar Samudrala, netdev, lkml, kvm@vger.kernel.org,
Andrew Morton, Dmitri Vorobiev, Tejun Heo, Jiri Kosina,
Thomas Gleixner, Ingo Molnar, Andi Kleen
In-Reply-To: <20100527091426.GA6308@redhat.com>
On 05/27, Michael S. Tsirkin wrote:
>
> On Tue, May 18, 2010 at 05:04:51PM -0700, Sridhar Samudrala wrote:
> > Add a new kernel API to create a singlethread workqueue and attach it's
> > task to current task's cgroup and cpumask.
> >
> > Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
>
> Could someone familiar with workqueue code please comment on whether
> this patch is suitable for 2.6.35?
>
> It is needed to fix the case where vhost user might cause a kernel
> thread to consume more CPU than allowed by the cgroup.
> Should I merge it through the vhost tree?
> Ack for this?
I don't understand the reasons for this patch, but this doesn't matter.
I don't really see any need to change workqueue.c,
> > +static struct task_struct *get_singlethread_wq_task(struct workqueue_struct *wq)
> > +{
> > + return (per_cpu_ptr(wq->cpu_wq, singlethread_cpu))->thread;
> > +}
(Not sure this trivial static helper with the single caller makes sense, but
see below)
> > +/* Create a singlethread workqueue and attach it's task to the current task's
> > + * cgroup and set it's cpumask to the current task's cpumask.
> > + */
> > +struct workqueue_struct *create_singlethread_workqueue_in_current_cg(char *name)
> > +{
> > + struct workqueue_struct *wq;
> > + struct task_struct *task;
> > + cpumask_var_t mask;
> > +
> > + wq = create_singlethread_workqueue(name);
> > + if (!wq)
> > + goto out;
> > +
> > + if (!alloc_cpumask_var(&mask, GFP_KERNEL))
> > + goto err;
> > +
> > + if (sched_getaffinity(current->pid, mask))
> > + goto err;
> > +
> > + task = get_singlethread_wq_task(wq);
> > + if (sched_setaffinity(task->pid, mask))
> > + goto err;
> > +
> > + if (cgroup_attach_task_current_cg(task))
> > + goto err;
> > +out:
> > + return wq;
> > +err:
> > + destroy_workqueue(wq);
> > + wq = NULL;
> > + goto out;
> > +}
Instead, cgroup.c (or whoever needs this) can do
struct move_struct {
struct work_struct work;
int ret;
};
static void move_func(struct work_struct *work)
{
struct move_struct *move = container_of(...);
if (cgroup_attach_task_current_cg(current))
ret = -EANY;
}
static struct workqueue_struct *create_singlethread_workqueue_in_current_cg(char *name)
{
struct workqueue_struct *wq;
struct move_struct move = {
.work = __WORK_INITIALIZER(move_func);
};
wq = create_singlethread_workqueue(name);
if (!wq)
return NULL;
queue_work(&move.work);
flush_work(&move.work);
if (move.ret) {
destroy_workqueue(wq);
wq = NULL;
}
return wq;
}
Or. Just export wq_per_cpu() from workqueue.c (probably with a better name) and
use it like the patch does.
But, imho, create_singlethread_workqueue_in_current_cg() does not belong
to workqueue.c.
Oleg.
^ permalink raw reply
* Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
From: Michael S. Tsirkin @ 2010-05-27 13:12 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Sridhar Samudrala, netdev, lkml, kvm@vger.kernel.org,
Andrew Morton, Dmitri Vorobiev, Tejun Heo, Jiri Kosina,
Thomas Gleixner, Ingo Molnar, Andi Kleen
In-Reply-To: <20100527124448.GA4241@redhat.com>
On Thu, May 27, 2010 at 02:44:48PM +0200, Oleg Nesterov wrote:
> On 05/27, Michael S. Tsirkin wrote:
> >
> > On Tue, May 18, 2010 at 05:04:51PM -0700, Sridhar Samudrala wrote:
> > > Add a new kernel API to create a singlethread workqueue and attach it's
> > > task to current task's cgroup and cpumask.
> > >
> > > Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
> >
> > Could someone familiar with workqueue code please comment on whether
> > this patch is suitable for 2.6.35?
> >
> > It is needed to fix the case where vhost user might cause a kernel
> > thread to consume more CPU than allowed by the cgroup.
> > Should I merge it through the vhost tree?
> > Ack for this?
>
> I don't understand the reasons for this patch, but this doesn't matter.
Depending on userspace application, driver can create a lot of work
for a workqueue to handle. By making the workqueue thread
belong in a cgroup, we make it possible to the CPU and other
resources thus consumed.
--
MST
^ permalink raw reply
* RE: [PATCH 2/2] net: ll_temac: fix checksum offload logic
From: John Linn @ 2010-05-27 13:26 UTC (permalink / raw)
To: David Miller
Cc: netdev, linuxppc-dev, grant.likely, jwboyer, john.williams,
michal.simek, Brian Hill
In-Reply-To: <20100526.204517.35032653.davem@davemloft.net>
> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Wednesday, May 26, 2010 9:45 PM
> To: John Linn
> Cc: netdev@vger.kernel.org; linuxppc-dev@ozlabs.org;
grant.likely@secretlab.ca;
> jwboyer@linux.vnet.ibm.com; john.williams@petalogix.com;
michal.simek@petalogix.com; Brian Hill
> Subject: Re: [PATCH 2/2] net: ll_temac: fix checksum offload logic
>
> aFrom: John Linn <john.linn@xilinx.com>
> Date: Wed, 26 May 2010 11:29:19 -0600
>
> > The current checksum offload code does not work and this corrects
> > that functionality. It also updates the interrupt coallescing
> > initialization so than there are fewer interrupts and performance
> > is increased.
> >
> > Signed-off-by: Brian Hill <brian.hill@xilinx.com>
> > Signed-off-by: John Linn <john.linn@xilinx.com>
>
> Applied, although I changed "temac_features" to be explicitly
> "unsigned int" instead of just plain "unsigned"
Awesome! Appreciate the help with both patches David.
>
This email and any attachments are intended for the sole use of the named recipient(s) and contain(s) confidential information that may be proprietary, privileged or copyrighted under applicable law. If you are not the intended recipient, do not read, copy, or forward this email message or any attachments. Delete this email message and any attachments immediately.
^ permalink raw reply
* Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
From: Oleg Nesterov @ 2010-05-27 13:48 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Sridhar Samudrala, netdev, lkml, kvm@vger.kernel.org,
Andrew Morton, Dmitri Vorobiev, Tejun Heo, Jiri Kosina,
Thomas Gleixner, Ingo Molnar, Andi Kleen
In-Reply-To: <20100527131254.GB7974@redhat.com>
On 05/27, Michael S. Tsirkin wrote:
>
> On Thu, May 27, 2010 at 02:44:48PM +0200, Oleg Nesterov wrote:
> > On 05/27, Michael S. Tsirkin wrote:
> > >
> > > On Tue, May 18, 2010 at 05:04:51PM -0700, Sridhar Samudrala wrote:
> > > > Add a new kernel API to create a singlethread workqueue and attach it's
> > > > task to current task's cgroup and cpumask.
> > > >
> > > > Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
> > >
> > > Could someone familiar with workqueue code please comment on whether
> > > this patch is suitable for 2.6.35?
> > >
> > > It is needed to fix the case where vhost user might cause a kernel
> > > thread to consume more CPU than allowed by the cgroup.
> > > Should I merge it through the vhost tree?
> > > Ack for this?
> >
> > I don't understand the reasons for this patch, but this doesn't matter.
>
> Depending on userspace application, driver can create a lot of work
> for a workqueue to handle. By making the workqueue thread
> belong in a cgroup, we make it possible to the CPU and other
> resources thus consumed.
OK, I see, thanks for your explanation.
in case I wasn't clear... I didn't mean I dislike this idea, only the
the implementation of create_singlethread_workqueue_in_current_cg(),
it doesn't belong to workqueue.c imho.
Oleg.
^ permalink raw reply
* [PATCH net-next-2.6] net: replace hooks in __netif_receive_skb
From: Jiri Pirko @ 2010-05-27 13:49 UTC (permalink / raw)
To: netdev; +Cc: davem, shemminger, kaber, eric.dumazet
What this patch does is it removes two receive frame hooks (for bridge and for
macvlan) from __netif_receive_skb. These are replaced them with a general
list of rx_handlers which is iterated thru instead.
Then a network driver (of virtual netdev like macvlan or bridge) can register
an rx_handler for needed net device to be called from __netif_receive_skb.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
drivers/net/macvlan.c | 24 +++++--
include/linux/if_bridge.h | 2 -
include/linux/if_macvlan.h | 4 -
include/linux/netdevice.h | 18 +++++
net/bridge/br.c | 2 -
net/bridge/br_if.c | 14 ++++
net/bridge/br_input.c | 12 +++-
net/bridge/br_private.h | 3 +-
net/core/dev.c | 158 ++++++++++++++++++++++++++------------------
9 files changed, 155 insertions(+), 82 deletions(-)
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 87e8d4c..100b16a 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -145,15 +145,16 @@ static void macvlan_broadcast(struct sk_buff *skb,
}
/* called under rcu_read_lock() from netif_receive_skb */
-static struct sk_buff *macvlan_handle_frame(struct macvlan_port *port,
- struct sk_buff *skb)
+static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
{
+ struct macvlan_port *port;
const struct ethhdr *eth = eth_hdr(skb);
const struct macvlan_dev *vlan;
const struct macvlan_dev *src;
struct net_device *dev;
unsigned int len;
+ port = rcu_dereference(skb->dev->macvlan_port);
if (is_multicast_ether_addr(eth->h_dest)) {
src = macvlan_hash_lookup(port, eth->h_source);
if (!src)
@@ -511,10 +512,16 @@ static void macvlan_setup(struct net_device *dev)
dev->tx_queue_len = 0;
}
+static struct netdev_rx_handler macvlan_rx_handler = {
+ .order = NETDEV_RX_HANDLER_ORDER_MACVLAN,
+ .callback = macvlan_handle_frame,
+};
+
static int macvlan_port_create(struct net_device *dev)
{
struct macvlan_port *port;
unsigned int i;
+ int err;
if (dev->type != ARPHRD_ETHER || dev->flags & IFF_LOOPBACK)
return -EINVAL;
@@ -528,6 +535,15 @@ static int macvlan_port_create(struct net_device *dev)
for (i = 0; i < MACVLAN_HASH_SIZE; i++)
INIT_HLIST_HEAD(&port->vlan_hash[i]);
rcu_assign_pointer(dev->macvlan_port, port);
+
+ err = netdev_rx_handler_register(dev, &macvlan_rx_handler);
+ if (err) {
+ rcu_assign_pointer(dev->macvlan_port, NULL);
+ synchronize_rcu();
+ kfree(port);
+ return err;
+ }
+
return 0;
}
@@ -535,6 +551,7 @@ static void macvlan_port_destroy(struct net_device *dev)
{
struct macvlan_port *port = dev->macvlan_port;
+ netdev_rx_handler_unregister(dev, &macvlan_rx_handler);
rcu_assign_pointer(dev->macvlan_port, NULL);
synchronize_rcu();
kfree(port);
@@ -767,14 +784,12 @@ static int __init macvlan_init_module(void)
int err;
register_netdevice_notifier(&macvlan_notifier_block);
- macvlan_handle_frame_hook = macvlan_handle_frame;
err = macvlan_link_register(&macvlan_link_ops);
if (err < 0)
goto err1;
return 0;
err1:
- macvlan_handle_frame_hook = NULL;
unregister_netdevice_notifier(&macvlan_notifier_block);
return err;
}
@@ -782,7 +797,6 @@ err1:
static void __exit macvlan_cleanup_module(void)
{
rtnl_link_unregister(&macvlan_link_ops);
- macvlan_handle_frame_hook = NULL;
unregister_netdevice_notifier(&macvlan_notifier_block);
}
diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index 938b7e8..0d241a5 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -102,8 +102,6 @@ struct __fdb_entry {
#include <linux/netdevice.h>
extern void brioctl_set(int (*ioctl_hook)(struct net *, unsigned int, void __user *));
-extern struct sk_buff *(*br_handle_frame_hook)(struct net_bridge_port *p,
- struct sk_buff *skb);
extern int (*br_should_route_hook)(struct sk_buff *skb);
#endif
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 9ea047a..c26a0e4 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -84,8 +84,4 @@ extern int macvlan_link_register(struct rtnl_link_ops *ops);
extern netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
struct net_device *dev);
-
-extern struct sk_buff *(*macvlan_handle_frame_hook)(struct macvlan_port *,
- struct sk_buff *);
-
#endif /* _LINUX_IF_MACVLAN_H */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a1bff65..8e95b2d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -254,6 +254,15 @@ struct netdev_hw_addr_list {
#define netdev_for_each_mc_addr(ha, dev) \
netdev_hw_addr_list_for_each(ha, &(dev)->mc)
+
+struct netdev_rx_handler {
+ struct list_head list;
+ unsigned int order;
+#define NETDEV_RX_HANDLER_ORDER_BRIDGE 1
+#define NETDEV_RX_HANDLER_ORDER_MACVLAN 2
+ struct sk_buff *(*callback)(struct sk_buff *skb);
+};
+
struct hh_cache {
struct hh_cache *hh_next; /* Next entry */
atomic_t hh_refcnt; /* number of users */
@@ -1031,6 +1040,10 @@ struct net_device {
/* GARP */
struct garp_port *garp_port;
+ /* receive handlers (hooks) list */
+ spinlock_t rx_handlers_lock;
+ struct list_head rx_handlers;
+
/* class/net/name entry */
struct device dev;
/* space for optional device, statistics, and wireless sysfs groups */
@@ -1685,6 +1698,11 @@ static inline void napi_free_frags(struct napi_struct *napi)
napi->skb = NULL;
}
+extern int netdev_rx_handler_register(struct net_device *dev,
+ struct netdev_rx_handler *rh);
+extern void netdev_rx_handler_unregister(struct net_device *dev,
+ struct netdev_rx_handler *rh);
+
extern void netif_nit_deliver(struct sk_buff *skb);
extern int dev_valid_name(const char *name);
extern int dev_ioctl(struct net *net, unsigned int cmd, void __user *);
diff --git a/net/bridge/br.c b/net/bridge/br.c
index 76357b5..c8436fa 100644
--- a/net/bridge/br.c
+++ b/net/bridge/br.c
@@ -63,7 +63,6 @@ static int __init br_init(void)
goto err_out4;
brioctl_set(br_ioctl_deviceless_stub);
- br_handle_frame_hook = br_handle_frame;
#if defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE)
br_fdb_test_addr_hook = br_fdb_test_addr;
@@ -100,7 +99,6 @@ static void __exit br_deinit(void)
br_fdb_test_addr_hook = NULL;
#endif
- br_handle_frame_hook = NULL;
br_fdb_fini();
}
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 18b245e..1d6bf64 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -119,6 +119,11 @@ static void destroy_nbp_rcu(struct rcu_head *head)
destroy_nbp(p);
}
+static struct netdev_rx_handler br_rx_handler = {
+ .order = NETDEV_RX_HANDLER_ORDER_BRIDGE,
+ .callback = br_handle_frame,
+};
+
/* Delete port(interface) from bridge is done in two steps.
* via RCU. First step, marks device as down. That deletes
* all the timers and stops new packets from flowing through.
@@ -147,6 +152,7 @@ static void del_nbp(struct net_bridge_port *p)
list_del_rcu(&p->list);
+ netdev_rx_handler_unregister(dev, &br_rx_handler);
rcu_assign_pointer(dev->br_port, NULL);
br_multicast_del_port(p);
@@ -429,6 +435,11 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
goto err2;
rcu_assign_pointer(dev->br_port, p);
+
+ err = netdev_rx_handler_register(dev, &br_rx_handler);
+ if (err)
+ goto err3;
+
dev_disable_lro(dev);
list_add_rcu(&p->list, &br->port_list);
@@ -451,6 +462,9 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
br_netpoll_enable(br, dev);
return 0;
+err3:
+ rcu_assign_pointer(dev->br_port, NULL);
+ synchronize_rcu();
err2:
br_fdb_delete_by_port(br, p, 1);
err1:
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index d36e700..99647d8 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -131,15 +131,19 @@ static inline int is_link_local(const unsigned char *dest)
}
/*
- * Called via br_handle_frame_hook.
* Return NULL if skb is handled
- * note: already called with rcu_read_lock (preempt_disabled)
+ * note: already called with rcu_read_lock (preempt_disabled) from
+ * netif_receive_skb
*/
-struct sk_buff *br_handle_frame(struct net_bridge_port *p, struct sk_buff *skb)
+struct sk_buff *br_handle_frame(struct sk_buff *skb)
{
+ struct net_bridge_port *p;
const unsigned char *dest = eth_hdr(skb)->h_dest;
int (*rhook)(struct sk_buff *skb);
+ if (skb->pkt_type == PACKET_LOOPBACK)
+ return skb;
+
if (!is_valid_ether_addr(eth_hdr(skb)->h_source))
goto drop;
@@ -147,6 +151,8 @@ struct sk_buff *br_handle_frame(struct net_bridge_port *p, struct sk_buff *skb)
if (!skb)
return NULL;
+ p = rcu_dereference(skb->dev->br_port);
+
if (unlikely(is_link_local(dest))) {
/* Pause frames shouldn't be passed up by driver anyway */
if (skb->protocol == htons(ETH_P_PAUSE))
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 0f4a74b..c83519b 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -331,8 +331,7 @@ extern void br_features_recompute(struct net_bridge *br);
/* br_input.c */
extern int br_handle_frame_finish(struct sk_buff *skb);
-extern struct sk_buff *br_handle_frame(struct net_bridge_port *p,
- struct sk_buff *skb);
+extern struct sk_buff *br_handle_frame(struct sk_buff *skb);
/* br_ioctl.c */
extern int br_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd);
diff --git a/net/core/dev.c b/net/core/dev.c
index 6c82065..8d4a817 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2585,70 +2585,14 @@ static inline int deliver_skb(struct sk_buff *skb,
return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
}
-#if defined(CONFIG_BRIDGE) || defined (CONFIG_BRIDGE_MODULE)
-
-#if defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE)
+#if (defined(CONFIG_BRIDGE) || defined(CONFIG_BRIDGE_MODULE)) && \
+ (defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE))
/* This hook is defined here for ATM LANE */
int (*br_fdb_test_addr_hook)(struct net_device *dev,
unsigned char *addr) __read_mostly;
EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
#endif
-/*
- * If bridge module is loaded call bridging hook.
- * returns NULL if packet was consumed.
- */
-struct sk_buff *(*br_handle_frame_hook)(struct net_bridge_port *p,
- struct sk_buff *skb) __read_mostly;
-EXPORT_SYMBOL_GPL(br_handle_frame_hook);
-
-static inline struct sk_buff *handle_bridge(struct sk_buff *skb,
- struct packet_type **pt_prev, int *ret,
- struct net_device *orig_dev)
-{
- struct net_bridge_port *port;
-
- if (skb->pkt_type == PACKET_LOOPBACK ||
- (port = rcu_dereference(skb->dev->br_port)) == NULL)
- return skb;
-
- if (*pt_prev) {
- *ret = deliver_skb(skb, *pt_prev, orig_dev);
- *pt_prev = NULL;
- }
-
- return br_handle_frame_hook(port, skb);
-}
-#else
-#define handle_bridge(skb, pt_prev, ret, orig_dev) (skb)
-#endif
-
-#if defined(CONFIG_MACVLAN) || defined(CONFIG_MACVLAN_MODULE)
-struct sk_buff *(*macvlan_handle_frame_hook)(struct macvlan_port *p,
- struct sk_buff *skb) __read_mostly;
-EXPORT_SYMBOL_GPL(macvlan_handle_frame_hook);
-
-static inline struct sk_buff *handle_macvlan(struct sk_buff *skb,
- struct packet_type **pt_prev,
- int *ret,
- struct net_device *orig_dev)
-{
- struct macvlan_port *port;
-
- port = rcu_dereference(skb->dev->macvlan_port);
- if (!port)
- return skb;
-
- if (*pt_prev) {
- *ret = deliver_skb(skb, *pt_prev, orig_dev);
- *pt_prev = NULL;
- }
- return macvlan_handle_frame_hook(port, skb);
-}
-#else
-#define handle_macvlan(skb, pt_prev, ret, orig_dev) (skb)
-#endif
-
#ifdef CONFIG_NET_CLS_ACT
/* TODO: Maybe we should just force sch_ingress to be compiled in
* when CONFIG_NET_CLS_ACT is? otherwise some useless instructions
@@ -2744,6 +2688,82 @@ void netif_nit_deliver(struct sk_buff *skb)
rcu_read_unlock();
}
+static bool rx_handlers_equal(struct netdev_rx_handler *rh1,
+ struct netdev_rx_handler *rh2)
+{
+ return (rh1->order == rh2->order) && (rh1->callback == rh2->callback);
+}
+
+/**
+ * netdev_rx_handler_register - register receive handler
+ * @dev: device to register a handler for
+ * @rh: receive handler to register
+ *
+ * Register a receive hander for a device. This handler will then be
+ * called from __netif_receive_skb. A negative errno code is returned
+ * on a failure.
+ */
+int netdev_rx_handler_register(struct net_device *dev,
+ struct netdev_rx_handler *rh)
+{
+ struct list_head *list, *add_after;
+ struct netdev_rx_handler *rh1;
+ int err = 0;
+
+ spin_lock_bh(&dev->rx_handlers_lock);
+ add_after = &dev->rx_handlers;
+ list_for_each(list, &dev->rx_handlers) {
+ rh1 = list_entry(list, struct netdev_rx_handler, list);
+ if (rx_handlers_equal(rh, rh1)) {
+ err = -EEXIST;
+ goto unlock;
+ }
+ if (rh1->order > rh->order)
+ break;
+ add_after = list;
+ }
+ rh1 = kzalloc(sizeof(*rh), GFP_KERNEL);
+ if (!rh1) {
+ err = -ENOMEM;
+ goto unlock;
+ }
+
+ rh1->order = rh->order;
+ rh1->callback = rh->callback;
+ list_add_rcu(&rh1->list, add_after);
+
+unlock:
+ spin_unlock_bh(&dev->rx_handlers_lock);
+
+ return err;
+}
+EXPORT_SYMBOL(netdev_rx_handler_register);
+
+/**
+ * netdev_rx_handler_unregister - unregister receive handler
+ * @dev: device to unregister a handler from
+ * @rh: receive handler to unregister
+ *
+ * Unregister a receive hander from a device.
+ */
+void netdev_rx_handler_unregister(struct net_device *dev,
+ struct netdev_rx_handler *rh)
+{
+ struct netdev_rx_handler *rh1;
+
+ spin_lock_bh(&dev->rx_handlers_lock);
+ list_for_each_entry(rh1, &dev->rx_handlers, list) {
+ if (rx_handlers_equal(rh, rh1)) {
+ list_del_rcu(&rh1->list);
+ synchronize_net();
+ kfree(rh1);
+ break;
+ }
+ }
+ spin_unlock_bh(&dev->rx_handlers_lock);
+}
+EXPORT_SYMBOL(netdev_rx_handler_unregister);
+
static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
struct net_device *master)
{
@@ -2796,6 +2816,7 @@ EXPORT_SYMBOL(__skb_bond_should_drop);
static int __netif_receive_skb(struct sk_buff *skb)
{
struct packet_type *ptype, *pt_prev;
+ struct netdev_rx_handler *rh;
struct net_device *orig_dev;
struct net_device *master;
struct net_device *null_or_orig;
@@ -2859,12 +2880,18 @@ static int __netif_receive_skb(struct sk_buff *skb)
ncls:
#endif
- skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
- if (!skb)
- goto out;
- skb = handle_macvlan(skb, &pt_prev, &ret, orig_dev);
- if (!skb)
- goto out;
+ /*
+ * Go through various rx handlers, like bridge, macvlan etc.
+ */
+ list_for_each_entry_rcu(rh, &skb->dev->rx_handlers, list) {
+ if (pt_prev) {
+ ret = deliver_skb(skb, pt_prev, orig_dev);
+ pt_prev = NULL;
+ }
+ skb = rh->callback(skb);
+ if (!skb)
+ goto out;
+ }
/*
* Make sure frames received on VLAN interfaces stacked on
@@ -4932,6 +4959,7 @@ int register_netdevice(struct net_device *dev)
BUG_ON(!net);
spin_lock_init(&dev->addr_list_lock);
+ spin_lock_init(&dev->rx_handlers_lock);
netdev_set_addr_lockdep_class(dev);
netdev_init_queue_locks(dev);
@@ -5371,6 +5399,8 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
dev_mc_init(dev);
dev_uc_init(dev);
+ INIT_LIST_HEAD(&dev->rx_handlers);
+
dev_net_set(dev, &init_net);
dev->_tx = tx;
--
1.6.6.1
^ permalink raw reply related
* Re: [PATCH net-next-2.6] net: replace hooks in __netif_receive_skb
From: Patrick McHardy @ 2010-05-27 14:06 UTC (permalink / raw)
To: Jiri Pirko; +Cc: netdev, davem, shemminger, eric.dumazet
In-Reply-To: <20100527134935.GA6916@psychotron.lab.eng.brq.redhat.com>
Jiri Pirko wrote:
> @@ -511,10 +512,16 @@ static void macvlan_setup(struct net_device *dev)
> dev->tx_queue_len = 0;
> }
>
> +static struct netdev_rx_handler macvlan_rx_handler = {
> + .order = NETDEV_RX_HANDLER_ORDER_MACVLAN,
> + .callback = macvlan_handle_frame,
> +};
It seems this could be const since you duplicate it on
registration.
> +
> static int macvlan_port_create(struct net_device *dev)
> {
> struct macvlan_port *port;
> unsigned int i;
> + int err;
>
> if (dev->type != ARPHRD_ETHER || dev->flags & IFF_LOOPBACK)
> return -EINVAL;
> @@ -528,6 +535,15 @@ static int macvlan_port_create(struct net_device *dev)
> for (i = 0; i < MACVLAN_HASH_SIZE; i++)
> INIT_HLIST_HEAD(&port->vlan_hash[i]);
> rcu_assign_pointer(dev->macvlan_port, port);
> +
> + err = netdev_rx_handler_register(dev, &macvlan_rx_handler);
> + if (err) {
> + rcu_assign_pointer(dev->macvlan_port, NULL);
> + synchronize_rcu();
> + kfree(port);
> + return err;
> + }
I'd prefer goto-based unroll since that makes changes in the
future easier.
> +
> return 0;
> }
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index a1bff65..8e95b2d 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -254,6 +254,15 @@ struct netdev_hw_addr_list {
> #define netdev_for_each_mc_addr(ha, dev) \
> netdev_hw_addr_list_for_each(ha, &(dev)->mc)
>
> +
> +struct netdev_rx_handler {
> + struct list_head list;
> + unsigned int order;
> +#define NETDEV_RX_HANDLER_ORDER_BRIDGE 1
> +#define NETDEV_RX_HANDLER_ORDER_MACVLAN 2
Any reason for not using an enum?
> + struct sk_buff *(*callback)(struct sk_buff *skb);
> +};
> +
> struct hh_cache {
> struct hh_cache *hh_next; /* Next entry */
> atomic_t hh_refcnt; /* number of users */
> @@ -1031,6 +1040,10 @@ struct net_device {
> /* GARP */
> struct garp_port *garp_port;
>
> + /* receive handlers (hooks) list */
> + spinlock_t rx_handlers_lock;
> + struct list_head rx_handlers;
> +
> /* class/net/name entry */
> struct device dev;
> /* space for optional device, statistics, and wireless sysfs groups */
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 6c82065..8d4a817 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2744,6 +2688,82 @@ void netif_nit_deliver(struct sk_buff *skb)
> rcu_read_unlock();
> }
>
> +static bool rx_handlers_equal(struct netdev_rx_handler *rh1,
> + struct netdev_rx_handler *rh2)
> +{
> + return (rh1->order == rh2->order) && (rh1->callback == rh2->callback);
> +}
> +
> +/**
> + * netdev_rx_handler_register - register receive handler
> + * @dev: device to register a handler for
> + * @rh: receive handler to register
> + *
> + * Register a receive hander for a device. This handler will then be
> + * called from __netif_receive_skb. A negative errno code is returned
> + * on a failure.
> + */
> +int netdev_rx_handler_register(struct net_device *dev,
> + struct netdev_rx_handler *rh)
> +{
> + struct list_head *list, *add_after;
> + struct netdev_rx_handler *rh1;
> + int err = 0;
> +
> + spin_lock_bh(&dev->rx_handlers_lock);
Why are you using a spin lock and even disable BHs? This function
should only be called from user context, so a mutex will work fine
(and would fix the use of GFP_KERNEL in an atomic section).
> + add_after = &dev->rx_handlers;
> + list_for_each(list, &dev->rx_handlers) {
Naming the element "list" is confusing. Also this should be
using list_for_each_entry().
> + rh1 = list_entry(list, struct netdev_rx_handler, list);
> + if (rx_handlers_equal(rh, rh1)) {
> + err = -EEXIST;
> + goto unlock;
> + }
> + if (rh1->order > rh->order)
> + break;
> + add_after = list;
> + }
> + rh1 = kzalloc(sizeof(*rh), GFP_KERNEL);
> + if (!rh1) {
> + err = -ENOMEM;
> + goto unlock;
> + }
> +
> + rh1->order = rh->order;
> + rh1->callback = rh->callback;
> + list_add_rcu(&rh1->list, add_after);
> +
> +unlock:
> + spin_unlock_bh(&dev->rx_handlers_lock);
> +
> + return err;
> +}
> +EXPORT_SYMBOL(netdev_rx_handler_register);
^ permalink raw reply
* Re: [net-next PATCH] be2net: Adding PCI SRIOV support
From: Ajit Khaparde @ 2010-05-27 13:59 UTC (permalink / raw)
To: Venugopal Busireddy; +Cc: netdev
In-Reply-To: <5AC56E9F58E6EE4681447596FAECEC6DAA6506@ad-1.austin.nextio>
On 26/05/10 12:55 -0500, Venugopal Busireddy wrote:
>
> David,
> >> From: Sarveshwar Bandi <sarveshwarb@serverengines.com>
> >> Date: Wed, 31 Mar 2010 18:26:12 +0530
> >> - Patch adds support to enable PCI SRIOV in the driver and changes to
> handle \
> >> initialization of PCI virtual functions.
> >> - Function handler to change mac addresses for VF from its
> corresponding PF.
> >>
> >> Signed-off-by: Sarveshwar Bandi <sarveshwarb@serverengines.com>
> >Applied, thanks.
> I checked the 2.6.34 kernel, and could not see this patch applied in
> it. In which kernel would this patch be available? Also, if I were to
> apply this patch, which kernel version would this patch require?
You wouldn't see these changes in 2.6.34
They were pushed to Linus tree post 2.6.34 release.
You can take the latest driver from Linus git repository.
-Ajit
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: replace hooks in __netif_receive_skb
From: Eric Dumazet @ 2010-05-27 14:17 UTC (permalink / raw)
To: Jiri Pirko; +Cc: netdev, davem, shemminger, kaber
In-Reply-To: <20100527134935.GA6916@psychotron.lab.eng.brq.redhat.com>
Le jeudi 27 mai 2010 à 15:49 +0200, Jiri Pirko a écrit :
> +/**
> + * netdev_rx_handler_unregister - unregister receive handler
> + * @dev: device to unregister a handler from
> + * @rh: receive handler to unregister
> + *
> + * Unregister a receive hander from a device.
> + */
> +void netdev_rx_handler_unregister(struct net_device *dev,
> + struct netdev_rx_handler *rh)
> +{
> + struct netdev_rx_handler *rh1;
> +
> + spin_lock_bh(&dev->rx_handlers_lock);
> + list_for_each_entry(rh1, &dev->rx_handlers, list) {
> + if (rx_handlers_equal(rh, rh1)) {
> + list_del_rcu(&rh1->list);
> + synchronize_net();
> + kfree(rh1);
> + break;
> + }
> + }
> + spin_unlock_bh(&dev->rx_handlers_lock);
> +}
> +EXPORT_SYMBOL(netdev_rx_handler_unregister);
> +
Please dont synchronize_net(); inside the spin_lock_bh() section, at a
very minimum.
void netdev_rx_handler_unregister(struct net_device *dev,
struct netdev_rx_handler *rh)
{
struct netdev_rx_handler *rh1, *found = NULL;
spin_lock_bh(&dev->rx_handlers_lock);
list_for_each_entry(rh1, &dev->rx_handlers, list) {
if (rx_handlers_equal(rh, rh1)) {
list_del_rcu(&rh1->list);
found = rh1;
break;
}
}
spin_unlock_bh(&dev->rx_handlers_lock);
if (found) {
synchronize_net();
kfree(rh1);
}
}
This synchronize_net() proliferation makes me very nervous.
Am I the only one that think this thing is/should be avoided as much as
possible ?
Please dont use synchronize_net() but a call_rcu(), there is absolutely
no point making this thread waits 30 or 40 ms, there is no risk here.
Thanks
^ permalink raw reply
* [PATCH] skb: make skb_recycle_check() return a bool value
From: Changli Gao @ 2010-05-27 14:33 UTC (permalink / raw)
To: David S. Miller; +Cc: Eric Dumazet, linux-kernel, netdev, Changli Gao
make skb_recycle_check() return a bool value
make skb_recycle_check() return a bool value
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
include/linux/skbuff.h | 2 +-
net/core/skbuff.c | 12 ++++++------
2 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7cdfb4d..bf243fc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -501,7 +501,7 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
return __alloc_skb(size, priority, 1, -1);
}
-extern int skb_recycle_check(struct sk_buff *skb, int skb_size);
+extern bool skb_recycle_check(struct sk_buff *skb, int skb_size);
extern struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
extern struct sk_buff *skb_clone(struct sk_buff *skb,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c543dd2..d430876 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -482,22 +482,22 @@ EXPORT_SYMBOL(consume_skb);
* reference count dropping and cleans up the skbuff as if it
* just came from __alloc_skb().
*/
-int skb_recycle_check(struct sk_buff *skb, int skb_size)
+bool skb_recycle_check(struct sk_buff *skb, int skb_size)
{
struct skb_shared_info *shinfo;
if (irqs_disabled())
- return 0;
+ return false;
if (skb_is_nonlinear(skb) || skb->fclone != SKB_FCLONE_UNAVAILABLE)
- return 0;
+ return false;
skb_size = SKB_DATA_ALIGN(skb_size + NET_SKB_PAD);
if (skb_end_pointer(skb) - skb->head < skb_size)
- return 0;
+ return false;
if (skb_shared(skb) || skb_cloned(skb))
- return 0;
+ return false;
skb_release_head_state(skb);
@@ -509,7 +509,7 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
skb->data = skb->head + NET_SKB_PAD;
skb_reset_tail_pointer(skb);
- return 1;
+ return true;
}
EXPORT_SYMBOL(skb_recycle_check);
^ permalink raw reply related
* Question about netns & AF_UNIX
From: Martín Ferrari @ 2010-05-27 14:38 UTC (permalink / raw)
To: netdev; +Cc: Mathieu Lacage
Hi, again a question about netns...
I seem to recall being able to use AF_UNIX sockets across network name
spaces, but I cannot reproduce that with a current kernel. Probably my
test was fubar (I've lost the script).
In any case: is a design decision to forbid this, even when the file
system is shared? I found some discussions from 2008, but I don't see
an agreement being reached...
I also wonder if filedescriptor passing thru ancilliary messages will
work (that is, with unix sockets that I've created before the netns
change).
Thanks
--
Martín Ferrari
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: replace hooks in __netif_receive_skb
From: Stephen Hemminger @ 2010-05-27 14:47 UTC (permalink / raw)
To: Jiri Pirko; +Cc: netdev, davem, kaber, eric.dumazet
In-Reply-To: <20100527134935.GA6916@psychotron.lab.eng.brq.redhat.com>
On Thu, 27 May 2010 15:49:36 +0200
Jiri Pirko <jpirko@redhat.com> wrote:
> What this patch does is it removes two receive frame hooks (for bridge and for
> macvlan) from __netif_receive_skb. These are replaced them with a general
> list of rx_handlers which is iterated thru instead.
>
> Then a network driver (of virtual netdev like macvlan or bridge) can register
> an rx_handler for needed net device to be called from __netif_receive_skb.
>
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
This is almost identical to my previous RFC but it is per device.
Why bother with a rx handler lock, why not just use RTNL since the
hook should only be changed from control operations.
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: replace hooks in __netif_receive_skb
From: Jiri Pirko @ 2010-05-27 14:57 UTC (permalink / raw)
To: Patrick McHardy; +Cc: netdev, davem, shemminger, eric.dumazet
In-Reply-To: <4BFE7C7E.8090803@trash.net>
Thu, May 27, 2010 at 04:06:54PM CEST, kaber@trash.net wrote:
>Jiri Pirko wrote:
>> @@ -511,10 +512,16 @@ static void macvlan_setup(struct net_device *dev)
>> dev->tx_queue_len = 0;
>> }
>>
>> +static struct netdev_rx_handler macvlan_rx_handler = {
>> + .order = NETDEV_RX_HANDLER_ORDER_MACVLAN,
>> + .callback = macvlan_handle_frame,
>> +};
>
>It seems this could be const since you duplicate it on
>registration.
Noted.
>
>> +
>> static int macvlan_port_create(struct net_device *dev)
>> {
>> struct macvlan_port *port;
>> unsigned int i;
>> + int err;
>>
>> if (dev->type != ARPHRD_ETHER || dev->flags & IFF_LOOPBACK)
>> return -EINVAL;
>> @@ -528,6 +535,15 @@ static int macvlan_port_create(struct net_device *dev)
>> for (i = 0; i < MACVLAN_HASH_SIZE; i++)
>> INIT_HLIST_HEAD(&port->vlan_hash[i]);
>> rcu_assign_pointer(dev->macvlan_port, port);
>> +
>> + err = netdev_rx_handler_register(dev, &macvlan_rx_handler);
>> + if (err) {
>> + rcu_assign_pointer(dev->macvlan_port, NULL);
>> + synchronize_rcu();
>> + kfree(port);
>> + return err;
>> + }
>
>I'd prefer goto-based unroll since that makes changes in the
>future easier.
Ok, I thought this won't be necessary in this case, but right, looks better.
>
>> +
>> return 0;
>> }
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index a1bff65..8e95b2d 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -254,6 +254,15 @@ struct netdev_hw_addr_list {
>> #define netdev_for_each_mc_addr(ha, dev) \
>> netdev_hw_addr_list_for_each(ha, &(dev)->mc)
>>
>> +
>> +struct netdev_rx_handler {
>> + struct list_head list;
>> + unsigned int order;
>> +#define NETDEV_RX_HANDLER_ORDER_BRIDGE 1
>> +#define NETDEV_RX_HANDLER_ORDER_MACVLAN 2
>
>Any reason for not using an enum?
No, I can use enum, but this "inlining" gives a person who is looking at the
code the connection on the first look.
>
>> + struct sk_buff *(*callback)(struct sk_buff *skb);
>> +};
>> +
>> struct hh_cache {
>> struct hh_cache *hh_next; /* Next entry */
>> atomic_t hh_refcnt; /* number of users */
>> @@ -1031,6 +1040,10 @@ struct net_device {
>> /* GARP */
>> struct garp_port *garp_port;
>>
>> + /* receive handlers (hooks) list */
>> + spinlock_t rx_handlers_lock;
>> + struct list_head rx_handlers;
>> +
>> /* class/net/name entry */
>> struct device dev;
>> /* space for optional device, statistics, and wireless sysfs groups */
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 6c82065..8d4a817 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -2744,6 +2688,82 @@ void netif_nit_deliver(struct sk_buff *skb)
>> rcu_read_unlock();
>> }
>>
>> +static bool rx_handlers_equal(struct netdev_rx_handler *rh1,
>> + struct netdev_rx_handler *rh2)
>> +{
>> + return (rh1->order == rh2->order) && (rh1->callback == rh2->callback);
>> +}
>> +
>> +/**
>> + * netdev_rx_handler_register - register receive handler
>> + * @dev: device to register a handler for
>> + * @rh: receive handler to register
>> + *
>> + * Register a receive hander for a device. This handler will then be
>> + * called from __netif_receive_skb. A negative errno code is returned
>> + * on a failure.
>> + */
>> +int netdev_rx_handler_register(struct net_device *dev,
>> + struct netdev_rx_handler *rh)
>> +{
>> + struct list_head *list, *add_after;
>> + struct netdev_rx_handler *rh1;
>> + int err = 0;
>> +
>> + spin_lock_bh(&dev->rx_handlers_lock);
>
>Why are you using a spin lock and even disable BHs? This function
>should only be called from user context, so a mutex will work fine
>(and would fix the use of GFP_KERNEL in an atomic section).
Right, I will use rather rtnl_lock as suggested by Stephen.
>
>> + add_after = &dev->rx_handlers;
>> + list_for_each(list, &dev->rx_handlers) {
>
>Naming the element "list" is confusing. Also this should be
>using list_for_each_entry().
Well I'm not using list_for_each_entry because I use the list_head cursor as a
head in list_add_rcu (add_after assignment)
>
>> + rh1 = list_entry(list, struct netdev_rx_handler, list);
>> + if (rx_handlers_equal(rh, rh1)) {
>> + err = -EEXIST;
>> + goto unlock;
>> + }
>> + if (rh1->order > rh->order)
>> + break;
>> + add_after = list;
>> + }
>> + rh1 = kzalloc(sizeof(*rh), GFP_KERNEL);
>> + if (!rh1) {
>> + err = -ENOMEM;
>> + goto unlock;
>> + }
>> +
>> + rh1->order = rh->order;
>> + rh1->callback = rh->callback;
>> + list_add_rcu(&rh1->list, add_after);
>> +
>> +unlock:
>> + spin_unlock_bh(&dev->rx_handlers_lock);
>> +
>> + return err;
>> +}
>> +EXPORT_SYMBOL(netdev_rx_handler_register);
Thanks all for comments, I'll send V2 soon
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: replace hooks in __netif_receive_skb
From: Jiri Pirko @ 2010-05-27 15:02 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, davem, shemminger, kaber
In-Reply-To: <1274969866.2523.22.camel@edumazet-laptop>
Thu, May 27, 2010 at 04:17:46PM CEST, eric.dumazet@gmail.com wrote:
>Le jeudi 27 mai 2010 à 15:49 +0200, Jiri Pirko a écrit :
>
>> +/**
>> + * netdev_rx_handler_unregister - unregister receive handler
>> + * @dev: device to unregister a handler from
>> + * @rh: receive handler to unregister
>> + *
>> + * Unregister a receive hander from a device.
>> + */
>> +void netdev_rx_handler_unregister(struct net_device *dev,
>> + struct netdev_rx_handler *rh)
>> +{
>> + struct netdev_rx_handler *rh1;
>> +
>> + spin_lock_bh(&dev->rx_handlers_lock);
>> + list_for_each_entry(rh1, &dev->rx_handlers, list) {
>> + if (rx_handlers_equal(rh, rh1)) {
>> + list_del_rcu(&rh1->list);
>> + synchronize_net();
>> + kfree(rh1);
>> + break;
>> + }
>> + }
>> + spin_unlock_bh(&dev->rx_handlers_lock);
>> +}
>> +EXPORT_SYMBOL(netdev_rx_handler_unregister);
>> +
>
>Please dont synchronize_net(); inside the spin_lock_bh() section, at a
>very minimum.
>
>void netdev_rx_handler_unregister(struct net_device *dev,
> struct netdev_rx_handler *rh)
>{
> struct netdev_rx_handler *rh1, *found = NULL;
>
> spin_lock_bh(&dev->rx_handlers_lock);
> list_for_each_entry(rh1, &dev->rx_handlers, list) {
> if (rx_handlers_equal(rh, rh1)) {
> list_del_rcu(&rh1->list);
> found = rh1;
> break;
> }
> }
> spin_unlock_bh(&dev->rx_handlers_lock);
> if (found) {
> synchronize_net();
> kfree(rh1);
> }
>}
I had it done in the same way originally. But I though that's not a problem to
do in inside the lock.
>
>
>This synchronize_net() proliferation makes me very nervous.
>
>Am I the only one that think this thing is/should be avoided as much as
>possible ?
>
>Please dont use synchronize_net() but a call_rcu(), there is absolutely
>no point making this thread waits 30 or 40 ms, there is no risk here.
Ok, will do. Thanks a lot.
>
>Thanks
>
>
^ permalink raw reply
* Re: [REGRESSION,BISECTED] MIPv6 support broken by f4f914b58019f0
From: Arnaud Ebalard @ 2010-05-27 15:14 UTC (permalink / raw)
To: Brian Haley
Cc: David Miller,
YOSHIFUJI Hideaki / 吉藤英明, Jiri Olsa,
Scott Otto, netdev
In-Reply-To: <4BFDC14F.6050407@hp.com>
Hi,
Thanks for your reply Brian and sorry for the length of this response. If
Hideaki and David can comment on the IPv6/XFRM and SO_BINDTODEVICE
aspects discussed below that would be helpful, IMHO.
Brian Haley <brian.haley@hp.com> writes:
> On 05/26/2010 01:01 PM, Arnaud Ebalard wrote:
>> Hi,
>>
>> I just updated my laptop's kernel to 2.6.34 (previously running .33 and
>> configured to act as an IPsec/IKE-protected MIPv6 Mobile Node using
>> racoon and umip): after rebooting on the new kernel, the transport mode
>> SA protecting MIPv6 signaling traffic are missing.
>>
>> I bisected the issue down to f4f914b58019f0e50d521bbbadfaee260d766f95
>> (net: ipv6 bind to device issue) which was added after 2.6.34-rc5:
>>
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index c2438e8..05ebd78 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -815,7 +815,7 @@ struct dst_entry * ip6_route_output(struct net *net, struct sock *sk,
>> {
>> int flags = 0;
>>
>> - if (rt6_need_strict(&fl->fl6_dst))
>> + if (fl->oif || rt6_need_strict(&fl->fl6_dst))
>> flags |= RT6_LOOKUP_F_IFACE;
>
> Can you see if fl->oif is at least a sane value here? Maybe there's some
> partially un-initialized flowi getting passed-in, a quick source code check
> didn't find anything obvious.
When it's not 0, fl->oif is a sane value: it is set to the index of the
interface on which the current *Care-of Address* is configured. All the
traffic is expected to leave the host via this interface.
> The other thought is that it's the tunnel code calling it, as it's going
> to set 'oif' (actually it caches a whole flowi) from the tunnel parms ifindex/link
> value. It could have been setting it forever, but ip6_route_output() just
> never enforced it until now.
I added some printk in the code of ip6_route_output(), rt6_score_route()
and find_rr_leaf(). Below are respectivevly what I get for a 2.6.34 with
and without f4f914b58019f0e50d521bbbadfaee260d766f95. I removed the
beginning as it is the same and only started when it starts diverging.:
...
ip6_route_output() called from ip6_dst_lookup_tail() 1
ip6_route_output: fl->oif is wlan0
2001:XXXX:XXXX:0002:020d:93ff:fe55:f897 (HoA) => 2001:XXXX:XXXX:f002:021e:0bff:fe4e:04b5 (HA@) proto 135
rt6_score_route: oif is wlan0. rt->rt6i_dev->ifindex: lo. Leaving due to strict.
rt6_score_route: oif is wlan0. rt->rt6i_dev->ifindex: lo. Leaving due to strict.
rt6_score_route: oif is wlan0. rt->rt6i_dev->ifindex: ip6tnl1. Leaving due to strict.
rt6_score_route: oif is wlan0. rt->rt6i_dev->ifindex: ip6tnl1. Leaving due to strict.
...
On a working kernel:
...
ip6_route_output() called from ip6_dst_lookup_tail() 1
ip6_route_output: fl->oif is wlan0
2001:XXXX:XXXX:0002:020d:93ff:fe55:f897 (HoA) => 2001:XXXX:XXXX:f002:021e:0bff:fe4e:04b5 (HA@) proto 135
find_rr_leaf: match is 1. oif is wlan0
find_rr_leaf: match is 1. oif is wlan0
find_rr_leaf: match is 8. oif is wlan0
ip6_route_output() called from ip6_dst_lookup_tail() 1
ip6_route_output: fl->oif is 0
...
Above, a Binding Update message (a Mobility Header (proto 135) type 5)
has to be sent to the Home Agent. It is expected to leave the system via
the wlan0 interface, which is the interface on which the Care-of Address
of the packet is configured. The *wire* format of the packet is the
following:
IPv6(src=CoA, dst=HA@)/DestOpt(HoA)/ESP()/MH(type=5)
The addition of Destination Option header (containing a Home Address
Option) and ESP extension header is performed via XFRM. Initially, the
packet created by userland looks like this:
IPv6(src=HoA, dst=HA@)/MH(type=5)
In previous debug outputs, the content of the fl->oif is ok, i.e. it is
set to the interface on which the CoA is configured, i.e. the output
interface. But the commit results in flags |= RT6_LOOKUP_F_IFACE.
Later, in rt6_score_route(), the call to rt6_check_dev() returns 0
(dev->ifindex is ip6tnl1 but oif is wlan0). Because of the change to flags
flags, we quickly return -1 in rt6_score_route():
static int rt6_score_route(struct rt6_info *rt, int oif,
int strict)
{
int m, n;
m = rt6_check_dev(rt, oif);
if (!m && (strict & RT6_LOOKUP_F_IFACE))
return -1;
...
Now, I wonder if the following is correct. Don't hesitate to correct me
if I am wrong:
Initially (before f4f914b58019f0), the purpose of the test using
rt6_need_strict() in ip6_route_output() (introduced by c71099ac) was to
allow the multiple routing table logic to be applied to all global
addresses but to preserve the addresses for which it would not make
sense (link-local, multicast, ). The change introduced by f4f914b58019f0
basically reduces the ability to route traffic as you want and forces
the traffic to leave the device by the interface on which it is
configured (if fl->oif is set).
>From my (very limited and possibly wrong) understanding, the change
introduced by f4f914b58019f0 looks like a workaround for the
SO_BINDTODEVICE issue. Looking at the code, there is something I don't
understand: if SO_BINDTODEVICE has been used on a socket, the socket
should have its sk_bound_dev_if attribute set to the correct ifindex
value. Hence the following (naive) question: why is that information not
used to inflect the selection of the route cached for the socket? And
why would the fix be at the adress level instead of being at the
interface level (ifindex)?
Cheers,
a+
^ permalink raw reply
* Re: [PATCH] tcp: Socket option to set congestion window
From: Tom Herbert @ 2010-05-27 16:14 UTC (permalink / raw)
To: Hagen Paul Pfeifer; +Cc: David Miller, andi, shemminger, netdev, ycheng
In-Reply-To: <20100527070827.GB2728@nuttenaction>
> And this is the most important aspect in this email: core network components
> rely on end hosts to behave in a fair manner. Disable Slow Start/Congestion
> Avoidance and the network will instantly collapse (mmh, net-next? ;-)
>
> The mechanism as proposed in the patch is not fair. There are a lot of
> publications available that analyse the impact CWND in great detail as well as
> several RFC that talk about the CWND.
The mechanism proposed in the patch is merely an API change; misuse,
abuse, or unfairness are inferences of how it might be used. Proper
safeguards should be applied to prevent misuse, but I don't see that
it should be any more insidious than 350 other mechanisms in the
system that could be used to screw things up.
Yes, there has been a lot of talk about CWND, but the standard has not
changed since 2002. In the meantime, browsers have increased the
number of parallel connections they open to a destination, and servers
hide behind multiple domains-- the end result of this is that browsers
use aggregate initial congestion windows much larger than the
standard, which sidesteps slowstart and is a source of unfairness.
This is contrary to RFC 3390:
"When web browsers open simultaneous TCP connections to the same
destination, they are working against TCP's congestion control
mechanisms"
I have yet to find any paper on CWND that analyzed the effect of this
phenomena on the Internet which is quite unfortunate. In our own full
scale experiments
(http://code.google.com/speed/articles/tcp_initcwnd_paper.pdf), we
anlayzed the effects of using larger initial congestion windows on the
Internet which might be the closest thing to such an analysis. I know
in LEDBAT WG of IETF they are trying to come up with new
recommendations for number of connections a browser can open, this is
good but I hope it's not after the fact.
It would be better, by almost any perspective, to rein in the number
of connections servers are allowing clients to open. However this
isn't going to happen if this means increase latency for end users,
there's is no competitive rationale for servers to do that. That's
where a primary motivation of this patch becomes evident. Instead of
a server allowing 6 connections from a client, for instance, it could
allow just one connection but with a initial congestion window equal
to the aggregate of the 6 connections. This reduces connections and
does not change the size of initial data burst going into the
Internet. The Internet is happy because there a fewer connections
(better for fairness) and fewer packets (fewer 3WHS); the server and
client are happy because there are fewer connections to deal without
and no increased latency.
Tom
^ permalink raw reply
* Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
From: Tejun Heo @ 2010-05-27 16:15 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Oleg Nesterov, Sridhar Samudrala, netdev, lkml,
kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev, Jiri Kosina,
Thomas Gleixner, Ingo Molnar, Andi Kleen
In-Reply-To: <20100527131254.GB7974@redhat.com>
Hello,
On 05/27/2010 03:12 PM, Michael S. Tsirkin wrote:
>> I don't understand the reasons for this patch, but this doesn't matter.
>
> Depending on userspace application, driver can create a lot of work
> for a workqueue to handle. By making the workqueue thread
> belong in a cgroup, we make it possible to the CPU and other
> resources thus consumed.
Hmmm.... I don't really get it. The unit of scheduling in workqueue
is a work. Unless you're gonna convert every driver to use this
special kind of workqueue (and what happens when multiple tasks from
different cgroups share the driver?), I can't see how this is gonna be
useful. If you really wanna impose cgroup control on workqueue items,
you'll have to do it per work item which might lead to the problem of
priority inversion. Can you please describe what you're trying to do
in more detail?
Thank you.
--
tejun
^ permalink raw reply
* Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
From: Sridhar Samudrala @ 2010-05-27 16:24 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Michael S. Tsirkin, netdev, lkml, kvm@vger.kernel.org,
Andrew Morton, Dmitri Vorobiev, Tejun Heo, Jiri Kosina,
Thomas Gleixner, Ingo Molnar, Andi Kleen
In-Reply-To: <20100527124448.GA4241@redhat.com>
On Thu, 2010-05-27 at 14:44 +0200, Oleg Nesterov wrote:
> On 05/27, Michael S. Tsirkin wrote:
> >
> > On Tue, May 18, 2010 at 05:04:51PM -0700, Sridhar Samudrala wrote:
> > > Add a new kernel API to create a singlethread workqueue and attach it's
> > > task to current task's cgroup and cpumask.
> > >
> > > Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
> >
> > Could someone familiar with workqueue code please comment on whether
> > this patch is suitable for 2.6.35?
> >
> > It is needed to fix the case where vhost user might cause a kernel
> > thread to consume more CPU than allowed by the cgroup.
> > Should I merge it through the vhost tree?
> > Ack for this?
>
> I don't understand the reasons for this patch, but this doesn't matter.
>
> I don't really see any need to change workqueue.c,
>
> > > +static struct task_struct *get_singlethread_wq_task(struct workqueue_struct *wq)
> > > +{
> > > + return (per_cpu_ptr(wq->cpu_wq, singlethread_cpu))->thread;
> > > +}
>
> (Not sure this trivial static helper with the single caller makes sense, but
> see below)
>
> > > +/* Create a singlethread workqueue and attach it's task to the current task's
> > > + * cgroup and set it's cpumask to the current task's cpumask.
> > > + */
> > > +struct workqueue_struct *create_singlethread_workqueue_in_current_cg(char *name)
> > > +{
> > > + struct workqueue_struct *wq;
> > > + struct task_struct *task;
> > > + cpumask_var_t mask;
> > > +
> > > + wq = create_singlethread_workqueue(name);
> > > + if (!wq)
> > > + goto out;
> > > +
> > > + if (!alloc_cpumask_var(&mask, GFP_KERNEL))
> > > + goto err;
> > > +
> > > + if (sched_getaffinity(current->pid, mask))
> > > + goto err;
> > > +
> > > + task = get_singlethread_wq_task(wq);
> > > + if (sched_setaffinity(task->pid, mask))
> > > + goto err;
> > > +
> > > + if (cgroup_attach_task_current_cg(task))
> > > + goto err;
> > > +out:
> > > + return wq;
> > > +err:
> > > + destroy_workqueue(wq);
> > > + wq = NULL;
> > > + goto out;
> > > +}
>
> Instead, cgroup.c (or whoever needs this) can do
>
> struct move_struct {
> struct work_struct work;
> int ret;
> };
>
> static void move_func(struct work_struct *work)
> {
> struct move_struct *move = container_of(...);
>
> if (cgroup_attach_task_current_cg(current))
We are trying to attach the task associated with workqueue to the
current task's cgroup. So what we need is
cgroup_attach_task_current_cg(wq->task);
However there is no interface currently that exports the task associated
with a workqueue. It is hidden in cpu_workqueue_struct and is only
accessible within workqueue.c.
> ret = -EANY;
> }
>
> static struct workqueue_struct *create_singlethread_workqueue_in_current_cg(char *name)
> {
> struct workqueue_struct *wq;
> struct move_struct move = {
> .work = __WORK_INITIALIZER(move_func);
> };
>
> wq = create_singlethread_workqueue(name);
> if (!wq)
> return NULL;
>
> queue_work(&move.work);
> flush_work(&move.work);
>
> if (move.ret) {
> destroy_workqueue(wq);
> wq = NULL;
> }
>
> return wq;
> }
>
> Or. Just export wq_per_cpu() from workqueue.c (probably with a better name) and
> use it like the patch does.
This requires that struct cpu_workqueue_struct and struct
workqueue_struct are made externally visible by moving them to
kernel/workqueue.h.
Instead what about adding the simple helper get_singlethread_wq_task()
in workqueue.c and exporting it.
I can add create_singlethread_workqueue_in_current_cg() to cgroup.c
using this helper routine.
Thanks
Sridhar
^ permalink raw reply
* Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
From: Michael S. Tsirkin @ 2010-05-27 16:39 UTC (permalink / raw)
To: Tejun Heo
Cc: Oleg Nesterov, Sridhar Samudrala, netdev, lkml,
kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev, Jiri Kosina,
Thomas Gleixner, Ingo Molnar, Andi Kleen
In-Reply-To: <4BFE9ABA.6030907@kernel.org>
On Thu, May 27, 2010 at 06:15:54PM +0200, Tejun Heo wrote:
> Hello,
>
> On 05/27/2010 03:12 PM, Michael S. Tsirkin wrote:
> >> I don't understand the reasons for this patch, but this doesn't matter.
> >
> > Depending on userspace application, driver can create a lot of work
> > for a workqueue to handle. By making the workqueue thread
> > belong in a cgroup, we make it possible to the CPU and other
> > resources thus consumed.
>
> Hmmm.... I don't really get it. The unit of scheduling in workqueue
> is a work.
Yes. However, we use cgroups to limit when the workqueue itself is scheduled.
This affects all of work done on this workqueue, so it's a bit
of a blunt intrument. Thus we are not trying to apply this
to all drivers, we intend to start with vhost-net.
> Unless you're gonna convert every driver to use this
> special kind of workqueue (and what happens when multiple tasks from
> different cgroups share the driver?),
We'll then create a workqueue per task. Each workqueue will have the
right cgroup. But we are not trying to selve the problem for
every driver.
> I can't see how this is gonna be
> useful. If you really wanna impose cgroup control on workqueue items,
> you'll have to do it per work item which might lead to the problem of
> priority inversion.
Exactly. cgroup is per-workqueue not per work item.
If driver wants to let administrators control priority
for different kinds of items separately, driver will have
to submit them to separate workqueues.
> Can you please describe what you're trying to do
> in more detail?
>
> Thank you.
vhost-net driver is under control from userspace,
it queues potentially a lot of work into the workqueue, which
might load the system beyond the cgroup limits.
And staying within cgroups limits is important for virtualization
where vhost is used.
> --
> tejun
^ permalink raw reply
* Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
From: Michael S. Tsirkin @ 2010-05-27 16:41 UTC (permalink / raw)
To: Sridhar Samudrala
Cc: Oleg Nesterov, netdev, lkml, kvm@vger.kernel.org, Andrew Morton,
Dmitri Vorobiev, Tejun Heo, Jiri Kosina, Thomas Gleixner,
Ingo Molnar, Andi Kleen
In-Reply-To: <1274977458.10161.11.camel@w-sridhar.beaverton.ibm.com>
On Thu, May 27, 2010 at 09:24:18AM -0700, Sridhar Samudrala wrote:
> On Thu, 2010-05-27 at 14:44 +0200, Oleg Nesterov wrote:
> > On 05/27, Michael S. Tsirkin wrote:
> > >
> > > On Tue, May 18, 2010 at 05:04:51PM -0700, Sridhar Samudrala wrote:
> > > > Add a new kernel API to create a singlethread workqueue and attach it's
> > > > task to current task's cgroup and cpumask.
> > > >
> > > > Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
> > >
> > > Could someone familiar with workqueue code please comment on whether
> > > this patch is suitable for 2.6.35?
> > >
> > > It is needed to fix the case where vhost user might cause a kernel
> > > thread to consume more CPU than allowed by the cgroup.
> > > Should I merge it through the vhost tree?
> > > Ack for this?
> >
> > I don't understand the reasons for this patch, but this doesn't matter.
> >
> > I don't really see any need to change workqueue.c,
> >
> > > > +static struct task_struct *get_singlethread_wq_task(struct workqueue_struct *wq)
> > > > +{
> > > > + return (per_cpu_ptr(wq->cpu_wq, singlethread_cpu))->thread;
> > > > +}
> >
> > (Not sure this trivial static helper with the single caller makes sense, but
> > see below)
> >
> > > > +/* Create a singlethread workqueue and attach it's task to the current task's
> > > > + * cgroup and set it's cpumask to the current task's cpumask.
> > > > + */
> > > > +struct workqueue_struct *create_singlethread_workqueue_in_current_cg(char *name)
> > > > +{
> > > > + struct workqueue_struct *wq;
> > > > + struct task_struct *task;
> > > > + cpumask_var_t mask;
> > > > +
> > > > + wq = create_singlethread_workqueue(name);
> > > > + if (!wq)
> > > > + goto out;
> > > > +
> > > > + if (!alloc_cpumask_var(&mask, GFP_KERNEL))
> > > > + goto err;
> > > > +
> > > > + if (sched_getaffinity(current->pid, mask))
> > > > + goto err;
> > > > +
> > > > + task = get_singlethread_wq_task(wq);
> > > > + if (sched_setaffinity(task->pid, mask))
> > > > + goto err;
> > > > +
> > > > + if (cgroup_attach_task_current_cg(task))
> > > > + goto err;
> > > > +out:
> > > > + return wq;
> > > > +err:
> > > > + destroy_workqueue(wq);
> > > > + wq = NULL;
> > > > + goto out;
> > > > +}
> >
> > Instead, cgroup.c (or whoever needs this) can do
> >
> > struct move_struct {
> > struct work_struct work;
> > int ret;
> > };
> >
> > static void move_func(struct work_struct *work)
> > {
> > struct move_struct *move = container_of(...);
> >
> > if (cgroup_attach_task_current_cg(current))
>
> We are trying to attach the task associated with workqueue to the
> current task's cgroup. So what we need is
> cgroup_attach_task_current_cg(wq->task);
>
> However there is no interface currently that exports the task associated
> with a workqueue. It is hidden in cpu_workqueue_struct and is only
> accessible within workqueue.c.
>
>
> > ret = -EANY;
> > }
> >
> > static struct workqueue_struct *create_singlethread_workqueue_in_current_cg(char *name)
> > {
> > struct workqueue_struct *wq;
> > struct move_struct move = {
> > .work = __WORK_INITIALIZER(move_func);
> > };
> >
> > wq = create_singlethread_workqueue(name);
> > if (!wq)
> > return NULL;
> >
> > queue_work(&move.work);
> > flush_work(&move.work);
> >
> > if (move.ret) {
> > destroy_workqueue(wq);
> > wq = NULL;
> > }
> >
> > return wq;
> > }
> >
> > Or. Just export wq_per_cpu() from workqueue.c (probably with a better name) and
> > use it like the patch does.
> This requires that struct cpu_workqueue_struct and struct
> workqueue_struct are made externally visible by moving them to
> kernel/workqueue.h.
>
> Instead what about adding the simple helper get_singlethread_wq_task()
> in workqueue.c and exporting it.
> I can add create_singlethread_workqueue_in_current_cg() to cgroup.c
> using this helper routine.
Or to our driver, if that's more palatable.
> Thanks
> Sridhar
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox