* [PATCH] AF_RXRPC: Handle receiving ACKALL packets
From: David Howells @ 2011-02-28 13:27 UTC (permalink / raw)
To: netdev; +Cc: linux-kernel, David Howells
The OpenAFS server is now sending ACKALL packets, so we need to handle them.
Otherwise we report a protocol error and abort.
Signed-off-by: David Howells <dhowells@redhat.com>
---
net/rxrpc/ar-input.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/net/rxrpc/ar-input.c b/net/rxrpc/ar-input.c
index a4fc974..996d3ef 100644
--- a/net/rxrpc/ar-input.c
+++ b/net/rxrpc/ar-input.c
@@ -423,6 +423,7 @@ void rxrpc_fast_process_packet(struct rxrpc_call *call, struct sk_buff *skb)
goto protocol_error;
}
+ case RXRPC_PACKET_TYPE_ACKALL:
case RXRPC_PACKET_TYPE_ACK:
/* ACK processing is done in process context */
read_lock_bh(&call->state_lock);
^ permalink raw reply related
* Re: txqueuelen has wrong units; should be time
From: Eric Dumazet @ 2011-02-28 13:10 UTC (permalink / raw)
To: Jussi Kivilinna; +Cc: Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev
In-Reply-To: <20110228134338.1241484mkljbz4w0@hayate.sektori.org>
Le lundi 28 février 2011 à 13:43 +0200, Jussi Kivilinna a écrit :
> Quoting Eric Dumazet <eric.dumazet@gmail.com>:
>
> > Le dimanche 27 février 2011 à 12:55 +0200, Jussi Kivilinna a écrit :
> >> Quoting Albert Cahalan <acahalan@gmail.com>:
> >>
> >> > On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet
> >> <eric.dumazet@gmail.com> wrote:
> >> >> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
> >> >>> On Sun, 27 Feb 2011, Albert Cahalan wrote:
> >> >>>
> >> >>> > Nanoseconds seems fine; it's unlikely you'd ever want
> >> >>> > more than 4.2 seconds (32-bit unsigned) of queue.
> >> > ...
> >> >> Problem is some machines have slow High Resolution timing services.
> >> >>
> >> >> _If_ we have a time limit, it will probably use the low resolution (aka
> >> >> jiffies), unless high resolution services are cheap.
> >> >
> >> > As long as that is totally internal to the kernel and never
> >> > getting exposed by some API for setting the amount, sure.
> >> >
> >> >> I was thinking not having an absolute hard limit, but an EWMA based one.
> >> >
> >> > The whole point is to prevent stale packets, especially to prevent
> >> > them from messing with TCP, so I really don't think so. I suppose
> >> > you do get this to some extent via early drop.
> >>
> >> I made simple hack on sch_fifo with per packet time limits
> >> (attachment) this weekend and have been doing limited testing on
> >> wireless link. I think hardlimit is fine, it's simple and does
> >> somewhat same as what packet(-hard)limited buffer does, drops packets
> >> when buffer is 'full'. My hack checks for timed out packets on
> >> enqueue, might be wrong approach (on other hand might allow some more
> >> burstiness).
> >>
> >
> >
> > Qdisc should return to caller a good indication packet is queued or
> > dropped at enqueue() time... not later (aka : never)
> >
> > Accepting a packet at t0, and dropping it later at t0+limit without
> > giving any indication to caller is a problem.
> >
> > This is why I suggested using an EWMA plus a probabilist drop or
> > congestion indication (NET_XMIT_CN) to caller at enqueue() time.
> >
> > The absolute time limit you are trying to implement should be checked at
> > dequeue time, to cope with enqueue bursts or pauses on wire.
> >
>
> Would it be better to implement this as generic feature instead of
> qdisc specific? Have qdisc_enqueue_root do ewma check:
Problem is you can have several virtual queues in a qdisc.
For example, pfifo_fast has 3 bands. You could have a global ewma with
high values, but you still want to let a high priority packet going
through...
^ permalink raw reply
* [Patch] bonding: move procfs into bond_procfs.c
From: Amerigo Wang @ 2011-02-28 12:30 UTC (permalink / raw)
To: linux-kernel; +Cc: WANG Cong, Jay Vosburgh, netdev
bond_main.c is bloating, separate the procfs code out,
move them to bond_procfs.c
Signed-off-by: WANG Cong <amwang@redhat.com>
---
diff --git a/drivers/net/bonding/Makefile b/drivers/net/bonding/Makefile
index 0e2737e..1d8de09 100644
--- a/drivers/net/bonding/Makefile
+++ b/drivers/net/bonding/Makefile
@@ -4,7 +4,7 @@
obj-$(CONFIG_BONDING) += bonding.o
-bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_debugfs.o
+bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_debugfs.o bond_procfs.o
ipv6-$(subst m,y,$(CONFIG_IPV6)) += bond_ipv6.o
bonding-objs += $(ipv6-y)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 584f97b..7abed73 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -65,8 +65,6 @@
#include <linux/skbuff.h>
#include <net/sock.h>
#include <linux/rtnetlink.h>
-#include <linux/proc_fs.h>
-#include <linux/seq_file.h>
#include <linux/smp.h>
#include <linux/if_ether.h>
#include <net/arp.h>
@@ -173,9 +171,6 @@ MODULE_PARM_DESC(resend_igmp, "Number of IGMP membership reports to send on link
atomic_t netpoll_block_tx = ATOMIC_INIT(0);
#endif
-static const char * const version =
- DRV_DESCRIPTION ": v" DRV_VERSION " (" DRV_RELDATE ")\n";
-
int bond_net_id __read_mostly;
static __be32 arp_target[BOND_MAX_ARP_TARGETS];
@@ -245,7 +240,7 @@ static void bond_uninit(struct net_device *bond_dev);
/*---------------------------- General routines -----------------------------*/
-static const char *bond_mode_name(int mode)
+const char *bond_mode_name(int mode)
{
static const char *names[] = {
[BOND_MODE_ROUNDROBIN] = "load balancing (round-robin)",
@@ -3292,299 +3287,6 @@ out:
read_unlock(&bond->lock);
}
-/*------------------------------ proc/seq_file-------------------------------*/
-
-#ifdef CONFIG_PROC_FS
-
-static void *bond_info_seq_start(struct seq_file *seq, loff_t *pos)
- __acquires(RCU)
- __acquires(&bond->lock)
-{
- struct bonding *bond = seq->private;
- loff_t off = 0;
- struct slave *slave;
- int i;
-
- /* make sure the bond won't be taken away */
- rcu_read_lock();
- read_lock(&bond->lock);
-
- if (*pos == 0)
- return SEQ_START_TOKEN;
-
- bond_for_each_slave(bond, slave, i) {
- if (++off == *pos)
- return slave;
- }
-
- return NULL;
-}
-
-static void *bond_info_seq_next(struct seq_file *seq, void *v, loff_t *pos)
-{
- struct bonding *bond = seq->private;
- struct slave *slave = v;
-
- ++*pos;
- if (v == SEQ_START_TOKEN)
- return bond->first_slave;
-
- slave = slave->next;
-
- return (slave == bond->first_slave) ? NULL : slave;
-}
-
-static void bond_info_seq_stop(struct seq_file *seq, void *v)
- __releases(&bond->lock)
- __releases(RCU)
-{
- struct bonding *bond = seq->private;
-
- read_unlock(&bond->lock);
- rcu_read_unlock();
-}
-
-static void bond_info_show_master(struct seq_file *seq)
-{
- struct bonding *bond = seq->private;
- struct slave *curr;
- int i;
-
- read_lock(&bond->curr_slave_lock);
- curr = bond->curr_active_slave;
- read_unlock(&bond->curr_slave_lock);
-
- seq_printf(seq, "Bonding Mode: %s",
- bond_mode_name(bond->params.mode));
-
- if (bond->params.mode == BOND_MODE_ACTIVEBACKUP &&
- bond->params.fail_over_mac)
- seq_printf(seq, " (fail_over_mac %s)",
- fail_over_mac_tbl[bond->params.fail_over_mac].modename);
-
- seq_printf(seq, "\n");
-
- if (bond->params.mode == BOND_MODE_XOR ||
- bond->params.mode == BOND_MODE_8023AD) {
- seq_printf(seq, "Transmit Hash Policy: %s (%d)\n",
- xmit_hashtype_tbl[bond->params.xmit_policy].modename,
- bond->params.xmit_policy);
- }
-
- if (USES_PRIMARY(bond->params.mode)) {
- seq_printf(seq, "Primary Slave: %s",
- (bond->primary_slave) ?
- bond->primary_slave->dev->name : "None");
- if (bond->primary_slave)
- seq_printf(seq, " (primary_reselect %s)",
- pri_reselect_tbl[bond->params.primary_reselect].modename);
-
- seq_printf(seq, "\nCurrently Active Slave: %s\n",
- (curr) ? curr->dev->name : "None");
- }
-
- seq_printf(seq, "MII Status: %s\n", netif_carrier_ok(bond->dev) ?
- "up" : "down");
- seq_printf(seq, "MII Polling Interval (ms): %d\n", bond->params.miimon);
- seq_printf(seq, "Up Delay (ms): %d\n",
- bond->params.updelay * bond->params.miimon);
- seq_printf(seq, "Down Delay (ms): %d\n",
- bond->params.downdelay * bond->params.miimon);
-
-
- /* ARP information */
- if (bond->params.arp_interval > 0) {
- int printed = 0;
- seq_printf(seq, "ARP Polling Interval (ms): %d\n",
- bond->params.arp_interval);
-
- seq_printf(seq, "ARP IP target/s (n.n.n.n form):");
-
- for (i = 0; (i < BOND_MAX_ARP_TARGETS); i++) {
- if (!bond->params.arp_targets[i])
- break;
- if (printed)
- seq_printf(seq, ",");
- seq_printf(seq, " %pI4", &bond->params.arp_targets[i]);
- printed = 1;
- }
- seq_printf(seq, "\n");
- }
-
- if (bond->params.mode == BOND_MODE_8023AD) {
- struct ad_info ad_info;
-
- seq_puts(seq, "\n802.3ad info\n");
- seq_printf(seq, "LACP rate: %s\n",
- (bond->params.lacp_fast) ? "fast" : "slow");
- seq_printf(seq, "Aggregator selection policy (ad_select): %s\n",
- ad_select_tbl[bond->params.ad_select].modename);
-
- if (bond_3ad_get_active_agg_info(bond, &ad_info)) {
- seq_printf(seq, "bond %s has no active aggregator\n",
- bond->dev->name);
- } else {
- seq_printf(seq, "Active Aggregator Info:\n");
-
- seq_printf(seq, "\tAggregator ID: %d\n",
- ad_info.aggregator_id);
- seq_printf(seq, "\tNumber of ports: %d\n",
- ad_info.ports);
- seq_printf(seq, "\tActor Key: %d\n",
- ad_info.actor_key);
- seq_printf(seq, "\tPartner Key: %d\n",
- ad_info.partner_key);
- seq_printf(seq, "\tPartner Mac Address: %pM\n",
- ad_info.partner_system);
- }
- }
-}
-
-static void bond_info_show_slave(struct seq_file *seq,
- const struct slave *slave)
-{
- struct bonding *bond = seq->private;
-
- seq_printf(seq, "\nSlave Interface: %s\n", slave->dev->name);
- seq_printf(seq, "MII Status: %s\n",
- (slave->link == BOND_LINK_UP) ? "up" : "down");
- seq_printf(seq, "Speed: %d Mbps\n", slave->speed);
- seq_printf(seq, "Duplex: %s\n", slave->duplex ? "full" : "half");
- seq_printf(seq, "Link Failure Count: %u\n",
- slave->link_failure_count);
-
- seq_printf(seq, "Permanent HW addr: %pM\n", slave->perm_hwaddr);
-
- if (bond->params.mode == BOND_MODE_8023AD) {
- const struct aggregator *agg
- = SLAVE_AD_INFO(slave).port.aggregator;
-
- if (agg)
- seq_printf(seq, "Aggregator ID: %d\n",
- agg->aggregator_identifier);
- else
- seq_puts(seq, "Aggregator ID: N/A\n");
- }
- seq_printf(seq, "Slave queue ID: %d\n", slave->queue_id);
-}
-
-static int bond_info_seq_show(struct seq_file *seq, void *v)
-{
- if (v == SEQ_START_TOKEN) {
- seq_printf(seq, "%s\n", version);
- bond_info_show_master(seq);
- } else
- bond_info_show_slave(seq, v);
-
- return 0;
-}
-
-static const struct seq_operations bond_info_seq_ops = {
- .start = bond_info_seq_start,
- .next = bond_info_seq_next,
- .stop = bond_info_seq_stop,
- .show = bond_info_seq_show,
-};
-
-static int bond_info_open(struct inode *inode, struct file *file)
-{
- struct seq_file *seq;
- struct proc_dir_entry *proc;
- int res;
-
- res = seq_open(file, &bond_info_seq_ops);
- if (!res) {
- /* recover the pointer buried in proc_dir_entry data */
- seq = file->private_data;
- proc = PDE(inode);
- seq->private = proc->data;
- }
-
- return res;
-}
-
-static const struct file_operations bond_info_fops = {
- .owner = THIS_MODULE,
- .open = bond_info_open,
- .read = seq_read,
- .llseek = seq_lseek,
- .release = seq_release,
-};
-
-static void bond_create_proc_entry(struct bonding *bond)
-{
- struct net_device *bond_dev = bond->dev;
- struct bond_net *bn = net_generic(dev_net(bond_dev), bond_net_id);
-
- if (bn->proc_dir) {
- bond->proc_entry = proc_create_data(bond_dev->name,
- S_IRUGO, bn->proc_dir,
- &bond_info_fops, bond);
- if (bond->proc_entry == NULL)
- pr_warning("Warning: Cannot create /proc/net/%s/%s\n",
- DRV_NAME, bond_dev->name);
- else
- memcpy(bond->proc_file_name, bond_dev->name, IFNAMSIZ);
- }
-}
-
-static void bond_remove_proc_entry(struct bonding *bond)
-{
- struct net_device *bond_dev = bond->dev;
- struct bond_net *bn = net_generic(dev_net(bond_dev), bond_net_id);
-
- if (bn->proc_dir && bond->proc_entry) {
- remove_proc_entry(bond->proc_file_name, bn->proc_dir);
- memset(bond->proc_file_name, 0, IFNAMSIZ);
- bond->proc_entry = NULL;
- }
-}
-
-/* Create the bonding directory under /proc/net, if doesn't exist yet.
- * Caller must hold rtnl_lock.
- */
-static void __net_init bond_create_proc_dir(struct bond_net *bn)
-{
- if (!bn->proc_dir) {
- bn->proc_dir = proc_mkdir(DRV_NAME, bn->net->proc_net);
- if (!bn->proc_dir)
- pr_warning("Warning: cannot create /proc/net/%s\n",
- DRV_NAME);
- }
-}
-
-/* Destroy the bonding directory under /proc/net, if empty.
- * Caller must hold rtnl_lock.
- */
-static void __net_exit bond_destroy_proc_dir(struct bond_net *bn)
-{
- if (bn->proc_dir) {
- remove_proc_entry(DRV_NAME, bn->net->proc_net);
- bn->proc_dir = NULL;
- }
-}
-
-#else /* !CONFIG_PROC_FS */
-
-static void bond_create_proc_entry(struct bonding *bond)
-{
-}
-
-static void bond_remove_proc_entry(struct bonding *bond)
-{
-}
-
-static inline void bond_create_proc_dir(struct bond_net *bn)
-{
-}
-
-static inline void bond_destroy_proc_dir(struct bond_net *bn)
-{
-}
-
-#endif /* CONFIG_PROC_FS */
-
-
/*-------------------------- netdev event handling --------------------------*/
/*
@@ -5388,7 +5090,7 @@ static int __init bonding_init(void)
int i;
int res;
- pr_info("%s", version);
+ pr_info("%s", bond_version);
res = bond_check_params(&bonding_defaults);
if (res)
diff --git a/drivers/net/bonding/bond_procfs.c b/drivers/net/bonding/bond_procfs.c
new file mode 100644
index 0000000..4db0529
--- /dev/null
+++ b/drivers/net/bonding/bond_procfs.c
@@ -0,0 +1,297 @@
+#include <linux/proc_fs.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include "bonding.h"
+
+#ifdef CONFIG_PROC_FS
+
+extern const char *bond_mode_name(int mode);
+
+static void *bond_info_seq_start(struct seq_file *seq, loff_t *pos)
+ __acquires(RCU)
+ __acquires(&bond->lock)
+{
+ struct bonding *bond = seq->private;
+ loff_t off = 0;
+ struct slave *slave;
+ int i;
+
+ /* make sure the bond won't be taken away */
+ rcu_read_lock();
+ read_lock(&bond->lock);
+
+ if (*pos == 0)
+ return SEQ_START_TOKEN;
+
+ bond_for_each_slave(bond, slave, i) {
+ if (++off == *pos)
+ return slave;
+ }
+
+ return NULL;
+}
+
+static void *bond_info_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ struct bonding *bond = seq->private;
+ struct slave *slave = v;
+
+ ++*pos;
+ if (v == SEQ_START_TOKEN)
+ return bond->first_slave;
+
+ slave = slave->next;
+
+ return (slave == bond->first_slave) ? NULL : slave;
+}
+
+static void bond_info_seq_stop(struct seq_file *seq, void *v)
+ __releases(&bond->lock)
+ __releases(RCU)
+{
+ struct bonding *bond = seq->private;
+
+ read_unlock(&bond->lock);
+ rcu_read_unlock();
+}
+
+static void bond_info_show_master(struct seq_file *seq)
+{
+ struct bonding *bond = seq->private;
+ struct slave *curr;
+ int i;
+
+ read_lock(&bond->curr_slave_lock);
+ curr = bond->curr_active_slave;
+ read_unlock(&bond->curr_slave_lock);
+
+ seq_printf(seq, "Bonding Mode: %s",
+ bond_mode_name(bond->params.mode));
+
+ if (bond->params.mode == BOND_MODE_ACTIVEBACKUP &&
+ bond->params.fail_over_mac)
+ seq_printf(seq, " (fail_over_mac %s)",
+ fail_over_mac_tbl[bond->params.fail_over_mac].modename);
+
+ seq_printf(seq, "\n");
+
+ if (bond->params.mode == BOND_MODE_XOR ||
+ bond->params.mode == BOND_MODE_8023AD) {
+ seq_printf(seq, "Transmit Hash Policy: %s (%d)\n",
+ xmit_hashtype_tbl[bond->params.xmit_policy].modename,
+ bond->params.xmit_policy);
+ }
+
+ if (USES_PRIMARY(bond->params.mode)) {
+ seq_printf(seq, "Primary Slave: %s",
+ (bond->primary_slave) ?
+ bond->primary_slave->dev->name : "None");
+ if (bond->primary_slave)
+ seq_printf(seq, " (primary_reselect %s)",
+ pri_reselect_tbl[bond->params.primary_reselect].modename);
+
+ seq_printf(seq, "\nCurrently Active Slave: %s\n",
+ (curr) ? curr->dev->name : "None");
+ }
+
+ seq_printf(seq, "MII Status: %s\n", netif_carrier_ok(bond->dev) ?
+ "up" : "down");
+ seq_printf(seq, "MII Polling Interval (ms): %d\n", bond->params.miimon);
+ seq_printf(seq, "Up Delay (ms): %d\n",
+ bond->params.updelay * bond->params.miimon);
+ seq_printf(seq, "Down Delay (ms): %d\n",
+ bond->params.downdelay * bond->params.miimon);
+
+
+ /* ARP information */
+ if (bond->params.arp_interval > 0) {
+ int printed = 0;
+ seq_printf(seq, "ARP Polling Interval (ms): %d\n",
+ bond->params.arp_interval);
+
+ seq_printf(seq, "ARP IP target/s (n.n.n.n form):");
+
+ for (i = 0; (i < BOND_MAX_ARP_TARGETS); i++) {
+ if (!bond->params.arp_targets[i])
+ break;
+ if (printed)
+ seq_printf(seq, ",");
+ seq_printf(seq, " %pI4", &bond->params.arp_targets[i]);
+ printed = 1;
+ }
+ seq_printf(seq, "\n");
+ }
+
+ if (bond->params.mode == BOND_MODE_8023AD) {
+ struct ad_info ad_info;
+
+ seq_puts(seq, "\n802.3ad info\n");
+ seq_printf(seq, "LACP rate: %s\n",
+ (bond->params.lacp_fast) ? "fast" : "slow");
+ seq_printf(seq, "Aggregator selection policy (ad_select): %s\n",
+ ad_select_tbl[bond->params.ad_select].modename);
+
+ if (bond_3ad_get_active_agg_info(bond, &ad_info)) {
+ seq_printf(seq, "bond %s has no active aggregator\n",
+ bond->dev->name);
+ } else {
+ seq_printf(seq, "Active Aggregator Info:\n");
+
+ seq_printf(seq, "\tAggregator ID: %d\n",
+ ad_info.aggregator_id);
+ seq_printf(seq, "\tNumber of ports: %d\n",
+ ad_info.ports);
+ seq_printf(seq, "\tActor Key: %d\n",
+ ad_info.actor_key);
+ seq_printf(seq, "\tPartner Key: %d\n",
+ ad_info.partner_key);
+ seq_printf(seq, "\tPartner Mac Address: %pM\n",
+ ad_info.partner_system);
+ }
+ }
+}
+
+static void bond_info_show_slave(struct seq_file *seq,
+ const struct slave *slave)
+{
+ struct bonding *bond = seq->private;
+
+ seq_printf(seq, "\nSlave Interface: %s\n", slave->dev->name);
+ seq_printf(seq, "MII Status: %s\n",
+ (slave->link == BOND_LINK_UP) ? "up" : "down");
+ seq_printf(seq, "Speed: %d Mbps\n", slave->speed);
+ seq_printf(seq, "Duplex: %s\n", slave->duplex ? "full" : "half");
+ seq_printf(seq, "Link Failure Count: %u\n",
+ slave->link_failure_count);
+
+ seq_printf(seq, "Permanent HW addr: %pM\n", slave->perm_hwaddr);
+
+ if (bond->params.mode == BOND_MODE_8023AD) {
+ const struct aggregator *agg
+ = SLAVE_AD_INFO(slave).port.aggregator;
+
+ if (agg)
+ seq_printf(seq, "Aggregator ID: %d\n",
+ agg->aggregator_identifier);
+ else
+ seq_puts(seq, "Aggregator ID: N/A\n");
+ }
+ seq_printf(seq, "Slave queue ID: %d\n", slave->queue_id);
+}
+
+static int bond_info_seq_show(struct seq_file *seq, void *v)
+{
+ if (v == SEQ_START_TOKEN) {
+ seq_printf(seq, "%s\n", bond_version);
+ bond_info_show_master(seq);
+ } else
+ bond_info_show_slave(seq, v);
+
+ return 0;
+}
+
+static const struct seq_operations bond_info_seq_ops = {
+ .start = bond_info_seq_start,
+ .next = bond_info_seq_next,
+ .stop = bond_info_seq_stop,
+ .show = bond_info_seq_show,
+};
+
+static int bond_info_open(struct inode *inode, struct file *file)
+{
+ struct seq_file *seq;
+ struct proc_dir_entry *proc;
+ int res;
+
+ res = seq_open(file, &bond_info_seq_ops);
+ if (!res) {
+ /* recover the pointer buried in proc_dir_entry data */
+ seq = file->private_data;
+ proc = PDE(inode);
+ seq->private = proc->data;
+ }
+
+ return res;
+}
+
+static const struct file_operations bond_info_fops = {
+ .owner = THIS_MODULE,
+ .open = bond_info_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+void bond_create_proc_entry(struct bonding *bond)
+{
+ struct net_device *bond_dev = bond->dev;
+ struct bond_net *bn = net_generic(dev_net(bond_dev), bond_net_id);
+
+ if (bn->proc_dir) {
+ bond->proc_entry = proc_create_data(bond_dev->name,
+ S_IRUGO, bn->proc_dir,
+ &bond_info_fops, bond);
+ if (bond->proc_entry == NULL)
+ pr_warning("Warning: Cannot create /proc/net/%s/%s\n",
+ DRV_NAME, bond_dev->name);
+ else
+ memcpy(bond->proc_file_name, bond_dev->name, IFNAMSIZ);
+ }
+}
+
+void bond_remove_proc_entry(struct bonding *bond)
+{
+ struct net_device *bond_dev = bond->dev;
+ struct bond_net *bn = net_generic(dev_net(bond_dev), bond_net_id);
+
+ if (bn->proc_dir && bond->proc_entry) {
+ remove_proc_entry(bond->proc_file_name, bn->proc_dir);
+ memset(bond->proc_file_name, 0, IFNAMSIZ);
+ bond->proc_entry = NULL;
+ }
+}
+
+/* Create the bonding directory under /proc/net, if doesn't exist yet.
+ * Caller must hold rtnl_lock.
+ */
+void __net_init bond_create_proc_dir(struct bond_net *bn)
+{
+ if (!bn->proc_dir) {
+ bn->proc_dir = proc_mkdir(DRV_NAME, bn->net->proc_net);
+ if (!bn->proc_dir)
+ pr_warning("Warning: cannot create /proc/net/%s\n",
+ DRV_NAME);
+ }
+}
+
+/* Destroy the bonding directory under /proc/net, if empty.
+ * Caller must hold rtnl_lock.
+ */
+void __net_exit bond_destroy_proc_dir(struct bond_net *bn)
+{
+ if (bn->proc_dir) {
+ remove_proc_entry(DRV_NAME, bn->net->proc_net);
+ bn->proc_dir = NULL;
+ }
+}
+
+#else /* !CONFIG_PROC_FS */
+
+void bond_create_proc_entry(struct bonding *bond)
+{
+}
+
+void bond_remove_proc_entry(struct bonding *bond)
+{
+}
+
+void bond_create_proc_dir(struct bond_net *bn)
+{
+}
+
+void bond_destroy_proc_dir(struct bond_net *bn)
+{
+}
+
+#endif /* CONFIG_PROC_FS */
+
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index a401b8d..dfe41df 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -29,6 +29,8 @@
#define DRV_NAME "bonding"
#define DRV_DESCRIPTION "Ethernet Channel Bonding Driver"
+#define bond_version DRV_DESCRIPTION ": v" DRV_VERSION " (" DRV_RELDATE ")\n"
+
#define BOND_MAX_ARP_TARGETS 16
#define IS_UP(dev) \
@@ -413,6 +415,11 @@ struct bond_net {
#endif
};
+void bond_create_proc_entry(struct bonding *bond);
+void bond_remove_proc_entry(struct bonding *bond);
+void bond_create_proc_dir(struct bond_net *bn);
+void bond_destroy_proc_dir(struct bond_net *bn);
+
/* exported from bond_main.c */
extern int bond_net_id;
extern const struct bond_parm_tbl bond_lacp_tbl[];
^ permalink raw reply related
* [question] fcoe: bonding support
From: Jiri Pirko @ 2011-02-28 12:29 UTC (permalink / raw)
To: robert.w.love; +Cc: netdev
Hi Robert.
I wonder what's the meaning of the following code in fcoe_interface_setup():
/* Do not support for bonding device */
if ((netdev->priv_flags & IFF_MASTER_ALB) ||
(netdev->priv_flags & IFF_SLAVE_INACTIVE) ||
(netdev->priv_flags & IFF_MASTER_8023AD)) {
FCOE_NETDEV_DBG(netdev, "Bonded interfaces not supported\n");
return -EOPNOTSUPP;
}
>From this I cannot understand if bonding is not supported at all or
only alb and 8023ad modes are not supported (leaving aside completely
bogus checking for IFF_SLAVE_INACTIVE).
How about to check IFF_BONDING only (in case bonding should not be
supported at all)
Thanks.
Jirka
^ permalink raw reply
* Re: txqueuelen has wrong units; should be time
From: Jussi Kivilinna @ 2011-02-28 11:43 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev
In-Reply-To: <1298837273.8726.128.camel@edumazet-laptop>
Quoting Eric Dumazet <eric.dumazet@gmail.com>:
> Le dimanche 27 février 2011 à 12:55 +0200, Jussi Kivilinna a écrit :
>> Quoting Albert Cahalan <acahalan@gmail.com>:
>>
>> > On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet
>> <eric.dumazet@gmail.com> wrote:
>> >> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
>> >>> On Sun, 27 Feb 2011, Albert Cahalan wrote:
>> >>>
>> >>> > Nanoseconds seems fine; it's unlikely you'd ever want
>> >>> > more than 4.2 seconds (32-bit unsigned) of queue.
>> > ...
>> >> Problem is some machines have slow High Resolution timing services.
>> >>
>> >> _If_ we have a time limit, it will probably use the low resolution (aka
>> >> jiffies), unless high resolution services are cheap.
>> >
>> > As long as that is totally internal to the kernel and never
>> > getting exposed by some API for setting the amount, sure.
>> >
>> >> I was thinking not having an absolute hard limit, but an EWMA based one.
>> >
>> > The whole point is to prevent stale packets, especially to prevent
>> > them from messing with TCP, so I really don't think so. I suppose
>> > you do get this to some extent via early drop.
>>
>> I made simple hack on sch_fifo with per packet time limits
>> (attachment) this weekend and have been doing limited testing on
>> wireless link. I think hardlimit is fine, it's simple and does
>> somewhat same as what packet(-hard)limited buffer does, drops packets
>> when buffer is 'full'. My hack checks for timed out packets on
>> enqueue, might be wrong approach (on other hand might allow some more
>> burstiness).
>>
>
>
> Qdisc should return to caller a good indication packet is queued or
> dropped at enqueue() time... not later (aka : never)
>
> Accepting a packet at t0, and dropping it later at t0+limit without
> giving any indication to caller is a problem.
>
> This is why I suggested using an EWMA plus a probabilist drop or
> congestion indication (NET_XMIT_CN) to caller at enqueue() time.
>
> The absolute time limit you are trying to implement should be checked at
> dequeue time, to cope with enqueue bursts or pauses on wire.
>
Would it be better to implement this as generic feature instead of
qdisc specific? Have qdisc_enqueue_root do ewma check:
static inline int qdisc_enqueue_root(struct sk_buff *skb, struct Qdisc *sch)
{
qdisc_skb_cb(skb)->pkt_len = skb->len;
if (likely(!sch->use_timeout)) {
ewma_ok:
return qdisc_enqueue(skb, sch) & NET_XMIT_MASK;
}
status = qdisc_check_ewma_status()
if (status == ok)
goto ewma_ok;
if (status == overlimits)
...drop...
if (status == congestion) {
ret = qdisc_enqueue(skb, sch) & NET_XMIT_MASK;
return (ret == success) ? NET_XMIT_CN : ret;
}
}
And add qdisc_dequeue_root:
static inline struct sk_buff *qdisc_dequeue_root(struct Qdisc *sch)
{
skb = sch->dequeue(sch);
if (skb && unlikely(sch->use_timeout))
qdisc_update_ewma(skb);
return skb;
}
Then user could specify any qdisc to use timeout or not with tc. Maybe
go even as far as have some default timeout for default qdisc(?)
-Jussi
^ permalink raw reply
* Re: [PATCH 4/5] udp: Add lockless transmit path
From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw)
To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
netdev, Thomas Graf
In-Reply-To: <E1Pu1TJ-0005RA-DV@gondolin.me.apana.org.au>
On Mon, Feb 28, 2011 at 07:41:01PM +0800, Herbert Xu wrote:
> udp: Add lockless transmit path
Doh! There are only 4 patches in the series. So you didn't
miss anything, yet :)
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* [PATCH 2/5] net: Remove explicit write references to sk/inet in ip_append_data
From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw)
To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
netdev, Thomas Graf
In-Reply-To: <20110227110614.GA6246@gondor.apana.org.au>
net: Remove explicit write references to sk/inet in ip_append_data
In order to allow simultaneous calls to ip_append_data on the same
socket, it must not modify any shared state in sk or inet (other
than those that are designed to allow that such as atomic counters).
This patch abstracts out write references to sk and inet_sk in
ip_append_data and its friends so that we may use the underlying
code in parallel.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---
include/net/inet_sock.h | 23 ++--
net/ipv4/ip_output.c | 238 ++++++++++++++++++++++++++++--------------------
2 files changed, 154 insertions(+), 107 deletions(-)
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 8181498..b3de102 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -86,6 +86,19 @@ static inline struct inet_request_sock *inet_rsk(const struct request_sock *sk)
return (struct inet_request_sock *)sk;
}
+struct inet_cork {
+ unsigned int flags;
+ unsigned int fragsize;
+ struct ip_options *opt;
+ struct dst_entry *dst;
+ int length; /* Total length of all frames */
+ __be32 addr;
+ struct flowi fl;
+ struct page *page;
+ u32 off;
+ u8 tx_flags;
+};
+
struct ip_mc_socklist;
struct ipv6_pinfo;
struct rtable;
@@ -143,15 +156,7 @@ struct inet_sock {
int mc_index;
__be32 mc_addr;
struct ip_mc_socklist __rcu *mc_list;
- struct {
- unsigned int flags;
- unsigned int fragsize;
- struct ip_options *opt;
- struct dst_entry *dst;
- int length; /* Total length of all frames */
- __be32 addr;
- struct flowi fl;
- } cork;
+ struct inet_cork cork;
};
#define IPCORK_OPT 1 /* ip-options has been held in ipcork.opt */
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index d3a4540..1dd5ecc 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -733,6 +733,7 @@ csum_page(struct page *page, int offset, int copy)
}
static inline int ip_ufo_append_data(struct sock *sk,
+ struct sk_buff_head *queue,
int getfrag(void *from, char *to, int offset, int len,
int odd, struct sk_buff *skb),
void *from, int length, int hh_len, int fragheaderlen,
@@ -745,7 +746,7 @@ static inline int ip_ufo_append_data(struct sock *sk,
* device, so create one single skb packet containing complete
* udp datagram
*/
- if ((skb = skb_peek_tail(&sk->sk_write_queue)) == NULL) {
+ if ((skb = skb_peek_tail(queue)) == NULL) {
skb = sock_alloc_send_skb(sk,
hh_len + fragheaderlen + transhdrlen + 20,
(flags & MSG_DONTWAIT), &err);
@@ -771,35 +772,24 @@ static inline int ip_ufo_append_data(struct sock *sk,
/* specify the length of each IP datagram fragment */
skb_shinfo(skb)->gso_size = mtu - fragheaderlen;
skb_shinfo(skb)->gso_type = SKB_GSO_UDP;
- __skb_queue_tail(&sk->sk_write_queue, skb);
+ __skb_queue_tail(queue, skb);
}
return skb_append_datato_frags(sk, skb, getfrag, from,
(length - transhdrlen));
}
-/*
- * ip_append_data() and ip_append_page() can make one large IP datagram
- * from many pieces of data. Each pieces will be holded on the socket
- * until ip_push_pending_frames() is called. Each piece can be a page
- * or non-page data.
- *
- * Not only UDP, other transport protocols - e.g. raw sockets - can use
- * this interface potentially.
- *
- * LATER: length must be adjusted by pad at tail, when it is required.
- */
-int ip_append_data(struct sock *sk,
- int getfrag(void *from, char *to, int offset, int len,
- int odd, struct sk_buff *skb),
- void *from, int length, int transhdrlen,
- struct ipcm_cookie *ipc, struct rtable **rtp,
- unsigned int flags)
+static int __ip_append_data(struct sock *sk, struct sk_buff_head *queue,
+ struct inet_cork *cork,
+ int getfrag(void *from, char *to, int offset,
+ int len, int odd, struct sk_buff *skb),
+ void *from, int length, int transhdrlen,
+ unsigned int flags)
{
struct inet_sock *inet = inet_sk(sk);
struct sk_buff *skb;
- struct ip_options *opt = NULL;
+ struct ip_options *opt = inet->cork.opt;
int hh_len;
int exthdrlen;
int mtu;
@@ -808,58 +798,19 @@ int ip_append_data(struct sock *sk,
int offset = 0;
unsigned int maxfraglen, fragheaderlen;
int csummode = CHECKSUM_NONE;
- struct rtable *rt;
-
- if (flags&MSG_PROBE)
- return 0;
+ struct rtable *rt = (struct rtable *)cork->dst;
- if (skb_queue_empty(&sk->sk_write_queue)) {
- /*
- * setup for corking.
- */
- opt = ipc->opt;
- if (opt) {
- if (inet->cork.opt == NULL) {
- inet->cork.opt = kmalloc(sizeof(struct ip_options) + 40, sk->sk_allocation);
- if (unlikely(inet->cork.opt == NULL))
- return -ENOBUFS;
- }
- memcpy(inet->cork.opt, opt, sizeof(struct ip_options)+opt->optlen);
- inet->cork.flags |= IPCORK_OPT;
- inet->cork.addr = ipc->addr;
- }
- rt = *rtp;
- if (unlikely(!rt))
- return -EFAULT;
- /*
- * We steal reference to this route, caller should not release it
- */
- *rtp = NULL;
- inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE ?
- rt->dst.dev->mtu :
- dst_mtu(rt->dst.path);
- inet->cork.dst = &rt->dst;
- inet->cork.length = 0;
- sk->sk_sndmsg_page = NULL;
- sk->sk_sndmsg_off = 0;
- exthdrlen = rt->dst.header_len;
- length += exthdrlen;
- transhdrlen += exthdrlen;
- } else {
- rt = (struct rtable *)inet->cork.dst;
- if (inet->cork.flags & IPCORK_OPT)
- opt = inet->cork.opt;
+ exthdrlen = transhdrlen ? rt->dst.header_len : 0;
+ length += exthdrlen;
+ transhdrlen += exthdrlen;
+ mtu = inet->cork.fragsize;
- transhdrlen = 0;
- exthdrlen = 0;
- mtu = inet->cork.fragsize;
- }
hh_len = LL_RESERVED_SPACE(rt->dst.dev);
fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
- if (inet->cork.length + length > 0xFFFF - fragheaderlen) {
+ if (cork->length + length > 0xFFFF - fragheaderlen) {
ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->inet_dport,
mtu-exthdrlen);
return -EMSGSIZE;
@@ -875,15 +826,15 @@ int ip_append_data(struct sock *sk,
!exthdrlen)
csummode = CHECKSUM_PARTIAL;
- skb = skb_peek_tail(&sk->sk_write_queue);
+ skb = skb_peek_tail(queue);
- inet->cork.length += length;
+ cork->length += length;
if (((length > mtu) || (skb && skb_is_gso(skb))) &&
(sk->sk_protocol == IPPROTO_UDP) &&
(rt->dst.dev->features & NETIF_F_UFO)) {
- err = ip_ufo_append_data(sk, getfrag, from, length, hh_len,
- fragheaderlen, transhdrlen, mtu,
- flags);
+ err = ip_ufo_append_data(sk, queue, getfrag, from, length,
+ hh_len, fragheaderlen, transhdrlen,
+ mtu, flags);
if (err)
goto error;
return 0;
@@ -960,7 +911,7 @@ alloc_new_skb:
else
/* only the initial fragment is
time stamped */
- ipc->tx_flags = 0;
+ cork->tx_flags = 0;
}
if (skb == NULL)
goto error;
@@ -971,7 +922,7 @@ alloc_new_skb:
skb->ip_summed = csummode;
skb->csum = 0;
skb_reserve(skb, hh_len);
- skb_shinfo(skb)->tx_flags = ipc->tx_flags;
+ skb_shinfo(skb)->tx_flags = cork->tx_flags;
/*
* Find where to start putting bytes.
@@ -1008,7 +959,7 @@ alloc_new_skb:
/*
* Put the packet on the pending queue.
*/
- __skb_queue_tail(&sk->sk_write_queue, skb);
+ __skb_queue_tail(queue, skb);
continue;
}
@@ -1028,8 +979,8 @@ alloc_new_skb:
} else {
int i = skb_shinfo(skb)->nr_frags;
skb_frag_t *frag = &skb_shinfo(skb)->frags[i-1];
- struct page *page = sk->sk_sndmsg_page;
- int off = sk->sk_sndmsg_off;
+ struct page *page = cork->page;
+ int off = cork->off;
unsigned int left;
if (page && (left = PAGE_SIZE - off) > 0) {
@@ -1041,7 +992,7 @@ alloc_new_skb:
goto error;
}
get_page(page);
- skb_fill_page_desc(skb, i, page, sk->sk_sndmsg_off, 0);
+ skb_fill_page_desc(skb, i, page, off, 0);
frag = &skb_shinfo(skb)->frags[i];
}
} else if (i < MAX_SKB_FRAGS) {
@@ -1052,8 +1003,8 @@ alloc_new_skb:
err = -ENOMEM;
goto error;
}
- sk->sk_sndmsg_page = page;
- sk->sk_sndmsg_off = 0;
+ cork->page = page;
+ cork->off = 0;
skb_fill_page_desc(skb, i, page, 0, 0);
frag = &skb_shinfo(skb)->frags[i];
@@ -1065,7 +1016,7 @@ alloc_new_skb:
err = -EFAULT;
goto error;
}
- sk->sk_sndmsg_off += copy;
+ cork->off += copy;
frag->size += copy;
skb->len += copy;
skb->data_len += copy;
@@ -1079,11 +1030,87 @@ alloc_new_skb:
return 0;
error:
- inet->cork.length -= length;
+ cork->length -= length;
IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
return err;
}
+static int ip_setup_cork(struct sock *sk, struct inet_cork *cork,
+ struct ipcm_cookie *ipc, struct rtable **rtp)
+{
+ struct inet_sock *inet = inet_sk(sk);
+ struct ip_options *opt;
+ struct rtable *rt;
+
+ /*
+ * setup for corking.
+ */
+ opt = ipc->opt;
+ if (opt) {
+ if (cork->opt == NULL) {
+ cork->opt = kmalloc(sizeof(struct ip_options) + 40,
+ sk->sk_allocation);
+ if (unlikely(cork->opt == NULL))
+ return -ENOBUFS;
+ }
+ memcpy(cork->opt, opt, sizeof(struct ip_options) + opt->optlen);
+ cork->flags |= IPCORK_OPT;
+ cork->addr = ipc->addr;
+ }
+ rt = *rtp;
+ if (unlikely(!rt))
+ return -EFAULT;
+ /*
+ * We steal reference to this route, caller should not release it
+ */
+ *rtp = NULL;
+ cork->fragsize = inet->pmtudisc == IP_PMTUDISC_PROBE ?
+ rt->dst.dev->mtu : dst_mtu(rt->dst.path);
+ cork->dst = &rt->dst;
+ cork->length = 0;
+ cork->tx_flags = ipc->tx_flags;
+ cork->page = NULL;
+ cork->off = 0;
+
+ return 0;
+}
+
+/*
+ * ip_append_data() and ip_append_page() can make one large IP datagram
+ * from many pieces of data. Each pieces will be holded on the socket
+ * until ip_push_pending_frames() is called. Each piece can be a page
+ * or non-page data.
+ *
+ * Not only UDP, other transport protocols - e.g. raw sockets - can use
+ * this interface potentially.
+ *
+ * LATER: length must be adjusted by pad at tail, when it is required.
+ */
+int ip_append_data(struct sock *sk,
+ int getfrag(void *from, char *to, int offset, int len,
+ int odd, struct sk_buff *skb),
+ void *from, int length, int transhdrlen,
+ struct ipcm_cookie *ipc, struct rtable **rtp,
+ unsigned int flags)
+{
+ struct inet_sock *inet = inet_sk(sk);
+ int err;
+
+ if (flags&MSG_PROBE)
+ return 0;
+
+ if (skb_queue_empty(&sk->sk_write_queue)) {
+ err = ip_setup_cork(sk, &inet->cork, ipc, rtp);
+ if (err)
+ return err;
+ } else {
+ transhdrlen = 0;
+ }
+
+ return __ip_append_data(sk, &sk->sk_write_queue, &inet->cork, getfrag,
+ from, length, transhdrlen, flags);
+}
+
ssize_t ip_append_page(struct sock *sk, struct page *page,
int offset, size_t size, int flags)
{
@@ -1227,40 +1254,42 @@ error:
return err;
}
-static void ip_cork_release(struct inet_sock *inet)
+static void ip_cork_release(struct inet_cork *cork)
{
- inet->cork.flags &= ~IPCORK_OPT;
- kfree(inet->cork.opt);
- inet->cork.opt = NULL;
- dst_release(inet->cork.dst);
- inet->cork.dst = NULL;
+ cork->flags &= ~IPCORK_OPT;
+ kfree(cork->opt);
+ cork->opt = NULL;
+ dst_release(cork->dst);
+ cork->dst = NULL;
}
/*
* Combined all pending IP fragments on the socket as one IP datagram
* and push them out.
*/
-int ip_push_pending_frames(struct sock *sk)
+static int __ip_push_pending_frames(struct sock *sk,
+ struct sk_buff_head *queue,
+ struct inet_cork *cork)
{
struct sk_buff *skb, *tmp_skb;
struct sk_buff **tail_skb;
struct inet_sock *inet = inet_sk(sk);
struct net *net = sock_net(sk);
struct ip_options *opt = NULL;
- struct rtable *rt = (struct rtable *)inet->cork.dst;
+ struct rtable *rt = (struct rtable *)cork->dst;
struct iphdr *iph;
__be16 df = 0;
__u8 ttl;
int err = 0;
- if ((skb = __skb_dequeue(&sk->sk_write_queue)) == NULL)
+ if ((skb = __skb_dequeue(queue)) == NULL)
goto out;
tail_skb = &(skb_shinfo(skb)->frag_list);
/* move skb->data to ip header from ext header */
if (skb->data < skb_network_header(skb))
__skb_pull(skb, skb_network_offset(skb));
- while ((tmp_skb = __skb_dequeue(&sk->sk_write_queue)) != NULL) {
+ while ((tmp_skb = __skb_dequeue(queue)) != NULL) {
__skb_pull(tmp_skb, skb_network_header_len(skb));
*tail_skb = tmp_skb;
tail_skb = &(tmp_skb->next);
@@ -1286,8 +1315,8 @@ int ip_push_pending_frames(struct sock *sk)
ip_dont_fragment(sk, &rt->dst)))
df = htons(IP_DF);
- if (inet->cork.flags & IPCORK_OPT)
- opt = inet->cork.opt;
+ if (cork->flags & IPCORK_OPT)
+ opt = cork->opt;
if (rt->rt_type == RTN_MULTICAST)
ttl = inet->mc_ttl;
@@ -1299,7 +1328,7 @@ int ip_push_pending_frames(struct sock *sk)
iph->ihl = 5;
if (opt) {
iph->ihl += opt->optlen>>2;
- ip_options_build(skb, opt, inet->cork.addr, rt, 0);
+ ip_options_build(skb, opt, cork->addr, rt, 0);
}
iph->tos = inet->tos;
iph->frag_off = df;
@@ -1315,7 +1344,7 @@ int ip_push_pending_frames(struct sock *sk)
* Steal rt from cork.dst to avoid a pair of atomic_inc/atomic_dec
* on dst refcount
*/
- inet->cork.dst = NULL;
+ cork->dst = NULL;
skb_dst_set(skb, &rt->dst);
if (iph->protocol == IPPROTO_ICMP)
@@ -1332,7 +1361,7 @@ int ip_push_pending_frames(struct sock *sk)
}
out:
- ip_cork_release(inet);
+ ip_cork_release(cork);
return err;
error:
@@ -1340,17 +1369,30 @@ error:
goto out;
}
+int ip_push_pending_frames(struct sock *sk)
+{
+ return __ip_push_pending_frames(sk, &sk->sk_write_queue,
+ &inet_sk(sk)->cork);
+}
+
/*
* Throw away all pending data on the socket.
*/
-void ip_flush_pending_frames(struct sock *sk)
+static void __ip_flush_pending_frames(struct sock *sk,
+ struct sk_buff_head *queue,
+ struct inet_cork *cork)
{
struct sk_buff *skb;
- while ((skb = __skb_dequeue_tail(&sk->sk_write_queue)) != NULL)
+ while ((skb = __skb_dequeue_tail(queue)) != NULL)
kfree_skb(skb);
- ip_cork_release(inet_sk(sk));
+ ip_cork_release(cork);
+}
+
+void ip_flush_pending_frames(struct sock *sk)
+{
+ __ip_flush_pending_frames(sk, &sk->sk_write_queue, &inet_sk(sk)->cork);
}
^ permalink raw reply related
* [PATCH 4/5] udp: Add lockless transmit path
From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw)
To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
netdev, Thomas Graf
In-Reply-To: <20110227110614.GA6246@gondor.apana.org.au>
udp: Add lockless transmit path
The UDP transmit path has been running under the socket lock
for a long time because of the corking feature. This means that
transmitting to the same socket in multiple threads does not
scale at all.
However, as most users don't actually use corking, the locking
can be removed in the common case.
This patch creates a lockless fast path where corking is not used.
Please note that this does create a slight inaccuracy in the
enforcement of socket send buffer limits. In particular, we
may exceed the socket limit by up to (number of CPUs) * (packet
size) because of the way the limit is computed.
As the primary purpose of socket buffers is to indicate congestion,
this should not be a great problem for now.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---
include/net/udp.h | 11 +++++
include/net/udplite.h | 12 +++++
net/ipv4/udp.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 126 insertions(+), 1 deletion(-)
diff --git a/include/net/udp.h b/include/net/udp.h
index bb967dd..b8563ba 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -144,6 +144,17 @@ static inline __wsum udp_csum_outgoing(struct sock *sk, struct sk_buff *skb)
return csum;
}
+static inline __wsum udp_csum(struct sk_buff *skb)
+{
+ __wsum csum = csum_partial(skb_transport_header(skb),
+ sizeof(struct udphdr), skb->csum);
+
+ for (skb = skb_shinfo(skb)->frag_list; skb; skb = skb->next) {
+ csum = csum_add(csum, skb->csum);
+ }
+ return csum;
+}
+
/* hash routines shared between UDPv4/6 and UDP-Litev4/6 */
static inline void udp_lib_hash(struct sock *sk)
{
diff --git a/include/net/udplite.h b/include/net/udplite.h
index afdffe6..673a024 100644
--- a/include/net/udplite.h
+++ b/include/net/udplite.h
@@ -115,6 +115,18 @@ static inline __wsum udplite_csum_outgoing(struct sock *sk, struct sk_buff *skb)
return csum;
}
+static inline __wsum udplite_csum(struct sk_buff *skb)
+{
+ struct sock *sk = skb->sk;
+ int cscov = udplite_sender_cscov(udp_sk(sk), udp_hdr(skb));
+ const int off = skb_transport_offset(skb);
+ const int len = skb->len - off;
+
+ skb->ip_summed = CHECKSUM_NONE; /* no HW support for checksumming */
+
+ return skb_checksum(skb, off, min(cscov, len), 0);
+}
+
extern void udplite4_register(void);
extern int udplite_get_port(struct sock *sk, unsigned short snum,
int (*scmp)(const struct sock *, const struct sock *));
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8157b17..7fd3664 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -769,6 +769,95 @@ out:
return err;
}
+static void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst)
+{
+ struct udphdr *uh = udp_hdr(skb);
+ struct sk_buff *frags = skb_shinfo(skb)->frag_list;
+ int offset = skb_transport_offset(skb);
+ int len = skb->len - offset;
+ int hlen = len;
+ __wsum csum = 0;
+
+ if (!frags) {
+ /*
+ * Only one fragment on the socket.
+ */
+ skb->csum_start = skb_transport_header(skb) - skb->head;
+ skb->csum_offset = offsetof(struct udphdr, check);
+ uh->check = ~csum_tcpudp_magic(src, dst, len,
+ IPPROTO_UDP, 0);
+ } else {
+ /*
+ * HW-checksum won't work as there are two or more
+ * fragments on the socket so that all csums of sk_buffs
+ * should be together
+ */
+ do {
+ csum = csum_add(csum, frags->csum);
+ hlen -= frags->len;
+ } while ((frags = frags->next));
+
+ csum = skb_checksum(skb, offset, hlen, csum);
+ skb->ip_summed = CHECKSUM_NONE;
+
+ uh->check = csum_tcpudp_magic(src, dst, len, IPPROTO_UDP, csum);
+ if (uh->check == 0)
+ uh->check = CSUM_MANGLED_0;
+ }
+}
+
+static int udp_send_skb(struct sk_buff *skb, __be32 daddr, __be32 dport)
+{
+ struct sock *sk = skb->sk;
+ struct inet_sock *inet = inet_sk(sk);
+ struct udphdr *uh;
+ struct rtable *rt = (struct rtable *)skb_dst(skb);
+ int err = 0;
+ int is_udplite = IS_UDPLITE(sk);
+ int offset = skb_transport_offset(skb);
+ int len = skb->len - offset;
+ __wsum csum = 0;
+
+ /*
+ * Create a UDP header
+ */
+ uh = udp_hdr(skb);
+ uh->source = inet->inet_sport;
+ uh->dest = dport;
+ uh->len = htons(len);
+ uh->check = 0;
+
+ if (is_udplite)
+ csum = udplite_csum(skb);
+ else if (sk->sk_no_check == UDP_CSUM_NOXMIT) {
+ skb->ip_summed = CHECKSUM_NONE;
+ goto send;
+ } else if (skb->ip_summed == CHECKSUM_PARTIAL) {
+ udp4_hwcsum(skb, rt->rt_src, daddr);
+ goto send;
+ } else
+ csum = udp_csum(skb);
+
+ /* add protocol-dependent pseudo-header */
+ uh->check = csum_tcpudp_magic(rt->rt_src, daddr, len,
+ sk->sk_protocol, csum);
+ if (uh->check == 0)
+ uh->check = CSUM_MANGLED_0;
+
+send:
+ err = ip_send_skb(skb);
+ if (err) {
+ if (err == -ENOBUFS && !inet->recverr) {
+ UDP_INC_STATS_USER(sock_net(sk),
+ UDP_MIB_SNDBUFERRORS, is_udplite);
+ err = 0;
+ }
+ } else
+ UDP_INC_STATS_USER(sock_net(sk),
+ UDP_MIB_OUTDATAGRAMS, is_udplite);
+ return err;
+}
+
int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
size_t len)
{
@@ -785,6 +874,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
int err, is_udplite = IS_UDPLITE(sk);
int corkreq = up->corkflag || msg->msg_flags&MSG_MORE;
int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
+ struct sk_buff *skb;
if (len > 0xFFFF)
return -EMSGSIZE;
@@ -799,6 +889,8 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
ipc.opt = NULL;
ipc.tx_flags = 0;
+ getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag;
+
if (up->pending) {
/*
* There are pending frames.
@@ -923,6 +1015,17 @@ back_from_confirm:
if (!ipc.addr)
daddr = ipc.addr = rt->rt_dst;
+ /* Lockless fast path for the non-corking case. */
+ if (!corkreq) {
+ skb = ip_make_skb(sk, getfrag, msg->msg_iov, ulen,
+ sizeof(struct udphdr), &ipc, &rt,
+ msg->msg_flags);
+ err = PTR_ERR(skb);
+ if (skb && !IS_ERR(skb))
+ err = udp_send_skb(skb, daddr, dport);
+ goto out;
+ }
+
lock_sock(sk);
if (unlikely(up->pending)) {
/* The socket is already corked while preparing it. */
@@ -944,7 +1047,6 @@ back_from_confirm:
do_append_data:
up->len += ulen;
- getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag;
err = ip_append_data(sk, getfrag, msg->msg_iov, ulen,
sizeof(struct udphdr), &ipc, &rt,
corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
^ permalink raw reply related
* [PATCH 1/5] net: Remove unused sk_sndmsg_* from UFO
From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw)
To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
netdev, Thomas Graf
In-Reply-To: <20110227110614.GA6246@gondor.apana.org.au>
net: Remove unused sk_sndmsg_* from UFO
UFO doesn't really use the sk_sndmsg_* parameters so touching
them is pointless. It can't use them anyway since the whole
point of UFO is to use the original pages without copying.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---
net/core/skbuff.c | 3 ---
net/ipv4/ip_output.c | 1 -
net/ipv6/ip6_output.c | 1 -
3 files changed, 5 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d883dcc..97011a7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2434,8 +2434,6 @@ int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb,
return -ENOMEM;
/* initialize the next frag */
- sk->sk_sndmsg_page = page;
- sk->sk_sndmsg_off = 0;
skb_fill_page_desc(skb, frg_cnt, page, 0, 0);
skb->truesize += PAGE_SIZE;
atomic_add(PAGE_SIZE, &sk->sk_wmem_alloc);
@@ -2455,7 +2453,6 @@ int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb,
return -EFAULT;
/* copy was successful so update the size parameters */
- sk->sk_sndmsg_off += copy;
frag->size += copy;
skb->len += copy;
skb->data_len += copy;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 04c7b3b..d3a4540 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -767,7 +767,6 @@ static inline int ip_ufo_append_data(struct sock *sk,
skb->ip_summed = CHECKSUM_PARTIAL;
skb->csum = 0;
- sk->sk_sndmsg_off = 0;
/* specify the length of each IP datagram fragment */
skb_shinfo(skb)->gso_size = mtu - fragheaderlen;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 5f8d242..9965182 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1061,7 +1061,6 @@ static inline int ip6_ufo_append_data(struct sock *sk,
skb->ip_summed = CHECKSUM_PARTIAL;
skb->csum = 0;
- sk->sk_sndmsg_off = 0;
}
err = skb_append_datato_frags(sk,skb, getfrag, from,
^ permalink raw reply related
* [PATCH 3/5] inet: Add ip_make_skb and ip_send_skb
From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw)
To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
netdev, Thomas Graf
In-Reply-To: <20110227110614.GA6246@gondor.apana.org.au>
inet: Add ip_make_skb and ip_send_skb
This patch adds the helper ip_make_skb which is like ip_append_data
and ip_push_pending_frames all rolled into one, except that it does
not send the skb produced. The sending part is carried out by
ip_send_skb, which the transport protocol can call after it has
tweaked the skb.
It is meant to be called in cases where corking is not used should
have a one-to-one correspondence to sendmsg.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---
include/net/ip.h | 8 ++++++
net/ipv4/ip_output.c | 65 ++++++++++++++++++++++++++++++++++++++++-----------
2 files changed, 59 insertions(+), 14 deletions(-)
diff --git a/include/net/ip.h b/include/net/ip.h
index 67fac78..a96e525 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -116,8 +116,16 @@ extern int ip_append_data(struct sock *sk,
extern int ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
extern ssize_t ip_append_page(struct sock *sk, struct page *page,
int offset, size_t size, int flags);
+extern int ip_send_skb(struct sk_buff *skb);
extern int ip_push_pending_frames(struct sock *sk);
extern void ip_flush_pending_frames(struct sock *sk);
+extern struct sk_buff *ip_make_skb(struct sock *sk,
+ int getfrag(void *from, char *to, int offset, int len,
+ int odd, struct sk_buff *skb),
+ void *from, int length, int transhdrlen,
+ struct ipcm_cookie *ipc,
+ struct rtable **rtp,
+ unsigned int flags);
/* datagram.c */
extern int ip4_datagram_connect(struct sock *sk,
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 1dd5ecc..dba14c6 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1267,9 +1267,9 @@ static void ip_cork_release(struct inet_cork *cork)
* Combined all pending IP fragments on the socket as one IP datagram
* and push them out.
*/
-static int __ip_push_pending_frames(struct sock *sk,
- struct sk_buff_head *queue,
- struct inet_cork *cork)
+static struct sk_buff *__ip_make_skb(struct sock *sk,
+ struct sk_buff_head *queue,
+ struct inet_cork *cork)
{
struct sk_buff *skb, *tmp_skb;
struct sk_buff **tail_skb;
@@ -1280,7 +1280,6 @@ static int __ip_push_pending_frames(struct sock *sk,
struct iphdr *iph;
__be16 df = 0;
__u8 ttl;
- int err = 0;
if ((skb = __skb_dequeue(queue)) == NULL)
goto out;
@@ -1351,28 +1350,37 @@ static int __ip_push_pending_frames(struct sock *sk,
icmp_out_count(net, ((struct icmphdr *)
skb_transport_header(skb))->type);
- /* Netfilter gets whole the not fragmented skb. */
+ ip_cork_release(cork);
+out:
+ return skb;
+}
+
+int ip_send_skb(struct sk_buff *skb)
+{
+ struct net *net = sock_net(skb->sk);
+ int err;
+
err = ip_local_out(skb);
if (err) {
if (err > 0)
err = net_xmit_errno(err);
if (err)
- goto error;
+ IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
}
-out:
- ip_cork_release(cork);
return err;
-
-error:
- IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
- goto out;
}
int ip_push_pending_frames(struct sock *sk)
{
- return __ip_push_pending_frames(sk, &sk->sk_write_queue,
- &inet_sk(sk)->cork);
+ struct sk_buff *skb;
+
+ skb = __ip_make_skb(sk, &sk->sk_write_queue, &inet_sk(sk)->cork);
+ if (!skb)
+ return 0;
+
+ /* Netfilter gets whole the not fragmented skb. */
+ return ip_send_skb(skb);
}
/*
@@ -1395,6 +1403,35 @@ void ip_flush_pending_frames(struct sock *sk)
__ip_flush_pending_frames(sk, &sk->sk_write_queue, &inet_sk(sk)->cork);
}
+struct sk_buff *ip_make_skb(struct sock *sk,
+ int getfrag(void *from, char *to, int offset,
+ int len, int odd, struct sk_buff *skb),
+ void *from, int length, int transhdrlen,
+ struct ipcm_cookie *ipc, struct rtable **rtp,
+ unsigned int flags)
+{
+ struct inet_cork cork = {};
+ struct sk_buff_head queue;
+ int err;
+
+ if (flags & MSG_PROBE)
+ return NULL;
+
+ __skb_queue_head_init(&queue);
+
+ err = ip_setup_cork(sk, &cork, ipc, rtp);
+ if (err)
+ return ERR_PTR(err);
+
+ err = __ip_append_data(sk, &queue, &cork, getfrag,
+ from, length, transhdrlen, flags);
+ if (err) {
+ __ip_flush_pending_frames(sk, &queue, &cork);
+ return ERR_PTR(err);
+ }
+
+ return __ip_make_skb(sk, &queue, &cork);
+}
/*
* Fetch data from kernel space and fill in checksum if needed.
^ permalink raw reply related
* Re: Bug inkvm_set_irq
From: Michael S. Tsirkin @ 2011-02-28 11:39 UTC (permalink / raw)
To: Jean-Philippe Menil; +Cc: kvm, netdev, virtualization
In-Reply-To: <4D6B7BAB.9070907@univ-nantes.fr>
On Mon, Feb 28, 2011 at 11:40:43AM +0100, Jean-Philippe Menil wrote:
> Le 28/02/2011 11:11, Michael S. Tsirkin a écrit :
> >On Mon, Feb 28, 2011 at 09:56:46AM +0100, Jean-Philippe Menil wrote:
> >>Le 27/02/2011 18:00, Michael S. Tsirkin a écrit :
> >>>On Fri, Feb 25, 2011 at 10:07:22AM +0100, Jean-Philippe Menil wrote:
> >>>>Hi,
> >>>>
> >>>>Each time i try tou use vhost_net, i'm facing a kernel bug.
> >>>>I do a "modprobe vhost_net", and start guest whith vhost=on.
> >>>>
> >>>>Following is a trace with a kernel 2.6.37, but i had the same
> >>>>problem with 2.6.36 (cf https://lkml.org/lkml/2010/11/30/29).
> >>>2.6.36 had a theorectical race that could explain this,
> >>>but it should be ok in 2.6.37.
> >>>
> >>>>The bug only occurs whith vhost_net charged, so i don't know if this
> >>>>is a bug in kvm module code or in the vhost_net code.
> >>>It could be a bug in eventfd which is the interface
> >>>used by both kvm and vhost_net.
> >>>Just for fun, you can try 3.6.38 - eventfd code has been changed
> >>>a lot in 2.6.38 and if it does not trigger there
> >>>it's a hint that irqfd is the reason.
> >>>
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.243100] BUG: unable to handle kernel paging request at
> >>>>0000000000002458
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.243250] IP: [<ffffffffa041aa8a>] kvm_set_irq+0x2a/0x130 [kvm]
> >>>Could you run markup_oops/ ksymoops on this please?
> >>>As far as I can see kvm_set_irq can only get a wrong
> >>>kvm pointer. Unless there's some general memory corruption,
> >>>I'd guess
> >>>
> >>>You can also try comparing the irqfd->kvm pointer in
> >>>kvm_irqfd_assign irqfd_wakeup and kvm_set_irq in
> >>>virt/kvm/eventfd.c.
> >>>
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.243378] PGD 45d363067 PUD 45e77a067 PMD 0
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.243556] Oops: 0000 [#1] SMP
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.243692] last sysfs file:
> >>>>/sys/devices/pci0000:00/0000:00:0d.0/0000:05:00.0/0000:06:00.0/irq
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243777] CPU 0
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.243820] Modules linked in: vhost_net macvtap macvlan tun
> >>>>powernow_k8 mperf cpufreq_userspace cpufreq_stats cpufreq_powersave
> >>>>cpufreq_ondemand fre
> >>>>q_table cpufreq_conservative fuse xt_physdev ip6t_LOG
> >>>>ip6table_filter ip6_tables ipt_LOG xt_multiport xt_limit xt_tcpudp
> >>>>xt_state iptable_filter ip_tables x_tables nf_conntrack_tftp
> >>>>nf_conntrack_ftp nf_connt
> >>>>rack_ipv4 nf_defrag_ipv4 8021q bridge stp ext2 mbcache
> >>>>dm_round_robin dm_multipath nf_conntrack_ipv6 nf_conntrack
> >>>>nf_defrag_ipv6 kvm_amd kvm ipv6 snd_pcm snd_timer snd soundcore
> >>>>snd_page_alloc tpm_tis tpm ps
> >>>>mouse dcdbas tpm_bios processor i2c_nforce2 shpchp pcspkr ghes
> >>>>serio_raw joydev evdev pci_hotplug i2c_core hed button thermal_sys
> >>>>xfs exportfs dm_mod sg sr_mod cdrom usbhid hid usb_storage ses
> >>>>sd_mod enclosu
> >>>>re megaraid_sas ohci_hcd lpfc scsi_transport_fc scsi_tgt bnx2
> >>>>scsi_mod ehci_hcd [last unloaded: scsi_wait_scan]
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123]
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] Pid: 10, comm: kworker/0:1 Not tainted
> >>>>2.6.37-dsiun-110105 #17 0K543T/PowerEdge M605
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] RIP: 0010:[<ffffffffa041aa8a>] [<ffffffffa041aa8a>]
> >>>>kvm_set_irq+0x2a/0x130 [kvm]
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] RSP: 0018:ffff88045fc89d30 EFLAGS: 00010246
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] RAX: 0000000000000000 RBX: 000000000000001a RCX:
> >>>>0000000000000001
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> >>>>0000000000000000
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] RBP: 0000000000000000 R08: 0000000000000001 R09:
> >>>>ffff880856a91e48
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] R10: 0000000000000000 R11: 00000000ffffffff R12:
> >>>>0000000000000000
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] R13: 0000000000000001 R14: 0000000000000000 R15:
> >>>>0000000000000000
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] FS: 00007f617986c710(0000) GS:ffff88007f800000(0000)
> >>>>knlGS:0000000000000000
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] CR2: 0000000000002458 CR3: 000000045d197000 CR4:
> >>>>00000000000006f0
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >>>>0000000000000000
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> >>>>0000000000000400
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] Process kworker/0:1 (pid: 10, threadinfo
> >>>>ffff88045fc88000, task ffff88085fc53c30)
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] Stack:
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] ffff88045fc89fd8 00000000000119c0 ffff88045fc88010
> >>>>ffff88085fc53ee8
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] ffff88045fc89fd8 ffff88085fc53ee0 ffff88085fc53c30
> >>>>00000000000119c0
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] 00000000000119c0 ffffffff8137f7ce ffff88007f80df40
> >>>>00000000ffffffff
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] Call Trace:
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] [<ffffffff8137f7ce>] ? common_interrupt+0xe/0x13
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] [<ffffffffa041bc30>] ? irqfd_inject+0x0/0x50 [kvm]
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] [<ffffffffa041bc57>] ? irqfd_inject+0x27/0x50 [kvm]
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] [<ffffffffa041bc30>] ? irqfd_inject+0x0/0x50 [kvm]
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] [<ffffffff8106b6f2>] ? process_one_work+0x112/0x460
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] [<ffffffff8106be25>] ? worker_thread+0x145/0x410
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] [<ffffffff8103a3d0>] ? __wake_up_common+0x50/0x80
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] [<ffffffff8106bce0>] ? worker_thread+0x0/0x410
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] [<ffffffff8106bce0>] ? worker_thread+0x0/0x410
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] [<ffffffff8106f786>] ? kthread+0x96/0xa0
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] [<ffffffff81003ce4>] ? kernel_thread_helper+0x4/0x10
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] [<ffffffff8106f6f0>] ? kthread+0x0/0xa0
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] [<ffffffff81003ce0>] ? kernel_thread_helper+0x0/0x10
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] Code: ff 41 57 41 89 f7 41 56 41 55 41 89 cd 41 54 49 89
> >>>>fc 55 53 89 d3 48 81 ec 98 00 00 00 8b 15 c6 79 03 00 85 d2 0f 85 c4
> >>>>00 00 00<4
> >>>>9> 8b 84 24 58 24 00 00 3b 98 28 01 00 00 73 5e 89 db 48 8b 84
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] RIP [<ffffffffa041aa8a>] kvm_set_irq+0x2a/0x130 [kvm]
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] RSP<ffff88045fc89d30>
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] CR2: 0000000000002458
> >>>>
> >>>>
> >>>>If someone can help me, on how to solve this.
> >>>>
> >>>>Regards.
> >>>>_______________________________________________
> >>>>Virtualization mailing list
> >>>>Virtualization@lists.linux-foundation.org
> >>>>https://lists.linux-foundation.org/mailman/listinfo/virtualization
> >>>--
> >>>To unsubscribe from this list: send the line "unsubscribe netdev" in
> >>>the body of a message to majordomo@vger.kernel.org
> >>>More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>Hi,
> >>
> >>thanks for your response.
> >>
> >>This is what markup_oops.pl return me:
> >>"No matching code found"
> >Well, let's try to understand what's there.
> >
> >Do objdumop -ldS kvm.ko
> >look for<kvm_set_irq>
> >
> >and paste the content from start of that function
> >to offset 0x2a and a bit beyond.
> >
> >You can also upload your kvm.ko somewhere, I'll try to take a look.
> >
> >
> >>So this is not a vhost_net bug, or my oops is incomplete and
> >>markup_oops can't find the good vma offset.
> >>
> >>I will try to compare the pointers you indicate me, even it could be
> >>a little difficult for me.
> >Hmm you know how to add printk to code and rebuild, right?
> >
> >>Maybe i will try a 2.6.38, will wait a response from the kvm team.
> >>
> >>Regards.
> >>
> >>--
> >>Jean-Philippe Menil - Pôle réseau Service IRTS
> >>DSI Université de Nantes
> >>jean-philippe.menil@univ-nantes.fr
> >>Tel : 02.53.48.49.27 - Fax : 02.53.48.49.09
> So, here is the result for the objdump against the kvm.ko (the
> kvm_set_irq part) :
Can you try building with -g and adding -l and -S to objdump
please? I'd rather make the tool do the legwork than
do it manually.
>
> 0000000000006a60 <kvm_set_irq>:
> kvm_set_irq():
> 6a60: 41 57 push %r15
> 6a62: 41 89 f7 mov %esi,%r15d
> 6a65: 41 56 push %r14
> 6a67: 41 55 push %r13
> 6a69: 41 89 cd mov %ecx,%r13d
> 6a6c: 41 54 push %r12
> 6a6e: 49 89 fc mov %rdi,%r12
> 6a71: 55 push %rbp
> 6a72: 53 push %rbx
> 6a73: 89 d3 mov %edx,%ebx
> 6a75: 48 81 ec 98 00 00 00 sub $0x98,%rsp
> 6a7c: 8b 15 00 00 00 00 mov 0x0(%rip),%edx
> # 6a82 <kvm_set_irq+0x22>
> 6a82: 85 d2 test %edx,%edx
> 6a84: 0f 85 c4 00 00 00 jne 6b4e <kvm_set_irq+0xee>
> 6a8a: 49 8b 84 24 58 24 00 mov 0x2458(%r12),%rax
OK, 0x6a8a is the offset.
After you build with -g, try
addr2line kvm.ko 0x6a8a
and see which line this points to.
> 6a91: 00
> 6a92: 3b 98 28 01 00 00 cmp 0x128(%rax),%ebx
> 6a98: 73 5e jae 6af8 <kvm_set_irq+0x98>
> 6a9a: 89 db mov %ebx,%ebx
> 6a9c: 48 8b 84 d8 30 01 00 mov 0x130(%rax,%rbx,8),%rax
> 6aa3: 00
> 6aa4: 48 85 c0 test %rax,%rax
> 6aa7: 74 4f je 6af8 <kvm_set_irq+0x98>
> 6aa9: 48 89 e2 mov %rsp,%rdx
> 6aac: 31 db xor %ebx,%ebx
> 6aae: 48 8b 08 mov (%rax),%rcx
> 6ab1: 83 c3 01 add $0x1,%ebx
> 6ab4: 0f 18 09 prefetcht0 (%rcx)
> 6ab7: 48 8b 48 e0 mov -0x20(%rax),%rcx
> 6abb: 48 89 0a mov %rcx,(%rdx)
> 6abe: 48 8b 48 e8 mov -0x18(%rax),%rcx
> 6ac2: 48 89 4a 08 mov %rcx,0x8(%rdx)
> 6ac6: 48 8b 48 f0 mov -0x10(%rax),%rcx
> 6aca: 48 89 4a 10 mov %rcx,0x10(%rdx)
> 6ace: 48 8b 48 f8 mov -0x8(%rax),%rcx
> 6ad2: 48 89 4a 18 mov %rcx,0x18(%rdx)
> 6ad6: 48 8b 08 mov (%rax),%rcx
> 6ad9: 48 89 4a 20 mov %rcx,0x20(%rdx)
> 6add: 48 8b 48 08 mov 0x8(%rax),%rcx
> 6ae1: 48 89 4a 28 mov %rcx,0x28(%rdx)
> 6ae5: 48 8b 00 mov (%rax),%rax
> 6ae8: 48 83 c2 30 add $0x30,%rdx
> 6aec: 48 85 c0 test %rax,%rax
> 6aef: 75 bd jne 6aae <kvm_set_irq+0x4e>
> 6af1: eb 07 jmp 6afa <kvm_set_irq+0x9a>
> 6af3: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
> 6af8: 31 db xor %ebx,%ebx
> 6afa: bd ff ff ff ff mov $0xffffffff,%ebp
> 6aff: 49 89 e6 mov %rsp,%r14
> 6b02: 85 db test %ebx,%ebx
> 6b04: 74 34 je 6b3a <kvm_set_irq+0xda>
> 6b06: 83 eb 01 sub $0x1,%ebx
> 6b09: 44 89 e9 mov %r13d,%ecx
> 6b0c: 44 89 fa mov %r15d,%edx
> 6b0f: 48 63 c3 movslq %ebx,%rax
> 6b12: 4c 89 e6 mov %r12,%rsi
> 6b15: 48 8d 04 40 lea (%rax,%rax,2),%rax
> 6b19: 48 c1 e0 04 shl $0x4,%rax
> 6b1d: 49 8d 3c 06 lea (%r14,%rax,1),%rdi
> 6b21: ff 54 04 08 callq *0x8(%rsp,%rax,1)
> 6b25: 85 c0 test %eax,%eax
> 6b27: 78 d9 js 6b02 <kvm_set_irq+0xa2>
> 6b29: 85 ed test %ebp,%ebp
> 6b2b: ba 00 00 00 00 mov $0x0,%edx
> 6b30: 0f 48 ea cmovs %edx,%ebp
> 6b33: 85 db test %ebx,%ebx
> 6b35: 8d 2c 28 lea (%rax,%rbp,1),%ebp
> 6b38: 75 cc jne 6b06 <kvm_set_irq+0xa6>
> 6b3a: 48 81 c4 98 00 00 00 add $0x98,%rsp
> 6b41: 89 e8 mov %ebp,%eax
> 6b43: 5b pop %rbx
> 6b44: 5d pop %rbp
> 6b45: 41 5c pop %r12
> 6b47: 41 5d pop %r13
> 6b49: 41 5e pop %r14
> 6b4b: 41 5f pop %r15
> 6b4d: c3 retq
> 6b4e: 48 8b 2d 00 00 00 00 mov 0x0(%rip),%rbp
> # 6b55 <kvm_set_irq+0xf5>
> 6b55: 48 85 ed test %rbp,%rbp
> 6b58: 0f 84 2c ff ff ff je 6a8a <kvm_set_irq+0x2a>
> 6b5e: 48 8b 45 00 mov 0x0(%rbp),%rax
> 6b62: 48 8b 7d 08 mov 0x8(%rbp),%rdi
> 6b66: 48 83 c5 10 add $0x10,%rbp
> 6b6a: 44 89 f9 mov %r15d,%ecx
> 6b6d: 44 89 ea mov %r13d,%edx
> 6b70: 89 de mov %ebx,%esi
> 6b72: ff d0 callq *%rax
> 6b74: 48 8b 45 00 mov 0x0(%rbp),%rax
> 6b78: 48 85 c0 test %rax,%rax
> 6b7b: 75 e5 jne 6b62 <kvm_set_irq+0x102>
> 6b7d: e9 08 ff ff ff jmpq 6a8a <kvm_set_irq+0x2a>
> 6b82: 66 66 66 66 66 2e 0f nopw %cs:0x0(%rax,%rax,1)
> 6b89: 1f 84 00 00 00 00 00
>
> I admit that this analysis is too complicated for me.
> I, effectively, can rebuild a kernel with more printk, and program a reboot.
>
> The kvm.ko is available through the following address:
> http://filex.univ-nantes.fr/get?k=k1jKhQghdcHLz12Z50H
>
> Regards.
This has no debug data. Can you rebuild with -g please?
BTW if you want to rerun and get more reliable backtrace,
tyr enabling frame pointers (do you know how to?). But this will change code
so backtrace will no longer be val we will need
a new one.
> --
> Jean-Philippe Menil - Pôle réseau Service IRTS
> DSI Université de Nantes
> jean-philippe.menil@univ-nantes.fr
> Tel : 02.53.48.49.27 - Fax : 02.53.48.49.09
^ permalink raw reply
* Re: SO_REUSEPORT - can it be done in kernel?
From: Herbert Xu @ 2011-02-28 11:36 UTC (permalink / raw)
To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
netdev
In-Reply-To: <20110227110614.GA6246@gondor.apana.org.au>
On Sun, Feb 27, 2011 at 07:06:14PM +0800, Herbert Xu wrote:
> I'm working on this right now.
OK I think I was definitely on the right track. With the send
patch made lockless I now get numbers which are even better than
those obtained with running named with multiple sockets. That's
right, a single socket is now faster than what multiple sockets
were without the patch (of course, multiple sockets may still
faster with the patch vs. a single socket for obvious reasons,
but I couldn't measure any significant difference).
Also worthy of note is that prior to the patch all CPUs showed
idleness (lazy bastards!), with the patch they're all maxed out.
In retrospect, the idleness was simply the result of the socket
lock scheduling away and was an indication of lock contention.
Here are the patches I used. Please don't them yet as I intend
to clean them up quite a bit.
But please do test them heavily, especially if you have an AMD
NUMA machine as that's where scalability problems really show
up. Intel tends to be a lot more forgiving. My last AMD machine
blew up years ago :)
Thanks!
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* dccp test-tree [RFC] [Patch 1/1] dccp: Only activate NN values after receiving the Confirm option
From: Gerrit Renker @ 2011-02-28 11:25 UTC (permalink / raw)
To: Samuel Jero; +Cc: dccp, netdev
I am sending this as RFC since I have not yet deeply tested this. It makes
the exchange of NN options in established state conform to RFC 4340, 6.6.1
and thus actually is a bug fix.
>>>>>>>>>>>>>>>>>>>>>>>>> Patch <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
dccp: Only activate NN values after receiving the Confirm option
This defers changing local values using exchange of NN options in established
connection state by only activating the value after receiving the Confirm
option, as mandated by RFC 4340, 6.6.1.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
---
net/dccp/ccids/ccid2.c | 7 ++-----
net/dccp/feat.c | 13 ++++---------
2 files changed, 6 insertions(+), 14 deletions(-)
--- a/net/dccp/feat.c
+++ b/net/dccp/feat.c
@@ -775,12 +775,7 @@ int dccp_feat_register_sp(struct sock *s
* @sk: DCCP socket of an established connection
* @feat: NN feature number from %dccp_feature_numbers
* @nn_val: the new value to use
- * This function is used to communicate NN updates out-of-band. The difference
- * to feature negotiation during connection setup is that values are activated
- * immediately after validation, i.e. we don't wait for the Confirm: either the
- * value is accepted by the peer (and then the waiting is futile), or it is not
- * (Reset or empty Confirm). We don't accept empty Confirms - transmitted values
- * are validated, and the peer "MUST accept any valid value" (RFC 4340, 6.3.2).
+ * This function is used to communicate NN updates out-of-band.
*/
int dccp_feat_signal_nn_change(struct sock *sk, u8 feat, u64 nn_val)
{
@@ -805,9 +800,6 @@ int dccp_feat_signal_nn_change(struct so
dccp_feat_list_pop(entry);
}
- if (dccp_feat_activate(sk, feat, 1, &fval))
- return -EADV;
-
inet_csk_schedule_ack(sk);
return dccp_feat_push_change(fn, feat, 1, 0, &fval);
}
@@ -1356,6 +1348,9 @@ static u8 dccp_feat_handle_nn_establishe
if (fval.nn != entry->val.nn)
return 0;
+ /* Only activate after receiving the Confirm option (6.6.1). */
+ dccp_feat_activate(sk, feat, local, &fval);
+
/* It has been confirmed - so remove the entry */
dccp_feat_list_pop(entry);
--- a/net/dccp/ccids/ccid2.c
+++ b/net/dccp/ccids/ccid2.c
@@ -105,7 +105,6 @@ static void ccid2_change_l_ack_ratio(str
return;
ccid2_pr_debug("changing local ack ratio to %u\n", val);
- dp->dccps_l_ack_ratio = val;
dccp_feat_signal_nn_change(sk, DCCPF_ACK_RATIO, val);
}
@@ -117,11 +116,9 @@ static void ccid2_change_l_seq_window(st
val = DCCPF_SEQ_WMIN;
if (val > DCCPF_SEQ_WMAX)
val = DCCPF_SEQ_WMAX;
- if (val == dp->dccps_l_seq_win)
- return;
- dp->dccps_l_seq_win = val;
- dccp_feat_signal_nn_change(sk, DCCPF_SEQUENCE_WINDOW, val);
+ if (val != dp->dccps_l_seq_win)
+ dccp_feat_signal_nn_change(sk, DCCPF_SEQUENCE_WINDOW, val);
}
static void ccid2_hc_tx_rto_expire(unsigned long data)
--
^ permalink raw reply
* Re: txqueuelen has wrong units; should be time
From: Jussi Kivilinna @ 2011-02-28 11:23 UTC (permalink / raw)
To: Albert Cahalan; +Cc: Eric Dumazet, Mikael Abrahamsson, linux-kernel, netdev
In-Reply-To: <AANLkTimofhhH5omyk=HhkyaNG+MGqoac4rDf=dPuR7K-@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 1170 bytes --]
Quoting Albert Cahalan <acahalan@gmail.com>:
> On Sun, Feb 27, 2011 at 5:55 AM, Jussi Kivilinna
> <jussi.kivilinna@mbnet.fi> wrote:
>
>> I made simple hack on sch_fifo with per packet time limits (attachment) this
>> weekend and have been doing limited testing on wireless link. I think
>> hardlimit is fine, it's simple and does somewhat same as what
>> packet(-hard)limited buffer does, drops packets when buffer is 'full'. My
>> hack checks for timed out packets on enqueue, might be wrong approach (on
>> other hand might allow some more burstiness).
>
> Thanks!
>
> I think the default is too high. 1 ms may even be a bit high.
Well, with 10ms buffer timeout latency goes to 10-20ms on 54Mbit wifi
link (zd1211rw driver) from >500ms (ping rtt when iperf running same
time). So for that it's good enough.
>
> I suppose there is a need to allow at least 2 packets despite any
> time limits, so that it remains possible to use a traditional modem
> even if a huge packet takes several seconds to send.
>
I made EWMA version of my fifo hack (attached). I added minimum 2
packet queue limit and probabilistic 1% ECN marking/dropping for
timeout/2.
-Jussi
[-- Attachment #2: sch_fifo_ewma.c --]
[-- Type: text/x-csrc, Size: 7809 bytes --]
/*
* sch_fifo_ewma.c Simple FIFO EWMA timelimit queue.
*
* This program is free software; you can redistribute it and/or modify it under
* the terms of the GNU General Public License as published by the Free Software
* Foundation; either version 2 of the License, or (at your option) any later
* version.
*
*/
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/skbuff.h>
#include <net/pkt_sched.h>
#include <net/inet_ecn.h>
#include <linux/version.h>
#if LINUX_VERSION_CODE <= KERNEL_VERSION(2, 6, 37)
#include "average.h"
#else
#include <linux/average.h>
#endif
#define DEFAULT_PKT_TIMEOUT_MS 10
#define DEFAULT_PKT_TIMEOUT PSCHED_NS2TICKS(NSEC_PER_MSEC * \
DEFAULT_PKT_TIMEOUT_MS)
#define DEFAULT_PROB_HALF_DROP 10 /* 1% */
#define FIFO_EWMA_MIN_QDISC_LEN 2
struct tc_fifo_ewma_qopt {
__u64 timeout; /* Max time packet may stay in buffer */
__u32 limit; /* Queue length: bytes for bfifo, packets for pfifo */
};
struct fifo_ewma_skb_cb {
psched_time_t time_queued;
};
struct fifo_ewma_sched_data {
psched_tdiff_t timeout;
u32 limit;
struct ewma ewma;
};
static inline
struct fifo_ewma_skb_cb *fifo_ewma_skb_cb(struct sk_buff *skb)
{
BUILD_BUG_ON(sizeof(skb->cb) <
sizeof(struct qdisc_skb_cb) +
sizeof(struct fifo_ewma_skb_cb));
return (struct fifo_ewma_skb_cb *)qdisc_skb_cb(skb)->data;
}
static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
struct fifo_ewma_sched_data *q = qdisc_priv(sch);
if (likely(skb_queue_len(&sch->q) < q->limit))
return qdisc_enqueue_tail(skb, sch);
/* queue full, remove one skb to fulfill the limit */
__qdisc_queue_drop_head(sch, &sch->q);
sch->qstats.drops++;
qdisc_enqueue_tail(skb, sch);
return NET_XMIT_CN;
}
static int bfifo_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
struct fifo_ewma_sched_data *q = qdisc_priv(sch);
if (likely(sch->qstats.backlog + qdisc_pkt_len(skb) <= q->limit))
return qdisc_enqueue_tail(skb, sch);
return qdisc_reshape_fail(skb, sch);
}
static int pfifo_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
struct fifo_ewma_sched_data *q = qdisc_priv(sch);
if (likely(skb_queue_len(&sch->q) < q->limit))
return qdisc_enqueue_tail(skb, sch);
return qdisc_reshape_fail(skb, sch);
}
static inline int fifo_get_prob(void)
{
return (net_random() & 0xffff) * 1000 / 0xffff;
}
static struct sk_buff *fifo_ewma_dequeue(struct Qdisc* sch)
{
struct fifo_ewma_sched_data *q = qdisc_priv(sch);
struct sk_buff *skb;
psched_tdiff_t tdiff;
if (likely(!q->timeout))
goto no_ewma;
skb = qdisc_peek_head(sch);
if (!skb)
return NULL;
/* update EWMA */
tdiff = psched_get_time() - fifo_ewma_skb_cb(skb)->time_queued;
ewma_add(&q->ewma, tdiff);
no_ewma:
return qdisc_dequeue_head(sch);
}
#define FIFO_EWMA_OK 0
#define FIFO_EWMA_DROP 1
#define FIFO_EWMA_CN 2
static int fifo_check_ewma_drop(struct sk_buff *skb, struct Qdisc *sch)
{
struct fifo_ewma_sched_data *q = qdisc_priv(sch);
unsigned long fifo_latency_avg;
int ret = FIFO_EWMA_OK;
if (likely(!q->timeout))
goto no_ewma;
/* lower limit */
if (skb_queue_len(&sch->q) <= FIFO_EWMA_MIN_QDISC_LEN)
goto no_drop;
fifo_latency_avg = ewma_read(&q->ewma);
/* hard drop */
if (fifo_latency_avg > q->timeout) {
/*printk(KERN_WARNING "fifo_ewma: hard drop\n");*/
return FIFO_EWMA_DROP;
}
/* probabilistic drop */
if (fifo_latency_avg > q->timeout / 2 &&
fifo_get_prob() < DEFAULT_PROB_HALF_DROP) {
if (!INET_ECN_set_ce(skb)) {
/*printk(KERN_WARNING "fifo_ewma: prob drop\n");*/
return FIFO_EWMA_DROP;
}
/*printk(KERN_WARNING "fifo_ewma: prob mark\n");*/
ret = FIFO_EWMA_CN;
}
no_drop:
fifo_ewma_skb_cb(skb)->time_queued = psched_get_time();
no_ewma:
return ret;
}
static int pfifo_ewma_tail_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
int ewma_action, ret;
ewma_action = fifo_check_ewma_drop(skb, sch);
if (unlikely(ewma_action == FIFO_EWMA_DROP))
return qdisc_drop(skb, sch);
ret = pfifo_tail_enqueue(skb, sch);
if (unlikely(ret != NET_XMIT_SUCCESS))
return ret;
return unlikely(ewma_action == FIFO_EWMA_CN) ? NET_XMIT_CN : ret;
}
static int bfifo_ewma_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
int ewma_action, ret;
ewma_action = fifo_check_ewma_drop(skb, sch);
if (unlikely(ewma_action == FIFO_EWMA_DROP))
return qdisc_drop(skb, sch);
ret = bfifo_enqueue(skb, sch);
if (unlikely(ret != NET_XMIT_SUCCESS))
return ret;
return unlikely(ewma_action == FIFO_EWMA_CN) ? NET_XMIT_CN : ret;
}
static int pfifo_ewma_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
int ewma_action, ret;
ewma_action = fifo_check_ewma_drop(skb, sch);
if (unlikely(ewma_action == FIFO_EWMA_DROP))
return qdisc_drop(skb, sch);
ret = pfifo_enqueue(skb, sch);
if (unlikely(ret != NET_XMIT_SUCCESS))
return ret;
return unlikely(ewma_action == FIFO_EWMA_CN) ? NET_XMIT_CN : ret;
}
static int fifo_ewma_init(struct Qdisc *sch, struct nlattr *opt)
{
struct fifo_ewma_sched_data *q = qdisc_priv(sch);
if (opt == NULL) {
u32 limit = qdisc_dev(sch)->tx_queue_len ? : 1;
q->limit = limit;
q->timeout = DEFAULT_PKT_TIMEOUT;
} else {
struct tc_fifo_ewma_qopt *ctl = nla_data(opt);
if (nla_len(opt) < sizeof(*ctl))
return -EINVAL;
q->limit = ctl->limit;
q->timeout = ctl->timeout ? : DEFAULT_PKT_TIMEOUT;
}
ewma_init(&q->ewma, 1, 64);
return 0;
}
static int fifo_ewma_dump(struct Qdisc *sch, struct sk_buff *skb)
{
struct fifo_ewma_sched_data *q = qdisc_priv(sch);
struct tc_fifo_ewma_qopt opt = {
.limit = q->limit,
.timeout = q->timeout
};
NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
return skb->len;
nla_put_failure:
return -1;
}
static struct Qdisc_ops pfifo_ewma_qdisc_ops __read_mostly = {
.id = "pfifo_ewma",
.priv_size = sizeof(struct fifo_ewma_sched_data),
.enqueue = pfifo_ewma_enqueue,
.dequeue = fifo_ewma_dequeue,
.peek = qdisc_peek_head,
.drop = qdisc_queue_drop,
.init = fifo_ewma_init,
.reset = qdisc_reset_queue,
.change = fifo_ewma_init,
.dump = fifo_ewma_dump,
.owner = THIS_MODULE,
};
static struct Qdisc_ops bfifo_ewma_qdisc_ops __read_mostly = {
.id = "bfifo_ewma",
.priv_size = sizeof(struct fifo_ewma_sched_data),
.enqueue = bfifo_ewma_enqueue,
.dequeue = fifo_ewma_dequeue,
.peek = qdisc_peek_head,
.drop = qdisc_queue_drop,
.init = fifo_ewma_init,
.reset = qdisc_reset_queue,
.change = fifo_ewma_init,
.dump = fifo_ewma_dump,
.owner = THIS_MODULE,
};
static struct Qdisc_ops pfifo_head_drop_ewma_qdisc_ops __read_mostly = {
.id = "pfifo_hd_ewma",
.priv_size = sizeof(struct fifo_ewma_sched_data),
.enqueue = pfifo_ewma_tail_enqueue,
.dequeue = fifo_ewma_dequeue,
.peek = qdisc_peek_head,
.drop = qdisc_queue_drop_head,
.init = fifo_ewma_init,
.reset = qdisc_reset_queue,
.change = fifo_ewma_init,
.dump = fifo_ewma_dump,
.owner = THIS_MODULE,
};
static int __init fifo_ewma_module_init(void)
{
int retval;
retval = register_qdisc(&pfifo_ewma_qdisc_ops);
if (retval)
goto cleanup;
retval = register_qdisc(&bfifo_ewma_qdisc_ops);
if (retval)
goto cleanup;
retval = register_qdisc(&pfifo_head_drop_ewma_qdisc_ops);
if (retval)
goto cleanup;
return 0;
cleanup:
unregister_qdisc(&pfifo_ewma_qdisc_ops);
unregister_qdisc(&bfifo_ewma_qdisc_ops);
unregister_qdisc(&pfifo_head_drop_ewma_qdisc_ops);
return retval;
}
static void __exit fifo_ewma_module_exit(void)
{
unregister_qdisc(&pfifo_ewma_qdisc_ops);
unregister_qdisc(&bfifo_ewma_qdisc_ops);
unregister_qdisc(&pfifo_head_drop_ewma_qdisc_ops);
}
module_init(fifo_ewma_module_init)
module_exit(fifo_ewma_module_exit)
MODULE_LICENSE("GPL");
#include <linux/version.h>
#if LINUX_VERSION_CODE <= KERNEL_VERSION(2, 6, 37)
#include "average.c"
#endif
^ permalink raw reply
* Re: dccp: null-pointer dereference on close
From: Gerrit Renker @ 2011-02-28 11:21 UTC (permalink / raw)
To: Johan Hovold; +Cc: Arnaldo Carvalho de Melo, David S. Miller, dccp, netdev
In-Reply-To: <20110226174505.GB3609@localhost>
On 32/64 bit x86 problem so far not seen.
Problem seems to be that
140 tw->tw_tb = icsk->icsk_bind_hash is NULL in __inet_twsk_hashdance()
141 WARN_ON(!icsk->icsk_bind_hash);
Will be looking at this later on today - any hints how to reproduce would be appreciated.
Gerrit
Quoting Johan Hovold:
| Hi,
|
| I triggered the null-pointer dereference below when closing a dccp
| socket on 2.6.37 the other day. The receive path is hit during
| close, and the socket has already been unhashed in dccp_set_state from
| dccp_close.
|
| Thanks,
| Johan
|
|
| root@overo:~# [84140.128631] ------------[ cut here ]------------
| [84140.133575] WARNING: at net/ipv4/inet_timewait_sock.c:141 __inet_twsk_hashdance+0x48/0x128()
| [84140.142517] Modules linked in: arc4 ecb carl9170 rt2870sta(C) mac80211 r8712u(C) crc_ccitt ah
| [84140.151794] [<c0038850>] (unwind_backtrace+0x0/0xec) from [<c0055364>] (warn_slowpath_common)
| [84140.161743] [<c0055364>] (warn_slowpath_common+0x4c/0x64) from [<c0055398>] (warn_slowpath_n)
| [84140.171966] [<c0055398>] (warn_slowpath_null+0x1c/0x24) from [<c02b72d0>] (__inet_twsk_hashd)
| [84140.182373] [<c02b72d0>] (__inet_twsk_hashdance+0x48/0x128) from [<c031caa0>] (dccp_time_wai)
| [84140.192413] [<c031caa0>] (dccp_time_wait+0x40/0xc8) from [<c031c15c>] (dccp_rcv_state_proces)
| [84140.202636] [<c031c15c>] (dccp_rcv_state_process+0x120/0x538) from [<c032609c>] (dccp_v4_do_)
| [84140.213043] [<c032609c>] (dccp_v4_do_rcv+0x11c/0x14c) from [<c0286594>] (release_sock+0xac/0)
| [84140.222442] [<c0286594>] (release_sock+0xac/0x110) from [<c031fd34>] (dccp_close+0x28c/0x380)
| [84140.231475] [<c031fd34>] (dccp_close+0x28c/0x380) from [<c02d9a78>] (inet_release+0x64/0x70)
| [84140.240386] [<c02d9a78>] (inet_release+0x64/0x70) from [<c0284ddc>] (sock_release+0x24/0xb8)
| [84140.249328] [<c0284ddc>] (sock_release+0x24/0xb8) from [<c0284e94>] (sock_close+0x24/0x34)
| [84140.258087] [<c0284e94>] (sock_close+0x24/0x34) from [<c00c2e4c>] (fput+0x108/0x1f4)
| [84140.266296] [<c00c2e4c>] (fput+0x108/0x1f4) from [<c00c0104>] (filp_close+0x70/0x7c)
| [84140.274505] [<c00c0104>] (filp_close+0x70/0x7c) from [<c00c01c4>] (sys_close+0xb4/0x10c)
| [84140.283081] [<c00c01c4>] (sys_close+0xb4/0x10c) from [<c0033a80>] (ret_fast_syscall+0x0/0x30)
| [84140.292114] ---[ end trace b8877ec9d542c32e ]---
| [84140.296997] Unable to handle kernel NULL pointer dereference at virtual address 00000010
| [84140.305541] pgd = cedb0000
| [84140.308410] [00000010] *pgd=8ed22031, *pte=00000000, *ppte=00000000
| [84140.315032] Internal error: Oops: 17 [#1] PREEMPT
| [84140.320007] last sysfs file: /sys/kernel/uevent_seqnum
| [84140.325408] Modules linked in: arc4 ecb carl9170 rt2870sta(C) mac80211 r8712u(C) crc_ccitt ah
| [84140.334533] CPU: 0 Tainted: G WC (2.6.37+ #47)
| [84140.340332] PC is at __inet_twsk_hashdance+0x4c/0x128
| [84140.345642] LR is at warn_slowpath_null+0x1c/0x24
| [84140.350616] pc : [<c02b72d4>] lr : [<c0055398>] psr: 60000013
| [84140.350616] sp : ce975e68 ip : ce975db8 fp : cfbc5c00
| [84140.362701] r10: cfa3e400 r9 : cfbc5c18 r8 : 00000000
| [84140.368225] r7 : 00000006 r6 : cfa96110 r5 : cfa3e400 r4 : cfb54000
| [84140.375091] r3 : 00000002 r2 : 00000006 r1 : 00000000 r0 : 00000000
| [84140.381988] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
| [84140.389495] Control: 10c5387d Table: 8edb0019 DAC: 00000015
| [84140.395538] Process be2p_ctrl (pid: 2207, stack limit = 0xce9742f0)
| [84140.402160] Stack: (0xce975e68 to 0xce976000)
| [84140.406738] 5e60: cfb54000 00000180 cfa3e400 c031caa0 00000007 cfbc5c00
| [84140.415374] 5e80: cfbc9824 00000020 00000007 c031c15c 00000000 00000022 00000000 00000008
| [84140.424011] 5ea0: 00000001 cfbc5c00 cfbc5c00 cfa3e400 cfbc9824 00000000 00000001 c04c11b8
| [84140.432617] 5ec0: be8ffc1c c032609c fa200000 c0033608 cfa3e400 cfa3e7b0 be8ffc1c ce975ee8
| [84140.441253] 5ee0: be8ffc1c cfbc5c00 cfa3e400 ce974000 00000000 c0286594 cfa3e474 cfa3e400
| [84140.449859] 5f00: cfa3e408 00000007 cf487c20 cf805840 cf60ca00 c031fd34 00000000 00000000
| [84140.458496] 5f20: cfb20288 cfa3e400 cf487c00 00000008 00000000 c02d9a78 00000003 00000000
| [84140.467102] 5f40: cf487c00 c0284ddc 00000000 cfb20288 cfb20280 c0284e94 00000000 c00c2e4c
| [84140.475738] 5f60: 00000000 00000000 cfb20280 00000000 cfbc50c0 00000006 c0033c04 ce974000
| [84140.484375] 5f80: 00000000 c00c0104 00000004 cfbc50c0 cfb20280 c00c01c4 400a1000 00000000
| [84140.492980] 5fa0: 0000891c c0033a80 400a1000 00000000 00000004 00000000 403d3014 00000000
| [84140.501617] 5fc0: 400a1000 00000000 0000891c 00000006 00000000 00000000 400a9000 be8ffc1c
| [84140.510223] 5fe0: 00000000 be8ffbe0 00009584 4036320c 60000010 00000004 00005153 bf0fa7d0
| [84140.518859] [<c02b72d4>] (__inet_twsk_hashdance+0x4c/0x128) from [<c031caa0>] (dccp_time_wai)
| [84140.528869] [<c031caa0>] (dccp_time_wait+0x40/0xc8) from [<c031c15c>] (dccp_rcv_state_proces)
| [84140.539062] [<c031c15c>] (dccp_rcv_state_process+0x120/0x538) from [<c032609c>] (dccp_v4_do_)
| [84140.549407] [<c032609c>] (dccp_v4_do_rcv+0x11c/0x14c) from [<c0286594>] (release_sock+0xac/0)
| [84140.558776] [<c0286594>] (release_sock+0xac/0x110) from [<c031fd34>] (dccp_close+0x28c/0x380)
| [84140.567779] [<c031fd34>] (dccp_close+0x28c/0x380) from [<c02d9a78>] (inet_release+0x64/0x70)
| [84140.576660] [<c02d9a78>] (inet_release+0x64/0x70) from [<c0284ddc>] (sock_release+0x24/0xb8)
| [84140.585571] [<c0284ddc>] (sock_release+0x24/0xb8) from [<c0284e94>] (sock_close+0x24/0x34)
| [84140.594299] [<c0284e94>] (sock_close+0x24/0x34) from [<c00c2e4c>] (fput+0x108/0x1f4)
| [84140.602447] [<c00c2e4c>] (fput+0x108/0x1f4) from [<c00c0104>] (filp_close+0x70/0x7c)
| [84140.610626] [<c00c0104>] (filp_close+0x70/0x7c) from [<c00c01c4>] (sys_close+0xb4/0x10c)
| [84140.619171] [<c00c01c4>] (sys_close+0xb4/0x10c) from [<c0033a80>] (ret_fast_syscall+0x0/0x30)
| [84140.628143] Code: e59f00dc e3a0108d ebf6782a e5941044 (e5912010)
| [84140.634643] ---[ end trace b8877ec9d542c32f ]---
| [84140.639526] Kernel panic - not syncing: Fatal exception in interrupt
|
| --
| To unsubscribe from this list: send the line "unsubscribe dccp" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at http://vger.kernel.org/majordomo-info.html
|
--
^ permalink raw reply
* Re: Bug inkvm_set_irq
From: Jean-Philippe Menil @ 2011-02-28 10:40 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: kvm, netdev, virtualization
In-Reply-To: <20110228101139.GD28006@redhat.com>
Le 28/02/2011 11:11, Michael S. Tsirkin a écrit :
> On Mon, Feb 28, 2011 at 09:56:46AM +0100, Jean-Philippe Menil wrote:
>> Le 27/02/2011 18:00, Michael S. Tsirkin a écrit :
>>> On Fri, Feb 25, 2011 at 10:07:22AM +0100, Jean-Philippe Menil wrote:
>>>> Hi,
>>>>
>>>> Each time i try tou use vhost_net, i'm facing a kernel bug.
>>>> I do a "modprobe vhost_net", and start guest whith vhost=on.
>>>>
>>>> Following is a trace with a kernel 2.6.37, but i had the same
>>>> problem with 2.6.36 (cf https://lkml.org/lkml/2010/11/30/29).
>>> 2.6.36 had a theorectical race that could explain this,
>>> but it should be ok in 2.6.37.
>>>
>>>> The bug only occurs whith vhost_net charged, so i don't know if this
>>>> is a bug in kvm module code or in the vhost_net code.
>>> It could be a bug in eventfd which is the interface
>>> used by both kvm and vhost_net.
>>> Just for fun, you can try 3.6.38 - eventfd code has been changed
>>> a lot in 2.6.38 and if it does not trigger there
>>> it's a hint that irqfd is the reason.
>>>
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.243100] BUG: unable to handle kernel paging request at
>>>> 0000000000002458
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.243250] IP: [<ffffffffa041aa8a>] kvm_set_irq+0x2a/0x130 [kvm]
>>> Could you run markup_oops/ ksymoops on this please?
>>> As far as I can see kvm_set_irq can only get a wrong
>>> kvm pointer. Unless there's some general memory corruption,
>>> I'd guess
>>>
>>> You can also try comparing the irqfd->kvm pointer in
>>> kvm_irqfd_assign irqfd_wakeup and kvm_set_irq in
>>> virt/kvm/eventfd.c.
>>>
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.243378] PGD 45d363067 PUD 45e77a067 PMD 0
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.243556] Oops: 0000 [#1] SMP
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.243692] last sysfs file:
>>>> /sys/devices/pci0000:00/0000:00:0d.0/0000:05:00.0/0000:06:00.0/irq
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243777] CPU 0
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.243820] Modules linked in: vhost_net macvtap macvlan tun
>>>> powernow_k8 mperf cpufreq_userspace cpufreq_stats cpufreq_powersave
>>>> cpufreq_ondemand fre
>>>> q_table cpufreq_conservative fuse xt_physdev ip6t_LOG
>>>> ip6table_filter ip6_tables ipt_LOG xt_multiport xt_limit xt_tcpudp
>>>> xt_state iptable_filter ip_tables x_tables nf_conntrack_tftp
>>>> nf_conntrack_ftp nf_connt
>>>> rack_ipv4 nf_defrag_ipv4 8021q bridge stp ext2 mbcache
>>>> dm_round_robin dm_multipath nf_conntrack_ipv6 nf_conntrack
>>>> nf_defrag_ipv6 kvm_amd kvm ipv6 snd_pcm snd_timer snd soundcore
>>>> snd_page_alloc tpm_tis tpm ps
>>>> mouse dcdbas tpm_bios processor i2c_nforce2 shpchp pcspkr ghes
>>>> serio_raw joydev evdev pci_hotplug i2c_core hed button thermal_sys
>>>> xfs exportfs dm_mod sg sr_mod cdrom usbhid hid usb_storage ses
>>>> sd_mod enclosu
>>>> re megaraid_sas ohci_hcd lpfc scsi_transport_fc scsi_tgt bnx2
>>>> scsi_mod ehci_hcd [last unloaded: scsi_wait_scan]
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123]
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] Pid: 10, comm: kworker/0:1 Not tainted
>>>> 2.6.37-dsiun-110105 #17 0K543T/PowerEdge M605
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] RIP: 0010:[<ffffffffa041aa8a>] [<ffffffffa041aa8a>]
>>>> kvm_set_irq+0x2a/0x130 [kvm]
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] RSP: 0018:ffff88045fc89d30 EFLAGS: 00010246
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] RAX: 0000000000000000 RBX: 000000000000001a RCX:
>>>> 0000000000000001
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
>>>> 0000000000000000
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] RBP: 0000000000000000 R08: 0000000000000001 R09:
>>>> ffff880856a91e48
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] R10: 0000000000000000 R11: 00000000ffffffff R12:
>>>> 0000000000000000
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] R13: 0000000000000001 R14: 0000000000000000 R15:
>>>> 0000000000000000
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] FS: 00007f617986c710(0000) GS:ffff88007f800000(0000)
>>>> knlGS:0000000000000000
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] CR2: 0000000000002458 CR3: 000000045d197000 CR4:
>>>> 00000000000006f0
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>> 0000000000000000
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>>>> 0000000000000400
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] Process kworker/0:1 (pid: 10, threadinfo
>>>> ffff88045fc88000, task ffff88085fc53c30)
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] Stack:
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] ffff88045fc89fd8 00000000000119c0 ffff88045fc88010
>>>> ffff88085fc53ee8
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] ffff88045fc89fd8 ffff88085fc53ee0 ffff88085fc53c30
>>>> 00000000000119c0
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] 00000000000119c0 ffffffff8137f7ce ffff88007f80df40
>>>> 00000000ffffffff
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] Call Trace:
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] [<ffffffff8137f7ce>] ? common_interrupt+0xe/0x13
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] [<ffffffffa041bc30>] ? irqfd_inject+0x0/0x50 [kvm]
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] [<ffffffffa041bc57>] ? irqfd_inject+0x27/0x50 [kvm]
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] [<ffffffffa041bc30>] ? irqfd_inject+0x0/0x50 [kvm]
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] [<ffffffff8106b6f2>] ? process_one_work+0x112/0x460
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] [<ffffffff8106be25>] ? worker_thread+0x145/0x410
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] [<ffffffff8103a3d0>] ? __wake_up_common+0x50/0x80
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] [<ffffffff8106bce0>] ? worker_thread+0x0/0x410
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] [<ffffffff8106bce0>] ? worker_thread+0x0/0x410
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] [<ffffffff8106f786>] ? kthread+0x96/0xa0
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] [<ffffffff81003ce4>] ? kernel_thread_helper+0x4/0x10
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] [<ffffffff8106f6f0>] ? kthread+0x0/0xa0
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] [<ffffffff81003ce0>] ? kernel_thread_helper+0x0/0x10
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] Code: ff 41 57 41 89 f7 41 56 41 55 41 89 cd 41 54 49 89
>>>> fc 55 53 89 d3 48 81 ec 98 00 00 00 8b 15 c6 79 03 00 85 d2 0f 85 c4
>>>> 00 00 00<4
>>>> 9> 8b 84 24 58 24 00 00 3b 98 28 01 00 00 73 5e 89 db 48 8b 84
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] RIP [<ffffffffa041aa8a>] kvm_set_irq+0x2a/0x130 [kvm]
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] RSP<ffff88045fc89d30>
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] CR2: 0000000000002458
>>>>
>>>>
>>>> If someone can help me, on how to solve this.
>>>>
>>>> Regards.
>>>> _______________________________________________
>>>> Virtualization mailing list
>>>> Virtualization@lists.linux-foundation.org
>>>> https://lists.linux-foundation.org/mailman/listinfo/virtualization
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Hi,
>>
>> thanks for your response.
>>
>> This is what markup_oops.pl return me:
>> "No matching code found"
> Well, let's try to understand what's there.
>
> Do objdumop -ldS kvm.ko
> look for<kvm_set_irq>
>
> and paste the content from start of that function
> to offset 0x2a and a bit beyond.
>
> You can also upload your kvm.ko somewhere, I'll try to take a look.
>
>
>> So this is not a vhost_net bug, or my oops is incomplete and
>> markup_oops can't find the good vma offset.
>>
>> I will try to compare the pointers you indicate me, even it could be
>> a little difficult for me.
> Hmm you know how to add printk to code and rebuild, right?
>
>> Maybe i will try a 2.6.38, will wait a response from the kvm team.
>>
>> Regards.
>>
>> --
>> Jean-Philippe Menil - Pôle réseau Service IRTS
>> DSI Université de Nantes
>> jean-philippe.menil@univ-nantes.fr
>> Tel : 02.53.48.49.27 - Fax : 02.53.48.49.09
So, here is the result for the objdump against the kvm.ko (the
kvm_set_irq part) :
0000000000006a60 <kvm_set_irq>:
kvm_set_irq():
6a60: 41 57 push %r15
6a62: 41 89 f7 mov %esi,%r15d
6a65: 41 56 push %r14
6a67: 41 55 push %r13
6a69: 41 89 cd mov %ecx,%r13d
6a6c: 41 54 push %r12
6a6e: 49 89 fc mov %rdi,%r12
6a71: 55 push %rbp
6a72: 53 push %rbx
6a73: 89 d3 mov %edx,%ebx
6a75: 48 81 ec 98 00 00 00 sub $0x98,%rsp
6a7c: 8b 15 00 00 00 00 mov 0x0(%rip),%edx #
6a82 <kvm_set_irq+0x22>
6a82: 85 d2 test %edx,%edx
6a84: 0f 85 c4 00 00 00 jne 6b4e <kvm_set_irq+0xee>
6a8a: 49 8b 84 24 58 24 00 mov 0x2458(%r12),%rax
6a91: 00
6a92: 3b 98 28 01 00 00 cmp 0x128(%rax),%ebx
6a98: 73 5e jae 6af8 <kvm_set_irq+0x98>
6a9a: 89 db mov %ebx,%ebx
6a9c: 48 8b 84 d8 30 01 00 mov 0x130(%rax,%rbx,8),%rax
6aa3: 00
6aa4: 48 85 c0 test %rax,%rax
6aa7: 74 4f je 6af8 <kvm_set_irq+0x98>
6aa9: 48 89 e2 mov %rsp,%rdx
6aac: 31 db xor %ebx,%ebx
6aae: 48 8b 08 mov (%rax),%rcx
6ab1: 83 c3 01 add $0x1,%ebx
6ab4: 0f 18 09 prefetcht0 (%rcx)
6ab7: 48 8b 48 e0 mov -0x20(%rax),%rcx
6abb: 48 89 0a mov %rcx,(%rdx)
6abe: 48 8b 48 e8 mov -0x18(%rax),%rcx
6ac2: 48 89 4a 08 mov %rcx,0x8(%rdx)
6ac6: 48 8b 48 f0 mov -0x10(%rax),%rcx
6aca: 48 89 4a 10 mov %rcx,0x10(%rdx)
6ace: 48 8b 48 f8 mov -0x8(%rax),%rcx
6ad2: 48 89 4a 18 mov %rcx,0x18(%rdx)
6ad6: 48 8b 08 mov (%rax),%rcx
6ad9: 48 89 4a 20 mov %rcx,0x20(%rdx)
6add: 48 8b 48 08 mov 0x8(%rax),%rcx
6ae1: 48 89 4a 28 mov %rcx,0x28(%rdx)
6ae5: 48 8b 00 mov (%rax),%rax
6ae8: 48 83 c2 30 add $0x30,%rdx
6aec: 48 85 c0 test %rax,%rax
6aef: 75 bd jne 6aae <kvm_set_irq+0x4e>
6af1: eb 07 jmp 6afa <kvm_set_irq+0x9a>
6af3: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
6af8: 31 db xor %ebx,%ebx
6afa: bd ff ff ff ff mov $0xffffffff,%ebp
6aff: 49 89 e6 mov %rsp,%r14
6b02: 85 db test %ebx,%ebx
6b04: 74 34 je 6b3a <kvm_set_irq+0xda>
6b06: 83 eb 01 sub $0x1,%ebx
6b09: 44 89 e9 mov %r13d,%ecx
6b0c: 44 89 fa mov %r15d,%edx
6b0f: 48 63 c3 movslq %ebx,%rax
6b12: 4c 89 e6 mov %r12,%rsi
6b15: 48 8d 04 40 lea (%rax,%rax,2),%rax
6b19: 48 c1 e0 04 shl $0x4,%rax
6b1d: 49 8d 3c 06 lea (%r14,%rax,1),%rdi
6b21: ff 54 04 08 callq *0x8(%rsp,%rax,1)
6b25: 85 c0 test %eax,%eax
6b27: 78 d9 js 6b02 <kvm_set_irq+0xa2>
6b29: 85 ed test %ebp,%ebp
6b2b: ba 00 00 00 00 mov $0x0,%edx
6b30: 0f 48 ea cmovs %edx,%ebp
6b33: 85 db test %ebx,%ebx
6b35: 8d 2c 28 lea (%rax,%rbp,1),%ebp
6b38: 75 cc jne 6b06 <kvm_set_irq+0xa6>
6b3a: 48 81 c4 98 00 00 00 add $0x98,%rsp
6b41: 89 e8 mov %ebp,%eax
6b43: 5b pop %rbx
6b44: 5d pop %rbp
6b45: 41 5c pop %r12
6b47: 41 5d pop %r13
6b49: 41 5e pop %r14
6b4b: 41 5f pop %r15
6b4d: c3 retq
6b4e: 48 8b 2d 00 00 00 00 mov 0x0(%rip),%rbp #
6b55 <kvm_set_irq+0xf5>
6b55: 48 85 ed test %rbp,%rbp
6b58: 0f 84 2c ff ff ff je 6a8a <kvm_set_irq+0x2a>
6b5e: 48 8b 45 00 mov 0x0(%rbp),%rax
6b62: 48 8b 7d 08 mov 0x8(%rbp),%rdi
6b66: 48 83 c5 10 add $0x10,%rbp
6b6a: 44 89 f9 mov %r15d,%ecx
6b6d: 44 89 ea mov %r13d,%edx
6b70: 89 de mov %ebx,%esi
6b72: ff d0 callq *%rax
6b74: 48 8b 45 00 mov 0x0(%rbp),%rax
6b78: 48 85 c0 test %rax,%rax
6b7b: 75 e5 jne 6b62 <kvm_set_irq+0x102>
6b7d: e9 08 ff ff ff jmpq 6a8a <kvm_set_irq+0x2a>
6b82: 66 66 66 66 66 2e 0f nopw %cs:0x0(%rax,%rax,1)
6b89: 1f 84 00 00 00 00 00
I admit that this analysis is too complicated for me.
I, effectively, can rebuild a kernel with more printk, and program a reboot.
The kvm.ko is available through the following address:
http://filex.univ-nantes.fr/get?k=k1jKhQghdcHLz12Z50H
Regards.
--
Jean-Philippe Menil - Pôle réseau Service IRTS
DSI Université de Nantes
jean-philippe.menil@univ-nantes.fr
Tel : 02.53.48.49.27 - Fax : 02.53.48.49.09
^ permalink raw reply
* Re: Bug inkvm_set_irq
From: Michael S. Tsirkin @ 2011-02-28 10:11 UTC (permalink / raw)
To: Jean-Philippe Menil; +Cc: kvm, netdev, virtualization
In-Reply-To: <4D6B634E.9090801@univ-nantes.fr>
On Mon, Feb 28, 2011 at 09:56:46AM +0100, Jean-Philippe Menil wrote:
> Le 27/02/2011 18:00, Michael S. Tsirkin a écrit :
> >On Fri, Feb 25, 2011 at 10:07:22AM +0100, Jean-Philippe Menil wrote:
> >>Hi,
> >>
> >>Each time i try tou use vhost_net, i'm facing a kernel bug.
> >>I do a "modprobe vhost_net", and start guest whith vhost=on.
> >>
> >>Following is a trace with a kernel 2.6.37, but i had the same
> >>problem with 2.6.36 (cf https://lkml.org/lkml/2010/11/30/29).
> >2.6.36 had a theorectical race that could explain this,
> >but it should be ok in 2.6.37.
> >
> >>The bug only occurs whith vhost_net charged, so i don't know if this
> >>is a bug in kvm module code or in the vhost_net code.
> >It could be a bug in eventfd which is the interface
> >used by both kvm and vhost_net.
> >Just for fun, you can try 3.6.38 - eventfd code has been changed
> >a lot in 2.6.38 and if it does not trigger there
> >it's a hint that irqfd is the reason.
> >
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.243100] BUG: unable to handle kernel paging request at
> >>0000000000002458
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.243250] IP: [<ffffffffa041aa8a>] kvm_set_irq+0x2a/0x130 [kvm]
> >
> >Could you run markup_oops/ ksymoops on this please?
> >As far as I can see kvm_set_irq can only get a wrong
> >kvm pointer. Unless there's some general memory corruption,
> >I'd guess
> >
> >You can also try comparing the irqfd->kvm pointer in
> >kvm_irqfd_assign irqfd_wakeup and kvm_set_irq in
> >virt/kvm/eventfd.c.
> >
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.243378] PGD 45d363067 PUD 45e77a067 PMD 0
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.243556] Oops: 0000 [#1] SMP
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.243692] last sysfs file:
> >>/sys/devices/pci0000:00/0000:00:0d.0/0000:05:00.0/0000:06:00.0/irq
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243777] CPU 0
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.243820] Modules linked in: vhost_net macvtap macvlan tun
> >>powernow_k8 mperf cpufreq_userspace cpufreq_stats cpufreq_powersave
> >>cpufreq_ondemand fre
> >>q_table cpufreq_conservative fuse xt_physdev ip6t_LOG
> >>ip6table_filter ip6_tables ipt_LOG xt_multiport xt_limit xt_tcpudp
> >>xt_state iptable_filter ip_tables x_tables nf_conntrack_tftp
> >>nf_conntrack_ftp nf_connt
> >>rack_ipv4 nf_defrag_ipv4 8021q bridge stp ext2 mbcache
> >>dm_round_robin dm_multipath nf_conntrack_ipv6 nf_conntrack
> >>nf_defrag_ipv6 kvm_amd kvm ipv6 snd_pcm snd_timer snd soundcore
> >>snd_page_alloc tpm_tis tpm ps
> >>mouse dcdbas tpm_bios processor i2c_nforce2 shpchp pcspkr ghes
> >>serio_raw joydev evdev pci_hotplug i2c_core hed button thermal_sys
> >>xfs exportfs dm_mod sg sr_mod cdrom usbhid hid usb_storage ses
> >>sd_mod enclosu
> >>re megaraid_sas ohci_hcd lpfc scsi_transport_fc scsi_tgt bnx2
> >>scsi_mod ehci_hcd [last unloaded: scsi_wait_scan]
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123]
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] Pid: 10, comm: kworker/0:1 Not tainted
> >>2.6.37-dsiun-110105 #17 0K543T/PowerEdge M605
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] RIP: 0010:[<ffffffffa041aa8a>] [<ffffffffa041aa8a>]
> >>kvm_set_irq+0x2a/0x130 [kvm]
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] RSP: 0018:ffff88045fc89d30 EFLAGS: 00010246
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] RAX: 0000000000000000 RBX: 000000000000001a RCX:
> >>0000000000000001
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> >>0000000000000000
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] RBP: 0000000000000000 R08: 0000000000000001 R09:
> >>ffff880856a91e48
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] R10: 0000000000000000 R11: 00000000ffffffff R12:
> >>0000000000000000
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] R13: 0000000000000001 R14: 0000000000000000 R15:
> >>0000000000000000
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] FS: 00007f617986c710(0000) GS:ffff88007f800000(0000)
> >>knlGS:0000000000000000
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] CR2: 0000000000002458 CR3: 000000045d197000 CR4:
> >>00000000000006f0
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >>0000000000000000
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> >>0000000000000400
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] Process kworker/0:1 (pid: 10, threadinfo
> >>ffff88045fc88000, task ffff88085fc53c30)
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] Stack:
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] ffff88045fc89fd8 00000000000119c0 ffff88045fc88010
> >>ffff88085fc53ee8
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] ffff88045fc89fd8 ffff88085fc53ee0 ffff88085fc53c30
> >>00000000000119c0
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] 00000000000119c0 ffffffff8137f7ce ffff88007f80df40
> >>00000000ffffffff
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] Call Trace:
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] [<ffffffff8137f7ce>] ? common_interrupt+0xe/0x13
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] [<ffffffffa041bc30>] ? irqfd_inject+0x0/0x50 [kvm]
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] [<ffffffffa041bc57>] ? irqfd_inject+0x27/0x50 [kvm]
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] [<ffffffffa041bc30>] ? irqfd_inject+0x0/0x50 [kvm]
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] [<ffffffff8106b6f2>] ? process_one_work+0x112/0x460
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] [<ffffffff8106be25>] ? worker_thread+0x145/0x410
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] [<ffffffff8103a3d0>] ? __wake_up_common+0x50/0x80
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] [<ffffffff8106bce0>] ? worker_thread+0x0/0x410
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] [<ffffffff8106bce0>] ? worker_thread+0x0/0x410
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] [<ffffffff8106f786>] ? kthread+0x96/0xa0
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] [<ffffffff81003ce4>] ? kernel_thread_helper+0x4/0x10
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] [<ffffffff8106f6f0>] ? kthread+0x0/0xa0
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] [<ffffffff81003ce0>] ? kernel_thread_helper+0x0/0x10
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] Code: ff 41 57 41 89 f7 41 56 41 55 41 89 cd 41 54 49 89
> >>fc 55 53 89 d3 48 81 ec 98 00 00 00 8b 15 c6 79 03 00 85 d2 0f 85 c4
> >>00 00 00<4
> >>9> 8b 84 24 58 24 00 00 3b 98 28 01 00 00 73 5e 89 db 48 8b 84
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] RIP [<ffffffffa041aa8a>] kvm_set_irq+0x2a/0x130 [kvm]
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] RSP<ffff88045fc89d30>
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] CR2: 0000000000002458
> >>
> >>
> >>If someone can help me, on how to solve this.
> >>
> >>Regards.
> >>_______________________________________________
> >>Virtualization mailing list
> >>Virtualization@lists.linux-foundation.org
> >>https://lists.linux-foundation.org/mailman/listinfo/virtualization
> >--
> >To unsubscribe from this list: send the line "unsubscribe netdev" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> Hi,
>
> thanks for your response.
>
> This is what markup_oops.pl return me:
> "No matching code found "
Well, let's try to understand what's there.
Do objdumop -ldS kvm.ko
look for <kvm_set_irq>
and paste the content from start of that function
to offset 0x2a and a bit beyond.
You can also upload your kvm.ko somewhere, I'll try to take a look.
> So this is not a vhost_net bug, or my oops is incomplete and
> markup_oops can't find the good vma offset.
>
> I will try to compare the pointers you indicate me, even it could be
> a little difficult for me.
Hmm you know how to add printk to code and rebuild, right?
>
> Maybe i will try a 2.6.38, will wait a response from the kvm team.
>
> Regards.
>
> --
> Jean-Philippe Menil - Pôle réseau Service IRTS
> DSI Université de Nantes
> jean-philippe.menil@univ-nantes.fr
> Tel : 02.53.48.49.27 - Fax : 02.53.48.49.09
^ permalink raw reply
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy and source mac selection mode
From: Oleg V. Ukhno @ 2011-02-28 10:09 UTC (permalink / raw)
To: David Miller; +Cc: netdev, fubar
In-Reply-To: <20110227.160823.35047612.davem@davemloft.net>
David, thank you for reply.
Actually this is second version of patch discussed previously in
http://patchwork.ozlabs.org/patch/78994/
I've remade that patch into current version
(patchwork.ozlabs.org/patch/83389/) in the way Jay suggested.
Jay, can you please comment on patch I've remade, please?
On 02/28/2011 03:08 AM, David Miller wrote:
> From: "Oleg V. Ukhno"<olegu@yandex-team.ru>
> Date: Wed, 16 Feb 2011 22:13:41 +0300
>
>>
> Can we get some feedback on this patch from bonding folks?
>
> I'm not applying it blinding without at least one bonding developer
> saying it at least looks ok.
>
> Thanks.
>
--
С уважением,
руководитель службы
эксплуатации коммерческих и финансовых сервисов
ООО Яндекс
Олег Юхно
^ permalink raw reply
* Re: [PATCH 3/3] [RFC] Changes for MQ vhost
From: Michael S. Tsirkin @ 2011-02-28 10:04 UTC (permalink / raw)
To: Krishna Kumar
Cc: rusty, davem, eric.dumazet, arnd, netdev, horms, avi, anthony,
kvm
In-Reply-To: <20110228063443.24908.38147.sendpatchset@krkumar2.in.ibm.com>
On Mon, Feb 28, 2011 at 12:04:43PM +0530, Krishna Kumar wrote:
> Changes for mq vhost.
>
> vhost_net_open is changed to allocate a vhost_net and return.
> The remaining initializations are delayed till SET_OWNER.
> SET_OWNER is changed so that the argument is used to determine
> how many txqs to use. Unmodified qemu's will pass NULL, so
> this is recognized and handled as numtxqs=1.
>
> The number of vhost threads is <= #txqs. Threads handle more
> than one txq when #txqs is more than MAX_VHOST_THREADS (4).
It is this sharing that prevents us from just reusing multiple vhost
descriptors? 4 seems a bit arbitrary - do you have an explanation
on why this is a good number?
> The same thread handles both RX and TX - tested with tap/bridge
> so far (TBD: needs some changes in macvtap driver to support
> the same).
>
> I had to convert space->tab in vhost_attach_cgroups* to avoid
> checkpatch errors.
Separate patch pls, I'll apply that right away.
>
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> ---
> drivers/vhost/net.c | 295 ++++++++++++++++++++++++++--------------
> drivers/vhost/vhost.c | 225 +++++++++++++++++++-----------
> drivers/vhost/vhost.h | 39 ++++-
> 3 files changed, 378 insertions(+), 181 deletions(-)
>
> diff -ruNp org/drivers/vhost/vhost.h new/drivers/vhost/vhost.h
> --- org/drivers/vhost/vhost.h 2011-02-08 09:05:09.000000000 +0530
> +++ new/drivers/vhost/vhost.h 2011-02-28 11:48:06.000000000 +0530
> @@ -35,11 +35,11 @@ struct vhost_poll {
> wait_queue_t wait;
> struct vhost_work work;
> unsigned long mask;
> - struct vhost_dev *dev;
> + struct vhost_virtqueue *vq; /* points back to vq */
> };
>
> void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> - unsigned long mask, struct vhost_dev *dev);
> + unsigned long mask, struct vhost_virtqueue *vq);
> void vhost_poll_start(struct vhost_poll *poll, struct file *file);
> void vhost_poll_stop(struct vhost_poll *poll);
> void vhost_poll_flush(struct vhost_poll *poll);
> @@ -108,8 +108,14 @@ struct vhost_virtqueue {
> /* Log write descriptors */
> void __user *log_base;
> struct vhost_log *log;
> + struct task_struct *worker; /* worker for this vq */
> + spinlock_t *work_lock; /* points to a dev->work_lock[] entry */
> + struct list_head *work_list; /* points to a dev->work_list[] entry */
> + int qnum; /* 0 for RX, 1 -> n-1 for TX */
Is this right?
> };
>
> +#define MAX_VHOST_THREADS 4
> +
> struct vhost_dev {
> /* Readers use RCU to access memory table pointer
> * log base pointer and features.
> @@ -122,12 +128,33 @@ struct vhost_dev {
> int nvqs;
> struct file *log_file;
> struct eventfd_ctx *log_ctx;
> - spinlock_t work_lock;
> - struct list_head work_list;
> - struct task_struct *worker;
> + spinlock_t *work_lock[MAX_VHOST_THREADS];
> + struct list_head *work_list[MAX_VHOST_THREADS];
This looks a bit strange. Won't sticking everything in a single
array of structures rather than multiple arrays be better for cache
utilization?
> };
>
> -long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
> +/*
> + * Return maximum number of vhost threads needed to handle RX & TX.
> + * Upto MAX_VHOST_THREADS are started, and threads can be shared
> + * among different vq's if numtxqs > MAX_VHOST_THREADS.
> + */
> +static inline int get_nvhosts(int nvqs)
nvhosts -> nthreads?
> +{
> + return min_t(int, nvqs / 2, MAX_VHOST_THREADS);
> +}
> +
> +/*
> + * Get index of an existing thread that will handle this txq/rxq.
> + * The same thread handles both rx[index] and tx[index].
> + */
> +static inline int vhost_get_thread_index(int index, int numtxqs, int nvhosts)
> +{
> + return (index % numtxqs) % nvhosts;
> +}
> +
As the only caller passes MAX_VHOST_THREADS,
just use that?
> +int vhost_setup_vqs(struct vhost_dev *dev, int numtxqs);
> +void vhost_free_vqs(struct vhost_dev *dev);
> +long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs,
> + int nvhosts);
> long vhost_dev_check_owner(struct vhost_dev *);
> long vhost_dev_reset_owner(struct vhost_dev *);
> void vhost_dev_cleanup(struct vhost_dev *);
> diff -ruNp org/drivers/vhost/net.c new/drivers/vhost/net.c
> --- org/drivers/vhost/net.c 2011-02-08 09:05:09.000000000 +0530
> +++ new/drivers/vhost/net.c 2011-02-28 11:48:53.000000000 +0530
> @@ -32,12 +32,6 @@
> * Using this limit prevents one virtqueue from starving others. */
> #define VHOST_NET_WEIGHT 0x80000
>
> -enum {
> - VHOST_NET_VQ_RX = 0,
> - VHOST_NET_VQ_TX = 1,
> - VHOST_NET_VQ_MAX = 2,
> -};
> -
> enum vhost_net_poll_state {
> VHOST_NET_POLL_DISABLED = 0,
> VHOST_NET_POLL_STARTED = 1,
> @@ -46,12 +40,13 @@ enum vhost_net_poll_state {
>
> struct vhost_net {
> struct vhost_dev dev;
> - struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
> - struct vhost_poll poll[VHOST_NET_VQ_MAX];
> + struct vhost_virtqueue *vqs;
> + struct vhost_poll *poll;
> + struct socket **socks;
> /* Tells us whether we are polling a socket for TX.
> * We only do this when socket buffer fills up.
> * Protected by tx vq lock. */
> - enum vhost_net_poll_state tx_poll_state;
> + enum vhost_net_poll_state *tx_poll_state;
another array?
> };
>
> /* Pop first len bytes from iovec. Return number of segments used. */
> @@ -91,28 +86,28 @@ static void copy_iovec_hdr(const struct
> }
>
> /* Caller must have TX VQ lock */
> -static void tx_poll_stop(struct vhost_net *net)
> +static void tx_poll_stop(struct vhost_net *net, int qnum)
> {
> - if (likely(net->tx_poll_state != VHOST_NET_POLL_STARTED))
> + if (likely(net->tx_poll_state[qnum] != VHOST_NET_POLL_STARTED))
> return;
> - vhost_poll_stop(net->poll + VHOST_NET_VQ_TX);
> - net->tx_poll_state = VHOST_NET_POLL_STOPPED;
> + vhost_poll_stop(&net->poll[qnum]);
> + net->tx_poll_state[qnum] = VHOST_NET_POLL_STOPPED;
> }
>
> /* Caller must have TX VQ lock */
> -static void tx_poll_start(struct vhost_net *net, struct socket *sock)
> +static void tx_poll_start(struct vhost_net *net, struct socket *sock, int qnum)
> {
> - if (unlikely(net->tx_poll_state != VHOST_NET_POLL_STOPPED))
> + if (unlikely(net->tx_poll_state[qnum] != VHOST_NET_POLL_STOPPED))
> return;
> - vhost_poll_start(net->poll + VHOST_NET_VQ_TX, sock->file);
> - net->tx_poll_state = VHOST_NET_POLL_STARTED;
> + vhost_poll_start(&net->poll[qnum], sock->file);
> + net->tx_poll_state[qnum] = VHOST_NET_POLL_STARTED;
> }
>
> /* Expects to be always run from workqueue - which acts as
> * read-size critical section for our kind of RCU. */
> -static void handle_tx(struct vhost_net *net)
> +static void handle_tx(struct vhost_virtqueue *vq)
> {
> - struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> + struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
> unsigned out, in, s;
> int head;
> struct msghdr msg = {
> @@ -136,7 +131,7 @@ static void handle_tx(struct vhost_net *
> wmem = atomic_read(&sock->sk->sk_wmem_alloc);
> if (wmem >= sock->sk->sk_sndbuf) {
> mutex_lock(&vq->mutex);
> - tx_poll_start(net, sock);
> + tx_poll_start(net, sock, vq->qnum);
> mutex_unlock(&vq->mutex);
> return;
> }
> @@ -145,7 +140,7 @@ static void handle_tx(struct vhost_net *
> vhost_disable_notify(vq);
>
> if (wmem < sock->sk->sk_sndbuf / 2)
> - tx_poll_stop(net);
> + tx_poll_stop(net, vq->qnum);
> hdr_size = vq->vhost_hlen;
>
> for (;;) {
> @@ -160,7 +155,7 @@ static void handle_tx(struct vhost_net *
> if (head == vq->num) {
> wmem = atomic_read(&sock->sk->sk_wmem_alloc);
> if (wmem >= sock->sk->sk_sndbuf * 3 / 4) {
> - tx_poll_start(net, sock);
> + tx_poll_start(net, sock, vq->qnum);
> set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
> break;
> }
> @@ -190,7 +185,7 @@ static void handle_tx(struct vhost_net *
> err = sock->ops->sendmsg(NULL, sock, &msg, len);
> if (unlikely(err < 0)) {
> vhost_discard_vq_desc(vq, 1);
> - tx_poll_start(net, sock);
> + tx_poll_start(net, sock, vq->qnum);
> break;
> }
> if (err != len)
> @@ -282,9 +277,9 @@ err:
>
> /* Expects to be always run from workqueue - which acts as
> * read-size critical section for our kind of RCU. */
> -static void handle_rx_big(struct vhost_net *net)
> +static void handle_rx_big(struct vhost_virtqueue *vq,
> + struct vhost_net *net)
> {
> - struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> unsigned out, in, log, s;
> int head;
> struct vhost_log *vq_log;
> @@ -392,9 +387,9 @@ static void handle_rx_big(struct vhost_n
>
> /* Expects to be always run from workqueue - which acts as
> * read-size critical section for our kind of RCU. */
> -static void handle_rx_mergeable(struct vhost_net *net)
> +static void handle_rx_mergeable(struct vhost_virtqueue *vq,
> + struct vhost_net *net)
> {
> - struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> unsigned uninitialized_var(in), log;
> struct vhost_log *vq_log;
> struct msghdr msg = {
> @@ -498,99 +493,196 @@ static void handle_rx_mergeable(struct v
> mutex_unlock(&vq->mutex);
> }
>
> -static void handle_rx(struct vhost_net *net)
> +static void handle_rx(struct vhost_virtqueue *vq)
> {
> + struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
> +
> if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF))
> - handle_rx_mergeable(net);
> + handle_rx_mergeable(vq, net);
> else
> - handle_rx_big(net);
> + handle_rx_big(vq, net);
> }
>
> static void handle_tx_kick(struct vhost_work *work)
> {
> struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
> poll.work);
> - struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
>
> - handle_tx(net);
> + handle_tx(vq);
> }
>
> static void handle_rx_kick(struct vhost_work *work)
> {
> struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
> poll.work);
> - struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
>
> - handle_rx(net);
> + handle_rx(vq);
> }
>
> static void handle_tx_net(struct vhost_work *work)
> {
> - struct vhost_net *net = container_of(work, struct vhost_net,
> - poll[VHOST_NET_VQ_TX].work);
> - handle_tx(net);
> + struct vhost_virtqueue *vq = container_of(work, struct vhost_poll,
> + work)->vq;
> +
> + handle_tx(vq);
> }
>
> static void handle_rx_net(struct vhost_work *work)
> {
> - struct vhost_net *net = container_of(work, struct vhost_net,
> - poll[VHOST_NET_VQ_RX].work);
> - handle_rx(net);
> + struct vhost_virtqueue *vq = container_of(work, struct vhost_poll,
> + work)->vq;
> +
> + handle_rx(vq);
> }
>
> -static int vhost_net_open(struct inode *inode, struct file *f)
> +void vhost_free_vqs(struct vhost_dev *dev)
> {
> - struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
> - struct vhost_dev *dev;
> - int r;
> + struct vhost_net *n = container_of(dev, struct vhost_net, dev);
> + int i;
>
> - if (!n)
> - return -ENOMEM;
> + if (!n->vqs)
> + return;
>
> - dev = &n->dev;
> - n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
> - n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
> - r = vhost_dev_init(dev, n->vqs, VHOST_NET_VQ_MAX);
> - if (r < 0) {
> - kfree(n);
> - return r;
> + /* vhost_net_open does kzalloc, so this loop will not panic */
> + for (i = 0; i < get_nvhosts(dev->nvqs); i++) {
> + kfree(dev->work_list[i]);
> + kfree(dev->work_lock[i]);
> }
>
> - vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
> - vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
> - n->tx_poll_state = VHOST_NET_POLL_DISABLED;
> + kfree(n->socks);
> + kfree(n->tx_poll_state);
> + kfree(n->poll);
> + kfree(n->vqs);
> +
> + /*
> + * Reset so that vhost_net_release (which gets called when
> + * vhost_dev_set_owner() call fails) will notice.
> + */
> + n->vqs = NULL;
> +}
>
> - f->private_data = n;
> +int vhost_setup_vqs(struct vhost_dev *dev, int numtxqs)
> +{
> + struct vhost_net *n = container_of(dev, struct vhost_net, dev);
> + int nvhosts;
> + int i, nvqs;
> + int ret = -ENOMEM;
> +
> + if (numtxqs < 0 || numtxqs > VIRTIO_MAX_TXQS)
> + return -EINVAL;
> +
> + if (numtxqs == 0) {
> + /* Old qemu doesn't pass arguments to set_owner, use 1 txq */
> + numtxqs = 1;
> + }
> +
> + /* Get total number of virtqueues */
> + nvqs = numtxqs * 2;
> +
> + n->vqs = kmalloc(nvqs * sizeof(*n->vqs), GFP_KERNEL);
> + n->poll = kmalloc(nvqs * sizeof(*n->poll), GFP_KERNEL);
> + n->socks = kmalloc(nvqs * sizeof(*n->socks), GFP_KERNEL);
> + n->tx_poll_state = kmalloc(nvqs * sizeof(*n->tx_poll_state),
> + GFP_KERNEL);
> + if (!n->vqs || !n->poll || !n->socks || !n->tx_poll_state)
> + goto err;
> +
> + /* Get total number of vhost threads */
> + nvhosts = get_nvhosts(nvqs);
> +
> + for (i = 0; i < nvhosts; i++) {
> + dev->work_lock[i] = kmalloc(sizeof(*dev->work_lock[i]),
> + GFP_KERNEL);
> + dev->work_list[i] = kmalloc(sizeof(*dev->work_list[i]),
> + GFP_KERNEL);
> + if (!dev->work_lock[i] || !dev->work_list[i])
> + goto err;
> + if (((unsigned long) dev->work_lock[i] & (SMP_CACHE_BYTES - 1))
> + ||
> + ((unsigned long) dev->work_list[i] & SMP_CACHE_BYTES - 1))
> + pr_debug("Unaligned buffer @ %d - Lock: %p List: %p\n",
> + i, dev->work_lock[i], dev->work_list[i]);
> + }
> +
> + /* 'numtxqs' RX followed by 'numtxqs' TX queues */
> + for (i = 0; i < numtxqs; i++)
> + n->vqs[i].handle_kick = handle_rx_kick;
> + for (; i < nvqs; i++)
> + n->vqs[i].handle_kick = handle_tx_kick;
> +
> + ret = vhost_dev_init(dev, n->vqs, nvqs, nvhosts);
> + if (ret < 0)
> + goto err;
> +
> + for (i = 0; i < numtxqs; i++)
> + vhost_poll_init(&n->poll[i], handle_rx_net, POLLIN, &n->vqs[i]);
> +
> + for (; i < nvqs; i++) {
> + vhost_poll_init(&n->poll[i], handle_tx_net, POLLOUT,
> + &n->vqs[i]);
> + n->tx_poll_state[i] = VHOST_NET_POLL_DISABLED;
> + }
>
> return 0;
> +
> +err:
> + /* Free all pointers that may have been allocated */
> + vhost_free_vqs(dev);
> +
> + return ret;
> +}
> +
> +static int vhost_net_open(struct inode *inode, struct file *f)
> +{
> + struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL);
> + int ret = -ENOMEM;
> +
> + if (n) {
> + struct vhost_dev *dev = &n->dev;
> +
> + f->private_data = n;
> + mutex_init(&dev->mutex);
> +
> + /* Defer all other initialization till user does SET_OWNER */
> + ret = 0;
> + }
> +
> + return ret;
> }
>
> static void vhost_net_disable_vq(struct vhost_net *n,
> struct vhost_virtqueue *vq)
> {
> + int qnum = vq->qnum;
> +
> if (!vq->private_data)
> return;
> - if (vq == n->vqs + VHOST_NET_VQ_TX) {
> - tx_poll_stop(n);
> - n->tx_poll_state = VHOST_NET_POLL_DISABLED;
> - } else
> - vhost_poll_stop(n->poll + VHOST_NET_VQ_RX);
> + if (qnum < n->dev.nvqs / 2) {
> + /* qnum is less than half, we are RX */
> + vhost_poll_stop(&n->poll[qnum]);
> + } else { /* otherwise we are TX */
> + tx_poll_stop(n, qnum);
> + n->tx_poll_state[qnum] = VHOST_NET_POLL_DISABLED;
> + }
> }
>
> static void vhost_net_enable_vq(struct vhost_net *n,
> struct vhost_virtqueue *vq)
> {
> struct socket *sock;
> + int qnum = vq->qnum;
>
> sock = rcu_dereference_protected(vq->private_data,
> lockdep_is_held(&vq->mutex));
> if (!sock)
> return;
> - if (vq == n->vqs + VHOST_NET_VQ_TX) {
> - n->tx_poll_state = VHOST_NET_POLL_STOPPED;
> - tx_poll_start(n, sock);
> - } else
> - vhost_poll_start(n->poll + VHOST_NET_VQ_RX, sock->file);
> + if (qnum < n->dev.nvqs / 2) {
> + /* qnum is less than half, we are RX */
> + vhost_poll_start(&n->poll[qnum], sock->file);
> + } else {
> + n->tx_poll_state[qnum] = VHOST_NET_POLL_STOPPED;
> + tx_poll_start(n, sock, qnum);
> + }
> }
>
> static struct socket *vhost_net_stop_vq(struct vhost_net *n,
> @@ -607,11 +699,12 @@ static struct socket *vhost_net_stop_vq(
> return sock;
> }
>
> -static void vhost_net_stop(struct vhost_net *n, struct socket **tx_sock,
> - struct socket **rx_sock)
> +static void vhost_net_stop(struct vhost_net *n)
> {
> - *tx_sock = vhost_net_stop_vq(n, n->vqs + VHOST_NET_VQ_TX);
> - *rx_sock = vhost_net_stop_vq(n, n->vqs + VHOST_NET_VQ_RX);
> + int i;
> +
> + for (i = n->dev.nvqs - 1; i >= 0; i--)
> + n->socks[i] = vhost_net_stop_vq(n, &n->vqs[i]);
> }
>
> static void vhost_net_flush_vq(struct vhost_net *n, int index)
> @@ -622,26 +715,33 @@ static void vhost_net_flush_vq(struct vh
>
> static void vhost_net_flush(struct vhost_net *n)
> {
> - vhost_net_flush_vq(n, VHOST_NET_VQ_TX);
> - vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
> + int i;
> +
> + for (i = n->dev.nvqs - 1; i >= 0; i--)
> + vhost_net_flush_vq(n, i);
> }
>
> static int vhost_net_release(struct inode *inode, struct file *f)
> {
> struct vhost_net *n = f->private_data;
> - struct socket *tx_sock;
> - struct socket *rx_sock;
> + struct vhost_dev *dev = &n->dev;
> + int i;
>
> - vhost_net_stop(n, &tx_sock, &rx_sock);
> + vhost_net_stop(n);
> vhost_net_flush(n);
> - vhost_dev_cleanup(&n->dev);
> - if (tx_sock)
> - fput(tx_sock->file);
> - if (rx_sock)
> - fput(rx_sock->file);
> + vhost_dev_cleanup(dev);
> +
> + for (i = n->dev.nvqs - 1; i >= 0; i--)
> + if (n->socks[i])
> + fput(n->socks[i]->file);
> +
> /* We do an extra flush before freeing memory,
> * since jobs can re-queue themselves. */
> vhost_net_flush(n);
> +
> + /* Free all old pointers */
> + vhost_free_vqs(dev);
> +
> kfree(n);
> return 0;
> }
> @@ -719,7 +819,7 @@ static long vhost_net_set_backend(struct
> if (r)
> goto err;
>
> - if (index >= VHOST_NET_VQ_MAX) {
> + if (index >= n->dev.nvqs) {
> r = -ENOBUFS;
> goto err;
> }
> @@ -741,9 +841,9 @@ static long vhost_net_set_backend(struct
> oldsock = rcu_dereference_protected(vq->private_data,
> lockdep_is_held(&vq->mutex));
> if (sock != oldsock) {
> - vhost_net_disable_vq(n, vq);
> - rcu_assign_pointer(vq->private_data, sock);
> - vhost_net_enable_vq(n, vq);
> + vhost_net_disable_vq(n, vq);
> + rcu_assign_pointer(vq->private_data, sock);
> + vhost_net_enable_vq(n, vq);
> }
>
> mutex_unlock(&vq->mutex);
> @@ -765,22 +865,25 @@ err:
>
> static long vhost_net_reset_owner(struct vhost_net *n)
> {
> - struct socket *tx_sock = NULL;
> - struct socket *rx_sock = NULL;
> long err;
> + int i;
> +
> mutex_lock(&n->dev.mutex);
> err = vhost_dev_check_owner(&n->dev);
> - if (err)
> - goto done;
> - vhost_net_stop(n, &tx_sock, &rx_sock);
> + if (err) {
> + mutex_unlock(&n->dev.mutex);
> + return err;
> + }
> +
> + vhost_net_stop(n);
> vhost_net_flush(n);
> err = vhost_dev_reset_owner(&n->dev);
> -done:
> mutex_unlock(&n->dev.mutex);
> - if (tx_sock)
> - fput(tx_sock->file);
> - if (rx_sock)
> - fput(rx_sock->file);
> +
> + for (i = n->dev.nvqs - 1; i >= 0; i--)
> + if (n->socks[i])
> + fput(n->socks[i]->file);
> +
> return err;
> }
>
> @@ -809,7 +912,7 @@ static int vhost_net_set_features(struct
> }
> n->dev.acked_features = features;
> smp_wmb();
> - for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
> + for (i = 0; i < n->dev.nvqs; ++i) {
> mutex_lock(&n->vqs[i].mutex);
> n->vqs[i].vhost_hlen = vhost_hlen;
> n->vqs[i].sock_hlen = sock_hlen;
> diff -ruNp org/drivers/vhost/vhost.c new/drivers/vhost/vhost.c
> --- org/drivers/vhost/vhost.c 2011-01-19 20:01:29.000000000 +0530
> +++ new/drivers/vhost/vhost.c 2011-02-25 21:18:14.000000000 +0530
> @@ -70,12 +70,12 @@ static void vhost_work_init(struct vhost
>
> /* Init poll structure */
> void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> - unsigned long mask, struct vhost_dev *dev)
> + unsigned long mask, struct vhost_virtqueue *vq)
> {
> init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
> init_poll_funcptr(&poll->table, vhost_poll_func);
> poll->mask = mask;
> - poll->dev = dev;
> + poll->vq = vq;
>
> vhost_work_init(&poll->work, fn);
> }
> @@ -97,29 +97,30 @@ void vhost_poll_stop(struct vhost_poll *
> remove_wait_queue(poll->wqh, &poll->wait);
> }
>
> -static bool vhost_work_seq_done(struct vhost_dev *dev, struct vhost_work *work,
> - unsigned seq)
> +static bool vhost_work_seq_done(struct vhost_virtqueue *vq,
> + struct vhost_work *work, unsigned seq)
> {
> int left;
> - spin_lock_irq(&dev->work_lock);
> + spin_lock_irq(vq->work_lock);
> left = seq - work->done_seq;
> - spin_unlock_irq(&dev->work_lock);
> + spin_unlock_irq(vq->work_lock);
> return left <= 0;
> }
>
> -static void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work)
> +static void vhost_work_flush(struct vhost_virtqueue *vq,
> + struct vhost_work *work)
> {
> unsigned seq;
> int flushing;
>
> - spin_lock_irq(&dev->work_lock);
> + spin_lock_irq(vq->work_lock);
> seq = work->queue_seq;
> work->flushing++;
> - spin_unlock_irq(&dev->work_lock);
> - wait_event(work->done, vhost_work_seq_done(dev, work, seq));
> - spin_lock_irq(&dev->work_lock);
> + spin_unlock_irq(vq->work_lock);
> + wait_event(work->done, vhost_work_seq_done(vq, work, seq));
> + spin_lock_irq(vq->work_lock);
> flushing = --work->flushing;
> - spin_unlock_irq(&dev->work_lock);
> + spin_unlock_irq(vq->work_lock);
> BUG_ON(flushing < 0);
> }
>
> @@ -127,26 +128,26 @@ static void vhost_work_flush(struct vhos
> * locks that are also used by the callback. */
> void vhost_poll_flush(struct vhost_poll *poll)
> {
> - vhost_work_flush(poll->dev, &poll->work);
> + vhost_work_flush(poll->vq, &poll->work);
> }
>
> -static inline void vhost_work_queue(struct vhost_dev *dev,
> +static inline void vhost_work_queue(struct vhost_virtqueue *vq,
> struct vhost_work *work)
> {
> unsigned long flags;
>
> - spin_lock_irqsave(&dev->work_lock, flags);
> + spin_lock_irqsave(vq->work_lock, flags);
> if (list_empty(&work->node)) {
> - list_add_tail(&work->node, &dev->work_list);
> + list_add_tail(&work->node, vq->work_list);
> work->queue_seq++;
> - wake_up_process(dev->worker);
> + wake_up_process(vq->worker);
> }
> - spin_unlock_irqrestore(&dev->work_lock, flags);
> + spin_unlock_irqrestore(vq->work_lock, flags);
> }
>
> void vhost_poll_queue(struct vhost_poll *poll)
> {
> - vhost_work_queue(poll->dev, &poll->work);
> + vhost_work_queue(poll->vq, &poll->work);
> }
>
> static void vhost_vq_reset(struct vhost_dev *dev,
> @@ -176,17 +177,17 @@ static void vhost_vq_reset(struct vhost_
>
> static int vhost_worker(void *data)
> {
> - struct vhost_dev *dev = data;
> + struct vhost_virtqueue *vq = data;
> struct vhost_work *work = NULL;
> unsigned uninitialized_var(seq);
>
> - use_mm(dev->mm);
> + use_mm(vq->dev->mm);
>
> for (;;) {
> /* mb paired w/ kthread_stop */
> set_current_state(TASK_INTERRUPTIBLE);
>
> - spin_lock_irq(&dev->work_lock);
> + spin_lock_irq(vq->work_lock);
> if (work) {
> work->done_seq = seq;
> if (work->flushing)
> @@ -194,18 +195,18 @@ static int vhost_worker(void *data)
> }
>
> if (kthread_should_stop()) {
> - spin_unlock_irq(&dev->work_lock);
> + spin_unlock_irq(vq->work_lock);
> __set_current_state(TASK_RUNNING);
> break;
> }
> - if (!list_empty(&dev->work_list)) {
> - work = list_first_entry(&dev->work_list,
> + if (!list_empty(vq->work_list)) {
> + work = list_first_entry(vq->work_list,
> struct vhost_work, node);
> list_del_init(&work->node);
> seq = work->queue_seq;
> } else
> work = NULL;
> - spin_unlock_irq(&dev->work_lock);
> + spin_unlock_irq(vq->work_lock);
>
> if (work) {
> __set_current_state(TASK_RUNNING);
> @@ -214,7 +215,7 @@ static int vhost_worker(void *data)
> schedule();
>
> }
> - unuse_mm(dev->mm);
> + unuse_mm(vq->dev->mm);
> return 0;
> }
>
> @@ -258,7 +259,7 @@ static void vhost_dev_free_iovecs(struct
> }
>
> long vhost_dev_init(struct vhost_dev *dev,
> - struct vhost_virtqueue *vqs, int nvqs)
> + struct vhost_virtqueue *vqs, int nvqs, int nvhosts)
> {
> int i;
>
> @@ -269,20 +270,34 @@ long vhost_dev_init(struct vhost_dev *de
> dev->log_file = NULL;
> dev->memory = NULL;
> dev->mm = NULL;
> - spin_lock_init(&dev->work_lock);
> - INIT_LIST_HEAD(&dev->work_list);
> - dev->worker = NULL;
>
> for (i = 0; i < dev->nvqs; ++i) {
> - dev->vqs[i].log = NULL;
> - dev->vqs[i].indirect = NULL;
> - dev->vqs[i].heads = NULL;
> - dev->vqs[i].dev = dev;
> - mutex_init(&dev->vqs[i].mutex);
> + struct vhost_virtqueue *vq = &dev->vqs[i];
> + int j;
> +
> + if (i < nvhosts) {
> + spin_lock_init(dev->work_lock[i]);
> + INIT_LIST_HEAD(dev->work_list[i]);
> + j = i;
> + } else {
> + /* Share work with another thread */
> + j = vhost_get_thread_index(i, nvqs / 2, nvhosts);
> + }
> +
> + vq->work_lock = dev->work_lock[j];
> + vq->work_list = dev->work_list[j];
> +
> + vq->worker = NULL;
> + vq->qnum = i;
> + vq->log = NULL;
> + vq->indirect = NULL;
> + vq->heads = NULL;
> + vq->dev = dev;
> + mutex_init(&vq->mutex);
> vhost_vq_reset(dev, dev->vqs + i);
> - if (dev->vqs[i].handle_kick)
> - vhost_poll_init(&dev->vqs[i].poll,
> - dev->vqs[i].handle_kick, POLLIN, dev);
> + if (vq->handle_kick)
> + vhost_poll_init(&vq->poll,
> + vq->handle_kick, POLLIN, vq);
> }
>
> return 0;
> @@ -296,65 +311,124 @@ long vhost_dev_check_owner(struct vhost_
> }
>
> struct vhost_attach_cgroups_struct {
> - struct vhost_work work;
> - struct task_struct *owner;
> - int ret;
> + struct vhost_work work;
> + struct task_struct *owner;
> + int ret;
> };
>
> static void vhost_attach_cgroups_work(struct vhost_work *work)
> {
> - struct vhost_attach_cgroups_struct *s;
> - s = container_of(work, struct vhost_attach_cgroups_struct, work);
> - s->ret = cgroup_attach_task_all(s->owner, current);
> + struct vhost_attach_cgroups_struct *s;
> + s = container_of(work, struct vhost_attach_cgroups_struct, work);
> + s->ret = cgroup_attach_task_all(s->owner, current);
> +}
> +
> +static int vhost_attach_cgroups(struct vhost_virtqueue *vq)
> +{
> + struct vhost_attach_cgroups_struct attach;
> + attach.owner = current;
> + vhost_work_init(&attach.work, vhost_attach_cgroups_work);
> + vhost_work_queue(vq, &attach.work);
> + vhost_work_flush(vq, &attach.work);
> + return attach.ret;
> +}
> +
> +static void __vhost_stop_workers(struct vhost_dev *dev, int nvhosts)
> +{
> + int i;
> +
> + for (i = 0; i < nvhosts; i++) {
> + WARN_ON(!list_empty(dev->vqs[i].work_list));
> + if (dev->vqs[i].worker) {
> + kthread_stop(dev->vqs[i].worker);
> + dev->vqs[i].worker = NULL;
> + }
> + }
> +
> + if (dev->mm)
> + mmput(dev->mm);
> + dev->mm = NULL;
> +}
> +
> +static void vhost_stop_workers(struct vhost_dev *dev)
> +{
> + __vhost_stop_workers(dev, get_nvhosts(dev->nvqs));
> }
>
> -static int vhost_attach_cgroups(struct vhost_dev *dev)
> -{
> - struct vhost_attach_cgroups_struct attach;
> - attach.owner = current;
> - vhost_work_init(&attach.work, vhost_attach_cgroups_work);
> - vhost_work_queue(dev, &attach.work);
> - vhost_work_flush(dev, &attach.work);
> - return attach.ret;
> +static int vhost_start_workers(struct vhost_dev *dev)
> +{
> + int nvhosts = get_nvhosts(dev->nvqs);
> + int i, err;
> +
> + for (i = 0; i < dev->nvqs; ++i) {
> + struct vhost_virtqueue *vq = &dev->vqs[i];
> +
> + if (i < nvhosts) {
> + /* Start a new thread */
> + vq->worker = kthread_create(vhost_worker, vq,
> + "vhost-%d-%d",
> + current->pid, i);
> + if (IS_ERR(vq->worker)) {
> + i--; /* no thread to clean at this index */
> + err = PTR_ERR(vq->worker);
> + goto err;
> + }
> +
> + wake_up_process(vq->worker);
> +
> + /* avoid contributing to loadavg */
> + err = vhost_attach_cgroups(vq);
> + if (err)
> + goto err;
> + } else {
> + /* Share work with an existing thread */
> + int j = vhost_get_thread_index(i, dev->nvqs / 2,
> + nvhosts);
> +
> + vq->worker = dev->vqs[j].worker;
> + }
> + }
> + return 0;
> +
> +err:
> + __vhost_stop_workers(dev, i);
> + return err;
> }
>
> /* Caller should have device mutex */
> -static long vhost_dev_set_owner(struct vhost_dev *dev)
> +static long vhost_dev_set_owner(struct vhost_dev *dev, int numtxqs)
> {
> - struct task_struct *worker;
> int err;
> /* Is there an owner already? */
> if (dev->mm) {
> err = -EBUSY;
> goto err_mm;
> }
> +
> + err = vhost_setup_vqs(dev, numtxqs);
> + if (err)
> + goto err_mm;
> +
> /* No owner, become one */
> dev->mm = get_task_mm(current);
> - worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
> - if (IS_ERR(worker)) {
> - err = PTR_ERR(worker);
> - goto err_worker;
> - }
> -
> - dev->worker = worker;
> - wake_up_process(worker); /* avoid contributing to loadavg */
>
> - err = vhost_attach_cgroups(dev);
> + /* Start threads */
> + err = vhost_start_workers(dev);
> if (err)
> - goto err_cgroup;
> + goto free_vqs;
>
> err = vhost_dev_alloc_iovecs(dev);
> if (err)
> - goto err_cgroup;
> + goto clean_workers;
>
> return 0;
> -err_cgroup:
> - kthread_stop(worker);
> - dev->worker = NULL;
> -err_worker:
> +clean_workers:
> + vhost_stop_workers(dev);
> +free_vqs:
> if (dev->mm)
> mmput(dev->mm);
> dev->mm = NULL;
> + vhost_free_vqs(dev);
> err_mm:
> return err;
> }
> @@ -408,14 +482,7 @@ void vhost_dev_cleanup(struct vhost_dev
> kfree(rcu_dereference_protected(dev->memory,
> lockdep_is_held(&dev->mutex)));
> RCU_INIT_POINTER(dev->memory, NULL);
> - WARN_ON(!list_empty(&dev->work_list));
> - if (dev->worker) {
> - kthread_stop(dev->worker);
> - dev->worker = NULL;
> - }
> - if (dev->mm)
> - mmput(dev->mm);
> - dev->mm = NULL;
> + vhost_stop_workers(dev);
> }
>
> static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
> @@ -775,7 +842,7 @@ long vhost_dev_ioctl(struct vhost_dev *d
>
> /* If you are not the owner, you can become one */
> if (ioctl == VHOST_SET_OWNER) {
> - r = vhost_dev_set_owner(d);
> + r = vhost_dev_set_owner(d, arg);
> goto done;
> }
>
^ permalink raw reply
* [patch net-next-2.6] net: convert bonding to use rx_handler - second part
From: Jiri Pirko @ 2011-02-28 9:55 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1298885725.2941.36.camel@edumazet-laptop>
This patch converts bonding to use rx_handler. Results in cleaner
__netif_receive_skb() with much less exceptions needed.
Did performance test using pktgen and counting incoming packets by
iptables. No regression noted.
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
net/core/dev.c | 119 ++++++++++++++-----------------------------------------
1 files changed, 31 insertions(+), 88 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 69a3c08..30440e7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3096,63 +3096,31 @@ void netdev_rx_handler_unregister(struct net_device *dev)
}
EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
-static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
- struct net_device *master)
+static void vlan_on_bond_hook(struct sk_buff *skb)
{
- if (skb->pkt_type == PACKET_HOST) {
- u16 *dest = (u16 *) eth_hdr(skb)->h_dest;
+ /*
+ * Make sure ARP frames received on VLAN interfaces stacked on
+ * bonding interfaces still make their way to any base bonding
+ * device that may have registered for a specific ptype.
+ */
+ if (skb->dev->priv_flags & IFF_802_1Q_VLAN &&
+ vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING &&
+ skb->protocol == htons(ETH_P_ARP)) {
+ struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);
- memcpy(dest, master->dev_addr, ETH_ALEN);
+ if (!skb2)
+ return;
+ skb2->dev = vlan_dev_real_dev(skb->dev);
+ netif_rx(skb2);
}
}
-/* On bonding slaves other than the currently active slave, suppress
- * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
- * ARP on active-backup slaves with arp_validate enabled.
- */
-static int __skb_bond_should_drop(struct sk_buff *skb,
- struct net_device *master)
-{
- struct net_device *dev = skb->dev;
-
- if (master->priv_flags & IFF_MASTER_ARPMON)
- dev->last_rx = jiffies;
-
- if ((master->priv_flags & IFF_MASTER_ALB) &&
- (master->priv_flags & IFF_BRIDGE_PORT)) {
- /* Do address unmangle. The local destination address
- * will be always the one master has. Provides the right
- * functionality in a bridge.
- */
- skb_bond_set_mac_by_master(skb, master);
- }
-
- if (dev->priv_flags & IFF_SLAVE_INACTIVE) {
- if ((dev->priv_flags & IFF_SLAVE_NEEDARP) &&
- skb->protocol == __cpu_to_be16(ETH_P_ARP))
- return 0;
-
- if (master->priv_flags & IFF_MASTER_ALB) {
- if (skb->pkt_type != PACKET_BROADCAST &&
- skb->pkt_type != PACKET_MULTICAST)
- return 0;
- }
- if (master->priv_flags & IFF_MASTER_8023AD &&
- skb->protocol == __cpu_to_be16(ETH_P_SLOW))
- return 0;
-
- return 1;
- }
- return 0;
-}
-
static int __netif_receive_skb(struct sk_buff *skb)
{
struct packet_type *ptype, *pt_prev;
rx_handler_func_t *rx_handler;
struct net_device *orig_dev;
- struct net_device *null_or_orig;
- struct net_device *orig_or_bond;
+ struct net_device *null_or_dev;
int ret = NET_RX_DROP;
__be16 type;
@@ -3167,32 +3135,8 @@ static int __netif_receive_skb(struct sk_buff *skb)
if (!skb->skb_iif)
skb->skb_iif = skb->dev->ifindex;
-
- /*
- * bonding note: skbs received on inactive slaves should only
- * be delivered to pkt handlers that are exact matches. Also
- * the deliver_no_wcard flag will be set. If packet handlers
- * are sensitive to duplicate packets these skbs will need to
- * be dropped at the handler.
- */
- null_or_orig = NULL;
orig_dev = skb->dev;
- if (skb->deliver_no_wcard)
- null_or_orig = orig_dev;
- else if (netif_is_bond_slave(orig_dev)) {
- struct net_device *bond_master = ACCESS_ONCE(orig_dev->master);
-
- if (likely(bond_master)) {
- if (__skb_bond_should_drop(skb, bond_master)) {
- skb->deliver_no_wcard = 1;
- /* deliver only exact match */
- null_or_orig = orig_dev;
- } else
- skb->dev = bond_master;
- }
- }
- __this_cpu_inc(softnet_data.processed);
skb_reset_network_header(skb);
skb_reset_transport_header(skb);
skb->mac_len = skb->network_header - skb->mac_header;
@@ -3201,6 +3145,10 @@ static int __netif_receive_skb(struct sk_buff *skb)
rcu_read_lock();
+another_round:
+
+ __this_cpu_inc(softnet_data.processed);
+
#ifdef CONFIG_NET_CLS_ACT
if (skb->tc_verd & TC_NCLS) {
skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
@@ -3209,8 +3157,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
#endif
list_for_each_entry_rcu(ptype, &ptype_all, list) {
- if (ptype->dev == null_or_orig || ptype->dev == skb->dev ||
- ptype->dev == orig_dev) {
+ if (!ptype->dev || ptype->dev == skb->dev) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
@@ -3224,16 +3171,20 @@ static int __netif_receive_skb(struct sk_buff *skb)
ncls:
#endif
- /* Handle special case of bridge or macvlan */
rx_handler = rcu_dereference(skb->dev->rx_handler);
if (rx_handler) {
+ struct net_device *prev_dev;
+
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = NULL;
}
+ prev_dev = skb->dev;
skb = rx_handler(skb);
if (!skb)
goto out;
+ if (skb->dev != prev_dev)
+ goto another_round;
}
if (vlan_tx_tag_present(skb)) {
@@ -3248,24 +3199,16 @@ ncls:
goto out;
}
- /*
- * Make sure frames received on VLAN interfaces stacked on
- * bonding interfaces still make their way to any base bonding
- * device that may have registered for a specific ptype. The
- * handler may have to adjust skb->dev and orig_dev.
- */
- orig_or_bond = orig_dev;
- if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) &&
- (vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING)) {
- orig_or_bond = vlan_dev_real_dev(skb->dev);
- }
+ vlan_on_bond_hook(skb);
+
+ /* deliver only exact match when indicated */
+ null_or_dev = skb->deliver_no_wcard ? skb->dev : NULL;
type = skb->protocol;
list_for_each_entry_rcu(ptype,
&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
- if (ptype->type == type && (ptype->dev == null_or_orig ||
- ptype->dev == skb->dev || ptype->dev == orig_dev ||
- ptype->dev == orig_or_bond)) {
+ if (ptype->type == type &&
+ (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
--
1.7.3.4
^ permalink raw reply related
* Re: [PATCH] don't allow CAP_NET_ADMIN to load non-netdev kernel modules
From: Vasiliy Kulikov @ 2011-02-28 9:51 UTC (permalink / raw)
To: Michael Tokarev
Cc: Arnd Bergmann, Michał Mirosław, Ben Hutchings,
David Miller, netdev, linux-kernel, kuznet, pekkas, jmorris,
yoshfuji, kaber, eric.dumazet, therbert, xiaosuo, jesse,
kees.cook, eugene, dan.j.rosenberg, akpm
In-Reply-To: <4D6B6AE7.2050202@msgid.tls.msk.ru>
On Mon, Feb 28, 2011 at 12:29 +0300, Michael Tokarev wrote:
> 27.02.2011 23:22, Arnd Bergmann wrote:
> > The backwards compatibility should mostly be for systems that today don't
> > use split capabilities, right?
> >
> > The fallback could therefore rely on CAP_SYS_MODULE as well:
> >
> > if (request_module("netdev-%s", name)) {
> > if (capable(CAP_SYS_MODULE))
> > request_module("%s", name);
> > }
> >
> > Not 100% solution, but should solve the capability escalation nicely without
> > causing much pain.
>
> To me this looks like the best solution so far - trivial and
> compatible.
Agreed, it's looks good. But before the request_module() there is a check
for capabile(CAP_NET_ADMIN), IMO it's better to request either
CAP_NET_ADMIN or CAP_SYS_MODULE, not both of them.
if (!dev) {
if (capable(CAP_NET_ADMIN))
request_module("netdev-%s", name))
if (capable(CAP_SYS_MODULE) {
if (!request_module("%s", name))
WARN_ONE(1, "Loading kernel module for a network device"
" with CAP_SYS_MODULE (deprecated). Use CAP_NET_ADMIN and alias"
" netdev-%s instead\n", name);
}
}
The only drawback is distributions/setups that already use
CAP_SYS_MODULE'less network scripts.
David, are you OK with this way?
Thanks,
--
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments
^ permalink raw reply
* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Eric Dumazet @ 2011-02-28 9:35 UTC (permalink / raw)
To: Jiri Pirko; +Cc: David Miller, netdev
In-Reply-To: <20110228092222.GA2831@psychotron.brq.redhat.com>
Le lundi 28 février 2011 à 10:22 +0100, Jiri Pirko a écrit :
> Applied incorrectly. net/core/dev.c part is missing
Hmm, could you please provide a 2nd patch with this part ?
Thanks
^ permalink raw reply
* Re: [PATCH 2/3] [RFC] Changes for MQ virtio-net
From: Michael S. Tsirkin @ 2011-02-28 9:43 UTC (permalink / raw)
To: Krishna Kumar
Cc: rusty, davem, eric.dumazet, arnd, netdev, horms, avi, anthony,
kvm
In-Reply-To: <20110228063437.24908.29435.sendpatchset@krkumar2.in.ibm.com>
On Mon, Feb 28, 2011 at 12:04:37PM +0530, Krishna Kumar wrote:
> Implement mq virtio-net driver.
>
> Though struct virtio_net_config changes, it works with the old
> qemu since the last element is not accessed unless qemu sets
> VIRTIO_NET_F_NUMTXQS. Patch also adds a macro for the maximum
> number of TX vq's (VIRTIO_MAX_TXQS) that the user can specify.
>
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Overall looks good.
The numtxqs meaning the number of rx queues needs some cleanup.
init/cleanup routines need more symmetry.
Error handling on setup also seems slightly buggy or at least asymmetrical.
Finally, this will use up a large number of MSI vectors,
while TX interrupts mostly stay unused.
Some comments below.
> ---
> drivers/net/virtio_net.c | 543 ++++++++++++++++++++++++-----------
> include/linux/virtio_net.h | 6
> 2 files changed, 386 insertions(+), 163 deletions(-)
>
> diff -ruNp org/include/linux/virtio_net.h new/include/linux/virtio_net.h
> --- org/include/linux/virtio_net.h 2010-10-11 10:20:22.000000000 +0530
> +++ new/include/linux/virtio_net.h 2011-02-25 16:24:15.000000000 +0530
> @@ -7,6 +7,9 @@
> #include <linux/virtio_config.h>
> #include <linux/if_ether.h>
>
> +/* Maximum number of individual RX/TX queues supported */
> +#define VIRTIO_MAX_TXQS 16
> +
This also does not seem to belong in the header.
> /* The feature bitmap for virtio net */
> #define VIRTIO_NET_F_CSUM 0 /* Host handles pkts w/ partial csum */
> #define VIRTIO_NET_F_GUEST_CSUM 1 /* Guest handles pkts w/ partial csum */
> @@ -26,6 +29,7 @@
> #define VIRTIO_NET_F_CTRL_RX 18 /* Control channel RX mode support */
> #define VIRTIO_NET_F_CTRL_VLAN 19 /* Control channel VLAN filtering */
> #define VIRTIO_NET_F_CTRL_RX_EXTRA 20 /* Extra RX mode control support */
> +#define VIRTIO_NET_F_NUMTXQS 21 /* Device supports multiple TX queue */
VIRTIO_NET_F_MULTIQUEUE ?
>
> #define VIRTIO_NET_S_LINK_UP 1 /* Link is up */
>
> @@ -34,6 +38,8 @@ struct virtio_net_config {
> __u8 mac[6];
> /* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
> __u16 status;
> + /* number of RX/TX queues */
> + __u16 numtxqs;
The interface here is a bit ugly:
- this is really both # of tx and rx queues but called numtxqs
- there's a hardcoded max value
- 0 is assumed to be same as 1
- assumptions above are undocumented.
One way to address this could be num_queue_pairs, and something like
/* The actual number of TX and RX queues is num_queue_pairs + 1 each. */
__u16 num_queue_pairs;
(and tweak code to match).
Alternatively, have separate registers for the number of tx and rx queues.
> } __attribute__((packed));
>
> /* This is the first element of the scatter-gather list. If you don't
> diff -ruNp org/drivers/net/virtio_net.c new/drivers/net/virtio_net.c
> --- org/drivers/net/virtio_net.c 2011-02-21 17:55:42.000000000 +0530
> +++ new/drivers/net/virtio_net.c 2011-02-25 16:23:41.000000000 +0530
> @@ -40,31 +40,53 @@ module_param(gso, bool, 0444);
>
> #define VIRTNET_SEND_COMMAND_SG_MAX 2
>
> -struct virtnet_info {
> - struct virtio_device *vdev;
> - struct virtqueue *rvq, *svq, *cvq;
> - struct net_device *dev;
> +/* Internal representation of a send virtqueue */
> +struct send_queue {
> + struct virtqueue *svq;
> +
> + /* TX: fragments + linear part + virtio header */
> + struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
> +};
> +
> +/* Internal representation of a receive virtqueue */
> +struct receive_queue {
> + /* Virtqueue associated with this receive_queue */
> + struct virtqueue *rvq;
> +
> + /* Back pointer to the virtnet_info */
> + struct virtnet_info *vi;
> +
> struct napi_struct napi;
> - unsigned int status;
>
> /* Number of input buffers, and max we've ever had. */
> unsigned int num, max;
>
> - /* I like... big packets and I cannot lie! */
> - bool big_packets;
> -
> - /* Host will merge rx buffers for big packets (shake it! shake it!) */
> - bool mergeable_rx_bufs;
> -
> /* Work struct for refilling if we run low on memory. */
> struct delayed_work refill;
>
> /* Chain pages by the private ptr. */
> struct page *pages;
>
> - /* fragments + linear part + virtio header */
> + /* RX: fragments + linear part + virtio header */
> struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
> - struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
> +};
> +
> +struct virtnet_info {
> + struct send_queue **sq;
> + struct receive_queue **rq;
> +
> + /* read-mostly variables */
> + int numtxqs ____cacheline_aligned_in_smp; /* # of rxqs/txqs */
Why do you think this alignment is a win?
> + struct virtio_device *vdev;
> + struct virtqueue *cvq;
> + struct net_device *dev;
> + unsigned int status;
> +
> + /* I like... big packets and I cannot lie! */
> + bool big_packets;
> +
> + /* Host will merge rx buffers for big packets (shake it! shake it!) */
> + bool mergeable_rx_bufs;
> };
>
> struct skb_vnet_hdr {
> @@ -94,22 +116,22 @@ static inline struct skb_vnet_hdr *skb_v
> * private is used to chain pages for big packets, put the whole
> * most recent used list in the beginning for reuse
> */
> -static void give_pages(struct virtnet_info *vi, struct page *page)
> +static void give_pages(struct receive_queue *rq, struct page *page)
> {
> struct page *end;
>
> /* Find end of list, sew whole thing into vi->pages. */
> for (end = page; end->private; end = (struct page *)end->private);
> - end->private = (unsigned long)vi->pages;
> - vi->pages = page;
> + end->private = (unsigned long)rq->pages;
> + rq->pages = page;
> }
>
> -static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
> +static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
> {
> - struct page *p = vi->pages;
> + struct page *p = rq->pages;
>
> if (p) {
> - vi->pages = (struct page *)p->private;
> + rq->pages = (struct page *)p->private;
> /* clear private here, it is used to chain pages */
> p->private = 0;
> } else
> @@ -117,15 +139,20 @@ static struct page *get_a_page(struct vi
> return p;
> }
>
> +/*
> + * Note for 'qnum' below:
> + * first 'numtxqs' vqs are RX, next 'numtxqs' vqs are TX.
> + */
Another option to consider is to have them RX,TX,RX,TX:
this way vq->queue_index / 2 gives you the
queue pair number, no need to read numtxqs. On the other hand, it makes the
#RX==#TX assumption even more entrenched.
> static void skb_xmit_done(struct virtqueue *svq)
> {
> struct virtnet_info *vi = svq->vdev->priv;
> + int qnum = svq->queue_index - vi->numtxqs;
>
> /* Suppress further interrupts. */
> virtqueue_disable_cb(svq);
>
> /* We were probably waiting for more output buffers. */
> - netif_wake_queue(vi->dev);
> + netif_wake_subqueue(vi->dev, qnum);
> }
>
> static void set_skb_frag(struct sk_buff *skb, struct page *page,
> @@ -145,9 +172,10 @@ static void set_skb_frag(struct sk_buff
> *len -= f->size;
> }
>
> -static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> +static struct sk_buff *page_to_skb(struct receive_queue *rq,
> struct page *page, unsigned int len)
> {
> + struct virtnet_info *vi = rq->vi;
> struct sk_buff *skb;
> struct skb_vnet_hdr *hdr;
> unsigned int copy, hdr_len, offset;
> @@ -190,12 +218,12 @@ static struct sk_buff *page_to_skb(struc
> }
>
> if (page)
> - give_pages(vi, page);
> + give_pages(rq, page);
>
> return skb;
> }
>
> -static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
> +static int receive_mergeable(struct receive_queue *rq, struct sk_buff *skb)
> {
> struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
> struct page *page;
> @@ -210,7 +238,7 @@ static int receive_mergeable(struct virt
> return -EINVAL;
> }
>
> - page = virtqueue_get_buf(vi->rvq, &len);
> + page = virtqueue_get_buf(rq->rvq, &len);
> if (!page) {
> pr_debug("%s: rx error: %d buffers missing\n",
> skb->dev->name, hdr->mhdr.num_buffers);
> @@ -222,13 +250,14 @@ static int receive_mergeable(struct virt
>
> set_skb_frag(skb, page, 0, &len);
>
> - --vi->num;
> + --rq->num;
> }
> return 0;
> }
>
> -static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
> +static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
> {
> + struct net_device *dev = rq->vi->dev;
> struct virtnet_info *vi = netdev_priv(dev);
> struct sk_buff *skb;
> struct page *page;
> @@ -238,7 +267,7 @@ static void receive_buf(struct net_devic
> pr_debug("%s: short packet %i\n", dev->name, len);
> dev->stats.rx_length_errors++;
> if (vi->mergeable_rx_bufs || vi->big_packets)
> - give_pages(vi, buf);
> + give_pages(rq, buf);
> else
> dev_kfree_skb(buf);
> return;
> @@ -250,14 +279,14 @@ static void receive_buf(struct net_devic
> skb_trim(skb, len);
> } else {
> page = buf;
> - skb = page_to_skb(vi, page, len);
> + skb = page_to_skb(rq, page, len);
> if (unlikely(!skb)) {
> dev->stats.rx_dropped++;
> - give_pages(vi, page);
> + give_pages(rq, page);
> return;
> }
> if (vi->mergeable_rx_bufs)
> - if (receive_mergeable(vi, skb)) {
> + if (receive_mergeable(rq, skb)) {
> dev_kfree_skb(skb);
> return;
> }
> @@ -323,184 +352,200 @@ frame_err:
> dev_kfree_skb(skb);
> }
>
> -static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
> +static int add_recvbuf_small(struct receive_queue *rq, gfp_t gfp)
> {
> struct sk_buff *skb;
> struct skb_vnet_hdr *hdr;
> int err;
>
> - skb = netdev_alloc_skb_ip_align(vi->dev, MAX_PACKET_LEN);
> + skb = netdev_alloc_skb_ip_align(rq->vi->dev, MAX_PACKET_LEN);
> if (unlikely(!skb))
> return -ENOMEM;
>
> skb_put(skb, MAX_PACKET_LEN);
>
> hdr = skb_vnet_hdr(skb);
> - sg_set_buf(vi->rx_sg, &hdr->hdr, sizeof hdr->hdr);
> + sg_set_buf(rq->rx_sg, &hdr->hdr, sizeof hdr->hdr);
>
> - skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
> + skb_to_sgvec(skb, rq->rx_sg + 1, 0, skb->len);
>
> - err = virtqueue_add_buf_gfp(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
> + err = virtqueue_add_buf_gfp(rq->rvq, rq->rx_sg, 0, 2, skb, gfp);
> if (err < 0)
> dev_kfree_skb(skb);
>
> return err;
> }
>
> -static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
> +static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
> {
> struct page *first, *list = NULL;
> char *p;
> int i, err, offset;
>
> - /* page in vi->rx_sg[MAX_SKB_FRAGS + 1] is list tail */
> + /* page in rq->rx_sg[MAX_SKB_FRAGS + 1] is list tail */
> for (i = MAX_SKB_FRAGS + 1; i > 1; --i) {
> - first = get_a_page(vi, gfp);
> + first = get_a_page(rq, gfp);
> if (!first) {
> if (list)
> - give_pages(vi, list);
> + give_pages(rq, list);
> return -ENOMEM;
> }
> - sg_set_buf(&vi->rx_sg[i], page_address(first), PAGE_SIZE);
> + sg_set_buf(&rq->rx_sg[i], page_address(first), PAGE_SIZE);
>
> /* chain new page in list head to match sg */
> first->private = (unsigned long)list;
> list = first;
> }
>
> - first = get_a_page(vi, gfp);
> + first = get_a_page(rq, gfp);
> if (!first) {
> - give_pages(vi, list);
> + give_pages(rq, list);
> return -ENOMEM;
> }
> p = page_address(first);
>
> - /* vi->rx_sg[0], vi->rx_sg[1] share the same page */
> - /* a separated vi->rx_sg[0] for virtio_net_hdr only due to QEMU bug */
> - sg_set_buf(&vi->rx_sg[0], p, sizeof(struct virtio_net_hdr));
> + /* rq->rx_sg[0], rq->rx_sg[1] share the same page */
> + /* a separated rq->rx_sg[0] for virtio_net_hdr only due to QEMU bug */
> + sg_set_buf(&rq->rx_sg[0], p, sizeof(struct virtio_net_hdr));
>
> - /* vi->rx_sg[1] for data packet, from offset */
> + /* rq->rx_sg[1] for data packet, from offset */
> offset = sizeof(struct padded_vnet_hdr);
> - sg_set_buf(&vi->rx_sg[1], p + offset, PAGE_SIZE - offset);
> + sg_set_buf(&rq->rx_sg[1], p + offset, PAGE_SIZE - offset);
>
> /* chain first in list head */
> first->private = (unsigned long)list;
> - err = virtqueue_add_buf_gfp(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
> + err = virtqueue_add_buf_gfp(rq->rvq, rq->rx_sg, 0, MAX_SKB_FRAGS + 2,
> first, gfp);
> if (err < 0)
> - give_pages(vi, first);
> + give_pages(rq, first);
>
> return err;
> }
>
> -static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
> +static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
> {
> struct page *page;
> int err;
>
> - page = get_a_page(vi, gfp);
> + page = get_a_page(rq, gfp);
> if (!page)
> return -ENOMEM;
>
> - sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
> + sg_init_one(rq->rx_sg, page_address(page), PAGE_SIZE);
>
> - err = virtqueue_add_buf_gfp(vi->rvq, vi->rx_sg, 0, 1, page, gfp);
> + err = virtqueue_add_buf_gfp(rq->rvq, rq->rx_sg, 0, 1, page, gfp);
> if (err < 0)
> - give_pages(vi, page);
> + give_pages(rq, page);
>
> return err;
> }
>
> /* Returns false if we couldn't fill entirely (OOM). */
> -static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
> +static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
> {
> + struct virtnet_info *vi = rq->vi;
> int err;
> bool oom;
>
> do {
> if (vi->mergeable_rx_bufs)
> - err = add_recvbuf_mergeable(vi, gfp);
> + err = add_recvbuf_mergeable(rq, gfp);
> else if (vi->big_packets)
> - err = add_recvbuf_big(vi, gfp);
> + err = add_recvbuf_big(rq, gfp);
> else
> - err = add_recvbuf_small(vi, gfp);
> + err = add_recvbuf_small(rq, gfp);
>
> oom = err == -ENOMEM;
> if (err < 0)
> break;
> - ++vi->num;
> + ++rq->num;
> } while (err > 0);
> - if (unlikely(vi->num > vi->max))
> - vi->max = vi->num;
> - virtqueue_kick(vi->rvq);
> + if (unlikely(rq->num > rq->max))
> + rq->max = rq->num;
> + virtqueue_kick(rq->rvq);
> return !oom;
> }
>
> static void skb_recv_done(struct virtqueue *rvq)
> {
> + int qnum = rvq->queue_index;
> struct virtnet_info *vi = rvq->vdev->priv;
> + struct napi_struct *napi = &vi->rq[qnum]->napi;
> +
> /* Schedule NAPI, Suppress further interrupts if successful. */
> - if (napi_schedule_prep(&vi->napi)) {
> + if (napi_schedule_prep(napi)) {
> virtqueue_disable_cb(rvq);
> - __napi_schedule(&vi->napi);
> + __napi_schedule(napi);
> }
> }
>
> -static void virtnet_napi_enable(struct virtnet_info *vi)
> +static void virtnet_napi_enable(struct receive_queue *rq)
> {
> - napi_enable(&vi->napi);
> + napi_enable(&rq->napi);
>
> /* If all buffers were filled by other side before we napi_enabled, we
> * won't get another interrupt, so process any outstanding packets
> * now. virtnet_poll wants re-enable the queue, so we disable here.
> * We synchronize against interrupts via NAPI_STATE_SCHED */
> - if (napi_schedule_prep(&vi->napi)) {
> - virtqueue_disable_cb(vi->rvq);
> - __napi_schedule(&vi->napi);
> + if (napi_schedule_prep(&rq->napi)) {
> + virtqueue_disable_cb(rq->rvq);
> + __napi_schedule(&rq->napi);
> }
> }
>
> +static void virtnet_napi_enable_all_queues(struct virtnet_info *vi)
> +{
> + int i;
> +
> + for (i = 0; i < vi->numtxqs; i++)
> + virtnet_napi_enable(vi->rq[i]);
> +}
> +
> static void refill_work(struct work_struct *work)
> {
> - struct virtnet_info *vi;
> + struct napi_struct *napi;
> + struct receive_queue *rq;
> bool still_empty;
>
> - vi = container_of(work, struct virtnet_info, refill.work);
> - napi_disable(&vi->napi);
> - still_empty = !try_fill_recv(vi, GFP_KERNEL);
> - virtnet_napi_enable(vi);
> + rq = container_of(work, struct receive_queue, refill.work);
> + napi = &rq->napi;
> +
> + napi_disable(napi);
> + still_empty = !try_fill_recv(rq, GFP_KERNEL);
> + virtnet_napi_enable(rq);
>
> /* In theory, this can happen: if we don't get any buffers in
> * we will *never* try to fill again. */
> if (still_empty)
> - schedule_delayed_work(&vi->refill, HZ/2);
> + schedule_delayed_work(&rq->refill, HZ/2);
> }
>
> static int virtnet_poll(struct napi_struct *napi, int budget)
> {
> - struct virtnet_info *vi = container_of(napi, struct virtnet_info, napi);
> + struct receive_queue *rq = container_of(napi, struct receive_queue,
> + napi);
> void *buf;
> unsigned int len, received = 0;
>
> again:
> while (received < budget &&
> - (buf = virtqueue_get_buf(vi->rvq, &len)) != NULL) {
> - receive_buf(vi->dev, buf, len);
> - --vi->num;
> + (buf = virtqueue_get_buf(rq->rvq, &len)) != NULL) {
> + receive_buf(rq, buf, len);
> + --rq->num;
> received++;
> }
>
> - if (vi->num < vi->max / 2) {
> - if (!try_fill_recv(vi, GFP_ATOMIC))
> - schedule_delayed_work(&vi->refill, 0);
> + if (rq->num < rq->max / 2) {
> + if (!try_fill_recv(rq, GFP_ATOMIC))
> + schedule_delayed_work(&rq->refill, 0);
> }
>
> /* Out of packets? */
> if (received < budget) {
> napi_complete(napi);
> - if (unlikely(!virtqueue_enable_cb(vi->rvq)) &&
> + if (unlikely(!virtqueue_enable_cb(rq->rvq)) &&
> napi_schedule_prep(napi)) {
> - virtqueue_disable_cb(vi->rvq);
> + virtqueue_disable_cb(rq->rvq);
> __napi_schedule(napi);
> goto again;
> }
> @@ -509,12 +554,13 @@ again:
> return received;
> }
>
> -static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
> +static unsigned int free_old_xmit_skbs(struct virtnet_info *vi,
> + struct virtqueue *svq)
> {
> struct sk_buff *skb;
> unsigned int len, tot_sgs = 0;
>
> - while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> + while ((skb = virtqueue_get_buf(svq, &len)) != NULL) {
> pr_debug("Sent skb %p\n", skb);
> vi->dev->stats.tx_bytes += skb->len;
> vi->dev->stats.tx_packets++;
> @@ -524,7 +570,8 @@ static unsigned int free_old_xmit_skbs(s
> return tot_sgs;
> }
>
> -static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
> +static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb,
> + struct virtqueue *svq, struct scatterlist *tx_sg)
> {
> struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
> const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
> @@ -562,12 +609,12 @@ static int xmit_skb(struct virtnet_info
>
> /* Encode metadata header at front. */
> if (vi->mergeable_rx_bufs)
> - sg_set_buf(vi->tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
> + sg_set_buf(tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
> else
> - sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
> + sg_set_buf(tx_sg, &hdr->hdr, sizeof hdr->hdr);
>
> - hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
> - return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
> + hdr->num_sg = skb_to_sgvec(skb, tx_sg + 1, 0, skb->len) + 1;
> + return virtqueue_add_buf(svq, tx_sg, hdr->num_sg,
> 0, skb);
> }
>
> @@ -575,31 +622,34 @@ static netdev_tx_t start_xmit(struct sk_
> {
> struct virtnet_info *vi = netdev_priv(dev);
> int capacity;
> + int qnum = skb_get_queue_mapping(skb);
> + struct virtqueue *svq = vi->sq[qnum]->svq;
>
> /* Free up any pending old buffers before queueing new ones. */
> - free_old_xmit_skbs(vi);
> + free_old_xmit_skbs(vi, svq);
>
> /* Try to transmit */
> - capacity = xmit_skb(vi, skb);
> + capacity = xmit_skb(vi, skb, svq, vi->sq[qnum]->tx_sg);
>
> /* This can happen with OOM and indirect buffers. */
> if (unlikely(capacity < 0)) {
> if (net_ratelimit()) {
> if (likely(capacity == -ENOMEM)) {
> dev_warn(&dev->dev,
> - "TX queue failure: out of memory\n");
> + "TXQ (%d) failure: out of memory\n",
> + qnum);
> } else {
> dev->stats.tx_fifo_errors++;
> dev_warn(&dev->dev,
> - "Unexpected TX queue failure: %d\n",
> - capacity);
> + "Unexpected TXQ (%d) failure: %d\n",
> + qnum, capacity);
> }
> }
> dev->stats.tx_dropped++;
> kfree_skb(skb);
> return NETDEV_TX_OK;
> }
> - virtqueue_kick(vi->svq);
> + virtqueue_kick(svq);
>
> /* Don't wait up for transmitted skbs to be freed. */
> skb_orphan(skb);
> @@ -608,13 +658,13 @@ static netdev_tx_t start_xmit(struct sk_
> /* Apparently nice girls don't return TX_BUSY; stop the queue
> * before it gets out of hand. Naturally, this wastes entries. */
> if (capacity < 2+MAX_SKB_FRAGS) {
> - netif_stop_queue(dev);
> - if (unlikely(!virtqueue_enable_cb(vi->svq))) {
> + netif_stop_subqueue(dev, qnum);
> + if (unlikely(!virtqueue_enable_cb(svq))) {
> /* More just got used, free them then recheck. */
> - capacity += free_old_xmit_skbs(vi);
> + capacity += free_old_xmit_skbs(vi, svq);
> if (capacity >= 2+MAX_SKB_FRAGS) {
> - netif_start_queue(dev);
> - virtqueue_disable_cb(vi->svq);
> + netif_start_subqueue(dev, qnum);
> + virtqueue_disable_cb(svq);
> }
> }
> }
> @@ -643,8 +693,10 @@ static int virtnet_set_mac_address(struc
> static void virtnet_netpoll(struct net_device *dev)
> {
> struct virtnet_info *vi = netdev_priv(dev);
> + int i;
>
> - napi_schedule(&vi->napi);
> + for (i = 0; i < vi->numtxqs; i++)
> + napi_schedule(&vi->rq[i]->napi);
> }
> #endif
>
> @@ -652,7 +704,7 @@ static int virtnet_open(struct net_devic
> {
> struct virtnet_info *vi = netdev_priv(dev);
>
> - virtnet_napi_enable(vi);
> + virtnet_napi_enable_all_queues(vi);
> return 0;
> }
>
> @@ -704,8 +756,10 @@ static bool virtnet_send_command(struct
> static int virtnet_close(struct net_device *dev)
> {
> struct virtnet_info *vi = netdev_priv(dev);
> + int i;
>
> - napi_disable(&vi->napi);
> + for (i = 0; i < vi->numtxqs; i++)
> + napi_disable(&vi->rq[i]->napi);
>
> return 0;
> }
> @@ -876,10 +930,10 @@ static void virtnet_update_status(struct
>
> if (vi->status & VIRTIO_NET_S_LINK_UP) {
> netif_carrier_on(vi->dev);
> - netif_wake_queue(vi->dev);
> + netif_tx_wake_all_queues(vi->dev);
> } else {
> netif_carrier_off(vi->dev);
> - netif_stop_queue(vi->dev);
> + netif_tx_stop_all_queues(vi->dev);
> }
> }
>
> @@ -890,18 +944,173 @@ static void virtnet_config_changed(struc
> virtnet_update_status(vi);
> }
>
> +static void free_receive_bufs(struct virtnet_info *vi)
> +{
> + int i;
> +
> + for (i = 0; i < vi->numtxqs; i++) {
> + BUG_ON(vi->rq[i] == NULL);
> + while (vi->rq[i]->pages)
> + __free_pages(get_a_page(vi->rq[i], GFP_KERNEL), 0);
> + }
> +}
> +
> +static void free_rq_sq(struct virtnet_info *vi)
> +{
> + int i;
> +
> + if (vi->rq) {
> + for (i = 0; i < vi->numtxqs; i++)
> + kfree(vi->rq[i]);
> + kfree(vi->rq);
> + }
> +
> + if (vi->sq) {
> + for (i = 0; i < vi->numtxqs; i++)
> + kfree(vi->sq[i]);
> + kfree(vi->sq);
> + }
> +}
> +
> +#define MAX_DEVICE_NAME 16
> +static int initialize_vqs(struct virtnet_info *vi, int numtxqs)
> +{
> + vq_callback_t **callbacks;
> + struct virtqueue **vqs;
> + int i, err = -ENOMEM;
> + int totalvqs;
> + char **names;
> +
> + /* Allocate receive queues */
> + vi->rq = kzalloc(numtxqs * sizeof(*vi->rq), GFP_KERNEL);
> + if (!vi->rq)
> + goto out;
> + for (i = 0; i < numtxqs; i++) {
> + vi->rq[i] = kzalloc(sizeof(*vi->rq[i]), GFP_KERNEL);
> + if (!vi->rq[i])
> + goto out;
> + }
> +
> + /* Allocate send queues */
> + vi->sq = kzalloc(numtxqs * sizeof(*vi->sq), GFP_KERNEL);
> + if (!vi->sq)
> + goto out;
> + for (i = 0; i < numtxqs; i++) {
> + vi->sq[i] = kzalloc(sizeof(*vi->sq[i]), GFP_KERNEL);
> + if (!vi->sq[i])
> + goto out;
> + }
> +
> + /* setup initial receive and send queue parameters */
> + for (i = 0; i < numtxqs; i++) {
> + vi->rq[i]->vi = vi;
> + vi->rq[i]->pages = NULL;
> + INIT_DELAYED_WORK(&vi->rq[i]->refill, refill_work);
> + netif_napi_add(vi->dev, &vi->rq[i]->napi, virtnet_poll,
> + napi_weight);
> +
> + sg_init_table(vi->rq[i]->rx_sg, ARRAY_SIZE(vi->rq[i]->rx_sg));
> + sg_init_table(vi->sq[i]->tx_sg, ARRAY_SIZE(vi->sq[i]->tx_sg));
> + }
> +
> + /*
> + * We expect 'numtxqs' RX virtqueue followed by 'numtxqs' TX
> + * virtqueues, and optionally one control virtqueue.
> + */
> + totalvqs = numtxqs * 2 +
> + virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ);
> +
> + /* Allocate space for find_vqs parameters */
> + vqs = kmalloc(totalvqs * sizeof(*vqs), GFP_KERNEL);
> + callbacks = kmalloc(totalvqs * sizeof(*callbacks), GFP_KERNEL);
> + names = kzalloc(totalvqs * sizeof(*names), GFP_KERNEL);
> + if (!vqs || !callbacks || !names)
> + goto free_params;
> +
> + /* Allocate/initialize parameters for recv virtqueues */
> + for (i = 0; i < numtxqs; i++) {
> + callbacks[i] = skb_recv_done;
> + names[i] = kmalloc(MAX_DEVICE_NAME * sizeof(*names[i]),
> + GFP_KERNEL);
> + if (!names[i])
> + goto free_params;
> + sprintf(names[i], "input.%d", i);
> + }
> +
> + /* Allocate/initialize parameters for send virtqueues */
> + for (; i < numtxqs * 2; i++) {
> + callbacks[i] = skb_xmit_done;
> + names[i] = kmalloc(MAX_DEVICE_NAME * sizeof(*names[i]),
> + GFP_KERNEL);
> + if (!names[i])
> + goto free_params;
> + sprintf(names[i], "output.%d", i - numtxqs);
> + }
> +
> + /* Parameters for control virtqueue, if any */
> + if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
> + callbacks[i] = NULL;
> + names[i] = "control";
> + }
> +
> + err = vi->vdev->config->find_vqs(vi->vdev, totalvqs, vqs, callbacks,
> + (const char **)names);
> + if (err)
> + goto free_params;
> +
This would use up quite a lot of vectors. However,
tx interrupt is, in fact, slow path. So, assuming we don't have
enough vectors to use per vq, I think it's a good idea to
support reducing MSI vector usage by mapping all TX VQs to the same vector
and separate vectors for RX.
The hypervisor actually allows this, but we don't have an API at the virtio
level to pass that info to virtio pci ATM.
Any idea what a good API to use would be?
> + for (i = 0; i < numtxqs; i++) {
> + vi->rq[i]->rvq = vqs[i];
> + vi->sq[i]->svq = vqs[i + numtxqs];
This logic is spread all over. We need some kind of macro to
get queue number of vq number and back.
> + }
> +
> + if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
> + vi->cvq = vqs[i + numtxqs];
> +
> + if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
> + vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
This bit does not seem to belong in initialize_vqs.
> + }
> +
> +free_params:
> + if (names) {
> + for (i = 0; i < numtxqs * 2; i++)
> + kfree(names[i]);
> + kfree(names);
> + }
> +
> + kfree(callbacks);
> + kfree(vqs);
> +
> +out:
> + if (err)
> + free_rq_sq(vi);
> +
> + return err;
> +}
> +
> static int virtnet_probe(struct virtio_device *vdev)
> {
> - int err;
> + int i, err;
> + u16 numtxqs;
> struct net_device *dev;
> struct virtnet_info *vi;
> - struct virtqueue *vqs[3];
> - vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
> - const char *names[] = { "input", "output", "control" };
> - int nvqs;
> +
> + /*
> + * Find if host passed the number of RX/TX queues supported
> + * by the device
> + */
> + err = virtio_config_val(vdev, VIRTIO_NET_F_NUMTXQS,
> + offsetof(struct virtio_net_config, numtxqs),
> + &numtxqs);
> +
> + /* We need atleast one txq */
> + if (err || !numtxqs)
> + numtxqs = 1;
err is okay, but should we just fail on illegal values?
Or change the semantics:
n = 0;
err = virtio_config_val(vdev, VIRTIO_NET_F_NUMTXQS,
offsetof(struct virtio_net_config, numtxqs),
&n);
numtxq = n + 1;
> +
> + if (numtxqs > VIRTIO_MAX_TXQS)
> + return -EINVAL;
>
Do we strictly need this?
I think we should just use whatever hardware has,
or alternatively somehow ignore the unused queues
(easy for tx, not sure about rx).
> /* Allocate ourselves a network device with room for our info */
> - dev = alloc_etherdev(sizeof(struct virtnet_info));
> + dev = alloc_etherdev_mq(sizeof(struct virtnet_info), numtxqs);
> if (!dev)
> return -ENOMEM;
>
> @@ -940,14 +1149,10 @@ static int virtnet_probe(struct virtio_d
>
> /* Set up our device-specific information */
> vi = netdev_priv(dev);
> - netif_napi_add(dev, &vi->napi, virtnet_poll, napi_weight);
> vi->dev = dev;
> vi->vdev = vdev;
> vdev->priv = vi;
> - vi->pages = NULL;
> - INIT_DELAYED_WORK(&vi->refill, refill_work);
> - sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
> - sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
> + vi->numtxqs = numtxqs;
>
> /* If we can receive ANY GSO packets, we must allocate large ones. */
> if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
> @@ -958,23 +1163,10 @@ static int virtnet_probe(struct virtio_d
> if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
> vi->mergeable_rx_bufs = true;
>
> - /* We expect two virtqueues, receive then send,
> - * and optionally control. */
> - nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
> -
> - err = vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names);
> + /* Initialize our rx/tx queue parameters, and invoke find_vqs */
> + err = initialize_vqs(vi, numtxqs);
> if (err)
> - goto free;
> -
> - vi->rvq = vqs[0];
> - vi->svq = vqs[1];
> -
> - if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
> - vi->cvq = vqs[2];
> -
> - if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
> - dev->features |= NETIF_F_HW_VLAN_FILTER;
> - }
> + goto free_netdev;
>
> err = register_netdev(dev);
> if (err) {
> @@ -983,14 +1175,19 @@ static int virtnet_probe(struct virtio_d
> }
>
> /* Last of all, set up some receive buffers. */
> - try_fill_recv(vi, GFP_KERNEL);
> + for (i = 0; i < numtxqs; i++) {
> + try_fill_recv(vi->rq[i], GFP_KERNEL);
>
> - /* If we didn't even get one input buffer, we're useless. */
> - if (vi->num == 0) {
> - err = -ENOMEM;
> - goto unregister;
> + /* If we didn't even get one input buffer, we're useless. */
> + if (vi->rq[i]->num == 0) {
> + err = -ENOMEM;
> + goto free_recv_bufs;
> + }
> }
If this fails for vq > 0, you have to detach bufs.
>
> + dev_info(&dev->dev, "(virtio-net) Allocated %d RX & TX vq's\n",
> + numtxqs);
> +
> /* Assume link up if device can't report link status,
> otherwise get link status from config. */
> if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
> @@ -1001,15 +1198,21 @@ static int virtnet_probe(struct virtio_d
> netif_carrier_on(dev);
> }
>
> - pr_debug("virtnet: registered device %s\n", dev->name);
> + pr_debug("virtnet: registered device %s with %d RX and TX vq's\n",
> + dev->name, numtxqs);
> return 0;
>
> -unregister:
> +free_recv_bufs:
> + free_receive_bufs(vi);
> unregister_netdev(dev);
> - cancel_delayed_work_sync(&vi->refill);
> +
> free_vqs:
> + for (i = 0; i < numtxqs; i++)
> + cancel_delayed_work_sync(&vi->rq[i]->refill);
> vdev->config->del_vqs(vdev);
> -free:
> + free_rq_sq(vi);
If we have a wrapper to init all vqs, pls add a wrapper to clean up
all vqs as well.
> +
> +free_netdev:
> free_netdev(dev);
> return err;
> }
> @@ -1017,44 +1220,58 @@ free:
> static void free_unused_bufs(struct virtnet_info *vi)
> {
> void *buf;
> - while (1) {
> - buf = virtqueue_detach_unused_buf(vi->svq);
> - if (!buf)
> - break;
> - dev_kfree_skb(buf);
> - }
> - while (1) {
> - buf = virtqueue_detach_unused_buf(vi->rvq);
> - if (!buf)
> - break;
> - if (vi->mergeable_rx_bufs || vi->big_packets)
> - give_pages(vi, buf);
> - else
> + int i;
> +
> + for (i = 0; i < vi->numtxqs; i++) {
> + struct virtqueue *svq = vi->sq[i]->svq;
> +
> + while (1) {
> + buf = virtqueue_detach_unused_buf(svq);
> + if (!buf)
> + break;
> dev_kfree_skb(buf);
> - --vi->num;
> + }
> + }
> +
> + for (i = 0; i < vi->numtxqs; i++) {
> + struct virtqueue *rvq = vi->rq[i]->rvq;
> +
> + while (1) {
> + buf = virtqueue_detach_unused_buf(rvq);
> + if (!buf)
> + break;
> + if (vi->mergeable_rx_bufs || vi->big_packets)
> + give_pages(vi->rq[i], buf);
> + else
> + dev_kfree_skb(buf);
> + --vi->rq[i]->num;
> + }
> + BUG_ON(vi->rq[i]->num != 0);
> }
> - BUG_ON(vi->num != 0);
> +
> + free_rq_sq(vi);
This looks wrong here. This function should detach
and free all bufs, not internal malloc stuff.
I think we should have free_unused_bufs that handles
a single queue, and call it in a loop.
> }
>
> static void __devexit virtnet_remove(struct virtio_device *vdev)
> {
> struct virtnet_info *vi = vdev->priv;
> + int i;
>
> /* Stop all the virtqueues. */
> vdev->config->reset(vdev);
>
>
> unregister_netdev(vi->dev);
> - cancel_delayed_work_sync(&vi->refill);
> +
> + for (i = 0; i < vi->numtxqs; i++)
> + cancel_delayed_work_sync(&vi->rq[i]->refill);
>
> /* Free unused buffers in both send and recv, if any. */
> free_unused_bufs(vi);
>
> vdev->config->del_vqs(vi->vdev);
>
> - while (vi->pages)
> - __free_pages(get_a_page(vi, GFP_KERNEL), 0);
> -
> + free_receive_bufs(vi);
> free_netdev(vi->dev);
> }
>
> @@ -1070,7 +1287,7 @@ static unsigned int features[] = {
> VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6,
> VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
> VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
> - VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
> + VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, VIRTIO_NET_F_NUMTXQS,
> };
>
> static struct virtio_driver virtio_net_driver = {
^ permalink raw reply
* [PATCH] bonding: use the correct size for _simple_hash()
From: Amerigo Wang @ 2011-02-28 9:34 UTC (permalink / raw)
To: linux-kernel; +Cc: WANG Cong, Jay Vosburgh, netdev
Clearly it should be the size of ->ip_dst here.
Although this is harmless, but it still reads odd.
Signed-off-by: WANG Cong <amwang@redhat.com>
---
diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 5c6fba8..9bc5de3 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -604,7 +604,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
_lock_rx_hashtbl(bond);
- hash_index = _simple_hash((u8 *)&arp->ip_dst, sizeof(arp->ip_src));
+ hash_index = _simple_hash((u8 *)&arp->ip_dst, sizeof(arp->ip_dst));
client_info = &(bond_info->rx_hashtbl[hash_index]);
if (client_info->assigned) {
^ permalink raw reply related
* Re: [PATCH] don't allow CAP_NET_ADMIN to load non-netdev kernel modules
From: Michael Tokarev @ 2011-02-28 9:29 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Michał Mirosław, Ben Hutchings, David Miller, segoon,
netdev, linux-kernel, kuznet, pekkas, jmorris, yoshfuji, kaber,
eric.dumazet, therbert, xiaosuo, jesse, kees.cook, eugene,
dan.j.rosenberg, akpm
In-Reply-To: <201102272122.52643.arnd@arndb.de>
27.02.2011 23:22, Arnd Bergmann wrote:
> On Friday 25 February 2011, Michał Mirosław wrote:
>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>> index 54aaca6..0d09baa 100644
>>> --- a/net/core/dev.c
>>> +++ b/net/core/dev.c
>>> @@ -1120,8 +1120,20 @@ void dev_load(struct net *net, const char *name)
>>> dev = dev_get_by_name_rcu(net, name);
>>> rcu_read_unlock();
>>>
>>> - if (!dev && capable(CAP_NET_ADMIN))
>>> - request_module("%s", name);
>>> + if (!dev && capable(CAP_NET_ADMIN)) {
>>> + /* Check whether the name looks like one that a net
>>> + * driver will generate initially. If not, require a
>>> + * module alias with a suitable prefix, so that this
>>> + * can't be used to load arbitrary modules.
>>> + */
>>> + if ((strncmp(name, "eth", 3) == 0 &&
>>> + isdigit((unsigned char)name[3])) ||
>>> + (strncmp(name, "wlan", 4) == 0 &&
>>> + isdigit((unsigned char)name[4])))
>>> + request_module("%s", name);
>>> + else
>>> + request_module("netdev-%s", name);
>>> + }
>>> }
>>> EXPORT_SYMBOL(dev_load);
>>>
>>
>> This might be better as:
>>
>> if (request_module("netdev-%s", name))
>> ... fallback
>>
>> Then after some years the fallback could be removed if announced properly.
>
> The backwards compatibility should mostly be for systems that today don't
> use split capabilities, right?
>
> The fallback could therefore rely on CAP_SYS_MODULE as well:
>
> if (request_module("netdev-%s", name)) {
> if (capable(CAP_SYS_MODULE))
> request_module("%s", name);
> }
>
> Not 100% solution, but should solve the capability escalation nicely without
> causing much pain.
To me this looks like the best solution so far - trivial and
compatible.
Thanks!
/mjt
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox