Netdev List
 help / color / mirror / Atom feed
* [PATCH] e100: Fix race condition in e100_down() while testing
From: prasanna.panchamukhi @ 2011-05-17 22:20 UTC (permalink / raw)
  To: bruce.w.allan, jeffrey.t.kirsher, jesse.brandeburg
  Cc: e1000-devel, netdev, prasanna.panchamukhi

There is a race condition between e100_down() and e100_diag_test().
During device testing e100_diag_test() ends up calling e100_up()/e100_down()
even while the driver is already in e100_down().
This patch fixes the above race condition by changing
e100_up()/e100_down() to dev_open() and dev_close().
Also fixes the race between e100_open() and e100_diag_test().

Signed-off-by: Prasanna S. Panchamukh <prasanna.panchamukhi@riverbed.com>
---
 drivers/net/e100.c |   21 +++++++++++++++++----
 1 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/net/e100.c b/drivers/net/e100.c
index b0aa9e6..abbf229 100644
--- a/drivers/net/e100.c
+++ b/drivers/net/e100.c
@@ -425,6 +425,10 @@ enum cb_command {
 	cb_el     = 0x8000,
 };
 
+enum e100_state_t {
+	__E100_TESTING
+};
+
 struct rfd {
 	__le16 status;
 	__le16 command;
@@ -623,6 +627,7 @@ struct nic {
 	__le16 eeprom[256];
 	spinlock_t mdio_lock;
 	const struct firmware *fw;
+	unsigned long state;
 };
 
 static inline void e100_write_flush(struct nic *nic)
@@ -2570,8 +2575,10 @@ static void e100_diag_test(struct net_device *netdev,
 {
 	struct ethtool_cmd cmd;
 	struct nic *nic = netdev_priv(netdev);
+	int if_running = netif_running(netdev);
 	int i, err;
 
+	set_bit(__E100_TESTING, &nic->state);
 	memset(data, 0, E100_TEST_LEN * sizeof(u64));
 	data[0] = !mii_link_ok(&nic->mii);
 	data[1] = e100_eeprom_load(nic);
@@ -2580,8 +2587,9 @@ static void e100_diag_test(struct net_device *netdev,
 		/* save speed, duplex & autoneg settings */
 		err = mii_ethtool_gset(&nic->mii, &cmd);
 
-		if (netif_running(netdev))
-			e100_down(nic);
+		if (if_running)
+			/* indicate we're in test mode */
+			dev_open(netdev);
 		data[2] = e100_self_test(nic);
 		data[3] = e100_loopback_test(nic, lb_mac);
 		data[4] = e100_loopback_test(nic, lb_phy);
@@ -2589,12 +2597,13 @@ static void e100_diag_test(struct net_device *netdev,
 		/* restore speed, duplex & autoneg settings */
 		err = mii_ethtool_sset(&nic->mii, &cmd);
 
-		if (netif_running(netdev))
-			e100_up(nic);
+		if (if_running)
+			dev_open(netdev);
 	}
 	for (i = 0; i < E100_TEST_LEN; i++)
 		test->flags |= data[i] ? ETH_TEST_FL_FAILED : 0;
 
+	clear_bit(__E100_TESTING, &nic->state);
 	msleep_interruptible(4 * 1000);
 }
 
@@ -2724,6 +2733,10 @@ static int e100_open(struct net_device *netdev)
 	struct nic *nic = netdev_priv(netdev);
 	int err = 0;
 
+	/* disallow open during testing */
+	if (test_bit(__E100_TESTING, &nic->state))
+		return -EBUSY;
+
 	netif_carrier_off(netdev);
 	if ((err = e100_up(nic)))
 		netif_err(nic, ifup, nic->netdev, "Cannot open interface, aborting\n");
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH net-next-2.6] net: add rcu protection to netdev->ifalias
From: Eric Dumazet @ 2011-05-17 22:18 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, kaber, netdev, remi.denis-courmont
In-Reply-To: <20110502.152733.48516094.davem@davemloft.net>

Le lundi 02 mai 2011 à 15:27 -0700, David Miller a écrit :
> From: Stephen Hemminger <shemminger@vyatta.com>
> Date: Thu, 28 Apr 2011 08:43:37 -0700
> 
> > On Thu, 28 Apr 2011 10:56:07 +0200
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > 
> >> Four years ago, Patrick made a change to hold rtnl mutex during netlink
> >> dump callbacks.
> >> 
> >> I believe it was a wrong move. This slows down concurrent dumps, making
> >> good old /proc/net/ files faster than rtnetlink in some situations.
> >> 
> >> This occurred to me because one "ip link show dev ..." was _very_ slow
> >> on a workload adding/removing network devices in background.
> >> 
> >> All dump callbacks are able to use RCU locking now, so this patch does
> >> roughly a revert of commits :
> >> 
> >> 1c2d670f366 : [RTNETLINK]: Hold rtnl_mutex during netlink dump callbacks
> >> 6313c1e0992 : [RTNETLINK]: Remove unnecessary locking in dump callbacks
> >> 
> >> This let writers fight for rtnl mutex and readers going full speed.
> >> 
> >> It also takes care of phonet : phonet_route_get() is now called from rcu
> >> read section. I renamed it to phonet_route_get_rcu()
> >> 
> >> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> >> Cc: Patrick McHardy <kaber@trash.net>
> >> Cc: Remi Denis-Courmont <remi.denis-courmont@nokia.com>
> > 
> > Acked-by: Stephen Hemminger <shemminger@vyatta.com>
> 
> Applied, thanks everyone!

I think we probably have to make some fixes, because rtnl_fill_ifinfo()
can access fields that were previously protected with RTNL

Let's see if it's doable, before considering re-adding rtnl_lock() in
rtnl_dump_ifinfo() ?

First step : dev->ifalias

Its late here, I'll continue the work tomorrow.

Thanks

[PATCH net-next-2.6] net: add rcu protection to netdev->ifalias

Avoid holding RTNL in show_ifalias() and rtnl_fill_ifinfo() to
manipulate netdev->ifalias

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netdevice.h |    7 ++++++-
 net/core/dev.c            |   28 ++++++++++++++++------------
 net/core/net-sysfs.c      |   13 +++++++------
 net/core/rtnetlink.c      |    8 ++++++--
 4 files changed, 35 insertions(+), 21 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a134d80..09b3df4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -980,6 +980,11 @@ struct net_device_ops {
 						    u32 features);
 };
 
+struct ifalias {
+	struct rcu_head	rcu;
+	char		name[0];
+};
+
 /*
  *	The DEVICE structure.
  *	Actually, this whole structure is a big mistake.  It mixes I/O
@@ -1004,7 +1009,7 @@ struct net_device {
 	/* device name hash chain */
 	struct hlist_node	name_hlist;
 	/* snmp alias */
-	char 			*ifalias;
+	struct ifalias __rcu 	*ifalias;
 
 	/*
 	 *	I/O specific fields
diff --git a/net/core/dev.c b/net/core/dev.c
index 155de20..01e3be0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1035,6 +1035,12 @@ rollback:
 	return err;
 }
 
+
+static void ifalias_kfree_rcu(struct rcu_head *head)
+{
+	kfree(container_of(head, struct ifalias, rcu));
+}
+
 /**
  *	dev_set_alias - change ifalias of a device
  *	@dev: device
@@ -1045,24 +1051,22 @@ rollback:
  */
 int dev_set_alias(struct net_device *dev, const char *alias, size_t len)
 {
+	struct ifalias *ifalias, *new = NULL;
 	ASSERT_RTNL();
 
 	if (len >= IFALIASZ)
 		return -EINVAL;
 
-	if (!len) {
-		if (dev->ifalias) {
-			kfree(dev->ifalias);
-			dev->ifalias = NULL;
-		}
-		return 0;
+	ifalias = rtnl_dereference(dev->ifalias);
+	if (len) {
+		new = kmalloc(sizeof(*new) + len + 1, GFP_KERNEL);
+		if (!new)
+			return -ENOMEM;
+		strlcpy(new->name, alias, len + 1);
 	}
-
-	dev->ifalias = krealloc(dev->ifalias, len + 1, GFP_KERNEL);
-	if (!dev->ifalias)
-		return -ENOMEM;
-
-	strlcpy(dev->ifalias, alias, len+1);
+	if (ifalias)
+		call_rcu(&ifalias->rcu, ifalias_kfree_rcu);
+	rcu_assign_pointer(dev->ifalias, new);
 	return len;
 }
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 1b12217..e2acf15 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -281,13 +281,14 @@ static ssize_t show_ifalias(struct device *dev,
 			    struct device_attribute *attr, char *buf)
 {
 	const struct net_device *netdev = to_net_dev(dev);
+	struct ifalias *ifalias;
 	ssize_t ret = 0;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
-	if (netdev->ifalias)
-		ret = sprintf(buf, "%s\n", netdev->ifalias);
-	rtnl_unlock();
+	rcu_read_lock();
+	ifalias = rcu_dereference(netdev->ifalias);
+	if (ifalias)
+		ret = sprintf(buf, "%s\n", ifalias->name);
+	rcu_read_unlock();
 	return ret;
 }
 
@@ -1265,7 +1266,7 @@ static void netdev_release(struct device *d)
 
 	BUG_ON(dev->reg_state != NETREG_RELEASED);
 
-	kfree(dev->ifalias);
+	kfree(rcu_dereference_protected(dev->ifalias, 1));
 	kfree((char *)dev - dev->padded);
 }
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index d2ba259..c8f0a11 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -849,6 +849,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	const struct rtnl_link_stats64 *stats;
 	struct nlattr *attr, *af_spec;
 	struct rtnl_af_ops *af_ops;
+	struct ifalias *ifalias;
 
 	nlh = nlmsg_put(skb, pid, seq, type, sizeof(*ifm), flags);
 	if (nlh == NULL)
@@ -879,8 +880,11 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	if (dev->qdisc)
 		NLA_PUT_STRING(skb, IFLA_QDISC, dev->qdisc->ops->id);
 
-	if (dev->ifalias)
-		NLA_PUT_STRING(skb, IFLA_IFALIAS, dev->ifalias);
+	rcu_read_lock();
+	ifalias = rcu_dereference(dev->ifalias);
+	if (ifalias)
+		NLA_PUT_STRING(skb, IFLA_IFALIAS, ifalias->name);
+	rcu_read_unlock();
 
 	if (1) {
 		struct rtnl_link_ifmap map = {



^ permalink raw reply related

* Bug, kernel panic, NULL dereference , cleanup_once / icmp_route_lookup.clone.19.clone / nat , 2.6.39-rc7-git11
From: Denys Fedoryshchenko @ 2011-05-17 22:16 UTC (permalink / raw)
  To: netdev

 Just got recently. 32Bit, PPPoE NAS, shapers, firewall, NAT
 Kernel i mention in subject, 2.6.39-rc7-git11
 If required i can give more information

 sharanal (sorry for ugly name) is libpcap based traffic analyser, sure 
 userspace

  [44528.153869] BUG: unable to handle kernel NULL pointer dereference 
 at 0000001a
  [44528.153869] IP: [<c02e8614>] cleanup_once+0x49/0x1cf
  [44528.153869] *pdpt = 0000000034a73001 *pde = 0000000000000000
  [44528.153869] Oops: 0002 [#1] SMP
  [44528.153869] last sysfs file: 
 /sys/devices/system/cpu/cpu3/cache/index1/shared_cpu_map
  [44528.153869] Modules linked in: nf_conntrack_netlink nfnetlink 
 ipt_LOG rtc_cmos rtc_core rtc_lib act_skbedit sch_ingress sch_prio 
 bridge cls_flow cls_u32 em_meta cls_basic xt_dscp ipt_REJECT xt_hl ifb 
 cls_fw sch_tbf sch_htb a
  [44528.153869]
  [44528.153869] Pid: 1744, comm: sharanal Not tainted 
 2.6.39-rc7-git11-build-0058 #6 Intel                                  
 /SE7520BD2S
  [44528.153869] EIP: 0060:[<c02e8614>] EFLAGS: 00010286 CPU: 1
  [44528.153869] EIP is at cleanup_once+0x49/0x1cf
  [44528.153869] EAX: dd378920 EBX: dd378900 ECX: 00000016 EDX: 06000001
  [44528.153869] ESI: 00000000 EDI: e4b6a59c EBP: f5485b8c ESP: f5485b68
  [44528.153869]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
  [44528.153869] Process sharanal (pid: 1744, ti=f5484000 task=f4c9b6c0 
 task.ti=f4fa4000)
  [44528.153869] Stack:
  [44528.153869]  f5485c6c f5485ba4 f5485ba4 00000013 f5485ba4 f5485b84 
 e4b6a500 f5485c6c
  [44528.153869]  e4b6a59c f5485c50 c02e8b30 f5485bec f5485c58 00000000 
 c04602d4 c04602d4
  [44528.153869]  e4252504 f0aecc04 e3ee3200 e8d1b200 efd0ab00 e489dc00 
 e422e900 e9b18100
  [44528.153869] Call Trace:
  [44528.153869]  [<c02e8b30>] inet_getpeer+0x2ab/0x2cf
  [44528.153869]  [<c03080bc>] ? 
 icmp_route_lookup.clone.19.clone.20+0x197/0x1cb
  [44528.153869]  [<c01320a5>] ? _local_bh_enable_ip.clone.6+0x18/0x71
  [44528.153869]  [<c0132106>] ? local_bh_enable_ip+0x8/0xa
  [44528.153869]  [<f88c0929>] ? nf_nat_setup_info+0x3b0/0x3db [nf_nat]
  [44528.153869]  [<c0318591>] ? ipt_do_table+0x41c/0x437
  [44528.153869]  [<c02e56a8>] inet_getpeer_v4+0x17/0x19
  [44528.153869]  [<c02e7a72>] rt_bind_peer+0xe/0x39
  [44528.153869]  [<c030818d>] icmpv4_xrlim_allow.clone.22+0x4b/0x5f
  [44528.153869]  [<c03083d4>] icmp_send+0x203/0x282
  [44528.153869]  [<f88b23aa>] ? ipv4_confirm+0x131/0x13e 
 [nf_conntrack_ipv4]
  [44528.153869]  [<c030633b>] __udp4_lib_rcv+0x2d7/0x3f9
  [44528.153869]  [<c02e2d25>] ? nf_iterate+0x52/0x65
  [44528.153869]  [<c02e8efb>] ? xfrm4_policy_check.clone.10+0x47/0x47
  [44528.153869]  [<c030646f>] udp_rcv+0x12/0x14
  [44528.153869]  [<c02e8fc5>] ip_local_deliver_finish+0xca/0x171
  [44528.153869]  [<c02e8efb>] ? xfrm4_policy_check.clone.10+0x47/0x47
  [44528.153869]  [<c02e90b2>] NF_HOOK.clone.11+0x46/0x4d
  [44528.153869]  [<c02e8efb>] ? xfrm4_policy_check.clone.10+0x47/0x47
  [44528.153869]  [<c02e91b6>] ip_local_deliver+0x41/0x45
  [44528.153869]  [<c02e8efb>] ? xfrm4_policy_check.clone.10+0x47/0x47
  [44528.153869]  [<c02e8e92>] ip_rcv_finish+0x2b4/0x2d6
  [44528.153869]  [<c02e8bde>] ? pskb_may_pull+0x30/0x30
  [44528.153869]  [<c02e90b2>] NF_HOOK.clone.11+0x46/0x4d
  [44528.153869]  [<c02e8bde>] ? pskb_may_pull+0x30/0x30
  [44528.153869]  [<c02e93a0>] ip_rcv+0x1e6/0x21a
  [44528.153869]  [<c02e8bde>] ? pskb_may_pull+0x30/0x30
  [44528.153869]  [<c02c929d>] __netif_receive_skb+0x351/0x379
  [44528.153869]  [<c02c93f5>] netif_receive_skb+0x46/0x4c
  [44528.153869]  [<c02c9826>] ? __napi_gro_receive+0x9e/0xa6
  [44528.153869]  [<c02c94b6>] napi_skb_finish+0x1e/0x34
  [44528.153869]  [<c02c987d>] napi_gro_receive+0x20/0x24
  [44528.153869]  [<f84e6516>] e1000_receive_skb+0x5a/0x62 [e1000]
  [44528.153869]  [<f84e8ac9>] e1000_clean_rx_irq+0x28d/0x323 [e1000]
  [44528.153869]  [<f84e845d>] e1000_clean+0x2cc/0x43e [e1000]
  [44528.153869]  [<c02db37e>] ? qdisc_watchdog_schedule+0x39/0x3e
  [44528.153869]  [<f8cd8221>] ? tbf_dequeue+0x1d/0x1b6 [sch_tbf]
  [44528.153869]  [<c02b0d50>] ? dma_issue_pending_all+0x60/0x6e
  [44528.153869]  [<c02c9961>] net_rx_action+0x86/0x139
  [44528.153869]  [<c0132179>] __do_softirq+0x67/0xf3
  [44528.153869]  [<c0132112>] ? local_bh_enable+0xa/0xa
  [44528.153869]  <IRQ>
  [44528.153869]  [<c0132362>] ? irq_exit+0x35/0x70
  [44528.153869]  [<c0103d13>] ? do_IRQ+0x79/0x8d
  [44528.153869]  [<c0132379>] ? irq_exit+0x4c/0x70
  [44528.153869]  [<c011598a>] ? smp_apic_timer_interrupt+0x66/0x73
  [44528.153869]  [<c0337c29>] ? common_interrupt+0x29/0x30
  [44528.153869]  [<c033007b>] ? cpu_init+0x65/0x1c5
  [44528.153869] Code: c8 02 46 c0 74 3a 8d 58 e0 8b 15 40 9a 44 c0 2b 
 53 28 39 f2 73 0f b8 d0 02 46 c0 e8 6b e7 04 00 e9 81 01 00 00 8b 4b 20 
 8b 53 24
  9>[44528.153869]  51 04 89 0a 89 43 20 89 43 24 83 c0 0c e8 cd fd ff 
 ff eb 02
  [44528.153869] EIP: [<c02e8614>] cleanup_once+0x49/0x1cf SS:ESP 
 0068:f5485b68
  [44528.153869] CR2: 000000000000001a
  [44528.167278] ---[ end trace dd3639ec5ab2f01f ]---
  [44528.167468] Kernel panic - not syncing: Fatal exception in 
 interrupt
  [44528.167660] Pid: 1744, comm: sharanal Tainted: G      D     
 2.6.39-rc7-git11-build-0058 #6
  [44528.167992] Call Trace:
  [44528.168190]  [<c0335548>] ? printk+0x18/0x20
  [44528.168374]  [<c0335435>] panic+0x57/0x152
  [44528.168567]  [<c0104e7a>] oops_end+0x92/0x9f
  [44528.168759]  [<c011b95e>] no_context+0x151/0x159
  [44528.168948]  [<c011ba72>] __bad_area_nosemaphore+0x10c/0x114
  [44528.169154]  [<c02ca7df>] ? dev_hard_start_xmit+0x338/0x3f8
  [44528.169353]  [<c011ba8c>] bad_area_nosemaphore+0x12/0x14
  [44528.169544]  [<c011bd0c>] do_page_fault+0x12e/0x2ee
  [44528.169737]  [<c021ba8f>] ? memcmp+0xe/0x25
  [44528.169922]  [<c01320a5>] ? _local_bh_enable_ip.clone.6+0x18/0x71
  [44528.170125]  [<c0132110>] ? local_bh_enable+0x8/0xa
  [44528.170326]  [<c02d1a97>] ? neigh_lookup+0x8b/0x95
  [44528.170517]  [<c011bbde>] ? vmalloc_sync_all+0x5/0x5
  [44528.170713]  [<c03374da>] error_code+0x5a/0x60
  [44528.170907]  [<c01300d8>] ? wait_consider_task+0x3f2/0x76f
  [44528.171111]  [<c011bbde>] ? vmalloc_sync_all+0x5/0x5
  [44528.171304]  [<c02e8614>] ? cleanup_once+0x49/0x1cf
  [44528.171484]  [<c02e8b30>] inet_getpeer+0x2ab/0x2cf
  [44528.171672]  [<c03080bc>] ? 
 icmp_route_lookup.clone.19.clone.20+0x197/0x1cb
  [44528.171861]  [<c01320a5>] ? _local_bh_enable_ip.clone.6+0x18/0x71
  [44528.172063]  [<c0132106>] ? local_bh_enable_ip+0x8/0xa
  [44528.172262]  [<f88c0929>] ? nf_nat_setup_info+0x3b0/0x3db [nf_nat]
  [44528.172445]  [<c0318591>] ? ipt_do_table+0x41c/0x437
  [44528.172626]  [<c02e56a8>] inet_getpeer_v4+0x17/0x19
  [44528.172809]  [<c02e7a72>] rt_bind_peer+0xe/0x39
  [44528.172990]  [<c030818d>] icmpv4_xrlim_allow.clone.22+0x4b/0x5f
  [44528.173196]  [<c03083d4>] icmp_send+0x203/0x282+++

^ permalink raw reply

* Re: small RPS cache for fragments?
From: Andi Kleen @ 2011-05-17 22:06 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110517.175044.2057517197524794568.davem@davemloft.net>

David Miller <davem@davemloft.net> writes:
>
> We're discussing the idea to do the defragmentation first
> so we can choose the flow properly and steer the packet
> to the correct cpu.
>
> This also would allos each fragmented packet to traverse the
> stack only once (one route lookup etc.) instead of once per
> fragment.

You could always check first in a cheap way (e.g. a small hash table) if
it's local or not (and bypass the defragmentation if routing is turned
off or the hash table would have collisions)

On the other hand if fragmentation is expensive it's probably
better to do it later anyways to spread it out better.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply

* Re: small RPS cache for fragments?
From: Andi Kleen @ 2011-05-17 22:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1305669140.6741.4.camel@edumazet-laptop>

Eric Dumazet <eric.dumazet@gmail.com> writes:

>> They're not rwlocks, but especially if the locking was more finegrained
>> that's likely not needed anymore.
>
> Well, there is the rehashing stuff, and this locks the whole table.
>
> Not easy to switch to rcu or something like that.

No need to switch to RCU, just a more finegrained bucket lock.

If you move a chain between queues you just lock both for the move.
It sounds easy enough. I should probably just code it up.

>
> Anyway I hardly use frags here at work, so never considered it was a
> field to spend time ;)

Yes that's the problem. On the other hand most scalability problems hurt
sooner or later, so sometimes it's good to fix them in advance.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply

* Re: sfc: an enumeration is not a bitmask
From: Jeff Garzik @ 2011-05-17 22:00 UTC (permalink / raw)
  To: Michał Mirosław; +Cc: David Miller, bhutchings, netdev
In-Reply-To: <BANLkTinaO0VVGpRRqRauH6FucvwbcXyRUg@mail.gmail.com>

On 05/17/2011 03:09 PM, Michał Mirosław wrote:
> 2011/5/17 Jeff Garzik<jgarzik@pobox.com>:
>> 2011/5/17 David Miller<davem@davemloft.net>:
>>> An enumeration is not a bitmask, instead it means one out of the set
>>> of enumerated values will be used.
>>
>> It's a decade-old kernel practice to use 'enum' to define typed
>> constants, preferred over  macros that convey no type information and
>> disappear after cpp phase.
>>
>> So your assertion about enumerations is demonstrably not true, as it
>> is often used in the kernel.  Call it enum abuse if you want, but it
>> is consistent with code all over the kernel.

> Old age of the mistake doesn't make it correct.

It is not a mistake, but a useful coding tool.

	Jeff



^ permalink raw reply

* net-2.6 --> net-next-2.6 merge...
From: David Miller @ 2011-05-17 21:55 UTC (permalink / raw)
  To: netdev


I've merged the two trees in order to resolve some
conflicts.

Just FYI...

^ permalink raw reply

* [PATCH] sfc: Don't use enums as a bitmask.
From: David Miller @ 2011-05-17 21:54 UTC (permalink / raw)
  To: netdev


This fixes:

drivers/net/sfc/mcdi_mac.c: In function ‘efx_mcdi_set_mac’:
drivers/net/sfc/mcdi_mac.c:36:2: warning: case value ‘3’ not in enumerated type ‘enum efx_fc_type’

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/sfc/efx.c        |    2 +-
 drivers/net/sfc/efx.h        |    2 +-
 drivers/net/sfc/ethtool.c    |    2 +-
 drivers/net/sfc/mdio_10g.c   |    2 +-
 drivers/net/sfc/mdio_10g.h   |    2 +-
 drivers/net/sfc/net_driver.h |   12 +++++-------
 6 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/drivers/net/sfc/efx.c b/drivers/net/sfc/efx.c
index 05502b3..c914729 100644
--- a/drivers/net/sfc/efx.c
+++ b/drivers/net/sfc/efx.c
@@ -833,7 +833,7 @@ void efx_link_set_advertising(struct efx_nic *efx, u32 advertising)
 	}
 }
 
-void efx_link_set_wanted_fc(struct efx_nic *efx, enum efx_fc_type wanted_fc)
+void efx_link_set_wanted_fc(struct efx_nic *efx, u8 wanted_fc)
 {
 	efx->wanted_fc = wanted_fc;
 	if (efx->link_advertising) {
diff --git a/drivers/net/sfc/efx.h b/drivers/net/sfc/efx.h
index 3d83a1f..b0d1209 100644
--- a/drivers/net/sfc/efx.h
+++ b/drivers/net/sfc/efx.h
@@ -142,6 +142,6 @@ static inline void efx_schedule_channel(struct efx_channel *channel)
 
 extern void efx_link_status_changed(struct efx_nic *efx);
 extern void efx_link_set_advertising(struct efx_nic *efx, u32);
-extern void efx_link_set_wanted_fc(struct efx_nic *efx, enum efx_fc_type);
+extern void efx_link_set_wanted_fc(struct efx_nic *efx, u8);
 
 #endif /* EFX_EFX_H */
diff --git a/drivers/net/sfc/ethtool.c b/drivers/net/sfc/ethtool.c
index 348437a..d229027 100644
--- a/drivers/net/sfc/ethtool.c
+++ b/drivers/net/sfc/ethtool.c
@@ -698,7 +698,7 @@ static int efx_ethtool_set_pauseparam(struct net_device *net_dev,
 				      struct ethtool_pauseparam *pause)
 {
 	struct efx_nic *efx = netdev_priv(net_dev);
-	enum efx_fc_type wanted_fc, old_fc;
+	u8 wanted_fc, old_fc;
 	u32 old_adv;
 	bool reset;
 	int rc = 0;
diff --git a/drivers/net/sfc/mdio_10g.c b/drivers/net/sfc/mdio_10g.c
index 7115914..7ab385c 100644
--- a/drivers/net/sfc/mdio_10g.c
+++ b/drivers/net/sfc/mdio_10g.c
@@ -284,7 +284,7 @@ void efx_mdio_an_reconfigure(struct efx_nic *efx)
 	efx_mdio_write(efx, MDIO_MMD_AN, MDIO_CTRL1, reg);
 }
 
-enum efx_fc_type efx_mdio_get_pause(struct efx_nic *efx)
+u8 efx_mdio_get_pause(struct efx_nic *efx)
 {
 	BUILD_BUG_ON(EFX_FC_AUTO & (EFX_FC_RX | EFX_FC_TX));
 
diff --git a/drivers/net/sfc/mdio_10g.h b/drivers/net/sfc/mdio_10g.h
index df07039..a97dbbd 100644
--- a/drivers/net/sfc/mdio_10g.h
+++ b/drivers/net/sfc/mdio_10g.h
@@ -92,7 +92,7 @@ extern void efx_mdio_an_reconfigure(struct efx_nic *efx);
 /* Get pause parameters from AN if available (otherwise return
  * requested pause parameters)
  */
-enum efx_fc_type efx_mdio_get_pause(struct efx_nic *efx);
+u8 efx_mdio_get_pause(struct efx_nic *efx);
 
 /* Wait for specified MMDs to exit reset within a timeout */
 extern int efx_mdio_wait_reset_mmds(struct efx_nic *efx,
diff --git a/drivers/net/sfc/net_driver.h b/drivers/net/sfc/net_driver.h
index ce9697b..e8d5f03 100644
--- a/drivers/net/sfc/net_driver.h
+++ b/drivers/net/sfc/net_driver.h
@@ -449,11 +449,9 @@ enum nic_state {
 struct efx_nic;
 
 /* Pseudo bit-mask flow control field */
-enum efx_fc_type {
-	EFX_FC_RX = FLOW_CTRL_RX,
-	EFX_FC_TX = FLOW_CTRL_TX,
-	EFX_FC_AUTO = 4,
-};
+#define EFX_FC_RX	FLOW_CTRL_RX
+#define EFX_FC_TX	FLOW_CTRL_TX
+#define EFX_FC_AUTO	4
 
 /**
  * struct efx_link_state - Current state of the link
@@ -465,7 +463,7 @@ enum efx_fc_type {
 struct efx_link_state {
 	bool up;
 	bool fd;
-	enum efx_fc_type fc;
+	u8 fc;
 	unsigned int speed;
 };
 
@@ -784,7 +782,7 @@ struct efx_nic {
 
 	bool promiscuous;
 	union efx_multicast_hash multicast_hash;
-	enum efx_fc_type wanted_fc;
+	u8 wanted_fc;
 
 	atomic_t rx_reset;
 	enum efx_loopback_mode loopback_mode;
-- 
1.7.4.4


^ permalink raw reply related

* Re: small RPS cache for fragments?
From: Eric Dumazet @ 2011-05-17 21:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, netdev
In-Reply-To: <m2aaelypbp.fsf@firstfloor.org>

Le mardi 17 mai 2011 à 14:44 -0700, Andi Kleen a écrit :
> Eric Dumazet <eric.dumazet@gmail.com> writes:
> >
> > OK but do we have workloads actually needing this optimization at all ?
> 
> That's a good a question.
> >
> > (IP defrag hits a read_lock(&ip4_frags.lock)), so maybe steer all frags
> > on a given cpu ?)
> 
> Couldn't the lock just be replaced with a hashed or bitmap lock or 
> bit in low bits of pointer lock?
> 
> iirc it just protects the heads of the hash table.
> 
> They're not rwlocks, but especially if the locking was more finegrained
> that's likely not needed anymore.

Well, there is the rehashing stuff, and this locks the whole table.

Not easy to switch to rcu or something like that.

Anyway I hardly use frags here at work, so never considered it was a
field to spend time ;)




^ permalink raw reply

* Re: small RPS cache for fragments?
From: David Miller @ 2011-05-17 21:50 UTC (permalink / raw)
  To: andi; +Cc: netdev
In-Reply-To: <m262p9yp5f.fsf@firstfloor.org>

From: Andi Kleen <andi@firstfloor.org>
Date: Tue, 17 May 2011 14:48:28 -0700

> David Miller <davem@davemloft.net> writes:
> 
>> Guys we can't time out fragments if we are not the final
>> destination.
> 
> If you're not the final destination you should never even
> try to reassemble them?
> 
> I'm probably missing something...

We're discussing the idea to do the defragmentation first
so we can choose the flow properly and steer the packet
to the correct cpu.

This also would allos each fragmented packet to traverse the
stack only once (one route lookup etc.) instead of once per
fragment.

Please read the rest of this thread, we have discussed this
and now I'm repeating information solely for your benefit.

^ permalink raw reply

* [PATCH v2] ipconfig wait for carrier
From: Micha Nelissen @ 2011-05-17 21:48 UTC (permalink / raw)
  To: netdev; +Cc: Micha Nelissen, David Miller
In-Reply-To: <1305405386-25187-1-git-send-email-micha@neli.hopto.org>

Currently the ip auto configuration has a hardcoded delay of 1 second.
When (ethernet) link takes longer to come up (e.g. more than 3 seconds),
nfs root may not be found.

Remove the hardcoded delay, and wait for carrier on at least one network
device.

Signed-off-by: Micha Nelissen <micha@neli.hopto.org>
Cc: David Miller <davem@davemloft.net>
---
 net/ipv4/ipconfig.c |   32 +++++++++++++++++++++-----------
 1 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index 2b09775..953808f 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -87,8 +87,8 @@
 #endif
 
 /* Define the friendly delay before and after opening net devices */
-#define CONF_PRE_OPEN		500	/* Before opening: 1/2 second */
-#define CONF_POST_OPEN		1	/* After opening: 1 second */
+#define CONF_POST_OPEN		10	/* After opening: 10 msecs */
+#define CONF_CARRIER_TIMEOUT	120000	/* Wait for carrier timeout */
 
 /* Define the timeout for waiting for a DHCP/BOOTP/RARP reply */
 #define CONF_OPEN_RETRIES 	2	/* (Re)open devices twice */
@@ -188,14 +188,14 @@ struct ic_device {
 static struct ic_device *ic_first_dev __initdata = NULL;/* List of open device */
 static struct net_device *ic_dev __initdata = NULL;	/* Selected device */
 
-static bool __init ic_device_match(struct net_device *dev)
+static bool __init ic_is_init_dev(struct net_device *dev)
 {
-	if (user_dev_name[0] ? !strcmp(dev->name, user_dev_name) :
+	if (dev->flags & IFF_LOOPBACK)
+		return 0;
+	return user_dev_name[0] ? !strcmp(dev->name, user_dev_name) :
 	    (!(dev->flags & IFF_LOOPBACK) &&
 	     (dev->flags & (IFF_POINTOPOINT|IFF_BROADCAST)) &&
-	     strncmp(dev->name, "dummy", 5)))
-		return true;
-	return false;
+	     strncmp(dev->name, "dummy", 5));
 }
 
 static int __init ic_open_devs(void)
@@ -203,6 +203,7 @@ static int __init ic_open_devs(void)
 	struct ic_device *d, **last;
 	struct net_device *dev;
 	unsigned short oflags;
+	unsigned long start;
 
 	last = &ic_first_dev;
 	rtnl_lock();
@@ -216,9 +217,7 @@ static int __init ic_open_devs(void)
 	}
 
 	for_each_netdev(&init_net, dev) {
-		if (dev->flags & IFF_LOOPBACK)
-			continue;
-		if (ic_device_match(dev)) {
+		if (ic_is_init_dev(dev)) {
 			int able = 0;
 			if (dev->mtu >= 364)
 				able |= IC_BOOTP;
@@ -252,6 +251,17 @@ static int __init ic_open_devs(void)
 				dev->name, able, d->xid));
 		}
 	}
+
+	/* wait for a carrier on at least one device */
+	start = jiffies;
+	while (jiffies - start < msecs_to_jiffies(CONF_CARRIER_TIMEOUT)) {
+		for_each_netdev(&init_net, dev)
+			if (ic_is_init_dev(dev) && netif_carrier_ok(dev))
+				goto have_carrier;
+
+		msleep(1);
+	}
+have_carrier:
 	rtnl_unlock();
 
 	*last = NULL;
@@ -1378,7 +1388,7 @@ static int __init ip_auto_config(void)
 		return err;
 
 	/* Give drivers a chance to settle */
-	ssleep(CONF_POST_OPEN);
+	msleep(CONF_POST_OPEN);
 
 	/*
 	 * If the config information is insufficient (e.g., our IP address or
-- 
1.7.4.4


^ permalink raw reply related

* Re: small RPS cache for fragments?
From: Rick Jones @ 2011-05-17 21:40 UTC (permalink / raw)
  To: David Miller; +Cc: bhutchings, eric.dumazet, therbert, netdev
In-Reply-To: <20110517.172630.1789843473242620898.davem@davemloft.net>

On Tue, 2011-05-17 at 17:26 -0400, David Miller wrote:
> The idea to do RFS post fragmentation is interesting, it's sort of
> another form of GRO.  We would need to re-fragment (like GRO does)
> in the forwarding case.
> 
> But it would be nice since it would reduce the number of calls into
> the stack (and thus route lookups, etc.) per fragmented frame.
> 
> There is of course the issue of fragmentation queue timeouts, and
> what semantics of that means when we are not the final destination
> and those fragments would have been forwarded rather than consumed
> by us.

If we are not the final destination, should there be any reassembly
going-on in the first place?

And if reassembly times-out, don't the frags just get dropped like they
would anyway?

Eric keeps asking about (real) workload :)  About the only one I can
think of at this point that would have much in the way of UDP fragments
is EDNS.  Apart from that we may be worrying about how many fragments
can dance on the header of an IP datagram?-)

rick


^ permalink raw reply

* Re: small RPS cache for fragments?
From: Andi Kleen @ 2011-05-17 21:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110517.174431.1004332995189918916.davem@davemloft.net>

David Miller <davem@davemloft.net> writes:

> Guys we can't time out fragments if we are not the final
> destination.

If you're not the final destination you should never even
try to reassemble them?

I'm probably missing something...

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michał Mirosław @ 2011-05-17 21:48 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michael S. Tsirkin, Ben Hutchings, David Miller, Eric Dumazet,
	Avi Kivity, Arnd Bergmann, netdev, kvm, linux-kernel
In-Reply-To: <1305588738.3456.65.camel@localhost.localdomain>

2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
> Hello Michael,
>
> Looks like to use a new flag requires more time/work. I am thinking
> whether we can just use HIGHDMA flag to enable zero-copy in macvtap to
> avoid the new flag for now since mavctap uses real NICs as lower device?

Is there any other restriction besides requiring driver to not recycle
the skb? Are there any drivers that recycle TX skbs?

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [PATCH net-next-2.6] sfc: Replace enum efx_fc_type with macros and type alias
From: David Miller @ 2011-05-17 21:45 UTC (permalink / raw)
  To: bhutchings; +Cc: netdev
In-Reply-To: <1305668611.2848.58.camel@bwh-desktop>


Too late Ben, I've already written this patch myself using "u8"
throughout since that also matches the return type of the mii_*()
routines that feed this value.

^ permalink raw reply

* Re: small RPS cache for fragments?
From: Andi Kleen @ 2011-05-17 21:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1305663288.2691.2.camel@edumazet-laptop>

Eric Dumazet <eric.dumazet@gmail.com> writes:
>
> OK but do we have workloads actually needing this optimization at all ?

That's a good a question.
>
> (IP defrag hits a read_lock(&ip4_frags.lock)), so maybe steer all frags
> on a given cpu ?)

Couldn't the lock just be replaced with a hashed or bitmap lock or 
bit in low bits of pointer lock?

iirc it just protects the heads of the hash table.

They're not rwlocks, but especially if the locking was more finegrained
that's likely not needed anymore.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply

* Re: small RPS cache for fragments?
From: David Miller @ 2011-05-17 21:44 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20110517.143342.1566027350038182221.davem@davemloft.net>


Guys we can't time out fragments if we are not the final
destination.

Due to assymetric routing, the fragment pieces we don't
see might reach the final destiantion not through us.

So we have to pass them onwards, we can't just drop them.

^ permalink raw reply

* [PATCH net-next-2.6] sfc: Replace enum efx_fc_type with macros and type alias
From: Ben Hutchings @ 2011-05-17 21:43 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110517.171412.1017451005914294196.davem@davemloft.net>

gcc 4.5 doesn't like enumerators as flags, and neither does davem.
Replace the enumerated type with macros and a type alias which makes
it clear where the flags are being used.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
On Tue, 2011-05-17 at 17:14 -0400, David Miller wrote:
> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Tue, 17 May 2011 22:06:19 +0100
> 
> > But that seems to result in discarding all type information for the
> > flags. I would prefer to improve type checking.
> 
> That's why I don't want a solution like your cast patch, as that
> takes the typing information away.
> 
> Accept that the compiler currently doesn't want to allow enums to be
> used as bit-masks, don't paper around it.
> 
> I'm not applying this patch either, you think all of this "__force"
> casting stuff is better?  No way.

OK, here it is with an ordinary type alias.  I still don't think this is
the right way to go, as neither gcc nor sparse can do any meaningful
type-checking on it.  The type alias should at least help readers.

Ben.

 drivers/net/sfc/efx.c        |    2 +-
 drivers/net/sfc/efx.h        |    2 +-
 drivers/net/sfc/ethtool.c    |    2 +-
 drivers/net/sfc/mdio_10g.c   |    2 +-
 drivers/net/sfc/mdio_10g.h   |    2 +-
 drivers/net/sfc/net_driver.h |   15 +++++++--------
 6 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/drivers/net/sfc/efx.c b/drivers/net/sfc/efx.c
index 05502b3..f840879 100644
--- a/drivers/net/sfc/efx.c
+++ b/drivers/net/sfc/efx.c
@@ -833,7 +833,7 @@ void efx_link_set_advertising(struct efx_nic *efx, u32 advertising)
 	}
 }
 
-void efx_link_set_wanted_fc(struct efx_nic *efx, enum efx_fc_type wanted_fc)
+void efx_link_set_wanted_fc(struct efx_nic *efx, efx_fc_mode wanted_fc)
 {
 	efx->wanted_fc = wanted_fc;
 	if (efx->link_advertising) {
diff --git a/drivers/net/sfc/efx.h b/drivers/net/sfc/efx.h
index 3d83a1f..242d68b 100644
--- a/drivers/net/sfc/efx.h
+++ b/drivers/net/sfc/efx.h
@@ -142,6 +142,6 @@ static inline void efx_schedule_channel(struct efx_channel *channel)
 
 extern void efx_link_status_changed(struct efx_nic *efx);
 extern void efx_link_set_advertising(struct efx_nic *efx, u32);
-extern void efx_link_set_wanted_fc(struct efx_nic *efx, enum efx_fc_type);
+extern void efx_link_set_wanted_fc(struct efx_nic *efx, efx_fc_mode);
 
 #endif /* EFX_EFX_H */
diff --git a/drivers/net/sfc/ethtool.c b/drivers/net/sfc/ethtool.c
index 348437a..114829d 100644
--- a/drivers/net/sfc/ethtool.c
+++ b/drivers/net/sfc/ethtool.c
@@ -698,7 +698,7 @@ static int efx_ethtool_set_pauseparam(struct net_device *net_dev,
 				      struct ethtool_pauseparam *pause)
 {
 	struct efx_nic *efx = netdev_priv(net_dev);
-	enum efx_fc_type wanted_fc, old_fc;
+	efx_fc_mode wanted_fc, old_fc;
 	u32 old_adv;
 	bool reset;
 	int rc = 0;
diff --git a/drivers/net/sfc/mdio_10g.c b/drivers/net/sfc/mdio_10g.c
index 7115914..1e748cf 100644
--- a/drivers/net/sfc/mdio_10g.c
+++ b/drivers/net/sfc/mdio_10g.c
@@ -284,7 +284,7 @@ void efx_mdio_an_reconfigure(struct efx_nic *efx)
 	efx_mdio_write(efx, MDIO_MMD_AN, MDIO_CTRL1, reg);
 }
 
-enum efx_fc_type efx_mdio_get_pause(struct efx_nic *efx)
+efx_fc_mode efx_mdio_get_pause(struct efx_nic *efx)
 {
 	BUILD_BUG_ON(EFX_FC_AUTO & (EFX_FC_RX | EFX_FC_TX));
 
diff --git a/drivers/net/sfc/mdio_10g.h b/drivers/net/sfc/mdio_10g.h
index df07039..1616b13 100644
--- a/drivers/net/sfc/mdio_10g.h
+++ b/drivers/net/sfc/mdio_10g.h
@@ -92,7 +92,7 @@ extern void efx_mdio_an_reconfigure(struct efx_nic *efx);
 /* Get pause parameters from AN if available (otherwise return
  * requested pause parameters)
  */
-enum efx_fc_type efx_mdio_get_pause(struct efx_nic *efx);
+extern efx_fc_mode efx_mdio_get_pause(struct efx_nic *efx);
 
 /* Wait for specified MMDs to exit reset within a timeout */
 extern int efx_mdio_wait_reset_mmds(struct efx_nic *efx,
diff --git a/drivers/net/sfc/net_driver.h b/drivers/net/sfc/net_driver.h
index ce9697b..a5aeba8 100644
--- a/drivers/net/sfc/net_driver.h
+++ b/drivers/net/sfc/net_driver.h
@@ -448,12 +448,11 @@ enum nic_state {
 /* Forward declaration */
 struct efx_nic;
 
-/* Pseudo bit-mask flow control field */
-enum efx_fc_type {
-	EFX_FC_RX = FLOW_CTRL_RX,
-	EFX_FC_TX = FLOW_CTRL_TX,
-	EFX_FC_AUTO = 4,
-};
+/* Flow control flags */
+typedef unsigned int efx_fc_mode;
+#define EFX_FC_RX	FLOW_CTRL_RX
+#define EFX_FC_TX	FLOW_CTRL_TX
+#define EFX_FC_AUTO	4
 
 /**
  * struct efx_link_state - Current state of the link
@@ -465,7 +464,7 @@ enum efx_fc_type {
 struct efx_link_state {
 	bool up;
 	bool fd;
-	enum efx_fc_type fc;
+	efx_fc_mode fc;
 	unsigned int speed;
 };
 
@@ -784,7 +783,7 @@ struct efx_nic {
 
 	bool promiscuous;
 	union efx_multicast_hash multicast_hash;
-	enum efx_fc_type wanted_fc;
+	efx_fc_mode wanted_fc;
 
 	atomic_t rx_reset;
 	enum efx_loopback_mode loopback_mode;
-- 
1.7.4

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* Re: [PATCH net-next-2.6] sfc: Replace enum efx_fc_type with a 'bitwise' type
From: Joe Perches @ 2011-05-17 21:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bhutchings, netdev
In-Reply-To: <1305668197.2691.15.camel@edumazet-laptop>

On Tue, 2011-05-17 at 23:36 +0200, Eric Dumazet wrote:
> Le mardi 17 mai 2011 à 14:22 -0700, Joe Perches a écrit :
> > On Tue, 2011-05-17 at 17:14 -0400, David Miller wrote:
> > > Accept that the compiler currently doesn't want to allow enums to be
> > > used as bit-masks, don't paper around it.
> > A patch applied yesterday using enums as bitmasks:
> > http://patchwork.ozlabs.org/patch/95802/
> Well, quite frankly I focused on solving a bug, not making beautiful
> code. After reading convoluted RFC, going back to C is not exactly
> straightforward.

Given some RFCs, not necessarily possible either.

> I filled a (small) table with C99 initializers, its not the problem
> David raised

True, but it was a similar use of or'd enums.

> To be honest, I was using plain #define and right before sending patch I
> added one enum just because its less chars on screen for the reader.
> No strong opinion on this.

Nor I honestly.  I don't mind either.

cheers, Joe


^ permalink raw reply

* Re: [PATCH net-next-2.6] sfc: Replace enum efx_fc_type with a 'bitwise' type
From: Eric Dumazet @ 2011-05-17 21:36 UTC (permalink / raw)
  To: Joe Perches; +Cc: David Miller, bhutchings, netdev
In-Reply-To: <1305667378.1722.65.camel@Joe-Laptop>

Le mardi 17 mai 2011 à 14:22 -0700, Joe Perches a écrit :
> On Tue, 2011-05-17 at 17:14 -0400, David Miller wrote:
> > Accept that the compiler currently doesn't want to allow enums to be
> > used as bit-masks, don't paper around it.
> 
> A patch applied yesterday using enums as bitmasks:
> http://patchwork.ozlabs.org/patch/95802/

Well, quite frankly I focused on solving a bug, not making beautiful
code. After reading convoluted RFC, going back to C is not exactly
straightforward.

I filled a (small) table with C99 initializers, its not the problem
David raised with :

switch (value) {
case XXX | YYY:   /* gcc warning */

To be honest, I was using plain #define and right before sending patch I
added one enum just because its less chars on screen for the reader.

No strong opinion on this.




^ permalink raw reply

* Re: [PATCH net-next-2.6] sfc: Replace enum efx_fc_type with a 'bitwise' type
From: Michał Mirosław @ 2011-05-17 21:30 UTC (permalink / raw)
  To: Joe Perches; +Cc: David Miller, bhutchings, netdev
In-Reply-To: <1305667378.1722.65.camel@Joe-Laptop>

2011/5/17 Joe Perches <joe@perches.com>:
> On Tue, 2011-05-17 at 17:14 -0400, David Miller wrote:
>> Accept that the compiler currently doesn't want to allow enums to be
>> used as bit-masks, don't paper around it.
> A patch applied yesterday using enums as bitmasks:
> http://patchwork.ozlabs.org/patch/95802/

The compiler warning is about using enum-masks in switch() statement,
IIRC. In this case it might be enough to just add the mask to the
enum.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: small RPS cache for fragments?
From: David Miller @ 2011-05-17 21:28 UTC (permalink / raw)
  To: therbert; +Cc: netdev
In-Reply-To: <BANLkTin3hRwB39LGfOSxet61Cdm3F1MsKg@mail.gmail.com>

From: Tom Herbert <therbert@google.com>
Date: Tue, 17 May 2011 14:27:10 -0700

>> Actually, I think it won't work.  Even Linux emits fragments last to
>> first, so we won't see the UDP header until the last packet where it's
>> no longer useful.
>>
> I remember observing this a while back, what's the rationale for it?

That's the cheapest way to build the fragments.

Regardless of the reason we have to handle it forever.

^ permalink raw reply

* Re: [PATCH V5 4/6 net-next] vhost: vhost TX zero-copy support
From: Michael S. Tsirkin @ 2011-05-17 21:28 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel
In-Reply-To: <1305665419.10756.33.camel@localhost.localdomain>

On Tue, May 17, 2011 at 01:50:19PM -0700, Shirley Ma wrote:
> Resubmit the patch with most update. This patch passed some
> live-migration test against RHEL6.2. I will run more stress test w/i
> live migration.
> 
> Signed-off-by: Shirley Ma <xma@us.ibm.com>

Cool. cleanup path needs a fix - are you use you can
not use kobj or some other existing refcounting?

Is perf regressiion caused by tx ring overrun gone now?

I added some comments about how we might be aqble
to complete requests out of order but it's not a must.

> ---
> 
>  drivers/vhost/net.c   |   37 +++++++++++++++++++++++++++++++-
>  drivers/vhost/vhost.c |   55 ++++++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/vhost/vhost.h |   12 ++++++++++
>  3 files changed, 101 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 2f7c76a..6bd6e28 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -32,6 +32,9 @@
>   * Using this limit prevents one virtqueue from starving others. */
>  #define VHOST_NET_WEIGHT 0x80000
>  
> +/* MAX number of TX used buffers for outstanding zerocopy */
> +#define VHOST_MAX_ZEROCOPY_PEND 128 
> +
>  enum {
>  	VHOST_NET_VQ_RX = 0,
>  	VHOST_NET_VQ_TX = 1,
> @@ -129,6 +132,7 @@ static void handle_tx(struct vhost_net *net)
>  	int err, wmem;
>  	size_t hdr_size;
>  	struct socket *sock;
> +	struct skb_ubuf_info pend;
>  
>  	/* TODO: check that we are running from vhost_worker? */
>  	sock = rcu_dereference_check(vq->private_data, 1);
> @@ -151,6 +155,10 @@ static void handle_tx(struct vhost_net *net)
>  	hdr_size = vq->vhost_hlen;
>  
>  	for (;;) {
> +		/* Release DMAs done buffers first */
> +		if (atomic_read(&vq->refcnt) > VHOST_MAX_ZEROCOPY_PEND)
> +			vhost_zerocopy_signal_used(vq, false);
> +
>  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
>  					 ARRAY_SIZE(vq->iov),
>  					 &out, &in,
> @@ -166,6 +174,13 @@ static void handle_tx(struct vhost_net *net)
>  				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
>  				break;
>  			}
> +			/* If more outstanding DMAs, queue the work */
> +			if (sock_flag(sock->sk, SOCK_ZEROCOPY) &&
> +			    (atomic_read(&vq->refcnt) > VHOST_MAX_ZEROCOPY_PEND)) {
> +				tx_poll_start(net, sock);
> +				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
> +				break;
> +			}
>  			if (unlikely(vhost_enable_notify(vq))) {
>  				vhost_disable_notify(vq);
>  				continue;
> @@ -188,17 +203,35 @@ static void handle_tx(struct vhost_net *net)
>  			       iov_length(vq->hdr, s), hdr_size);
>  			break;
>  		}
> +		/* use msg_control to pass vhost zerocopy ubuf info to skb */
> +		if (sock_flag(sock->sk, SOCK_ZEROCOPY)) {
> +			vq->heads[vq->upend_idx].id = head;
> +			if (len <= 128)

I thought we have a constant for that?

> +				vq->heads[vq->upend_idx].len = VHOST_DMA_DONE_LEN;
> +			else {
> +				vq->heads[vq->upend_idx].len = len;
> +				pend.callback = vhost_zerocopy_callback;
> +				pend.arg = vq;
> +				pend.desc = vq->upend_idx;
> +				msg.msg_control = &pend;
> +				msg.msg_controllen = sizeof(pend);
> +			}
> +			atomic_inc(&vq->refcnt);
> +			vq->upend_idx = (vq->upend_idx + 1) % UIO_MAXIOV;

Ok, so we deal with a cyclic ring apparently? What guarantees we don't
overrun it?


> +		}
>  		/* TODO: Check specific error and bomb out unless ENOBUFS? */
>  		err = sock->ops->sendmsg(NULL, sock, &msg, len);
>  		if (unlikely(err < 0)) {
> -			vhost_discard_vq_desc(vq, 1);
> +			if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
> +				vhost_discard_vq_desc(vq, 1);

How are errors handled with zerocopy?


>  			tx_poll_start(net, sock);
>  			break;
>  		}
>  		if (err != len)
>  			pr_debug("Truncated TX packet: "
>  				 " len %d != %zd\n", err, len);
> -		vhost_add_used_and_signal(&net->dev, vq, head, 0);
> +		if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
> +			vhost_add_used_and_signal(&net->dev, vq, head, 0);
>  		total_len += len;
>  		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>  			vhost_poll_queue(&vq->poll);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 2ab2912..ce799d6 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -174,6 +174,9 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  	vq->call_ctx = NULL;
>  	vq->call = NULL;
>  	vq->log_ctx = NULL;
> +	vq->upend_idx = 0;
> +	vq->done_idx = 0;
> +	atomic_set(&vq->refcnt, 0);
>  }
>  
>  static int vhost_worker(void *data)
> @@ -230,7 +233,7 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
>  					       UIO_MAXIOV, GFP_KERNEL);
>  		dev->vqs[i].log = kmalloc(sizeof *dev->vqs[i].log * UIO_MAXIOV,
>  					  GFP_KERNEL);
> -		dev->vqs[i].heads = kmalloc(sizeof *dev->vqs[i].heads *
> +		dev->vqs[i].heads = kzalloc(sizeof *dev->vqs[i].heads *
>  					    UIO_MAXIOV, GFP_KERNEL);

Which fields need to be initialized actually?

>  
>  		if (!dev->vqs[i].indirect || !dev->vqs[i].log ||
> @@ -385,6 +388,38 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
>  	return 0;
>  }
>  
> +/* 
> +	comments
> +*/

Hmm.

> +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown)
> +{
> +	int i, j = 0;
> +
> +	i = vq->done_idx;
> +	while (i != vq->upend_idx) {

A for loop might be clearer.

> +		if ((vq->heads[i].len == VHOST_DMA_DONE_LEN) || shutdown) {

On shutdown, we signal all buffers used to the guest?
Why?


> +			/* reset len = 0 */

comment not very helpful.
Could you explain what this does instead?
Or better use some constant instead of 0 ...

> +			vq->heads[i].len = 0;
> +			i = (i + 1) % UIO_MAXIOV;
> +			++j;
> +		} else
> +			break;

Hmm so if the 1st entry does not complete, you do not signal anything?

> +	}

Looking at this loop, done idx is the consumer and used idx
is the producer, right?

> +	if (j) {
> +		/* comments */

Yes?

> +		if (i > vq->done_idx)
> +			vhost_add_used_n(vq, &vq->heads[vq->done_idx], j);
> +		else {
> +			vhost_add_used_n(vq, &vq->heads[vq->done_idx],
> +					 UIO_MAXIOV - vq->done_idx);
> +			vhost_add_used_n(vq, vq->heads, i);
> +		}
> +		vq->done_idx = i;
> +		vhost_signal(vq->dev, vq);
> +		atomic_sub(j, &vq->refcnt);

Code will likely be simpler if you call vhost_add_used once for
each head in the first loop. Possibly add_used_signal might be
a good idea too.

> +	}
> +}
> +
>  /* Caller should have device mutex */
>  void vhost_dev_cleanup(struct vhost_dev *dev)
>  {
> @@ -395,6 +430,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
>  			vhost_poll_stop(&dev->vqs[i].poll);
>  			vhost_poll_flush(&dev->vqs[i].poll);
>  		}
> +		/* wait for all lower device DMAs done, then notify guest */
> +		if (atomic_read(&dev->vqs[i].refcnt)) {
> +			msleep(1000);
> +			vhost_zerocopy_signal_used(&dev->vqs[i], true);
> +		}

This needs to be fixed somehow. Use a completion object and wait
on it?

>  		if (dev->vqs[i].error_ctx)
>  			eventfd_ctx_put(dev->vqs[i].error_ctx);
>  		if (dev->vqs[i].error)
> @@ -603,6 +643,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
>  
>  	mutex_lock(&vq->mutex);
>  
> +	/* force all lower device DMAs done */
> +	if (atomic_read(&vq->refcnt)) 
> +		vhost_zerocopy_signal_used(vq, true);
> +
>  	switch (ioctl) {
>  	case VHOST_SET_VRING_NUM:
>  		/* Resizing ring with an active backend?
> @@ -1416,3 +1460,12 @@ void vhost_disable_notify(struct vhost_virtqueue *vq)
>  		vq_err(vq, "Failed to enable notification at %p: %d\n",
>  		       &vq->used->flags, r);
>  }
> +
> +void vhost_zerocopy_callback(struct sk_buff *skb)
> +{
> +	int idx = skb_shinfo(skb)->ubuf.desc;
> +	struct vhost_virtqueue *vq = skb_shinfo(skb)->ubuf.arg;
> +
> +	/* set len = 1 to mark this desc buffers done DMA */

this comment can now go.

> +	vq->heads[idx].len = VHOST_DMA_DONE_LEN;
> +}
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index b3363ae..8e3ecc7 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -13,6 +13,10 @@
>  #include <linux/virtio_ring.h>
>  #include <asm/atomic.h>
>  
> +/* This is for zerocopy, used buffer len is set to 1 when lower device DMA
> + * done */
> +#define VHOST_DMA_DONE_LEN	1
> +
>  struct vhost_device;
>  
>  struct vhost_work;
> @@ -108,6 +112,12 @@ struct vhost_virtqueue {
>  	/* Log write descriptors */
>  	void __user *log_base;
>  	struct vhost_log *log;
> +	/* vhost zerocopy support */
> +	atomic_t refcnt; /* num of outstanding zerocopy DMAs */

future enhancement idea: this is used apparently under vq lock
so no need for an atomic?

> +	/* copy of avail idx to monitor outstanding DMA zerocopy buffers */

looking at code upend_idx seems to be calculated independently
of guest avail idx - could you clarify pls?

> +	int upend_idx;
> +	/* copy of used idx to monintor DMA done zerocopy buffers */

monitor

> +	int done_idx;


I think in reality these are just producer and consumer
in the head structure which for zero copy is used



>  };
>  
>  struct vhost_dev {
> @@ -154,6 +164,8 @@ bool vhost_enable_notify(struct vhost_virtqueue *);
>  
>  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
>  		    unsigned int log_num, u64 len);
> +void vhost_zerocopy_callback(struct sk_buff *skb);
> +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown);
>  
>  #define vq_err(vq, fmt, ...) do {                                  \
>  		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> 

^ permalink raw reply

* Re: [PATCH net-next-2.6] sfc: Replace enum efx_fc_type with a 'bitwise' type
From: David Miller @ 2011-05-17 21:28 UTC (permalink / raw)
  To: joe; +Cc: bhutchings, netdev
In-Reply-To: <1305667378.1722.65.camel@Joe-Laptop>

From: Joe Perches <joe@perches.com>
Date: Tue, 17 May 2011 14:22:58 -0700

> On Tue, 2011-05-17 at 17:14 -0400, David Miller wrote:
>> Accept that the compiler currently doesn't want to allow enums to be
>> used as bit-masks, don't paper around it.
> 
> A patch applied yesterday using enums as bitmasks:
> http://patchwork.ozlabs.org/patch/95802/

Thanks, fixed:

--------------------
ipv4: Don't use enums as bitmasks in ip_fragment.c

Noticed by Joe Perches.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/ip_fragment.c |   10 ++++------
 1 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 9e1458d..0ad6035 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -81,12 +81,10 @@ struct ipq {
  * We want to check ECN values of all fragments, do detect invalid combinations.
  * In ipq->ecn, we store the OR value of each ip4_frag_ecn() fragment value.
  */
-enum {
-	IPFRAG_ECN_NOT_ECT	= 0x01, /* one frag had ECN_NOT_ECT */
-	IPFRAG_ECN_ECT_1	= 0x02, /* one frag had ECN_ECT_1 */
-	IPFRAG_ECN_ECT_0	= 0x04, /* one frag had ECN_ECT_0 */
-	IPFRAG_ECN_CE		= 0x08, /* one frag had ECN_CE */
-};
+#define	IPFRAG_ECN_NOT_ECT	0x01 /* one frag had ECN_NOT_ECT */
+#define	IPFRAG_ECN_ECT_1	0x02 /* one frag had ECN_ECT_1 */
+#define	IPFRAG_ECN_ECT_0	0x04 /* one frag had ECN_ECT_0 */
+#define	IPFRAG_ECN_CE		0x08 /* one frag had ECN_CE */
 
 static inline u8 ip4_frag_ecn(u8 tos)
 {
-- 
1.7.4.4


^ permalink raw reply related

* Re: small RPS cache for fragments?
From: Eric Dumazet @ 2011-05-17 21:27 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, therbert, netdev
In-Reply-To: <1305666822.2848.51.camel@bwh-desktop>

Le mardi 17 mai 2011 à 22:13 +0100, Ben Hutchings a écrit :
> On Tue, 2011-05-17 at 17:10 -0400, David Miller wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Tue, 17 May 2011 23:00:50 +0200
> > 
> > > Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit :
> > >> From: Tom Herbert <therbert@google.com>
> > >> Date: Tue, 17 May 2011 13:02:25 -0700
> > >> 
> > >> > I like it!  And this sounds like the sort of algorithm that NICs might
> > >> > be able to implement to solve the UDP/RSS unpleasantness, so even
> > >> > better.
> > >> 
> > >> Actually, I think it won't work.  Even Linux emits fragments last to
> > >> first, so we won't see the UDP header until the last packet where it's
> > >> no longer useful.
> > >> 
> > >> Back to the drawing board. :-/
> > > 
> > > Well, we could just use the iph->id in the rxhash computation for frags.
> > > 
> > > At least all frags of a given datagram should be reassembled on same
> > > cpu, so we get RPS (but not RFS)
> > 
> > That's true, but one could also argue that in the existing code at least
> > one of the packets (the one with the UDP header) would make it to the
> > proper flow cpu.
> 
> No, we ignore the layer-4 header when either MF or OFFSET is non-zero.

Exactly

As is, RPS (based on our software rxhash computation) should be working
fine with frags, unless we receive different flows with same
(src_addr,dst_addr) pair.

This is why I asked David if real workloads could hit one cpu instead of
many ones.




^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox