Netdev List
 help / color / mirror / Atom feed
* [PATCH] ipv6: addrconf: don't remove address state on ifdown if the address is being kept
From: Lorenzo Colitti @ 2010-10-28  4:16 UTC (permalink / raw)
  To: netdev; +Cc: Maciej Żenczykowski, David Miller, Brian Haley

Currently, addrconf_ifdown does not delete statically configured IPv6
addresses when the interface is brought down. The intent is that when
the interface comes back up the address will be usable again. However,
this doesn't actually work, because the system stops listening on the
corresponding solicited-node multicast address, so the address cannot
respond to neighbor solicitations and thus receive traffic. Also, the
code notifies the rest of the system that the address is being deleted
(e.g, RTM_DELADDR), even though it is not. Fix it so that none of this
state is updated if the address is being kept on the interface.

--- a/net/ipv6/addrconf.c	2010-10-20 13:30:22.000000000 -0700
+++ b/net/ipv6/addrconf.c	2010-10-27 21:04:39.000000000 -0700
@@ -2737,10 +2737,7 @@ static int addrconf_ifdown(struct net_de
 			/* Flag it for later restoration when link comes up */
 			ifa->flags |= IFA_F_TENTATIVE;
 			ifa->state = INET6_IFADDR_STATE_DAD;
-
-			write_unlock_bh(&idev->lock);
-
-			in6_ifa_hold(ifa);
+			continue;
 		} else {
 			list_del(&ifa->if_list);

^ permalink raw reply

* RE: [PATCH net-next] ixgbe: fix stats handling
From: Eric Dumazet @ 2010-10-28  4:24 UTC (permalink / raw)
  To: Tantilov, Emil S
  Cc: David Miller, Waskiewicz Jr, Peter P, Kirsher, Jeffrey T,
	Brattain, Ross B, netdev
In-Reply-To: <EA929A9653AAE14F841771FB1DE5A136602807B460@rrsmsx501.amr.corp.intel.com>

Le mercredi 27 octobre 2010 à 16:35 -0600, Tantilov, Emil S a écrit :
> >-----Original Message-----
> >From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> >Sent: Monday, October 11, 2010 5:17 AM
> >To: David Miller; Waskiewicz Jr, Peter P; Tantilov, Emil S; Kirsher,
> >Jeffrey T
> >Cc: netdev
> >Subject: [PATCH net-next] ixgbe: fix stats handling
> >
> >Hi
> >
> >I am sending this patch for Intel people review/test and acknowledge.
> >
> >Thanks !
> >
> >[PATCH net-next] ixgbe: fix stats handling
> >
> >Current ixgbe stats have following problems :
> >
> >- Not 64 bit safe (on 32bit arches)
> >
> >- Not safe in ixgbe_clean_rx_irq() :
> >   All cpus dirty a common location (netdev->stats.rx_bytes &
> >netdev->stats.rx_packets) without proper synchronization.
> >   This slow down a bit multiqueue operations, and possibly miss some
> >updates.
> >
> >Fixes :
> >
> >Implement ndo_get_stats64() method to provide accurate 64bit rx|tx
> >bytes/packets counters, using 64bit safe infrastructure.
> >
> >ixgbe_get_ethtool_stats() also use this infrastructure to provide 64bit
> >safe counters.
> >
> >Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> >CC: Peter Waskiewicz <peter.p.waskiewicz.jr@intel.com>
> >CC: Emil Tantilov <emil.s.tantilov@intel.com>
> >CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> >---
> > drivers/net/ixgbe/ixgbe.h         |    3 +-
> > drivers/net/ixgbe/ixgbe_ethtool.c |   29 +++++++++++---------
> > drivers/net/ixgbe/ixgbe_main.c    |   40 +++++++++++++++++++++++++---
> > 3 files changed, 56 insertions(+), 16 deletions(-)
> >
> >diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
> >index a8c47b0..944d9e2 100644
> >--- a/drivers/net/ixgbe/ixgbe.h
> >+++ b/drivers/net/ixgbe/ixgbe.h
> >@@ -180,8 +180,9 @@ struct ixgbe_ring {
> > 					 */
> >
> > 	struct ixgbe_queue_stats stats;
> >-	unsigned long reinit_state;
> >+	struct u64_stats_sync syncp;
> > 	int numa_node;
> >+	unsigned long reinit_state;
> > 	u64 rsc_count;			/* stat for coalesced packets */
> > 	u64 rsc_flush;			/* stats for flushed packets */
> > 	u32 restart_queue;		/* track tx queue restarts */
> >diff --git a/drivers/net/ixgbe/ixgbe_ethtool.c
> >b/drivers/net/ixgbe/ixgbe_ethtool.c
> >index d4ac943..3c7f15d 100644
> >--- a/drivers/net/ixgbe/ixgbe_ethtool.c
> >+++ b/drivers/net/ixgbe/ixgbe_ethtool.c
> >@@ -999,12 +999,11 @@ static void ixgbe_get_ethtool_stats(struct net_device
> >*netdev,
> >                                     struct ethtool_stats *stats, u64
> >*data)
> > {
> > 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
> >-	u64 *queue_stat;
> >-	int stat_count = sizeof(struct ixgbe_queue_stats) / sizeof(u64);
> > 	struct rtnl_link_stats64 temp;
> > 	const struct rtnl_link_stats64 *net_stats;
> >-	int j, k;
> >-	int i;
> >+	unsigned int start;
> >+	struct ixgbe_ring *ring;
> >+	int i, j;
> > 	char *p = NULL;
> >
> > 	ixgbe_update_stats(adapter);
> >@@ -1025,16 +1024,22 @@ static void ixgbe_get_ethtool_stats(struct
> >net_device *netdev,
> > 		           sizeof(u64)) ? *(u64 *)p : *(u32 *)p;
> > 	}
> > 	for (j = 0; j < adapter->num_tx_queues; j++) {
> >-		queue_stat = (u64 *)&adapter->tx_ring[j]->stats;
> >-		for (k = 0; k < stat_count; k++)
> >-			data[i + k] = queue_stat[k];
> >-		i += k;
> >+		ring = adapter->tx_ring[j];
> >+		do {
> >+			start = u64_stats_fetch_begin_bh(&ring->syncp);
> >+			data[i]   = ring->stats.packets;
> >+			data[i+1] = ring->stats.bytes;
> >+		} while (u64_stats_fetch_retry_bh(&ring->syncp, start));
> >+		i += 2;
> > 	}
> > 	for (j = 0; j < adapter->num_rx_queues; j++) {
> >-		queue_stat = (u64 *)&adapter->rx_ring[j]->stats;
> >-		for (k = 0; k < stat_count; k++)
> >-			data[i + k] = queue_stat[k];
> >-		i += k;
> >+		ring = adapter->rx_ring[j];
> >+		do {
> >+			start = u64_stats_fetch_begin_bh(&ring->syncp);
> >+			data[i]   = ring->stats.packets;
> >+			data[i+1] = ring->stats.bytes;
> >+		} while (u64_stats_fetch_retry_bh(&ring->syncp, start));
> >+		i += 2;
> > 	}
> > 	if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
> > 		for (j = 0; j < MAX_TX_PACKET_BUFFERS; j++) {
> >diff --git a/drivers/net/ixgbe/ixgbe_main.c
> >b/drivers/net/ixgbe/ixgbe_main.c
> >index 95dbf60..1efbcde 100644
> >--- a/drivers/net/ixgbe/ixgbe_main.c
> >+++ b/drivers/net/ixgbe/ixgbe_main.c
> >@@ -824,8 +824,10 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector
> >*q_vector,
> >
> > 	tx_ring->total_bytes += total_bytes;
> > 	tx_ring->total_packets += total_packets;
> >+	u64_stats_update_begin(&tx_ring->syncp);
> > 	tx_ring->stats.packets += total_packets;
> > 	tx_ring->stats.bytes += total_bytes;
> >+	u64_stats_update_end(&tx_ring->syncp);
> > 	return count < tx_ring->work_limit;
> > }
> >
> >@@ -1172,7 +1174,6 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector
> >*q_vector,
> > 			       int *work_done, int work_to_do)
> > {
> > 	struct ixgbe_adapter *adapter = q_vector->adapter;
> >-	struct net_device *netdev = adapter->netdev;
> > 	struct pci_dev *pdev = adapter->pdev;
> > 	union ixgbe_adv_rx_desc *rx_desc, *next_rxd;
> > 	struct ixgbe_rx_buffer *rx_buffer_info, *next_buffer;
> >@@ -1298,8 +1299,10 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector
> >*q_vector,
> > 					rx_ring->rsc_count++;
> > 				rx_ring->rsc_flush++;
> > 			}
> >+			u64_stats_update_begin(&rx_ring->syncp);
> > 			rx_ring->stats.packets++;
> > 			rx_ring->stats.bytes += skb->len;
> >+			u64_stats_update_end(&rx_ring->syncp);
> > 		} else {
> > 			if (rx_ring->flags & IXGBE_RING_RX_PS_ENABLED) {
> > 				rx_buffer_info->skb = next_buffer->skb;
> >@@ -1375,8 +1378,6 @@ next_desc:
> >
> > 	rx_ring->total_packets += total_rx_packets;
> > 	rx_ring->total_bytes += total_rx_bytes;
> >-	netdev->stats.rx_bytes += total_rx_bytes;
> >-	netdev->stats.rx_packets += total_rx_packets;
> >
> > 	return cleaned;
> > }
> >@@ -6559,6 +6560,38 @@ static void ixgbe_netpoll(struct net_device *netdev)
> > }
> > #endif
> >
> >+static struct rtnl_link_stats64 *ixgbe_get_stats64(struct net_device
> >*netdev,
> >+						   struct rtnl_link_stats64 *stats)
> >+{
> >+	struct ixgbe_adapter *adapter = netdev_priv(netdev);
> >+	int i;
> >+
> >+	/* accurate rx/tx bytes/packets stats */
> >+	dev_txq_stats_fold(netdev, stats);
> >+	for (i = 0; i < adapter->num_rx_queues; i++) {
> >+		struct ixgbe_ring *ring = adapter->rx_ring[i];
> >+		u64 bytes, packets;
> >+		unsigned int start;
> >+
> >+		do {
> >+			start = u64_stats_fetch_begin_bh(&ring->syncp);
> >+			packets = ring->stats.packets;
> >+			bytes   = ring->stats.bytes;
> >+		} while (u64_stats_fetch_retry_bh(&ring->syncp, start));
> >+		stats->rx_packets += packets;
> >+		stats->rx_bytes   += bytes;
> >+	}
> >+
> >+	/* following stats updated by ixgbe_watchdog_task() */
> >+	stats->multicast	= netdev->stats.multicast;
> >+	stats->rx_errors	= netdev->stats.rx_errors;
> >+	stats->rx_length_errors	= netdev->stats.rx_length_errors;
> >+	stats->rx_crc_errors	= netdev->stats.rx_crc_errors;
> >+	stats->rx_missed_errors	= netdev->stats.rx_missed_errors;
> >+	return stats;
> >+}
> >+
> >+
> > static const struct net_device_ops ixgbe_netdev_ops = {
> > 	.ndo_open		= ixgbe_open,
> > 	.ndo_stop		= ixgbe_close,
> >@@ -6578,6 +6611,7 @@ static const struct net_device_ops ixgbe_netdev_ops =
> >{
> > 	.ndo_set_vf_vlan	= ixgbe_ndo_set_vf_vlan,
> > 	.ndo_set_vf_tx_rate	= ixgbe_ndo_set_vf_bw,
> > 	.ndo_get_vf_config	= ixgbe_ndo_get_vf_config,
> >+	.ndo_get_stats64	= ixgbe_get_stats64,
> > #ifdef CONFIG_NET_POLL_CONTROLLER
> > 	.ndo_poll_controller	= ixgbe_netpoll,
> > #endif
> >
> 
> Eric,
> 
> We are seeing intermittent hangs on ia32 arch which seem to point to this patch:
> 
> BUG: unable to handle kernel NULL pointer dereference at 00000040
>  IP: [<f7f6b537>] ixgbe_get_stats64+0x47/0x120 [ixgbe]
>  *pdpt = 000000002dc83001 *pde = 000000032d7e5067
>  Oops: 0000 [#2] SMP
>  last sysfs file: /sys/devices/system/cpu/cpu23/cache/index2/shared_cpu_map
>  Modules linked in: act_skbedit cls_u32 sch_multiq ixgbe mdio netconsole configfs autofs4 8021q garp stp llc sunrpc ipv6 e
> xt3 jbd dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath power_meter hwmon sg ses enclosure dcdbas pcspkr serio_raw iTCO_wdt iTCO_vendor_support io
> atdma dca i7core_edac edac_core bnx2 ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix megaraid_sas dm_mod [last unloaded: mperf
> ]
> 
>  Pid: 1939, comm: irqbalance Tainted: G      D W   2.6.36-rc7-upstream-net-next-2.6-ixgbe-queue-i386-g55e1a84 #1 09CGW2/Po
> werEdge T610
>  EIP: 0060:[<f7f6b537>] EFLAGS: 00010206 CPU: 0
>  EIP is at ixgbe_get_stats64+0x47/0x120 [ixgbe]
>  EAX: 00000000 EBX: ecc45e4c ECX: ebea0400 EDX: 00000000
>  ESI: ebea0000 EDI: 00000018 EBP: f7f846a0 ESP: ecc45d88
>  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
>  Process irqbalance (pid: 1939, ti=ecc44000 task=efc63a50 task.ti=ecc44000)
>  Stack:
>  ecc45e4c 00000000 ebea0400 ebea0000 ecc45e4c ebea0000 ecc45f04 f7f846a0
>  <0> c0750593 ebea0000 edfec340 ebea0000 000002b5 c075063a edfec340 c0993eec
>  <0> ebe78000 0000021c 00000000 00000009 00000000 00000000 00000000 00000000
>  Call Trace:
>  [<c0750593>] ? dev_get_stats+0x33/0xc0
>  [<c075063a>] ? dev_seq_printf_stats+0x1a/0x180
>  [<c07507aa>] ? dev_seq_show+0xa/0x20
>  [<c052398f>] ? seq_read+0x22f/0x3d0
>  [<c0523760>] ? seq_read+0x0/0x3d0
>  [<c054fdde>] ? proc_reg_read+0x5e/0x90
>  [<c054fd80>] ? proc_reg_read+0x0/0x90
>  [<c050a1dd>] ? vfs_read+0x9d/0x160
>  [<c049d4ef>] ? audit_syscall_entry+0x20f/0x230
>  [<c050a971>] ? sys_read+0x41/0x70
>  [<c0409cdf>] ? sysenter_do_call+0x12/0x28
>  Code: 60 4f 7e c8 8b 44 24 08 8b b8 20 06 00 00 85 ff 7e 63 c7 44 24 04 00 00 00 00 66 90 8b 54 24 04 8b 4c 24 08 8b 84 9
> 1 00 05 00 00 <8b> 50 40 eb 06 8d 74 26 00 89 ca f6 c2 01 0f 85 ae 00 00 00 8b
>  EIP: [<f7f6b537>] ixgbe_get_stats64+0x47/0x120 [ixgbe] SS:ESP 0068:ecc45d88
>  CR2: 0000000000000040
>  ---[ end trace 51ea89f4e57f54f1 ]---
> 
> Emil

Code disassembly makes no sense to me : this doesnt match C code at all

60 4f 7e c8 
8b 44 24 08             mov    0x08(%esp),%eax 
8b b8 20 06 00 00       mov    0x620(%eax),%edi
85 ff                   test   %edi,%edi
7e 63                   jle    +0x63
c7 44 24 04 00 00 00 00 movl   $0,0x4(%esp)
66 90                   xchg   %ax,%ax
8b 54 24 04             mov    0x04(%esp),%edx   (0 in your report)
8b 4c 24 08             mov    0x08(%esp),%ecx   (ebea0400 in your rep)
8b 84 91 00 05 00 00    mov    0x500(,),eax

<8b> 50 40              mov    0x40(%eax),%edx    NULL pointer deref

eb 06                   jmp    +6
8d 74 26 00             lea    0x0(%esi,%eiz,1),%esi
89 ca                   mov    %ecx,%edx
f6 c2 01                test   $0x1,%dl 
0f 85 ae 00 00 00       jne 
8b

Could you disassemble your ixgbe_get_stats64() function ?

Thanks

Mine is :

00000e29 <ixgbe_get_stats64>:
     e29:       55                      push   %ebp
     e2a:       89 e5                   mov    %esp,%ebp
     e2c:       57                      push   %edi
     e2d:       56                      push   %esi
     e2e:       89 c6                   mov    %eax,%esi
     e30:       53                      push   %ebx
     e31:       89 d3                   mov    %edx,%ebx
     e33:       83 ec 18                sub    $0x18,%esp
     e36:       8d b8 40 03 00 00       lea    0x340(%eax),%edi
     e3c:       e8 fc ff ff ff          call   dev_txq_stats_fold
     e41:       31 c0                   xor    %eax,%eax
     e43:       eb 54                   jmp    e99 <ixgbe_get_stats64+0x70>
     e45:       8b 94 87 40 0d 00 00    mov    0xd40(%edi,%eax,4),%edx
     e4c:       89 75 e0                mov    %esi,-0x20(%ebp)
     e4f:       89 5d dc                mov    %ebx,-0x24(%ebp)
     e52:       8b 4a 34                mov    0x34(%edx),%ecx
     e55:       f6 c1 01                test   $0x1,%cl
     e58:       74 04                   je     e5e <ixgbe_get_stats64+0x35>
     e5a:       f3 90                   pause
     e5c:       eb f4                   jmp    e52 <ixgbe_get_stats64+0x29>
     e5e:       8b 5a 24                mov    0x24(%edx),%ebx
     e61:       8b 72 28                mov    0x28(%edx),%esi
     e64:       89 5d ec                mov    %ebx,-0x14(%ebp)
     e67:       8b 5a 2c                mov    0x2c(%edx),%ebx
     e6a:       89 75 f0                mov    %esi,-0x10(%ebp)
     e6d:       8b 72 30                mov    0x30(%edx),%esi
     e70:       89 5d e4                mov    %ebx,-0x1c(%ebp)
     e73:       89 75 e8                mov    %esi,-0x18(%ebp)
     e76:       39 4a 34                cmp    %ecx,0x34(%edx)
     e79:       75 d7                   jne    e52 <ixgbe_get_stats64+0x29>
     e7b:       8b 5d dc                mov    -0x24(%ebp),%ebx
     e7e:       8b 55 ec                mov    -0x14(%ebp),%edx
     e81:       8b 75 e0                mov    -0x20(%ebp),%esi
     e84:       8b 4d f0                mov    -0x10(%ebp),%ecx
     e87:       01 13                   add    %edx,(%ebx)
     e89:       8b 55 e4                mov    -0x1c(%ebp),%edx
     e8c:       11 4b 04                adc    %ecx,0x4(%ebx)
     e8f:       01 53 10                add    %edx,0x10(%ebx)
     e92:       8b 4d e8                mov    -0x18(%ebp),%ecx
     e95:       11 4b 14                adc    %ecx,0x14(%ebx)
     e98:       40                      inc    %eax
     e99:       3b 87 40 0e 00 00       cmp    0xe40(%edi),%eax
     e9f:       7c a4                   jl     e45 <ixgbe_get_stats64+0x1c>
     ea1:       8b 8e 90 00 00 00       mov    0x90(%esi),%ecx
     ea7:       c7 43 44 00 00 00 00    movl   $0x0,0x44(%ebx)
     eae:       89 4b 40                mov    %ecx,0x40(%ebx)
     eb1:       8b 86 80 00 00 00       mov    0x80(%esi),%eax
     eb7:       c7 43 24 00 00 00 00    movl   $0x0,0x24(%ebx)
     ebe:       89 43 20                mov    %eax,0x20(%ebx)
     ec1:       8b 96 98 00 00 00       mov    0x98(%esi),%edx
     ec7:       c7 43 54 00 00 00 00    movl   $0x0,0x54(%ebx)
     ece:       89 53 50                mov    %edx,0x50(%ebx)
     ed1:       8b 8e a0 00 00 00       mov    0xa0(%esi),%ecx
     ed7:       c7 43 64 00 00 00 00    movl   $0x0,0x64(%ebx)
     ede:       89 4b 60                mov    %ecx,0x60(%ebx)
     ee1:       8b 86 ac 00 00 00       mov    0xac(%esi),%eax
     ee7:       c7 43 7c 00 00 00 00    movl   $0x0,0x7c(%ebx)
     eee:       89 43 78                mov    %eax,0x78(%ebx)
     ef1:       83 c4 18                add    $0x18,%esp
     ef4:       89 d8                   mov    %ebx,%eax
     ef6:       5b                      pop    %ebx
     ef7:       5e                      pop    %esi
     ef8:       5f                      pop    %edi
     ef9:       5d                      pop    %ebp
     efa:       c3                      ret

No 0x40(ptr) deref for me

Could adapter->rx_ring[i] becomes NULL sometime in your driver ?

If yes, a lock/protection is needed (same for ethtool stats I guess)



^ permalink raw reply

* Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation
From: Shirley Ma @ 2010-10-28  4:40 UTC (permalink / raw)
  To: mst@redhat.com; +Cc: David Miller, netdev, kvm, linux-kernel
In-Reply-To: <1288216693.17571.38.camel@localhost.localdomain>

Resubmit this patch for fixing some minor error (white space, typo).

Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
 drivers/vhost/net.c   |   20 +++++++++++++++++++-
 drivers/vhost/vhost.c |   32 ++++++++++++++++++++++++++++++++
 drivers/vhost/vhost.h |    3 +++
 3 files changed, 54 insertions(+), 1 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 4b4da5b..3eb8016 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -128,6 +128,7 @@ static void handle_tx(struct vhost_net *net)
 	int err, wmem;
 	size_t hdr_size;
 	struct socket *sock;
+	int max_pend = vq->num - (vq->num >> 2);
 
 	sock = rcu_dereference_check(vq->private_data,
 				     lockdep_is_held(&vq->mutex));
@@ -198,7 +199,24 @@ static void handle_tx(struct vhost_net *net)
 		if (err != len)
 			pr_debug("Truncated TX packet: "
 				 " len %d != %zd\n", err, len);
-		vhost_add_used_and_signal(&net->dev, vq, head, 0);
+		/*
+		 * if no pending buffer size allocate, signal used buffer
+		 * one by one, otherwise, signal used buffer when reaching
+		 * 3/4 ring size to reduce CPU utilization.
+		 */
+		if (unlikely(vq->pend))
+			vhost_add_used_and_signal(&net->dev, vq, head, 0);
+		else {
+			vq->pend[vq->num_pend].id = head;
+			vq->pend[vq->num_pend].len = 0;
+			++vq->num_pend;
+			if (vq->num_pend == max_pend) {
+				vhost_add_used_and_signal_n(&net->dev, vq,
+							    vq->pend,
+							    vq->num_pend);
+				vq->num_pend = 0;
+			}
+		}
 		total_len += len;
 		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
 			vhost_poll_queue(&vq->poll);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 94701ff..f2f3288 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -170,6 +170,16 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->call_ctx = NULL;
 	vq->call = NULL;
 	vq->log_ctx = NULL;
+	/* signal pending used buffers */
+	if (vq->pend) {
+		if (vq->num_pend != 0) {
+			vhost_add_used_and_signal_n(dev, vq, vq->pend,
+						    vq->num_pend);
+			vq->num_pend = 0;
+		}
+		kfree(vq->pend);
+	}
+	vq->pend = NULL;
 }
 
 static int vhost_worker(void *data)
@@ -273,7 +283,14 @@ long vhost_dev_init(struct vhost_dev *dev,
 		dev->vqs[i].heads = NULL;
 		dev->vqs[i].dev = dev;
 		mutex_init(&dev->vqs[i].mutex);
+		dev->vqs[i].num_pend = 0;
+		dev->vqs[i].pend = NULL;
 		vhost_vq_reset(dev, dev->vqs + i);
+		/* signal 3/4 of ring size used buffers */
+		dev->vqs[i].pend = kmalloc((dev->vqs[i].num -
+					   (dev->vqs[i].num >> 2)) *
+					   sizeof *dev->vqs[i].pend,
+					   GFP_KERNEL);
 		if (dev->vqs[i].handle_kick)
 			vhost_poll_init(&dev->vqs[i].poll,
 					dev->vqs[i].handle_kick, POLLIN, dev);
@@ -599,6 +616,21 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
 			r = -EINVAL;
 			break;
 		}
+		if (vq->num != s.num) {
+			/* signal used buffers first */
+			if (vq->pend) {
+				if (vq->num_pend != 0) {
+					vhost_add_used_and_signal_n(vq->dev, vq,
+								    vq->pend,
+								    vq->num_pend);
+					vq->num_pend = 0;
+				}
+				kfree(vq->pend);
+			}
+			/* realloc pending used buffers size */
+			vq->pend = kmalloc((s.num - (s.num >> 2)) *
+					   sizeof *vq->pend, GFP_KERNEL);
+		}
 		vq->num = s.num;
 		break;
 	case VHOST_SET_VRING_BASE:
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 073d06a..78949c0 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -108,6 +108,9 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log *log;
+	/* delay multiple used buffers to signal once */
+	int num_pend;
+	struct vring_used_elem *pend;
 };
 
 struct vhost_dev {



^ permalink raw reply related

* [PATCH] via-velocity: Codestyle fixes
From: alex @ 2010-10-28  5:00 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, alex

From: Alexander Wigen <alex@wigen.net>

checkpatch.pl --terse --file drivers/net/via-velocity.c
Before: total: 13 errors, 106 warnings, 3537 lines checked
After: total: 2 errors, 96 warnings, 3526 lines checked

Signed-off-by: Alexander Wigen <alex@wigen.net>
---
 drivers/net/via-velocity.c |  181 +++++++++++++++++++++-----------------------
 1 files changed, 85 insertions(+), 96 deletions(-)

diff --git a/drivers/net/via-velocity.c b/drivers/net/via-velocity.c
index cab96ad..fe80537 100644
--- a/drivers/net/via-velocity.c
+++ b/drivers/net/via-velocity.c
@@ -84,7 +84,7 @@ static int velocity_nics;
 static int msglevel = MSG_LEVEL_INFO;
 
 /**
- *	mac_get_cam_mask	-	Read a CAM mask
+ *	mac_get_cam_mask - Read a CAM mask
  *	@regs: register block for this velocity
  *	@mask: buffer to store mask
  *
@@ -113,7 +113,7 @@ static void mac_get_cam_mask(struct mac_regs __iomem *regs, u8 *mask)
 
 
 /**
- *	mac_set_cam_mask	-	Set a CAM mask
+ *	mac_set_cam_mask - Set a CAM mask
  *	@regs: register block for this velocity
  *	@mask: CAM mask to load
  *
@@ -156,7 +156,7 @@ static void mac_set_vlan_cam_mask(struct mac_regs __iomem *regs, u8 *mask)
 }
 
 /**
- *	mac_set_cam	-	set CAM data
+ *	mac_set_cam - set CAM data
  *	@regs: register block of this velocity
  *	@idx: Cam index
  *	@addr: 2 or 6 bytes of CAM data
@@ -211,11 +211,11 @@ static void mac_set_vlan_cam(struct mac_regs __iomem *regs, int idx,
 
 
 /**
- *	mac_wol_reset	-	reset WOL after exiting low power
+ *	mac_wol_reset - reset WOL after exiting low power
  *	@regs: register block of this velocity
  *
  *	Called after we drop out of wake on lan mode in order to
- *	reset the Wake on lan features. This function doesn't restore
+ *	reset the Wake On Lan features. This function doesn't restore
  *	the rest of the logic from the result of sleep/wakeup
  */
 static void mac_wol_reset(struct mac_regs __iomem *regs)
@@ -342,7 +342,7 @@ VELOCITY_PARAM(ValPktLen, "Receiving or Drop invalid 802.3 frame");
    1: Wake up if link status is on/off.
    2: Wake up if recevied an arp packet.
    4: Wake up if recevied any unicast packet.
-   Those value can be sumed up to support more than one option.
+   Those value can be summed up to support more than one option.
 */
 VELOCITY_PARAM(wol_opts, "Wake On Lan options");
 
@@ -370,7 +370,7 @@ static DEFINE_PCI_DEVICE_TABLE(velocity_id_table) = {
 MODULE_DEVICE_TABLE(pci, velocity_id_table);
 
 /**
- *	get_chip_name	- 	identifier to name
+ *	get_chip_name - identifier to name
  *	@id: chip identifier
  *
  *	Given a chip identifier return a suitable description. Returns
@@ -386,7 +386,7 @@ static const char __devinit *get_chip_name(enum chip_type chip_id)
 }
 
 /**
- *	velocity_remove1	-	device unplug
+ *	velocity_remove1 - device unplug
  *	@pdev: PCI device being removed
  *
  *	Device unload callback. Called on an unplug or on module
@@ -409,7 +409,7 @@ static void __devexit velocity_remove1(struct pci_dev *pdev)
 }
 
 /**
- *	velocity_set_int_opt	-	parser for integer options
+ *	velocity_set_int_opt - parser for integer options
  *	@opt: pointer to option value
  *	@val: value the user requested (or -1 for default)
  *	@min: lowest value allowed
@@ -438,7 +438,7 @@ static void __devinit velocity_set_int_opt(int *opt, int val, int min, int max,
 }
 
 /**
- *	velocity_set_bool_opt	-	parser for boolean options
+ *	velocity_set_bool_opt - parser for boolean options
  *	@opt: pointer to option value
  *	@val: value the user requested (or -1 for default)
  *	@def: default value (yes/no)
@@ -467,7 +467,7 @@ static void __devinit velocity_set_bool_opt(u32 *opt, int val, int def, u32 flag
 }
 
 /**
- *	velocity_get_options	-	set options on device
+ *	velocity_get_options - set options on device
  *	@opts: option structure for the device
  *	@index: index of option to use in module options array
  *	@devname: device name
@@ -492,7 +492,7 @@ static void __devinit velocity_get_options(struct velocity_opt *opts, int index,
 }
 
 /**
- *	velocity_init_cam_filter	-	initialise CAM
+ *	velocity_init_cam_filter - initialize CAM
  *	@vptr: velocity to program
  *
  *	Initialize the content addressable memory used for filters. Load
@@ -507,8 +507,8 @@ static void velocity_init_cam_filter(struct velocity_info *vptr)
 	WORD_REG_BITS_ON(MCFG_VIDFR, &regs->MCFG);
 
 	/* Disable all CAMs */
-	memset(vptr->vCAMmask, 0, sizeof(u8) * 8);
-	memset(vptr->mCAMmask, 0, sizeof(u8) * 8);
+	memset(vptr->vCAMmask, 0, sizeof(u8) *8);
+	memset(vptr->mCAMmask, 0, sizeof(u8) *8);
 	mac_set_vlan_cam_mask(regs, vptr->vCAMmask);
 	mac_set_cam_mask(regs, vptr->mCAMmask);
 
@@ -564,7 +564,7 @@ static void velocity_init_rx_ring_indexes(struct velocity_info *vptr)
 }
 
 /**
- *	velocity_rx_reset	-	handle a receive reset
+ *	velocity_rx_reset - handle a receive reset
  *	@vptr: velocity we are resetting
  *
  *	Reset the ownership and status for the receive ring side.
@@ -591,7 +591,7 @@ static void velocity_rx_reset(struct velocity_info *vptr)
 }
 
 /**
- *	velocity_get_opt_media_mode	-	get media selection
+ *	velocity_get_opt_media_mode - get media selection
  *	@vptr: velocity adapter
  *
  *	Get the media mode stored in EEPROM or module options and load
@@ -627,7 +627,7 @@ static u32 velocity_get_opt_media_mode(struct velocity_info *vptr)
 }
 
 /**
- *	safe_disable_mii_autopoll	-	autopoll off
+ *	safe_disable_mii_autopoll - autopoll off
  *	@regs: velocity registers
  *
  *	Turn off the autopoll and wait for it to disable on the chip
@@ -646,7 +646,7 @@ static void safe_disable_mii_autopoll(struct mac_regs __iomem *regs)
 }
 
 /**
- *	enable_mii_autopoll	-	turn on autopolling
+ *	enable_mii_autopoll - turn on autopolling
  *	@regs: velocity registers
  *
  *	Enable the MII link status autopoll feature on the Velocity
@@ -676,7 +676,7 @@ static void enable_mii_autopoll(struct mac_regs __iomem *regs)
 }
 
 /**
- *	velocity_mii_read	-	read MII data
+ *	velocity_mii_read - read MII data
  *	@regs: velocity registers
  *	@index: MII register index
  *	@data: buffer for received data
@@ -712,7 +712,7 @@ static int velocity_mii_read(struct mac_regs __iomem *regs, u8 index, u16 *data)
 
 
 /**
- *	mii_check_media_mode	-	check media state
+ *	mii_check_media_mode - check media state
  *	@regs: velocity registers
  *
  *	Check the current MII status and determine the link status
@@ -755,7 +755,7 @@ static u32 mii_check_media_mode(struct mac_regs __iomem *regs)
 }
 
 /**
- *	velocity_mii_write	-	write MII data
+ *	velocity_mii_write - write MII data
  *	@regs: velocity registers
  *	@index: MII register index
  *	@data: 16bit data for the MII register
@@ -794,7 +794,7 @@ static int velocity_mii_write(struct mac_regs __iomem *regs, u8 mii_addr, u16 da
 }
 
 /**
- *	set_mii_flow_control	-	flow control setup
+ *	set_mii_flow_control - flow control setup
  *	@vptr: velocity interface
  *
  *	Set up the flow control on this interface according to
@@ -829,10 +829,10 @@ static void set_mii_flow_control(struct velocity_info *vptr)
 }
 
 /**
- *	mii_set_auto_on		-	autonegotiate on
+ *	mii_set_auto_on - autonegotiate on
  *	@vptr: velocity
  *
- *	Enable autonegotation on this interface
+ *	Enable autonegotiation on this interface
  */
 static void mii_set_auto_on(struct velocity_info *vptr)
 {
@@ -879,7 +879,7 @@ static u32 check_connection_type(struct mac_regs __iomem *regs)
 
 
 /**
- *	velocity_set_media_mode		-	set media mode
+ *	velocity_set_media_mode - set media mode
  *	@mii_status: old MII link state
  *
  *	Check the media link state and configure the flow control
@@ -999,7 +999,7 @@ static int velocity_set_media_mode(struct velocity_info *vptr, u32 mii_status)
 }
 
 /**
- *	velocity_print_link_status	-	link status reporting
+ *	velocity_print_link_status - link status reporting
  *	@vptr: velocity to report on
  *
  *	Turn the link status of the velocity card into a kernel log
@@ -1050,7 +1050,7 @@ static void velocity_print_link_status(struct velocity_info *vptr)
 }
 
 /**
- *	enable_flow_control_ability	-	flow control
+ *	enable_flow_control_ability - flow control
  *	@vptr: veloity to configure
  *
  *	Set up flow control according to the flow control options
@@ -1102,7 +1102,7 @@ static void enable_flow_control_ability(struct velocity_info *vptr)
 }
 
 /**
- *	velocity_soft_reset	-	soft reset
+ *	velocity_soft_reset - soft reset
  *	@vptr: velocity to reset
  *
  *	Kick off a soft reset of the velocity adapter and then poll
@@ -1131,7 +1131,7 @@ static int velocity_soft_reset(struct velocity_info *vptr)
 }
 
 /**
- *	velocity_set_multi	-	filter list change callback
+ *	velocity_set_multi - filter list change callback
  *	@dev: network device
  *
  *	Called by the network layer when the filter lists need to change
@@ -1181,7 +1181,7 @@ static void velocity_set_multi(struct net_device *dev)
  */
 
 /**
- *	mii_init	-	set up MII
+ *	mii_init - set up MII
  *	@vptr: velocity adapter
  *	@mii_status:  links tatus
  *
@@ -1250,7 +1250,7 @@ static void mii_init(struct velocity_info *vptr, u32 mii_status)
 }
 
 /**
- * setup_queue_timers	-	Setup interrupt timers
+ * setup_queue_timers - Setup interrupt timers
  *
  * Setup interrupt frequency during suppression (timeout if the frame
  * count isn't filled).
@@ -1273,7 +1273,7 @@ static void setup_queue_timers(struct velocity_info *vptr)
 	}
 }
 /**
- * setup_adaptive_interrupts  -  Setup interrupt suppression
+ * setup_adaptive_interrupts - Setup interrupt suppression
  *
  * @vptr velocity adapter
  *
@@ -1311,11 +1311,11 @@ static void setup_adaptive_interrupts(struct velocity_info *vptr)
 }
 
 /**
- *	velocity_init_registers	-	initialise MAC registers
+ *	velocity_init_registers - initialise MAC registers
  *	@vptr: velocity to init
  *	@type: type of initialisation (hot or cold)
  *
- *	Initialise the MAC on a reset or on first set up on the
+ *	Initialize the MAC on a reset or on first set up on the
  *	hardware.
  */
 static void velocity_init_registers(struct velocity_info *vptr,
@@ -1459,7 +1459,7 @@ static void velocity_give_many_rx_descs(struct velocity_info *vptr)
 }
 
 /**
- *	velocity_init_dma_rings	-	set up DMA rings
+ *	velocity_init_dma_rings - set up DMA rings
  *	@vptr: Velocity to set up
  *
  *	Allocate PCI mapped DMA rings for the receive and transmit layer
@@ -1511,7 +1511,7 @@ static void velocity_set_rxbufsize(struct velocity_info *vptr, int mtu)
 }
 
 /**
- *	velocity_alloc_rx_buf	-	allocate aligned receive buffer
+ *	velocity_alloc_rx_buf - allocate aligned receive buffer
  *	@vptr: velocity
  *	@idx: ring index
  *
@@ -1542,7 +1542,7 @@ static int velocity_alloc_rx_buf(struct velocity_info *vptr, int idx)
 	 *	Fill in the descriptor to match
 	 */
 
-	*((u32 *) & (rd->rdesc0)) = 0;
+	*((u32 *) &(rd->rdesc0)) = 0;
 	rd->size = cpu_to_le16(vptr->rx.buf_sz) | RX_INTEN;
 	rd->pa_low = cpu_to_le32(rd_info->skb_dma);
 	rd->pa_high = 0;
@@ -1578,7 +1578,7 @@ static int velocity_rx_refill(struct velocity_info *vptr)
 }
 
 /**
- *	velocity_free_rd_ring	-	free receive ring
+ *	velocity_free_rd_ring - free receive ring
  *	@vptr: velocity to clean up
  *
  *	Free the receive buffers for each ring slot and any
@@ -1614,7 +1614,7 @@ static void velocity_free_rd_ring(struct velocity_info *vptr)
 
 
 /**
- *	velocity_init_rd_ring	-	set up receive ring
+ *	velocity_init_rd_ring - set up receive ring
  *	@vptr: velocity to configure
  *
  *	Allocate and set up the receive buffers for each ring slot and
@@ -1644,7 +1644,7 @@ out:
 }
 
 /**
- *	velocity_init_td_ring	-	set up transmit ring
+ *	velocity_init_td_ring - set up transmit ring
  *	@vptr:	velocity
  *
  *	Set up the transmit ring and chain the ring pointers together.
@@ -1673,7 +1673,7 @@ static int velocity_init_td_ring(struct velocity_info *vptr)
 }
 
 /**
- *	velocity_free_dma_rings	-	free PCI ring pointers
+ *	velocity_free_dma_rings - free PCI ring pointers
  *	@vptr: Velocity to free from
  *
  *	Clean up the PCI ring buffers allocated to this velocity.
@@ -1715,7 +1715,7 @@ err_free_dma_rings_0:
 }
 
 /**
- *	velocity_free_tx_buf	-	free transmit buffer
+ *	velocity_free_tx_buf - free transmit buffer
  *	@vptr: velocity
  *	@tdinfo: buffer
  *
@@ -1776,7 +1776,7 @@ static void velocity_free_td_ring_entry(struct velocity_info *vptr,
 }
 
 /**
- *	velocity_free_td_ring	-	free td ring
+ *	velocity_free_td_ring - free td ring
  *	@vptr: velocity
  *
  *	Free up the transmit ring for this particular velocity adapter.
@@ -1806,7 +1806,7 @@ static void velocity_free_rings(struct velocity_info *vptr)
 }
 
 /**
- *	velocity_error	-	handle error from controller
+ *	velocity_error - handle error from controller
  *	@vptr: velocity
  *	@status: card status
  *
@@ -1895,7 +1895,7 @@ static void velocity_error(struct velocity_info *vptr, int status)
 }
 
 /**
- *	tx_srv		-	transmit interrupt service
+ *	tx_srv - transmit interrupt service
  *	@vptr; Velocity
  *
  *	Scan the queues looking for transmitted packets that
@@ -1963,7 +1963,7 @@ static int velocity_tx_srv(struct velocity_info *vptr)
 }
 
 /**
- *	velocity_rx_csum	-	checksum process
+ *	velocity_rx_csum - checksum process
  *	@rd: receive packet descriptor
  *	@skb: network layer packet buffer
  *
@@ -1987,14 +1987,14 @@ static inline void velocity_rx_csum(struct rx_desc *rd, struct sk_buff *skb)
 }
 
 /**
- *	velocity_rx_copy	-	in place Rx copy for small packets
+ *	velocity_rx_copy - in place Rx copy for small packets
  *	@rx_skb: network layer packet buffer candidate
  *	@pkt_size: received data size
  *	@rd: receive packet descriptor
  *	@dev: network device
  *
  *	Replace the current skb that is scheduled for Rx processing by a
- *	shorter, immediatly allocated skb, if the received packet is small
+ *	shorter, immediately allocated skb, if the received packet is small
  *	enough. This function returns a negative value if the received
  *	packet is too big or if memory is exhausted.
  */
@@ -2018,7 +2018,7 @@ static int velocity_rx_copy(struct sk_buff **rx_skb, int pkt_size,
 }
 
 /**
- *	velocity_iph_realign	-	IP header alignment
+ *	velocity_iph_realign - IP header alignment
  *	@vptr: velocity we are handling
  *	@skb: network layer packet buffer
  *	@pkt_size: received data size
@@ -2037,7 +2037,7 @@ static inline void velocity_iph_realign(struct velocity_info *vptr,
 
 
 /**
- *	velocity_receive_frame	-	received packet processor
+ *	velocity_receive_frame - received packet processor
  *	@vptr: velocity we are handling
  *	@idx: ring index
  *
@@ -2107,7 +2107,7 @@ static int velocity_receive_frame(struct velocity_info *vptr, int idx)
 
 
 /**
- *	velocity_rx_srv		-	service RX interrupt
+ *	velocity_rx_srv - service RX interrupt
  *	@vptr: velocity
  *
  *	Walk the receive ring of the velocity adapter and remove
@@ -2191,7 +2191,7 @@ static int velocity_poll(struct napi_struct *napi, int budget)
 }
 
 /**
- *	velocity_intr		-	interrupt callback
+ *	velocity_intr - interrupt callback
  *	@irq: interrupt number
  *	@dev_instance: interrupting device
  *
@@ -2232,14 +2232,14 @@ static irqreturn_t velocity_intr(int irq, void *dev_instance)
 }
 
 /**
- *	velocity_open		-	interface activation callback
+ *	velocity_open - interface activation callback
  *	@dev: network layer device to open
  *
  *	Called when the network layer brings the interface up. Returns
  *	a negative posix error code on failure, or zero on success.
  *
  *	All the ring allocation and set up is done on open for this
- *	adapter to minimise memory usage when inactive
+ *	adapter to minimize memory usage when inactive
  */
 static int velocity_open(struct net_device *dev)
 {
@@ -2275,7 +2275,7 @@ out:
 }
 
 /**
- *	velocity_shutdown	-	shut down the chip
+ *	velocity_shutdown - shut down the chip
  *	@vptr: velocity to deactivate
  *
  *	Shuts down the internal operations of the velocity and
@@ -2293,7 +2293,7 @@ static void velocity_shutdown(struct velocity_info *vptr)
 }
 
 /**
- *	velocity_change_mtu	-	MTU change callback
+ *	velocity_change_mtu - MTU change callback
  *	@dev: network device
  *	@new_mtu: desired MTU
  *
@@ -2374,7 +2374,7 @@ out_0:
 }
 
 /**
- *	velocity_mii_ioctl		-	MII ioctl handler
+ *	velocity_mii_ioctl - MII ioctl handler
  *	@dev: network device
  *	@ifr: the ifreq block for the ioctl
  *	@cmd: the command
@@ -2415,7 +2415,7 @@ static int velocity_mii_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd
 
 
 /**
- *	velocity_ioctl		-	ioctl entry point
+ *	velocity_ioctl - ioctl entry point
  *	@dev: network device
  *	@rq: interface request ioctl
  *	@cmd: command code
@@ -2452,7 +2452,7 @@ static int velocity_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 }
 
 /**
- *	velocity_get_status	-	statistics callback
+ *	velocity_get_status - statistics callback
  *	@dev: network device
  *
  *	Callback from the network layer to allow driver statistics
@@ -2477,24 +2477,14 @@ static struct net_device_stats *velocity_get_stats(struct net_device *dev)
 	dev->stats.rx_errors = vptr->mib_counter[HW_MIB_ifRxErrorPkts];
 	dev->stats.rx_length_errors = vptr->mib_counter[HW_MIB_ifInRangeLengthErrors];
 
-//  unsigned long   rx_dropped;     /* no space in linux buffers    */
 	dev->stats.collisions = vptr->mib_counter[HW_MIB_ifTxEtherCollisions];
-	/* detailed rx_errors: */
-//  unsigned long   rx_length_errors;
-//  unsigned long   rx_over_errors;     /* receiver ring buff overflow  */
 	dev->stats.rx_crc_errors = vptr->mib_counter[HW_MIB_ifRxPktCRCE];
-//  unsigned long   rx_frame_errors;    /* recv'd frame alignment error */
-//  unsigned long   rx_fifo_errors;     /* recv'r fifo overrun      */
-//  unsigned long   rx_missed_errors;   /* receiver missed packet   */
-
-	/* detailed tx_errors */
-//  unsigned long   tx_fifo_errors;
 
 	return &dev->stats;
 }
 
 /**
- *	velocity_close		-	close adapter callback
+ *	velocity_close - close adapter callback
  *	@dev: network device
  *
  *	Callback from the network layer when the velocity is being
@@ -2523,11 +2513,11 @@ static int velocity_close(struct net_device *dev)
 }
 
 /**
- *	velocity_xmit		-	transmit packet callback
+ *	velocity_xmit - transmit packet callback
  *	@skb: buffer to transmit
  *	@dev: network device
  *
- *	Called by the networ layer to request a packet is queued to
+ *	Called by the network layer to request a packet is queued to
  *	the velocity. Returns zero on success.
  */
 static netdev_tx_t velocity_xmit(struct sk_buff *skb,
@@ -2636,7 +2626,7 @@ static const struct net_device_ops velocity_netdev_ops = {
 	.ndo_start_xmit		= velocity_xmit,
 	.ndo_get_stats		= velocity_get_stats,
 	.ndo_validate_addr	= eth_validate_addr,
-	.ndo_set_mac_address 	= eth_mac_addr,
+	.ndo_set_mac_address	= eth_mac_addr,
 	.ndo_set_multicast_list	= velocity_set_multi,
 	.ndo_change_mtu		= velocity_change_mtu,
 	.ndo_do_ioctl		= velocity_ioctl,
@@ -2646,7 +2636,7 @@ static const struct net_device_ops velocity_netdev_ops = {
 };
 
 /**
- *	velocity_init_info	-	init private data
+ *	velocity_init_info - init private data
  *	@pdev: PCI device
  *	@vptr: Velocity info
  *	@info: Board type
@@ -2668,7 +2658,7 @@ static void __devinit velocity_init_info(struct pci_dev *pdev,
 }
 
 /**
- *	velocity_get_pci_info	-	retrieve PCI info for device
+ *	velocity_get_pci_info - retrieve PCI info for device
  *	@vptr: velocity device
  *	@pdev: PCI device it matches
  *
@@ -2706,7 +2696,7 @@ static int __devinit velocity_get_pci_info(struct velocity_info *vptr, struct pc
 }
 
 /**
- *	velocity_print_info	-	per driver data
+ *	velocity_print_info - per driver data
  *	@vptr: velocity
  *
  *	Print per driver data as the kernel driver finds Velocity
@@ -2730,7 +2720,7 @@ static u32 velocity_get_link(struct net_device *dev)
 
 
 /**
- *	velocity_found1		-	set up discovered velocity card
+ *	velocity_found1 - set up discovered velocity card
  *	@pdev: PCI device
  *	@ent: PCI device table entry that matched
  *
@@ -2877,7 +2867,7 @@ err_free_dev:
 
 #ifdef CONFIG_PM
 /**
- *	wol_calc_crc		-	WOL CRC
+ *	wol_calc_crc - WOL CRC
  *	@pattern: data pattern
  *	@mask_pattern: mask
  *
@@ -2912,7 +2902,7 @@ static u16 wol_calc_crc(int size, u8 *pattern, u8 *mask_pattern)
 }
 
 /**
- *	velocity_set_wol	-	set up for wake on lan
+ *	velocity_set_wol - set up for wake on lan
  *	@vptr: velocity to set WOL status on
  *
  *	Set a card up for wake on lan either by unicast or by
@@ -2957,7 +2947,7 @@ static int velocity_set_wol(struct velocity_info *vptr)
 		memcpy(arp->ar_tip, vptr->ip_addr, 4);
 
 		crc = wol_calc_crc((sizeof(struct arp_packet) + 7) / 8, buf,
-				(u8 *) & mask_pattern[0][0]);
+				(u8 *)mask_pattern[0][0]);
 
 		writew(crc, &regs->PatternCRC[0]);
 		writew(WOLCR_ARP_EN, &regs->WOLCRSet);
@@ -2997,7 +2987,7 @@ static int velocity_set_wol(struct velocity_info *vptr)
 }
 
 /**
- *	velocity_save_context	-	save registers
+ *	velocity_save_context - save registers
  *	@vptr: velocity
  *	@context: buffer for stored context
  *
@@ -3058,7 +3048,7 @@ static int velocity_suspend(struct pci_dev *pdev, pm_message_t state)
 }
 
 /**
- *	velocity_restore_context	-	restore registers
+ *	velocity_restore_context - restore registers
  *	@vptr: velocity
  *	@context: buffer for stored context
  *
@@ -3133,19 +3123,19 @@ static int velocity_resume(struct pci_dev *pdev)
  *	uses this to handle all our card discover and plugging
  */
 static struct pci_driver velocity_driver = {
-      .name	= VELOCITY_NAME,
-      .id_table	= velocity_id_table,
-      .probe	= velocity_found1,
-      .remove	= __devexit_p(velocity_remove1),
+	.name	= VELOCITY_NAME,
+	.id_table	= velocity_id_table,
+	.probe	= velocity_found1,
+	.remove	= __devexit_p(velocity_remove1),
 #ifdef CONFIG_PM
-      .suspend	= velocity_suspend,
-      .resume	= velocity_resume,
+	.suspend	= velocity_suspend,
+	.resume	= velocity_resume,
 #endif
 };
 
 
 /**
- *	velocity_ethtool_up	-	pre hook for ethtool
+ *	velocity_ethtool_up - pre hook for ethtool
  *	@dev: network device
  *
  *	Called before an ethtool operation. We need to make sure the
@@ -3160,7 +3150,7 @@ static int velocity_ethtool_up(struct net_device *dev)
 }
 
 /**
- *	velocity_ethtool_down	-	post hook for ethtool
+ *	velocity_ethtool_down - post hook for ethtool
  *	@dev: network device
  *
  *	Called after an ethtool operation. Restore the chip back to D3
@@ -3352,8 +3342,7 @@ static int get_pending_timer_val(int val)
 	int mult_bits = val >> 6;
 	int mult = 1;
 
-	switch (mult_bits)
-	{
+	switch (mult_bits) {
 	case 1:
 		mult = 4; break;
 	case 2:
@@ -3454,7 +3443,7 @@ static const struct ethtool_ops velocity_ethtool_ops = {
 	.set_wol	=	velocity_ethtool_set_wol,
 	.get_msglevel	=	velocity_get_msglevel,
 	.set_msglevel	=	velocity_set_msglevel,
-	.set_sg 	=	ethtool_op_set_sg,
+	.set_sg		=	ethtool_op_set_sg,
 	.get_link	=	velocity_get_link,
 	.get_coalesce	=	velocity_get_coalesce,
 	.set_coalesce	=	velocity_set_coalesce,
@@ -3480,7 +3469,7 @@ static int velocity_netdev_event(struct notifier_block *nb, unsigned long notifi
 
 #if defined(CONFIG_PM) && defined(CONFIG_INET)
 static struct notifier_block velocity_inetaddr_notifier = {
-      .notifier_call	= velocity_netdev_event,
+	.notifier_call	= velocity_netdev_event,
 };
 
 static void velocity_register_notifier(void)
@@ -3501,7 +3490,7 @@ static void velocity_unregister_notifier(void)
 #endif	/* defined(CONFIG_PM) && defined(CONFIG_INET) */
 
 /**
- *	velocity_init_module	-	load time function
+ *	velocity_init_module - load time function
  *
  *	Called when the velocity module is loaded. The PCI driver
  *	is registered with the PCI layer, and in turn will call
@@ -3520,7 +3509,7 @@ static int __init velocity_init_module(void)
 }
 
 /**
- *	velocity_cleanup	-	module unload
+ *	velocity_cleanup - module unload
  *
  *	When the velocity hardware is unloaded this function is called.
  *	It will clean up the notifiers and the unregister the PCI
-- 
1.7.3.2





^ permalink raw reply related

* [PATCH] cxgb3: fix crash due to manipulating queues before registration
From: Nishanth Aravamudan @ 2010-10-28  5:06 UTC (permalink / raw)
  To: eric.dumazet
  Cc: Nishanth Aravamudan, sonnyrao, Divy Le Ray, Dimitris Michailidis,
	netdev, linux-kernel
In-Reply-To: <1288236563.2658.59.camel@edumazet-laptop>

Hi Eric,

Something like the following?:

Thanks,
Nish


Along the same lines as "cxgb4: fix crash due to manipulating queues
before registration" (8f6d9f40476895571df039b6f1f5230ec7faebad), before
commit "net: allocate tx queues in register_netdevice"
netif_tx_stop_all_queues and related functions could be used between
device allocation and registration but now only after registration.
cxgb4 has such a call before registration and crashes now.  Move it
after register_netdev.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: eric.dumazet@gmail.com
Cc: sonnyrao@us.ibm.com
Cc: Divy Le Ray <divy@chelsio.com>
Cc: Dimitris Michailidis <dm@chelsio.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/net/cxgb3/cxgb3_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c
index 4e3c123..96c70a5 100644
--- a/drivers/net/cxgb3/cxgb3_main.c
+++ b/drivers/net/cxgb3/cxgb3_main.c
@@ -3301,7 +3301,6 @@ static int __devinit init_one(struct pci_dev *pdev,
 		pi->rx_offload = T3_RX_CSUM | T3_LRO;
 		pi->port_id = i;
 		netif_carrier_off(netdev);
-		netif_tx_stop_all_queues(netdev);
 		netdev->irq = pdev->irq;
 		netdev->mem_start = mmio_start;
 		netdev->mem_end = mmio_start + mmio_len - 1;
@@ -3342,6 +3341,7 @@ static int __devinit init_one(struct pci_dev *pdev,
 				adapter->name = adapter->port[i]->name;
 
 			__set_bit(i, &adapter->registered_device_map);
+			 netif_tx_stop_all_queues(adapter->port[i]);
 		}
 	}
 	if (!adapter->registered_device_map) {
-- 
1.7.1


^ permalink raw reply related

* [PATCH] cxgb3: Fix panic in free_tx_desc()
From: Krishna Kumar @ 2010-10-28  5:10 UTC (permalink / raw)
  To: davem, divy; +Cc: netdev, Krishna Kumar

I got a few of these panics (on 2.6.36-rc7) when running high
number of netperf sessions:

BUG: unable to handle kernel paging request at 0000100000000000
IP: [<ffffffff813125f0>] skb_release_data+0xa0/0xd0
Oops: 0000 [#1] SMP
Pid: 2155, comm: vhost-2115 Not tainted 2.6.36-rc7-ORG #1 49Y6512     /System x3650 M2 -[7947AC1]-
RIP: 0010:[<ffffffff813125f0>]  [<ffffffff813125f0>] skb_release_data+0xa0/0xd0
RSP: 0018:ffff880001803738  EFLAGS: 00010206
RAX: ffff880179b0fc00 RBX: ffff880178b441c0 RCX: 0000000000000000
RSP: 0018:ffff880001803738  EFLAGS: 00010206
RAX: ffff880179b0fc00 RBX: ffff880178b441c0 RCX: 0000000000000000
RDX: ffff880179b0fd40 RSI: 0000000000000000 RDI: 0000100000000000
RBP: ffff880001803748 R08: 0000000000000001 R09: ffff88017f117000
R10: ffff88017b990608 R11: ffff88017f117090 R12: ffff880178b441c0
R13: ffff88017f117090 R14: 0000000000000000 R15: ffff880178b441c0
FS:  0000000000000000(0000) GS:ffff880001800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000100000000000 CR3: 000000017ea64000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process vhost-2115 (pid: 2155, threadinfo ffff88017d872000, task ffff88017e954680)
Stack:
ffff880178b441c0 0000000000000007 ffff880001803768 ffffffff81312119
<0> 0000000000000000 0000000000000002 ffff880001803778 ffffffff813121f9
<0> ffff880001803818 ffffffffa012d14c ffffffffa02de076 ffff880001803700
Call Trace:
<IRQ>
[<ffffffff81312119>] __kfree_skb+0x19/0xa0
[<ffffffff813121f9>] kfree_skb+0x19/0x40
[<ffffffffa012d14c>] free_tx_desc+0x2fc/0x350 [cxgb3]
[<ffffffffa02de076>] ? vhost_poll_wakeup+0x16/0x20 [vhost_net]
[<ffffffffa01323db>] t3_eth_xmit+0x28b/0x380 [cxgb3]
[<ffffffff8131ce47>] dev_hard_start_xmit+0x377/0x5a0
[<ffffffff81335a4a>] sch_direct_xmit+0xfa/0x1d0
[<ffffffff8131d1a9>] dev_queue_xmit+0x139/0x450
[<ffffffff81326225>] neigh_resolve_output+0x125/0x340
[<ffffffff8135a77c>] ip_finish_output+0x14c/0x320
[<ffffffff8135a9fe>] ip_output+0xae/0xc0
[<ffffffff8135620f>] ip_forward_finish+0x3f/0x50
[<ffffffff8135641f>] ip_forward+0x1ff/0x400
[<ffffffff81354789>] ip_rcv_finish+0x119/0x3e0
[<ffffffff81354c7d>] ip_rcv+0x22d/0x300
[<ffffffff8131a95b>] __netif_receive_skb+0x29b/0x570
[<ffffffff8131ba70>] ? netif_receive_skb+0x0/0x80
[<ffffffff8131bae8>] netif_receive_skb+0x78/0x80
[<ffffffffa02a96d8>] br_handle_frame_finish+0x198/0x260 [bridge]
[<ffffffffa02aebc8>] br_nf_pre_routing_finish+0x238/0x380 [bridge]
[<ffffffff813424bc>] ? nf_hook_slow+0x6c/0x100
[<ffffffffa02ae990>] ? br_nf_pre_routing_finish+0x0/0x380 [bridge]
[<ffffffffa02afb08>] br_nf_pre_routing+0x698/0x7a0 [bridge]
[<ffffffff81342414>] nf_iterate+0x64/0xa0
[<ffffffffa02a9540>] ? br_handle_frame_finish+0x0/0x260 [bridge]
[<ffffffff813424bc>] nf_hook_slow+0x6c/0x100
[<ffffffffa02a9540>] ? br_handle_frame_finish+0x0/0x260 [bridge]
[<ffffffffa02a9931>] br_handle_frame+0x191/0x240 [bridge]
[<ffffffffa02a97a0>] ? br_handle_frame+0x0/0x240 [bridge]
[<ffffffff8131a863>] __netif_receive_skb+0x1a3/0x570
[<ffffffff812ef3f6>] ? dma_issue_pending_all+0x76/0xa0
[<ffffffff8131ad32>] process_backlog+0x102/0x200
[<ffffffff8131c2d0>] net_rx_action+0x100/0x220
[<ffffffff810548ef>] __do_softirq+0xaf/0x140
[<ffffffff8100bcdc>] call_softirq+0x1c/0x30
[<ffffffff8100dfc5>] ? do_softirq+0x65/0xa0
[<ffffffff8131c6b8>] netif_rx_ni+0x28/0x30
[<ffffffffa02c305d>] tun_sendmsg+0x2cd/0x4b0 [tun]
[<ffffffffa02e01af>] handle_tx+0x1df/0x340 [vhost_net]
[<ffffffffa02e0340>] handle_tx_kick+0x10/0x20 [vhost_net]
[<ffffffffa02de29b>] vhost_worker+0xbb/0x130 [vhost_net]
[<ffffffffa02de1e0>] ? vhost_worker+0x0/0x130 [vhost_net]
[<ffffffffa02de1e0>] ? vhost_worker+0x0/0x130 [vhost_net]
[<ffffffff81069686>] kthread+0x96/0xa0
[<ffffffff8100bbe4>] kernel_thread_helper+0x4/0x10
[<ffffffff810695f0>] ? kthread+0x0/0xa0
[<ffffffff8100bbe0>] ? kernel_thread_helper+0x0/0x10
Code: 8b 94 24 d0 00 00 00 49 8b 84 24 d8 00 00 00 48 8d 14 10 0f b7 0a 39 d9 7f d1 48 8b 7a 10 48 85 ff 74 20 48 c7 42 10 00 00 00 00 <48> 8b 1f e8 e8 fb ff ff 48 85 db 48 89 df 75 f0 49 8b 84 24 d8

Patch below fixes the panic. cxgb4 and cxgb4vf already have this fix.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 drivers/net/cxgb3/sge.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff -ruNp org/drivers/net/cxgb3/sge.c new/drivers/net/cxgb3/sge.c
--- org/drivers/net/cxgb3/sge.c	2010-10-25 09:10:47.000000000 +0530
+++ new/drivers/net/cxgb3/sge.c	2010-10-28 10:22:47.000000000 +0530
@@ -296,8 +296,10 @@ static void free_tx_desc(struct adapter 
 		if (d->skb) {	/* an SGL is present */
 			if (need_unmap)
 				unmap_skb(d->skb, q, cidx, pdev);
-			if (d->eop)
+			if (d->eop) {
 				kfree_skb(d->skb);
+				d->skb = NULL;
+			}
 		}
 		++d;
 		if (++cidx == q->size) {

^ permalink raw reply

* [PATCH] virtio_net: Fix queue full check
From: Krishna Kumar @ 2010-10-28  5:10 UTC (permalink / raw)
  To: rusty, davem; +Cc: netdev, Krishna Kumar

I get many queue full errors being wrongly reported when running
parallel netperfs:

Oct 17 10:22:40 localhost kernel: net eth0: Unexpected TX queue failure: -28
Oct 17 10:28:22 localhost kernel: net eth0: Unexpected TX queue failure: -28
Oct 17 10:35:58 localhost kernel: net eth0: Unexpected TX queue failure: -28
Oct 17 10:41:06 localhost kernel: net eth0: Unexpected TX queue failure: -28

I initially changed the check from -ENOMEM to -ENOSPC, but
virtqueue_add_buf can return only -ENOSPC when it doesn't have
space for new request.  Patch removes redundant checks but
displays the failure errno.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 drivers/net/virtio_net.c |   15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff -ruNp org/drivers/net/virtio_net.c new/drivers/net/virtio_net.c
--- org/drivers/net/virtio_net.c	2010-10-11 10:20:02.000000000 +0530
+++ new/drivers/net/virtio_net.c	2010-10-21 17:37:45.000000000 +0530
@@ -570,17 +570,10 @@ static netdev_tx_t start_xmit(struct sk_
 
 	/* This can happen with OOM and indirect buffers. */
 	if (unlikely(capacity < 0)) {
-		if (net_ratelimit()) {
-			if (likely(capacity == -ENOMEM)) {
-				dev_warn(&dev->dev,
-					 "TX queue failure: out of memory\n");
-			} else {
-				dev->stats.tx_fifo_errors++;
-				dev_warn(&dev->dev,
-					 "Unexpected TX queue failure: %d\n",
-					 capacity);
-			}
-		}
+		if (net_ratelimit())
+			dev_warn(&dev->dev,
+				 "TX queue failure (%d): out of memory\n",
+				 capacity); 
 		dev->stats.tx_dropped++;
 		kfree_skb(skb);
 		return NETDEV_TX_OK;

^ permalink raw reply

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Krishna Kumar2 @ 2010-10-28  5:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty
In-Reply-To: <20101026110913.GC7922@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> wrote on 10/26/2010 04:39:13 PM:

(merging two posts into one)

> I think we discussed the need for external to guest testing
> over 10G. For large messages we should not see any change
> but you should be able to get better numbers for small messages
> assuming a MQ NIC card.

For external host, there is a contention among different
queues (vhosts) when packets are processed in tun/bridge,
unless I implement MQ TX for macvtap (tun/bridge?).  So
my testing shows a small improvement (1 to 1.5% average)
in BW and a rise in SD (between 10-15%).  For remote host,
I think tun/macvtap needs MQ TX support?

> > > > Results for UDP BW tests (unidirectional, sum across
> > > > 3 iterations, each iteration of 45 seconds, default
> > > > netperf, vhosts bound to cpus 0-3; no other tuning):
> > >
> > > Is binding vhost threads to CPUs really required?
> > > What happens if we let the scheduler do its job?
> >
> > Nothing drastic, I remember BW% and SD% both improved a
> > bit as a result of binding.
>
> If there's a significant improvement this would mean that
> we need to rethink the vhost-net interaction with the scheduler.

I will get a test run with and without binding and post the
results later today.

Thanks,

- KK


^ permalink raw reply

* net-next-2.6 [PATCH 0/4] dccp: CCID packet dequeueing interface
From: Gerrit Renker @ 2010-10-28  5:16 UTC (permalink / raw)
  To: davem; +Cc: dccp, netdev
In-Reply-To: <dccp_packet_dequeueing_interface_set_of_4>

Dave,

can you please consider the following set of 4, which solves several problems
of the CCID packet-dequeueing interface. This set has been tested for about
3 years.

 Patch #1: sets up a return code convention to handle both rate-based and
           window-based congestion-control IDs (CCIDs) with one function.
 Patch #2: generalizes the xmit interface: rate-based CCIDs continue to use 
           timers, whereas window-based ones use a tasklet.
 Patch #3: fixes issues in cleaning up CCID state when closing a connection,
           like patch #2 this also divides into rate-based/window-based.
 Patch #4: using the previous patches, takes CCID-2 out of its polling mode
           (that was the problem of the xmit interface - CCID-2 kept on 
	    abusing it by faking a small timeout, constantly polling until
	    its internal congestion state allowed it to send again).

The set has also been placed into a fresh (today's) copy of net-next-2.6, on

    git://eden-feed.erg.abdn.ac.uk/net-next-2.6        [subtree 'dccp']

The set is fully bisectable.    
---
 include/linux/dccp.h   |    4 +-
 net/dccp/ccid.h        |   34 +++++++-
 net/dccp/ccids/ccid2.c |   23 ++++--
 net/dccp/ccids/ccid2.h |    5 +
 net/dccp/ccids/ccid3.c |   12 ++--
 net/dccp/dccp.h        |    5 +-
 net/dccp/output.c      |  209 +++++++++++++++++++++++++++++-------------------
 net/dccp/proto.c       |   21 +++++-
 net/dccp/timer.c       |   27 ++++---
 9 files changed, 223 insertions(+), 117 deletions(-)

^ permalink raw reply

* [PATCH 4/4] dccp ccid-2: Stop polling
From: Gerrit Renker @ 2010-10-28  5:16 UTC (permalink / raw)
  To: davem; +Cc: dccp, netdev, Gerrit Renker
In-Reply-To: <1288242988-5333-4-git-send-email-gerrit@erg.abdn.ac.uk>

This updates CCID-2 to use the CCID dequeuing mechanism, converting from
previous continuous-polling to a now event-driven mechanism.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
---
 net/dccp/ccids/ccid2.c |   23 +++++++++++++++--------
 net/dccp/ccids/ccid2.h |    5 +++++
 2 files changed, 20 insertions(+), 8 deletions(-)

--- a/net/dccp/ccids/ccid2.h
+++ b/net/dccp/ccids/ccid2.h
@@ -81,6 +81,11 @@ struct ccid2_hc_tx_sock {
 	u64			tx_high_ack;
 };
 
+static inline bool ccid2_cwnd_network_limited(struct ccid2_hc_tx_sock *hc)
+{
+	return hc->tx_pipe >= hc->tx_cwnd;
+}
+
 struct ccid2_hc_rx_sock {
 	int	rx_data;
 };
--- a/net/dccp/ccids/ccid2.c
+++ b/net/dccp/ccids/ccid2.c
@@ -78,12 +78,9 @@ static int ccid2_hc_tx_alloc_seq(struct ccid2_hc_tx_sock *hc)
 
 static int ccid2_hc_tx_send_packet(struct sock *sk, struct sk_buff *skb)
 {
-	struct ccid2_hc_tx_sock *hc = ccid2_hc_tx_sk(sk);
-
-	if (hc->tx_pipe < hc->tx_cwnd)
-		return 0;
-
-	return 1; /* XXX CCID should dequeue when ready instead of polling */
+	if (ccid2_cwnd_network_limited(ccid2_hc_tx_sk(sk)))
+		return CCID_PACKET_WILL_DEQUEUE_LATER;
+	return CCID_PACKET_SEND_AT_ONCE;
 }
 
 static void ccid2_change_l_ack_ratio(struct sock *sk, u32 val)
@@ -115,6 +112,7 @@ static void ccid2_hc_tx_rto_expire(unsigned long data)
 {
 	struct sock *sk = (struct sock *)data;
 	struct ccid2_hc_tx_sock *hc = ccid2_hc_tx_sk(sk);
+	const bool sender_was_blocked = ccid2_cwnd_network_limited(hc);
 
 	bh_lock_sock(sk);
 	if (sock_owned_by_user(sk)) {
@@ -129,8 +127,6 @@ static void ccid2_hc_tx_rto_expire(unsigned long data)
 	if (hc->tx_rto > DCCP_RTO_MAX)
 		hc->tx_rto = DCCP_RTO_MAX;
 
-	sk_reset_timer(sk, &hc->tx_rtotimer, jiffies + hc->tx_rto);
-
 	/* adjust pipe, cwnd etc */
 	hc->tx_ssthresh = hc->tx_cwnd / 2;
 	if (hc->tx_ssthresh < 2)
@@ -146,6 +142,12 @@ static void ccid2_hc_tx_rto_expire(unsigned long data)
 	hc->tx_rpseq    = 0;
 	hc->tx_rpdupack = -1;
 	ccid2_change_l_ack_ratio(sk, 1);
+
+	/* if we were blocked before, we may now send cwnd=1 packet */
+	if (sender_was_blocked)
+		tasklet_schedule(&dccp_sk(sk)->dccps_xmitlet);
+	/* restart backed-off timer */
+	sk_reset_timer(sk, &hc->tx_rtotimer, jiffies + hc->tx_rto);
 out:
 	bh_unlock_sock(sk);
 	sock_put(sk);
@@ -434,6 +436,7 @@ static void ccid2_hc_tx_packet_recv(struct sock *sk, struct sk_buff *skb)
 {
 	struct dccp_sock *dp = dccp_sk(sk);
 	struct ccid2_hc_tx_sock *hc = ccid2_hc_tx_sk(sk);
+	const bool sender_was_blocked = ccid2_cwnd_network_limited(hc);
 	u64 ackno, seqno;
 	struct ccid2_seq *seqp;
 	unsigned char *vector;
@@ -631,6 +634,10 @@ static void ccid2_hc_tx_packet_recv(struct sock *sk, struct sk_buff *skb)
 		sk_stop_timer(sk, &hc->tx_rtotimer);
 	else
 		sk_reset_timer(sk, &hc->tx_rtotimer, jiffies + hc->tx_rto);
+
+	/* check if incoming Acks allow pending packets to be sent */
+	if (sender_was_blocked && !ccid2_cwnd_network_limited(hc))
+		tasklet_schedule(&dccp_sk(sk)->dccps_xmitlet);
 }
 
 static int ccid2_hc_tx_init(struct ccid *ccid, struct sock *sk)

^ permalink raw reply

* [PATCH 1/4] dccp: Return-value convention of hc_tx_send_packet()
From: Gerrit Renker @ 2010-10-28  5:16 UTC (permalink / raw)
  To: davem; +Cc: dccp, netdev, Gerrit Renker
In-Reply-To: <1288242988-5333-1-git-send-email-gerrit@erg.abdn.ac.uk>

This patch reorganises the return value convention of the CCID TX sending
function, to permit more flexible schemes, as required by subsequent patches.

Currently the convention is
 * values < 0     mean error,
 * a value == 0   means "send now", and
 * a value x > 0  means "send in x milliseconds".

The patch provides symbolic constants and a function to interpret return values.

In addition, it caps the maximum positive return value to 0xFFFF milliseconds,
corresponding to 65.535 seconds.  This is possible since in CCID-3/4 the
maximum possible inter-packet gap is fixed at t_mbi = 64 sec.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
---
 net/dccp/ccid.h        |   34 +++++++++++++++++++++++++++++++---
 net/dccp/ccids/ccid3.c |   12 ++++++------
 2 files changed, 37 insertions(+), 9 deletions(-)

--- a/net/dccp/ccid.h
+++ b/net/dccp/ccid.h
@@ -134,13 +134,41 @@ static inline int ccid_get_current_tx_ccid(struct dccp_sock *dp)
 extern void ccid_hc_rx_delete(struct ccid *ccid, struct sock *sk);
 extern void ccid_hc_tx_delete(struct ccid *ccid, struct sock *sk);
 
+/*
+ * Congestion control of queued data packets via CCID decision.
+ *
+ * The TX CCID performs its congestion-control by indicating whether and when a
+ * queued packet may be sent, using the return code of ccid_hc_tx_send_packet().
+ * The following modes are supported via the symbolic constants below:
+ * - timer-based pacing    (CCID returns a delay value in milliseconds);
+ * - autonomous dequeueing (CCID internally schedules dccps_xmitlet).
+ */
+
+enum ccid_dequeueing_decision {
+	CCID_PACKET_SEND_AT_ONCE =	 0x00000,  /* "green light": no delay */
+	CCID_PACKET_DELAY_MAX =		 0x0FFFF,  /* maximum delay in msecs  */
+	CCID_PACKET_DELAY =		 0x10000,  /* CCID msec-delay mode */
+	CCID_PACKET_WILL_DEQUEUE_LATER = 0x20000,  /* CCID autonomous mode */
+	CCID_PACKET_ERR =		 0xF0000,  /* error condition */
+};
+
+static inline int ccid_packet_dequeue_eval(const int return_code)
+{
+	if (return_code < 0)
+		return CCID_PACKET_ERR;
+	if (return_code == 0)
+		return CCID_PACKET_SEND_AT_ONCE;
+	if (return_code <= CCID_PACKET_DELAY_MAX)
+		return CCID_PACKET_DELAY;
+	return return_code;
+}
+
 static inline int ccid_hc_tx_send_packet(struct ccid *ccid, struct sock *sk,
 					 struct sk_buff *skb)
 {
-	int rc = 0;
 	if (ccid->ccid_ops->ccid_hc_tx_send_packet != NULL)
-		rc = ccid->ccid_ops->ccid_hc_tx_send_packet(sk, skb);
-	return rc;
+		return ccid->ccid_ops->ccid_hc_tx_send_packet(sk, skb);
+	return CCID_PACKET_SEND_AT_ONCE;
 }
 
 static inline void ccid_hc_tx_packet_sent(struct ccid *ccid, struct sock *sk,
--- a/net/dccp/ccids/ccid3.c
+++ b/net/dccp/ccids/ccid3.c
@@ -268,11 +268,11 @@ out:
 	sock_put(sk);
 }
 
-/*
- * returns
- *   > 0: delay (in msecs) that should pass before actually sending
- *   = 0: can send immediately
- *   < 0: error condition; do not send packet
+/**
+ * ccid3_hc_tx_send_packet  -  Delay-based dequeueing of TX packets
+ * @skb: next packet candidate to send on @sk
+ * This function uses the convention of ccid_packet_dequeue_eval() and
+ * returns a millisecond-delay value between 0 and t_mbi = 64000 msec.
  */
 static int ccid3_hc_tx_send_packet(struct sock *sk, struct sk_buff *skb)
 {
@@ -348,7 +348,7 @@ static int ccid3_hc_tx_send_packet(struct sock *sk, struct sk_buff *skb)
 
 	/* set the nominal send time for the next following packet */
 	hc->tx_t_nom = ktime_add_us(hc->tx_t_nom, hc->tx_t_ipi);
-	return 0;
+	return CCID_PACKET_SEND_AT_ONCE;
 }
 
 static void ccid3_hc_tx_packet_sent(struct sock *sk, unsigned int len)

^ permalink raw reply

* [PATCH 2/4] dccp: Extend CCID packet dequeueing interface
From: Gerrit Renker @ 2010-10-28  5:16 UTC (permalink / raw)
  To: davem; +Cc: dccp, netdev, Gerrit Renker
In-Reply-To: <1288242988-5333-2-git-send-email-gerrit@erg.abdn.ac.uk>

This extends the packet dequeuing interface of dccp_write_xmit() to allow
 1. CCIDs to take care of timing when the next packet may be sent;
 2. delayed sending (as before, with an inter-packet gap up to 65.535 seconds).

The main purpose is to take CCID-2 out of its polling mode (when it is network-
limited, it tries every millisecond to send, without interruption).

The mode of operation for (2) is as follows:
 * new packet is enqueued via dccp_sendmsg() => dccp_write_xmit(),
 * ccid_hc_tx_send_packet() detects that it may not send (e.g. window full),
 * it signals this condition via `CCID_PACKET_WILL_DEQUEUE_LATER',
 * dccp_write_xmit() returns without further action;
 * after some time the wait-condition for CCID becomes true,
 * that CCID schedules the tasklet,
 * tasklet function calls ccid_hc_tx_send_packet() via dccp_write_xmit(),
 * since the wait-condition is now true, ccid_hc_tx_packet() returns "send now",
 * packet is sent, and possibly more (since dccp_write_xmit() loops).

Code reuse: the taskled function calls dccp_write_xmit(), the timer function
            reduces to a wrapper around the same code.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
---
 include/linux/dccp.h |    4 +-
 net/dccp/output.c    |  118 ++++++++++++++++++++++++++++++-------------------
 net/dccp/timer.c     |   25 ++++++-----
 3 files changed, 89 insertions(+), 58 deletions(-)

--- a/include/linux/dccp.h
+++ b/include/linux/dccp.h
@@ -462,7 +462,8 @@ struct dccp_ackvec;
  * @dccps_hc_rx_insert_options - receiver wants to add options when acking
  * @dccps_hc_tx_insert_options - sender wants to add options when sending
  * @dccps_server_timewait - server holds timewait state on close (RFC 4340, 8.3)
- * @dccps_xmit_timer - timer for when CCID is not ready to send
+ * @dccps_xmitlet - tasklet scheduled by the TX CCID to dequeue data packets
+ * @dccps_xmit_timer - used by the TX CCID to delay sending (rate-based pacing)
  * @dccps_syn_rtt - RTT sample from Request/Response exchange (in usecs)
  */
 struct dccp_sock {
@@ -502,6 +503,7 @@ struct dccp_sock {
 	__u8				dccps_hc_rx_insert_options:1;
 	__u8				dccps_hc_tx_insert_options:1;
 	__u8				dccps_server_timewait:1;
+	struct tasklet_struct		dccps_xmitlet;
 	struct timer_list		dccps_xmit_timer;
 };
 
--- a/net/dccp/output.c
+++ b/net/dccp/output.c
@@ -254,63 +254,89 @@ do_interrupted:
 	goto out;
 }
 
+/**
+ * dccp_xmit_packet  -  Send data packet under control of CCID
+ * Transmits next-queued payload and informs CCID to account for the packet.
+ */
+static void dccp_xmit_packet(struct sock *sk)
+{
+	int err, len;
+	struct dccp_sock *dp = dccp_sk(sk);
+	struct sk_buff *skb = skb_dequeue(&sk->sk_write_queue);
+
+	if (unlikely(skb == NULL))
+		return;
+	len = skb->len;
+
+	if (sk->sk_state == DCCP_PARTOPEN) {
+		const u32 cur_mps = dp->dccps_mss_cache - DCCP_FEATNEG_OVERHEAD;
+		/*
+		 * See 8.1.5 - Handshake Completion.
+		 *
+		 * For robustness we resend Confirm options until the client has
+		 * entered OPEN. During the initial feature negotiation, the MPS
+		 * is smaller than usual, reduced by the Change/Confirm options.
+		 */
+		if (!list_empty(&dp->dccps_featneg) && len > cur_mps) {
+			DCCP_WARN("Payload too large (%d) for featneg.\n", len);
+			dccp_send_ack(sk);
+			dccp_feat_list_purge(&dp->dccps_featneg);
+		}
+
+		inet_csk_schedule_ack(sk);
+		inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
+					      inet_csk(sk)->icsk_rto,
+					      DCCP_RTO_MAX);
+		DCCP_SKB_CB(skb)->dccpd_type = DCCP_PKT_DATAACK;
+	} else if (dccp_ack_pending(sk)) {
+		DCCP_SKB_CB(skb)->dccpd_type = DCCP_PKT_DATAACK;
+	} else {
+		DCCP_SKB_CB(skb)->dccpd_type = DCCP_PKT_DATA;
+	}
+
+	err = dccp_transmit_skb(sk, skb);
+	if (err)
+		dccp_pr_debug("transmit_skb() returned err=%d\n", err);
+	/*
+	 * Register this one as sent even if an error occurred. To the remote
+	 * end a local packet drop is indistinguishable from network loss, i.e.
+	 * any local drop will eventually be reported via receiver feedback.
+	 */
+	ccid_hc_tx_packet_sent(dp->dccps_hc_tx_ccid, sk, len);
+}
+
 void dccp_write_xmit(struct sock *sk, int block)
 {
 	struct dccp_sock *dp = dccp_sk(sk);
 	struct sk_buff *skb;
 
 	while ((skb = skb_peek(&sk->sk_write_queue))) {
-		int err = ccid_hc_tx_send_packet(dp->dccps_hc_tx_ccid, sk, skb);
+		int rc = ccid_hc_tx_send_packet(dp->dccps_hc_tx_ccid, sk, skb);
 
-		if (err > 0) {
+		switch (ccid_packet_dequeue_eval(rc)) {
+		case CCID_PACKET_WILL_DEQUEUE_LATER:
+			return;
+		case CCID_PACKET_DELAY:
 			if (!block) {
 				sk_reset_timer(sk, &dp->dccps_xmit_timer,
-						msecs_to_jiffies(err)+jiffies);
+						msecs_to_jiffies(rc)+jiffies);
+				return;
+			}
+			rc = dccp_wait_for_ccid(sk, skb, rc);
+			if (rc && rc != -EINTR) {
+				DCCP_BUG("err=%d after dccp_wait_for_ccid", rc);
+				skb_dequeue(&sk->sk_write_queue);
+				kfree_skb(skb);
 				break;
-			} else
-				err = dccp_wait_for_ccid(sk, skb, err);
-			if (err && err != -EINTR)
-				DCCP_BUG("err=%d after dccp_wait_for_ccid", err);
-		}
-
-		skb_dequeue(&sk->sk_write_queue);
-		if (err == 0) {
-			struct dccp_skb_cb *dcb = DCCP_SKB_CB(skb);
-			const int len = skb->len;
-
-			if (sk->sk_state == DCCP_PARTOPEN) {
-				const u32 cur_mps = dp->dccps_mss_cache - DCCP_FEATNEG_OVERHEAD;
-				/*
-				 * See 8.1.5 - Handshake Completion.
-				 *
-				 * For robustness we resend Confirm options until the client has
-				 * entered OPEN. During the initial feature negotiation, the MPS
-				 * is smaller than usual, reduced by the Change/Confirm options.
-				 */
-				if (!list_empty(&dp->dccps_featneg) && len > cur_mps) {
-					DCCP_WARN("Payload too large (%d) for featneg.\n", len);
-					dccp_send_ack(sk);
-					dccp_feat_list_purge(&dp->dccps_featneg);
-				}
-
-				inet_csk_schedule_ack(sk);
-				inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
-						  inet_csk(sk)->icsk_rto,
-						  DCCP_RTO_MAX);
-				dcb->dccpd_type = DCCP_PKT_DATAACK;
-			} else if (dccp_ack_pending(sk))
-				dcb->dccpd_type = DCCP_PKT_DATAACK;
-			else
-				dcb->dccpd_type = DCCP_PKT_DATA;
-
-			err = dccp_transmit_skb(sk, skb);
-			ccid_hc_tx_packet_sent(dp->dccps_hc_tx_ccid, sk, len);
-			if (err)
-				DCCP_BUG("err=%d after ccid_hc_tx_packet_sent",
-					 err);
-		} else {
-			dccp_pr_debug("packet discarded due to err=%d\n", err);
+			}
+			/* fall through */
+		case CCID_PACKET_SEND_AT_ONCE:
+			dccp_xmit_packet(sk);
+			break;
+		case CCID_PACKET_ERR:
+			skb_dequeue(&sk->sk_write_queue);
 			kfree_skb(skb);
+			dccp_pr_debug("packet discarded due to err=%d\n", rc);
 		}
 	}
 }
--- a/net/dccp/timer.c
+++ b/net/dccp/timer.c
@@ -237,32 +237,35 @@ out:
 	sock_put(sk);
 }
 
-/* Transmit-delay timer: used by the CCIDs to delay actual send time */
-static void dccp_write_xmit_timer(unsigned long data)
+/**
+ * dccp_write_xmitlet  -  Workhorse for CCID packet dequeueing interface
+ * See the comments above %ccid_dequeueing_decision for supported modes.
+ */
+static void dccp_write_xmitlet(unsigned long data)
 {
 	struct sock *sk = (struct sock *)data;
-	struct dccp_sock *dp = dccp_sk(sk);
 
 	bh_lock_sock(sk);
 	if (sock_owned_by_user(sk))
-		sk_reset_timer(sk, &dp->dccps_xmit_timer, jiffies+1);
+		sk_reset_timer(sk, &dccp_sk(sk)->dccps_xmit_timer, jiffies + 1);
 	else
 		dccp_write_xmit(sk, 0);
 	bh_unlock_sock(sk);
-	sock_put(sk);
 }
 
-static void dccp_init_write_xmit_timer(struct sock *sk)
+static void dccp_write_xmit_timer(unsigned long data)
 {
-	struct dccp_sock *dp = dccp_sk(sk);
-
-	setup_timer(&dp->dccps_xmit_timer, dccp_write_xmit_timer,
-			(unsigned long)sk);
+	dccp_write_xmitlet(data);
+	sock_put((struct sock *)data);
 }
 
 void dccp_init_xmit_timers(struct sock *sk)
 {
-	dccp_init_write_xmit_timer(sk);
+	struct dccp_sock *dp = dccp_sk(sk);
+
+	tasklet_init(&dp->dccps_xmitlet, dccp_write_xmitlet, (unsigned long)sk);
+	setup_timer(&dp->dccps_xmit_timer, dccp_write_xmit_timer,
+							     (unsigned long)sk);
 	inet_csk_init_xmit_timers(sk, &dccp_write_timer, &dccp_delack_timer,
 				  &dccp_keepalive_timer);
 }

^ permalink raw reply

* [PATCH 3/4] dccp: Refine the wait-for-ccid mechanism
From: Gerrit Renker @ 2010-10-28  5:16 UTC (permalink / raw)
  To: davem; +Cc: dccp, netdev, Gerrit Renker
In-Reply-To: <1288242988-5333-3-git-send-email-gerrit@erg.abdn.ac.uk>

This extends the existing wait-for-ccid routine so that it may be used with
different types of CCID, addressing the following problems:

 1) The queue-drain mechanism only works with rate-based CCIDs. If CCID-2 for
    example has a full TX queue and becomes network-limited just as the
    application wants to close, then waiting for CCID-2 to become unblocked
    could lead to an indefinite  delay (i.e., application "hangs").
 2) Since each TX CCID in turn uses a feedback mechanism, there may be changes
    in its sending policy while the queue is being drained. This can lead to
    further delays during which the application will not be able to terminate.
 3) The minimum wait time for CCID-3/4 can be expected to be the queue length
    times the current inter-packet delay. For example if tx_qlen=100 and a delay
    of 15 ms is used for each packet, then the application would have to wait
    for a minimum of 1.5 seconds before being allowed to exit.
 4) There is no way for the user/application to control this behaviour. It would
    be good to use the timeout argument of dccp_close() as an upper bound. Then
    the maximum time that an application is willing to wait for its CCIDs to can
    be set via the SO_LINGER option.

These problems are addressed by giving the CCID a grace period of up to the
`timeout' value.

The wait-for-ccid function is, as before, used when the application
 (a) has read all the data in its receive buffer and
 (b) if SO_LINGER was set with a non-zero linger time, or
 (c) the socket is either in the OPEN (active close) or in the PASSIVE_CLOSEREQ
     state (client application closes after receiving CloseReq).

In addition, there is a catch-all case of __skb_queue_purge() after waiting for
the CCID. This is necessary since the write queue may still have data when
 (a) the host has been passively-closed,
 (b) abnormal termination (unread data, zero linger time),
 (c) wait-for-ccid could not finish within the given time limit.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
---
 net/dccp/dccp.h   |    5 +-
 net/dccp/output.c |  115 ++++++++++++++++++++++++++++++-----------------------
 net/dccp/proto.c  |   21 +++++++++-
 net/dccp/timer.c  |    2 +-
 4 files changed, 89 insertions(+), 54 deletions(-)

--- a/net/dccp/dccp.h
+++ b/net/dccp/dccp.h
@@ -243,8 +243,9 @@ extern void dccp_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
 extern void dccp_send_sync(struct sock *sk, const u64 seq,
 			   const enum dccp_pkt_type pkt_type);
 
-extern void dccp_write_xmit(struct sock *sk, int block);
-extern void dccp_write_space(struct sock *sk);
+extern void   dccp_write_xmit(struct sock *sk);
+extern void   dccp_write_space(struct sock *sk);
+extern void   dccp_flush_write_queue(struct sock *sk, long *time_budget);
 
 extern void dccp_init_xmit_timers(struct sock *sk);
 static inline void dccp_clear_xmit_timers(struct sock *sk)
--- a/net/dccp/output.c
+++ b/net/dccp/output.c
@@ -209,49 +209,29 @@ void dccp_write_space(struct sock *sk)
 }
 
 /**
- * dccp_wait_for_ccid - Wait for ccid to tell us we can send a packet
+ * dccp_wait_for_ccid  -  Await CCID send permission
  * @sk:    socket to wait for
- * @skb:   current skb to pass on for waiting
- * @delay: sleep timeout in milliseconds (> 0)
- * This function is called by default when the socket is closed, and
- * when a non-zero linger time is set on the socket. For consistency
+ * @delay: timeout in jiffies
+ * This is used by CCIDs which need to delay the send time in process context.
  */
-static int dccp_wait_for_ccid(struct sock *sk, struct sk_buff *skb, int delay)
+static int dccp_wait_for_ccid(struct sock *sk, unsigned long delay)
 {
-	struct dccp_sock *dp = dccp_sk(sk);
 	DEFINE_WAIT(wait);
-	unsigned long jiffdelay;
-	int rc;
-
-	do {
-		dccp_pr_debug("delayed send by %d msec\n", delay);
-		jiffdelay = msecs_to_jiffies(delay);
-
-		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+	long remaining;
 
-		sk->sk_write_pending++;
-		release_sock(sk);
-		schedule_timeout(jiffdelay);
-		lock_sock(sk);
-		sk->sk_write_pending--;
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+	sk->sk_write_pending++;
+	release_sock(sk);
 
-		if (sk->sk_err)
-			goto do_error;
-		if (signal_pending(current))
-			goto do_interrupted;
+	remaining = schedule_timeout(delay);
 
-		rc = ccid_hc_tx_send_packet(dp->dccps_hc_tx_ccid, sk, skb);
-	} while ((delay = rc) > 0);
-out:
+	lock_sock(sk);
+	sk->sk_write_pending--;
 	finish_wait(sk_sleep(sk), &wait);
-	return rc;
-
-do_error:
-	rc = -EPIPE;
-	goto out;
-do_interrupted:
-	rc = -EINTR;
-	goto out;
+
+	if (signal_pending(current) || sk->sk_err)
+		return -1;
+	return remaining;
 }
 
 /**
@@ -305,7 +285,53 @@ static void dccp_xmit_packet(struct sock *sk)
 	ccid_hc_tx_packet_sent(dp->dccps_hc_tx_ccid, sk, len);
 }
 
-void dccp_write_xmit(struct sock *sk, int block)
+/**
+ * dccp_flush_write_queue  -  Drain queue at end of connection
+ * Since dccp_sendmsg queues packets without waiting for them to be sent, it may
+ * happen that the TX queue is not empty at the end of a connection. We give the
+ * HC-sender CCID a grace period of up to @time_budget jiffies. If this function
+ * returns with a non-empty write queue, it will be purged later.
+ */
+void dccp_flush_write_queue(struct sock *sk, long *time_budget)
+{
+	struct dccp_sock *dp = dccp_sk(sk);
+	struct sk_buff *skb;
+	long delay, rc;
+
+	while (*time_budget > 0 && (skb = skb_peek(&sk->sk_write_queue))) {
+		rc = ccid_hc_tx_send_packet(dp->dccps_hc_tx_ccid, sk, skb);
+
+		switch (ccid_packet_dequeue_eval(rc)) {
+		case CCID_PACKET_WILL_DEQUEUE_LATER:
+			/*
+			 * If the CCID determines when to send, the next sending
+			 * time is unknown or the CCID may not even send again
+			 * (e.g. remote host crashes or lost Ack packets).
+			 */
+			DCCP_WARN("CCID did not manage to send all packets\n");
+			return;
+		case CCID_PACKET_DELAY:
+			delay = msecs_to_jiffies(rc);
+			if (delay > *time_budget)
+				return;
+			rc = dccp_wait_for_ccid(sk, delay);
+			if (rc < 0)
+				return;
+			*time_budget -= (delay - rc);
+			/* check again if we can send now */
+			break;
+		case CCID_PACKET_SEND_AT_ONCE:
+			dccp_xmit_packet(sk);
+			break;
+		case CCID_PACKET_ERR:
+			skb_dequeue(&sk->sk_write_queue);
+			kfree_skb(skb);
+			dccp_pr_debug("packet discarded due to err=%ld\n", rc);
+		}
+	}
+}
+
+void dccp_write_xmit(struct sock *sk)
 {
 	struct dccp_sock *dp = dccp_sk(sk);
 	struct sk_buff *skb;
@@ -317,19 +343,9 @@ void dccp_write_xmit(struct sock *sk, int block)
 		case CCID_PACKET_WILL_DEQUEUE_LATER:
 			return;
 		case CCID_PACKET_DELAY:
-			if (!block) {
-				sk_reset_timer(sk, &dp->dccps_xmit_timer,
-						msecs_to_jiffies(rc)+jiffies);
-				return;
-			}
-			rc = dccp_wait_for_ccid(sk, skb, rc);
-			if (rc && rc != -EINTR) {
-				DCCP_BUG("err=%d after dccp_wait_for_ccid", rc);
-				skb_dequeue(&sk->sk_write_queue);
-				kfree_skb(skb);
-				break;
-			}
-			/* fall through */
+			sk_reset_timer(sk, &dp->dccps_xmit_timer,
+				       jiffies + msecs_to_jiffies(rc));
+			return;
 		case CCID_PACKET_SEND_AT_ONCE:
 			dccp_xmit_packet(sk);
 			break;
@@ -648,7 +664,6 @@ void dccp_send_close(struct sock *sk, const int active)
 		DCCP_SKB_CB(skb)->dccpd_type = DCCP_PKT_CLOSE;
 
 	if (active) {
-		dccp_write_xmit(sk, 1);
 		dccp_skb_entail(sk, skb);
 		dccp_transmit_skb(sk, skb_clone(skb, prio));
 		/*
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -726,7 +726,13 @@ int dccp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		goto out_discard;
 
 	skb_queue_tail(&sk->sk_write_queue, skb);
-	dccp_write_xmit(sk,0);
+	/*
+	 * The xmit_timer is set if the TX CCID is rate-based and will expire
+	 * when congestion control permits to release further packets into the
+	 * network. Window-based CCIDs do not use this timer.
+	 */
+	if (!timer_pending(&dp->dccps_xmit_timer))
+		dccp_write_xmit(sk);
 out_release:
 	release_sock(sk);
 	return rc ? : len;
@@ -951,9 +957,22 @@ void dccp_close(struct sock *sk, long timeout)
 		/* Check zero linger _after_ checking for unread data. */
 		sk->sk_prot->disconnect(sk, 0);
 	} else if (sk->sk_state != DCCP_CLOSED) {
+		/*
+		 * Normal connection termination. May need to wait if there are
+		 * still packets in the TX queue that are delayed by the CCID.
+		 */
+		dccp_flush_write_queue(sk, &timeout);
 		dccp_terminate_connection(sk);
 	}
 
+	/*
+	 * Flush write queue. This may be necessary in several cases:
+	 * - we have been closed by the peer but still have application data;
+	 * - abortive termination (unread data or zero linger time),
+	 * - normal termination but queue could not be flushed within time limit
+	 */
+	__skb_queue_purge(&sk->sk_write_queue);
+
 	sk_stream_wait_close(sk, timeout);
 
 adjudge_to_death:
--- a/net/dccp/timer.c
+++ b/net/dccp/timer.c
@@ -249,7 +249,7 @@ static void dccp_write_xmitlet(unsigned long data)
 	if (sock_owned_by_user(sk))
 		sk_reset_timer(sk, &dccp_sk(sk)->dccps_xmit_timer, jiffies + 1);
 	else
-		dccp_write_xmit(sk, 0);
+		dccp_write_xmit(sk);
 	bh_unlock_sock(sk);
 }
 

^ permalink raw reply

* Re: [PATCH] cxgb3: fix crash due to manipulating queues before registration
From: Eric Dumazet @ 2010-10-28  5:18 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: sonnyrao, Divy Le Ray, Dimitris Michailidis, netdev, linux-kernel
In-Reply-To: <1288242390-28574-1-git-send-email-nacc@us.ibm.com>

Le mercredi 27 octobre 2010 à 22:06 -0700, Nishanth Aravamudan a écrit :
> Hi Eric,
> 
> Something like the following?:
> 
> Thanks,
> Nish

Yes, probably, but I dont have the hardware so cannot test.



^ permalink raw reply

* Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation
From: Michael S. Tsirkin @ 2010-10-28  5:20 UTC (permalink / raw)
  To: Shirley Ma; +Cc: David Miller, netdev, kvm, linux-kernel
In-Reply-To: <1288240804.14342.1.camel@localhost.localdomain>

On Wed, Oct 27, 2010 at 09:40:04PM -0700, Shirley Ma wrote:
> Resubmit this patch for fixing some minor error (white space, typo).
> 
> Signed-off-by: Shirley Ma <xma@us.ibm.com>

My concern is this can delay signalling for unlimited time.
Could you pls test this with guests that do not have
2b5bbe3b8bee8b38bdc27dd9c0270829b6eb7eeb
b0c39dbdc204006ef3558a66716ff09797619778
that is 2.6.31 and older?

This seems to be slighltly out of spec, even though
for TX, signals are less important.

Two ideas:
1. How about writing out used, just delaying the signal?
   This way we don't have to queue separately.
2. How about flushing out queued stuff before we exit
   the handle_tx loop? That would address most of
   the spec issue.

> ---
>  drivers/vhost/net.c   |   20 +++++++++++++++++++-
>  drivers/vhost/vhost.c |   32 ++++++++++++++++++++++++++++++++
>  drivers/vhost/vhost.h |    3 +++
>  3 files changed, 54 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 4b4da5b..3eb8016 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -128,6 +128,7 @@ static void handle_tx(struct vhost_net *net)
>  	int err, wmem;
>  	size_t hdr_size;
>  	struct socket *sock;
> +	int max_pend = vq->num - (vq->num >> 2);
>  
>  	sock = rcu_dereference_check(vq->private_data,
>  				     lockdep_is_held(&vq->mutex));
> @@ -198,7 +199,24 @@ static void handle_tx(struct vhost_net *net)
>  		if (err != len)
>  			pr_debug("Truncated TX packet: "
>  				 " len %d != %zd\n", err, len);
> -		vhost_add_used_and_signal(&net->dev, vq, head, 0);
> +		/*
> +		 * if no pending buffer size allocate, signal used buffer
> +		 * one by one, otherwise, signal used buffer when reaching
> +		 * 3/4 ring size to reduce CPU utilization.
> +		 */
> +		if (unlikely(vq->pend))
> +			vhost_add_used_and_signal(&net->dev, vq, head, 0);
> +		else {
> +			vq->pend[vq->num_pend].id = head;
> +			vq->pend[vq->num_pend].len = 0;
> +			++vq->num_pend;
> +			if (vq->num_pend == max_pend) {
> +				vhost_add_used_and_signal_n(&net->dev, vq,
> +							    vq->pend,
> +							    vq->num_pend);
> +				vq->num_pend = 0;
> +			}
> +		}
>  		total_len += len;
>  		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>  			vhost_poll_queue(&vq->poll);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 94701ff..f2f3288 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -170,6 +170,16 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  	vq->call_ctx = NULL;
>  	vq->call = NULL;
>  	vq->log_ctx = NULL;
> +	/* signal pending used buffers */
> +	if (vq->pend) {
> +		if (vq->num_pend != 0) {
> +			vhost_add_used_and_signal_n(dev, vq, vq->pend,
> +						    vq->num_pend);
> +			vq->num_pend = 0;
> +		}
> +		kfree(vq->pend);
> +	}
> +	vq->pend = NULL;
>  }
>  
>  static int vhost_worker(void *data)
> @@ -273,7 +283,14 @@ long vhost_dev_init(struct vhost_dev *dev,
>  		dev->vqs[i].heads = NULL;
>  		dev->vqs[i].dev = dev;
>  		mutex_init(&dev->vqs[i].mutex);
> +		dev->vqs[i].num_pend = 0;
> +		dev->vqs[i].pend = NULL;
>  		vhost_vq_reset(dev, dev->vqs + i);
> +		/* signal 3/4 of ring size used buffers */
> +		dev->vqs[i].pend = kmalloc((dev->vqs[i].num -
> +					   (dev->vqs[i].num >> 2)) *
> +					   sizeof *dev->vqs[i].pend,
> +					   GFP_KERNEL);
>  		if (dev->vqs[i].handle_kick)
>  			vhost_poll_init(&dev->vqs[i].poll,
>  					dev->vqs[i].handle_kick, POLLIN, dev);
> @@ -599,6 +616,21 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
>  			r = -EINVAL;
>  			break;
>  		}
> +		if (vq->num != s.num) {
> +			/* signal used buffers first */
> +			if (vq->pend) {
> +				if (vq->num_pend != 0) {
> +					vhost_add_used_and_signal_n(vq->dev, vq,
> +								    vq->pend,
> +								    vq->num_pend);
> +					vq->num_pend = 0;
> +				}
> +				kfree(vq->pend);
> +			}
> +			/* realloc pending used buffers size */
> +			vq->pend = kmalloc((s.num - (s.num >> 2)) *
> +					   sizeof *vq->pend, GFP_KERNEL);
> +		}
>  		vq->num = s.num;
>  		break;
>  	case VHOST_SET_VRING_BASE:
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 073d06a..78949c0 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -108,6 +108,9 @@ struct vhost_virtqueue {
>  	/* Log write descriptors */
>  	void __user *log_base;
>  	struct vhost_log *log;
> +	/* delay multiple used buffers to signal once */
> +	int num_pend;
> +	struct vring_used_elem *pend;
>  };
>  
>  struct vhost_dev {
> 

^ permalink raw reply

* Re: [PATCH] cxgb3: fix crash due to manipulating queues before registration
From: Nishanth Aravamudan @ 2010-10-28  5:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: sonnyrao, Divy Le Ray, Dimitris Michailidis, netdev, linux-kernel
In-Reply-To: <1288243110.2658.126.camel@edumazet-laptop>

On 28.10.2010 [07:18:30 +0200], Eric Dumazet wrote:
> Le mercredi 27 octobre 2010 à 22:06 -0700, Nishanth Aravamudan a écrit :
> > Hi Eric,
> > 
> > Something like the following?:
> > 
> > Thanks,
> > Nish
> 
> Yes, probably, but I dont have the hardware so cannot test.

Sorry, should have been clearer. I do have the hardware, tested with the
patch, it does work. Wanted to make sure that it made sense to the
maintainers, of course, but:

Tested-by: Nishanth Aravamudan <nacc@us.ibm.com>

Can be added, if needed.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

^ permalink raw reply

* Re: [PATCH] via-velocity: Codestyle fixes
From: Alexander Wigen @ 2010-10-28  5:02 UTC (permalink / raw)
  To: Joe Perches; +Cc: netdev, linux-kernel
In-Reply-To: <1288239318.1836.7.camel@Joe-Laptop>

Hi Joe,

Thanks for that very quick review, this is my first kernel patch.

On Thu, 28 Oct 2010, you wrote:
> On Thu, 2010-10-28 at 13:48 +1000, alex@wigen.net wrote:
> > From: Alexander Wigen <alex@wigen.net>
> > @@ -1542,7 +1542,7 @@ static int velocity_alloc_rx_buf(struct
> > velocity_info *vptr, int idx)
> > 
> >  	 *	Fill in the descriptor to match
> >  	 */
> > 
> > -	*((u32 *) & (rd->rdesc0)) = 0;
> > +	*((u32 *) &(rd->rdesc0)) = 0;
> 
> I think this style change isn't really that good.
> rdesc0 is:
> 
> struct rdesc0 {
> 	__le16 RSR;		/* Receive status */
> 	__le16 len;		/* bits 0--13; bit 15 - owner */
> };
> 
> I think it more sensible to either set the fields
> in the struct to 0 or to memset the struct to 0.

Not sure what is best so I reverted the change.

> > @@ -1681,7 +1681,7 @@ static int velocity_init_td_ring(struct
> > velocity_info *vptr)
> > 
> >  static void velocity_free_dma_rings(struct velocity_info *vptr)
> >  {
> >  
> >  	const int size = vptr->options.numrx * sizeof(struct rx_desc) +
> > 
> > -		vptr->options.numtx * sizeof(struct tx_desc) * vptr->tx.numq;
> > +		vptr->options.numtx * sizeof(struct tx_desc) *vptr->tx.numq;
> 
> If checkpatch warns about this, it's wrong.

It does produce a error acctually, but I have reverted the change:

drivers/net/via-velocity.c:1684: ERROR: space prohibited after that '*' 
(ctx:WxW)


> 
> > @@ -2477,24 +2477,24 @@ static struct net_device_stats
> > *velocity_get_stats(struct net_device *dev)
> > 
> >  	dev->stats.rx_errors = vptr->mib_counter[HW_MIB_ifRxErrorPkts];
> >  	dev->stats.rx_length_errors =
> >  	vptr->mib_counter[HW_MIB_ifInRangeLengthErrors];
> > 
> > -//  unsigned long   rx_dropped;     /* no space in linux buffers    */
> > +/*  unsigned long   rx_dropped;     /* no space in linux buffers    */
> 
> why not just remove all these commented-out lines?

Done.

> > @@ -2957,7 +2957,7 @@ static int velocity_set_wol(struct velocity_info
> > *vptr)
> > 
> >  		memcpy(arp->ar_tip, vptr->ip_addr, 4);
> >  		
> >  		crc = wol_calc_crc((sizeof(struct arp_packet) + 7) / 8, buf,
> > 
> > -				(u8 *) & mask_pattern[0][0]);
> > +				(u8 *) &mask_pattern[0][0]);
> 
> perhaps just (u8 *)mask_pattern?
> 
> > @@ -3454,7 +3453,7 @@ static const struct ethtool_ops
> > velocity_ethtool_ops = {
> > 
> >  	.set_wol	=	velocity_ethtool_set_wol,
> >  	.get_msglevel	=	velocity_get_msglevel,
> >  	.set_msglevel	=	velocity_set_msglevel,
> > 
> > -	.set_sg 	=	ethtool_op_set_sg,
> > +	.set_sg	=	ethtool_op_set_sg,
> 
> bad alignment?

Sorry, wrong editor tab width.

Cheers,
Alexander Wigen

^ permalink raw reply

* Re: [PATCH net-next-2.6 1/2] be2net: Adding an option to use INTx instead of MSI-X
From: Grant Grundler @ 2010-10-28  5:37 UTC (permalink / raw)
  To: David Miller; +Cc: michael, bhutchings, somnath.kotur, netdev, linux-pci
In-Reply-To: <20101027.084604.226761472.davem@davemloft.net>

On Wed, Oct 27, 2010 at 08:46:04AM -0700, David Miller wrote:
> From: Michael Ellerman <michael@ellerman.id.au>
> Date: Wed, 27 Oct 2010 10:20:35 +1100
> 
> > On Tue, 2010-10-26 at 14:32 +0100, Ben Hutchings wrote:
> >> Michael Ellerman wrote:
> >> > On Mon, 2010-10-25 at 16:25 -0700, David Miller wrote:
> >> > > From: Ben Hutchings <bhutchings@solarflare.com>
> >> > > Date: Mon, 25 Oct 2010 23:38:53 +0100
> > 
> >> > Ethtool would be nice, but only for network drivers. Is there a generic
> >> > solution, quirks are obviously not keeping people happy.
> >>  
> >> Since this is (normally) a property of the system, pci=nomsi is the
> >> generic solution.
> > 
> > Sort of, it's a big hammer. Did all these driver writers not know about
> > pci=nomsi or did they prefer to add a parameter to their driver for some
> > reason?
> 
> Every time I've actually done the work to try and track down the
> true issue, it always turned out to be a PCI chipset problem rather
> than a device specific issue.

I agree with your generalization. I can think of only one exception:
ISTR pre-5705 tg3 chips would send both MSI and assert IRQ line at the same time.

My guess is driver writers just want knob to "work around" any issues
*their* driver might see with chipset. Disabling MSI for all drivers
doesn't leave opportunity for experimenting with individual drivers.

hth,
grant

> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Michael S. Tsirkin @ 2010-10-28  5:50 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty
In-Reply-To: <OFC29C4491.59069AD1-ON652577CA.00170F0D-652577CA.001C76C8@in.ibm.com>

On Thu, Oct 28, 2010 at 10:44:14AM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 10/26/2010 04:39:13 PM:
> 
> (merging two posts into one)
> 
> > I think we discussed the need for external to guest testing
> > over 10G. For large messages we should not see any change
> > but you should be able to get better numbers for small messages
> > assuming a MQ NIC card.
> 
> For external host, there is a contention among different
> queues (vhosts) when packets are processed in tun/bridge,
> unless I implement MQ TX for macvtap (tun/bridge?).  So
> my testing shows a small improvement (1 to 1.5% average)
> in BW and a rise in SD (between 10-15%).  For remote host,
> I think tun/macvtap needs MQ TX support?

Confused. I thought this *is* with a multiqueue tun/macvtap?
bridge does not do any queueing AFAIK ...
I think we need to fix the contention. With migration what was guest to
host a minute ago might become guest to external now ...

> > > > > Results for UDP BW tests (unidirectional, sum across
> > > > > 3 iterations, each iteration of 45 seconds, default
> > > > > netperf, vhosts bound to cpus 0-3; no other tuning):
> > > >
> > > > Is binding vhost threads to CPUs really required?
> > > > What happens if we let the scheduler do its job?
> > >
> > > Nothing drastic, I remember BW% and SD% both improved a
> > > bit as a result of binding.
> >
> > If there's a significant improvement this would mean that
> > we need to rethink the vhost-net interaction with the scheduler.
> 
> I will get a test run with and without binding and post the
> results later today.
> 
> Thanks,
> 
> - KK

^ permalink raw reply

* Re: [patch v2] fix stack overflow in pktgen_if_write()
From: Dan Carpenter @ 2010-10-28  6:05 UTC (permalink / raw)
  To: Nelson Elhage
  Cc: Eric Dumazet, David S. Miller, Robert Olsson, Andy Shevchenko,
	netdev
In-Reply-To: <20101027230657.GT16803@ksplice.com>

On Wed, Oct 27, 2010 at 07:06:57PM -0400, Nelson Elhage wrote:
> You want to add a trailing NUL, or else printk will read off the end of the
> buffer.
> 
> Also, by memdup()ing count + 1 bytes, you're technically reading one more byte
> than userspace asked for, which could in principle lead to a spurious EFAULT.
> 

That's a lot of bugs per line.  :(  I'm eating humble pie today...

regards,
dan carpenter



^ permalink raw reply

* [patch v3] fix stack overflow in pktgen_if_write()
From: Dan Carpenter @ 2010-10-28  6:05 UTC (permalink / raw)
  To: Nelson Elhage
  Cc: Eric Dumazet, David S. Miller, Robert Olsson, Andy Shevchenko,
	netdev
In-Reply-To: <20101027230657.GT16803@ksplice.com>

Nelson Elhage says he was able to oops both amd64 and i386 test 
machines with 8k writes to the pktgen file.  Let's just allocate the
buffer on the heap instead of on the stack.

This can only be triggered by root so there are no security issues here.

Reported-by: Nelson Elhage <nelhage@ksplice.com>
Signed-off-by: Dan Carpenter <error27@gmail.com>
---
v3:  just use kmalloc()

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 2c0df0f..c8d3620 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -887,12 +887,17 @@ static ssize_t pktgen_if_write(struct file *file,
 	i += len;
 
 	if (debug) {
-		char tb[count + 1];
+		char *tb;
+
+		tb = kmalloc(count + 1, GFP_KERNEL);
+		if (!tb)
+			return -ENOMEM;
 		if (copy_from_user(tb, user_buffer, count))
 			return -EFAULT;
 		tb[count] = 0;
 		printk(KERN_DEBUG "pktgen: %s,%lu  buffer -:%s:-\n", name,
 		       (unsigned long)count, tb);
+		kfree(tb);
 	}
 
 	if (!strcmp(name, "min_pkt_size")) {

^ permalink raw reply related

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Krishna Kumar2 @ 2010-10-28  6:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev,
	netdev-owner, rusty
In-Reply-To: <20101028055054.GG5599@redhat.com>

> "Michael S. Tsirkin" <mst@redhat.com>

> > > I think we discussed the need for external to guest testing
> > > over 10G. For large messages we should not see any change
> > > but you should be able to get better numbers for small messages
> > > assuming a MQ NIC card.
> >
> > For external host, there is a contention among different
> > queues (vhosts) when packets are processed in tun/bridge,
> > unless I implement MQ TX for macvtap (tun/bridge?).  So
> > my testing shows a small improvement (1 to 1.5% average)
> > in BW and a rise in SD (between 10-15%).  For remote host,
> > I think tun/macvtap needs MQ TX support?
>
> Confused. I thought this *is* with a multiqueue tun/macvtap?
> bridge does not do any queueing AFAIK ...
> I think we need to fix the contention. With migration what was guest to
> host a minute ago might become guest to external now ...

Macvtap RX is MQ but not TX. I don't think MQ TX support is
required for macvtap, though. Is it enough for existing
macvtap sendmsg to work, since it calls dev_queue_xmit
which selects the txq for the outgoing device?

Thanks,

- KK


^ permalink raw reply

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Michael S. Tsirkin @ 2010-10-28  6:18 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev,
	netdev-owner, rusty
In-Reply-To: <OF03CB03BB.27A87B85-ON652577CA.002069A5-652577CA.0021C2E2@in.ibm.com>

On Thu, Oct 28, 2010 at 11:42:05AM +0530, Krishna Kumar2 wrote:
> > "Michael S. Tsirkin" <mst@redhat.com>
> 
> > > > I think we discussed the need for external to guest testing
> > > > over 10G. For large messages we should not see any change
> > > > but you should be able to get better numbers for small messages
> > > > assuming a MQ NIC card.
> > >
> > > For external host, there is a contention among different
> > > queues (vhosts) when packets are processed in tun/bridge,
> > > unless I implement MQ TX for macvtap (tun/bridge?).  So
> > > my testing shows a small improvement (1 to 1.5% average)
> > > in BW and a rise in SD (between 10-15%).  For remote host,
> > > I think tun/macvtap needs MQ TX support?
> >
> > Confused. I thought this *is* with a multiqueue tun/macvtap?
> > bridge does not do any queueing AFAIK ...
> > I think we need to fix the contention. With migration what was guest to
> > host a minute ago might become guest to external now ...
> 
> Macvtap RX is MQ but not TX. I don't think MQ TX support is
> required for macvtap, though. Is it enough for existing
> macvtap sendmsg to work, since it calls dev_queue_xmit
> which selects the txq for the outgoing device?
> 
> Thanks,
> 
> - KK

I think there would be an issue with using a single poll notifier and
contention on send buffer atomic variable.
Is tun different than macvtap? We need to support both long term ...

-- 
MST

^ permalink raw reply

* Re: [PATCH] ipv6: addrconf: don't remove address state on ifdown if the address is being kept
From: Lorenzo Colitti @ 2010-10-28  6:45 UTC (permalink / raw)
  To: netdev; +Cc: Maciej Żenczykowski, David Miller, Brian Haley
In-Reply-To: <AANLkTimyqFoUvjQM_qLS2YeVmJQ_6-LJc6KfLVE-mzH3@mail.gmail.com>

On Wed, Oct 27, 2010 at 9:16 PM, Lorenzo Colitti <lorenzo@google.com> wrote:
> Fix it so that none of this state is updated if the address is being kept on the interface.

Or, since the logic of the patched code is a little hard to follow,
here's an alternative patch against 2.6.36. I think it does exactly
the same as the previous patch, but it moves things around to make the
result more readable.

Tested: Added a statically configured IPv6 address to an interface,
started ping, brought link down, brought link up again. When link came
up ping kept on going and "ip -6 maddr" showed that the host was still
subscribed to there

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>

--- 2.6.36-orig/net/ipv6/addrconf.c	2010-10-20 13:30:22.000000000 -0700
+++ linux-2.6.36/net/ipv6/addrconf.c	2010-10-27 23:26:57.000000000 -0700
@@ -2737,10 +2737,6 @@ static int addrconf_ifdown(struct net_de
 			/* Flag it for later restoration when link comes up */
 			ifa->flags |= IFA_F_TENTATIVE;
 			ifa->state = INET6_IFADDR_STATE_DAD;
-
-			write_unlock_bh(&idev->lock);
-
-			in6_ifa_hold(ifa);
 		} else {
 			list_del(&ifa->if_list);

@@ -2755,19 +2751,15 @@ static int addrconf_ifdown(struct net_de
 			ifa->state = INET6_IFADDR_STATE_DEAD;
 			spin_unlock_bh(&ifa->state_lock);

-			if (state == INET6_IFADDR_STATE_DEAD)
-				goto put_ifa;
+			if (state == INET6_IFADDR_STATE_DEAD) {
+				in6_ifa_put(ifa);
+			} else {
+				__ipv6_ifa_notify(RTM_DELADDR, ifa);
+				atomic_notifier_call_chain(&inet6addr_chain,
+							   NETDEV_DOWN, ifa);
+			}
+			write_lock_bh(&idev->lock);
 		}
-
-		__ipv6_ifa_notify(RTM_DELADDR, ifa);
-		if (ifa->state == INET6_IFADDR_STATE_DEAD)
-			atomic_notifier_call_chain(&inet6addr_chain,
-						   NETDEV_DOWN, ifa);
-
-put_ifa:
-		in6_ifa_put(ifa);
-
-		write_lock_bh(&idev->lock);
 	}

 	list_splice(&keep_list, &idev->addr_list);

^ permalink raw reply

* About the Davicom PHY in drivers/net/phy in Linux kernel
From: macpaul @ 2010-10-28  6:33 UTC (permalink / raw)
  To: netdev; +Cc: afleming, jeff, f.rodo, joseph_chang

Hi all,

According to the source code of Davicom PHY in Linux,
We should do "ISOLATE" command to PHY before setting "auto negotiation" or "MII/RMII".

However, I've found that with Faraday's MAC/GMAC controller (ftmac100/ftgmac100), setting ISOLATE for multiple PHY configuration will lead MDC become stop because the flaw inside the MAC controller. Faraday's MAC/GMAC will leverage the TX_CLK as the MDC source. When FTMAC100 send "ISOLATE" to Davicom's PHY, the TX_CLK send out from PHY will be stopped, then MDC will also become stop.

However, this mail wasn't meant to discuss about the design flaw inside the IC.

We've done two test to the following codes.

1st: if we just skip the " BMCR_ISOLATE" setting command, we will get PHY sometimes become unstable, for example, could not do DHCP request successfully.

2nd: if we replace "BMCR_ISOLATE" to "BMCR_RESET", we could get rid of the problem occurred by Faraday GMAC. And the PHY works well still.

I've found that in some other PHY implementation, for example, in "marvell.c", there are seems no ISOLATE commands. There are only RESET commands.
If we could replace BMCR_ISOLATE to BMCR_RESET in current kernel source, will there be any unpredictable behavior happened?

Please give us suggestion according to your experiences.
Thanks a lot.

+static int dm9161_config_aneg(struct phy_device *phydev)
+{
+	int err;
+
+	/* Isolate the PHY */
+	err = phy_write(phydev, MII_BMCR, BMCR_ISOLATE);
+
+	if (err < 0)
+		return err;
+
+	/* Configure the new settings */
+	err = genphy_config_aneg(phydev);
+
+	if (err < 0)
+		return err;
+
+	return 0;
+}
+
+static int dm9161_config_init(struct phy_device *phydev)
+{
+	int err;
+
+	/* Isolate the PHY */
+	err = phy_write(phydev, MII_BMCR, BMCR_ISOLATE);
+
+	if (err < 0)
+		return err;
+
+	/* Do not bypass the scrambler/descrambler */
+	err = phy_write(phydev, MII_DM9161_SCR, MII_DM9161_SCR_INIT);
+
+	if (err < 0)
+		return err;
+
+	/* Clear 10BTCSR to default */
+	err = phy_write(phydev, MII_DM9161_10BTCSR, MII_DM9161_10BTCSR_INIT);
+
+	if (err < 0)
+		return err;
+
+	/* Reconnect the PHY, and enable Autonegotiation */
+	err = phy_write(phydev, MII_BMCR, BMCR_ANENABLE);
+
+	if (err < 0)
+		return err;
+
+	return 0;

Best regards,
Macpaul Lin

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox