Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next 1/5] be2net: Fix race in posting rx buffers.
From: Sathya Perla @ 2011-08-23  5:41 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1314078115-8121-1-git-send-email-sathya.perla@emulex.com>

There is a possibility of be_post_rx_frags() being called simultaneously from
both be_worker() (when rx_post_starved) and be_poll_rx() (when rxq->used is 0).
This can be avoided by posting rx buffers only when some completions have been
reaped.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
---
 drivers/net/ethernet/emulex/benet/be_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index ef62594..09eb699 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -1862,7 +1862,7 @@ loop_continue:
 	}

 	/* Refill the queue */
-	if (atomic_read(&rxo->q.used) < RX_FRAGS_REFILL_WM)
+	if (work_done && atomic_read(&rxo->q.used) < RX_FRAGS_REFILL_WM)
 		be_post_rx_frags(rxo, GFP_ATOMIC);

 	/* All consumed */
-- 
1.7.4

^ permalink raw reply related

* [PATCH net-next 0/5] be2net fixes v2
From: Sathya Perla @ 2011-08-23  5:41 UTC (permalink / raw)
  To: netdev

Incorporated Eric's feedback on [3/5] be2net: fix erx->rx_drops_no_frags wrap around
Pls apply. Thanks.

Sathya Perla (5):
  be2net: Fix race in posting rx buffers.
  be2net: get rid of memory mapped pci-cfg space address
  be2net: fix erx->rx_drops_no_frags wrap around
  be2net: increase FW update completion timeout
  be2net: remove unused variable

 drivers/net/ethernet/emulex/benet/be.h      |    2 -
 drivers/net/ethernet/emulex/benet/be_cmds.c |    2 +-
 drivers/net/ethernet/emulex/benet/be_main.c |   51 +++++++++++++++------------
 3 files changed, 29 insertions(+), 26 deletions(-)

-- 
1.7.4

^ permalink raw reply

* [PATCH net-next 2/5] be2net: get rid of memory mapped pci-cfg space address
From: Sathya Perla @ 2011-08-23  5:41 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1314078115-8121-1-git-send-email-sathya.perla@emulex.com>

Get rid of adapter->pcicfg and its use. Use pci_config_read/write_dword()
instead.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
---
 drivers/net/ethernet/emulex/benet/be.h      |    1 -
 drivers/net/ethernet/emulex/benet/be_main.c |   27 ++++++++-------------------
 2 files changed, 8 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be.h b/drivers/net/ethernet/emulex/benet/be.h
index 12b5b51..868d7f4 100644
--- a/drivers/net/ethernet/emulex/benet/be.h
+++ b/drivers/net/ethernet/emulex/benet/be.h
@@ -298,7 +298,6 @@ struct be_adapter {
 
 	u8 __iomem *csr;
 	u8 __iomem *db;		/* Door Bell */
-	u8 __iomem *pcicfg;	/* PCI config space */
 
 	struct mutex mbox_lock; /* For serializing mbox cmds to BE card */
 	struct be_dma_mem mbox_mem;
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index 09eb699..2375c0c 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -141,13 +141,15 @@ static int be_queue_alloc(struct be_adapter *adapter, struct be_queue_info *q,
 
 static void be_intr_set(struct be_adapter *adapter, bool enable)
 {
-	u8 __iomem *addr = adapter->pcicfg + PCICFG_MEMBAR_CTRL_INT_CTRL_OFFSET;
-	u32 reg = ioread32(addr);
-	u32 enabled = reg & MEMBAR_CTRL_INT_CTRL_HOSTINTR_MASK;
+	u32 reg, enabled;
 
 	if (adapter->eeh_err)
 		return;
 
+	pci_read_config_dword(adapter->pdev, PCICFG_MEMBAR_CTRL_INT_CTRL_OFFSET,
+				&reg);
+	enabled = reg & MEMBAR_CTRL_INT_CTRL_HOSTINTR_MASK;
+
 	if (!enabled && enable)
 		reg |= MEMBAR_CTRL_INT_CTRL_HOSTINTR_MASK;
 	else if (enabled && !enable)
@@ -155,7 +157,8 @@ static void be_intr_set(struct be_adapter *adapter, bool enable)
 	else
 		return;
 
-	iowrite32(reg, addr);
+	pci_write_config_dword(adapter->pdev,
+			PCICFG_MEMBAR_CTRL_INT_CTRL_OFFSET, reg);
 }
 
 static void be_rxq_notify(struct be_adapter *adapter, u16 qid, u16 posted)
@@ -2951,14 +2954,12 @@ static void be_unmap_pci_bars(struct be_adapter *adapter)
 		iounmap(adapter->csr);
 	if (adapter->db)
 		iounmap(adapter->db);
-	if (adapter->pcicfg && be_physfn(adapter))
-		iounmap(adapter->pcicfg);
 }
 
 static int be_map_pci_bars(struct be_adapter *adapter)
 {
 	u8 __iomem *addr;
-	int pcicfg_reg, db_reg;
+	int db_reg;
 
 	if (lancer_chip(adapter)) {
 		addr = ioremap_nocache(pci_resource_start(adapter->pdev, 0),
@@ -2978,10 +2979,8 @@ static int be_map_pci_bars(struct be_adapter *adapter)
 	}
 
 	if (adapter->generation == BE_GEN2) {
-		pcicfg_reg = 1;
 		db_reg = 4;
 	} else {
-		pcicfg_reg = 0;
 		if (be_physfn(adapter))
 			db_reg = 4;
 		else
@@ -2993,16 +2992,6 @@ static int be_map_pci_bars(struct be_adapter *adapter)
 		goto pci_map_err;
 	adapter->db = addr;
 
-	if (be_physfn(adapter)) {
-		addr = ioremap_nocache(
-				pci_resource_start(adapter->pdev, pcicfg_reg),
-				pci_resource_len(adapter->pdev, pcicfg_reg));
-		if (addr == NULL)
-			goto pci_map_err;
-		adapter->pcicfg = addr;
-	} else
-		adapter->pcicfg = adapter->db + SRIOV_VF_PCICFG_OFFSET;
-
 	return 0;
 pci_map_err:
 	be_unmap_pci_bars(adapter);
-- 
1.7.4


^ permalink raw reply related

* Re: (3.1.0-rc2-git7) include/linux/inetdevice.h:209 invoked rcu_dereference_check() without protection!
From: Eric Dumazet @ 2011-08-23  5:32 UTC (permalink / raw)
  To: Dave Jones, David Miller; +Cc: netdev
In-Reply-To: <20110823024423.GA23874@redhat.com>

Le lundi 22 août 2011 à 22:44 -0400, Dave Jones a écrit :
> Just hit this..
> Is the fix as straight forward as changing that __in_dev_get_rcu to
> an in_dev_get() ?
> 
> 	Dave
> 
> 
>  ===================================================
>  [ INFO: suspicious rcu_dereference_check() usage. ]
>  ---------------------------------------------------
>  include/linux/inetdevice.h:209 invoked rcu_dereference_check() without protection!
>  
>  other info that might help us debug this:
>  
> 
>  rcu_scheduler_active = 1, debug_locks = 0
>  4 locks held by setfiles/2123:
>   #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff8114cbc4>] walk_component+0x1ef/0x3e8
>   #1:  (&isec->lock){+.+.+.}, at: [<ffffffff81204bca>] inode_doinit_with_dentry+0x3f/0x41f
>   #2:  (&tbl->proxy_timer){+.-...}, at: [<ffffffff8106a803>] run_timer_softirq+0x157/0x372
>   #3:  (class){+.-...}, at: [<ffffffff8141f256>] neigh_proxy_process+0x36/0x103
>  
>  stack backtrace:
>  Pid: 2123, comm: setfiles Tainted: G        W   3.1.0-0.rc2.git7.2.fc16.x86_64 #1
>  Call Trace:
>   <IRQ>  [<ffffffff8108ca23>] lockdep_rcu_dereference+0xa7/0xaf
>   [<ffffffff8146a0b7>] __in_dev_get_rcu+0x55/0x5d
>   [<ffffffff8146a751>] arp_process+0x25/0x4d7
>   [<ffffffff8146ac11>] parp_redo+0xe/0x10
>   [<ffffffff8141f2ba>] neigh_proxy_process+0x9a/0x103
>   [<ffffffff8106a8c4>] run_timer_softirq+0x218/0x372
>   [<ffffffff8106a803>] ? run_timer_softirq+0x157/0x372
>   [<ffffffff8141f220>] ? neigh_stat_seq_open+0x41/0x41
>   [<ffffffff8108f2f0>] ? mark_held_locks+0x6d/0x95
>   [<ffffffff81062bb6>] __do_softirq+0x112/0x25a
>   [<ffffffff8150d27c>] call_softirq+0x1c/0x30
>   [<ffffffff81010bf5>] do_softirq+0x4b/0xa2
>   [<ffffffff81062f65>] irq_exit+0x5d/0xcf
>   [<ffffffff8150dc11>] smp_apic_timer_interrupt+0x7c/0x8a
>   [<ffffffff8150baf3>] apic_timer_interrupt+0x73/0x80
>   <EOI>  [<ffffffff8108f439>] ? trace_hardirqs_on_caller+0x121/0x158
>   [<ffffffff814fc285>] ? __slab_free+0x30/0x24c
>   [<ffffffff814fc283>] ? __slab_free+0x2e/0x24c
>   [<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f
>   [<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f
>   [<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f
>   [<ffffffff81130cb0>] kfree+0x108/0x131
>   [<ffffffff81204e74>] inode_doinit_with_dentry+0x2e9/0x41f
>   [<ffffffff81204fc6>] selinux_d_instantiate+0x1c/0x1e
>   [<ffffffff81200f4f>] security_d_instantiate+0x21/0x23
>   [<ffffffff81154625>] d_instantiate+0x5c/0x61
>   [<ffffffff811563ca>] d_splice_alias+0xbc/0xd2
>   [<ffffffff811b17ff>] ext4_lookup+0xba/0xeb
>   [<ffffffff8114bf1e>] d_alloc_and_lookup+0x45/0x6b
>   [<ffffffff8114cbea>] walk_component+0x215/0x3e8
>   [<ffffffff8114cdf8>] lookup_last+0x3b/0x3d
>   [<ffffffff8114daf3>] path_lookupat+0x82/0x2af
>   [<ffffffff8110fc53>] ? might_fault+0xa5/0xac
>   [<ffffffff8110fc0a>] ? might_fault+0x5c/0xac
>   [<ffffffff8114c564>] ? getname_flags+0x31/0x1ca
>   [<ffffffff8114dd48>] do_path_lookup+0x28/0x97
>   [<ffffffff8114df2c>] user_path_at+0x59/0x96
>   [<ffffffff811467ad>] ? cp_new_stat+0xf7/0x10d
>   [<ffffffff811469a6>] vfs_fstatat+0x44/0x6e
>   [<ffffffff811469ee>] vfs_lstat+0x1e/0x20
>   [<ffffffff81146b3d>] sys_newlstat+0x1a/0x33
>   [<ffffffff8108f439>] ? trace_hardirqs_on_caller+0x121/0x158
>   [<ffffffff812535fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>   [<ffffffff8150af82>] system_call_fastpath+0x16/0x1b
> 

Dave, thanks again for your report, here the fix I cooked to address
this.

Kernels >= 2.6.36 are affected, probably only RT ones, since
neigh_proxy_process() holds a spinlock before the call to proxy_redo().

David, IPv6 side is affected too only in net-next, but following patch
should take care of both protocols.

Thanks

[PATCH] arp: fix rcu lockdep splat in arp_process()

Dave Jones reported a lockdep splat triggered by an arp_process() call
from parp_redo().

Commit faa9dcf793be (arp: RCU changes) is the origin of the bug, since
it assumed arp_process() was called under rcu_read_lock(), which is not
true in this particular path.

Instead of adding rcu_read_lock() in parp_redo(), I chose to add it in 
neigh_proxy_process() to take care of IPv6 side too.

 ===================================================
 [ INFO: suspicious rcu_dereference_check() usage. ]
 ---------------------------------------------------
 include/linux/inetdevice.h:209 invoked rcu_dereference_check() without
protection!
 
 other info that might help us debug this:
 
 
 rcu_scheduler_active = 1, debug_locks = 0
 4 locks held by setfiles/2123:
  #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff8114cbc4>]
walk_component+0x1ef/0x3e8
  #1:  (&isec->lock){+.+.+.}, at: [<ffffffff81204bca>]
inode_doinit_with_dentry+0x3f/0x41f
  #2:  (&tbl->proxy_timer){+.-...}, at: [<ffffffff8106a803>]
run_timer_softirq+0x157/0x372
  #3:  (class){+.-...}, at: [<ffffffff8141f256>] neigh_proxy_process
+0x36/0x103
 
 stack backtrace:
 Pid: 2123, comm: setfiles Tainted: G        W
3.1.0-0.rc2.git7.2.fc16.x86_64 #1
 Call Trace:
  <IRQ>  [<ffffffff8108ca23>] lockdep_rcu_dereference+0xa7/0xaf
  [<ffffffff8146a0b7>] __in_dev_get_rcu+0x55/0x5d
  [<ffffffff8146a751>] arp_process+0x25/0x4d7
  [<ffffffff8146ac11>] parp_redo+0xe/0x10
  [<ffffffff8141f2ba>] neigh_proxy_process+0x9a/0x103
  [<ffffffff8106a8c4>] run_timer_softirq+0x218/0x372
  [<ffffffff8106a803>] ? run_timer_softirq+0x157/0x372
  [<ffffffff8141f220>] ? neigh_stat_seq_open+0x41/0x41
  [<ffffffff8108f2f0>] ? mark_held_locks+0x6d/0x95
  [<ffffffff81062bb6>] __do_softirq+0x112/0x25a
  [<ffffffff8150d27c>] call_softirq+0x1c/0x30
  [<ffffffff81010bf5>] do_softirq+0x4b/0xa2
  [<ffffffff81062f65>] irq_exit+0x5d/0xcf
  [<ffffffff8150dc11>] smp_apic_timer_interrupt+0x7c/0x8a
  [<ffffffff8150baf3>] apic_timer_interrupt+0x73/0x80
  <EOI>  [<ffffffff8108f439>] ? trace_hardirqs_on_caller+0x121/0x158
  [<ffffffff814fc285>] ? __slab_free+0x30/0x24c
  [<ffffffff814fc283>] ? __slab_free+0x2e/0x24c
  [<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f
  [<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f
  [<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f
  [<ffffffff81130cb0>] kfree+0x108/0x131
  [<ffffffff81204e74>] inode_doinit_with_dentry+0x2e9/0x41f
  [<ffffffff81204fc6>] selinux_d_instantiate+0x1c/0x1e
  [<ffffffff81200f4f>] security_d_instantiate+0x21/0x23
  [<ffffffff81154625>] d_instantiate+0x5c/0x61
  [<ffffffff811563ca>] d_splice_alias+0xbc/0xd2
  [<ffffffff811b17ff>] ext4_lookup+0xba/0xeb
  [<ffffffff8114bf1e>] d_alloc_and_lookup+0x45/0x6b
  [<ffffffff8114cbea>] walk_component+0x215/0x3e8
  [<ffffffff8114cdf8>] lookup_last+0x3b/0x3d
  [<ffffffff8114daf3>] path_lookupat+0x82/0x2af
  [<ffffffff8110fc53>] ? might_fault+0xa5/0xac
  [<ffffffff8110fc0a>] ? might_fault+0x5c/0xac
  [<ffffffff8114c564>] ? getname_flags+0x31/0x1ca
  [<ffffffff8114dd48>] do_path_lookup+0x28/0x97
  [<ffffffff8114df2c>] user_path_at+0x59/0x96
  [<ffffffff811467ad>] ? cp_new_stat+0xf7/0x10d
  [<ffffffff811469a6>] vfs_fstatat+0x44/0x6e
  [<ffffffff811469ee>] vfs_lstat+0x1e/0x20
  [<ffffffff81146b3d>] sys_newlstat+0x1a/0x33
  [<ffffffff8108f439>] ? trace_hardirqs_on_caller+0x121/0x158
  [<ffffffff812535fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
  [<ffffffff8150af82>] system_call_fastpath+0x16/0x1b


Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/core/neighbour.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 8fab9b0..1334d7e 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1319,11 +1319,15 @@ static void neigh_proxy_process(unsigned long arg)
 
 		if (tdif <= 0) {
 			struct net_device *dev = skb->dev;
+
 			__skb_unlink(skb, &tbl->proxy_queue);
-			if (tbl->proxy_redo && netif_running(dev))
+			if (tbl->proxy_redo && netif_running(dev)) {
+				rcu_read_lock();
 				tbl->proxy_redo(skb);
-			else
+				rcu_read_unlock();
+			} else {
 				kfree_skb(skb);
+			}
 
 			dev_put(dev);
 		} else if (!sched_next || tdif < sched_next)



^ permalink raw reply related

* Re: (3.1.0-rc2-git7) include/linux/inetdevice.h:209 invoked rcu_dereference_check() without protection!
From: Eric Dumazet @ 2011-08-23  5:02 UTC (permalink / raw)
  To: Dave Jones; +Cc: netdev
In-Reply-To: <20110823024423.GA23874@redhat.com>

Le lundi 22 août 2011 à 22:44 -0400, Dave Jones a écrit :
> Just hit this..
> Is the fix as straight forward as changing that __in_dev_get_rcu to
> an in_dev_get() ?
> 
> 	Dave
> 
> 
>  ===================================================
>  [ INFO: suspicious rcu_dereference_check() usage. ]
>  ---------------------------------------------------
>  include/linux/inetdevice.h:209 invoked rcu_dereference_check() without protection!
>  
>  other info that might help us debug this:
>  
> 
>  rcu_scheduler_active = 1, debug_locks = 0
>  4 locks held by setfiles/2123:
>   #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff8114cbc4>] walk_component+0x1ef/0x3e8
>   #1:  (&isec->lock){+.+.+.}, at: [<ffffffff81204bca>] inode_doinit_with_dentry+0x3f/0x41f
>   #2:  (&tbl->proxy_timer){+.-...}, at: [<ffffffff8106a803>] run_timer_softirq+0x157/0x372
>   #3:  (class){+.-...}, at: [<ffffffff8141f256>] neigh_proxy_process+0x36/0x103
>  
>  stack backtrace:
>  Pid: 2123, comm: setfiles Tainted: G        W   3.1.0-0.rc2.git7.2.fc16.x86_64 #1
>  Call Trace:
>   <IRQ>  [<ffffffff8108ca23>] lockdep_rcu_dereference+0xa7/0xaf
>   [<ffffffff8146a0b7>] __in_dev_get_rcu+0x55/0x5d
>   [<ffffffff8146a751>] arp_process+0x25/0x4d7
>   [<ffffffff8146ac11>] parp_redo+0xe/0x10
>   [<ffffffff8141f2ba>] neigh_proxy_process+0x9a/0x103
>   [<ffffffff8106a8c4>] run_timer_softirq+0x218/0x372
>   [<ffffffff8106a803>] ? run_timer_softirq+0x157/0x372
>   [<ffffffff8141f220>] ? neigh_stat_seq_open+0x41/0x41
>   [<ffffffff8108f2f0>] ? mark_held_locks+0x6d/0x95
>   [<ffffffff81062bb6>] __do_softirq+0x112/0x25a
>   [<ffffffff8150d27c>] call_softirq+0x1c/0x30
>   [<ffffffff81010bf5>] do_softirq+0x4b/0xa2
>   [<ffffffff81062f65>] irq_exit+0x5d/0xcf
>   [<ffffffff8150dc11>] smp_apic_timer_interrupt+0x7c/0x8a
>   [<ffffffff8150baf3>] apic_timer_interrupt+0x73/0x80
>   <EOI>  [<ffffffff8108f439>] ? trace_hardirqs_on_caller+0x121/0x158
>   [<ffffffff814fc285>] ? __slab_free+0x30/0x24c
>   [<ffffffff814fc283>] ? __slab_free+0x2e/0x24c
>   [<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f
>   [<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f
>   [<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f
>   [<ffffffff81130cb0>] kfree+0x108/0x131
>   [<ffffffff81204e74>] inode_doinit_with_dentry+0x2e9/0x41f
>   [<ffffffff81204fc6>] selinux_d_instantiate+0x1c/0x1e
>   [<ffffffff81200f4f>] security_d_instantiate+0x21/0x23
>   [<ffffffff81154625>] d_instantiate+0x5c/0x61
>   [<ffffffff811563ca>] d_splice_alias+0xbc/0xd2
>   [<ffffffff811b17ff>] ext4_lookup+0xba/0xeb
>   [<ffffffff8114bf1e>] d_alloc_and_lookup+0x45/0x6b
>   [<ffffffff8114cbea>] walk_component+0x215/0x3e8
>   [<ffffffff8114cdf8>] lookup_last+0x3b/0x3d
>   [<ffffffff8114daf3>] path_lookupat+0x82/0x2af
>   [<ffffffff8110fc53>] ? might_fault+0xa5/0xac
>   [<ffffffff8110fc0a>] ? might_fault+0x5c/0xac
>   [<ffffffff8114c564>] ? getname_flags+0x31/0x1ca
>   [<ffffffff8114dd48>] do_path_lookup+0x28/0x97
>   [<ffffffff8114df2c>] user_path_at+0x59/0x96
>   [<ffffffff811467ad>] ? cp_new_stat+0xf7/0x10d
>   [<ffffffff811469a6>] vfs_fstatat+0x44/0x6e
>   [<ffffffff811469ee>] vfs_lstat+0x1e/0x20
>   [<ffffffff81146b3d>] sys_newlstat+0x1a/0x33
>   [<ffffffff8108f439>] ? trace_hardirqs_on_caller+0x121/0x158
>   [<ffffffff812535fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>   [<ffffffff8150af82>] system_call_fastpath+0x16/0x1b

Thanks for the report Dave, I am preparing a patch to fix this.




^ permalink raw reply

* Re: [net-next 03/10] ixgbe: Drop the TX work limit and instead just leave it to budget
From: Alexander Duyck @ 2011-08-23  4:04 UTC (permalink / raw)
  To: David Miller
  Cc: alexander.h.duyck, bhutchings, jeffrey.t.kirsher, netdev, gospo
In-Reply-To: <20110822.164027.1830363266993513959.davem@davemloft.net>

On Mon, Aug 22, 2011 at 4:40 PM, David Miller <davem@davemloft.net> wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> Date: Mon, 22 Aug 2011 15:57:51 -0700
>
>> The problem was occurring even without large rings.
>> I was seeing issues with rings just 256 descriptors in size.
>
> And the default in the ixgbe driver is 512 entries which I think
> itself is quite excessive.  Something like 128 is more in line with
> what I'd call a sane default.

Are you suggesting I change the the ring size, the TX quota, or both?

> So the only side effect of your change is to decrease the TX quota to
> 64 (the default NAPI quota) from it's current value of 512
> (IXGBE_DEFAULT_TXD).

Yeah, that pretty much sums it up.  However as I said in the earlier
email I am counting SKBs freed instead of descriptors.  As such we
will probably end up cleaning more like 128 descriptors per TX clean
just due to our context descriptors that are also occupying space in
the ring.

> Talking about the existing code, I can't even see how the current
> driver private TX quota can trigger except in the most extreme cases.
> This is because the quota is set the same as the size you're setting
> the TX ring to.

I'm not sure if it ever met the quota either.  I suspect not since
under the routing workloads I would see the RX interrupts get disabled
but the TX interrupts keep going.

>> The problem seemed to be that the TX cleanup being a multiple of
>> budget was allowing one CPU to overwhelm the other and the fact that
>> the TX was essentially unbounded was just allowing the issue to
>> feedback on itself.
>
> I still don't understand what issue you could even be running into.
>
> On each CPU we round robin against all NAPI requestors for that CPU.
>
> In your routing test setup, we should have one cpu doing the RX and
> another different cpu doing TX.
>
> Therefore if the TX cpu simply spins in a loop doing nothing but TX
> reclaim work it should not really matter.

Doing a unidirectional test was fine and everything worked as you
describe.  It was doing a bidirectional test with two ports and a
single queue for each port that was the issue.  Specifically what
would happen is that one direction would tend to dominate over the
other so I would end up with a 60/40 split with either upstream or
downstream dominating.  By using the budget as the quota I found the
results were generally much closer to 50/50 and the overall result of
the two flows combined was higher.  I suspect it had to do with TX
work getting backlogged on the ring and acting as a feedback mechanism
to prevent the RX work on that CPU from getting squashed.

> And if you hit the TX budget on the TX cpu, it's just going to come
> right back into the ixgbe NAPI handler and thus the TX reclaim
> processing not even a dozen cycles later.
>
> The only effect is to have us go through the whole function call
> sequence and data structure setup into local variables more than you
> would be doing so before.

I suppose that is possible.  It all depends on the type of packets
being sent.  For single sends without any offloads we would only be
cleaning 64 descriptors per call to ixgbe_poll,  with an offloaded
checksum or VLAN it would be 128 descriptors per call, and with TSO we
probably still wouldn't consume the budget since the sk_buff
consumption rate would be too low.

I can try testing the throughput with pktgen tomorrow to see if it
improves by increasing the TX budget.  I suppose there could be a few
factors affecting this since the budget value also determines the
number of buffers we clean before we call netif_wake_queue to
re-enable the transmit path.

>> In addition since the RX and TX workload was balanced it kept both
>> locked into polling while the CPU was saturated instead of allowing
>> the TX to become interrupt driven.  In addition since the TX was
>> working on the same budget as the RX the number of SKBs freed up in
>> the TX path would match the number consumed when being reallocated
>> on the RX path.
>
> So the only conclusion I can come to is that what happens is we're now
> executing what are essentially wasted cpu cycles and this takes us
> over the threshold such that we poll more and take interrupts less.
> And this improves performance.
>
> That's pretty unwise if you ask me, we should do something useful with
> cpu cycles instead of wasting them merely to make us poll more.

The thing is we are probably going to be wasting those cycles anyway.
In the case of bidirectional routing I was always locked into RX
polling with this change in place or not.  The only difference is that
the TX will likely clean itself completely with each poll versus the
possibility of leaving a few buffers behind when it hits the 64 quota.

>> The problem seemed to be present as long as I allowed the TX budget to
>> be a multiple of the RX budget.  The easiest way to keep things
>> balanced and avoid allowing the TX from one CPU to overwhelm the RX on
>> another was just to keep the budgets equal.
>
> You're executing 10 or 20 cpu cycles after every 64 TX reclaims,
> that's the only effect of these changes.  That's not even long enough
> for a cache line to transfer between two cpus.

It sounds like I may not have been seeing this due to the type of
workload I was focusing on.  I'll try generating some data with pktgen
and netperf tomorrow to see how this holds up under small packet
transmit only traffic since those are the cases most likely to get
into the state you mention.

Also I would appreciate it if you had any suggestions on other
workloads I might need to focus on in order to determine the impact of
this change.

Thanks,

Alex

^ permalink raw reply

* Re: linux-next: boot test failure (net tree)
From: David Miller @ 2011-08-23  4:02 UTC (permalink / raw)
  To: lacombar
  Cc: sfr, netdev, linux-next, linux-kernel, jeffrey.t.kirsher, mikey,
	torvalds, akpm, linuxppc-dev, benh, paulus, linux-kbuild
In-Reply-To: <CACqU3MW4BnucRt3gxPrKPDvWEjaVuxRF1VOPWk5hTRfneyANkg@mail.gmail.com>

From: Arnaud Lacombe <lacombar@gmail.com>
Date: Mon, 22 Aug 2011 23:50:02 -0400

> Are you implying we need some kind of way to migrate config ?

The issue is that the dependencies for every single ethernet driver
have changed.  Some dependencies have been dropped (f.e. NETDEV_10000
and some have been added (f.e. ETHERNET, NET_VENDOR_****)

So right now an automated (non-prompted, default to no on all new
options) run on an existing config results in all ethernet drivers
getting disabled because the new dependencies don't get enabled.

This wouldn't be so bad if it was just one or two drivers, but in
this case it's every single ethernet driver which will have and hit
this problem.

^ permalink raw reply

* Re: linux-next: boot test failure (net tree)
From: Arnaud Lacombe @ 2011-08-23  3:50 UTC (permalink / raw)
  To: David Miller
  Cc: sfr, netdev, linux-next, linux-kernel, jeffrey.t.kirsher, mikey,
	torvalds, akpm, linuxppc-dev, benh, paulus, linux-kbuild
In-Reply-To: <20110822.191348.2099822249437201579.davem@davemloft.net>

Hi,

[Added linux-kbuild@ to the Cc: list.]

On Mon, Aug 22, 2011 at 10:13 PM, David Miller <davem@davemloft.net> wrote:
> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Tue, 23 Aug 2011 11:41:29 +1000
>
>> On Tue, 23 Aug 2011 11:40:11 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>>>
>>> On Mon, 22 Aug 2011 11:30:32 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>>> >
>>> > Here's what I am applying as a merge fixup to the net tree today so that
>>> > my ppc64_defconfig builds actually build more or less the same set of
>>> > drivers as before this rearrangement.
>>>
>>> And this today:
>>
>> And this:
>
> I'm starting to get uncomfortable with this whole situation, and I
> feel more and more that these new kconfig guards are not tenable.
>
> Changing defconfig files might fix the "automated test boot with
> defconfig" case but it won't fix the case of someone trying to
> automate a build and boot using a different, existing, config file.
> It ought to work too, and I do know people really do this.
>
> And just the fact that we would have to merge all of these defconfig changes
> through the networking tree is evidence of how it's really not reasonable
> to be doing things this way.
>
> Jeff, I think we need to revert the dependencies back to what they were
> before the drivers/net moves.  Could you prepare a patch which does that?
>
Are you implying we need some kind of way to migrate config ?

 - Arnaud
--
To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] virtio-net: Read MAC only after initializing MSI-X
From: Rusty Russell @ 2011-08-23  3:49 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Sasha Levin, linux-kernel, virtualization, netdev, kvm
In-Reply-To: <20110822083613.GA18556@redhat.com>

On Mon, 22 Aug 2011 11:36:13 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> Yes. Or maybe 'WHEN' - since if MSI-X is disabled again, it moves back.
> Let's update the spec to make it clearer?

I left the language as is, but added a footnote at the end of the
sentence:

If MSI-X is enabled for the device, two additional fields immediately
follow this header:
[footnote: ie. once you enable MSI-X on the device, the other
           fields move. If you turn it off again, they move back!]

Thanks,
Rusty.

^ permalink raw reply

* linux-next: manual merge of the net tree with the unicore32 tree
From: Stephen Rothwell @ 2011-08-23  3:02 UTC (permalink / raw)
  To: David Miller, netdev
  Cc: linux-next, linux-kernel, Guan Xuetao, Zhang Shu, Chu Xiaowei,
	Su Yonggang, Jeff Kirsher

Hi all,

Today's linux-next merge of the net tree got conflicts in
drivers/net/Kconfig and drivers/net/Makefile between commit e8787de6fa83
("unicore32: add pkunity-v3 mac/net driver (umal)") from the unicore32
tree and the network driver rearrangement from the net tree.

I just added the new driver from the unicore32 tree commit into each file
(see below).
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index ef6b6be..df5e990 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -215,6 +215,12 @@ source "drivers/s390/net/Kconfig"
 
 source "drivers/net/caif/Kconfig"
 
+config MAC_PUV3
+	tristate "PKUnity v3 UMAL Gigabit Network Adapter support"
+	depends on UNICORE32 && ARCH_PUV3
+	select MII
+	select PHYLIB
+
 config XEN_NETDEV_FRONTEND
 	tristate "Xen network device frontend driver"
 	depends on XEN
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index c33009b..9896ad1 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -23,6 +23,7 @@ obj-$(CONFIG_RIONET) += rionet.o
 
 obj-$(CONFIG_NET) += Space.o loopback.o
 obj-$(CONFIG_NET_SB1000) += sb1000.o
+obj-$(CONFIG_MAC_PUV3) += mac-puv3.o
 obj-$(CONFIG_PPP) += ppp_generic.o
 obj-$(CONFIG_PPP_ASYNC) += ppp_async.o
 obj-$(CONFIG_PPP_SYNC_TTY) += ppp_synctty.o

^ permalink raw reply related

* (3.1.0-rc2-git7) include/linux/inetdevice.h:209 invoked rcu_dereference_check() without protection!
From: Dave Jones @ 2011-08-23  2:44 UTC (permalink / raw)
  To: netdev

Just hit this..
Is the fix as straight forward as changing that __in_dev_get_rcu to
an in_dev_get() ?

	Dave


 ===================================================
 [ INFO: suspicious rcu_dereference_check() usage. ]
 ---------------------------------------------------
 include/linux/inetdevice.h:209 invoked rcu_dereference_check() without protection!
 
 other info that might help us debug this:
 
 
 rcu_scheduler_active = 1, debug_locks = 0
 4 locks held by setfiles/2123:
  #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff8114cbc4>] walk_component+0x1ef/0x3e8
  #1:  (&isec->lock){+.+.+.}, at: [<ffffffff81204bca>] inode_doinit_with_dentry+0x3f/0x41f
  #2:  (&tbl->proxy_timer){+.-...}, at: [<ffffffff8106a803>] run_timer_softirq+0x157/0x372
  #3:  (class){+.-...}, at: [<ffffffff8141f256>] neigh_proxy_process+0x36/0x103
 
 stack backtrace:
 Pid: 2123, comm: setfiles Tainted: G        W   3.1.0-0.rc2.git7.2.fc16.x86_64 #1
 Call Trace:
  <IRQ>  [<ffffffff8108ca23>] lockdep_rcu_dereference+0xa7/0xaf
  [<ffffffff8146a0b7>] __in_dev_get_rcu+0x55/0x5d
  [<ffffffff8146a751>] arp_process+0x25/0x4d7
  [<ffffffff8146ac11>] parp_redo+0xe/0x10
  [<ffffffff8141f2ba>] neigh_proxy_process+0x9a/0x103
  [<ffffffff8106a8c4>] run_timer_softirq+0x218/0x372
  [<ffffffff8106a803>] ? run_timer_softirq+0x157/0x372
  [<ffffffff8141f220>] ? neigh_stat_seq_open+0x41/0x41
  [<ffffffff8108f2f0>] ? mark_held_locks+0x6d/0x95
  [<ffffffff81062bb6>] __do_softirq+0x112/0x25a
  [<ffffffff8150d27c>] call_softirq+0x1c/0x30
  [<ffffffff81010bf5>] do_softirq+0x4b/0xa2
  [<ffffffff81062f65>] irq_exit+0x5d/0xcf
  [<ffffffff8150dc11>] smp_apic_timer_interrupt+0x7c/0x8a
  [<ffffffff8150baf3>] apic_timer_interrupt+0x73/0x80
  <EOI>  [<ffffffff8108f439>] ? trace_hardirqs_on_caller+0x121/0x158
  [<ffffffff814fc285>] ? __slab_free+0x30/0x24c
  [<ffffffff814fc283>] ? __slab_free+0x2e/0x24c
  [<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f
  [<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f
  [<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f
  [<ffffffff81130cb0>] kfree+0x108/0x131
  [<ffffffff81204e74>] inode_doinit_with_dentry+0x2e9/0x41f
  [<ffffffff81204fc6>] selinux_d_instantiate+0x1c/0x1e
  [<ffffffff81200f4f>] security_d_instantiate+0x21/0x23
  [<ffffffff81154625>] d_instantiate+0x5c/0x61
  [<ffffffff811563ca>] d_splice_alias+0xbc/0xd2
  [<ffffffff811b17ff>] ext4_lookup+0xba/0xeb
  [<ffffffff8114bf1e>] d_alloc_and_lookup+0x45/0x6b
  [<ffffffff8114cbea>] walk_component+0x215/0x3e8
  [<ffffffff8114cdf8>] lookup_last+0x3b/0x3d
  [<ffffffff8114daf3>] path_lookupat+0x82/0x2af
  [<ffffffff8110fc53>] ? might_fault+0xa5/0xac
  [<ffffffff8110fc0a>] ? might_fault+0x5c/0xac
  [<ffffffff8114c564>] ? getname_flags+0x31/0x1ca
  [<ffffffff8114dd48>] do_path_lookup+0x28/0x97
  [<ffffffff8114df2c>] user_path_at+0x59/0x96
  [<ffffffff811467ad>] ? cp_new_stat+0xf7/0x10d
  [<ffffffff811469a6>] vfs_fstatat+0x44/0x6e
  [<ffffffff811469ee>] vfs_lstat+0x1e/0x20
  [<ffffffff81146b3d>] sys_newlstat+0x1a/0x33
  [<ffffffff8108f439>] ? trace_hardirqs_on_caller+0x121/0x158
  [<ffffffff812535fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
  [<ffffffff8150af82>] system_call_fastpath+0x16/0x1b



^ permalink raw reply

* Re: linux-next: boot test failure (net tree)
From: Jeff Kirsher @ 2011-08-23  2:26 UTC (permalink / raw)
  To: David Miller
  Cc: sfr@canb.auug.org.au, netdev@vger.kernel.org,
	linux-next@vger.kernel.org, linux-kernel@vger.kernel.org,
	mikey@neuling.org, torvalds@linux-foundation.org,
	akpm@linux-foundation.org, linuxppc-dev@lists.ozlabs.org,
	benh@kernel.crashing.org, paulus@samba.org
In-Reply-To: <20110822.191348.2099822249437201579.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 1586 bytes --]

On Mon, 2011-08-22 at 19:13 -0700, David Miller wrote:
> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Tue, 23 Aug 2011 11:41:29 +1000
> 
> > On Tue, 23 Aug 2011 11:40:11 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> >>
> >> On Mon, 22 Aug 2011 11:30:32 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> >> >
> >> > Here's what I am applying as a merge fixup to the net tree today so that
> >> > my ppc64_defconfig builds actually build more or less the same set of
> >> > drivers as before this rearrangement.
> >> 
> >> And this today:
> > 
> > And this:
> 
> I'm starting to get uncomfortable with this whole situation, and I
> feel more and more that these new kconfig guards are not tenable.
> 
> Changing defconfig files might fix the "automated test boot with
> defconfig" case but it won't fix the case of someone trying to
> automate a build and boot using a different, existing, config file.
> It ought to work too, and I do know people really do this.
> 
> And just the fact that we would have to merge all of these defconfig changes
> through the networking tree is evidence of how it's really not reasonable
> to be doing things this way.
> 
> Jeff, I think we need to revert the dependencies back to what they were
> before the drivers/net moves.  Could you prepare a patch which does that?
> 

I was just finishing up those patches (not including any defconfig
changes) and started looking at a patch to fix/resolve the issues that
Stephen is seeing.

Let me see what I can come up with tonight to resolve this.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply

* Re: linux-next: boot test failure (net tree)
From: David Miller @ 2011-08-23  2:13 UTC (permalink / raw)
  To: sfr
  Cc: netdev, linux-next, linux-kernel, jeffrey.t.kirsher, mikey,
	torvalds, akpm, linuxppc-dev, benh, paulus
In-Reply-To: <20110823114129.ceb18da164bf7df3c145941b@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Tue, 23 Aug 2011 11:41:29 +1000

> On Tue, 23 Aug 2011 11:40:11 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>>
>> On Mon, 22 Aug 2011 11:30:32 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>> >
>> > Here's what I am applying as a merge fixup to the net tree today so that
>> > my ppc64_defconfig builds actually build more or less the same set of
>> > drivers as before this rearrangement.
>> 
>> And this today:
> 
> And this:

I'm starting to get uncomfortable with this whole situation, and I
feel more and more that these new kconfig guards are not tenable.

Changing defconfig files might fix the "automated test boot with
defconfig" case but it won't fix the case of someone trying to
automate a build and boot using a different, existing, config file.
It ought to work too, and I do know people really do this.

And just the fact that we would have to merge all of these defconfig changes
through the networking tree is evidence of how it's really not reasonable
to be doing things this way.

Jeff, I think we need to revert the dependencies back to what they were
before the drivers/net moves.  Could you prepare a patch which does that?

^ permalink raw reply

* Re: linux-next: boot test failure (net tree)
From: Stephen Rothwell @ 2011-08-23  1:41 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, linux-next, linux-kernel, jeffrey.t.kirsher, mikey,
	torvalds, akpm, ppc-dev, Benjamin Herrenschmidt, Paul Mackerras
In-Reply-To: <20110823114011.a059aea0138b75bfa7eed1ce@canb.auug.org.au>

Hi Dave,

On Tue, 23 Aug 2011 11:40:11 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> On Mon, 22 Aug 2011 11:30:32 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> >
> > Here's what I am applying as a merge fixup to the net tree today so that
> > my ppc64_defconfig builds actually build more or less the same set of
> > drivers as before this rearrangement.
> 
> And this today:

And this:

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Tue, 23 Aug 2011 11:35:18 +1000
Subject: [PATCH] sparc: update sparc64_defconfig for net device movement

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 arch/sparc/configs/sparc64_defconfig |   20 +++++++++-----------
 1 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/arch/sparc/configs/sparc64_defconfig b/arch/sparc/configs/sparc64_defconfig
index 3c1e858..5732728 100644
--- a/arch/sparc/configs/sparc64_defconfig
+++ b/arch/sparc/configs/sparc64_defconfig
@@ -37,8 +37,6 @@ CONFIG_NET_KEY_MIGRATE=y
 CONFIG_INET=y
 CONFIG_IP_MULTICAST=y
 CONFIG_NET_IPIP=m
-CONFIG_NET_IPGRE=m
-CONFIG_NET_IPGRE_BROADCAST=y
 CONFIG_IP_MROUTE=y
 CONFIG_IP_PIMSM_V1=y
 CONFIG_IP_PIMSM_V2=y
@@ -95,17 +93,19 @@ CONFIG_DM_SNAPSHOT=m
 CONFIG_DM_MIRROR=m
 CONFIG_DM_ZERO=m
 CONFIG_NETDEVICES=y
-CONFIG_NET_ETHERNET=y
 CONFIG_MII=m
+CONFIG_NET_VENDOR_AMD=y
 CONFIG_SUNLANCE=m
+CONFIG_NET_VENDOR_BROADCOM=y
+CONFIG_BNX2=m
+CONFIG_TIGON3=m
+CONFIG_NET_VENDOR_INTEL=y
+CONFIG_E1000=m
+CONFIG_E1000E=m
+CONFIG_NET_VENDOR_SUN=y
 CONFIG_HAPPYMEAL=m
 CONFIG_SUNGEM=m
 CONFIG_SUNVNET=m
-CONFIG_NET_PCI=y
-CONFIG_E1000=m
-CONFIG_E1000E=m
-CONFIG_TIGON3=m
-CONFIG_BNX2=m
 CONFIG_NIU=m
 # CONFIG_WLAN is not set
 CONFIG_PPP=m
@@ -126,13 +126,13 @@ CONFIG_INPUT_SPARCSPKR=y
 # CONFIG_SERIO_SERPORT is not set
 CONFIG_SERIO_PCIPS2=m
 CONFIG_SERIO_RAW=m
+# CONFIG_LEGACY_PTYS is not set
 # CONFIG_DEVKMEM is not set
 CONFIG_SERIAL_SUNSU=y
 CONFIG_SERIAL_SUNSU_CONSOLE=y
 CONFIG_SERIAL_SUNSAB=y
 CONFIG_SERIAL_SUNSAB_CONSOLE=y
 CONFIG_SERIAL_SUNHV=y
-# CONFIG_LEGACY_PTYS is not set
 CONFIG_FB=y
 CONFIG_FB_TILEBLITTING=y
 CONFIG_FB_SBUS=y
@@ -206,10 +206,8 @@ CONFIG_PRINTK_TIME=y
 CONFIG_MAGIC_SYSRQ=y
 CONFIG_DEBUG_KERNEL=y
 CONFIG_LOCKUP_DETECTOR=y
-CONFIG_DETECT_HUNG_TASK=y
 # CONFIG_SCHED_DEBUG is not set
 CONFIG_SCHEDSTATS=y
-# CONFIG_RCU_CPU_STALL_DETECTOR is not set
 CONFIG_SYSCTL_SYSCALL_CHECK=y
 CONFIG_BLK_DEV_IO_TRACE=y
 CONFIG_KEYS=y
-- 
1.7.5.4

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

^ permalink raw reply related

* Re: linux-next: boot test failure (net tree)
From: Stephen Rothwell @ 2011-08-23  1:40 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, linux-next, linux-kernel, jeffrey.t.kirsher, mikey,
	torvalds, akpm, ppc-dev, Benjamin Herrenschmidt, Paul Mackerras
In-Reply-To: <20110822113032.15087c2e190e2b0c3ee7dfb8@canb.auug.org.au>

Hi Dave,

On Mon, 22 Aug 2011 11:30:32 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> Here's what I am applying as a merge fixup to the net tree today so that
> my ppc64_defconfig builds actually build more or less the same set of
> drivers as before this rearrangement.

And this today:

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Tue, 23 Aug 2011 11:23:40 +1000
Subject: [PATCH] sparc: update sparc32_defconfig for net device movement

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 arch/sparc/configs/sparc32_defconfig |    9 ++-------
 1 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/sparc/configs/sparc32_defconfig b/arch/sparc/configs/sparc32_defconfig
index fb23fd6..9bc241a 100644
--- a/arch/sparc/configs/sparc32_defconfig
+++ b/arch/sparc/configs/sparc32_defconfig
@@ -2,9 +2,7 @@ CONFIG_EXPERIMENTAL=y
 CONFIG_SYSVIPC=y
 CONFIG_POSIX_MQUEUE=y
 CONFIG_LOG_BUF_SHIFT=14
-CONFIG_SYSFS_DEPRECATED_V2=y
 CONFIG_BLK_DEV_INITRD=y
-# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
 CONFIG_SLAB=y
 CONFIG_MODULES=y
 CONFIG_MODULE_UNLOAD=y
@@ -42,9 +40,10 @@ CONFIG_SCSI_QLOGICPTI=m
 CONFIG_SCSI_SUNESP=y
 CONFIG_NETDEVICES=y
 CONFIG_DUMMY=m
-CONFIG_NET_ETHERNET=y
 CONFIG_MII=m
+CONFIG_NET_VENDOR_AMD=y
 CONFIG_SUNLANCE=y
+CONFIG_NET_VENDOR_SUN=y
 CONFIG_HAPPYMEAL=m
 CONFIG_SUNBMAC=m
 CONFIG_SUNQE=m
@@ -64,26 +63,22 @@ CONFIG_SERIAL_SUNSU=y
 CONFIG_SERIAL_SUNSU_CONSOLE=y
 CONFIG_SPI=y
 CONFIG_SPI_XILINX=m
-CONFIG_SPI_XILINX_PLTFM=m
 CONFIG_SUN_OPENPROMIO=m
 CONFIG_EXT2_FS=y
 CONFIG_EXT2_FS_XATTR=y
 CONFIG_EXT2_FS_POSIX_ACL=y
 CONFIG_EXT2_FS_SECURITY=y
-CONFIG_AUTOFS_FS=m
 CONFIG_AUTOFS4_FS=m
 CONFIG_ISO9660_FS=m
 CONFIG_PROC_KCORE=y
 CONFIG_ROMFS_FS=m
 CONFIG_NFS_FS=y
 CONFIG_ROOT_NFS=y
-CONFIG_RPCSEC_GSS_KRB5=m
 CONFIG_NLS=y
 # CONFIG_ENABLE_WARN_DEPRECATED is not set
 CONFIG_DEBUG_KERNEL=y
 CONFIG_DETECT_HUNG_TASK=y
 # CONFIG_SCHED_DEBUG is not set
-# CONFIG_RCU_CPU_STALL_DETECTOR is not set
 CONFIG_KGDB=y
 CONFIG_KGDB_TESTS=y
 CONFIG_CRYPTO_NULL=m
-- 
1.7.5.4

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

^ permalink raw reply related

* Re: [PATCH 00/11] various fixes v3
From: Greg KH @ 2011-08-23  1:32 UTC (permalink / raw)
  To: Jason Baron
  Cc: gregkh, joe, jim.cromie, bvanassche, linux-kernel, davem,
	aloisio.almeida, netdev
In-Reply-To: <cover.1313085588.git.jbaron@redhat.com>

On Thu, Aug 11, 2011 at 02:36:17PM -0400, Jason Baron wrote:
> Hi,
> 
> Dynamic debug fixes and cleanups. Only changes from v2 are a re-base against
> Linus's latest tree and the inclusion of the re-posted version for patch #11.

I applied the first 8, as it seemed like there was lots of discussion
about #9.

Care to send the rest when you all have those worked out?

thanks,


greg k-h

^ permalink raw reply

* Re: [Patch] Scm: Remove unnecessary pid & credential references in Unix socket's send and receive path
From: Tim Chen @ 2011-08-23  0:57 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, viro, ebiederm, ak, matt.fleming, linux-kernel,
	netdev
In-Reply-To: <20110822.174013.353730308624819543.davem@davemloft.net>

On Mon, 2011-08-22 at 17:40 -0700, David Miller wrote:
> From: Tim Chen <tim.c.chen@linux.intel.com>
> Date: Fri, 19 Aug 2011 09:44:58 -0700
> 
> > -		/* Only send the fds in the first buffer */
> > -		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent);
> > +		/* Only send the fds and no ref to pid in the first buffer */
> > +		if (fds_sent)
> > +			err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, true);
> > +		else
> > +			err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, false);
> 
> Just set this final boolean the way the third argument is, there is no
> reason to replicate the entire function call twice just to set the
> final argument to what "fds_sent" evaluates to as a boolean.
> 
> 		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, fds_sent);
> 
> ought to suffice.

Good point.  Patch updated as suggested.

Tim

-----------

Patch series 109f6e39..7361c36c back in 2.6.36 added functionality to
allow credentials to work across pid namespaces for packets sent via
UNIX sockets.  However, the atomic reference counts on pid and
credentials caused plenty of cache bouncing when there are numerous
threads of the same pid sharing a UNIX socket.  This patch mitigates the
problem by eliminating extraneous reference counts on pid and
credentials on both send and receive path of UNIX sockets. I found a 2x
improvement in hackbench's threaded case.

On the receive path in unix_dgram_recvmsg, currently there is an
increment of reference count on pid and credentials in scm_set_cred.
Then there are two decrement of the reference counts.  Once in scm_recv
and once when skb_free_datagram call skb->destructor function
unix_destruct_scm.  One pair of increment and decrement of ref count on
pid and credentials can be eliminated from the receive path.  Until we
destroy the skb, we already set a reference when we created the skb on
the send side.

On the send path, there are two increments of ref count on pid and
credentials, once in scm_send and once in unix_scm_to_skb.  Then there
is a decrement of the reference counts in scm_destroy's call to
scm_destroy_cred at the end of unix_dgram_sendmsg functions.   One pair
of increment and decrement of the reference counts can be removed so we
only need to increment the ref counts once.

By incorporating these changes, for hackbench running on a 4 socket
NHM-EX machine with 40 cores, the execution of hackbench on
50 groups of 20 threads sped up by factor of 2.

Hackbench command used for testing:
./hackbench 50 thread 2000

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
diff --git a/include/net/scm.h b/include/net/scm.h
index 745460f..68e1e48 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -53,6 +53,14 @@ static __inline__ void scm_set_cred(struct scm_cookie *scm,
 	cred_to_ucred(pid, cred, &scm->creds);
 }
 
+static __inline__ void scm_set_cred_noref(struct scm_cookie *scm,
+				    struct pid *pid, const struct cred *cred)
+{
+	scm->pid  = pid;
+	scm->cred = cred;
+	cred_to_ucred(pid, cred, &scm->creds);
+}
+
 static __inline__ void scm_destroy_cred(struct scm_cookie *scm)
 {
 	put_pid(scm->pid);
@@ -70,6 +78,15 @@ static __inline__ void scm_destroy(struct scm_cookie *scm)
 		__scm_destroy(scm);
 }
 
+static __inline__ void scm_release(struct scm_cookie *scm)
+{
+	/* keep ref on pid and cred */
+	scm->pid = NULL;
+	scm->cred = NULL;
+	if (scm->fp)
+		__scm_destroy(scm);
+}
+
 static __inline__ int scm_send(struct socket *sock, struct msghdr *msg,
 			       struct scm_cookie *scm)
 {
@@ -108,15 +125,14 @@ static __inline__ void scm_recv(struct socket *sock, struct msghdr *msg,
 	if (!msg->msg_control) {
 		if (test_bit(SOCK_PASSCRED, &sock->flags) || scm->fp)
 			msg->msg_flags |= MSG_CTRUNC;
-		scm_destroy(scm);
+		if (scm && scm->fp)
+			__scm_destroy(scm);
 		return;
 	}
 
 	if (test_bit(SOCK_PASSCRED, &sock->flags))
 		put_cmsg(msg, SOL_SOCKET, SCM_CREDENTIALS, sizeof(scm->creds), &scm->creds);
 
-	scm_destroy_cred(scm);
-
 	scm_passec(sock, msg, scm);
 
 	if (!scm->fp)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 0722a25..136298c 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1382,11 +1382,17 @@ static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
 	return max_level;
 }
 
-static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool send_fds)
+static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb,
+			   bool send_fds, bool ref)
 {
 	int err = 0;
-	UNIXCB(skb).pid  = get_pid(scm->pid);
-	UNIXCB(skb).cred = get_cred(scm->cred);
+	if (ref) {
+		UNIXCB(skb).pid  = get_pid(scm->pid);
+		UNIXCB(skb).cred = get_cred(scm->cred);
+	} else {
+		UNIXCB(skb).pid  = scm->pid;
+		UNIXCB(skb).cred = scm->cred;
+	}
 	UNIXCB(skb).fp = NULL;
 	if (scm->fp && send_fds)
 		err = unix_attach_fds(scm, skb);
@@ -1411,7 +1417,7 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	int namelen = 0; /* fake GCC */
 	int err;
 	unsigned hash;
-	struct sk_buff *skb;
+	struct sk_buff *skb = NULL;
 	long timeo;
 	struct scm_cookie tmp_scm;
 	int max_level;
@@ -1452,7 +1458,7 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	if (skb == NULL)
 		goto out;
 
-	err = unix_scm_to_skb(siocb->scm, skb, true);
+	err = unix_scm_to_skb(siocb->scm, skb, true, false);
 	if (err < 0)
 		goto out_free;
 	max_level = err + 1;
@@ -1548,7 +1554,7 @@ restart:
 	unix_state_unlock(other);
 	other->sk_data_ready(other, len);
 	sock_put(other);
-	scm_destroy(siocb->scm);
+	scm_release(siocb->scm);
 	return len;
 
 out_unlock:
@@ -1558,7 +1564,8 @@ out_free:
 out:
 	if (other)
 		sock_put(other);
-	scm_destroy(siocb->scm);
+	if (skb == NULL)
+		scm_destroy(siocb->scm);
 	return err;
 }
 
@@ -1570,7 +1577,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	struct sock *sk = sock->sk;
 	struct sock *other = NULL;
 	int err, size;
-	struct sk_buff *skb;
+	struct sk_buff *skb = NULL;
 	int sent = 0;
 	struct scm_cookie tmp_scm;
 	bool fds_sent = false;
@@ -1635,11 +1642,11 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		size = min_t(int, size, skb_tailroom(skb));
 
 
-		/* Only send the fds in the first buffer */
-		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent);
+		/* Only send the fds and no ref to pid in the first buffer */
+		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, fds_sent);
 		if (err < 0) {
 			kfree_skb(skb);
-			goto out_err;
+			goto out;
 		}
 		max_level = err + 1;
 		fds_sent = true;
@@ -1647,7 +1654,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		err = memcpy_fromiovec(skb_put(skb, size), msg->msg_iov, size);
 		if (err) {
 			kfree_skb(skb);
-			goto out_err;
+			goto out;
 		}
 
 		unix_state_lock(other);
@@ -1664,7 +1671,10 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		sent += size;
 	}
 
-	scm_destroy(siocb->scm);
+	if (skb)
+		scm_release(siocb->scm);
+	else
+		scm_destroy(siocb->scm);
 	siocb->scm = NULL;
 
 	return sent;
@@ -1677,7 +1687,9 @@ pipe_err:
 		send_sig(SIGPIPE, current, 0);
 	err = -EPIPE;
 out_err:
-	scm_destroy(siocb->scm);
+	if (skb == NULL)
+		scm_destroy(siocb->scm);
+out:
 	siocb->scm = NULL;
 	return sent ? : err;
 }
@@ -1781,7 +1793,7 @@ static int unix_dgram_recvmsg(struct kiocb *iocb, struct socket *sock,
 		siocb->scm = &tmp_scm;
 		memset(&tmp_scm, 0, sizeof(tmp_scm));
 	}
-	scm_set_cred(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).cred);
+	scm_set_cred_noref(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).cred);
 	unix_set_secdata(siocb->scm, skb);
 
 	if (!(flags & MSG_PEEK)) {
@@ -1943,7 +1955,8 @@ static int unix_stream_recvmsg(struct kiocb *iocb, struct socket *sock,
 			}
 		} else {
 			/* Copy credentials */
-			scm_set_cred(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).cred);
+			scm_set_cred_noref(siocb->scm, UNIXCB(skb).pid,
+					   UNIXCB(skb).cred);
 			check_creds = 1;
 		}
 



^ permalink raw reply related

* Re: Miscalculated TCP ACK ?
From: Gabriel Beddingfield @ 2011-08-23  0:41 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1314038973.3777.0.camel@edumazet-laptop>

On 08/22/2011 01:49 PM, Eric Dumazet wrote:
> Le lundi 22 août 2011 à 13:05 -0500, Gabriel Beddingfield a écrit :
>> On Mon, Aug 22, 2011 at 12:19 PM, Daniel Baluta<daniel.baluta@gmail.com>  wrote:
>>>>> It appears to me that the ACK is off-by-one (should have been 27676).
>>>>> The server replies again with retransmissions.  This is with several
>>>>> kernel versions on local machines, ranging from 2.6.31 to 2.6.38.
>>>
>>>> Not a bug if frame carried a FIN
>>
>> The server packet had the FIN bit set.
>>
>
> then its fine.

Thank you!


>> I clipped the conversation to the vicinity of the problem.
>
> This sounds as a TSO bug, maybe driver/NIC related
>
> What is the NIC/driver used on your server ?

New news:  If I remove my router (Linksys WRTU54G-TM), the problem goes 
away.

Thank you guys for your helpful insights. :-)

-gabriel

^ permalink raw reply

* Re: [Patch] Scm: Remove unnecessary pid & credential references in Unix socket's send and receive path
From: David Miller @ 2011-08-23  0:40 UTC (permalink / raw)
  To: tim.c.chen
  Cc: eric.dumazet, viro, ebiederm, ak, matt.fleming, linux-kernel,
	netdev
In-Reply-To: <1313772298.2576.2916.camel@schen9-DESK>

From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Fri, 19 Aug 2011 09:44:58 -0700

> -		/* Only send the fds in the first buffer */
> -		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent);
> +		/* Only send the fds and no ref to pid in the first buffer */
> +		if (fds_sent)
> +			err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, true);
> +		else
> +			err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, false);

Just set this final boolean the way the third argument is, there is no
reason to replicate the entire function call twice just to set the
final argument to what "fds_sent" evaluates to as a boolean.

		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, fds_sent);

ought to suffice.

^ permalink raw reply

* Re: pull request: wireless-next 2011-08-22
From: David Miller @ 2011-08-23  0:14 UTC (permalink / raw)
  To: linville; +Cc: linux-wireless, netdev
In-Reply-To: <20110822195857.GL2534@tuxdriver.com>

From: "John W. Linville" <linville@tuxdriver.com>
Date: Mon, 22 Aug 2011 15:58:57 -0400

> This is a batch of updates intended for 3.2.
> 
> Highlights include a wireless extensions dependency cleanup amongst
> a number of drivers, and the inclusion of the ath6kl driver
> in drivers/net/wireless (along with its removal from staging).
> Also included are a number of updates to mac80211, bcma, mwifiex,
> libertas, ath9k, iwlagn, and a number of other drivers.
> 
> Please let me know if there are problems!

Pulled, thanks!

^ permalink raw reply

* Re: [PATCH] net: add APIs for manipulating skb page fragments.
From: David Miller @ 2011-08-23  0:00 UTC (permalink / raw)
  To: Ian.Campbell; +Cc: netdev, linux-kernel, eric.dumazet, mirq-linux
In-Reply-To: <1313915037.17978.686.camel@dagon.hellion.org.uk>

From: Ian Campbell <Ian.Campbell@eu.citrix.com>
Date: Sun, 21 Aug 2011 09:23:57 +0100

> On Sun, 2011-08-21 at 01:31 +0100, David Miller wrote:
>> From: Ian Campbell <ian.campbell@citrix.com>
>> Date: Fri, 19 Aug 2011 17:25:00 +0100
>> 
>> > The primary aim is to add skb_frag_(ref|unref) in order to remove the use of
>> > bare get/put_page on SKB pages fragments and to isolate users from subsequent
>> > changes to the skb_frag_t data structure.
>> > 
>> > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
>> 
>> You're going to have to protect all of the things using the interfaces
>> from linux/dma-mapping.h with CONFIG_HAS_DMA otherwise it won't build
>> on platforms like S390.
> 
> s390 is one of the arches which I build tested and I initially saw this
> issue too. I did add CONFIG_HAS_DMA but it turns out that
> linux/dma-mapping.h takes care of this by including
> asm-generic/dma-mapping-broken.h for you so I removed the #ifdef again.
> The header defines the prototypes to allow building but causes a link
> time failure if anything actually uses the interfaces.
> 
> I just tested a s390x defconfig build again and it appears to be fine.

Thanks for explaining this.

Patch applied, thanks!

^ permalink raw reply

* Re: net_device leak after "bridge: allow creating bridge devices with netlink"?
From: David Miller @ 2011-08-22 23:48 UTC (permalink / raw)
  To: eric.dumazet; +Cc: shemminger, JBeulich, netdev
In-Reply-To: <1314029159.2307.99.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 22 Aug 2011 18:05:59 +0200

> [PATCH] bridge: fix a possible net_device leak
> 
> Jan Beulich reported a possible net_device leak in bridge code after
> commit bb900b27a2f4 (bridge: allow creating bridge devices with netlink)
> 
> Reported-by: Jan Beulich <JBeulich@novell.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Acked-by: Stephen Hemminger <shemminger@vyatta.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [net-next 04/12] bna: TX Path and RX Path Changes
From: David Miller @ 2011-08-22 23:45 UTC (permalink / raw)
  To: rmody; +Cc: netdev, adapter_linux_open_src_team, gkaraje
In-Reply-To: <1314052869-7407-5-git-send-email-rmody@brocade.com>

From: Rasesh Mody <rmody@brocade.com>
Date: Mon, 22 Aug 2011 15:41:01 -0700

> +		if (unlikely(skb_is_gso(skb) > netdev->mtu)) {

skb_is_gso() is mean to be a boolean test, it happens to return
the value of ->gso_size as it's implementation but that's not
a good reason for you to abuse this interface in this way.

If we every fix skb_is_gso() to return a true 'bool' your driver
will break.

Access ->gso_size directly if that's what you're interested in,
that way it's future proof and won't break in the future.

^ permalink raw reply

* Re: [net-next 03/10] ixgbe: Drop the TX work limit and instead just leave it to budget
From: David Miller @ 2011-08-22 23:40 UTC (permalink / raw)
  To: alexander.h.duyck; +Cc: bhutchings, jeffrey.t.kirsher, netdev, gospo
In-Reply-To: <4E52DEEF.40504@intel.com>

From: Alexander Duyck <alexander.h.duyck@intel.com>
Date: Mon, 22 Aug 2011 15:57:51 -0700

> The problem was occurring even without large rings.
> I was seeing issues with rings just 256 descriptors in size.

And the default in the ixgbe driver is 512 entries which I think
itself is quite excessive.  Something like 128 is more in line with
what I'd call a sane default.

So the only side effect of your change is to decrease the TX quota to
64 (the default NAPI quota) from it's current value of 512
(IXGBE_DEFAULT_TXD).

Talking about the existing code, I can't even see how the current
driver private TX quota can trigger except in the most extreme cases.
This is because the quota is set the same as the size you're setting
the TX ring to.

> The problem seemed to be that the TX cleanup being a multiple of
> budget was allowing one CPU to overwhelm the other and the fact that
> the TX was essentially unbounded was just allowing the issue to
> feedback on itself.

I still don't understand what issue you could even be running into.

On each CPU we round robin against all NAPI requestors for that CPU.

In your routing test setup, we should have one cpu doing the RX and
another different cpu doing TX.

Therefore if the TX cpu simply spins in a loop doing nothing but TX
reclaim work it should not really matter.

And if you hit the TX budget on the TX cpu, it's just going to come
right back into the ixgbe NAPI handler and thus the TX reclaim
processing not even a dozen cycles later.

The only effect is to have us go through the whole function call
sequence and data structure setup into local variables more than you
would be doing so before.

> In addition since the RX and TX workload was balanced it kept both
> locked into polling while the CPU was saturated instead of allowing
> the TX to become interrupt driven.  In addition since the TX was
> working on the same budget as the RX the number of SKBs freed up in
> the TX path would match the number consumed when being reallocated
> on the RX path.

So the only conclusion I can come to is that what happens is we're now
executing what are essentially wasted cpu cycles and this takes us
over the threshold such that we poll more and take interrupts less.
And this improves performance.

That's pretty unwise if you ask me, we should do something useful with
cpu cycles instead of wasting them merely to make us poll more.

> The problem seemed to be present as long as I allowed the TX budget to
> be a multiple of the RX budget.  The easiest way to keep things
> balanced and avoid allowing the TX from one CPU to overwhelm the RX on
> another was just to keep the budgets equal.

You're executing 10 or 20 cpu cycles after every 64 TX reclaims,
that's the only effect of these changes.  That's not even long enough
for a cache line to transfer between two cpus.

^ permalink raw reply

* Re: [net-next 03/10] ixgbe: Drop the TX work limit and instead just leave it to budget
From: Alexander Duyck @ 2011-08-22 22:57 UTC (permalink / raw)
  To: David Miller; +Cc: bhutchings, jeffrey.t.kirsher, netdev, gospo
In-Reply-To: <20110822.135644.683110224886588181.davem@davemloft.net>

On 08/22/2011 01:56 PM, David Miller wrote:
> From: Alexander Duyck<alexander.h.duyck@intel.com>
> Date: Mon, 22 Aug 2011 10:29:51 -0700
>
>> The only problem I was seeing with that was that in certain cases it
>> seemed like the TX cleanup could consume enough CPU time to cause
>> pretty significant delays in processing the RX cleanup.  This in turn
>> was causing single queue bi-directional routing tests to come out
>> pretty unbalanced since what seemed to happen is that one CPUs RX work
>> would overwhelm the other CPU with the TX processing resulting in an
>> unbalanced flow that was something like a 60/40 split between the
>> upstream and downstream throughput.
> But the problem is that now you're applying the budget to two operations
> that have much differing costs.  Freeing up a TX ring packet is probably
> on the order of 1/10th the cost of processing an incoming RX ring frame.
>
> I've advocated to not apply the budget at all to TX ring processing.
I fully understand that the TX path is much cheaper than the RX path.  
One step I have taken in all of this code is that the TX path only 
counts SKBs cleaned, it doesn't count descriptors.  So a single 
descriptor 60byte transmit will cost the same as a 64K 18 descriptor 
TSO.  All I am really counting is the number of times I have called 
dev_kfree_skb_any();
> I can see your delimma with respect to RX ring processing being delayed,
> but if that's really happening you can consider whether the TX ring is
> simply too large.
The problem was occurring even without large rings.  I was seeing issues 
with rings just 256 descriptors in size.  The problem seemed to be that 
the TX cleanup being a multiple of budget was allowing one CPU to 
overwhelm the other and the fact that the TX was essentially unbounded 
was just allowing the issue to feedback on itself.

In the routing test case I was actually seeing significant advantages to 
this approach as we were essentially cleaning just the right number of 
buffers to make room for the next set of transmits when the RX cleanup 
came though.  In addition since the RX and TX workload was balanced it 
kept both locked into polling while the CPU was saturated instead of 
allowing the TX to become interrupt driven.  In addition since the TX 
was working on the same budget as the RX the number of SKBs freed up in 
the TX path would match the number consumed when being reallocated on 
the RX path.

> In any event can you try something like dampening the cost applied to
> budget for TX work (1/2, 1/4, etc.)?  Because as far as I can tell, if
> you are really hitting the budget limit on TX then you won't be doing
> any RX work on that device until a future NAPI round that depletes the
> TX ring work without going over the budget.
The problem seemed to be present as long as I allowed the TX budget to 
be a multiple of the RX budget.  The easiest way to keep things balanced 
and avoid allowing the TX from one CPU to overwhelm the RX on another 
was just to keep the budgets equal.

I'm a bit confused by this last comment.  The full budget is used for TX 
and RX, it isn't divided.  I do a budget worth of TX cleanup and a 
budget worth of RX cleanup within the ixgbe_poll routine, and if either 
of them consume their full budget then I return the budget value as the 
work done.

If you are referring to the case where two devices are sharing the CPU 
then I would suspect this might lead to faster consumption of the 
netdev_budget, but other than that I don't see any starvation issues for 
RX or TX.

Thanks,

Alex

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox