* Re: [PATCH net-next-2.6] drivers: net: use skb_headlen()
From: David Miller @ 2010-04-14 23:15 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
In-Reply-To: <20100414.155949.210580528.davem@davemloft.net>
From: David Miller <davem@davemloft.net>
Date: Wed, 14 Apr 2010 15:59:49 -0700 (PDT)
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 13 Apr 2010 22:48:20 +0200
>
>> replaces (skb->len - skb->data_len) occurrences by skb_headlen(skb)
>>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>
> Applied, thanks Eric.
Eric, there were several typos in this patch and it is clear
that even e1000 wasn't build tested.
You used the mystical "skb_headsize()" instead of "skb_headlen()"
in several locations.
I fixed this up, but please build test your changes no matter
how seemingly trivial next time.
Thanks.
^ permalink raw reply
* Re: [PATCH v3] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-14 23:13 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David S. Miller, netdev
In-Reply-To: <1271258409.16881.1701.camel@edumazet-laptop>
On Wed, Apr 14, 2010 at 11:20 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mercredi 14 avril 2010 à 17:52 +0800, Changli Gao a écrit :
>> batch skb dequeueing from softnet input_pkt_queue
>>
>> batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
>> contention and irq disabling/enabling.
>>
>> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
>
> Adding stop_machine() with no explanation ?
stop_machine() is added to flush the processing_queue. Because the old
flush_backlog() runs in IRQ context, it can't touch the things, which
are only valid in softirq context.
>
> No ack from my previous comments, suggestions, and still same logic ?
In this patch, the volatile variable flush_processing_queue is
removed, and the corresponding lines are removed too.
Oh, I should splice the old message back, as stop_machine is used
instead. So, the processing queue will be removed from softnet_data,
and a single int counter will be used instead to count the packets
which are being processing.
one issue you concern is potential cache miss when summing in enqueue
function. It won't happen all the time, in fact, I think it should
happen seldom, and it is the responsibility of hardware to cache the
data frequently used.
>
> Are we supposed to read patch, test it, make some benches, correct bugs,
> say Amen ?
>
OK, I'll test it.
--
Regards,
Changli Gao(xiaosuo@gmail.com)
^ permalink raw reply
* Re: [BUG net-next-2.6] fib: Some rcu warning
From: David Miller @ 2010-04-14 23:12 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, paulmck
In-Reply-To: <1271275859.16881.1753.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 14 Apr 2010 22:10:59 +0200
> [PATCH] fib: suppress lockdep-RCU false positive in FIB trie.
>
> Followup of commit 634a4b20
>
> Allow tnode_get_child_rcu() to be called either under rcu_read_lock()
> protection or with RTNL held.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Since both your and Paul both posted an identical patch I'll
just add both of your signoffs to this one :-)
Since 634a4b20 is in net-2.6, this should probably go into
net-2.6 as well.
Thanks guys.
^ permalink raw reply
* Re: rps perfomance WAS(Re: rps: question
From: Stephen Hemminger @ 2010-04-14 23:02 UTC (permalink / raw)
To: Changli Gao
Cc: Eric Dumazet, Tom Herbert, hadi, netdev, robert, David Miller,
Andi Kleen
In-Reply-To: <v2s412e6f7f1004141551u5764604h3ab092ee9e70a650@mail.gmail.com>
On Thu, 15 Apr 2010 06:51:29 +0800
Changli Gao <xiaosuo@gmail.com> wrote:
> On Thu, Apr 15, 2010 at 4:57 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > RPS can be tuned (Changli wants a finer tuning...), it would be
> > intereting to tune multiqueue devices too. I dont know if its possible
> > right now.
> >
> > On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
> > 10Gigabit has 16 queues. It might be good to use less queues according
> > to your results on some workloads, and eventually use RPS on a second
> > layering.
>
> My idear is: run a daemon in userland to monitor the softnet
> statistics, and tun the RPS setting if necessary. It seems that the
> current softnet statistics data isn't correct.
>
> Long time ago, I did a test, and the conclution was
> call_function_single IPI was more expensive than resched IPI, so I
> moved to kernel thread from softirq for packet processing. I'll redo
> the test later.
>
The big thing is data, data, data... Performance can only be examined
with real hard data with multiple different kind of hardware. Also, check for
regressions in lmbench and TPC benchmarks. Yes this is hard, but papers
on this would allow for rational rather than speculative choices.
Adding more tuning knobs is not the answer unless you can show when
the tuning helps.
--
^ permalink raw reply
* Re: [net-next-2.6 PATCH] ixgbe: fix bug with vlan strip in promsic mode
From: David Miller @ 2010-04-14 23:04 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: netdev, gospo, jesse.brandeburg
In-Reply-To: <20100414221015.11752.42490.stgit@localhost.localdomain>
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed, 14 Apr 2010 15:11:12 -0700
> From: Jesse Brandeburg <jesse.brandeburg@intel.com>
>
> The ixgbe driver was setting up 82598 hardware correctly, so that
> when promiscuous mode was enabled hardware stripping was turned
> off. But on 82599 the logic to disable/enable hardware stripping
> is different, and the code was not updated correctly when the
> hardware vlan stripping was enabled as default.
>
> This change comprises the creation of two new helper functions
> and calling them from the right locations to disable and enable
> hardware stripping of vlan tags at appropriate times.
>
> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Applied, for real this time! :-)
Thanks.
^ permalink raw reply
* Re: [PATCH net-next-2.6] drivers: net: use skb_headlen()
From: David Miller @ 2010-04-14 22:59 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
In-Reply-To: <1271191700.16881.574.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 13 Apr 2010 22:48:20 +0200
> replaces (skb->len - skb->data_len) occurrences by skb_headlen(skb)
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied, thanks Eric.
^ permalink raw reply
* Re: [GIT PULL] vhost-net fix for 2.6.34-rc4
From: David Miller @ 2010-04-14 22:57 UTC (permalink / raw)
To: mst; +Cc: netdev, hch
In-Reply-To: <20100414141733.GA5370@redhat.com>
From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Wed, 14 Apr 2010 17:17:33 +0300
> Christoph Hellwig (1):
> vhost: fix sparse warnings
Do the sparse warnings show actual real bugs, or are we just
adding annotations to make the tool happy?
Unless the bugs being fixed are earth shattering, this belongs
in net-next-2.6 not net-2.6
^ permalink raw reply
* Re: rps perfomance WAS(Re: rps: question
From: Changli Gao @ 2010-04-14 22:51 UTC (permalink / raw)
To: Eric Dumazet
Cc: Tom Herbert, hadi, Stephen Hemminger, netdev, robert,
David Miller, Andi Kleen
In-Reply-To: <1271278661.16881.1761.camel@edumazet-laptop>
On Thu, Apr 15, 2010 at 4:57 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> RPS can be tuned (Changli wants a finer tuning...), it would be
> intereting to tune multiqueue devices too. I dont know if its possible
> right now.
>
> On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
> 10Gigabit has 16 queues. It might be good to use less queues according
> to your results on some workloads, and eventually use RPS on a second
> layering.
My idear is: run a daemon in userland to monitor the softnet
statistics, and tun the RPS setting if necessary. It seems that the
current softnet statistics data isn't correct.
Long time ago, I did a test, and the conclution was
call_function_single IPI was more expensive than resched IPI, so I
moved to kernel thread from softirq for packet processing. I'll redo
the test later.
--
Regards,
Changli Gao(xiaosuo@gmail.com)
^ permalink raw reply
* Re: hdlc_ppp: why no detach()?
From: Krzysztof Halasa @ 2010-04-14 22:48 UTC (permalink / raw)
To: Michael Barkowski
Cc: David S. Miller, Julia Lawall, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <4BC32B00.1030600@ruggedcom.com>
Hello Michael,
Michael Barkowski <michaelbarkowski@ruggedcom.com> writes:
> I am looking at your hdlc_ppp code and I don't understand: why is there
> not the equivalent of fr_detach() in there?
I assume you mean .detach = fr_destroy(). It's used only to kill
subdevices, i.e. it has nothing to do with the interface being up/down.
> pc8300_drv:cpc_remove_one() frees netdevs quite confidently but I wonder
> how it can be so sure that there are not skbs in hdlc_ppp's tx_queue
> associated with those devices before freeing them....q
Theoretically all paths adding skbs to the tx_queue should send them out
before returning (possibly also on behalf of other devices). However I
wonder if it's the case. Let's see: Only ppp_tx_cp() adds to the queue
directly:
- ppp_rx() (calls ppp_tx_flush())
- ppp_timer (calls ppp_tx_flush())
- ppp_cp_event():
- ppp_cp_parse_cr() (calls ppp_tx_flush())
- ppp_stop() calls ppp_cp_event(), but it won't queue any skb, it only
marks the connection as closed and does the same to IPCP and IPV6CP.
This means the problematic part is ppp_start() which calls
ppp_cp_event(LCP, START) = IRC | SCR | 3 meaning
Initialize-Restart-Count, Send-Configure-Request and change state to
REQ_SENT. This causes two problems:
1. The SCR packet will be delayed by 2 seconds (both first and second
SCR will be sent the same time). Perhaps we delay only a little
(instead of full 2 seconds) and only then send the initial packet.
2. (as you noted) the skb will be added to tx_queue and left there. If
we happen to "ifconfig up" and "rmmod driver" before receiving any
packet and before ppp->req_timeout (2 seconds) and before any other
PPP interface does the same, we will eventually get skb with invalid
->dev. This is simple to drain in .close (detach is a wrong place
since it may be called long after the interface is deactivated, there
is no need to delay it past .close). The fix for #1 will already fix
#2, but the redundant safety doesn't cost us anything.
Thanks for noting the problem, I'll post a patch shortly.
Also it seems the timeouts etc. should be configurable. ATM we're only
fixing bugs, good.
--
Krzysztof Halasa
^ permalink raw reply
* [net-next-2.6 PATCH] ixgbe: fix bug with vlan strip in promsic mode
From: Jeff Kirsher @ 2010-04-14 22:11 UTC (permalink / raw)
To: davem; +Cc: netdev, gospo, Jesse Brandeburg, Jeff Kirsher
From: Jesse Brandeburg <jesse.brandeburg@intel.com>
The ixgbe driver was setting up 82598 hardware correctly, so that
when promiscuous mode was enabled hardware stripping was turned
off. But on 82599 the logic to disable/enable hardware stripping
is different, and the code was not updated correctly when the
hardware vlan stripping was enabled as default.
This change comprises the creation of two new helper functions
and calling them from the right locations to disable and enable
hardware stripping of vlan tags at appropriate times.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ixgbe/ixgbe_main.c | 115 +++++++++++++++++++++++++---------------
1 files changed, 72 insertions(+), 43 deletions(-)
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 1b1419c..a98ff0e 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -2482,12 +2482,74 @@ static void ixgbe_vlan_rx_kill_vid(struct net_device *netdev, u16 vid)
hw->mac.ops.set_vfta(&adapter->hw, vid, pool_ndx, false);
}
+/**
+ * ixgbe_vlan_filter_disable - helper to disable hw vlan filtering
+ * @adapter: driver data
+ */
+static void ixgbe_vlan_filter_disable(struct ixgbe_adapter *adapter)
+{
+ struct ixgbe_hw *hw = &adapter->hw;
+ u32 vlnctrl = IXGBE_READ_REG(hw, IXGBE_VLNCTRL);
+ int i, j;
+
+ switch (hw->mac.type) {
+ case ixgbe_mac_82598EB:
+ vlnctrl &= ~(IXGBE_VLNCTRL_VME | IXGBE_VLNCTRL_VFE);
+ vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
+ IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
+ break;
+ case ixgbe_mac_82599EB:
+ vlnctrl &= ~IXGBE_VLNCTRL_VFE;
+ vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
+ IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
+ for (i = 0; i < adapter->num_rx_queues; i++) {
+ j = adapter->rx_ring[i]->reg_idx;
+ vlnctrl = IXGBE_READ_REG(hw, IXGBE_RXDCTL(j));
+ vlnctrl &= ~IXGBE_RXDCTL_VME;
+ IXGBE_WRITE_REG(hw, IXGBE_RXDCTL(j), vlnctrl);
+ }
+ break;
+ default:
+ break;
+ }
+}
+
+/**
+ * ixgbe_vlan_filter_enable - helper to enable hw vlan filtering
+ * @adapter: driver data
+ */
+static void ixgbe_vlan_filter_enable(struct ixgbe_adapter *adapter)
+{
+ struct ixgbe_hw *hw = &adapter->hw;
+ u32 vlnctrl = IXGBE_READ_REG(hw, IXGBE_VLNCTRL);
+ int i, j;
+
+ switch (hw->mac.type) {
+ case ixgbe_mac_82598EB:
+ vlnctrl |= IXGBE_VLNCTRL_VME | IXGBE_VLNCTRL_VFE;
+ vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
+ IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
+ break;
+ case ixgbe_mac_82599EB:
+ vlnctrl |= IXGBE_VLNCTRL_VFE;
+ vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
+ IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
+ for (i = 0; i < adapter->num_rx_queues; i++) {
+ j = adapter->rx_ring[i]->reg_idx;
+ vlnctrl = IXGBE_READ_REG(hw, IXGBE_RXDCTL(j));
+ vlnctrl |= IXGBE_RXDCTL_VME;
+ IXGBE_WRITE_REG(hw, IXGBE_RXDCTL(j), vlnctrl);
+ }
+ break;
+ default:
+ break;
+ }
+}
+
static void ixgbe_vlan_rx_register(struct net_device *netdev,
struct vlan_group *grp)
{
struct ixgbe_adapter *adapter = netdev_priv(netdev);
- u32 ctrl;
- int i, j;
if (!test_bit(__IXGBE_DOWN, &adapter->state))
ixgbe_irq_disable(adapter);
@@ -2498,25 +2560,7 @@ static void ixgbe_vlan_rx_register(struct net_device *netdev,
* still receive traffic from a DCB-enabled host even if we're
* not in DCB mode.
*/
- ctrl = IXGBE_READ_REG(&adapter->hw, IXGBE_VLNCTRL);
-
- /* Disable CFI check */
- ctrl &= ~IXGBE_VLNCTRL_CFIEN;
-
- /* enable VLAN tag stripping */
- if (adapter->hw.mac.type == ixgbe_mac_82598EB) {
- ctrl |= IXGBE_VLNCTRL_VME;
- } else if (adapter->hw.mac.type == ixgbe_mac_82599EB) {
- for (i = 0; i < adapter->num_rx_queues; i++) {
- u32 ctrl;
- j = adapter->rx_ring[i]->reg_idx;
- ctrl = IXGBE_READ_REG(&adapter->hw, IXGBE_RXDCTL(j));
- ctrl |= IXGBE_RXDCTL_VME;
- IXGBE_WRITE_REG(&adapter->hw, IXGBE_RXDCTL(j), ctrl);
- }
- }
-
- IXGBE_WRITE_REG(&adapter->hw, IXGBE_VLNCTRL, ctrl);
+ ixgbe_vlan_filter_enable(adapter);
ixgbe_vlan_rx_add_vid(netdev, 0);
@@ -2551,17 +2595,17 @@ void ixgbe_set_rx_mode(struct net_device *netdev)
{
struct ixgbe_adapter *adapter = netdev_priv(netdev);
struct ixgbe_hw *hw = &adapter->hw;
- u32 fctrl, vlnctrl;
+ u32 fctrl;
/* Check for Promiscuous and All Multicast modes */
fctrl = IXGBE_READ_REG(hw, IXGBE_FCTRL);
- vlnctrl = IXGBE_READ_REG(hw, IXGBE_VLNCTRL);
if (netdev->flags & IFF_PROMISC) {
hw->addr_ctrl.user_set_promisc = 1;
fctrl |= (IXGBE_FCTRL_UPE | IXGBE_FCTRL_MPE);
- vlnctrl &= ~IXGBE_VLNCTRL_VFE;
+ /* don't hardware filter vlans in promisc mode */
+ ixgbe_vlan_filter_disable(adapter);
} else {
if (netdev->flags & IFF_ALLMULTI) {
fctrl |= IXGBE_FCTRL_MPE;
@@ -2569,12 +2613,11 @@ void ixgbe_set_rx_mode(struct net_device *netdev)
} else {
fctrl &= ~(IXGBE_FCTRL_UPE | IXGBE_FCTRL_MPE);
}
- vlnctrl |= IXGBE_VLNCTRL_VFE;
+ ixgbe_vlan_filter_enable(adapter);
hw->addr_ctrl.user_set_promisc = 0;
}
IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
- IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
/* reprogram secondary unicast list */
hw->mac.ops.update_uc_addr_list(hw, netdev);
@@ -2641,7 +2684,7 @@ static void ixgbe_napi_disable_all(struct ixgbe_adapter *adapter)
static void ixgbe_configure_dcb(struct ixgbe_adapter *adapter)
{
struct ixgbe_hw *hw = &adapter->hw;
- u32 txdctl, vlnctrl;
+ u32 txdctl;
int i, j;
ixgbe_dcb_check_config(&adapter->dcb_cfg);
@@ -2659,22 +2702,8 @@ static void ixgbe_configure_dcb(struct ixgbe_adapter *adapter)
IXGBE_WRITE_REG(hw, IXGBE_TXDCTL(j), txdctl);
}
/* Enable VLAN tag insert/strip */
- vlnctrl = IXGBE_READ_REG(hw, IXGBE_VLNCTRL);
- if (hw->mac.type == ixgbe_mac_82598EB) {
- vlnctrl |= IXGBE_VLNCTRL_VME | IXGBE_VLNCTRL_VFE;
- vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
- IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
- } else if (hw->mac.type == ixgbe_mac_82599EB) {
- vlnctrl |= IXGBE_VLNCTRL_VFE;
- vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
- IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
- for (i = 0; i < adapter->num_rx_queues; i++) {
- j = adapter->rx_ring[i]->reg_idx;
- vlnctrl = IXGBE_READ_REG(hw, IXGBE_RXDCTL(j));
- vlnctrl |= IXGBE_RXDCTL_VME;
- IXGBE_WRITE_REG(hw, IXGBE_RXDCTL(j), vlnctrl);
- }
- }
+ ixgbe_vlan_filter_enable(adapter);
+
hw->mac.ops.set_vfta(&adapter->hw, 0, 0, true);
}
^ permalink raw reply related
* Re: HTB - What's the minimal value for 'rate' parameter?
From: Jarek Poplawski @ 2010-04-14 21:45 UTC (permalink / raw)
To: Antonio Almeida; +Cc: netdev, kaber, davem, devik
In-Reply-To: <p2q298f5c051004140322hc54f5315taa5c7d85cb74f06@mail.gmail.com>
Antonio Almeida wrote, On 04/14/2010 12:22 PM:
> What do you mean with "1:2 has grandchildren with overflown rate tables"?
> I couldn't understand your idea. Is there any mistake in the
> configuration I sent?
> How would you set rates for this particular example?
class htb 1:1 root rate 1000Mbit ceil 1000Mbit
class htb 1:2 parent 1:1 rate 4096Kbit ceil 4096Kbit
class htb 1:10 parent 1:2 rate 1024Kbit ceil 4096Kbit
class htb 1:11 parent 1:2 rate 1024Kbit ceil 4096Kbit
class htb 1:101 parent 1:10 prio 3 rate 8bit ceil 4096Kbit
class htb 1:111 parent 1:11 prio 3 rate 8bit ceil 4096Kbit
Classes 1:101 and 1:111 have too low rates, which causes wrong (overflowed!)
values in their rate tables, so their rates could be practically
uncontrollable. They are limited by their ceils instead, so something like:
class htb 1:101 parent 1:10 leaf 101: prio 3 rate 4096Kbit ceil 4096Kbit
class htb 1:111 parent 1:11 leaf 111: prio 3 rate 4096Kbit ceil 4096Kbit
But then their guaranteed rates are higher than their parents, and the
sum is higher than grandparent's rate, which means the config is wrong.
(You have to control these sums - HTB doesn't.)
As I wrote before, the minimal (overflow safe) rate depends on max
packet size, and for 1500 byte it would be something around:
1500b/2min, so if your clients can wait so long, try this:
class htb 1:101 parent 1:10 leaf 101: prio 3 rate 100bit ceil 4096Kbit
class htb 1:111 parent 1:11 leaf 111: prio 3 rate 100bit ceil 4096Kbit
Regards,
Jarek P.
^ permalink raw reply
* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-14 20:57 UTC (permalink / raw)
To: Tom Herbert
Cc: hadi, Stephen Hemminger, netdev, robert, David Miller,
Changli Gao, Andi Kleen
In-Reply-To: <z2y65634d661004141345o9bdcf759sf266866931823baf@mail.gmail.com>
Le mercredi 14 avril 2010 à 13:45 -0700, Tom Herbert a écrit :
> > Only if more than one flow is involved.
> >
> > And if you have many flows, chance they will spread several queues...
> >
>
> But use too many queues and the efficiency of NAPI drops and cost of
> device interrupts becomes dominant, so that the overhead from
> additional hard interrupts can surpass the overhead of doing RPS and
> the IPIs. I believe we are seeing this is in some of our results
> which shows that a combination of multi-queue and RPS can be better
> than just multi-queue (see rps changelog). Again, I'm not claiming
> that is generally true, but there are a lot of factors to consider.
> --
RPS can be tuned (Changli wants a finer tuning...), it would be
intereting to tune multiqueue devices too. I dont know if its possible
right now.
On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
10Gigabit has 16 queues. It might be good to use less queues according
to your results on some workloads, and eventually use RPS on a second
layering.
^ permalink raw reply
* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Arnd Bergmann @ 2010-04-14 20:52 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <20100414204003.GC11321@redhat.com>
On Wednesday 14 April 2010 22:40:03 Michael S. Tsirkin wrote:
> On Wed, Apr 14, 2010 at 10:39:49PM +0200, Arnd Bergmann wrote:
> > Well, if the guest not only wants to send data but also receive
> > frames coming from other machines, they need to get from the kernel
> > into qemu, and the only way I can see for doing that is to read
> > from this device if there is no vhost support around on the new
> > machine.
> >
> > Maybe we're talking about different things here.
>
> mpassthrough is currently useless without vhost.
> If the new machine has no vhost, it can't use mpassthrough :)
Ok. Is that a planned feature though? vhost is currently limited
to guests with a virtio-net driver and even if you extend it to other
guest emulations, it will probably always be a subset of the qemu
supported drivers, but it may be useful to support zero-copy on other
drivers as well.
Arnd
^ permalink raw reply
* Re: rps perfomance WAS(Re: rps: question
From: Tom Herbert @ 2010-04-14 20:45 UTC (permalink / raw)
To: Eric Dumazet
Cc: hadi, Stephen Hemminger, netdev, robert, David Miller,
Changli Gao, Andi Kleen
In-Reply-To: <1271276855.16881.1756.camel@edumazet-laptop>
> Only if more than one flow is involved.
>
> And if you have many flows, chance they will spread several queues...
>
But use too many queues and the efficiency of NAPI drops and cost of
device interrupts becomes dominant, so that the overhead from
additional hard interrupts can surpass the overhead of doing RPS and
the IPIs. I believe we are seeing this is in some of our results
which shows that a combination of multi-queue and RPS can be better
than just multi-queue (see rps changelog). Again, I'm not claiming
that is generally true, but there are a lot of factors to consider.
^ permalink raw reply
* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Michael S. Tsirkin @ 2010-04-14 20:40 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <201004142239.50299.arnd@arndb.de>
On Wed, Apr 14, 2010 at 10:39:49PM +0200, Arnd Bergmann wrote:
> On Wednesday 14 April 2010 22:31:42 Michael S. Tsirkin wrote:
> > On Wed, Apr 14, 2010 at 06:35:57PM +0200, Arnd Bergmann wrote:
> > > On Wednesday 14 April 2010, Michael S. Tsirkin wrote:
> > > > > >
> > > > > > qemu needs the ability to inject raw packets into device
> > > > > > from userspace, bypassing vhost/virtio (for live migration).
> > > > >
> > > > > Ok, but since there is only a write callback and no read, it won't
> > > > > actually be able to do this with the current code, right?
> > > >
> > > > I think it'll work as is, with vhost qemu only ever writes,
> > > > never reads from device. We'll also never need GSO etc
> > > > which is a large part of what tap does (and macvtap will
> > > > have to do).
> > >
> > > Ah, I see. I didn't realize that qemu needs to write to the
> > > device even if vhost is used. But for the case of migration to
> > > another machine without vhost, wouldn't qemu also need to read?
> >
> > Not that I know. Why?
>
> Well, if the guest not only wants to send data but also receive
> frames coming from other machines, they need to get from the kernel
> into qemu, and the only way I can see for doing that is to read
> from this device if there is no vhost support around on the new
> machine.
>
> Maybe we're talking about different things here.
>
> Arnd
mpassthrough is currently useless without vhost.
If the new machine has no vhost, it can't use mpassthrough :)
--
MST
^ permalink raw reply
* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Arnd Bergmann @ 2010-04-14 20:39 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <20100414203142.GA11321@redhat.com>
On Wednesday 14 April 2010 22:31:42 Michael S. Tsirkin wrote:
> On Wed, Apr 14, 2010 at 06:35:57PM +0200, Arnd Bergmann wrote:
> > On Wednesday 14 April 2010, Michael S. Tsirkin wrote:
> > > > >
> > > > > qemu needs the ability to inject raw packets into device
> > > > > from userspace, bypassing vhost/virtio (for live migration).
> > > >
> > > > Ok, but since there is only a write callback and no read, it won't
> > > > actually be able to do this with the current code, right?
> > >
> > > I think it'll work as is, with vhost qemu only ever writes,
> > > never reads from device. We'll also never need GSO etc
> > > which is a large part of what tap does (and macvtap will
> > > have to do).
> >
> > Ah, I see. I didn't realize that qemu needs to write to the
> > device even if vhost is used. But for the case of migration to
> > another machine without vhost, wouldn't qemu also need to read?
>
> Not that I know. Why?
Well, if the guest not only wants to send data but also receive
frames coming from other machines, they need to get from the kernel
into qemu, and the only way I can see for doing that is to read
from this device if there is no vhost support around on the new
machine.
Maybe we're talking about different things here.
Arnd
^ permalink raw reply
* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-14 20:38 UTC (permalink / raw)
To: Eric Dumazet
Cc: Stephen Hemminger, Tom Herbert, netdev, robert, David Miller,
Changli Gao, Andi Kleen
In-Reply-To: <1271276855.16881.1756.camel@edumazet-laptop>
On Wed, 2010-04-14 at 22:27 +0200, Eric Dumazet wrote:
> Only if more than one flow is involved.
>
> And if you have many flows, chance they will spread several queues...
Over long period of time measurement, true; but even with > 1 flows, it
is possible that one flow is more active/intense than others (rtp vs
some bulk file transfer) or more processor intensive than others(eg
ipsec vs clear text) etc.
BTW: just poking at intel doc on turbo boost and it seems the max a
core can steal from others is 400Mhz; so a core can go from 2.8Ghz
to 3.2Ghz. I am sure theres a lot of interesting dynamics from this ;->
I think i will try turning this thing in my tests since i have an i7.
cheers,
jamal
^ permalink raw reply
* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Michael S. Tsirkin @ 2010-04-14 20:31 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <201004141835.57673.arnd@arndb.de>
On Wed, Apr 14, 2010 at 06:35:57PM +0200, Arnd Bergmann wrote:
> On Wednesday 14 April 2010, Michael S. Tsirkin wrote:
> > > >
> > > > qemu needs the ability to inject raw packets into device
> > > > from userspace, bypassing vhost/virtio (for live migration).
> > >
> > > Ok, but since there is only a write callback and no read, it won't
> > > actually be able to do this with the current code, right?
> >
> > I think it'll work as is, with vhost qemu only ever writes,
> > never reads from device. We'll also never need GSO etc
> > which is a large part of what tap does (and macvtap will
> > have to do).
>
> Ah, I see. I didn't realize that qemu needs to write to the
> device even if vhost is used. But for the case of migration to
> another machine without vhost, wouldn't qemu also need to read?
Not that I know. Why?
> > > Moreover, it seems weird to have a new type of interface here that
> > > duplicates tap/macvtap with less functionality. Coming back
> > > to your original comment, this means that while mpassthru is currently
> > > not duplicating the actual code from macvtap, it would need to do
> > > exactly that to get the qemu interface right!
> > >
> > I don't think so, see above. anyway, both can reuse tun.c :)
>
> There is one significant difference between macvtap/mpassthru and
> tun/tap in that the directions are reversed. While macvtap and
> mpassthru forward data from write into dev_queue_xmit and from
> skb_receive into read, tun/tap forwards data from write into
> skb_receive and from start_xmit into read.
>
> Also, I'm not really objecting to duplicating code between
> macvtap and mpassthru, as the implementation can always be merged.
> My main objection is instead to having two different _user_interfaces_
> for doing the same thing.
>
> Arnd
They *could* do the same thing :)
--
MST
^ permalink raw reply
* Re: rps perfomance WAS(Re: rps: question
From: Andi Kleen @ 2010-04-14 20:34 UTC (permalink / raw)
To: Stephen Hemminger
Cc: hadi, Eric Dumazet, Tom Herbert, netdev, robert, David Miller,
Changli Gao, Andi Kleen
In-Reply-To: <20100414124426.6aee95c3@nehalam>
> RPS might also interact with the core turbo boost functionality on Intel chips.
> Newer chips will make a single core faster if other core can be kept idle.
In addition to Turbo using less cores can also help to save power.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply
* Re: [BUG net-next-2.6] fib: Some rcu warning
From: Paul E. McKenney @ 2010-04-14 20:33 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1271274844.16881.1743.camel@edumazet-laptop>
On Wed, Apr 14, 2010 at 09:54:04PM +0200, Eric Dumazet wrote:
> [ 27.756998] IPv4 FIB: Using LC-trie version 0.409
> [ 27.757121]
> [ 27.757121] ===================================================
> [ 27.757228] [ INFO: suspicious rcu_dereference_check() usage. ]
> [ 27.757285] ---------------------------------------------------
> [ 27.757342] net/ipv4/fib_trie.c:212 invoked rcu_dereference_check()
> without protection!
Looks like this needs to allow for RTNL. Please see below for a
patch. Other places in this file might also need this.
Thanx, Paul
> [ 27.757417]
> [ 27.757417] other info that might help us debug this:
> [ 27.757418]
> [ 27.757569]
> [ 27.757570] rcu_scheduler_active = 1, debug_locks = 0
> [ 27.757674] 2 locks held by ip/5686:
> [ 27.757727] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff8134d247>]
> rtnl_lock+0x17/0x20
> [ 27.757936] #1: ((inetaddr_chain).rwsem){.+.+.+}, at:
> [<ffffffff810654a7>] __blocking_notifier_call_chain+0x47/0x90
> [ 27.758148]
> [ 27.758149] stack backtrace:
> [ 27.758249] Pid: 5686, comm: ip Not tainted
> 2.6.34-rc3-03164-gb4bf665-dirty #10
> [ 27.758323] Call Trace:
> [ 27.758377] [<ffffffff81071b4f>] lockdep_rcu_dereference+0xaf/0xc0
> [ 27.758437] [<ffffffff813ae32c>] fib_find_node+0x19c/0x220
> [ 27.758495] [<ffffffff813b0bdc>] fib_table_insert+0xac/0x760
> [ 27.758552] [<ffffffff813aa121>] ? fib_get_table+0x91/0xd0
> [ 27.758609] [<ffffffff813aa0b8>] ? fib_get_table+0x28/0xd0
> [ 27.758667] [<ffffffff813aa561>] fib_magic+0x111/0x120
> [ 27.758723] [<ffffffff813aa6a0>] fib_add_ifaddr+0x130/0x170
> [ 27.758780] [<ffffffff813aad3c>] fib_inetaddr_event+0x5c/0x290
> [ 27.758838] [<ffffffff810650d8>] notifier_call_chain+0x58/0x80
> [ 27.758896] [<ffffffff810654bd>] __blocking_notifier_call_chain
> +0x5d/0x90
> [ 27.758956] [<ffffffff81065506>] blocking_notifier_call_chain
> +0x16/0x20
> [ 27.759016] [<ffffffff813a0c44>] __inet_insert_ifa+0xd4/0x170
> [ 27.759089] [<ffffffff813a0cf2>] inet_insert_ifa+0x12/0x20
> [ 27.759146] [<ffffffff813a2140>] inetdev_event+0x400/0x430
> [ 27.759204] [<ffffffff81362472>] ? netlink_broadcast+0x262/0x3f0
> [ 27.759263] [<ffffffff81351391>] ? fib_rules_event+0x21/0x180
> [ 27.759320] [<ffffffff810650d8>] notifier_call_chain+0x58/0x80
> [ 27.759378] [<ffffffff8106512e>] __raw_notifier_call_chain+0xe/0x10
> [ 27.759436] [<ffffffff81065146>] raw_notifier_call_chain+0x16/0x20
> [ 27.761102] [<ffffffff8133fc1b>] call_netdevice_notifiers+0x1b/0x20
> [ 27.761161] [<ffffffff8133ffad>] __dev_notify_flags+0x7d/0x90
> [ 27.761219] [<ffffffff81340005>] dev_change_flags+0x45/0x70
> [ 27.761278] [<ffffffff813a2e79>] devinet_ioctl+0x5e9/0x770
> [ 27.761336] [<ffffffff813a36b1>] inet_ioctl+0x61/0x80
> [ 27.761392] [<ffffffff8132c532>] sock_do_ioctl+0x32/0x60
> [ 27.761449] [<ffffffff8132d06c>] compat_sock_ioctl+0x89c/0xb60
> [ 27.761507] [<ffffffff81075816>] ? __lock_acquire+0x486/0xaf0
> [ 27.761567] [<ffffffff810cbfeb>] ? __do_fault+0x12b/0x480
> [ 27.761624] [<ffffffff810b4a93>] ? filemap_fault+0xd3/0x3d0
> [ 27.761683] [<ffffffff810f7104>] ? fget_light+0x174/0x360
> [ 27.761740] [<ffffffff810cc1f9>] ? __do_fault+0x339/0x480
> [ 27.761798] [<ffffffff811370e3>] compat_sys_ioctl+0xd3/0x1900
> [ 27.761856] [<ffffffff810ce28a>] ? handle_mm_fault+0x19a/0x780
> [ 27.761915] [<ffffffff81023832>] ? do_page_fault+0xe2/0x3b0
> [ 27.761972] [<ffffffff81064703>] ? up_read+0x23/0x40
> [ 27.762028] [<ffffffff810238f5>] ? do_page_fault+0x1a5/0x3b0
> [ 27.762087] [<ffffffff813f36cb>] ? _raw_spin_unlock+0x2b/0x40
> [ 27.762145] [<ffffffff810f3508>] ? fd_install+0xa8/0xe0
> [ 27.762202] [<ffffffff8132eced>] ? sock_map_fd+0x2d/0x40
> [ 27.762259] [<ffffffff813f3a69>] ? retint_swapgs+0xe/0x13
> [ 27.762331] [<ffffffff8107292d>] ? trace_hardirqs_on_caller
> +0x10d/0x190
> [ 27.762391] [<ffffffff813f2d21>] ? trace_hardirqs_off_thunk
> +0x3a/0x3c
> [ 27.762450] [<ffffffff81028b64>] sysenter_dispatch+0x7/0x30
> [ 27.762507] [<ffffffff813f2ce2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
commit 3f69d3dbdd2468f13591b676ca36b3d61d1a1294
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date: Wed Apr 14 13:30:13 2010 -0700
net: fix RCU lockdep splat in tnode_get_child_rcu()
Located-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index af5d897..45bda1f 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -208,7 +208,8 @@ static inline struct node *tnode_get_child_rcu(struct tnode *tn, unsigned int i)
{
struct node *ret = tnode_get_child(tn, i);
- return rcu_dereference(ret);
+ return rcu_dereference_check(ret,
+ rcu_read_lock_held() || rtnl_is_locked());
}
static inline int tnode_child_length(const struct tnode *tn)
^ permalink raw reply related
* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-14 20:27 UTC (permalink / raw)
To: hadi
Cc: Stephen Hemminger, Tom Herbert, netdev, robert, David Miller,
Changli Gao, Andi Kleen
In-Reply-To: <1271276568.4567.59.camel@bigi>
Le mercredi 14 avril 2010 à 16:22 -0400, jamal a écrit :
> On Wed, 2010-04-14 at 12:44 -0700, Stephen Hemminger wrote:
>
> > RPS might also interact with the core turbo boost functionality on Intel chips.
> > Newer chips will make a single core faster if other core can be kept idle.
>
> how well does it work with Linux? Sounds like all i need to do is turn
> on some BIOS feature.
> One of the negatives with multiqueue nics is because the core selection
> is static, you could end up overloading one core while others stay idle.
> This seems to steal cycle capacity from the idle cores and gives it to
> the busy cpus. nice. So i see it as a boost to multiqueue.
Only if more than one flow is involved.
And if you have many flows, chance they will spread several queues...
^ permalink raw reply
* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-14 20:22 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Eric Dumazet, Tom Herbert, netdev, robert, David Miller,
Changli Gao, Andi Kleen
In-Reply-To: <20100414124426.6aee95c3@nehalam>
On Wed, 2010-04-14 at 12:44 -0700, Stephen Hemminger wrote:
> RPS might also interact with the core turbo boost functionality on Intel chips.
> Newer chips will make a single core faster if other core can be kept idle.
how well does it work with Linux? Sounds like all i need to do is turn
on some BIOS feature.
One of the negatives with multiqueue nics is because the core selection
is static, you could end up overloading one core while others stay idle.
This seems to steal cycle capacity from the idle cores and gives it to
the busy cpus. nice. So i see it as a boost to multiqueue.
cheers,
jamal
^ permalink raw reply
* Re: [BUG net-next-2.6] fib: Some rcu warning
From: Eric Dumazet @ 2010-04-14 20:10 UTC (permalink / raw)
To: netdev, David Miller; +Cc: Paul E. McKenney
In-Reply-To: <1271274844.16881.1743.camel@edumazet-laptop>
Le mercredi 14 avril 2010 à 21:54 +0200, Eric Dumazet a écrit :
> [ 27.756998] IPv4 FIB: Using LC-trie version 0.409
> [ 27.757121]
> [ 27.757121] ===================================================
> [ 27.757228] [ INFO: suspicious rcu_dereference_check() usage. ]
> [ 27.757285] ---------------------------------------------------
> [ 27.757342] net/ipv4/fib_trie.c:212 invoked rcu_dereference_check()
> without protection!
> [ 27.757417]
> [ 27.757417] other info that might help us debug this:
> [ 27.757418]
> [ 27.757569]
> [ 27.757570] rcu_scheduler_active = 1, debug_locks = 0
> [ 27.757674] 2 locks held by ip/5686:
> [ 27.757727] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff8134d247>]
> rtnl_lock+0x17/0x20
> [ 27.757936] #1: ((inetaddr_chain).rwsem){.+.+.+}, at:
> [<ffffffff810654a7>] __blocking_notifier_call_chain+0x47/0x90
> [ 27.758148]
> [ 27.758149] stack backtrace:
> [ 27.758249] Pid: 5686, comm: ip Not tainted
> 2.6.34-rc3-03164-gb4bf665-dirty #10
> [ 27.758323] Call Trace:
> [ 27.758377] [<ffffffff81071b4f>] lockdep_rcu_dereference+0xaf/0xc0
> [ 27.758437] [<ffffffff813ae32c>] fib_find_node+0x19c/0x220
> [ 27.758495] [<ffffffff813b0bdc>] fib_table_insert+0xac/0x760
> [ 27.758552] [<ffffffff813aa121>] ? fib_get_table+0x91/0xd0
> [ 27.758609] [<ffffffff813aa0b8>] ? fib_get_table+0x28/0xd0
> [ 27.758667] [<ffffffff813aa561>] fib_magic+0x111/0x120
> [ 27.758723] [<ffffffff813aa6a0>] fib_add_ifaddr+0x130/0x170
> [ 27.758780] [<ffffffff813aad3c>] fib_inetaddr_event+0x5c/0x290
> [ 27.758838] [<ffffffff810650d8>] notifier_call_chain+0x58/0x80
> [ 27.758896] [<ffffffff810654bd>] __blocking_notifier_call_chain
> +0x5d/0x90
> [ 27.758956] [<ffffffff81065506>] blocking_notifier_call_chain
> +0x16/0x20
> [ 27.759016] [<ffffffff813a0c44>] __inet_insert_ifa+0xd4/0x170
> [ 27.759089] [<ffffffff813a0cf2>] inet_insert_ifa+0x12/0x20
> [ 27.759146] [<ffffffff813a2140>] inetdev_event+0x400/0x430
> [ 27.759204] [<ffffffff81362472>] ? netlink_broadcast+0x262/0x3f0
> [ 27.759263] [<ffffffff81351391>] ? fib_rules_event+0x21/0x180
> [ 27.759320] [<ffffffff810650d8>] notifier_call_chain+0x58/0x80
> [ 27.759378] [<ffffffff8106512e>] __raw_notifier_call_chain+0xe/0x10
> [ 27.759436] [<ffffffff81065146>] raw_notifier_call_chain+0x16/0x20
> [ 27.761102] [<ffffffff8133fc1b>] call_netdevice_notifiers+0x1b/0x20
> [ 27.761161] [<ffffffff8133ffad>] __dev_notify_flags+0x7d/0x90
> [ 27.761219] [<ffffffff81340005>] dev_change_flags+0x45/0x70
> [ 27.761278] [<ffffffff813a2e79>] devinet_ioctl+0x5e9/0x770
> [ 27.761336] [<ffffffff813a36b1>] inet_ioctl+0x61/0x80
> [ 27.761392] [<ffffffff8132c532>] sock_do_ioctl+0x32/0x60
> [ 27.761449] [<ffffffff8132d06c>] compat_sock_ioctl+0x89c/0xb60
> [ 27.761507] [<ffffffff81075816>] ? __lock_acquire+0x486/0xaf0
> [ 27.761567] [<ffffffff810cbfeb>] ? __do_fault+0x12b/0x480
> [ 27.761624] [<ffffffff810b4a93>] ? filemap_fault+0xd3/0x3d0
> [ 27.761683] [<ffffffff810f7104>] ? fget_light+0x174/0x360
> [ 27.761740] [<ffffffff810cc1f9>] ? __do_fault+0x339/0x480
> [ 27.761798] [<ffffffff811370e3>] compat_sys_ioctl+0xd3/0x1900
> [ 27.761856] [<ffffffff810ce28a>] ? handle_mm_fault+0x19a/0x780
> [ 27.761915] [<ffffffff81023832>] ? do_page_fault+0xe2/0x3b0
> [ 27.761972] [<ffffffff81064703>] ? up_read+0x23/0x40
> [ 27.762028] [<ffffffff810238f5>] ? do_page_fault+0x1a5/0x3b0
> [ 27.762087] [<ffffffff813f36cb>] ? _raw_spin_unlock+0x2b/0x40
> [ 27.762145] [<ffffffff810f3508>] ? fd_install+0xa8/0xe0
> [ 27.762202] [<ffffffff8132eced>] ? sock_map_fd+0x2d/0x40
> [ 27.762259] [<ffffffff813f3a69>] ? retint_swapgs+0xe/0x13
> [ 27.762331] [<ffffffff8107292d>] ? trace_hardirqs_on_caller
> +0x10d/0x190
> [ 27.762391] [<ffffffff813f2d21>] ? trace_hardirqs_off_thunk
> +0x3a/0x3c
> [ 27.762450] [<ffffffff81028b64>] sysenter_dispatch+0x7/0x30
> [ 27.762507] [<ffffffff813f2ce2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>
>
Here is the patch to remove this problem.
[PATCH] fib: suppress lockdep-RCU false positive in FIB trie.
Followup of commit 634a4b20
Allow tnode_get_child_rcu() to be called either under rcu_read_lock()
protection or with RTNL held.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 59a8387..c98f115 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -209,7 +209,9 @@ static inline struct node *tnode_get_child_rcu(struct tnode *tn, unsigned int i)
{
struct node *ret = tnode_get_child(tn, i);
- return rcu_dereference(ret);
+ return rcu_dereference_check(ret,
+ rcu_read_lock_held() ||
+ lockdep_rtnl_is_held());
}
static inline int tnode_child_length(const struct tnode *tn)
^ permalink raw reply related
* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-14 19:58 UTC (permalink / raw)
To: Stephen Hemminger
Cc: hadi, Tom Herbert, netdev, robert, David Miller, Changli Gao,
Andi Kleen
In-Reply-To: <20100414124426.6aee95c3@nehalam>
Le mercredi 14 avril 2010 à 12:44 -0700, Stephen Hemminger a écrit :
> On Wed, 14 Apr 2010 14:53:42 -0400
> jamal <hadi@cyberus.ca> wrote:
>
> > Agreed. So to enumerate, the benefits come in if:
> > a) you have many processors
> > b) you have single-queue nic
> > c) at sub-threshold traffic you dont care about a little latency
>
> There probably needs to be better autotuning for this, there is no reason
> that RPS to be steering packets unless the queue is getting backed up.
> Some kind of high / low water mark mechanism is needed.
>
> RPS might also interact with the core turbo boost functionality on Intel chips.
> Newer chips will make a single core faster if other core can be kept idle.
>
>
This was discussed a while ago, and Out Of Order packet delivery was the
thing that frightened us a bit.
Every time we change RPS to be on or off, we might have some extra
noise. Maybe we already have this problem with irqbalance ?
^ permalink raw reply
* [PATCH net-next-2.6 v3] fasync: RCU and fine grained locking
From: Eric Dumazet @ 2010-04-14 19:55 UTC (permalink / raw)
To: David Miller; +Cc: Paul E. McKenney, netdev, linux-kernel, Lai Jiangshan
In-Reply-To: <1271259264.16881.1703.camel@edumazet-laptop>
Here is V3 of the patch. This patch is a preliminary work to full RCU
conversion of sock_def_readable() & sock_def_write_space() functions,
called nearly for each packet...
I based it against David net-next-2.6 tree.
Thanks
[PATCH net-next-2.6 v3] fasync: RCU and fine grained locking
kill_fasync() uses a central rwlock, candidate for RCU conversion, to
avoid cache line ping pongs on SMP.
fasync_remove_entry() and fasync_add_entry() can disable IRQS on a short
section instead during whole list scan.
Use a spinlock per fasync_struct to synchronize kill_fasync_rcu() and
fasync_{remove|add}_entry(). This spinlock is IRQ safe, so sock_fasync()
doesnt need its own implementation and can use fasync_helper(), to
reduce code size and complexity.
We can remove __kill_fasync() direct use in net/socket.c, and rename it
to kill_fasync_rcu().
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
---
v3: sock_fasync() can use generic fasync_helper(), this gives a nice
cleanup.
v2: As Lai Jiangshan noticed, we need a mutual exclusion between
fasync_{remove|add}_entry() and kill_fasync_rcu().
fs/fcntl.c | 66 ++++++++++++++++++++++++--------------
include/linux/fs.h | 12 +++----
net/socket.c | 73 ++++++-------------------------------------
3 files changed, 59 insertions(+), 92 deletions(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 452d02f..0a14074 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -614,9 +614,15 @@ int send_sigurg(struct fown_struct *fown)
return ret;
}
-static DEFINE_RWLOCK(fasync_lock);
+static DEFINE_SPINLOCK(fasync_lock);
static struct kmem_cache *fasync_cache __read_mostly;
+static void fasync_free_rcu(struct rcu_head *head)
+{
+ kmem_cache_free(fasync_cache,
+ container_of(head, struct fasync_struct, fa_rcu));
+}
+
/*
* Remove a fasync entry. If successfully removed, return
* positive and clear the FASYNC flag. If no entry exists,
@@ -625,8 +631,6 @@ static struct kmem_cache *fasync_cache __read_mostly;
* NOTE! It is very important that the FASYNC flag always
* match the state "is the filp on a fasync list".
*
- * We always take the 'filp->f_lock', in since fasync_lock
- * needs to be irq-safe.
*/
static int fasync_remove_entry(struct file *filp, struct fasync_struct **fapp)
{
@@ -634,17 +638,22 @@ static int fasync_remove_entry(struct file *filp, struct fasync_struct **fapp)
int result = 0;
spin_lock(&filp->f_lock);
- write_lock_irq(&fasync_lock);
+ spin_lock(&fasync_lock);
for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) {
if (fa->fa_file != filp)
continue;
+
+ spin_lock_irq(&fa->fa_lock);
+ fa->fa_file = NULL;
+ spin_unlock_irq(&fa->fa_lock);
+
*fp = fa->fa_next;
- kmem_cache_free(fasync_cache, fa);
+ call_rcu(&fa->fa_rcu, fasync_free_rcu);
filp->f_flags &= ~FASYNC;
result = 1;
break;
}
- write_unlock_irq(&fasync_lock);
+ spin_unlock(&fasync_lock);
spin_unlock(&filp->f_lock);
return result;
}
@@ -666,25 +675,30 @@ static int fasync_add_entry(int fd, struct file *filp, struct fasync_struct **fa
return -ENOMEM;
spin_lock(&filp->f_lock);
- write_lock_irq(&fasync_lock);
+ spin_lock(&fasync_lock);
for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) {
if (fa->fa_file != filp)
continue;
+
+ spin_lock_irq(&fa->fa_lock);
fa->fa_fd = fd;
+ spin_unlock_irq(&fa->fa_lock);
+
kmem_cache_free(fasync_cache, new);
goto out;
}
+ spin_lock_init(&new->fa_lock);
new->magic = FASYNC_MAGIC;
new->fa_file = filp;
new->fa_fd = fd;
new->fa_next = *fapp;
- *fapp = new;
+ rcu_assign_pointer(*fapp, new);
result = 1;
filp->f_flags |= FASYNC;
out:
- write_unlock_irq(&fasync_lock);
+ spin_unlock(&fasync_lock);
spin_unlock(&filp->f_lock);
return result;
}
@@ -704,37 +718,41 @@ int fasync_helper(int fd, struct file * filp, int on, struct fasync_struct **fap
EXPORT_SYMBOL(fasync_helper);
-void __kill_fasync(struct fasync_struct *fa, int sig, int band)
+/*
+ * rcu_read_lock() is held
+ */
+static void kill_fasync_rcu(struct fasync_struct *fa, int sig, int band)
{
while (fa) {
- struct fown_struct * fown;
+ struct fown_struct *fown;
if (fa->magic != FASYNC_MAGIC) {
printk(KERN_ERR "kill_fasync: bad magic number in "
"fasync_struct!\n");
return;
}
- fown = &fa->fa_file->f_owner;
- /* Don't send SIGURG to processes which have not set a
- queued signum: SIGURG has its own default signalling
- mechanism. */
- if (!(sig == SIGURG && fown->signum == 0))
- send_sigio(fown, fa->fa_fd, band);
- fa = fa->fa_next;
+ spin_lock(&fa->fa_lock);
+ if (fa->fa_file) {
+ fown = &fa->fa_file->f_owner;
+ /* Don't send SIGURG to processes which have not set a
+ queued signum: SIGURG has its own default signalling
+ mechanism. */
+ if (!(sig == SIGURG && fown->signum == 0))
+ send_sigio(fown, fa->fa_fd, band);
+ }
+ spin_unlock(&fa->fa_lock);
+ fa = rcu_dereference(fa->fa_next);
}
}
-EXPORT_SYMBOL(__kill_fasync);
-
void kill_fasync(struct fasync_struct **fp, int sig, int band)
{
/* First a quick test without locking: usually
* the list is empty.
*/
if (*fp) {
- read_lock(&fasync_lock);
- /* reread *fp after obtaining the lock */
- __kill_fasync(*fp, sig, band);
- read_unlock(&fasync_lock);
+ rcu_read_lock();
+ kill_fasync_rcu(rcu_dereference(*fp), sig, band);
+ rcu_read_unlock();
}
}
EXPORT_SYMBOL(kill_fasync);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 39d57bc..018d382 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1280,10 +1280,12 @@ static inline int lock_may_write(struct inode *inode, loff_t start,
struct fasync_struct {
- int magic;
- int fa_fd;
- struct fasync_struct *fa_next; /* singly linked list */
- struct file *fa_file;
+ spinlock_t fa_lock;
+ int magic;
+ int fa_fd;
+ struct fasync_struct *fa_next; /* singly linked list */
+ struct file *fa_file;
+ struct rcu_head fa_rcu;
};
#define FASYNC_MAGIC 0x4601
@@ -1292,8 +1294,6 @@ struct fasync_struct {
extern int fasync_helper(int, struct file *, int, struct fasync_struct **);
/* can be called from interrupts */
extern void kill_fasync(struct fasync_struct **, int, int);
-/* only for net: no internal synchronization */
-extern void __kill_fasync(struct fasync_struct *, int, int);
extern int __f_setown(struct file *filp, struct pid *, enum pid_type, int force);
extern int f_setown(struct file *filp, unsigned long arg, int force);
diff --git a/net/socket.c b/net/socket.c
index 35bc198..9822081 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1067,78 +1067,27 @@ static int sock_close(struct inode *inode, struct file *filp)
* 1. fasync_list is modified only under process context socket lock
* i.e. under semaphore.
* 2. fasync_list is used under read_lock(&sk->sk_callback_lock)
- * or under socket lock.
- * 3. fasync_list can be used from softirq context, so that
- * modification under socket lock have to be enhanced with
- * write_lock_bh(&sk->sk_callback_lock).
- * --ANK (990710)
+ * or under socket lock
*/
static int sock_fasync(int fd, struct file *filp, int on)
{
- struct fasync_struct *fa, *fna = NULL, **prev;
- struct socket *sock;
- struct sock *sk;
-
- if (on) {
- fna = kmalloc(sizeof(struct fasync_struct), GFP_KERNEL);
- if (fna == NULL)
- return -ENOMEM;
- }
-
- sock = filp->private_data;
+ struct socket *sock = filp->private_data;
+ struct sock *sk = sock->sk;
- sk = sock->sk;
- if (sk == NULL) {
- kfree(fna);
+ if (sk == NULL)
return -EINVAL;
- }
lock_sock(sk);
- spin_lock(&filp->f_lock);
- if (on)
- filp->f_flags |= FASYNC;
- else
- filp->f_flags &= ~FASYNC;
- spin_unlock(&filp->f_lock);
-
- prev = &(sock->fasync_list);
+ fasync_helper(fd, filp, on, &sock->fasync_list);
- for (fa = *prev; fa != NULL; prev = &fa->fa_next, fa = *prev)
- if (fa->fa_file == filp)
- break;
-
- if (on) {
- if (fa != NULL) {
- write_lock_bh(&sk->sk_callback_lock);
- fa->fa_fd = fd;
- write_unlock_bh(&sk->sk_callback_lock);
-
- kfree(fna);
- goto out;
- }
- fna->fa_file = filp;
- fna->fa_fd = fd;
- fna->magic = FASYNC_MAGIC;
- fna->fa_next = sock->fasync_list;
- write_lock_bh(&sk->sk_callback_lock);
- sock->fasync_list = fna;
+ if (!sock->fasync_list)
+ sock_reset_flag(sk, SOCK_FASYNC);
+ else
sock_set_flag(sk, SOCK_FASYNC);
- write_unlock_bh(&sk->sk_callback_lock);
- } else {
- if (fa != NULL) {
- write_lock_bh(&sk->sk_callback_lock);
- *prev = fa->fa_next;
- if (!sock->fasync_list)
- sock_reset_flag(sk, SOCK_FASYNC);
- write_unlock_bh(&sk->sk_callback_lock);
- kfree(fa);
- }
- }
-out:
- release_sock(sock->sk);
+ release_sock(sk);
return 0;
}
@@ -1159,10 +1108,10 @@ int sock_wake_async(struct socket *sock, int how, int band)
/* fall through */
case SOCK_WAKE_IO:
call_kill:
- __kill_fasync(sock->fasync_list, SIGIO, band);
+ kill_fasync(&sock->fasync_list, SIGIO, band);
break;
case SOCK_WAKE_URG:
- __kill_fasync(sock->fasync_list, SIGURG, band);
+ kill_fasync(&sock->fasync_list, SIGURG, band);
}
return 0;
}
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox