Netdev List
 help / color / mirror / Atom feed
* Re: [patch 2/4] ipset: make IPv4 and IPv6 address handling similar
From: Jozsef Kadlecsik @ 2011-01-18 20:54 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: holger, netfilter-devel, netdev
In-Reply-To: <alpine.LNX.2.01.1101182139030.19166@obet.zrqbmnf.qr>

On Tue, 18 Jan 2011, Jan Engelhardt wrote:

> On Tuesday 2011-01-18 21:37, Jozsef Kadlecsik wrote:
> >> >> 
> >> >> this does not work for AF_INET6:
> >> >> 
> >> >>  ipset add foo6 20a1:1:2:3:4:5:6:7/128
> >> >>  ipset v5.2: Syntax error: plain IP address must be supplied: 20a1:1:2:3:4:5:6:7/128
> >> >
> >> >Yeah, the usual issue: should IPv4/32 and IPv6/128 be handled as a plain 
> >> >IPv4/v6 address when the manual says "enter a plain IPv4/v6 address" :-).
> >> 
> >> (Assuming this was a question, heuristically based on the word order
> >> you used:) I don't think so. iptables, resp. its modules, do not
> >> allow that either.
> >
> >I know, but the situation is a little bit more complicated: the set type 
> >in question works differently with IPv4 and IPv6. In the IPv4 case, a 
> >range of IP addresses as IPv4/prefix is accepted as input (thus 
> >192.168.1.1/32 too), while for IPv6, only plain IPv6 addresses are allowed 
> >and therefore 20a1:1:2:3:4:5:6:7/128 was rejected.
> 
> Is there a specific reason that there is no IPv6 net support?

Call it laziness: for IPv6, the hash:ip* types does *not* accept a 
range of elements to be added/deleted in one command, expressed as

ipset add foo6 20a1:1:2:3:4:5:6:7/120

or

ipset add foo6 20A1:1:2:3:4:5:6:0-20A1:1:2:3:4:5:6:FF

For IPv4 the syntax is accepted and handled.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@mail.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply

* Re: inbound connection problems when "netlink: test for all flags of the NLM_F_DUMP composite" commit applied
From: Jarek Poplawski @ 2011-01-18 20:55 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: David Miller, arthur.marsh, jengelh, eric.dumazet, netdev, hadi
In-Reply-To: <4D35F8A3.1010600@netfilter.org>

On Tue, Jan 18, 2011 at 09:31:31PM +0100, Pablo Neira Ayuso wrote:
> On 18/01/11 11:24, Jarek Poplawski wrote:
> > On Tue, Jan 18, 2011 at 02:07:02AM -0800, David Miller wrote:
> >> From: Jarek Poplawski <jarkao2@gmail.com>
> >> Date: Tue, 18 Jan 2011 09:38:11 +0000
> >>
> >>> Even if I'm wrong, this change added to stable will break many configs.
> >>> My proposal is to revert commit 0ab03c2b147 until proper fix is found.
> >>
> >> The flag combination is, at best ambiguous, it has no proper
> >> definition without the check we added.
> > 
> > Do you all expect all users manage to upgrade avahi app before
> > changing their stable kernel? I mean "own distro" users especially.
> 
> The combination that avahi uses makes no sense.

I don't agree as explained in the reverting patch. Anyway, again,
this is an old problem, so no reason to force "fixing" it just now
at the expense of the obvious regression especially in stable kernels
Anyway, I'll accept any David's decision wrt this problem.

Jarek P.

^ permalink raw reply

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
From: Nicolas de Pesloüan @ 2011-01-18 21:20 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Oleg V. Ukhno, John Fastabend, David S. Miller,
	netdev@vger.kernel.org, Sébastien Barré,
	Christophe Paasch
In-Reply-To: <28837.1295382268@death>

Le 18/01/2011 21:24, Jay Vosburgh a écrit :
> Nicolas de Pesloüan<nicolas.2p.debian@gmail.com>  wrote:

>>>> - it is possible to detect path failure using arp monitoring instead of
>>>> miimon.
>
> 	I don't think this is true, at least not for the case of
> balance-rr.  Using ARP monitoring with any sort of load balance scheme
> is problematic, because the replies may be balanced to a different slave
> than the sender.

Cannot we achieve the expected arp monitoring by using the exact same artifice that Oleg suggested: 
using a different source MAC per slave for arp monitoring, so that return path match sending path ?

>>>> - changing the destination MAC address of egress packets are not
>>>> necessary, because egress path selection force ingress path selection
>>>> due to the VLAN.
>
> 	This is true, with one comment: Oleg's proposal we're discussing
> changes the source MAC address of outgoing packets, not the destination.
> The purpose being to manipulate the src-mac balancing algorithm on the
> switch when the packets are hashed at the egress port channel group.
> The packets (for a particular destination) all bear the same destination
> MAC, but (as I understand it) are manually assigned tailored source MAC
> addresses that hash to sequential values.

Yes, you're right.

> 	That's true.  The big problem with the "VLAN tunnel" approach is
> that it's not tolerant of link failures.

Yes, except if we find a way to make arp monitoring reliable in load balancing situation.

[snip]

> 	This is essentially the same thing as the diagram I pasted in up
> above, except with VLANs and an additional layer of switches between the
> hosts.  The multiple VLANs take the place of multiple discrete switches.
>
> 	This could also be accomplished via bridge groups (in
> Cisco-speak).  For example, instead of VLAN 100, that could be bridge
> group X, VLAN 200 is bridge group Y, and so on.
>
> 	Neither the VLAN nor the bridge group methods handle link
> failures very well; if, in the above diagram, the link from "switch 2
> vlan 100" to "host B" fails, there's no way for host A to know to stop
> sending to "switch 1 vlan 100," and there's no backup path for VLAN 100
> to "host B."

Can't we imagine to "arp monitor" the destination MAC address of host B, on both paths ? That way, 
host A would know that a given path is down, because return path would be the same. The target host 
should send the reply on the slave on which it receive the request, which is the normal way to reply 
to arp request.

> 	One item I'd like to see some more data on is the level of
> reordering at the receiver in Oleg's system.

This is exactly the reason why I asked Oleg to do some test with balance-rr. I cannot find a good 
reason for a possibly new xmit_hash_policy to provide better throughput than current balance-rr. If 
the throughput increase by, let's say, less than 20%, whatever tcp_reordering value, then it is 
probably a dead end way.

> 	One of the reasons round robin isn't as useful as it once was is
> due to the rise of NAPI and interrupt coalescing, both of which will
> tend to increase the reordering of packets at the receiver when the
> packets are evenly striped.  In the old days, it was one interrupt, one
> packet.  Now, it's one interrupt or NAPI poll, many packets.  With the
> packets striped across interfaces, this will tend to increase
> reordering.  E.g.,
>
> 	slave 1		slave 2		slave 3
> 	Packet 1	P2		P3
> 	P4		P5		P6
> 	P7		P8		P9
>
> 	and so on.  A poll of slave 1 will get packets 1, 4 and 7 (and
> probably several more), then a poll of slave 2 will get 2, 5 and 8, etc.

Any chance to receive P1, P2, P3 on slave 1, P4, P5, P6 on slave 2 et P7, P8, P9 on slave3, possibly 
by sending grouped packets, changing the sending slave every N packets instead of every packet ? I 
think we already discussed this possibility a few months or years ago in bonding-devel ML. For as 
far as I remember, the idea was not developed because it was not easy to find the number of packets 
to send through the same slave. Anyway, this might help reduce out of order delivery.

> 	Barring evidence to the contrary, I presume that Oleg's system
> delivers out of order at the receiver.  That's not automatically a
> reason to reject it, but this entire proposal is sufficiently complex to
> configure that very explicit documentation will be necessary.

Yes, and this is already true for some bonding modes and in particular for balance-rr.

	Nicolas.

^ permalink raw reply

* Re: Flow Control and Port Mirroring Revisited
From: Rick Jones @ 2011-01-18 21:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Simon Horman, Jesse Gross, Eric Dumazet, Rusty Russell,
	virtualization, dev, virtualization, netdev, kvm
In-Reply-To: <20110118201333.GD18760@redhat.com>

Michael S. Tsirkin wrote:
> On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> 
>>PS - the enhanced latency statistics from -j are only available in
>>the "omni" version of the TCP_RR test.  To get that add a
>>--enable-omni to the ./configure - and in this case both netperf and
>>netserver have to be recompiled.
> 
> Is this TCP only? I would love to get latency data from UDP as well.

I believe it will work with UDP request response as well.  The omni test code 
strives to be protocol agnostic.  (I'm sure there are bugs of course, there 
always are.)

There is though the added complication of there being no specific matching of 
requests to responses.  The code as written takes advantage of TCP's in-order 
semantics and recovery from packet loss.  In a "plain" UDP_RR test, with one at 
a time transactions, if either the request or response are lost, data flow 
effectively stops there until the timer expires.  So, one has "reasonable" RTT 
numbers from before that point.  In a burst UDP RR test, the code doesn't know 
which request/response was lost and so the matching being done to get RTTs will 
be off by each lost datagram.  And if something were re-ordered the timstamps 
would be off even without a datagram loss event.

To "fix" that would require netperf do something it has not yet done in 18-odd 
years :)  That is actually echo something back from the netserver on the RR test 
- either an id, or a timestamp.  That means "dirtying" the buffers which means 
still more cache misses, from places other than the actual stack. Not beyond the 
realm of the possible, but it would be a bit of departure for "normal" operation 
(*) and could enforce a minimum request/response size beyond the present single 
byte (ok, perhaps only two or four bytes :).  But that, perhaps, is a discussion 
best left to netperf-talk at netperf.org.

happy benchmarking,

rick jones

(*) netperf does have the concept of reading from and/or dirtying buffers, 
put-in back in the days of COW/page-remapping in HP-UX 9.0, but that was mainly 
to force COW and/or show the effect of the required data cache purges/flushes. 
As such it was made conditional on DIRTY being defined.

^ permalink raw reply

* Re: inbound connection problems when "netlink: test for all flags of the NLM_F_DUMP composite" commit applied
From: Jarek Poplawski @ 2011-01-18 21:14 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: David Miller, arthur.marsh, jengelh, eric.dumazet, netdev, hadi
In-Reply-To: <4D35F8A3.1010600@netfilter.org>

On Tue, Jan 18, 2011 at 09:31:31PM +0100, Pablo Neira Ayuso wrote:
> On 18/01/11 11:24, Jarek Poplawski wrote:
> > On Tue, Jan 18, 2011 at 02:07:02AM -0800, David Miller wrote:
> >> From: Jarek Poplawski <jarkao2@gmail.com>
> >> Date: Tue, 18 Jan 2011 09:38:11 +0000
> >>
> >>> Even if I'm wrong, this change added to stable will break many configs.
> >>> My proposal is to revert commit 0ab03c2b147 until proper fix is found.
> >>
> >> The flag combination is, at best ambiguous, it has no proper
> >> definition without the check we added.
> > 
> > Do you all expect all users manage to upgrade avahi app before
> > changing their stable kernel? I mean "own distro" users especially.
> 
> The combination that avahi uses makes no sense.
> 
> I've been auditing user-space tools that may have problems with this change:
> 
> * iw (it uses libnl)
> * acpid (it uses a mangled version of libnetlink shipped in iproute)
> * tstime, for taskstats, it uses libnl
> * wimax-tools, it uses libnl
> * quota-tools, it uses libnl
> * keepalived, no libs used
> 
> Well, I can keep looking for more, but I think that avahi is the only
> one doing this incorrectly.

BTW, could you answer my earlier question, why NLM_F_ATOMIC flag isn't
handled now with dumps?

Jarek P.

^ permalink raw reply

* Re: [regression] 2.6.37+ commit 0363466866d9.... breaks tcp ipv6
From: Hans de Bruin @ 2011-01-18 21:42 UTC (permalink / raw)
  To: Jesse Gross; +Cc: LKML, netdev
In-Reply-To: <AANLkTinejKw0pY5HncB2e0ocjqYuCwTBUwd2xkroWu+E@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3491 bytes --]

On 01/18/2011 09:06 PM, Jesse Gross wrote:
> On Tue, Jan 18, 2011 at 11:52 AM, Hans de Bruin<jmdebruin@xmsnet.nl>  wrote:
>> On 01/16/2011 09:24 PM, Hans de Bruin wrote:
>>>
>>> After last nights compile i lost the possibility to connect to ssh and
>>> http over ipv6. The connection stops at syn_sent. connections to my
>>> machine end in syn_recv. ping6 still works.
>>>
>>
>> The bisect ended in:
>>
>> 0363466866d901fbc658f4e63dd61e7cc93dd0af is the first bad commit
>> commit 0363466866d901fbc658f4e63dd61e7cc93dd0af
>> Author: Jesse Gross<jesse@nicira.com>
>> Date:   Sun Jan 9 06:23:35 2011 +0000
>>
>>     net offloading: Convert checksums to use centrally computed features.
>>
>>     In order to compute the features for other offloads (primarily
>>     scatter/gather), we need to first check the ability of the NIC to
>>     offload the checksum for the packet.  Since we have already computed
>>     this, we can directly use the result instead of figuring it out
>>     again.
>>
>>     Signed-off-by: Jesse Gross<jesse@nicira.com>
>>     Signed-off-by: David S. Miller<davem@davemloft.net>
>>
>>
>> ssh ::1  still works. And since dns still works I guess udp is not affected.
>> My nic is a:
>>
>> 09:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5752 Gigabit
>> Ethernet PCI Express (rev 02)
>
> Are you using vlans?  If so, can you please test this patch?
> http://patchwork.ozlabs.org/patch/79264/
>

No I am not using vlans. The option is even not set in the config file. 
Except for the disk less bit is a straightforward setup:

   bash-4.1# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
     inet 127.0.0.1/8 scope host lo
     inet6 ::1/128 scope host
        valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP 
qlen 1000
     link/ether 00:1c:23:2d:73:87 brd ff:ff:ff:ff:ff:ff
     inet 10.10.0.6/16 brd 10.10.255.255 scope global eth0
     inet6 2001:610:76e:0:21c:23ff:fe2d:7387/64 scope global dynamic
        valid_lft 86399sec preferred_lft 14399sec
     inet6 fe80::21c:23ff:fe2d:7387/64 scope link
        valid_lft forever preferred_lft forever
3: sit0: <NOARP> mtu 1480 qdisc noop state DOWN
     link/sit 0.0.0.0 brd 0.0.0.0
bash-4.1# ip route show
10.10.0.0/16 dev eth0  proto kernel  scope link  src 10.10.0.6
default via 10.10.0.1 dev eth0
bash-4.1# ip -6 route show
2001:610:76e::/64 dev eth0  proto kernel  metric 256  expires 86404sec
fe80::/64 dev eth0  proto kernel  metric 256
ff00::/8 dev eth0  metric 256
default via fe80::230:18ff:feae:75d8 dev eth0  proto kernel  metric 1024 
  expires 29sec hoplimit 64
bash-4.1# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
bash-4.1# ip6tables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
bash-4.1# brctl show
bridge name     bridge id               STP enabled     interfaces
bash-4.1# cat /proc/cmdline
root=/dev/nfs nfsroot=10.10.0.2:/nfs/root/psion/,v3,tcp ro ip=::: ::dhcp 
BOOT_IMAGE=nightlybuild-psion

-- 
Hans


[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 14483 bytes --]

^ permalink raw reply

* Re: [patch 2/4] ipset: make IPv4 and IPv6 address handling similar
From: Holger Eitzenberger @ 2011-01-18 21:43 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Jozsef Kadlecsik, netfilter-devel, netdev
In-Reply-To: <alpine.LNX.2.01.1101182139030.19166@obet.zrqbmnf.qr>

On Tue, Jan 18, 2011 at 09:39:32PM +0100, Jan Engelhardt wrote:
> On Tuesday 2011-01-18 21:37, Jozsef Kadlecsik wrote:
> >> >> 
> >> >> this does not work for AF_INET6:
> >> >> 
> >> >>  ipset add foo6 20a1:1:2:3:4:5:6:7/128
> >> >>  ipset v5.2: Syntax error: plain IP address must be supplied: 20a1:1:2:3:4:5:6:7/128
> >> >
> >> >Yeah, the usual issue: should IPv4/32 and IPv6/128 be handled as a plain 
> >> >IPv4/v6 address when the manual says "enter a plain IPv4/v6 address" :-).
> >> 
> >> (Assuming this was a question, heuristically based on the word order
> >> you used:) I don't think so. iptables, resp. its modules, do not
> >> allow that either.
> >
> >I know, but the situation is a little bit more complicated: the set type 
> >in question works differently with IPv4 and IPv6. In the IPv4 case, a 
> >range of IP addresses as IPv4/prefix is accepted as input (thus 
> >192.168.1.1/32 too), while for IPv6, only plain IPv6 addresses are allowed 
> >and therefore 20a1:1:2:3:4:5:6:7/128 was rejected.
> 
> Is there a specific reason that there is no IPv6 net support?

You shouldn't use hash:ip with ranges for IPv4 too because the range
members are added individually, which is less efficient both memory
and performance wise, see:

 $ ipset create foo hash:ip hashsize 64
 $ ipset add foo 192.168.1.0/30
 $ ipset list foo
 Name: foo
 Type: hash:ip
 Header: family inet hashsize 64 maxelem 65536 
 Size in memory: 628
 References: 0
 Members:
 192.168.1.3
 192.168.1.2
 192.168.1.0
 192.168.1.1

> Call it laziness: for IPv6, the hash:ip* types does *not* accept a 
> range of elements to be added/deleted in one command, expressed as
> 
> ipset add foo6 20a1:1:2:3:4:5:6:7/120
> 
> or
> 
> ipset add foo6 20A1:1:2:3:4:5:6:0-20A1:1:2:3:4:5:6:FF
> 
> For IPv4 the syntax is accepted and handled.


^ permalink raw reply

* Re: [regression] 2.6.37+ commit 0363466866d9.... breaks tcp ipv6
From: Eric Dumazet @ 2011-01-18 22:03 UTC (permalink / raw)
  To: Hans de Bruin; +Cc: Jesse Gross, LKML, netdev
In-Reply-To: <4D36092B.1090406@xmsnet.nl>

Le mardi 18 janvier 2011 à 22:42 +0100, Hans de Bruin a écrit :
> On 01/18/2011 09:06 PM, Jesse Gross wrote:
> > On Tue, Jan 18, 2011 at 11:52 AM, Hans de Bruin<jmdebruin@xmsnet.nl>  wrote:
> >> On 01/16/2011 09:24 PM, Hans de Bruin wrote:
> >>>
> >>> After last nights compile i lost the possibility to connect to ssh and
> >>> http over ipv6. The connection stops at syn_sent. connections to my
> >>> machine end in syn_recv. ping6 still works.
> >>>
> >>
> >> The bisect ended in:
> >>
> >> 0363466866d901fbc658f4e63dd61e7cc93dd0af is the first bad commit
> >> commit 0363466866d901fbc658f4e63dd61e7cc93dd0af
> >> Author: Jesse Gross<jesse@nicira.com>
> >> Date:   Sun Jan 9 06:23:35 2011 +0000
> >>
> >>     net offloading: Convert checksums to use centrally computed features.
> >>
> >>     In order to compute the features for other offloads (primarily
> >>     scatter/gather), we need to first check the ability of the NIC to
> >>     offload the checksum for the packet.  Since we have already computed
> >>     this, we can directly use the result instead of figuring it out
> >>     again.
> >>
> >>     Signed-off-by: Jesse Gross<jesse@nicira.com>
> >>     Signed-off-by: David S. Miller<davem@davemloft.net>
> >>
> >>
> >> ssh ::1  still works. And since dns still works I guess udp is not affected.
> >> My nic is a:
> >>
> >> 09:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5752 Gigabit
> >> Ethernet PCI Express (rev 02)
> >
> > Are you using vlans?  If so, can you please test this patch?
> > http://patchwork.ozlabs.org/patch/79264/
> >
> 
> No I am not using vlans. The option is even not set in the config file. 
> Except for the disk less bit is a straightforward setup:
> 
>    bash-4.1# ip addr show
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
>      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>      inet 127.0.0.1/8 scope host lo
>      inet6 ::1/128 scope host
>         valid_lft forever preferred_lft forever
> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP 
> qlen 1000
>      link/ether 00:1c:23:2d:73:87 brd ff:ff:ff:ff:ff:ff
>      inet 10.10.0.6/16 brd 10.10.255.255 scope global eth0
>      inet6 2001:610:76e:0:21c:23ff:fe2d:7387/64 scope global dynamic
>         valid_lft 86399sec preferred_lft 14399sec
>      inet6 fe80::21c:23ff:fe2d:7387/64 scope link
>         valid_lft forever preferred_lft forever
> 3: sit0: <NOARP> mtu 1480 qdisc noop state DOWN
>      link/sit 0.0.0.0 brd 0.0.0.0
> bash-4.1# ip route show
> 10.10.0.0/16 dev eth0  proto kernel  scope link  src 10.10.0.6
> default via 10.10.0.1 dev eth0
> bash-4.1# ip -6 route show
> 2001:610:76e::/64 dev eth0  proto kernel  metric 256  expires 86404sec
> fe80::/64 dev eth0  proto kernel  metric 256
> ff00::/8 dev eth0  metric 256
> default via fe80::230:18ff:feae:75d8 dev eth0  proto kernel  metric 1024 
>   expires 29sec hoplimit 64
> bash-4.1# iptables -L
> Chain INPUT (policy ACCEPT)
> target     prot opt source               destination
> 
> Chain FORWARD (policy ACCEPT)
> target     prot opt source               destination
> 
> Chain OUTPUT (policy ACCEPT)
> target     prot opt source               destination
> bash-4.1# ip6tables -L
> Chain INPUT (policy ACCEPT)
> target     prot opt source               destination
> 
> Chain FORWARD (policy ACCEPT)
> target     prot opt source               destination
> 
> Chain OUTPUT (policy ACCEPT)
> target     prot opt source               destination
> bash-4.1# brctl show
> bridge name     bridge id               STP enabled     interfaces
> bash-4.1# cat /proc/cmdline
> root=/dev/nfs nfsroot=10.10.0.2:/nfs/root/psion/,v3,tcp ro ip=::: ::dhcp 
> BOOT_IMAGE=nightlybuild-psion
> 


You could try "tcpdump -i eth0 ip6 -v"

I guess you receive frames with bad checksums

You can ask other system to not offload tx checksums

ethtool -K eth0 tx off

Please give result of (on both machines)
lspci -v

^ permalink raw reply

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
From: Oleg V. Ukhno @ 2011-01-18 22:22 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Nicolas de Pesloüan, John Fastabend, David S. Miller,
	netdev@vger.kernel.org, Sébastien Barré,
	Christophe Paasch
In-Reply-To: <28837.1295382268@death>



Jay Vosburgh wrote:
> 
> 	One item I'd like to see some more data on is the level of
> reordering at the receiver in Oleg's system.
> 
> 	One of the reasons round robin isn't as useful as it once was is
> due to the rise of NAPI and interrupt coalescing, both of which will
> tend to increase the reordering of packets at the receiver when the
> packets are evenly striped.  In the old days, it was one interrupt, one
> packet.  Now, it's one interrupt or NAPI poll, many packets.  With the
> packets striped across interfaces, this will tend to increase
> reordering.  E.g.,
> 
> 	slave 1		slave 2		slave 3
> 	Packet 1	P2		P3
> 	P4		P5		P6
> 	P7		P8		P9
> 
> 	and so on.  A poll of slave 1 will get packets 1, 4 and 7 (and
> probably several more), then a poll of slave 2 will get 2, 5 and 8, etc.
> 
> 	I haven't done much testing with this lately, but I suspect this
> behavior hasn't really changed.  Raising the tcp_reordering sysctl value
> can mitigate this somewhat (by making TCP more tolerant of this), but
> that doesn't help non-TCP protocols.
> 
> 	Barring evidence to the contrary, I presume that Oleg's system
> delivers out of order at the receiver.  That's not automatically a
> reason to reject it, but this entire proposal is sufficiently complex to
> configure that very explicit documentation will be necessary.

Jay, here is some network stats from one of my iSCSI targets with avg 
load of 1.5-2.5Gbit/sec(4 slaves in etherchannel).Not perfect and not 
very "clean"(there are more interfaces on host, than these 4)
[root@<somehost> ~]# netstat -st 

IcmpMsg: 

     InType0: 6 

     InType3: 1872 

     InType8: 60557 

     InType11: 23 

     OutType0: 60528 

     OutType3: 1755 

     OutType8: 6 

Tcp: 

     1298909 active connections openings 

     61090 passive connection openings 

     2374 failed connection attempts 

     62781 connection resets received 

     3 connections established 

     1268233942 segments received 

     1198020318 segments send out 

     18939618 segments retransmited 

     0 bad segments received. 

     23643 resets sent 

TcpExt:
     294935 TCP sockets finished time wait in fast timer
     472 time wait sockets recycled by time stamp
     819481 delayed acks sent
     295332 delayed acks further delayed because of locked socket
     Quick ack mode was activated 30616377 times
     3516920 packets directly queued to recvmsg prequeue.
     4353 packets directly received from backlog
     44873453 packets directly received from prequeue
     1442812750 packets header predicted
     1077442 packets header predicted and directly queued to user
     2123453975 acknowledgments not containing data received
     2375328274 predicted acknowledgments
     8462439 times recovered from packet loss due to fast retransmit
     Detected reordering 19203 times using reno fast retransmit
     Detected reordering 100 times using time stamp
     3429 congestion windows fully recovered
     11760 congestion windows partially recovered using Hoe heuristic
     398 congestion windows recovered after partial ack
     0 TCP data loss events
     3671 timeouts after reno fast retransmit
     6 timeouts in loss state
     18919118 fast retransmits
     11637 retransmits in slow start
     1756 other TCP timeouts
     TCPRenoRecoveryFail: 3187
     62779 connections reset due to early user close
IpExt:
     InBcastPkts: 512616
[root@<somehost> ~]# uptime
  00:35:49 up 42 days,  8:27,  1 user,  load average: 3.70, 3.80, 4.07
[root@<somehost> ~]# sysctl -a|grep tcp_reo
net.ipv4.tcp_reordering = 3

I will get back with "clean" results after I'll setup test system tomorrow.
TcpExt stats from other hosts are similar.

> 
> 	-J
> 
> ---
> 	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
> 

-- 
Best regards,
Oleg Ukhno

^ permalink raw reply

* [PATCH v2] ethtool : Add option -L | --set-common to set common flags.
From: Mahesh Bandewar @ 2011-01-18 22:37 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David Miller, Tom Herbert, Laurent Chavey, netdev,
	Mahesh Bandewar
In-Reply-To: <1294963892-11997-1-git-send-email-maheshb@google.com>

This patch adds -L | --set-common option to add / remove common flags which
includes loopback flag. The -l | --show-common displays the current values
for these common flags.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
 ethtool-copy.h |    1 +
 ethtool.8.in   |   16 ++++++++++
 ethtool.c      |   84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 101 insertions(+), 0 deletions(-)

diff --git a/ethtool-copy.h b/ethtool-copy.h
index 75c3ae7..5fd18c7 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -309,6 +309,7 @@ struct ethtool_perm_addr {
  * flag differs from the read-only value.
  */
 enum ethtool_flags {
+	ETH_FLAG_LOOPBACK	= (1 << 2),	/* Loopback enable / disable */
 	ETH_FLAG_TXVLAN		= (1 << 7),	/* TX VLAN offload enabled */
 	ETH_FLAG_RXVLAN		= (1 << 8),	/* RX VLAN offload enabled */
 	ETH_FLAG_LRO		= (1 << 15),	/* LRO is enabled */
diff --git a/ethtool.8.in b/ethtool.8.in
index 0ee91a0..b9f8892 100644
--- a/ethtool.8.in
+++ b/ethtool.8.in
@@ -211,6 +211,13 @@ ethtool \- query or control network driver and hardware settings
 .B2 txvlan on off
 .B2 rxhash on off
 
+.B ethtool \-l|\-\-show\-common
+.I ethX
+
+.B ethtool \-L|\-\-set\-common
+.I ethX
+.B2 loopback on off
+
 .B ethtool \-p|\-\-identify
 .I ethX
 .RI [ N ]
@@ -444,6 +451,15 @@ Specifies whether TX VLAN acceleration should be enabled
 .A2 rxhash on off
 Specifies whether receive hashing offload should be enabled
 .TP
+.B \-l \-\-show\-common
+Queries the specified network device for common flag settings.
+.TP
+.B \-L \-\-set\-common
+Changes the common parameters of the specified network device.
+.TP
+.A2 loopback on off
+Specifies whether loopback should be enabled.
+.TP
 .B \-p \-\-identify
 Initiates adapter-specific action intended to enable an operator to
 easily identify the adapter by sight.  Typically this involves
diff --git a/ethtool.c b/ethtool.c
index 1afdfe4..0e234ea 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -97,6 +97,8 @@ static int do_gcoalesce(int fd, struct ifreq *ifr);
 static int do_scoalesce(int fd, struct ifreq *ifr);
 static int do_goffload(int fd, struct ifreq *ifr);
 static int do_soffload(int fd, struct ifreq *ifr);
+static int do_gcommon(int fd, struct ifreq *ifr);
+static int do_scommon(int fd, struct ifreq *ifr);
 static int do_gstats(int fd, struct ifreq *ifr);
 static int rxflow_str_to_type(const char *str);
 static int parse_rxfhashopts(char *optstr, u32 *data);
@@ -142,6 +144,8 @@ static enum {
 	MODE_GNTUPLE,
 	MODE_FLASHDEV,
 	MODE_PERMADDR,
+	MODE_GCOMMON,
+	MODE_SCOMMON,
 } mode = MODE_GSET;
 
 static struct option {
@@ -211,6 +215,10 @@ static struct option {
 		"		[ ntuple on|off ]\n"
 		"		[ rxhash on|off ]\n"
     },
+    { "-l", "--show-common", MODE_GCOMMON, "Get common flags information" },
+    { "-L", "--set-common", MODE_SCOMMON, "Set common flags",
+		"               [ loopback on|off ]\n"
+    },
     { "-i", "--driver", MODE_GDRV, "Show driver information" },
     { "-d", "--register-dump", MODE_GREGS, "Do a register dump",
 		"		[ raw on|off ]\n"
@@ -309,6 +317,10 @@ static u32 off_flags_wanted = 0;
 static u32 off_flags_mask = 0;
 static int off_gro_wanted = -1;
 
+static int gcommon_changed = 0;
+static u32 common_flags_wanted = 0;
+static u32 common_flags_mask = 0;
+
 static struct ethtool_pauseparam epause;
 static int gpause_changed = 0;
 static int pause_autoneg_wanted = -1;
@@ -482,6 +494,11 @@ static struct cmdline_info cmdline_offload[] = {
 	  ETH_FLAG_RXHASH, &off_flags_mask },
 };
 
+static struct cmdline_info cmdline_commonflags[] = {
+	{ "loopback", CMDL_FLAG, &common_flags_wanted, NULL,
+	  ETH_FLAG_LOOPBACK, &common_flags_mask },
+};
+
 static struct cmdline_info cmdline_pause[] = {
 	{ "autoneg", CMDL_BOOL, &pause_autoneg_wanted, &epause.autoneg },
 	{ "rx", CMDL_BOOL, &pause_rx_wanted, &epause.rx_pause },
@@ -829,6 +846,8 @@ static void parse_cmdline(int argc, char **argp)
 			    (mode == MODE_SRING) ||
 			    (mode == MODE_GOFFLOAD) ||
 			    (mode == MODE_SOFFLOAD) ||
+			    (mode == MODE_GCOMMON) ||
+			    (mode == MODE_SCOMMON) ||
 			    (mode == MODE_GSTATS) ||
 			    (mode == MODE_GNFC) ||
 			    (mode == MODE_SNFC) ||
@@ -919,6 +938,14 @@ static void parse_cmdline(int argc, char **argp)
 				i = argc;
 				break;
 			}
+			if (mode == MODE_SCOMMON) {
+				parse_generic_cmdline(argc, argp, i,
+					&gcommon_changed,
+			      		cmdline_commonflags,
+			      		ARRAY_SIZE(cmdline_offload));
+				i = argc;
+				break;
+			}
 			if (mode == MODE_SNTUPLE) {
 				if (!strcmp(argp[i], "flow-type")) {
 					i += 1;
@@ -1905,6 +1932,13 @@ static int dump_offload(int rx, int tx, int sg, int tso, int ufo, int gso,
 	return 0;
 }
 
+static int dump_common_flags(int loopback)
+{
+	fprintf(stdout, "loopback: %s\n", loopback ? "on" : "off");
+
+	return 0;
+}
+
 static int dump_rxfhash(int fhash, u64 val)
 {
 	switch (fhash) {
@@ -1998,6 +2032,10 @@ static int doit(void)
 		return do_goffload(fd, &ifr);
 	} else if (mode == MODE_SOFFLOAD) {
 		return do_soffload(fd, &ifr);
+	} else if (mode == MODE_GCOMMON) {
+		return do_gcommon(fd, &ifr);
+	} else if (mode == MODE_SCOMMON) {
+		return do_scommon(fd, &ifr);
 	} else if (mode == MODE_GSTATS) {
 		return do_gstats(fd, &ifr);
 	} else if (mode == MODE_GNFC) {
@@ -2219,6 +2257,52 @@ static int do_scoalesce(int fd, struct ifreq *ifr)
 	return 0;
 }
 
+static int do_gcommon(int fd, struct ifreq *ifr)
+{
+	struct ethtool_value eval;
+	int loopback = 0;
+
+	fprintf(stdout, "Common flags for %s:\n", devname);
+
+	eval.cmd = ETHTOOL_GFLAGS;
+	ifr->ifr_data = (caddr_t)&eval;
+	if (ioctl(fd, SIOCETHTOOL, ifr)) {
+		perror("Cannot get device flags");
+	} else {
+		loopback = (eval.data & ETH_FLAG_LOOPBACK) != 0;
+	}
+
+	return dump_common_flags(loopback);
+}
+
+static int do_scommon(int fd, struct ifreq *ifr)
+{
+	struct ethtool_value eval;
+
+	if (common_flags_mask) {
+		eval.cmd = ETHTOOL_GFLAGS;
+		eval.data = 0;
+		ifr->ifr_data = (caddr_t)&eval;
+		if (ioctl(fd, SIOCETHTOOL, ifr)) {
+			perror("Cannot get device common flags");
+			return 1;
+		}
+
+		eval.cmd = ETHTOOL_SFLAGS;
+		eval.data = ((eval.data & ~common_flags_mask) |
+			     common_flags_wanted);
+
+		if (ioctl(fd, SIOCETHTOOL, ifr)) {
+			perror("Cannot set device common flags");
+			return 1;
+		}
+	} else {
+		fprintf(stdout, "No common settings changed\n");
+	}
+
+	return 0;
+}
+
 static int do_goffload(int fd, struct ifreq *ifr)
 {
 	struct ethtool_value eval;
-- 
1.7.3.1


^ permalink raw reply related

* Re: [patch 2/4] ipset: make IPv4 and IPv6 address handling similar
From: Mr Dash Four @ 2011-01-18 22:49 UTC (permalink / raw)
  To: Jan Engelhardt, Jozsef Kadlecsik, netfilter-devel, netdev
In-Reply-To: <20110118214343.GA4845@mail.eitzenberger.org>


> You shouldn't use hash:ip with ranges for IPv4 too because the range
> members are added individually, which is less efficient both memory
> and performance wise, see:
>
>  $ ipset create foo hash:ip hashsize 64
>  $ ipset add foo 192.168.1.0/30
>  $ ipset list foo
>  Name: foo
>  Type: hash:ip
>  Header: family inet hashsize 64 maxelem 65536 
>  Size in memory: 628
>  References: 0
>  Members:
>  192.168.1.3
>  192.168.1.2
>  192.168.1.0
>  192.168.1.1
>   
I disagree!

If I need to add the 192.168.1.0/30 then I have to execute a loop (via a 
script) and add individual elements (i.e. ipset add foo 192.168.1.0, 
ipset add foo 192.168.1.1 etc).

By specifying ipset add foo 192.168.1.0/30 I do that in one go. Even 
though I am inclined to agree that storing individual elements may not 
be the best way memory/storage wise I think performance wise (i.e. when 
the actual matching is performed) it is better matching a single IP 
address than IP range.


^ permalink raw reply

* Re: [PATCH resend] netfilter: place in source hash after SNAT is done
From: Changli Gao @ 2011-01-19  0:03 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David S. Miller, netfilter-devel, netdev
In-Reply-To: <4D35A0F2.5010908@trash.net>

On Tue, Jan 18, 2011 at 10:17 PM, Patrick McHardy <kaber@trash.net> wrote:
>>  net/ipv4/netfilter/nf_nat_core.c |   18 +++++++++++-------
>>  1 file changed, 11 insertions(+), 7 deletions(-)
>> diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
>> index c04787c..51ce55a 100644
>> --- a/net/ipv4/netfilter/nf_nat_core.c
>> +++ b/net/ipv4/netfilter/nf_nat_core.c
>> @@ -221,7 +221,14 @@ get_unique_tuple(struct nf_conntrack_tuple *tuple,
>>          manips not an issue.  */
>>       if (maniptype == IP_NAT_MANIP_SRC &&
>>           !(range->flags & IP_NAT_RANGE_PROTO_RANDOM)) {
>> -             if (find_appropriate_src(net, zone, orig_tuple, tuple, range)) {
>> +             /* try the original tuple first */
>
> This doesn't seem to be related to the hashing change. Please describe
> the intention behind this change.

Currently, we add the ct at the head of the corresponding bucket of
the source hash table after DNAT is done, so when we do SNAT, the
original ct will be tried first. This change is used to keep this
behavior.


>
>> +             if (in_range(orig_tuple, range)) {
>> +                     if (!nf_nat_used_tuple(orig_tuple, ct)) {
>> +                             *tuple = *orig_tuple;
>> +                             return;
>> +                     }
>> +             } else if (find_appropriate_src(net, zone, orig_tuple, tuple,
>> +                        range)) {
>>                       pr_debug("get_unique_tuple: Found current src map\n");
>>                       if (!nf_nat_used_tuple(tuple, ct))
>>                               return;
>
>
>



-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-2.6 0/8] bnx2x: Minor link related fixes
From: David Miller @ 2011-01-19  0:10 UTC (permalink / raw)
  To: yanivr; +Cc: netdev, eilong
In-Reply-To: <1295361876.5281.100.camel@lb-tlvb-dmitry>

From: "Yaniv Rosner" <yanivr@broadcom.com>
Date: Tue, 18 Jan 2011 16:44:36 +0200

> Hi Dave,
> I meant net-2.6, and not net-next-2.6

Understood, all applied, thanks.

^ permalink raw reply

* Re: [PATCH] ipv6: Silence privacy extensions initialization
From: David Miller @ 2011-01-19  0:14 UTC (permalink / raw)
  To: romain; +Cc: kuznet, pekkas, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <87sjwr4djd.fsf@elegiac.orebokech.com>

From: Romain Francoise <romain@orebokech.com>
Date: Mon, 17 Jan 2011 18:59:18 +0100

> When a network namespace is created (via CLONE_NEWNET), the loopback
> interface is automatically added to the new namespace, triggering a
> printk in ipv6_add_dev() if CONFIG_IPV6_PRIVACY is set.
> 
> This is problematic for applications which use CLONE_NEWNET as
> part of a sandbox, like Chromium's suid sandbox or recent versions of
> vsftpd. On a busy machine, it can lead to thousands of useless
> "lo: Disabled Privacy Extensions" messages appearing in dmesg.
> 
> It's easy enough to check the status of privacy extensions via the
> use_tempaddr sysctl, so just removing the printk seems like the most
> sensible solution.
> 
> Signed-off-by: Romain Francoise <romain@orebokech.com>

Yes, this message always bugged me too, applied thanks!

^ permalink raw reply

* Re: [PATCH] ns83820: Avoid bad pointer deref in ns83820_init_one().
From: David Miller @ 2011-01-19  0:14 UTC (permalink / raw)
  To: bcrl; +Cc: jj, netdev, linux-ns83820, linux-kernel, tj, segooon, dkirjanov
In-Reply-To: <20110118164200.GI17839@kvack.org>

From: Benjamin LaHaise <bcrl@kvack.org>
Date: Tue, 18 Jan 2011 11:42:00 -0500

> On Mon, Jan 17, 2011 at 09:24:57PM +0100, Jesper Juhl wrote:
>> In drivers/net/ns83820.c::ns83820_init_one() we dynamically allocate 
>> memory via alloc_etherdev(). We then call PRIV() on the returned storage 
>> which is 'return netdev_priv()'. netdev_priv() takes the pointer it is 
>> passed and adds 'ALIGN(sizeof(struct net_device), NETDEV_ALIGN)' to it and 
>> returns it. Then we test the resulting pointer for NULL, which it is 
>> unlikely to be at this point, and later dereference it. This will go bad 
>> if alloc_etherdev() actually returned NULL.
>> 
>> This patch reworks the code slightly so that we test for a NULL pointer 
>> (and return -ENOMEM) directly after calling alloc_etherdev().
> 
> Looks good.
> 
> 		-ben
> 
> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

Applied, thanks everyone.

^ permalink raw reply

* Re: [PATCHv2] USB CDC NCM: tx_fixup() race condition fix
From: David Miller @ 2011-01-19  0:14 UTC (permalink / raw)
  To: alexey.orishko-Re5JQEeQqe8AvxtiuMwx3w
  Cc: linux-usb-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	gregkh-l3A5Bk7waGM, yauheni.kaliuta-xNZwKgViW5gAvxtiuMwx3w,
	alexey.orishko-0IS4wlFg1OjSUeElwK9/Pw
In-Reply-To: <1295284045-9310-1-git-send-email-alexey.orishko-0IS4wlFg1OjSUeElwK9/Pw@public.gmane.org>

From: Alexey Orishko <alexey.orishko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date: Mon, 17 Jan 2011 18:07:25 +0100

> - tx_fixup() can be called from either timer callback or from xmit()
>   in usbnet, so spinlock is added to avoid concurrency-related problem.
> - minor correction due to checkpatch warning for some line over 80
>   chars after previous patch was applied.
> 
> Signed-off-by: Alexey Orishko <alexey.orishko-0IS4wlFg1OjSUeElwK9/Pw@public.gmane.org>

Applied, thanks a lot.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] net offloading: Do not mask out NETIF_F_HW_VLAN_TX for vlan.
From: David Miller @ 2011-01-19  0:15 UTC (permalink / raw)
  To: eric.dumazet; +Cc: jesse, netdev
In-Reply-To: <1295333714.3362.561.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 18 Jan 2011 07:55:14 +0100

> Le lundi 17 janvier 2011 à 22:46 -0800, Jesse Gross a écrit :
>> In netif_skb_features() we return only the features that are valid for vlans
>> if we have a vlan packet.  However, we should not mask out NETIF_F_HW_VLAN_TX
>> since it enables transmission of vlan tags and is obviously valid.
>> 
>> Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
>> Signed-off-by: Jesse Gross <jesse@nicira.com>
> 
> Thanks Jesse
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] net/irda/sh_irda: return to RX mode when TX error
From: David Miller @ 2011-01-19  0:15 UTC (permalink / raw)
  To: kuninori.morimoto.gx; +Cc: netdev, samuel
In-Reply-To: <w3ppqs0ncyp.wl%kuninori.morimoto.gx@renesas.com>

From: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Date: Fri, 14 Jan 2011 16:47:42 +0900

> sh_irda can not use RX/TX in same time,
> but this driver didn't return to RX mode when TX error occurred.
> This patch care xmit error case to solve this issue.
> 
> Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH] gianfar: Fix misleading indentation in startup_gfar()
From: David Miller @ 2011-01-19  0:16 UTC (permalink / raw)
  To: cbouatmailru; +Cc: netdev
In-Reply-To: <20110118123602.GA26997@oksana.dev.rtsoft.ru>

From: Anton Vorontsov <cbouatmailru@gmail.com>
Date: Tue, 18 Jan 2011 15:36:02 +0300

> Just stumbled upon the issue while looking for another bug.
> 
> The code looks correct, the indentation is not.
> 
> Signed-off-by: Anton Vorontsov <cbouatmailru@gmail.com>

Applied, thank you.

^ permalink raw reply

* net-next-2.6 open for business...
From: David Miller @ 2011-01-19  0:29 UTC (permalink / raw)
  To: netdev; +Cc: netfilter-devel, linux-wireless


It is currently sync'd with net-2.6 and I will start adding
feature and cleanup patches to that tree.

Just FYI...

^ permalink raw reply

* Re: [PATCH] vhost: rcu annotation fixup
From: Mel Gorman @ 2011-01-19  0:40 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <20110118110845.GA11555@redhat.com>

On Tue, Jan 18, 2011 at 01:08:45PM +0200, Michael S. Tsirkin wrote:
> When built with rcu checks enabled, vhost triggers
> bogus warnings as vhost features are read without
> dev->mutex sometimes.
> Fixing it properly is not trivial as vhost.h does not
> know which lockdep classes it will be used under.
> Disable the warning by stubbing out the check for now.
> 

What is the harm in leaving the bogus warnings until the difficult fix
happens? RCU checks enabled does not seem like something that is enabled
in production. If this patch is applied, there is always the risk that
it'll be simply forgotten about.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply

* [PATCH 1/4] vxge: cleanup probe error paths
From: Jon Mason @ 2011-01-19  1:02 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Sivakumar Subramani, Sreenivasa Honnur, Ram Vepa

Reorder the commands to be in the inverse order of their allocations
(instead of the random order they appear to be in), propagate return
code on errors from pci_request_region and register_netdev, reduce the
config_dev_cnt and total_dev_cnt counters on remove, and return the
correct error code for vdev->vpaths kzalloc failures.  Also, prevent
leaking of vdev->vpaths memory and netdev in vxge_probe error path due
to freeing for these not occurring in vxge_device_unregister.

Signed-off-by: Jon Mason <jon.mason@exar.com>
Signed-off-by: Sivakumar Subramani <sivakumar.subramani@exar.com>
---
 drivers/net/vxge/vxge-main.c |   55 +++++++++++++++++++-----------------------
 1 files changed, 25 insertions(+), 30 deletions(-)

diff --git a/drivers/net/vxge/vxge-main.c b/drivers/net/vxge/vxge-main.c
index 1ac9b56..cd0698c 100644
--- a/drivers/net/vxge/vxge-main.c
+++ b/drivers/net/vxge/vxge-main.c
@@ -3348,7 +3348,7 @@ static int __devinit vxge_device_register(struct __vxge_hw_device *hldev,
 		vxge_debug_init(VXGE_ERR,
 			"%s: vpath memory allocation failed",
 			vdev->ndev->name);
-		ret = -ENODEV;
+		ret = -ENOMEM;
 		goto _out1;
 	}
 
@@ -3369,11 +3369,11 @@ static int __devinit vxge_device_register(struct __vxge_hw_device *hldev,
 	if (vdev->config.gro_enable)
 		ndev->features |= NETIF_F_GRO;
 
-	if (register_netdev(ndev)) {
+	ret = register_netdev(ndev);
+	if (ret) {
 		vxge_debug_init(vxge_hw_device_trace_level_get(hldev),
 			"%s: %s : device registration failed!",
 			ndev->name, __func__);
-		ret = -ENODEV;
 		goto _out2;
 	}
 
@@ -3444,6 +3444,11 @@ static void vxge_device_unregister(struct __vxge_hw_device *hldev)
 	/* in 2.6 will call stop() if device is up */
 	unregister_netdev(dev);
 
+	kfree(vdev->vpaths);
+
+	/* we are safe to free it now */
+	free_netdev(dev);
+
 	vxge_debug_init(vdev->level_trace, "%s: ethernet device unregistered",
 			buf);
 	vxge_debug_entryexit(vdev->level_trace,	"%s: %s:%d  Exiting...", buf,
@@ -4334,10 +4339,10 @@ vxge_probe(struct pci_dev *pdev, const struct pci_device_id *pre)
 		goto _exit1;
 	}
 
-	if (pci_request_region(pdev, 0, VXGE_DRIVER_NAME)) {
+	ret = pci_request_region(pdev, 0, VXGE_DRIVER_NAME);
+	if (ret) {
 		vxge_debug_init(VXGE_ERR,
 			"%s : request regions failed", __func__);
-		ret = -ENODEV;
 		goto _exit1;
 	}
 
@@ -4642,8 +4647,9 @@ _exit6:
 _exit5:
 	vxge_device_unregister(hldev);
 _exit4:
-	pci_disable_sriov(pdev);
+	pci_set_drvdata(pdev, NULL);
 	vxge_hw_device_terminate(hldev);
+	pci_disable_sriov(pdev);
 _exit3:
 	iounmap(attr.bar0);
 _exit2:
@@ -4654,7 +4660,7 @@ _exit0:
 	kfree(ll_config);
 	kfree(device_config);
 	driver_config->config_dev_cnt--;
-	pci_set_drvdata(pdev, NULL);
+	driver_config->total_dev_cnt--;
 	return ret;
 }
 
@@ -4667,45 +4673,34 @@ _exit0:
 static void __devexit vxge_remove(struct pci_dev *pdev)
 {
 	struct __vxge_hw_device *hldev;
-	struct vxgedev *vdev = NULL;
-	struct net_device *dev;
-	int i = 0;
+	struct vxgedev *vdev;
+	int i;
 
 	hldev = pci_get_drvdata(pdev);
-
 	if (hldev == NULL)
 		return;
 
-	dev = hldev->ndev;
-	vdev = netdev_priv(dev);
+	vdev = netdev_priv(hldev->ndev);
 
 	vxge_debug_entryexit(vdev->level_trace,	"%s:%d", __func__, __LINE__);
-
 	vxge_debug_init(vdev->level_trace, "%s : removing PCI device...",
 			__func__);
-	vxge_device_unregister(hldev);
 
-	for (i = 0; i < vdev->no_of_vpath; i++) {
+	for (i = 0; i < vdev->no_of_vpath; i++)
 		vxge_free_mac_add_list(&vdev->vpaths[i]);
-		vdev->vpaths[i].mcast_addr_cnt = 0;
-		vdev->vpaths[i].mac_addr_cnt = 0;
-	}
-
-	kfree(vdev->vpaths);
 
+	vxge_device_unregister(hldev);
+	pci_set_drvdata(pdev, NULL);
+	/* Do not call pci_disable_sriov here, as it will break child devices */
+	vxge_hw_device_terminate(hldev);
 	iounmap(vdev->bar0);
-
-	/* we are safe to free it now */
-	free_netdev(dev);
+	pci_release_region(pdev, 0);
+	pci_disable_device(pdev);
+	driver_config->config_dev_cnt--;
+	driver_config->total_dev_cnt--;
 
 	vxge_debug_init(vdev->level_trace, "%s:%d Device unregistered",
 			__func__, __LINE__);
-
-	vxge_hw_device_terminate(hldev);
-
-	pci_disable_device(pdev);
-	pci_release_region(pdev, 0);
-	pci_set_drvdata(pdev, NULL);
 	vxge_debug_entryexit(vdev->level_trace,	"%s:%d  Exiting...", __func__,
 			     __LINE__);
 }
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH 2/4] vxge: correct eprom version detection
From: Jon Mason @ 2011-01-19  1:02 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Sivakumar Subramani, Sreenivasa Honnur, Ram Vepa
In-Reply-To: <1295398942-4131-1-git-send-email-jon.mason@exar.com>

The firmware PXE EPROM version detection is failing due to passing the
wrong parameter into firmware query function.  Also, the version
printing function has an extraneous newline.

Signed-off-by: Jon Mason <jon.mason@exar.com>
Signed-off-by: Sivakumar Subramani <sivakumar.subramani@exar.com>
---
 drivers/net/vxge/vxge-config.c |    2 +-
 drivers/net/vxge/vxge-main.c   |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/vxge/vxge-config.c b/drivers/net/vxge/vxge-config.c
index 01c05f5..da35562 100644
--- a/drivers/net/vxge/vxge-config.c
+++ b/drivers/net/vxge/vxge-config.c
@@ -387,8 +387,8 @@ vxge_hw_vpath_eprom_img_ver_get(struct __vxge_hw_device *hldev,
 		data1 = steer_ctrl = 0;
 
 		status = vxge_hw_vpath_fw_api(vpath,
-			VXGE_HW_RTS_ACCESS_STEER_CTRL_DATA_STRUCT_SEL_FW_MEMO,
 			VXGE_HW_FW_API_GET_EPROM_REV,
+			VXGE_HW_RTS_ACCESS_STEER_CTRL_DATA_STRUCT_SEL_FW_MEMO,
 			0, &data0, &data1, &steer_ctrl);
 		if (status != VXGE_HW_OK)
 			break;
diff --git a/drivers/net/vxge/vxge-main.c b/drivers/net/vxge/vxge-main.c
index cd0698c..9d4b0e8 100644
--- a/drivers/net/vxge/vxge-main.c
+++ b/drivers/net/vxge/vxge-main.c
@@ -4450,7 +4450,7 @@ vxge_probe(struct pci_dev *pdev, const struct pci_device_id *pre)
 			if (!img[i].is_valid)
 				break;
 			vxge_debug_init(VXGE_TRACE, "%s: EPROM %d, version "
-					"%d.%d.%d.%d\n", VXGE_DRIVER_NAME, i,
+					"%d.%d.%d.%d", VXGE_DRIVER_NAME, i,
 					VXGE_EPROM_IMG_MAJOR(img[i].version),
 					VXGE_EPROM_IMG_MINOR(img[i].version),
 					VXGE_EPROM_IMG_FIX(img[i].version),
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH 3/4] vxge: MSIX one shot mode
From: Jon Mason @ 2011-01-19  1:02 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Sivakumar Subramani, Sreenivasa Honnur, Ram Vepa,
	Masroor Vettuparambil
In-Reply-To: <1295398942-4131-1-git-send-email-jon.mason@exar.com>

To reduce the possibility of losing an interrupt in the handler due to a
race between an interrupt processing and disable/enable of interrupts,
enable MSIX one shot.

Also, add support for adaptive interrupt coalesing

Signed-off-by: Jon Mason <jon.mason@exar.com>
Signed-off-by: Masroor Vettuparambil <masroor.vettuparambil@exar.com>
---
 drivers/net/vxge/vxge-config.c  |   30 +++-----
 drivers/net/vxge/vxge-config.h  |   10 +++
 drivers/net/vxge/vxge-main.c    |  159 ++++++++++++++++++++++++++++++++++-----
 drivers/net/vxge/vxge-main.h    |   23 +++++-
 drivers/net/vxge/vxge-traffic.c |  116 ++++++++++++++++++++++++++--
 drivers/net/vxge/vxge-traffic.h |   14 +++-
 6 files changed, 302 insertions(+), 50 deletions(-)

diff --git a/drivers/net/vxge/vxge-config.c b/drivers/net/vxge/vxge-config.c
index da35562..77097e3 100644
--- a/drivers/net/vxge/vxge-config.c
+++ b/drivers/net/vxge/vxge-config.c
@@ -2868,6 +2868,8 @@ __vxge_hw_ring_create(struct __vxge_hw_vpath_handle *vp,
 	ring->rxd_init = attr->rxd_init;
 	ring->rxd_term = attr->rxd_term;
 	ring->buffer_mode = config->buffer_mode;
+	ring->tim_rti_cfg1_saved = vp->vpath->tim_rti_cfg1_saved;
+	ring->tim_rti_cfg3_saved = vp->vpath->tim_rti_cfg3_saved;
 	ring->rxds_limit = config->rxds_limit;
 
 	ring->rxd_size = vxge_hw_ring_rxd_size_get(config->buffer_mode);
@@ -3511,6 +3513,8 @@ __vxge_hw_fifo_create(struct __vxge_hw_vpath_handle *vp,
 
 	/* apply "interrupts per txdl" attribute */
 	fifo->interrupt_type = VXGE_HW_FIFO_TXD_INT_TYPE_UTILZ;
+	fifo->tim_tti_cfg1_saved = vpath->tim_tti_cfg1_saved;
+	fifo->tim_tti_cfg3_saved = vpath->tim_tti_cfg3_saved;
 
 	if (fifo->config->intr)
 		fifo->interrupt_type = VXGE_HW_FIFO_TXD_INT_TYPE_PER_LIST;
@@ -4377,6 +4381,8 @@ __vxge_hw_vpath_tim_configure(struct __vxge_hw_device *hldev, u32 vp_id)
 		}
 
 		writeq(val64, &vp_reg->tim_cfg1_int_num[VXGE_HW_VPATH_INTR_TX]);
+		vpath->tim_tti_cfg1_saved = val64;
+
 		val64 = readq(&vp_reg->tim_cfg2_int_num[VXGE_HW_VPATH_INTR_TX]);
 
 		if (config->tti.uec_a != VXGE_HW_USE_FLASH_DEFAULT) {
@@ -4433,6 +4439,7 @@ __vxge_hw_vpath_tim_configure(struct __vxge_hw_device *hldev, u32 vp_id)
 		}
 
 		writeq(val64, &vp_reg->tim_cfg3_int_num[VXGE_HW_VPATH_INTR_TX]);
+		vpath->tim_tti_cfg3_saved = val64;
 	}
 
 	if (config->ring.enable == VXGE_HW_RING_ENABLE) {
@@ -4481,6 +4488,8 @@ __vxge_hw_vpath_tim_configure(struct __vxge_hw_device *hldev, u32 vp_id)
 		}
 
 		writeq(val64, &vp_reg->tim_cfg1_int_num[VXGE_HW_VPATH_INTR_RX]);
+		vpath->tim_rti_cfg1_saved = val64;
+
 		val64 = readq(&vp_reg->tim_cfg2_int_num[VXGE_HW_VPATH_INTR_RX]);
 
 		if (config->rti.uec_a != VXGE_HW_USE_FLASH_DEFAULT) {
@@ -4537,6 +4546,7 @@ __vxge_hw_vpath_tim_configure(struct __vxge_hw_device *hldev, u32 vp_id)
 		}
 
 		writeq(val64, &vp_reg->tim_cfg3_int_num[VXGE_HW_VPATH_INTR_RX]);
+		vpath->tim_rti_cfg3_saved = val64;
 	}
 
 	val64 = 0;
@@ -4555,26 +4565,6 @@ __vxge_hw_vpath_tim_configure(struct __vxge_hw_device *hldev, u32 vp_id)
 	return status;
 }
 
-void vxge_hw_vpath_tti_ci_set(struct __vxge_hw_device *hldev, u32 vp_id)
-{
-	struct __vxge_hw_virtualpath *vpath;
-	struct vxge_hw_vpath_reg __iomem *vp_reg;
-	struct vxge_hw_vp_config *config;
-	u64 val64;
-
-	vpath = &hldev->virtual_paths[vp_id];
-	vp_reg = vpath->vp_reg;
-	config = vpath->vp_config;
-
-	if (config->fifo.enable == VXGE_HW_FIFO_ENABLE &&
-	    config->tti.timer_ci_en != VXGE_HW_TIM_TIMER_CI_ENABLE) {
-		config->tti.timer_ci_en = VXGE_HW_TIM_TIMER_CI_ENABLE;
-		val64 = readq(&vp_reg->tim_cfg1_int_num[VXGE_HW_VPATH_INTR_TX]);
-		val64 |= VXGE_HW_TIM_CFG1_INT_NUM_TIMER_CI;
-		writeq(val64, &vp_reg->tim_cfg1_int_num[VXGE_HW_VPATH_INTR_TX]);
-	}
-}
-
 /*
  * __vxge_hw_vpath_initialize
  * This routine is the final phase of init which initializes the
diff --git a/drivers/net/vxge/vxge-config.h b/drivers/net/vxge/vxge-config.h
index e249e28..3c53aa7 100644
--- a/drivers/net/vxge/vxge-config.h
+++ b/drivers/net/vxge/vxge-config.h
@@ -682,6 +682,10 @@ struct __vxge_hw_virtualpath {
 	u32				vsport_number;
 	u32				max_kdfc_db;
 	u32				max_nofl_db;
+	u64				tim_tti_cfg1_saved;
+	u64				tim_tti_cfg3_saved;
+	u64				tim_rti_cfg1_saved;
+	u64				tim_rti_cfg3_saved;
 
 	struct __vxge_hw_ring *____cacheline_aligned ringh;
 	struct __vxge_hw_fifo *____cacheline_aligned fifoh;
@@ -921,6 +925,9 @@ struct __vxge_hw_ring {
 	u32					doorbell_cnt;
 	u32					total_db_cnt;
 	u64					rxds_limit;
+	u32					rtimer;
+	u64					tim_rti_cfg1_saved;
+	u64					tim_rti_cfg3_saved;
 
 	enum vxge_hw_status (*callback)(
 			struct __vxge_hw_ring *ringh,
@@ -1000,6 +1007,9 @@ struct __vxge_hw_fifo {
 	u32					per_txdl_space;
 	u32					vp_id;
 	u32					tx_intr_num;
+	u32					rtimer;
+	u64					tim_tti_cfg1_saved;
+	u64					tim_tti_cfg3_saved;
 
 	enum vxge_hw_status (*callback)(
 			struct __vxge_hw_fifo *fifo_handle,
diff --git a/drivers/net/vxge/vxge-main.c b/drivers/net/vxge/vxge-main.c
index 9d4b0e8..6c33eab 100644
--- a/drivers/net/vxge/vxge-main.c
+++ b/drivers/net/vxge/vxge-main.c
@@ -371,9 +371,6 @@ vxge_rx_1b_compl(struct __vxge_hw_ring *ringh, void *dtr,
 	struct vxge_hw_ring_rxd_info ext_info;
 	vxge_debug_entryexit(VXGE_TRACE, "%s: %s:%d",
 		ring->ndev->name, __func__, __LINE__);
-	ring->pkts_processed = 0;
-
-	vxge_hw_ring_replenish(ringh);
 
 	do {
 		prefetch((char *)dtr + L1_CACHE_BYTES);
@@ -1588,6 +1585,36 @@ static int vxge_reset_vpath(struct vxgedev *vdev, int vp_id)
 	return ret;
 }
 
+/* Configure CI */
+static void vxge_config_ci_for_tti_rti(struct vxgedev *vdev)
+{
+	int i = 0;
+
+	/* Enable CI for RTI */
+	if (vdev->config.intr_type == MSI_X) {
+		for (i = 0; i < vdev->no_of_vpath; i++) {
+			struct __vxge_hw_ring *hw_ring;
+
+			hw_ring = vdev->vpaths[i].ring.handle;
+			vxge_hw_vpath_dynamic_rti_ci_set(hw_ring);
+		}
+	}
+
+	/* Enable CI for TTI */
+	for (i = 0; i < vdev->no_of_vpath; i++) {
+		struct __vxge_hw_fifo *hw_fifo = vdev->vpaths[i].fifo.handle;
+		vxge_hw_vpath_tti_ci_set(hw_fifo);
+		/*
+		 * For Inta (with or without napi), Set CI ON for only one
+		 * vpath. (Have only one free running timer).
+		 */
+		if ((vdev->config.intr_type == INTA) && (i == 0))
+			break;
+	}
+
+	return;
+}
+
 static int do_vxge_reset(struct vxgedev *vdev, int event)
 {
 	enum vxge_hw_status status;
@@ -1753,6 +1780,9 @@ static int do_vxge_reset(struct vxgedev *vdev, int event)
 		netif_tx_wake_all_queues(vdev->ndev);
 	}
 
+	/* configure CI */
+	vxge_config_ci_for_tti_rti(vdev);
+
 out:
 	vxge_debug_entryexit(VXGE_TRACE,
 		"%s:%d  Exiting...", __func__, __LINE__);
@@ -1793,22 +1823,29 @@ static void vxge_reset(struct work_struct *work)
  */
 static int vxge_poll_msix(struct napi_struct *napi, int budget)
 {
-	struct vxge_ring *ring =
-		container_of(napi, struct vxge_ring, napi);
+	struct vxge_ring *ring = container_of(napi, struct vxge_ring, napi);
+	int pkts_processed;
 	int budget_org = budget;
-	ring->budget = budget;
 
+	ring->budget = budget;
+	ring->pkts_processed = 0;
 	vxge_hw_vpath_poll_rx(ring->handle);
+	pkts_processed = ring->pkts_processed;
 
 	if (ring->pkts_processed < budget_org) {
 		napi_complete(napi);
+
 		/* Re enable the Rx interrupts for the vpath */
 		vxge_hw_channel_msix_unmask(
 				(struct __vxge_hw_channel *)ring->handle,
 				ring->rx_vector_no);
+		mmiowb();
 	}
 
-	return ring->pkts_processed;
+	/* We are copying and returning the local variable, in case if after
+	 * clearing the msix interrupt above, if the interrupt fires right
+	 * away which can preempt this NAPI thread */
+	return pkts_processed;
 }
 
 static int vxge_poll_inta(struct napi_struct *napi, int budget)
@@ -1824,6 +1861,7 @@ static int vxge_poll_inta(struct napi_struct *napi, int budget)
 	for (i = 0; i < vdev->no_of_vpath; i++) {
 		ring = &vdev->vpaths[i].ring;
 		ring->budget = budget;
+		ring->pkts_processed = 0;
 		vxge_hw_vpath_poll_rx(ring->handle);
 		pkts_processed += ring->pkts_processed;
 		budget -= ring->pkts_processed;
@@ -2054,6 +2092,7 @@ static int vxge_open_vpaths(struct vxgedev *vdev)
 					netdev_get_tx_queue(vdev->ndev, 0);
 			vpath->fifo.indicate_max_pkts =
 				vdev->config.fifo_indicate_max_pkts;
+			vpath->fifo.tx_vector_no = 0;
 			vpath->ring.rx_vector_no = 0;
 			vpath->ring.rx_csum = vdev->rx_csum;
 			vpath->ring.rx_hwts = vdev->rx_hwts;
@@ -2079,6 +2118,61 @@ static int vxge_open_vpaths(struct vxgedev *vdev)
 	return VXGE_HW_OK;
 }
 
+/**
+ *  adaptive_coalesce_tx_interrupts - Changes the interrupt coalescing
+ *  if the interrupts are not within a range
+ *  @fifo: pointer to transmit fifo structure
+ *  Description: The function changes boundary timer and restriction timer
+ *  value depends on the traffic
+ *  Return Value: None
+ */
+static void adaptive_coalesce_tx_interrupts(struct vxge_fifo *fifo)
+{
+	fifo->interrupt_count++;
+	if (jiffies > fifo->jiffies + HZ / 100) {
+		struct __vxge_hw_fifo *hw_fifo = fifo->handle;
+
+		fifo->jiffies = jiffies;
+		if (fifo->interrupt_count > VXGE_T1A_MAX_TX_INTERRUPT_COUNT &&
+		    hw_fifo->rtimer != VXGE_TTI_RTIMER_ADAPT_VAL) {
+			hw_fifo->rtimer = VXGE_TTI_RTIMER_ADAPT_VAL;
+			vxge_hw_vpath_dynamic_tti_rtimer_set(hw_fifo);
+		} else if (hw_fifo->rtimer != 0) {
+			hw_fifo->rtimer = 0;
+			vxge_hw_vpath_dynamic_tti_rtimer_set(hw_fifo);
+		}
+		fifo->interrupt_count = 0;
+	}
+}
+
+/**
+ *  adaptive_coalesce_rx_interrupts - Changes the interrupt coalescing
+ *  if the interrupts are not within a range
+ *  @ring: pointer to receive ring structure
+ *  Description: The function increases of decreases the packet counts within
+ *  the ranges of traffic utilization, if the interrupts due to this ring are
+ *  not within a fixed range.
+ *  Return Value: Nothing
+ */
+static void adaptive_coalesce_rx_interrupts(struct vxge_ring *ring)
+{
+	ring->interrupt_count++;
+	if (jiffies > ring->jiffies + HZ / 100) {
+		struct __vxge_hw_ring *hw_ring = ring->handle;
+
+		ring->jiffies = jiffies;
+		if (ring->interrupt_count > VXGE_T1A_MAX_INTERRUPT_COUNT &&
+		    hw_ring->rtimer != VXGE_RTI_RTIMER_ADAPT_VAL) {
+			hw_ring->rtimer = VXGE_RTI_RTIMER_ADAPT_VAL;
+			vxge_hw_vpath_dynamic_rti_rtimer_set(hw_ring);
+		} else if (hw_ring->rtimer != 0) {
+			hw_ring->rtimer = 0;
+			vxge_hw_vpath_dynamic_rti_rtimer_set(hw_ring);
+		}
+		ring->interrupt_count = 0;
+	}
+}
+
 /*
  *  vxge_isr_napi
  *  @irq: the irq of the device.
@@ -2139,24 +2233,39 @@ static irqreturn_t vxge_isr_napi(int irq, void *dev_id)
 
 #ifdef CONFIG_PCI_MSI
 
-static irqreturn_t
-vxge_tx_msix_handle(int irq, void *dev_id)
+static irqreturn_t vxge_tx_msix_handle(int irq, void *dev_id)
 {
 	struct vxge_fifo *fifo = (struct vxge_fifo *)dev_id;
 
+	adaptive_coalesce_tx_interrupts(fifo);
+
+	vxge_hw_channel_msix_mask((struct __vxge_hw_channel *)fifo->handle,
+				  fifo->tx_vector_no);
+
+	vxge_hw_channel_msix_clear((struct __vxge_hw_channel *)fifo->handle,
+				   fifo->tx_vector_no);
+
 	VXGE_COMPLETE_VPATH_TX(fifo);
 
+	vxge_hw_channel_msix_unmask((struct __vxge_hw_channel *)fifo->handle,
+				    fifo->tx_vector_no);
+
+	mmiowb();
+
 	return IRQ_HANDLED;
 }
 
-static irqreturn_t
-vxge_rx_msix_napi_handle(int irq, void *dev_id)
+static irqreturn_t vxge_rx_msix_napi_handle(int irq, void *dev_id)
 {
 	struct vxge_ring *ring = (struct vxge_ring *)dev_id;
 
-	/* MSIX_IDX for Rx is 1 */
+	adaptive_coalesce_rx_interrupts(ring);
+
 	vxge_hw_channel_msix_mask((struct __vxge_hw_channel *)ring->handle,
-					ring->rx_vector_no);
+				  ring->rx_vector_no);
+
+	vxge_hw_channel_msix_clear((struct __vxge_hw_channel *)ring->handle,
+				   ring->rx_vector_no);
 
 	napi_schedule(&ring->napi);
 	return IRQ_HANDLED;
@@ -2173,14 +2282,20 @@ vxge_alarm_msix_handle(int irq, void *dev_id)
 		VXGE_HW_VPATH_MSIX_ACTIVE) + VXGE_ALARM_MSIX_ID;
 
 	for (i = 0; i < vdev->no_of_vpath; i++) {
+		/* Reduce the chance of loosing alarm interrupts by masking
+		 * the vector. A pending bit will be set if an alarm is
+		 * generated and on unmask the interrupt will be fired.
+		 */
 		vxge_hw_vpath_msix_mask(vdev->vpaths[i].handle, msix_id);
+		vxge_hw_vpath_msix_clear(vdev->vpaths[i].handle, msix_id);
+		mmiowb();
 
 		status = vxge_hw_vpath_alarm_process(vdev->vpaths[i].handle,
 			vdev->exec_mode);
 		if (status == VXGE_HW_OK) {
-
 			vxge_hw_vpath_msix_unmask(vdev->vpaths[i].handle,
-					msix_id);
+						  msix_id);
+			mmiowb();
 			continue;
 		}
 		vxge_debug_intr(VXGE_ERR,
@@ -2299,6 +2414,9 @@ static int vxge_enable_msix(struct vxgedev *vdev)
 			vpath->ring.rx_vector_no = (vpath->device_id *
 						VXGE_HW_VPATH_MSIX_ACTIVE) + 1;
 
+			vpath->fifo.tx_vector_no = (vpath->device_id *
+						VXGE_HW_VPATH_MSIX_ACTIVE);
+
 			vxge_hw_vpath_msix_set(vpath->handle, tim_msix_id,
 					       VXGE_ALARM_MSIX_ID);
 		}
@@ -2474,8 +2592,9 @@ INTA_MODE:
 			"%s:vxge:INTA", vdev->ndev->name);
 		vxge_hw_device_set_intr_type(vdev->devh,
 			VXGE_HW_INTR_MODE_IRQLINE);
-		vxge_hw_vpath_tti_ci_set(vdev->devh,
-			vdev->vpaths[0].device_id);
+
+		vxge_hw_vpath_tti_ci_set(vdev->vpaths[0].fifo.handle);
+
 		ret = request_irq((int) vdev->pdev->irq,
 			vxge_isr_napi,
 			IRQF_SHARED, vdev->desc[0], vdev);
@@ -2745,6 +2864,10 @@ static int vxge_open(struct net_device *dev)
 	}
 
 	netif_tx_start_all_queues(vdev->ndev);
+
+	/* configure CI */
+	vxge_config_ci_for_tti_rti(vdev);
+
 	goto out0;
 
 out2:
@@ -3804,7 +3927,7 @@ static void __devinit vxge_device_config_init(
 		break;
 
 	case MSI_X:
-		device_config->intr_mode = VXGE_HW_INTR_MODE_MSIX;
+		device_config->intr_mode = VXGE_HW_INTR_MODE_MSIX_ONE_SHOT;
 		break;
 	}
 
diff --git a/drivers/net/vxge/vxge-main.h b/drivers/net/vxge/vxge-main.h
index 5746fed..40474f0 100644
--- a/drivers/net/vxge/vxge-main.h
+++ b/drivers/net/vxge/vxge-main.h
@@ -59,11 +59,13 @@
 #define VXGE_TTI_LTIMER_VAL	1000
 #define VXGE_T1A_TTI_LTIMER_VAL	80
 #define VXGE_TTI_RTIMER_VAL	0
+#define VXGE_TTI_RTIMER_ADAPT_VAL	10
 #define VXGE_T1A_TTI_RTIMER_VAL	400
 #define VXGE_RTI_BTIMER_VAL	250
 #define VXGE_RTI_LTIMER_VAL	100
 #define VXGE_RTI_RTIMER_VAL	0
-#define VXGE_FIFO_INDICATE_MAX_PKTS VXGE_DEF_FIFO_LENGTH
+#define VXGE_RTI_RTIMER_ADAPT_VAL	15
+#define VXGE_FIFO_INDICATE_MAX_PKTS	VXGE_DEF_FIFO_LENGTH
 #define VXGE_ISR_POLLING_CNT 	8
 #define VXGE_MAX_CONFIG_DEV	0xFF
 #define VXGE_EXEC_MODE_DISABLE	0
@@ -107,6 +109,14 @@
 #define RTI_T1A_RX_UFC_C	50
 #define RTI_T1A_RX_UFC_D	60
 
+/*
+ * The interrupt rate is maintained at 3k per second with the moderation
+ * parameters for most traffic but not all. This is the maximum interrupt
+ * count allowed per function with INTA or per vector in the case of
+ * MSI-X in a 10 millisecond time period. Enabled only for Titan 1A.
+ */
+#define VXGE_T1A_MAX_INTERRUPT_COUNT	100
+#define VXGE_T1A_MAX_TX_INTERRUPT_COUNT	200
 
 /* Milli secs timer period */
 #define VXGE_TIMER_DELAY		10000
@@ -247,6 +257,11 @@ struct vxge_fifo {
 	int tx_steering_type;
 	int indicate_max_pkts;
 
+	/* Adaptive interrupt moderation parameters used in T1A */
+	unsigned long interrupt_count;
+	unsigned long jiffies;
+
+	u32 tx_vector_no;
 	/* Tx stats */
 	struct vxge_fifo_stats stats;
 } ____cacheline_aligned;
@@ -271,6 +286,10 @@ struct vxge_ring {
 	 */
 	int driver_id;
 
+	/* Adaptive interrupt moderation parameters used in T1A */
+	unsigned long interrupt_count;
+	unsigned long jiffies;
+
 	/* copy of the flag indicating whether rx_csum is to be used */
 	u32 rx_csum:1,
 	    rx_hwts:1;
@@ -286,7 +305,7 @@ struct vxge_ring {
 
 	int vlan_tag_strip;
 	struct vlan_group *vlgrp;
-	int rx_vector_no;
+	u32 rx_vector_no;
 	enum vxge_hw_status last_status;
 
 	/* Rx stats */
diff --git a/drivers/net/vxge/vxge-traffic.c b/drivers/net/vxge/vxge-traffic.c
index 4c10d6c..8674f33 100644
--- a/drivers/net/vxge/vxge-traffic.c
+++ b/drivers/net/vxge/vxge-traffic.c
@@ -218,6 +218,68 @@ exit:
 	return status;
 }
 
+void vxge_hw_vpath_tti_ci_set(struct __vxge_hw_fifo *fifo)
+{
+	struct vxge_hw_vpath_reg __iomem *vp_reg;
+	struct vxge_hw_vp_config *config;
+	u64 val64;
+
+	if (fifo->config->enable != VXGE_HW_FIFO_ENABLE)
+		return;
+
+	vp_reg = fifo->vp_reg;
+	config = container_of(fifo->config, struct vxge_hw_vp_config, fifo);
+
+	if (config->tti.timer_ci_en != VXGE_HW_TIM_TIMER_CI_ENABLE) {
+		config->tti.timer_ci_en = VXGE_HW_TIM_TIMER_CI_ENABLE;
+		val64 = readq(&vp_reg->tim_cfg1_int_num[VXGE_HW_VPATH_INTR_TX]);
+		val64 |= VXGE_HW_TIM_CFG1_INT_NUM_TIMER_CI;
+		fifo->tim_tti_cfg1_saved = val64;
+		writeq(val64, &vp_reg->tim_cfg1_int_num[VXGE_HW_VPATH_INTR_TX]);
+	}
+}
+
+void vxge_hw_vpath_dynamic_rti_ci_set(struct __vxge_hw_ring *ring)
+{
+	u64 val64 = ring->tim_rti_cfg1_saved;
+
+	val64 |= VXGE_HW_TIM_CFG1_INT_NUM_TIMER_CI;
+	ring->tim_rti_cfg1_saved = val64;
+	writeq(val64, &ring->vp_reg->tim_cfg1_int_num[VXGE_HW_VPATH_INTR_RX]);
+}
+
+void vxge_hw_vpath_dynamic_tti_rtimer_set(struct __vxge_hw_fifo *fifo)
+{
+	u64 val64 = fifo->tim_tti_cfg3_saved;
+	u64 timer = (fifo->rtimer * 1000) / 272;
+
+	val64 &= ~VXGE_HW_TIM_CFG3_INT_NUM_RTIMER_VAL(0x3ffffff);
+	if (timer)
+		val64 |= VXGE_HW_TIM_CFG3_INT_NUM_RTIMER_VAL(timer) |
+			VXGE_HW_TIM_CFG3_INT_NUM_RTIMER_EVENT_SF(5);
+
+	writeq(val64, &fifo->vp_reg->tim_cfg3_int_num[VXGE_HW_VPATH_INTR_TX]);
+	/* tti_cfg3_saved is not updated again because it is
+	 * initialized at one place only - init time.
+	 */
+}
+
+void vxge_hw_vpath_dynamic_rti_rtimer_set(struct __vxge_hw_ring *ring)
+{
+	u64 val64 = ring->tim_rti_cfg3_saved;
+	u64 timer = (ring->rtimer * 1000) / 272;
+
+	val64 &= ~VXGE_HW_TIM_CFG3_INT_NUM_RTIMER_VAL(0x3ffffff);
+	if (timer)
+		val64 |= VXGE_HW_TIM_CFG3_INT_NUM_RTIMER_VAL(timer) |
+			VXGE_HW_TIM_CFG3_INT_NUM_RTIMER_EVENT_SF(4);
+
+	writeq(val64, &ring->vp_reg->tim_cfg3_int_num[VXGE_HW_VPATH_INTR_RX]);
+	/* rti_cfg3_saved is not updated again because it is
+	 * initialized at one place only - init time.
+	 */
+}
+
 /**
  * vxge_hw_channel_msix_mask - Mask MSIX Vector.
  * @channeh: Channel for rx or tx handle
@@ -254,6 +316,23 @@ vxge_hw_channel_msix_unmask(struct __vxge_hw_channel *channel, int msix_id)
 }
 
 /**
+ * vxge_hw_channel_msix_clear - Unmask the MSIX Vector.
+ * @channel: Channel for rx or tx handle
+ * @msix_id:  MSI ID
+ *
+ * The function unmasks the msix interrupt for the given msix_id
+ * if configured in MSIX oneshot mode
+ *
+ * Returns: 0
+ */
+void vxge_hw_channel_msix_clear(struct __vxge_hw_channel *channel, int msix_id)
+{
+	__vxge_hw_pio_mem_write32_upper(
+		(u32) vxge_bVALn(vxge_mBIT(msix_id >> 2), 0, 32),
+		&channel->common_reg->clr_msix_one_shot_vec[msix_id % 4]);
+}
+
+/**
  * vxge_hw_device_set_intr_type - Updates the configuration
  *		with new interrupt type.
  * @hldev: HW device handle.
@@ -2191,19 +2270,14 @@ vxge_hw_vpath_msix_set(struct __vxge_hw_vpath_handle *vp, int *tim_msix_id,
 	if (vpath->hldev->config.intr_mode ==
 					VXGE_HW_INTR_MODE_MSIX_ONE_SHOT) {
 		__vxge_hw_pio_mem_write32_upper((u32)vxge_bVALn(
+				VXGE_HW_ONE_SHOT_VECT0_EN_ONE_SHOT_VECT0_EN,
+				0, 32), &vp_reg->one_shot_vect0_en);
+		__vxge_hw_pio_mem_write32_upper((u32)vxge_bVALn(
 				VXGE_HW_ONE_SHOT_VECT1_EN_ONE_SHOT_VECT1_EN,
 				0, 32), &vp_reg->one_shot_vect1_en);
-	}
-
-	if (vpath->hldev->config.intr_mode ==
-		VXGE_HW_INTR_MODE_MSIX_ONE_SHOT) {
 		__vxge_hw_pio_mem_write32_upper((u32)vxge_bVALn(
 				VXGE_HW_ONE_SHOT_VECT2_EN_ONE_SHOT_VECT2_EN,
 				0, 32), &vp_reg->one_shot_vect2_en);
-
-		__vxge_hw_pio_mem_write32_upper((u32)vxge_bVALn(
-				VXGE_HW_ONE_SHOT_VECT3_EN_ONE_SHOT_VECT3_EN,
-				0, 32), &vp_reg->one_shot_vect3_en);
 	}
 }
 
@@ -2229,6 +2303,32 @@ vxge_hw_vpath_msix_mask(struct __vxge_hw_vpath_handle *vp, int msix_id)
 }
 
 /**
+ * vxge_hw_vpath_msix_clear - Clear MSIX Vector.
+ * @vp: Virtual Path handle.
+ * @msix_id:  MSI ID
+ *
+ * The function clears the msix interrupt for the given msix_id
+ *
+ * Returns: 0,
+ * Otherwise, VXGE_HW_ERR_WRONG_IRQ if the msix index is out of range
+ * status.
+ * See also:
+ */
+void vxge_hw_vpath_msix_clear(struct __vxge_hw_vpath_handle *vp, int msix_id)
+{
+	struct __vxge_hw_device *hldev = vp->vpath->hldev;
+
+	if ((hldev->config.intr_mode == VXGE_HW_INTR_MODE_MSIX_ONE_SHOT))
+		__vxge_hw_pio_mem_write32_upper(
+			(u32) vxge_bVALn(vxge_mBIT((msix_id >> 2)), 0, 32),
+			&hldev->common_reg->clr_msix_one_shot_vec[msix_id % 4]);
+	else
+		__vxge_hw_pio_mem_write32_upper(
+			(u32) vxge_bVALn(vxge_mBIT((msix_id >> 2)), 0, 32),
+			&hldev->common_reg->clear_msix_mask_vect[msix_id % 4]);
+}
+
+/**
  * vxge_hw_vpath_msix_unmask - Unmask the MSIX Vector.
  * @vp: Virtual Path handle.
  * @msix_id:  MSI ID
diff --git a/drivers/net/vxge/vxge-traffic.h b/drivers/net/vxge/vxge-traffic.h
index 8c3103f..760c319 100644
--- a/drivers/net/vxge/vxge-traffic.h
+++ b/drivers/net/vxge/vxge-traffic.h
@@ -2142,6 +2142,10 @@ void vxge_hw_device_clear_tx_rx(
  *  Virtual Paths
  */
 
+void vxge_hw_vpath_dynamic_rti_rtimer_set(struct __vxge_hw_ring *ring);
+
+void vxge_hw_vpath_dynamic_tti_rtimer_set(struct __vxge_hw_fifo *fifo);
+
 u32 vxge_hw_vpath_id(
 	struct __vxge_hw_vpath_handle *vpath_handle);
 
@@ -2245,6 +2249,8 @@ void
 vxge_hw_vpath_msix_mask(struct __vxge_hw_vpath_handle *vpath_handle,
 			int msix_id);
 
+void vxge_hw_vpath_msix_clear(struct __vxge_hw_vpath_handle *vp, int msix_id);
+
 void vxge_hw_device_flush_io(struct __vxge_hw_device *devh);
 
 void
@@ -2270,6 +2276,9 @@ void
 vxge_hw_channel_msix_unmask(struct __vxge_hw_channel *channelh, int msix_id);
 
 void
+vxge_hw_channel_msix_clear(struct __vxge_hw_channel *channelh, int msix_id);
+
+void
 vxge_hw_channel_dtr_try_complete(struct __vxge_hw_channel *channel,
 				 void **dtrh);
 
@@ -2282,7 +2291,8 @@ vxge_hw_channel_dtr_free(struct __vxge_hw_channel *channel, void *dtrh);
 int
 vxge_hw_channel_dtr_count(struct __vxge_hw_channel *channel);
 
-void
-vxge_hw_vpath_tti_ci_set(struct __vxge_hw_device *hldev, u32 vp_id);
+void vxge_hw_vpath_tti_ci_set(struct __vxge_hw_fifo *fifo);
+
+void vxge_hw_vpath_dynamic_rti_ci_set(struct __vxge_hw_ring *ring);
 
 #endif
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH 4/4] vxge: update driver version
From: Jon Mason @ 2011-01-19  1:02 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Sivakumar Subramani, Sreenivasa Honnur, Ram Vepa
In-Reply-To: <1295398942-4131-1-git-send-email-jon.mason@exar.com>

Update vxge driver version to 2.5.2

Signed-off-by: Jon Mason <jon.mason@exar.com>
---
 drivers/net/vxge/vxge-version.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/vxge/vxge-version.h b/drivers/net/vxge/vxge-version.h
index ad2f99b..581e215 100644
--- a/drivers/net/vxge/vxge-version.h
+++ b/drivers/net/vxge/vxge-version.h
@@ -16,8 +16,8 @@
 
 #define VXGE_VERSION_MAJOR	"2"
 #define VXGE_VERSION_MINOR	"5"
-#define VXGE_VERSION_FIX	"1"
-#define VXGE_VERSION_BUILD	"22082"
+#define VXGE_VERSION_FIX	"2"
+#define VXGE_VERSION_BUILD	"22259"
 #define VXGE_VERSION_FOR	"k"
 
 #define VXGE_FW_VER(maj, min, bld) (((maj) << 16) + ((min) << 8) + (bld))
-- 
1.7.0.4


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox