Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: resurrecting tcphealth
From: Piotr Sawuk @ 2012-07-21 10:34 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel
In-Reply-To: <CAK6E8=dKZUJ3evPDGc3gP0a5bsBbYnL0NGhPZenB=T_t+5Kx5w@mail.gmail.com>

On Fr, 20.07.2012, 16:06, Yuchung Cheng wrote:
> On Mon, Jul 16, 2012 at 6:03 AM, Piotr Sawuk <a9702387@unet.univie.ac.at>
> wrote:
>> On Mo, 16.07.2012, 13:46, Eric Dumazet wrote:
>>> On Mon, 2012-07-16 at 13:33 +0200, Piotr Sawuk wrote:
>>>> On Sa, 14.07.2012, 01:55, Stephen Hemminger wrote:
>>>> > I am not sure if the is really necessary since the most
>>>> > of the stats are available elsewhere.
>>>>
>>>> if by "most" you mean address and port then you're right.
>>>> but even the rtt reported by "ss -i" seems to differ from tcphealth.
>>>
>>> Thats because tcphealth is wrong, it assumes HZ=1000 ?
>>>
>>> tp->srtt unit is jiffies, not ms.
>>
>> thanks. any conversion-functions in the kernel for that?
>>>
>>> tcphealth is a gross hack.
>>
>> what would you do if you tried making it less gross?
>>
>> I've not found any similar functionality, in the kernel.
>> I want to know an estimate for the percentage of data lost in tcp.
>> and I want to know that without actually sending much packets.
>> afterall I'm on the receiving end most of the time.
>> percentage of duplicate packets received is nice too.
>> you have any suggestions?
>
> counting dupack may not be as reliable as you'd like.
> say the remote sends you 100 packets and only the first one is lost,
> you'll see 99 dupacks. Morover any small degree reordering (<3)
> will generate substantial dupacks but the network is perfectly fine

I understand that.
but you must consider the difference between network-health and tcp-health.
network-health on my end I can see by looking at wlan-signal strength.
slow downloads can have many causes, some loose cable is only one possibility.

for example I once played a lan-game, 2 computers connected directly.
however, one computer was 10 times slower than the other.
so when the slow computer would act as server, the game would never start.
the reason wasn't bad connection, it was packet-loss caused by slowness.
and it had to do with the protocol being used (i.e. not tcp).

so when in tcp I get high percentage of dupack I see something's wrong.
not necessarily with the physical connection, but with protocol-handling.
the paper showed dupacks indicate spikes in network-usage.
and as we all know the bottleneck isn't the cable, it's data-processing.
when there is a spike, lots of users connecting, network itself is ok.
however, reordering and lost packets indicate something's up with the server.
and that's actually the info I'm after.

if I were the net-admin I would be interested in network-health too.
bad connection indicated by packet-loss itself means I've got to check cables.
but a user might have much wider area of interest.
the user can't do anything about the cables, but yet is interested in them.
i.e. useless info for the net-admin could be interesting for the user.
that's why I do not recommend tcphealth for servers, useless overhead.

so, if you want to judge usefulness of this patch, consider the situation:
you are powerless but interested in responsiveness of thousands of servers.
you want to learn how those servers behave at different times of a day.
isn't dupacks and dup-packets the best info on that you can possibly get?

> (see Craig Patridge's "reordering is not pathological" paper).

thanks, will look into that.

> unfortunately receiver can't and does not have to distinguish loss

true, not needed for the protocol.
on a higher level it sill can be interesting though.
most of the work for preventing packetloss must be done by the sender.
but as I said before, the receiver can do something too: avoid traffic-jams!
in a network many things are predictable, can be reprogrammed.
this way a network could become more efficient as a whole.
that's what spikes my interest in tcphealth, thinking more globally.

>  or reordering. you can infer that but it should not be kernel's job.

that's why I made it an option as opposed to what the original author did.
theoretically it should be possible to get the same functionality without it.
just read the raw network-data and emulate the work of tcp and tcphealth.
but that definitely would add a big overhead as tcp-handling is duplicated.

> there are public tools that inspect tcpdump traces to do that

good example. so to figure out dupacks you filter out the acks.
and you must somehow compare them, or you parse them the way the kernel does.
in the latter case you'll have to recreate the kernel's internal data.
definitely faster, but could result in duplicate code, that requires space.

also you should consider that not all users have privilegues for tcpdump.
and if they had, it would add another security-risk to their computer.
and you'd have to consider multiple users on one computer, using that service.
I can imagine a daemon in the background doing what tcphealth does.
that's the alternative, allows for more fine-grained security.
it could disallow spying on what connections other users have.
(of course then you'd need to remove /proc/net/tcp output too.)
but imagine the nightmare of keeping that daemon secure.
afterall it must be privilegued to read all network data.

if the kernel would provide what I'm looking for, this daemon could still run.
but then it wouldn't need that risky privilegues, could focus on other stuff.
the task of preventing users from seeing eachothers connections is enough...
>
> exposing duplicate packets received can be done via getsockopt(TCP_INFO)
> although I don't know what that gives you. the remote can be too
> aggressive in retransmission (not just because of a bad RTO) or
> the network RTT fluctuates.

TCP_INFO contains only duplicate packets *sent* (retransmits), not received!
am I missing something? can you give a code-example that can obtain such info?
if running that code in userspace results in same values as tcphealth...
well, actually dupacks is more interesting than dup-packets.
afterall in usual usage the latter will always be zero.
>
> I don't what if tracking last_ack_sent (the latest RCV.NXT) without
> knowing the ISN is useful.

so you suggest I should store and compare ISN too, for accuracy?
you think the gain in accuracy justifies the added overhead?
>
> btw the term project paper cited concludes SACK is not useful is simply
> wrong. This makes me suspicious about how rigorous and thoughtful of
> its design.

isn't my paper.
but I'd guess the usefulness of SACK is only doubted from pov of users.
remember, users connect to many servers.
if a server behaves badly, choose another one.
servers do not have such a choice, for them SACK naturally is important.

a server would just need to look at TCP_INFO to see how useful SACK is.
so I would conclude the author was quite thoughtful about users' pov.
(and quite ignorant about the servers.)

no matter how little knowledge the authors have, tcphealth is interesting.
maybe it was a random discovery by sheer luck.
the correlation between the data it provides and reality is compelling.
if we'd judge inventions by the stupidity of their inventors...

^ permalink raw reply

* Re: [PATCH v5] sctp: Implement quick failover draft from tsvwg
From: Neil Horman @ 2012-07-21 11:03 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: netdev, Sridhar Samudrala, David S. Miller, linux-sctp, joe
In-Reply-To: <ab3918ef-84db-413c-93d7-c3b9afe8fa87@email.android.com>

On Sat, Jul 21, 2012 at 02:45:03AM -0400, Vlad Yasevich wrote:
> Neil Horman <nhorman@tuxdriver.com> wrote:
><snip>
> >+
> >+		val.spt_pathmaxrxt = trans->pathmaxrxt;
> >+		val.spt_pathpfthld = trans->pf_retrans;
> >+	}
> >+
> >+	if (copy_to_user(optval, &val, optlen))
> >+		return -EFAULT;
> >+
> >+	return optlen;
> 
> I don't think you can simply return this.  You have to call put_user() with the value to write it back to the User.  See how other get calls are done.
> 
> -Vlad
> 
Yeah, sorry, I had assumed that the put_user for the return code was part of the
common sctp_getsockopt path and didn't go check.  Thanks, I'll respin this later
tonight.
Neil

^ permalink raw reply

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
From: Sasha Levin @ 2012-07-21 12:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: krkumar2, habanero, kvm, netdev, mashirle, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <20120720134014.GD16550@redhat.com>

On 07/20/2012 03:40 PM, Michael S. Tsirkin wrote:
>> -	err = init_vqs(vi);
>> > +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>> > +		vi->has_cvq = true;
>> > +
> How about we disable multiqueue if there's no cvq?
> Will make logic a bit simpler, won't it?

multiqueues don't really depend on cvq. Does this added complexity really justifies adding an artificial limit?

^ permalink raw reply

* Re: [net-next 4/6] e1000: configure and read MDI settings
From: Ben Hutchings @ 2012-07-21 15:37 UTC (permalink / raw)
  To: Brandeburg, Jesse
  Cc: Jeff Kirsher, davem, netdev, gospo, sassmann, Tushar Dave
In-Reply-To: <alpine.WNT.2.00.1207201807470.14524@jbrandeb-mobl2.amr.corp.intel.com>

On Fri, 2012-07-20 at 18:17 -0700, Brandeburg, Jesse wrote:
> 
> On Fri, 20 Jul 2012, Ben Hutchings wrote:
> > Why don't you set ecmd->eth_tp_mdix_ctrl here?
> > 
> > If you also leave it as 0, it's impossible for userland to tell whether
> > the current mode was forced or automatically selected.
> 
> Thanks for the review, right now the get interface (and ethtool display) 
> doesn't support any way to report if the setting was forced or not.  I 
> didn't think about changing the get because I didn't want to modify the 
> userland reporting (I also figured it was a simple interface right now, 
> and didn't need changing, and was focused on the _set_ which is the part 
> fixing the users' reported bugs.)

Everything else you can change with ETHTOOL_SSET is also reported by
ETHTOOL_GSET; why would this be any different?

> I think the patches as they currently stand are okay, do you agree? I 
> would be glad to submit a followon to implement the new "get" interface if 
> we can hash out the interface changes, but I see no reason to gate these 
> patches.

You left this for 20 months, what's the hurry now?

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH] net: fix race condition in several drivers when reading stats
From: Kevin Groeneveld @ 2012-07-21 16:30 UTC (permalink / raw)
  To: netdev; +Cc: Kevin Groeneveld

Fix race condition in several network drivers when reading stats on 32bit
UP architectures.  These drivers update their stats in a BH context and
therefore should use u64_stats_fetch_begin_bh/u64_stats_fetch_retry_bh
instead of u64_stats_fetch_begin/u64_stats_fetch_retry when reading the
stats.

Signed-off-by: Kevin Groeneveld <kgroeneveld@gmail.com>
---
 drivers/net/dummy.c                            |    4 ++--
 drivers/net/ethernet/neterion/vxge/vxge-main.c |    8 ++++----
 drivers/net/loopback.c                         |    4 ++--
 drivers/net/virtio_net.c                       |    8 ++++----
 net/bridge/br_device.c                         |    4 ++--
 5 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
index 9d6a067..c260af5 100644
--- a/drivers/net/dummy.c
+++ b/drivers/net/dummy.c
@@ -63,10 +63,10 @@ static struct rtnl_link_stats64 *dummy_get_stats64(struct net_device *dev,
 
 		dstats = per_cpu_ptr(dev->dstats, i);
 		do {
-			start = u64_stats_fetch_begin(&dstats->syncp);
+			start = u64_stats_fetch_begin_bh(&dstats->syncp);
 			tbytes = dstats->tx_bytes;
 			tpackets = dstats->tx_packets;
-		} while (u64_stats_fetch_retry(&dstats->syncp, start));
+		} while (u64_stats_fetch_retry_bh(&dstats->syncp, start));
 		stats->tx_bytes += tbytes;
 		stats->tx_packets += tpackets;
 	}
diff --git a/drivers/net/ethernet/neterion/vxge/vxge-main.c b/drivers/net/ethernet/neterion/vxge/vxge-main.c
index 4e20c5f..de21904 100644
--- a/drivers/net/ethernet/neterion/vxge/vxge-main.c
+++ b/drivers/net/ethernet/neterion/vxge/vxge-main.c
@@ -3131,12 +3131,12 @@ vxge_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *net_stats)
 		u64 packets, bytes, multicast;
 
 		do {
-			start = u64_stats_fetch_begin(&rxstats->syncp);
+			start = u64_stats_fetch_begin_bh(&rxstats->syncp);
 
 			packets   = rxstats->rx_frms;
 			multicast = rxstats->rx_mcast;
 			bytes     = rxstats->rx_bytes;
-		} while (u64_stats_fetch_retry(&rxstats->syncp, start));
+		} while (u64_stats_fetch_retry_bh(&rxstats->syncp, start));
 
 		net_stats->rx_packets += packets;
 		net_stats->rx_bytes += bytes;
@@ -3146,11 +3146,11 @@ vxge_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *net_stats)
 		net_stats->rx_dropped += rxstats->rx_dropped;
 
 		do {
-			start = u64_stats_fetch_begin(&txstats->syncp);
+			start = u64_stats_fetch_begin_bh(&txstats->syncp);
 
 			packets = txstats->tx_frms;
 			bytes   = txstats->tx_bytes;
-		} while (u64_stats_fetch_retry(&txstats->syncp, start));
+		} while (u64_stats_fetch_retry_bh(&txstats->syncp, start));
 
 		net_stats->tx_packets += packets;
 		net_stats->tx_bytes += bytes;
diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index 32eb94e..e2a06fd 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -107,10 +107,10 @@ static struct rtnl_link_stats64 *loopback_get_stats64(struct net_device *dev,
 
 		lb_stats = per_cpu_ptr(dev->lstats, i);
 		do {
-			start = u64_stats_fetch_begin(&lb_stats->syncp);
+			start = u64_stats_fetch_begin_bh(&lb_stats->syncp);
 			tbytes = lb_stats->bytes;
 			tpackets = lb_stats->packets;
-		} while (u64_stats_fetch_retry(&lb_stats->syncp, start));
+		} while (u64_stats_fetch_retry_bh(&lb_stats->syncp, start));
 		bytes   += tbytes;
 		packets += tpackets;
 	}
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 1db445b..83d2b0c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -704,16 +704,16 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
 		u64 tpackets, tbytes, rpackets, rbytes;
 
 		do {
-			start = u64_stats_fetch_begin(&stats->tx_syncp);
+			start = u64_stats_fetch_begin_bh(&stats->tx_syncp);
 			tpackets = stats->tx_packets;
 			tbytes   = stats->tx_bytes;
-		} while (u64_stats_fetch_retry(&stats->tx_syncp, start));
+		} while (u64_stats_fetch_retry_bh(&stats->tx_syncp, start));
 
 		do {
-			start = u64_stats_fetch_begin(&stats->rx_syncp);
+			start = u64_stats_fetch_begin_bh(&stats->rx_syncp);
 			rpackets = stats->rx_packets;
 			rbytes   = stats->rx_bytes;
-		} while (u64_stats_fetch_retry(&stats->rx_syncp, start));
+		} while (u64_stats_fetch_retry_bh(&stats->rx_syncp, start));
 
 		tot->rx_packets += rpackets;
 		tot->tx_packets += tpackets;
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index f4be1bb..3334845 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -127,9 +127,9 @@ static struct rtnl_link_stats64 *br_get_stats64(struct net_device *dev,
 		const struct br_cpu_netstats *bstats
 			= per_cpu_ptr(br->stats, cpu);
 		do {
-			start = u64_stats_fetch_begin(&bstats->syncp);
+			start = u64_stats_fetch_begin_bh(&bstats->syncp);
 			memcpy(&tmp, bstats, sizeof(tmp));
-		} while (u64_stats_fetch_retry(&bstats->syncp, start));
+		} while (u64_stats_fetch_retry_bh(&bstats->syncp, start));
 		sum.tx_bytes   += tmp.tx_bytes;
 		sum.tx_packets += tmp.tx_packets;
 		sum.rx_bytes   += tmp.rx_bytes;
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH] b44: add 64 bit stats
From: Kevin Groeneveld @ 2012-07-21 16:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Simon Horman, Julian Anastasov, Wensong Zhang, lvs-devel
In-Reply-To: <1342847376.2626.8162.camel@edumazet-glaptop>

On Sat, Jul 21, 2012 at 1:09 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> drivers/net/dummy.c
>> drivers/net/ethernet/neterion/vxge/vxge-main.c
>> drivers/net/loopback.c
>> drivers/net/virtio_net.c
>> net/bridge/br_device.c
>
> Thats right.

I just submitted a patch for these five drivers.  I don't want to mess
with the other two when I don't fully know what I am doing.


Kevin

^ permalink raw reply

* Re: [net-next PATCH v1] net: netprio_cgroup: rework update socket logic
From: John Fastabend @ 2012-07-21 17:02 UTC (permalink / raw)
  To: Neil Horman; +Cc: davem, netdev, eric.dumazet, gaofeng, lizefan
In-Reply-To: <20120721020015.GA3827@neilslaptop.think-freely.org>

On 7/20/2012 7:00 PM, Neil Horman wrote:
> On Fri, Jul 20, 2012 at 01:39:25PM -0700, John Fastabend wrote:
>> Instead of updating the sk_cgrp_prioidx struct field on every send
>> this only updates the field when a task is moved via cgroup
>> infrastructure.
>>
>> This allows sockets that may be used by a kernel worker thread
>> to be managed. For example in the iscsi case today a user can
>> put iscsid in a netprio cgroup and control traffic will be sent
>> with the correct sk_cgrp_prioidx value set but as soon as data
>> is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
>> is updated with the kernel worker threads value which is the
>> default case.
>>
>> It seems more correct to only update the field when the user
>> explicitly sets it via control group infrastructure. This allows
>> the users to manage sockets that may be used with other threads.
>>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> I like the idea, but IIRC last time we tried this I think it caused problems
> with processes that shared sockets.  That is to say, if you have a parent and
> child process that dup an socket descriptior, and put them in separate cgroups,
> you get unpredictable results, as the socket gets assigned a priority based on
> the last processed that moved cgroups.
>
> Neil
>

Shared sockets creates strange behavior as it exists today. If a dup
of the socket fd is created the private data is still shared right. So
in this case the sk_cgrp_prioidx value is going to get updated by both
threads and then it is a race to see what it happens to be set to in
the xmit path.

With this patch at least the behavior is deterministic. Without it
I can create the above scenario but have no way to determine what the
skb priority will actually be set to.

.John

^ permalink raw reply

* Re: [net-next PATCH v1] net: netprio_cgroup: rework update socket logic
From: Neil Horman @ 2012-07-21 17:18 UTC (permalink / raw)
  To: John Fastabend; +Cc: davem, netdev, eric.dumazet, gaofeng, lizefan
In-Reply-To: <500AE08B.5040602@intel.com>

On Sat, Jul 21, 2012 at 10:02:03AM -0700, John Fastabend wrote:
> On 7/20/2012 7:00 PM, Neil Horman wrote:
> >On Fri, Jul 20, 2012 at 01:39:25PM -0700, John Fastabend wrote:
> >>Instead of updating the sk_cgrp_prioidx struct field on every send
> >>this only updates the field when a task is moved via cgroup
> >>infrastructure.
> >>
> >>This allows sockets that may be used by a kernel worker thread
> >>to be managed. For example in the iscsi case today a user can
> >>put iscsid in a netprio cgroup and control traffic will be sent
> >>with the correct sk_cgrp_prioidx value set but as soon as data
> >>is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
> >>is updated with the kernel worker threads value which is the
> >>default case.
> >>
> >>It seems more correct to only update the field when the user
> >>explicitly sets it via control group infrastructure. This allows
> >>the users to manage sockets that may be used with other threads.
> >>
> >>Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> >I like the idea, but IIRC last time we tried this I think it caused problems
> >with processes that shared sockets.  That is to say, if you have a parent and
> >child process that dup an socket descriptior, and put them in separate cgroups,
> >you get unpredictable results, as the socket gets assigned a priority based on
> >the last processed that moved cgroups.
> >
> >Neil
> >
> 
> Shared sockets creates strange behavior as it exists today. If a dup
> of the socket fd is created the private data is still shared right. So
> in this case the sk_cgrp_prioidx value is going to get updated by both
> threads and then it is a race to see what it happens to be set to in
> the xmit path.
> 
> With this patch at least the behavior is deterministic. Without it
> I can create the above scenario but have no way to determine what the
> skb priority will actually be set to.
> 
> .John
> 
Ok, I can buy that.  Lets give this a try:

Acked-by: Neil Horman <nhorman@tuxdriver.com>

^ permalink raw reply

* Re: [net-next 4/6] e1000: configure and read MDI settings
From: Jeff Kirsher @ 2012-07-21 17:37 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Brandeburg, Jesse, davem, netdev, gospo, sassmann, Tushar Dave
In-Reply-To: <1342885066.11373.126.camel@deadeye.wl.decadent.org.uk>

[-- Attachment #1: Type: text/plain, Size: 1448 bytes --]

On Sat, 2012-07-21 at 16:37 +0100, Ben Hutchings wrote:
> 
> On Fri, 2012-07-20 at 18:17 -0700, Brandeburg, Jesse wrote:
> > 
> > On Fri, 20 Jul 2012, Ben Hutchings wrote:
> > > Why don't you set ecmd->eth_tp_mdix_ctrl here?
> > > 
> > > If you also leave it as 0, it's impossible for userland to tell
> whether
> > > the current mode was forced or automatically selected.
> > 
> > Thanks for the review, right now the get interface (and ethtool
> display) 
> > doesn't support any way to report if the setting was forced or not.
> I 
> > didn't think about changing the get because I didn't want to modify
> the 
> > userland reporting (I also figured it was a simple interface right
> now, 
> > and didn't need changing, and was focused on the _set_ which is the
> part 
> > fixing the users' reported bugs.)
> 
> Everything else you can change with ETHTOOL_SSET is also reported by
> ETHTOOL_GSET; why would this be any different?
> 
> > I think the patches as they currently stand are okay, do you agree?
> I 
> > would be glad to submit a followon to implement the new "get"
> interface if 
> > we can hash out the interface changes, but I see no reason to gate
> these 
> > patches.
> 
> You left this for 20 months, what's the hurry now? 

Since there some changes that are needed in this patch set, I will drop
this series from my tree so that I can continue pushing additional
ixgbe/ixgbevf patches.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [net-next 0/6][pull request] Intel Wired LAN Driver Updates
From: Jeff Kirsher @ 2012-07-21 17:37 UTC (permalink / raw)
  To: davem@davemloft.net
  Cc: netdev@vger.kernel.org, gospo@redhat.com, sassmann@redhat.com
In-Reply-To: <1342820631-19738-1-git-send-email-jeffrey.t.kirsher@intel.com>

[-- Attachment #1: Type: text/plain, Size: 1709 bytes --]

On Fri, 2012-07-20 at 14:43 -0700, Kirsher, Jeffrey T wrote:
> This series contains updates to ethtool, e1000, e1000e and igb with
> regards to the new MDI ethtool support patches submitted earlier.
> 
> The following are changes since commit fa0afcd10951afad2022dda09777d2bf70cdab3d:
>   atl1c: fix issue of io access mode for AR8152 v2.1
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next master
> 
> Bruce W Allan (1):
>   e1000e: implement 82577/579 MDI setting support
> 
> Jesse Brandeburg (5):
>   ethtool.h: MDI setting support
>   igb: implement 580 MDI setting support
>   e1000: configure and read MDI settings
>   e1000e: implement MDI/MDI-X control
>   igb: update to allow reading/setting MDI state
> 
>  drivers/net/ethernet/intel/e1000/e1000_ethtool.c |   34 +++++++++++++++++++
>  drivers/net/ethernet/intel/e1000/e1000_main.c    |    4 +++
>  drivers/net/ethernet/intel/e1000e/ethtool.c      |   39 +++++++++++++++++++---
>  drivers/net/ethernet/intel/e1000e/phy.c          |   31 +++++++++++++++--
>  drivers/net/ethernet/intel/igb/e1000_phy.c       |   29 ++++++++++++++--
>  drivers/net/ethernet/intel/igb/e1000_phy.h       |    5 +--
>  drivers/net/ethernet/intel/igb/igb_ethtool.c     |   37 ++++++++++++++++++++
>  drivers/net/ethernet/intel/igb/igb_main.c        |    4 +++
>  include/linux/ethtool.h                          |   17 ++++++----
>  9 files changed, 184 insertions(+), 16 deletions(-)
> 
> --
> 1.7.10.4
> 

Since there some changes that are needed in this patch set, I will drop
this series from my tree so that I can continue pushing additional
ixgbe/ixgbevf patches.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* [PATCH v6] sctp: Implement quick failover draft from tsvwg
From: Neil Horman @ 2012-07-21 17:56 UTC (permalink / raw)
  To: netdev
  Cc: Neil Horman, Vlad Yasevich, Sridhar Samudrala, David S. Miller,
	linux-sctp, joe
In-Reply-To: <1342203998-24037-1-git-send-email-nhorman@tuxdriver.com>

I've seen several attempts recently made to do quick failover of sctp transports
by reducing various retransmit timers and counters.  While its possible to
implement a faster failover on multihomed sctp associations, its not
particularly robust, in that it can lead to unneeded retransmits, as well as
false connection failures due to intermittent latency on a network.

Instead, lets implement the new ietf quick failover draft found here:
http://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05

This will let the sctp stack identify transports that have had a small number of
errors, and avoid using them quickly until their reliability can be
re-established.  I've tested this out on two virt guests connected via multiple
isolated virt networks and believe its in compliance with the above draft and
works well.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Vlad Yasevich <vyasevich@gmail.com>
CC: Sridhar Samudrala <sri@us.ibm.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: linux-sctp@vger.kernel.org
CC: joe@perches.com

---
Change notes:

V2)
- Added socket option API from section 6.1 of the specification, as per
request from Vlad. Adding this socket option allows us to alter both the path
maximum retransmit value and the path partial failure threshold for each
transport and the association as a whole.

- Added a per transport pf_retrans value, and initialized it from the
association value.  This makes each transport independently configurable as per
the socket option above, and prevents changes in the sysctl from bleeding into
an already created association.

V3)
- Cleaned up some line spacing (Joe Perches)
- Fixed some socket option user data sanitization (Vlad Yasevich)

V4)
- Added additional documentation (Flavio Leitner)

V5)
- Modified setsockopt option to ignore 0 pathmaxrxt rather than return
  error (Vlad Yasevich)
- Modified getsocopt to return option length written (Vlad Y.)

V6)
- Fixed stupid mistake about returning optval incorrectly (Vlad Y)
---
 Documentation/networking/ip-sysctl.txt |   14 +++++
 include/net/sctp/constants.h           |    1 +
 include/net/sctp/structs.h             |   20 ++++++-
 include/net/sctp/user.h                |   11 ++++
 net/sctp/associola.c                   |   37 +++++++++--
 net/sctp/outqueue.c                    |    6 +-
 net/sctp/sm_sideeffect.c               |   33 +++++++++-
 net/sctp/socket.c                      |  101 ++++++++++++++++++++++++++++++++
 net/sctp/sysctl.c                      |    9 +++
 net/sctp/transport.c                   |    4 +-
 10 files changed, 221 insertions(+), 15 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 47b6c79..c636f9c 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1408,6 +1408,20 @@ path_max_retrans - INTEGER
 
 	Default: 5
 
+pf_retrans - INTEGER
+	The number of retransmissions that will be attempted on a given path
+	before traffic is redirected to an alternate transport (should one
+	exist).  Note this is distinct from path_max_retrans, as a path that
+	passes the pf_retrans threshold can still be used.  Its only
+	deprioritized when a transmission path is selected by the stack.  This
+	setting is primarily used to enable fast failover mechanisms without
+	having to reduce path_max_retrans to a very low value.  See:
+	http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt
+	for details.  Note also that a value of pf_retrans > path_max_retrans
+	disables this feature
+
+	Default: 0
+
 rto_initial - INTEGER
 	The initial round trip timeout value in milliseconds that will be used
 	in calculating round trip times.  This is the initial time interval
diff --git a/include/net/sctp/constants.h b/include/net/sctp/constants.h
index 942b864..d053d2e 100644
--- a/include/net/sctp/constants.h
+++ b/include/net/sctp/constants.h
@@ -334,6 +334,7 @@ typedef enum {
 typedef enum {
 	SCTP_TRANSPORT_UP,
 	SCTP_TRANSPORT_DOWN,
+	SCTP_TRANSPORT_PF,
 } sctp_transport_cmd_t;
 
 /* These are the address scopes defined mainly for IPv4 addresses
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index e4652fe..cee0678 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -161,6 +161,12 @@ extern struct sctp_globals {
 	int max_retrans_path;
 	int max_retrans_init;
 
+	/* Potentially-Failed.Max.Retrans sysctl value
+	 * taken from:
+	 * http://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05
+	 */
+	int pf_retrans;
+
 	/*
 	 * Policy for preforming sctp/socket accounting
 	 * 0   - do socket level accounting, all assocs share sk_sndbuf
@@ -258,6 +264,7 @@ extern struct sctp_globals {
 #define sctp_sndbuf_policy	 	(sctp_globals.sndbuf_policy)
 #define sctp_rcvbuf_policy	 	(sctp_globals.rcvbuf_policy)
 #define sctp_max_retrans_path		(sctp_globals.max_retrans_path)
+#define sctp_pf_retrans			(sctp_globals.pf_retrans)
 #define sctp_max_retrans_init		(sctp_globals.max_retrans_init)
 #define sctp_sack_timeout		(sctp_globals.sack_timeout)
 #define sctp_hb_interval		(sctp_globals.hb_interval)
@@ -987,10 +994,15 @@ struct sctp_transport {
 
 	/* This is the max_retrans value for the transport and will
 	 * be initialized from the assocs value.  This can be changed
-	 * using SCTP_SET_PEER_ADDR_PARAMS socket option.
+	 * using the SCTP_SET_PEER_ADDR_PARAMS socket option.
 	 */
 	__u16 pathmaxrxt;
 
+	/* This is the partially failed retrans value for the transport
+	 * and will be initialized from the assocs value.  This can be changed
+	 * using the SCTP_PEER_ADDR_THLDS socket option
+	 */
+	int pf_retrans;
 	/* PMTU	      : The current known path MTU.  */
 	__u32 pathmtu;
 
@@ -1660,6 +1672,12 @@ struct sctp_association {
 	 */
 	int max_retrans;
 
+	/* This is the partially failed retrans value for the transport
+	 * and will be initialized from the assocs value.  This can be
+	 * changed using the SCTP_PEER_ADDR_THLDS socket option
+	 */
+	int pf_retrans;
+
 	/* Maximum number of times the endpoint will retransmit INIT  */
 	__u16 max_init_attempts;
 
diff --git a/include/net/sctp/user.h b/include/net/sctp/user.h
index 0842ef0..1b02d7a 100644
--- a/include/net/sctp/user.h
+++ b/include/net/sctp/user.h
@@ -93,6 +93,7 @@ typedef __s32 sctp_assoc_t;
 #define SCTP_GET_ASSOC_NUMBER	28	/* Read only */
 #define SCTP_GET_ASSOC_ID_LIST	29	/* Read only */
 #define SCTP_AUTO_ASCONF       30
+#define SCTP_PEER_ADDR_THLDS	31
 
 /* Internal Socket Options. Some of the sctp library functions are
  * implemented using these socket options.
@@ -649,6 +650,7 @@ struct sctp_paddrinfo {
  */
 enum sctp_spinfo_state {
 	SCTP_INACTIVE,
+	SCTP_PF,
 	SCTP_ACTIVE,
 	SCTP_UNCONFIRMED,
 	SCTP_UNKNOWN = 0xffff  /* Value used for transport state unknown */
@@ -741,4 +743,13 @@ typedef struct {
 	int sd;
 } sctp_peeloff_arg_t;
 
+/*
+ *  Peer Address Thresholds socket option
+ */
+struct sctp_paddrthlds {
+	sctp_assoc_t spt_assoc_id;
+	struct sockaddr_storage spt_address;
+	__u16 spt_pathmaxrxt;
+	__u16 spt_pathpfthld;
+};
 #endif /* __net_sctp_user_h__ */
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 5bc9ab1..90fe36b 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -124,6 +124,8 @@ static struct sctp_association *sctp_association_init(struct sctp_association *a
 	 * socket values.
 	 */
 	asoc->max_retrans = sp->assocparams.sasoc_asocmaxrxt;
+	asoc->pf_retrans  = sctp_pf_retrans;
+
 	asoc->rto_initial = msecs_to_jiffies(sp->rtoinfo.srto_initial);
 	asoc->rto_max = msecs_to_jiffies(sp->rtoinfo.srto_max);
 	asoc->rto_min = msecs_to_jiffies(sp->rtoinfo.srto_min);
@@ -685,6 +687,9 @@ struct sctp_transport *sctp_assoc_add_peer(struct sctp_association *asoc,
 	/* Set the path max_retrans.  */
 	peer->pathmaxrxt = asoc->pathmaxrxt;
 
+	/* And the partial failure retrnas threshold */
+	peer->pf_retrans = asoc->pf_retrans;
+
 	/* Initialize the peer's SACK delay timeout based on the
 	 * association configured value.
 	 */
@@ -840,6 +845,7 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
 	struct sctp_ulpevent *event;
 	struct sockaddr_storage addr;
 	int spc_state = 0;
+	bool ulp_notify = true;
 
 	/* Record the transition on the transport.  */
 	switch (command) {
@@ -853,6 +859,14 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
 			spc_state = SCTP_ADDR_CONFIRMED;
 		else
 			spc_state = SCTP_ADDR_AVAILABLE;
+		/* Don't inform ULP about transition from PF to
+		 * active state and set cwnd to 1, see SCTP
+		 * Quick failover draft section 5.1, point 5
+		 */
+		if (transport->state == SCTP_PF) {
+			ulp_notify = false;
+			transport->cwnd = 1;
+		}
 		transport->state = SCTP_ACTIVE;
 		break;
 
@@ -871,6 +885,11 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
 		spc_state = SCTP_ADDR_UNREACHABLE;
 		break;
 
+	case SCTP_TRANSPORT_PF:
+		transport->state = SCTP_PF;
+		ulp_notify = false;
+		break;
+
 	default:
 		return;
 	}
@@ -878,12 +897,15 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
 	/* Generate and send a SCTP_PEER_ADDR_CHANGE notification to the
 	 * user.
 	 */
-	memset(&addr, 0, sizeof(struct sockaddr_storage));
-	memcpy(&addr, &transport->ipaddr, transport->af_specific->sockaddr_len);
-	event = sctp_ulpevent_make_peer_addr_change(asoc, &addr,
-				0, spc_state, error, GFP_ATOMIC);
-	if (event)
-		sctp_ulpq_tail_event(&asoc->ulpq, event);
+	if (ulp_notify) {
+		memset(&addr, 0, sizeof(struct sockaddr_storage));
+		memcpy(&addr, &transport->ipaddr,
+		       transport->af_specific->sockaddr_len);
+		event = sctp_ulpevent_make_peer_addr_change(asoc, &addr,
+					0, spc_state, error, GFP_ATOMIC);
+		if (event)
+			sctp_ulpq_tail_event(&asoc->ulpq, event);
+	}
 
 	/* Select new active and retran paths. */
 
@@ -899,7 +921,8 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
 			transports) {
 
 		if ((t->state == SCTP_INACTIVE) ||
-		    (t->state == SCTP_UNCONFIRMED))
+		    (t->state == SCTP_UNCONFIRMED) ||
+		    (t->state == SCTP_PF))
 			continue;
 		if (!first || t->last_time_heard > first->last_time_heard) {
 			second = first;
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index a0fa19f..e7aa177c 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -792,7 +792,8 @@ static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout)
 			if (!new_transport)
 				new_transport = asoc->peer.active_path;
 		} else if ((new_transport->state == SCTP_INACTIVE) ||
-			   (new_transport->state == SCTP_UNCONFIRMED)) {
+			   (new_transport->state == SCTP_UNCONFIRMED) ||
+			   (new_transport->state == SCTP_PF)) {
 			/* If the chunk is Heartbeat or Heartbeat Ack,
 			 * send it to chunk->transport, even if it's
 			 * inactive.
@@ -987,7 +988,8 @@ static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout)
 			new_transport = chunk->transport;
 			if (!new_transport ||
 			    ((new_transport->state == SCTP_INACTIVE) ||
-			     (new_transport->state == SCTP_UNCONFIRMED)))
+			     (new_transport->state == SCTP_UNCONFIRMED) ||
+			     (new_transport->state == SCTP_PF)))
 				new_transport = asoc->peer.active_path;
 			if (new_transport->state == SCTP_UNCONFIRMED)
 				continue;
diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
index c96d1a8..285e26a 100644
--- a/net/sctp/sm_sideeffect.c
+++ b/net/sctp/sm_sideeffect.c
@@ -76,6 +76,8 @@ static int sctp_side_effects(sctp_event_t event_type, sctp_subtype_t subtype,
 			     sctp_cmd_seq_t *commands,
 			     gfp_t gfp);
 
+static void sctp_cmd_hb_timer_update(sctp_cmd_seq_t *cmds,
+				     struct sctp_transport *t);
 /********************************************************************
  * Helper functions
  ********************************************************************/
@@ -470,7 +472,8 @@ sctp_timer_event_t *sctp_timer_events[SCTP_NUM_TIMEOUT_TYPES] = {
  * notification SHOULD be sent to the upper layer.
  *
  */
-static void sctp_do_8_2_transport_strike(struct sctp_association *asoc,
+static void sctp_do_8_2_transport_strike(sctp_cmd_seq_t *commands,
+					 struct sctp_association *asoc,
 					 struct sctp_transport *transport,
 					 int is_hb)
 {
@@ -495,6 +498,23 @@ static void sctp_do_8_2_transport_strike(struct sctp_association *asoc,
 			transport->error_count++;
 	}
 
+	/* If the transport error count is greater than the pf_retrans
+	 * threshold, and less than pathmaxrtx, then mark this transport
+	 * as Partially Failed, ee SCTP Quick Failover Draft, secon 5.1,
+	 * point 1
+	 */
+	if ((transport->state != SCTP_PF) &&
+	   (asoc->pf_retrans < transport->pathmaxrxt) &&
+	   (transport->error_count > asoc->pf_retrans)) {
+
+		sctp_assoc_control_transport(asoc, transport,
+					     SCTP_TRANSPORT_PF,
+					     0);
+
+		/* Update the hb timer to resend a heartbeat every rto */
+		sctp_cmd_hb_timer_update(commands, transport);
+	}
+
 	if (transport->state != SCTP_INACTIVE &&
 	    (transport->error_count > transport->pathmaxrxt)) {
 		SCTP_DEBUG_PRINTK_IPADDR("transport_strike:association %p",
@@ -699,6 +719,10 @@ static void sctp_cmd_transport_on(sctp_cmd_seq_t *cmds,
 					     SCTP_HEARTBEAT_SUCCESS);
 	}
 
+	if (t->state == SCTP_PF)
+		sctp_assoc_control_transport(asoc, t, SCTP_TRANSPORT_UP,
+					     SCTP_HEARTBEAT_SUCCESS);
+
 	/* The receiver of the HEARTBEAT ACK should also perform an
 	 * RTT measurement for that destination transport address
 	 * using the time value carried in the HEARTBEAT ACK chunk.
@@ -1565,8 +1589,8 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
 
 		case SCTP_CMD_STRIKE:
 			/* Mark one strike against a transport.  */
-			sctp_do_8_2_transport_strike(asoc, cmd->obj.transport,
-						    0);
+			sctp_do_8_2_transport_strike(commands, asoc,
+						    cmd->obj.transport, 0);
 			break;
 
 		case SCTP_CMD_TRANSPORT_IDLE:
@@ -1576,7 +1600,8 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
 
 		case SCTP_CMD_TRANSPORT_HB_SENT:
 			t = cmd->obj.transport;
-			sctp_do_8_2_transport_strike(asoc, t, 1);
+			sctp_do_8_2_transport_strike(commands, asoc,
+						     t, 1);
 			t->hb_sent = 1;
 			break;
 
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index b3b8a8d..8665b81 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -3470,6 +3470,56 @@ static int sctp_setsockopt_auto_asconf(struct sock *sk, char __user *optval,
 }
 
 
+/*
+ * SCTP_PEER_ADDR_THLDS
+ *
+ * This option allows us to alter the partially failed threshold for one or all
+ * transports in an association.  See Section 6.1 of:
+ * http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt
+ */
+static int sctp_setsockopt_paddr_thresholds(struct sock *sk,
+					    char __user *optval,
+					    unsigned int optlen)
+{
+	struct sctp_paddrthlds val;
+	struct sctp_transport *trans;
+	struct sctp_association *asoc;
+
+	if (optlen < sizeof(struct sctp_paddrthlds))
+		return -EINVAL;
+	if (copy_from_user(&val, (struct sctp_paddrthlds __user *)optval,
+			   sizeof(struct sctp_paddrthlds)))
+		return -EFAULT;
+
+
+	if (sctp_is_any(sk, (const union sctp_addr *)&val.spt_address)) {
+		asoc = sctp_id2assoc(sk, val.spt_assoc_id);
+		if (!asoc)
+			return -ENOENT;
+		list_for_each_entry(trans, &asoc->peer.transport_addr_list,
+				    transports) {
+			if (val.spt_pathmaxrxt)
+				trans->pathmaxrxt = val.spt_pathmaxrxt;
+			trans->pf_retrans = val.spt_pathpfthld;
+		}
+
+		if (val.spt_pathmaxrxt)
+			asoc->pathmaxrxt = val.spt_pathmaxrxt;
+		asoc->pf_retrans = val.spt_pathpfthld;
+	} else {
+		trans = sctp_addr_id2transport(sk, &val.spt_address,
+					       val.spt_assoc_id);
+		if (!trans)
+			return -ENOENT;
+
+		if (val.spt_pathmaxrxt)
+			trans->pathmaxrxt = val.spt_pathmaxrxt;
+		trans->pf_retrans = val.spt_pathpfthld;
+	}
+
+	return 0;
+}
+
 /* API 6.2 setsockopt(), getsockopt()
  *
  * Applications use setsockopt() and getsockopt() to set or retrieve
@@ -3619,6 +3669,9 @@ SCTP_STATIC int sctp_setsockopt(struct sock *sk, int level, int optname,
 	case SCTP_AUTO_ASCONF:
 		retval = sctp_setsockopt_auto_asconf(sk, optval, optlen);
 		break;
+	case SCTP_PEER_ADDR_THLDS:
+		retval = sctp_setsockopt_paddr_thresholds(sk, optval, optlen);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;
@@ -5490,6 +5543,51 @@ static int sctp_getsockopt_assoc_ids(struct sock *sk, int len,
 	return 0;
 }
 
+/*
+ * SCTP_PEER_ADDR_THLDS
+ *
+ * This option allows us to fetch the partially failed threshold for one or all
+ * transports in an association.  See Section 6.1 of:
+ * http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt
+ */
+static int sctp_getsockopt_paddr_thresholds(struct sock *sk,
+					    char __user *optval,
+					    int len,
+					    int __user *optlen)
+{
+	struct sctp_paddrthlds val;
+	struct sctp_transport *trans;
+	struct sctp_association *asoc;
+
+	if (len < sizeof(struct sctp_paddrthlds))
+		return -EINVAL;
+	len = sizeof(struct sctp_paddrthlds);
+	if (copy_from_user(&val, (struct sctp_paddrthlds __user *)optval, len))
+		return -EFAULT;
+
+	if (sctp_is_any(sk, (const union sctp_addr *)&val.spt_address)) {
+		asoc = sctp_id2assoc(sk, val.spt_assoc_id);
+		if (!asoc)
+			return -ENOENT;
+
+		val.spt_pathpfthld = asoc->pf_retrans;
+		val.spt_pathmaxrxt = asoc->pathmaxrxt;
+	} else {
+		trans = sctp_addr_id2transport(sk, &val.spt_address,
+					       val.spt_assoc_id);
+		if (!trans)
+			return -ENOENT;
+
+		val.spt_pathmaxrxt = trans->pathmaxrxt;
+		val.spt_pathpfthld = trans->pf_retrans;
+	}
+
+	if (put_user(len, optlen) || copy_to_user(optval, &val, len))
+		return -EFAULT;
+
+	return 0;
+}
+
 SCTP_STATIC int sctp_getsockopt(struct sock *sk, int level, int optname,
 				char __user *optval, int __user *optlen)
 {
@@ -5628,6 +5726,9 @@ SCTP_STATIC int sctp_getsockopt(struct sock *sk, int level, int optname,
 	case SCTP_AUTO_ASCONF:
 		retval = sctp_getsockopt_auto_asconf(sk, len, optval, optlen);
 		break;
+	case SCTP_PEER_ADDR_THLDS:
+		retval = sctp_getsockopt_paddr_thresholds(sk, optval, len, optlen);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;
diff --git a/net/sctp/sysctl.c b/net/sctp/sysctl.c
index e5fe639..2b2bfe9 100644
--- a/net/sctp/sysctl.c
+++ b/net/sctp/sysctl.c
@@ -141,6 +141,15 @@ static ctl_table sctp_table[] = {
 		.extra2		= &int_max
 	},
 	{
+		.procname	= "pf_retrans",
+		.data		= &sctp_pf_retrans,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &int_max
+	},
+	{
 		.procname	= "max_init_retransmits",
 		.data		= &sctp_max_retrans_init,
 		.maxlen		= sizeof(int),
diff --git a/net/sctp/transport.c b/net/sctp/transport.c
index b026ba0..194d0f3 100644
--- a/net/sctp/transport.c
+++ b/net/sctp/transport.c
@@ -85,6 +85,7 @@ static struct sctp_transport *sctp_transport_init(struct sctp_transport *peer,
 
 	/* Initialize the default path max_retrans.  */
 	peer->pathmaxrxt  = sctp_max_retrans_path;
+	peer->pf_retrans  = sctp_pf_retrans;
 
 	INIT_LIST_HEAD(&peer->transmitted);
 	INIT_LIST_HEAD(&peer->send_ready);
@@ -585,7 +586,8 @@ unsigned long sctp_transport_timeout(struct sctp_transport *t)
 {
 	unsigned long timeout;
 	timeout = t->rto + sctp_jitter(t->rto);
-	if (t->state != SCTP_UNCONFIRMED)
+	if ((t->state != SCTP_UNCONFIRMED) &&
+	    (t->state != SCTP_PF))
 		timeout += t->hbinterval;
 	timeout += jiffies;
 	return timeout;
-- 
1.7.7.6

^ permalink raw reply related

* Re: how routing table maintained
From: Jan Ceuleers @ 2012-07-21 17:58 UTC (permalink / raw)
  To: BALAKUMARAN KANNAN; +Cc: netdev@vger.kernel.org
In-Reply-To: <4A71D24947E78D43BC584A7CD4391A41017DD9AB@SIXPRD0410MB359.apcprd04.prod.outlook.com>

On 07/19/2012 02:41 PM, BALAKUMARAN KANNAN wrote:
> I want to know how routing table and routing cache is maintained.
> I can find some entries in cache even after the route is deleted
> from the routing table. Is it possible to delete an entry(the
> particular entry or flush the entire cache) from the cache once
> it is deleted from the main routing table.

Kannan,

Please be aware that David Miller is on a quest to obliterate the route cache. He has recently sent patches to do just that.

HTH, Jan

^ permalink raw reply

* [PATCH] Fix divide zero crash when xmit with no enabled port
From: ISHIKAWA Mutsumi @ 2012-07-21 20:30 UTC (permalink / raw)
  To: jpirko; +Cc: netdev


hash calculation in lb_transmit() cause divide zero crash when
xmit on teaming loadbalance mode with no team member port is enabled
(this situation means team->en_port_count = 0). Add check
team->en_port_count is not 0.

---
 drivers/net/team/team_mode_loadbalance.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/team/team_mode_loadbalance.c b/drivers/net/team/team_mode_loadbalance.c
index 86e8183..7b878d5 100644
--- a/drivers/net/team/team_mode_loadbalance.c
+++ b/drivers/net/team/team_mode_loadbalance.c
@@ -38,6 +38,8 @@ static bool lb_transmit(struct team *team, struct sk_buff *skb)
 	if (unlikely(!fp))
 		goto drop;
 	hash = SK_RUN_FILTER(fp, skb);
+	if (team->en_port_count < 1)
+		goto drop;
 	port_index = hash % team->en_port_count;
 	port = team_get_port_by_index_rcu(team, port_index);
 	if (unlikely(!port))
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH] wimax: fix printk format warnings
From: Randy Dunlap @ 2012-07-21 20:54 UTC (permalink / raw)
  To: netdev, David Miller
  Cc: Andrew Morton, Geert Uytterhoeven, wimax, Inaky Perez-Gonzalez,
	linux-wimax

From: Randy Dunlap <rdunlap@xenotime.net>

Fix printk format warnings in drivers/net/wimax/i2400m:

drivers/net/wimax/i2400m/control.c: warning: format '%zu' expects argument of type 'size_t', but argument 4 has type 'ssize_t' [-Wformat]
drivers/net/wimax/i2400m/control.c: warning: format '%zu' expects argument of type 'size_t', but argument 5 has type 'ssize_t' [-Wformat]
drivers/net/wimax/i2400m/usb-fw.c: warning: format '%zu' expects argument of type 'size_t', but argument 4 has type 'ssize_t' [-Wformat]

I don't see these warnings on x86.  The warnings that are quoted above
are from Geert's kernel build reports.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc:	Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com>
Cc:	linux-wimax@intel.com
Cc:	wimax@linuxwimax.org
---
 drivers/net/wimax/i2400m/control.c |    4 ++--
 drivers/net/wimax/i2400m/usb-fw.c  |    2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

--- lnx-35-rc7.orig/drivers/net/wimax/i2400m/control.c
+++ lnx-35-rc7/drivers/net/wimax/i2400m/control.c
@@ -130,7 +130,7 @@ ssize_t i2400m_tlv_match(const struct i2
 	    && le16_to_cpu(tlv->length) + sizeof(*tlv) != tlv_size) {
 		size_t size = le16_to_cpu(tlv->length) + sizeof(*tlv);
 		printk(KERN_WARNING "W: tlv type 0x%x mismatched because of "
-		       "size (got %zu vs %zu expected)\n",
+		       "size (got %zu vs %zd expected)\n",
 		       tlv_type, size, tlv_size);
 		return size;
 	}
@@ -235,7 +235,7 @@ const struct i2400m_tlv_hdr *i2400m_tlv_
 			break;
 		if (match > 0)
 			dev_warn(dev, "TLV type 0x%04x found with size "
-				 "mismatch (%zu vs %zu needed)\n",
+				 "mismatch (%zu vs %zd needed)\n",
 				 tlv_type, match, tlv_size);
 	}
 	return tlv;
--- lnx-35-rc7.orig/drivers/net/wimax/i2400m/usb-fw.c
+++ lnx-35-rc7/drivers/net/wimax/i2400m/usb-fw.c
@@ -212,7 +212,7 @@ ssize_t i2400mu_bus_bm_cmd_send(struct i
 	}
 	if (result != cmd_size) {		/* all was transferred? */
 		dev_err(dev, "boot-mode cmd %d: incomplete transfer "
-			"(%zu vs %zu submitted)\n",  opcode, result, cmd_size);
+			"(%zd vs %zu submitted)\n",  opcode, result, cmd_size);
 		result = -EIO;
 		goto error_cmd_size;
 	}

^ permalink raw reply

* Re: [PATCH] wimax: fix printk format warnings
From: Geert Uytterhoeven @ 2012-07-21 21:02 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: wimax, netdev, Inaky Perez-Gonzalez, linux-wimax, Andrew Morton,
	David Miller
In-Reply-To: <500B170B.4000000@xenotime.net>

On Sat, Jul 21, 2012 at 10:54 PM, Randy Dunlap <rdunlap@xenotime.net> wrote:
> Fix printk format warnings in drivers/net/wimax/i2400m:
>
> drivers/net/wimax/i2400m/control.c: warning: format '%zu' expects argument of type 'size_t', but argument 4 has type 'ssize_t' [-Wformat]
> drivers/net/wimax/i2400m/control.c: warning: format '%zu' expects argument of type 'size_t', but argument 5 has type 'ssize_t' [-Wformat]
> drivers/net/wimax/i2400m/usb-fw.c: warning: format '%zu' expects argument of type 'size_t', but argument 4 has type 'ssize_t' [-Wformat]
>
> I don't see these warnings on x86.  The warnings that are quoted above
> are from Geert's kernel build reports.

Yeah, while technically they are correct (size_t and ssize_t differ in
signedness),
they only happen for cris, which has different base types for (s)size_t
(size_t = unsigned int, ssize_t = long), which is actually incorrect. Still
on my list of things to fix...

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCHv3 0/6] tun zerocopy support
From: Michael S. Tsirkin @ 2012-07-21 22:05 UTC (permalink / raw)
  To: David Miller
  Cc: jasowang, eric.dumazet, netdev, linux-kernel, ebiederm,
	Ian Campbell
In-Reply-To: <20120720.174902.2055189237500355771.davem@davemloft.net>

On Fri, Jul 20, 2012 at 05:49:02PM -0700, David Miller wrote:
> From: "Michael S. Tsirkin" <mst@redhat.com>
> Date: Fri, 20 Jul 2012 22:23:03 +0300
> 
> > Same as with macvtap, I get single-percentage wins in CPU utilization
> > on guest to external from this patchset, and a performance regression on
> > guest to host, so more work is needed until this feature can move out of
> > experimental status, but I think it's useful for some people already.
> > 
> > Pls review and consider for 3.6.
> 
> It doesn't improve performance in one case, and hurts performance in
> another.
> 
> You'll have to give me a more compelling argument than that.  You've
> just given me every reason not to include these patches in 3.6

OK let me clarify a bit, I think this wasn't explained well:
it's not true that this makes users suffer :)

This patch has no effect unless experimental zero copy mode in vhost-net
is enabled, and it is very clearly marked as experimental.

I agree a small win in CPU use is nothing to write home about,
I don't yet understand why the win is so small - macvtap has zero copy
supported for a while and it has exactly same issues.
I hope adding tun zerocopy support upstream will help us
make progress faster and find the bottleneck, so far not many people use
macvtap so zero copy mode there didn't make progress.

I do know why local performance regresses with zero copy enabled:
instead of plain copy to user we got get user pages and then memcpy.
We'll need some logic here to detect this and turn off zero copy.

The core patches will also be helpful for Ian's more ambitious work.

Overall I think it's a step in the right direction and it's easier to
work if core parts are upstream, but if you think we need to wait
until the gains are more significant, I understand that too.

-- 
MST

^ permalink raw reply

* [PULL] vhost-net changes for net/3.6
From: Michael S. Tsirkin @ 2012-07-21 22:54 UTC (permalink / raw)
  To: David Miller, netdev, linux-kernel
  Cc: asias, mst, nab, nab, pbonzini, stefanha, stefanha, wuzhy

Hi Dave,
Please merge the following patches into net tree for 3.6.
They are on top of the current net-next.
Thanks!

The following changes since commit 186e868786f97c8026f0a81400b451ace306b3a4:

  forcedeth: spin_unlock_irq in interrupt handler fix (2012-07-20 16:18:36 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net-next

for you to fetch changes up to 163049aefdc04323a2d17ec9f2862027b43b0502:

  vhost: make vhost work queue visible (2012-07-22 01:22:23 +0300)

----------------------------------------------------------------
Stefan Hajnoczi (2):
      vhost: Separate vhost-net features from vhost features
      vhost: make vhost work queue visible

 drivers/vhost/net.c   | 4 ++--
 drivers/vhost/test.c  | 4 ++--
 drivers/vhost/vhost.c | 5 ++---
 drivers/vhost/vhost.h | 6 +++++-
 4 files changed, 11 insertions(+), 8 deletions(-)

^ permalink raw reply

* [net-next 00/13][pull request] Intel Wired LAN Driver Updates
From: Jeff Kirsher @ 2012-07-21 23:08 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, gospo, sassmann

This series contains updates to ixgbe and ixgbevf.

The following are changes since commit 186e868786f97c8026f0a81400b451ace306b3a4:
  forcedeth: spin_unlock_irq in interrupt handler fix
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next master

Akeem G. Abodunrin (1):
  igb: reset PHY in the link_up process to recover PHY setting after
    power down.

Alexander Duyck (8):
  ixgbe: Drop probe_vf and merge functionality into ixgbe_enable_sriov
  ixgbe: Change how we check for pre-existing and assigned VFs
  ixgbevf: Add lock around mailbox ops to prevent simultaneous access
  ixgbevf: Add support for PCI error handling
  ixgbe: Fix handling of FDIR_HASH flag
  ixgbe: Reduce Rx header size to what is actually used
  ixgbe: Use num_tcs.pg_tcs as upper limit for TC when checking based
    on UP
  ixgbe: Use 1TC DCB instead of disabling DCB for MSI and legacy
    interrupts

Don Skidmore (1):
  ixgbe: add support for new 82599 device

Greg Rose (1):
  ixgbevf: Fix namespace issue with ixgbe_write_eitr

John Fastabend (2):
  ixgbe: fix RAR entry counting for generic and fdb_add()
  ixgbe: remove extra unused queues in DCB + FCoE case

 drivers/net/ethernet/intel/igb/igb_main.c         |    3 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe.h          |   16 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_dcb.c      |   12 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c      |   41 ++++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |  140 +++++++++---------
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c    |  151 ++++++++-----------
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h    |    1 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_type.h     |    1 +
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      |    3 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  164 +++++++++++++++++----
 10 files changed, 323 insertions(+), 209 deletions(-)

-- 
1.7.10.4

^ permalink raw reply

* [net-next 01/13] ixgbe: Drop probe_vf and merge functionality into ixgbe_enable_sriov
From: Jeff Kirsher @ 2012-07-21 23:08 UTC (permalink / raw)
  To: davem; +Cc: Alexander Duyck, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1342912142-11130-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This is meant to fix a bug in which we were not checking for pre-existing
VFs if we were not setting the max_vfs value at driver load.  What happens
now is that we always call ixgbe_enable_sriov and this checks for
pre-existing VFs ore requested VFs prior to deciding on no SR-IOV.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Tested-by: Sibai Li <sibai.li@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c   |    3 +--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |   33 +++++++-----------------
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c |   24 ++++++++++++-----
 3 files changed, 28 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index 38d1b65..3ff5aa8 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -1061,8 +1061,7 @@ static void ixgbe_set_interrupt_capability(struct ixgbe_adapter *adapter)
 	}
 	adapter->flags &= ~IXGBE_FLAG_FDIR_HASH_CAPABLE;
 	adapter->atr_sample_rate = 0;
-	if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
-		ixgbe_disable_sriov(adapter);
+	ixgbe_disable_sriov(adapter);
 
 	adapter->ring_feature[RING_F_RSS].limit = 1;
 	ixgbe_set_num_queues(adapter);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index f4e53c1..24f2b45 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -4490,6 +4490,12 @@ static int __devinit ixgbe_sw_init(struct ixgbe_adapter *adapter)
 	hw->fc.send_xon = true;
 	hw->fc.disable_fc_autoneg = false;
 
+#ifdef CONFIG_PCI_IOV
+	/* assign number of SR-IOV VFs */
+	if (hw->mac.type != ixgbe_mac_82598EB)
+		adapter->num_vfs = (max_vfs > 63) ? 0 : max_vfs;
+
+#endif
 	/* enable itr by default in dynamic mode */
 	adapter->rx_itr_setting = 1;
 	adapter->tx_itr_setting = 1;
@@ -6942,26 +6948,6 @@ static const struct net_device_ops ixgbe_netdev_ops = {
 	.ndo_fdb_dump		= ixgbe_ndo_fdb_dump,
 };
 
-static void __devinit ixgbe_probe_vf(struct ixgbe_adapter *adapter,
-				     const struct ixgbe_info *ii)
-{
-#ifdef CONFIG_PCI_IOV
-	struct ixgbe_hw *hw = &adapter->hw;
-
-	if (hw->mac.type == ixgbe_mac_82598EB)
-		return;
-
-	/* The 82599 supports up to 64 VFs per physical function
-	 * but this implementation limits allocation to 63 so that
-	 * basic networking resources are still available to the
-	 * physical function.  If the user requests greater thn
-	 * 63 VFs then it is an error - reset to default of zero.
-	 */
-	adapter->num_vfs = (max_vfs > 63) ? 0 : max_vfs;
-	ixgbe_enable_sriov(adapter, ii);
-#endif /* CONFIG_PCI_IOV */
-}
-
 /**
  * ixgbe_wol_supported - Check whether device supports WoL
  * @hw: hw specific details
@@ -7206,8 +7192,10 @@ static int __devinit ixgbe_probe(struct pci_dev *pdev,
 		goto err_sw_init;
 	}
 
-	ixgbe_probe_vf(adapter, ii);
+#ifdef CONFIG_PCI_IOV
+	ixgbe_enable_sriov(adapter, ii);
 
+#endif
 	netdev->features = NETIF_F_SG |
 			   NETIF_F_IP_CSUM |
 			   NETIF_F_IPV6_CSUM |
@@ -7411,8 +7399,7 @@ err_register:
 	ixgbe_release_hw_control(adapter);
 	ixgbe_clear_interrupt_scheme(adapter);
 err_sw_init:
-	if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
-		ixgbe_disable_sriov(adapter);
+	ixgbe_disable_sriov(adapter);
 	adapter->flags2 &= ~IXGBE_FLAG2_SEARCH_FOR_SFP;
 	iounmap(hw->hw_addr);
 err_ioremap:
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index a825d48..593fdd5 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -82,7 +82,6 @@ void ixgbe_enable_sriov(struct ixgbe_adapter *adapter,
 			 const struct ixgbe_info *ii)
 {
 	struct ixgbe_hw *hw = &adapter->hw;
-	int err = 0;
 	int num_vf_macvlans, i;
 	struct vf_macvlans *mv_list;
 	int pre_existing_vfs = 0;
@@ -106,10 +105,21 @@ void ixgbe_enable_sriov(struct ixgbe_adapter *adapter,
 			 "enabled for this device - Please reload all "
 			 "VF drivers to avoid spoofed packet errors\n");
 	} else {
+		int err;
+		/*
+		 * The 82599 supports up to 64 VFs per physical function
+		 * but this implementation limits allocation to 63 so that
+		 * basic networking resources are still available to the
+		 * physical function.  If the user requests greater thn
+		 * 63 VFs then it is an error - reset to default of zero.
+		 */
+		adapter->num_vfs = min_t(unsigned int, adapter->num_vfs, 63);
+
 		err = pci_enable_sriov(adapter->pdev, adapter->num_vfs);
 		if (err) {
 			e_err(probe, "Failed to enable PCI sriov: %d\n", err);
-			goto err_novfs;
+			adapter->num_vfs = 0;
+			return;
 		}
 	}
 
@@ -193,11 +203,7 @@ void ixgbe_enable_sriov(struct ixgbe_adapter *adapter,
 	/* Oh oh */
 	e_err(probe, "Unable to allocate memory for VF Data Storage - "
 	      "SRIOV disabled\n");
-	pci_disable_sriov(adapter->pdev);
-
-err_novfs:
-	adapter->flags &= ~IXGBE_FLAG_SRIOV_ENABLED;
-	adapter->num_vfs = 0;
+	ixgbe_disable_sriov(adapter);
 }
 #endif /* #ifdef CONFIG_PCI_IOV */
 
@@ -219,6 +225,10 @@ void ixgbe_disable_sriov(struct ixgbe_adapter *adapter)
 	kfree(adapter->mv_list);
 	adapter->mv_list = NULL;
 
+	/* if SR-IOV is already disabled then there is nothing to do */
+	if (!(adapter->flags & IXGBE_FLAG_SRIOV_ENABLED))
+		return;
+
 #ifdef CONFIG_PCI_IOV
 	/* disable iov and allow time for transactions to clear */
 	pci_disable_sriov(adapter->pdev);
-- 
1.7.10.4

^ permalink raw reply related

* [net-next 02/13] ixgbe: Change how we check for pre-existing and assigned VFs
From: Jeff Kirsher @ 2012-07-21 23:08 UTC (permalink / raw)
  To: davem; +Cc: Alexander Duyck, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1342912142-11130-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This patch does two things.  First it drops the unnecessary work of
searching for enabled VFs when we first bring up the adapter and instead
just uses pci_num_vf to determine how many VFs are enabled on the adapter.

The second thing it does is drop the use of vfdev from the vf_data_storage
structure.  Instead we just search the entire system for a VF that has us
as it's PF, and then if that VF is assigned we indicate that the VFs are
assigned.  This allows us to still check for assigned VFs even if the
vfinfo allocation has failed, or vfinfo has been freed.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Greg Rose <gregory.v.rose@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Tested-by: Sibai Li <sibai.li@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h       |    1 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |    8 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c |  129 ++++++++----------------
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h |    1 -
 4 files changed, 45 insertions(+), 94 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 5a286ad..eb59282 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -130,7 +130,6 @@ struct vf_data_storage {
 	u16 tx_rate;
 	u16 vlan_count;
 	u8 spoofchk_enabled;
-	struct pci_dev *vfdev;
 };
 
 struct vf_macvlans {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 24f2b45..9c42679 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7452,13 +7452,7 @@ static void __devexit ixgbe_remove(struct pci_dev *pdev)
 	if (netdev->reg_state == NETREG_REGISTERED)
 		unregister_netdev(netdev);
 
-	if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED) {
-		if (!(ixgbe_check_vf_assignment(adapter)))
-			ixgbe_disable_sriov(adapter);
-		else
-			e_dev_warn("Unloading driver while VFs are assigned "
-				   "- VFs will not be deallocated\n");
-	}
+	ixgbe_disable_sriov(adapter);
 
 	ixgbe_clear_interrupt_scheme(adapter);
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index 593fdd5..47b2ce7 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -44,40 +44,6 @@
 #include "ixgbe_sriov.h"
 
 #ifdef CONFIG_PCI_IOV
-static int ixgbe_find_enabled_vfs(struct ixgbe_adapter *adapter)
-{
-	struct pci_dev *pdev = adapter->pdev;
-	struct pci_dev *pvfdev;
-	u16 vf_devfn = 0;
-	int device_id;
-	int vfs_found = 0;
-
-	switch (adapter->hw.mac.type) {
-	case ixgbe_mac_82599EB:
-		device_id = IXGBE_DEV_ID_82599_VF;
-		break;
-	case ixgbe_mac_X540:
-		device_id = IXGBE_DEV_ID_X540_VF;
-		break;
-	default:
-		device_id = 0;
-		break;
-	}
-
-	vf_devfn = pdev->devfn + 0x80;
-	pvfdev = pci_get_device(PCI_VENDOR_ID_INTEL, device_id, NULL);
-	while (pvfdev) {
-		if (pvfdev->devfn == vf_devfn &&
-		    (pvfdev->bus->number >= pdev->bus->number))
-			vfs_found++;
-		vf_devfn += 2;
-		pvfdev = pci_get_device(PCI_VENDOR_ID_INTEL,
-					device_id, pvfdev);
-	}
-
-	return vfs_found;
-}
-
 void ixgbe_enable_sriov(struct ixgbe_adapter *adapter,
 			 const struct ixgbe_info *ii)
 {
@@ -86,7 +52,7 @@ void ixgbe_enable_sriov(struct ixgbe_adapter *adapter,
 	struct vf_macvlans *mv_list;
 	int pre_existing_vfs = 0;
 
-	pre_existing_vfs = ixgbe_find_enabled_vfs(adapter);
+	pre_existing_vfs = pci_num_vf(adapter->pdev);
 	if (!pre_existing_vfs && !adapter->num_vfs)
 		return;
 
@@ -205,14 +171,46 @@ void ixgbe_enable_sriov(struct ixgbe_adapter *adapter,
 	      "SRIOV disabled\n");
 	ixgbe_disable_sriov(adapter);
 }
-#endif /* #ifdef CONFIG_PCI_IOV */
 
+static bool ixgbe_vfs_are_assigned(struct ixgbe_adapter *adapter)
+{
+	struct pci_dev *pdev = adapter->pdev;
+	struct pci_dev *vfdev;
+	int dev_id;
+
+	switch (adapter->hw.mac.type) {
+	case ixgbe_mac_82599EB:
+		dev_id = IXGBE_DEV_ID_82599_VF;
+		break;
+	case ixgbe_mac_X540:
+		dev_id = IXGBE_DEV_ID_X540_VF;
+		break;
+	default:
+		return false;
+	}
+
+	/* loop through all the VFs to see if we own any that are assigned */
+	vfdev = pci_get_device(PCI_VENDOR_ID_INTEL, dev_id, NULL);
+	while (vfdev) {
+		/* if we don't own it we don't care */
+		if (vfdev->is_virtfn && vfdev->physfn == pdev) {
+			/* if it is assigned we cannot release it */
+			if (vfdev->dev_flags & PCI_DEV_FLAGS_ASSIGNED)
+				return true;
+		}
+
+		vfdev = pci_get_device(PCI_VENDOR_ID_INTEL, dev_id, vfdev);
+	}
+
+	return false;
+}
+
+#endif /* #ifdef CONFIG_PCI_IOV */
 void ixgbe_disable_sriov(struct ixgbe_adapter *adapter)
 {
 	struct ixgbe_hw *hw = &adapter->hw;
 	u32 gpie;
 	u32 vmdctl;
-	int i;
 
 	/* set num VFs to 0 to prevent access to vfinfo */
 	adapter->num_vfs = 0;
@@ -230,6 +228,15 @@ void ixgbe_disable_sriov(struct ixgbe_adapter *adapter)
 		return;
 
 #ifdef CONFIG_PCI_IOV
+	/*
+	 * If our VFs are assigned we cannot shut down SR-IOV
+	 * without causing issues, so just leave the hardware
+	 * available but disabled
+	 */
+	if (ixgbe_vfs_are_assigned(adapter)) {
+		e_dev_warn("Unloading driver while VFs are assigned - VFs will not be deallocated\n");
+		return;
+
 	/* disable iov and allow time for transactions to clear */
 	pci_disable_sriov(adapter->pdev);
 #endif
@@ -254,12 +261,6 @@ void ixgbe_disable_sriov(struct ixgbe_adapter *adapter)
 	/* take a breather then clean up driver data */
 	msleep(100);
 
-	/* Release reference to VF devices */
-	for (i = 0; i < adapter->num_vfs; i++) {
-		if (adapter->vfinfo[i].vfdev)
-			pci_dev_put(adapter->vfinfo[i].vfdev);
-	}
-
 	adapter->flags &= ~IXGBE_FLAG_SRIOV_ENABLED;
 }
 
@@ -493,28 +494,11 @@ static int ixgbe_set_vf_macvlan(struct ixgbe_adapter *adapter,
 	return 0;
 }
 
-int ixgbe_check_vf_assignment(struct ixgbe_adapter *adapter)
-{
-#ifdef CONFIG_PCI_IOV
-	int i;
-	for (i = 0; i < adapter->num_vfs; i++) {
-		if (adapter->vfinfo[i].vfdev->dev_flags &
-				PCI_DEV_FLAGS_ASSIGNED)
-			return true;
-	}
-#endif
-	return false;
-}
-
 int ixgbe_vf_configuration(struct pci_dev *pdev, unsigned int event_mask)
 {
 	unsigned char vf_mac_addr[6];
 	struct ixgbe_adapter *adapter = pci_get_drvdata(pdev);
 	unsigned int vfn = (event_mask & 0x3f);
-	struct pci_dev *pvfdev;
-	unsigned int device_id;
-	u16 thisvf_devfn = (pdev->devfn + 0x80 + (vfn << 1)) |
-				(pdev->devfn & 1);
 
 	bool enable = ((event_mask & 0x10000000U) != 0);
 
@@ -527,31 +511,6 @@ int ixgbe_vf_configuration(struct pci_dev *pdev, unsigned int event_mask)
 		 * for it later.
 		 */
 		memcpy(adapter->vfinfo[vfn].vf_mac_addresses, vf_mac_addr, 6);
-
-		switch (adapter->hw.mac.type) {
-		case ixgbe_mac_82599EB:
-			device_id = IXGBE_DEV_ID_82599_VF;
-			break;
-		case ixgbe_mac_X540:
-			device_id = IXGBE_DEV_ID_X540_VF;
-			break;
-		default:
-			device_id = 0;
-			break;
-		}
-
-		pvfdev = pci_get_device(PCI_VENDOR_ID_INTEL, device_id, NULL);
-		while (pvfdev) {
-			if (pvfdev->devfn == thisvf_devfn)
-				break;
-			pvfdev = pci_get_device(PCI_VENDOR_ID_INTEL,
-						device_id, pvfdev);
-		}
-		if (pvfdev)
-			adapter->vfinfo[vfn].vfdev = pvfdev;
-		else
-			e_err(drv, "Couldn't find pci dev ptr for VF %4.4x\n",
-			      thisvf_devfn);
 	}
 
 	return 0;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
index 2ab38d5..1be1d30 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
@@ -42,7 +42,6 @@ int ixgbe_ndo_get_vf_config(struct net_device *netdev,
 			    int vf, struct ifla_vf_info *ivi);
 void ixgbe_check_vf_rate_limit(struct ixgbe_adapter *adapter);
 void ixgbe_disable_sriov(struct ixgbe_adapter *adapter);
-int ixgbe_check_vf_assignment(struct ixgbe_adapter *adapter);
 #ifdef CONFIG_PCI_IOV
 void ixgbe_enable_sriov(struct ixgbe_adapter *adapter,
 			const struct ixgbe_info *ii);
-- 
1.7.10.4

^ permalink raw reply related

* [net-next 04/13] ixgbevf: Add support for PCI error handling
From: Jeff Kirsher @ 2012-07-21 23:08 UTC (permalink / raw)
  To: davem; +Cc: Alexander Duyck, netdev, gospo, sassmann, Greg Rose, Jeff Kirsher
In-Reply-To: <1342912142-11130-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This change adds support for handling IO errors and slot resets.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   80 +++++++++++++++++++++
 1 file changed, 80 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 7cb678d..ccc801e 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -3189,12 +3189,92 @@ static void __devexit ixgbevf_remove(struct pci_dev *pdev)
 	pci_disable_device(pdev);
 }
 
+/**
+ * ixgbevf_io_error_detected - called when PCI error is detected
+ * @pdev: Pointer to PCI device
+ * @state: The current pci connection state
+ *
+ * This function is called after a PCI bus error affecting
+ * this device has been detected.
+ */
+static pci_ers_result_t ixgbevf_io_error_detected(struct pci_dev *pdev,
+						  pci_channel_state_t state)
+{
+	struct net_device *netdev = pci_get_drvdata(pdev);
+	struct ixgbevf_adapter *adapter = netdev_priv(netdev);
+
+	netif_device_detach(netdev);
+
+	if (state == pci_channel_io_perm_failure)
+		return PCI_ERS_RESULT_DISCONNECT;
+
+	if (netif_running(netdev))
+		ixgbevf_down(adapter);
+
+	pci_disable_device(pdev);
+
+	/* Request a slot slot reset. */
+	return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * ixgbevf_io_slot_reset - called after the pci bus has been reset.
+ * @pdev: Pointer to PCI device
+ *
+ * Restart the card from scratch, as if from a cold-boot. Implementation
+ * resembles the first-half of the ixgbevf_resume routine.
+ */
+static pci_ers_result_t ixgbevf_io_slot_reset(struct pci_dev *pdev)
+{
+	struct net_device *netdev = pci_get_drvdata(pdev);
+	struct ixgbevf_adapter *adapter = netdev_priv(netdev);
+
+	if (pci_enable_device_mem(pdev)) {
+		dev_err(&pdev->dev,
+			"Cannot re-enable PCI device after reset.\n");
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+
+	pci_set_master(pdev);
+
+	ixgbevf_reset(adapter);
+
+	return PCI_ERS_RESULT_RECOVERED;
+}
+
+/**
+ * ixgbevf_io_resume - called when traffic can start flowing again.
+ * @pdev: Pointer to PCI device
+ *
+ * This callback is called when the error recovery driver tells us that
+ * its OK to resume normal operation. Implementation resembles the
+ * second-half of the ixgbevf_resume routine.
+ */
+static void ixgbevf_io_resume(struct pci_dev *pdev)
+{
+	struct net_device *netdev = pci_get_drvdata(pdev);
+	struct ixgbevf_adapter *adapter = netdev_priv(netdev);
+
+	if (netif_running(netdev))
+		ixgbevf_up(adapter);
+
+	netif_device_attach(netdev);
+}
+
+/* PCI Error Recovery (ERS) */
+static struct pci_error_handlers ixgbevf_err_handler = {
+	.error_detected = ixgbevf_io_error_detected,
+	.slot_reset = ixgbevf_io_slot_reset,
+	.resume = ixgbevf_io_resume,
+};
+
 static struct pci_driver ixgbevf_driver = {
 	.name     = ixgbevf_driver_name,
 	.id_table = ixgbevf_pci_tbl,
 	.probe    = ixgbevf_probe,
 	.remove   = __devexit_p(ixgbevf_remove),
 	.shutdown = ixgbevf_shutdown,
+	.err_handler = &ixgbevf_err_handler
 };
 
 /**
-- 
1.7.10.4

^ permalink raw reply related

* [net-next 03/13] ixgbevf: Add lock around mailbox ops to prevent simultaneous access
From: Jeff Kirsher @ 2012-07-21 23:08 UTC (permalink / raw)
  To: davem; +Cc: Alexander Duyck, netdev, gospo, sassmann, Greg Rose, Jeff Kirsher
In-Reply-To: <1342912142-11130-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This change adds a spinlock around the mailbox accesses to prevent
simultaneous access to the mailboxes.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Tested-by: Sibai Li <sibai.li@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      |    2 +
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   41 ++++++++++++++++++++-
 2 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index e167d1b..66858b5 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -249,6 +249,8 @@ struct ixgbevf_adapter {
 	bool link_up;
 
 	struct work_struct watchdog_task;
+
+	spinlock_t mbx_lock;
 };
 
 enum ixbgevf_state_t {
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 2dc78d7..7cb678d 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -1120,9 +1120,14 @@ static int ixgbevf_vlan_rx_add_vid(struct net_device *netdev, u16 vid)
 	struct ixgbevf_adapter *adapter = netdev_priv(netdev);
 	struct ixgbe_hw *hw = &adapter->hw;
 
+	spin_lock(&adapter->mbx_lock);
+
 	/* add VID to filter table */
 	if (hw->mac.ops.set_vfta)
 		hw->mac.ops.set_vfta(hw, vid, 0, true);
+
+	spin_unlock(&adapter->mbx_lock);
+
 	set_bit(vid, adapter->active_vlans);
 
 	return 0;
@@ -1133,9 +1138,14 @@ static int ixgbevf_vlan_rx_kill_vid(struct net_device *netdev, u16 vid)
 	struct ixgbevf_adapter *adapter = netdev_priv(netdev);
 	struct ixgbe_hw *hw = &adapter->hw;
 
+	spin_lock(&adapter->mbx_lock);
+
 	/* remove VID from filter table */
 	if (hw->mac.ops.set_vfta)
 		hw->mac.ops.set_vfta(hw, vid, 0, false);
+
+	spin_unlock(&adapter->mbx_lock);
+
 	clear_bit(vid, adapter->active_vlans);
 
 	return 0;
@@ -1190,11 +1200,15 @@ static void ixgbevf_set_rx_mode(struct net_device *netdev)
 	struct ixgbevf_adapter *adapter = netdev_priv(netdev);
 	struct ixgbe_hw *hw = &adapter->hw;
 
+	spin_lock(&adapter->mbx_lock);
+
 	/* reprogram multicast list */
 	if (hw->mac.ops.update_mc_addr_list)
 		hw->mac.ops.update_mc_addr_list(hw, netdev);
 
 	ixgbevf_write_uc_addr_list(netdev);
+
+	spin_unlock(&adapter->mbx_lock);
 }
 
 static void ixgbevf_napi_enable_all(struct ixgbevf_adapter *adapter)
@@ -1339,6 +1353,8 @@ static void ixgbevf_up_complete(struct ixgbevf_adapter *adapter)
 
 	ixgbevf_configure_msix(adapter);
 
+	spin_lock(&adapter->mbx_lock);
+
 	if (hw->mac.ops.set_rar) {
 		if (is_valid_ether_addr(hw->mac.addr))
 			hw->mac.ops.set_rar(hw, 0, hw->mac.addr, 0);
@@ -1350,6 +1366,8 @@ static void ixgbevf_up_complete(struct ixgbevf_adapter *adapter)
 	msg[1] = netdev->mtu + ETH_HLEN + ETH_FCS_LEN;
 	hw->mbx.ops.write_posted(hw, msg, 2);
 
+	spin_unlock(&adapter->mbx_lock);
+
 	clear_bit(__IXGBEVF_DOWN, &adapter->state);
 	ixgbevf_napi_enable_all(adapter);
 
@@ -1562,11 +1580,15 @@ void ixgbevf_reset(struct ixgbevf_adapter *adapter)
 	struct ixgbe_hw *hw = &adapter->hw;
 	struct net_device *netdev = adapter->netdev;
 
+	spin_lock(&adapter->mbx_lock);
+
 	if (hw->mac.ops.reset_hw(hw))
 		hw_dbg(hw, "PF still resetting\n");
 	else
 		hw->mac.ops.init_hw(hw);
 
+	spin_unlock(&adapter->mbx_lock);
+
 	if (is_valid_ether_addr(adapter->hw.mac.addr)) {
 		memcpy(netdev->dev_addr, adapter->hw.mac.addr,
 		       netdev->addr_len);
@@ -1893,6 +1915,9 @@ static int __devinit ixgbevf_sw_init(struct ixgbevf_adapter *adapter)
 			adapter->netdev->addr_len);
 	}
 
+	/* lock to protect mailbox accesses */
+	spin_lock_init(&adapter->mbx_lock);
+
 	/* Enable dynamic interrupt throttling rates */
 	adapter->rx_itr_setting = 1;
 	adapter->tx_itr_setting = 1;
@@ -2032,8 +2057,16 @@ static void ixgbevf_watchdog_task(struct work_struct *work)
 	 * no LSC interrupt
 	 */
 	if (hw->mac.ops.check_link) {
-		if ((hw->mac.ops.check_link(hw, &link_speed,
-					    &link_up, false)) != 0) {
+		s32 need_reset;
+
+		spin_lock(&adapter->mbx_lock);
+
+		need_reset = hw->mac.ops.check_link(hw, &link_speed,
+						    &link_up, false);
+
+		spin_unlock(&adapter->mbx_lock);
+
+		if (need_reset) {
 			adapter->link_up = link_up;
 			adapter->link_speed = link_speed;
 			netif_carrier_off(netdev);
@@ -2813,9 +2846,13 @@ static int ixgbevf_set_mac(struct net_device *netdev, void *p)
 	memcpy(netdev->dev_addr, addr->sa_data, netdev->addr_len);
 	memcpy(hw->mac.addr, addr->sa_data, netdev->addr_len);
 
+	spin_lock(&adapter->mbx_lock);
+
 	if (hw->mac.ops.set_rar)
 		hw->mac.ops.set_rar(hw, 0, hw->mac.addr, 0);
 
+	spin_unlock(&adapter->mbx_lock);
+
 	return 0;
 }
 
-- 
1.7.10.4

^ permalink raw reply related

* [net-next 05/13] ixgbe: Fix handling of FDIR_HASH flag
From: Jeff Kirsher @ 2012-07-21 23:08 UTC (permalink / raw)
  To: davem; +Cc: Alexander Duyck, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1342912142-11130-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This change makes it so that we can use the atr_sample_rate to determine if
we are capable of supporting ATR. The advantage to this approach is that it
allows us to now determine the setting of the IXGBE_FLAG_FDIR_HASH_CAPABLE
based on the queueing scheme, instead of the queueing scheme being based on
the flag.

Using this approach there are essentially 5 conditions that must be checked
prior to trying to enable ATR:
1.  Is SR-IOV disabled?
2.  Are the number of TCs <= 1?
3.  Is RSS queueing limit greater than 1?
4.  Is atr_sample_rate set?
5.  Is Flow Director perfect filtering disabled?

If any of these conditions are enabled they should disable ATR filtering.
Note that in the case of conditions 1 through 4 being met we will set
things up for ATR queueing, however if test 5 fails we will still leave the
queues allocated for use by perfect filters.  The reason for this is to
allow for us to switch back and forth between ntuple and ATR without
needing to reallocate the descriptor rings.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c  |   22 +++++++-----
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   48 ++++++++++++++++---------
 2 files changed, 46 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index 3ff5aa8..29a2a85 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -370,6 +370,9 @@ static bool ixgbe_set_dcb_sriov_queues(struct ixgbe_adapter *adapter)
 	adapter->ring_feature[RING_F_RSS].indices = 1;
 	adapter->ring_feature[RING_F_RSS].mask = IXGBE_RSS_DISABLED_MASK;
 
+	/* disable ATR as it is not supported when VMDq is enabled */
+	adapter->flags &= ~IXGBE_FLAG_FDIR_HASH_CAPABLE;
+
 	adapter->num_rx_pools = vmdq_i;
 	adapter->num_rx_queues_per_pool = tcs;
 
@@ -450,6 +453,9 @@ static bool ixgbe_set_dcb_queues(struct ixgbe_adapter *adapter)
 	f->indices = rss_i;
 	f->mask = rss_m;
 
+	/* disable ATR as it is not supported when multiple TCs are enabled */
+	adapter->flags &= ~IXGBE_FLAG_FDIR_HASH_CAPABLE;
+
 #ifdef IXGBE_FCOE
 	/* FCoE enabled queues require special configuration indexed
 	 * by feature specific indices and offset. Here we map FCoE
@@ -606,16 +612,22 @@ static bool ixgbe_set_rss_queues(struct ixgbe_adapter *adapter)
 	f->indices = rss_i;
 	f->mask = IXGBE_RSS_16Q_MASK;
 
+	/* disable ATR by default, it will be configured below */
+	adapter->flags &= ~IXGBE_FLAG_FDIR_HASH_CAPABLE;
+
 	/*
 	 * Use Flow Director in addition to RSS to ensure the best
 	 * distribution of flows across cores, even when an FDIR flow
 	 * isn't matched.
 	 */
-	if (adapter->flags & IXGBE_FLAG_FDIR_HASH_CAPABLE) {
+	if (rss_i > 1 && adapter->atr_sample_rate) {
 		f = &adapter->ring_feature[RING_F_FDIR];
 
 		f->indices = min_t(u16, num_online_cpus(), f->limit);
 		rss_i = max_t(u16, rss_i, f->indices);
+
+		if (!(adapter->flags & IXGBE_FLAG_FDIR_PERFECT_CAPABLE))
+			adapter->flags |= IXGBE_FLAG_FDIR_HASH_CAPABLE;
 	}
 
 #ifdef IXGBE_FCOE
@@ -1054,13 +1066,7 @@ static void ixgbe_set_interrupt_capability(struct ixgbe_adapter *adapter)
 	}
 
 	adapter->flags &= ~IXGBE_FLAG_DCB_ENABLED;
-	if (adapter->flags & IXGBE_FLAG_FDIR_HASH_CAPABLE) {
-		e_err(probe,
-		      "ATR is not supported while multiple "
-		      "queues are disabled.  Disabling Flow Director\n");
-	}
-	adapter->flags &= ~IXGBE_FLAG_FDIR_HASH_CAPABLE;
-	adapter->atr_sample_rate = 0;
+
 	ixgbe_disable_sriov(adapter);
 
 	adapter->ring_feature[RING_F_RSS].limit = 1;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 9c42679..7be3504 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2688,8 +2688,7 @@ void ixgbe_configure_tx_ring(struct ixgbe_adapter *adapter,
 		   32;		/* PTHRESH = 32 */
 
 	/* reinitialize flowdirector state */
-	if ((adapter->flags & IXGBE_FLAG_FDIR_HASH_CAPABLE) &&
-	    adapter->atr_sample_rate) {
+	if (adapter->flags & IXGBE_FLAG_FDIR_HASH_CAPABLE) {
 		ring->atr_sample_rate = adapter->atr_sample_rate;
 		ring->atr_count = 0;
 		set_bit(__IXGBE_TX_FDIR_INIT_DONE, &ring->state);
@@ -4419,7 +4418,6 @@ static int __devinit ixgbe_sw_init(struct ixgbe_adapter *adapter)
 		if (hw->device_id == IXGBE_DEV_ID_82599_T3_LOM)
 			adapter->flags2 |= IXGBE_FLAG2_TEMP_SENSOR_CAPABLE;
 		/* Flow Director hash filters enabled */
-		adapter->flags |= IXGBE_FLAG_FDIR_HASH_CAPABLE;
 		adapter->atr_sample_rate = 20;
 		adapter->ring_feature[RING_F_FDIR].limit =
 							 IXGBE_MAX_FDIR_INDICES;
@@ -6726,7 +6724,6 @@ int ixgbe_setup_tc(struct net_device *dev, u8 tc)
 		ixgbe_set_prio_tc_map(adapter);
 
 		adapter->flags |= IXGBE_FLAG_DCB_ENABLED;
-		adapter->flags &= ~IXGBE_FLAG_FDIR_HASH_CAPABLE;
 
 		if (adapter->hw.mac.type == ixgbe_mac_82598EB) {
 			adapter->last_lfc_mode = adapter->hw.fc.requested_mode;
@@ -6739,7 +6736,6 @@ int ixgbe_setup_tc(struct net_device *dev, u8 tc)
 			adapter->hw.fc.requested_mode = adapter->last_lfc_mode;
 
 		adapter->flags &= ~IXGBE_FLAG_DCB_ENABLED;
-		adapter->flags |= IXGBE_FLAG_FDIR_HASH_CAPABLE;
 
 		adapter->temp_dcb_cfg.pfc_mode_enable = false;
 		adapter->dcb_cfg.pfc_mode_enable = false;
@@ -6808,20 +6804,40 @@ static int ixgbe_set_features(struct net_device *netdev,
 	 * Check if Flow Director n-tuple support was enabled or disabled.  If
 	 * the state changed, we need to reset.
 	 */
-	if (!(features & NETIF_F_NTUPLE)) {
-		if (adapter->flags & IXGBE_FLAG_FDIR_PERFECT_CAPABLE) {
-			/* turn off Flow Director, set ATR and reset */
-			if (!(adapter->flags & IXGBE_FLAG_SRIOV_ENABLED) &&
-			    !(adapter->flags & IXGBE_FLAG_DCB_ENABLED))
-				adapter->flags |= IXGBE_FLAG_FDIR_HASH_CAPABLE;
-			need_reset = true;
-		}
-		adapter->flags &= ~IXGBE_FLAG_FDIR_PERFECT_CAPABLE;
-	} else if (!(adapter->flags & IXGBE_FLAG_FDIR_PERFECT_CAPABLE)) {
+	switch (features & NETIF_F_NTUPLE) {
+	case NETIF_F_NTUPLE:
 		/* turn off ATR, enable perfect filters and reset */
+		if (!(adapter->flags & IXGBE_FLAG_FDIR_PERFECT_CAPABLE))
+			need_reset = true;
+
 		adapter->flags &= ~IXGBE_FLAG_FDIR_HASH_CAPABLE;
 		adapter->flags |= IXGBE_FLAG_FDIR_PERFECT_CAPABLE;
-		need_reset = true;
+		break;
+	default:
+		/* turn off perfect filters, enable ATR and reset */
+		if (adapter->flags & IXGBE_FLAG_FDIR_PERFECT_CAPABLE)
+			need_reset = true;
+
+		adapter->flags &= ~IXGBE_FLAG_FDIR_PERFECT_CAPABLE;
+
+		/* We cannot enable ATR if SR-IOV is enabled */
+		if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
+			break;
+
+		/* We cannot enable ATR if we have 2 or more traffic classes */
+		if (netdev_get_num_tc(netdev) > 1)
+			break;
+
+		/* We cannot enable ATR if RSS is disabled */
+		if (adapter->ring_feature[RING_F_RSS].limit <= 1)
+			break;
+
+		/* A sample rate of 0 indicates ATR disabled */
+		if (!adapter->atr_sample_rate)
+			break;
+
+		adapter->flags |= IXGBE_FLAG_FDIR_HASH_CAPABLE;
+		break;
 	}
 
 	if (features & NETIF_F_HW_VLAN_RX)
-- 
1.7.10.4

^ permalink raw reply related

* [net-next 07/13] ixgbe: Reduce Rx header size to what is actually used
From: Jeff Kirsher @ 2012-07-21 23:08 UTC (permalink / raw)
  To: davem; +Cc: Alexander Duyck, netdev, gospo, sassmann, Eric Dumazet,
	Jeff Kirsher
In-Reply-To: <1342912142-11130-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Alexander Duyck <alexander.h.duyck@intel.com>

The recent changes to netdev_alloc_skb actually make it so that the size of
the buffer now actually has a more direct input on the truesize.  So in
order to make best use of the piece of a page we are allocated I am
reducing the IXGBE_RX_HDR_SIZE to 256 so that our truesize will be reduced
by 256 bytes as well.

This should result in performance improvements since the number of uses per
page should increase from 4 to 6 in the case of a 4K page.  In addition we
should see socket performance improvements due to the truesize dropping
to less than 1K for buffers less than 256 bytes.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h      |   15 ++++++++-------
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    4 ++--
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index eb59282..b9623e9 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -77,17 +77,18 @@
 #define IXGBE_MAX_FCPAUSE		 0xFFFF
 
 /* Supported Rx Buffer Sizes */
-#define IXGBE_RXBUFFER_512   512    /* Used for packet split */
+#define IXGBE_RXBUFFER_256    256  /* Used for skb receive header */
 #define IXGBE_MAX_RXBUFFER  16384  /* largest size for a single descriptor */
 
 /*
- * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN mans we
- * reserve 2 more, and skb_shared_info adds an additional 384 bytes more,
- * this adds up to 512 bytes of extra data meaning the smallest allocation
- * we could have is 1K.
- * i.e. RXBUFFER_512 --> size-1024 slab
+ * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN means we
+ * reserve 64 more, and skb_shared_info adds an additional 320 bytes more,
+ * this adds up to 448 bytes of extra data.
+ *
+ * Since netdev_alloc_skb now allocates a page fragment we can use a value
+ * of 256 and the resultant skb will have a truesize of 960 or less.
  */
-#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_512
+#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_256
 
 #define MAXIMUM_ETHERNET_VLAN_SIZE (ETH_FRAME_LEN + ETH_FCS_LEN + VLAN_HLEN)
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 7be3504..b376926 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1517,8 +1517,8 @@ static bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
 	 * 60 bytes if the skb->len is less than 60 for skb_pad.
 	 */
 	pull_len = skb_frag_size(frag);
-	if (pull_len > 256)
-		pull_len = ixgbe_get_headlen(va, pull_len);
+	if (pull_len > IXGBE_RX_HDR_SIZE)
+		pull_len = ixgbe_get_headlen(va, IXGBE_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
-- 
1.7.10.4

^ permalink raw reply related

* [net-next 06/13] ixgbevf: Fix namespace issue with ixgbe_write_eitr
From: Jeff Kirsher @ 2012-07-21 23:08 UTC (permalink / raw)
  To: davem; +Cc: Greg Rose, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1342912142-11130-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Greg Rose <gregory.v.rose@intel.com>

Make the function static to cleanup namespace.

Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Tested-by: Sibai Li <Sibai.li@intel.com
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      |    1 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   43 +++++++++------------
 2 files changed, 19 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 66858b5..98cadb0 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -286,7 +286,6 @@ extern void ixgbevf_free_rx_resources(struct ixgbevf_adapter *,
 extern void ixgbevf_free_tx_resources(struct ixgbevf_adapter *,
 				      struct ixgbevf_ring *);
 extern void ixgbevf_update_stats(struct ixgbevf_adapter *adapter);
-void ixgbevf_write_eitr(struct ixgbevf_q_vector *);
 extern int ethtool_ioctl(struct ifreq *ifr);
 
 extern void ixgbe_napi_add_all(struct ixgbevf_adapter *adapter);
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index ccc801e..3f9841d 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -540,6 +540,25 @@ static int ixgbevf_poll(struct napi_struct *napi, int budget)
 	return 0;
 }
 
+/**
+ * ixgbevf_write_eitr - write VTEITR register in hardware specific way
+ * @q_vector: structure containing interrupt and ring information
+ */
+static void ixgbevf_write_eitr(struct ixgbevf_q_vector *q_vector)
+{
+	struct ixgbevf_adapter *adapter = q_vector->adapter;
+	struct ixgbe_hw *hw = &adapter->hw;
+	int v_idx = q_vector->v_idx;
+	u32 itr_reg = q_vector->itr & IXGBE_MAX_EITR;
+
+	/*
+	 * set the WDIS bit to not clear the timer bits and cause an
+	 * immediate assertion of the interrupt
+	 */
+	itr_reg |= IXGBE_EITR_CNT_WDIS;
+
+	IXGBE_WRITE_REG(hw, IXGBE_VTEITR(v_idx), itr_reg);
+}
 
 /**
  * ixgbevf_configure_msix - Configure MSI-X hardware
@@ -662,30 +681,6 @@ static void ixgbevf_update_itr(struct ixgbevf_q_vector *q_vector,
 	ring_container->itr = itr_setting;
 }
 
-/**
- * ixgbevf_write_eitr - write VTEITR register in hardware specific way
- * @q_vector: structure containing interrupt and ring information
- *
- * This function is made to be called by ethtool and by the driver
- * when it needs to update VTEITR registers at runtime.  Hardware
- * specific quirks/differences are taken care of here.
- */
-void ixgbevf_write_eitr(struct ixgbevf_q_vector *q_vector)
-{
-	struct ixgbevf_adapter *adapter = q_vector->adapter;
-	struct ixgbe_hw *hw = &adapter->hw;
-	int v_idx = q_vector->v_idx;
-	u32 itr_reg = q_vector->itr & IXGBE_MAX_EITR;
-
-	/*
-	 * set the WDIS bit to not clear the timer bits and cause an
-	 * immediate assertion of the interrupt
-	 */
-	itr_reg |= IXGBE_EITR_CNT_WDIS;
-
-	IXGBE_WRITE_REG(hw, IXGBE_VTEITR(v_idx), itr_reg);
-}
-
 static void ixgbevf_set_itr(struct ixgbevf_q_vector *q_vector)
 {
 	u32 new_itr = q_vector->itr;
-- 
1.7.10.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox