* Re: [PATCH 2/8] dcbnl: Shorten all command handling functions
From: John Fastabend @ 2012-06-21 6:04 UTC (permalink / raw)
To: Thomas Graf; +Cc: davem, netdev, lucy.liu, alexander.h.duyck
In-Reply-To: <3fea99383d5ee98b51bd8b0088bed38cdb59f375.1339591572.git.tgraf@suug.ch>
On 6/13/2012 5:54 AM, Thomas Graf wrote:
> Allocating and sending the skb in dcb_doit() allows for much
> shorter and cleaner command handling functions.
>
> The huge switch statement is replaced with an array based definition
> of the handling function and reply message type.
>
> Signed-off-by: Thomas Graf <tgraf@suug.ch>
> ---
> net/dcb/dcbnl.c | 722 +++++++++++++------------------------------------------
> 1 files changed, 172 insertions(+), 550 deletions(-)
>
[...]
> @@ -1054,32 +885,27 @@ static int __dcbnl_pg_setcfg(struct net_device *netdev, struct nlattr **tb,
> }
> }
>
> - ret = dcbnl_reply(0, RTM_SETDCB,
> - (dir ? DCB_CMD_PGRX_SCFG : DCB_CMD_PGTX_SCFG),
> - DCB_ATTR_PG_CFG, pid, seq, flags);
> + ret = nla_put_u8(skb, (dir ? DCB_CMD_PGRX_SCFG : DCB_CMD_PGTX_SCFG), 0);
Previously, this returned DCB_ATTR_PG_CFG now it is returning RX or TX
variants. Although I like this implementation better I think in order
to not break user space application we should keep the previous
behavior. At least one app is checking these return codes.
I'll post a patch shortly. Thomas if you ACK it that would be great.
Thanks,
John
^ permalink raw reply
* [PATCH] net: dcb: fix small regression in __dcbnl_pg_setcfg()
From: John Fastabend @ 2012-06-21 5:56 UTC (permalink / raw)
To: tgraf, davem; +Cc: netdev, lucy.liu, alexander.h.duyck
A small regression was introduced in the reply command of
dcbnl_pg_setcfg(). User space apps may be expecting the
DCB_ATTR_PG_CFG attribute to be returned with the patch
below TX or RX variants are returned.
commit 7be994138b188387691322921c08e19bddf6d3c5
Author: Thomas Graf <tgraf@suug.ch>
Date: Wed Jun 13 02:54:55 2012 +0000
dcbnl: Shorten all command handling functions
This patch reverts this behavior and returns DCB_ATTR_PG_CFG
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
net/dcb/dcbnl.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)
diff --git a/net/dcb/dcbnl.c b/net/dcb/dcbnl.c
index 0a36007..013da86 100644
--- a/net/dcb/dcbnl.c
+++ b/net/dcb/dcbnl.c
@@ -852,8 +852,7 @@ static int __dcbnl_pg_setcfg(struct net_device *netdev, struct nlmsghdr *nlh,
}
}
- return nla_put_u8(skb,
- (dir ? DCB_CMD_PGRX_SCFG : DCB_CMD_PGTX_SCFG), 0);
+ return nla_put_u8(skb, DCB_ATTR_PG_CFG, 0);
}
static int dcbnl_pgtx_setcfg(struct net_device *netdev, struct nlmsghdr *nlh,
^ permalink raw reply related
* [PATCH] [XFRM] Fix unexpected SA hard expiration after changing date
From: Fan Du @ 2012-06-21 6:26 UTC (permalink / raw)
To: davem, herbert; +Cc: netdev
In-Reply-To: <1340259961-30354-1-git-send-email-fdu@windriver.com>
After SA is setup, one timer is armed to detect soft/hard expiration,
however the timer handler uses xtime to do the math. This makes hard
expiration occurs first before soft expiration after setting new date
with big interval. As a result new child SA is deleted before rekeying
the new one.
Signed-off-by: Fan Du <fdu@windriver.com>
---
include/net/xfrm.h | 3 +++
net/xfrm/xfrm_state.c | 21 +++++++++++++++++----
2 files changed, 20 insertions(+), 4 deletions(-)
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 2933d74..8d16777 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -197,6 +197,8 @@ struct xfrm_state
struct xfrm_lifetime_cur curlft;
struct timer_list timer;
+ /* used to fix curlft->add_time when changing date */
+ long saved_tmo;
/* Last used time */
unsigned long lastused;
@@ -218,6 +220,7 @@ struct xfrm_state
/* xflags - make enum if more show up */
#define XFRM_TIME_DEFER 1
+#define XFRM_SOFT_EXPIRE 2
enum {
XFRM_STATE_VOID,
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index fd77cf0..2b55244 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -442,8 +442,17 @@ static void xfrm_timer_handler(unsigned long data)
if (x->lft.hard_add_expires_seconds) {
long tmo = x->lft.hard_add_expires_seconds +
x->curlft.add_time - now;
- if (tmo <= 0)
- goto expired;
+ if (tmo <= 0) {
+ if (x->xflags & XFRM_SOFT_EXPIRE) {
+ /* enter hard expire without soft expire first?!
+ * setting a new date could trigger this.
+ * workarbound: fix x->curflt.add_time by below:
+ */
+ x->curlft.add_time = now - x->saved_tmo - 1;
+ tmo = x->lft.hard_add_expires_seconds - x->saved_tmo;
+ } else
+ goto expired;
+ }
if (tmo < next)
next = tmo;
}
@@ -460,10 +469,14 @@ static void xfrm_timer_handler(unsigned long data)
if (x->lft.soft_add_expires_seconds) {
long tmo = x->lft.soft_add_expires_seconds +
x->curlft.add_time - now;
- if (tmo <= 0)
+ if (tmo <= 0) {
warn = 1;
- else if (tmo < next)
+ x->xflags &= ~XFRM_SOFT_EXPIRE;
+ } else if (tmo < next) {
next = tmo;
+ x->xflags |= XFRM_SOFT_EXPIRE;
+ x->saved_tmo = tmo;
+ }
}
if (x->lft.soft_use_expires_seconds) {
long tmo = x->lft.soft_use_expires_seconds +
--
1.7.11
^ permalink raw reply related
* [XFRM][RFC v4] Fix unexpected SA hard expiration after setting new date
From: Fan Du @ 2012-06-21 6:26 UTC (permalink / raw)
To: davem, herbert; +Cc: netdev
Hi, Dave and Herbert
Could you give a try to review this?
thanks
Changelog:
v1->v2
1) use xflags instead of creating new flags(suggested by Steffen Klassert)
v2->v3
1) fix email problem, and remove cc to myself(requested by David Miller)
v3->v4
1) fix typo when clearing XFRM_SOFT_EXPIRE(thanks for David Miller)
2) fix email problem, and remove cc to myself AGAIN!!!
*Background*:
Once IPsec SAs are created between two peers, kernel setup a timer to monitor
two events: soft/hard expiration. However the timer handler use xtime to
caculate whether it's soft or hard expiration event.
normal code flow(hard expire time:100s, soft expire time:82s)
a) When new SAs created, xfrm_timer_handler is called one second
after its creation. At this point, calculate soft expire
interval(81s), setup the timer;
b) soft expire occur, rearm the timer with hard expire interval(18s)
then notify racoon2 about soft expire event. racoon2 will create
new SAs.
c) hard expire happen, notify racoon2 about it. racoon2 will delete
the old SAs.
*Scenario*:
Setting a new date before b),and after a) could result c) happens first,
As a result, old SAs is deleted before new ones are created. Normally
new SAs will be created by the next time networking traffic, but there
is a small time being when networking connection is down, this could
result in upper layer connections failed in tel comm area, thus it's
better to keep it strict in sequence.
*Workaround*:
set new time could happen:
1) before a), then SAs is updated with new time.
2) before b),and after a)
2a) When new SAs created, xfrm_timer_handler is called one second
after its creation. At this point, calculate soft expire
interval(81s), setup the timer;(set flag to mark next time should
be soft time expire)
<<---- new date comes
2b) soft expire occur, the calculation results in a hard time expire
event, but flag is set, so catch ya. Sync the addtime, and rearm
the timer with hard expire interval(18s), then notify racoon2
about soft expire event;
2c) hard expire happen, notify racoon2 about it;
so everything is in order.
3) after b), hard expire always happened anyway.
^ permalink raw reply
* [RFC] tcp: How does SACK or FACK determine the time to start fast retransmition?
From: 李易 @ 2012-06-21 6:33 UTC (permalink / raw)
To: netdev; +Cc: kernelnewbies
In-Reply-To: <CAAA3+BpRQ24HED6a+yCF+A_q=vJ7nHexLJd1U+-50pb43MVa5w@mail.gmail.com>
HI all,
When tcp uses reno as its congestion control algothim, it uses
tp->sacked_out as dup-ack. When the third dup-ack(under default
condition) comes, tcp will initiate its fast retransmition.
But how about sack ?
According to kernel source code comments, when sack or fack tcp option
is enabled, there is no dup-ack counter. See comments for function
tcp_dupack_heuristics():
http://lxr.linux.no/linux+v2.6.37/net/ipv4/tcp_input.c#L2300
So , how does tcp know the current dup-ack is the last one which
triggers the fast retransmition?
According to rfc3517 section 5:
"Upon the receipt of the first (DupThresh - 1) duplicate ACKs, the
scoreboard is to be updated as normal."
"When a TCP sender receives the duplicate ACK corresponding to
DupThresh ACKs,
the scoreboard MUST be updated with the new SACK information (via
Update ()). If no previous loss event has occurred
on the connection or the cumulative acknowledgment point is beyond
the last value of RecoveryPoint, a loss recovery phase SHOULD be
initiated, per the fast retransmit algorithm outlined in [RFC2581]."
But these sentences doesn't describe how tcp knows the current ack
is the dup-threshold dup-ack.
Accorrding to rfc3517 seciton 4 and isLost(Seqnum) function:
"The routine returns true when either
DupThresh discontiguous SACKed sequences have arrived above
’SeqNum’ or (DupThresh * SMSS) bytes with sequence numbers greater
than ’SeqNum’ have been SACKed. Otherwise, the routine returns
false."
I think this is just what I am searching for, but I still don't know
which line of code in Linux tcp protocol does this check.
Can any one help me ? thks in advance.
^ permalink raw reply
* Re: [PATCH] [XFRM] Fix unexpected SA hard expiration after changing date
From: Fan Du @ 2012-06-21 6:41 UTC (permalink / raw)
To: David Miller; +Cc: herbert, netdev
In-Reply-To: <20120620.204323.1614285855856633624.davem@davemloft.net>
Hi, David
On 2012年06月21日 11:43, David Miller wrote:
> From: Fan Du<fan.du@windriver.com>
> Date: Thu, 21 Jun 2012 10:11:51 +0800
>
>> Could you please take a look at it if you have time?
>
> Everyone saw it, you need to be patient.
>
Apologize for this noise.
Yes, I'm young, and a bit of rush :)
> And you still have fdu@windriver.com in the CC: list, which I told you
> bounces.
>
> I asked you explicitly to remove this email address from the CC: list,
> because it bounces.
I didn't do it deliberately.
Damn, must be my afternoon buzzy head!
I have sent out V4, I hope I will not make any dummy mistake.
Anyway, please review it if you have time, and I will wait for your
response *PATIENTLY*.
thanks
--
Love each day!
--fan
^ permalink raw reply
* RE: [RFC net-next 11/14] Fix emulex/benet
From: Sathya.Perla @ 2012-06-21 6:42 UTC (permalink / raw)
To: yuvalmin, netdev, davem; +Cc: eilong
In-Reply-To: <1340118848-30978-12-git-send-email-yuvalmin@broadcom.com>
Yuval, for be2net, the best place to cap the number of queues to a global default
value would be_num_rss_want(). The change would look like:
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/
index fa2a01e..5265b42 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -2186,12 +2186,15 @@ static void be_msix_disable(struct be_adapter *adapter)
static uint be_num_rss_want(struct be_adapter *adapter)
{
+ u32 num = 0;
+
if ((adapter->function_caps & BE_FUNCTION_CAPS_RSS) &&
!sriov_want(adapter) && be_physfn(adapter) &&
- !be_is_mc(adapter))
- return (adapter->be3_native) ? BE3_MAX_RSS_QS : BE2_MAX_RSS_QS;
- else
- return 0;
+ !be_is_mc(adapter)) {
+ num = (adapter->be3_native) ? BE3_MAX_RSS_QS : BE2_MAX_RSS_QS;
+ num = min_t(u32, num, DEFAULT_MAX_NUM_RSS_QUEUES);
+ }
+ return num;
}
static void be_msix_enable(struct be_adapter *adapter)
thanks,
-Sathya
________________________________________
From: Yuval Mintz [yuvalmin@broadcom.com]
Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Cc: Sathya Perla <sathya.perla@emulex.com>
Cc: Subbu Seetharaman <subbu.seetharaman@emulex.com>
Cc: Ajit Khaparde <ajit.khaparde@emulex.com>
---
drivers/net/ethernet/emulex/benet/be_main.c | 8 +++++---
1 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index 5a34503..e42597d 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -2153,13 +2153,15 @@ static uint be_num_rss_want(struct be_adapter *adapter)
static void be_msix_enable(struct be_adapter *adapter)
{
#define BE_MIN_MSIX_VECTORS 1
- int i, status, num_vec, num_roce_vec = 0;
+ int i, status, num_vec, num_roce_vec = 0, ncpu;
+
+ ncpu = min_t(int, num_online_cpus(), DEFAULT_MAX_NUM_RSS_QUEUES);
/* If RSS queues are not used, need a vec for default RX Q */
- num_vec = min(be_num_rss_want(adapter), num_online_cpus());
+ num_vec = min(be_num_rss_want(adapter), ncpu);
if (be_roce_supported(adapter)) {
num_roce_vec = min_t(u32, MAX_ROCE_MSIX_VECTORS,
- (num_online_cpus() + 1));
+ (u32)(ncpu + 1));
num_roce_vec = min(num_roce_vec, MAX_ROCE_EQS);
num_vec += num_roce_vec;
num_vec = min(num_vec, MAX_MSIX_VECTORS);
--
1.7.9.rc2
^ permalink raw reply related
* Re: [RFC] tcp: How does SACK or FACK determine the time to start fast retransmition?
From: Li Yu @ 2012-06-21 7:20 UTC (permalink / raw)
To: 李易; +Cc: netdev, kernelnewbies
In-Reply-To: <4FE2C03C.6030102@gmail.com>
于 2012年06月21日 14:33, 李易 写道:
> HI all,
> When tcp uses reno as its congestion control algothim, it uses
> tp->sacked_out as dup-ack. When the third dup-ack(under default
> condition) comes, tcp will initiate its fast retransmition.
> But how about sack ?
> According to kernel source code comments, when sack or fack tcp option
> is enabled, there is no dup-ack counter. See comments for function
> tcp_dupack_heuristics():
> http://lxr.linux.no/linux+v2.6.37/net/ipv4/tcp_input.c#L2300
> So , how does tcp know the current dup-ack is the last one which
> triggers the fast retransmition?
>
> According to rfc3517 section 5:
> "Upon the receipt of the first (DupThresh - 1) duplicate ACKs, the
> scoreboard is to be updated as normal."
> "When a TCP sender receives the duplicate ACK corresponding to
> DupThresh ACKs,
> the scoreboard MUST be updated with the new SACK information (via
> Update ()). If no previous loss event has occurred
> on the connection or the cumulative acknowledgment point is beyond
> the last value of RecoveryPoint, a loss recovery phase SHOULD be
> initiated, per the fast retransmit algorithm outlined in [RFC2581]."
>
> But these sentences doesn't describe how tcp knows the current ack
> is the dup-threshold dup-ack.
>
> Accorrding to rfc3517 seciton 4 and isLost(Seqnum) function:
> "The routine returns true when either
> DupThresh discontiguous SACKed sequences have arrived above
> ’SeqNum’ or (DupThresh * SMSS) bytes with sequence numbers greater
> than ’SeqNum’ have been SACKed. Otherwise, the routine returns
> false."
> I think this is just what I am searching for, but I still don't know
> which line of code in Linux tcp protocol does this check.
> Can any one help me ? thks in advance.
>
>
Do you mean you did not locate where FR is triggered in TCP stack ?
I am not a TCP expert, however I think that it may be at
tcp_time_to_recover(), and the "DupThresh" is not a fixed value in
Linux TCP implementation.
Thanks
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: divide by 0 error in igbvf_set_coalesce - ab50a2a
From: Jeff Kirsher @ 2012-06-21 7:35 UTC (permalink / raw)
To: David Ahern; +Cc: Williams, Mitch A, netdev@vger.kernel.org
In-Reply-To: <4FE24CF1.50603@cisco.com>
[-- Attachment #1: Type: text/plain, Size: 373 bytes --]
On 06/20/2012 03:21 PM, David Ahern wrote:
>
> On 6/18/12 2:45 PM, Williams, Mitch A wrote:
>> Thanks for letting me know, David. I'll look into it and get a patch
>> out soon. Shouldn't be that big of a deal to fix.
>
> Could you CC me on the patch so I know when it's fixed? I have enough
> events to poll.
I will CC you when I push Mitch's patch upstream.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 900 bytes --]
^ permalink raw reply
* IPV6 ndisc:: Bad NIC causing IPV6 NDP to stop working
From: Menny_Hamburger @ 2012-06-21 7:59 UTC (permalink / raw)
To: netdev
Hi,
Our machines runs EL5.8 x86_64.
We have witnessed several cases where we suspect that a bad NIC on the machine caused IPV6 neighbour discovery to stop working on all the other NICs - when this happens ping6 fails on every NIC we try it.
>From looking into the code I see that there is only a single socket assigned for NDP; Does it sound logical to allocate a socket per interface instead of a single global socket.
I have found the following thread in LKML: https://lkml.org/lkml/2006/11/29/335, and it seems that this allocation issue still exists in EL5 based kernels - could this cause the above problem?
Thanks,
Menny
^ permalink raw reply
* Re: [PATCH] net: dcb: fix small regression in __dcbnl_pg_setcfg()
From: Thomas Graf @ 2012-06-21 8:19 UTC (permalink / raw)
To: John Fastabend; +Cc: tgraf, davem, netdev, lucy.liu, alexander.h.duyck
In-Reply-To: <20120621055621.14148.42206.stgit@jf-dev1-dcblab>
On Wed, Jun 20, 2012 at 10:56:21PM -0700, John Fastabend wrote:
> A small regression was introduced in the reply command of
> dcbnl_pg_setcfg(). User space apps may be expecting the
> DCB_ATTR_PG_CFG attribute to be returned with the patch
> below TX or RX variants are returned.
>
> commit 7be994138b188387691322921c08e19bddf6d3c5
> Author: Thomas Graf <tgraf@suug.ch>
> Date: Wed Jun 13 02:54:55 2012 +0000
>
> dcbnl: Shorten all command handling functions
>
> This patch reverts this behavior and returns DCB_ATTR_PG_CFG
>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>
> net/dcb/dcbnl.c | 3 +--
> 1 files changed, 1 insertions(+), 2 deletions(-)
>
> diff --git a/net/dcb/dcbnl.c b/net/dcb/dcbnl.c
> index 0a36007..013da86 100644
> --- a/net/dcb/dcbnl.c
> +++ b/net/dcb/dcbnl.c
> @@ -852,8 +852,7 @@ static int __dcbnl_pg_setcfg(struct net_device *netdev, struct nlmsghdr *nlh,
> }
> }
>
> - return nla_put_u8(skb,
> - (dir ? DCB_CMD_PGRX_SCFG : DCB_CMD_PGTX_SCFG), 0);
> + return nla_put_u8(skb, DCB_ATTR_PG_CFG, 0);
> }
>
ACK
Thanks John
^ permalink raw reply
* Re: IPV6 ndisc:: Bad NIC causing IPV6 NDP to stop working
From: Eric Dumazet @ 2012-06-21 8:22 UTC (permalink / raw)
To: Menny_Hamburger; +Cc: netdev
In-Reply-To: <D8C50530D6022F40A817A35C40CC06A70B34DBEFCF@DUBX7MCDUB01.EMEA.DELL.COM>
On Thu, 2012-06-21 at 08:59 +0100, Menny_Hamburger@Dell.com wrote:
> Hi,
>
> Our machines runs EL5.8 x86_64.
> We have witnessed several cases where we suspect that a bad NIC on the machine caused IPV6 neighbour discovery to stop working on all the other NICs - when this happens ping6 fails on every NIC we try it.
> From looking into the code I see that there is only a single socket assigned for NDP; Does it sound logical to allocate a socket per interface instead of a single global socket.
> I have found the following thread in LKML: https://lkml.org/lkml/2006/11/29/335, and it seems that this allocation issue still exists in EL5 based kernels - could this cause the above problem?
>
What is a bad NIC, and why not fixing it ?
^ permalink raw reply
* Re: [RFC] tcp: How does SACK or FACK determine the time to start fast retransmition?
From: Vijay Subramanian @ 2012-06-21 8:42 UTC (permalink / raw)
To: 李易; +Cc: netdev, kernelnewbies
In-Reply-To: <4FE2C03C.6030102@gmail.com>
On 20 June 2012 23:33, 李易 <lovelylich@gmail.com> wrote:
> HI all,
> When tcp uses reno as its congestion control algothim, it uses
> tp->sacked_out as dup-ack. When the third dup-ack(under default
> condition) comes, tcp will initiate its fast retransmition.
> But how about sack ?
> According to kernel source code comments, when sack or fack tcp option
> is enabled, there is no dup-ack counter. See comments for function
> tcp_dupack_heuristics():
> http://lxr.linux.no/linux+v2.6.37/net/ipv4/tcp_input.c#L2300
> So , how does tcp know the current dup-ack is the last one which
> triggers the fast retransmition?
With SACK, number of dupacks does not have much meaning. What matters is
--how the SACK scoreboard looks like i.e. which packets are tagged
Lost/Sacked/Retransmitted
-- Whether FACK is in use (this assumes holes in between sacked
packets are lost and have left the network and so we can send out more
packets)
So, stack does not count the number of dupacks that have come in. Only
SACK blocks matter.
You can try to track the following path:
tcp_ack() deals with incoming acks and if it sees a dupack (does not
matter what number), or incoming packet contains SACK it calls
tcp_fastretrans_alert() which calls tcp_xmit_retransmit_queue().
tcp_xmit_retransmit_queue() decides which packets to retransmit. The
first packet to start retransmitting from is tracked in
tp->retransmit_skb_hint.
Note that the dupThresh is actually tracked by tp->reordering which
measures the reordering in the network and is not fixed at 3. So, if
more than
tp->reordering packets have been acked above a given packet, this
packet is a candidate for retransmisson. See tcp_mark_head_lost() to
see how the
reordering metric is used to mark packets as lost. This corresponds to
the check you mentioned in the RFC.
So, window permitting, packets are sent as follows;
(a)-- Packets marked lost as per description above
(b)-- new packets (if any)
(c)-- Holes between sacked packets which are not reliably lost.
choice between (b) and (c) is made in tcp_can_forward_retransmit().
Hope this helps.
Vijay
^ permalink raw reply
* RE: IPV6 ndisc:: Bad NIC causing IPV6 NDP to stop working
From: Menny_Hamburger @ 2012-06-21 8:43 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
In-Reply-To: <1340266972.4604.4404.camel@edumazet-glaptop>
For high availability reasons, the machines discussed run with a number of NICs per subnet, where our own proprietary service fixes up routing when a NIC goes wild.
We schedule a fix in the field but our goal is to eliminate as many single points of failure as we can, so that our systems will still run properly when something goes wrong.
We encountered this issue on some proprietary NICs but also with bnx2, where we get "chip not in correct endian mode" errors (This is another problem that may require a separate discussion).
-----Original Message-----
From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
Sent: 21 June, 2012 11:23
To: Hamburger, Menny
Cc: netdev@vger.kernel.org
Subject: Re: IPV6 ndisc:: Bad NIC causing IPV6 NDP to stop working
On Thu, 2012-06-21 at 08:59 +0100, Menny_Hamburger@Dell.com wrote:
> Hi,
>
> Our machines runs EL5.8 x86_64.
> We have witnessed several cases where we suspect that a bad NIC on the machine caused IPV6 neighbour discovery to stop working on all the other NICs - when this happens ping6 fails on every NIC we try it.
> From looking into the code I see that there is only a single socket assigned for NDP; Does it sound logical to allocate a socket per interface instead of a single global socket.
> I have found the following thread in LKML: https://lkml.org/lkml/2006/11/29/335, and it seems that this allocation issue still exists in EL5 based kernels - could this cause the above problem?
>
What is a bad NIC, and why not fixing it ?
^ permalink raw reply
* Re: [PATCH] net: dcb: fix small regression in __dcbnl_pg_setcfg()
From: David Miller @ 2012-06-21 9:04 UTC (permalink / raw)
To: tgraf; +Cc: john.r.fastabend, tgraf, netdev, lucy.liu, alexander.h.duyck
In-Reply-To: <20120621081910.GH27921@canuck.infradead.org>
From: Thomas Graf <tgraf@infradead.org>
Date: Thu, 21 Jun 2012 04:19:10 -0400
> ACK
Can I get a real "Acked-by: ..." so that it automatically gets
picked up by patchwork? Thanks.
^ permalink raw reply
* RE: IPV6 ndisc:: Bad NIC causing IPV6 NDP to stop working
From: Eric Dumazet @ 2012-06-21 9:16 UTC (permalink / raw)
To: Menny_Hamburger; +Cc: netdev
In-Reply-To: <D8C50530D6022F40A817A35C40CC06A70B34DBF05D@DUBX7MCDUB01.EMEA.DELL.COM>
Please don't top post on this list.
On Thu, 2012-06-21 at 09:43 +0100, Menny_Hamburger@Dell.com wrote:
> For high availability reasons, the machines discussed run with a
> number of NICs per subnet, where our own proprietary service fixes up
> routing when a NIC goes wild.
> We schedule a fix in the field but our goal is to eliminate as many
> single points of failure as we can, so that our systems will still run
> properly when something goes wrong.
Even if a NIC does memory corruption or some nasty bug ?
That sounds great :)
> We encountered this issue on some proprietary NICs but also with bnx2,
> where we get "chip not in correct endian mode" errors (This is another
> problem that may require a separate discussion).
Until very recently, we used to orphan skb before giving them to device
transmit. So you probably use a very old kernel.
I guess we could just do a regular alloc_skb(), it makes no sense to
limit in-flight ND skbs, we have Qdisc/device limits anyway.
BTW, I have no idea why ndisc_build_skb() is EXPORTed
net/ipv6/ndisc.c | 24 ++++++------------------
1 file changed, 6 insertions(+), 18 deletions(-)
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 69a6330..f149d85 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -429,7 +429,6 @@ struct sk_buff *ndisc_build_skb(struct net_device *dev,
int hlen = LL_RESERVED_SPACE(dev);
int tlen = dev->needed_tailroom;
int len;
- int err;
u8 *opt;
if (!dev->addr_len)
@@ -439,15 +438,10 @@ struct sk_buff *ndisc_build_skb(struct net_device *dev,
if (llinfo)
len += ndisc_opt_addr_space(dev);
- skb = sock_alloc_send_skb(sk,
- (MAX_HEADER + sizeof(struct ipv6hdr) +
- len + hlen + tlen),
- 1, &err);
- if (!skb) {
- ND_PRINTK(0, err, "ND: %s failed to allocate an skb, err=%d\n",
- __func__, err);
+ skb = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + len + hlen + tlen,
+ GFP_ATOMIC);
+ if (!skb)
return NULL;
- }
skb_reserve(skb, hlen);
ip6_nd_hdr(sk, skb, dev, saddr, daddr, IPPROTO_ICMPV6, len);
@@ -1550,16 +1544,10 @@ void ndisc_send_redirect(struct sk_buff *skb, const struct in6_addr *target)
hlen = LL_RESERVED_SPACE(dev);
tlen = dev->needed_tailroom;
- buff = sock_alloc_send_skb(sk,
- (MAX_HEADER + sizeof(struct ipv6hdr) +
- len + hlen + tlen),
- 1, &err);
- if (buff == NULL) {
- ND_PRINTK(0, err,
- "Redirect: %s failed to allocate an skb, err=%d\n",
- __func__, err);
+ buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + len + hlen + tlen,
+ GFP_ATOMIC);
+ if (!buff)
goto release;
- }
skb_reserve(buff, hlen);
ip6_nd_hdr(sk, buff, dev, &saddr_buf, &ipv6_hdr(skb)->saddr,
^ permalink raw reply related
* [PATCH net 0/4] net/mlx4: Bug fixes for the mlx4_en driver
From: Yevgeny Petrilin @ 2012-06-21 9:18 UTC (permalink / raw)
To: davem; +Cc: netdev, yevgenyp
This is a set of 4 bug fixes generated agains net tree:
Yevgeny Petrilin (4):
net/mlx4_en: Set correct port parameters during device initialization
net/mlx4: Use single completion vector after NOP failure
net/mlx4_en: Release QP range in free_resources
net/mlx4_en: Use atomic counter to decide when queue is full
en_netdev.c | 18 ++++++++++++------
en_tx.c | 16 ++++++++--------
main.c | 2 ++
mlx4_en.h | 2 ++
4 files changed, 24 insertions(+), 14 deletions(-)
Thanks,
Yevgeny
^ permalink raw reply
* [PATCH net 0/4] net/mlx4: Bug fixes for the mlx4_en driver
From: Yevgeny Petrilin @ 2012-06-21 9:19 UTC (permalink / raw)
To: davem; +Cc: netdev
This is a set of 4 bug fixes generated agains net tree:
Yevgeny Petrilin (4):
net/mlx4_en: Set correct port parameters during device initialization
net/mlx4: Use single completion vector after NOP failure
net/mlx4_en: Release QP range in free_resources
net/mlx4_en: Use atomic counter to decide when queue is full
en_netdev.c | 18 ++++++++++++------
en_tx.c | 16 ++++++++--------
main.c | 2 ++
mlx4_en.h | 2 ++
4 files changed, 24 insertions(+), 14 deletions(-)
Thanks,
Yevgeny
^ permalink raw reply
* [PATCH 2/4] net/mlx4: Use single completion vector after NOP failure
From: Yevgeny Petrilin @ 2012-06-21 9:19 UTC (permalink / raw)
To: davem; +Cc: netdev, Yevgeny Petrilin
In-Reply-To: <1340270358-19504-1-git-send-email-yevgenyp@mellanox.co.il>
Fix a crash at the error flow of NOP command which caused the driver to try and use
a completion vector which wasn't allocated.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
---
drivers/net/ethernet/mellanox/mlx4/main.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index 18c8deb..c91a2b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -1978,6 +1978,8 @@ slave_start:
if (err == -EBUSY && (dev->flags & MLX4_FLAG_MSI_X) &&
!mlx4_is_mfunc(dev)) {
dev->flags &= ~MLX4_FLAG_MSI_X;
+ dev->caps.num_comp_vectors = 1;
+ dev->caps.comp_pool = 0;
pci_disable_msix(pdev);
err = mlx4_setup_hca(dev);
}
--
1.7.1
^ permalink raw reply related
* [PATCH 3/4] net/mlx4_en: Release QP range in free_resources
From: Yevgeny Petrilin @ 2012-06-21 9:19 UTC (permalink / raw)
To: davem; +Cc: netdev, Yevgeny Petrilin
In-Reply-To: <1340270358-19504-1-git-send-email-yevgenyp@mellanox.co.il>
Add a missing resource release in ring cleanup.
Not doing this leaves a range of QPs that are being reserved,
and no one can use them.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
---
drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 12 ++++++++----
drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 1 +
2 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index a80280e..073b85b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -929,15 +929,20 @@ void mlx4_en_free_resources(struct mlx4_en_priv *priv)
if (priv->rx_cq[i].buf)
mlx4_en_destroy_cq(priv, &priv->rx_cq[i]);
}
+
+ if (priv->base_tx_qpn) {
+ mlx4_qp_release_range(priv->mdev->dev, priv->base_tx_qpn, priv->tx_ring_num);
+ priv->base_tx_qpn = 0;
+ }
}
int mlx4_en_alloc_resources(struct mlx4_en_priv *priv)
{
struct mlx4_en_port_profile *prof = priv->prof;
int i;
- int base_tx_qpn, err;
+ int err;
- err = mlx4_qp_reserve_range(priv->mdev->dev, priv->tx_ring_num, 256, &base_tx_qpn);
+ err = mlx4_qp_reserve_range(priv->mdev->dev, priv->tx_ring_num, 256, &priv->base_tx_qpn);
if (err) {
en_err(priv, "failed reserving range for TX rings\n");
return err;
@@ -949,7 +954,7 @@ int mlx4_en_alloc_resources(struct mlx4_en_priv *priv)
prof->tx_ring_size, i, TX))
goto err;
- if (mlx4_en_create_tx_ring(priv, &priv->tx_ring[i], base_tx_qpn + i,
+ if (mlx4_en_create_tx_ring(priv, &priv->tx_ring[i], priv->base_tx_qpn + i,
prof->tx_ring_size, TXBB_SIZE))
goto err;
}
@@ -969,7 +974,6 @@ int mlx4_en_alloc_resources(struct mlx4_en_priv *priv)
err:
en_err(priv, "Failed to allocate NIC resources\n");
- mlx4_qp_release_range(priv->mdev->dev, base_tx_qpn, priv->tx_ring_num);
return -ENOMEM;
}
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 6ae3509..225c20d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -495,6 +495,7 @@ struct mlx4_en_priv {
int vids[128];
bool wol;
struct device *ddev;
+ int base_tx_qpn;
#ifdef CONFIG_MLX4_EN_DCB
struct ieee_ets ets;
--
1.7.1
^ permalink raw reply related
* [PATCH 4/4] net/mlx4_en: Use atomic counter to decide when queue is full
From: Yevgeny Petrilin @ 2012-06-21 9:19 UTC (permalink / raw)
To: davem; +Cc: netdev, Yevgeny Petrilin
In-Reply-To: <1340270358-19504-1-git-send-email-yevgenyp@mellanox.co.il>
The Transmit and transmit completion flows execute from different contexts,
which are not synchronized. Hence naive reading the of consumer index might
give wrong value by the time it is being used, That could lead to a state of transmit timeout.
Fix that by using atomic variable to maintain that index.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
---
drivers/net/ethernet/mellanox/mlx4/en_tx.c | 16 ++++++++--------
drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 1 +
2 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 019d856..f4b4703 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -165,6 +165,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
ring->last_nr_txbb = 1;
ring->poll_cnt = 0;
ring->blocked = 0;
+ atomic_set(&ring->inflight, 0);
memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
memset(ring->buf, 0, ring->buf_size);
@@ -364,15 +365,13 @@ static void mlx4_en_process_tx_cq(struct net_device *dev, struct mlx4_en_cq *cq)
wmb();
ring->cons += txbbs_skipped;
netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
+ atomic_sub(txbbs_skipped, &ring->inflight);
/* Wakeup Tx queue if this ring stopped it */
- if (unlikely(ring->blocked)) {
- if ((u32) (ring->prod - ring->cons) <=
- ring->size - HEADROOM - MAX_DESC_TXBBS) {
- ring->blocked = 0;
- netif_tx_wake_queue(ring->tx_queue);
- priv->port_stats.wake_queue++;
- }
+ if (unlikely(ring->blocked && txbbs_skipped > 0)) {
+ ring->blocked = 0;
+ netif_tx_wake_queue(ring->tx_queue);
+ priv->port_stats.wake_queue++;
}
}
@@ -588,7 +587,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
vlan_tag = vlan_tx_tag_get(skb);
/* Check available TXBBs And 2K spare for prefetch */
- if (unlikely(((int)(ring->prod - ring->cons)) >
+ if (unlikely(atomic_read(&ring->inflight) >
ring->size - HEADROOM - MAX_DESC_TXBBS)) {
/* every full Tx ring stops queue */
netif_tx_stop_queue(ring->tx_queue);
@@ -710,6 +709,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
}
ring->prod += nr_txbb;
+ atomic_add(nr_txbb, &ring->inflight);
/* If we used a bounce buffer then copy descriptor back into place */
if (bounce)
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 225c20d..6a8a69d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -257,6 +257,7 @@ struct mlx4_en_tx_ring {
struct mlx4_bf bf;
bool bf_enabled;
struct netdev_queue *tx_queue;
+ atomic_t inflight;
};
struct mlx4_en_rx_desc {
--
1.7.1
^ permalink raw reply related
* [PATCH 1/4] net/mlx4_en: Set correct port parameters during device initialization
From: Yevgeny Petrilin @ 2012-06-21 9:19 UTC (permalink / raw)
To: davem; +Cc: netdev, Yevgeny Petrilin
In-Reply-To: <1340270358-19504-1-git-send-email-yevgenyp@mellanox.co.il>
Set valid port parameters: MTU and flow control configuration when
configuring the port during HW device initialization,
prior to the net device open() being called.
Using invalid parameters (such as all zeros)
could lead to bad firmware behavior.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
---
drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 926d8aa..a80280e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1204,9 +1204,11 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
en_warn(priv, "Using %d RX rings\n", prof->rx_ring_num);
/* Configure port */
+ mlx4_en_calc_rx_buf(dev);
err = mlx4_SET_PORT_general(mdev->dev, priv->port,
- MLX4_EN_MIN_MTU,
- 0, 0, 0, 0);
+ priv->rx_skb_size + ETH_FCS_LEN,
+ prof->tx_pause, prof->tx_ppp,
+ prof->rx_pause, prof->rx_ppp);
if (err) {
en_err(priv, "Failed setting port general configurations "
"for port %d, with error %d\n", priv->port, err);
--
1.7.1
^ permalink raw reply related
* Re: [RFC] tcp: How does SACK or FACK determine the time to start fast retransmition?
From: 李易 @ 2012-06-21 10:17 UTC (permalink / raw)
To: Vijay Subramanian; +Cc: netdev, kernelnewbies
In-Reply-To: <CAGK4HS_CN53EtQvPm2H7Qn85eXRz7hiEQvs3bU3xXpcNeg4byg@mail.gmail.com>
于 2012/6/21 16:42, Vijay Subramanian 写道:
> With SACK, number of dupacks does not have much meaning. What matters is
> --how the SACK scoreboard looks like i.e. which packets are tagged
> Lost/Sacked/Retransmitted
> -- Whether FACK is in use (this assumes holes in between sacked
> packets are lost and have left the network and so we can send out more
> packets)
>
> So, stack does not count the number of dupacks that have come in. Only
> SACK blocks matter.
> You can try to track the following path:
> tcp_ack() deals with incoming acks and if it sees a dupack (does not
> matter what number), or incoming packet contains SACK it calls
> tcp_fastretrans_alert() which calls tcp_xmit_retransmit_queue().
>
> tcp_xmit_retransmit_queue() decides which packets to retransmit. The
> first packet to start retransmitting from is tracked in
> tp->retransmit_skb_hint.
> Note that the dupThresh is actually tracked by tp->reordering which
> measures the reordering in the network and is not fixed at 3. So, if
> more than
> tp->reordering packets have been acked above a given packet, this
> packet is a candidate for retransmisson. See tcp_mark_head_lost() to
> see how the
> reordering metric is used to mark packets as lost. This corresponds to
> the check you mentioned in the RFC.
>
> So, window permitting, packets are sent as follows;
> (a)-- Packets marked lost as per description above
> (b)-- new packets (if any)
> (c)-- Holes between sacked packets which are not reliably lost.
>
> choice between (b) and (c) is made in tcp_can_forward_retransmit().
>
> Hope this helps.
> Vijay
>
It is just I wanted! Thanks for your detailed explaination and kindness.
^ permalink raw reply
* Re: [PATCH] net: dcb: fix small regression in __dcbnl_pg_setcfg()
From: Thomas Graf @ 2012-06-21 10:52 UTC (permalink / raw)
To: John Fastabend; +Cc: davem, netdev, lucy.liu, alexander.h.duyck
In-Reply-To: <20120621055621.14148.42206.stgit@jf-dev1-dcblab>
On Wed, Jun 20, 2012 at 10:56:21PM -0700, John Fastabend wrote:
> A small regression was introduced in the reply command of
> dcbnl_pg_setcfg(). User space apps may be expecting the
> DCB_ATTR_PG_CFG attribute to be returned with the patch
> below TX or RX variants are returned.
>
> commit 7be994138b188387691322921c08e19bddf6d3c5
> Author: Thomas Graf <tgraf@suug.ch>
> Date: Wed Jun 13 02:54:55 2012 +0000
>
> dcbnl: Shorten all command handling functions
>
> This patch reverts this behavior and returns DCB_ATTR_PG_CFG
>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
^ permalink raw reply
* [net] ixgbe: simplify padding and length checks
From: Jeff Kirsher @ 2012-06-21 12:15 UTC (permalink / raw)
To: davem; +Cc: Stephen Hemminger, netdev, gospo, sassmann, Jeff Kirsher
From: Stephen Hemminger <shemminger@vyatta.com>
The check for length <= 0 is bogus because length is unsigned, and network
stack never sends zero length packets (unless it is totally broken).
The check for really small packets can be optimized (using unlikely)
and calling skb_pad directly.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index cbb05d6..9afe0cb 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -6383,17 +6383,12 @@ static netdev_tx_t ixgbe_xmit_frame(struct sk_buff *skb,
struct ixgbe_adapter *adapter = netdev_priv(netdev);
struct ixgbe_ring *tx_ring;
- if (skb->len <= 0) {
- dev_kfree_skb_any(skb);
- return NETDEV_TX_OK;
- }
-
/*
* The minimum packet size for olinfo paylen is 17 so pad the skb
* in order to meet this minimum size requirement.
*/
- if (skb->len < 17) {
- if (skb_padto(skb, 17))
+ if (unlikely(skb->len < 17)) {
+ if (skb_pad(skb, 17 - skb->len))
return NETDEV_TX_OK;
skb->len = 17;
}
--
1.7.10.2
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox