Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] ipv4: Early TCP socket demux.
From: Changli Gao @ 2012-06-19  4:43 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Tom Herbert
In-Reply-To: <20120618.194016.2282814982594761206.davem@davemloft.net>

On Tue, Jun 19, 2012 at 10:40 AM, David Miller <davem@davemloft.net> wrote:
> @@ -324,19 +324,34 @@ static int ip_rcv_finish(struct sk_buff *skb)
>         *      how the packet travels inside Linux networking.
>         */
>        if (skb_dst(skb) == NULL) {
> -               int err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
> -                                              iph->tos, skb->dev);
> -               if (unlikely(err)) {
> -                       if (err == -EHOSTUNREACH)
> -                               IP_INC_STATS_BH(dev_net(skb->dev),
> -                                               IPSTATS_MIB_INADDRERRORS);
> -                       else if (err == -ENETUNREACH)
> -                               IP_INC_STATS_BH(dev_net(skb->dev),
> -                                               IPSTATS_MIB_INNOROUTES);
> -                       else if (err == -EXDEV)
> -                               NET_INC_STATS_BH(dev_net(skb->dev),
> -                                                LINUX_MIB_IPRPFILTER);
> -                       goto drop;
> +               const struct net_protocol *ipprot;
> +               int protocol = iph->protocol;
> +               int hash, err;
> +
> +               hash = protocol & (MAX_INET_PROTOS - 1);
> +
> +               rcu_read_lock();
> +               ipprot = rcu_dereference(inet_protos[hash]);
> +               err = -ENOENT;
> +               if (ipprot && ipprot->early_demux)
> +                       err = ipprot->early_demux(skb);

I am afraid that this lookup with hurt the performance of the
forwarding path. A knob?

If this approach is acceptable, maybe we can use sockets to do finer RFS.

Thanks.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH] ipv4: Early TCP socket demux.
From: Eric Dumazet @ 2012-06-19  4:47 UTC (permalink / raw)
  To: Changli Gao; +Cc: David Miller, netdev, Tom Herbert
In-Reply-To: <CABa6K_Ge-Y1ne3iBcN1HSRN+G4bW6hSAHYFGKuA5QKW6CT3grQ@mail.gmail.com>

On Tue, 2012-06-19 at 12:43 +0800, Changli Gao wrote:

> I am afraid that this lookup with hurt the performance of the
> forwarding path. A knob?
> 

ip_rcv() & ip_rcv_finish() in the forwarding path ?

^ permalink raw reply

* Re: [PATCH] ipv4: Early TCP socket demux.
From: Changli Gao @ 2012-06-19  4:51 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Tom Herbert
In-Reply-To: <1340081276.7491.2228.camel@edumazet-glaptop>

On Tue, Jun 19, 2012 at 12:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2012-06-19 at 12:43 +0800, Changli Gao wrote:
>
>> I am afraid that this lookup with hurt the performance of the
>> forwarding path. A knob?
>>
>
> ip_rcv() & ip_rcv_finish() in the forwarding path ?
>
>

Yes, the two routines are shared by both. I think you mean
ip_local_deliver() and ip_local_deliver_finish().

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [RFC] TCP:  Support configurable delayed-ack parameters.
From: Eric Dumazet @ 2012-06-19  5:11 UTC (permalink / raw)
  To: greearb; +Cc: netdev, Daniel Baluta
In-Reply-To: <1340067163-29329-1-git-send-email-greearb@candelatech.com>

On Mon, 2012-06-18 at 17:52 -0700, greearb@candelatech.com wrote:
> From: Ben Greear <greearb@candelatech.com>
> 
> RFC2581 ($4.2) specifies when an ACK should be generated as follows:
> 
> " .. an ACK SHOULD be generated for at least every second
>   full-sized segment, and MUST be generated within 500 ms
>   of the arrival of the first unacknowledged packet.
> "
> 
> We export the number of segments and the timeout limits
> specified above, so that a user can tune them according
> to their needs.
> 
> Specifically:
> 	* /proc/sys/net/ipv4/tcp_default_delack_segs, represents
> 	the threshold for the number of segments.
> 	* /proc/sys/net/ipv4/tcp_default_delack_min, specifies
> 	the minimum timeout value
> 	* /proc/sys/net/ipv4/tcp_default_delack_max, specifies
> 	the maximum timeout value.
> 
> In addition, new TCP socket options are added to allow
> per-socket configuration:
> 
> TCP_DELACK_SEGS
> TCP_DELACK_MIN
> TCP_DELACK_MAX
> 
> In order to keep a multiply out of the hot path, the segs * mss
> computation is recalculated and cached whenever segs or mss changes.
> 

I know David was worried about this multiply, but current cpus do a
multiply in at most 3 cycles.

Addding an u32 field in socket structure adds 1/16 of a cache line, and
adds more penalty.

Avoiding to build/send an ACK packet can save us so many cpu cycles that
the multiply is pure noise.

^ permalink raw reply

* RE: [PATCH] net: added support for 40GbE link.
From: Parav.Pandit @ 2012-06-19  5:20 UTC (permalink / raw)
  To: rick.jones2; +Cc: netdev, bhutchings
In-Reply-To: <4FDF56FB.9080509@hp.com>



> -----Original Message-----
> From: Rick Jones [mailto:rick.jones2@hp.com]
> Sent: Monday, June 18, 2012 9:58 PM
> To: Pandit, Parav
> Cc: netdev@vger.kernel.org; bhutchings@solarflare.com
> Subject: Re: [PATCH] net: added support for 40GbE link.
> 
> On 06/18/2012 05:44 AM, Parav Pandit wrote:
> 
> > diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h index
> > 297370a..1ebfa6e 100644
> > --- a/include/linux/ethtool.h
> > +++ b/include/linux/ethtool.h
> > @@ -1153,6 +1153,10 @@ struct ethtool_ops {
> >   #define SUPPORTED_10000baseR_FEC	(1<<  20)
> >   #define SUPPORTED_20000baseMLD2_Full	(1<<  21)
> >   #define SUPPORTED_20000baseKR2_Full	(1<<  22)
> > +#define SUPPORTED_40000baseKR4_Full	(1<<  23)
> > +#define SUPPORTED_40000baseCR4_Full	(1<<  24)
> > +#define SUPPORTED_40000baseSR4_Full	(1<<  25)
> > +#define SUPPORTED_40000baseLR4_Full	(1<<  26)
> >
> >   /* Indicates what features are advertised by the interface. */
> >   #define ADVERTISED_10baseT_Half		(1<<  0)
> > @@ -1178,6 +1182,10 @@ struct ethtool_ops {
> >   #define ADVERTISED_10000baseR_FEC	(1<<  20)
> >   #define ADVERTISED_20000baseMLD2_Full	(1<<  21)
> >   #define ADVERTISED_20000baseKR2_Full	(1<<  22)
> > +#define ADVERTISED_40000baseKR4_Full	(1<<  23)
> > +#define ADVERTISED_40000baseCR4_Full	(1<<  24)
> > +#define ADVERTISED_40000baseSR4_Full	(1<<  25)
> > +#define ADVERTISED_40000baseLR4_Full	(1<<  26)
> 
> Any idea how many defines will be wanted for 100 Gbit Ethernet?
> Supported and advertising in ethtool_cmd are __u32s...
> 
100G supports CR10, ER4, LR4, Base-R and SR10 interfaces. So 5 bits. We have space from 27 to 31 bits for 100G as well.

> rick jones

^ permalink raw reply

* Re: [PATCH] ipv4: Early TCP socket demux.
From: David Miller @ 2012-06-19  6:07 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1340080661.7491.2205.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 19 Jun 2012 06:37:41 +0200

> On Mon, 2012-06-18 at 21:27 -0700, David Miller wrote:
> 
>> How are we not using it?  We use the cached SKB socket no matter what
>> happens.
>> 
>> Look at how inet hash lookup works.
>> 
>> The error tells the caller solely whether a route lookup is still
>> necessary.
> 
> OK, remove the unlikely() in __inet_lookup_skb() so that its obvious we
> have this skb_steal_sock() thing :)

Sure thing.

We also need to add some dst->ops->check() handling as well.

^ permalink raw reply

* Re: [PATCH] ipv4: Early TCP socket demux.
From: Eric Dumazet @ 2012-06-19  6:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120618.230723.612232275251003129.davem@davemloft.net>

On Mon, 2012-06-18 at 23:07 -0700, David Miller wrote:

> Sure thing.
> 
> We also need to add some dst->ops->check() handling as well.

Yes, rp_filter comes to mind.

^ permalink raw reply

* Re: [PATCH] sctp: fix warning when compiling without IPv6
From: David Miller @ 2012-06-19  7:28 UTC (permalink / raw)
  To: dhalperi; +Cc: netdev
In-Reply-To: <B7062EEE-046D-4435-B5FC-54FF3F763645@cs.washington.edu>

From: Daniel Halperin <dhalperi@cs.washington.edu>
Date: Mon, 18 Jun 2012 14:04:55 -0700

> net/sctp/protocol.c: In function ‘sctp_addr_wq_timeout_handler’:
> net/sctp/protocol.c:676: warning: label ‘free_next’ defined but not used
> 
> Signed-off-by: Daniel Halperin <dhalperi@cs.washington.edu>

Applied.

^ permalink raw reply

* Re: [PATCH RESEND] net: lpc_eth: Driver cleanup
From: David Miller @ 2012-06-19  7:28 UTC (permalink / raw)
  To: stigge
  Cc: eric.dumazet, netdev, linux-kernel, kevin.wells, srinivas.bakki,
	aletes.xgr, linux-arm-kernel
In-Reply-To: <1340050482-22666-1-git-send-email-stigge@antcom.de>

From: Roland Stigge <stigge@antcom.de>
Date: Mon, 18 Jun 2012 22:14:42 +0200

> This patch removes some nowadays superfluous definitions (one unused define and
> an obsolete function forward declaration) and corrects a netdev_err() to
> netdev_dbg().
> 
> Signed-off-by: Roland Stigge <stigge@antcom.de>
> Signed-off-by: Alexandre Pereira da Silva <aletes.xgr@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net: added support for 40GbE link.
From: David Miller @ 2012-06-19  7:29 UTC (permalink / raw)
  To: bhutchings; +Cc: parav.pandit, netdev
In-Reply-To: <1340039376.2913.13.camel@bwh-desktop.uk.solarflarecom.com>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Mon, 18 Jun 2012 18:09:36 +0100

> On Mon, 2012-06-18 at 18:14 +0530, Parav Pandit wrote:
...
>> -/* The forced speed, 10Mb, 100Mb, gigabit, 2.5Gb, 10GbE. */
>> +/* The forced speed, 10Mb, 100Mb, gigabit, 2.5Gb, 10GbE, 40GbE. */
> 
> I don't think there's any need to name all possible link speeds, and it
> just encourages the bad practice of ethtool API users checking for
> specific values.  You may notice there is no SPEED_20000.

Agreed.

>> @@ -542,13 +542,11 @@ static int prb_calc_retire_blk_tmo(struct packet_sock *po,
 ...
> This function should be fixed properly.  Firstly, it must use
> ethtool_cmd_speed() rather than directly accessing ecmd.speed.
> Secondly, it should allow any speed value rather than checking for
> specific values.  Then there will be no need to make further changes for
> 100G or any other new speed.

Agreed.

^ permalink raw reply

* Re: [PATCH] phy/micrel: change phy_id_mask for KSZ9021 and KS8001
From: David Miller @ 2012-06-19  7:31 UTC (permalink / raw)
  To: jason77.wang; +Cc: david.choi, nobuhiro.iwamatsu.yj, netdev
In-Reply-To: <1340009529-4231-1-git-send-email-jason77.wang@gmail.com>

From: Hui Wang <jason77.wang@gmail.com>
Date: Mon, 18 Jun 2012 16:52:09 +0800

> On a freescale imx6q platform, a hardware phy chip KSZ9021 is
> recognized as a KS8001 chip by the current driver like this:
> eth0: Freescale FEC PHY driver [Micrel KS8001 or KS8721]
> 
> KSZ9021 has phy_id 0x00221610, while KSZ8001 has phy_id
> 0x0022161a, the current phy_id_mask (0x00fffff0/0x00ffff10) can't
> distinguish them. So change phy_id_mask to resolve this problem.
> 
> Although the micrel datasheet says that the 4 LSB of phyid2 register
> contains the chip revision number and the current driver is designed
> to follow this rule, in reality the chip implementation doesn't follow
> it.
> 
> Cc: David J. Choi <david.choi@micrel.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Nobuhiro Iwamatsu <nobuhiro.iwamatsu.yj@renesas.com>
> Signed-off-by: Hui Wang <jason77.wang@gmail.com>

Applied, thanks.

^ permalink raw reply

* RE: [PATCH] net: added support for 40GbE link.
From: Parav.Pandit @ 2012-06-19  7:33 UTC (permalink / raw)
  To: davem, bhutchings; +Cc: netdev
In-Reply-To: <20120619.002905.922583388766089167.davem@davemloft.net>



> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Tuesday, June 19, 2012 12:59 PM
> To: bhutchings@solarflare.com
> Cc: Pandit, Parav; netdev@vger.kernel.org
> Subject: Re: [PATCH] net: added support for 40GbE link.
> 
> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Mon, 18 Jun 2012 18:09:36 +0100
> 
> > On Mon, 2012-06-18 at 18:14 +0530, Parav Pandit wrote:
> ...
> >> -/* The forced speed, 10Mb, 100Mb, gigabit, 2.5Gb, 10GbE. */
> >> +/* The forced speed, 10Mb, 100Mb, gigabit, 2.5Gb, 10GbE, 40GbE. */
> >
> > I don't think there's any need to name all possible link speeds, and
> > it just encourages the bad practice of ethtool API users checking for
> > specific values.  You may notice there is no SPEED_20000.
> 
> Agreed.

Should eventually all net driver should remove using SPEED_xxxxxx and start using hard coded value of 10, 100, 1000, 20000?

> 
> >> @@ -542,13 +542,11 @@ static int prb_calc_retire_blk_tmo(struct
> >> packet_sock *po,
>  ...
> > This function should be fixed properly.  Firstly, it must use
> > ethtool_cmd_speed() rather than directly accessing ecmd.speed.
> > Secondly, it should allow any speed value rather than checking for
> > specific values.  Then there will be no need to make further changes
> > for 100G or any other new speed.
> 
> Agreed.

That means ethtool_cmd_speed() should not be called in this function?

If I understand correctly, it should return the value of 8ms (for 10Mb,s 100Mbps, 2.5 Gbps, 20Gbps) as today and remaining it should return calculated value?
Or
Function needs a fix for all these speeds (10Mbps, 100Mbs, 20Gbps too)?

^ permalink raw reply

* Re: [XFRM][RFC v1] Fix unexpected SA hard expiration after setting new date
From: David Miller @ 2012-06-19  7:34 UTC (permalink / raw)
  To: herbert; +Cc: fan.du, netdev, fdu
In-Reply-To: <20120618084021.GA24902@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 18 Jun 2012 16:40:21 +0800

> I guess I can live with your workaround if Dave is happy with
> it.  But IMHO we should just go back to relative time and fix
> the suspend/resume user-space scripts instead.

In the short term I'm happy with it too.

But I seem to remember that last time this issue came up the
suggestion was to use highres timers.  Someone tried but they did an
amazingly poor job so the effort just died off.  It takes someone with
some skill because highres timers operate with different context
requirements than normal timers.

^ permalink raw reply

* Re: [PATCH] net: added support for 40GbE link.
From: David Miller @ 2012-06-19  7:35 UTC (permalink / raw)
  To: Parav.Pandit; +Cc: bhutchings, netdev
In-Reply-To: <5D6C0ABE6A236946864C45679362BBE20AC52269@CMEXMB1.ad.emulex.com>

From: <Parav.Pandit@Emulex.Com>
Date: Tue, 19 Jun 2012 07:33:12 +0000

> Should eventually all net driver should remove using SPEED_xxxxxx and start using hard coded value of 10, 100, 1000, 20000?

No, the ones that exist can stay, just no new ones.

> That means ethtool_cmd_speed() should not be called in this function?

Ben said that it must be called, what are you talking about?

^ permalink raw reply

* RE: [PATCH] net: added support for 40GbE link.
From: Parav.Pandit @ 2012-06-19  7:42 UTC (permalink / raw)
  To: davem; +Cc: bhutchings, netdev
In-Reply-To: <20120619.003517.2067426285873138772.davem@davemloft.net>

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Tuesday, June 19, 2012 1:05 PM
> To: Pandit, Parav
> Cc: bhutchings@solarflare.com; netdev@vger.kernel.org
> Subject: Re: [PATCH] net: added support for 40GbE link.
> 
> From: <Parav.Pandit@Emulex.Com>
> Date: Tue, 19 Jun 2012 07:33:12 +0000
> 
> > Should eventually all net driver should remove using SPEED_xxxxxx and
> start using hard coded value of 10, 100, 1000, 20000?
> 
> No, the ones that exist can stay, just no new ones.
> 
So driver which supports 40Gpbs, 100Gbps should hardcode to 40000, 100000 respectively?

> > That means ethtool_cmd_speed() should not be called in this function?
> 
> Ben said that it must be called, what are you talking about?

Sorry, I wanted to ask - Do you need switch case for speed like below new code or its should be speed independent code?
                switch (ethtool_cmd_speed()) {
                case SPEED_100:
                case SPEED_10:
                        return DEFAULT_PRB_RETIRE_TOV;
                default:
                        msec = 1;
                        div = ethtool_cmd_speed() / 1000;
                        break;
                /*
                }

^ permalink raw reply

* Re: [XFRM][RFC v1] Fix unexpected SA hard expiration after setting new date
From: Herbert Xu @ 2012-06-19  7:43 UTC (permalink / raw)
  To: David Miller; +Cc: fan.du, netdev, fdu
In-Reply-To: <20120619.003422.1109786261235176550.davem@davemloft.net>

On Tue, Jun 19, 2012 at 12:34:22AM -0700, David Miller wrote:
> 
> But I seem to remember that last time this issue came up the
> suggestion was to use highres timers.  Someone tried but they did an
> amazingly poor job so the effort just died off.  It takes someone with
> some skill because highres timers operate with different context
> requirements than normal timers.

Sorry I think I didn't see that discussion or if I did, then
it has completely left my mind :)

So excuse my ignorance, are highres timers relative like jiffies
or absolute? If it's absolute, I don't see how it would impact this
case.

If it's relative, what exactly does it buy for us over jiffies?

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [XFRM][RFC v2] Fix unexpected SA hard expiration after setting new date
From: fan.du @ 2012-06-19  7:51 UTC (permalink / raw)
  To: davem, herbert; +Cc: netdev, fdu


Hi, Dave and Herbert

Could you give a try to review this?
thanks :)

Changelog:
v1->v2
1) use xflags instead of creating new flags(suggested by Steffen Klassert)


*Background*:
Once IPsec SAs are created between two peers, kernel setup a timer to monitor
two events: soft/hard expiration. However the timer handler use xtime to
caculate whether it's soft or hard expiration event.

normal code flow(hard expire time:100s, soft expire time:82s)

a) When new SAs created, xfrm_timer_handler is called one second
after its creation. At this point, calculate soft expire
interval(81s), setup the timer;

b) soft expire occur, rearm the timer with hard expire interval(18s)
then notify racoon2 about soft expire event. racoon2 will create
new SAs.

c) hard expire happen, notify racoon2 about it. racoon2 will delete
the old SAs.

*Scenario*:
Setting a new date before b),and after a) could result c) happens first,
As a result, old SAs is deleted before new ones are created. Normally
new SAs will be created by the next time networking traffic, but there
is a small time being when networking connection is down, this could
result in upper layer connections failed in tel comm area, thus it's
better to keep it strict in sequence.

*Workaround*:
set new time could happen:
1) before a), then SAs is updated with new time.
2) before b),and after a)
2a) When new SAs created, xfrm_timer_handler is called one second
after its creation. At this point, calculate soft expire
interval(81s), setup the timer;(set flag to mark next time should
be soft time expire)

<<---- new date comes

2b) soft expire occur, the calculation results in a hard time expire
event, but flag is set, so catch ya. Sync the addtime, and rearm
the timer with hard expire interval(18s), then notify racoon2
about soft expire event;

2c) hard expire happen, notify racoon2 about it;
so everything is in order.

3) after b), hard expire always happened anyway.

^ permalink raw reply

* [PATCH] [XFRM] Fix unexpected SA hard expiration after changing date
From: fan.du @ 2012-06-19  7:51 UTC (permalink / raw)
  To: davem, herbert; +Cc: netdev, fdu
In-Reply-To: <1340092269-6136-1-git-send-email-fan.du@windriver.com>

From: "fan.du" <fan.du@windriver.com>

After SA is setup, one timer is armed to detect soft/hard expiration,
however the timer handler uses xtime to do the math. This makes hard
expiration occurs first before soft expiration after setting new date
with big interval. As a result new child SA is deleted before rekeying
the new one.

Signed-off-by: fan.du <fan.du@windriver.com>
---
 include/net/xfrm.h    |  3 +++
 net/xfrm/xfrm_state.c | 21 +++++++++++++++++----
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 2933d74..8d16777 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -197,6 +197,8 @@ struct xfrm_state
 
 	struct xfrm_lifetime_cur curlft;
 	struct timer_list	timer;
+	/* used to fix curlft->add_time when changing date */
+	long		saved_tmo;
 
 	/* Last used time */
 	unsigned long		lastused;
@@ -218,6 +220,7 @@ struct xfrm_state
 
 /* xflags - make enum if more show up */
 #define XFRM_TIME_DEFER	1
+#define XFRM_SOFT_EXPIRE 2
 
 enum {
 	XFRM_STATE_VOID,
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index fd77cf0..ab4aa0f 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -442,8 +442,17 @@ static void xfrm_timer_handler(unsigned long data)
 	if (x->lft.hard_add_expires_seconds) {
 		long tmo = x->lft.hard_add_expires_seconds +
 			x->curlft.add_time - now;
-		if (tmo <= 0)
-			goto expired;
+		if (tmo <= 0) {
+			if (x->xflags & XFRM_SOFT_EXPIRE) {
+				/* enter hard expire without soft expire first?!
+				 * setting a new date could trigger this.
+				 * workarbound: fix x->curflt.add_time by below:
+				 */
+				x->curlft.add_time = now - x->saved_tmo - 1;
+				tmo = x->lft.hard_add_expires_seconds - x->saved_tmo;
+			} else
+				goto expired;
+		}
 		if (tmo < next)
 			next = tmo;
 	}
@@ -460,10 +469,14 @@ static void xfrm_timer_handler(unsigned long data)
 	if (x->lft.soft_add_expires_seconds) {
 		long tmo = x->lft.soft_add_expires_seconds +
 			x->curlft.add_time - now;
-		if (tmo <= 0)
+		if (tmo <= 0) {
 			warn = 1;
-		else if (tmo < next)
+			x->xflags |= ~XFRM_SOFT_EXPIRE;
+		} else if (tmo < next) {
 			next = tmo;
+			x->xflags |= XFRM_SOFT_EXPIRE;
+			x->saved_tmo = tmo;
+		}
 	}
 	if (x->lft.soft_use_expires_seconds) {
 		long tmo = x->lft.soft_use_expires_seconds +
-- 
1.7.11

^ permalink raw reply related

* Re: [PATCH net-next v2 06/12] netfilter: merge udpv[4,6]_net_init into udp_net_init
From: Gao feng @ 2012-06-19  8:08 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netdev, netfilter-devel
In-Reply-To: <20120616112259.GG18251@1984>

于 2012年06月16日 19:22, Pablo Neira Ayuso 写道:
> On Sat, Jun 16, 2012 at 11:41:17AM +0800, Gao feng wrote:
>> merge udpv4_net_init and udpv6_net_init into udp_net_init to
>> reduce the redundancy codes.
>>
>> and use nf_proto_net.users to identify if it's the first time
>> we use the nf_proto_net. when it's the first time,we will
>> initialized it.
>>
>> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
>> ---
>>  net/netfilter/nf_conntrack_proto_udp.c |   56 ++++++++++---------------------
>>  1 files changed, 18 insertions(+), 38 deletions(-)
>>
>> diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c
>> index 2b978e6..61bca4f 100644
>> --- a/net/netfilter/nf_conntrack_proto_udp.c
>> +++ b/net/netfilter/nf_conntrack_proto_udp.c
>> @@ -270,52 +270,32 @@ static int udp_kmemdup_compat_sysctl_table(struct nf_proto_net *pn)
>>  	return 0;
>>  }
>>
>> -static void udp_init_net_data(struct nf_udp_net *un)
>> +static int udp_init_net(struct net *net, u_int16_t proto)
>>  {
>> -	int i;
>> -#ifdef CONFIG_SYSCTL
>> -	if (!un->pn.ctl_table) {
>> -#else
>> -	if (!un->pn.users++) {
>> -#endif
>> +	int ret;
>> +	struct nf_udp_net *un = udp_pernet(net);
>> +	struct nf_proto_net *pn = &un->pn;
>> +
>> +	if (!pn->users) {
>> +		int i;
>>  		for (i = 0; i < UDP_CT_MAX; i++)
>>  			un->timeouts[i] = udp_timeouts[i];
>>  	}
>> -}
>> -
>> -static int udpv4_init_net(struct net *net, u_int16_t proto)
>> -{
>> -	int ret;
>> -	struct nf_udp_net *un = udp_pernet(net);
>> -	struct nf_proto_net *pn = (struct nf_proto_net *)un;
>>
>> -	udp_init_net_data(un);
>> +	if (proto == AF_INET) {
> 
> I think we can remove that u_int16_t proto that I proposed to make
> something like:
> 
> static int udp_kmemdup_compat_sysctl_table(struct nf_proto_net *pn)
> {
> #ifdef CONFIG_SYSCTL
> #ifdef CONFIG_NF_CONNTRACK_PROC_COMPAT
>         struct nf_udp_net *un = (struct nf_udp_net *)pn;
> +
> +       if (pn->ctl_compat_table)
> +               return 0;
> +
>         pn->ctl_compat_table = kmemdup(udp_compat_sysctl_table,
>                                        sizeof(udp_compat_sysctl_table),
>                                        GFP_KERNEL);
>         if (!pn->ctl_compat_table)
>                 return -ENOMEM;
> 
> That should be enough to ensure that the compat is registered once. No
> matter if it's done by the IPv4 or IPv6 invocation of udp_init_net.
> 
> Thus, it will look consistent with udp_kmemdup_sysctl_table.


yes, but this will be very terrible to unregister compat sysctl
and free compat sysctl table.

thinking about, we may insmod nf_conntrack_ipv6 only, as your idea,
we will allocate compat_sysctl_table.so we have to free it when
rmmod nf_conntrack_ipv6.

in order to implement it, we have to change the logic of
nf_ct_l4proto_unregister_sysctl and nf_ct_unregister_sysctl. because we
only free the sysctl table when we unregister the proto.

^ permalink raw reply

* Re: [PATCH net-next v2 06/12] netfilter: merge udpv[4,6]_net_init into udp_net_init
From: Gao feng @ 2012-06-19  8:12 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netdev, netfilter-devel
In-Reply-To: <4FE0337F.6030000@cn.fujitsu.com>

于 2012年06月19日 16:08, Gao feng 写道:
> 
> in order to implement it, we have to change the logic of
> nf_ct_l4proto_unregister_sysctl and nf_ct_unregister_sysctl. because we
> only free the sysctl table when we unregister the proto.

I means free the sysctl table when we register sysctl success
(pn->ctl_table_header != NULL).

^ permalink raw reply

* Re: [PATCH net-next v2 11/12] netfilter: nf_conntrack_l4proto_icmp cleanup
From: Gao feng @ 2012-06-19  8:14 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netdev, netfilter-devel
In-Reply-To: <20120616110828.GE18251@1984>

于 2012年06月16日 19:08, Pablo Neira Ayuso 写道:
> extra line unrequiered.
> 
> I'm sorry, I'm stressing a lot on this because I don't like abusing
> follow-up patches that clean up extra lines / missing lines and that
> sort of nitpicks...
> 

Sorry for this,I will double check this when I send the v3 patch.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] [XFRM] Fix unexpected SA hard expiration after changing date
From: David Miller @ 2012-06-19  9:01 UTC (permalink / raw)
  To: fan.du; +Cc: herbert, netdev, fdu
In-Reply-To: <1340092269-6136-2-git-send-email-fan.du@windriver.com>

From: "fan.du" <fan.du@windriver.com>
Date: Tue, 19 Jun 2012 15:51:09 +0800

> From: "fan.du" <fan.du@windriver.com>

Please don't put your email user name instead of real name in quotes
there.  Thank you.

^ permalink raw reply

* Re: [PATCH] [XFRM] Fix unexpected SA hard expiration after changing date
From: David Miller @ 2012-06-19  9:05 UTC (permalink / raw)
  To: fan.du; +Cc: herbert, netdev, fdu
In-Reply-To: <20120619.020111.1596676335106760442.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Tue, 19 Jun 2012 02:01:11 -0700 (PDT)

> From: "fan.du" <fan.du@windriver.com>
> Date: Tue, 19 Jun 2012 15:51:09 +0800
> 
>> From: "fan.du" <fan.du@windriver.com>
> 
> Please don't put your email user name instead of real name in quotes
> there.  Thank you.

Also fdu@windriver.com bounces, do not put it in the CC: list.

^ permalink raw reply

* [XFRM][RFC v3] Fix unexpected SA hard expiration after setting new date
From: Fan Du @ 2012-06-19  9:28 UTC (permalink / raw)
  To: davem, herbert; +Cc: netdev


Hi, Dave and Herbert

Could you give a try to review this?
thanks

Changelog:
v1->v2
1) use xflags instead of creating new flags(suggested by Steffen Klassert)

v2->v3
1) fix email problem, and remove cc to myself(requested by David Miller)


*Background*:
Once IPsec SAs are created between two peers, kernel setup a timer to monitor
two events: soft/hard expiration. However the timer handler use xtime to
caculate whether it's soft or hard expiration event.

normal code flow(hard expire time:100s, soft expire time:82s)

a) When new SAs created, xfrm_timer_handler is called one second
after its creation. At this point, calculate soft expire
interval(81s), setup the timer;

b) soft expire occur, rearm the timer with hard expire interval(18s)
then notify racoon2 about soft expire event. racoon2 will create
new SAs.

c) hard expire happen, notify racoon2 about it. racoon2 will delete
the old SAs.

*Scenario*:
Setting a new date before b),and after a) could result c) happens first,
As a result, old SAs is deleted before new ones are created. Normally
new SAs will be created by the next time networking traffic, but there
is a small time being when networking connection is down, this could
result in upper layer connections failed in tel comm area, thus it's
better to keep it strict in sequence.

*Workaround*:
set new time could happen:
1) before a), then SAs is updated with new time.
2) before b),and after a)
2a) When new SAs created, xfrm_timer_handler is called one second
after its creation. At this point, calculate soft expire
interval(81s), setup the timer;(set flag to mark next time should
be soft time expire)

<<---- new date comes

2b) soft expire occur, the calculation results in a hard time expire
event, but flag is set, so catch ya. Sync the addtime, and rearm
the timer with hard expire interval(18s), then notify racoon2
about soft expire event;

2c) hard expire happen, notify racoon2 about it;
so everything is in order.

3) after b), hard expire always happened anyway.

^ permalink raw reply

* [PATCH] [XFRM] Fix unexpected SA hard expiration after changing date
From: Fan Du @ 2012-06-19  9:28 UTC (permalink / raw)
  To: davem, herbert; +Cc: netdev
In-Reply-To: <1340098118-21529-1-git-send-email-fdu@windriver.com>

After SA is setup, one timer is armed to detect soft/hard expiration,
however the timer handler uses xtime to do the math. This makes hard
expiration occurs first before soft expiration after setting new date
with big interval. As a result new child SA is deleted before rekeying
the new one.

Signed-off-by: Fan Du <fdu@windriver.com>
---
 include/net/xfrm.h    |  3 +++
 net/xfrm/xfrm_state.c | 21 +++++++++++++++++----
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 2933d74..8d16777 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -197,6 +197,8 @@ struct xfrm_state
 
 	struct xfrm_lifetime_cur curlft;
 	struct timer_list	timer;
+	/* used to fix curlft->add_time when changing date */
+	long		saved_tmo;
 
 	/* Last used time */
 	unsigned long		lastused;
@@ -218,6 +220,7 @@ struct xfrm_state
 
 /* xflags - make enum if more show up */
 #define XFRM_TIME_DEFER	1
+#define XFRM_SOFT_EXPIRE 2
 
 enum {
 	XFRM_STATE_VOID,
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index fd77cf0..ab4aa0f 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -442,8 +442,17 @@ static void xfrm_timer_handler(unsigned long data)
 	if (x->lft.hard_add_expires_seconds) {
 		long tmo = x->lft.hard_add_expires_seconds +
 			x->curlft.add_time - now;
-		if (tmo <= 0)
-			goto expired;
+		if (tmo <= 0) {
+			if (x->xflags & XFRM_SOFT_EXPIRE) {
+				/* enter hard expire without soft expire first?!
+				 * setting a new date could trigger this.
+				 * workarbound: fix x->curflt.add_time by below:
+				 */
+				x->curlft.add_time = now - x->saved_tmo - 1;
+				tmo = x->lft.hard_add_expires_seconds - x->saved_tmo;
+			} else
+				goto expired;
+		}
 		if (tmo < next)
 			next = tmo;
 	}
@@ -460,10 +469,14 @@ static void xfrm_timer_handler(unsigned long data)
 	if (x->lft.soft_add_expires_seconds) {
 		long tmo = x->lft.soft_add_expires_seconds +
 			x->curlft.add_time - now;
-		if (tmo <= 0)
+		if (tmo <= 0) {
 			warn = 1;
-		else if (tmo < next)
+			x->xflags |= ~XFRM_SOFT_EXPIRE;
+		} else if (tmo < next) {
 			next = tmo;
+			x->xflags |= XFRM_SOFT_EXPIRE;
+			x->saved_tmo = tmo;
+		}
 	}
 	if (x->lft.soft_use_expires_seconds) {
 		long tmo = x->lft.soft_use_expires_seconds +
-- 
1.7.11

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox