* Re: [PATCH net-next] ipv4: tcp: remove per net tcp_sock
From: David Miller @ 2012-07-19 17:12 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, therbert, wsommerfeld
In-Reply-To: <20120719.101155.1970854797296520147.davem@davemloft.net>
From: David Miller <davem@davemloft.net>
Date: Thu, 19 Jul 2012 10:11:55 -0700 (PDT)
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 19 Jul 2012 19:07:47 +0200
>
>> On Thu, 2012-07-19 at 08:45 -0700, David Miller wrote:
>>> From: David Miller <davem@davemloft.net>
>>> Date: Thu, 19 Jul 2012 08:35:44 -0700 (PDT)
>>>
>>> > Looks great, applied, thanks Eric.
>>>
>>> I take that back, it doesn't build:
>>>
>>> net/ipv4/ip_output.c: In function ‘ip_send_unicast_reply’:
>>> net/ipv4/ip_output.c:1481:1: error: section attribute cannot be specified for local variables
>>> net/ipv4/ip_output.c:1481:1: error: section attribute cannot be specified for local variables
>>> net/ipv4/ip_output.c:1481:1: error: declaration of ‘__pcpu_unique_unicast_sock’ with no linkage follows extern declaration
>>> net/ipv4/ip_output.c:1481:1: note: previous declaration of ‘__pcpu_unique_unicast_sock’ was here
>>> net/ipv4/ip_output.c:1481:9: error: section attribute cannot be specified for local variables
>>> net/ipv4/ip_output.c:1481:9: error: weak declaration of ‘unicast_sock’ must be public
>>
>> Strange, it builds on my machines, and I got nice performance boost.
>>
>> Apparently your arch doesnt handle the
>>
>> static DEFINE_PER_CPU(struct inet_sock, unicast_sock)
>>
>> in the function body ?
>
> It's x86-64, standard Fedora 17 install.
Just give this thing a unique name and define it right before the function.
^ permalink raw reply
* Re: [PATCH net-next] ipv4: tcp: remove per net tcp_sock
From: David Miller @ 2012-07-19 17:11 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, therbert, wsommerfeld
In-Reply-To: <1342717667.2626.4494.camel@edumazet-glaptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 19 Jul 2012 19:07:47 +0200
> On Thu, 2012-07-19 at 08:45 -0700, David Miller wrote:
>> From: David Miller <davem@davemloft.net>
>> Date: Thu, 19 Jul 2012 08:35:44 -0700 (PDT)
>>
>> > Looks great, applied, thanks Eric.
>>
>> I take that back, it doesn't build:
>>
>> net/ipv4/ip_output.c: In function ‘ip_send_unicast_reply’:
>> net/ipv4/ip_output.c:1481:1: error: section attribute cannot be specified for local variables
>> net/ipv4/ip_output.c:1481:1: error: section attribute cannot be specified for local variables
>> net/ipv4/ip_output.c:1481:1: error: declaration of ‘__pcpu_unique_unicast_sock’ with no linkage follows extern declaration
>> net/ipv4/ip_output.c:1481:1: note: previous declaration of ‘__pcpu_unique_unicast_sock’ was here
>> net/ipv4/ip_output.c:1481:9: error: section attribute cannot be specified for local variables
>> net/ipv4/ip_output.c:1481:9: error: weak declaration of ‘unicast_sock’ must be public
>
> Strange, it builds on my machines, and I got nice performance boost.
>
> Apparently your arch doesnt handle the
>
> static DEFINE_PER_CPU(struct inet_sock, unicast_sock)
>
> in the function body ?
It's x86-64, standard Fedora 17 install.
^ permalink raw reply
* Re: [PATCH net/for-next V1 1/1] IB/ipoib: break linkage to neighbouring system
From: David Miller @ 2012-07-19 17:08 UTC (permalink / raw)
To: ogerlitz-VPRAkNaXOzVWk0Htik3J/w
Cc: cl-vYTEC60ixJUAvxtiuMwx3w, shlomop-VPRAkNaXOzVWk0Htik3J/w,
roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
erezsh-VPRAkNaXOzVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <500833D9.8000001-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
From: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Date: Thu, 19 Jul 2012 19:20:41 +0300
> On 7/19/2012 6:24 PM, Christoph Lameter wrote:
>> On Thu, 19 Jul 2012, Shlomo Pongartz wrote:
>>
>>> The garbage collection and stale times follow the default ipv4/6
>>> neigh.default.gc_yyy
>>> sysctl values, for example
>>>
>>> net.ipv4.neigh.default.gc_interval = 30
>>> net.ipv4.neigh.default.gc_stale_time = 60
>>>
>>> If given access to these values from IPoIB, we will be happy
>>> to integrate them into that logic
>>
>> It looks like the values are hardcoded right now.
>
> Two points here,
>
> 1s, they are indeed hard-coded since there's no define/enum
> that holds their default values (or maybe we should add one now?), see
> this code snippest from net/ipv4/arp.c
These numbers come from the IPV6 Neighbour Discovery RFCs. IPV4
replicates the Neighbour Unreachability Detection schemes of IPV6 in
pretty much it's entirety, and therefore takes on the same timeout et
al. parameters.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH net-next] ipv4: tcp: remove per net tcp_sock
From: Eric Dumazet @ 2012-07-19 17:08 UTC (permalink / raw)
To: David Miller; +Cc: netdev, therbert, wsommerfeld
In-Reply-To: <20120719.083544.1223522161508413373.davem@davemloft.net>
On Thu, 2012-07-19 at 08:35 -0700, David Miller wrote:
> > @@ -2624,13 +2624,11 @@ EXPORT_SYMBOL(tcp_prot);
> >
> > static int __net_init tcp_sk_init(struct net *net)
> > {
> > - return inet_ctl_sock_create(&net->ipv4.tcp_sock,
> > - PF_INET, SOCK_RAW, IPPROTO_TCP, net);
> > + return 0;
> > }
> >
> > static void __net_exit tcp_sk_exit(struct net *net)
> > {
> > - inet_ctl_sock_destroy(net->ipv4.tcp_sock);
> > }
> >
> > static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
>
> If these no longer really do anything, just send me a patch to kill
> them off entirely.
>
We cant remove them, because of tcp_sk_exit_batch()
^ permalink raw reply
* Re: [PATCH net-next] ipv4: tcp: remove per net tcp_sock
From: Eric Dumazet @ 2012-07-19 17:07 UTC (permalink / raw)
To: David Miller; +Cc: netdev, therbert, wsommerfeld
In-Reply-To: <20120719.084503.136853790544876708.davem@davemloft.net>
On Thu, 2012-07-19 at 08:45 -0700, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Thu, 19 Jul 2012 08:35:44 -0700 (PDT)
>
> > Looks great, applied, thanks Eric.
>
> I take that back, it doesn't build:
>
> net/ipv4/ip_output.c: In function ‘ip_send_unicast_reply’:
> net/ipv4/ip_output.c:1481:1: error: section attribute cannot be specified for local variables
> net/ipv4/ip_output.c:1481:1: error: section attribute cannot be specified for local variables
> net/ipv4/ip_output.c:1481:1: error: declaration of ‘__pcpu_unique_unicast_sock’ with no linkage follows extern declaration
> net/ipv4/ip_output.c:1481:1: note: previous declaration of ‘__pcpu_unique_unicast_sock’ was here
> net/ipv4/ip_output.c:1481:9: error: section attribute cannot be specified for local variables
> net/ipv4/ip_output.c:1481:9: error: weak declaration of ‘unicast_sock’ must be public
Strange, it builds on my machines, and I got nice performance boost.
Apparently your arch doesnt handle the
static DEFINE_PER_CPU(struct inet_sock, unicast_sock)
in the function body ?
^ permalink raw reply
* [PATCH] sfc: initialize dynamic sysfs attributes for lockdep
From: Michal Schmidt @ 2012-07-19 17:04 UTC (permalink / raw)
To: netdev; +Cc: Ben Hutchings, Solarflare linux maintainers
Dynamically allocated sysfs attributes must be initialized using
sysfs_attr_init(), otherwise lockdep complains:
BUG: key <address> not in .data!
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
---
drivers/net/ethernet/sfc/mcdi_mon.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/sfc/mcdi_mon.c b/drivers/net/ethernet/sfc/mcdi_mon.c
index fb7f65b..1d552f0 100644
--- a/drivers/net/ethernet/sfc/mcdi_mon.c
+++ b/drivers/net/ethernet/sfc/mcdi_mon.c
@@ -222,6 +222,7 @@ efx_mcdi_mon_add_attr(struct efx_nic *efx, const char *name,
attr->index = index;
attr->type = type;
attr->limit_value = limit_value;
+ sysfs_attr_init(&attr->dev_attr.attr);
attr->dev_attr.attr.name = attr->name;
attr->dev_attr.attr.mode = S_IRUGO;
attr->dev_attr.show = reader;
--
1.7.10.4
^ permalink raw reply related
* Re: [PATCH net-next] asix: AX88172A driver depends on phylib
From: Fengguang Wu @ 2012-07-19 17:01 UTC (permalink / raw)
To: Christian Riesch; +Cc: netdev, kernel-janitors
In-Reply-To: <1342699339-13871-1-git-send-email-christian.riesch@omicron.at>
On Thu, Jul 19, 2012 at 02:02:19PM +0200, Christian Riesch wrote:
> Since commit 16626b0cc3d5afe250850f96759b241f8a403b52 the asix
> driver depends on the phylib. Select phylib when the asix driver is
> selected.
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
^ permalink raw reply
* [PATCH] bridge: update documentation references
From: Stephen Hemminger @ 2012-07-19 17:01 UTC (permalink / raw)
To: David Miller; +Cc: netdev
Update the references to bridge utilities and web pages
to current locations
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
--- a/Documentation/networking/bridge.txt 2012-07-16 12:08:02.743695241 -0700
+++ b/Documentation/networking/bridge.txt 2012-07-19 09:43:47.955910072 -0700
@@ -1,7 +1,14 @@
In order to use the Ethernet bridging functionality, you'll need the
-userspace tools. These programs and documentation are available
-at http://www.linuxfoundation.org/en/Net:Bridge. The download page is
-http://prdownloads.sourceforge.net/bridge.
+userspace tools.
+
+Documentation for Linux bridging is on:
+ http://www.linuxfoundation.org/collaborate/workgroups/networking/bridge
+
+The bridge-utilities are maintained at:
+ git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/bridge-utils.git
+
+Additionally, the iproute2 utilities can be used to configure
+bridge devices.
If you still have questions, don't hesitate to post to the mailing list
(more info https://lists.linux-foundation.org/mailman/listinfo/bridge).
^ permalink raw reply
* Re: [PATCH v2] sctp: Implement quick failover draft from tsvwg
From: Joe Perches @ 2012-07-19 16:54 UTC (permalink / raw)
To: Neil Horman
Cc: netdev, Vlad Yasevich, Sridhar Samudrala, David S. Miller,
linux-sctp
In-Reply-To: <20120719104513.GB2070@hmsreliant.think-freely.org>
On Thu, 2012-07-19 at 06:45 -0400, Neil Horman wrote:
> On Wed, Jul 18, 2012 at 01:30:58PM -0700, Joe Perches wrote:
> > On Wed, 2012-07-18 at 14:01 -0400, Neil Horman wrote:
> > > I've seen several attempts recently made to do quick failover of sctp transports
> > > by reducing various retransmit timers and counters. While its possible to
> > > implement a faster failover on multihomed sctp associations, its not
> > > particularly robust, in that it can lead to unneeded retransmits, as well as
> > > false connection failures due to intermittent latency on a network.
[]
> > > @@ -878,12 +896,15 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
> > []
> > > + if (ulp_notify) {
> > > + memset(&addr, 0, sizeof(struct sockaddr_storage));
> > > + memcpy(&addr, &transport->ipaddr,
> > > + transport->af_specific->sockaddr_len);
> >
> > Perhaps it's better to do the memcpy then the memset of the
> > space left instead.
> >
> > memcpy(&addr, &transport->ipaddr, transport->af_specific->sockaddr_len);
> > memset((char *)&addr) + transport->af_specific->sockaddr_len, 0,
> > sizeof(struct sockaddr_storage) - transport->af_specific->sockaddr_len);
> >
> hmm, not sure about that. It works either way for me, but I've not changed that
> code, just the condition under which it was executed. I'd rather save cleanups
> like that for a separate patch if you don't mind.
Not a bit.
It's almost certain reversing the order is slower for v4
addresses anyway. It might be slower for v6 too given
the arithmetic.
cheers, Joe
^ permalink raw reply
* [PATCH v3] sctp: Implement quick failover draft from tsvwg
From: Neil Horman @ 2012-07-19 16:51 UTC (permalink / raw)
To: netdev
Cc: Neil Horman, Vlad Yasevich, Sridhar Samudrala, David S. Miller,
linux-sctp, joe
In-Reply-To: <1342203998-24037-1-git-send-email-nhorman@tuxdriver.com>
I've seen several attempts recently made to do quick failover of sctp transports
by reducing various retransmit timers and counters. While its possible to
implement a faster failover on multihomed sctp associations, its not
particularly robust, in that it can lead to unneeded retransmits, as well as
false connection failures due to intermittent latency on a network.
Instead, lets implement the new ietf quick failover draft found here:
http://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05
This will let the sctp stack identify transports that have had a small number of
errors, and avoid using them quickly until their reliability can be
re-established. I've tested this out on two virt guests connected via multiple
isolated virt networks and believe its in compliance with the above draft and
works well.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Vlad Yasevich <vyasevich@gmail.com>
CC: Sridhar Samudrala <sri@us.ibm.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: linux-sctp@vger.kernel.org
CC: joe@perches.com
---
Change notes:
V2)
- Added socket option API from section 6.1 of the specification, as per
request from Vlad. Adding this socket option allows us to alter both the path
maximum retransmit value and the path partial failure threshold for each
transport and the association as a whole.
- Added a per transport pf_retrans value, and initialized it from the
association value. This makes each transport independently configurable as per
the socket option above, and prevents changes in the sysctl from bleeding into
an already created association.
V3)
- Cleaned up some line spacing (Joe Perches)
- Fixed some socket option user data sanitization (Vlad Yasevich)
---
Documentation/networking/ip-sysctl.txt | 14 +++++
include/net/sctp/constants.h | 1 +
include/net/sctp/structs.h | 11 +++-
include/net/sctp/user.h | 11 ++++
net/sctp/associola.c | 37 ++++++++++--
net/sctp/outqueue.c | 6 +-
net/sctp/sm_sideeffect.c | 33 +++++++++-
net/sctp/socket.c | 100 ++++++++++++++++++++++++++++++++
net/sctp/sysctl.c | 9 +++
net/sctp/transport.c | 4 +-
10 files changed, 211 insertions(+), 15 deletions(-)
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 47b6c79..c636f9c 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1408,6 +1408,20 @@ path_max_retrans - INTEGER
Default: 5
+pf_retrans - INTEGER
+ The number of retransmissions that will be attempted on a given path
+ before traffic is redirected to an alternate transport (should one
+ exist). Note this is distinct from path_max_retrans, as a path that
+ passes the pf_retrans threshold can still be used. Its only
+ deprioritized when a transmission path is selected by the stack. This
+ setting is primarily used to enable fast failover mechanisms without
+ having to reduce path_max_retrans to a very low value. See:
+ http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt
+ for details. Note also that a value of pf_retrans > path_max_retrans
+ disables this feature
+
+ Default: 0
+
rto_initial - INTEGER
The initial round trip timeout value in milliseconds that will be used
in calculating round trip times. This is the initial time interval
diff --git a/include/net/sctp/constants.h b/include/net/sctp/constants.h
index 942b864..d053d2e 100644
--- a/include/net/sctp/constants.h
+++ b/include/net/sctp/constants.h
@@ -334,6 +334,7 @@ typedef enum {
typedef enum {
SCTP_TRANSPORT_UP,
SCTP_TRANSPORT_DOWN,
+ SCTP_TRANSPORT_PF,
} sctp_transport_cmd_t;
/* These are the address scopes defined mainly for IPv4 addresses
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index e4652fe..f70726c 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -160,6 +160,7 @@ extern struct sctp_globals {
int max_retrans_association;
int max_retrans_path;
int max_retrans_init;
+ int pf_retrans;
/*
* Policy for preforming sctp/socket accounting
@@ -258,6 +259,7 @@ extern struct sctp_globals {
#define sctp_sndbuf_policy (sctp_globals.sndbuf_policy)
#define sctp_rcvbuf_policy (sctp_globals.rcvbuf_policy)
#define sctp_max_retrans_path (sctp_globals.max_retrans_path)
+#define sctp_pf_retrans (sctp_globals.pf_retrans)
#define sctp_max_retrans_init (sctp_globals.max_retrans_init)
#define sctp_sack_timeout (sctp_globals.sack_timeout)
#define sctp_hb_interval (sctp_globals.hb_interval)
@@ -987,10 +989,15 @@ struct sctp_transport {
/* This is the max_retrans value for the transport and will
* be initialized from the assocs value. This can be changed
- * using SCTP_SET_PEER_ADDR_PARAMS socket option.
+ * using the SCTP_SET_PEER_ADDR_PARAMS socket option.
*/
__u16 pathmaxrxt;
+ /* This is the partially failed retrans value for the transport
+ * and will be initialized from the assocs value. This can be changed
+ * using the SCTP_PEER_ADDR_THLDS socket option
+ */
+ int pf_retrans;
/* PMTU : The current known path MTU. */
__u32 pathmtu;
@@ -1660,6 +1667,8 @@ struct sctp_association {
*/
int max_retrans;
+ int pf_retrans;
+
/* Maximum number of times the endpoint will retransmit INIT */
__u16 max_init_attempts;
diff --git a/include/net/sctp/user.h b/include/net/sctp/user.h
index 0842ef0..1b02d7a 100644
--- a/include/net/sctp/user.h
+++ b/include/net/sctp/user.h
@@ -93,6 +93,7 @@ typedef __s32 sctp_assoc_t;
#define SCTP_GET_ASSOC_NUMBER 28 /* Read only */
#define SCTP_GET_ASSOC_ID_LIST 29 /* Read only */
#define SCTP_AUTO_ASCONF 30
+#define SCTP_PEER_ADDR_THLDS 31
/* Internal Socket Options. Some of the sctp library functions are
* implemented using these socket options.
@@ -649,6 +650,7 @@ struct sctp_paddrinfo {
*/
enum sctp_spinfo_state {
SCTP_INACTIVE,
+ SCTP_PF,
SCTP_ACTIVE,
SCTP_UNCONFIRMED,
SCTP_UNKNOWN = 0xffff /* Value used for transport state unknown */
@@ -741,4 +743,13 @@ typedef struct {
int sd;
} sctp_peeloff_arg_t;
+/*
+ * Peer Address Thresholds socket option
+ */
+struct sctp_paddrthlds {
+ sctp_assoc_t spt_assoc_id;
+ struct sockaddr_storage spt_address;
+ __u16 spt_pathmaxrxt;
+ __u16 spt_pathpfthld;
+};
#endif /* __net_sctp_user_h__ */
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 5bc9ab1..90fe36b 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -124,6 +124,8 @@ static struct sctp_association *sctp_association_init(struct sctp_association *a
* socket values.
*/
asoc->max_retrans = sp->assocparams.sasoc_asocmaxrxt;
+ asoc->pf_retrans = sctp_pf_retrans;
+
asoc->rto_initial = msecs_to_jiffies(sp->rtoinfo.srto_initial);
asoc->rto_max = msecs_to_jiffies(sp->rtoinfo.srto_max);
asoc->rto_min = msecs_to_jiffies(sp->rtoinfo.srto_min);
@@ -685,6 +687,9 @@ struct sctp_transport *sctp_assoc_add_peer(struct sctp_association *asoc,
/* Set the path max_retrans. */
peer->pathmaxrxt = asoc->pathmaxrxt;
+ /* And the partial failure retrnas threshold */
+ peer->pf_retrans = asoc->pf_retrans;
+
/* Initialize the peer's SACK delay timeout based on the
* association configured value.
*/
@@ -840,6 +845,7 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
struct sctp_ulpevent *event;
struct sockaddr_storage addr;
int spc_state = 0;
+ bool ulp_notify = true;
/* Record the transition on the transport. */
switch (command) {
@@ -853,6 +859,14 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
spc_state = SCTP_ADDR_CONFIRMED;
else
spc_state = SCTP_ADDR_AVAILABLE;
+ /* Don't inform ULP about transition from PF to
+ * active state and set cwnd to 1, see SCTP
+ * Quick failover draft section 5.1, point 5
+ */
+ if (transport->state == SCTP_PF) {
+ ulp_notify = false;
+ transport->cwnd = 1;
+ }
transport->state = SCTP_ACTIVE;
break;
@@ -871,6 +885,11 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
spc_state = SCTP_ADDR_UNREACHABLE;
break;
+ case SCTP_TRANSPORT_PF:
+ transport->state = SCTP_PF;
+ ulp_notify = false;
+ break;
+
default:
return;
}
@@ -878,12 +897,15 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
/* Generate and send a SCTP_PEER_ADDR_CHANGE notification to the
* user.
*/
- memset(&addr, 0, sizeof(struct sockaddr_storage));
- memcpy(&addr, &transport->ipaddr, transport->af_specific->sockaddr_len);
- event = sctp_ulpevent_make_peer_addr_change(asoc, &addr,
- 0, spc_state, error, GFP_ATOMIC);
- if (event)
- sctp_ulpq_tail_event(&asoc->ulpq, event);
+ if (ulp_notify) {
+ memset(&addr, 0, sizeof(struct sockaddr_storage));
+ memcpy(&addr, &transport->ipaddr,
+ transport->af_specific->sockaddr_len);
+ event = sctp_ulpevent_make_peer_addr_change(asoc, &addr,
+ 0, spc_state, error, GFP_ATOMIC);
+ if (event)
+ sctp_ulpq_tail_event(&asoc->ulpq, event);
+ }
/* Select new active and retran paths. */
@@ -899,7 +921,8 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
transports) {
if ((t->state == SCTP_INACTIVE) ||
- (t->state == SCTP_UNCONFIRMED))
+ (t->state == SCTP_UNCONFIRMED) ||
+ (t->state == SCTP_PF))
continue;
if (!first || t->last_time_heard > first->last_time_heard) {
second = first;
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index a0fa19f..e7aa177c 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -792,7 +792,8 @@ static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout)
if (!new_transport)
new_transport = asoc->peer.active_path;
} else if ((new_transport->state == SCTP_INACTIVE) ||
- (new_transport->state == SCTP_UNCONFIRMED)) {
+ (new_transport->state == SCTP_UNCONFIRMED) ||
+ (new_transport->state == SCTP_PF)) {
/* If the chunk is Heartbeat or Heartbeat Ack,
* send it to chunk->transport, even if it's
* inactive.
@@ -987,7 +988,8 @@ static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout)
new_transport = chunk->transport;
if (!new_transport ||
((new_transport->state == SCTP_INACTIVE) ||
- (new_transport->state == SCTP_UNCONFIRMED)))
+ (new_transport->state == SCTP_UNCONFIRMED) ||
+ (new_transport->state == SCTP_PF)))
new_transport = asoc->peer.active_path;
if (new_transport->state == SCTP_UNCONFIRMED)
continue;
diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
index c96d1a8..285e26a 100644
--- a/net/sctp/sm_sideeffect.c
+++ b/net/sctp/sm_sideeffect.c
@@ -76,6 +76,8 @@ static int sctp_side_effects(sctp_event_t event_type, sctp_subtype_t subtype,
sctp_cmd_seq_t *commands,
gfp_t gfp);
+static void sctp_cmd_hb_timer_update(sctp_cmd_seq_t *cmds,
+ struct sctp_transport *t);
/********************************************************************
* Helper functions
********************************************************************/
@@ -470,7 +472,8 @@ sctp_timer_event_t *sctp_timer_events[SCTP_NUM_TIMEOUT_TYPES] = {
* notification SHOULD be sent to the upper layer.
*
*/
-static void sctp_do_8_2_transport_strike(struct sctp_association *asoc,
+static void sctp_do_8_2_transport_strike(sctp_cmd_seq_t *commands,
+ struct sctp_association *asoc,
struct sctp_transport *transport,
int is_hb)
{
@@ -495,6 +498,23 @@ static void sctp_do_8_2_transport_strike(struct sctp_association *asoc,
transport->error_count++;
}
+ /* If the transport error count is greater than the pf_retrans
+ * threshold, and less than pathmaxrtx, then mark this transport
+ * as Partially Failed, ee SCTP Quick Failover Draft, secon 5.1,
+ * point 1
+ */
+ if ((transport->state != SCTP_PF) &&
+ (asoc->pf_retrans < transport->pathmaxrxt) &&
+ (transport->error_count > asoc->pf_retrans)) {
+
+ sctp_assoc_control_transport(asoc, transport,
+ SCTP_TRANSPORT_PF,
+ 0);
+
+ /* Update the hb timer to resend a heartbeat every rto */
+ sctp_cmd_hb_timer_update(commands, transport);
+ }
+
if (transport->state != SCTP_INACTIVE &&
(transport->error_count > transport->pathmaxrxt)) {
SCTP_DEBUG_PRINTK_IPADDR("transport_strike:association %p",
@@ -699,6 +719,10 @@ static void sctp_cmd_transport_on(sctp_cmd_seq_t *cmds,
SCTP_HEARTBEAT_SUCCESS);
}
+ if (t->state == SCTP_PF)
+ sctp_assoc_control_transport(asoc, t, SCTP_TRANSPORT_UP,
+ SCTP_HEARTBEAT_SUCCESS);
+
/* The receiver of the HEARTBEAT ACK should also perform an
* RTT measurement for that destination transport address
* using the time value carried in the HEARTBEAT ACK chunk.
@@ -1565,8 +1589,8 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
case SCTP_CMD_STRIKE:
/* Mark one strike against a transport. */
- sctp_do_8_2_transport_strike(asoc, cmd->obj.transport,
- 0);
+ sctp_do_8_2_transport_strike(commands, asoc,
+ cmd->obj.transport, 0);
break;
case SCTP_CMD_TRANSPORT_IDLE:
@@ -1576,7 +1600,8 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
case SCTP_CMD_TRANSPORT_HB_SENT:
t = cmd->obj.transport;
- sctp_do_8_2_transport_strike(asoc, t, 1);
+ sctp_do_8_2_transport_strike(commands, asoc,
+ t, 1);
t->hb_sent = 1;
break;
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index b3b8a8d..fef9bfa 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -3470,6 +3470,56 @@ static int sctp_setsockopt_auto_asconf(struct sock *sk, char __user *optval,
}
+/*
+ * SCTP_PEER_ADDR_THLDS
+ *
+ * This option allows us to alter the partially failed threshold for one or all
+ * transports in an association. See Section 6.1 of:
+ * http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt
+ */
+static int sctp_setsockopt_paddr_thresholds(struct sock *sk,
+ char __user *optval,
+ unsigned int optlen)
+{
+ struct sctp_paddrthlds val;
+ struct sctp_transport *trans;
+ struct sctp_association *asoc;
+
+ if (optlen < sizeof(struct sctp_paddrthlds))
+ return -EINVAL;
+ if (copy_from_user(&val, (struct sctp_paddrthlds __user *)optval,
+ sizeof(struct sctp_paddrthlds)))
+ return -EFAULT;
+
+ /* path_max_retrans shouldn't ever be zero */
+ if (!val.spt_pathmaxrxt)
+ return -EINVAL;
+
+ if (sctp_is_any(sk, (const union sctp_addr *)&val.spt_address)) {
+ asoc = sctp_id2assoc(sk, val.spt_assoc_id);
+ if (!asoc)
+ return -ENOENT;
+ list_for_each_entry(trans, &asoc->peer.transport_addr_list,
+ transports) {
+ trans->pathmaxrxt = val.spt_pathmaxrxt;
+ trans->pf_retrans = val.spt_pathpfthld;
+ }
+
+ asoc->pf_retrans = val.spt_pathpfthld;
+ asoc->pathmaxrxt = val.spt_pathmaxrxt;
+ } else {
+ trans = sctp_addr_id2transport(sk, &val.spt_address,
+ val.spt_assoc_id);
+ if (!trans)
+ return -ENOENT;
+
+ trans->pathmaxrxt = val.spt_pathmaxrxt;
+ trans->pf_retrans = val.spt_pathpfthld;
+ }
+
+ return 0;
+}
+
/* API 6.2 setsockopt(), getsockopt()
*
* Applications use setsockopt() and getsockopt() to set or retrieve
@@ -3619,6 +3669,9 @@ SCTP_STATIC int sctp_setsockopt(struct sock *sk, int level, int optname,
case SCTP_AUTO_ASCONF:
retval = sctp_setsockopt_auto_asconf(sk, optval, optlen);
break;
+ case SCTP_PEER_ADDR_THLDS:
+ retval = sctp_setsockopt_paddr_thresholds(sk, optval, optlen);
+ break;
default:
retval = -ENOPROTOOPT;
break;
@@ -5490,6 +5543,50 @@ static int sctp_getsockopt_assoc_ids(struct sock *sk, int len,
return 0;
}
+/*
+ * SCTP_PEER_ADDR_THLDS
+ *
+ * This option allows us to fetch the partially failed threshold for one or all
+ * transports in an association. See Section 6.1 of:
+ * http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt
+ */
+static int sctp_getsockopt_paddr_thresholds(struct sock *sk,
+ char __user *optval,
+ int optlen)
+{
+ struct sctp_paddrthlds val;
+ struct sctp_transport *trans;
+ struct sctp_association *asoc;
+
+ if (optlen < sizeof(struct sctp_paddrthlds))
+ return -EINVAL;
+ optlen = sizeof(struct sctp_paddrthlds);
+ if (copy_from_user(&val, (struct sctp_paddrthlds __user *)optval, optlen))
+ return -EFAULT;
+
+ if (sctp_is_any(sk, (const union sctp_addr *)&val.spt_address)) {
+ asoc = sctp_id2assoc(sk, val.spt_assoc_id);
+ if (!asoc)
+ return -ENOENT;
+
+ val.spt_pathpfthld = asoc->pf_retrans;
+ val.spt_pathmaxrxt = asoc->pathmaxrxt;
+ } else {
+ trans = sctp_addr_id2transport(sk, &val.spt_address,
+ val.spt_assoc_id);
+ if (!trans)
+ return -ENOENT;
+
+ val.spt_pathmaxrxt = trans->pathmaxrxt;
+ val.spt_pathpfthld = trans->pf_retrans;
+ }
+
+ if (copy_to_user(optval, &val, optlen))
+ return -EFAULT;
+
+ return 0;
+}
+
SCTP_STATIC int sctp_getsockopt(struct sock *sk, int level, int optname,
char __user *optval, int __user *optlen)
{
@@ -5628,6 +5725,9 @@ SCTP_STATIC int sctp_getsockopt(struct sock *sk, int level, int optname,
case SCTP_AUTO_ASCONF:
retval = sctp_getsockopt_auto_asconf(sk, len, optval, optlen);
break;
+ case SCTP_PEER_ADDR_THLDS:
+ retval = sctp_getsockopt_paddr_thresholds(sk, optval, len);
+ break;
default:
retval = -ENOPROTOOPT;
break;
diff --git a/net/sctp/sysctl.c b/net/sctp/sysctl.c
index e5fe639..2b2bfe9 100644
--- a/net/sctp/sysctl.c
+++ b/net/sctp/sysctl.c
@@ -141,6 +141,15 @@ static ctl_table sctp_table[] = {
.extra2 = &int_max
},
{
+ .procname = "pf_retrans",
+ .data = &sctp_pf_retrans,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &int_max
+ },
+ {
.procname = "max_init_retransmits",
.data = &sctp_max_retrans_init,
.maxlen = sizeof(int),
diff --git a/net/sctp/transport.c b/net/sctp/transport.c
index b026ba0..194d0f3 100644
--- a/net/sctp/transport.c
+++ b/net/sctp/transport.c
@@ -85,6 +85,7 @@ static struct sctp_transport *sctp_transport_init(struct sctp_transport *peer,
/* Initialize the default path max_retrans. */
peer->pathmaxrxt = sctp_max_retrans_path;
+ peer->pf_retrans = sctp_pf_retrans;
INIT_LIST_HEAD(&peer->transmitted);
INIT_LIST_HEAD(&peer->send_ready);
@@ -585,7 +586,8 @@ unsigned long sctp_transport_timeout(struct sctp_transport *t)
{
unsigned long timeout;
timeout = t->rto + sctp_jitter(t->rto);
- if (t->state != SCTP_UNCONFIRMED)
+ if ((t->state != SCTP_UNCONFIRMED) &&
+ (t->state != SCTP_PF))
timeout += t->hbinterval;
timeout += jiffies;
return timeout;
--
1.7.7.6
^ permalink raw reply related
* Re: [PATCH] net, cgroup: Fix boot failure due to iteration of uninitialized list
From: Neil Horman @ 2012-07-19 16:44 UTC (permalink / raw)
To: Srivatsa S. Bhat
Cc: gaofeng, eric.dumazet, davem, linux-kernel, netdev, mark.d.rustad,
john.r.fastabend, lizefan
In-Reply-To: <20120719162532.23505.85946.stgit@srivatsabhat.in.ibm.com>
On Thu, Jul 19, 2012 at 09:57:37PM +0530, Srivatsa S. Bhat wrote:
> After commit ef209f15 (net: cgroup: fix access the unallocated memory in
> netprio cgroup), boot fails with the following NULL pointer dereference:
>
> Initializing cgroup subsys devices
> Initializing cgroup subsys freezer
> Initializing cgroup subsys net_cls
> Initializing cgroup subsys blkio
> Initializing cgroup subsys perf_event
> Initializing cgroup subsys net_prio
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000698
> IP: [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
> PGD 0
> Oops: 0000 [#1] SMP
> CPU 0
> Modules linked in:
>
> Pid: 0, comm: swapper/0 Not tainted 3.5.0-rc7-mandeep #1 IBM IBM System x -[7870C4Q]-/68Y8033
> RIP: 0010:[<ffffffff8145e8d6>] [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
> RSP: 0000:ffffffff81a01ea8 EFLAGS: 00010213
> RAX: 0000000000000000 RBX: ffffffffffffff10 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffffff81aa70a0
> RBP: ffffffff81a01ed8 R08: 0000000000000000 R09: 0000000000000000
> R10: ffff8808ff8641c0 R11: 6e697a696c616974 R12: 0000000000000001
> R13: ffff8808ff8641c0 R14: 0000000000000000 R15: 0000000000093970
> FS: 0000000000000000(0000) GS:ffff8808ffc00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000698 CR3: 0000000001a0b000 CR4: 00000000000006b0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper/0 (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a13420)
> Stack:
> ffffffff81a01eb8 ffffffff818060ff ffffffff81d75ec8 ffffffff81aa8960
> ffffffff81aa8960 ffffffff81b4c2c0 ffffffff81a01ef8 ffffffff81b1cb78
> 0000000000000018 0000000000000048 ffffffff81a01f18 ffffffff81b1ce13
> Call Trace:
> [<ffffffff81b1cb78>] cgroup_init_subsys+0x83/0x169
> [<ffffffff81b1ce13>] cgroup_init+0x36/0x119
> [<ffffffff81affef7>] start_kernel+0x3ba/0x3ef
> [<ffffffff81aff95b>] ? kernel_init+0x27b/0x27b
> [<ffffffff81aff356>] x86_64_start_reservations+0x131/0x136
> [<ffffffff81aff45e>] x86_64_start_kernel+0x103/0x112
> Code: 01 48 3d f8 e1 ec 81 48 8d 98 10 ff ff ff 75 1b eb 73 0f 1f 00 48 8b 83 f0 00 00 00 48 3d f8 e1 ec 81 48 8d 98 10 ff ff ff 74 5a <48> 8b 83 88 07 00 00 48 85 c0 74 de 44 3b 60 10 76 d8 44 89 e6
> RIP [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
> RSP <ffffffff81a01ea8>
> CR2: 0000000000000698
> ---[ end trace a7919e7f17c0a725 ]---
> Kernel panic - not syncing: Attempted to kill the idle task!
>
> The code corresponds to:
>
> update_netdev_tables():
> for_each_netdev(&init_net, dev) {
> map = rtnl_dereference(dev->priomap); <---- HERE
>
>
> The list head is initialized in netdev_init(), which is called much
> later than cgrp_create(). So the problem is that we are calling
> update_netdev_tables() way too early (in cgrp_create()), which will
> end up traversing the not-yet-circular linked list. So at some point,
> the dev pointer will become NULL and hence dev->priomap becomes an
> invalid access.
>
> To fix this, just remove the update_netdev_tables() function entirely,
> since it appears that write_update_netdev_table() will handle things
> just fine.
>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
>
> Requesting a thorough review of this patch, since I am not sure whether
> removing update_netdev_tables() is perfectly OK and whether that is the
> right thing to do.
>
We could do this I suppose, but this has already been fixed by
734b65417b24d6eea3e3d7457e1f11493890ee1d
>
^ permalink raw reply
* [PATCH v3 7/7] net-tcp: Fast Open client - cookie-less mode
From: Yuchung Cheng @ 2012-07-19 16:43 UTC (permalink / raw)
To: davem, hkchu, edumazet, ncardwell; +Cc: sivasankar, netdev, Yuchung Cheng
In-Reply-To: <1342716191-19196-1-git-send-email-ycheng@google.com>
In trusted networks, e.g., intranet, data-center, the client does not
need to use Fast Open cookie to mitigate DoS attacks. In cookie-less
mode, sendmsg() with MSG_FASTOPEN flag will send SYN-data regardless
of cookie availability.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
Documentation/networking/ip-sysctl.txt | 2 ++
include/linux/tcp.h | 1 +
include/net/tcp.h | 1 +
net/ipv4/tcp_input.c | 8 ++++++--
net/ipv4/tcp_output.c | 6 +++++-
5 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 03964e0..5f3ef7f 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -476,6 +476,8 @@ tcp_fastopen - INTEGER
The values (bitmap) are:
1: Enables sending data in the opening SYN on the client
+ 5: Enables sending data in the opening SYN on the client regardless
+ of cookie availability.
Default: 0
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 1edf96a..9febfb6 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -387,6 +387,7 @@ struct tcp_sock {
u8 repair_queue;
u8 do_early_retrans:1,/* Enable RFC5827 early-retransmit */
early_retrans_delayed:1, /* Delayed ER timer installed */
+ syn_data:1, /* SYN includes data */
syn_fastopen:1; /* SYN includes Fast Open option */
/* RTT measurement */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index e07878d..bc7c134 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -214,6 +214,7 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
/* Bit Flags for sysctl_tcp_fastopen */
#define TFO_CLIENT_ENABLE 1
+#define TFO_CLIENT_NO_COOKIE 4 /* Data in SYN w/o cookie option */
extern struct inet_timewait_death_row tcp_death_row;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c49a4fc..e67d685 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5650,7 +5650,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
struct tcp_fastopen_cookie *cookie)
{
struct tcp_sock *tp = tcp_sk(sk);
- struct sk_buff *data = tcp_write_queue_head(sk);
+ struct sk_buff *data = tp->syn_data ? tcp_write_queue_head(sk) : NULL;
u16 mss = tp->rx_opt.mss_clamp;
bool syn_drop;
@@ -5665,6 +5665,9 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
mss = opt.mss_clamp;
}
+ if (!tp->syn_fastopen) /* Ignore an unsolicited cookie */
+ cookie->len = -1;
+
/* The SYN-ACK neither has cookie nor acknowledges the data. Presumably
* the remote receives only the retransmitted (regular) SYNs: either
* the original SYN-data or the corresponding SYN-ACK is lost.
@@ -5816,7 +5819,8 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
tcp_finish_connect(sk, skb);
- if (tp->syn_fastopen && tcp_rcv_fastopen_synack(sk, skb, &foc))
+ if ((tp->syn_fastopen || tp->syn_data) &&
+ tcp_rcv_fastopen_synack(sk, skb, &foc))
return -1;
if (sk->sk_write_pending ||
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c5cfd5e..27a32ac 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2864,6 +2864,7 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
struct sk_buff *syn_data = NULL, *data;
unsigned long last_syn_loss = 0;
+ tp->rx_opt.mss_clamp = tp->advmss; /* If MSS is not cached */
tcp_fastopen_cache_get(sk, &tp->rx_opt.mss_clamp, &fo->cookie,
&syn_loss, &last_syn_loss);
/* Recurring FO SYN losses: revert to regular handshake temporarily */
@@ -2873,7 +2874,9 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
goto fallback;
}
- if (fo->cookie.len <= 0)
+ if (sysctl_tcp_fastopen & TFO_CLIENT_NO_COOKIE)
+ fo->cookie.len = -1;
+ else if (fo->cookie.len <= 0)
goto fallback;
/* MSS for SYN-data is based on cached MSS and bounded by PMTU and
@@ -2916,6 +2919,7 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
fo->copied = data->len;
if (tcp_transmit_skb(sk, syn_data, 0, sk->sk_allocation) == 0) {
+ tp->syn_data = (fo->copied > 0);
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPFASTOPENACTIVE);
goto done;
}
--
1.7.7.3
^ permalink raw reply related
* [PATCH v3 0/7] TCP Fast Open client
From: Yuchung Cheng @ 2012-07-19 16:43 UTC (permalink / raw)
To: davem, hkchu, edumazet, ncardwell; +Cc: sivasankar, netdev, Yuchung Cheng
ChangeLog since v2:
- Added seqlock to update Fast Open metrics
- Move TCP magic code in inet_wait_for_connect() to inet_stream_connect()
- Move up MSG_FASTOPEN macro for better header formatting
ChangeLog since v1:
- Reduce tons of code by storing Fast Open stats in the TCP metrics :)
- Clarify the purpose of using an experimental option in patch 1/7
This patch series implement the client functionality of TCP Fast Open.
TCP Fast Open (TFO) allows data to be carried in the SYN and SYN-ACK
packets and consumed by the receiving end during the initial connection
handshake, thus providing a saving of up to one full round trip time (RTT)
compared to standard TCP requiring a three-way handshake (3WHS) to
complete before data can be exchanged.
The protocol change is detailed in the IETF internet draft at
http://www.ietf.org/id/draft-ietf-tcpm-fastopen-00.txt . The research
paper (http://conferences.sigcomm.org/co-next/2011/papers/1569470463.pdf)
studied the performance impact of HTTP using Fast Open, based on this
Linux implementation and the Chrome browser.
To use Fast Open, the client application (active SYN sender) must
replace connect() socket call with sendmsg() or sendto() with the new
MSG_FASTOPEN flag. If the server supports Fast Open the data exchange
starts at TCP handshake. Otherwise the connection will automatically
fall back to conventional TCP.
Yuchung Cheng (7):
net-tcp: Fast Open base
net-tcp: Fast Open client - cookie cache
net-tcp: Fast Open client - sending SYN-data
net-tcp: Fast Open client - receiving SYN-ACK
net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)
net-tcp: Fast Open client - detecting SYN-data drops
net-tcp: Fast Open client - cookie-less mode
Documentation/networking/ip-sysctl.txt | 13 +++
include/linux/snmp.h | 1 +
include/linux/socket.h | 1 +
include/linux/tcp.h | 17 ++++-
include/net/inet_common.h | 6 +-
include/net/tcp.h | 28 ++++++-
net/ipv4/Makefile | 2 +-
net/ipv4/af_inet.c | 29 +++++--
net/ipv4/proc.c | 1 +
net/ipv4/syncookies.c | 2 +-
net/ipv4/sysctl_net_ipv4.c | 7 ++
net/ipv4/tcp.c | 61 ++++++++++++-
net/ipv4/tcp_fastopen.c | 11 +++
net/ipv4/tcp_input.c | 76 ++++++++++++++--
net/ipv4/tcp_ipv4.c | 5 +-
net/ipv4/tcp_metrics.c | 61 +++++++++++++
net/ipv4/tcp_minisocks.c | 4 +-
net/ipv4/tcp_output.c | 153 +++++++++++++++++++++++++++++---
net/ipv6/syncookies.c | 2 +-
net/ipv6/tcp_ipv6.c | 2 +-
20 files changed, 438 insertions(+), 44 deletions(-)
create mode 100644 net/ipv4/tcp_fastopen.c
--
1.7.7.3
^ permalink raw reply
* [PATCH v3 6/7] net-tcp: Fast Open client - detecting SYN-data drops
From: Yuchung Cheng @ 2012-07-19 16:43 UTC (permalink / raw)
To: davem, hkchu, edumazet, ncardwell; +Cc: sivasankar, netdev, Yuchung Cheng
In-Reply-To: <1342716191-19196-1-git-send-email-ycheng@google.com>
On paths with firewalls dropping SYN with data or experimental TCP options,
Fast Open connections will have experience SYN timeout and bad performance.
The solution is to track such incidents in the cookie cache and disables
Fast Open temporarily.
Since only the original SYN includes data and/or Fast Open option, the
SYN-ACK has some tell-tale sign (tcp_rcv_fastopen_synack()) to detect
such drops. If a path has recurring Fast Open SYN drops, Fast Open is
disabled for 2^(recurring_losses) minutes starting from four minutes up to
roughly one and half day. sendmsg with MSG_FASTOPEN flag will succeed but
it behaves as connect() then write().
Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
include/net/tcp.h | 6 ++++--
net/ipv4/tcp_input.c | 10 +++++++++-
net/ipv4/tcp_metrics.c | 16 +++++++++++++---
net/ipv4/tcp_output.c | 13 +++++++++++--
4 files changed, 37 insertions(+), 8 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index c025810..e07878d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -409,9 +409,11 @@ extern bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst,
extern bool tcp_remember_stamp(struct sock *sk);
extern bool tcp_tw_remember_stamp(struct inet_timewait_sock *tw);
extern void tcp_fastopen_cache_get(struct sock *sk, u16 *mss,
- struct tcp_fastopen_cookie *cookie);
+ struct tcp_fastopen_cookie *cookie,
+ int *syn_loss, unsigned long *last_syn_loss);
extern void tcp_fastopen_cache_set(struct sock *sk, u16 mss,
- struct tcp_fastopen_cookie *cookie);
+ struct tcp_fastopen_cookie *cookie,
+ bool syn_lost);
extern void tcp_fetch_timewait_stamp(struct sock *sk, struct dst_entry *dst);
extern void tcp_disable_fack(struct tcp_sock *tp);
extern void tcp_close(struct sock *sk, long timeout);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 38b6a81..c49a4fc 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5652,6 +5652,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *data = tcp_write_queue_head(sk);
u16 mss = tp->rx_opt.mss_clamp;
+ bool syn_drop;
if (mss == tp->rx_opt.user_mss) {
struct tcp_options_received opt;
@@ -5664,7 +5665,14 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
mss = opt.mss_clamp;
}
- tcp_fastopen_cache_set(sk, mss, cookie);
+ /* The SYN-ACK neither has cookie nor acknowledges the data. Presumably
+ * the remote receives only the retransmitted (regular) SYNs: either
+ * the original SYN-data or the corresponding SYN-ACK is lost.
+ */
+ syn_drop = (cookie->len <= 0 && data &&
+ inet_csk(sk)->icsk_retransmits);
+
+ tcp_fastopen_cache_set(sk, mss, cookie, syn_drop);
if (data) { /* Retransmit unacked data in SYN */
tcp_retransmit_skb(sk, data);
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index d02ff37..99779ae 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -32,6 +32,8 @@ enum tcp_metric_index {
struct tcp_fastopen_metrics {
u16 mss;
+ u16 syn_loss:10; /* Recurring Fast Open SYN losses */
+ unsigned long last_syn_loss; /* Last Fast Open SYN loss */
struct tcp_fastopen_cookie cookie;
};
@@ -125,6 +127,7 @@ static void tcpm_suck_dst(struct tcp_metrics_block *tm, struct dst_entry *dst)
tm->tcpm_ts = 0;
tm->tcpm_ts_stamp = 0;
tm->tcpm_fastopen.mss = 0;
+ tm->tcpm_fastopen.syn_loss = 0;
tm->tcpm_fastopen.cookie.len = 0;
}
@@ -644,7 +647,8 @@ bool tcp_tw_remember_stamp(struct inet_timewait_sock *tw)
static DEFINE_SEQLOCK(fastopen_seqlock);
void tcp_fastopen_cache_get(struct sock *sk, u16 *mss,
- struct tcp_fastopen_cookie *cookie)
+ struct tcp_fastopen_cookie *cookie,
+ int *syn_loss, unsigned long *last_syn_loss)
{
struct tcp_metrics_block *tm;
@@ -659,14 +663,15 @@ void tcp_fastopen_cache_get(struct sock *sk, u16 *mss,
if (tfom->mss)
*mss = tfom->mss;
*cookie = tfom->cookie;
+ *syn_loss = tfom->syn_loss;
+ *last_syn_loss = *syn_loss ? tfom->last_syn_loss : 0;
} while (read_seqretry(&fastopen_seqlock, seq));
}
rcu_read_unlock();
}
-
void tcp_fastopen_cache_set(struct sock *sk, u16 mss,
- struct tcp_fastopen_cookie *cookie)
+ struct tcp_fastopen_cookie *cookie, bool syn_lost)
{
struct tcp_metrics_block *tm;
@@ -679,6 +684,11 @@ void tcp_fastopen_cache_set(struct sock *sk, u16 mss,
tfom->mss = mss;
if (cookie->len > 0)
tfom->cookie = *cookie;
+ if (syn_lost) {
+ ++tfom->syn_loss;
+ tfom->last_syn_loss = jiffies;
+ } else
+ tfom->syn_loss = 0;
write_sequnlock_bh(&fastopen_seqlock);
}
rcu_read_unlock();
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8869328..c5cfd5e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2860,10 +2860,19 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
{
struct tcp_sock *tp = tcp_sk(sk);
struct tcp_fastopen_request *fo = tp->fastopen_req;
- int space, i, err = 0, iovlen = fo->data->msg_iovlen;
+ int syn_loss = 0, space, i, err = 0, iovlen = fo->data->msg_iovlen;
struct sk_buff *syn_data = NULL, *data;
+ unsigned long last_syn_loss = 0;
+
+ tcp_fastopen_cache_get(sk, &tp->rx_opt.mss_clamp, &fo->cookie,
+ &syn_loss, &last_syn_loss);
+ /* Recurring FO SYN losses: revert to regular handshake temporarily */
+ if (syn_loss > 1 &&
+ time_before(jiffies, last_syn_loss + (60*HZ << syn_loss))) {
+ fo->cookie.len = -1;
+ goto fallback;
+ }
- tcp_fastopen_cache_get(sk, &tp->rx_opt.mss_clamp, &fo->cookie);
if (fo->cookie.len <= 0)
goto fallback;
--
1.7.7.3
^ permalink raw reply related
* [PATCH v3 5/7] net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)
From: Yuchung Cheng @ 2012-07-19 16:43 UTC (permalink / raw)
To: davem, hkchu, edumazet, ncardwell; +Cc: sivasankar, netdev, Yuchung Cheng
In-Reply-To: <1342716191-19196-1-git-send-email-ycheng@google.com>
sendmsg() (or sendto()) with MSG_FASTOPEN is a combo of connect(2)
and write(2). The application should replace connect() with it to
send data in the opening SYN packet.
For blocking socket, sendmsg() blocks until all the data are buffered
locally and the handshake is completed like connect() call. It
returns similar errno like connect() if the TCP handshake fails.
For non-blocking socket, it returns the number of bytes queued (and
transmitted in the SYN-data packet) if cookie is available. If cookie
is not available, it transmits a data-less SYN packet with Fast Open
cookie request option and returns -EINPROGRESS like connect().
Using MSG_FASTOPEN on connecting or connected socket will result in
simlar errno like repeating connect() calls. Therefore the application
should only use this flag on new sockets.
The buffer size of sendmsg() is independent of the MSS of the connection.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
Documentation/networking/ip-sysctl.txt | 11 ++++++
include/linux/socket.h | 1 +
include/net/inet_common.h | 6 ++-
include/net/tcp.h | 3 ++
net/ipv4/af_inet.c | 19 +++++++---
net/ipv4/tcp.c | 61 +++++++++++++++++++++++++++++---
net/ipv4/tcp_ipv4.c | 3 ++
7 files changed, 92 insertions(+), 12 deletions(-)
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index e1e0215..03964e0 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -468,6 +468,17 @@ tcp_syncookies - BOOLEAN
SYN flood warnings in logs not being really flooded, your server
is seriously misconfigured.
+tcp_fastopen - INTEGER
+ Enable TCP Fast Open feature (draft-ietf-tcpm-fastopen) to send data
+ in the opening SYN packet. To use this feature, the client application
+ must not use connect(). Instead, it should use sendmsg() or sendto()
+ with MSG_FASTOPEN flag which performs a TCP handshake automatically.
+
+ The values (bitmap) are:
+ 1: Enables sending data in the opening SYN on the client
+
+ Default: 0
+
tcp_syn_retries - INTEGER
Number of times initial SYNs for an active TCP connection attempt
will be retransmitted. Should not be higher than 255. Default value
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 25d6322..ba7b2e8 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -268,6 +268,7 @@ struct ucred {
#define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
#define MSG_EOF MSG_FIN
+#define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */
#define MSG_CMSG_CLOEXEC 0x40000000 /* Set close_on_exit for file
descriptor received through
SCM_RIGHTS */
diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 22fac98..2340087 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -14,9 +14,11 @@ struct sockaddr;
struct socket;
extern int inet_release(struct socket *sock);
-extern int inet_stream_connect(struct socket *sock, struct sockaddr * uaddr,
+extern int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
int addr_len, int flags);
-extern int inet_dgram_connect(struct socket *sock, struct sockaddr * uaddr,
+extern int __inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
+ int addr_len, int flags);
+extern int inet_dgram_connect(struct socket *sock, struct sockaddr *uaddr,
int addr_len, int flags);
extern int inet_accept(struct socket *sock, struct socket *newsock, int flags);
extern int inet_sendmsg(struct kiocb *iocb, struct socket *sock,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 867557b..c025810 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -212,6 +212,9 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
/* TCP initial congestion window as per draft-hkchu-tcpm-initcwnd-01 */
#define TCP_INIT_CWND 10
+/* Bit Flags for sysctl_tcp_fastopen */
+#define TFO_CLIENT_ENABLE 1
+
extern struct inet_timewait_death_row tcp_death_row;
/* sysctl variables for tcp */
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index edc4146..fe4582c 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -585,8 +585,8 @@ static long inet_wait_for_connect(struct sock *sk, long timeo, int writebias)
* Connect to a remote host. There is regrettably still a little
* TCP 'magic' in here.
*/
-int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
- int addr_len, int flags)
+int __inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
+ int addr_len, int flags)
{
struct sock *sk = sock->sk;
int err;
@@ -595,8 +595,6 @@ int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
if (addr_len < sizeof(uaddr->sa_family))
return -EINVAL;
- lock_sock(sk);
-
if (uaddr->sa_family == AF_UNSPEC) {
err = sk->sk_prot->disconnect(sk, flags);
sock->state = err ? SS_DISCONNECTING : SS_UNCONNECTED;
@@ -663,7 +661,6 @@ int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
sock->state = SS_CONNECTED;
err = 0;
out:
- release_sock(sk);
return err;
sock_error:
@@ -673,6 +670,18 @@ sock_error:
sock->state = SS_DISCONNECTING;
goto out;
}
+EXPORT_SYMBOL(__inet_stream_connect);
+
+int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
+ int addr_len, int flags)
+{
+ int err;
+
+ lock_sock(sock->sk);
+ err = __inet_stream_connect(sock, uaddr, addr_len, flags);
+ release_sock(sock->sk);
+ return err;
+}
EXPORT_SYMBOL(inet_stream_connect);
/*
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4252cd8..581ecf0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -270,6 +270,7 @@
#include <linux/slab.h>
#include <net/icmp.h>
+#include <net/inet_common.h>
#include <net/tcp.h>
#include <net/xfrm.h>
#include <net/ip.h>
@@ -982,26 +983,67 @@ static inline int select_size(const struct sock *sk, bool sg)
return tmp;
}
+void tcp_free_fastopen_req(struct tcp_sock *tp)
+{
+ if (tp->fastopen_req != NULL) {
+ kfree(tp->fastopen_req);
+ tp->fastopen_req = NULL;
+ }
+}
+
+static int tcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg, int *size)
+{
+ struct tcp_sock *tp = tcp_sk(sk);
+ int err, flags;
+
+ if (!(sysctl_tcp_fastopen & TFO_CLIENT_ENABLE))
+ return -EOPNOTSUPP;
+ if (tp->fastopen_req != NULL)
+ return -EALREADY; /* Another Fast Open is in progress */
+
+ tp->fastopen_req = kzalloc(sizeof(struct tcp_fastopen_request),
+ sk->sk_allocation);
+ if (unlikely(tp->fastopen_req == NULL))
+ return -ENOBUFS;
+ tp->fastopen_req->data = msg;
+
+ flags = (msg->msg_flags & MSG_DONTWAIT) ? O_NONBLOCK : 0;
+ err = __inet_stream_connect(sk->sk_socket, msg->msg_name,
+ msg->msg_namelen, flags);
+ *size = tp->fastopen_req->copied;
+ tcp_free_fastopen_req(tp);
+ return err;
+}
+
int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
size_t size)
{
struct iovec *iov;
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
- int iovlen, flags, err, copied;
- int mss_now = 0, size_goal;
+ int iovlen, flags, err, copied = 0;
+ int mss_now = 0, size_goal, copied_syn = 0, offset = 0;
bool sg;
long timeo;
lock_sock(sk);
flags = msg->msg_flags;
+ if (flags & MSG_FASTOPEN) {
+ err = tcp_sendmsg_fastopen(sk, msg, &copied_syn);
+ if (err == -EINPROGRESS && copied_syn > 0)
+ goto out;
+ else if (err)
+ goto out_err;
+ offset = copied_syn;
+ }
+
timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
/* Wait for a connection to finish. */
if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT))
if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
- goto out_err;
+ goto do_error;
if (unlikely(tp->repair)) {
if (tp->repair_queue == TCP_RECV_QUEUE) {
@@ -1037,6 +1079,15 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
unsigned char __user *from = iov->iov_base;
iov++;
+ if (unlikely(offset > 0)) { /* Skip bytes copied in SYN */
+ if (offset >= seglen) {
+ offset -= seglen;
+ continue;
+ }
+ seglen -= offset;
+ from += offset;
+ offset = 0;
+ }
while (seglen > 0) {
int copy = 0;
@@ -1199,7 +1250,7 @@ out:
if (copied && likely(!tp->repair))
tcp_push(sk, flags, mss_now, tp->nonagle);
release_sock(sk);
- return copied;
+ return copied + copied_syn;
do_fault:
if (!skb->len) {
@@ -1212,7 +1263,7 @@ do_fault:
}
do_error:
- if (copied)
+ if (copied + copied_syn)
goto out;
out_err:
err = sk_stream_error(sk, flags, err);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 6b1fe40..f70f844 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1952,6 +1952,9 @@ void tcp_v4_destroy_sock(struct sock *sk)
tp->cookie_values = NULL;
}
+ /* If socket is aborted during connect operation */
+ tcp_free_fastopen_req(tp);
+
sk_sockets_allocated_dec(sk);
sock_release_memcg(sk);
}
--
1.7.7.3
^ permalink raw reply related
* [PATCH v3 1/7] net-tcp: Fast Open base
From: Yuchung Cheng @ 2012-07-19 16:43 UTC (permalink / raw)
To: davem, hkchu, edumazet, ncardwell; +Cc: sivasankar, netdev, Yuchung Cheng
In-Reply-To: <1342716191-19196-1-git-send-email-ycheng@google.com>
This patch impelements the common code for both the client and server.
1. TCP Fast Open option processing. Since Fast Open does not have an
option number assigned by IANA yet, it shares the experiment option
code 254 by implementing draft-ietf-tcpm-experimental-options
with a 16 bits magic number 0xF989. This enables global experiments
without clashing the scarce(2) experimental options available for TCP.
When the draft status becomes standard (maybe), the client should
switch to the new option number assigned while the server supports
both numbers for transistion.
2. The new sysctl tcp_fastopen
3. A place holder init function
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
include/linux/tcp.h | 10 ++++++++++
include/net/tcp.h | 9 ++++++++-
net/ipv4/Makefile | 2 +-
net/ipv4/syncookies.c | 2 +-
net/ipv4/sysctl_net_ipv4.c | 7 +++++++
net/ipv4/tcp_fastopen.c | 11 +++++++++++
net/ipv4/tcp_input.c | 26 ++++++++++++++++++++++----
net/ipv4/tcp_ipv4.c | 2 +-
net/ipv4/tcp_minisocks.c | 4 ++--
net/ipv4/tcp_output.c | 25 +++++++++++++++++++++----
net/ipv6/syncookies.c | 2 +-
net/ipv6/tcp_ipv6.c | 2 +-
12 files changed, 86 insertions(+), 16 deletions(-)
create mode 100644 net/ipv4/tcp_fastopen.c
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 1888169..12948f5 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -243,6 +243,16 @@ static inline unsigned int tcp_optlen(const struct sk_buff *skb)
return (tcp_hdr(skb)->doff - 5) * 4;
}
+/* TCP Fast Open */
+#define TCP_FASTOPEN_COOKIE_MIN 4 /* Min Fast Open Cookie size in bytes */
+#define TCP_FASTOPEN_COOKIE_MAX 16 /* Max Fast Open Cookie size in bytes */
+
+/* TCP Fast Open Cookie as stored in memory */
+struct tcp_fastopen_cookie {
+ s8 len;
+ u8 val[TCP_FASTOPEN_COOKIE_MAX];
+};
+
/* This defines a selective acknowledgement block. */
struct tcp_sack_block_wire {
__be32 start_seq;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 85c5090..5aed371 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -170,6 +170,11 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
#define TCPOPT_TIMESTAMP 8 /* Better RTT estimations/PAWS */
#define TCPOPT_MD5SIG 19 /* MD5 Signature (RFC2385) */
#define TCPOPT_COOKIE 253 /* Cookie extension (experimental) */
+#define TCPOPT_EXP 254 /* Experimental */
+/* Magic number to be after the option value for sharing TCP
+ * experimental options. See draft-ietf-tcpm-experimental-options-00.txt
+ */
+#define TCPOPT_FASTOPEN_MAGIC 0xF989
/*
* TCP option lengths
@@ -180,6 +185,7 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
#define TCPOLEN_SACK_PERM 2
#define TCPOLEN_TIMESTAMP 10
#define TCPOLEN_MD5SIG 18
+#define TCPOLEN_EXP_FASTOPEN_BASE 4
#define TCPOLEN_COOKIE_BASE 2 /* Cookie-less header extension */
#define TCPOLEN_COOKIE_PAIR 3 /* Cookie pair header extension */
#define TCPOLEN_COOKIE_MIN (TCPOLEN_COOKIE_BASE+TCP_COOKIE_MIN)
@@ -222,6 +228,7 @@ extern int sysctl_tcp_retries1;
extern int sysctl_tcp_retries2;
extern int sysctl_tcp_orphan_retries;
extern int sysctl_tcp_syncookies;
+extern int sysctl_tcp_fastopen;
extern int sysctl_tcp_retrans_collapse;
extern int sysctl_tcp_stdurg;
extern int sysctl_tcp_rfc1337;
@@ -418,7 +425,7 @@ extern int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
size_t len, int nonblock, int flags, int *addr_len);
extern void tcp_parse_options(const struct sk_buff *skb,
struct tcp_options_received *opt_rx, const u8 **hvpp,
- int estab);
+ int estab, struct tcp_fastopen_cookie *foc);
extern const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
/*
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index a677d80..ae2ccf2 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -7,7 +7,7 @@ obj-y := route.o inetpeer.o protocol.o \
ip_output.o ip_sockglue.o inet_hashtables.o \
inet_timewait_sock.o inet_connection_sock.o \
tcp.o tcp_input.o tcp_output.o tcp_timer.o tcp_ipv4.o \
- tcp_minisocks.o tcp_cong.o tcp_metrics.o \
+ tcp_minisocks.o tcp_cong.o tcp_metrics.o tcp_fastopen.o \
datagram.o raw.o udp.o udplite.o \
arp.o icmp.o devinet.o af_inet.o igmp.o \
fib_frontend.o fib_semantics.o fib_trie.o \
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index eab2a7f..650e152 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -293,7 +293,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
/* check for timestamp cookie support */
memset(&tcp_opt, 0, sizeof(tcp_opt));
- tcp_parse_options(skb, &tcp_opt, &hash_location, 0);
+ tcp_parse_options(skb, &tcp_opt, &hash_location, 0, NULL);
if (!cookie_check_timestamp(&tcp_opt, &ecn_ok))
goto out;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 3f6a1e7..5840c32 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -367,6 +367,13 @@ static struct ctl_table ipv4_table[] = {
},
#endif
{
+ .procname = "tcp_fastopen",
+ .data = &sysctl_tcp_fastopen,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "tcp_tw_recycle",
.data = &tcp_death_row.sysctl_tw_recycle,
.maxlen = sizeof(int),
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
new file mode 100644
index 0000000..a7f729c
--- /dev/null
+++ b/net/ipv4/tcp_fastopen.c
@@ -0,0 +1,11 @@
+#include <linux/init.h>
+#include <linux/kernel.h>
+
+int sysctl_tcp_fastopen;
+
+static int __init tcp_fastopen_init(void)
+{
+ return 0;
+}
+
+late_initcall(tcp_fastopen_init);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fdd49f1..a06bb89 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3732,7 +3732,8 @@ old_ack:
* the fast version below fails.
*/
void tcp_parse_options(const struct sk_buff *skb, struct tcp_options_received *opt_rx,
- const u8 **hvpp, int estab)
+ const u8 **hvpp, int estab,
+ struct tcp_fastopen_cookie *foc)
{
const unsigned char *ptr;
const struct tcphdr *th = tcp_hdr(skb);
@@ -3839,8 +3840,25 @@ void tcp_parse_options(const struct sk_buff *skb, struct tcp_options_received *o
break;
}
break;
- }
+ case TCPOPT_EXP:
+ /* Fast Open option shares code 254 using a
+ * 16 bits magic number. It's valid only in
+ * SYN or SYN-ACK with an even size.
+ */
+ if (opsize < TCPOLEN_EXP_FASTOPEN_BASE ||
+ get_unaligned_be16(ptr) != TCPOPT_FASTOPEN_MAGIC ||
+ foc == NULL || !th->syn || (opsize & 1))
+ break;
+ foc->len = opsize - TCPOLEN_EXP_FASTOPEN_BASE;
+ if (foc->len >= TCP_FASTOPEN_COOKIE_MIN &&
+ foc->len <= TCP_FASTOPEN_COOKIE_MAX)
+ memcpy(foc->val, ptr + 2, foc->len);
+ else if (foc->len != 0)
+ foc->len = -1;
+ break;
+
+ }
ptr += opsize-2;
length -= opsize;
}
@@ -3882,7 +3900,7 @@ static bool tcp_fast_parse_options(const struct sk_buff *skb,
if (tcp_parse_aligned_timestamp(tp, th))
return true;
}
- tcp_parse_options(skb, &tp->rx_opt, hvpp, 1);
+ tcp_parse_options(skb, &tp->rx_opt, hvpp, 1, NULL);
return true;
}
@@ -5637,7 +5655,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
struct tcp_cookie_values *cvp = tp->cookie_values;
int saved_clamp = tp->rx_opt.mss_clamp;
- tcp_parse_options(skb, &tp->rx_opt, &hash_location, 0);
+ tcp_parse_options(skb, &tp->rx_opt, &hash_location, 0, NULL);
if (th->ack) {
/* rfc793:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d9caf5c..6b1fe40 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1307,7 +1307,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
tcp_clear_options(&tmp_opt);
tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
tmp_opt.user_mss = tp->rx_opt.user_mss;
- tcp_parse_options(skb, &tmp_opt, &hash_location, 0);
+ tcp_parse_options(skb, &tmp_opt, &hash_location, 0, NULL);
if (tmp_opt.cookie_plus > 0 &&
tmp_opt.saw_tstamp &&
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index c66f2ed..5912ac3 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -97,7 +97,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
tmp_opt.saw_tstamp = 0;
if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
- tcp_parse_options(skb, &tmp_opt, &hash_location, 0);
+ tcp_parse_options(skb, &tmp_opt, &hash_location, 0, NULL);
if (tmp_opt.saw_tstamp) {
tmp_opt.ts_recent = tcptw->tw_ts_recent;
@@ -534,7 +534,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
tmp_opt.saw_tstamp = 0;
if (th->doff > (sizeof(struct tcphdr)>>2)) {
- tcp_parse_options(skb, &tmp_opt, &hash_location, 0);
+ tcp_parse_options(skb, &tmp_opt, &hash_location, 0, NULL);
if (tmp_opt.saw_tstamp) {
tmp_opt.ts_recent = req->ts_recent;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 15a7c7b..4849be7 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -385,15 +385,17 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
#define OPTION_MD5 (1 << 2)
#define OPTION_WSCALE (1 << 3)
#define OPTION_COOKIE_EXTENSION (1 << 4)
+#define OPTION_FAST_OPEN_COOKIE (1 << 8)
struct tcp_out_options {
- u8 options; /* bit field of OPTION_* */
+ u16 options; /* bit field of OPTION_* */
+ u16 mss; /* 0 to disable */
u8 ws; /* window scale, 0 to disable */
u8 num_sack_blocks; /* number of SACK blocks to include */
u8 hash_size; /* bytes in hash_location */
- u16 mss; /* 0 to disable */
- __u32 tsval, tsecr; /* need to include OPTION_TS */
__u8 *hash_location; /* temporary pointer, overloaded */
+ __u32 tsval, tsecr; /* need to include OPTION_TS */
+ struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */
};
/* The sysctl int routines are generic, so check consistency here.
@@ -442,7 +444,7 @@ static u8 tcp_cookie_size_check(u8 desired)
static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
struct tcp_out_options *opts)
{
- u8 options = opts->options; /* mungable copy */
+ u16 options = opts->options; /* mungable copy */
/* Having both authentication and cookies for security is redundant,
* and there's certainly not enough room. Instead, the cookie-less
@@ -564,6 +566,21 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
tp->rx_opt.dsack = 0;
}
+
+ if (unlikely(OPTION_FAST_OPEN_COOKIE & options)) {
+ struct tcp_fastopen_cookie *foc = opts->fastopen_cookie;
+
+ *ptr++ = htonl((TCPOPT_EXP << 24) |
+ ((TCPOLEN_EXP_FASTOPEN_BASE + foc->len) << 16) |
+ TCPOPT_FASTOPEN_MAGIC);
+
+ memcpy(ptr, foc->val, foc->len);
+ if ((foc->len & 3) == 2) {
+ u8 *align = ((u8 *)ptr) + foc->len;
+ align[0] = align[1] = TCPOPT_NOP;
+ }
+ ptr += (foc->len + 3) >> 2;
+ }
}
/* Compute TCP options for SYN packets. This is not the final
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 7bf3cc4..bb46061 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -177,7 +177,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
/* check for timestamp cookie support */
memset(&tcp_opt, 0, sizeof(tcp_opt));
- tcp_parse_options(skb, &tcp_opt, &hash_location, 0);
+ tcp_parse_options(skb, &tcp_opt, &hash_location, 0, NULL);
if (!cookie_check_timestamp(&tcp_opt, &ecn_ok))
goto out;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index c9dabdd..0302ec3 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1033,7 +1033,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
tcp_clear_options(&tmp_opt);
tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
tmp_opt.user_mss = tp->rx_opt.user_mss;
- tcp_parse_options(skb, &tmp_opt, &hash_location, 0);
+ tcp_parse_options(skb, &tmp_opt, &hash_location, 0, NULL);
if (tmp_opt.cookie_plus > 0 &&
tmp_opt.saw_tstamp &&
--
1.7.7.3
^ permalink raw reply related
* [PATCH v3 3/7] net-tcp: Fast Open client - sending SYN-data
From: Yuchung Cheng @ 2012-07-19 16:43 UTC (permalink / raw)
To: davem, hkchu, edumazet, ncardwell; +Cc: sivasankar, netdev, Yuchung Cheng
In-Reply-To: <1342716191-19196-1-git-send-email-ycheng@google.com>
This patch implements sending SYN-data in tcp_connect(). The data is
from tcp_sendmsg() with flag MSG_FASTOPEN (implemented in a later patch).
The length of the cookie in tcp_fastopen_req, init'd to 0, controls the
type of the SYN. If the cookie is not cached (len==0), the host sends
data-less SYN with Fast Open cookie request option to solicit a cookie
from the remote. If cookie is not available (len > 0), the host sends
a SYN-data with Fast Open cookie option. If cookie length is negative,
the SYN will not include any Fast Open option (for fall back operations).
To deal with middleboxes that may drop SYN with data or experimental TCP
option, the SYN-data is only sent once. SYN retransmits do not include
data or Fast Open options. The connection will fall back to regular TCP
handshake.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
include/linux/snmp.h | 1 +
include/linux/tcp.h | 6 ++-
include/net/tcp.h | 9 ++++
net/ipv4/af_inet.c | 10 +++-
net/ipv4/proc.c | 1 +
net/ipv4/tcp_output.c | 115 +++++++++++++++++++++++++++++++++++++++++++++----
6 files changed, 130 insertions(+), 12 deletions(-)
diff --git a/include/linux/snmp.h b/include/linux/snmp.h
index e5fcbd0..00bc189 100644
--- a/include/linux/snmp.h
+++ b/include/linux/snmp.h
@@ -238,6 +238,7 @@ enum
LINUX_MIB_TCPOFOMERGE, /* TCPOFOMerge */
LINUX_MIB_TCPCHALLENGEACK, /* TCPChallengeACK */
LINUX_MIB_TCPSYNCHALLENGE, /* TCPSYNChallenge */
+ LINUX_MIB_TCPFASTOPENACTIVE, /* TCPFastOpenActive */
__LINUX_MIB_MAX
};
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 12948f5..1edf96a 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -386,7 +386,8 @@ struct tcp_sock {
unused : 1;
u8 repair_queue;
u8 do_early_retrans:1,/* Enable RFC5827 early-retransmit */
- early_retrans_delayed:1; /* Delayed ER timer installed */
+ early_retrans_delayed:1, /* Delayed ER timer installed */
+ syn_fastopen:1; /* SYN includes Fast Open option */
/* RTT measurement */
u32 srtt; /* smoothed round trip time << 3 */
@@ -500,6 +501,9 @@ struct tcp_sock {
struct tcp_md5sig_info __rcu *md5sig_info;
#endif
+/* TCP fastopen related information */
+ struct tcp_fastopen_request *fastopen_req;
+
/* When the cookie options are generated and exchanged, then this
* object holds a reference to them (cookie_values->kref). Also
* contains related tcp_cookie_transactions fields.
diff --git a/include/net/tcp.h b/include/net/tcp.h
index e601da1..867557b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1289,6 +1289,15 @@ extern int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *, const struct sk_buff
extern int tcp_md5_hash_key(struct tcp_md5sig_pool *hp,
const struct tcp_md5sig_key *key);
+struct tcp_fastopen_request {
+ /* Fast Open cookie. Size 0 means a cookie request */
+ struct tcp_fastopen_cookie cookie;
+ struct msghdr *data; /* data in MSG_FASTOPEN */
+ u16 copied; /* queued in tcp_connect() */
+};
+
+void tcp_free_fastopen_req(struct tcp_sock *tp);
+
/* write queue abstraction */
static inline void tcp_write_queue_purge(struct sock *sk)
{
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 07a02f6..edc4146 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -556,11 +556,12 @@ int inet_dgram_connect(struct socket *sock, struct sockaddr *uaddr,
}
EXPORT_SYMBOL(inet_dgram_connect);
-static long inet_wait_for_connect(struct sock *sk, long timeo)
+static long inet_wait_for_connect(struct sock *sk, long timeo, int writebias)
{
DEFINE_WAIT(wait);
prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+ sk->sk_write_pending += writebias;
/* Basic assumption: if someone sets sk->sk_err, he _must_
* change state of the socket from TCP_SYN_*.
@@ -576,6 +577,7 @@ static long inet_wait_for_connect(struct sock *sk, long timeo)
prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
}
finish_wait(sk_sleep(sk), &wait);
+ sk->sk_write_pending -= writebias;
return timeo;
}
@@ -634,8 +636,12 @@ int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
timeo = sock_sndtimeo(sk, flags & O_NONBLOCK);
if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
+ int writebias = (sk->sk_protocol == IPPROTO_TCP) &&
+ tcp_sk(sk)->fastopen_req &&
+ tcp_sk(sk)->fastopen_req->data ? 1 : 0;
+
/* Error code is set above */
- if (!timeo || !inet_wait_for_connect(sk, timeo))
+ if (!timeo || !inet_wait_for_connect(sk, timeo, writebias))
goto out;
err = sock_intr_errno(timeo);
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 2a5240b..957acd1 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -262,6 +262,7 @@ static const struct snmp_mib snmp4_net_list[] = {
SNMP_MIB_ITEM("TCPOFOMerge", LINUX_MIB_TCPOFOMERGE),
SNMP_MIB_ITEM("TCPChallengeACK", LINUX_MIB_TCPCHALLENGEACK),
SNMP_MIB_ITEM("TCPSYNChallenge", LINUX_MIB_TCPSYNCHALLENGE),
+ SNMP_MIB_ITEM("TCPFastOpenActive", LINUX_MIB_TCPFASTOPENACTIVE),
SNMP_MIB_SENTINEL
};
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4849be7..8869328 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -596,6 +596,7 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
u8 cookie_size = (!tp->rx_opt.cookie_out_never && cvp != NULL) ?
tcp_cookie_size_check(cvp->cookie_desired) :
0;
+ struct tcp_fastopen_request *fastopen = tp->fastopen_req;
#ifdef CONFIG_TCP_MD5SIG
*md5 = tp->af_specific->md5_lookup(sk, sk);
@@ -636,6 +637,16 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
remaining -= TCPOLEN_SACKPERM_ALIGNED;
}
+ if (fastopen && fastopen->cookie.len >= 0) {
+ u32 need = TCPOLEN_EXP_FASTOPEN_BASE + fastopen->cookie.len;
+ need = (need + 3) & ~3U; /* Align to 32 bits */
+ if (remaining >= need) {
+ opts->options |= OPTION_FAST_OPEN_COOKIE;
+ opts->fastopen_cookie = &fastopen->cookie;
+ remaining -= need;
+ tp->syn_fastopen = 1;
+ }
+ }
/* Note that timestamps are required by the specification.
*
* Odd numbers of bytes are prohibited by the specification, ensuring
@@ -2824,6 +2835,96 @@ void tcp_connect_init(struct sock *sk)
tcp_clear_retrans(tp);
}
+static void tcp_connect_queue_skb(struct sock *sk, struct sk_buff *skb)
+{
+ struct tcp_sock *tp = tcp_sk(sk);
+ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+
+ tcb->end_seq += skb->len;
+ skb_header_release(skb);
+ __tcp_add_write_queue_tail(sk, skb);
+ sk->sk_wmem_queued += skb->truesize;
+ sk_mem_charge(sk, skb->truesize);
+ tp->write_seq = tcb->end_seq;
+ tp->packets_out += tcp_skb_pcount(skb);
+}
+
+/* Build and send a SYN with data and (cached) Fast Open cookie. However,
+ * queue a data-only packet after the regular SYN, such that regular SYNs
+ * are retransmitted on timeouts. Also if the remote SYN-ACK acknowledges
+ * only the SYN sequence, the data are retransmitted in the first ACK.
+ * If cookie is not cached or other error occurs, falls back to send a
+ * regular SYN with Fast Open cookie request option.
+ */
+static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
+{
+ struct tcp_sock *tp = tcp_sk(sk);
+ struct tcp_fastopen_request *fo = tp->fastopen_req;
+ int space, i, err = 0, iovlen = fo->data->msg_iovlen;
+ struct sk_buff *syn_data = NULL, *data;
+
+ tcp_fastopen_cache_get(sk, &tp->rx_opt.mss_clamp, &fo->cookie);
+ if (fo->cookie.len <= 0)
+ goto fallback;
+
+ /* MSS for SYN-data is based on cached MSS and bounded by PMTU and
+ * user-MSS. Reserve maximum option space for middleboxes that add
+ * private TCP options. The cost is reduced data space in SYN :(
+ */
+ if (tp->rx_opt.user_mss && tp->rx_opt.user_mss < tp->rx_opt.mss_clamp)
+ tp->rx_opt.mss_clamp = tp->rx_opt.user_mss;
+ space = tcp_mtu_to_mss(sk, inet_csk(sk)->icsk_pmtu_cookie) -
+ MAX_TCP_OPTION_SPACE;
+
+ syn_data = skb_copy_expand(syn, skb_headroom(syn), space,
+ sk->sk_allocation);
+ if (syn_data == NULL)
+ goto fallback;
+
+ for (i = 0; i < iovlen && syn_data->len < space; ++i) {
+ struct iovec *iov = &fo->data->msg_iov[i];
+ unsigned char __user *from = iov->iov_base;
+ int len = iov->iov_len;
+
+ if (syn_data->len + len > space)
+ len = space - syn_data->len;
+ else if (i + 1 == iovlen)
+ /* No more data pending in inet_wait_for_connect() */
+ fo->data = NULL;
+
+ if (skb_add_data(syn_data, from, len))
+ goto fallback;
+ }
+
+ /* Queue a data-only packet after the regular SYN for retransmission */
+ data = pskb_copy(syn_data, sk->sk_allocation);
+ if (data == NULL)
+ goto fallback;
+ TCP_SKB_CB(data)->seq++;
+ TCP_SKB_CB(data)->tcp_flags &= ~TCPHDR_SYN;
+ TCP_SKB_CB(data)->tcp_flags = (TCPHDR_ACK|TCPHDR_PSH);
+ tcp_connect_queue_skb(sk, data);
+ fo->copied = data->len;
+
+ if (tcp_transmit_skb(sk, syn_data, 0, sk->sk_allocation) == 0) {
+ NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPFASTOPENACTIVE);
+ goto done;
+ }
+ syn_data = NULL;
+
+fallback:
+ /* Send a regular SYN with Fast Open cookie request option */
+ if (fo->cookie.len > 0)
+ fo->cookie.len = 0;
+ err = tcp_transmit_skb(sk, syn, 1, sk->sk_allocation);
+ if (err)
+ tp->syn_fastopen = 0;
+ kfree_skb(syn_data);
+done:
+ fo->cookie.len = -1; /* Exclude Fast Open option for SYN retries */
+ return err;
+}
+
/* Build a SYN and send it off. */
int tcp_connect(struct sock *sk)
{
@@ -2841,17 +2942,13 @@ int tcp_connect(struct sock *sk)
skb_reserve(buff, MAX_TCP_HEADER);
tcp_init_nondata_skb(buff, tp->write_seq++, TCPHDR_SYN);
+ tp->retrans_stamp = TCP_SKB_CB(buff)->when = tcp_time_stamp;
+ tcp_connect_queue_skb(sk, buff);
TCP_ECN_send_syn(sk, buff);
- /* Send it off. */
- TCP_SKB_CB(buff)->when = tcp_time_stamp;
- tp->retrans_stamp = TCP_SKB_CB(buff)->when;
- skb_header_release(buff);
- __tcp_add_write_queue_tail(sk, buff);
- sk->sk_wmem_queued += buff->truesize;
- sk_mem_charge(sk, buff->truesize);
- tp->packets_out += tcp_skb_pcount(buff);
- err = tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
+ /* Send off SYN; include data in Fast Open. */
+ err = tp->fastopen_req ? tcp_send_syn_data(sk, buff) :
+ tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
if (err == -ECONNREFUSED)
return err;
--
1.7.7.3
^ permalink raw reply related
* [PATCH v3 2/7] net-tcp: Fast Open client - cookie cache
From: Yuchung Cheng @ 2012-07-19 16:43 UTC (permalink / raw)
To: davem, hkchu, edumazet, ncardwell; +Cc: sivasankar, netdev, Yuchung Cheng
In-Reply-To: <1342716191-19196-1-git-send-email-ycheng@google.com>
With help from Eric Dumazet, add Fast Open metrics in tcp metrics cache.
The basic ones are MSS and the cookies. Later patch will cache more to
handle unfriendly middleboxes.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
include/net/tcp.h | 4 +++
net/ipv4/tcp_metrics.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 55 insertions(+), 0 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5aed371..e601da1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -405,6 +405,10 @@ extern void tcp_metrics_init(void);
extern bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst, bool paws_check);
extern bool tcp_remember_stamp(struct sock *sk);
extern bool tcp_tw_remember_stamp(struct inet_timewait_sock *tw);
+extern void tcp_fastopen_cache_get(struct sock *sk, u16 *mss,
+ struct tcp_fastopen_cookie *cookie);
+extern void tcp_fastopen_cache_set(struct sock *sk, u16 mss,
+ struct tcp_fastopen_cookie *cookie);
extern void tcp_fetch_timewait_stamp(struct sock *sk, struct dst_entry *dst);
extern void tcp_disable_fack(struct tcp_sock *tp);
extern void tcp_close(struct sock *sk, long timeout);
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 1a115b6..d02ff37 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -30,6 +30,11 @@ enum tcp_metric_index {
TCP_METRIC_MAX,
};
+struct tcp_fastopen_metrics {
+ u16 mss;
+ struct tcp_fastopen_cookie cookie;
+};
+
struct tcp_metrics_block {
struct tcp_metrics_block __rcu *tcpm_next;
struct inetpeer_addr tcpm_addr;
@@ -38,6 +43,7 @@ struct tcp_metrics_block {
u32 tcpm_ts_stamp;
u32 tcpm_lock;
u32 tcpm_vals[TCP_METRIC_MAX];
+ struct tcp_fastopen_metrics tcpm_fastopen;
};
static bool tcp_metric_locked(struct tcp_metrics_block *tm,
@@ -118,6 +124,8 @@ static void tcpm_suck_dst(struct tcp_metrics_block *tm, struct dst_entry *dst)
tm->tcpm_vals[TCP_METRIC_REORDERING] = dst_metric_raw(dst, RTAX_REORDERING);
tm->tcpm_ts = 0;
tm->tcpm_ts_stamp = 0;
+ tm->tcpm_fastopen.mss = 0;
+ tm->tcpm_fastopen.cookie.len = 0;
}
static struct tcp_metrics_block *tcpm_new(struct dst_entry *dst,
@@ -633,6 +641,49 @@ bool tcp_tw_remember_stamp(struct inet_timewait_sock *tw)
return ret;
}
+static DEFINE_SEQLOCK(fastopen_seqlock);
+
+void tcp_fastopen_cache_get(struct sock *sk, u16 *mss,
+ struct tcp_fastopen_cookie *cookie)
+{
+ struct tcp_metrics_block *tm;
+
+ rcu_read_lock();
+ tm = tcp_get_metrics(sk, __sk_dst_get(sk), false);
+ if (tm) {
+ struct tcp_fastopen_metrics *tfom = &tm->tcpm_fastopen;
+ unsigned int seq;
+
+ do {
+ seq = read_seqbegin(&fastopen_seqlock);
+ if (tfom->mss)
+ *mss = tfom->mss;
+ *cookie = tfom->cookie;
+ } while (read_seqretry(&fastopen_seqlock, seq));
+ }
+ rcu_read_unlock();
+}
+
+
+void tcp_fastopen_cache_set(struct sock *sk, u16 mss,
+ struct tcp_fastopen_cookie *cookie)
+{
+ struct tcp_metrics_block *tm;
+
+ rcu_read_lock();
+ tm = tcp_get_metrics(sk, __sk_dst_get(sk), true);
+ if (tm) {
+ struct tcp_fastopen_metrics *tfom = &tm->tcpm_fastopen;
+
+ write_seqlock_bh(&fastopen_seqlock);
+ tfom->mss = mss;
+ if (cookie->len > 0)
+ tfom->cookie = *cookie;
+ write_sequnlock_bh(&fastopen_seqlock);
+ }
+ rcu_read_unlock();
+}
+
static unsigned long tcpmhash_entries;
static int __init set_tcpmhash_entries(char *str)
{
--
1.7.7.3
^ permalink raw reply related
* [PATCH v3 4/7] net-tcp: Fast Open client - receiving SYN-ACK
From: Yuchung Cheng @ 2012-07-19 16:43 UTC (permalink / raw)
To: davem, hkchu, edumazet, ncardwell; +Cc: sivasankar, netdev, Yuchung Cheng
In-Reply-To: <1342716191-19196-1-git-send-email-ycheng@google.com>
On receiving the SYN-ACK after SYN-data, the client needs to
a) update the cached MSS and cookie (if included in SYN-ACK)
b) retransmit the data not yet acknowledged by the SYN-ACK in the final ACK of
the handshake.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
net/ipv4/tcp_input.c | 40 +++++++++++++++++++++++++++++++++++-----
1 files changed, 35 insertions(+), 5 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a06bb89..38b6a81 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5646,6 +5646,34 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
}
}
+static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
+ struct tcp_fastopen_cookie *cookie)
+{
+ struct tcp_sock *tp = tcp_sk(sk);
+ struct sk_buff *data = tcp_write_queue_head(sk);
+ u16 mss = tp->rx_opt.mss_clamp;
+
+ if (mss == tp->rx_opt.user_mss) {
+ struct tcp_options_received opt;
+ const u8 *hash_location;
+
+ /* Get original SYNACK MSS value if user MSS sets mss_clamp */
+ tcp_clear_options(&opt);
+ opt.user_mss = opt.mss_clamp = 0;
+ tcp_parse_options(synack, &opt, &hash_location, 0, NULL);
+ mss = opt.mss_clamp;
+ }
+
+ tcp_fastopen_cache_set(sk, mss, cookie);
+
+ if (data) { /* Retransmit unacked data in SYN */
+ tcp_retransmit_skb(sk, data);
+ tcp_rearm_rto(sk);
+ return true;
+ }
+ return false;
+}
+
static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
const struct tcphdr *th, unsigned int len)
{
@@ -5653,9 +5681,10 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
struct tcp_cookie_values *cvp = tp->cookie_values;
+ struct tcp_fastopen_cookie foc = { .len = -1 };
int saved_clamp = tp->rx_opt.mss_clamp;
- tcp_parse_options(skb, &tp->rx_opt, &hash_location, 0, NULL);
+ tcp_parse_options(skb, &tp->rx_opt, &hash_location, 0, &foc);
if (th->ack) {
/* rfc793:
@@ -5665,11 +5694,9 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
* If SEG.ACK =< ISS, or SEG.ACK > SND.NXT, send
* a reset (unless the RST bit is set, if so drop
* the segment and return)"
- *
- * We do not send data with SYN, so that RFC-correct
- * test reduces to:
*/
- if (TCP_SKB_CB(skb)->ack_seq != tp->snd_nxt)
+ if (!after(TCP_SKB_CB(skb)->ack_seq, tp->snd_una) ||
+ after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt))
goto reset_and_undo;
if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
@@ -5781,6 +5808,9 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
tcp_finish_connect(sk, skb);
+ if (tp->syn_fastopen && tcp_rcv_fastopen_synack(sk, skb, &foc))
+ return -1;
+
if (sk->sk_write_pending ||
icsk->icsk_accept_queue.rskq_defer_accept ||
icsk->icsk_ack.pingpong) {
--
1.7.7.3
^ permalink raw reply related
* [PATCH] net, cgroup: Fix boot failure due to iteration of uninitialized list
From: Srivatsa S. Bhat @ 2012-07-19 16:27 UTC (permalink / raw)
To: Srivatsa S. Bhat, gaofeng, eric.dumazet, nhorman, davem
Cc: Srivatsa S. Bhat, linux-kernel, netdev, mark.d.rustad,
john.r.fastabend, lizefan
After commit ef209f15 (net: cgroup: fix access the unallocated memory in
netprio cgroup), boot fails with the following NULL pointer dereference:
Initializing cgroup subsys devices
Initializing cgroup subsys freezer
Initializing cgroup subsys net_cls
Initializing cgroup subsys blkio
Initializing cgroup subsys perf_event
Initializing cgroup subsys net_prio
BUG: unable to handle kernel NULL pointer dereference at 0000000000000698
IP: [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
PGD 0
Oops: 0000 [#1] SMP
CPU 0
Modules linked in:
Pid: 0, comm: swapper/0 Not tainted 3.5.0-rc7-mandeep #1 IBM IBM System x -[7870C4Q]-/68Y8033
RIP: 0010:[<ffffffff8145e8d6>] [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
RSP: 0000:ffffffff81a01ea8 EFLAGS: 00010213
RAX: 0000000000000000 RBX: ffffffffffffff10 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffffff81aa70a0
RBP: ffffffff81a01ed8 R08: 0000000000000000 R09: 0000000000000000
R10: ffff8808ff8641c0 R11: 6e697a696c616974 R12: 0000000000000001
R13: ffff8808ff8641c0 R14: 0000000000000000 R15: 0000000000093970
FS: 0000000000000000(0000) GS:ffff8808ffc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000698 CR3: 0000000001a0b000 CR4: 00000000000006b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper/0 (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a13420)
Stack:
ffffffff81a01eb8 ffffffff818060ff ffffffff81d75ec8 ffffffff81aa8960
ffffffff81aa8960 ffffffff81b4c2c0 ffffffff81a01ef8 ffffffff81b1cb78
0000000000000018 0000000000000048 ffffffff81a01f18 ffffffff81b1ce13
Call Trace:
[<ffffffff81b1cb78>] cgroup_init_subsys+0x83/0x169
[<ffffffff81b1ce13>] cgroup_init+0x36/0x119
[<ffffffff81affef7>] start_kernel+0x3ba/0x3ef
[<ffffffff81aff95b>] ? kernel_init+0x27b/0x27b
[<ffffffff81aff356>] x86_64_start_reservations+0x131/0x136
[<ffffffff81aff45e>] x86_64_start_kernel+0x103/0x112
Code: 01 48 3d f8 e1 ec 81 48 8d 98 10 ff ff ff 75 1b eb 73 0f 1f 00 48 8b 83 f0 00 00 00 48 3d f8 e1 ec 81 48 8d 98 10 ff ff ff 74 5a <48> 8b 83 88 07 00 00 48 85 c0 74 de 44 3b 60 10 76 d8 44 89 e6
RIP [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
RSP <ffffffff81a01ea8>
CR2: 0000000000000698
---[ end trace a7919e7f17c0a725 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
The code corresponds to:
update_netdev_tables():
for_each_netdev(&init_net, dev) {
map = rtnl_dereference(dev->priomap); <---- HERE
The list head is initialized in netdev_init(), which is called much
later than cgrp_create(). So the problem is that we are calling
update_netdev_tables() way too early (in cgrp_create()), which will
end up traversing the not-yet-circular linked list. So at some point,
the dev pointer will become NULL and hence dev->priomap becomes an
invalid access.
To fix this, just remove the update_netdev_tables() function entirely,
since it appears that write_update_netdev_table() will handle things
just fine.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
Requesting a thorough review of this patch, since I am not sure whether
removing update_netdev_tables() is perfectly OK and whether that is the
right thing to do.
net/core/netprio_cgroup.c | 32 --------------------------------
1 files changed, 0 insertions(+), 32 deletions(-)
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index b2e9caa..68fe239 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -109,32 +109,6 @@ static int write_update_netdev_table(struct net_device *dev)
return ret;
}
-static int update_netdev_tables(void)
-{
- int ret = 0;
- struct net_device *dev;
- u32 max_len;
- struct netprio_map *map;
-
- rtnl_lock();
- max_len = atomic_read(&max_prioidx) + 1;
- for_each_netdev(&init_net, dev) {
- map = rtnl_dereference(dev->priomap);
- /*
- * don't allocate priomap if we didn't
- * change net_prio.ifpriomap (map == NULL),
- * this will speed up skb_update_prio.
- */
- if (map && map->priomap_len < max_len) {
- ret = extend_netdev_table(dev, max_len);
- if (ret < 0)
- break;
- }
- }
- rtnl_unlock();
- return ret;
-}
-
static struct cgroup_subsys_state *cgrp_create(struct cgroup *cgrp)
{
struct cgroup_netprio_state *cs;
@@ -153,12 +127,6 @@ static struct cgroup_subsys_state *cgrp_create(struct cgroup *cgrp)
goto out;
}
- ret = update_netdev_tables();
- if (ret < 0) {
- put_prioidx(cs->prioidx);
- goto out;
- }
-
return &cs->css;
out:
kfree(cs);
^ permalink raw reply related
* [PATCH v2] net: e100: ucode is optional in some cases
From: Bjørn Mork @ 2012-07-19 16:28 UTC (permalink / raw)
To: netdev
Cc: e1000-devel, Bruce Allan, Jesse Brandeburg, John Ronciak,
Bjørn Mork
commit 9ac32e1b firmware: convert e100 driver to request_firmware()
did a straight conversion of the in-driver ucode to external
files. This introduced the possibility of the driver failing
to enable an interface due to missing ucode. There was no
evaluation of the importance of the ucode at the time.
Based on comments in earlier versions of this driver, and in
the source code for the FreeBSD fxp driver, we can assume that
the ucode implements the "CPU Cycle Saver" feature on supported
adapters. Although generally wanted, this is an optional
feature. The ucode source is not available, preventing it from
being included in free distributions. This creates unnecessary
problems for the end users. Doing a network install based on a
free distribution installer requires the user to download and
insert the ucode into the installer.
Making the ucode optional when possible improves the user
experience and driver usability.
The ucode for some adapters include a bugfix, making it
essential. We continue to fail for these adapters unless the
ucode is available.
Signed-off-by: Bjørn Mork <bjorn@mork.no>
---
v2: removed URLs from the patch, converting them to generic
descriptions of the sources of information
drivers/net/ethernet/intel/e100.c | 40 ++++++++++++++++++++++++++++---------
1 file changed, 31 insertions(+), 9 deletions(-)
diff --git a/drivers/net/ethernet/intel/e100.c b/drivers/net/ethernet/intel/e100.c
index ada720b..535f94f 100644
--- a/drivers/net/ethernet/intel/e100.c
+++ b/drivers/net/ethernet/intel/e100.c
@@ -1249,20 +1249,35 @@ static const struct firmware *e100_request_firmware(struct nic *nic)
const struct firmware *fw = nic->fw;
u8 timer, bundle, min_size;
int err = 0;
+ bool required = false;
/* do not load u-code for ICH devices */
if (nic->flags & ich)
return NULL;
- /* Search for ucode match against h/w revision */
- if (nic->mac == mac_82559_D101M)
+ /* Search for ucode match against h/w revision
+ *
+ * Based on comments in the source code for the FreeBSD fxp
+ * driver, the FIRMWARE_D102E ucode includes both CPUSaver and
+ *
+ * "fixes for bugs in the B-step hardware (specifically, bugs
+ * with Inline Receive)."
+ *
+ * So we must fail if it cannot be loaded.
+ *
+ * The other microcode files are only required for the optional
+ * CPUSaver feature. Nice to have, but no reason to fail.
+ */
+ if (nic->mac == mac_82559_D101M) {
fw_name = FIRMWARE_D101M;
- else if (nic->mac == mac_82559_D101S)
+ } else if (nic->mac == mac_82559_D101S) {
fw_name = FIRMWARE_D101S;
- else if (nic->mac == mac_82551_F || nic->mac == mac_82551_10)
+ } else if (nic->mac == mac_82551_F || nic->mac == mac_82551_10) {
fw_name = FIRMWARE_D102E;
- else /* No ucode on other devices */
+ required = true;
+ } else { /* No ucode on other devices */
return NULL;
+ }
/* If the firmware has not previously been loaded, request a pointer
* to it. If it was previously loaded, we are reinitializing the
@@ -1273,10 +1288,17 @@ static const struct firmware *e100_request_firmware(struct nic *nic)
err = request_firmware(&fw, fw_name, &nic->pdev->dev);
if (err) {
- netif_err(nic, probe, nic->netdev,
- "Failed to load firmware \"%s\": %d\n",
- fw_name, err);
- return ERR_PTR(err);
+ if (required) {
+ netif_err(nic, probe, nic->netdev,
+ "Failed to load firmware \"%s\": %d\n",
+ fw_name, err);
+ return ERR_PTR(err);
+ } else {
+ netif_info(nic, probe, nic->netdev,
+ "CPUSaver disabled. Needs \"%s\": %d\n",
+ fw_name, err);
+ return NULL;
+ }
}
/* Firmware should be precisely UCODE_SIZE (words) plus three bytes
--
1.7.10.4
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired
^ permalink raw reply related
* Re: [PATCH net-next V1 4/4] net/mlx4_en: Add accelerated RFS support
From: Or Gerlitz @ 2012-07-19 16:28 UTC (permalink / raw)
To: Ben Hutchings; +Cc: davem, roland, netdev, oren, yevgenyp, Amir Vadai
In-Reply-To: <1342706957.2617.25.camel@bwh-desktop.uk.solarflarecom.com>
On 7/19/2012 5:09 PM, Ben Hutchings wrote:
>> @@ -77,6 +77,12 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
>> > struct mlx4_en_dev *mdev = priv->mdev;
>> > int err = 0;
>> > char name[25];
>> >+ struct cpu_rmap *rmap =
>> >+#ifdef CONFIG_RFS_ACCEL
>> >+ priv->dev->rx_cpu_rmap;
>> >+#else
>> >+ NULL;
>> >+#endif
> You can write this slightly more cleanly using IS_ENABLED().
>
> [...]
>> >+static struct mlx4_en_filter *
>> >+mlx4_en_filter_alloc(struct mlx4_en_priv *priv, int rxq_index, __be32 src_ip,
>> >+ __be32 dst_ip, __be16 src_port, __be16 dst_port,
>> >+ u32 flow_id)
>> >+{
> [...]
>> >+ filter->id = priv->last_filter_id++;
> [...]
>
> You need to limit the filter IDs to be < RPS_NO_FILTER.
OK, thanks, we will send fixes for these two comments.
Or.
^ permalink raw reply
* Re: [PATCH net-next V1 7/9] net/eipoib: Add main driver functionality
From: Or Gerlitz @ 2012-07-19 16:21 UTC (permalink / raw)
To: Ben Hutchings
Cc: davem, roland, netdev, ali, sean.hefty, shlomop, Erez Shitrit
In-Reply-To: <1342714574.2617.43.camel@bwh-desktop.uk.solarflarecom.com>
On 7/19/2012 7:16 PM, Ben Hutchings wrote:
> On Thu, 2012-07-19 at 18:46 +0300, Or Gerlitz wrote:
> For that end, I was under the impression all the three NETIF_F_HW_VLAN_{TX,RX,FILTER) features need to be advertized. From your comment I understand now that RX/TX are enough in that respect?
> [...]
>
> Yes.
OK, good, will fix.
Or.
^ permalink raw reply
* Re: [PATCH net/for-next V1 1/1] IB/ipoib: break linkage to neighbouring system
From: Or Gerlitz @ 2012-07-19 16:20 UTC (permalink / raw)
To: Christoph Lameter
Cc: Shlomo Pongartz, roland-DgEjT+Ai2ygdnm+yROfE0A,
davem-fT/PcQaiUtIeIZ0/mPfg9Q, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
erezsh-VPRAkNaXOzVWk0Htik3J/w,
netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <alpine.DEB.2.00.1207191023130.29808-sBS69tsa9Uj/9pzu0YdTqQ@public.gmane.org>
On 7/19/2012 6:24 PM, Christoph Lameter wrote:
> On Thu, 19 Jul 2012, Shlomo Pongartz wrote:
>
>> The garbage collection and stale times follow the default ipv4/6 neigh.default.gc_yyy
>> sysctl values, for example
>>
>> net.ipv4.neigh.default.gc_interval = 30
>> net.ipv4.neigh.default.gc_stale_time = 60
>>
>> If given access to these values from IPoIB, we will be happy
>> to integrate them into that logic
>
> It looks like the values are hardcoded right now.
Two points here,
1s, they are indeed hard-coded since there's no define/enum
that holds their default values (or maybe we should add one now?), see
this code snippest from net/ipv4/arp.c
> .gc_interval = 30 * HZ,
> .gc_thresh1 = 128,
> .gc_thresh2 = 512,
> .gc_thresh3 = 1024,
2nd, and even more interesting, the little challenge here is how
to integrate with the sysctl's that allow for changing these values,
the mechanism that uses neigh_sysctl_table in net/core/neighbour.c isn't
exported to the rest of the world. And there's no point to define new
sysctl entries just for managing the IPoIB neighbours, ideas welcome.
>> Please clarify what do you mean by group expiration.
>
> If you have neighbor expiration periods of 4 hrs and it is necessary to
> run the expiration logic then please expire all the neighbor entries due a
> certain period after that as well to avoid running the expiration again in
> the next minute or so.
This is still a bit unclear here... do you mean to say that at a certain
point in time,
**all** entries need to be deleted irrelevant of their (jiffies) age? why?
> I guess the fuzz factor needs to scale depending on the expiration period.
>
>
and this is what happens now, the factor is 0.5, entry would be deleted when
if (60m <= unused < 90s) holds
Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH net-next V1 7/9] net/eipoib: Add main driver functionality
From: Ben Hutchings @ 2012-07-19 16:16 UTC (permalink / raw)
To: Or Gerlitz; +Cc: davem, roland, netdev, ali, sean.hefty, shlomop, Erez Shitrit
In-Reply-To: <50082BDE.2040005@mellanox.com>
On Thu, 2012-07-19 at 18:46 +0300, Or Gerlitz wrote:
> On 7/19/2012 4:49 PM, Ben Hutchings wrote:
> > On Wed, 2012-07-18 at 14:00 +0300, Or Gerlitz wrote:
[...]
> >> + .ndo_vlan_rx_add_vid = eth_ipoib_vlan_rx_add_vid,
> >> + .ndo_vlan_rx_kill_vid = eth_ipoib_vlan_rx_kill_vid,
> >
> > These shouldn't be needed.
>
> ok, here's the point, the eIPoIB driver maps Ethernet vlans to
> infiniband/IPoIB pkeys
> (partition keys). The underlying IPoIB devices work with these pkeys
> in a way which is HW accelerated, and we want the eIPoIB driver to be
> considered as one
> that support HW accelerate vlans. E.g on the TX flow we don't want that
> any special SW
> handling by the 8021q driver will be done on the skb except for setting
> the skb->vlan_tci
> field, and in the RX flow, we set skb->vlan_tci field and don't want
> that 8021q to try
> and extract it from the headers, etc.
>
> For that end, I was under the impression all the three
> NETIF_F_HW_VLAN_{TX,RX,FILTER)
> features need to be advertized. From your comment I understand now that
> RX/TX are enough
> in that respect?
[...]
Yes.
Ben.
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox