* [PATCH net-next 01/10] tipc: remove obsolete flush of stale reassembly buffer
2012-12-07 14:28 [PATCH net-next 00/10] tipc: more updates for the v3.8 content Paul Gortmaker
@ 2012-12-07 14:28 ` Paul Gortmaker
2012-12-07 14:28 ` [PATCH net-next 02/10] tipc: eliminate aggregate sk_receive_queue limit Paul Gortmaker
` (8 subsequent siblings)
9 siblings, 0 replies; 22+ messages in thread
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Erik Hugne, Paul Gortmaker
From: Erik Hugne <erik.hugne@ericsson.com>
Each link instance has a periodic job checking if there is a stale
ongoing message reassembly associated to the link. If no new
fragment has been received during the last 4*[link_tolerance] period,
it is assumed the missing fragment will never arrive. As a consequence,
the reassembly buffer is discarded, and a gap in the message sequence
occurs.
This assumption is wrong. After we abandoned our ambition to develop
packet routing for multi-cluster networks, only single-hop packet
transfer remains as an option. For those, all packets are guaranteed
to be delivered in sequence to the defragmentation layer. Any failure
to achieve sequenced delivery will eventually lead to link reset, and
the reassembly buffer will be flushed anyway.
So we just remove this periodic check, which is now obsolete.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: also delete get/inc_timer count, since they are now unused]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
net/tipc/link.c | 44 --------------------------------------------
1 file changed, 44 deletions(-)
diff --git a/net/tipc/link.c b/net/tipc/link.c
index 87bf5aa..daa6080 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -97,7 +97,6 @@ static int link_send_sections_long(struct tipc_port *sender,
struct iovec const *msg_sect,
u32 num_sect, unsigned int total_len,
u32 destnode);
-static void link_check_defragm_bufs(struct tipc_link *l_ptr);
static void link_state_event(struct tipc_link *l_ptr, u32 event);
static void link_reset_statistics(struct tipc_link *l_ptr);
static void link_print(struct tipc_link *l_ptr, const char *str);
@@ -271,7 +270,6 @@ static void link_timeout(struct tipc_link *l_ptr)
}
/* do all other link processing performed on a periodic basis */
- link_check_defragm_bufs(l_ptr);
link_state_event(l_ptr, TIMEOUT_EVT);
@@ -2497,16 +2495,6 @@ static void set_expected_frags(struct sk_buff *buf, u32 exp)
msg_set_bcast_ack(buf_msg(buf), exp);
}
-static u32 get_timer_cnt(struct sk_buff *buf)
-{
- return msg_reroute_cnt(buf_msg(buf));
-}
-
-static void incr_timer_cnt(struct sk_buff *buf)
-{
- msg_incr_reroute_cnt(buf_msg(buf));
-}
-
/*
* tipc_link_recv_fragment(): Called with node lock on. Returns
* the reassembled buffer if message is complete.
@@ -2585,38 +2573,6 @@ int tipc_link_recv_fragment(struct sk_buff **pending, struct sk_buff **fb,
return 0;
}
-/**
- * link_check_defragm_bufs - flush stale incoming message fragments
- * @l_ptr: pointer to link
- */
-static void link_check_defragm_bufs(struct tipc_link *l_ptr)
-{
- struct sk_buff *prev = NULL;
- struct sk_buff *next = NULL;
- struct sk_buff *buf = l_ptr->defragm_buf;
-
- if (!buf)
- return;
- if (!link_working_working(l_ptr))
- return;
- while (buf) {
- u32 cnt = get_timer_cnt(buf);
-
- next = buf->next;
- if (cnt < 4) {
- incr_timer_cnt(buf);
- prev = buf;
- } else {
- if (prev)
- prev->next = buf->next;
- else
- l_ptr->defragm_buf = buf->next;
- kfree_skb(buf);
- }
- buf = next;
- }
-}
-
static void link_set_supervision_props(struct tipc_link *l_ptr, u32 tolerance)
{
if ((tolerance < TIPC_MIN_LINK_TOL) || (tolerance > TIPC_MAX_LINK_TOL))
--
1.7.12.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH net-next 02/10] tipc: eliminate aggregate sk_receive_queue limit
2012-12-07 14:28 [PATCH net-next 00/10] tipc: more updates for the v3.8 content Paul Gortmaker
2012-12-07 14:28 ` [PATCH net-next 01/10] tipc: remove obsolete flush of stale reassembly buffer Paul Gortmaker
@ 2012-12-07 14:28 ` Paul Gortmaker
2012-12-07 16:07 ` Neil Horman
2012-12-07 14:28 ` [PATCH net-next 03/10] tipc: sk_recv_queue size check only for connectionless sockets Paul Gortmaker
` (7 subsequent siblings)
9 siblings, 1 reply; 22+ messages in thread
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
From: Ying Xue <ying.xue@windriver.com>
As a complement to the per-socket sk_recv_queue limit, TIPC keeps a
global atomic counter for the sum of sk_recv_queue sizes across all
tipc sockets. When incremented, the counter is compared to an upper
threshold value, and if this is reached, the message is rejected
with error code TIPC_OVERLOAD.
This check was originally meant to protect the node against
buffer exhaustion and general CPU overload. However, all experience
indicates that the feature not only is redundant on Linux, but even
harmful. Users run into the limit very often, causing disturbances
for their applications, while removing it seems to have no negative
effects at all. We have also seen that overall performance is
boosted significantly when this bottleneck is removed.
Furthermore, we don't see any other network protocols maintaining
such a mechanism, something strengthening our conviction that this
control can be eliminated.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
net/tipc/socket.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 1a720c8..a059ed0 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -2,7 +2,7 @@
* net/tipc/socket.c: TIPC socket API
*
* Copyright (c) 2001-2007, Ericsson AB
- * Copyright (c) 2004-2008, 2010-2011, Wind River Systems
+ * Copyright (c) 2004-2008, 2010-2012, Wind River Systems
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
@@ -1241,11 +1241,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
}
/* Reject message if there isn't room to queue it */
- recv_q_len = (u32)atomic_read(&tipc_queue_size);
- if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) {
- if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE))
- return TIPC_ERR_OVERLOAD;
- }
recv_q_len = skb_queue_len(&sk->sk_receive_queue);
if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2))) {
if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE / 2))
--
1.7.12.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH net-next 02/10] tipc: eliminate aggregate sk_receive_queue limit
2012-12-07 14:28 ` [PATCH net-next 02/10] tipc: eliminate aggregate sk_receive_queue limit Paul Gortmaker
@ 2012-12-07 16:07 ` Neil Horman
2012-12-07 17:36 ` David Miller
0 siblings, 1 reply; 22+ messages in thread
From: Neil Horman @ 2012-12-07 16:07 UTC (permalink / raw)
To: Paul Gortmaker; +Cc: David Miller, netdev, Jon Maloy, Ying Xue
On Fri, Dec 07, 2012 at 09:28:10AM -0500, Paul Gortmaker wrote:
> From: Ying Xue <ying.xue@windriver.com>
>
> As a complement to the per-socket sk_recv_queue limit, TIPC keeps a
> global atomic counter for the sum of sk_recv_queue sizes across all
> tipc sockets. When incremented, the counter is compared to an upper
> threshold value, and if this is reached, the message is rejected
> with error code TIPC_OVERLOAD.
>
> This check was originally meant to protect the node against
> buffer exhaustion and general CPU overload. However, all experience
> indicates that the feature not only is redundant on Linux, but even
> harmful. Users run into the limit very often, causing disturbances
> for their applications, while removing it seems to have no negative
> effects at all. We have also seen that overall performance is
> boosted significantly when this bottleneck is removed.
>
> Furthermore, we don't see any other network protocols maintaining
> such a mechanism, something strengthening our conviction that this
> control can be eliminated.
>
> Signed-off-by: Ying Xue <ying.xue@windriver.com>
> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
> ---
> net/tipc/socket.c | 7 +------
> 1 file changed, 1 insertion(+), 6 deletions(-)
>
> diff --git a/net/tipc/socket.c b/net/tipc/socket.c
> index 1a720c8..a059ed0 100644
> --- a/net/tipc/socket.c
> +++ b/net/tipc/socket.c
> @@ -2,7 +2,7 @@
> * net/tipc/socket.c: TIPC socket API
> *
> * Copyright (c) 2001-2007, Ericsson AB
> - * Copyright (c) 2004-2008, 2010-2011, Wind River Systems
> + * Copyright (c) 2004-2008, 2010-2012, Wind River Systems
> * All rights reserved.
> *
> * Redistribution and use in source and binary forms, with or without
> @@ -1241,11 +1241,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
> }
>
> /* Reject message if there isn't room to queue it */
> - recv_q_len = (u32)atomic_read(&tipc_queue_size);
> - if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) {
> - if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE))
> - return TIPC_ERR_OVERLOAD;
> - }
If you're going to remove the one place that you read this variable, don't you
also want to remove the points where you increment/decrement the atomic as well,
and for that matter eliminate the definition itself?
Neil
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH net-next 02/10] tipc: eliminate aggregate sk_receive_queue limit
2012-12-07 16:07 ` Neil Horman
@ 2012-12-07 17:36 ` David Miller
2012-12-07 19:54 ` Paul Gortmaker
0 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2012-12-07 17:36 UTC (permalink / raw)
To: nhorman; +Cc: paul.gortmaker, netdev, jon.maloy, ying.xue
From: Neil Horman <nhorman@tuxdriver.com>
Date: Fri, 7 Dec 2012 11:07:33 -0500
> On Fri, Dec 07, 2012 at 09:28:10AM -0500, Paul Gortmaker wrote:
>> From: Ying Xue <ying.xue@windriver.com>
>>
>> As a complement to the per-socket sk_recv_queue limit, TIPC keeps a
>> global atomic counter for the sum of sk_recv_queue sizes across all
>> tipc sockets. When incremented, the counter is compared to an upper
>> threshold value, and if this is reached, the message is rejected
>> with error code TIPC_OVERLOAD.
>>
>> This check was originally meant to protect the node against
>> buffer exhaustion and general CPU overload. However, all experience
>> indicates that the feature not only is redundant on Linux, but even
>> harmful. Users run into the limit very often, causing disturbances
>> for their applications, while removing it seems to have no negative
>> effects at all. We have also seen that overall performance is
>> boosted significantly when this bottleneck is removed.
>>
>> Furthermore, we don't see any other network protocols maintaining
>> such a mechanism, something strengthening our conviction that this
>> control can be eliminated.
>>
>> Signed-off-by: Ying Xue <ying.xue@windriver.com>
>> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
>> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
...
>> @@ -1241,11 +1241,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
>> }
>>
>> /* Reject message if there isn't room to queue it */
>> - recv_q_len = (u32)atomic_read(&tipc_queue_size);
>> - if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) {
>> - if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE))
>> - return TIPC_ERR_OVERLOAD;
>> - }
> If you're going to remove the one place that you read this variable, don't you
> also want to remove the points where you increment/decrement the atomic as well,
> and for that matter eliminate the definition itself?
There's another reader, a getsockopt() call.
I would just make it return zero or similar.
Paul please do so and respin this series.
Thanks.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH net-next 02/10] tipc: eliminate aggregate sk_receive_queue limit
2012-12-07 17:36 ` David Miller
@ 2012-12-07 19:54 ` Paul Gortmaker
0 siblings, 0 replies; 22+ messages in thread
From: Paul Gortmaker @ 2012-12-07 19:54 UTC (permalink / raw)
To: David Miller; +Cc: nhorman, netdev, jon.maloy, ying.xue
[Re: [PATCH net-next 02/10] tipc: eliminate aggregate sk_receive_queue limit] On 07/12/2012 (Fri 12:36) David Miller wrote:
> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Fri, 7 Dec 2012 11:07:33 -0500
>
> > On Fri, Dec 07, 2012 at 09:28:10AM -0500, Paul Gortmaker wrote:
> >> From: Ying Xue <ying.xue@windriver.com>
> >>
> >> As a complement to the per-socket sk_recv_queue limit, TIPC keeps a
> >> global atomic counter for the sum of sk_recv_queue sizes across all
> >> tipc sockets. When incremented, the counter is compared to an upper
> >> threshold value, and if this is reached, the message is rejected
> >> with error code TIPC_OVERLOAD.
> >>
> >> This check was originally meant to protect the node against
> >> buffer exhaustion and general CPU overload. However, all experience
> >> indicates that the feature not only is redundant on Linux, but even
> >> harmful. Users run into the limit very often, causing disturbances
> >> for their applications, while removing it seems to have no negative
> >> effects at all. We have also seen that overall performance is
> >> boosted significantly when this bottleneck is removed.
> >>
> >> Furthermore, we don't see any other network protocols maintaining
> >> such a mechanism, something strengthening our conviction that this
> >> control can be eliminated.
> >>
> >> Signed-off-by: Ying Xue <ying.xue@windriver.com>
> >> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
> >> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
> ...
> >> @@ -1241,11 +1241,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
> >> }
> >>
> >> /* Reject message if there isn't room to queue it */
> >> - recv_q_len = (u32)atomic_read(&tipc_queue_size);
> >> - if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) {
> >> - if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE))
> >> - return TIPC_ERR_OVERLOAD;
> >> - }
> > If you're going to remove the one place that you read this variable, don't you
> > also want to remove the points where you increment/decrement the atomic as well,
> > and for that matter eliminate the definition itself?
>
> There's another reader, a getsockopt() call.
>
> I would just make it return zero or similar.
Updated commit below for reference, just to double check that I'm
doing what is requested properly.
Paul.
--
From 9da3d475874f4da49057767913af95ce01063ba3 Mon Sep 17 00:00:00 2001
From: Ying Xue <ying.xue@windriver.com>
Date: Tue, 27 Nov 2012 06:15:27 -0500
Subject: [PATCH] tipc: eliminate aggregate sk_receive_queue limit
As a complement to the per-socket sk_recv_queue limit, TIPC keeps a
global atomic counter for the sum of sk_recv_queue sizes across all
tipc sockets. When incremented, the counter is compared to an upper
threshold value, and if this is reached, the message is rejected
with error code TIPC_OVERLOAD.
This check was originally meant to protect the node against
buffer exhaustion and general CPU overload. However, all experience
indicates that the feature not only is redundant on Linux, but even
harmful. Users run into the limit very often, causing disturbances
for their applications, while removing it seems to have no negative
effects at all. We have also seen that overall performance is
boosted significantly when this bottleneck is removed.
Furthermore, we don't see any other network protocols maintaining
such a mechanism, something strengthening our conviction that this
control can be eliminated.
As a result, the atomic variable tipc_queue_size is now unused
and so it can be deleted. There is a getsockopt call that used
to allow reading it; we retain that but just return zero for
maximum compatibility.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
[PG: phase out tipc_queue_size as pointed out by Neil Horman]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 1a720c8..848be69 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -2,7 +2,7 @@
* net/tipc/socket.c: TIPC socket API
*
* Copyright (c) 2001-2007, Ericsson AB
- * Copyright (c) 2004-2008, 2010-2011, Wind River Systems
+ * Copyright (c) 2004-2008, 2010-2012, Wind River Systems
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
@@ -73,8 +73,6 @@ static struct proto tipc_proto;
static int sockets_enabled;
-static atomic_t tipc_queue_size = ATOMIC_INIT(0);
-
/*
* Revised TIPC socket locking policy:
*
@@ -128,7 +126,6 @@ static atomic_t tipc_queue_size = ATOMIC_INIT(0);
static void advance_rx_queue(struct sock *sk)
{
kfree_skb(__skb_dequeue(&sk->sk_receive_queue));
- atomic_dec(&tipc_queue_size);
}
/**
@@ -140,10 +137,8 @@ static void discard_rx_queue(struct sock *sk)
{
struct sk_buff *buf;
- while ((buf = __skb_dequeue(&sk->sk_receive_queue))) {
- atomic_dec(&tipc_queue_size);
+ while ((buf = __skb_dequeue(&sk->sk_receive_queue)))
kfree_skb(buf);
- }
}
/**
@@ -155,10 +150,8 @@ static void reject_rx_queue(struct sock *sk)
{
struct sk_buff *buf;
- while ((buf = __skb_dequeue(&sk->sk_receive_queue))) {
+ while ((buf = __skb_dequeue(&sk->sk_receive_queue)))
tipc_reject_msg(buf, TIPC_ERR_NO_PORT);
- atomic_dec(&tipc_queue_size);
- }
}
/**
@@ -280,7 +273,6 @@ static int release(struct socket *sock)
buf = __skb_dequeue(&sk->sk_receive_queue);
if (buf == NULL)
break;
- atomic_dec(&tipc_queue_size);
if (TIPC_SKB_CB(buf)->handle != 0)
kfree_skb(buf);
else {
@@ -1241,11 +1233,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
}
/* Reject message if there isn't room to queue it */
- recv_q_len = (u32)atomic_read(&tipc_queue_size);
- if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) {
- if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE))
- return TIPC_ERR_OVERLOAD;
- }
recv_q_len = skb_queue_len(&sk->sk_receive_queue);
if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2))) {
if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE / 2))
@@ -1254,7 +1241,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
/* Enqueue message (finally!) */
TIPC_SKB_CB(buf)->handle = 0;
- atomic_inc(&tipc_queue_size);
__skb_queue_tail(&sk->sk_receive_queue, buf);
/* Initiate connection termination for an incoming 'FIN' */
@@ -1578,7 +1564,6 @@ restart:
/* Disconnect and send a 'FIN+' or 'FIN-' message to peer */
buf = __skb_dequeue(&sk->sk_receive_queue);
if (buf) {
- atomic_dec(&tipc_queue_size);
if (TIPC_SKB_CB(buf)->handle != 0) {
kfree_skb(buf);
goto restart;
@@ -1717,7 +1702,7 @@ static int getsockopt(struct socket *sock,
/* no need to set "res", since already 0 at this point */
break;
case TIPC_NODE_RECVQ_DEPTH:
- value = (u32)atomic_read(&tipc_queue_size);
+ value = 0; /* was tipc_queue_size, now obsolete */
break;
case TIPC_SOCK_RECVQ_DEPTH:
value = skb_queue_len(&sk->sk_receive_queue);
--
1.7.12.1
>
> Paul please do so and respin this series.
>
> Thanks.
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH net-next 03/10] tipc: sk_recv_queue size check only for connectionless sockets
2012-12-07 14:28 [PATCH net-next 00/10] tipc: more updates for the v3.8 content Paul Gortmaker
2012-12-07 14:28 ` [PATCH net-next 01/10] tipc: remove obsolete flush of stale reassembly buffer Paul Gortmaker
2012-12-07 14:28 ` [PATCH net-next 02/10] tipc: eliminate aggregate sk_receive_queue limit Paul Gortmaker
@ 2012-12-07 14:28 ` Paul Gortmaker
2012-12-07 19:20 ` Neil Horman
2012-12-07 14:28 ` [PATCH net-next 04/10] tipc: change sk_receive_queue upper limit Paul Gortmaker
` (6 subsequent siblings)
9 siblings, 1 reply; 22+ messages in thread
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
From: Ying Xue <ying.xue@windriver.com>
The sk_receive_queue limit control is currently performed for
all arriving messages, disregarding socket and message type.
But for connected sockets this check is redundant, since the protocol
flow control already makes queue overflow impossible.
We move the sk_receive_queue limit control so that it is only performed
for connectionless sockets, i.e. SOCK_RDM and SOCK_DGRAM type sockets.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
net/tipc/socket.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index a059ed0..4d6a448 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1188,9 +1188,6 @@ static int rx_queue_full(struct tipc_msg *msg, u32 queue_size, u32 base)
else
return 0;
- if (msg_connected(msg))
- threshold *= 4;
-
return queue_size >= threshold;
}
@@ -1219,6 +1216,15 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
if (sock->state == SS_READY) {
if (msg_connected(msg))
return TIPC_ERR_NO_PORT;
+ /* Reject SOCK_DGRAM and SOCK_RDM messages if there isn't room
+ * to queue it.
+ */
+ recv_q_len = skb_queue_len(&sk->sk_receive_queue);
+ if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2))) {
+ if (rx_queue_full(msg, recv_q_len,
+ OVERLOAD_LIMIT_BASE / 2))
+ return TIPC_ERR_OVERLOAD;
+ }
} else {
if (msg_mcast(msg))
return TIPC_ERR_NO_PORT;
@@ -1240,13 +1246,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
}
}
- /* Reject message if there isn't room to queue it */
- recv_q_len = skb_queue_len(&sk->sk_receive_queue);
- if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2))) {
- if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE / 2))
- return TIPC_ERR_OVERLOAD;
- }
-
/* Enqueue message (finally!) */
TIPC_SKB_CB(buf)->handle = 0;
atomic_inc(&tipc_queue_size);
--
1.7.12.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH net-next 03/10] tipc: sk_recv_queue size check only for connectionless sockets
2012-12-07 14:28 ` [PATCH net-next 03/10] tipc: sk_recv_queue size check only for connectionless sockets Paul Gortmaker
@ 2012-12-07 19:20 ` Neil Horman
2012-12-07 22:30 ` Jon Maloy
0 siblings, 1 reply; 22+ messages in thread
From: Neil Horman @ 2012-12-07 19:20 UTC (permalink / raw)
To: Paul Gortmaker; +Cc: David Miller, netdev, Jon Maloy, Ying Xue
On Fri, Dec 07, 2012 at 09:28:11AM -0500, Paul Gortmaker wrote:
> From: Ying Xue <ying.xue@windriver.com>
>
> The sk_receive_queue limit control is currently performed for
> all arriving messages, disregarding socket and message type.
> But for connected sockets this check is redundant, since the protocol
> flow control already makes queue overflow impossible.
>
Can you explain where that occurs? I see where the tipc dispatch function calls
sk_add_backlog, which checks the per socket recieve queue (regardless of weather
the receiving socket is connection oriented or connectionless), but if the
receiver doesn't call receive very often, This just adds a check against your
global limit, doing nothing for your per-socket limits. In fact it seems to
repeat the same check twice, as in the worst case of the incomming message being
TIPC_LOW_IMPORTANCE, its just going to check that the global limit is exactly
OVERLOAD_LIMIT_BASE/2 again.
Neil
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH net-next 03/10] tipc: sk_recv_queue size check only for connectionless sockets
2012-12-07 19:20 ` Neil Horman
@ 2012-12-07 22:30 ` Jon Maloy
2012-12-09 16:50 ` Neil Horman
0 siblings, 1 reply; 22+ messages in thread
From: Jon Maloy @ 2012-12-07 22:30 UTC (permalink / raw)
To: Neil Horman; +Cc: Paul Gortmaker, David Miller, netdev, Ying Xue
On 12/07/2012 02:20 PM, Neil Horman wrote:
> On Fri, Dec 07, 2012 at 09:28:11AM -0500, Paul Gortmaker wrote:
>> From: Ying Xue <ying.xue@windriver.com>
>>
>> The sk_receive_queue limit control is currently performed for
>> all arriving messages, disregarding socket and message type.
>> But for connected sockets this check is redundant, since the protocol
>> flow control already makes queue overflow impossible.
>>
> Can you explain where that occurs?
It happens in the functions port_dispatcher_sigh() and tipc_send(),
among other places. Both are to be found in the file port.c, which
was supposed to contain the 'generic' (i.e., API independent) part
of the send/receive code.
Now that we have only one API left, the socket API, we are
planning to merge the code in socket.c and port.c, and get rid of
some code overhead.
The flow control in TIPC is message based, where the sender
requires to receive an explicit acknowledge message for each
512 message the receiver reads to user space.
If the sender has more than 1024 messages outstanding without having
received an acknowledge he will be suspended or receive EAGAIN until
he does.
The plan going forward is to replace this mechanism with a more
standard looking byte based flow control, while maintaining
backwards compatibility.
> I see where the tipc dispatch function calls
> sk_add_backlog, which checks the per socket recieve queue (regardless of weather
> the receiving socket is connection oriented or connectionless), but if the
> receiver doesn't call receive very often, This just adds a check against your
> global limit, doing nothing for your per-socket limits.
OVERLOAD_LIMIT_BASE is tested against a per-socket message counter, so it _is_
our per-socket limit. In fact, TIPC connectionless overflow control currently
is a kind of a hybrid, based on a message counter when the socket is not locked,
and based on sk_rcv_queue's byte limit when a message has to be added to the
backlog.
We are planning to fix this inconsistency too.
In fact it seems to
> repeat the same check twice, as in the worst case of the incomming message being
> TIPC_LOW_IMPORTANCE, its just going to check that the global limit is exactly
> OVERLOAD_LIMIT_BASE/2 again.
Yes, you are right. The intention is that only the first test,
if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2)){..}
will be run for the vast majority of messages, since we must assume
that there is no overload most of the time.
An inelegant optimization, perhaps, but not logically wrong.
///jon
>
> Neil
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH net-next 03/10] tipc: sk_recv_queue size check only for connectionless sockets
2012-12-07 22:30 ` Jon Maloy
@ 2012-12-09 16:50 ` Neil Horman
2012-12-10 6:27 ` Ying Xue
0 siblings, 1 reply; 22+ messages in thread
From: Neil Horman @ 2012-12-09 16:50 UTC (permalink / raw)
To: Jon Maloy; +Cc: Paul Gortmaker, David Miller, netdev, Ying Xue
On Fri, Dec 07, 2012 at 05:30:11PM -0500, Jon Maloy wrote:
> On 12/07/2012 02:20 PM, Neil Horman wrote:
> > On Fri, Dec 07, 2012 at 09:28:11AM -0500, Paul Gortmaker wrote:
> >> From: Ying Xue <ying.xue@windriver.com>
> >>
> >> The sk_receive_queue limit control is currently performed for
> >> all arriving messages, disregarding socket and message type.
> >> But for connected sockets this check is redundant, since the protocol
> >> flow control already makes queue overflow impossible.
> >>
> > Can you explain where that occurs?
>
> It happens in the functions port_dispatcher_sigh() and tipc_send(),
> among other places. Both are to be found in the file port.c, which
> was supposed to contain the 'generic' (i.e., API independent) part
> of the send/receive code.
> Now that we have only one API left, the socket API, we are
> planning to merge the code in socket.c and port.c, and get rid of
> some code overhead.
>
> The flow control in TIPC is message based, where the sender
> requires to receive an explicit acknowledge message for each
> 512 message the receiver reads to user space.
> If the sender has more than 1024 messages outstanding without having
> received an acknowledge he will be suspended or receive EAGAIN until
> he does.
> The plan going forward is to replace this mechanism with a more
> standard looking byte based flow control, while maintaining
> backwards compatibility.
>
Ok, That makes more sense, thank you. Although I still don't think this is
safe (but the problem may not be solely introduced by this patch). Using a
global limit that assumes the sender will block when the congestion window is
reached just doesn't seem sane to me. It clearly works with the Linux
implementation, as it conforms to your expectations, but an alternate
implementation could create a DOS situation by simply ignoring the window limit,
and continuing to send. I see that we drop frames over the global limit in
filter_rcv, but the check in rx_queue_full bumps up that limit based on the
value of msg_importance(msg), but that threshold is ignored if the value of
msg_importance is invalid. All a sender needs to do is flood a receiver with
frames containing an invalid set of message importance bits, and you will queue
frames indefinately. In fact that will also happen if you send message of
CRITICAL importance as well, so you don't even need to supply an invalid value
here.
>
> > I see where the tipc dispatch function calls
> > sk_add_backlog, which checks the per socket recieve queue (regardless of weather
> > the receiving socket is connection oriented or connectionless), but if the
> > receiver doesn't call receive very often, This just adds a check against your
> > global limit, doing nothing for your per-socket limits.
>
> OVERLOAD_LIMIT_BASE is tested against a per-socket message counter, so it _is_
> our per-socket limit. In fact, TIPC connectionless overflow control currently
> is a kind of a hybrid, based on a message counter when the socket is not locked,
> and based on sk_rcv_queue's byte limit when a message has to be added to the
> backlog.
> We are planning to fix this inconsistency too.
Good, thank you, that was seeming quite wrong to me.
>
> In fact it seems to
> > repeat the same check twice, as in the worst case of the incomming message being
> > TIPC_LOW_IMPORTANCE, its just going to check that the global limit is exactly
> > OVERLOAD_LIMIT_BASE/2 again.
>
> Yes, you are right. The intention is that only the first test,
> if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2)){..}
> will be run for the vast majority of messages, since we must assume
> that there is no overload most of the time.
> An inelegant optimization, perhaps, but not logically wrong.
No, not logically wrong, but not an optimization either. With this change,
your only use of rx_queue_full passes OVERLOAD_LIMIT_BASE/2 as the base value to
rx_queue_full, and then you do some multiplication based on that. If you really
want to optimize this, leave OVERLOAD_LIMIT_BASE where it is (rather than
doubling it like this patch series does), mark rx_queue_full as inline, and just
pass OVERLOAD_LIMIT_BASE as the argument, it will save you a division opration,
the conditional branch and a call instruction. If you add a multiplication
factor table, you can eliminate the if/else clauses in rx_queue_full as well.
Neil
>
> ///jon
>
> >
> > Neil
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH net-next 03/10] tipc: sk_recv_queue size check only for connectionless sockets
2012-12-09 16:50 ` Neil Horman
@ 2012-12-10 6:27 ` Ying Xue
2012-12-10 8:46 ` Jon Maloy
2012-12-10 14:22 ` Neil Horman
0 siblings, 2 replies; 22+ messages in thread
From: Ying Xue @ 2012-12-10 6:27 UTC (permalink / raw)
To: Neil Horman; +Cc: Jon Maloy, Paul Gortmaker, David Miller, netdev, Ying Xue
Neil Horman wrote:
> On Fri, Dec 07, 2012 at 05:30:11PM -0500, Jon Maloy wrote:
>
>> On 12/07/2012 02:20 PM, Neil Horman wrote:
>>
>>> On Fri, Dec 07, 2012 at 09:28:11AM -0500, Paul Gortmaker wrote:
>>>
>>>> From: Ying Xue <ying.xue@windriver.com>
>>>>
>>>> The sk_receive_queue limit control is currently performed for
>>>> all arriving messages, disregarding socket and message type.
>>>> But for connected sockets this check is redundant, since the protocol
>>>> flow control already makes queue overflow impossible.
>>>>
>>>>
>>> Can you explain where that occurs?
>>>
>> It happens in the functions port_dispatcher_sigh() and tipc_send(),
>> among other places. Both are to be found in the file port.c, which
>> was supposed to contain the 'generic' (i.e., API independent) part
>> of the send/receive code.
>> Now that we have only one API left, the socket API, we are
>> planning to merge the code in socket.c and port.c, and get rid of
>> some code overhead.
>>
>> The flow control in TIPC is message based, where the sender
>> requires to receive an explicit acknowledge message for each
>> 512 message the receiver reads to user space.
>> If the sender has more than 1024 messages outstanding without having
>> received an acknowledge he will be suspended or receive EAGAIN until
>> he does.
>> The plan going forward is to replace this mechanism with a more
>> standard looking byte based flow control, while maintaining
>> backwards compatibility.
>>
>>
> Ok, That makes more sense, thank you. Although I still don't think this is
> safe (but the problem may not be solely introduced by this patch). Using a
> global limit that assumes the sender will block when the congestion window is
> reached just doesn't seem sane to me. It clearly works with the Linux
> implementation, as it conforms to your expectations, but an alternate
> implementation could create a DOS situation by simply ignoring the window limit,
> and continuing to send. I see that we drop frames over the global limit in
> filter_rcv, but the check in rx_queue_full bumps up that limit based on the
> value of msg_importance(msg), but that threshold is ignored if the value of
> msg_importance is invalid. All a sender needs to do is flood a receiver with
> frames containing an invalid set of message importance bits, and you will queue
> frames indefinately. In fact that will also happen if you send message of
> CRITICAL importance as well, so you don't even need to supply an invalid value
> here.
>
>
You are absolutely right. I will correct these drawbacks in next version.
>>> I see where the tipc dispatch function calls
>>> sk_add_backlog, which checks the per socket recieve queue (regardless of weather
>>> the receiving socket is connection oriented or connectionless), but if the
>>> receiver doesn't call receive very often, This just adds a check against your
>>> global limit, doing nothing for your per-socket limits.
>>>
>> OVERLOAD_LIMIT_BASE is tested against a per-socket message counter, so it _is_
>> our per-socket limit. In fact, TIPC connectionless overflow control currently
>> is a kind of a hybrid, based on a message counter when the socket is not locked,
>> and based on sk_rcv_queue's byte limit when a message has to be added to the
>> backlog.
>> We are planning to fix this inconsistency too.
>>
> Good, thank you, that was seeming quite wrong to me.
>
>
>> In fact it seems to
>>
>>> repeat the same check twice, as in the worst case of the incomming message being
>>> TIPC_LOW_IMPORTANCE, its just going to check that the global limit is exactly
>>> OVERLOAD_LIMIT_BASE/2 again.
>>>
>> Yes, you are right. The intention is that only the first test,
>> if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2)){..}
>> will be run for the vast majority of messages, since we must assume
>> that there is no overload most of the time.
>> An inelegant optimization, perhaps, but not logically wrong.
>>
> No, not logically wrong, but not an optimization either. With this change,
> your only use of rx_queue_full passes OVERLOAD_LIMIT_BASE/2 as the base value to
> rx_queue_full, and then you do some multiplication based on that. If you really
> want to optimize this, leave OVERLOAD_LIMIT_BASE where it is (rather than
> doubling it like this patch series does), mark rx_queue_full as inline, and just
> pass OVERLOAD_LIMIT_BASE as the argument, it will save you a division opration,
> the conditional branch and a call instruction. If you add a multiplication
> factor table, you can eliminate the if/else clauses in rx_queue_full as well.
>
>
Good suggestion with a factor table. Maybe it's unnecessary to
explicitly mark rx_queue_full as inline. Currently it sounds like we let
complier decide whether a function is defined as inline or not.
Regards,
Ying
> Neil
>
>
>> ///jon
>>
>>
>>> Neil
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH net-next 03/10] tipc: sk_recv_queue size check only for connectionless sockets
2012-12-10 6:27 ` Ying Xue
@ 2012-12-10 8:46 ` Jon Maloy
2012-12-10 14:22 ` Neil Horman
1 sibling, 0 replies; 22+ messages in thread
From: Jon Maloy @ 2012-12-10 8:46 UTC (permalink / raw)
To: Ying Xue; +Cc: Neil Horman, Paul Gortmaker, David Miller, netdev
On 12/10/2012 01:27 AM, Ying Xue wrote:
> Neil Horman wrote:
>> On Fri, Dec 07, 2012 at 05:30:11PM -0500, Jon Maloy wrote:
>>
>>> On 12/07/2012 02:20 PM, Neil Horman wrote:
>>>
>>>> On Fri, Dec 07, 2012 at 09:28:11AM -0500, Paul Gortmaker wrote:
>>>>
>>>>> From: Ying Xue <ying.xue@windriver.com>
>>>>>
>>>>> The sk_receive_queue limit control is currently performed for
>>>>> all arriving messages, disregarding socket and message type.
>>>>> But for connected sockets this check is redundant, since the protocol
>>>>> flow control already makes queue overflow impossible.
>>>>>
>>>>>
>>>> Can you explain where that occurs?
>>>>
>>> It happens in the functions port_dispatcher_sigh() and tipc_send(),
>>> among other places. Both are to be found in the file port.c, which
>>> was supposed to contain the 'generic' (i.e., API independent) part
>>> of the send/receive code.
>>> Now that we have only one API left, the socket API, we are
>>> planning to merge the code in socket.c and port.c, and get rid of
>>> some code overhead.
>>>
>>> The flow control in TIPC is message based, where the sender
>>> requires to receive an explicit acknowledge message for each
>>> 512 message the receiver reads to user space.
>>> If the sender has more than 1024 messages outstanding without having
>>> received an acknowledge he will be suspended or receive EAGAIN until
>>> he does.
>>> The plan going forward is to replace this mechanism with a more
>>> standard looking byte based flow control, while maintaining
>>> backwards compatibility.
>>>
>>>
>> Ok, That makes more sense, thank you. Although I still don't think this is
>> safe (but the problem may not be solely introduced by this patch). Using a
>> global limit that assumes the sender will block when the congestion window is
>> reached just doesn't seem sane to me. It clearly works with the Linux
>> implementation, as it conforms to your expectations, but an alternate
>> implementation could create a DOS situation by simply ignoring the window limit,
>> and continuing to send. I see that we drop frames over the global limit in
>> filter_rcv, but the check in rx_queue_full bumps up that limit based on the
>> value of msg_importance(msg), but that threshold is ignored if the value of
>> msg_importance is invalid. All a sender needs to do is flood a receiver with
>> frames containing an invalid set of message importance bits, and you will queue
>> frames indefinately. In fact that will also happen if you send message of
>> CRITICAL importance as well, so you don't even need to supply an invalid value
>> here.
>>
>>
>
> You are absolutely right. I will correct these drawbacks in next version.
I think we should rather just drop this patch. We introduce a major vulnerability,
as Neil correctly points out. We will anyway have to do a rework of this code.
>
>>>> I see where the tipc dispatch function calls
>>>> sk_add_backlog, which checks the per socket recieve queue (regardless of weather
>>>> the receiving socket is connection oriented or connectionless), but if the
>>>> receiver doesn't call receive very often, This just adds a check against your
>>>> global limit, doing nothing for your per-socket limits.
>>>>
>>> OVERLOAD_LIMIT_BASE is tested against a per-socket message counter, so it _is_
>>> our per-socket limit. In fact, TIPC connectionless overflow control currently
>>> is a kind of a hybrid, based on a message counter when the socket is not locked,
>>> and based on sk_rcv_queue's byte limit when a message has to be added to the
>>> backlog.
>>> We are planning to fix this inconsistency too.
>>>
>> Good, thank you, that was seeming quite wrong to me.
>>
>>
>>> In fact it seems to
>>>
>>>> repeat the same check twice, as in the worst case of the incomming message being
>>>> TIPC_LOW_IMPORTANCE, its just going to check that the global limit is exactly
>>>> OVERLOAD_LIMIT_BASE/2 again.
>>>>
>>> Yes, you are right. The intention is that only the first test,
>>> if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2)){..}
>>> will be run for the vast majority of messages, since we must assume
>>> that there is no overload most of the time.
>>> An inelegant optimization, perhaps, but not logically wrong.
>>>
>> No, not logically wrong, but not an optimization either. With this change,
>> your only use of rx_queue_full passes OVERLOAD_LIMIT_BASE/2 as the base value to
>> rx_queue_full, and then you do some multiplication based on that.
It is still in the "unlikely" (in fact, very unlikely) branch. And the multiplication
is by two, i.e. just a left-shift operation. Our approach was rather to let the
compiler decide about inlining, which in this case might be a sub-optimization.
If you really
>> want to optimize this, leave OVERLOAD_LIMIT_BASE where it is (rather than
>> doubling it like this patch series does), mark rx_queue_full as inline, and just
>> pass OVERLOAD_LIMIT_BASE as the argument, it will save you a division opration,
>> the conditional branch and a call instruction. If you add a multiplication
>> factor table, you can eliminate the if/else clauses in rx_queue_full as well.
>>
>>
>
> Good suggestion with a factor table. Maybe it's unnecessary to
> explicitly mark rx_queue_full as inline. Currently it sounds like we let
> complier decide whether a function is defined as inline or not.
One approach I had in mind was to just left-shift OVERLOAD_LIMIT_BASE with
message priority, and compare that to the per-socket counter. This way,
we obtain the limit set [10000,20000,30000,40000] without having to read
data memory. The limits will not be the same as now, but probably good
enough. We don't even need a separate function for this check.
Something we should look into when we move on to make this mechanism
byte-based.
>
> Regards,
> Ying
>
>> Neil
>>
>>
>>> ///jon
>>>
>>>
>>>> Neil
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH net-next 03/10] tipc: sk_recv_queue size check only for connectionless sockets
2012-12-10 6:27 ` Ying Xue
2012-12-10 8:46 ` Jon Maloy
@ 2012-12-10 14:22 ` Neil Horman
1 sibling, 0 replies; 22+ messages in thread
From: Neil Horman @ 2012-12-10 14:22 UTC (permalink / raw)
To: Ying Xue; +Cc: Jon Maloy, Paul Gortmaker, David Miller, netdev
On Mon, Dec 10, 2012 at 02:27:50PM +0800, Ying Xue wrote:
> Neil Horman wrote:
> >On Fri, Dec 07, 2012 at 05:30:11PM -0500, Jon Maloy wrote:
> >>On 12/07/2012 02:20 PM, Neil Horman wrote:
> >>>On Fri, Dec 07, 2012 at 09:28:11AM -0500, Paul Gortmaker wrote:
> >>>>From: Ying Xue <ying.xue@windriver.com>
> >>>>
> >>>>The sk_receive_queue limit control is currently performed for
> >>>>all arriving messages, disregarding socket and message type.
> >>>>But for connected sockets this check is redundant, since the protocol
> >>>>flow control already makes queue overflow impossible.
> >>>>
> >>>Can you explain where that occurs?
> >>It happens in the functions port_dispatcher_sigh() and
> >>tipc_send(), among other places. Both are to be found in the
> >>file port.c, which was supposed to contain the 'generic' (i.e.,
> >>API independent) part of the send/receive code.
> >>Now that we have only one API left, the socket API, we are
> >>planning to merge the code in socket.c and port.c, and get rid
> >>of some code overhead.
> >>
> >>The flow control in TIPC is message based, where the sender
> >>requires to receive an explicit acknowledge message for each 512
> >>message the receiver reads to user space.
> >>If the sender has more than 1024 messages outstanding without having
> >>received an acknowledge he will be suspended or receive EAGAIN
> >>until he does.
> >>The plan going forward is to replace this mechanism with a more
> >>standard looking byte based flow control, while maintaining
> >>backwards compatibility.
> >>
> >Ok, That makes more sense, thank you. Although I still don't think this is
> >safe (but the problem may not be solely introduced by this patch). Using a
> >global limit that assumes the sender will block when the congestion window is
> >reached just doesn't seem sane to me. It clearly works with the Linux
> >implementation, as it conforms to your expectations, but an alternate
> >implementation could create a DOS situation by simply ignoring the window limit,
> >and continuing to send. I see that we drop frames over the global limit in
> >filter_rcv, but the check in rx_queue_full bumps up that limit based on the
> >value of msg_importance(msg), but that threshold is ignored if the value of
> >msg_importance is invalid. All a sender needs to do is flood a receiver with
> >frames containing an invalid set of message importance bits, and you will queue
> >frames indefinately. In fact that will also happen if you send message of
> >CRITICAL importance as well, so you don't even need to supply an invalid value
> >here.
> >
>
> You are absolutely right. I will correct these drawbacks in next version.
>
> >>>I see where the tipc dispatch function calls
> >>>sk_add_backlog, which checks the per socket recieve queue (regardless of weather
> >>>the receiving socket is connection oriented or connectionless), but if the
> >>>receiver doesn't call receive very often, This just adds a check against your
> >>>global limit, doing nothing for your per-socket limits.
> >>OVERLOAD_LIMIT_BASE is tested against a per-socket message counter, so it _is_
> >>our per-socket limit. In fact, TIPC connectionless overflow
> >>control currently is a kind of a hybrid, based on a message
> >>counter when the socket is not locked, and based on
> >>sk_rcv_queue's byte limit when a message has to be added to the
> >>backlog.
> >>We are planning to fix this inconsistency too.
> >Good, thank you, that was seeming quite wrong to me.
> >
> >> In fact it seems to
> >>>repeat the same check twice, as in the worst case of the incomming message being
> >>>TIPC_LOW_IMPORTANCE, its just going to check that the global limit is exactly
> >>>OVERLOAD_LIMIT_BASE/2 again.
> >>Yes, you are right. The intention is that only the first test,
> >>if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2)){..}
> >>will be run for the vast majority of messages, since we must assume
> >>that there is no overload most of the time.
> >>An inelegant optimization, perhaps, but not logically wrong.
> >No, not logically wrong, but not an optimization either. With this change,
> >your only use of rx_queue_full passes OVERLOAD_LIMIT_BASE/2 as the base value to
> >rx_queue_full, and then you do some multiplication based on that. If you really
> >want to optimize this, leave OVERLOAD_LIMIT_BASE where it is (rather than
> >doubling it like this patch series does), mark rx_queue_full as inline, and just
> >pass OVERLOAD_LIMIT_BASE as the argument, it will save you a division opration,
> >the conditional branch and a call instruction. If you add a multiplication
> >factor table, you can eliminate the if/else clauses in rx_queue_full as well.
> >
>
> Good suggestion with a factor table. Maybe it's unnecessary to
> explicitly mark rx_queue_full as inline. Currently it sounds like we
> let complier decide whether a function is defined as inline or not.
>
Thats correct, the compiler usually decides if something should be inlined
(unless you use the __always_inline) attribute. In this case, given a single
call site, it most like will just inline anyway. But if you're interested in
optimizing here, it might be worth taking the extra steps to make sure. In
fact, since this is your only call site, it may be worthwhile to just remove the
function entirely, and manually inline the check.
Neil
> Regards,
> Ying
>
> >Neil
> >
> >>///jon
> >>
> >>>Neil
> >>>
> >>>--
> >>>To unsubscribe from this list: send the line "unsubscribe netdev" in
> >>>the body of a message to majordomo@vger.kernel.org
> >>>More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>
> >--
> >To unsubscribe from this list: send the line "unsubscribe netdev" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH net-next 04/10] tipc: change sk_receive_queue upper limit
2012-12-07 14:28 [PATCH net-next 00/10] tipc: more updates for the v3.8 content Paul Gortmaker
` (2 preceding siblings ...)
2012-12-07 14:28 ` [PATCH net-next 03/10] tipc: sk_recv_queue size check only for connectionless sockets Paul Gortmaker
@ 2012-12-07 14:28 ` Paul Gortmaker
2012-12-07 14:28 ` [PATCH net-next 05/10] tipc: standardize across connect/disconnect function naming Paul Gortmaker
` (5 subsequent siblings)
9 siblings, 0 replies; 22+ messages in thread
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Paul Gortmaker
From: Jon Maloy <jon.maloy@ericsson.com>
The sk_recv_queue upper limit for connectionless sockets has empirically
turned out to be too low. When we double the current limit we get much
fewer rejected messages and no noticable negative side-effects.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
net/tipc/socket.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 4d6a448..65a6bfc 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1,7 +1,7 @@
/*
* net/tipc/socket.c: TIPC socket API
*
- * Copyright (c) 2001-2007, Ericsson AB
+ * Copyright (c) 2001-2007, 2012 Ericsson AB
* Copyright (c) 2004-2008, 2010-2012, Wind River Systems
* All rights reserved.
*
@@ -43,7 +43,7 @@
#define SS_LISTENING -1 /* socket is listening */
#define SS_READY -2 /* socket is connectionless */
-#define OVERLOAD_LIMIT_BASE 5000
+#define OVERLOAD_LIMIT_BASE 10000
#define CONN_TIMEOUT_DEFAULT 8000 /* default connect timeout = 8s */
struct tipc_sock {
--
1.7.12.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH net-next 05/10] tipc: standardize across connect/disconnect function naming
2012-12-07 14:28 [PATCH net-next 00/10] tipc: more updates for the v3.8 content Paul Gortmaker
` (3 preceding siblings ...)
2012-12-07 14:28 ` [PATCH net-next 04/10] tipc: change sk_receive_queue upper limit Paul Gortmaker
@ 2012-12-07 14:28 ` Paul Gortmaker
2012-12-07 14:28 ` [PATCH net-next 06/10] tipc: consolidate connection-oriented message reception in one function Paul Gortmaker
` (4 subsequent siblings)
9 siblings, 0 replies; 22+ messages in thread
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Paul Gortmaker
Currently we have tipc_disconnect and tipc_disconnect_port. It is
not clear from the names alone, what they do or how they differ.
It turns out that tipc_disconnect just deals with the port locking
and then calls tipc_disconnect_port which does all the work.
If we rename as follows: tipc_disconnect_port --> __tipc_disconnect
then we will be following typical linux convention, where:
__tipc_disconnect: "raw" function that does all the work.
tipc_disconnect: wrapper that deals with locking and then calls
the real core __tipc_disconnect function
With this, the difference is immediately evident, and locking
violations are more apt to be spotted by chance while working on,
or even just while reading the code.
On the connect side of things, we currently only have the single
"tipc_connect2port" function. It does both the locking at enter/exit,
and the core of the work. Pending changes will make it desireable to
have the connect be a two part locking wrapper + worker function,
just like the disconnect is already.
Here, we make the connect look just like the updated disconnect case,
for the above reason, and for consistency. In the process, we also
get rid of the "2port" suffix that was on the original name, since
it adds no descriptive value.
On close examination, one might notice that the above connect
changes implicitly move the call to tipc_link_get_max_pkt() to be
within the scope of tipc_port_lock() protected region; when it was
not previously. We don't see any issues with this, and it is in
keeping with __tipc_connect doing the work and tipc_connect just
handling the locking.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
net/tipc/port.c | 32 +++++++++++++++++++++++---------
net/tipc/port.h | 6 ++++--
net/tipc/socket.c | 6 +++---
net/tipc/subscr.c | 2 +-
4 files changed, 31 insertions(+), 15 deletions(-)
diff --git a/net/tipc/port.c b/net/tipc/port.c
index 07c42fb..18098ca 100644
--- a/net/tipc/port.c
+++ b/net/tipc/port.c
@@ -726,7 +726,7 @@ static void port_dispatcher_sigh(void *dummy)
if (unlikely(!cb))
goto reject;
if (unlikely(!connected)) {
- if (tipc_connect2port(dref, &orig))
+ if (tipc_connect(dref, &orig))
goto reject;
} else if (peer_invalid)
goto reject;
@@ -1036,15 +1036,30 @@ int tipc_withdraw(u32 ref, unsigned int scope, struct tipc_name_seq const *seq)
return res;
}
-int tipc_connect2port(u32 ref, struct tipc_portid const *peer)
+int tipc_connect(u32 ref, struct tipc_portid const *peer)
{
struct tipc_port *p_ptr;
- struct tipc_msg *msg;
- int res = -EINVAL;
+ int res;
p_ptr = tipc_port_lock(ref);
if (!p_ptr)
return -EINVAL;
+ res = __tipc_connect(ref, p_ptr, peer);
+ tipc_port_unlock(p_ptr);
+ return res;
+}
+
+/*
+ * __tipc_connect - connect to a remote peer
+ *
+ * Port must be locked.
+ */
+int __tipc_connect(u32 ref, struct tipc_port *p_ptr,
+ struct tipc_portid const *peer)
+{
+ struct tipc_msg *msg;
+ int res = -EINVAL;
+
if (p_ptr->published || p_ptr->connected)
goto exit;
if (!peer->ref)
@@ -1067,17 +1082,16 @@ int tipc_connect2port(u32 ref, struct tipc_portid const *peer)
(net_ev_handler)port_handle_node_down);
res = 0;
exit:
- tipc_port_unlock(p_ptr);
p_ptr->max_pkt = tipc_link_get_max_pkt(peer->node, ref);
return res;
}
-/**
- * tipc_disconnect_port - disconnect port from peer
+/*
+ * __tipc_disconnect - disconnect port from peer
*
* Port must be locked.
*/
-int tipc_disconnect_port(struct tipc_port *tp_ptr)
+int __tipc_disconnect(struct tipc_port *tp_ptr)
{
int res;
@@ -1104,7 +1118,7 @@ int tipc_disconnect(u32 ref)
p_ptr = tipc_port_lock(ref);
if (!p_ptr)
return -EINVAL;
- res = tipc_disconnect_port(p_ptr);
+ res = __tipc_disconnect(p_ptr);
tipc_port_unlock(p_ptr);
return res;
}
diff --git a/net/tipc/port.h b/net/tipc/port.h
index 4660e30..fb66e2e 100644
--- a/net/tipc/port.h
+++ b/net/tipc/port.h
@@ -190,7 +190,7 @@ int tipc_publish(u32 portref, unsigned int scope,
int tipc_withdraw(u32 portref, unsigned int scope,
struct tipc_name_seq const *name_seq);
-int tipc_connect2port(u32 portref, struct tipc_portid const *port);
+int tipc_connect(u32 portref, struct tipc_portid const *port);
int tipc_disconnect(u32 portref);
@@ -200,7 +200,9 @@ int tipc_shutdown(u32 ref);
/*
* The following routines require that the port be locked on entry
*/
-int tipc_disconnect_port(struct tipc_port *tp_ptr);
+int __tipc_disconnect(struct tipc_port *tp_ptr);
+int __tipc_connect(u32 ref, struct tipc_port *p_ptr,
+ struct tipc_portid const *peer);
int tipc_port_peer_msg(struct tipc_port *p_ptr, struct tipc_msg *msg);
/*
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 65a6bfc..4d56eae 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -791,7 +791,7 @@ static int auto_connect(struct socket *sock, struct tipc_msg *msg)
tsock->peer_name.ref = msg_origport(msg);
tsock->peer_name.node = msg_orignode(msg);
- tipc_connect2port(tsock->p->ref, &tsock->peer_name);
+ tipc_connect(tsock->p->ref, &tsock->peer_name);
tipc_set_portimportance(tsock->p->ref, msg_importance(msg));
sock->state = SS_CONNECTED;
return 0;
@@ -1254,7 +1254,7 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
/* Initiate connection termination for an incoming 'FIN' */
if (unlikely(msg_errcode(msg) && (sock->state == SS_CONNECTED))) {
sock->state = SS_DISCONNECTING;
- tipc_disconnect_port(tipc_sk_port(sk));
+ __tipc_disconnect(tipc_sk_port(sk));
}
sk->sk_data_ready(sk, 0);
@@ -1514,7 +1514,7 @@ static int accept(struct socket *sock, struct socket *new_sock, int flags)
/* Connect new socket to it's peer */
new_tsock->peer_name.ref = msg_origport(msg);
new_tsock->peer_name.node = msg_orignode(msg);
- tipc_connect2port(new_ref, &new_tsock->peer_name);
+ tipc_connect(new_ref, &new_tsock->peer_name);
new_sock->state = SS_CONNECTED;
tipc_set_portimportance(new_ref, msg_importance(msg));
diff --git a/net/tipc/subscr.c b/net/tipc/subscr.c
index 0f7d0d0..6b42d47 100644
--- a/net/tipc/subscr.c
+++ b/net/tipc/subscr.c
@@ -462,7 +462,7 @@ static void subscr_named_msg_event(void *usr_handle,
kfree(subscriber);
return;
}
- tipc_connect2port(subscriber->port_ref, orig);
+ tipc_connect(subscriber->port_ref, orig);
/* Lock server port (& save lock address for future use) */
subscriber->lock = tipc_port_lock(subscriber->port_ref)->lock;
--
1.7.12.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH net-next 06/10] tipc: consolidate connection-oriented message reception in one function
2012-12-07 14:28 [PATCH net-next 00/10] tipc: more updates for the v3.8 content Paul Gortmaker
` (4 preceding siblings ...)
2012-12-07 14:28 ` [PATCH net-next 05/10] tipc: standardize across connect/disconnect function naming Paul Gortmaker
@ 2012-12-07 14:28 ` Paul Gortmaker
2012-12-07 14:28 ` [PATCH net-next 07/10] tipc: introduce non-blocking socket connect Paul Gortmaker
` (3 subsequent siblings)
9 siblings, 0 replies; 22+ messages in thread
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
From: Ying Xue <ying.xue@windriver.com>
Handling of connection-related message reception is currently scattered
around at different places in the code. This makes it harder to verify
that things are handled correctly in all possible scenarios.
So we consolidate the existing processing of connection-oriented
message reception in a single routine. In the process, we convert the
chain of if/else into a switch/case for improved readability.
A cast on the socket_state in the switch is needed to avoid compile
warnings on 32 bit, like "net/tipc/socket.c:1252:2: warning: case value
‘4294967295’ not in enumerated type". This happens because existing
tipc code pseudo extends the default linux socket state values with:
#define SS_LISTENING -1 /* socket is listening */
#define SS_READY -2 /* socket is connectionless */
It may make sense to add these as _positive_ values to the existing
socket state enum list someday, vs. these already existing defines.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: add cast to fix warning; remove returns from middle of switch]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
net/tipc/socket.c | 75 +++++++++++++++++++++++++++++++++++++------------------
1 file changed, 51 insertions(+), 24 deletions(-)
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 4d56eae..fc0aa4f 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1192,6 +1192,53 @@ static int rx_queue_full(struct tipc_msg *msg, u32 queue_size, u32 base)
}
/**
+ * filter_connect - Handle all incoming messages for a connection-based socket
+ * @tsock: TIPC socket
+ * @msg: message
+ *
+ * Returns TIPC error status code and socket error status code
+ * once it encounters some errors
+ */
+static u32 filter_connect(struct tipc_sock *tsock, struct sk_buff **buf)
+{
+ struct socket *sock = tsock->sk.sk_socket;
+ struct tipc_msg *msg = buf_msg(*buf);
+ u32 retval = TIPC_ERR_NO_PORT;
+
+ if (msg_mcast(msg))
+ return retval;
+
+ switch ((int)sock->state) {
+ case SS_CONNECTED:
+ /* Accept only connection-based messages sent by peer */
+ if (msg_connected(msg) && tipc_port_peer_msg(tsock->p, msg)) {
+ if (unlikely(msg_errcode(msg))) {
+ sock->state = SS_DISCONNECTING;
+ __tipc_disconnect(tsock->p);
+ }
+ retval = TIPC_OK;
+ }
+ break;
+ case SS_CONNECTING:
+ /* Accept only ACK or NACK message */
+ if (msg_connected(msg) || (msg_errcode(msg)))
+ retval = TIPC_OK;
+ break;
+ case SS_LISTENING:
+ case SS_UNCONNECTED:
+ /* Accept only SYN message */
+ if (!msg_connected(msg) && !(msg_errcode(msg)))
+ retval = TIPC_OK;
+ break;
+ case SS_DISCONNECTING:
+ break;
+ default:
+ pr_err("Unknown socket state %u\n", sock->state);
+ }
+ return retval;
+}
+
+/**
* filter_rcv - validate incoming message
* @sk: socket
* @buf: message
@@ -1208,6 +1255,7 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
struct socket *sock = sk->sk_socket;
struct tipc_msg *msg = buf_msg(buf);
u32 recv_q_len;
+ u32 res = TIPC_OK;
/* Reject message if it is wrong sort of message for socket */
if (msg_type(msg) > TIPC_DIRECT_MSG)
@@ -1226,24 +1274,9 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
return TIPC_ERR_OVERLOAD;
}
} else {
- if (msg_mcast(msg))
- return TIPC_ERR_NO_PORT;
- if (sock->state == SS_CONNECTED) {
- if (!msg_connected(msg) ||
- !tipc_port_peer_msg(tipc_sk_port(sk), msg))
- return TIPC_ERR_NO_PORT;
- } else if (sock->state == SS_CONNECTING) {
- if (!msg_connected(msg) && (msg_errcode(msg) == 0))
- return TIPC_ERR_NO_PORT;
- } else if (sock->state == SS_LISTENING) {
- if (msg_connected(msg) || msg_errcode(msg))
- return TIPC_ERR_NO_PORT;
- } else if (sock->state == SS_DISCONNECTING) {
- return TIPC_ERR_NO_PORT;
- } else /* (sock->state == SS_UNCONNECTED) */ {
- if (msg_connected(msg) || msg_errcode(msg))
- return TIPC_ERR_NO_PORT;
- }
+ res = filter_connect(tipc_sk(sk), &buf);
+ if (res != TIPC_OK || buf == NULL)
+ return res;
}
/* Enqueue message (finally!) */
@@ -1251,12 +1284,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
atomic_inc(&tipc_queue_size);
__skb_queue_tail(&sk->sk_receive_queue, buf);
- /* Initiate connection termination for an incoming 'FIN' */
- if (unlikely(msg_errcode(msg) && (sock->state == SS_CONNECTED))) {
- sock->state = SS_DISCONNECTING;
- __tipc_disconnect(tipc_sk_port(sk));
- }
-
sk->sk_data_ready(sk, 0);
return TIPC_OK;
}
--
1.7.12.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH net-next 07/10] tipc: introduce non-blocking socket connect
2012-12-07 14:28 [PATCH net-next 00/10] tipc: more updates for the v3.8 content Paul Gortmaker
` (5 preceding siblings ...)
2012-12-07 14:28 ` [PATCH net-next 06/10] tipc: consolidate connection-oriented message reception in one function Paul Gortmaker
@ 2012-12-07 14:28 ` Paul Gortmaker
2012-12-07 14:28 ` [PATCH net-next 08/10] tipc: eliminate connection setup for implied connect in recv_msg() Paul Gortmaker
` (2 subsequent siblings)
9 siblings, 0 replies; 22+ messages in thread
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
From: Ying Xue <ying.xue@windriver.com>
TIPC has so far only supported blocking connect(), meaning that a call
to connect() doesn't return until either the connection is fully
established, or an error occurs. This has proved insufficient for many
users, so we now introduce non-blocking connect(), analogous to how
this is done in TCP and other protocols.
With this feature, if a connection cannot be established instantly,
connect() will return the error code "-EINPROGRESS".
If the user later calls connect() again, he will either have the
return code "-EALREADY" or "-EISCONN", depending on whether the
connection has been established or not.
The user must have explicitly set the socket to be non-blocking
(SOCK_NONBLOCK or O_NONBLOCK, depending on method used), so unless
for some reason they had set this already (the socket would anyway
remain blocking in current TIPC) this change should be completely
backwards compatible.
It is also now possible to call select() or poll() to wait for the
completion of a connection.
An effect of the above is that the actual completion of a connection
may now be performed asynchronously, independent of the calls from
user space. Therefore, we now execute this code in BH context, in
the function filter_rcv(), which is executed upon reception of
messages in the socket.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: minor refactoring for improved connect/disconnect function names]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
net/tipc/socket.c | 158 ++++++++++++++++++++++++++++++++----------------------
1 file changed, 93 insertions(+), 65 deletions(-)
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index fc0aa4f..1ba3b6f 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -783,16 +783,19 @@ exit:
static int auto_connect(struct socket *sock, struct tipc_msg *msg)
{
struct tipc_sock *tsock = tipc_sk(sock->sk);
-
- if (msg_errcode(msg)) {
- sock->state = SS_DISCONNECTING;
- return -ECONNREFUSED;
- }
+ struct tipc_port *p_ptr;
tsock->peer_name.ref = msg_origport(msg);
tsock->peer_name.node = msg_orignode(msg);
- tipc_connect(tsock->p->ref, &tsock->peer_name);
- tipc_set_portimportance(tsock->p->ref, msg_importance(msg));
+ p_ptr = tipc_port_deref(tsock->p->ref);
+ if (!p_ptr)
+ return -EINVAL;
+
+ __tipc_connect(tsock->p->ref, p_ptr, &tsock->peer_name);
+
+ if (msg_importance(msg) > TIPC_CRITICAL_IMPORTANCE)
+ return -EINVAL;
+ msg_set_importance(&p_ptr->phdr, (u32)msg_importance(msg));
sock->state = SS_CONNECTED;
return 0;
}
@@ -1203,7 +1206,9 @@ static u32 filter_connect(struct tipc_sock *tsock, struct sk_buff **buf)
{
struct socket *sock = tsock->sk.sk_socket;
struct tipc_msg *msg = buf_msg(*buf);
+ struct sock *sk = &tsock->sk;
u32 retval = TIPC_ERR_NO_PORT;
+ int res;
if (msg_mcast(msg))
return retval;
@@ -1221,8 +1226,36 @@ static u32 filter_connect(struct tipc_sock *tsock, struct sk_buff **buf)
break;
case SS_CONNECTING:
/* Accept only ACK or NACK message */
- if (msg_connected(msg) || (msg_errcode(msg)))
+ if (unlikely(msg_errcode(msg))) {
+ sock->state = SS_DISCONNECTING;
+ sk->sk_err = -ECONNREFUSED;
+ retval = TIPC_OK;
+ break;
+ }
+
+ if (unlikely(!msg_connected(msg)))
+ break;
+
+ res = auto_connect(sock, msg);
+ if (res) {
+ sock->state = SS_DISCONNECTING;
+ sk->sk_err = res;
retval = TIPC_OK;
+ break;
+ }
+
+ /* If an incoming message is an 'ACK-', it should be
+ * discarded here because it doesn't contain useful
+ * data. In addition, we should try to wake up
+ * connect() routine if sleeping.
+ */
+ if (msg_data_sz(msg) == 0) {
+ kfree_skb(*buf);
+ *buf = NULL;
+ if (waitqueue_active(sk_sleep(sk)))
+ wake_up_interruptible(sk_sleep(sk));
+ }
+ retval = TIPC_OK;
break;
case SS_LISTENING:
case SS_UNCONNECTED:
@@ -1369,8 +1402,6 @@ static int connect(struct socket *sock, struct sockaddr *dest, int destlen,
struct sock *sk = sock->sk;
struct sockaddr_tipc *dst = (struct sockaddr_tipc *)dest;
struct msghdr m = {NULL,};
- struct sk_buff *buf;
- struct tipc_msg *msg;
unsigned int timeout;
int res;
@@ -1382,26 +1413,6 @@ static int connect(struct socket *sock, struct sockaddr *dest, int destlen,
goto exit;
}
- /* For now, TIPC does not support the non-blocking form of connect() */
- if (flags & O_NONBLOCK) {
- res = -EOPNOTSUPP;
- goto exit;
- }
-
- /* Issue Posix-compliant error code if socket is in the wrong state */
- if (sock->state == SS_LISTENING) {
- res = -EOPNOTSUPP;
- goto exit;
- }
- if (sock->state == SS_CONNECTING) {
- res = -EALREADY;
- goto exit;
- }
- if (sock->state != SS_UNCONNECTED) {
- res = -EISCONN;
- goto exit;
- }
-
/*
* Reject connection attempt using multicast address
*
@@ -1413,49 +1424,66 @@ static int connect(struct socket *sock, struct sockaddr *dest, int destlen,
goto exit;
}
- /* Reject any messages already in receive queue (very unlikely) */
- reject_rx_queue(sk);
+ timeout = (flags & O_NONBLOCK) ? 0 : tipc_sk(sk)->conn_timeout;
+
+ switch (sock->state) {
+ case SS_UNCONNECTED:
+ /* Send a 'SYN-' to destination */
+ m.msg_name = dest;
+ m.msg_namelen = destlen;
+
+ /* If connect is in non-blocking case, set MSG_DONTWAIT to
+ * indicate send_msg() is never blocked.
+ */
+ if (!timeout)
+ m.msg_flags = MSG_DONTWAIT;
+
+ res = send_msg(NULL, sock, &m, 0);
+ if ((res < 0) && (res != -EWOULDBLOCK))
+ goto exit;
- /* Send a 'SYN-' to destination */
- m.msg_name = dest;
- m.msg_namelen = destlen;
- res = send_msg(NULL, sock, &m, 0);
- if (res < 0)
+ /* Just entered SS_CONNECTING state; the only
+ * difference is that return value in non-blocking
+ * case is EINPROGRESS, rather than EALREADY.
+ */
+ res = -EINPROGRESS;
+ break;
+ case SS_CONNECTING:
+ res = -EALREADY;
+ break;
+ case SS_CONNECTED:
+ res = -EISCONN;
+ break;
+ default:
+ res = -EINVAL;
goto exit;
+ }
- /* Wait until an 'ACK' or 'RST' arrives, or a timeout occurs */
- timeout = tipc_sk(sk)->conn_timeout;
- release_sock(sk);
- res = wait_event_interruptible_timeout(*sk_sleep(sk),
- (!skb_queue_empty(&sk->sk_receive_queue) ||
- (sock->state != SS_CONNECTING)),
- timeout ? (long)msecs_to_jiffies(timeout)
- : MAX_SCHEDULE_TIMEOUT);
- lock_sock(sk);
+ if (sock->state == SS_CONNECTING) {
+ if (!timeout)
+ goto exit;
- if (res > 0) {
- buf = skb_peek(&sk->sk_receive_queue);
- if (buf != NULL) {
- msg = buf_msg(buf);
- res = auto_connect(sock, msg);
- if (!res) {
- if (!msg_data_sz(msg))
- advance_rx_queue(sk);
- }
- } else {
- if (sock->state == SS_CONNECTED)
- res = -EISCONN;
+ /* Wait until an 'ACK' or 'RST' arrives, or a timeout occurs */
+ release_sock(sk);
+ res = wait_event_interruptible_timeout(*sk_sleep(sk),
+ sock->state != SS_CONNECTING,
+ timeout ? (long)msecs_to_jiffies(timeout)
+ : MAX_SCHEDULE_TIMEOUT);
+ lock_sock(sk);
+ if (res <= 0) {
+ if (res == 0)
+ res = -ETIMEDOUT;
else
- res = -ECONNREFUSED;
+ ; /* leave "res" unchanged */
+ goto exit;
}
- } else {
- if (res == 0)
- res = -ETIMEDOUT;
- else
- ; /* leave "res" unchanged */
- sock->state = SS_DISCONNECTING;
}
+ if (unlikely(sock->state == SS_DISCONNECTING))
+ res = sock_error(sk);
+ else
+ res = 0;
+
exit:
release_sock(sk);
return res;
--
1.7.12.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH net-next 08/10] tipc: eliminate connection setup for implied connect in recv_msg()
2012-12-07 14:28 [PATCH net-next 00/10] tipc: more updates for the v3.8 content Paul Gortmaker
` (6 preceding siblings ...)
2012-12-07 14:28 ` [PATCH net-next 07/10] tipc: introduce non-blocking socket connect Paul Gortmaker
@ 2012-12-07 14:28 ` Paul Gortmaker
2012-12-07 14:28 ` [PATCH net-next 09/10] tipc: add lock nesting notation to quiet lockdep warning Paul Gortmaker
2012-12-07 14:28 ` [PATCH net-next 10/10] tipc: refactor accept() code for improved readability Paul Gortmaker
9 siblings, 0 replies; 22+ messages in thread
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
From: Ying Xue <ying.xue@windriver.com>
As connection setup is now completed asynchronously in BH context,
in the function filter_connect(), the corresponding code in recv_msg()
becomes redundant.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
net/tipc/socket.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 1ba3b6f..0df42fa 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -954,13 +954,6 @@ restart:
sz = msg_data_sz(msg);
err = msg_errcode(msg);
- /* Complete connection setup for an implied connect */
- if (unlikely(sock->state == SS_CONNECTING)) {
- res = auto_connect(sock, msg);
- if (res)
- goto exit;
- }
-
/* Discard an empty non-errored message & try again */
if ((!sz) && (!err)) {
advance_rx_queue(sk);
--
1.7.12.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH net-next 09/10] tipc: add lock nesting notation to quiet lockdep warning
2012-12-07 14:28 [PATCH net-next 00/10] tipc: more updates for the v3.8 content Paul Gortmaker
` (7 preceding siblings ...)
2012-12-07 14:28 ` [PATCH net-next 08/10] tipc: eliminate connection setup for implied connect in recv_msg() Paul Gortmaker
@ 2012-12-07 14:28 ` Paul Gortmaker
2012-12-07 14:28 ` [PATCH net-next 10/10] tipc: refactor accept() code for improved readability Paul Gortmaker
9 siblings, 0 replies; 22+ messages in thread
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
From: Ying Xue <ying.xue@windriver.com>
TIPC accept() call grabs the socket lock on a newly allocated
socket while holding the socket lock on an old socket. But lockdep
worries that this might be a recursive lock attempt:
[ INFO: possible recursive locking detected ]
---------------------------------------------
kworker/u:0/6 is trying to acquire lock:
(sk_lock-AF_TIPC){+.+.+.}, at: [<c8c1226c>] accept+0x15c/0x310 [tipc]
but task is already holding lock:
(sk_lock-AF_TIPC){+.+.+.}, at: [<c8c12138>] accept+0x28/0x310 [tipc]
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(sk_lock-AF_TIPC);
lock(sk_lock-AF_TIPC);
*** DEADLOCK ***
May be due to missing lock nesting notation
[...]
Tell lockdep that this locking is safe by using lock_sock_nested().
This is similar to what was done in commit 5131a184a3458d9 for
SCTP code ("SCTP: lock_sock_nested in sctp_sock_migrate").
Also note that this is isn't something that is seen normally,
as it was uncovered with some experimental work-in-progress
code not yet ready for mainline. So no need for stable
backports or similar of this commit.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
net/tipc/socket.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 0df42fa..38613cf 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1551,7 +1551,8 @@ static int accept(struct socket *sock, struct socket *new_sock, int flags)
u32 new_ref = new_tport->ref;
struct tipc_msg *msg = buf_msg(buf);
- lock_sock(new_sk);
+ /* we lock on new_sk; but lockdep sees the lock on sk */
+ lock_sock_nested(new_sk, SINGLE_DEPTH_NESTING);
/*
* Reject any stray messages received by new socket
--
1.7.12.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH net-next 10/10] tipc: refactor accept() code for improved readability
2012-12-07 14:28 [PATCH net-next 00/10] tipc: more updates for the v3.8 content Paul Gortmaker
` (8 preceding siblings ...)
2012-12-07 14:28 ` [PATCH net-next 09/10] tipc: add lock nesting notation to quiet lockdep warning Paul Gortmaker
@ 2012-12-07 14:28 ` Paul Gortmaker
2012-12-07 19:42 ` Neil Horman
9 siblings, 1 reply; 22+ messages in thread
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Paul Gortmaker
In TIPC's accept() routine, there is a large block of code relating
to initialization of a new socket, all within an if condition checking
if the allocation succeeded.
Here, we simply factor out that init code within the accept() function
to its own separate function, which improves readability, and makes
it easier to track which lock_sock() calls are operating on existing
vs. new sockets.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
net/tipc/socket.c | 93 +++++++++++++++++++++++++++++--------------------------
1 file changed, 49 insertions(+), 44 deletions(-)
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 38613cf..56661c8 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1507,6 +1507,53 @@ static int listen(struct socket *sock, int len)
return res;
}
+static void tipc_init_socket(struct sock *sk, struct socket *new_sock,
+ struct sk_buff *buf)
+{
+
+ struct sock *new_sk = new_sock->sk;
+ struct tipc_sock *new_tsock = tipc_sk(new_sk);
+ struct tipc_port *new_tport = new_tsock->p;
+ u32 new_ref = new_tport->ref;
+ struct tipc_msg *msg = buf_msg(buf);
+
+ /* we lock on new_sk; but lockdep sees accept's lock on sk */
+ lock_sock_nested(new_sk, SINGLE_DEPTH_NESTING);
+
+ /*
+ * Reject any stray messages received by new socket
+ * before the socket lock was taken (very, very unlikely)
+ */
+ reject_rx_queue(new_sk);
+
+ /* Connect new socket to it's peer */
+ new_tsock->peer_name.ref = msg_origport(msg);
+ new_tsock->peer_name.node = msg_orignode(msg);
+ tipc_connect(new_ref, &new_tsock->peer_name);
+ new_sock->state = SS_CONNECTED;
+
+ tipc_set_portimportance(new_ref, msg_importance(msg));
+ if (msg_named(msg)) {
+ new_tport->conn_type = msg_nametype(msg);
+ new_tport->conn_instance = msg_nameinst(msg);
+ }
+
+ /*
+ * Respond to 'SYN-' by discarding it & returning 'ACK'-.
+ * Respond to 'SYN+' by queuing it on new socket.
+ */
+ if (!msg_data_sz(msg)) {
+ struct msghdr m = {NULL,};
+
+ advance_rx_queue(sk);
+ send_packet(NULL, new_sock, &m, 0);
+ } else {
+ __skb_dequeue(&sk->sk_receive_queue);
+ __skb_queue_head(&new_sk->sk_receive_queue, buf);
+ }
+ release_sock(new_sk);
+}
+
/**
* accept - wait for connection request
* @sock: listening socket
@@ -1542,51 +1589,9 @@ static int accept(struct socket *sock, struct socket *new_sock, int flags)
}
buf = skb_peek(&sk->sk_receive_queue);
-
res = tipc_create(sock_net(sock->sk), new_sock, 0, 0);
- if (!res) {
- struct sock *new_sk = new_sock->sk;
- struct tipc_sock *new_tsock = tipc_sk(new_sk);
- struct tipc_port *new_tport = new_tsock->p;
- u32 new_ref = new_tport->ref;
- struct tipc_msg *msg = buf_msg(buf);
-
- /* we lock on new_sk; but lockdep sees the lock on sk */
- lock_sock_nested(new_sk, SINGLE_DEPTH_NESTING);
-
- /*
- * Reject any stray messages received by new socket
- * before the socket lock was taken (very, very unlikely)
- */
- reject_rx_queue(new_sk);
-
- /* Connect new socket to it's peer */
- new_tsock->peer_name.ref = msg_origport(msg);
- new_tsock->peer_name.node = msg_orignode(msg);
- tipc_connect(new_ref, &new_tsock->peer_name);
- new_sock->state = SS_CONNECTED;
-
- tipc_set_portimportance(new_ref, msg_importance(msg));
- if (msg_named(msg)) {
- new_tport->conn_type = msg_nametype(msg);
- new_tport->conn_instance = msg_nameinst(msg);
- }
-
- /*
- * Respond to 'SYN-' by discarding it & returning 'ACK'-.
- * Respond to 'SYN+' by queuing it on new socket.
- */
- if (!msg_data_sz(msg)) {
- struct msghdr m = {NULL,};
-
- advance_rx_queue(sk);
- send_packet(NULL, new_sock, &m, 0);
- } else {
- __skb_dequeue(&sk->sk_receive_queue);
- __skb_queue_head(&new_sk->sk_receive_queue, buf);
- }
- release_sock(new_sk);
- }
+ if (!res)
+ tipc_init_socket(sk, new_sock, buf);
exit:
release_sock(sk);
return res;
--
1.7.12.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH net-next 10/10] tipc: refactor accept() code for improved readability
2012-12-07 14:28 ` [PATCH net-next 10/10] tipc: refactor accept() code for improved readability Paul Gortmaker
@ 2012-12-07 19:42 ` Neil Horman
2012-12-07 19:56 ` Paul Gortmaker
0 siblings, 1 reply; 22+ messages in thread
From: Neil Horman @ 2012-12-07 19:42 UTC (permalink / raw)
To: Paul Gortmaker; +Cc: David Miller, netdev, Jon Maloy
On Fri, Dec 07, 2012 at 09:28:18AM -0500, Paul Gortmaker wrote:
> In TIPC's accept() routine, there is a large block of code relating
> to initialization of a new socket, all within an if condition checking
> if the allocation succeeded.
>
> Here, we simply factor out that init code within the accept() function
> to its own separate function, which improves readability, and makes
> it easier to track which lock_sock() calls are operating on existing
> vs. new sockets.
>
> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
> ---
> net/tipc/socket.c | 93 +++++++++++++++++++++++++++++--------------------------
> 1 file changed, 49 insertions(+), 44 deletions(-)
>
> diff --git a/net/tipc/socket.c b/net/tipc/socket.c
> index 38613cf..56661c8 100644
> --- a/net/tipc/socket.c
> +++ b/net/tipc/socket.c
> @@ -1507,6 +1507,53 @@ static int listen(struct socket *sock, int len)
> return res;
> }
>
> +static void tipc_init_socket(struct sock *sk, struct socket *new_sock,
> + struct sk_buff *buf)
> +{
Can you rename this to something more specific to its purpose? tipc_init_socket
makes me wonder why you're not calling it internally from tipc_create. This
seems more like a tipc_init_accept_sock, or some such. Alternatively, since
you're ony using it from your accept call, you might consider not factoring it
out at all, and just reversing the logic in your accept function so that you do:
res = tipc_create(...)
if (res)
goto exit;
<rest of tipc_init_socket goes here>
That gives you the same level of readability, without the additional potential
call instruction, plus you put the unlikely case inside the if conditional.
Neil
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH net-next 10/10] tipc: refactor accept() code for improved readability
2012-12-07 19:42 ` Neil Horman
@ 2012-12-07 19:56 ` Paul Gortmaker
0 siblings, 0 replies; 22+ messages in thread
From: Paul Gortmaker @ 2012-12-07 19:56 UTC (permalink / raw)
To: Neil Horman; +Cc: David Miller, netdev, Jon Maloy
[Re: [PATCH net-next 10/10] tipc: refactor accept() code for improved readability] On 07/12/2012 (Fri 14:42) Neil Horman wrote:
> On Fri, Dec 07, 2012 at 09:28:18AM -0500, Paul Gortmaker wrote:
> > In TIPC's accept() routine, there is a large block of code relating
> > to initialization of a new socket, all within an if condition checking
> > if the allocation succeeded.
> >
> > Here, we simply factor out that init code within the accept() function
> > to its own separate function, which improves readability, and makes
> > it easier to track which lock_sock() calls are operating on existing
> > vs. new sockets.
> >
> > Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
> > ---
> > net/tipc/socket.c | 93 +++++++++++++++++++++++++++++--------------------------
> > 1 file changed, 49 insertions(+), 44 deletions(-)
> >
> > diff --git a/net/tipc/socket.c b/net/tipc/socket.c
> > index 38613cf..56661c8 100644
> > --- a/net/tipc/socket.c
> > +++ b/net/tipc/socket.c
> > @@ -1507,6 +1507,53 @@ static int listen(struct socket *sock, int len)
> > return res;
> > }
> >
> > +static void tipc_init_socket(struct sock *sk, struct socket *new_sock,
> > + struct sk_buff *buf)
> > +{
> Can you rename this to something more specific to its purpose? tipc_init_socket
> makes me wonder why you're not calling it internally from tipc_create. This
> seems more like a tipc_init_accept_sock, or some such. Alternatively, since
> you're ony using it from your accept call, you might consider not factoring it
> out at all, and just reversing the logic in your accept function so that you do:
>
> res = tipc_create(...)
> if (res)
> goto exit;
> <rest of tipc_init_socket goes here>
>
> That gives you the same level of readability, without the additional potential
> call instruction, plus you put the unlikely case inside the if conditional.
I'm inclined to do the latter and just flip the sense of the "if" since
I already scratched my head trying to think of a good name (and
apparently failed in the end).
Thanks,
Paul.
>
> Neil
>
^ permalink raw reply [flat|nested] 22+ messages in thread