linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andreas Bluemle <andreas.bluemle@itxperts.de>
To: "Hefty, Sean" <sean.hefty@intel.com>
Cc: Matthew Anderson <manderson8787@gmail.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>,
	"linux-rdma@vger.kernel.org (linux-rdma@vger.kernel.org)"
	<linux-rdma@vger.kernel.org>
Subject: Re: [ceph-users] Help needed porting Ceph to RSockets
Date: Tue, 10 Sep 2013 15:48:47 +0200	[thread overview]
Message-ID: <20130910154847.3fb7dd02@doppio> (raw)
In-Reply-To: <1828884A29C6694DAF28B7E6B8A8237388CA7C88@ORSMSX109.amr.corp.intel.com>

[-- Attachment #1: Type: text/plain, Size: 6755 bytes --]

Hi,

after some more analysis and debugging, I found
workarounds for my problems; I have added these workarounds
to the last version of the patch for the poll problem by Sean;
see the attachment to this posting.

The shutdown() operations below are all SHUT_RDWR.

1. shutdown() on side A of a connection waits for close() on side B

   With rsockets, when a shutdown is done on side A of a socket
   connection, then the shutdown will only return after side B
   has done a close() on its end of the connection.

   This is different from TCP/IP sockets: there a shutdown will cause
   the other end to terminate the connection at the TCP level
   instantly. The socket changes state into CLOSE_WAIT, which indicates
   that the application level close is outstanding.

   In the attached patch, the workaround is in rs_poll_cq(),
   case RS_OP_CTRL, where for a RS_CTRL_DISCONNECT the rshutdown()
   is called on side B; this will cause the termination of the
   socket connection to acknowledged to side A and the shutdown()
   there can now terminate.

2. double (multiple) shutdown on side A: delay on 2nd shutdown

   When an application does a shutdown() of side A and does a 2nd
   shutdown() shortly after (for whatever reason) then the
   return of the 2nd shutdown() is delayed by 2 seconds.

   The delay happens in rdma_disconnect(), when this is called
   from rshutdown() in the case that the rsocket state is
   rs_disconnected.

   Even if it could be considered as a bug if an application
   calls shutdown() twice on the same socket, it still
   does not make sense to delay that 2nd call to shutdown().

   To workaround this, I have
   - introduced an additional rsocket state: rs_shutdown
   - switch to that new state in rshutdown() at the very end
     of the function.

   The first call to shutdown() will therefore switch to the new
   rsocket state rs_shutdown - and any further call to rshutdown()
   will not do anything any more, because every effect of rshutdown()
   will only happen if the rsocket state is either rs_connnected or
   rs_disconnected. Hence it would be better to explicitely check
   the rsocket state at the beginning of the function and return
   immediately if the state is rs_shutdown.

Since I have added these workarounds to my version of the librdmacm
library, I can at least start up ceph using LD_PRELOAD and end up in
a healthy ceph cluster state.

I would not call these workarounds a real fix, but they should point
out the problems which I am trying to solve.


Regards

Andreas Bluemle




On Fri, 23 Aug 2013 00:35:22 +0000
"Hefty, Sean" <sean.hefty@intel.com> wrote:

> > I tested out the patch and unfortunately had the same results as
> > Andreas. About 50% of the time the rpoll() thread in Ceph still
> > hangs when rshutdown() is called. I saw a similar behaviour when
> > increasing the poll time on the pre-patched version if that's of
> > any relevance.
> 
> I'm not optimistic, but here's an updated patch.  I attempted to
> handle more shutdown conditions, but I can't say that any of those
> would prevent the hang that you see.
> 
> I have a couple of questions: 
> 
> Is there any chance that the code would call rclose while rpoll
> is still running?  Also, can you verify that the thread is in the
> real poll() call when the hang occurs?
> 
> Signed-off-by: Sean Hefty <sean.hefty@intel.com>
> ---
>  src/rsocket.c |   35 +++++++++++++++++++++++++----------
>  1 files changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/src/rsocket.c b/src/rsocket.c
> index d544dd0..f94ddf3 100644
> --- a/src/rsocket.c
> +++ b/src/rsocket.c
> @@ -1822,7 +1822,12 @@ static int rs_poll_cq(struct rsocket *rs)
>  					rs->state = rs_disconnected;
>  					return 0;
>  				} else if (rs_msg_data(msg) ==
> RS_CTRL_SHUTDOWN) {
> -					rs->state &= ~rs_readable;
> +					if (rs->state & rs_writable)
> {
> +						rs->state &=
> ~rs_readable;
> +					} else {
> +						rs->state =
> rs_disconnected;
> +						return 0;
> +					}
>  				}
>  				break;
>  			case RS_OP_WRITE:
> @@ -2948,10 +2953,12 @@ static int rs_poll_events(struct pollfd
> *rfds, struct pollfd *fds, nfds_t nfds) 
>  		rs = idm_lookup(&idm, fds[i].fd);
>  		if (rs) {
> +			fastlock_acquire(&rs->cq_wait_lock);
>  			if (rs->type == SOCK_STREAM)
>  				rs_get_cq_event(rs);
>  			else
>  				ds_get_cq_event(rs);
> +			fastlock_release(&rs->cq_wait_lock);
>  			fds[i].revents = rs_poll_rs(rs,
> fds[i].events, 1, rs_poll_all); } else {
>  			fds[i].revents = rfds[i].revents;
> @@ -3098,7 +3105,8 @@ int rselect(int nfds, fd_set *readfds, fd_set
> *writefds, 
>  /*
>   * For graceful disconnect, notify the remote side that we're
> - * disconnecting and wait until all outstanding sends complete.
> + * disconnecting and wait until all outstanding sends complete,
> provided
> + * that the remote side has not sent a disconnect message.
>   */
>  int rshutdown(int socket, int how)
>  {
> @@ -3106,11 +3114,6 @@ int rshutdown(int socket, int how)
>  	int ctrl, ret = 0;
>  
>  	rs = idm_at(&idm, socket);
> -	if (how == SHUT_RD) {
> -		rs->state &= ~rs_readable;
> -		return 0;
> -	}
> -
>  	if (rs->fd_flags & O_NONBLOCK)
>  		rs_set_nonblocking(rs, 0);
>  
> @@ -3118,15 +3121,20 @@ int rshutdown(int socket, int how)
>  		if (how == SHUT_RDWR) {
>  			ctrl = RS_CTRL_DISCONNECT;
>  			rs->state &= ~(rs_readable | rs_writable);
> -		} else {
> +		} else if (how == SHUT_WR) {
>  			rs->state &= ~rs_writable;
>  			ctrl = (rs->state & rs_readable) ?
>  				RS_CTRL_SHUTDOWN :
> RS_CTRL_DISCONNECT;
> +		} else {
> +			rs->state &= ~rs_readable;
> +			if (rs->state & rs_writable)
> +				goto out;
> +			ctrl = RS_CTRL_DISCONNECT;
>  		}
>  		if (!rs->ctrl_avail) {
>  			ret = rs_process_cq(rs, 0,
> rs_conn_can_send_ctrl); if (ret)
> -				return ret;
> +				goto out;
>  		}
>  
>  		if ((rs->state & rs_connected) && rs->ctrl_avail) {
> @@ -3138,10 +3146,17 @@ int rshutdown(int socket, int how)
>  	if (rs->state & rs_connected)
>  		rs_process_cq(rs, 0, rs_conn_all_sends_done);
>  
> +out:
>  	if ((rs->fd_flags & O_NONBLOCK) && (rs->state &
> rs_connected)) rs_set_nonblocking(rs, rs->fd_flags);
>  
> -	return 0;
> +	if (rs->state & rs_disconnected) {
> +		/* Generate event by flushing receives to unblock
> rpoll */
> +		ibv_req_notify_cq(rs->cm_id->recv_cq, 0);
> +		rdma_disconnect(rs->cm_id);
> +	}
> +
> +	return ret;
>  }
>  
>  static void ds_shutdown(struct rsocket *rs)
> 
> 
> 



-- 
Andreas Bluemle                     mailto:Andreas.Bluemle@itxperts.de
ITXperts GmbH                       http://www.itxperts.de
Balanstrasse 73, Geb. 08            Phone: (+49) 89 89044917
D-81541 Muenchen (Germany)          Fax:   (+49) 89 89044910

Company details: http://www.itxperts.de/imprint.htm

[-- Attachment #2: rsocket-poll-hefty2+shutdown-andy.patch --]
[-- Type: text/x-patch, Size: 3094 bytes --]

diff --git a/src/rsocket.c b/src/rsocket.c
index abdd392..76fbb85 100644
--- a/src/rsocket.c
+++ b/src/rsocket.c
@@ -206,6 +206,7 @@ enum rs_state {
 	rs_connect_error   =		    0x0800,
 	rs_disconnected	   =		    0x1000,
 	rs_error	   =		    0x2000,
+	rs_shutdown	   =		    0x4000,
 };
 
 #define RS_OPT_SWAP_SGL	(1 << 0)
@@ -1786,9 +1787,15 @@ static int rs_poll_cq(struct rsocket *rs)
 			case RS_OP_CTRL:
 				if (rs_msg_data(msg) == RS_CTRL_DISCONNECT) {
 					rs->state = rs_disconnected;
+					rshutdown(rs->index, SHUT_RDWR);
 					return 0;
 				} else if (rs_msg_data(msg) == RS_CTRL_SHUTDOWN) {
-					rs->state &= ~rs_readable;
+					if (rs->state & rs_writable) {
+						rs->state &= ~rs_readable;
+					} else {
+						rs->state = rs_disconnected;
+						return 0;
+					}
 				}
 				break;
 			case RS_OP_WRITE:
@@ -2914,10 +2921,12 @@ static int rs_poll_events(struct pollfd *rfds, struct pollfd *fds, nfds_t nfds)
 
 		rs = idm_lookup(&idm, fds[i].fd);
 		if (rs) {
+			fastlock_acquire(&rs->cq_wait_lock);
 			if (rs->type == SOCK_STREAM)
 				rs_get_cq_event(rs);
 			else
 				ds_get_cq_event(rs);
+			fastlock_release(&rs->cq_wait_lock);
 			fds[i].revents = rs_poll_rs(rs, fds[i].events, 1, rs_poll_all);
 		} else {
 			fds[i].revents = rfds[i].revents;
@@ -3064,7 +3073,8 @@ int rselect(int nfds, fd_set *readfds, fd_set *writefds,
 
 /*
  * For graceful disconnect, notify the remote side that we're
- * disconnecting and wait until all outstanding sends complete.
+ * disconnecting and wait until all outstanding sends complete, provided
+ * that the remote side has not sent a disconnect message.
  */
 int rshutdown(int socket, int how)
 {
@@ -3072,11 +3082,6 @@ int rshutdown(int socket, int how)
 	int ctrl, ret = 0;
 
 	rs = idm_at(&idm, socket);
-	if (how == SHUT_RD) {
-		rs->state &= ~rs_readable;
-		return 0;
-	}
-
 	if (rs->fd_flags & O_NONBLOCK)
 		rs_set_nonblocking(rs, 0);
 
@@ -3084,15 +3089,20 @@ int rshutdown(int socket, int how)
 		if (how == SHUT_RDWR) {
 			ctrl = RS_CTRL_DISCONNECT;
 			rs->state &= ~(rs_readable | rs_writable);
-		} else {
+		} else if (how == SHUT_WR) {
 			rs->state &= ~rs_writable;
 			ctrl = (rs->state & rs_readable) ?
 				RS_CTRL_SHUTDOWN : RS_CTRL_DISCONNECT;
+		} else {
+			rs->state &= ~rs_readable;
+			if (rs->state & rs_writable)
+				goto out;
+			ctrl = RS_CTRL_DISCONNECT;
 		}
 		if (!rs->ctrl_avail) {
 			ret = rs_process_cq(rs, 0, rs_conn_can_send_ctrl);
 			if (ret)
-				return ret;
+				goto out;
 		}
 
 		if ((rs->state & rs_connected) && rs->ctrl_avail) {
@@ -3104,10 +3114,19 @@ int rshutdown(int socket, int how)
 	if (rs->state & rs_connected)
 		rs_process_cq(rs, 0, rs_conn_all_sends_done);
 
+out:
 	if ((rs->fd_flags & O_NONBLOCK) && (rs->state & rs_connected))
 		rs_set_nonblocking(rs, rs->fd_flags);
 
-	return 0;
+	if (rs->state & rs_disconnected) {
+		/* Generate event by flushing receives to unblock rpoll */
+		ibv_req_notify_cq(rs->cm_id->recv_cq, 0);
+		rdma_disconnect(rs->cm_id);
+	}
+
+	rs->state = rs_shutdown;
+
+	return ret;
 }
 
 static void ds_shutdown(struct rsocket *rs)

  reply	other threads:[~2013-09-10 13:48 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-23  0:35 [ceph-users] Help needed porting Ceph to RSockets Hefty, Sean
2013-09-10 13:48 ` Andreas Bluemle [this message]
2013-09-12 10:20   ` Gandalf Corvotempesta
     [not found]     ` <CAJH6TXg94x+jcc=1MQQoQaF5JcGKuxbG_SzLEMt3EVQOrFeNjw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-12 12:12       ` Andreas Bluemle
2013-09-16 13:29         ` Gandalf Corvotempesta
2013-09-20 23:47   ` Hefty, Sean
2013-10-30 23:25   ` Hefty, Sean
     [not found]     ` <1828884A29C6694DAF28B7E6B8A8237388CF3072-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2014-02-05 13:58       ` Gandalf Corvotempesta
     [not found] <CAJA05UmTDGze7S50_j8RgHmPFYonk2Z94Fi6CQtxND9QxgRr3g@mail.gmail.com>
     [not found] ` <CAJA05Umgio7XftGZtQdRzyWq70pXBb3oEx2jrT3BxcBr6MLoVQ@mail.gmail.com>
     [not found]   ` <20130812075513.43c338e1@andylap>
     [not found]     ` <CAJA05U=i22jHy2xbvCUk+UVN0+hOFDgt-JMSHTWKmr1+DZu0Dg@mail.gmail.com>
     [not found]       ` <20130812180644.447ca089@andylap>
     [not found]         ` <CAJA05U=KqzYis14sYXusbpk-T-F=uqvNz1XPurutmdeRz6BAdg@mail.gmail.com>
     [not found]           ` <20130813075312.7cac0d46@andylap>
     [not found]             ` <20130813160612.037ea9f2@andylap>
2013-08-13 14:35               ` Atchley, Scott
     [not found]                 ` <1978D1F9-C675-4A37-AA57-C7E1158B2F72-1Heg1YXhbW8@public.gmane.org>
2013-08-13 21:44                   ` Hefty, Sean
2013-08-14  7:21                     ` Andreas Bluemle
2013-08-14 13:05                       ` Atchley, Scott
2013-08-14 17:04                       ` Hefty, Sean
2013-08-17  0:07                       ` Hefty, Sean
2013-08-19 17:10                       ` Hefty, Sean
2013-08-20  7:21                         ` Andreas Bluemle
2013-08-20 10:30                           ` Andreas Bluemle
2013-08-20 15:04                             ` Hefty, Sean
     [not found]                               ` <1828884A29C6694DAF28B7E6B8A8237388CA6F25-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2013-08-21 11:44                                 ` Matthew Anderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130910154847.3fb7dd02@doppio \
    --to=andreas.bluemle@itxperts.de \
    --cc=ceph-devel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=manderson8787@gmail.com \
    --cc=sean.hefty@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).