Re: [ceph-users] Help needed porting Ceph to RSockets

linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [ceph-users] Help needed porting Ceph to RSockets
       [not found]             ` <20130813160612.037ea9f2@andylap>
@ 2013-08-13 14:35               ` Atchley, Scott
       [not found]                 ` <1978D1F9-C675-4A37-AA57-C7E1158B2F72-1Heg1YXhbW8@public.gmane.org>
  0 siblings, 1 reply; 19+ messages in thread
From: Atchley, Scott @ 2013-08-13 14:35 UTC (permalink / raw)
  To: Andreas Bluemle
  Cc: Matthew Anderson,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

On Aug 13, 2013, at 10:06 AM, Andreas Bluemle <andreas.bluemle@itxperts.de> wrote:

> Hi Matthew,
> 
> I found a workaround for my (our) problem: in the librdmacm
> code, rsocket.c, there is a global constant polling_time, which
> is set to 10 microseconds at the moment.
> 
> I raise this to 10000 - and all of a sudden things work nicely.

I am adding the linux-rdma list to CC so Sean might see this.

If I understand what you are describing, the caller to rpoll() spins for up to 10 ms (10,000 us) before calling the real poll().

What is the purpose of the real poll() call? Is it simply a means to block the caller and avoid spinning? Or does it actually expect to detect an event?

> I think we are looking at two issues here:
> 1. the thread structure of ceph messenger
>   For a given socket connection, there are 3 threads of interest
>   here: the main messenger thread, the Pipe::reader and the
>   Pipe::writer.
> 
>   For a ceph client like the ceph admin command, I see the following
>   sequence
>     - the connection to the ceph monitor is created by the
>       main messenger  thread, the Pipe::reader and
>       Pipe::writer are instantiated.
>     - the requested command is sent to the ceph monitor, the
>       answer is read and printed
>     - at this point the Pipe::reader already has called
>       tcp_read_wait(), polling for more data or connection termination
>     - after the response had been printed, the main loop calls the
>       shutdown routines which in in turn shutdown() the socket
> 
>    There is some time between the last two steps - and this gap is
>    long enough to open a race:
> 
> 2. rpoll, ibv and poll
>   the rpoll implementation in rsockets is split in 2 phases:
>   - a busy loop which checks the state of the underlying ibv queue pair
>   - the call to real poll() system call (i.e. the uverbs(?)
>     implementation of poll() inside the kernel)
> 
>   The busy loop has a maximum duration of polling_time (10 microseconds
>   by default) - and is able detect the local shutdown and returns a
>   POLLHUP.
> 
>   The poll() system call (i.e. the uverbs implementation of poll() 
>   in the kernel) does not detect the local shutdown - and only returns
>   after the caller supplied timeout expires.
> 
> Increasing the rsockets polloing_time from 10 to 10000 microseconds
> results in the rpoll to detect the local shutdown within the busy loop.
> 
> Decreasing the ceph "ms tcp read timeout" from the default of 900 to 5
> seconds serves a similar purpose, but is much coarser.
> 
> From my understanding, the real issue is neither at the ceph nor at the
> rsockets level: it is related to the uverbs kernel module.
> 
> An alternative way to address the current problem at the rsockets level
> would be w re-write of the rpoll(): instead of the busy loop at the
> beginning followed by the reall poll() call with the full user
> specificed timeout value ("ms tcp read timeout" in our case), I would
> embed the real poll()  into a loop, splitting the user specified timeout
> into smaller portions and doing the rsockets specific rs_poll_check()
> on every timeout of the real poll().

I have not looked at the rsocket code, so take the following with a grain of salt. If the purpose of the real poll() is to simply block the user for a specified time, then you can simply make it a short duration (taking into consideration what granularity the OS provides) and then call ibv_poll_cq(). Keep in mind, polling will prevent your CPU from reducing power.

If the real poll() is actually checking for something (e.g. checking on the RDMA channel's fd or the IB channel's fd), then you may not want to spin too much.

Scott

> Best Regards
> 
> Andreas Bluemle
> 
> 
> On Tue, 13 Aug 2013 07:53:12 +0200
> Andreas Bluemle <andreas.bluemle-jy6uzZffzV0b1SvskN2V4Q@public.gmane.org> wrote:
> 
>> Hi Matthew,
>> 
>> I can confirm the beahviour whichi you describe.
>> I too believe that the problem is on the client side (ceph command).
>> My log files show the very same symptom, i.e. the client side
>> not being able to shutdown the pipes properly.
>> 
>> (Q: I had problems yesterday to send a mail to ceph-users list
>> with the log files attached to it because of the size of 
>> the attachments exceeding some limit; I hadnÄt been subscribed
>> to the list at that point. Is the uses of pastebin.com the better
>> way to provide such lengthy information in general?
>> 
>> 
>> Best Regards
>> 
>> Andreas Bluemle
>> 
>> On Tue, 13 Aug 2013 11:59:36 +0800
>> Matthew Anderson <manderson8787-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> 
>>> Moving this conversation to ceph-devel where the dev's might be able
>>> to shed some light on this.
>>> 
>>> I've added some additional debug to my code to narrow the issue down
>>> a bit and the reader thread appears to be getting locked by
>>> tcp_read_wait() because rpoll never returns an event when the socket
>>> is shutdown. A hack way of proving this was to lower the timeout in
>>> rpoll to 5 seconds. When command like 'ceph osd tree' completes you
>>> can see it block for 5 seconds until rpoll times out and returns 0.
>>> The reader thread is then able to join and the pipe can be reaped.
>>> 
>>> Ceph log is here - http://pastebin.com/rHK4vYLZ
>>> Mon log is here - http://pastebin.com/WyAJEw0m
>>> 
>>> What's particularly weird is that the monitor receives a POLLHUP
>>> event when the ceph command shuts down it's socket but the ceph
>>> command never does. When using regular sockets both sides of the
>>> connection receive a POLLIN | POLLHUP | POLRDHUP event when the
>>> sockets are shut down. It would seem like there is a bug in rsockets
>>> that causes the side that calls shutdown first not to receive the
>>> correct rpoll events.
>>> 
>>> Can anyone comment on whether the above seems right?
>>> 
>>> Thanks all
>>> -Matt
>>> 
>>> 
>>> On Tue, Aug 13, 2013 at 12:06 AM, Andreas Bluemle <
>>> andreas.bluemle-jy6uzZffzV0b1SvskN2V4Q@public.gmane.org> wrote:
>>> 
>>>> Hi Matthew,
>>>> 
>>>> I am not quite sure about the POLLRDHUP.
>>>> On the server side (ceph-mon), tcp_read_wait does see the
>>>> POLLHUP - which should be the indicator that the
>>>> the other side is shutting down.
>>>> 
>>>> I have also taken a brief look at the client side (ceph mon stat).
>>>> It initiates a shutdown - but never finishes. See attached log
>>>> file from "ceph --log-file ceph-mon-stat.rsockets --debug-ms 30
>>>> mon stat". I have also attached the corresponding log file for
>>>> regualr TCP/IP sockets.
>>>> 
>>>> It looks to me that in the rsockets case, the reaper is able to
>>>> cleanup even though there is still sth. left to do - and hence the
>>>> shutdown never completes.
>>>> 
>>>> 
>>>> Best Regards
>>>> 
>>>> Andreas Bluemle
>>>> 
>>>> 
>>>> On Mon, 12 Aug 2013 15:11:47 +0800
>>>> Matthew Anderson <manderson8787-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> 
>>>>> Hi Andreas,
>>>>> 
>>>>> I think we're both working on the same thing, I've just changed
>>>>> the function calls over to rsockets in the source instead of
>>>>> using the pre-load library. It explains why we're having the
>>>>> exact same problem!
>>>>> 
>>>>> From what I've been able to tell the entire problem revolves
>>>>> around rsockets not supporting POLLRDHUP. As far as I can tell
>>>>> the pipe will only be removed when tcp_read_wait returns -1.
>>>>> With rsockets it never receives the POLLRDHUP event after
>>>>> shutdown_socket() is called so the rpoll call blocks until
>>>>> timeout (900 seconds) and the pipe stays active.
>>>>> 
>>>>> The question then would be how can we destroy a pipe without
>>>>> relying on POLLRDHUP? shutdown_socket() always gets called when
>>>>> the socket should be closed so could there might be a way to
>>>>> trick tcp_read_wait() into returning -1 by doing somethere in
>>>>> shutdown_socket() but I'm not sure how to go about it.
>>>>> 
>>>>> Any ideas?
>>>>> 
>>>> 
>> 
>> 
>> 
> 
> 
> 
> -- 
> Andreas Bluemle                     mailto:Andreas.Bluemle@itxperts.de
> Heinrich Boell Strasse 88           Phone: (+49) 89 4317582
> D-81829 Muenchen (Germany)          Mobil: (+49) 177 522 0151
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [ceph-users] Help needed porting Ceph to RSockets
       [not found]                 ` <1978D1F9-C675-4A37-AA57-C7E1158B2F72-1Heg1YXhbW8@public.gmane.org>
@ 2013-08-13 21:44                   ` Hefty, Sean
  2013-08-14  7:21                     ` Andreas Bluemle
  0 siblings, 1 reply; 19+ messages in thread
From: Hefty, Sean @ 2013-08-13 21:44 UTC (permalink / raw)
  To: Atchley, Scott, Andreas Bluemle
  Cc: Matthew Anderson,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

> > I found a workaround for my (our) problem: in the librdmacm
> > code, rsocket.c, there is a global constant polling_time, which
> > is set to 10 microseconds at the moment.
> >
> > I raise this to 10000 - and all of a sudden things work nicely.
> 
> I am adding the linux-rdma list to CC so Sean might see this.
> 
> If I understand what you are describing, the caller to rpoll() spins for up to
> 10 ms (10,000 us) before calling the real poll().
> 
> What is the purpose of the real poll() call? Is it simply a means to block the
> caller and avoid spinning? Or does it actually expect to detect an event?

When the real poll() is called, an event is expected on an fd associated with the CQ's completion channel.
 
> > I think we are looking at two issues here:
> > 1. the thread structure of ceph messenger
> >   For a given socket connection, there are 3 threads of interest
> >   here: the main messenger thread, the Pipe::reader and the
> >   Pipe::writer.
> >
> >   For a ceph client like the ceph admin command, I see the following
> >   sequence
> >     - the connection to the ceph monitor is created by the
> >       main messenger  thread, the Pipe::reader and
> >       Pipe::writer are instantiated.
> >     - the requested command is sent to the ceph monitor, the
> >       answer is read and printed
> >     - at this point the Pipe::reader already has called
> >       tcp_read_wait(), polling for more data or connection termination
> >     - after the response had been printed, the main loop calls the
> >       shutdown routines which in in turn shutdown() the socket
> >
> >    There is some time between the last two steps - and this gap is
> >    long enough to open a race:
> >
> > 2. rpoll, ibv and poll
> >   the rpoll implementation in rsockets is split in 2 phases:
> >   - a busy loop which checks the state of the underlying ibv queue pair
> >   - the call to real poll() system call (i.e. the uverbs(?)
> >     implementation of poll() inside the kernel)
> >
> >   The busy loop has a maximum duration of polling_time (10 microseconds
> >   by default) - and is able detect the local shutdown and returns a
> >   POLLHUP.
> >
> >   The poll() system call (i.e. the uverbs implementation of poll()
> >   in the kernel) does not detect the local shutdown - and only returns
> >   after the caller supplied timeout expires.

It sounds like there's an issue here either with a message getting lost or a race.  Given that spinning longer works for you, it sounds like an event is getting lost, not being generated correctly, or not being configured to generate.

> > Increasing the rsockets polloing_time from 10 to 10000 microseconds
> > results in the rpoll to detect the local shutdown within the busy loop.
> >
> > Decreasing the ceph "ms tcp read timeout" from the default of 900 to 5
> > seconds serves a similar purpose, but is much coarser.
> >
> > From my understanding, the real issue is neither at the ceph nor at the
> > rsockets level: it is related to the uverbs kernel module.
> >
> > An alternative way to address the current problem at the rsockets level
> > would be w re-write of the rpoll(): instead of the busy loop at the
> > beginning followed by the reall poll() call with the full user
> > specificed timeout value ("ms tcp read timeout" in our case), I would
> > embed the real poll()  into a loop, splitting the user specified timeout
> > into smaller portions and doing the rsockets specific rs_poll_check()
> > on every timeout of the real poll().
> 
> I have not looked at the rsocket code, so take the following with a grain of
> salt. If the purpose of the real poll() is to simply block the user for a
> specified time, then you can simply make it a short duration (taking into
> consideration what granularity the OS provides) and then call ibv_poll_cq().
> Keep in mind, polling will prevent your CPU from reducing power.
> 
> If the real poll() is actually checking for something (e.g. checking on the
> RDMA channel's fd or the IB channel's fd), then you may not want to spin too
> much.

The real poll() call is intended to block the application until a timeout occurs or an event shows up.  Since increasing the spin time works for you, it makes me suspect that there is a bug in the CQ event handling in rsockets.
 
> >>> What's particularly weird is that the monitor receives a POLLHUP
> >>> event when the ceph command shuts down it's socket but the ceph
> >>> command never does. When using regular sockets both sides of the
> >>> connection receive a POLLIN | POLLHUP | POLRDHUP event when the
> >>> sockets are shut down. It would seem like there is a bug in rsockets
> >>> that causes the side that calls shutdown first not to receive the
> >>> correct rpoll events.

rsockets does not support POLLRDHUP.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ceph-users] Help needed porting Ceph to RSockets
  2013-08-13 21:44                   ` Hefty, Sean
@ 2013-08-14  7:21                     ` Andreas Bluemle
  2013-08-14 13:05                       ` Atchley, Scott
                                         ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Andreas Bluemle @ 2013-08-14  7:21 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Atchley, Scott, Matthew Anderson, ceph-devel@vger.kernel.org,
	linux-rdma@vger.kernel.org (linux-rdma@vger.kernel.org)

Hi,

maybe some information about the environment I am
working in:

- CentOS 6.4 with custom kernel 3.8.13
- librdmacm / librspreload from git, tag 1.0.17
- application started with librspreload in LD_PRELOAD environment

Currently, I have increased the value of the spin time by setting the
default value for polling_time in the source code.

I guess that the correct way to do this is via
configuration in /etc/rdma/rsocket/polling_time?

Concerning the rpoll() itself, some more comments/questions
embedded below.

On Tue, 13 Aug 2013 21:44:42 +0000
"Hefty, Sean" <sean.hefty@intel.com> wrote:

> > > I found a workaround for my (our) problem: in the librdmacm
> > > code, rsocket.c, there is a global constant polling_time, which
> > > is set to 10 microseconds at the moment.
> > >
> > > I raise this to 10000 - and all of a sudden things work nicely.
> > 
> > I am adding the linux-rdma list to CC so Sean might see this.
> > 
> > If I understand what you are describing, the caller to rpoll()
> > spins for up to 10 ms (10,000 us) before calling the real poll().
> > 
> > What is the purpose of the real poll() call? Is it simply a means
> > to block the caller and avoid spinning? Or does it actually expect
> > to detect an event?
> 
> When the real poll() is called, an event is expected on an fd
> associated with the CQ's completion channel. 

The first question I would have is: why is the rpoll() split into
these two pieces? There must have been some reason to do a busy
loop on some local state information rather than just call the
real poll() directly.

> > > I think we are looking at two issues here:
> > > 1. the thread structure of ceph messenger
> > >   For a given socket connection, there are 3 threads of interest
> > >   here: the main messenger thread, the Pipe::reader and the
> > >   Pipe::writer.
> > >
> > >   For a ceph client like the ceph admin command, I see the
> > > following sequence
> > >     - the connection to the ceph monitor is created by the
> > >       main messenger  thread, the Pipe::reader and
> > >       Pipe::writer are instantiated.
> > >     - the requested command is sent to the ceph monitor, the
> > >       answer is read and printed
> > >     - at this point the Pipe::reader already has called
> > >       tcp_read_wait(), polling for more data or connection
> > > termination
> > >     - after the response had been printed, the main loop calls the
> > >       shutdown routines which in in turn shutdown() the socket
> > >
> > >    There is some time between the last two steps - and this gap is
> > >    long enough to open a race:
> > >
> > > 2. rpoll, ibv and poll
> > >   the rpoll implementation in rsockets is split in 2 phases:
> > >   - a busy loop which checks the state of the underlying ibv
> > > queue pair
> > >   - the call to real poll() system call (i.e. the uverbs(?)
> > >     implementation of poll() inside the kernel)
> > >
> > >   The busy loop has a maximum duration of polling_time (10
> > > microseconds by default) - and is able detect the local shutdown
> > > and returns a POLLHUP.
> > >
> > >   The poll() system call (i.e. the uverbs implementation of poll()
> > >   in the kernel) does not detect the local shutdown - and only
> > > returns after the caller supplied timeout expires.
> 
> It sounds like there's an issue here either with a message getting
> lost or a race.  Given that spinning longer works for you, it sounds
> like an event is getting lost, not being generated correctly, or not
> being configured to generate.
> 

I am looking at a multithreaded application here, and I believe that
the race is between thread A calling the rpoll() for POLLIN event and
thread B calling the shutdown(SHUT_RDWR) for reading and writing of
the (r)socket almost immediately afterwards.

I think that the shutdown itself does not cause a POLLHUP event to be
generated from the kernel to interupt the real poll().
(BTW: which kernel module implements the poll() for rsockets?
Is that ib_uverbs.ko with ib_uverbs_poll_cq()?)

> > > Increasing the rsockets polloing_time from 10 to 10000
> > > microseconds results in the rpoll to detect the local shutdown
> > > within the busy loop.
> > >
> > > Decreasing the ceph "ms tcp read timeout" from the default of 900
> > > to 5 seconds serves a similar purpose, but is much coarser.
> > >
> > > From my understanding, the real issue is neither at the ceph nor
> > > at the rsockets level: it is related to the uverbs kernel module.
> > >
> > > An alternative way to address the current problem at the rsockets
> > > level would be w re-write of the rpoll(): instead of the busy
> > > loop at the beginning followed by the reall poll() call with the
> > > full user specificed timeout value ("ms tcp read timeout" in our
> > > case), I would embed the real poll()  into a loop, splitting the
> > > user specified timeout into smaller portions and doing the
> > > rsockets specific rs_poll_check() on every timeout of the real
> > > poll().
> > 
> > I have not looked at the rsocket code, so take the following with a
> > grain of salt. If the purpose of the real poll() is to simply block
> > the user for a specified time, then you can simply make it a short
> > duration (taking into consideration what granularity the OS
> > provides) and then call ibv_poll_cq(). Keep in mind, polling will
> > prevent your CPU from reducing power.
> > 
> > If the real poll() is actually checking for something (e.g.
> > checking on the RDMA channel's fd or the IB channel's fd), then you
> > may not want to spin too much.
> 
> The real poll() call is intended to block the application until a
> timeout occurs or an event shows up.  Since increasing the spin time
> works for you, it makes me suspect that there is a bug in the CQ
> event handling in rsockets. 

> > >>> What's particularly weird is that the monitor receives a POLLHUP
> > >>> event when the ceph command shuts down it's socket but the ceph
> > >>> command never does. When using regular sockets both sides of the
> > >>> connection receive a POLLIN | POLLHUP | POLRDHUP event when the
> > >>> sockets are shut down. It would seem like there is a bug in
> > >>> rsockets that causes the side that calls shutdown first not to
> > >>> receive the correct rpoll events.
> 
> rsockets does not support POLLRDHUP.
> 

I don't think the issue is POLLRDHUP.
I think the issue is POLLHUP and/or POLLIN.
My impression is that on a local shutdown (r)socket, a poll
for POLLIN event should at least return a POLLIN event and
a subsequent read should return 0 bytes indicating EOF.
But the POLLIN is not generated from the layer below 
rsockets (ib_uverbs.ko?) as far as I can tell.

See also: http://www.greenend.org.uk/rjk/tech/poll.html

Best Regards

Andreas Bluemle

> - Sean
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> in the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 




-- 
Andreas Bluemle                     mailto:Andreas.Bluemle@itxperts.de
ITXperts GmbH                       http://www.itxperts.de
Balanstrasse 73, Geb. 08            Phone: (+49) 89 89044917
D-81541 Muenchen (Germany)          Fax:   (+49) 89 89044910

Company details: http://www.itxperts.de/imprint.htm

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ceph-users] Help needed porting Ceph to RSockets
  2013-08-14  7:21                     ` Andreas Bluemle
@ 2013-08-14 13:05                       ` Atchley, Scott
  2013-08-14 17:04                       ` Hefty, Sean
                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Atchley, Scott @ 2013-08-14 13:05 UTC (permalink / raw)
  To: Andreas Bluemle
  Cc: Hefty, Sean, Matthew Anderson,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

On Aug 14, 2013, at 3:21 AM, Andreas Bluemle <andreas.bluemle-jy6uzZffzV0b1SvskN2V4Q@public.gmane.org> wrote:

> Hi,
> 
> maybe some information about the environment I am
> working in:
> 
> - CentOS 6.4 with custom kernel 3.8.13
> - librdmacm / librspreload from git, tag 1.0.17
> - application started with librspreload in LD_PRELOAD environment
> 
> Currently, I have increased the value of the spin time by setting the
> default value for polling_time in the source code.
> 
> I guess that the correct way to do this is via
> configuration in /etc/rdma/rsocket/polling_time?
> 
> Concerning the rpoll() itself, some more comments/questions
> embedded below.
> 
> On Tue, 13 Aug 2013 21:44:42 +0000
> "Hefty, Sean" <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> 
>>>> I found a workaround for my (our) problem: in the librdmacm
>>>> code, rsocket.c, there is a global constant polling_time, which
>>>> is set to 10 microseconds at the moment.
>>>> 
>>>> I raise this to 10000 - and all of a sudden things work nicely.
>>> 
>>> I am adding the linux-rdma list to CC so Sean might see this.
>>> 
>>> If I understand what you are describing, the caller to rpoll()
>>> spins for up to 10 ms (10,000 us) before calling the real poll().
>>> 
>>> What is the purpose of the real poll() call? Is it simply a means
>>> to block the caller and avoid spinning? Or does it actually expect
>>> to detect an event?
>> 
>> When the real poll() is called, an event is expected on an fd
>> associated with the CQ's completion channel. 
> 
> The first question I would have is: why is the rpoll() split into
> these two pieces? There must have been some reason to do a busy
> loop on some local state information rather than just call the
> real poll() directly.

Sean can answer specifically, but this is a typical HPC technique. The worst thing you can do is handle an event and then block when the next event is available. This adds 1-3 us to latency which is unacceptable in HPC. In HPC, we poll. If we worry about power, we poll until we get no more events and then we poll a little more before blocking. Determining the "little more" is the fun part. ;-) 

Scott--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [ceph-users] Help needed porting Ceph to RSockets
  2013-08-14  7:21                     ` Andreas Bluemle
  2013-08-14 13:05                       ` Atchley, Scott
@ 2013-08-14 17:04                       ` Hefty, Sean
  2013-08-17  0:07                       ` Hefty, Sean
  2013-08-19 17:10                       ` Hefty, Sean
  3 siblings, 0 replies; 19+ messages in thread
From: Hefty, Sean @ 2013-08-14 17:04 UTC (permalink / raw)
  To: Andreas Bluemle
  Cc: Atchley, Scott, Matthew Anderson,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

> The first question I would have is: why is the rpoll() split into
> these two pieces? There must have been some reason to do a busy
> loop on some local state information rather than just call the
> real poll() directly.

As Scott mentioned in his email, this is done for performance reasons.  The cost of always dropping into the kernel is too high for HPC.

> I am looking at a multithreaded application here, and I believe that
> the race is between thread A calling the rpoll() for POLLIN event and
> thread B calling the shutdown(SHUT_RDWR) for reading and writing of
> the (r)socket almost immediately afterwards.

Ah - this is likely the issue.  I did not assume that rshutdown() would be called simultaneously with rpoll().  I need to think about how to solve this, so that rpoll() unblocks.

> I think that the shutdown itself does not cause a POLLHUP event to be
> generated from the kernel to interupt the real poll().
> (BTW: which kernel module implements the poll() for rsockets?
> Is that ib_uverbs.ko with ib_uverbs_poll_cq()?)

The POLLHUP event in rsockets is just software indicating that such an 'event' occurred - basically when a call to rpoll() detects that the rsocket state is disconnected.

I believe that the real poll() call traps into ib_uverbs_event_poll() in the kernel.  The fd associated with the poll call corresponds to a 'completion channel', which is used to report events which occur on a CQ.  Connection related events don't actually go to that fd - only completions for data transfers.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [ceph-users] Help needed porting Ceph to RSockets
  2013-08-14  7:21                     ` Andreas Bluemle
  2013-08-14 13:05                       ` Atchley, Scott
  2013-08-14 17:04                       ` Hefty, Sean
@ 2013-08-17  0:07                       ` Hefty, Sean
  2013-08-19 17:10                       ` Hefty, Sean
  3 siblings, 0 replies; 19+ messages in thread
From: Hefty, Sean @ 2013-08-17  0:07 UTC (permalink / raw)
  To: Andreas Bluemle
  Cc: Atchley, Scott, Matthew Anderson, ceph-devel@vger.kernel.org,
	linux-rdma@vger.kernel.org (linux-rdma@vger.kernel.org)

> I am looking at a multithreaded application here, and I believe that
> the race is between thread A calling the rpoll() for POLLIN event and
> thread B calling the shutdown(SHUT_RDWR) for reading and writing of
> the (r)socket almost immediately afterwards.

I modified a test program, and I can reproduce the hang as you describe -- calling rpoll() then rshutdown() from another thread.

These calls end up calling rpoll->poll followed by rshutdown->read.  The read completes, which completes rshutdown, but the poll continues to wait for an event.  In the kernel, poll ends up in uverbs.c::ib_uverbs_event_poll, and read in uverbs.c::ib_uverbs_event_read().

The behavior of poll/read seems reasonable, so I don't think this is a kernel issue.  I'm still trying to figure out a simple solution to fixing this.

- Sean

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [ceph-users] Help needed porting Ceph to RSockets
  2013-08-14  7:21                     ` Andreas Bluemle
                                         ` (2 preceding siblings ...)
  2013-08-17  0:07                       ` Hefty, Sean
@ 2013-08-19 17:10                       ` Hefty, Sean
  2013-08-20  7:21                         ` Andreas Bluemle
  3 siblings, 1 reply; 19+ messages in thread
From: Hefty, Sean @ 2013-08-19 17:10 UTC (permalink / raw)
  To: Andreas Bluemle
  Cc: Atchley, Scott, Matthew Anderson,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

Can you see if the patch below fixes the hang?

Signed-off-by: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 src/rsocket.c |   11 ++++++++++-
 1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/src/rsocket.c b/src/rsocket.c
index d544dd0..e45b26d 100644
--- a/src/rsocket.c
+++ b/src/rsocket.c
@@ -2948,10 +2948,12 @@ static int rs_poll_events(struct pollfd *rfds, struct pollfd *fds, nfds_t nfds)
 
 		rs = idm_lookup(&idm, fds[i].fd);
 		if (rs) {
+			fastlock_acquire(&rs->cq_wait_lock);
 			if (rs->type == SOCK_STREAM)
 				rs_get_cq_event(rs);
 			else
 				ds_get_cq_event(rs);
+			fastlock_release(&rs->cq_wait_lock);
 			fds[i].revents = rs_poll_rs(rs, fds[i].events, 1, rs_poll_all);
 		} else {
 			fds[i].revents = rfds[i].revents;
@@ -3098,7 +3100,8 @@ int rselect(int nfds, fd_set *readfds, fd_set *writefds,
 
 /*
  * For graceful disconnect, notify the remote side that we're
- * disconnecting and wait until all outstanding sends complete.
+ * disconnecting and wait until all outstanding sends complete, provided
+ * that the remote side has not sent a disconnect message.
  */
 int rshutdown(int socket, int how)
 {
@@ -3138,6 +3141,12 @@ int rshutdown(int socket, int how)
 	if (rs->state & rs_connected)
 		rs_process_cq(rs, 0, rs_conn_all_sends_done);
 
+	if (rs->state & rs_disconnected) {
+		/* Generate event by flushing receives to unblock rpoll */
+		ibv_req_notify_cq(rs->cm_id->recv_cq, 0);
+		rdma_disconnect(rs->cm_id);
+	}
+
 	if ((rs->fd_flags & O_NONBLOCK) && (rs->state & rs_connected))
 		rs_set_nonblocking(rs, rs->fd_flags);
 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [ceph-users] Help needed porting Ceph to RSockets
  2013-08-19 17:10                       ` Hefty, Sean
@ 2013-08-20  7:21                         ` Andreas Bluemle
  2013-08-20 10:30                           ` Andreas Bluemle
  0 siblings, 1 reply; 19+ messages in thread
From: Andreas Bluemle @ 2013-08-20  7:21 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Atchley, Scott, Matthew Anderson, ceph-devel@vger.kernel.org,
	linux-rdma@vger.kernel.org (linux-rdma@vger.kernel.org)

Hi Sean,

I will re-check until the end of the week; there is
some test scheduling issue with our test system, which
affects my access times.

Thanks

Andreas


On Mon, 19 Aug 2013 17:10:11 +0000
"Hefty, Sean" <sean.hefty@intel.com> wrote:

> Can you see if the patch below fixes the hang?
> 
> Signed-off-by: Sean Hefty <sean.hefty@intel.com>
> ---
>  src/rsocket.c |   11 ++++++++++-
>  1 files changed, 10 insertions(+), 1 deletions(-)
> 
> diff --git a/src/rsocket.c b/src/rsocket.c
> index d544dd0..e45b26d 100644
> --- a/src/rsocket.c
> +++ b/src/rsocket.c
> @@ -2948,10 +2948,12 @@ static int rs_poll_events(struct pollfd
> *rfds, struct pollfd *fds, nfds_t nfds) 
>  		rs = idm_lookup(&idm, fds[i].fd);
>  		if (rs) {
> +			fastlock_acquire(&rs->cq_wait_lock);
>  			if (rs->type == SOCK_STREAM)
>  				rs_get_cq_event(rs);
>  			else
>  				ds_get_cq_event(rs);
> +			fastlock_release(&rs->cq_wait_lock);
>  			fds[i].revents = rs_poll_rs(rs,
> fds[i].events, 1, rs_poll_all); } else {
>  			fds[i].revents = rfds[i].revents;
> @@ -3098,7 +3100,8 @@ int rselect(int nfds, fd_set *readfds, fd_set
> *writefds, 
>  /*
>   * For graceful disconnect, notify the remote side that we're
> - * disconnecting and wait until all outstanding sends complete.
> + * disconnecting and wait until all outstanding sends complete,
> provided
> + * that the remote side has not sent a disconnect message.
>   */
>  int rshutdown(int socket, int how)
>  {
> @@ -3138,6 +3141,12 @@ int rshutdown(int socket, int how)
>  	if (rs->state & rs_connected)
>  		rs_process_cq(rs, 0, rs_conn_all_sends_done);
>  
> +	if (rs->state & rs_disconnected) {
> +		/* Generate event by flushing receives to unblock
> rpoll */
> +		ibv_req_notify_cq(rs->cm_id->recv_cq, 0);
> +		rdma_disconnect(rs->cm_id);
> +	}
> +
>  	if ((rs->fd_flags & O_NONBLOCK) && (rs->state &
> rs_connected)) rs_set_nonblocking(rs, rs->fd_flags);
>  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> in the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 



-- 
Andreas Bluemle                     mailto:Andreas.Bluemle@itxperts.de
Heinrich Boell Strasse 88           Phone: (+49) 89 4317582
D-81829 Muenchen (Germany)          Mobil: (+49) 177 522 0151

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ceph-users] Help needed porting Ceph to RSockets
  2013-08-20  7:21                         ` Andreas Bluemle
@ 2013-08-20 10:30                           ` Andreas Bluemle
  2013-08-20 15:04                             ` Hefty, Sean
  0 siblings, 1 reply; 19+ messages in thread
From: Andreas Bluemle @ 2013-08-20 10:30 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Atchley, Scott, Matthew Anderson, ceph-devel@vger.kernel.org,
	linux-rdma@vger.kernel.org (linux-rdma@vger.kernel.org)

Hi,

I have added the patch and re-tested: I still encounter
hangs of my application. I am not quite sure whether the
I hit the same error on the shutdown because now I don't hit
the error always, but only every now and then.

WHen adding the patch to my code base (git tag v1.0.17) I notice
an offset of "-34 lines". Which code base are you using?


Best Regards

Andreas Bluemle

On Tue, 20 Aug 2013 09:21:13 +0200
Andreas Bluemle <andreas.bluemle@itxperts.de> wrote:

> Hi Sean,
> 
> I will re-check until the end of the week; there is
> some test scheduling issue with our test system, which
> affects my access times.
> 
> Thanks
> 
> Andreas
> 
> 
> On Mon, 19 Aug 2013 17:10:11 +0000
> "Hefty, Sean" <sean.hefty@intel.com> wrote:
> 
> > Can you see if the patch below fixes the hang?
> > 
> > Signed-off-by: Sean Hefty <sean.hefty@intel.com>
> > ---
> >  src/rsocket.c |   11 ++++++++++-
> >  1 files changed, 10 insertions(+), 1 deletions(-)
> > 
> > diff --git a/src/rsocket.c b/src/rsocket.c
> > index d544dd0..e45b26d 100644
> > --- a/src/rsocket.c
> > +++ b/src/rsocket.c
> > @@ -2948,10 +2948,12 @@ static int rs_poll_events(struct pollfd
> > *rfds, struct pollfd *fds, nfds_t nfds) 
> >  		rs = idm_lookup(&idm, fds[i].fd);
> >  		if (rs) {
> > +			fastlock_acquire(&rs->cq_wait_lock);
> >  			if (rs->type == SOCK_STREAM)
> >  				rs_get_cq_event(rs);
> >  			else
> >  				ds_get_cq_event(rs);
> > +			fastlock_release(&rs->cq_wait_lock);
> >  			fds[i].revents = rs_poll_rs(rs,
> > fds[i].events, 1, rs_poll_all); } else {
> >  			fds[i].revents = rfds[i].revents;
> > @@ -3098,7 +3100,8 @@ int rselect(int nfds, fd_set *readfds, fd_set
> > *writefds, 
> >  /*
> >   * For graceful disconnect, notify the remote side that we're
> > - * disconnecting and wait until all outstanding sends complete.
> > + * disconnecting and wait until all outstanding sends complete,
> > provided
> > + * that the remote side has not sent a disconnect message.
> >   */
> >  int rshutdown(int socket, int how)
> >  {
> > @@ -3138,6 +3141,12 @@ int rshutdown(int socket, int how)
> >  	if (rs->state & rs_connected)
> >  		rs_process_cq(rs, 0, rs_conn_all_sends_done);
> >  
> > +	if (rs->state & rs_disconnected) {
> > +		/* Generate event by flushing receives to unblock
> > rpoll */
> > +		ibv_req_notify_cq(rs->cm_id->recv_cq, 0);
> > +		rdma_disconnect(rs->cm_id);
> > +	}
> > +
> >  	if ((rs->fd_flags & O_NONBLOCK) && (rs->state &
> > rs_connected)) rs_set_nonblocking(rs, rs->fd_flags);
> >  
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-rdma" in the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> 
> 
> 



-- 
Andreas Bluemle                     mailto:Andreas.Bluemle@itxperts.de
Heinrich Boell Strasse 88           Phone: (+49) 89 4317582
D-81829 Muenchen (Germany)          Mobil: (+49) 177 522 0151

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [ceph-users] Help needed porting Ceph to RSockets
  2013-08-20 10:30                           ` Andreas Bluemle
@ 2013-08-20 15:04                             ` Hefty, Sean
       [not found]                               ` <1828884A29C6694DAF28B7E6B8A8237388CA6F25-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 19+ messages in thread
From: Hefty, Sean @ 2013-08-20 15:04 UTC (permalink / raw)
  To: Andreas Bluemle
  Cc: Atchley, Scott, Matthew Anderson, ceph-devel@vger.kernel.org,
	linux-rdma@vger.kernel.org (linux-rdma@vger.kernel.org)

> I have added the patch and re-tested: I still encounter
> hangs of my application. I am not quite sure whether the
> I hit the same error on the shutdown because now I don't hit
> the error always, but only every now and then.

I guess this is at least some progress... :/
 
> WHen adding the patch to my code base (git tag v1.0.17) I notice
> an offset of "-34 lines". Which code base are you using?

This patch was generated against the tip of the git tree. 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ceph-users] Help needed porting Ceph to RSockets
       [not found]                               ` <1828884A29C6694DAF28B7E6B8A8237388CA6F25-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2013-08-21 11:44                                 ` Matthew Anderson
  0 siblings, 0 replies; 19+ messages in thread
From: Matthew Anderson @ 2013-08-21 11:44 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Andreas Bluemle,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

Hi Sean,

I tested out the patch and unfortunately had the same results as
Andreas. About 50% of the time the rpoll() thread in Ceph still hangs
when rshutdown() is called. I saw a similar behaviour when increasing
the poll time on the pre-patched version if that's of any relevance.

Thanks

On Tue, Aug 20, 2013 at 11:04 PM, Hefty, Sean <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>> I have added the patch and re-tested: I still encounter
>> hangs of my application. I am not quite sure whether the
>> I hit the same error on the shutdown because now I don't hit
>> the error always, but only every now and then.
>
> I guess this is at least some progress... :/
>
>> WHen adding the patch to my code base (git tag v1.0.17) I notice
>> an offset of "-34 lines". Which code base are you using?
>
> This patch was generated against the tip of the git tree.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [ceph-users] Help needed porting Ceph to RSockets
@ 2013-08-23  0:35 Hefty, Sean
  2013-09-10 13:48 ` Andreas Bluemle
  0 siblings, 1 reply; 19+ messages in thread
From: Hefty, Sean @ 2013-08-23  0:35 UTC (permalink / raw)
  To: Matthew Anderson
  Cc: Andreas Bluemle,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

> I tested out the patch and unfortunately had the same results as
> Andreas. About 50% of the time the rpoll() thread in Ceph still hangs
> when rshutdown() is called. I saw a similar behaviour when increasing
> the poll time on the pre-patched version if that's of any relevance.

I'm not optimistic, but here's an updated patch.  I attempted to handle more
shutdown conditions, but I can't say that any of those would prevent the
hang that you see.

I have a couple of questions: 

Is there any chance that the code would call rclose while rpoll
is still running?  Also, can you verify that the thread is in the
real poll() call when the hang occurs?

Signed-off-by: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 src/rsocket.c |   35 +++++++++++++++++++++++++----------
 1 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/src/rsocket.c b/src/rsocket.c
index d544dd0..f94ddf3 100644
--- a/src/rsocket.c
+++ b/src/rsocket.c
@@ -1822,7 +1822,12 @@ static int rs_poll_cq(struct rsocket *rs)
 					rs->state = rs_disconnected;
 					return 0;
 				} else if (rs_msg_data(msg) == RS_CTRL_SHUTDOWN) {
-					rs->state &= ~rs_readable;
+					if (rs->state & rs_writable) {
+						rs->state &= ~rs_readable;
+					} else {
+						rs->state = rs_disconnected;
+						return 0;
+					}
 				}
 				break;
 			case RS_OP_WRITE:
@@ -2948,10 +2953,12 @@ static int rs_poll_events(struct pollfd *rfds, struct pollfd *fds, nfds_t nfds)
 
 		rs = idm_lookup(&idm, fds[i].fd);
 		if (rs) {
+			fastlock_acquire(&rs->cq_wait_lock);
 			if (rs->type == SOCK_STREAM)
 				rs_get_cq_event(rs);
 			else
 				ds_get_cq_event(rs);
+			fastlock_release(&rs->cq_wait_lock);
 			fds[i].revents = rs_poll_rs(rs, fds[i].events, 1, rs_poll_all);
 		} else {
 			fds[i].revents = rfds[i].revents;
@@ -3098,7 +3105,8 @@ int rselect(int nfds, fd_set *readfds, fd_set *writefds,
 
 /*
  * For graceful disconnect, notify the remote side that we're
- * disconnecting and wait until all outstanding sends complete.
+ * disconnecting and wait until all outstanding sends complete, provided
+ * that the remote side has not sent a disconnect message.
  */
 int rshutdown(int socket, int how)
 {
@@ -3106,11 +3114,6 @@ int rshutdown(int socket, int how)
 	int ctrl, ret = 0;
 
 	rs = idm_at(&idm, socket);
-	if (how == SHUT_RD) {
-		rs->state &= ~rs_readable;
-		return 0;
-	}
-
 	if (rs->fd_flags & O_NONBLOCK)
 		rs_set_nonblocking(rs, 0);
 
@@ -3118,15 +3121,20 @@ int rshutdown(int socket, int how)
 		if (how == SHUT_RDWR) {
 			ctrl = RS_CTRL_DISCONNECT;
 			rs->state &= ~(rs_readable | rs_writable);
-		} else {
+		} else if (how == SHUT_WR) {
 			rs->state &= ~rs_writable;
 			ctrl = (rs->state & rs_readable) ?
 				RS_CTRL_SHUTDOWN : RS_CTRL_DISCONNECT;
+		} else {
+			rs->state &= ~rs_readable;
+			if (rs->state & rs_writable)
+				goto out;
+			ctrl = RS_CTRL_DISCONNECT;
 		}
 		if (!rs->ctrl_avail) {
 			ret = rs_process_cq(rs, 0, rs_conn_can_send_ctrl);
 			if (ret)
-				return ret;
+				goto out;
 		}
 
 		if ((rs->state & rs_connected) && rs->ctrl_avail) {
@@ -3138,10 +3146,17 @@ int rshutdown(int socket, int how)
 	if (rs->state & rs_connected)
 		rs_process_cq(rs, 0, rs_conn_all_sends_done);
 
+out:
 	if ((rs->fd_flags & O_NONBLOCK) && (rs->state & rs_connected))
 		rs_set_nonblocking(rs, rs->fd_flags);
 
-	return 0;
+	if (rs->state & rs_disconnected) {
+		/* Generate event by flushing receives to unblock rpoll */
+		ibv_req_notify_cq(rs->cm_id->recv_cq, 0);
+		rdma_disconnect(rs->cm_id);
+	}
+
+	return ret;
 }
 
 static void ds_shutdown(struct rsocket *rs)


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [ceph-users] Help needed porting Ceph to RSockets
  2013-08-23  0:35 [ceph-users] Help needed porting Ceph to RSockets Hefty, Sean
@ 2013-09-10 13:48 ` Andreas Bluemle
  2013-09-12 10:20   ` Gandalf Corvotempesta
                     ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Andreas Bluemle @ 2013-09-10 13:48 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Matthew Anderson, ceph-devel@vger.kernel.org,
	linux-rdma@vger.kernel.org (linux-rdma@vger.kernel.org)

[-- Attachment #1: Type: text/plain, Size: 6755 bytes --]

Hi,

after some more analysis and debugging, I found
workarounds for my problems; I have added these workarounds
to the last version of the patch for the poll problem by Sean;
see the attachment to this posting.

The shutdown() operations below are all SHUT_RDWR.

1. shutdown() on side A of a connection waits for close() on side B

   With rsockets, when a shutdown is done on side A of a socket
   connection, then the shutdown will only return after side B
   has done a close() on its end of the connection.

   This is different from TCP/IP sockets: there a shutdown will cause
   the other end to terminate the connection at the TCP level
   instantly. The socket changes state into CLOSE_WAIT, which indicates
   that the application level close is outstanding.

   In the attached patch, the workaround is in rs_poll_cq(),
   case RS_OP_CTRL, where for a RS_CTRL_DISCONNECT the rshutdown()
   is called on side B; this will cause the termination of the
   socket connection to acknowledged to side A and the shutdown()
   there can now terminate.

2. double (multiple) shutdown on side A: delay on 2nd shutdown

   When an application does a shutdown() of side A and does a 2nd
   shutdown() shortly after (for whatever reason) then the
   return of the 2nd shutdown() is delayed by 2 seconds.

   The delay happens in rdma_disconnect(), when this is called
   from rshutdown() in the case that the rsocket state is
   rs_disconnected.

   Even if it could be considered as a bug if an application
   calls shutdown() twice on the same socket, it still
   does not make sense to delay that 2nd call to shutdown().

   To workaround this, I have
   - introduced an additional rsocket state: rs_shutdown
   - switch to that new state in rshutdown() at the very end
     of the function.

   The first call to shutdown() will therefore switch to the new
   rsocket state rs_shutdown - and any further call to rshutdown()
   will not do anything any more, because every effect of rshutdown()
   will only happen if the rsocket state is either rs_connnected or
   rs_disconnected. Hence it would be better to explicitely check
   the rsocket state at the beginning of the function and return
   immediately if the state is rs_shutdown.

Since I have added these workarounds to my version of the librdmacm
library, I can at least start up ceph using LD_PRELOAD and end up in
a healthy ceph cluster state.

I would not call these workarounds a real fix, but they should point
out the problems which I am trying to solve.


Regards

Andreas Bluemle




On Fri, 23 Aug 2013 00:35:22 +0000
"Hefty, Sean" <sean.hefty@intel.com> wrote:

> > I tested out the patch and unfortunately had the same results as
> > Andreas. About 50% of the time the rpoll() thread in Ceph still
> > hangs when rshutdown() is called. I saw a similar behaviour when
> > increasing the poll time on the pre-patched version if that's of
> > any relevance.
> 
> I'm not optimistic, but here's an updated patch.  I attempted to
> handle more shutdown conditions, but I can't say that any of those
> would prevent the hang that you see.
> 
> I have a couple of questions: 
> 
> Is there any chance that the code would call rclose while rpoll
> is still running?  Also, can you verify that the thread is in the
> real poll() call when the hang occurs?
> 
> Signed-off-by: Sean Hefty <sean.hefty@intel.com>
> ---
>  src/rsocket.c |   35 +++++++++++++++++++++++++----------
>  1 files changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/src/rsocket.c b/src/rsocket.c
> index d544dd0..f94ddf3 100644
> --- a/src/rsocket.c
> +++ b/src/rsocket.c
> @@ -1822,7 +1822,12 @@ static int rs_poll_cq(struct rsocket *rs)
>  					rs->state = rs_disconnected;
>  					return 0;
>  				} else if (rs_msg_data(msg) ==
> RS_CTRL_SHUTDOWN) {
> -					rs->state &= ~rs_readable;
> +					if (rs->state & rs_writable)
> {
> +						rs->state &=
> ~rs_readable;
> +					} else {
> +						rs->state =
> rs_disconnected;
> +						return 0;
> +					}
>  				}
>  				break;
>  			case RS_OP_WRITE:
> @@ -2948,10 +2953,12 @@ static int rs_poll_events(struct pollfd
> *rfds, struct pollfd *fds, nfds_t nfds) 
>  		rs = idm_lookup(&idm, fds[i].fd);
>  		if (rs) {
> +			fastlock_acquire(&rs->cq_wait_lock);
>  			if (rs->type == SOCK_STREAM)
>  				rs_get_cq_event(rs);
>  			else
>  				ds_get_cq_event(rs);
> +			fastlock_release(&rs->cq_wait_lock);
>  			fds[i].revents = rs_poll_rs(rs,
> fds[i].events, 1, rs_poll_all); } else {
>  			fds[i].revents = rfds[i].revents;
> @@ -3098,7 +3105,8 @@ int rselect(int nfds, fd_set *readfds, fd_set
> *writefds, 
>  /*
>   * For graceful disconnect, notify the remote side that we're
> - * disconnecting and wait until all outstanding sends complete.
> + * disconnecting and wait until all outstanding sends complete,
> provided
> + * that the remote side has not sent a disconnect message.
>   */
>  int rshutdown(int socket, int how)
>  {
> @@ -3106,11 +3114,6 @@ int rshutdown(int socket, int how)
>  	int ctrl, ret = 0;
>  
>  	rs = idm_at(&idm, socket);
> -	if (how == SHUT_RD) {
> -		rs->state &= ~rs_readable;
> -		return 0;
> -	}
> -
>  	if (rs->fd_flags & O_NONBLOCK)
>  		rs_set_nonblocking(rs, 0);
>  
> @@ -3118,15 +3121,20 @@ int rshutdown(int socket, int how)
>  		if (how == SHUT_RDWR) {
>  			ctrl = RS_CTRL_DISCONNECT;
>  			rs->state &= ~(rs_readable | rs_writable);
> -		} else {
> +		} else if (how == SHUT_WR) {
>  			rs->state &= ~rs_writable;
>  			ctrl = (rs->state & rs_readable) ?
>  				RS_CTRL_SHUTDOWN :
> RS_CTRL_DISCONNECT;
> +		} else {
> +			rs->state &= ~rs_readable;
> +			if (rs->state & rs_writable)
> +				goto out;
> +			ctrl = RS_CTRL_DISCONNECT;
>  		}
>  		if (!rs->ctrl_avail) {
>  			ret = rs_process_cq(rs, 0,
> rs_conn_can_send_ctrl); if (ret)
> -				return ret;
> +				goto out;
>  		}
>  
>  		if ((rs->state & rs_connected) && rs->ctrl_avail) {
> @@ -3138,10 +3146,17 @@ int rshutdown(int socket, int how)
>  	if (rs->state & rs_connected)
>  		rs_process_cq(rs, 0, rs_conn_all_sends_done);
>  
> +out:
>  	if ((rs->fd_flags & O_NONBLOCK) && (rs->state &
> rs_connected)) rs_set_nonblocking(rs, rs->fd_flags);
>  
> -	return 0;
> +	if (rs->state & rs_disconnected) {
> +		/* Generate event by flushing receives to unblock
> rpoll */
> +		ibv_req_notify_cq(rs->cm_id->recv_cq, 0);
> +		rdma_disconnect(rs->cm_id);
> +	}
> +
> +	return ret;
>  }
>  
>  static void ds_shutdown(struct rsocket *rs)
> 
> 
> 



-- 
Andreas Bluemle                     mailto:Andreas.Bluemle@itxperts.de
ITXperts GmbH                       http://www.itxperts.de
Balanstrasse 73, Geb. 08            Phone: (+49) 89 89044917
D-81541 Muenchen (Germany)          Fax:   (+49) 89 89044910

Company details: http://www.itxperts.de/imprint.htm

[-- Attachment #2: rsocket-poll-hefty2+shutdown-andy.patch --]
[-- Type: text/x-patch, Size: 3094 bytes --]

diff --git a/src/rsocket.c b/src/rsocket.c
index abdd392..76fbb85 100644
--- a/src/rsocket.c
+++ b/src/rsocket.c
@@ -206,6 +206,7 @@ enum rs_state {
 	rs_connect_error   =		    0x0800,
 	rs_disconnected	   =		    0x1000,
 	rs_error	   =		    0x2000,
+	rs_shutdown	   =		    0x4000,
 };
 
 #define RS_OPT_SWAP_SGL	(1 << 0)
@@ -1786,9 +1787,15 @@ static int rs_poll_cq(struct rsocket *rs)
 			case RS_OP_CTRL:
 				if (rs_msg_data(msg) == RS_CTRL_DISCONNECT) {
 					rs->state = rs_disconnected;
+					rshutdown(rs->index, SHUT_RDWR);
 					return 0;
 				} else if (rs_msg_data(msg) == RS_CTRL_SHUTDOWN) {
-					rs->state &= ~rs_readable;
+					if (rs->state & rs_writable) {
+						rs->state &= ~rs_readable;
+					} else {
+						rs->state = rs_disconnected;
+						return 0;
+					}
 				}
 				break;
 			case RS_OP_WRITE:
@@ -2914,10 +2921,12 @@ static int rs_poll_events(struct pollfd *rfds, struct pollfd *fds, nfds_t nfds)
 
 		rs = idm_lookup(&idm, fds[i].fd);
 		if (rs) {
+			fastlock_acquire(&rs->cq_wait_lock);
 			if (rs->type == SOCK_STREAM)
 				rs_get_cq_event(rs);
 			else
 				ds_get_cq_event(rs);
+			fastlock_release(&rs->cq_wait_lock);
 			fds[i].revents = rs_poll_rs(rs, fds[i].events, 1, rs_poll_all);
 		} else {
 			fds[i].revents = rfds[i].revents;
@@ -3064,7 +3073,8 @@ int rselect(int nfds, fd_set *readfds, fd_set *writefds,
 
 /*
  * For graceful disconnect, notify the remote side that we're
- * disconnecting and wait until all outstanding sends complete.
+ * disconnecting and wait until all outstanding sends complete, provided
+ * that the remote side has not sent a disconnect message.
  */
 int rshutdown(int socket, int how)
 {
@@ -3072,11 +3082,6 @@ int rshutdown(int socket, int how)
 	int ctrl, ret = 0;
 
 	rs = idm_at(&idm, socket);
-	if (how == SHUT_RD) {
-		rs->state &= ~rs_readable;
-		return 0;
-	}
-
 	if (rs->fd_flags & O_NONBLOCK)
 		rs_set_nonblocking(rs, 0);
 
@@ -3084,15 +3089,20 @@ int rshutdown(int socket, int how)
 		if (how == SHUT_RDWR) {
 			ctrl = RS_CTRL_DISCONNECT;
 			rs->state &= ~(rs_readable | rs_writable);
-		} else {
+		} else if (how == SHUT_WR) {
 			rs->state &= ~rs_writable;
 			ctrl = (rs->state & rs_readable) ?
 				RS_CTRL_SHUTDOWN : RS_CTRL_DISCONNECT;
+		} else {
+			rs->state &= ~rs_readable;
+			if (rs->state & rs_writable)
+				goto out;
+			ctrl = RS_CTRL_DISCONNECT;
 		}
 		if (!rs->ctrl_avail) {
 			ret = rs_process_cq(rs, 0, rs_conn_can_send_ctrl);
 			if (ret)
-				return ret;
+				goto out;
 		}
 
 		if ((rs->state & rs_connected) && rs->ctrl_avail) {
@@ -3104,10 +3114,19 @@ int rshutdown(int socket, int how)
 	if (rs->state & rs_connected)
 		rs_process_cq(rs, 0, rs_conn_all_sends_done);
 
+out:
 	if ((rs->fd_flags & O_NONBLOCK) && (rs->state & rs_connected))
 		rs_set_nonblocking(rs, rs->fd_flags);
 
-	return 0;
+	if (rs->state & rs_disconnected) {
+		/* Generate event by flushing receives to unblock rpoll */
+		ibv_req_notify_cq(rs->cm_id->recv_cq, 0);
+		rdma_disconnect(rs->cm_id);
+	}
+
+	rs->state = rs_shutdown;
+
+	return ret;
 }
 
 static void ds_shutdown(struct rsocket *rs)

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [ceph-users] Help needed porting Ceph to RSockets
  2013-09-10 13:48 ` Andreas Bluemle
@ 2013-09-12 10:20   ` Gandalf Corvotempesta
       [not found]     ` <CAJH6TXg94x+jcc=1MQQoQaF5JcGKuxbG_SzLEMt3EVQOrFeNjw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-09-20 23:47   ` Hefty, Sean
  2013-10-30 23:25   ` Hefty, Sean
  2 siblings, 1 reply; 19+ messages in thread
From: Gandalf Corvotempesta @ 2013-09-12 10:20 UTC (permalink / raw)
  To: Andreas Bluemle
  Cc: Hefty, Sean, Matthew Anderson, ceph-devel@vger.kernel.org,
	linux-rdma@vger.kernel.org (linux-rdma@vger.kernel.org)

2013/9/10 Andreas Bluemle <andreas.bluemle@itxperts.de>:
> Since I have added these workarounds to my version of the librdmacm
> library, I can at least start up ceph using LD_PRELOAD and end up in
> a healthy ceph cluster state.

Have you seen any performance improvement by using LD_PRELOAD with ceph?
Which throughput are you able to archive with rsocket and ceph?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ceph-users] Help needed porting Ceph to RSockets
       [not found]     ` <CAJH6TXg94x+jcc=1MQQoQaF5JcGKuxbG_SzLEMt3EVQOrFeNjw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-09-12 12:12       ` Andreas Bluemle
  2013-09-16 13:29         ` Gandalf Corvotempesta
  0 siblings, 1 reply; 19+ messages in thread
From: Andreas Bluemle @ 2013-09-12 12:12 UTC (permalink / raw)
  To: Gandalf Corvotempesta
  Cc: Hefty, Sean, Matthew Anderson,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

On Thu, 12 Sep 2013 12:20:03 +0200
Gandalf Corvotempesta <gandalf.corvotempesta-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> 2013/9/10 Andreas Bluemle <andreas.bluemle-jy6uzZffzV0b1SvskN2V4Q@public.gmane.org>:
> > Since I have added these workarounds to my version of the librdmacm
> > library, I can at least start up ceph using LD_PRELOAD and end up in
> > a healthy ceph cluster state.
> 
> Have you seen any performance improvement by using LD_PRELOAD with
> ceph? Which throughput are you able to archive with rsocket and ceph?

I have not yet done any performance testing.

The next step I have to take is more related to setting up
a larger cluster with sth. like 150 osd's without hitting any
resource limitations.

Regards

Andreas Bluemle

> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 



-- 
Andreas Bluemle                     mailto:Andreas.Bluemle-jy6uzZffzV0b1SvskN2V4Q@public.gmane.org
ITXperts GmbH                       http://www.itxperts.de
Balanstrasse 73, Geb. 08            Phone: (+49) 89 89044917
D-81541 Muenchen (Germany)          Fax:   (+49) 89 89044910

Company details: http://www.itxperts.de/imprint.htm
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ceph-users] Help needed porting Ceph to RSockets
  2013-09-12 12:12       ` Andreas Bluemle
@ 2013-09-16 13:29         ` Gandalf Corvotempesta
  0 siblings, 0 replies; 19+ messages in thread
From: Gandalf Corvotempesta @ 2013-09-16 13:29 UTC (permalink / raw)
  To: Andreas Bluemle
  Cc: Hefty, Sean, Matthew Anderson,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

2013/9/12 Andreas Bluemle <andreas.bluemle-jy6uzZffzV0b1SvskN2V4Q@public.gmane.org>:
> I have not yet done any performance testing.
>
> The next step I have to take is more related to setting up
> a larger cluster with sth. like 150 osd's without hitting any
> resource limitations.

How do you manage failover ? Will you use mulitple HBA (or dual-port
HBA) on each node or just a single IB fabric ?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [ceph-users] Help needed porting Ceph to RSockets
  2013-09-10 13:48 ` Andreas Bluemle
  2013-09-12 10:20   ` Gandalf Corvotempesta
@ 2013-09-20 23:47   ` Hefty, Sean
  2013-10-30 23:25   ` Hefty, Sean
  2 siblings, 0 replies; 19+ messages in thread
From: Hefty, Sean @ 2013-09-20 23:47 UTC (permalink / raw)
  To: Andreas Bluemle
  Cc: Matthew Anderson,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

> I would not call these workarounds a real fix, but they should point
> out the problems which I am trying to solve.

Thanks for the update.  I haven't had the time to investigate this, but did want to at least acknowledge that this hasn't gotten lost.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [ceph-users] Help needed porting Ceph to RSockets
  2013-09-10 13:48 ` Andreas Bluemle
  2013-09-12 10:20   ` Gandalf Corvotempesta
  2013-09-20 23:47   ` Hefty, Sean
@ 2013-10-30 23:25   ` Hefty, Sean
       [not found]     ` <1828884A29C6694DAF28B7E6B8A8237388CF3072-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2 siblings, 1 reply; 19+ messages in thread
From: Hefty, Sean @ 2013-10-30 23:25 UTC (permalink / raw)
  To: Andreas Bluemle
  Cc: Matthew Anderson, ceph-devel@vger.kernel.org,
	linux-rdma@vger.kernel.org (linux-rdma@vger.kernel.org)

[-- Attachment #1: Type: text/plain, Size: 6951 bytes --]

Sorry it took so long to get to this.

> after some more analysis and debugging, I found
> workarounds for my problems; I have added these workarounds
> to the last version of the patch for the poll problem by Sean;
> see the attachment to this posting.
> 
> The shutdown() operations below are all SHUT_RDWR.
> 
> 1. shutdown() on side A of a connection waits for close() on side B
> 
>    With rsockets, when a shutdown is done on side A of a socket
>    connection, then the shutdown will only return after side B
>    has done a close() on its end of the connection.
>
>    This is different from TCP/IP sockets: there a shutdown will cause
>    the other end to terminate the connection at the TCP level
>    instantly. The socket changes state into CLOSE_WAIT, which indicates
>    that the application level close is outstanding.
> 
>    In the attached patch, the workaround is in rs_poll_cq(),
>    case RS_OP_CTRL, where for a RS_CTRL_DISCONNECT the rshutdown()
>    is called on side B; this will cause the termination of the
>    socket connection to acknowledged to side A and the shutdown()
>    there can now terminate.

rsockets should behave similar to sockets on this.  I believe that I introduced the problem with my patch to fix the shutdown hang.  The line in question was adding rdma_disconnect() in the rshutdown call.  I replaced the disconnect call with one that simply transitions the underlying QP into an error state.

> 2. double (multiple) shutdown on side A: delay on 2nd shutdown
> 
>    When an application does a shutdown() of side A and does a 2nd
>    shutdown() shortly after (for whatever reason) then the
>    return of the 2nd shutdown() is delayed by 2 seconds.
> 
>    The delay happens in rdma_disconnect(), when this is called
>    from rshutdown() in the case that the rsocket state is
>    rs_disconnected.
> 
>    Even if it could be considered as a bug if an application
>    calls shutdown() twice on the same socket, it still
>    does not make sense to delay that 2nd call to shutdown().

I'd like to avoid calling shutdown internally on the user's behalf.  The internal shutdown may be the reason shutdown ends up being called twice.

I don't understand why you see a 2 second delay, but hopefully removing the overhead of the full rdma_disconnect call fixes things.
 
Can you please try the attached patch in place of all previous patches?

Thanks
- Sean


rsockets: Handle race between rshutdown and rpoll

From: Sean Hefty <sean.hefty@intel.com>

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
---
 src/cma.c     |   23 +++++++++++++----------
 src/cma.h     |    1 +
 src/rsocket.c |   35 +++++++++++++++++++++++++----------
 3 files changed, 39 insertions(+), 20 deletions(-)

diff --git a/src/cma.c b/src/cma.c
index 374844c..4f41879 100644
--- a/src/cma.c
+++ b/src/cma.c
@@ -1543,22 +1543,25 @@ int rdma_notify(struct rdma_cm_id *id, enum ibv_event_type event)
 	return 0;
 }
 
-int rdma_disconnect(struct rdma_cm_id *id)
+int ucma_shutdown(struct rdma_cm_id *id)
 {
-	struct ucma_abi_disconnect cmd;
-	struct cma_id_private *id_priv;
-	int ret;
-
 	switch (id->verbs->device->transport_type) {
 	case IBV_TRANSPORT_IB:
-		ret = ucma_modify_qp_err(id);
-		break;
+		return ucma_modify_qp_err(id);
 	case IBV_TRANSPORT_IWARP:
-		ret = ucma_modify_qp_sqd(id);
-		break;
+		return ucma_modify_qp_sqd(id);
 	default:
-		ret = ERR(EINVAL);
+		return ERR(EINVAL);
 	}
+}
+
+int rdma_disconnect(struct rdma_cm_id *id)
+{
+	struct ucma_abi_disconnect cmd;
+	struct cma_id_private *id_priv;
+	int ret;
+
+	ret = ucma_shutdown(id);
 	if (ret)
 		return ret;
 
diff --git a/src/cma.h b/src/cma.h
index e944a9a..4c991b4 100644
--- a/src/cma.h
+++ b/src/cma.h
@@ -150,6 +150,7 @@ void ucma_set_sid(enum rdma_port_space ps, struct sockaddr *addr,
 		  struct sockaddr_ib *sib);
 int ucma_max_qpsize(struct rdma_cm_id *id);
 int ucma_complete(struct rdma_cm_id *id);
+int ucma_shutdown(struct rdma_cm_id *id);
 
 static inline int ERR(int err)
 {
diff --git a/src/rsocket.c b/src/rsocket.c
index d544dd0..c7a491b 100644
--- a/src/rsocket.c
+++ b/src/rsocket.c
@@ -1822,7 +1822,12 @@ static int rs_poll_cq(struct rsocket *rs)
 					rs->state = rs_disconnected;
 					return 0;
 				} else if (rs_msg_data(msg) == RS_CTRL_SHUTDOWN) {
-					rs->state &= ~rs_readable;
+					if (rs->state & rs_writable) {
+						rs->state &= ~rs_readable;
+					} else {
+						rs->state = rs_disconnected;
+						return 0;
+					}
 				}
 				break;
 			case RS_OP_WRITE:
@@ -2948,10 +2953,12 @@ static int rs_poll_events(struct pollfd *rfds, struct pollfd *fds, nfds_t nfds)
 
 		rs = idm_lookup(&idm, fds[i].fd);
 		if (rs) {
+			fastlock_acquire(&rs->cq_wait_lock);
 			if (rs->type == SOCK_STREAM)
 				rs_get_cq_event(rs);
 			else
 				ds_get_cq_event(rs);
+			fastlock_release(&rs->cq_wait_lock);
 			fds[i].revents = rs_poll_rs(rs, fds[i].events, 1, rs_poll_all);
 		} else {
 			fds[i].revents = rfds[i].revents;
@@ -3098,7 +3105,8 @@ int rselect(int nfds, fd_set *readfds, fd_set *writefds,
 
 /*
  * For graceful disconnect, notify the remote side that we're
- * disconnecting and wait until all outstanding sends complete.
+ * disconnecting and wait until all outstanding sends complete, provided
+ * that the remote side has not sent a disconnect message.
  */
 int rshutdown(int socket, int how)
 {
@@ -3106,11 +3114,6 @@ int rshutdown(int socket, int how)
 	int ctrl, ret = 0;
 
 	rs = idm_at(&idm, socket);
-	if (how == SHUT_RD) {
-		rs->state &= ~rs_readable;
-		return 0;
-	}
-
 	if (rs->fd_flags & O_NONBLOCK)
 		rs_set_nonblocking(rs, 0);
 
@@ -3118,15 +3121,20 @@ int rshutdown(int socket, int how)
 		if (how == SHUT_RDWR) {
 			ctrl = RS_CTRL_DISCONNECT;
 			rs->state &= ~(rs_readable | rs_writable);
-		} else {
+		} else if (how == SHUT_WR) {
 			rs->state &= ~rs_writable;
 			ctrl = (rs->state & rs_readable) ?
 				RS_CTRL_SHUTDOWN : RS_CTRL_DISCONNECT;
+		} else {
+			rs->state &= ~rs_readable;
+			if (rs->state & rs_writable)
+				goto out;
+			ctrl = RS_CTRL_DISCONNECT;
 		}
 		if (!rs->ctrl_avail) {
 			ret = rs_process_cq(rs, 0, rs_conn_can_send_ctrl);
 			if (ret)
-				return ret;
+				goto out;
 		}
 
 		if ((rs->state & rs_connected) && rs->ctrl_avail) {
@@ -3138,10 +3146,17 @@ int rshutdown(int socket, int how)
 	if (rs->state & rs_connected)
 		rs_process_cq(rs, 0, rs_conn_all_sends_done);
 
+out:
 	if ((rs->fd_flags & O_NONBLOCK) && (rs->state & rs_connected))
 		rs_set_nonblocking(rs, rs->fd_flags);
 
-	return 0;
+	if (rs->state & rs_disconnected) {
+		/* Generate event by flushing receives to unblock rpoll */
+		ibv_req_notify_cq(rs->cm_id->recv_cq, 0);
+		ucma_shutdown(rs->cm_id);
+	}
+
+	return ret;
 }
 
 static void ds_shutdown(struct rsocket *rs)



[-- Attachment #2: shutdown --]
[-- Type: application/octet-stream, Size: 4327 bytes --]

rsockets: Handle race between rshutdown and rpoll

From: Sean Hefty <sean.hefty@intel.com>

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
---
 src/cma.c     |   23 +++++++++++++----------
 src/cma.h     |    1 +
 src/rsocket.c |   35 +++++++++++++++++++++++++----------
 3 files changed, 39 insertions(+), 20 deletions(-)

diff --git a/src/cma.c b/src/cma.c
index 374844c..4f41879 100644
--- a/src/cma.c
+++ b/src/cma.c
@@ -1543,22 +1543,25 @@ int rdma_notify(struct rdma_cm_id *id, enum ibv_event_type event)
 	return 0;
 }
 
-int rdma_disconnect(struct rdma_cm_id *id)
+int ucma_shutdown(struct rdma_cm_id *id)
 {
-	struct ucma_abi_disconnect cmd;
-	struct cma_id_private *id_priv;
-	int ret;
-
 	switch (id->verbs->device->transport_type) {
 	case IBV_TRANSPORT_IB:
-		ret = ucma_modify_qp_err(id);
-		break;
+		return ucma_modify_qp_err(id);
 	case IBV_TRANSPORT_IWARP:
-		ret = ucma_modify_qp_sqd(id);
-		break;
+		return ucma_modify_qp_sqd(id);
 	default:
-		ret = ERR(EINVAL);
+		return ERR(EINVAL);
 	}
+}
+
+int rdma_disconnect(struct rdma_cm_id *id)
+{
+	struct ucma_abi_disconnect cmd;
+	struct cma_id_private *id_priv;
+	int ret;
+
+	ret = ucma_shutdown(id);
 	if (ret)
 		return ret;
 
diff --git a/src/cma.h b/src/cma.h
index e944a9a..4c991b4 100644
--- a/src/cma.h
+++ b/src/cma.h
@@ -150,6 +150,7 @@ void ucma_set_sid(enum rdma_port_space ps, struct sockaddr *addr,
 		  struct sockaddr_ib *sib);
 int ucma_max_qpsize(struct rdma_cm_id *id);
 int ucma_complete(struct rdma_cm_id *id);
+int ucma_shutdown(struct rdma_cm_id *id);
 
 static inline int ERR(int err)
 {
diff --git a/src/rsocket.c b/src/rsocket.c
index d544dd0..c7a491b 100644
--- a/src/rsocket.c
+++ b/src/rsocket.c
@@ -1822,7 +1822,12 @@ static int rs_poll_cq(struct rsocket *rs)
 					rs->state = rs_disconnected;
 					return 0;
 				} else if (rs_msg_data(msg) == RS_CTRL_SHUTDOWN) {
-					rs->state &= ~rs_readable;
+					if (rs->state & rs_writable) {
+						rs->state &= ~rs_readable;
+					} else {
+						rs->state = rs_disconnected;
+						return 0;
+					}
 				}
 				break;
 			case RS_OP_WRITE:
@@ -2948,10 +2953,12 @@ static int rs_poll_events(struct pollfd *rfds, struct pollfd *fds, nfds_t nfds)
 
 		rs = idm_lookup(&idm, fds[i].fd);
 		if (rs) {
+			fastlock_acquire(&rs->cq_wait_lock);
 			if (rs->type == SOCK_STREAM)
 				rs_get_cq_event(rs);
 			else
 				ds_get_cq_event(rs);
+			fastlock_release(&rs->cq_wait_lock);
 			fds[i].revents = rs_poll_rs(rs, fds[i].events, 1, rs_poll_all);
 		} else {
 			fds[i].revents = rfds[i].revents;
@@ -3098,7 +3105,8 @@ int rselect(int nfds, fd_set *readfds, fd_set *writefds,
 
 /*
  * For graceful disconnect, notify the remote side that we're
- * disconnecting and wait until all outstanding sends complete.
+ * disconnecting and wait until all outstanding sends complete, provided
+ * that the remote side has not sent a disconnect message.
  */
 int rshutdown(int socket, int how)
 {
@@ -3106,11 +3114,6 @@ int rshutdown(int socket, int how)
 	int ctrl, ret = 0;
 
 	rs = idm_at(&idm, socket);
-	if (how == SHUT_RD) {
-		rs->state &= ~rs_readable;
-		return 0;
-	}
-
 	if (rs->fd_flags & O_NONBLOCK)
 		rs_set_nonblocking(rs, 0);
 
@@ -3118,15 +3121,20 @@ int rshutdown(int socket, int how)
 		if (how == SHUT_RDWR) {
 			ctrl = RS_CTRL_DISCONNECT;
 			rs->state &= ~(rs_readable | rs_writable);
-		} else {
+		} else if (how == SHUT_WR) {
 			rs->state &= ~rs_writable;
 			ctrl = (rs->state & rs_readable) ?
 				RS_CTRL_SHUTDOWN : RS_CTRL_DISCONNECT;
+		} else {
+			rs->state &= ~rs_readable;
+			if (rs->state & rs_writable)
+				goto out;
+			ctrl = RS_CTRL_DISCONNECT;
 		}
 		if (!rs->ctrl_avail) {
 			ret = rs_process_cq(rs, 0, rs_conn_can_send_ctrl);
 			if (ret)
-				return ret;
+				goto out;
 		}
 
 		if ((rs->state & rs_connected) && rs->ctrl_avail) {
@@ -3138,10 +3146,17 @@ int rshutdown(int socket, int how)
 	if (rs->state & rs_connected)
 		rs_process_cq(rs, 0, rs_conn_all_sends_done);
 
+out:
 	if ((rs->fd_flags & O_NONBLOCK) && (rs->state & rs_connected))
 		rs_set_nonblocking(rs, rs->fd_flags);
 
-	return 0;
+	if (rs->state & rs_disconnected) {
+		/* Generate event by flushing receives to unblock rpoll */
+		ibv_req_notify_cq(rs->cm_id->recv_cq, 0);
+		ucma_shutdown(rs->cm_id);
+	}
+
+	return ret;
 }
 
 static void ds_shutdown(struct rsocket *rs)

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [ceph-users] Help needed porting Ceph to RSockets
       [not found]     ` <1828884A29C6694DAF28B7E6B8A8237388CF3072-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2014-02-05 13:58       ` Gandalf Corvotempesta
  0 siblings, 0 replies; 19+ messages in thread
From: Gandalf Corvotempesta @ 2014-02-05 13:58 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Andreas Bluemle, Matthew Anderson,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

2013-10-31 Hefty, Sean <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>:
> Can you please try the attached patch in place of all previous patches?

Any updates on ceph with rsockets?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-02-05 13:58 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-23  0:35 [ceph-users] Help needed porting Ceph to RSockets Hefty, Sean
2013-09-10 13:48 ` Andreas Bluemle
2013-09-12 10:20   ` Gandalf Corvotempesta
     [not found]     ` <CAJH6TXg94x+jcc=1MQQoQaF5JcGKuxbG_SzLEMt3EVQOrFeNjw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-12 12:12       ` Andreas Bluemle
2013-09-16 13:29         ` Gandalf Corvotempesta
2013-09-20 23:47   ` Hefty, Sean
2013-10-30 23:25   ` Hefty, Sean
     [not found]     ` <1828884A29C6694DAF28B7E6B8A8237388CF3072-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2014-02-05 13:58       ` Gandalf Corvotempesta
     [not found] <CAJA05UmTDGze7S50_j8RgHmPFYonk2Z94Fi6CQtxND9QxgRr3g@mail.gmail.com>
     [not found] ` <CAJA05Umgio7XftGZtQdRzyWq70pXBb3oEx2jrT3BxcBr6MLoVQ@mail.gmail.com>
     [not found]   ` <20130812075513.43c338e1@andylap>
     [not found]     ` <CAJA05U=i22jHy2xbvCUk+UVN0+hOFDgt-JMSHTWKmr1+DZu0Dg@mail.gmail.com>
     [not found]       ` <20130812180644.447ca089@andylap>
     [not found]         ` <CAJA05U=KqzYis14sYXusbpk-T-F=uqvNz1XPurutmdeRz6BAdg@mail.gmail.com>
     [not found]           ` <20130813075312.7cac0d46@andylap>
     [not found]             ` <20130813160612.037ea9f2@andylap>
2013-08-13 14:35               ` Atchley, Scott
     [not found]                 ` <1978D1F9-C675-4A37-AA57-C7E1158B2F72-1Heg1YXhbW8@public.gmane.org>
2013-08-13 21:44                   ` Hefty, Sean
2013-08-14  7:21                     ` Andreas Bluemle
2013-08-14 13:05                       ` Atchley, Scott
2013-08-14 17:04                       ` Hefty, Sean
2013-08-17  0:07                       ` Hefty, Sean
2013-08-19 17:10                       ` Hefty, Sean
2013-08-20  7:21                         ` Andreas Bluemle
2013-08-20 10:30                           ` Andreas Bluemle
2013-08-20 15:04                             ` Hefty, Sean
     [not found]                               ` <1828884A29C6694DAF28B7E6B8A8237388CA6F25-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2013-08-21 11:44                                 ` Matthew Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).