Re: [ceph-users] Help needed porting Ceph to RSockets

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

From: Andreas Bluemle <andreas.bluemle@itxperts.de>
To: "Hefty, Sean" <sean.hefty@intel.com>
Cc: "Atchley, Scott" <atchleyes@ornl.gov>,
	Matthew Anderson <manderson8787@gmail.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>,
	"linux-rdma@vger.kernel.org (linux-rdma@vger.kernel.org)"
	<linux-rdma@vger.kernel.org>
Subject: Re: [ceph-users] Help needed porting Ceph to RSockets
Date: Wed, 14 Aug 2013 09:21:24 +0200	[thread overview]
Message-ID: <20130814092124.3f8c3b03@doppio> (raw)
In-Reply-To: <1828884A29C6694DAF28B7E6B8A8237388CA5A0F@ORSMSX109.amr.corp.intel.com>

Hi,

maybe some information about the environment I am
working in:

- CentOS 6.4 with custom kernel 3.8.13
- librdmacm / librspreload from git, tag 1.0.17
- application started with librspreload in LD_PRELOAD environment

Currently, I have increased the value of the spin time by setting the
default value for polling_time in the source code.

I guess that the correct way to do this is via
configuration in /etc/rdma/rsocket/polling_time?

Concerning the rpoll() itself, some more comments/questions
embedded below.

On Tue, 13 Aug 2013 21:44:42 +0000
"Hefty, Sean" <sean.hefty@intel.com> wrote:

> > > I found a workaround for my (our) problem: in the librdmacm
> > > code, rsocket.c, there is a global constant polling_time, which
> > > is set to 10 microseconds at the moment.
> > >
> > > I raise this to 10000 - and all of a sudden things work nicely.
> > 
> > I am adding the linux-rdma list to CC so Sean might see this.
> > 
> > If I understand what you are describing, the caller to rpoll()
> > spins for up to 10 ms (10,000 us) before calling the real poll().
> > 
> > What is the purpose of the real poll() call? Is it simply a means
> > to block the caller and avoid spinning? Or does it actually expect
> > to detect an event?
> 
> When the real poll() is called, an event is expected on an fd
> associated with the CQ's completion channel. 

The first question I would have is: why is the rpoll() split into
these two pieces? There must have been some reason to do a busy
loop on some local state information rather than just call the
real poll() directly.

> > > I think we are looking at two issues here:
> > > 1. the thread structure of ceph messenger
> > >   For a given socket connection, there are 3 threads of interest
> > >   here: the main messenger thread, the Pipe::reader and the
> > >   Pipe::writer.
> > >
> > >   For a ceph client like the ceph admin command, I see the
> > > following sequence
> > >     - the connection to the ceph monitor is created by the
> > >       main messenger  thread, the Pipe::reader and
> > >       Pipe::writer are instantiated.
> > >     - the requested command is sent to the ceph monitor, the
> > >       answer is read and printed
> > >     - at this point the Pipe::reader already has called
> > >       tcp_read_wait(), polling for more data or connection
> > > termination
> > >     - after the response had been printed, the main loop calls the
> > >       shutdown routines which in in turn shutdown() the socket
> > >
> > >    There is some time between the last two steps - and this gap is
> > >    long enough to open a race:
> > >
> > > 2. rpoll, ibv and poll
> > >   the rpoll implementation in rsockets is split in 2 phases:
> > >   - a busy loop which checks the state of the underlying ibv
> > > queue pair
> > >   - the call to real poll() system call (i.e. the uverbs(?)
> > >     implementation of poll() inside the kernel)
> > >
> > >   The busy loop has a maximum duration of polling_time (10
> > > microseconds by default) - and is able detect the local shutdown
> > > and returns a POLLHUP.
> > >
> > >   The poll() system call (i.e. the uverbs implementation of poll()
> > >   in the kernel) does not detect the local shutdown - and only
> > > returns after the caller supplied timeout expires.
> 
> It sounds like there's an issue here either with a message getting
> lost or a race.  Given that spinning longer works for you, it sounds
> like an event is getting lost, not being generated correctly, or not
> being configured to generate.
> 

I am looking at a multithreaded application here, and I believe that
the race is between thread A calling the rpoll() for POLLIN event and
thread B calling the shutdown(SHUT_RDWR) for reading and writing of
the (r)socket almost immediately afterwards.

I think that the shutdown itself does not cause a POLLHUP event to be
generated from the kernel to interupt the real poll().
(BTW: which kernel module implements the poll() for rsockets?
Is that ib_uverbs.ko with ib_uverbs_poll_cq()?)

> > > Increasing the rsockets polloing_time from 10 to 10000
> > > microseconds results in the rpoll to detect the local shutdown
> > > within the busy loop.
> > >
> > > Decreasing the ceph "ms tcp read timeout" from the default of 900
> > > to 5 seconds serves a similar purpose, but is much coarser.
> > >
> > > From my understanding, the real issue is neither at the ceph nor
> > > at the rsockets level: it is related to the uverbs kernel module.
> > >
> > > An alternative way to address the current problem at the rsockets
> > > level would be w re-write of the rpoll(): instead of the busy
> > > loop at the beginning followed by the reall poll() call with the
> > > full user specificed timeout value ("ms tcp read timeout" in our
> > > case), I would embed the real poll()  into a loop, splitting the
> > > user specified timeout into smaller portions and doing the
> > > rsockets specific rs_poll_check() on every timeout of the real
> > > poll().
> > 
> > I have not looked at the rsocket code, so take the following with a
> > grain of salt. If the purpose of the real poll() is to simply block
> > the user for a specified time, then you can simply make it a short
> > duration (taking into consideration what granularity the OS
> > provides) and then call ibv_poll_cq(). Keep in mind, polling will
> > prevent your CPU from reducing power.
> > 
> > If the real poll() is actually checking for something (e.g.
> > checking on the RDMA channel's fd or the IB channel's fd), then you
> > may not want to spin too much.
> 
> The real poll() call is intended to block the application until a
> timeout occurs or an event shows up.  Since increasing the spin time
> works for you, it makes me suspect that there is a bug in the CQ
> event handling in rsockets. 

> > >>> What's particularly weird is that the monitor receives a POLLHUP
> > >>> event when the ceph command shuts down it's socket but the ceph
> > >>> command never does. When using regular sockets both sides of the
> > >>> connection receive a POLLIN | POLLHUP | POLRDHUP event when the
> > >>> sockets are shut down. It would seem like there is a bug in
> > >>> rsockets that causes the side that calls shutdown first not to
> > >>> receive the correct rpoll events.
> 
> rsockets does not support POLLRDHUP.
> 

I don't think the issue is POLLRDHUP.
I think the issue is POLLHUP and/or POLLIN.
My impression is that on a local shutdown (r)socket, a poll
for POLLIN event should at least return a POLLIN event and
a subsequent read should return 0 bytes indicating EOF.
But the POLLIN is not generated from the layer below 
rsockets (ib_uverbs.ko?) as far as I can tell.

See also: http://www.greenend.org.uk/rjk/tech/poll.html

Best Regards

Andreas Bluemle

> - Sean
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> in the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 




-- 
Andreas Bluemle                     mailto:Andreas.Bluemle@itxperts.de
ITXperts GmbH                       http://www.itxperts.de
Balanstrasse 73, Geb. 08            Phone: (+49) 89 89044917
D-81541 Muenchen (Germany)          Fax:   (+49) 89 89044910

Company details: http://www.itxperts.de/imprint.htm

next prev parent reply	other threads:[~2013-08-14  7:21 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAJA05UmTDGze7S50_j8RgHmPFYonk2Z94Fi6CQtxND9QxgRr3g@mail.gmail.com>
     [not found] ` <CAJA05Umgio7XftGZtQdRzyWq70pXBb3oEx2jrT3BxcBr6MLoVQ@mail.gmail.com>
     [not found]   ` <20130812075513.43c338e1@andylap>
     [not found]     ` <CAJA05U=i22jHy2xbvCUk+UVN0+hOFDgt-JMSHTWKmr1+DZu0Dg@mail.gmail.com>
     [not found]       ` <20130812180644.447ca089@andylap>
     [not found]         ` <CAJA05U=KqzYis14sYXusbpk-T-F=uqvNz1XPurutmdeRz6BAdg@mail.gmail.com>
     [not found]           ` <20130813075312.7cac0d46@andylap>
     [not found]             ` <20130813160612.037ea9f2@andylap>
2013-08-13 14:35               ` [ceph-users] Help needed porting Ceph to RSockets Atchley, Scott
     [not found]                 ` <1978D1F9-C675-4A37-AA57-C7E1158B2F72-1Heg1YXhbW8@public.gmane.org>
2013-08-13 21:44                   ` Hefty, Sean
2013-08-14  7:21                     ` Andreas Bluemle [this message]
2013-08-14 13:05                       ` Atchley, Scott
2013-08-14 17:04                       ` Hefty, Sean
2013-08-17  0:07                       ` Hefty, Sean
2013-08-19 17:10                       ` Hefty, Sean
2013-08-20  7:21                         ` Andreas Bluemle
2013-08-20 10:30                           ` Andreas Bluemle
2013-08-20 15:04                             ` Hefty, Sean
     [not found]                               ` <1828884A29C6694DAF28B7E6B8A8237388CA6F25-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2013-08-21 11:44                                 ` Matthew Anderson
2013-08-23  0:35 Hefty, Sean
2013-09-10 13:48 ` Andreas Bluemle
2013-09-12 10:20   ` Gandalf Corvotempesta
     [not found]     ` <CAJH6TXg94x+jcc=1MQQoQaF5JcGKuxbG_SzLEMt3EVQOrFeNjw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-12 12:12       ` Andreas Bluemle
2013-09-16 13:29         ` Gandalf Corvotempesta
2013-09-20 23:47   ` Hefty, Sean
2013-10-30 23:25   ` Hefty, Sean
     [not found]     ` <1828884A29C6694DAF28B7E6B8A8237388CF3072-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2014-02-05 13:58       ` Gandalf Corvotempesta

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130814092124.3f8c3b03@doppio \
    --to=andreas.bluemle@itxperts.de \
    --cc=atchleyes@ornl.gov \
    --cc=ceph-devel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=manderson8787@gmail.com \
    --cc=sean.hefty@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox