From mboxrd@z Thu Jan  1 00:00:00 1970
Content-Type: multipart/mixed; boundary="===============5749751357144168039=="
MIME-Version: 1.0
From: Walker, Benjamin <benjamin.walker at intel.com>
Subject: Re: [SPDK] SPDK Dynamic Threading Model
Date: Fri, 25 May 2018 21:26:03 +0000
Message-ID: <1527283562.55770.73.camel@intel.com>
In-Reply-To: MWHPR06MB2558CED974739A13B11CA842E5690@MWHPR06MB2558.namprd06.prod.outlook.com
List-ID: <spdk@lists.01.org>
To: spdk@lists.01.org

--===============5749751357144168039==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable

On Fri, 2018-05-25 at 20:31 +0000, Pai, Madhu wrote:
> Ben,
> =

> Thank you very much for the detailed analysis and mail. I agree with the
> points you are making here and the design goals for SPDK.  I'll try to ta=
lk
> some more about the design for the patch, the advantages, and see if we c=
an
> improvise and better this. I'll also try to answer the three valid concer=
ns
> you have raised below.
> =

> 1) Design principle:
> Breaking the 1:1 mapping between reactors and core will give an app better
> flexibility in terms of the threading model. At a very high abstract level
> this could be looked at, as being similar to the green threading framewor=
k.
> I'm no expert in the green threading framework, but, based on my reading =
the
> similarity would be: =

> - The "spdk_thread" is the virtual thread. =

> - The reactor is the "cache" of this virtual thread (i.e. they have a 1-1
> relationship). =

> - The bare metal thread is the DPDK thread in this model.
> So in this design, the bare metal thread inhales the properties of the vi=
rtual
> thread when it is running on a reactor. The implementation specific part =
of
> that was the change of the TLS of the _lcore value to the reactor id.

I think we'll need to iterate on what the basic primitives are (especially =
their
names), but I'm generally leaving that discussion for slightly later on in =
the
design. For now, I agree with the direction above. I'm going to temporarily=
 use
the words "spdk_thread" for the virtual thread and "DPDK thread" for the na=
tive
thread.

> =

> 2) Advantages:
> This was the first iteration of what can become a more solidified solutio=
n as
> we go along. There are gaps in this approach and I think we can mitigate =
some
> of those concerns. =

> - It gives applications a switch. They may choose to not do this at all a=
nd
> then a virtual thread would map to a bare metal thread and stay that way. =

> - The other advantage is that the core library, API"s and the design do n=
ot
> change. =


I generally don't like to have switches that control behavior this fundamen=
tal,
where possible. It would be ideal to have the old behavior fall out of the =
new
behavior whenever you happen to create one spdk_thread per DPDK thread.

> - I think it helps apps to seamlessly adapt to a dynamic threading model =
with
> their current code base.
> - It allows apps to dynamically increase the number of reactors. The
> granularity of a QP to a reactor could be decided by the app. So, we coul=
d, in
> one way solve the problem of trying to move a QP from one thread to anoth=
er
> for load balancing.
> - It gives Apps certain QoS capabilities. Apps can create "gold", "silver=
" and
> "bronze"  reactor rings. The number of bare-metal threads working on each=
 ring
> may be different depending on how quickly certain QP's have to be service=
d.
> =

> 3) Concerns:
> Looking at the three concerns you raised - =

> =

> 3a) Cache thrashing: Agreed. In fact, I write about this in the commit pa=
tch
> as a valid concern. But, I think we can mitigate this. For applications t=
hat
> are concerned about cache thrashing - the option is to run the reactor on=
 a
> bare metal thread for a longer period of time. In the patch I showed a cr=
ude
> way, where after 100 usec, the reactor is switched out. But, that does not
> absolutely have to be the case. A reactor could run for a much longer per=
iod
> of time (10s of msec) allowing the benefits of the CPU caching to be used=
. The
> other way to mitigate this is to make sure that the bare metal threads ru=
n on
> the same socket. Thus even when reactors are switched out, the cache at t=
he
> socket layer is not invalidated.

Instead of automatically moving spdk_thread/reactor objects between DPDK
threads, what if moving an spdk_thread to a new DPDK thread was an explicit
operation performed by the application periodically?

> =

> 3b) NUMA: I believe NUMA-awareness can be built in. It depends on where t=
he
> bare metal threads run. If we have NICs and SSDs spread across the two
> sockets, a more elegant solution can probably be designed where we create
> reactor rings per socket. Then we would have the capability to add the QP=
's to
> the right reactor (in the right ring) based on the NIC. That is an extens=
ion
> of the current design IMO.

It's not always clear what the right NUMA node to run on actually is. That's
because an spdk_thread has a set of I/O channels (queue pairs) that talk to
different devices. Sometimes you want to be on the same NUMA node as a
particular NIC, but other times as a particular SSD. Making the movement of
spdk_threads between DPDK threads an explicit operation performed by the
application would push this decision up into the application/user code, whe=
re it
knows best.

> =

> 3c) Global Reactor Ring bottleneck: The number of reactors and the number=
 of
> threads are not high. Also, the idea here is to run on a reactor for "some
> extended period of time". Given that the number of producer/consumers from
> this ring will be limited, I don't think the reactor ring will be a
> bottleneck. Compare and contrast this global reactor ring with the event =
queue
> ring that exists today. We use events for all callback events from the ba=
ckend
> during IO. We definitely do not want the reactors to be swapped in and ou=
t at
> the rate of IO, but to hold on to a reactor for a somewhat larger period =
of
> time. When the application specific metrics show that these threads are d=
oing
> more "useful work" versus "idle polling", we just add more threads. Event=
ually
> at high loads the number of threads will be the same as the number of rea=
ctors
> and thus falls back to the traditional SPDK model. =

> This would be something that is decided by the ecosystem that runs SPDK.
> =

> Looking forward to discussing this more.
> =

> Thanks,
> Madhu
> =

> =

> -----Original Message-----
> From: SPDK <spdk-bounces(a)lists.01.org> On Behalf Of Walker, Benjamin
> Sent: Friday, May 25, 2018 3:03 PM
> To: spdk(a)lists.01.org
> Subject: Re: [SPDK] SPDK Dynamic Threading Model
> =

> I've been doing my best to think this through over the last few days, as =
have
> a number of other community members, and some things are beginning to loo=
k a
> bit clearer now.
> =

> SPDK was always intended to be a composable set of libraries as opposed t=
o a
> framework. By that, I mean that SPDK is intended to be integrated into ot=
her
> applications as opposed to existing code being integrated into SPDK. The
> community has done a lot of work to attempt to make that happen, with var=
ying
> degrees of success. The challenges are primarily centered on two things.
> First, SPDK requires special memory management operations to allocate DMA=
-safe =

> memory.
> This stems from the strict requirement to avoid data copies. The problem =
would
> essentially go away if SPDK instead internally allocated DMA-safe memory =
and
> copied user data into those buffers, but the performance would take a big=
 hit.
> Second, SPDK avoids locks by instead passing messages between threads. Th=
at
> means that many components (although not all) within SPDK imply that the
> application is using a certain threading model. Specifically, the threadi=
ng
> model needs to look like cooperative multi-tasking, or futures and promis=
es,
> or event loops, etc. So far the consensus seems to be that it is acceptab=
le to
> assume there is some threading model that is conducive to message passing=
, but
> we don't want to specifically pick a single model or framework.
> =

> The problem that John, Madhu, and the others at NetApp have identified is=
 that
> SPDK currently makes entirely too many assumptions about and places too m=
any
> strict requirements on the mechanics of the threading model in an applica=
tion.
> I think there is a strong consensus that fixing this is important and sho=
uld
> be high priority. The fix, ultimately, will be better abstractions around=
 the
> underlying application's threading model. I hope we can design something =
that
> will enable people to plug SPDK into all sorts of frameworks - green thre=
ading
> frameworks, DPDK lthreads, Seastar, coroutine frameworks, etc. The more p=
eople
> we can get participating in this work, the better the abstractions will b=
e, so
> please everyone chime in with requirements and ideas.
> =

> The current set of patches break the 1:1 mapping between reactors and cor=
es.
> Instead, reactors are stored on a global list. Each core iterates on this
> global list and pulls the next reactor and processes any waiting events a=
nd
> executes pollers, then places the reactor back on the list. I'm concerned
> about three things with this design:
> =

> * Since the reactors now potentially execute on a different core each time
> through their loop, the CPU cache is going to be badly thrashed. I suspec=
t the
> performance hit here is very large and continues to grow as additional th=
reads
> are added. SPDK is designed to scale linearly with the addition of CPU co=
res
> as much as possible, and I think it would be a mistake to move away from
> that. =

> * All NUMA-awareness has been lost. Placing the processing of I/O on the =
same
> NUMA node as the NIC or SSD is critical to achieving high performance, so=
 the
> code needs to remain NUMA-aware.
> * All threads are polling a single queue of reactors, so the atomic varia=
bles
> controlling the head and tail of that queue are going to be highly conten=
ded
> and become more contended as the number of threads increases.
> =

> I hope this is just the beginning of a larger discussion. I'll let the pa=
tch
> review settle into next week and see if solutions begin to emerge.
> =

> Thanks,
> Ben
> =

> On Thu, 2018-05-24 at 02:24 +0000, Meneghini, John wrote:
> > Hi Frank.
> >  =

> > Thanks for your suggestion.
> >  =

> > In our implementation/application, we don=E2=80=99t use DPDK.  This is =
why the =

> > first set of changes we proposed last year were to abstract out the =

> > dependencies on DPK. I think I still have copy of the old pull request
> > around for reference.
> >  =

> > https://github.com/spdk/spdk/pull/152
> >  =

> > We are actually running SPDK in a completely different execution =

> > environment, and we need a =E2=80=9Cnative=E2=80=9D SPDK dynamic thread=
ing model that =

> > can be supported on any platform, without DPDK.
> >  =

> > An second RFC patch has been pushed up to GerritHub for review.  =

> > Please see the commit message of these two patches for a complete =

> > description of the proposed change.
> >  =

> > https://review.gerrithub.io/#/c/spdk/spdk/+/412277/
> >  =

> > https://review.gerrithub.io/#/c/spdk/spdk/+/412093/
> >  =

> > /John
> >  =

> > 40.5. The L-thread subsystem
> > The L-thread subsystem resides in the =

> > examples/performance-thread/common
> > directory and is built and linked automatically when building the =

> > l3fwd- thread example.
> > =

> > The subsystem provides a simple cooperative scheduler to enable =

> > arbitrary functions to run as cooperative threads within a single EAL =

> > thread. The subsystem provides a pthread like API that is intended to =

> > assist in reuse of legacy code written for POSIX pthreads.
> > =

> > The following sections provide some detail on the features, =

> > constraints, performance and porting considerations when using L-thread=
s.
> > =

> >  =

> >  =

> > From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Huang Frank =

> > <kinzent(a)hotma il.com>
> > Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Date: Wednesday, May 23, 2018 at 9:46 PM
> > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Subject: [SPDK] =E7=AD=94=E5=A4=8D: SPDK Dynamic Threading Model
> >  =

> > Hi,
> >  =

> > Why not consider to use lpthread provided by DPDK?
> > http://dpdk.org/doc/guides-16.04/sample_app_ug/performance_thread.html
> > #lthread
> > -subsystem
> >  =

> >  =

> > =

> > Frank Huang
> > =

> >  =

> > =E5=8F=91=E4=BB=B6=E4=BA=BA: SPDK <spdk-bounces(a)lists.01.org> =E4=BB=
=A3=E8=A1=A8 Meneghini, John =

> > <John.Meneghini(a)netap p.com>
> > =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2018=E5=B9=B45=E6=9C=8823=E6=97=
=A5 4:12
> > =E6=94=B6=E4=BB=B6=E4=BA=BA: Storage Performance Development Kit
> > =E4=B8=BB=E9=A2=98: [SPDK] RFC: SPDK Dynamic Threading Model
> >  =

> > As discussed during the Summit last week, we believe SPDK needs =

> > support for a dynamic threading model.  An RFC patch has been pushed
> > upstream for review.
> >  =

> > https://review.gerrithub.io/#/c/spdk/spdk/+/412093/
> >  =

> > This patch is a beginning point for our proposed changes. Improvements =

> > will be made with subsequent patches.
> >  =

> > The description below is taken from =

> > https://github.com/spdk/spdk/issues/308
> > SPDK needs to support a dynamic threading model where reactors are NOT =

> > bound to lcores.
> > Many applications need SPDK to support a threading model that:
> > Does not assume a static number of threads Does not bind threads to =

> > cores (this burns up cores) Does not assume all treads use the same =

> > polling model Removing these assumptions from the SPDK libraries will =

> > allow:
> > Different applications to share the SPDK libraries on the same =

> > platform E.g. FC-NVMe, RDMA-NVMe, and NVMe Different platforms to =

> > support the same applications with the same libraries E.g. a 4 core =

> > platform and a 128 core plaform, a PowerPC and NFS traffic Different =

> > workloads at different scales E.g. 1 NVMF Host with 1 Subsystem and 1 =

> > Namespace, or 16 NVMF Hosts with 100 Subsystems and 1,000 namespaces.
> > In particular, in SPDK, NVMF threads need to come and go depending =

> > upon the =E2=80=9CNVMF load=E2=80=9D.
> > More Dynamic Use Cases Coming
> > With the advent of FC-NVMe (which uses NPIV to visualize FC ports) =

> > NVMF Subsystem Ports and Host Ports are not static. Different Hosts =

> > and Subsystems can have a different number of Ports, and Ports can be =

> > dynamically added and removed from the configuration. This means:
> > The same platform may end up having different number of Subsystem =

> > ports at various points in its lifecycle The SPDK FC-NVMe application =

> > does NOT know up front how many ports it will have.
> > Expected Behavior
> > SPDK libraries should not assume a static number of threads SPDK =

> > libraries should bind threads to cores only optionally - supporting =

> > both static and dynamic threading models SPDK libraries should support =

> > a Hybrid polling model (modified run to
> > completion)
> > Current Behavior
> > SPDK libraries assume a static number of threads SPDK libraries bind =

> > threads to cores SPDK libraries assume all treads use the same polling =

> > model Possible Solution Proposal to solve above Use Cases:
> > Use the spdk_nvmf_poll_group (PG) as the unit of threading abstraction =

> > Use PG as the fundamental unit on which a thread operates The =

> > spdk_thread will be a =E2=80=9Cvirtual=E2=80=9D thread that gets tied i=
nto a PG (1-1
> > relationship)
> > Create PGs as and when hardware ports (and associated queue-pairs) =

> > come to life.
> > No dependency between a PG and a =E2=80=9Creal=E2=80=9D thread.
> > A PG can be picked up by any =E2=80=9Creal=E2=80=9D thread and worked u=
pon. The PG =

> > contains everything needed for IO handling.
> > PG continues to contain spdk_thread. spdk_thread continues same =

> > mechanisms for IO channels to different NS etc. etc.
> > PG contains vendor data. Eg. A =E2=80=9Cring=E2=80=9D for depositing as=
ynchronous =

> > callback events from the backend OR management events that come from
> > external modules.
> > spdk_thread contains thread_context that points to a PG instead of a
> > reactor.
> > So messages from the library get routed to the PG =E2=80=9Cring=E2=80=
=9D instead of a =

> > thread/reactor event ring.spdk_bdev_get_io Understanding the intent of =

> > the event library, it is believed this is the place for customization. =

> > However, the current event library assumes a threading model that's a =

> > part of the util library. Moreover, many of the other SPDK core =

> > libraries assume the same threading model as the util library. If the =

> > SPDK util library can be modified to support these use dynamic =

> > threading use cases, all applications would be able to use the SPDK =

> > framework more effectively.
> > Steps to Reproduce
> > This is an enhancement. There is no bug.
> > Context (Environment including OS version, SPDK version, etc.) Would =

> > like to provide these enhancements in V18.07.
> >  =

> >  =

> >  =

> >  =

> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> =

> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

--===============5749751357144168039==--