From mboxrd@z Thu Jan 1 00:00:00 1970 From: ibetts Subject: =?utf-8?q?=5BPATCH_v2_1/5=5D_doc=3A_add_performance-th?= =?utf-8?q?read_sample_application_guide?= Date: Thu, 29 Oct 2015 15:08:41 +0000 Message-ID: <1446131325-13019-2-git-send-email-ian.betts@intel.com> References: <1446131325-13019-1-git-send-email-ian.betts@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: Ian Betts To: dev@dpdk.org Return-path: Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id B87E88DA1 for ; Thu, 29 Oct 2015 16:08:52 +0100 (CET) In-Reply-To: <1446131325-13019-1-git-send-email-ian.betts@intel.com> List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" From: Ian Betts This commit adds documentation for the performance-thread sample application. Changes in this version: * fix typos * changes to tx parameter * add diagrams * add CPU load stats Signed-off-by: Ian Betts --- doc/guides/rel_notes/release_2_2.rst | 11 + .../sample_app_ug/img/performance_thread_1.svg | 799 +++++++++++++ .../sample_app_ug/img/performance_thread_2.svg | 865 ++++++++++++++ doc/guides/sample_app_ug/index.rst | 1 + doc/guides/sample_app_ug/performance_thread.rst | 1243 ++++++++++++++= ++++++ 5 files changed, 2919 insertions(+) create mode 100644 doc/guides/sample_app_ug/img/performance_thread_1.svg create mode 100644 doc/guides/sample_app_ug/img/performance_thread_2.svg create mode 100644 doc/guides/sample_app_ug/performance_thread.rst diff --git a/doc/guides/rel_notes/release_2_2.rst b/doc/guides/rel_notes/= release_2_2.rst index 128f956..2c031de 100644 --- a/doc/guides/rel_notes/release_2_2.rst +++ b/doc/guides/rel_notes/release_2_2.rst @@ -75,6 +75,12 @@ Libraries Examples ~~~~~~~~ =20 +* **examples: Introducing a performance thread example** + + This an l3fwd derivative focused to enable characterization of perform= ance + with different threading models, including multiple EAL threads per ph= ysical + core, and multiple Lightweight threads running in an EAL thread. + The examples includes a simple cooperative scheduler. =20 Other ~~~~~ @@ -82,6 +88,11 @@ Other =20 Known Issues ------------ +* When running the performance thread application in configurations with= more than + two EAL threads per phsycial core, then forwarding throughput may drop= suddenly + to a low level. Stopping and restarting traffic restores correct opera= tion. + This problem does not occur when running with multiple lightweight thr= eads per + physical core. =20 =20 API Changes diff --git a/doc/guides/sample_app_ug/img/performance_thread_1.svg b/doc/= guides/sample_app_ug/img/performance_thread_1.svg new file mode 100644 index 0000000..db01d7c --- /dev/null +++ b/doc/guides/sample_app_ug/img/performance_thread_1.svg @@ -0,0 +1,799 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + Port 1 + + + Port 2 + + + rx-thread + + rings + + + + + + rx-thread + + + + Port 1 + + + Port 2 + + + tx-thread + + + + + + tx-thread + + + + + + + tx-thread + + + + + + + + diff --git a/doc/guides/sample_app_ug/img/performance_thread_2.svg b/doc/= guides/sample_app_ug/img/performance_thread_2.svg new file mode 100644 index 0000000..48cf833 --- /dev/null +++ b/doc/guides/sample_app_ug/img/performance_thread_2.svg @@ -0,0 +1,865 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + Port 1 + + + Port 2 + + + rx-thread + + rings + + + + + + rx-thread + + + + Port 1 + + + Port 2 + + + + + + + + tx-thread + + + + + + + tx-drain + + + + tx-thread + + + + tx-drain + + + + tx-thread + + + + tx-drain + + + + + diff --git a/doc/guides/sample_app_ug/index.rst b/doc/guides/sample_app_u= g/index.rst index 9beedd9..70d4a5c 100644 --- a/doc/guides/sample_app_ug/index.rst +++ b/doc/guides/sample_app_ug/index.rst @@ -73,6 +73,7 @@ Sample Applications User Guide vm_power_management tep_termination proc_info + performance_thread =20 **Figures** =20 diff --git a/doc/guides/sample_app_ug/performance_thread.rst b/doc/guides= /sample_app_ug/performance_thread.rst new file mode 100644 index 0000000..c735931 --- /dev/null +++ b/doc/guides/sample_app_ug/performance_thread.rst @@ -0,0 +1,1243 @@ +.. BSD LICENSE + Copyright(c) 2010-2014 Intel Corporation. All rights reserved. + All rights reserved. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions + are met: + + * Re-distributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + * Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in + the documentation and/or other materials provided with the + distribution. + * Neither the name of Intel Corporation nor the names of its + contributors may be used to endorse or promote products derived + from this software without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FO= R + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL= , + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE= , + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON AN= Y + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE US= E + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + + +Performance Thread Sample Application +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +The performance thread sample application is a derivative of the standar= d L3 +forwarding application that demonstrates different threading models. + +Overview +-------- +For a general description of the L3 forwarding applications capabilities +please refer to the documentation of the standard application in +:doc:`l3_forward`. + +The performance thread sample application differs from the standard L3 f= orward +example in that it divides the TX and Rx processing between different th= reads, +and makes it possible to assign individual threads to different cores. + +Three threading models are considered:- + +#. When there is one EAL thread per physical core +#. When there are multiple EAL threads per physical core +#. When there are multiple lightweight threads per EAL thread + +Since DPDK release 2.0 it is possible to launch applications using the =E2= =80=93lcores +EAL parameter, specifying cpu-sets for a physical core. With the perfor= mance +thread sample application its is now also possible to assign individual = Rx +and TX functions to different cores. + +As an alternative to dividing the L3 forwarding work between different E= AL +threads the performance thread sample introduces the possibility to run = the +application threads as lightweight threads (L-threads) within one or +more EAL threads. + +In order to facilitate this threading model the example includes a primi= tive +cooperative scheduler (L-thread) subsystem. More details of the L-thread +subsystem can be found in :ref:`lthread_subsystem` + +**Note:** Whilst theoretically possible it is not anticipated that multi= ple +L-thread schedulers would be run on the same physical core, this mode of +operation should not be expected to yield useful performance and is cons= idered +invalid. + +Compiling the Application +------------------------- +The application is located in the sample application folder in the +performance-thread folder. + +#. Go to the example applications folder + + .. code-block:: console + + export RTE_SDK=3D/path/to/rte_sdk cd ${RTE_SDK}/examples/performa= nce-thread/l3fwd-thread + +#. Set the target (a default target is used if not specified). For exam= ple: + + .. code-block:: console + + export RTE_TARGET=3Dx86_64-native-linuxapp-gcc + + See the DPDK Getting Started Guide for possible RTE_TARGET values. + +#. Build the application: + + make + + + +Running the Application +----------------------- + +The application has a number of command line options: + +.. code-block:: console + + ./build/l3fwd-thread [EAL options] -- -p PORTMASK [-P] --rx(port,que= ue,lcore,thread)[,(port,queue,lcore,thread)] --tx(lcore,thread)[,(lcore,t= hread)] [--enable-jumbo [--max-pkt-len PKTLEN]] [--no-numa][--hash-entry= -num][--ipv6] [--no-lthreads] [--stat-lcore lcore] + +where, + +* -p PORTMASK: Hexadecimal bitmask of ports to configure + +* -P: optional, sets all ports to promiscuous mode so that packets are + accepted regardless of the packet's Ethernet MAC destination addres= s. + Without this option, only packets with the Ethernet MAC destination + address set to the Ethernet address of the port are accepted. + +* --rx (port,queue,lcore,thread)[,(port,queue,lcore,thread)]: + the list of NIC RX ports and queues handled by the RX lcores and thread= s + +* --tx (lcore,thread)[,(lcore,thread)]: + the list of tx threads identifying the lcore the thread runs on, and th= e id of + rx thread with which it is associated. + +* --enable-jumbo: optional, enables jumbo frames + +* --max-pkt-len: optional, maximum packet length in decimal (64-9600) + +* --no-numa: optional, disables numa awareness + +* --hash-entry-num: optional, specifies the hash entry number in hex t= o be setup + +* --ipv6: optional, set it if running ipv6 packets + +* --no-lthreads: optional, disables lthread model and uses EAL threadi= ng model + +* --stat-lcore: optional, run CPU load stats collector on the specifie= d lcore + +The l3fwd-threads application allows you to start packet processing in t= wo threading +models: L-Threads (default) and EAL Threads (when the "--no-lthreads" pa= rameter +is used). For consistency all parameters are used in the same way for bo= th models. + +* rx parameters + +.. _table_l3fwd_rx_parameters: + ++--------+------------------------------------------------------+ +| port | rx port | ++--------+------------------------------------------------------+ +| queue | rx queue that will be read on the specified rx port | ++--------+------------------------------------------------------+ +| lcore | core to use for the thread | ++--------+------------------------------------------------------+ +| thread | thread id (continuously from 0 to N) | ++--------+------------------------------------------------------+ + + +* tx parameters + +.. _table_l3fwd_tx_parameters: + ++--------+------------------------------------------------------+ +| lcore | core to use for L3 route match and transmit | ++--------+------------------------------------------------------+ +| thread | id of rx thread to be associated with this tx thread | ++--------+------------------------------------------------------+ + + +Running with L-threads +~~~~~~~~~~~~~~~~~~~~~~ + +When the L-thread model is used (default option), lcore and thread param= eters in +--rx/--tx are used to affine threads to the selected scheduler using the= rules: + +e.g. + + + .. code-block:: console + +l3fwd-thread -c ff -n 2 -- -P -p 3 \ + --rx=3D"(0,0,0,0)(1,0,1,1)" \ + --tx=3D"(2,0)(3,1)" + +Places every l-thread on different lcore + + .. code-block:: console + +l3fwd-thread -c ff -n 2 -- -P -p 3 \ + --rx=3D"(0,0,0,0)(1,0,0,1)" \ + --tx=3D"(1,0)(2,1)" + +Places rx lthreads on lcore 0 and tx l-threads on lcore 1 and 2 + +and so on. + +Running with EAL threads +~~~~~~~~~~~~~~~~~~~~~~~~ + +When the --no-lthreads parameter is used, L-threading model is turned of= f and EAL +threads are used for all processing. EAL Threads are enumerated in the s= ame way as L-threads, +but the --lcores EAL parameter is used to affine thread to the selected = cpu-set (scheduler). + +Thus it is possible to place every Rx and TX thread on different lcores + +e.g. + + .. code-block:: console + +l3fwd-thread -c ff -n 2 -- -P -p 3 \ + --rx=3D"(0,0,0,0)(1,0,1,1)" \ + --tx=3D"(2,0)(3,1)" \ + --no-lthreads + +Places every EAL thread on different lcore. + +To affine two ore more EAL threads to one cpu-set, eal --lcores paramete= r is used + + .. code-block:: console + +l3fwd-thread -c ff -n 2 --lcores=3D"(0,1)@0,(2,3)@1" -- -P -p 3 \ + --rx=3D"(0,0,0,0)(1,0,1,1)" \ + --tx=3D"(2,0)(3,1)" \ + --no-lthreads + +Places rx EAL threads on lcore 0 and tx EAL threads on lcore 1 and 2 and= so on. + + +Examples +~~~~~~~~ + +For selected scenarios the command line configuration of the application= for L-threads +and its corresponding EAL threads command line can be realized as follow= s: + +a) Start every thread on different scheduler (1:1) + + .. code-block:: console + +l3fwd-thread -c ff -n 2 -- -P -p 3 \ + --rx=3D"(0,0,0,0)(1,0,1,1)" \ + --tx=3D"(2,0)(3,1)" + + .. code-block:: console + +l3fwd-thread -c ff -n 2 -- -P -p 3 \ + --rx=3D"(0,0,0,0)(1,0,1,1)" \ + --tx=3D"(2,0)(3,1)" \ + --no-lthreads + +b) Start all threads on one core (N:1) + + .. code-block:: console + +l3fwd-thread -c ff -n 2 -- -P -p 3 \ + --rx=3D"(0,0,0,0)(1,0,0,1)" \ + --tx=3D"(0,0)(0,1)" + +Example above, starts 4 L-threads on lcore 0. + + .. code-block:: console + +l3fwd-thread -c ff -n 2 --lcores=3D"(0-3)@0" -- -P -p 3 \ + --rx=3D"(0,0,0,0)(1,0,0,1)" \ + --tx=3D"(2,0)(3,1)" \ + --no-lthreads + +Example above, starts 4 EAL threads on cpu-set 0. + + +c) Start threads on different cores (N:M) + + .. code-block:: console + +l3fwd-thread -c ff -n 2 -- -P -p 3 \ + --rx=3D"(0,0,0,0)(1,0,0,1)" \ + --tx=3D"(1,0)(1,1)" + +Example above, starts 2 L-threads for rx on lcore 0, and 2 L-threads +for tx on lcore 1. + + .. code-block:: console + +l3fwd-thread -c ff -n 2 --lcores=3D"(0-1)@0,(2-3)@1" -- -P -p 3 \ + --rx=3D"(0,0,0,0)(1,0,1,1)" \ + --tx=3D"(2,0)(3,1)" \ + --no-lthreads + +Example above, starts 2 EAL threads for rx on cpu-set 0, and +2 EAL threads for tx on cpu-set 1. + + +Explanation +----------- + +To a great extent the sample application differs little from the standar= d L3 +forwarding application, and readers are advised to familiarize themselve= s with the +material covered in the :doc:`l3_forward` documentation before proceedin= g. + +The following explanation is focused on the way threading is handled in = the +performance thread example. + + +Mode of operation with EAL threads +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The performance thread sample application has split the Rx and TX functi= onality +into two different threads, and the Rx and TX threads are +interconnected via software rings. With respect to these rings the Rx th= reads +are producers and the TX threads are consumers. + +On initialization the tx and rx threads are started according to the com= mand +line parameters. + +The Rx threads poll the network interface queues and post received packe= ts to a +TX thread via a corresponding software ring. + +The TX threads poll software rings, perform the L3 forwarding hash/LPM m= atch, +and assemble packet bursts before performing burst transmit on the netwo= rk +interface. + +As with the standard L3 forward application, burst draining of residual = packets +is performed periodically with the period calculated from elapsed time u= sing +the timestamps counter. + +The diagram below illustrates a case with two rx threads and three tx th= reads. + +.. _figure_performance_thread_1: + +.. figure:: img/performance_thread_1.* + +Mode of operation with L-threads +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Like the EAL thread configuration the application has split the Rx and T= X +functionality into different threads, and the pairs of Rx and TX threads= are +interconnected via software rings. + +On initialization an L-thread scheduler is started on every EAL thread. = On all +but the master EAL thread only a a dummy L-thread is initially started. +The L-thread started on the master EAL thread then spawns other L-thread= s on +different L-thread shedulers according the the command line parameters. + +The Rx threads poll the network interface queues and post received packe= ts +to a TX thread via the corresponding software ring. + +The ring interface is augmented by means of an L-thread condition variab= le that +enables the TX thread to be suspended when the TX ring is empty. The Rx = thread +signals the condition whenever it posts to the TX ring, causing the TX t= hread +to be resumed. + +Additionally the TX L-thread spawns a worker L-thread to take care of +polling the software rings, whilst it handles burst draining of the tran= smit +buffer. + +The worker threads poll the software rings, perform L3 route lookup and +assemble packet bursts. If the TX ring is empty the worker thread suspen= ds +itself by waiting on the condition variable associated with the ring. + +Burst draining of residual packets, less than the burst size, is perform= ed by +the TX thread which sleeps (using an L-thread sleep function) and resum= es +periodically to flush the TX buffer. + +This design means that L-threads that have no work, can yield the CPU to= other +L-threads and avoid having to constantly poll the software rings. + +The diagram below illustrates a case with two rx threads and three tx fu= nctions +(each comprising a thread that processes forwarding and a thread that +periodically drains the output buffer of residual packets). + +.. _figure_performance_thread_2: + +.. figure:: img/performance_thread_2.* + +CPU load statistics +~~~~~~~~~~~~~~~~~~~ +It is possible to display statistics showing estimated CPU load on each = core. +The statistics indidicate the percentage of CPU time spent: processing +received packets (forwarding), polling queues/rings (waiting for work), +and doing any other processing (context switch and other overhead). + +When enabled statitics are gathered by having the applciation threads se= t and +clear flags when they enter and exit pertinent code sections. The flags = are +then sampled in real time by a statistics collector thread running on an= other +core. This thread displays the data in real time on the console. + +This feature is enabled by designating a statistics collector core, usin= g the +--stat-lcore parameter. + + +.. _lthread_subsystem: + +The L-thread subsystem +---------------------- +The L-thread subsystem resides in the examples/performance-thread/common +directory and is built and linked automatically when building the l3fwd-= thread +example. + +The subsystem provides a simple cooperative scheduler to enable arbitrar= y +functions to run as cooperative threads within a single EAL thread. +The subsystem provides a pthread like API that is intended to assist in +reuse of legacy code written for POSIX pthreads. + +The following sections provide some detail on the features, constraints, +performance and porting considerations when using L-threads. + +.. _comparison_between_lthreads_and_pthreads: + +Comparison between L-threads and POSIX pthreads +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The fundamental difference between the L-thread and pthread models is th= e +way which threads are scheduled. The simplest way to think about this is= to +consider the case of a processor with a single CPU. To run multiple thr= eads +on a single CPU, then the scheduler must frequently switch between the t= hreads, +in order that each thread is able to make timely progress. +This is the basis of any multitasking operating system. + +This section explores the differences between the pthread model and the +L-thread model as implemented in the provided L-thread subsystem. If nee= ded a +theoretical discussion of preemptive vs cooperative multithreading can b= e +found in any good text on operating system design. + +Sceduling and context switching +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +The POSIX pthread library provides an application programming interface = to +create and synchronize threads. Scheduling policy is determined by the h= ost OS, +and may be configurable. The OS may use sophisticated rules to determine= which +thread should be run next, threads may suspend themselves or make other = threads +ready, and the scheduler may employ a time slice giving each thread a ma= ximum +time quantum after which it will be preempted in favor of another thread= that +is ready to run. To complicate matters further threads may be assigned +different scheduling priorities. + +By contrast the L-thread subsystem is considerably simpler. Logically th= e +L-thread scheduler performs the same multiplexing function for L-threads +within a single pthread as the OS scheduler does for pthreads within an +application process. The L-thread scheduler is simply the main loop of a +pthread, and in so far as the host OS is concerned it is a regular pthre= ad +just like any other. The host OS is oblivious about the existence of an= d +not at all involved in the scheduling of L-threads. + +The other and most significant difference between the two models is that +L-threads are scheduled cooperatively. L-threads cannot not preempt each +other, nor can the L-thread scheduler preempt a running L-thread ( i.e. +there is no time slicing). The consequence is that programs implemented = with +L-threads must possess frequent rescheduling points, meaning that they m= ust +explicitly and of their own volition return to the scheduler at frequent +intervals, in order to allow other L-threads an opportunity to proceed. + +In both models switching between threads requires that the current CPU +context is saved and a new context (belonging to the next thread ready t= o run) +is restored. With pthreads this context switching is handled transparent= ly +and the set of CPU registers that must be preserved between context swit= ches +is as per an interrupt handler. + +An L-thread context switch is achieved by the thread itself making a fun= ction +call to the L-thread scheduler. Thus it is only necessary to preserve th= e +callee registers. The caller is responsible to save and restore any othe= r +registers it is using before a function call, and restore them on return= , +and this is handled by the compiler. For X86_64 on both Linux and BSD th= e +System V calling convention is used, this defines registers RSP,RBP,and = R12-R15 +as callee-save registers (for more detailed discussion a good reference +can be found here https://en.wikipedia.org/wiki/X86_calling_conventions)= . + +Taking advantage of this, and due to the absence of preemption, an L-thr= ead +context switch is acheived with less than 20 load/store instructions. + +The scheduling policy for L-threads is fixed, there is no prioritization= of +L-threads, all L-threads are equal and scheduling is based on a FIFO +ready queue. + +An L-thread is a struct containing the CPU context of the thread +(saved on context switch) and other useful items. The ready queue contai= ns +pointers to threads that are ready to run. The L-thread scheduler is a s= imple +loop that polls the ready queue, reads from it the next thread ready to = run, +which it resumes by saving the current context (the current position in = the +scheduler loop) and restoring the context of the next thread from its th= read +struct. Thus an L-thread is always resumed at the last place it yielded. + +A well behaved L-thread will call the context switch regularly (at least= once +in its main loop) thus returning to the scheduler's own main loop. Yield= ing +inserts the current thread at the back of the ready queue, and the proce= ss of +servicing the ready queue is repeated, thus the system runs by flipping = back +and forth the between L-threads and scheduler loop. + +In the case of pthreads, the preemptive scheduling, time slicing, and su= pport +for thread prioritization means that progress is normally possible for a= ny +thread that is ready to run. This comes at the price of a relatively hea= vier +context switch and scheduling overhead. + +With L-threads the progress of any particular thread is determined by th= e +frequency of rescheduling opportunities in the other L-threads. This mea= ns that +an errant L-thread monopolizing the CPU might cause scheduling of other = threads +to be stalled. Due to the lower cost of context switching, however, volu= ntary +rescheduling to ensure progress of other threads, if managed sensibly, i= s not +a prohibitive overhead, and overall performance can exceed that of an +application using pthreads. + +Mutual exclusion +^^^^^^^^^^^^^^^^ +With pthreads preemption means that threads that share data must observe +some form of mutual exclusion protocol. + +The fact that L-threads cannot preempt each other means that in many cas= es +mutual exclusion devices can be completely avoided. + +Locking to protect shared data can be a significant bottleneck in +multi-threaded applications so a carefully designed cooperatively schedu= led +program can enjoy significant performance advantages. + +So far we have considered only the simplistic case of a single core CPU, +when multiple CPUs are considered things are somewhat more complex. + +First of all it is inevitable that there must be multiple L-thread sched= ulers, +one running on each EAL thread. So long as these schedulers remain isola= ted +from each other the above assertions about the potential advantages of +cooperative scheduling hold true. + +A configuration with isolated cooperative schedulers is less flexible th= an the +pthread model where threads can be affined to run on any CPU. With isola= ted +schedulers scaling of applications to utilize fewer or more CPUs accorin= dg to +system demand is very difficult to achieve. + +The L-thread subsystem makes it possible for L-threads to migrate betwee= n +schedulers running on different CPUs. Needless to say if the migration m= eans +that threads that share data end up running on different CPUs then this = will +introduce the need for some kind mutual exclusion device. + +Of course rte_ring s/w rings can always be used to interconnect threads = running +on different cores, however to protect other kinds of shared data struct= ures, +lock free constructs or else explicit locking will be required. This is = a +consideration for the application design. + +In support of this extended functionality, the L-thread subsystem implem= ents +thread safe mutexes and condition variables. + +The cost of affining and of condition variable signaling is significantl= y +lower than the equivalent pthread operations, and so applications using +these features will see a performance benefit. + + +Thread local storage +^^^^^^^^^^^^^^^^^^^^ + +As with applications written for pthreads an application written for L-t= hreads +can take advantage of thread local storage, in this case local to an L-t= hread. +An application may save and retrieve a single pointer to application dat= a in +the L-thread struct. + +For legacy and backward compatibility reasons two alternative methods ar= e also +offered, the first is modelled directly on the pthread get/set specific = APIs, +the second approach is modelled on the RTE_PER_LCORE macros, whereby PER= _LTHREAD +macros are introduced, in both cases the storage is local to the L-threa= d. + + +.. _constraints_and_performance_implications: + +Constraints and performance implications when using L-threads +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + + +.. _API_compatibility: + +API compatibility +^^^^^^^^^^^^^^^^^ + +The L-thread subsystem provides a set of functions that are logically eq= uivalent +to the corresponding functions offered by the POSIX pthread library, how= ever not +all pthread functions have a corresponding L-thread equivalent, and not = all +features available to pthreads are implemented for L-threads. + +The pthread library offers considerable flexibility via programmable att= ributes +that can be associated with threads, mutexes, and condition variables. + +By contrast the L-thread subsystem has fixed functionality, the schedule= r policy +cannot be varied, and L-threads cannot be prioritized. There are no vari= able +attributes associated with any L-thread objects. L-threads, mutexs and +conditional variables, all have fixed functionality. (Note: reserved par= ameters +are included in the APIs to facilitate possible future support for attri= butes). + +The table below lists the pthread and equivalent L-thread APIs with note= s on +differences and/or constraints. Where there is no L-thread entry in the = table, +then the L-thread subsystem provides no equivalent function. + +.. _table_lthread_pthread: + ++-----------------------------+-----------------------------+-----------= ---------+ +| **Pthread function** | **L-thread function** | **Notes** = | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D+ +| pthread_barrier_destroy | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_barrier_init | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_barrier_wait | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_cond_broadcast | lthread_cond_broadcast | See note 1= | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_cond_destroy | lthread_cond_destroy | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_cond_init | lthread_cond_init | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_cond_signal | lthread_cond_signal | See note 1= | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_cond_timedwait | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_cond_wait | lthread_cond_wait | See note 5= | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_create | lthread_create | See notes = 2, 3 | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_detach | lthread_detach | See note 4= | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_equal | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_exit | lthread_exit | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_getspecific | lthread_getspecific | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_getcpuclockid | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_join | lthread_join | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_key_create | lthread_key_create | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_key_delete | lthread_key_delete | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_mutex_destroy | lthread_mutex_destroy | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_mutex_init | lthread_mutex_init | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_mutex_lock | lthread_mutex_lock | See note 6= | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_mutex_trylock | lthread_mutex_trylock | See note 6= | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_mutex_timedlock | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_mutex_unlock | lthread_mutex_unlock | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_once | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_rwlock_destroy | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_rwlock_init | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_rwlock_rdlock | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_rwlock_timedrdlock | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_rwlock_timedwrlock | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_rwlock_tryrdlock | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_rwlock_trywrlock | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_rwlock_unlock | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_rwlock_wrlock | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_self | lthread_current | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_setspecific | lthread_setspecific | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_spin_init | | See note 1= 0 | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_spin_destroy | | See note 1= 0 | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_spin_lock | | See note 1= 0 | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_spin_trylock | | See note 1= 0 | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_spin_unlock | | See note 1= 0 | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_cancel | lthread_cancel | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_setcancelstate | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_setcanceltype | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_testcancel | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_getschedparam | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_setschedparam | | = | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_yield | lthread_yield | See note 7= | ++-----------------------------+-----------------------------+-----------= ---------+ +| pthread_setaffinity_np | lthread_set_affinity | See notes = 2, 3, 8 | ++-----------------------------+-----------------------------+-----------= ---------+ +| | lthread_sleep | See note 9= | ++-----------------------------+-----------------------------+-----------= ---------+ +| | lthread_sleep_clks | See note 9= | ++-----------------------------+-----------------------------+-----------= ---------+ + + +Note 1: + +neither lthread_signal nor broadcast may be called concurrently by L-thr= eads +running on different schedulers, although multiple L-threads running in = the +same scheduler may freely perform signal or broadcast operations. L-thre= ads +running on the same or different schedulers may always safely wait on a = condition +variable. + + +Note 2: + +pthread attributes may be used to affine a pthread with a cpu-set. The L= -thread +subsystem does not support a cpu-set. An L-thread may be affined only wi= th a +single CPU at any time. + + +Note 3: + +if an L-thread is intended to run on a different NUMA node than the node= that +creates the thread then, when calling lthread_create() it is advantageou= s to +specify the destination core as a parameter of lthread_create() +See :ref:`memory_allocation_and_NUMA_awareness` for details. + + +Note 4: + +an L-thread can only detach itself, and cannot detach other L-threads. + + +Note 5: + +a wait operation on a pthread condition variable is always associated wi= th and +protected by a mutex which must be owned by the thread at the time it in= vokes +pthread_wait(). By contrast L-thread condition variables are thread safe +(for waiters) and do not use an associated mutex. Multiple L-threads (in= cluding +L-threads running on other schedulers) can safely wait on a L-thread con= dition +variable. As a consequence the performance of an L-thread condition vari= able is +typically an order of magnitude faster than its pthread counterpart. + + +Note 6: + +recursive locking is not supported with L-threads, attempts to take a lo= ck +recursively will be detected and rejected. + + +Note 7: + +lthread_yield() will save the current context, insert the current thread= to the +back of the ready queue, and resume the next ready thread. Yielding incr= eases +ready queue backlog, see :ref:`ready_queue_backlog` for more details abo= ut the +implications of this. + + +N.B. The context switch time as measured from immediately before the cal= l to +lthread_yield() to the point at which the next ready thread is resumed, = can be +an order of magnitude faster that the same measurement for pthread_yield= . + + +Note 8: + +lthread_set_affinity() is similar to a yield apart from the fact that th= e +yielding thread is inserted into a peer ready queue of another scheduler= . +The peer ready queue is actually a separate thread safe queue, which mea= ns that +threads appearing in the peer ready queue can jump any backlog in the lo= cal +ready queue on the destination scheduler. + +The context switch time as measured from the time just before the call t= o +lthread_set_affinity() to just after the same thread is resumed on the n= ew +scheduler can be orders of magnitude faster than the same measurement fo= r +pthread_setaffinity_np(). + + +Note 9: + +although there is no pthread_sleep() function, lthread_sleep() and +lthread_sleep_clks() can be used wherever sleep(), usleep() or nanoslee= p() +might ordinarily be used. The L-thread sleep functions suspend the curre= nt +thread, start an rte_timer and resume the thread when the timer matures. +The rte_timer_manage() entry point is called on every pass of the schedu= ler +loop. This means that the worst case jitter on timer expiry is determine= d by +the longest period between context switches of any running L-threads. + +In a synthetic test with many threads sleeping and resuming then the mea= sured +jitter is typically orders of magnitude lower than the same measurement = made +for nanosleep(). + + +Note 10: + +spin locks are not provided because they are problematical in a cooperat= ive +environment, see :ref:`porting_locks_and_spinlocks` for a more detailed +discussion on how to avoid spin locks. + +.. _Thread_local_storage_performance: + +Thread local storage +^^^^^^^^^^^^^^^^^^^^ + +Of the three L-thread local storage options the simplest and most effici= ent is +storing a single application data pointer in the L-thread struct. + +The PER_LTHREAD macros involve a run time computation to obtain the addr= ess +of the variable being saved/retrieved and also require that the accesses= are +de-referenced via a pointer. This means that code that has used +RTE_PER_LCORE macros being ported to L-threads might need some slight +adjustment (see :ref:`porting_thread_local_storage` for hints about port= ing +code that makes use of thread local storage). + +The get/set specific APIs are consistent with their pthread counterparts= both +in use and in performance. + +.. _memory_allocation_and_NUMA_awareness: + +Memory allocation and NUMA awareness +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +All memory allocation is from DPDK huge pages, and is NUMA aware. Each +scheduler maintains its own caches of objects: lthreads, their stacks, T= LS, +mutexes and condition variables. These caches are implemented as unbound= ed lock +free MPSC queues. When objects are created they are always allocated fr= om the +caches on the local core (current EAL thread). + +If an L-thread has affined to a different sheduler, then it can always s= afely +free resources to the caches from which they originated (because the cac= hes are +MPSC queues). + +If the L-thread has affined to a different NUMA node then the memory res= ources +associated with it may incur longer access latency. + +The commonly used pattern of setting affinity on entry to a thread after= it has +started, means that memory allocation for both the stack and TLS will ha= ve been +made from caches on the NUMA node on which the threads creator is runnin= g. +This has the side effect that access latency will be sub-optimal after +affining. + +This side effect can be mitigated to some extent (although not completel= y) by +specifying the destination CPU as a parameter of lthread_create() this c= auses +the L-thread=E2=80=99s stack and TLS to be allocated when it is first sc= heduled on the +destination scheduler, if the destination is a on another NUMA node it r= esults +in a more optimal memory allocation. + +Note that the lthread struct itself remains allocated from memory on the= node +creating node, this is unavoidable because an L-thread is known everywhe= re by +the address of this struct. + +.. _object_cache_sizing: + +Object cache sizing +^^^^^^^^^^^^^^^^^^^ + +The per lcore object caches pre-allocate objects in bulk whenever a requ= est to +allocate an object finds a cache empty. By default 100 objects are +pre-allocated, this is defined by LTHREAD_PREALLOC in the public API hea= der +file lthread_api.h. This means that the caches constantly grow to meet s= ystem +demand. + +In the present implementation there is no mechanism to reduce the cache = sizes +if system demand reduces. Thus the caches will remain at their maximum e= xtent +indefinitely. + +A consequence of the bulk pre-allocation of objects is that every 100 +(default value) additional new object create operations results in a cal= l to +rte_malloc. For creation of objects such as L-threads, which trigger the +allocation of even more objects ( i.e. their stacks and TLS) then this c= an +cause outliers in scheduling performance. + +If this is a problem the simplest mitigation strategy is to dimension th= e +system, by setting the bulk object pre-allocation size to some large num= ber +that you do not expect to be exceeded. This means the caches will be pop= ulated +once only, the very first time a thread is created. + +.. _Ready_queue_backlog: + +Ready queue backlog +^^^^^^^^^^^^^^^^^^^ + +One of the more subtle performance considerations is managing the ready = queue +backlog. The fewer threads that are waiting in the ready queue then the = faster +any particular thread will get serviced. + +In a naive L-thread application with N L-threads simply looping and yiel= ding, +this backlog will always be equal to the number of L-threads, thus the c= ost of +a yield to a particular L-thread will be N times the context switch time= . + +This side effect can be mitigated by arranging for threads to be suspend= ed and +waiting to be resumed, rather than polling for work by constantly yieldi= ng. +Blocking on a mutex or condition variable or even more obviously having = a +thread sleep if it has a low frequency workload are all mechanisms by wh= ich a +thread can be excluded from the ready queue until it really does need to= be +running. This can have a significant positive impact on performance. + +.. _Initialization_and_shutdown_dependencies: + +Initialization, shutdown and dependencies +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The L-thread subsystem depends on DPDK for huge page allocation and depe= nds on +the rte_timer subsystem. The DPDK EAL initialization and +rte_timer_subsystem_init() MUST be completed before the L-thread sub sy= stem +can be used. + +Thereafter initialization of the L-thread subsystem is largely transpare= nt to +the application. Constructor functions ensure that global variables are = properly +initialized. Other than global variables each scheduler is initialized +independently the first time that an L-thread is created by a particular= EAL +thread. + +If the schedulers are to be run as isolated and independent schedulers, = with +no intention that L-threads running on different schedulers will migrate= between +schedulers or synchronize with L-threads running on other schedulers, th= en +initialization consists simply of creating an L-thread, and then running= the +L-thread scheduler. + +If there will be interaction between L-threads running on different sche= dulers, +then it is important that the starting of schedulers on different EAL th= reads +is synchronized. + +To achieve this an additional initialization step is necessary, this is = simply +to set the number of schedulers by calling the API function +lthread_num_schedulers_set(n), where n =3D the number of EAL threads tha= t will +run L-thread schedulers. Setting the number of schedulers to a number gr= eater +than 0 will cause all schedulers to wait until the others have started b= efore +beginning to schedule L-threads. + +The L-thread scheduler is started by calling the function +lthread_run() and should be called from the EAL thread and thus +become the main loop of the EAL thread. + +The function lthread_run(), will not return until all threads running +on the scheduler have exited, and the scheduler has been explicitly stop= ped by +calling lthread_scheduler_shutdown(lcore) or lthread_scheduler_shutdown_= all(). + +All these function do is tell the scheduler that it can exit when there = are no +longer any running L-threads, neither function forces any running L-thre= ad to +terminate. Any desired application shutdown behavior must be designed a= nd +built into the application to ensure that L-threads complete in a timely +manner. + +**Important Note:** It is assumed when the scheduler exits that the appl= ication +is terminating for good, the scheduler does not free resources before ex= iting +and running the scheduler a subsequent time will result in undefined beh= avior. + +.. _porting_legacy_code_to_run_on_lthreads: + +Porting legacy code to run on L-threads +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Legacy code originally written for a pthread environment may be ported t= o +L-threads if the considerations about differences in scheduling policy, = and +constraints discussed in the previous sections can be accommodated. + +This section looks in more detail at some of the issues that may have to= be +resolved when porting code. + +.. _pthread_API_compatibility: + +pthread API compatibility +^^^^^^^^^^^^^^^^^^^^^^^^^ + +The first step is to establish exactly which pthread APIs the legacy +application uses, and to understand the requirements of those APIs. If = there +are corresponding L-lthread APIs, and where the default pthread function= ality +is used by the application then, notwithstanding the other issues discus= sed +here, it should be feasible to run the application with L-threads. If th= e +legacy code modifies the default behavior using attributes then if may b= e +necessary to make some adjustments to eliminate those requirements. + +.. _blocking_system_calls: + +Blocking system API calls +^^^^^^^^^^^^^^^^^^^^^^^^^ + +It is important to understand what other system services the application= may be +using, bearing in mind that in a cooperatively scheduled environment a t= hread +cannot block without stalling the scheduler and with it all other cooper= ative +threads. Any kind of blocking system call, for example file or socket IO= , is a +potential problem, a good tool to analyze the application for this purpo= se is +the =E2=80=9Cstrace=E2=80=9D utility. + +There are many strategies to resolve these kind of issues, each with it +merits. Possible solutions include:- + +Adopting a polled mode of the system API concerned (if available). + +Arranging for another core to perform the function and synchronizing wit= h that +core via constructs that will not block the L-thread. + +Affining the thread to another scheduler devoted (as a matter of policy)= to +handling threads wishing to make blocking calls, and then back again whe= n +finished. + + +.. _porting_locks_and_spinlocks: + +Locks and spinlocks +^^^^^^^^^^^^^^^^^^^ + +Locks and spinlocks are another source of blocking behavior that for the= same +reasons as system calls will need to be addressed. + +If the application design ensures that the contending L-threads will alw= ays +run on the same scheduler then it its probably safe to remove locks and = spin +locks completely. + +The only exception to the above rule is if for some reason the +code performs any kind of context switch whilst holding the lock +(e.g. yield, sleep, or block on a different lock, or on a condition var)= . +This will need to determined before deciding to eliminate a lock. + +If a lock cannot be eliminated then an L-thread mutex can be substituted= for +either kind of lock. + +An L-thread blocking on an L-thread mutex will be suspended and will cau= se +another ready L-thread to be resumed, thus not blocking the scheduler. W= hen +default behaviour is required, it can be used as a direct replacement fo= r a +pthread mutex lock. + +Spin locks are typically used when lock contention is likely to be rare = and +where the period during which the lock may be held is relatively short. +When the contending L-threads are running on the same scheduler then an +L-thread blocking on a spin lock will enter an infinite loop stopping th= e +scheduler completely (see :ref:`porting_infinite_loops` below ). + +If the application design ensures that contending L-threads will always = run +on different schedulers then it might be reasonable to leave a short spi= n lock +that rarely experiences contention in place. + +If after all considerations it appears that a spin lock can neither be +eliminated completely, replaced with an L-thread mutex, or left in place= as +is, then an alternative is to loop on a flag, with a call to lthread_yie= ld() +inside the loop ( n.b. if the contending L-threads might ever run on dif= ferent +schedulers the flag will need to be manipulated atomically ). + +Spinning and yielding is the least preferred solution since it introduce= s +ready queue backlog ( see also :ref:`ready_queue_backlog`). + +.. _porting_sleeps_and_delays: + +Sleeps and delays +^^^^^^^^^^^^^^^^^ + +Yet another kind of blocking behavior (albeit momentary) are delay funct= ions +like sleep(), usleep(), nanosleep() etc. All will have the consequence o= f +stalling the L-thread scheduler and unless the delay is very short ( e.g= . a +very short nanosleep) calls to these functions will need to be eliminate= d. + +The simplest mitigation strategy is to use the L-thread sleep API functi= ons, +of which two variants exist, lthread_sleep() and lthread_sleep_clks(). +These functions start an rte_timer against the L-thread, suspend the L-t= hread +and cause another ready L-thread to be resumed. The suspended L-thread i= s +resumed when the rte_timer matures. + +.. _porting_infinite_loops: + +Infinite loops +^^^^^^^^^^^^^^ + +Some applications have threads with loops that contain no inherent +rescheduling opportunity, and rely solely on the OS time slicing to shar= e +the CPU. In a cooperative environment this will stop everything dead. T= hese +kind of loops are not hard to identify, in a debug session you will find= the +debugger is always stopping in the same loop. + +The simplest solution to this kind of problem is to insert an explicit +lthread_yield() or lthread_sleep() into the loop. Another solution migh= t be +to include the function performed by the loop into the execution path of= some +other loop that does in fact yield, if this is possible. + +.. _porting_thread_local_storage: + +Thread local storage +^^^^^^^^^^^^^^^^^^^^ + +If the application uses thread local storage, the use case should be +studied carefully. + +In a legacy pthread application either or both the __thread prefix, or = the +pthread set/get specific APIs may have been used to define storage local +to a pthread. + +In some applications it may be a reasonable assumption that the data cou= ld +or in fact most likely should be placed in L-thread local storage. + +If the application (like many DPDK applications) has assumed a certain +relationship between a pthread and the CPU to which it is affined, there= is +a risk that thread local storage may have been used to save some data it= ems +that are correctly logically associated with the CPU, and others items w= hich +relate to application context for the thread. Only a good understanding= of +the application will reveal such cases. + +If the application requires an that an L-thread is to be able to move be= tween +schedulers then care should be taken to separate these kinds of data, in= to per +lcore, and per L-thread storage. In this way a migrating thread will bri= ng with +it the local data it needs, and pick up the new logical core specific va= lues +from pthread local storage at its new home. + +.. _pthread_shim: + +Pthread shim +~~~~~~~~~~~~ + +A convenient way to get something working with legacy code can be to use= a +shim that adapts pthread API calls to the corresponding L-thread ones. +This approach will not mitigate any of the porting considerations mentio= ned +in the previous sections, but it will reduce the amount of code churn th= at +would otherwise been involved. It is a reasonable approach to evaluate +L-threads, before investing effort in porting to the native L-thread API= s. + +Overview +^^^^^^^^ +The L-thread subsystem includes an example pthread shim. This is a parti= al +implementation but does contain the API stubs needed to get basic applic= ations +running. There is a simple =E2=80=9Chello world=E2=80=9D application th= at demonstrates the +use of the pthread shim. + +A subtlety of working with a shim is that the application will still nee= d +to make use of the genuine pthread library functions, at the very least = in +order to create the EAL threads in which the L-thread schedulers will ru= n. +This is the case with DPDK initialization, and exit. + +To deal with the initialization and shutdown scenarios, the shim is capa= ble +of switching on or off its adaptor functionality, an application can con= trol +this behavior by the calling the function pt_override_set(). The default= state +is disabled. + +The pthread shim uses the dynamic linker loader and saves the loaded add= resses +of the genuine pthread API functions in an internal table, when the shim +functionality is enabled it performs the adaptor function, when disabled= it +invokes the genuine pthread function. + +The function pthread_exit() has additional special handling. The standar= d +system header file pthread.h declares pthread_exit() +with __attribute__((noreturn)) this is an optimization that is possible +because the pthread is terminating and this enables the compiler to omit= the +normal handling of stack and protection of registers since the function = is not +expected to return, and in fact the thread is being destroyed. +These optimizations are applied in both the callee and the caller of the +pthread_exit() function. + +In our cooperative scheduling environment this behavior is inadmissible. +The pthread is the L-thread scheduler thread, and, although an L-thread = is +terminating, there must be a return to the scheduler in order that syste= m can +continue to run. Further, returning from a function with attribute noret= urn is +invalid and may result in undefined behavior. + +The solution is to redefine the pthread_exit function with a macro, caus= ing it +to be mapped to a stub function in the shim that does not have the (nore= turn) +attribute. This macro is defined in the file pthread_shim.h. The stub f= unction +is otherwise no different than any of the other stub functions in the sh= im, +and will switch between the real pthread_exit() function or the lthread_= exit() +function as required. The only difference is that the mapping to the stu= b by +macro substitution. + +A consequence of this is that the file pthread_shim.h must be included i= n +legacy code wishing to make use of the shim. It also means that dynamic = linkage +of a pre-compiled binary that did not include pthread_shim.h is not be s= upported. + +Given the requirements for porting legacy code outlined in +:ref:`porting_legacy_code_to_run_on_lthreads` most applications will req= uire at +least some minimal adjustment and recompilation to run on L-threads so +pre-compiled binaries are unlikely to be met in practice. + +In summary the shim approach adds some overhead but can be a useful tool= to help +establish the feasibility of a code reuse project. It is also a fairly +straightforward task to extend the shim if necessary. + +**Note:** Bearing in mind the preceding discussions about the impact of = making +blocking calls then switching the shim in and out on the fly to invoke a= ny +pthread API this might block is something that should typically be avoid= ed. + + +Building and running the pthread shim +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The shim example application is located in the sample application +in the performance-thread folder + +To build and run the pthread shim example + +#. Go to the example applications folder + + .. code-block:: console + + export RTE_SDK=3D/path/to/rte_sdk cd ${RTE_SDK}/examples/performance-th= read/pthread_shim + + +#. Set the target (a default target is used if not specified). For exa= mple: + + .. code-block:: console + + export RTE_TARGET=3Dx86_64-native-linuxapp-gcc + + See the DPDK Getting Started Guide for possible RTE_TARGET values. + +#. Build the application: + + .. code-block:: console + + make + +#. To run the pthread_shim example + + .. code-block:: console + + lthread-pthread-shim =E2=80=93c < core mask ) =E2=80=93n + +.. _lthread_diagnostics: + +L-thread Diagnostics +~~~~~~~~~~~~~~~~~~~~ + +When debugging you must take account of the fact that the L-threads are = run in +a single pthread. The current scheduler is defined by +RTE_PER_LCORE(this_sched), and the current lthread is stored at +RTE_PER_LCORE(this_sched)->current_lthread. +Thus on a breakpoint in a GDB session the current lthread can be obtaine= d by +displaying the pthread local variable "per_lcore_this_sched->current_lt= hread". + +Another useful diagnostic feature is the possibility to trace significan= t +events in the life of an L-thread, this feature is enabled by changing t= he +value of LTHREAD_DIAG from 0 to 1 in the file lthread_diag_api.h. + +Tracing of events can be individually masked, and the mask may be progra= mmed at +run time. +An unmasked event results in a callback that provides information +about the event. The default callback simply prints trace information. +The default mask is 0 (all events off) the mask can be modified by calli= ng the +function lthread_diagniostic_set_mask(). + +It is possible register a user callback function to implement more +sophisticated diagnostic functions. +Object creation events (lthread, mutex, and condition variable) accept, = and +store in the created object, a user supplied reference value returned by= the +callback function. + +The lthread reference value is passed back in all subsequent event callb= acks, +the mutex and APIs are provided to retriive the reference value from +mutexs and condition variables. This enables a user to monitor, count, o= r +filter for specific events, on specific objects, for example to monitor = for a +specific thread signalling a specific condition variable, or to monitor +on all timer events, the possibilities and combinations are endless. + +The callback function can be set by calling the function +lthread_diagnostic_enable() supplying a callback function pointer +and an event mask. + +Setting LTHREAD_DIAG also enables counting of statistics about cache and +queue usage, and these statistics can be displayed by calling the functi= on +lthread_diag_stats_display(). This function also performs a consistency = check +on the caches and queues. The function should only be called from the ma= ster +EAL thread after all slave threads have stopped and returned to the C ma= in +program, otherwise the consistency check will fail. --=20 1.7.9.5