From: Bob Pearson <rpearsonhpe@gmail.com>
To: "Daisuke Matsuda (Fujitsu)" <matsuda-daisuke@fujitsu.com>,
"jgg@nvidia.com" <jgg@nvidia.com>,
"leon@kernel.org" <leon@kernel.org>,
"zyjzyj2000@gmail.com" <zyjzyj2000@gmail.com>,
"jhack@hpe.com" <jhack@hpe.com>,
"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>
Subject: Re: [PATCH for-next v3 00/13] Implement work queues for rdma_rxe
Date: Mon, 7 Nov 2022 10:06:16 -0600 [thread overview]
Message-ID: <c1016a97-80c5-349d-0238-a5d7d408950d@gmail.com> (raw)
In-Reply-To: <TYCPR01MB8455610B57F2EAE2A2691614E53C9@TYCPR01MB8455.jpnprd01.prod.outlook.com>
On 11/7/22 02:21, Daisuke Matsuda (Fujitsu) wrote:
> On Sun, Nov 6, 2022 6:15 AM Bob Pearson
>> On 11/3/22 23:59, Daisuke Matsuda (Fujitsu) wrote:
>>> On Wed, Nov 2, 2022 8:21 PM Bob Pearson wrote:
>>>> On 11/2/22 05:17, Daisuke Matsuda (Fujitsu) wrote:
>>>>> On Sat, Oct 29, 2022 12:10 PM Bob Pearson wrote:
>>>>>> This patch series implements work queues as an alternative for
>>>>>> the main tasklets in the rdma_rxe driver. The patch series starts
>>>>>> with a patch that makes the internal API for task execution pluggable
>>>>>> and implements an inline and a tasklet based set of functions.
>>>>>> The remaining patches cleanup the qp reset and error code in the
>>>>>> three tasklets and modify the locking logic to prevent making
>>>>>> multiple calls to the tasklet scheduling routine. After
>>>>>> this preparation the work queue equivalent set of functions is
>>>>>> added and the tasklet version is dropped.
>>>>>
>>>>> Thank you for posting the 3rd series.
>>>>> It looks fine at a glance, but now I am concerned about problems
>>>>> that can be potentially caused by concurrency.
>>>>>
>>>>>>
>>>>>> The advantages of the work queue version of deferred task execution
>>>>>> is mainly that the work queue variant has much better scalability
>>>>>> and overall performance than the tasklet variant. The perftest
>>>>>> microbenchmarks in local loopback mode (not a very realistic test
>>>>>> case) can reach approximately 100Gb/sec with work queues compared to
>>>>>> about 16Gb/sec for tasklets.
>>>>>
>>>>> As you wrote, the advantage of work queue version is that the number works
>>>>> that can run parallelly scales with the number of logical CPUs. However, the
>>>>> dispatched works (rxe_requester, rxe_responder, and rxe_completer) are
>>>>> designed for serial execution on tasklet, so we must not rely on them functioning
>>>>> properly on parallel execution.
>>>>
>>>> Work queues are serial for each separate work task just like tasklets. There isn't
>>>> a problem here. The tasklets for different tasks can run in parallel but tend to
>>>> do so less than work queue tasks. The reason is that tasklets are scheduled by
>>>> default on the same cpu as the thread that scheduled it while work queues are scheduled
>>>> by the kernel scheduler and get spread around.
>>>
>>> =====
>>> rxe_wq = alloc_workqueue("rxe_wq", WQ_CPU_INTENSIVE, WQ_MAX_ACTIVE);
>>> =====
>>> You are using the WQ_CPU_INTENSIVE flag. This allows works to be scheduled by
>>> the system scheduler, but each work is still enqueued to worker pools of each CPU
>>> and thus bound to the CPU the issuer is running on. It seems the behaviour you
>>> expect can be achieved by the WQ_UNBOUND flag. Unbound work items will run
>>> on any CPU at the cost of cache utilization.
>>>
>>> Two of the same tasklets never run concurrently on two different processors by nature,
>>> but that is not the case with work queues. If two softirqs running on different CPUs
>>> enqueue responder works at almost the same time, it is possible that they are dispatched
>>> and run on the different CPUs at the same time. I mean the problems may arise in such
>>> a situation.
>>>
>>> Please let me know if I missed anything. I referred to the following document.
>>> The easiest solution is to use @flags= WQ_UNBOUND and @max_active=1 to let works
>>> run serially.
>>> cf. https://www.kernel.org/doc/html/latest/core-api/workqueue.html
>>>
>>> Thanks,
>>> Daisuke
>>>
>> According to this:
>>
>> Workqueue guarantees that a work item cannot be re-entrant if the following conditions hold
>> after a work item gets queued:
>>
>> The work function hasn’t been changed.
>>
>> No one queues the work item to another workqueue.
>>
>> The work item hasn’t been reinitiated.
>>
>> In other words, if the above conditions hold, the work item is guaranteed to be executed by at
>> most one worker system-wide at any given time.
>>
>> Note that requeuing the work item (to the same queue) in the self function doesn’t break these
>> conditions, so it’s safe to do. Otherwise, caution is required when breaking the conditions
>> inside a work function.
>>
>> I should be OK. Each work item checks the state under lock before scheduling the item and
>> if it is free moves it to busy and then schedules it. Only one instance of a work item
>> at a time should be running.
>
> Thank you for the explanation.
> Per-qp work items should meet the three conditions. That is what I have missing.
> Now I see. You are correct.
>
>>
>> I only know what I see from running top. It seems that the work items do get spread out over
>> time on the cpus.
>
> It seems process_one_work() schedules items for both UNBOUND and CPU_INTENSIVE
> workers in the same way. This is not stated explicitly in the document.
>
>>
>> The CPU_INTENSIVE is certainly correct for our application which will run all the cpus at
>> 100% for extended periods of time. We are benchmarking storage with IOR.
>
> It is OK with me. I have not come up with any situations where the CPU_INTENSIVE
> flag bothers other rxe users.
>
> Thanks,
> Daisuke
>
>>
>> Bob
>>
>>>>>
>>>>> There could be 3 problems, which stem from the fact that works are not necessarily
>>>>> executed in the same order the packets are received. Works are enqueued to worker
>>>>> pools on each CPU, and each CPU respectively schedules the works, so the ordering
>>>>> of works among CPUs is not guaranteed.
>>>>>
>>>>> [1]
>>>>> On UC/UD connections, responder does not check the psn of inbound packets,
>>>>> so the payloads can be copied to MRs without checking the order. If there are
>>>>> works that write to overlapping memory locations, they can potentially cause
>>>>> data corruption depending the order.
>>>>>
>>>>> [2]
>>>>> On RC connections, responder checks the psn, and drops the packet if it is not
>>>>> the expected one. Requester can retransmit the request in this case, so the order
>>>>> seems to be guaranteed for RC.
>>>>>
>>>>> However, responder updates the next expected psn (qp->resp.psn) BEFORE
>>>>> replying an ACK packet. If the work is preempted soon after storing the next psn,
>>>>> another work on another CPU can potentially reply another ACK packet earlier.
>>>>> This behaviour is against the spec.
>>>>> Cf. IB Specification Vol 1-Release-1.5 " 9.5 TRANSACTION ORDERING"
>>>>>
>>>>> [3]
>>>>> Again on RC connections, the next expected psn (qp->resp.psn) can be
>>>>> loaded and stored at the same time from different threads. It seems we
>>>>> have to use a synchronization method, perhaps like READ_ONCE() and
>>>>> WRITE_ONCE() macros, to prevent loading an old value. This one is just an
>>>>> example; there can be other variables that need similar consideration.
>>>>>
>>>>>
>>>>> All the problems above can be solved by making the work queue single-
>>>>> threaded. We can do it by using flags=WQ_UNBOUND and max_active=1
>>>>> for alloc_workqueue(), but this should be the last resort since this spoils
>>>>> the performance benefit of work queue.
>>>>>
>>>>> I am not sure what we can do with [1] right now.
>>>>> For [2] and [3], we could just move the update of psn later than the ack reply,
>>>>> and use *_ONCE() macros for shared variables.
>>>>>
>>>>> Thanks,
>>>>> Daisuke
Thank you for taking the time to review this.
Bob
>>>>>
>>>>>>
>>>>>> This version of the patch series drops the tasklet version as an option
>>>>>> but keeps the option of switching between the workqueue and inline
>>>>>> versions.
>>>>>>
>>>>>> This patch series is derived from an earlier patch set developed by
>>>>>> Ian Ziemba at HPE which is used in some Lustre storage clients attached
>>>>>> to Lustre servers with hard RoCE v2 NICs.
>>>>>>
>>>>>> It is based on the current version of wip/jgg-for-next.
>>>>>>
>>>>>> v3:
>>>>>> Link: https://lore.kernel.org/linux-rdma/202210220559.f7taTL8S-lkp@intel.com/
>>>>>> The v3 version drops the first few patches which have already been accepted
>>>>>> in for-next. It also drops the last patch of the v2 version which
>>>>>> introduced module parameters to select between the task interfaces. It also
>>>>>> drops the tasklet version entirely. It fixes a minor error caught by
>>>>>> the kernel test robot <lkp@intel.com> with a missing static declaration.
>>>>>>
>>>>>> v2:
>>>>>> The v2 version of the patch set has some minor changes that address
>>>>>> comments from Leon Romanovsky regarding locking of the valid parameter
>>>>>> and the setup parameters for alloc_workqueue. It also has one
>>>>>> additional cleanup patch.
>>>>>>
>>>>>> Bob Pearson (13):
>>>>>> RDMA/rxe: Make task interface pluggable
>>>>>> RDMA/rxe: Split rxe_drain_resp_pkts()
>>>>>> RDMA/rxe: Simplify reset state handling in rxe_resp.c
>>>>>> RDMA/rxe: Handle qp error in rxe_resp.c
>>>>>> RDMA/rxe: Cleanup comp tasks in rxe_qp.c
>>>>>> RDMA/rxe: Remove __rxe_do_task()
>>>>>> RDMA/rxe: Make tasks schedule each other
>>>>>> RDMA/rxe: Implement disable/enable_task()
>>>>>> RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
>>>>>> RDMA/rxe: Replace task->destroyed by task state INVALID.
>>>>>> RDMA/rxe: Add workqueue support for tasks
>>>>>> RDMA/rxe: Make WORKQUEUE default for RC tasks
>>>>>> RDMA/rxe: Remove tasklets from rxe_task.c
>>>>>>
>>>>>> drivers/infiniband/sw/rxe/rxe.c | 9 +-
>>>>>> drivers/infiniband/sw/rxe/rxe_comp.c | 24 ++-
>>>>>> drivers/infiniband/sw/rxe/rxe_qp.c | 80 ++++-----
>>>>>> drivers/infiniband/sw/rxe/rxe_req.c | 4 +-
>>>>>> drivers/infiniband/sw/rxe/rxe_resp.c | 70 +++++---
>>>>>> drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++--------
>>>>>> drivers/infiniband/sw/rxe/rxe_task.h | 56 +++---
>>>>>> 7 files changed, 329 insertions(+), 172 deletions(-)
>>>>>>
>>>>>>
>>>>>> base-commit: 692373d186205dfb1b56f35f22702412d94d9420
>>>>>> --
>>>>>> 2.34.1
>>>>>
>>>
>
next prev parent reply other threads:[~2022-11-07 16:07 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-29 3:09 [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Bob Pearson
2022-10-29 3:09 ` [PATCH for-next v3 01/13] RDMA/rxe: Make task interface pluggable Bob Pearson
2022-11-11 2:28 ` Yanjun Zhu
2022-10-29 3:09 ` [PATCH for-next v3 02/13] RDMA/rxe: Split rxe_drain_resp_pkts() Bob Pearson
2022-10-29 3:10 ` [PATCH for-next v3 03/13] RDMA/rxe: Simplify reset state handling in rxe_resp.c Bob Pearson
2022-11-11 3:04 ` Yanjun Zhu
2022-10-29 3:10 ` [PATCH for-next v3 04/13] RDMA/rxe: Handle qp error " Bob Pearson
2022-10-29 3:10 ` [PATCH v3 05/13] RDMA/rxe: Cleanup comp tasks in rxe_qp.c Bob Pearson
2022-10-29 3:10 ` [PATCH for-next v3 06/13] RDMA/rxe: Remove __rxe_do_task() Bob Pearson
2022-10-29 3:10 ` [PATCH for-next v3 07/13] RDMA/rxe: Make tasks schedule each other Bob Pearson
2022-10-29 3:10 ` [PATCH for-next v3 08/13] RDMA/rxe: Implement disable/enable_task() Bob Pearson
2022-10-29 3:10 ` [PATCH for-next v3 09/13] RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE Bob Pearson
2022-10-29 3:10 ` [PATCH for-next v3 10/13] RDMA/rxe: Replace task->destroyed by task state INVALID Bob Pearson
2022-10-29 3:10 ` [PATCH for-next v3 11/13] RDMA/rxe: Add workqueue support for tasks Bob Pearson
2022-10-29 3:10 ` [PATCH for-next v3 12/13] RDMA/rxe: Make WORKQUEUE default for RC tasks Bob Pearson
2022-10-29 3:10 ` [PATCH for-next v3 13/13] RDMA/rxe: Remove tasklets from rxe_task.c Bob Pearson
2022-11-02 10:17 ` [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Daisuke Matsuda (Fujitsu)
2022-11-02 11:20 ` Bob Pearson
2022-11-04 4:59 ` Daisuke Matsuda (Fujitsu)
2022-11-05 21:15 ` Bob Pearson
2022-11-07 8:21 ` Daisuke Matsuda (Fujitsu)
2022-11-07 16:06 ` Bob Pearson [this message]
2022-11-18 5:02 ` Daisuke Matsuda (Fujitsu)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c1016a97-80c5-349d-0238-a5d7d408950d@gmail.com \
--to=rpearsonhpe@gmail.com \
--cc=jgg@nvidia.com \
--cc=jhack@hpe.com \
--cc=leon@kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=matsuda-daisuke@fujitsu.com \
--cc=zyjzyj2000@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).