public inbox for dev@dpdk.org
 help / color / mirror / Atom feed
From: Kevin Traynor <ktraynor@redhat.com>
To: Slava Ovsiienko <viacheslavo@nvidia.com>,
	Stephen Hemminger <stephen@networkplumber.org>
Cc: "dev@dpdk.org" <dev@dpdk.org>,
	"NBU-Contact-Thomas Monjalon (EXTERNAL)" <thomas@monjalon.net>,
	"david.marchand@redhat.com" <david.marchand@redhat.com>,
	Dariusz Sosnowski <dsosnowski@nvidia.com>,
	"stable@dpdk.org" <stable@dpdk.org>,
	Harman Kalra <hkalra@marvell.com>
Subject: Re: [PATCH v2 1/2] net/mlx5: check for no data read in devx interrupt
Date: Thu, 19 Feb 2026 14:44:28 +0000	[thread overview]
Message-ID: <ea84acc6-35e1-409d-9977-a7b013ab2585@redhat.com> (raw)
In-Reply-To: <DM4PR12MB6087DD2EE1DB24F3A0BE51EADF62A@DM4PR12MB6087.namprd12.prod.outlook.com>

On 10/02/2026 20:58, Slava Ovsiienko wrote:
> Hi,
> 
> What about checking EPOLLERR | EPOLLHUP | EPOLLRDHU flags for specific fd in mlx5 habdler?
> 
> if devx_get_async_cmd_comp() returns EAGAIN {
>    if no data were read {
>     call epoll_wait() for specific fd and zero timeout
>     check EPOLLERR | EPOLLHUP | EPOLLRDHU flags
>     if fd is in hanging/error state {
>      - remove handler 
>     }
>   }
> }
> 

Thanks for the suggestion. We already have this info at the time of
callback. So I added an API to get it and made it os agnostic in v4.

Let me know if you have comments or suggestion. Thanks!

> With best regards,
> Slava
> 
>> -----Original Message-----
>> From: Kevin Traynor <ktraynor@redhat.com>
>> Sent: Tuesday, February 10, 2026 9:08 PM
>> To: Slava Ovsiienko <viacheslavo@nvidia.com>; Stephen Hemminger
>> <stephen@networkplumber.org>
>> Cc: dev@dpdk.org; NBU-Contact-Thomas Monjalon (EXTERNAL)
>> <thomas@monjalon.net>; david.marchand@redhat.com; Dariusz Sosnowski
>> <dsosnowski@nvidia.com>; stable@dpdk.org; Harman Kalra
>> <hkalra@marvell.com>
>> Subject: Re: [PATCH v2 1/2] net/mlx5: check for no data read in devx interrupt
>>
>> On 10/02/2026 17:05, Slava Ovsiienko wrote:
>>> Hi,
>>>
>>
>> Hi Slava,
>>
>>> I'm sorry, I have some concern about the patch.
>>>
>>
>> No problem, that's what reviews are for :-) thanks for reviewing.
>>
>>> How it works, as far as I understand:
>>>
>>> - DPDK simulates interrupts in user mode with epoll_wait()
>>> - mlx5 PMD emits the async counter query command to the NIC
>>> periodically
>>
>> I didn't think this would happen unless there was something like hardware
>> offload, but regardless, yes I agree there may be async counter queries.
>>
>>> - there might be multiple async query commands in the flight
>>> - kernel drivers handles the async query completion interrupts, pushes
>>> the token to the internal completion queue and unblocks associated fd
>>> - epoll_wait() sees this unblocked fd and notifies mlx5 PMD about
>>> - mlx5 PMD reads the completion token from the kernel queue with
>>> devx_get_async_cmd_comp()
>>>
>>> The concern scenario, let's assume:
>>>
>>> - we have 2 async query commands in the flight
>>> - the first async query completes, fd is unblocked, PMD is inviked,
>>> the completion is read by PMD and is being handled
>>> - the second async query completes, fd gets unblocked, the second
>>> token is written to the queue
>>> - the PMD completes the handling of the first completion and reads the
>>> queue again (with devx_get_async_cmd_comp() call in the loop)
>>> - it reads the second token successfully and handles
>>> - then, on the third call, devx_get_async_cmd_comp() returns EAGAIN,
>>> it means queue is empty
>>> - DPDK calls epoll_wait() again and sees unblocked fd
>>> - it call mlx5 PMD, and it calls devx_get_async_cmd_comp(), but queue
>>> is empty (handled in previous interrupt handling)
>>> - with the patch we wrongly remove the handler
>>>
>>
>> I'm not sure, but this ^^^ sounds feasible.
>>
>>> In my opinion, we should handle flags EPOLLERR | EPOLLHUP | EPOLLRDHU
>>> from the epoll_wait()_return also for RTE_INTR_HANDLE_EXT and
>> RTE_INTR_HANDLE_DEV_EVENT interrupt types.
>>>
>>
>> That's exactly what I had in v1 of the patch! The issue is that some clients of eal
>> interrupt may not interpret the condition of EPOLLHUP/EPOLLRDHUP as an
>> error condition and/or want to do some special handling.
>>
>> The example is vhost user server, which puts in place a reconnect mechanism. If
>> we filter out EPOLLHUP/EPOLLRDHUP events in eal, then virtio will not receive
>> the callback and vhost server reconnect is broken. I have some more notes
>> about it in the cover letter.
>>
>> Trying to base on the read pattern in devx handler was an attempt to move logic
>> out of eal so different handlers could be flexible in how they handle this
>> condition.
>>
>> We do have a distinction in that mlx5 uses RTE_INTR_HANDLE_EXT and virtio
>> uses RTE_INTR_HANDLE_VDEV but i'm not sure that is generic enough to base a
>> check/don't check for EPOLLHUP/EPOLLRDHUP events on.
>>
>> So we'd need to come up with another solution if we wanted to filter this in eal.
>> Let's think more on this, though we are a bit constrained by public API as well.
>>
>> A workaround we can do from application is David's hack™ "-a 0000:00:00.0"
>> to skip initial probe. That will at least prevent the issue for mlx devices not used
>> in DPDK, which was the scenario reported.
>>
>> thanks,
>> Kevin.
>>
>>> With best regards,
>>> Slava
>>>
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Kevin Traynor <ktraynor@redhat.com>
>>>> Sent: Tuesday, February 10, 2026 5:06 PM
>>>> To: Stephen Hemminger <stephen@networkplumber.org>
>>>> Cc: dev@dpdk.org; NBU-Contact-Thomas Monjalon (EXTERNAL)
>>>> <thomas@monjalon.net>; david.marchand@redhat.com; Dariusz Sosnowski
>>>> <dsosnowski@nvidia.com>; Slava Ovsiienko <viacheslavo@nvidia.com>;
>>>> stable@dpdk.org
>>>> Subject: Re: [PATCH v2 1/2] net/mlx5: check for no data read in devx
>>>> interrupt
>>>>
>>>> On 07/02/2026 06:09, Stephen Hemminger wrote:
>>>>> On Fri,  6 Feb 2026 17:20:53 +0000
>>>>> Kevin Traynor <ktraynor@redhat.com> wrote:
>>>>>
>>>>>> A busy-loop may occur when there are EPOLLERR, EPOLLHUP or
>>>> EPOLLRDHUP
>>>>>> epoll events for the devx interrupt fd.
>>>>>>
>>>>>> This may happen if the interrupt fd is deleted, if the device is
>>>>>> unbound from mlx5_core kernel driver or if the device is removed by
>>>>>> the mlx5 kernel driver as part of LAG setup.
>>>>>>
>>>>>> When that occurs, there is no data to be read and in the devx
>>>>>> interrupt handler an EAGAIN is returned on the first call to
>>>>>> devx_get_async_cmd_comp, but this is not checked.
>>>>>>
>>>>>> As the interrupt is not removed or condition reset, it causes an
>>>>>> interrupt processing busy-loop, which leads to the dpdk-intr thread
>>>>>> going to 100% CPU.
>>>>>>
>>>>>> e.g.
>>>>>> epoll_wait
>>>>>>    (6, [{events=EPOLLIN|EPOLLRDHUP, data={u32=28, u64=28}}], 8, -1)
>>>>>> =
>>>>>> 1 read(28, 0x7f1f5c7fc2f0, 40)
>>>>>>    = -1 EAGAIN (Resource temporarily unavailable) epoll_wait
>>>>>>    (6, [{events=EPOLLIN|EPOLLRDHUP, data={u32=28, u64=28}}], 8, -1)
>>>>>> =
>>>>>> 1 read(28, 0x7f1f5c7fc2f0, 40)
>>>>>>    = -1 EAGAIN (Resource temporarily unavailable)
>>>>>>
>>>>>> Add a check for an EAGAIN return from devx_get_async_cmd_comp on
>>>>>> the first read. If that happens, unregister the callback to prevent looping.
>>>>>>
>>>>>> Bugzilla ID: 1873
>>>>>> Fixes: f15db67df09c ("net/mlx5: accelerate DV flow counter query")
>>>>>> Cc: stable@dpdk.org
>>>>>>
>>>>>> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
>>>>>
>>>>> AI spotted this, I didn't...
>>>>>
>>>>>
>>>>> Errors:
>>>>>
>>>>>     Line 139: Unnecessary semicolon after closing brace
>>>>>
>>>>> c
>>>>>
>>>>>    };
>>>>>
>>>>> Should be:
>>>>> c
>>>>>
>>>>>    }
>>>>>
>>>>>     Lines 142-146: Block comment uses incorrect style Block comments
>>>>> in C
>>>> code should use /* and */ style, not /** which is reserved for
>>>> documentation comments.
>>>>>
>>>>> c
>>>>>
>>>>>    /**
>>>>>     * no data and EAGAIN indicate there is an error or
>>>>>     * disconnect state. Unregister callback to prevent
>>>>>     * interrupt busy-looping.
>>>>>     */
>>>>>
>>>>> Should be:
>>>>> c
>>>>>
>>>>>    /*
>>>>>     * no data and EAGAIN indicate there is an error or
>>>>>     * disconnect state. Unregister callback to prevent
>>>>>     * interrupt busy-looping.
>>>>>     */
>>>>>
>>>>> Warnings:
>>>>>
>>>>>     Logic clarity: The variable data_read is set to true inside the
>>>>> while loop but
>>>> never checked when data WAS read. Consider if data_read is the
>>>> clearest way to express this condition.
>>>>>
>>>>
>>>> Ack above. Thanks.Will be fixed in v3.
>>>
> 


  reply	other threads:[~2026-02-19 14:44 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-28 12:20 [PATCH] eal/linux: handle epoll error conditions Kevin Traynor
2026-01-29 12:51 ` Kevin Traynor
2026-02-06 17:20 ` [PATCH v2 0/2] interrupt epoll event handling Kevin Traynor
2026-02-06 17:20   ` [PATCH v2 1/2] net/mlx5: check for no data read in devx interrupt Kevin Traynor
2026-02-07  6:09     ` Stephen Hemminger
2026-02-10 15:05       ` Kevin Traynor
2026-02-10 17:05         ` Slava Ovsiienko
2026-02-10 19:07           ` Kevin Traynor
2026-02-10 20:58             ` Slava Ovsiienko
2026-02-19 14:44               ` Kevin Traynor [this message]
2026-02-06 17:20   ` [PATCH v2 2/2] eal/linux: handle interrupt epoll events Kevin Traynor
2026-02-07  6:11     ` Stephen Hemminger
2026-02-10 13:35       ` Kevin Traynor
2026-02-10  9:17     ` David Marchand
2026-02-10 14:47       ` Kevin Traynor
2026-02-10 18:06 ` [PATCH v3 0/2] interrupt epoll event handling Kevin Traynor
2026-02-10 18:06   ` [PATCH v3 1/2] net/mlx5: check for no data read in devx interrupt Kevin Traynor
2026-02-10 18:06   ` [PATCH v3 2/2] eal/linux: handle interrupt epoll events Kevin Traynor
2026-02-19 14:37 ` [PATCH v4 0/3] interrupt disconnect/error event handling Kevin Traynor
2026-02-19 14:38 ` Kevin Traynor
2026-02-19 14:38   ` [PATCH v4 1/3] eal/linux: handle interrupt epoll events Kevin Traynor
2026-02-19 14:38   ` [PATCH v4 2/3] eal/interrupt: add interrupt event info Kevin Traynor
2026-02-26 15:41     ` David Marchand
2026-03-02 11:47       ` Kevin Traynor
2026-02-19 14:38   ` [PATCH v4 3/3] net/mlx5: check devx disconnect/error interrupt events Kevin Traynor
2026-03-03 16:16     ` Slava Ovsiienko
2026-02-19 18:52   ` [PATCH v4 0/3] interrupt disconnect/error event handling Stephen Hemminger
2026-03-02 11:41     ` Kevin Traynor
2026-03-03 18:58 ` [PATCH v5 " Kevin Traynor
2026-03-03 18:58   ` [PATCH v5 1/3] eal/linux: handle interrupt epoll events Kevin Traynor
2026-03-03 18:58   ` [PATCH v5 2/3] eal/interrupt: add interrupt event info Kevin Traynor
2026-03-03 18:58   ` [PATCH v5 3/3] net/mlx5: check devx disconnect/error interrupt events Kevin Traynor
2026-03-04 11:09   ` [PATCH v5 0/3] interrupt disconnect/error event handling David Marchand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ea84acc6-35e1-409d-9977-a7b013ab2585@redhat.com \
    --to=ktraynor@redhat.com \
    --cc=david.marchand@redhat.com \
    --cc=dev@dpdk.org \
    --cc=dsosnowski@nvidia.com \
    --cc=hkalra@marvell.com \
    --cc=stable@dpdk.org \
    --cc=stephen@networkplumber.org \
    --cc=thomas@monjalon.net \
    --cc=viacheslavo@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox