linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
To: Jason Xing <kerneljasonxing@gmail.com>,
	 Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Xin Zhao <jackzxcui1989@163.com>,
	 edumazet@google.com,  ferenc@fejes.dev,  davem@davemloft.net,
	 kuba@kernel.org,  pabeni@redhat.com,  horms@kernel.org,
	 netdev@vger.kernel.org,  linux-kernel@vger.kernel.org
Subject: Re: [PATCH net-next v10 2/2] net: af_packet: Use hrtimer to do the retire operation
Date: Sat, 06 Sep 2025 11:25:26 -0400	[thread overview]
Message-ID: <willemdebruijn.kernel.1a11cce73202f@gmail.com> (raw)
In-Reply-To: <CAL+tcoBX4URyxxxuCT3XdzJ7R2zS-DjobdKfMjwc-R7h=ptFCg@mail.gmail.com>

Jason Xing wrote:
> On Sat, Sep 6, 2025 at 12:16 AM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Jason Xing wrote:
> > > On Fri, Sep 5, 2025 at 2:03 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > On Fri, Sep 5, 2025 at 12:01 PM Xin Zhao <jackzxcui1989@163.com> wrote:
> > > > >
> > > > > On Thu, Sep 4, 2025 at 11:26 +0800 Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > >
> > > > > > > In the description of [PATCH net-next v10 0/2] net: af_packet: optimize retire operation:
> > > > > > >
> > > > > > > Changes in v8:
> > > > > > > - Delete delete_blk_timer field, as suggested by Willem de Bruijn,
> > > > > > >   hrtimer_cancel will check and wait until the timer callback return and ensure
> > > > > > >   enter enter callback again;
> > > > > >
> > > > > > I see the reason now :)
> > > > > >
> > > > > > Please know that the history changes through versions will finally be
> > > > > > removed, only the official message that will be kept in the git. So
> > > > > > this kind of change, I think, should be clarified officially since
> > > > > > you're removing a structure member. Adding more descriptions will be
> > > > > > helpful to readers in the future. Thank you.
> > > > >
> > > > > I will add some more information to the commit message of this 2/2 PATCH.
> > > > >
> > > > >
> > > > >
> > > > > > > Consider the following timing sequence:
> > > > > > > timer   cpu0 (softirq context, hrtimer timeout)                cpu1 (process context)
> > > > > > > 0       hrtimer_run_softirq
> > > > > > > 1         __hrtimer_run_queues
> > > > > > > 2           __run_hrtimer
> > > > > > > 3             prb_retire_rx_blk_timer_expired
> > > > > > > 4               spin_lock(&po->sk.sk_receive_queue.lock);
> > > > > > > 5               _prb_refresh_rx_retire_blk_timer
> > > > > > > 6                 hrtimer_forward_now
> > > > > > > 7               spin_unlock(&po->sk.sk_receive_queue.lock)
> > > > > > > 8             raw_spin_lock_irq(&cpu_base->lock);              tpacket_rcv
> > > > > > > 9             enqueue_hrtimer                                    spin_lock(&sk->sk_receive_queue.lock);
> > > > > > > 10                                                               packet_current_rx_frame
> > > > > > > 11                                                                 __packet_lookup_frame_in_block
> > > > > > > 12            finish enqueue_hrtimer                                 prb_open_block
> > > > > > > 13                                                                     _prb_refresh_rx_retire_blk_timer
> > > > > > > 14                                                                       hrtimer_is_queued(&pkc->retire_blk_timer) == true
> > > > > > > 15                                                                       hrtimer_forward_now
> > > > > > > 16                                                                         WARN_ON
> > > > > > > On cpu0 in the timing sequence above, enqueue_hrtimer is not protected by sk_receive_queue.lock,
> > > > > > > while the hrtimer_forward_now is not protected by raw_spin_lock_irq(&cpu_base->lock).
> > > > > > >
> > > > > > > In my previous email, I provided an explanation. As a supplement, I would
> > > > > > > like to reiterate a paragraph from my earlier response to Willem.
> > > > > > > The point is that when the hrtimer is in the enqueued state, you cannot
> > > > > >
> > > > > > How about tring hrtimer_is_queued() beforehand?
> > > > > >
> > > > > > IIUC, with this patch applied, we will lose the opportunity to refresh
> > > > > > the timer when the lookup function (in the above path I mentioned)
> > > > > > gets called compared to before. If the packet socket tries to look up
> > > > > > a new block and it doesn't update its expiry time, the timer will soon
> > > > > > wake up. Does it sound unreasonable?
> > > > >
> > > > >
> > > > > I actually pointed out the issue with the timeout setting in a previous email:
> > > > > https://lore.kernel.org/netdev/20250826030328.878001-1-jackzxcui1989@163.com/.
> > > > >
> > > > > Regarding the method you mentioned, using hrtimer_is_queued to assist in judgment, I had
> > > > > discussed this extensively with Willem in previous emails, and the conclusion was that
> > > > > it is not feasible. The reason is that in our scenario, the hrtimer always returns
> > > > > HRTIMER_RESTART, unlike the places you pointed out, such as tcp_pacing_check, where the
> > > > > corresponding hrtimer callbacks all return HRTIMER_NORESTART. Since our scenario returns
> > > > > HRTIMER_RESTART, this can lead to many troublesome issues. The fundamental reason is that
> > > > > if HRTIMER_RESTART is returned, the hrtimer module will enqueue the hrtimer after the
> > > > > callback returns, which leads to exiting the protection of our sk_receive_queue lock.
> > > > >
> > > > > Returning to the functionality here, if we really want to update the hrtimer's timeout
> > > > > outside of the timer callback, there are two key points to note:
> > > > >
> > > > > 1. Accurately knowing whether the current context is a timer callback or tpacket_rcv.
> > > > > 2. How to update the hrtimer's timeout in a non-timer callback scenario.
> > > > >
> > > > > To start with the first point, it has already been explained in previous emails that
> > > > > executing hrtimer_forward outside of a timer callback is not allowed. Therefore, we
> > > > > must accurately determine whether we are in a timer callback; only in that context can
> > > > > we use the hrtimer_forward function to update.
> > > > > In the original code, since the same _prb_refresh_rx_retire_blk_timer function was called,
> > > > > distinguishing between contexts required code restructuring. Now that this patch removes
> > > > > the _prb_refresh_rx_retire_blk_timer function, achieving this accurate distinction is not
> > > > > too difficult.
> > > > > The key issue is the second point. If we are not inside the hrtimer's callback, we cannot
> > > > > use hrtimer_forward to update the timeout.
> > > > > So what other interface can we use? You might
> > > > > suggest using hrtimer_start, but fundamentally, hrtimer_start cannot be called if it has
> > > > > already been started previously. Therefore, wouldn’t you need to add hrtimer_cancel to
> > > > > confirm that the hrtimer has been canceled? Once hrtimer_cancel is added, there will also
> > > > > be scenarios where it is restarted, which means we need to consider the concurrent
> > > > > scenario when the socket exits and also calls hrtimer_cancel. This might require adding
> > > > > logic for that concurrency scenario, and you might even need to reintroduce the
> > > > > delete_blk_timer variable to indicate whether the packet_release operation has been
> > > > > triggered so that the hrtimer does not restart in the tpacket_rcv scenario.
> > > > >
> > > > > In fact, in a previous v7 version, I proposed a change that I personally thought was
> > > > > quite good, which can be seen here:
> > > > > https://lore.kernel.org/netdev/20250822132051.266787-1-jackzxcui1989@163.com/. However,
> > > > > this change introduced an additional variable and more logic. Willem also pointed out
> > > > > that the added complexity to avoid a non-problematic issue was unnecessary.
> > > >
> > > > Admittedly it's a bit complex.
> > > >
> > > > >
> > > > > As mentioned in Changes in v8:
> > > > >   The only special case is when prb_open_block is called from tpacket_rcv.
> > > > >   That would set the timeout further into the future than the already queued
> > > > >   timer. An earlier timeout is not problematic. No need to add complexity to
> > > > >   avoid that.
> > > >
> > > > It'd be better to highlight this in the commit message as well to
> > > > avoid further repeat questions from others. It's an obvious change in
> > > > this patch :)
> > >
> > > BTW, I have to emphasize that after this patch, the hrtimer will run
> > > periodically and unconditionally. As far as I know, it's not possible
> > > to run hundreds and thousands packet sockets in production, so it
> > > might not be a huge problem.
> >
> > In tcp each sk has its own hrtimers (tcp_init_xmit_timers).
> 
> Right, but they don't get woken up so frequently in normal cases. I'm
> worried because in production I saw a huge impact caused by the 1ms
> timer from numerous sockets:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e4dd0d3a2f64b8bd8029ec70f52bdbebd0644408
> But AFAIK I don't see a real use case where people use so many packet
> sockets. If it happens, in the future we might get complaints. We'll
> see:)

Hrtimers share an interrupt.

The worst case additional overhead is if timer expiration is a little
bit slower than packet arrival rate (and thus tpacket_rcv event rate).

Because the higher the tpacket_rcv rate relatively, the smaller the
timer callback overhead is. And if timeout frequency is higher than
tpacket_rcv, it would fire anyway.

So worst case the overhead is a doubling of block management calls?

I don't have hard numbers but the cost of removing a timer and
reprogramming from inside tpacket_rcv just to defer seems higher than
accepting this bounded (relative to other operations) cost of
spurious wakeups.

  reply	other threads:[~2025-09-06 15:25 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-05  4:00 [PATCH net-next v10 2/2] net: af_packet: Use hrtimer to do the retire operation Xin Zhao
2025-09-05  6:03 ` Jason Xing
2025-09-05  6:45   ` Jason Xing
2025-09-05 16:16     ` Willem de Bruijn
2025-09-06  0:14       ` Jason Xing
2025-09-06 15:25         ` Willem de Bruijn [this message]
2025-09-06 15:53           ` Jason Xing
  -- strict thread matches above, loose matches on Subject: below --
2025-09-05 14:47 Xin Zhao
2025-09-05 16:16 ` Willem de Bruijn
2025-09-04 14:59 Xin Zhao
2025-09-05  0:09 ` Jason Xing
2025-09-03 17:07 Xin Zhao
2025-09-04  3:26 ` Jason Xing
2025-09-03 16:17 Xin Zhao
2025-09-04  2:50 ` Jason Xing
2025-09-01 14:16 Xin Zhao
2025-09-01  2:27 Xin Zhao
2025-09-01 13:35 ` Willem de Bruijn
2025-08-31 10:08 [PATCH net-next v10 0/2] net: af_packet: optimize " Xin Zhao
2025-08-31 10:08 ` [PATCH net-next v10 2/2] net: af_packet: Use hrtimer to do the " Xin Zhao
2025-09-01  1:21   ` Willem de Bruijn
2025-09-02 15:43   ` Jason Xing
2025-09-02 16:42     ` Jason Xing

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=willemdebruijn.kernel.1a11cce73202f@gmail.com \
    --to=willemdebruijn.kernel@gmail.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=ferenc@fejes.dev \
    --cc=horms@kernel.org \
    --cc=jackzxcui1989@163.com \
    --cc=kerneljasonxing@gmail.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).