linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GIT PULL] Networking for v6.16-rc6 (follow up)
@ 2025-07-11 15:10 Jakub Kicinski
  2025-07-11 17:44 ` pr-tracker-bot
  2025-07-11 18:33 ` Linus Torvalds
  0 siblings, 2 replies; 15+ messages in thread
From: Jakub Kicinski @ 2025-07-11 15:10 UTC (permalink / raw)
  To: torvalds; +Cc: kuba, davem, netdev, linux-kernel, pabeni

Hi Linus!

The following changes since commit bc9ff192a6c940d9a26e21a0a82f2667067aaf5f:

  Merge tag 'net-6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net (2025-07-10 09:18:53 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git tags/net-6.16-rc6-2

for you to fetch changes up to a215b5723922f8099078478122f02100e489cb80:

  netlink: make sure we allow at least one dump skb (2025-07-11 07:31:47 -0700)

----------------------------------------------------------------
Big chunk of fixes for WiFi, Johannes says probably the last
for the release. The Netlink fixes (on top of the tree) restore
operation of iw (WiFi CLI) which uses sillily small recv buffer,
and is the reason for this "emergency PR". The GRE multicast
fix also stands out among the user-visible regressions.

Current release - fix to a fix:

 - netlink: make sure we always allow at least one skb to be queued,
   even if the recvbuf is (mis)configured to be tiny

Previous releases - regressions:

 - gre: fix IPv6 multicast route creation

Previous releases - always broken:

 - wifi: prevent A-MSDU attacks in mesh networks

 - wifi: cfg80211: fix S1G beacon head validation and detection

 - wifi: mac80211:
   - always clear frame buffer to prevent stack leak in cases which
     hit a WARN()
   - fix monitor interface in device restart

 - wifi: mwifiex: discard erroneous disassoc frames on STA interface

 - wifi: mt76:
   - prevent null-deref in mt7925_sta_set_decap_offload()
   - add missing RCU annotations, and fix sleep in atomic
   - fix decapsulation offload
   - fixes for scanning

 - phy: microchip: improve link establishment and reset handling

 - eth: mlx5e: fix race between DIM disable and net_dim()

 - bnxt_en: correct DMA unmap len for XDP_REDIRECT

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

----------------------------------------------------------------
Alok Tiwari (1):
      net: ll_temac: Fix missing tx_pending check in ethtools_set_ringparam()

Carolina Jubran (2):
      net/mlx5: Reset bw_share field when changing a node's parent
      net/mlx5e: Fix race between DIM disable and net_dim()

Daniil Dulov (1):
      wifi: zd1211rw: Fix potential NULL pointer dereference in zd_mac_tx_to_dev()

Deren Wu (2):
      wifi: mt76: mt7925: prevent NULL pointer dereference in mt7925_sta_set_decap_offload()
      wifi: mt76: mt7921: prevent decap offload config before STA initialization

Eric Dumazet (1):
      netfilter: flowtable: account for Ethernet header in nf_flow_pppoe_proto()

Felix Fietkau (3):
      wifi: rt2x00: fix remove callback type mismatch
      wifi: mt76: add a wrapper for wcid access with validation
      wifi: mt76: fix queue assignment for deauth packets

Guillaume Nault (2):
      gre: Fix IPv6 multicast route creation.
      selftests: Add IPv6 multicast route generation tests for GRE devices.

Hangbin Liu (1):
      selftests: net: lib: fix shift count out of range

Henry Martin (1):
      wifi: mt76: mt7925: Fix null-ptr-deref in mt7925_thermal_init()

Jakub Kicinski (7):
      Merge tag 'wireless-2025-07-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
      Merge branch 'net-phy-microchip-lan88xx-reliability-fixes'
      Merge branch 'gre-fix-default-ipv6-multicast-route-creation'
      Merge tag 'linux-can-fixes-for-6.16-20250711' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
      Merge branch 'mlx5-misc-fixes-2025-07-10'
      Merge branch 'bnxt_en-3-bug-fixes'
      netlink: make sure we allow at least one dump skb

Jianbo Liu (1):
      net/mlx5e: Add new prio for promiscuous mode

Johannes Berg (3):
      wifi: mac80211: clear frame buffer to never leak stack
      wifi: mac80211: fix non-transmitted BSSID profile search
      Merge tag 'mt76-fixes-2025-07-07' of https://github.com/nbd168/wireless

Kito Xu (1):
      net: appletalk: Fix device refcount leak in atrtr_create()

Kuniyuki Iwashima (1):
      netlink: Fix rmem check in netlink_broadcast_deliver().

Lachlan Hodges (2):
      wifi: cfg80211: fix S1G beacon head validation in nl80211
      wifi: mac80211: correctly identify S1G short beacon

Leon Yen (1):
      wifi: mt76: mt792x: Limit the concurrent STA and SoftAP to operate on the same channel

Lorenzo Bianconi (5):
      wifi: mt76: Assume __mt76_connac_mcu_alloc_sta_req runs in atomic context
      wifi: mt76: Move RCU section in mt7996_mcu_set_fixed_field()
      wifi: mt76: Move RCU section in mt7996_mcu_add_rate_ctrl_fixed()
      wifi: mt76: Move RCU section in mt7996_mcu_add_rate_ctrl()
      wifi: mt76: Remove RCU section in mt7996_mac_sta_rc_work()

Mathy Vanhoef (1):
      wifi: prevent A-MSDU attacks in mesh networks

Michael Lo (1):
      wifi: mt76: mt7925: fix invalid array index in ssid assignment during hw scan

Ming Yen Hsieh (2):
      wifi: mt76: mt7925: fix the wrong config for tx interrupt
      wifi: mt76: mt7925: fix incorrect scan probe IE handling for hw_scan

Mingming Cao (1):
      ibmvnic: Fix hardcoded NUM_RX_STATS/NUM_TX_STATS with dynamic sizeof

Miri Korenblit (2):
      wifi: mac80211: always initialize sdata::key_list
      wifi: mac80211: add the virtual monitor after reconfig complete

Moon Hee Lee (1):
      wifi: mac80211: reject VHT opmode for unsupported channel widths

Oleksij Rempel (2):
      net: phy: microchip: Use genphy_soft_reset() to purge stale LPA bits
      net: phy: microchip: limit 100M workaround to link-down events on LAN88xx

Pagadala Yesu Anjaneyulu (1):
      wifi: mac80211: Fix uninitialized variable with __free() in ieee80211_ml_epcs()

Sean Nyekjaer (1):
      can: m_can: m_can_handle_lost_msg(): downgrade msg lost in rx message to debug level

Shravya KN (1):
      bnxt_en: Fix DCB ETS validation

Shruti Parab (1):
      bnxt_en: Flush FW trace before copying to the coredump

Somnath Kotur (1):
      bnxt_en: Set DMA unmap len correctly for XDP_REDIRECT

Vitor Soares (1):
      wifi: mwifiex: discard erroneous disassoc frames on STA interface

 drivers/net/can/m_can/m_can.c                      |   2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_coredump.c |  18 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_dcb.c      |   2 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c      |   2 +-
 drivers/net/ethernet/ibm/ibmvnic.h                 |   8 +-
 drivers/net/ethernet/mellanox/mlx5/core/en/fs.h    |   9 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c   |   4 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c    |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c  |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  |  13 +-
 drivers/net/ethernet/xilinx/ll_temac_main.c        |   2 +-
 drivers/net/phy/microchip.c                        |   3 +-
 drivers/net/wireless/marvell/mwifiex/util.c        |   4 +-
 drivers/net/wireless/mediatek/mt76/mt76.h          |  10 ++
 drivers/net/wireless/mediatek/mt76/mt7603/dma.c    |   2 +-
 drivers/net/wireless/mediatek/mt76/mt7603/mac.c    |  10 +-
 drivers/net/wireless/mediatek/mt76/mt7615/mac.c    |   7 +-
 .../net/wireless/mediatek/mt76/mt76_connac_mac.c   |   2 +-
 .../net/wireless/mediatek/mt76/mt76_connac_mcu.c   |   6 +-
 drivers/net/wireless/mediatek/mt76/mt76x02.h       |   5 +-
 drivers/net/wireless/mediatek/mt76/mt76x02_mac.c   |   4 +-
 drivers/net/wireless/mediatek/mt76/mt7915/mac.c    |  12 +-
 drivers/net/wireless/mediatek/mt76/mt7915/mcu.c    |   2 +-
 drivers/net/wireless/mediatek/mt76/mt7915/mmio.c   |   5 +-
 drivers/net/wireless/mediatek/mt76/mt7921/mac.c    |   6 +-
 drivers/net/wireless/mediatek/mt76/mt7921/main.c   |   3 +
 drivers/net/wireless/mediatek/mt76/mt7925/init.c   |   2 +
 drivers/net/wireless/mediatek/mt76/mt7925/mac.c    |   6 +-
 drivers/net/wireless/mediatek/mt76/mt7925/main.c   |   8 +-
 drivers/net/wireless/mediatek/mt76/mt7925/mcu.c    |  79 ++++++--
 drivers/net/wireless/mediatek/mt76/mt7925/mcu.h    |   5 +-
 drivers/net/wireless/mediatek/mt76/mt7925/regs.h   |   2 +-
 drivers/net/wireless/mediatek/mt76/mt792x_core.c   |  32 +++-
 drivers/net/wireless/mediatek/mt76/mt792x_mac.c    |   5 +-
 drivers/net/wireless/mediatek/mt76/mt7996/mac.c    |  52 ++----
 drivers/net/wireless/mediatek/mt76/mt7996/main.c   |   5 +-
 drivers/net/wireless/mediatek/mt76/mt7996/mcu.c    | 199 +++++++++++++++------
 drivers/net/wireless/mediatek/mt76/mt7996/mt7996.h |  16 +-
 drivers/net/wireless/mediatek/mt76/tx.c            |  11 +-
 drivers/net/wireless/mediatek/mt76/util.c          |   2 +-
 drivers/net/wireless/ralink/rt2x00/rt2x00soc.c     |   4 +-
 drivers/net/wireless/ralink/rt2x00/rt2x00soc.h     |   2 +-
 drivers/net/wireless/zydas/zd1211rw/zd_mac.c       |   6 +-
 include/linux/ieee80211.h                          |  45 +++--
 include/net/netfilter/nf_flow_table.h              |   2 +-
 net/appletalk/ddp.c                                |   1 +
 net/ipv6/addrconf.c                                |   9 +-
 net/mac80211/cfg.c                                 |  14 ++
 net/mac80211/iface.c                               |   4 +-
 net/mac80211/mlme.c                                |  12 +-
 net/mac80211/parse.c                               |   6 +-
 net/mac80211/util.c                                |   9 +-
 net/netlink/af_netlink.c                           |   7 +-
 net/wireless/nl80211.c                             |   7 +-
 net/wireless/util.c                                |  52 +++++-
 tools/testing/selftests/net/gre_ipv6_lladdr.sh     |  27 +--
 tools/testing/selftests/net/lib.sh                 |   2 +-
 57 files changed, 500 insertions(+), 277 deletions(-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Networking for v6.16-rc6 (follow up)
  2025-07-11 15:10 [GIT PULL] Networking for v6.16-rc6 (follow up) Jakub Kicinski
@ 2025-07-11 17:44 ` pr-tracker-bot
  2025-07-11 18:33 ` Linus Torvalds
  1 sibling, 0 replies; 15+ messages in thread
From: pr-tracker-bot @ 2025-07-11 17:44 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: torvalds, kuba, davem, netdev, linux-kernel, pabeni

The pull request you sent on Fri, 11 Jul 2025 08:10:02 -0700:

> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git tags/net-6.16-rc6-2

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/c7979c3917fa1326dae3607e1c6a04c12057b194

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Networking for v6.16-rc6 (follow up)
  2025-07-11 15:10 [GIT PULL] Networking for v6.16-rc6 (follow up) Jakub Kicinski
  2025-07-11 17:44 ` pr-tracker-bot
@ 2025-07-11 18:33 ` Linus Torvalds
  2025-07-11 18:46   ` Jakub Kicinski
  1 sibling, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2025-07-11 18:33 UTC (permalink / raw)
  To: Jakub Kicinski, Thomas Zimmermann, Simona Vetter, Dave Airlie
  Cc: davem, netdev, linux-kernel, pabeni, dri-devel

[ Added in some drm people too, just to give a heads-up that it isn't
all their fault ]

On Fri, 11 Jul 2025 at 08:10, Jakub Kicinski <kuba@kernel.org> wrote:
>
>  The Netlink fixes (on top of the tree) restore
> operation of iw (WiFi CLI) which uses sillily small recv buffer,
> and is the reason for this "emergency PR".

So this was "useful" in the sense that it seems to have taken my
"random long delays at initial graphical login" and made them
"reliable hangs at early boot time" instead.

I originally blamed the drm tree, because there were some other issues
in there with reference counting - and because the hang happened at
that "start graphical environment", but now it really looks like two
independent issues, where the netlink issues cause the delay, and the
drm object refcounting issues were entirely separate and coincidental.

I suspect that there is bootup code that needs more than that "just
one skb", and that all the recent issues with netlink sk_rmem_alloc
are broken and need reverting.

Because this "emergency PR" does seem to have turned my "annoying
problem with timeouts at initial login" into "now it doesn't boot at
all".

Which is good in that the random timeouts and delays were looking like
a nightmare to bisect, and now it looks like at least the cause of
them is more clear.

But it's certainly not good in the sense of "we're at almost rc6, we
shouldn't be having these kinds of issues".

The machine I see this on doesn't actually use WiFi at all, but there
*is* a WiFi chip in it, I just turn off that interface in favor of the
wired ports.

But obviously there might also be various other netlink users that are
unhappy with the accounting changes, so the WiFi angle may be a red
herring.

            Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Networking for v6.16-rc6 (follow up)
  2025-07-11 18:33 ` Linus Torvalds
@ 2025-07-11 18:46   ` Jakub Kicinski
  2025-07-11 18:54     ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2025-07-11 18:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Zimmermann, Simona Vetter, Dave Airlie, davem, netdev,
	linux-kernel, pabeni, dri-devel

On Fri, 11 Jul 2025 11:33:10 -0700 Linus Torvalds wrote:
> Because this "emergency PR" does seem to have turned my "annoying
> problem with timeouts at initial login" into "now it doesn't boot at
> all".

Hm. I'm definitely okay with reverting. So if you revert these three:

a3c4a125ec72 ("netlink: Fix rmem check in netlink_broadcast_deliver().")
a3c4a125ec72 ("netlink: Fix rmem check in netlink_broadcast_deliver().")
ae8f160e7eb2 ("netlink: Fix wraparounds of sk->sk_rmem_alloc.")

everything is just fine?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Networking for v6.16-rc6 (follow up)
  2025-07-11 18:46   ` Jakub Kicinski
@ 2025-07-11 18:54     ` Linus Torvalds
  2025-07-11 19:18       ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2025-07-11 18:54 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Thomas Zimmermann, Simona Vetter, Dave Airlie, davem, netdev,
	linux-kernel, pabeni, dri-devel

On Fri, 11 Jul 2025 at 11:46, Jakub Kicinski <kuba@kernel.org> wrote:
>
> Hm. I'm definitely okay with reverting. So if you revert these three:
>
> a3c4a125ec72 ("netlink: Fix rmem check in netlink_broadcast_deliver().")
> a3c4a125ec72 ("netlink: Fix rmem check in netlink_broadcast_deliver().")
> ae8f160e7eb2 ("netlink: Fix wraparounds of sk->sk_rmem_alloc.")
>
> everything is just fine?

I'm assuming you mean

  a215b5723922 netlink: make sure we allow at least one dump skb
  a3c4a125ec72 netlink: Fix rmem check in netlink_broadcast_deliver().
  ae8f160e7eb2 netlink: Fix wraparounds of sk->sk_rmem_alloc.

Will do more testing.

             Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Networking for v6.16-rc6 (follow up)
  2025-07-11 18:54     ` Linus Torvalds
@ 2025-07-11 19:18       ` Linus Torvalds
  2025-07-11 19:30         ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2025-07-11 19:18 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Thomas Zimmermann, Simona Vetter, Dave Airlie, davem, netdev,
	linux-kernel, pabeni, dri-devel

On Fri, 11 Jul 2025 at 11:54, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Will do more testing.

Bah. What I thought was a "reliable hang" isn't actually that at all.
It ends up still being very random indeed.

That said, I do think it's related to this netlink issue, because the
symptoms end up being random delays.

I've seen it at boot before even logging in (I saw that twice in a row
after the latest networking pull, which is why I thought it was
reliable).

But the much more common situation is that some random gnome app ends
up hanging and then timing out.

Sometimes it's gnome-shell itself, so when I log in nothing happens,
and then after a 30s timeout gnome-shell times out and I get back the
login window.

That was what I *thought* was the common failure case, but it turns
out that I've now several times seen just random other applications
having that issue. This boot, for example, things "worked", except
starting gnome-terminal took a long time, and then I get a random
crash report for gsd-screensaver-proxy.

The backtrace for that was

  g_bus_get_sync ->
    initable_init ->
      g_data_input_stream_read_line ->
        g_buffered_input_stream_fill ->
          g_buffered_input_stream_real_fill ->
            g_input_stream_read ->
              g_socket_receive_with_timeout ->
                g_socket_condition_timed_wait ->
                  poll ->
                    __syscall_cancel

and I suspect these are all symptoms of the same thing.

My *guess* is that all of these things use a netlink socket, and
presumably it's the *other* end of the socket has filled up its
receive queue and is dropping packets as a result, and never
answering, so then - entirely randomly - depending on how overworked
things got, and which requests got dropped, some poor gnome process
never gets a reply and times out and the thing fails.

And sometimes the things that fail are not very critical (like some
gsd-screensaver-proxy) and I can log in happily. And sometimes they
are rather more critical and nothing works.

Anyway, because it's so damn random, it's neither bisectable nor easy
to know when something is "fixed".

I spent several hours yesterday chasing all the wrong things (because
I thought it was in drm), and often thought "Oh, that fixed it". Only
to then realize that nope, the problem still happens.

I will test the reverts. Several times.

             Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Networking for v6.16-rc6 (follow up)
  2025-07-11 19:18       ` Linus Torvalds
@ 2025-07-11 19:30         ` Linus Torvalds
  2025-07-11 19:34           ` Jakub Kicinski
  2025-07-11 19:42           ` Linus Torvalds
  0 siblings, 2 replies; 15+ messages in thread
From: Linus Torvalds @ 2025-07-11 19:30 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Thomas Zimmermann, Simona Vetter, Dave Airlie, davem, netdev,
	linux-kernel, pabeni, dri-devel

On Fri, 11 Jul 2025 at 12:18, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I spent several hours yesterday chasing all the wrong things (because
> I thought it was in drm), and often thought "Oh, that fixed it". Only
> to then realize that nope, the problem still happens.
>
> I will test the reverts. Several times.

Well, the first boot with those three commits reverted shows no problem at all.

But as mentioned, I've now had "Oh, that fixed it" about ten times.

So that "Oh, it worked this time" has been tainted by past experience.
Will do several more boots now in the hope that it's gone for good.

            Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Networking for v6.16-rc6 (follow up)
  2025-07-11 19:30         ` Linus Torvalds
@ 2025-07-11 19:34           ` Jakub Kicinski
  2025-07-11 19:42           ` Linus Torvalds
  1 sibling, 0 replies; 15+ messages in thread
From: Jakub Kicinski @ 2025-07-11 19:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Zimmermann, Simona Vetter, Dave Airlie, davem, netdev,
	linux-kernel, pabeni, dri-devel

On Fri, 11 Jul 2025 12:30:28 -0700 Linus Torvalds wrote:
> > I spent several hours yesterday chasing all the wrong things (because
> > I thought it was in drm), and often thought "Oh, that fixed it". Only
> > to then realize that nope, the problem still happens.
> >
> > I will test the reverts. Several times.  
> 
> Well, the first boot with those three commits reverted shows no problem at all.
> 
> But as mentioned, I've now had "Oh, that fixed it" about ten times.
> 
> So that "Oh, it worked this time" has been tainted by past experience.
> Will do several more boots now in the hope that it's gone for good.

Fingers crossed. FWIW /proc/net/netlink should show the socket
drop counters. But my laptop running 6.15 has a number of 
GNOME apps which never read their sockets so it's not going to
be as immediately obvious whether we regressed or its a bad app 
as I hoped.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Networking for v6.16-rc6 (follow up)
  2025-07-11 19:30         ` Linus Torvalds
  2025-07-11 19:34           ` Jakub Kicinski
@ 2025-07-11 19:42           ` Linus Torvalds
  2025-07-11 19:53             ` Jakub Kicinski
  1 sibling, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2025-07-11 19:42 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Thomas Zimmermann, Simona Vetter, Dave Airlie, davem, netdev,
	linux-kernel, pabeni, dri-devel

On Fri, 11 Jul 2025 at 12:30, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So that "Oh, it worked this time" has been tainted by past experience.
> Will do several more boots now in the hope that it's gone for good.

Yeah, no.

There's still something wrong. The second boot looked fine, but then
starting chrome had a 15s delay, and when that cleared I got a
notification that 'gnome-settings-daemon' had crashed.

And the backtrace is basically identical to the one I saw with
gsd-screensaver-proxy.

So it's some socket that times out, but reverting these three

  a215b5723922 netlink: make sure we allow at least one dump skb
  a3c4a125ec72 netlink: Fix rmem check in netlink_broadcast_deliver().
  ae8f160e7eb2 netlink: Fix wraparounds of sk->sk_rmem_alloc.

did *not* fix it.

Were there any other socket changes perhaps?

I just looked, and gsd-screensaver-proxy seems to use a regular Unix
domain stream socket. Maybe not related to netlink, did unix domain
sockets end up with some similar changes?

                   Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Networking for v6.16-rc6 (follow up)
  2025-07-11 19:42           ` Linus Torvalds
@ 2025-07-11 19:53             ` Jakub Kicinski
  2025-07-11 20:07               ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2025-07-11 19:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Zimmermann, Simona Vetter, Dave Airlie, davem, netdev,
	linux-kernel, pabeni, dri-devel

On Fri, 11 Jul 2025 12:42:54 -0700 Linus Torvalds wrote:
> Were there any other socket changes perhaps?
> 
> I just looked, and gsd-screensaver-proxy seems to use a regular Unix
> domain stream socket. Maybe not related to netlink, did unix domain
> sockets end up with some similar changes?

Humpf. Not that I can see, here's a list of commits since rc5 we sent
minus all the driver and wifi and data center stuff:

a3c4a125ec72 ("netlink: Fix rmem check in netlink_broadcast_deliver().")
a215b5723922 ("netlink: make sure we allow at least one dump skb")
ae8f160e7eb2 ("netlink: Fix wraparounds of sk->sk_rmem_alloc.")

ef9675b0ef03 ("Bluetooth: hci_sync: Fix not disabling advertising instance")
59710a26a289 ("Bluetooth: hci_core: Remove check of BDADDR_ANY in hci_conn_hash_lookup_big_state")
314d30b15086 ("Bluetooth: hci_sync: Fix attempting to send HCI_Disconnect to BIS handle")
c7349772c268 ("Bluetooth: hci_event: Fix not marking Broadcast Sink BIS as connected")

d3a5f2871adc ("tcp: Correct signedness in skb remaining space calculation")
1a03edeb84e6 ("tcp: refine sk_rcvbuf increase for ooo packets")

ffdde7bf5a43 ("net/sched: Abort __tc_modify_qdisc if parent class does not exist")

Let me keep digging but other than the netlink stuff the rest doesn't
stand out..

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Networking for v6.16-rc6 (follow up)
  2025-07-11 19:53             ` Jakub Kicinski
@ 2025-07-11 20:07               ` Linus Torvalds
  2025-07-11 20:35                 ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2025-07-11 20:07 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Thomas Zimmermann, Simona Vetter, Dave Airlie, davem, netdev,
	linux-kernel, pabeni, dri-devel

On Fri, 11 Jul 2025 at 12:53, Jakub Kicinski <kuba@kernel.org> wrote:
>
> Let me keep digging but other than the netlink stuff the rest doesn't
> stand out..

Oh well. I think I'll just have to go back to bisecting this thing.
I've tried to do that several times, and it has failed due to being
too flaky, but I think I've learnt the signs to look out for better
too.

For example, the first few times I was just looking for "not able to
log in", because I hadn't caught on to the fact that sometimes the
failures simply didn't hit something very important.

This clearly is timing-sensitive, and it's presumably hardware-dependent too.

And it could easily be that some bootup process gets stuck on
something entirely unrelated. Some random driver change - sound, pin
control, whatever - might then just end up having odd interactions.

I don't see any issues on my laptop. And considering how random the
behavior problems are, it could have been going on for a while without
me ever realizing it (plus I was running a distro kernel for at least
a few days without even noticing that I wasn't running my own build
any more).

I was hoping it was some known problem, because I'm not sure how
successful a bisect will be.

I guess I had nothing better to do this weekend anyway....

                  Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Networking for v6.16-rc6 (follow up)
  2025-07-11 20:07               ` Linus Torvalds
@ 2025-07-11 20:35                 ` Linus Torvalds
  2025-07-11 21:46                   ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2025-07-11 20:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Thomas Zimmermann, Simona Vetter, Dave Airlie, davem, netdev,
	linux-kernel, pabeni, dri-devel

On Fri, 11 Jul 2025 at 13:07, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Oh well. I think I'll just have to go back to bisecting this thing.
> I've tried to do that several times, and it has failed due to being
> too flaky, but I think I've learnt the signs to look out for better
> too.

Indeed. It turns out that the problem actually started somewhere
between rc4 and rc5, and all my previous bisections never even came
close, because kernels usually work well enough that I never realized
that it went back that far.

                Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Networking for v6.16-rc6 (follow up)
  2025-07-11 20:35                 ` Linus Torvalds
@ 2025-07-11 21:46                   ` Linus Torvalds
  2025-07-11 22:19                     ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2025-07-11 21:46 UTC (permalink / raw)
  To: Jakub Kicinski, Frederic Weisbecker, Valentin Schneider, Nam Cao,
	Christian Brauner
  Cc: Thomas Zimmermann, Simona Vetter, Dave Airlie, davem, netdev,
	linux-kernel, pabeni, dri-devel

On Fri, 11 Jul 2025 at 13:35, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Indeed. It turns out that the problem actually started somewhere
> between rc4 and rc5, and all my previous bisections never even came
> close, because kernels usually work well enough that I never realized
> that it went back that far.

It looks like it's actually due to commit 8c44dac8add7 ("eventpoll:
Fix priority inversion problem"), and it's been going on for a while
now and the behavior was just too subtle for me to have noticed.

Does not look hardware-specific, except in the sense that it probably
needs several CPU's along with the odd startup pattern to trigger
this.

It's possible that the bisection ended up wrong, and when it appeared
to start going off in the weeds I was like "this is broken again", but
before I marked a kernel "good" I tested it several times, and then in
the end that "eventpoll: Fix priority inversion problem" kind of makes
sense after all.

I would never have guessed at that commit otherwise (well, considering
that I blamed both the drm code and the netlink code first, that goes
without saying), but at the same time, that *is* the kind of change
that would certainly make user space get hung up with odd timeouts.

I've only tested the previous commit being good twice now, but I'll go
back to the head of tree and try a revert to verify that this is
really it. Because maybe it's the now Nth time I found something that
hides the problem, not the real issue.

Fingers crossed that this very timing-dependent odd problem really did
bisect right finally, after many false starts.

                 Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Networking for v6.16-rc6 (follow up)
  2025-07-11 21:46                   ` Linus Torvalds
@ 2025-07-11 22:19                     ` Linus Torvalds
  2025-07-11 23:58                       ` Nam Cao
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2025-07-11 22:19 UTC (permalink / raw)
  To: Jakub Kicinski, Frederic Weisbecker, Valentin Schneider, Nam Cao,
	Christian Brauner
  Cc: Thomas Zimmermann, Simona Vetter, Dave Airlie, davem, netdev,
	linux-kernel, pabeni, dri-devel

On Fri, 11 Jul 2025 at 14:46, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I've only tested the previous commit being good twice now, but I'll go
> back to the head of tree and try a revert to verify that this is
> really it. Because maybe it's the now Nth time I found something that
> hides the problem, not the real issue.
>
> Fingers crossed that this very timing-dependent odd problem really did
> bisect right finally, after many false starts.

Ok, verified. Finally.

I've rebooted this machine five times now with the revert in place,
and now that I know to recognize all the subtler signs of breakage,
I'm pretty sure I finally got the right culprit.

Sometimes the breakage is literally just something like "it takes an
extra ten or fifteen seconds to start up some app" and then everything
ends up working, which is why it was so easy to overlook, and why my
other bisection attempts were such abject failures.

But that last bisection when I was more careful and knew what to look
for ended up laser-guided to that thing.

And apologies to the drm and netlink people who I initially blamed
just because there were unrelated bugs that just got merged in the
timeframe when I started noticing oddities. You may have had your own
bugs, but you were blameless on this issue that I basically spent the
last day on (I'd say "wasted" the last day on, but right now I feel
good about finding it, so I guess it wasn't wasted time after all).

Anyway, I think reverting that commit 8c44dac8add7 ("eventpoll: Fix
priority inversion problem") is the right thing for 6.16, and
hopefully Nam Cao & co can figure out what went wrong and we'll
revisit this in the future.

               Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Networking for v6.16-rc6 (follow up)
  2025-07-11 22:19                     ` Linus Torvalds
@ 2025-07-11 23:58                       ` Nam Cao
  0 siblings, 0 replies; 15+ messages in thread
From: Nam Cao @ 2025-07-11 23:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jakub Kicinski, Frederic Weisbecker, Valentin Schneider,
	Christian Brauner, Thomas Zimmermann, Simona Vetter, Dave Airlie,
	davem, netdev, linux-kernel, pabeni, dri-devel

On Fri, Jul 11, 2025 at 03:19:00PM -0700, Linus Torvalds wrote:
> On Fri, 11 Jul 2025 at 14:46, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I've only tested the previous commit being good twice now, but I'll go
> > back to the head of tree and try a revert to verify that this is
> > really it. Because maybe it's the now Nth time I found something that
> > hides the problem, not the real issue.
> >
> > Fingers crossed that this very timing-dependent odd problem really did
> > bisect right finally, after many false starts.
> 
> Ok, verified. Finally.
> 
> I've rebooted this machine five times now with the revert in place,
> and now that I know to recognize all the subtler signs of breakage,
> I'm pretty sure I finally got the right culprit.
> 
> Sometimes the breakage is literally just something like "it takes an
> extra ten or fifteen seconds to start up some app" and then everything
> ends up working, which is why it was so easy to overlook, and why my
> other bisection attempts were such abject failures.
> 
> But that last bisection when I was more careful and knew what to look
> for ended up laser-guided to that thing.
> 
> And apologies to the drm and netlink people who I initially blamed
> just because there were unrelated bugs that just got merged in the
> timeframe when I started noticing oddities. You may have had your own
> bugs, but you were blameless on this issue that I basically spent the
> last day on (I'd say "wasted" the last day on, but right now I feel
> good about finding it, so I guess it wasn't wasted time after all).
> 
> Anyway, I think reverting that commit 8c44dac8add7 ("eventpoll: Fix
> priority inversion problem") is the right thing for 6.16, and
> hopefully Nam Cao & co can figure out what went wrong and we'll
> revisit this in the future.

Yes, please revert it. I had another person reported to me earlier today
about a breakage. We also think that reverting this commit for 6.16 is the
right thing.

Sorry for causing trouble. Strangely my laptop has been running with this
commit for ~6 weeks now without any trouble. Maybe I shouldn't have touched
this lockless business in the first place.

Best regards,
Nam

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-07-11 23:58 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-11 15:10 [GIT PULL] Networking for v6.16-rc6 (follow up) Jakub Kicinski
2025-07-11 17:44 ` pr-tracker-bot
2025-07-11 18:33 ` Linus Torvalds
2025-07-11 18:46   ` Jakub Kicinski
2025-07-11 18:54     ` Linus Torvalds
2025-07-11 19:18       ` Linus Torvalds
2025-07-11 19:30         ` Linus Torvalds
2025-07-11 19:34           ` Jakub Kicinski
2025-07-11 19:42           ` Linus Torvalds
2025-07-11 19:53             ` Jakub Kicinski
2025-07-11 20:07               ` Linus Torvalds
2025-07-11 20:35                 ` Linus Torvalds
2025-07-11 21:46                   ` Linus Torvalds
2025-07-11 22:19                     ` Linus Torvalds
2025-07-11 23:58                       ` Nam Cao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).