netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [REGRESSION] v6.16 system hangs (bisected to nf_conntrack fix)
@ 2025-07-28 23:25 Dan Moulding
  2025-07-28 23:47 ` Florian Westphal
  0 siblings, 1 reply; 8+ messages in thread
From: Dan Moulding @ 2025-07-28 23:25 UTC (permalink / raw)
  To: netfilter-devel; +Cc: fw, pablo, dan, regressions

Hello netfilter folks,

Since v6.16-rc7 I've been hitting a vexing system hang (no kernel
panic is being produced that I can see). I did not have this problem
when running rc6. I first noticed it the morning after upgrading to
rc7. I found the machine unresponsive. Checking logs after restarting
it, I could see it had been in the middle of being backed up by an
rsync-based backup system. This same sequence repeated the following
day.

Then I also started experiencing the hang when running a build on a
proprietary codebase that I work on. The machine that is hanging is a
virtual machine host, with a fleet of VMs that I use for doing various
development tasks. One of those VMs is where I build the proprietary
system. I do that by SSHing to the VM from the host and invoking the
build from there. The strange thing is that at the point the hang
occurs, there's nothing overtly "networky" that the build system is
doing. It's just compressing and creating a self-extracting archive on
the build VM. But it happens with 100% consistency which allowed me to
bisect it down to commit 2d72afb34065 (netfilter: nf_conntrack: fix
crash due to removal of uninitialised entry).

The hang is still present on the final v6.16 just released
yesterday. I have confirmed that if I revert the above commit on top
of v6.16, I can no longer reproduce the problem.

I know this doesn't provide a lot detail about the cause of the
problem, but the nature of the hang prevents me from being able to
check logs since the whole system becomes unresponsive. Any ideas on
next steps I might be able to take to gather more information, if
needed, are welcome.

Cheers,

-- Dan

#regzbot introduced: 2d72afb34065

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [REGRESSION] v6.16 system hangs (bisected to nf_conntrack fix)
  2025-07-28 23:25 [REGRESSION] v6.16 system hangs (bisected to nf_conntrack fix) Dan Moulding
@ 2025-07-28 23:47 ` Florian Westphal
  2025-07-29  0:25   ` Florian Westphal
  2025-07-29 17:02   ` Dan Moulding
  0 siblings, 2 replies; 8+ messages in thread
From: Florian Westphal @ 2025-07-28 23:47 UTC (permalink / raw)
  To: Dan Moulding; +Cc: netfilter-devel, pablo, regressions

Dan Moulding <dan@danm.net> wrote:
> Hello netfilter folks,
> 
> Since v6.16-rc7 I've been hitting a vexing system hang (no kernel
> panic is being produced that I can see). I did not have this problem
> when running rc6. I first noticed it the morning after upgrading to
> rc7. I found the machine unresponsive. Checking logs after restarting
> it, I could see it had been in the middle of being backed up by an
> rsync-based backup system. This same sequence repeated the following
> day.

Bah.  Can't see the problem.  Can you partial-revert and see what
happens?

E.g. only revert the changes to net/netfilter/nf_conntrack_core.c
and keep nf_ct_resolve_clash_harder().

Is this x86?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [REGRESSION] v6.16 system hangs (bisected to nf_conntrack fix)
  2025-07-28 23:47 ` Florian Westphal
@ 2025-07-29  0:25   ` Florian Westphal
  2025-07-29 17:02   ` Dan Moulding
  1 sibling, 0 replies; 8+ messages in thread
From: Florian Westphal @ 2025-07-29  0:25 UTC (permalink / raw)
  To: Dan Moulding; +Cc: netfilter-devel, pablo, regressions

Florian Westphal <fw@strlen.de> wrote:
> and keep nf_ct_resolve_clash_harder().
           ~~~~~~~~~~
Meant nf_ct_should_gc() (i.e. the change in nf_conntrack.h).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [REGRESSION] v6.16 system hangs (bisected to nf_conntrack fix)
  2025-07-28 23:47 ` Florian Westphal
  2025-07-29  0:25   ` Florian Westphal
@ 2025-07-29 17:02   ` Dan Moulding
  2025-07-29 17:38     ` Florian Westphal
  1 sibling, 1 reply; 8+ messages in thread
From: Dan Moulding @ 2025-07-29 17:02 UTC (permalink / raw)
  To: fw; +Cc: dan, netfilter-devel, pablo, regressions

Florian Westphal <fw@strlen.de> wrote:
> Bah.  Can't see the problem.  Can you partial-revert and see what
> happens?
> 
> E.g. only revert the changes to net/netfilter/nf_conntrack_core.c

Ok. I just tried reverting only the changes to nf_conntrack_core.c and
the hang no longer occurs. This is on top of 6.16.

> Is this x86?

Yes, it's an Intel x86_64 Gentoo desktop and virtualization host. But
kernel sources directly from git.kernel.org (I don't run the
distribution's kernel sources -- I'm skeptical of the "value add").

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [REGRESSION] v6.16 system hangs (bisected to nf_conntrack fix)
  2025-07-29 17:02   ` Dan Moulding
@ 2025-07-29 17:38     ` Florian Westphal
  2025-07-31 15:49       ` Florian Westphal
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Westphal @ 2025-07-29 17:38 UTC (permalink / raw)
  To: Dan Moulding; +Cc: netfilter-devel, pablo, regressions

Dan Moulding <dan@danm.net> wrote:
> Ok. I just tried reverting only the changes to nf_conntrack_core.c and
> the hang no longer occurs. This is on top of 6.16.

Strange.  Can you completely revert 2d72afb340657f03f7261e9243b44457a9228ac7
and then apply this patch instead?

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -984,6 +984,7 @@ static void __nf_conntrack_insert_prepare(struct nf_conn *ct)
        struct nf_conn_tstamp *tstamp;

        refcount_inc(&ct->ct_general.use);
+       ct->status |= IPS_CONFIRMED;

        /* set conntrack timestamp, if enabled. */
        tstamp = nf_conn_tstamp_find(ct);
@@ -1260,8 +1261,6 @@ __nf_conntrack_confirm(struct sk_buff *skb)
         * user context, else we insert an already 'dead' hash, blocking
         * further use of that particular connection -JM.
         */
-       ct->status |= IPS_CONFIRMED;
-
        if (unlikely(nf_ct_is_dying(ct))) {
                NF_CT_STAT_INC(net, insert_failed);
                goto dying;



(the confirm-bit-set moves from the too-early spot in __nf_conntrack_confirm
 to __nf_conntrack_insert_prepare).

Unlike 2d72afb340657f03f7261e9243b44457a9228ac7 its still set before
hash insertion, but we no longer set it on entries that were not
inserted into the hash.

Unfortunately I still do not see why setting the bit after hashtable
insertion causes problems.  ____nf_conntrack_find() should skip/ignore
the entry, and I don't see how it causes an infinite loop or
double-insert or whatever else is causing this hang.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [REGRESSION] v6.16 system hangs (bisected to nf_conntrack fix)
  2025-07-29 17:38     ` Florian Westphal
@ 2025-07-31 15:49       ` Florian Westphal
  2025-07-31 19:49         ` Dan Moulding
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Westphal @ 2025-07-31 15:49 UTC (permalink / raw)
  To: Dan Moulding; +Cc: netfilter-devel, pablo, regressions

Florian Westphal <fw@strlen.de> wrote:
> Dan Moulding <dan@danm.net> wrote:
> > Ok. I just tried reverting only the changes to nf_conntrack_core.c and
> > the hang no longer occurs. This is on top of 6.16.
> 
> Strange.  Can you completely revert 2d72afb340657f03f7261e9243b44457a9228ac7
> and then apply this patch instead?

Any news?  If you don't have the time to test, could you please share
kernel config or at least some details like CONFIG_PREEMPT settings, if
this uses kasan, kcsan etc.?

I'm asking because I still cannot reproduce any hangs, so I assume that
there is some significant difference between our setups.
While I could ask for a blank revert, that would get back the bug I
was trying to fix and I dislike doing so without understanding the cause
of the new bug first.

Are you using anything more excotic, say, conntrackd, conntrack
helpers, synproxy, or anything like that?

I was able to produce a memory leak by running conntrack_resize.sh
selftest in a loop, but its unrelated bug in ctnetlink.

I will submit a patch later after some more testing.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [REGRESSION] v6.16 system hangs (bisected to nf_conntrack fix)
  2025-07-31 15:49       ` Florian Westphal
@ 2025-07-31 19:49         ` Dan Moulding
  2025-08-30  3:48           ` Dan Moulding
  0 siblings, 1 reply; 8+ messages in thread
From: Dan Moulding @ 2025-07-31 19:49 UTC (permalink / raw)
  To: fw; +Cc: dan, netfilter-devel, pablo, regressions

Florian Westphal <fw@strlen.de> wrote:
> > Strange.  Can you completely revert 2d72afb340657f03f7261e9243b44457a9228ac7
> > and then apply this patch instead?
> 
> Any news?

For some reason, I can no longer reproduce the problem in new kernels
that I build. I can still reproduce it in the kernels that I built
last week. But if I build a new one, from the same commit as I was
able to reproduce it from before, the new kernel build can't reproduce
it. I didn't change anything (like compiler version) since. So I'm
stumped.

I'm not sure what this means, other than that I can't test any new
patches because any new build I do doesn't exhibit the problem. I
guess at this point I can't rule out some kind of hardware failure on
my end (though that would seem very strange that it would consistently
point to a specific commit, and the problem went away after reverting,
if it was a hardware problem).

So maybe we can ignore this for now and see if it comes up again with
anyone else.

Cheers,

-- Dan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [REGRESSION] v6.16 system hangs (bisected to nf_conntrack fix)
  2025-07-31 19:49         ` Dan Moulding
@ 2025-08-30  3:48           ` Dan Moulding
  0 siblings, 0 replies; 8+ messages in thread
From: Dan Moulding @ 2025-08-30  3:48 UTC (permalink / raw)
  To: dan; +Cc: fw, netfilter-devel, pablo, regressions

> For some reason, I can no longer reproduce the problem in new kernels
> that I build. I can still reproduce it in the kernels that I built
> last week. But if I build a new one, from the same commit as I was
> able to reproduce it from before, the new kernel build can't reproduce
> it. I didn't change anything (like compiler version) since. So I'm
> stumped.

Well, I finally figured out why I couldn't reproduce the problem in
some builds. It turns out structure layout randomization was affecting
whether there was a bug or not. So every build I did had a chance of
not having the bug. This also led my bisection effort down the wrong
path and wrongly indicated that this nf_conntrack fix was the first
bad commit (and also by pure chance of randstruct made it look like
the problem went away when the change was reverted).

Now that I know that it's a randstruct problem, I was able to use a
"known bad" randstruct.seed to reliably make reproducible builds that
always have the bug, and this time correctly bisected it to a problem
in compression code in the cpypto API[1]. netfilter had nothing to do
with it. nf_conntrack is exonerated :)

So, Florian, sorry for taking up some of your valuable cycles on this
wild goose chase. But thank you for your attention and quick responses.

Cheers,

-- Dan

[1] https://lore.kernel.org/linux-crypto/20250830032839.11005-1-dan@danm.net/T/#t

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-08-30  3:48 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-28 23:25 [REGRESSION] v6.16 system hangs (bisected to nf_conntrack fix) Dan Moulding
2025-07-28 23:47 ` Florian Westphal
2025-07-29  0:25   ` Florian Westphal
2025-07-29 17:02   ` Dan Moulding
2025-07-29 17:38     ` Florian Westphal
2025-07-31 15:49       ` Florian Westphal
2025-07-31 19:49         ` Dan Moulding
2025-08-30  3:48           ` Dan Moulding

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).