commit to kernel fails since Debian 12 (bookworm)

All of lore.kernel.org
 help / color / mirror / Atom feed

* commit to kernel fails since Debian 12 (bookworm)
@ 2023-10-13 14:02 Markus Wigge
  2023-10-13 14:26 ` Kevin P. Fleming
  2023-10-13 14:41 ` Pablo Neira Ayuso
  0 siblings, 2 replies; 6+ messages in thread
From: Markus Wigge @ 2023-10-13 14:02 UTC (permalink / raw)
  To: netfilter


[-- Attachment #1.1: Type: text/plain, Size: 6476 bytes --]

Hello,

we are encountering a strange problem with conntrackd after upgrading to 
Debian 12 (bookworm).
First the logs were flooded with errors like this:
> 2023-10-13T12:49:06.724542+02:00 fw-dc-c conntrackd[421008]: [Fri Oct 13 12:49:06 2023] (pid=421008) [warning] could not add new ct entry: Device or resource busy                            
> 2023-10-13T12:49:06.724690+02:00 fw-dc-c conntrackd[421008]: [Fri Oct 13 12:49:06 2023] (pid=421008) [warning] could not add new ct entry: Device or resource busy                            
> 2023-10-13T12:49:06.724847+02:00 fw-dc-c conntrackd[421008]: [Fri Oct 13 12:49:06 2023] (pid=421008) [warning] could not add new ct entry: Device or resource busy                            
> 2023-10-13T12:49:06.725048+02:00 fw-dc-c conntrackd[421008]: [Fri Oct 13 12:49:06 2023] (pid=421008) [warning] could not update ct entry, even if creating it instead: Device or resource busy
> 2023-10-13T12:49:06.725182+02:00 fw-dc-c conntrackd[421008]: [Fri Oct 13 12:49:06 2023] (pid=421008) [warning] could not update ct entry, even if creating it instead: Device or resource busy
> 2023-10-13T12:49:06.725271+02:00 fw-dc-c conntrackd[421008]: [Fri Oct 13 12:49:06 2023] (pid=421008) [warning] could not update ct entry, even if creating it instead: Device or resource busy
> 2023-10-13T12:49:06.725406+02:00 fw-dc-c conntrackd[421008]: [Fri Oct 13 12:49:06 2023] (pid=421008) [warning] could not update ct entry, even if creating it instead: Device or resource busy


Then with strace we discovered that conntrackd seems not to be able to 
commit received state information to the kernel any more:

> pselect6(18, [3 6 7 8 11 13 17], NULL, NULL, {tv_sec=0, tv_nsec=997962986}, NULL) = 1 (in [6], left {tv_sec=0, tv_nsec=995201883})                                                            
> rt_sigprocmask(SIG_BLOCK, [INT TERM CHLD], NULL, 8) = 0                                                                                                                                       
> recvfrom(6, "\20\0\0<h\233\333k\0\f\0\0\215@\314\v\215@\342+\0\10\0\5\0\0\1\210\0\5\0\2"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(46346), sin_addr=inet_addr("172.23.42.10")}, [16]) =
>  60                                                                                                                                                                                           
> sendto(4, [{nlmsg_len=172, nlmsg_type=NFNL_SUBSYS_CTNETLINK<<8|IPCTNL_MSG_CT_NEW, nlmsg_flags=NLM_F_REQUEST|NLM_F_ACK|NLM_F_EXCL|NLM_F_CREATE, nlmsg_seq=0, nlmsg_pid=0}, {nfgen_family=AF_INE
> T, version=NFNETLINK_V0, res_id=htons(0)}, [[{nla_len=52, nla_type=NLA_F_NESTED|0x1}, "\x14\x00\x01\x80\x08\x00\x01\x00\x8d\x40\xcc\x0b\x08\x00\x02\x00\x8d\x40\xe2\x2b\x1c\x00\x02\x80\x05\x0
> 0\x01\x00\x06\x00\x00\x00"...], [{nla_len=52, nla_type=NLA_F_NESTED|0x2}, "\x14\x00\x01\x80\x08\x00\x01\x00\x8d\x40\xe2\x2b\x08\x00\x02\x00\x8d\x40\xcc\x0b\x1c\x00\x02\x80\x05\x00\x01\x00\x0
> 6\x00\x00\x00"...], [{nla_len=8, nla_type=0x3}, "\x00\x00\x01\x88"], [{nla_len=8, nla_type=0x7}, "\x00\x00\x00\x78"], [{nla_len=32, nla_type=NLA_F_NESTED|0x4}, "\x1c\x00\x01\x80\x05\x00\x01\
> x00\x01\x00\x00\x00\x06\x00\x04\x00\x08\x08\x00\x00\x06\x00\x05\x00\x08\x08\x00\x00"]]], 172, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 172                              
> recvfrom(4, [{nlmsg_len=192, nlmsg_type=NLMSG_ERROR, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=369424}, {error=-EBUSY, msg=[{nlmsg_len=172, nlmsg_type=NFNL_SUBSYS_CTNETLINK<<8|IPCTNL_MSG_CT_NEW,
>  nlmsg_flags=NLM_F_REQUEST|NLM_F_ACK|NLM_F_EXCL|NLM_F_CREATE, nlmsg_seq=0, nlmsg_pid=0}, {nfgen_family=AF_INET, version=NFNETLINK_V0, res_id=htons(0)}, [[{nla_len=52, nla_type=NLA_F_NESTED|0
> x1}, "\x14\x00\x01\x80\x08\x00\x01\x00\x8d\x40\xcc\x0b\x08\x00\x02\x00\x8d\x40\xe2\x2b\x1c\x00\x02\x80\x05\x00\x01\x00\x06\x00\x00\x00"...], [{nla_len=52, nla_type=NLA_F_NESTED|0x2}, "\x14\x
> 00\x01\x80\x08\x00\x01\x00\x8d\x40\xe2\x2b\x08\x00\x02\x00\x8d\x40\xcc\x0b\x1c\x00\x02\x80\x05\x00\x01\x00\x06\x00\x00\x00"...], [{nla_len=8, nla_type=0x3}, "\x00\x00\x01\x88"], [{nla_len=8,
>  nla_type=0x7}, "\x00\x00\x00\x78"], [{nla_len=32, nla_type=NLA_F_NESTED|0x4}, "\x1c\x00\x01\x80\x05\x00\x01\x00\x01\x00\x00\x00\x06\x00\x04\x00\x08\x08\x00\x00\x06\x00\x05\x00\x08\x08\x00\x
> 00"]]]}], 8192, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12]) = 192                                                                                                          
> newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0644, st_size=2298, ...}, 0) = 0                                                                                                      
> getpid()                                = 369424                                                                                                                                              
> write(2, "[Fri Oct 13 12:17:39 2023] (pid="..., 50) = 50                                                                                                                                      
> write(2, "could not add new ct entry: Devi"..., 51) = 51                                                                                                                                      
> write(2, "\n", 1)                       = 1                                                                                                                                                   


Any ideas when and why this broke? Is it a known bug? I was not able to 
find anything about that yet.
# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 12 (bookworm)
Release:        12
Codename:       bookworm

# dpkg -l conntrackd
Gewünscht=Unbekannt/Installieren/R=Entfernen/P=Vollständig Löschen/Halten
| Status=Nicht/Installiert/Config/U=Entpackt/halb konFiguriert/
          Halb installiert/Trigger erWartet/Trigger anhängig
|/ Fehler?=(kein)/R=Neuinstallation notwendig (Status, Fehler: 
GROSS=schlecht)
||/ Name           Version      Architektur  Beschreibung
+++-==============-============-============-=================================
ii  conntrackd     1:1.4.7-1+b2 amd64        Connection tracking daemon

# uname -a
Linux fw-dc-c 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 
(2023-05-08) x86_64 GNU/Linux


Kind Regards
Markus

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: commit to kernel fails since Debian 12 (bookworm)
  2023-10-13 14:02 commit to kernel fails since Debian 12 (bookworm) Markus Wigge
@ 2023-10-13 14:26 ` Kevin P. Fleming
  2023-10-13 14:41 ` Pablo Neira Ayuso
  1 sibling, 0 replies; 6+ messages in thread
From: Kevin P. Fleming @ 2023-10-13 14:26 UTC (permalink / raw)
  To: netfilter

> # uname -a
> Linux fw-dc-c 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 
> (2023-05-08) x86_64 GNU/Linux

I doubt it's the cause, but your kernel is four releases behind: current Bookworm kernel is 6.1.0-13 (6.1.55-1).

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: commit to kernel fails since Debian 12 (bookworm)
  2023-10-13 14:02 commit to kernel fails since Debian 12 (bookworm) Markus Wigge
  2023-10-13 14:26 ` Kevin P. Fleming
@ 2023-10-13 14:41 ` Pablo Neira Ayuso
       [not found]   ` <6289ae8d-7d8e-40a5-a012-3e6e32251942@bht-berlin.de>
  1 sibling, 1 reply; 6+ messages in thread
From: Pablo Neira Ayuso @ 2023-10-13 14:41 UTC (permalink / raw)
  To: Markus Wigge; +Cc: netfilter

Hi,

On Fri, Oct 13, 2023 at 04:02:01PM +0200, Markus Wigge wrote:
> Hello,
> 
> we are encountering a strange problem with conntrackd after upgrading to
> Debian 12 (bookworm).
> First the logs were flooded with errors like this:
> > 2023-10-13T12:49:06.724542+02:00 fw-dc-c conntrackd[421008]: [Fri Oct 13
> > 12:49:06 2023] (pid=421008) [warning] could not add new ct entry: Device
> > or resource busy
> > 2023-10-13T12:49:06.724690+02:00 fw-dc-c conntrackd[421008]: [Fri Oct 13
> > 12:49:06 2023] (pid=421008) [warning] could not add new ct entry: Device
> > or resource busy
> > 2023-10-13T12:49:06.724847+02:00 fw-dc-c conntrackd[421008]: [Fri Oct 13
> > 12:49:06 2023] (pid=421008) [warning] could not add new ct entry: Device
> > or resource busy
> > 2023-10-13T12:49:06.725048+02:00 fw-dc-c conntrackd[421008]: [Fri Oct 13
> > 12:49:06 2023] (pid=421008) [warning] could not update ct entry, even if
> > creating it instead: Device or resource busy
> > 2023-10-13T12:49:06.725182+02:00 fw-dc-c conntrackd[421008]: [Fri Oct 13 12:49:06 2023] (pid=421008) [warning] could not update ct entry, even if creating it instead: Device or resource busy
> > 2023-10-13T12:49:06.725271+02:00 fw-dc-c conntrackd[421008]: [Fri Oct 13 12:49:06 2023] (pid=421008) [warning] could not update ct entry, even if creating it instead: Device or resource busy
> > 2023-10-13T12:49:06.725406+02:00 fw-dc-c conntrackd[421008]: [Fri Oct 13 12:49:06 2023] (pid=421008) [warning] could not update ct entry, even if creating it instead: Device or resource busy

Could you describe your setup? From your report, I can see you are
using:

        DisableExternalCache Off

Is this an active-active asymmetric path cluster?

EBUSY can be triggered in nf_conntrack_netlink.c in a few spots, this
is most likely ct status flags and conntrackd losing race to update
and entry that is being picked up from packet path.

Is your ruleset dropping invalid packets to disable lazy pick up?
That is, nf_conntrack_tcp_loose sysctl is set to zero.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: commit to kernel fails since Debian 12 (bookworm)
       [not found]       ` <43708702-0f37-4ea6-9b3d-4dc8ac2913a1@bht-berlin.de>
@ 2023-10-16 21:24         ` Pablo Neira Ayuso
  2023-10-18 11:31           ` Markus Wigge
  0 siblings, 1 reply; 6+ messages in thread
From: Pablo Neira Ayuso @ 2023-10-16 21:24 UTC (permalink / raw)
  To: Markus Wigge; +Cc: netfilter

Hi Markus,

On Mon, Oct 16, 2023 at 01:02:35PM +0000, Markus Wigge wrote:
> >> With each received message I got a "device or resource busy" when conntrackd
> >> tried to commit it to the kernel.
> >>
> >> When I try to commit the cache now I get all the same errors but at once ;-)
> >
> > That means there is already an entry in the kernel.
>
> Is there any known change between bullseye and bookworm that might
> explain this? Unfortunately I am not so deep inside the kernel mechanics
> involved here.

The only spots where EBUSY could reasonably happen in the kernel is here:

static int
ctnetlink_update_status(struct nf_conn *ct, const struct nlattr * const cda[])
{
        unsigned int status = ntohl(nla_get_be32(cda[CTA_STATUS]));
        unsigned long d = ct->status ^ status;

        if (d & IPS_SEEN_REPLY && !(status & IPS_SEEN_REPLY))
                /* SEEN_REPLY bit can only be set */
                return -EBUSY;

        if (d & IPS_ASSURED && !(status & IPS_ASSURED))
                /* ASSURED bit can only be set */
                return -EBUSY;

And this EBUSY can only happen if userspace (conntrackd) is losing
race to update an already existing entry in the kernel.

> >> The architecture is quite simple and used to work since several years. It
> >> started flooding the syslog with dist-upgrade to "bookworm".
> >> Two active-active nodes share a bunch of VLANs in two keepalived groups.
> >>
> >> Each node is primary for one of the groups and secondary for the other. The
> >> interfaces are configured correctly and traffic is flowing as expected.
> >
> > That is, flow-based distribution between the firewalls, correct?
>
> I am not sure about your definition of flow-based but it sounds
> plausible. Each node is responsible for its own dedicated VLANs they
> only failover on reboot or upgrades etc.

So VLAN interfaces are distributed between nodes and, on failover, one
node picks up the VLAN interfaces of the node that is failing? I am
trying to understand if, in your setup, one node is active but is is
also at the same time a backup for the flows that are handled by the
other node.

> >> bird and bird6 are announcing the routes correctly on each side.
> >> Shorewall is used to filter the passing traffic. Thats all.
> >>
> >>>
> >>> EBUSY can be triggered in nf_conntrack_netlink.c in a few spots, this
> >>> is most likely ct status flags and conntrackd losing race to update
> >>> and entry that is being picked up from packet path.
> >>>
> >>> Is your ruleset dropping invalid packets to disable lazy pick up?
> >>> That is, nf_conntrack_tcp_loose sysctl is set to zero.
> >>
> >> nope:
> >> # sysctl -a | grep loose
> >> net.netfilter.nf_conntrack_dccp_loose = 1
> >> net.netfilter.nf_conntrack_tcp_loose = 1
> >
> > If _loose is enabled, that means kernel conntrack can pick up entries
> > from the middle base from packet path.
>
> I don't understand this part. The kernel picks up connections
> automatically? But how when the flow started on the other node?

This is how it works with net.netfilter.nf_conntrack_tcp_loose = 1,
that toggle enables "poor man" connection pickup, that is, the kernel
infers from the middle of the connection the current state.

> > Is your ruleset dropping invalid packets?
>
> Only for smurfs as far as I can see:
> >  203M   19G smurfs     0    --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate INVALID,NEW,UNTRACKED
>
> > Chain smurfs (7 references)
> >  pkts bytes target     prot opt in     out     source               destination
> >   19M 6211M RETURN     0    --  *      *       0.0.0.0              0.0.0.0/0
> >     0     0 smurflog   0    --  *      *       0.0.0.0/0            0.0.0.0/0           [goto]  ADDRTYPE match src-type BROADCAST
> >     0     0 smurflog   0    --  *      *       224.0.0.0/4          0.0.0.0/0           [goto]

This RETURN means you take back invalid packets to the chain where the
jump to smurfs happen.

> > It looks like conntrackd is getting late to synchronize the states
> > for some flows because the packet path already created the entry via
> > _loose mechanism.
>
> Following the logs it appears to me that every single entry is getting
> late then. I doubt that and don't see where state should come from
> beforehand.

From datapath itself, from the _loose mechanism that is enabled.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: commit to kernel fails since Debian 12 (bookworm)
  2023-10-16 21:24         ` Pablo Neira Ayuso
@ 2023-10-18 11:31           ` Markus Wigge
  2023-10-18 12:05             ` Pablo Neira Ayuso
  0 siblings, 1 reply; 6+ messages in thread
From: Markus Wigge @ 2023-10-18 11:31 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter

Hello,


> So VLAN interfaces are distributed between nodes and, on failover, one
> node picks up the VLAN interfaces of the node that is failing? I am
> trying to understand if, in your setup, one node is active but is is
> also at the same time a backup for the flows that are handled by the
> other node.
Yes, the VLAN interfaces are available on both nodes but only one nodes 
has configured IP addresses on the interface. The other nodes only takes 
over the address with keepalived if necessary.

So could it be possible, that the kernel notices flows on the passive 
VLAN interface?



> This is how it works with net.netfilter.nf_conntrack_tcp_loose = 1,
> that toggle enables "poor man" connection pickup, that is, the kernel
> infers from the middle of the connection the current state.
But why does the kernel see this connection at all when it flow over the 
other node?


>>> Is your ruleset dropping invalid packets?
>>
>> Only for smurfs as far as I can see:
>>>   203M   19G smurfs     0    --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate INVALID,NEW,UNTRACKED
>>
>>> Chain smurfs (7 references)
>>>   pkts bytes target     prot opt in     out     source               destination
>>>    19M 6211M RETURN     0    --  *      *       0.0.0.0              0.0.0.0/0
>>>      0     0 smurflog   0    --  *      *       0.0.0.0/0            0.0.0.0/0           [goto]  ADDRTYPE match src-type BROADCAST
>>>      0     0 smurflog   0    --  *      *       224.0.0.0/4          0.0.0.0/0           [goto]
> 
> This RETURN means you take back invalid packets to the chain where the
> jump to smurfs happen.
Yes and there are dedicated chains for the configured zones which each 
drop INVALID packets.

>> Following the logs it appears to me that every single entry is getting
>> late then. I doubt that and don't see where state should come from
>> beforehand.
> 
>  From datapath itself, from the _loose mechanism that is enabled.
What datapath? The passive node should not be involved with the flows of 
the other node?

Kind Regards
Markus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: commit to kernel fails since Debian 12 (bookworm)
  2023-10-18 11:31           ` Markus Wigge
@ 2023-10-18 12:05             ` Pablo Neira Ayuso
  0 siblings, 0 replies; 6+ messages in thread
From: Pablo Neira Ayuso @ 2023-10-18 12:05 UTC (permalink / raw)
  To: Markus Wigge; +Cc: netfilter

On Wed, Oct 18, 2023 at 01:31:26PM +0200, Markus Wigge wrote:
> Hello,
> 
> > So VLAN interfaces are distributed between nodes and, on failover, one
> > node picks up the VLAN interfaces of the node that is failing? I am
> > trying to understand if, in your setup, one node is active but is is
> > also at the same time a backup for the flows that are handled by the
> > other node.
>
> Yes, the VLAN interfaces are available on both nodes but only one nodes has
> configured IP addresses on the interface. The other nodes only takes over
> the address with keepalived if necessary.
> 
> So could it be possible, that the kernel notices flows on the passive VLAN
> interface?

Then, I assume there is HA daemon that sets this IP on the VLAN
interface. From what you describe, disabling _loose connection pickup
should be safe.

> > This is how it works with net.netfilter.nf_conntrack_tcp_loose = 1,
> > that toggle enables "poor man" connection pickup, that is, the kernel
> > infers from the middle of the connection the current state.
>
> But why does the kernel see this connection at all when it flow over the
> other node?
> 
> > > > Is your ruleset dropping invalid packets?
> > > 
> > > Only for smurfs as far as I can see:
> > > >   203M   19G smurfs     0    --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate INVALID,NEW,UNTRACKED
> > > 
> > > > Chain smurfs (7 references)
> > > >   pkts bytes target     prot opt in     out     source               destination
> > > >    19M 6211M RETURN     0    --  *      *       0.0.0.0              0.0.0.0/0
> > > >      0     0 smurflog   0    --  *      *       0.0.0.0/0            0.0.0.0/0           [goto]  ADDRTYPE match src-type BROADCAST
> > > >      0     0 smurflog   0    --  *      *       224.0.0.0/4          0.0.0.0/0           [goto]
> > 
> > This RETURN means you take back invalid packets to the chain where the
> > jump to smurfs happen.
>
> Yes and there are dedicated chains for the configured zones which each drop
> INVALID packets.

Yes, but _loose is disabled. And I suspect _be_liberal is disabled
too, invalid packets are unlikely with this configuration.

> > > Following the logs it appears to me that every single entry is getting
> > > late then. I doubt that and don't see where state should come from
> > > beforehand.
> > 
> >  From datapath itself, from the _loose mechanism that is enabled.
>
> What datapath? The passive node should not be involved with the flows of the
> other node?

As said, EBUSY means conntrackd is trying to update an existing entry
in the kernel.

How is all this going on, you have to diagnose in your setup.

There is `conntrack -E` which shows the tag [USERSPACE] for entries
that are created by conntrackd.

    [NEW] tcp      6 10 ESTABLISHED src=1.1.1.1 dst=2.2.2.2 sport=10 dport=20 [UNREPLIED] src=2.2.2.2 dst=1.1.1.1 sport=20 dport=10 mark=0 [USERSPACE] portid=2149

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-10-18 12:05 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-13 14:02 commit to kernel fails since Debian 12 (bookworm) Markus Wigge
2023-10-13 14:26 ` Kevin P. Fleming
2023-10-13 14:41 ` Pablo Neira Ayuso
     [not found]   ` <6289ae8d-7d8e-40a5-a012-3e6e32251942@bht-berlin.de>
     [not found]     ` <ZS0TvfCRySTWfdW6@calendula>
     [not found]       ` <43708702-0f37-4ea6-9b3d-4dc8ac2913a1@bht-berlin.de>
2023-10-16 21:24         ` Pablo Neira Ayuso
2023-10-18 11:31           ` Markus Wigge
2023-10-18 12:05             ` Pablo Neira Ayuso

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.