All of lore.kernel.org
 help / color / mirror / Atom feed
* nftables.service hardening ideas
@ 2025-10-27  3:36 Christoph Anton Mitterer
  2025-10-28 16:26 ` Florian Westphal
  2025-10-30 23:30 ` Pablo Neira Ayuso
  0 siblings, 2 replies; 7+ messages in thread
From: Christoph Anton Mitterer @ 2025-10-27  3:36 UTC (permalink / raw)
  To: netfilter-devel

Hey.


This would be ideas about further hardening nftables.service, primarily
using the options from systemd.exec(5).


1. The current .service already has:
> ProtectSystem=full

This can be further tightened to:
> ProtectSystem=strict

which not only mounts some but the entire fs hierarchy read-only for
the service's commands.
I guess nft -f should never write anywhere, or does it? At least it
seems to work.

Setting ProtectSystem= effectively does something like PrivateMounts=,
so that is recommended not to be set.


2. As per ProtectSystem=’s documentation, which refers to
ReadOnlyPaths=’s:
- I assume nft never mounts anything (that wouldn't be propagated back
  to the main namespace.
- ReadOnlyPaths= docs say, either
    CapabilityBoundingSet=~CAP_SYS_ADMIN 
  and/or
    SystemCallFilter=~@mount
  shall be set, or a process can undo any ReadOnlyPaths= (and thus also
  ProtectSystem= and others.

SystemCallFilter= docs extend this to if any of PrivateTmp=,
PrivateDevices=, ProtectSystem=, ProtectHome=, ProtectKernelTunables=,
ProtectControlGroups=, ProtectKernelLogs=, ProtectClock=,
ReadOnlyPaths=, InaccessiblePaths= and ReadWritePaths=
are used.

We might actually set both:
> CapabilityBoundingSet=~CAP_SYS_ADMIN 
> SystemCallFilter=~@mount

(for CapabilityBoundingSet= even more, see below).


3. Using SystemCallFilter= in turn is recommended to use: >
SystemCallArchitectures=native

I'd guess nft doesn't use syscals from other archs?!


4. I guess nft needs no capabilities/privileges other than
CAP_NET_ADMIN:
> CapabilityBoundingSet=CAP_NET_ADMIN
> AmbientCapabilities=""
> NoNewPrivileges=yes

CapabilityBoundingSet=CAP_NET_ADMIN would supersede the =~CAP_SYS_ADMIN
from (2) above.

AmbientCapabilities="" disables all ambient capabilities, I'd blindly
guess that nftables doesn't execve(),... but it shouldn't harm either.


5. There should be no reason why nft -f needs to access stuff in /tmp
or /var/tmp of anything else, so:
> PrivateTmp=yes

this makes however /tmp/ and /var/tmp/ will be writable again (despite
the ProtectSystem=strict).

Even safer would be:
> PrivateTmp=isolate

but than we also need (because we have DefaultDependencies=no and some
other conditions fulfilled):
> RequiresMountsFor=/var

or /var/tmp could "leak out".


6. I'd guess nft -f never changes the clocks or directly reads/writes
to the kernel log buffer (or does it):
> ProtectClock=yes
> ProtectKernelLogs=yes

This also blocks syslog(2) (but not syslog(3)).


7. AFAICS, nft -f may cause modules to be loaded, but that's done
indirectly (i.e. I guess by the kernel itself?), so we can
> ProtectKernelModules=yes

Also makes /usr/lib/modules inaccessible, should that be used somehow.


8. I guess nft -f doesn't use devices (other than some standard ones
like /dev/null, etc.), IPC, RT nor namespaces and doesn't set
SUID/SGID:
> PrivateDevices=yes
> PrivateIPC=yes
> RestrictNamespaces=yes
> RestrictRealtime=yes
> RestrictSUIDSGID=yes


9. Does nftables use BPF or personalities? If not:
> PrivateBPF=yes
> LockPersonality=yes


10. A bit obscure perhaps...
> OOMScoreAdjust=-1000
or:
> OOMScoreAdjust=-999

The idea would be that nftables.service is security critical and should
rather not be OOMkilled in memory tight situations, also since it's
oneshot it would anyway give resources back soon.

Along with that one might set:
> OOMPolicy=kill

the default action is variable and set in system.conf (where it
defaults to stop).
Better a safe kill then sorry?!


Except perhaps for 10, the above things are IMO not totally
unreasonable.
So leats get a bit more exotic ;-)


11. Restrict executable paths:
> NoExecPaths=/
> ExecPaths=/usr/sbin/nft -/lib -/usr/lib

Not sure whether we'd also need to add locations lib32, lib64, libc32.
The above at least works on my amd64 Debian.


12. May unneeded pathnames inaccessible:
> InaccessiblePaths=-/boot -/media -/mnt -/opt -/proc -/srv -/sys -/var
> TemporaryFileSystem=/etc:ro
> BindReadOnlyPaths=/etc/protocols /etc/services /etc/passwd /etc/group /etc/nftables /etc/nftables.conf /etc/resolv.conf

- Some of these are already readonly, but why allowing even that if not
  neeed?!
- /dev, /home, /root and /tmp are already secured via other options
- I consider /bin/, /lib*, /sbin, /usr to never contain anything
  sensitive.
  But perhaps /usr/local could be blocked (might contain private code).
- /proc/ and /sys are seemingly not needed.
- /run is apparently needed, so cannot be blocked
- /etc/ is needed, but not all of it, so I make a tmpfs mount on it,
  and bindmount only the needed stuff.
  - The above uses the Debian config /etc/nftables.conf and what
    upstream nftables.service would use for rules /etc/nftables.
  - /etc/protocols, /etc/services,  /etc/passwd  and/etc/group are
    needed for resolving proto, service, user and group names.
  - For /etc/resolv.conf see (14) below.

Seems to work, but has of course the disadvantage that it only
blacklists. If a user has /my-secrets it would still be readable.

Better would be something like:
> TemporaryFileSystem=/:ro

and selectively BindReadOnlyPaths= everything actually needed.
If that would be preferred, I could try to work out the required dirs.



Up to here, I've tested the settings with at least some very simply
rules file, and loading that still seemed to work.
The following I haven't tested yet,... thought I'd ask for feedback
first, whether these things would be even wanted.



13. This depends on (12) or better (TODO) below.
> ProtectHostname=yes

“Note that when this option is enabled for a service hostname changes
no longer propagate from the system into the service”... not sure
whether nftables ever uses the hostname/domainname? If so, that would
mean it could be outdated.

“Note that this option does not prevent changing system hostname via
hostnamectl”... not sure whether (12) makes this fully impossible by
having hostnamectl non-executable... the manpage suggests using User=


14. Disallowing unneeded address families.
> RestrictAddressFamilies=AF_NETLINK AF_UNIX AF_INET AF_INET6

- This I've actually checked, too.
- In principle AF_NETLINK suffices.
- AF_INET and AF_INET6 are needed if hostnames are resolved via DNS.
Now in principle, nftables.service is anyway meant to run a at point
where no networking is available yet, so one could say this (and
/etc/resolv.conf) above is pointless.
It may however also be restarted/reloaded when the system is already
up, and then DNS would in principle work (in case the nft rules have
been modified meanwhile to include hostnames/fqdns as addresses).
Whether that should then be allowed, or whether it would be better to
fail already then (and not wait for the next boot), is of course
another questions. No strong opinions here from my side.
- Not sure whether AF_UNIX is used by anything in nft (or indirectly).

Because the above doesn't cover all cases how sockets may be accessed,
the manpage suggests to also use:
SystemCallFilter=@service

So in our case it would be (order matters, hope it's the right one ^^):
> SystemCallFilter=@service
> SystemCallFilter=~@mount

(haven't checked whether this is enough for nft)

Also the SystemCallArchitectures=native already set before should be
set again when this is used.


15. Deny writeable memory mappings that are also executable:
> MemoryDenyWriteExecute=yes

To prevent circumvention, the manpage recommends to also set:
InaccessiblePaths=/dev/shm
SystemCallFilter=~memfd_create

Again, the already above set SystemCallArchitectures=native is
recommended, too.


16. The following, AFAIU, would all depend on letting run nft under a
non-root-user, via User= or rather DynamicUser=, which, I presume, we
could do as long as we give it CAP_NET_ADMIN.
> ProtectProc=invisible
> ProcSubset=pid   (not fully sure whether that requires non-root)
> RemoveIPC=yes


17. These all imply (18):
> PrivatePIDs=yes
> ProtectKernelTunables=yes   (not sure whether nft would need any of these)
> ProtectControlGroups=strict


18. The first two options from (16) and the ones from (17) also imply
> MountAPIVFS=yes
which in turn implies BindLogSockets=yes .
Not sure whether MountAPIVFS=yes causes any issues. Documentation kinda
implies it might only be effective if RootDirectory=/RootImage= are
used.


19. The following already default to secure values:
> KeyringMode=
> IgnoreSIGPIPE=


20. We could configure timeouts and restarting, like via TimeoutSec=
respectively Restart=, StartLimitIntervalSec= and StartLimitBurst=.
Not sure whether that would make any sense security wise, rather not.



I could/would make patches for all of the ones of your choice.
Of course if you know any cases where nft -f uses some feature which
would be restricted by the above, then please tell. :-)



Thanks,
Chris.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-10-30 23:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-27  3:36 nftables.service hardening ideas Christoph Anton Mitterer
2025-10-28 16:26 ` Florian Westphal
2025-10-29  0:55   ` Christoph Anton Mitterer
2025-10-30 23:10     ` Florian Westphal
2025-10-30 23:59       ` Christoph Anton Mitterer
2025-10-30 23:30 ` Pablo Neira Ayuso
2025-10-30 23:59   ` Christoph Anton Mitterer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.