system panic while dentry reference count overflow

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* system panic while dentry reference count overflow
@ 2019-05-06  3:36 yangerkun
  2019-05-07  0:40 ` Al Viro
  0 siblings, 1 reply; 10+ messages in thread
From: yangerkun @ 2019-05-06  3:36 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, yi.zhang, houtao1, miaoxie

Hi,

Run process parallel which each code show as below(2T memory), reference 
count of root dentry will overflow since allocation of negative dentry 
should do count++ for root dentry. Then, another dput of root dentry 
will free it, which cause crash of system. I wondered is there anyone 
has found this problem?

#include<stdlib.h> 

#include<stdio.h> 

#include<string.h> 

#include<time.h> 

 

int main() 

{ 

     const char *forname="_dOeSnotExist_.db"; 

     int i; 

     char filename[100]=""; 

     struct timespec time1 = {0, 0}; 

     for(;;) 

     { 

         clock_gettime(CLOCK_REALTIME, &time1); 

         for(i=0; i < 10000; i++) { 

 
sprintf(filename,"/%d%d%d%s",time1.tv_sec,time1.tv_nsec,i,forname); 

             access(filename,0); 

             memset(filename,'\0',100); 

         } 

     } 

     return 0; 

 

} 

~ 

Thanks,
Kun.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: system panic while dentry reference count overflow
  2019-05-06  3:36 system panic while dentry reference count overflow yangerkun
@ 2019-05-07  0:40 ` Al Viro
  2019-05-07  1:50   ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Al Viro @ 2019-05-07  0:40 UTC (permalink / raw)
  To: yangerkun; +Cc: linux-fsdevel, yi.zhang, houtao1, miaoxie, Linus Torvalds

On Mon, May 06, 2019 at 11:36:10AM +0800, yangerkun wrote:
> Hi,
> 
> Run process parallel which each code show as below(2T memory), reference
> count of root dentry will overflow since allocation of negative dentry
> should do count++ for root dentry. Then, another dput of root dentry will
> free it, which cause crash of system. I wondered is there anyone has found
> this problem?

The problem is, in principle, known - it's just that you need an obscene
amount of RAM to trigger it (you need 4G objects of some sort to hold those
references).

_If_ you have that much RAM, there's any number of ways to hit that thing -
it doesn't have to be cached results of lookups in directory as in your
testcase.  E.g. raise /proc/sys/fs/file-nr past 4Gb (you will need a lot
of RAM for that, or the thing won't let you go that high) and just keep
opening the same file (using enough processes to get around the per-process
limit, or playing with SCM_RIGHTS sendmsg to yourself, etc.)

I don't think that making dget() able to fail is a feasible approach;
there are too many callers and hundreds of brand-new failure exits
that will almost never be exercised is _the_ recipe for bitrot from
hell.

An obvious approach would be to use atomic_long_t; the problem is that
it's not atomic_t - it's lockref, which is limited to 32 bits.  Doing
a wider variant... hell knows - wider cmpxchg variants might be
usable, or we could put the upper bits into a separate word, with
cmpxchg loops in lockref_get() et.al. treating "lower bits all zero" as
"fall back to grabbing spinlock".

Linus, lockref is your code, IIRC; which variant would you consider
more feasible?

We don't have that many places looking at the refcount, fortunately.
And most of them are using d_count(dentry) (comparisons or printk).
The rest is almost all in fs/dcache.c...  So it's not as if we'd
been tied to refcount representation by arseloads of code all over
the tree.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: system panic while dentry reference count overflow
  2019-05-07  0:40 ` Al Viro
@ 2019-05-07  1:50   ` Linus Torvalds
  2019-05-07  4:15     ` Al Viro
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2019-05-07  1:50 UTC (permalink / raw)
  To: Al Viro; +Cc: yangerkun, linux-fsdevel, yi.zhang, houtao1, miaoxie

On Mon, May 6, 2019 at 5:40 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Linus, lockref is your code, IIRC; which variant would you consider
> more feasible?

I think we should strive for the same kind of count overflow handling
that the pageref patches did: keep the count at 32 bits, but just add
a new "try_dget()" thing that returns "no" when the count grows too
large.

And then use the "try_dget()" model when possible, and particularly
for the easy cases for user mode to trigger. You don't have to catch
them all, and in most places it isn't worth even worrying about it
because users can't force billions of those places to be active at
once.

I don't see the original email (I'm not on fsdevel, and google doesn't
find it), so I don't see if there was some particular case that was
pointed out as being an easy attack vector.

For the page ref counts, it was all about get_user_pages() and a
couple of other places, and the patches ended up being small and
localized:

  15fab63e1e57 fs: prevent page refcount overflow in pipe_buf_get
  8fde12ca79af mm: prevent get_user_pages() from overflowing page refcount
  88b1a17dfc3e mm: add 'try_get_page()' helper function
  f958d7b528b1 mm: make page ref count overflow check tighter and more explicit

and I think we should see this kind of thing as the primary model,
rather than do the whole "let's make everything 64-bit".

                 Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: system panic while dentry reference count overflow
  2019-05-07  1:50   ` Linus Torvalds
@ 2019-05-07  4:15     ` Al Viro
  2019-05-07 15:26       ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Al Viro @ 2019-05-07  4:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: yangerkun, linux-fsdevel, yi.zhang, houtao1, miaoxie

On Mon, May 06, 2019 at 06:50:27PM -0700, Linus Torvalds wrote:
> On Mon, May 6, 2019 at 5:40 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Linus, lockref is your code, IIRC; which variant would you consider
> > more feasible?
> 
> I think we should strive for the same kind of count overflow handling
> that the pageref patches did: keep the count at 32 bits, but just add
> a new "try_dget()" thing that returns "no" when the count grows too
> large.
> 
> And then use the "try_dget()" model when possible, and particularly
> for the easy cases for user mode to trigger. You don't have to catch
> them all, and in most places it isn't worth even worrying about it
> because users can't force billions of those places to be active at
> once.

Umm...  Where would you put the cutoff for try_dget()?  1G?  Because
2G-<something relatively small> is risky - have it reached, then
get the rest of the way to 2G by normal dget() and you've got trouble.

The part that worries me is that we'd get new failure exits that will
be impossible to test without fuckloads of RAM in the testbox; certainly
more than I have ;-/

> I don't see the original email (I'm not on fsdevel, and google doesn't
> find it), so I don't see if there was some particular case that was
> pointed out as being an easy attack vector.

Attack is predicated upon having 2Tb RAM; basically, a forkbomb with
plenty of access(2) on different inexistent names in the same directory
from each process.  And yes, that can be mitigated by limiting the
number of negative dentries, but I'm pretty sure that it can be
done with positive ones.  E.g. take tmpfs, create a file there,
then link(2) a lot and you've got each child contribute to parent
directory's dentry refcount.  Or you can open the root directory
a lot (again, on tmpfs), then seek in it (each open and each cursor
will contribute).  That way you need 1G opened files, and /proc/sys/fs/file-nr
set that high is not impossible if you have enough RAM.  Will any
boxen be set up that way?  No idea...

Most of attack vectors going through ->d_parent contributions are
relatively easy to deal with - d_alloc() can fail already, and having
that kind of overflow mapped at -ENOMEM is not a big deal.  Ditto
for d_alloc_cursor().  Callers of __d_move() in d_splice_alias() are
not hard to deal with.  However, d_move() itself is nasty.  We call
it in situations where it's too late to fail.  We _might_ get away
with it if we do a variant that consumes a reference to new parent
and have that code go
	try to reserve an extra reference to new parent
	sod off if failed
	do the stuff that might fail
	if failed - drop that extra reference
	otherwise do the new variant of d_move()
That's an extra lock/unlock of parent's ->d_lock, but I'd expect
it to drown in the noise on those paths...

However, that obviously doesn't do anything for references
held by opened files.  Having complete_walk() check for
refcount being too high and buggering off in that case
probably would help that one...

Ugh...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: system panic while dentry reference count overflow
  2019-05-07  4:15     ` Al Viro
@ 2019-05-07 15:26       ` Linus Torvalds
  2019-05-07 19:16         ` Al Viro
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2019-05-07 15:26 UTC (permalink / raw)
  To: Al Viro; +Cc: yangerkun, linux-fsdevel, yi.zhang, houtao1, miaoxie

On Mon, May 6, 2019 at 9:15 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Umm...  Where would you put the cutoff for try_dget()?  1G?  Because
> 2G-<something relatively small> is risky - have it reached, then
> get the rest of the way to 2G by normal dget() and you've got trouble.

I'd make the limit be 2G exactly like the page count. Negative counts
are fine - they work exactly like large integers. It's only 0 that is
special.

So do something like this:

 - make dget() WARN_ONCE(), and perhaps set a flag to start background
dentry pruning, if the dentry count is negative ("big integer") after
the lockref_get()

 - add a try_dget(), which returns the dentry or NULL (and is
"must_check") and just refuses to increment the ref past the 2G mark

 - add the "limit negative dentries" patches that were already written
for other reasons by Waiman Long.

 - and exactly like the page ref count, the negative values can be
tested non-atomically without worrying about races, because it's not a
"hard" limit.  It takes a *looong* time (and a lot of memory) to go
from 2G to actually overflowing

 - for the same "not a hard limit", use try_dget() in a couple of
strategic places that are easy to error out for and that are
particularly easily user-triggerable. It's not clear if this is even
needed, since the only obviously user-triggerable case is the negative
dentry one - everything else really needs an actual user ref, and the
soft "start to try to prune if any dentry ref goes negative" will take
care of the "we just have a ton of unused but cached dentries case.

All pretty much exactly like the page count.

The fact that we have that "slop" of 2 _billion_ references between
"oh, the recount went negative" and "oops, now we overflowed and that
would be fatal" really means that we have a lot of time and
flexibility to handle things. If an attacker has to open two billion
files, the attacker is going to spend a lot of time that we can
mitigate.

               Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: system panic while dentry reference count overflow
  2019-05-07 15:26       ` Linus Torvalds
@ 2019-05-07 19:16         ` Al Viro
  2019-05-07 19:23           ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Al Viro @ 2019-05-07 19:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: yangerkun, linux-fsdevel, yi.zhang, houtao1, miaoxie

On Tue, May 07, 2019 at 08:26:06AM -0700, Linus Torvalds wrote:
> On Mon, May 6, 2019 at 9:15 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Umm...  Where would you put the cutoff for try_dget()?  1G?  Because
> > 2G-<something relatively small> is risky - have it reached, then
> > get the rest of the way to 2G by normal dget() and you've got trouble.
> 
> I'd make the limit be 2G exactly like the page count. Negative counts
> are fine - they work exactly like large integers. It's only 0 that is
> special.

Negative ->d_lockref.count are used for "lockref is dead"...

>  - add the "limit negative dentries" patches that were already written
> for other reasons by Waiman Long.

Irrelevant here, IMO - negative or not (and evictable or pinned, for that
matter) it's easier solved by having d_alloc() fail on parent's ->d_count
overflow.  And I'm pretty sure we can hit that crap without creating any
negative dentries.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: system panic while dentry reference count overflow
  2019-05-07 19:16         ` Al Viro
@ 2019-05-07 19:23           ` Linus Torvalds
  2019-05-07 19:55             ` Al Viro
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2019-05-07 19:23 UTC (permalink / raw)
  To: Al Viro; +Cc: yangerkun, linux-fsdevel, yi.zhang, houtao1, miaoxie

On Tue, May 7, 2019 at 12:16 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Negative ->d_lockref.count are used for "lockref is dead"...

We can change that to just -1, can't we? It's equally easy to test for.

Those aren't supposed to be incremented anyway, which is the whole point.

But we could do what the page refs also did: consider refcounts in the
"small negative range" to be very special, because they are either
critically close to an overflow, or they are actually a sign of a
fatal underflow due to some bug. And make one of those be the dead
marker.

(See page_ref_zero_or_close_to_overflow() for the particular critical
range check for the page ref)

                 Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: system panic while dentry reference count overflow
  2019-05-07 19:23           ` Linus Torvalds
@ 2019-05-07 19:55             ` Al Viro
  2019-05-07 20:47               ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Al Viro @ 2019-05-07 19:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: yangerkun, linux-fsdevel, yi.zhang, houtao1, miaoxie

On Tue, May 07, 2019 at 12:23:23PM -0700, Linus Torvalds wrote:
> On Tue, May 7, 2019 at 12:16 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Negative ->d_lockref.count are used for "lockref is dead"...
> 
> We can change that to just -1, can't we? It's equally easy to test for.

	Provided that lockref.c is updated accordingly (look at e.g.
lockref_get_not_zero()).

> Those aren't supposed to be incremented anyway, which is the whole point.
> 
> But we could do what the page refs also did: consider refcounts in the
> "small negative range" to be very special, because they are either
> critically close to an overflow, or they are actually a sign of a
> fatal underflow due to some bug. And make one of those be the dead
> marker.

	lockref_get_not_zero() hitting dead dentry is not abnormal,
so we'd better not complain in such case...  BTW, wouldn't that WARN_ON()
in dget() belong in lockref_get()?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: system panic while dentry reference count overflow
  2019-05-07 19:55             ` Al Viro
@ 2019-05-07 20:47               ` Linus Torvalds
  2019-05-07 21:14                 ` Al Viro
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2019-05-07 20:47 UTC (permalink / raw)
  To: Al Viro; +Cc: yangerkun, linux-fsdevel, yi.zhang, houtao1, miaoxie

On Tue, May 7, 2019 at 12:55 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>
>         Provided that lockref.c is updated accordingly (look at e.g.
> lockref_get_not_zero()).

Yeah, we should likely just make this all a lockref feature.

The dcache is *almost* the only user of lockrefs. We've got them in
gfs2 too, but that looks like it would be perfectly happy with the
same model.

>         lockref_get_not_zero() hitting dead dentry is not abnormal,
> so we'd better not complain in such case...  BTW, wouldn't that WARN_ON()
> in dget() belong in lockref_get()?

Yeah.

              Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: system panic while dentry reference count overflow
  2019-05-07 20:47               ` Linus Torvalds
@ 2019-05-07 21:14                 ` Al Viro
  0 siblings, 0 replies; 10+ messages in thread
From: Al Viro @ 2019-05-07 21:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: yangerkun, linux-fsdevel, yi.zhang, houtao1, miaoxie

On Tue, May 07, 2019 at 01:47:31PM -0700, Linus Torvalds wrote:
> On Tue, May 7, 2019 at 12:55 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >
> >         Provided that lockref.c is updated accordingly (look at e.g.
> > lockref_get_not_zero()).
> 
> Yeah, we should likely just make this all a lockref feature.
> 
> The dcache is *almost* the only user of lockrefs. We've got them in
> gfs2 too, but that looks like it would be perfectly happy with the
> same model.
> 
> >         lockref_get_not_zero() hitting dead dentry is not abnormal,
> > so we'd better not complain in such case...  BTW, wouldn't that WARN_ON()
> > in dget() belong in lockref_get()?
> 
> Yeah.

	OK...  Lockref parts aside, I suspect that the right sequence
would be
	* make d_alloc() and d_alloc_cursor() check for excessive
growth of parent's refcount and fail if it's about to occur.  That will
result in -ENOMEM for now, if we want another errno value, we can
follow with making d_alloc() return ERR_PTR() on failure (instead of
NULL)
	* lift the increment of new parent's refcount into the
callers of __d_move(..., false)
	* make the callers in d_splice_alias() fail if refcount is
about to overflow
	* add a reference-consuming variant of d_move()
	* switch d_move() callers to that one by one, lifting
the refcount increment into those, with bailout on overflow
	* make complete_walk() check for overflow, bail out
if it happens.
	* ditto for clone_mnt().

That would take care of the majority of long-term references; then
we'll see what's left...

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-05-07 21:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-05-06  3:36 system panic while dentry reference count overflow yangerkun
2019-05-07  0:40 ` Al Viro
2019-05-07  1:50   ` Linus Torvalds
2019-05-07  4:15     ` Al Viro
2019-05-07 15:26       ` Linus Torvalds
2019-05-07 19:16         ` Al Viro
2019-05-07 19:23           ` Linus Torvalds
2019-05-07 19:55             ` Al Viro
2019-05-07 20:47               ` Linus Torvalds
2019-05-07 21:14                 ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).