[RFC] staging: r8188eu: Sleeping in atomic context (SAC) bugs

linux-staging.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

* [RFC] staging: r8188eu: Sleeping in atomic context (SAC) bugs
@ 2022-02-07  0:02 Fabio M. De Francesco
  2022-02-07  9:21 ` Dan Carpenter
  0 siblings, 1 reply; 5+ messages in thread
From: Fabio M. De Francesco @ 2022-02-07  0:02 UTC (permalink / raw)
  To: linux-staging, gregkh, phil, larry.finger
  Cc: julia.lawall, linux-kernel, Michael Straube, martin

Hello everyone,

I've noticed a couple of SAC bugs in r8188eu. Recently two of them have been 
addressed by me and one by Michael Straube (sorry if I recall it wrongly).

Prof. Julia Lawall wrote with Others in a paper that "[] Code executing in 
atomic context monopolizes a CPU core, and the progress of other threads that 
need to concurrently access the same resources is delayed. Thus the code 
execution in atomic context should complete as quickly as possible. Sleeping 
in atomic context is forbidden, as it can block a CPU core for a long period 
and may lead to a system hang. [] SAC bugs do not always cause problems in real 
execution, and they are often hard to reproduce at runtime. Recent studies [] 
have shown that SAC bugs have caused serious system hangs at runtime. Thus, 
it is necessary to detect and fix SAC bugs in kernel modules.".

Probably the phrase "SAC bugs do not always cause problems in real execution" 
is the key to understand why very few people seems to care and fix them.

While working on my last patch for this driver, I noticed that a Mutex is 
acquired while holding a Spinlock and while bottom halves are disabled.

In staging/r8188eu/core/rtw_ioctl_set.c, the function rtw_set_802_11_disassociate()
calls spin_lock_bh(). While holding that Spinlock, it calls _rtw_pwr_wakeup()
that, in turn, does a call to msleep().

My first question is whether or not msleep() can be called in atomic context. If
I understand its semantics and implementation it seems that it should be forbidden.
What about changing it to mdelay()? Again, it seems that mdelay() spins without 
sleeping so it should be safe. Isn't it?

Furthermore, I noticed that _rtw_pwr_wakeup() calls ips_leave(). The latter 
acquires the "pwrpriv->lock" Mutex. Aren't we forbidden to call Mutexes while 
holding Spinlocks? That lock is used to protect struct "pwrctrl_priv". I didn't
yet provide a patch because I'm pretty sure that I'm missing the good and 
correct reasons that led the author(s) to use a Mutex in that context...

My second question is: should we substitute that Mutex with a Spinlock and use 
it everywhere else the struct "pwrctrl_priv" must be protected in the driver?

Looking forward to reading what you all think about these two questions of mine.

Regards,

Fabio M. De Francesco

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] staging: r8188eu: Sleeping in atomic context (SAC) bugs
  2022-02-07  0:02 [RFC] staging: r8188eu: Sleeping in atomic context (SAC) bugs Fabio M. De Francesco
@ 2022-02-07  9:21 ` Dan Carpenter
  2022-02-07 14:18   ` Fabio M. De Francesco
  0 siblings, 1 reply; 5+ messages in thread
From: Dan Carpenter @ 2022-02-07  9:21 UTC (permalink / raw)
  To: Fabio M. De Francesco
  Cc: linux-staging, gregkh, phil, larry.finger, julia.lawall,
	linux-kernel, Michael Straube, martin

On Mon, Feb 07, 2022 at 01:02:17AM +0100, Fabio M. De Francesco wrote:
> My first question is whether or not msleep() can be called in atomic context.

You are not allowed to call msleep() in atomic context.  The Smatch
check for sleeping in atomic did not look for msleep so I will update
it.

> If
> I understand its semantics and implementation it seems that it should be forbidden.
> What about changing it to mdelay()? Again, it seems that mdelay() spins without 
> sleeping so it should be safe. Isn't it?

mdelay() is does not sleep, but it's not necessarily a good idea to
delay for a long time while holding a spinlock.

> 
> Furthermore, I noticed that _rtw_pwr_wakeup() calls ips_leave(). The latter 
> acquires the "pwrpriv->lock" Mutex. Aren't we forbidden to call Mutexes while 
> holding Spinlocks?

Correct.  You cannot take a mutex while holding a spinlock.

Where is the spinlock in taken in the code you're talking about?  If
it's rtw_set_802_11_disassociate() then that's fine.  The check for
if (check_fwstate(pmlmepriv, _FW_LINKED)) { will prevent ips_leave()
from being called.

> My second question is: should we substitute that Mutex with a Spinlock and use 
> it everywhere else the struct "pwrctrl_priv" must be protected in the driver?

Changing mutexes to spinlocks is tricky.  I can't review your proposed
patch before you send it.

regards,
dan carpenter

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] staging: r8188eu: Sleeping in atomic context (SAC) bugs
  2022-02-07  9:21 ` Dan Carpenter
@ 2022-02-07 14:18   ` Fabio M. De Francesco
  2022-02-07 17:17     ` Fabio M. De Francesco
  2022-02-07 17:38     ` Dan Carpenter
  0 siblings, 2 replies; 5+ messages in thread
From: Fabio M. De Francesco @ 2022-02-07 14:18 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: linux-staging, gregkh, phil, larry.finger, julia.lawall,
	linux-kernel, Michael Straube, martin

On Mon, Feb 07, 2022 10:21:33 CET Dan Carpenter wrote:
> On Mon, Feb 07, 2022 at 01:02:17AM +0100, Fabio M. De Francesco wrote:

Hello Dan,

Thanks for your exhaustive reply.

> > My first question is whether or not msleep() can be called in atomic context.
> 
> You are not allowed to call msleep() in atomic context.

OK, well. You confirmed what I was suspecting.

> The Smatch
> check for sleeping in atomic did not look for msleep so I will update
> it.

I'm glad that my observations suggested you that update to Smatch :)

> > If
> > I understand its semantics and implementation it seems that it should be forbidden.

Said.

> > What about changing it to mdelay()? Again, it seems that mdelay() spins without 
> > sleeping so it should be safe. Isn't it?
> 
> mdelay() is does not sleep,

OK, again. It seems that I've been able to understand the different implementations 
and semantics of msleep() and mdelay().

> but it's not necessarily a good idea to
> delay for a long time while holding a spinlock.

Yeah, I agree with you. If I recall it well, somewhere I read a suggestion which
says that one should avoid mdelay() for more than 1 ms while under spinlocks.

Thus it looks like here we have a problem...

The disable of bottom halves and the acquire of a spinlock is performed in 
rtw_set_802_11_disassociate(). If "if (check_fwstate(pmlmepriv, _FW_LINKED))" 
evaluates true, rtw_pwr_wakeup() (that is a macro defined as _rtw_pwr_wakeup())
is invoked. In the latter function I see two conditional calls to msleep(10).

Thus it is 10 milliseconds and I suppose that it is a huge amount of time under 
spinlocks. Nevertheless, I think that we should use mdelay(10), unless with 
proper tests we find out that 10 ms can be reduced a bit.

A couple of months ago I made an analysis of a report by Syzbot that led to a 
patch in the tty line discipline. That ftrace analysis was necessary to prove 
that that SAC bug could hang an IOCTL for more than 8000 ms. 

Why am I saying this? Suppose that one of those msleep(10) cannot recover within
acceptable time, that 10 ms argument would behave like 100 ms, 1000 ms, or 
even hang the driver indefinitely.

I trust your words (as usually): "[] it's not necessarily a good idea to delay 
for a long time while holding a spinlock". But what are the alternatives?

1) Leave the code as-is with those msleep(10) and risk to hang the driver (or 
the whole kernel?).
2) Change msleep(10) to mdelay(10). It's a huge amount of time, but we are 
guaranteed that it is safe.
3) Change to mdelay() and try to figure out what's the minimum amount of time 
that suffice. I forgot to say that those msleep(10) are within "while" loops, 
therefore they are called an unknown amount of times.

If we choose the third option, how can we test the driver and see if shorter 
delays may work? Can you suggest a methodical approach for figure out the 
minimum amount of delay that can work? I have the hardware, therefore I can 
test the changes, but I'm not sure about how to do this work.

Any other solutions?

> > 
> > Furthermore, I noticed that _rtw_pwr_wakeup() calls ips_leave(). The latter 
> > acquires the "pwrpriv->lock" Mutex. Aren't we forbidden to call Mutexes while 
> > holding Spinlocks?
> 
> Correct.  You cannot take a mutex while holding a spinlock.

Again, thanks for confirming.

> Where is the spinlock in taken in the code you're talking about?  If
> it's rtw_set_802_11_disassociate() then that's fine.  The check for
> if (check_fwstate(pmlmepriv, _FW_LINKED)) { will prevent ips_leave()
> from being called.

You're right: "if (check_fwstate(pmlmepriv, _FW_LINKED))" in _rtw_pwr_wakeup() 
will prevent a call to ips_leave(). Therefore, it seems that we have no problems
with the mutex in ips_leave(). 

I had not noticed the above-mentioned "if" test. Sorry :(
So, let's leave the code as it is.

Thank you very much.

Regards,

Fabio M. De Francesco 

> 
> > My second question is: should we substitute that Mutex with a Spinlock and use 
> > it everywhere else the struct "pwrctrl_priv" must be protected in the driver?
> 
> Changing Mutexes to spinlocks is tricky.  I can't review your proposed
> patch before you send it.
> 
> regards,
> dan carpenter
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] staging: r8188eu: Sleeping in atomic context (SAC) bugs
  2022-02-07 14:18   ` Fabio M. De Francesco
@ 2022-02-07 17:17     ` Fabio M. De Francesco
  2022-02-07 17:38     ` Dan Carpenter
  1 sibling, 0 replies; 5+ messages in thread
From: Fabio M. De Francesco @ 2022-02-07 17:17 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: linux-staging, gregkh, phil, larry.finger, julia.lawall,
	linux-kernel, Michael Straube, martin

On luned? 7 febbraio 2022 15:18:52 CET Fabio M. De Francesco wrote:
> On Mon, Feb 07, 2022 10:21:33 CET Dan Carpenter wrote:
> > On Mon, Feb 07, 2022 at 01:02:17AM +0100, Fabio M. De Francesco wrote:
>
> [...]
>
> You're right: "if (check_fwstate(pmlmepriv, _FW_LINKED))" in _rtw_pwr_wakeup() 
> will prevent a call to ips_leave(). Therefore, it seems that we have no problems
> with the mutex in ips_leave(). 
> 
> I had not noticed the above-mentioned "if" test. Sorry :(
> So, let's leave the code as it is.

I'm writing again just to be sure that I made my argument clear. When I wrote 
"[] let's leave the code as it is [currently]" I was referring to the mutex_lock() 
that is _never_ reached while holding the spinlock that rtw_set_802_11_disassociate() 
takes before calling _rtw_pwr_wakeup().

Instead, if no one objects, I want to substitute the two "msleep(10);" with 
"mdelay(10);".

However, I'll wait some time just in case someone wants to suggest a better 
solution.

Fabio

 
> Thank you very much.
> 
> Regards,
> 
> Fabio M. De Francesco 
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] staging: r8188eu: Sleeping in atomic context (SAC) bugs
  2022-02-07 14:18   ` Fabio M. De Francesco
  2022-02-07 17:17     ` Fabio M. De Francesco
@ 2022-02-07 17:38     ` Dan Carpenter
  1 sibling, 0 replies; 5+ messages in thread
From: Dan Carpenter @ 2022-02-07 17:38 UTC (permalink / raw)
  To: Fabio M. De Francesco
  Cc: linux-staging, gregkh, phil, larry.finger, julia.lawall,
	linux-kernel, Michael Straube, martin

On Mon, Feb 07, 2022 at 03:18:52PM +0100, Fabio M. De Francesco wrote:
> 1) Leave the code as-is with those msleep(10) and risk to hang the driver (or 
> the whole kernel?).
> 2) Change msleep(10) to mdelay(10). It's a huge amount of time, but we are 
> guaranteed that it is safe.
> 3) Change to mdelay() and try to figure out what's the minimum amount of time 
> that suffice. I forgot to say that those msleep(10) are within "while" loops, 
> therefore they are called an unknown amount of times.

You are talking about the:
rtw_set_802_11_disassociate() <- disables preempt
-> _rtw_pwr_wakeup()

_rtw_pwr_wakeup() calls msleep() on two different paths.

Let's just add it to the TODO list.

The code is a mess, right?  Maybe if we clean it up a bit first the
answer will become obvious.

The first loop is trying to prevent racing with rtw_ps_processor().
Sleeping is not a valid way to prevent race conditions.  Especially
since it just gives up and it's like, "Oh well, three seconds are up.
Let's just race."

I don't really think that sleeping in atomic bugs are likely to lock up
the whole system these days because even the cheapest mobile phones have
something like 64 cores in them.  But imagine it was twenty years and
you had only one core with CONFIG_PREEMPT turned on.  The CPU was in
ips_enter() and it was preempted by this thread, which is totally
possible in theory.  Then this thread took the spinlock and we changed
the msleep to mdelay().  There is now no way for the other thread to
complete ips_enter() because we don't give up the CPU so we just sit for
3 seconds and then timeout.

In other words, changing from msleep to mdelay could make bad code even
worse.  Plus it silences the warning so now the problem is hard to find.

regards,
dan carpenter

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-02-07 17:38 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-02-07  0:02 [RFC] staging: r8188eu: Sleeping in atomic context (SAC) bugs Fabio M. De Francesco
2022-02-07  9:21 ` Dan Carpenter
2022-02-07 14:18   ` Fabio M. De Francesco
2022-02-07 17:17     ` Fabio M. De Francesco
2022-02-07 17:38     ` Dan Carpenter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).