filesystem behavior when low on memory and PF

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* filesystem behavior when low on memory and PF_MEMALLOC
@ 2004-04-27 16:20 Steve French
  2004-04-27 19:09 ` Bryan Henderson
  0 siblings, 1 reply; 17+ messages in thread
From: Steve French @ 2004-04-27 16:20 UTC (permalink / raw)
  To: linux-fsdevel

Does PF_MEMALLOC have a similar effect to setting SLAB_NOFS and
equivalent on memory allocations? and prevent memory allocations in
critical code paths from blocking?

Sergey Vlasov recently made a good suggestion about fixing a problem
with very large file copy hangs via the use of the PF_MEMALLOC.

He noted that shrink_caches can cause writepage (cifs_writepage in my
case) to be invoked to write out dirty pages - but writepage needs to
allocate memory both explicitly (for each the 4.5K cifs write buffer)
and implicitly as a result of using the sockets API (sock_sendmsg can
allocate memory) but this presumably can block.  In addition the cifs
demultiplex thread needs to get an acknowledgement from the server to
before waking up the writepage thread - but the demultiplex thread can
allocate memory in some cases.

His suggested solution was to add the PF_MEMALLOC flag to the
current->flags for the demultiplex thread, which makes sense and seems
similar to what XFS and a few other filesystems do in some of their
daemons.   What was harder to evaluate though was how to fix the context
of the process doing writepage - is it ok to temporarily set PF_MEMALLOC
on entry to a filesystems writepage and writepages routines? Or would
this be redundant since the linux/mm code should already be doing this
in all low memory paths in the calling function? Is it ok to clear the
flag - always clearing PF_MEMALLOC on exit from cifs_writepage (and
eventually cifs_writepages when that is added).  The alternative is to
set SLAB_NOFS and equivalent on memory allocations on all calls in cifs
on behalf of writepages which would probably be ok but would hit more
code and make the codepaths trickier (figuring out if an smb buffer
allocation e.g. came from writepage).  My initial observations was that
there is a significant performance hit setting SLAB_NOFS on all cifs
buffer allocations (although I think that this is what at least one
other filesystem basically does) - it seems like overkill when writepage
(and possibly prepare_write/commit_write) are the ones that matter for
performance during low memory situations as pages are being freed.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: filesystem behavior when low on memory and PF_MEMALLOC
  2004-04-27 16:20 filesystem behavior when low on memory and PF_MEMALLOC Steve French
@ 2004-04-27 19:09 ` Bryan Henderson
  2004-04-27 20:29   ` filesystem signal handling Steve French
  0 siblings, 1 reply; 17+ messages in thread
From: Bryan Henderson @ 2004-04-27 19:09 UTC (permalink / raw)
  To: smfltc; +Cc: linux-fsdevel

>Does PF_MEMALLOC have a similar effect to setting SLAB_NOFS and
>equivalent on memory allocations? and prevent memory allocations in
>critical code paths from blocking?

It's more like SLAB_ATOMIC, in that it does indeed prevent memory 
allocations from blocking under any circumstance (SLAB_NOFS doesn't).  But 
it goes further than SLAB_ATOMIC and allows a memory allocation to be 
satisfied out of reserve memory (the last dregs of memory that are not 
available to an ordinary process).

The biggest problem I've found getting PF_MEMALLOC to solve my memory 
allocation deadlock problems is that the process in question might be 
waiting for some other resource, which is held by another process that is 
waiting for memory but doesn't have PF_MEMALLOC.  This indirect wait for 
memory isn't helped by PF_MEMALLOC.

One thing you don't mention in all your analysis is what you do when the 
memory allocation fails.  In exchange for having your allocations never 
block, you have to deal with the fact that they fail in an essentially 
normal system.  How do you eventually get the memory you need, and make 
sure you aren't responsible for the memory not being available?

I don't see anything wrong with setting PF_MEMALLOC anyplace it's useful, 
but it should be a stacking thing; i.e. make sure you set it back to 
whatever it was before returning to your caller.

Incidentally, I've personally had all sorts of grief with this same 
problem.  Linux appears to have a fundamental memory deadlock issue that 
doesn't normally show up because of longterm memory scheduling of the 
buffer cache, which makes it so that it takes a carefully crafted workload 
to actually get down to that last page of memory.  However, I have a 
filesystem driver that doesn't use the buffer cache to schedule its cache 
space allocation and in fact tries to exploit all the available memory for 
file cache instead of an arbitrary per centage of it.  It can quite easily 
use up every page of memory and writepage deadlocks pop up.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* filesystem signal handling
  2004-04-27 19:09 ` Bryan Henderson
@ 2004-04-27 20:29   ` Steve French
  2004-04-28 15:14     ` David Woodhouse
  0 siblings, 1 reply; 17+ messages in thread
From: Steve French @ 2004-04-27 20:29 UTC (permalink / raw)
  Cc: linux-fsdevel

I noticed that the ramfs and libfs code do not have special handling of
signals (checking for signal_pending, returning or handling ERESTARTSYS
or EINTR).   

Is it permitted for a filesystem to mask signals temporarily across most
fs calls and simply let the higher layers handle checking for
signal_pending (other than using the existing of a signal_pending to
shorten schedule_timeouts so the app would not wait as long on errors)?

For network filesystems, especially for stateful calls such as create,
open, unlink, link, mkdir, rmdir), signals look dangerous - what would
happen if the filesystem gets a signal in the midst of (or right after)
a socket write of a network request and the filesystem returns an error
(such as ERESTARTSYS/EINTR) but the actual open ended up succeeding on
the server but only after the client filesystem had returned an error to
the app. Seems like there could be cases in which the local kernel had
released a file struct (on an open failure due to a signal) but the
server would have to keep the file struct around.

Is there a Unix/Linux convention for signal handling in network
filesystems?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: filesystem signal handling
  2004-04-27 20:29   ` filesystem signal handling Steve French
@ 2004-04-28 15:14     ` David Woodhouse
  2004-04-28 17:05       ` Trond Myklebust
  2004-04-28 21:46       ` Bryan Henderson
  0 siblings, 2 replies; 17+ messages in thread
From: David Woodhouse @ 2004-04-28 15:14 UTC (permalink / raw)
  To: Steve French; +Cc: linux-fsdevel

On Tue, 2004-04-27 at 15:29 -0500, Steve French wrote:
> I noticed that the ramfs and libfs code do not have special handling of
> signals (checking for signal_pending, returning or handling ERESTARTSYS
> or EINTR).   
> 
> Is it permitted for a filesystem to mask signals temporarily across most
> fs calls and simply let the higher layers handle checking for
> signal_pending (other than using the existing of a signal_pending to
> shorten schedule_timeouts so the app would not wait as long on errors)?

NFS does this. It's fairly ugly. What would be a _lot_ nicer if we could
have something in the task_struct which is vaguely reminiscent of
preempt_count, only it counts the number of reasons why this task cannot
receive signals. So instead of using the TASK_INTERRUPTIBLE and
TASK_UNINTERRUPTIBLE states to make the decision, we'd look at the
task's uninterruptible_count instead. 

That way, we can provide functions, including the ones you mention as
dangerous, which do the right thing _both_ when called from a function
which cannot tolerate signals, and when called from a function which
_can_ perform the necessary cleanup. Without screwing with the signal
masks. 

(In fact, I think it would be useful also to have a way of saying that
_fatal_ signals should be allowed, but not signals with a handler. This
would be useful in, e.g., sys_read() and sys_write().)

-- 
dwmw2

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: filesystem signal handling
  2004-04-28 15:14     ` David Woodhouse
@ 2004-04-28 17:05       ` Trond Myklebust
  2004-04-28 17:14         ` David Woodhouse
  2004-04-28 21:46       ` Bryan Henderson
  1 sibling, 1 reply; 17+ messages in thread
From: Trond Myklebust @ 2004-04-28 17:05 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Steve French, linux-fsdevel

On Wed, 2004-04-28 at 11:14, David Woodhouse wrote:

> NFS does this. It's fairly ugly. What would be a _lot_ nicer if we could
> have something in the task_struct which is vaguely reminiscent of
> preempt_count, only it counts the number of reasons why this task cannot
> receive signals. So instead of using the TASK_INTERRUPTIBLE and
> TASK_UNINTERRUPTIBLE states to make the decision, we'd look at the
> task's uninterruptible_count instead. 

The reason NFS has the scheme that it does is precisely *because* we
want to set our own sigmask.

The reason is that we'd like to respect SIGINT, SIGQUIT and SIGKILL as
signalling that the user wants to interrupt the operation if and only if
the "intr" mount flag has been set.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: filesystem signal handling
  2004-04-28 17:05       ` Trond Myklebust
@ 2004-04-28 17:14         ` David Woodhouse
  2004-04-28 17:32           ` Trond Myklebust
  0 siblings, 1 reply; 17+ messages in thread
From: David Woodhouse @ 2004-04-28 17:14 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Steve French, linux-fsdevel

On Wed, 2004-04-28 at 13:05 -0400, Trond Myklebust wrote:
> On Wed, 2004-04-28 at 11:14, David Woodhouse wrote:
> 
> > NFS does this. It's fairly ugly. What would be a _lot_ nicer if we could
> > have something in the task_struct which is vaguely reminiscent of
> > preempt_count, only it counts the number of reasons why this task cannot
> > receive signals. So instead of using the TASK_INTERRUPTIBLE and
> > TASK_UNINTERRUPTIBLE states to make the decision, we'd look at the
> > task's uninterruptible_count instead. 
> 
> The reason NFS has the scheme that it does is precisely *because* we
> want to set our own sigmask.
> 
> The reason is that we'd like to respect SIGINT, SIGQUIT and SIGKILL as
> signalling that the user wants to interrupt the operation if and only if
> the "intr" mount flag has been set.

Is there a benefit to having precisely this implementation, as opposed
to the option of allowing only fatal signals? What standard do we need
to adhere to?

-- 
dwmw2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: filesystem signal handling
  2004-04-28 17:14         ` David Woodhouse
@ 2004-04-28 17:32           ` Trond Myklebust
  2004-04-28 19:28             ` Jamie Lokier
  0 siblings, 1 reply; 17+ messages in thread
From: Trond Myklebust @ 2004-04-28 17:32 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Steve French, linux-fsdevel

On Wed, 2004-04-28 at 13:14, David Woodhouse wrote:
> > The reason is that we'd like to respect SIGINT, SIGQUIT and SIGKILL as
> > signalling that the user wants to interrupt the operation if and only if
> > the "intr" mount flag has been set.
> 
> Is there a benefit to having precisely this implementation, as opposed
> to the option of allowing only fatal signals? What standard do we need
> to adhere to?

It is a standard interface on all *NIX implementations of NFS.

As for the advantages:

SIGQUIT certainly has different semantics than SIGKILL (it coredumps!)
so it can possibly be useful for debugging if some particular NFS
operation is causing a hang.
As for SIGINT vs SIGKILL - the only difference I can see is that the
former can be generated directly from the keyboard.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: filesystem signal handling
  2004-04-28 17:32           ` Trond Myklebust
@ 2004-04-28 19:28             ` Jamie Lokier
  2004-04-28 19:43               ` Trond Myklebust
  2004-04-29  2:18               ` David Woodhouse
  0 siblings, 2 replies; 17+ messages in thread
From: Jamie Lokier @ 2004-04-28 19:28 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: David Woodhouse, Steve French, linux-fsdevel

Trond Myklebust wrote:
> As for SIGINT vs SIGKILL - the only difference I can see is that the
> former can be generated directly from the keyboard.

The other difference is that SIGINT can be intercepted by the
application to do cleanups or whatever; SIGKILL cannot.

-- Jamie

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: filesystem signal handling
  2004-04-28 19:28             ` Jamie Lokier
@ 2004-04-28 19:43               ` Trond Myklebust
  2004-04-28 19:47                 ` Jamie Lokier
  2004-04-29  2:18               ` David Woodhouse
  1 sibling, 1 reply; 17+ messages in thread
From: Trond Myklebust @ 2004-04-28 19:43 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: David Woodhouse, Steve French, linux-fsdevel

On Wed, 2004-04-28 at 15:28, Jamie Lokier wrote:
> Trond Myklebust wrote:
> > As for SIGINT vs SIGKILL - the only difference I can see is that the
> > former can be generated directly from the keyboard.
> 
> The other difference is that SIGINT can be intercepted by the
> application to do cleanups or whatever; SIGKILL cannot.

Right, but we explicitly mask signals if the handler is not set to
SIG_DFL.
I must admit that I'm unclear as to the full reason why we do that, but
I guess that at least part of the problem is that we're not going to be
reentrant if the handler decides to ignore the signal.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: filesystem signal handling
  2004-04-28 19:43               ` Trond Myklebust
@ 2004-04-28 19:47                 ` Jamie Lokier
  2004-04-28 20:31                   ` Trond Myklebust
  0 siblings, 1 reply; 17+ messages in thread
From: Jamie Lokier @ 2004-04-28 19:47 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: David Woodhouse, Steve French, linux-fsdevel

Trond Myklebust wrote:
> I must admit that I'm unclear as to the full reason why we do that, but
> I guess that at least part of the problem is that we're not going to be
> reentrant if the handler decides to ignore the signal.

Eh?  When a signal is handled, that is done by aborting the current
kernel operation (e.g. read), returning to userspace, letting
userspace handle the signal, and then possibly restarting the aborted
operation by doing the system call again.

How does re-entrancy come into it?

-- Jamie

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: filesystem signal handling
  2004-04-28 19:47                 ` Jamie Lokier
@ 2004-04-28 20:31                   ` Trond Myklebust
  0 siblings, 0 replies; 17+ messages in thread
From: Trond Myklebust @ 2004-04-28 20:31 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: David Woodhouse, Steve French, linux-fsdevel

On Wed, 2004-04-28 at 15:47, Jamie Lokier wrote:

> Eh?  When a signal is handled, that is done by aborting the current
> kernel operation (e.g. read), returning to userspace, letting
> userspace handle the signal, and then possibly restarting the aborted
> operation by doing the system call again.
> 
> How does re-entrancy come into it?

Sorry, I worded that poorly: I didn't really mean "reentrant" (at least
not in the usual sense of the word).

I'm thinking rather about the case of non-idempotent operations such as
open(O_EXCL), O_APPEND writes, rename(),...
If the operation succeeds, but we interrupt the RPC call before the
reply from the server has reached us, then restarting that operation
will be very poorly defined indeed.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: filesystem signal handling
  2004-04-28 15:14     ` David Woodhouse
  2004-04-28 17:05       ` Trond Myklebust
@ 2004-04-28 21:46       ` Bryan Henderson
  2004-04-29  2:34         ` David Woodhouse
  1 sibling, 1 reply; 17+ messages in thread
From: Bryan Henderson @ 2004-04-28 21:46 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-fsdevel, smfltc

>It's [blocking signals] fairly ugly. What would be a _lot_ nicer if we 
could
>have something in the task_struct which is vaguely reminiscent of
>preempt_count, only it counts the number of reasons why this task cannot
>receive signals. So instead of using the TASK_INTERRUPTIBLE and
>TASK_UNINTERRUPTIBLE states to make the decision, we'd look at the
>task's uninterruptible_count instead. 

Why is this less ugly than setting the signal mask and doing a 
save/restore?  Because it's fewer lines of code?  Fewer instructions?

I actually have moved away from using TASK_UNINTERRUPTIBLE because I find 
it cleaner to use a single, more powerful interface -- the signal mask -- 
everywhere.

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: filesystem signal handling
  2004-04-28 19:28             ` Jamie Lokier
  2004-04-28 19:43               ` Trond Myklebust
@ 2004-04-29  2:18               ` David Woodhouse
  2004-04-29  2:53                 ` Trond Myklebust
  1 sibling, 1 reply; 17+ messages in thread
From: David Woodhouse @ 2004-04-29  2:18 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Trond Myklebust, Steve French, linux-fsdevel

On Wed, 2004-04-28 at 20:28 +0100, Jamie Lokier wrote:
> Trond Myklebust wrote:
> > As for SIGINT vs SIGKILL - the only difference I can see is that the
> > former can be generated directly from the keyboard.
> 
> The other difference is that SIGINT can be intercepted by the
> application to do cleanups or whatever; SIGKILL cannot.

That was the difference I was thinking of -- since if they're not
_handled_, all three of the mentioned signals remain fatal and hence the
resulting behaviour is basically the same with either implementation.
(playing with the mask vs. allowing only fatal signals).

It's the case of a handled SIGINT during NFS operations which is
potentially different. If that happens when a read() or write() is
partially complete, what do we currently do? Is it really mandatory that
we do handle the signal immediately rather than upon completion of the
operation, or is this a corner case which nobody cares about?

But I wasn't necessarily suggesting that NFS should change its
behaviour, only offering a potential answer to the question which was
posed, which was presumably for CIFS not NFS.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: filesystem signal handling
  2004-04-28 21:46       ` Bryan Henderson
@ 2004-04-29  2:34         ` David Woodhouse
  0 siblings, 0 replies; 17+ messages in thread
From: David Woodhouse @ 2004-04-29  2:34 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: linux-fsdevel, smfltc

On Wed, 2004-04-28 at 14:46 -0700, Bryan Henderson wrote:
> >It's [blocking signals] fairly ugly. What would be a _lot_ nicer if we 
> could
> >have something in the task_struct which is vaguely reminiscent of
> >preempt_count, only it counts the number of reasons why this task cannot
> >receive signals. So instead of using the TASK_INTERRUPTIBLE and
> >TASK_UNINTERRUPTIBLE states to make the decision, we'd look at the
> >task's uninterruptible_count instead. 
> 
> Why is this less ugly than setting the signal mask and doing a 
> save/restore?  Because it's fewer lines of code?  Fewer instructions?

TBH it's mostly because it doesn't make my head hurt when I try to prove
to myself that it's safe w.r.t. multi-threaded signal handling, with
different signals having per-process and per-thread semantics etc.

-- 
dwmw2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: filesystem signal handling
  2004-04-29  2:18               ` David Woodhouse
@ 2004-04-29  2:53                 ` Trond Myklebust
  2004-04-29  6:41                   ` David Woodhouse
  0 siblings, 1 reply; 17+ messages in thread
From: Trond Myklebust @ 2004-04-29  2:53 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Jamie Lokier, Steve French, linux-fsdevel

On Wed, 2004-04-28 at 22:18, David Woodhouse wrote:

> That was the difference I was thinking of -- since if they're not
> _handled_, all three of the mentioned signals remain fatal and hence the
> resulting behaviour is basically the same with either implementation.
> (playing with the mask vs. allowing only fatal signals).

They are very different. "Playing with the mask" restricts which signals
are considered to be fatal. The "intr" mount option is pretty particular
about that.

> It's the case of a handled SIGINT during NFS operations which is
> potentially different. If that happens when a read() or write() is
> partially complete, what do we currently do? Is it really mandatory that
> we do handle the signal immediately rather than upon completion of the
> operation, or is this a corner case which nobody cares about?

As I said earlier: if the signal is handled, we mask it. See the
discussion about non-idempotent operations and the dangers of retrying
them.
I'm not entirely sure that is the reason for masking the handled case
(rpc_clnt_sigmask() predates my maintainership of the NFS client), but I
suspect that it might be. I should really ask Alan Cox for what his
motivations were. IIRC, he wrote that code...

> But I wasn't necessarily suggesting that NFS should change its
> behaviour, only offering a potential answer to the question which was
> posed, which was presumably for CIFS not NFS.

I would imagine that CIFS has similar problems to NFS here.

One of the main points of playing with the sigmask in the first place is
that interrupting an operation in the middle of the RPC call leaves
various attribute/data caches in an incoherent state, and so is in
general bad practice. We certainly don't want something like a poorly
coded application that generates the occasional SIGSEGV or SIGBUS (or
even a well-coded one that uses timers + SIGALRM!) to screw up the
cached data/metadata for everyone else.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: filesystem signal handling
  2004-04-29  2:53                 ` Trond Myklebust
@ 2004-04-29  6:41                   ` David Woodhouse
  2004-04-29 17:41                     ` Bryan Henderson
  0 siblings, 1 reply; 17+ messages in thread
From: David Woodhouse @ 2004-04-29  6:41 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Jamie Lokier, Steve French, linux-fsdevel

On Wed, 2004-04-28 at 22:53 -0400, Trond Myklebust wrote:
> One of the main points of playing with the sigmask in the first place is
> that interrupting an operation in the middle of the RPC call leaves
> various attribute/data caches in an incoherent state, and so is in
> general bad practice.

We can generalise that. The main reason for blocking signals -- be it by
using TASK_UNINTERRUPTIBLE, manipulating the sigmask, or some
hypothetical 'uninterruptible_count' in the task_struct, is that in some
places we either can't or don't want to clean up properly if a signal
happens. Sometimes that requirement comes all the way from userspace
(e.g. expected semantics of sys_read() and short reads)

In those problematic cases, we often call other functions which may
sleep any may potentially be interrupted -- since we can't handle that,
we obviously require them _not_ to be interrupted. But making those
functions use TASK_UNINTERRUPTIBLE means that they can't be interrupted
even when called by other callers which _could_ clean up after
themselves -- unless we do something horrid like passing a flag all the
way down to say what sleep state should be used. 

The advantage of doing it with the latter two methods mentioned above is
that the 'problematic' routine can make that decision for itself and set
some state, and then the routines which it calls don't need to care.
They just _don't_ get interrupted when they're called from a function
which cannot handle receiving -EINTR.

> We certainly don't want something like a poorly
> coded application that generates the occasional SIGSEGV or SIGBUS (or
> even a well-coded one that uses timers + SIGALRM!) to screw up the
> cached data/metadata for everyone else.

If that's a possible outcome, why doesn't that happen with uncaught
SIGINT? 

-- 
dwmw2

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: filesystem signal handling
  2004-04-29  6:41                   ` David Woodhouse
@ 2004-04-29 17:41                     ` Bryan Henderson
  0 siblings, 0 replies; 17+ messages in thread
From: Bryan Henderson @ 2004-04-29 17:41 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Jamie Lokier, linux-fsdevel, smfltc, Trond Myklebust

I see several aspects of interruptibility being discussed at once here, 
and just to make sure people aren't confusing them together, I'd like to 
summarize them here:

In some places, kernel code has started some operation and must complete 
it to avoid making the filesystem or the filesystem image inconsistent. 
For example, you wouldn't want to make an update in the local cache and 
then get interrupted before you make the update in the actual filesystem. 
And vice versa.

In other places, kernel code has started some operation in service of a 
system call and must complete in order that the system call not have some 
illegal partial result.  Note that I'm not talking about a short read or 
write.  If I do a read for 1000 bytes and get only 500 because the user 
hit ctl-C, that's complete read, totally POSIX-compliant.  I'm talking 
about where I do a read of 1000 bytes and the file pointer does get bumped 
by 1000 bytes, but only 500 bytes are put into my buffer.  Or the file 
access time doesn't get updated.

These latter cases are where it matters whether a signal has a handler or 
not.  Because if a signal has arrived that you know will terminate the 
process before the system call you're executing returns, then the user 
cannot possibly see your partially completed system call, so there's 
nothing wrong with bailing out.  SIGKILL is the only signal class that 
guarantees the user will never get control back from the system call.  But 
if you're willing to make your behavior vary depending on whether the user 
installed a signal handler (a user might consider that flaky), you could 
also determine that signals of other classes are guaranteed to terminate 
the process.

By the way, there is officially no such thing as an _application_ that 
can't tolerate an EINTR failure of a system call.  (Though in reality, we 
know such programs abound).  But remember that when a system call fails 
with EINTR, it must have no visible effect at all.

That's the issue of whether interruptions should happen or not in response 
to signals.  The original issue of the thread, though, was more about how 
you engineer code to make sure the interruptions happen or don't happen, 
given that lots of layers and code modules are involved.

Sometimes, the caller of a subroutine is perfectly willing to abort what 
he's doing, but the subroutine itself cannot return halfway through.  So 
the subroutine ignores signals and causes his waits to be uninterruptible. 
 In other cases, the subroutine is perfectly capable of aborting halfway 
through, but the caller is in the middle of something that must go to 
completion.  You want the subroutine in this case to wait uninterruptibly, 
and to ignore signals, even though the subroutine itself has no reason to 
do so.

In Linux, we've generally either made everything really conservative -- a 
subroutine makes itself uninterruptible just in case its caller requires 
it -- or cascaded an "interruptible" argument down through the call chain.

Two ways to avoid that extra argument on all the subroutines are 1) use 
the task's global signal mask; and 2) use David's proposed task global 
"uninterruptible" flag/count.

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2004-04-29 17:41 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-27 16:20 filesystem behavior when low on memory and PF_MEMALLOC Steve French
2004-04-27 19:09 ` Bryan Henderson
2004-04-27 20:29   ` filesystem signal handling Steve French
2004-04-28 15:14     ` David Woodhouse
2004-04-28 17:05       ` Trond Myklebust
2004-04-28 17:14         ` David Woodhouse
2004-04-28 17:32           ` Trond Myklebust
2004-04-28 19:28             ` Jamie Lokier
2004-04-28 19:43               ` Trond Myklebust
2004-04-28 19:47                 ` Jamie Lokier
2004-04-28 20:31                   ` Trond Myklebust
2004-04-29  2:18               ` David Woodhouse
2004-04-29  2:53                 ` Trond Myklebust
2004-04-29  6:41                   ` David Woodhouse
2004-04-29 17:41                     ` Bryan Henderson
2004-04-28 21:46       ` Bryan Henderson
2004-04-29  2:34         ` David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).