manipulating sigmask from filesystems and drivers

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* manipulating sigmask from filesystems and drivers
@ 2002-07-31 11:52 David Howells
  2002-07-31 11:58 ` Alan Cox
  2002-08-01 19:09 ` Linus Torvalds
  0 siblings, 2 replies; 37+ messages in thread
From: David Howells @ 2002-07-31 11:52 UTC (permalink / raw)
  To: torvalds, alan; +Cc: linux-kernel, dhowells

Hi Linus, Alan,

Can you confirm that this is A Bad Thing(TM)? I've been poking around in the
OpenAFS filesystem driver, and it tries to achieve uninterruptible I/O waiting
by the following means:

	/* CV_WAIT and CV_TIMEDWAIT rely on the fact that the Linux kernel has
	 * a global lock. Thus we can safely drop our locks before calling the
	 * kernel sleep services.
	 */
	static inline int CV_WAIT(afs_kcondvar_t *cv, afs_kmutex_t *l)
	{
	    int isAFSGlocked = ISAFS_GLOCK(); 
	    sigset_t saved_set;
	#ifdef DECLARE_WAITQUEUE
	    DECLARE_WAITQUEUE(wait, current);
	#else
	    struct wait_queue wait = { current, NULL };
	#endif

	    add_wait_queue((wait_queue_head_t *)cv, &wait);
	    set_current_state(TASK_INTERRUPTIBLE);

	    if (isAFSGlocked) AFS_GUNLOCK();
	    MUTEX_EXIT(l);

	    spin_lock_irq(&current->sigmask_lock);
	    saved_set = current->blocked;
	    sigfillset(&current->blocked);
	    recalc_sigpending(current);
	    spin_unlock_irq(&current->sigmask_lock);

	    schedule();
	    remove_wait_queue(cv, &wait);

	    spin_lock_irq(&current->sigmask_lock);
	    current->blocked = saved_set;
	    recalc_sigpending(current);
	    spin_unlock_irq(&current->sigmask_lock);

	    if (isAFSGlocked) AFS_GLOCK();
	    MUTEX_ENTER(l);

	    return 0;
	}

The reason for them doing this is so that they can get the process to appear
in the "S" state and thus avoid increasing the load average.

What I'm concerned about is that they wait for an event to happen by blocking
all signals (by accessing the process's signal masks directly) and then
sitting in TASK_INTERRUPTIBLE (which _mostly_ works, but ptrace(PTRACE_KILL)
can interrupt).

Can you comment on whether a driver is allowed to block signals like this, and
whether they should be waiting in TASK_UNINTERRUPTIBLE?

Cheers,
David

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-07-31 11:52 manipulating sigmask from filesystems and drivers David Howells
@ 2002-07-31 11:58 ` Alan Cox
  2002-08-01 19:09 ` Linus Torvalds
  1 sibling, 0 replies; 37+ messages in thread
From: Alan Cox @ 2002-07-31 11:58 UTC (permalink / raw)
  To: David Howells; +Cc: torvalds, alan, linux-kernel, dhowells

> OpenAFS filesystem driver, and it tries to achieve uninterruptible I/O waiting
> by the following means:

Bletch

> The reason for them doing this is so that they can get the process to appear
> in the "S" state and thus avoid increasing the load average.

Thats come up enough I wish there was a way to distinguish 'disk wait'
and uninterruptible. Its an old V7 handwaving load average estimation trick
that lived too long IMHO

> Can you comment on whether a driver is allowed to block signals like this, and
> whether they should be waiting in TASK_UNINTERRUPTIBLE?

NFS does something similar in hard mount mode. 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-07-31 11:52 manipulating sigmask from filesystems and drivers David Howells
  2002-07-31 11:58 ` Alan Cox
@ 2002-08-01 19:09 ` Linus Torvalds
  2002-08-01 20:10   ` David Woodhouse
       [not found]   ` <mailman.1028232841.11555.linux-kernel2news@redhat.com>
  1 sibling, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2002-08-01 19:09 UTC (permalink / raw)
  To: David Howells; +Cc: alan, linux-kernel

On Wed, 31 Jul 2002, David Howells wrote:
> 
> Can you comment on whether a driver is allowed to block signals like this, and
> whether they should be waiting in TASK_UNINTERRUPTIBLE?

They should be waiting in TASK_UNINTERRUPTIBLE, and we should add a flag 
to distinguish between "increases load average" and "doesn't". So you 
could have

	TASK_WAKESIGNAL - wake on all signals
	TASK_WAKEKILL	- wake on signals that are deadly
	TASK_NOSIGNAL	- don't wake on signals
	TASK_LOADAVG	- counts toward loadaverage

	#define TASK_UNINTERRUPTIBLE	(TASK_NOSIGNAL | TASK_LOADAVG)
	#define TASK_INTERRUPTIBLE	TASK_WAKESIGNAL

and then people who wanted to could use other combinations. The
TASK_WAKEKILL thing is useful - there are many loops that cannot exit
until they have a result, simply because the calling conventions require 
that. Th emost common example is disk wait.

HOWEVER, if they are killed by a signal, the calling convention doesn't
matter, and the read() or whatever could just return 0 (knowing that the
process will never see it), and leave a locked page locked. Things like 
generic_file_read() could easily use this, and make processes killable 
even when they are waiting for a hung NFS mount - regardless of any soft 
mount issues, and without NFS having to have special code.

In the end, I'm too lazy, and I don't care. So I can only tell you how it 
_should_ be done, and maybe you can tell somebody else until the sucker to 
actually do it is found.

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-01 19:09 ` Linus Torvalds
@ 2002-08-01 20:10   ` David Woodhouse
  2002-08-01 20:21     ` Linus Torvalds
       [not found]   ` <mailman.1028232841.11555.linux-kernel2news@redhat.com>
  1 sibling, 1 reply; 37+ messages in thread
From: David Woodhouse @ 2002-08-01 20:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Howells, alan, linux-kernel


torvalds@transmeta.com said:
>  They should be waiting in TASK_UNINTERRUPTIBLE, and we should add a
> flag  to distinguish between "increases load average" and "doesn't".

The disadvantage of this approach is that it encourages people to be lazy
and sleep with signals disabled, instead of implementing proper cleanup
code. 

I'm more in favour of removing TASK_UNINTERRUPTIBLE entirely, or at least 
making people apply for a special licence to be permitted to use it :)

--
dwmw2



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-01 20:10   ` David Woodhouse
@ 2002-08-01 20:21     ` Linus Torvalds
  2002-08-01 20:47       ` Roman Zippel
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2002-08-01 20:21 UTC (permalink / raw)
  To: David Woodhouse; +Cc: David Howells, alan, linux-kernel

On Thu, 1 Aug 2002, David Woodhouse wrote:
> 
> torvalds@transmeta.com said:
> >  They should be waiting in TASK_UNINTERRUPTIBLE, and we should add a
> > flag  to distinguish between "increases load average" and "doesn't".
> 
> The disadvantage of this approach is that it encourages people to be lazy
> and sleep with signals disabled, instead of implementing proper cleanup
> code. 
> 
> I'm more in favour of removing TASK_UNINTERRUPTIBLE entirely, or at least 
> making people apply for a special licence to be permitted to use it :)

Can't do that.

Easy reason: there are tons of code sequences that _cannot_ take signals.  
The only way to make a signal go away is to actually deliver it, and there
are documented interfaces that are guaranteed to complete without
delivering a signal. The trivial case is a disk read: real applications
break if you return partial results in order to handle signals in the
middle.

In short, this is not something that can be discussed. It's a cold fact, a
law of UNIX if you will.

There are enough reasons to discourage people from using uninterruptible 
sleep ("this f*cking application won't die when the network goes down") 
that I don't think this is an issue. We need to handle both cases, and 
while we can expand on the two cases we have now, we can't remove them.

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-01 20:21     ` Linus Torvalds
@ 2002-08-01 20:47       ` Roman Zippel
  2002-08-01 20:51         ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Roman Zippel @ 2002-08-01 20:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Woodhouse, David Howells, alan, linux-kernel

Hi,

On Thu, 1 Aug 2002, Linus Torvalds wrote:

> Easy reason: there are tons of code sequences that _cannot_ take signals.
> The only way to make a signal go away is to actually deliver it, and there
> are documented interfaces that are guaranteed to complete without
> delivering a signal. The trivial case is a disk read: real applications
> break if you return partial results in order to handle signals in the
> middle.
>
> In short, this is not something that can be discussed. It's a cold fact, a
> law of UNIX if you will.

Any program setting up signal handlers should expext interrupted i/o,
otherwise it's buggy. If a program doesn't have any signal handlers, there
is no signal to deliver, so simple programs don't need to worry.

bye, Roman



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-01 20:47       ` Roman Zippel
@ 2002-08-01 20:51         ` Linus Torvalds
  2002-08-01 21:15           ` Roman Zippel
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2002-08-01 20:51 UTC (permalink / raw)
  To: Roman Zippel; +Cc: David Woodhouse, David Howells, alan, linux-kernel


On Thu, 1 Aug 2002, Roman Zippel wrote:
> >
> > In short, this is not something that can be discussed. It's a cold fact, a
> > law of UNIX if you will.
> 
> Any program setting up signal handlers should expext interrupted i/o,
> otherwise it's buggy.

Roman, THAT IS JUST NOT TRUE!

Go read the standards. Some IO is not interruptible. This is not something 
I'm making up, and this is not something that can be discussed about. The 
speed of light in vacuum is 'c', regardless of your own relative speed. 
And file reads are not interruptible.

			Linus


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-01 20:51         ` Linus Torvalds
@ 2002-08-01 21:15           ` Roman Zippel
  2002-08-01 21:42             ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Roman Zippel @ 2002-08-01 21:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Woodhouse, David Howells, alan, linux-kernel

Hi,

On Thu, 1 Aug 2002, Linus Torvalds wrote:

> Go read the standards. Some IO is not interruptible.

Which standard? Which "some IO"?
Some programs rely on interruptable IO, e.g. to implement timeouts.
/me is confused.

bye, Roman


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-01 21:15           ` Roman Zippel
@ 2002-08-01 21:42             ` Linus Torvalds
  2002-08-01 22:29               ` David Woodhouse
  2002-08-01 22:35               ` Roman Zippel
  0 siblings, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2002-08-01 21:42 UTC (permalink / raw)
  To: Roman Zippel; +Cc: David Woodhouse, David Howells, alan, linux-kernel

On Thu, 1 Aug 2002, Roman Zippel wrote:

> > Go read the standards. Some IO is not interruptible.
> 
> Which standard? Which "some IO"?

Any regular file IO is supposed to give you the full result. 

If you write() to a file, and get a partial return value back due to a
signal, there are programs that will assume that the disk is full (there
are also programs that will just lose your data).

This is not "sloppy programming". See the read() system call manual, which 
says

     Upon successful completion, read(), readv(), and pread() return the num-
     ber of bytes actually read and placed in the buffer.  The system guaran-
     tees to read the number of bytes requested if the descriptor references a
     normal file that has that many bytes left before the end-of-file, but in
     no other case.

Note the "The system guarantees to read the number of bytes requested .." 
part.

Stop arguing about this. It's a FACT.

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-01 21:42             ` Linus Torvalds
@ 2002-08-01 22:29               ` David Woodhouse
  2002-08-01 22:40                 ` Linus Torvalds
  2002-08-01 22:35               ` Roman Zippel
  1 sibling, 1 reply; 37+ messages in thread
From: David Woodhouse @ 2002-08-01 22:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Roman Zippel, David Howells, alan, linux-kernel

torvalds@transmeta.com said:
>  Any regular file IO is supposed to give you the full result. 

read(2) is permitted to return -EINTR. Granted, we shouldn't allow it to be
interrupted and return a partial read after the point we start to
copy_to_user(), but before then it's fair game.

Regular file I/O through the page cache is inherently restartable, anyway, 
as long as you're careful about fpos.

There are better examples where you really can't have a cleanup path without
severe pain, even using ERESTARTNOINTR, and I was only joking about removing
TASK_UNINTERRUPTIBLE _entirely_ -- but the point remains that reducing its
usage would be nice.

--
dwmw2

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-01 22:29               ` David Woodhouse
@ 2002-08-01 22:40                 ` Linus Torvalds
  2002-08-01 22:50                   ` David Woodhouse
  2002-08-02 15:59                   ` yodaiken
  0 siblings, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2002-08-01 22:40 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Roman Zippel, David Howells, alan, linux-kernel

On Thu, 1 Aug 2002, David Woodhouse wrote:
> 
> torvalds@transmeta.com said:
> >  Any regular file IO is supposed to give you the full result. 
> 
> read(2) is permitted to return -EINTR.

It is _not_ allowed to do that for regular UNIX filesystems.

It is allowed to return it for things like pipes, sockets, etc, and for 
filesystems that do not have UNIX behaviour.

> Regular file I/O through the page cache is inherently restartable, anyway, 
> as long as you're careful about fpos.

It's not the kernel side that is not restartable. It's the _user_ side. 
There is 30 _years_ of history on this, and there are programs that have 
been programmed to follow the existing documentation.

And the existing documentation says that if you return a partial read from 
a normal file, that means EOF for that file.

You may not like it, but that doesn't make it less so. Linux has UNIX 
semantics for read(). Linux is not a research project where we change 
fundamental semantics just because we don't like it. That's final.

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-01 22:40                 ` Linus Torvalds
@ 2002-08-01 22:50                   ` David Woodhouse
  2002-08-02 15:59                   ` yodaiken
  1 sibling, 0 replies; 37+ messages in thread
From: David Woodhouse @ 2002-08-01 22:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Roman Zippel, David Howells, alan, linux-kernel

torvalds@transmeta.com said:
>  It's not the kernel side that is not restartable. It's the _user_
> side. 

As I said, you can't allow it to be interrupted after you've started the 
copy_to_user(). Show me how the userspace program can prove the signal 
arrived _after_ the 'int 0x80' had trapped into the kernel rather then 
beforehand, and I'll accept that you can't allow read() to be interrupted
even before the copy_to_user() starts.

But I also agree that there are other, better, examples of why
TASK_UNINTERRUPTIBLE has to be used sometimes. Page faults in 
copy_{from,to}_user are probably one such.

It's just that it doesn't have to be scattered all over the place just
because people are too lazy to do the cleanup code.

Yeah -- sometimes it's hard. So go shopping.

--
dwmw2

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-01 22:40                 ` Linus Torvalds
  2002-08-01 22:50                   ` David Woodhouse
@ 2002-08-02 15:59                   ` yodaiken
  1 sibling, 0 replies; 37+ messages in thread
From: yodaiken @ 2002-08-02 15:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Woodhouse, Roman Zippel, David Howells, alan, linux-kernel

On Thu, Aug 01, 2002 at 03:40:56PM -0700, Linus Torvalds wrote:
> 
> On Thu, 1 Aug 2002, David Woodhouse wrote:
> > 
> > torvalds@transmeta.com said:
> > >  Any regular file IO is supposed to give you the full result. 
> > 
> > read(2) is permitted to return -EINTR.
> 
> It is _not_ allowed to do that for regular UNIX filesystems.
> 

SusV2 and POSIX seem to have changed the prior standard

     The value  returned  may be less than nbyte if the number of bytes
     left  in  the  file  is  less than nbyte, if the read() request was
     interrupted  by  a  signal,  or  if  the  file is a pipe or FIFO or
     special  file  and has fewer than nbyte bytes immediately available
     for  reading.


The rationale mentions that
	while( read(...) > 0) 
must work
but, I think Linus is correct that many programs rely on
	while( read(fd,&b,n) == n)

------------------susv2

     If a read() is interrupted by a signal before it reads any data, it
     will return -1 with errno set to [EINTR].

     If  a  read()  is interrupted by a signal after it has successfully
     read some data, it will return the number of bytes read.
----------------------



The grim effects of "cancel" spread through the system.


-------------------------------------
The transitive closure of a design error is limited only by the size of
the program.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-01 21:42             ` Linus Torvalds
  2002-08-01 22:29               ` David Woodhouse
@ 2002-08-01 22:35               ` Roman Zippel
  2002-08-01 23:30                 ` Linus Torvalds
  2002-08-02  7:31                 ` Giuliano Pochini
  1 sibling, 2 replies; 37+ messages in thread
From: Roman Zippel @ 2002-08-01 22:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Woodhouse, David Howells, alan, linux-kernel

Hi

On Thu, 1 Aug 2002, Linus Torvalds wrote:

> This is not "sloppy programming". See the read() system call manual, which
> says
>
>      Upon successful completion, read(), readv(), and pread() return the num-
>      ber of bytes actually read and placed in the buffer.  The system guaran-
>      tees to read the number of bytes requested if the descriptor references a
>      normal file that has that many bytes left before the end-of-file, but in
>      no other case.
>
> Note the "The system guarantees to read the number of bytes requested .."
> part.

Relying on that the fd will always point to a normal file is only asking
for trouble.

> Stop arguing about this. It's a FACT.

Linus, it's not that I don't want to believe you, but e.g. the SUS doesn't
make that special exception.
Installing signal handlers and not expecting EINTR _is_ sloppy
programming.

bye, Roman


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-01 22:35               ` Roman Zippel
@ 2002-08-01 23:30                 ` Linus Torvalds
  2002-08-02  0:31                   ` Olivier Galibert
  2002-08-02 10:02                   ` Roman Zippel
  2002-08-02  7:31                 ` Giuliano Pochini
  1 sibling, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2002-08-01 23:30 UTC (permalink / raw)
  To: Roman Zippel; +Cc: David Woodhouse, David Howells, alan, linux-kernel

[ I'm having a real hard time to not shout at the top of my lungs "SHUT
  THE FUCK UP ABOUT THIS ALREADY!" and then running around the offices 
  with a chainsaw laughing maniacally. But I'll try. This once. ]

On Fri, 2 Aug 2002, Roman Zippel wrote:
> 
> Relying on that the fd will always point to a normal file is only asking
> for trouble.

People _do_ rely on regular files working this way. Wake up and smell the
coffee, you cannot change reality by just arguing about it.

There are cases where you absolutely _have_ to rely on this documented
UNIX behaviour. One example is using a log-file (yes, a _file_, not a
socket or a pipe) that you explicitly opened with O_APPEND, just so that
you can guarantee _atomic_ writes that do not get lost or partially
re-ordered in your log.

And yes, these logging programs are mission-critical, and they do have
signals going on, and they rely on well-defined and documented interfaces
that say that doing a write() to a filesystem is _not_ going to return in
the middle just because a signal came in.

These programs know what they are doing. They are explicitly _not_ using
"stdio" to write the log-file, exactly because they cannot afford to have
the write broken up into many parts (and they do not want to have it
buffered either, since getting the logging out in a timely fashion can be
important). 

The only, and the _portable_ way to do this under UNIX is to create one
single buffer, and write it out with one single write() call. Anything
else is likely to cause the file to be interspersed by random
log-fragments, instead of being a nice consecutive list of full log
entries.

Feel free to change Linux to have your stupid preferred semantics where
everybody is supposed to be able to handle signals at any time, but please
re-name it to "Crapix" when you do. Because that is what it would be.  
Crap. Utter braindamage.

If people cannot find this in SuS, then I simply don't _care_. I care
about not having a crap OS, and I also care about not having to repeat
myself and give a million examples of why the current behaviour is
_required_, and why we're not getting rid of it.

[ Did profanity help explain the situation? Do people finally understand 
  why this _really_ isn't up for discussion? Please don't bother sending 
  me any more email about this.  My co-workers are already eyeing me and 
  my chainsaw nervously. Thank you for sparing them. ]

			Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-01 23:30                 ` Linus Torvalds
@ 2002-08-02  0:31                   ` Olivier Galibert
  2002-08-02  8:00                     ` Kai Henningsen
  2002-08-02 10:02                   ` Roman Zippel
  1 sibling, 1 reply; 37+ messages in thread
From: Olivier Galibert @ 2002-08-02  0:31 UTC (permalink / raw)
  To: linux-kernel

On Thu, Aug 01, 2002 at 04:30:40PM -0700, Linus Torvalds wrote:
> And yes, these logging programs are mission-critical, and they do have
> signals going on, and they rely on well-defined and documented interfaces
> that say that doing a write() to a filesystem is _not_ going to return in
> the middle just because a signal came in.

How hard and/or insane would it be to somehow special-case SIGKILL?
It is a tad annoying not to be able to get rid of D state processes,
especially ones blocking unmounts because the filesystem is busy.

Of course, an alternative is a real, brutal, forced unmount that
leaves a clean filesystem and dead/dying processes.

  OG.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-02  0:31                   ` Olivier Galibert
@ 2002-08-02  8:00                     ` Kai Henningsen
  0 siblings, 0 replies; 37+ messages in thread
From: Kai Henningsen @ 2002-08-02  8:00 UTC (permalink / raw)
  To: linux-kernel

galibert@pobox.com (Olivier Galibert)  wrote on 01.08.02 in <20020801203140.A3166@zalem.puupuu.org>:

> On Thu, Aug 01, 2002 at 04:30:40PM -0700, Linus Torvalds wrote:
> > And yes, these logging programs are mission-critical, and they do have
> > signals going on, and they rely on well-defined and documented interfaces
> > that say that doing a write() to a filesystem is _not_ going to return in
> > the middle just because a signal came in.
>
> How hard and/or insane would it be to somehow special-case SIGKILL?

Thank you for reading the thread before jumping in.

(Hint: some messages upthread, Linux explained how to do exactly this.)

MfG Kai

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-01 23:30                 ` Linus Torvalds
  2002-08-02  0:31                   ` Olivier Galibert
@ 2002-08-02 10:02                   ` Roman Zippel
  2002-08-02 12:38                     ` Ryan Anderson
  2002-08-02 15:39                     ` Linus Torvalds
  1 sibling, 2 replies; 37+ messages in thread
From: Roman Zippel @ 2002-08-02 10:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Woodhouse, David Howells, alan, linux-kernel

Hi,

I slept over it and I thought about a bit more, but I think it just starts
to get interresting, so I hope your co-workers had a chance to snatch that
chainsaw away. :-)

On Thu, 1 Aug 2002, Linus Torvalds wrote:

> There are cases where you absolutely _have_ to rely on this documented
> UNIX behaviour. One example is using a log-file (yes, a _file_, not a
> socket or a pipe) that you explicitly opened with O_APPEND, just so that
> you can guarantee _atomic_ writes that do not get lost or partially
> re-ordered in your log.
>
> And yes, these logging programs are mission-critical, and they do have
> signals going on, and they rely on well-defined and documented interfaces
> that say that doing a write() to a filesystem is _not_ going to return in
> the middle just because a signal came in.

If these programs are so mission-critical, they better do some error
checking. Your atomic write will fail on a full disk and if the
information is that important, the program has to handle that.

> These programs know what they are doing. They are explicitly _not_ using
> "stdio" to write the log-file, exactly because they cannot afford to have
> the write broken up into many parts (and they do not want to have it
> buffered either, since getting the logging out in a timely fashion can be
> important).
>
> The only, and the _portable_ way to do this under UNIX is to create one
> single buffer, and write it out with one single write() call. Anything
> else is likely to cause the file to be interspersed by random
> log-fragments, instead of being a nice consecutive list of full log
> entries.

I looked around a bit and it doesn't look that portable. UNIX systems seem
to have this behaviour, but not all POSIX systems. Letting multiple
programs log to the same file is insane (there must be a reason why syslog
was invented). Signals are your least worry, that's easy to deal with:

	sigfillset(&s);
	sigprocmask(SIG_SETMASK, &s, &os);
	write(...);
	sigprocmask(SIG_SETMASK, &os, NULL);

But that doesn't deal with an almost full disk.

> If people cannot find this in SuS, then I simply don't _care_. I care
> about not having a crap OS, and I also care about not having to repeat
> myself and give a million examples of why the current behaviour is
> _required_, and why we're not getting rid of it.

I'm not asking for removing for this, I simply don't agree with the
reason. I still consider any program relying on this behaviour buggy. Your
"atomic" write is an illusion, the os can certainly try, but in the end
it's the applications responsibility to get this right.
I know that we can't remove this behaviour, but the only reason is to
keep buggy programs working and let them instead fail in other more subtle
ways.

> [ Did profanity help explain the situation?

Not really. :) I'm not that easily impressed.

bye, Roman

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-02 10:02                   ` Roman Zippel
@ 2002-08-02 12:38                     ` Ryan Anderson
  2002-08-02 15:39                     ` Linus Torvalds
  1 sibling, 0 replies; 37+ messages in thread
From: Ryan Anderson @ 2002-08-02 12:38 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Linus Torvalds, David Woodhouse, David Howells, alan,
	linux-kernel

On Fri, 2 Aug 2002, Roman Zippel wrote:

> I slept over it and I thought about a bit more, but I think it just starts
> to get interresting, so I hope your co-workers had a chance to snatch that
> chainsaw away. :-)

uh-oh.  heh.

> On Thu, 1 Aug 2002, Linus Torvalds wrote:
> 
> > There are cases where you absolutely _have_ to rely on this documented
> > UNIX behaviour. One example is using a log-file (yes, a _file_, not a
> > socket or a pipe) that you explicitly opened with O_APPEND, just so that
> > you can guarantee _atomic_ writes that do not get lost or partially
> > re-ordered in your log.
> >
> > And yes, these logging programs are mission-critical, and they do have
> > signals going on, and they rely on well-defined and documented interfaces
> > that say that doing a write() to a filesystem is _not_ going to return in
> > the middle just because a signal came in.
> 
> If these programs are so mission-critical, they better do some error
> checking. Your atomic write will fail on a full disk and if the
> information is that important, the program has to handle that.

>From personal experience, I have an application that DEPENDS on the
atomicity of writes in O_APPEND mode.  In the logging function, we have
made a conscious choice that, if we get a disk-full error, there's not a
damn thing we can do about a "disk full" condition that makes sense in
the application, so we don't handle that error.  (That situation is
something the sysadmin is supposed to prevent, anyway.)  (Admittedly,
that application is running on AIX - but I still expect the behavior to
remain the same when we port that application over to Linux...
eventually.)


> > These programs know what they are doing. They are explicitly _not_ using
> > "stdio" to write the log-file, exactly because they cannot afford to have
> > the write broken up into many parts (and they do not want to have it
> > buffered either, since getting the logging out in a timely fashion can be
> > important).
> >
> > The only, and the _portable_ way to do this under UNIX is to create one
> > single buffer, and write it out with one single write() call. Anything
> > else is likely to cause the file to be interspersed by random
> > log-fragments, instead of being a nice consecutive list of full log
> > entries.
> 
> I looked around a bit and it doesn't look that portable. UNIX systems seem
> to have this behaviour, but not all POSIX systems. Letting multiple
> programs log to the same file is insane (there must be a reason why syslog
> was invented). Signals are your least worry, that's easy to deal with:
> 
> 	sigfillset(&s);
> 	sigprocmask(SIG_SETMASK, &s, &os);
> 	write(...);
> 	sigprocmask(SIG_SETMASK, &os, NULL);
> 
> But that doesn't deal with an almost full disk.
> 
> > If people cannot find this in SuS, then I simply don't _care_. I care
> > about not having a crap OS, and I also care about not having to repeat
> > myself and give a million examples of why the current behaviour is
> > _required_, and why we're not getting rid of it.

In any case, atomicity is guaranteed by SUSv3 (Read open(2) for O_APPEND,
then read write(2) and skip down to the Rationale, first paragraph.

> I'm not asking for removing for this, I simply don't agree with the
> reason. I still consider any program relying on this behaviour buggy. Your
> "atomic" write is an illusion, the os can certainly try, but in the end
> it's the applications responsibility to get this right.
> I know that we can't remove this behaviour, but the only reason is to
> keep buggy programs working and let them instead fail in other more subtle
> ways.

Yes, it's always the application's responsibility to get this right.

The application can expect that when write(2) returns from a request
that involved smaller than {PIPE_BUF} size, it either failed, or wrote
the whole request, and with that assumption, the application can handle
the other errors sanely.

Note that write(2) can never return less bytes written than requested if
the requested amount is smaller than {PIPE_BUF} size.

Linus is absolutely correct here, changing the semantics of this
scenario would be a huge mess.

--
Ryan Anderson
  sometimes Pug Majere


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-02 10:02                   ` Roman Zippel
  2002-08-02 12:38                     ` Ryan Anderson
@ 2002-08-02 15:39                     ` Linus Torvalds
  2002-08-02 16:00                       ` Benjamin LaHaise
  2002-08-02 19:27                       ` Roman Zippel
  1 sibling, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2002-08-02 15:39 UTC (permalink / raw)
  To: Roman Zippel; +Cc: David Woodhouse, David Howells, alan, linux-kernel

On Fri, 2 Aug 2002, Roman Zippel wrote:
>
> If these programs are so mission-critical, they better do some error
> checking. Your atomic write will fail on a full disk and if the
> information is that important, the program has to handle that.

The logging thing is logging. It's critical for performance tuning, it's
critical for later finding of errors, but it's still secondary.

It's also performance-critical in a very real way.

> I looked around a bit and it doesn't look that portable. UNIX systems seem
> to have this behaviour, but not all POSIX systems.

POSIX is a hobbled standard, and does not matter.

We're not making a "POSIX-compliant OS". People have done that before:
see all the RT-OS's out there, and see even the NT POSIX subsystem.

They are uninteresting.

Linux is a _real_ OS, not some "we filled in the paperwork and it is now
standards compliant".

And being a real OS means taking the real world into account.

And the real world says that it's not acceptable to make up your own
semantics, unless you have some _damn_ good reason for doing so.

Let's turn the tables here. I'm not interested in your "but.." arguments
at all. I've refuted every single one, and I've refuted them with cold
hard facts.

The fact is, there is _zero_ reason to change existing functionality.

We already do this right, and there is no reason to _break_ the fact that
we do it right. Can you come up with a _single_ reason for why we should
break existing standardized binary interfaces?

Binary compatibility is important. As is the larger issue of generic UNIX
compatibility. You had better have some really strong arguments for why
you would think I'd be willing to break compatibility. So far you have had
_no_ arguments for the question "Why?".

			Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-02 15:39                     ` Linus Torvalds
@ 2002-08-02 16:00                       ` Benjamin LaHaise
  2002-08-02 16:27                         ` Linus Torvalds
  2002-08-02 19:27                       ` Roman Zippel
  1 sibling, 1 reply; 37+ messages in thread
From: Benjamin LaHaise @ 2002-08-02 16:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roman Zippel, David Woodhouse, David Howells, alan, linux-kernel

On Fri, Aug 02, 2002 at 08:39:34AM -0700, Linus Torvalds wrote:
> We already do this right, and there is no reason to _break_ the fact that
> we do it right. Can you come up with a _single_ reason for why we should
> break existing standardized binary interfaces?

Personally, I think that uninterruptible file io is good, but there needs 
to be an upper limit to the maximum size of the io.  As it stands today, 
someone can do a single multigigabyte read or write that is completely 
uninterruptible (even to kill -9), but could take a minute or more to 
complete.

		-ben
-- 
"You will be reincarnated as a toad; and you will be much happier."

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-02 16:00                       ` Benjamin LaHaise
@ 2002-08-02 16:27                         ` Linus Torvalds
  2002-08-02 17:13                           ` Jamie Lokier
                                             ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Linus Torvalds @ 2002-08-02 16:27 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Roman Zippel, David Woodhouse, David Howells, alan, linux-kernel

On Fri, 2 Aug 2002, Benjamin LaHaise wrote:
>
> Personally, I think that uninterruptible file io is good, but there needs
> to be an upper limit to the maximum size of the io.  As it stands today,
> someone can do a single multigigabyte read or write that is completely
> uninterruptible (even to kill -9), but could take a minute or more to
> complete.

Actually, i fyou read my original email on this thread, I actually
suggested splitting up the "INTERRUPTIBLE" vs "UNINTERRUPTIBLE" into three
different cases and one extra bit.

Sending somebody a SIGKILL (or any signal that kills the process) is
different (in my opinion) from a signal that interrupts a system call in
order to run a signal handler.

So my original suggestion on this thread was to make

        TASK_WAKESIGNAL - wake on all signals
        TASK_WAKEKILL   - wake on signals that are deadly
        TASK_NOSIGNAL   - don't wake on signals
        TASK_LOADAVG    - counts toward loadaverage

        #define TASK_UNINTERRUPTIBLE    (TASK_NOSIGNAL | TASK_LOADAVG)
        #define TASK_INTERRUPTIBLE      TASK_WAKESIGNAL

and then let code like generic_file_write() etc use other combinations
than the two existing ones, ie

	(TASK_WAKEKILL | TASK_LOADAVG)

results in a process that is woken up by signals that kill it (but not
other signals), and is counted towards the loadaverage. Which is what we
want in generic_file_read() (and _probably_ generic_file_write() as well,
but that's slightly more debatable).

(We'd also have to add a new way to test whether you've been killed, so
that such users could use "process_killed()" instead of the
"signal_pending()" that a INTERRUPTIBLE sleeper uses to test whether it
should exit).

This is the trivial way to get the best of both worlds - you can still
kill a process that is in D wait (if that particular kernel path allows
it), but you don't get process-visible semantic changes.

AND it also allows waiting uninterruptibly without adding to the
loadaverage, for those people who want to do long uninterruptible waits
(which was one of the reasons for starting this whole thread in the first
place - it just got slightly de-railed in the meantime).

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-02 16:27                         ` Linus Torvalds
@ 2002-08-02 17:13                           ` Jamie Lokier
  2002-08-02 17:29                             ` Linus Torvalds
  2002-08-02 17:33                           ` Oliver Neukum
  2002-10-17  8:32                           ` David Woodhouse
  2 siblings, 1 reply; 37+ messages in thread
From: Jamie Lokier @ 2002-08-02 17:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, Roman Zippel, David Woodhouse, David Howells,
	alan, linux-kernel

Linus Torvalds wrote:
> Sending somebody a SIGKILL (or any signal that kills the process) is
> different (in my opinion) from a signal that interrupts a system call in
> order to run a signal handler.

So it's ok to have truncated log entries (or more realistically,
truncated simple database entries) if the logging program is killed?

-- Jamie

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-02 17:13                           ` Jamie Lokier
@ 2002-08-02 17:29                             ` Linus Torvalds
  2002-08-02 17:57                               ` Trond Myklebust
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2002-08-02 17:29 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Benjamin LaHaise, Roman Zippel, David Woodhouse, David Howells,
	alan, linux-kernel

On Fri, 2 Aug 2002, Jamie Lokier wrote:
>
> Linus Torvalds wrote:
> > Sending somebody a SIGKILL (or any signal that kills the process) is
> > different (in my opinion) from a signal that interrupts a system call in
> > order to run a signal handler.
>
> So it's ok to have truncated log entries (or more realistically,
> truncated simple database entries) if the logging program is killed?

This is why I said

 "Which is what we want in generic_file_read() (and _probably_
  generic_file_write() as well, but that's slightly more debatable)"

The "slightly more debatable" comes exactly from the thing you mention.

The thing is, "read()" on a file doesn't have any side effects outside the
process that does it, so if you kill the process, doing a partial read is
always ok (yeah, you can come up with thread examples etc where you can
see the state, but I think those are so contrieved as to not really merit
much worry and certainly have no existing programs issues).

With write(), you have to make a judgement call. Unlike read, a truncated
write _is_ visible outside the killed process. But exactly like read()
there _are_ system management reasons why you may really need to kill
writers. So the debatable point comes from whether you want to consider a
killing signal to be "exceptional enough" to warrant the partial write.

I can see both sides. I personally think I'd prefer the "if I kill a
process, I want it dead _now_" approach, but this definitely _is_ up for
discussion (unlike the signal handler case).

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-02 17:29                             ` Linus Torvalds
@ 2002-08-02 17:57                               ` Trond Myklebust
  2002-08-02 18:10                                 ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Trond Myklebust @ 2002-08-02 17:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Benjamin LaHaise, Roman Zippel, David Woodhouse,
	David Howells, alan, linux-kernel

>>>>> " " == Linus Torvalds <torvalds@transmeta.com> writes:

     >  "Which is what we want in generic_file_read() (and _probably_
     >   generic_file_write() as well, but that's slightly more
     >   debatable)"

A frequent cause of complaints with the NFS client 'intr' mount option
is that grabbing the inode semaphore too is uninterruptible, and hence
even a lookup() can hang.
Would you therefore be planning on making down() interruptible by
SIGKILL?

Cheers,
   Trond

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-02 17:57                               ` Trond Myklebust
@ 2002-08-02 18:10                                 ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2002-08-02 18:10 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Jamie Lokier, Benjamin LaHaise, Roman Zippel, David Woodhouse,
	David Howells, alan, linux-kernel



On 2 Aug 2002, Trond Myklebust wrote:
>
> Would you therefore be planning on making down() interruptible by
> SIGKILL?

Can't do - existing users know that down() cannot fail.

But we already have a "down_interruptible()", so if we introduce the
notion of "non-interruptible but killable", we can also introduce a
"down_killable()".

		Linus


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-02 16:27                         ` Linus Torvalds
  2002-08-02 17:13                           ` Jamie Lokier
@ 2002-08-02 17:33                           ` Oliver Neukum
  2002-08-03 18:27                             ` David Woodhouse
  2002-10-17  8:32                           ` David Woodhouse
  2 siblings, 1 reply; 37+ messages in thread
From: Oliver Neukum @ 2002-08-02 17:33 UTC (permalink / raw)
  To: Linus Torvalds, Benjamin LaHaise
  Cc: Roman Zippel, David Woodhouse, David Howells, alan, linux-kernel


> and then let code like generic_file_write() etc use other combinations
> than the two existing ones, ie
>
> 	(TASK_WAKEKILL | TASK_LOADAVG)
>
> results in a process that is woken up by signals that kill it (but not
> other signals), and is counted towards the loadaverage. Which is what we
> want in generic_file_read() (and _probably_ generic_file_write() as well,
> but that's slightly more debatable).
>
> (We'd also have to add a new way to test whether you've been killed, so
> that such users could use "process_killed()" instead of the
> "signal_pending()" that a INTERRUPTIBLE sleeper uses to test whether it
> should exit).
>
> This is the trivial way to get the best of both worlds - you can still
> kill a process that is in D wait (if that particular kernel path allows
> it), but you don't get process-visible semantic changes.

If you do this to generic_file_write() you change the semantics for the
parent process, which up to now can expect a change to a file
made by a single call to write() to be done either fully or not at all,
if there's no error.

So IMHO it would be better to limit this new kind of waiting to reading.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-02 17:33                           ` Oliver Neukum
@ 2002-08-03 18:27                             ` David Woodhouse
  0 siblings, 0 replies; 37+ messages in thread
From: David Woodhouse @ 2002-08-03 18:27 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Linus Torvalds, Benjamin LaHaise, Roman Zippel, David Howells,
	alan, linux-kernel

Oliver.Neukum@lrz.uni-muenchen.de said:
>  So IMHO it would be better to limit this new kind of waiting to
> reading. 

You can do it for write() in the case where no data have yet been written, 
i.e. in prepare_write() first time round the loop. In fact, you can even 
return -ERESTARTNOINTR in that case, just as you can for read() where no 
data have yet been copied into userspace. Whether we want to special-case 
the first time round the loop just to give better responsiveness in the 
common case is debatable though.

You can also do it for open() in 2.5. (in 2.4 the read_inode API gave the
file system no choice but to return a full real inode or a bad one which
remained and prevented subsequent lookups of the same inode.)

--
dwmw2

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-02 16:27                         ` Linus Torvalds
  2002-08-02 17:13                           ` Jamie Lokier
  2002-08-02 17:33                           ` Oliver Neukum
@ 2002-10-17  8:32                           ` David Woodhouse
  2 siblings, 0 replies; 37+ messages in thread
From: David Woodhouse @ 2002-10-17  8:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, Roman Zippel, David Howells, alan, linux-kernel

On Fri, 2 Aug 2002, Linus Torvalds wrote:
> Actually, i fyou read my original email on this thread, I actually
> suggested splitting up the "INTERRUPTIBLE" vs "UNINTERRUPTIBLE" into
> three different cases and one extra bit.
>
> Sending somebody a SIGKILL (or any signal that kills the process) is
> different (in my opinion) from a signal that interrupts a system call
> in order to run a signal handler.
>
> So my original suggestion on this thread was to make 
> 
>         TASK_WAKESIGNAL - wake on all signals
>         TASK_WAKEKILL   - wake on signals that are deadly
>         TASK_NOSIGNAL   - don't wake on signals
>         TASK_LOADAVG    - counts toward loadaverage
>
>         #define TASK_UNINTERRUPTIBLE    (TASK_NOSIGNAL | TASK_LOADAVG)
>         #define TASK_INTERRUPTIBLE      TASK_WAKESIGNAL
>
> and then let code like generic_file_write() etc use other combinations
> than the two existing ones, ie
>
> 	(TASK_WAKEKILL | TASK_LOADAVG) 
>
> results in a process that is woken up by signals that kill it (but not
> other signals), and is counted towards the loadaverage. Which is what we
> want in generic_file_read() (and _probably_ generic_file_write() as well,
> but that's slightly more debatable).

I like this, but perhaps with a slight modification. I think that instead of
being a flag in the task state, the 'can we deal with being interrupted'
should instead be a _counter_, and the process in question can be woken 
by signals only if it's zero. For the following reason:

In the world you describe above, a function sets the NOSIGNAL flag when
sleeping if it can't deal with being interrupted -- presumably because it 
has no simple way to clean up its state in that case.

But we should remain aware that the ability to clean up on being interrupted
is not a function of only the bottom-most function on the call stack; it is
actually a function of the _entire_ call stack, in some cases all the way
back up to the userspace program making the system call. Every function in
the call stack needs to be able to deal with the -EINTR return and clean up
appropriately, not just the bottom-most function.

Consider the case where function A() sleeps, and can perfectly happily deal
with being interrupted. It is called from function B() which can also deal
with an -EINTR return from A(), and in fact _wants_ to do so because it can
take a long time and being uninterruptible would suck. The majority of 
callers of function A() are of this nature, and want to be interruptible.

But there exists a function C() which also calls function A(), and which 
_cannot_ deal with being interrupted.

What we actually tend to do in this case is make function A()
uninterruptible at all times (or -- ugh -- pass an extra argument all the
way down telling it which type of sleep to use and not to check
signal_pending()).

My suggested alternative is to have the afore-mentioned'interruptible_count'
in the task struct, and each function that cannot deal with receiving -EINTR
from the other functions it calls shall increase this count accordingly.

So sys_write() would increase current->interruptible_count, and then 
nothing called from there would upset the user by returning -EINTR.

Similarly, down() would just increase the count before calling what's 
currently down_interruptible(), etc.

--
dwmw2

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-02 15:39                     ` Linus Torvalds
  2002-08-02 16:00                       ` Benjamin LaHaise
@ 2002-08-02 19:27                       ` Roman Zippel
  1 sibling, 0 replies; 37+ messages in thread
From: Roman Zippel @ 2002-08-02 19:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Woodhouse, David Howells, alan, linux-kernel

Hi

On Fri, 2 Aug 2002, Linus Torvalds wrote:

> Binary compatibility is important. As is the larger issue of generic UNIX
> compatibility. You had better have some really strong arguments for why
> you would think I'd be willing to break compatibility. So far you have had
> _no_ arguments for the question "Why?".

I never asked for breaking binary compatibility.
On the other hand I can give you an example, why I'd like to have a
choice. I expect from a good application that it recovers gracefully from
failures, so if I'm saving a file and the server goes down, I would
really, really like it if something happened when I push that stop button
or I press Esc and the application should offer me the possibility to save
the file somewhere else.

To implement this I would suggest using file flags instead of new task
flags:

	O_ATOMIC
	O_NONBLOCK
	O_SIGNALINT
	O_KILLINT
	O_DONT_BOTHER_ME

The first one might be useful for aio, it wants something like that
already anyway.
This way applications had a choice, how read/write should behave on
signals.

bye, Roman

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-01 22:35               ` Roman Zippel
  2002-08-01 23:30                 ` Linus Torvalds
@ 2002-08-02  7:31                 ` Giuliano Pochini
  1 sibling, 0 replies; 37+ messages in thread
From: Giuliano Pochini @ 2002-08-02  7:31 UTC (permalink / raw)
  To: Roman Zippel
  Cc: linux-kernel, alan, David Howells, David Woodhouse,
	Linus Torvalds


>> Stop arguing about this. It's a FACT.
>
> Linus, it's not that I don't want to believe you, but e.g. the SUS doesn't
> make that special exception.
> Installing signal handlers and not expecting EINTR _is_ sloppy
> programming.

Linus is right. It's impossible to change default rw semantics
without breaking things. Anyway, we can add another non-standard
file flag (that's likely to remain mostly unused).


Bye.


^ permalink raw reply	[flat|nested] 37+ messages in thread

[parent not found: <mailman.1028232841.11555.linux-kernel2news@redhat.com>]

* Re: manipulating sigmask from filesystems and drivers
       [not found]   ` <mailman.1028232841.11555.linux-kernel2news@redhat.com>
@ 2002-08-01 23:37     ` Pete Zaitcev
  2002-08-01 23:46       ` David Woodhouse
  0 siblings, 1 reply; 37+ messages in thread
From: Pete Zaitcev @ 2002-08-01 23:37 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-kernel

>>  They should be waiting in TASK_UNINTERRUPTIBLE, and we should add a
>> flag  to distinguish between "increases load average" and "doesn't".
> 
> The disadvantage of this approach is that it encourages people to be lazy
> and sleep with signals disabled, instead of implementing proper cleanup
> code. 
> 
> I'm more in favour of removing TASK_UNINTERRUPTIBLE entirely, or at least 
> making people apply for a special licence to be permitted to use it :)
> 
> --
> dwmw2

Consider this. An application writes to /dev/dsp0, and ymfpci
(for example) start DMA. Then user interrupts the app with ^C.
When ymfpci gets ->release() call, it has to tell the chip
to stop DMA, then wait until it's complete. If it tries to
wait with TASK_INTERRUPTIBLE, schedule() will return immediately,
and in essense do a busy loop with CPU pegged at 100%.

Same thing happens in USB, only there it's worse: a spinning
application locks out khubd and whole subsistem dies.

-- Pete

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-01 23:37     ` Pete Zaitcev
@ 2002-08-01 23:46       ` David Woodhouse
  0 siblings, 0 replies; 37+ messages in thread
From: David Woodhouse @ 2002-08-01 23:46 UTC (permalink / raw)
  To: Pete Zaitcev; +Cc: linux-kernel

zaitcev@redhat.com said:
>  Consider this. An application writes to /dev/dsp0, and ymfpci (for
> example) start DMA. Then user interrupts the app with ^C. When ymfpci
> gets ->release() call, it has to tell the chip to stop DMA, then wait
> until it's complete. If it tries to wait with TASK_INTERRUPTIBLE,
> schedule() will return immediately, and in essense do a busy loop with
> CPU pegged at 100%.

Forgive me; it's late here. What's wrong with -ERESTARTNOINTR?

But even if that's not going to work -- as I said later, there _are_ cases
where you can't really get rid of it. And given the debates we've had
recently about the return value from close(), your release() call seems like
one such, if you really can't restart it.

My point was that we should be making TASK_UNINTERRUPTIBLE _less_ attractive
to encourage people not to use it simply because it's easier than thinking
about the cleanup path, rather than making it more attractive as was
originally suggested.

> Same thing happens in USB, only there it's worse: a spinning
> application locks out khubd and whole subsistem dies. 

Spinning is obviously wrong. If after thinking hard about it you can't come 
up with a better solution, go pick up a form, find 5 kernel hackers to sign 
it saying there really is no better solution, and you can have one of these 
licences to use TASK_UNTERRUPTIBLE that I was talking about :)

--
dwmw2

^ permalink raw reply	[flat|nested] 37+ messages in thread

[parent not found: <0C01A29FBAE24448A792F5C68F5EA47D2D3E2B@nasdaq.ms.ensim.com>]

* Re: manipulating sigmask from filesystems and drivers
       [not found] <0C01A29FBAE24448A792F5C68F5EA47D2D3E2B@nasdaq.ms.ensim.com>
@ 2002-08-02 17:57 ` Paul Menage
  2002-08-02 23:25   ` Ryan Anderson
  0 siblings, 1 reply; 37+ messages in thread
From: Paul Menage @ 2002-08-02 17:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, pmenage

In article <0C01A29FBAE24448A792F5C68F5EA47D2D3E2B@nasdaq.ms.ensim.com>,
 you write:
>
>With write(), you have to make a judgement call. Unlike read, a truncated
>write _is_ visible outside the killed process. But exactly like read()
>there _are_ system management reasons why you may really need to kill
>writers. So the debatable point comes from whether you want to consider a
>killing signal to be "exceptional enough" to warrant the partial write.
>

How about a sysctl that lets the user specify the size threshold at
which writes use a killable wait state rather than
TASK_UNINTERRUPTIBLE? (Probably defaulting to never.)

Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-02 17:57 ` Paul Menage
@ 2002-08-02 23:25   ` Ryan Anderson
  2002-08-02 23:30     ` Paul Menage
  0 siblings, 1 reply; 37+ messages in thread
From: Ryan Anderson @ 2002-08-02 23:25 UTC (permalink / raw)
  To: Paul Menage; +Cc: Linus Torvalds, linux-kernel

On Fri, 2 Aug 2002, Paul Menage wrote:

> In article <0C01A29FBAE24448A792F5C68F5EA47D2D3E2B@nasdaq.ms.ensim.com>,
>  you write:
> >
> >With write(), you have to make a judgement call. Unlike read, a truncated
> >write _is_ visible outside the killed process. But exactly like read()
> >there _are_ system management reasons why you may really need to kill
> >writers. So the debatable point comes from whether you want to consider a
> >killing signal to be "exceptional enough" to warrant the partial write.
> >
> 
> How about a sysctl that lets the user specify the size threshold at
> which writes use a killable wait state rather than
> TASK_UNINTERRUPTIBLE? (Probably defaulting to never.)

/usr/include/linux/limits.h:#define PIPE_BUF        4096        /* #
bytes in atomic write to a pipe */

According to SUSv3, this is the maximum size that write is guaranteed to
be atomic for.

The minimum size this can be set to is 512.  I *think* this applies to
file io - but I'm not up to tracking all the definitions down.

So, you've already got the definitions in place, especially the size
limits.  It even reads like changing the behavior for larger writes
would be acceptable.

Just leave the small writes alone and I think you'll avoid causing large
problems for the majority of applications.

--
Ryan Anderson
  sometimes Pug Majere

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
  2002-08-02 23:25   ` Ryan Anderson
@ 2002-08-02 23:30     ` Paul Menage
  0 siblings, 0 replies; 37+ messages in thread
From: Paul Menage @ 2002-08-02 23:30 UTC (permalink / raw)
  To: Ryan Anderson; +Cc: linux-kernel, pmenage

>> How about a sysctl that lets the user specify the size threshold at
>> which writes use a killable wait state rather than
>> TASK_UNINTERRUPTIBLE? (Probably defaulting to never.)
>
>/usr/include/linux/limits.h:#define PIPE_BUF        4096        /* #
>bytes in atomic write to a pipe */
>
>According to SUSv3, this is the maximum size that write is guaranteed to
>be atomic for.

But that doesn't mean that we can't allow the user to configure higher
limits if they're using some particular software that relies on (and
more importantly, has successfully relied upon in the past) the current 
behaviour of no interruptions.

Paul




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: manipulating sigmask from filesystems and drivers
@ 2002-08-02 18:24 Jesse Pollard
  0 siblings, 0 replies; 37+ messages in thread
From: Jesse Pollard @ 2002-08-02 18:24 UTC (permalink / raw)
  To: linux-kernel

Linus Torvalds <torvalds@transmeta.com>:
>On Fri, 2 Aug 2002, Jamie Lokier wrote:
>>
>> Linus Torvalds wrote:
>> > Sending somebody a SIGKILL (or any signal that kills the process) is
>> > different (in my opinion) from a signal that interrupts a system call in
>> > order to run a signal handler.
>>
>> So it's ok to have truncated log entries (or more realistically,
>> truncated simple database entries) if the logging program is killed?
>
>This is why I said
>
> "Which is what we want in generic_file_read() (and _probably_
>  generic_file_write() as well, but that's slightly more debatable)"
>
>The "slightly more debatable" comes exactly from the thing you mention.
>
>The thing is, "read()" on a file doesn't have any side effects outside the
>process that does it, so if you kill the process, doing a partial read is
>always ok (yeah, you can come up with thread examples etc where you can
>see the state, but I think those are so contrieved as to not really merit
>much worry and certainly have no existing programs issues).
>
>With write(), you have to make a judgement call. Unlike read, a truncated
>write _is_ visible outside the killed process. But exactly like read()
>there _are_ system management reasons why you may really need to kill
>writers. So the debatable point comes from whether you want to consider a
>killing signal to be "exceptional enough" to warrant the partial write.
>
>I can see both sides. I personally think I'd prefer the "if I kill a
>process, I want it dead _now_" approach, but this definitely _is_ up for
>discussion (unlike the signal handler case).

There has been cases (and systems) in the past that have provided BOTH
interpretations:

1. current kill -9 action:

	terminates process as soon as current process returns or is in
	the process of returning to user mode. This is normal, and prevents
	most partial writes. This is applicable to things like data base
	servers, log servers, and journaling processes.

2. Kill, and abort outstanding I/O.

	This casues partial log writes, corrupts databases (usually), and will
	cause any process to terminate.

When is #2 used:
	a. real time systems where the device handling MUST be terminated now.
	b. system shutdown for emergencies (this allows filesystems to
	   finsh flushing, but user processes may be stuck writing to an
	   audio/parallel device... procedure is to use kill -15, wait a
	   second or two, kill -9 wait a second or two, KILL UNCONDITIONALLY,
	   and then shutdown anyway).
	   Other uses:
		b1. fire, flood, power failure (act of god)
		b2. system overtemp (loss of AC cooling...)
		b3. disk drive failures (to stop writing to a drive, abort
		    DMA actions, controller failure detection - no need to
		    propagate errors to a raid...)
		b4. safety related aborts in time critical applications

Item b3 allows a system with some pretty catastrophic hardware
failures to actually do something and shutdown/clean up as much as possible
without just hanging - which will also introduce partial log writes...

I worked on one system that determined the main disk controller was failing,
and proceded to request a power cycle on all disk drives attached to that
particular controller to attempt to clear the failure. All user processes
were killed, a detailed diagnostic was provided, then the system shut itself
off.

In realtime underwater survey systems we used such an abort to cancel
expensive operations that were already in progress (expensive if it
finished - setting off remote explosives via an external controller).

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: pollard@navo.hpc.mil

Any opinions expressed are solely my own.

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2002-10-17  8:26 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-07-31 11:52 manipulating sigmask from filesystems and drivers David Howells
2002-07-31 11:58 ` Alan Cox
2002-08-01 19:09 ` Linus Torvalds
2002-08-01 20:10   ` David Woodhouse
2002-08-01 20:21     ` Linus Torvalds
2002-08-01 20:47       ` Roman Zippel
2002-08-01 20:51         ` Linus Torvalds
2002-08-01 21:15           ` Roman Zippel
2002-08-01 21:42             ` Linus Torvalds
2002-08-01 22:29               ` David Woodhouse
2002-08-01 22:40                 ` Linus Torvalds
2002-08-01 22:50                   ` David Woodhouse
2002-08-02 15:59                   ` yodaiken
2002-08-01 22:35               ` Roman Zippel
2002-08-01 23:30                 ` Linus Torvalds
2002-08-02  0:31                   ` Olivier Galibert
2002-08-02  8:00                     ` Kai Henningsen
2002-08-02 10:02                   ` Roman Zippel
2002-08-02 12:38                     ` Ryan Anderson
2002-08-02 15:39                     ` Linus Torvalds
2002-08-02 16:00                       ` Benjamin LaHaise
2002-08-02 16:27                         ` Linus Torvalds
2002-08-02 17:13                           ` Jamie Lokier
2002-08-02 17:29                             ` Linus Torvalds
2002-08-02 17:57                               ` Trond Myklebust
2002-08-02 18:10                                 ` Linus Torvalds
2002-08-02 17:33                           ` Oliver Neukum
2002-08-03 18:27                             ` David Woodhouse
2002-10-17  8:32                           ` David Woodhouse
2002-08-02 19:27                       ` Roman Zippel
2002-08-02  7:31                 ` Giuliano Pochini
     [not found]   ` <mailman.1028232841.11555.linux-kernel2news@redhat.com>
2002-08-01 23:37     ` Pete Zaitcev
2002-08-01 23:46       ` David Woodhouse
     [not found] <0C01A29FBAE24448A792F5C68F5EA47D2D3E2B@nasdaq.ms.ensim.com>
2002-08-02 17:57 ` Paul Menage
2002-08-02 23:25   ` Ryan Anderson
2002-08-02 23:30     ` Paul Menage
  -- strict thread matches above, loose matches on Subject: below --
2002-08-02 18:24 Jesse Pollard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox