Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [RFC] capabilities: Ambient capabilities
From: Andy Lutomirski @ 2015-03-14 21:45 UTC (permalink / raw)
  To: Andrew G. Morgan
  Cc: Jarkko Sakkinen, Ted Ts'o, Andrew Lutomirski, Andrew Morton,
	Michael Kerrisk, Mimi Zohar, Linux API, Austin S Hemmelgarn,
	linux-security-module, Aaron Jones, Christoph Lameter, LKML,
	Serge Hallyn, Markku Savela, Kees Cook, Jonathan Corbet
In-Reply-To: <CALQRfL5FGGUnK7+Ss=MSDGmxpAERTUjezardGnH3QouBwhbZ4A@mail.gmail.com>

On Sat, Mar 14, 2015 at 2:09 PM, Andrew G. Morgan <morgan@kernel.org> wrote:
> On Fri, Mar 13, 2015 at 10:57 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mar 13, 2015 6:24 AM, "Andrew G. Morgan" <morgan@kernel.org> wrote:
>>>
>>> > It's to preserve the invariant that pA is always a subset of pI.
>>>
>>> But since a user can always raise a bit in pI if it is present in pP,
>>> what does this invariant add to your model other than inconvenience?
>>
>> The useful part is that dropping a bit from pI also drops it from pA.
>> To keep the model consistent, I also require that you add the bit to
>> pI before adding it to pA.
>
> So you are saying that pA is always a strict subset of pI (and pP)?
> Then why not explicitly implement it as:
>
>   pA' = (file caps or setuid or setgid ? 0 : pA)
>   pP' = (fP & X) | (pI & [fI | (pA' & pP)] )
>
> As it is you have so distributed these constraints that it is hard to
> be sure it will remain that way.

That would be insecure.  If an attacker had pA = CAP_SYS_ADMIN, pI =
0, pP = 0 (i.e. no privs but pA is set somehow) then, unless that's
there's some other protection implemented, they could run some setuid
program, and that program could switch back to non-root, set pI = 0,
and call execve.  Unexpectedly, CAP_SYS_ADMIN would be inherited.

So I made the invariant explicit and added an assertion.

>>
>> If you have a program that deliberately uses PR_CAP_AMBIENT, then
>> setting such a securebit will break the program, so it still doesn't
>> buy you anything.
>
> Not if you make the bit lockable (like the other bits). If you want to
> run with your model in effect, you lock the enable bit on.

I don't see the point.  Again, this should be the default.


>>
>>>
>>> > In the mean time, I don't even believe that there's a legitimate use
>>> > for any of the other secure bits (except keepcaps, and I don't know
>>> > why that's a securebit in the first place).
>>>
>>> Those bits currently make it possible to run a subsystem with no
>>> [set]uid-0 support in its process tree.
>>
>> Not usefully, because even with all the securebits set to their
>> non-legacy modes, caps don't inherit, so it doesn't work.  I've tried.
>
> Not sure I follow. They work for a definition if inheritable that you
> seem to refuse to accept.

I, and everyone I know who's tried to use inheritable capabilities,
has run into the near-complete uselessness of the current model.  I
understand that a defunct POSIX draft specified it, but it's still
nearly useless.

You've objected to changing it, but you've never directly addressed
any of the reasons why Christoph, Google, and I all believe that we
can't usefully use it.

>>> I think it is safe to say that naive privilege inheritance has a fair
>>> track record of being exploited orders of magnitude more frequently
>>> than this. After all, these are the reasons LD_PRELOAD and shell
>>> script setuid bits are suppressed.
>>
>> I don't know what you mean here by naive privilege inheritance.  The
>> examples you're taking about aren't inheritance at all; they're
>> exploring privilege *grants* during execve.  My patch deliberately
>> leaves grants like that alone.
>
> The pI set is inherited through this exec unmolested.

This is flat-out useless.  Having pI = CAP_NET_BIND_SERVICE doesn't
let me bind low-numbered ports, full stop.

> My Nack remains that you are eliminating the explicit enforcement of
> selective inheritance. A lockable secure bit protecting access to your
> prctl() function would address this concern.

Would a sysctl or securebit that *optionally* allows pA to be disabled
satisfy you?

I don't understand why lockable is at all useful.  You'd need
CAP_SETPCAP to flip it regardless.

--Andy

^ permalink raw reply

* Re: [RFC] capabilities: Ambient capabilities
From: Andrew G. Morgan @ 2015-03-14 21:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jarkko Sakkinen, Ted Ts'o, Andrew Lutomirski, Andrew Morton,
	Michael Kerrisk, Mimi Zohar, Linux API, Austin S Hemmelgarn,
	linux-security-module, Aaron Jones, Christoph Lameter, LKML,
	Serge Hallyn, Markku Savela, Kees Cook, Jonathan Corbet
In-Reply-To: <CALCETrVz39pJmMAEFodBNA8cOZjd0xkH95EzMazY0MWJDtpYgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, Mar 13, 2015 at 10:57 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Mar 13, 2015 6:24 AM, "Andrew G. Morgan" <morgan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>>
>> > It's to preserve the invariant that pA is always a subset of pI.
>>
>> But since a user can always raise a bit in pI if it is present in pP,
>> what does this invariant add to your model other than inconvenience?
>
> The useful part is that dropping a bit from pI also drops it from pA.
> To keep the model consistent, I also require that you add the bit to
> pI before adding it to pA.

So you are saying that pA is always a strict subset of pI (and pP)?
Then why not explicitly implement it as:

  pA' = (file caps or setuid or setgid ? 0 : pA)
  pP' = (fP & X) | (pI & [fI | (pA' & pP)] )

As it is you have so distributed these constraints that it is hard to
be sure it will remain that way.

>> >> I'm also unclear how you can turn off this new 'feature' for a process
>> >> tree? As it is, the code creates an exploit path for a capable (pP !=
>> >> 0) program with an exploitable flaw to create a privilege escalation
>> >> for an arbitrary child program.
>> >
>> > Huh?  If you exploit the parent, you already win.  Yes, if a kiddie
>> > injects shellcode that does system("/bin/bash") into some pP != 0
>> > program, they don't actually elevate their privileges.  On the other
>> > hand, by the time an attacker injected shellcode for:
>> >
>> > prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, CAP_SYS_ADMIN);
>> > system("/bin/bash");
>>
>> Let's call the above two lines [a] and [b]. With this patch, you are
>> encouraging folk to write programs that contain a line like [a]
>> already. So, yes, I am saying that you are creating an exploitable
>> path in these programs that says if someone can inject
>> system("/bin/bash") into these programs they can get a new (because of
>> this patch) privilege escalation.
>>
>> In the prevailing model, this kind of privilege escalation (resulting
>> from naive inheritance) is designed out. I recognize that you want to
>> add it back in, but I am concerned that you are not allowing for the
>> possibility that some folk might still want to still be able to
>> prevent it.
>
> If you have a program that deliberately uses PR_CAP_AMBIENT, then
> setting such a securebit will break the program, so it still doesn't
> buy you anything.

Not if you make the bit lockable (like the other bits). If you want to
run with your model in effect, you lock the enable bit on.

>> > into a target, they can already do whatever they want.
>> >
>> >> While I understand that everyone
>> >> 'knows what they are doing' in implementing this change, I'm convinced
>> >> that folk that are up to no good also do... Why not provide a lockable
>> >> secure bit to selectively disable this support?
>> >
>> > Show me a legitimate use case and I'll gladly implement a secure bit.
>>
>> Thanks. I was kind of hoping that you would add a lockable secure bit
>> that defaults this support to off, but can be used to turn it on with
>> or without a lock. That way, you can avoid disturbing the legacy
>> defaults (no surprises).
>
> I think this thing needs to default on to be useful.
>
>>
>> > In the mean time, I don't even believe that there's a legitimate use
>> > for any of the other secure bits (except keepcaps, and I don't know
>> > why that's a securebit in the first place).
>>
>> Those bits currently make it possible to run a subsystem with no
>> [set]uid-0 support in its process tree.
>
> Not usefully, because even with all the securebits set to their
> non-legacy modes, caps don't inherit, so it doesn't work.  I've tried.

Not sure I follow. They work for a definition if inheritable that you
seem to refuse to accept.

>> > In the mean time, see CVE-2014-3215 for an example of why securebits
>> > are probably more trouble than they're worth.
>>
>> I think it is safe to say that naive privilege inheritance has a fair
>> track record of being exploited orders of magnitude more frequently
>> than this. After all, these are the reasons LD_PRELOAD and shell
>> script setuid bits are suppressed.
>
> I don't know what you mean here by naive privilege inheritance.  The
> examples you're taking about aren't inheritance at all; they're
> exploring privilege *grants* during execve.  My patch deliberately
> leaves grants like that alone.

The pI set is inherited through this exec unmolested. You are adding a
pA set that is passed through exec somewhat unmolested but allows all
descendants to have privilege.

- Naive inheritance is 'if you had a privilege pre-exec any descendant
gets it post-exec'. This is what your pA achieves.

- Inheritance, as it is currently implemented, is implemented as a
handshake between a privileged ancestor (who added a pI bit) and a
descendant that is ready for it (has a fI bit to match).
  - the un-naive aspect (if you prefer 'selective' inheritance) of
this is that only binaries with file capabilities can ever wield
privilege, and while it can be suppressed by an intermediate
descendant the grant of inherited privilege can skip generations.

I think we can agree that the only place that inheritance can occur is
through an exec. So 'grants' through exec is sort of a semantic bit of
smoke masquerading as an argument. Even in the naive inheritance case
you have to explicitly grant some privilege - you are just choosing
different rules.

My Nack remains that you are eliminating the explicit enforcement of
selective inheritance. A lockable secure bit protecting access to your
prctl() function would address this concern.

Thanks

Andrew

^ permalink raw reply

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
From: Oleg Nesterov @ 2015-03-14 20:30 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86
In-Reply-To: <20150314201402.GH22130@thin>

On 03/14, Josh Triplett wrote:
>
> What I was proposing was that a task that isn't yet dead, but that is
> going to be autoreaped, is not eligible for waiting either.  All the
> various wait* familiy of system calls should pretend it doesn't exist at
> all, because returning an autoreaped task from a wait* call introduces a
> race condition if the parent tries to *do* anything with the returned
> PID.  If you launch a process with CLONE_FD, you need to manage it
> exclusively with that fd, not with the wait* family of system calls.
>
> That also implies that the child-stop and child-continued mechanisms
> (do_notify_parent_cldstop, WSTOPPED, WCONTINUED) should ignore the task
> too.  In the future there could be a flag to clone4 that lets you get
> stop and continue notifications through the file descriptor.

So far I strongly disagree, and I think that "autoreap" feature should
not depend on CLONE_FD.

However, as I already said, perhaps something like W_IGNORE_AUTOREAP
for wait* makes sense.

Plus we should also discuss the reparenting. Ok, let me leave this
discussion until I read 0/4 at least.

Oleg.


^ permalink raw reply

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
From: Oleg Nesterov @ 2015-03-14 20:20 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86
In-Reply-To: <20150314200318.GG22130@thin>

On 03/14, Josh Triplett wrote:
>
> On Sat, Mar 14, 2015 at 08:18:36PM +0100, Oleg Nesterov wrote:
>
> Is there any information somewhere on how this state machine of doom is
> *supposed* to work? :)

This looks as if you think that other parts of this kernel differ ;)

> Why would "p->task_state == EXIT_DEAD" mean
> something different in wait_consider_task?

different? but once again, EXIT_DEAD never meant "autoreap". It always
means "already reaped". OK, more correctly it means "release_task() will
be called soon, please ignore me".

> Pulling the EXIT_DEAD tasks out of those lists completely does sound
> like a good simplification.  However, that doesn't seem to be the
> current expectation in wait_consider_task, which just returns if
> p->task_state == EXIT_DEAD to skip considering that task.

See above. And another email I sent.

> And an autoreaping task isn't necessarily dead yet; it just shouldn't be
> waited on.

Yes, sure, but please do not confuse this with EXIT_DEAD. In fact, please
do not confuse this with ->task_state.

> > However, currently this is TODO. The main problem is the locking in
> > wait_task_zombie(), we can set EXIT_DEAD and remove the task from list
                        ^^^^^^
I mean, "we can't".

> > under read_lock().
>
> That appears to be only reachable for zombies, which an autoreaping task
> should never become.

Sure. I meant that other paths which can set EXIT_DEAD are simpler wrt
"kill the EXIT_DEAD and EXIT_TRACE state" patch we actually want. But lets
not discuss this here, this is offtopic.

Oleg.

^ permalink raw reply

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
From: Josh Triplett @ 2015-03-14 20:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A
In-Reply-To: <20150314194721.GA9654-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Sat, Mar 14, 2015 at 08:47:21PM +0100, Oleg Nesterov wrote:
> On 03/14, Oleg Nesterov wrote:
> >
> > On 03/14, Josh Triplett wrote:
> > >
> > > On Sat, Mar 14, 2015 at 11:38:29AM -0700, Thiago Macieira wrote:
> > > > On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> > > > > It is not clear to me what do_wait() should do with ->autoreap child, even
> > > > > ignoring ptrace.
> > > > >
> > > > > Just suppose that real_parent has a single "autoreap" child. Should
> > > > > wait(NULL) hanf then?
> > > >
> > > > It should ignore the child that is set to autoreap. wait(NULL) should return -
> > > > ECHILD, indicating there are no children waiting to be reaped.
> > >
> > > Right.  And I don't think the current code does this.  I think we need
> > > to change wait_consider_task to early-return for ->autoreap just as it
> > > does for task_state == EXIT_DEAD.
> >
> > No. This EXIT_DEAD is absolutely different. And this is another indication
> > that you might use it wrongly ;)
> >
> > What we actually want is BUG_ON(task_state == EXIT_DEAD) here. We do not
> > want the EXIT_DEAD tasks in ->children/ptraced lists. These EXIT_DEAD tasks
> > complicate the exit/wait/reparent paths.
> >
> > However, currently this is TODO. The main problem is the locking in
> > wait_task_zombie(), we can set EXIT_DEAD and remove the task from list
> > under read_lock().
> 
> Let me clarify in case I confused you.
> 
> The EXIT_DEAD check in do_wait() paths doesn't mean "autoreap". It means
> that this thread/process (depending on ptrace) was already reaped. It was
> reaped by our sub-thread, or it was reaped because we ignore SIGCHLD, or
> other reasons. This doesn't matter.
> 
> In short, EXIT_DEAD means: we have to keep this thread on lists until the
> task which set this state calls release_task().

That much I already understood from reading through the code, since
exit_notify doesn't set task_state to EXIT_DEAD until the task is
actually completely dead.  When wait_consider_task sees p->task_state ==
EXIT_DEAD, that task isn't eligible for waiting at all.

What I was proposing was that a task that isn't yet dead, but that is
going to be autoreaped, is not eligible for waiting either.  All the
various wait* familiy of system calls should pretend it doesn't exist at
all, because returning an autoreaped task from a wait* call introduces a
race condition if the parent tries to *do* anything with the returned
PID.  If you launch a process with CLONE_FD, you need to manage it
exclusively with that fd, not with the wait* family of system calls.

That also implies that the child-stop and child-continued mechanisms
(do_notify_parent_cldstop, WSTOPPED, WCONTINUED) should ignore the task
too.  In the future there could be a flag to clone4 that lets you get
stop and continue notifications through the file descriptor.

- Josh Triplett

^ permalink raw reply

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
From: Josh Triplett @ 2015-03-14 20:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A
In-Reply-To: <20150314191836.GA8416-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Sat, Mar 14, 2015 at 08:18:36PM +0100, Oleg Nesterov wrote:
> On 03/14, Josh Triplett wrote:
> >
> > On Sat, Mar 14, 2015 at 11:38:29AM -0700, Thiago Macieira wrote:
> > > On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> > > > It is not clear to me what do_wait() should do with ->autoreap child, even
> > > > ignoring ptrace.
> > > >
> > > > Just suppose that real_parent has a single "autoreap" child. Should
> > > > wait(NULL) hanf then?
> > >
> > > It should ignore the child that is set to autoreap. wait(NULL) should return -
> > > ECHILD, indicating there are no children waiting to be reaped.
> >
> > Right.  And I don't think the current code does this.  I think we need
> > to change wait_consider_task to early-return for ->autoreap just as it
> > does for task_state == EXIT_DEAD.
> 
> No. This EXIT_DEAD is absolutely different. And this is another indication
> that you might use it wrongly ;)

Is there any information somewhere on how this state machine of doom is
*supposed* to work? :)  Why would "p->task_state == EXIT_DEAD" mean
something different in wait_consider_task?

> What we actually want is BUG_ON(task_state == EXIT_DEAD) here. We do not
> want the EXIT_DEAD tasks in ->children/ptraced lists. These EXIT_DEAD tasks
> complicate the exit/wait/reparent paths.

Pulling the EXIT_DEAD tasks out of those lists completely does sound
like a good simplification.  However, that doesn't seem to be the
current expectation in wait_consider_task, which just returns if
p->task_state == EXIT_DEAD to skip considering that task.

And an autoreaping task isn't necessarily dead yet; it just shouldn't be
waited on.

> However, currently this is TODO. The main problem is the locking in
> wait_task_zombie(), we can set EXIT_DEAD and remove the task from list
> under read_lock().

That appears to be only reachable for zombies, which an autoreaping task
should never become.

> And please see another email from me. So far  I disagree that wait(NULL)
> should return ECHILD unconditionally. At least unless this is discussed
> separately.

I'll respond in that separate thread, but one issue there: waiting for
any child process cannot safely return an autoreaping child process,
because that would introduce a race condition.  The PID the parent gets
back can disappear at any time, so there's nothing useful the parent can
do with it.

- Josh Triplett

^ permalink raw reply

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
From: Josh Triplett @ 2015-03-14 19:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A
In-Reply-To: <20150314192456.GA8707-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Sat, Mar 14, 2015 at 08:24:56PM +0100, Oleg Nesterov wrote:
> On 03/14, Josh Triplett wrote:
> >
> > On Sat, Mar 14, 2015 at 03:35:58PM +0100, Oleg Nesterov wrote:
> > > On 03/12, Josh Triplett wrote:
> > > >
> > > > @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > > >  	if (group_dead)
> > > >  		kill_orphaned_pgrp(tsk->group_leader, NULL);
> > > >
> > > > -	if (unlikely(tsk->ptrace)) {
> > > > +	if (tsk->autoreap) {
> > > > +		autoreap = true;
> > > > +	} else if (unlikely(tsk->ptrace)) {
> > > >  		int sig = thread_group_leader(tsk) &&
> > > >  				thread_group_empty(tsk) &&
> > > >  				!ptrace_reparented(tsk) ?
> > > > @@ -612,8 +616,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > > >  	}
> > > >
> > > >  	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
> > > > -	if (tsk->exit_state == EXIT_DEAD)
> > > > +	if (tsk->exit_state == EXIT_DEAD) {
> > > >  		list_add(&tsk->ptrace_entry, &dead);
> > > > +		clonefd_do_notify(tsk);
> > > > +	}
> > >
> > > And even ignoring semantics issues, this change looks simply buggy anyway ;)
> > >
> > > How can we do list_add(&tsk->ptrace_entry) if it is traced by _another_ task?
> > > ->ptrace_entry is used by debugger.
> >
> > That list_add was there before; I didn't change that.
> 
> But this doesn't matter,
> 
> > I just added a
> > second line inside the EXIT_DEAD case, to call clonefd_do_notify (which
> > wakes up potential callers of poll/read).
> 
> No. Please read this code before and after your patch. You also added
> 
> 	if (tsk->autoreap)
> 		autoreap = true;
> 
> at the start. At this can trigger the _wrong_ list_add(&tsk->ptrace_entry),
> when the task is traced by another thread.
> 
> The current code can only use ->ptrace_entry if it was untraced (by us).

Ugh.  I finally realized just how magic the logic is there; thanks for
catching this.  The call to forget_original_parent at the top of
exit_notify can potentially add the process to the list "dead" here,
either in exit_ptrace() or in reparent_leader() (the latter of which has
its own duplicate of part of exit_notify's logic, including
do_notify_parent and setting exit_state).  Then exit_notify can add the
task to "dead" itself under some conditions that clearly depend on the
exact nature of the existing three-way conditional above.  And finally
exit_notify loops over "dead" and releases all the tasks there.

I'll investigate this further and make sure the ptrace case gets
handled correctly.

- Josh Triplett

^ permalink raw reply

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
From: Oleg Nesterov @ 2015-03-14 19:47 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A
In-Reply-To: <20150314191836.GA8416-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On 03/14, Oleg Nesterov wrote:
>
> On 03/14, Josh Triplett wrote:
> >
> > On Sat, Mar 14, 2015 at 11:38:29AM -0700, Thiago Macieira wrote:
> > > On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> > > > It is not clear to me what do_wait() should do with ->autoreap child, even
> > > > ignoring ptrace.
> > > >
> > > > Just suppose that real_parent has a single "autoreap" child. Should
> > > > wait(NULL) hanf then?
> > >
> > > It should ignore the child that is set to autoreap. wait(NULL) should return -
> > > ECHILD, indicating there are no children waiting to be reaped.
> >
> > Right.  And I don't think the current code does this.  I think we need
> > to change wait_consider_task to early-return for ->autoreap just as it
> > does for task_state == EXIT_DEAD.
>
> No. This EXIT_DEAD is absolutely different. And this is another indication
> that you might use it wrongly ;)
>
> What we actually want is BUG_ON(task_state == EXIT_DEAD) here. We do not
> want the EXIT_DEAD tasks in ->children/ptraced lists. These EXIT_DEAD tasks
> complicate the exit/wait/reparent paths.
>
> However, currently this is TODO. The main problem is the locking in
> wait_task_zombie(), we can set EXIT_DEAD and remove the task from list
> under read_lock().

Let me clarify in case I confused you.

The EXIT_DEAD check in do_wait() paths doesn't mean "autoreap". It means
that this thread/process (depending on ptrace) was already reaped. It was
reaped by our sub-thread, or it was reaped because we ignore SIGCHLD, or
other reasons. This doesn't matter.

In short, EXIT_DEAD means: we have to keep this thread on lists until the
task which set this state calls release_task().

Oleg.

^ permalink raw reply

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
From: Josh Triplett @ 2015-03-14 19:29 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: Andy Lutomirski, David Drysdale, Al Viro, Andrew Morton,
	Ingo Molnar, Kees Cook, Oleg Nesterov, Paul E. McKenney,
	H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Linux FS Devel, X86 ML
In-Reply-To: <2279556.Wl6mCVq5Zi@tjmaciei-mobl4>

On Sat, Mar 14, 2015 at 12:03:12PM -0700, Thiago Macieira wrote:
> On Friday 13 March 2015 18:11:32 Thiago Macieira wrote:
> > On Friday 13 March 2015 14:51:47 Andy Lutomirski wrote:
> > > In any event, we should find out what FreeBSD does in response to
> > > read(2) on the fd.
> > 
> > I've just successfully installed FreeBSD and compiled qtbase (main package
> > of Qt 5) on it.
> > 
> > I'll test pdfork during the weekend and report its behaviour.
> 
> Here are my findings about pdfork.
> 
> Source: http://fxr.watson.org/fxr/source/kern/sys_procdesc.c?v=FREEBSD10
> Qt adaptations: https://codereview.qt-project.org/108561
> 
> Processes created with pdfork() are normal processes that still send SIGCHLD 
> to their parents. The only difference is that you get the extra file descriptor 
> that can be passed to the pdgetpid() system call and works on select()/poll(). 
> Trying to read from that file descriptor will result in EOPNOTSUPP.

OK, since read() doesn't work on a pdfork() file descriptor, we don't
have to worry about compatibility with pdfork()'s read result.

However, if the expectation is that pdfork()ed child processes still
send SIGCHLD, then I don't see how we can be compatible there, nor do I
think we want to; as you mention below, that breaks the ability to
encapsulate management of the created process entirely within a library.

> Since they've never implemented pdwait4() (it's not even declared in the 
> headers), the only way to reap a child if you only have the file descriptor is 
> to first pdgetpid() and then call wait4() or wait6().

Which suggests that we shouldn't try to implement pdwait4() in glibc
until FreeBSD implements it in their kernel, since we won't know the
exact semantics they expect.

> If you don't pass PD_DAEMON, the child process gets killed with SIGKILL when 
> the file closes.

OK, that makes sense.  We could certainly implement a
CLONE_FD_KILL_ON_CLOSE flag with those semantics, if we want one in the
future.

> Conclusion: 
> Pros: this is the bare minimum that we'd need to disentangle the SIGCHLD mess. 
> As long as all child process activations use this feature, the problem is 
> solved.
> 
> Cons: it requires cooperation from all child starters. If some other library 
> or the application installs a global SIGCHLD handler that waits on all child 
> processes, like libvlc used to do and Glib and Ecore still do, you won't be 
> able to get the child exit status.
> 
> I have not tested what happens if you try to pass the file descriptor to other 
> processes (can you even do that on FreeBSD?). But even if you could and got 
> notifications, you couldn't wait on the child to get its exit status -- unless 
> they implement pdwait4.

Even if they do implement pdwait4, they might not bypass the "must be
the parent process" restriction.  Let's wait to see what semantics they
go with.

>  - pdfork: can be emulated with clone4 + CLONE_FD (+ CLONEFD_KILL_ON_CLOSE)
>  - pdwait4: can be emulated with read()
>  - pdgetpid: needs an ioctl
>  - pdkill: needs an ioctl [or just write()]

I think that should be a dedicated syscall, not an ioctl.

It's unfortunate that rt_sigqueueinfo doesn't take a flags argument.
However, I just realized that it takes a 32-bit "int" for the signal
number, yet signal numbers fit in 8 bits.  So we could just add flags in
the high 24 bits of that argument, and in particular add a flag
indicating that the first argument is a file descriptor rather than a
PID.

- Josh Triplett

^ permalink raw reply

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
From: Oleg Nesterov @ 2015-03-14 19:24 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk, linux-kernel, linux-api,
	linux-fsdevel, x86
In-Reply-To: <20150314191500.GC22130@thin>

On 03/14, Josh Triplett wrote:
>
> On Sat, Mar 14, 2015 at 03:35:58PM +0100, Oleg Nesterov wrote:
> > On 03/12, Josh Triplett wrote:
> > >
> > > @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > >  	if (group_dead)
> > >  		kill_orphaned_pgrp(tsk->group_leader, NULL);
> > >
> > > -	if (unlikely(tsk->ptrace)) {
> > > +	if (tsk->autoreap) {
> > > +		autoreap = true;
> > > +	} else if (unlikely(tsk->ptrace)) {
> > >  		int sig = thread_group_leader(tsk) &&
> > >  				thread_group_empty(tsk) &&
> > >  				!ptrace_reparented(tsk) ?
> > > @@ -612,8 +616,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > >  	}
> > >
> > >  	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
> > > -	if (tsk->exit_state == EXIT_DEAD)
> > > +	if (tsk->exit_state == EXIT_DEAD) {
> > >  		list_add(&tsk->ptrace_entry, &dead);
> > > +		clonefd_do_notify(tsk);
> > > +	}
> >
> > And even ignoring semantics issues, this change looks simply buggy anyway ;)
> >
> > How can we do list_add(&tsk->ptrace_entry) if it is traced by _another_ task?
> > ->ptrace_entry is used by debugger.
>
> That list_add was there before; I didn't change that.

But this doesn't matter,

> I just added a
> second line inside the EXIT_DEAD case, to call clonefd_do_notify (which
> wakes up potential callers of poll/read).

No. Please read this code before and after your patch. You also added

	if (tsk->autoreap)
		autoreap = true;

at the start. At this can trigger the _wrong_ list_add(&tsk->ptrace_entry),
when the task is traced by another thread.

The current code can only use ->ptrace_entry if it was untraced (by us).

Oleg.

^ permalink raw reply

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
From: Oleg Nesterov @ 2015-03-14 19:18 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Thiago Macieira, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A
In-Reply-To: <20150314190132.GB22130@thin>

On 03/14, Josh Triplett wrote:
>
> On Sat, Mar 14, 2015 at 11:38:29AM -0700, Thiago Macieira wrote:
> > On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> > > It is not clear to me what do_wait() should do with ->autoreap child, even
> > > ignoring ptrace.
> > >
> > > Just suppose that real_parent has a single "autoreap" child. Should
> > > wait(NULL) hanf then?
> >
> > It should ignore the child that is set to autoreap. wait(NULL) should return -
> > ECHILD, indicating there are no children waiting to be reaped.
>
> Right.  And I don't think the current code does this.  I think we need
> to change wait_consider_task to early-return for ->autoreap just as it
> does for task_state == EXIT_DEAD.

No. This EXIT_DEAD is absolutely different. And this is another indication
that you might use it wrongly ;)

What we actually want is BUG_ON(task_state == EXIT_DEAD) here. We do not
want the EXIT_DEAD tasks in ->children/ptraced lists. These EXIT_DEAD tasks
complicate the exit/wait/reparent paths.

However, currently this is TODO. The main problem is the locking in
wait_task_zombie(), we can set EXIT_DEAD and remove the task from list
under read_lock().

And please see another email from me. So far  I disagree that wait(NULL)
should return ECHILD unconditionally. At least unless this is discussed
separately.

Oleg.

^ permalink raw reply

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
From: Josh Triplett @ 2015-03-14 19:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A
In-Reply-To: <20150314143558.GB12086-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Sat, Mar 14, 2015 at 03:35:58PM +0100, Oleg Nesterov wrote:
> On 03/12, Josh Triplett wrote:
> >
> > @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> >  	if (group_dead)
> >  		kill_orphaned_pgrp(tsk->group_leader, NULL);
> >  
> > -	if (unlikely(tsk->ptrace)) {
> > +	if (tsk->autoreap) {
> > +		autoreap = true;
> > +	} else if (unlikely(tsk->ptrace)) {
> >  		int sig = thread_group_leader(tsk) &&
> >  				thread_group_empty(tsk) &&
> >  				!ptrace_reparented(tsk) ?
> > @@ -612,8 +616,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> >  	}
> >  
> >  	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
> > -	if (tsk->exit_state == EXIT_DEAD)
> > +	if (tsk->exit_state == EXIT_DEAD) {
> >  		list_add(&tsk->ptrace_entry, &dead);
> > +		clonefd_do_notify(tsk);
> > +	}
> 
> And even ignoring semantics issues, this change looks simply buggy anyway ;)
> 
> How can we do list_add(&tsk->ptrace_entry) if it is traced by _another_ task?
> ->ptrace_entry is used by debugger.

That list_add was there before; I didn't change that.  I just added a
second line inside the EXIT_DEAD case, to call clonefd_do_notify (which
wakes up potential callers of poll/read).

- Josh Triplett

^ permalink raw reply

* Re: [PATCH] capabilities: Ambient capability set V2
From: Pavel Machek @ 2015-03-14 19:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Serge Hallyn, Andy Lutomirski, Jonathan Corbet, Aaron Jones,
	linux-security-module, linux-kernel, akpm, Andrew G. Morgan,
	Mimi Zohar, Austin S Hemmelgarn, Markku Savela, Jarkko Sakkinen,
	linux-api, Michael Kerrisk
In-Reply-To: <alpine.DEB.2.11.1502261612370.8994@gentwo.org>

Hi!

>  		return -ENOSYS;
> Index: linux/include/linux/cred.h
> ===================================================================
> --- linux.orig/include/linux/cred.h	2015-02-25 13:43:06.929973954 -0600
> +++ linux/include/linux/cred.h	2015-02-25 13:43:06.925972078 -0600
> @@ -122,6 +122,7 @@ struct cred {
>  	kernel_cap_t	cap_permitted;	/* caps we're permitted */
>  	kernel_cap_t	cap_effective;	/* caps we can actually use */
>  	kernel_cap_t	cap_bset;	/* capability bounding set */
> +	kernel_cap_t	cap_ambient;	/* Ambient capability set */

lowercase "Ambient", for consistency?

Thanks,
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
From: Thiago Macieira @ 2015-03-14 19:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Triplett, David Drysdale, Al Viro, Andrew Morton,
	Ingo Molnar, Kees Cook, Oleg Nesterov, Paul E. McKenney,
	H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
	linux-kernel@vger.kernel.org, Linux API, Linux FS Devel, X86 ML
In-Reply-To: <4467745.tgv2QnPHiH@tjmaciei-mobl4>

On Friday 13 March 2015 18:11:32 Thiago Macieira wrote:
> On Friday 13 March 2015 14:51:47 Andy Lutomirski wrote:
> > In any event, we should find out what FreeBSD does in response to
> > read(2) on the fd.
> 
> I've just successfully installed FreeBSD and compiled qtbase (main package
> of Qt 5) on it.
> 
> I'll test pdfork during the weekend and report its behaviour.

Here are my findings about pdfork.

Source: http://fxr.watson.org/fxr/source/kern/sys_procdesc.c?v=FREEBSD10
Qt adaptations: https://codereview.qt-project.org/108561

Processes created with pdfork() are normal processes that still send SIGCHLD 
to their parents. The only difference is that you get the extra file descriptor 
that can be passed to the pdgetpid() system call and works on select()/poll(). 
Trying to read from that file descriptor will result in EOPNOTSUPP.

Since they've never implemented pdwait4() (it's not even declared in the 
headers), the only way to reap a child if you only have the file descriptor is 
to first pdgetpid() and then call wait4() or wait6().

If you don't pass PD_DAEMON, the child process gets killed with SIGKILL when 
the file closes.

Conclusion: 
Pros: this is the bare minimum that we'd need to disentangle the SIGCHLD mess. 
As long as all child process activations use this feature, the problem is 
solved.

Cons: it requires cooperation from all child starters. If some other library 
or the application installs a global SIGCHLD handler that waits on all child 
processes, like libvlc used to do and Glib and Ecore still do, you won't be 
able to get the child exit status.

I have not tested what happens if you try to pass the file descriptor to other 
processes (can you even do that on FreeBSD?). But even if you could and got 
notifications, you couldn't wait on the child to get its exit status -- unless 
they implement pdwait4.

 - pdfork: can be emulated with clone4 + CLONE_FD (+ CLONEFD_KILL_ON_CLOSE)
 - pdwait4: can be emulated with read()
 - pdgetpid: needs an ioctl
 - pdkill: needs an ioctl [or just write()]

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

^ permalink raw reply

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
From: Josh Triplett @ 2015-03-14 19:01 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: Oleg Nesterov, Al Viro, Andrew Morton, Andy Lutomirski,
	Ingo Molnar, Kees Cook, Paul E. McKenney, H. Peter Anvin,
	Rik van Riel, Thomas Gleixner, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86
In-Reply-To: <28025621.k7WkrfHd4d@tjmaciei-mobl4>

On Sat, Mar 14, 2015 at 11:38:29AM -0700, Thiago Macieira wrote:
> On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> > It is not clear to me what do_wait() should do with ->autoreap child, even
> > ignoring ptrace.
> > 
> > Just suppose that real_parent has a single "autoreap" child. Should
> > wait(NULL) hanf then?
> 
> It should ignore the child that is set to autoreap. wait(NULL) should return -
> ECHILD, indicating there are no children waiting to be reaped.

Right.  And I don't think the current code does this.  I think we need
to change wait_consider_task to early-return for ->autoreap just as it
does for task_state == EXIT_DEAD.  I'll do that in v2.

- Josh Triplett

^ permalink raw reply

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
From: Oleg Nesterov @ 2015-03-14 18:54 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: josh, Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar,
	Kees Cook, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, linux-kernel, linux-api,
	linux-fsdevel, x86
In-Reply-To: <28025621.k7WkrfHd4d@tjmaciei-mobl4>

On 03/14, Thiago Macieira wrote:
>
> On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> > It is not clear to me what do_wait() should do with ->autoreap child, even
> > ignoring ptrace.
> >
> > Just suppose that real_parent has a single "autoreap" child. Should
> > wait(NULL) hanf then?
>
> It should ignore the child that is set to autoreap. wait(NULL) should return -
> ECHILD, indicating there are no children waiting to be reaped.

I disagree. I won't really argue now, because I think that this needs
a separate discussion. And imo "autoreap" should come as a separate feature.

I think that wait(NULL) should hang like it hangs even if the parent ignores
SIGCHLD. But in this case the parent should be woken up when the "autoreap"
child exits.

If nothing else. Suppose that the parent does waitid(WEXITED|WSTOPPED).
Should WSTOPPED work? I think it should.

At the same time, if we add autoreap then probably it also makes sense to add
WEXITIED_UNLESS_AUTOREAP.

In short: this all certainly needs more discussion, but (afaics) this patch
is wrong in any case.

In fact I have some concerns about file descriptor from clone, it doesn't look
like a "right" interface to me. But I will not comment this part until at least
I read 0/4 ;)

Oleg.

^ permalink raw reply

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
From: Thiago Macieira @ 2015-03-14 18:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: josh, Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar,
	Kees Cook, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk, linux-kernel, linux-api,
	linux-fsdevel, x86
In-Reply-To: <20150314143235.GA12086@redhat.com>

On Saturday 14 March 2015 15:32:35 Oleg Nesterov wrote:
> It is not clear to me what do_wait() should do with ->autoreap child, even
> ignoring ptrace.
> 
> Just suppose that real_parent has a single "autoreap" child. Should
> wait(NULL) hanf then?

It should ignore the child that is set to autoreap. wait(NULL) should return -
ECHILD, indicating there are no children waiting to be reaped.

But now I realise that this might have implications for session management and 
job control.

> If yes, who will wake the parent up?
> 
> If no, I do not see the necessary changes in wait_cosnider_task().

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

^ permalink raw reply

* Re: [PATCH 1/2] vt: add cursor blink interval escape sequence
From: Scot Doyle @ 2015-03-14 17:48 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Jiri Slaby, Tomi Valkeinen, Jean-Christophe Plagniol-Villard,
	Michael Kerrisk, Pavel Machek, Geert Uytterhoeven,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fbdev-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <alpine.LNX.2.11.1502271911380.4248@localhost>

On Fri, 27 Feb 2015, Scot Doyle wrote:
> Add an escape sequence to specify the current console's cursor blink
> interval. The interval is specified as a number of milliseconds until
> the next cursor display state toggle, from 50 to 65535. /proc/loadavg
> did not show a difference with a one msec interval, but the lower
> bound is set to 50 msecs since slower hardware wasn't tested.
> 
> Store the interval in the vc_data structure for later access by fbcon,
> initializing the value to fbcon's current hardcoded value of 200 msecs.
> 
> Signed-off-by: Scot Doyle <lkml14-enLWO88E2pdl57MIdRCFDg@public.gmane.org>
> Acked-by: Pavel Machek <pavel-+ZI9xUNit7I@public.gmane.org>

Hi Greg, sorry about your backlog. Is it too soon for a ping?

> ---
>  drivers/tty/vt/vt.c            | 9 +++++++++
>  include/linux/console_struct.h | 1 +
>  2 files changed, 10 insertions(+)
> 
> diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
> index 6e00572..ab1f173 100644
> --- a/drivers/tty/vt/vt.c
> +++ b/drivers/tty/vt/vt.c
> @@ -135,6 +135,7 @@ const struct consw *conswitchp;
>   */
>  #define DEFAULT_BELL_PITCH	750
>  #define DEFAULT_BELL_DURATION	(HZ/8)
> +#define DEFAULT_CURSOR_BLINK_MS	200
>  
>  struct vc vc_cons [MAX_NR_CONSOLES];
>  
> @@ -1590,6 +1591,13 @@ static void setterm_command(struct vc_data *vc)
>  		case 15: /* activate the previous console */
>  			set_console(last_console);
>  			break;
> +		case 16: /* set cursor blink duration in msec */
> +			if (vc->vc_npar >= 1 && vc->vc_par[1] >= 50 &&
> +					vc->vc_par[1] <= USHRT_MAX)
> +				vc->vc_cur_blink_ms = vc->vc_par[1];
> +			else
> +				vc->vc_cur_blink_ms = DEFAULT_CURSOR_BLINK_MS;
> +			break;
>  	}
>  }
>  
> @@ -1717,6 +1725,7 @@ static void reset_terminal(struct vc_data *vc, int do_clear)
>  
>  	vc->vc_bell_pitch = DEFAULT_BELL_PITCH;
>  	vc->vc_bell_duration = DEFAULT_BELL_DURATION;
> +	vc->vc_cur_blink_ms = DEFAULT_CURSOR_BLINK_MS;
>  
>  	gotoxy(vc, 0, 0);
>  	save_cur(vc);
> diff --git a/include/linux/console_struct.h b/include/linux/console_struct.h
> index e859c98..e329ee2 100644
> --- a/include/linux/console_struct.h
> +++ b/include/linux/console_struct.h
> @@ -104,6 +104,7 @@ struct vc_data {
>  	unsigned int    vc_resize_user;         /* resize request from user */
>  	unsigned int	vc_bell_pitch;		/* Console bell pitch */
>  	unsigned int	vc_bell_duration;	/* Console bell duration */
> +	unsigned short	vc_cur_blink_ms;	/* Cursor blink duration */
>  	struct vc_data **vc_display_fg;		/* [!] Ptr to var holding fg console for this display */
>  	struct uni_pagedir *vc_uni_pagedir;
>  	struct uni_pagedir **vc_uni_pagedir_loc; /* [!] Location of uni_pagedir variable for this console */
> -- 
> 2.3.0
> 

^ permalink raw reply

* Re: [PATCH v2 net-next 1/2] bpf: allow extended BPF programs access skb fields
From: Alexei Starovoitov @ 2015-03-14 15:55 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller
  Cc: Thomas Graf, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <550400D8.5060407-FeC+5ew28dpmcu3hnIyYJQ@public.gmane.org>

On 3/14/15 2:35 AM, Daniel Borkmann wrote:
> On 03/14/2015 05:59 AM, Alexei Starovoitov wrote:
>> also note that this case and twsk_build_assert are different.
>> twsk_build_assert has no other choice then to have one function
>> that covers logic in the whole file, whereas in this patch:
>> +               BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
>> +               *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
>> +                                     offsetof(struct sk_buff, mark));
>>
>> the build_bug_on protect the line directly below.
>> Separating them just doesn't make sense at all.
>
> I also like the above approach better, I only suggested that as a
> possible alternative since you were saying earlier in this thread:
>
>    I thought about it, but didn't add it, since we already have them
>    in the same file several lines above this spot. I think one build
>    error per .c file should be enough to attract attention. Though
>    I'll add a comment to convert_bpf_extensions() that build_bug_on
>    errors should be addressed in two places.

and to that you replied:
"My concern would be that in case the code gets changed, this spot
could easily be overlooked by someone possibly unfamiliar with the
code, since no build error triggers there."

so from there I saw two options: either copy paste all
build_bug_on and have the same *insn=... and build_bug_on in
two places or consolidate them in single helper function.
Obviously single helper function is a preferred method.
I'm not sure what are you still arguing about.

^ permalink raw reply

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
From: Oleg Nesterov @ 2015-03-14 14:35 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A
In-Reply-To: <9c39c576e1d9a9912b4aec54d833a73a84d2f592.1426180120.git.josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>

On 03/12, Josh Triplett wrote:
>
> @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>  	if (group_dead)
>  		kill_orphaned_pgrp(tsk->group_leader, NULL);
>  
> -	if (unlikely(tsk->ptrace)) {
> +	if (tsk->autoreap) {
> +		autoreap = true;
> +	} else if (unlikely(tsk->ptrace)) {
>  		int sig = thread_group_leader(tsk) &&
>  				thread_group_empty(tsk) &&
>  				!ptrace_reparented(tsk) ?
> @@ -612,8 +616,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>  	}
>  
>  	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
> -	if (tsk->exit_state == EXIT_DEAD)
> +	if (tsk->exit_state == EXIT_DEAD) {
>  		list_add(&tsk->ptrace_entry, &dead);
> +		clonefd_do_notify(tsk);
> +	}

And even ignoring semantics issues, this change looks simply buggy anyway ;)

How can we do list_add(&tsk->ptrace_entry) if it is traced by _another_ task?
->ptrace_entry is used by debugger.

Oleg.

^ permalink raw reply

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
From: Oleg Nesterov @ 2015-03-14 14:32 UTC (permalink / raw)
  To: josh-iaAMLnmF4UmaiuxdJuQwMA
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A
In-Reply-To: <20150314141414.GA11062-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

And let me add another note before I forget...

On 03/14, Oleg Nesterov wrote:
>
> On 03/13, josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org wrote:
> >
> >
> > A process launching a new process with CLONE_FD is explicitly requesting
> > that the process be automatically reaped without any other process
> > having to wait on it.  The task needs to not become a zombie, because
> > otherwise, it'll show up in waitpid(-1, ...)
>
> This is clear.
>
> But please note that this task can be traced/debugged by unrelated process,
> not its real_parent/creator. Say, the system admin does "strace -p". This
> simply breaks the current API.
>
> Again, again, I didn't read this series yet. But the proper solution (afaics)
> should move this "autoreap" check in release_task/__ptrace_detach(). If the
> task is traced. Debugger should check ->autoreap and skip another
> do_notify_parent().
>
> Speaking of autoreap... If ->exit_signal is zero, then the exiting child
> doesn't send the notification to its parent, still it doesn't autoreap
> itself. To me this looks strange, and in fact it seems to me that this
> is only by mistake. I am wondering if we can treat ->exit_signal == 0
> as "autoreap" too. As usual, most probably the answer is "no, because it
> is too late to change the historical behaviour". But this is off-topic.


It is not clear to me what do_wait() should do with ->autoreap child, even
ignoring ptrace.

Just suppose that real_parent has a single "autoreap" child. Should wait(NULL)
hanf then?

If yes, who will wake the parent up?

If no, I do not see the necessary changes in wait_cosnider_task().

Oleg.

^ permalink raw reply

* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
From: Oleg Nesterov @ 2015-03-14 14:14 UTC (permalink / raw)
  To: josh-iaAMLnmF4UmaiuxdJuQwMA
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
	Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A
In-Reply-To: <20150313195707.GA10487@cloud>

On 03/13, josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org wrote:
>
> On Fri, Mar 13, 2015 at 05:21:13PM +0100, Oleg Nesterov wrote:
> >
> > Again, I simply do not know what this code does at all. But I bet the usage
> > of EXIT_DEAD is wrong ;)
> >
> > OK, OK, I can be wrong. But I simply do not see what protects this task_struct
> > if it is EXIT_DEAD (in fact even if it is EXIT_ZOMBIE).
>
> If by "what protects" you mean "what keeps it alive", the file
> descriptor holds a reference to the task_struct by calling
> get_task_struct when created and put_task_struct when released.

OK, so I was wrong. Although I still have a gut feeling that the usage
of EXIT_DEAD can't be right. Because it was always wrong outside of core
"exit" code ;) Nevermind, I didn't read this series yet, forget.

> > > @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > >  	if (group_dead)
> > >  		kill_orphaned_pgrp(tsk->group_leader, NULL);
> > >
> > > -	if (unlikely(tsk->ptrace)) {
> > > +	if (tsk->autoreap) {
> > > +		autoreap = true;
> >
> > Debuggers won't be happy. A ptraced task should not autoreap itself.
>
> A process launching a new process with CLONE_FD is explicitly requesting
> that the process be automatically reaped without any other process
> having to wait on it.  The task needs to not become a zombie, because
> otherwise, it'll show up in waitpid(-1, ...)

This is clear.

But please note that this task can be traced/debugged by unrelated process,
not its real_parent/creator. Say, the system admin does "strace -p". This
simply breaks the current API.

Again, again, I didn't read this series yet. But the proper solution (afaics)
should move this "autoreap" check in release_task/__ptrace_detach(). If the
task is traced. Debugger should check ->autoreap and skip another
do_notify_parent().

Speaking of autoreap... If ->exit_signal is zero, then the exiting child
doesn't send the notification to its parent, still it doesn't autoreap
itself. To me this looks strange, and in fact it seems to me that this
is only by mistake. I am wondering if we can treat ->exit_signal == 0
as "autoreap" too. As usual, most probably the answer is "no, because it
is too late to change the historical behaviour". But this is off-topic.

Oleg.

^ permalink raw reply

* (unknown), 
From: INFO @ 2015-03-14 13:17 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 48 bytes --]


-- 
Attn: Your Ref View Attached File Please...

[-- Attachment #2: Dear Alivia Edwin.docx --]
[-- Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document, Size: 14040 bytes --]

^ permalink raw reply

* Re: [PATCH v2 net-next 1/2] bpf: allow extended BPF programs access skb fields
From: Daniel Borkmann @ 2015-03-14  9:35 UTC (permalink / raw)
  To: Alexei Starovoitov, David S. Miller
  Cc: Thomas Graf, linux-api, netdev, linux-kernel
In-Reply-To: <5503C03F.8020903@plumgrid.com>

On 03/14/2015 05:59 AM, Alexei Starovoitov wrote:
> On 3/13/15 7:27 PM, Alexei Starovoitov wrote:
>> On 3/13/15 7:16 PM, Daniel Borkmann wrote:
>>> On 03/14/2015 03:08 AM, Alexei Starovoitov wrote:
>>>> On 3/13/15 7:06 PM, Daniel Borkmann wrote:
>>>>> On 03/14/2015 02:46 AM, Daniel Borkmann wrote:
>>>>> ...
>>>>>> Previously, it was much more consistent, which I like better. And only
>>>>>> because of the simple BUILD_BUG_ON()? :/
>>>>>
>>>>> Alternative is to move all of them into a central place, something like
>>>>> in twsk_build_assert() or __mld2_query_bugs[].
>>>>
>>>> nope. that defeats the purpose of bug_on.
>>>
>>> Well, it doesn't. ;) It throws a build error thus the user is forced to
>>> investigate that further.
>>
>> according to this distorted logic all build_bug_on can be in one file
>> across the whole tree, since 'user is forced to investigate' ?!

That was not what I was suggesting, and I assume you know that ...

> also note that this case and twsk_build_assert are different.
> twsk_build_assert has no other choice then to have one function
> that covers logic in the whole file, whereas in this patch:
> +               BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
> +               *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
> +                                     offsetof(struct sk_buff, mark));
>
> the build_bug_on protect the line directly below.
> Separating them just doesn't make sense at all.

I also like the above approach better, I only suggested that as a
possible alternative since you were saying earlier in this thread:

   I thought about it, but didn't add it, since we already have them
   in the same file several lines above this spot. I think one build
   error per .c file should be enough to attract attention. Though
   I'll add a comment to convert_bpf_extensions() that build_bug_on
   errors should be addressed in two places.

Cheers,
Daniel

^ permalink raw reply

* Re: [PATCH v2 net-next 1/2] bpf: allow extended BPF programs access skb fields
From: Alexei Starovoitov @ 2015-03-14  4:59 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller
  Cc: Thomas Graf, linux-api, netdev, linux-kernel
In-Reply-To: <55039C9D.6010602@plumgrid.com>

On 3/13/15 7:27 PM, Alexei Starovoitov wrote:
> On 3/13/15 7:16 PM, Daniel Borkmann wrote:
>> On 03/14/2015 03:08 AM, Alexei Starovoitov wrote:
>>> On 3/13/15 7:06 PM, Daniel Borkmann wrote:
>>>> On 03/14/2015 02:46 AM, Daniel Borkmann wrote:
>>>> ...
>>>>> Previously, it was much more consistent, which I like better. And only
>>>>> because of the simple BUILD_BUG_ON()? :/
>>>>
>>>> Alternative is to move all of them into a central place, something like
>>>> in twsk_build_assert() or __mld2_query_bugs[].
>>>
>>> nope. that defeats the purpose of bug_on.
>>
>> Well, it doesn't. ;) It throws a build error thus the user is forced to
>> investigate that further.
>
> according to this distorted logic all build_bug_on can be in one file
> across the whole tree, since 'user is forced to investigate' ?!

also note that this case and twsk_build_assert are different.
twsk_build_assert has no other choice then to have one function
that covers logic in the whole file, whereas in this patch:
+               BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
+               *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
+                                     offsetof(struct sk_buff, mark));

the build_bug_on protect the line directly below.
Separating them just doesn't make sense at all.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox