* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Serge E. Hallyn @ 2015-02-11 4:29 UTC (permalink / raw)
To: Tejun Heo
Cc: Serge E. Hallyn, Eric W. Biederman, Richard Weinberger, Linux API,
Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Andy Lutomirski, cgroups mailinglist, Ingo Molnar
In-Reply-To: <20150211040957.GC21356-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> On Wed, Feb 11, 2015 at 04:46:16AM +0100, Serge E. Hallyn wrote:
> > 1. Hierarchy_num in /proc/cgroups and /proc/self/cgroup start at 0. Used
> > to start with 1. I expect many userspace parsers to be broken by this.
>
> This is intentional. The unified hierarchy will always have the
> hierarchy number zero. Userland needs to be updated anyway and the
> unified hierarchy won't show up unless explicitly enabled.
>
> > 2. After creating every non-leaf cgroup, we must fill in the
> > cgroup.subtree_cgroups file. This is extra work which userspace
> > doesn't have to do right now.
>
> Again, by design. This is how organization and control are separated
> and the differing levels of granularity is achieved.
>
> > 3. Let's say we want to create a freezer cgroup /foo/bar for some set of
>
> There shouldn't be a "freezer" cgroup. The processes are categorized
> according to their logical structure and controllers are applied to
> the hierarchy as necessary.
But there can well be cgroups for which only freezer is enabled. If
I'm wrong about that, then I am suffering a fundamental misunderstanding.
> > tasks, which they will administer. In fact let's assume we are going to
> > use cgroup namespaces. We have to put the tasks into /foo/bar, unshare
> > the cgroup ns, then create /foo/bar/leaf, move the tasks into /foo/bar/leaf,
> > and then write 'freezer' into /foo/bar. (If we're not using cgroup
> > namespaces, then we have to do a similar thing to let the tasks administer
> > /foo/bar while placing them under /foo/bar/leaf). The oddness I'm pointing
> > to is where the tasks have to know that they can create cgroups in "..".
> >
> > For containers this becomes odd. We tend to group containers by the
> > tasks in and under a cgroup. We now will have to assume a convention
> > where we know to check for tasks in and under "..", since by definition
> > pid 1's cgroup (in a container) cannot have children.
>
> The semantics is that the parent enables distribution of its given
> type of resource by enabling the controller in its subtree_control.
> This scoping isn't necessary for freezer and I'm debating whether to
> enable controllers which don't need granularity control to be enabled
> unconditionally. Right now, I'm leaning against it mostly for
> consistency.
Yeah, IIUC (i.e. freezer would always be enabled?) that would be
even-more-confusing.
> > 4. The per-cgroup "tasks" file not existing seems odd, although certainly
> > unexpected by much current software.
>
> And, yes, everything is per-process for reasons described in
> unified-hierarchy.txt.
>
> > So, if the unified hierarchy is going to not cause undue pain, existing
> > software really needs to start working now to use it. It's going to be
> > a sizeable task for lxc.
>
> Yes, this isn't gonna be a trivial conversion. The usage model
> changes and so will a lot of controller knobs and behaviors.
>
> Thanks.
>
> --
> tejun
^ permalink raw reply
* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
From: Steven Rostedt @ 2015-02-11 4:31 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Masami Hiramatsu, Linux API, Network Development, LKML,
Linus Torvalds, Peter Zijlstra, Eric W. Biederman
In-Reply-To: <CAMEtUuy+B9aXP-8m7tA6aNxnS4SKRhMpfGEdWMcbxbj7ggOATw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Tue, 10 Feb 2015 19:04:55 -0800
Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
> > You mean to be completely invisible to ftrace? And the debugfs/tracefs
> > directory?
>
> I mean it will be seen in tracefs to get 'id', but without enable/format/filter
In other words, invisible to ftrace. I'm not sure I'll be on board to
work with that.
> > Again, this would mean they become invisible to ftrace, and even
> > ftrace_dump_on_oops.
>
> yes, since these new tracepoints have no meat inside them.
> They're placeholders sitting idle and waiting for bpf to do
> something useful with them.
Hmm, I have a patch somewhere (never posted it), that add
TRACE_MARKER(), which basically would just print that it was hit. But
no data was passed to it. Something like that I would be more inclined
to take. Then the question is, what can bpf access there? Could just be
a place holder to add a "fast kprobe". This way, it can show up in
trace events with enable and and all that, it just wont have any data
to print out besides the normal pid, flags, etc.
But because it will inject a nop, that could be converted to a jump, it
will give you the power of kprobes but with the speed of a tracepoint.
>
> > I'm not fully understanding what is to be exported by this new ABI. If
> > the fields available, will always be available, then why can't the
> > appear in a TP_printk()?
>
> say, we define trace_netif_rx_entry() as this new tracepoint_bpf.
> It will have only one argument 'skb'.
> bpf program will read and print skb fields the way it likes
> for particular tracing scenario.
> So instead of making
> TP_printk("dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p
> vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x
> ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u
> mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d
> gso_type=%#x",...
> the abi exposed via trace_pipe (as it is today),
> the new tracepoint_bpf abi is presence of 'skb' pointer as one
> and only argument to bpf program.
> Future refactoring of netif_rx would need to guarantee
> that trace_netif_rx_entry(skb) is called. that's it.
> imo such tracepoints are much easier to deal with during
> code changes.
But what can you access from that skb that is guaranteed to be there?
If you say anything, then there's no reason it can't be added to the
printk as well.
>
> May be some of the existing tracepoints like this one that
> takes one argument can be marked 'bpf-ready', so that
> programs can attach to them only.
I really hate the idea of adding tracepoints that ftrace can't use. It
basically kills the entire busy box usage scenario, as boards that have
extremely limited userspace still make full use of ftrace via the
existing tracepoints.
I still don't see the argument that adding data via the bpf functions
is any different than adding those same items to fields in an event.
Once you add a bpf function, then you must maintain those fields.
Look, you can always add more to a TP_printk(), as that is standard
with all text file kernel parsing. Like stat in /proc. Fields can not
be removed, but more can always be appended to the end.
Any tool that parses trace_pipe is broken if it can't handle extended
fields. The api can be extended, and for text files, that is by
appending to them.
-- Steve
^ permalink raw reply
* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Eric W. Biederman @ 2015-02-11 5:02 UTC (permalink / raw)
To: Serge E. Hallyn
Cc: Tejun Heo, Richard Weinberger, Linux API, Linux Containers,
Serge Hallyn, linux-kernel@vger.kernel.org, Andy Lutomirski,
cgroups mailinglist, Ingo Molnar
In-Reply-To: <20150211042942.GA27931-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
A slightly off topic comment, for where this thread has gone but
relevant if we are talking about cgroup namespaces.
If don't implement compatibility with existing userspace, they get a
nack. A backwards-incompatible change should figure out how to remove
the need for any namespaces.
Because that is what namespaces are about backwards compatibility.
Eric
^ permalink raw reply
* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Tejun Heo @ 2015-02-11 5:10 UTC (permalink / raw)
To: Serge E. Hallyn
Cc: Eric W. Biederman, Richard Weinberger, Linux API,
Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Andy Lutomirski, cgroups mailinglist, Ingo Molnar
In-Reply-To: <20150211042942.GA27931-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
Hello,
On Wed, Feb 11, 2015 at 05:29:42AM +0100, Serge E. Hallyn wrote:
> > There shouldn't be a "freezer" cgroup. The processes are categorized
> > according to their logical structure and controllers are applied to
> > the hierarchy as necessary.
>
> But there can well be cgroups for which only freezer is enabled. If
> I'm wrong about that, then I am suffering a fundamental misunderstanding.
Ah, sure, I was mostly arguing semantics. It's just weird to call it
"freezer" cgroup.
> > The semantics is that the parent enables distribution of its given
> > type of resource by enabling the controller in its subtree_control.
> > This scoping isn't necessary for freezer and I'm debating whether to
> > enable controllers which don't need granularity control to be enabled
> > unconditionally. Right now, I'm leaning against it mostly for
> > consistency.
>
> Yeah, IIUC (i.e. freezer would always be enabled?) that would be
> even-more-confusing.
Right, freezer is kinda weird tho. Its feature can almost be
considered a utility feature of cgroups core rather than a separate
controller. That said, it's most likely that it'll remain in its
current form although how it blocks tasks should definitely be
reimplemented.
Thanks.
--
tejun
^ permalink raw reply
* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Tejun Heo @ 2015-02-11 5:17 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Andy Lutomirski, cgroups mailinglist, Ingo Molnar
In-Reply-To: <87oap1qbv3.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
Hey,
On Tue, Feb 10, 2015 at 11:02:40PM -0600, Eric W. Biederman wrote:
> A slightly off topic comment, for where this thread has gone but
> relevant if we are talking about cgroup namespaces.
>
> If don't implement compatibility with existing userspace, they get a
> nack. A backwards-incompatible change should figure out how to remove
> the need for any namespaces.
>
> Because that is what namespaces are about backwards compatibility.
Are you claiming that namespaces are soley about backwards
compatibility? ie. to trick userland into scoping without letting it
notice? That's a very restricted view and namespaces do provide
further isolation capabilties in addition to what can be achieved
otherwise and it is logical to collect simliar funtionalities there.
Thanks.
--
tejun
^ permalink raw reply
* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Eric W. Biederman @ 2015-02-11 6:29 UTC (permalink / raw)
To: Tejun Heo
Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Andy Lutomirski, Ingo Molnar, Linux API, cgroups mailinglist
In-Reply-To: <20150211051704.GB24897-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:
> Hey,
>
> On Tue, Feb 10, 2015 at 11:02:40PM -0600, Eric W. Biederman wrote:
>> A slightly off topic comment, for where this thread has gone but
>> relevant if we are talking about cgroup namespaces.
>>
>> If don't implement compatibility with existing userspace, they get a
>> nack. A backwards-incompatible change should figure out how to remove
>> the need for any namespaces.
>>
>> Because that is what namespaces are about backwards compatibility.
>
> Are you claiming that namespaces are soley about backwards
> compatibility? ie. to trick userland into scoping without letting it
> notice? That's a very restricted view and namespaces do provide
> further isolation capabilties in addition to what can be achieved
> otherwise and it is logical to collect simliar funtionalities there.
In principle a namespace is an additional layer of indirection from
names to objects. A namespace does not invent new kinds of objects.
A namespace takes things that were previously global and gives them a
scope.
In princple after name resolution a namespace should impose no overhead.
In general namespaces are not necessary if your scope of names
already has hierarchy. Which means that new interfaces can almost
always be designed in such a way that you can support containers without
needing to add any special namespace support. Which typically results
in more flexible and useful APIs for everyone, with no real code cost.
Further in the cgroup namespace patchset I looked at a while ago, the
only reason for having a cgroup namespace was to provide a measure of
backwards compatibility with existing userspace. I expect removing the
/proc/<pid>/cgroup file and replacing it with something in cgroupfs
itself would serve just as well if backwards compatibility is not the
objective. Or possibly replacincg /proc/<pid>/cgroup into a magic
symlink onto somewhere in the unified cgroupfs itself.
I just don't see any point in doing weird silly namespace things to keep
existing userspace working when the existing userspace won't work.
As such if a namespace doesn't implement compatibility with the existing
userspace it gets my nack.
Eric
^ permalink raw reply
* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
From: Alexei Starovoitov @ 2015-02-11 6:33 UTC (permalink / raw)
To: Steven Rostedt
Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Masami Hiramatsu, Linux API, Network Development, LKML,
Linus Torvalds, Peter Zijlstra, Eric W. Biederman
On Tue, Feb 10, 2015 at 8:31 PM, Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org> wrote:
>> > Again, this would mean they become invisible to ftrace, and even
>> > ftrace_dump_on_oops.
>>
>> yes, since these new tracepoints have no meat inside them.
>> They're placeholders sitting idle and waiting for bpf to do
>> something useful with them.
>
> Hmm, I have a patch somewhere (never posted it), that add
> TRACE_MARKER(), which basically would just print that it was hit. But
> no data was passed to it. Something like that I would be more inclined
> to take. Then the question is, what can bpf access there? Could just be
> a place holder to add a "fast kprobe". This way, it can show up in
> trace events with enable and and all that, it just wont have any data
> to print out besides the normal pid, flags, etc.
>
> But because it will inject a nop, that could be converted to a jump, it
> will give you the power of kprobes but with the speed of a tracepoint.
fair enough.
Something like TRACE_MARKER(arg1, arg2) that prints
it was hit without accessing the args would be enough.
Without any args it is indeed a 'fast kprobe' only.
Debug info would still be needed to access
function arguments.
On x64 function entry point and x64 abi make it easy
to access args, but i386 or kprobe in the middle
lose visibility when debug info is not available.
TRACE_MARKER (with few key args that function
is operating on) is enough to achieve roughly the same
as kprobe without debug info.
>> > I'm not fully understanding what is to be exported by this new ABI. If
>> > the fields available, will always be available, then why can't the
>> > appear in a TP_printk()?
>>
>> say, we define trace_netif_rx_entry() as this new tracepoint_bpf.
>> It will have only one argument 'skb'.
>> bpf program will read and print skb fields the way it likes
>> for particular tracing scenario.
>> So instead of making
>> TP_printk("dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p
>> vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x
>> ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u
>> mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d
>> gso_type=%#x",...
>> the abi exposed via trace_pipe (as it is today),
>> the new tracepoint_bpf abi is presence of 'skb' pointer as one
>> and only argument to bpf program.
>> Future refactoring of netif_rx would need to guarantee
>> that trace_netif_rx_entry(skb) is called. that's it.
>> imo such tracepoints are much easier to deal with during
>> code changes.
>
> But what can you access from that skb that is guaranteed to be there?
> If you say anything, then there's no reason it can't be added to the
> printk as well.
programs can access any field via bpf_fetch*() helpers which
make them kernel layout dependent or via user-ized sk_buff
with few fields which is portable.
In both cases kernel/user abi is only 'skb' pointer.
whether it's debugging program that needs full access
via fetch* helpers or portable program that uses stable api
it's up to program author.
Just like kprobes, it's clear, that if program is using
fetch* helpers it's doing it without any abi guarantees.
'perf probe' and 'bpf with fetch* helpers are the same.
perf probe creates wrappers on top of probe_kernel_read and
bpf_fetch* helpers are wrappers on top of probe_kernel_read.
Complains that 'my kprobe with flags=%cx mode=+4($stack)
stopped working in new kernel' are equivalent to complains
that program with bpf_fetch* stopped working.
Whereas if program is using user-ized structs it will work
across kernel versions, though it will be able to see
only very limited slice of in-kernel data.
>> May be some of the existing tracepoints like this one that
>> takes one argument can be marked 'bpf-ready', so that
>> programs can attach to them only.
>
> I really hate the idea of adding tracepoints that ftrace can't use. It
> basically kills the entire busy box usage scenario, as boards that have
> extremely limited userspace still make full use of ftrace via the
> existing tracepoints.
agree. I think your trace_marker with few args is a good
middle ground.
> I still don't see the argument that adding data via the bpf functions
> is any different than adding those same items to fields in an event.
> Once you add a bpf function, then you must maintain those fields.
>
> Look, you can always add more to a TP_printk(), as that is standard
> with all text file kernel parsing. Like stat in /proc. Fields can not
> be removed, but more can always be appended to the end.
>
> Any tool that parses trace_pipe is broken if it can't handle extended
> fields. The api can be extended, and for text files, that is by
> appending to them.
I agree that any text parsing script should be able to cope
with additional args without problems. I think it's a fear of
<1% breakage is causing maintainers to avoid any changes
to tracepoints even when they just add few args to the end
of TP_printk.
When tracepoints stop printing and the only thing they see
is single pointer to a well known struct like sk_buff,
this fear of tracepoints should fade.
programs are not part of the kernel, so whatever they do
and print is not our headache. We only make sure that
interface between kernel and programs is stable.
In other words kernel ABI is what kernel exposes to
user space and to bpf programs. Though programs
are run inside the kernel what they do it outside of
kernel abi. So when program prints fields is not our
problem, whereas when tracepoint prints fields it's
kernel abi.
ps I'll be traveling for the next few weeks, so
apologize in advance for slow response.
^ permalink raw reply
* Re: [PATCH 1/2] proc.5: Document /proc/[pid]/setgroups
From: Michael Kerrisk (man-pages) @ 2015-02-11 8:01 UTC (permalink / raw)
To: Eric W. Biederman
Cc: linux-man, Kees Cook, Linux API, Linux Containers, Josh Triplett,
stable, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Kenton Varda, LSM, Michael Kerrisk, Richard Weinberger,
Casey Schaufler, Andrew Morton, Andy Lutomirski
In-Reply-To: <54CF9995.1050409-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Hi Eric,
Ping!
Cheers,
Michael
On 2 February 2015 at 16:36, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> [Adding Josh to CC in case he has anything to add.]
>
> On 12/12/2014 10:54 PM, Eric W. Biederman wrote:
>>
>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>> ---
>> man5/proc.5 | 15 +++++++++++++++
>> 1 file changed, 15 insertions(+)
>>
>> diff --git a/man5/proc.5 b/man5/proc.5
>> index 96077d0dd195..d661e8cfeac9 100644
>> --- a/man5/proc.5
>> +++ b/man5/proc.5
>> @@ -1097,6 +1097,21 @@ are not available if the main thread has already terminated
>> .\" Added in 2.6.9
>> .\" CONFIG_SCHEDSTATS
>> .TP
>> +.IR /proc/[pid]/setgroups " (since Linux 3.19-rc1)"
>> +This file reports
>> +.BR allow
>> +if the setgroups system call is allowed in the current user namespace.
>> +This file reports
>> +.BR deny
>> +if the setgroups system call is not allowed in the current user namespace.
>> +This file may be written to with values of
>> +.BR allow
>> +and
>> +.BR deny
>> +before
>> +.IR /proc/[pid]/gid_map
>> +is written to (enabling setgroups) in a user namespace.
>> +.TP
>> .IR /proc/[pid]/smaps " (since Linux 2.6.14)"
>> This file shows memory consumption for each of the process's mappings.
>> (The
>
> Hi Eric,
>
> Thanks for this patch. I applied it, and then tried to work in
> quite a few other details gleaned from the source code and commit
> message, and Jon Corbet's article at http://lwn.net/Articles/626665/.
> Could you please let me know if the following is correct:
>
> /proc/[pid]/setgroups (since Linux 3.19)
> This file displays the string "allow" if processes in
> the user namespace that contains the process pid are
> permitted to employ the setgroups(2) system call, and
> "deny" if setgroups(2) is not permitted in that user
> namespace.
>
> A privileged process (one with the CAP_SYS_ADMIN capa‐
> bility in the namespace) may write either of the strings
> "allow" or "deny" to this file before writing a group ID
> mapping for this user namespace to the file
> /proc/[pid]/gid_map. Writing the string "deny" prevents
> any process in the user namespace from employing set‐
> groups(2).
>
> The default value of this file in the initial user
> namespace is "allow".
>
> Once /proc/[pid]/gid_map has been written to (which has
> the effect of enabling setgroups(2) in the user names‐
> pace), it is no longer possible to deny setgroups(2) by
> writing to /proc/[pid]/setgroups.
>
> A child user namespace inherits the /proc/[pid]/gid_map
> setting from its parent.
>
> If the setgroups file has the value "deny", then the
> setgroups(2) system call can't subsequently be reenabled
> (by writing "allow" to the file) in this user namespace.
> This restriction also propagates down to all child user
> namespaces of this user namespace.
>
> Thanks,
>
> Michael
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers
^ permalink raw reply
* Re: [PATCH 2/2] user_namespaces.7: Update the documention to reflect the fixes for negative groups
From: Michael Kerrisk (man-pages) @ 2015-02-11 8:02 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Michael Kerrisk, Linux Containers, Josh Triplett, Andrew Morton,
Kees Cook, Linux API, linux-man, linux-kernel@vger.kernel.org,
LSM, Casey Schaufler, Serge E. Hallyn, Richard Weinberger,
Kenton Varda, stable, Andy Lutomirski
In-Reply-To: <54CF99BF.8050401@gmail.com>
Hi Eric,
Ping!
Cheers,
Michael
On 2 February 2015 at 16:37, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> Hi Eric,
>
> Thanks for writing this up!
>
> On 12/12/2014 10:54 PM, Eric W. Biederman wrote:
>>
>> Files with access permissions such as ---rwx---rwx give fewer
>> permissions to their group then they do to everyone else. Which means
>> dropping groups with setgroups(0, NULL) actually grants a process
>> privileges.
>>
>> The uprivileged setting of gid_map turned out not to be safe after
>> this change. Privilege setting of gid_map can be interpreted as
>> meaning yes it is ok to drop groups.
>
> I had trouble to parse that sentence (and I'd like to make sure that
> the right sentence ends up in the commit message). Did you mean:
>
> "*Unprivileged* setting of gid_map can be interpreted as meaning
> yes it is ok to drop groups"
>
> ?
>
> Or something else?
>
>> To prevent this problem and future problems user namespaces were
>> changed in such a way as to guarantee a user can not obtain
>> credentials without privilege they could not obtain without the
>> help of user namespaces.
>>
>> This meant testing the effective user ID and not the filesystem user
>> ID as setresuid and setregid allow setting any process uid or gid
>> (except the supplemental groups) to the effective ID.
>>
>> Furthermore to preserve in some form the useful applications that have
>> been setting gid_map without privilege the file /proc/[pid]/setgroups
>> was added to allow disabling setgroups. With the setgroups system
>> call permanently disabled in a user namespace it again becomes safe to
>> allow writes to gid_map without privilege.
>>
>> Here is my meager attempt to update user_namespaces.7 to reflect these
>> issues.
>
> It looked pretty serviceable as patch, IMO. So, thanks again. I've applied,
> tweaking some wordings afterward, but changing nothing essential. See below
> for a question.
>
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>> man7/user_namespaces.7 | 52 +++++++++++++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 49 insertions(+), 3 deletions(-)
>>
>> diff --git a/man7/user_namespaces.7 b/man7/user_namespaces.7
>> index d76721d9a0a1..f8333a762308 100644
>> --- a/man7/user_namespaces.7
>> +++ b/man7/user_namespaces.7
>> @@ -533,11 +533,16 @@ One of the following is true:
>> The data written to
>> .I uid_map
>> .RI ( gid_map )
>> -consists of a single line that maps the writing process's filesystem user ID
>> +consists of a single line that maps the writing process's effective user ID
>> (group ID) in the parent user namespace to a user ID (group ID)
>> in the user namespace.
>> -The usual case here is that this single line provides a mapping for user ID
>> -of the process that created the namespace.
>> +The writing process must have the same effective user ID as the process
>> +that created the user namespace.
>> +In the case of
>> +.I gid_map
>> +the
>> +.I setgroups
>> +file must have been written to earlier and disabled the setgroups system call.
>> .IP * 3
>> The opening process has the
>> .BR CAP_SETUID
>> @@ -552,6 +557,47 @@ Writes that violate the above rules fail with the error
>> .\"
>> .\" ============================================================
>> .\"
>> +.SS Interaction with system calls that change the uid or gid values
>> +When in a user namespace where the
>> +.I uid_map
>> +or
>> +.I gid_map
>> +file has not been written the system calls that change user IDs
>> +or group IDs respectively will fail. After the
>> +.I uid_map
>> +and
>> +.I gid_map
>> +file have been written only the mapped values may be used in
>> +system calls that change user IDs and group IDs.
>> +
>> +For user IDs these system calls include
>> +.BR setuid ,
>> +.BR setfsuid ,
>> +.BR setreuid ,
>> +and
>> +.BR setresuid .
>> +
>> +For group IDs these system calls include
>> +.BR setgid ,
>> +.BR setfsgid ,
>> +.BR setregid ,
>> +.BR setresgid ,
>> +and
>> +.BR setgroups.
>> +
>> +Writing
>> +.BR deny
>> +to the
>> +.I /proc/[pid]/setgroups
>> +file before writing to
>> +.I /proc/[pid]/gid_map
>> +will permanently disable the setgroups system call in a user namespace
>> +and allow writing to
>> +.I /proc/[pid]/gid_map
>> +without
>> +.BR CAP_SETGID
>> +in the parent user namespace.
>
> I just want to double check: you really did mean to write "*parent* namespace"
> above, right?
>
> Thanks,
>
> Michael
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply
* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
From: Peter Zijlstra @ 2015-02-11 9:33 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Steven Rostedt, Ingo Molnar, Namhyung Kim,
Arnaldo Carvalho de Melo, Jiri Olsa, Masami Hiramatsu, Linux API,
Network Development, LKML, Linus Torvalds, Eric W. Biederman
In-Reply-To: <CAMEtUuzY_Po=WtFEFg1aqzJ8dEF4rHGcWDsaS44KYgACMNPPgA@mail.gmail.com>
On Tue, Feb 10, 2015 at 04:22:50PM -0800, Alexei Starovoitov wrote:
> >> not all tools use libtraceevent.
> >> gdb calls perf_event_open directly:
> >> https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c
> >> and parses PERF_RECORD_SAMPLE as a binary.
> >> In this case it's branch records, but I think we never said anywhere
> >> that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come
> >> in this particular order.
> >
> > What particular order? Note, that's a hardware event, not a software
> > one.
>
> yes, but gdb assumes that 'u64 ip' precedes, 'u64 addr'
> when attr.sample_type = IP | ADDR whereas this is an
> internal order of 'if' statements inside perf_output_sample()...
This is indeed promised in the data layout description in
include/uapi/linux/perf_event.h.
There is no other way to find where fields are; perf relies on
predetermined order of fields coupled with the requested field bitmask.
So we promise the order: id, ip, pid, tid, time, addr,.. etc.
So if you request IP and ADDR but none of the other fields, then you
know your sample will start with IP and then contain ADDR.
The traceevent thing has a debug/trace-fs format description of fields
that is supposed to be used.
^ permalink raw reply
* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
From: Peter Zijlstra @ 2015-02-11 9:45 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Steven Rostedt, Ingo Molnar, Namhyung Kim,
Arnaldo Carvalho de Melo, Jiri Olsa, Masami Hiramatsu, Linux API,
Network Development, LKML, Linus Torvalds, Eric W. Biederman
In-Reply-To: <CAMEtUuzY_Po=WtFEFg1aqzJ8dEF4rHGcWDsaS44KYgACMNPPgA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Tue, Feb 10, 2015 at 04:22:50PM -0800, Alexei Starovoitov wrote:
> well, ->prio and ->pid are already printed by sched tracepoints
> and their meaning depends on scheduler. So users taking that
> into account.
Right, so trace_events were/are root only, and root 'should' be in the
root pid namespace, and therefore pid magically works.
And I'm not sure, but I don't think the 'nested' root available from
containers should have access to ftrace, so that should not be an issue.
Perf tries real hard to present PERF_SAMPLE_PID data in the pid
namespace of the task that created the event.
As to prio; yes this is a prime example of suck, I would love to change
that but cannot :-(. The only solace I have is that everybody who is
relying on it is broken.
There is a very good reason I'm against adding more tracepoints to the
scheduler, its a nightmare.
^ permalink raw reply
* Re: [RFC] simple_char: New infrastructure to simplify chardev management
From: Jiri Kosina @ 2015-02-11 10:10 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Greg Kroah-Hartman, Arnd Bergmann,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <d3ab93173d51cebf00dd2263fd0ce9f8cd6258f7.1423609645.git.luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
On Tue, 10 Feb 2015, Andy Lutomirski wrote:
> This isn't adequately tested, and I don't have a demonstration (yet).
> It's here for review for whether it's a good idea in the first place
> and for weather the fully_dynamic mechanism is a good idea.
[ ... snip ... ]
> Signed-off-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
Andy, thanks a lot, this really *is* needed. I am willing to port things I
am responsible for (such as hidraw) to this once this gets merged.
Acked-by: Jiri Kosina <jkosina-AlSwsSmVLrQ@public.gmane.org>
--
Jiri Kosina
SUSE Labs
^ permalink raw reply
* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
From: Peter Zijlstra @ 2015-02-11 10:15 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Steven Rostedt, Ingo Molnar, Namhyung Kim,
Arnaldo Carvalho de Melo, Jiri Olsa, Masami Hiramatsu, Linux API,
Network Development, LKML, Linus Torvalds, Eric W. Biederman
In-Reply-To: <CAMEtUuzY_Po=WtFEFg1aqzJ8dEF4rHGcWDsaS44KYgACMNPPgA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Tue, Feb 10, 2015 at 04:22:50PM -0800, Alexei Starovoitov wrote:
> > It would need to do more that that. It may have to calculate the value
> > that it returns, as the internal value may be different with different
> > kernels.
>
> back to 'prio'... the 'prio' accessible from the program
> should be the same 'prio' that we're storing inside task_struct.
Its not, task_struct::prio is an entirely different value than the one
used in sched_param::sched_priority / sched_attr::sched_priority.
And the 'problem' is, prio is only relevant to SCHED_RR/SCHED_FIFO
tasks, we have more classes.
> No extra conversions.
We're not going to add runtime/space overhead to the kernel just because
someone might maybe someday trace the kernel.
That leaves the option of either tracing the kernel internal value and
userspace will just have to deal with it, or making the tracepoint more
expensive by having it do the conversion.
Now the big question is, what do we do when we add/extend a scheduling
class and have more parameters? We cannot change the tracepoint because
userspace assumes format. And I simply refuse to add a second -- because
that will end up being a third and fourth etc.. -- tracepoint right next
to it with a different layout.
Note that we just did add a class, we grew SCHED_DEADLINE a few releases
ago, and that has 3 parameters (or 6 depending on how you look at it).
You currently cannot 'debug' that with the existing tracepoints.
In short, I loathe traceevents, they're more trouble than they're worth.
Now I do love the infrastructure, its very useful debugging, but that's
all with local hacks that will never see the light of day.
^ permalink raw reply
* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
From: Peter Zijlstra @ 2015-02-11 10:28 UTC (permalink / raw)
To: Steven Rostedt
Cc: Alexei Starovoitov, Ingo Molnar, Namhyung Kim,
Arnaldo Carvalho de Melo, Jiri Olsa, Masami Hiramatsu, Linux API,
Network Development, LKML, Linus Torvalds,
ebiederm-aS9lmoZGLiVWk0Htik3J/w
In-Reply-To: <20150210165359.34cc53d9-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org>
On Tue, Feb 10, 2015 at 04:53:59PM -0500, Steven Rostedt wrote:
> > >> In the future we might add a tracepoint and pass a single
> > >> pointer to interesting data struct to it. bpf programs will walk
> > >> data structures 'as safe modules' via bpf_fetch*() methods
> > >> without exposing it as ABI.
> > >
> > > Will this work if that structure changes? When the field we are looking
> > > for no longer exists?
> >
> > bpf_fetch*() is the same mechanism as perf probe.
> > If there is a mistake by user space tools, the program
> > will be reading some junk, but it won't be crashing.
>
> But what if the userspace tool depends on that value returning
> something meaningful. If it was meaningful in the past, it will have to
> be meaningful in the future, even if the internals of the kernel make
> it otherwise.
We're compiling the BPF stuff against the 'current' kernel headers
right? So would enforcing module versioning not be sufficient?
We already break out-of-tree modules without a second thought, the
module interface is not guaranteed. They just need to cope with it.
Anything using the kernel headers to look into the kernel guts should be
bound to the same rules.
So if we think of BFP stuff as out-of-tree modules, and treat them the
same, I see no problem.
I'm sure some BFP 'scripts' will turn in the same right mess that
out-of-tree modules are, with endless #ifdef version checks, but hey,
not my problem ;-)
^ permalink raw reply
* Re: [RFC] implementing tape statistics single file vs multi-file in sysfs
From: Bryn M. Reeves @ 2015-02-11 11:32 UTC (permalink / raw)
To: Greg KH
Cc: Seymour, Shane M, linux-api@vger.kernel.org,
linux-scsi@vger.kernel.org, Kai.Makisara@kolumbus.fi,
James E.J. Bottomley (JBottomley@parallels.com),
Laurence Oberman (loberman@redhat.com)
In-Reply-To: <20150210223027.GA27269@kroah.com>
On Wed, Feb 11, 2015 at 06:30:27AM +0800, Greg KH wrote:
> On Tue, Feb 10, 2015 at 02:27:20PM +0000, Bryn M. Reeves wrote:
> > On Sat, Feb 07, 2015 at 12:07:43PM +0800, Greg KH wrote:
> >
> > $ cat /sys/fs/selinux/avc/cache_stats
> > lookups hits misses allocations reclaims frees
> > 18938916 18921707 17209 17209 17328 22215
> > 38164283 38146514 17769 17769 16800 19049
> > 18078108 18056991 21117 21117 21344 19305
> > 15168204 15150079 18125 18125 17776 13149
> > 0 0 0 0 0 0
> > 0 0 0 0 0 0
> > 0 0 0 0 0 0
> > 0 0 0 0 0 0
> >
> > $ cat /sys/fs/selinux/avc/hash_stats
> > entries: 506
> > buckets used: 290/512
> > longest chain: 5
>
> Ugh, those look like they should be debugfs interfaces. Thanks, I'll
> add them to my list of things to nag people about...
Actually looking properly these are outside sysfs:
$ mount | grep selinux
selinuxfs on /sys/fs/selinux type selinuxfs (rw,relatime)
So everything below the mount point is in their own vfstype.
Regards,
Bryn.
^ permalink raw reply
* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
From: Steven Rostedt @ 2015-02-11 12:51 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Masami Hiramatsu, Linux API, Network Development, LKML,
Linus Torvalds, Peter Zijlstra, Eric W. Biederman
In-Reply-To: <CAMEtUuxizvHF09y_vU-c+L=n16tqWDqJPTybGB1a_OqN7p+jsA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Tue, 10 Feb 2015 22:33:05 -0800
Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
> fair enough.
> Something like TRACE_MARKER(arg1, arg2) that prints
> it was hit without accessing the args would be enough.
> Without any args it is indeed a 'fast kprobe' only.
> Debug info would still be needed to access
> function arguments.
> On x64 function entry point and x64 abi make it easy
> to access args, but i386 or kprobe in the middle
> lose visibility when debug info is not available.
> TRACE_MARKER (with few key args that function
> is operating on) is enough to achieve roughly the same
> as kprobe without debug info.
Actually, what about a TRACE_EVENT_DEBUG(), that has a few args and
possibly a full trace event layout.
The difference would be that the trace events do not show up unless you
have "trace_debug" on the command line. This should prevent
applications from depending on them.
I could even do the nasty dmesg output like I do with trace_printk()s,
that would definitely keep a production kernel from adding it by
default.
When trace_debug is not there, the trace points could still be accessed
but perhaps only via bpf, or act like a simple trace marker.
Note, if you need ids, I rather have them in another directory than
tracefs. Make a eventfs perhaps that holds these. I rather keep tracefs
simple.
This is something that needs to probably be discussed at bit.
-- Steve
^ permalink raw reply
* Re: [PATCH 1/2] proc.5: Document /proc/[pid]/setgroups
From: Eric W. Biederman @ 2015-02-11 13:51 UTC (permalink / raw)
To: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w
Cc: linux-man, Kees Cook, Linux API, Linux Containers, Josh Triplett,
stable, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Kenton Varda, LSM, Richard Weinberger, Casey Schaufler,
Andrew Morton, Andy Lutomirski
In-Reply-To: <CAKgNAkgWnZ=7E4bk3JhzFS88CJ32szYCYcm_Sx166yVuWKhhUA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
> Hi Eric,
>
> Ping!
>
> Cheers,
>
> Michael
My apologies. You description wasn't wrong but it may be a bit
misleading, explanation below. You will have to figure out how to work
that into your proposed text.
> On 2 February 2015 at 16:36, Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>> [Adding Josh to CC in case he has anything to add.]
>>
>> On 12/12/2014 10:54 PM, Eric W. Biederman wrote:
>>>
>>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>>> ---
>>> man5/proc.5 | 15 +++++++++++++++
>>> 1 file changed, 15 insertions(+)
>>>
>>> diff --git a/man5/proc.5 b/man5/proc.5
>>> index 96077d0dd195..d661e8cfeac9 100644
>>> --- a/man5/proc.5
>>> +++ b/man5/proc.5
>>> @@ -1097,6 +1097,21 @@ are not available if the main thread has already terminated
>>> .\" Added in 2.6.9
>>> .\" CONFIG_SCHEDSTATS
>>> .TP
>>> +.IR /proc/[pid]/setgroups " (since Linux 3.19-rc1)"
>>> +This file reports
>>> +.BR allow
>>> +if the setgroups system call is allowed in the current user namespace.
>>> +This file reports
>>> +.BR deny
>>> +if the setgroups system call is not allowed in the current user namespace.
>>> +This file may be written to with values of
>>> +.BR allow
>>> +and
>>> +.BR deny
>>> +before
>>> +.IR /proc/[pid]/gid_map
>>> +is written to (enabling setgroups) in a user namespace.
>>> +.TP
>>> .IR /proc/[pid]/smaps " (since Linux 2.6.14)"
>>> This file shows memory consumption for each of the process's mappings.
>>> (The
>>
>> Hi Eric,
>>
>> Thanks for this patch. I applied it, and then tried to work in
>> quite a few other details gleaned from the source code and commit
>> message, and Jon Corbet's article at http://lwn.net/Articles/626665/.
>> Could you please let me know if the following is correct:
It is close but it may be misleading.
>> /proc/[pid]/setgroups (since Linux 3.19)
>> This file displays the string "allow" if processes in
>> the user namespace that contains the process pid are
>> permitted to employ the setgroups(2) system call, and
>> "deny" if setgroups(2) is not permitted in that user
>> namespace.
With the caveat that when gid_map is not set that setgroups is also not
allowed.
>> A privileged process (one with the CAP_SYS_ADMIN capa‐
>> bility in the namespace) may write either of the strings
>> "allow" or "deny" to this file before writing a group ID
>> mapping for this user namespace to the file
>> /proc/[pid]/gid_map. Writing the string "deny" prevents
>> any process in the user namespace from employing set‐
>> groups(2).
Or more succintly. You are allowed to write to /proc/[pid]/setgroups
when calling setgroups is not allowed because gid_map is unset. This
ensures we do not have any transitions from a state where setgroups
is allowed to a state where setgroups is denied. There are only
transitions from setgroups not-allowed to setgroups allowed.
>> The default value of this file in the initial user
>> namespace is "allow".
>>
>> Once /proc/[pid]/gid_map has been written to (which has
>> the effect of enabling setgroups(2) in the user names‐
>> pace), it is no longer possible to deny setgroups(2) by
>> writing to /proc/[pid]/setgroups.
>>
>> A child user namespace inherits the /proc/[pid]/gid_map
>> setting from its parent.
>>
>> If the setgroups file has the value "deny", then the
>> setgroups(2) system call can't subsequently be reenabled
>> (by writing "allow" to the file) in this user namespace.
>> This restriction also propagates down to all child user
>> namespaces of this user namespace.
Eric
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers
^ permalink raw reply
* Re: [PATCH 2/2] user_namespaces.7: Update the documention to reflect the fixes for negative groups
From: Eric W. Biederman @ 2015-02-11 14:01 UTC (permalink / raw)
To: mtk.manpages
Cc: Linux Containers, Josh Triplett, Andrew Morton, Kees Cook,
Linux API, linux-man, linux-kernel@vger.kernel.org, LSM,
Casey Schaufler, Serge E. Hallyn, Richard Weinberger,
Kenton Varda, stable, Andy Lutomirski
In-Reply-To: <CAKgNAkhmwK02DJQV84S+dEdrUDjzRuR32j+2gcKkgeDq8jTkuQ@mail.gmail.com>
"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
> Hi Eric,
>
> Ping!
>
> Cheers,
>
> Michael
>
>
> On 2 February 2015 at 16:37, Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>> Hi Eric,
>>
>> Thanks for writing this up!
>>
>> On 12/12/2014 10:54 PM, Eric W. Biederman wrote:
>>>
>>> Files with access permissions such as ---rwx---rwx give fewer
>>> permissions to their group then they do to everyone else. Which means
>>> dropping groups with setgroups(0, NULL) actually grants a process
>>> privileges.
>>>
>>> The uprivileged setting of gid_map turned out not to be safe after
^^^^^^^^^^^
unprivileged -- typo fix
>>> this change. Privilege setting of gid_map can be interpreted as
>>> meaning yes it is ok to drop groups.
>>
>> I had trouble to parse that sentence (and I'd like to make sure that
>> the right sentence ends up in the commit message). Did you mean:
>>
>> "*Unprivileged* setting of gid_map can be interpreted as meaning
>> yes it is ok to drop groups"
>> ?
>>
>> Or something else?
I meant: Setting of gid_map with privilege has been clarified to mean
that dropping groups is ok. This allows existing programs that set
gid_map with privilege to work without changes. That is newgidmap
continues to work unchanged.
>>> To prevent this problem and future problems user namespaces were
>>> changed in such a way as to guarantee a user can not obtain
>>> credentials without privilege they could not obtain without the
>>> help of user namespaces.
>>>
>>> This meant testing the effective user ID and not the filesystem user
>>> ID as setresuid and setregid allow setting any process uid or gid
>>> (except the supplemental groups) to the effective ID.
>>>
>>> Furthermore to preserve in some form the useful applications that have
>>> been setting gid_map without privilege the file /proc/[pid]/setgroups
>>> was added to allow disabling setgroups. With the setgroups system
>>> call permanently disabled in a user namespace it again becomes safe to
>>> allow writes to gid_map without privilege.
>>>
>>> Here is my meager attempt to update user_namespaces.7 to reflect these
>>> issues.
>>
>> It looked pretty serviceable as patch, IMO. So, thanks again. I've applied,
>> tweaking some wordings afterward, but changing nothing essential. See below
>> for a question.
>>
>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>> ---
>>> man7/user_namespaces.7 | 52 +++++++++++++++++++++++++++++++++++++++++++++++---
>>> 1 file changed, 49 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/man7/user_namespaces.7 b/man7/user_namespaces.7
>>> index d76721d9a0a1..f8333a762308 100644
>>> --- a/man7/user_namespaces.7
>>> +++ b/man7/user_namespaces.7
>>> @@ -533,11 +533,16 @@ One of the following is true:
>>> The data written to
>>> .I uid_map
>>> .RI ( gid_map )
>>> -consists of a single line that maps the writing process's filesystem user ID
>>> +consists of a single line that maps the writing process's effective user ID
>>> (group ID) in the parent user namespace to a user ID (group ID)
>>> in the user namespace.
>>> -The usual case here is that this single line provides a mapping for user ID
>>> -of the process that created the namespace.
>>> +The writing process must have the same effective user ID as the process
>>> +that created the user namespace.
>>> +In the case of
>>> +.I gid_map
>>> +the
>>> +.I setgroups
>>> +file must have been written to earlier and disabled the setgroups system call.
>>> .IP * 3
>>> The opening process has the
>>> .BR CAP_SETUID
>>> @@ -552,6 +557,47 @@ Writes that violate the above rules fail with the error
>>> .\"
>>> .\" ============================================================
>>> .\"
>>> +.SS Interaction with system calls that change the uid or gid values
>>> +When in a user namespace where the
>>> +.I uid_map
>>> +or
>>> +.I gid_map
>>> +file has not been written the system calls that change user IDs
>>> +or group IDs respectively will fail. After the
>>> +.I uid_map
>>> +and
>>> +.I gid_map
>>> +file have been written only the mapped values may be used in
>>> +system calls that change user IDs and group IDs.
>>> +
>>> +For user IDs these system calls include
>>> +.BR setuid ,
>>> +.BR setfsuid ,
>>> +.BR setreuid ,
>>> +and
>>> +.BR setresuid .
>>> +
>>> +For group IDs these system calls include
>>> +.BR setgid ,
>>> +.BR setfsgid ,
>>> +.BR setregid ,
>>> +.BR setresgid ,
>>> +and
>>> +.BR setgroups.
>>> +
>>> +Writing
>>> +.BR deny
>>> +to the
>>> +.I /proc/[pid]/setgroups
>>> +file before writing to
>>> +.I /proc/[pid]/gid_map
>>> +will permanently disable the setgroups system call in a user namespace
>>> +and allow writing to
>>> +.I /proc/[pid]/gid_map
>>> +without
>>> +.BR CAP_SETGID
>>> +in the parent user namespace.
>>
>> I just want to double check: you really did mean to write "*parent* namespace"
>> above, right?
Yes. At this point only privilege in the *parent* user namespace is
meaningful, as applications in the new user namespace have all
privileges.
Eric
^ permalink raw reply
* Re: [Patch v4] firmware: dmi-sysfs: add SMBIOS entry point area attribute
From: Matt Fleming @ 2015-02-11 14:17 UTC (permalink / raw)
To: Ivan Khoronzhuk
Cc: Grant Likely, Ard Biesheuvel, linux-kernel@vger.kernel.org,
linux-api, linux-doc, Leif Lindholm, Mark Salter
In-Reply-To: <54D9D4B0.7020803@linaro.org>
On Tue, 10 Feb, at 11:51:44AM, Ivan Khoronzhuk wrote:
>
> If you are Ok with this patch, could you please pickup it?
Applied, thanks Ivan!
--
Matt Fleming, Intel Open Source Technology Center
^ permalink raw reply
* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Tejun Heo @ 2015-02-11 14:36 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Andy Lutomirski, cgroups mailinglist, Ingo Molnar
In-Reply-To: <87twytklkv.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
Hey,
On Wed, Feb 11, 2015 at 12:29:20AM -0600, Eric W. Biederman wrote:
> In general namespaces are not necessary if your scope of names
> already has hierarchy. Which means that new interfaces can almost
> always be designed in such a way that you can support containers without
> needing to add any special namespace support. Which typically results
> in more flexible and useful APIs for everyone, with no real code cost.
Sure, and cgroup ns support isn't doing anything weird there. Just
bind mounting a subhierarchy is enough for the core features. The ns
part is dealing with things which can't easily be tied to such
hierarchical scoping like path reported under through proc and even
handling that can be achieved by, for example, marking delegation
points in cgroup proper and forcing tasks beyond that point to
consider that as its origin when determining the path to report.
However, note that something like that is inherently similar to what's
being provided by other namespaces. It is true that it can be
implemented outside namespace facility proper but that doesn't
automatically make that the right choice and it's more likely to be
worse - we'd be introducing a different way to perform about the same
thing.
So, the argument that adding namespace interface except for backward
compatibility doesn't seem to hold water. Like it or not, namespace
is serving as a platform for certain type of features and we'd be
foolish to not to consider putting a related feature together there
and I fail to see a valid technical argument as of yet.
> Further in the cgroup namespace patchset I looked at a while ago, the
> only reason for having a cgroup namespace was to provide a measure of
> backwards compatibility with existing userspace. I expect removing the
> /proc/<pid>/cgroup file and replacing it with something in cgroupfs
> itself would serve just as well if backwards compatibility is not the
> objective. Or possibly replacincg /proc/<pid>/cgroup into a magic
> symlink onto somewhere in the unified cgroupfs itself.
No matter what we do, we'd still need to mark the delegation point
somehow; otherwise, there's no way to produce a scoped identifier.
This isn't really about backward compatibility but rather the feature
to scope a subhierarcy properly.
> I just don't see any point in doing weird silly namespace things to keep
> existing userspace working when the existing userspace won't work.
If it's too different from existing namespaces, sure, doing something
is definitely an option but is it?
> As such if a namespace doesn't implement compatibility with the existing
> userspace it gets my nack.
Hmmm.... I don't think making the proposed NS support to work across
all hierarchies including the traditional multiple ones would be too
difficult. That should work then, right?
Thanks.
--
tejun
^ permalink raw reply
* Re: [Patch v4] firmware: dmi-sysfs: add SMBIOS entry point area attribute
From: Matt Fleming @ 2015-02-11 14:43 UTC (permalink / raw)
To: Ivan Khoronzhuk
Cc: Grant Likely, Ard Biesheuvel, linux-kernel@vger.kernel.org,
linux-api, linux-doc, Leif Lindholm, Mark Salter
In-Reply-To: <20150211141703.GA4665@codeblueprint.co.uk>
On Wed, 11 Feb, at 02:17:03PM, Matt Fleming wrote:
> On Tue, 10 Feb, at 11:51:44AM, Ivan Khoronzhuk wrote:
> >
> > If you are Ok with this patch, could you please pickup it?
>
> Applied, thanks Ivan!
Btw this patch doesn't apply cleanly, the reject looks like this,
--- drivers/firmware/dmi_scan.c
+++ drivers/firmware/dmi_scan.c
@@ -537,6 +543,8 @@
dmi_ver &= 0xFFFFFF;
dmi_len = get_unaligned_le32(buf + 12);
dmi_base = get_unaligned_le64(buf + 16);
+ smbios_header_size = buf[6];
+ memcpy(smbios_header, buf, smbios_header_size);
/*
* The 64-bit SMBIOS 3.0 entry point no longer has a field
What version of the kernel did you base this patch on? The conflict is
trivial to fixup and I've done so and pushed it out on the EFI 'next'
branch, but I wanted to call out this conflict explicitly.
--
Matt Fleming, Intel Open Source Technology Center
^ permalink raw reply
* [PATCH RFC 1/6] fs: new interface and behavior for file project id
From: Konstantin Khlebnikov @ 2015-02-11 15:11 UTC (permalink / raw)
To: Linux FS Devel, linux-ext4, linux-kernel
Cc: Jan Kara, Linux API, containers, Dave Chinner, Andy Lutomirski,
Christoph Hellwig, Dmitry Monakhov, Eric W. Biederman, Li Xi,
Theodore Ts'o, Al Viro
For now project id and quotas are implemented only in XFS.
Existing behavior isn't very useful: any unprivileged user can set any
project id for its own files and this way he can bypass project limits.
XFS interface for getting or changing file project is a very XFS-centric:
ioctl XFS_IOC_FSGET/SETXATTR with structure (struct fsxattr) as a argument
which has three unrelated fields and twelve reserved padding bytes.
Idea of keeping XFS-compatible interface seems overpriced. Old tools checks
filesystem name/magic thus without update they anyway will work only for XFS.
This patch defines common interface and new behavior.
Depending on sysctl fs.protected_projects = 0|1 projects works as:
0 = XFS-compatible projects
- changing project id could be performed only from init user-ns
- file owner or task with CAP_FOWNER can set any project id
- changing user-ns project-id mapping allowed for everybody
- cross-project hardlinks and renaming are forbidden (-EXDEV)
- new inodes inherits project id from directory if flag
XFS_DIFLAG_PROJINHERIT is set for directory inode
1 = Protected projects
- changing project id requires CAP_SYS_RESOURCE in current user-ns
- changing project id mapping require CAP_SYS_RESOURCE in parent user-ns
- cross-project hardlinks and renaming are permitted if current task has
CAP_SYS_RESOURCE in current user-namespace or if directory project is
mapped to zero in current user-namespace.
- new inodes always inherits project id from directory
Now project id is more sticky and cross-project sharing is more flexible.
User-namespace project mapping defines set of project ids which could be
used inside, if it's empty then container cannot change project id at all.
CONFIG_PROTECTED_PROJECTS_BY_DEFAULT defines default value for sysctl.
This patch adds two new fcntls:
int fcntl(fd, F_GET_PROJECT, projid_t *);
int fcntl(fd, F_SET_PROJECT, projid_t);
Permissions:
F_GET_PROJECT is permitted for everybody but if file project isn't mapped
into current user-namespace -EACCESS will be returned.
F_SET_PROJECT: depending on state of sysctl fs.protected_projects allowed
either for file owner and CAP_FOWNER or requires capability CAP_SYS_RESOURCE.
Error codes:
EINVAL - not implemented in this kernel
EPERM - not permitted/supported by this filesystem type
ENOTSUPP - not supported for this filesystem instance (no feature at sb)
EACCES - not enough permissions or project id isn't mapped
Project id is stored in fs-specific inode and exposed via couple super-block
operations: get_projid / set_projid. This have to be sb-operations because
dquot_initialize() could be called before setting inode->i_op.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
Documentation/filesystems/Locking | 4 ++
Documentation/filesystems/vfs.txt | 10 ++++++
fs/fcntl.c | 65 +++++++++++++++++++++++++++++++++++++
fs/quota/Kconfig | 9 +++++
include/linux/fs.h | 4 ++
include/linux/projid.h | 4 ++
include/uapi/linux/fcntl.h | 6 +++
kernel/capability.c | 62 +++++++++++++++++++++++++++++++++++
kernel/sysctl.c | 9 +++++
kernel/user_namespace.c | 4 +-
10 files changed, 175 insertions(+), 2 deletions(-)
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index b30753c..649e404 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -125,6 +125,8 @@ prototypes:
int (*show_options)(struct seq_file *, struct dentry *);
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
+ int (*get_projid) (struct inode *, kprojid_t *);
+ int (*set_projid) (struct inode *, kprojid_t);
int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
locking rules:
@@ -147,6 +149,8 @@ show_options: no (namespace_sem)
quota_read: no (see below)
quota_write: no (see below)
bdev_try_to_free_page: no (see below)
+get_projid no (maybe i_mutex)
+set_projid no (i_mutex)
->statfs() has s_umount (shared) when called by ustat(2) (native or
compat), but that's an accident of bad API; s_umount is used to pin
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 43ce050..c25b3ee 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -228,6 +228,10 @@ struct super_operations {
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
+
+ int (*get_projid) (struct inode *, kprojid_t *);
+ int (*set_projid) (struct inode *, kprojid_t);
+
int (*nr_cached_objects)(struct super_block *);
void (*free_cached_objects)(struct super_block *, int);
};
@@ -319,6 +323,12 @@ or bottom half).
implementations will cause holdoff problems due to large scan batch
sizes.
+ get_projid: called by the VFS and quota to get project id of a inode.
+ This method is called by fcntl() and project quota management.
+
+ set_projid: called by the VFS to set project if of a inode.
+ This method is called by fcntl() with i_mutex locked.
+
Whoever sets up the inode is responsible for filling in the "i_op" field. This
is a pointer to a "struct inode_operations" which describes the methods that
can be performed on individual inodes.
diff --git a/fs/fcntl.c b/fs/fcntl.c
index ee85cd4..c89df0e 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -9,6 +9,7 @@
#include <linux/mm.h>
#include <linux/fs.h>
#include <linux/file.h>
+#include <linux/mount.h>
#include <linux/fdtable.h>
#include <linux/capability.h>
#include <linux/dnotify.h>
@@ -240,6 +241,62 @@ static int f_getowner_uids(struct file *filp, unsigned long arg)
}
#endif
+static int fcntl_get_project(struct file *file, projid_t __user *arg)
+{
+ struct inode *inode = file_inode(file);
+ struct super_block *sb = inode->i_sb;
+ kprojid_t kprojid;
+ projid_t projid;
+ int err;
+
+ if (!sb->s_op->get_projid)
+ return -EPERM;
+
+ err = sb->s_op->get_projid(inode, &kprojid);
+ if (err)
+ return err;
+
+ projid = from_kprojid(current_user_ns(), kprojid);
+ if (projid == (projid_t)-1)
+ return -EACCES;
+
+ return put_user(projid, arg);
+}
+
+static int fcntl_set_project(struct file *file, projid_t projid)
+{
+ struct user_namespace *ns = current_user_ns();
+ struct inode *inode = file_inode(file);
+ struct super_block *sb = inode->i_sb;
+ kprojid_t old_kprojid, kprojid;
+ int err;
+
+ if (!sb->s_op->get_projid || !sb->s_op->set_projid)
+ return -EPERM;
+
+ kprojid = make_kprojid(ns, projid);
+ if (!projid_valid(kprojid))
+ return -EACCES;
+
+ err = mnt_want_write_file(file);
+ if (err)
+ return err;
+
+ mutex_lock(&inode->i_mutex);
+ err = sb->s_op->get_projid(inode, &old_kprojid);
+ if (!err) {
+ if (capable_set_inode_project(inode, old_kprojid, kprojid))
+ err = sb->s_op->set_projid(inode, kprojid);
+ else
+ err = -EACCES;
+ }
+ mutex_unlock(&inode->i_mutex);
+
+ mnt_drop_write_file(file);
+
+ return err;
+}
+
static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
struct file *filp)
{
@@ -334,6 +391,12 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
case F_GET_SEALS:
err = shmem_fcntl(filp, cmd, arg);
break;
+ case F_GET_PROJECT:
+ err = fcntl_get_project(filp, (projid_t __user *) arg);
+ break;
+ case F_SET_PROJECT:
+ err = fcntl_set_project(filp, (projid_t) arg);
+ break;
default:
break;
}
@@ -348,6 +411,8 @@ static int check_fcntl_cmd(unsigned cmd)
case F_GETFD:
case F_SETFD:
case F_GETFL:
+ case F_GET_PROJECT:
+ case F_SET_PROJECT:
return 1;
}
return 0;
diff --git a/fs/quota/Kconfig b/fs/quota/Kconfig
index 4a09975..b38f881 100644
--- a/fs/quota/Kconfig
+++ b/fs/quota/Kconfig
@@ -74,3 +74,12 @@ config QUOTACTL_COMPAT
bool
depends on QUOTACTL && COMPAT_FOR_U64_ALIGNMENT
default y
+
+config PROTECTED_PROJECTS_ENABLED_BY_DEFAULT
+ bool "Protected projects by default"
+ default n
+ help
+ This option defines default value for sysctl fs.protected_projects.
+ Say N if you need XFS-compatible mode when file owner could set any
+ project id. If you need reliable project disk quotas say Y here:
+ in this mode changing project requires capability CAP_SYS_RESOURCE.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f125b88..f6faf22 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -27,6 +27,7 @@
#include <linux/shrinker.h>
#include <linux/migrate_mode.h>
#include <linux/uidgid.h>
+#include <linux/projid.h>
#include <linux/lockdep.h>
#include <linux/percpu-rwsem.h>
#include <linux/blk_types.h>
@@ -62,6 +63,7 @@ extern struct inodes_stat_t inodes_stat;
extern int leases_enable, lease_break_time;
extern int sysctl_protected_symlinks;
extern int sysctl_protected_hardlinks;
+extern int sysctl_protected_projects;
struct buffer_head;
typedef int (get_block_t)(struct inode *inode, sector_t iblock,
@@ -1636,6 +1638,8 @@ struct super_operations {
int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
long (*nr_cached_objects)(struct super_block *, int);
long (*free_cached_objects)(struct super_block *, long, int);
+ int (*get_projid)(struct inode *, kprojid_t *);
+ int (*set_projid)(struct inode *, kprojid_t);
};
/*
diff --git a/include/linux/projid.h b/include/linux/projid.h
index 8c1f2c5..410b509 100644
--- a/include/linux/projid.h
+++ b/include/linux/projid.h
@@ -86,4 +86,8 @@ static inline bool kprojid_has_mapping(struct user_namespace *ns, kprojid_t proj
#endif /* CONFIG_USER_NS */
+bool capable_set_inode_project(const struct inode *inode,
+ kprojid_t old_kprojid, kprojid_t kprojid);
+bool capable_mix_inode_project(kprojid_t dir_kprojid, kprojid_t kprojid);
+
#endif /* _LINUX_PROJID_H */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index beed138..92791d0 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -34,6 +34,12 @@
#define F_GET_SEALS (F_LINUX_SPECIFIC_BASE + 10)
/*
+ * Get/Set project id
+ */
+#define F_GET_PROJECT (F_LINUX_SPECIFIC_BASE + 11)
+#define F_SET_PROJECT (F_LINUX_SPECIFIC_BASE + 12)
+
+/*
* Types of seals
*/
#define F_SEAL_SEAL 0x0001 /* prevent further seals from being set */
diff --git a/kernel/capability.c b/kernel/capability.c
index 989f5bf..cd67ef4 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -444,3 +444,65 @@ bool capable_wrt_inode_uidgid(const struct inode *inode, int cap)
kgid_has_mapping(ns, inode->i_gid);
}
EXPORT_SYMBOL(capable_wrt_inode_uidgid);
+
+int sysctl_protected_projects =
+ IS_ENABLED(CONFIG_PROTECTED_PROJECTS_ENABLED_BY_DEFAULT);
+
+/**
+ * capable_set_inode_project - Check restrictions for changing project id
+ * @inode: The inode in question
+ * @old_kprojid: current project id
+ * @kprojid: target project id
+ *
+ * Returns true if current task can set new project id for inode:
+ * In XFS-compatible mode (sysctl fs.protected_projects = 0) this is permitted
+ * only in init user namespace if current user owns file or task has CAP_FOWNER.
+ * If sysctl fs.protected_projects = 1 then tasks must have CAP_SYS_RESOURCE in
+ * current user-namespace and both projects must be mapped into this namespace.
+ */
+bool capable_set_inode_project(const struct inode *inode,
+ kprojid_t old_kprojid, kprojid_t kprojid)
+{
+ struct user_namespace *ns = current_user_ns();
+
+ /* In XFS-compat mode file owner can set any project id */
+ if (!sysctl_protected_projects)
+ return ns == &init_user_ns && inode_owner_or_capable(inode);
+
+ return ns_capable(ns, CAP_SYS_RESOURCE) &&
+ kprojid_has_mapping(ns, old_kprojid) &&
+ kprojid_has_mapping(ns, kprojid);
+}
+EXPORT_SYMBOL(capable_set_inode_project);
+
+/**
+ * capable_mix_inode_project - Check project id restrictions for link/rename
+ * @kprojid: inode project id
+ * @dir_kprojid: directory project id
+ *
+ * Returns true if current task can link/rename inode into given directory:
+ * In XFS-compatible mode operation is permitted only if projects are match.
+ * If fs.protected_projects is set then it's permitted also if directory
+ * project is mapped to zero or if task has capability CAP_SYS_RESOURCE.
+ */
+bool capable_mix_inode_project(kprojid_t dir_kprojid, kprojid_t kprojid)
+{
+ struct user_namespace *ns;
+ projid_t dir_projid;
+
+ if (projid_eq(dir_kprojid, kprojid))
+ return true;
+
+ if (!sysctl_protected_projects)
+ return false;
+
+ ns = current_user_ns();
+ if (!kprojid_has_mapping(ns, kprojid))
+ return false;
+
+ dir_projid = from_kprojid(ns, dir_kprojid);
+ return dir_projid == (projid_t)0 ||
+ (dir_projid != (projid_t)-1 &&
+ ns_capable(ns, CAP_SYS_RESOURCE));
+}
+EXPORT_SYMBOL(capable_mix_inode_project);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 88ea2d6..cb6f9fb 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1649,6 +1649,15 @@ static struct ctl_table fs_table[] = {
.extra2 = &one,
},
{
+ .procname = "protected_projects",
+ .data = &sysctl_protected_projects,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+ {
.procname = "suid_dumpable",
.data = &suid_dumpable,
.maxlen = sizeof(int),
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 4109f83..88f6619 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -807,8 +807,8 @@ ssize_t proc_projid_map_write(struct file *file, const char __user *buf,
if ((seq_ns != ns) && (seq_ns != ns->parent))
return -EPERM;
- /* Anyone can set any valid project id no capability needed */
- return map_write(file, buf, size, ppos, -1,
+ return map_write(file, buf, size, ppos,
+ sysctl_protected_projects ? CAP_SYS_RESOURCE : -1,
&ns->projid_map, &ns->parent->projid_map);
}
^ permalink raw reply related
* [PATCH RFC 2/6] quota: adds generic code for enforcing project quota limits
From: Konstantin Khlebnikov @ 2015-02-11 15:11 UTC (permalink / raw)
To: Linux FS Devel, linux-ext4, linux-kernel
Cc: Jan Kara, Linux API, containers, Dave Chinner, Andy Lutomirski,
Christoph Hellwig, Dmitry Monakhov, Eric W. Biederman, Li Xi,
Theodore Ts'o, Al Viro
In-Reply-To: <20150211151146.6717.62017.stgit@buzz>
This patch adds support for a new quota type PRJQUOTA for project quota.
[ Based on patch by Li Xi <lixi@ddn.com> ]
Permissions:
Q_GETQUOTA: allows to query all projects present in current user-namespace
Q_SETQUOTA: requires system-wide capability CAP_SYS_ADMIN
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Li Xi <lixi@ddn.com>
---
fs/quota/dquot.c | 37 +++++++++++++++++++++++++++++++++++--
fs/quota/quota.c | 8 ++++++--
fs/quota/quotaio_v2.h | 6 ++++--
include/linux/quota.h | 1 +
include/linux/quotaops.h | 1 +
include/uapi/linux/quota.h | 6 ++++--
6 files changed, 51 insertions(+), 8 deletions(-)
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 0ccd4ba..afa5f67 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -1159,8 +1159,8 @@ static int need_print_warning(struct dquot_warn *warn)
return uid_eq(current_fsuid(), warn->w_dq_id.uid);
case GRPQUOTA:
return in_group_p(warn->w_dq_id.gid);
- case PRJQUOTA: /* Never taken... Just make gcc happy */
- return 0;
+ case PRJQUOTA:
+ return 1;
}
return 0;
}
@@ -1399,6 +1399,8 @@ static void __dquot_initialize(struct inode *inode, int type)
/* First get references to structures we might need. */
for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
struct kqid qid;
+ kprojid_t projid;
+
got[cnt] = NULL;
if (type != -1 && cnt != type)
continue;
@@ -1409,6 +1411,10 @@ static void __dquot_initialize(struct inode *inode, int type)
*/
if (i_dquot(inode)[cnt])
continue;
+
+ if (!sb_has_quota_active(sb, cnt))
+ continue;
+
init_needed = 1;
switch (cnt) {
@@ -1418,6 +1424,13 @@ static void __dquot_initialize(struct inode *inode, int type)
case GRPQUOTA:
qid = make_kqid_gid(inode->i_gid);
break;
+ case PRJQUOTA:
+ if (!sb->s_op->get_projid)
+ continue;
+ if (sb->s_op->get_projid(inode, &projid))
+ continue;
+ qid = make_kqid_projid(projid);
+ break;
}
got[cnt] = dqget(sb, qid);
}
@@ -1951,6 +1964,22 @@ int dquot_transfer(struct inode *inode, struct iattr *iattr)
EXPORT_SYMBOL(dquot_transfer);
/*
+ * Helper function for transferring inode into another project.
+ */
+int dquot_transfer_project(struct inode *inode, kprojid_t projid)
+{
+ struct dquot *transfer_to[MAXQUOTAS] = {};
+ struct super_block *sb = inode->i_sb;
+ int ret;
+
+ transfer_to[PRJQUOTA] = dqget(sb, make_kqid_projid(projid));
+ ret = __dquot_transfer(inode, transfer_to);
+ dqput_all(transfer_to);
+ return ret;
+}
+EXPORT_SYMBOL(dquot_transfer_project);
+
+/*
* Write info of quota file to disk
*/
int dquot_commit_info(struct super_block *sb, int type)
@@ -2165,6 +2194,10 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
error = -EINVAL;
goto out_fmt;
}
+ if (type == PRJQUOTA && !sb->s_op->get_projid) {
+ error = -EINVAL;
+ goto out_fmt;
+ }
/* Usage always has to be set... */
if (!(flags & DQUOT_USAGE_ENABLED)) {
error = -EINVAL;
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index d14a799..0acd1bb 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -30,11 +30,15 @@ static int check_quotactl_permission(struct super_block *sb, int type, int cmd,
case Q_XGETQSTATV:
case Q_XQUOTASYNC:
break;
- /* allow to query information for dquots we "own" */
+ /*
+ * Allow to query information for user/group dquots we "own".
+ * Allow querying project quota present in our user-namespace.
+ */
case Q_GETQUOTA:
case Q_XGETQUOTA:
if ((type == USRQUOTA && uid_eq(current_euid(), make_kuid(current_user_ns(), id))) ||
- (type == GRPQUOTA && in_egroup_p(make_kgid(current_user_ns(), id))))
+ (type == GRPQUOTA && in_egroup_p(make_kgid(current_user_ns(), id))) ||
+ (type == PRJQUOTA && projid_valid(make_kprojid(current_user_ns(), id))))
break;
/*FALLTHROUGH*/
default:
diff --git a/fs/quota/quotaio_v2.h b/fs/quota/quotaio_v2.h
index f1966b4..4e95430 100644
--- a/fs/quota/quotaio_v2.h
+++ b/fs/quota/quotaio_v2.h
@@ -13,12 +13,14 @@
*/
#define V2_INITQMAGICS {\
0xd9c01f11, /* USRQUOTA */\
- 0xd9c01927 /* GRPQUOTA */\
+ 0xd9c01927, /* GRPQUOTA */\
+ 0xd9c03f14, /* PRJQUOTA */\
}
#define V2_INITQVERSIONS {\
1, /* USRQUOTA */\
- 1 /* GRPQUOTA */\
+ 1, /* GRPQUOTA */\
+ 1, /* PRJQUOTA */\
}
/* First generic header */
diff --git a/include/linux/quota.h b/include/linux/quota.h
index d534e8e..8bad159 100644
--- a/include/linux/quota.h
+++ b/include/linux/quota.h
@@ -50,6 +50,7 @@
#undef USRQUOTA
#undef GRPQUOTA
+#undef PRJQUOTA
enum quota_type {
USRQUOTA = 0, /* element used for user quotas */
GRPQUOTA = 1, /* element used for group quotas */
diff --git a/include/linux/quotaops.h b/include/linux/quotaops.h
index df73258..ba54745 100644
--- a/include/linux/quotaops.h
+++ b/include/linux/quotaops.h
@@ -104,6 +104,7 @@ int dquot_set_dqblk(struct super_block *sb, struct kqid id,
int __dquot_transfer(struct inode *inode, struct dquot **transfer_to);
int dquot_transfer(struct inode *inode, struct iattr *iattr);
+int dquot_transfer_project(struct inode *inode, kprojid_t projid);
static inline struct mem_dqinfo *sb_dqinfo(struct super_block *sb, int type)
{
diff --git a/include/uapi/linux/quota.h b/include/uapi/linux/quota.h
index 1f49b83..9c95b2c 100644
--- a/include/uapi/linux/quota.h
+++ b/include/uapi/linux/quota.h
@@ -36,11 +36,12 @@
#include <linux/errno.h>
#include <linux/types.h>
-#define __DQUOT_VERSION__ "dquot_6.5.2"
+#define __DQUOT_VERSION__ "dquot_6.6.0"
-#define MAXQUOTAS 2
+#define MAXQUOTAS 3
#define USRQUOTA 0 /* element used for user quotas */
#define GRPQUOTA 1 /* element used for group quotas */
+#define PRJQUOTA 2 /* element used for project quotas */
/*
* Definitions for the default names of the quotas files.
@@ -48,6 +49,7 @@
#define INITQFNAMES { \
"user", /* USRQUOTA */ \
"group", /* GRPQUOTA */ \
+ "project", /* PRJQUOTA */ \
"undefined", \
};
^ permalink raw reply related
* [PATCH RFC 3/6] quota: mangle statfs result according to project quota usage and limits
From: Konstantin Khlebnikov @ 2015-02-11 15:11 UTC (permalink / raw)
To: Linux FS Devel, linux-ext4, linux-kernel
Cc: Jan Kara, Linux API, containers, Dave Chinner, Andy Lutomirski,
Christoph Hellwig, Dmitry Monakhov, Eric W. Biederman, Li Xi,
Theodore Ts'o, Al Viro
In-Reply-To: <20150211151146.6717.62017.stgit@buzz>
This patch updates amount of available space and inodes returned by
vfs_statfs() according to project quota for this directory.
This works only if sysctl fs.protected_projects = 1.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
fs/quota/dquot.c | 56 ++++++++++++++++++++++++++++++++++++++++++++--
fs/statfs.c | 5 +++-
include/linux/quotaops.h | 5 ++++
3 files changed, 63 insertions(+), 3 deletions(-)
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index afa5f67..e3dfac8 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -65,6 +65,7 @@
#include <linux/stat.h>
#include <linux/tty.h>
#include <linux/file.h>
+#include <linux/statfs.h>
#include <linux/slab.h>
#include <linux/sysctl.h>
#include <linux/init.h>
@@ -1242,15 +1243,20 @@ static void flush_warnings(struct dquot_warn *warn)
}
}
-static int ignore_hardlimit(struct dquot *dquot)
+static bool sb_ignore_hardlimit(struct super_block *sb, int type)
{
- struct mem_dqinfo *info = &sb_dqopt(dquot->dq_sb)->info[dquot->dq_id.type];
+ struct mem_dqinfo *info = sb_dqinfo(sb, type);
return capable(CAP_SYS_RESOURCE) &&
(info->dqi_format->qf_fmt_id != QFMT_VFS_OLD ||
!(info->dqi_flags & DQF_ROOT_SQUASH));
}
+static bool ignore_hardlimit(struct dquot *dquot)
+{
+ return sb_ignore_hardlimit(dquot->dq_sb, dquot->dq_id.type);
+}
+
/* needs dq_data_lock */
static int check_idq(struct dquot *dquot, qsize_t inodes,
struct dquot_warn *warn)
@@ -2022,6 +2028,52 @@ int dquot_file_open(struct inode *inode, struct file *file)
}
EXPORT_SYMBOL(dquot_file_open);
+void dquot_mangle_statfs(const struct path *path, struct kstatfs *buf)
+{
+ struct inode *inode = path->dentry->d_inode;
+ struct super_block *sb = inode->i_sb;
+ u64 blimit = 0, ilimit = 0, busage, iusage;
+ struct dquot **dquots;
+
+ if (!sysctl_protected_projects ||
+ !sb_has_quota_limits_enabled(sb, PRJQUOTA) ||
+ sb_ignore_hardlimit(sb, PRJQUOTA))
+ return;
+
+ dquots = i_dquot(inode);
+ if (!dquots[PRJQUOTA])
+ return;
+
+ spin_lock(&dq_data_lock);
+ if (dquots[PRJQUOTA]) {
+ struct mem_dqblk *dm = &(dquots[PRJQUOTA]->dq_dqb);
+
+ blimit = dm->dqb_bsoftlimit ?: dm->dqb_bhardlimit;
+ busage = dm->dqb_curspace + dm->dqb_rsvspace;
+ ilimit = dm->dqb_isoftlimit ?: dm->dqb_ihardlimit;
+ iusage = dm->dqb_curinodes;
+ }
+ spin_unlock(&dq_data_lock);
+
+ if (blimit) {
+ u64 bavail = (blimit <= busage) ? 0 :
+ div64_long(blimit - busage, buf->f_bsize);
+
+ if (buf->f_bavail > bavail)
+ buf->f_bavail = bavail;
+ }
+
+ if (ilimit) {
+ u64 iavail = (ilimit <= iusage) ? 0 : (ilimit - iusage);
+
+ if (buf->f_ffree > iavail) {
+ /* Cut count of "free" inodes but keep "used" */
+ buf->f_files -= buf->f_ffree - iavail;
+ buf->f_ffree = iavail;
+ }
+ }
+}
+
/*
* Turn quota off on a device. type == -1 ==> quotaoff for all types (umount)
*/
diff --git a/fs/statfs.c b/fs/statfs.c
index 083dc0a..3fa6192 100644
--- a/fs/statfs.c
+++ b/fs/statfs.c
@@ -7,6 +7,7 @@
#include <linux/statfs.h>
#include <linux/security.h>
#include <linux/uaccess.h>
+#include <linux/quotaops.h>
#include "internal.h"
static int flags_by_mnt(int mnt_flags)
@@ -68,8 +69,10 @@ int vfs_statfs(struct path *path, struct kstatfs *buf)
int error;
error = statfs_by_dentry(path->dentry, buf);
- if (!error)
+ if (!error) {
buf->f_flags = calculate_f_flags(path->mnt);
+ dquot_mangle_statfs(path, buf);
+ }
return error;
}
EXPORT_SYMBOL(vfs_statfs);
diff --git a/include/linux/quotaops.h b/include/linux/quotaops.h
index ba54745..42fe4d9 100644
--- a/include/linux/quotaops.h
+++ b/include/linux/quotaops.h
@@ -85,6 +85,7 @@ int dquot_commit_info(struct super_block *sb, int type);
int dquot_mark_dquot_dirty(struct dquot *dquot);
int dquot_file_open(struct inode *inode, struct file *file);
+void dquot_mangle_statfs(const struct path *path, struct kstatfs *buf);
int dquot_enable(struct inode *inode, int type, int format_id,
unsigned int flags);
@@ -275,6 +276,10 @@ static inline int dquot_resume(struct super_block *sb, int type)
#define dquot_file_open generic_file_open
+static inline void dquot_mangle_statfs(struct path *path, struct kstatfs *buf)
+{
+}
+
static inline int dquot_writeback_dquots(struct super_block *sb, int type)
{
return 0;
^ permalink raw reply related
* [PATCH RFC 4/6] ext4: add project id support
From: Konstantin Khlebnikov @ 2015-02-11 15:11 UTC (permalink / raw)
To: Linux FS Devel, linux-ext4, linux-kernel
Cc: Jan Kara, Linux API, containers, Dave Chinner, Andy Lutomirski,
Christoph Hellwig, Dmitry Monakhov, Eric W. Biederman, Li Xi,
Theodore Ts'o, Al Viro
In-Reply-To: <20150211151146.6717.62017.stgit@buzz>
This patch adds a new internal field of ext4 inode to save project identifier.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
fs/ext4/ext4.h | 13 ++++++++++-
fs/ext4/ialloc.c | 6 +++++
fs/ext4/inode.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/namei.c | 14 ++++++++++++
fs/ext4/super.c | 2 ++
5 files changed, 99 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index a75fba6..a3fdbb5 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -683,6 +683,7 @@ struct ext4_inode {
__le32 i_crtime; /* File Creation time */
__le32 i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
__le32 i_version_hi; /* high 32 bits for 64-bit version */
+ __le32 i_projid; /* Project ID */
};
struct move_extent {
@@ -938,6 +939,7 @@ struct ext4_inode_info {
/* Precomputed uuid+inum+igen checksum for seeding inode checksums */
__u32 i_csum_seed;
+ kprojid_t i_projid;
};
/*
@@ -1522,6 +1524,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
* GDT_CSUM bits are mutually exclusive.
*/
#define EXT4_FEATURE_RO_COMPAT_METADATA_CSUM 0x0400
+#define EXT4_FEATURE_RO_COMPAT_PROJECT 0x1000 /* Project ID */
#define EXT4_FEATURE_INCOMPAT_COMPRESSION 0x0001
#define EXT4_FEATURE_INCOMPAT_FILETYPE 0x0002
@@ -1571,7 +1574,8 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
EXT4_FEATURE_RO_COMPAT_HUGE_FILE |\
EXT4_FEATURE_RO_COMPAT_BIGALLOC |\
EXT4_FEATURE_RO_COMPAT_METADATA_CSUM|\
- EXT4_FEATURE_RO_COMPAT_QUOTA)
+ EXT4_FEATURE_RO_COMPAT_QUOTA |\
+ EXT4_FEATURE_RO_COMPAT_PROJECT)
/*
* Default values for user and/or group using reserved blocks
@@ -1579,6 +1583,11 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
#define EXT4_DEF_RESUID 0
#define EXT4_DEF_RESGID 0
+/*
+ * Default project ID
+ */
+#define EXT4_DEF_PROJID 0
+
#define EXT4_DEF_INODE_READAHEAD_BLKS 32
/*
@@ -2131,6 +2140,8 @@ extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
loff_t lstart, loff_t lend);
extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
extern qsize_t *ext4_get_reserved_space(struct inode *inode);
+extern int ext4_get_projid(struct inode *inode, kprojid_t *projid);
+extern int ext4_set_projid(struct inode *inode, kprojid_t projid);
extern void ext4_da_update_reserve_space(struct inode *inode,
int used, int quota_claim);
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index ac644c3..d81a30d 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -756,6 +756,12 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
inode->i_gid = dir->i_gid;
} else
inode_init_owner(inode, dir, mode);
+
+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_PROJECT))
+ ei->i_projid = EXT4_I(dir)->i_projid;
+ else
+ ei->i_projid = KPROJIDT_INIT(EXT4_DEF_PROJID);
+
dquot_initialize(inode);
if (!goal)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5653fa4..0ae2c39 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3863,6 +3863,53 @@ static inline void ext4_iget_extra_inode(struct inode *inode,
EXT4_I(inode)->i_inline_off = 0;
}
+int ext4_get_projid(struct inode *inode, kprojid_t *projid)
+{
+ struct super_block *sb = inode->i_sb;
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_PROJECT))
+ return -EOPNOTSUPP;
+
+ *projid = EXT4_I(inode)->i_projid;
+ return 0;
+}
+
+/* Called with inode->i_mutex locked. */
+int ext4_set_projid(struct inode *inode, kprojid_t projid)
+{
+ struct super_block *sb = inode->i_sb;
+ struct ext4_inode *raw_inode;
+ struct ext4_iloc iloc;
+ handle_t *handle;
+ int err;
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_PROJECT))
+ return -EOPNOTSUPP;
+
+ /* Sanity check */
+ if (EXT4_INODE_SIZE(sb) <= EXT4_GOOD_OLD_INODE_SIZE ||
+ !EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), i_projid))
+ return -EOPNOTSUPP;
+
+ if (projid_eq(EXT4_I(inode)->i_projid, projid))
+ return 0;
+
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+ err = ext4_reserve_inode_write(handle, inode, &iloc);
+ if (!err) {
+ inode->i_ctime = ext4_current_time(inode);
+ EXT4_I(inode)->i_projid = projid;
+ err = ext4_mark_iloc_dirty(handle, inode, &iloc);
+ }
+ if (IS_SYNC(inode))
+ ext4_handle_sync(handle);
+ ext4_journal_stop(handle);
+
+ return err;
+}
+
struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
{
struct ext4_iloc iloc;
@@ -3874,6 +3921,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
int block;
uid_t i_uid;
gid_t i_gid;
+ projid_t i_projid;
inode = iget_locked(sb, ino);
if (!inode)
@@ -3923,12 +3971,20 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
inode->i_mode = le16_to_cpu(raw_inode->i_mode);
i_uid = (uid_t)le16_to_cpu(raw_inode->i_uid_low);
i_gid = (gid_t)le16_to_cpu(raw_inode->i_gid_low);
+
+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_PROJECT) &&
+ EXT4_FITS_IN_INODE(raw_inode, ei, i_projid))
+ i_projid = (projid_t)le32_to_cpu(raw_inode->i_projid);
+ else
+ i_projid = EXT4_DEF_PROJID;
+
if (!(test_opt(inode->i_sb, NO_UID32))) {
i_uid |= le16_to_cpu(raw_inode->i_uid_high) << 16;
i_gid |= le16_to_cpu(raw_inode->i_gid_high) << 16;
}
i_uid_write(inode, i_uid);
i_gid_write(inode, i_gid);
+ ei->i_projid = KPROJIDT_INIT(i_projid);
set_nlink(inode, le16_to_cpu(raw_inode->i_links_count));
ext4_clear_state_flags(ei); /* Only relevant on 32-bit archs */
@@ -4192,6 +4248,15 @@ static int ext4_do_update_inode(handle_t *handle,
raw_inode->i_uid_high = 0;
raw_inode->i_gid_high = 0;
}
+
+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_PROJECT) &&
+ EXT4_FITS_IN_INODE(raw_inode, ei, i_projid)) {
+ projid_t i_projid;
+
+ i_projid = from_kprojid(&init_user_ns, ei->i_projid);
+ raw_inode->i_projid = cpu_to_le32(i_projid);
+ }
+
raw_inode->i_links_count = cpu_to_le16(inode->i_nlink);
EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 2291923..9337d81 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2938,6 +2938,10 @@ static int ext4_link(struct dentry *old_dentry,
if (inode->i_nlink >= EXT4_LINK_MAX)
return -EMLINK;
+ if (!capable_mix_inode_project(EXT4_I(dir)->i_projid,
+ EXT4_I(inode)->i_projid))
+ return -EXDEV;
+
dquot_initialize(dir);
retry:
@@ -3217,6 +3221,10 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
int credits;
u8 old_file_type;
+ if (!capable_mix_inode_project(EXT4_I(new.dir)->i_projid,
+ EXT4_I(old.inode)->i_projid))
+ return -EXDEV;
+
dquot_initialize(old.dir);
dquot_initialize(new.dir);
@@ -3395,6 +3403,12 @@ static int ext4_cross_rename(struct inode *old_dir, struct dentry *old_dentry,
u8 new_file_type;
int retval;
+ if (!capable_mix_inode_project(EXT4_I(new.dir)->i_projid,
+ EXT4_I(old.inode)->i_projid) ||
+ !capable_mix_inode_project(EXT4_I(old.dir)->i_projid,
+ EXT4_I(new.inode)->i_projid))
+ return -EXDEV;
+
dquot_initialize(old.dir);
dquot_initialize(new.dir);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index ac64edb..d656269 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1103,6 +1103,8 @@ static const struct super_operations ext4_sops = {
.get_dquots = ext4_get_dquots,
#endif
.bdev_try_to_free_page = bdev_try_to_free_page,
+ .get_projid = ext4_get_projid,
+ .set_projid = ext4_set_projid,
};
static const struct export_operations ext4_export_ops = {
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox