* Re: [PATCH 01/13] kdbus: add documentation
From: Greg Kroah-Hartman @ 2015-02-04 3:14 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Andy Lutomirski, Daniel Mack, Arnd Bergmann, Ted Ts'o,
Michael Kerrisk, Linux API, One Thousand Gnomes,
Austin S Hemmelgarn, Tom Gundersen, linux-kernel, David Herrmann,
Djalal Harouni, Johannes Stezenbach, Christoph Hellwig
In-Reply-To: <87siemgzoo.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
On Tue, Feb 03, 2015 at 08:47:51PM -0600, Eric W. Biederman wrote:
> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>
> > On Tue, Feb 3, 2015 at 2:09 AM, Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
> >> Hi Andy,
> >>
> >> On 02/02/2015 09:12 PM, Andy Lutomirski wrote:
> >>> On Feb 2, 2015 1:34 AM, "Daniel Mack" <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
> >>
> >>>> That's right, but again - if an application wants to gather this kind of
> >>>> information about tasks it interacts with, it can do so today by looking
> >>>> at /proc or similar sources. Desktop machines do exactly that already,
> >>>> and the kernel code executed in such cases very much resembles that in
> >>>> metadata.c, and is certainly not cheaper. kdbus just makes such
> >>>> information more accessible when requested. Which information is
> >>>> collected is defined by bit-masks on both the sender and the receiver
> >>>> connection, and most applications will effectively only use a very
> >>>> limited set by default if they go through one of the more high-level
> >>>> libraries.
> >>>
> >>> I should rephrase a bit. Kdbus doesn't require use of send-time
> >>> metadata. It does, however, strongly encourage it, and it sounds like
> >>
> >> On the kernel level, kdbus just *offers* that, just like sockets offer
> >> SO_PASSCRED. On the userland level, kdbus helps applications get that
> >> information race-free, easier and faster than they would otherwise.
> >>
> >>> systemd and other major users will use send-time metadata. Once that
> >>> happens, it's ABI (even if it's purely in userspace), and changing it
> >>> is asking for security holes to pop up. So you'll be mostly stuck
> >>> with it.
> >>
> >> We know we can't break the ABI. At most, we could deprecate item types
> >> and introduce new ones, but we want to avoid that by all means of
> >> course. However, I fail to see how that is related to send time
> >> metadata, or even to kdbus in general, as all ABIs have to be kept stable.
> >
> > I should have said it differently. ABI is the wrong term -- it's more
> > of a protocol issue.
> >
> > It looks like, with the current code, the kernel will provide
> > (optional) send-time metadata, and the sd-bus library will use it.
> > The result will be that the communication protocol between clients and
> > udev, systemd, systemd-logind, g-s-d, etc, will likely involve
> > send-time metadata. This may end up being a bottleneck.
>
> A quick note on a couple of things I have seen in this conversation.
>
> - The reason for kdbus is performance.
No, that's not the only reason for kdbus, don't focus only on this. I
set out a long list of things for why we created kdbus, speed was only
one of the things. Security is also one, and the ability to gather
these attributes in an atomic and secure way is very important as
userspace wants this.
> - pipes rather than unix domain sockets are likely the standard to meet.
> If you can't equal unix domain sockets for simple things you are
> likely leaving a lot of stops in. Last I looked pipes in general were
> notiably faster than unix domain sockets.
>
> The performance numbers I saw posted up-thread were horrible. I have
> seen faster numbers across a network of machines. If your ping-pong
> latency isn't measured in nano-seconds you are probably doing
> something wrong.
It all depends on what you are passing on that "ping-pong", a real
D-Bus connection has real data and meta data that has to be sent.
Trying to make a fake benchmark number isn't going to show anything.
> - syscalls remove overhead. So since performance is kdbus's reason for existence
> let's remove some ridiculous stops, and get a fast path into the kernel.
Again, not the only reason, see my first post in this thread for
details.
thanks,
greg k-h
^ permalink raw reply
* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
From: Calvin Owens @ 2015-02-04 3:28 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Kees Cook, Andrew Morton, Cyrill Gorcunov, Kirill A. Shutemov,
Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
Kirill A. Shutemov, Peter Feiner, Grant Likely,
Siddhesh Poyarekar, LKML, kernel-team-b10kYP2dOMg,
Pavel Emelyanov, Linux API
In-Reply-To: <CALCETrUufe3USocUDpkBdwx6SyGKVgVUTh4rg2H9Xn91u+6iHg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Monday 02/02 at 12:16 -0800, Andy Lutomirski wrote:
> On Fri, Jan 30, 2015 at 5:58 PM, Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org> wrote:
> > On Thursday 01/29 at 17:30 -0800, Kees Cook wrote:
> >> On Tue, Jan 27, 2015 at 8:38 PM, Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org> wrote:
> >> > On Monday 01/26 at 15:43 -0800, Andrew Morton wrote:
> >> >> On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> >>
> >> >> > On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
> >> >> > > On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
> >> >> > > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> >> >> > > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> >> >> > > > is very useful for enumerating the files mapped into a process when
> >> >> > > > the more verbose information in /proc/<pid>/maps is not needed.
> >> >>
> >> >> This is the main (actually only) justification for the patch, and it it
> >> >> far too thin. What does "not needed" mean. Why can't people just use
> >> >> /proc/pid/maps?
> >> >
> >> > The biggest difference is that if you do something like this:
> >> >
> >> > fd = open("/stuff", O_BLAH);
> >> > map = mmap(NULL, 4096, PROT_BLAH, MAP_SHARED, fd, 0);
> >> > close(fd);
> >> > unlink("/stuff");
> >> >
> >> > ...then map_files/ gives you a way to get a file descriptor for
> >> > "/stuff", which you couldn't do with /proc/pid/maps.
> >> >
> >> > It's also something of a win if you just want to see what is mapped at a
> >> > specific address, since you can just readlink() the symlink for the
> >> > address range you care about and it will go grab the appropriate VMA and
> >> > give you the answer. /proc/pid/maps requires walking the VMA tree, which
> >> > is quite expensive for processes with many thousands of threads, even
> >> > without the O(N^2) issue.
> >> >
> >> > (You have to know what address range you want though, since readdir() on
> >> > map_files/ obviously has to walk the VMA tree just like /proc/N/maps.)
> >> >
> >> >> > > > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> >> >> > > > removes the CAP_SYS_ADMIN restrictions. Following the links requires
> >> >> > > > the ability to ptrace the process in question, so this doesn't allow
> >> >> > > > an attacker to do anything they couldn't already do before.
> >> >> > > >
> >> >> > > > Signed-off-by: Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org>
> >> >> > >
> >> >> > > Cc +linux-api@
> >> >> >
> >> >> > Looks good to me, thanks! Though I would really appreciate if someone
> >> >> > from security camp take a look as well.
> >> >>
> >> >> hm, who's that. Kees comes to mind.
> >> >>
> >> >> And reviewers' task would be a heck of a lot easier if they knew what
> >> >> /proc/pid/map_files actually does. This:
> >> >>
> >> >> akpm3:/usr/src/25> grep -r map_files Documentation
> >> >> akpm3:/usr/src/25>
> >> >>
> >> >> does not help.
> >> >>
> >> >> The 640708a2cff7f81 changelog says:
> >> >>
> >> >> : This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> >> >> : symlinks one for each mapping with file, the name of a symlink is
> >> >> : "vma->vm_start-vma->vm_end", the target is the file. Opening a symlink
> >> >> : results in a file that point exactly to the same inode as them vma's one.
> >> >> :
> >> >> : For example the ls -l of some arbitrary /proc/<pid>/map_files/
> >> >> :
> >> >> : | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> >> >> : | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> >> >> : | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> >> >> : | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> >> >> : | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> >> >>
> >> >> afacit this info is also available in /proc/pid/maps, so things
> >> >> shouldn't get worse if the /proc/pid/map_files permissions are at least
> >> >> as restrictive as the /proc/pid/maps permissions. Is that the case?
> >> >> (Please add to changelog).
> >> >
> >> > Yes, the only difference is that you can follow the link as per above.
> >> > I'll resend with a new message explaining that and the deletion thing.
> >> >
> >> >> There's one other problem here: we're assuming that the map_files
> >> >> implementation doesn't have bugs. If it does have bugs then relaxing
> >> >> permissions like this will create new vulnerabilities. And the
> >> >> map_files implementation is surprisingly complex. Is it bug-free?
> >> >
> >> > While I was messing with it I used it a good bit and didn't see any
> >> > issues, although I didn't actively try to fuzz it or anything. I'd be
> >> > happy to write something to test hammering it in weird ways if you like.
> >> > I'm also happy to write testcases for namespaces.
> >> >
> >> > So far as security issues, as others have pointed out you can't follow
> >> > the links unless you can ptrace the process in question, which seems
> >> > like a pretty solid guarantee. As Cyrill pointed out in the discussion
> >> > about the documentation, that's the same protection as /proc/N/fd/*, and
> >> > those links function in the same way.
> >>
> >> My concern here is that fd/* are connected as streams, and while that
> >> has a certain level of badness as an external-to-the-process attacker,
> >> PTRACE_MODE_READ is much weaker than PTRACE_MODE_ATTACH (which is
> >> required for access to /proc/N/mem). Since these fds are the things
> >> mapped into memory on a process, writing to them is a subset of access
> >> to /proc/N/mem, and I don't feel that PTRACE_MODE_READ is sufficient.
> >
> > If you haven't done close() on a mmapped file, doesn't fd/* allow the
> > same access to the corresponding regions of memory? Or am I missing
> > something?
> >
>
> But if you have called close(), then you can't currently do things
> like ftruncate or ioctl on the mapped file. These things don't
> persist across execve(), but the do persist across calls to setresuid,
> etc that drop privileges. The latter part makes me a tiny bit
> nervous.
Hmm, in that scenario you would have to open() the map_files symlink,
and since you've dropped privileges that would only succeed if the user
you dropped to has permission to access that file anyway, right?
In the deleted file case it does actually allow something that used to
be impossible, but relying on open/map/close/unlink to prevent a user
from opening a file they have permission to open is just buggy in
general.
But, O_TMPFILE lets you end up in that position without the race. The
manpage says that O_TMPFILE files "can never be reached via any
pathname", which isn't strictly true since you can get them from fd/* in
proc. But if you close() after mapping it they are currently truly
inaccessible via any path, and given the language in the manpage it
seems reasonable that somebody might rely on that and be lazy with the
permissions.
I hadn't thought about O_TMPFILE thing: I'm definitely convinced now
that PTRACE_MODE_ATTACH is the right thing here. But I think having to
reopen the file saves you even if you "leak" maps of files across a call
to setresuid/etc.
> It also might be worth checking for drivers or arch code that creates
> vmas that are backed by a different struct file than the struct file
> that was mmapped in the first place.
Interesting, I'll look into this before I resend.
Thanks,
Calvin
> --Andy
^ permalink raw reply
* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
From: Calvin Owens @ 2015-02-04 3:53 UTC (permalink / raw)
To: Austin S Hemmelgarn
Cc: Kees Cook, Andrew Morton, Cyrill Gorcunov, Kirill A. Shutemov,
Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
Kirill A. Shutemov, Peter Feiner, Grant Likely,
Siddhesh Poyarekar, LKML, kernel-team-b10kYP2dOMg,
Pavel Emelyanov, Linux API
In-Reply-To: <54CF832A.7010707-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
On Monday 02/02 at 09:01 -0500, Austin S Hemmelgarn wrote:
> On 2015-01-30 20:58, Calvin Owens wrote:
> >On Thursday 01/29 at 17:30 -0800, Kees Cook wrote:
> >>On Tue, Jan 27, 2015 at 8:38 PM, Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org> wrote:
> >>>On Monday 01/26 at 15:43 -0800, Andrew Morton wrote:
> >>>>On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>
> >>>>>On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
> >>>>>>On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
> >>>>>>>Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> >>>>>>>is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> >>>>>>>is very useful for enumerating the files mapped into a process when
> >>>>>>>the more verbose information in /proc/<pid>/maps is not needed.
> >>>>
> >>>>This is the main (actually only) justification for the patch, and it it
> >>>>far too thin. What does "not needed" mean. Why can't people just use
> >>>>/proc/pid/maps?
> >>>
> >>>The biggest difference is that if you do something like this:
> >>>
> >>> fd = open("/stuff", O_BLAH);
> >>> map = mmap(NULL, 4096, PROT_BLAH, MAP_SHARED, fd, 0);
> >>> close(fd);
> >>> unlink("/stuff");
> >>>
> >>>...then map_files/ gives you a way to get a file descriptor for
> >>>"/stuff", which you couldn't do with /proc/pid/maps.
> >>>
> >>>It's also something of a win if you just want to see what is mapped at a
> >>>specific address, since you can just readlink() the symlink for the
> >>>address range you care about and it will go grab the appropriate VMA and
> >>>give you the answer. /proc/pid/maps requires walking the VMA tree, which
> >>>is quite expensive for processes with many thousands of threads, even
> >>>without the O(N^2) issue.
> >>>
> >>>(You have to know what address range you want though, since readdir() on
> >>>map_files/ obviously has to walk the VMA tree just like /proc/N/maps.)
> >>>
> >>>>>>>This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> >>>>>>>removes the CAP_SYS_ADMIN restrictions. Following the links requires
> >>>>>>>the ability to ptrace the process in question, so this doesn't allow
> >>>>>>>an attacker to do anything they couldn't already do before.
> >>>>>>>
> >>>>>>>Signed-off-by: Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org>
> >>>>>>
> >>>>>>Cc +linux-api@
> >>>>>
> >>>>>Looks good to me, thanks! Though I would really appreciate if someone
> >>>>>from security camp take a look as well.
> >>>>
> >>>>hm, who's that. Kees comes to mind.
> >>>>
> >>>>And reviewers' task would be a heck of a lot easier if they knew what
> >>>>/proc/pid/map_files actually does. This:
> >>>>
> >>>>akpm3:/usr/src/25> grep -r map_files Documentation
> >>>>akpm3:/usr/src/25>
> >>>>
> >>>>does not help.
> >>>>
> >>>>The 640708a2cff7f81 changelog says:
> >>>>
> >>>>: This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> >>>>: symlinks one for each mapping with file, the name of a symlink is
> >>>>: "vma->vm_start-vma->vm_end", the target is the file. Opening a symlink
> >>>>: results in a file that point exactly to the same inode as them vma's one.
> >>>>:
> >>>>: For example the ls -l of some arbitrary /proc/<pid>/map_files/
> >>>>:
> >>>>: | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> >>>>: | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> >>>>: | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> >>>>: | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> >>>>: | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> >>>>
> >>>>afacit this info is also available in /proc/pid/maps, so things
> >>>>shouldn't get worse if the /proc/pid/map_files permissions are at least
> >>>>as restrictive as the /proc/pid/maps permissions. Is that the case?
> >>>>(Please add to changelog).
> >>>
> >>>Yes, the only difference is that you can follow the link as per above.
> >>>I'll resend with a new message explaining that and the deletion thing.
> >>>
> >>>>There's one other problem here: we're assuming that the map_files
> >>>>implementation doesn't have bugs. If it does have bugs then relaxing
> >>>>permissions like this will create new vulnerabilities. And the
> >>>>map_files implementation is surprisingly complex. Is it bug-free?
> >>>
> >>>While I was messing with it I used it a good bit and didn't see any
> >>>issues, although I didn't actively try to fuzz it or anything. I'd be
> >>>happy to write something to test hammering it in weird ways if you like.
> >>>I'm also happy to write testcases for namespaces.
> >>>
> >>>So far as security issues, as others have pointed out you can't follow
> >>>the links unless you can ptrace the process in question, which seems
> >>>like a pretty solid guarantee. As Cyrill pointed out in the discussion
> >>>about the documentation, that's the same protection as /proc/N/fd/*, and
> >>>those links function in the same way.
> >>
> >>My concern here is that fd/* are connected as streams, and while that
> >>has a certain level of badness as an external-to-the-process attacker,
> >>PTRACE_MODE_READ is much weaker than PTRACE_MODE_ATTACH (which is
> >>required for access to /proc/N/mem). Since these fds are the things
> >>mapped into memory on a process, writing to them is a subset of access
> >>to /proc/N/mem, and I don't feel that PTRACE_MODE_READ is sufficient.
> >
> >If you haven't done close() on a mmapped file, doesn't fd/* allow the
> >same access to the corresponding regions of memory? Or am I missing
> >something?
> >
> >But that said, I can't think of any reason making it MODE_ATTACH would
> >be a problem. Would you rather that be enforced on follow_link() like
> >the original patch did, or enforce it for the whole directory?
> >
> >
> Whole directory would probably be better, as even just the mapped
> ranges could be considered sensitive information.
You can already get the ranges that are mapped from /proc/N/maps with
PTRACE_MODE_READ, so that part isn't new information.
> Ideally, the check should be done on both follow_link(), and the
> directory itself.
Oh, I didn't mean restricting readdir(), I meant restricting any access
through the directory similar to how the original CAP_SYS_ADMIN check
was done.
Thanks,
Calvin
^ permalink raw reply
* Re: [PATCH 01/13] kdbus: add documentation
From: Eric W. Biederman @ 2015-02-04 6:30 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Andy Lutomirski, Daniel Mack, Arnd Bergmann, Ted Ts'o,
Michael Kerrisk, Linux API, One Thousand Gnomes,
Austin S Hemmelgarn, Tom Gundersen, linux-kernel, David Herrmann,
Djalal Harouni, Johannes Stezenbach, Christoph Hellwig
In-Reply-To: <20150204031437.GB32207-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> writes:
> On Tue, Feb 03, 2015 at 08:47:51PM -0600, Eric W. Biederman wrote:
>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>
>> > On Tue, Feb 3, 2015 at 2:09 AM, Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
>> >> Hi Andy,
>> >>
>> >> On 02/02/2015 09:12 PM, Andy Lutomirski wrote:
>> >>> On Feb 2, 2015 1:34 AM, "Daniel Mack" <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
>> >>
>> >>>> That's right, but again - if an application wants to gather this kind of
>> >>>> information about tasks it interacts with, it can do so today by looking
>> >>>> at /proc or similar sources. Desktop machines do exactly that already,
>> >>>> and the kernel code executed in such cases very much resembles that in
>> >>>> metadata.c, and is certainly not cheaper. kdbus just makes such
>> >>>> information more accessible when requested. Which information is
>> >>>> collected is defined by bit-masks on both the sender and the receiver
>> >>>> connection, and most applications will effectively only use a very
>> >>>> limited set by default if they go through one of the more high-level
>> >>>> libraries.
>> >>>
>> >>> I should rephrase a bit. Kdbus doesn't require use of send-time
>> >>> metadata. It does, however, strongly encourage it, and it sounds like
>> >>
>> >> On the kernel level, kdbus just *offers* that, just like sockets offer
>> >> SO_PASSCRED. On the userland level, kdbus helps applications get that
>> >> information race-free, easier and faster than they would otherwise.
>> >>
>> >>> systemd and other major users will use send-time metadata. Once that
>> >>> happens, it's ABI (even if it's purely in userspace), and changing it
>> >>> is asking for security holes to pop up. So you'll be mostly stuck
>> >>> with it.
>> >>
>> >> We know we can't break the ABI. At most, we could deprecate item types
>> >> and introduce new ones, but we want to avoid that by all means of
>> >> course. However, I fail to see how that is related to send time
>> >> metadata, or even to kdbus in general, as all ABIs have to be kept stable.
>> >
>> > I should have said it differently. ABI is the wrong term -- it's more
>> > of a protocol issue.
>> >
>> > It looks like, with the current code, the kernel will provide
>> > (optional) send-time metadata, and the sd-bus library will use it.
>> > The result will be that the communication protocol between clients and
>> > udev, systemd, systemd-logind, g-s-d, etc, will likely involve
>> > send-time metadata. This may end up being a bottleneck.
>>
>> A quick note on a couple of things I have seen in this conversation.
>>
>> - The reason for kdbus is performance.
>
> No, that's not the only reason for kdbus, don't focus only on this. I
> set out a long list of things for why we created kdbus, speed was only
> one of the things. Security is also one, and the ability to gather
> these attributes in an atomic and secure way is very important as
> userspace wants this.
Perhaps I should have said the predominant reason. Certainly that seems
to be most of what I have seen talked about.
Regardless looking at the performance in the design and removing any
substantial obstacle to making things go fast.
Further. I had this conversation earlier in an earlier round of the
review and I was told that in fact existing dbus applications do not
want or need these attributes. I think I heard journald wants them for
pretty printing things.
If security is your concern I really think per message attributes
collected and sent when a message is sent is a bad idea. It has been a
nasty anti-pattern in the kernel code. Lots and lots of meta-data
copyed from a task and sent to someone else has significant performance,
maintenance, and security impacts.
Code written in that pattern is complex and hard to analyze, and hard to
think about. Consider debugging why a message does not get the expected
treatment from your suid application because someone changed the euid
over that particular call and had not thought about it's consequences.
Frankly I have been there and done that and it is a mess.
So no I do not think breaking encapsulation and having weird side
effects affecting your new primitive will have any security benefits
whatsover. It will just result in brittle complex code.
If you want to avoid the races causing sends through a file descriptor
to fail that don't have the expected attributes (my constructive
suggestion earlier) is a very different thing from a performance and
mainteance standpoint. That does not increase the code complexity
nearly as much in the implementation or in use, and unexpected failures
happen right away.
>> - pipes rather than unix domain sockets are likely the standard to meet.
>> If you can't equal unix domain sockets for simple things you are
>> likely leaving a lot of stops in. Last I looked pipes in general were
>> notiably faster than unix domain sockets.
>>
>> The performance numbers I saw posted up-thread were horrible. I have
>> seen faster numbers across a network of machines. If your ping-pong
>> latency isn't measured in nano-seconds you are probably doing
>> something wrong.
>
> It all depends on what you are passing on that "ping-pong", a real
> D-Bus connection has real data and meta data that has to be sent.
> Trying to make a fake benchmark number isn't going to show anything.
All that I was intending to convey is that the numbers I have seen have
been orders of magnitude slower than I would expect. And 10x to 100x
slower than the code should be is a reason to ask why.
In my experience being efficient with small messages are important
because (a) they are the hardest to make go fast (b) they are surprising
common. Remote X application start-up times are very slow because of
these.
People have a distressing habit of writing applications that
send a small message and synchronously waits for it. Over time these
small ipc calls build up and you are limited by how fast they will go.
>> - syscalls remove overhead. So since performance is kdbus's reason for existence
>> let's remove some ridiculous stops, and get a fast path into the kernel.
>
> Again, not the only reason, see my first post in this thread for
> details.
But performance is important, and performance is a good reason to use
system calls.
Security is another reason to have real system calls, as there is less
going on (compared to an ioctl multiplexer) so the code is easier to
audit.
Eric
^ permalink raw reply
* Re: [PATCH 0/5] RFC: Offer a way for userspace to request real deletion of files
From: Michael Kerrisk @ 2015-02-04 8:01 UTC (permalink / raw)
To: Alexander Holler; +Cc: Linux-Fsdevel, Linux Kernel, Linux API
In-Reply-To: <1422896713-25367-1-git-send-email-holler-SXC+2es9fhnfWeYVQQPykw@public.gmane.org>
[CC += linux-api@]
Hello Alexander,
The kernel source file Documentation/SubmitChecklist notes that all
Linux kernel patches that change userspace interfaces should be CCed
to linux-api-u79uwXL29TaiAVqoAR/hOA@public.gmane.org See also
https://www.kernel.org/doc/man-pages/linux-api-ml.html. Please CC
linux-api@ on future iterations of this patch.
Thanks,
Michael
On Mon, Feb 2, 2015 at 6:05 PM, Alexander Holler <holler-SXC+2es9fhnfWeYVQQPykw@public.gmane.org> wrote:
>
> Hello,
>
> some people already might have noticed that I've got a bit angry that
> no filesystem nor the posix api nor the linux syscalls are offering
> the userspace a way to request real deletion of one or more files (there's
> the 's' bit, see man chattr, but it is ignored by all FS which know it).
>
> Almost all filesystems are working such, that deleting a file just
> means it doesn't appear in the list of files anymore but the contents
> of the file still might be readable on the storage.
>
> So in the last 30 years many tools were created trying to circumvent that
> inability of all filesystems. Up to encrypting the whole storage which
> seems to be the current state of art and which many people recently tried
> to recommend to me.
>
> Also I'm using that workaround already myself since many years, I still
> believe it's a very bad, complictated, cumbersome and very uncomfortable
> way to make sure contents of files are not readable anymore. Besides that,
> just relying on encryption might hit back badly, because encryption often
> suffers from bugs in the implementation, bugs or even backdoors in the
> design and Moore. That means it's unsure how long the used encryption
> will defeat any tries to read the contents of a deleted file from storage
> and the used encryption might be worthless tomorrow. Not to speak about
> the problems with the necessary key-handling.
>
> What's the answer? Easy and obvious, just (try to) overwrite the contents
> of a file by request from userspace. Filesystems do know where on the
> storage they have written the contents to, so why not just let them delete
> that stuff themself instead? It's almost unbelievable that this was not
> already done in the past 30 years.
>
> So, now, after I've got angry enough, I've tried to do it myself, it seems
> to work and wasn't really hard.
>
> Of course, the easy way I think I've found isn't really my achievement.
> Instead it relies on all the work people have already done to support the
> trim command of SSDs. So thanks to all of them. You've made the following
> simple patches possible.
>
> How does it work:
>
> - Implement a new syscall named unlinkat_s() with the same signature as
> unlinkat(). With this syscall filesystems should make the old contents
> of files unreadable and should fail if they can't. This doesn't really
> have to be reliable, because it is often impossible for a filesystem to
> make enough assumptions about the underlying storage to promise secure
> deletion. But it has to mean that the filesystem tried everything it can
> to make sure the contents are unreadabler afterwards, e.g. by overwriting
> them, using secure trim or even just using trim. I've no idea if trim
> might be enough, if I would have implemented trim, it would clear the
> trimmed blocks in flash too, making them unreadable. But I haven't done
> such and I haven't tested if that's the case.
> The new syscall isn't meant to replace unlinkat() for everyday operations,
> therefor operation speed is ignored (see below in regard to a side effect).
>
> - Instruct the filesystem that it should discard or overwrite (all) freed
> blocks while the unlinkat_s() is at work.
>
> - Kill the inode while letting the filesystem discard freed blocks or
> overwrite them. As said before, this was easy through all the work already
> done by others. There even already existed a sb_issue_zeroout() which could
> be used instead of sb_issue_discard().
>
> - Sync the filesystem, to make sure the stuff is written to the storage.
>
>
> This approach has the side effect that while a call of unlinkat_s() is at
> work, all freed blocks will be destroyed, even those which aren't beloning
> to the unlink operation but are freed by possible other running actions.
> But in my humble opinion, that's nothing to care about and it keeps the
> implementation of this feature simple. I like KISS and that's imho the
> main feature of these patches.
>
>
> Things to be aware of when reading and starting to critisize my patches:
>
> - I've never had a look before at the kernel sources in fs/*.
> - They are the result of around half a dozen hours.
> - I'm aware that these patches are imperfect. Perfectionism does cost time
> for which I often don't feel the need to spend it unpaid.
> - I don't care for comments regarding style.
> - They are a proof of concept and are an offer. They are meant for other
> users, not maintainers. I wasn't paid for doing them and I don't care much
> if they will end up in the kernel. I already have and can use them, I'm
> happy with them and I don't really need them in the official kernel as I'm
> able to easily rebase them myself (thanks to git).
> - Don't be disappointed because the patches are that simple. The idea
> counts. ;)
>
>
> Regards,
>
> Alexander Holler
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/
^ permalink raw reply
* Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)
From: Michael Kerrisk @ 2015-02-04 8:01 UTC (permalink / raw)
To: Alexander Holler; +Cc: Linux-Fsdevel, Linux Kernel, Linux API
In-Reply-To: <1422896713-25367-2-git-send-email-holler-SXC+2es9fhnfWeYVQQPykw@public.gmane.org>
[CC += linux-api@]
On Mon, Feb 2, 2015 at 6:05 PM, Alexander Holler <holler-SXC+2es9fhnfWeYVQQPykw@public.gmane.org> wrote:
> Signed-off-by: Alexander Holler <holler-SXC+2es9fhnfWeYVQQPykw@public.gmane.org>
> ---
> arch/x86/syscalls/syscall_32.tbl | 1 +
> arch/x86/syscalls/syscall_64.tbl | 1 +
> fs/namei.c | 38 ++++++++++++++++++++++++++++++-----
> include/asm-generic/audit_dir_write.h | 1 +
> include/linux/fs.h | 1 +
> include/linux/syscalls.h | 1 +
> include/uapi/asm-generic/unistd.h | 4 +++-
> tools/perf/builtin-trace.c | 2 ++
> 8 files changed, 43 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index 9fe1b5d..7a3d530 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -364,3 +364,4 @@
> 355 i386 getrandom sys_getrandom
> 356 i386 memfd_create sys_memfd_create
> 357 i386 bpf sys_bpf
> +359 i386 unlinkat_s sys_unlinkat_s
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index 281150b..97eaf01 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -328,6 +328,7 @@
> 319 common memfd_create sys_memfd_create
> 320 common kexec_file_load sys_kexec_file_load
> 321 common bpf sys_bpf
> +322 common unlinkat_s sys_unlinkat_s
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/namei.c b/fs/namei.c
> index db5fe86..1ad3724 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3717,7 +3717,7 @@ EXPORT_SYMBOL(vfs_unlink);
> * writeout happening, and we don't want to prevent access to the directory
> * while waiting on the I/O.
> */
> -static long do_unlinkat(int dfd, const char __user *pathname)
> +static long do_unlinkat(int dfd, const char __user *pathname, bool secure)
> {
> int error;
> struct filename *name;
> @@ -3759,8 +3759,25 @@ exit2:
> dput(dentry);
> }
> mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
> - if (inode)
> - iput(inode); /* truncate the inode here */
> + if (inode) {
> + // TODO:
> + // if (inode is file and 's' flag is set)
> + // secure = true;
> + if (!secure)
> + iput(inode); /* truncate the inode here */
> + else {
> + struct super_block *sb = inode->i_sb;
> + if (sb->s_op->set_secure_delete)
> + sb->s_op->set_secure_delete(sb, true);
> + // TODO: We should fail if secure isn't supported,
> + // look up how that's possible here.
> + iput(inode); /* truncate the inode here */
> + // TODO: check if sb is still valid after the inode is gone
> + sync_filesystem(sb);
> + if (sb->s_op->set_secure_delete)
> + sb->s_op->set_secure_delete(sb, false);
> + }
> + }
> inode = NULL;
> if (delegated_inode) {
> error = break_deleg_wait(&delegated_inode);
> @@ -3796,12 +3813,23 @@ SYSCALL_DEFINE3(unlinkat, int, dfd, const char __user *, pathname, int, flag)
> if (flag & AT_REMOVEDIR)
> return do_rmdir(dfd, pathname);
>
> - return do_unlinkat(dfd, pathname);
> + return do_unlinkat(dfd, pathname, false);
> }
>
> SYSCALL_DEFINE1(unlink, const char __user *, pathname)
> {
> - return do_unlinkat(AT_FDCWD, pathname);
> + return do_unlinkat(AT_FDCWD, pathname, false);
> +}
> +
> +SYSCALL_DEFINE3(unlinkat_s, int, dfd, const char __user *, pathname, int, flag)
> +{
> + if ((flag & ~AT_REMOVEDIR) != 0)
> + return -EINVAL;
> +
> + if (flag & AT_REMOVEDIR)
> + return do_rmdir(dfd, pathname);
> +
> + return do_unlinkat(dfd, pathname, true);
> }
>
> int vfs_symlink(struct inode *dir, struct dentry *dentry, const char *oldname)
> diff --git a/include/asm-generic/audit_dir_write.h b/include/asm-generic/audit_dir_write.h
> index 7b61db4..5282aba 100644
> --- a/include/asm-generic/audit_dir_write.h
> +++ b/include/asm-generic/audit_dir_write.h
> @@ -29,4 +29,5 @@ __NR_unlinkat,
> __NR_renameat,
> __NR_linkat,
> __NR_symlinkat,
> +__NR_unlinkat_s,
> #endif
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9ab779e..039e969 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1594,6 +1594,7 @@ struct super_operations {
> int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
> long (*nr_cached_objects)(struct super_block *, int);
> long (*free_cached_objects)(struct super_block *, long, int);
> + void (*set_secure_delete) (struct super_block *, bool);
> };
>
> /*
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index bda9b81..b88019b 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -877,4 +877,5 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
> asmlinkage long sys_getrandom(char __user *buf, size_t count,
> unsigned int flags);
> asmlinkage long sys_bpf(int cmd, union bpf_attr *attr, unsigned int size);
> +asmlinkage long sys_unlinkat_s(int dfd, const char __user * pathname, int flag);
> #endif
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 22749c1..2ba072e 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -707,9 +707,11 @@ __SYSCALL(__NR_getrandom, sys_getrandom)
> __SYSCALL(__NR_memfd_create, sys_memfd_create)
> #define __NR_bpf 280
> __SYSCALL(__NR_bpf, sys_bpf)
> +#define __NR_unlinkat_s 281
> +__SYSCALL(__NR_unlinkat_s, sys_unlinkat_s)
>
> #undef __NR_syscalls
> -#define __NR_syscalls 281
> +#define __NR_syscalls 282
>
> /*
> * All syscalls below here should go away really,
> diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
> index fb12645..1507335 100644
> --- a/tools/perf/builtin-trace.c
> +++ b/tools/perf/builtin-trace.c
> @@ -1110,6 +1110,8 @@ static struct syscall_fmt {
> { .name = "uname", .errmsg = true, .alias = "newuname", },
> { .name = "unlinkat", .errmsg = true,
> .arg_scnprintf = { [0] = SCA_FDAT, /* dfd */ }, },
> + { .name = "unlinkat_s", .errmsg = true,
> + .arg_scnprintf = { [0] = SCA_FDAT, /* dfd */ }, },
> { .name = "utimensat", .errmsg = true,
> .arg_scnprintf = { [0] = SCA_FDAT, /* dirfd */ }, },
> { .name = "write", .errmsg = true,
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/
^ permalink raw reply
* Re: [PATCH v4] gpio: lib-sysfs: Add 'wakeup' attribute
From: Linus Walleij @ 2015-02-04 9:19 UTC (permalink / raw)
To: Sören Brinkmann
Cc: Alexandre Courbot, Johan Hovold, linux-api,
Linux Kernel Mailing List, linux-gpio@vger.kernel.org,
linux-doc@vger.kernel.org
In-Reply-To: <bb967ff02726449ab6525d33bccde7e4@BL2FFO11FD014.protection.gbl>
On Thu, Jan 29, 2015 at 6:23 PM, Sören Brinkmann
<soren.brinkmann@xilinx.com> wrote:
> On Mon, 2015-01-19 at 09:54AM +0100, Linus Walleij wrote:
>> On Mon, Jan 19, 2015 at 5:20 AM, Alexandre Courbot <gnurou@gmail.com> wrote:
>> > On Sat, Jan 17, 2015 at 1:49 AM, Sören Brinkmann <soren.brinkmann@xilinx.com> wrote:
>> >> On Fri, 2015-01-16 at 12:11PM +0100, Johan Hovold wrote:
>>
>> >>> Implementing proper wakeup support for unclaimed GPIOs would take some
>> >>> work (if at all desired), but that is not a reason to be adding custom
>> >>> implementations that violates the kernel's power policies and new ABIs
>> >>> that would need to be maintained forever.
>> (...)
>> >>> Meanwhile you can (should) use gpio-keys if you need to wake your system
>> >>> on gpio events.
>> >>
>> >> We had that discussion and I don't think GPIO keys is the right solution
>> >> for every use-case.
>> >
>> > Sorry, it has been a while - can you remind us of why?
>>
>> There are such cases. Of course keys should be handled by GPIO-keys
>> and these will trigger the right wakeup events in such cases.
>>
>> This is for more esoteric cases: we cannot have a kernel module for
>> everything people want to do with GPIOs, and the use case I accept
>> is GPIOs used in automatic control etc, think factory lines or doors.
>> We can't have a "door" driver or "punch arm" or "fire alarm" driver
>> in the kernel. Those are userspace things.
>>
>> Still such embedded systems need to be able to go to idle and
>> sleep to conerve power, and then they need to put wakeups on
>> these GPIOs.
>>
>> So it is a feature userspace needs, though as with much of the
>> sysfs ABI it is very often abused for things like keys and LEDs which
>> is an abomination but we can't do much about it :(
>
> Thanks for clearing that up.
> What does that mean for this patch? Are we going ahead, accepting the
> extension of this API or do all these use-cases have to wait for the
> rewrite of a proper GPIO userspace interface?
What needs to happen (IMHO) is to make gpio_chips properly obeying
the device model, and then add the attributes for fiddling around with
GPIOs to either the *real* device or create a new char device interface.
Whatever works best. These mock devices are fragile and never
worked properly especially in the removal path as Johans recent
fixes has shown.
Yours,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-gpio" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [RFC] mremap: add MREMAP_NOHOLE flag
From: Michael Kerrisk @ 2015-02-04 10:22 UTC (permalink / raw)
To: Shaohua Li
Cc: linux-mm, danielmicay-Re5JQEeQqe8AvxtiuMwx3w,
Kernel-team-b10kYP2dOMg, Rik van Riel, Andrew Morton,
Hugh Dickins, Andy Lutomirski, Linux API
In-Reply-To: <7064772f72049de8a79383105f49b5db84a946e5.1422990665.git.shli-b10kYP2dOMg@public.gmane.org>
[CC += linux-api]
Hello Shaohua Li,
Since this is an API change, please CC linux-api@. (The kernel source
file Documentation/SubmitChecklist notes that all Linux kernel patches
that change userspace interfaces should be CCed to
linux-api-u79uwXL29TaiAVqoAR/hOA@public.gmane.org See also
https://www.kernel.org/doc/man-pages/linux-api-ml.html)
Thanks,
Michael
On Tue, Feb 3, 2015 at 8:19 PM, Shaohua Li <shli-b10kYP2dOMg@public.gmane.org> wrote:
> There was a similar patch posted before, but it doesn't get merged. I'd like
> to try again if there are more discussions.
> http://marc.info/?l=linux-mm&m=141230769431688&w=2
>
> mremap can be used to accelerate realloc. The problem is mremap will
> punch a hole in original VMA, which makes specific memory allocator
> unable to utilize it. Jemalloc is an example. It manages memory in 4M
> chunks. mremap a range of the chunk will punch a hole, which other
> mmap() syscall can fill into. The 4M chunk is then fragmented, jemalloc
> can't handle it.
>
> This patch adds a new flag for mremap. With it, mremap will not punch the
> hole. page tables of original vma will be zapped in the same way, but
> vma is still there. That is original vma will look like a vma without
> pagefault. Behavior of new vma isn't changed.
>
> For private vma, accessing original vma will cause
> page fault and just like the address of the vma has never been accessed.
> So for anonymous, new page/zero page will be fault in. For file mapping,
> new page will be allocated with file reading for cow, or pagefault will
> use existing page cache.
>
> For shared vma, original and new vma will map to the same file. We can
> optimize this without zaping original vma's page table in this case, but
> this patch doesn't do it yet.
>
> Since with MREMAP_NOHOLE, original vma still exists. pagefault handler
> for special vma might not able to handle pagefault for mremap'd area.
> The patch doesn't allow vmas with VM_PFNMAP|VM_MIXEDMAP flags do NOHOLE
> mremap.
>
> Cc: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
> Signed-off-by: Shaohua Li <shli-b10kYP2dOMg@public.gmane.org>
> ---
> include/uapi/linux/mman.h | 1 +
> mm/mremap.c | 97 ++++++++++++++++++++++++++++++++---------------
> 2 files changed, 67 insertions(+), 31 deletions(-)
>
> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> index ade4acd..9ee9a15 100644
> --- a/include/uapi/linux/mman.h
> +++ b/include/uapi/linux/mman.h
> @@ -5,6 +5,7 @@
>
> #define MREMAP_MAYMOVE 1
> #define MREMAP_FIXED 2
> +#define MREMAP_NOHOLE 4
>
> #define OVERCOMMIT_GUESS 0
> #define OVERCOMMIT_ALWAYS 1
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 3b886dc..ea3f40d 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -236,7 +236,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>
> static unsigned long move_vma(struct vm_area_struct *vma,
> unsigned long old_addr, unsigned long old_len,
> - unsigned long new_len, unsigned long new_addr, bool *locked)
> + unsigned long new_len, unsigned long new_addr, bool *locked,
> + bool nohole)
> {
> struct mm_struct *mm = vma->vm_mm;
> struct vm_area_struct *new_vma;
> @@ -292,7 +293,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
> vma->vm_file->f_op->mremap(vma->vm_file, new_vma);
>
> /* Conceal VM_ACCOUNT so old reservation is not undone */
> - if (vm_flags & VM_ACCOUNT) {
> + if ((vm_flags & VM_ACCOUNT) && !nohole) {
> vma->vm_flags &= ~VM_ACCOUNT;
> excess = vma->vm_end - vma->vm_start - old_len;
> if (old_addr > vma->vm_start &&
> @@ -312,11 +313,18 @@ static unsigned long move_vma(struct vm_area_struct *vma,
> hiwater_vm = mm->hiwater_vm;
> vm_stat_account(mm, vma->vm_flags, vma->vm_file, new_len>>PAGE_SHIFT);
>
> - if (do_munmap(mm, old_addr, old_len) < 0) {
> + if (!nohole && do_munmap(mm, old_addr, old_len) < 0) {
> /* OOM: unable to split vma, just get accounts right */
> vm_unacct_memory(excess >> PAGE_SHIFT);
> excess = 0;
> }
> +
> + if (nohole && (new_addr & ~PAGE_MASK)) {
> + /* caller will unaccount */
> + vma->vm_flags &= ~VM_ACCOUNT;
> + do_munmap(mm, old_addr, old_len);
> + }
> +
> mm->hiwater_vm = hiwater_vm;
>
> /* Restore VM_ACCOUNT if one or two pieces of vma left */
> @@ -334,14 +342,13 @@ static unsigned long move_vma(struct vm_area_struct *vma,
> return new_addr;
> }
>
> -static struct vm_area_struct *vma_to_resize(unsigned long addr,
> - unsigned long old_len, unsigned long new_len, unsigned long *p)
> +static unsigned long validate_vma_and_charge(struct vm_area_struct *vma,
> + unsigned long addr,
> + unsigned long old_len, unsigned long new_len, unsigned long *p,
> + bool nohole)
> {
> struct mm_struct *mm = current->mm;
> - struct vm_area_struct *vma = find_vma(mm, addr);
> -
> - if (!vma || vma->vm_start > addr)
> - goto Efault;
> + unsigned long diff;
>
> if (is_vm_hugetlb_page(vma))
> goto Einval;
> @@ -350,6 +357,9 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
> if (old_len > vma->vm_end - addr)
> goto Efault;
>
> + if (nohole && (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)))
> + goto Einval;
> +
> /* Need to be careful about a growing mapping */
> if (new_len > old_len) {
> unsigned long pgoff;
> @@ -362,39 +372,45 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
> goto Einval;
> }
>
> + if (nohole)
> + diff = new_len;
> + else
> + diff = new_len - old_len;
> +
> if (vma->vm_flags & VM_LOCKED) {
> unsigned long locked, lock_limit;
> locked = mm->locked_vm << PAGE_SHIFT;
> lock_limit = rlimit(RLIMIT_MEMLOCK);
> - locked += new_len - old_len;
> + locked += diff;
> if (locked > lock_limit && !capable(CAP_IPC_LOCK))
> goto Eagain;
> }
>
> - if (!may_expand_vm(mm, (new_len - old_len) >> PAGE_SHIFT))
> + if (!may_expand_vm(mm, diff >> PAGE_SHIFT))
> goto Enomem;
>
> if (vma->vm_flags & VM_ACCOUNT) {
> - unsigned long charged = (new_len - old_len) >> PAGE_SHIFT;
> + unsigned long charged = diff >> PAGE_SHIFT;
> if (security_vm_enough_memory_mm(mm, charged))
> goto Efault;
> *p = charged;
> }
>
> - return vma;
> + return 0;
>
> Efault: /* very odd choice for most of the cases, but... */
> - return ERR_PTR(-EFAULT);
> + return -EFAULT;
> Einval:
> - return ERR_PTR(-EINVAL);
> + return -EINVAL;
> Enomem:
> - return ERR_PTR(-ENOMEM);
> + return -ENOMEM;
> Eagain:
> - return ERR_PTR(-EAGAIN);
> + return -EAGAIN;
> }
>
> static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
> - unsigned long new_addr, unsigned long new_len, bool *locked)
> + unsigned long new_addr, unsigned long new_len, bool *locked,
> + bool nohole)
> {
> struct mm_struct *mm = current->mm;
> struct vm_area_struct *vma;
> @@ -422,17 +438,23 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
> goto out;
>
> if (old_len >= new_len) {
> - ret = do_munmap(mm, addr+new_len, old_len - new_len);
> - if (ret && old_len != new_len)
> - goto out;
> + if (!nohole) {
> + ret = do_munmap(mm, addr+new_len, old_len - new_len);
> + if (ret && old_len != new_len)
> + goto out;
> + }
> old_len = new_len;
> }
>
> - vma = vma_to_resize(addr, old_len, new_len, &charged);
> - if (IS_ERR(vma)) {
> - ret = PTR_ERR(vma);
> + vma = find_vma(mm, addr);
> + if (!vma || vma->vm_start > addr) {
> + ret = -EFAULT;
> goto out;
> }
> + ret = validate_vma_and_charge(vma, addr, old_len, new_len, &charged,
> + nohole);
> + if (ret)
> + goto out;
>
> map_flags = MAP_FIXED;
> if (vma->vm_flags & VM_MAYSHARE)
> @@ -444,7 +466,7 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
> if (ret & ~PAGE_MASK)
> goto out1;
>
> - ret = move_vma(vma, addr, old_len, new_len, new_addr, locked);
> + ret = move_vma(vma, addr, old_len, new_len, new_addr, locked, nohole);
> if (!(ret & ~PAGE_MASK))
> goto out;
> out1:
> @@ -483,8 +505,9 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> unsigned long ret = -EINVAL;
> unsigned long charged = 0;
> bool locked = false;
> + bool nohole = flags & MREMAP_NOHOLE;
>
> - if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
> + if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_NOHOLE))
> return ret;
>
> if (flags & MREMAP_FIXED && !(flags & MREMAP_MAYMOVE))
> @@ -508,7 +531,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
>
> if (flags & MREMAP_FIXED) {
> ret = mremap_to(addr, old_len, new_addr, new_len,
> - &locked);
> + &locked, nohole);
> goto out;
> }
>
> @@ -528,9 +551,9 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> /*
> * Ok, we need to grow..
> */
> - vma = vma_to_resize(addr, old_len, new_len, &charged);
> - if (IS_ERR(vma)) {
> - ret = PTR_ERR(vma);
> + vma = find_vma(mm, addr);
> + if (!vma || vma->vm_start > addr) {
> + ret = -EFAULT;
> goto out;
> }
>
> @@ -541,6 +564,12 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> if (vma_expandable(vma, new_len - old_len)) {
> int pages = (new_len - old_len) >> PAGE_SHIFT;
>
> + ret = validate_vma_and_charge(vma, addr, old_len, new_len,
> + &charged, false);
> + if (ret) {
> + BUG_ON(charged != 0);
> + goto out;
> + }
> if (vma_adjust(vma, vma->vm_start, addr + new_len,
> vma->vm_pgoff, NULL)) {
> ret = -ENOMEM;
> @@ -558,6 +587,11 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> }
> }
>
> + ret = validate_vma_and_charge(vma, addr, old_len, new_len,
> + &charged, nohole);
> + if (ret)
> + goto out;
> +
> /*
> * We weren't able to just expand or shrink the area,
> * we need to create a new one and move it..
> @@ -577,7 +611,8 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> goto out;
> }
>
> - ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked);
> + ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked,
> + nohole);
> }
> out:
> if (ret & ~PAGE_MASK)
> --
> 1.8.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"> email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org </a>
--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/
^ permalink raw reply
* Re: [RFC PATCH] fork: report pid reservation failure properly
From: Michal Hocko @ 2015-02-04 10:27 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Michael Kerrisk (man-pages), Linux API, Andrew Morton,
Oleg Nesterov, LKML
In-Reply-To: <87bnlalo7k.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
On Tue 03-02-15 14:44:31, Eric W. Biederman wrote:
> Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> writes:
>
> > On Tue 03-02-15 16:33:03, Michael Kerrisk wrote:
> >> Hi Michal,
> >>
> >>
> >> On 3 February 2015 at 16:05, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> >> > Hi,
> >> > while debugging an unexpected ENOMEM from fork (there was no memory
> >> > pressure and OVERCOMMIT_ALWAYS) I have found out that fork returns
> >> > ENOMEM even when not short on memory.
> >> >
> >> > In this particular case it was due to depleted pid space which is
> >> > documented to return EAGAIN in man pages.
> >> >
> >> > Here is a quick fix up.
> >>
> >> Could you summarize briefly what the user-space visible change is
> >> here?
> >
> > The user visible change is that the userspace will get EAGAIN when
> > calling fork and the pid space is depleted because of a system wide
> > limit as per man page description rather than ENOMEM which we return
> > currently.
>
> I don't think that EAGAIN is any better than ENOMEM,
Well, EAGAIN is what the documentation says (both POSIX and ours).
I do not want to get to language lawyering but we should either change
the documentation and reason why we differ from POSIX or change the
behavior. I find the later option more viable because ENOMEM is really
hard to debug when there is not a lack of memory.
> nor do I know that it is safe to return EBUSY from fork. What nonsense
> will applications do when they see an unexpected error code.
Agreed, any new error code might be confusing.
> >> It is not so obvious from your message. I believe you're turning
> >> some cases of ENOMEM into EAGAIN, right?
> >
> > Yes, except for the case mentioned below which discusses a potential
> > error code for pid namespace triggered failures.
> >
> >> Note, by the way, that if I understandwhat you intend, this change
> >> would bring the implementation closer to POSIX, which specifies:
> >
> > True.
> >
> > HTH.
> >
> >> EAGAIN The system lacked the necessary resources to create
> >> another process, or the system-imposed limit on the total
> >> number of processes under execution system-wide or by a
> >> single user {CHILD_MAX} would be exceeded.
> >>
>
> Note. All of those documented errors documented to return EAGAIN
> are the kind of errors that if you wait a while you can reasonably
> expect fork to succeed later.
>
> With respecting to dealing with errors from fork, fork
> is a major pain. Fork only has only two return codes documented,
> and fork is one of the most complicated system calls in the kernel with
> the most failure modes of any system call I am familiar with. Mapping
> a plethora of failure modes onto two error codes is always going to be
> problematic from some view point.
Agreed.
> EAGAIN is a bad idea in general because that means try again and if you
> have hit a fixed limit trying again is wrong.
I don't know what was the justification for EAGAIN but it was like that
for ages so I assume there is a code which relies on that.
> Frankly I think posix is probably borked to recommend EAGAIN instead of
> ENOMEM.
>
> Everyone in the world uses fork which makes is quite tricky to figure
> out which assumptions on the return values of fork exist in the wild,
> so it is not clear if it is safe to add new more descriptive return
> messages.
>
> With respect to the case where PIDNS_HASH_ADDING would cause fork to
> fail, that only happens after init has exited in a pid namespace, so it
> is very much a permanent failure, and there are no longer any processes
> in the specific pid namespace nor will there ever be any more processes
> in that pid namespace. EINVAL might actually makes sense. Of course
> a sensible error code from fork does not seem to be allowed.
>
> Of the two return codes that are allowed for fork, EAGAIN and ENOMEM
> ENOMEM seems to be better as it is a more permanement failure.
Yes, if a new error code is really a nogo I would go for ENOMEM.
> I agree it is a little confusing, but I don't see anything that is other
> than a little confusing.
>
> Other than someone doing:
>
> unshare(CLONE_NEWPID);
> pid = fork();
> waitpid(pid);
> fork(); /* returns ENOMEM */
Ohh, I got it finally. unshare itself doesn't move the calling process
into a new namsepace.
> Was there any other real world issue that started this goal to fix fork?
I haven't seen any real world issue wrt. PIDNS_HASH_ADDING. It came out
from the error code propagation.
But I have seen ENOMEM when pid space was depleted and that confused
people and they reported it as a bug. Exactly because there was a ton of
memory free and no overcommit.
> I think there is a reasonable argument for digging into the fork return
> code situation. Perhaps it is just a matter of returning exotic return
> codes for the weird and strange cases like trying to create a pid in a
> dead pid namespace.
>
> But what we have works, and I don't know of anything bad that happens
> except when people are developing new code they get confused.
>
> Further we can't count on people to read their man pages because this
> behavior of returning ENOMEM is documented in pid_namespaces(7). Which
> makes me really thinking changing the code to match the manpage is more
> likely to break code than to fix code.
OK, I will remove the pid namespace related error code propagation and
keep it returning ENOMEM but I think the EAGAIN part is still worth fixing.
Will post a new patch.
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
From: Fam Zheng @ 2015-02-04 10:36 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
Changes from v1 (https://lkml.org/lkml/2015/1/20/189):
- As discussed in previous thread [1], split the call to epoll_ctl_batch and
epoll_pwait. [Michael]
- Fix memory leaks. [Omar]
- Add a short comment about the ignored copy_to_user failure. [Omar]
- Cover letter rewritten.
This adds two new system calls as described below. I mentioned glibc wrapping
of sigarg in epoll_pwait1 description, so don't read it as end user man page
yet.
One caveat is the possible failure of the last copy_to_user in epoll_ctl_batch,
which returns per command error code. Ideas to improve that are welcome!
1) epoll_ctl_batch
------------------
NAME
epoll_ctl_batch - modify an epoll descriptor in batch
SYNOPSIS
#include <sys/epoll.h>
int epoll_ctl_batch(int epfd, int flags,
int ncmds, struct epoll_ctl_cmd *cmds);
DESCRIPTION
The system call performs a batch of epoll_ctl operations. It allows
efficient update of events on this epoll descriptor.
Flags is reserved and must be 0.
Each operation is specified as an element in the cmds array, defined as:
struct epoll_ctl_cmd {
/* Reserved flags for future extension, must be 0. */
int flags;
/* The same as epoll_ctl() op parameter. */
int op;
/* The same as epoll_ctl() fd parameter. */
int fd;
/* The same as the "events" field in struct epoll_event. */
uint32_t events;
/* The same as the "data" field in struct epoll_event. */
uint64_t data;
/* Output field, will be set to the return code after this
* command is executed by kernel */
int error_hint;
};
Commands are executed in their order in cmds, and if one of them failed,
the rest after it are not tried.
Not that this call isn't atomic in terms of updating the epoll
descriptor, which means a second epoll_ctl or epoll_ctl_batch call
during the first epoll_ctl_batch may make the operation sequence
interleaved. However, each single epoll_ctl_cmd operation has the same
semantics as a epoll_ctl call.
RETURN VALUE
If one or more of the parameters are incorrect, -1 is returned and errno
is set appropriately. Otherwise, the number of succeeded commands is
returned.
Each error_hint field may be set to the error code or 0, depending on
the result of the command. If there is some error in returning the error
to user, it may also be unchanged, even though the command may actually
be executed. In this case, it's still ensured that the i-th (i = ret)
command is the failed command.
ERRORS
Errors for the overall system call (in errno) are:
EINVAL flags is non-zero, or ncmds is less than or equal to zero, or
cmds is NULL.
ENOMEM There was insufficient memory to handle the requested op control
operation.
EFAULT The memory area pointed to by cmds is not accessible with write
permissions.
Errors for each command (in error_hint) are:
EBADF epfd or fd is not a valid file descriptor.
EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
already registered with this epoll instance.
EINVAL epfd is not an epoll file descriptor, or fd is the same as epfd,
or the requested operation op is not supported by this interface.
ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
with this epoll instance.
ENOMEM There was insufficient memory to handle the requested op control
operation.
ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
encountered while trying to register (EPOLL_CTL_ADD) a new file
descriptor on an epoll instance. See epoll(7) for further
details.
EPERM The target file fd does not support epoll.
CONFORMING TO
epoll_ctl_batch() is Linux-specific.
SEE ALSO
epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)
2) epoll_pwait1
---------------
NAME
epoll_pwait1 - wait for an I/O event on an epoll file descriptor
SYNOPSIS
#include <sys/epoll.h>
int epoll_pwait1(int epfd, int flags,
struct epoll_event *events,
int maxevents,
struct timespec *timeout,
struct sigargs *sig);
DESCRIPTION
The epoll_pwait1 system call differs from epoll_pwait only in parameter
types. The first difference is timeout, a pointer to timespec structure
which allows nanosecond presicion; the second difference, which should
probably be wrapper by glibc and only expose a sigset_t pointer as in
pselect6.
If timeout is NULL, it's treated as if 0 is specified in epoll_pwait
(return immediately). Otherwise it's converted to nanosecond scalar,
again, with the same convention as epoll_pwait's timeout.
RETURN VALUE
The same as said in epoll_pwait(2).
ERRORS
The same as said in man epoll_pwait(2), plus:
EINVAL flags is not zero.
CONFORMING TO
epoll_pwait1() is Linux-specific.
SEE ALSO
epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)
Fam Zheng (7):
epoll: Extract epoll_wait_do and epoll_pwait_do
epoll: Specify clockid explicitly
epoll: Extract ep_ctl_do
epoll: Add implementation for epoll_ctl_batch
x86: Hook up epoll_ctl_batch syscall
epoll: Add implementation for epoll_pwait1
x86: Hook up epoll_pwait1 syscall
arch/x86/syscalls/syscall_32.tbl | 2 +
arch/x86/syscalls/syscall_64.tbl | 2 +
fs/eventpoll.c | 241 ++++++++++++++++++++++++++-------------
include/linux/syscalls.h | 9 ++
include/uapi/linux/eventpoll.h | 11 ++
5 files changed, 186 insertions(+), 79 deletions(-)
--
1.9.3
^ permalink raw reply
* [PATCH RFC v2 1/7] epoll: Extract epoll_wait_do and epoll_pwait_do
From: Fam Zheng @ 2015-02-04 10:36 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
In-Reply-To: <1423046213-7043-1-git-send-email-famz@redhat.com>
In preparation of new epoll syscalls, this patch allows reusing the code from
epoll_pwait implementation. The new functions uses ktime_t for more accuracy.
Signed-off-by: Fam Zheng <famz@redhat.com>
---
fs/eventpoll.c | 130 ++++++++++++++++++++++++++-------------------------------
1 file changed, 59 insertions(+), 71 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d77f944..4cf359d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1554,17 +1554,6 @@ static int ep_send_events(struct eventpoll *ep,
return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
}
-static inline struct timespec ep_set_mstimeout(long ms)
-{
- struct timespec now, ts = {
- .tv_sec = ms / MSEC_PER_SEC,
- .tv_nsec = NSEC_PER_MSEC * (ms % MSEC_PER_SEC),
- };
-
- ktime_get_ts(&now);
- return timespec_add_safe(now, ts);
-}
-
/**
* ep_poll - Retrieves ready events, and delivers them to the caller supplied
* event buffer.
@@ -1573,17 +1562,15 @@ static inline struct timespec ep_set_mstimeout(long ms)
* @events: Pointer to the userspace buffer where the ready events should be
* stored.
* @maxevents: Size (in terms of number of events) of the caller event buffer.
- * @timeout: Maximum timeout for the ready events fetch operation, in
- * milliseconds. If the @timeout is zero, the function will not block,
- * while if the @timeout is less than zero, the function will block
- * until at least one event has been retrieved (or an error
- * occurred).
+ * @timeout: Maximum timeout for the ready events fetch operation. If 0, the
+ * function will not block. If negative, the function will block until
+ * at least one event has been retrieved (or an error occurred).
*
* Returns: Returns the number of ready events which have been fetched, or an
* error code, in case of error.
*/
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
- int maxevents, long timeout)
+ int maxevents, const ktime_t timeout)
{
int res = 0, eavail, timed_out = 0;
unsigned long flags;
@@ -1591,13 +1578,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
wait_queue_t wait;
ktime_t expires, *to = NULL;
- if (timeout > 0) {
- struct timespec end_time = ep_set_mstimeout(timeout);
-
- slack = select_estimate_accuracy(&end_time);
- to = &expires;
- *to = timespec_to_ktime(end_time);
- } else if (timeout == 0) {
+ if (!ktime_to_ns(timeout)) {
/*
* Avoid the unnecessary trip to the wait queue loop, if the
* caller specified a non blocking operation.
@@ -1605,6 +1586,15 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
timed_out = 1;
spin_lock_irqsave(&ep->lock, flags);
goto check_events;
+ } else if (ktime_to_ns(timeout) > 0) {
+ struct timespec now, end_time;
+
+ ktime_get_ts(&now);
+ end_time = timespec_add_safe(now, ktime_to_timespec(timeout));
+
+ slack = select_estimate_accuracy(&end_time);
+ to = &expires;
+ *to = timespec_to_ktime(end_time);
}
fetch_events:
@@ -1954,12 +1944,8 @@ error_return:
return error;
}
-/*
- * Implement the event wait interface for the eventpoll file. It is the kernel
- * part of the user space epoll_wait(2).
- */
-SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
- int, maxevents, int, timeout)
+static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
+ int maxevents, const ktime_t timeout)
{
int error;
struct fd f;
@@ -2002,29 +1988,32 @@ error_fput:
/*
* Implement the event wait interface for the eventpoll file. It is the kernel
- * part of the user space epoll_pwait(2).
+ * part of the user space epoll_wait(2).
*/
-SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
- int, maxevents, int, timeout, const sigset_t __user *, sigmask,
- size_t, sigsetsize)
+SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
+ int, maxevents, int, timeout)
+{
+ ktime_t kt = ms_to_ktime(timeout);
+ return epoll_wait_do(epfd, events, maxevents, kt);
+}
+
+static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
+ int maxevents, ktime_t timeout,
+ sigset_t *sigmask, size_t sigsetsize)
{
int error;
- sigset_t ksigmask, sigsaved;
+ sigset_t sigsaved;
/*
* If the caller wants a certain signal mask to be set during the wait,
* we apply it here.
*/
if (sigmask) {
- if (sigsetsize != sizeof(sigset_t))
- return -EINVAL;
- if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
- return -EFAULT;
sigsaved = current->blocked;
- set_current_blocked(&ksigmask);
+ set_current_blocked(sigmask);
}
- error = sys_epoll_wait(epfd, events, maxevents, timeout);
+ error = epoll_wait_do(epfd, events, maxevents, timeout);
/*
* If we changed the signal mask, we need to restore the original one.
@@ -2044,49 +2033,48 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
return error;
}
+/*
+ * Implement the event wait interface for the eventpoll file. It is the kernel
+ * part of the user space epoll_pwait(2).
+ */
+SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
+ int, maxevents, int, timeout, const sigset_t __user *, sigmask,
+ size_t, sigsetsize)
+{
+ ktime_t kt = ms_to_ktime(timeout);
+ sigset_t ksigmask;
+
+ if (sigmask) {
+ if (sigsetsize != sizeof(sigset_t))
+ return -EINVAL;
+ if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
+ return -EFAULT;
+ }
+ return epoll_pwait_do(epfd, events, maxevents, kt,
+ sigmask ? &ksigmask : NULL, sigsetsize);
+}
+
#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
- struct epoll_event __user *, events,
- int, maxevents, int, timeout,
- const compat_sigset_t __user *, sigmask,
- compat_size_t, sigsetsize)
+ struct epoll_event __user *, events,
+ int, maxevents, int, timeout,
+ const compat_sigset_t __user *, sigmask,
+ compat_size_t, sigsetsize)
{
- long err;
compat_sigset_t csigmask;
- sigset_t ksigmask, sigsaved;
+ sigset_t ksigmask;
+ ktime_t kt = ms_to_ktime(timeout);
- /*
- * If the caller wants a certain signal mask to be set during the wait,
- * we apply it here.
- */
if (sigmask) {
if (sigsetsize != sizeof(compat_sigset_t))
return -EINVAL;
if (copy_from_user(&csigmask, sigmask, sizeof(csigmask)))
return -EFAULT;
sigset_from_compat(&ksigmask, &csigmask);
- sigsaved = current->blocked;
- set_current_blocked(&ksigmask);
- }
-
- err = sys_epoll_wait(epfd, events, maxevents, timeout);
-
- /*
- * If we changed the signal mask, we need to restore the original one.
- * In case we've got a signal while waiting, we do not restore the
- * signal mask yet, and we allow do_signal() to deliver the signal on
- * the way back to userspace, before the signal mask is restored.
- */
- if (sigmask) {
- if (err == -EINTR) {
- memcpy(¤t->saved_sigmask, &sigsaved,
- sizeof(sigsaved));
- set_restore_sigmask();
- } else
- set_current_blocked(&sigsaved);
}
- return err;
+ return epoll_pwait_do(epfd, events, maxevents, kt,
+ sigmask ? &ksigmask : NULL, sigsetsize);
}
#endif
--
1.9.3
^ permalink raw reply related
* [PATCH RFC v2 2/7] epoll: Specify clockid explicitly
From: Fam Zheng @ 2015-02-04 10:36 UTC (permalink / raw)
To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro, Andrew Morton,
Kees Cook, Andy Lutomirski, David Herrmann, Alexei Starovoitov,
Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu Desnoyers,
Fam Zheng, Peter Zijlstra <peter>
In-Reply-To: <1423046213-7043-1-git-send-email-famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Later we will add clockid in the interface, so let's start using explicit
clockid internally. Now we specify CLOCK_MONOTONIC, which is the same as before.
Signed-off-by: Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
fs/eventpoll.c | 27 +++++++++++++++------------
1 file changed, 15 insertions(+), 12 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4cf359d..01f469a 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1570,7 +1570,7 @@ static int ep_send_events(struct eventpoll *ep,
* error code, in case of error.
*/
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
- int maxevents, const ktime_t timeout)
+ int maxevents, int clockid, const ktime_t timeout)
{
int res = 0, eavail, timed_out = 0;
unsigned long flags;
@@ -1624,7 +1624,8 @@ fetch_events:
}
spin_unlock_irqrestore(&ep->lock, flags);
- if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
+ if (!schedule_hrtimeout_range_clock(to, slack,
+ HRTIMER_MODE_ABS, clockid))
timed_out = 1;
spin_lock_irqsave(&ep->lock, flags);
@@ -1945,7 +1946,8 @@ error_return:
}
static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
- int maxevents, const ktime_t timeout)
+ int maxevents, int clockid,
+ const ktime_t timeout)
{
int error;
struct fd f;
@@ -1979,7 +1981,7 @@ static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
ep = f.file->private_data;
/* Time to fish for events ... */
- error = ep_poll(ep, events, maxevents, timeout);
+ error = ep_poll(ep, events, maxevents, clockid, timeout);
error_fput:
fdput(f);
@@ -1994,12 +1996,13 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
int, maxevents, int, timeout)
{
ktime_t kt = ms_to_ktime(timeout);
- return epoll_wait_do(epfd, events, maxevents, kt);
+ return epoll_wait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt);
}
static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
- int maxevents, ktime_t timeout,
- sigset_t *sigmask, size_t sigsetsize)
+ int maxevents,
+ int clockid, ktime_t timeout,
+ sigset_t *sigmask)
{
int error;
sigset_t sigsaved;
@@ -2013,7 +2016,7 @@ static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
set_current_blocked(sigmask);
}
- error = epoll_wait_do(epfd, events, maxevents, timeout);
+ error = epoll_wait_do(epfd, events, maxevents, clockid, timeout);
/*
* If we changed the signal mask, we need to restore the original one.
@@ -2050,8 +2053,8 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
return -EFAULT;
}
- return epoll_pwait_do(epfd, events, maxevents, kt,
- sigmask ? &ksigmask : NULL, sigsetsize);
+ return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+ sigmask ? &ksigmask : NULL);
}
#ifdef CONFIG_COMPAT
@@ -2073,8 +2076,8 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
sigset_from_compat(&ksigmask, &csigmask);
}
- return epoll_pwait_do(epfd, events, maxevents, kt,
- sigmask ? &ksigmask : NULL, sigsetsize);
+ return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+ sigmask ? &ksigmask : NULL);
}
#endif
--
1.9.3
^ permalink raw reply related
* [PATCH RFC v2 3/7] epoll: Extract ep_ctl_do
From: Fam Zheng @ 2015-02-04 10:36 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
In-Reply-To: <1423046213-7043-1-git-send-email-famz@redhat.com>
This is a common part from epoll_ctl implementation which will be shared with
the coming epoll_ctl_wait.
Signed-off-by: Fam Zheng <famz@redhat.com>
---
fs/eventpoll.c | 26 ++++++++++++++++++--------
1 file changed, 18 insertions(+), 8 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 01f469a..a1c313c 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1808,22 +1808,15 @@ SYSCALL_DEFINE1(epoll_create, int, size)
* the eventpoll file that enables the insertion/removal/change of
* file descriptors inside the interest set.
*/
-SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
- struct epoll_event __user *, event)
+int ep_ctl_do(int epfd, int op, int fd, struct epoll_event epds)
{
int error;
int full_check = 0;
struct fd f, tf;
struct eventpoll *ep;
struct epitem *epi;
- struct epoll_event epds;
struct eventpoll *tep = NULL;
- error = -EFAULT;
- if (ep_op_has_event(op) &&
- copy_from_user(&epds, event, sizeof(struct epoll_event)))
- goto error_return;
-
error = -EBADF;
f = fdget(epfd);
if (!f.file)
@@ -1945,6 +1938,23 @@ error_return:
return error;
}
+/*
+ * The following function implements the controller interface for
+ * the eventpoll file that enables the insertion/removal/change of
+ * file descriptors inside the interest set.
+ */
+SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
+ struct epoll_event __user *, event)
+{
+ struct epoll_event epds;
+
+ if (ep_op_has_event(op) &&
+ copy_from_user(&epds, event, sizeof(struct epoll_event)))
+ return -EFAULT;
+
+ return ep_ctl_do(epfd, op, fd, epds);
+}
+
static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
int maxevents, int clockid,
const ktime_t timeout)
--
1.9.3
^ permalink raw reply related
* [PATCH RFC v2 4/7] epoll: Add implementation for epoll_ctl_batch
From: Fam Zheng @ 2015-02-04 10:36 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
In-Reply-To: <1423046213-7043-1-git-send-email-famz@redhat.com>
This new syscall is a batched version of epoll_ctl. It will execute each
command as specified in cmds in given order, and stop at first failure
or upon completion of all commands.
Signed-off-by: Fam Zheng <famz@redhat.com>
---
fs/eventpoll.c | 48 ++++++++++++++++++++++++++++++++++++++++++
include/linux/syscalls.h | 4 ++++
include/uapi/linux/eventpoll.h | 11 ++++++++++
3 files changed, 63 insertions(+)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a1c313c..4eb40e0 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2067,6 +2067,54 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
sigmask ? &ksigmask : NULL);
}
+SYSCALL_DEFINE4(epoll_ctl_batch, int, epfd, int, flags,
+ int, ncmds, struct epoll_ctl_cmd __user *, cmds)
+{
+ struct epoll_ctl_cmd *kcmds = NULL;
+ int i, r, ret = 0;
+ int cmd_size;
+
+ if (flags)
+ return -EINVAL;
+ if (ncmds <= 0 || !cmds)
+ return -EINVAL;
+ cmd_size = sizeof(struct epoll_ctl_cmd) * ncmds;
+ kcmds = kmalloc(cmd_size, GFP_KERNEL);
+ if (!kcmds)
+ return -ENOMEM;
+ if (copy_from_user(kcmds, cmds, cmd_size)) {
+ ret = -EFAULT;
+ goto out;
+ }
+ for (i = 0; i < ncmds; i++) {
+ struct epoll_event ev = (struct epoll_event) {
+ .events = kcmds[i].events,
+ .data = kcmds[i].data,
+ };
+ if (kcmds[i].flags) {
+ kcmds[i].error_hint = -EINVAL;
+ goto copy;
+ }
+ kcmds[i].error_hint = ep_ctl_do(epfd, kcmds[i].op,
+ kcmds[i].fd, ev);
+ if (kcmds[i].error_hint)
+ goto copy;
+ ret++;
+ }
+copy:
+ r = copy_to_user(cmds, kcmds,
+ sizeof(struct epoll_ctl_cmd) * ncmds);
+ /* Failing to copy the command results back will leave
+ * userspace no way to know the actual error code, but we still
+ * report the number of succeeded commands with ret, so it's
+ * not a big problem. Ignore it for now.
+ */
+ (void) r;
+out:
+ kfree(kcmds);
+ return ret;
+}
+
#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 85893d7..117822c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -12,6 +12,7 @@
#define _LINUX_SYSCALLS_H
struct epoll_event;
+struct epoll_ctl_cmd;
struct iattr;
struct inode;
struct iocb;
@@ -630,6 +631,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
int maxevents, int timeout,
const sigset_t __user *sigmask,
size_t sigsetsize);
+asmlinkage long sys_epoll_ctl_batch(int epfd, int flags,
+ int ncmds,
+ struct epoll_ctl_cmd __user *cmds);
asmlinkage long sys_gethostname(char __user *name, int len);
asmlinkage long sys_sethostname(char __user *name, int len);
asmlinkage long sys_setdomainname(char __user *name, int len);
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..bbdce3d 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -18,6 +18,8 @@
#include <linux/fcntl.h>
#include <linux/types.h>
+#include <linux/signal.h>
+
/* Flags for epoll_create1. */
#define EPOLL_CLOEXEC O_CLOEXEC
@@ -61,6 +63,15 @@ struct epoll_event {
__u64 data;
} EPOLL_PACKED;
+struct epoll_ctl_cmd {
+ int flags;
+ int op;
+ int fd;
+ __u32 events;
+ __u64 data;
+ int error_hint;
+} EPOLL_PACKED;
+
#ifdef CONFIG_PM_SLEEP
static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
{
--
1.9.3
^ permalink raw reply related
* [PATCH RFC v2 5/7] x86: Hook up epoll_ctl_batch syscall
From: Fam Zheng @ 2015-02-04 10:36 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
In-Reply-To: <1423046213-7043-1-git-send-email-famz@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
---
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
2 files changed, 2 insertions(+)
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..fe809f6 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
356 i386 memfd_create sys_memfd_create
357 i386 bpf sys_bpf
358 i386 execveat sys_execveat stub32_execveat
+359 i386 epoll_ctl_batch sys_epoll_ctl_batch
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..67b2ea4 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
320 common kexec_file_load sys_kexec_file_load
321 common bpf sys_bpf
322 64 execveat stub_execveat
+323 64 epoll_ctl_batch sys_epoll_ctl_batch
#
# x32-specific system call numbers start at 512 to avoid cache impact
--
1.9.3
^ permalink raw reply related
* [PATCH RFC v2 6/7] epoll: Add implementation for epoll_pwait1
From: Fam Zheng @ 2015-02-04 10:36 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
In-Reply-To: <1423046213-7043-1-git-send-email-famz@redhat.com>
This is the new implementation for poll which uses timespec as timeout,
and has a flags parameter.
Signed-off-by: Fam Zheng <famz@redhat.com>
---
fs/eventpoll.c | 34 ++++++++++++++++++++++++++++++++++
include/linux/syscalls.h | 5 +++++
2 files changed, 39 insertions(+)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4eb40e0..59b2db9 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2115,6 +2115,40 @@ out:
return ret;
}
+SYSCALL_DEFINE6(epoll_pwait1, int, epfd, int, flags,
+ struct epoll_event __user *, events,
+ int, maxevents, struct timespec __user *, timeout,
+ void __user *, sig)
+{
+ size_t sigsetsize = 0;
+ sigset_t ksigmask;
+ sigset_t __user *sigmask = NULL;
+ struct timespec kts;
+ ktime_t kt = { 0 };
+
+ if (flags)
+ return -EINVAL;
+ if (sig) {
+ if (!access_ok(VERIFY_READ, sig, sizeof(void *)+sizeof(size_t))
+ || __get_user(sigmask, (sigset_t __user * __user *)sig)
+ || __get_user(sigsetsize,
+ (size_t __user *)(sig+sizeof(void *))))
+ return -EFAULT;
+ if (sigsetsize != sizeof(sigset_t))
+ return -EINVAL;
+ if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
+ return -EFAULT;
+ }
+ if (timeout) {
+ if (copy_from_user(&kts, timeout, sizeof(kts)))
+ return -EFAULT;
+ kt = timespec_to_ktime(kts);
+ }
+
+ return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC,
+ kt, &ksigmask);
+}
+
#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 117822c..0c3ecdb 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -631,6 +631,11 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
int maxevents, int timeout,
const sigset_t __user *sigmask,
size_t sigsetsize);
+asmlinkage long sys_epoll_pwait1(int epfd, int flags,
+ struct epoll_event __user *events,
+ int maxevents,
+ struct timespec __user *timeout,
+ void __user *sig);
asmlinkage long sys_epoll_ctl_batch(int epfd, int flags,
int ncmds,
struct epoll_ctl_cmd __user *cmds);
--
1.9.3
^ permalink raw reply related
* [PATCH RFC v2 7/7] x86: Hook up epoll_pwait1 syscall
From: Fam Zheng @ 2015-02-04 10:36 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra <peter>
In-Reply-To: <1423046213-7043-1-git-send-email-famz@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
---
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
2 files changed, 2 insertions(+)
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index fe809f6..bf912d8 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -366,3 +366,4 @@
357 i386 bpf sys_bpf
358 i386 execveat sys_execveat stub32_execveat
359 i386 epoll_ctl_batch sys_epoll_ctl_batch
+360 i386 epoll_pwait1 sys_epoll_pwait1
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 67b2ea4..9246ad5 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -330,6 +330,7 @@
321 common bpf sys_bpf
322 64 execveat stub_execveat
323 64 epoll_ctl_batch sys_epoll_ctl_batch
+324 64 epoll_pwait1 sys_epoll_pwait1
#
# x32-specific system call numbers start at 512 to avoid cache impact
--
1.9.3
^ permalink raw reply related
* [PATCH] fork: report pid reservation failure properly
From: Michal Hocko @ 2015-02-04 10:43 UTC (permalink / raw)
To: linux-api-u79uwXL29TY76Z2rM5mHXA
Cc: Andrew Morton, Oleg Nesterov, Eric W. Biederman, Michael Kerrisk,
LKML
In-Reply-To: <20150204102719.GB29434-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
copy_process will report any failure in alloc_pid as ENOMEM currently
which is misleading because the pid allocation might fail not only when
the memory is short but also when the pid space is consumed already.
The current man page even mentions this case:
"
EAGAIN
A system-imposed limit on the number of threads was encountered.
There are a number of limits that may trigger this error: the
RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
limits the number of processes and threads for a real user ID, was
reached; the kernel's system-wide limit on the number of processes
and threads, /proc/sys/kernel/threads-max, was reached (see
proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
was reached (see proc(5)).
"
so the current behavior is also incorrect wrt. documentation. POSIX man
page also suggest returing EAGAIN when the process count limit is
reached.
This patch simply propagates error code from alloc_pid and makes sure we
return -EAGAIN due to reservation failure. This will make behavior of
fork closer to both our documentation and POSIX.
alloc_pid might alsoo fail when the reaper in the pid namespace is dead
(the namespace basically disallows all new processes) and there is no
good error code which would match documented ones. We have traditionally
returned ENOMEM for this case which is misleading as well but as per
Eric W. Biederman this behavior is documented in man pid_namespaces(7)
"
If the "init" process of a PID namespace terminates, the kernel
terminates all of the processes in the namespace via a SIGKILL signal.
This behavior reflects the fact that the "init" process is essential for
the correct operation of a PID namespace. In this case, a subsequent
fork(2) into this PID namespace will fail with the error ENOMEM; it is
not possible to create a new processes in a PID namespace whose "init"
process has terminated.
"
and introducing a new error code would be too risky so let's stick to
ENOMEM for this case.
Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
---
kernel/fork.c | 5 +++--
kernel/pid.c | 15 ++++++++-------
2 files changed, 11 insertions(+), 9 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 3ce8d57cff09..c37b88a162c5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1405,10 +1405,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto bad_fork_cleanup_io;
if (pid != &init_struct_pid) {
- retval = -ENOMEM;
pid = alloc_pid(p->nsproxy->pid_ns_for_children);
- if (!pid)
+ if (IS_ERR(pid)) {
+ retval = PTR_ERR(pid);
goto bad_fork_cleanup_io;
+ }
}
p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
diff --git a/kernel/pid.c b/kernel/pid.c
index 82430c858d69..bbb805ccd893 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -179,7 +179,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
spin_unlock_irq(&pidmap_lock);
kfree(page);
if (unlikely(!map->page))
- break;
+ return -ENOMEM;
}
if (likely(atomic_read(&map->nr_free))) {
for ( ; ; ) {
@@ -207,7 +207,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
}
pid = mk_pid(pid_ns, map, offset);
}
- return -1;
+ return -EAGAIN;
}
int next_pidmap(struct pid_namespace *pid_ns, unsigned int last)
@@ -298,17 +298,20 @@ struct pid *alloc_pid(struct pid_namespace *ns)
int i, nr;
struct pid_namespace *tmp;
struct upid *upid;
+ int retval = -ENOMEM;
pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
if (!pid)
- goto out;
+ return ERR_PTR(retval);
tmp = ns;
pid->level = ns->level;
for (i = ns->level; i >= 0; i--) {
nr = alloc_pidmap(tmp);
- if (nr < 0)
+ if (IS_ERR_VALUE(nr)) {
+ retval = nr;
goto out_free;
+ }
pid->numbers[i].nr = nr;
pid->numbers[i].ns = tmp;
@@ -336,7 +339,6 @@ struct pid *alloc_pid(struct pid_namespace *ns)
}
spin_unlock_irq(&pidmap_lock);
-out:
return pid;
out_unlock:
@@ -348,8 +350,7 @@ out_free:
free_pidmap(pid->numbers + i);
kmem_cache_free(ns->pid_cachep, pid);
- pid = NULL;
- goto out;
+ return ERR_PTR(retval);
}
void disable_pid_allocation(struct pid_namespace *ns)
--
2.1.4
^ permalink raw reply related
* Re: [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
From: Fam Zheng @ 2015-02-04 10:50 UTC (permalink / raw)
To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro, Andrew Morton,
Kees Cook, Andy Lutomirski, David Herrmann, Alexei Starovoitov,
Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu Desnoyers,
Peter Zijlstra, linux-fsdev
In-Reply-To: <1423046213-7043-1-git-send-email-famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
On Wed, 02/04 18:36, Fam Zheng wrote:
>
> Not that this call isn't atomic in terms of updating the epoll
s/Not/Note/
> descriptor, which means a second epoll_ctl or epoll_ctl_batch call
> during the first epoll_ctl_batch may make the operation sequence
> interleaved. However, each single epoll_ctl_cmd operation has the same
> semantics as a epoll_ctl call.
>
^ permalink raw reply
* Re: [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
From: Michael Kerrisk (man-pages) @ 2015-02-04 12:44 UTC (permalink / raw)
To: Fam Zheng, linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro,
Andrew Morton, Kees Cook, Andy Lutomirski, David Herrmann,
Alexei Starovoitov, Miklos Szeredi, David Drysdale, Oleg Nesterov,
David S. Miller, Vivek Goyal, Mike Frysinger, Theodore Ts'o,
Heiko Carstens, Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Peter Zijlstra
In-Reply-To: <1423046213-7043-1-git-send-email-famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Hello Fam Zheng,
On 02/04/2015 11:36 AM, Fam Zheng wrote:
> Changes from v1 (https://lkml.org/lkml/2015/1/20/189):
>
> - As discussed in previous thread [1], split the call to epoll_ctl_batch and
> epoll_pwait. [Michael]
>
> - Fix memory leaks. [Omar]
>
> - Add a short comment about the ignored copy_to_user failure. [Omar]
>
> - Cover letter rewritten.
>
> This adds two new system calls as described below. I mentioned glibc wrapping
> of sigarg in epoll_pwait1 description, so don't read it as end user man page
> yet.
Fair enough. But I think it would be helpful to say a little more already.
See my comment below.
> One caveat is the possible failure of the last copy_to_user in epoll_ctl_batch,
> which returns per command error code. Ideas to improve that are welcome!
>
> 1) epoll_ctl_batch
> ------------------
>
> NAME
> epoll_ctl_batch - modify an epoll descriptor in batch
>
> SYNOPSIS
>
> #include <sys/epoll.h>
>
> int epoll_ctl_batch(int epfd, int flags,
> int ncmds, struct epoll_ctl_cmd *cmds);
>
> DESCRIPTION
>
> The system call performs a batch of epoll_ctl operations. It allows
> efficient update of events on this epoll descriptor.
>
> Flags is reserved and must be 0.
>
> Each operation is specified as an element in the cmds array, defined as:
>
> struct epoll_ctl_cmd {
>
> /* Reserved flags for future extension, must be 0. */
> int flags;
>
> /* The same as epoll_ctl() op parameter. */
> int op;
>
> /* The same as epoll_ctl() fd parameter. */
> int fd;
>
> /* The same as the "events" field in struct epoll_event. */
> uint32_t events;
>
> /* The same as the "data" field in struct epoll_event. */
> uint64_t data;
>
> /* Output field, will be set to the return code after this
> * command is executed by kernel */
> int error_hint;
Why 'error_hint', rather than just stay 'status' or 'result'? I mean
is it really a "hint"? Also, it can be 0 meaning "success" (no error).
> };
>
> Commands are executed in their order in cmds, and if one of them failed,
> the rest after it are not tried.
>
> Not that this call isn't atomic in terms of updating the epoll
> descriptor, which means a second epoll_ctl or epoll_ctl_batch call
> during the first epoll_ctl_batch may make the operation sequence
> interleaved. However, each single epoll_ctl_cmd operation has the same
> semantics as a epoll_ctl call.
(Good to include that warning!)
> RETURN VALUE
>
> If one or more of the parameters are incorrect, -1 is returned and errno
> is set appropriately. Otherwise, the number of succeeded commands is
> returned.
>
> Each error_hint field may be set to the error code or 0, depending on
> the result of the command. If there is some error in returning the error
> to user, it may also be unchanged, even though the command may actually
> be executed. In this case, it's still ensured that the i-th (i = ret)
> command is the failed command.
Sorry -- I'm not clear here. Can you say some more here? What sort
of error might there be when "returning the error to the user"?
> ERRORS
>
> Errors for the overall system call (in errno) are:
>
> EINVAL flags is non-zero, or ncmds is less than or equal to zero, or
> cmds is NULL.
>
> ENOMEM There was insufficient memory to handle the requested op control
> operation.
>
> EFAULT The memory area pointed to by cmds is not accessible with write
> permissions.
>
>
> Errors for each command (in error_hint) are:
>
> EBADF epfd or fd is not a valid file descriptor.
>
> EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
> already registered with this epoll instance.
>
> EINVAL epfd is not an epoll file descriptor, or fd is the same as epfd,
> or the requested operation op is not supported by this interface.
>
> ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
> with this epoll instance.
>
> ENOMEM There was insufficient memory to handle the requested op control
> operation.
>
> ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
> encountered while trying to register (EPOLL_CTL_ADD) a new file
> descriptor on an epoll instance. See epoll(7) for further
> details.
>
> EPERM The target file fd does not support epoll.
>
> CONFORMING TO
>
> epoll_ctl_batch() is Linux-specific.
>
> SEE ALSO
>
> epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)
>
>
> 2) epoll_pwait1
> ---------------
>
> NAME
> epoll_pwait1 - wait for an I/O event on an epoll file descriptor
>
> SYNOPSIS
>
> #include <sys/epoll.h>
>
> int epoll_pwait1(int epfd, int flags,
> struct epoll_event *events,
> int maxevents,
> struct timespec *timeout,
> struct sigargs *sig);
>
> DESCRIPTION
>
> The epoll_pwait1 system call differs from epoll_pwait only in parameter
> types. The first difference is timeout, a pointer to timespec structure
> which allows nanosecond presicion; the second difference, which should
> probably be wrapper by glibc and only expose a sigset_t pointer as in
> pselect6.
Here I think it still helps to explain that 'struct sigags' is a sigset_t* +
the size of the pointed-to set.
> If timeout is NULL, it's treated as if 0 is specified in epoll_pwait
The convention I would expect here is that NULL means infinite timeout,
like select(), and timeout == {0,0} would get the "return immediately"
behavior. Why did you choose your convention? And, how does one otherwise
request an infinite timeout?
> (return immediately). Otherwise it's converted to nanosecond scalar,
> again, with the same convention as epoll_pwait's timeout.
>
> RETURN VALUE
>
> The same as said in epoll_pwait(2).
>
> ERRORS
>
> The same as said in man epoll_pwait(2), plus:
>
> EINVAL flags is not zero.
>
>
> CONFORMING TO
>
> epoll_pwait1() is Linux-specific.
>
> SEE ALSO
>
> epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)
In your earlier patch set, there was the ability to select the clock
used for timeouts. Why did this go away? I'm not sure whether we need
that functionality or not, but it would be good to know why you
dropped it this time.
Thanks,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply
* Re: [PATCH v17 1/7] mm: support madvise(MADV_FREE)
From: Michal Hocko @ 2015-02-04 12:52 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: Minchan Kim, Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-api-u79uwXL29TY76Z2rM5mHXA,
Hugh Dickins, Johannes Weiner, Rik van Riel, KOSAKI Motohiro,
Mel Gorman, Jason Evans, zhangyanfei-BthXqXjhjHXQFUHtdCDX3A,
Kirill A. Shutemov, Kirill A. Shutemov
In-Reply-To: <54D0F9BC.4060306-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
On Tue 03-02-15 17:39:24, Michael Kerrisk wrote:
[...]
> If I'm reading the conversation right, the initially proposed text
> was from the BSD man page (which would be okay), but most of the
> text above seems to have come straight from the page here:
> http://www.lehman.cuny.edu/cgi-bin/man-cgi?madvise+3
>
> Right?
>
> Unfortunately, I don't think we can use that text. It's from the
> Solaris man page as far as I can tell, and I doubt that it's
> under a license that we can use.
Ohh, I wasn't aware of that restriction and didn't notice anything at
the man page nor http://www.lehman.cuny.edu/cgi-bin/man-cgi.
But you are definitely right that it would be better to not use this
source. Sorry about that, I should have noticed that myself.
> If that's the case, we need to go back and come up with an
> original text. It might draw inspiration from the Solaris page,
> and take actual text from the BSD page (which is under a free
> license), and it might also draw inspiration from Jon Corbet's
> description at http://lwn.net/Articles/590991/.
>
> Could you take another shot this please!
Minchan is obviously working on one and I will review it once he is done
with it.
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)
From: Michael Kerrisk @ 2015-02-04 13:06 UTC (permalink / raw)
To: Alexander Holler
Cc: Lukáš Czerner, Al Viro, Theodore Ts'o,
Linux-Fsdevel, Linux Kernel, Linux API
In-Reply-To: <54D20F02.5050700@ahsoftware.de>
Alexander,
On Wed, Feb 4, 2015 at 1:22 PM, Alexander Holler <holler@ahsoftware.de> wrote:
> Am 04.02.2015 um 13:07 schrieb Lukáš Czerner:
>
>> The fact is that the current patches are useless for anything other
>> than proof-of-concept. Now you know more that needs to be done or
>
>
> That's wrong. The patches already work. If you delete a file which isn't in
> use by something else, the current contents will be wiped on traditional
> harddrives. I assume that already fulfills more than 50% of use cases of
> ordinary people.
You are getting various feedback from people, that you seem to be ignoring.
Al Viro, in his curmedgeonly way, points out that the problems are
much deeper than you realize. He does not say so explicitly, but I
imagine his point is that he does not want to see the kernel cluttered
with "partial" solutions that will simply increase the maintenance
burden in the long term, and leave bugs to be fixed further down the
line. You seem not to be listening.
Lukáš points out to you that getting a feature like this into the
kernel is complex process. You seem unwilling to hear that, and still
just want your partial solution.
I tell you that discussions of APIs should CC linux-api, which I am
now CCing into this thread, again, because, again, you're not
listening to feedback.
Nobody is asking for "high towers"; they just have their eyes on the
big picture. And the people here are just "ordinary people" with a
*lot* of experience dealing with kernel code (I exclude myself) . They
see many complexities that you don't. Getting intersting features into
the kernel requires a lot of work, and careful listening.
Thanks,
Michael
--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)
From: Alexander Holler @ 2015-02-04 13:21 UTC (permalink / raw)
To: Michael Kerrisk
Cc: Lukáš Czerner, Al Viro, Theodore Ts'o,
Linux-Fsdevel, Linux Kernel, Linux API
In-Reply-To: <CAHO5Pa0_2HE23dPU5Fg-sUgi2-e-szCpqoR98Vd6zE70_yd8Sw@mail.gmail.com>
Am 04.02.2015 um 14:06 schrieb Michael Kerrisk:
> Alexander,
>
> On Wed, Feb 4, 2015 at 1:22 PM, Alexander Holler <holler@ahsoftware.de> wrote:
>> Am 04.02.2015 um 13:07 schrieb Lukáš Czerner:
>>
>>> The fact is that the current patches are useless for anything other
>>> than proof-of-concept. Now you know more that needs to be done or
>>
>>
>> That's wrong. The patches already work. If you delete a file which isn't in
>> use by something else, the current contents will be wiped on traditional
>> harddrives. I assume that already fulfills more than 50% of use cases of
>> ordinary people.
>
> You are getting various feedback from people, that you seem to be ignoring.
I'm happy for all the feedback. But it doesn't help me. I'm not going to
spend the necessary time unpaid.
.
> Al Viro, in his curmedgeonly way, points out that the problems are
> much deeper than you realize. He does not say so explicitly, but I
> imagine his point is that he does not want to see the kernel cluttered
> with "partial" solutions that will simply increase the maintenance
> burden in the long term, and leave bugs to be fixed further down the
> line. You seem not to be listening.
It doesn't help me nor anyone else. As Eric Sandeen made me aware
through in bug, look at http://lwn.net/Articles/462437/ what already
happened.
> Lukáš points out to you that getting a feature like this into the
> kernel is complex process. You seem unwilling to hear that, and still
> just want your partial solution.
Wrong. I don't want my partial solution to be part of the official
kernel. I don't care. I offered it for other users because I'm aware
that has become almost impossible for normal people to get something
into the kernel without spending an unbelievable amount of time most
people can't afford to spend.
> I tell you that discussions of APIs should CC linux-api, which I am
> now CCing into this thread, again, because, again, you're not
> listening to feedback.
Please don't confuse "not listening" with "unable to fulfill Linux
kernel maintainer requests".
Alexander Holler
^ permalink raw reply
* Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)
From: Alexander Holler @ 2015-02-04 13:29 UTC (permalink / raw)
To: Michael Kerrisk
Cc: Lukáš Czerner, Al Viro, Theodore Ts'o,
Linux-Fsdevel, Linux Kernel, Linux API
In-Reply-To: <54D21CC8.4020705-SXC+2es9fhnfWeYVQQPykw@public.gmane.org>
Am 04.02.2015 um 14:21 schrieb Alexander Holler:
>> I tell you that discussions of APIs should CC linux-api, which I am
>> now CCing into this thread, again, because, again, you're not
>> listening to feedback.
>
> Please don't confuse "not listening" with "unable to fulfill Linux
> kernel maintainer requests".
I really wonder what do you expect from people not getting paid to spend
time for fulfilling maintainer request?
I've written bugs and even offered some patches (regardless how usefull
there are in your eyes, it's more than most other people can do).
And all what it brought me is that I receive flames like your one.
Do you really think that's the right way to stimulate people in helping
to make Linux better?
Alexander Holler
^ permalink raw reply
* Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints
From: Vlastimil Babka @ 2015-02-04 13:46 UTC (permalink / raw)
To: Michael Kerrisk (man-pages), Kirill A. Shutemov
Cc: Dave Hansen, Mel Gorman, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
Minchan Kim, Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-man-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins
In-Reply-To: <54D0F56A.9050003-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
On 02/03/2015 05:20 PM, Michael Kerrisk (man-pages) wrote:
> Hello Vlastimil
>
> Thanks for CCing me into this thread.
NP
> On 02/03/2015 12:42 PM, Vlastimil Babka wrote:
>> On 02/03/2015 11:53 AM, Kirill A. Shutemov wrote:
>>> On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote:
>>>
>>> It doesn't skip. It fails with -EINVAL. Or I miss something.
>>
>> No, I missed that. Thanks for pointing out. The manpage also explains EINVAL in
>> this case:
>>
>> * The application is attempting to release locked or shared pages (with
>> MADV_DONTNEED).
>
> Yes, there is that. But the page could be more explicit when discussing
> MADV_DONTNEED in the main text. I've done that.
>
>> - that covers mlocking ok, not sure if the rest fits the "shared pages" case
>> though. I dont see any check for other kinds of shared pages in the code.
>
> Agreed. "shared" here seems confused. I've removed it. And I've
> added mention of "Huge TLB pages" for this error.
>
Thanks.
>>>> - The word "will result" did sound as a guarantee at least to me. So here it
>>>> could be changed to "may result (unless the advice is ignored)"?
>>>
>>> It's too late to fix documentation. Applications already depends on the
>>> beheviour.
>>
>> Right, so as long as they check for EINVAL, it should be safe. It appears that
>> jemalloc does.
>
> So, first a brief question: in the cases where the call does not error out,
> are we agreed that in the current implementation, MADV_DONTNEED will
> always result in zero-filled pages when the region is faulted back in
> (when we consider pages that are not backed by a file)?
I'd agree at this point.
Also we should probably mention anonymously shared pages (shmem). I
think they behave the same as file here.
>> I still wouldnt be sure just by reading the man page that the clearing is
>> guaranteed whenever I dont get an error return value, though,
>
> I'm not quite sure what you want here. I mean: if there's an error,
I was just reiterating that the guarantee is not clear from if you
consider all the statements in the man page.
> then the DONTNEED action didn't occur, right? Therefore, there won't
> be zero-filled pages. But, for what it's worth, I added "If the
> operation succeeds" at the start of that sentence beginning "Subsequent
> accesses...".
Yes, that should clarify it. Thanks!
> Now, some history, explaining why the page is a bit of a mess,
> and for that matter why I could really use more help on it from MM
> folk (especially in the form of actual patches [1], rather than notes
> about deficiencies in the documentation), because:
>
> ***I simply cannot keep up with all of the details***.
I see, and expected it would be like this. I would just send patch if
the situation was clear, but here we should agree first, and I thought
you should be involved from the beginning.
> Once upon a time (Linux 2.4), there was madvise() with just 5 flags:
>
> MADV_NORMAL
> MADV_RANDOM
> MADV_SEQUENTIAL
> MADV_WILLNEED
> MADV_DONTNEED
>
> And already a dozen years ago, *I* added the text about MADV_DONTNEED.
> Back then, I believe it was true. I'm not sure if it's still true now,
> but I assume for the moment that it is, and await feedback. And the
> text saying that the call does not affect the semantics of memory
> access dates back even further (and was then true, MADV_DONTNEED aside).
>
> Those 5 flags have analogs in POSIX's posix_madvise() (albeit, there
> is a semantic mismatch between the destructive MADV_DONTNEED and
> POSIX's nondestructive POSIX_MADV_DONTNEED). They also appear
> on most other implementations.
>
> Since the original implementation, numerous pieces of cruft^W^W^W
> excellent new flags have been overloaded into this one system call.
> Some of those certainly violated the "does not change the semantics
> of the application" statement, but, sadly, the kernel developers who
> implemented MADV_REMOVE or MADV_DONTFORK did not think to send a
> patch to the man page for those new flags, one that might have noted
> that the semantics of the application are changed by such flags. Equally
> sadly, I did overlook to scan the bigger page when *I* added
> documentation of these flags to those pages, otherwise I might have
> caught that detail.
>
> So, just to repeat, I could really use more help on it from MM
> folk in the form of actual patches to the man page.
Thanks for the background. I'll try to remember to check for man-pages
part when I review some api changing patch.
> Thanks,
>
> Michael
>
> [1] https://www.kernel.org/doc/man-pages/patches.html
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox