Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH 1/2] fork: add clone6
From: Jann Horn @ 2019-05-27 19:36 UTC (permalink / raw)
  To: Linus Torvalds, Kees Cook, Christian Brauner
  Cc: Al Viro, Linux List Kernel Mailing, Florian Weimer, Oleg Nesterov,
	Arnd Bergmann, David Howells, Pavel Emelyanov, Andrew Morton,
	Adrian Reber, Andrei Vagin, Linux API
In-Reply-To: <CAHk-=wjnbK5ob9JE0H1Ge_R4BL6D0ztsAvrM6DN+S+zyDWE=7A@mail.gmail.com>

+Kees

On Mon, May 27, 2019 at 9:27 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, May 27, 2019 at 3:42 AM Christian Brauner <christian@brauner.io> wrote:
> > Hm, still pondering whether having one unsigned int argument passed
> > through registers that captures all the flags from the old clone() would
> > be a good idea.
>
> That sounds like a reasonable thing to do.
>
> Maybe we could continue to call the old flags CLONE_XYZ and continue
> to pass them in as "flags" argument, and then we have CLONE_EXT_XYZ
> flags for a new 64-bit flag field that comes in through memory in the
> new clone_args thing?

With the current seccomp model, that would have the unfortunate effect
of making it impossible to filter out new clone flags - which would
likely mean that people who want to sandbox their code would not use
the new clone() because they don't want their sandboxed code to be
able to create time namespaces and whatever other new fancy things
clone() might support in the future. This is why I convinced Christian
to pass flags in registers for the first patch version.

The alternative I see would be to somehow extend seccomp to support
argument structures that are passed in memory - that would probably
require quite a bit of new plumbing though, both in the kernel and in
userspace code that configures seccomp filters.

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v10)
From: Mathieu Desnoyers @ 2019-05-27 19:27 UTC (permalink / raw)
  To: Florian Weimer
  Cc: carlos, Joseph Myers, Szabolcs Nagy, libc-alpha, Thomas Gleixner,
	Ben Maurer, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Will Deacon, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <87h89gjgaf.fsf@oldenburg2.str.redhat.com>

----- On May 27, 2019, at 7:19 AM, Florian Weimer fweimer@redhat.com wrote:

> * Mathieu Desnoyers:
> 
>> +/* volatile because fields can be read/updated by the kernel.  */
>> +__thread volatile struct rseq __rseq_abi = {
>> +  .cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
>> +};
> 
> As I've explained repeatedly, the volatile qualifier is wrong because it
> is impossible to get rid of it.  (Accessing an object declared volatile
> using non-volatile pointers is undefined.)  Code using __rseq_abi should
> use relaxed MO atomics or signal fences/compiler barriers, as
> appropriate.

Hi Florian,

OK. So let's remove the volatile.

This means the sched_getcpu() implementation will need to load __rseq_abi.cpu_id
with a atomic_load_relaxed(), am I correct ?

This field can be updated at by the kernel at any point of user-space execution
due to preemption, so we need to ensure the load is performed as a single
instruction to prevent the compiler from doing load tearing, and to force it
to re-fetch the value within loops.

It would become:

int
sched_getcpu (void)
{
  int cpu_id = atomic_load_relaxed (&__rseq_abi.cpu_id);

  return cpu_id >= 0 ? cpu_id : vsyscall_sched_getcpu ();
}

> 
>> +/* Advertise Restartable Sequences registration ownership across
>> +   application and shared libraries.
>> +
>> +   Libraries and applications must check whether this variable is zero or
>> +   non-zero if they wish to perform rseq registration on their own. If it
>> +   is zero, it means restartable sequence registration is not handled, and
>> +   the library or application is free to perform rseq registration. In
>> +   that case, the library or application is taking ownership of rseq
>> +   registration, and may set __rseq_handled to 1. It may then set it back
>> +   to 0 after it completes unregistering rseq.
>> +
>> +   If __rseq_handled is found to be non-zero, it means that another
>> +   library (or the application) is currently handling rseq registration.
>> +
>> +   Typical use of __rseq_handled is within library constructors and
>> +   destructors, or at program startup.  */
>> +
>> +int __rseq_handled;
> 
> It's not clear to me whether the intent is that __rseq_handled reflects
> kernel support for rseq or not.

If __rseq_handled is set, it means a library is managing the rseq registration.
It is independent from the fact that the kernel supports rseq or not.

If e.g. glibc manages rseq registration, it sets __rseq_handled to 1. It will
then query the kernel for rseq availability. If the kernel happens to not
support rseq, the __rseq_abi.cpu_id will be set to RSEQ_CPU_ID_REGISTRATION_FAILED,
which means the registration has failed.

The kernel does not support rseq in that scenario, and it would be pointless
for an early adopter library to try to also register it.

As soon as a library changes the state of __rseq_abi.cpu_id, it is indeed
managing rseq registration. Perhaps the meaning of "handling" rseq registration
should be clarified in the comment.

> Currently, it only tells us whether
> glibc has been built with rseq support or not.  It does not reflect
> kernel support.

We know we have kernel support if __rseq_abi.cpu_id >= 0.

>  I'm still not convinced that this symbol is necessary,
> especially if we mandate a kernel header version which defines __NR_rseq
> for building glibc (which may happen due to the time64_t work).

__NR_rseq is not yet supported by all Linux architectures. So we will need
to support building glibc against kernel headers that do not define __NR_rseq
for quite a while anyway.

Moreover, this does not solve the issue tackled by __rseq_handled: early
adopter libraries managing rseq registration built against older glibc
versions which eventually end up running within a process linked against
a newer glibc which handles rseq registration.

> 
> Furthermore, the reference to ELF constructors is misleading.  I believe
> the code you added to __libc_start_main to initialize __rseq_handled and
> register __seq_abi with the kernel runs *after* ELF constructors have
> executed (and not at all if the main program is written in Go, alas).
> All initialization activity for the shared case needs to happen in
> elf/rtld.c or called from there, probably as part of the security
> initialization code or thereabouts.

in elf/rtld.c:dl_main() we have the following code:

  /* We do not initialize any of the TLS functionality unless any of the
     initial modules uses TLS.  This makes dynamic loading of modules with
     TLS impossible, but to support it requires either eagerly doing setup
     now or lazily doing it later.  Doing it now makes us incompatible with
     an old kernel that can't perform TLS_INIT_TP, even if no TLS is ever
     used.  Trying to do it lazily is too hairy to try when there could be
     multiple threads (from a non-TLS-using libpthread).  */
  bool was_tls_init_tp_called = tls_init_tp_called;
  if (tcbp == NULL)
    tcbp = init_tls ();

If I understand your point correctly, I should move the rseq_init() and
rseq_register_current_thread() for the SHARED case just after this
initialization, otherwise calling those from LIBC_START_MAIN() is too
late and it runs after initial modules constructors (or not at all for
Go). However, this means glibc will start using TLS internally. I'm
concerned that this is not quite in line with the above comment which
states that TLS is not initialized if no initial modules use TLS.

For the !SHARED use-case, if my understanding is correct, I should keep
rseq_init() and rseq_register_current_thread() calls within LIBC_START_MAIN().

Thoughts ?

Thanks for the feedback!

Mathieu



> 
> Thanks,
> Florian

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH 1/2] fork: add clone6
From: Linus Torvalds @ 2019-05-27 19:27 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, Linux List Kernel Mailing, Jann Horn, Florian Weimer,
	Oleg Nesterov, Arnd Bergmann, David Howells, Pavel Emelyanov,
	Andrew Morton, Adrian Reber, Andrei Vagin, Linux API
In-Reply-To: <20190527104239.fbnjzfyxa4y4acpf@brauner.io>

On Mon, May 27, 2019 at 3:42 AM Christian Brauner <christian@brauner.io> wrote:
>
> Hm, still pondering whether having one unsigned int argument passed
> through registers that captures all the flags from the old clone() would
> be a good idea.

That sounds like a reasonable thing to do.

Maybe we could continue to call the old flags CLONE_XYZ and continue
to pass them in as "flags" argument, and then we have CLONE_EXT_XYZ
flags for a new 64-bit flag field that comes in through memory in the
new clone_args thing?

That would make the flag arguments less "unified", but might make for
a simpler patch, and might make it easier for the old interfaces to
just pass in the clone flags they already have in registers instead of
setting them up in the clone_args structure.

I don't think it's necessarily wrong for the interface to show some
effects of legacy models, as long as we don't have those kinds of "if
(is_clone6)" nasties.

                   Linus

^ permalink raw reply

* Re: [PATCH 2/2] arch: wire-up clone6() syscall on x86
From: Linus Torvalds @ 2019-05-27 18:48 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Arnd Bergmann, linux-ia64, Al Viro, Linux Kernel Mailing List,
	Jann Horn, Florian Weimer, Oleg Nesterov, David Howells,
	Andrew Morton, Adrian Reber, Linux API, linux-arch,
	the arch/x86 maintainers
In-Reply-To: <20190527123414.rv2r6g6de6y3ay6w@brauner.io>

On Mon, May 27, 2019 at 5:34 AM Christian Brauner <christian@brauner.io> wrote:
>
> Afaict, stack_size is *only* used on ia64:

That's because ia64 "stacks" are an odd non-stack thing (like so much
of the architecture).

In computer science, a stack is a FIFO that grows/shrinks according to
use. In practical implementations, it also has a direction, but the
"size" is basically not relevant if you just allow it to grow
dynamically. The key word here being "dynamically": the stack size is
inherently a dynamic thing.

So you don't really need a "stack size". The whole concept doesn't
make sense, outside of the obvious maximum limit things (ie
RLIMIT_STACK) and simply just hitting other allocations.

But ia64 is "special".

The ia64 stack isn't actually a stack. It's *two* stacks, growing in
opposite directions. One for the hardware spilling of the register
state and call frame ("backing store"), and one for the traditional
software stack.

So on ia64, the stack size suddenly becomes a fixed thing, because
it's not a dynamically growing single stack that grows in one
direction, it's literally a fixed virtual area that has two different
stacks growing towards each other.

Btw, don't get me wrong. Two stacks can be a good thing, and a lot of
security people want to have two stacks - one for actual call frame
data, and a separate one for automatic stack variables that have their
address taken.

Having separate stacks avoids the whole traditional stack smash model
(well, it avoids the one that overwrites the return frame - you can
still possibly have security issues because one function smashes the
automatic stack of a caller and then cause the caller to be confused
and do something insecure).

And the ia64 double stack kind of works that way automatically. So
"double stack" very much isn't wrong per se, but doing it the way ia64
did was too inflexible and the register stack (and rotation) was and
is just a bad idea.

Two stacks without the hw register renaming and flushing can be
lovely, and can even merit some hw support (ie the whole "Shadow
stack" model).

              Linus

^ permalink raw reply

* [RFC][PATCH] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA
From: Amir Goldstein @ 2019-05-27 17:26 UTC (permalink / raw)
  To: Theodore Tso, Jan Kara
  Cc: Darrick J . Wong, Dave Chinner, Chris Mason, Al Viro,
	linux-fsdevel, linux-xfs, linux-ext4, linux-btrfs, linux-api

New link flags to request "atomic" link.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---

Hi Guys,

Following our discussions on LSF/MM and beyond [1][2], here is
an RFC documentation patch.

Ted, I know we discussed limiting the API for linking an O_TMPFILE
to avert the hardlinks issue, but I decided it would be better to
document the hardlinks non-guaranty instead. This will allow me to
replicate the same semantics and documentation to renameat(2).
Let me know how that works out for you.

I also decided to try out two separate flags for data and metadata.
I do not find any of those flags very useful without the other, but
documenting them seprately was easier, because of the fsync/fdatasync
reference.  In the end, we are trying to solve a social engineering
problem, so this is the least confusing way I could think of to describe
the new API.

First implementation of AT_ATOMIC_METADATA is expected to be
noop for xfs/ext4 and probably fsync for btrfs.

First implementation of AT_ATOMIC_DATA is expected to be
filemap_write_and_wait() for xfs/ext4 and probably fdatasync for btrfs.

Thoughts?

Amir.

[1] https://lwn.net/Articles/789038/
[2] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjZm6E2TmCv8JOyQr7f-2VB0uFRy7XEp8HBHQmMdQg+6w@mail.gmail.com/

 man2/link.2 | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/man2/link.2 b/man2/link.2
index 649ba00c7..15c24703e 100644
--- a/man2/link.2
+++ b/man2/link.2
@@ -184,6 +184,57 @@ See
 .BR openat (2)
 for an explanation of the need for
 .BR linkat ().
+.TP
+.BR AT_ATOMIC_METADATA " (since Linux 5.x)"
+By default, a link operation followed by a system crash, may result in the
+new file name being linked with old inode metadata, such as out dated time
+stamps or missing extended attributes.
+One way to prevent this is to call
+.BR fsync (2)
+before linking the inode, but that involves flushing of volatile disk caches.
+
+A filesystem that accepts this flag will guaranty, that old inode metadata
+will not be exposed in the new linked name.
+Some filesystems may internally perform
+.BR fsync (2)
+before linking the inode to provide this guaranty,
+but often, filesystems will have a more efficient method to provide this
+guaranty without flushing volatile disk caches.
+
+A filesystem that accepts this flag does
+.BR NOT
+guaranty that the new file name will exist after a system crash, nor that the
+current inode metadata is persisted to disk.
+Specifically, if a file has hardlinks, the existance of the linked name after
+a system crash does
+.BR NOT
+guaranty that any of the other file names exist, nor that the last observed
+value of
+.I st_nlink
+(see
+.BR stat (2))
+has persisted.
+.TP
+.BR AT_ATOMIC_DATA " (since Linux 5.x)"
+By default, a link operation followed by a system crash, may result in the
+new file name being linked with old data or missing data.
+One way to prevent this is to call
+.BR fdatasync (2)
+before linking the inode, but that involves flushing of volatile disk caches.
+
+A filesystem that accepts this flag will guaranty, that old data
+will not be exposed in the new linked name.
+Some filesystems may internally perform
+.BR fsync (2)
+before linking the inode to provide this guaranty,
+but often, filesystems will have a more efficient method to provide this
+guaranty without flushing volatile disk caches.
+
+A filesystem that accepts this flag does
+.BR NOT
+guaranty that the new file name will exist after a system crash, nor that the
+current inode data is persisted to disk.
+.TP
 .SH RETURN VALUE
 On success, zero is returned.
 On error, \-1 is returned, and
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH RFC] mm/madvise: implement MADV_STOCKPILE (kswapd from user space)
From: Michal Hocko @ 2019-05-27 14:39 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, linux-kernel, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, Andrew Morton, Mel Gorman, Roman Gushchin, linux-api
In-Reply-To: <20190527142156.GE1658@dhcp22.suse.cz>

On Mon 27-05-19 16:21:56, Michal Hocko wrote:
> On Mon 27-05-19 16:12:23, Michal Hocko wrote:
> > [Cc linux-api. Please always cc this list when proposing a new user
> >  visible api. Keeping the rest of the email intact for reference]
> > 
> > On Mon 27-05-19 13:05:58, Konstantin Khlebnikov wrote:
> [...]
> > > This implements manual kswapd-style memory reclaim initiated by userspace.
> > > It reclaims both physical memory and cgroup pages. It works in context of
> > > task who calls syscall madvise thus cpu time is accounted correctly.
> 
> I do not follow. Does this mean that the madvise always reclaims from
> the memcg the process is member of?

OK, I've had a quick look at the implementation (the semantic should be
clear from the patch descrition btw.) and it goes all the way up the
hierarchy and finally try to impose the same limit to the global state.
This doesn't really make much sense to me. For few reasons.

First of all it breaks isolation where one subgroup can influence a
different hierarchy via parent reclaim.

I also have a problem with conflating the global and memcg states. Does
it really make any sense to have the same target to the global state
as per-memcg? How are you supposed to use this interface to shrink a
particular memcg or for the global situation with a proportional
distribution to all memcgs?

There also doens't seem to be anything about security model for this
operation. There is no capability check from a quick look. Is it really
safe to expose such a functionality for a common user?

Last but not least, I am not really convinced that madvise is a proper
interface. It stretches the API which is address range based and it has
per-process implications.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH RFC] mm/madvise: implement MADV_STOCKPILE (kswapd from user space)
From: Konstantin Khlebnikov @ 2019-05-27 14:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, Andrew Morton, Mel Gorman, Roman Gushchin, linux-api
In-Reply-To: <20190527142156.GE1658@dhcp22.suse.cz>

On 27.05.2019 17:21, Michal Hocko wrote:
> On Mon 27-05-19 16:12:23, Michal Hocko wrote:
>> [Cc linux-api. Please always cc this list when proposing a new user
>>   visible api. Keeping the rest of the email intact for reference]
>>
>> On Mon 27-05-19 13:05:58, Konstantin Khlebnikov wrote:
> [...]
>>> This implements manual kswapd-style memory reclaim initiated by userspace.
>>> It reclaims both physical memory and cgroup pages. It works in context of
>>> task who calls syscall madvise thus cpu time is accounted correctly.
> 
> I do not follow. Does this mean that the madvise always reclaims from
> the memcg the process is member of?
> 

First it reclaims in its own memcg while limit - usage < requested.
Then repeats this in parent memcg and so on. And at least pokes global
direct reclaimer while system wide free memory is less than requested.

So, if machine is divided into containers without overcommit global
reclaim will never happens - memcg will free enough memory.

^ permalink raw reply

* Re: [PATCH RFC] mm/madvise: implement MADV_STOCKPILE (kswapd from user space)
From: Michal Hocko @ 2019-05-27 14:21 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, linux-kernel, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, Andrew Morton, Mel Gorman, Roman Gushchin, linux-api
In-Reply-To: <20190527141223.GD1658@dhcp22.suse.cz>

On Mon 27-05-19 16:12:23, Michal Hocko wrote:
> [Cc linux-api. Please always cc this list when proposing a new user
>  visible api. Keeping the rest of the email intact for reference]
> 
> On Mon 27-05-19 13:05:58, Konstantin Khlebnikov wrote:
[...]
> > This implements manual kswapd-style memory reclaim initiated by userspace.
> > It reclaims both physical memory and cgroup pages. It works in context of
> > task who calls syscall madvise thus cpu time is accounted correctly.

I do not follow. Does this mean that the madvise always reclaims from
the memcg the process is member of?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH RFC] mm/madvise: implement MADV_STOCKPILE (kswapd from user space)
From: Michal Hocko @ 2019-05-27 14:12 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, linux-kernel, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, Andrew Morton, Mel Gorman, Roman Gushchin, linux-api
In-Reply-To: <155895155861.2824.318013775811596173.stgit@buzz>

[Cc linux-api. Please always cc this list when proposing a new user
 visible api. Keeping the rest of the email intact for reference]

On Mon 27-05-19 13:05:58, Konstantin Khlebnikov wrote:
> Memory cgroup has no background memory reclaimer. Reclaiming after passing
> high-limit blocks task because works synchronously in task-work.
> 
> This implements manual kswapd-style memory reclaim initiated by userspace.
> It reclaims both physical memory and cgroup pages. It works in context of
> task who calls syscall madvise thus cpu time is accounted correctly.
> 
> Interface:
> 
> ret = madvise(ptr, size, MADV_STOCKPILE)
> 
> Returns:
>   0         - ok, free memory >= size
>   -EINVAL   - not supported
>   -ENOMEM   - not enough memory/cgroup limit
>   -EINTR    - interrupted by pending signal
>   -EAGAIN   - cannot reclaim enough memory
> 
> Argument 'size' is interpreted size of required free memory.
> Implementation triggers direct reclaim until amount of free memory is
> lower than that size. Argument 'ptr' could points to vma for specifying
> numa allocation policy, right now should be NULL.
> 
> Usage scenario: independent thread or standalone daemon estimates rate of
> allocations and calls MADV_STOCKPILE in loop to prepare free pages.
> Thus fast path avoids allocation latency induced by direct reclaim.
> 
> We are using this embedded into memory allocator based on MADV_FREE.
> 
> 
> Demonstration in memory cgroup with limit 1G:
> 
> touch zero
> truncate -s 5G zero
> 
> Without stockpile:
> 
> perf stat -e vmscan:* md5sum zero
> 
>  Performance counter stats for 'md5sum zero':
> 
>                  0      vmscan:mm_vmscan_kswapd_sleep
>                  0      vmscan:mm_vmscan_kswapd_wake
>                  0      vmscan:mm_vmscan_wakeup_kswapd
>                  0      vmscan:mm_vmscan_direct_reclaim_begin
>              10147      vmscan:mm_vmscan_memcg_reclaim_begin
>                  0      vmscan:mm_vmscan_memcg_softlimit_reclaim_begin
>                  0      vmscan:mm_vmscan_direct_reclaim_end
>              10147      vmscan:mm_vmscan_memcg_reclaim_end
>                  0      vmscan:mm_vmscan_memcg_softlimit_reclaim_end
>              99910      vmscan:mm_shrink_slab_start
>              99910      vmscan:mm_shrink_slab_end
>              39654      vmscan:mm_vmscan_lru_isolate
>                  0      vmscan:mm_vmscan_writepage
>              39652      vmscan:mm_vmscan_lru_shrink_inactive
>                  2      vmscan:mm_vmscan_lru_shrink_active
>              19982      vmscan:mm_vmscan_inactive_list_is_low
> 
>       10.886832585 seconds time elapsed
> 
>        8.928366000 seconds user
>        1.935212000 seconds sys
> 
> With stockpile:
> 
> stockpile 100 10 &   # up to 100M every 10ms
> perf stat -e vmscan:* md5sum zero
> 
>  Performance counter stats for 'md5sum zero':
> 
>                  0      vmscan:mm_vmscan_kswapd_sleep
>                  0      vmscan:mm_vmscan_kswapd_wake
>                  0      vmscan:mm_vmscan_wakeup_kswapd
>                  0      vmscan:mm_vmscan_direct_reclaim_begin
>                  0      vmscan:mm_vmscan_memcg_reclaim_begin
>                  0      vmscan:mm_vmscan_memcg_softlimit_reclaim_begin
>                  0      vmscan:mm_vmscan_direct_reclaim_end
>                  0      vmscan:mm_vmscan_memcg_reclaim_end
>                  0      vmscan:mm_vmscan_memcg_softlimit_reclaim_end
>                  0      vmscan:mm_shrink_slab_start
>                  0      vmscan:mm_shrink_slab_end
>                  0      vmscan:mm_vmscan_lru_isolate
>                  0      vmscan:mm_vmscan_writepage
>                  0      vmscan:mm_vmscan_lru_shrink_inactive
>                  0      vmscan:mm_vmscan_lru_shrink_active
>                  0      vmscan:mm_vmscan_inactive_list_is_low
> 
>       10.469776675 seconds time elapsed
> 
>        8.976261000 seconds user
>        1.491378000 seconds sys
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---
>  include/linux/memcontrol.h             |    6 +++++
>  include/uapi/asm-generic/mman-common.h |    2 ++
>  mm/madvise.c                           |   39 ++++++++++++++++++++++++++++++
>  mm/memcontrol.c                        |   41 ++++++++++++++++++++++++++++++++
>  tools/vm/Makefile                      |    2 +-
>  tools/vm/stockpile.c                   |   30 +++++++++++++++++++++++
>  6 files changed, 119 insertions(+), 1 deletion(-)
>  create mode 100644 tools/vm/stockpile.c
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index bc74d6a4407c..25325f18ad55 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -517,6 +517,7 @@ unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec,
>  }
>  
>  void mem_cgroup_handle_over_high(void);
> +int mem_cgroup_stockpile(unsigned long goal_pages);
>  
>  unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg);
>  
> @@ -968,6 +969,11 @@ static inline void mem_cgroup_handle_over_high(void)
>  {
>  }
>  
> +static inline int mem_cgroup_stockpile(unsigned long goal_page)
> +{
> +	return 0;
> +}
> +
>  static inline void mem_cgroup_enter_user_fault(void)
>  {
>  }
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index abd238d0f7a4..675145864fee 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -64,6 +64,8 @@
>  #define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
>  #define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
>  
> +#define MADV_STOCKPILE	20		/* stockpile free pages */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 628022e674a7..f908b08ecc9f 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -686,6 +686,41 @@ static int madvise_inject_error(int behavior,
>  }
>  #endif
>  
> +static long madvise_stockpile(unsigned long start, size_t len)
> +{
> +	unsigned long goal_pages, progress;
> +	struct zonelist *zonelist;
> +	int ret;
> +
> +	if (start)
> +		return -EINVAL;
> +
> +	goal_pages = len >> PAGE_SHIFT;
> +
> +	if (goal_pages > totalram_pages() - totalreserve_pages)
> +		return -ENOMEM;
> +
> +	ret = mem_cgroup_stockpile(goal_pages);
> +	if (ret)
> +		return ret;
> +
> +	/* TODO: use vma mempolicy */
> +	zonelist = node_zonelist(numa_node_id(), GFP_HIGHUSER);
> +
> +	while (global_zone_page_state(NR_FREE_PAGES) <
> +			goal_pages + totalreserve_pages) {
> +
> +		if (signal_pending(current))
> +			return -EINTR;
> +
> +		progress = try_to_free_pages(zonelist, 0, GFP_HIGHUSER, NULL);
> +		if (!progress)
> +			return -EAGAIN;
> +	}
> +
> +	return 0;
> +}
> +
>  static long
>  madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  		unsigned long start, unsigned long end, int behavior)
> @@ -728,6 +763,7 @@ madvise_behavior_valid(int behavior)
>  	case MADV_DODUMP:
>  	case MADV_WIPEONFORK:
>  	case MADV_KEEPONFORK:
> +	case MADV_STOCKPILE:
>  #ifdef CONFIG_MEMORY_FAILURE
>  	case MADV_SOFT_OFFLINE:
>  	case MADV_HWPOISON:
> @@ -834,6 +870,9 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  		return madvise_inject_error(behavior, start, start + len_in);
>  #endif
>  
> +	if (behavior == MADV_STOCKPILE)
> +		return madvise_stockpile(start, len);
> +
>  	write = madvise_need_mmap_write(behavior);
>  	if (write) {
>  		if (down_write_killable(&current->mm->mmap_sem))
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e50a2db5b4ff..dc23dc6bbeb3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2276,6 +2276,47 @@ void mem_cgroup_handle_over_high(void)
>  	current->memcg_nr_pages_over_high = 0;
>  }
>  
> +int mem_cgroup_stockpile(unsigned long goal_pages)
> +{
> +	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> +	unsigned long limit, nr_free, progress;
> +	struct mem_cgroup *memcg, *pos;
> +	int ret = 0;
> +
> +	pos = memcg = get_mem_cgroup_from_mm(current->mm);
> +
> +retry:
> +	if (signal_pending(current)) {
> +		ret = -EINTR;
> +		goto out;
> +	}
> +
> +	limit = min(pos->memory.max, pos->high);
> +	if (goal_pages > limit) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	nr_free = limit - page_counter_read(&pos->memory);
> +	if ((long)nr_free < (long)goal_pages) {
> +		progress = try_to_free_mem_cgroup_pages(pos,
> +				goal_pages - nr_free, GFP_HIGHUSER, true);
> +		if (progress || nr_retries--)
> +			goto retry;
> +		ret = -EAGAIN;
> +		goto out;
> +	}
> +
> +	nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> +	pos = parent_mem_cgroup(pos);
> +	if (pos)
> +		goto retry;
> +
> +out:
> +	css_put(&memcg->css);
> +	return ret;
> +}
> +
>  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		      unsigned int nr_pages)
>  {
> diff --git a/tools/vm/Makefile b/tools/vm/Makefile
> index 20f6cf04377f..e5b5bc0d9421 100644
> --- a/tools/vm/Makefile
> +++ b/tools/vm/Makefile
> @@ -1,7 +1,7 @@
>  # SPDX-License-Identifier: GPL-2.0
>  # Makefile for vm tools
>  #
> -TARGETS=page-types slabinfo page_owner_sort
> +TARGETS=page-types slabinfo page_owner_sort stockpile
>  
>  LIB_DIR = ../lib/api
>  LIBS = $(LIB_DIR)/libapi.a
> diff --git a/tools/vm/stockpile.c b/tools/vm/stockpile.c
> new file mode 100644
> index 000000000000..245e24f293ec
> --- /dev/null
> +++ b/tools/vm/stockpile.c
> @@ -0,0 +1,30 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <sys/mman.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +#include <err.h>
> +#include <errno.h>
> +
> +#ifndef MADV_STOCKPILE
> +# define MADV_STOCKPILE	20
> +#endif
> +
> +int main(int argc, char **argv)
> +{
> +	int interval;
> +	size_t size;
> +	int ret;
> +
> +	if (argc != 3)
> +		errx(1, "usage: %s <size_mb> <interval_ms>", argv[0]);
> +
> +	size = atol(argv[1]) << 20;
> +	interval = atoi(argv[2]) * 1000;
> +
> +	while (1) {
> +		ret = madvise(NULL, size, MADV_STOCKPILE);
> +		if (ret && errno != EAGAIN)
> +			err(2, "madvise(NULL, %zu, MADV_STOCKPILE)", size);
> +		usleep(interval);
> +	}
> +}
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
From: Renzo Davoli @ 2019-05-27 13:36 UTC (permalink / raw)
  To: Greg KH
  Cc: Alexander Viro, Davide Libenzi, linux-fsdevel, linux-kernel,
	linux-api
In-Reply-To: <20190527073332.GA13782@kroah.com>

On Mon, May 27, 2019 at 09:33:32AM +0200, Greg KH wrote:
> On Sun, May 26, 2019 at 04:25:21PM +0200, Renzo Davoli wrote:
> > This patch implements an extension of eventfd to define file descriptors 
> > whose I/O events can be generated at user level. These file descriptors
> > trigger notifications for [p]select/[p]poll/epoll.
> > 
> > This feature is useful for user-level implementations of network stacks
> > or virtual device drivers as libraries.
> 
> How can this be used to create a "virtual device driver"?  Do you have
> any examples of this new interface being used anywhere?

Networking programs use system calls implementing the Berkeley sockets API:
socket, accept, connect, listen, recv*, send* etc.  Programs dealing with a
device use system calls like open, read, write, ioctl etc.

When somebody wants to write a library able to behave like a network stack (say
lwipv6, picotcp) or a device, they can implement functions like my_socket,
my_accept, my_open or my_ioctl, as drop-in replacement of their system
call counterpart.  (It is also possible to use dynamic library magic to
rename/divert the system call requests to use their 'virtual'
implementation provided by the library: socket maps to my_socket, recv
to my_recv etc).

In this way portability and compatibility is easier, using a well known API
instead of inventing new ones.

Unfortunately this approach cannot be applied to
poll/select/ppoll/pselect/epoll.  These system calls can refer at the same time
to file descriptors created by 'real' system calls like socket, open, signalfd... 
and to file descriptors returned by my_open, your_socket.

> 
> Also, meta-comment, you should provide some sort of test to kselftests
> for your new feature so that it can actually be tested, as well as a man
> page update (separately).
Sure. I'll do it ASAP, let me collect suggestions first.

> 
> > Development and porting of code often requires to find the way to wait for I/O
> > events both coming from file descriptors and generated by user-level code (e.g.
> > user-implemented net stacks or drivers).  While it is possible to provide a
> > partial support (e.g. using pipes or socketpairs), a clean and complete
> > solution is still missing (as far as I have seen); e.g. I have not seen any
> > clean way to generate EPOLLPRI, EPOLLERR, etc.
> 
> What's wrong with pipes or sockets for stuff like this?  Why is epoll
> required?
Example:
suppose there is an application waiting for a TCP OOB message. It uses poll to wait 
for POLLPRI and then reads the message (e.g. by 'recv').
If I want to port that application to use a network stack implemented as a library
I have to rewrite the code about 'poll' as it is not possible to receive a POLLPRI.
>From a pipe I can just receive a POLLIN, I have to encode in an external data structure
any further information.
Using EFD_VPOLL the solution is straightforward: the function mysocket (used in place
of socket to create a file descripor behaving as a 'real'socket) returns a file
descriptor created by eventfd/EFD_VPOLL, so the poll system call can be left
unmodified in the code. When the OOB message is available the library can trigger
an EPOLLPRI and the message can be received using my_recv.

> 
...omissis...
> > 
> > Signed-off-by: Renzo Davoli <renzo@cs.unibo.it>
> > ---
> >  fs/eventfd.c                   | 115 +++++++++++++++++++++++++++++++--
> >  include/linux/eventfd.h        |   7 +-
> >  include/uapi/linux/eventpoll.h |   2 +
> >  3 files changed, 116 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/eventfd.c b/fs/eventfd.c
> > index 8aa0ea8c55e8..f83b7d02307e 100644
> > --- a/fs/eventfd.c
> > +++ b/fs/eventfd.c
> > @@ -3,6 +3,7 @@
> >   *  fs/eventfd.c
> >   *
> >   *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
> > + *  EFD_VPOLL support: 2019 Renzo Davoli <renzo@cs.unibo.it>
> 
> No need for this line, that's what the git history shows.
okay

> 
> >   *
> >   */
> >  
> > @@ -30,12 +31,24 @@ struct eventfd_ctx {
> >  	struct kref kref;
> >  	wait_queue_head_t wqh;
> >  	/*
> > -	 * Every time that a write(2) is performed on an eventfd, the
> > -	 * value of the __u64 being written is added to "count" and a
> > -	 * wakeup is performed on "wqh". A read(2) will return the "count"
> > -	 * value to userspace, and will reset "count" to zero. The kernel
> > -	 * side eventfd_signal() also, adds to the "count" counter and
> > -	 * issue a wakeup.
> > +	 * If the EFD_VPOLL flag was NOT set at eventfd creation:
> > +	 *   Every time that a write(2) is performed on an eventfd, the
> > +	 *   value of the __u64 being written is added to "count" and a
> > +	 *   wakeup is performed on "wqh". A read(2) will return the "count"
> > +	 *   value to userspace, and will reset "count" to zero (or decrement
> > +	 *   "count" by 1 if the flag EFD_SEMAPHORE has been set). The kernel
> > +	 *   side eventfd_signal() also, adds to the "count" counter and
> > +	 *   issue a wakeup.
> > +	 *
> > +	 * If the EFD_VPOLL flag was set at eventfd creation:
> > +	 *   count is the set of pending EPOLL events.
> > +	 *   read(2) returns the current value of count.
> > +	 *   The argument of write(2) is an 8-byte integer:
> > +	 *   it is an or-composition of a control command (EFD_VPOLL_ADDEVENTS,
> > +	 *   EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS) and the bitmap of
> > +	 *   events to be added, deleted to the current set of pending events.
> > +	 *   (i.e. which bits of "count" must be set or reset).
> > +	 *   EFD_VPOLL_MODEVENTS redefines the set of pending events.
> 
> Ugh, overloading stuff, this is increased complexity, do you _have_ to
> do it this way?
There can be other approaches: e.g. two specific new system calls like "vpollfd_create" and "vpollfd_ctl".
Their signature could be:
  int vpollfd_create(unsigned int init_events, int flags);
  where flags are the usual NONBLOCK/CLOEXEC
  int vpollfd_ctl(int fd, int op, unsigned int events);
  where op can be VPOLL_ADDEVENTS, VPOLL_DELEVENTS, VPOLL_MODEVENTS

It possible to reimplement the patch this way. It needs the definition of the new system calls.
I am proposing just a new tag for eventfd as eventfd purpose is conceptually close to the new feature.
Eventfd creates a file descriptor which generates events. The default eventfd mode uses counters while
EFD_VPOLL uses event flags.  The new feature can be implemented on eventfd with a very limited
impact on the kernel core code.
Instead of syscalls, the vpollfd_create/vpollfd_ctl API could be provided by the glibc as (very simple) 
library functions, as it is the case for eventfd_read/eventfd_write in /usr/include/sys/eventfd.h

It seems to me that EFD_VPOLL is just a different MODE of the same system call, not a real overloading.
Something similar happened to seccomp: the original implementation by Arcangeli is now
MODE_STRICT, the MODE_FILTER has been added in a second time as an extension.

> 
> >  	 */
> >  	__u64 count;
> >  	unsigned int flags;
> > @@ -295,6 +308,78 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
> >  	return res;
> >  }
> >  
> > +static __poll_t eventfd_vpoll_poll(struct file *file, poll_table *wait)
> > +{
> > +	struct eventfd_ctx *ctx = file->private_data;
> > +	__poll_t events = 0;
> > +	u64 count;
> > +
> > +	poll_wait(file, &ctx->wqh, wait);
> > +
> > +	count = READ_ONCE(ctx->count);
> > +
> > +	events = (count & EPOLLALLMASK);
> 
> Why mask?
Because the four most significant bits are not events but policy modifiers (mainly for
		multithreading support):
#define EPOLLEXCLUSIVE  ((__force __poll_t)(1U << 28))
#define EPOLLWAKEUP ((__force __poll_t)(1U << 29))
#define EPOLLONESHOT  ((__force __poll_t)(1U << 30))
#define EPOLLET   ((__force __poll_t)(1U << 31))
a file_operations poll function implementation should never generate these, isn't it?
> 
> > +
> > +	return events;
> > +}
> > +
> > +static ssize_t eventfd_vpoll_read(struct file *file, char __user *buf,
> > +		size_t count, loff_t *ppos)
> > +{
> > +	struct eventfd_ctx *ctx = file->private_data;
> > +	ssize_t res;
> > +	__u64 ucnt = 0;
> > +
> > +	if (count < sizeof(ucnt))
> > +		return -EINVAL;
> 
> What is magic about the size of a __u64 here?
Just for consistency with the non EFD_VPOLL eventfd.

> 
> > +	res = sizeof(ucnt);
> > +	ucnt = READ_ONCE(ctx->count);
> > +	if (put_user(ucnt, (__u64 __user *)buf))
> > +		return -EFAULT;
> > +
> > +	return res;
> > +}
> > +
> > +static ssize_t eventfd_vpoll_write(struct file *file, const char __user *buf,
> > +		size_t count, loff_t *ppos)
> > +{
> > +	struct eventfd_ctx *ctx = file->private_data;
> > +	ssize_t res;
> > +	__u64 ucnt;
> > +	__u32 events;
> > +
> > +	if (count < sizeof(ucnt))
> > +		return -EINVAL;
> 
> Why can it not be less than 64?
This is the imeplementation of 'write'. The 64 bits include the 'command'
EFD_VPOLL_ADDEVENTS, EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS (in the most
significant 32 bits) and the set of events (in the lowest 32 bits).

> 
> > +	if (copy_from_user(&ucnt, buf, sizeof(ucnt)))
> > +		return -EFAULT;
> > +	spin_lock_irq(&ctx->wqh.lock);
> > +
> > +	events = ucnt & EPOLLALLMASK;
> > +	res = sizeof(ucnt);
> > +	switch (ucnt & ~((__u64)EPOLLALLMASK)) {
> > +	case EFD_VPOLL_ADDEVENTS:
> > +		ctx->count |= events;
> > +		break;
> > +	case EFD_VPOLL_DELEVENTS:
> > +		ctx->count &= ~(events);
> > +		break;
> > +	case EFD_VPOLL_MODEVENTS:
> > +		ctx->count = (ctx->count & ~EPOLLALLMASK) | events;
> > +		break;
> > +	default:
> > +		res = -EINVAL;
> > +	}
> > +
> > +	/* wake up waiting threads */
> > +	if (res >= 0 && waitqueue_active(&ctx->wqh))
> > +		wake_up_locked_poll(&ctx->wqh, res);
> 
> Can you call this with a spinlock held?  I really don't remember, sorry,
> if so, nevermind, but you should check...

I would have done the same objection. eventfd_vpoll_write uses the same pattern
of code of eventfd_write, the difference is in the cause to unblock processes:
counter values vs. events. So this is as correct as eventfd_write is, isn't it? :)

> 
> > +
> > +	spin_unlock_irq(&ctx->wqh.lock);
> > +
> > +	return res;
> > +
> > +}
> > +
> >  #ifdef CONFIG_PROC_FS
> >  static void eventfd_show_fdinfo(struct seq_file *m, struct file *f)
> >  {
> > @@ -319,6 +404,17 @@ static const struct file_operations eventfd_fops = {
> >  	.llseek		= noop_llseek,
> >  };
> >  
> > +static const struct file_operations eventfd_vpoll_fops = {
> > +#ifdef CONFIG_PROC_FS
> > +	.show_fdinfo	= eventfd_show_fdinfo,
> > +#endif
> > +	.release	= eventfd_release,
> > +	.poll		= eventfd_vpoll_poll,
> > +	.read		= eventfd_vpoll_read,
> > +	.write		= eventfd_vpoll_write,
> > +	.llseek		= noop_llseek,
> > +};
> > +
> >  /**
> >   * eventfd_fget - Acquire a reference of an eventfd file descriptor.
> >   * @fd: [in] Eventfd file descriptor.
> > @@ -391,6 +487,7 @@ EXPORT_SYMBOL_GPL(eventfd_ctx_fileget);
> >  static int do_eventfd(unsigned int count, int flags)
> >  {
> >  	struct eventfd_ctx *ctx;
> > +	const struct file_operations *fops = &eventfd_fops;
> >  	int fd;
> >  
> >  	/* Check the EFD_* constants for consistency.  */
> > @@ -410,7 +507,11 @@ static int do_eventfd(unsigned int count, int flags)
> >  	ctx->flags = flags;
> >  	ctx->id = ida_simple_get(&eventfd_ida, 0, 0, GFP_KERNEL);
> >  
> > -	fd = anon_inode_getfd("[eventfd]", &eventfd_fops, ctx,
> > +	if (flags & EFD_VPOLL) {
> > +		fops = &eventfd_vpoll_fops;
> > +		ctx->count &= EPOLLALLMASK;
> > +	}
> > +	fd = anon_inode_getfd("[eventfd]", fops, ctx,
> >  			      O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS));
> >  	if (fd < 0)
> >  		eventfd_free_ctx(ctx);
> > diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> > index ffcc7724ca21..63258cf29344 100644
> > --- a/include/linux/eventfd.h
> > +++ b/include/linux/eventfd.h
> > @@ -21,11 +21,16 @@
> >   * shared O_* flags.
> >   */
> >  #define EFD_SEMAPHORE (1 << 0)
> > +#define EFD_VPOLL (1 << 1)
> 
> BIT(1)?
It is for the sake of consistency with the rest of the header file,
I can change also the line above:
	#define EFD_SEMAPHORE BIT(0)
	#define EFD_VPOLL BIT(1)
The value has been chosen to fullfil the request written just above the patched statement:
/*
 * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
 * new flags, since they might collide with O_* ones. We want
 * to re-use O_* flags that couldn't possibly have a meaning
 * from eventfd, in order to leave a free define-space for
 * shared O_* flags.
 */
EFD_SEMAPHORE uses the same value of O_WRONLY, I decided to use the same value of O_RDWR.
Both should not have a meaning from eventfd.

> 
> >  #define EFD_CLOEXEC O_CLOEXEC
> >  #define EFD_NONBLOCK O_NONBLOCK
> >  
> >  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
> > -#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
> > +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_VPOLL)
> > +
> > +#define EFD_VPOLL_ADDEVENTS (1UL << 32)
> > +#define EFD_VPOLL_DELEVENTS (2UL << 32)
> > +#define EFD_VPOLL_MODEVENTS (3UL << 32)
> 
> Aren't these part of the uapi?  Why are they hidden in here?

You are right. There is not a uapi/linux/eventfd.h. EFD_SEMAPHORE is here, so for
consistency I added EFD_VPOLL/EFD_VPOLL_* here, too. (glibc provides the value of EFD_SEMAPHORE
in /usr/include/bits/eventfd.h).

> 
> >  
> >  struct eventfd_ctx;
> >  struct file;
> > diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
> > index 8a3432d0f0dc..814de6d869c7 100644
> > --- a/include/uapi/linux/eventpoll.h
> > +++ b/include/uapi/linux/eventpoll.h
> > @@ -41,6 +41,8 @@
> >  #define EPOLLMSG	(__force __poll_t)0x00000400
> >  #define EPOLLRDHUP	(__force __poll_t)0x00002000
> >  
> > +#define EPOLLALLMASK	((__force __poll_t)0x0fffffff)
> 
> Why is this part of the uapi?
It can be useful. As I have explained above it permits to split betweeen
event bits and behavioral flags in poll_t.

> thanks,
> greg k-h

many thanks to you
  renzo

^ permalink raw reply

* Re: [PATCH v2] mm: mlockall error for flag MCL_ONFAULT
From: Vlastimil Babka @ 2019-05-27 13:19 UTC (permalink / raw)
  To: Potyra, Stefan, Michal Hocko, Daniel Jordan
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Jordan, Tobias,
	akpm@linux-foundation.org, kirill.shutemov@linux.intel.com,
	linux-api@vger.kernel.org
In-Reply-To: <20190527075333.GA6339@er01809n.ebgroup.elektrobit.com>

On 5/27/19 9:53 AM, Potyra, Stefan wrote:
> If mlockall() is called with only MCL_ONFAULT as flag,
> it removes any previously applied lockings and does
> nothing else.
> 
> This behavior is counter-intuitive and doesn't match the
> Linux man page.
> 
>   For mlockall():
> 
>   EINVAL Unknown  flags were specified or MCL_ONFAULT was specified with‐
>          out either MCL_FUTURE or MCL_CURRENT.
> 
> Consequently, return the error EINVAL, if only MCL_ONFAULT
> is passed. That way, applications will at least detect that
> they are calling mlockall() incorrectly.
> 
> Fixes: b0f205c2a308 ("mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage")
> Signed-off-by: Stefan Potyra <Stefan.Potyra@elektrobit.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks, shame we didn't catch it during review. Hope nobody will report
a regression.

> ---
>  mm/mlock.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/mlock.c b/mm/mlock.c
> index e492a155c51a..03f39cbdd4c4 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -797,7 +797,8 @@ SYSCALL_DEFINE1(mlockall, int, flags)
>  	unsigned long lock_limit;
>  	int ret;
>  
> -	if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT)))
> +	if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT)) ||
> +	    flags == MCL_ONFAULT)
>  		return -EINVAL;
>  
>  	if (!can_do_mlock())
> 

^ permalink raw reply

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
From: Michal Hocko @ 2019-05-27 12:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api
In-Reply-To: <20190527075811.GC6879@google.com>

On Mon 27-05-19 16:58:11, Minchan Kim wrote:
> On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote:
> > On Tue 21-05-19 11:55:33, Minchan Kim wrote:
> > > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > > > [cc linux-api]
> > > > 
> > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > > > System could have much faster swap device like zRAM. In that case, swapping
> > > > > is extremely cheaper than file-IO on the low-end storage.
> > > > > In this configuration, userspace could handle different strategy for each
> > > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > > > file IO is more expensive in this case so want to keep them in memory
> > > > > until memory pressure happens.
> > > > > 
> > > > > To support such strategy easier, this patch introduces
> > > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > > > that /proc/<pid>/clear_refs already has supported same filters.
> > > > > They are filters could be Ored with other existing hints using top two bits
> > > > > of (int behavior).
> > > > 
> > > > madvise operates on top of ranges and it is quite trivial to do the
> > > > filtering from the userspace so why do we need any additional filtering?
> > > > 
> > > > > Once either of them is set, the hint could affect only the interested vma
> > > > > either anonymous or file-backed.
> > > > > 
> > > > > With that, user could call a process_madvise syscall simply with a entire
> > > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > > > 
> > > > OK, so here is the reason you want that. The immediate question is why
> > > > cannot the monitor do the filtering from the userspace. Slightly more
> > > > work, all right, but less of an API to expose and that itself is a
> > > > strong argument against.
> > > 
> > > What I should do if we don't have such filter option is to enumerate all of
> > > vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> > > which would be painful for 2000+ vmas.
> > 
> > Painful is not an argument to add a new user API. If the existing API
> > suits the purpose then it should be used. If it is not usable, we can
> > think of a different way.
> 
> I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern
> mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor.
> It's never trivial.

This is not the only option. Have you tried to simply use
/proc/<pid>/map_files interface? This will provide you with all the file
backed mappings.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH 2/2] arch: wire-up clone6() syscall on x86
From: Christian Brauner @ 2019-05-27 12:34 UTC (permalink / raw)
  To: Arnd Bergmann, linux-ia64
  Cc: Al Viro, Linux Kernel Mailing List, Linus Torvalds, Jann Horn,
	Florian Weimer, Oleg Nesterov, David Howells, Andrew Morton,
	Adrian Reber, Linux API, linux-arch, the arch/x86 maintainers
In-Reply-To: <CAK8P3a3q=5Ca0xoMp+kyCvOqNDRzDTgu28f+U8J-buMVcZcVaw@mail.gmail.com>

On Mon, May 27, 2019 at 02:28:33PM +0200, Arnd Bergmann wrote:
> On Mon, May 27, 2019 at 12:45 PM Christian Brauner <christian@brauner.io> wrote:
> > On Mon, May 27, 2019 at 12:02:37PM +0200, Arnd Bergmann wrote:
> > > On Sun, May 26, 2019 at 12:27 PM Christian Brauner <christian@brauner.io> wrote:
> > > >
> > > > Wire up the clone6() call on x86.
> > > >
> > > > This patch only wires up clone6() on x86. Some of the arches look like they
> > > > need special assembly massaging and it is probably smarter if the
> > > > appropriate arch maintainers would do the actual wiring.
> > >
> > > Why do some architectures need special cases here? I'd prefer to have
> > > new system calls always get defined in a way that avoids this, and
> > > have a common entry point for everyone.
> > >
> > > Looking at the m68k sys_clone comment in
> > > arch/m68k/kernel/process.c, it seems that this was done as an
> > > optimization to deal with an inferior ABI. Similar code is present
> > > in h8300, ia64, nios2, and sparc. If all of them just do this to
> > > shave off a few cycles from the system call entry, I really
> > > couldn't care less.
> >
> > I'm happy to wire all arches up at the same time in the next revision. I
> > just wasn't sure why some of them were assemblying the living hell out
> > of clone; especially ia64. I really didn't want to bother touching all
> > of this just for an initial RFC.
> 
> Don't worry about doing all architectures for the RFC, I mainly want this
> to be done consistently by the time it gets into linux-next.
> 
> One thing to figure out though is whether we need the stack_size argument
> that a couple of architectures pass. It's usually hardwired to zero,
> but not all the time, and I don't know the history of this.

Afaict, stack_size is *only* used on ia64:

/*
 * sys_clone2(u64 flags, u64 ustack_base, u64 ustack_size, u64 parent_tidptr, u64 child_tidptr,
 *	      u64 tls)
 */
GLOBAL_ENTRY(sys_clone2)
	/*
	 * Allocate 8 input registers since ptrace() may clobber them
	 */
	.prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8)
	alloc r16=ar.pfs,8,2,6,0
	DO_SAVE_SWITCH_STACK
	adds r2=PT(R16)+IA64_SWITCH_STACK_SIZE+16,sp
	mov loc0=rp
	mov loc1=r16				// save ar.pfs across do_fork
	.body
	mov out1=in1
	mov out2=in2
	tbit.nz p6,p0=in0,CLONE_SETTLS_BIT
	mov out3=in3	// parent_tidptr: valid only w/CLONE_PARENT_SETTID
	;;
(p6)	st8 [r2]=in5				// store TLS in r16 for copy_thread()
	mov out4=in4	// child_tidptr:  valid only w/CLONE_CHILD_SETTID or CLONE_CHILD_CLEARTID
	mov out0=in0				// out0 = clone_flags
	br.call.sptk.many rp=do_fork
.ret1:	.restore sp
	adds sp=IA64_SWITCH_STACK_SIZE,sp	// pop the switch stack
	mov ar.pfs=loc1
	mov rp=loc0
	br.ret.sptk.many rp
END(sys_clone2)

I'm not sure if this needs to be because of architectural constraints or
if it just is a historic artifact.
(Ccing ia64 now to see what they have to say.)

Christian

^ permalink raw reply

* Re: [PATCH 2/2] arch: wire-up clone6() syscall on x86
From: Arnd Bergmann @ 2019-05-27 12:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, Linux Kernel Mailing List, Linus Torvalds, Jann Horn,
	Florian Weimer, Oleg Nesterov, David Howells, Andrew Morton,
	Adrian Reber, Linux API, linux-arch, the arch/x86 maintainers
In-Reply-To: <20190527104528.cao7wamuj4vduh3u@brauner.io>

On Mon, May 27, 2019 at 12:45 PM Christian Brauner <christian@brauner.io> wrote:
> On Mon, May 27, 2019 at 12:02:37PM +0200, Arnd Bergmann wrote:
> > On Sun, May 26, 2019 at 12:27 PM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > Wire up the clone6() call on x86.
> > >
> > > This patch only wires up clone6() on x86. Some of the arches look like they
> > > need special assembly massaging and it is probably smarter if the
> > > appropriate arch maintainers would do the actual wiring.
> >
> > Why do some architectures need special cases here? I'd prefer to have
> > new system calls always get defined in a way that avoids this, and
> > have a common entry point for everyone.
> >
> > Looking at the m68k sys_clone comment in
> > arch/m68k/kernel/process.c, it seems that this was done as an
> > optimization to deal with an inferior ABI. Similar code is present
> > in h8300, ia64, nios2, and sparc. If all of them just do this to
> > shave off a few cycles from the system call entry, I really
> > couldn't care less.
>
> I'm happy to wire all arches up at the same time in the next revision. I
> just wasn't sure why some of them were assemblying the living hell out
> of clone; especially ia64. I really didn't want to bother touching all
> of this just for an initial RFC.

Don't worry about doing all architectures for the RFC, I mainly want this
to be done consistently by the time it gets into linux-next.

One thing to figure out though is whether we need the stack_size argument
that a couple of architectures pass. It's usually hardwired to zero,
but not all the time, and I don't know the history of this.

       Arnd

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v10)
From: Florian Weimer @ 2019-05-27 11:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Carlos O'Donell, Joseph Myers, Szabolcs Nagy, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Will Deacon, Dave Watson, Paul Turner, Rich Felker,
	linux-kernel, linux-api
In-Reply-To: <20190503184219.19266-2-mathieu.desnoyers@efficios.com>

* Mathieu Desnoyers:

> +/* volatile because fields can be read/updated by the kernel.  */
> +__thread volatile struct rseq __rseq_abi = {
> +  .cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
> +};

As I've explained repeatedly, the volatile qualifier is wrong because it
is impossible to get rid of it.  (Accessing an object declared volatile
using non-volatile pointers is undefined.)  Code using __rseq_abi should
use relaxed MO atomics or signal fences/compiler barriers, as
appropriate.

> +/* Advertise Restartable Sequences registration ownership across
> +   application and shared libraries.
> +
> +   Libraries and applications must check whether this variable is zero or
> +   non-zero if they wish to perform rseq registration on their own. If it
> +   is zero, it means restartable sequence registration is not handled, and
> +   the library or application is free to perform rseq registration. In
> +   that case, the library or application is taking ownership of rseq
> +   registration, and may set __rseq_handled to 1. It may then set it back
> +   to 0 after it completes unregistering rseq.
> +
> +   If __rseq_handled is found to be non-zero, it means that another
> +   library (or the application) is currently handling rseq registration.
> +
> +   Typical use of __rseq_handled is within library constructors and
> +   destructors, or at program startup.  */
> +
> +int __rseq_handled;

It's not clear to me whether the intent is that __rseq_handled reflects
kernel support for rseq or not.  Currently, it only tells us whether
glibc has been built with rseq support or not.  It does not reflect
kernel support.  I'm still not convinced that this symbol is necessary,
especially if we mandate a kernel header version which defines __NR_rseq
for building glibc (which may happen due to the time64_t work).

Furthermore, the reference to ELF constructors is misleading.  I believe
the code you added to __libc_start_main to initialize __rseq_handled and
register __seq_abi with the kernel runs *after* ELF constructors have
executed (and not at all if the main program is written in Go, alas).
All initialization activity for the shared case needs to happen in
elf/rtld.c or called from there, probably as part of the security
initialization code or thereabouts.

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH 2/2] arch: wire-up clone6() syscall on x86
From: Christian Brauner @ 2019-05-27 10:45 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Al Viro, Linux Kernel Mailing List, Linus Torvalds, Jann Horn,
	Florian Weimer, Oleg Nesterov, David Howells, Andrew Morton,
	Adrian Reber, Linux API, linux-arch, the arch/x86 maintainers
In-Reply-To: <CAK8P3a1Ltsna_rtKxhMU7X0t=UOXDA75tKpph6s=OZ4itJe7VQ@mail.gmail.com>

On Mon, May 27, 2019 at 12:02:37PM +0200, Arnd Bergmann wrote:
> On Sun, May 26, 2019 at 12:27 PM Christian Brauner <christian@brauner.io> wrote:
> >
> > Wire up the clone6() call on x86.
> >
> > This patch only wires up clone6() on x86. Some of the arches look like they
> > need special assembly massaging and it is probably smarter if the
> > appropriate arch maintainers would do the actual wiring.
> 
> Why do some architectures need special cases here? I'd prefer to have
> new system calls always get defined in a way that avoids this, and
> have a common entry point for everyone.
> 
> Looking at the m68k sys_clone comment in
> arch/m68k/kernel/process.c, it seems that this was done as an
> optimization to deal with an inferior ABI. Similar code is present
> in h8300, ia64, nios2, and sparc. If all of them just do this to
> shave off a few cycles from the system call entry, I really
> couldn't care less.

I'm happy to wire all arches up at the same time in the next revision. I
just wasn't sure why some of them were assemblying the living hell out
of clone; especially ia64. I really didn't want to bother touching all
of this just for an initial RFC.

Christian

^ permalink raw reply

* Re: [PATCH 1/2] fork: add clone6
From: Christian Brauner @ 2019-05-27 10:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Linux List Kernel Mailing, Jann Horn, Florian Weimer,
	Oleg Nesterov, Arnd Bergmann, David Howells, Pavel Emelyanov,
	Andrew Morton, Adrian Reber, Andrei Vagin, Linux API
In-Reply-To: <CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com>

Moin,

Wasn't near a computer yesterday so sorry for the late reply. :) I
(I should note that this was supposed to be prefixed with RFC. But *shrug*.)

On Sun, May 26, 2019 at 09:50:32AM -0700, Linus Torvalds wrote:
> On Sun, May 26, 2019 at 3:27 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > This adds the clone6 system call.
> 
> No, this is not acceptable.
> 
> > +       struct clone6_args args = {
> 
> First of all, we don't pass in "clone6_args" to the actual implementation.
> 
> Passing in lots of args as a structure is fine. But it damn well isn't
> a "clone6" structure. It's just a "clone_args" inside the kernel, and
> there should be a separate clone_uapi_args for the user/kernel
> interface.

Yeah, that makes sense. This is similar to what we recently did for
signals, i.e. kernel_siginfo_t as the kernel internal struct and only
expose siginfo_t to userspace.

> 
> The user interface (in the uapi file) may be called "clone6_args", but
> that is *not* then what we pass around inside the kernel.
> 
> But even for the uapi version, the "clone6" name doesn't make any
> sense as a name to begin with. It's not the sixth revision of clone,

Ok, no argument about the argument struct.

But for the name of the actual syscall itself we originally used
the revision number aka clone3() (because of clone2 on ia64).
We chose clone6() after we asked around what the current convention is:
revision number or argument number. And we were told that it is common
to use the arg number. And from the syscall list it looked like people
were right:
- accept4(/* 4 args */)
- dup2(/* 2 args */)
- dup3(/* 3 args */)
- eventfd2(/* 2 args */)
- pipe2(/* 2 args */)
- pselect6(/* 6 args, including structs */)
- signalfd4(/* 4 args, one of them a struct */)
- umount2(/* 2 args */)
- wait3(/* 3 args, one of them a struct */)
- wait4(/* 4 args, one of them a struct */)

Anyway, we can name it clone3() for the next revision.
(The "argument number" naming scheme struck us as kind of odd. Jann
 pointed out that if we ever have to have another syscall that already
 takes 6 arguments what would it be named "pselect6.1"?)

> and that clone6 structure isn't even "six arguments", because it has
> lots of other arguments (with the extra flags registers you did, but
> I'll get to that later).
> 
> Finally, the actual definition of that thing is wrong for a uapi interface too:
> 
>    struct clone6_args {
>           __s32 pidfd;
>           __aligned_u64 parent_tidptr;
>           __aligned_u64 child_tidptr;
>           ...
> 
> because now it has a hole in that structure definition because of
> alignment issues. So make "pidfd" be 64-bit too. You clearly tried to
> make this be compat-aware etc, but we really don't want to have parts
> of user structures with random padding that we then don't check.

Noted.

> 
> So the *user* visible structure should be full of those __aligned_u64
> to make sure everything is aligned and doesn't care about compat
> models.
> 
> But the *kernel* structure that we use should be nice and tailored for
> kernel use, and use proper kernel types.

Noted.

> 
> And related to that disgusting thing:
> 
> > -extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
> > +extern long _do_fork(u64 clone_flags, struct clone6_args *uargs);
> 
> This is the correct thing to do (except for the "clone6" name), but:
> 
> >  static __latent_entropy struct task_struct *copy_process(
> > -                                       unsigned long clone_flags,
> > -                                       unsigned long stack_start,
> > -                                       unsigned long stack_size,
> > -                                       int __user *parent_tidptr,
> > -                                       int __user *child_tidptr,
> > +                                       u64 clone_flags,
> >                                         struct pid *pid,
> >                                         int trace,
> > -                                       unsigned long tls,
> > -                                       int node)
> > +                                       int node,
> > +                                       struct clone6_args *args,
> > +                                       bool is_clone6)
> 
> But this is absolutely wrong.
> 
> That "bool is_clone6" is garbage.
> 
> The in-kernel "struict clone_args" should just be everything that
> clone needs to know.

This goes back to the missing split between an in-kernel struct and uapi
struct, but sure.

> 
> So the in-kernel "struct clone_args" should never ever need some
> "is_clone6" boolean to specify how you then treat the arguments.
> 
> Stuff like this:
> 
> > +       int __user *child_tidptr = u64_to_user_ptr(args->child_tidptr);
> 
> should have been done in the stubs that set up the "struct clone_args"
> thing, not in copy_process().

Noted.

> 
> So all those
> 
>         if (is_clone6) ...
> 
> things need to go away, and it just needs to be the caller (who knows
> what kind of clone call it is) setting up the clone_args properly,
> depending on the *different* user interfaces.

Right.

> 
> And no, we don't do crazy stuff like this either:
> 
> +SYSCALL_DEFINE6(clone6, struct clone6_args __user*, uargs,
> +                       unsigned int, flags1,
> +                       unsigned int, flags2,
> +                       unsigned int, flags3,
> +                       unsigned int, flags4,
> +                       unsigned int, flags5)
> 
> where this is wrong because it randomly just decides that everything
> is flags (BS), and then doubles down on stupidity by making them
> "unsigned int", so that the tests of the flags actually don't test the

For all system calls that use flag arguments so far "unsigned int" was
the way to go because of 32bit and because of prior convention.
We debated having this be a 64 bit. But honestly, this is one of those
junctions where it becomes a matter of: "have you been around long
enough to simply ignore prior conventions?".

> upper bits anyway.
> 
> Why would you reserve 5 words of flags for future use when you have a
> whole structure in user space that you _didn't_ reserve anything in?
> 
> So remove all those random flags, put ONE of them in the structure you
> already have (as a "__aligned_u64", so that you actually get 64 bits,
> not the 32 in "unsigned int"), and then perhaps you can add *one*
> other register argument, which is the *size* of the structure that
> user space is passing in.

Sure, that was what I originally had in mind but the valid point was
made that passing all flags arguments in registers makes it easy for
seccomp to filter them.
It is a cleaner design to do it in the struct, sure. I don't have
quarrels with moving the flags into the struct itself.

> 
> That way, if we ever expand things, we'll just add new flags to the
> end of the in-memory structure we have, but old binaries that don't
> know about the size will continue to pass in the old size, and we'll
> be all good.

Right, that's the way the commit message outlines what we would've
wanted to do once we run out of flags in the register arguments.

> 
> So I hate the whole patch with a passion. It does absolutely
> everything wrong from an interface standpoint.

He, fair enough. Let's see how fast we can fix this then. :)

> 
> I don't hate the notion of just adding flags. But do it cleanly, not
> with random "is_clone6" bits.
> 
> And no, the structure we pass in from user space must *NOT* be the
> same structure that we just copy blindly around as a kernel structure.
> We've done that mistake before, and it is *always* a mistake. Think
> "struct stat" and friends. No, we have a "struct kstat" for the kernel
> version, and then we convert at the system call boundary. Let's not
> repeat traditional mistakes.
> 
> And part of the conversion is exactly that
> 
>   Make everybody use the same in-kernel interface, so that the
> lower-down routines don't have those kinds of "if it's this system
> call, do this, otherwise, do that"
> 
> kind of horrible nasty mis-designs.
> 
> So to summarize:
> 
>  (a) make a separate kernel-only "clone_args" structure that is
> unified and works for every single version of clone/fork/vfork, and
> that has pointers and types in the kernel native format.
> 
>      So this will have things like "int __user *parent_tidptr" etc,
> and something like *one* u64 flag field.
> 
>  (b) have the new system call have a nice compat-safe but
> _independent_ format ("struct clone_uapi_struct")
> 
>  (c) you can now make the new system-call interface be something like
> 
> int clone_ext(struct clone_uapi_struct __user *uargs, size_t size);
> {
>         struct clone_uapi_struct kargs;
>         struct clone_args args;
> 
>         // The API is defined as a stucture of 64-bit words
>         if (size & 7)
>                 return -EINVAL;
>         if (!access_ok(uargs, size))
>                 return -EFAULT;
> 
>         // If the user passes in new flags we don't
>         // know about (because it was compiled against
>         // a newer kernel than what is runn ing), make
>         // sure they are zero
>         if (size > sizeof(kargs)) {
>                 size_t n = (size - sizeof(kargs))>>3;
>                 u64 __user *ptr = (u64 __user *)(uargs+1)
> 
>                 if (n > 10)
>                         return -EINVAL;
>                 do {
>                         u64 val;
>                         if (get_user(val, ptr++))
>                                 return -EFAULT;
>                         if (val)
>                                 return -EINVAL;
>                 } while (--n);
> 
>                 // Ok, everything else was zero, we use
>                 // the part we know about
>                 size = sizeof(kargs);
>         }
>         memset(&kargs, 0, sizeof(kargs));
>         memcpy_from_user(&kargs, uargs, size);
> 
>         .. now convert 'kargs' to 'args' ..
> 
> See? The above may be a bit *unnecessarily* anal about the whole
> checking etc, but it's actually fairly simple in the end. And it means
> that we have that "convert to internal format" in just one place - the
> place where it makes sense. And it's a lot cleaner interface to user
> mode, and allows for easy expansion later.
> 
> NOTE! By all means make "clone_ext()" take some of the other arguments
> as part of the argument registers, not everything has to be part of
> "struct clone_uapi_struct". But none of this "flag1..5" stuff. That's
> what a struct is for.
> 
> Maybe the basic ":u64 clone_flags" could be the first argument (but
> 64-bit arguments can be a bit nasty for compat layers, so it's
> probably not even worth it - once you have to copy an argument
> structure from user space, you might as well put everything there).

Hm, still pondering whether having one unsigned int argument passed
through registers that captures all the flags from the old clone() would
be a good idea. But I'll call that once the other code is written and if
I *should* think it is a good idea and people hate it they can just yell
and we can remove it.

Thanks the pointers!
Christian

^ permalink raw reply

* Re: [PATCH 2/2] arch: wire-up clone6() syscall on x86
From: Arnd Bergmann @ 2019-05-27 10:02 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, Linux Kernel Mailing List, Linus Torvalds, Jann Horn,
	Florian Weimer, Oleg Nesterov, David Howells, Andrew Morton,
	Adrian Reber, Linux API, linux-arch, the arch/x86 maintainers
In-Reply-To: <20190526102612.6970-2-christian@brauner.io>

On Sun, May 26, 2019 at 12:27 PM Christian Brauner <christian@brauner.io> wrote:
>
> Wire up the clone6() call on x86.
>
> This patch only wires up clone6() on x86. Some of the arches look like they
> need special assembly massaging and it is probably smarter if the
> appropriate arch maintainers would do the actual wiring.

Why do some architectures need special cases here? I'd prefer to have
new system calls always get defined in a way that avoids this, and
have a common entry point for everyone.

Looking at the m68k sys_clone comment in
arch/m68k/kernel/process.c, it seems that this was done as an
optimization to deal with an inferior ABI. Similar code is present
in h8300, ia64, nios2, and sparc. If all of them just do this to
shave off a few cycles from the system call entry, I really
couldn't care less.

       Arnd

^ permalink raw reply

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
From: Minchan Kim @ 2019-05-27  7:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api
In-Reply-To: <20190521062628.GE32329@dhcp22.suse.cz>

On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote:
> On Tue 21-05-19 11:55:33, Minchan Kim wrote:
> > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > > [cc linux-api]
> > > 
> > > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > > System could have much faster swap device like zRAM. In that case, swapping
> > > > is extremely cheaper than file-IO on the low-end storage.
> > > > In this configuration, userspace could handle different strategy for each
> > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > > file IO is more expensive in this case so want to keep them in memory
> > > > until memory pressure happens.
> > > > 
> > > > To support such strategy easier, this patch introduces
> > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > > that /proc/<pid>/clear_refs already has supported same filters.
> > > > They are filters could be Ored with other existing hints using top two bits
> > > > of (int behavior).
> > > 
> > > madvise operates on top of ranges and it is quite trivial to do the
> > > filtering from the userspace so why do we need any additional filtering?
> > > 
> > > > Once either of them is set, the hint could affect only the interested vma
> > > > either anonymous or file-backed.
> > > > 
> > > > With that, user could call a process_madvise syscall simply with a entire
> > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > > 
> > > OK, so here is the reason you want that. The immediate question is why
> > > cannot the monitor do the filtering from the userspace. Slightly more
> > > work, all right, but less of an API to expose and that itself is a
> > > strong argument against.
> > 
> > What I should do if we don't have such filter option is to enumerate all of
> > vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> > which would be painful for 2000+ vmas.
> 
> Painful is not an argument to add a new user API. If the existing API
> suits the purpose then it should be used. If it is not usable, we can
> think of a different way.

I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern
mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor.
It's never trivial.

^ permalink raw reply

* [PATCH v2] mm: mlockall error for flag MCL_ONFAULT
From: Potyra, Stefan @ 2019-05-27  7:53 UTC (permalink / raw)
  To: Michal Hocko, Daniel Jordan
  Cc: Potyra, Stefan, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Jordan, Tobias, akpm@linux-foundation.org, vbabka@suse.cz,
	kirill.shutemov@linux.intel.com, linux-api@vger.kernel.org
In-Reply-To: <20190527070415.GA1658@dhcp22.suse.cz>

If mlockall() is called with only MCL_ONFAULT as flag,
it removes any previously applied lockings and does
nothing else.

This behavior is counter-intuitive and doesn't match the
Linux man page.

  For mlockall():

  EINVAL Unknown  flags were specified or MCL_ONFAULT was specified with‐
         out either MCL_FUTURE or MCL_CURRENT.

Consequently, return the error EINVAL, if only MCL_ONFAULT
is passed. That way, applications will at least detect that
they are calling mlockall() incorrectly.

Fixes: b0f205c2a308 ("mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage")
Signed-off-by: Stefan Potyra <Stefan.Potyra@elektrobit.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/mlock.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index e492a155c51a..03f39cbdd4c4 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -797,7 +797,8 @@ SYSCALL_DEFINE1(mlockall, int, flags)
 	unsigned long lock_limit;
 	int ret;
 
-	if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT)))
+	if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT)) ||
+	    flags == MCL_ONFAULT)
 		return -EINVAL;
 
 	if (!can_do_mlock())
-- 
2.20.1


^ permalink raw reply related

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
From: Minchan Kim @ 2019-05-27  7:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api
In-Reply-To: <20190521103726.GM32329@dhcp22.suse.cz>

On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > [Cc linux-api]
> > > > > 
> > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > so user should call the syscall several times to give hints to
> > > > > > multiple address range.
> > > > > 
> > > > > Is that a problem? How big of a problem? Any numbers?
> > > > 
> > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > with number in the description at respin.
> > > 
> > > Does this really have to be a fast operation? I would expect the monitor
> > > is by no means a fast path. The system call overhead is not what it used
> > > to be, sigh, but still for something that is not a hot path it should be
> > > tolerable, especially when the whole operation is quite expensive on its
> > > own (wrt. the syscall entry/exit).
> > 
> > What's different with process_vm_[readv|writev] and vmsplice?
> > If the range needed to be covered is a lot, vector operation makes senese
> > to me.
> 
> I am not saying that the vector API is wrong. All I am trying to say is
> that the benefit is not really clear so far. If you want to push it
> through then you should better get some supporting data.

I measured 1000 madvise syscall vs. a vector range syscall with 1000
ranges on ARM64 mordern device. Even though I saw 15% improvement but
absoluate gain is just 1ms so I don't think it's worth to support.
I will drop vector support at next revision.

Thanks for the review, Michal!

^ permalink raw reply

* Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
From: Greg KH @ 2019-05-27  7:33 UTC (permalink / raw)
  To: Renzo Davoli
  Cc: Alexander Viro, Davide Libenzi, linux-fsdevel, linux-kernel,
	linux-api
In-Reply-To: <20190526142521.GA21842@cs.unibo.it>

On Sun, May 26, 2019 at 04:25:21PM +0200, Renzo Davoli wrote:
> This patch implements an extension of eventfd to define file descriptors 
> whose I/O events can be generated at user level. These file descriptors
> trigger notifications for [p]select/[p]poll/epoll.
> 
> This feature is useful for user-level implementations of network stacks
> or virtual device drivers as libraries.

How can this be used to create a "virtual device driver"?  Do you have
any examples of this new interface being used anywhere?

Also, meta-comment, you should provide some sort of test to kselftests
for your new feature so that it can actually be tested, as well as a man
page update (separately).

> Development and porting of code often requires to find the way to wait for I/O
> events both coming from file descriptors and generated by user-level code (e.g.
> user-implemented net stacks or drivers).  While it is possible to provide a
> partial support (e.g. using pipes or socketpairs), a clean and complete
> solution is still missing (as far as I have seen); e.g. I have not seen any
> clean way to generate EPOLLPRI, EPOLLERR, etc.

What's wrong with pipes or sockets for stuff like this?  Why is epoll
required?

> This proposal is based on a new tag for eventfd2(2): EFD_VPOLL.
> 
> This statement:
> 	fd = eventfd(EPOLLOUT, EFD_VPOLL | EFD_CLOEXEC);
> creates a file descriptor for I/O event generation. In this case EPOLLOUT is
> initially true.
> 
> Likewise all the other eventfs services, read(2) and write(2) use a 8-byte 
> integer argument.
> 
> read(2) returns the current state of the pending events.
> 
> The argument of write(2) is an or-composition of a control command
> (EFD_VPOLL_ADDEVENTS, EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS) and the
> bitmap of events to be added, deleted to the current set of pending events.
> EFD_VPOLL_MODEVENTS completely redefines the set of pending events.
> 
> e.g.:
> 	uint64_t request = EFD_VPOLL_ADDEVENTS | EPOLLIN | EPOLLPRI;
> 	write(fd, &request, sizeof(request);
> adds EPOLLIN and EPOLLPRI to the set of pending events.
> 
> These are examples of messages asking for a feature like EFD_VPOLL:
> https://stackoverflow.com/questions/909189/simulating-file-descriptor-in-user-space
> https://stackoverflow.com/questions/1648147/running-a-simple-tcp-server-with-poll-how-do-i-trigger-events-artificially
> ... and I need it to write networking and device modules for vuos:
> https://github.com/virtualsquare/vuos
> (it is the new codebase of ViewOS, see www.virtualsquare.org).
> 
> EXAMPLE:
> The following program creates an eventfd/EFD_VPOLL file descriptor and then forks
> a child process.  While the parent waits for events using epoll_wait the child
> generates a sequence of events. When the parent receives an event (or a set of events)
> it prints it and disarm it.
> The following shell session shows a sample run of the program:
> 	timeout...
> 	timeout...
> 	GOT event 1
> 	timeout...
> 	GOT event 1
> 	timeout...
> 	GOT event 3
> 	timeout...
> 	GOT event 2
> 	timeout...
> 	GOT event 4
> 	timeout...
> 	GOT event 10
> 
> Program source:
> #include <sys/eventfd.h>
> #include <sys/epoll.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <stdint.h>             /* Definition of uint64_t */
> 
> #ifndef EFD_VPOLL
> #define EFD_VPOLL (1 << 1)
> #define EFD_VPOLL_ADDEVENTS (1UL << 32)
> #define EFD_VPOLL_DELEVENTS (2UL << 32)
> #define EFD_VPOLL_MODEVENTS (3UL << 32)
> #endif
> 
> #define handle_error(msg) \
> 	do { perror(msg); exit(EXIT_FAILURE); } while (0)
> 
> static void vpoll_ctl(int fd, uint64_t request) {
> 	ssize_t s;
> 	s = write(fd, &request, sizeof(request));
> 	if (s != sizeof(uint64_t))
> 		handle_error("write");
> }
> 
> int
> main(int argc, char *argv[])
> {
> 	int efd, epollfd; 
> 	struct epoll_event ev;
> 	ev.events = EPOLLIN | EPOLLRDHUP | EPOLLERR | EPOLLOUT | EPOLLHUP | EPOLLPRI;
> 	ev.data.u64 = 0;
> 
> 	efd = eventfd(0, EFD_VPOLL | EFD_CLOEXEC);
> 	if (efd == -1)
> 		handle_error("eventfd");
> 	epollfd = epoll_create1(EPOLL_CLOEXEC);
> 	if (efd == -1)
> 		handle_error("epoll_create1");
> 	if (epoll_ctl(epollfd, EPOLL_CTL_ADD, efd, &ev) == -1) 
> 		handle_error("epoll_ctl");
> 
> 	switch (fork()) {
> 		case 0:
> 			sleep(3);
> 			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN);
> 			sleep(2);
> 			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN);
> 			sleep(2);
> 			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN | EPOLLPRI);
> 			sleep(2);
> 			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLPRI);
> 			sleep(2);
> 			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLOUT);
> 			sleep(2);
> 			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLHUP);
> 			exit(EXIT_SUCCESS);
> 		default:
> 			while (1) {
> 				int nfds;
> 				nfds = epoll_wait(epollfd, &ev, 1, 1000);
> 				if (nfds < 0)
> 					handle_error("epoll_wait");
> 				else if (nfds == 0)
> 					printf("timeout...\n");
> 				else {
> 					printf("GOT event %x\n", ev.events);
> 					vpoll_ctl(efd, EFD_VPOLL_DELEVENTS | ev.events);
> 					if (ev.events & EPOLLHUP)
> 						break;
> 				}
> 			}
> 		case -1:
> 			handle_error("fork");
> 	}
> 	close(epollfd);
> 	close(efd);
> 	return 0;
> }
> 
> Signed-off-by: Renzo Davoli <renzo@cs.unibo.it>
> ---
>  fs/eventfd.c                   | 115 +++++++++++++++++++++++++++++++--
>  include/linux/eventfd.h        |   7 +-
>  include/uapi/linux/eventpoll.h |   2 +
>  3 files changed, 116 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index 8aa0ea8c55e8..f83b7d02307e 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -3,6 +3,7 @@
>   *  fs/eventfd.c
>   *
>   *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
> + *  EFD_VPOLL support: 2019 Renzo Davoli <renzo@cs.unibo.it>

No need for this line, that's what the git history shows.

>   *
>   */
>  
> @@ -30,12 +31,24 @@ struct eventfd_ctx {
>  	struct kref kref;
>  	wait_queue_head_t wqh;
>  	/*
> -	 * Every time that a write(2) is performed on an eventfd, the
> -	 * value of the __u64 being written is added to "count" and a
> -	 * wakeup is performed on "wqh". A read(2) will return the "count"
> -	 * value to userspace, and will reset "count" to zero. The kernel
> -	 * side eventfd_signal() also, adds to the "count" counter and
> -	 * issue a wakeup.
> +	 * If the EFD_VPOLL flag was NOT set at eventfd creation:
> +	 *   Every time that a write(2) is performed on an eventfd, the
> +	 *   value of the __u64 being written is added to "count" and a
> +	 *   wakeup is performed on "wqh". A read(2) will return the "count"
> +	 *   value to userspace, and will reset "count" to zero (or decrement
> +	 *   "count" by 1 if the flag EFD_SEMAPHORE has been set). The kernel
> +	 *   side eventfd_signal() also, adds to the "count" counter and
> +	 *   issue a wakeup.
> +	 *
> +	 * If the EFD_VPOLL flag was set at eventfd creation:
> +	 *   count is the set of pending EPOLL events.
> +	 *   read(2) returns the current value of count.
> +	 *   The argument of write(2) is an 8-byte integer:
> +	 *   it is an or-composition of a control command (EFD_VPOLL_ADDEVENTS,
> +	 *   EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS) and the bitmap of
> +	 *   events to be added, deleted to the current set of pending events.
> +	 *   (i.e. which bits of "count" must be set or reset).
> +	 *   EFD_VPOLL_MODEVENTS redefines the set of pending events.

Ugh, overloading stuff, this is increased complexity, do you _have_ to
do it this way?

>  	 */
>  	__u64 count;
>  	unsigned int flags;
> @@ -295,6 +308,78 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
>  	return res;
>  }
>  
> +static __poll_t eventfd_vpoll_poll(struct file *file, poll_table *wait)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	__poll_t events = 0;
> +	u64 count;
> +
> +	poll_wait(file, &ctx->wqh, wait);
> +
> +	count = READ_ONCE(ctx->count);
> +
> +	events = (count & EPOLLALLMASK);

Why mask?

> +
> +	return events;
> +}
> +
> +static ssize_t eventfd_vpoll_read(struct file *file, char __user *buf,
> +		size_t count, loff_t *ppos)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	ssize_t res;
> +	__u64 ucnt = 0;
> +
> +	if (count < sizeof(ucnt))
> +		return -EINVAL;

What is magic about the size of a __u64 here?

> +	res = sizeof(ucnt);
> +	ucnt = READ_ONCE(ctx->count);
> +	if (put_user(ucnt, (__u64 __user *)buf))
> +		return -EFAULT;
> +
> +	return res;
> +}
> +
> +static ssize_t eventfd_vpoll_write(struct file *file, const char __user *buf,
> +		size_t count, loff_t *ppos)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	ssize_t res;
> +	__u64 ucnt;
> +	__u32 events;
> +
> +	if (count < sizeof(ucnt))
> +		return -EINVAL;

Why can it not be less than 64?

> +	if (copy_from_user(&ucnt, buf, sizeof(ucnt)))
> +		return -EFAULT;
> +	spin_lock_irq(&ctx->wqh.lock);
> +
> +	events = ucnt & EPOLLALLMASK;
> +	res = sizeof(ucnt);
> +	switch (ucnt & ~((__u64)EPOLLALLMASK)) {
> +	case EFD_VPOLL_ADDEVENTS:
> +		ctx->count |= events;
> +		break;
> +	case EFD_VPOLL_DELEVENTS:
> +		ctx->count &= ~(events);
> +		break;
> +	case EFD_VPOLL_MODEVENTS:
> +		ctx->count = (ctx->count & ~EPOLLALLMASK) | events;
> +		break;
> +	default:
> +		res = -EINVAL;
> +	}
> +
> +	/* wake up waiting threads */
> +	if (res >= 0 && waitqueue_active(&ctx->wqh))
> +		wake_up_locked_poll(&ctx->wqh, res);

Can you call this with a spinlock held?  I really don't remember, sorry,
if so, nevermind, but you should check...

> +
> +	spin_unlock_irq(&ctx->wqh.lock);
> +
> +	return res;
> +
> +}
> +
>  #ifdef CONFIG_PROC_FS
>  static void eventfd_show_fdinfo(struct seq_file *m, struct file *f)
>  {
> @@ -319,6 +404,17 @@ static const struct file_operations eventfd_fops = {
>  	.llseek		= noop_llseek,
>  };
>  
> +static const struct file_operations eventfd_vpoll_fops = {
> +#ifdef CONFIG_PROC_FS
> +	.show_fdinfo	= eventfd_show_fdinfo,
> +#endif
> +	.release	= eventfd_release,
> +	.poll		= eventfd_vpoll_poll,
> +	.read		= eventfd_vpoll_read,
> +	.write		= eventfd_vpoll_write,
> +	.llseek		= noop_llseek,
> +};
> +
>  /**
>   * eventfd_fget - Acquire a reference of an eventfd file descriptor.
>   * @fd: [in] Eventfd file descriptor.
> @@ -391,6 +487,7 @@ EXPORT_SYMBOL_GPL(eventfd_ctx_fileget);
>  static int do_eventfd(unsigned int count, int flags)
>  {
>  	struct eventfd_ctx *ctx;
> +	const struct file_operations *fops = &eventfd_fops;
>  	int fd;
>  
>  	/* Check the EFD_* constants for consistency.  */
> @@ -410,7 +507,11 @@ static int do_eventfd(unsigned int count, int flags)
>  	ctx->flags = flags;
>  	ctx->id = ida_simple_get(&eventfd_ida, 0, 0, GFP_KERNEL);
>  
> -	fd = anon_inode_getfd("[eventfd]", &eventfd_fops, ctx,
> +	if (flags & EFD_VPOLL) {
> +		fops = &eventfd_vpoll_fops;
> +		ctx->count &= EPOLLALLMASK;
> +	}
> +	fd = anon_inode_getfd("[eventfd]", fops, ctx,
>  			      O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS));
>  	if (fd < 0)
>  		eventfd_free_ctx(ctx);
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index ffcc7724ca21..63258cf29344 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -21,11 +21,16 @@
>   * shared O_* flags.
>   */
>  #define EFD_SEMAPHORE (1 << 0)
> +#define EFD_VPOLL (1 << 1)

BIT(1)?

>  #define EFD_CLOEXEC O_CLOEXEC
>  #define EFD_NONBLOCK O_NONBLOCK
>  
>  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
> -#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
> +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_VPOLL)
> +
> +#define EFD_VPOLL_ADDEVENTS (1UL << 32)
> +#define EFD_VPOLL_DELEVENTS (2UL << 32)
> +#define EFD_VPOLL_MODEVENTS (3UL << 32)

Aren't these part of the uapi?  Why are they hidden in here?

>  
>  struct eventfd_ctx;
>  struct file;
> diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
> index 8a3432d0f0dc..814de6d869c7 100644
> --- a/include/uapi/linux/eventpoll.h
> +++ b/include/uapi/linux/eventpoll.h
> @@ -41,6 +41,8 @@
>  #define EPOLLMSG	(__force __poll_t)0x00000400
>  #define EPOLLRDHUP	(__force __poll_t)0x00002000
>  
> +#define EPOLLALLMASK	((__force __poll_t)0x0fffffff)

Why is this part of the uapi?

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v18 1/3] proc: add /proc/<pid>/arch_status
From: Li, Aubrey @ 2019-05-27  7:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tglx, mingo, peterz, hpa, ak, tim.c.chen, dave.hansen, arjan,
	adobriyan, aubrey.li, linux-api, linux-kernel, Andy Lutomirski
In-Reply-To: <20190523201822.cc554d68ec567164bec781e1@linux-foundation.org>

On 2019/5/24 11:18, Andrew Morton wrote:
> On Thu, 25 Apr 2019 22:32:17 +0800 Aubrey Li <aubrey.li@linux.intel.com> wrote:
> 
>> The architecture specific information of the running processes
>> could be useful to the userland. Add /proc/<pid>/arch_status
>> interface support to examine process architecture specific
>> information externally.
> 
> I'll give this an
> 
> Acked-by: Andrew Morton <akpm@linux-foundation.org>

Thanks!

> 
> from a procfs POV and shall let the x86 maintainers worry about it.
> 
> I must say I'm a bit surprised that we don't already provide some form
> of per-process CPU-specific info anywhere in procfs.  Something to
> piggy-back this onto.  But I can't find such a thing.
> 
> I assume we've already discussed why this is a new procfs file rather
> than merely a new line in /proc/<pid>/status.  If so, please add the
> reasoning to the changelog.  If not, please discuss now ;)
>

Andy and Thomas may want to give more comments. The discussion was that
we don't want /proc/PID/status to be different on different architectures.
It would be better to separate the arch staff into its own file /proc/PID/
arch_status and make sure that everything in it is namespaced.

Thanks,
-Aubrey

^ permalink raw reply

* Re: [PATCH] mm: mlockall error for flag MCL_ONFAULT
From: Michal Hocko @ 2019-05-27  7:04 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Potyra, Stefan, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Jordan, Tobias, akpm, vbabka, kirill.shutemov, linux-api
In-Reply-To: <20190524214304.enntpu4tvzpyxzfe@ca-dmjordan1.us.oracle.com>

On Fri 24-05-19 17:43:04, Daniel Jordan wrote:
> [ Adding linux-api and some of the people who were involved in the
> MCL_ONFAULT/mlock2/etc discussions.  Author of the Fixes patch appears to
> have moved on. ]
> 
> On Wed, May 22, 2019 at 11:23:37AM +0000, Potyra, Stefan wrote:
> > If mlockall() is called with only MCL_ONFAULT as flag,
> > it removes any previously applied lockings and does
> > nothing else.
> 
> The change looks reasonable.  Hard to imagine any application relies on it, and
> they really shouldn't be if they are.  Debian codesearch turned up only a few
> cases where stress-ng was doing this for unknown reasons[1] and this change
> isn't gonna break those.  In this case I think changing the syscall's behavior
> is justified.  
> 
> > This behavior is counter-intuitive and doesn't match the
> > Linux man page.
> 
> I'd quote it for the changelog:
> 
>   For mlockall():
> 
>   EINVAL Unknown  flags were specified or MCL_ONFAULT was specified with‐
>          out either MCL_FUTURE or MCL_CURRENT.
> 
> With that you can add
> 
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> 
> [1] https://sources.debian.org/src/stress-ng/0.09.50-1/stress-mlock.c/?hl=203#L203

Well spotted and the fix looks reasonable as well. Quoting the man page
seems useful as well.

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* [PATCH v3 3/3] fpga: dfl: fme: add power management support
From: Wu Hao @ 2019-05-27  6:06 UTC (permalink / raw)
  To: atull, mdf, jdelvare, linux, linux-fpga, linux-hwmon,
	linux-kernel
  Cc: linux-api, Wu Hao, Luwei Kang, Xu Yilun
In-Reply-To: <1558937216-12742-1-git-send-email-hao.wu@intel.com>

This patch adds support for power management private feature under
FPGA Management Engine (FME). This private feature driver registers
a hwmon for power (power1_input), thresholds information, e.g.
(power1_max / crit / max_alarm / crit_alarm) and also read-only sysfs
interfaces for other power management information. For configuration,
user could write threshold values via above power1_max / crit sysfs
interface under hwmon too.

Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
----
v2: create a dfl_fme_power hwmon to expose power sysfs interfaces.
    move all sysfs interfaces under hwmon
        consumed          --> hwmon power1_input
        threshold1        --> hwmon power1_cap
        threshold2        --> hwmon power1_crit
        threshold1_status --> hwmon power1_cap_status
        threshold2_status --> hwmon power1_crit_status
        xeon_limit        --> hwmon power1_xeon_limit
        fpga_limit        --> hwmon power1_fpga_limit
        ltr               --> hwmon power1_ltr
v3: rename some hwmon sysfs interfaces to follow hwmon ABI.
	power1_cap         --> power1_max
	power1_cap_status  --> power1_max_alarm
	power1_crit_status --> power1_crit_alarm
    update sysfs doc for above sysfs interface changes.
    replace scnprintf with sprintf in sysfs interface.
---
 Documentation/ABI/testing/sysfs-platform-dfl-fme |  67 +++++++
 drivers/fpga/dfl-fme-main.c                      | 230 +++++++++++++++++++++++
 2 files changed, 297 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
index 0a0b7b9..49c44f7 100644
--- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
+++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
@@ -220,6 +220,7 @@ Contact:	Wu Hao <hao.wu@intel.com>
 Description:	Read-Only. Read this file to get the name of hwmon device, it
 		supports values:
 		    'dfl_fme_thermal' - thermal hwmon device name
+		    'dfl_fme_power'   - power hwmon device name
 
 What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_input
 Date:		May 2019
@@ -276,3 +277,69 @@ Description:	Read-Only. Read this file to get the policy of hardware threshold1
 		(see 'temp1_max'). It only supports two values (policies):
 		    0 - AP2 state (90% throttling)
 		    1 - AP1 state (50% throttling)
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_input
+Date:		May 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-Only. It returns current FPGA power consumption in uW.
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_max
+Date:		May 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-Write. Read this file to get current hardware power
+		threshold1 in uW. If power consumption rises at or above
+		this threshold, hardware starts 50% throttling.
+		Write this file to set current hardware power threshold1 in uW.
+		As hardware only accepts values in Watts, so input value will
+		be round down per Watts (< 1 watts part will be discarded).
+		Write fails with -EINVAL if input parsing fails or input isn't
+		in the valid range (0 - 127000000 uW).
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_crit
+Date:		May 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-Write. Read this file to get current hardware power
+		threshold2 in uW. If power consumption rises at or above
+		this threshold, hardware starts 90% throttling.
+		Write this file to set current hardware power threshold2 in uW.
+		As hardware only accepts values in Watts, so input value will
+		be round down per Watts (< 1 watts part will be discarded).
+		Write fails with -EINVAL if input parsing fails or input isn't
+		in the valid range (0 - 127000000 uW).
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_max_alarm
+Date:		May 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-only. It returns 1 if power consumption is currently at or
+		above hardware threshold1 (see 'power1_max'), otherwise 0.
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_crit_alarm
+Date:		May 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-only. It returns 1 if power consumption is currently at or
+		above hardware threshold2 (see 'power1_crit'), otherwise 0.
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_xeon_limit
+Date:		May 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-Only. It returns power limit for XEON in uW.
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_fpga_limit
+Date:		May 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-Only. It returns power limit for FPGA in uW.
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_ltr
+Date:		May 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-only. Read this file to get current Latency Tolerance
+		Reporting (ltr) value. This ltr impacts the CPU low power
+		state in integrated solution.
diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
index 91152d5..d6d4ab8 100644
--- a/drivers/fpga/dfl-fme-main.c
+++ b/drivers/fpga/dfl-fme-main.c
@@ -409,6 +409,232 @@ static void fme_thermal_mgmt_uinit(struct platform_device *pdev,
 	.uinit = fme_thermal_mgmt_uinit,
 };
 
+#define FME_PWR_STATUS		0x8
+#define FME_LATENCY_TOLERANCE	BIT_ULL(18)
+#define PWR_CONSUMED		GENMASK_ULL(17, 0)
+
+#define FME_PWR_THRESHOLD	0x10
+#define PWR_THRESHOLD1		GENMASK_ULL(6, 0)	/* in Watts */
+#define PWR_THRESHOLD2		GENMASK_ULL(14, 8)	/* in Watts */
+#define PWR_THRESHOLD_MAX	0x7f			/* in Watts */
+#define PWR_THRESHOLD1_STATUS	BIT_ULL(16)
+#define PWR_THRESHOLD2_STATUS	BIT_ULL(17)
+
+#define FME_PWR_XEON_LIMIT	0x18
+#define XEON_PWR_LIMIT		GENMASK_ULL(14, 0)	/* in 0.1 Watts */
+#define XEON_PWR_EN		BIT_ULL(15)
+#define FME_PWR_FPGA_LIMIT	0x20
+#define FPGA_PWR_LIMIT		GENMASK_ULL(14, 0)	/* in 0.1 Watts */
+#define FPGA_PWR_EN		BIT_ULL(15)
+
+#define PWR_THRESHOLD_MAX_IN_UW (PWR_THRESHOLD_MAX * 1000000)
+
+static int power_hwmon_read(struct device *dev, enum hwmon_sensor_types type,
+			    u32 attr, int channel, long *val)
+{
+	struct dfl_feature *feature = dev_get_drvdata(dev);
+	u64 v;
+
+	switch (attr) {
+	case hwmon_power_input:
+		v = readq(feature->ioaddr + FME_PWR_STATUS);
+		*val = (long)(FIELD_GET(PWR_CONSUMED, v) * 1000000);
+		break;
+	case hwmon_power_max:
+		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
+		*val = (long)(FIELD_GET(PWR_THRESHOLD1, v) * 1000000);
+		break;
+	case hwmon_power_crit:
+		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
+		*val = (long)(FIELD_GET(PWR_THRESHOLD2, v) * 1000000);
+		break;
+	case hwmon_power_max_alarm:
+		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
+		*val = (long)FIELD_GET(PWR_THRESHOLD1_STATUS, v);
+		break;
+	case hwmon_power_crit_alarm:
+		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
+		*val = (long)FIELD_GET(PWR_THRESHOLD2_STATUS, v);
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
+static int power_hwmon_write(struct device *dev, enum hwmon_sensor_types type,
+			     u32 attr, int channel, long val)
+{
+	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev->parent);
+	struct dfl_feature *feature = dev_get_drvdata(dev);
+	int ret = 0;
+	u64 v;
+
+	if (val < 0 || val > PWR_THRESHOLD_MAX_IN_UW)
+		return -EINVAL;
+
+	val = val / 1000000;
+
+	mutex_lock(&pdata->lock);
+
+	switch (attr) {
+	case hwmon_power_max:
+		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
+		v &= ~PWR_THRESHOLD1;
+		v |= FIELD_PREP(PWR_THRESHOLD1, val);
+		writeq(v, feature->ioaddr + FME_PWR_THRESHOLD);
+		break;
+	case hwmon_power_crit:
+		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
+		v &= ~PWR_THRESHOLD2;
+		v |= FIELD_PREP(PWR_THRESHOLD2, val);
+		writeq(v, feature->ioaddr + FME_PWR_THRESHOLD);
+		break;
+	default:
+		ret = -EOPNOTSUPP;
+		break;
+	}
+
+	mutex_unlock(&pdata->lock);
+
+	return ret;
+}
+
+static umode_t power_hwmon_attrs_visible(const void *drvdata,
+					 enum hwmon_sensor_types type,
+					 u32 attr, int channel)
+{
+	switch (attr) {
+	case hwmon_power_input:
+	case hwmon_power_max_alarm:
+	case hwmon_power_crit_alarm:
+		return 0444;
+	case hwmon_power_max:
+	case hwmon_power_crit:
+		return 0644;
+	}
+
+	return 0;
+}
+
+static const u32 power_hwmon_config[] = {
+	HWMON_P_INPUT | HWMON_P_MAX | HWMON_P_CRIT |
+	HWMON_P_MAX_ALARM | HWMON_P_CRIT_ALARM,
+	0
+};
+
+static const struct hwmon_channel_info hwmon_pwr_info = {
+	.type = hwmon_power,
+	.config = power_hwmon_config,
+};
+
+static const struct hwmon_channel_info *power_hwmon_info[] = {
+	&hwmon_pwr_info,
+	NULL
+};
+
+static const struct hwmon_ops power_hwmon_ops = {
+	.is_visible = power_hwmon_attrs_visible,
+	.read = power_hwmon_read,
+	.write = power_hwmon_write,
+};
+
+static const struct hwmon_chip_info power_hwmon_chip_info = {
+	.ops = &power_hwmon_ops,
+	.info = power_hwmon_info,
+};
+
+static ssize_t power1_xeon_limit_show(struct device *dev,
+				      struct device_attribute *attr, char *buf)
+{
+	struct dfl_feature *feature = dev_get_drvdata(dev);
+	u16 xeon_limit = 0;
+	u64 v;
+
+	v = readq(feature->ioaddr + FME_PWR_XEON_LIMIT);
+
+	if (FIELD_GET(XEON_PWR_EN, v))
+		xeon_limit = FIELD_GET(XEON_PWR_LIMIT, v);
+
+	return sprintf(buf, "%u\n", xeon_limit * 100000);
+}
+
+static ssize_t power1_fpga_limit_show(struct device *dev,
+				      struct device_attribute *attr, char *buf)
+{
+	struct dfl_feature *feature = dev_get_drvdata(dev);
+	u16 fpga_limit = 0;
+	u64 v;
+
+	v = readq(feature->ioaddr + FME_PWR_FPGA_LIMIT);
+
+	if (FIELD_GET(FPGA_PWR_EN, v))
+		fpga_limit = FIELD_GET(FPGA_PWR_LIMIT, v);
+
+	return sprintf(buf, "%u\n", fpga_limit * 100000);
+}
+
+static ssize_t power1_ltr_show(struct device *dev,
+			       struct device_attribute *attr, char *buf)
+{
+	struct dfl_feature *feature = dev_get_drvdata(dev);
+	u64 v;
+
+	v = readq(feature->ioaddr + FME_PWR_STATUS);
+
+	return sprintf(buf, "%u\n",
+		       (unsigned int)FIELD_GET(FME_LATENCY_TOLERANCE, v));
+}
+
+static DEVICE_ATTR_RO(power1_xeon_limit);
+static DEVICE_ATTR_RO(power1_fpga_limit);
+static DEVICE_ATTR_RO(power1_ltr);
+
+static struct attribute *power_extra_attrs[] = {
+	&dev_attr_power1_xeon_limit.attr,
+	&dev_attr_power1_fpga_limit.attr,
+	&dev_attr_power1_ltr.attr,
+	NULL
+};
+
+ATTRIBUTE_GROUPS(power_extra);
+
+static int fme_power_mgmt_init(struct platform_device *pdev,
+			       struct dfl_feature *feature)
+{
+	struct device *hwmon;
+
+	dev_dbg(&pdev->dev, "FME Power Management Init.\n");
+
+	hwmon = devm_hwmon_device_register_with_info(&pdev->dev,
+						     "dfl_fme_power", feature,
+						     &power_hwmon_chip_info,
+						     power_extra_groups);
+	if (IS_ERR(hwmon)) {
+		dev_err(&pdev->dev, "Fail to register power hwmon\n");
+		return PTR_ERR(hwmon);
+	}
+
+	return 0;
+}
+
+static void fme_power_mgmt_uinit(struct platform_device *pdev,
+				 struct dfl_feature *feature)
+{
+	dev_dbg(&pdev->dev, "FME Power Management UInit.\n");
+}
+
+static const struct dfl_feature_id fme_power_mgmt_id_table[] = {
+	{.id = FME_FEATURE_ID_POWER_MGMT,},
+	{0,}
+};
+
+static const struct dfl_feature_ops fme_power_mgmt_ops = {
+	.init = fme_power_mgmt_init,
+	.uinit = fme_power_mgmt_uinit,
+};
+
 static struct dfl_feature_driver fme_feature_drvs[] = {
 	{
 		.id_table = fme_hdr_id_table,
@@ -431,6 +657,10 @@ static void fme_thermal_mgmt_uinit(struct platform_device *pdev,
 		.ops = &fme_thermal_mgmt_ops,
 	},
 	{
+		.id_table = fme_power_mgmt_id_table,
+		.ops = &fme_power_mgmt_ops,
+	},
+	{
 		.ops = NULL,
 	},
 };
-- 
1.8.3.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox