Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall
From: Andy Lutomirski @ 2015-01-09  1:28 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Miklos Szeredi,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML,
	Alexander Viro, Andrew Morton, Juri Lelli, Zach Brown,
	David Drysdale, Kees Cook, Alexei Starovoitov, David Herrmann,
	Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal,
	Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov,
	Mathieu Desnoyers
In-Reply-To: <20150109011608.GA2924-+wGkCoP0yD+sDdueE5tM26fLeoKvNuZc@public.gmane.org>

On Thu, Jan 8, 2015 at 5:25 PM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Thu, 01/08 09:57, Andy Lutomirski wrote:
>> On Thu, Jan 8, 2015 at 1:12 AM, Miklos Szeredi <mszeredi-AlSwsSmVLrQ@public.gmane.org> wrote:
>> > On Thu, 2015-01-08 at 16:25 +0800, Fam Zheng wrote:
>> >> Applications could use epoll interface when then need to poll a big number of
>> >> files in their main loops, to achieve better performance than ppoll(2). Except
>> >> for one concern: epoll only takes timeout parameters in microseconds, rather
>> >> than nanoseconds.
>> >>
>> >> That is a drawback we should address. For a real case in QEMU, we run into a
>> >> scalability issue with ppoll(2) when many devices are attached to guest, in
>> >> which case many host fds, such as virtual disk images and sockets, need to be
>> >> polled by the main loop. As a result we are looking at switching to epoll, but
>> >> the coarse timeout precision is a trouble, as explained below.
>> >>
>> >> We're already using prctl(PR_SET_TIMERSLACK, 1) which is necessary to implement
>> >> timers in the main loop; and we call ppoll(2) with the next firing timer as
>> >> timeout, so when ppoll(2) returns, we know that we have more work to do (either
>> >> handling IO events, or fire a timer callback). This is natual and efficient,
>> >> except that ppoll(2) itself is slow.
>> >>
>> >> Now that we want to switch to epoll, to speed up the polling. However the timer
>> >> slack setting will be effectively undone, because that way we will have to
>> >> round up the timeout to microseconds honoring timer contract. But consequently,
>> >> this hurts the general responsiveness.
>> >>
>> >> Note: there are two alternatives, without changing kernel:
>> >>
>> >> 1) Leading ppoll(2), with the epollfd only and a nanosecond timeout. It won't
>> >> be slow as one fd is polled. No more scalability issue. And if there are
>> >> events, we know from ppoll(2)'s return, then we do the epoll_wait(2) with
>> >> timeout=0; otherwise, there can't be events for the epoll, skip the following
>> >> epoll_wait and just continue with other work.
>> >>
>> >> 2) Setup and add a timerfd to epoll, then we do epoll_wait(..., timeout=-1).
>> >> The timerfd will hopefully force epoll_wait to return when it timeouts, even if
>> >> no other events have arrived. This will inheritly give us timerfd's precision.
>> >> Note that for each poll, the desired timeout is different because the next
>> >> timer is different, so that, before each epoll_wait(2), there will be a
>> >> timerfd_settime syscall to set it to a proper value.
>> >>
>> >> Unfortunately, both approaches require one more syscall per iteration, compared
>> >> to the original single ppoll(2), cost of which is unneglectable when we talk
>> >> about nanosecond granularity.
>>
>> I'd like to see a more ambitious change, since the timer isn't the
>> only problem like this.  Specifically, I'd like a syscall that does a
>> list of epoll-related things and then waits.  The list of things could
>> include, at least:
>>
>>  - EPOLL_CTL_MOD actions: level-triggered epoll users are likely to
>> want to turn on and off their requests for events on a somewhat
>> regular basis.
>
> This sounds good to me.
>
>>
>>  - timerfd_settime actions: this allows a single syscall to wait and
>> adjust *both* monotonic and real-time wakeups.
>
> I'm not sure, doesn't this break orthogonality between epoll and timerfd?

Yes.  It's not very elegant, and more elegant ideas are welcome.

>
>>
>> Would this make sense?  It could look like:
>>
>> int epoll_mod_and_pwait(int epfd,
>>   struct epoll_event *events, int maxevents,
>>   struct epoll_command *commands, int ncommands,
>>   const sigset_t *sigmask);
>
> What about flags?
>

No room.  Maybe it should just be a struct for everything instead of
separate args.

> Fam



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall
From: Fam Zheng @ 2015-01-09  1:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Miklos Szeredi,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML,
	Alexander Viro, Andrew Morton, Juri Lelli, Zach Brown,
	David Drysdale, Kees Cook, Alexei Starovoitov, David Herrmann,
	Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal,
	Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov,
	Mathieu Desnoyers
In-Reply-To: <CALCETrVyPij1Zxwmw7p06UrZjoyYDXqEjmxyQ-KJ8Y7dx7mL3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Thu, 01/08 09:57, Andy Lutomirski wrote:
> On Thu, Jan 8, 2015 at 1:12 AM, Miklos Szeredi <mszeredi-AlSwsSmVLrQ@public.gmane.org> wrote:
> > On Thu, 2015-01-08 at 16:25 +0800, Fam Zheng wrote:
> >> Applications could use epoll interface when then need to poll a big number of
> >> files in their main loops, to achieve better performance than ppoll(2). Except
> >> for one concern: epoll only takes timeout parameters in microseconds, rather
> >> than nanoseconds.
> >>
> >> That is a drawback we should address. For a real case in QEMU, we run into a
> >> scalability issue with ppoll(2) when many devices are attached to guest, in
> >> which case many host fds, such as virtual disk images and sockets, need to be
> >> polled by the main loop. As a result we are looking at switching to epoll, but
> >> the coarse timeout precision is a trouble, as explained below.
> >>
> >> We're already using prctl(PR_SET_TIMERSLACK, 1) which is necessary to implement
> >> timers in the main loop; and we call ppoll(2) with the next firing timer as
> >> timeout, so when ppoll(2) returns, we know that we have more work to do (either
> >> handling IO events, or fire a timer callback). This is natual and efficient,
> >> except that ppoll(2) itself is slow.
> >>
> >> Now that we want to switch to epoll, to speed up the polling. However the timer
> >> slack setting will be effectively undone, because that way we will have to
> >> round up the timeout to microseconds honoring timer contract. But consequently,
> >> this hurts the general responsiveness.
> >>
> >> Note: there are two alternatives, without changing kernel:
> >>
> >> 1) Leading ppoll(2), with the epollfd only and a nanosecond timeout. It won't
> >> be slow as one fd is polled. No more scalability issue. And if there are
> >> events, we know from ppoll(2)'s return, then we do the epoll_wait(2) with
> >> timeout=0; otherwise, there can't be events for the epoll, skip the following
> >> epoll_wait and just continue with other work.
> >>
> >> 2) Setup and add a timerfd to epoll, then we do epoll_wait(..., timeout=-1).
> >> The timerfd will hopefully force epoll_wait to return when it timeouts, even if
> >> no other events have arrived. This will inheritly give us timerfd's precision.
> >> Note that for each poll, the desired timeout is different because the next
> >> timer is different, so that, before each epoll_wait(2), there will be a
> >> timerfd_settime syscall to set it to a proper value.
> >>
> >> Unfortunately, both approaches require one more syscall per iteration, compared
> >> to the original single ppoll(2), cost of which is unneglectable when we talk
> >> about nanosecond granularity.
> 
> I'd like to see a more ambitious change, since the timer isn't the
> only problem like this.  Specifically, I'd like a syscall that does a
> list of epoll-related things and then waits.  The list of things could
> include, at least:
> 
>  - EPOLL_CTL_MOD actions: level-triggered epoll users are likely to
> want to turn on and off their requests for events on a somewhat
> regular basis.

This sounds good to me.

> 
>  - timerfd_settime actions: this allows a single syscall to wait and
> adjust *both* monotonic and real-time wakeups.

I'm not sure, doesn't this break orthogonality between epoll and timerfd?

> 
> Would this make sense?  It could look like:
> 
> int epoll_mod_and_pwait(int epfd,
>   struct epoll_event *events, int maxevents,
>   struct epoll_command *commands, int ncommands,
>   const sigset_t *sigmask);

What about flags?

Fam

^ permalink raw reply

* Re: [v8 2/5] ext4: adds project ID support
From: Andreas Dilger @ 2015-01-08 22:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Li Xi, Linux Filesystem Development List, ext4 development,
	linux-api, Theodore Ts'o, Al Viro, Christoph Hellwig,
	Dmitry Monakhov
In-Reply-To: <20150108082625.GA15339@quack.suse.cz>

On Jan 8, 2015, at 1:26 AM, Jan Kara <jack@suse.cz> wrote:
> On Tue 09-12-14 13:22:25, Li Xi wrote:
>> This patch adds a new internal field of ext4 inode to save project
>> identifier. Also a new flag EXT4_INODE_PROJINHERIT is added for
>> inheriting project ID from parent directory.
>  I have noticed one thing you apparently changed in v7 of the patch set.
> See below.
> 
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 29c43e7..8bd1da9 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -377,16 +377,18 @@ struct flex_groups {
>> #define EXT4_EA_INODE_FL	        0x00200000 /* Inode used for large EA */
>> #define EXT4_EOFBLOCKS_FL		0x00400000 /* Blocks allocated beyond EOF */
>> #define EXT4_INLINE_DATA_FL		0x10000000 /* Inode has inline data. */
>> +#define EXT4_PROJINHERIT_FL		FS_PROJINHERIT_FL /* Create with parents projid */
>  How did FS_PROJINHERIT_FL get here? There used to be 0x20000000 in older
> version of the patch set which is correct - this definition is defining
> ext4 on-disk format. As such it is an ext4 specific flag and should be
> definined to a fixed constant independed of any other filesystem. It seems
> you are somewhat mixing what is an on-disk format flag value and what is a
> flag value passed from userspace. These two may be different things and
> you need to convert between the values when getting / setting flags...

Currently the EXT4_*_FL and FS_*_FL values are all identical, and there
is no reason to change that before it is actually needed.  Since the
FS_PROJINHERIT_FL is used via chattr/lsattr from userspace, this value
must also be kept the same in the future to avoid API breakage, so there
is no reason to worry about incompatibilities.

See also the [v8 5/5] patch, which is changing the EXT4_*_FL values to
use FS_*_FL constants, where applicable, so that it is more clear that
these values need to be the same.

Cheers, Andreas






^ permalink raw reply

* Re: [PATCH] selftests/vm: fix link error for transhuge-stress test
From: Andrey Skvortsov @ 2015-01-08 21:30 UTC (permalink / raw)
  To: Shuah Khan; +Cc: Konstantin Khlebnikov, linux-api, linux-kernel
In-Reply-To: <54AEA86D.60709@osg.samsung.com>

[-- Attachment #1: Type: text/plain, Size: 1725 bytes --]

On Thu, Jan 08, 2015 at 08:55:25AM -0700, Shuah Khan wrote:

> On 01/07/2015 01:10 PM, Andrey Skvortsov wrote:
> > man page for clock_gettime says 'Link with -lrt'. So I think the
> > error message is correct.
> > 
> 
> Thanks for fixing it. Applied to linux-kselftest fixes branch
> 

Hi Shuah,

thanks for taking the patch.

sorry for the late reply. I've just checked e-mail. Different
timezone. =)

I have couple questions about selftests/vm/hugetlbfstest. It is called by
make run_tests, but fails always on my machine.
What does it actually test? if I understand the code right,
it tests whether RSS (read from /proc/self/statm) grows/shrinks not much during
allocation/deallocation of 16Mb. 

Here are couple of questions about some tests in the code.
1)
hugetlbfstest.c:75
	do_mmap(-1, MAP_ANONYMOUS, 1);

it does not use hugetlbfs. So it does not correspond to the
actual test name. That confused me a little. Does it test usage of
transparent huge pages? 

2) 
hugetlbfstest.c:78
	do_mmap(hugefd, 0, 1);

Despite the fact, that hugefd is open at wrong location
(run_tests script mounts hugetlbfs at ./huge) the test actually does
use hugetlbfs. Unfortunately on my machine it fails always too. 
RSS does not grow after allocation, but is the same as before.
So assertion 'llabs(after - before - length) < 0x40000' fails.
But hugepages are definitely allocated, I can see that HugePages_Free decreases in
/proc/meminfo. To find this out I placed delays in the test code.
Does rss field in /proc/statm counts only normal non-huge pages? 

-- 
Best regards,
Andrey Skvortsov

Secure e-mail with gnupg: See http://www.gnupg.org/
PGP Key ID: 0x57A3AEAD

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall
From: Andy Lutomirski @ 2015-01-08 19:42 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Michael Kerrisk, Mathieu Desnoyers, Zach Brown, Fam Zheng,
	Vivek Goyal, Ingo Molnar,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Alexander Viro, Mike Frysinger, Linux FS Devel,
	Alexei Starovoitov, Rasmus Villemoes, Stefan Hajnoczi,
	Dario Faggioli, Paolo Bonzini, Juri Lelli, X86 ML, Oleg Nesterov,
	Heiko Carstens, H. Peter Anvin, Andrew Morton, Linux API,
	"David S. Miller" <dav>
In-Reply-To: <20150108184201.GB13974@cloud>

On Jan 8, 2015 10:42 AM, <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
>
> On Thu, Jan 08, 2015 at 09:57:24AM -0800, Andy Lutomirski wrote:
> > On Thu, Jan 8, 2015 at 1:12 AM, Miklos Szeredi <mszeredi-AlSwsSmVLrQ@public.gmane.org> wrote:
> > > On Thu, 2015-01-08 at 16:25 +0800, Fam Zheng wrote:
> > >> Applications could use epoll interface when then need to poll a big number of
> > >> files in their main loops, to achieve better performance than ppoll(2). Except
> > >> for one concern: epoll only takes timeout parameters in microseconds, rather
> > >> than nanoseconds.
> > >>
> > >> That is a drawback we should address. For a real case in QEMU, we run into a
> > >> scalability issue with ppoll(2) when many devices are attached to guest, in
> > >> which case many host fds, such as virtual disk images and sockets, need to be
> > >> polled by the main loop. As a result we are looking at switching to epoll, but
> > >> the coarse timeout precision is a trouble, as explained below.
> > >>
> > >> We're already using prctl(PR_SET_TIMERSLACK, 1) which is necessary to implement
> > >> timers in the main loop; and we call ppoll(2) with the next firing timer as
> > >> timeout, so when ppoll(2) returns, we know that we have more work to do (either
> > >> handling IO events, or fire a timer callback). This is natual and efficient,
> > >> except that ppoll(2) itself is slow.
> > >>
> > >> Now that we want to switch to epoll, to speed up the polling. However the timer
> > >> slack setting will be effectively undone, because that way we will have to
> > >> round up the timeout to microseconds honoring timer contract. But consequently,
> > >> this hurts the general responsiveness.
> > >>
> > >> Note: there are two alternatives, without changing kernel:
> > >>
> > >> 1) Leading ppoll(2), with the epollfd only and a nanosecond timeout. It won't
> > >> be slow as one fd is polled. No more scalability issue. And if there are
> > >> events, we know from ppoll(2)'s return, then we do the epoll_wait(2) with
> > >> timeout=0; otherwise, there can't be events for the epoll, skip the following
> > >> epoll_wait and just continue with other work.
> > >>
> > >> 2) Setup and add a timerfd to epoll, then we do epoll_wait(..., timeout=-1).
> > >> The timerfd will hopefully force epoll_wait to return when it timeouts, even if
> > >> no other events have arrived. This will inheritly give us timerfd's precision.
> > >> Note that for each poll, the desired timeout is different because the next
> > >> timer is different, so that, before each epoll_wait(2), there will be a
> > >> timerfd_settime syscall to set it to a proper value.
> > >>
> > >> Unfortunately, both approaches require one more syscall per iteration, compared
> > >> to the original single ppoll(2), cost of which is unneglectable when we talk
> > >> about nanosecond granularity.
> >
> > I'd like to see a more ambitious change, since the timer isn't the
> > only problem like this.  Specifically, I'd like a syscall that does a
> > list of epoll-related things and then waits.  The list of things could
> > include, at least:
> >
> >  - EPOLL_CTL_MOD actions: level-triggered epoll users are likely to
> > want to turn on and off their requests for events on a somewhat
> > regular basis.
> >
> >  - timerfd_settime actions: this allows a single syscall to wait and
> > adjust *both* monotonic and real-time wakeups.
> >
> > Would this make sense?  It could look like:
> >
> > int epoll_mod_and_pwait(int epfd,
> >   struct epoll_event *events, int maxevents,
> >   struct epoll_command *commands, int ncommands,
> >   const sigset_t *sigmask);
>
> That's a complicated syscall.  (And it also doesn't have room for the
> flags argument.)
>
> At that point, why not just have a syscall like this:
>
> struct syscall {
>         unsigned long num;
>         unsigned long params[6];
> };

Too minimize locking and such.  The user / kernel transitions are very
fast in some configurations, and I suspect that all the fd lookups,
epoll data structure locks, etc are important, too.

This particular pattern (EPOLL_CTL_MOD, then timerfd_settime, then
epoll_pwait) is so common that optimizing it from three syscalls to
one (as opposed to two as in your patch) seems like it could be
worthwhile.

syscall entry + exit latency is pretty good (~180 cycles total, maybe)
these days, but only in some configurations.  In others, it blows up
pretty badly.  I'm slowly working on improving that.

Also, simplicity is a win. The multi-syscall thing has never gone
smoothly, and, as Alexei pointed out, it's a big mess for seccomp.

>
> int sys_many(size_t count, struct syscall *syscalls, int *results, unsigned long flags);
>
> I think that has been discussed in the past.
>
> Or, these days, that might be better done via eBPF, which would avoid
> the need for flags like "return on error"; an eBPF program could decide
> how to proceed after each call.
>
> - Josh Triplett

^ permalink raw reply

* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall
From: Alexei Starovoitov @ 2015-01-08 19:31 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Andy Lutomirski, Miklos Szeredi, Fam Zheng,
	linux-kernel@vger.kernel.org, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, X86 ML, Alexander Viro, Andrew Morton, Juri Lelli,
	Zach Brown, David Drysdale, Kees Cook, David Herrmann,
	Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal,
	Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov
In-Reply-To: <20150108184201.GB13974@cloud>

On Thu, Jan 8, 2015 at 10:42 AM,  <josh@joshtriplett.org> wrote:
>> I'd like to see a more ambitious change, since the timer isn't the
>> only problem like this.  Specifically, I'd like a syscall that does a
>> list of epoll-related things and then waits.  The list of things could
>> include, at least:
>>
>>  - EPOLL_CTL_MOD actions: level-triggered epoll users are likely to
>> want to turn on and off their requests for events on a somewhat
>> regular basis.
>>
>>  - timerfd_settime actions: this allows a single syscall to wait and
>> adjust *both* monotonic and real-time wakeups.
>>
>> Would this make sense?  It could look like:
>>
>> int epoll_mod_and_pwait(int epfd,
>>   struct epoll_event *events, int maxevents,
>>   struct epoll_command *commands, int ncommands,
>>   const sigset_t *sigmask);
>
> That's a complicated syscall.  (And it also doesn't have room for the
> flags argument.)
>
> At that point, why not just have a syscall like this:
>
> struct syscall {
>         unsigned long num;
>         unsigned long params[6];
> };
>
> int sys_many(size_t count, struct syscall *syscalls, int *results, unsigned long flags);
>
> I think that has been discussed in the past.
>
> Or, these days, that might be better done via eBPF, which would avoid
> the need for flags like "return on error"; an eBPF program could decide
> how to proceed after each call.

I'm afraid that will break seccomp or will make it much more
complicated.
Also I think syscall latency with the latest improvements is
actually quite fast, so chaining of syscalls is probably an overkill.
Same goes to Andy's argument of doing immediate CTL_MOD.
Let user space do it. It makes sense to combine things only
when atomicity is needed. Here it's not the case.

^ permalink raw reply

* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall
From: josh-iaAMLnmF4UmaiuxdJuQwMA @ 2015-01-08 18:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Miklos Szeredi, Fam Zheng,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML,
	Alexander Viro, Andrew Morton, Juri Lelli, Zach Brown,
	David Drysdale, Kees Cook, Alexei Starovoitov, David Herrmann,
	Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal,
	Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov,
	Mathieu Desnoyers <math>
In-Reply-To: <CALCETrVyPij1Zxwmw7p06UrZjoyYDXqEjmxyQ-KJ8Y7dx7mL3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Thu, Jan 08, 2015 at 09:57:24AM -0800, Andy Lutomirski wrote:
> On Thu, Jan 8, 2015 at 1:12 AM, Miklos Szeredi <mszeredi-AlSwsSmVLrQ@public.gmane.org> wrote:
> > On Thu, 2015-01-08 at 16:25 +0800, Fam Zheng wrote:
> >> Applications could use epoll interface when then need to poll a big number of
> >> files in their main loops, to achieve better performance than ppoll(2). Except
> >> for one concern: epoll only takes timeout parameters in microseconds, rather
> >> than nanoseconds.
> >>
> >> That is a drawback we should address. For a real case in QEMU, we run into a
> >> scalability issue with ppoll(2) when many devices are attached to guest, in
> >> which case many host fds, such as virtual disk images and sockets, need to be
> >> polled by the main loop. As a result we are looking at switching to epoll, but
> >> the coarse timeout precision is a trouble, as explained below.
> >>
> >> We're already using prctl(PR_SET_TIMERSLACK, 1) which is necessary to implement
> >> timers in the main loop; and we call ppoll(2) with the next firing timer as
> >> timeout, so when ppoll(2) returns, we know that we have more work to do (either
> >> handling IO events, or fire a timer callback). This is natual and efficient,
> >> except that ppoll(2) itself is slow.
> >>
> >> Now that we want to switch to epoll, to speed up the polling. However the timer
> >> slack setting will be effectively undone, because that way we will have to
> >> round up the timeout to microseconds honoring timer contract. But consequently,
> >> this hurts the general responsiveness.
> >>
> >> Note: there are two alternatives, without changing kernel:
> >>
> >> 1) Leading ppoll(2), with the epollfd only and a nanosecond timeout. It won't
> >> be slow as one fd is polled. No more scalability issue. And if there are
> >> events, we know from ppoll(2)'s return, then we do the epoll_wait(2) with
> >> timeout=0; otherwise, there can't be events for the epoll, skip the following
> >> epoll_wait and just continue with other work.
> >>
> >> 2) Setup and add a timerfd to epoll, then we do epoll_wait(..., timeout=-1).
> >> The timerfd will hopefully force epoll_wait to return when it timeouts, even if
> >> no other events have arrived. This will inheritly give us timerfd's precision.
> >> Note that for each poll, the desired timeout is different because the next
> >> timer is different, so that, before each epoll_wait(2), there will be a
> >> timerfd_settime syscall to set it to a proper value.
> >>
> >> Unfortunately, both approaches require one more syscall per iteration, compared
> >> to the original single ppoll(2), cost of which is unneglectable when we talk
> >> about nanosecond granularity.
> 
> I'd like to see a more ambitious change, since the timer isn't the
> only problem like this.  Specifically, I'd like a syscall that does a
> list of epoll-related things and then waits.  The list of things could
> include, at least:
> 
>  - EPOLL_CTL_MOD actions: level-triggered epoll users are likely to
> want to turn on and off their requests for events on a somewhat
> regular basis.
> 
>  - timerfd_settime actions: this allows a single syscall to wait and
> adjust *both* monotonic and real-time wakeups.
> 
> Would this make sense?  It could look like:
> 
> int epoll_mod_and_pwait(int epfd,
>   struct epoll_event *events, int maxevents,
>   struct epoll_command *commands, int ncommands,
>   const sigset_t *sigmask);

That's a complicated syscall.  (And it also doesn't have room for the
flags argument.)

At that point, why not just have a syscall like this:

struct syscall {
	unsigned long num;
	unsigned long params[6];
};

int sys_many(size_t count, struct syscall *syscalls, int *results, unsigned long flags);

I think that has been discussed in the past.

Or, these days, that might be better done via eBPF, which would avoid
the need for flags like "return on error"; an eBPF program could decide
how to proceed after each call.

- Josh Triplett

^ permalink raw reply

* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall
From: Andy Lutomirski @ 2015-01-08 17:57 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Fam Zheng, linux-kernel@vger.kernel.org, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro,
	Andrew Morton, Juri Lelli, Zach Brown, David Drysdale, Kees Cook,
	Alexei Starovoitov, David Herrmann, Dario Faggioli,
	Theodore Ts'o, Peter Zijlstra, Vivek Goyal, Mike Frysinger,
	Heiko Carstens, Rasmus Villemoes, Oleg Nesterov,
	Mathieu Desnoyers, Fab
In-Reply-To: <1420708372.18399.15.camel@suse.cz>

On Thu, Jan 8, 2015 at 1:12 AM, Miklos Szeredi <mszeredi@suse.cz> wrote:
> On Thu, 2015-01-08 at 16:25 +0800, Fam Zheng wrote:
>> Applications could use epoll interface when then need to poll a big number of
>> files in their main loops, to achieve better performance than ppoll(2). Except
>> for one concern: epoll only takes timeout parameters in microseconds, rather
>> than nanoseconds.
>>
>> That is a drawback we should address. For a real case in QEMU, we run into a
>> scalability issue with ppoll(2) when many devices are attached to guest, in
>> which case many host fds, such as virtual disk images and sockets, need to be
>> polled by the main loop. As a result we are looking at switching to epoll, but
>> the coarse timeout precision is a trouble, as explained below.
>>
>> We're already using prctl(PR_SET_TIMERSLACK, 1) which is necessary to implement
>> timers in the main loop; and we call ppoll(2) with the next firing timer as
>> timeout, so when ppoll(2) returns, we know that we have more work to do (either
>> handling IO events, or fire a timer callback). This is natual and efficient,
>> except that ppoll(2) itself is slow.
>>
>> Now that we want to switch to epoll, to speed up the polling. However the timer
>> slack setting will be effectively undone, because that way we will have to
>> round up the timeout to microseconds honoring timer contract. But consequently,
>> this hurts the general responsiveness.
>>
>> Note: there are two alternatives, without changing kernel:
>>
>> 1) Leading ppoll(2), with the epollfd only and a nanosecond timeout. It won't
>> be slow as one fd is polled. No more scalability issue. And if there are
>> events, we know from ppoll(2)'s return, then we do the epoll_wait(2) with
>> timeout=0; otherwise, there can't be events for the epoll, skip the following
>> epoll_wait and just continue with other work.
>>
>> 2) Setup and add a timerfd to epoll, then we do epoll_wait(..., timeout=-1).
>> The timerfd will hopefully force epoll_wait to return when it timeouts, even if
>> no other events have arrived. This will inheritly give us timerfd's precision.
>> Note that for each poll, the desired timeout is different because the next
>> timer is different, so that, before each epoll_wait(2), there will be a
>> timerfd_settime syscall to set it to a proper value.
>>
>> Unfortunately, both approaches require one more syscall per iteration, compared
>> to the original single ppoll(2), cost of which is unneglectable when we talk
>> about nanosecond granularity.

I'd like to see a more ambitious change, since the timer isn't the
only problem like this.  Specifically, I'd like a syscall that does a
list of epoll-related things and then waits.  The list of things could
include, at least:

 - EPOLL_CTL_MOD actions: level-triggered epoll users are likely to
want to turn on and off their requests for events on a somewhat
regular basis.

 - timerfd_settime actions: this allows a single syscall to wait and
adjust *both* monotonic and real-time wakeups.

Would this make sense?  It could look like:

int epoll_mod_and_pwait(int epfd,
  struct epoll_event *events, int maxevents,
  struct epoll_command *commands, int ncommands,
  const sigset_t *sigmask);

--Andy

^ permalink raw reply

* Re: [RESEND PATCH 1/3] epoll: Extract epoll_wait_do and epoll_pwait_do
From: Alexei Starovoitov @ 2015-01-08 17:51 UTC (permalink / raw)
  To: Fam Zheng
  Cc: LKML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML,
	Alexander Viro, Andrew Morton, Miklos Szeredi, Juri Lelli,
	Zach Brown, David Drysdale, Kees Cook, David Herrmann,
	Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal,
	Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov,
	Mathieu Desnoyers, Fabian Frederick, Josh Triplett

On Thu, Jan 8, 2015 at 1:16 AM, Fam Zheng <famz@redhat.com> wrote:
> +       if (!timeout || (timeout->tv_nsec == 0 && timeout->tv_sec == 0)) {
..
> +       } else if (timeout->tv_nsec >= 0 && timeout->tv_sec >= 0) {

the check for tv_nsec is not enough, which points
to the fragility of passing user timespec around.
I think it would be safer and cleaner to do it futex style:
  if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
      return -EFAULT;
  if (!timespec_valid(&ts))
      return -EINVAL;
  t = timespec_to_ktime(ts);
and then only pass ktime_t around.

> +               struct timespec end_time = ep_set_mstimeout(timeout);

the name is now wrong, since it's no longer MStimeout.

I think handling of compat is missing as well.

^ permalink raw reply

* Re: [PATCHv3 01/20] media: add new types for DVB devnodes
From: Mauro Carvalho Chehab @ 2015-01-08 17:44 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Sakari Ailus, Linux Media Mailing List, Mauro Carvalho Chehab,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <39466944.7lEGPfqsct@avalon>

Em Thu, 08 Jan 2015 18:10:13 +0200
Laurent Pinchart <laurent.pinchart-ryLnwIuWjnjg/C1BVhZhaw@public.gmane.org> escreveu:

> Hi Mauro,
> 
> On Wednesday 07 January 2015 12:22:39 Mauro Carvalho Chehab wrote:
> > Em Wed, 07 Jan 2015 16:09:04 +0200 Sakari Ailus escreveu:
> > > Mauro Carvalho Chehab wrote:
> > > > Most of the DVB subdevs have already their own devnode.
> > > > 
> > > > Add support for them at the media controller API.
> > > > 
> > > > Signed-off-by: Mauro Carvalho Chehab <mchehab-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>
> > > > 
> > > > diff --git a/include/uapi/linux/media.h b/include/uapi/linux/media.h
> > > > index 7902e800f019..707db275f92b 100644
> > > > --- a/include/uapi/linux/media.h
> > > > +++ b/include/uapi/linux/media.h
> > > > @@ -50,7 +50,14 @@ struct media_device_info {
> > > > 
> > > >   #define MEDIA_ENT_T_DEVNODE_V4L		(MEDIA_ENT_T_DEVNODE + 1)
> > > >   #define MEDIA_ENT_T_DEVNODE_FB		(MEDIA_ENT_T_DEVNODE + 2)
> > > >   #define MEDIA_ENT_T_DEVNODE_ALSA	(MEDIA_ENT_T_DEVNODE + 3)
> > > > 
> > > > -#define MEDIA_ENT_T_DEVNODE_DVB		(MEDIA_ENT_T_DEVNODE + 4)
> > > > +#define MEDIA_ENT_T_DEVNODE_DVB_FE	(MEDIA_ENT_T_DEVNODE + 4)
> > > > +#define MEDIA_ENT_T_DEVNODE_DVB_DEMUX	(MEDIA_ENT_T_DEVNODE + 5)
> > > > +#define MEDIA_ENT_T_DEVNODE_DVB_DVR	(MEDIA_ENT_T_DEVNODE + 6)
> > > > +#define MEDIA_ENT_T_DEVNODE_DVB_CA	(MEDIA_ENT_T_DEVNODE + 7)
> > > > +#define MEDIA_ENT_T_DEVNODE_DVB_NET	(MEDIA_ENT_T_DEVNODE + 8)
> > > 
> > > I'd create another type for the DVB sub-type devices, as there is for
> > > V4L2 sub-devices. I wonder what Laurent thinks.
> > 
> > I discussed this quickly with Laurent on IRC.
> > 
> > There are some concept differences between V4L2 and DVB.
> > 
> > At v4l2:
> > - the spec is one monolitic header (videodev2.h);
> > - one devnode is used to control everyhing (/dev/video?)
> > - there is one v4l core for all types of devices
> > 
> > At DVB:
> > - each different DVB API has its own header;
> > - each DVB device type has its own core (ok, they're
> >   linked into one module, but internally they're almost independent);
> > - each different DVB API has its own devnode.
> > 
> > So, using "SUBDEV" for DVB (or at least for the devnodes) don't
> > make much sense.
> > 
> > Ok, there are still some things at DVB side that could be mapped as
> > subdev. The clear example is the tuner. However, in this case, the
> > same tuner can be either V4L, DVB or both. So, we need to define just
> > one subdev type for the tuner.
> > 
> > Also, each DVB device can be identified via major/minor pairs.
> > 
> > I wrote already (and submitted upstream) the patches for media-ctl to
> > recognize them. They're also on my experimental v4l-utils tree:
> > 	http://git.linuxtv.org/cgit.cgi/mchehab/experimental-v4l-> 	utils.git/log/?h=dvb-media-ctl
> 
> As I've mentioned in a previous discussion, the media_entity type field is too 
> restrictive. Not only does this use case show that we need a type, sub-type 
> and sub-sub-type, there are also entities that implement several distinct 
> types. I thus believe we need a new ioctl is needed to expose detailed 
> information about entities. This topic has been discussed numerous times in 
> the past, it "just" requires someone to implement it.
> 
> I'm not opposed to a short-term solution like the one proposed here, but maybe 
> we should instead decide it's time to implement the new ioctl instead.
> 

Ok, so let's stick with it for DVB. At DVB side, I don't see a need for
sub-sub-type, especially since DVB has no subdevs (except for the shared
tuner between DVB and V4L). Everything there are devnodes, with their
functionality strictly following the documentation, as the API is fully
handled inside the DVB core[1].

Also, I don't want to mix adding DVB media controller support with the
addition of a new ioctl.

[1] For the non-deprecated DVB devnodes. DVB have 3 devnode types that
are deprecated because they implement functionality found elsewhere
(video, audio and OSD dvb APIs). Only one legacy driver implements it,
and there's no plan to ever add media controller or expand/accept
new drivers using those legacy APIs.

Regards,
Mauro

^ permalink raw reply

* Re: [PATCHv3 01/20] media: add new types for DVB devnodes
From: Laurent Pinchart @ 2015-01-08 16:10 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Sakari Ailus, Linux Media Mailing List, Mauro Carvalho Chehab,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20150107122239.232d1f56-iErKd7e5oVp+urZeOPWqwQ@public.gmane.org>

Hi Mauro,

On Wednesday 07 January 2015 12:22:39 Mauro Carvalho Chehab wrote:
> Em Wed, 07 Jan 2015 16:09:04 +0200 Sakari Ailus escreveu:
> > Mauro Carvalho Chehab wrote:
> > > Most of the DVB subdevs have already their own devnode.
> > > 
> > > Add support for them at the media controller API.
> > > 
> > > Signed-off-by: Mauro Carvalho Chehab <mchehab-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>
> > > 
> > > diff --git a/include/uapi/linux/media.h b/include/uapi/linux/media.h
> > > index 7902e800f019..707db275f92b 100644
> > > --- a/include/uapi/linux/media.h
> > > +++ b/include/uapi/linux/media.h
> > > @@ -50,7 +50,14 @@ struct media_device_info {
> > > 
> > >   #define MEDIA_ENT_T_DEVNODE_V4L		(MEDIA_ENT_T_DEVNODE + 1)
> > >   #define MEDIA_ENT_T_DEVNODE_FB		(MEDIA_ENT_T_DEVNODE + 2)
> > >   #define MEDIA_ENT_T_DEVNODE_ALSA	(MEDIA_ENT_T_DEVNODE + 3)
> > > 
> > > -#define MEDIA_ENT_T_DEVNODE_DVB		(MEDIA_ENT_T_DEVNODE + 4)
> > > +#define MEDIA_ENT_T_DEVNODE_DVB_FE	(MEDIA_ENT_T_DEVNODE + 4)
> > > +#define MEDIA_ENT_T_DEVNODE_DVB_DEMUX	(MEDIA_ENT_T_DEVNODE + 5)
> > > +#define MEDIA_ENT_T_DEVNODE_DVB_DVR	(MEDIA_ENT_T_DEVNODE + 6)
> > > +#define MEDIA_ENT_T_DEVNODE_DVB_CA	(MEDIA_ENT_T_DEVNODE + 7)
> > > +#define MEDIA_ENT_T_DEVNODE_DVB_NET	(MEDIA_ENT_T_DEVNODE + 8)
> > 
> > I'd create another type for the DVB sub-type devices, as there is for
> > V4L2 sub-devices. I wonder what Laurent thinks.
> 
> I discussed this quickly with Laurent on IRC.
> 
> There are some concept differences between V4L2 and DVB.
> 
> At v4l2:
> - the spec is one monolitic header (videodev2.h);
> - one devnode is used to control everyhing (/dev/video?)
> - there is one v4l core for all types of devices
> 
> At DVB:
> - each different DVB API has its own header;
> - each DVB device type has its own core (ok, they're
>   linked into one module, but internally they're almost independent);
> - each different DVB API has its own devnode.
> 
> So, using "SUBDEV" for DVB (or at least for the devnodes) don't
> make much sense.
> 
> Ok, there are still some things at DVB side that could be mapped as
> subdev. The clear example is the tuner. However, in this case, the
> same tuner can be either V4L, DVB or both. So, we need to define just
> one subdev type for the tuner.
> 
> Also, each DVB device can be identified via major/minor pairs.
> 
> I wrote already (and submitted upstream) the patches for media-ctl to
> recognize them. They're also on my experimental v4l-utils tree:
> 	http://git.linuxtv.org/cgit.cgi/mchehab/experimental-v4l-> 	utils.git/log/?h=dvb-media-ctl

As I've mentioned in a previous discussion, the media_entity type field is too 
restrictive. Not only does this use case show that we need a type, sub-type 
and sub-sub-type, there are also entities that implement several distinct 
types. I thus believe we need a new ioctl is needed to expose detailed 
information about entities. This topic has been discussed numerous times in 
the past, it "just" requires someone to implement it.

I'm not opposed to a short-term solution like the one proposed here, but maybe 
we should instead decide it's time to implement the new ioctl instead.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply

* Re: [PATCH] selftests/vm: fix link error for transhuge-stress test
From: Shuah Khan @ 2015-01-08 15:55 UTC (permalink / raw)
  To: Andrey Skvortsov, Andrew Morton, Konstantin Khlebnikov, linux-api,
	linux-kernel
In-Reply-To: <20150107201045.GA5704@yulia-desktop>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/07/2015 01:10 PM, Andrey Skvortsov wrote:
> On Wed, Jan 07, 2015 at 12:21:18PM -0700, Shuah Khan wrote:
>> On 01/07/2015 11:35 AM, Andrey Skvortsov wrote:
>>> add -lrt to fix undefined reference to `clock_gettime'
>>> 
>>> Signed-off-by: Andrey Skvortsov <andrej.skvortzov@gmail.com> 
>>> --- tools/testing/selftests/vm/Makefile |    2 +- 1 file
>>> changed, 1 insertion(+), 1 deletion(-)
>>> 
>>> diff --git a/tools/testing/selftests/vm/Makefile
>>> b/tools/testing/selftests/vm/Makefile index 4c4b1f6..077828c
>>> 100644 --- a/tools/testing/selftests/vm/Makefile +++
>>> b/tools/testing/selftests/vm/Makefile @@ -7,7 +7,7 @@ BINARIES
>>> += transhuge-stress
>>> 
>>> all: $(BINARIES) %: %.c -	$(CC) $(CFLAGS) -o $@ $^ +	$(CC)
>>> $(CFLAGS) -o $@ $^ -lrt
>>> 
>>> run_tests: all @/bin/sh ./run_vmtests || (echo "vmtests:
>>> [FAIL]"; exit 1)
>>> 
>> 
>> Andrey,
>> 
>> I don't see any undefined references when I build. Curious if it
>> is specific to your env??
>> 
>> Please include the warning in the change log when you fix
>> warnings in the future.
>> 
> 
> thanks for the comment.
> 
> Here is what I get without a patch:
> 
> linux-next/tools/testing/selftests/vm $ make gcc -Wall -o
> hugepage-mmap hugepage-mmap.c gcc -Wall -o hugepage-shm
> hugepage-shm.c gcc -Wall -o map_hugetlb map_hugetlb.c gcc -Wall -o
> thuge-gen thuge-gen.c gcc -Wall -o hugetlbfstest hugetlbfstest.c 
> gcc -Wall -o transhuge-stress transhuge-stress.c /tmp/ccpWoqkG.o:
> In function `main': transhuge-stress.c:(.text+0x3a3): undefined
> reference to `clock_gettime' transhuge-stress.c:(.text+0x4dc):
> undefined reference to `clock_gettime' collect2: ld returned 1 exit
> status make: *** [transhuge-stress] Error 1
> 
> 
> $ gcc --version gcc (Ubuntu/Linaro 4.6.4-1ubuntu1~12.04) 4.6.4. The
> same error I get on my other Debian system.
> 
> man page for clock_gettime says 'Link with -lrt'. So I think the
> error message is correct.
> 

Thanks for fixing it. Applied to linux-kselftest fixes branch

- -- Shuah

- -- 
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
shuahkh@osg.samsung.com | (970) 217-8978
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQIcBAEBAgAGBQJUrqhtAAoJEAsCRMQNDUMcVkgP/if69SGYILLnnj+7oBIuXJGl
8QZVJRHKdN/8Tny7PVBDmerv8kTxbDfaJpJyHg8nSSFRyTmOKChNkO9mcR7TIXcP
KxQn5BjYYUvcPM9jZESTAOIdEcaKbsMjj2Yknf49o7ij9YgCBPvFbDOZtfk/oMtF
no3wGSuZPF4cRBvhfHJQdLMc87VeLSdz4eDBWZ7LX6EXlfWc8QH8iJhbLzmPp3pJ
ofFKKeoaOrlcb/tQPdfSbWvVXZmWpJa8YdDGc2V+w6gahjPeWffe3ZTSyS4dwKZ+
Ayt4F2ptxu+/1fbOb9ZSFCrST2q+yavvXR5D/guK1EqIi1iRHJzq9QSNO8VJksG1
g+aCCTxI5qxHmmq2NandZ/UTttdxsUhPVO32IAFZcrq49gdPQqa/QuB0iI3fDevp
kqueiQj96wHr6zBFMgO6O+Vw/bmOjNdFOQUIv6iEJt1SbRtouBa4VVSRsbJijwGU
3jDZBmBhV/ntGuydqsPE1IAjQPO9nKk2rF7RZkx2OGx6Lc6kOk6BQ0eHnA/FgmcG
vZiHAIK99K9dEd926C+nV6pYK9UfPU/UKtmxIa/Z5At4oxN3SUKeZokmCFC7f2Ff
ueM0hLvEmfNTjSzGkx72RsdxE/6nU2kgYfNOGsJh4QuePrph9+orgHtb5qzZXmiA
Mf3pjHSWH0sqAXKfZZmF
=zsvn
-----END PGP SIGNATURE-----

^ permalink raw reply

* Re: [RESEND PATCH 2/3] epoll: Add implementation for epoll_pwait1
From: Michael Kerrisk (man-pages) @ 2015-01-08 11:48 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Fam Zheng, lkml, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, Alexander Viro,
	Andrew Morton, Miklos Szeredi, Juri Lelli, Zach Brown,
	David Drysdale, Kees Cook, Alexei Starovoitov, David Herrmann,
	Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal,
	Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov,
	Mathieu Desnoyers <mathieu.desnoyers>
In-Reply-To: <54AE65BB.1020707-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On 8 January 2015 at 12:10, Paolo Bonzini <pbonzini-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
>
> On 08/01/2015 10:16, Fam Zheng wrote:
>> Unlike ppoll(2), which accepts a timespec argument "timeout_ts" to
>> specify the timeout, epoll_wait(2) and epoll_pwait(2) expect a
>> microsecond timeout in int type.
>>
>> This is an obstacle for applications in switching from ppoll to epoll,
>> if they want nanosecond resolution in their event loops.
>>
>> Therefore, adding this variation of epoll wait interface, giving user an
>> option with *both* advantages, is a reasonable move: there could be
>> constantly scalable performance polling many fds, while having a
>> nanosecond timeout precision (assuming it has properly set up timer
>> slack with prctl(2)).
>>
>> Signed-off-by: Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> ---
>>  fs/eventpoll.c           | 24 ++++++++++++++++++++++++
>>  include/linux/syscalls.h |  4 ++++
>>  kernel/sys_ni.c          |  3 +++
>>  3 files changed, 31 insertions(+)
>
> As mentioned by Miklos in the non-resent version, please add a flags
> argument.  Invalid flags should return -EINVAL.
>
> In fact, we could already use the flags argument to specify an absolute
> timeout, which is a nice thing to have for QEMU too.

Nice! It looks like we found this iteration of "failure to include a
flags argument is a mistake" already!

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [RESEND PATCH 2/3] epoll: Add implementation for epoll_pwait1
From: Paolo Bonzini @ 2015-01-08 11:10 UTC (permalink / raw)
  To: Fam Zheng, linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
	Andrew Morton, Miklos Szeredi, Juri Lelli, Zach Brown,
	David Drysdale, Kees Cook, Alexei Starovoitov, David Herrmann,
	Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal,
	Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov,
	Mathieu Desnoyers, Fabian Frederick, Josh Triplett
In-Reply-To: <1420708563-21743-3-git-send-email-famz@redhat.com>



On 08/01/2015 10:16, Fam Zheng wrote:
> Unlike ppoll(2), which accepts a timespec argument "timeout_ts" to
> specify the timeout, epoll_wait(2) and epoll_pwait(2) expect a
> microsecond timeout in int type.
> 
> This is an obstacle for applications in switching from ppoll to epoll,
> if they want nanosecond resolution in their event loops.
> 
> Therefore, adding this variation of epoll wait interface, giving user an
> option with *both* advantages, is a reasonable move: there could be
> constantly scalable performance polling many fds, while having a
> nanosecond timeout precision (assuming it has properly set up timer
> slack with prctl(2)).
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  fs/eventpoll.c           | 24 ++++++++++++++++++++++++
>  include/linux/syscalls.h |  4 ++++
>  kernel/sys_ni.c          |  3 +++
>  3 files changed, 31 insertions(+)

As mentioned by Miklos in the non-resent version, please add a flags
argument.  Invalid flags should return -EINVAL.

In fact, we could already use the flags argument to specify an absolute
timeout, which is a nice thing to have for QEMU too.

Paolo

^ permalink raw reply

* Re: [PATCH v8 1/2] crypto: AF_ALG: add AEAD support
From: Herbert Xu @ 2015-01-08 11:09 UTC (permalink / raw)
  To: Stephan Mueller
  Cc: 'Quentin Gouchet', Daniel Borkmann,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <6260799.4x68Msg3Li-PJstQz4BMNNP20K/wil9xYQuADTiUCJX@public.gmane.org>

On Wed, Jan 07, 2015 at 04:51:38PM +0100, Stephan Mueller wrote:
>
> +		if (!aead_writable(sk)) {
> +			/*
> +			 * If there is more data to be expected, but we cannot
> +			 * write more data, forcefully define that we do not
> +			 * expect more data to invoke the AEAD operation. This
> +			 * prevents a deadlock in user space.
> +			 */
> +			ctx->more = 0;

We should return EMSGSIZE here.  Also we should clear out the
existing data so that the socket may be reused again.

> +	ctx->more = msg->msg_flags & MSG_MORE;
> +	if (!ctx->more && !aead_sufficient_data(ctx))
> +		err = -EINVAL;

Ditto, we should discard the data that's queued up.  Also perhaps
use EBADMSG instead of EINVAL.

> +	/*
> +	 * Require exactly one IOV block as the AEAD operation is a one shot
> +	 * due to the authentication tag.
> +	 */
> +	if (msg->msg_iter.nr_segs != 1)
> +		return -ENOMSG;

Why does the receive buffer have to be contiguous?

Cheers,
-- 
Email: Herbert Xu <herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall
From: Michael Kerrisk (man-pages) @ 2015-01-08 11:07 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Fam Zheng, lkml, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, Alexander Viro,
	Andrew Morton, Juri Lelli, Zach Brown, David Drysdale, Kees Cook,
	Alexei Starovoitov, David Herrmann, Dario Faggioli,
	Theodore Ts'o, Peter Zijlstra, Vivek Goyal, Mike Frysinger,
	Heiko Carstens, Rasmus Villemoes, Oleg Nesterov,
	Mathieu Desnoyers
In-Reply-To: <1420708372.18399.15.camel-AlSwsSmVLrQ@public.gmane.org>

On 8 January 2015 at 10:12, Miklos Szeredi <mszeredi-AlSwsSmVLrQ@public.gmane.org> wrote:
> On Thu, 2015-01-08 at 16:25 +0800, Fam Zheng wrote:
>> Applications could use epoll interface when then need to poll a big number of
>> files in their main loops, to achieve better performance than ppoll(2). Except
>> for one concern: epoll only takes timeout parameters in microseconds, rather
>> than nanoseconds.
>>
>> That is a drawback we should address. For a real case in QEMU, we run into a
>> scalability issue with ppoll(2) when many devices are attached to guest, in
>> which case many host fds, such as virtual disk images and sockets, need to be
>> polled by the main loop. As a result we are looking at switching to epoll, but
>> the coarse timeout precision is a trouble, as explained below.
>>
>> We're already using prctl(PR_SET_TIMERSLACK, 1) which is necessary to implement
>> timers in the main loop; and we call ppoll(2) with the next firing timer as
>> timeout, so when ppoll(2) returns, we know that we have more work to do (either
>> handling IO events, or fire a timer callback). This is natual and efficient,
>> except that ppoll(2) itself is slow.
>>
>> Now that we want to switch to epoll, to speed up the polling. However the timer
>> slack setting will be effectively undone, because that way we will have to
>> round up the timeout to microseconds honoring timer contract. But consequently,
>> this hurts the general responsiveness.
>>
>> Note: there are two alternatives, without changing kernel:
>>
>> 1) Leading ppoll(2), with the epollfd only and a nanosecond timeout. It won't
>> be slow as one fd is polled. No more scalability issue. And if there are
>> events, we know from ppoll(2)'s return, then we do the epoll_wait(2) with
>> timeout=0; otherwise, there can't be events for the epoll, skip the following
>> epoll_wait and just continue with other work.
>>
>> 2) Setup and add a timerfd to epoll, then we do epoll_wait(..., timeout=-1).
>> The timerfd will hopefully force epoll_wait to return when it timeouts, even if
>> no other events have arrived. This will inheritly give us timerfd's precision.
>> Note that for each poll, the desired timeout is different because the next
>> timer is different, so that, before each epoll_wait(2), there will be a
>> timerfd_settime syscall to set it to a proper value.
>>
>> Unfortunately, both approaches require one more syscall per iteration, compared
>> to the original single ppoll(2), cost of which is unneglectable when we talk
>> about nanosecond granularity.
>
> Please consider adding a "flags" argument to the new syscall (and
> returning EINVAL if non-zero).  See this article, which shows that
> extended syscalls almost always want flags, and they often get it only
> on the second try:
>
> http://lwn.net/Articles/585415/

Yes, please ensure that the new call gets a flags argument. We've made
this mistake way too many times in the past.

Thanks,

Michael




-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* [RESEND PATCH 3/3] x86: hook up epoll_pwait1 syscall
From: Fam Zheng @ 2015-01-08  9:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
	Andrew Morton, Miklos Szeredi, Juri Lelli, Zach Brown,
	David Drysdale, Fam Zheng, Kees Cook, Alexei Starovoitov,
	David Herrmann, Dario Faggioli, Theodore Ts'o, Peter Zijlstra,
	Vivek Goyal, Mike Frysinger, Heiko Carstens, Rasmus Villemoes,
	Oleg Nesterov, Mathieu Desnoyers, Fabian Frederick, Josh
In-Reply-To: <1420708563-21743-1-git-send-email-famz@redhat.com>

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl | 1 +
 arch/x86/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..1c863f6 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	epoll_pwait1		sys_epoll_pwait1
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..644a90f 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	common	epoll_pwait1		sys_epoll_pwait1
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
1.9.3

^ permalink raw reply related

* [RESEND PATCH 2/3] epoll: Add implementation for epoll_pwait1
From: Fam Zheng @ 2015-01-08  9:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
	Andrew Morton, Miklos Szeredi, Juri Lelli, Zach Brown,
	David Drysdale, Fam Zheng, Kees Cook, Alexei Starovoitov,
	David Herrmann, Dario Faggioli, Theodore Ts'o, Peter Zijlstra,
	Vivek Goyal, Mike Frysinger, Heiko Carstens, Rasmus Villemoes,
	Oleg Nesterov, Mathieu Desnoyers, Fabian Frederick, Josh
In-Reply-To: <1420708563-21743-1-git-send-email-famz@redhat.com>

Unlike ppoll(2), which accepts a timespec argument "timeout_ts" to
specify the timeout, epoll_wait(2) and epoll_pwait(2) expect a
microsecond timeout in int type.

This is an obstacle for applications in switching from ppoll to epoll,
if they want nanosecond resolution in their event loops.

Therefore, adding this variation of epoll wait interface, giving user an
option with *both* advantages, is a reasonable move: there could be
constantly scalable performance polling many fds, while having a
nanosecond timeout precision (assuming it has properly set up timer
slack with prctl(2)).

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c           | 24 ++++++++++++++++++++++++
 include/linux/syscalls.h |  4 ++++
 kernel/sys_ni.c          |  3 +++
 3 files changed, 31 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 117ba72..ee69fd4 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2066,6 +2066,30 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 			      sigmask ? &ksigmask : NULL, sigsetsize);
 }
 
+SYSCALL_DEFINE6(epoll_pwait1, int, epfd, struct epoll_event __user *, events,
+		int, maxevents,
+		struct timespec __user *, timeout,
+		const sigset_t __user *, sigmask,
+		size_t, sigsetsize)
+{
+	struct timespec ts;
+	sigset_t ksigmask;
+
+	if (timeout && copy_from_user(&ts, timeout, sizeof(ts)))
+		return -EFAULT;
+
+	if (sigmask) {
+		if (sigsetsize != sizeof(sigset_t))
+			return -EINVAL;
+		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
+			return -EFAULT;
+	}
+	return epoll_pwait_do(epfd, events, maxevents,
+			      timeout ? &ts : NULL,
+			      sigmask ? &ksigmask : NULL,
+			      sigsetsize);
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 		       struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 85893d7..3e0ed0b 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -630,6 +630,10 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
 				int maxevents, int timeout,
 				const sigset_t __user *sigmask,
 				size_t sigsetsize);
+asmlinkage long sys_epoll_pwait1(int epfd, struct epoll_event __user *events,
+				 int maxevents, struct timespec __user *ts,
+				 const sigset_t __user *sigmask,
+				 size_t sigsetsize);
 asmlinkage long sys_gethostname(char __user *name, int len);
 asmlinkage long sys_sethostname(char __user *name, int len);
 asmlinkage long sys_setdomainname(char __user *name, int len);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 5adcb0a..1044158 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -229,3 +229,6 @@ cond_syscall(sys_bpf);
 
 /* execveat */
 cond_syscall(sys_execveat);
+
+/* epoll_pwait1 */
+cond_syscall(sys_epoll_pwait1);
-- 
1.9.3


^ permalink raw reply related

* [RESEND PATCH 1/3] epoll: Extract epoll_wait_do and epoll_pwait_do
From: Fam Zheng @ 2015-01-08  9:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Alexander Viro,
	Andrew Morton, Miklos Szeredi, Juri Lelli, Zach Brown,
	David Drysdale, Fam Zheng, Kees Cook, Alexei Starovoitov,
	David Herrmann, Dario Faggioli, Theodore Ts'o, Peter Zijlstra,
	Vivek Goyal, Mike Frysinger, Heiko Carstens, Rasmus Villemoes,
	Oleg Nesterov, Mathieu Desnoyers, Fabian Frederick, Josh
In-Reply-To: <1420708563-21743-1-git-send-email-famz@redhat.com>

In preparation of epoll_pwait1, this allows sharing code with coming new
syscall. The new functions use timespec for timeout.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 136 +++++++++++++++++++++++++++++----------------------------
 1 file changed, 70 insertions(+), 66 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d77f944..117ba72 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1554,15 +1554,12 @@ static int ep_send_events(struct eventpoll *ep,
 	return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
 }
 
-static inline struct timespec ep_set_mstimeout(long ms)
+static inline struct timespec ep_set_mstimeout(const struct timespec *ts)
 {
-	struct timespec now, ts = {
-		.tv_sec = ms / MSEC_PER_SEC,
-		.tv_nsec = NSEC_PER_MSEC * (ms % MSEC_PER_SEC),
-	};
+	struct timespec now;
 
 	ktime_get_ts(&now);
-	return timespec_add_safe(now, ts);
+	return timespec_add_safe(now, *ts);
 }
 
 /**
@@ -1573,17 +1570,16 @@ static inline struct timespec ep_set_mstimeout(long ms)
  * @events: Pointer to the userspace buffer where the ready events should be
  *          stored.
  * @maxevents: Size (in terms of number of events) of the caller event buffer.
- * @timeout: Maximum timeout for the ready events fetch operation, in
- *           milliseconds. If the @timeout is zero, the function will not block,
- *           while if the @timeout is less than zero, the function will block
- *           until at least one event has been retrieved (or an error
- *           occurred).
+ * @timeout: Maximum timeout for the ready events fetch operation.  If NULL, or
+ *           if both tv_sec and tv_nsec are zero, the function will not block.
+ *           If either one is less than zero, the function will block until at
+ *           least one event has been retrieved (or an error occurred).
  *
  * Returns: Returns the number of ready events which have been fetched, or an
  *          error code, in case of error.
  */
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
-		   int maxevents, long timeout)
+		   int maxevents, const struct timespec *timeout)
 {
 	int res = 0, eavail, timed_out = 0;
 	unsigned long flags;
@@ -1591,13 +1587,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	wait_queue_t wait;
 	ktime_t expires, *to = NULL;
 
-	if (timeout > 0) {
-		struct timespec end_time = ep_set_mstimeout(timeout);
-
-		slack = select_estimate_accuracy(&end_time);
-		to = &expires;
-		*to = timespec_to_ktime(end_time);
-	} else if (timeout == 0) {
+	if (!timeout || (timeout->tv_nsec == 0 && timeout->tv_sec == 0)) {
 		/*
 		 * Avoid the unnecessary trip to the wait queue loop, if the
 		 * caller specified a non blocking operation.
@@ -1605,6 +1595,12 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		timed_out = 1;
 		spin_lock_irqsave(&ep->lock, flags);
 		goto check_events;
+	} else if (timeout->tv_nsec >= 0 && timeout->tv_sec >= 0) {
+		struct timespec end_time = ep_set_mstimeout(timeout);
+
+		slack = select_estimate_accuracy(&end_time);
+		to = &expires;
+		*to = timespec_to_ktime(end_time);
 	}
 
 fetch_events:
@@ -1954,12 +1950,8 @@ error_return:
 	return error;
 }
 
-/*
- * Implement the event wait interface for the eventpoll file. It is the kernel
- * part of the user space epoll_wait(2).
- */
-SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
-		int, maxevents, int, timeout)
+static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
+				int maxevents, const struct timespec *timeout)
 {
 	int error;
 	struct fd f;
@@ -2002,29 +1994,35 @@ error_fput:
 
 /*
  * Implement the event wait interface for the eventpoll file. It is the kernel
- * part of the user space epoll_pwait(2).
+ * part of the user space epoll_wait(2).
  */
-SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
-		int, maxevents, int, timeout, const sigset_t __user *, sigmask,
-		size_t, sigsetsize)
+SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
+		int, maxevents, int, timeout)
+{
+	struct timespec ts = (struct timespec) {
+	    .tv_sec = timeout / MSEC_PER_SEC,
+	    .tv_nsec = (timeout % MSEC_PER_SEC) * NSEC_PER_MSEC,
+	};
+	return epoll_wait_do(epfd, events, maxevents, &ts);
+}
+
+static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
+				 int maxevents, struct timespec *timeout,
+				 sigset_t *sigmask, size_t sigsetsize)
 {
 	int error;
-	sigset_t ksigmask, sigsaved;
+	sigset_t sigsaved;
 
 	/*
 	 * If the caller wants a certain signal mask to be set during the wait,
 	 * we apply it here.
 	 */
 	if (sigmask) {
-		if (sigsetsize != sizeof(sigset_t))
-			return -EINVAL;
-		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
-			return -EFAULT;
 		sigsaved = current->blocked;
-		set_current_blocked(&ksigmask);
+		set_current_blocked(sigmask);
 	}
 
-	error = sys_epoll_wait(epfd, events, maxevents, timeout);
+	error = epoll_wait_do(epfd, events, maxevents, timeout);
 
 	/*
 	 * If we changed the signal mask, we need to restore the original one.
@@ -2044,49 +2042,55 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 	return error;
 }
 
+/*
+ * Implement the event wait interface for the eventpoll file. It is the kernel
+ * part of the user space epoll_pwait(2).
+ */
+SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
+		int, maxevents, int, timeout, const sigset_t __user *, sigmask,
+		size_t, sigsetsize)
+{
+	struct timespec ts = (struct timespec) {
+	    .tv_sec = timeout / MSEC_PER_SEC,
+	    .tv_nsec = (timeout % MSEC_PER_SEC) * NSEC_PER_MSEC,
+	};
+	sigset_t ksigmask;
+
+	if (sigmask) {
+		if (sigsetsize != sizeof(sigset_t))
+			return -EINVAL;
+		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
+			return -EFAULT;
+	}
+	return epoll_pwait_do(epfd, events, maxevents, &ts,
+			      sigmask ? &ksigmask : NULL, sigsetsize);
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
-			struct epoll_event __user *, events,
-			int, maxevents, int, timeout,
-			const compat_sigset_t __user *, sigmask,
-			compat_size_t, sigsetsize)
+		       struct epoll_event __user *, events,
+		       int, maxevents, int, timeout,
+		       const compat_sigset_t __user *, sigmask,
+		       compat_size_t, sigsetsize)
 {
-	long err;
 	compat_sigset_t csigmask;
-	sigset_t ksigmask, sigsaved;
+	sigset_t ksigmask;
+
+	struct timespec ts = (struct timespec) {
+	    .tv_sec = timeout / MSEC_PER_SEC,
+	    .tv_nsec = (timeout % MSEC_PER_SEC) * NSEC_PER_MSEC,
+	};
 
-	/*
-	 * If the caller wants a certain signal mask to be set during the wait,
-	 * we apply it here.
-	 */
 	if (sigmask) {
 		if (sigsetsize != sizeof(compat_sigset_t))
 			return -EINVAL;
 		if (copy_from_user(&csigmask, sigmask, sizeof(csigmask)))
 			return -EFAULT;
 		sigset_from_compat(&ksigmask, &csigmask);
-		sigsaved = current->blocked;
-		set_current_blocked(&ksigmask);
-	}
-
-	err = sys_epoll_wait(epfd, events, maxevents, timeout);
-
-	/*
-	 * If we changed the signal mask, we need to restore the original one.
-	 * In case we've got a signal while waiting, we do not restore the
-	 * signal mask yet, and we allow do_signal() to deliver the signal on
-	 * the way back to userspace, before the signal mask is restored.
-	 */
-	if (sigmask) {
-		if (err == -EINTR) {
-			memcpy(&current->saved_sigmask, &sigsaved,
-			       sizeof(sigsaved));
-			set_restore_sigmask();
-		} else
-			set_current_blocked(&sigsaved);
 	}
 
-	return err;
+	return epoll_pwait_do(epfd, events, maxevents, &ts,
+			      sigmask ? &ksigmask : NULL, sigsetsize);
 }
 #endif
 
-- 
1.9.3

^ permalink raw reply related

* [RESEND PATCH 0/3] epoll: Add epoll_pwait1 syscall
From: Fam Zheng @ 2015-01-08  9:16 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro, Andrew Morton,
	Miklos Szeredi, Juri Lelli, Zach Brown, David Drysdale, Fam Zheng,
	Kees Cook, Alexei Starovoitov, David Herrmann, Dario Faggioli,
	Theodore Ts'o, Peter Zijlstra, Vivek Goyal, Mike Frysinger,
	Heiko Carstens, Rasmus Villemoes, Oleg Nesterov,
	Mathieu Desnoyers, Fabian Frederick, Josh

[Resend because my script screwed the recipient format, sorry for the noise.]

Applications could use epoll interface when then need to poll a big number of
files in their main loops, to achieve better performance than ppoll(2). Except
for one concern: epoll only takes timeout parameters in microseconds, rather
than nanoseconds.

That is a drawback we should address. For a real case in QEMU, we run into a
scalability issue with ppoll(2) when many devices are attached to guest, in
which case many host fds, such as virtual disk images and sockets, need to be
polled by the main loop. As a result we are looking at switching to epoll, but
the coarse timeout precision is a trouble, as explained below. 

We're already using prctl(PR_SET_TIMERSLACK, 1) which is necessary to implement
timers in the main loop; and we call ppoll(2) with the next firing timer as
timeout, so when ppoll(2) returns, we know that we have more work to do (either
handling IO events, or fire a timer callback). This is natual and efficient,
except that ppoll(2) itself is slow.

Now that we want to switch to epoll, to speed up the polling. However the timer
slack setting will be effectively undone, because that way we will have to
round up the timeout to microseconds honoring timer contract. But consequently,
this hurts the general responsiveness.

Note: there are two alternatives, without changing kernel:

1) Leading ppoll(2), with the epollfd only and a nanosecond timeout. It won't
be slow as one fd is polled. No more scalability issue. And if there are
events, we know from ppoll(2)'s return, then we do the epoll_wait(2) with
timeout=0; otherwise, there can't be events for the epoll, skip the following
epoll_wait and just continue with other work.

2) Setup and add a timerfd to epoll, then we do epoll_wait(..., timeout=-1).
The timerfd will hopefully force epoll_wait to return when it timeouts, even if
no other events have arrived. This will inheritly give us timerfd's precision.
Note that for each poll, the desired timeout is different because the next
timer is different, so that, before each epoll_wait(2), there will be a
timerfd_settime syscall to set it to a proper value.

Unfortunately, both approaches require one more syscall per iteration, compared
to the original single ppoll(2), cost of which is unneglectable when we talk
about nanosecond granularity.

Fam

Fam Zheng (3):
  epoll: Extract epoll_wait_do and epoll_pwait_do
  epoll: Add implementation for epoll_pwait1
  x86: hook up epoll_pwait1 syscall

 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 fs/eventpoll.c                   | 160 +++++++++++++++++++++++----------------
 include/linux/syscalls.h         |   4 +
 kernel/sys_ni.c                  |   3 +
 5 files changed, 103 insertions(+), 66 deletions(-)

-- 
1.9.3

^ permalink raw reply

* Re: [PATCH 0/3] epoll: Add epoll_pwait1 syscall
From: Miklos Szeredi @ 2015-01-08  9:12 UTC (permalink / raw)
  To: Fam Zheng
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Juri Lelli, Zach Brown,
	David Drysdale, Kees Cook, Alexei Starovoitov, David Herrmann,
	Dario Faggioli, Theodore Ts'o, Peter Zijlstra, Vivek Goyal,
	Mike Frysinger, Heiko Carstens, Rasmus Villemoes, Oleg Nesterov,
	Mathieu Desnoyers, Fabian Frederick, Josh Triplett
In-Reply-To: <1420705550-24245-1-git-send-email-famz@redhat.com>

On Thu, 2015-01-08 at 16:25 +0800, Fam Zheng wrote:
> Applications could use epoll interface when then need to poll a big number of
> files in their main loops, to achieve better performance than ppoll(2). Except
> for one concern: epoll only takes timeout parameters in microseconds, rather
> than nanoseconds.
> 
> That is a drawback we should address. For a real case in QEMU, we run into a
> scalability issue with ppoll(2) when many devices are attached to guest, in
> which case many host fds, such as virtual disk images and sockets, need to be
> polled by the main loop. As a result we are looking at switching to epoll, but
> the coarse timeout precision is a trouble, as explained below. 
> 
> We're already using prctl(PR_SET_TIMERSLACK, 1) which is necessary to implement
> timers in the main loop; and we call ppoll(2) with the next firing timer as
> timeout, so when ppoll(2) returns, we know that we have more work to do (either
> handling IO events, or fire a timer callback). This is natual and efficient,
> except that ppoll(2) itself is slow.
> 
> Now that we want to switch to epoll, to speed up the polling. However the timer
> slack setting will be effectively undone, because that way we will have to
> round up the timeout to microseconds honoring timer contract. But consequently,
> this hurts the general responsiveness.
> 
> Note: there are two alternatives, without changing kernel:
> 
> 1) Leading ppoll(2), with the epollfd only and a nanosecond timeout. It won't
> be slow as one fd is polled. No more scalability issue. And if there are
> events, we know from ppoll(2)'s return, then we do the epoll_wait(2) with
> timeout=0; otherwise, there can't be events for the epoll, skip the following
> epoll_wait and just continue with other work.
> 
> 2) Setup and add a timerfd to epoll, then we do epoll_wait(..., timeout=-1).
> The timerfd will hopefully force epoll_wait to return when it timeouts, even if
> no other events have arrived. This will inheritly give us timerfd's precision.
> Note that for each poll, the desired timeout is different because the next
> timer is different, so that, before each epoll_wait(2), there will be a
> timerfd_settime syscall to set it to a proper value.
> 
> Unfortunately, both approaches require one more syscall per iteration, compared
> to the original single ppoll(2), cost of which is unneglectable when we talk
> about nanosecond granularity.

Please consider adding a "flags" argument to the new syscall (and
returning EINVAL if non-zero).  See this article, which shows that
extended syscalls almost always want flags, and they often get it only
on the second try:

http://lwn.net/Articles/585415/

Thanks,
Miklos

P.S. stray apostrophes in To: and Cc: lines seems to be causing trouble.

^ permalink raw reply

* Re: [v8 2/5] ext4: adds project ID support
From: Jan Kara @ 2015-01-08  8:51 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Li Xi, linux-fsdevel, linux-ext4, linux-api, tytso, jack, viro,
	hch, dmonakhov
In-Reply-To: <0F8FAF43-54EB-4202-8120-C1D70CF330F0@dilger.ca>

On Wed 07-01-15 16:11:16, Andreas Dilger wrote:
> On Dec 8, 2014, at 10:22 PM, Li Xi <pkuelelixi@gmail.com> wrote:
> > This patch adds a new internal field of ext4 inode to save project
> > identifier. Also a new flag EXT4_INODE_PROJINHERIT is added for
> > inheriting project ID from parent directory.
> > 
> > Signed-off-by: Li Xi <lixi@ddn.com>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> > ---
...
> > @@ -4249,6 +4266,18 @@ static int ext4_do_update_inode(handle_t *handle,
> > 		}
> > 	}
> > 
> > +	BUG_ON(!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> > +			EXT4_FEATURE_RO_COMPAT_PROJECT) &&
> > +	       i_projid != EXT4_DEF_PROJID);
> > +	if (i_projid != EXT4_DEF_PROJID &&
> > +	    (EXT4_INODE_SIZE(inode->i_sb) <= EXT4_GOOD_OLD_INODE_SIZE ||
> > +	     (!EXT4_FITS_IN_INODE(raw_inode, ei, i_projid)))) {
> > +		spin_unlock(&ei->i_raw_lock);
> > +		err = -EFBIG;
> 
> I'm not sure if -EFBIG is the best error case, since that is a common
> filesystem error.  Maybe -EOVERFLOW would be better?
  So my thought has been that this is mostly a sanity check and we would
check in tune2fs whether all inodes have enough space when setting
EXT4_FEATURE_RO_COMPAT_PROJECT feature... Because sudden errors (regardless
whether it will be EFBIG or EOVERFLOW) when renaming files will be hard to
understand for sysadmins IMHO.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH net-next v2 0/8] net: extend ethtool link mode bitmaps to 48 bits
From: Amir Vadai @ 2015-01-08  8:40 UTC (permalink / raw)
  To: David Decotigny
  Cc: Florian Fainelli, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Saeed Mahameed,
	David S. Miller, Jason Wang, Michael S. Tsirkin, Herbert Xu,
	Al Viro, Ben Hutchings, Masatake YAMATO, Xi Wang, Neil Horman,
	WANG Cong, Flavio Leitner, Tom Gundersen, Jiri Pirko,
	Vlad Yasevich, Eric W. Biederman, Venkata Duvvuru,
	Govindarajulu Varadarajan <_govind>
In-Reply-To: <CAG88wWYPDpwkWkL+Pj2VKrX5WVp=at8v0=gcFAVAA8nntv+-nw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 1/6/2015 7:36 PM, David Decotigny wrote:
> Interesting. It seems that the band-aid I was proposing is already
> obsolete. We could still use the remaining reserved 16 bits to encode
> 5 more bits per mask (that is: 53 bits / mask total). But if I
> understand you, it would allow us to survive only a few months longer,
> as opposed to a few weeks.
> 
> One short-term alternative solution I can imagine is the following:
> /* For example bitmap-based for variable length: */
> struct ethtool_link_mode {
>   __u32 cmd;
>   __u8 autoneg :1;
>   __u8 duplex :2;
>  __u16 supported_nbits;
>   __u16 advertising_nbits;
>   __u16 lp_advertising_nbits;
>   __u32 reserved[4];
>   __u8 masks[0];
> };
> /* Or simpler, statically limited to 64b / mask, but easier to migrate
> to for driver authors: */
I think the first options is better. A driver will have to do changes in
order to support >32 link modes, so better change it once now, without
having to change it again for >64 link modes.

> struct ethtool_link_mode {
>   __u32 cmd;
>   __u8 autoneg :1;
>   __u8 duplex :2;
>    __u64 supported;
>   __u64 advertising;
>   __u64 lp_advertising;
>   __u32 reserved[4];
> };
> #define ETHTOOL_GLINK_MODE 0x0000004a
> #define ETHTOOL_SLINK_MODE 0x0000004b
> struct ethtool_ops {
> ...
>    int (*get_link_mode)(struct net_device *, struct ethtool_link_mode *);
>    int (*set_link_mode)(struct net_device *, struct ethtool_link_mode *);
> };
> 
> The same thing required for EEE.
Yeh :(

> 
> I am not sure about moving the autoneg and duplex fields into the new
> struct. Especially the "duplex" field.
I think so too. ethtool user space will call ETHTOOL_[GS]SET and after
that ETHTOOL_[GS]LINK_MODE (if supported). No need to get the
duplex/autoneg fields again.

> 
> Then the idea would be to update the ethtool user-space tool to try
> get/set_link mode when reporting/changing the autoneg/advertising
> settings.
> 
> Both will require significant effort from the driver authors.
> Especially if the variable-length bitmap approach is preferred:
>  - most drivers currently use simple bitwise arithmetic in their code,
> and that goes far beyond get/set_settings, it is sometimes part of the
> core driver logic. They will have to migrate to the bitmap API if they
> want to use the larger bitmaps (note: no change needed if they are
> happy with <= 32b / mask)
As I said above, it will save as doing this work again in the future,
and more problematic, save another version to backport in the future. In
addition, not all drivers will have to do it, only if >32 link speeds is
needed - this work will be required.

>  - we would have to progressively deprecate the use of #define
> ADVERTISED_1000baseT_Full in favor of an enum of the bit indices.
Maybe we could use some macro juggling to define the legacy macro's
using enum for the first 32 bits, and fail the compilation if used on
>32. For example, calling this:
DEFINE_LINK_MODE(ADVERTISED_1000baseT_Full, 5)

Will add the following:
ADVERTISED_1000baseT_Full_SHIFT = 5
ADVERTISED_1000baseT_Full = (1<<5)

DEFINE_LINK_MODE(ADVERTISED_100000baseKR5_Full, 50) will add:
ADVERTISED_100000baseKR5_Full_SHIFT = 50
ADVERTISED_100000baseKR5_Full = #error new link speeds must be defined
using [gs]et_link_speed

This will break compilation if ADVERTISED_100000baseKR5_Full is used in
[gs]et_settings (I know the '#error' will not print something very
pretty - I used it only to explain what I meant)

> 
> Any feedback welcome. In the meantime, I am going to propose a v3 of
> current option with 53 bits / mask. I can also propose a prototype of
> the scheme described above, please let me know.
I think that it is better to do it once, and skip the 53 bits / mask
version.
I'll be happy to assist.

Amir

^ permalink raw reply

* Re: [v8 2/5] ext4: adds project ID support
From: Jan Kara @ 2015-01-08  8:26 UTC (permalink / raw)
  To: Li Xi
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	adilger-m1MBpc4rdrD3fQ9qLvQP4Q, jack-AlSwsSmVLrQ,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	dmonakhov-GEFAQzZX7r8dnm+yROfE0A
In-Reply-To: <1418102548-5469-3-git-send-email-lixi-LfVdkaOWEx8@public.gmane.org>

On Tue 09-12-14 13:22:25, Li Xi wrote:
> This patch adds a new internal field of ext4 inode to save project
> identifier. Also a new flag EXT4_INODE_PROJINHERIT is added for
> inheriting project ID from parent directory.
  I have noticed one thing you apparently changed in v7 of the patch set.
See below.

> Signed-off-by: Li Xi <lixi-LfVdkaOWEx8@public.gmane.org>
> Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
> ---
>  fs/ext4/ext4.h          |   21 +++++++++++++++++----
>  fs/ext4/ialloc.c        |    6 ++++++
>  fs/ext4/inode.c         |   29 +++++++++++++++++++++++++++++
>  fs/ext4/namei.c         |   17 +++++++++++++++++
>  fs/ext4/super.c         |    1 +
>  include/uapi/linux/fs.h |    1 +
>  6 files changed, 71 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 29c43e7..8bd1da9 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -377,16 +377,18 @@ struct flex_groups {
>  #define EXT4_EA_INODE_FL	        0x00200000 /* Inode used for large EA */
>  #define EXT4_EOFBLOCKS_FL		0x00400000 /* Blocks allocated beyond EOF */
>  #define EXT4_INLINE_DATA_FL		0x10000000 /* Inode has inline data. */
> +#define EXT4_PROJINHERIT_FL		FS_PROJINHERIT_FL /* Create with parents projid */
  How did FS_PROJINHERIT_FL get here? There used to be 0x20000000 in older
version of the patch set which is correct - this definition is defining
ext4 on-disk format. As such it is an ext4 specific flag and should be
definined to a fixed constant independed of any other filesystem. It seems
you are somewhat mixing what is an on-disk format flag value and what is a
flag value passed from userspace. These two may be different things and you
need to convert between the values when getting / setting flags...

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Tejun Heo @ 2015-01-07 23:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Andy Lutomirski, Ingo Molnar, Linux API, cgroups mailinglist
In-Reply-To: <87y4peyxw5.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

On Wed, Jan 07, 2015 at 05:27:38PM -0600, Eric W. Biederman wrote:
> >> The -o SUBSYS option doesn't exist.  Jesus, at least get yourself
> >> familiar with the basics before claiming random stuff.
> 
> Oh let's see I got that command line option out of /proc/mounts and yes
> it works.  Perhaps it doesn't if I invoke unified hiearchies but the
> option does in fact exist and work.

I meant the -o SUBSYS doesn't exist for unified hierarchy.

> Now I really do need to test report regressions, and send probably send
> regression fixes.  If I understand your strange ranting I think you just
> told me that option that -o SUBSYS does work with unified hierarchies.

What?  Why would -O SUBSYS exist for unified hierarchy?  It's unified
for all controllers.

> Tejun.  I asked you specifically about this case 2 years ago at plumbers
> and you personally told me this would continue to work.  I am going to
> hold you to that.

I have no idea what you're talking about in *THIS* thread.  I'm fully
aware of what was discussed *THEN*.

> Fixing bugs is one thing.  Gratuitious regressions that make supporting
> existing user space applications insane is another.

Can you explain what problem you're actually trying to talk about
without spouting random claims about regressions?

-- 
tejun

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox