Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v1 1/2] Add polling support to pidfd
From: Christian Brauner @ 2019-04-26 15:21 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, luto, rostedt, dancol, sspatil, jannh, surenb,
	timmurray, Jonathan Kowalski, torvalds, kernel-team,
	Andrew Morton, Arnd Bergmann, Eric W. Biederman,
	Greg Kroah-Hartman, Ingo Molnar, Jann Horn, linux-kselftest,
	Michal Hocko, Peter Zijlstra (Intel), Serge Hallyn, Shuah Khan,
	Stephen Rothwell, Thomas Gleixner, Tycho Andersen
In-Reply-To: <20190426142337.GC261279@google.com>

On Fri, Apr 26, 2019 at 10:23:37AM -0400, Joel Fernandes wrote:
> On Fri, Apr 26, 2019 at 12:24:04AM +0200, Christian Brauner wrote:
> > On Thu, Apr 25, 2019 at 03:00:09PM -0400, Joel Fernandes (Google) wrote:
> > > pidfd are file descriptors referring to a process created with the
> > > CLONE_PIDFD clone(2) flag. Android low memory killer (LMK) needs pidfd
> > > polling support to replace code that currently checks for existence of
> > > /proc/pid for knowing that a process that is signalled to be killed has
> > > died, which is both racy and slow. The pidfd poll approach is race-free,
> > > and also allows the LMK to do other things (such as by polling on other
> > > fds) while awaiting the process being killed to die.
> > 
> > Thanks for the patch!
> > 
> > Ok, let me be a little bit anal.
> > Please start the commit message with what this patch does and then add
> 
> The subject title is "Add polling support to pidfd", but ok I should write a
> better commit message.

Yeah, it's really just that we should really just have a simple
paragraph that expresses this makes the codebase do X.

> 
> > the justification why. You just say the "pidfd-poll" approach. You can
> > probably assume that CLONE_PIDFD is available for this patch. So
> > something like:
> > 
> > "This patch makes pidfds pollable. Specifically, it allows listeners to
> > be informed when the process the pidfd referes to exits. This patch only
> > introduces the ability to poll thread-group leaders since pidfds
> > currently can only reference those..."
> > 
> > Then justify the use-case and then go into implementation details.
> > That's usually how I would think about this:
> > - Change the codebase to do X
> > - Why do we need X
> > - Are there any technical details worth mentioning in the commit message
> > (- Are there any controversial points that people stumbled upon but that
> >   have been settled sufficiently.)
> 
> Generally the "how" in the patch should be in the code, but ok.

That's why I said: technical details that are worth mentioning.
Sometimes you have controversial bits that are obviously to be
understood in the code but it still might be worth explaining why one
had to do it this way. Like say what we did for the pidfd_send_signal()
thing where we explained why O_PATH is disallowed.

> 
> I changed the first 3 paragraphs of the changelog to the following, is that
> better? :
> 
> Android low memory killer (LMK) needs to know when a process dies once
> it is sent the kill signal. It does so by checking for the existence of
> /proc/pid which is both racy and slow. For example, if a PID is reused
> between when LMK sends a kill signal and checks for existence of the
> PID, since the wrong PID is now possibly checked for existence.
> 
> This patch adds polling support to pidfd. Using the polling support, LMK
> will be able to get notified when a process exists in race-free and fast
> way, and allows the LMK to do other things (such as by polling on other
> fds) while awaiting the process being killed to die.
> 
> For notification to polling processes, we follow the same existing
> mechanism in the kernel used when the parent of the task group is to be
> notified of a child's death (do_notify_parent).  This is precisely when
> the tasks waiting on a poll of pidfd are also awakened in this patch.
> 
> > > pidfd are file descriptors referring to a process created with the
> > > CLONE_PIDFD clone(2) flag. Android low memory killer (LMK) needs pidfd
> > > polling support to replace code that currently checks for existence of
> > > /proc/pid for knowing that a process that is signalled to be killed has
> > > died, which is both racy and slow. The pidfd poll approach is race-free,
> > > and also allows the LMK to do other things (such as by polling on other
> > > fds) while awaiting the process being killed to die.
> > 
> > > 
> > > It prevents a situation where a PID is reused between when LMK sends a
> > > kill signal and checks for existence of the PID, since the wrong PID is
> > > now possibly checked for existence.
> > > 
> > > In this patch, we follow the same existing mechanism in the kernel used
> > > when the parent of the task group is to be notified (do_notify_parent).
> > > This is when the tasks waiting on a poll of pidfd are also awakened.
> > > 
> > > We have decided to include the waitqueue in struct pid for the following
> > > reasons:
> > > 1. The wait queue has to survive for the lifetime of the poll. Including
> > > it in task_struct would not be option in this case because the task can
> > > be reaped and destroyed before the poll returns.
> > > 
> > > 2. By including the struct pid for the waitqueue means that during
> > > de_thread(), the new thread group leader automatically gets the new
> > > waitqueue/pid even though its task_struct is different.
> > > 
> > > Appropriate test cases are added in the second patch to provide coverage
> > > of all the cases the patch is handling.
> > > 
> > > Andy had a similar patch [1] in the past which was a good reference
> > > however this patch tries to handle different situations properly related
> > > to thread group existence, and how/where it notifies. And also solves
> > > other bugs (waitqueue lifetime).  Daniel had a similar patch [2]
> > > recently which this patch supercedes.
> > > 
> > > [1] https://lore.kernel.org/patchwork/patch/345098/
> > > [2] https://lore.kernel.org/lkml/20181029175322.189042-1-dancol@google.com/
> > > 
> > > Cc: luto@amacapital.net
> > > Cc: rostedt@goodmis.org
> > > Cc: dancol@google.com
> > > Cc: sspatil@google.com
> > > Cc: christian@brauner.io
> > > Cc: jannh@google.com
> > > Cc: surenb@google.com
> > > Cc: timmurray@google.com
> > > Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
> > > Cc: torvalds@linux-foundation.org
> > > Cc: kernel-team@android.com
> > 
> > These should all be in the form:
> > 
> > Cc: Firstname Lastname <email@address.com>
> 
> If this bothers you too much, I can also just remove the CC list from the
> changelog here, and include it in my invocation of git-send-email instead..
> but I have seen commits in the tree that don't follow this rule.

Yeah, but they should. There are people with multiple emails over the
years and they might not necessarily contain their first and last
name. And I don't want to have to mailmap them or sm. Having their names
in there just makes it easier. Also, every single other DCO-*related*
line follows:

Random J Developer <random@developer.example.org>

This should too. If others are sloppy and allow this, fine. No reason we
should.

> 
> > 
> > There are people missing from the Cc that really should be there...
> 
> If you look at the CC list of the email, people in the get_maintainer.pl
> script were also added. I did run get_maintainer.pl and checkpatch. But ok, I
> will add the folks you are suggesting as well. Thanks.

get_maintainer.pl is not the last word. 

> 
> > Even though he usually doesn't respond that often, please Cc Al on this.
> > If he responds it's usually rather important.
> 
> No issues on that, but I am wondering if he should also be in MAINTAINERS
> file somewhere such that get_maintainer.pl does pick him up. I added him.

It's often not about someone being a maintainer but whether or not they
have valuable input.

"[...] This tag documents that potentially interested parties have been
included in the discussion."

> 
> > Oleg has reviewed your RFC patch quite substantially and given valuable
> > feedback and has an opinion on this thing and is best acquainted with
> > the exit code. So please add him to the Cc of the commit message in the
> > appropriate form and also add him to the Cc of the thread.
> 
> Done.

Thanks!

> 
> > Probably also want linux-api for good measure since a lot of people are
> > subscribed that would care about pollable pidfds. I'd also add Kees
> > since he had some interest in this work and David (Howells).
> 
> Done, I added all of them and CC will go out to them next time. Thanks.

Cool. That really wasn't a "you've done this wrong". It's rather really
just to make sure that everyone who might catch a big f*ck up on our
part has had a chance to tell us so. :)

> 
> > 
> > > Co-developed-by: Daniel Colascione <dancol@google.com>
> > 
> > Every CDB needs to give a SOB as well.
> 
> Ok, done. thanks.

Fwiw, I only learned this recently too.

> 
> > 
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > 
> > > ---
> > > 
> > > RFC -> v1:
> > > * Based on CLONE_PIDFD patches: https://lwn.net/Articles/786244/
> > > * Updated selftests.
> > > * Renamed poll wake function to do_notify_pidfd.
> > > * Removed depending on EXIT flags
> > > * Removed POLLERR flag since semantics are controversial and
> > >   we don't have usecases for it right now (later we can add if there's
> > >   a need for it).
> > > 
> > >  include/linux/pid.h |  3 +++
> > >  kernel/fork.c       | 33 +++++++++++++++++++++++++++++++++
> > >  kernel/pid.c        |  2 ++
> > >  kernel/signal.c     | 14 ++++++++++++++
> > >  4 files changed, 52 insertions(+)
> > > 
> > > diff --git a/include/linux/pid.h b/include/linux/pid.h
> > > index 3c8ef5a199ca..1484db6ca8d1 100644
> > > --- a/include/linux/pid.h
> > > +++ b/include/linux/pid.h
> > > @@ -3,6 +3,7 @@
> > >  #define _LINUX_PID_H
> > >  
> > >  #include <linux/rculist.h>
> > > +#include <linux/wait.h>
> > >  
> > >  enum pid_type
> > >  {
> > > @@ -60,6 +61,8 @@ struct pid
> > >  	unsigned int level;
> > >  	/* lists of tasks that use this pid */
> > >  	struct hlist_head tasks[PIDTYPE_MAX];
> > > +	/* wait queue for pidfd notifications */
> > > +	wait_queue_head_t wait_pidfd;
> > >  	struct rcu_head rcu;
> > >  	struct upid numbers[1];
> > >  };
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 5525837ed80e..fb3b614f6456 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -1685,8 +1685,41 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
> > >  }
> > >  #endif
> > >  
> > > +static unsigned int pidfd_poll(struct file *file, struct poll_table_struct *pts)
> > > +{
> > > +	struct task_struct *task;
> > > +	struct pid *pid;
> > > +	int poll_flags = 0;
> > > +
> > > +	/*
> > > +	 * tasklist_lock must be held because to avoid racing with
> > > +	 * changes in exit_state and wake up. Basically to avoid:
> > > +	 *
> > > +	 * P0: read exit_state = 0
> > > +	 * P1: write exit_state = EXIT_DEAD
> > > +	 * P1: Do a wake up - wq is empty, so do nothing
> > > +	 * P0: Queue for polling - wait forever.
> > > +	 */
> > > +	read_lock(&tasklist_lock);
> > > +	pid = file->private_data;
> > > +	task = pid_task(pid, PIDTYPE_PID);
> > > +	WARN_ON_ONCE(task && !thread_group_leader(task));
> > > +
> > > +	if (!task || (task->exit_state && thread_group_empty(task)))
> > > +		poll_flags = POLLIN | POLLRDNORM;
> > 
> > So we block until the thread-group is empty? Hm, the thread-group leader
> > remains in zombie state until all threads are gone. Should probably just
> > be a short comment somewhere that callers are only informed about a
> > whole thread-group exit and not about when the thread-group leader has
> > actually exited.
> 
> Ok, I'll add a comment.
> 
> > I would like the ability to extend this interface in the future to allow
> > for actually reading data from the pidfd on EPOLLIN.
> > POSIX specifies that POLLIN and POLLRDNORM are set even if the
> > message to be read is zero. So one cheap way of doing this would
> > probably be to do a 0 read/ioctl. That wouldn't hurt your very limited
> > usecase and people could test whether the read returned non-0 data and
> > if so they know this interface got extended. If we never extend it here
> > it won't matter.
> 
> I am a bit confused. What specific changes to this patch are you proposing?
> This patch makes poll block until the process exits. In the future, we can
> make it unblock for a other reasons as well, that's fine with me. I don't see
> how this patch prevents such extensions.

I guess I should've asked the following:
What happens right now, when you get EPOLLIN on the pidfd and you and
out of ignorance you do:

read(pidfd, ...)

> 
> > > +	if (!poll_flags)
> > > +		poll_wait(file, &pid->wait_pidfd, pts);
> > > +
> > > +	read_unlock(&tasklist_lock);
> > > +
> > > +	return poll_flags;
> > > +}
> > 
> > 
> > > +
> > > +
> > >  const struct file_operations pidfd_fops = {
> > >  	.release = pidfd_release,
> > > +	.poll = pidfd_poll,
> > >  #ifdef CONFIG_PROC_FS
> > >  	.show_fdinfo = pidfd_show_fdinfo,
> > >  #endif
> > > diff --git a/kernel/pid.c b/kernel/pid.c
> > > index 20881598bdfa..5c90c239242f 100644
> > > --- a/kernel/pid.c
> > > +++ b/kernel/pid.c
> > > @@ -214,6 +214,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> > >  	for (type = 0; type < PIDTYPE_MAX; ++type)
> > >  		INIT_HLIST_HEAD(&pid->tasks[type]);
> > >  
> > > +	init_waitqueue_head(&pid->wait_pidfd);
> > > +
> > >  	upid = pid->numbers + ns->level;
> > >  	spin_lock_irq(&pidmap_lock);
> > >  	if (!(ns->pid_allocated & PIDNS_ADDING))
> > > diff --git a/kernel/signal.c b/kernel/signal.c
> > > index 1581140f2d99..16e7718316e5 100644
> > > --- a/kernel/signal.c
> > > +++ b/kernel/signal.c
> > > @@ -1800,6 +1800,17 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
> > >  	return ret;
> > >  }
> > >  
> > > +static void do_notify_pidfd(struct task_struct *task)
> > 
> > Maybe a short command that this helper can only be called when we know
> > that task is a thread-group leader wouldn't hurt so there's no confusion
> > later.
> 
> Ok, will do.
> 
> thanks,
> 
>  - Joel
> 

^ permalink raw reply

* Re: [PATCH v2] moduleparam: Save information about built-in modules in separate file
From: Alexey Gladkov @ 2019-04-26 15:29 UTC (permalink / raw)
  To: Masahiro Yamada
  Cc: Jessica Yu, Linux Kernel Mailing List, Linux Kbuild mailing list,
	linux-api, linux-modules, Kirill A . Shutemov,
	Gleb Fotengauer-Malinovskiy, Dmitry V. Levin, Michal Marek,
	Dmitry Torokhov, Rusty Russell, Lucas De Marchi
In-Reply-To: <CAK7LNARxT7a5-C4-ZuKYWOWyuDdMVzkd-n8C8vdO1MEw10TsJA@mail.gmail.com>

On Fri, Apr 19, 2019 at 12:03:50PM +0900, Masahiro Yamada wrote:
> On Fri, Apr 19, 2019 at 12:36 AM Jessica Yu <jeyu@kernel.org> wrote:
> >
> > +++ Masahiro Yamada [19/04/19 00:26 +0900]:
> > >On Thu, Apr 18, 2019 at 10:52 PM Jessica Yu <jeyu@kernel.org> wrote:
> > >>
> > >> +++ Masahiro Yamada [18/04/19 20:10 +0900]:
> > >> >On Sat, Apr 6, 2019 at 9:15 PM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
> > >> >>
> > >> >> Problem:
> > >> >>
> > >> >> When a kernel module is compiled as a separate module, some important
> > >> >> information about the kernel module is available via .modinfo section of
> > >> >> the module.  In contrast, when the kernel module is compiled into the
> > >> >> kernel, that information is not available.
> > >> >>
> > >> >> Information about built-in modules is necessary in the following cases:
> > >> >>
> > >> >> 1. When it is necessary to find out what additional parameters can be
> > >> >> passed to the kernel at boot time.
> > >> >>
> > >> >> 2. When you need to know which module names and their aliases are in
> > >> >> the kernel. This is very useful for creating an initrd image.
> > >> >>
> > >> >> Proposal:
> > >> >>
> > >> >> The proposed patch does not remove .modinfo section with module
> > >> >> information from the vmlinux at the build time and saves it into a
> > >> >> separate file after kernel linking. So, the kernel does not increase in
> > >> >> size and no additional information remains in it. Information is stored
> > >> >> in the same format as in the separate modules (null-terminated string
> > >> >> array). Because the .modinfo section is already exported with a separate
> > >> >> modules, we are not creating a new API.
> > >> >>
> > >> >> It can be easily read in the userspace:
> > >> >>
> > >> >> $ tr '\0' '\n' < kernel.builtin
> > >> >> ext4.softdep=pre: crc32c
> > >> >> ext4.license=GPL
> > >> >> ext4.description=Fourth Extended Filesystem
> > >> >> ext4.author=Remy Card, Stephen Tweedie, Andrew Morton, Andreas Dilger, Theodore Ts'o and others
> > >> >> ext4.alias=fs-ext4
> > >> >> ext4.alias=ext3
> > >> >> ext4.alias=fs-ext3
> > >> >> ext4.alias=ext2
> > >> >> ext4.alias=fs-ext2
> > >> >> md_mod.alias=block-major-9-*
> > >> >> md_mod.alias=md
> > >> >> md_mod.description=MD RAID framework
> > >> >> md_mod.license=GPL
> > >> >> md_mod.parmtype=create_on_open:bool
> > >> >> md_mod.parmtype=start_dirty_degraded:int
> > >> >> ...
> > >> >>
> > >> >> v2:
> > >> >>  * Extract modinfo from vmlinux.o as suggested by Masahiro Yamada;
> > >> >>  * Rename output file to kernel.builtin;
> > >> >
> > >> >Sorry, I do not get why you renamed
> > >> >"kernel.builtin.modinfo" to "kernel.builtin".
> > >> >
> > >> >If you drop "modinfo", we do not understand
> > >> >what kind information is contained in it.
> > >> >
> > >> >I think "kernel" and "builtin" have
> > >> >a quite similar meaning here.
> > >> >
> > >> >How about "builtin.modinfo" for example?
> > >> >
> > >> >
> > >> >It is shorter, and it is clear enough
> > >> >that it contains module_info.
> > >>
> > >> I agree that the name kernel.builtin is unclear in what kind of
> > >> information it contains. Apologies for not having clarified this in
> > >> the previous review.
> > >>
> > >> Since kbuild already produces "modules.order" and "modules.builtin"
> > >> files, why not just name it "modules.builtin.modinfo" to keep the
> > >> names consistent with what is already there?
> > >
> > >
> > >Is it consistent?
> > >
> > >If we had "modules.order" and "modules.builtin.order" there,
> > >I would agree with "modules.builtin.modinfo",
> > >and also "modules.alias" vs "modules.builtin.alias".
> > >
> > >
> > >We already have "modules.builtin", and probably impossible
> > >to rename it, so we cannot keep consistency in any way.
> > >
> > >
> > >"modules.builtin" is a weird name since
> > >it actually contains "order", but its extension
> > >does not express what kind of information is in it.
> > >Hence, I doubt "modules.builtin" is a good precedent.
> > >
> > >IMHO, "modules" and "builtin" are opposite
> > >to each other. "modules.builtin" sounds iffy to me.
> >
> > I've always interpreted "modules.builtin" to mean "this is a list of
> > modules that have been built-in into the kernel", no? So I thought the
> > name made sense.
> 
> OK, I see.
> 
> > But you are the maintainer, so I do not have a strong
> > opinion on this either way :-)
> 
> My idea was to use
> 'modules.<file-type>'  vs  'builtin.<file-type>'
> instead of
> 'modules.<file-type>'  vs  'modules.builtin.<file-type>'
> 
> I am slightly in favor of the former
> since it is shorter and
> (I hope) still clear enough.
> 
> If this naming is not nice for external projects such as kmod,
> please speak up.
> 
> 
> (BTW, I am thinking of renaming 'modules.builtin' into 'builtin.order'
> for kbuild internal. We cannot change that for the installation area, though.)

Since there were no other suggestions, how can I better name the file ?
modules.builtin.modinfo or just builtin.modinfo ? I personally like the
first one more.

-- 
Rgrds, legion

^ permalink raw reply

* Re: [PATCH v1 1/2] Add polling support to pidfd
From: Christian Brauner @ 2019-04-26 15:31 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, luto, rostedt, dancol, sspatil, jannh, surenb,
	timmurray, Jonathan Kowalski, torvalds, kernel-team,
	Andrew Morton, Arnd Bergmann, Eric W. Biederman,
	Greg Kroah-Hartman, Ingo Molnar, Jann Horn, linux-kselftest,
	Michal Hocko, Peter Zijlstra (Intel), Serge Hallyn, Shuah Khan,
	Stephen Rothwell, Thomas Gleixner, Tycho Andersen
In-Reply-To: <20190426152139.jbyihycizb4x5sfd@brauner.io>

On April 26, 2019 5:21:40 PM GMT+02:00, Christian Brauner <christian@brauner.io> wrote:
>On Fri, Apr 26, 2019 at 10:23:37AM -0400, Joel Fernandes wrote:
>> On Fri, Apr 26, 2019 at 12:24:04AM +0200, Christian Brauner wrote:
>> > On Thu, Apr 25, 2019 at 03:00:09PM -0400, Joel Fernandes (Google)
>wrote:
>> > > pidfd are file descriptors referring to a process created with
>the
>> > > CLONE_PIDFD clone(2) flag. Android low memory killer (LMK) needs
>pidfd
>> > > polling support to replace code that currently checks for
>existence of
>> > > /proc/pid for knowing that a process that is signalled to be
>killed has
>> > > died, which is both racy and slow. The pidfd poll approach is
>race-free,
>> > > and also allows the LMK to do other things (such as by polling on
>other
>> > > fds) while awaiting the process being killed to die.
>> > 
>> > Thanks for the patch!
>> > 
>> > Ok, let me be a little bit anal.
>> > Please start the commit message with what this patch does and then
>add
>> 
>> The subject title is "Add polling support to pidfd", but ok I should
>write a
>> better commit message.
>
>Yeah, it's really just that we should really just have a simple
>paragraph that expresses this makes the codebase do X.
>
>> 
>> > the justification why. You just say the "pidfd-poll" approach. You
>can
>> > probably assume that CLONE_PIDFD is available for this patch. So
>> > something like:
>> > 
>> > "This patch makes pidfds pollable. Specifically, it allows
>listeners to
>> > be informed when the process the pidfd referes to exits. This patch
>only
>> > introduces the ability to poll thread-group leaders since pidfds
>> > currently can only reference those..."
>> > 
>> > Then justify the use-case and then go into implementation details.
>> > That's usually how I would think about this:
>> > - Change the codebase to do X
>> > - Why do we need X
>> > - Are there any technical details worth mentioning in the commit
>message
>> > (- Are there any controversial points that people stumbled upon but
>that
>> >   have been settled sufficiently.)
>> 
>> Generally the "how" in the patch should be in the code, but ok.
>
>That's why I said: technical details that are worth mentioning.
>Sometimes you have controversial bits that are obviously to be
>understood in the code but it still might be worth explaining why one
>had to do it this way. Like say what we did for the pidfd_send_signal()
>thing where we explained why O_PATH is disallowed.
>
>> 
>> I changed the first 3 paragraphs of the changelog to the following,
>is that
>> better? :
>> 
>> Android low memory killer (LMK) needs to know when a process dies
>once
>> it is sent the kill signal. It does so by checking for the existence
>of
>> /proc/pid which is both racy and slow. For example, if a PID is
>reused
>> between when LMK sends a kill signal and checks for existence of the
>> PID, since the wrong PID is now possibly checked for existence.
>> 
>> This patch adds polling support to pidfd. Using the polling support,
>LMK
>> will be able to get notified when a process exists in race-free and
>fast
>> way, and allows the LMK to do other things (such as by polling on
>other
>> fds) while awaiting the process being killed to die.
>> 
>> For notification to polling processes, we follow the same existing
>> mechanism in the kernel used when the parent of the task group is to
>be
>> notified of a child's death (do_notify_parent).  This is precisely
>when
>> the tasks waiting on a poll of pidfd are also awakened in this patch.
>> 
>> > > pidfd are file descriptors referring to a process created with
>the
>> > > CLONE_PIDFD clone(2) flag. Android low memory killer (LMK) needs
>pidfd
>> > > polling support to replace code that currently checks for
>existence of
>> > > /proc/pid for knowing that a process that is signalled to be
>killed has
>> > > died, which is both racy and slow. The pidfd poll approach is
>race-free,
>> > > and also allows the LMK to do other things (such as by polling on
>other
>> > > fds) while awaiting the process being killed to die.
>> > 
>> > > 
>> > > It prevents a situation where a PID is reused between when LMK
>sends a
>> > > kill signal and checks for existence of the PID, since the wrong
>PID is
>> > > now possibly checked for existence.
>> > > 
>> > > In this patch, we follow the same existing mechanism in the
>kernel used
>> > > when the parent of the task group is to be notified
>(do_notify_parent).
>> > > This is when the tasks waiting on a poll of pidfd are also
>awakened.
>> > > 
>> > > We have decided to include the waitqueue in struct pid for the
>following
>> > > reasons:
>> > > 1. The wait queue has to survive for the lifetime of the poll.
>Including
>> > > it in task_struct would not be option in this case because the
>task can
>> > > be reaped and destroyed before the poll returns.
>> > > 
>> > > 2. By including the struct pid for the waitqueue means that
>during
>> > > de_thread(), the new thread group leader automatically gets the
>new
>> > > waitqueue/pid even though its task_struct is different.
>> > > 
>> > > Appropriate test cases are added in the second patch to provide
>coverage
>> > > of all the cases the patch is handling.
>> > > 
>> > > Andy had a similar patch [1] in the past which was a good
>reference
>> > > however this patch tries to handle different situations properly
>related
>> > > to thread group existence, and how/where it notifies. And also
>solves
>> > > other bugs (waitqueue lifetime).  Daniel had a similar patch [2]
>> > > recently which this patch supercedes.
>> > > 
>> > > [1] https://lore.kernel.org/patchwork/patch/345098/
>> > > [2]
>https://lore.kernel.org/lkml/20181029175322.189042-1-dancol@google.com/
>> > > 
>> > > Cc: luto@amacapital.net
>> > > Cc: rostedt@goodmis.org
>> > > Cc: dancol@google.com
>> > > Cc: sspatil@google.com
>> > > Cc: christian@brauner.io
>> > > Cc: jannh@google.com
>> > > Cc: surenb@google.com
>> > > Cc: timmurray@google.com
>> > > Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
>> > > Cc: torvalds@linux-foundation.org
>> > > Cc: kernel-team@android.com
>> > 
>> > These should all be in the form:
>> > 
>> > Cc: Firstname Lastname <email@address.com>
>> 
>> If this bothers you too much, I can also just remove the CC list from
>the
>> changelog here, and include it in my invocation of git-send-email
>instead..
>> but I have seen commits in the tree that don't follow this rule.
>
>Yeah, but they should. There are people with multiple emails over the
>years and they might not necessarily contain their first and last
>name. And I don't want to have to mailmap them or sm. Having their
>names
>in there just makes it easier. Also, every single other DCO-*related*
>line follows:
>
>Random J Developer <random@developer.example.org>
>
>This should too. If others are sloppy and allow this, fine. No reason
>we
>should.
>
>> 
>> > 
>> > There are people missing from the Cc that really should be there...
>> 
>> If you look at the CC list of the email, people in the
>get_maintainer.pl
>> script were also added. I did run get_maintainer.pl and checkpatch.
>But ok, I
>> will add the folks you are suggesting as well. Thanks.
>
>get_maintainer.pl is not the last word. 
>
>> 
>> > Even though he usually doesn't respond that often, please Cc Al on
>this.
>> > If he responds it's usually rather important.
>> 
>> No issues on that, but I am wondering if he should also be in
>MAINTAINERS
>> file somewhere such that get_maintainer.pl does pick him up. I added
>him.
>
>It's often not about someone being a maintainer but whether or not they
>have valuable input.
>
>"[...] This tag documents that potentially interested parties have been
>included in the discussion."
>
>> 
>> > Oleg has reviewed your RFC patch quite substantially and given
>valuable
>> > feedback and has an opinion on this thing and is best acquainted
>with
>> > the exit code. So please add him to the Cc of the commit message in
>the
>> > appropriate form and also add him to the Cc of the thread.
>> 
>> Done.
>
>Thanks!
>
>> 
>> > Probably also want linux-api for good measure since a lot of people
>are
>> > subscribed that would care about pollable pidfds. I'd also add Kees
>> > since he had some interest in this work and David (Howells).
>> 
>> Done, I added all of them and CC will go out to them next time.
>Thanks.
>
>Cool. That really wasn't a "you've done this wrong". It's rather really
>just to make sure that everyone who might catch a big f*ck up on our
>part has had a chance to tell us so. :)
>
>> 
>> > 
>> > > Co-developed-by: Daniel Colascione <dancol@google.com>
>> > 
>> > Every CDB needs to give a SOB as well.
>> 
>> Ok, done. thanks.
>
>Fwiw, I only learned this recently too.
>
>> 
>> > 
>> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> > > 
>> > > ---
>> > > 
>> > > RFC -> v1:
>> > > * Based on CLONE_PIDFD patches: https://lwn.net/Articles/786244/
>> > > * Updated selftests.
>> > > * Renamed poll wake function to do_notify_pidfd.
>> > > * Removed depending on EXIT flags
>> > > * Removed POLLERR flag since semantics are controversial and
>> > >   we don't have usecases for it right now (later we can add if
>there's
>> > >   a need for it).
>> > > 
>> > >  include/linux/pid.h |  3 +++
>> > >  kernel/fork.c       | 33 +++++++++++++++++++++++++++++++++
>> > >  kernel/pid.c        |  2 ++
>> > >  kernel/signal.c     | 14 ++++++++++++++
>> > >  4 files changed, 52 insertions(+)
>> > > 
>> > > diff --git a/include/linux/pid.h b/include/linux/pid.h
>> > > index 3c8ef5a199ca..1484db6ca8d1 100644
>> > > --- a/include/linux/pid.h
>> > > +++ b/include/linux/pid.h
>> > > @@ -3,6 +3,7 @@
>> > >  #define _LINUX_PID_H
>> > >  
>> > >  #include <linux/rculist.h>
>> > > +#include <linux/wait.h>
>> > >  
>> > >  enum pid_type
>> > >  {
>> > > @@ -60,6 +61,8 @@ struct pid
>> > >  	unsigned int level;
>> > >  	/* lists of tasks that use this pid */
>> > >  	struct hlist_head tasks[PIDTYPE_MAX];
>> > > +	/* wait queue for pidfd notifications */
>> > > +	wait_queue_head_t wait_pidfd;
>> > >  	struct rcu_head rcu;
>> > >  	struct upid numbers[1];
>> > >  };
>> > > diff --git a/kernel/fork.c b/kernel/fork.c
>> > > index 5525837ed80e..fb3b614f6456 100644
>> > > --- a/kernel/fork.c
>> > > +++ b/kernel/fork.c
>> > > @@ -1685,8 +1685,41 @@ static void pidfd_show_fdinfo(struct
>seq_file *m, struct file *f)
>> > >  }
>> > >  #endif
>> > >  
>> > > +static unsigned int pidfd_poll(struct file *file, struct
>poll_table_struct *pts)
>> > > +{
>> > > +	struct task_struct *task;
>> > > +	struct pid *pid;
>> > > +	int poll_flags = 0;
>> > > +
>> > > +	/*
>> > > +	 * tasklist_lock must be held because to avoid racing with
>> > > +	 * changes in exit_state and wake up. Basically to avoid:
>> > > +	 *
>> > > +	 * P0: read exit_state = 0
>> > > +	 * P1: write exit_state = EXIT_DEAD
>> > > +	 * P1: Do a wake up - wq is empty, so do nothing
>> > > +	 * P0: Queue for polling - wait forever.
>> > > +	 */
>> > > +	read_lock(&tasklist_lock);
>> > > +	pid = file->private_data;
>> > > +	task = pid_task(pid, PIDTYPE_PID);
>> > > +	WARN_ON_ONCE(task && !thread_group_leader(task));
>> > > +
>> > > +	if (!task || (task->exit_state && thread_group_empty(task)))
>> > > +		poll_flags = POLLIN | POLLRDNORM;
>> > 
>> > So we block until the thread-group is empty? Hm, the thread-group
>leader
>> > remains in zombie state until all threads are gone. Should probably
>just
>> > be a short comment somewhere that callers are only informed about a
>> > whole thread-group exit and not about when the thread-group leader
>has
>> > actually exited.
>> 
>> Ok, I'll add a comment.
>> 
>> > I would like the ability to extend this interface in the future to
>allow
>> > for actually reading data from the pidfd on EPOLLIN.
>> > POSIX specifies that POLLIN and POLLRDNORM are set even if the
>> > message to be read is zero. So one cheap way of doing this would
>> > probably be to do a 0 read/ioctl. That wouldn't hurt your very
>limited
>> > usecase and people could test whether the read returned non-0 data
>and
>> > if so they know this interface got extended. If we never extend it
>here
>> > it won't matter.
>> 
>> I am a bit confused. What specific changes to this patch are you
>proposing?
>> This patch makes poll block until the process exits. In the future,
>we can
>> make it unblock for a other reasons as well, that's fine with me. I
>don't see
>> how this patch prevents such extensions.
>
>I guess I should've asked the following:
>What happens right now, when you get EPOLLIN on the pidfd and you and
>out of ignorance you do:
>
>read(pidfd, ...)

I guess it returns EINVAL which is fine.
So you can ignore that comment.

>
>> 
>> > > +	if (!poll_flags)
>> > > +		poll_wait(file, &pid->wait_pidfd, pts);
>> > > +
>> > > +	read_unlock(&tasklist_lock);
>> > > +
>> > > +	return poll_flags;
>> > > +}
>> > 
>> > 
>> > > +
>> > > +
>> > >  const struct file_operations pidfd_fops = {
>> > >  	.release = pidfd_release,
>> > > +	.poll = pidfd_poll,
>> > >  #ifdef CONFIG_PROC_FS
>> > >  	.show_fdinfo = pidfd_show_fdinfo,
>> > >  #endif
>> > > diff --git a/kernel/pid.c b/kernel/pid.c
>> > > index 20881598bdfa..5c90c239242f 100644
>> > > --- a/kernel/pid.c
>> > > +++ b/kernel/pid.c
>> > > @@ -214,6 +214,8 @@ struct pid *alloc_pid(struct pid_namespace
>*ns)
>> > >  	for (type = 0; type < PIDTYPE_MAX; ++type)
>> > >  		INIT_HLIST_HEAD(&pid->tasks[type]);
>> > >  
>> > > +	init_waitqueue_head(&pid->wait_pidfd);
>> > > +
>> > >  	upid = pid->numbers + ns->level;
>> > >  	spin_lock_irq(&pidmap_lock);
>> > >  	if (!(ns->pid_allocated & PIDNS_ADDING))
>> > > diff --git a/kernel/signal.c b/kernel/signal.c
>> > > index 1581140f2d99..16e7718316e5 100644
>> > > --- a/kernel/signal.c
>> > > +++ b/kernel/signal.c
>> > > @@ -1800,6 +1800,17 @@ int send_sigqueue(struct sigqueue *q,
>struct pid *pid, enum pid_type type)
>> > >  	return ret;
>> > >  }
>> > >  
>> > > +static void do_notify_pidfd(struct task_struct *task)
>> > 
>> > Maybe a short command that this helper can only be called when we
>know
>> > that task is a thread-group leader wouldn't hurt so there's no
>confusion
>> > later.
>> 
>> Ok, will do.
>> 
>> thanks,
>> 
>>  - Joel
>> 

^ permalink raw reply

* Re: [PATCH V2] mm: Allow userland to request that the kernel clear memory on release
From: Matthew Garrett @ 2019-04-26 18:08 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Linux Kernel Mailing List, Linux API
In-Reply-To: <20190426052520.GB12337@dhcp22.suse.cz>

On Thu, Apr 25, 2019 at 10:25 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 25-04-19 13:39:01, Matthew Garrett wrote:
> > Yes, given MADV_DONTDUMP doesn't imply mlock I thought it'd be more
> > consistent to keep those independent.
>
> Do we want to fail madvise call on VMAs that are not mlocked then? What
> if the munlock happens later after the madvise is called?

I'm not sure if it's strictly necessary. We already have various
combinations of features that only make sense when used together and
which can be undermined by later actions. I can see the appeal of
designing this in a way that makes it harder to misuse, but is that
worth additional implementation complexity?

^ permalink raw reply

* Re: [PATCH V3] mm: Allow userland to request that the kernel clear memory on release
From: Matthew Garrett @ 2019-04-26 18:10 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, Linux Kernel Mailing List, Linux API
In-Reply-To: <d058d1ef-994f-ea6b-b6b4-bcd838a9fe2f@suse.cz>

On Fri, Apr 26, 2019 at 12:45 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 4/26/19 12:58 AM, Matthew Garrett wrote:
> > Updated based on feedback from Jann - for now let's just prevent setting
> > the flag on anything that has already mapped some pages, which avoids
> > child processes being able to interfere with the parent. In addition,
>
> That makes the API quite tricky and different from existing madvise()
> modes that don't care. One would for example have to call
> madvise(MADV_WIPEONRELEASE) before mlock(), otherwise mlock() would
> fault the pages in (unless MLOCK_ONFAULT). As such it really looks like
> a mmap() flag, but that's less flexible.
>
> How bout just doing the CoW on any such pre-existing pages as part of
> the madvise(MADV_WIPEONRELEASE) call?

I'll look into the easiest way to do that.

^ permalink raw reply

* Re: [PATCH 1/2] y2038: make CONFIG_64BIT_TIME unconditional
From: Lukasz Majewski @ 2019-04-26 22:46 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Thomas Gleixner, Joseph Myers, libc-alpha, linux-api,
	linux-kernel, Deepa Dinamani, Stepan Golosunov
In-Reply-To: <20190426142531.1378357-1-arnd@arndb.de>

[-- Attachment #1: Type: text/plain, Size: 2890 bytes --]

Hi Arnd,

> As Stepan Golosunov points out, we made a small mistake in the
> get_timespec64() function in the kernel. It was originally added under
> the assumption that CONFIG_64BIT_TIME would get enabled on all 32-bit
> and 64-bit architectures, but when I did the conversion, I only turned
> it on for 32-bit ones.
> 
> The effect is that the get_timespec64() function never clears the
> upper half of the tv_nsec field for 32-bit tasks in compat mode.
> Clearing this is required for POSIX compliant behavior of functions
> that pass a 'timespec' structure with a 64-bit tv_sec and a 32-bit
> tv_nsec, plus uninitialized padding.

On my 32 bit ARM setup (with CONFIG_COMPAT_32BIT_TIME=y) for
Y2038 test:

- I do use clock_settime64 [1] to set 64 bit tv_sec
  with 32 bit tv_nsec and 32 bit unnamed padding.

- I also may use clock_settime32 as a fallback (but it will not work
  beyond Y2038) if the 64 bit version is not available for any reason.

In the get_timespec64() the in_compat_syscall() returns 0 [2], so the
padding bits are not cleared.

This imposes the in glibc requirement to zero the padding before passing
it to the Linux kernel.

At least for my setup (32bit ARM with 64 bit time support) this patch is
not fixing anything.


Moreover, the described above problem doesn't matter on native 64 bit
systems as there both fields are 64 bit (no padding required).


Note:
[1] -
https://elixir.bootlin.com/linux/v5.1-rc6/source/arch/arm/tools/syscall.tbl#L421

[2] - include/linux/compat.h -> CONFIG_COMPAT is not defined - as a
result in_compat_syscall() returns 0.

> 
> The easiest fix for linux-5.1 is to just make the Kconfig symbol
> unconditional, as it was originally intended. As a follow-up,
> we should remove any #ifdef CONFIG_64BIT_TIME completely.
> 
> Link:
> https://lore.kernel.org/lkml/20190422090710.bmxdhhankurhafxq@sghpc.golosunov.pp.ru/
> Cc: Lukasz Majewski <lukma@denx.de> Cc: Stepan Golosunov
> <stepan@golosunov.pp.ru> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
> Please apply this one as a bugfix for 5.1
> ---
>  arch/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 33687dddd86a..9092e0ffe4d3 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -764,7 +764,7 @@ config COMPAT_OLD_SIGACTION
>  	bool
>  
>  config 64BIT_TIME
> -	def_bool ARCH_HAS_64BIT_TIME
> +	def_bool y
>  	help
>  	  This should be selected by all architectures that need to
> support new system calls with a 64-bit time_t. This is relevant on
> all 32-bit




Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lukma@denx.de

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH 1/2] y2038: make CONFIG_64BIT_TIME unconditional
From: Stepan Golosunov @ 2019-04-27  9:54 UTC (permalink / raw)
  To: Lukasz Majewski
  Cc: Arnd Bergmann, Thomas Gleixner, Joseph Myers, libc-alpha,
	linux-api, linux-kernel, Deepa Dinamani
In-Reply-To: <20190427004653.3cecd1cb@jawa>

27.04.2019 в 00:46:53 +0200 Lukasz Majewski написал:
> Hi Arnd,
> 
> > As Stepan Golosunov points out, we made a small mistake in the
> > get_timespec64() function in the kernel. It was originally added under
> > the assumption that CONFIG_64BIT_TIME would get enabled on all 32-bit
> > and 64-bit architectures, but when I did the conversion, I only turned
> > it on for 32-bit ones.
> > 
> > The effect is that the get_timespec64() function never clears the
> > upper half of the tv_nsec field for 32-bit tasks in compat mode.
> > Clearing this is required for POSIX compliant behavior of functions
> > that pass a 'timespec' structure with a 64-bit tv_sec and a 32-bit
> > tv_nsec, plus uninitialized padding.

> At least for my setup (32bit ARM with 64 bit time support) this patch is
> not fixing anything.

The patch is not supposed to fix anything on 32-bit architectures as
in-kernel struct timespec64 has 32-bit tv_nsec there.  Thus truncation
should happen automatically.  I also missed that fact when I was
reading get_timespec64 code.

(I am wondering whether such trucation is undefined behaviour in C and
whether there should be sign-extension instead of zeroing-out for the
in_compat_syscall() case though.)

> > The easiest fix for linux-5.1 is to just make the Kconfig symbol
> > unconditional, as it was originally intended. As a follow-up,
> > we should remove any #ifdef CONFIG_64BIT_TIME completely.
> > 
> > Link:
> > https://lore.kernel.org/lkml/20190422090710.bmxdhhankurhafxq@sghpc.golosunov.pp.ru/
> > Cc: Lukasz Majewski <lukma@denx.de> Cc: Stepan Golosunov
> > <stepan@golosunov.pp.ru> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> > ---
> > Please apply this one as a bugfix for 5.1
> > ---
> >  arch/Kconfig | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index 33687dddd86a..9092e0ffe4d3 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -764,7 +764,7 @@ config COMPAT_OLD_SIGACTION
> >  	bool
> >  
> >  config 64BIT_TIME
> > -	def_bool ARCH_HAS_64BIT_TIME
> > +	def_bool y
> >  	help
> >  	  This should be selected by all architectures that need to
> > support new system calls with a 64-bit time_t. This is relevant on
> > all 32-bit

^ permalink raw reply

* Re: [PATCH v10 00/18] Introduce the Counter subsystem
From: Jonathan Cameron @ 2019-04-27 11:49 UTC (permalink / raw)
  To: Greg KH
  Cc: William Breathitt Gray, linux-arm-kernel, devicetree,
	linux-kernel, linux-iio, fabrice.gasnier, benjamin.gaignard,
	knaack.h, lars, pmeerw, akpm, david, robh+dt, mark.rutland,
	shawnguo, leoyang.li, daniel.lezcano, tglx, thierry.reding, esben,
	linux-pwm, linuxppc-dev, patrick.havelange, linux-api
In-Reply-To: <20190425193624.GA11240@kroah.com>

On Thu, 25 Apr 2019 21:36:24 +0200
Greg KH <gregkh@linuxfoundation.org> wrote:

> On Sun, Apr 07, 2019 at 03:25:50PM +0100, Jonathan Cameron wrote:
> > On Tue,  2 Apr 2019 15:30:35 +0900
> > William Breathitt Gray <vilhelm.gray@gmail.com> wrote:
> >   
> > > Changes in v10:
> > >   - Fix minor typographical errors in documentation
> > >   - Merge the FlexTimer Module Quadrature decoder counter driver patches
> > > 
> > > This revision is functionally identical to the last; changes in this
> > > version were made to fix minor typos in the documentation files and also
> > > to pull in the new FTM quadrature decoder counter driver.
> > > 
> > > The Generic Counter API has been and is still in a feature freeze until
> > > it is merged into the mainline. The following features will be
> > > investigated after the merge: interrupt support for counter devices, and
> > > a character device interface for low-latency applications.  
> > 
> > Hi William / al,
> > 
> > So the question is how to move this forwards?  I'm happy with how it turned
> > out and the existing drivers we had in IIO are a lot cleaner under
> > the counter subsystem (other than the backwards compatibility for those that
> > ever existed in IIO).  For those  not following closely the situation is:  
> 
> I've now sucked this into my staging-testing branch and if 0-day is fine
> with it, I'll merge it to staging-next in a day or so.  This way you can
> build on it for any iio drivers that might be coming.

Great thanks. 

> 
> I do have reservations about that one sysfs file that is multi-line, and
> I think it will come to bite you in the end over time, so I reserve the
> right to say "I told you so" when that happens...
> 
> But, I don't have a better answer for it now, so don't really worry
> about it :)
> 
> thanks,
> 
> greg k-h

Looks like a few late breaking comments came in, but nothing that can't
be fixed up before this reaches a release.

Thanks,

Jonathan

^ permalink raw reply

* Re: [PATCH 1/2] y2038: make CONFIG_64BIT_TIME unconditional
From: Lukasz Majewski @ 2019-04-27 15:06 UTC (permalink / raw)
  To: Stepan Golosunov
  Cc: Arnd Bergmann, Thomas Gleixner, Joseph Myers, libc-alpha,
	linux-api, linux-kernel, Deepa Dinamani
In-Reply-To: <20190427095418.344yoomf5nwznatd@sghpc.golosunov.pp.ru>

[-- Attachment #1: Type: text/plain, Size: 3037 bytes --]

Hi Stepan,

> 27.04.2019 в 00:46:53 +0200 Lukasz Majewski написал:
> > Hi Arnd,
> >   
> > > As Stepan Golosunov points out, we made a small mistake in the
> > > get_timespec64() function in the kernel. It was originally added
> > > under the assumption that CONFIG_64BIT_TIME would get enabled on
> > > all 32-bit and 64-bit architectures, but when I did the
> > > conversion, I only turned it on for 32-bit ones.
> > > 
> > > The effect is that the get_timespec64() function never clears the
> > > upper half of the tv_nsec field for 32-bit tasks in compat mode.
> > > Clearing this is required for POSIX compliant behavior of
> > > functions that pass a 'timespec' structure with a 64-bit tv_sec
> > > and a 32-bit tv_nsec, plus uninitialized padding.  
> 
> > At least for my setup (32bit ARM with 64 bit time support) this
> > patch is not fixing anything.  
> 
> The patch is not supposed to fix anything on 32-bit architectures as
> in-kernel struct timespec64 has 32-bit tv_nsec there.  Thus truncation
> should happen automatically.  I also missed that fact when I was
> reading get_timespec64 code.

Yes. You are right. The tv_nsec is 32 bit.

> 
> (I am wondering whether such trucation is undefined behaviour in C

According to [1] - Chapter 6.3.1.3 - Point 3 it is
implementation-defined.

> and
> whether there should be sign-extension instead of zeroing-out for the
> in_compat_syscall() case though.)

What I've found is that "typically" the high order bits are discarded.

However, it is still "implementation dependent".

> 
> > > The easiest fix for linux-5.1 is to just make the Kconfig symbol
> > > unconditional, as it was originally intended. As a follow-up,
> > > we should remove any #ifdef CONFIG_64BIT_TIME completely.
> > > 
> > > Link:
> > > https://lore.kernel.org/lkml/20190422090710.bmxdhhankurhafxq@sghpc.golosunov.pp.ru/
> > > Cc: Lukasz Majewski <lukma@denx.de> Cc: Stepan Golosunov
> > > <stepan@golosunov.pp.ru> Signed-off-by: Arnd Bergmann
> > > <arnd@arndb.de> ---
> > > Please apply this one as a bugfix for 5.1
> > > ---
> > >  arch/Kconfig | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/Kconfig b/arch/Kconfig
> > > index 33687dddd86a..9092e0ffe4d3 100644
> > > --- a/arch/Kconfig
> > > +++ b/arch/Kconfig
> > > @@ -764,7 +764,7 @@ config COMPAT_OLD_SIGACTION
> > >  	bool
> > >  
> > >  config 64BIT_TIME
> > > -	def_bool ARCH_HAS_64BIT_TIME
> > > +	def_bool y
> > >  	help
> > >  	  This should be selected by all architectures that need
> > > to support new system calls with a 64-bit time_t. This is
> > > relevant on all 32-bit  

Note:

[1] - http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf


Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lukma@denx.de

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH 1/2] y2038: make CONFIG_64BIT_TIME unconditional
From: Arnd Bergmann @ 2019-04-27 20:47 UTC (permalink / raw)
  To: Lukasz Majewski
  Cc: Stepan Golosunov, Thomas Gleixner, Joseph Myers, GNU C Library,
	Linux API, Linux Kernel Mailing List, Deepa Dinamani
In-Reply-To: <20190427170613.38bdfbbd@jawa>

On Sat, Apr 27, 2019 at 5:06 PM Lukasz Majewski <lukma@denx.de> wrote:
> > 27.04.2019 в 00:46:53 +0200 Lukasz Majewski написал:
> > (I am wondering whether such trucation is undefined behaviour in C
>
> According to [1] - Chapter 6.3.1.3 - Point 3 it is
> implementation-defined.

The kernel relies on the sane behavior for integer overflow in many
places already, and it is built with -fno-strict-overflow to also make
sure gcc doesn't optimize cases that would be undefined otherwise.

> > and
> > whether there should be sign-extension instead of zeroing-out for the
> > in_compat_syscall() case though.)
>
> What I've found is that "typically" the high order bits are discarded.
>
> However, it is still "implementation dependent".

I think the question was whether we should use

          kts.tv_nsec = (int)kts.tv_nsec;

instead of

          kts.tv_nsec &= 0xfffffffful;

Both have a clearly defined meaning in the C dialect we use in the
kernel, but differ in the upper 32 bits for negative input values.

I would say that using sign-extension indeed makes more sense
here, but we don't need to change it for linux-5.1, since none of the
callers of get_timespec64() care -- any negative 32-bit tv_nsec
value results in -EINVAL, including the utimensat() syscall that
has two special cases outside of the 0...999999999 range.

      Arnd

^ permalink raw reply

* Re: [PATCH v1 1/2] Add polling support to pidfd
From: Oleg Nesterov @ 2019-04-28 16:24 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Joel Fernandes (Google), linux-kernel, luto, rostedt, dancol,
	sspatil, jannh, surenb, timmurray, Jonathan Kowalski, torvalds,
	kernel-team, Andrew Morton, Arnd Bergmann, Eric W. Biederman,
	Greg Kroah-Hartman, Ingo Molnar, Jann Horn, linux-kselftest,
	Michal Hocko, Peter Zijlstra (Intel), Serge Hallyn, Shuah Khan,
	Stephen Rothwell, Thomas
In-Reply-To: <20190425222359.sqhboc4x4daznr6r@brauner.io>

Thanks for cc'ing me...

On 04/26, Christian Brauner wrote:
>
> On Thu, Apr 25, 2019 at 03:00:09PM -0400, Joel Fernandes (Google) wrote:
> > +static unsigned int pidfd_poll(struct file *file, struct poll_table_struct *pts)
> > +{
> > +	struct task_struct *task;
> > +	struct pid *pid;
> > +	int poll_flags = 0;
> > +
> > +	/*
> > +	 * tasklist_lock must be held because to avoid racing with
> > +	 * changes in exit_state and wake up. Basically to avoid:
> > +	 *
> > +	 * P0: read exit_state = 0
> > +	 * P1: write exit_state = EXIT_DEAD
> > +	 * P1: Do a wake up - wq is empty, so do nothing
> > +	 * P0: Queue for polling - wait forever.
> > +	 */
> > +	read_lock(&tasklist_lock);
> > +	pid = file->private_data;
> > +	task = pid_task(pid, PIDTYPE_PID);
> > +	WARN_ON_ONCE(task && !thread_group_leader(task));
> > +
> > +	if (!task || (task->exit_state && thread_group_empty(task)))
> > +		poll_flags = POLLIN | POLLRDNORM;

Joel, I still can't understand why do we need tasklist... and I don't really
understand the comment. The code looks as if you are trying to avoid poll_wait(),
but this would be strange.

OK, why can't pidfd_poll() do

	poll_wait(file, &pid->wait_pidfd, pts);

	rcu_read_lock();
	task = pid_task(pid, PIDTYPE_PID);
	if (!task || task->exit_state && thread_group_empty(task))
		poll_flags = POLLIN | ...;
	rcu_read_unlock();

	return poll_flags;

?

> > +static void do_notify_pidfd(struct task_struct *task)
>
> Maybe a short command that this helper can only be called when we know
> that task is a thread-group leader wouldn't hurt so there's no confusion
> later.

Not really. If the task is traced, do_notify_parent() (and thus do_notify_pidfd())
can be called to notify the debugger even if the task is not a leader and/or if
it is not the last thread. The latter means a spurious wakeup for pidfd_poll().

> > +{
> > +	struct pid *pid;
> > +
> > +	lockdep_assert_held(&tasklist_lock);
> > +
> > +	pid = get_task_pid(task, PIDTYPE_PID);
> > +	wake_up_all(&pid->wait_pidfd);
> > +	put_pid(pid);

Why get/put?

Oleg.

^ permalink raw reply

* Re: [PATCH V32 01/27] Add the ability to lock down access to the running kernel image
From: Daniel Axtens @ 2019-04-29  0:06 UTC (permalink / raw)
  To: Matthew Garrett, Andrew Donnellan
  Cc: James Morris, LSM List, Linux Kernel Mailing List, David Howells,
	Linux API, Andy Lutomirski, linuxppc-dev, Michael Ellerman, cmr
In-Reply-To: <CACdnJus9AhAAYs-R94BH7HDuuQfXjgdhdqUR6Pvk9mxbuPx1=Q@mail.gmail.com>

Matthew Garrett <mjg59@google.com> writes:

> On Tue, Apr 16, 2019 at 1:40 AM Andrew Donnellan
> <andrew.donnellan@au1.ibm.com> wrote:
>> I'm thinking about whether we should lock down the powerpc xmon debug
>> monitor - intuitively, I think the answer is yes if for no other reason
>> than Least Astonishment, when lockdown is enabled you probably don't
>> expect xmon to keep letting you access kernel memory.
>
> The original patchset contained a sysrq hotkey to allow physically
> present users to disable lockdown, so I'm not super concerned about
> this case - I could definitely be convinced otherwise, though.

So currently (and I'm pretty new to this as I've only recently rejoined
IBM) we aren't considering access to the console to be sufficient to
assert physical presence on bare-metal server-class Power machines. The
short argument for this is that with IPMI and BMCs, a server's console
isn't what it used to be. Our console is also a bit different to x86:
we don't generally have bios configuration screens on the console.

In your example, a sysrq key would allow you to disable lockdown after
the system has booted. On Power though, we use Linux as a bootloader
(Petitboot: https://github.com/open-power/petitboot) so being able to
disable lockdown there allows an IPMI-connected user to prevent a signed
kernel being loaded in the first place. I don't know if this is
_actually_ worse, but it certainly feels worse.

There are of course some arguments against our approach. I'm aware of
some of them. I'm also very open to being told that not equating console
access with physical access is fundamentally silly or broken and that we
should rethink things.

Regards,
Daniel

^ permalink raw reply

* Re: [PATCH v2] moduleparam: Save information about built-in modules in separate file
From: Masahiro Yamada @ 2019-04-29  3:28 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Jessica Yu, Linux Kernel Mailing List, Linux Kbuild mailing list,
	linux-api, linux-modules, Kirill A . Shutemov,
	Gleb Fotengauer-Malinovskiy, Dmitry V. Levin, Michal Marek,
	Dmitry Torokhov, Rusty Russell, Lucas De Marchi
In-Reply-To: <20190426152904.GO9023@dhcp129-178.brq.redhat.com>

On Sat, Apr 27, 2019 at 12:29 AM Alexey Gladkov
<gladkov.alexey@gmail.com> wrote:
>
> On Fri, Apr 19, 2019 at 12:03:50PM +0900, Masahiro Yamada wrote:
> > On Fri, Apr 19, 2019 at 12:36 AM Jessica Yu <jeyu@kernel.org> wrote:
> > >
> > > +++ Masahiro Yamada [19/04/19 00:26 +0900]:
> > > >On Thu, Apr 18, 2019 at 10:52 PM Jessica Yu <jeyu@kernel.org> wrote:
> > > >>
> > > >> +++ Masahiro Yamada [18/04/19 20:10 +0900]:
> > > >> >On Sat, Apr 6, 2019 at 9:15 PM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
> > > >> >>
> > > >> >> Problem:
> > > >> >>
> > > >> >> When a kernel module is compiled as a separate module, some important
> > > >> >> information about the kernel module is available via .modinfo section of
> > > >> >> the module.  In contrast, when the kernel module is compiled into the
> > > >> >> kernel, that information is not available.
> > > >> >>
> > > >> >> Information about built-in modules is necessary in the following cases:
> > > >> >>
> > > >> >> 1. When it is necessary to find out what additional parameters can be
> > > >> >> passed to the kernel at boot time.
> > > >> >>
> > > >> >> 2. When you need to know which module names and their aliases are in
> > > >> >> the kernel. This is very useful for creating an initrd image.
> > > >> >>
> > > >> >> Proposal:
> > > >> >>
> > > >> >> The proposed patch does not remove .modinfo section with module
> > > >> >> information from the vmlinux at the build time and saves it into a
> > > >> >> separate file after kernel linking. So, the kernel does not increase in
> > > >> >> size and no additional information remains in it. Information is stored
> > > >> >> in the same format as in the separate modules (null-terminated string
> > > >> >> array). Because the .modinfo section is already exported with a separate
> > > >> >> modules, we are not creating a new API.
> > > >> >>
> > > >> >> It can be easily read in the userspace:
> > > >> >>
> > > >> >> $ tr '\0' '\n' < kernel.builtin
> > > >> >> ext4.softdep=pre: crc32c
> > > >> >> ext4.license=GPL
> > > >> >> ext4.description=Fourth Extended Filesystem
> > > >> >> ext4.author=Remy Card, Stephen Tweedie, Andrew Morton, Andreas Dilger, Theodore Ts'o and others
> > > >> >> ext4.alias=fs-ext4
> > > >> >> ext4.alias=ext3
> > > >> >> ext4.alias=fs-ext3
> > > >> >> ext4.alias=ext2
> > > >> >> ext4.alias=fs-ext2
> > > >> >> md_mod.alias=block-major-9-*
> > > >> >> md_mod.alias=md
> > > >> >> md_mod.description=MD RAID framework
> > > >> >> md_mod.license=GPL
> > > >> >> md_mod.parmtype=create_on_open:bool
> > > >> >> md_mod.parmtype=start_dirty_degraded:int
> > > >> >> ...
> > > >> >>
> > > >> >> v2:
> > > >> >>  * Extract modinfo from vmlinux.o as suggested by Masahiro Yamada;
> > > >> >>  * Rename output file to kernel.builtin;
> > > >> >
> > > >> >Sorry, I do not get why you renamed
> > > >> >"kernel.builtin.modinfo" to "kernel.builtin".
> > > >> >
> > > >> >If you drop "modinfo", we do not understand
> > > >> >what kind information is contained in it.
> > > >> >
> > > >> >I think "kernel" and "builtin" have
> > > >> >a quite similar meaning here.
> > > >> >
> > > >> >How about "builtin.modinfo" for example?
> > > >> >
> > > >> >
> > > >> >It is shorter, and it is clear enough
> > > >> >that it contains module_info.
> > > >>
> > > >> I agree that the name kernel.builtin is unclear in what kind of
> > > >> information it contains. Apologies for not having clarified this in
> > > >> the previous review.
> > > >>
> > > >> Since kbuild already produces "modules.order" and "modules.builtin"
> > > >> files, why not just name it "modules.builtin.modinfo" to keep the
> > > >> names consistent with what is already there?
> > > >
> > > >
> > > >Is it consistent?
> > > >
> > > >If we had "modules.order" and "modules.builtin.order" there,
> > > >I would agree with "modules.builtin.modinfo",
> > > >and also "modules.alias" vs "modules.builtin.alias".
> > > >
> > > >
> > > >We already have "modules.builtin", and probably impossible
> > > >to rename it, so we cannot keep consistency in any way.
> > > >
> > > >
> > > >"modules.builtin" is a weird name since
> > > >it actually contains "order", but its extension
> > > >does not express what kind of information is in it.
> > > >Hence, I doubt "modules.builtin" is a good precedent.
> > > >
> > > >IMHO, "modules" and "builtin" are opposite
> > > >to each other. "modules.builtin" sounds iffy to me.
> > >
> > > I've always interpreted "modules.builtin" to mean "this is a list of
> > > modules that have been built-in into the kernel", no? So I thought the
> > > name made sense.
> >
> > OK, I see.
> >
> > > But you are the maintainer, so I do not have a strong
> > > opinion on this either way :-)
> >
> > My idea was to use
> > 'modules.<file-type>'  vs  'builtin.<file-type>'
> > instead of
> > 'modules.<file-type>'  vs  'modules.builtin.<file-type>'
> >
> > I am slightly in favor of the former
> > since it is shorter and
> > (I hope) still clear enough.
> >
> > If this naming is not nice for external projects such as kmod,
> > please speak up.
> >
> >
> > (BTW, I am thinking of renaming 'modules.builtin' into 'builtin.order'
> > for kbuild internal. We cannot change that for the installation area, though.)
>
> Since there were no other suggestions, how can I better name the file ?
> modules.builtin.modinfo or just builtin.modinfo ? I personally like the
> first one more.


Either is fine.

Please put it in Documentation/dontdiff too.

Thanks.



-- 
Best Regards
Masahiro Yamada

^ permalink raw reply

* Re: [PATCH V32 01/27] Add the ability to lock down access to the running kernel image
From: Daniel Axtens @ 2019-04-29  4:54 UTC (permalink / raw)
  To: Matthew Garrett, Andrew Donnellan
  Cc: James Morris, LSM List, Linux Kernel Mailing List, David Howells,
	Linux API, Andy Lutomirski, linuxppc-dev, Michael Ellerman, cmr
In-Reply-To: <87wojdy8ro.fsf@dja-thinkpad.axtens.net>

Hi, 

>>> I'm thinking about whether we should lock down the powerpc xmon debug
>>> monitor - intuitively, I think the answer is yes if for no other reason
>>> than Least Astonishment, when lockdown is enabled you probably don't
>>> expect xmon to keep letting you access kernel memory.
>>
>> The original patchset contained a sysrq hotkey to allow physically
>> present users to disable lockdown, so I'm not super concerned about
>> this case - I could definitely be convinced otherwise, though.

So Mimi contacted me offlist and very helpfully provided me with a much
better and less confused justification for disabling xmon in lockdown:

On x86, physical presence (== console access) is a trigger to
disable/enable lockdown mode.

In lockdown mode, you're not supposed to be able to modify memory. xmon
allows you to modify memory, and therefore shouldn't be allowed in
lockdown.

So, if you can disable lockdown on the console that's probably OK, but
it should be specifically disabling lockdown, not randomly editing
memory with xmon.

Regards,
Daniel

^ permalink raw reply

* Re: [PATCH 1/2] y2038: make CONFIG_64BIT_TIME unconditional
From: Thomas Gleixner @ 2019-04-29  7:33 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Joseph Myers, libc-alpha, linux-api, linux-kernel, Deepa Dinamani,
	Lukasz Majewski, Stepan Golosunov
In-Reply-To: <20190426142531.1378357-1-arnd@arndb.de>

On Fri, 26 Apr 2019, Arnd Bergmann wrote:

> As Stepan Golosunov points out, we made a small mistake in the
> get_timespec64() function in the kernel. It was originally added under
> the assumption that CONFIG_64BIT_TIME would get enabled on all 32-bit
> and 64-bit architectures, but when I did the conversion, I only turned
> it on for 32-bit ones.
> 
> The effect is that the get_timespec64() function never clears the upper
> half of the tv_nsec field for 32-bit tasks in compat mode. Clearing this
> is required for POSIX compliant behavior of functions that pass a
> 'timespec' structure with a 64-bit tv_sec and a 32-bit tv_nsec, plus
> uninitialized padding.
> 
> The easiest fix for linux-5.1 is to just make the Kconfig symbol
> unconditional, as it was originally intended. As a follow-up,
> we should remove any #ifdef CONFIG_64BIT_TIME completely.
> 
> Link: https://lore.kernel.org/lkml/20190422090710.bmxdhhankurhafxq@sghpc.golosunov.pp.ru/
> Cc: Lukasz Majewski <lukma@denx.de>
> Cc: Stepan Golosunov <stepan@golosunov.pp.ru>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
> Please apply this one as a bugfix for 5.1

Can you provide a 'Fixes: ....' tag please?

Thanks,

	tglx

^ permalink raw reply

* [PATCH v2 00/18] add new features for FPGA DFL drivers
From: Wu Hao @ 2019-04-29  8:55 UTC (permalink / raw)
  To: atull, mdf, linux-fpga, linux-kernel; +Cc: linux-api, Wu Hao

This patchset adds more features support for FPGA Device Feature List
(DFL) drivers, including PR enhancement, virtualization support based
on PCIe SRIOV, private features to Port, private features to FME, and
enhancement to DFL framework. Please refer to details in below list.

Main changes from v1:
 - split the clean up code in a separated patch (patch #2)
 - add cpu_feature_enabled check for AVX512 code (patch #4)
 - improve sysfs return values and also sysfs doc (patch #12 #17)
 - create a hwmon for thermal management sysfs interfaces (patch #15)
 - create a hwmon for power management sysfs interfaces (patch #16)
 - update docmentation according to above changes (patch #5)
 - improve sysfs doc for performance reporting support (patch #18)

Wu Hao (18):
  fpga: dfl-fme-mgr: fix FME_PR_INTFC_ID register address.
  fpga: dfl: fme: remove copy_to_user() in ioctl for PR
  fpga: dfl: fme: align PR buffer size per PR datawidth
  fpga: dfl: fme: support 512bit data width PR
  Documentation: fpga: dfl: add descriptions for virtualization and new
    interfaces.
  fpga: dfl: fme: add DFL_FPGA_FME_PORT_RELEASE/ASSIGN ioctl support.
  fpga: dfl: pci: enable SRIOV support.
  fpga: dfl: afu: add AFU state related sysfs interfaces
  fpga: dfl: afu: add userclock sysfs interfaces.
  fpga: dfl: add id_table for dfl private feature driver
  fpga: dfl: afu: export __port_enable/disable function.
  fpga: dfl: afu: add error reporting support.
  fpga: dfl: afu: add STP (SignalTap) support
  fpga: dfl: fme: add capability sysfs interfaces
  fpga: dfl: fme: add thermal management support
  fpga: dfl: fme: add power management support
  fpga: dfl: fme: add global error reporting support
  fpga: dfl: fme: add performance reporting support

 Documentation/ABI/testing/sysfs-platform-dfl-fme  | 322 ++++++++
 Documentation/ABI/testing/sysfs-platform-dfl-port | 104 +++
 Documentation/fpga/dfl.txt                        | 114 +++
 drivers/fpga/Kconfig                              |   2 +-
 drivers/fpga/Makefile                             |   4 +-
 drivers/fpga/dfl-afu-error.c                      | 225 +++++
 drivers/fpga/dfl-afu-main.c                       | 335 +++++++-
 drivers/fpga/dfl-afu.h                            |   7 +
 drivers/fpga/dfl-fme-error.c                      | 385 +++++++++
 drivers/fpga/dfl-fme-main.c                       | 583 ++++++++++++-
 drivers/fpga/dfl-fme-mgr.c                        | 117 ++-
 drivers/fpga/dfl-fme-perf.c                       | 950 ++++++++++++++++++++++
 drivers/fpga/dfl-fme-pr.c                         |  65 +-
 drivers/fpga/dfl-fme.h                            |   9 +-
 drivers/fpga/dfl-pci.c                            |  40 +
 drivers/fpga/dfl.c                                | 170 +++-
 drivers/fpga/dfl.h                                |  56 +-
 include/uapi/linux/fpga-dfl.h                     |  32 +
 18 files changed, 3439 insertions(+), 81 deletions(-)
 create mode 100644 drivers/fpga/dfl-afu-error.c
 create mode 100644 drivers/fpga/dfl-fme-error.c
 create mode 100644 drivers/fpga/dfl-fme-perf.c

-- 
1.8.3.1

^ permalink raw reply

* [PATCH v2 01/18] fpga: dfl-fme-mgr: fix FME_PR_INTFC_ID register address.
From: Wu Hao @ 2019-04-29  8:55 UTC (permalink / raw)
  To: atull, mdf, linux-fpga, linux-kernel; +Cc: linux-api, Wu Hao
In-Reply-To: <1556528151-17221-1-git-send-email-hao.wu@intel.com>

FME_PR_INTFC_ID is used as compat_id for fpga manager and region,
but high 64 bits and low 64 bits of the compat_id are swapped by
mistake. This patch fixes this problem by fixing register address.

Signed-off-by: Wu Hao <hao.wu@intel.com>
Acked-by: Alan Tull <atull@kernel.org>
Acked-by: Moritz Fischer <mdf@kernel.org>
---
 drivers/fpga/dfl-fme-mgr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/fpga/dfl-fme-mgr.c b/drivers/fpga/dfl-fme-mgr.c
index 76f3770..b3f7eee 100644
--- a/drivers/fpga/dfl-fme-mgr.c
+++ b/drivers/fpga/dfl-fme-mgr.c
@@ -30,8 +30,8 @@
 #define FME_PR_STS		0x10
 #define FME_PR_DATA		0x18
 #define FME_PR_ERR		0x20
-#define FME_PR_INTFC_ID_H	0xA8
-#define FME_PR_INTFC_ID_L	0xB0
+#define FME_PR_INTFC_ID_L	0xA8
+#define FME_PR_INTFC_ID_H	0xB0

 /* FME PR Control Register Bitfield */
 #define FME_PR_CTRL_PR_RST	BIT_ULL(0)  /* Reset PR engine */
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v2 02/18] fpga: dfl: fme: remove copy_to_user() in ioctl for PR
From: Wu Hao @ 2019-04-29  8:55 UTC (permalink / raw)
  To: atull, mdf, linux-fpga, linux-kernel; +Cc: linux-api, Wu Hao, Xu Yilun
In-Reply-To: <1556528151-17221-1-git-send-email-hao.wu@intel.com>

This patch removes copy_to_user() code in partial reconfiguration
ioctl, as it's useless as user never needs to read the data
structure after ioctl.

Signed-off-by: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
---
v2: clean up code split from patch 2 in v1 patchset.
---
 drivers/fpga/dfl-fme-pr.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/fpga/dfl-fme-pr.c b/drivers/fpga/dfl-fme-pr.c
index d9ca955..6ec0f09 100644
--- a/drivers/fpga/dfl-fme-pr.c
+++ b/drivers/fpga/dfl-fme-pr.c
@@ -159,9 +159,6 @@ static int fme_pr(struct platform_device *pdev, unsigned long arg)
 	mutex_unlock(&pdata->lock);
 free_exit:
 	vfree(buf);
-	if (copy_to_user((void __user *)arg, &port_pr, minsz))
-		return -EFAULT;
-
 	return ret;
 }
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v2 03/18] fpga: dfl: fme: align PR buffer size per PR datawidth
From: Wu Hao @ 2019-04-29  8:55 UTC (permalink / raw)
  To: atull, mdf, linux-fpga, linux-kernel; +Cc: linux-api, Wu Hao, Xu Yilun
In-Reply-To: <1556528151-17221-1-git-send-email-hao.wu@intel.com>

Current driver checks if input bitstream file size is aligned or
not per PR data width (default 32bits). It requires one additional
step for end user when they generate the bitstream file, padding
extra zeros to bitstream file to align its size per PR data width,
but they don't have to as hardware will drop extra padding bytes
automatically.

In order to simplify the user steps, this patch aligns PR buffer
size per PR data width in driver, to allow user to pass unaligned
size bitstream files to driver.

Signed-off-by: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
Acked-by: Alan Tull <atull@kernel.org>
---
 drivers/fpga/dfl-fme-pr.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/fpga/dfl-fme-pr.c b/drivers/fpga/dfl-fme-pr.c
index 6ec0f09..3c71dc3 100644
--- a/drivers/fpga/dfl-fme-pr.c
+++ b/drivers/fpga/dfl-fme-pr.c
@@ -74,6 +74,7 @@ static int fme_pr(struct platform_device *pdev, unsigned long arg)
 	struct dfl_fme *fme;
 	unsigned long minsz;
 	void *buf = NULL;
+	size_t length;
 	int ret = 0;
 	u64 v;
 
@@ -85,9 +86,6 @@ static int fme_pr(struct platform_device *pdev, unsigned long arg)
 	if (port_pr.argsz < minsz || port_pr.flags)
 		return -EINVAL;
 
-	if (!IS_ALIGNED(port_pr.buffer_size, 4))
-		return -EINVAL;
-
 	/* get fme header region */
 	fme_hdr = dfl_get_feature_ioaddr_by_id(&pdev->dev,
 					       FME_FEATURE_ID_HEADER);
@@ -103,7 +101,13 @@ static int fme_pr(struct platform_device *pdev, unsigned long arg)
 		       port_pr.buffer_size))
 		return -EFAULT;
 
-	buf = vmalloc(port_pr.buffer_size);
+	/*
+	 * align PR buffer per PR bandwidth, as HW ignores the extra padding
+	 * data automatically.
+	 */
+	length = ALIGN(port_pr.buffer_size, 4);
+
+	buf = vmalloc(length);
 	if (!buf)
 		return -ENOMEM;
 
@@ -140,7 +144,7 @@ static int fme_pr(struct platform_device *pdev, unsigned long arg)
 	fpga_image_info_free(region->info);
 
 	info->buf = buf;
-	info->count = port_pr.buffer_size;
+	info->count = length;
 	info->region_id = port_pr.port_id;
 	region->info = info;
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v2 04/18] fpga: dfl: fme: support 512bit data width PR
From: Wu Hao @ 2019-04-29  8:55 UTC (permalink / raw)
  To: atull, mdf, linux-fpga, linux-kernel
  Cc: linux-api, Wu Hao, Ananda Ravuri, Xu Yilun
In-Reply-To: <1556528151-17221-1-git-send-email-hao.wu@intel.com>

In early partial reconfiguration private feature, it only
supports 32bit data width when writing data to hardware for
PR. 512bit data width PR support is an important optimization
for some specific solutions (e.g. XEON with FPGA integrated),
it allows driver to use AVX512 instruction to improve the
performance of partial reconfiguration. e.g. programming one
100MB bitstream image via this 512bit data width PR hardware
only takes ~300ms, but 32bit revision requires ~3s per test
result.

Please note now this optimization is only done on revision 2
of this PR private feature which is only used in integrated
solution that AVX512 is always supported. This revision 2
hardware doesn't support 32bit PR.

Signed-off-by: Ananda Ravuri <ananda.ravuri@intel.com>
Signed-off-by: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
---
v2: check AVX512 support using cpu_feature_enabled()
    fix other comments from Scott Wood <swood@redhat.com>
---
 drivers/fpga/dfl-fme-main.c |   3 ++
 drivers/fpga/dfl-fme-mgr.c  | 113 +++++++++++++++++++++++++++++++++++++-------
 drivers/fpga/dfl-fme-pr.c   |  43 +++++++++++------
 drivers/fpga/dfl-fme.h      |   2 +
 drivers/fpga/dfl.h          |   5 ++
 5 files changed, 135 insertions(+), 31 deletions(-)

diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
index 086ad24..076d74f 100644
--- a/drivers/fpga/dfl-fme-main.c
+++ b/drivers/fpga/dfl-fme-main.c
@@ -21,6 +21,8 @@
 #include "dfl.h"
 #include "dfl-fme.h"
 
+#define DRV_VERSION	"0.8"
+
 static ssize_t ports_num_show(struct device *dev,
 			      struct device_attribute *attr, char *buf)
 {
@@ -277,3 +279,4 @@ static int fme_remove(struct platform_device *pdev)
 MODULE_AUTHOR("Intel Corporation");
 MODULE_LICENSE("GPL v2");
 MODULE_ALIAS("platform:dfl-fme");
+MODULE_VERSION(DRV_VERSION);
diff --git a/drivers/fpga/dfl-fme-mgr.c b/drivers/fpga/dfl-fme-mgr.c
index b3f7eee..d1a4ba5 100644
--- a/drivers/fpga/dfl-fme-mgr.c
+++ b/drivers/fpga/dfl-fme-mgr.c
@@ -22,14 +22,18 @@
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/fpga/fpga-mgr.h>
 
+#include "dfl.h"
 #include "dfl-fme-pr.h"
 
+#define DRV_VERSION	"0.8"
+
 /* FME Partial Reconfiguration Sub Feature Register Set */
 #define FME_PR_DFH		0x0
 #define FME_PR_CTRL		0x8
 #define FME_PR_STS		0x10
 #define FME_PR_DATA		0x18
 #define FME_PR_ERR		0x20
+#define FME_PR_512_DATA		0x40 /* Data Register for 512bit datawidth PR */
 #define FME_PR_INTFC_ID_L	0xA8
 #define FME_PR_INTFC_ID_H	0xB0
 
@@ -67,8 +71,43 @@
 #define PR_WAIT_TIMEOUT   8000000
 #define PR_HOST_STATUS_IDLE	0
 
+#if defined(CONFIG_X86) && defined(CONFIG_AS_AVX512)
+
+#include <linux/cpufeature.h>
+#include <asm/fpu/api.h>
+
+static inline int is_cpu_avx512_enabled(void)
+{
+	return cpu_feature_enabled(X86_FEATURE_AVX512F);
+}
+
+static inline void copy512(const void *src, void __iomem *dst)
+{
+	kernel_fpu_begin();
+
+	asm volatile("vmovdqu64 (%0), %%zmm0;"
+		     "vmovntdq %%zmm0, (%1);"
+		     :
+		     : "r"(src), "r"(dst)
+		     : "memory");
+
+	kernel_fpu_end();
+}
+#else
+static inline int is_cpu_avx512_enabled(void)
+{
+	return 0;
+}
+
+static inline void copy512(const void *src, void __iomem *dst)
+{
+	WARN_ON_ONCE(1);
+}
+#endif
+
 struct fme_mgr_priv {
 	void __iomem *ioaddr;
+	unsigned int pr_datawidth;
 	u64 pr_error;
 };
 
@@ -169,7 +208,7 @@ static int fme_mgr_write(struct fpga_manager *mgr,
 	struct fme_mgr_priv *priv = mgr->priv;
 	void __iomem *fme_pr = priv->ioaddr;
 	u64 pr_ctrl, pr_status, pr_data;
-	int delay = 0, pr_credit, i = 0;
+	int ret = 0, delay = 0, pr_credit;
 
 	dev_dbg(dev, "start request\n");
 
@@ -181,9 +220,9 @@ static int fme_mgr_write(struct fpga_manager *mgr,
 
 	/*
 	 * driver can push data to PR hardware using PR_DATA register once HW
-	 * has enough pr_credit (> 1), pr_credit reduces one for every 32bit
-	 * pr data write to PR_DATA register. If pr_credit <= 1, driver needs
-	 * to wait for enough pr_credit from hardware by polling.
+	 * has enough pr_credit (> 1), pr_credit reduces one for every pr data
+	 * width write to PR_DATA register. If pr_credit <= 1, driver needs to
+	 * wait for enough pr_credit from hardware by polling.
 	 */
 	pr_status = readq(fme_pr + FME_PR_STS);
 	pr_credit = FIELD_GET(FME_PR_STS_PR_CREDIT, pr_status);
@@ -192,7 +231,8 @@ static int fme_mgr_write(struct fpga_manager *mgr,
 		while (pr_credit <= 1) {
 			if (delay++ > PR_WAIT_TIMEOUT) {
 				dev_err(dev, "PR_CREDIT timeout\n");
-				return -ETIMEDOUT;
+				ret = -ETIMEDOUT;
+				goto done;
 			}
 			udelay(1);
 
@@ -200,21 +240,27 @@ static int fme_mgr_write(struct fpga_manager *mgr,
 			pr_credit = FIELD_GET(FME_PR_STS_PR_CREDIT, pr_status);
 		}
 
-		if (count < 4) {
-			dev_err(dev, "Invalid PR bitstream size\n");
-			return -EINVAL;
+		WARN_ON(count < priv->pr_datawidth);
+
+		switch (priv->pr_datawidth) {
+		case 4:
+			pr_data = FIELD_PREP(FME_PR_DATA_PR_DATA_RAW,
+					     *(u32 *)buf);
+			writeq(pr_data, fme_pr + FME_PR_DATA);
+			break;
+		case 64:
+			copy512(buf, fme_pr + FME_PR_512_DATA);
+			break;
+		default:
+			WARN_ON_ONCE(1);
 		}
-
-		pr_data = 0;
-		pr_data |= FIELD_PREP(FME_PR_DATA_PR_DATA_RAW,
-				      *(((u32 *)buf) + i));
-		writeq(pr_data, fme_pr + FME_PR_DATA);
-		count -= 4;
+		buf += priv->pr_datawidth;
+		count -= priv->pr_datawidth;
 		pr_credit--;
-		i++;
 	}
 
-	return 0;
+done:
+	return ret;
 }
 
 static int fme_mgr_write_complete(struct fpga_manager *mgr,
@@ -279,6 +325,36 @@ static void fme_mgr_get_compat_id(void __iomem *fme_pr,
 	id->id_h = readq(fme_pr + FME_PR_INTFC_ID_H);
 }
 
+static u8 fme_mgr_get_pr_datawidth(struct device *dev, void __iomem *fme_pr)
+{
+	u8 revision = dfl_feature_revision(fme_pr);
+
+	if (revision < 2) {
+		/*
+		 * revision 0 and 1 only support 32bit data width partial
+		 * reconfiguration, so pr_datawidth is 4 (Byte).
+		 */
+		return 4;
+	} else if (revision == 2) {
+		/*
+		 * revision 2 hardware has optimization to support 512bit data
+		 * width partial reconfiguration with AVX512 instructions. So
+		 * pr_datawidth is 64 (Byte). As revision 2 hardware is only
+		 * used in integrated solution, CPU supports AVX512 instructions
+		 * for sure, but it still needs to check here as AVX512 could be
+		 * disabled in kernel (e.g. using clearcpuid boot option).
+		 */
+		if (is_cpu_avx512_enabled())
+			return 64;
+
+		dev_err(dev, "revision 2: AVX512 is disabled\n");
+		return 0;
+	}
+
+	dev_err(dev, "revision %d is not supported yet\n", revision);
+	return 0;
+}
+
 static int fme_mgr_probe(struct platform_device *pdev)
 {
 	struct dfl_fme_mgr_pdata *pdata = dev_get_platdata(&pdev->dev);
@@ -302,6 +378,10 @@ static int fme_mgr_probe(struct platform_device *pdev)
 			return PTR_ERR(priv->ioaddr);
 	}
 
+	priv->pr_datawidth = fme_mgr_get_pr_datawidth(dev, priv->ioaddr);
+	if (!priv->pr_datawidth)
+		return -ENODEV;
+
 	compat_id = devm_kzalloc(dev, sizeof(*compat_id), GFP_KERNEL);
 	if (!compat_id)
 		return -ENOMEM;
@@ -342,3 +422,4 @@ static int fme_mgr_remove(struct platform_device *pdev)
 MODULE_AUTHOR("Intel Corporation");
 MODULE_LICENSE("GPL v2");
 MODULE_ALIAS("platform:dfl-fme-mgr");
+MODULE_VERSION(DRV_VERSION);
diff --git a/drivers/fpga/dfl-fme-pr.c b/drivers/fpga/dfl-fme-pr.c
index 3c71dc3..cd94ba8 100644
--- a/drivers/fpga/dfl-fme-pr.c
+++ b/drivers/fpga/dfl-fme-pr.c
@@ -83,7 +83,7 @@ static int fme_pr(struct platform_device *pdev, unsigned long arg)
 	if (copy_from_user(&port_pr, argp, minsz))
 		return -EFAULT;
 
-	if (port_pr.argsz < minsz || port_pr.flags)
+	if (port_pr.argsz < minsz || port_pr.flags || !port_pr.buffer_size)
 		return -EINVAL;
 
 	/* get fme header region */
@@ -101,15 +101,25 @@ static int fme_pr(struct platform_device *pdev, unsigned long arg)
 		       port_pr.buffer_size))
 		return -EFAULT;
 
+	mutex_lock(&pdata->lock);
+	fme = dfl_fpga_pdata_get_private(pdata);
+	/* fme device has been unregistered. */
+	if (!fme) {
+		ret = -EINVAL;
+		goto unlock_exit;
+	}
+
 	/*
 	 * align PR buffer per PR bandwidth, as HW ignores the extra padding
 	 * data automatically.
 	 */
-	length = ALIGN(port_pr.buffer_size, 4);
+	length = ALIGN(port_pr.buffer_size, fme->pr_datawidth);
 
 	buf = vmalloc(length);
-	if (!buf)
-		return -ENOMEM;
+	if (!buf) {
+		ret = -ENOMEM;
+		goto unlock_exit;
+	}
 
 	if (copy_from_user(buf,
 			   (void __user *)(unsigned long)port_pr.buffer_address,
@@ -127,18 +137,10 @@ static int fme_pr(struct platform_device *pdev, unsigned long arg)
 
 	info->flags |= FPGA_MGR_PARTIAL_RECONFIG;
 
-	mutex_lock(&pdata->lock);
-	fme = dfl_fpga_pdata_get_private(pdata);
-	/* fme device has been unregistered. */
-	if (!fme) {
-		ret = -EINVAL;
-		goto unlock_exit;
-	}
-
 	region = dfl_fme_region_find(fme, port_pr.port_id);
 	if (!region) {
 		ret = -EINVAL;
-		goto unlock_exit;
+		goto free_exit;
 	}
 
 	fpga_image_info_free(region->info);
@@ -159,10 +161,10 @@ static int fme_pr(struct platform_device *pdev, unsigned long arg)
 		fpga_bridges_put(&region->bridge_list);
 
 	put_device(&region->dev);
-unlock_exit:
-	mutex_unlock(&pdata->lock);
 free_exit:
 	vfree(buf);
+unlock_exit:
+	mutex_unlock(&pdata->lock);
 	return ret;
 }
 
@@ -388,6 +390,17 @@ static int pr_mgmt_init(struct platform_device *pdev,
 	mutex_lock(&pdata->lock);
 	priv = dfl_fpga_pdata_get_private(pdata);
 
+	/*
+	 * Initialize PR data width.
+	 * Only revision 2 supports 512bit datawidth for better performance,
+	 * other revisions use default 32bit datawidth. This is used for
+	 * buffer alignment.
+	 */
+	if (dfl_feature_revision(feature->ioaddr) == 2)
+		priv->pr_datawidth = 64;
+	else
+		priv->pr_datawidth = 4;
+
 	/* Initialize the region and bridge sub device list */
 	INIT_LIST_HEAD(&priv->region_list);
 	INIT_LIST_HEAD(&priv->bridge_list);
diff --git a/drivers/fpga/dfl-fme.h b/drivers/fpga/dfl-fme.h
index 5394a21..de20755 100644
--- a/drivers/fpga/dfl-fme.h
+++ b/drivers/fpga/dfl-fme.h
@@ -21,12 +21,14 @@
 /**
  * struct dfl_fme - dfl fme private data
  *
+ * @pr_datawidth: data width for partial reconfiguration.
  * @mgr: FME's FPGA manager platform device.
  * @region_list: linked list of FME's FPGA regions.
  * @bridge_list: linked list of FME's FPGA bridges.
  * @pdata: fme platform device's pdata.
  */
 struct dfl_fme {
+	int pr_datawidth;
 	struct platform_device *mgr;
 	struct list_head region_list;
 	struct list_head bridge_list;
diff --git a/drivers/fpga/dfl.h b/drivers/fpga/dfl.h
index a8b869e..8851c6c 100644
--- a/drivers/fpga/dfl.h
+++ b/drivers/fpga/dfl.h
@@ -331,6 +331,11 @@ static inline bool dfl_feature_is_port(void __iomem *base)
 		(FIELD_GET(DFH_ID, v) == DFH_ID_FIU_PORT);
 }
 
+static inline u8 dfl_feature_revision(void __iomem *base)
+{
+	return (u8)FIELD_GET(DFH_REVISION, readq(base + DFH));
+}
+
 /**
  * struct dfl_fpga_enum_info - DFL FPGA enumeration information
  *
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v2 05/18] Documentation: fpga: dfl: add descriptions for virtualization and new interfaces.
From: Wu Hao @ 2019-04-29  8:55 UTC (permalink / raw)
  To: atull, mdf, linux-fpga, linux-kernel; +Cc: linux-api, Wu Hao, Xu Yilun
In-Reply-To: <1556528151-17221-1-git-send-email-hao.wu@intel.com>

This patch adds virtualization support description for DFL based
FPGA devices (based on PCIe SRIOV), and introductions to new
interfaces added by new dfl private feature drivers.

Signed-off-by: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
---
v2: update description for thermal/power management user interfaces.
---
 Documentation/fpga/dfl.txt | 115 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 115 insertions(+)

diff --git a/Documentation/fpga/dfl.txt b/Documentation/fpga/dfl.txt
index 6df4621..36610e0 100644
--- a/Documentation/fpga/dfl.txt
+++ b/Documentation/fpga/dfl.txt
@@ -84,6 +84,8 @@ The following functions are exposed through ioctls:
  Get driver API version (DFL_FPGA_GET_API_VERSION)
  Check for extensions (DFL_FPGA_CHECK_EXTENSION)
  Program bitstream (DFL_FPGA_FME_PORT_PR)
+ Assign port to PF (DFL_FPGA_FME_PORT_ASSIGN)
+ Release port from PF (DFL_FPGA_FME_PORT_RELEASE)
 
 More functions are exposed through sysfs
 (/sys/class/fpga_region/regionX/dfl-fme.n/):
@@ -99,6 +101,24 @@ More functions are exposed through sysfs
      one FPGA device may have more than one port, this sysfs interface indicates
      how many ports the FPGA device has.
 
+ Power management (dfl_fme_power hwmon)
+     power management hwmon sysfs interfaces allow user to read power management
+     information (power consumption, thresholds, threshold status, limits, etc.)
+     and and configure power thresholds for different throttling levels.
+
+ Thermal management (dfl_fme_thermal hwmon)
+     thermal management hwmon sysfs interfaces allow user to read thermal
+     management information (current temperature, thresholds, threshold status,
+     etc.).
+
+ Global error reporting management (errors/)
+     error reporting sysfs interfaces allow user to read errors detected by the
+     hardware, and clear the logged errors.
+
+ Performance counters (perf/)
+     performance counters sysfs interfaces allow user to use different counters
+     to get performance data.
+
 
 FIU - PORT
 ==========
@@ -139,6 +159,10 @@ More functions are exposed through sysfs:
  Read Accelerator GUID (afu_id)
      afu_id indicates which PR bitstream is programmed to this AFU.
 
+ Error reporting (errors/)
+     error reporting sysfs interfaces allow user to read port/afu errors
+     detected by the hardware, and clear the logged errors.
+
 
 DFL Framework Overview
 ======================
@@ -212,6 +236,97 @@ the compat_id exposed by the target FPGA region. This check is usually done by
 userspace before calling the reconfiguration IOCTL.
 
 
+FPGA virtualization - PCIe SRIOV
+================================
+This section describes the virtualization support on DFL based FPGA device to
+enable accessing an accelerator from applications running in a virtual machine
+(VM). This section only describes the PCIe based FPGA device with SRIOV support.
+
+Features supported by the particular FPGA device are exposed through Device
+Feature Lists, as illustrated below:
+
+  +-------------------------------+  +-------------+
+  |              PF               |  |     VF      |
+  +-------------------------------+  +-------------+
+      ^            ^         ^              ^
+      |            |         |              |
++-----|------------|---------|--------------|-------+
+|     |            |         |              |       |
+|  +-----+     +-------+ +-------+      +-------+   |
+|  | FME |     | Port0 | | Port1 |      | Port2 |   |
+|  +-----+     +-------+ +-------+      +-------+   |
+|                  ^         ^              ^       |
+|                  |         |              |       |
+|              +-------+ +------+       +-------+   |
+|              |  AFU  | |  AFU |       |  AFU  |   |
+|              +-------+ +------+       +-------+   |
+|                                                   |
+|            DFL based FPGA PCIe Device             |
++---------------------------------------------------+
+
+FME is always accessed through the physical function (PF).
+
+Ports (and related AFUs) are accessed via PF by default, but could be exposed
+through virtual function (VF) devices via PCIe SRIOV. Each VF only contains
+1 Port and 1 AFU for isolation. Users could assign individual VFs (accelerators)
+created via PCIe SRIOV interface, to virtual machines.
+
+The driver organization in virtualization case is illustrated below:
+
+  +-------++------++------+             |
+  | FME   || FME  || FME  |             |
+  | FPGA  || FPGA || FPGA |             |
+  |Manager||Bridge||Region|             |
+  +-------++------++------+             |
+  +-----------------------+  +--------+ |             +--------+
+  |          FME          |  |  AFU   | |             |  AFU   |
+  |         Module        |  | Module | |             | Module |
+  +-----------------------+  +--------+ |             +--------+
+        +-----------------------+       |       +-----------------------+
+        | FPGA Container Device |       |       | FPGA Container Device |
+        |  (FPGA Base Region)   |       |       |  (FPGA Base Region)   |
+        +-----------------------+       |       +-----------------------+
+          +------------------+          |         +------------------+
+          | FPGA PCIE Module |          | Virtual | FPGA PCIE Module |
+          +------------------+   Host   | Machine +------------------+
+ -------------------------------------- | ------------------------------
+           +---------------+            |          +---------------+
+           | PCI PF Device |            |          | PCI VF Device |
+           +---------------+            |          +---------------+
+
+FPGA PCIe device driver is always loaded first once a FPGA PCIe PF or VF device
+is detected. It:
+
+	a) finish enumeration on both FPGA PCIe PF and VF device using common
+	   interfaces from DFL framework.
+	b) supports SRIOV.
+
+The FME device driver plays a management role in this driver architecture, it
+provides ioctls to release Port from PF and assign Port to PF. After release
+a port from PF, then it's safe to expose this port through a VF via PCIe SRIOV
+sysfs interface.
+
+To enable accessing an accelerator from applications running in a VM, the
+respective AFU's port needs to be assigned to a VF using the following steps:
+
+	a) The PF owns all AFU ports by default. Any port that needs to be
+	   reassigned to a VF must first be released through the
+	   DFL_FPGA_FME_PORT_RELEASE ioctl on the FME device.
+
+	b) Once N ports are released from PF, then user can use command below
+	   to enable SRIOV and VFs. Each VF owns only one Port with AFU.
+
+	   echo N > $PCI_DEVICE_PATH/sriov_numvfs
+
+	c) Pass through the VFs to VMs
+
+	d) The AFU under VF is accessible from applications in VM (using the
+	   same driver inside the VF).
+
+Note that an FME can't be assigned to a VF, thus PR and other management
+functions are only available via the PF.
+
+
 Device enumeration
 ==================
 This section introduces how applications enumerate the fpga device from
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v2 06/18] fpga: dfl: fme: add DFL_FPGA_FME_PORT_RELEASE/ASSIGN ioctl support.
From: Wu Hao @ 2019-04-29  8:55 UTC (permalink / raw)
  To: atull, mdf, linux-fpga, linux-kernel
  Cc: linux-api, Wu Hao, Zhang Yi Z, Xu Yilun
In-Reply-To: <1556528151-17221-1-git-send-email-hao.wu@intel.com>

In order to support virtualization usage via PCIe SRIOV, this patch
adds two ioctls under FPGA Management Engine (FME) to release and
assign back the port device. In order to safely turn Port from PF
into VF and enable PCIe SRIOV, it requires user to invoke this
PORT_RELEASE ioctl to release port firstly to remove userspace
interfaces, and then configure the PF/VF access register in FME.
After disable SRIOV, it requires user to invoke this PORT_ASSIGN
ioctl to attach the port back to PF.

 Ioctl interfaces:
 * DFL_FPGA_FME_PORT_RELEASE
   Release platform device of given port, it deletes port platform
   device to remove related userspace interfaces on PF, then
   configures PF/VF access mode to VF.

 * DFL_FPGA_FME_PORT_ASSIGN
   Assign platform device of given port back to PF, it configures
   PF/VF access mode to PF, then adds port platform device back to
   re-enable related userspace interfaces on PF.

Signed-off-by: Zhang Yi Z <yi.z.zhang@intel.com>
Signed-off-by: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
Acked-by: Alan Tull <atull@kernel.org>
---
 drivers/fpga/dfl-fme-main.c   |  54 +++++++++++++++++++++
 drivers/fpga/dfl.c            | 107 +++++++++++++++++++++++++++++++++++++-----
 drivers/fpga/dfl.h            |  10 ++++
 include/uapi/linux/fpga-dfl.h |  32 +++++++++++++
 4 files changed, 191 insertions(+), 12 deletions(-)

diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
index 076d74f..8b2a337 100644
--- a/drivers/fpga/dfl-fme-main.c
+++ b/drivers/fpga/dfl-fme-main.c
@@ -16,6 +16,7 @@
 
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <linux/uaccess.h>
 #include <linux/fpga-dfl.h>
 
 #include "dfl.h"
@@ -105,9 +106,62 @@ static void fme_hdr_uinit(struct platform_device *pdev,
 	sysfs_remove_files(&pdev->dev.kobj, fme_hdr_attrs);
 }
 
+static long fme_hdr_ioctl_release_port(struct dfl_feature_platform_data *pdata,
+				       void __user *arg)
+{
+	struct dfl_fpga_cdev *cdev = pdata->dfl_cdev;
+	struct dfl_fpga_fme_port_release release;
+	unsigned long minsz;
+
+	minsz = offsetofend(struct dfl_fpga_fme_port_release, port_id);
+
+	if (copy_from_user(&release, arg, minsz))
+		return -EFAULT;
+
+	if (release.argsz < minsz || release.flags)
+		return -EINVAL;
+
+	return dfl_fpga_cdev_config_port(cdev, release.port_id, true);
+}
+
+static long fme_hdr_ioctl_assign_port(struct dfl_feature_platform_data *pdata,
+				      void __user *arg)
+{
+	struct dfl_fpga_cdev *cdev = pdata->dfl_cdev;
+	struct dfl_fpga_fme_port_assign assign;
+	unsigned long minsz;
+
+	minsz = offsetofend(struct dfl_fpga_fme_port_assign, port_id);
+
+	if (copy_from_user(&assign, arg, minsz))
+		return -EFAULT;
+
+	if (assign.argsz < minsz || assign.flags)
+		return -EINVAL;
+
+	return dfl_fpga_cdev_config_port(cdev, assign.port_id, false);
+}
+
+static long fme_hdr_ioctl(struct platform_device *pdev,
+			  struct dfl_feature *feature,
+			  unsigned int cmd, unsigned long arg)
+{
+	struct dfl_feature_platform_data *pdata = dev_get_platdata(&pdev->dev);
+
+	switch (cmd) {
+	case DFL_FPGA_FME_PORT_RELEASE:
+		return fme_hdr_ioctl_release_port(pdata, (void __user *)arg);
+	case DFL_FPGA_FME_PORT_ASSIGN:
+		return fme_hdr_ioctl_assign_port(pdata, (void __user *)arg);
+	}
+
+	return -ENODEV;
+}
+
 static const struct dfl_feature_ops fme_hdr_ops = {
 	.init = fme_hdr_init,
 	.uinit = fme_hdr_uinit,
+	.ioctl = fme_hdr_ioctl,
 };
 
 static struct dfl_feature_driver fme_feature_drvs[] = {
diff --git a/drivers/fpga/dfl.c b/drivers/fpga/dfl.c
index 2c09e50..a6b6d38 100644
--- a/drivers/fpga/dfl.c
+++ b/drivers/fpga/dfl.c
@@ -224,16 +224,20 @@ void dfl_fpga_port_ops_del(struct dfl_fpga_port_ops *ops)
  */
 int dfl_fpga_check_port_id(struct platform_device *pdev, void *pport_id)
 {
-	struct dfl_fpga_port_ops *port_ops = dfl_fpga_port_ops_get(pdev);
-	int port_id;
+	struct dfl_feature_platform_data *pdata = dev_get_platdata(&pdev->dev);
+	struct dfl_fpga_port_ops *port_ops;
+
+	if (pdata->id != FEATURE_DEV_ID_UNUSED)
+		return pdata->id == *(int *)pport_id;
 
+	port_ops = dfl_fpga_port_ops_get(pdev);
 	if (!port_ops || !port_ops->get_id)
 		return 0;
 
-	port_id = port_ops->get_id(pdev);
+	pdata->id = port_ops->get_id(pdev);
 	dfl_fpga_port_ops_put(port_ops);
 
-	return port_id == *(int *)pport_id;
+	return pdata->id == *(int *)pport_id;
 }
 EXPORT_SYMBOL_GPL(dfl_fpga_check_port_id);
 
@@ -462,6 +466,7 @@ static int build_info_commit_dev(struct build_feature_devs_info *binfo)
 	pdata->dev = fdev;
 	pdata->num = binfo->feature_num;
 	pdata->dfl_cdev = binfo->cdev;
+	pdata->id = FEATURE_DEV_ID_UNUSED;
 	mutex_init(&pdata->lock);
 
 	/*
@@ -959,25 +964,27 @@ void dfl_fpga_feature_devs_remove(struct dfl_fpga_cdev *cdev)
 {
 	struct dfl_feature_platform_data *pdata, *ptmp;
 
-	remove_feature_devs(cdev);
-
 	mutex_lock(&cdev->lock);
-	if (cdev->fme_dev) {
-		/* the fme should be unregistered. */
-		WARN_ON(device_is_registered(cdev->fme_dev));
+	if (cdev->fme_dev)
 		put_device(cdev->fme_dev);
-	}
 
 	list_for_each_entry_safe(pdata, ptmp, &cdev->port_dev_list, node) {
 		struct platform_device *port_dev = pdata->dev;
 
-		/* the port should be unregistered. */
-		WARN_ON(device_is_registered(&port_dev->dev));
+		/* remove released ports */
+		if (!device_is_registered(&port_dev->dev)) {
+			dfl_id_free(feature_dev_id_type(port_dev),
+				    port_dev->id);
+			platform_device_put(port_dev);
+		}
+
 		list_del(&pdata->node);
 		put_device(&port_dev->dev);
 	}
 	mutex_unlock(&cdev->lock);
 
+	remove_feature_devs(cdev);
+
 	fpga_region_unregister(cdev->region);
 	devm_kfree(cdev->parent, cdev);
 }
@@ -1015,6 +1022,82 @@ struct platform_device *
 }
 EXPORT_SYMBOL_GPL(__dfl_fpga_cdev_find_port);
 
+static int attach_port_dev(struct dfl_fpga_cdev *cdev, u32 port_id)
+{
+	struct platform_device *port_pdev;
+	int ret = -ENODEV;
+
+	mutex_lock(&cdev->lock);
+	port_pdev = __dfl_fpga_cdev_find_port(cdev, &port_id,
+					      dfl_fpga_check_port_id);
+	if (!port_pdev)
+		goto unlock_exit;
+
+	if (device_is_registered(&port_pdev->dev)) {
+		ret = -EBUSY;
+		goto put_dev_exit;
+	}
+
+	ret = platform_device_add(port_pdev);
+	if (ret)
+		goto put_dev_exit;
+
+	dfl_feature_dev_use_end(dev_get_platdata(&port_pdev->dev));
+	cdev->released_port_num--;
+put_dev_exit:
+	put_device(&port_pdev->dev);
+unlock_exit:
+	mutex_unlock(&cdev->lock);
+	return ret;
+}
+
+static int detach_port_dev(struct dfl_fpga_cdev *cdev, u32 port_id)
+{
+	struct platform_device *port_pdev;
+	int ret = -ENODEV;
+
+	mutex_lock(&cdev->lock);
+	port_pdev = __dfl_fpga_cdev_find_port(cdev, &port_id,
+					      dfl_fpga_check_port_id);
+	if (!port_pdev)
+		goto unlock_exit;
+
+	if (!device_is_registered(&port_pdev->dev)) {
+		ret = -EBUSY;
+		goto put_dev_exit;
+	}
+
+	ret = dfl_feature_dev_use_begin(dev_get_platdata(&port_pdev->dev));
+	if (ret)
+		goto put_dev_exit;
+
+	platform_device_del(port_pdev);
+	cdev->released_port_num++;
+put_dev_exit:
+	put_device(&port_pdev->dev);
+unlock_exit:
+	mutex_unlock(&cdev->lock);
+	return ret;
+}
+
+/**
+ * dfl_fpga_cdev_config_port - configure a port feature dev
+ * @cdev: parent container device.
+ * @port_id: id of the port feature device.
+ * @release: release port or assign port back.
+ *
+ * This function allows user to release port platform device or assign it back.
+ * e.g. to safely turn one port from PF into VF for PCI device SRIOV support,
+ * release port platform device is one necessary step.
+ */
+int dfl_fpga_cdev_config_port(struct dfl_fpga_cdev *cdev,
+			      u32 port_id, bool release)
+{
+	return release ? detach_port_dev(cdev, port_id) :
+			 attach_port_dev(cdev, port_id);
+}
+EXPORT_SYMBOL_GPL(dfl_fpga_cdev_config_port);
+
 static int __init dfl_fpga_init(void)
 {
 	int ret;
diff --git a/drivers/fpga/dfl.h b/drivers/fpga/dfl.h
index 8851c6c..63f39ab 100644
--- a/drivers/fpga/dfl.h
+++ b/drivers/fpga/dfl.h
@@ -183,6 +183,8 @@ struct dfl_feature {
 
 #define DEV_STATUS_IN_USE	0
 
+#define FEATURE_DEV_ID_UNUSED	(-1)
+
 /**
  * struct dfl_feature_platform_data - platform data for feature devices
  *
@@ -191,6 +193,7 @@ struct dfl_feature {
  * @cdev: cdev of feature dev.
  * @dev: ptr to platform device linked with this platform data.
  * @dfl_cdev: ptr to container device.
+ * @id: id used for this feature device.
  * @disable_count: count for port disable.
  * @num: number for sub features.
  * @dev_status: dev status (e.g. DEV_STATUS_IN_USE).
@@ -203,6 +206,7 @@ struct dfl_feature_platform_data {
 	struct cdev cdev;
 	struct platform_device *dev;
 	struct dfl_fpga_cdev *dfl_cdev;
+	int id;
 	unsigned int disable_count;
 	unsigned long dev_status;
 	void *private;
@@ -378,6 +382,7 @@ int dfl_fpga_enum_info_add_dfl(struct dfl_fpga_enum_info *info,
  * @fme_dev: FME feature device under this container device.
  * @lock: mutex lock to protect the port device list.
  * @port_dev_list: list of all port feature devices under this container device.
+ * @released_port_num: released port number under this container device.
  */
 struct dfl_fpga_cdev {
 	struct device *parent;
@@ -385,6 +390,7 @@ struct dfl_fpga_cdev {
 	struct device *fme_dev;
 	struct mutex lock;
 	struct list_head port_dev_list;
+	int released_port_num;
 };
 
 struct dfl_fpga_cdev *
@@ -412,4 +418,8 @@ struct platform_device *
 
 	return pdev;
 }
+
+int dfl_fpga_cdev_config_port(struct dfl_fpga_cdev *cdev,
+			      u32 port_id, bool release);
+
 #endif /* __FPGA_DFL_H */
diff --git a/include/uapi/linux/fpga-dfl.h b/include/uapi/linux/fpga-dfl.h
index 2e324e5..e9a00e0 100644
--- a/include/uapi/linux/fpga-dfl.h
+++ b/include/uapi/linux/fpga-dfl.h
@@ -176,4 +176,36 @@ struct dfl_fpga_fme_port_pr {
 
 #define DFL_FPGA_FME_PORT_PR	_IO(DFL_FPGA_MAGIC, DFL_FME_BASE + 0)
 
+/**
+ * DFL_FPGA_FME_PORT_RELEASE - _IOW(DFL_FPGA_MAGIC, DFL_FME_BASE + 1,
+ *					struct dfl_fpga_fme_port_release)
+ *
+ * Driver releases the port per Port ID provided by caller.
+ * Return: 0 on success, -errno on failure.
+ */
+struct dfl_fpga_fme_port_release {
+	/* Input */
+	__u32 argsz;		/* Structure length */
+	__u32 flags;		/* Zero for now */
+	__u32 port_id;
+};
+
+#define DFL_FPGA_FME_PORT_RELEASE	_IO(DFL_FPGA_MAGIC, DFL_FME_BASE + 1)
+
+/**
+ * DFL_FPGA_FME_PORT_ASSIGN - _IOW(DFL_FPGA_MAGIC, DFL_FME_BASE + 2,
+ *					struct dfl_fpga_fme_port_assign)
+ *
+ * Driver assigns the port back per Port ID provided by caller.
+ * Return: 0 on success, -errno on failure.
+ */
+struct dfl_fpga_fme_port_assign {
+	/* Input */
+	__u32 argsz;		/* Structure length */
+	__u32 flags;		/* Zero for now */
+	__u32 port_id;
+};
+
+#define DFL_FPGA_FME_PORT_ASSIGN	_IO(DFL_FPGA_MAGIC, DFL_FME_BASE + 2)
+
 #endif /* _UAPI_LINUX_FPGA_DFL_H */
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v2 07/18] fpga: dfl: pci: enable SRIOV support.
From: Wu Hao @ 2019-04-29  8:55 UTC (permalink / raw)
  To: atull, mdf, linux-fpga, linux-kernel
  Cc: linux-api, Wu Hao, Zhang Yi Z, Xu Yilun
In-Reply-To: <1556528151-17221-1-git-send-email-hao.wu@intel.com>

This patch enables the standard sriov support. It allows user to
enable SRIOV (and VFs), then user could pass through accelerators
(VFs) into virtual machine or use VFs directly in host.

Signed-off-by: Zhang Yi Z <yi.z.zhang@intel.com>
Signed-off-by: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
Acked-by: Alan Tull <atull@kernel.org>
---
 drivers/fpga/dfl-pci.c | 40 ++++++++++++++++++++++++++++++++++++++++
 drivers/fpga/dfl.c     | 41 +++++++++++++++++++++++++++++++++++++++++
 drivers/fpga/dfl.h     |  1 +
 3 files changed, 82 insertions(+)

diff --git a/drivers/fpga/dfl-pci.c b/drivers/fpga/dfl-pci.c
index 66b5720..2fa571b 100644
--- a/drivers/fpga/dfl-pci.c
+++ b/drivers/fpga/dfl-pci.c
@@ -223,8 +223,46 @@ int cci_pci_probe(struct pci_dev *pcidev, const struct pci_device_id *pcidevid)
 	return ret;
 }
 
+static int cci_pci_sriov_configure(struct pci_dev *pcidev, int num_vfs)
+{
+	struct cci_drvdata *drvdata = pci_get_drvdata(pcidev);
+	struct dfl_fpga_cdev *cdev = drvdata->cdev;
+	int ret = 0;
+
+	mutex_lock(&cdev->lock);
+
+	if (!num_vfs) {
+		/*
+		 * disable SRIOV and then put released ports back to default
+		 * PF access mode.
+		 */
+		pci_disable_sriov(pcidev);
+
+		__dfl_fpga_cdev_config_port_vf(cdev, false);
+
+	} else if (cdev->released_port_num == num_vfs) {
+		/*
+		 * only enable SRIOV if cdev has matched released ports, put
+		 * released ports into VF access mode firstly.
+		 */
+		__dfl_fpga_cdev_config_port_vf(cdev, true);
+
+		ret = pci_enable_sriov(pcidev, num_vfs);
+		if (ret)
+			__dfl_fpga_cdev_config_port_vf(cdev, false);
+	} else {
+		ret = -EINVAL;
+	}
+
+	mutex_unlock(&cdev->lock);
+	return ret;
+}
+
 static void cci_pci_remove(struct pci_dev *pcidev)
 {
+	if (dev_is_pf(&pcidev->dev))
+		cci_pci_sriov_configure(pcidev, 0);
+
 	cci_remove_feature_devs(pcidev);
 	pci_disable_pcie_error_reporting(pcidev);
 }
@@ -234,6 +272,7 @@ static void cci_pci_remove(struct pci_dev *pcidev)
 	.id_table = cci_pcie_id_tbl,
 	.probe = cci_pci_probe,
 	.remove = cci_pci_remove,
+	.sriov_configure = cci_pci_sriov_configure,
 };
 
 module_pci_driver(cci_pci_driver);
@@ -241,3 +280,4 @@ static void cci_pci_remove(struct pci_dev *pcidev)
 MODULE_DESCRIPTION("FPGA DFL PCIe Device Driver");
 MODULE_AUTHOR("Intel Corporation");
 MODULE_LICENSE("GPL v2");
+MODULE_VERSION(DRV_VERSION);
diff --git a/drivers/fpga/dfl.c b/drivers/fpga/dfl.c
index a6b6d38..c5aa287 100644
--- a/drivers/fpga/dfl.c
+++ b/drivers/fpga/dfl.c
@@ -1098,6 +1098,47 @@ int dfl_fpga_cdev_config_port(struct dfl_fpga_cdev *cdev,
 }
 EXPORT_SYMBOL_GPL(dfl_fpga_cdev_config_port);
 
+static void config_port_vf(struct device *fme_dev, int port_id, bool is_vf)
+{
+	void __iomem *base;
+	u64 v;
+
+	base = dfl_get_feature_ioaddr_by_id(fme_dev, FME_FEATURE_ID_HEADER);
+
+	v = readq(base + FME_HDR_PORT_OFST(port_id));
+
+	v &= ~FME_PORT_OFST_ACC_CTRL;
+	v |= FIELD_PREP(FME_PORT_OFST_ACC_CTRL,
+			is_vf ? FME_PORT_OFST_ACC_VF : FME_PORT_OFST_ACC_PF);
+
+	writeq(v, base + FME_HDR_PORT_OFST(port_id));
+}
+
+/**
+ * __dfl_fpga_cdev_config_port_vf - configure port to VF access mode
+ *
+ * @cdev: parent container device.
+ * @if_vf: true for VF access mode, and false for PF access mode
+ *
+ * Return: 0 on success, negative error code otherwise.
+ *
+ * This function is needed in sriov configuration routine. It could be used to
+ * configures the released ports access mode to VF or PF.
+ * The caller needs to hold lock for protection.
+ */
+void __dfl_fpga_cdev_config_port_vf(struct dfl_fpga_cdev *cdev, bool is_vf)
+{
+	struct dfl_feature_platform_data *pdata;
+
+	list_for_each_entry(pdata, &cdev->port_dev_list, node) {
+		if (device_is_registered(&pdata->dev->dev))
+			continue;
+
+		config_port_vf(cdev->fme_dev, pdata->id, is_vf);
+	}
+}
+EXPORT_SYMBOL_GPL(__dfl_fpga_cdev_config_port_vf);
+
 static int __init dfl_fpga_init(void)
 {
 	int ret;
diff --git a/drivers/fpga/dfl.h b/drivers/fpga/dfl.h
index 63f39ab..1350e8e 100644
--- a/drivers/fpga/dfl.h
+++ b/drivers/fpga/dfl.h
@@ -421,5 +421,6 @@ struct platform_device *
 
 int dfl_fpga_cdev_config_port(struct dfl_fpga_cdev *cdev,
 			      u32 port_id, bool release);
+void __dfl_fpga_cdev_config_port_vf(struct dfl_fpga_cdev *cdev, bool is_vf);
 
 #endif /* __FPGA_DFL_H */
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v2 08/18] fpga: dfl: afu: add AFU state related sysfs interfaces
From: Wu Hao @ 2019-04-29  8:55 UTC (permalink / raw)
  To: atull, mdf, linux-fpga, linux-kernel
  Cc: linux-api, Wu Hao, Ananda Ravuri, Xu Yilun
In-Reply-To: <1556528151-17221-1-git-send-email-hao.wu@intel.com>

This patch introduces more sysfs interfaces for Accelerated
Function Unit (AFU). These interfaces allow users to read
current AFU Power State (APx), read / clear AFU Power (APx)
events which are sticky to identify transient APx state,
and manage AFU's LTR (latency tolerance reporting).

Signed-off-by: Ananda Ravuri <ananda.ravuri@intel.com>
Signed-off-by: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
Acked-by: Alan Tull <atull@kernel.org>
---
 Documentation/ABI/testing/sysfs-platform-dfl-port |  30 +++++
 drivers/fpga/dfl-afu-main.c                       | 144 ++++++++++++++++++++++
 drivers/fpga/dfl.h                                |  11 ++
 3 files changed, 185 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-port b/Documentation/ABI/testing/sysfs-platform-dfl-port
index 6a92dda..d7122a4 100644
--- a/Documentation/ABI/testing/sysfs-platform-dfl-port
+++ b/Documentation/ABI/testing/sysfs-platform-dfl-port
@@ -14,3 +14,33 @@ Description:	Read-only. User can program different PR bitstreams to FPGA
 		Accelerator Function Unit (AFU) for different functions. It
 		returns uuid which could be used to identify which PR bitstream
 		is programmed in this AFU.
+
+What:		/sys/bus/platform/devices/dfl-port.0/power_state
+Date:		April 2019
+KernelVersion:	5.2
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-only. It reports the APx (AFU Power) state, different APx
+		means different throttling level. When reading this file, it
+		returns "0" - Normal / "1" - AP1 / "2" - AP2 / "6" - AP6.
+
+What:		/sys/bus/platform/devices/dfl-port.0/ap1_event
+Date:		April 2019
+KernelVersion:	5.2
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-write. Read or set 1 to clear AP1 (AFU Power State 1)
+		event. It's used to indicate transient AP1 state.
+
+What:		/sys/bus/platform/devices/dfl-port.0/ap2_event
+Date:		April 2019
+KernelVersion:	5.2
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-write. Read or set 1 to clear AP2 (AFU Power State 2)
+		event. It's used to indicate transient AP2 state.
+
+What:		/sys/bus/platform/devices/dfl-port.0/ltr
+Date:		April 2019
+KernelVersion:	5.2
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-write. Read and set AFU latency tolerance reporting value.
+		Set ltr to 1 if the AFU can tolerate latency >= 40us or set it
+		to 0 if it is latency sensitive.
diff --git a/drivers/fpga/dfl-afu-main.c b/drivers/fpga/dfl-afu-main.c
index 02baa6a..2ffec06 100644
--- a/drivers/fpga/dfl-afu-main.c
+++ b/drivers/fpga/dfl-afu-main.c
@@ -21,6 +21,8 @@
 
 #include "dfl-afu.h"
 
+#define DRV_VERSION	"0.8"
+
 /**
  * port_enable - enable a port
  * @pdev: port platform device.
@@ -141,8 +143,149 @@ static int port_get_id(struct platform_device *pdev)
 }
 static DEVICE_ATTR_RO(id);
 
+static ssize_t
+ltr_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
+	void __iomem *base;
+	u64 v;
+
+	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
+
+	mutex_lock(&pdata->lock);
+	v = readq(base + PORT_HDR_CTRL);
+	mutex_unlock(&pdata->lock);
+
+	return scnprintf(buf, PAGE_SIZE, "%x\n",
+			 (u8)FIELD_GET(PORT_CTRL_LATENCY, v));
+}
+
+static ssize_t
+ltr_store(struct device *dev, struct device_attribute *attr,
+	  const char *buf, size_t count)
+{
+	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
+	void __iomem *base;
+	u8 ltr;
+	u64 v;
+
+	if (kstrtou8(buf, 0, &ltr) || ltr > 1)
+		return -EINVAL;
+
+	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
+
+	mutex_lock(&pdata->lock);
+	v = readq(base + PORT_HDR_CTRL);
+	v &= ~PORT_CTRL_LATENCY;
+	v |= FIELD_PREP(PORT_CTRL_LATENCY, ltr);
+	writeq(v, base + PORT_HDR_CTRL);
+	mutex_unlock(&pdata->lock);
+
+	return count;
+}
+static DEVICE_ATTR_RW(ltr);
+
+static ssize_t
+ap1_event_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
+	void __iomem *base;
+	u64 v;
+
+	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
+
+	mutex_lock(&pdata->lock);
+	v = readq(base + PORT_HDR_STS);
+	mutex_unlock(&pdata->lock);
+
+	return scnprintf(buf, PAGE_SIZE, "%x\n",
+			 (u8)FIELD_GET(PORT_STS_AP1_EVT, v));
+}
+
+static ssize_t
+ap1_event_store(struct device *dev, struct device_attribute *attr,
+		const char *buf, size_t count)
+{
+	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
+	void __iomem *base;
+	u8 ap1_event;
+
+	if (kstrtou8(buf, 0, &ap1_event) || ap1_event != 1)
+		return -EINVAL;
+
+	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
+
+	mutex_lock(&pdata->lock);
+	writeq(PORT_STS_AP1_EVT, base + PORT_HDR_STS);
+	mutex_unlock(&pdata->lock);
+
+	return count;
+}
+static DEVICE_ATTR_RW(ap1_event);
+
+static ssize_t
+ap2_event_show(struct device *dev, struct device_attribute *attr,
+	       char *buf)
+{
+	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
+	void __iomem *base;
+	u64 v;
+
+	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
+
+	mutex_lock(&pdata->lock);
+	v = readq(base + PORT_HDR_STS);
+	mutex_unlock(&pdata->lock);
+
+	return scnprintf(buf, PAGE_SIZE, "%x\n",
+			 (u8)FIELD_GET(PORT_STS_AP2_EVT, v));
+}
+
+static ssize_t
+ap2_event_store(struct device *dev, struct device_attribute *attr,
+		const char *buf, size_t count)
+{
+	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
+	void __iomem *base;
+	u8 ap2_event;
+
+	if (kstrtou8(buf, 0, &ap2_event) || ap2_event != 1)
+		return -EINVAL;
+
+	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
+
+	mutex_lock(&pdata->lock);
+	writeq(PORT_STS_AP2_EVT, base + PORT_HDR_STS);
+	mutex_unlock(&pdata->lock);
+
+	return count;
+}
+static DEVICE_ATTR_RW(ap2_event);
+
+static ssize_t
+power_state_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
+	void __iomem *base;
+	u64 v;
+
+	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
+
+	mutex_lock(&pdata->lock);
+	v = readq(base + PORT_HDR_STS);
+	mutex_unlock(&pdata->lock);
+
+	return scnprintf(buf, PAGE_SIZE, "0x%x\n",
+			 (u8)FIELD_GET(PORT_STS_PWR_STATE, v));
+}
+static DEVICE_ATTR_RO(power_state);
+
 static const struct attribute *port_hdr_attrs[] = {
 	&dev_attr_id.attr,
+	&dev_attr_ltr.attr,
+	&dev_attr_ap1_event.attr,
+	&dev_attr_ap2_event.attr,
+	&dev_attr_power_state.attr,
 	NULL,
 };
 
@@ -634,3 +777,4 @@ static void __exit afu_exit(void)
 MODULE_AUTHOR("Intel Corporation");
 MODULE_LICENSE("GPL v2");
 MODULE_ALIAS("platform:dfl-port");
+MODULE_VERSION(DRV_VERSION);
diff --git a/drivers/fpga/dfl.h b/drivers/fpga/dfl.h
index 1350e8e..1525098 100644
--- a/drivers/fpga/dfl.h
+++ b/drivers/fpga/dfl.h
@@ -119,6 +119,7 @@
 #define PORT_HDR_NEXT_AFU	NEXT_AFU
 #define PORT_HDR_CAP		0x30
 #define PORT_HDR_CTRL		0x38
+#define PORT_HDR_STS		0x40
 
 /* Port Capability Register Bitfield */
 #define PORT_CAP_PORT_NUM	GENMASK_ULL(1, 0)	/* ID of this port */
@@ -130,6 +131,16 @@
 /* Latency tolerance reporting. '1' >= 40us, '0' < 40us.*/
 #define PORT_CTRL_LATENCY	BIT_ULL(2)
 #define PORT_CTRL_SFTRST_ACK	BIT_ULL(4)		/* HW ack for reset */
+
+/* Port Status Register Bitfield */
+#define PORT_STS_AP2_EVT	BIT_ULL(13)		/* AP2 event detected */
+#define PORT_STS_AP1_EVT	BIT_ULL(12)		/* AP1 event detected */
+#define PORT_STS_PWR_STATE	GENMASK_ULL(11, 8)	/* AFU power states */
+#define PORT_STS_PWR_STATE_NORM 0
+#define PORT_STS_PWR_STATE_AP1	1			/* 50% throttling */
+#define PORT_STS_PWR_STATE_AP2	2			/* 90% throttling */
+#define PORT_STS_PWR_STATE_AP6	6			/* 100% throttling */
+
 /**
  * struct dfl_fpga_port_ops - port ops
  *
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v2 09/18] fpga: dfl: afu: add userclock sysfs interfaces.
From: Wu Hao @ 2019-04-29  8:55 UTC (permalink / raw)
  To: atull, mdf, linux-fpga, linux-kernel
  Cc: linux-api, Wu Hao, Ananda Ravuri, Russ Weight, Xu Yilun
In-Reply-To: <1556528151-17221-1-git-send-email-hao.wu@intel.com>

This patch introduces userclock sysfs interfaces for AFU, user
could use these interfaces for clock setting to AFU.

Please note that, this is only working for port header feature
with revision 0, for later revisions, userclock setting is moved
to a separated private feature, so one revision sysfs interface
is exposed to userspace application for this purpose too.

Signed-off-by: Ananda Ravuri <ananda.ravuri@intel.com>
Signed-off-by: Russ Weight <russell.h.weight@intel.com>
Signed-off-by: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
Acked-by: Alan Tull <atull@kernel.org>
---
 Documentation/ABI/testing/sysfs-platform-dfl-port |  35 +++++++
 drivers/fpga/dfl-afu-main.c                       | 114 +++++++++++++++++++++-
 drivers/fpga/dfl.h                                |   4 +
 3 files changed, 152 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-port b/Documentation/ABI/testing/sysfs-platform-dfl-port
index d7122a4..71f4892 100644
--- a/Documentation/ABI/testing/sysfs-platform-dfl-port
+++ b/Documentation/ABI/testing/sysfs-platform-dfl-port
@@ -44,3 +44,38 @@ Contact:	Wu Hao <hao.wu@intel.com>
 Description:	Read-write. Read and set AFU latency tolerance reporting value.
 		Set ltr to 1 if the AFU can tolerate latency >= 40us or set it
 		to 0 if it is latency sensitive.
+
+What:		/sys/bus/platform/devices/dfl-port.0/revision
+Date:		April 2019
+KernelVersion:	5.2
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-only. Read this file to get the revision of port header
+		feature.
+
+What:		/sys/bus/platform/devices/dfl-port.0/userclk_freqcmd
+Date:		April 2019
+KernelVersion:	5.2
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Write-only. User writes command to this interface to set
+		userclock to AFU.
+
+What:		/sys/bus/platform/devices/dfl-port.0/userclk_freqsts
+Date:		April 2019
+KernelVersion:	5.2
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-only. Read this file to get the status of issued command
+		to userclck_freqcmd.
+
+What:		/sys/bus/platform/devices/dfl-port.0/userclk_freqcntrcmd
+Date:		April 2019
+KernelVersion:	5.2
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Write-only. User writes command to this interface to set
+		userclock counter.
+
+What:		/sys/bus/platform/devices/dfl-port.0/userclk_freqcntrsts
+Date:		April 2019
+KernelVersion:	5.2
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-only. Read this file to get the status of issued command
+		to userclck_freqcntrcmd.
diff --git a/drivers/fpga/dfl-afu-main.c b/drivers/fpga/dfl-afu-main.c
index 2ffec06..82fd80a 100644
--- a/drivers/fpga/dfl-afu-main.c
+++ b/drivers/fpga/dfl-afu-main.c
@@ -144,6 +144,17 @@ static int port_get_id(struct platform_device *pdev)
 static DEVICE_ATTR_RO(id);
 
 static ssize_t
+revision_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	void __iomem *base;
+
+	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
+
+	return scnprintf(buf, PAGE_SIZE, "%x\n", dfl_feature_revision(base));
+}
+static DEVICE_ATTR_RO(revision);
+
+static ssize_t
 ltr_show(struct device *dev, struct device_attribute *attr, char *buf)
 {
 	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
@@ -282,6 +293,7 @@ static int port_get_id(struct platform_device *pdev)
 
 static const struct attribute *port_hdr_attrs[] = {
 	&dev_attr_id.attr,
+	&dev_attr_revision.attr,
 	&dev_attr_ltr.attr,
 	&dev_attr_ap1_event.attr,
 	&dev_attr_ap2_event.attr,
@@ -289,14 +301,113 @@ static int port_get_id(struct platform_device *pdev)
 	NULL,
 };
 
+static ssize_t
+userclk_freqcmd_store(struct device *dev, struct device_attribute *attr,
+		      const char *buf, size_t count)
+{
+	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
+	u64 userclk_freq_cmd;
+	void __iomem *base;
+
+	if (kstrtou64(buf, 0, &userclk_freq_cmd))
+		return -EINVAL;
+
+	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
+
+	mutex_lock(&pdata->lock);
+	writeq(userclk_freq_cmd, base + PORT_HDR_USRCLK_CMD0);
+	mutex_unlock(&pdata->lock);
+
+	return count;
+}
+static DEVICE_ATTR_WO(userclk_freqcmd);
+
+static ssize_t
+userclk_freqcntrcmd_store(struct device *dev, struct device_attribute *attr,
+			  const char *buf, size_t count)
+{
+	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
+	u64 userclk_freqcntr_cmd;
+	void __iomem *base;
+
+	if (kstrtou64(buf, 0, &userclk_freqcntr_cmd))
+		return -EINVAL;
+
+	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
+
+	mutex_lock(&pdata->lock);
+	writeq(userclk_freqcntr_cmd, base + PORT_HDR_USRCLK_CMD1);
+	mutex_unlock(&pdata->lock);
+
+	return count;
+}
+static DEVICE_ATTR_WO(userclk_freqcntrcmd);
+
+static ssize_t
+userclk_freqsts_show(struct device *dev, struct device_attribute *attr,
+		     char *buf)
+{
+	u64 userclk_freqsts;
+	void __iomem *base;
+
+	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
+
+	userclk_freqsts = readq(base + PORT_HDR_USRCLK_STS0);
+
+	return scnprintf(buf, PAGE_SIZE, "0x%llx\n",
+			 (unsigned long long)userclk_freqsts);
+}
+static DEVICE_ATTR_RO(userclk_freqsts);
+
+static ssize_t
+userclk_freqcntrsts_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	u64 userclk_freqcntrsts;
+	void __iomem *base;
+
+	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
+
+	userclk_freqcntrsts = readq(base + PORT_HDR_USRCLK_STS1);
+
+	return scnprintf(buf, PAGE_SIZE, "0x%llx\n",
+			 (unsigned long long)userclk_freqcntrsts);
+}
+static DEVICE_ATTR_RO(userclk_freqcntrsts);
+
+static const struct attribute *port_hdr_userclk_attrs[] = {
+	&dev_attr_userclk_freqcmd.attr,
+	&dev_attr_userclk_freqcntrcmd.attr,
+	&dev_attr_userclk_freqsts.attr,
+	&dev_attr_userclk_freqcntrsts.attr,
+	NULL,
+};
+
 static int port_hdr_init(struct platform_device *pdev,
 			 struct dfl_feature *feature)
 {
+	int ret;
+
 	dev_dbg(&pdev->dev, "PORT HDR Init.\n");
 
 	port_reset(pdev);
 
-	return sysfs_create_files(&pdev->dev.kobj, port_hdr_attrs);
+	ret = sysfs_create_files(&pdev->dev.kobj, port_hdr_attrs);
+	if (ret)
+		return ret;
+
+	/*
+	 * if revision > 0, the userclock will be moved from port hdr register
+	 * region to a separated private feature.
+	 */
+	if (dfl_feature_revision(feature->ioaddr) > 0)
+		return 0;
+
+	ret = sysfs_create_files(&pdev->dev.kobj, port_hdr_userclk_attrs);
+	if (ret)
+		sysfs_remove_files(&pdev->dev.kobj, port_hdr_attrs);
+
+	return ret;
 }
 
 static void port_hdr_uinit(struct platform_device *pdev,
@@ -304,6 +415,7 @@ static void port_hdr_uinit(struct platform_device *pdev,
 {
 	dev_dbg(&pdev->dev, "PORT HDR UInit.\n");
 
+	sysfs_remove_files(&pdev->dev.kobj, port_hdr_userclk_attrs);
 	sysfs_remove_files(&pdev->dev.kobj, port_hdr_attrs);
 }
 
diff --git a/drivers/fpga/dfl.h b/drivers/fpga/dfl.h
index 1525098..3c5dc3a 100644
--- a/drivers/fpga/dfl.h
+++ b/drivers/fpga/dfl.h
@@ -120,6 +120,10 @@
 #define PORT_HDR_CAP		0x30
 #define PORT_HDR_CTRL		0x38
 #define PORT_HDR_STS		0x40
+#define PORT_HDR_USRCLK_CMD0	0x50
+#define PORT_HDR_USRCLK_CMD1	0x58
+#define PORT_HDR_USRCLK_STS0	0x60
+#define PORT_HDR_USRCLK_STS1	0x68
 
 /* Port Capability Register Bitfield */
 #define PORT_CAP_PORT_NUM	GENMASK_ULL(1, 0)	/* ID of this port */
-- 
1.8.3.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox