Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH ghak90 V6 05/10] audit: add contid support for signalling the audit daemon
From: Paul Moore @ 2019-04-09 14:07 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ondrej Mosnacek, Richard Guy Briggs, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, Steve Grubb, David Howells, Simo Sorce,
	Eric Paris, Serge E. Hallyn, Eric W . Biederman
In-Reply-To: <20190409134852.GB15660@hmswarspite.think-freely.org>

On Tue, Apr 9, 2019 at 9:49 AM Neil Horman <nhorman@tuxdriver.com> wrote:
> On Tue, Apr 09, 2019 at 09:40:58AM -0400, Paul Moore wrote:
> > On Tue, Apr 9, 2019 at 8:58 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:
> > >
> > > On Tue, Apr 9, 2019 at 5:40 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > Add audit container identifier support to the action of signalling the
> > > > audit daemon.
> > > >
> > > > Since this would need to add an element to the audit_sig_info struct,
> > > > a new record type AUDIT_SIGNAL_INFO2 was created with a new
> > > > audit_sig_info2 struct.  Corresponding support is required in the
> > > > userspace code to reflect the new record request and reply type.
> > > > An older userspace won't break since it won't know to request this
> > > > record type.
> > > >
> > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > >
> > > This looks good to me.
> > >
> > > Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> > >
> > > Although I'm wondering if we shouldn't try to future-proof the
> > > AUDIT_SIGNAL_INFO2 format somehow, so that we don't need to add
> > > another AUDIT_SIGNAL_INFO3 when the need arises to add yet-another
> > > identifier to it... The simplest solution I can come up with is to add
> > > a "version" field at the beginning (set to 2 initially), then v<N>_len
> > > at the beginning of data for version <N>. But maybe this is too
> > > complicated for too little gain...
> >
> > FWIW, I believe the long term solution to this is the fabled netlink
> > attribute approach that we haven't talked about in some time, but I
> > keep dreaming about (it has been mostly on the back burner becasue 1)
> > time and 2) didn't want to impact the audit container ID work).  While
> > I'm not opposed to trying to make things like this a bit more robust
> > by adding version fields and similar things, there are still so many
> > (so very many) problems with the audit kernel/userspace interface that
> > still need to be addressed.
>
> Agreed, this change as-is is in keeping with the message structure that audit
> has today, and so is ok with me, but the long term goal should be a conversion
> to netlink attributes for all audit messages.  Thats a big undertaking and
> should be addressed separately though.

You've likely missed all the conversations around this from some time
ago, but this is the direction I want us to go towards eventually, and
yes, this is a huge undertaking (much larger than the audit container
ID work) that will need to be done in stages.

The first step is moving away from audit_log_format() to an in-kernel
audit API that separates the data from the record format; I've got a
lot of ideas on that, but as I said earlier, it's mostly on the back
burner so it doesn't hold up the audit container ID work.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH ghak90 V6 05/10] audit: add contid support for signalling the audit daemon
From: Paul Moore @ 2019-04-09 14:08 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Ondrej Mosnacek, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, netfilter-devel, Steve Grubb,
	David Howells, Simo Sorce, Eric Paris, Serge E. Hallyn,
	Eric W . Biederman, nhorman
In-Reply-To: <20190409135304.rgkcpgya6h3ciphy@madcap2.tricolour.ca>

On Tue, Apr 9, 2019 at 9:53 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-04-09 09:40, Paul Moore wrote:
> > On Tue, Apr 9, 2019 at 8:58 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:
> > > On Tue, Apr 9, 2019 at 5:40 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > Add audit container identifier support to the action of signalling the
> > > > audit daemon.
> > > >
> > > > Since this would need to add an element to the audit_sig_info struct,
> > > > a new record type AUDIT_SIGNAL_INFO2 was created with a new
> > > > audit_sig_info2 struct.  Corresponding support is required in the
> > > > userspace code to reflect the new record request and reply type.
> > > > An older userspace won't break since it won't know to request this
> > > > record type.
> > > >
> > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > >
> > > This looks good to me.
> > >
> > > Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> > >
> > > Although I'm wondering if we shouldn't try to future-proof the
> > > AUDIT_SIGNAL_INFO2 format somehow, so that we don't need to add
> > > another AUDIT_SIGNAL_INFO3 when the need arises to add yet-another
> > > identifier to it... The simplest solution I can come up with is to add
> > > a "version" field at the beginning (set to 2 initially), then v<N>_len
> > > at the beginning of data for version <N>. But maybe this is too
> > > complicated for too little gain...
> >
> > FWIW, I believe the long term solution to this is the fabled netlink
> > attribute approach that we haven't talked about in some time, but I
> > keep dreaming about (it has been mostly on the back burner becasue 1)
> > time and 2) didn't want to impact the audit container ID work).  While
> > I'm not opposed to trying to make things like this a bit more robust
> > by adding version fields and similar things, there are still so many
> > (so very many) problems with the audit kernel/userspace interface that
> > still need to be addressed.
>
> While this particular message type is used very infrequently, adding a
> version field to every message type strikes me as a huge overhead for
> the small likelihood of the format needing to change.
>
> I'd prefer to just key it off the AUDIT_FEATURE_BITMAP or some other
> easily detectable change in this distinguishing feature, such as the
> presence of /proc/self/audit_containerid, which is what I've done in the
> accompanying userspace patchset that I'm preparing to post that works
> with this change.

That's fine.  As I said, I'm not overly worried about this; I view
this as a bit of a necessary hack.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)
From: Carlos O'Donell @ 2019-04-09 14:13 UTC (permalink / raw)
  To: Tulio Magno Quites Machado Filho, Alan Modra, Michael Ellerman
  Cc: Florian Weimer, Michael Meissner, Peter Bergner,
	Mathieu Desnoyers, Paul Burton, Will Deacon, Boqun Feng,
	Heiko Carstens, Vasily Gorbik, Martin Schwidefsky, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, carlos, Joseph Myers,
	Szabolcs Nagy, libc-alpha, Thomas Gleixner, Ben Maurer
In-Reply-To: <871s2bp9f9.fsf@linux.ibm.com>

On 4/9/19 9:58 AM, Tulio Magno Quites Machado Filho wrote:
> Alan Modra <amodra@gmail.com> writes:
>> Yes, looks fine to me, except that in VLE mode (do we care?)
>> ".long 0x0fe50553" disassembles as
>>     0:	0f e5       	se_cmphl r5,r30
>>     2:	05 53       	se_mullw r3,r5
>> No illegal/trap/privileged insn there.
>>
>> ".long 0x0fe5000b" might be better to cover VLE.
> 
> Looks good for me too.

The requirement that it be a valid instruction is simply to aid in the
disassembly of rseq regions which may be hand written assembly with a
thin veneer of CFI/DWARF information.

It has already been pointed out that POWER uses data in the instruction
stream for jump tables to implement switch statements, but that specific
use has compiler support and one presumes good debug information. So as
Alan says, there is already data in the insn stream, though such things
can't be good for performance (pollutes D-cache, problematic for
speculative execution).

> Actually, it better fits what Carlos O'Donnell had requested:
> 
>>>> I think the order of preference is:
>>>>
>>>> 1.  An uncommon insn (with random immediate values), in a literal pool, that is
>>>>       not a useful ROP/JOP sequence (very uncommon)
>>>> 2a. A uncommon TRAP hopefully with some immediate data encoded (maybe uncommon)
>>>> 2b. A NOP to avoid affecting speculative execution (maybe uncommon)
>>>>
>>>> With 2a/2b being roughly equivalent depending on speculative execution policy.

Yes, though "in a literal pool" is something that is not required, since
users might not want literal pools and so we shouldn't require that
feature (it also pollutes D-cache).

Keep in mind the insn will never execute.

If a trap insn calls out the nature of the signature more clearly then
use that instead.

-- 
Cheers,
Carlos.

^ permalink raw reply

* Re: [PATCH] Linux: Define struct termios2 in <termios.h> under _GNU_SOURCE [BZ #10339]
From: hpa @ 2019-04-09 15:02 UTC (permalink / raw)
  To: Florian Weimer, libc-alpha; +Cc: adhemerval.zanella, linux-api, linuxppc-dev
In-Reply-To: <875zrnlakd.fsf@oldenburg2.str.redhat.com>

On April 9, 2019 4:47:30 AM MDT, Florian Weimer <fweimer@redhat.com> wrote:
>struct termios2 is required for setting arbitrary baud rates on serial
>ports.  <sys/ioctl.h> and <linux/termios.h> have conflicting
>definitions in the existing termios definitions, which means that it
>is currently very difficult to use TCGETS2/TCSETS2 and struct termios2
>with glibc.  Providing a definition within glibc resolves this problem.
>
>This does not completely address bug 10339, but it at least exposes
>the current kernel functionality in this area.
>
>Support for struct termios2 is architecture-specific in the kernel.
>Support on alpha was only added in Linux 4.20.  POWER support is
>currently missing.  The expectation is that the kernel will eventually
>use the generic UAPI definition for struct termios2.
>
>2019-04-09  Florian Weimer  <fweimer@redhat.com>
>
>	[BZ #10339]
>	Linux: Define struct termios2 in <termios.h> under _GNU_SOURCE.
>	* sysdeps/unix/sysv/linux/Makefile [$(subdir) == termios] (tests):
>	Add tst-termios2.
>	* sysdeps/unix/sysv/linux/tst-termios2.c: New file.
>	* sysdeps/unix/sysv/linux/bits/termios2-struct.h: Likewise.
>	* sysdeps/unix/sysv/linux/bits/termios.h [__USE_GNU]: Include it.
>	* sysdeps/unix/sysv/linux/alpha/bits/termios2-struct.h: New file.
>	* sysdeps/unix/sysv/linux/sparc/bits/termios2-struct.h: Likewise.
>
>diff --git a/NEWS b/NEWS
>index b58e2469d4..5e6ecb9c7d 100644
>--- a/NEWS
>+++ b/NEWS
>@@ -18,6 +18,9 @@ Major new features:
> 
> * On Linux, the gettid function has been added.
> 
>+* On Linux, <termios.h> now provides a definition of struct termios2
>with
>+  the _GNU_SOURCE feature test macro.
>+
> * Minguo (Republic of China) calendar support has been added as an
> alternative calendar for the following locales: zh_TW, cmn_TW, hak_TW,
>   nan_TW, lzh_TW.
>diff --git a/sysdeps/unix/sysv/linux/Makefile
>b/sysdeps/unix/sysv/linux/Makefile
>index 52ac6ad484..4cb5e4f0d2 100644
>--- a/sysdeps/unix/sysv/linux/Makefile
>+++ b/sysdeps/unix/sysv/linux/Makefile
>@@ -156,6 +156,7 @@ endif
> 
> ifeq ($(subdir),termios)
> sysdep_headers += termio.h
>+tests += tst-termios2
> endif
> 
> ifeq ($(subdir),posix)
>diff --git a/sysdeps/unix/sysv/linux/alpha/bits/termios2-struct.h
>b/sysdeps/unix/sysv/linux/alpha/bits/termios2-struct.h
>new file mode 100644
>index 0000000000..5f09445e23
>--- /dev/null
>+++ b/sysdeps/unix/sysv/linux/alpha/bits/termios2-struct.h
>@@ -0,0 +1,33 @@
>+/* struct termios2 definition.  Linux/alpha version.
>+   Copyright (C) 2019 Free Software Foundation, Inc.
>+   This file is part of the GNU C Library.
>+
>+   The GNU C Library is free software; you can redistribute it and/or
>+   modify it under the terms of the GNU Lesser General Public
>+   License as published by the Free Software Foundation; either
>+   version 2.1 of the License, or (at your option) any later version.
>+
>+   The GNU C Library is distributed in the hope that it will be
>useful,
>+   but WITHOUT ANY WARRANTY; without even the implied warranty of
>+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>+   Lesser General Public License for more details.
>+
>+   You should have received a copy of the GNU Lesser General Public
>+   License along with the GNU C Library.  If not, see
>+   <http://www.gnu.org/licenses/>.  */
>+
>+#ifndef _TERMIOS_H
>+# error "Never include <bits/termios2-struct.h> directly; use
><termios.h> instead."
>+#endif
>+
>+struct termios2
>+{
>+  tcflag_t c_iflag;
>+  tcflag_t c_oflag;
>+  tcflag_t c_cflag;
>+  tcflag_t c_lflag;
>+  cc_t c_cc[NCCS];
>+  cc_t c_line;
>+  speed_t c_ispeed;
>+  speed_t c_ospeed;
>+};
>diff --git a/sysdeps/unix/sysv/linux/bits/termios.h
>b/sysdeps/unix/sysv/linux/bits/termios.h
>index 997231cd03..45ac7affdf 100644
>--- a/sysdeps/unix/sysv/linux/bits/termios.h
>+++ b/sysdeps/unix/sysv/linux/bits/termios.h
>@@ -25,6 +25,10 @@ typedef unsigned int	speed_t;
> typedef unsigned int	tcflag_t;
> 
> #include <bits/termios-struct.h>
>+#ifdef __USE_GNU
>+# include <bits/termios2-struct.h>
>+#endif
>+
> #include <bits/termios-c_cc.h>
> #include <bits/termios-c_iflag.h>
> #include <bits/termios-c_oflag.h>
>diff --git a/sysdeps/unix/sysv/linux/bits/termios2-struct.h
>b/sysdeps/unix/sysv/linux/bits/termios2-struct.h
>new file mode 100644
>index 0000000000..5a48e45ef3
>--- /dev/null
>+++ b/sysdeps/unix/sysv/linux/bits/termios2-struct.h
>@@ -0,0 +1,33 @@
>+/* struct termios2 definition.  Linux/generic version.
>+   Copyright (C) 2019 Free Software Foundation, Inc.
>+   This file is part of the GNU C Library.
>+
>+   The GNU C Library is free software; you can redistribute it and/or
>+   modify it under the terms of the GNU Lesser General Public
>+   License as published by the Free Software Foundation; either
>+   version 2.1 of the License, or (at your option) any later version.
>+
>+   The GNU C Library is distributed in the hope that it will be
>useful,
>+   but WITHOUT ANY WARRANTY; without even the implied warranty of
>+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>+   Lesser General Public License for more details.
>+
>+   You should have received a copy of the GNU Lesser General Public
>+   License along with the GNU C Library.  If not, see
>+   <http://www.gnu.org/licenses/>.  */
>+
>+#ifndef _TERMIOS_H
>+# error "Never include <bits/termios2-struct.h> directly; use
><termios.h> instead."
>+#endif
>+
>+struct termios2
>+{
>+  tcflag_t c_iflag;
>+  tcflag_t c_oflag;
>+  tcflag_t c_cflag;
>+  tcflag_t c_lflag;
>+  cc_t c_line;
>+  cc_t c_cc[NCCS];
>+  speed_t c_ispeed;
>+  speed_t c_ospeed;
>+};
>diff --git a/sysdeps/unix/sysv/linux/sparc/bits/termios2-struct.h
>b/sysdeps/unix/sysv/linux/sparc/bits/termios2-struct.h
>new file mode 100644
>index 0000000000..7c889e575c
>--- /dev/null
>+++ b/sysdeps/unix/sysv/linux/sparc/bits/termios2-struct.h
>@@ -0,0 +1,33 @@
>+/* struct termios2 definition.  Linux/sparc version.
>+   Copyright (C) 2019 Free Software Foundation, Inc.
>+   This file is part of the GNU C Library.
>+
>+   The GNU C Library is free software; you can redistribute it and/or
>+   modify it under the terms of the GNU Lesser General Public
>+   License as published by the Free Software Foundation; either
>+   version 2.1 of the License, or (at your option) any later version.
>+
>+   The GNU C Library is distributed in the hope that it will be
>useful,
>+   but WITHOUT ANY WARRANTY; without even the implied warranty of
>+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>+   Lesser General Public License for more details.
>+
>+   You should have received a copy of the GNU Lesser General Public
>+   License along with the GNU C Library.  If not, see
>+   <http://www.gnu.org/licenses/>.  */
>+
>+#ifndef _TERMIOS_H
>+# error "Never include <bits/termios2-struct.h> directly; use
><termios.h> instead."
>+#endif
>+
>+struct termios2
>+{
>+  tcflag_t c_iflag;
>+  tcflag_t c_oflag;
>+  tcflag_t c_cflag;
>+  tcflag_t c_lflag;
>+  cc_t c_line;
>+  cc_t c_cc[NCCS + 2];
>+  speed_t c_ispeed;
>+  speed_t c_ospeed;
>+};
>diff --git a/sysdeps/unix/sysv/linux/tst-termios2.c
>b/sysdeps/unix/sysv/linux/tst-termios2.c
>new file mode 100644
>index 0000000000..82326a1288
>--- /dev/null
>+++ b/sysdeps/unix/sysv/linux/tst-termios2.c
>@@ -0,0 +1,48 @@
>+/* Minimal test of struct termios2 definition.
>+   Copyright (C) 2019 Free Software Foundation, Inc.
>+   This file is part of the GNU C Library.
>+
>+   The GNU C Library is free software; you can redistribute it and/or
>+   modify it under the terms of the GNU Lesser General Public
>+   License as published by the Free Software Foundation; either
>+   version 2.1 of the License, or (at your option) any later version.
>+
>+   The GNU C Library is distributed in the hope that it will be
>useful,
>+   but WITHOUT ANY WARRANTY; without even the implied warranty of
>+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>+   Lesser General Public License for more details.
>+
>+   You should have received a copy of the GNU Lesser General Public
>+   License along with the GNU C Library.  If not, see
>+   <http://www.gnu.org/licenses/>.  */
>+
>+#include <termios.h>
>+#include <sys/ioctl.h>
>+
>+/* This function is never executed, but must be compiled successfully.
>+   Accessing serial ports in the test suite is problematic because
>+   they likely correspond with low-level system functionality.  */
>+void
>+not_executed (int fd)
>+{
>+  /* Avoid a compilation failure if TCGETS2, TCSETS2 are not
>+     defined.  */
>+#if defined (TCGETS2) && defined (TCSETS2)
>+  struct termios2 ti;
>+  ioctl (fd, TCGETS2, &ti);
>+  ioctl (fd, TCSETS2, &ti);
>+#endif
>+}
>+
>+static int
>+do_test (void)
>+{
>+  /* Fail at run time if TCGETS2 or TCSETS2 is not defined.  */
>+#if defined (TCGETS2) && defined (TCSETS2)
>+  return 0;
>+#else
>+  return 1;
>+#endif
>+}
>+
>+#include <support/test-driver.c>

PowerPC doesn't need it at all.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply

* Re: [PATCH] Linux: Define struct termios2 in <termios.h> under _GNU_SOURCE [BZ #10339]
From: hpa @ 2019-04-09 15:07 UTC (permalink / raw)
  To: Florian Weimer, libc-alpha; +Cc: adhemerval.zanella, linux-api, linuxppc-dev
In-Reply-To: <875zrnlakd.fsf@oldenburg2.str.redhat.com>

On April 9, 2019 4:47:30 AM MDT, Florian Weimer <fweimer@redhat.com> wrote:
>struct termios2 is required for setting arbitrary baud rates on serial
>ports.  <sys/ioctl.h> and <linux/termios.h> have conflicting
>definitions in the existing termios definitions, which means that it
>is currently very difficult to use TCGETS2/TCSETS2 and struct termios2
>with glibc.  Providing a definition within glibc resolves this problem.
>
>This does not completely address bug 10339, but it at least exposes
>the current kernel functionality in this area.
>
>Support for struct termios2 is architecture-specific in the kernel.
>Support on alpha was only added in Linux 4.20.  POWER support is
>currently missing.  The expectation is that the kernel will eventually
>use the generic UAPI definition for struct termios2.
>
>2019-04-09  Florian Weimer  <fweimer@redhat.com>
>
>	[BZ #10339]
>	Linux: Define struct termios2 in <termios.h> under _GNU_SOURCE.
>	* sysdeps/unix/sysv/linux/Makefile [$(subdir) == termios] (tests):
>	Add tst-termios2.
>	* sysdeps/unix/sysv/linux/tst-termios2.c: New file.
>	* sysdeps/unix/sysv/linux/bits/termios2-struct.h: Likewise.
>	* sysdeps/unix/sysv/linux/bits/termios.h [__USE_GNU]: Include it.
>	* sysdeps/unix/sysv/linux/alpha/bits/termios2-struct.h: New file.
>	* sysdeps/unix/sysv/linux/sparc/bits/termios2-struct.h: Likewise.
>
>diff --git a/NEWS b/NEWS
>index b58e2469d4..5e6ecb9c7d 100644
>--- a/NEWS
>+++ b/NEWS
>@@ -18,6 +18,9 @@ Major new features:
> 
> * On Linux, the gettid function has been added.
> 
>+* On Linux, <termios.h> now provides a definition of struct termios2
>with
>+  the _GNU_SOURCE feature test macro.
>+
> * Minguo (Republic of China) calendar support has been added as an
> alternative calendar for the following locales: zh_TW, cmn_TW, hak_TW,
>   nan_TW, lzh_TW.
>diff --git a/sysdeps/unix/sysv/linux/Makefile
>b/sysdeps/unix/sysv/linux/Makefile
>index 52ac6ad484..4cb5e4f0d2 100644
>--- a/sysdeps/unix/sysv/linux/Makefile
>+++ b/sysdeps/unix/sysv/linux/Makefile
>@@ -156,6 +156,7 @@ endif
> 
> ifeq ($(subdir),termios)
> sysdep_headers += termio.h
>+tests += tst-termios2
> endif
> 
> ifeq ($(subdir),posix)
>diff --git a/sysdeps/unix/sysv/linux/alpha/bits/termios2-struct.h
>b/sysdeps/unix/sysv/linux/alpha/bits/termios2-struct.h
>new file mode 100644
>index 0000000000..5f09445e23
>--- /dev/null
>+++ b/sysdeps/unix/sysv/linux/alpha/bits/termios2-struct.h
>@@ -0,0 +1,33 @@
>+/* struct termios2 definition.  Linux/alpha version.
>+   Copyright (C) 2019 Free Software Foundation, Inc.
>+   This file is part of the GNU C Library.
>+
>+   The GNU C Library is free software; you can redistribute it and/or
>+   modify it under the terms of the GNU Lesser General Public
>+   License as published by the Free Software Foundation; either
>+   version 2.1 of the License, or (at your option) any later version.
>+
>+   The GNU C Library is distributed in the hope that it will be
>useful,
>+   but WITHOUT ANY WARRANTY; without even the implied warranty of
>+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>+   Lesser General Public License for more details.
>+
>+   You should have received a copy of the GNU Lesser General Public
>+   License along with the GNU C Library.  If not, see
>+   <http://www.gnu.org/licenses/>.  */
>+
>+#ifndef _TERMIOS_H
>+# error "Never include <bits/termios2-struct.h> directly; use
><termios.h> instead."
>+#endif
>+
>+struct termios2
>+{
>+  tcflag_t c_iflag;
>+  tcflag_t c_oflag;
>+  tcflag_t c_cflag;
>+  tcflag_t c_lflag;
>+  cc_t c_cc[NCCS];
>+  cc_t c_line;
>+  speed_t c_ispeed;
>+  speed_t c_ospeed;
>+};
>diff --git a/sysdeps/unix/sysv/linux/bits/termios.h
>b/sysdeps/unix/sysv/linux/bits/termios.h
>index 997231cd03..45ac7affdf 100644
>--- a/sysdeps/unix/sysv/linux/bits/termios.h
>+++ b/sysdeps/unix/sysv/linux/bits/termios.h
>@@ -25,6 +25,10 @@ typedef unsigned int	speed_t;
> typedef unsigned int	tcflag_t;
> 
> #include <bits/termios-struct.h>
>+#ifdef __USE_GNU
>+# include <bits/termios2-struct.h>
>+#endif
>+
> #include <bits/termios-c_cc.h>
> #include <bits/termios-c_iflag.h>
> #include <bits/termios-c_oflag.h>
>diff --git a/sysdeps/unix/sysv/linux/bits/termios2-struct.h
>b/sysdeps/unix/sysv/linux/bits/termios2-struct.h
>new file mode 100644
>index 0000000000..5a48e45ef3
>--- /dev/null
>+++ b/sysdeps/unix/sysv/linux/bits/termios2-struct.h
>@@ -0,0 +1,33 @@
>+/* struct termios2 definition.  Linux/generic version.
>+   Copyright (C) 2019 Free Software Foundation, Inc.
>+   This file is part of the GNU C Library.
>+
>+   The GNU C Library is free software; you can redistribute it and/or
>+   modify it under the terms of the GNU Lesser General Public
>+   License as published by the Free Software Foundation; either
>+   version 2.1 of the License, or (at your option) any later version.
>+
>+   The GNU C Library is distributed in the hope that it will be
>useful,
>+   but WITHOUT ANY WARRANTY; without even the implied warranty of
>+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>+   Lesser General Public License for more details.
>+
>+   You should have received a copy of the GNU Lesser General Public
>+   License along with the GNU C Library.  If not, see
>+   <http://www.gnu.org/licenses/>.  */
>+
>+#ifndef _TERMIOS_H
>+# error "Never include <bits/termios2-struct.h> directly; use
><termios.h> instead."
>+#endif
>+
>+struct termios2
>+{
>+  tcflag_t c_iflag;
>+  tcflag_t c_oflag;
>+  tcflag_t c_cflag;
>+  tcflag_t c_lflag;
>+  cc_t c_line;
>+  cc_t c_cc[NCCS];
>+  speed_t c_ispeed;
>+  speed_t c_ospeed;
>+};
>diff --git a/sysdeps/unix/sysv/linux/sparc/bits/termios2-struct.h
>b/sysdeps/unix/sysv/linux/sparc/bits/termios2-struct.h
>new file mode 100644
>index 0000000000..7c889e575c
>--- /dev/null
>+++ b/sysdeps/unix/sysv/linux/sparc/bits/termios2-struct.h
>@@ -0,0 +1,33 @@
>+/* struct termios2 definition.  Linux/sparc version.
>+   Copyright (C) 2019 Free Software Foundation, Inc.
>+   This file is part of the GNU C Library.
>+
>+   The GNU C Library is free software; you can redistribute it and/or
>+   modify it under the terms of the GNU Lesser General Public
>+   License as published by the Free Software Foundation; either
>+   version 2.1 of the License, or (at your option) any later version.
>+
>+   The GNU C Library is distributed in the hope that it will be
>useful,
>+   but WITHOUT ANY WARRANTY; without even the implied warranty of
>+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>+   Lesser General Public License for more details.
>+
>+   You should have received a copy of the GNU Lesser General Public
>+   License along with the GNU C Library.  If not, see
>+   <http://www.gnu.org/licenses/>.  */
>+
>+#ifndef _TERMIOS_H
>+# error "Never include <bits/termios2-struct.h> directly; use
><termios.h> instead."
>+#endif
>+
>+struct termios2
>+{
>+  tcflag_t c_iflag;
>+  tcflag_t c_oflag;
>+  tcflag_t c_cflag;
>+  tcflag_t c_lflag;
>+  cc_t c_line;
>+  cc_t c_cc[NCCS + 2];
>+  speed_t c_ispeed;
>+  speed_t c_ospeed;
>+};
>diff --git a/sysdeps/unix/sysv/linux/tst-termios2.c
>b/sysdeps/unix/sysv/linux/tst-termios2.c
>new file mode 100644
>index 0000000000..82326a1288
>--- /dev/null
>+++ b/sysdeps/unix/sysv/linux/tst-termios2.c
>@@ -0,0 +1,48 @@
>+/* Minimal test of struct termios2 definition.
>+   Copyright (C) 2019 Free Software Foundation, Inc.
>+   This file is part of the GNU C Library.
>+
>+   The GNU C Library is free software; you can redistribute it and/or
>+   modify it under the terms of the GNU Lesser General Public
>+   License as published by the Free Software Foundation; either
>+   version 2.1 of the License, or (at your option) any later version.
>+
>+   The GNU C Library is distributed in the hope that it will be
>useful,
>+   but WITHOUT ANY WARRANTY; without even the implied warranty of
>+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>+   Lesser General Public License for more details.
>+
>+   You should have received a copy of the GNU Lesser General Public
>+   License along with the GNU C Library.  If not, see
>+   <http://www.gnu.org/licenses/>.  */
>+
>+#include <termios.h>
>+#include <sys/ioctl.h>
>+
>+/* This function is never executed, but must be compiled successfully.
>+   Accessing serial ports in the test suite is problematic because
>+   they likely correspond with low-level system functionality.  */
>+void
>+not_executed (int fd)
>+{
>+  /* Avoid a compilation failure if TCGETS2, TCSETS2 are not
>+     defined.  */
>+#if defined (TCGETS2) && defined (TCSETS2)
>+  struct termios2 ti;
>+  ioctl (fd, TCGETS2, &ti);
>+  ioctl (fd, TCSETS2, &ti);
>+#endif
>+}
>+
>+static int
>+do_test (void)
>+{
>+  /* Fail at run time if TCGETS2 or TCSETS2 is not defined.  */
>+#if defined (TCGETS2) && defined (TCSETS2)
>+  return 0;
>+#else
>+  return 1;
>+#endif
>+}
>+
>+#include <support/test-driver.c>

Anything blocking the POC code I provided?

Note: for PowerPC, define termios2 = kernel termios, and the ioctls as aliases.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply

* Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)
From: Mathieu Desnoyers @ 2019-04-09 15:45 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Tulio Magno Quites Machado Filho, Alan Modra, Michael Ellerman,
	Florian Weimer, Michael Meissner, Peter Bergner, Paul Burton,
	Will Deacon, Boqun Feng, heiko carstens, gor, schwidefsky,
	Russell King, ARM Linux, Benjamin Herrenschmidt, Paul Mackerras,
	carlos, Joseph Myers, Szabolcs Nagy, libc-alpha,
	Thomas Gleixner <tg>
In-Reply-To: <86030de9-862d-3ef5-d372-1695af3c8204@redhat.com>

----- On Apr 9, 2019, at 10:13 AM, Carlos O'Donell codonell@redhat.com wrote:

> On 4/9/19 9:58 AM, Tulio Magno Quites Machado Filho wrote:
>> Alan Modra <amodra@gmail.com> writes:
>>> Yes, looks fine to me, except that in VLE mode (do we care?)
>>> ".long 0x0fe50553" disassembles as
>>>     0:	0f e5       	se_cmphl r5,r30
>>>     2:	05 53       	se_mullw r3,r5
>>> No illegal/trap/privileged insn there.
>>>
>>> ".long 0x0fe5000b" might be better to cover VLE.
>> 
>> Looks good for me too.
> 
> The requirement that it be a valid instruction is simply to aid in the
> disassembly of rseq regions which may be hand written assembly with a
> thin veneer of CFI/DWARF information.
> 
> It has already been pointed out that POWER uses data in the instruction
> stream for jump tables to implement switch statements, but that specific
> use has compiler support and one presumes good debug information. So as
> Alan says, there is already data in the insn stream, though such things
> can't be good for performance (pollutes D-cache, problematic for
> speculative execution).
> 
>> Actually, it better fits what Carlos O'Donnell had requested:
>> 
>>>>> I think the order of preference is:
>>>>>
>>>>> 1.  An uncommon insn (with random immediate values), in a literal pool, that is
>>>>>       not a useful ROP/JOP sequence (very uncommon)
>>>>> 2a. A uncommon TRAP hopefully with some immediate data encoded (maybe uncommon)
>>>>> 2b. A NOP to avoid affecting speculative execution (maybe uncommon)
>>>>>
>>>>> With 2a/2b being roughly equivalent depending on speculative execution policy.
> 
> Yes, though "in a literal pool" is something that is not required, since
> users might not want literal pools and so we shouldn't require that
> feature (it also pollutes D-cache).
> 
> Keep in mind the insn will never execute.
> 
> If a trap insn calls out the nature of the signature more clearly then
> use that instead.

So based on the recent discussions, there are a few things we can conclude:

- Choosing a random value and relying on literal pools is a bad idea, because some
  compilation environments disable them entirely,
- We ideally want the signature to be a valid instruction in the instruction set
  so disassembler/emulator tools don't get confused and we don't hurt speculative
  execution.
- Best option is a trap with an unlikely immediate opcode, because it traps on the
  instruction in case the program try to execute it by mistake,
- Second best would be a no-op with an unlikely immediate opcode,
- We may want to stay away from privileged instructions because they can confuse
  emulators with may try to emulate them,
- Some architectures have big endian/little endian variants. We may need to carefully
  #ifdef each case so the numeric value matches actual instructions,
- Some architectures have extensions to their instruction set (e.g. ARM thumb, power
  VLE) which can be combined with the basic instruction set within the same program.
  We need to decide whether we care what those signatures look like in those
  instruction set extensions or not. Is it a best effort to match real instructions
  or a hard requirement ? If it's a hard requirement, we may need to extend the rseq
  system call with new flags to accept more than one signature.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH] Linux: Define struct termios2 in <termios.h> under _GNU_SOURCE [BZ #10339]
From: Florian Weimer @ 2019-04-09 16:04 UTC (permalink / raw)
  To: hpa; +Cc: libc-alpha, adhemerval.zanella, linux-api, linuxppc-dev
In-Reply-To: <CA8DB086-0ABB-4E37-93F3-89A64B8012DF@zytor.com>

* hpa:

> Anything blocking the POC code I provided?

I don't recall that, sorry.  Would you please provide a reference?

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)
From: Mathieu Desnoyers @ 2019-04-09 16:33 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Tulio Magno Quites Machado Filho, Florian Weimer,
	Michael Meissner, Alan Modra, Peter Bergner, Michael Ellerman,
	Paul Burton, Will Deacon, Boqun Feng, heiko carstens, gor,
	schwidefsky, Russell King, ARM Linux, Benjamin Herrenschmidt,
	Paul Mackerras, carlos, Joseph Myers, Szabolcs Nagy, libc-alpha,
	Thomas Gleixner <tg>
In-Reply-To: <ce9fbc66-5eb9-0fa9-99fa-5abfe00ddfc2@redhat.com>

----- On Apr 8, 2019, at 5:45 PM, Carlos O'Donell codonell@redhat.com wrote:

> On 4/8/19 3:20 PM, Tulio Magno Quites Machado Filho wrote:
>> Carlos O'Donell <codonell@redhat.com> writes:
>> 
>>> On 4/5/19 5:16 AM, Florian Weimer wrote:
>>>> * Carlos O'Donell:
>>>>> It is valuable that it be a trap, particularly for constant pools because
>>>>> it means that a jump into the constant pool will trap.
>>>>
>>>> Sorry, I don't understand why this matters in this context.  Would you
>>>> please elaborate?
>>>
>>> Sorry, I wasn't very clear.
>>>
>>> My point is only that any accidental jumps, either with off-by-one (like you
>>> fixed in gcc/glibc's signal unwinding most recently), result in a process fault
>>> rather than executing RSEQ_SIG as a valid instruction *and then* continuing
>>> onwards to the handler.
>>>
>>> A process fault is achieved either by a trap, or an invalid instruction, or
>>> a privileged insn (like suggested for MIPS in this thread).
>> 
>> In that case, mtmsr (Move to Machine State Register) seems a good candidate.
>> 
>> mtmsr is available both on 32 and 64 bits since their first implementations.
>> 
>> It's a privileged instruction and should never appear in userspace
>> code (causes SIGILL).
>> 
>> Any comments?
>  
> That seems good to me.
> 
> Mathieu,
> 
> What's required to move this forward for POWER?

First, we need to decide whether we need rseq to support more than
one signature per process to cover the VLE extension. If so, we'd
need to extend rseq with a new flag for future kernels.

Then, ideally, we'd need a patch on top of the Linux kernel
tools/testing/selftests/rseq/rseq-ppc.h file that updates
the signature value. I think the current discussion leads us
towards a trap:

/*
 * TODO: document instruction objdump output on each architecture and VLE
 * instruction sets.
 */
#define RSEQ_SIG 0x0fe5000b

Should we do anything specific for big/little endian ? Is the byte order
of the instruction encoding the same as data ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)
From: Mathieu Desnoyers @ 2019-04-09 16:40 UTC (permalink / raw)
  To: Paul Burton
  Cc: Carlos O'Donell, Will Deacon, Boqun Feng, heiko carstens, gor,
	schwidefsky, Russell King, ARM Linux, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, carlos, Florian Weimer,
	Joseph Myers, Szabolcs Nagy, libc-alpha, Thomas Gleixner,
	Ben Maurer, Peter Zijlstra, Paul E. McKenney, Dave Watson
In-Reply-To: <20190404214151.6ogrm34dok52az4h@pburton-laptop>

----- On Apr 4, 2019, at 5:41 PM, Paul Burton paul.burton@mips.com wrote:

> Hi Carlos / all,
> 
> On Thu, Apr 04, 2019 at 04:50:08PM -0400, Carlos O'Donell wrote:
>> > > > +/* Signature required before each abort handler code.  */
>> > > > +#define RSEQ_SIG 0x53053053
>> > > 
>> > > Why isn't this a mips-specific op code?
>> > 
>> > MIPS also has a literal pool just before the abort handler, and it
>> > jumps over it. My understanding is that we can use any signature value
>> > we want, and it does not need to be a valid instruction, similarly to ARM:
>> > 
>> > #define __RSEQ_ASM_DEFINE_ABORT(table_label, label, teardown, \
>> >                                  abort_label, version, flags, \
>> >                                  start_ip, post_commit_offset, abort_ip) \
>> >                  ".balign 32\n\t" \
>> >                  __rseq_str(table_label) ":\n\t" \
>> >                  ".word " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
>> >                  LONG " " U32_U64_PAD(__rseq_str(start_ip)) "\n\t" \
>> >                  LONG " " U32_U64_PAD(__rseq_str(post_commit_offset)) "\n\t" \
>> >                  LONG " " U32_U64_PAD(__rseq_str(abort_ip)) "\n\t" \
>> >                  ".word " __rseq_str(RSEQ_SIG) "\n\t" \
>> >                  __rseq_str(label) ":\n\t" \
>> >                  teardown \
>> >                  "b %l[" __rseq_str(abort_label) "]\n\t"
>> > 
>> > Perhaps Paul Burton can confirm this ?
>> 
>> Yes please.
>> 
>> You also want to avoid the value being a valid MIPS insn that's common.
>> 
>> Did you check that?
> 
> This does not decode as a standard MIPS instruction, though it does
> decode for both the microMIPS (ori) & nanoMIPS (lwxs; sll) ISAs.
> 
> I imagine I copied the value from another architecture when porting, and
> since it doesn't get executed it seemed fine.
> 
> One maybe nicer option along the same lines would be 0x72736571 or
> 0x71657372 (ASCII 'rseq') neither of which decode as a MIPS instruction.
> 
>> I think the order of preference is:
>> 
>> 1.  An uncommon insn (with random immediate values), in a literal pool, that is
>>     not a useful ROP/JOP sequence (very uncommon)
> 
> For that option on MIPS we could do something like:
> 
>  sll $0, $0, 31     # effectively a nop, but looks weird
> 
>> 2a. A uncommon TRAP hopefully with some immediate data encoded (maybe uncommon)
> 
> Our break instruction has a 19b immediate in nanoMIPS (20b for microMIPS
> & classic MIPS) so that could be something like:
> 
>  break 0x7273       # ASCII 'rs'
> 
> That's pretty unlikely to be seen in normal code, or the teq instruction
> has a rarely used code field (4b in microMIPS, 5b in nanoMIPS, 10b in
> classic MIPS) that's meaningless to hardware so something like this
> would be possible:
> 
>  teq $0, $0, 0x8    # ASCII backspace
> 
>> 2b. A NOP to avoid affecting speculative execution (maybe uncommon)
>> 
>> With 2a/2b being roughly equivalent depending on speculative execution policy.
> 
> There are a bunch of potential odd looking nops possible, one of which
> would be the sll I mentioned above.
> 
> Another option would be to use a priveleged instruction which userland
> code can't execute & should normally never contain. That would decode as
> a valid instruction & effectively behave like a trap instruction but
> look very odd to anyone reading disassembled code. eg:
> 
>  mfc0 $0, 13        # Try to read the cause register; take SIGILL
> 
> In order to handle MIPS vs microMIPS vs nanoMIPS differences I'm
> thinking it may be best to switch to one of these real instructions that
> looks strange. The ugly part would be the nest of #ifdef's to deal with
> endianness & ISA when defining it as a number...

Note that we can have different signatures for each sub-architecture, as
long as they don't have to co-exist within the same process.

Ideally we'd need a patch on top of the Linux kernel
tools/testing/selftests/rseq/rseq-mips.h file that updates
the signature value. I think the current discussion leads us
towards a trap with unlikely immediate operand. Note that we
can special-case with #ifdef for each sub-architecture and endianness
if need be.

/*
 * TODO: document trap instruction objdump output on each sub-architecture
 * instruction sets.
 */
#define RSEQ_SIG 0x########

Should we do anything specific for big/little endian ? Is the byte order
of the instruction encoding the same as data ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH] io_uring: add support for barrier fsync
From: Christoph Hellwig @ 2019-04-09 18:17 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-fsdevel, linux-block@vger.kernel.org, linux-api,
	linux-kernel
In-Reply-To: <7c7276e4-8ffa-495a-6abf-926a58ee899e@kernel.dk>

On Tue, Apr 09, 2019 at 10:27:43AM -0600, Jens Axboe wrote:
> It's a quite common use case to issue a bunch of writes, then an fsync
> or fdatasync when they complete. Since io_uring doesn't guarantee any
> type of ordering, the application must track issued writes and wait
> with the fsync issue until they have completed.
> 
> Add an IORING_FSYNC_BARRIER flag that helps with this so the application
> doesn't have to do this manually. If this flag is set for the fsync
> request, we won't issue it until pending IO has already completed.

I think we need a much more detailed explanation of the semantics,
preferably in man page format.

Barrier at least in Linux traditionally means all previously submitted
requests have finished and no new ones are started until the
barrier request finishes, which is very heavy handed.  Is that what
this is supposed to do?  If not what are the exact guarantees vs
ordering and or barrier semantics?

^ permalink raw reply

* Re: [PATCH] io_uring: add support for barrier fsync
From: Jens Axboe @ 2019-04-09 18:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-block@vger.kernel.org, linux-api,
	linux-kernel
In-Reply-To: <20190409181742.GA24925@infradead.org>

On 4/9/19 12:17 PM, Christoph Hellwig wrote:
> On Tue, Apr 09, 2019 at 10:27:43AM -0600, Jens Axboe wrote:
>> It's a quite common use case to issue a bunch of writes, then an fsync
>> or fdatasync when they complete. Since io_uring doesn't guarantee any
>> type of ordering, the application must track issued writes and wait
>> with the fsync issue until they have completed.
>>
>> Add an IORING_FSYNC_BARRIER flag that helps with this so the application
>> doesn't have to do this manually. If this flag is set for the fsync
>> request, we won't issue it until pending IO has already completed.
> 
> I think we need a much more detailed explanation of the semantics,
> preferably in man page format.
> 
> Barrier at least in Linux traditionally means all previously submitted
> requests have finished and no new ones are started until the
> barrier request finishes, which is very heavy handed.  Is that what
> this is supposed to do?  If not what are the exact guarantees vs
> ordering and or barrier semantics?

The patch description isn't that great, and maybe the naming isn't that
intuitive either. The way it's implemented, the fsync will NOT be issued
until previously issued IOs have completed. That means both reads and
writes, since there's no way to wait for just one.  In terms of
semantics, any previously submitted writes will have completed before
this fsync is issued. The barrier fsync has no ordering wrt future
writes, no ordering is implied there. Hence:

W1, W2, W3, FSYNC_W_BARRIER, W4, W5

W1..3 will have been completed by the hardware side before we start
FSYNC_W_BARRIER. We don't wait with issuing W4..5 until after the fsync
completes, no ordering is provided there.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH] io_uring: add support for barrier fsync
From: Chris Mason @ 2019-04-09 18:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-fsdevel, linux-block@vger.kernel.org,
	linux-api@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <5f8d9644-9e8f-c9d2-611e-4b144c62539c@kernel.dk>

On 9 Apr 2019, at 14:23, Jens Axboe wrote:

> On 4/9/19 12:17 PM, Christoph Hellwig wrote:
>> On Tue, Apr 09, 2019 at 10:27:43AM -0600, Jens Axboe wrote:
>>> It's a quite common use case to issue a bunch of writes, then an 
>>> fsync
>>> or fdatasync when they complete. Since io_uring doesn't guarantee 
>>> any
>>> type of ordering, the application must track issued writes and wait
>>> with the fsync issue until they have completed.
>>>
>>> Add an IORING_FSYNC_BARRIER flag that helps with this so the 
>>> application
>>> doesn't have to do this manually. If this flag is set for the fsync
>>> request, we won't issue it until pending IO has already completed.
>>
>> I think we need a much more detailed explanation of the semantics,
>> preferably in man page format.
>>
>> Barrier at least in Linux traditionally means all previously 
>> submitted
>> requests have finished and no new ones are started until the
>> barrier request finishes, which is very heavy handed.  Is that what
>> this is supposed to do?  If not what are the exact guarantees vs
>> ordering and or barrier semantics?
>
> The patch description isn't that great, and maybe the naming isn't 
> that
> intuitive either. The way it's implemented, the fsync will NOT be 
> issued
> until previously issued IOs have completed. That means both reads and
> writes, since there's no way to wait for just one.  In terms of
> semantics, any previously submitted writes will have completed before
> this fsync is issued. The barrier fsync has no ordering wrt future
> writes, no ordering is implied there. Hence:
>
> W1, W2, W3, FSYNC_W_BARRIER, W4, W5
>
> W1..3 will have been completed by the hardware side before we start
> FSYNC_W_BARRIER. We don't wait with issuing W4..5 until after the 
> fsync
> completes, no ordering is provided there.

Looking at the patch, why is fsync special?  Seems like you could add 
this ordering bit to any write?

While you're here, do you want to add a way to FUA/cache flush?  
Basically the rest of what user land would need to make their own 
write-back-cache-safe implementation.

-chris

^ permalink raw reply

* Re: [PATCH] io_uring: add support for barrier fsync
From: Jens Axboe @ 2019-04-09 18:46 UTC (permalink / raw)
  To: Chris Mason
  Cc: Christoph Hellwig, linux-fsdevel, linux-block@vger.kernel.org,
	linux-api@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <5BF7FDDE-212E-4F9A-9B50-26BDA99E952A@fb.com>

On 4/9/19 12:42 PM, Chris Mason wrote:
> On 9 Apr 2019, at 14:23, Jens Axboe wrote:
> 
>> On 4/9/19 12:17 PM, Christoph Hellwig wrote:
>>> On Tue, Apr 09, 2019 at 10:27:43AM -0600, Jens Axboe wrote:
>>>> It's a quite common use case to issue a bunch of writes, then an 
>>>> fsync
>>>> or fdatasync when they complete. Since io_uring doesn't guarantee 
>>>> any
>>>> type of ordering, the application must track issued writes and wait
>>>> with the fsync issue until they have completed.
>>>>
>>>> Add an IORING_FSYNC_BARRIER flag that helps with this so the 
>>>> application
>>>> doesn't have to do this manually. If this flag is set for the fsync
>>>> request, we won't issue it until pending IO has already completed.
>>>
>>> I think we need a much more detailed explanation of the semantics,
>>> preferably in man page format.
>>>
>>> Barrier at least in Linux traditionally means all previously 
>>> submitted
>>> requests have finished and no new ones are started until the
>>> barrier request finishes, which is very heavy handed.  Is that what
>>> this is supposed to do?  If not what are the exact guarantees vs
>>> ordering and or barrier semantics?
>>
>> The patch description isn't that great, and maybe the naming isn't 
>> that
>> intuitive either. The way it's implemented, the fsync will NOT be 
>> issued
>> until previously issued IOs have completed. That means both reads and
>> writes, since there's no way to wait for just one.  In terms of
>> semantics, any previously submitted writes will have completed before
>> this fsync is issued. The barrier fsync has no ordering wrt future
>> writes, no ordering is implied there. Hence:
>>
>> W1, W2, W3, FSYNC_W_BARRIER, W4, W5
>>
>> W1..3 will have been completed by the hardware side before we start
>> FSYNC_W_BARRIER. We don't wait with issuing W4..5 until after the 
>> fsync
>> completes, no ordering is provided there.
> 
> Looking at the patch, why is fsync special?  Seems like you could add 
> this ordering bit to any write?

It's really not, the exact same technique could be used on any type of
command to imply ordering. My initial idea was to have an explicit
barrier/ordering command, but I didn't think that separating it from an
actual command would be needed/useful.

> While you're here, do you want to add a way to FUA/cache flush?  
> Basically the rest of what user land would need to make their own 
> write-back-cache-safe implementation.

FUA would be a WRITEV/WRITE_FIXED flag, that should be trivially doable.

In terms of cache flush, that's very heavy handed (and storage
oriented).  What applications would want/need to do an explicit whole
device flush?

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH] io_uring: add support for barrier fsync
From: Chris Mason @ 2019-04-09 18:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-fsdevel, linux-block@vger.kernel.org,
	linux-api@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <a7f00629-f822-7271-7a4e-ad3b802752e4@kernel.dk>

On 9 Apr 2019, at 14:46, Jens Axboe wrote:

> On 4/9/19 12:42 PM, Chris Mason wrote:

>> Looking at the patch, why is fsync special?  Seems like you could add
>> this ordering bit to any write?
>
> It's really not, the exact same technique could be used on any type of
> command to imply ordering. My initial idea was to have an explicit
> barrier/ordering command, but I didn't think that separating it from 
> an
> actual command would be needed/useful.

Might want to order discards?  Commit a transaction to free some blocks, 
discard those blocks?

>
>> While you're here, do you want to add a way to FUA/cache flush?
>> Basically the rest of what user land would need to make their own
>> write-back-cache-safe implementation.
>
> FUA would be a WRITEV/WRITE_FIXED flag, that should be trivially 
> doable.
>
> In terms of cache flush, that's very heavy handed (and storage
> oriented).  What applications would want/need to do an explicit whole
> device flush?

Basically if someone is writing a userland filesystem they'd want to 
cache flush for the same reasons the kernel filesystems do.


-chris

^ permalink raw reply

* Re: [PATCH 11/17] fpga: dfl: afu: add error reporting support.
From: Alan Tull @ 2019-04-09 20:57 UTC (permalink / raw)
  To: Wu Hao; +Cc: Moritz Fischer, linux-fpga, linux-kernel, linux-api, Xu Yilun
In-Reply-To: <1553483264-5379-12-git-send-email-hao.wu@intel.com>

On Sun, Mar 24, 2019 at 10:24 PM Wu Hao <hao.wu@intel.com> wrote:

Hi Hao,

>
> Error reporting is one important private feature, it reports error
> detected on port and accelerated function unit (AFU). It introduces
> several sysfs interfaces to allow userspace to check and clear
> errors detected by hardware.
>
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>
> ---
>  Documentation/ABI/testing/sysfs-platform-dfl-port |  29 +++
>  drivers/fpga/Makefile                             |   1 +
>  drivers/fpga/dfl-afu-error.c                      | 225 ++++++++++++++++++++++
>  drivers/fpga/dfl-afu-main.c                       |   4 +
>  drivers/fpga/dfl-afu.h                            |   4 +
>  5 files changed, 263 insertions(+)
>  create mode 100644 drivers/fpga/dfl-afu-error.c
>
> diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-port b/Documentation/ABI/testing/sysfs-platform-dfl-port
> index f611e47..e6140aa 100644
> --- a/Documentation/ABI/testing/sysfs-platform-dfl-port
> +++ b/Documentation/ABI/testing/sysfs-platform-dfl-port
> @@ -79,3 +79,32 @@ KernelVersion:       5.2
>  Contact:       Wu Hao <hao.wu@intel.com>
>  Description:   Read-only. Read this file to get the status of issued command
>                 to userclck_freqcntrcmd.
> +
> +What:          /sys/bus/platform/devices/dfl-port.0/errors/errors
> +Date:          March 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. Read this file to get errors detected on port and
> +               Accelerated Function Unit (AFU).
> +
> +What:          /sys/bus/platform/devices/dfl-port.0/errors/first_error
> +Date:          March 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. Read this file to get the first error detected by
> +               hardware.
> +
> +What:          /sys/bus/platform/devices/dfl-port.0/errors/first_malformed_req
> +Date:          March 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. Read this file to get the first malformed request
> +               captured by hardware.
> +
> +What:          /sys/bus/platform/devices/dfl-port.0/errors/clear
> +Date:          March 2019
> +KernelVersion: 5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Write-only. Write error code to this file to clear errors. If
> +               the input error code doesn't match, it returns -EBUSY error
> +               code.

I understand how -EBUSY could be the right error code for when the
hardware is in a state where the error can't be cleared.  But if the
input error code doesn't match, shouldn't the code be -EINVAL?  Also
as noted below, the way this is currently coded, -ETIMEDOUT could get
returned.

> diff --git a/drivers/fpga/Makefile b/drivers/fpga/Makefile
> index c0dd4c8..f1f0af7 100644
> --- a/drivers/fpga/Makefile
> +++ b/drivers/fpga/Makefile
> @@ -40,6 +40,7 @@ obj-$(CONFIG_FPGA_DFL_AFU)            += dfl-afu.o
>
>  dfl-fme-objs := dfl-fme-main.o dfl-fme-pr.o
>  dfl-afu-objs := dfl-afu-main.o dfl-afu-region.o dfl-afu-dma-region.o
> +dfl-afu-objs += dfl-afu-error.o
>
>  # Drivers for FPGAs which implement DFL
>  obj-$(CONFIG_FPGA_DFL_PCI)             += dfl-pci.o
> diff --git a/drivers/fpga/dfl-afu-error.c b/drivers/fpga/dfl-afu-error.c
> new file mode 100644
> index 0000000..b66bd4a
> --- /dev/null
> +++ b/drivers/fpga/dfl-afu-error.c
> @@ -0,0 +1,225 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Driver for FPGA Accelerated Function Unit (AFU) Error Reporting
> + *
> + * Copyright 2019 Intel Corporation, Inc.
> + *
> + * Authors:
> + *   Wu Hao <hao.wu@linux.intel.com>
> + *   Xiao Guangrong <guangrong.xiao@linux.intel.com>
> + *   Joseph Grecco <joe.grecco@intel.com>
> + *   Enno Luebbers <enno.luebbers@intel.com>
> + *   Tim Whisonant <tim.whisonant@intel.com>
> + *   Ananda Ravuri <ananda.ravuri@intel.com>
> + *   Mitchel Henry <henry.mitchel@intel.com>
> + */
> +
> +#include <linux/uaccess.h>
> +
> +#include "dfl-afu.h"
> +
> +#define PORT_ERROR_MASK                0x8
> +#define PORT_ERROR             0x10
> +#define PORT_FIRST_ERROR       0x18
> +#define PORT_MALFORMED_REQ0    0x20
> +#define PORT_MALFORMED_REQ1    0x28
> +
> +#define ERROR_MASK             GENMASK_ULL(63, 0)
> +
> +/* mask or unmask port errors by the error mask register. */
> +static void __port_err_mask(struct device *dev, bool mask)
> +{
> +       void __iomem *base;
> +
> +       base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> +
> +       writeq(mask ? ERROR_MASK : 0, base + PORT_ERROR_MASK);
> +}
> +
> +/* clear port errors. */
> +static int __port_err_clear(struct device *dev, u64 err)
> +{
> +       struct platform_device *pdev = to_platform_device(dev);
> +       void __iomem *base_err, *base_hdr;
> +       int ret;
> +       u64 v;
> +
> +       base_err = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> +       base_hdr = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
> +
> +       /*
> +        * clear Port Errors
> +        *
> +        * - Check for AP6 State
> +        * - Halt Port by keeping Port in reset
> +        * - Set PORT Error mask to all 1 to mask errors
> +        * - Clear all errors
> +        * - Set Port mask to all 0 to enable errors
> +        * - All errors start capturing new errors
> +        * - Enable Port by pulling the port out of reset
> +        */
> +
> +       /* if device is still in AP6 power state, can not clear any error. */
> +       v = readq(base_hdr + PORT_HDR_STS);
> +       if (FIELD_GET(PORT_STS_PWR_STATE, v) == PORT_STS_PWR_STATE_AP6) {
> +               dev_err(dev, "Could not clear errors, device in AP6 state.\n");
> +               return -EBUSY;
> +       }
> +
> +       /* Halt Port by keeping Port in reset */
> +       ret = __port_disable(pdev);
> +       if (ret)
> +               return ret;

__port_disable can return -ETIMEDOUT which will then get returned from
clear_store.  The sysfs document only talks about -EBUSY. You could
either document -ETIMEDOUT in the sysfs doc or you could change the
code to adjust the returned error code.

> +
> +       /* Mask all errors */
> +       __port_err_mask(dev, true);
> +
> +       /* Clear errors if err input matches with current port errors.*/
> +       v = readq(base_err + PORT_ERROR);
> +
> +       if (v == err) {
> +               writeq(v, base_err + PORT_ERROR);
> +
> +               v = readq(base_err + PORT_FIRST_ERROR);
> +               writeq(v, base_err + PORT_FIRST_ERROR);
> +       } else {
> +               ret = -EBUSY;
> +       }
> +
> +       /* Clear mask */
> +       __port_err_mask(dev, false);
> +
> +       /* Enable the Port by clear the reset */
> +       __port_enable(pdev);
> +
> +       return ret;
> +}
> +
> +static ssize_t revision_show(struct device *dev, struct device_attribute *attr,
> +                            char *buf)
> +{
> +       void __iomem *base;
> +
> +       base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n", dfl_feature_revision(base));
> +}
> +static DEVICE_ATTR_RO(revision);

This appears to be adding a
/sys/bus/platform/devices/dfl-port.0/errors/revision attribute that
isn't documented in the sysfs document.

> +
> +static ssize_t errors_show(struct device *dev, struct device_attribute *attr,
> +                          char *buf)
> +{
> +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> +       void __iomem *base;
> +       u64 error;
> +
> +       base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> +
> +       mutex_lock(&pdata->lock);
> +       error = readq(base + PORT_ERROR);
> +       mutex_unlock(&pdata->lock);
> +
> +       return scnprintf(buf, PAGE_SIZE, "0x%llx\n", (unsigned long long)error);
> +}
> +static DEVICE_ATTR_RO(errors);
> +
> +static ssize_t first_error_show(struct device *dev,
> +                               struct device_attribute *attr, char *buf)
> +{
> +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> +       void __iomem *base;
> +       u64 error;
> +
> +       base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> +
> +       mutex_lock(&pdata->lock);
> +       error = readq(base + PORT_FIRST_ERROR);
> +       mutex_unlock(&pdata->lock);
> +
> +       return scnprintf(buf, PAGE_SIZE, "0x%llx\n", (unsigned long long)error);
> +}
> +static DEVICE_ATTR_RO(first_error);
> +
> +static ssize_t first_malformed_req_show(struct device *dev,
> +                                       struct device_attribute *attr,
> +                                       char *buf)
> +{
> +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> +       void __iomem *base;
> +       u64 req0, req1;
> +
> +       base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> +
> +       mutex_lock(&pdata->lock);
> +       req0 = readq(base + PORT_MALFORMED_REQ0);
> +       req1 = readq(base + PORT_MALFORMED_REQ1);
> +       mutex_unlock(&pdata->lock);
> +
> +       return scnprintf(buf, PAGE_SIZE, "0x%016llx%016llx\n",
> +                        (unsigned long long)req1, (unsigned long long)req0);
> +}
> +static DEVICE_ATTR_RO(first_malformed_req);
> +
> +static ssize_t clear_store(struct device *dev, struct device_attribute *attr,
> +                          const char *buff, size_t count)
> +{
> +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> +       u64 value;
> +       int ret;
> +
> +       if (kstrtou64(buff, 0, &value))
> +               return -EINVAL;
> +
> +       mutex_lock(&pdata->lock);
> +       ret = __port_err_clear(dev, value);
> +       mutex_unlock(&pdata->lock);
> +
> +       return ret ? ret : count;
> +}
> +static DEVICE_ATTR_WO(clear);
> +
> +static struct attribute *port_err_attrs[] = {
> +       &dev_attr_revision.attr,
> +       &dev_attr_errors.attr,
> +       &dev_attr_first_error.attr,
> +       &dev_attr_first_malformed_req.attr,
> +       &dev_attr_clear.attr,
> +       NULL,
> +};
> +
> +static struct attribute_group port_err_attr_group = {
> +       .attrs = port_err_attrs,
> +       .name = "errors",
> +};
> +
> +static int port_err_init(struct platform_device *pdev,
> +                        struct dfl_feature *feature)
> +{
> +       struct dfl_feature_platform_data *pdata = dev_get_platdata(&pdev->dev);
> +
> +       dev_dbg(&pdev->dev, "PORT ERR Init.\n");
> +
> +       mutex_lock(&pdata->lock);
> +       __port_err_mask(&pdev->dev, false);
> +       mutex_unlock(&pdata->lock);
> +
> +       return sysfs_create_group(&pdev->dev.kobj, &port_err_attr_group);
> +}
> +
> +static void port_err_uinit(struct platform_device *pdev,
> +                          struct dfl_feature *feature)
> +{
> +       dev_dbg(&pdev->dev, "PORT ERR UInit.\n");
> +
> +       sysfs_remove_group(&pdev->dev.kobj, &port_err_attr_group);
> +}
> +
> +const struct dfl_feature_id port_err_id_table[] = {
> +       {.id = PORT_FEATURE_ID_ERROR,},
> +       {0,}
> +};
> +
> +const struct dfl_feature_ops port_err_ops = {
> +       .init = port_err_init,
> +       .uinit = port_err_uinit,
> +};
> diff --git a/drivers/fpga/dfl-afu-main.c b/drivers/fpga/dfl-afu-main.c
> index e727d9b..754729e 100644
> --- a/drivers/fpga/dfl-afu-main.c
> +++ b/drivers/fpga/dfl-afu-main.c
> @@ -528,6 +528,10 @@ static struct dfl_feature_driver port_feature_drvs[] = {
>                 .ops = &port_afu_ops,
>         },
>         {
> +               .id_table = port_err_id_table,
> +               .ops = &port_err_ops,
> +       },
> +       {
>                 .ops = NULL,
>         }
>  };
> diff --git a/drivers/fpga/dfl-afu.h b/drivers/fpga/dfl-afu.h
> index 35e60c5..c3182a2 100644
> --- a/drivers/fpga/dfl-afu.h
> +++ b/drivers/fpga/dfl-afu.h
> @@ -100,4 +100,8 @@ int afu_dma_unmap_region(struct dfl_feature_platform_data *pdata, u64 iova);
>  struct dfl_afu_dma_region *
>  afu_dma_region_find(struct dfl_feature_platform_data *pdata,
>                     u64 iova, u64 size);
> +
> +extern const struct dfl_feature_ops port_err_ops;
> +extern const struct dfl_feature_id port_err_id_table[];
> +
>  #endif /* __DFL_AFU_H */
> --
> 2.7.4
>

Thanks,
Alan

^ permalink raw reply

* Re: [PATCH 13/17] fpga: dfl: fme: add capability sysfs interfaces
From: Alan Tull @ 2019-04-09 21:05 UTC (permalink / raw)
  To: Wu Hao
  Cc: Moritz Fischer, linux-fpga, linux-kernel, linux-api, Luwei Kang,
	Xu Yilun
In-Reply-To: <1553483264-5379-14-git-send-email-hao.wu@intel.com>

On Sun, Mar 24, 2019 at 10:24 PM Wu Hao <hao.wu@intel.com> wrote:

Hi Hao,

Looks good...

>
> This patch adds 3 read-only sysfs interfaces for FPGA Management Engine
> (FME) block for capabilities including cache_size, fabric_version and
> socket_id.
>
> Signed-off-by: Luwei Kang <luwei.kang@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>

Acked-by: Alan Tull <atull@kernel.org>

Thanks,
Alan

> ---
>  Documentation/ABI/testing/sysfs-platform-dfl-fme | 23 ++++++++++++
>  drivers/fpga/dfl-fme-main.c                      | 48 ++++++++++++++++++++++++
>  2 files changed, 71 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> index 8fa4feb..b8327e9 100644
> --- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
> +++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> @@ -21,3 +21,26 @@ Contact:     Wu Hao <hao.wu@intel.com>
>  Description:   Read-only. It returns Bitstream (static FPGA region) meta
>                 data, which includes the synthesis date, seed and other
>                 information of this static FPGA region.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/cache_size
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. It returns cache size of this FPGA device.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/fabric_version
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. It returns fabric version of this FPGA device.
> +               Userspace applications need this information to select
> +               best data channels per different fabric design.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/socket_id
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. It returns socket_id to indicate which socket
> +               this FPGA belongs to, only valid for integrated solution.
> +               User only needs this information, in case standard numa node
> +               can't provide correct information.
> diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
> index 38c6342..8339ee8 100644
> --- a/drivers/fpga/dfl-fme-main.c
> +++ b/drivers/fpga/dfl-fme-main.c
> @@ -75,10 +75,58 @@ static ssize_t bitstream_metadata_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(bitstream_metadata);
>
> +static ssize_t cache_size_show(struct device *dev,
> +                              struct device_attribute *attr, char *buf)
> +{
> +       void __iomem *base;
> +       u64 v;
> +
> +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_HEADER);
> +
> +       v = readq(base + FME_HDR_CAP);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> +                        (unsigned int)FIELD_GET(FME_CAP_CACHE_SIZE, v));
> +}
> +static DEVICE_ATTR_RO(cache_size);
> +
> +static ssize_t fabric_version_show(struct device *dev,
> +                                  struct device_attribute *attr, char *buf)
> +{
> +       void __iomem *base;
> +       u64 v;
> +
> +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_HEADER);
> +
> +       v = readq(base + FME_HDR_CAP);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> +                        (unsigned int)FIELD_GET(FME_CAP_FABRIC_VERID, v));
> +}
> +static DEVICE_ATTR_RO(fabric_version);
> +
> +static ssize_t socket_id_show(struct device *dev,
> +                             struct device_attribute *attr, char *buf)
> +{
> +       void __iomem *base;
> +       u64 v;
> +
> +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_HEADER);
> +
> +       v = readq(base + FME_HDR_CAP);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> +                        (unsigned int)FIELD_GET(FME_CAP_SOCKET_ID, v));
> +}
> +static DEVICE_ATTR_RO(socket_id);
> +
>  static const struct attribute *fme_hdr_attrs[] = {
>         &dev_attr_ports_num.attr,
>         &dev_attr_bitstream_id.attr,
>         &dev_attr_bitstream_metadata.attr,
> +       &dev_attr_cache_size.attr,
> +       &dev_attr_fabric_version.attr,
> +       &dev_attr_socket_id.attr,
>         NULL,
>  };
>
> --
> 2.7.4
>

^ permalink raw reply

* Re: [PATCH 16/17] fpga: dfl: fme: add global error reporting support
From: Alan Tull @ 2019-04-09 21:35 UTC (permalink / raw)
  To: Wu Hao
  Cc: Moritz Fischer, linux-fpga, linux-kernel, linux-api, Luwei Kang,
	Ananda Ravuri, Xu Yilun
In-Reply-To: <1553483264-5379-17-git-send-email-hao.wu@intel.com>

On Sun, Mar 24, 2019 at 10:24 PM Wu Hao <hao.wu@intel.com> wrote:

Hi Hao,

>
> This patch adds support for global error reporting for FPGA
> Management Engine (FME), it introduces sysfs interfaces to
> report different error detected by the hardware, and allow
> user to clear errors or inject error for testing purpose.
>
> Signed-off-by: Luwei Kang <luwei.kang@intel.com>
> Signed-off-by: Ananda Ravuri <ananda.ravuri@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>
> ---
>  Documentation/ABI/testing/sysfs-platform-dfl-fme |  58 ++++
>  drivers/fpga/Makefile                            |   2 +-
>  drivers/fpga/dfl-fme-error.c                     | 390 +++++++++++++++++++++++
>  drivers/fpga/dfl-fme-main.c                      |   4 +
>  drivers/fpga/dfl-fme.h                           |   2 +
>  drivers/fpga/dfl.h                               |   2 +
>  6 files changed, 457 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/fpga/dfl-fme-error.c
>
> diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> index 4b6448f..38f9cdd 100644
> --- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
> +++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> @@ -156,3 +156,61 @@ KernelVersion:  5.2
>  Contact:       Wu Hao <hao.wu@intel.com>
>  Description:   Read-only. Read this file to get power limit for fpga, it
>                 is only valid for integrated solution.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/errors/fme-errors/errors
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. Read this file to get errors detected by hardware.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/errors/fme-errors/first_error
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. Read this file to get the first error detected by
> +               hardware.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/errors/fme-errors/next_error
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. Read this file to get the second error detected by
> +               hardware.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/errors/fme-errors/clear
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Write-only. Write error code to this file to clear errors. If
> +               the input error code doesn't match, it returns -EBUSY.

As with the afu errors patch, seems like -EINVAL would be better.

> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/errors/pcie0_errors
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. It returns errors detected on pcie0 link.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/errors/pcie1_errors
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. It returns errors detected on pcie1 link.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/errors/nonfatal_errors
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. It returns non-fatal errors detected.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/errors/catfatal_errors
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-only. It returns catastrophic and fatal errors detected.
> +
> +What:          /sys/bus/platform/devices/dfl-fme.0/errors/inject_error
> +Date:          March 2019
> +KernelVersion:  5.2
> +Contact:       Wu Hao <hao.wu@intel.com>
> +Description:   Read-Write. Write this file to inject errors for testing
> +               purpose. Read this file to check errors injected.
> diff --git a/drivers/fpga/Makefile b/drivers/fpga/Makefile
> index f1f0af7..1a9fa3d 100644
> --- a/drivers/fpga/Makefile
> +++ b/drivers/fpga/Makefile
> @@ -38,7 +38,7 @@ obj-$(CONFIG_FPGA_DFL_FME_BRIDGE)     += dfl-fme-br.o
>  obj-$(CONFIG_FPGA_DFL_FME_REGION)      += dfl-fme-region.o
>  obj-$(CONFIG_FPGA_DFL_AFU)             += dfl-afu.o
>
> -dfl-fme-objs := dfl-fme-main.o dfl-fme-pr.o
> +dfl-fme-objs := dfl-fme-main.o dfl-fme-pr.o dfl-fme-error.o
>  dfl-afu-objs := dfl-afu-main.o dfl-afu-region.o dfl-afu-dma-region.o
>  dfl-afu-objs += dfl-afu-error.o
>
> diff --git a/drivers/fpga/dfl-fme-error.c b/drivers/fpga/dfl-fme-error.c
> new file mode 100644
> index 0000000..f2bd5f8
> --- /dev/null
> +++ b/drivers/fpga/dfl-fme-error.c
> @@ -0,0 +1,390 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Driver for FPGA Management Engine Error Management
> + *
> + * Copyright 2019 Intel Corporation, Inc.
> + *
> + * Authors:
> + *   Kang Luwei <luwei.kang@intel.com>
> + *   Xiao Guangrong <guangrong.xiao@linux.intel.com>
> + *   Wu Hao <hao.wu@intel.com>
> + *   Joseph Grecco <joe.grecco@intel.com>
> + *   Enno Luebbers <enno.luebbers@intel.com>
> + *   Tim Whisonant <tim.whisonant@intel.com>
> + *   Ananda Ravuri <ananda.ravuri@intel.com>
> + *   Mitchel, Henry <henry.mitchel@intel.com>
> + */
> +
> +#include <linux/uaccess.h>
> +
> +#include "dfl.h"
> +#include "dfl-fme.h"
> +
> +#define FME_ERROR_MASK         0x8
> +#define FME_ERROR              0x10
> +#define MBP_ERROR              BIT_ULL(6)
> +#define PCIE0_ERROR_MASK       0x18
> +#define PCIE0_ERROR            0x20
> +#define PCIE1_ERROR_MASK       0x28
> +#define PCIE1_ERROR            0x30
> +#define FME_FIRST_ERROR                0x38
> +#define FME_NEXT_ERROR         0x40
> +#define RAS_NONFAT_ERROR_MASK  0x48
> +#define RAS_NONFAT_ERROR       0x50
> +#define RAS_CATFAT_ERROR_MASK  0x58
> +#define RAS_CATFAT_ERROR       0x60
> +#define RAS_ERROR_INJECT       0x68
> +#define INJECT_ERROR_MASK      GENMASK_ULL(2, 0)
> +
> +static ssize_t errors_show(struct device *dev,
> +                          struct device_attribute *attr, char *buf)
> +{
> +       struct device *err_dev = dev->parent;
> +       void __iomem *base;
> +
> +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> +
> +       return scnprintf(buf, PAGE_SIZE, "0x%llx\n",
> +                        (unsigned long long)readq(base + FME_ERROR));
> +}
> +static DEVICE_ATTR_RO(errors);
> +
> +static ssize_t first_error_show(struct device *dev,
> +                               struct device_attribute *attr, char *buf)
> +{
> +       struct device *err_dev = dev->parent;
> +       void __iomem *base;
> +
> +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> +
> +       return scnprintf(buf, PAGE_SIZE, "0x%llx\n",
> +                        (unsigned long long)readq(base + FME_FIRST_ERROR));
> +}
> +static DEVICE_ATTR_RO(first_error);
> +
> +static ssize_t next_error_show(struct device *dev,
> +                              struct device_attribute *attr, char *buf)
> +{
> +       struct device *err_dev = dev->parent;
> +       void __iomem *base;
> +
> +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> +
> +       return scnprintf(buf, PAGE_SIZE, "0x%llx\n",
> +                        (unsigned long long)readq(base + FME_NEXT_ERROR));
> +}
> +static DEVICE_ATTR_RO(next_error);
> +
> +static ssize_t clear_store(struct device *dev, struct device_attribute *attr,
> +                          const char *buf, size_t count)
> +{
> +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev->parent);
> +       struct device *err_dev = dev->parent;
> +       struct dfl_feature *feature;
> +       void __iomem *base;
> +       u64 v, val;
> +       int ret;
> +
> +       ret = kstrtou64(buf, 0, &val);
> +       if (ret)
> +               return ret;

from kstrtox.c:
* Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error.

Your sysfs doc could be updated to explain these return codes.

> +
> +       feature = dfl_get_feature_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> +       base = feature->ioaddr;
> +
> +       mutex_lock(&pdata->lock);
> +       writeq(GENMASK_ULL(63, 0), base + FME_ERROR_MASK);
> +
> +       v = readq(base + FME_ERROR);
> +       if (val != v) {
> +               ret = -EINVAL;

Oh wait, that's what I thought it should be ;)  so the doc just needs to change.

> +               goto done;

It would be easy to avoid using 'goto' here.

> +       }
> +
> +       writeq(v, base + FME_ERROR);
> +       v = readq(base + FME_FIRST_ERROR);
> +       writeq(v, base + FME_FIRST_ERROR);
> +       v = readq(base + FME_NEXT_ERROR);
> +       writeq(v, base + FME_NEXT_ERROR);
> +
> +done:
> +       /* Workaround: disable MBP_ERROR if feature revision is 0 */
> +       writeq(dfl_feature_revision(feature->ioaddr) ? 0ULL : MBP_ERROR,
> +              base + FME_ERROR_MASK);
> +       mutex_unlock(&pdata->lock);
> +       return ret ? ret : count;
> +}
> +static DEVICE_ATTR_WO(clear);
> +
> +static ssize_t revision_show(struct device *dev, struct device_attribute *attr,
> +                            char *buf)
> +{
> +       struct device *err_dev = dev->parent;
> +       void __iomem *base;
> +
> +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> +
> +       return scnprintf(buf, PAGE_SIZE, "%u\n", dfl_feature_revision(base));
> +}
> +static DEVICE_ATTR_RO(revision);

The revision attr to be documented in the sysfs doc.

> +
> +static ssize_t pcie0_errors_show(struct device *dev,
> +                                struct device_attribute *attr, char *buf)
> +{
> +       struct device *err_dev = dev->parent;
> +       void __iomem *base;
> +
> +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> +
> +       return scnprintf(buf, PAGE_SIZE, "0x%llx\n",
> +                        (unsigned long long)readq(base + PCIE0_ERROR));
> +}
> +
> +static ssize_t pcie0_errors_store(struct device *dev,

The sysfs doc shows this as read-only.

> +                                 struct device_attribute *attr,
> +                                 const char *buf, size_t count)
> +{
> +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev->parent);
> +       struct device *err_dev = dev->parent;
> +       void __iomem *base;
> +       u64 v, val;
> +       int ret;
> +
> +       ret = kstrtou64(buf, 0, &val);
> +       if (ret)
> +               return ret;
> +
> +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> +
> +       mutex_lock(&pdata->lock);
> +       writeq(GENMASK_ULL(63, 0), base + PCIE0_ERROR_MASK);
> +
> +       v = readq(base + PCIE0_ERROR);
> +       if (val != v)
> +               ret = -EBUSY;
> +       else
> +               writeq(v, base + PCIE0_ERROR);
> +
> +       writeq(0ULL, base + PCIE0_ERROR_MASK);
> +       mutex_unlock(&pdata->lock);
> +       return ret ? ret : count;
> +}
> +static DEVICE_ATTR_RW(pcie0_errors);
> +
> +static ssize_t pcie1_errors_show(struct device *dev,
> +                                struct device_attribute *attr, char *buf)
> +{
> +       struct device *err_dev = dev->parent;
> +       void __iomem *base;
> +
> +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> +
> +       return scnprintf(buf, PAGE_SIZE, "0x%llx\n",
> +                        (unsigned long long)readq(base + PCIE1_ERROR));
> +}
> +
> +static ssize_t pcie1_errors_store(struct device *dev,
> +                                 struct device_attribute *attr,
> +                                 const char *buf, size_t count)

Same here, documentation needs to show as R/W and explain what happens
when written.

> +{
> +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev->parent);
> +       struct device *err_dev = dev->parent;
> +       void __iomem *base;
> +       u64 v, val;
> +       int ret;
> +
> +       ret = kstrtou64(buf, 0, &val);
> +       if (ret)
> +               return ret;
> +
> +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> +
> +       mutex_lock(&pdata->lock);
> +       writeq(GENMASK_ULL(63, 0), base + PCIE1_ERROR_MASK);
> +
> +       v = readq(base + PCIE1_ERROR);
> +       if (val != v)
> +               ret = -EBUSY;
> +       else
> +               writeq(v, base + PCIE1_ERROR);
> +
> +       writeq(0ULL, base + PCIE1_ERROR_MASK);
> +       mutex_unlock(&pdata->lock);
> +       return ret ? ret : count;
> +}
> +static DEVICE_ATTR_RW(pcie1_errors);
> +
> +static ssize_t nonfatal_errors_show(struct device *dev,
> +                                   struct device_attribute *attr, char *buf)
> +{
> +       struct device *err_dev = dev->parent;
> +       void __iomem *base;
> +
> +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> +
> +       return scnprintf(buf, PAGE_SIZE, "0x%llx\n",
> +                        (unsigned long long)readq(base + RAS_NONFAT_ERROR));
> +}
> +static DEVICE_ATTR_RO(nonfatal_errors);
> +
> +static ssize_t catfatal_errors_show(struct device *dev,
> +                                   struct device_attribute *attr, char *buf)
> +{
> +       struct device *err_dev = dev->parent;
> +       void __iomem *base;
> +
> +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> +
> +       return scnprintf(buf, PAGE_SIZE, "0x%llx\n",
> +                        (unsigned long long)readq(base + RAS_CATFAT_ERROR));
> +}
> +static DEVICE_ATTR_RO(catfatal_errors);
> +
> +static ssize_t inject_error_show(struct device *dev,
> +                                struct device_attribute *attr, char *buf)
> +{
> +       struct device *err_dev = dev->parent;
> +       void __iomem *base;
> +       u64 v;
> +
> +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> +
> +       v = readq(base + RAS_ERROR_INJECT);
> +
> +       return scnprintf(buf, PAGE_SIZE, "0x%llx\n",
> +                        (unsigned long long)FIELD_GET(INJECT_ERROR_MASK, v));
> +}
> +
> +static ssize_t inject_error_store(struct device *dev,
> +                                 struct device_attribute *attr,
> +                                 const char *buf, size_t count)
> +{
> +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev->parent);
> +       struct device *err_dev = dev->parent;
> +       void __iomem *base;
> +       u8 inject_error;
> +       int ret;
> +       u64 v;
> +
> +       ret = kstrtou8(buf, 0, &inject_error);
> +       if (ret)
> +               return ret;
> +
> +       if (inject_error & ~INJECT_ERROR_MASK)
> +               return -EINVAL;
> +
> +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> +
> +       mutex_lock(&pdata->lock);
> +       v = readq(base + RAS_ERROR_INJECT);
> +       v &= ~INJECT_ERROR_MASK;
> +       v |= FIELD_PREP(INJECT_ERROR_MASK, inject_error);
> +       writeq(v, base + RAS_ERROR_INJECT);
> +       mutex_unlock(&pdata->lock);
> +
> +       return count;
> +}
> +static DEVICE_ATTR_RW(inject_error);
> +
> +static struct attribute *fme_errors_attrs[] = {
> +       &dev_attr_errors.attr,
> +       &dev_attr_first_error.attr,
> +       &dev_attr_next_error.attr,
> +       &dev_attr_clear.attr,
> +       NULL,
> +};
> +
> +static struct attribute_group fme_errors_attr_group = {
> +       .attrs  = fme_errors_attrs,
> +       .name   = "fme-errors",
> +};
> +
> +static struct attribute *errors_attrs[] = {
> +       &dev_attr_revision.attr,
> +       &dev_attr_pcie0_errors.attr,
> +       &dev_attr_pcie1_errors.attr,
> +       &dev_attr_nonfatal_errors.attr,
> +       &dev_attr_catfatal_errors.attr,
> +       &dev_attr_inject_error.attr,
> +       NULL,
> +};
> +
> +static struct attribute_group errors_attr_group = {
> +       .attrs  = errors_attrs,
> +};
> +
> +static const struct attribute_group *error_groups[] = {
> +       &fme_errors_attr_group,
> +       &errors_attr_group,
> +       NULL
> +};
> +
> +static void fme_error_enable(struct dfl_feature *feature)
> +{
> +       void __iomem *base = feature->ioaddr;
> +
> +       /* Workaround: disable MBP_ERROR if revision is 0 */
> +       writeq(dfl_feature_revision(feature->ioaddr) ? 0ULL : MBP_ERROR,
> +              base + FME_ERROR_MASK);
> +       writeq(0ULL, base + PCIE0_ERROR_MASK);
> +       writeq(0ULL, base + PCIE1_ERROR_MASK);
> +       writeq(0ULL, base + RAS_NONFAT_ERROR_MASK);
> +       writeq(0ULL, base + RAS_CATFAT_ERROR_MASK);
> +}
> +
> +static void err_dev_release(struct device *dev)
> +{
> +       kfree(dev);
> +}
> +
> +static int fme_global_err_init(struct platform_device *pdev,
> +                              struct dfl_feature *feature)
> +{
> +       struct device *dev;
> +       int ret = 0;
> +
> +       dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +       if (!dev)
> +               return -ENOMEM;
> +
> +       dev->parent = &pdev->dev;
> +       dev->release = err_dev_release;
> +       dev_set_name(dev, "errors");
> +
> +       fme_error_enable(feature);
> +
> +       ret = device_register(dev);
> +       if (ret) {
> +               put_device(dev);
> +               return ret;
> +       }
> +
> +       ret = sysfs_create_groups(&dev->kobj, error_groups);
> +       if (ret) {
> +               device_unregister(dev);
> +               return ret;
> +       }
> +
> +       feature->priv = dev;
> +
> +       return ret;
> +}
> +
> +static void fme_global_err_uinit(struct platform_device *pdev,
> +                                struct dfl_feature *feature)
> +{
> +       struct device *dev = feature->priv;
> +
> +       sysfs_remove_groups(&dev->kobj, error_groups);
> +       device_unregister(dev);
> +}
> +
> +const struct dfl_feature_id fme_global_err_id_table[] = {
> +       {.id = FME_FEATURE_ID_GLOBAL_ERR,},
> +       {0,}
> +};
> +
> +const struct dfl_feature_ops fme_global_err_ops = {
> +       .init = fme_global_err_init,
> +       .uinit = fme_global_err_uinit,
> +};
> diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
> index dafa6580..76cb112 100644
> --- a/drivers/fpga/dfl-fme-main.c
> +++ b/drivers/fpga/dfl-fme-main.c
> @@ -686,6 +686,10 @@ static struct dfl_feature_driver fme_feature_drvs[] = {
>                 .ops = &fme_power_mgmt_ops,
>         },
>         {
> +               .id_table = fme_global_err_id_table,
> +               .ops = &fme_global_err_ops,
> +       },
> +       {
>                 .ops = NULL,
>         },
>  };
> diff --git a/drivers/fpga/dfl-fme.h b/drivers/fpga/dfl-fme.h
> index 7a021c4..5fbe3f5 100644
> --- a/drivers/fpga/dfl-fme.h
> +++ b/drivers/fpga/dfl-fme.h
> @@ -37,5 +37,7 @@ struct dfl_fme {
>
>  extern const struct dfl_feature_ops fme_pr_mgmt_ops;
>  extern const struct dfl_feature_id fme_pr_mgmt_id_table[];
> +extern const struct dfl_feature_ops fme_global_err_ops;
> +extern const struct dfl_feature_id fme_global_err_id_table[];
>
>  #endif /* __DFL_FME_H */
> diff --git a/drivers/fpga/dfl.h b/drivers/fpga/dfl.h
> index fbc57f0..6c32080 100644
> --- a/drivers/fpga/dfl.h
> +++ b/drivers/fpga/dfl.h
> @@ -197,12 +197,14 @@ struct dfl_feature_driver {
>   *                 feature dev (platform device)'s reources.
>   * @ioaddr: mapped mmio resource address.
>   * @ops: ops of this sub feature.
> + * @priv: priv data of this feature.
>   */
>  struct dfl_feature {
>         u64 id;
>         int resource_index;
>         void __iomem *ioaddr;
>         const struct dfl_feature_ops *ops;
> +       void *priv;
>  };
>
>  #define DEV_STATUS_IN_USE      0
> --
> 2.7.4
>

Thanks,
Alan

^ permalink raw reply

* Re: [PATCH 16/17] fpga: dfl: fme: add global error reporting support
From: Wu Hao @ 2019-04-10  1:34 UTC (permalink / raw)
  To: Alan Tull
  Cc: Moritz Fischer, linux-fpga, linux-kernel, linux-api, Luwei Kang,
	Ananda Ravuri, Xu Yilun
In-Reply-To: <CANk1AXQ8L-RGwZyz2fbDvni5o92AEei1DxjVfga7kMe5VPS7jg@mail.gmail.com>

On Tue, Apr 09, 2019 at 04:35:25PM -0500, Alan Tull wrote:
> On Sun, Mar 24, 2019 at 10:24 PM Wu Hao <hao.wu@intel.com> wrote:
> 
> Hi Hao,
> 
> >
> > This patch adds support for global error reporting for FPGA
> > Management Engine (FME), it introduces sysfs interfaces to
> > report different error detected by the hardware, and allow
> > user to clear errors or inject error for testing purpose.
> >
> > Signed-off-by: Luwei Kang <luwei.kang@intel.com>
> > Signed-off-by: Ananda Ravuri <ananda.ravuri@intel.com>
> > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > Signed-off-by: Wu Hao <hao.wu@intel.com>
> > ---
> >  Documentation/ABI/testing/sysfs-platform-dfl-fme |  58 ++++
> >  drivers/fpga/Makefile                            |   2 +-
> >  drivers/fpga/dfl-fme-error.c                     | 390 +++++++++++++++++++++++
> >  drivers/fpga/dfl-fme-main.c                      |   4 +
> >  drivers/fpga/dfl-fme.h                           |   2 +
> >  drivers/fpga/dfl.h                               |   2 +
> >  6 files changed, 457 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/fpga/dfl-fme-error.c
> >
> > diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > index 4b6448f..38f9cdd 100644
> > --- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > +++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > @@ -156,3 +156,61 @@ KernelVersion:  5.2
> >  Contact:       Wu Hao <hao.wu@intel.com>
> >  Description:   Read-only. Read this file to get power limit for fpga, it
> >                 is only valid for integrated solution.
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/errors/fme-errors/errors
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. Read this file to get errors detected by hardware.
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/errors/fme-errors/first_error
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. Read this file to get the first error detected by
> > +               hardware.
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/errors/fme-errors/next_error
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. Read this file to get the second error detected by
> > +               hardware.
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/errors/fme-errors/clear
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Write-only. Write error code to this file to clear errors. If
> > +               the input error code doesn't match, it returns -EBUSY.
> 
> As with the afu errors patch, seems like -EINVAL would be better.
> 
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/errors/pcie0_errors
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. It returns errors detected on pcie0 link.
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/errors/pcie1_errors
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. It returns errors detected on pcie1 link.
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/errors/nonfatal_errors
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. It returns non-fatal errors detected.
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/errors/catfatal_errors
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. It returns catastrophic and fatal errors detected.
> > +
> > +What:          /sys/bus/platform/devices/dfl-fme.0/errors/inject_error
> > +Date:          March 2019
> > +KernelVersion:  5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-Write. Write this file to inject errors for testing
> > +               purpose. Read this file to check errors injected.
> > diff --git a/drivers/fpga/Makefile b/drivers/fpga/Makefile
> > index f1f0af7..1a9fa3d 100644
> > --- a/drivers/fpga/Makefile
> > +++ b/drivers/fpga/Makefile
> > @@ -38,7 +38,7 @@ obj-$(CONFIG_FPGA_DFL_FME_BRIDGE)     += dfl-fme-br.o
> >  obj-$(CONFIG_FPGA_DFL_FME_REGION)      += dfl-fme-region.o
> >  obj-$(CONFIG_FPGA_DFL_AFU)             += dfl-afu.o
> >
> > -dfl-fme-objs := dfl-fme-main.o dfl-fme-pr.o
> > +dfl-fme-objs := dfl-fme-main.o dfl-fme-pr.o dfl-fme-error.o
> >  dfl-afu-objs := dfl-afu-main.o dfl-afu-region.o dfl-afu-dma-region.o
> >  dfl-afu-objs += dfl-afu-error.o
> >
> > diff --git a/drivers/fpga/dfl-fme-error.c b/drivers/fpga/dfl-fme-error.c
> > new file mode 100644
> > index 0000000..f2bd5f8
> > --- /dev/null
> > +++ b/drivers/fpga/dfl-fme-error.c
> > @@ -0,0 +1,390 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Driver for FPGA Management Engine Error Management
> > + *
> > + * Copyright 2019 Intel Corporation, Inc.
> > + *
> > + * Authors:
> > + *   Kang Luwei <luwei.kang@intel.com>
> > + *   Xiao Guangrong <guangrong.xiao@linux.intel.com>
> > + *   Wu Hao <hao.wu@intel.com>
> > + *   Joseph Grecco <joe.grecco@intel.com>
> > + *   Enno Luebbers <enno.luebbers@intel.com>
> > + *   Tim Whisonant <tim.whisonant@intel.com>
> > + *   Ananda Ravuri <ananda.ravuri@intel.com>
> > + *   Mitchel, Henry <henry.mitchel@intel.com>
> > + */
> > +
> > +#include <linux/uaccess.h>
> > +
> > +#include "dfl.h"
> > +#include "dfl-fme.h"
> > +
> > +#define FME_ERROR_MASK         0x8
> > +#define FME_ERROR              0x10
> > +#define MBP_ERROR              BIT_ULL(6)
> > +#define PCIE0_ERROR_MASK       0x18
> > +#define PCIE0_ERROR            0x20
> > +#define PCIE1_ERROR_MASK       0x28
> > +#define PCIE1_ERROR            0x30
> > +#define FME_FIRST_ERROR                0x38
> > +#define FME_NEXT_ERROR         0x40
> > +#define RAS_NONFAT_ERROR_MASK  0x48
> > +#define RAS_NONFAT_ERROR       0x50
> > +#define RAS_CATFAT_ERROR_MASK  0x58
> > +#define RAS_CATFAT_ERROR       0x60
> > +#define RAS_ERROR_INJECT       0x68
> > +#define INJECT_ERROR_MASK      GENMASK_ULL(2, 0)
> > +
> > +static ssize_t errors_show(struct device *dev,
> > +                          struct device_attribute *attr, char *buf)
> > +{
> > +       struct device *err_dev = dev->parent;
> > +       void __iomem *base;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "0x%llx\n",
> > +                        (unsigned long long)readq(base + FME_ERROR));
> > +}
> > +static DEVICE_ATTR_RO(errors);
> > +
> > +static ssize_t first_error_show(struct device *dev,
> > +                               struct device_attribute *attr, char *buf)
> > +{
> > +       struct device *err_dev = dev->parent;
> > +       void __iomem *base;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "0x%llx\n",
> > +                        (unsigned long long)readq(base + FME_FIRST_ERROR));
> > +}
> > +static DEVICE_ATTR_RO(first_error);
> > +
> > +static ssize_t next_error_show(struct device *dev,
> > +                              struct device_attribute *attr, char *buf)
> > +{
> > +       struct device *err_dev = dev->parent;
> > +       void __iomem *base;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "0x%llx\n",
> > +                        (unsigned long long)readq(base + FME_NEXT_ERROR));
> > +}
> > +static DEVICE_ATTR_RO(next_error);
> > +
> > +static ssize_t clear_store(struct device *dev, struct device_attribute *attr,
> > +                          const char *buf, size_t count)
> > +{
> > +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev->parent);
> > +       struct device *err_dev = dev->parent;
> > +       struct dfl_feature *feature;
> > +       void __iomem *base;
> > +       u64 v, val;
> > +       int ret;
> > +
> > +       ret = kstrtou64(buf, 0, &val);
> > +       if (ret)
> > +               return ret;
> 
> from kstrtox.c:
> * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error.
> 
> Your sysfs doc could be updated to explain these return codes.

Sure, will fix this.

> 
> > +
> > +       feature = dfl_get_feature_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> > +       base = feature->ioaddr;
> > +
> > +       mutex_lock(&pdata->lock);
> > +       writeq(GENMASK_ULL(63, 0), base + FME_ERROR_MASK);
> > +
> > +       v = readq(base + FME_ERROR);
> > +       if (val != v) {
> > +               ret = -EINVAL;
> 
> Oh wait, that's what I thought it should be ;)  so the doc just needs to change.

Sure, will capture the detailed return values in doc.

> 
> > +               goto done;
> 
> It would be easy to avoid using 'goto' here.

Yes, agree.

> 
> > +       }
> > +
> > +       writeq(v, base + FME_ERROR);
> > +       v = readq(base + FME_FIRST_ERROR);
> > +       writeq(v, base + FME_FIRST_ERROR);
> > +       v = readq(base + FME_NEXT_ERROR);
> > +       writeq(v, base + FME_NEXT_ERROR);
> > +
> > +done:
> > +       /* Workaround: disable MBP_ERROR if feature revision is 0 */
> > +       writeq(dfl_feature_revision(feature->ioaddr) ? 0ULL : MBP_ERROR,
> > +              base + FME_ERROR_MASK);
> > +       mutex_unlock(&pdata->lock);
> > +       return ret ? ret : count;
> > +}
> > +static DEVICE_ATTR_WO(clear);
> > +
> > +static ssize_t revision_show(struct device *dev, struct device_attribute *attr,
> > +                            char *buf)
> > +{
> > +       struct device *err_dev = dev->parent;
> > +       void __iomem *base;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n", dfl_feature_revision(base));
> > +}
> > +static DEVICE_ATTR_RO(revision);
> 
> The revision attr to be documented in the sysfs doc.

Will fix this.

> 
> > +
> > +static ssize_t pcie0_errors_show(struct device *dev,
> > +                                struct device_attribute *attr, char *buf)
> > +{
> > +       struct device *err_dev = dev->parent;
> > +       void __iomem *base;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "0x%llx\n",
> > +                        (unsigned long long)readq(base + PCIE0_ERROR));
> > +}
> > +
> > +static ssize_t pcie0_errors_store(struct device *dev,
> 
> The sysfs doc shows this as read-only.
> 
> > +                                 struct device_attribute *attr,
> > +                                 const char *buf, size_t count)
> > +{
> > +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev->parent);
> > +       struct device *err_dev = dev->parent;
> > +       void __iomem *base;
> > +       u64 v, val;
> > +       int ret;
> > +
> > +       ret = kstrtou64(buf, 0, &val);
> > +       if (ret)
> > +               return ret;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> > +
> > +       mutex_lock(&pdata->lock);
> > +       writeq(GENMASK_ULL(63, 0), base + PCIE0_ERROR_MASK);
> > +
> > +       v = readq(base + PCIE0_ERROR);
> > +       if (val != v)
> > +               ret = -EBUSY;
> > +       else
> > +               writeq(v, base + PCIE0_ERROR);
> > +
> > +       writeq(0ULL, base + PCIE0_ERROR_MASK);
> > +       mutex_unlock(&pdata->lock);
> > +       return ret ? ret : count;
> > +}
> > +static DEVICE_ATTR_RW(pcie0_errors);
> > +
> > +static ssize_t pcie1_errors_show(struct device *dev,
> > +                                struct device_attribute *attr, char *buf)
> > +{
> > +       struct device *err_dev = dev->parent;
> > +       void __iomem *base;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "0x%llx\n",
> > +                        (unsigned long long)readq(base + PCIE1_ERROR));
> > +}
> > +
> > +static ssize_t pcie1_errors_store(struct device *dev,
> > +                                 struct device_attribute *attr,
> > +                                 const char *buf, size_t count)
> 
> Same here, documentation needs to show as R/W and explain what happens
> when written.

Sorry, will fix them in doc.

> 
> > +{
> > +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev->parent);
> > +       struct device *err_dev = dev->parent;
> > +       void __iomem *base;
> > +       u64 v, val;
> > +       int ret;
> > +
> > +       ret = kstrtou64(buf, 0, &val);
> > +       if (ret)
> > +               return ret;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> > +
> > +       mutex_lock(&pdata->lock);
> > +       writeq(GENMASK_ULL(63, 0), base + PCIE1_ERROR_MASK);
> > +
> > +       v = readq(base + PCIE1_ERROR);
> > +       if (val != v)
> > +               ret = -EBUSY;
> > +       else
> > +               writeq(v, base + PCIE1_ERROR);
> > +
> > +       writeq(0ULL, base + PCIE1_ERROR_MASK);
> > +       mutex_unlock(&pdata->lock);
> > +       return ret ? ret : count;
> > +}
> > +static DEVICE_ATTR_RW(pcie1_errors);
> > +
> > +static ssize_t nonfatal_errors_show(struct device *dev,
> > +                                   struct device_attribute *attr, char *buf)
> > +{
> > +       struct device *err_dev = dev->parent;
> > +       void __iomem *base;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "0x%llx\n",
> > +                        (unsigned long long)readq(base + RAS_NONFAT_ERROR));
> > +}
> > +static DEVICE_ATTR_RO(nonfatal_errors);
> > +
> > +static ssize_t catfatal_errors_show(struct device *dev,
> > +                                   struct device_attribute *attr, char *buf)
> > +{
> > +       struct device *err_dev = dev->parent;
> > +       void __iomem *base;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "0x%llx\n",
> > +                        (unsigned long long)readq(base + RAS_CATFAT_ERROR));
> > +}
> > +static DEVICE_ATTR_RO(catfatal_errors);
> > +
> > +static ssize_t inject_error_show(struct device *dev,
> > +                                struct device_attribute *attr, char *buf)
> > +{
> > +       struct device *err_dev = dev->parent;
> > +       void __iomem *base;
> > +       u64 v;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> > +
> > +       v = readq(base + RAS_ERROR_INJECT);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "0x%llx\n",
> > +                        (unsigned long long)FIELD_GET(INJECT_ERROR_MASK, v));
> > +}
> > +
> > +static ssize_t inject_error_store(struct device *dev,
> > +                                 struct device_attribute *attr,
> > +                                 const char *buf, size_t count)
> > +{
> > +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev->parent);
> > +       struct device *err_dev = dev->parent;
> > +       void __iomem *base;
> > +       u8 inject_error;
> > +       int ret;
> > +       u64 v;
> > +
> > +       ret = kstrtou8(buf, 0, &inject_error);
> > +       if (ret)
> > +               return ret;
> > +
> > +       if (inject_error & ~INJECT_ERROR_MASK)
> > +               return -EINVAL;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(err_dev, FME_FEATURE_ID_GLOBAL_ERR);
> > +
> > +       mutex_lock(&pdata->lock);
> > +       v = readq(base + RAS_ERROR_INJECT);
> > +       v &= ~INJECT_ERROR_MASK;
> > +       v |= FIELD_PREP(INJECT_ERROR_MASK, inject_error);
> > +       writeq(v, base + RAS_ERROR_INJECT);
> > +       mutex_unlock(&pdata->lock);
> > +
> > +       return count;
> > +}
> > +static DEVICE_ATTR_RW(inject_error);
> > +
> > +static struct attribute *fme_errors_attrs[] = {
> > +       &dev_attr_errors.attr,
> > +       &dev_attr_first_error.attr,
> > +       &dev_attr_next_error.attr,
> > +       &dev_attr_clear.attr,
> > +       NULL,
> > +};
> > +
> > +static struct attribute_group fme_errors_attr_group = {
> > +       .attrs  = fme_errors_attrs,
> > +       .name   = "fme-errors",
> > +};
> > +
> > +static struct attribute *errors_attrs[] = {
> > +       &dev_attr_revision.attr,
> > +       &dev_attr_pcie0_errors.attr,
> > +       &dev_attr_pcie1_errors.attr,
> > +       &dev_attr_nonfatal_errors.attr,
> > +       &dev_attr_catfatal_errors.attr,
> > +       &dev_attr_inject_error.attr,
> > +       NULL,
> > +};
> > +
> > +static struct attribute_group errors_attr_group = {
> > +       .attrs  = errors_attrs,
> > +};
> > +
> > +static const struct attribute_group *error_groups[] = {
> > +       &fme_errors_attr_group,
> > +       &errors_attr_group,
> > +       NULL
> > +};
> > +
> > +static void fme_error_enable(struct dfl_feature *feature)
> > +{
> > +       void __iomem *base = feature->ioaddr;
> > +
> > +       /* Workaround: disable MBP_ERROR if revision is 0 */
> > +       writeq(dfl_feature_revision(feature->ioaddr) ? 0ULL : MBP_ERROR,
> > +              base + FME_ERROR_MASK);
> > +       writeq(0ULL, base + PCIE0_ERROR_MASK);
> > +       writeq(0ULL, base + PCIE1_ERROR_MASK);
> > +       writeq(0ULL, base + RAS_NONFAT_ERROR_MASK);
> > +       writeq(0ULL, base + RAS_CATFAT_ERROR_MASK);
> > +}
> > +
> > +static void err_dev_release(struct device *dev)
> > +{
> > +       kfree(dev);
> > +}
> > +
> > +static int fme_global_err_init(struct platform_device *pdev,
> > +                              struct dfl_feature *feature)
> > +{
> > +       struct device *dev;
> > +       int ret = 0;
> > +
> > +       dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> > +       if (!dev)
> > +               return -ENOMEM;
> > +
> > +       dev->parent = &pdev->dev;
> > +       dev->release = err_dev_release;
> > +       dev_set_name(dev, "errors");
> > +
> > +       fme_error_enable(feature);
> > +
> > +       ret = device_register(dev);
> > +       if (ret) {
> > +               put_device(dev);
> > +               return ret;
> > +       }
> > +
> > +       ret = sysfs_create_groups(&dev->kobj, error_groups);
> > +       if (ret) {
> > +               device_unregister(dev);
> > +               return ret;
> > +       }
> > +
> > +       feature->priv = dev;
> > +
> > +       return ret;
> > +}
> > +
> > +static void fme_global_err_uinit(struct platform_device *pdev,
> > +                                struct dfl_feature *feature)
> > +{
> > +       struct device *dev = feature->priv;
> > +
> > +       sysfs_remove_groups(&dev->kobj, error_groups);
> > +       device_unregister(dev);
> > +}
> > +
> > +const struct dfl_feature_id fme_global_err_id_table[] = {
> > +       {.id = FME_FEATURE_ID_GLOBAL_ERR,},
> > +       {0,}
> > +};
> > +
> > +const struct dfl_feature_ops fme_global_err_ops = {
> > +       .init = fme_global_err_init,
> > +       .uinit = fme_global_err_uinit,
> > +};
> > diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
> > index dafa6580..76cb112 100644
> > --- a/drivers/fpga/dfl-fme-main.c
> > +++ b/drivers/fpga/dfl-fme-main.c
> > @@ -686,6 +686,10 @@ static struct dfl_feature_driver fme_feature_drvs[] = {
> >                 .ops = &fme_power_mgmt_ops,
> >         },
> >         {
> > +               .id_table = fme_global_err_id_table,
> > +               .ops = &fme_global_err_ops,
> > +       },
> > +       {
> >                 .ops = NULL,
> >         },
> >  };
> > diff --git a/drivers/fpga/dfl-fme.h b/drivers/fpga/dfl-fme.h
> > index 7a021c4..5fbe3f5 100644
> > --- a/drivers/fpga/dfl-fme.h
> > +++ b/drivers/fpga/dfl-fme.h
> > @@ -37,5 +37,7 @@ struct dfl_fme {
> >
> >  extern const struct dfl_feature_ops fme_pr_mgmt_ops;
> >  extern const struct dfl_feature_id fme_pr_mgmt_id_table[];
> > +extern const struct dfl_feature_ops fme_global_err_ops;
> > +extern const struct dfl_feature_id fme_global_err_id_table[];
> >
> >  #endif /* __DFL_FME_H */
> > diff --git a/drivers/fpga/dfl.h b/drivers/fpga/dfl.h
> > index fbc57f0..6c32080 100644
> > --- a/drivers/fpga/dfl.h
> > +++ b/drivers/fpga/dfl.h
> > @@ -197,12 +197,14 @@ struct dfl_feature_driver {
> >   *                 feature dev (platform device)'s reources.
> >   * @ioaddr: mapped mmio resource address.
> >   * @ops: ops of this sub feature.
> > + * @priv: priv data of this feature.
> >   */
> >  struct dfl_feature {
> >         u64 id;
> >         int resource_index;
> >         void __iomem *ioaddr;
> >         const struct dfl_feature_ops *ops;
> > +       void *priv;
> >  };
> >
> >  #define DEV_STATUS_IN_USE      0
> > --
> > 2.7.4
> >

Thanks for the review, will fix these things in the next version.

Hao

> 
> Thanks,
> Alan

^ permalink raw reply

* Re: [PATCH 11/17] fpga: dfl: afu: add error reporting support.
From: Wu Hao @ 2019-04-10  1:43 UTC (permalink / raw)
  To: Alan Tull; +Cc: Moritz Fischer, linux-fpga, linux-kernel, linux-api, Xu Yilun
In-Reply-To: <CANk1AXS_JcdOve_7rfdzg6hkGCHY9Yp7qz6UKVq8YsRiB_fdGg@mail.gmail.com>

On Tue, Apr 09, 2019 at 03:57:37PM -0500, Alan Tull wrote:
> On Sun, Mar 24, 2019 at 10:24 PM Wu Hao <hao.wu@intel.com> wrote:
> 
> Hi Hao,
> 
> >
> > Error reporting is one important private feature, it reports error
> > detected on port and accelerated function unit (AFU). It introduces
> > several sysfs interfaces to allow userspace to check and clear
> > errors detected by hardware.
> >
> > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > Signed-off-by: Wu Hao <hao.wu@intel.com>
> > ---
> >  Documentation/ABI/testing/sysfs-platform-dfl-port |  29 +++
> >  drivers/fpga/Makefile                             |   1 +
> >  drivers/fpga/dfl-afu-error.c                      | 225 ++++++++++++++++++++++
> >  drivers/fpga/dfl-afu-main.c                       |   4 +
> >  drivers/fpga/dfl-afu.h                            |   4 +
> >  5 files changed, 263 insertions(+)
> >  create mode 100644 drivers/fpga/dfl-afu-error.c
> >
> > diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-port b/Documentation/ABI/testing/sysfs-platform-dfl-port
> > index f611e47..e6140aa 100644
> > --- a/Documentation/ABI/testing/sysfs-platform-dfl-port
> > +++ b/Documentation/ABI/testing/sysfs-platform-dfl-port
> > @@ -79,3 +79,32 @@ KernelVersion:       5.2
> >  Contact:       Wu Hao <hao.wu@intel.com>
> >  Description:   Read-only. Read this file to get the status of issued command
> >                 to userclck_freqcntrcmd.
> > +
> > +What:          /sys/bus/platform/devices/dfl-port.0/errors/errors
> > +Date:          March 2019
> > +KernelVersion: 5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. Read this file to get errors detected on port and
> > +               Accelerated Function Unit (AFU).
> > +
> > +What:          /sys/bus/platform/devices/dfl-port.0/errors/first_error
> > +Date:          March 2019
> > +KernelVersion: 5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. Read this file to get the first error detected by
> > +               hardware.
> > +
> > +What:          /sys/bus/platform/devices/dfl-port.0/errors/first_malformed_req
> > +Date:          March 2019
> > +KernelVersion: 5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Read-only. Read this file to get the first malformed request
> > +               captured by hardware.
> > +
> > +What:          /sys/bus/platform/devices/dfl-port.0/errors/clear
> > +Date:          March 2019
> > +KernelVersion: 5.2
> > +Contact:       Wu Hao <hao.wu@intel.com>
> > +Description:   Write-only. Write error code to this file to clear errors. If
> > +               the input error code doesn't match, it returns -EBUSY error
> > +               code.
> 
> I understand how -EBUSY could be the right error code for when the
> hardware is in a state where the error can't be cleared.  But if the
> input error code doesn't match, shouldn't the code be -EINVAL?  Also
> as noted below, the way this is currently coded, -ETIMEDOUT could get
> returned.

Thanks for the comments, let me try to capture all possible error return
values in doc in the next version to avoid confusion.

> 
> > diff --git a/drivers/fpga/Makefile b/drivers/fpga/Makefile
> > index c0dd4c8..f1f0af7 100644
> > --- a/drivers/fpga/Makefile
> > +++ b/drivers/fpga/Makefile
> > @@ -40,6 +40,7 @@ obj-$(CONFIG_FPGA_DFL_AFU)            += dfl-afu.o
> >
> >  dfl-fme-objs := dfl-fme-main.o dfl-fme-pr.o
> >  dfl-afu-objs := dfl-afu-main.o dfl-afu-region.o dfl-afu-dma-region.o
> > +dfl-afu-objs += dfl-afu-error.o
> >
> >  # Drivers for FPGAs which implement DFL
> >  obj-$(CONFIG_FPGA_DFL_PCI)             += dfl-pci.o
> > diff --git a/drivers/fpga/dfl-afu-error.c b/drivers/fpga/dfl-afu-error.c
> > new file mode 100644
> > index 0000000..b66bd4a
> > --- /dev/null
> > +++ b/drivers/fpga/dfl-afu-error.c
> > @@ -0,0 +1,225 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Driver for FPGA Accelerated Function Unit (AFU) Error Reporting
> > + *
> > + * Copyright 2019 Intel Corporation, Inc.
> > + *
> > + * Authors:
> > + *   Wu Hao <hao.wu@linux.intel.com>
> > + *   Xiao Guangrong <guangrong.xiao@linux.intel.com>
> > + *   Joseph Grecco <joe.grecco@intel.com>
> > + *   Enno Luebbers <enno.luebbers@intel.com>
> > + *   Tim Whisonant <tim.whisonant@intel.com>
> > + *   Ananda Ravuri <ananda.ravuri@intel.com>
> > + *   Mitchel Henry <henry.mitchel@intel.com>
> > + */
> > +
> > +#include <linux/uaccess.h>
> > +
> > +#include "dfl-afu.h"
> > +
> > +#define PORT_ERROR_MASK                0x8
> > +#define PORT_ERROR             0x10
> > +#define PORT_FIRST_ERROR       0x18
> > +#define PORT_MALFORMED_REQ0    0x20
> > +#define PORT_MALFORMED_REQ1    0x28
> > +
> > +#define ERROR_MASK             GENMASK_ULL(63, 0)
> > +
> > +/* mask or unmask port errors by the error mask register. */
> > +static void __port_err_mask(struct device *dev, bool mask)
> > +{
> > +       void __iomem *base;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> > +
> > +       writeq(mask ? ERROR_MASK : 0, base + PORT_ERROR_MASK);
> > +}
> > +
> > +/* clear port errors. */
> > +static int __port_err_clear(struct device *dev, u64 err)
> > +{
> > +       struct platform_device *pdev = to_platform_device(dev);
> > +       void __iomem *base_err, *base_hdr;
> > +       int ret;
> > +       u64 v;
> > +
> > +       base_err = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> > +       base_hdr = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
> > +
> > +       /*
> > +        * clear Port Errors
> > +        *
> > +        * - Check for AP6 State
> > +        * - Halt Port by keeping Port in reset
> > +        * - Set PORT Error mask to all 1 to mask errors
> > +        * - Clear all errors
> > +        * - Set Port mask to all 0 to enable errors
> > +        * - All errors start capturing new errors
> > +        * - Enable Port by pulling the port out of reset
> > +        */
> > +
> > +       /* if device is still in AP6 power state, can not clear any error. */
> > +       v = readq(base_hdr + PORT_HDR_STS);
> > +       if (FIELD_GET(PORT_STS_PWR_STATE, v) == PORT_STS_PWR_STATE_AP6) {
> > +               dev_err(dev, "Could not clear errors, device in AP6 state.\n");
> > +               return -EBUSY;
> > +       }
> > +
> > +       /* Halt Port by keeping Port in reset */
> > +       ret = __port_disable(pdev);
> > +       if (ret)
> > +               return ret;
> 
> __port_disable can return -ETIMEDOUT which will then get returned from
> clear_store.  The sysfs document only talks about -EBUSY. You could
> either document -ETIMEDOUT in the sysfs doc or you could change the
> code to adjust the returned error code.

Yes, agree.

> 
> > +
> > +       /* Mask all errors */
> > +       __port_err_mask(dev, true);
> > +
> > +       /* Clear errors if err input matches with current port errors.*/
> > +       v = readq(base_err + PORT_ERROR);
> > +
> > +       if (v == err) {
> > +               writeq(v, base_err + PORT_ERROR);
> > +
> > +               v = readq(base_err + PORT_FIRST_ERROR);
> > +               writeq(v, base_err + PORT_FIRST_ERROR);
> > +       } else {
> > +               ret = -EBUSY;
> > +       }
> > +
> > +       /* Clear mask */
> > +       __port_err_mask(dev, false);
> > +
> > +       /* Enable the Port by clear the reset */
> > +       __port_enable(pdev);
> > +
> > +       return ret;
> > +}
> > +
> > +static ssize_t revision_show(struct device *dev, struct device_attribute *attr,
> > +                            char *buf)
> > +{
> > +       void __iomem *base;
> > +
> > +       base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> > +
> > +       return scnprintf(buf, PAGE_SIZE, "%u\n", dfl_feature_revision(base));
> > +}
> > +static DEVICE_ATTR_RO(revision);
> 
> This appears to be adding a
> /sys/bus/platform/devices/dfl-port.0/errors/revision attribute that
> isn't documented in the sysfs document.

Sorry, will fix all above issues in the next version.

Thanks again for the code review and comments.

Hao

^ permalink raw reply

* [PATCH v14 1/3] /proc/pid/status: Add support for architecture specific output
From: Aubrey Li @ 2019-04-10  1:53 UTC (permalink / raw)
  To: tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, adobriyan, akpm, aubrey.li,
	linux-api, linux-kernel, Aubrey Li

The architecture specific information of the running processes could
be useful to the userland. Add support to examine process architecture
specific information externally.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 fs/proc/array.c         | 5 +++++
 include/linux/proc_fs.h | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 2edbb657f859..331592a61718 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -401,6 +401,10 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
 	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
 }
 
+void __weak arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
+{
+}
+
 int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 			struct pid *pid, struct task_struct *task)
 {
@@ -424,6 +428,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 	task_cpus_allowed(m, task);
 	cpuset_task_status_allowed(m, task);
 	task_context_switch_counts(m, task);
+	arch_proc_pid_status(m, task);
 	return 0;
 }
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 52a283ba0465..bf4328cb58ed 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -74,6 +74,8 @@ struct proc_dir_entry *proc_create_net_single_write(const char *name, umode_t mo
 						    proc_write_t write,
 						    void *data);
 extern struct pid *tgid_pidfd_to_pid(const struct file *file);
+/* Add support for architecture specific output in /proc/pid/status */
+void arch_proc_pid_status(struct seq_file *m, struct task_struct *task);
 
 #else /* CONFIG_PROC_FS */
 
-- 
2.21.0

^ permalink raw reply related

* [PATCH v14 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time
From: Aubrey Li @ 2019-04-10  1:53 UTC (permalink / raw)
  To: tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, adobriyan, akpm, aubrey.li,
	linux-api, linux-kernel, Aubrey Li
In-Reply-To: <20190410015326.24500-1-aubrey.li@linux.intel.com>

AVX-512 components use could cause core turbo frequency drop. So
it's useful to expose AVX-512 usage elapsed time as a heuristic hint
for the user space job scheduler to cluster the AVX-512 using tasks
together.

Tensorflow example:
$ while [ 1 ]; do cat /proc/tid/status | grep AVX; sleep 1; done
AVX512_elapsed_ms:      4
AVX512_elapsed_ms:      8
AVX512_elapsed_ms:      4

This means that 4 milliseconds have elapsed since the AVX512 usage
of tensorflow task was detected when the task was scheduled out.

Or:
$ cat /proc/tid/status | grep AVX512_elapsed_ms
AVX512_elapsed_ms:      -1

The number '-1' indicates that no AVX512 usage recorded before
thus the task unlikely has frequency drop issue.

User space tools may want to further check by:

$ perf stat --pid <pid> -e core_power.lvl2_turbo_license -- sleep 1

 Performance counter stats for process id '3558':

     3,251,565,961      core_power.lvl2_turbo_license

       1.004031387 seconds time elapsed

Non-zero counter value confirms that the task causes frequency drop.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 arch/x86/kernel/fpu/xstate.c | 42 ++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d7432c2b1051..5e55ed9584ab 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -7,6 +7,8 @@
 #include <linux/cpu.h>
 #include <linux/mman.h>
 #include <linux/pkeys.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
 
 #include <asm/fpu/api.h>
 #include <asm/fpu/internal.h>
@@ -1243,3 +1245,43 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
 
 	return 0;
 }
+
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+	unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
+	long delta;
+
+	if (!timestamp) {
+		/*
+		 * Report -1 if no AVX512 usage
+		 */
+		delta = -1;
+	} else {
+		delta = (long)(jiffies - timestamp);
+		/*
+		 * Cap to LONG_MAX if time difference > LONG_MAX
+		 */
+		if (delta < 0)
+			delta = LONG_MAX;
+		delta = jiffies_to_msecs(delta);
+	}
+
+	seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+	seq_putc(m, '\n');
+}
+
+/*
+ * Report architecture specific information
+ */
+void arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
+{
+	/*
+	 * Report AVX512 state if the processor and build option supported.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+		avx512_status(m, task);
+}
-- 
2.21.0

^ permalink raw reply related

* [PATCH v14 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms
From: Aubrey Li @ 2019-04-10  1:53 UTC (permalink / raw)
  To: tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, adobriyan, akpm, aubrey.li,
	linux-api, linux-kernel, Aubrey Li
In-Reply-To: <20190410015326.24500-1-aubrey.li@linux.intel.com>

Added AVX512_elapsed_ms in /proc/<pid>/status. Report it
in Documentation/filesystems/proc.txt

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 Documentation/filesystems/proc.txt | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 66cad5c86171..c4a9e48681ad 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -207,6 +207,7 @@ read the file /proc/PID/status:
   Speculation_Store_Bypass:       thread vulnerable
   voluntary_ctxt_switches:        0
   nonvoluntary_ctxt_switches:     1
+  AVX512_elapsed_ms:	8
 
 This shows you nearly the same information you would get if you viewed it with
 the ps  command.  In  fact,  ps  uses  the  proc  file  system  to  obtain its
@@ -224,7 +225,7 @@ asynchronous manner and the value may not be very precise. To see a precise
 snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table.
 It's slow but very precise.
 
-Table 1-2: Contents of the status files (as of 4.19)
+Table 1-2: Contents of the status files (as of 5.1)
 ..............................................................................
  Field                       Content
  Name                        filename of the executable
@@ -289,6 +290,32 @@ Table 1-2: Contents of the status files (as of 4.19)
  Mems_allowed_list           Same as previous, but in "list format"
  voluntary_ctxt_switches     number of voluntary context switches
  nonvoluntary_ctxt_switches  number of non voluntary context switches
+ AVX512_elapsed_ms           time elapsed since last AVX512 usage recorded
+
+ AVX512_elapsed_ms:
+ ------------------
+  If AVX512 is supported on the machine, this entry shows the milliseconds
+  elapsed since the last time AVX512 usage was recorded. The recording
+  happens on a best effort basis when a task is scheduled out. This means
+  that the value depends on two factors:
+
+    1) The time which the task spent on the CPU without being scheduled
+       out. With CPU isolation and a single runnable task this can take
+       several seconds.
+
+    2) The time since the task was scheduled out last. Depending on the
+       reason for being scheduled out (time slice exhausted, syscall ...)
+       this can be arbitrary long time.
+
+  As a consequence the value cannot be considered precise and authoritative
+  information. The application which uses this information has to be aware
+  of the overall scenario on the system in order to determine whether a
+  task is a real AVX512 user or not.
+
+  A special value of '-1' indicates that no AVX512 usage was recorded, thus
+  the task is unlikely an AVX512 user, but depends on the workload and the
+  scheduling scenario, it also could be a false negative mentioned above.
+
 ..............................................................................
 
 Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
-- 
2.21.0

^ permalink raw reply related

* Re: [PATCH v14 1/3] /proc/pid/status: Add support for architecture specific output
From: Andy Lutomirski @ 2019-04-10  1:58 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
	Andi Kleen, Tim Chen, Dave Hansen, Arjan van de Ven,
	Alexey Dobriyan, Andrew Morton, aubrey.li, Linux API, LKML
In-Reply-To: <20190410015326.24500-1-aubrey.li@linux.intel.com>

On Tue, Apr 9, 2019 at 6:55 PM Aubrey Li <aubrey.li@linux.intel.com> wrote:
>
> The architecture specific information of the running processes could
> be useful to the userland. Add support to examine process architecture
> specific information externally.
>
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Cc: Linux API <linux-api@vger.kernel.org>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  fs/proc/array.c         | 5 +++++
>  include/linux/proc_fs.h | 2 ++
>  2 files changed, 7 insertions(+)
>
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 2edbb657f859..331592a61718 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -401,6 +401,10 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
>         seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
>  }
>
> +void __weak arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
> +{
> +}

This pointlessly bloats other architectures.  Do this instead in an
appropriate header:

#ifndef arch_proc_pid_status
static inline void arch_proc_pid_status(...)
{
}
#endif

Or add /proc/PID/x86_status, which sounds better in most respects to me.

^ permalink raw reply

* Re: [PATCH v14 1/3] /proc/pid/status: Add support for architecture specific output
From: Li, Aubrey @ 2019-04-10  2:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
	Andi Kleen, Tim Chen, Dave Hansen, Arjan van de Ven,
	Alexey Dobriyan, Andrew Morton, aubrey.li, Linux API, LKML
In-Reply-To: <CALCETrWV9SVDGQmzvZuub4aOPGyVqx7gPdk6D0s2_7j471jT_g@mail.gmail.com>

On 2019/4/10 9:58, Andy Lutomirski wrote:
> On Tue, Apr 9, 2019 at 6:55 PM Aubrey Li <aubrey.li@linux.intel.com> wrote:
>>
>> The architecture specific information of the running processes could
>> be useful to the userland. Add support to examine process architecture
>> specific information externally.
>>
>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Andi Kleen <ak@linux.intel.com>
>> Cc: Tim Chen <tim.c.chen@linux.intel.com>
>> Cc: Dave Hansen <dave.hansen@intel.com>
>> Cc: Arjan van de Ven <arjan@linux.intel.com>
>> Cc: Linux API <linux-api@vger.kernel.org>
>> Cc: Alexey Dobriyan <adobriyan@gmail.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> ---
>>  fs/proc/array.c         | 5 +++++
>>  include/linux/proc_fs.h | 2 ++
>>  2 files changed, 7 insertions(+)
>>
>> diff --git a/fs/proc/array.c b/fs/proc/array.c
>> index 2edbb657f859..331592a61718 100644
>> --- a/fs/proc/array.c
>> +++ b/fs/proc/array.c
>> @@ -401,6 +401,10 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
>>         seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
>>  }
>>
>> +void __weak arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
>> +{
>> +}
> 
> This pointlessly bloats other architectures.  Do this instead in an
> appropriate header:
> 
> #ifndef arch_proc_pid_status
> static inline void arch_proc_pid_status(...)
> {
> }
> #endif
> 

I saw a bunch of similar weak functions, is it not acceptable?

fs/proc$ grep weak *.c
cpuinfo.c:__weak void arch_freq_prepare_all(void)
meminfo.c:void __attribute__((weak)) arch_report_meminfo(struct seq_file *m)
vmcore.c:int __weak elfcorehdr_alloc(unsigned long long *addr, unsigned long long *size)
vmcore.c:void __weak elfcorehdr_free(unsigned long long addr)
vmcore.c:ssize_t __weak elfcorehdr_read(char *buf, size_t count, u64 *ppos)
vmcore.c:ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos)
vmcore.c:int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
vmcore.c:ssize_t __weak

> Or add /proc/PID/x86_status, which sounds better in most respects to me.
> 

I didn't figure out how to make /proc/PID/x86_status invisible to other
architectures in an appropriate way, do you have any suggestions?

Thanks,
-Aubrey

^ permalink raw reply

* Re: [PATCH v14 1/3] /proc/pid/status: Add support for architecture specific output
From: Andy Lutomirski @ 2019-04-10  2:25 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	H. Peter Anvin, Andi Kleen, Tim Chen, Dave Hansen,
	Arjan van de Ven, Alexey Dobriyan, Andrew Morton, aubrey.li,
	Linux API, LKML
In-Reply-To: <89957461-ec42-f041-a7e9-2452b2e83d39@linux.intel.com>

On Tue, Apr 9, 2019 at 7:20 PM Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>
> On 2019/4/10 9:58, Andy Lutomirski wrote:
> > On Tue, Apr 9, 2019 at 6:55 PM Aubrey Li <aubrey.li@linux.intel.com> wrote:
> >>
> >> The architecture specific information of the running processes could
> >> be useful to the userland. Add support to examine process architecture
> >> specific information externally.
> >>
> >> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> >> Cc: Peter Zijlstra <peterz@infradead.org>
> >> Cc: Andi Kleen <ak@linux.intel.com>
> >> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> >> Cc: Dave Hansen <dave.hansen@intel.com>
> >> Cc: Arjan van de Ven <arjan@linux.intel.com>
> >> Cc: Linux API <linux-api@vger.kernel.org>
> >> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> ---
> >>  fs/proc/array.c         | 5 +++++
> >>  include/linux/proc_fs.h | 2 ++
> >>  2 files changed, 7 insertions(+)
> >>
> >> diff --git a/fs/proc/array.c b/fs/proc/array.c
> >> index 2edbb657f859..331592a61718 100644
> >> --- a/fs/proc/array.c
> >> +++ b/fs/proc/array.c
> >> @@ -401,6 +401,10 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
> >>         seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
> >>  }
> >>
> >> +void __weak arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
> >> +{
> >> +}
> >
> > This pointlessly bloats other architectures.  Do this instead in an
> > appropriate header:
> >
> > #ifndef arch_proc_pid_status
> > static inline void arch_proc_pid_status(...)
> > {
> > }
> > #endif
> >
>
> I saw a bunch of similar weak functions, is it not acceptable?
>
> fs/proc$ grep weak *.c
> cpuinfo.c:__weak void arch_freq_prepare_all(void)
> meminfo.c:void __attribute__((weak)) arch_report_meminfo(struct seq_file *m)
> vmcore.c:int __weak elfcorehdr_alloc(unsigned long long *addr, unsigned long long *size)
> vmcore.c:void __weak elfcorehdr_free(unsigned long long addr)
> vmcore.c:ssize_t __weak elfcorehdr_read(char *buf, size_t count, u64 *ppos)
> vmcore.c:ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos)
> vmcore.c:int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
> vmcore.c:ssize_t __weak

I think they're acceptable, but I don't personally like them.

>
> > Or add /proc/PID/x86_status, which sounds better in most respects to me.
> >
>
> I didn't figure out how to make /proc/PID/x86_status invisible to other
> architectures in an appropriate way, do you have any suggestions?

#ifdef CONFIG_X86?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox