Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH V33 24/30] bpf: Restrict bpf when kernel lockdown is in confidentiality mode
From: Matthew Garrett @ 2019-06-28 18:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Stephen Smalley, James Morris, linux-security, LKML, Linux API,
	David Howells, Alexei Starovoitov, Network Development,
	Chun-Yi Lee, Daniel Borkmann, LSM List
In-Reply-To: <CALCETrXwt43w6rQY6zt0J_3HOaad=+E5PushJNdSOZDBuaYV+Q@mail.gmail.com>

On Thu, Jun 27, 2019 at 4:27 PM Andy Lutomirski <luto@kernel.org> wrote:
> They're really quite similar in my mind.  Certainly some things in the
> "integrity" category give absolutely trivial control over the kernel
> (e.g. modules) while others make it quite challenging (ioperm), but
> the end result is very similar.  And quite a few "confidentiality"
> things genuinely do allow all kernel memory to be read.
>
> I agree that finer-grained distinctions could be useful. My concern is
> that it's a tradeoff, and the other end of the tradeoff is an ABI
> stability issue.  If someone decides down the road that some feature
> that is currently "integrity" can be split into a narrow "integrity"
> feature and a "confidentiality" feature then, if the user policy knows
> about the individual features, there's a risk of breaking people's
> systems.  If we keep the fine-grained control, do we have a clear
> compatibility story?

My preference right now is to retain the fine-grained aspect of things
in the internal API, simply because it'll be more annoying to add it
back later if we want to. I don't want to expose it via the Lockdown
user facing API for the reasons you've described, but it's not
impossible that another LSM would find a way to do this reasonably.
Does it seem reasonable to punt this discussion out to the point where
another LSM tries to do something with this information, based on the
implementation they're attempting?

^ permalink raw reply

* [RFC PATCH 1/3] mm: Introduce VM_IBT for CET legacy code bitmap
From: Yu-cheng Yu @ 2019-06-28 19:41 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
  Cc: Yu-cheng Yu

The previous discussion of the IBT legacy code bitmap is here:

    https://lkml.org/lkml/2019/6/6/1032

When CET Indirect Branch Tracking (IBT) is enabled, the processor expects
every branch target is an ENDBR instruction, or the target's address is
marked as legacy in the legacy code bitmap.  The bitmap covers the whole
user-mode address space (TASK_SIZE_MAX for 64-bit, TASK_SIZE for IA32),
and each bit represents one page of linear address range.

This patch introduces VM_IBT for the bitmap.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 fs/proc/task_mmu.c | 3 +++
 include/linux/mm.h | 8 ++++++++
 2 files changed, 11 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 66725e262a77..d707390285d3 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -663,6 +663,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 #endif /* CONFIG_ARCH_HAS_PKEYS */
 #ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
 		[ilog2(VM_SHSTK)]	= "ss",
+#endif
+#ifdef CONFIG_X86_INTEL_BRANCH_TRACKING_USER
+		[ilog2(VM_IBT)]		= "bt",
 #endif
 	};
 	size_t i;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 921bae5fa7ab..a8da5bdfd7c9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -299,12 +299,14 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_6	38	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
 #define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
+#define VM_HIGH_ARCH_6	BIT(VM_HIGH_ARCH_BIT_6)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -348,6 +350,12 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_SHSTK	VM_NONE
 #endif
 
+#ifdef CONFIG_X86_INTEL_BRANCH_TRACKING_USER
+# define VM_IBT		VM_HIGH_ARCH_6
+#else
+# define VM_IBT		VM_NONE
+#endif
+
 #ifndef VM_GROWSUP
 # define VM_GROWSUP	VM_NONE
 #endif
-- 
2.17.1

^ permalink raw reply related

* [RFC PATCH 2/3] Introduce arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE)
From: Yu-cheng Yu @ 2019-06-28 19:41 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
  Cc: Yu-cheng Yu
In-Reply-To: <20190628194158.2431-1-yu-cheng.yu@intel.com>

The CET legacy code bitmap covers the whole user-mode address space and is
located at the top of the user-mode address space.  It is allocated only
when the first time arch_prctl(ARCH_X86_MARK_LEGACY_CODE) is called from
an application.

Introduce:

arch_prctl(ARCH_X86_MARK_LEGACY_CODE, unsigned long *buf)
    Mark an address range as IBT legacy code.

    *buf: starting linear address
    *(buf + 1): size of the legacy code
    *(buf + 2): set (1); clear (0)

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/include/asm/cet.h        |   3 +
 arch/x86/include/asm/processor.h  |  13 +++-
 arch/x86/include/uapi/asm/prctl.h |   1 +
 arch/x86/kernel/Makefile          |   2 +-
 arch/x86/kernel/cet_bitmap.c      | 119 ++++++++++++++++++++++++++++++
 arch/x86/kernel/cet_prctl.c       |  15 ++++
 6 files changed, 151 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/kernel/cet_bitmap.c

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 9e613a6598c9..8ca497850f4a 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -4,6 +4,7 @@
 
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
+#include <asm/processor.h>
 
 struct task_struct;
 struct sc_ext;
@@ -32,6 +33,7 @@ int cet_restore_signal(bool ia32, struct sc_ext *sc);
 int cet_setup_signal(bool ia32, unsigned long rstor, struct sc_ext *sc);
 int cet_setup_ibt(void);
 int cet_setup_ibt_bitmap(unsigned long bitmap, unsigned long size);
+int cet_mark_legacy_code(unsigned long addr, unsigned long size, unsigned long set);
 void cet_disable_ibt(void);
 #else
 static inline int prctl_cet(int option, unsigned long arg2) { return -EINVAL; }
@@ -44,6 +46,7 @@ static inline int cet_restore_signal(bool ia32, struct sc_ext *sc) { return -EIN
 static inline int cet_setup_signal(bool ia32, unsigned long rstor,
 				   struct sc_ext *sc) { return -EINVAL; }
 static inline int cet_setup_ibt(void) { return -EINVAL; }
+static inline int cet_mark_legacy_code(unsigned long addr, unsigned long size, unsigned long set) { return -EINVAL; }
 static inline void cet_disable_ibt(void) {}
 #endif
 
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 2ae7c1bf4e43..f4600157c73d 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -884,7 +884,18 @@ static inline void spin_lock_prefetch(const void *x)
 #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP		TASK_SIZE_LOW
+#define MMAP_MAX		(unsigned long)(test_thread_flag(TIF_ADDR32) ? \
+					TASK_SIZE : TASK_SIZE_MAX)
+
+#define IBT_BITMAP_SIZE		(round_up(MMAP_MAX, PAGE_SIZE * BITS_PER_BYTE) / \
+					(PAGE_SIZE * BITS_PER_BYTE))
+
+#define IBT_BITMAP_ADDR		(TASK_SIZE - IBT_BITMAP_SIZE)
+
+#define STACK_TOP		(TASK_SIZE_LOW < IBT_BITMAP_ADDR - PAGE_SIZE ? \
+					TASK_SIZE_LOW : \
+					IBT_BITMAP_ADDR - PAGE_SIZE)
+
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
 #define INIT_THREAD  {						\
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 5eb9aeb5c662..5f670e70dc00 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -20,5 +20,6 @@
 #define ARCH_X86_CET_ALLOC_SHSTK	0x3004
 #define ARCH_X86_CET_GET_LEGACY_BITMAP	0x3005 /* deprecated */
 #define ARCH_X86_CET_SET_LEGACY_BITMAP	0x3006
+#define ARCH_X86_CET_MARK_LEGACY_CODE	0x3007
 
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index d908c95306fc..754dde1bf9ac 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -140,7 +140,7 @@ obj-$(CONFIG_UNWINDER_ORC)		+= unwind_orc.o
 obj-$(CONFIG_UNWINDER_FRAME_POINTER)	+= unwind_frame.o
 obj-$(CONFIG_UNWINDER_GUESS)		+= unwind_guess.o
 
-obj-$(CONFIG_X86_INTEL_CET)		+= cet.o cet_prctl.o
+obj-$(CONFIG_X86_INTEL_CET)		+= cet.o cet_prctl.o cet_bitmap.o
 
 ###
 # 64 bit specific files
diff --git a/arch/x86/kernel/cet_bitmap.c b/arch/x86/kernel/cet_bitmap.c
new file mode 100644
index 000000000000..6cb7ac2f66f7
--- /dev/null
+++ b/arch/x86/kernel/cet_bitmap.c
@@ -0,0 +1,119 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/bits.h>
+#include <asm/fpu/internal.h>
+#include <asm/cet.h>
+#include <linux/pagemap.h>
+#include <linux/err.h>
+#include <asm/vdso.h>
+
+static int alloc_bitmap(void)
+{
+	unsigned long addr;
+	u64 msr_ia32_u_cet;
+
+	addr = do_mmap_locked(NULL, IBT_BITMAP_ADDR, IBT_BITMAP_SIZE,
+			      PROT_READ | PROT_WRITE,
+			      MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED_NOREPLACE,
+			      VM_IBT | VM_NORESERVE, NULL);
+
+	if (IS_ERR((void *)addr))
+		return addr;
+
+	current->thread.cet.ibt_bitmap_addr = addr;
+	current->thread.cet.ibt_bitmap_size = IBT_BITMAP_SIZE;
+
+	modify_fpu_regs_begin();
+	rdmsrl(MSR_IA32_U_CET, msr_ia32_u_cet);
+	msr_ia32_u_cet |= (MSR_IA32_CET_LEG_IW_EN | addr);
+	wrmsrl(MSR_IA32_U_CET, msr_ia32_u_cet);
+	modify_fpu_regs_end();
+	return 0;
+}
+
+static int set_user_bits(unsigned long __user *buf, unsigned long buf_size,
+			 unsigned long start_bit, unsigned long end_bit, unsigned long set)
+{
+	unsigned long start_ul, end_ul, total_ul;
+	int i, j, r;
+
+	if (round_up(end_bit, BITS_PER_BYTE) / BITS_PER_BYTE > buf_size)
+		end_bit = buf_size * BITS_PER_BYTE - 1;
+
+	start_ul = start_bit / BITS_PER_LONG;
+	end_ul = end_bit / BITS_PER_LONG;
+	total_ul = (end_ul - start_ul + 1);
+
+	i = start_bit % BITS_PER_LONG;
+	j = end_bit % BITS_PER_LONG;
+
+	r = 0;
+	put_user_try {
+		unsigned long tmp;
+		unsigned long x;
+
+		if (total_ul == 1) {
+			get_user_ex(tmp, &buf[start_ul]);
+
+			if (set != 0)
+				tmp |= GENMASK(j, i);
+			else
+				tmp &= ~GENMASK(j, i);
+
+			put_user_ex(tmp, &buf[start_ul]);
+		} else {
+			get_user_ex(tmp, &buf[start_ul]);
+
+			if (set != 0)
+				tmp |= GENMASK(BITS_PER_LONG - 1, i);
+			else
+				tmp &= ~GENMASK(BITS_PER_LONG - 1, i);
+
+			put_user_ex(tmp, &buf[start_ul]);
+
+			get_user_ex(tmp, &buf[end_ul]);
+
+			if (set != 0)
+				tmp |= GENMASK(j, 0);
+			else
+				tmp &= ~GENMASK(j, 0);
+
+			put_user_ex(tmp, &buf[end_ul]);
+
+			if (set != 0) {
+				for (x = start_ul + 1; x < end_ul; x++)
+					put_user_ex(~0UL, &buf[x]);
+			} else {
+				for (x = start_ul + 1; x < end_ul; x++)
+					put_user_ex(0UL, &buf[x]);
+			}
+		}
+	} put_user_catch(r);
+
+	return r;
+}
+
+int cet_mark_legacy_code(unsigned long addr, unsigned long size, unsigned long set)
+{
+	unsigned long bitmap_addr, bitmap_size;
+	int r;
+
+	if (!current->thread.cet.ibt_enabled)
+		return -EINVAL;
+
+	if (current->thread.cet.ibt_bitmap_size == 0) {
+		r = alloc_bitmap();
+		if (r)
+			return r;
+	}
+
+	bitmap_addr = current->thread.cet.ibt_bitmap_addr;
+	bitmap_size = current->thread.cet.ibt_bitmap_size;
+
+	r = set_user_bits((unsigned long * __user)bitmap_addr, bitmap_size,
+			  addr / PAGE_SIZE, (addr + size - 1) / PAGE_SIZE, set);
+
+	return r;
+}
diff --git a/arch/x86/kernel/cet_prctl.c b/arch/x86/kernel/cet_prctl.c
index b7f37bbc0dd3..b2b7f462482f 100644
--- a/arch/x86/kernel/cet_prctl.c
+++ b/arch/x86/kernel/cet_prctl.c
@@ -68,6 +68,18 @@ static int handle_bitmap(unsigned long arg2)
 	return cet_setup_ibt_bitmap(addr, size);
 }
 
+static int handle_mark_legacy_code(unsigned long arg2)
+{
+	unsigned long addr, size, set;
+
+	if (get_user(addr, (unsigned long __user *)arg2) ||
+	    get_user(size, (unsigned long __user *)arg2 + 1) ||
+	    get_user(set, (unsigned long __user *)arg2 + 2))
+		return -EFAULT;
+
+	return cet_mark_legacy_code(addr, size, set);
+}
+
 int prctl_cet(int option, unsigned long arg2)
 {
 	if (!cpu_x86_cet_enabled())
@@ -100,6 +112,9 @@ int prctl_cet(int option, unsigned long arg2)
 	case ARCH_X86_CET_SET_LEGACY_BITMAP:
 		return handle_bitmap(arg2);
 
+	case ARCH_X86_CET_MARK_LEGACY_CODE:
+		return handle_mark_legacy_code(arg2);
+
 	default:
 		return -EINVAL;
 	}
-- 
2.17.1

^ permalink raw reply related

* [RFC PATCH 3/3] Prevent user from writing to IBT bitmap.
From: Yu-cheng Yu @ 2019-06-28 19:41 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
  Cc: Yu-cheng Yu
In-Reply-To: <20190628194158.2431-1-yu-cheng.yu@intel.com>

The IBT bitmap is visiable from user-mode, but not writable.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>

---
 arch/x86/mm/fault.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 59f4f66e4f2e..231196abb62e 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1454,6 +1454,13 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 * we can handle it..
 	 */
 good_area:
+#define USER_MODE_WRITE (FAULT_FLAG_WRITE | FAULT_FLAG_USER)
+	if (((flags & USER_MODE_WRITE)  == USER_MODE_WRITE) &&
+	    (vma->vm_flags & VM_IBT)) {
+		bad_area_access_error(regs, hw_error_code, address, vma);
+		return;
+	}
+
 	if (unlikely(access_error(hw_error_code, vma))) {
 		bad_area_access_error(regs, hw_error_code, address, vma);
 		return;
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH v5 17/18] kernel/sysctl-test: Add null pointer test for sysctl.c:proc_dointvec()
From: Luis Chamberlain @ 2019-06-28 21:37 UTC (permalink / raw)
  To: Brendan Higgins, Dominik Brodowski
  Cc: Petr Mladek, open list:DOCUMENTATION, Peter Zijlstra,
	Amir Goldstein, dri-devel, Sasha Levin, Masahiro Yamada,
	Michael Ellerman, open list:KERNEL SELFTEST FRAMEWORK, shuah,
	Rob Herring, linux-nvdimm, Frank Rowand, Knut Omang,
	Kieran Bingham, wfg-VuQAYsv1563Yd54FQh9/CA,
	Michael Kerrisk (man-pages), David Rientjes, Iurii Zaikin,
	Jeff Dike, Dan Carpenter, Joel Stanley <joel>
In-Reply-To: <CAFd5g45VJ9yfuESUc=E0ydJyN+mk1b1kyHSCYvO2x9KPC7+3GQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, Jun 28, 2019 at 01:01:54AM -0700, Brendan Higgins wrote:
> On Wed, Jun 26, 2019 at 11:10 PM Luis Chamberlain <mcgrof-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> >
> > On Wed, Jun 26, 2019 at 09:07:43PM -0700, Iurii Zaikin wrote:
> > > On Tue, Jun 25, 2019 at 7:17 PM Luis Chamberlain <mcgrof-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > > > > +static void sysctl_test_dointvec_table_maxlen_unset(struct kunit *test)
> > > > > +{
> > > > > +     struct ctl_table table = {
> > > > > +             .procname = "foo",
> > > > > +             .data           = &test_data.int_0001,
> > > > > +             .maxlen         = 0,
> > > > > +             .mode           = 0644,
> > > > > +             .proc_handler   = proc_dointvec,
> > > > > +             .extra1         = &i_zero,
> > > > > +             .extra2         = &i_one_hundred,
> > > > > +     };
> > > > > +     void  *buffer = kunit_kzalloc(test, sizeof(int), GFP_USER);
> > > > > +     size_t len;
> > > > > +     loff_t pos;
> > > > > +
> > > > > +     len = 1234;
> > > > > +     KUNIT_EXPECT_EQ(test, 0, proc_dointvec(&table, 0, buffer, &len, &pos));
> > > > > +     KUNIT_EXPECT_EQ(test, (size_t)0, len);
> > > > > +     len = 1234;
> > > > > +     KUNIT_EXPECT_EQ(test, 0, proc_dointvec(&table, 1, buffer, &len, &pos));
> > > > > +     KUNIT_EXPECT_EQ(test, (size_t)0, len);
> > > > > +}
> > > >
> > > > In a way this is also testing for general kernel API changes. This is and the
> > > > last one were good examples. So this is not just testing functionality
> > > > here. There is no wrong or write answer if 0 or -EINVAL was returned
> > > > other than the fact that we have been doing this for years.
> > > >
> > > > Its a perhaps small but important difference for some of these tests.  I
> > > > *do* think its worth clarifying through documentation which ones are
> > > > testing for API consistency Vs proper correctness.
> > >
> > > You make a good point that the test codifies the existing behavior of
> > > the function in lieu of formal documentation.  However, the test cases
> > > were derived from examining the source code of the function under test
> > > and attempting to cover all branches. The assertions were added only
> > > for the values that appeared to be set deliberately in the
> > > implementation. And it makes sense to me to test that the code does
> > > exactly what the implementation author intended.
> >
> > I'm not arguing against adding them. I'm suggesting that it is different
> > to test for API than for correctness of intended functionality, and
> > it would be wise to make it clear which test cases are for API and which
> > for correctness.
> 
> I see later on that some of the API stuff you are talking about is
> public APIs from the standpoint of user (outside of LInux) visible.

Right, UAPI.

> To
> be clear, is that what you mean by public APIs throughout, or would
> you distinguish between correctness tests, internal API tests, and
> external API tests?

I would definitely recommend distingishing between all of these.
Kernel API (or just call it API), UAPI, and correctness.

> > This will come up later for other kunit tests and it would be great
> > to set precendent so that other kunit tests can follow similar
> > practices to ensure its clear what is API realted Vs correctness of
> > intended functionality.
> >
> > In fact, I'm not yet sure if its possible to test public kernel API to
> > userspace with kunit, but if it is possible... well, that could make
> > linux-api folks happy as they could enable us to codify interpreation of
> > what is expected into kunit test cases, and we'd ensure that the
> > codified interpretation is not only documented in man pages but also
> > through formal kunit test cases.
> >
> > A regression in linux-api then could be formalized through a proper
> > kunit tests case. And if an API evolves, it would force developers to
> > update the respective kunit which codifies that contract.
> 
> Yep, I think that is long term hope. Some of the file system interface
> stuff that requires a filesystem to be mounted somewhere might get a
> little weird/difficult, but I suspect we should be able to do it
> eventually. I mean it's all just C code right? Should mostly boil down
> to someone figuring out how to do it the first time.

There used to be hacks in the kernel the call syscalls in a few places.
This was cleaned up and addressed via Dominik Brodowski's series last
year in March:

http://lkml.kernel.org/r/20180325162527.GA17492-SGhQLRGLuNwb6pqDj42GsMgv3T4z79SOrE5yTffgRl4@public.gmane.org

An example commit: d300b610812f3 ("kernel: use kernel_wait4() instead of
sys_wait4()").

So it would seem the work is done, and you'd just have to use the
respective exposed kernel_syscallname() calls, or add some if you
want to test a specific syscall in kernel space.

  Luis

^ permalink raw reply

* Re: [RFC PATCH 1/3] mm: Introduce VM_IBT for CET legacy code bitmap
From: Andy Lutomirski @ 2019-06-28 21:49 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Balbir Singh, Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel
In-Reply-To: <20190628194158.2431-1-yu-cheng.yu@intel.com>



> On Jun 28, 2019, at 12:41 PM, Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> 
> The previous discussion of the IBT legacy code bitmap is here:
> 
>    https://lkml.org/lkml/2019/6/6/1032
> 
> When CET Indirect Branch Tracking (IBT) is enabled, the processor expects
> every branch target is an ENDBR instruction, or the target's address is
> marked as legacy in the legacy code bitmap.  The bitmap covers the whole
> user-mode address space (TASK_SIZE_MAX for 64-bit, TASK_SIZE for IA32),
> and each bit represents one page of linear address range.
> 
> This patch introduces VM_IBT for the bitmap.

There’s no need to allocate a bit for this and to clutter up the fault code with special cases. Use _install_special_mapping(), please.  If you need to make it more flexible to cover your use case, please do so.

^ permalink raw reply

* Re: [PATCH v4 3/3] fpga: dfl: fme: add power management support
From: Wu Hao @ 2019-06-29  0:33 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-hwmon, jdelvare,
	atull, gregkh, Luwei Kang, Xu Yilun
In-Reply-To: <20190628175514.GB25890@roeck-us.net>

On Fri, Jun 28, 2019 at 10:55:14AM -0700, Guenter Roeck wrote:
> On Thu, Jun 27, 2019 at 12:53:38PM +0800, Wu Hao wrote:
> > This patch adds support for power management private feature under
> > FPGA Management Engine (FME). This private feature driver registers
> > a hwmon for power (power1_input), thresholds information, e.g.
> > (power1_max / crit / max_alarm / crit_alarm) and also read-only sysfs
> > interfaces for other power management information. For configuration,
> > user could write threshold values via above power1_max / crit sysfs
> > interface under hwmon too.
> > 
> > Signed-off-by: Luwei Kang <luwei.kang@intel.com>
> > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > Signed-off-by: Wu Hao <hao.wu@intel.com>
> > ---
> > v2: create a dfl_fme_power hwmon to expose power sysfs interfaces.
> >     move all sysfs interfaces under hwmon
> >         consumed          --> hwmon power1_input
> >         threshold1        --> hwmon power1_cap
> >         threshold2        --> hwmon power1_crit
> >         threshold1_status --> hwmon power1_cap_status
> >         threshold2_status --> hwmon power1_crit_status
> >         xeon_limit        --> hwmon power1_xeon_limit
> >         fpga_limit        --> hwmon power1_fpga_limit
> 
> How do those limits differ from the other limits ?
> We do have powerX_cap and powerX_cap_max, and from the context
> it appears that you could possibly at least use power1_cap_max
> (and power1_cap instead of power1_max) instead of
> power1_fpga_limit.

Thanks a lot for the review and comments.

Actually xeon/fpga_limit are introduced for different purpose. It shows
the power limit of CPU and FPGA, that may be useful in some integrated
solution, e.g. CPU and FPGA shares power. We should never these
interfaces as throttling thresholds.

> 
> >         ltr               --> hwmon power1_ltr
> > v3: rename some hwmon sysfs interfaces to follow hwmon ABI.
> > 	power1_cap         --> power1_max
> > 	power1_cap_status  --> power1_max_alarm
> > 	power1_crit_status --> power1_crit_alarm
> 
> power1_cap is standard ABI, and since the value is enforced by HW,
> it should be usable.

As you see, in thermal management, threshold1 and threshold2 are
mapped to temp1_max_alarm and temp1_crit_alarm. So we feel that if
it will be friendly to user that we keep using max_alarm and crit_alarm
in power management for threshold1 and threshold2 too.

Do you think if we can keep this, or it's better to switch back to
power1_cap?


> 
> >     update sysfs doc for above sysfs interface changes.
> >     replace scnprintf with sprintf in sysfs interface.
> > v4: use HWMON_CHANNEL_INFO.
> >     update date in sysfs doc.
> > ---
> >  Documentation/ABI/testing/sysfs-platform-dfl-fme |  67 +++++++
> >  drivers/fpga/dfl-fme-main.c                      | 221 +++++++++++++++++++++++
> >  2 files changed, 288 insertions(+)
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > index 2cd17dc..a669548 100644
> > --- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > +++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > @@ -127,6 +127,7 @@ Contact:	Wu Hao <hao.wu@intel.com>
> >  Description:	Read-Only. Read this file to get the name of hwmon device, it
> >  		supports values:
> >  		    'dfl_fme_thermal' - thermal hwmon device name
> > +		    'dfl_fme_power'   - power hwmon device name
> >  
> >  What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_input
> >  Date:		June 2019
> > @@ -183,3 +184,69 @@ Description:	Read-Only. Read this file to get the policy of hardware threshold1
> >  		(see 'temp1_max'). It only supports two values (policies):
> >  		    0 - AP2 state (90% throttling)
> >  		    1 - AP1 state (50% throttling)
> > +
> > +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_input
> > +Date:		June 2019
> > +KernelVersion:	5.3
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Read-Only. It returns current FPGA power consumption in uW.
> > +
> > +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_max
> > +Date:		June 2019
> > +KernelVersion:	5.3
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Read-Write. Read this file to get current hardware power
> > +		threshold1 in uW. If power consumption rises at or above
> > +		this threshold, hardware starts 50% throttling.
> > +		Write this file to set current hardware power threshold1 in uW.
> > +		As hardware only accepts values in Watts, so input value will
> > +		be round down per Watts (< 1 watts part will be discarded).
> > +		Write fails with -EINVAL if input parsing fails or input isn't
> > +		in the valid range (0 - 127000000 uW).
> > +
> > +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_crit
> > +Date:		June 2019
> > +KernelVersion:	5.3
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Read-Write. Read this file to get current hardware power
> > +		threshold2 in uW. If power consumption rises at or above
> > +		this threshold, hardware starts 90% throttling.
> > +		Write this file to set current hardware power threshold2 in uW.
> > +		As hardware only accepts values in Watts, so input value will
> > +		be round down per Watts (< 1 watts part will be discarded).
> > +		Write fails with -EINVAL if input parsing fails or input isn't
> > +		in the valid range (0 - 127000000 uW).
> > +
> > +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_max_alarm
> > +Date:		June 2019
> > +KernelVersion:	5.3
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Read-only. It returns 1 if power consumption is currently at or
> > +		above hardware threshold1 (see 'power1_max'), otherwise 0.
> > +
> > +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_crit_alarm
> > +Date:		June 2019
> > +KernelVersion:	5.3
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Read-only. It returns 1 if power consumption is currently at or
> > +		above hardware threshold2 (see 'power1_crit'), otherwise 0.
> > +
> > +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_xeon_limit
> > +Date:		June 2019
> > +KernelVersion:	5.3
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Read-Only. It returns power limit for XEON in uW.
> > +
> > +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_fpga_limit
> > +Date:		June 2019
> > +KernelVersion:	5.3
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Read-Only. It returns power limit for FPGA in uW.
> > +
> > +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_ltr
> > +Date:		June 2019
> > +KernelVersion:	5.3
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Read-only. Read this file to get current Latency Tolerance
> > +		Reporting (ltr) value. This ltr impacts the CPU low power
> > +		state in integrated solution.
> 
> Does that attribute add any value without any kind of unit or an explanation
> of its meaning ? What is userspace supposed to do with that information ?
> Without context, it is just a meaningless number.

I should add more description here, will fix it in the next version.

> 
> Also, it appears that the information is supposed to be passed to power
> management via the set_latency_tolerance() callback. If so, it would be
> reported there. Would it possibly make more sense to use that interface ?

If I remember correctly set_latency_tolerance is used to communicate a tolerance
to device, but actually this is a read-only value, to indicate latency tolerance
requirement for memory access from FPGA device, as you know FPGA could be
programmed for different workloads, and different workloads may have different
latency requirements, if workloads in FPGA don't have any need for immediate
memory access, then it would be safe for deeper sleep state of system memory.

> 
> > diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
> > index 59ff9f1..9225b68 100644
> > --- a/drivers/fpga/dfl-fme-main.c
> > +++ b/drivers/fpga/dfl-fme-main.c
> > @@ -400,6 +400,223 @@ static void fme_thermal_mgmt_uinit(struct platform_device *pdev,
> >  	.uinit = fme_thermal_mgmt_uinit,
> >  };
> >  
> > +#define FME_PWR_STATUS		0x8
> > +#define FME_LATENCY_TOLERANCE	BIT_ULL(18)
> > +#define PWR_CONSUMED		GENMASK_ULL(17, 0)
> > +
> > +#define FME_PWR_THRESHOLD	0x10
> > +#define PWR_THRESHOLD1		GENMASK_ULL(6, 0)	/* in Watts */
> > +#define PWR_THRESHOLD2		GENMASK_ULL(14, 8)	/* in Watts */
> > +#define PWR_THRESHOLD_MAX	0x7f			/* in Watts */
> > +#define PWR_THRESHOLD1_STATUS	BIT_ULL(16)
> > +#define PWR_THRESHOLD2_STATUS	BIT_ULL(17)
> > +
> > +#define FME_PWR_XEON_LIMIT	0x18
> > +#define XEON_PWR_LIMIT		GENMASK_ULL(14, 0)	/* in 0.1 Watts */
> > +#define XEON_PWR_EN		BIT_ULL(15)
> > +#define FME_PWR_FPGA_LIMIT	0x20
> > +#define FPGA_PWR_LIMIT		GENMASK_ULL(14, 0)	/* in 0.1 Watts */
> > +#define FPGA_PWR_EN		BIT_ULL(15)
> > +
> > +#define PWR_THRESHOLD_MAX_IN_UW (PWR_THRESHOLD_MAX * 1000000)
> > +
> > +static int power_hwmon_read(struct device *dev, enum hwmon_sensor_types type,
> > +			    u32 attr, int channel, long *val)
> > +{
> > +	struct dfl_feature *feature = dev_get_drvdata(dev);
> > +	u64 v;
> > +
> > +	switch (attr) {
> > +	case hwmon_power_input:
> > +		v = readq(feature->ioaddr + FME_PWR_STATUS);
> > +		*val = (long)(FIELD_GET(PWR_CONSUMED, v) * 1000000);
> > +		break;
> > +	case hwmon_power_max:
> > +		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
> > +		*val = (long)(FIELD_GET(PWR_THRESHOLD1, v) * 1000000);
> > +		break;
> > +	case hwmon_power_crit:
> > +		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
> > +		*val = (long)(FIELD_GET(PWR_THRESHOLD2, v) * 1000000);
> > +		break;
> > +	case hwmon_power_max_alarm:
> > +		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
> > +		*val = (long)FIELD_GET(PWR_THRESHOLD1_STATUS, v);
> > +		break;
> > +	case hwmon_power_crit_alarm:
> > +		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
> > +		*val = (long)FIELD_GET(PWR_THRESHOLD2_STATUS, v);
> > +		break;
> > +	default:
> > +		return -EOPNOTSUPP;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int power_hwmon_write(struct device *dev, enum hwmon_sensor_types type,
> > +			     u32 attr, int channel, long val)
> > +{
> > +	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev->parent);
> > +	struct dfl_feature *feature = dev_get_drvdata(dev);
> > +	int ret = 0;
> > +	u64 v;
> > +
> > +	if (val < 0 || val > PWR_THRESHOLD_MAX_IN_UW)
> > +		return -EINVAL;
> 
> We usually use clamp_val() in such cases because there is no useful means
> for the user to know the valid range.

Sure. Will fix this in the next version. Thanks for the comments.

Thanks
Hao

> 
> > +
> > +	val = val / 1000000;
> > +
> > +	mutex_lock(&pdata->lock);
> > +
> > +	switch (attr) {
> > +	case hwmon_power_max:
> > +		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
> > +		v &= ~PWR_THRESHOLD1;
> > +		v |= FIELD_PREP(PWR_THRESHOLD1, val);
> > +		writeq(v, feature->ioaddr + FME_PWR_THRESHOLD);
> > +		break;
> > +	case hwmon_power_crit:
> > +		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
> > +		v &= ~PWR_THRESHOLD2;
> > +		v |= FIELD_PREP(PWR_THRESHOLD2, val);
> > +		writeq(v, feature->ioaddr + FME_PWR_THRESHOLD);
> > +		break;
> > +	default:
> > +		ret = -EOPNOTSUPP;
> > +		break;
> > +	}
> > +
> > +	mutex_unlock(&pdata->lock);
> > +
> > +	return ret;
> > +}
> > +
> > +static umode_t power_hwmon_attrs_visible(const void *drvdata,
> > +					 enum hwmon_sensor_types type,
> > +					 u32 attr, int channel)
> > +{
> > +	switch (attr) {
> > +	case hwmon_power_input:
> > +	case hwmon_power_max_alarm:
> > +	case hwmon_power_crit_alarm:
> > +		return 0444;
> > +	case hwmon_power_max:
> > +	case hwmon_power_crit:
> > +		return 0644;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static const struct hwmon_ops power_hwmon_ops = {
> > +	.is_visible = power_hwmon_attrs_visible,
> > +	.read = power_hwmon_read,
> > +	.write = power_hwmon_write,
> > +};
> > +
> > +static const struct hwmon_channel_info *power_hwmon_info[] = {
> > +	HWMON_CHANNEL_INFO(power, HWMON_P_INPUT |
> > +				  HWMON_P_MAX   | HWMON_P_MAX_ALARM |
> > +				  HWMON_P_CRIT  | HWMON_P_CRIT_ALARM),
> > +	NULL
> > +};
> > +
> > +static const struct hwmon_chip_info power_hwmon_chip_info = {
> > +	.ops = &power_hwmon_ops,
> > +	.info = power_hwmon_info,
> > +};
> > +
> > +static ssize_t power1_xeon_limit_show(struct device *dev,
> > +				      struct device_attribute *attr, char *buf)
> > +{
> > +	struct dfl_feature *feature = dev_get_drvdata(dev);
> > +	u16 xeon_limit = 0;
> > +	u64 v;
> > +
> > +	v = readq(feature->ioaddr + FME_PWR_XEON_LIMIT);
> > +
> > +	if (FIELD_GET(XEON_PWR_EN, v))
> > +		xeon_limit = FIELD_GET(XEON_PWR_LIMIT, v);
> > +
> > +	return sprintf(buf, "%u\n", xeon_limit * 100000);
> > +}
> > +
> > +static ssize_t power1_fpga_limit_show(struct device *dev,
> > +				      struct device_attribute *attr, char *buf)
> > +{
> > +	struct dfl_feature *feature = dev_get_drvdata(dev);
> > +	u16 fpga_limit = 0;
> > +	u64 v;
> > +
> > +	v = readq(feature->ioaddr + FME_PWR_FPGA_LIMIT);
> > +
> > +	if (FIELD_GET(FPGA_PWR_EN, v))
> > +		fpga_limit = FIELD_GET(FPGA_PWR_LIMIT, v);
> > +
> > +	return sprintf(buf, "%u\n", fpga_limit * 100000);
> > +}
> > +
> > +static ssize_t power1_ltr_show(struct device *dev,
> > +			       struct device_attribute *attr, char *buf)
> > +{
> > +	struct dfl_feature *feature = dev_get_drvdata(dev);
> > +	u64 v;
> > +
> > +	v = readq(feature->ioaddr + FME_PWR_STATUS);
> > +
> > +	return sprintf(buf, "%u\n",
> > +		       (unsigned int)FIELD_GET(FME_LATENCY_TOLERANCE, v));
> > +}
> > +
> > +static DEVICE_ATTR_RO(power1_xeon_limit);
> > +static DEVICE_ATTR_RO(power1_fpga_limit);
> > +static DEVICE_ATTR_RO(power1_ltr);
> > +
> > +static struct attribute *power_extra_attrs[] = {
> > +	&dev_attr_power1_xeon_limit.attr,
> > +	&dev_attr_power1_fpga_limit.attr,
> > +	&dev_attr_power1_ltr.attr,
> > +	NULL
> > +};
> > +
> > +ATTRIBUTE_GROUPS(power_extra);
> > +
> > +static int fme_power_mgmt_init(struct platform_device *pdev,
> > +			       struct dfl_feature *feature)
> > +{
> > +	struct device *hwmon;
> > +
> > +	dev_dbg(&pdev->dev, "FME Power Management Init.\n");
> > +
> > +	hwmon = devm_hwmon_device_register_with_info(&pdev->dev,
> > +						     "dfl_fme_power", feature,
> > +						     &power_hwmon_chip_info,
> > +						     power_extra_groups);
> > +	if (IS_ERR(hwmon)) {
> > +		dev_err(&pdev->dev, "Fail to register power hwmon\n");
> > +		return PTR_ERR(hwmon);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static void fme_power_mgmt_uinit(struct platform_device *pdev,
> > +				 struct dfl_feature *feature)
> > +{
> > +	dev_dbg(&pdev->dev, "FME Power Management UInit.\n");
> > +}
> > +
> > +static const struct dfl_feature_id fme_power_mgmt_id_table[] = {
> > +	{.id = FME_FEATURE_ID_POWER_MGMT,},
> > +	{0,}
> > +};
> > +
> > +static const struct dfl_feature_ops fme_power_mgmt_ops = {
> > +	.init = fme_power_mgmt_init,
> > +	.uinit = fme_power_mgmt_uinit,
> > +};
> > +
> >  static struct dfl_feature_driver fme_feature_drvs[] = {
> >  	{
> >  		.id_table = fme_hdr_id_table,
> > @@ -418,6 +635,10 @@ static void fme_thermal_mgmt_uinit(struct platform_device *pdev,
> >  		.ops = &fme_thermal_mgmt_ops,
> >  	},
> >  	{
> > +		.id_table = fme_power_mgmt_id_table,
> > +		.ops = &fme_power_mgmt_ops,
> > +	},
> > +	{
> >  		.ops = NULL,
> >  	},
> >  };
> > -- 
> > 1.8.3.1
> > 

^ permalink raw reply

* Re: [PATCH v4 3/3] fpga: dfl: fme: add power management support
From: Guenter Roeck @ 2019-06-29 16:21 UTC (permalink / raw)
  To: Wu Hao
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-hwmon, jdelvare,
	atull, gregkh, Luwei Kang, Xu Yilun
In-Reply-To: <20190629003308.GA15139@hao-dev>

On 6/28/19 5:33 PM, Wu Hao wrote:
> On Fri, Jun 28, 2019 at 10:55:14AM -0700, Guenter Roeck wrote:
>> On Thu, Jun 27, 2019 at 12:53:38PM +0800, Wu Hao wrote:
>>> This patch adds support for power management private feature under
>>> FPGA Management Engine (FME). This private feature driver registers
>>> a hwmon for power (power1_input), thresholds information, e.g.
>>> (power1_max / crit / max_alarm / crit_alarm) and also read-only sysfs
>>> interfaces for other power management information. For configuration,
>>> user could write threshold values via above power1_max / crit sysfs
>>> interface under hwmon too.
>>>
>>> Signed-off-by: Luwei Kang <luwei.kang@intel.com>
>>> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
>>> Signed-off-by: Wu Hao <hao.wu@intel.com>
>>> ---
>>> v2: create a dfl_fme_power hwmon to expose power sysfs interfaces.
>>>      move all sysfs interfaces under hwmon
>>>          consumed          --> hwmon power1_input
>>>          threshold1        --> hwmon power1_cap
>>>          threshold2        --> hwmon power1_crit
>>>          threshold1_status --> hwmon power1_cap_status
>>>          threshold2_status --> hwmon power1_crit_status
>>>          xeon_limit        --> hwmon power1_xeon_limit
>>>          fpga_limit        --> hwmon power1_fpga_limit
>>
>> How do those limits differ from the other limits ?
>> We do have powerX_cap and powerX_cap_max, and from the context
>> it appears that you could possibly at least use power1_cap_max
>> (and power1_cap instead of power1_max) instead of
>> power1_fpga_limit.
> 
> Thanks a lot for the review and comments.
> 
> Actually xeon/fpga_limit are introduced for different purpose. It shows
> the power limit of CPU and FPGA, that may be useful in some integrated
> solution, e.g. CPU and FPGA shares power. We should never these
> interfaces as throttling thresholds.
> 

Ok, your call. Just keep in mind that non-standard attributes won't show
up with the sensors command, and won't be visible for libsensors.

>>
>>>          ltr               --> hwmon power1_ltr
>>> v3: rename some hwmon sysfs interfaces to follow hwmon ABI.
>>> 	power1_cap         --> power1_max
>>> 	power1_cap_status  --> power1_max_alarm
>>> 	power1_crit_status --> power1_crit_alarm
>>
>> power1_cap is standard ABI, and since the value is enforced by HW,
>> it should be usable.
> 
> As you see, in thermal management, threshold1 and threshold2 are
> mapped to temp1_max_alarm and temp1_crit_alarm. So we feel that if
> it will be friendly to user that we keep using max_alarm and crit_alarm
> in power management for threshold1 and threshold2 too.
> 
> Do you think if we can keep this, or it's better to switch back to
> power1_cap?
> 

Again, your call.

> 
>>
>>>      update sysfs doc for above sysfs interface changes.
>>>      replace scnprintf with sprintf in sysfs interface.
>>> v4: use HWMON_CHANNEL_INFO.
>>>      update date in sysfs doc.
>>> ---
>>>   Documentation/ABI/testing/sysfs-platform-dfl-fme |  67 +++++++
>>>   drivers/fpga/dfl-fme-main.c                      | 221 +++++++++++++++++++++++
>>>   2 files changed, 288 insertions(+)
>>>
>>> diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
>>> index 2cd17dc..a669548 100644
>>> --- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
>>> +++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
>>> @@ -127,6 +127,7 @@ Contact:	Wu Hao <hao.wu@intel.com>
>>>   Description:	Read-Only. Read this file to get the name of hwmon device, it
>>>   		supports values:
>>>   		    'dfl_fme_thermal' - thermal hwmon device name
>>> +		    'dfl_fme_power'   - power hwmon device name
>>>   
>>>   What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_input
>>>   Date:		June 2019
>>> @@ -183,3 +184,69 @@ Description:	Read-Only. Read this file to get the policy of hardware threshold1
>>>   		(see 'temp1_max'). It only supports two values (policies):
>>>   		    0 - AP2 state (90% throttling)
>>>   		    1 - AP1 state (50% throttling)
>>> +
>>> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_input
>>> +Date:		June 2019
>>> +KernelVersion:	5.3
>>> +Contact:	Wu Hao <hao.wu@intel.com>
>>> +Description:	Read-Only. It returns current FPGA power consumption in uW.
>>> +
>>> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_max
>>> +Date:		June 2019
>>> +KernelVersion:	5.3
>>> +Contact:	Wu Hao <hao.wu@intel.com>
>>> +Description:	Read-Write. Read this file to get current hardware power
>>> +		threshold1 in uW. If power consumption rises at or above
>>> +		this threshold, hardware starts 50% throttling.
>>> +		Write this file to set current hardware power threshold1 in uW.
>>> +		As hardware only accepts values in Watts, so input value will
>>> +		be round down per Watts (< 1 watts part will be discarded).
>>> +		Write fails with -EINVAL if input parsing fails or input isn't
>>> +		in the valid range (0 - 127000000 uW).
>>> +
>>> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_crit
>>> +Date:		June 2019
>>> +KernelVersion:	5.3
>>> +Contact:	Wu Hao <hao.wu@intel.com>
>>> +Description:	Read-Write. Read this file to get current hardware power
>>> +		threshold2 in uW. If power consumption rises at or above
>>> +		this threshold, hardware starts 90% throttling.
>>> +		Write this file to set current hardware power threshold2 in uW.
>>> +		As hardware only accepts values in Watts, so input value will
>>> +		be round down per Watts (< 1 watts part will be discarded).
>>> +		Write fails with -EINVAL if input parsing fails or input isn't
>>> +		in the valid range (0 - 127000000 uW).
>>> +
>>> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_max_alarm
>>> +Date:		June 2019
>>> +KernelVersion:	5.3
>>> +Contact:	Wu Hao <hao.wu@intel.com>
>>> +Description:	Read-only. It returns 1 if power consumption is currently at or
>>> +		above hardware threshold1 (see 'power1_max'), otherwise 0.
>>> +
>>> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_crit_alarm
>>> +Date:		June 2019
>>> +KernelVersion:	5.3
>>> +Contact:	Wu Hao <hao.wu@intel.com>
>>> +Description:	Read-only. It returns 1 if power consumption is currently at or
>>> +		above hardware threshold2 (see 'power1_crit'), otherwise 0.
>>> +
>>> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_xeon_limit
>>> +Date:		June 2019
>>> +KernelVersion:	5.3
>>> +Contact:	Wu Hao <hao.wu@intel.com>
>>> +Description:	Read-Only. It returns power limit for XEON in uW.
>>> +
>>> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_fpga_limit
>>> +Date:		June 2019
>>> +KernelVersion:	5.3
>>> +Contact:	Wu Hao <hao.wu@intel.com>
>>> +Description:	Read-Only. It returns power limit for FPGA in uW.
>>> +
>>> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_ltr
>>> +Date:		June 2019
>>> +KernelVersion:	5.3
>>> +Contact:	Wu Hao <hao.wu@intel.com>
>>> +Description:	Read-only. Read this file to get current Latency Tolerance
>>> +		Reporting (ltr) value. This ltr impacts the CPU low power
>>> +		state in integrated solution.
>>
>> Does that attribute add any value without any kind of unit or an explanation
>> of its meaning ? What is userspace supposed to do with that information ?
>> Without context, it is just a meaningless number.
> 
> I should add more description here, will fix it in the next version.
> 
>>
>> Also, it appears that the information is supposed to be passed to power
>> management via the set_latency_tolerance() callback. If so, it would be
>> reported there. Would it possibly make more sense to use that interface ?
> 
> If I remember correctly set_latency_tolerance is used to communicate a tolerance
> to device, but actually this is a read-only value, to indicate latency tolerance
> requirement for memory access from FPGA device, as you know FPGA could be
> programmed for different workloads, and different workloads may have different
> latency requirements, if workloads in FPGA don't have any need for immediate
> memory access, then it would be safe for deeper sleep state of system memory.
> 

Hmm, you are correct. Yes, this attribute could definitely benefit from a more
detailed explanation.

Thanks,
Guenter

^ permalink raw reply

* Re: [RFC PATCH 2/3] Introduce arch_prctl(ARCH_X86_CET_MARK_LEGACY_CODE)
From: Andy Lutomirski @ 2019-06-29 23:43 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz
In-Reply-To: <20190628194158.2431-2-yu-cheng.yu@intel.com>

> On Jun 28, 2019, at 12:41 PM, Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>
> The CET legacy code bitmap covers the whole user-mode address space and is
> located at the top of the user-mode address space.  It is allocated only
> when the first time arch_prctl(ARCH_X86_MARK_LEGACY_CODE) is called from
> an application.
>
> Introduce:
>
> arch_prctl(ARCH_X86_MARK_LEGACY_CODE, unsigned long *buf)
>    Mark an address range as IBT legacy code.

How about defining a struct for this?

The change log should discuss where the bitmap goes and how it’s allocated.

> +static int alloc_bitmap(void)
> +{
> +    unsigned long addr;
> +    u64 msr_ia32_u_cet;
> +
> +    addr = do_mmap_locked(NULL, IBT_BITMAP_ADDR, IBT_BITMAP_SIZE,
> +                  PROT_READ | PROT_WRITE,
> +                  MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED_NOREPLACE,
> +                  VM_IBT | VM_NORESERVE, NULL);
> +
> +    if (IS_ERR((void *)addr))
> +        return addr;
> +
> +    current->thread.cet.ibt_bitmap_addr = addr;

addr is a constant. Why are you storing it?  If it ends up not being
constant, you should wire up mremap like the vDSO does.


> +static int set_user_bits(unsigned long __user *buf, unsigned long buf_size,
> +             unsigned long start_bit, unsigned long end_bit, unsigned long set)
> +{
> +    unsigned long start_ul, end_ul, total_ul;
> +    int i, j, r;
> +
> +    if (round_up(end_bit, BITS_PER_BYTE) / BITS_PER_BYTE > buf_size)
> +        end_bit = buf_size * BITS_PER_BYTE - 1;
> +
> +    start_ul = start_bit / BITS_PER_LONG;
> +    end_ul = end_bit / BITS_PER_LONG;
> +    total_ul = (end_ul - start_ul + 1);
> +
> +    i = start_bit % BITS_PER_LONG;
> +    j = end_bit % BITS_PER_LONG;
> +
> +    r = 0;
> +    put_user_try {

put_user_try is obsolete.  Just use get_user(), etc.

Also, I must be missing something fundamental, because this series
claims that user code can't write directly to the bitmap.  This means
that this entire function shouldn't work at all.

^ permalink raw reply

* Re: [RFC PATCH 3/3] Prevent user from writing to IBT bitmap.
From: Andy Lutomirski @ 2019-06-29 23:44 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz
In-Reply-To: <20190628194158.2431-3-yu-cheng.yu@intel.com>

On Fri, Jun 28, 2019 at 12:50 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>
> The IBT bitmap is visiable from user-mode, but not writable.
>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
>
> ---
>  arch/x86/mm/fault.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 59f4f66e4f2e..231196abb62e 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1454,6 +1454,13 @@ void do_user_addr_fault(struct pt_regs *regs,
>          * we can handle it..
>          */
>  good_area:
> +#define USER_MODE_WRITE (FAULT_FLAG_WRITE | FAULT_FLAG_USER)
> +       if (((flags & USER_MODE_WRITE)  == USER_MODE_WRITE) &&
> +           (vma->vm_flags & VM_IBT)) {
> +               bad_area_access_error(regs, hw_error_code, address, vma);
> +               return;
> +       }
> +

Just make the VMA have VM_WRITE and VM_MAYWRITE clear.  No new code
like this should be required.

^ permalink raw reply

* Re: [PATCH V33 24/30] bpf: Restrict bpf when kernel lockdown is in confidentiality mode
From: Andy Lutomirski @ 2019-06-29 23:47 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Andy Lutomirski, Stephen Smalley, James Morris, linux-security,
	LKML, Linux API, David Howells, Alexei Starovoitov,
	Network Development, Chun-Yi Lee, Daniel Borkmann, LSM List
In-Reply-To: <CACdnJuuy7-tkj86njAqtdJ3dUMu-2T8a2y8DC3fMKBK0z9J6ag@mail.gmail.com>

On Fri, Jun 28, 2019 at 11:47 AM Matthew Garrett <mjg59@google.com> wrote:
>
> On Thu, Jun 27, 2019 at 4:27 PM Andy Lutomirski <luto@kernel.org> wrote:
> > They're really quite similar in my mind.  Certainly some things in the
> > "integrity" category give absolutely trivial control over the kernel
> > (e.g. modules) while others make it quite challenging (ioperm), but
> > the end result is very similar.  And quite a few "confidentiality"
> > things genuinely do allow all kernel memory to be read.
> >
> > I agree that finer-grained distinctions could be useful. My concern is
> > that it's a tradeoff, and the other end of the tradeoff is an ABI
> > stability issue.  If someone decides down the road that some feature
> > that is currently "integrity" can be split into a narrow "integrity"
> > feature and a "confidentiality" feature then, if the user policy knows
> > about the individual features, there's a risk of breaking people's
> > systems.  If we keep the fine-grained control, do we have a clear
> > compatibility story?
>
> My preference right now is to retain the fine-grained aspect of things
> in the internal API, simply because it'll be more annoying to add it
> back later if we want to. I don't want to expose it via the Lockdown
> user facing API for the reasons you've described, but it's not
> impossible that another LSM would find a way to do this reasonably.
> Does it seem reasonable to punt this discussion out to the point where
> another LSM tries to do something with this information, based on the
> implementation they're attempting?

I think I can get behind this, as long as it's clear to LSM authors
that this list is only a little bit stable.  I can certainly see the
use for the fine-grained info being available for auditing.

^ permalink raw reply

* Re: [PATCH] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Andy Lutomirski @ 2019-06-29 23:51 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andy Lutomirski, Dave Martin, Yu-cheng Yu, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook
In-Reply-To: <87ef3fweoq.fsf@oldenburg2.str.redhat.com>

On Thu, Jun 27, 2019 at 2:39 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Andy Lutomirski:
>
> > Also, I don't think there's any actual requirement that the upstream
> > kernel recognize existing CET-enabled RHEL 8 binaries as being
> > CET-enabled.  I tend to think that RHEL 8 jumped the gun here.
>
> The ABI was supposed to be finalized and everyone involved thought it
> had been reviewed by the GNU gABI community and other interested
> parties.  It had been included in binutils for several releases.
>
> From my point of view, the kernel is just a consumer of the ABI.  The
> kernel would not change an instruction encoding if it doesn't like it
> for some reason, either.

I read the only relevant gABI thing I could find easily, and it seems
to document the "gnu property" thing.  I have no problem with that.

>
> > While the upstream kernel should make some reasonble effort to make
> > sure that RHEL 8 binaries will continue to run, I don't see why we
> > need to go out of our way to keep the full set of mitigations
> > available for binaries that were developed against a non-upstream
> > kernel.
>
> They were developed against the ABI specification.
>
> I do not have a strong opinion what the kernel should do going forward.
> I just want to make clear what happened.

I admit that I'm not really clear on exactly what RHEL 8 shipped.
Some of this stuff is very much an ELF ABI that belongs to the
toolchain, but some if it is kernel API.  For example, the IBT legacy
bitmap API is very much in flux, and I don't think anything credible
has been submitted for upstream inclusion.  Does RHEL 8's glibc
attempt to cope with the case where some libraries are CET-compatible
and some are not?  If so, how does this work?  What, if any, services
does the RHEL 8 kernel provide in this direction?

^ permalink raw reply

* [PATCH for 5.2] rseq/selftests: Fix Thumb mode build failure on arm32
From: Mathieu Desnoyers @ 2019-06-30 13:56 UTC (permalink / raw)
  To: Shuah Khan, Will Deacon
  Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra, Thomas Gleixner,
	Joel Fernandes, Catalin Marinas, Dave Watson, Andi Kleen,
	linux-kselftest, H . Peter Anvin, Chris Lameter, Russell King,
	Michael Kerrisk, Paul E . McKenney, Paul Turner, Boqun Feng,
	Josh Triplett, Steven Rostedt, Ben Maurer, linux-api,
	Andy Lutomirski <lu>

Using ".arm .inst" for the arm signature introduces build issues for
programs compiled in Thumb mode because the assembler stays in the
arm mode for the rest of the inline assembly. Revert to using a ".word"
to express the signature as data instead.

The choice of signature is a valid trap instruction on arm32 little
endian, where both code and data are little endian.

ARMv6+ big endian (BE8) generates mixed endianness code vs data:
little-endian code and big-endian data. The data value of the signature
needs to have its byte order reversed to generate the trap instruction.

Prior to ARMv6, -mbig-endian generates big-endian code and data
(which match), so the endianness of the data representation of the
signature should not be reversed. However, the choice between BE32
and BE8 is done by the linker, so we cannot know whether code and
data endianness will be mixed before the linker is invoked. So rather
than try to play tricks with the linker, the rseq signature is simply
data (not a trap instruction) prior to ARMv6 on big endian. This is
why the signature is expressed as data (.word) rather than as
instruction (.inst) in assembler.

Because a ".word" is used to emit the signature, it will be interpreted
as a literal pool by a disassembler, not as an actual instruction.
Considering that the signature is not meant to be executed except in
scenarios where the program execution is completely bogus, this should
not be an issue.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Will Deacon <will.deacon@arm.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Joel Fernandes <joelaf@google.com>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Dave Watson <davejwatson@fb.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Shuah Khan <shuah@kernel.org>
CC: Andi Kleen <andi@firstfloor.org>
CC: linux-kselftest@vger.kernel.org
CC: "H . Peter Anvin" <hpa@zytor.com>
CC: Chris Lameter <cl@linux.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
CC: Paul Turner <pjt@google.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ben Maurer <bmaurer@fb.com>
CC: linux-api@vger.kernel.org
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Carlos O'Donell <carlos@redhat.com>
CC: Florian Weimer <fweimer@redhat.com>
---
 tools/testing/selftests/rseq/rseq-arm.h | 61 ++++++++++++++++++---------------
 1 file changed, 33 insertions(+), 28 deletions(-)

diff --git a/tools/testing/selftests/rseq/rseq-arm.h b/tools/testing/selftests/rseq/rseq-arm.h
index 84f28f147fb6..5943c816c07c 100644
--- a/tools/testing/selftests/rseq/rseq-arm.h
+++ b/tools/testing/selftests/rseq/rseq-arm.h
@@ -6,6 +6,8 @@
  */
 
 /*
+ * - ARM little endian
+ *
  * RSEQ_SIG uses the udf A32 instruction with an uncommon immediate operand
  * value 0x5de3. This traps if user-space reaches this instruction by mistake,
  * and the uncommon operand ensures the kernel does not move the instruction
@@ -22,36 +24,40 @@
  * def3        udf    #243      ; 0xf3
  * e7f5        b.n    <7f5>
  *
- * pre-ARMv6 big endian code:
- * e7f5        b.n    <7f5>
- * def3        udf    #243      ; 0xf3
+ * - ARMv6+ big endian (BE8):
  *
  * ARMv6+ -mbig-endian generates mixed endianness code vs data: little-endian
- * code and big-endian data. Ensure the RSEQ_SIG data signature matches code
- * endianness. Prior to ARMv6, -mbig-endian generates big-endian code and data
- * (which match), so there is no need to reverse the endianness of the data
- * representation of the signature. However, the choice between BE32 and BE8
- * is done by the linker, so we cannot know whether code and data endianness
- * will be mixed before the linker is invoked.
+ * code and big-endian data. The data value of the signature needs to have its
+ * byte order reversed to generate the trap instruction:
+ *
+ * Data: 0xf3def5e7
+ *
+ * Translates to this A32 instruction pattern:
+ *
+ * e7f5def3    udf    #24035    ; 0x5de3
+ *
+ * Translates to this T16 instruction pattern:
+ *
+ * def3        udf    #243      ; 0xf3
+ * e7f5        b.n    <7f5>
+ *
+ * - Prior to ARMv6 big endian (BE32):
+ *
+ * Prior to ARMv6, -mbig-endian generates big-endian code and data
+ * (which match), so the endianness of the data representation of the
+ * signature should not be reversed. However, the choice between BE32
+ * and BE8 is done by the linker, so we cannot know whether code and
+ * data endianness will be mixed before the linker is invoked. So rather
+ * than try to play tricks with the linker, the rseq signature is simply
+ * data (not a trap instruction) prior to ARMv6 on big endian. This is
+ * why the signature is expressed as data (.word) rather than as
+ * instruction (.inst) in assembler.
  */
 
-#define RSEQ_SIG_CODE	0xe7f5def3
-
-#ifndef __ASSEMBLER__
-
-#define RSEQ_SIG_DATA							\
-	({								\
-		int sig;						\
-		asm volatile ("b 2f\n\t"				\
-			      "1: .inst " __rseq_str(RSEQ_SIG_CODE) "\n\t" \
-			      "2:\n\t"					\
-			      "ldr %[sig], 1b\n\t"			\
-			      : [sig] "=r" (sig));			\
-		sig;							\
-	})
-
-#define RSEQ_SIG	RSEQ_SIG_DATA
-
+#ifdef __ARMEB__
+#define RSEQ_SIG    0xf3def5e7      /* udf    #24035    ; 0x5de3 (ARMv6+) */
+#else
+#define RSEQ_SIG    0xe7f5def3      /* udf    #24035    ; 0x5de3 */
 #endif
 
 #define rseq_smp_mb()	__asm__ __volatile__ ("dmb" ::: "memory", "cc")
@@ -125,8 +131,7 @@ do {									\
 		__rseq_str(table_label) ":\n\t"				\
 		".word " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
 		".word " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) ", 0x0\n\t" \
-		".arm\n\t"						\
-		".inst " __rseq_str(RSEQ_SIG_CODE) "\n\t"		\
+		".word " __rseq_str(RSEQ_SIG) "\n\t"			\
 		__rseq_str(label) ":\n\t"				\
 		teardown						\
 		"b %l[" __rseq_str(abort_label) "]\n\t"
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH v4 3/3] gen_init_cpio: add support for file metadata
From: Mimi Zohar @ 2019-06-30 15:27 UTC (permalink / raw)
  To: Roberto Sassu, viro
  Cc: linux-security-module, linux-integrity, initramfs, linux-api,
	linux-fsdevel, linux-kernel, bug-cpio, zohar, silviu.vlasceanu,
	dmitry.kasatkin, takondra, kamensky, hpa, arnd, rob,
	james.w.mcmechan, niveditas98
In-Reply-To: <20190523121803.21638-4-roberto.sassu@huawei.com>

On Thu, 2019-05-23 at 14:18 +0200, Roberto Sassu wrote:

> diff --git a/usr/Kconfig b/usr/Kconfig
> index 43658b8a975e..8d9f54a16440 100644
> --- a/usr/Kconfig
> +++ b/usr/Kconfig
> @@ -233,3 +233,11 @@ config INITRAMFS_COMPRESSION
>  	default ".lzma" if RD_LZMA
>  	default ".bz2"  if RD_BZIP2
>  	default ""
> +
> +config INITRAMFS_FILE_METADATA
> +	string "File metadata type"
> +	default ""
> +	help
> +	  Specify xattr to include xattrs in the image.
> +
> +	  If you are not sure, leave it blank.
> 	fi

Instead of having to specify the metdata type, let's make this a
choice.

Mimi

^ permalink raw reply

* Re: [PATCH v4 0/3] initramfs: add support for xattrs in the initial ram disk
From: Mimi Zohar @ 2019-06-30 15:39 UTC (permalink / raw)
  To: Roberto Sassu, Rob Landley, viro
  Cc: linux-security-module, linux-integrity, initramfs, linux-api,
	linux-fsdevel, linux-kernel, bug-cpio, zohar, silviu.vlasceanu,
	dmitry.kasatkin, takondra, kamensky, hpa, arnd, james.w.mcmechan,
	niveditas98
In-Reply-To: <33cfb804-6a17-39f0-92b7-01d54e9c452d@huawei.com>

On Wed, 2019-06-26 at 10:15 +0200, Roberto Sassu wrote:
> On 6/3/2019 8:32 PM, Rob Landley wrote:
> > On 6/3/19 4:31 AM, Roberto Sassu wrote:
> >>> This patch set aims at solving the following use case: appraise files from
> >>> the initial ram disk. To do that, IMA checks the signature/hash from the
> >>> security.ima xattr. Unfortunately, this use case cannot be implemented
> >>> currently, as the CPIO format does not support xattrs.
> >>>
> >>> This proposal consists in including file metadata as additional files named
> >>> METADATA!!!, for each file added to the ram disk. The CPIO parser in the
> >>> kernel recognizes these special files from the file name, and calls the
> >>> appropriate parser to add metadata to the previously extracted file. It has
> >>> been proposed to use bit 17:16 of the file mode as a way to recognize files
> >>> with metadata, but both the kernel and the cpio tool declare the file mode
> >>> as unsigned short.
> >>
> >> Any opinion on this patch set?
> >>
> >> Thanks
> >>
> >> Roberto
> > 
> > Sorry, I've had the window open since you posted it but haven't gotten around to
> > it. I'll try to build it later today.
> > 
> > It does look interesting, and I have no objections to the basic approach. I
> > should be able to add support to toybox cpio over a weekend once I've got the
> > kernel doing it to test against.
> 
> Ok.
> 
> Let me give some instructions so that people can test this patch set.
> 
> To add xattrs to the ram disk embedded in the kernel it is sufficient
> to set CONFIG_INITRAMFS_FILE_METADATA="xattr" and
> CONFIG_INITRAMFS_SOURCE="<file with xattr>" in the kernel configuration.
> 
> To add xattrs to the external ram disk, it is necessary to patch cpio:
> 
> https://github.com/euleros/cpio/commit/531cabc88e9ecdc3231fad6e4856869baa9a91ef 
> (xattr-v1 branch)
> 
> and dracut:
> 
> https://github.com/euleros/dracut/commit/a2dee56ea80495c2c1871bc73186f7b00dc8bf3b 
> (digest-lists branch)
> 
> The same modification can be done for mkinitramfs (add '-e xattr' to the
> cpio command line).
> 
> To simplify the test, it would be sufficient to replace only the cpio
> binary and the dracut script with the modified versions. For dracut, the
> patch should be applied to the local dracut (after it has been renamed
> to dracut.sh).
> 
> Then, run:
> 
> dracut -e xattr -I <file with xattr> (add -f to overwrite the ram disk)
> 
> Xattrs can be seen by stopping the boot process for example by adding
> rd.break to the kernel command line.

A simple way of testing, without needing any changes other than the
kernel patches, is to save the dracut temporary directory by supplying
"--keep" on the dracut command line, calling
usr/gen_initramfs_list.sh, followed by usr/gen_init_cpio with the "-e
xattr" option.

If your filesystem already has and copied the security xattrs to the
dracut temporary directory, the script, below, will include them in
the initramfs file.  Otherwise, you'll need to write the desired
security xattrs on the files, using setfattr, in the temporary dracut
directory, before creating the initramfs.

Remember to make sure that the initramfs_list includes "getfattr",
otherwise you'll need to wait until real root is mounted as /sysroot
to see the security xattrs for the rootfs files.

The following script has not been tested on a recent version of
dracut.  Some changes might be needed, as well as some code cleanup.

#!/bin/bash

initramfs_name=/boot/initramfs-`uname -r`.img
initramfs_output_name=${initramfs_name/.img/.test.img}

if [ $# -eq 1 ]; then
        initramfs_name=$1
fi

if [ ! -f "$initramfs_name" ]; then
        echo "Usage; $0 <initramfs pathanem>"
        exit 1
fi

tmp=$(dracut -H -f "$initramfs_name" --keep --noprelink --nostrip 2>&1)
suffix=$(echo $tmp | cut -d ' ' -f 3 | cut -d '.' -f 2)

tmpdir="/var/tmp/dracut.$suffix/initramfs"

if [ ! -d "$tmpdir" ]; then
        echo "$tmpdir does not exist"
        exit 1
fi

usr/gen_initramfs_list.sh ${tmpdir} > usr/initramfs_list
usr/gen_init_cpio -e xattr usr/initramfs_list > usr/initramfs_data.cpio
gzip usr/initramfs_data.cpio

echo "Copying usr/initramfs_data.cpio to $initramfs_output_name"
cp usr/initramfs_data.cpio.gz "$initramfs_output_name"

Mimi

^ permalink raw reply

* Re: [PATCH 2/6] Adjust watch_queue documentation to mention mount and superblock watches. [ver #5]
From: Randy Dunlap @ 2019-07-01  2:59 UTC (permalink / raw)
  To: David Howells, viro
  Cc: Casey Schaufler, Stephen Smalley, Greg Kroah-Hartman,
	nicolas.dichtel, raven, Christian Brauner, keyrings, linux-usb,
	linux-security-module, linux-fsdevel, linux-api, linux-block,
	linux-kernel
In-Reply-To: <156173703546.15650.14319137940607993268.stgit@warthog.procyon.org.uk>

Hi David,

On 6/28/19 8:50 AM, David Howells wrote:
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  Documentation/watch_queue.rst |   20 +++++++++++++++++++-
>  drivers/misc/Kconfig          |    5 +++--
>  2 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/watch_queue.rst b/Documentation/watch_queue.rst
> index 4087a8e670a8..1bec2018d549 100644
> --- a/Documentation/watch_queue.rst
> +++ b/Documentation/watch_queue.rst
> @@ -13,6 +13,10 @@ receive notifications from the kernel.  This can be used in conjunction with::
>  
>      * USB subsystem event notifications
>  
> +  * Mount topology change notifications
> +
> +  * Superblock event notifications
> +
>  
>  The notifications buffers can be enabled by:
>  
> @@ -324,6 +328,19 @@ Any particular buffer can be fed from multiple sources.  Sources include:
>      for buses and devices.  Watchpoints of this type are set on the global
>      device watch list.
>  
> +  * WATCH_TYPE_MOUNT_NOTIFY
> +
> +    Notifications of this type indicate mount tree topology changes and mount
> +    attribute changes.  A watch can be set on a particular file or directory
> +    and notifications from the path subtree rooted at that point will be
> +    intercepted.
> +
> +  * WATCH_TYPE_SB_NOTIFY
> +
> +    Notifications of this type indicate superblock events, such as quota limits
> +    being hit, I/O errors being produced or network server loss/reconnection.
> +    Watches of this type are set directly on superblocks.
> +
>  
>  Event Filtering
>  ===============
> @@ -365,7 +382,8 @@ Where:
>  	(watch.info & info_mask) == info_filter
>  
>      This could be used, for example, to ignore events that are not exactly on
> -    the watched point in a mount tree.
> +    the watched point in a mount tree by specifying NOTIFY_MOUNT_IN_SUBTREE
> +    must be 0.

I'm having a little trouble parsing that sentence.
Could you clarify it or maybe rewrite/modify it?
Thanks.

>  
>    * ``subtype_filter`` is a bitmask indicating the subtypes that are of
>      interest.  Bit 0 of subtype_filter[0] corresponds to subtype 0, bit 1 to



-- 
~Randy

^ permalink raw reply

* [PATCH v2 05/15] Documentation: fpga: dfl: add descriptions for virtualization and new interfaces.
From: Wu Hao @ 2019-07-01  6:22 UTC (permalink / raw)
  To: mdf, linux-fpga, linux-kernel; +Cc: linux-api, gregkh, atull, Wu Hao, Xu Yilun

This patch adds virtualization support description for DFL based
FPGA devices (based on PCIe SRIOV), and introductions to new
interfaces added by new dfl private feature drivers.

[mdf@kernel.org: Fixed up to make it work with new reStructuredText docs]
Signed-off-by: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
Acked-by: Alan Tull <atull@kernel.org>
Signed-off-by: Moritz Fischer <mdf@kernel.org>
---
v2: fix issues during rebasing (doc format change txt -> rst).
---
 Documentation/fpga/dfl.rst | 105 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 105 insertions(+)

diff --git a/Documentation/fpga/dfl.rst b/Documentation/fpga/dfl.rst
index 2f125ab..6fa483f 100644
--- a/Documentation/fpga/dfl.rst
+++ b/Documentation/fpga/dfl.rst
@@ -87,6 +87,8 @@ The following functions are exposed through ioctls:
 - Get driver API version (DFL_FPGA_GET_API_VERSION)
 - Check for extensions (DFL_FPGA_CHECK_EXTENSION)
 - Program bitstream (DFL_FPGA_FME_PORT_PR)
+- Assign port to PF (DFL_FPGA_FME_PORT_ASSIGN)
+- Release port from PF (DFL_FPGA_FME_PORT_RELEASE)
 
 More functions are exposed through sysfs
 (/sys/class/fpga_region/regionX/dfl-fme.n/):
@@ -102,6 +104,10 @@ More functions are exposed through sysfs
      one FPGA device may have more than one port, this sysfs interface indicates
      how many ports the FPGA device has.
 
+ Global error reporting management (errors/)
+     error reporting sysfs interfaces allow user to read errors detected by the
+     hardware, and clear the logged errors.
+
 
 FIU - PORT
 ==========
@@ -143,6 +149,10 @@ More functions are exposed through sysfs:
  Read Accelerator GUID (afu_id)
      afu_id indicates which PR bitstream is programmed to this AFU.
 
+ Error reporting (errors/)
+     error reporting sysfs interfaces allow user to read port/afu errors
+     detected by the hardware, and clear the logged errors.
+
 
 DFL Framework Overview
 ======================
@@ -218,6 +228,101 @@ the compat_id exposed by the target FPGA region. This check is usually done by
 userspace before calling the reconfiguration IOCTL.
 
 
+FPGA virtualization - PCIe SRIOV
+================================
+This section describes the virtualization support on DFL based FPGA device to
+enable accessing an accelerator from applications running in a virtual machine
+(VM). This section only describes the PCIe based FPGA device with SRIOV support.
+
+Features supported by the particular FPGA device are exposed through Device
+Feature Lists, as illustrated below:
+
+::
+
+    +-------------------------------+  +-------------+
+    |              PF               |  |     VF      |
+    +-------------------------------+  +-------------+
+        ^            ^         ^              ^
+        |            |         |              |
+  +-----|------------|---------|--------------|-------+
+  |     |            |         |              |       |
+  |  +-----+     +-------+ +-------+      +-------+   |
+  |  | FME |     | Port0 | | Port1 |      | Port2 |   |
+  |  +-----+     +-------+ +-------+      +-------+   |
+  |                  ^         ^              ^       |
+  |                  |         |              |       |
+  |              +-------+ +------+       +-------+   |
+  |              |  AFU  | |  AFU |       |  AFU  |   |
+  |              +-------+ +------+       +-------+   |
+  |                                                   |
+  |            DFL based FPGA PCIe Device             |
+  +---------------------------------------------------+
+
+FME is always accessed through the physical function (PF).
+
+Ports (and related AFUs) are accessed via PF by default, but could be exposed
+through virtual function (VF) devices via PCIe SRIOV. Each VF only contains
+1 Port and 1 AFU for isolation. Users could assign individual VFs (accelerators)
+created via PCIe SRIOV interface, to virtual machines.
+
+The driver organization in virtualization case is illustrated below:
+::
+
+    +-------++------++------+             |
+    | FME   || FME  || FME  |             |
+    | FPGA  || FPGA || FPGA |             |
+    |Manager||Bridge||Region|             |
+    +-------++------++------+             |
+    +-----------------------+  +--------+ |             +--------+
+    |          FME          |  |  AFU   | |             |  AFU   |
+    |         Module        |  | Module | |             | Module |
+    +-----------------------+  +--------+ |             +--------+
+          +-----------------------+       |       +-----------------------+
+          | FPGA Container Device |       |       | FPGA Container Device |
+          |  (FPGA Base Region)   |       |       |  (FPGA Base Region)   |
+          +-----------------------+       |       +-----------------------+
+            +------------------+          |         +------------------+
+            | FPGA PCIE Module |          | Virtual | FPGA PCIE Module |
+            +------------------+   Host   | Machine +------------------+
+   -------------------------------------- | ------------------------------
+             +---------------+            |          +---------------+
+             | PCI PF Device |            |          | PCI VF Device |
+             +---------------+            |          +---------------+
+
+FPGA PCIe device driver is always loaded first once a FPGA PCIe PF or VF device
+is detected. It:
+
+* Finishes enumeration on both FPGA PCIe PF and VF device using common
+  interfaces from DFL framework.
+* Supports SRIOV.
+
+The FME device driver plays a management role in this driver architecture, it
+provides ioctls to release Port from PF and assign Port to PF. After release
+a port from PF, then it's safe to expose this port through a VF via PCIe SRIOV
+sysfs interface.
+
+To enable accessing an accelerator from applications running in a VM, the
+respective AFU's port needs to be assigned to a VF using the following steps:
+
+#. The PF owns all AFU ports by default. Any port that needs to be
+   reassigned to a VF must first be released through the
+   DFL_FPGA_FME_PORT_RELEASE ioctl on the FME device.
+
+#. Once N ports are released from PF, then user can use command below
+   to enable SRIOV and VFs. Each VF owns only one Port with AFU.
+
+   ::
+
+      echo N > $PCI_DEVICE_PATH/sriov_numvfs
+
+#. Pass through the VFs to VMs
+
+#. The AFU under VF is accessible from applications in VM (using the
+   same driver inside the VF).
+
+Note that an FME can't be assigned to a VF, thus PR and other management
+functions are only available via the PF.
+
 Device enumeration
 ==================
 This section introduces how applications enumerate the fpga device from
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v4 05/15] Documentation: fpga: dfl: add descriptions for virtualization and new interfaces.
From: Wu Hao @ 2019-07-01  6:30 UTC (permalink / raw)
  To: Moritz Fischer
  Cc: linux-fpga, linux-kernel, linux-api, yilun.xu, gregkh, atull
In-Reply-To: <20190628021333.GB9994@hao-dev>

On Fri, Jun 28, 2019 at 10:13:33AM +0800, Wu Hao wrote:
> On Thu, Jun 27, 2019 at 06:12:56PM -0700, Moritz Fischer wrote:
> > Hi Wu,
> > 
> > On Thu, Jun 27, 2019 at 12:44:45PM +0800, Wu Hao wrote:
> > > This patch adds virtualization support description for DFL based
> > > FPGA devices (based on PCIe SRIOV), and introductions to new
> > > interfaces added by new dfl private feature drivers.
> > > 
> > > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > > Signed-off-by: Wu Hao <hao.wu@intel.com>
> > > Acked-by: Alan Tull <atull@kernel.org>
> > > ---
> > >  Documentation/fpga/dfl.txt | 101 +++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 101 insertions(+)
> > > 
> > > diff --git a/Documentation/fpga/dfl.txt b/Documentation/fpga/dfl.txt
> > > index 6df4621..a22631f 100644
> > > --- a/Documentation/fpga/dfl.txt
> > > +++ b/Documentation/fpga/dfl.txt
> > 
> > This got re{named,formatted} in linux-next. I've tried to fix it before sending it
> > to Greg.
> 
> Many Thanks for the help! :)
> 
> I think I need rebase thermal/power management (hwmon) and perf support
> patchsets as they update this documentation file too. But I feel we
> can go ahead on code review for the other patches, if there are some
> comments received and something need to fix, i can put fixes together in
> the next version.

Hi Moritz,

I found some descriptions are dropped unexpectedly in your patch, so I 
prepared and sent out a new version (v2) to fix it. Please let me know if
you prefer an increamental patch.

Thanks!
Hao

> 
> Thanks
> Hao
> 
> > 
> > Thanks,
> > Moritz

^ permalink raw reply

* [PATCH v5 0/3] add thermal/power management features for FPGA DFL drivers
From: Wu Hao @ 2019-07-01  6:37 UTC (permalink / raw)
  To: mdf, linux-fpga, linux-kernel
  Cc: linux-api, linux-hwmon, jdelvare, linux, atull, gregkh, Wu Hao

This patchset adds thermal and power management features for FPGA DFL
drivers. Both patches are using hwmon as userspace interfaces.

Main changes from v4:
 - rebase due to Documentation format change (dfl.txt -> rst).
 - clamp threshold inputs for sysfs interfaces. (patch#3)
 - update sysfs doc to add more description for ltr sysfs interfaces.
   (patch#3)

Main changes from v3:
 - use HWMON_CHANNEL_INFO.

Main changes from v2:
 - switch to standard hwmon APIs for thermal hwmon:
     temp1_alarm        --> temp1_max
     temp1_alarm_status --> temp1_max_alarm
     temp1_crit_status  --> temp1_crit_alarm
     temp1_alarm_policy --> temp1_max_policy
 - switch to standard hwmon APIs for power hwmon:
     power1_cap         --> power1_max
     power1_cap_status  --> power1_max_alarm
     power1_crit_status --> power1_crit_alarm

Please note that this patchset is generated on top of this patchset.
[PATCH 0/15] FPGA DFL updates
https://lkml.org/lkml/2019/6/27/1088


Wu Hao (2):
  fpga: dfl: fme: add thermal management support
  fpga: dfl: fme: add power management support

Xu Yilun (1):
  Documentation: fpga: dfl: add descriptions for thermal/power
    management interfaces

 Documentation/ABI/testing/sysfs-platform-dfl-fme | 132 ++++++++
 Documentation/fpga/dfl.rst                       |  10 +
 drivers/fpga/Kconfig                             |   2 +-
 drivers/fpga/dfl-fme-main.c                      | 403 +++++++++++++++++++++++
 4 files changed, 546 insertions(+), 1 deletion(-)

-- 
1.8.3.1

^ permalink raw reply

* [PATCH v5 1/3] Documentation: fpga: dfl: add descriptions for thermal/power management interfaces
From: Wu Hao @ 2019-07-01  6:37 UTC (permalink / raw)
  To: mdf, linux-fpga, linux-kernel
  Cc: linux-api, linux-hwmon, jdelvare, linux, atull, gregkh, Xu Yilun,
	Wu Hao
In-Reply-To: <1561963027-4213-1-git-send-email-hao.wu@intel.com>

From: Xu Yilun <yilun.xu@intel.com>

This patch add introductions to thermal/power interfaces. They are
implemented as hwmon sysfs interfaces by thermal/power private
feature drivers.

Signed-off-by: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
---
v5: rebase due to dfl.txt -> dfl.rst
---
 Documentation/fpga/dfl.rst | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/Documentation/fpga/dfl.rst b/Documentation/fpga/dfl.rst
index 6fa483f..094fc8a 100644
--- a/Documentation/fpga/dfl.rst
+++ b/Documentation/fpga/dfl.rst
@@ -108,6 +108,16 @@ More functions are exposed through sysfs
      error reporting sysfs interfaces allow user to read errors detected by the
      hardware, and clear the logged errors.
 
+ Power management (dfl_fme_power hwmon)
+     power management hwmon sysfs interfaces allow user to read power management
+     information (power consumption, thresholds, threshold status, limits, etc.)
+     and configure power thresholds for different throttling levels.
+
+ Thermal management (dfl_fme_thermal hwmon)
+     thermal management hwmon sysfs interfaces allow user to read thermal
+     management information (current temperature, thresholds, threshold status,
+     etc.).
+
 
 FIU - PORT
 ==========
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v5 2/3] fpga: dfl: fme: add thermal management support
From: Wu Hao @ 2019-07-01  6:37 UTC (permalink / raw)
  To: mdf, linux-fpga, linux-kernel
  Cc: linux-api, linux-hwmon, jdelvare, linux, atull, gregkh, Wu Hao,
	Luwei Kang, Russ Weight, Xu Yilun
In-Reply-To: <1561963027-4213-1-git-send-email-hao.wu@intel.com>

This patch adds support to thermal management private feature for DFL
FPGA Management Engine (FME). This private feature driver registers
a hwmon for thermal/temperature monitoring (hwmon temp1_input).
If hardware automatic throttling is supported by this hardware, then
driver also exposes sysfs interfaces under hwmon for thresholds
(temp1_max/ crit/ emergency), threshold alarms (temp1_max_alarm/
temp1_crit_alarm) and throttling policy (temp1_max_policy).

Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Russ Weight <russell.h.weight@intel.com>
Signed-off-by: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
---
v2: create a dfl_fme_thermal hwmon to expose thermal information.
    move all sysfs interfaces under hwmon
	tempareture       --> hwmon temp1_input
	threshold1        --> hwmon temp1_alarm
	threshold2        --> hwmon temp1_crit
	trip_threshold    --> hwmon temp1_emergency
	threshold1_status --> hwmon temp1_alarm_status
	threshold2_status --> hwmon temp1_crit_status
	threshold1_policy --> hwmon temp1_alarm_policy
v3: rename some hwmon sysfs interfaces to follow hwmon ABI.
	temp1_alarm        --> temp1_max
	temp1_alarm_status --> temp1_max_alarm
	temp1_crit_status  --> temp1_crit_alarm
	temp1_alarm_policy --> temp1_max_policy
    update sysfs doc for above sysfs interface changes.
    replace scnprintf with sprintf in sysfs interface.
v4: use HWMON_CHANNEL_INFO.
    rebase, and update date in sysfs doc.
v5: no change.
---
 Documentation/ABI/testing/sysfs-platform-dfl-fme |  64 ++++++++
 drivers/fpga/Kconfig                             |   2 +-
 drivers/fpga/dfl-fme-main.c                      | 187 +++++++++++++++++++++++
 3 files changed, 252 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
index 86eef83..2cd17dc 100644
--- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
+++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
@@ -119,3 +119,67 @@ Description:	Write-only. Write error code to this file to clear all errors
 		logged in errors, first_error and next_error. Write fails with
 		-EINVAL if input parsing fails or input error code doesn't
 		match.
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/name
+Date:		June 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-Only. Read this file to get the name of hwmon device, it
+		supports values:
+		    'dfl_fme_thermal' - thermal hwmon device name
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_input
+Date:		June 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-Only. It returns FPGA device temperature in millidegrees
+		Celsius.
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_max
+Date:		June 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-Only. It returns hardware threshold1 temperature in
+		millidegrees Celsius. If temperature rises at or above this
+		threshold, hardware starts 50% or 90% throttling (see
+		'temp1_max_policy').
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_crit
+Date:		June 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-Only. It returns hardware threshold2 temperature in
+		millidegrees Celsius. If temperature rises at or above this
+		threshold, hardware starts 100% throttling.
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_emergency
+Date:		June 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-Only. It returns hardware trip threshold temperature in
+		millidegrees Celsius. If temperature rises at or above this
+		threshold, a fatal event will be triggered to board management
+		controller (BMC) to shutdown FPGA.
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_max_alarm
+Date:		June 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-only. It returns 1 if temperature is currently at or above
+		hardware threshold1 (see 'temp1_max'), otherwise 0.
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_crit_alarm
+Date:		June 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-only. It returns 1 if temperature is currently at or above
+		hardware threshold2 (see 'temp1_crit'), otherwise 0.
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_max_policy
+Date:		June 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-Only. Read this file to get the policy of hardware threshold1
+		(see 'temp1_max'). It only supports two values (policies):
+		    0 - AP2 state (90% throttling)
+		    1 - AP1 state (50% throttling)
diff --git a/drivers/fpga/Kconfig b/drivers/fpga/Kconfig
index 8072c19..48f6224 100644
--- a/drivers/fpga/Kconfig
+++ b/drivers/fpga/Kconfig
@@ -155,7 +155,7 @@ config FPGA_DFL
 
 config FPGA_DFL_FME
 	tristate "FPGA DFL FME Driver"
-	depends on FPGA_DFL
+	depends on FPGA_DFL && HWMON
 	help
 	  The FPGA Management Engine (FME) is a feature device implemented
 	  under Device Feature List (DFL) framework. Select this option to
diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
index 4490cf4..59ff9f1 100644
--- a/drivers/fpga/dfl-fme-main.c
+++ b/drivers/fpga/dfl-fme-main.c
@@ -14,6 +14,8 @@
  *   Henry Mitchel <henry.mitchel@intel.com>
  */
 
+#include <linux/hwmon.h>
+#include <linux/hwmon-sysfs.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/uaccess.h>
@@ -217,6 +219,187 @@ static long fme_hdr_ioctl(struct platform_device *pdev,
 	.ioctl = fme_hdr_ioctl,
 };
 
+#define FME_THERM_THRESHOLD	0x8
+#define TEMP_THRESHOLD1		GENMASK_ULL(6, 0)
+#define TEMP_THRESHOLD1_EN	BIT_ULL(7)
+#define TEMP_THRESHOLD2		GENMASK_ULL(14, 8)
+#define TEMP_THRESHOLD2_EN	BIT_ULL(15)
+#define TRIP_THRESHOLD		GENMASK_ULL(30, 24)
+#define TEMP_THRESHOLD1_STATUS	BIT_ULL(32)		/* threshold1 reached */
+#define TEMP_THRESHOLD2_STATUS	BIT_ULL(33)		/* threshold2 reached */
+/* threshold1 policy: 0 - AP2 (90% throttle) / 1 - AP1 (50% throttle) */
+#define TEMP_THRESHOLD1_POLICY	BIT_ULL(44)
+
+#define FME_THERM_RDSENSOR_FMT1	0x10
+#define FPGA_TEMPERATURE	GENMASK_ULL(6, 0)
+
+#define FME_THERM_CAP		0x20
+#define THERM_NO_THROTTLE	BIT_ULL(0)
+
+#define MD_PRE_DEG
+
+static bool fme_thermal_throttle_support(void __iomem *base)
+{
+	u64 v = readq(base + FME_THERM_CAP);
+
+	return FIELD_GET(THERM_NO_THROTTLE, v) ? false : true;
+}
+
+static umode_t thermal_hwmon_attrs_visible(const void *drvdata,
+					   enum hwmon_sensor_types type,
+					   u32 attr, int channel)
+{
+	const struct dfl_feature *feature = drvdata;
+
+	/* temperature is always supported, and check hardware cap for others */
+	if (attr == hwmon_temp_input)
+		return 0444;
+
+	return fme_thermal_throttle_support(feature->ioaddr) ? 0444 : 0;
+}
+
+static int thermal_hwmon_read(struct device *dev, enum hwmon_sensor_types type,
+			      u32 attr, int channel, long *val)
+{
+	struct dfl_feature *feature = dev_get_drvdata(dev);
+	u64 v;
+
+	switch (attr) {
+	case hwmon_temp_input:
+		v = readq(feature->ioaddr + FME_THERM_RDSENSOR_FMT1);
+		*val = (long)(FIELD_GET(FPGA_TEMPERATURE, v) * 1000);
+		break;
+	case hwmon_temp_max:
+		v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
+		*val = (long)(FIELD_GET(TEMP_THRESHOLD1, v) * 1000);
+		break;
+	case hwmon_temp_crit:
+		v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
+		*val = (long)(FIELD_GET(TEMP_THRESHOLD2, v) * 1000);
+		break;
+	case hwmon_temp_emergency:
+		v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
+		*val = (long)(FIELD_GET(TRIP_THRESHOLD, v) * 1000);
+		break;
+	case hwmon_temp_max_alarm:
+		v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
+		*val = (long)FIELD_GET(TEMP_THRESHOLD1_STATUS, v);
+		break;
+	case hwmon_temp_crit_alarm:
+		v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
+		*val = (long)FIELD_GET(TEMP_THRESHOLD2_STATUS, v);
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
+static const struct hwmon_ops thermal_hwmon_ops = {
+	.is_visible = thermal_hwmon_attrs_visible,
+	.read = thermal_hwmon_read,
+};
+
+static const struct hwmon_channel_info *thermal_hwmon_info[] = {
+	HWMON_CHANNEL_INFO(temp, HWMON_T_INPUT | HWMON_T_EMERGENCY |
+				 HWMON_T_MAX   | HWMON_T_MAX_ALARM |
+				 HWMON_T_CRIT  | HWMON_T_CRIT_ALARM),
+	NULL
+};
+
+static const struct hwmon_chip_info thermal_hwmon_chip_info = {
+	.ops = &thermal_hwmon_ops,
+	.info = thermal_hwmon_info,
+};
+
+static ssize_t temp1_max_policy_show(struct device *dev,
+				     struct device_attribute *attr, char *buf)
+{
+	struct dfl_feature *feature = dev_get_drvdata(dev);
+	u64 v;
+
+	v = readq(feature->ioaddr + FME_THERM_THRESHOLD);
+
+	return sprintf(buf, "%u\n",
+		       (unsigned int)FIELD_GET(TEMP_THRESHOLD1_POLICY, v));
+}
+
+static DEVICE_ATTR_RO(temp1_max_policy);
+
+static struct attribute *thermal_extra_attrs[] = {
+	&dev_attr_temp1_max_policy.attr,
+	NULL,
+};
+
+static umode_t thermal_extra_attrs_visible(struct kobject *kobj,
+					   struct attribute *attr, int index)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct dfl_feature *feature = dev_get_drvdata(dev);
+
+	return fme_thermal_throttle_support(feature->ioaddr) ? attr->mode : 0;
+}
+
+static const struct attribute_group thermal_extra_group = {
+	.attrs		= thermal_extra_attrs,
+	.is_visible	= thermal_extra_attrs_visible,
+};
+__ATTRIBUTE_GROUPS(thermal_extra);
+
+static int fme_thermal_mgmt_init(struct platform_device *pdev,
+				 struct dfl_feature *feature)
+{
+	struct device *hwmon;
+
+	dev_dbg(&pdev->dev, "FME Thermal Management Init.\n");
+
+	/*
+	 * create hwmon to allow userspace monitoring temperature and other
+	 * threshold information.
+	 *
+	 * temp1_input      -> FPGA device temperature
+	 * temp1_max        -> hardware threshold 1 -> 50% or 90% throttling
+	 * temp1_crit       -> hardware threshold 2 -> 100% throttling
+	 * temp1_emergency  -> hardware trip_threshold to shutdown FPGA
+	 * temp1_max_alarm  -> hardware threshold 1 alarm
+	 * temp1_crit_alarm -> hardware threshold 2 alarm
+	 *
+	 * create device specific sysfs interfaces, e.g. read temp1_max_policy
+	 * to understand the actual hardware throttling action (50% vs 90%).
+	 *
+	 * If hardware doesn't support automatic throttling per thresholds,
+	 * then all above sysfs interfaces are not visible except temp1_input
+	 * for temperature.
+	 */
+	hwmon = devm_hwmon_device_register_with_info(&pdev->dev,
+						     "dfl_fme_thermal", feature,
+						     &thermal_hwmon_chip_info,
+						     thermal_extra_groups);
+	if (IS_ERR(hwmon)) {
+		dev_err(&pdev->dev, "Fail to register thermal hwmon\n");
+		return PTR_ERR(hwmon);
+	}
+
+	return 0;
+}
+
+static void fme_thermal_mgmt_uinit(struct platform_device *pdev,
+				   struct dfl_feature *feature)
+{
+	dev_dbg(&pdev->dev, "FME Thermal Management UInit.\n");
+}
+
+static const struct dfl_feature_id fme_thermal_mgmt_id_table[] = {
+	{.id = FME_FEATURE_ID_THERMAL_MGMT,},
+	{0,}
+};
+
+static const struct dfl_feature_ops fme_thermal_mgmt_ops = {
+	.init = fme_thermal_mgmt_init,
+	.uinit = fme_thermal_mgmt_uinit,
+};
+
 static struct dfl_feature_driver fme_feature_drvs[] = {
 	{
 		.id_table = fme_hdr_id_table,
@@ -231,6 +414,10 @@ static long fme_hdr_ioctl(struct platform_device *pdev,
 		.ops = &fme_global_err_ops,
 	},
 	{
+		.id_table = fme_thermal_mgmt_id_table,
+		.ops = &fme_thermal_mgmt_ops,
+	},
+	{
 		.ops = NULL,
 	},
 };
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v5 3/3] fpga: dfl: fme: add power management support
From: Wu Hao @ 2019-07-01  6:37 UTC (permalink / raw)
  To: mdf, linux-fpga, linux-kernel
  Cc: linux-api, linux-hwmon, jdelvare, linux, atull, gregkh, Wu Hao,
	Luwei Kang, Xu Yilun
In-Reply-To: <1561963027-4213-1-git-send-email-hao.wu@intel.com>

This patch adds support for power management private feature under
FPGA Management Engine (FME). This private feature driver registers
a hwmon for power (power1_input), thresholds information, e.g.
(power1_max / crit / max_alarm / crit_alarm) and also read-only sysfs
interfaces for other power management information. For configuration,
user could write threshold values via above power1_max / crit sysfs
interface under hwmon too.

Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Wu Hao <hao.wu@intel.com>
---
v2: create a dfl_fme_power hwmon to expose power sysfs interfaces.
    move all sysfs interfaces under hwmon
        consumed          --> hwmon power1_input
        threshold1        --> hwmon power1_cap
        threshold2        --> hwmon power1_crit
        threshold1_status --> hwmon power1_cap_status
        threshold2_status --> hwmon power1_crit_status
        xeon_limit        --> hwmon power1_xeon_limit
        fpga_limit        --> hwmon power1_fpga_limit
        ltr               --> hwmon power1_ltr
v3: rename some hwmon sysfs interfaces to follow hwmon ABI.
	power1_cap         --> power1_max
	power1_cap_status  --> power1_max_alarm
	power1_crit_status --> power1_crit_alarm
    update sysfs doc for above sysfs interface changes.
    replace scnprintf with sprintf in sysfs interface.
v4: use HWMON_CHANNEL_INFO.
    update date in sysfs doc.
v5: clamp threshold inputs in power_hwmon_write function.
    update sysfs doc as threshold inputs are clamped now.
    add more descriptions to ltr sysfs interface.
---
 Documentation/ABI/testing/sysfs-platform-dfl-fme |  68 +++++++
 drivers/fpga/dfl-fme-main.c                      | 216 +++++++++++++++++++++++
 2 files changed, 284 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
index 2cd17dc..5c2e49d 100644
--- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
+++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
@@ -127,6 +127,7 @@ Contact:	Wu Hao <hao.wu@intel.com>
 Description:	Read-Only. Read this file to get the name of hwmon device, it
 		supports values:
 		    'dfl_fme_thermal' - thermal hwmon device name
+		    'dfl_fme_power'   - power hwmon device name
 
 What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_input
 Date:		June 2019
@@ -183,3 +184,70 @@ Description:	Read-Only. Read this file to get the policy of hardware threshold1
 		(see 'temp1_max'). It only supports two values (policies):
 		    0 - AP2 state (90% throttling)
 		    1 - AP1 state (50% throttling)
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_input
+Date:		June 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-Only. It returns current FPGA power consumption in uW.
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_max
+Date:		June 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-Write. Read this file to get current hardware power
+		threshold1 in uW. If power consumption rises at or above
+		this threshold, hardware starts 50% throttling.
+		Write this file to set current hardware power threshold1 in uW.
+		As hardware only accepts values in Watts, so input value will
+		be round down per Watts (< 1 watts part will be discarded) and
+		clamped within the range from 0 to 127 Watts. Write fails with
+		-EINVAL if input parsing fails.
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_crit
+Date:		June 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-Write. Read this file to get current hardware power
+		threshold2 in uW. If power consumption rises at or above
+		this threshold, hardware starts 90% throttling.
+		Write this file to set current hardware power threshold2 in uW.
+		As hardware only accepts values in Watts, so input value will
+		be round down per Watts (< 1 watts part will be discarded) and
+		clamped within the range from 0 to 127 Watts. Write fails with
+		-EINVAL if input parsing fails.
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_max_alarm
+Date:		June 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-only. It returns 1 if power consumption is currently at or
+		above hardware threshold1 (see 'power1_max'), otherwise 0.
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_crit_alarm
+Date:		June 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-only. It returns 1 if power consumption is currently at or
+		above hardware threshold2 (see 'power1_crit'), otherwise 0.
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_xeon_limit
+Date:		June 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-Only. It returns power limit for XEON in uW.
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_fpga_limit
+Date:		June 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-Only. It returns power limit for FPGA in uW.
+
+What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_ltr
+Date:		June 2019
+KernelVersion:	5.3
+Contact:	Wu Hao <hao.wu@intel.com>
+Description:	Read-only. Read this file to get current Latency Tolerance
+		Reporting (ltr) value. It returns 1 if all Accelerated
+		Function Units (AFUs) can tolerate latency >= 40us for memory
+		access or 0 if any AFU is latency sensitive (< 40us).
diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
index 59ff9f1..1ff386d 100644
--- a/drivers/fpga/dfl-fme-main.c
+++ b/drivers/fpga/dfl-fme-main.c
@@ -400,6 +400,218 @@ static void fme_thermal_mgmt_uinit(struct platform_device *pdev,
 	.uinit = fme_thermal_mgmt_uinit,
 };
 
+#define FME_PWR_STATUS		0x8
+#define FME_LATENCY_TOLERANCE	BIT_ULL(18)
+#define PWR_CONSUMED		GENMASK_ULL(17, 0)
+
+#define FME_PWR_THRESHOLD	0x10
+#define PWR_THRESHOLD1		GENMASK_ULL(6, 0)	/* in Watts */
+#define PWR_THRESHOLD2		GENMASK_ULL(14, 8)	/* in Watts */
+#define PWR_THRESHOLD_MAX	0x7f			/* in Watts */
+#define PWR_THRESHOLD1_STATUS	BIT_ULL(16)
+#define PWR_THRESHOLD2_STATUS	BIT_ULL(17)
+
+#define FME_PWR_XEON_LIMIT	0x18
+#define XEON_PWR_LIMIT		GENMASK_ULL(14, 0)	/* in 0.1 Watts */
+#define XEON_PWR_EN		BIT_ULL(15)
+#define FME_PWR_FPGA_LIMIT	0x20
+#define FPGA_PWR_LIMIT		GENMASK_ULL(14, 0)	/* in 0.1 Watts */
+#define FPGA_PWR_EN		BIT_ULL(15)
+
+static int power_hwmon_read(struct device *dev, enum hwmon_sensor_types type,
+			    u32 attr, int channel, long *val)
+{
+	struct dfl_feature *feature = dev_get_drvdata(dev);
+	u64 v;
+
+	switch (attr) {
+	case hwmon_power_input:
+		v = readq(feature->ioaddr + FME_PWR_STATUS);
+		*val = (long)(FIELD_GET(PWR_CONSUMED, v) * 1000000);
+		break;
+	case hwmon_power_max:
+		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
+		*val = (long)(FIELD_GET(PWR_THRESHOLD1, v) * 1000000);
+		break;
+	case hwmon_power_crit:
+		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
+		*val = (long)(FIELD_GET(PWR_THRESHOLD2, v) * 1000000);
+		break;
+	case hwmon_power_max_alarm:
+		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
+		*val = (long)FIELD_GET(PWR_THRESHOLD1_STATUS, v);
+		break;
+	case hwmon_power_crit_alarm:
+		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
+		*val = (long)FIELD_GET(PWR_THRESHOLD2_STATUS, v);
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
+static int power_hwmon_write(struct device *dev, enum hwmon_sensor_types type,
+			     u32 attr, int channel, long val)
+{
+	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev->parent);
+	struct dfl_feature *feature = dev_get_drvdata(dev);
+	int ret = 0;
+	u64 v;
+
+	val = clamp_val(val / 1000000, 0, PWR_THRESHOLD_MAX);
+
+	mutex_lock(&pdata->lock);
+
+	switch (attr) {
+	case hwmon_power_max:
+		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
+		v &= ~PWR_THRESHOLD1;
+		v |= FIELD_PREP(PWR_THRESHOLD1, val);
+		writeq(v, feature->ioaddr + FME_PWR_THRESHOLD);
+		break;
+	case hwmon_power_crit:
+		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
+		v &= ~PWR_THRESHOLD2;
+		v |= FIELD_PREP(PWR_THRESHOLD2, val);
+		writeq(v, feature->ioaddr + FME_PWR_THRESHOLD);
+		break;
+	default:
+		ret = -EOPNOTSUPP;
+		break;
+	}
+
+	mutex_unlock(&pdata->lock);
+
+	return ret;
+}
+
+static umode_t power_hwmon_attrs_visible(const void *drvdata,
+					 enum hwmon_sensor_types type,
+					 u32 attr, int channel)
+{
+	switch (attr) {
+	case hwmon_power_input:
+	case hwmon_power_max_alarm:
+	case hwmon_power_crit_alarm:
+		return 0444;
+	case hwmon_power_max:
+	case hwmon_power_crit:
+		return 0644;
+	}
+
+	return 0;
+}
+
+static const struct hwmon_ops power_hwmon_ops = {
+	.is_visible = power_hwmon_attrs_visible,
+	.read = power_hwmon_read,
+	.write = power_hwmon_write,
+};
+
+static const struct hwmon_channel_info *power_hwmon_info[] = {
+	HWMON_CHANNEL_INFO(power, HWMON_P_INPUT |
+				  HWMON_P_MAX   | HWMON_P_MAX_ALARM |
+				  HWMON_P_CRIT  | HWMON_P_CRIT_ALARM),
+	NULL
+};
+
+static const struct hwmon_chip_info power_hwmon_chip_info = {
+	.ops = &power_hwmon_ops,
+	.info = power_hwmon_info,
+};
+
+static ssize_t power1_xeon_limit_show(struct device *dev,
+				      struct device_attribute *attr, char *buf)
+{
+	struct dfl_feature *feature = dev_get_drvdata(dev);
+	u16 xeon_limit = 0;
+	u64 v;
+
+	v = readq(feature->ioaddr + FME_PWR_XEON_LIMIT);
+
+	if (FIELD_GET(XEON_PWR_EN, v))
+		xeon_limit = FIELD_GET(XEON_PWR_LIMIT, v);
+
+	return sprintf(buf, "%u\n", xeon_limit * 100000);
+}
+
+static ssize_t power1_fpga_limit_show(struct device *dev,
+				      struct device_attribute *attr, char *buf)
+{
+	struct dfl_feature *feature = dev_get_drvdata(dev);
+	u16 fpga_limit = 0;
+	u64 v;
+
+	v = readq(feature->ioaddr + FME_PWR_FPGA_LIMIT);
+
+	if (FIELD_GET(FPGA_PWR_EN, v))
+		fpga_limit = FIELD_GET(FPGA_PWR_LIMIT, v);
+
+	return sprintf(buf, "%u\n", fpga_limit * 100000);
+}
+
+static ssize_t power1_ltr_show(struct device *dev,
+			       struct device_attribute *attr, char *buf)
+{
+	struct dfl_feature *feature = dev_get_drvdata(dev);
+	u64 v;
+
+	v = readq(feature->ioaddr + FME_PWR_STATUS);
+
+	return sprintf(buf, "%u\n",
+		       (unsigned int)FIELD_GET(FME_LATENCY_TOLERANCE, v));
+}
+
+static DEVICE_ATTR_RO(power1_xeon_limit);
+static DEVICE_ATTR_RO(power1_fpga_limit);
+static DEVICE_ATTR_RO(power1_ltr);
+
+static struct attribute *power_extra_attrs[] = {
+	&dev_attr_power1_xeon_limit.attr,
+	&dev_attr_power1_fpga_limit.attr,
+	&dev_attr_power1_ltr.attr,
+	NULL
+};
+
+ATTRIBUTE_GROUPS(power_extra);
+
+static int fme_power_mgmt_init(struct platform_device *pdev,
+			       struct dfl_feature *feature)
+{
+	struct device *hwmon;
+
+	dev_dbg(&pdev->dev, "FME Power Management Init.\n");
+
+	hwmon = devm_hwmon_device_register_with_info(&pdev->dev,
+						     "dfl_fme_power", feature,
+						     &power_hwmon_chip_info,
+						     power_extra_groups);
+	if (IS_ERR(hwmon)) {
+		dev_err(&pdev->dev, "Fail to register power hwmon\n");
+		return PTR_ERR(hwmon);
+	}
+
+	return 0;
+}
+
+static void fme_power_mgmt_uinit(struct platform_device *pdev,
+				 struct dfl_feature *feature)
+{
+	dev_dbg(&pdev->dev, "FME Power Management UInit.\n");
+}
+
+static const struct dfl_feature_id fme_power_mgmt_id_table[] = {
+	{.id = FME_FEATURE_ID_POWER_MGMT,},
+	{0,}
+};
+
+static const struct dfl_feature_ops fme_power_mgmt_ops = {
+	.init = fme_power_mgmt_init,
+	.uinit = fme_power_mgmt_uinit,
+};
+
 static struct dfl_feature_driver fme_feature_drvs[] = {
 	{
 		.id_table = fme_hdr_id_table,
@@ -418,6 +630,10 @@ static void fme_thermal_mgmt_uinit(struct platform_device *pdev,
 		.ops = &fme_thermal_mgmt_ops,
 	},
 	{
+		.id_table = fme_power_mgmt_id_table,
+		.ops = &fme_power_mgmt_ops,
+	},
+	{
 		.ops = NULL,
 	},
 };
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH v3 1/5] mm: introduce MADV_COLD
From: Minchan Kim @ 2019-07-01  7:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Hansen, Andrew Morton, linux-mm, LKML, linux-api,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, oleksandr, hdanton,
	lizeb, Kirill A . Shutemov
In-Reply-To: <20190627235618.GC33052@google.com>

On Fri, Jun 28, 2019 at 08:56:18AM +0900, Minchan Kim wrote:
> On Thu, Jun 27, 2019 at 04:53:02PM +0200, Michal Hocko wrote:
> > On Thu 27-06-19 07:36:50, Dave Hansen wrote:
> > [...]
> > > For MADV_COLD, if we defined it like this, I think we could use it for
> > > both purposes (demotion and LRU movement):
> > > 
> > > 	Pages in the specified regions will be treated as less-recently-
> > > 	accessed compared to pages in the system with similar access
> > > 	frequencies.  In contrast to MADV_DONTNEED, the contents of the
> > 
> > you meant s@MADV_DONTNEED@MADV_FREE@ I suppose
> 
> Right, MADV_FREE is more proper because it's aging related.
> 
> > 
> > > 	region are preserved.
> > > 
> > > It would be nice not to talk about reclaim at all since we're not
> > > promising reclaim per se.
> 
> Your suggestion doesn't expose any implementation detail and could meet your
> needs later. I'm okay. I will change it if others are not against of it.
> 
> Thanks, Dave.

>From 39df9f94e6204b8893f3f3feb692745657392657 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Fri, 24 May 2019 13:47:54 +0900
Subject: [PATCH v3 1/5] mm: introduce MADV_COLD

When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use.  This could reduce
workingset eviction so it ends up increasing performance.

This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.

It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves

	active file page -> inactive file LRU
	active anon page -> inacdtive anon LRU

Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
file LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because
the content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault. Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages. It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied. Even, it could give a bonus to make
them be reclaimed on swapless system. However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost. Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU. Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device. Let's start simpler way without adding
complexity at this moment. However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD
on anonymous memory because we rarely age anonymous LRU lists.

* man-page material

MADV_COLD (since Linux x.x)

Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies.
In contrast to MADV_FREE, the contents of the region are preserved
regardless of subsequent writes to pages.

MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.

* v2
 * add up the warn with lots of page cache workload - mhocko
 * add man page stuff - dave

* v1
 * remove page_mapcount filter - hannes, mhocko
 * remove idle page handling - joelaf

* RFCv2
 * add more description - mhocko

* RFCv1
 * renaming from MADV_COOL to MADV_COLD - hannes

* internal review
 * use clear_page_youn in deactivate_page - joelaf
 * Revise the description - surenb
 * Renaming from MADV_WARM to MADV_COOL - surenb

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h                   |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/internal.h                          |   2 +-
 mm/madvise.c                           | 180 ++++++++++++++++++++++++-
 mm/oom_kill.c                          |   2 +-
 mm/swap.c                              |  42 ++++++
 6 files changed, 224 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index de2c67a33b7e..0ce997edb8bb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ef4623f03156..d7b4231eea63 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -47,6 +47,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_COLD	5		/* deactivatie these pages */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_FREE	8		/* free pages only if memory pressure */
diff --git a/mm/internal.h b/mm/internal.h
index f53a14d67538..c61b215ff265 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -39,7 +39,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
-static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
+static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
 {
 	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index 628022e674a7..7abb8e54bc7a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_COLD:
 	case MADV_FREE:
 		return 0;
 	default:
@@ -307,6 +308,178 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	return 0;
 }
 
+static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+{
+	struct mmu_gather *tlb = walk->private;
+	struct mm_struct *mm = tlb->mm;
+	struct vm_area_struct *vma = walk->vma;
+	pte_t *orig_pte, *pte, ptent;
+	spinlock_t *ptl;
+	struct page *page;
+	unsigned long next;
+
+	next = pmd_addr_end(addr, end);
+	if (pmd_trans_huge(*pmd)) {
+		pmd_t orig_pmd;
+
+		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
+		ptl = pmd_trans_huge_lock(pmd, vma);
+		if (!ptl)
+			return 0;
+
+		orig_pmd = *pmd;
+		if (is_huge_zero_pmd(orig_pmd))
+			goto huge_unlock;
+
+		if (unlikely(!pmd_present(orig_pmd))) {
+			VM_BUG_ON(thp_migration_supported() &&
+					!is_pmd_migration_entry(orig_pmd));
+			goto huge_unlock;
+		}
+
+		page = pmd_page(orig_pmd);
+		if (next - addr != HPAGE_PMD_SIZE) {
+			int err;
+
+			if (page_mapcount(page) != 1)
+				goto huge_unlock;
+
+			get_page(page);
+			spin_unlock(ptl);
+			lock_page(page);
+			err = split_huge_page(page);
+			unlock_page(page);
+			put_page(page);
+			if (!err)
+				goto regular_page;
+			return 0;
+		}
+
+		if (pmd_young(orig_pmd)) {
+			pmdp_invalidate(vma, addr, pmd);
+			orig_pmd = pmd_mkold(orig_pmd);
+
+			set_pmd_at(mm, addr, pmd, orig_pmd);
+			tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+		}
+
+		test_and_clear_page_young(page);
+		deactivate_page(page);
+huge_unlock:
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+regular_page:
+	tlb_change_page_size(tlb, PAGE_SIZE);
+	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	flush_tlb_batched_pending(mm);
+	arch_enter_lazy_mmu_mode();
+	for (; addr < end; pte++, addr += PAGE_SIZE) {
+		ptent = *pte;
+
+		if (pte_none(ptent))
+			continue;
+
+		if (!pte_present(ptent))
+			continue;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page)
+			continue;
+
+		/*
+		 * Creating a THP page is expensive so split it only if we
+		 * are sure it's worth. Split it if we are only owner.
+		 */
+		if (PageTransCompound(page)) {
+			if (page_mapcount(page) != 1)
+				break;
+			get_page(page);
+			if (!trylock_page(page)) {
+				put_page(page);
+				break;
+			}
+			pte_unmap_unlock(orig_pte, ptl);
+			if (split_huge_page(page)) {
+				unlock_page(page);
+				put_page(page);
+				pte_offset_map_lock(mm, pmd, addr, &ptl);
+				break;
+			}
+			unlock_page(page);
+			put_page(page);
+			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+			pte--;
+			addr -= PAGE_SIZE;
+			continue;
+		}
+
+		VM_BUG_ON_PAGE(PageTransCompound(page), page);
+
+		if (pte_young(ptent)) {
+			ptent = ptep_get_and_clear_full(mm, addr, pte,
+							tlb->fullmm);
+			ptent = pte_mkold(ptent);
+			set_pte_at(mm, addr, pte, ptent);
+			tlb_remove_tlb_entry(tlb, pte, addr);
+		}
+
+		/*
+		 * We are deactivating a page for accelerating reclaiming.
+		 * VM couldn't reclaim the page unless we clear PG_young.
+		 * As a side effect, it makes confuse idle-page tracking
+		 * because they will miss recent referenced history.
+		 */
+		test_and_clear_page_young(page);
+		deactivate_page(page);
+	}
+
+	arch_enter_lazy_mmu_mode();
+	pte_unmap_unlock(orig_pte, ptl);
+	cond_resched();
+
+	return 0;
+}
+
+static void madvise_cold_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end)
+{
+	struct mm_walk cold_walk = {
+		.pmd_entry = madvise_cold_pte_range,
+		.mm = vma->vm_mm,
+		.private = tlb,
+	};
+
+	tlb_start_vma(tlb, vma);
+	walk_page_range(addr, end, &cold_walk);
+	tlb_end_vma(tlb, vma);
+}
+
+static long madvise_cold(struct vm_area_struct *vma,
+			struct vm_area_struct **prev,
+			unsigned long start_addr, unsigned long end_addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_gather tlb;
+
+	*prev = vma;
+	if (!can_madv_lru_vma(vma))
+		return -EINVAL;
+
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
+	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
+	tlb_finish_mmu(&tlb, start_addr, end_addr);
+
+	return 0;
+}
+
 static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
 
@@ -519,7 +692,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 				  int behavior)
 {
 	*prev = vma;
-	if (!can_madv_dontneed_vma(vma))
+	if (!can_madv_lru_vma(vma))
 		return -EINVAL;
 
 	if (!userfaultfd_remove(vma, start, end)) {
@@ -541,7 +714,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 			 */
 			return -ENOMEM;
 		}
-		if (!can_madv_dontneed_vma(vma))
+		if (!can_madv_lru_vma(vma))
 			return -EINVAL;
 		if (end > vma->vm_end) {
 			/*
@@ -695,6 +868,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_remove(vma, prev, start, end);
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
+	case MADV_COLD:
+		return madvise_cold(vma, prev, start, end);
 	case MADV_FREE:
 	case MADV_DONTNEED:
 		return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -716,6 +891,7 @@ madvise_behavior_valid(int behavior)
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
 	case MADV_FREE:
+	case MADV_COLD:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6de5c354d6ca..2140a6f8db63 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -523,7 +523,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 	set_bit(MMF_UNSTABLE, &mm->flags);
 
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
-		if (!can_madv_dontneed_vma(vma))
+		if (!can_madv_lru_vma(vma))
 			continue;
 
 		/*
diff --git a/mm/swap.c b/mm/swap.c
index 607c48229a1d..a91859d061f3 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -47,6 +47,7 @@ int page_cluster;
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
@@ -538,6 +539,22 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 	update_page_reclaim_stat(lruvec, file, 0);
 }
 
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+			    void *arg)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		int file = page_is_file_cache(page);
+		int lru = page_lru_base_type(page);
+
+		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+		ClearPageActive(page);
+		ClearPageReferenced(page);
+		add_page_to_lru_list(page, lruvec, lru);
+
+		__count_vm_events(PGDEACTIVATE, hpage_nr_pages(page));
+		update_page_reclaim_stat(lruvec, file, 0);
+	}
+}
 
 static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
 			    void *arg)
@@ -590,6 +607,10 @@ void lru_add_drain_cpu(int cpu)
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
 
+	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	if (pagevec_count(pvec))
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+
 	pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
@@ -623,6 +644,26 @@ void deactivate_file_page(struct page *page)
 	}
 }
 
+/*
+ * deactivate_page - deactivate a page
+ * @page: page to deactivate
+ *
+ * deactivate_page() moves @page to the inactive list if @page was on the active
+ * list and was not an unevictable page.  This is done to accelerate the reclaim
+ * of @page.
+ */
+void deactivate_page(struct page *page)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+		get_page(page);
+		if (!pagevec_add(pvec, page) || PageCompound(page))
+			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		put_cpu_var(lru_deactivate_pvecs);
+	}
+}
+
 /**
  * mark_page_lazyfree - make an anon page lazyfree
  * @page: page to deactivate
@@ -687,6 +728,7 @@ void lru_add_drain_all(void)
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
-- 
2.22.0.410.gd8fdbe21b5-goog

^ permalink raw reply related

* Re: [PATCH v3 0/5] Introduce MADV_COLD and MADV_PAGEOUT
From: Minchan Kim @ 2019-07-01  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, oleksandr, hdanton, lizeb, Dave Hansen,
	Kirill A . Shutemov
In-Reply-To: <20190627115405.255259-1-minchan@kernel.org>


Hi Folks,

Do you guys have comments? I think it would be long enough to be
pending. If there is no further comments, I want to ask to merge.

Thanks.

On Thu, Jun 27, 2019 at 08:54:00PM +0900, Minchan Kim wrote:
> This patch is part of previous series:
> https://lore.kernel.org/lkml/20190531064313.193437-1-minchan@kernel.org/T/#u
> Originally, it was created for external madvise hinting feature.
> 
> https://lkml.org/lkml/2019/5/31/463
> Michal wanted to separte the discussion from external hinting interface
> so this patchset includes only first part of my entire patchset
> 
>   - introduce MADV_COLD and MADV_PAGEOUT hint to madvise.
> 
> However, I keep entire description for others for easier understanding
> why this kinds of hint was born.
> 
> Thanks.
> 
> This patchset is against on next-20190530.
> 
> Below is description of previous entire patchset.
> ================= &< =====================
> 
> - Background
> 
> The Android terminology used for forking a new process and starting an app
> from scratch is a cold start, while resuming an existing app is a hot start.
> While we continually try to improve the performance of cold starts, hot
> starts will always be significantly less power hungry as well as faster so
> we are trying to make hot start more likely than cold start.
> 
> To increase hot start, Android userspace manages the order that apps should
> be killed in a process called ActivityManagerService. ActivityManagerService
> tracks every Android app or service that the user could be interacting with
> at any time and translates that into a ranked list for lmkd(low memory
> killer daemon). They are likely to be killed by lmkd if the system has to
> reclaim memory. In that sense they are similar to entries in any other cache.
> Those apps are kept alive for opportunistic performance improvements but
> those performance improvements will vary based on the memory requirements of
> individual workloads.
> 
> - Problem
> 
> Naturally, cached apps were dominant consumers of memory on the system.
> However, they were not significant consumers of swap even though they are
> good candidate for swap. Under investigation, swapping out only begins
> once the low zone watermark is hit and kswapd wakes up, but the overall
> allocation rate in the system might trip lmkd thresholds and cause a cached
> process to be killed(we measured performance swapping out vs. zapping the
> memory by killing a process. Unsurprisingly, zapping is 10x times faster
> even though we use zram which is much faster than real storage) so kill
> from lmkd will often satisfy the high zone watermark, resulting in very
> few pages actually being moved to swap.
> 
> - Approach
> 
> The approach we chose was to use a new interface to allow userspace to
> proactively reclaim entire processes by leveraging platform information.
> This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> that are known to be cold from userspace and to avoid races with lmkd
> by reclaiming apps as soon as they entered the cached state. Additionally,
> it could provide many chances for platform to use much information to
> optimize memory efficiency.
> 
> To achieve the goal, the patchset introduce two new options for madvise.
> One is MADV_COLD which will deactivate activated pages and the other is
> MADV_PAGEOUT which will reclaim private pages instantly. These new options
> complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed when memory pressure rises.
> 
> Minchan Kim (5):
>   mm: introduce MADV_COLD
>   mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
>   mm: account nr_isolated_xxx in [isolate|putback]_lru_page
>   mm: introduce MADV_PAGEOUT
>   mm: factor out pmd young/dirty bit handling and THP split
> 
>  include/linux/huge_mm.h                |   3 -
>  include/linux/swap.h                   |   2 +
>  include/uapi/asm-generic/mman-common.h |   2 +
>  mm/compaction.c                        |   2 -
>  mm/gup.c                               |   7 +-
>  mm/huge_memory.c                       |  74 -----
>  mm/internal.h                          |   2 +-
>  mm/khugepaged.c                        |   3 -
>  mm/madvise.c                           | 438 ++++++++++++++++++++++++-
>  mm/memory-failure.c                    |   3 -
>  mm/memory_hotplug.c                    |   4 -
>  mm/mempolicy.c                         |   6 +-
>  mm/migrate.c                           |  37 +--
>  mm/oom_kill.c                          |   2 +-
>  mm/swap.c                              |  42 +++
>  mm/vmscan.c                            |  86 ++++-
>  16 files changed, 566 insertions(+), 147 deletions(-)
> 
> -- 
> 2.22.0.410.gd8fdbe21b5-goog
> 

^ permalink raw reply

* Re: [PATCH 01/13] vfs: verify param type in vfs_parse_sb_flag()
From: Miklos Szeredi @ 2019-07-01  8:45 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Miklos Szeredi, Ian Kent, Linux API, linux-fsdevel,
	linux-kernel
In-Reply-To: <20190619123019.30032-1-mszeredi@redhat.com>

Hi David,

Ping?  Have you had a chance of looking at this series?

Köszi,
Miklos

On Wed, Jun 19, 2019 at 2:30 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
>
> vfs_parse_sb_flag() accepted any kind of param with a matching key, not
> just a flag.  This is wrong, only allow flag type and return -EINVAL
> otherwise.
>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> ---
>  fs/fs_context.c | 31 +++++++++++++++----------------
>  1 file changed, 15 insertions(+), 16 deletions(-)
>
> diff --git a/fs/fs_context.c b/fs/fs_context.c
> index 103643c68e3f..e56310fd8c75 100644
> --- a/fs/fs_context.c
> +++ b/fs/fs_context.c
> @@ -81,30 +81,29 @@ static const char *const forbidden_sb_flag[] = {
>  /*
>   * Check for a common mount option that manipulates s_flags.
>   */
> -static int vfs_parse_sb_flag(struct fs_context *fc, const char *key)
> +static int vfs_parse_sb_flag(struct fs_context *fc, struct fs_parameter *param)
>  {
> -       unsigned int token;
> +       const char *key = param->key;
> +       unsigned int set, clear;
>         unsigned int i;
>
>         for (i = 0; i < ARRAY_SIZE(forbidden_sb_flag); i++)
>                 if (strcmp(key, forbidden_sb_flag[i]) == 0)
>                         return -EINVAL;
>
> -       token = lookup_constant(common_set_sb_flag, key, 0);
> -       if (token) {
> -               fc->sb_flags |= token;
> -               fc->sb_flags_mask |= token;
> -               return 0;
> -       }
> +       set = lookup_constant(common_set_sb_flag, key, 0);
> +       clear = lookup_constant(common_clear_sb_flag, key, 0);
> +       if (!set && !clear)
> +               return -ENOPARAM;
>
> -       token = lookup_constant(common_clear_sb_flag, key, 0);
> -       if (token) {
> -               fc->sb_flags &= ~token;
> -               fc->sb_flags_mask |= token;
> -               return 0;
> -       }
> +       if (param->type != fs_value_is_flag)
> +               return invalf(fc, "%s: Unexpected value for '%s'",
> +                             fc->fs_type->name, param->key);
>
> -       return -ENOPARAM;
> +       fc->sb_flags |= set;
> +       fc->sb_flags &= ~clear;
> +       fc->sb_flags_mask |= set | clear;
> +       return 0;
>  }
>
>  /**
> @@ -130,7 +129,7 @@ int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param)
>         if (!param->key)
>                 return invalf(fc, "Unnamed parameter\n");
>
> -       ret = vfs_parse_sb_flag(fc, param->key);
> +       ret = vfs_parse_sb_flag(fc, param);
>         if (ret != -ENOPARAM)
>                 return ret;
>
> --
> 2.21.0
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox