Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v3 linux-trace 4/8] samples: bpf: simple tracing example in C
From: Steven Rostedt @ 2015-02-10 12:27 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, Linux API, Network Development, LKML,
	Linus Torvalds
In-Reply-To: <CAMEtUuyCmO+N2SOTNXfOVnRA59TWsZczPCJBWsB+h0E45NxHPA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Mon, 9 Feb 2015 21:47:42 -0800
Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:

> On Mon, Feb 9, 2015 at 9:45 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
> > I thought we already stated that.
> > Here is the quote from perf_event.h:
> >          *      # The RAW record below is opaque data wrt the ABI
> >          *      #
> >          *      # That is, the ABI doesn't make any promises wrt to
> >          *      # the stability of its content, it may vary depending
> >          *      # on event, hardware, kernel version and phase of
> >          *      # the moon.
> >          *      #
> >          *      # In other words, PERF_SAMPLE_RAW contents are not an ABI.
> >
> > and this example is reading PERF_SAMPLE_RAW events and
> > uses locally defined structs to print them for simplicity.
> 
> to underline my point once more:
> addition of bpf doesn't change at all what PERF_SAMPLE_RAW already
> delivers to user space.
> so no new ABIs anywhere.

Again, it we give an example of how to hard code the data, definitely
expect this to show up in user space. Users are going to look at this
code to learn how to use eBPF. I really want it to do it the correct
way instead of the 'easy' way. Because whatever way we have it here,
will be the way we will see it out in the wild.

-- Steve

^ permalink raw reply

* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
From: Steven Rostedt @ 2015-02-10 12:27 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, Linux API, Network Development, LKML
In-Reply-To: <CAMEtUuzuu9T1eHyh27tdyj1+BnFta9-dzK8q7P9Y7x4iQrk+Xw@mail.gmail.com>

On Mon, 9 Feb 2015 21:51:05 -0800
Alexei Starovoitov <ast@plumgrid.com> wrote:

> On Mon, Feb 9, 2015 at 8:46 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > Looks like this is entirely perf based and does not interact with
> > ftrace at all. In other words, it's perf not tracing.
> >
> > It makes more sense to go through tip than the tracing tree.
> 
> well, all of earlier series were based on ftrace only,
> but I was given convincing enough arguments that
> perf_even_open+ioctl is a better interface :)
> Ok. will rebase on tip in the next version.

That's fine. If this proves useful, I'll probably add the interface
back to ftrace as well.

-- Steve

^ permalink raw reply

* Re: [PATCH] tpm, tpm_tis: fix TPM 2.0 probing
From: Jarkko Sakkinen @ 2015-02-10 12:50 UTC (permalink / raw)
  To: Stefan Berger
  Cc: Peter Hüwe, Ashley Lai, Marcel Selhorst,
	tpmdd-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, josh-iaAMLnmF4UmaiuxdJuQwMA,
	christophe.ricard-Re5JQEeQqe8AvxtiuMwx3w,
	jason.gunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	trousers-tech-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
In-Reply-To: <54D9F6A0.9010905-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

On Tue, Feb 10, 2015 at 07:16:32AM -0500, Stefan Berger wrote:
> On 02/09/2015 03:39 AM, Jarkko Sakkinen wrote:
> >On Mon, Feb 09, 2015 at 12:08:46AM +0100, Peter Hüwe wrote:
> >>Am Mittwoch, 4. Februar 2015, 15:21:09 schrieb Jarkko Sakkinen:
> >>>If during transmission system error was returned, the logic was to
> >>>incorrectly deduce that chip is a TPM 1.x chip. This patch fixes this
> >>>issue. Also, this patch changes probing so that message tag is used as the
> >>>measure for TPM 2.x, which should be much more stable.
> >>Is it aware that some TPMs may respond with 0x00C1 as TAG for TPM1.2 commands?
> >I guess none of the TPM 1.2 command answer with the tag 0x8002?
> 
> 
> FYI: pdf page 26 , section 6.1 explains the predictable return value for a
> TPM1.2 command seen by a TPM2
> 
> http://www.trustedcomputinggroup.org/files/static_page_files/8C68ADA8-1A4B-B294-D0FC06D3773F7DAA/TPM%20Rev%202.0%20Part%203%20-%20Commands%2001.16-code.pdf
> 
> Following this:
> 
> Sending a TPM1.2 command to a TPM2 should return a TPM1.2 header (tag =
> 0xc4) and error code (TPM_BADTAG = 0x1e)
> 
> Sending a TPM 2 command to a TPM 2 will give a TPM 2 tag in the header.
> Sending a TPM 2 command to a TPM 1.2 will give a TPM 1.2 tag in the header
> and an error code.

Thank you for the information. Do you think that for some reason
tpm2_probe() shoould instead check that value is not this error
instead of checking that tag is 0x80002?

>    Stefan

/Jarkko

^ permalink raw reply

* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
From: Steven Rostedt @ 2015-02-10 13:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, Linux API, Network Development, LKML,
	Linus Torvalds
In-Reply-To: <CAMEtUuztTca2jC9Su0YkTHEUHsD4p2PBRyukabrjGO0WfFzdfA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Mon, 9 Feb 2015 22:10:45 -0800
Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:

> One can argue that current TP_printk format is already an ABI,
> because somebody might be parsing the text output.

If somebody does, then it is an ABI. Luckily, it's not that useful to
parse, thus it hasn't been an issue. As Linus has stated in the past,
it's not that we can't change ABI interfaces, its just that we can not
change them if there's a user space application that depends on it.

The harder we make it for interface changes to break user space, the
better. The field layouts is a user interface. In fact, some of those
fields must always be there. This is because tools do parse the layout
and expect some events to have specific fields. Now we can add new
fields, or even remove fields that no user space tool is using. This is
because today, tools use libtraceevent to parse the event data.

This is why I'm nervous about exporting the parameters of the trace
event call. Right now, those parameters can always change, because the
only way to know they exist is by looking at the code. And currently,
there's no way to interact with those parameters. Once we have eBPF in
mainline, we now have a way to interact with the parameters and if
those parameters change, then the eBPF program will break, and if eBPF
can be part of a user space tool, that will break that tool and
whatever change in the trace point that caused this breakage would have
to be reverted. IOW, this can limit development in the kernel.

Al Viro currently does not let any tracepoint in VFS because he doesn't
want the internals of that code locked to an ABI. He's right to be
worried.

> so in some cases we cannot change tracepoints without
> somebody complaining that his tool broke.
> In other cases tracepoints are used for debugging only
> and no one will notice when they change...
> It was and still a grey area.

Not really. If a tool uses the tracepoint, it can lock that tracepoint
down. This is exactly what latencytop did. It happened, it's not a
hypothetical situation.

> bpf doesn't change any of that.
> It actually makes addition of new tracepoints easier.

I totally disagree. It adds more ways to see inside the kernel, and if
user space depends on this, it adds more ways the kernel can not change.

It comes down to how robust eBPF is with the internals of the kernel
changing. If we limit eBPF to system call tracepoints only, that's
fine because those have the same ABI as the system call itself. I'm
worried about the internal tracepoints for scheduling, irqs, file
systems, etc.

> In the future we might add a tracepoint and pass a single
> pointer to interesting data struct to it. bpf programs will walk
> data structures 'as safe modules' via bpf_fetch*() methods
> without exposing it as ABI.

Will this work if that structure changes? When the field we are looking
for no longer exists?

> whereas today we pass a lot of fields to tracepoints and
> make all of these fields immutable.

The parameters passed to the tracepoint are not shown to userspace and
can change at will. Now, we present the final parsing of the parameters
that convert to fields. As all currently known tools uses
libtraceevent.a, and parse the format files, those fields can move
around and even change in size. The structures are not immutable. The
fields are locked down if user space relies on them. But they can move
about within the tracepoint, because the parsing allows for it.

Remember, these are processed fields. The result of TP_fast_assign()
and what gets put into the ring buffer. Now what is passed to the
actual tracepoint is not visible by userspace, and in lots of cases, it
is just a pointer to some structure. What eBPF brings to the table is a
way to access this structure from user space. What keeps a structured
passed to a tracepoint from becoming immutable if there's a eBPF
program that expects it to have a specific field?

> 
> To me tracepoints are like gdb breakpoints.

Unfortunately, it doesn't matter what you think they are. It doesn't
matter what I think they are. What matters is what Linus thinks they
are and if user space depends on it and Linus decides to revert what
ever change broke that user space program, no matter what we think, we
just screwed ourselves.

I'm being stubborn on this because this is exactly what happened in the
past, which caused every trace point to hold 4 bytes of padding. 4
bytes may not sound like a lot, but when you have 1 million
tracepoints, that's 4 megs of wasted space.

> and bpf programs like live debugger that examine things.

If bpf programs only dealt with kprobes, I may agree. But tracepoints
have already been proven to be a type of ABI. If we open another window
into the kernel, this can screw us later. It's better to solve this now
than when we are fighting with Linus over user space breakage.

> 
> the next step is to be able to write bpf scripts on the fly
> without leaving debugger. Something like perf probe +
> editor + live execution. Truly like gdb for kernel.
> while kernel is running.

What we need is to know if eBPF programs are modules or a user space
interface. If they are a user interface then we need to be extremely
careful here. If they are treated the same as modules, then it would
not add any API. But that hasn't been settled yet, even if we have a
comment in the kernel.

Maybe what we should do is to make eBPF pass the kernel version it was
made for (with all the mod version checks). If it doesn't match, fail
to load it. Perhaps the more eBPF is limited like modules are, the
better chance we have that no eBPF program creates a new ABI.

-- Steve

^ permalink raw reply

* [PATCH v2 1/4] scsi: ufs: add ioctl interface for query request
From: Gilad Broner @ 2015-02-10 13:58 UTC (permalink / raw)
  To: James.Bottomley
  Cc: linux-kernel, linux-scsi, linux-arm-msm, santoshsy,
	linux-scsi-owner, subhashj, ygardi, draviv, Noa Rubens,
	Raviv Shvili, Vinayak Holikatti, James E.J. Bottomley,
	open list:ABI/API
In-Reply-To: <1423576735-26267-1-git-send-email-gbroner@codeaurora.org>

From: Dolev Raviv <draviv@codeaurora.org>

This patch exposes the ioctl interface for UFS driver via SCSI device
ioctl interface. As of now UFS driver would provide the ioctl for query
interface to connected UFS device.

Signed-off-by: Dolev Raviv <draviv@codeaurora.org>
Signed-off-by: Noa Rubens <noag@codeaurora.org>
Signed-off-by: Raviv Shvili <rshvili@codeaurora.org>
Signed-off-by: Yaniv Gardi <ygardi@codeaurora.org>
---
 drivers/scsi/ufs/ufs.h        |  53 +++-------
 drivers/scsi/ufs/ufshcd.c     | 225 +++++++++++++++++++++++++++++++++++++++++-
 include/uapi/scsi/Kbuild      |   1 +
 include/uapi/scsi/ufs/Kbuild  |   3 +
 include/uapi/scsi/ufs/ioctl.h |  57 +++++++++++
 include/uapi/scsi/ufs/ufs.h   |  66 +++++++++++++
 6 files changed, 361 insertions(+), 44 deletions(-)
 create mode 100644 include/uapi/scsi/ufs/Kbuild
 create mode 100644 include/uapi/scsi/ufs/ioctl.h
 create mode 100644 include/uapi/scsi/ufs/ufs.h

diff --git a/drivers/scsi/ufs/ufs.h b/drivers/scsi/ufs/ufs.h
index 42c459a..1f023c4 100644
--- a/drivers/scsi/ufs/ufs.h
+++ b/drivers/scsi/ufs/ufs.h
@@ -38,6 +38,7 @@
 
 #include <linux/mutex.h>
 #include <linux/types.h>
+#include <scsi/ufs/ufs.h>
 
 #define MAX_CDB_SIZE	16
 #define GENERAL_UPIU_REQUEST_SIZE 32
@@ -71,6 +72,16 @@ enum {
 	UFS_UPIU_RPMB_WLUN		= 0xC4,
 };
 
+/**
+ * ufs_is_valid_unit_desc_lun - checks if the given LUN has a unit descriptor
+ * @lun: LU number to check
+ * @return: true if the lun has a matching unit descriptor, false otherwise
+ */
+static inline bool ufs_is_valid_unit_desc_lun(u8 lun)
+{
+	return (lun == UFS_UPIU_RPMB_WLUN || (lun < UFS_UPIU_MAX_GENERAL_LUN));
+}
+
 /*
  * UFS Protocol Information Unit related definitions
  */
@@ -126,35 +137,6 @@ enum {
 	UPIU_QUERY_FUNC_STANDARD_WRITE_REQUEST          = 0x81,
 };
 
-/* Flag idn for Query Requests*/
-enum flag_idn {
-	QUERY_FLAG_IDN_FDEVICEINIT      = 0x01,
-	QUERY_FLAG_IDN_PWR_ON_WPE	= 0x03,
-	QUERY_FLAG_IDN_BKOPS_EN         = 0x04,
-};
-
-/* Attribute idn for Query requests */
-enum attr_idn {
-	QUERY_ATTR_IDN_ACTIVE_ICC_LVL	= 0x03,
-	QUERY_ATTR_IDN_BKOPS_STATUS	= 0x05,
-	QUERY_ATTR_IDN_EE_CONTROL	= 0x0D,
-	QUERY_ATTR_IDN_EE_STATUS	= 0x0E,
-};
-
-/* Descriptor idn for Query requests */
-enum desc_idn {
-	QUERY_DESC_IDN_DEVICE		= 0x0,
-	QUERY_DESC_IDN_CONFIGURAION	= 0x1,
-	QUERY_DESC_IDN_UNIT		= 0x2,
-	QUERY_DESC_IDN_RFU_0		= 0x3,
-	QUERY_DESC_IDN_INTERCONNECT	= 0x4,
-	QUERY_DESC_IDN_STRING		= 0x5,
-	QUERY_DESC_IDN_RFU_1		= 0x6,
-	QUERY_DESC_IDN_GEOMETRY		= 0x7,
-	QUERY_DESC_IDN_POWER		= 0x8,
-	QUERY_DESC_IDN_MAX,
-};
-
 enum desc_header_offset {
 	QUERY_DESC_LENGTH_OFFSET	= 0x00,
 	QUERY_DESC_DESC_TYPE_OFFSET	= 0x01,
@@ -247,19 +229,6 @@ enum bkops_status {
 	BKOPS_STATUS_MAX		 = BKOPS_STATUS_CRITICAL,
 };
 
-/* UTP QUERY Transaction Specific Fields OpCode */
-enum query_opcode {
-	UPIU_QUERY_OPCODE_NOP		= 0x0,
-	UPIU_QUERY_OPCODE_READ_DESC	= 0x1,
-	UPIU_QUERY_OPCODE_WRITE_DESC	= 0x2,
-	UPIU_QUERY_OPCODE_READ_ATTR	= 0x3,
-	UPIU_QUERY_OPCODE_WRITE_ATTR	= 0x4,
-	UPIU_QUERY_OPCODE_READ_FLAG	= 0x5,
-	UPIU_QUERY_OPCODE_SET_FLAG	= 0x6,
-	UPIU_QUERY_OPCODE_CLEAR_FLAG	= 0x7,
-	UPIU_QUERY_OPCODE_TOGGLE_FLAG	= 0x8,
-};
-
 /* Query response result code */
 enum {
 	QUERY_RESULT_SUCCESS                    = 0x00,
diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index 5d60a86..cb357f8 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -3,7 +3,7 @@
  *
  * This code is based on drivers/scsi/ufs/ufshcd.c
  * Copyright (C) 2011-2013 Samsung India Software Operations
- * Copyright (c) 2013-2014, The Linux Foundation. All rights reserved.
+ * Copyright (c) 2013-2015, The Linux Foundation. All rights reserved.
  *
  * Authors:
  *	Santosh Yaraganavi <santosh.sy@samsung.com>
@@ -39,6 +39,7 @@
 
 #include <linux/async.h>
 #include <linux/devfreq.h>
+#include <scsi/ufs/ioctl.h>
 
 #include "ufshcd.h"
 #include "unipro.h"
@@ -74,6 +75,9 @@
 /* Interrupt aggregation default timeout, unit: 40us */
 #define INT_AGGR_DEF_TO	0x02
 
+/* IOCTL opcode for command - ufs set device read only */
+#define UFS_IOCTL_BLKROSET      BLKROSET
+
 #define ufshcd_toggle_vreg(_dev, _vreg, _on)				\
 	({                                                              \
 		int _ret;                                               \
@@ -1882,7 +1886,7 @@ static inline int ufshcd_read_unit_desc_param(struct ufs_hba *hba,
 	 * Unit descriptors are only available for general purpose LUs (LUN id
 	 * from 0 to 7) and RPMB Well known LU.
 	 */
-	if (lun != UFS_UPIU_RPMB_WLUN && (lun >= UFS_UPIU_MAX_GENERAL_LUN))
+	if (!ufs_is_valid_unit_desc_lun(lun))
 		return -EOPNOTSUPP;
 
 	return ufshcd_read_desc_param(hba, QUERY_DESC_IDN_UNIT, lun,
@@ -4201,6 +4205,222 @@ static void ufshcd_async_scan(void *data, async_cookie_t cookie)
 	ufshcd_probe_hba(hba);
 }
 
+/**
+ * ufshcd_query_ioctl - perform user read queries
+ * @hba: per-adapter instance
+ * @lun: used for lun specific queries
+ * @buffer: user space buffer for reading and submitting query data and params
+ * @return: 0 for success negative error code otherwise
+ *
+ * Expected/Submitted buffer structure is struct ufs_ioctl_query_data.
+ * It will read the opcode, idn and buf_length parameters, and, put the
+ * response in the buffer field while updating the used size in buf_length.
+ */
+static int ufshcd_query_ioctl(struct ufs_hba *hba, u8 lun, void __user *buffer)
+{
+	struct ufs_ioctl_query_data *ioctl_data;
+	int err = 0;
+	int length = 0;
+	void *data_ptr;
+	bool flag;
+	u32 att;
+	u8 index;
+	u8 *desc = NULL;
+
+	ioctl_data = kzalloc(sizeof(struct ufs_ioctl_query_data), GFP_KERNEL);
+	if (!ioctl_data) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	/* extract params from user buffer */
+	err = copy_from_user(ioctl_data, buffer,
+			sizeof(struct ufs_ioctl_query_data));
+	if (err) {
+		dev_err(hba->dev,
+			"%s: Failed copying buffer from user, err %d\n",
+			__func__, err);
+		goto out_release_mem;
+	}
+
+	/* verify legal parameters & send query */
+	switch (ioctl_data->opcode) {
+	case UPIU_QUERY_OPCODE_READ_DESC:
+		switch (ioctl_data->idn) {
+		case QUERY_DESC_IDN_DEVICE:
+		case QUERY_DESC_IDN_CONFIGURAION:
+		case QUERY_DESC_IDN_INTERCONNECT:
+		case QUERY_DESC_IDN_GEOMETRY:
+		case QUERY_DESC_IDN_POWER:
+			index = 0;
+			break;
+		case QUERY_DESC_IDN_UNIT:
+			if (!ufs_is_valid_unit_desc_lun(lun)) {
+				dev_err(hba->dev,
+					"%s: No unit descriptor for lun 0x%x\n",
+					__func__, lun);
+				err = -EINVAL;
+				goto out_release_mem;
+			}
+			index = lun;
+			break;
+		default:
+			goto out_einval;
+		}
+		length = min_t(int, QUERY_DESC_MAX_SIZE,
+				ioctl_data->buf_size);
+		desc = kzalloc(length, GFP_KERNEL);
+		if (!desc) {
+			dev_err(hba->dev, "%s: Failed allocating %d bytes\n",
+					__func__, length);
+			err = -ENOMEM;
+			goto out_release_mem;
+		}
+		err = ufshcd_query_descriptor(hba, ioctl_data->opcode,
+				ioctl_data->idn, index, 0, desc, &length);
+		break;
+	case UPIU_QUERY_OPCODE_READ_ATTR:
+		switch (ioctl_data->idn) {
+		case QUERY_ATTR_IDN_BOOT_LU_EN:
+		case QUERY_ATTR_IDN_POWER_MODE:
+		case QUERY_ATTR_IDN_ACTIVE_ICC_LVL:
+		case QUERY_ATTR_IDN_OOO_DATA_EN:
+		case QUERY_ATTR_IDN_BKOPS_STATUS:
+		case QUERY_ATTR_IDN_PURGE_STATUS:
+		case QUERY_ATTR_IDN_MAX_DATA_IN:
+		case QUERY_ATTR_IDN_MAX_DATA_OUT:
+		case QUERY_ATTR_IDN_REF_CLK_FREQ:
+		case QUERY_ATTR_IDN_CONF_DESC_LOCK:
+		case QUERY_ATTR_IDN_MAX_NUM_OF_RTT:
+		case QUERY_ATTR_IDN_EE_CONTROL:
+		case QUERY_ATTR_IDN_EE_STATUS:
+		case QUERY_ATTR_IDN_SECONDS_PASSED:
+			index = 0;
+			break;
+		case QUERY_ATTR_IDN_DYN_CAP_NEEDED:
+		case QUERY_ATTR_IDN_CORR_PRG_BLK_NUM:
+			index = lun;
+			break;
+		default:
+			goto out_einval;
+		}
+		err = ufshcd_query_attr(hba, ioctl_data->opcode,
+					ioctl_data->idn, index, 0, &att);
+		break;
+	case UPIU_QUERY_OPCODE_READ_FLAG:
+		switch (ioctl_data->idn) {
+		case QUERY_FLAG_IDN_FDEVICEINIT:
+		case QUERY_FLAG_IDN_PERMANENT_WPE:
+		case QUERY_FLAG_IDN_PWR_ON_WPE:
+		case QUERY_FLAG_IDN_BKOPS_EN:
+		case QUERY_FLAG_IDN_PURGE_ENABLE:
+		case QUERY_FLAG_IDN_FPHYRESOURCEREMOVAL:
+		case QUERY_FLAG_IDN_BUSY_RTC:
+			break;
+		default:
+			goto out_einval;
+		}
+		err = ufshcd_query_flag(hba, ioctl_data->opcode,
+					ioctl_data->idn, &flag);
+		break;
+	default:
+		goto out_einval;
+	}
+
+	if (err) {
+		dev_err(hba->dev, "%s: Query for idn %d failed\n", __func__,
+					ioctl_data->idn);
+		goto out_release_mem;
+	}
+
+	/*
+	 * copy response data
+	 * As we might end up reading less data then what is specified in
+	 * "ioct_data->buf_size". So we are updating "ioct_data->
+	 * buf_size" to what exactly we have read.
+	 */
+	switch (ioctl_data->opcode) {
+	case UPIU_QUERY_OPCODE_READ_DESC:
+		ioctl_data->buf_size = min_t(int, ioctl_data->buf_size, length);
+		data_ptr = desc;
+		break;
+	case UPIU_QUERY_OPCODE_READ_ATTR:
+		ioctl_data->buf_size = sizeof(u32);
+		data_ptr = &att;
+		break;
+	case UPIU_QUERY_OPCODE_READ_FLAG:
+		ioctl_data->buf_size = 1;
+		data_ptr = &flag;
+		break;
+	default:
+		BUG_ON(true);
+	}
+
+	/* copy to user */
+	err = copy_to_user(buffer, ioctl_data,
+			sizeof(struct ufs_ioctl_query_data));
+	if (err)
+		dev_err(hba->dev, "%s: Failed copying back to user.\n",
+			__func__);
+	err = copy_to_user(buffer + sizeof(struct ufs_ioctl_query_data),
+			data_ptr, ioctl_data->buf_size);
+	if (err)
+		dev_err(hba->dev, "%s: err %d copying back to user.\n",
+				__func__, err);
+	goto out_release_mem;
+
+out_einval:
+	dev_err(hba->dev,
+		"%s: illegal ufs query ioctl data, opcode 0x%x, idn 0x%x\n",
+		__func__, ioctl_data->opcode, (unsigned int)ioctl_data->idn);
+	err = -EINVAL;
+out_release_mem:
+	kfree(ioctl_data);
+	kfree(desc);
+out:
+	return err;
+}
+
+/**
+ * ufshcd_ioctl - ufs ioctl callback registered in scsi_host
+ * @dev: scsi device required for per LUN queries
+ * @cmd: command opcode
+ * @buffer: user space buffer for transferring data
+ *
+ * Supported commands:
+ * UFS_IOCTL_QUERY
+ */
+static int ufshcd_ioctl(struct scsi_device *dev, int cmd, void __user *buffer)
+{
+	struct ufs_hba *hba = shost_priv(dev->host);
+	int err = 0;
+
+	BUG_ON(!hba);
+	if (!buffer) {
+		dev_err(hba->dev, "%s: User buffer is NULL!\n", __func__);
+		return -EINVAL;
+	}
+
+	switch (cmd) {
+	case UFS_IOCTL_QUERY:
+		pm_runtime_get_sync(hba->dev);
+		err = ufshcd_query_ioctl(hba, ufshcd_scsi_to_upiu_lun(dev->lun),
+				buffer);
+		pm_runtime_put_sync(hba->dev);
+		break;
+	case UFS_IOCTL_BLKROSET:
+		err = -ENOIOCTLCMD;
+		break;
+	default:
+		err = -EINVAL;
+		dev_err(hba->dev, "%s: Illegal ufs-IOCTL cmd %d\n", __func__,
+				cmd);
+		break;
+	}
+
+	return err;
+}
+
 static struct scsi_host_template ufshcd_driver_template = {
 	.module			= THIS_MODULE,
 	.name			= UFSHCD,
@@ -4213,6 +4433,7 @@ static struct scsi_host_template ufshcd_driver_template = {
 	.eh_abort_handler	= ufshcd_abort,
 	.eh_device_reset_handler = ufshcd_eh_device_reset_handler,
 	.eh_host_reset_handler   = ufshcd_eh_host_reset_handler,
+	.ioctl			= ufshcd_ioctl,
 	.this_id		= -1,
 	.sg_tablesize		= SG_ALL,
 	.cmd_per_lun		= UFSHCD_CMD_PER_LUN,
diff --git a/include/uapi/scsi/Kbuild b/include/uapi/scsi/Kbuild
index 75746d5..d404525 100644
--- a/include/uapi/scsi/Kbuild
+++ b/include/uapi/scsi/Kbuild
@@ -1,5 +1,6 @@
 # UAPI Header export list
 header-y += fc/
+header-y += ufs/
 header-y += scsi_bsg_fc.h
 header-y += scsi_netlink.h
 header-y += scsi_netlink_fc.h
diff --git a/include/uapi/scsi/ufs/Kbuild b/include/uapi/scsi/ufs/Kbuild
new file mode 100644
index 0000000..cc3ef20
--- /dev/null
+++ b/include/uapi/scsi/ufs/Kbuild
@@ -0,0 +1,3 @@
+# UAPI Header export list
+header-y += ioctl.h
+header-y += ufs.h
diff --git a/include/uapi/scsi/ufs/ioctl.h b/include/uapi/scsi/ufs/ioctl.h
new file mode 100644
index 0000000..bc4eed7
--- /dev/null
+++ b/include/uapi/scsi/ufs/ioctl.h
@@ -0,0 +1,57 @@
+#ifndef UAPI_UFS_IOCTL_H_
+#define UAPI_UFS_IOCTL_H_
+
+#include <linux/types.h>
+
+/*
+ *  IOCTL opcode for ufs queries has the following opcode after
+ *  SCSI_IOCTL_GET_PCI
+ */
+#define UFS_IOCTL_QUERY			0x5388
+
+/**
+ * struct ufs_ioctl_query_data - used to transfer data to and from user via ioctl
+ * @opcode: type of data to query (descriptor/attribute/flag)
+ * @idn: id of the data structure
+ * @buf_size: number of allocated bytes/data size on return
+ * @buffer: data location
+ *
+ * Received: buffer and buf_size (available space for transferred data)
+ * Submitted: opcode, idn, length, buf_size
+ */
+struct ufs_ioctl_query_data {
+	/*
+	 * User should select one of the opcode defined in "enum query_opcode".
+	 * Please check include/uapi/scsi/ufs/ufs.h for the definition of it.
+	 * Note that only UPIU_QUERY_OPCODE_READ_DESC,
+	 * UPIU_QUERY_OPCODE_READ_ATTR & UPIU_QUERY_OPCODE_READ_FLAG are
+	 * supported as of now. All other query_opcode would be considered
+	 * invalid.
+	 * As of now only read query operations are supported.
+	 */
+	__u32 opcode;
+	/*
+	 * User should select one of the idn from "enum flag_idn" or "enum
+	 * attr_idn" or "enum desc_idn" based on whether opcode above is
+	 * attribute, flag or descriptor.
+	 * Please check include/uapi/scsi/ufs/ufs.h for the definition of it.
+	 */
+	__u8 idn;
+	/*
+	 * User should specify the size of the buffer (buffer[0] below) where
+	 * it wants to read the query data (attribute/flag/descriptor).
+	 * As we might end up reading less data then what is specified in
+	 * buf_size. So we are updating buf_size to what exactly we have read.
+	 */
+	__u16 buf_size;
+	/*
+	 * placeholder for the start of the data buffer where kernel will copy
+	 * the query data (attribute/flag/descriptor) read from the UFS device
+	 * Note:
+	 * For Read Attribute you will have to allocate 4 bytes
+	 * For Read Flag you will have to allocate 1 byte
+	 */
+	__u8 buffer[0];
+};
+
+#endif /* UAPI_UFS_IOCTL_H_ */
diff --git a/include/uapi/scsi/ufs/ufs.h b/include/uapi/scsi/ufs/ufs.h
new file mode 100644
index 0000000..894ea45
--- /dev/null
+++ b/include/uapi/scsi/ufs/ufs.h
@@ -0,0 +1,66 @@
+#ifndef UAPI_UFS_H_
+#define UAPI_UFS_H_
+
+/* Flag idn for Query Requests*/
+enum flag_idn {
+	QUERY_FLAG_IDN_FDEVICEINIT		= 0x01,
+	QUERY_FLAG_IDN_PERMANENT_WPE		= 0x02,
+	QUERY_FLAG_IDN_PWR_ON_WPE		= 0x03,
+	QUERY_FLAG_IDN_BKOPS_EN			= 0x04,
+	QUERY_FLAG_IDN_RESERVED1		= 0x05,
+	QUERY_FLAG_IDN_PURGE_ENABLE		= 0x06,
+	QUERY_FLAG_IDN_RESERVED2		= 0x07,
+	QUERY_FLAG_IDN_FPHYRESOURCEREMOVAL      = 0x08,
+	QUERY_FLAG_IDN_BUSY_RTC			= 0x09,
+};
+
+/* Attribute idn for Query requests */
+enum attr_idn {
+	QUERY_ATTR_IDN_BOOT_LU_EN		= 0x00,
+	QUERY_ATTR_IDN_RESERVED			= 0x01,
+	QUERY_ATTR_IDN_POWER_MODE		= 0x02,
+	QUERY_ATTR_IDN_ACTIVE_ICC_LVL		= 0x03,
+	QUERY_ATTR_IDN_OOO_DATA_EN		= 0x04,
+	QUERY_ATTR_IDN_BKOPS_STATUS		= 0x05,
+	QUERY_ATTR_IDN_PURGE_STATUS		= 0x06,
+	QUERY_ATTR_IDN_MAX_DATA_IN		= 0x07,
+	QUERY_ATTR_IDN_MAX_DATA_OUT		= 0x08,
+	QUERY_ATTR_IDN_DYN_CAP_NEEDED		= 0x09,
+	QUERY_ATTR_IDN_REF_CLK_FREQ		= 0x0A,
+	QUERY_ATTR_IDN_CONF_DESC_LOCK		= 0x0B,
+	QUERY_ATTR_IDN_MAX_NUM_OF_RTT		= 0x0C,
+	QUERY_ATTR_IDN_EE_CONTROL		= 0x0D,
+	QUERY_ATTR_IDN_EE_STATUS		= 0x0E,
+	QUERY_ATTR_IDN_SECONDS_PASSED		= 0x0F,
+	QUERY_ATTR_IDN_CNTX_CONF		= 0x10,
+	QUERY_ATTR_IDN_CORR_PRG_BLK_NUM		= 0x11,
+};
+
+/* Descriptor idn for Query requests */
+enum desc_idn {
+	QUERY_DESC_IDN_DEVICE		= 0x0,
+	QUERY_DESC_IDN_CONFIGURAION	= 0x1,
+	QUERY_DESC_IDN_UNIT		= 0x2,
+	QUERY_DESC_IDN_RFU_0		= 0x3,
+	QUERY_DESC_IDN_INTERCONNECT	= 0x4,
+	QUERY_DESC_IDN_STRING		= 0x5,
+	QUERY_DESC_IDN_RFU_1		= 0x6,
+	QUERY_DESC_IDN_GEOMETRY		= 0x7,
+	QUERY_DESC_IDN_POWER		= 0x8,
+	QUERY_DESC_IDN_RFU_2		= 0x9,
+	QUERY_DESC_IDN_MAX,
+};
+
+/* UTP QUERY Transaction Specific Fields OpCode */
+enum query_opcode {
+	UPIU_QUERY_OPCODE_NOP		= 0x0,
+	UPIU_QUERY_OPCODE_READ_DESC	= 0x1,
+	UPIU_QUERY_OPCODE_WRITE_DESC	= 0x2,
+	UPIU_QUERY_OPCODE_READ_ATTR	= 0x3,
+	UPIU_QUERY_OPCODE_WRITE_ATTR	= 0x4,
+	UPIU_QUERY_OPCODE_READ_FLAG	= 0x5,
+	UPIU_QUERY_OPCODE_SET_FLAG	= 0x6,
+	UPIU_QUERY_OPCODE_CLEAR_FLAG	= 0x7,
+	UPIU_QUERY_OPCODE_TOGGLE_FLAG	= 0x8,
+};
+#endif /* UAPI_UFS_H_ */
-- 
Qualcomm Israel, on behalf of Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply related

* Re: [RFC] implementing tape statistics single file vs multi-file in sysfs
From: Bryn M. Reeves @ 2015-02-10 14:27 UTC (permalink / raw)
  To: Greg KH
  Cc: Seymour, Shane M,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Kai.Makisara-9Aww8k/80nUxHbG02/KK1g@public.gmane.org,
	James E.J. Bottomley (JBottomley-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org),
	Laurence Oberman (loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org)
In-Reply-To: <20150207040743.GB29944-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>

On Sat, Feb 07, 2015 at 12:07:43PM +0800, Greg KH wrote:
> On Fri, Feb 06, 2015 at 03:41:58PM +0000, Bryn M. Reeves wrote:
> > I can't speak for Shane but wouldn't spend too much time looking at the
> > current v2 patch: it's the result of a pretty ugly compromise suggested
> > on linux-scsi.
> 
> Fair enough, but please feel free to cc: me on the patch that you do
> feel is correct to get a sysfs-related review.

Will do; I'm back from travels this week & will have some time to look at
this.
 
> > Likewise for disk stats: although fluff like maj:min/name etc. has been
> > shuffled a few times the basic fields have remained unchanged for a very
> > long time and sysfs already removes the need to include an identity
> > field.
> 
> We already handle i/o stats just fine, why create a special sysfs
> interface for just a tape device interface?  What makes them so special?

But the iostats use exactly the sort of array file we're talking about:

$ cat /sys/block/sda/stat 
  127644    20869  4320505  2305697   154045    30056  3834036  9065092        0   931842 11371357

And we can't simply extend these to tapes as they are not block devices.
 
> > I understand the fact that you can't change them; I just don't think it's
> > a big problem in this specific case (and much less than some of the
> > more imaginative sysfs content - 2d int arrays with column headers
> > anyone?).
> 
> What sysfs file is a 2d int array?  I'll be glad to fix it.

$ cat /sys/fs/selinux/avc/cache_stats 
lookups hits misses allocations reclaims frees
18938916 18921707 17209 17209 17328 22215
38164283 38146514 17769 17769 16800 19049
18078108 18056991 21117 21117 21344 19305
15168204 15150079 18125 18125 17776 13149
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0

$ cat /sys/fs/selinux/avc/hash_stats
entries: 506
buckets used: 290/512
longest chain: 5

> If you want to measure tens of thousands of tape devices then you
> shouldn't be using sysfs in the first place as it is not designed for
> "speed" at all.  Use the existing i/o rate interfaces instead, don't try
> to cram something into sysfs that doesn't belong there.

So far as I'm aware there is no other way to obtain performance data
for the SCSI tape subsystem (without resorting to ftrace/systemtap).

Regards,
Bryn.

^ permalink raw reply

* Re: [PATCH v3 linux-trace 0/8] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
From: Steven Rostedt @ 2015-02-10 14:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1423539961-21792-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>

On Mon,  9 Feb 2015 19:45:53 -0800
Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:

> So the overhead of realistic bpf program is 5.05963/4.80074 = ~5%
> which is faster than perf_event filtering: 5.69732/4.80074 = ~18%
> or ftrace filtering: 6.50091/4.80074 = ~35%

Come to think of it, this is comparing apples to oranges, as you move
the filtering before the recording. It would be interesting to see the
ftrace speed up, if it were to use eBPF instead of its own filtering.
Maybe that 35% is the filter part, and not the discard part.

I just tried the dd test with count==1234 and count!=1234 and the one
that drops events is only slightly slower. In this case it does seem
that the most overhead is in the filter logic.

But by moving it before the recording, we can not use the fields
defined in the format files, as the parameters and the fields do not
match in most trace points. And to use the parameters, as I have
stated, there's no interface to know what those parameters are, then
filtering on them is a one shot deal. Might as well write a module and
hook directly to the tracepoint and do the filtering natively. That
would be faster than BPF too.

My point is, what's the use case? If you filter before recording, you
can not use the fields of the tracepoint. That limits you to filtering
only syscalls, and perhaps kprobes.

-- Steve

^ permalink raw reply

* Re: memcg && uaccess (Was: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID)
From: Johannes Weiner @ 2015-02-10 16:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Konstantin Khlebnikov, Michal Hocko, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, linux-api-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Linus Torvalds, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Roman Gushchin, Nikita Vetoshkin, Pavel Emelyanov
In-Reply-To: <20150206203246.GA16924-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Fri, Feb 06, 2015 at 09:32:46PM +0100, Oleg Nesterov wrote:
> Add cc's.
> 
> On 02/06, Oleg Nesterov wrote:
> >
> > And in fact I think that this is not set_child_tid/etc-specific. Perhaps
> > I am totally confused, but I think that put_user() simply should not fail
> > this way. Say, why a syscall should return -EFAULT if memory allocation
> > "silently" fails? Confused.
> 
> Seriously. I must have missed something, but I can't understand 519e52473eb
> "mm: memcg: enable memcg OOM killer only for user faults".
> 
> The changelog says:
> 
> 	System calls and kernel faults (uaccess, gup) can handle an out of
> 	memory situation gracefully and just return -ENOMEM.

The cover letter of the original patch series is hidden in the
changelog a few commits before this one:

commit 94bce453c78996cc4373d5da6cfabe07fcc6d9f9
Author: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Date:   Thu Sep 12 15:13:36 2013 -0700

    arch: mm: remove obsolete init OOM protection

    The memcg code can trap tasks in the context of the failing allocation
    until an OOM situation is resolved.  They can hold all kinds of locks
    (fs, mm) at this point, which makes it prone to deadlocking.

    This series converts memcg OOM handling into a two step process that is
    started in the charge context, but any waiting is done after the fault
    stack is fully unwound.

    Patches 1-4 prepare architecture handlers to support the new memcg
    requirements, but in doing so they also remove old cruft and unify
    out-of-memory behavior across architectures.

    Patch 5 disables the memcg OOM handling for syscalls, readahead, kernel
    faults, because they can gracefully unwind the stack with -ENOMEM.  OOM
    handling is restricted to user triggered faults that have no other
    option.

We had reports of systems deadlocking because the allocating task
would wait for the OOM killer to make progress, and the OOM-killed
task would wait for a lock held by the allocating task.  It's
important to note here that the OOM killer currently does not kill a
second task until the current victim has exited, because that victim
has special rights to access to emergency reserves and we don't want
to risk overcommitting those.

To address that deadlock, I decided to no longer invoke memcg OOM
handling from the allocation context directly, but instead remember
the situation in the task_struct and handle it before restarting the
fault.  However, in-kernel faults do not restart, maybe even succeed
despite allocation failures (i.e. readahead), so under the assumption
that they have to handle error returns anyway, I disabled invoking the
memcg OOM killer for in-kernel faults altogether.

> How can a system call know it should return -ENOMEM if put_user() can only
> return -EFAULT ?

I see the problem, but allocations can not be guaranteed to succeed,
not even the OOM killer can reliably make progress, depending on the
state the faulting process is in, the locks it is holding.  So what
can we do if that allocation fails?  Even if we go the route that
Linus proposes and make OOM situations more generic and check them on
*every* return to userspace, the OOM handler at that point might still
kill a task more suited to free memory than the faulting one, and so
we still have to communicate the proper error value to the syscall.

However, I think we could go back to invoking OOM from all allocation
contexts again as long as we change allocator and OOM killer to not
wait for individual OOM victims to exit indefinitely (unless it's a
__GFP_NOFAIL context).  Maybe wait for some time on the first victim
before moving on to the next one.  That would reduce the likelihood of
failing allocations without reinstating the original deadlock.
put_user() could still technically fail due to allocation failure, but
it wouldn't be a very likely event.  Would that be okay?

This does not just apply to memcg allocations, btw.  Physical page
allocations have recently been reported to deadlock in a similar
fashion because of low orders implying __GFP_NOFAIL.  Making the OOM
killer more robust and more eager to make progress would allow us to
remove the implied __GFP_NOFAIL while maintaining reasonable quality
of service in the allocator.

What do you think?

^ permalink raw reply

* [PATCH v2] iio: Export userspace IIO headers
From: Daniel Baluta @ 2015-02-10 16:33 UTC (permalink / raw)
  To: jic23-DgEjT+Ai2ygdnm+yROfE0A, knaack.h-Mmb7MZpHnFY,
	lars-Qo5EllUWu/uELgA04lAiVw, pmeerw-jW+XmwGofnusTnJN9+BGXg
  Cc: irina.tirdea-ral2JQCrhuEAvxtiuMwx3w,
	roberta.dobrescu-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-iio-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

After UAPI header file split [1] all user-kernel interfaces were
placed under include/uapi/.

This patch moves IIO user specific API from:
	* include/linux/iio/events.h => include/uapi/linux/iio/events.h
	* include/linux/types.h => include/uapi/linux/types.h

Now there is no need for nasty tricks to compile userspace programs
(e.g iio_event_monitor). Just installing the kernel headers with
make headers_install command does the job.

[1] http://lwn.net/Articles/507794/

Signed-off-by: Daniel Baluta <daniel.baluta-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
Changes since v1:
	* keep enum iio_event_info and IIO_VAL_* constants in
	include/linux/iio/types.h since they aren't part of the
	ABI.

 include/linux/iio/events.h      | 30 +-------------
 include/linux/iio/types.h       | 78 +----------------------------------
 include/uapi/linux/Kbuild       |  1 +
 include/uapi/linux/iio/Kbuild   |  3 ++
 include/uapi/linux/iio/events.h | 43 +++++++++++++++++++
 include/uapi/linux/iio/types.h  | 91 +++++++++++++++++++++++++++++++++++++++++
 6 files changed, 140 insertions(+), 106 deletions(-)
 create mode 100644 include/uapi/linux/iio/Kbuild
 create mode 100644 include/uapi/linux/iio/events.h
 create mode 100644 include/uapi/linux/iio/types.h

diff --git a/include/linux/iio/events.h b/include/linux/iio/events.h
index 03fa332..8ad87d1 100644
--- a/include/linux/iio/events.h
+++ b/include/linux/iio/events.h
@@ -9,22 +9,8 @@
 #ifndef _IIO_EVENTS_H_
 #define _IIO_EVENTS_H_
 
-#include <linux/ioctl.h>
-#include <linux/types.h>
 #include <linux/iio/types.h>
-
-/**
- * struct iio_event_data - The actual event being pushed to userspace
- * @id:		event identifier
- * @timestamp:	best estimate of time of event occurrence (often from
- *		the interrupt handler)
- */
-struct iio_event_data {
-	__u64	id;
-	__s64	timestamp;
-};
-
-#define IIO_GET_EVENT_FD_IOCTL _IOR('i', 0x90, int)
+#include <uapi/linux/iio/events.h>
 
 /**
  * IIO_EVENT_CODE() - create event identifier
@@ -70,18 +56,4 @@ struct iio_event_data {
 #define IIO_UNMOD_EVENT_CODE(chan_type, number, type, direction)	\
 	IIO_EVENT_CODE(chan_type, 0, 0, direction, type, number, 0, 0)
 
-#define IIO_EVENT_CODE_EXTRACT_TYPE(mask) ((mask >> 56) & 0xFF)
-
-#define IIO_EVENT_CODE_EXTRACT_DIR(mask) ((mask >> 48) & 0x7F)
-
-#define IIO_EVENT_CODE_EXTRACT_CHAN_TYPE(mask) ((mask >> 32) & 0xFF)
-
-/* Event code number extraction depends on which type of event we have.
- * Perhaps review this function in the future*/
-#define IIO_EVENT_CODE_EXTRACT_CHAN(mask) ((__s16)(mask & 0xFFFF))
-#define IIO_EVENT_CODE_EXTRACT_CHAN2(mask) ((__s16)(((mask) >> 16) & 0xFFFF))
-
-#define IIO_EVENT_CODE_EXTRACT_MODIFIER(mask) ((mask >> 40) & 0xFF)
-#define IIO_EVENT_CODE_EXTRACT_DIFF(mask) (((mask) >> 55) & 0x1)
-
 #endif
diff --git a/include/linux/iio/types.h b/include/linux/iio/types.h
index 580ed5b..942b6de 100644
--- a/include/linux/iio/types.h
+++ b/include/linux/iio/types.h
@@ -10,76 +10,7 @@
 #ifndef _IIO_TYPES_H_
 #define _IIO_TYPES_H_
 
-enum iio_chan_type {
-	IIO_VOLTAGE,
-	IIO_CURRENT,
-	IIO_POWER,
-	IIO_ACCEL,
-	IIO_ANGL_VEL,
-	IIO_MAGN,
-	IIO_LIGHT,
-	IIO_INTENSITY,
-	IIO_PROXIMITY,
-	IIO_TEMP,
-	IIO_INCLI,
-	IIO_ROT,
-	IIO_ANGL,
-	IIO_TIMESTAMP,
-	IIO_CAPACITANCE,
-	IIO_ALTVOLTAGE,
-	IIO_CCT,
-	IIO_PRESSURE,
-	IIO_HUMIDITYRELATIVE,
-	IIO_ACTIVITY,
-	IIO_STEPS,
-	IIO_ENERGY,
-	IIO_DISTANCE,
-	IIO_VELOCITY,
-};
-
-enum iio_modifier {
-	IIO_NO_MOD,
-	IIO_MOD_X,
-	IIO_MOD_Y,
-	IIO_MOD_Z,
-	IIO_MOD_X_AND_Y,
-	IIO_MOD_X_AND_Z,
-	IIO_MOD_Y_AND_Z,
-	IIO_MOD_X_AND_Y_AND_Z,
-	IIO_MOD_X_OR_Y,
-	IIO_MOD_X_OR_Z,
-	IIO_MOD_Y_OR_Z,
-	IIO_MOD_X_OR_Y_OR_Z,
-	IIO_MOD_LIGHT_BOTH,
-	IIO_MOD_LIGHT_IR,
-	IIO_MOD_ROOT_SUM_SQUARED_X_Y,
-	IIO_MOD_SUM_SQUARED_X_Y_Z,
-	IIO_MOD_LIGHT_CLEAR,
-	IIO_MOD_LIGHT_RED,
-	IIO_MOD_LIGHT_GREEN,
-	IIO_MOD_LIGHT_BLUE,
-	IIO_MOD_QUATERNION,
-	IIO_MOD_TEMP_AMBIENT,
-	IIO_MOD_TEMP_OBJECT,
-	IIO_MOD_NORTH_MAGN,
-	IIO_MOD_NORTH_TRUE,
-	IIO_MOD_NORTH_MAGN_TILT_COMP,
-	IIO_MOD_NORTH_TRUE_TILT_COMP,
-	IIO_MOD_RUNNING,
-	IIO_MOD_JOGGING,
-	IIO_MOD_WALKING,
-	IIO_MOD_STILL,
-	IIO_MOD_ROOT_SUM_SQUARED_X_Y_Z,
-};
-
-enum iio_event_type {
-	IIO_EV_TYPE_THRESH,
-	IIO_EV_TYPE_MAG,
-	IIO_EV_TYPE_ROC,
-	IIO_EV_TYPE_THRESH_ADAPTIVE,
-	IIO_EV_TYPE_MAG_ADAPTIVE,
-	IIO_EV_TYPE_CHANGE,
-};
+#include <uapi/linux/iio/types.h>
 
 enum iio_event_info {
 	IIO_EV_INFO_ENABLE,
@@ -88,13 +19,6 @@ enum iio_event_info {
 	IIO_EV_INFO_PERIOD,
 };
 
-enum iio_event_direction {
-	IIO_EV_DIR_EITHER,
-	IIO_EV_DIR_RISING,
-	IIO_EV_DIR_FALLING,
-	IIO_EV_DIR_NONE,
-};
-
 #define IIO_VAL_INT 1
 #define IIO_VAL_INT_PLUS_MICRO 2
 #define IIO_VAL_INT_PLUS_NANO 3
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 00b10002..5bfc5bd 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -6,6 +6,7 @@ header-y += caif/
 header-y += dvb/
 header-y += hdlc/
 header-y += hsi/
+header-y += iio/
 header-y += isdn/
 header-y += mmc/
 header-y += nfsd/
diff --git a/include/uapi/linux/iio/Kbuild b/include/uapi/linux/iio/Kbuild
new file mode 100644
index 0000000..86f76d8
--- /dev/null
+++ b/include/uapi/linux/iio/Kbuild
@@ -0,0 +1,3 @@
+# UAPI Header export list
+header-y += events.h
+header-y += types.h
diff --git a/include/uapi/linux/iio/events.h b/include/uapi/linux/iio/events.h
new file mode 100644
index 0000000..4b06477
--- /dev/null
+++ b/include/uapi/linux/iio/events.h
@@ -0,0 +1,43 @@
+/* The industrial I/O - event passing to userspace
+ *
+ * Copyright (c) 2008-2011 Jonathan Cameron
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+#ifndef _UAPI_IIO_EVENTS_H_
+#define _UAPI_IIO_EVENTS_H_
+
+#include <linux/ioctl.h>
+#include <linux/types.h>
+
+/**
+ * struct iio_event_data - The actual event being pushed to userspace
+ * @id:		event identifier
+ * @timestamp:	best estimate of time of event occurrence (often from
+ *		the interrupt handler)
+ */
+struct iio_event_data {
+	__u64	id;
+	__s64	timestamp;
+};
+
+#define IIO_GET_EVENT_FD_IOCTL _IOR('i', 0x90, int)
+
+#define IIO_EVENT_CODE_EXTRACT_TYPE(mask) ((mask >> 56) & 0xFF)
+
+#define IIO_EVENT_CODE_EXTRACT_DIR(mask) ((mask >> 48) & 0x7F)
+
+#define IIO_EVENT_CODE_EXTRACT_CHAN_TYPE(mask) ((mask >> 32) & 0xFF)
+
+/* Event code number extraction depends on which type of event we have.
+ * Perhaps review this function in the future*/
+#define IIO_EVENT_CODE_EXTRACT_CHAN(mask) ((__s16)(mask & 0xFFFF))
+#define IIO_EVENT_CODE_EXTRACT_CHAN2(mask) ((__s16)(((mask) >> 16) & 0xFFFF))
+
+#define IIO_EVENT_CODE_EXTRACT_MODIFIER(mask) ((mask >> 40) & 0xFF)
+#define IIO_EVENT_CODE_EXTRACT_DIFF(mask) (((mask) >> 55) & 0x1)
+
+#endif /* _UAPI_IIO_EVENTS_H_ */
+
diff --git a/include/uapi/linux/iio/types.h b/include/uapi/linux/iio/types.h
new file mode 100644
index 0000000..56a1529
--- /dev/null
+++ b/include/uapi/linux/iio/types.h
@@ -0,0 +1,91 @@
+/* industrial I/O data types needed both in and out of kernel
+ *
+ * Copyright (c) 2008 Jonathan Cameron
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef _UAPI_IIO_TYPES_H_
+#define _UAPI_IIO_TYPES_H_
+
+enum iio_chan_type {
+	IIO_VOLTAGE,
+	IIO_CURRENT,
+	IIO_POWER,
+	IIO_ACCEL,
+	IIO_ANGL_VEL,
+	IIO_MAGN,
+	IIO_LIGHT,
+	IIO_INTENSITY,
+	IIO_PROXIMITY,
+	IIO_TEMP,
+	IIO_INCLI,
+	IIO_ROT,
+	IIO_ANGL,
+	IIO_TIMESTAMP,
+	IIO_CAPACITANCE,
+	IIO_ALTVOLTAGE,
+	IIO_CCT,
+	IIO_PRESSURE,
+	IIO_HUMIDITYRELATIVE,
+	IIO_ACTIVITY,
+	IIO_STEPS,
+	IIO_ENERGY,
+	IIO_DISTANCE,
+	IIO_VELOCITY,
+};
+
+enum iio_modifier {
+	IIO_NO_MOD,
+	IIO_MOD_X,
+	IIO_MOD_Y,
+	IIO_MOD_Z,
+	IIO_MOD_X_AND_Y,
+	IIO_MOD_X_AND_Z,
+	IIO_MOD_Y_AND_Z,
+	IIO_MOD_X_AND_Y_AND_Z,
+	IIO_MOD_X_OR_Y,
+	IIO_MOD_X_OR_Z,
+	IIO_MOD_Y_OR_Z,
+	IIO_MOD_X_OR_Y_OR_Z,
+	IIO_MOD_LIGHT_BOTH,
+	IIO_MOD_LIGHT_IR,
+	IIO_MOD_ROOT_SUM_SQUARED_X_Y,
+	IIO_MOD_SUM_SQUARED_X_Y_Z,
+	IIO_MOD_LIGHT_CLEAR,
+	IIO_MOD_LIGHT_RED,
+	IIO_MOD_LIGHT_GREEN,
+	IIO_MOD_LIGHT_BLUE,
+	IIO_MOD_QUATERNION,
+	IIO_MOD_TEMP_AMBIENT,
+	IIO_MOD_TEMP_OBJECT,
+	IIO_MOD_NORTH_MAGN,
+	IIO_MOD_NORTH_TRUE,
+	IIO_MOD_NORTH_MAGN_TILT_COMP,
+	IIO_MOD_NORTH_TRUE_TILT_COMP,
+	IIO_MOD_RUNNING,
+	IIO_MOD_JOGGING,
+	IIO_MOD_WALKING,
+	IIO_MOD_STILL,
+	IIO_MOD_ROOT_SUM_SQUARED_X_Y_Z,
+};
+
+enum iio_event_type {
+	IIO_EV_TYPE_THRESH,
+	IIO_EV_TYPE_MAG,
+	IIO_EV_TYPE_ROC,
+	IIO_EV_TYPE_THRESH_ADAPTIVE,
+	IIO_EV_TYPE_MAG_ADAPTIVE,
+	IIO_EV_TYPE_CHANGE,
+};
+
+enum iio_event_direction {
+	IIO_EV_DIR_EITHER,
+	IIO_EV_DIR_RISING,
+	IIO_EV_DIR_FALLING,
+	IIO_EV_DIR_NONE,
+};
+
+#endif /* _UAPI_IIO_TYPES_H_ */
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
From: Jason Baron @ 2015-02-10 19:16 UTC (permalink / raw)
  To: Eric Wong
  Cc: Andy Lutomirski, Linux API, Peter Zijlstra, Ingo Molnar, Al Viro,
	Andrew Morton, Davide Libenzi, Michael Kerrisk-manpages,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Linux FS Devel
In-Reply-To: <20150210044939.GA15616-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>

On 02/09/2015 11:49 PM, Eric Wong wrote:
> Jason Baron <jbaron-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org> wrote:
>> On 02/09/2015 05:45 PM, Andy Lutomirski wrote:
>>> On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron <jbaron-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
>>>>> On 02/09/2015 12:06 PM, Jason Baron wrote:
>>>>>> Epoll file descriptors that are added to a shared wakeup source are always
>>>>>> added in a non-exclusive manner. That means that when we have multiple epoll
>>>>>> fds attached to a shared wakeup source they are all woken up. This can
>>>>>> lead to excessive cpu usage and uneven load distribution.
>>>>>>
>>>>>> This patch introduces two new 'events' flags that are intended to be used
>>>>>> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
>>>>>> source in an exclusive manner such that the minimum number of threads are
>>>>>> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
>>>>>> also be added to the 'events' flag, such that we round robin around the set
>>>>>> of waiting threads.
>>>>>>
>>>>>> An implementation note is that in the epoll wakeup routine,
>>>>>> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
>>>>>> wakeup, only when there are current waiters. The idea is to use this additional
>>>>>> heuristic in order minimize wakeup latencies.
>>>>> I don't understand what this is intended to do.
>>>>>
>>>>> If an event has EPOLLONESHOT, then this only one thread should be woken regardless, right?  If not, isn't that just a bug that should be fixed?
>>>>>
>>>> hmm...so with EPOLLONESHOT you basically get notified once about an event. If i have multiple epoll fds (say 1 per-thread) attached to a single source in EPOLLONESHOT, then all threads will potentially get woken up once per event. Then, I would have to re-arm all of them. So I don't think this addresses this particular usecase...what I am trying to avoid is this mass wakeup or thundering herd for a shared event source.
>>> Now I understand.  Why are you using multiple epollfds?
>>>
>>> --Andy
>> So the multiple epollfds is really a way to partition the set of
>> events. Otherwise, I have all the threads contending on all the events
>> that are being generated. So I'm not sure if that is scalable.
> I wonder if EPOLLONESHOT + epoll_wait with a sufficiently large
> maxevents value is sufficient for you.  All events would be shared, so
> they can migrate between threads(*).  Each thread takes a largish set of
> events on every epoll_wait call and doesn't call epoll_wait again until
> it's done with the whole set it got.
>
> You'll hit more contention on EPOLL_CTL_MOD with shared events and a
> single epoll, but I think it's a better goal to make that lock-free.

Its not just EPOLL_CTL_MOD, but there's also going to be contention on
epoll add and remove since there is only 1 epoll fd in this case. I'm also
concerned about the balancing of the workload among threads in the single
queue case. I think its quite reasonable to have user-space partition
the set
of events among threads as it sees fit using multiple epoll fds.

However, currently this multiple epoll fd scheme does not handle events from
a shared event source well. As I mentioned there is a thundering herd wakeup
in this case, and the wakeups are unbalanced. In fact, we have an
application
that currently does EPOLL_CTL_REMOVEs followed by EPOLL_CTL_ADDs
periodically against a shared wakeup source in order to re-balance the
wakeup
queues. This solves the balancing issues to an extent, but not the
thundering
herd. I'd like to move this logic down into the kernel with this patch set.

> (*) Too large a maxevents will lead to head-of-line blocking, but from
> what I'm inferring, you already risk that with multiple epollfds and
> separate threads working on them.
>
> Do you have a userland use case to share?

I've been trying to describe the use case, maybe I haven't been doing a good
job :(

>> In the use-case I'm trying to describe, I've partitioned a large set
>> of the events, but there may still be some event sources that we wish
>> to share among all of the threads (or even subsets of them), so as not
>> to overload any one in particular.
>  
>> More specifically, in the case of a single listen socket, its natural
>> to call accept() on the thread that has been woken up, but without
>> doing round robin, you quickly get into a very unbalanced load, and in
>> addition you waste a lot of cpu doing unnecessary wakeups. There are
>> other approaches to solve this, specifically using SO_REUSEPORT, which
>> creates a separate socket per-thread and gets one back to the
>> separately partitioned events case previously described. However,
>> SO_REUSEPORT, I believe is very specific to tcp/udp, and in addition
>> does not have knowledge of the threads that are actively waiting as
>> the epoll code does.
> Did you try my suggestion of using a dedicated thread (or thread pool)
> which does nothing but loop on accept() + EPOLL_CTL_ADD?
>
> Those dedicated threads could do its own round-robin in userland to pick
> a different epollfd to call EPOLL_CTL_ADD on.

Thanks for your suggestion! I'm not actively working on the user-space
code here, but I will pass it along.

I would prefer though not to have to context switch the 'accept' thread
on and off the cpu every time there is a new connection. So the approach
suggested here essentially moves this dedicated thread (threads), down
into the kernel and avoids the creation of these threads entirely.

Thanks,

-Jason

^ permalink raw reply

* Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
From: Eric Wong @ 2015-02-10 19:32 UTC (permalink / raw)
  To: Jason Baron
  Cc: Andy Lutomirski, Linux API, Peter Zijlstra, Ingo Molnar, Al Viro,
	Andrew Morton, Davide Libenzi, Michael Kerrisk-manpages,
	linux-kernel@vger.kernel.org, Linux FS Devel
In-Reply-To: <54DA592A.3050102@akamai.com>

Jason Baron <jbaron@akamai.com> wrote:
> On 02/09/2015 11:49 PM, Eric Wong wrote:
> > Do you have a userland use case to share?
> 
> I've been trying to describe the use case, maybe I haven't been doing a good
> job :(

Sorry, I meant if you had any public code.

Anyways, I've restarted work on another project which I'll hopefully be
able to share in a few weeks which might be a good public candidate for
epoll performance testing.

> > Did you try my suggestion of using a dedicated thread (or thread pool)
> > which does nothing but loop on accept() + EPOLL_CTL_ADD?
> >
> > Those dedicated threads could do its own round-robin in userland to pick
> > a different epollfd to call EPOLL_CTL_ADD on.
> 
> Thanks for your suggestion! I'm not actively working on the user-space
> code here, but I will pass it along.
> 
> I would prefer though not to have to context switch the 'accept' thread
> on and off the cpu every time there is a new connection. So the approach
> suggested here essentially moves this dedicated thread (threads), down
> into the kernel and avoids the creation of these threads entirely.

For cmogstored, I stopped using TCP_DEFER_ACCEPT when using the
dedicated thread.  This approach offloads to epoll and ends up giving
similar behavior to what used to be infinite in TCP_DEFER_ACCEPT in
Linux <= 2.6.31

^ permalink raw reply

* Re: memcg && uaccess (Was: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID)
From: Oleg Nesterov @ 2015-02-10 19:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Konstantin Khlebnikov, Michal Hocko, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, linux-api, Andrew Morton, Linus Torvalds,
	linux-kernel, Roman Gushchin, Nikita Vetoshkin, Pavel Emelyanov
In-Reply-To: <20150210161941.GB11212@phnom.home.cmpxchg.org>

On 02/10, Johannes Weiner wrote:
>
> We had reports of systems deadlocking because

Yes, yes, to some degree I understand why it was done this way. Not
that I understand the details of course. Thanks for your explanations.

> > How can a system call know it should return -ENOMEM if put_user() can only
> > return -EFAULT ?
>
> I see the problem, but allocations can not be guaranteed to succeed,
> not even the OOM killer can reliably make progress,

Yes sure,

> So what
> can we do if that allocation fails?  Even if we go the route that
> Linus proposes and make OOM situations more generic and check them on
> *every* return to userspace, the OOM handler at that point might still
> kill a task more suited to free memory than the faulting one, and so
> we still have to communicate the proper error value to the syscall.

Yes. To me this means that if a page fault from kernel-space fails because
of VM_FAULT_OOM the task should be killed in any case. Except we should
obviously exclude gup/kthreads.

We can't retry in this case and (say) schedule_tail() simply can't report
or handle the failure. Imho it would be better to kill the task loudly,
perhaps with a warning.

To avoid the confusion. Of course, it is not that I am trying to simply
add send_sig(SIGKILL) into the failure paths. My only point is that,
whatever we do, the "silent" or misleading failure is worse than SIGKILL.

The application can't really "handle an out of memory situation gracefully"
as the changlelog says. Even if put_user() (and thus syscall) could return
-ENOMEM, this doesn't really matter I think.

> However, I think we could go back to invoking OOM from all allocation
> contexts again as long as we change allocator and OOM killer to not
> wait for individual OOM victims to exit indefinitely (unless it's a
> __GFP_NOFAIL context).  Maybe wait for some time on the first victim
> before moving on to the next one.

perhaps... can't really comment, at least right now.

> What do you think?

So far I only think that this problem is not trivial ;)

Oleg.

^ permalink raw reply

* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
From: Alexei Starovoitov @ 2015-02-10 19:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, Linux API, Network Development, LKML,
	Linus Torvalds

On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 9 Feb 2015 22:10:45 -0800
> Alexei Starovoitov <ast@plumgrid.com> wrote:
>
>> One can argue that current TP_printk format is already an ABI,
>> because somebody might be parsing the text output.
>
> If somebody does, then it is an ABI. Luckily, it's not that useful to
> parse, thus it hasn't been an issue. As Linus has stated in the past,
> it's not that we can't change ABI interfaces, its just that we can not
> change them if there's a user space application that depends on it.

there are already tools that parse trace_pipe:
https://github.com/brendangregg/perf-tools

> and expect some events to have specific fields. Now we can add new
> fields, or even remove fields that no user space tool is using. This is
> because today, tools use libtraceevent to parse the event data.

not all tools use libtraceevent.
gdb calls perf_event_open directly:
https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c
and parses PERF_RECORD_SAMPLE as a binary.
In this case it's branch records, but I think we never said anywhere
that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come
in this particular order.

> This is why I'm nervous about exporting the parameters of the trace
> event call. Right now, those parameters can always change, because the
> only way to know they exist is by looking at the code. And currently,
> there's no way to interact with those parameters. Once we have eBPF in
> mainline, we now have a way to interact with the parameters and if
> those parameters change, then the eBPF program will break, and if eBPF
> can be part of a user space tool, that will break that tool and
> whatever change in the trace point that caused this breakage would have
> to be reverted. IOW, this can limit development in the kernel.

it can limit development unless we say that bpf programs
that attach to tracepoints are not part of ABI.
Easy enough to add large comment similar to perf_event.h

> Al Viro currently does not let any tracepoint in VFS because he doesn't
> want the internals of that code locked to an ABI. He's right to be
> worried.

Same with networking bits. We don't want tracepoints to limit
kernel development, but we want debuggability and kernel
analytics.
All existing tracepoints defined via DEFINE_EVENT should
not be an ABI.
But some maintainers think of them as ABI, whereas others
are using them freely. imo it's time to remove ambiguity.

The idea for new style of tracepoints is the following:
introduce new macro: DEFINE_EVENT_USER
and let programs attach to them.
These tracepoint will receive one or two pointers to important
structs only. They will not have TP_printk, assign and fields.
The placement and arguments to these new tracepoints
will be an ABI.
All existing tracepoints are not.

The main reason to attach to tracepoint is that they are
accessible without debug info (unlike kprobe)
Another reason is speed. tracepoints are much faster than
optimized kprobes and for real-time analytics the speed
is critical.

The position of new tracepoints and their arguments
will be an ABI and the programs can be both.
If program is using bpf_fetch*() helpers it obviously
wants to access internal data structures, so
it's really nothing more, but 'safe kernel module'
and kernel layout specific.
Both old and new tracepoints + programs will be used
for live kernel debugging.

If program is accessing user-ized data structures then
it is portable and will run on any kernel.
In uapi header we can define:
struct task_struct_user {
  int pid;
  int prio;
};
and let bpf programs access it via real 'struct task_struct *'
pointer passed into tracepoint.
bpf loader will translate offsets and sizes used inside
the program into real task_struct's offsets and loads.
(all structs are read-only of course)
programs will be fast and kernel independent.
They will be used for analytics (latency, etc)

>> so in some cases we cannot change tracepoints without
>> somebody complaining that his tool broke.
>> In other cases tracepoints are used for debugging only
>> and no one will notice when they change...
>> It was and still a grey area.
>
> Not really. If a tool uses the tracepoint, it can lock that tracepoint
> down. This is exactly what latencytop did. It happened, it's not a
> hypothetical situation.

correct.

>> bpf doesn't change any of that.
>> It actually makes addition of new tracepoints easier.
>
> I totally disagree. It adds more ways to see inside the kernel, and if
> user space depends on this, it adds more ways the kernel can not change.
>
> It comes down to how robust eBPF is with the internals of the kernel
> changing. If we limit eBPF to system call tracepoints only, that's
> fine because those have the same ABI as the system call itself. I'm
> worried about the internal tracepoints for scheduling, irqs, file
> systems, etc.

agree. we need to make it clear that existing tracepoints
+ programs is not ABI.

>> In the future we might add a tracepoint and pass a single
>> pointer to interesting data struct to it. bpf programs will walk
>> data structures 'as safe modules' via bpf_fetch*() methods
>> without exposing it as ABI.
>
> Will this work if that structure changes? When the field we are looking
> for no longer exists?

bpf_fetch*() is the same mechanism as perf probe.
If there is a mistake by user space tools, the program
will be reading some junk, but it won't be crashing.
To be able to debug live kernel we need to see everywhere.
Same way as systemtap loads kernel modules to walk
things inside kernel, bpf programs walk pointers with
bpf_fetch*().
I'm saying that if program is using bpf_fetch*()
it wants to see kernel internals and obviously depends
on particular kernel layout.

>> whereas today we pass a lot of fields to tracepoints and
>> make all of these fields immutable.
>
> The parameters passed to the tracepoint are not shown to userspace and
> can change at will. Now, we present the final parsing of the parameters
> that convert to fields. As all currently known tools uses
> libtraceevent.a, and parse the format files, those fields can move
> around and even change in size. The structures are not immutable. The
> fields are locked down if user space relies on them. But they can move
> about within the tracepoint, because the parsing allows for it.
>
> Remember, these are processed fields. The result of TP_fast_assign()
> and what gets put into the ring buffer. Now what is passed to the
> actual tracepoint is not visible by userspace, and in lots of cases, it
> is just a pointer to some structure. What eBPF brings to the table is a
> way to access this structure from user space. What keeps a structured
> passed to a tracepoint from becoming immutable if there's a eBPF
> program that expects it to have a specific field?

agree. that's fair.
I'm proposing to treat bpf programs that attach to existing
tracepoints as kernel modules that carry no ABI claims.

>> and bpf programs like live debugger that examine things.
>
> If bpf programs only dealt with kprobes, I may agree. But tracepoints
> have already been proven to be a type of ABI. If we open another window
> into the kernel, this can screw us later. It's better to solve this now
> than when we are fighting with Linus over user space breakage.

I'm not sure what's more needed other than adding
large comments into documentation, man pages and sample
code that bpf+existing tracepoint is not an ABI.

> What we need is to know if eBPF programs are modules or a user space
> interface. If they are a user interface then we need to be extremely
> careful here. If they are treated the same as modules, then it would
> not add any API. But that hasn't been settled yet, even if we have a
> comment in the kernel.
>
> Maybe what we should do is to make eBPF pass the kernel version it was
> made for (with all the mod version checks). If it doesn't match, fail
> to load it. Perhaps the more eBPF is limited like modules are, the
> better chance we have that no eBPF program creates a new ABI.

it's easy to add kernel version check and it will be equally
easy for user space to hack it.
imo comments in documentation and samples is good enough.

also not all bpf programs are equal.
bpf+existing tracepoint is not ABI
bpf+new tracepoint is ABI if programs are not using bpf_fetch
bpf+syscall is ABI if programs are not using bpf_fetch
bpf+kprobe is not ABI
bpf+sockets is ABI
At the end we want most of the programs to be written
without assuming anything about kernel internals.
But for live kernel debugging we will write programs
very specific to given kernel layout.

We can categorize the above in non-ambigous via
bpf program type.
Programs with:
BPF_PROG_TYPE_TRACEPOINT - not ABI
BPF_PROG_TYPE_KPROBE - not ABI
BPF_PROG_TYPE_SOCKET_FILTER - ABI

for my proposed 'new tracepoints' we can add type:
BPF_PROG_TYPE_TRACEPOINT_USER - ABI
and disallow calls to bpf_fetch*() for them.
To make it more strict we can do kernel version check
for all prog types that are 'not ABI', but is it really necessary?

To summarize and consolidate other threads:
- I will remove reading of PERF_SAMPLE_RAW in tracex1 example.
it's really orthogonal to this whole discussion.
- will add more comments through out that just because
programs can read tracepoint arguments, they shouldn't
make any assumptions that args stays as-is from version to version
- will work on a patch to demonstrate how few in-kernel
structures can be user-ized and how programs can access
them in version-indepedent way

btw the concept of user-ized data structures already exists
with classic bpf, since 'A = load -0x1000' is translated into
'A = skb->protocol'. I'm thinking of something similar
but more generic and less obscure.

^ permalink raw reply

* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
From: Steven Rostedt @ 2015-02-10 21:53 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, Linux API, Network Development, LKML,
	Linus Torvalds, Peter Zijlstra, ebiederm-aS9lmoZGLiVWk0Htik3J/w
In-Reply-To: <CAMEtUuwAf4AFeWVEdJB4VQz_=2fWY=q8=rbkFg38rtVtNVCHxg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, 10 Feb 2015 11:53:22 -0800
Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:

> On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org> wrote:
> > On Mon, 9 Feb 2015 22:10:45 -0800
> > Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
> >
> >> One can argue that current TP_printk format is already an ABI,
> >> because somebody might be parsing the text output.
> >
> > If somebody does, then it is an ABI. Luckily, it's not that useful to
> > parse, thus it hasn't been an issue. As Linus has stated in the past,
> > it's not that we can't change ABI interfaces, its just that we can not
> > change them if there's a user space application that depends on it.
> 
> there are already tools that parse trace_pipe:
> https://github.com/brendangregg/perf-tools

Yep, and if this becomes a standard, then any change that makes
trace_pipe different will be reverted.

> 
> > and expect some events to have specific fields. Now we can add new
> > fields, or even remove fields that no user space tool is using. This is
> > because today, tools use libtraceevent to parse the event data.
> 
> not all tools use libtraceevent.
> gdb calls perf_event_open directly:
> https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c
> and parses PERF_RECORD_SAMPLE as a binary.
> In this case it's branch records, but I think we never said anywhere
> that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come
> in this particular order.

What particular order? Note, that's a hardware event, not a software
one.

> 
> > This is why I'm nervous about exporting the parameters of the trace
> > event call. Right now, those parameters can always change, because the
> > only way to know they exist is by looking at the code. And currently,
> > there's no way to interact with those parameters. Once we have eBPF in
> > mainline, we now have a way to interact with the parameters and if
> > those parameters change, then the eBPF program will break, and if eBPF
> > can be part of a user space tool, that will break that tool and
> > whatever change in the trace point that caused this breakage would have
> > to be reverted. IOW, this can limit development in the kernel.
> 
> it can limit development unless we say that bpf programs
> that attach to tracepoints are not part of ABI.
> Easy enough to add large comment similar to perf_event.h

Again, it doesn't matter what we say. Linus made that very clear. He
stated if you provide an interface, and someone uses that interface for
a user space application, and they complain if it breaks, that is just
reason to revert whatever broke it.

> 
> > Al Viro currently does not let any tracepoint in VFS because he doesn't
> > want the internals of that code locked to an ABI. He's right to be
> > worried.
> 
> Same with networking bits. We don't want tracepoints to limit
> kernel development, but we want debuggability and kernel
> analytics.
> All existing tracepoints defined via DEFINE_EVENT should
> not be an ABI.

I agree, but that doesn't make it so :-/

> But some maintainers think of them as ABI, whereas others
> are using them freely. imo it's time to remove ambiguity.

I would love to, and have brought this up at Kernel Summit more than
once with no solution out of it.

> 
> The idea for new style of tracepoints is the following:
> introduce new macro: DEFINE_EVENT_USER
> and let programs attach to them.

We actually have that today. But it's TRACE_EVENT_FLAGS(), although
that should be cleaned up a bit. Frederic added it to label events that
are safe for perf non root. It seems to be used currently only for
syscalls.

> These tracepoint will receive one or two pointers to important
> structs only. They will not have TP_printk, assign and fields.
> The placement and arguments to these new tracepoints
> will be an ABI.
> All existing tracepoints are not.

TP_printk() is not really an issue.

> 
> The main reason to attach to tracepoint is that they are
> accessible without debug info (unlike kprobe)

That is, if you have a special bpf call to access variables, right? How
else do you access part of a data structure.

> Another reason is speed. tracepoints are much faster than
> optimized kprobes and for real-time analytics the speed
> is critical.
> 
> The position of new tracepoints and their arguments
> will be an ABI and the programs can be both.

You means "special tracepoints" one that does export the arguments?

Question is, how many maintainers will add these, knowing that they
will have to be forever maintained as is.

> If program is using bpf_fetch*() helpers it obviously
> wants to access internal data structures, so
> it's really nothing more, but 'safe kernel module'
> and kernel layout specific.
> Both old and new tracepoints + programs will be used
> for live kernel debugging.
> 
> If program is accessing user-ized data structures then

Technically, the TP_struct__entry is a user-ized structure.

> it is portable and will run on any kernel.
> In uapi header we can define:
> struct task_struct_user {
>   int pid;
>   int prio;

Here's a perfect example of something that looks stable to show to
user space, but is really a pimple that is hiding cancer.

Lets start with pid. We have name spaces. What pid will be put there?
We have to show the pid of the name space it is under.

Then we have prio. What is prio in the DEADLINE scheduler. It is rather
meaningless. Also, it is meaningless in SCHED_OTHER.

Also note that even for SCHED_FIFO, the prio is used differently in the
kernel than it is in userspace. For the kernel, lower is higher.

> };
> and let bpf programs access it via real 'struct task_struct *'
> pointer passed into tracepoint.
> bpf loader will translate offsets and sizes used inside
> the program into real task_struct's offsets and loads.

It would need to do more that that. It may have to calculate the value
that it returns, as the internal value may be different with different
kernels.

> (all structs are read-only of course)
> programs will be fast and kernel independent.
> They will be used for analytics (latency, etc)
> 
> >> so in some cases we cannot change tracepoints without
> >> somebody complaining that his tool broke.
> >> In other cases tracepoints are used for debugging only
> >> and no one will notice when they change...
> >> It was and still a grey area.
> >
> > Not really. If a tool uses the tracepoint, it can lock that tracepoint
> > down. This is exactly what latencytop did. It happened, it's not a
> > hypothetical situation.
> 
> correct.
> 
> >> bpf doesn't change any of that.
> >> It actually makes addition of new tracepoints easier.
> >
> > I totally disagree. It adds more ways to see inside the kernel, and if
> > user space depends on this, it adds more ways the kernel can not change.
> >
> > It comes down to how robust eBPF is with the internals of the kernel
> > changing. If we limit eBPF to system call tracepoints only, that's
> > fine because those have the same ABI as the system call itself. I'm
> > worried about the internal tracepoints for scheduling, irqs, file
> > systems, etc.
> 
> agree. we need to make it clear that existing tracepoints
> + programs is not ABI.

The question is, how do we do that. Linus pointed out that comments and
documentation is not enough. We need to have an interface that users
would use before they use one that we do not like them to use.

> 
> >> In the future we might add a tracepoint and pass a single
> >> pointer to interesting data struct to it. bpf programs will walk
> >> data structures 'as safe modules' via bpf_fetch*() methods
> >> without exposing it as ABI.
> >
> > Will this work if that structure changes? When the field we are looking
> > for no longer exists?
> 
> bpf_fetch*() is the same mechanism as perf probe.
> If there is a mistake by user space tools, the program
> will be reading some junk, but it won't be crashing.

But what if the userspace tool depends on that value returning
something meaningful. If it was meaningful in the past, it will have to
be meaningful in the future, even if the internals of the kernel make
it otherwise.

> To be able to debug live kernel we need to see everywhere.
> Same way as systemtap loads kernel modules to walk
> things inside kernel, bpf programs walk pointers with
> bpf_fetch*().

I would agree here too. But again, we really need to be careful about
the interface we expose.

> I'm saying that if program is using bpf_fetch*()
> it wants to see kernel internals and obviously depends
> on particular kernel layout.

Right, but if there's something specific in that kernel layout that it
can decide to do something with, otherwise it breaks, then we need to
worry about it.

eBPF is very flexible, which means it is bound to have someone use it
in a way you never dreamed of, and that will be what bites you in the
end (pun intended).

> 
> >> whereas today we pass a lot of fields to tracepoints and
> >> make all of these fields immutable.
> >
> > The parameters passed to the tracepoint are not shown to userspace and
> > can change at will. Now, we present the final parsing of the parameters
> > that convert to fields. As all currently known tools uses
> > libtraceevent.a, and parse the format files, those fields can move
> > around and even change in size. The structures are not immutable. The
> > fields are locked down if user space relies on them. But they can move
> > about within the tracepoint, because the parsing allows for it.
> >
> > Remember, these are processed fields. The result of TP_fast_assign()
> > and what gets put into the ring buffer. Now what is passed to the
> > actual tracepoint is not visible by userspace, and in lots of cases, it
> > is just a pointer to some structure. What eBPF brings to the table is a
> > way to access this structure from user space. What keeps a structured
> > passed to a tracepoint from becoming immutable if there's a eBPF
> > program that expects it to have a specific field?
> 
> agree. that's fair.
> I'm proposing to treat bpf programs that attach to existing
> tracepoints as kernel modules that carry no ABI claims.

Yeah, we may need to find a way to guarantee that.

> 
> >> and bpf programs like live debugger that examine things.
> >
> > If bpf programs only dealt with kprobes, I may agree. But tracepoints
> > have already been proven to be a type of ABI. If we open another window
> > into the kernel, this can screw us later. It's better to solve this now
> > than when we are fighting with Linus over user space breakage.
> 
> I'm not sure what's more needed other than adding
> large comments into documentation, man pages and sample
> code that bpf+existing tracepoint is not an ABI.

Making it break as soon as a config or kernel version changes ;-)

That may be the only way to guarantee that users do not rely on it.

> 
> > What we need is to know if eBPF programs are modules or a user space
> > interface. If they are a user interface then we need to be extremely
> > careful here. If they are treated the same as modules, then it would
> > not add any API. But that hasn't been settled yet, even if we have a
> > comment in the kernel.
> >
> > Maybe what we should do is to make eBPF pass the kernel version it was
> > made for (with all the mod version checks). If it doesn't match, fail
> > to load it. Perhaps the more eBPF is limited like modules are, the
> > better chance we have that no eBPF program creates a new ABI.
> 
> it's easy to add kernel version check and it will be equally
> easy for user space to hack it.

Well, if the bpf program says it is built for kernel 3.19, but the
user tool fudges it to say it was built for kernel 3.20, but it breaks,
than how can they complain about a regression? The tool itself says it
was built for 3.20 but doesn't work. You need to show that same program
worked on 3.19, but it wont because 3.20 wont work there.

> imo comments in documentation and samples is good enough.

Again, your and my opinion do not matter. It comes down to what Linus
thinks. And if we expose an interface that applications decide to use,
and it breaks when a design in the kernel changes, Linus may revert
that change. The author of that change will not be too happy with us.

> 
> also not all bpf programs are equal.
> bpf+existing tracepoint is not ABI

Why not?

> bpf+new tracepoint is ABI if programs are not using bpf_fetch

How is this different?

> bpf+syscall is ABI if programs are not using bpf_fetch

Well, this is easy. As syscalls are ABI, and the tracepoints for them
match the ABI, it by default becomes an ABI.

> bpf+kprobe is not ABI

Right.

> bpf+sockets is ABI

Right, because sockets themselves are ABI.

> At the end we want most of the programs to be written
> without assuming anything about kernel internals.
> But for live kernel debugging we will write programs
> very specific to given kernel layout.

And here lies the trick. How do we differentiate applications for
everyday use from debugging tools? There's been times when debugging
tools have shown themselves as being so useful they become everyday
use tools.

> 
> We can categorize the above in non-ambigous via
> bpf program type.
> Programs with:
> BPF_PROG_TYPE_TRACEPOINT - not ABI
> BPF_PROG_TYPE_KPROBE - not ABI
> BPF_PROG_TYPE_SOCKET_FILTER - ABI

Again, what enforces this? (hint, it's Linus)

> 
> for my proposed 'new tracepoints' we can add type:
> BPF_PROG_TYPE_TRACEPOINT_USER - ABI
> and disallow calls to bpf_fetch*() for them.
> To make it more strict we can do kernel version check
> for all prog types that are 'not ABI', but is it really necessary?

If we have something that makes it difficult for a tool to work from
one kernel to the next, or ever with different configs, where that tool
will never become a standard, then that should be good enough to keep
it from dictating user ABI.

To give you an example, we thought about scrambling the trace event
field locations from boot to boot to keep tools from hard coding the
event layout. This may sound crazy, but developers out there are crazy.
And if you want to keep them from abusing interfaces, you just need to
be a bit more crazy than they are.

> 
> To summarize and consolidate other threads:
> - I will remove reading of PERF_SAMPLE_RAW in tracex1 example.
> it's really orthogonal to this whole discussion.

Or yous libtraceevent ;-) We really need to finish that and package it
up for distros.

> - will add more comments through out that just because
> programs can read tracepoint arguments, they shouldn't
> make any assumptions that args stays as-is from version to version

We may need to find a way to actually keep it from being as is from
version to version even if the users do not change.

> - will work on a patch to demonstrate how few in-kernel
> structures can be user-ized and how programs can access
> them in version-indepedent way

It will be interesting to see what kernel structures can be user-ized
that are not already used by system calls.

> 
> btw the concept of user-ized data structures already exists
> with classic bpf, since 'A = load -0x1000' is translated into
> 'A = skb->protocol'. I'm thinking of something similar
> but more generic and less obscure.

I have to try to wrap my head around understanding the classic bpf, and
how "load -0x1000" translates to "skb->protocol". Is that documented
somewhere?

-- Steve

^ permalink raw reply

* Re: [RFC] implementing tape statistics single file vs multi-file in sysfs
From: Greg KH @ 2015-02-10 22:30 UTC (permalink / raw)
  To: Bryn M. Reeves
  Cc: Seymour, Shane M,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Kai.Makisara-9Aww8k/80nUxHbG02/KK1g@public.gmane.org,
	James E.J. Bottomley (JBottomley-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org),
	Laurence Oberman (loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org)
In-Reply-To: <20150210142719.GA1437-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>

On Tue, Feb 10, 2015 at 02:27:20PM +0000, Bryn M. Reeves wrote:
> On Sat, Feb 07, 2015 at 12:07:43PM +0800, Greg KH wrote:
> 
> $ cat /sys/fs/selinux/avc/cache_stats 
> lookups hits misses allocations reclaims frees
> 18938916 18921707 17209 17209 17328 22215
> 38164283 38146514 17769 17769 16800 19049
> 18078108 18056991 21117 21117 21344 19305
> 15168204 15150079 18125 18125 17776 13149
> 0 0 0 0 0 0
> 0 0 0 0 0 0
> 0 0 0 0 0 0
> 0 0 0 0 0 0
> 
> $ cat /sys/fs/selinux/avc/hash_stats
> entries: 506
> buckets used: 290/512
> longest chain: 5

Ugh, those look like they should be debugfs interfaces.  Thanks, I'll
add them to my list of things to nag people about...

greg k-h

^ permalink raw reply

* Re: [PATCH v17 1/7] mm: support madvise(MADV_FREE)
From: Shaohua Li @ 2015-02-10 22:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michael Kerrisk (man-pages), Michal Hocko, Andrew Morton,
	linux-kernel, linux-mm, linux-api, Hugh Dickins, Johannes Weiner,
	Rik van Riel, KOSAKI Motohiro, Mel Gorman, Jason Evans,
	zhangyanfei, Kirill A. Shutemov, Kirill A. Shutemov
In-Reply-To: <20150209071553.GC32300@blaptop>

On Mon, Feb 09, 2015 at 04:15:53PM +0900, Minchan Kim wrote:
> On Fri, Feb 06, 2015 at 10:29:18AM -0800, Shaohua Li wrote:
> > On Fri, Feb 06, 2015 at 02:51:03PM +0900, Minchan Kim wrote:
> > > Hi Shaohua,
> > > 
> > > On Thu, Feb 05, 2015 at 04:33:11PM -0800, Shaohua Li wrote:
> > > > 
> > > > Hi Minchan,
> > > > 
> > > > Sorry to jump in this thread so later, and if some issues are discussed before.
> > > > I'm interesting in this patch, so tried it here. I use a simple test with
> > > 
> > > No problem at all. Interest is always win over ignorance.
> > > 
> > > > jemalloc. Obviously this can improve performance when there is no memory
> > > > pressure. Did you try setup with memory pressure?
> > > 
> > > Sure but it was not a huge memory system like yours.
> > 
> > Yes, I'd like to check the symptom in memory pressure, so choose such test.
> > 
> > > > In my test, jemalloc will map 61G vma, and use about 32G memory without
> > > > MADV_FREE. If MADV_FREE is enabled, jemalloc will use whole 61G memory because
> > > > madvise doesn't reclaim the unused memory. If I disable swap (tweak your patch
> > > 
> > > Yes, IIUC, jemalloc replaces MADV_DONTNEED with MADV_FREE completely.
> > 
> > right.
> > > > slightly to make it work without swap), I got oom. If swap is enabled, my
> > > 
> > > You mean you modified anon aging logic so it works although there is no swap?
> > > If so, I have no idea why OOM happens. I guess it should free all of freeable
> > > pages during the aging so although system stall happens more, I don't expect
> > > OOM. Anyway, with MADV_FREE with no swap, we should consider more things
> > > about anonymous aging.
> > 
> > In the patch, MADV_FREE will be disabled and fallback to DONTNEED if no swap is
> > enabled. Our production environment doesn't enable swap, so I tried to delete
> > the 'no swap' check and make MADV_FREE always enabled regardless if swap is
> > enabled. I didn't change anything else. With such change, I saw oom
> > immediately. So definitely we have aging issue, the pages aren't reclaimed
> > fast.
> 
> In current VM implementation, it doesn't age anonymous LRU list if we have no
> swap. That's the reason to drop freeing pages instantly.
> I think it could be enhanced later.
> http://lists.infradead.org/pipermail/linux-arm-kernel/2014-December/311591.html
> 
> > 
> > > > system is totally stalled because of swap activity. Without the MADV_FREE,
> > > > everything is ok. Considering we definitely don't want to waste too much
> > > > memory, a system with memory pressure is normal, so sounds MADV_FREE will
> > > > introduce big trouble here.
> > > > 
> > > > Did you think about move the MADV_FREE pages to the head of inactive LRU, so
> > > > they can be reclaimed easily?
> > > 
> > > I think it's desirable if the page lived in active LRU.
> > > The reason I didn't that was caused by volatile ranges system call which
> > > was motivaion for MADV_FREE in my mind.
> > > In last LSF/MM, there was concern about data's hotness.
> > > Some of users want to keep that as it is in LRU position, others want to
> > > handle that as cold(tail of inactive list)/warm(head of inactive list)/
> > > hot(head of active list), for example.
> > > The vrange syscall was just about volatiltiy, not depends on page hotness
> > > so the decision on my head was not to change LRU order and let's make new
> > > hotness advise if we need it later.
> > > 
> > > However, MADV_FREE's main customer is allocators and afaik, they want
> > > to replace MADV_DONTNEED with MADV_FREE so I think it is really cold,
> > > but we couldn't make sure so head of inactive is good compromise.
> > > Another concern about tail of inactive list is that there could be
> > > plenty of pages in there, which was asynchromos write-backed in
> > > previous reclaim path, not-yet reclaimed because of not being able
> > > to free the in softirq context of writeback. It means we ends up
> > > freeing more potential pages to become workingset in advance
> > > than pages VM already decided to evict.
> > 
> > Yes, they are definitely cold pages. I thought We should make sure the
> > MADV_FREE pages are reclaimed first before other pages, at least in the anon
> > LRU list, though there might be difficult to determine if we should reclaim
> > writeback pages first or MADV_FREE pages first.
> 
> Frankly speaking, the issue with writeback page is just hurdle of
> implementation, not design so if we could fix it, we might move
> cold pages into tail of the inactive LRU. I tried it but don't have
> time slot to continue these days. Hope to get a time to look soon.
> https://lkml.org/lkml/2014/7/1/628
> Even, it wouldn't be critical problem although we couldn't fix
> the problem of writeback pages because they are already all
> cold pages so it might be not important to keep order in LRU so
> we could save working set and effort of VM to reclaim them
> at the cost of moving all of hinting pages into tail of the LRU
> whenever the syscall is called.
> 
> However, significant problem from my mind is we couldn't make
> sure they are really cold pages. It would be true for allocators
> but it's cache-friendly pages so it might be better to discard
> tail pages of inactive LRU, which are really cold.
> In addition, we couldn't expect all of usecase for MADV_FREE
> so some of users might want to treat them as warm, not cold.
> 
> With moving them into inactive list's head, if we still see
> a lot stall, I think it's a sign to add other logic, for example,
> we could drop MADV_FREEed pages instantly if the zone is below
> low min watermark when the syscall is called. Because everybody
> doesn't like direct reclaim.

So I tried move the MADV_FREE pages to inactive list head or tail. It helps a
little. But there are still stalls/oom. kswapd isn't fast enough to free the
pages, App enters direct reclaim frequently. In one machine, no swap trigger,
but MADV_FREE is 5x slower than MADV_DONTNEED. In another machine, MADV_FREE
trigers a lot of swap and sometimes even oom. app enters direct reclaim and has
a lot of lock contention because of excessive direct reclaim, so there are a
lot of stalls.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [RFC] simple_char: New infrastructure to simplify chardev management
From: Andy Lutomirski @ 2015-02-10 23:44 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Arnd Bergmann, Jiri Kosina,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: Andy Lutomirski

This isn't adequately tested, and I don't have a demonstration (yet).
It's here for review for whether it's a good idea in the first place
and for weather the fully_dynamic mechanism is a good idea.

The current character device interfaces are IMO awful.  There's a
reservation mechanism (alloc_chrdev_region, etc), a bizarre
sort-of-hashtable lookup mechanism that character drivers need to
interact with (cdev_add, etc), and number of lookup stages on open
with various levels of optimization.  Despite all the code, there's no
core infrastructure to map from a dev_t back to a kobject, struct
device, or any other useful device pointer.

This means that lots of drivers need to implement all of this
themselves.  The drivers don't even all seem to do it right.  For
example, the UIO code seems to be missing any protection against
chardev open racing against device removal.

On top of the complexity of writing a chardev driver, the user
interface is odd.  We have /proc/devices, which nothing should use,
since it conveys no information about minor numbers, and minors are
mostly dynamic these days.  Even the major numbers aren't terribly
useful, since sysfs refers to (major, minor) pairs.

This adds simple helpers simple_char_minor_create and
simple_char_minor_free to create and destroy chardev minors.  Common
code handles minor allocation and lookup and provides a simple helper
to allow (and even mostly require!) users to reference count their
devices correctly.

Users can either allocation a traditional dynamic major using
simple_char_major_create, or they can use a global "fully_dynamic"
major and avoid thinking about major numbers at all.

This currently does not integrate with the driver core at all.
Users are responsible for plugging the dev_t into their struct
devices manually.  I couldn't see a clean way to fix this without
integrating all of this into the driver core.

Thoughts?  I want to use this for the u2f driver, which will either be
a chardev driver in its own right or use a simple new iso7816 class.

Ideally we could convert a bunch of drivers to use this, at least
where there are no legacy minor number considerations.

(Note: simple_char users will *not* have their devicename%d indices
match their minor numbers unless they specifically arrange for this to
be the case.  For new drivers, this shouldn't be a problem at all.  I
don't know whether it matters for old drivers.)

Signed-off-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
---

 drivers/base/Makefile       |   2 +-
 drivers/base/simple_char.c  | 231 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/simple_char.h |  38 ++++++++
 3 files changed, 270 insertions(+), 1 deletion(-)
 create mode 100644 drivers/base/simple_char.c
 create mode 100644 include/linux/simple_char.h

diff --git a/drivers/base/Makefile b/drivers/base/Makefile
index 6922cd6850a2..d3832749f74c 100644
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -4,7 +4,7 @@ obj-y			:= component.o core.o bus.o dd.o syscore.o \
 			   driver.o class.o platform.o \
 			   cpu.o firmware.o init.o map.o devres.o \
 			   attribute_container.o transport_class.o \
-			   topology.o container.o
+			   topology.o container.o simple_char.o
 obj-$(CONFIG_DEVTMPFS)	+= devtmpfs.o
 obj-$(CONFIG_DMA_CMA) += dma-contiguous.o
 obj-y			+= power/
diff --git a/drivers/base/simple_char.c b/drivers/base/simple_char.c
new file mode 100644
index 000000000000..f3205ef9e44b
--- /dev/null
+++ b/drivers/base/simple_char.c
@@ -0,0 +1,231 @@
+/*
+ * A simple way to create character devices
+ *
+ * Copyright (c) 2015 Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
+ *
+ * Loosely based, somewhat arbitrarily, on the UIO driver, which is one
+ * of many copies of essentially identical boilerplate.
+ *
+ * Licensed under the GPLv2.
+ */
+
+#include <linux/simple_char.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/poll.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+#include <linux/idr.h>
+#include <linux/sched.h>
+#include <linux/string.h>
+#include <linux/kobject.h>
+#include <linux/cdev.h>
+
+#define MAX_MINORS (1U << MINORBITS)
+
+struct simple_char_major {
+	struct cdev cdev;
+	unsigned majornum;
+	struct idr idr;
+	struct mutex lock;
+};
+
+static struct simple_char_major *fully_dynamic_major;
+static DEFINE_MUTEX(fully_dynamic_major_lock);
+
+static int simple_char_open(struct inode *inode, struct file *filep)
+{
+	struct simple_char_major *major =
+		container_of(inode->i_cdev, struct simple_char_major,
+			     cdev);
+	void *private;
+	const struct simple_char_ops *ops;
+	int ret = 0;
+
+	mutex_lock(&major->lock);
+
+	{
+		/*
+		 * This is a separate block to make the locking entirely
+		 * clear.  The only thing keeping minor alive is major->lock.
+		 * We need to be completely done with the simple_char_minor
+		 * by the time we release the lock.
+		 */
+		struct simple_char_minor *minor;
+		minor = idr_find(&major->idr, iminor(inode));
+		if (!minor || !minor->ops->reference(minor->private)) {
+			mutex_unlock(&major->lock);
+			return -ENODEV;
+		}
+		private = minor->private;
+		ops = minor->ops;
+	}
+
+	mutex_unlock(&major->lock);
+
+	replace_fops(filep, ops->fops);
+	filep->private_data = private;
+	if (ops->fops->open)
+		ret = ops->fops->open(inode, filep);
+
+	return ret;
+}
+
+static const struct file_operations simple_char_fops = {
+	.open = simple_char_open,
+	.llseek = noop_llseek,
+};
+
+struct simple_char_major *simple_char_major_create(const char *name)
+{
+	struct simple_char_major *major = NULL;
+	dev_t devt;
+	int ret;
+
+	ret = alloc_chrdev_region(&devt, 0, MAX_MINORS, name);
+	if (ret)
+		goto out;
+
+	ret = -ENOMEM;
+	major = kmalloc(sizeof(struct simple_char_major), GFP_KERNEL);
+	if (!major)
+		goto out_unregister;
+	cdev_init(&major->cdev, &simple_char_fops);
+	kobject_set_name(&major->cdev.kobj, "%s", name);
+
+	ret = cdev_add(&major->cdev, devt, MAX_MINORS);
+	if (ret)
+		goto out_free;
+
+	major->majornum = MAJOR(devt);
+	idr_init(&major->idr);
+	return major;
+
+out_free:
+	cdev_del(&major->cdev);
+	kfree(major);
+out_unregister:
+	unregister_chrdev_region(devt, MAX_MINORS);
+out:
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(simple_char_major_create);
+
+void simple_char_major_free(struct simple_char_major *major)
+{
+	BUG_ON(!idr_is_empty(&major->idr));
+
+	cdev_del(&major->cdev);
+	unregister_chrdev_region(MKDEV(major->majornum, 0), MAX_MINORS);
+	idr_destroy(&major->idr);
+	kfree(major);
+}
+
+static struct simple_char_major *get_fully_dynamic_major(void)
+{
+	struct simple_char_major *major =
+		smp_load_acquire(&fully_dynamic_major);
+	if (major)
+		return major;
+
+	mutex_lock(&fully_dynamic_major_lock);
+
+	if (fully_dynamic_major) {
+		major = fully_dynamic_major;
+		goto out;
+	}
+
+	major = simple_char_major_create("fully_dynamic");
+	if (!IS_ERR(major))
+		smp_store_release(&fully_dynamic_major, major);
+
+out:
+	mutex_unlock(&fully_dynamic_major_lock);
+	return major;
+	
+}
+
+/**
+ * simple_char_minor_create() - create a chardev minor
+ * @major:	Major to use or NULL for a fully dynamic chardev.
+ * @ops:	simple_char_ops to associate with the minor.
+ * @private:	opaque pointer for @ops's use.
+ *
+ * simple_char_minor_create() creates a minor chardev.  For new code,
+ * @major should be NULL; this will create a minor chardev with fully
+ * dynamic major and minor numbers and without a useful name in
+ * /proc/devices.  (All recent user code should be using sysfs
+ * exclusively to map between devices and device numbers.)  For legacy
+ * code, @major can come from simple_char_major_create().
+ *
+ * The chardev will use @ops->fops for its file operations.  Before any
+ * of those operations are called, the struct file's private_data will
+ * be set to @private.
+ *
+ * To simplify reference counting, @ops->reference will be called before
+ * @ops->fops->open.  @ops->reference should take any needed references
+ * and return true if the object being opened still exists, and it
+ * should return false without taking references if the object is dying.
+ * @ops->reference is called with locks held, so it should neither sleep
+ * nor take heavy locks.
+ *
+ * @ops->fops->release (and @ops->fops->open, if it exists and fails)
+ * are responsible for releasing any references takes by @ops->reference.
+ *
+ * The minor must be destroyed by @simple_char_minor_free.  After
+ * @simple_char_minor_free returns, @ops->reference will not be called.
+ */
+struct simple_char_minor *
+simple_char_minor_create(struct simple_char_major *major,
+			 const struct simple_char_ops *ops,
+			 void *private)
+{
+	int ret;
+	struct simple_char_minor *minor = NULL;
+
+	if (!major) {
+		major = get_fully_dynamic_major();
+		if (IS_ERR(major))
+			return (void *)major;
+	}
+
+	minor = kmalloc(sizeof(struct simple_char_minor), GFP_KERNEL);
+	if (!minor)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_lock(&major->lock);
+	ret = idr_alloc(&major->idr, minor, 0, MAX_MINORS, GFP_KERNEL);
+	if (ret >= 0) {
+		minor->devt = MKDEV(major->majornum, ret);
+		ret = 0;
+	}
+	/* Warn on ENOSPC?  It's embarrassing if it ever happens. */
+	mutex_unlock(&major->lock);
+
+	if (ret) {
+		kfree(minor);
+		return ERR_PTR(ret);
+	}
+
+	minor->major = major;
+	minor->private = private;
+	minor->ops = ops;
+	return minor;
+}
+
+/**
+ * simple_char_minor_free() - Free a simple_char chardev minor
+ * @minor:	the minor to free.
+ *
+ * This frees a chardev minor and prevents that minor's @ops->reference
+ * op from being called in the future.
+ */
+void simple_char_minor_free(struct simple_char_minor *minor)
+{
+	mutex_lock(&minor->major->lock);
+	idr_remove(&minor->major->idr, MINOR(minor->devt));
+	mutex_unlock(&minor->major->lock);
+	kfree(minor);
+}
+EXPORT_SYMBOL(simple_char_minor_free);
diff --git a/include/linux/simple_char.h b/include/linux/simple_char.h
new file mode 100644
index 000000000000..8f391e7b50af
--- /dev/null
+++ b/include/linux/simple_char.h
@@ -0,0 +1,38 @@
+/*
+ * A simple way to create character devices
+ *
+ * Copyright (c) 2015 Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
+ *
+ * Licensed under the GPLv2.
+ */
+
+#include <linux/types.h>
+#include <linux/kobject.h>
+#include <linux/file.h>
+
+struct simple_char_major;
+
+struct simple_char_ops {
+	bool (*reference)(void *private);
+	const struct file_operations *fops;
+};
+
+struct simple_char_minor {
+	struct simple_char_major *major;
+	const struct simple_char_ops *ops;
+	void *private;
+	dev_t devt;
+};
+
+extern struct simple_char_minor *
+simple_char_minor_create(struct simple_char_major *major,
+			 const struct simple_char_ops *ops,
+			 void *private);
+extern void simple_char_minor_free(struct simple_char_minor *minor);
+
+extern void simple_char_file_release(struct file *filep, struct kobject *kobj);
+
+/* These exist only to support legacy classes that need their own major. */
+extern struct simple_char_major *simple_char_major_create(const char *name);
+extern void simple_char_major_free(struct simple_char_major *major);
+
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
From: Alexei Starovoitov @ 2015-02-11  0:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, Linux API, Network Development, LKML,
	Linus Torvalds, Peter Zijlstra, Eric W. Biederman

On Tue, Feb 10, 2015 at 1:53 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Tue, 10 Feb 2015 11:53:22 -0800
> Alexei Starovoitov <ast@plumgrid.com> wrote:
>
>> On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>> > On Mon, 9 Feb 2015 22:10:45 -0800
>> > Alexei Starovoitov <ast@plumgrid.com> wrote:
>> >
>> >> One can argue that current TP_printk format is already an ABI,
>> >> because somebody might be parsing the text output.
>> >
>> > If somebody does, then it is an ABI. Luckily, it's not that useful to
>> > parse, thus it hasn't been an issue. As Linus has stated in the past,
>> > it's not that we can't change ABI interfaces, its just that we can not
>> > change them if there's a user space application that depends on it.
>>
>> there are already tools that parse trace_pipe:
>> https://github.com/brendangregg/perf-tools
>
> Yep, and if this becomes a standard, then any change that makes
> trace_pipe different will be reverted.

I think reading of trace_pipe is widespread.

>> > and expect some events to have specific fields. Now we can add new
>> > fields, or even remove fields that no user space tool is using. This is
>> > because today, tools use libtraceevent to parse the event data.
>>
>> not all tools use libtraceevent.
>> gdb calls perf_event_open directly:
>> https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c
>> and parses PERF_RECORD_SAMPLE as a binary.
>> In this case it's branch records, but I think we never said anywhere
>> that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come
>> in this particular order.
>
> What particular order? Note, that's a hardware event, not a software
> one.

yes, but gdb assumes that 'u64 ip' precedes, 'u64 addr'
when attr.sample_type = IP | ADDR whereas this is an
internal order of 'if' statements inside perf_output_sample()...

>> But some maintainers think of them as ABI, whereas others
>> are using them freely. imo it's time to remove ambiguity.
>
> I would love to, and have brought this up at Kernel Summit more than
> once with no solution out of it.

let's try it again at plumbers in august?

For now I'm going to drop bpf+tracepoints, since it's so controversial
and go with bpf+syscall and bpf+kprobe only.

Hopefully by august it will be clear what bpf+kprobes can do
and why I'm excited about bpf+tracepoints in the future.

>> The idea for new style of tracepoints is the following:
>> introduce new macro: DEFINE_EVENT_USER
>> and let programs attach to them.
>
> We actually have that today. But it's TRACE_EVENT_FLAGS(), although
> that should be cleaned up a bit. Frederic added it to label events that
> are safe for perf non root. It seems to be used currently only for
> syscalls.

I didn't mean to let unprivileged apps to use.
Better name probably would be: DEFINE_EVENT_BPF
and only let bpf programs use it.

>> These tracepoint will receive one or two pointers to important
>> structs only. They will not have TP_printk, assign and fields.
>> The placement and arguments to these new tracepoints
>> will be an ABI.
>> All existing tracepoints are not.
>
> TP_printk() is not really an issue.

I think it is. The way things are printed is the most
visible part of tracepoints and I suspect maintainers don't
want to add new ones, because internal fields are printed
and users do parse trace_pipe.
Recent discussion about tcp instrumentation was
about adding new tracepoints and a module to print them.
As soon as something like this is in, the next question
'what we're going to do when more arguments need
to be printed'...

imo the solution is DEFINE_EVENT_BPF that doesn't
print anything and a bpf program to process it.
Then programs decide what they like to pass to
user space and in what format.
The kernel/user ABI is the position of this new tracepoint
and arguments that are passed in.
bpf program takes care of walking pointers, extracting
interesting fields, aggregating and passing them to
user in the format specific to this particular program.
Then when user wants to collect more fields it just
changes the program and corresponding userland side.

>> The position of new tracepoints and their arguments
>> will be an ABI and the programs can be both.
>
> You means "special tracepoints" one that does export the arguments?
>
> Question is, how many maintainers will add these, knowing that they
> will have to be forever maintained as is.

Obviously we would need to prove usefulness of
tracepoint_bpf before they're accepted.
Hopefully bpf+kprobe will make an impression on its own
and users would want similar but without debug info.

>> If program is using bpf_fetch*() helpers it obviously
>> wants to access internal data structures, so
>> it's really nothing more, but 'safe kernel module'
>> and kernel layout specific.
>> Both old and new tracepoints + programs will be used
>> for live kernel debugging.
>>
>> If program is accessing user-ized data structures then
>
> Technically, the TP_struct__entry is a user-ized structure.
>
>> it is portable and will run on any kernel.
>> In uapi header we can define:
>> struct task_struct_user {
>>   int pid;
>>   int prio;
>
> Here's a perfect example of something that looks stable to show to
> user space, but is really a pimple that is hiding cancer.
>
> Lets start with pid. We have name spaces. What pid will be put there?
> We have to show the pid of the name space it is under.
>
> Then we have prio. What is prio in the DEADLINE scheduler. It is rather
> meaningless. Also, it is meaningless in SCHED_OTHER.
>
> Also note that even for SCHED_FIFO, the prio is used differently in the
> kernel than it is in userspace. For the kernel, lower is higher.

well, ->prio and ->pid are already printed by sched tracepoints
and their meaning depends on scheduler. So users taking that
into account.
I'm not suggesting to preserve the meaning of 'pid' semantically
in all cases. That's not what users would want anyway.
I want to allow programs to access important fields and print
them in more generic way than current TP_printk does.
Then exposed ABI of such tracepoint_bpf is smaller than
with current tracepoints.

>> };
>> and let bpf programs access it via real 'struct task_struct *'
>> pointer passed into tracepoint.
>> bpf loader will translate offsets and sizes used inside
>> the program into real task_struct's offsets and loads.
>
> It would need to do more that that. It may have to calculate the value
> that it returns, as the internal value may be different with different
> kernels.

back to 'prio'... the 'prio' accessible from the program
should be the same 'prio' that we're storing inside task_struct.
No extra conversions.
The bpf-script for sched analytics would want to see
the actual value to make sense of it.

> But what if the userspace tool depends on that value returning
> something meaningful. If it was meaningful in the past, it will have to
> be meaningful in the future, even if the internals of the kernel make
> it otherwise.

in some cases... yes. if we make a poor choice of
selecting fields for this 'task_struct_user' then some fields
may disappear from real task_struct, and placeholders
will be left in _user version.
so exposed fields need to be thought through.

In case of networking some of these choices were
already made by classic bpf. Fields:
protocol, pkttype, ifindex, hatype, mark, hash, queue_mapping
have to be reference-able from 'skb' pointer.
They can move from skb to some other struct, change
their sizes, location within sk_buff, etc
They can even disappear, but user visible
struct sk_buff_user will still contain the above.

> eBPF is very flexible, which means it is bound to have someone use it
> in a way you never dreamed of, and that will be what bites you in the
> end (pun intended).

understood :)
let's start slow then with bpf+syscall and bpf+kprobe only.

>> also not all bpf programs are equal.
>> bpf+existing tracepoint is not ABI
>
> Why not?

well, because we want to see more tracepoints in the kernel.
We're already struggling to add more.

>> bpf+new tracepoint is ABI if programs are not using bpf_fetch
>
> How is this different?

the new ones will be explicit by definition.

>> bpf+syscall is ABI if programs are not using bpf_fetch
>
> Well, this is easy. As syscalls are ABI, and the tracepoints for them
> match the ABI, it by default becomes an ABI.
>
>> bpf+kprobe is not ABI
>
> Right.
>
>> bpf+sockets is ABI
>
> Right, because sockets themselves are ABI.
>
>> At the end we want most of the programs to be written
>> without assuming anything about kernel internals.
>> But for live kernel debugging we will write programs
>> very specific to given kernel layout.
>
> And here lies the trick. How do we differentiate applications for
> everyday use from debugging tools? There's been times when debugging
> tools have shown themselves as being so useful they become everyday
> use tools.
>
>>
>> We can categorize the above in non-ambigous via
>> bpf program type.
>> Programs with:
>> BPF_PROG_TYPE_TRACEPOINT - not ABI
>> BPF_PROG_TYPE_KPROBE - not ABI
>> BPF_PROG_TYPE_SOCKET_FILTER - ABI
>
> Again, what enforces this? (hint, it's Linus)
>
>
>>
>> for my proposed 'new tracepoints' we can add type:
>> BPF_PROG_TYPE_TRACEPOINT_USER - ABI
>> and disallow calls to bpf_fetch*() for them.
>> To make it more strict we can do kernel version check
>> for all prog types that are 'not ABI', but is it really necessary?
>
> If we have something that makes it difficult for a tool to work from
> one kernel to the next, or ever with different configs, where that tool
> will never become a standard, then that should be good enough to keep
> it from dictating user ABI.
>
> To give you an example, we thought about scrambling the trace event
> field locations from boot to boot to keep tools from hard coding the
> event layout. This may sound crazy, but developers out there are crazy.
> And if you want to keep them from abusing interfaces, you just need to
> be a bit more crazy than they are.

that is indeed crazy. the point is understood.

right now I cannot think of a solid way to prevent abuse
of bpf+tracepoint, so just going to drop it for now.
Cool things can be done with bpf+kprobe/syscall already.

>> To summarize and consolidate other threads:
>> - I will remove reading of PERF_SAMPLE_RAW in tracex1 example.
>> it's really orthogonal to this whole discussion.
>
> Or yous libtraceevent ;-) We really need to finish that and package it
> up for distros.

sure. some sophisticated example can use it too.

>> - will add more comments through out that just because
>> programs can read tracepoint arguments, they shouldn't
>> make any assumptions that args stays as-is from version to version
>
> We may need to find a way to actually keep it from being as is from
> version to version even if the users do not change.
>
>> - will work on a patch to demonstrate how few in-kernel
>> structures can be user-ized and how programs can access
>> them in version-indepedent way
>
> It will be interesting to see what kernel structures can be user-ized
> that are not already used by system calls.

well, for networking, few fields I mentioned above would
be enough for most of the programs.

>> btw the concept of user-ized data structures already exists
>> with classic bpf, since 'A = load -0x1000' is translated into
>> 'A = skb->protocol'. I'm thinking of something similar
>> but more generic and less obscure.
>
> I have to try to wrap my head around understanding the classic bpf, and
> how "load -0x1000" translates to "skb->protocol". Is that documented
> somewhere?

well, the magic constants are in uapi/linux/filter.h
load -0x1000 = skb->protocol
load -0x1000 + 4 = skb->pkt_type
load -0x1000 + 8 = skb->dev->ifindex
for eBPF I want to clean it up in a way that user program
will see:
struct sk_buff_user {
  int protocol;
  int pkt_type;
  int ifindex;
 ...
};
and C code of bpf program will look normal: skb->ifindex.
bpf loader will do verification and translation of
'load skb_ptr+12' into sequence of loads with correct
internal offsets.
So struct sk_buff_user is only an interface. It won't exist
in such layout in memory.

^ permalink raw reply

* Re: [RFC] simple_char: New infrastructure to simplify chardev management
From: Greg Kroah-Hartman @ 2015-02-11  0:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Arnd Bergmann, Jiri Kosina, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <d3ab93173d51cebf00dd2263fd0ce9f8cd6258f7.1423609645.git.luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>

On Tue, Feb 10, 2015 at 03:44:05PM -0800, Andy Lutomirski wrote:
> This isn't adequately tested, and I don't have a demonstration (yet).
> It's here for review for whether it's a good idea in the first place
> and for weather the fully_dynamic mechanism is a good idea.
> 
> The current character device interfaces are IMO awful.

That's a total understatement.  Redoing the char interface has been in
my todo list for a decade now.  It's the complexity that happens to be
used by just a handful of drivers that have prevented me from doing the
rework in the past.  Creating a "new" interface that we then port code
to is a very good idea, as it can happen over time in a much more
orderly way.

And we can throw the kernel-janitors people at it once it's working, to
convert the rest of the tree, providing them a useful outlet for their
need for patch cleanups :)

So yes, I'm all for this, thanks so much for looking into this.  I'm at
a conference this week, but will go over it on the plane home and give
you some review comments.

greg k-h

^ permalink raw reply

* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
From: Steven Rostedt @ 2015-02-11  0:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, Linux API, Network Development, LKML,
	Linus Torvalds, Peter Zijlstra, Eric W. Biederman
In-Reply-To: <CAMEtUuzY_Po=WtFEFg1aqzJ8dEF4rHGcWDsaS44KYgACMNPPgA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, 10 Feb 2015 16:22:50 -0800
Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:

> > Yep, and if this becomes a standard, then any change that makes
> > trace_pipe different will be reverted.
> 
> I think reading of trace_pipe is widespread.

I've heard of a few, but nothing that has broken when something changed.

Is it scripts or actual C code?

Again, it matters if people complain about the change.


> >> But some maintainers think of them as ABI, whereas others
> >> are using them freely. imo it's time to remove ambiguity.
> >
> > I would love to, and have brought this up at Kernel Summit more than
> > once with no solution out of it.
> 
> let's try it again at plumbers in august?

Well, we need a statement from Linus. And it would be nice if we could
also get Ingo involved in the discussion, but he seldom comes to
anything but Kernel Summit.

> 
> For now I'm going to drop bpf+tracepoints, since it's so controversial
> and go with bpf+syscall and bpf+kprobe only.

Probably the safest bet.

> 
> Hopefully by august it will be clear what bpf+kprobes can do
> and why I'm excited about bpf+tracepoints in the future.

BTW, I wonder if I could make a simple compiler in the kernel that
would translate the current ftrace filters into a BPF program, where it
could use the program and not use the current filter logic.


> >> These tracepoint will receive one or two pointers to important
> >> structs only. They will not have TP_printk, assign and fields.
> >> The placement and arguments to these new tracepoints
> >> will be an ABI.
> >> All existing tracepoints are not.
> >
> > TP_printk() is not really an issue.
> 
> I think it is. The way things are printed is the most
> visible part of tracepoints and I suspect maintainers don't
> want to add new ones, because internal fields are printed
> and users do parse trace_pipe.
> Recent discussion about tcp instrumentation was
> about adding new tracepoints and a module to print them.
> As soon as something like this is in, the next question
> 'what we're going to do when more arguments need
> to be printed'...

I should rephrase that. It's not that it's not an issue, it's just that
it hasn't been an issue. the trace_pipe code is slow. The
raw_trace_pipe is much faster. Any tool would benefit from using it.

I really need to get a library out to help users do such a thing.


> 
> imo the solution is DEFINE_EVENT_BPF that doesn't
> print anything and a bpf program to process it.

You mean to be completely invisible to ftrace? And the debugfs/tracefs
directory?


> >
> >> it is portable and will run on any kernel.
> >> In uapi header we can define:
> >> struct task_struct_user {
> >>   int pid;
> >>   int prio;
> >
> > Here's a perfect example of something that looks stable to show to
> > user space, but is really a pimple that is hiding cancer.
> >
> > Lets start with pid. We have name spaces. What pid will be put there?
> > We have to show the pid of the name space it is under.
> >
> > Then we have prio. What is prio in the DEADLINE scheduler. It is rather
> > meaningless. Also, it is meaningless in SCHED_OTHER.
> >
> > Also note that even for SCHED_FIFO, the prio is used differently in the
> > kernel than it is in userspace. For the kernel, lower is higher.
> 
> well, ->prio and ->pid are already printed by sched tracepoints
> and their meaning depends on scheduler. So users taking that
> into account.

I know, and Peter hates this.

> I'm not suggesting to preserve the meaning of 'pid' semantically
> in all cases. That's not what users would want anyway.
> I want to allow programs to access important fields and print
> them in more generic way than current TP_printk does.
> Then exposed ABI of such tracepoint_bpf is smaller than
> with current tracepoints.

Again, this would mean they become invisible to ftrace, and even
ftrace_dump_on_oops.

I'm not fully understanding what is to be exported by this new ABI. If
the fields available, will always be available, then why can't the
appear in a TP_printk()?


> > eBPF is very flexible, which means it is bound to have someone use it
> > in a way you never dreamed of, and that will be what bites you in the
> > end (pun intended).
> 
> understood :)
> let's start slow then with bpf+syscall and bpf+kprobe only.

I'm fine with that.

> 
> >> also not all bpf programs are equal.
> >> bpf+existing tracepoint is not ABI
> >
> > Why not?
> 
> well, because we want to see more tracepoints in the kernel.
> We're already struggling to add more.

Still, the question is, even with a new "tracepoint" that limits what
it shows, there still needs to be something that is guaranteed to
always be there. I still don't see how this is different than the
current tracepoints.

> 
> >> bpf+new tracepoint is ABI if programs are not using bpf_fetch
> >
> > How is this different?
> 
> the new ones will be explicit by definition.

Who monitors this?



> > To give you an example, we thought about scrambling the trace event
> > field locations from boot to boot to keep tools from hard coding the
> > event layout. This may sound crazy, but developers out there are crazy.
> > And if you want to keep them from abusing interfaces, you just need to
> > be a bit more crazy than they are.
> 
> that is indeed crazy. the point is understood.
> 
> right now I cannot think of a solid way to prevent abuse
> of bpf+tracepoint, so just going to drop it for now.

Welcome to our world ;-)

> Cool things can be done with bpf+kprobe/syscall already.

True.

-- Steve

^ permalink raw reply

* Re: [PATCH v17 1/7] mm: support madvise(MADV_FREE)
From: Minchan Kim @ 2015-02-11  0:56 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Michael Kerrisk (man-pages), Michal Hocko, Andrew Morton,
	linux-kernel, linux-mm, linux-api, Hugh Dickins, Johannes Weiner,
	Rik van Riel, KOSAKI Motohiro, Mel Gorman, Jason Evans,
	zhangyanfei, Kirill A. Shutemov, Kirill A. Shutemov
In-Reply-To: <20150210223826.GA2342@kernel.org>

Hi Shaohua,

On Tue, Feb 10, 2015 at 02:38:26PM -0800, Shaohua Li wrote:
> On Mon, Feb 09, 2015 at 04:15:53PM +0900, Minchan Kim wrote:
> > On Fri, Feb 06, 2015 at 10:29:18AM -0800, Shaohua Li wrote:
> > > On Fri, Feb 06, 2015 at 02:51:03PM +0900, Minchan Kim wrote:
> > > > Hi Shaohua,
> > > > 
> > > > On Thu, Feb 05, 2015 at 04:33:11PM -0800, Shaohua Li wrote:
> > > > > 
> > > > > Hi Minchan,
> > > > > 
> > > > > Sorry to jump in this thread so later, and if some issues are discussed before.
> > > > > I'm interesting in this patch, so tried it here. I use a simple test with
> > > > 
> > > > No problem at all. Interest is always win over ignorance.
> > > > 
> > > > > jemalloc. Obviously this can improve performance when there is no memory
> > > > > pressure. Did you try setup with memory pressure?
> > > > 
> > > > Sure but it was not a huge memory system like yours.
> > > 
> > > Yes, I'd like to check the symptom in memory pressure, so choose such test.
> > > 
> > > > > In my test, jemalloc will map 61G vma, and use about 32G memory without
> > > > > MADV_FREE. If MADV_FREE is enabled, jemalloc will use whole 61G memory because
> > > > > madvise doesn't reclaim the unused memory. If I disable swap (tweak your patch
> > > > 
> > > > Yes, IIUC, jemalloc replaces MADV_DONTNEED with MADV_FREE completely.
> > > 
> > > right.
> > > > > slightly to make it work without swap), I got oom. If swap is enabled, my
> > > > 
> > > > You mean you modified anon aging logic so it works although there is no swap?
> > > > If so, I have no idea why OOM happens. I guess it should free all of freeable
> > > > pages during the aging so although system stall happens more, I don't expect
> > > > OOM. Anyway, with MADV_FREE with no swap, we should consider more things
> > > > about anonymous aging.
> > > 
> > > In the patch, MADV_FREE will be disabled and fallback to DONTNEED if no swap is
> > > enabled. Our production environment doesn't enable swap, so I tried to delete
> > > the 'no swap' check and make MADV_FREE always enabled regardless if swap is
> > > enabled. I didn't change anything else. With such change, I saw oom
> > > immediately. So definitely we have aging issue, the pages aren't reclaimed
> > > fast.
> > 
> > In current VM implementation, it doesn't age anonymous LRU list if we have no
> > swap. That's the reason to drop freeing pages instantly.
> > I think it could be enhanced later.
> > http://lists.infradead.org/pipermail/linux-arm-kernel/2014-December/311591.html
> > 
> > > 
> > > > > system is totally stalled because of swap activity. Without the MADV_FREE,
> > > > > everything is ok. Considering we definitely don't want to waste too much
> > > > > memory, a system with memory pressure is normal, so sounds MADV_FREE will
> > > > > introduce big trouble here.
> > > > > 
> > > > > Did you think about move the MADV_FREE pages to the head of inactive LRU, so
> > > > > they can be reclaimed easily?
> > > > 
> > > > I think it's desirable if the page lived in active LRU.
> > > > The reason I didn't that was caused by volatile ranges system call which
> > > > was motivaion for MADV_FREE in my mind.
> > > > In last LSF/MM, there was concern about data's hotness.
> > > > Some of users want to keep that as it is in LRU position, others want to
> > > > handle that as cold(tail of inactive list)/warm(head of inactive list)/
> > > > hot(head of active list), for example.
> > > > The vrange syscall was just about volatiltiy, not depends on page hotness
> > > > so the decision on my head was not to change LRU order and let's make new
> > > > hotness advise if we need it later.
> > > > 
> > > > However, MADV_FREE's main customer is allocators and afaik, they want
> > > > to replace MADV_DONTNEED with MADV_FREE so I think it is really cold,
> > > > but we couldn't make sure so head of inactive is good compromise.
> > > > Another concern about tail of inactive list is that there could be
> > > > plenty of pages in there, which was asynchromos write-backed in
> > > > previous reclaim path, not-yet reclaimed because of not being able
> > > > to free the in softirq context of writeback. It means we ends up
> > > > freeing more potential pages to become workingset in advance
> > > > than pages VM already decided to evict.
> > > 
> > > Yes, they are definitely cold pages. I thought We should make sure the
> > > MADV_FREE pages are reclaimed first before other pages, at least in the anon
> > > LRU list, though there might be difficult to determine if we should reclaim
> > > writeback pages first or MADV_FREE pages first.
> > 
> > Frankly speaking, the issue with writeback page is just hurdle of
> > implementation, not design so if we could fix it, we might move
> > cold pages into tail of the inactive LRU. I tried it but don't have
> > time slot to continue these days. Hope to get a time to look soon.
> > https://lkml.org/lkml/2014/7/1/628
> > Even, it wouldn't be critical problem although we couldn't fix
> > the problem of writeback pages because they are already all
> > cold pages so it might be not important to keep order in LRU so
> > we could save working set and effort of VM to reclaim them
> > at the cost of moving all of hinting pages into tail of the LRU
> > whenever the syscall is called.
> > 
> > However, significant problem from my mind is we couldn't make
> > sure they are really cold pages. It would be true for allocators
> > but it's cache-friendly pages so it might be better to discard
> > tail pages of inactive LRU, which are really cold.
> > In addition, we couldn't expect all of usecase for MADV_FREE
> > so some of users might want to treat them as warm, not cold.
> > 
> > With moving them into inactive list's head, if we still see
> > a lot stall, I think it's a sign to add other logic, for example,
> > we could drop MADV_FREEed pages instantly if the zone is below
> > low min watermark when the syscall is called. Because everybody
> > doesn't like direct reclaim.
> 
> So I tried move the MADV_FREE pages to inactive list head or tail. It helps a
> little. But there are still stalls/oom. kswapd isn't fast enough to free the
> pages, App enters direct reclaim frequently. In one machine, no swap trigger,
> but MADV_FREE is 5x slower than MADV_DONTNEED. In another machine, MADV_FREE

It's expected. MADV_DONTNEED and MADV_FREE is really different.
MADV_DONTNEED is self-sacrificy for others in the system while MADV_FREE is
greedy approach for itself because random process asking the memory could
enter direct reclaim.
However, as I said earlier, we could mitigate the problem by checking
min_free_kbytes. If memory in the system is under min_free_kbytes, it is
pointless to impose reclaim overhead for hinted pages because we alreay
know the hint is "please free when you are trouble with memory" and we got
know it already.

When I test below patch on my 3G machine + 12 CPU + 8G swap with below test
test: 12 processes(each process does 5 iteration: mmap 512M + memset + madvise),

1. MADV_DONTNEED : 41.884sec, sys:3m4.552
2. MADV_FREE : 1m28sec, sys: 5m23
3. MADV_FREE + below patch : 37.188s, sys: 2m20

Could you test?
        
diff --git a/mm/madvise.c b/mm/madvise.c
index 6d0fcb8..da15f8f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -523,7 +523,7 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		 * XXX: In this implementation, MADV_FREE works like
 		 * MADV_DONTNEED on swapless system or full swap.
 		 */
-		if (get_nr_swap_pages() > 0)
+		if (get_nr_swap_pages() > 0 && min_free_kbytes < nr_free_pages())
 			return madvise_free(vma, prev, start, end);
 		/* passthrough */
 	case MADV_DONTNEED:

> trigers a lot of swap and sometimes even oom. app enters direct reclaim and has
> a lot of lock contention because of excessive direct reclaim, so there are a
> lot of stalls.
> 
> Thanks,
> Shaohua

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
From: Alexei Starovoitov @ 2015-02-11  3:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, Linux API, Network Development, LKML,
	Linus Torvalds, Peter Zijlstra, Eric W. Biederman

On Tue, Feb 10, 2015 at 4:50 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
>> >> But some maintainers think of them as ABI, whereas others
>> >> are using them freely. imo it's time to remove ambiguity.
>> >
>> > I would love to, and have brought this up at Kernel Summit more than
>> > once with no solution out of it.
>>
>> let's try it again at plumbers in august?
>
> Well, we need a statement from Linus. And it would be nice if we could
> also get Ingo involved in the discussion, but he seldom comes to
> anything but Kernel Summit.

+1

> BTW, I wonder if I could make a simple compiler in the kernel that
> would translate the current ftrace filters into a BPF program, where it
> could use the program and not use the current filter logic.

yep. I've sent that patch last year.
It converted pred_tree into bpf program.
I can try to dig it up. It doesn't provide extra programmability
though, just makes filtering logic much faster.

>> imo the solution is DEFINE_EVENT_BPF that doesn't
>> print anything and a bpf program to process it.
>
> You mean to be completely invisible to ftrace? And the debugfs/tracefs
> directory?

I mean it will be seen in tracefs to get 'id', but without enable/format/filter

>> I'm not suggesting to preserve the meaning of 'pid' semantically
>> in all cases. That's not what users would want anyway.
>> I want to allow programs to access important fields and print
>> them in more generic way than current TP_printk does.
>> Then exposed ABI of such tracepoint_bpf is smaller than
>> with current tracepoints.
>
> Again, this would mean they become invisible to ftrace, and even
> ftrace_dump_on_oops.

yes, since these new tracepoints have no meat inside them.
They're placeholders sitting idle and waiting for bpf to do
something useful with them.

> I'm not fully understanding what is to be exported by this new ABI. If
> the fields available, will always be available, then why can't the
> appear in a TP_printk()?

say, we define trace_netif_rx_entry() as this new tracepoint_bpf.
It will have only one argument 'skb'.
bpf program will read and print skb fields the way it likes
for particular tracing scenario.
So instead of making
TP_printk("dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p
vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x
ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u
mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d
gso_type=%#x",...
the abi exposed via trace_pipe (as it is today),
the new tracepoint_bpf abi is presence of 'skb' pointer as one
and only argument to bpf program.
Future refactoring of netif_rx would need to guarantee
that trace_netif_rx_entry(skb) is called. that's it.
imo such tracepoints are much easier to deal with during
code changes.

May be some of the existing tracepoints like this one that
takes one argument can be marked 'bpf-ready', so that
programs can attach to them only.

>> let's start slow then with bpf+syscall and bpf+kprobe only.
>
> I'm fine with that.

thanks. will wait for merge window to close and will repost.

^ permalink raw reply

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Serge E. Hallyn @ 2015-02-11  3:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Eric W. Biederman, Serge E. Hallyn, Richard Weinberger, Linux API,
	Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Andy Lutomirski, cgroups mailinglist, Ingo Molnar
In-Reply-To: <20150107233553.GC28630-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> On Wed, Jan 07, 2015 at 05:27:38PM -0600, Eric W. Biederman wrote:
> > >> The -o SUBSYS option doesn't exist.  Jesus, at least get yourself
> > >> familiar with the basics before claiming random stuff.
> > 
> > Oh let's see I got that command line option out of /proc/mounts and yes
> > it works.  Perhaps it doesn't if I invoke unified hiearchies but the
> > option does in fact exist and work.
> 
> I meant the -o SUBSYS doesn't exist for unified hierarchy.
> 
> > Now I really do need to test report regressions, and send probably send
> > regression fixes.  If I understand your strange ranting I think you just
> > told me that option that -o SUBSYS does work with unified hierarchies.
> 
> What?  Why would -O SUBSYS exist for unified hierarchy?  It's unified
> for all controllers.
> 
> > Tejun.  I asked you specifically about this case 2 years ago at plumbers
> > and you personally told me this would continue to work.  I am going to
> > hold you to that.
> 
> I have no idea what you're talking about in *THIS* thread.  I'm fully
> aware of what was discussed *THEN*.
> 
> > Fixing bugs is one thing.  Gratuitious regressions that make supporting
> > existing user space applications insane is another.
> 
> Can you explain what problem you're actually trying to talk about
> without spouting random claims about regressions?

A few weeks ago, in order to test the cgroup namespace patchset with lxc,
I went through the motions of getting lxc to work with unified hierarchy.
A few of the things I had to change:

1. Hierarchy_num in /proc/cgroups and /proc/self/cgroup start at 0.  Used
to start with 1.  I expect many userspace parsers to be broken by this.

2. After creating every non-leaf cgroup, we must fill in the
cgroup.subtree_cgroups file.  This is extra work which userspace
doesn't have to do right now.

3. Let's say we want to create a freezer cgroup /foo/bar for some set of
tasks, which they will administer.  In fact let's assume we are going to
use cgroup namespaces.  We have to put the tasks into /foo/bar, unshare
the cgroup ns, then create /foo/bar/leaf, move the tasks into /foo/bar/leaf,
and then write 'freezer' into /foo/bar.  (If we're not using cgroup
namespaces, then we have to do a similar thing to let the tasks administer
/foo/bar while placing them under /foo/bar/leaf).  The oddness I'm pointing
to is where the tasks have to know that they can create cgroups in "..".

For containers this becomes odd.  We tend to group containers by the
tasks in and under a cgroup.  We now will have to assume a convention
where we know to check for tasks in and under "..", since by definition
pid 1's cgroup (in a container) cannot have children.

4. The per-cgroup "tasks" file not existing seems odd, although certainly
unexpected by much current software.

So, if the unified hierarchy is going to not cause undue pain, existing
software really needs to start working now to use it.  It's going to be
a sizeable task for lxc.

-serge

^ permalink raw reply

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Tejun Heo @ 2015-02-11  4:09 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Andy Lutomirski, Ingo Molnar, Eric W. Biederman, Linux API,
	cgroups mailinglist
In-Reply-To: <20150211034616.GA25022-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>

On Wed, Feb 11, 2015 at 04:46:16AM +0100, Serge E. Hallyn wrote:
> 1. Hierarchy_num in /proc/cgroups and /proc/self/cgroup start at 0.  Used
> to start with 1.  I expect many userspace parsers to be broken by this.

This is intentional.  The unified hierarchy will always have the
hierarchy number zero.  Userland needs to be updated anyway and the
unified hierarchy won't show up unless explicitly enabled.

> 2. After creating every non-leaf cgroup, we must fill in the
> cgroup.subtree_cgroups file.  This is extra work which userspace
> doesn't have to do right now.

Again, by design.  This is how organization and control are separated
and the differing levels of granularity is achieved.

> 3. Let's say we want to create a freezer cgroup /foo/bar for some set of

There shouldn't be a "freezer" cgroup.  The processes are categorized
according to their logical structure and controllers are applied to
the hierarchy as necessary.

> tasks, which they will administer.  In fact let's assume we are going to
> use cgroup namespaces.  We have to put the tasks into /foo/bar, unshare
> the cgroup ns, then create /foo/bar/leaf, move the tasks into /foo/bar/leaf,
> and then write 'freezer' into /foo/bar.  (If we're not using cgroup
> namespaces, then we have to do a similar thing to let the tasks administer
> /foo/bar while placing them under /foo/bar/leaf).  The oddness I'm pointing
> to is where the tasks have to know that they can create cgroups in "..".
> 
> For containers this becomes odd.  We tend to group containers by the
> tasks in and under a cgroup.  We now will have to assume a convention
> where we know to check for tasks in and under "..", since by definition
> pid 1's cgroup (in a container) cannot have children.

The semantics is that the parent enables distribution of its given
type of resource by enabling the controller in its subtree_control.
This scoping isn't necessary for freezer and I'm debating whether to
enable controllers which don't need granularity control to be enabled
unconditionally.  Right now, I'm leaning against it mostly for
consistency.

> 4. The per-cgroup "tasks" file not existing seems odd, although certainly
> unexpected by much current software.

And, yes, everything is per-process for reasons described in
unified-hierarchy.txt.

> So, if the unified hierarchy is going to not cause undue pain, existing
> software really needs to start working now to use it.  It's going to be
> a sizeable task for lxc.

Yes, this isn't gonna be a trivial conversion.  The usage model
changes and so will a lot of controller knobs and behaviors.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Serge E. Hallyn @ 2015-02-11  4:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, Eric W. Biederman, Richard Weinberger, Linux API,
	Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Andy Lutomirski, cgroups mailinglist, Ingo Molnar
In-Reply-To: <20150211040957.GC21356-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> On Wed, Feb 11, 2015 at 04:46:16AM +0100, Serge E. Hallyn wrote:
> > 1. Hierarchy_num in /proc/cgroups and /proc/self/cgroup start at 0.  Used
> > to start with 1.  I expect many userspace parsers to be broken by this.
> 
> This is intentional.  The unified hierarchy will always have the
> hierarchy number zero.  Userland needs to be updated anyway and the
> unified hierarchy won't show up unless explicitly enabled.
> 
> > 2. After creating every non-leaf cgroup, we must fill in the
> > cgroup.subtree_cgroups file.  This is extra work which userspace
> > doesn't have to do right now.
> 
> Again, by design.  This is how organization and control are separated
> and the differing levels of granularity is achieved.
> 
> > 3. Let's say we want to create a freezer cgroup /foo/bar for some set of
> 
> There shouldn't be a "freezer" cgroup.  The processes are categorized
> according to their logical structure and controllers are applied to
> the hierarchy as necessary.

But there can well be cgroups for which only freezer is enabled.  If
I'm wrong about that, then I am suffering a fundamental misunderstanding.

> > tasks, which they will administer.  In fact let's assume we are going to
> > use cgroup namespaces.  We have to put the tasks into /foo/bar, unshare
> > the cgroup ns, then create /foo/bar/leaf, move the tasks into /foo/bar/leaf,
> > and then write 'freezer' into /foo/bar.  (If we're not using cgroup
> > namespaces, then we have to do a similar thing to let the tasks administer
> > /foo/bar while placing them under /foo/bar/leaf).  The oddness I'm pointing
> > to is where the tasks have to know that they can create cgroups in "..".
> > 
> > For containers this becomes odd.  We tend to group containers by the
> > tasks in and under a cgroup.  We now will have to assume a convention
> > where we know to check for tasks in and under "..", since by definition
> > pid 1's cgroup (in a container) cannot have children.
> 
> The semantics is that the parent enables distribution of its given
> type of resource by enabling the controller in its subtree_control.
> This scoping isn't necessary for freezer and I'm debating whether to
> enable controllers which don't need granularity control to be enabled
> unconditionally.  Right now, I'm leaning against it mostly for
> consistency.

Yeah, IIUC (i.e. freezer would always be enabled?) that would be
even-more-confusing.

> > 4. The per-cgroup "tasks" file not existing seems odd, although certainly
> > unexpected by much current software.
> 
> And, yes, everything is per-process for reasons described in
> unified-hierarchy.txt.
> 
> > So, if the unified hierarchy is going to not cause undue pain, existing
> > software really needs to start working now to use it.  It's going to be
> > a sizeable task for lxc.
> 
> Yes, this isn't gonna be a trivial conversion.  The usage model
> changes and so will a lot of controller knobs and behaviors.
> 
> Thanks.
> 
> -- 
> tejun

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox