* Re: [PATCH ghak90 V6 08/10] audit: add containerid filtering
From: Richard Guy Briggs @ 2019-05-30 20:37 UTC (permalink / raw)
To: Paul Moore
Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
Eric Paris, Serge Hallyn, ebiederm, nhorman
In-Reply-To: <CAHC9VhQhkzCtVOXhPL7BzaqvF0y+8gBQwhOo1EQDS2OUyZbV5g@mail.gmail.com>
On 2019-05-30 10:34, Paul Moore wrote:
> On Thu, May 30, 2019 at 10:20 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> >
> > On 2019-05-29 18:16, Paul Moore wrote:
> > > On Mon, Apr 8, 2019 at 11:41 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > >
> > > > Implement audit container identifier filtering using the AUDIT_CONTID
> > > > field name to send an 8-character string representing a u64 since the
> > > > value field is only u32.
> > > >
> > > > Sending it as two u32 was considered, but gathering and comparing two
> > > > fields was more complex.
> > > >
> > > > The feature indicator is AUDIT_FEATURE_BITMAP_CONTAINERID.
> > > >
> > > > Please see the github audit kernel issue for the contid filter feature:
> > > > https://github.com/linux-audit/audit-kernel/issues/91
> > > > Please see the github audit userspace issue for filter additions:
> > > > https://github.com/linux-audit/audit-userspace/issues/40
> > > > Please see the github audit testsuiite issue for the test case:
> > > > https://github.com/linux-audit/audit-testsuite/issues/64
> > > > Please see the github audit wiki for the feature overview:
> > > > https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > > Acked-by: Serge Hallyn <serge@hallyn.com>
> > > > Acked-by: Neil Horman <nhorman@tuxdriver.com>
> > > > Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> > > > ---
> > > > include/linux/audit.h | 1 +
> > > > include/uapi/linux/audit.h | 5 ++++-
> > > > kernel/audit.h | 1 +
> > > > kernel/auditfilter.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > kernel/auditsc.c | 4 ++++
> > > > 5 files changed, 57 insertions(+), 1 deletion(-)
> > >
> > > ...
> > >
> > > > diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
> > > > index 63f8b3f26fab..407b5bb3b4c6 100644
> > > > --- a/kernel/auditfilter.c
> > > > +++ b/kernel/auditfilter.c
> > > > @@ -1206,6 +1224,31 @@ int audit_comparator(u32 left, u32 op, u32 right)
> > > > }
> > > > }
> > > >
> > > > +int audit_comparator64(u64 left, u32 op, u64 right)
> > > > +{
> > > > + switch (op) {
> > > > + case Audit_equal:
> > > > + return (left == right);
> > > > + case Audit_not_equal:
> > > > + return (left != right);
> > > > + case Audit_lt:
> > > > + return (left < right);
> > > > + case Audit_le:
> > > > + return (left <= right);
> > > > + case Audit_gt:
> > > > + return (left > right);
> > > > + case Audit_ge:
> > > > + return (left >= right);
> > > > + case Audit_bitmask:
> > > > + return (left & right);
> > > > + case Audit_bittest:
> > > > + return ((left & right) == right);
> > > > + default:
> > > > + BUG();
> > >
> > > A little birdy mentioned the BUG() here as a potential issue and while
> > > I had ignored it in earlier patches because this is likely a
> > > cut-n-paste from another audit comparator function, I took a closer
> > > look this time. It appears as though we will never have an invalid op
> > > value as audit_data_to_entry()/audit_to_op() ensure that the op value
> > > is a a known good value. Removing the BUG() from all the audit
> > > comparators is a separate issue, but I think it would be good to
> > > remove it from this newly added comparator; keeping it so that we
> > > return "0" in the default case seems reasoanble.
> >
> > Fair enough. That BUG(); can be removed.
>
> Please send a fixup patch for this.
The fixup patch is trivial. The rebase to v5.2-rc1 audit/next had merge
conflicts with four recent patchsets. It may be simpler to submit a new
patchset and look at a diff of the two sets. I'm testing the rebase
now.
> paul moore www.paul-moore.com
- RGB
--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
^ permalink raw reply
* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-05-30 19:29 UTC (permalink / raw)
To: Serge E. Hallyn
Cc: Tycho Andersen, Richard Guy Briggs, containers, linux-api,
Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
ebiederm, nhorman
In-Reply-To: <20190530170913.GA16722@mail.hallyn.com>
On Thu, May 30, 2019 at 1:09 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> On Wed, May 29, 2019 at 06:39:48PM -0400, Paul Moore wrote:
> > On Wed, May 29, 2019 at 6:28 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > On Wed, May 29, 2019 at 12:03:58PM -0400, Paul Moore wrote:
> > > > On Wed, May 29, 2019 at 11:34 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > On Wed, May 29, 2019 at 11:29:05AM -0400, Paul Moore wrote:
> > > > > > On Wed, May 29, 2019 at 10:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > > > On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
...
> > > > > > The current thinking
> > > > > > is that you would only change the audit container ID from one
> > > > > > set/inherited value to another if you were nesting containers, in
> > > > > > which case the nested container orchestrator would need to be granted
> > > > > > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > > > > > compromise).
> > >
> > > won't work in user namespaced containers, because they will never be
> > > capable(CAP_AUDIT_CONTROL); so I don't think this will work for
> > > nesting as is. But maybe nobody cares :)
> >
> > That's fun :)
> >
> > To be honest, I've never been a big fan of supporting nested
> > containers from an audit perspective, so I'm not really too upset
> > about this. The k8s/cri-o folks seem okay with this, or at least I
> > haven't heard any objections; lxc folks, what do you have to say?
>
> I actually thought the answer to this (when last I looked, "some time" ago)
> was that userspace should track an audit message saying "task X in
> container Y is changing its auditid to Z", and then decide to also track Z.
> This should be doable, but a lot of extra work in userspace.
>
> Per-userns containerids would also work. So task X1 is in containerid
> 1 on the host and creates a new task Y in new userns; it continues to
> be reported in init_user_ns as containerid 1 forever; but in its own
> userns it can request to be known as some other containerid. Audit
> socks would be per-userns, allowing root in a container to watch for
> audit events in its own (and descendent) namespaces.
>
> But again I'm sure we've gone over all this in the last few years.
>
> I suppose we can look at this as a "first step", and talk about
> making it user-ns-nestable later. But agreed it's not useful in a
> lot of situations as is.
[REMINDER: It is an "*audit* container ID" and not a general
"container ID" ;) Smiley aside, I'm not kidding about that part.]
I'm not interested in supporting/merging something that isn't useful;
if this doesn't work for your use case then we need to figure out what
would work. It sounds like nested containers are much more common in
the lxc world, can you elaborate a bit more on this?
As far as the possible solutions you mention above, I'm not sure I
like the per-userns audit container IDs, I'd much rather just emit the
necessary tracking information via the audit record stream and let the
log analysis tools figure it out. However, the bigger question is how
to limit (re)setting the audit container ID when you are in a non-init
userns. For reasons already mentioned, using capable() is a non
starter for everything but the initial userns, and using ns_capable()
is equally poor as it essentially allows any userns the ability to
munge it's audit container ID (obviously not good). It appears we
need a different method for controlling access to the audit container
ID.
Punting this to a LSM hook is an obvious thing to do, and something we
might want to do anyway, but currently audit doesn't rely on the LSM
for proper/safe operation and I'm not sure I want to change that now.
The next obvious thing is to create some sort of access control knob
in audit itself. Perhaps an auditctl operation that would allow the
administrator to specify which containers, via their corresponding
audit container IDs, are allowed to change their audit container ID?
The permission granting would need to be done in the init userns, but
it would allow containers with a non-init userns the ability to change
their audit container ID. We would probably still want a
ns_capable(CAP_AUDIT_CONTROL) restriction in this case.
Does anyone else have any other ideas?
--
paul moore
www.paul-moore.com
^ permalink raw reply
* Re: [PATCH v3 16/16] fpga: dfl: fme: add performance reporting support
From: Greg KH @ 2019-05-30 19:03 UTC (permalink / raw)
To: Wu Hao
Cc: atull, mdf, linux-fpga, linux-kernel, linux-api, Luwei Kang,
Xu Yilun
In-Reply-To: <1558934546-12171-17-git-send-email-hao.wu@intel.com>
On Mon, May 27, 2019 at 01:22:26PM +0800, Wu Hao wrote:
> --- /dev/null
> +++ b/drivers/fpga/dfl-fme-perf.c
> @@ -0,0 +1,962 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Driver for FPGA Management Engine (FME) Global Performance Reporting
> + *
> + * Copyright 2019 Intel Corporation, Inc.
> + *
> + * Authors:
> + * Kang Luwei <luwei.kang@intel.com>
> + * Xiao Guangrong <guangrong.xiao@linux.intel.com>
> + * Wu Hao <hao.wu@intel.com>
> + * Joseph Grecco <joe.grecco@intel.com>
> + * Enno Luebbers <enno.luebbers@intel.com>
> + * Tim Whisonant <tim.whisonant@intel.com>
> + * Ananda Ravuri <ananda.ravuri@intel.com>
> + * Mitchel, Henry <henry.mitchel@intel.com>
> + */
> +
> +#include "dfl.h"
> +#include "dfl-fme.h"
> +
> +/*
> + * Performance Counter Registers for Cache.
> + *
> + * Cache Events are listed below as CACHE_EVNT_*.
> + */
> +#define CACHE_CTRL 0x8
> +#define CACHE_RESET_CNTR BIT_ULL(0)
> +#define CACHE_FREEZE_CNTR BIT_ULL(8)
> +#define CACHE_CTRL_EVNT GENMASK_ULL(19, 16)
> +#define CACHE_EVNT_RD_HIT 0x0
> +#define CACHE_EVNT_WR_HIT 0x1
> +#define CACHE_EVNT_RD_MISS 0x2
> +#define CACHE_EVNT_WR_MISS 0x3
> +#define CACHE_EVNT_RSVD 0x4
> +#define CACHE_EVNT_HOLD_REQ 0x5
> +#define CACHE_EVNT_DATA_WR_PORT_CONTEN 0x6
> +#define CACHE_EVNT_TAG_WR_PORT_CONTEN 0x7
> +#define CACHE_EVNT_TX_REQ_STALL 0x8
> +#define CACHE_EVNT_RX_REQ_STALL 0x9
> +#define CACHE_EVNT_EVICTIONS 0xa
> +#define CACHE_EVNT_MAX CACHE_EVNT_EVICTIONS
> +#define CACHE_CHANNEL_SEL BIT_ULL(20)
> +#define CACHE_CHANNEL_RD 0
> +#define CACHE_CHANNEL_WR 1
> +#define CACHE_CHANNEL_MAX 2
> +#define CACHE_CNTR0 0x10
> +#define CACHE_CNTR1 0x18
> +#define CACHE_CNTR_EVNT_CNTR GENMASK_ULL(47, 0)
> +#define CACHE_CNTR_EVNT GENMASK_ULL(63, 60)
> +
> +/*
> + * Performance Counter Registers for Fabric.
> + *
> + * Fabric Events are listed below as FAB_EVNT_*
> + */
> +#define FAB_CTRL 0x20
> +#define FAB_RESET_CNTR BIT_ULL(0)
> +#define FAB_FREEZE_CNTR BIT_ULL(8)
> +#define FAB_CTRL_EVNT GENMASK_ULL(19, 16)
> +#define FAB_EVNT_PCIE0_RD 0x0
> +#define FAB_EVNT_PCIE0_WR 0x1
> +#define FAB_EVNT_PCIE1_RD 0x2
> +#define FAB_EVNT_PCIE1_WR 0x3
> +#define FAB_EVNT_UPI_RD 0x4
> +#define FAB_EVNT_UPI_WR 0x5
> +#define FAB_EVNT_MMIO_RD 0x6
> +#define FAB_EVNT_MMIO_WR 0x7
> +#define FAB_EVNT_MAX FAB_EVNT_MMIO_WR
> +#define FAB_PORT_ID GENMASK_ULL(21, 20)
> +#define FAB_PORT_FILTER BIT_ULL(23)
> +#define FAB_PORT_FILTER_DISABLE 0
> +#define FAB_PORT_FILTER_ENABLE 1
> +#define FAB_CNTR 0x28
> +#define FAB_CNTR_EVNT_CNTR GENMASK_ULL(59, 0)
> +#define FAB_CNTR_EVNT GENMASK_ULL(63, 60)
> +
> +/*
> + * Performance Counter Registers for Clock.
> + *
> + * Clock Counter can't be reset or frozen by SW.
> + */
> +#define CLK_CNTR 0x30
> +
> +/*
> + * Performance Counter Registers for IOMMU / VT-D.
> + *
> + * VT-D Events are listed below as VTD_EVNT_* and VTD_SIP_EVNT_*
> + */
> +#define VTD_CTRL 0x38
> +#define VTD_RESET_CNTR BIT_ULL(0)
> +#define VTD_FREEZE_CNTR BIT_ULL(8)
> +#define VTD_CTRL_EVNT GENMASK_ULL(19, 16)
> +#define VTD_EVNT_AFU_MEM_RD_TRANS 0x0
> +#define VTD_EVNT_AFU_MEM_WR_TRANS 0x1
> +#define VTD_EVNT_AFU_DEVTLB_RD_HIT 0x2
> +#define VTD_EVNT_AFU_DEVTLB_WR_HIT 0x3
> +#define VTD_EVNT_DEVTLB_4K_FILL 0x4
> +#define VTD_EVNT_DEVTLB_2M_FILL 0x5
> +#define VTD_EVNT_DEVTLB_1G_FILL 0x6
> +#define VTD_EVNT_MAX VTD_EVNT_DEVTLB_1G_FILL
> +#define VTD_CNTR 0x40
> +#define VTD_CNTR_EVNT GENMASK_ULL(63, 60)
> +#define VTD_CNTR_EVNT_CNTR GENMASK_ULL(47, 0)
> +#define VTD_SIP_CTRL 0x48
> +#define VTD_SIP_RESET_CNTR BIT_ULL(0)
> +#define VTD_SIP_FREEZE_CNTR BIT_ULL(8)
> +#define VTD_SIP_CTRL_EVNT GENMASK_ULL(19, 16)
> +#define VTD_SIP_EVNT_IOTLB_4K_HIT 0x0
> +#define VTD_SIP_EVNT_IOTLB_2M_HIT 0x1
> +#define VTD_SIP_EVNT_IOTLB_1G_HIT 0x2
> +#define VTD_SIP_EVNT_SLPWC_L3_HIT 0x3
> +#define VTD_SIP_EVNT_SLPWC_L4_HIT 0x4
> +#define VTD_SIP_EVNT_RCC_HIT 0x5
> +#define VTD_SIP_EVNT_IOTLB_4K_MISS 0x6
> +#define VTD_SIP_EVNT_IOTLB_2M_MISS 0x7
> +#define VTD_SIP_EVNT_IOTLB_1G_MISS 0x8
> +#define VTD_SIP_EVNT_SLPWC_L3_MISS 0x9
> +#define VTD_SIP_EVNT_SLPWC_L4_MISS 0xa
> +#define VTD_SIP_EVNT_RCC_MISS 0xb
> +#define VTD_SIP_EVNT_MAX VTD_SIP_EVNT_RCC_MISS
> +#define VTD_SIP_CNTR 0X50
> +#define VTD_SIP_CNTR_EVNT GENMASK_ULL(63, 60)
> +#define VTD_SIP_CNTR_EVNT_CNTR GENMASK_ULL(47, 0)
> +
> +#define PERF_OBJ_ROOT_ID (~0)
> +
> +#define PERF_TIMEOUT 30
> +
> +/**
> + * struct perf_object - object of performance counter
> + *
> + * @id: instance id. PERF_OBJ_ROOT_ID indicates it is a parent object which
> + * counts performance counters for all instances.
> + * @attr_groups: the sysfs files are associated with this object.
> + * @feature: pointer to related private feature.
> + * @node: used to link itself to parent's children list.
> + * @children: used to link its children objects together.
> + * @kobj: generic kobject interface.
> + *
> + * 'node' and 'children' are used to construct parent-children hierarchy.
> + */
> +struct perf_object {
> + int id;
> + const struct attribute_group **attr_groups;
> + struct dfl_feature *feature;
> +
> + struct list_head node;
> + struct list_head children;
> + struct kobject kobj;
Woah, why are you using a "raw" kobject and not a 'struct device' here?
You just broke userspace and no libraries will see your kobject's
properties as the "chain" of struct devices is not happening anymore.
Why can this not just be a 'struct device'?
> +};
> +
> +/**
> + * struct perf_obj_attribute - attribute of perf object
> + *
> + * @attr: attribute of this perf object.
> + * @show: show callback for sysfs attribute.
> + * @store: store callback for sysfs attribute.
> + */
> +struct perf_obj_attribute {
> + struct attribute attr;
> + ssize_t (*show)(struct perf_object *pobj, char *buf);
> + ssize_t (*store)(struct perf_object *pobj,
> + const char *buf, size_t n);
> +};
> +
> +#define to_perf_obj_attr(_attr) \
> + container_of(_attr, struct perf_obj_attribute, attr)
> +#define to_perf_obj(_kobj) \
> + container_of(_kobj, struct perf_object, kobj)
> +
> +#define __POBJ_ATTR(_name, _mode, _show, _store) { \
> + .attr = {.name = __stringify(_name), \
> + .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
> + .show = _show, \
> + .store = _store, \
> +}
> +
> +#define PERF_OBJ_ATTR_F_RO(_name, _filename) \
> +struct perf_obj_attribute perf_obj_attr_##_name = \
> + __POBJ_ATTR(_filename, 0444, _name##_show, NULL)
> +
> +#define PERF_OBJ_ATTR_F_WO(_name, _filename) \
> +struct perf_obj_attribute perf_obj_attr_##_name = \
> + __POBJ_ATTR(_filename, 0200, NULL, _name##_store)
> +
> +#define PERF_OBJ_ATTR_F_RW(_name, _filename) \
> +struct perf_obj_attribute perf_obj_attr_##_name = \
> + __POBJ_ATTR(_filename, 0644, _name##_show, _name##_store)
> +
> +#define PERF_OBJ_ATTR_RO(_name) \
> +struct perf_obj_attribute perf_obj_attr_##_name = \
> + __POBJ_ATTR(_name, 0444, _name##_show, NULL)
> +
> +#define PERF_OBJ_ATTR_WO(_name) \
> +struct perf_obj_attribute perf_obj_attr_##_name = \
> + __POBJ_ATTR(_name, 0200, NULL, _name##_store)
> +
> +#define PERF_OBJ_ATTR_RW(_name) \
> +struct perf_obj_attribute perf_obj_attr_##_name = \
> + __POBJ_ATTR(_name, 0644, _name##_show, _name##_store)
When you have to roll your own sysfs attributes for a single driver,
that is a HUGE hint you are doing something wrong. No driver for an
individual device should EVER have to do this.
Please use the driver core properly and do not route around it.
thanks,
greg k-h
^ permalink raw reply
* Re: [PATCH v3 00/15] fs-verity: read-only file-based authenticity protection
From: Eric Biggers @ 2019-05-30 18:54 UTC (permalink / raw)
To: linux-fscrypt
Cc: Theodore Y . Ts'o, linux-api, linux-f2fs-devel, linux-fsdevel,
Jaegeuk Kim, linux-integrity, linux-ext4, Victor Hsieh
In-Reply-To: <20190523161811.6259-1-ebiggers@kernel.org>
On Thu, May 23, 2019 at 09:17:56AM -0700, Eric Biggers wrote:
> Hello,
>
> This is a redesigned version of the fs-verity patchset, implementing
> Ted's suggestion to build the Merkle tree in the kernel
> (https://lore.kernel.org/linux-fsdevel/20190207031101.GA7387@mit.edu/).
Does anyone have any comments on this patchset?
- Eric
^ permalink raw reply
* Re: [PATCH v3 16/16] fpga: dfl: fme: add performance reporting support
From: Alan Tull @ 2019-05-30 18:53 UTC (permalink / raw)
To: Wu Hao
Cc: Moritz Fischer, linux-fpga, linux-kernel, linux-api, Luwei Kang,
Xu Yilun
In-Reply-To: <1558934546-12171-17-git-send-email-hao.wu@intel.com>
On Mon, May 27, 2019 at 12:39 AM Wu Hao <hao.wu@intel.com> wrote:
Hi Hao,
Just one correction that I saw below, sorry I didn't catch it last time.
>
> This patch adds support for performance reporting private feature
> for FPGA Management Engine (FME). Actually it supports 4 categories
> performance counters, 'clock', 'cache', 'iommu' and 'fabric', user
> could read the performance counter via exposed sysfs interfaces.
> Please refer to sysfs doc for more details.
>
> Signed-off-by: Luwei Kang <luwei.kang@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>
Acked-by: Alan Tull <atull@kernel.org>
> ---
> v3: replace scnprintf with sprintf in sysfs interfaces.
> update sysfs doc kernel version and date.
> fix sysfs doc issue for fabric counter.
> refine PERF_OBJ_ATTR_* macro, doesn't count on __ATTR anymore.
> introduce PERF_OBJ_ATTR_F_* macro, as it needs to use different
> filenames for some of the sysfs attributes.
> remove kobject_del when destroy kobject, kobject_put is enough.
> do sysfs_remove_groups first when destroying perf_obj.
> WARN_ON_ONCE in case internal parms are wrong in read_*_count().
> ---
> Documentation/ABI/testing/sysfs-platform-dfl-fme | 93 +++
> drivers/fpga/Makefile | 1 +
> drivers/fpga/dfl-fme-main.c | 4 +
> drivers/fpga/dfl-fme-perf.c | 962 +++++++++++++++++++++++
> drivers/fpga/dfl-fme.h | 2 +
> drivers/fpga/dfl.c | 1 +
> drivers/fpga/dfl.h | 2 +
> 7 files changed, 1065 insertions(+)
> create mode 100644 drivers/fpga/dfl-fme-perf.c
>
> diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> index 5a8938f..63a02cb 100644
> --- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
> +++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> @@ -119,3 +119,96 @@ Description: Write-only. Write error code to this file to clear all errors
> logged in errors, first_error and next_error. Write fails with
> -EINVAL if input parsing fails or input error code doesn't
> match.
> +
> +What: /sys/bus/platform/devices/dfl-fme.0/perf/clock
> +Date: May 2019
> +KernelVersion: 5.3
> +Contact: Wu Hao <hao.wu@intel.com>
> +Description: Read-Only. Read for Accelerator Function Unit (AFU) clock
> + counter.
> +
> +What: /sys/bus/platform/devices/dfl-fme.0/perf/cache/freeze
> +Date: May 2019
> +KernelVersion: 5.3
> +Contact: Wu Hao <hao.wu@intel.com>
> +Description: Read-Write. Read this file for the current status of 'cache'
> + category performance counters, and Write '1' or '0' to freeze
> + or unfreeze 'cache' performance counters.
> +
> +What: /sys/bus/platform/devices/dfl-fme.0/perf/cache/<counter>
> +Date: May 2019
> +KernelVersion: 5.3
> +Contact: Wu Hao <hao.wu@intel.com>
> +Description: Read-Only. Read 'cache' category performance counters:
> + read_hit, read_miss, write_hit, write_miss, hold_request,
> + data_write_port_contention, tag_write_port_contention,
> + tx_req_stall, rx_req_stall and rx_eviction.
> +
> +What: /sys/bus/platform/devices/dfl-fme.0/perf/iommu/freeze
> +Date: May 2019
> +KernelVersion: 5.3
> +Contact: Wu Hao <hao.wu@intel.com>
> +Description: Read-Write. Read this file for the current status of 'iommu'
> + category performance counters, and Write '1' or '0' to freeze
> + or unfreeze 'iommu' performance counters.
> +
> +What: /sys/bus/platform/devices/dfl-fme.0/perf/iommu/<sip_counter>
> +Date: May 2019
> +KernelVersion: 5.3
> +Contact: Wu Hao <hao.wu@intel.com>
> +Description: Read-Only. Read 'iommu' category 'sip' sub category
> + performance counters: iotlb_4k_hit, iotlb_2m_hit,
> + iotlb_1g_hit, slpwc_l3_hit, slpwc_l4_hit, rcc_hit,
> + rcc_miss, iotlb_4k_miss, iotlb_2m_miss, iotlb_1g_miss,
> + slpwc_l3_miss and slpwc_l4_miss.
> +
> +What: /sys/bus/platform/devices/dfl-fme.0/perf/iommu/afu0/<counter>
> +Date: May 2019
> +KernelVersion: 5.3
> +Contact: Wu Hao <hao.wu@intel.com>
> +Description: Read-Only. Read 'iommu' category 'afuX' sub category
> + performance counters: read_transaction, write_transaction,
> + devtlb_read_hit, devtlb_write_hit, devtlb_4k_fill,
> + devtlb_2m_fill and devtlb_1g_fill.
> +
> +What: /sys/bus/platform/devices/dfl-fme.0/perf/fabric/freeze
> +Date: May 2019
> +KernelVersion: 5.3
> +Contact: Wu Hao <hao.wu@intel.com>
> +Description: Read-Write. Read this file for the current status of 'fabric'
> + category performance counters, and Write '1' or '0' to freeze
> + or unfreeze 'fabric' performance counters.
> +
> +What: /sys/bus/platform/devices/dfl-fme.0/perf/fabric/<counter>
> +Date: May 2019
> +KernelVersion: 5.3
> +Contact: Wu Hao <hao.wu@intel.com>
> +Description: Read-Only. Read 'fabric' category performance counters:
> + pcie0_read, pcie0_write, pcie1_read, pcie1_write,
> + upi_read, upi_write, mmio_read and mmio_write.
> +
> +What: /sys/bus/platform/devices/dfl-fme.0/perf/fabric/enable
> +Date: May 2019
> +KernelVersion: 5.3
> +Contact: Wu Hao <hao.wu@intel.com>
> +Description: Read-Write. Read this file for current status of device level
> + fabric counters. Write "1" to enable device level fabric
> + counters. Once device level fabric counters are enabled, port
> + level fabric counters will be disabled automatically.
> +
> +What: /sys/bus/platform/devices/dfl-fme.0/perf/fabric/port0/<counter>
> +Date: May 2019
> +KernelVersion: 5.3
> +Contact: Wu Hao <hao.wu@intel.com>
> +Description: Read-Only. Read 'fabric' category "portX" sub category
> + performance counters: pcie0_read, pcie0_write, pcie1_read,
> + pcie1_write, upi_read, upi_write and mmio_read.
> +
> +What: /sys/bus/platform/devices/dfl-fme.0/perf/fabric/port0/enable
> +Date: May 2019
> +KernelVersion: 5.3
> +Contact: Wu Hao <hao.wu@intel.com>
> +Description: Read-Write. Read this file for current status of port level
> + fabric counters. Write "1" to enable port level fabric counters.
> + Once port level fabric counters are enabled, device level fabric
> + counters will be disabled automatically.
> diff --git a/drivers/fpga/Makefile b/drivers/fpga/Makefile
> index 4865b74..d8e21df 100644
> --- a/drivers/fpga/Makefile
> +++ b/drivers/fpga/Makefile
> @@ -40,6 +40,7 @@ obj-$(CONFIG_FPGA_DFL_FME_REGION) += dfl-fme-region.o
> obj-$(CONFIG_FPGA_DFL_AFU) += dfl-afu.o
>
> dfl-fme-objs := dfl-fme-main.o dfl-fme-pr.o dfl-fme-error.o
> +dfl-fme-objs += dfl-fme-perf.o
> dfl-afu-objs := dfl-afu-main.o dfl-afu-region.o dfl-afu-dma-region.o
> dfl-afu-objs += dfl-afu-error.o
>
> diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
> index 4490cf4..4a5b25d 100644
> --- a/drivers/fpga/dfl-fme-main.c
> +++ b/drivers/fpga/dfl-fme-main.c
> @@ -231,6 +231,10 @@ static long fme_hdr_ioctl(struct platform_device *pdev,
> .ops = &fme_global_err_ops,
> },
> {
> + .id_table = fme_perf_id_table,
> + .ops = &fme_perf_ops,
> + },
> + {
> .ops = NULL,
> },
> };
> diff --git a/drivers/fpga/dfl-fme-perf.c b/drivers/fpga/dfl-fme-perf.c
> new file mode 100644
> index 0000000..6eb6c89
> --- /dev/null
> +++ b/drivers/fpga/dfl-fme-perf.c
> @@ -0,0 +1,962 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Driver for FPGA Management Engine (FME) Global Performance Reporting
> + *
> + * Copyright 2019 Intel Corporation, Inc.
> + *
> + * Authors:
> + * Kang Luwei <luwei.kang@intel.com>
> + * Xiao Guangrong <guangrong.xiao@linux.intel.com>
> + * Wu Hao <hao.wu@intel.com>
> + * Joseph Grecco <joe.grecco@intel.com>
> + * Enno Luebbers <enno.luebbers@intel.com>
> + * Tim Whisonant <tim.whisonant@intel.com>
> + * Ananda Ravuri <ananda.ravuri@intel.com>
> + * Mitchel, Henry <henry.mitchel@intel.com>
> + */
> +
> +#include "dfl.h"
> +#include "dfl-fme.h"
> +
> +/*
> + * Performance Counter Registers for Cache.
> + *
> + * Cache Events are listed below as CACHE_EVNT_*.
> + */
> +#define CACHE_CTRL 0x8
> +#define CACHE_RESET_CNTR BIT_ULL(0)
> +#define CACHE_FREEZE_CNTR BIT_ULL(8)
> +#define CACHE_CTRL_EVNT GENMASK_ULL(19, 16)
> +#define CACHE_EVNT_RD_HIT 0x0
> +#define CACHE_EVNT_WR_HIT 0x1
> +#define CACHE_EVNT_RD_MISS 0x2
> +#define CACHE_EVNT_WR_MISS 0x3
> +#define CACHE_EVNT_RSVD 0x4
> +#define CACHE_EVNT_HOLD_REQ 0x5
> +#define CACHE_EVNT_DATA_WR_PORT_CONTEN 0x6
> +#define CACHE_EVNT_TAG_WR_PORT_CONTEN 0x7
> +#define CACHE_EVNT_TX_REQ_STALL 0x8
> +#define CACHE_EVNT_RX_REQ_STALL 0x9
> +#define CACHE_EVNT_EVICTIONS 0xa
> +#define CACHE_EVNT_MAX CACHE_EVNT_EVICTIONS
> +#define CACHE_CHANNEL_SEL BIT_ULL(20)
> +#define CACHE_CHANNEL_RD 0
> +#define CACHE_CHANNEL_WR 1
> +#define CACHE_CHANNEL_MAX 2
> +#define CACHE_CNTR0 0x10
> +#define CACHE_CNTR1 0x18
> +#define CACHE_CNTR_EVNT_CNTR GENMASK_ULL(47, 0)
> +#define CACHE_CNTR_EVNT GENMASK_ULL(63, 60)
> +
> +/*
> + * Performance Counter Registers for Fabric.
> + *
> + * Fabric Events are listed below as FAB_EVNT_*
> + */
> +#define FAB_CTRL 0x20
> +#define FAB_RESET_CNTR BIT_ULL(0)
> +#define FAB_FREEZE_CNTR BIT_ULL(8)
> +#define FAB_CTRL_EVNT GENMASK_ULL(19, 16)
> +#define FAB_EVNT_PCIE0_RD 0x0
> +#define FAB_EVNT_PCIE0_WR 0x1
> +#define FAB_EVNT_PCIE1_RD 0x2
> +#define FAB_EVNT_PCIE1_WR 0x3
> +#define FAB_EVNT_UPI_RD 0x4
> +#define FAB_EVNT_UPI_WR 0x5
> +#define FAB_EVNT_MMIO_RD 0x6
> +#define FAB_EVNT_MMIO_WR 0x7
> +#define FAB_EVNT_MAX FAB_EVNT_MMIO_WR
> +#define FAB_PORT_ID GENMASK_ULL(21, 20)
> +#define FAB_PORT_FILTER BIT_ULL(23)
> +#define FAB_PORT_FILTER_DISABLE 0
> +#define FAB_PORT_FILTER_ENABLE 1
> +#define FAB_CNTR 0x28
> +#define FAB_CNTR_EVNT_CNTR GENMASK_ULL(59, 0)
> +#define FAB_CNTR_EVNT GENMASK_ULL(63, 60)
> +
> +/*
> + * Performance Counter Registers for Clock.
> + *
> + * Clock Counter can't be reset or frozen by SW.
> + */
> +#define CLK_CNTR 0x30
> +
> +/*
> + * Performance Counter Registers for IOMMU / VT-D.
> + *
> + * VT-D Events are listed below as VTD_EVNT_* and VTD_SIP_EVNT_*
> + */
> +#define VTD_CTRL 0x38
> +#define VTD_RESET_CNTR BIT_ULL(0)
> +#define VTD_FREEZE_CNTR BIT_ULL(8)
> +#define VTD_CTRL_EVNT GENMASK_ULL(19, 16)
> +#define VTD_EVNT_AFU_MEM_RD_TRANS 0x0
> +#define VTD_EVNT_AFU_MEM_WR_TRANS 0x1
> +#define VTD_EVNT_AFU_DEVTLB_RD_HIT 0x2
> +#define VTD_EVNT_AFU_DEVTLB_WR_HIT 0x3
> +#define VTD_EVNT_DEVTLB_4K_FILL 0x4
> +#define VTD_EVNT_DEVTLB_2M_FILL 0x5
> +#define VTD_EVNT_DEVTLB_1G_FILL 0x6
> +#define VTD_EVNT_MAX VTD_EVNT_DEVTLB_1G_FILL
> +#define VTD_CNTR 0x40
> +#define VTD_CNTR_EVNT GENMASK_ULL(63, 60)
> +#define VTD_CNTR_EVNT_CNTR GENMASK_ULL(47, 0)
> +#define VTD_SIP_CTRL 0x48
> +#define VTD_SIP_RESET_CNTR BIT_ULL(0)
> +#define VTD_SIP_FREEZE_CNTR BIT_ULL(8)
> +#define VTD_SIP_CTRL_EVNT GENMASK_ULL(19, 16)
> +#define VTD_SIP_EVNT_IOTLB_4K_HIT 0x0
> +#define VTD_SIP_EVNT_IOTLB_2M_HIT 0x1
> +#define VTD_SIP_EVNT_IOTLB_1G_HIT 0x2
> +#define VTD_SIP_EVNT_SLPWC_L3_HIT 0x3
> +#define VTD_SIP_EVNT_SLPWC_L4_HIT 0x4
> +#define VTD_SIP_EVNT_RCC_HIT 0x5
> +#define VTD_SIP_EVNT_IOTLB_4K_MISS 0x6
> +#define VTD_SIP_EVNT_IOTLB_2M_MISS 0x7
> +#define VTD_SIP_EVNT_IOTLB_1G_MISS 0x8
> +#define VTD_SIP_EVNT_SLPWC_L3_MISS 0x9
> +#define VTD_SIP_EVNT_SLPWC_L4_MISS 0xa
> +#define VTD_SIP_EVNT_RCC_MISS 0xb
> +#define VTD_SIP_EVNT_MAX VTD_SIP_EVNT_RCC_MISS
> +#define VTD_SIP_CNTR 0X50
> +#define VTD_SIP_CNTR_EVNT GENMASK_ULL(63, 60)
> +#define VTD_SIP_CNTR_EVNT_CNTR GENMASK_ULL(47, 0)
> +
> +#define PERF_OBJ_ROOT_ID (~0)
> +
> +#define PERF_TIMEOUT 30
> +
> +/**
> + * struct perf_object - object of performance counter
> + *
> + * @id: instance id. PERF_OBJ_ROOT_ID indicates it is a parent object which
> + * counts performance counters for all instances.
> + * @attr_groups: the sysfs files are associated with this object.
> + * @feature: pointer to related private feature.
> + * @node: used to link itself to parent's children list.
> + * @children: used to link its children objects together.
> + * @kobj: generic kobject interface.
> + *
> + * 'node' and 'children' are used to construct parent-children hierarchy.
> + */
> +struct perf_object {
> + int id;
> + const struct attribute_group **attr_groups;
> + struct dfl_feature *feature;
> +
> + struct list_head node;
> + struct list_head children;
> + struct kobject kobj;
> +};
> +
> +/**
> + * struct perf_obj_attribute - attribute of perf object
> + *
> + * @attr: attribute of this perf object.
> + * @show: show callback for sysfs attribute.
> + * @store: store callback for sysfs attribute.
> + */
> +struct perf_obj_attribute {
> + struct attribute attr;
> + ssize_t (*show)(struct perf_object *pobj, char *buf);
> + ssize_t (*store)(struct perf_object *pobj,
> + const char *buf, size_t n);
> +};
> +
> +#define to_perf_obj_attr(_attr) \
> + container_of(_attr, struct perf_obj_attribute, attr)
> +#define to_perf_obj(_kobj) \
> + container_of(_kobj, struct perf_object, kobj)
> +
> +#define __POBJ_ATTR(_name, _mode, _show, _store) { \
> + .attr = {.name = __stringify(_name), \
> + .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
> + .show = _show, \
> + .store = _store, \
> +}
> +
> +#define PERF_OBJ_ATTR_F_RO(_name, _filename) \
> +struct perf_obj_attribute perf_obj_attr_##_name = \
> + __POBJ_ATTR(_filename, 0444, _name##_show, NULL)
> +
> +#define PERF_OBJ_ATTR_F_WO(_name, _filename) \
> +struct perf_obj_attribute perf_obj_attr_##_name = \
> + __POBJ_ATTR(_filename, 0200, NULL, _name##_store)
Please remove the macros that aren't actually used. I think that
includes PERF_OBJ_ATTR_F_RO, PERF_OBJ_ATTR_F_WO, PERF_OBJ_ATTR_WO, and
PERF_OBJ_ATTR_RW.
> +
> +#define PERF_OBJ_ATTR_F_RW(_name, _filename) \
> +struct perf_obj_attribute perf_obj_attr_##_name = \
> + __POBJ_ATTR(_filename, 0644, _name##_show, _name##_store)
> +
> +#define PERF_OBJ_ATTR_RO(_name) \
> +struct perf_obj_attribute perf_obj_attr_##_name = \
> + __POBJ_ATTR(_name, 0444, _name##_show, NULL)
> +
> +#define PERF_OBJ_ATTR_WO(_name) \
> +struct perf_obj_attribute perf_obj_attr_##_name = \
> + __POBJ_ATTR(_name, 0200, NULL, _name##_store)
> +
> +#define PERF_OBJ_ATTR_RW(_name) \
> +struct perf_obj_attribute perf_obj_attr_##_name = \
> + __POBJ_ATTR(_name, 0644, _name##_show, _name##_store)
> +
> +static ssize_t perf_obj_attr_show(struct kobject *kobj,
> + struct attribute *__attr, char *buf)
> +{
> + struct perf_obj_attribute *attr = to_perf_obj_attr(__attr);
> + struct perf_object *pobj = to_perf_obj(kobj);
> + ssize_t ret = -EIO;
> +
> + if (attr->show)
> + ret = attr->show(pobj, buf);
> + return ret;
> +}
> +
> +static ssize_t perf_obj_attr_store(struct kobject *kobj,
> + struct attribute *__attr,
> + const char *buf, size_t n)
> +{
> + struct perf_obj_attribute *attr = to_perf_obj_attr(__attr);
> + struct perf_object *pobj = to_perf_obj(kobj);
> + ssize_t ret = -EIO;
> +
> + if (attr->store)
> + ret = attr->store(pobj, buf, n);
> + return ret;
> +}
> +
> +static const struct sysfs_ops perf_obj_sysfs_ops = {
> + .show = perf_obj_attr_show,
> + .store = perf_obj_attr_store,
> +};
> +
> +static void perf_obj_release(struct kobject *kobj)
> +{
> + kfree(to_perf_obj(kobj));
> +}
> +
> +static struct kobj_type perf_obj_ktype = {
> + .sysfs_ops = &perf_obj_sysfs_ops,
> + .release = perf_obj_release,
> +};
> +
> +static struct perf_object *
> +create_perf_obj(struct dfl_feature *feature, struct kobject *parent, int id,
> + const struct attribute_group **groups, const char *name)
> +{
> + struct perf_object *pobj;
> + int ret;
> +
> + pobj = kzalloc(sizeof(*pobj), GFP_KERNEL);
> + if (!pobj)
> + return ERR_PTR(-ENOMEM);
> +
> + pobj->id = id;
> + pobj->feature = feature;
> + pobj->attr_groups = groups;
> + INIT_LIST_HEAD(&pobj->node);
> + INIT_LIST_HEAD(&pobj->children);
> +
> + if (id != PERF_OBJ_ROOT_ID)
> + ret = kobject_init_and_add(&pobj->kobj, &perf_obj_ktype,
> + parent, "%s%d", name, id);
> + else
> + ret = kobject_init_and_add(&pobj->kobj, &perf_obj_ktype,
> + parent, "%s", name);
> + if (ret)
> + goto put_exit;
> +
> + if (pobj->attr_groups) {
> + ret = sysfs_create_groups(&pobj->kobj, pobj->attr_groups);
> + if (ret)
> + goto put_exit;
> + }
> +
> + return pobj;
> +
> +put_exit:
> + kobject_put(&pobj->kobj);
> + return ERR_PTR(ret);
> +}
> +
> +/*
> + * Counter Sysfs Interface for Clock.
> + */
> +static ssize_t clock_show(struct perf_object *pobj, char *buf)
> +{
> + void __iomem *base = pobj->feature->ioaddr;
> +
> + return sprintf(buf, "0x%llx\n",
> + (unsigned long long)readq(base + CLK_CNTR));
> +}
> +static PERF_OBJ_ATTR_RO(clock);
> +
> +static struct attribute *clock_attrs[] = {
> + &perf_obj_attr_clock.attr,
> + NULL,
> +};
> +
> +static struct attribute_group clock_attr_group = {
> + .attrs = clock_attrs,
> +};
> +
> +static const struct attribute_group *perf_dev_attr_groups[] = {
> + &clock_attr_group,
> + NULL,
> +};
> +
> +static void destroy_perf_obj(struct perf_object *pobj)
> +{
> + struct perf_object *obj, *obj_tmp;
> +
> + if (pobj->attr_groups)
> + sysfs_remove_groups(&pobj->kobj, pobj->attr_groups);
> +
> + list_for_each_entry_safe(obj, obj_tmp, &pobj->children, node)
> + destroy_perf_obj(obj);
> +
> + list_del(&pobj->node);
> + kobject_put(&pobj->kobj);
> +}
> +
> +static struct perf_object *create_perf_dev(struct dfl_feature *feature)
> +{
> + struct platform_device *pdev = feature->pdev;
> +
> + return create_perf_obj(feature, &pdev->dev.kobj, PERF_OBJ_ROOT_ID,
> + perf_dev_attr_groups, "perf");
> +}
> +
> +/*
> + * Counter Sysfs Interfaces for Cache.
> + */
> +static ssize_t cache_freeze_show(struct perf_object *pobj, char *buf)
> +{
> + void __iomem *base = pobj->feature->ioaddr;
> + u64 v;
> +
> + v = readq(base + CACHE_CTRL);
> +
> + return sprintf(buf, "%u\n",
> + (unsigned int)FIELD_GET(CACHE_FREEZE_CNTR, v));
> +}
> +
> +static ssize_t cache_freeze_store(struct perf_object *pobj,
> + const char *buf, size_t n)
> +{
> + struct dfl_feature *feature = pobj->feature;
> + struct dfl_feature_platform_data *pdata;
> + void __iomem *base = feature->ioaddr;
> + bool state;
> + u64 v;
> +
> + if (strtobool(buf, &state))
> + return -EINVAL;
> +
> + pdata = dev_get_platdata(&feature->pdev->dev);
> +
> + mutex_lock(&pdata->lock);
> + v = readq(base + CACHE_CTRL);
> + v &= ~CACHE_FREEZE_CNTR;
> + v |= FIELD_PREP(CACHE_FREEZE_CNTR, state ? 1 : 0);
> + writeq(v, base + CACHE_CTRL);
> + mutex_unlock(&pdata->lock);
> +
> + return n;
> +}
> +static PERF_OBJ_ATTR_F_RW(cache_freeze, freeze);
> +
> +static ssize_t read_cache_counter(struct perf_object *pobj, char *buf,
> + u8 channel, u8 event)
> +{
> + struct dfl_feature *feature = pobj->feature;
> + struct dfl_feature_platform_data *pdata;
> + void __iomem *base = feature->ioaddr;
> + u64 v, count;
> +
> + WARN_ON_ONCE(event > CACHE_EVNT_MAX || channel > CACHE_CHANNEL_MAX);
> +
> + pdata = dev_get_platdata(&feature->pdev->dev);
> +
> + mutex_lock(&pdata->lock);
> + /* set channel access type and cache event code. */
> + v = readq(base + CACHE_CTRL);
> + v &= ~(CACHE_CHANNEL_SEL | CACHE_CTRL_EVNT);
> + v |= FIELD_PREP(CACHE_CHANNEL_SEL, channel);
> + v |= FIELD_PREP(CACHE_CTRL_EVNT, event);
> + writeq(v, base + CACHE_CTRL);
> +
> + if (readq_poll_timeout(base + CACHE_CNTR0, v,
> + FIELD_GET(CACHE_CNTR_EVNT, v) == event,
> + 1, PERF_TIMEOUT)) {
> + dev_err(&feature->pdev->dev, "timeout, unmatched cache event type in counter registers.\n");
> + mutex_unlock(&pdata->lock);
> + return -ETIMEDOUT;
> + }
> +
> + v = readq(base + CACHE_CNTR0);
> + count = FIELD_GET(CACHE_CNTR_EVNT_CNTR, v);
> + v = readq(base + CACHE_CNTR1);
> + count += FIELD_GET(CACHE_CNTR_EVNT_CNTR, v);
> + mutex_unlock(&pdata->lock);
> +
> + return sprintf(buf, "0x%llx\n", (unsigned long long)count);
> +}
> +
> +#define CACHE_SHOW(name, channel, event) \
> +static ssize_t name##_show(struct perf_object *pobj, char *buf) \
> +{ \
> + return read_cache_counter(pobj, buf, channel, event); \
> +} \
> +static PERF_OBJ_ATTR_RO(name)
> +
> +CACHE_SHOW(read_hit, CACHE_CHANNEL_RD, CACHE_EVNT_RD_HIT);
> +CACHE_SHOW(read_miss, CACHE_CHANNEL_RD, CACHE_EVNT_RD_MISS);
> +CACHE_SHOW(write_hit, CACHE_CHANNEL_WR, CACHE_EVNT_WR_HIT);
> +CACHE_SHOW(write_miss, CACHE_CHANNEL_WR, CACHE_EVNT_WR_MISS);
> +CACHE_SHOW(hold_request, CACHE_CHANNEL_RD, CACHE_EVNT_HOLD_REQ);
> +CACHE_SHOW(tx_req_stall, CACHE_CHANNEL_RD, CACHE_EVNT_TX_REQ_STALL);
> +CACHE_SHOW(rx_req_stall, CACHE_CHANNEL_RD, CACHE_EVNT_RX_REQ_STALL);
> +CACHE_SHOW(rx_eviction, CACHE_CHANNEL_RD, CACHE_EVNT_EVICTIONS);
> +CACHE_SHOW(data_write_port_contention, CACHE_CHANNEL_WR,
> + CACHE_EVNT_DATA_WR_PORT_CONTEN);
> +CACHE_SHOW(tag_write_port_contention, CACHE_CHANNEL_WR,
> + CACHE_EVNT_TAG_WR_PORT_CONTEN);
> +
> +static struct attribute *cache_attrs[] = {
> + &perf_obj_attr_read_hit.attr,
> + &perf_obj_attr_read_miss.attr,
> + &perf_obj_attr_write_hit.attr,
> + &perf_obj_attr_write_miss.attr,
> + &perf_obj_attr_hold_request.attr,
> + &perf_obj_attr_data_write_port_contention.attr,
> + &perf_obj_attr_tag_write_port_contention.attr,
> + &perf_obj_attr_tx_req_stall.attr,
> + &perf_obj_attr_rx_req_stall.attr,
> + &perf_obj_attr_rx_eviction.attr,
> + &perf_obj_attr_cache_freeze.attr,
> + NULL,
> +};
> +
> +static struct attribute_group cache_attr_group = {
> + .attrs = cache_attrs,
> +};
> +
> +static const struct attribute_group *cache_attr_groups[] = {
> + &cache_attr_group,
> + NULL,
> +};
> +
> +static int create_perf_cache_obj(struct perf_object *perf_dev)
> +{
> + struct perf_object *pobj;
> +
> + pobj = create_perf_obj(perf_dev->feature, &perf_dev->kobj,
> + PERF_OBJ_ROOT_ID, cache_attr_groups, "cache");
> + if (IS_ERR(pobj))
> + return PTR_ERR(pobj);
> +
> + list_add(&pobj->node, &perf_dev->children);
> +
> + return 0;
> +}
> +
> +/*
> + * Counter Sysfs Interfaces for VT-D / IOMMU.
> + */
> +static ssize_t vtd_freeze_show(struct perf_object *pobj, char *buf)
> +{
> + void __iomem *base = pobj->feature->ioaddr;
> + u64 v;
> +
> + v = readq(base + VTD_CTRL);
> +
> + return sprintf(buf, "%u\n",
> + (unsigned int)FIELD_GET(VTD_FREEZE_CNTR, v));
> +}
> +
> +static ssize_t vtd_freeze_store(struct perf_object *pobj,
> + const char *buf, size_t n)
> +{
> + struct dfl_feature *feature = pobj->feature;
> + struct dfl_feature_platform_data *pdata;
> + void __iomem *base = feature->ioaddr;
> + bool state;
> + u64 v;
> +
> + if (strtobool(buf, &state))
> + return -EINVAL;
> +
> + pdata = dev_get_platdata(&feature->pdev->dev);
> +
> + mutex_lock(&pdata->lock);
> + v = readq(base + VTD_CTRL);
> + v &= ~VTD_FREEZE_CNTR;
> + v |= FIELD_PREP(VTD_FREEZE_CNTR, state ? 1 : 0);
> + writeq(v, base + VTD_CTRL);
> + mutex_unlock(&pdata->lock);
> +
> + return n;
> +}
> +static PERF_OBJ_ATTR_F_RW(vtd_freeze, freeze);
> +
> +static struct attribute *iommu_top_attrs[] = {
> + &perf_obj_attr_vtd_freeze.attr,
> + NULL,
> +};
> +
> +static struct attribute_group iommu_top_attr_group = {
> + .attrs = iommu_top_attrs,
> +};
> +
> +static ssize_t read_iommu_sip_counter(struct perf_object *pobj,
> + u8 event, char *buf)
> +{
> + struct dfl_feature *feature = pobj->feature;
> + struct dfl_feature_platform_data *pdata;
> + void __iomem *base = feature->ioaddr;
> + u64 v, count;
> +
> + WARN_ON_ONCE(event > VTD_SIP_EVNT_MAX);
> +
> + pdata = dev_get_platdata(&feature->pdev->dev);
> +
> + mutex_lock(&pdata->lock);
> + v = readq(base + VTD_SIP_CTRL);
> + v &= ~VTD_SIP_CTRL_EVNT;
> + v |= FIELD_PREP(VTD_SIP_CTRL_EVNT, event);
> + writeq(v, base + VTD_SIP_CTRL);
> +
> + if (readq_poll_timeout(base + VTD_SIP_CNTR, v,
> + FIELD_GET(VTD_SIP_CNTR_EVNT, v) == event,
> + 1, PERF_TIMEOUT)) {
> + dev_err(&feature->pdev->dev, "timeout, unmatched VTd SIP event type in counter registers\n");
> + mutex_unlock(&pdata->lock);
> + return -ETIMEDOUT;
> + }
> +
> + v = readq(base + VTD_SIP_CNTR);
> + count = FIELD_GET(VTD_SIP_CNTR_EVNT_CNTR, v);
> + mutex_unlock(&pdata->lock);
> +
> + return sprintf(buf, "0x%llx\n", (unsigned long long)count);
> +}
> +
> +#define VTD_SIP_SHOW(name, event) \
> +static ssize_t name##_show(struct perf_object *pobj, char *buf) \
> +{ \
> + return read_iommu_sip_counter(pobj, event, buf); \
> +} \
> +static PERF_OBJ_ATTR_RO(name)
> +
> +VTD_SIP_SHOW(iotlb_4k_hit, VTD_SIP_EVNT_IOTLB_4K_HIT);
> +VTD_SIP_SHOW(iotlb_2m_hit, VTD_SIP_EVNT_IOTLB_2M_HIT);
> +VTD_SIP_SHOW(iotlb_1g_hit, VTD_SIP_EVNT_IOTLB_1G_HIT);
> +VTD_SIP_SHOW(slpwc_l3_hit, VTD_SIP_EVNT_SLPWC_L3_HIT);
> +VTD_SIP_SHOW(slpwc_l4_hit, VTD_SIP_EVNT_SLPWC_L4_HIT);
> +VTD_SIP_SHOW(rcc_hit, VTD_SIP_EVNT_RCC_HIT);
> +VTD_SIP_SHOW(iotlb_4k_miss, VTD_SIP_EVNT_IOTLB_4K_MISS);
> +VTD_SIP_SHOW(iotlb_2m_miss, VTD_SIP_EVNT_IOTLB_2M_MISS);
> +VTD_SIP_SHOW(iotlb_1g_miss, VTD_SIP_EVNT_IOTLB_1G_MISS);
> +VTD_SIP_SHOW(slpwc_l3_miss, VTD_SIP_EVNT_SLPWC_L3_MISS);
> +VTD_SIP_SHOW(slpwc_l4_miss, VTD_SIP_EVNT_SLPWC_L4_MISS);
> +VTD_SIP_SHOW(rcc_miss, VTD_SIP_EVNT_RCC_MISS);
> +
> +static struct attribute *iommu_sip_attrs[] = {
> + &perf_obj_attr_iotlb_4k_hit.attr,
> + &perf_obj_attr_iotlb_2m_hit.attr,
> + &perf_obj_attr_iotlb_1g_hit.attr,
> + &perf_obj_attr_slpwc_l3_hit.attr,
> + &perf_obj_attr_slpwc_l4_hit.attr,
> + &perf_obj_attr_rcc_hit.attr,
> + &perf_obj_attr_iotlb_4k_miss.attr,
> + &perf_obj_attr_iotlb_2m_miss.attr,
> + &perf_obj_attr_iotlb_1g_miss.attr,
> + &perf_obj_attr_slpwc_l3_miss.attr,
> + &perf_obj_attr_slpwc_l4_miss.attr,
> + &perf_obj_attr_rcc_miss.attr,
> + NULL,
> +};
> +
> +static struct attribute_group iommu_sip_attr_group = {
> + .attrs = iommu_sip_attrs,
> +};
> +
> +static const struct attribute_group *iommu_top_attr_groups[] = {
> + &iommu_top_attr_group,
> + &iommu_sip_attr_group,
> + NULL,
> +};
> +
> +static ssize_t read_iommu_counter(struct perf_object *pobj, u8 event, char *buf)
> +{
> + struct dfl_feature *feature = pobj->feature;
> + struct dfl_feature_platform_data *pdata;
> + void __iomem *base = feature->ioaddr;
> + u64 v, count;
> +
> + WARN_ON_ONCE(event > VTD_EVNT_MAX);
> +
> + event += pobj->id;
> + pdata = dev_get_platdata(&feature->pdev->dev);
> +
> + mutex_lock(&pdata->lock);
> + v = readq(base + VTD_CTRL);
> + v &= ~VTD_CTRL_EVNT;
> + v |= FIELD_PREP(VTD_CTRL_EVNT, event);
> + writeq(v, base + VTD_CTRL);
> +
> + if (readq_poll_timeout(base + VTD_CNTR, v,
> + FIELD_GET(VTD_CNTR_EVNT, v) == event, 1,
> + PERF_TIMEOUT)) {
> + dev_err(&feature->pdev->dev, "timeout, unmatched VTd event type in counter registers\n");
> + mutex_unlock(&pdata->lock);
> + return -ETIMEDOUT;
> + }
> +
> + v = readq(base + VTD_CNTR);
> + count = FIELD_GET(VTD_CNTR_EVNT_CNTR, v);
> + mutex_unlock(&pdata->lock);
> +
> + return sprintf(buf, "0x%llx\n", (unsigned long long)count);
> +}
> +
> +#define VTD_SHOW(name, base_event) \
> +static ssize_t name##_show(struct perf_object *pobj, char *buf) \
> +{ \
> + return read_iommu_counter(pobj, base_event, buf); \
> +} \
> +static PERF_OBJ_ATTR_RO(name)
> +
> +VTD_SHOW(read_transaction, VTD_EVNT_AFU_MEM_RD_TRANS);
> +VTD_SHOW(write_transaction, VTD_EVNT_AFU_MEM_WR_TRANS);
> +VTD_SHOW(devtlb_read_hit, VTD_EVNT_AFU_DEVTLB_RD_HIT);
> +VTD_SHOW(devtlb_write_hit, VTD_EVNT_AFU_DEVTLB_WR_HIT);
> +VTD_SHOW(devtlb_4k_fill, VTD_EVNT_DEVTLB_4K_FILL);
> +VTD_SHOW(devtlb_2m_fill, VTD_EVNT_DEVTLB_2M_FILL);
> +VTD_SHOW(devtlb_1g_fill, VTD_EVNT_DEVTLB_1G_FILL);
> +
> +static struct attribute *iommu_attrs[] = {
> + &perf_obj_attr_read_transaction.attr,
> + &perf_obj_attr_write_transaction.attr,
> + &perf_obj_attr_devtlb_read_hit.attr,
> + &perf_obj_attr_devtlb_write_hit.attr,
> + &perf_obj_attr_devtlb_4k_fill.attr,
> + &perf_obj_attr_devtlb_2m_fill.attr,
> + &perf_obj_attr_devtlb_1g_fill.attr,
> + NULL,
> +};
> +
> +static struct attribute_group iommu_attr_group = {
> + .attrs = iommu_attrs,
> +};
> +
> +static const struct attribute_group *iommu_attr_groups[] = {
> + &iommu_attr_group,
> + NULL,
> +};
> +
> +#define PERF_MAX_PORT_NUM 1
> +
> +static int create_perf_iommu_obj(struct perf_object *perf_dev)
> +{
> + struct dfl_feature *feature = perf_dev->feature;
> + struct device *dev = &feature->pdev->dev;
> + struct perf_object *pobj, *obj;
> + void __iomem *base;
> + u64 v;
> + int i;
> +
> + /* check if iommu is not supported on this device. */
> + base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_HEADER);
> + v = readq(base + FME_HDR_CAP);
> + if (!FIELD_GET(FME_CAP_IOMMU_AVL, v))
> + return 0;
> +
> + pobj = create_perf_obj(feature, &perf_dev->kobj, PERF_OBJ_ROOT_ID,
> + iommu_top_attr_groups, "iommu");
> + if (IS_ERR(pobj))
> + return PTR_ERR(pobj);
> +
> + list_add(&pobj->node, &perf_dev->children);
> +
> + for (i = 0; i < PERF_MAX_PORT_NUM; i++) {
> + obj = create_perf_obj(feature, &pobj->kobj, i,
> + iommu_attr_groups, "afu");
> + if (IS_ERR(obj))
> + return PTR_ERR(obj);
> +
> + list_add(&obj->node, &pobj->children);
> + }
> +
> + return 0;
> +}
> +
> +/*
> + * Counter Sysfs Interfaces for Fabric
> + */
> +static bool fabric_pobj_is_enabled(struct perf_object *pobj)
> +{
> + struct dfl_feature *feature = pobj->feature;
> + void __iomem *base = feature->ioaddr;
> + u64 v;
> +
> + v = readq(base + FAB_CTRL);
> +
> + if (FIELD_GET(FAB_PORT_FILTER, v) == FAB_PORT_FILTER_DISABLE)
> + return pobj->id == PERF_OBJ_ROOT_ID;
> +
> + return pobj->id == FIELD_GET(FAB_PORT_ID, v);
> +}
> +
> +static ssize_t read_fabric_counter(struct perf_object *pobj,
> + u8 event, char *buf)
> +{
> + struct dfl_feature *feature = pobj->feature;
> + struct dfl_feature_platform_data *pdata;
> + void __iomem *base = feature->ioaddr;
> + u64 v, count = 0;
> +
> + WARN_ON_ONCE(event > FAB_EVNT_MAX);
> +
> + pdata = dev_get_platdata(&feature->pdev->dev);
> +
> + mutex_lock(&pdata->lock);
> + /* if it is disabled, force the counter to return zero. */
> + if (!fabric_pobj_is_enabled(pobj))
> + goto exit;
> +
> + v = readq(base + FAB_CTRL);
> + v &= ~FAB_CTRL_EVNT;
> + v |= FIELD_PREP(FAB_CTRL_EVNT, event);
> + writeq(v, base + FAB_CTRL);
> +
> + if (readq_poll_timeout(base + FAB_CNTR, v,
> + FIELD_GET(FAB_CNTR_EVNT, v) == event,
> + 1, PERF_TIMEOUT)) {
> + dev_err(&feature->pdev->dev, "timeout, unmatched fab event type in counter registers.\n");
> + mutex_unlock(&pdata->lock);
> + return -ETIMEDOUT;
> + }
> +
> + v = readq(base + FAB_CNTR);
> + count = FIELD_GET(FAB_CNTR_EVNT_CNTR, v);
> +exit:
> + mutex_unlock(&pdata->lock);
> +
> + return sprintf(buf, "0x%llx\n", (unsigned long long)count);
> +}
> +
> +#define FAB_SHOW(name, event) \
> +static ssize_t name##_show(struct perf_object *pobj, char *buf) \
> +{ \
> + return read_fabric_counter(pobj, event, buf); \
> +} \
> +static PERF_OBJ_ATTR_RO(name)
> +
> +FAB_SHOW(pcie0_read, FAB_EVNT_PCIE0_RD);
> +FAB_SHOW(pcie0_write, FAB_EVNT_PCIE0_WR);
> +FAB_SHOW(pcie1_read, FAB_EVNT_PCIE1_RD);
> +FAB_SHOW(pcie1_write, FAB_EVNT_PCIE1_WR);
> +FAB_SHOW(upi_read, FAB_EVNT_UPI_RD);
> +FAB_SHOW(upi_write, FAB_EVNT_UPI_WR);
> +FAB_SHOW(mmio_read, FAB_EVNT_MMIO_RD);
> +FAB_SHOW(mmio_write, FAB_EVNT_MMIO_WR);
> +
> +static ssize_t fab_enable_show(struct perf_object *pobj, char *buf)
> +{
> + return sprintf(buf, "%u\n",
> + (unsigned int)!!fabric_pobj_is_enabled(pobj));
> +}
> +
> +/*
> + * If enable one port or all port event counter in fabric, other
> + * fabric event counter originally enabled will be disable automatically.
> + */
> +static ssize_t fab_enable_store(struct perf_object *pobj,
> + const char *buf, size_t n)
> +{
> + struct dfl_feature *feature = pobj->feature;
> + struct dfl_feature_platform_data *pdata;
> + void __iomem *base = feature->ioaddr;
> + bool state;
> + u64 v;
> +
> + if (strtobool(buf, &state) || !state)
> + return -EINVAL;
> +
> + pdata = dev_get_platdata(&feature->pdev->dev);
> +
> + /* if it is already enabled. */
> + if (fabric_pobj_is_enabled(pobj))
> + return n;
> +
> + mutex_lock(&pdata->lock);
> + v = readq(base + FAB_CTRL);
> + v &= ~(FAB_PORT_FILTER | FAB_PORT_ID);
> +
> + if (pobj->id == PERF_OBJ_ROOT_ID) {
> + v |= FIELD_PREP(FAB_PORT_FILTER, FAB_PORT_FILTER_DISABLE);
> + } else {
> + v |= FIELD_PREP(FAB_PORT_FILTER, FAB_PORT_FILTER_ENABLE);
> + v |= FIELD_PREP(FAB_PORT_ID, pobj->id);
> + }
> + writeq(v, base + FAB_CTRL);
> + mutex_unlock(&pdata->lock);
> +
> + return n;
> +}
> +static PERF_OBJ_ATTR_F_RW(fab_enable, enable);
> +
> +static struct attribute *fabric_attrs[] = {
> + &perf_obj_attr_pcie0_read.attr,
> + &perf_obj_attr_pcie0_write.attr,
> + &perf_obj_attr_pcie1_read.attr,
> + &perf_obj_attr_pcie1_write.attr,
> + &perf_obj_attr_upi_read.attr,
> + &perf_obj_attr_upi_write.attr,
> + &perf_obj_attr_mmio_read.attr,
> + &perf_obj_attr_mmio_write.attr,
> + &perf_obj_attr_fab_enable.attr,
> + NULL,
> +};
> +
> +static struct attribute_group fabric_attr_group = {
> + .attrs = fabric_attrs,
> +};
> +
> +static const struct attribute_group *fabric_attr_groups[] = {
> + &fabric_attr_group,
> + NULL,
> +};
> +
> +static ssize_t fab_freeze_show(struct perf_object *pobj, char *buf)
> +{
> + void __iomem *base = pobj->feature->ioaddr;
> + u64 v;
> +
> + v = readq(base + FAB_CTRL);
> +
> + return sprintf(buf, "%u\n",
> + (unsigned int)FIELD_GET(FAB_FREEZE_CNTR, v));
> +}
> +
> +static ssize_t fab_freeze_store(struct perf_object *pobj,
> + const char *buf, size_t n)
> +{
> + struct dfl_feature *feature = pobj->feature;
> + struct dfl_feature_platform_data *pdata;
> + void __iomem *base = feature->ioaddr;
> + bool state;
> + u64 v;
> +
> + if (strtobool(buf, &state))
> + return -EINVAL;
> +
> + pdata = dev_get_platdata(&feature->pdev->dev);
> +
> + mutex_lock(&pdata->lock);
> + v = readq(base + FAB_CTRL);
> + v &= ~FAB_FREEZE_CNTR;
> + v |= FIELD_PREP(FAB_FREEZE_CNTR, state ? 1 : 0);
> + writeq(v, base + FAB_CTRL);
> + mutex_unlock(&pdata->lock);
> +
> + return n;
> +}
> +static PERF_OBJ_ATTR_F_RW(fab_freeze, freeze);
> +
> +static struct attribute *fabric_top_attrs[] = {
> + &perf_obj_attr_fab_freeze.attr,
> + NULL,
> +};
> +
> +static struct attribute_group fabric_top_attr_group = {
> + .attrs = fabric_top_attrs,
> +};
> +
> +static const struct attribute_group *fabric_top_attr_groups[] = {
> + &fabric_attr_group,
> + &fabric_top_attr_group,
> + NULL,
> +};
> +
> +static int create_perf_fabric_obj(struct perf_object *perf_dev)
> +{
> + struct perf_object *pobj, *obj;
> + int i;
> +
> + pobj = create_perf_obj(perf_dev->feature, &perf_dev->kobj,
> + PERF_OBJ_ROOT_ID, fabric_top_attr_groups,
> + "fabric");
> + if (IS_ERR(pobj))
> + return PTR_ERR(pobj);
> +
> + list_add(&pobj->node, &perf_dev->children);
> +
> + for (i = 0; i < PERF_MAX_PORT_NUM; i++) {
> + obj = create_perf_obj(perf_dev->feature, &pobj->kobj, i,
> + fabric_attr_groups, "port");
> + if (IS_ERR(obj))
> + return PTR_ERR(obj);
> +
> + list_add(&obj->node, &pobj->children);
> + }
> +
> + return 0;
> +}
> +
> +static int fme_perf_init(struct platform_device *pdev,
> + struct dfl_feature *feature)
> +{
> + struct perf_object *perf_dev;
> + int ret;
> +
> + perf_dev = create_perf_dev(feature);
> + if (IS_ERR(perf_dev))
> + return PTR_ERR(perf_dev);
> +
> + ret = create_perf_fabric_obj(perf_dev);
> + if (ret)
> + goto done;
> +
> + if (feature->id == FME_FEATURE_ID_GLOBAL_IPERF) {
> + /*
> + * Cache and IOMMU(VT-D) performance counters are not supported
> + * on discreted solutions e.g. Intel Programmable Acceleration
> + * Card based on PCIe.
> + */
> + ret = create_perf_cache_obj(perf_dev);
> + if (ret)
> + goto done;
> +
> + ret = create_perf_iommu_obj(perf_dev);
> + if (ret)
> + goto done;
> + }
> +
> + feature->priv = perf_dev;
> + return 0;
> +
> +done:
> + destroy_perf_obj(perf_dev);
> + return ret;
> +}
> +
> +static void fme_perf_uinit(struct platform_device *pdev,
> + struct dfl_feature *feature)
> +{
> + struct perf_object *perf_dev = feature->priv;
> +
> + destroy_perf_obj(perf_dev);
> +}
> +
> +const struct dfl_feature_id fme_perf_id_table[] = {
> + {.id = FME_FEATURE_ID_GLOBAL_IPERF,},
> + {.id = FME_FEATURE_ID_GLOBAL_DPERF,},
> + {0,}
> +};
> +
> +const struct dfl_feature_ops fme_perf_ops = {
> + .init = fme_perf_init,
> + .uinit = fme_perf_uinit,
> +};
> diff --git a/drivers/fpga/dfl-fme.h b/drivers/fpga/dfl-fme.h
> index 5fbe3f5..dc71048 100644
> --- a/drivers/fpga/dfl-fme.h
> +++ b/drivers/fpga/dfl-fme.h
> @@ -39,5 +39,7 @@ struct dfl_fme {
> extern const struct dfl_feature_id fme_pr_mgmt_id_table[];
> extern const struct dfl_feature_ops fme_global_err_ops;
> extern const struct dfl_feature_id fme_global_err_id_table[];
> +extern const struct dfl_feature_ops fme_perf_ops;
> +extern const struct dfl_feature_id fme_perf_id_table[];
>
> #endif /* __DFL_FME_H */
> diff --git a/drivers/fpga/dfl.c b/drivers/fpga/dfl.c
> index 1bb2b58..050b18b 100644
> --- a/drivers/fpga/dfl.c
> +++ b/drivers/fpga/dfl.c
> @@ -521,6 +521,7 @@ static int build_info_commit_dev(struct build_feature_devs_info *binfo)
> struct dfl_feature *feature = &pdata->features[index];
>
> /* save resource information for each feature */
> + feature->pdev = fdev;
> feature->id = finfo->fid;
> feature->resource_index = index;
> feature->ioaddr = finfo->ioaddr;
> diff --git a/drivers/fpga/dfl.h b/drivers/fpga/dfl.h
> index 6c32080..bf23436 100644
> --- a/drivers/fpga/dfl.h
> +++ b/drivers/fpga/dfl.h
> @@ -191,6 +191,7 @@ struct dfl_feature_driver {
> /**
> * struct dfl_feature - sub feature of the feature devices
> *
> + * @pdev: parent platform device.
> * @id: sub feature id.
> * @resource_index: each sub feature has one mmio resource for its registers.
> * this index is used to find its mmio resource from the
> @@ -200,6 +201,7 @@ struct dfl_feature_driver {
> * @priv: priv data of this feature.
> */
> struct dfl_feature {
> + struct platform_device *pdev;
> u64 id;
> int resource_index;
> void __iomem *ioaddr;
> --
> 1.8.3.1
>
^ permalink raw reply
* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
From: Michal Hocko @ 2019-05-30 18:47 UTC (permalink / raw)
To: Minchan Kim
Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
Sonny Rao, Brian Geffon, Linux API
In-Reply-To: <20190530080214.GA159502@google.com>
On Thu 30-05-19 17:02:14, Minchan Kim wrote:
> On Thu, May 30, 2019 at 08:57:55AM +0200, Michal Hocko wrote:
> > On Thu 30-05-19 11:17:48, Minchan Kim wrote:
[...]
> > > First time, I didn't think about atomicity about address range race
> > > because MADV_COLD/PAGEOUT is not critical for the race.
> > > However you raised the atomicity issue because people would extend
> > > hints to destructive ones easily. I agree with that and that's why
> > > we discussed how to guarantee the race and Daniel comes up with good idea.
> >
> > Just for the clarification, I didn't really mean atomicity but rather a
> > _consistency_ (essentially time to check to time to use consistency).
>
> What do you mean by *consistency*? Could you elaborate it more?
That you operate on the object you have got by some means. In other
words that the range you want to call madvise on hasn't been
remapped/replaced by a different mmap operation.
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: [PATCH 1/2] fork: add clone6
From: Kees Cook @ 2019-05-30 18:26 UTC (permalink / raw)
To: Jann Horn
Cc: Linus Torvalds, Christian Brauner, Al Viro,
Linux List Kernel Mailing, Florian Weimer, Oleg Nesterov,
Arnd Bergmann, David Howells, Pavel Emelyanov, Andrew Morton,
Adrian Reber, Andrei Vagin, Linux API
In-Reply-To: <CAG48ez2wyDhM-V1hs5ya1R4x7wHT=T8XLOYCPUyw97kzzLhbhg@mail.gmail.com>
On Mon, May 27, 2019 at 09:36:18PM +0200, Jann Horn wrote:
> +Kees
>
> On Mon, May 27, 2019 at 9:27 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Mon, May 27, 2019 at 3:42 AM Christian Brauner <christian@brauner.io> wrote:
> > > Hm, still pondering whether having one unsigned int argument passed
> > > through registers that captures all the flags from the old clone() would
> > > be a good idea.
> >
> > That sounds like a reasonable thing to do.
> >
> > Maybe we could continue to call the old flags CLONE_XYZ and continue
> > to pass them in as "flags" argument, and then we have CLONE_EXT_XYZ
> > flags for a new 64-bit flag field that comes in through memory in the
> > new clone_args thing?
>
> With the current seccomp model, that would have the unfortunate effect
> of making it impossible to filter out new clone flags - which would
> likely mean that people who want to sandbox their code would not use
> the new clone() because they don't want their sandboxed code to be
> able to create time namespaces and whatever other new fancy things
> clone() might support in the future. This is why I convinced Christian
> to pass flags in registers for the first patch version.
>
> The alternative I see would be to somehow extend seccomp to support
> argument structures that are passed in memory - that would probably
> require quite a bit of new plumbing though, both in the kernel and in
> userspace code that configures seccomp filters.
FWIW, the only path forward on this that I've been able to see is to
normalize how syscalls read memory from userspace, and to basically
provide a cache (i.e. copy from userspace once) that will be examined by
both seccomp and later kernel functions. I have not been able to imagine
an API that wasn't a massive amount of work to implement, though. Maybe
it could be done only for a few kinds of arguments (file paths, certain
structures, etc) but I haven't made any progress on it.
--
Kees Cook
^ permalink raw reply
* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Serge E. Hallyn @ 2019-05-30 17:09 UTC (permalink / raw)
To: Paul Moore
Cc: Tycho Andersen, Richard Guy Briggs, containers, linux-api,
Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
Serge Hallyn, ebiederm, nhorman
In-Reply-To: <CAHC9VhRS66VGtug3fq3RTGHDvfGmBJG6yRJ+iMxm3cxnNF-zJw@mail.gmail.com>
On Wed, May 29, 2019 at 06:39:48PM -0400, Paul Moore wrote:
> On Wed, May 29, 2019 at 6:28 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > On Wed, May 29, 2019 at 12:03:58PM -0400, Paul Moore wrote:
> > > On Wed, May 29, 2019 at 11:34 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > > >
> > > > On Wed, May 29, 2019 at 11:29:05AM -0400, Paul Moore wrote:
> > > > > On Wed, May 29, 2019 at 10:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > >
> > > > > > On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> > > > > > > It is not permitted to unset the audit container identifier.
> > > > > > > A child inherits its parent's audit container identifier.
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > > /**
> > > > > > > + * audit_set_contid - set current task's audit contid
> > > > > > > + * @contid: contid value
> > > > > > > + *
> > > > > > > + * Returns 0 on success, -EPERM on permission failure.
> > > > > > > + *
> > > > > > > + * Called (set) from fs/proc/base.c::proc_contid_write().
> > > > > > > + */
> > > > > > > +int audit_set_contid(struct task_struct *task, u64 contid)
> > > > > > > +{
> > > > > > > + u64 oldcontid;
> > > > > > > + int rc = 0;
> > > > > > > + struct audit_buffer *ab;
> > > > > > > + uid_t uid;
> > > > > > > + struct tty_struct *tty;
> > > > > > > + char comm[sizeof(current->comm)];
> > > > > > > +
> > > > > > > + task_lock(task);
> > > > > > > + /* Can't set if audit disabled */
> > > > > > > + if (!task->audit) {
> > > > > > > + task_unlock(task);
> > > > > > > + return -ENOPROTOOPT;
> > > > > > > + }
> > > > > > > + oldcontid = audit_get_contid(task);
> > > > > > > + read_lock(&tasklist_lock);
> > > > > > > + /* Don't allow the audit containerid to be unset */
> > > > > > > + if (!audit_contid_valid(contid))
> > > > > > > + rc = -EINVAL;
> > > > > > > + /* if we don't have caps, reject */
> > > > > > > + else if (!capable(CAP_AUDIT_CONTROL))
> > > > > > > + rc = -EPERM;
> > > > > > > + /* if task has children or is not single-threaded, deny */
> > > > > > > + else if (!list_empty(&task->children))
> > > > > > > + rc = -EBUSY;
> > > > > > > + else if (!(thread_group_leader(task) && thread_group_empty(task)))
> > > > > > > + rc = -EALREADY;
> > > > > > > + read_unlock(&tasklist_lock);
> > > > > > > + if (!rc)
> > > > > > > + task->audit->contid = contid;
> > > > > > > + task_unlock(task);
> > > > > > > +
> > > > > > > + if (!audit_enabled)
> > > > > > > + return rc;
> > > > > >
> > > > > > ...but it is allowed to change it (assuming
> > > > > > capable(CAP_AUDIT_CONTROL), of course)? Seems like this might be more
> > > > > > immediately useful since we still live in the world of majority
> > > > > > privileged containers if we didn't allow changing it, in addition to
> > > > > > un-setting it.
> > > > >
> > > > > The idea is that only container orchestrators should be able to
> > > > > set/modify the audit container ID, and since setting the audit
> > > > > container ID can have a significant effect on the records captured
> > > > > (and their routing to multiple daemons when we get there) modifying
> > > > > the audit container ID is akin to modifying the audit configuration
> > > > > which is why it is gated by CAP_AUDIT_CONTROL. The current thinking
> > > > > is that you would only change the audit container ID from one
> > > > > set/inherited value to another if you were nesting containers, in
> > > > > which case the nested container orchestrator would need to be granted
> > > > > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > > > > compromise).
> > > >
> > > > But then don't you want some kind of ns_capable() instead (probably
> > > > not the obvious one, though...)? With capable(), you can't really nest
> > > > using the audit-id and user namespaces together.
> > >
> > > You want capable() and not ns_capable() because you want to ensure
> > > that the orchestrator has the rights in the init_ns as changes to the
> > > audit container ID could have an auditing impact that spans the entire
> > > system.
> >
> > Ok but,
> >
> > > > > The current thinking
> > > > > is that you would only change the audit container ID from one
> > > > > set/inherited value to another if you were nesting containers, in
> > > > > which case the nested container orchestrator would need to be granted
> > > > > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > > > > compromise).
> >
> > won't work in user namespaced containers, because they will never be
> > capable(CAP_AUDIT_CONTROL); so I don't think this will work for
> > nesting as is. But maybe nobody cares :)
>
> That's fun :)
>
> To be honest, I've never been a big fan of supporting nested
> containers from an audit perspective, so I'm not really too upset
> about this. The k8s/cri-o folks seem okay with this, or at least I
> haven't heard any objections; lxc folks, what do you have to say?
I actually thought the answer to this (when last I looked, "some time" ago)
was that userspace should track an audit message saying "task X in
container Y is changing its auditid to Z", and then decide to also track Z.
This should be doable, but a lot of extra work in userspace.
Per-userns containerids would also work. So task X1 is in containerid
1 on the host and creates a new task Y in new userns; it continues to
be reported in init_user_ns as containerid 1 forever; but in its own
userns it can request to be known as some other containerid. Audit
socks would be per-userns, allowing root in a container to watch for
audit events in its own (and descendent) namespaces.
But again I'm sure we've gone over all this in the last few years.
I suppose we can look at this as a "first step", and talk about
making it user-ns-nestable later. But agreed it's not useful in a
lot of situations as is.
-serge
^ permalink raw reply
* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
From: Daniel Colascione @ 2019-05-30 16:19 UTC (permalink / raw)
To: Minchan Kim
Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
Sonny Rao, Brian Geffon, Linux API
In-Reply-To: <20190530080214.GA159502@google.com>
On Thu, May 30, 2019 at 1:02 AM Minchan Kim <minchan@kernel.org> wrote:
>
> On Thu, May 30, 2019 at 08:57:55AM +0200, Michal Hocko wrote:
> > On Thu 30-05-19 11:17:48, Minchan Kim wrote:
> > > On Wed, May 29, 2019 at 12:33:52PM +0200, Michal Hocko wrote:
> > > > On Wed 29-05-19 03:08:32, Daniel Colascione wrote:
> > > > > On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > >
> > > > > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > > > > > > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > > > > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > > > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > > > > > > [Cc linux-api]
> > > > > > > > > > >
> > > > > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > > > > > > > so user should call the syscall several times to give hints to
> > > > > > > > > > > > multiple address range.
> > > > > > > > > > >
> > > > > > > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > > > > > > >
> > > > > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > > > > > > with number in the description at respin.
> > > > > > > > >
> > > > > > > > > Does this really have to be a fast operation? I would expect the monitor
> > > > > > > > > is by no means a fast path. The system call overhead is not what it used
> > > > > > > > > to be, sigh, but still for something that is not a hot path it should be
> > > > > > > > > tolerable, especially when the whole operation is quite expensive on its
> > > > > > > > > own (wrt. the syscall entry/exit).
> > > > > > > >
> > > > > > > > What's different with process_vm_[readv|writev] and vmsplice?
> > > > > > > > If the range needed to be covered is a lot, vector operation makes senese
> > > > > > > > to me.
> > > > > > >
> > > > > > > I am not saying that the vector API is wrong. All I am trying to say is
> > > > > > > that the benefit is not really clear so far. If you want to push it
> > > > > > > through then you should better get some supporting data.
> > > > > >
> > > > > > I measured 1000 madvise syscall vs. a vector range syscall with 1000
> > > > > > ranges on ARM64 mordern device. Even though I saw 15% improvement but
> > > > > > absoluate gain is just 1ms so I don't think it's worth to support.
> > > > > > I will drop vector support at next revision.
> > > > >
> > > > > Please do keep the vector support. Absolute timing is misleading,
> > > > > since in a tight loop, you're not going to contend on mmap_sem. We've
> > > > > seen tons of improvements in things like camera start come from
> > > > > coalescing mprotect calls, with the gains coming from taking and
> > > > > releasing various locks a lot less often and bouncing around less on
> > > > > the contended lock paths. Raw throughput doesn't tell the whole story,
> > > > > especially on mobile.
> > > >
> > > > This will always be a double edge sword. Taking a lock for longer can
> > > > improve a throughput of a single call but it would make a latency for
> > > > anybody contending on the lock much worse.
> > > >
> > > > Besides that, please do not overcomplicate the thing from the early
> > > > beginning please. Let's start with a simple and well defined remote
> > > > madvise alternative first and build a vector API on top with some
> > > > numbers based on _real_ workloads.
> > >
> > > First time, I didn't think about atomicity about address range race
> > > because MADV_COLD/PAGEOUT is not critical for the race.
> > > However you raised the atomicity issue because people would extend
> > > hints to destructive ones easily. I agree with that and that's why
> > > we discussed how to guarantee the race and Daniel comes up with good idea.
> >
> > Just for the clarification, I didn't really mean atomicity but rather a
> > _consistency_ (essentially time to check to time to use consistency).
>
> What do you mean by *consistency*? Could you elaborate it more?
>
> >
> > > - vma configuration seq number via process_getinfo(2).
> > >
> > > We discussed the race issue without _read_ workloads/requests because
> > > it's common sense that people might extend the syscall later.
> > >
> > > Here is same. For current workload, we don't need to support vector
> > > for perfomance point of view based on my experiment. However, it's
> > > rather limited experiment. Some configuration might have 10000+ vmas
> > > or really slow CPU.
> > >
> > > Furthermore, I want to have vector support due to atomicity issue
> > > if it's really the one we should consider.
> > > With vector support of the API and vma configuration sequence number
> > > from Daniel, we could support address ranges operations's atomicity.
> >
> > I am not sure what do you mean here. Perform all ranges atomicaly wrt.
> > other address space modifications? If yes I am not sure we want that
>
> Yub, I think it's *necessary* if we want to support destructive hints
> via process_madvise.
[Puts on flame-proof suit]
Here's a quick sketch of what I have in mind for process_getinfo(2).
Keep in mind that it's still just a rough idea.
We've had trouble efficiently learning about process and memory state
of the system via procfs. Android background memory-use scans (the
android.bg daemon) consume quite a bit of CPU time for PSS; Minchan's
done a lot of thinking for how we can specify desired page sets for
compaction as part of this patch set; and the full procfs walks that
some trace collection tools need to undertake take more than 200ms to
collect (sometimes much more) due mostly to procfs iteration. ISTM we
can do better.
While procfs *works* on a functional level, it's inefficient due to
the splatting of information we want across several different files
(which need to be independently opened --- e.g.,
/proc/pid/oom_score_adj *and* /proc/pid/status), inefficient due to
the ad-hoc text formatting, inefficient due to information over-fetch,
and cumbersome due to the fundamental impedance mismatch between
filesystem APIs and process lifetimes. procfs itself is also optional,
which has caused various bits of awkwardness that you'll recall from
the pidfd discussions.
How about we solve the problem once and for all? I'm imagining a new
process_getinfo(2) that solves all of these problems at the same time.
I want something with a few properties:
1) if we want to learn M facts about N things, we enter the kernel
once and learn all M*N things,
2) the information we collect is self-consistent (which implies
atomicity most of the time),
3) we retrieve the information we want in an efficient binary format, and
4) we don't pay to learn anything not in M.
I've jotted down a quick sketch of the API below; I'm curious what
everyone else thinks. It'd basically look like this:
int process_getinfo(int nr_proc, int* proc, int flags, unsigned long
mask, void* out_buf, size_t* inout_sz)
We wouldn't use the return value for much: 0 on success and -1 on
error with errno set. NR_PROC and PROC together would specify the
objects we want to learn about, which would be either PIDs or PIDFDs
or maybe nothing at all if FLAGS tells us to inspect every process on
the system. MASK is a bitset of facts we want to learn, described
below. OUT_BUF and INOUT_SZ are for actually communicating the result.
On input, the caller would fill *INOUT_SZ with the size of the buffer
to which OUT_BUF points; on success, we'd fill *INOUT_SZ with the
number of bytes we actually used. If the output buffer is too small,
we'll truncate the result, fail the system call with E2BIG, and fill
*INOUT_SZ with the number of needed bytes, inviting the caller to try
again. (If a caller supplies something huge like a reusable 512KB
buffer on the first call, no reallocation and retries will be
necessary in practice on a typically-sized system.)
The actual returned buffer is a collection of structures and data
blocks starting with a struct process_info. The structures in the
returned buffer sometimes contain "pointers" to other structures
encoded as byte offsets from the start of the information buffer.
Using offsets instead of actual pointers keeps the format the same
across 32- and 64-bit versions of process_getinfo and makes it
possible to relocate the returned information buffer with memcpy.
struct process_info {
int first_procrec_offset; // struct procrec*
// Examples of system-wide things we could ask for
int mem_total_kb;
int mem_free_kb;
int mem_available_kb;
char reserved[];
};
struct procrec {
int next_procrec_offset; // struct procrec*
// Following fields are self-explanatory and are only examples of the
// kind of information we could provide.
int tid;
int tgid;
char status;
int oom_score_adj;
struct { int real, effective, saved, fs; } uids;
int prio;
int comm_offset; // char*
uint64_t rss_file_kb
uint64_t rss_anon_fb;
uint64_t vm_seq;
int first_procrec_vma_offset; // struct procrec_vma*
char reserved[];
};
struct procrec_vma {
int next_procrec_vma_offset; // struct procrec_vma*
unsigned long start;
unsigned long end;
int backing_file_name_offset; // char*
int prot;
char reserved[];
};
Callers would use the returned buffer by casting it to a struct
process_info and following the internal "pointers".
MASK would specify which bits of information we wanted: for example,
if we asked for PROCESS_VM_MEMORY_MAP, the kernel would fill in each
struct procret's memory_map field and have it point to a struct
procrec_vma in the returned output buffer. If we omitted
PROCESS_VM_MEMORY_MAP, we'd leave the memory_map field as NULL
(encoded as offset zero). The kernel would embed any strings (like
comm and VMA names) into the output buffer; the precise locations
would be unspecified so long as callers could find these fields via
output-buffer pointers.
Because all the structures are variable-length and are chained
together with explicit pointers (or offsets) instead of being stuffed
into a literal array, we can add additional fields to the output
structures any time we want without breaking binary compatibility.
Callers would tell the kernel that they're interested in the added
struct fields by asking for them via bits in MASK, and kernels that
don't understand those fields would just fail the system call with
EINVAL or something.
Depending on how we call it, we can use this API as a bunch of different things:
1) Quick retrieval of system-wide memory counters, like /proc/meminfo
2) Quick snapshot of all process-thread identities (asking for the
wildcard TID match via FLAGS)
3) Fast enumeration of one process's address space
4) Collecting process-summary VM counters (e.g., rss_anon and
rss_file) for a set of processes
5) Retrieval of every VMA of every process on the system for debugging
We can do all of this with one entry into the kernel and without
opening any new file descriptors (unless we want to use pidfds as
inputs). We can also make this operation as atomic as we want, e.g.,
taking mmap_sem while looking at each process and taking tasklist_lock
so all the thread IDs line up with their processes. We don't
necessarily need to take the mmap_sems of all processes we care about
at the same time.
Since this isn't a filesystem-based API, we don't have to deal with
seq_file or deal with consistency issues arising from userspace
programs doing strange things like reading procfs files very slowly in
small chunks. Security-wise, we'd just use different access checks for
different requested information bits in MASK, maybe supplying "no
access" struct procrec entries if a caller doesn't happen to have
access to a particular process. I suppose we can talk about whether
access check failures should result in dummy values or syscall
failure: maybe callers should select which behavior they want.
Format-wise, we could also just return flatbuffer messages from the
kernel, but I suspect that we don't want flatbuffer in the kernel
right now. :-)
The API I'm proposing accepts an array of processes to inspect. We
could simplify it by accepting just one process and making the caller
enter the kernel once per process it wants to learn about, but this
simplification would make the API less useful for answering questions
like "What's the RSS of every process on the system right now?".
What do you think?
^ permalink raw reply
* Re: [PATCH ghak90 V6 08/10] audit: add containerid filtering
From: Paul Moore @ 2019-05-30 14:34 UTC (permalink / raw)
To: Richard Guy Briggs
Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
Eric Paris, Serge Hallyn, ebiederm, nhorman
In-Reply-To: <20190530141951.iofimovrndap4npq@madcap2.tricolour.ca>
On Thu, May 30, 2019 at 10:20 AM Richard Guy Briggs <rgb@redhat.com> wrote:
>
> On 2019-05-29 18:16, Paul Moore wrote:
> > On Mon, Apr 8, 2019 at 11:41 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > >
> > > Implement audit container identifier filtering using the AUDIT_CONTID
> > > field name to send an 8-character string representing a u64 since the
> > > value field is only u32.
> > >
> > > Sending it as two u32 was considered, but gathering and comparing two
> > > fields was more complex.
> > >
> > > The feature indicator is AUDIT_FEATURE_BITMAP_CONTAINERID.
> > >
> > > Please see the github audit kernel issue for the contid filter feature:
> > > https://github.com/linux-audit/audit-kernel/issues/91
> > > Please see the github audit userspace issue for filter additions:
> > > https://github.com/linux-audit/audit-userspace/issues/40
> > > Please see the github audit testsuiite issue for the test case:
> > > https://github.com/linux-audit/audit-testsuite/issues/64
> > > Please see the github audit wiki for the feature overview:
> > > https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > Acked-by: Serge Hallyn <serge@hallyn.com>
> > > Acked-by: Neil Horman <nhorman@tuxdriver.com>
> > > Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> > > ---
> > > include/linux/audit.h | 1 +
> > > include/uapi/linux/audit.h | 5 ++++-
> > > kernel/audit.h | 1 +
> > > kernel/auditfilter.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++
> > > kernel/auditsc.c | 4 ++++
> > > 5 files changed, 57 insertions(+), 1 deletion(-)
> >
> > ...
> >
> > > diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
> > > index 63f8b3f26fab..407b5bb3b4c6 100644
> > > --- a/kernel/auditfilter.c
> > > +++ b/kernel/auditfilter.c
> > > @@ -1206,6 +1224,31 @@ int audit_comparator(u32 left, u32 op, u32 right)
> > > }
> > > }
> > >
> > > +int audit_comparator64(u64 left, u32 op, u64 right)
> > > +{
> > > + switch (op) {
> > > + case Audit_equal:
> > > + return (left == right);
> > > + case Audit_not_equal:
> > > + return (left != right);
> > > + case Audit_lt:
> > > + return (left < right);
> > > + case Audit_le:
> > > + return (left <= right);
> > > + case Audit_gt:
> > > + return (left > right);
> > > + case Audit_ge:
> > > + return (left >= right);
> > > + case Audit_bitmask:
> > > + return (left & right);
> > > + case Audit_bittest:
> > > + return ((left & right) == right);
> > > + default:
> > > + BUG();
> >
> > A little birdy mentioned the BUG() here as a potential issue and while
> > I had ignored it in earlier patches because this is likely a
> > cut-n-paste from another audit comparator function, I took a closer
> > look this time. It appears as though we will never have an invalid op
> > value as audit_data_to_entry()/audit_to_op() ensure that the op value
> > is a a known good value. Removing the BUG() from all the audit
> > comparators is a separate issue, but I think it would be good to
> > remove it from this newly added comparator; keeping it so that we
> > return "0" in the default case seems reasoanble.
>
> Fair enough. That BUG(); can be removed.
Please send a fixup patch for this.
--
paul moore
www.paul-moore.com
^ permalink raw reply
* Re: [PATCH ghak90 V6 04/10] audit: log container info of syscalls
From: Paul Moore @ 2019-05-30 14:34 UTC (permalink / raw)
To: Richard Guy Briggs
Cc: Ondrej Mosnacek, Neil Horman, linux-api, containers, LKML,
David Howells, Linux-Audit Mailing List, netfilter-devel,
Eric W . Biederman, Simo Sorce, netdev, linux-fsdevel, Eric Paris,
Serge Hallyn
In-Reply-To: <20190530140849.zdxvlvkefwpngfil@madcap2.tricolour.ca>
On Thu, May 30, 2019 at 10:09 AM Richard Guy Briggs <rgb@redhat.com> wrote:
>
> On 2019-05-30 15:08, Ondrej Mosnacek wrote:
> > On Thu, May 30, 2019 at 12:16 AM Paul Moore <paul@paul-moore.com> wrote:
> > > On Mon, Apr 8, 2019 at 11:40 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > >
> > > > Create a new audit record AUDIT_CONTAINER_ID to document the audit
> > > > container identifier of a process if it is present.
> > > >
> > > > Called from audit_log_exit(), syscalls are covered.
> > > >
> > > > A sample raw event:
> > > > type=SYSCALL msg=audit(1519924845.499:257): arch=c000003e syscall=257 success=yes exit=3 a0=ffffff9c a1=56374e1cef30 a2=241 a3=1b6 items=2 ppid=606 pid=635 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=3 comm="bash" exe="/usr/bin/bash" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key="tmpcontainerid"
> > > > type=CWD msg=audit(1519924845.499:257): cwd="/root"
> > > > type=PATH msg=audit(1519924845.499:257): item=0 name="/tmp/" inode=13863 dev=00:27 mode=041777 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:tmp_t:s0 nametype= PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
> > > > type=PATH msg=audit(1519924845.499:257): item=1 name="/tmp/tmpcontainerid" inode=17729 dev=00:27 mode=0100644 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
> > > > type=PROCTITLE msg=audit(1519924845.499:257): proctitle=62617368002D6300736C65657020313B206563686F2074657374203E202F746D702F746D70636F6E7461696E65726964
> > > > type=CONTAINER_ID msg=audit(1519924845.499:257): contid=123458
> > > >
> > > > Please see the github audit kernel issue for the main feature:
> > > > https://github.com/linux-audit/audit-kernel/issues/90
> > > > Please see the github audit userspace issue for supporting additions:
> > > > https://github.com/linux-audit/audit-userspace/issues/51
> > > > Please see the github audit testsuiite issue for the test case:
> > > > https://github.com/linux-audit/audit-testsuite/issues/64
> > > > Please see the github audit wiki for the feature overview:
> > > > https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > > Acked-by: Serge Hallyn <serge@hallyn.com>
> > > > Acked-by: Steve Grubb <sgrubb@redhat.com>
> > > > Acked-by: Neil Horman <nhorman@tuxdriver.com>
> > > > Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> > > > ---
> > > > include/linux/audit.h | 5 +++++
> > > > include/uapi/linux/audit.h | 1 +
> > > > kernel/audit.c | 20 ++++++++++++++++++++
> > > > kernel/auditsc.c | 20 ++++++++++++++------
> > > > 4 files changed, 40 insertions(+), 6 deletions(-)
> > >
> > > ...
> > >
> > > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > > index 182b0f2c183d..3e0af53f3c4d 100644
> > > > --- a/kernel/audit.c
> > > > +++ b/kernel/audit.c
> > > > @@ -2127,6 +2127,26 @@ void audit_log_session_info(struct audit_buffer *ab)
> > > > audit_log_format(ab, "auid=%u ses=%u", auid, sessionid);
> > > > }
> > > >
> > > > +/*
> > > > + * audit_log_contid - report container info
> > > > + * @context: task or local context for record
> > > > + * @contid: container ID to report
> > > > + */
> > > > +void audit_log_contid(struct audit_context *context, u64 contid)
> > > > +{
> > > > + struct audit_buffer *ab;
> > > > +
> > > > + if (!audit_contid_valid(contid))
> > > > + return;
> > > > + /* Generate AUDIT_CONTAINER_ID record with container ID */
> > > > + ab = audit_log_start(context, GFP_KERNEL, AUDIT_CONTAINER_ID);
> > > > + if (!ab)
> > > > + return;
> > > > + audit_log_format(ab, "contid=%llu", (unsigned long long)contid);
> > >
> > > We have a consistency problem regarding how to output the u64 contid
> > > values; this function uses an explicit cast, others do not. According
> > > to Documentation/core-api/printk-formats.rst the recommendation for
> > > u64 is %llu (or %llx, if you want hex). Looking quickly through the
> > > printk code this appears to still be correct. I suggest we get rid of
> > > the cast (like it was in v5).
> >
> > IIRC it was me who suggested to add the casts. I didn't realize that
> > the kernel actually guarantees that "%llu" will always work with u64.
> > Taking that into account I rescind my request to add the cast. Sorry
> > for the false alarm.
>
> Yeah, just remove the cast.
Okay, this is trivial enough I'll take care of this during the merge
with a note.
--
paul moore
www.paul-moore.com
^ permalink raw reply
* Re: [PATCH ghak90 V6 09/10] audit: add support for containerid to network namespaces
From: Paul Moore @ 2019-05-30 14:32 UTC (permalink / raw)
To: Richard Guy Briggs
Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
Eric Paris, Serge Hallyn, ebiederm, nhorman
In-Reply-To: <20190530141555.qqcbasvyp7eokwjz@madcap2.tricolour.ca>
On Thu, May 30, 2019 at 10:16 AM Richard Guy Briggs <rgb@redhat.com> wrote:
>
> On 2019-05-29 18:17, Paul Moore wrote:
> > On Mon, Apr 8, 2019 at 11:41 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > >
> > > Audit events could happen in a network namespace outside of a task
> > > context due to packets received from the net that trigger an auditing
> > > rule prior to being associated with a running task. The network
> > > namespace could be in use by multiple containers by association to the
> > > tasks in that network namespace. We still want a way to attribute
> > > these events to any potential containers. Keep a list per network
> > > namespace to track these audit container identifiiers.
> > >
> > > Add/increment the audit container identifier on:
> > > - initial setting of the audit container identifier via /proc
> > > - clone/fork call that inherits an audit container identifier
> > > - unshare call that inherits an audit container identifier
> > > - setns call that inherits an audit container identifier
> > > Delete/decrement the audit container identifier on:
> > > - an inherited audit container identifier dropped when child set
> > > - process exit
> > > - unshare call that drops a net namespace
> > > - setns call that drops a net namespace
> > >
> > > Please see the github audit kernel issue for contid net support:
> > > https://github.com/linux-audit/audit-kernel/issues/92
> > > Please see the github audit testsuiite issue for the test case:
> > > https://github.com/linux-audit/audit-testsuite/issues/64
> > > Please see the github audit wiki for the feature overview:
> > > https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > Acked-by: Neil Horman <nhorman@tuxdriver.com>
> > > Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> > > ---
> > > include/linux/audit.h | 19 +++++++++++
> > > kernel/audit.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++--
> > > kernel/nsproxy.c | 4 +++
> > > 3 files changed, 108 insertions(+), 3 deletions(-)
> >
> > ...
> >
> > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > index 6c742da66b32..996213591617 100644
> > > --- a/kernel/audit.c
> > > +++ b/kernel/audit.c
> > > @@ -376,6 +384,75 @@ static struct sock *audit_get_sk(const struct net *net)
> > > return aunet->sk;
> > > }
> > >
> > > +void audit_netns_contid_add(struct net *net, u64 contid)
> > > +{
> > > + struct audit_net *aunet;
> > > + struct list_head *contid_list;
> > > + struct audit_contid *cont;
> > > +
> > > + if (!net)
> > > + return;
> > > + if (!audit_contid_valid(contid))
> > > + return;
> > > + aunet = net_generic(net, audit_net_id);
> > > + if (!aunet)
> > > + return;
> > > + contid_list = &aunet->contid_list;
> > > + spin_lock(&aunet->contid_list_lock);
> > > + list_for_each_entry_rcu(cont, contid_list, list)
> > > + if (cont->id == contid) {
> > > + refcount_inc(&cont->refcount);
> > > + goto out;
> > > + }
> > > + cont = kmalloc(sizeof(struct audit_contid), GFP_ATOMIC);
> > > + if (cont) {
> > > + INIT_LIST_HEAD(&cont->list);
> >
> > I thought you were going to get rid of this INIT_LIST_HEAD() call?
>
> I was intending to, and then Neil weighed in with this opinion:
>
> https://www.redhat.com/archives/linux-audit/2019-April/msg00014.html
>
> If you feel that isn't important, please remove it.
Okay, I missed/forgot that, it seems like the right thing to do is to
leave it as-is.
--
paul moore
www.paul-moore.com
^ permalink raw reply
* Re: [PATCH ghak90 V6 08/10] audit: add containerid filtering
From: Richard Guy Briggs @ 2019-05-30 14:19 UTC (permalink / raw)
To: Paul Moore
Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
Eric Paris, Serge Hallyn, ebiederm, nhorman
In-Reply-To: <CAHC9VhRV-0LSEcRvPO1uXtKdpEQsaLZnBV3T=zcMTZPN5ugz5w@mail.gmail.com>
On 2019-05-29 18:16, Paul Moore wrote:
> On Mon, Apr 8, 2019 at 11:41 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> >
> > Implement audit container identifier filtering using the AUDIT_CONTID
> > field name to send an 8-character string representing a u64 since the
> > value field is only u32.
> >
> > Sending it as two u32 was considered, but gathering and comparing two
> > fields was more complex.
> >
> > The feature indicator is AUDIT_FEATURE_BITMAP_CONTAINERID.
> >
> > Please see the github audit kernel issue for the contid filter feature:
> > https://github.com/linux-audit/audit-kernel/issues/91
> > Please see the github audit userspace issue for filter additions:
> > https://github.com/linux-audit/audit-userspace/issues/40
> > Please see the github audit testsuiite issue for the test case:
> > https://github.com/linux-audit/audit-testsuite/issues/64
> > Please see the github audit wiki for the feature overview:
> > https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > Acked-by: Serge Hallyn <serge@hallyn.com>
> > Acked-by: Neil Horman <nhorman@tuxdriver.com>
> > Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> > ---
> > include/linux/audit.h | 1 +
> > include/uapi/linux/audit.h | 5 ++++-
> > kernel/audit.h | 1 +
> > kernel/auditfilter.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++
> > kernel/auditsc.c | 4 ++++
> > 5 files changed, 57 insertions(+), 1 deletion(-)
>
> ...
>
> > diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
> > index 63f8b3f26fab..407b5bb3b4c6 100644
> > --- a/kernel/auditfilter.c
> > +++ b/kernel/auditfilter.c
> > @@ -1206,6 +1224,31 @@ int audit_comparator(u32 left, u32 op, u32 right)
> > }
> > }
> >
> > +int audit_comparator64(u64 left, u32 op, u64 right)
> > +{
> > + switch (op) {
> > + case Audit_equal:
> > + return (left == right);
> > + case Audit_not_equal:
> > + return (left != right);
> > + case Audit_lt:
> > + return (left < right);
> > + case Audit_le:
> > + return (left <= right);
> > + case Audit_gt:
> > + return (left > right);
> > + case Audit_ge:
> > + return (left >= right);
> > + case Audit_bitmask:
> > + return (left & right);
> > + case Audit_bittest:
> > + return ((left & right) == right);
> > + default:
> > + BUG();
>
> A little birdy mentioned the BUG() here as a potential issue and while
> I had ignored it in earlier patches because this is likely a
> cut-n-paste from another audit comparator function, I took a closer
> look this time. It appears as though we will never have an invalid op
> value as audit_data_to_entry()/audit_to_op() ensure that the op value
> is a a known good value. Removing the BUG() from all the audit
> comparators is a separate issue, but I think it would be good to
> remove it from this newly added comparator; keeping it so that we
> return "0" in the default case seems reasoanble.
Fair enough. That BUG(); can be removed.
> > + return 0;
> > + }
> > +}
>
> --
> paul moore
> www.paul-moore.com
- RGB
--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
^ permalink raw reply
* Re: [PATCH ghak90 V6 09/10] audit: add support for containerid to network namespaces
From: Richard Guy Briggs @ 2019-05-30 14:15 UTC (permalink / raw)
To: Paul Moore
Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
Eric Paris, Serge Hallyn, ebiederm, nhorman
In-Reply-To: <CAHC9VhQ9t-mvJGNCzArjg+MTGNXcZbVrWV4=RUD5ML_bHqua1Q@mail.gmail.com>
On 2019-05-29 18:17, Paul Moore wrote:
> On Mon, Apr 8, 2019 at 11:41 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> >
> > Audit events could happen in a network namespace outside of a task
> > context due to packets received from the net that trigger an auditing
> > rule prior to being associated with a running task. The network
> > namespace could be in use by multiple containers by association to the
> > tasks in that network namespace. We still want a way to attribute
> > these events to any potential containers. Keep a list per network
> > namespace to track these audit container identifiiers.
> >
> > Add/increment the audit container identifier on:
> > - initial setting of the audit container identifier via /proc
> > - clone/fork call that inherits an audit container identifier
> > - unshare call that inherits an audit container identifier
> > - setns call that inherits an audit container identifier
> > Delete/decrement the audit container identifier on:
> > - an inherited audit container identifier dropped when child set
> > - process exit
> > - unshare call that drops a net namespace
> > - setns call that drops a net namespace
> >
> > Please see the github audit kernel issue for contid net support:
> > https://github.com/linux-audit/audit-kernel/issues/92
> > Please see the github audit testsuiite issue for the test case:
> > https://github.com/linux-audit/audit-testsuite/issues/64
> > Please see the github audit wiki for the feature overview:
> > https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > Acked-by: Neil Horman <nhorman@tuxdriver.com>
> > Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> > ---
> > include/linux/audit.h | 19 +++++++++++
> > kernel/audit.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++--
> > kernel/nsproxy.c | 4 +++
> > 3 files changed, 108 insertions(+), 3 deletions(-)
>
> ...
>
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index 6c742da66b32..996213591617 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -376,6 +384,75 @@ static struct sock *audit_get_sk(const struct net *net)
> > return aunet->sk;
> > }
> >
> > +void audit_netns_contid_add(struct net *net, u64 contid)
> > +{
> > + struct audit_net *aunet;
> > + struct list_head *contid_list;
> > + struct audit_contid *cont;
> > +
> > + if (!net)
> > + return;
> > + if (!audit_contid_valid(contid))
> > + return;
> > + aunet = net_generic(net, audit_net_id);
> > + if (!aunet)
> > + return;
> > + contid_list = &aunet->contid_list;
> > + spin_lock(&aunet->contid_list_lock);
> > + list_for_each_entry_rcu(cont, contid_list, list)
> > + if (cont->id == contid) {
> > + refcount_inc(&cont->refcount);
> > + goto out;
> > + }
> > + cont = kmalloc(sizeof(struct audit_contid), GFP_ATOMIC);
> > + if (cont) {
> > + INIT_LIST_HEAD(&cont->list);
>
> I thought you were going to get rid of this INIT_LIST_HEAD() call?
I was intending to, and then Neil weighed in with this opinion:
https://www.redhat.com/archives/linux-audit/2019-April/msg00014.html
If you feel that isn't important, please remove it.
> > + cont->id = contid;
> > + refcount_set(&cont->refcount, 1);
> > + list_add_rcu(&cont->list, contid_list);
> > + }
> > +out:
> > + spin_unlock(&aunet->contid_list_lock);
> > +}
>
> --
> paul moore
> www.paul-moore.com
- RGB
--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
^ permalink raw reply
* Re: [PATCH ghak90 V6 04/10] audit: log container info of syscalls
From: Richard Guy Briggs @ 2019-05-30 14:08 UTC (permalink / raw)
To: Ondrej Mosnacek
Cc: Paul Moore, Neil Horman, linux-api, containers, LKML,
David Howells, Linux-Audit Mailing List, netfilter-devel,
Eric W . Biederman, Simo Sorce, netdev, linux-fsdevel, Eric Paris,
Serge Hallyn
In-Reply-To: <CAFqZXNsK6M_L_0dFzkEgh_QVP-fyb+fE0MMRsJ2kXxtKM3VUKA@mail.gmail.com>
On 2019-05-30 15:08, Ondrej Mosnacek wrote:
> On Thu, May 30, 2019 at 12:16 AM Paul Moore <paul@paul-moore.com> wrote:
> > On Mon, Apr 8, 2019 at 11:40 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > >
> > > Create a new audit record AUDIT_CONTAINER_ID to document the audit
> > > container identifier of a process if it is present.
> > >
> > > Called from audit_log_exit(), syscalls are covered.
> > >
> > > A sample raw event:
> > > type=SYSCALL msg=audit(1519924845.499:257): arch=c000003e syscall=257 success=yes exit=3 a0=ffffff9c a1=56374e1cef30 a2=241 a3=1b6 items=2 ppid=606 pid=635 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=3 comm="bash" exe="/usr/bin/bash" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key="tmpcontainerid"
> > > type=CWD msg=audit(1519924845.499:257): cwd="/root"
> > > type=PATH msg=audit(1519924845.499:257): item=0 name="/tmp/" inode=13863 dev=00:27 mode=041777 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:tmp_t:s0 nametype= PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
> > > type=PATH msg=audit(1519924845.499:257): item=1 name="/tmp/tmpcontainerid" inode=17729 dev=00:27 mode=0100644 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
> > > type=PROCTITLE msg=audit(1519924845.499:257): proctitle=62617368002D6300736C65657020313B206563686F2074657374203E202F746D702F746D70636F6E7461696E65726964
> > > type=CONTAINER_ID msg=audit(1519924845.499:257): contid=123458
> > >
> > > Please see the github audit kernel issue for the main feature:
> > > https://github.com/linux-audit/audit-kernel/issues/90
> > > Please see the github audit userspace issue for supporting additions:
> > > https://github.com/linux-audit/audit-userspace/issues/51
> > > Please see the github audit testsuiite issue for the test case:
> > > https://github.com/linux-audit/audit-testsuite/issues/64
> > > Please see the github audit wiki for the feature overview:
> > > https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > Acked-by: Serge Hallyn <serge@hallyn.com>
> > > Acked-by: Steve Grubb <sgrubb@redhat.com>
> > > Acked-by: Neil Horman <nhorman@tuxdriver.com>
> > > Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> > > ---
> > > include/linux/audit.h | 5 +++++
> > > include/uapi/linux/audit.h | 1 +
> > > kernel/audit.c | 20 ++++++++++++++++++++
> > > kernel/auditsc.c | 20 ++++++++++++++------
> > > 4 files changed, 40 insertions(+), 6 deletions(-)
> >
> > ...
> >
> > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > index 182b0f2c183d..3e0af53f3c4d 100644
> > > --- a/kernel/audit.c
> > > +++ b/kernel/audit.c
> > > @@ -2127,6 +2127,26 @@ void audit_log_session_info(struct audit_buffer *ab)
> > > audit_log_format(ab, "auid=%u ses=%u", auid, sessionid);
> > > }
> > >
> > > +/*
> > > + * audit_log_contid - report container info
> > > + * @context: task or local context for record
> > > + * @contid: container ID to report
> > > + */
> > > +void audit_log_contid(struct audit_context *context, u64 contid)
> > > +{
> > > + struct audit_buffer *ab;
> > > +
> > > + if (!audit_contid_valid(contid))
> > > + return;
> > > + /* Generate AUDIT_CONTAINER_ID record with container ID */
> > > + ab = audit_log_start(context, GFP_KERNEL, AUDIT_CONTAINER_ID);
> > > + if (!ab)
> > > + return;
> > > + audit_log_format(ab, "contid=%llu", (unsigned long long)contid);
> >
> > We have a consistency problem regarding how to output the u64 contid
> > values; this function uses an explicit cast, others do not. According
> > to Documentation/core-api/printk-formats.rst the recommendation for
> > u64 is %llu (or %llx, if you want hex). Looking quickly through the
> > printk code this appears to still be correct. I suggest we get rid of
> > the cast (like it was in v5).
>
> IIRC it was me who suggested to add the casts. I didn't realize that
> the kernel actually guarantees that "%llu" will always work with u64.
> Taking that into account I rescind my request to add the cast. Sorry
> for the false alarm.
Yeah, just remove the cast.
> > > + audit_log_end(ab);
> > > +}
> > > +EXPORT_SYMBOL(audit_log_contid);
> >
> > --
> > paul moore
> > www.paul-moore.com
>
> --
> Ondrej Mosnacek <omosnace at redhat dot com>
> Software Engineer, Security Technologies
> Red Hat, Inc.
>
> --
> Linux-audit mailing list
> Linux-audit@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-audit
- RGB
--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
^ permalink raw reply
* Re: [PATCH ghak90 V6 00/10] audit: implement container identifier
From: Richard Guy Briggs @ 2019-05-30 14:08 UTC (permalink / raw)
To: Paul Moore
Cc: Steve Grubb, Neil Horman, linux-api, containers, LKML, dhowells,
Linux-Audit Mailing List, netfilter-devel, ebiederm, simo, netdev,
linux-fsdevel, Eric Paris, Serge Hallyn
In-Reply-To: <CAHC9VhQqiaHRxZkOZO2c5UzMOQ_NNu1pEsc4fQDX=Hj-HnGUoQ@mail.gmail.com>
On 2019-05-30 09:35, Paul Moore wrote:
> On Thu, May 30, 2019 at 9:08 AM Steve Grubb <sgrubb@redhat.com> wrote:
> > On Wednesday, May 29, 2019 6:26:12 PM EDT Paul Moore wrote:
> > > On Mon, Apr 22, 2019 at 9:49 AM Paul Moore <paul@paul-moore.com> wrote:
> > > > On Mon, Apr 22, 2019 at 7:38 AM Neil Horman <nhorman@tuxdriver.com>
> > wrote:
> > > > > On Mon, Apr 08, 2019 at 11:39:07PM -0400, Richard Guy Briggs wrote:
> > > > > > Implement kernel audit container identifier.
> > > > >
> > > > > I'm sorry, I've lost track of this, where have we landed on it? Are we
> > > > > good for inclusion?
> > > >
> > > > I haven't finished going through this latest revision, but unless
> > > > Richard made any significant changes outside of the feedback from the
> > > > v5 patchset I'm guessing we are "close".
> > > >
> > > > Based on discussions Richard and I had some time ago, I have always
> > > > envisioned the plan as being get the kernel patchset, tests, docs
> > > > ready (which Richard has been doing) and then run the actual
> > > > implemented API by the userland container folks, e.g. cri-o/lxc/etc.,
> > > > to make sure the actual implementation is sane from their perspective.
> > > > They've already seen the design, so I'm not expecting any real
> > > > surprises here, but sometimes opinions change when they have actual
> > > > code in front of them to play with and review.
> > > >
> > > > Beyond that, while the cri-o/lxc/etc. folks are looking it over,
> > > > whatever additional testing we can do would be a big win. I'm
> > > > thinking I'll pull it into a separate branch in the audit tree
> > > > (audit/working-container ?) and include that in my secnext kernels
> > > > that I build/test on a regular basis; this is also a handy way to keep
> > > > it based against the current audit/next branch. If any changes are
> > > > needed Richard can either chose to base those changes on audit/next or
> > > > the separate audit container ID branch; that's up to him. I've done
> > > > this with other big changes in other trees, e.g. SELinux, and it has
> > > > worked well to get some extra testing in and keep the patchset "merge
> > > > ready" while others outside the subsystem look things over.
> > >
> > > I just sent my feedback on the v6 patchset, and it's small: basically
> > > three patches with "one-liner" changes needed.
> > >
> > > Richard, it's your call on how you want to proceed from here. You can
> > > post a v7 incorporating the feedback, or since the tweaks are so
> > > minor, you can post fixup patches; the former being more
> > > comprehensive, the later being quicker to review and digest.
> > > Regardless of that, while we are waiting on a prototype from the
> > > container folks, I think it would be good to pull this into a working
> > > branch in the audit repo (as mentioned above), unless you would prefer
> > > to keep it as a patchset on the mailing list?
> >
> > Personally, I'd like to see this on a branch so that it's easier to build a
> > kernel locally for testing.
>
> FWIW, if Richard does prefer for me to pull it into a working branch I
> plan to include it in my secnext builds both to make it easier to test
> regularly and to make the changes available to people who don't want
> to build their own kernel.
Sure, let's do a working branch. I'll answer the issues in respective
threads...
> * http://www.paul-moore.com/blog/d/2019/04/kernel_secnext_repo.html
>
> --
> paul moore
> www.paul-moore.com
>
> --
> Linux-audit mailing list
> Linux-audit@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-audit
- RGB
--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
^ permalink raw reply
* Re: [PATCH ghak90 V6 00/10] audit: implement container identifier
From: Paul Moore @ 2019-05-30 13:35 UTC (permalink / raw)
To: Steve Grubb
Cc: Richard Guy Briggs, containers, linux-api,
Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
netfilter-devel, omosnace, dhowells, simo, Eric Paris,
Serge Hallyn, ebiederm, Neil Horman
In-Reply-To: <1674888.6UpDe63hFX@x2>
On Thu, May 30, 2019 at 9:08 AM Steve Grubb <sgrubb@redhat.com> wrote:
> On Wednesday, May 29, 2019 6:26:12 PM EDT Paul Moore wrote:
> > On Mon, Apr 22, 2019 at 9:49 AM Paul Moore <paul@paul-moore.com> wrote:
> > > On Mon, Apr 22, 2019 at 7:38 AM Neil Horman <nhorman@tuxdriver.com>
> wrote:
> > > > On Mon, Apr 08, 2019 at 11:39:07PM -0400, Richard Guy Briggs wrote:
> > > > > Implement kernel audit container identifier.
> > > >
> > > > I'm sorry, I've lost track of this, where have we landed on it? Are we
> > > > good for inclusion?
> > >
> > > I haven't finished going through this latest revision, but unless
> > > Richard made any significant changes outside of the feedback from the
> > > v5 patchset I'm guessing we are "close".
> > >
> > > Based on discussions Richard and I had some time ago, I have always
> > > envisioned the plan as being get the kernel patchset, tests, docs
> > > ready (which Richard has been doing) and then run the actual
> > > implemented API by the userland container folks, e.g. cri-o/lxc/etc.,
> > > to make sure the actual implementation is sane from their perspective.
> > > They've already seen the design, so I'm not expecting any real
> > > surprises here, but sometimes opinions change when they have actual
> > > code in front of them to play with and review.
> > >
> > > Beyond that, while the cri-o/lxc/etc. folks are looking it over,
> > > whatever additional testing we can do would be a big win. I'm
> > > thinking I'll pull it into a separate branch in the audit tree
> > > (audit/working-container ?) and include that in my secnext kernels
> > > that I build/test on a regular basis; this is also a handy way to keep
> > > it based against the current audit/next branch. If any changes are
> > > needed Richard can either chose to base those changes on audit/next or
> > > the separate audit container ID branch; that's up to him. I've done
> > > this with other big changes in other trees, e.g. SELinux, and it has
> > > worked well to get some extra testing in and keep the patchset "merge
> > > ready" while others outside the subsystem look things over.
> >
> > I just sent my feedback on the v6 patchset, and it's small: basically
> > three patches with "one-liner" changes needed.
> >
> > Richard, it's your call on how you want to proceed from here. You can
> > post a v7 incorporating the feedback, or since the tweaks are so
> > minor, you can post fixup patches; the former being more
> > comprehensive, the later being quicker to review and digest.
> > Regardless of that, while we are waiting on a prototype from the
> > container folks, I think it would be good to pull this into a working
> > branch in the audit repo (as mentioned above), unless you would prefer
> > to keep it as a patchset on the mailing list?
>
> Personally, I'd like to see this on a branch so that it's easier to build a
> kernel locally for testing.
FWIW, if Richard does prefer for me to pull it into a working branch I
plan to include it in my secnext builds both to make it easier to test
regularly and to make the changes available to people who don't want
to build their own kernel.
* http://www.paul-moore.com/blog/d/2019/04/kernel_secnext_repo.html
--
paul moore
www.paul-moore.com
^ permalink raw reply
* Re: [PATCH v1 1/2] fork: add clone3
From: Szabolcs Nagy @ 2019-05-30 13:20 UTC (permalink / raw)
To: Christian Brauner
Cc: viro, linux-kernel, torvalds, jannh, fweimer, oleg, arnd,
dhowells, Pavel Emelyanov, Andrew Morton, Adrian Reber,
Andrei Vagin, linux-api
In-Reply-To: <20190529152237.10719-1-christian@brauner.io>
* Christian Brauner <christian@brauner.io> [2019-05-29 17:22:36 +0200]:
> This adds the clone3 system call.
>
> As mentioned several times already (cf. [7], [8]) here's the promised
> patchset for clone3().
>
> We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
> free flag from clone().
>
> Independent of the CLONE_PIDFD patchset a time namespace has been discussed
> at Linux Plumber Conference last year and has been sent out and reviewed
> (cf. [5]). It is expected that it will go upstream in the not too distant
> future. However, it relies on the addition of the CLONE_NEWTIME flag to
> clone(). The only other good candidate - CLONE_DETACHED - is currently not
> recyclable as we have identified at least two large or widely used
> codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
> CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
> blocked. clone3() has the advantage that it will unblock this patchset
> again.
>
> The idea is to keep clone3() very simple and close to the original clone(),
> specifically, to keep on supporting old clone()-based workloads.
> We know there have been various creative proposals how a new process
> creation syscall or even api is supposed to look like. Some people even
> going so far as to argue that the traditional fork()+exec() split should be
> abandoned in favor of an in-kernel version of spawn(). Independent of
> whether or not we personally think spawn() is a good idea this patchset has
> and does not want to have anything to do with this.
> One stance we take is that there's no real good alternative to
> clone()+exec() and we need and want to support this model going forward;
> independent of spawn().
> The following requirements guided clone3():
> - bump the number of available flags
> - move arguments that are currently passed as separate arguments
> in clone() into a dedicated struct clone_args
> - choose a struct layout that is easy to handle on 32 and on 64 bit
> - choose a struct layout that is extensible
> - give new flags that currently need to abuse another flag's dedicated
> return argument in clone() their own dedicated return argument
> (e.g. CLONE_PIDFD)
> - use a separate kernel internal struct kernel_clone_args that is
> properly typed according to current kernel conventions in fork.c and is
> different from the uapi struct clone_args
> - port _do_fork() to use kernel_clone_args so that all process creation
> syscalls such as fork(), vfork(), clone(), and clone3() behave identical
> (Arnd suggested, that we can probably also port do_fork() itself in a
> separate patchset.)
> - ease of transition for userspace from clone() to clone3()
> This very much means that we do *not* remove functionality that userspace
> currently relies on as the latter is a good way of creating a syscall
> that won't be adopted.
> - do not try to be clever or complex: keep clone3() as dumb as possible
>
> In accordance with Linus suggestions, clone3() has the following signature:
>
> /* uapi */
> struct clone_args {
> __aligned_u64 flags;
> __aligned_u64 pidfd;
> __aligned_u64 parent_tidptr;
> __aligned_u64 child_tidptr;
> __aligned_u64 stack;
> __aligned_u64 stack_size;
> __aligned_u64 tls;
> };
is this new linux syscall api style to pass pointers as u64?
i think it will look a bit ugly in userspace where cast
to u64 would signextend pointers on most 32bit targets, so
user code would have to do something like
arg.ptr = (uint64_t)(uintptr_t)ptr;
such ugliness can be hidden by the libc with a different
struct definition, but it will require bigendian and alignment
hackery (or translation in libc, but that does not really work
when user calls raw syscall).
>
> /* kernel internal */
> struct kernel_clone_args {
> u64 flags;
> int __user *pidfd;
> int __user *parent_tidptr;
> int __user *child_tidptr;
> unsigned long stack;
> unsigned long stack_size;
> unsigned long tls;
> };
>
> long sys_clone3(struct clone_args __user *uargs, size_t size)
>
> clone3() cleanly supports all of the supported flags from clone() and thus
> all legacy workloads.
> The advantage of sticking close to the old clone() is the low cost for
> userspace to switch to this new api. Quite a lot of userspace apis (e.g.
> pthreads) are based on the clone() syscall. With the new clone3() syscall
> supporting all of the old workloads and opening up the ability to add new
> features should make switching to it for userspace more appealing. In
> essence, glibc can just write a simple wrapper to switch from clone() to
> clone3().
>
> There has been some interest in this patchset already. We have received a
> patch from the CRIU corner for clone3() that would set the PID/TID of a
> restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.
>
> /* References */
> [1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be
> [2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
> [3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
> [4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
> [5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
> [6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
> [7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
> [8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Christian Brauner <christian@brauner.io>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Pavel Emelyanov <xemul@virtuozzo.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Adrian Reber <adrian@lisas.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrei Vagin <avagin@gmail.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Florian Weimer <fweimer@redhat.com>
> Cc: linux-api@vger.kernel.org
> --
> v1:
> - Linus Torvalds <torvalds@linux-foundation.org>:
> - redesign based on Linus proposal
> - switch from arg-based to revision-based naming scheme: s/clone6/clone3/
> - Arnd Bergmann <arnd@arndb.de>:
> - use a single copy_from_user() instead of multiple get_user() calls
> since the latter have a constant overhead on some architectures
> - a range of other tweaks and suggestions
> ---
> arch/x86/ia32/sys_ia32.c | 11 ++-
> include/linux/sched/task.h | 13 ++-
> include/linux/syscalls.h | 6 ++
> include/uapi/linux/sched.h | 16 ++++
> kernel/fork.c | 176 ++++++++++++++++++++++++++++---------
> 5 files changed, 177 insertions(+), 45 deletions(-)
...
^ permalink raw reply
* Re: [PATCH ghak90 V6 00/10] audit: implement container identifier
From: Steve Grubb @ 2019-05-30 13:08 UTC (permalink / raw)
To: Paul Moore
Cc: Richard Guy Briggs, containers, linux-api,
Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
netfilter-devel, omosnace, dhowells, simo, Eric Paris,
Serge Hallyn, ebiederm, Neil Horman
In-Reply-To: <CAHC9VhTQ0gDZoWUh1QB4b7h3AgbpkhS40jrPVpCfJb11GT_FzQ@mail.gmail.com>
On Wednesday, May 29, 2019 6:26:12 PM EDT Paul Moore wrote:
> On Mon, Apr 22, 2019 at 9:49 AM Paul Moore <paul@paul-moore.com> wrote:
> > On Mon, Apr 22, 2019 at 7:38 AM Neil Horman <nhorman@tuxdriver.com>
wrote:
> > > On Mon, Apr 08, 2019 at 11:39:07PM -0400, Richard Guy Briggs wrote:
> > > > Implement kernel audit container identifier.
> > >
> > > I'm sorry, I've lost track of this, where have we landed on it? Are we
> > > good for inclusion?
> >
> > I haven't finished going through this latest revision, but unless
> > Richard made any significant changes outside of the feedback from the
> > v5 patchset I'm guessing we are "close".
> >
> > Based on discussions Richard and I had some time ago, I have always
> > envisioned the plan as being get the kernel patchset, tests, docs
> > ready (which Richard has been doing) and then run the actual
> > implemented API by the userland container folks, e.g. cri-o/lxc/etc.,
> > to make sure the actual implementation is sane from their perspective.
> > They've already seen the design, so I'm not expecting any real
> > surprises here, but sometimes opinions change when they have actual
> > code in front of them to play with and review.
> >
> > Beyond that, while the cri-o/lxc/etc. folks are looking it over,
> > whatever additional testing we can do would be a big win. I'm
> > thinking I'll pull it into a separate branch in the audit tree
> > (audit/working-container ?) and include that in my secnext kernels
> > that I build/test on a regular basis; this is also a handy way to keep
> > it based against the current audit/next branch. If any changes are
> > needed Richard can either chose to base those changes on audit/next or
> > the separate audit container ID branch; that's up to him. I've done
> > this with other big changes in other trees, e.g. SELinux, and it has
> > worked well to get some extra testing in and keep the patchset "merge
> > ready" while others outside the subsystem look things over.
>
> I just sent my feedback on the v6 patchset, and it's small: basically
> three patches with "one-liner" changes needed.
>
> Richard, it's your call on how you want to proceed from here. You can
> post a v7 incorporating the feedback, or since the tweaks are so
> minor, you can post fixup patches; the former being more
> comprehensive, the later being quicker to review and digest.
> Regardless of that, while we are waiting on a prototype from the
> container folks, I think it would be good to pull this into a working
> branch in the audit repo (as mentioned above), unless you would prefer
> to keep it as a patchset on the mailing list?
Personally, I'd like to see this on a branch so that it's easier to build a
kernel locally for testing.
-Steve
> If you want to go with
> the working branch approach, I'll keep the branch fresh and (re)based
> against audit/next and if we notice any problems you can just submit
> fixes against that branch (depending on the issue they can be fixup
> patches, or proper patches). My hope is that this will enable the
> process to move quicker as we get near the finish line.
^ permalink raw reply
* Re: [PATCH ghak90 V6 04/10] audit: log container info of syscalls
From: Ondrej Mosnacek @ 2019-05-30 13:08 UTC (permalink / raw)
To: Paul Moore
Cc: Richard Guy Briggs, containers, linux-api,
Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
netfilter-devel, Steve Grubb, David Howells, Simo Sorce,
Eric Paris, Serge Hallyn, Eric W . Biederman, Neil Horman
In-Reply-To: <CAHC9VhRfQp-avV2rcEOvLCAXEz-MDZMp91UxU+BtvPkvWny9fQ@mail.gmail.com>
On Thu, May 30, 2019 at 12:16 AM Paul Moore <paul@paul-moore.com> wrote:
> On Mon, Apr 8, 2019 at 11:40 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> >
> > Create a new audit record AUDIT_CONTAINER_ID to document the audit
> > container identifier of a process if it is present.
> >
> > Called from audit_log_exit(), syscalls are covered.
> >
> > A sample raw event:
> > type=SYSCALL msg=audit(1519924845.499:257): arch=c000003e syscall=257 success=yes exit=3 a0=ffffff9c a1=56374e1cef30 a2=241 a3=1b6 items=2 ppid=606 pid=635 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=3 comm="bash" exe="/usr/bin/bash" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key="tmpcontainerid"
> > type=CWD msg=audit(1519924845.499:257): cwd="/root"
> > type=PATH msg=audit(1519924845.499:257): item=0 name="/tmp/" inode=13863 dev=00:27 mode=041777 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:tmp_t:s0 nametype= PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
> > type=PATH msg=audit(1519924845.499:257): item=1 name="/tmp/tmpcontainerid" inode=17729 dev=00:27 mode=0100644 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
> > type=PROCTITLE msg=audit(1519924845.499:257): proctitle=62617368002D6300736C65657020313B206563686F2074657374203E202F746D702F746D70636F6E7461696E65726964
> > type=CONTAINER_ID msg=audit(1519924845.499:257): contid=123458
> >
> > Please see the github audit kernel issue for the main feature:
> > https://github.com/linux-audit/audit-kernel/issues/90
> > Please see the github audit userspace issue for supporting additions:
> > https://github.com/linux-audit/audit-userspace/issues/51
> > Please see the github audit testsuiite issue for the test case:
> > https://github.com/linux-audit/audit-testsuite/issues/64
> > Please see the github audit wiki for the feature overview:
> > https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > Acked-by: Serge Hallyn <serge@hallyn.com>
> > Acked-by: Steve Grubb <sgrubb@redhat.com>
> > Acked-by: Neil Horman <nhorman@tuxdriver.com>
> > Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> > ---
> > include/linux/audit.h | 5 +++++
> > include/uapi/linux/audit.h | 1 +
> > kernel/audit.c | 20 ++++++++++++++++++++
> > kernel/auditsc.c | 20 ++++++++++++++------
> > 4 files changed, 40 insertions(+), 6 deletions(-)
>
> ...
>
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index 182b0f2c183d..3e0af53f3c4d 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -2127,6 +2127,26 @@ void audit_log_session_info(struct audit_buffer *ab)
> > audit_log_format(ab, "auid=%u ses=%u", auid, sessionid);
> > }
> >
> > +/*
> > + * audit_log_contid - report container info
> > + * @context: task or local context for record
> > + * @contid: container ID to report
> > + */
> > +void audit_log_contid(struct audit_context *context, u64 contid)
> > +{
> > + struct audit_buffer *ab;
> > +
> > + if (!audit_contid_valid(contid))
> > + return;
> > + /* Generate AUDIT_CONTAINER_ID record with container ID */
> > + ab = audit_log_start(context, GFP_KERNEL, AUDIT_CONTAINER_ID);
> > + if (!ab)
> > + return;
> > + audit_log_format(ab, "contid=%llu", (unsigned long long)contid);
>
> We have a consistency problem regarding how to output the u64 contid
> values; this function uses an explicit cast, others do not. According
> to Documentation/core-api/printk-formats.rst the recommendation for
> u64 is %llu (or %llx, if you want hex). Looking quickly through the
> printk code this appears to still be correct. I suggest we get rid of
> the cast (like it was in v5).
IIRC it was me who suggested to add the casts. I didn't realize that
the kernel actually guarantees that "%llu" will always work with u64.
Taking that into account I rescind my request to add the cast. Sorry
for the false alarm.
>
> > + audit_log_end(ab);
> > +}
> > +EXPORT_SYMBOL(audit_log_contid);
>
> --
> paul moore
> www.paul-moore.com
--
Ondrej Mosnacek <omosnace at redhat dot com>
Software Engineer, Security Technologies
Red Hat, Inc.
^ permalink raw reply
* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
From: Jan Kara @ 2019-05-30 11:00 UTC (permalink / raw)
To: Amir Goldstein
Cc: Jan Kara, David Howells, Al Viro, Ian Kent, linux-fsdevel,
linux-api, linux-block, keyrings, LSM List, linux-kernel
In-Reply-To: <CAOQ4uxjLzURf8c1UH_xCJKkuD2es8i-=P-ZNM=t3aFcZLMwXEg@mail.gmail.com>
On Wed 29-05-19 18:53:21, Amir Goldstein wrote:
> > > David,
> > >
> > > I am interested to know how you envision filesystem notifications would
> > > look with this interface.
> > >
> > > fanotify can certainly benefit from providing a ring buffer interface to read
> > > events.
> > >
> > > From what I have seen, a common practice of users is to monitor mounts
> > > (somehow) and place FAN_MARK_MOUNT fanotify watches dynamically.
> > > It'd be good if those users can use a single watch mechanism/API for
> > > watching the mount namespace and filesystem events within mounts.
> > >
> > > A similar usability concern is with sb_notify and FAN_MARK_FILESYSTEM.
> > > It provides users with two complete different mechanisms to watch error
> > > and filesystem events. That is generally not a good thing to have.
> > >
> > > I am not asking that you implement fs_notify() before merging sb_notify()
> > > and I understand that you have a use case for sb_notify().
> > > I am asking that you show me the path towards a unified API (how a
> > > typical program would look like), so that we know before merging your
> > > new API that it could be extended to accommodate fsnotify events
> > > where the final result will look wholesome to users.
> >
> > Are you sure we want to combine notification about file changes etc. with
> > administrator-type notifications about the filesystem? To me these two
> > sound like rather different (although sometimes related) things.
> >
>
> Well I am sure that ring buffer for fanotify events would be useful, so
> seeing that David is proposing a generic notification mechanism, I wanted
> to know how that mechanism could best share infrastructure with fsnotify.
>
> But apart from that I foresee the questions from users about why the
> mount notification API and filesystem events API do not have better
> integration.
>
> The way I see it, the notification queue can serve several classes
> of notifications and fsnotify could be one of those classes
> (at least FAN_CLASS_NOTIF fits nicely to the model).
I agree that for some type of fsnotify uses a ring buffer would make sense.
But for others - such as permission events or unlimited queues - you cannot
really use the ring buffer and I don't like the idea of having different
ways of passing fsnotify events to userspace based on notification group
type...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH v9 00/16] Add utilization clamping support
From: Patrick Bellasi @ 2019-05-30 10:15 UTC (permalink / raw)
To: Tejun Heo, linux-kernel, linux-pm, linux-api
Cc: Peter Zijlstra, Rafael J . Wysocki
In-Reply-To: <20190515094459.10317-1-patrick.bellasi@arm.com>
Hi Tejun,
this is just a gentle ping.
I had a chance to speak with Peter and Rafael at the last OSPM and
they both seems to be reasonably happy with the current status of the
core bits.
Thus, it would be very useful if you can jump into the discussion and
start the review of the cgroups integration bits.
You can find them in the last 5 patches of this series.
Luckily on the CGroup side, we don't start from ground zero.
Many aspects and pain points have been already discussed in the past
and have been addressed by the current version.
For you benefit, here is a list of previously discussed items:
- A child can never obtain more than its ancestors
https://lore.kernel.org/lkml/20180426185845.GO1911913@devbig577.frc2.facebook.com/
- Enforcing consistency rules among parent-child groups
https://lore.kernel.org/lkml/20180410200514.GA793541@devbig577.frc2.facebook.com/
- Use "effective" attributes to enforce restrictions:
https://lore.kernel.org/lkml/20180409222417.GK3126663@devbig577.frc2.facebook.com
https://lore.kernel.org/lkml/20180410200514.GA793541@devbig577.frc2.facebook.com
- Tasks on a subgroup can only be more boosted and/or capped,
i.e. parent always set an upper bound on both max and min clamps
https://lore.kernel.org/lkml/20180409222417.GK3126663@devbig577.frc2.facebook.com
https://lore.kernel.org/lkml/20180410200514.GA793541@devbig577.frc2.facebook.com
- Put everything at system level outside of the cgroup interface,
i.e. no root group tunable attributes
https://lore.kernel.org/lkml/20180409222417.GK3126663@devbig577.frc2.facebook.com/
- CGroups don't need any hierarchical behaviors on their own
https://lore.kernel.org/lkml/20180723153040.GG1934745@devbig577.frc2.facebook.com/
Looking forward to get some more feedbacks from you.
Cheers,
Patrick
On 15-May 10:44, Patrick Bellasi wrote:
> Hi all, this is a respin of:
>
> https://lore.kernel.org/lkml/20190402104153.25404-1-patrick.bellasi@arm.com/
>
> which includes the following main changes:
>
> - fix "max local update" by moving into uclamp_rq_inc_id()
> - use for_each_clamp_id() and uclamp_se_set() to make code less fragile
> - rename sysfs node: s/sched_uclamp_util_{min,max}/sched_util_clamp_{min,max}/
> - removed not used uclamp_eff_bucket_id()
> - move uclamp_bucket_base_value() definition into patch using it
> - get rid of not necessary SCHED_POLICY_MAX define
> - update sched_setattr() syscall to just force the current policy on
> SCHED_FLAG_KEEP_POLICY
> - return EOPNOTSUPP from uclamp_validate() on !CONFIG_UCLAMP_TASK
> - make alloc_uclamp_sched_group() a void function
> - simplify bits_per() definition
> - add rq's lockdep assert to uclamp_rq_{inc,dec}_id()
>
> With the above, I think we captured all the major inputs from Peter [1].
> Thus, this v9 is likely the right version to unlock Tejun's review [2] on the
> remaining cgroup related bits, i.e. patches [12-16].
>
> Cheers Patrick
>
>
> Series Organization
> ===================
>
> The series is organized into these main sections:
>
> - Patches [01-07]: Per task (primary) API
> - Patches [08-09]: Schedutil integration for FAIR and RT tasks
> - Patches [10-11]: Integration with EAS's energy_compute()
> - Patches [12-16]: Per task group (secondary) API
>
> It is based on today's tip/sched/core and the full tree is available here:
>
> git://linux-arm.org/linux-pb.git lkml/utilclamp_v9
> http://www.linux-arm.org/git?p=linux-pb.git;a=shortlog;h=refs/heads/lkml/utilclamp_v9
>
>
> Newcomer's Short Abstract
> =========================
>
> The Linux scheduler tracks a "utilization" signal for each scheduling entity
> (SE), e.g. tasks, to know how much CPU time they use. This signal allows the
> scheduler to know how "big" a task is and, in principle, it can support
> advanced task placement strategies by selecting the best CPU to run a task.
> Some of these strategies are represented by the Energy Aware Scheduler [3].
>
> When the schedutil cpufreq governor is in use, the utilization signal allows
> the Linux scheduler to also drive frequency selection. The CPU utilization
> signal, which represents the aggregated utilization of tasks scheduled on that
> CPU, is used to select the frequency which best fits the workload generated by
> the tasks.
>
> The current translation of utilization values into a frequency selection is
> simple: we go to max for RT tasks or to the minimum frequency which can
> accommodate the utilization of DL+FAIR tasks.
> However, utilization values by themselves cannot convey the desired
> power/performance behaviors of each task as intended by user-space.
> As such they are not ideally suited for task placement decisions.
>
> Task placement and frequency selection policies in the kernel can be improved
> by taking into consideration hints coming from authorized user-space elements,
> like for example the Android middleware or more generally any "System
> Management Software" (SMS) framework.
>
> Utilization clamping is a mechanism which allows to "clamp" (i.e. filter) the
> utilization generated by RT and FAIR tasks within a range defined by user-space.
> The clamped utilization value can then be used, for example, to enforce a
> minimum and/or maximum frequency depending on which tasks are active on a CPU.
>
> The main use-cases for utilization clamping are:
>
> - boosting: better interactive response for small tasks which
> are affecting the user experience.
>
> Consider for example the case of a small control thread for an external
> accelerator (e.g. GPU, DSP, other devices). Here, from the task utilization
> the scheduler does not have a complete view of what the task's requirements
> are and, if it's a small utilization task, it keeps selecting a more energy
> efficient CPU, with smaller capacity and lower frequency, thus negatively
> impacting the overall time required to complete task activations.
>
> - capping: increase energy efficiency for background tasks not affecting the
> user experience.
>
> Since running on a lower capacity CPU at a lower frequency is more energy
> efficient, when the completion time is not a main goal, then capping the
> utilization considered for certain (maybe big) tasks can have positive
> effects, both on energy consumption and thermal headroom.
> This feature allows also to make RT tasks more energy friendly on mobile
> systems where running them on high capacity CPUs and at the maximum
> frequency is not required.
>
> From these two use-cases, it's worth noticing that frequency selection
> biasing, introduced by patches 9 and 10 of this series, is just one possible
> usage of utilization clamping. Another compelling extension of utilization
> clamping is in helping the scheduler in making tasks placement decisions.
>
> Utilization is (also) a task specific property the scheduler uses to know
> how much CPU bandwidth a task requires, at least as long as there is idle time.
> Thus, the utilization clamp values, defined either per-task or per-task_group,
> can represent tasks to the scheduler as being bigger (or smaller) than what
> they actually are.
>
> Utilization clamping thus enables interesting additional optimizations, for
> example on asymmetric capacity systems like Arm big.LITTLE and DynamIQ CPUs,
> where:
>
> - boosting: try to run small/foreground tasks on higher-capacity CPUs to
> complete them faster despite being less energy efficient.
>
> - capping: try to run big/background tasks on low-capacity CPUs to save power
> and thermal headroom for more important tasks
>
> This series does not present this additional usage of utilization clamping but
> it's an integral part of the EAS feature set, where [4] is one of its main
> components.
>
> Android kernels use SchedTune, a solution similar to utilization clamping, to
> bias both 'frequency selection' and 'task placement'. This series provides the
> foundation to add similar features to mainline while focusing, for the
> time being, just on schedutil integration.
>
>
> References
> ==========
>
> [1] Message-ID: <20190509130215.GV2623@hirez.programming.kicks-ass.net>
> https://lore.kernel.org/lkml/20190509130215.GV2623@hirez.programming.kicks-ass.net/
>
> [2] Message-ID: <20180911162827.GJ1100574@devbig004.ftw2.facebook.com>
> https://lore.kernel.org/lkml/20180911162827.GJ1100574@devbig004.ftw2.facebook.com/
>
> [3] Energy Aware Scheduling
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scheduler/sched-energy.txt?h=v5.1
>
> [4] Expressing per-task/per-cgroup performance hints
> Linux Plumbers Conference 2018
> https://linuxplumbersconf.org/event/2/contributions/128/
>
>
> Patrick Bellasi (16):
> sched/core: uclamp: Add CPU's clamp buckets refcounting
> sched/core: uclamp: Add bucket local max tracking
> sched/core: uclamp: Enforce last task's UCLAMP_MAX
> sched/core: uclamp: Add system default clamps
> sched/core: Allow sched_setattr() to use the current policy
> sched/core: uclamp: Extend sched_setattr() to support utilization
> clamping
> sched/core: uclamp: Reset uclamp values on RESET_ON_FORK
> sched/core: uclamp: Set default clamps for RT tasks
> sched/cpufreq: uclamp: Add clamps for FAIR and RT tasks
> sched/core: uclamp: Add uclamp_util_with()
> sched/fair: uclamp: Add uclamp support to energy_compute()
> sched/core: uclamp: Extend CPU's cgroup controller
> sched/core: uclamp: Propagate parent clamps
> sched/core: uclamp: Propagate system defaults to root group
> sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
> sched/core: uclamp: Update CPU's refcount on TG's clamp changes
>
> Documentation/admin-guide/cgroup-v2.rst | 46 ++
> include/linux/log2.h | 34 ++
> include/linux/sched.h | 58 ++
> include/linux/sched/sysctl.h | 11 +
> include/linux/sched/topology.h | 6 -
> include/uapi/linux/sched.h | 14 +-
> include/uapi/linux/sched/types.h | 66 ++-
> init/Kconfig | 75 +++
> kernel/sched/core.c | 754 +++++++++++++++++++++++-
> kernel/sched/cpufreq_schedutil.c | 22 +-
> kernel/sched/fair.c | 44 +-
> kernel/sched/rt.c | 4 +
> kernel/sched/sched.h | 123 +++-
> kernel/sysctl.c | 16 +
> 14 files changed, 1229 insertions(+), 44 deletions(-)
>
> --
> 2.21.0
>
--
#include <best/regards.h>
Patrick Bellasi
^ permalink raw reply
* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: Andrea Parri @ 2019-05-30 9:50 UTC (permalink / raw)
To: Greg KH
Cc: David Howells, viro, raven, linux-fsdevel, linux-api, linux-block,
keyrings, linux-security-module, linux-kernel, Peter Zijlstra,
Will Deacon, Paul E. McKenney, Mark Rutland
In-Reply-To: <20190529231112.GB3164@kroah.com>
> > Looking at the perf ring buffer, there appears to be a missing barrier in
> > perf_aux_output_end():
> >
> > rb->user_page->aux_head = rb->aux_head;
> >
> > should be:
> >
> > smp_store_release(&rb->user_page->aux_head, rb->aux_head);
> >
> > It should also be using smp_load_acquire(). See
> > Documentation/core-api/circular-buffers.rst
> >
> > And a (partial) patch has been proposed: https://lkml.org/lkml/2018/5/10/249
>
> So, if that's all that needs to be fixed, can you use the same
> buffer/code if that patch is merged?
That's about one year old...: let me add the usual suspects in Cc: ;-)
since I'm not sure what the plan was (or if I'm missing something) ...
Speaking of ring buffer implementations (and relatively "old" patches),
here's another quite interesting:
https://lkml.kernel.org/r/20181211034032.32338-1-yuleixzhang@tencent.com
Thanks,
Andrea
^ permalink raw reply
* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
From: Minchan Kim @ 2019-05-30 8:02 UTC (permalink / raw)
To: Michal Hocko
Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
Sonny Rao, Brian Geffon, Linux API
In-Reply-To: <20190530065755.GD6703@dhcp22.suse.cz>
On Thu, May 30, 2019 at 08:57:55AM +0200, Michal Hocko wrote:
> On Thu 30-05-19 11:17:48, Minchan Kim wrote:
> > On Wed, May 29, 2019 at 12:33:52PM +0200, Michal Hocko wrote:
> > > On Wed 29-05-19 03:08:32, Daniel Colascione wrote:
> > > > On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > >
> > > > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > > > > > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > > > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > > > > > [Cc linux-api]
> > > > > > > > > >
> > > > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > > > > > > so user should call the syscall several times to give hints to
> > > > > > > > > > > multiple address range.
> > > > > > > > > >
> > > > > > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > > > > > >
> > > > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > > > > > with number in the description at respin.
> > > > > > > >
> > > > > > > > Does this really have to be a fast operation? I would expect the monitor
> > > > > > > > is by no means a fast path. The system call overhead is not what it used
> > > > > > > > to be, sigh, but still for something that is not a hot path it should be
> > > > > > > > tolerable, especially when the whole operation is quite expensive on its
> > > > > > > > own (wrt. the syscall entry/exit).
> > > > > > >
> > > > > > > What's different with process_vm_[readv|writev] and vmsplice?
> > > > > > > If the range needed to be covered is a lot, vector operation makes senese
> > > > > > > to me.
> > > > > >
> > > > > > I am not saying that the vector API is wrong. All I am trying to say is
> > > > > > that the benefit is not really clear so far. If you want to push it
> > > > > > through then you should better get some supporting data.
> > > > >
> > > > > I measured 1000 madvise syscall vs. a vector range syscall with 1000
> > > > > ranges on ARM64 mordern device. Even though I saw 15% improvement but
> > > > > absoluate gain is just 1ms so I don't think it's worth to support.
> > > > > I will drop vector support at next revision.
> > > >
> > > > Please do keep the vector support. Absolute timing is misleading,
> > > > since in a tight loop, you're not going to contend on mmap_sem. We've
> > > > seen tons of improvements in things like camera start come from
> > > > coalescing mprotect calls, with the gains coming from taking and
> > > > releasing various locks a lot less often and bouncing around less on
> > > > the contended lock paths. Raw throughput doesn't tell the whole story,
> > > > especially on mobile.
> > >
> > > This will always be a double edge sword. Taking a lock for longer can
> > > improve a throughput of a single call but it would make a latency for
> > > anybody contending on the lock much worse.
> > >
> > > Besides that, please do not overcomplicate the thing from the early
> > > beginning please. Let's start with a simple and well defined remote
> > > madvise alternative first and build a vector API on top with some
> > > numbers based on _real_ workloads.
> >
> > First time, I didn't think about atomicity about address range race
> > because MADV_COLD/PAGEOUT is not critical for the race.
> > However you raised the atomicity issue because people would extend
> > hints to destructive ones easily. I agree with that and that's why
> > we discussed how to guarantee the race and Daniel comes up with good idea.
>
> Just for the clarification, I didn't really mean atomicity but rather a
> _consistency_ (essentially time to check to time to use consistency).
What do you mean by *consistency*? Could you elaborate it more?
>
> > - vma configuration seq number via process_getinfo(2).
> >
> > We discussed the race issue without _read_ workloads/requests because
> > it's common sense that people might extend the syscall later.
> >
> > Here is same. For current workload, we don't need to support vector
> > for perfomance point of view based on my experiment. However, it's
> > rather limited experiment. Some configuration might have 10000+ vmas
> > or really slow CPU.
> >
> > Furthermore, I want to have vector support due to atomicity issue
> > if it's really the one we should consider.
> > With vector support of the API and vma configuration sequence number
> > from Daniel, we could support address ranges operations's atomicity.
>
> I am not sure what do you mean here. Perform all ranges atomicaly wrt.
> other address space modifications? If yes I am not sure we want that
Yub, I think it's *necessary* if we want to support destructive hints
via process_madvise.
> semantic because it can cause really long stalls for other operations
It could be or it couldn't be.
For example, if we could multiplex several syscalls which we should
enumerate all of page table lookup, it could be more effective rather
than doing each page table on each syscall.
> but that is a discussion on its own and I would rather focus on a simple
> interface first.
It seems it's time to send RFCv2 since we discussed a lot although we
don't have clear conclution yet. But still want to understand what you
meant _consistency_.
Thanks for the review, Michal! It's very helpful.
>
> > However, since we don't introduce vector at this moment, we need to
> > introduce *another syscall* later to be able to handle multile ranges
> > all at once atomically if it's okay.
>
> Agreed.
>
> > Other thought:
> > Maybe we could extend address range batch syscall covers other MM
> > syscall like mmap/munmap/madvise/mprotect and so on because there
> > are multiple users that would benefit from this general batching
> > mechanism.
>
> Again a discussion on its own ;)
>
> --
> Michal Hocko
> SUSE Labs
^ permalink raw reply
* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
From: Michal Hocko @ 2019-05-30 6:57 UTC (permalink / raw)
To: Minchan Kim
Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
Sonny Rao, Brian Geffon, Linux API
In-Reply-To: <20190530021748.GE229459@google.com>
On Thu 30-05-19 11:17:48, Minchan Kim wrote:
> On Wed, May 29, 2019 at 12:33:52PM +0200, Michal Hocko wrote:
> > On Wed 29-05-19 03:08:32, Daniel Colascione wrote:
> > > On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote:
> > > >
> > > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > > > > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > > > > [Cc linux-api]
> > > > > > > > >
> > > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > > > > > so user should call the syscall several times to give hints to
> > > > > > > > > > multiple address range.
> > > > > > > > >
> > > > > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > > > > >
> > > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > > > > with number in the description at respin.
> > > > > > >
> > > > > > > Does this really have to be a fast operation? I would expect the monitor
> > > > > > > is by no means a fast path. The system call overhead is not what it used
> > > > > > > to be, sigh, but still for something that is not a hot path it should be
> > > > > > > tolerable, especially when the whole operation is quite expensive on its
> > > > > > > own (wrt. the syscall entry/exit).
> > > > > >
> > > > > > What's different with process_vm_[readv|writev] and vmsplice?
> > > > > > If the range needed to be covered is a lot, vector operation makes senese
> > > > > > to me.
> > > > >
> > > > > I am not saying that the vector API is wrong. All I am trying to say is
> > > > > that the benefit is not really clear so far. If you want to push it
> > > > > through then you should better get some supporting data.
> > > >
> > > > I measured 1000 madvise syscall vs. a vector range syscall with 1000
> > > > ranges on ARM64 mordern device. Even though I saw 15% improvement but
> > > > absoluate gain is just 1ms so I don't think it's worth to support.
> > > > I will drop vector support at next revision.
> > >
> > > Please do keep the vector support. Absolute timing is misleading,
> > > since in a tight loop, you're not going to contend on mmap_sem. We've
> > > seen tons of improvements in things like camera start come from
> > > coalescing mprotect calls, with the gains coming from taking and
> > > releasing various locks a lot less often and bouncing around less on
> > > the contended lock paths. Raw throughput doesn't tell the whole story,
> > > especially on mobile.
> >
> > This will always be a double edge sword. Taking a lock for longer can
> > improve a throughput of a single call but it would make a latency for
> > anybody contending on the lock much worse.
> >
> > Besides that, please do not overcomplicate the thing from the early
> > beginning please. Let's start with a simple and well defined remote
> > madvise alternative first and build a vector API on top with some
> > numbers based on _real_ workloads.
>
> First time, I didn't think about atomicity about address range race
> because MADV_COLD/PAGEOUT is not critical for the race.
> However you raised the atomicity issue because people would extend
> hints to destructive ones easily. I agree with that and that's why
> we discussed how to guarantee the race and Daniel comes up with good idea.
Just for the clarification, I didn't really mean atomicity but rather a
_consistency_ (essentially time to check to time to use consistency).
> - vma configuration seq number via process_getinfo(2).
>
> We discussed the race issue without _read_ workloads/requests because
> it's common sense that people might extend the syscall later.
>
> Here is same. For current workload, we don't need to support vector
> for perfomance point of view based on my experiment. However, it's
> rather limited experiment. Some configuration might have 10000+ vmas
> or really slow CPU.
>
> Furthermore, I want to have vector support due to atomicity issue
> if it's really the one we should consider.
> With vector support of the API and vma configuration sequence number
> from Daniel, we could support address ranges operations's atomicity.
I am not sure what do you mean here. Perform all ranges atomicaly wrt.
other address space modifications? If yes I am not sure we want that
semantic because it can cause really long stalls for other operations
but that is a discussion on its own and I would rather focus on a simple
interface first.
> However, since we don't introduce vector at this moment, we need to
> introduce *another syscall* later to be able to handle multile ranges
> all at once atomically if it's okay.
Agreed.
> Other thought:
> Maybe we could extend address range batch syscall covers other MM
> syscall like mmap/munmap/madvise/mprotect and so on because there
> are multiple users that would benefit from this general batching
> mechanism.
Again a discussion on its own ;)
--
Michal Hocko
SUSE Labs
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox