Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH 15/17] fpga: dfl: fme: add power management support
From: Wu Hao @ 2019-04-17  7:31 UTC (permalink / raw)
  To: Moritz Fischer
  Cc: Alan Tull, linux-fpga, linux-kernel, linux-api, Luwei Kang,
	Xu Yilun
In-Reply-To: <CAJYdmeOMm0xTRbKTS68M4uiHJLfphPXNo+eFYguda+SuwnkqmQ@mail.gmail.com>

On Fri, Apr 12, 2019 at 02:05:21PM -0700, Moritz Fischer wrote:
> Hi Hao,
> 
> this looks suspiciously like a hwmon driver ;-)
> 
> https://www.kernel.org/doc/Documentation/hwmon/hwmon-kernel-api.txt

Hi Moritz,

Thanks a lot for the suggestion, yes, agree, and patch for thermal
management should be the similar case too. Let me see if i can make
thermal / power management code to hwmon in the next version. : )

Hao

> 
> Cheers,
> Moritz
> 
> 
> On Thu, Apr 11, 2019 at 1:08 PM Alan Tull <atull@kernel.org> wrote:
> >
> > On Sun, Mar 24, 2019 at 10:24 PM Wu Hao <hao.wu@intel.com> wrote:
> >
> > Hi Hao,
> >
> > >
> > > This patch adds support for power management private feature under
> > > FPGA Management Engine (FME), sysfs interfaces are introduced for
> > > different power management functions, users could use these sysfs
> > > interface to get current number of consumed power, throttling
> >
> > How about
> > s/number/measurement/
> > ?
> >
> > > thresholds, threshold status and other information, and configure
> > > different value for throttling thresholds too.
> > >
> > > Signed-off-by: Luwei Kang <luwei.kang@intel.com>
> > > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > > Signed-off-by: Wu Hao <hao.wu@intel.com>
> > > ---
> > >  Documentation/ABI/testing/sysfs-platform-dfl-fme |  56 +++++
> > >  drivers/fpga/dfl-fme-main.c                      | 257 +++++++++++++++++++++++
> > >  2 files changed, 313 insertions(+)
> > >
> > > diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > > index d3aeb88..4b6448f 100644
> > > --- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > > +++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > > @@ -100,3 +100,59 @@ Description:       Read-only. Read this file to get the policy of temperature
> > >                 threshold1. It only supports two value (policy):
> > >                     0 - AP2 state (90% throttling)
> > >                     1 - AP1 state (50% throttling)
> > > +
> > > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/consumed
> > > +Date:          March 2019
> > > +KernelVersion:  5.2
> > > +Contact:       Wu Hao <hao.wu@intel.com>
> > > +Description:   Read-only. It returns current power consumed by FPGA.
> >
> > What are the units?
> >
> > > +
> > > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold1
> > > +Date:          March 2019
> > > +KernelVersion:  5.2
> > > +Contact:       Wu Hao <hao.wu@intel.com>
> > > +Description:   Read-Write. Read/Write this file to get/set current power
> > > +               threshold1 in Watts.
> >
> > Perhaps document error codes here and for threshold2 below.
> >
> > > +
> > > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold2
> > > +Date:          March 2019
> > > +KernelVersion:  5.2
> > > +Contact:       Wu Hao <hao.wu@intel.com>
> > > +Description:   Read-Write. Read/Write this file to get/set current power
> > > +               threshold2 in Watts.
> > > +
> > > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold1_status
> > > +Date:          March 2019
> > > +KernelVersion:  5.2
> > > +Contact:       Wu Hao <hao.wu@intel.com>
> > > +Description:   Read-only. It returns 1 if power consumption reaches the
> > > +               threshold1, otherwise 0.
> >
> > I'm used to things like this requiring user to reset the status, so it
> > may be worth making it explicit that it will return to zero if
> > consumption drops below threshold if that's what's happening here.
> > If it's correct, perhaps could just say something like 'returns 1 if
> > power consumption is currently at or above threshold1, otherwise 0'
> >
> > > +
> > > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold2_status
> > > +Date:          March 2019
> > > +KernelVersion:  5.2
> > > +Contact:       Wu Hao <hao.wu@intel.com>
> > > +Description:   Read-only. It returns 1 if power consumption reaches the
> > > +               threshold2, otherwise 0.
> >
> > Same here.
> >
> > > +
> > > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/ltr
> > > +Date:          March 2019
> > > +KernelVersion:  5.2
> > > +Contact:       Wu Hao <hao.wu@intel.com>
> > > +Description:   Read-only. Read this file to get current Latency Tolerance
> > > +               Reporting (ltr) value, it's only valid for integrated
> > > +               solution as it blocks CPU on low power state.
> >
> > If we're not on the integrated solution, it returns a value but it is
> > not really real?
> >
> > > +
> > > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/xeon_limit
> > > +Date:          March 2019
> > > +KernelVersion:  5.2
> > > +Contact:       Wu Hao <hao.wu@intel.com>
> > > +Description:   Read-only. Read this file to get power limit for xeon, it
> > > +               is only valid for integrated solution.
> > > +
> > > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/fpga_limit
> > > +Date:          March 2019
> > > +KernelVersion:  5.2
> > > +Contact:       Wu Hao <hao.wu@intel.com>
> > > +Description:   Read-only. Read this file to get power limit for fpga, it
> > > +               is only valid for integrated solution.
> > > diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
> > > index 449a17d..dafa6580 100644
> > > --- a/drivers/fpga/dfl-fme-main.c
> > > +++ b/drivers/fpga/dfl-fme-main.c
> > > @@ -415,6 +415,259 @@ static const struct dfl_feature_ops fme_thermal_mgmt_ops = {
> > >         .uinit = fme_thermal_mgmt_uinit,
> > >  };
> > >
> > > +#define FME_PWR_STATUS         0x8
> > > +#define FME_LATENCY_TOLERANCE  BIT_ULL(18)
> > > +#define PWR_CONSUMED           GENMASK_ULL(17, 0)
> > > +
> > > +#define FME_PWR_THRESHOLD      0x10
> > > +#define PWR_THRESHOLD1         GENMASK_ULL(6, 0)       /* in Watts */
> > > +#define PWR_THRESHOLD2         GENMASK_ULL(14, 8)      /* in Watts */
> > > +#define PWR_THRESHOLD_MAX      0x7f
> > > +#define PWR_THRESHOLD1_STATUS  BIT_ULL(16)
> > > +#define PWR_THRESHOLD2_STATUS  BIT_ULL(17)
> > > +
> > > +#define FME_PWR_XEON_LIMIT     0x18
> > > +#define XEON_PWR_LIMIT         GENMASK_ULL(14, 0)
> > > +#define XEON_PWR_EN            BIT_ULL(15)
> > > +#define FME_PWR_FPGA_LIMIT     0x20
> > > +#define FPGA_PWR_LIMIT         GENMASK_ULL(14, 0)
> > > +#define FPGA_PWR_EN            BIT_ULL(15)
> > > +
> > > +#define POWER_ATTR(_name, _mode, _show, _store)        \
> > > +struct device_attribute power_attr_##_name =           \
> > > +       __ATTR(_name, _mode, _show, _store)
> > > +
> > > +#define POWER_ATTR_RO(_name, _show)                    \
> > > +       POWER_ATTR(_name, 0444, _show, NULL)
> > > +
> > > +#define POWER_ATTR_RW(_name, _show, _store)            \
> > > +       POWER_ATTR(_name, 0644, _show, _store)
> >
> > Are these #defines necessary?  Seems like you could just use DEVICE_ATTR*
> >
> > > +
> > > +static ssize_t pwr_consumed_show(struct device *dev,
> > > +                                struct device_attribute *attr, char *buf)
> > > +{
> > > +       void __iomem *base;
> > > +       u64 v;
> > > +
> > > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > +       v = readq(base + FME_PWR_STATUS);
> > > +
> > > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > > +                        (unsigned int)FIELD_GET(PWR_CONSUMED, v));
> > > +}
> > > +static POWER_ATTR_RO(consumed, pwr_consumed_show);
> > > +
> > > +static ssize_t pwr_threshold1_show(struct device *dev,
> > > +                                  struct device_attribute *attr, char *buf)
> > > +{
> > > +       void __iomem *base;
> > > +       u64 v;
> > > +
> > > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > +       v = readq(base + FME_PWR_THRESHOLD);
> > > +
> > > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > > +                        (unsigned int)FIELD_GET(PWR_THRESHOLD1, v));
> > > +}
> > > +
> > > +static ssize_t pwr_threshold1_store(struct device *dev,
> > > +                                   struct device_attribute *attr,
> > > +                                   const char *buf, size_t count)
> > > +{
> > > +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> > > +       void __iomem *base;
> > > +       u8 threshold;
> > > +       int ret;
> > > +       u64 v;
> > > +
> > > +       ret = kstrtou8(buf, 0, &threshold);
> > > +       if (ret)
> > > +               return ret;
> > > +
> > > +       if (threshold > PWR_THRESHOLD_MAX)
> > > +               return -EINVAL;
> > > +
> > > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > +       mutex_lock(&pdata->lock);
> > > +       v = readq(base + FME_PWR_THRESHOLD);
> > > +       v &= ~PWR_THRESHOLD1;
> > > +       v |= FIELD_PREP(PWR_THRESHOLD1, threshold);
> > > +       writeq(v, base + FME_PWR_THRESHOLD);
> > > +       mutex_unlock(&pdata->lock);
> > > +
> > > +       return count;
> > > +}
> > > +static POWER_ATTR_RW(threshold1, pwr_threshold1_show, pwr_threshold1_store);
> > > +
> > > +static ssize_t pwr_threshold2_show(struct device *dev,
> > > +                                  struct device_attribute *attr, char *buf)
> > > +{
> > > +       void __iomem *base;
> > > +       u64 v;
> > > +
> > > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > +       v = readq(base + FME_PWR_THRESHOLD);
> > > +
> > > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > > +                        (unsigned int)FIELD_GET(PWR_THRESHOLD2, v));
> > > +}
> > > +
> > > +static ssize_t pwr_threshold2_store(struct device *dev,
> > > +                                   struct device_attribute *attr,
> > > +                                   const char *buf, size_t count)
> > > +{
> > > +       struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> > > +       void __iomem *base;
> > > +       u8 threshold;
> > > +       int ret;
> > > +       u64 v;
> > > +
> > > +       ret = kstrtou8(buf, 0, &threshold);
> > > +       if (ret)
> > > +               return ret;
> > > +
> > > +       if (threshold > PWR_THRESHOLD_MAX)
> > > +               return -EINVAL;
> > > +
> > > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > +       mutex_lock(&pdata->lock);
> > > +       v = readq(base + FME_PWR_THRESHOLD);
> > > +       v &= ~PWR_THRESHOLD2;
> > > +       v |= FIELD_PREP(PWR_THRESHOLD2, threshold);
> > > +       writeq(v, base + FME_PWR_THRESHOLD);
> > > +       mutex_unlock(&pdata->lock);
> > > +
> > > +       return count;
> > > +}
> > > +static POWER_ATTR_RW(threshold2, pwr_threshold2_show, pwr_threshold2_store);
> > > +
> > > +static ssize_t pwr_threshold1_status_show(struct device *dev,
> > > +                                         struct device_attribute *attr,
> > > +                                         char *buf)
> > > +{
> > > +       void __iomem *base;
> > > +       u64 v;
> > > +
> > > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > +       v = readq(base + FME_PWR_THRESHOLD);
> > > +
> > > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > > +                        (unsigned int)FIELD_GET(PWR_THRESHOLD1_STATUS, v));
> > > +}
> > > +static POWER_ATTR_RO(threshold1_status, pwr_threshold1_status_show);
> > > +
> > > +static ssize_t pwr_threshold2_status_show(struct device *dev,
> > > +                                         struct device_attribute *attr,
> > > +                                         char *buf)
> > > +{
> > > +       void __iomem *base;
> > > +       u64 v;
> > > +
> > > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > +       v = readq(base + FME_PWR_THRESHOLD);
> > > +
> > > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > > +                        (unsigned int)FIELD_GET(PWR_THRESHOLD2_STATUS, v));
> > > +}
> > > +static POWER_ATTR_RO(threshold2_status, pwr_threshold2_status_show);
> > > +
> > > +static ssize_t ltr_show(struct device *dev,
> > > +                       struct device_attribute *attr, char *buf)
> > > +{
> > > +       void __iomem *base;
> > > +       u64 v;
> > > +
> > > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > +       v = readq(base + FME_PWR_STATUS);
> > > +
> > > +       return scnprintf(buf, PAGE_SIZE, "%u\n",
> > > +                        (unsigned int)FIELD_GET(FME_LATENCY_TOLERANCE, v));
> > > +}
> > > +static POWER_ATTR_RO(ltr, ltr_show);
> > > +
> > > +static ssize_t xeon_limit_show(struct device *dev,
> > > +                              struct device_attribute *attr, char *buf)
> > > +{
> > > +       void __iomem *base;
> > > +       u16 xeon_limit = 0;
> > > +       u64 v;
> > > +
> > > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > +       v = readq(base + FME_PWR_XEON_LIMIT);
> > > +
> > > +       if (FIELD_GET(XEON_PWR_EN, v))
> > > +               xeon_limit = FIELD_GET(XEON_PWR_LIMIT, v);
> > > +
> > > +       return scnprintf(buf, PAGE_SIZE, "%u\n", xeon_limit);
> > > +}
> > > +static POWER_ATTR_RO(xeon_limit, xeon_limit_show);
> > > +
> > > +static ssize_t fpga_limit_show(struct device *dev,
> > > +                              struct device_attribute *attr, char *buf)
> > > +{
> > > +       void __iomem *base;
> > > +       u16 fpga_limit = 0;
> > > +       u64 v;
> > > +
> > > +       base = dfl_get_feature_ioaddr_by_id(dev, FME_FEATURE_ID_POWER_MGMT);
> > > +
> > > +       v = readq(base + FME_PWR_FPGA_LIMIT);
> > > +
> > > +       if (FIELD_GET(FPGA_PWR_EN, v))
> > > +               fpga_limit = FIELD_GET(FPGA_PWR_LIMIT, v);
> > > +
> > > +       return scnprintf(buf, PAGE_SIZE, "%u\n", fpga_limit);
> > > +}
> > > +static POWER_ATTR_RO(fpga_limit, fpga_limit_show);
> > > +
> > > +static struct attribute *power_mgmt_attrs[] = {
> > > +       &power_attr_consumed.attr,
> > > +       &power_attr_threshold1.attr,
> > > +       &power_attr_threshold2.attr,
> > > +       &power_attr_threshold1_status.attr,
> > > +       &power_attr_threshold2_status.attr,
> > > +       &power_attr_xeon_limit.attr,
> > > +       &power_attr_fpga_limit.attr,
> > > +       &power_attr_ltr.attr,
> >
> > This is a nit, but I would expect to see these listed in the same
> > order as their show/store functions above.  So ltr_attr would come
> > between threshold2_status_attr and xeon_limit_attr.
> >
> > > +       NULL,
> > > +};
> > > +
> > > +static struct attribute_group power_mgmt_attr_group = {
> > > +       .attrs  = power_mgmt_attrs,
> > > +       .name   = "power_mgmt",
> > > +};
> > > +
> > > +static int fme_power_mgmt_init(struct platform_device *pdev,
> > > +                              struct dfl_feature *feature)
> > > +{
> > > +       return sysfs_create_group(&pdev->dev.kobj, &power_mgmt_attr_group);
> > > +}
> > > +
> > > +static void fme_power_mgmt_uinit(struct platform_device *pdev,
> > > +                                struct dfl_feature *feature)
> > > +{
> > > +       sysfs_remove_group(&pdev->dev.kobj, &power_mgmt_attr_group);
> > > +}
> > > +
> > > +static const struct dfl_feature_id fme_power_mgmt_id_table[] = {
> > > +       {.id = FME_FEATURE_ID_POWER_MGMT,},
> > > +       {0,}
> > > +};
> > > +
> > > +static const struct dfl_feature_ops fme_power_mgmt_ops = {
> > > +       .init = fme_power_mgmt_init,
> > > +       .uinit = fme_power_mgmt_uinit,
> > > +};
> > > +
> > >  static struct dfl_feature_driver fme_feature_drvs[] = {
> > >         {
> > >                 .id_table = fme_hdr_id_table,
> > > @@ -429,6 +682,10 @@ static struct dfl_feature_driver fme_feature_drvs[] = {
> > >                 .ops = &fme_thermal_mgmt_ops,
> > >         },
> > >         {
> > > +               .id_table = fme_power_mgmt_id_table,
> > > +               .ops = &fme_power_mgmt_ops,
> > > +       },
> > > +       {
> > >                 .ops = NULL,
> > >         },
> > >  };
> > > --
> > > 2.7.4
> > >
> >
> > Thanks,
> > Alan

^ permalink raw reply

* Re: [PATCH 15/17] fpga: dfl: fme: add power management support
From: Wu Hao @ 2019-04-17  7:36 UTC (permalink / raw)
  To: Alan Tull
  Cc: Moritz Fischer, linux-fpga, linux-kernel, linux-api, Luwei Kang,
	Xu Yilun
In-Reply-To: <CANk1AXRv+2cv1sJTyHD3ajgHVvda7y7t6DVvw47vAnTp4KF0yw@mail.gmail.com>

On Mon, Apr 15, 2019 at 04:17:48PM -0500, Alan Tull wrote:
> On Thu, Apr 11, 2019 at 10:06 PM Wu Hao <hao.wu@intel.com> wrote:
> >
> > On Thu, Apr 11, 2019 at 03:07:35PM -0500, Alan Tull wrote:
> > > On Sun, Mar 24, 2019 at 10:24 PM Wu Hao <hao.wu@intel.com> wrote:
> > >
> > > Hi Hao,
> > >
> > > >
> > > > This patch adds support for power management private feature under
> > > > FPGA Management Engine (FME), sysfs interfaces are introduced for
> > > > different power management functions, users could use these sysfs
> > > > interface to get current number of consumed power, throttling
> > >
> > > How about
> > > s/number/measurement/
> > > ?
> >
> > Sounds better. : )
> >
> > >
> > > > thresholds, threshold status and other information, and configure
> > > > different value for throttling thresholds too.
> > > >
> > > > Signed-off-by: Luwei Kang <luwei.kang@intel.com>
> > > > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > > > Signed-off-by: Wu Hao <hao.wu@intel.com>
> > > > ---
> > > >  Documentation/ABI/testing/sysfs-platform-dfl-fme |  56 +++++
> > > >  drivers/fpga/dfl-fme-main.c                      | 257 +++++++++++++++++++++++
> > > >  2 files changed, 313 insertions(+)
> > > >
> > > > diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > > > index d3aeb88..4b6448f 100644
> > > > --- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > > > +++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> > > > @@ -100,3 +100,59 @@ Description:       Read-only. Read this file to get the policy of temperature
> > > >                 threshold1. It only supports two value (policy):
> > > >                     0 - AP2 state (90% throttling)
> > > >                     1 - AP1 state (50% throttling)
> > > > +
> > > > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/consumed
> > > > +Date:          March 2019
> > > > +KernelVersion:  5.2
> > > > +Contact:       Wu Hao <hao.wu@intel.com>
> > > > +Description:   Read-only. It returns current power consumed by FPGA.
> > >
> > > What are the units?
> > >
> > > > +
> > > > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold1
> > > > +Date:          March 2019
> > > > +KernelVersion:  5.2
> > > > +Contact:       Wu Hao <hao.wu@intel.com>
> > > > +Description:   Read-Write. Read/Write this file to get/set current power
> > > > +               threshold1 in Watts.
> > >
> > > Perhaps document error codes here and for threshold2 below.
> > >
> > > > +
> > > > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold2
> > > > +Date:          March 2019
> > > > +KernelVersion:  5.2
> > > > +Contact:       Wu Hao <hao.wu@intel.com>
> > > > +Description:   Read-Write. Read/Write this file to get/set current power
> > > > +               threshold2 in Watts.
> > > > +
> > > > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold1_status
> > > > +Date:          March 2019
> > > > +KernelVersion:  5.2
> > > > +Contact:       Wu Hao <hao.wu@intel.com>
> > > > +Description:   Read-only. It returns 1 if power consumption reaches the
> > > > +               threshold1, otherwise 0.
> > >
> > > I'm used to things like this requiring user to reset the status, so it
> > > may be worth making it explicit that it will return to zero if
> > > consumption drops below threshold if that's what's happening here.
> > > If it's correct, perhaps could just say something like 'returns 1 if
> > > power consumption is currently at or above threshold1, otherwise 0'
> > >
> > > > +
> > > > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/threshold2_status
> > > > +Date:          March 2019
> > > > +KernelVersion:  5.2
> > > > +Contact:       Wu Hao <hao.wu@intel.com>
> > > > +Description:   Read-only. It returns 1 if power consumption reaches the
> > > > +               threshold2, otherwise 0.
> > >
> > > Same here.
> >
> > Sure, will fix all above comments in this sysfs doc.
> >
> > >
> > > > +
> > > > +What:          /sys/bus/platform/devices/dfl-fme.0/power_mgmt/ltr
> > > > +Date:          March 2019
> > > > +KernelVersion:  5.2
> > > > +Contact:       Wu Hao <hao.wu@intel.com>
> > > > +Description:   Read-only. Read this file to get current Latency Tolerance
> > > > +               Reporting (ltr) value, it's only valid for integrated
> > > > +               solution as it blocks CPU on low power state.
> > >
> > > If we're not on the integrated solution, it returns a value but it is
> > > not really real?
> >
> > Currently only integrated solution is implementing this private feature, other
> > devices e.g. Intel PAC card is not using this private feature, so user will
> > not see these sysfs interfaces at all.
> 
> OK then perhaps the "it's only valid for integrated solution as it
> blocks CPU on low power state" explanation doesn't need to be here and
> can lead to confusion.
> 

Sure, will fix it in the next version. Thanks!

Hao

^ permalink raw reply

* Re: [RFC PATCH v1 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Florian Weimer @ 2019-04-17 10:01 UTC (permalink / raw)
  To: Steve Grubb
  Cc: Jan Kara, Mickaël Salaün, linux-kernel, Al Viro,
	James Morris, Jonathan Corbet, Kees Cook, Matthew Garrett,
	Michael Kerrisk, Mickaël Salaün, Mimi Zohar,
	Philippe Trébuchet, Shuah Khan, Thibaut Sautereau,
	Vincent Strubel, Yves-Alexis Perez, kernel-hardening, linux-api,
	linux-security-module
In-Reply-To: <2361288.r9sZr2NHLB@x2>

* Steve Grubb:

> On Tuesday, April 16, 2019 7:49:39 AM EDT Florian Weimer wrote:
>> * Steve Grubb:
>> > This flag that is being proposed means that you would have to patch all
>> > interpreters to use it. If you are sure that upstreams will accept that,
>> > why not just change the policy to interpreters shouldn't execute
>> > anything unless the execute bit is set? That is simpler and doesn't need
>> > a kernel change. And setting the execute bit is an auditable event.
>> 
>> I think we need something like O_MAYEXEC so that security policies can
>> be enforced and noexec mounts can be detected.
>
> Application whitelisting can already today stop unknown software without 
> needing O_MAYEXEC.

I'm somewhat interested in using this to add a proper check for
executability to explicit dynamic loader invocations.  In other words,
this

  /lib64/ld-linux-x86-64.so.2 /path/to/noexec/fs/program

should refuse to run the program if the program is located on a file
system mounted with the noexec attribute.

> The problem is that passing O_MAYEXEC is opt-in. You can use ptrace/seccomp/
> bpf/LD_PRELOAD/LD_AUDIT to remove that bit from an otherwise normal program. 
> This does not require privs to do so.

That doesn't really help with the above.

> But let's consider that this comes to pass and every interpreter is
> updated and IMA can see the O_MAYEXEC flag. Attackers now simply pivot
> to running programs via stdin. It never touches disk and therefore
> nothing enforces security policy. This already is among the most
> common ways that malware runs today to evade detection.

Are you referring to Windows malware using Powershell?

I'm not sure this is applicable to Linux.  We do not have much
behavioral monitoring anyway.

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH v15 1/3] /proc/pid/status: Add support for architecture specific output
From: Li, Aubrey @ 2019-04-17 10:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tglx, mingo, peterz, hpa, ak, tim.c.chen, dave.hansen, arjan,
	adobriyan, aubrey.li, linux-api, linux-kernel
In-Reply-To: <20190416160143.8292ced993dc803aae7fa0da@linux-foundation.org>

On 2019/4/17 7:01, Andrew Morton wrote:
> On Tue, 16 Apr 2019 14:32:48 +0800 Aubrey Li <aubrey.li@linux.intel.com> wrote:
> 
>> The architecture specific information of the running processes could
>> be useful to the userland. Add support to examine process architecture
>> specific information externally.
> 
> The implementation looks just fine to me.  Have you had any feedback on
> the overall desirability of adding this feature?
> 
This is orientated by the customer's complain. When the customer colocated
their latency critical tasks with AVX512 tasks, the latency critical tasks
were affected due to core frequency slowing down by AVX512 use. So we propose
this interface for the user space tools to identify AVX512 using task and
apply user space dispatching policy. We may have subsequent effort based on
this proposal.

>> --- a/fs/proc/array.c
>> +++ b/fs/proc/array.c
>> @@ -96,6 +96,11 @@
>>  #include <asm/processor.h>
>>  #include "internal.h"
>>  
>> +/* Add support for architecture specific output in /proc/pid/status */
>> +#ifndef	arch_proc_pid_status
>> +#define	arch_proc_pid_status(m, task)
>> +#endif
> 
> To this I suggest adding
> 
> /* arch_proc_pid_status() must be defined in asm/processor.h */
> 
> Because we've regularly had different architectures defining such things
> in different headers, resulting in a mess.

Thanks, I'll add it in the next version.
-Aubrey

^ permalink raw reply

* [PATCH v16 1/3] /proc/pid/status: Add support for architecture specific output
From: Aubrey Li @ 2019-04-17 10:18 UTC (permalink / raw)
  To: tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, adobriyan, akpm, aubrey.li,
	linux-api, linux-kernel, Aubrey Li

The architecture specific information of the running processes could
be useful to the userland. Add support to examine process architecture
specific information externally.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 fs/proc/array.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 2edbb657f859..a6b394402ea2 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -96,6 +96,14 @@
 #include <asm/processor.h>
 #include "internal.h"
 
+/*
+ * Add support for architecture specific output in /proc/pid/status.
+ * arch_proc_pid_status() must be defined in asm/processor.h
+ */
+#ifndef	arch_proc_pid_status
+#define	arch_proc_pid_status(m, task)
+#endif
+
 void proc_task_name(struct seq_file *m, struct task_struct *p, bool escape)
 {
 	char *buf;
@@ -424,6 +432,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 	task_cpus_allowed(m, task);
 	cpuset_task_status_allowed(m, task);
 	task_context_switch_counts(m, task);
+	arch_proc_pid_status(m, task);
 	return 0;
 }
 
-- 
2.21.0

^ permalink raw reply related

* [PATCH v16 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time
From: Aubrey Li @ 2019-04-17 10:18 UTC (permalink / raw)
  To: tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, adobriyan, akpm, aubrey.li,
	linux-api, linux-kernel, Aubrey Li
In-Reply-To: <20190417101840.2926-1-aubrey.li@linux.intel.com>

AVX-512 components use could cause core turbo frequency drop. So
it's useful to expose AVX-512 usage elapsed time as a heuristic hint
for the user space job scheduler to cluster the AVX-512 using tasks
together.

Tensorflow example:
$ while [ 1 ]; do cat /proc/tid/status | grep AVX; sleep 1; done
AVX512_elapsed_ms:      4
AVX512_elapsed_ms:      8
AVX512_elapsed_ms:      4

This means that 4 milliseconds have elapsed since the AVX512 usage
of tensorflow task was detected when the task was scheduled out.

Or:
$ cat /proc/tid/status | grep AVX512_elapsed_ms
AVX512_elapsed_ms:      -1

The number '-1' indicates that no AVX512 usage recorded before
thus the task unlikely has frequency drop issue.

User space tools may want to further check by:

$ perf stat --pid <pid> -e core_power.lvl2_turbo_license -- sleep 1

 Performance counter stats for process id '3558':

     3,251,565,961      core_power.lvl2_turbo_license

       1.004031387 seconds time elapsed

Non-zero counter value confirms that the task causes frequency drop.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 arch/x86/include/asm/processor.h |  4 +++
 arch/x86/kernel/fpu/xstate.c     | 42 ++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 2bb3a648fc12..5a7271ab78d8 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -991,4 +991,8 @@ enum l1tf_mitigations {
 
 extern enum l1tf_mitigations l1tf_mitigation;
 
+/* Add support for architecture specific output in /proc/pid/status */
+void arch_proc_pid_status(struct seq_file *m, struct task_struct *task);
+#define arch_proc_pid_status arch_proc_pid_status
+
 #endif /* _ASM_X86_PROCESSOR_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d7432c2b1051..5e55ed9584ab 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -7,6 +7,8 @@
 #include <linux/cpu.h>
 #include <linux/mman.h>
 #include <linux/pkeys.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
 
 #include <asm/fpu/api.h>
 #include <asm/fpu/internal.h>
@@ -1243,3 +1245,43 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
 
 	return 0;
 }
+
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+	unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
+	long delta;
+
+	if (!timestamp) {
+		/*
+		 * Report -1 if no AVX512 usage
+		 */
+		delta = -1;
+	} else {
+		delta = (long)(jiffies - timestamp);
+		/*
+		 * Cap to LONG_MAX if time difference > LONG_MAX
+		 */
+		if (delta < 0)
+			delta = LONG_MAX;
+		delta = jiffies_to_msecs(delta);
+	}
+
+	seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+	seq_putc(m, '\n');
+}
+
+/*
+ * Report architecture specific information
+ */
+void arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
+{
+	/*
+	 * Report AVX512 state if the processor and build option supported.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+		avx512_status(m, task);
+}
-- 
2.21.0

^ permalink raw reply related

* [PATCH v16 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms
From: Aubrey Li @ 2019-04-17 10:18 UTC (permalink / raw)
  To: tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, adobriyan, akpm, aubrey.li,
	linux-api, linux-kernel, Aubrey Li
In-Reply-To: <20190417101840.2926-1-aubrey.li@linux.intel.com>

Added AVX512_elapsed_ms in /proc/<pid>/status. Report it
in Documentation/filesystems/proc.txt

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 Documentation/filesystems/proc.txt | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 66cad5c86171..c4a9e48681ad 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -207,6 +207,7 @@ read the file /proc/PID/status:
   Speculation_Store_Bypass:       thread vulnerable
   voluntary_ctxt_switches:        0
   nonvoluntary_ctxt_switches:     1
+  AVX512_elapsed_ms:	8
 
 This shows you nearly the same information you would get if you viewed it with
 the ps  command.  In  fact,  ps  uses  the  proc  file  system  to  obtain its
@@ -224,7 +225,7 @@ asynchronous manner and the value may not be very precise. To see a precise
 snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table.
 It's slow but very precise.
 
-Table 1-2: Contents of the status files (as of 4.19)
+Table 1-2: Contents of the status files (as of 5.1)
 ..............................................................................
  Field                       Content
  Name                        filename of the executable
@@ -289,6 +290,32 @@ Table 1-2: Contents of the status files (as of 4.19)
  Mems_allowed_list           Same as previous, but in "list format"
  voluntary_ctxt_switches     number of voluntary context switches
  nonvoluntary_ctxt_switches  number of non voluntary context switches
+ AVX512_elapsed_ms           time elapsed since last AVX512 usage recorded
+
+ AVX512_elapsed_ms:
+ ------------------
+  If AVX512 is supported on the machine, this entry shows the milliseconds
+  elapsed since the last time AVX512 usage was recorded. The recording
+  happens on a best effort basis when a task is scheduled out. This means
+  that the value depends on two factors:
+
+    1) The time which the task spent on the CPU without being scheduled
+       out. With CPU isolation and a single runnable task this can take
+       several seconds.
+
+    2) The time since the task was scheduled out last. Depending on the
+       reason for being scheduled out (time slice exhausted, syscall ...)
+       this can be arbitrary long time.
+
+  As a consequence the value cannot be considered precise and authoritative
+  information. The application which uses this information has to be aware
+  of the overall scenario on the system in order to determine whether a
+  task is a real AVX512 user or not.
+
+  A special value of '-1' indicates that no AVX512 usage was recorded, thus
+  the task is unlikely an AVX512 user, but depends on the workload and the
+  scheduling scenario, it also could be a false negative mentioned above.
+
 ..............................................................................
 
 Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
-- 
2.21.0

^ permalink raw reply related

* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Enrico Weigelt, metux IT consult @ 2019-04-17 12:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aleksa Sarai, Christian Brauner, Linus Torvalds, Al Viro,
	Jann Horn, David Howells, Linux API, LKML, Serge E. Hallyn,
	Arnd Bergmann, Eric W. Biederman, Kees Cook, Thomas Gleixner,
	Michael Kerrisk, Andrew Morton, Oleg Nesterov, Joel Fernandes,
	Daniel Colascione
In-Reply-To: <CALCETrWWoEvnzx6oLTEajBmiArScX5sSt74o2VO+aTiwEi_9mQ@mail.gmail.com>

On 16.04.19 23:31, Andy Lutomirski wrote:

>> How exactly would the pidfd improve this scenario ?
>> IMHO, would just need to pass the inherited fd's to that daemon (eg.
>> via unix socket) which then sets them up in the new child process.
> 
> It makes it easier to wait until the privileged program exits.
> Without pidfd, you can't just wait(2) because the program that gets
> spawned isn't a child.  

Ah, that is a cool thing !
I suppose that also works across namespaces ?

What other things can be done via pidfd ?

>> But: how can we handle things like cgroups ?
> 
> Find a secure way to tell the daemon what cgroups to use?

hmm, do we have some fd-handle to cgroups ?
In that case a process could send a handle of his cgroup to some
other process (eg. some "login" deamon) allowing him to join in.

We could look at cgroups more as kind of capabilities instead of
limitations (eg. things like: members of cgroup "net-foo1" are
granted n% of network bandwith, etc). That would open up completely
new approaches to security and resource control :)

It could go even further: anybody can create new cgroups within his
own, narrow down some limits and pass this to some other agent that
acts on behalf of him and is allowed to use his share of the system
resources for that.

--mtx

-- 
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply

* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Florian Weimer @ 2019-04-17 12:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aleksa Sarai, Enrico Weigelt, metux IT consult, Christian Brauner,
	Linus Torvalds, Al Viro, Jann Horn, David Howells, Linux API,
	LKML, Serge E. Hallyn, Arnd Bergmann, Eric W. Biederman,
	Kees Cook, Thomas Gleixner, Michael Kerrisk, Andrew Morton,
	Oleg Nesterov, Joel Fernandes, Daniel Colascione
In-Reply-To: <CALCETrWxMnaPvwicqkMLswMynWvJVteazD-bFv3ZnBKWp-1joQ@mail.gmail.com>

* Andy Lutomirski:

> I would personally *love* it if distros started setting no_new_privs
> for basically all processes.

Wouldn't no_new_privs inhibit all security transitions, including those
that reduce privileges?  And therefore effectively reduce security?

> Anyway, clone(2) is an enormous mess.  Surely the right solution here
> is to have a whole new process creation API that takes a big,
> extensible struct as an argument, and supports *at least* the full
> abilities of posix_spawn() and ideally covers all the use cases for
> fork() + do stuff + exec().  It would be nifty if this API also had a
> way to say "add no_new_privs and therefore enable extra functionality
> that doesn't work without no_new_privs".  This functionality would
> include things like returning a future extra-privileged pidfd that
> gives ptrace-like access.

I agree that consistent replacement for the clone system call makes
sense.  I'm not sure if covering everything that posix_spawn could do
would make sense.  There seems to be some demand to be able to do large
parts of container setup using posix_spawn, so we'll probably add
support for things like writing to arbitrary files eventually.  And of
course, proper error reporting, so that you can figure out which file
creation action failed.

Thanks,
Florian

^ permalink raw reply

* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Christian Brauner @ 2019-04-17 12:54 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: Andy Lutomirski, Aleksa Sarai, Linus Torvalds, Al Viro, Jann Horn,
	David Howells, Linux API, LKML, Serge E. Hallyn, Arnd Bergmann,
	Eric W. Biederman, Kees Cook, Thomas Gleixner, Michael Kerrisk,
	Andrew Morton, Oleg Nesterov, Joel Fernandes, Daniel Colascione
In-Reply-To: <e58d1694-c143-3465-512d-43301422bd8b@metux.net>

On Wed, Apr 17, 2019 at 02:03:16PM +0200, Enrico Weigelt, metux IT consult wrote:
> On 16.04.19 23:31, Andy Lutomirski wrote:
> 
> >> How exactly would the pidfd improve this scenario ?
> >> IMHO, would just need to pass the inherited fd's to that daemon (eg.
> >> via unix socket) which then sets them up in the new child process.
> > 
> > It makes it easier to wait until the privileged program exits.
> > Without pidfd, you can't just wait(2) because the program that gets
> > spawned isn't a child.  
> 
> Ah, that is a cool thing !
> I suppose that also works across namespaces ?

Yes, it should. If you hand off the pidfd to another pidns (e.g. via SCM
credentials) for example.

> 
> What other things can be done via pidfd ?

Very basic things right now and until CLONE_PIDFD is accepted (possibly
for 5.2) we won't enable any more features.
I'm not a fan of wild speculations and grand schemes that then don't
come to fruition. :) Right now it's about focussing on somewhat cleanly
covering the basics. Coming to a consensus there was hard enough. We
have no intention in making this more complex right now as it needs to
be.

Christian

^ permalink raw reply

* Re: [PATCH v1 3/4] signal: support CLONE_PIDFD with pidfd_send_signal
From: Oleg Nesterov @ 2019-04-17 14:01 UTC (permalink / raw)
  To: Christian Brauner
  Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
	luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
	joel, dancol
In-Reply-To: <20190416170233.10208-4-christian@brauner.io>

On 04/16, Christian Brauner wrote:
>
> @@ -3581,12 +3588,12 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
>  	if (flags)
>  		return -EINVAL;
>
> -	f = fdget_raw(pidfd);
> +	f = fdget(pidfd);

could you explain this change?

I am just curious, I don't understand why should we disallow O_PATH and how
this connects to this patch.

Oleg.

^ permalink raw reply

* Re: [PATCH v1 3/4] signal: support CLONE_PIDFD with pidfd_send_signal
From: Christian Brauner @ 2019-04-17 14:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
	luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
	joel, dancol
In-Reply-To: <20190417140106.GG32622@redhat.com>

On Wed, Apr 17, 2019 at 04:01:06PM +0200, Oleg Nesterov wrote:
> On 04/16, Christian Brauner wrote:
> >
> > @@ -3581,12 +3588,12 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> >  	if (flags)
> >  		return -EINVAL;
> >
> > -	f = fdget_raw(pidfd);
> > +	f = fdget(pidfd);
> 
> could you explain this change?
> 
> I am just curious, I don't understand why should we disallow O_PATH and how
> this connects to this patch.

Sending a signal through a pidfd is considered to be on a par with a
"write" to that pidfd.
Additionally, we use the fops associated with the fd to detect whether
it is actually a pidfd or not. This is not possible with O_PATH since
f_ops will be set to dummy fops. So we already correctly error out since
we detect that it is an O_PATH before fdget_raw(). This just takes the
opportunity to fix it.

Christian

^ permalink raw reply

* Re: [PATCH v1 2/4] clone: add CLONE_PIDFD
From: Oleg Nesterov @ 2019-04-17 14:22 UTC (permalink / raw)
  To: Christian Brauner
  Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
	luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
	joel, dancol
In-Reply-To: <20190416170233.10208-3-christian@brauner.io>

On 04/16, Christian Brauner wrote:
>
> +	if (clone_flags & CLONE_PIDFD) {
> +		retval = pidfd_create(pid, &pidfdf);
> +		if (retval < 0)
> +			goto bad_fork_free_pid;
> +		pidfd = retval;
> +	}

...

> +	if (clone_flags & CLONE_PIDFD) {
> +		fd_install(pidfd, pidfdf);
> +		put_user(pidfd, parent_tidptr);

put_user() can fail, I don't think this error should be silently ignored,
this can lead to the hard-to-trigger/debug problems.

Why can't we do put_user-with-check along with pidfd_create() above?

Oleg.

^ permalink raw reply

* Re: [PATCH v1 2/4] clone: add CLONE_PIDFD
From: Christian Brauner @ 2019-04-17 14:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
	luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
	joel, dancol
In-Reply-To: <20190417142253.GH32622@redhat.com>

On Wed, Apr 17, 2019 at 04:22:54PM +0200, Oleg Nesterov wrote:
> On 04/16, Christian Brauner wrote:
> >
> > +	if (clone_flags & CLONE_PIDFD) {
> > +		retval = pidfd_create(pid, &pidfdf);
> > +		if (retval < 0)
> > +			goto bad_fork_free_pid;
> > +		pidfd = retval;
> > +	}
> 
> ...
> 
> > +	if (clone_flags & CLONE_PIDFD) {
> > +		fd_install(pidfd, pidfdf);
> > +		put_user(pidfd, parent_tidptr);
> 
> put_user() can fail, I don't think this error should be silently ignored,

Yes, you're right, sorry.

> this can lead to the hard-to-trigger/debug problems.
> 
> Why can't we do put_user-with-check along with pidfd_create() above?

Yes, I'll take a look where it makes sense and respin.

Christian

^ permalink raw reply

* Re: [PATCH v1 2/4] clone: add CLONE_PIDFD
From: Christian Brauner @ 2019-04-17 14:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
	luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
	joel, dancol
In-Reply-To: <20190417142551.lhufsdffqrih3jsp@brauner.io>

On Wed, Apr 17, 2019 at 04:25:51PM +0200, Christian Brauner wrote:
> On Wed, Apr 17, 2019 at 04:22:54PM +0200, Oleg Nesterov wrote:
> > On 04/16, Christian Brauner wrote:
> > >
> > > +	if (clone_flags & CLONE_PIDFD) {
> > > +		retval = pidfd_create(pid, &pidfdf);
> > > +		if (retval < 0)
> > > +			goto bad_fork_free_pid;
> > > +		pidfd = retval;
> > > +	}
> > 
> > ...
> > 
> > > +	if (clone_flags & CLONE_PIDFD) {
> > > +		fd_install(pidfd, pidfdf);
> > > +		put_user(pidfd, parent_tidptr);
> > 
> > put_user() can fail, I don't think this error should be silently ignored,

Fwiw, the same is currently done for PARENT_SETTID which seems odd as
well...

^ permalink raw reply

* Re: [RFC PATCH v1 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Mickaël Salaün @ 2019-04-17 14:55 UTC (permalink / raw)
  To: Steve Grubb, Jan Kara
  Cc: Mickaël Salaün, linux-kernel, Al Viro, James Morris,
	Jonathan Corbet, Kees Cook, Matthew Garrett, Michael Kerrisk,
	Mimi Zohar, Philippe Trébuchet, Shuah Khan,
	Thibaut Sautereau, Vincent Strubel, Yves-Alexis Perez,
	kernel-hardening, linux-api, linux-security-module, linux-fsdevel,
	Matthew Bobrowski, Florian Weimer
In-Reply-To: <3452959.b6JmBh7Lnt@x2>

On 15/04/2019 20:47, Steve Grubb wrote:
> Hello,
>
> On Wednesday, December 12, 2018 9:43:06 AM EDT Jan Kara wrote:
>> On Wed 12-12-18 09:17:08, Mickaël Salaün wrote:
>>> When the O_MAYEXEC flag is passed, sys_open() may be subject to
>>> additional restrictions depending on a security policy implemented by an
>>> LSM through the inode_permission hook.
>>>
>>> The underlying idea is to be able to restrict scripts interpretation
>>> according to a policy defined by the system administrator.  For this to
>>> be possible, script interpreters must use the O_MAYEXEC flag
>>> appropriately.  To be fully effective, these interpreters also need to
>>> handle the other ways to execute code (for which the kernel can't help):
>>> command line parameters (e.g., option -e for Perl), module loading
>>> (e.g., option -m for Python), stdin, file sourcing, environment
>>> variables, configuration files...  According to the threat model, it may
>>> be acceptable to allow some script interpreters (e.g. Bash) to interpret
>>> commands from stdin, may it be a TTY or a pipe, because it may not be
>>> enough to (directly) perform syscalls.
>>>
>>> A simple security policy implementation is available in a following
>>> patch for Yama.
>>>
>>> This is an updated subset of the patch initially written by Vincent
>>> Strubel for CLIP OS:
>>> https://github.com/clipos-archive/src_platform_clip-patches/blob/f5cb330d
>>> 6b684752e403b4e41b39f7004d88e561/1901_open_mayexec.patch This patch has
>>> been used for more than 10 years with customized script interpreters.
>>> Some examples can be found here:
>>> https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYE
>>> XEC
>>>
>>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>>> Signed-off-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
>>> Signed-off-by: Vincent Strubel <vincent.strubel@ssi.gouv.fr>
>>> Reviewed-by: Philippe Trébuchet <philippe.trebuchet@ssi.gouv.fr>
>>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>>> Cc: Kees Cook <keescook@chromium.org>
>>> Cc: Mickaël Salaün <mickael.salaun@ssi.gouv.fr>
>>
>> ...
>>
>>> diff --git a/fs/open.c b/fs/open.c
>>> index 0285ce7dbd51..75479b79a58f 100644
>>> --- a/fs/open.c
>>> +++ b/fs/open.c
>>> @@ -974,6 +974,10 @@ static inline int build_open_flags(int flags,
>>> umode_t mode, struct open_flags *o>
>>>     if (flags & O_APPEND)
>>>
>>>             acc_mode |= MAY_APPEND;
>>>
>>> +   /* Check execution permissions on open. */
>>> +   if (flags & O_MAYEXEC)
>>> +           acc_mode |= MAY_OPENEXEC;
>>> +
>>>
>>>     op->acc_mode = acc_mode;
>>>
>>>     op->intent = flags & O_PATH ? 0 : LOOKUP_OPEN;
>>
>> I don't feel experienced enough in security to tell whether we want this
>> functionality or not. But if we do this, shouldn't we also set FMODE_EXEC
>> on the resulting struct file? That way also security_file_open() can be
>> used to arbitrate such executable opens and in particular
>> fanotify permission event FAN_OPEN_EXEC will get properly generated which I
>> guess is desirable (support for it is sitting in my tree waiting for the
>> merge window) - adding some audit people involved in FAN_OPEN_EXEC to CC.
>> Just an idea...
>
> Late in replying. But I think it's important to have a deep look into the
> issue.
>
> TL;DR - This is a gentle man's handshake. It won't _really_ solve the
> problem.

Thanks for your comments. You should find most answers in this thread:
https://lore.kernel.org/lkml/20181212081712.32347-4-mic@digikod.net/

The threat model targets persistent attacks. This O_MAYEXEC flag is not
a silver bullet but it's a needed block to enforce a security policy on
a trusted system. This means that every component executable on the
system must be controlled, which means they may need some bit of
customization. Today no userspace application use this flag (except in
CLIP OS), but we need to first create a feature before it can be used.

It is very important to have in mind that a system security policy need
to have a (central) security manager, in this case the kernel thanks to
Yama's policy (but it could be SELinux, IMA or any other LSM). The goal
is not to give to the developer the job of defining a security policy
for the *system*; this job is for the system administrator (or the distro).

>
> This flag that is being proposed means that you would have to patch all
> interpreters to use it. If you are sure that upstreams will accept that, why
> not just change the policy to interpreters shouldn't execute anything unless
> the execute bit is set? That is simpler and doesn't need a kernel change. And
> setting the execute bit is an auditable event.

As said above, the definition of a the security policy is the job of the
system administrator. Moreover, the security policy may be defined by
the mount point restrictions (i.e. noexec) but it should be definable
with something else (e.g. a SELinux or IMA policy which may be agnostic
to the mount points).

>
> The bottom line is that any interpreter has to become a security policy
> enforcement point whether by indicating it wants to execute by setting a flag
> or by refusing to use a file without execute bit set. But this just moves the
> problem to one that is harder to fix. Why in the world does any programming
> language allow programs to be loaded via stdin?
>
> It is possible to wget a program and pipe it into python which subsequently
> pulls down an ELF shared object and runs it all without touching disk via
> memfd_create (e.g. SnakeEater). This is all direct to memory execution. And
> direct to memory bypasses anti-virus, selinux, IMA, application whitelisting,
> and other integrity schemes.
>
> So, to fix this problem, you really need to not allow any programs to load via
> stdin so that everything that executes has to touch disk. This way you can
> get a fanotify event and see the application and vote yes/no on allowing it.
> And this will be particularly harder with the memfd_create fix for the runc
> container breakout. Prior to that, there were very few uses of that system
> call. Now it may be very common which means finding malicious use just got
> harder to spot.

As said above, stdin must be restricted in some way. You may want to
take a look at the CLIP OS patches (which doesn't only add the O_MAYEXEC
flag but restrict other way to interpret code). It may be foolish to
block or restrict stdin for interpreters on a developer workstation, but
it may make sense for an embedded custom system.

The same apply for memfd_create. If you want to enforce a security
policy on this kind of *file descriptor*, you should ask to the proper
LSM to do so. The current Yama patch deal with this kind of FD if they
are accessed through /proc/*/fd because the procfs is mounted with
noexec. Anyway, the interpreter must *inform* the LSM that it wants to
execute/interpret something from this FD, which is done thanks to the
O_MAYEXEC flag.

>
> But assuming these problems got fixed, then we have yet another place to look.
> Many interpreters allow you to specify a command to run via arguments. Some
> have a small buffer and some allow lengthy programs to be entered as an
> argument. One strategy might be that an attacker can bootstrap a lengthier
> program across the network. Python for example allows loading modules across
> a network. All you need to put in the commandline is the override for the
> module loader and a couple modules to import. It then loads the modules
> remotely. Getting rid of this hole will likely lead to some unhappy people -
> meaning it can't be fixed.

Again, this depend on the threat model and the corresponding product. If
you want to handle everything on your system, then you may need some
adjustments.

>
> And even if we get that fixed, we have one last hole to plug. Shells. One can
> simply start a shell and paste their program into the shell and then execute
> it. You can easily do this with bash or python or any language that has a
> REPL (read–eval–print loop). To fix this means divorcing the notion of a
> language from a REPL. Production systems really do not need a Python shell,
> they need the interpreter. I doubt that this would be popular. But fixing each
> of these issues is what it would take to prevent unknown software from
> running. Not going this far leaves holes.

This is also covered by the threat model defined in the patch 3/5 (i.e.
protect the kernel by restricting arbitrary syscalls).

Regards,

--
Mickaël Salaün
ANSSI/SDE/ST/LAM

Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.

^ permalink raw reply

* Re: [RFC PATCH v1 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Mickaël Salaün @ 2019-04-17 15:04 UTC (permalink / raw)
  To: Florian Weimer, Steve Grubb
  Cc: Jan Kara, Mickaël Salaün, linux-kernel, Al Viro,
	James Morris, Jonathan Corbet, Kees Cook, Matthew Garrett,
	Michael Kerrisk, Mimi Zohar, Philippe Trébuchet, Shuah Khan,
	Thibaut Sautereau, Vincent Strubel, Yves-Alexis Perez,
	kernel-hardening, linux-api, linux-security-module, linux-fsdevel
In-Reply-To: <875zrd7xy8.fsf@oldenburg2.str.redhat.com>

On 17/04/2019 12:01, Florian Weimer wrote:
> * Steve Grubb:
>
>> On Tuesday, April 16, 2019 7:49:39 AM EDT Florian Weimer wrote:
>>> * Steve Grubb:
>>>> This flag that is being proposed means that you would have to patch all
>>>> interpreters to use it. If you are sure that upstreams will accept that,
>>>> why not just change the policy to interpreters shouldn't execute
>>>> anything unless the execute bit is set? That is simpler and doesn't need
>>>> a kernel change. And setting the execute bit is an auditable event.
>>>
>>> I think we need something like O_MAYEXEC so that security policies can
>>> be enforced and noexec mounts can be detected.
>>
>> Application whitelisting can already today stop unknown software without
>> needing O_MAYEXEC.

Whitelisting may be a lot of thing (path/TPE, signed binaries…), but
being able to handle this with a global system configuration (instead of
app-specific hardcoded configuration) is a good idea. ;)

>
> I'm somewhat interested in using this to add a proper check for
> executability to explicit dynamic loader invocations.  In other words,
> this
>
>   /lib64/ld-linux-x86-64.so.2 /path/to/noexec/fs/program
>
> should refuse to run the program if the program is located on a file
> system mounted with the noexec attribute.

What if a sysadmin need to do this on an executable mount point? Being
able to enforce a security policy according to a configuration may fit
to much more use cases.

>
>> The problem is that passing O_MAYEXEC is opt-in. You can use ptrace/seccomp/
>> bpf/LD_PRELOAD/LD_AUDIT to remove that bit from an otherwise normal program.
>> This does not require privs to do so.
>
> That doesn't really help with the above.

Right, ptrace/LD_PRELOAD and so on must be addressed by something else
than only O_MAYEXEC.

>
>> But let's consider that this comes to pass and every interpreter is
>> updated and IMA can see the O_MAYEXEC flag. Attackers now simply pivot
>> to running programs via stdin. It never touches disk and therefore
>> nothing enforces security policy. This already is among the most
>> common ways that malware runs today to evade detection.

As my previous reply, use cases like stdin may be restricted as well.

>
> Are you referring to Windows malware using Powershell?
>
> I'm not sure this is applicable to Linux.  We do not have much
> behavioral monitoring anyway.
>
> Thanks,
> Florian
>

--
Mickaël Salaün
ANSSI/SDE/ST/LAM

Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.

^ permalink raw reply

* Re: [PATCH v1 3/4] signal: support CLONE_PIDFD with pidfd_send_signal
From: Oleg Nesterov @ 2019-04-17 15:20 UTC (permalink / raw)
  To: Christian Brauner
  Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
	luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
	joel, dancol
In-Reply-To: <20190417141144.k2kmg7hd7pdpywyw@brauner.io>

On 04/17, Christian Brauner wrote:
>
> On Wed, Apr 17, 2019 at 04:01:06PM +0200, Oleg Nesterov wrote:
> > On 04/16, Christian Brauner wrote:
> > >
> > > @@ -3581,12 +3588,12 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> > >  	if (flags)
> > >  		return -EINVAL;
> > >
> > > -	f = fdget_raw(pidfd);
> > > +	f = fdget(pidfd);
> >
> > could you explain this change?
> >
> > I am just curious, I don't understand why should we disallow O_PATH and how
> > this connects to this patch.
>
> Sending a signal through a pidfd is considered to be on a par with a
> "write" to that pidfd.

OK, but how this connects to "support pidfds" ?

> Additionally, we use the fops associated with the fd to detect whether
> it is actually a pidfd or not. This is not possible with O_PATH since
> f_ops will be set to dummy fops.

indeed... I didn't know this, thanks!

But this means that pidfd_send_signal() will return -EBADF with or without
this change; pidfd_to_pid() will return -EBADF even if fdget_raw() suceeds,
right?

To clarify, I am not arguing. I am trying to understand why exactly do we
need this s/fdget_raw/fdget/ change and, why it doesn't come as a separate
patch. Can you add a note into the changelog?

Oleg.

^ permalink raw reply

* Re: [PATCH v1 3/4] signal: support CLONE_PIDFD with pidfd_send_signal
From: Christian Brauner @ 2019-04-17 15:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
	luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
	joel, dancol
In-Reply-To: <20190417152055.GJ32622@redhat.com>

On April 17, 2019 5:20:55 PM GMT+02:00, Oleg Nesterov <oleg@redhat.com> wrote:
>On 04/17, Christian Brauner wrote:
>>
>> On Wed, Apr 17, 2019 at 04:01:06PM +0200, Oleg Nesterov wrote:
>> > On 04/16, Christian Brauner wrote:
>> > >
>> > > @@ -3581,12 +3588,12 @@ SYSCALL_DEFINE4(pidfd_send_signal, int,
>pidfd, int, sig,
>> > >  	if (flags)
>> > >  		return -EINVAL;
>> > >
>> > > -	f = fdget_raw(pidfd);
>> > > +	f = fdget(pidfd);
>> >
>> > could you explain this change?
>> >
>> > I am just curious, I don't understand why should we disallow O_PATH
>and how
>> > this connects to this patch.
>>
>> Sending a signal through a pidfd is considered to be on a par with a
>> "write" to that pidfd.
>
>OK, but how this connects to "support pidfds" ?
>
>> Additionally, we use the fops associated with the fd to detect
>whether
>> it is actually a pidfd or not. This is not possible with O_PATH since
>> f_ops will be set to dummy fops.
>
>indeed... I didn't know this, thanks!
>
>But this means that pidfd_send_signal() will return -EBADF with or
>without
>this change; pidfd_to_pid() will return -EBADF even if fdget_raw()
>suceeds,
>right?
>
>To clarify, I am not arguing. I am trying to understand why exactly do
>we
>need this s/fdget_raw/fdget/ change and, why it doesn't come as a
>separate
>patch. Can you add a note into the changelog?

I should split this into a separate patch indeed.
Let me do that for v2.

Christian

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v8)
From: Mathieu Desnoyers @ 2019-04-17 15:59 UTC (permalink / raw)
  To: carlos, Will Deacon
  Cc: Florian Weimer, Joseph Myers, Szabolcs Nagy, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <20190416173216.9028-2-mathieu.desnoyers@efficios.com>

----- On Apr 16, 2019, at 1:32 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

[...]
> diff --git a/sysdeps/unix/sysv/linux/aarch64/bits/rseq.h
> b/sysdeps/unix/sysv/linux/aarch64/bits/rseq.h
> new file mode 100644
> index 0000000000..b02471a89a
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/aarch64/bits/rseq.h
> @@ -0,0 +1,32 @@
> +/* Restartable Sequences Linux aarch64 architecture header.
> +
> +   Copyright (C) 2019 Free Software Foundation, Inc.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _SYS_RSEQ_H
> +# error "Never use <bits/rseq.h> directly; include <sys/rseq.h> instead."
> +#endif
> +
> +/* RSEQ_SIG is a signature required before each abort handler code.
> +
> +   It is a 32-bit value that maps to actual architecture code compiled
> +   into applications and libraries. It needs to be defined for each
> +   architecture. When choosing this value, it needs to be taken into
> +   account that generating invalid instructions may have ill effects on
> +   tools like objdump, and may also have impact on the CPU speculative
> +   execution efficiency in some cases.  */
> +
> +#define RSEQ_SIG 0xd428bc00	/* BRK #0x45E0.  */

After further investigation, we should probably do the following
to handle compiling with -mbig-endian on aarch64, which generates
binaries with mixed code vs data endianness (little endian code,
big endian data):

#ifdef __ARM_BIG_ENDIAN
#define RSEQ_SIG 0x00bc28d4	/* BRK #0x45E0.  */
#else
#define RSEQ_SIG 0xd428bc00	/* BRK #0x45E0.  */
#endif

Else mismatch between code endianness for the generated
signatures and data endianness for the RSEQ_SIG parameter
passed to the rseq registration will trigger application
segmentation faults when the kernel try to abort rseq
critical sections.

For ARM32, the situation is a bit more complex. Only armv6+
generates mixed-endianness code vs data with -mbig-endian.
Prior to armv6, the code and data endianness matches. Therefore,
I plan to #ifdef the reversed endianness handling with:

#if __ARM_ARCH >= 6 && __ARM_BIG_ENDIAN

on arm32.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v8)
From: Joseph Myers @ 2019-04-17 16:17 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: carlos, Will Deacon, Florian Weimer, Szabolcs Nagy, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <364803063.586.1555516769056.JavaMail.zimbra@efficios.com>

On Wed, 17 Apr 2019, Mathieu Desnoyers wrote:

> > +/* RSEQ_SIG is a signature required before each abort handler code.
> > +
> > +   It is a 32-bit value that maps to actual architecture code compiled
> > +   into applications and libraries. It needs to be defined for each
> > +   architecture. When choosing this value, it needs to be taken into
> > +   account that generating invalid instructions may have ill effects on
> > +   tools like objdump, and may also have impact on the CPU speculative
> > +   execution efficiency in some cases.  */
> > +
> > +#define RSEQ_SIG 0xd428bc00	/* BRK #0x45E0.  */
> 
> After further investigation, we should probably do the following
> to handle compiling with -mbig-endian on aarch64, which generates
> binaries with mixed code vs data endianness (little endian code,
> big endian data):

First, the comment on RSEQ_SIG should specify whether it is to be 
interpreted in the code or the data endianness.

> For ARM32, the situation is a bit more complex. Only armv6+
> generates mixed-endianness code vs data with -mbig-endian.
> Prior to armv6, the code and data endianness matches. Therefore,
> I plan to #ifdef the reversed endianness handling with:
> 
> #if __ARM_ARCH >= 6 && __ARM_BIG_ENDIAN
> 
> on arm32.

That doesn't work well because BE code (.o files) can be built for v5te 
(for example) and used on a range of different architecture variants with 
both BE32 and BE8 - the choice between BE32 and BE8 is a link-time choice, 
not a compile-time choice.  So if the value for Arm is a compile-time 
constant, it should also work for both BE32 and BE8.

In turn, that suggests to me that RSEQ_SIG should be defined to be a value 
that is always in the code endianness (and whatever corresponding kernel 
code handles RSEQ_SIG values should act accordingly on architectures where 
the two endiannesses can differ).  If the kernel ABI is already fixed in a 
way that prevents such a definition of RSEQ_SIG semantics as using code 
endianness, a value should be chosen for Arm that works for both 
endiannesses.

(Also, installed glibc headers are supposed to work with older compilers, 
and support for __ARM_ARCH was only added in GCC 4.8.  Before that you 
need to test lots of separate macros for different architecture variants 
to determine a version number.)

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply

* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Andy Lutomirski @ 2019-04-17 16:46 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andy Lutomirski, Aleksa Sarai, Enrico Weigelt, metux IT consult,
	Christian Brauner, Linus Torvalds, Al Viro, Jann Horn,
	David Howells, Linux API, LKML, Serge E. Hallyn, Arnd Bergmann,
	Eric W. Biederman, Kees Cook, Thomas Gleixner, Michael Kerrisk,
	Andrew Morton, Oleg Nesterov, Joel Fernandes, Daniel
In-Reply-To: <87v9zc6cz2.fsf@oldenburg2.str.redhat.com>

> On Apr 17, 2019, at 5:19 AM, Florian Weimer <fweimer@redhat.com> wrote:
> 
> * Andy Lutomirski:
> 
>> I would personally *love* it if distros started setting no_new_privs
>> for basically all processes.
> 
> Wouldn't no_new_privs inhibit all security transitions, including those
> that reduce privileges?  And therefore effectively reduce security?

In principle, you still can reduce privileges with no_new_privs.  SELinux has a whole mechanism for privilege-reducing transitions on exec that works in no_new_privs mode. Also, all the traditional privilege dropping techniques work — setresuid(), unshare(), etc are all unaffected.

> 
>> There seems to be some demand to be able to do large
> parts of container setup using posix_spawn, so we'll probably add
> support for things like writing to arbitrary files eventually.  And of
> course, proper error reporting, so that you can figure out which file
> creation action failed.
> 

ISTM the way to handle this is to have a way to make a container, set it up, and then clone/spawn into it.  The current unshare() API is severely awkward.

Maybe the new better kernel spawn API shouldn’t support unshare-like semantics at all and should instead work like setns().

^ permalink raw reply

* Re: [PATCH v15 1/3] /proc/pid/status: Add support for architecture specific output
From: Andy Lutomirski @ 2019-04-17 19:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aubrey Li, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	H. Peter Anvin, Andi Kleen, Tim Chen, Dave Hansen,
	Arjan van de Ven, Alexey Dobriyan, aubrey.li, Linux API, LKML
In-Reply-To: <20190416160143.8292ced993dc803aae7fa0da@linux-foundation.org>

On Tue, Apr 16, 2019 at 4:01 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 16 Apr 2019 14:32:48 +0800 Aubrey Li <aubrey.li@linux.intel.com> wrote:
>
> > The architecture specific information of the running processes could
> > be useful to the userland. Add support to examine process architecture
> > specific information externally.
>
> The implementation looks just fine to me.  Have you had any feedback on
> the overall desirability of adding this feature?

I think I've been the most outspoken, and my not-all-that-strong
opinion is that I don't really like it.  /proc/PID/status is already a
bit of a mess, and I don't think we really want it to be a mess that
is different on different architectures.  Hence my suggestion of
/proc/PID/x86_status instead.  Or we could do /proc/PID/arch_status, I
suppose, and make sure that everything in it is namespaced.  (We could
easily end up with a situation where status fields from more than one
architecture are present.  Think i386 + x86_64.  Thata's also the case
where qemu userspace emulation is used to run binaries meant for one
architecture on a different architecture's kernel.)

--Andy

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v8)
From: Mathieu Desnoyers @ 2019-04-17 19:56 UTC (permalink / raw)
  To: Joseph Myers
  Cc: carlos, Will Deacon, Florian Weimer, Szabolcs Nagy, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <alpine.DEB.2.21.1904171601540.32123@digraph.polyomino.org.uk>

----- On Apr 17, 2019, at 12:17 PM, Joseph Myers joseph@codesourcery.com wrote:

> On Wed, 17 Apr 2019, Mathieu Desnoyers wrote:
> 
>> > +/* RSEQ_SIG is a signature required before each abort handler code.
>> > +
>> > +   It is a 32-bit value that maps to actual architecture code compiled
>> > +   into applications and libraries. It needs to be defined for each
>> > +   architecture. When choosing this value, it needs to be taken into
>> > +   account that generating invalid instructions may have ill effects on
>> > +   tools like objdump, and may also have impact on the CPU speculative
>> > +   execution efficiency in some cases.  */
>> > +
>> > +#define RSEQ_SIG 0xd428bc00	/* BRK #0x45E0.  */
>> 
>> After further investigation, we should probably do the following
>> to handle compiling with -mbig-endian on aarch64, which generates
>> binaries with mixed code vs data endianness (little endian code,
>> big endian data):
> 
> First, the comment on RSEQ_SIG should specify whether it is to be
> interpreted in the code or the data endianness.

Right. The signature passed as argument to the rseq registration
system call needs to be in data endianness (currently exposed kernel
ABI).

Ideally for userspace, we want to define a signature in code endianness
that happens to nicely match specific code patterns.

> 
>> For ARM32, the situation is a bit more complex. Only armv6+
>> generates mixed-endianness code vs data with -mbig-endian.
>> Prior to armv6, the code and data endianness matches. Therefore,
>> I plan to #ifdef the reversed endianness handling with:
>> 
>> #if __ARM_ARCH >= 6 && __ARM_BIG_ENDIAN
>> 
>> on arm32.
> 
> That doesn't work well because BE code (.o files) can be built for v5te
> (for example) and used on a range of different architecture variants with
> both BE32 and BE8 - the choice between BE32 and BE8 is a link-time choice,
> not a compile-time choice.  So if the value for Arm is a compile-time
> constant, it should also work for both BE32 and BE8.

Good to know! Then we need to be even more careful.

> 
> In turn, that suggests to me that RSEQ_SIG should be defined to be a value
> that is always in the code endianness (and whatever corresponding kernel
> code handles RSEQ_SIG values should act accordingly on architectures where
> the two endiannesses can differ).  If the kernel ABI is already fixed in a
> way that prevents such a definition of RSEQ_SIG semantics as using code
> endianness, a value should be chosen for Arm that works for both
> endiannesses.

It might be tricky to pick up a trap instruction that is a palindrome
endianness-wise.

> 
> (Also, installed glibc headers are supposed to work with older compilers,
> and support for __ARM_ARCH was only added in GCC 4.8.  Before that you
> need to test lots of separate macros for different architecture variants
> to determine a version number.)

Good point!

Here is an alternative to the palindrome approach. I'm taking arm32
as an example:

* We define RSEQ_SIG_CODE in code endianness, meant to be used with
  .inst in rseq assembly:

#define RSEQ_SIG_CODE 0xe7f5def3

* We define RSEQ_SIG_DATA in data endianness:

#define RSEQ_SIG_DATA \
        ({ \
                int sig; \
                asm volatile (  "b 2f\n\t" \
                                ".arm\n\t" \
                                "1: .inst 0xe7f5def3\n\t" \
                                "2:\n\t" \
                                "ldr %[sig], 1b\n\t" \
                                : [sig] "=r" (sig)); \
                sig; \
        })

Technically, only glibc and early-adopter libraries wishing to
register rseq need to use RSEQ_SIG_DATA. The RSEQ_SIG_CODE needs
to be used from inline assembly to create the signatures before
each abort handler.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH v8 03/16] sched/core: uclamp: Enforce last task's UCLAMP_MAX
From: Suren Baghdasaryan @ 2019-04-17 20:36 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: LKML, linux-pm, linux-api, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan
In-Reply-To: <20190402104153.25404-4-patrick.bellasi@arm.com>

 Hi Patrick,

On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>
> When a task sleeps it removes its max utilization clamp from its CPU.
> However, the blocked utilization on that CPU can be higher than the max
> clamp value enforced while the task was running. This allows undesired
> CPU frequency increases while a CPU is idle, for example, when another
> CPU on the same frequency domain triggers a frequency update, since
> schedutil can now see the full not clamped blocked utilization of the
> idle CPU.
>
> Fix this by using
>   uclamp_rq_dec_id(p, rq, UCLAMP_MAX)
>     uclamp_rq_max_value(rq, UCLAMP_MAX, clamp_value)
> to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
> condition.
>

If I understand the intent correctly, you are trying to exclude idle
CPUs from affecting calculations of rq UCLAMP_MAX value. If that is
true I think description can be simplified a bit :) In particular it
took me some time to understand what "blocked utilization" means,
however if it's a widely accepted term then feel free to ignore my
input.

> Don't track any minimum utilization clamps since an idle CPU never
> requires a minimum frequency. The decay of the blocked utilization is
> good enough to reduce the CPU frequency.
>
> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
>
> --
> Changes in v8:
>  Message-ID: <20190314170619.rt6yhelj3y6dzypu@e110439-lin>
>  - moved flag reset into uclamp_rq_inc()
> ---
>  kernel/sched/core.c  | 45 ++++++++++++++++++++++++++++++++++++++++----
>  kernel/sched/sched.h |  2 ++
>  2 files changed, 43 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6e1beae5f348..046f61d33f00 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -754,8 +754,35 @@ static inline unsigned int uclamp_none(int clamp_id)
>         return SCHED_CAPACITY_SCALE;
>  }
>
> +static inline unsigned int
> +uclamp_idle_value(struct rq *rq, unsigned int clamp_id, unsigned int clamp_value)
> +{
> +       /*
> +        * Avoid blocked utilization pushing up the frequency when we go
> +        * idle (which drops the max-clamp) by retaining the last known
> +        * max-clamp.
> +        */
> +       if (clamp_id == UCLAMP_MAX) {
> +               rq->uclamp_flags |= UCLAMP_FLAG_IDLE;
> +               return clamp_value;
> +       }
> +
> +       return uclamp_none(UCLAMP_MIN);
> +}
> +
> +static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
> +                                    unsigned int clamp_value)
> +{
> +       /* Reset max-clamp retention only on idle exit */
> +       if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
> +               return;
> +
> +       WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value);
> +}
> +
>  static inline
> -unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
> +unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
> +                                unsigned int clamp_value)

IMHO the name of uclamp_rq_max_value() is a bit misleading because:
1. It does not imply that it has to be called only when there are no
more runnable tasks on a CPU. This is currently the case because it's
called only from uclamp_rq_dec_id() and only when bucket->tasks==0 but
nothing in the name of this function indicates that it can't be called
from other places.
2. It does not imply that it marks rq UCLAMP_FLAG_IDLE.

>  {
>         struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
>         int bucket_id = UCLAMP_BUCKETS - 1;
> @@ -771,7 +798,7 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
>         }
>
>         /* No tasks -- default clamp values */
> -       return uclamp_none(clamp_id);
> +       return uclamp_idle_value(rq, clamp_id, clamp_value);
>  }
>
>  /*
> @@ -794,6 +821,8 @@ static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
>         bucket = &uc_rq->bucket[uc_se->bucket_id];
>         bucket->tasks++;
>
> +       uclamp_idle_reset(rq, clamp_id, uc_se->value);
> +
>         /*
>          * Local max aggregation: rq buckets always track the max
>          * "requested" clamp value of its RUNNABLE tasks.
> @@ -820,6 +849,7 @@ static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
>         struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
>         struct uclamp_se *uc_se = &p->uclamp[clamp_id];
>         struct uclamp_bucket *bucket;
> +       unsigned int bkt_clamp;
>         unsigned int rq_clamp;
>
>         bucket = &uc_rq->bucket[uc_se->bucket_id];
> @@ -848,7 +878,8 @@ static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
>                  * there are anymore RUNNABLE tasks refcounting it.
>                  */
>                 bucket->value = uclamp_bucket_base_value(bucket->value);
> -               WRITE_ONCE(uc_rq->value, uclamp_rq_max_value(rq, clamp_id));
> +               bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
> +               WRITE_ONCE(uc_rq->value, bkt_clamp);
>         }
>  }
>
> @@ -861,6 +892,10 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>
>         for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
>                 uclamp_rq_inc_id(p, rq, clamp_id);
> +
> +       /* Reset clamp idle holding when there is one RUNNABLE task */
> +       if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
> +               rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
>  }
>
>  static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
> @@ -879,8 +914,10 @@ static void __init init_uclamp(void)
>         unsigned int clamp_id;
>         int cpu;
>
> -       for_each_possible_cpu(cpu)
> +       for_each_possible_cpu(cpu) {
>                 memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
> +               cpu_rq(cpu)->uclamp_flags = 0;
> +       }
>
>         for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
>                 struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c3d1ae1e7eec..d8b182f1782c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -880,6 +880,8 @@ struct rq {
>  #ifdef CONFIG_UCLAMP_TASK
>         /* Utilization clamp values based on CPU's RUNNABLE tasks */
>         struct uclamp_rq        uclamp[UCLAMP_CNT] ____cacheline_aligned;
> +       unsigned int            uclamp_flags;
> +#define UCLAMP_FLAG_IDLE 0x01
>  #endif
>
>         struct cfs_rq           cfs;
> --
> 2.20.1
>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox