Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v3 10/15] serial: stm32-usart: Add STM32 USART Driver
From: Maxime Coquelin @ 2015-03-19 17:35 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Andy Shevchenko, Uwe Kleine-König, Andreas Färber,
	Geert Uytterhoeven, Rob Herring, Philipp Zabel, Linus Walleij,
	Arnd Bergmann, Stefan Agner, Peter Meerwald, Paul Bolle,
	Jonathan Corbet, Pawel Moll, Mark Rutland, Ian Campbell,
	Kumar Gala, Russell King, Daniel Lezcano, Thomas Gleixner,
	Greg Kroah-Hartman, Jiri Slaby, Andrew Morton
In-Reply-To: <550AE41C.8070803-WaGBZJeGNqdsbIuE7sb01tBPR1lH4CV8@public.gmane.org>

2015-03-19 15:58 GMT+01:00 Peter Hurley <peter-WaGBZJeGNqdsbIuE7sb01tBPR1lH4CV8@public.gmane.org>:
> On 03/19/2015 09:55 AM, Maxime Coquelin wrote:
>>>>>> +static void stm32_set_termios(struct uart_port *port, struct ktermios *termios,
>>>>>> +                           struct ktermios *old)
> [...]
>>>>>> +       usardiv = (port->uartclk * 25) / (baud * 4);
>>>>>> +       mantissa = (usardiv / 100) << USART_BRR_DIV_M_SHIFT;
>>>>>> +       fraction = DIV_ROUND_CLOSEST((usardiv % 100) * 16, 100);
>>>>>> +       if (fraction & ~USART_BRR_DIV_F_MASK) {
>>>>>> +               fraction = 0;
>>>>>> +               mantissa += (1 << USART_BRR_DIV_M_SHIFT);
>>>>>> +       }
> [...]
>> Really, I would prefer keeping this fractional divider as it is
>> implemented today.
>
> You have to admit that's basically an unintelligible mess;
> how would anyone ever be able to refactor and replace that with a
> common divider implementation?
>
> At the very least, please comment on the formula and format.

Ok, I will refactor the implementation, and comment it.

Regards,
Maxime

>
> Regards,
> Peter Hurley
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [dmidecode] [Patch v4] firmware: dmi-sysfs: add SMBIOS entry point area attribute
From: Ivan.khoronzhuk @ 2015-03-19 17:33 UTC (permalink / raw)
  To: Jean Delvare
  Cc: Ivan Khoronzhuk, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A,
	grant.likely-QSEj5FYQhm4dnm+yROfE0A,
	matt.fleming-ral2JQCrhuEAvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	dmidecode-devel-qX2TKyscuCcdnm+yROfE0A,
	leif.lindholm-QSEj5FYQhm4dnm+yROfE0A,
	msalter-H+wXaHxf7aLQT0dZR+AlfA
In-Reply-To: <1425978807.24849.22.camel-H7Kp9ZFCxt/N0uC3ymp8PA@public.gmane.org>



On 10.03.15 11:13, Jean Delvare wrote:
> Hi Ivan,
>
> Sorry for the late reply. I think I addressed some points in other
> replies already, but for completeness let me reply to this post too.
>
> Le Wednesday 04 March 2015 à 14:30 +0200, Ivan.khoronzhuk a écrit :
>> On 02/26/2015 11:36 AM, Jean Delvare wrote:
>>> On Wed,  4 Feb 2015 19:06:03 +0200, Ivan Khoronzhuk wrote:
>>>> diff --git a/Documentation/ABI/testing/sysfs-firmware-dmi b/Documentation/ABI/testing/sysfs-firmware-dmi
>>>> index c78f9ab..3a9ffe8 100644
>>>> --- a/Documentation/ABI/testing/sysfs-firmware-dmi
>>>> +++ b/Documentation/ABI/testing/sysfs-firmware-dmi
>>>> @@ -12,6 +12,16 @@ Description:
>>>>    		cannot ensure that the data as exported to userland is
>>>>    		without error either.
>>>>    
>>>> +		The firmware provides DMI structures as a packed list of
>>>> +		data referenced by a SMBIOS table entry point. The SMBIOS
>>>> +		entry point contains general information, like SMBIOS
>>>> +		version, DMI table size, etc. The structure, content and
>>>> +		size of SMBIOS entry point is dependent on SMBIOS version.
>>>> +		That's why SMBIOS entry point is represented in dmi sysfs
>>>> +		like a raw attribute and is accessible via
>>>> +		/sys/firmware/dmi/smbios_raw_header. The format of SMBIOS
>>> As mentioned before, I don't like the name "smbios_raw_header". I think
>>> it should be "smbios_entry_point" or similar.
>> If Matt is OK to get another version,
>> Let it be smbios_entry_point.
>> If it's more convenient, it should be changed while it's possible.
> Great, thanks.
>
>>>> @@ -669,6 +699,18 @@ static int __init dmi_sysfs_init(void)
>>>>    		goto err;
>>>>    	}
>>>>    
>>>> +	smbios_raw_header = dmi_get_smbios_entry_area(&size);
>>>> +	if (!smbios_raw_header) {
>>>> +		pr_debug("dmi-sysfs: SMBIOS raw data is not available.\n");
>>>> +		error = -EINVAL;
>>>> +		goto err;
>>>> +	}
>>> I don't think this should have been a fatal error. Just because for
>>> some reason dmi_get_smbios_entry_area() returned NULL is no good reason
>>> for nor exposing /sys/firmware/dmi/entries as we used to.
>> It issues an error only in case of when entry table is not available,
>> if entry point is absent then dmi table is not available a fortiori.
>> So there is no reason to continue from that point.
> Well it could also fail because of an error in the code ;-) But OK, I
> agree with you here.
>
>>> But anyway this is no longer relevant if the code is moved to dmi_scan
>>> as I suggested.
>>>
>>>> +
>>>> +	/* Create the raw binary file to access the entry area */
>>>> +	smbios_raw_area_attr.size = size;
>>>> +	if (sysfs_create_bin_file(dmi_kobj, &smbios_raw_area_attr))
>>>> +		goto err;
>>> I think this should have had a corresponding call to
>>> sysfs_remove_bin_file() in dmi_sysfs_exit(). (Again no longer relevant
>>> if the code is moved.)
>> The removing is done in kobject_del().
>> Doesn't it? In another way it should be done for
>> dmi/entries/*/raw attributes also.
> It _is_ done for other attributes already:
>
> 	kset_unregister(dmi_kset);
>
> Which is exactly why I believe it should be done for
> smbios_raw_area_attr too. All other kernel drivers are calling
> sysfs_create_bin_file() or equivalent in their cleanup function and I
> see no reason why this driver would be an exception.

kset_unregister() uses kobject_del()
no see difference.

>
>>> There's one thing I do not understand. I seem to understand that the
>>> goal behind this patch is to be able to run dmidecode without /dev/mem.
>>> Dmidecode currently reads 2 areas from /dev/mem: the 0xF0000-0xFFFFF
>>> area in search of the entry point, and the DMI data table itself. With
>>> this patch you make the entry point available through sysfs. But
>>> dmidecode will still need to access /dev/mem to access the DMI data
>>> table. So that does not really solve anything, does it?
>> It's supposed to read DMI table via entries presented by dmi-sysfs.
>> It contains raw attributes that can be used for these purposes.
>> No need to use /dev/mem.
> Yes, I understood this meanwhile, sorry.
>
>> Another case if you want to add binary of whole dmi table to be able to
>> read it directly in order to parse in dmidecode w/o any additional headache
>> with open/close. Well, it partly dupes currently present dmi-sysfs.
> In fact dmi-sysfs already duplicates a lot of code which was already
> present in dmidecode and libsmbios. And exporting the raw table will
> require way less code than the implementation you proposed originally.
> So the code duplication argument doesn't hold, sorry.
>
>> But it simplifies dmi table parsing for dmidecode, and who wants to use
>> dmi-sysfs, let them use it, but dmidecode will be reading raw entry.
>> Well let it be. Why not.
> Yes, this is exactly my point.
>
>> If others are OK, for dmidecode, and probably others tools also,
>> kernel can constantly expose two tables under /sys/firmware/dmi/tables/
>> smbios_entry_point and dmi_table. Independently on dmi-sysfs.
> That's what I would like to see implemented, yes, thank you very much.
>

-- 
Regards,
Ivan Khoronzhuk

^ permalink raw reply

* Re: [PATCH 1/1] Add virtio-input driver.
From: David Herrmann @ 2015-03-19 17:16 UTC (permalink / raw)
  To: Dmitry Torokhov
  Cc: Gerd Hoffmann, virtio-dev, virtualization, mst, Rusty Russell,
	open list, open list:ABI/API
In-Reply-To: <20150319162704.GE30732@dtor-ws>

Hey

On Thu, Mar 19, 2015 at 5:27 PM, Dmitry Torokhov
<dmitry.torokhov@gmail.com> wrote:
> On Thu, Mar 19, 2015 at 01:29:49PM +0100, David Herrmann wrote:
[...]
>> > +static int virtinput_probe(struct virtio_device *vdev)
>> > +{
>> > +       struct virtio_input *vi;
>> > +       size_t size;
>> > +       int abs, err;
>> > +
>> > +       vi = kzalloc(sizeof(*vi), GFP_KERNEL);
>> > +       if (!vi) {
>> > +               err = -ENOMEM;
>> > +               goto out1;
>> > +       }
>> > +       vdev->priv = vi;
>> > +       vi->vdev = vdev;
>> > +
>> > +       err = virtinput_init_vqs(vi);
>> > +       if (err)
>> > +               goto out2;
>> > +
>> > +       vi->idev = input_allocate_device();
>> > +       if (!vi->idev) {
>> > +               err = -ENOMEM;
>> > +               goto out3;
>> > +       }
>> > +       input_set_drvdata(vi->idev, vi);
>> > +
>> > +       size = virtinput_cfg_select(vi, VIRTIO_INPUT_CFG_ID_NAME, 0);
>> > +       virtio_cread_bytes(vi->vdev, offsetof(struct virtio_input_config, u),
>> > +                          vi->name, min(size, sizeof(vi->name)));
>> > +       size = virtinput_cfg_select(vi, VIRTIO_INPUT_CFG_ID_SERIAL, 0);
>> > +       virtio_cread_bytes(vi->vdev, offsetof(struct virtio_input_config, u),
>> > +                          vi->serial, min(size, sizeof(vi->serial)));
>> > +       snprintf(vi->phys, sizeof(vi->phys),
>> > +                "virtio%d/input0", vdev->index);
>> > +
>> > +       virtinput_cfg_bits(vi, VIRTIO_INPUT_CFG_PROP_BITS, 0,
>> > +                          vi->idev->propbit, INPUT_PROP_CNT);
>> > +       size = virtinput_cfg_select(vi, VIRTIO_INPUT_CFG_EV_BITS, EV_REP);
>> > +       if (size)
>> > +               set_bit(EV_REP, vi->idev->evbit);
>> > +
>> > +       vi->idev->name = vi->name;
>> > +       vi->idev->phys = vi->phys;
>>
>> Can you set vi->idev->uniq to the virtio-bus path?
>
> No, uniq can't be phys as phys is unique within the system while uniq is
> like serial number or UUID and should never repeat.

...sorry, my bad! We should still forward it from the host, imo. It's
really handy for applications to re-detect devices.

Thanks
David

^ permalink raw reply

* Re: [PATCH] mremap: add MREMAP_NOHOLE flag --resend
From: Shaohua Li @ 2015-03-19 16:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	danielmicay-Re5JQEeQqe8AvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Rik van Riel, Hugh Dickins,
	Mel Gorman, Johannes Weiner, Michal Hocko, Andy Lutomirski
In-Reply-To: <20150318222246.bc608dd0.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>

On Wed, Mar 18, 2015 at 10:22:46PM -0700, Andrew Morton wrote:
> On Wed, 18 Mar 2015 22:08:26 -0700 Shaohua Li <shli-b10kYP2dOMg@public.gmane.org> wrote:
> 
> > > Daniel also had microbenchmark testing results for glibc and jemalloc. 
> > > Can you please do this?
> > 
> > I run Daniel's microbenchmark too, and not surprise the result is
> > similar:
> > glibc: 32.82
> > jemalloc: 70.35
> > jemalloc+mremap: 33.01
> > tcmalloc: 68.81
> > 
> > but tcmalloc doesn't support mremap currently, so I cant test it.
> 
> But Daniel's changelog implies strongly that tcmalloc would benefit
> from his patch.  Was that inaccurate or is this a difference between
> his patch and yours?

There is no big difference, except I fixed some issues. Daniel didn't
post data for tcmalloc, I suppose it's potential mremap can make
tcmalloc faster too, but Daniel can clarify.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH 1/1] Add virtio-input driver.
From: Dmitry Torokhov @ 2015-03-19 16:27 UTC (permalink / raw)
  To: David Herrmann
  Cc: Gerd Hoffmann, virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mst-H+wXaHxf7aLQT0dZR+AlfA, Rusty Russell, open list,
	open list:ABI/API
In-Reply-To: <CANq1E4TDj4pq3J_BVc=Yuzo5dVR=QcNexVUqaqwjg7Qi5_xX4Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Thu, Mar 19, 2015 at 01:29:49PM +0100, David Herrmann wrote:
> Hi
> 
> (CC: Dmitry)
> 
> On Thu, Mar 19, 2015 at 10:13 AM, Gerd Hoffmann <kraxel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > virtio-input is basically evdev-events-over-virtio, so this driver isn't
> > much more than reading configuration from config space and forwarding
> > incoming events to the linux input layer.
> >
> > Signed-off-by: Gerd Hoffmann <kraxel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >  drivers/virtio/Kconfig            |  10 ++
> >  drivers/virtio/Makefile           |   1 +
> >  drivers/virtio/virtio_input.c     | 313 ++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/virtio_ids.h   |   1 +
> >  include/uapi/linux/virtio_input.h |  65 ++++++++
> >  5 files changed, 390 insertions(+)
> >  create mode 100644 drivers/virtio/virtio_input.c
> >  create mode 100644 include/uapi/linux/virtio_input.h
> >
> > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > index b546da5..cab9f3f 100644
> > --- a/drivers/virtio/Kconfig
> > +++ b/drivers/virtio/Kconfig
> > @@ -48,6 +48,16 @@ config VIRTIO_BALLOON
> >
> >          If unsure, say M.
> >
> > +config VIRTIO_INPUT
> > +       tristate "Virtio input driver"
> > +       depends on VIRTIO
> > +       depends on INPUT
> > +       ---help---
> > +        This driver supports virtio input devices such as
> > +        keyboards, mice and tablets.
> > +
> > +        If unsure, say M.
> > +
> >   config VIRTIO_MMIO
> >         tristate "Platform bus driver for memory mapped virtio devices"
> >         depends on HAS_IOMEM
> > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> > index d85565b..41e30e3 100644
> > --- a/drivers/virtio/Makefile
> > +++ b/drivers/virtio/Makefile
> > @@ -4,3 +4,4 @@ obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
> >  virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> >  virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> >  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> > +obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> > diff --git a/drivers/virtio/virtio_input.c b/drivers/virtio/virtio_input.c
> > new file mode 100644
> > index 0000000..2d425cf
> > --- /dev/null
> > +++ b/drivers/virtio/virtio_input.c
> > @@ -0,0 +1,313 @@
> > +#include <linux/module.h>
> > +#include <linux/virtio.h>
> > +#include <linux/input.h>
> > +
> > +#include <uapi/linux/virtio_ids.h>
> > +#include <uapi/linux/virtio_input.h>
> > +
> > +struct virtio_input {
> > +       struct virtio_device       *vdev;
> > +       struct input_dev           *idev;
> > +       char                       name[64];
> > +       char                       serial[64];
> > +       char                       phys[64];
> > +       struct virtqueue           *evt, *sts;
> > +       struct virtio_input_event  evts[64];
> > +};
> > +
> > +static ssize_t serial_show(struct device *dev,
> > +                          struct device_attribute *attr, char *buf)
> > +{
> > +       struct input_dev *idev = to_input_dev(dev);
> > +       struct virtio_input *vi = input_get_drvdata(idev);
> > +       return sprintf(buf, "%s\n", vi->serial);
> > +}
> > +static DEVICE_ATTR_RO(serial);

What is serial? Serial number?

> > +
> > +static struct attribute *dev_attrs[] = {
> > +       &dev_attr_serial.attr,
> > +       NULL
> > +};
> > +
> > +static umode_t dev_attrs_are_visible(struct kobject *kobj,
> > +                                    struct attribute *a, int n)
> > +{
> > +       struct device *dev = container_of(kobj, struct device, kobj);
> > +       struct input_dev *idev = to_input_dev(dev);
> > +       struct virtio_input *vi = input_get_drvdata(idev);
> > +
> > +       if (a == &dev_attr_serial.attr && !strlen(vi->serial))
> > +               return 0;
> > +
> > +       return a->mode;
> > +}
> > +
> > +static struct attribute_group dev_attr_grp = {
> > +       .attrs =        dev_attrs,
> > +       .is_visible =   dev_attrs_are_visible,
> > +};
> > +
> > +static const struct attribute_group *dev_attr_groups[] = {
> > +       &dev_attr_grp,
> > +       NULL
> > +};
> > +
> > +static void virtinput_queue_evtbuf(struct virtio_input *vi,
> > +                                  struct virtio_input_event *evtbuf)
> > +{
> > +       struct scatterlist sg[1];
> > +
> > +       sg_init_one(sg, evtbuf, sizeof(*evtbuf));
> > +       virtqueue_add_inbuf(vi->evt, sg, 1, evtbuf, GFP_ATOMIC);
> > +}
> > +
> > +static void virtinput_recv_events(struct virtqueue *vq)
> > +{
> > +       struct virtio_input *vi = vq->vdev->priv;
> > +       struct virtio_input_event *event;
> > +       unsigned int len;
> > +
> > +       while ((event = virtqueue_get_buf(vi->evt, &len)) != NULL) {
> > +               input_event(vi->idev,
> > +                           le16_to_cpu(event->type),
> > +                           le16_to_cpu(event->code),
> > +                           le32_to_cpu(event->value));
> > +               virtinput_queue_evtbuf(vi, event);
> > +       }
> > +       virtqueue_kick(vq);
> > +}
> > +
> > +static int virtinput_send_status(struct virtio_input *vi,
> > +                                u16 type, u16 code, s32 value)
> > +{
> > +       struct virtio_input_event *stsbuf;
> > +       struct scatterlist sg[1];
> > +
> > +       stsbuf = kzalloc(sizeof(*stsbuf), GFP_ATOMIC);
> > +       if (!stsbuf)
> > +               return -ENOMEM;
> > +
> > +       stsbuf->type  = cpu_to_le16(type);
> > +       stsbuf->code  = cpu_to_le16(code);
> > +       stsbuf->value = cpu_to_le32(value);
> > +       sg_init_one(sg, stsbuf, sizeof(*stsbuf));
> > +       virtqueue_add_outbuf(vi->sts, sg, 1, stsbuf, GFP_ATOMIC);
> > +       virtqueue_kick(vi->sts);
> 
> GFP_ATOMIC, eww. But everyone does that for input_event() callbacks..
> we should fix that for user-space input one day.
> 
> > +       return 0;
> > +}
> > +
> > +static void virtinput_recv_status(struct virtqueue *vq)
> > +{
> > +       struct virtio_input *vi = vq->vdev->priv;
> > +       struct virtio_input_event *stsbuf;
> > +       unsigned int len;
> > +
> > +       while ((stsbuf = virtqueue_get_buf(vi->sts, &len)) != NULL)
> > +               kfree(stsbuf);
> > +       virtqueue_kick(vq);
> > +}
> > +
> > +static int virtinput_status(struct input_dev *idev, unsigned int type,
> > +                           unsigned int code, int value)
> > +{
> > +       struct virtio_input *vi = input_get_drvdata(idev);
> > +
> > +       return virtinput_send_status(vi, type, code, value);
> > +}
> > +
> > +static size_t virtinput_cfg_select(struct virtio_input *vi,
> > +                                  u8 select, u8 subsel)
> > +{
> > +       u8 size;
> > +
> > +       virtio_cwrite(vi->vdev, struct virtio_input_config, select, &select);
> > +       virtio_cwrite(vi->vdev, struct virtio_input_config, subsel, &subsel);
> > +       virtio_cread(vi->vdev, struct virtio_input_config, size, &size);
> > +       return size;
> > +}
> > +
> > +static void virtinput_cfg_bits(struct virtio_input *vi, int select, int subsel,
> > +                              unsigned long *bits, unsigned int bitcount)
> > +{
> > +       unsigned int bit;
> > +       size_t bytes;
> > +       u8 cfg = 0;
> > +
> > +       bytes = virtinput_cfg_select(vi, select, subsel);
> > +       if (!bytes)
> > +               return;
> > +       if (bitcount > bytes*8)
> > +               bitcount = bytes*8;
> > +       if (select == VIRTIO_INPUT_CFG_EV_BITS)
> > +               set_bit(subsel, vi->idev->evbit);
> > +       for (bit = 0; bit < bitcount; bit++) {
> > +               if ((bit % 8) == 0)
> > +                       virtio_cread(vi->vdev, struct virtio_input_config,
> > +                                    u.bitmap[bit/8], &cfg);
> > +               if (cfg & (1 << (bit % 8)))
> > +                       set_bit(bit, bits);
> > +       }
> > +}
> > +
> > +static void virtinput_cfg_abs(struct virtio_input *vi, int abs)
> > +{
> > +       u32 size, mi, ma, fu, fl;
> > +
> > +       size = virtinput_cfg_select(vi, VIRTIO_INPUT_CFG_ABS_INFO, abs);
> > +       virtio_cread(vi->vdev, struct virtio_input_config, u.abs.min, &mi);
> > +       virtio_cread(vi->vdev, struct virtio_input_config, u.abs.max, &ma);
> > +       virtio_cread(vi->vdev, struct virtio_input_config, u.abs.fuzz, &fu);
> > +       virtio_cread(vi->vdev, struct virtio_input_config, u.abs.flat, &fl);
> 
> abs.resolution is missing. Please add it, we really also need to add
> it to uinput one day.
> 
> > +       input_set_abs_params(vi->idev, abs,
> > +                            le32_to_cpu(mi), le32_to_cpu(ma),
> > +                            le32_to_cpu(fu), le32_to_cpu(fl));
> > +}
> > +
> > +static int virtinput_init_vqs(struct virtio_input *vi)
> > +{
> > +       struct virtqueue *vqs[2];
> > +       vq_callback_t *cbs[] = { virtinput_recv_events,
> > +                                virtinput_recv_status };
> > +       const char *names[] = { "events", "status" };
> > +       int i, err, size;
> > +
> > +       err = vi->vdev->config->find_vqs(vi->vdev, 2, vqs, cbs, names);
> > +       if (err)
> > +               return err;
> > +       vi->evt = vqs[0];
> > +       vi->sts = vqs[1];
> > +
> > +       size = virtqueue_get_vring_size(vi->evt);
> > +       if (size > ARRAY_SIZE(vi->evts))
> > +               size = ARRAY_SIZE(vi->evts);
> > +       for (i = 0; i < size; i++)
> > +               virtinput_queue_evtbuf(vi, &vi->evts[i]);
> > +
> > +       return 0;
> > +}
> > +
> > +static int virtinput_probe(struct virtio_device *vdev)
> > +{
> > +       struct virtio_input *vi;
> > +       size_t size;
> > +       int abs, err;
> > +
> > +       vi = kzalloc(sizeof(*vi), GFP_KERNEL);
> > +       if (!vi) {
> > +               err = -ENOMEM;
> > +               goto out1;
> > +       }
> > +       vdev->priv = vi;
> > +       vi->vdev = vdev;
> > +
> > +       err = virtinput_init_vqs(vi);
> > +       if (err)
> > +               goto out2;
> > +
> > +       vi->idev = input_allocate_device();
> > +       if (!vi->idev) {
> > +               err = -ENOMEM;
> > +               goto out3;
> > +       }
> > +       input_set_drvdata(vi->idev, vi);
> > +
> > +       size = virtinput_cfg_select(vi, VIRTIO_INPUT_CFG_ID_NAME, 0);
> > +       virtio_cread_bytes(vi->vdev, offsetof(struct virtio_input_config, u),
> > +                          vi->name, min(size, sizeof(vi->name)));
> > +       size = virtinput_cfg_select(vi, VIRTIO_INPUT_CFG_ID_SERIAL, 0);
> > +       virtio_cread_bytes(vi->vdev, offsetof(struct virtio_input_config, u),
> > +                          vi->serial, min(size, sizeof(vi->serial)));
> > +       snprintf(vi->phys, sizeof(vi->phys),
> > +                "virtio%d/input0", vdev->index);
> > +
> > +       virtinput_cfg_bits(vi, VIRTIO_INPUT_CFG_PROP_BITS, 0,
> > +                          vi->idev->propbit, INPUT_PROP_CNT);
> > +       size = virtinput_cfg_select(vi, VIRTIO_INPUT_CFG_EV_BITS, EV_REP);
> > +       if (size)
> > +               set_bit(EV_REP, vi->idev->evbit);
> > +
> > +       vi->idev->name = vi->name;
> > +       vi->idev->phys = vi->phys;
> 
> Can you set vi->idev->uniq to the virtio-bus path?

No, uniq can't be phys as phys is unique within the system while uniq is
like serial number or UUID and should never repeat.

Thanks.

-- 
Dmitry

^ permalink raw reply

* Re: [PATCH V6] Allow compaction of unevictable pages
From: Christoph Lameter @ 2015-03-19 16:18 UTC (permalink / raw)
  To: Eric B Munson
  Cc: Vlastimil Babka, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Mel Gorman, David Rientjes, Rik van Riel, Michal Hocko,
	linux-rt-users, linux-mm, linux-api, linux-kernel
In-Reply-To: <550AEC8B.1080806@akamai.com>

On Thu, 19 Mar 2015, Eric B Munson wrote:

> Thanks, I have a version with the changelog fixed up to actually make
> sense and can submit that if the patch is acceptable otherwise.


Looks good as far as I can see.

Acked-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v3] add support for Freescale's MMA8653FC 10 bit accelerometer
From: Mark Rutland @ 2015-03-19 16:03 UTC (permalink / raw)
  To: Martin Kepplinger
  Cc: robh+dt-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, Pawel Moll,
	ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg@public.gmane.org,
	galak-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org,
	dmitry.torokhov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	alexander.stein-93q1YBGzJSMe9JSWTWOYM3xStJ4P+DSV@public.gmane.org,
	hadess-0MeiytkfxGOsTnJN9+BGXg@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-input-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Martin Kepplinger, Christoph Muellner
In-Reply-To: <1426703538-16919-1-git-send-email-martink-1KBjaw7Xf1+zQB+pC5nmwQ@public.gmane.org>

> diff --git a/Documentation/devicetree/bindings/misc/fsl,mma8653fc.txt b/Documentation/devicetree/bindings/misc/fsl,mma8653fc.txt
> new file mode 100644
> index 0000000..3921acb
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/misc/fsl,mma8653fc.txt
> @@ -0,0 +1,96 @@
> +Freescale MMA8653FC 3-axis Accelerometer
> +
> +Required properties:
> +- compatible
> +       "fsl,mma8653fc"
> +- reg
> +       I2C address
> +
> +Optional properties:
> +
> +- interrupt-parent
> +       a phandle for the interrupt controller (see
> +       Documentation/devicetree/bindings/interrupt-controller/interrupts.txt)
> +- interrupts
> +       interrupt line to which the chip is connected
> +- int1
> +       set to use interrupt line 1 instead of 2

If you have two interrupt output lines, you should have two entries in
interrupts.

You can use interrupt-names to determine which line(s) are wired up.

I don't believe that you need this property.

> +- int_active_high
> +       set interrupt line active high

s/_/-/ in property names please. 

What happens if this isn't set? Is it active-low, or edge-triggered?

It feels like we should be able to query when we need to set this from
the IRQ(s).

> +- ir_freefall_motion_x
> +       activate freefall/motion interrupts on x axis
> +- ir_freefall_motion_y
> +       activate freefall/motion interrupts on y axis
> +- ir_freefall_motion_z
> +       activate freefall/motion interrupts on z axis
> +- irq_threshold
> +       0 < value < 8000: threshold for motion interrupts in mg
> +- ir_landscape_portrait
> +       activate landscape/portrait interrupts
> +- ir_data_ready:
> +       activate data-ready interrupts
> +       Interrupt events can be activated in any combination.

These all sounds like they would be better as runtime options. I don't
see why these should necessarily be in the DT.

> +- range
> +       2, 4, or 8: range in g, default: 2

Likewise.

It would be nice to have a better qualified name than "range".

> +- auto_wake_sleep
> +       auto sleep mode (lower frequency)
> +- motion_mode
> +       use motion mode instead of freefall mode (trigger if >threshold).
> +       per default an interrupt occurs if motion values fall below the
> +       value set in "threshold" and therefore can detect free fall on the
> +       vertical axis (depending on the position of the device).
> +       Setting this values inverts the behaviour and an interrupt occurs
> +       above the threshold value, so usually activate horizontal axis in
> +       this case.

These both sound like they would be better as runtime options.

> +
> +- x-offset
> +       0 < value < 500: calibration offset in mg
> +       this value has an offset of 250 itself:
> +       0 is -250mg, 250 is 0 mg, 500 is 250mg
> +- y-offset
> +       see x-offset
> +- z-offset
> +       see x-offset

I'm unsure about these; it really depends on what the calibration is
for.

Mark.

^ permalink raw reply

* Re: [PATCH v7 tip 5/8] samples: bpf: simple non-portable kprobe filter example
From: Alexei Starovoitov @ 2015-03-19 15:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, David S. Miller, Daniel Borkmann,
	Peter Zijlstra, linux-api, netdev, linux-kernel
In-Reply-To: <20150319115055.70e98cb8@gandalf.local.home>

On 3/19/15 8:50 AM, Steven Rostedt wrote:
>
> I'm not going to review the sample code, as I'm a bit strapped for
> time, and that's more userspace oriented anyway. I'm much more concerned
> that the kernel modifications are correct.

sure. thanks a lot for thorough review!

^ permalink raw reply

* Re: [PATCH v7 tip 4/8] tracing: allow BPF programs to call bpf_trace_printk()
From: Alexei Starovoitov @ 2015-03-19 15:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, David S. Miller, Daniel Borkmann,
	Peter Zijlstra, linux-api, netdev, linux-kernel
In-Reply-To: <20150319112931.14f3569b@gandalf.local.home>

On 3/19/15 8:29 AM, Steven Rostedt wrote:
>> +	/* check format string for allowed specifiers */
>> +	for (i = 0; i < fmt_size; i++)
>
> Even though there's only a single "if" statement after the "for", it is
> usually considered proper to add the brackets if the next line is
> complex (more than one line). Which it is in this case.

ok.

>> +			} else if (fmt[i] == 'p') {
>> +				mod_l[fmt_cnt] = true;
>> +				fmt_cnt++;
>
> So you also allow pointer conversions like "%pS" and "%pF"?

good catch. it's a bug. We shouldn't allow things like pV, pD, etc
Something like pK and pS may be ok, but pF is not because of arch
dependencies. So instead of analyzing all possibilities. I'll allow
%p only. bpf_trace_printk is debug only anyway.

>> +	return __trace_printk(1/* fake ip will not be printed */, fmt,
>> +			      mod_l[0] ? r3 : (u32) r3,
>> +			      mod_l[1] ? r4 : (u32) r4,
>> +			      mod_l[2] ? r5 : (u32) r5);
>
> Does the above work on 32 bit machines? as "%ld" would be (u32), but
> here you are passing in u64.

another great catch. it wouldn't crash, but would print junk. will fix.

^ permalink raw reply

* Re: [PATCH v7 tip 5/8] samples: bpf: simple non-portable kprobe filter example
From: Steven Rostedt @ 2015-03-19 15:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, David S. Miller, Daniel Borkmann,
	Peter Zijlstra, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1426542584-9406-6-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>


I'm not going to review the sample code, as I'm a bit strapped for
time, and that's more userspace oriented anyway. I'm much more concerned
that the kernel modifications are correct.

-- Steve

^ permalink raw reply

* Re: [PATCH v4 00/14] Add kdbus implementation
From: Andy Lutomirski @ 2015-03-19 15:48 UTC (permalink / raw)
  To: David Herrmann
  Cc: Greg Kroah-Hartman, Arnd Bergmann, Eric W. Biederman,
	One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Daniel Mack,
	Djalal Harouni
In-Reply-To: <CANq1E4RA1ok=3Z5W2LkzqaUOKtbMVX_Joxe7_LRbLnQgfRUEYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Thu, Mar 19, 2015 at 4:26 AM, David Herrmann <dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi
>
> On Wed, Mar 18, 2015 at 7:24 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Wed, Mar 18, 2015 at 6:54 AM, David Herrmann <dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> [...]
>>> This program sends unicast messages on kdbus and UDS, exactly the same
>>> number of times with the same 8kb payload. No parsing, no marshaling
>>> is done, just plain message passing. The interesting spikes are
>>> sys_read(), sys_write() and the 3 kdbus sys_ioctl(). Everything else
>>> should be ignored.
>>>
>>> sys_read() and sys_ioctl(KDBUS_CMD_RECV) are about the same. But note
>>> that we don't copy any payload in RECV, so it scales O(1) compared to
>>> message-size.
>>>
>>> sys_write() is about 3x faster than sys_ioctl(KDBUS_CMD_WRITE).
>>
>> Is that factor of 3 for 8 kb payloads?  If so, I expect it's a factor
>> of much worse than 3 for small payloads.
>
> Yes, factor of 3x for 8kb payloads. ~3.8x for 64byte payloads (see the
> second flamegraph I linked, which contains data for 64byte payloads
> (which is basically nothing)).

I find this surprising.  Are both of them so slow that copying 8kb is
negligible?  That's rather sad.

>
>>>>  - The time it takes to connect
>>>
>>> No idea, never measured it. Why is it of interest?
>>
>> Gah, sorry, bad terminology.  I mean the time it takes to send a
>> message to a receiver that you haven't sent to before.
>
> Cold message transactions are horribly slow for both, kdbus and UDS,
> and the performance varies heavily (factor of 10x). I haven't figured
> out whether it's icache/dcache misses, cold branch predictor, process
> mem faults, scheduler, whatever..
>
> What I can say, is the kdbus paths are more expensive, in both LOC and
> execution time. We might be able to improve the cold-transaction
> performance with _unlikely_() annotations, shortcuts, etc. But I want
> much more benchmark data before I try to outsmart the compiler. It
> works reasonably well right now.
>
> Note that from a algorithmic view, there's no difference between the
> first transaction and a following transaction. All relevant accesses
> are O(1).
>
> (Actually looking at the numbers again, worst-case vs. average-case in
> message transaction is exactly 10x for both, UDS and kdbus. Skipping
> the first couple, I get <2x. std-dev is roughly 2%)
>
>> (The kdbus terminology is weird.  You don't send to "endpoints", you
>> don't "connect" to other participants, and it's not even clear to me
>> what a participant in the bus is called.)
>
> A participant is called a "connection" or "peer" (I prefer 'peer'). It
> "connects" to a bus via an endpoint of the bus. Endpoints are
> file-system entries and can be shared, and usually are.
> Unlike binder, kdbus does not know peer-to-peer links. That is, there
> is never (not even a temporary) link between only two peers. Messages
> are always sent to the bus, and the bus makes sure only the addressed
> recipients will get the message.
>
>>>
>>>> I'm also interested in whether the current design is actually amenable
>>>> to good performance.  I'm told that it is, but ISTM there's a lot of
>>>> heavyweight stuff going on with each send operation that will be hard
>>>> to remove.
>>>
>>> I disagree. What heavyweight stuff is going on?
>>
>> At least metadata generation, metadata free, and policy db checks seem
>> expensive.  It could be worth running a bunch of copies of your
>> benchmark on different cores and seeing how it scales.
>
> metadata handling is local to the connection that sends the message.
> It does not affect the overall performance of other bus operations in
> parallel.

Sure it does if it writes to shared cachelines.  Given that you're
incrementing refcounts, I'm reasonable sure that you're touching lots
of shared cachelines.

> Furthermore, it's way faster than collecting the "same" data
> via /proc, so I don't think it slows down the overall transaction at
> all. If a receiver doesn't want metadata, it should not request it (by
> setting the receiver-metadata-mask). If a sender doesn't like the
> overhead, it should not send the metadata (by setting the
> sender-metadata-mask). Only if both peers set the metadata mask, it
> will be transmitted.

But you're comparing to the wrong thing, IMO.  Of course it's much
faster than /proc hackery, but it's probably much slower to do the
metadata operation once per message than to do it when you connect to
the endpoint.  (Gah!  It's a "bus" that could easily have tons of
users but a single "endpoint".  I'm still not used to it.)

>
> The policy-db does indeed look like a bottleneck. Until v2 we used to
> have a policy-cache, which I ripped out as it didn't meet our
> expectations. There are plans to rewrite it, but it's low-priority.
> Thing is, policy-setup is bus-privileged. As long as it's done in a
> sane manner (keeping the entries per name minimal), the hash-table
> based name-lookup gives suitable performance. Only if the number of
> entries per name rises, it gets problematic due to O(n)
> list-traversal. But even that could be optimized without a policy
> cache, by merging matching entries (see kdbus_policy_db_entry_access).
> Furthermore, the policy-db is skipped for privileged peers or if both,
> sender and recipient, trust each other (eg., have the same
> endpoint+uid). Thus, if you have a trusted transaction, the policy-db
> is skipped, anyway.

Yeah, that's reasonable.  I don't see any obvious way around that.
(The policy semantics are still insane wrt connections with multiple
names, though, but that should have nothing to do with performance.
Insanity for historical reasons is still insanity.)

--Andy

^ permalink raw reply

* Re: [PATCH v7 tip 3/8] tracing: allow BPF programs to call bpf_ktime_get_ns()
From: Steven Rostedt @ 2015-03-19 15:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, David S. Miller, Daniel Borkmann,
	Peter Zijlstra, linux-api, netdev, linux-kernel
In-Reply-To: <550AED5D.3010700@plumgrid.com>

On Thu, 19 Mar 2015 08:38:05 -0700
Alexei Starovoitov <ast@plumgrid.com> wrote:

> > Hmm, a nanosecond value returned as integer? Is there a way to make
> > this a 64 bit return type, or is RET_INTEGER default to 64 bits in BPF
> > functions?
> 
> RET_INTEGER doesn't mean C 'int' width. It means non-pointer and
> non-void value. The return value is always promoted to full
> register width which is 64-bit.
> 

OK.

Reviewed-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve

^ permalink raw reply

* Re: [PATCH v7 tip 3/8] tracing: allow BPF programs to call bpf_ktime_get_ns()
From: Alexei Starovoitov @ 2015-03-19 15:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, David S. Miller, Daniel Borkmann,
	Peter Zijlstra, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20150319111124.38f53061-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org>

On 3/19/15 8:11 AM, Steven Rostedt wrote:
> On Mon, 16 Mar 2015 14:49:39 -0700
> Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>
>> bpf_ktime_get_ns() is used by programs to compue time delta between events
>
>   "compute"

ok :)

>> +	[BPF_FUNC_ktime_get_ns] = {
>> +		.func = bpf_ktime_get_ns,
>> +		.gpl_only = true,
>> +		.ret_type = RET_INTEGER,
>
> Hmm, a nanosecond value returned as integer? Is there a way to make
> this a 64 bit return type, or is RET_INTEGER default to 64 bits in BPF
> functions?

RET_INTEGER doesn't mean C 'int' width. It means non-pointer and
non-void value. The return value is always promoted to full
register width which is 64-bit.

^ permalink raw reply

* Re: [PATCH V6] Allow compaction of unevictable pages
From: Eric B Munson @ 2015-03-19 15:34 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton
  Cc: Thomas Gleixner, Christoph Lameter, Peter Zijlstra, Mel Gorman,
	David Rientjes, Rik van Riel, Michal Hocko,
	linux-rt-users-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <550AE38E.7090006-AlSwsSmVLrQ@public.gmane.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03/19/2015 10:56 AM, Vlastimil Babka wrote:
> On 03/19/2015 02:57 PM, Eric B Munson wrote:
>> Currently, pages which are marked as unevictable are protected
>> from compaction, but not from other types of migration.  The
>> POSIX real time extension explicitly states that mlock() will
>> prevent a major page fault, but the spirit of is is that mlock()
>> should give a process the ability to control sources of latency,
>> including minor page faults. However, the mlock manpage only
>> explicitly says that a locked page will not be written to swap
>> and this can cause some confusion.  The compaction code today,
>> does not give a developer who wants to avoid swap but wants to
>> have large contiguous areas available any method to achieve this
>> state.  This patch introduces a sysctl for controlling
>> compaction behavoir with respect to the unevictable lru.  Users
>> that demand no page
> 
> behavior
> 
>> faults after a page is present can set compact_unevictable to 0
>> and
> 
> compact_unevictable_allowed
> 
>> users who need the large contiguous areas can enable compaction
>> on locked memory by leaving the default value of 1.
>> 
>> To illustrate this problem I wrote a quick test program that
>> mmaps a large number of 1MB files filled with random data.  These
>> maps are created locked and read only.  Then every other mmap is
>> unmapped and I attempt to allocate huge pages to the static huge
>> page pool.  When the compact_unevictable sysctl is 0, I cannot
>> allocate hugepages after
> 
> compact_unevictable_allowed
> 
>> fragmenting memory.  When the value is set to 1, allocations
>> succeed.
>> 
>> Signed-off-by: Eric B Munson <emunson-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org> Cc: Vlastimil
>> Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>
> 
> Acked-by: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>
> 
> Thanks.
> 

Thanks, I have a version with the changelog fixed up to actually make
sense and can submit that if the patch is acceptable otherwise.

Eric

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iQIcBAEBAgAGBQJVCuyHAAoJELbVsDOpoOa95w0QAIia0yPziiFJRx9uJlGwIfuM
IPHeQ1g201OJiKHxYpZI9FqSu+QJb9UFSPS7ewCH7xE+1aPxEL2pLDZxI5w8OPbY
KYxrVWBdTNesN5Xu8kb0yCXWlk5wGbf65jqMyBJlT9Y+GSiI3zK0AIQgu9Es8zep
YCcig4xfeojzzwGelszsBQ+iDpwqeiS76hCO20yuI5z5G5Le1h7MjxErXZ/uSwlv
+8CHgJWtISjjOYLnbFSEciQmvvcSXtGDmXJ2ru6tgLRoWyIcu3lCyvl/9zi4PuJz
hBtZ5TjQDbyBfj7Vyop90SA9/vwQL8F0wgi9yZXTklebB5cY5b+dWuFdcf14dn2o
uXalxBd1MBQ1hpGXGOLuQCoBows/REjPgKGu+0xGknPL56DXKmoWBeSpjnJKcqIA
bavYJ3bE7HSBI/zjaN2ZiP2Kxl3Y2fV3nmSoXVDJ6hPnYSZUMr1/dBRy5g+kTJ52
wrJt9gMi17alZZFNxsn+EnpagmghwQ89UHLG+ssOViW1DX0j6OxfFDpUlMbso6GS
KW4faaPpIlGbD03f8zZzuCG859rVDiah5WZLVWHG30mVevxvut5QSQo9FEpc2yCk
SG7jghV6Pj3m/F7tdOtwO2PpVSIA0tvxiX734H+z2NoU1Ozfwhofb0hGeEZyp7jm
oAKnxZkkaDbdiaSrNoRM
=bs7n
-----END PGP SIGNATURE-----

^ permalink raw reply

* Re: [PATCH v7 tip 2/8] tracing: attach BPF programs to kprobes
From: Alexei Starovoitov @ 2015-03-19 15:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, David S. Miller, Daniel Borkmann,
	Peter Zijlstra, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20150319110742.7dc9473d-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org>

On 3/19/15 8:07 AM, Steven Rostedt wrote:
>>   struct trace_print_flags {
>>   	unsigned long		mask;
>> @@ -252,6 +253,7 @@ enum {
>>   	TRACE_EVENT_FL_WAS_ENABLED_BIT,
>>   	TRACE_EVENT_FL_USE_CALL_FILTER_BIT,
>>   	TRACE_EVENT_FL_TRACEPOINT_BIT,
>> +	TRACE_EVENT_FL_KPROBE_BIT,
>
> I think this should be broken up into two patches. One that adds the
> KPROBE_BIT flag to kprobe events, and the other that adds the bpf
> program.

sure. will do.

>
> There should be kerneldoc comments above this function.
>
>> +unsigned int trace_call_bpf(struct bpf_prog *prog, void *ctx)

ok.

>> +	per_cpu(bpf_prog_active, cpu)--;
>
> Hmm, as cpu == smp_processor_id(), you should be using
> __this_cpu_inc_return(), and __this_cpu_dec().

yeah. picked a wrong place to copy paste from.
will do. good point.

>
> Why not just make kprobe_prog_funcs[] = {
> 	[BPF_FUNC_map_lookup_elem] = &bpf_map_lookup_elem_proto,
> 	[BPF_FUNC_map_update_elem] = &bpf_map_update_elem_proto,
> 	[BPF_FUNC_map_delete_elem] = &bpf_map_delete_elem_proto,
> 	[BPF_FUNC_probe_read] = &kprobe_prog_proto,
>
> And define kprobe_prog_proto separately.
>
> And then you don't need the switch statement, you could just use the
> if (func_id < 0 || func_id >= ARRAY_SIZE(kprobe_prog_funcs))
> 	return kprobe_prog_funcs[func_id];
>
> I think there's several advantages to my approach. One, is that you are
> not wasting memory with empty objects in the array. Also, if the array
> gets out of sync with the enum, it would be possible to return an empty
> object. That is, &kprobe_prog_funcs[out_of_sync_func_id], would not be
> NULL if in the future someone added an enum before BPF_FUNC_probe_read,
> and the func_id passed in is that enum.

yes! There will be gaps in func_ids, so that would have
been a bug. Thanks for catching it!

^ permalink raw reply

* Re: [Patch] firmware: dmi_scan: split dmisubsystem from dmi-sysfs
From: Jean Delvare @ 2015-03-19 15:30 UTC (permalink / raw)
  To: Ivan Khoronzhuk
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	matt.fleming-ral2JQCrhuEAvxtiuMwx3w,
	ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A,
	grant.likely-QSEj5FYQhm4dnm+yROfE0A,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	dmidecode-devel-qX2TKyscuCcdnm+yROfE0A,
	leif.lindholm-QSEj5FYQhm4dnm+yROfE0A,
	msalter-H+wXaHxf7aLQT0dZR+AlfA
In-Reply-To: <1426539451-2119-1-git-send-email-ivan.khoronzhuk-hExfYMNmJl/Cnp4W7fqMDg@public.gmane.org>

Hi Ivan,

Le Monday 16 March 2015 à 22:57 +0200, Ivan Khoronzhuk a écrit :
> Some utils, like dmidecode and smbios, need to access SMBIOS entry
> table area in order to get information like SMBIOS version, size, etc.
> Currently it's done via /dev/mem. But for situation when /dev/mem
> usage is disabled, the utils have to use dmi sysfs instead, which
> doesn't represent SMBIOS entry and adds code/delay redundancy when direct
> access for table is needed.
> 
> So this patch creates dmi subsystem and adds SMBIOS entry point to allow
> utils in question to work correctly without /dev/mem. Also patch adds
> raw dmi table to simplify dmi table processing in user space, as were
> proposed by Jean Delvare.

"as was proposed" or even "as proposed" sounds more correct.

BTW, which tree is your patch based on? I can't get it to apply cleanly
on top of any kernel version I tried. I adjusted the patch to my tree
but it means I'm not reviewing your code exactly. Please send patches
which can be applied on top of some recent vanilla tree (3.19 or 4.0-rc4
would be OK at the moment.)

> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk-hExfYMNmJl/Cnp4W7fqMDg@public.gmane.org>
> ---
> 
> This patch is logical continuation of
> "[dmidecode] [Patch v4] firmware: dmi-sysfs: add SMBIOS entry point area attribute"
> https://lkml.org/lkml/2015/2/4/475
> 
> Pay attention that this includes /sys/firmware/dmi for holding tables instead of
> /sys/firmware/dmi/table as were proposed.

Why is that? I proposed /sys/firmware/dmi/tables because the acpi
subsystem puts its own tables in /sys/firmware/acpi/tables. It seemed
reasonable to follow that example for consistency. What is your reason
for doing it differently?

>  Documentation/ABI/testing/sysfs-firmware-dmi       | 122 +++------------------
>  .../ABI/testing/sysfs-firmware-dmi-entries         | 110 +++++++++++++++++++

This is adding a lot of noise to your patch, making it harder to review
and backport. If you think that the contents of sysfs-firmware-dmi would
rather go in a documentation file named sysfs-firmware-dmi-entries, I
have no objection, but it should be done in a separate patch.

>  drivers/firmware/dmi-sysfs.c                       |  12 +-
>  drivers/firmware/dmi_scan.c                        | 115 +++++++++++++++++--
>  include/linux/dmi.h                                |   2 +
>  5 files changed, 238 insertions(+), 123 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-firmware-dmi-entries
> 
> diff --git a/Documentation/ABI/testing/sysfs-firmware-dmi b/Documentation/ABI/testing/sysfs-firmware-dmi
> index c78f9ab..6413128 100644
> --- a/Documentation/ABI/testing/sysfs-firmware-dmi
> +++ b/Documentation/ABI/testing/sysfs-firmware-dmi
> @@ -1,110 +1,16 @@
>  What:		/sys/firmware/dmi/
> -Date:		February 2011
> -Contact:	Mike Waychison <mikew-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> +Date:		March 2015
> +Contact:	Ivan Khoronzhuk <ivan.khoronzhuk-hExfYMNmJl/Cnp4W7fqMDg@public.gmane.org>
>  Description:
> (...)
> +		The firmware provides DMI structures as a packed list of
> +		data referenced by a SMBIOS table entry point. The SMBIOS
> +		entry point contains general information, like SMBIOS
> +		version, DMI table size, etc. The structure, content and
> +		size of SMBIOS entry point is dependent on SMBIOS version.
> +		That's why SMBIOS entry point is represented in dmi sysfs
> +		like a raw attribute and is accessible via
> +		/sys/firmware/dmi/smbios_entry_point. The format of SMBIOS
> +		entry point header can be read in SMBIOS specification.
> +		To simplify access and processing delay in user space,

"processing delay" doesn't really describe the problem that was present
with the previous patch set. It was more a problem of processing time,
CPU and memory usage, and code complexity. Anyway I see no reason to
explain that here, given that this proposal was rejected.

You'd rather just explain that the raw data is provided through sysfs as
an alternative to utilities reading them from /dev/mem (as you already
correctly explain in the patch description.) That's what your new patch
is all about.

> +		subsystem provides also raw dmi table under

DMI better spelled capitalized. I'd also avoid mentioning "subsystem",
I'm not sure why you're using that word (and also in the subject), sysfs
is a virtual file system, and there is no such thing as a "dmi
subsystem".

> +		/sys/firmware/dmi/dmi_table.

As said before I'd make it /sys/firmware/dmi/tables/DMI to mimic what
acpi does.

> diff --git a/drivers/firmware/dmi-sysfs.c b/drivers/firmware/dmi-sysfs.c
> index e0f1cb3..390067d 100644
> --- a/drivers/firmware/dmi-sysfs.c
> +++ b/drivers/firmware/dmi-sysfs.c
> @@ -566,7 +566,6 @@ static struct kobj_type dmi_sysfs_entry_ktype = {
>  	.default_attrs = dmi_sysfs_entry_attrs,
>  };
>  
> -static struct kobject *dmi_kobj;
>  static struct kset *dmi_kset;
>  
>  /* Global count of all instances seen.  Only for setup */
> @@ -651,10 +650,10 @@ static int __init dmi_sysfs_init(void)
>  	int error = -ENOMEM;
>  	int val;
>  
> -	/* Set up our directory */
> -	dmi_kobj = kobject_create_and_add("dmi", firmware_kobj);
> -	if (!dmi_kobj)
> -		goto err;
> +	if (!dmi_kobj) {
> +		pr_err("dmi-sysfs: dmi subsysterm is absent.\n");

Typo: subsystem. Then again the use of this word keeps confusing me, but
it's a minor thing.

> +		return -EINVAL;

The original code had a single error path and I see no reason to change
that. -EINVAL seems inappropriate for this case, I'd rather return
-ENOSYS.

> +	}
>  
>  	dmi_kset = kset_create_and_add("entries", NULL, dmi_kobj);
>  	if (!dmi_kset)
> @@ -675,7 +674,6 @@ static int __init dmi_sysfs_init(void)
>  err:
>  	cleanup_entry_list();
>  	kset_unregister(dmi_kset);
> -	kobject_put(dmi_kobj);
>  	return error;
>  }
>  
> @@ -685,8 +683,6 @@ static void __exit dmi_sysfs_exit(void)
>  	pr_debug("dmi-sysfs: unloading.\n");
>  	cleanup_entry_list();
>  	kset_unregister(dmi_kset);
> -	kobject_del(dmi_kobj);
> -	kobject_put(dmi_kobj);
>  }
>  
>  module_init(dmi_sysfs_init);
> diff --git a/drivers/firmware/dmi_scan.c b/drivers/firmware/dmi_scan.c
> index c9cb725..3fca52a 100644
> --- a/drivers/firmware/dmi_scan.c
> +++ b/drivers/firmware/dmi_scan.c
> @@ -10,6 +10,9 @@
>  #include <asm/dmi.h>
>  #include <asm/unaligned.h>
>  
> +struct kobject *dmi_kobj;
> +EXPORT_SYMBOL_GPL(dmi_kobj);
> +
>  /*
>   * DMI stands for "Desktop Management Interface".  It is part
>   * of and an antecedent to, SMBIOS, which stands for System
> @@ -20,6 +23,9 @@ static const char dmi_empty_string[] = "        ";
>  static u32 dmi_ver __initdata;
>  static u32 dmi_len;
>  static u16 dmi_num;
> +static u8 smbios_entry_point[32];
> +static int smbios_entry_point_size;
> +
>  /*
>   * Catch too early calls to dmi_check_system():
>   */
> @@ -118,6 +124,7 @@ static void dmi_table(u8 *buf,
>  }
>  
>  static phys_addr_t dmi_base;
> +static u8 *dmi_tb;

Where "tb" stands for... "table", but you couldn't use that because a
function by that name already exist, right? I can think of two less
cryptic ways to solve this: either you rename function dmi_table to,
say, dmi_decode_table (which would be a better name anyway IMHO), or you
name your variable dmi_table_p or dmi_raw_table. But "tb" is not
immediate enough to understand.

>  
>  static int __init dmi_walk_early(void (*decode)(const struct dmi_header *,
>  		void *))
> @@ -476,6 +483,8 @@ static int __init dmi_present(const u8 *buf)
>  	if (memcmp(buf, "_SM_", 4) == 0 &&
>  	    buf[5] < 32 && dmi_checksum(buf, buf[5])) {
>  		smbios_ver = get_unaligned_be16(buf + 6);
> +		smbios_entry_point_size = buf[5];
> +		memcpy(smbios_entry_point, buf, smbios_entry_point_size);
>  
>  		/* Some BIOS report weird SMBIOS version, fix that up */
>  		switch (smbios_ver) {
> @@ -508,6 +517,8 @@ static int __init dmi_present(const u8 *buf)
>  					dmi_ver >> 8, dmi_ver & 0xFF,
>  					(dmi_ver < 0x0300) ? "" : ".x");
>  			} else {
> +				smbios_entry_point_size = 15;
> +				memcpy(smbios_entry_point, buf, 15);
>  				dmi_ver = (buf[14] & 0xF0) << 4 |
>  					   (buf[14] & 0x0F);
>  				pr_info("Legacy DMI %d.%d present.\n",
> @@ -535,6 +546,8 @@ static int __init dmi_smbios3_present(const u8 *buf)
>  		dmi_ver &= 0xFFFFFF;
>  		dmi_len = get_unaligned_le32(buf + 12);
>  		dmi_base = get_unaligned_le64(buf + 16);
> +		smbios_entry_point_size = buf[6];
> +		memcpy(smbios_entry_point, buf, smbios_entry_point_size);
>  
>  		/*
>  		 * The 64-bit SMBIOS 3.0 entry point no longer has a field

This is inconsistent. You should either use smbios_entry_point_size as
the last argument to memcpy() in all 3 cases (my preference), or you
should use buf[5], 15 and buf[6] respectively, but not mix.

> @@ -638,6 +651,95 @@ void __init dmi_scan_machine(void)
>  	dmi_initialized = 1;
>  }
>  
> +static ssize_t smbios_entry_point_read(struct file *filp,
> +				       struct kobject *kobj,
> +				       struct bin_attribute *bin_attr,
> +				       char *buf, loff_t pos, size_t count)
> +{
> +	ssize_t size;
> +
> +	size = bin_attr->size;
> +
> +	if (size > pos)
> +		size -= pos;
> +	else
> +		return 0;
> +
> +	if (count < size)
> +		size = count;
> +
> +	memcpy(buf, &smbios_entry_point[pos], size);
> +
> +	return size;
> +}
> +
> +static ssize_t dmi_table_read(struct file *filp,
> +			      struct kobject *kobj,
> +			      struct bin_attribute *bin_attr,
> +			      char *buf, loff_t pos, size_t count)
> +{
> +	ssize_t size;
> +
> +	size = bin_attr->size;
> +
> +	if (size > pos)
> +		size -= pos;
> +	else
> +		return 0;
> +
> +	if (count < size)
> +		size = count;
> +
> +	memcpy(buf, &dmi_tb[pos], size);
> +
> +	return size;
> +}

Looking at drivers/acpi/bgrt.c, there seems to be a more simple way to
implement this: fill in the file size and contents (.private) at
run-time and let sysfs handle all the rest. You already do the former so
you're actually very close. 

> +BIN_ATTR_RO(dmi_table, 0);
> +BIN_ATTR_RO(smbios_entry_point, 0);

These should both be static.

This will make dmi_table readable by all users, instead of just root as
it should be. The DMI data contains private information (serial numbers,
UUID...) You will see that some files under /sys/devices/virtual/dmi/id
can't be read by regular users for this reason.

> +/*
> + * Register the dmi subsystem under the firmware subsysterm

Same typo again: subsystem.

> + */
> +static int __init dmisubsys_init(void)
> +{
> +	int ret = -ENOMEM;

There's only one error case which returns that value so it would be
better set in that error path.

> +
> +	if (!smbios_entry_point_size || !dmi_available) {

I already mentioned in a previous review that I don't think you need to
check for !dmi_available, and you said you agreed.

> +		ret = -EINVAL;

Again -ENOSYS or similar would be more appropriate (not that it matters
a lot in practice though as I don't think there is any way to actually
retrieve this value.)

> +		goto err;
> +	}
> +
> +	/* Set up dmi directory at /sys/firmware/dmi */
> +	dmi_kobj = kobject_create_and_add("dmi", firmware_kobj);
> +	if (!dmi_kobj)
> +		goto err;
> +
> +	bin_attr_smbios_entry_point.size = smbios_entry_point_size;
> +	ret = sysfs_create_bin_file(dmi_kobj, &bin_attr_smbios_entry_point);
> +	if (ret)
> +		goto err;
> +
> +	if (!dmi_tb) {

I don't like this. You should know which of this function and dmi_walk()
is called first. If you don't, then how can you guarantee that a race
condition can't happen?

This makes me wonder if that code wouldn't rather go in
dmi_scan_machine() rather than a separate function.

> +		dmi_tb = dmi_remap(dmi_base, dmi_len);
> +		if (!dmi_tb)
> +			goto err;
> +	}
> +
> +	bin_attr_dmi_table.size = dmi_len;
> +	ret = sysfs_create_bin_file(dmi_kobj, &bin_attr_dmi_table);
> +	if (ret)
> +		goto err;
> +
> +	return 0;
> +err:
> +	pr_err("dmi: Firmware registration failed.\n");

I said in a previous review that files that have been created should be
explicitly deleted, and I still think so.

> +	kobject_del(dmi_kobj);
> +	kobject_put(dmi_kobj);

I think you also need to explicitly set dmi_kobj to NULL here, otherwise
dmi-sysfs will try to use an object which no longer exists.

An alternative approach would be to leave dmi_kobj available even if the
binary files could not be created. As this case is very unlikely to
happen, I don't care which way you choose.

> +	return ret;
> +}
> +subsys_initcall(dmisubsys_init);
> +
>  /**
>   * dmi_set_dump_stack_arch_desc - set arch description for dump_stack()
>   *
> @@ -897,18 +999,17 @@ EXPORT_SYMBOL(dmi_get_date);
>  int dmi_walk(void (*decode)(const struct dmi_header *, void *),
>  	     void *private_data)
>  {
> -	u8 *buf;
> -
>  	if (!dmi_available)
>  		return -1;
>  
> -	buf = dmi_remap(dmi_base, dmi_len);
> -	if (buf == NULL)
> -		return -1;
> +	if (!dmi_tb) {
> +		dmi_tb = dmi_remap(dmi_base, dmi_len);
> +		if (!dmi_tb)
> +			return -1;
> +	}
>  
> -	dmi_table(buf, decode, private_data);
> +	dmi_table(dmi_tb, decode, private_data);
>  
> -	dmi_unmap(buf);
>  	return 0;
>  }
>  EXPORT_SYMBOL_GPL(dmi_walk);
> diff --git a/include/linux/dmi.h b/include/linux/dmi.h
> index f820f0a..316293e 100644
> --- a/include/linux/dmi.h
> +++ b/include/linux/dmi.h
> @@ -93,6 +93,7 @@ struct dmi_dev_onboard {
>  	int devfn;
>  };
>  
> +extern struct kobject *dmi_kobj;
>  extern int dmi_check_system(const struct dmi_system_id *list);
>  const struct dmi_system_id *dmi_first_match(const struct dmi_system_id *list);
>  extern const char * dmi_get_system_info(int field);
> @@ -112,6 +113,7 @@ extern void dmi_memdev_name(u16 handle, const char **bank, const char **device);
>  
>  #else
>  
> +extern struct kobject *dmi_kobj;

No, if CONFIG_DMI is not set then dmi_scan isn't built and dmi_kobj does
not exist.

>  static inline int dmi_check_system(const struct dmi_system_id *list) { return 0; }
>  static inline const char * dmi_get_system_info(int field) { return NULL; }
>  static inline const struct dmi_device * dmi_find_device(int type, const char *name,


-- 
Jean Delvare
SUSE L3 Support

^ permalink raw reply

* Re: [PATCH v7 tip 4/8] tracing: allow BPF programs to call bpf_trace_printk()
From: Steven Rostedt @ 2015-03-19 15:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, David S. Miller, Daniel Borkmann,
	Peter Zijlstra, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1426542584-9406-5-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>

On Mon, 16 Mar 2015 14:49:40 -0700
Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:

> Debugging of BPF programs needs some form of printk from the program,
> so let programs call limited trace_printk() with %d %u %x %p modifiers only.
> 
> Similar to kernel modules, during program load verifier checks whether program
> is calling bpf_trace_printk() and if so, kernel allocates trace_printk buffers
> and emits big 'this is debug only' banner.
> 
> Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
> ---
>  include/uapi/linux/bpf.h |    1 +
>  kernel/trace/bpf_trace.c |   68 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 69 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 101e509d1001..4095f3d9a716 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -166,6 +166,7 @@ enum bpf_func_id {
>  	BPF_FUNC_map_delete_elem, /* int map_delete_elem(&map, &key) */
>  	BPF_FUNC_probe_read,      /* int bpf_probe_read(void *dst, int size, void *src) */
>  	BPF_FUNC_ktime_get_ns,    /* u64 bpf_ktime_get_ns(void) */
> +	BPF_FUNC_trace_printk,    /* int bpf_trace_printk(const char *fmt, int fmt_size, ...) */
>  	__BPF_FUNC_MAX_ID,
>  };
>  
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 74eb6abda6a1..a22763a4d2e2 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -62,6 +62,60 @@ static u64 bpf_ktime_get_ns(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
>  	return ktime_get_mono_fast_ns();
>  }
>  
> +/* limited trace_printk()
> + * only %d %u %x %ld %lu %lx %lld %llu %llx %p conversion specifiers allowed
> + */
> +static u64 bpf_trace_printk(u64 r1, u64 fmt_size, u64 r3, u64 r4, u64 r5)
> +{
> +	char *fmt = (char *) (long) r1;
> +	int fmt_cnt = 0;
> +	bool mod_l[3] = {};
> +	int i;
> +
> +	/* bpf_check() guarantees that fmt points to bpf program stack and
> +	 * fmt_size bytes of it were initialized by bpf program
> +	 */
> +	if (fmt[fmt_size - 1] != 0)
> +		return -EINVAL;
> +
> +	/* check format string for allowed specifiers */
> +	for (i = 0; i < fmt_size; i++)

Even though there's only a single "if" statement after the "for", it is
usually considered proper to add the brackets if the next line is
complex (more than one line). Which it is in this case.


> +		if (fmt[i] == '%') {
> +			if (fmt_cnt >= 3)
> +				return -EINVAL;
> +			i++;
> +			if (i >= fmt_size)
> +				return -EINVAL;
> +
> +			if (fmt[i] == 'l') {
> +				mod_l[fmt_cnt] = true;
> +				i++;
> +				if (i >= fmt_size)
> +					return -EINVAL;
> +			} else if (fmt[i] == 'p') {
> +				mod_l[fmt_cnt] = true;
> +				fmt_cnt++;

So you also allow pointer conversions like "%pS" and "%pF"?

> +				continue;
> +			}
> +
> +			if (fmt[i] == 'l') {
> +				mod_l[fmt_cnt] = true;
> +				i++;
> +				if (i >= fmt_size)
> +					return -EINVAL;
> +			}
> +
> +			if (fmt[i] != 'd' && fmt[i] != 'u' && fmt[i] != 'x')
> +				return -EINVAL;
> +			fmt_cnt++;
> +		}
> +
> +	return __trace_printk(1/* fake ip will not be printed */, fmt,
> +			      mod_l[0] ? r3 : (u32) r3,
> +			      mod_l[1] ? r4 : (u32) r4,
> +			      mod_l[2] ? r5 : (u32) r5);

Does the above work on 32 bit machines? as "%ld" would be (u32), but
here you are passing in u64.

-- Steve

> +}
> +
>  static struct bpf_func_proto kprobe_prog_funcs[] = {
>  	[BPF_FUNC_probe_read] = {
>  		.func = bpf_probe_read,
> @@ -76,6 +130,13 @@ static struct bpf_func_proto kprobe_prog_funcs[] = {
>  		.gpl_only = true,
>  		.ret_type = RET_INTEGER,
>  	},
> +	[BPF_FUNC_trace_printk] = {
> +		.func = bpf_trace_printk,
> +		.gpl_only = true,
> +		.ret_type = RET_INTEGER,
> +		.arg1_type = ARG_PTR_TO_STACK,
> +		.arg2_type = ARG_CONST_STACK_SIZE,
> +	},
>  };
>  
>  static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
> @@ -90,6 +151,13 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
>  	default:
>  		if (func_id < 0 || func_id >= ARRAY_SIZE(kprobe_prog_funcs))
>  			return NULL;
> +
> +		if (func_id == BPF_FUNC_trace_printk)
> +			/* this program might be calling bpf_trace_printk,
> +			 * so allocate per-cpu printk buffers
> +			 */
> +			trace_printk_init_buffers();
> +
>  		return &kprobe_prog_funcs[func_id];
>  	}
>  }

^ permalink raw reply

* Re: [PATCH v7 tip 3/8] tracing: allow BPF programs to call bpf_ktime_get_ns()
From: Steven Rostedt @ 2015-03-19 15:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, David S. Miller, Daniel Borkmann,
	Peter Zijlstra, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1426542584-9406-4-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>

On Mon, 16 Mar 2015 14:49:39 -0700
Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:

> bpf_ktime_get_ns() is used by programs to compue time delta between events

 "compute"

> or as a timestamp
> 
> Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
> ---
>  include/uapi/linux/bpf.h |    1 +
>  kernel/trace/bpf_trace.c |   11 +++++++++++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 4486d36d2e9e..101e509d1001 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -165,6 +165,7 @@ enum bpf_func_id {
>  	BPF_FUNC_map_update_elem, /* int map_update_elem(&map, &key, &value, flags) */
>  	BPF_FUNC_map_delete_elem, /* int map_delete_elem(&map, &key) */
>  	BPF_FUNC_probe_read,      /* int bpf_probe_read(void *dst, int size, void *src) */
> +	BPF_FUNC_ktime_get_ns,    /* u64 bpf_ktime_get_ns(void) */
>  	__BPF_FUNC_MAX_ID,
>  };
>  
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index ba95b131082c..74eb6abda6a1 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -56,6 +56,12 @@ static u64 bpf_probe_read(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
>  	return probe_kernel_read(dst, unsafe_ptr, size);
>  }
>  
> +static u64 bpf_ktime_get_ns(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
> +{
> +	/* NMI safe access to clock monotonic */
> +	return ktime_get_mono_fast_ns();
> +}
> +
>  static struct bpf_func_proto kprobe_prog_funcs[] = {
>  	[BPF_FUNC_probe_read] = {
>  		.func = bpf_probe_read,
> @@ -65,6 +71,11 @@ static struct bpf_func_proto kprobe_prog_funcs[] = {
>  		.arg2_type = ARG_CONST_STACK_SIZE,
>  		.arg3_type = ARG_ANYTHING,
>  	},
> +	[BPF_FUNC_ktime_get_ns] = {
> +		.func = bpf_ktime_get_ns,
> +		.gpl_only = true,
> +		.ret_type = RET_INTEGER,

Hmm, a nanosecond value returned as integer? Is there a way to make
this a 64 bit return type, or is RET_INTEGER default to 64 bits in BPF
functions?

-- Steve


> +	},
>  };
>  
>  static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)

^ permalink raw reply

* Re: [PATCH v7 tip 2/8] tracing: attach BPF programs to kprobes
From: Steven Rostedt @ 2015-03-19 15:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, David S. Miller, Daniel Borkmann,
	Peter Zijlstra, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1426542584-9406-3-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>

On Mon, 16 Mar 2015 14:49:38 -0700
Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:

> User interface:
> struct perf_event_attr attr = {.type = PERF_TYPE_TRACEPOINT, .config = event_id, ...};
> event_fd = perf_event_open(&attr,...);
> ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
> 
> prog_fd is a file descriptor associated with BPF program previously loaded.
> event_id is an ID of created kprobe
> 
> close(event_fd) - automatically detaches BPF program from it
> 
> BPF programs can call in-kernel helper functions to:
> - lookup/update/delete elements in maps
> - probe_read - wraper of probe_kernel_read() used to access any kernel
>   data structures
> 
> BPF programs receive 'struct pt_regs *' as an input
> ('struct pt_regs' is architecture dependent)
> 
> Note, kprobes are _not_ a stable kernel ABI, so bpf programs attached to
> kprobes must be recompiled for every kernel version and user must supply correct
> LINUX_VERSION_CODE in attr.kern_version during bpf_prog_load() call.
> 
> Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
> ---
>  include/linux/ftrace_event.h    |   14 +++++
>  include/uapi/linux/bpf.h        |    3 +
>  include/uapi/linux/perf_event.h |    1 +
>  kernel/bpf/syscall.c            |    7 ++-
>  kernel/events/core.c            |   59 +++++++++++++++++++
>  kernel/trace/Makefile           |    1 +
>  kernel/trace/bpf_trace.c        |  119 +++++++++++++++++++++++++++++++++++++++
>  kernel/trace/trace_kprobe.c     |   10 +++-
>  8 files changed, 212 insertions(+), 2 deletions(-)
>  create mode 100644 kernel/trace/bpf_trace.c
> 
> diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
> index c674ee8f7fca..0aa535bc9f05 100644
> --- a/include/linux/ftrace_event.h
> +++ b/include/linux/ftrace_event.h
> @@ -13,6 +13,7 @@ struct trace_array;
>  struct trace_buffer;
>  struct tracer;
>  struct dentry;
> +struct bpf_prog;
>  
>  struct trace_print_flags {
>  	unsigned long		mask;
> @@ -252,6 +253,7 @@ enum {
>  	TRACE_EVENT_FL_WAS_ENABLED_BIT,
>  	TRACE_EVENT_FL_USE_CALL_FILTER_BIT,
>  	TRACE_EVENT_FL_TRACEPOINT_BIT,
> +	TRACE_EVENT_FL_KPROBE_BIT,

I think this should be broken up into two patches. One that adds the
KPROBE_BIT flag to kprobe events, and the other that adds the bpf
program.

>  };
>  
>  /*
> @@ -265,6 +267,7 @@ enum {
>   *                     it is best to clear the buffers that used it).
>   *  USE_CALL_FILTER - For ftrace internal events, don't use file filter
>   *  TRACEPOINT    - Event is a tracepoint
> + *  KPROBE        - Event is a kprobe
>   */
>  enum {
>  	TRACE_EVENT_FL_FILTERED		= (1 << TRACE_EVENT_FL_FILTERED_BIT),
> @@ -274,6 +277,7 @@ enum {
>  	TRACE_EVENT_FL_WAS_ENABLED	= (1 << TRACE_EVENT_FL_WAS_ENABLED_BIT),
>  	TRACE_EVENT_FL_USE_CALL_FILTER	= (1 << TRACE_EVENT_FL_USE_CALL_FILTER_BIT),
>  	TRACE_EVENT_FL_TRACEPOINT	= (1 << TRACE_EVENT_FL_TRACEPOINT_BIT),
> +	TRACE_EVENT_FL_KPROBE		= (1 << TRACE_EVENT_FL_KPROBE_BIT),
>  };
>  
>  struct ftrace_event_call {
> @@ -303,6 +307,7 @@ struct ftrace_event_call {
>  #ifdef CONFIG_PERF_EVENTS
>  	int				perf_refcount;
>  	struct hlist_head __percpu	*perf_events;
> +	struct bpf_prog			*prog;
>  
>  	int	(*perf_perm)(struct ftrace_event_call *,
>  			     struct perf_event *);
> @@ -548,6 +553,15 @@ event_trigger_unlock_commit_regs(struct ftrace_event_file *file,
>  		event_triggers_post_call(file, tt);
>  }
>  



> --- /dev/null
> +++ b/kernel/trace/bpf_trace.c
> @@ -0,0 +1,119 @@
> +/* Copyright (c) 2011-2015 PLUMgrid, http://plumgrid.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/slab.h>
> +#include <linux/bpf.h>
> +#include <linux/filter.h>
> +#include <linux/uaccess.h>
> +#include "trace.h"
> +
> +static DEFINE_PER_CPU(int, bpf_prog_active);
> +

There should be kerneldoc comments above this function.

> +unsigned int trace_call_bpf(struct bpf_prog *prog, void *ctx)
> +{
> +	unsigned int ret;
> +	int cpu;
> +
> +	if (in_nmi()) /* not supported yet */
> +		return 1;
> +
> +	preempt_disable();
> +
> +	cpu = raw_smp_processor_id();

Perhaps remove the above two lines and replace with:

	cpu = get_cpu();

> +	if (unlikely(per_cpu(bpf_prog_active, cpu)++ != 0)) {
> +		/* since some bpf program is already running on this cpu,
> +		 * don't call into another bpf program (same or different)
> +		 * and don't send kprobe event into ring-buffer,
> +		 * so return zero here
> +		 */
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	rcu_read_lock();
> +	ret = BPF_PROG_RUN(prog, ctx);
> +	rcu_read_unlock();
> +
> + out:
> +	per_cpu(bpf_prog_active, cpu)--;

Hmm, as cpu == smp_processor_id(), you should be using
__this_cpu_inc_return(), and __this_cpu_dec().

> +	preempt_enable();

	put_cpu();


If you replace the above with __this_cpu*, then you could use
preempt_enable/disable() and get rid of the "cpu" var.

> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(trace_call_bpf);
> +
> +static u64 bpf_probe_read(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
> +{
> +	void *dst = (void *) (long) r1;
> +	int size = (int) r2;
> +	void *unsafe_ptr = (void *) (long) r3;
> +
> +	return probe_kernel_read(dst, unsafe_ptr, size);
> +}
> +
> +static struct bpf_func_proto kprobe_prog_funcs[] = {
> +	[BPF_FUNC_probe_read] = {
> +		.func = bpf_probe_read,
> +		.gpl_only = true,
> +		.ret_type = RET_INTEGER,
> +		.arg1_type = ARG_PTR_TO_STACK,
> +		.arg2_type = ARG_CONST_STACK_SIZE,
> +		.arg3_type = ARG_ANYTHING,
> +	},
> +};
> +
> +static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
> +{
> +	switch (func_id) {
> +	case BPF_FUNC_map_lookup_elem:
> +		return &bpf_map_lookup_elem_proto;
> +	case BPF_FUNC_map_update_elem:
> +		return &bpf_map_update_elem_proto;
> +	case BPF_FUNC_map_delete_elem:
> +		return &bpf_map_delete_elem_proto;
> +	default:
> +		if (func_id < 0 || func_id >= ARRAY_SIZE(kprobe_prog_funcs))
> +			return NULL;
> +		return &kprobe_prog_funcs[func_id];
> +	}
> +}

Why not just make kprobe_prog_funcs[] = {
	[BPF_FUNC_map_lookup_elem] = &bpf_map_lookup_elem_proto,
	[BPF_FUNC_map_update_elem] = &bpf_map_update_elem_proto,
	[BPF_FUNC_map_delete_elem] = &bpf_map_delete_elem_proto,
	[BPF_FUNC_probe_read] = &kprobe_prog_proto,

And define kprobe_prog_proto separately.

And then you don't need the switch statement, you could just use the
if (func_id < 0 || func_id >= ARRAY_SIZE(kprobe_prog_funcs))
	return kprobe_prog_funcs[func_id];

I think there's several advantages to my approach. One, is that you are
not wasting memory with empty objects in the array. Also, if the array
gets out of sync with the enum, it would be possible to return an empty
object. That is, &kprobe_prog_funcs[out_of_sync_func_id], would not be
NULL if in the future someone added an enum before BPF_FUNC_probe_read,
and the func_id passed in is that enum.


> +
> +/* bpf+kprobe programs can access fields of 'struct pt_regs' */
> +static bool kprobe_prog_is_valid_access(int off, int size, enum bpf_access_type type)
> +{
> +	/* check bounds */
> +	if (off < 0 || off >= sizeof(struct pt_regs))
> +		return false;
> +
> +	/* only read is allowed */
> +	if (type != BPF_READ)
> +		return false;
> +
> +	/* disallow misaligned access */
> +	if (off % size != 0)
> +		return false;
> +
> +	return true;
> +}
> +
> +static struct bpf_verifier_ops kprobe_prog_ops = {
> +	.get_func_proto = kprobe_prog_func_proto,
> +	.is_valid_access = kprobe_prog_is_valid_access,
> +};
> +
> +static struct bpf_prog_type_list kprobe_tl = {
> +	.ops = &kprobe_prog_ops,
> +	.type = BPF_PROG_TYPE_KPROBE,
> +};
> +
> +static int __init register_kprobe_prog_ops(void)
> +{
> +	bpf_register_prog_type(&kprobe_tl);
> +	return 0;
> +}
> +late_initcall(register_kprobe_prog_ops);
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index d73f565b4e06..dc3462507d7c 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -1134,11 +1134,15 @@ static void
>  kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
>  {
>  	struct ftrace_event_call *call = &tk->tp.call;
> +	struct bpf_prog *prog = call->prog;
>  	struct kprobe_trace_entry_head *entry;
>  	struct hlist_head *head;
>  	int size, __size, dsize;
>  	int rctx;
>  
> +	if (prog && !trace_call_bpf(prog, regs))
> +		return;
> +
>  	head = this_cpu_ptr(call->perf_events);
>  	if (hlist_empty(head))
>  		return;
> @@ -1165,11 +1169,15 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
>  		    struct pt_regs *regs)
>  {
>  	struct ftrace_event_call *call = &tk->tp.call;
> +	struct bpf_prog *prog = call->prog;
>  	struct kretprobe_trace_entry_head *entry;
>  	struct hlist_head *head;
>  	int size, __size, dsize;
>  	int rctx;
>  
> +	if (prog && !trace_call_bpf(prog, regs))
> +		return;
> +
>  	head = this_cpu_ptr(call->perf_events);
>  	if (hlist_empty(head))
>  		return;
> @@ -1286,7 +1294,7 @@ static int register_kprobe_event(struct trace_kprobe *tk)
>  		kfree(call->print_fmt);
>  		return -ENODEV;
>  	}
> -	call->flags = 0;
> +	call->flags = TRACE_EVENT_FL_KPROBE;

Again, split out the adding of TRACE_EVENT_FL_KPROBE into a separate
patch.

-- Steve

>  	call->class->reg = kprobe_register;
>  	call->data = tk;
>  	ret = trace_add_event_call(call);

^ permalink raw reply

* Re: [PATCH v3 10/15] serial: stm32-usart: Add STM32 USART Driver
From: Peter Hurley @ 2015-03-19 14:58 UTC (permalink / raw)
  To: Maxime Coquelin, Andy Shevchenko
  Cc: Uwe Kleine-König, Andreas Färber, Geert Uytterhoeven,
	Rob Herring, Philipp Zabel, Linus Walleij, Arnd Bergmann,
	Stefan Agner, Peter Meerwald, Paul Bolle, Jonathan Corbet,
	Pawel Moll, Mark Rutland, Ian Campbell, Kumar Gala, Russell King,
	Daniel Lezcano, Thomas Gleixner, Greg Kroah-Hartman, Jiri Slaby,
	Andrew Morton, David S. Miller
In-Reply-To: <CALszF6A5Zu7i0hxSLTS-nfOAXmd0__jLjB=fHxK93Ex5Vbi9LA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 03/19/2015 09:55 AM, Maxime Coquelin wrote:
>>>>> +static void stm32_set_termios(struct uart_port *port, struct ktermios *termios,
>>>>> +                           struct ktermios *old)
[...]
>>>>> +       usardiv = (port->uartclk * 25) / (baud * 4);
>>>>> +       mantissa = (usardiv / 100) << USART_BRR_DIV_M_SHIFT;
>>>>> +       fraction = DIV_ROUND_CLOSEST((usardiv % 100) * 16, 100);
>>>>> +       if (fraction & ~USART_BRR_DIV_F_MASK) {
>>>>> +               fraction = 0;
>>>>> +               mantissa += (1 << USART_BRR_DIV_M_SHIFT);
>>>>> +       }
[...]
> Really, I would prefer keeping this fractional divider as it is
> implemented today.

You have to admit that's basically an unintelligible mess;
how would anyone ever be able to refactor and replace that with a
common divider implementation?

At the very least, please comment on the formula and format.

Regards,
Peter Hurley
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH V6] Allow compaction of unevictable pages
From: Vlastimil Babka @ 2015-03-19 14:56 UTC (permalink / raw)
  To: Eric B Munson, Andrew Morton
  Cc: Thomas Gleixner, Christoph Lameter, Peter Zijlstra, Mel Gorman,
	David Rientjes, Rik van Riel, Michal Hocko,
	linux-rt-users-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1426773430-31052-1-git-send-email-emunson-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org>

On 03/19/2015 02:57 PM, Eric B Munson wrote:
> Currently, pages which are marked as unevictable are protected from
> compaction, but not from other types of migration.  The POSIX real time
> extension explicitly states that mlock() will prevent a major page
> fault, but the spirit of is is that mlock() should give a process the
> ability to control sources of latency, including minor page faults.
> However, the mlock manpage only explicitly says that a locked page will
> not be written to swap and this can cause some confusion.  The
> compaction code today, does not give a developer who wants to avoid swap
> but wants to have large contiguous areas available any method to achieve
> this state.  This patch introduces a sysctl for controlling compaction
> behavoir with respect to the unevictable lru.  Users that demand no page

  behavior

> faults after a page is present can set compact_unevictable to 0 and

                                        compact_unevictable_allowed

> users who need the large contiguous areas can enable compaction on
> locked memory by leaving the default value of 1.
> 
> To illustrate this problem I wrote a quick test program that mmaps a
> large number of 1MB files filled with random data.  These maps are
> created locked and read only.  Then every other mmap is unmapped and I
> attempt to allocate huge pages to the static huge page pool.  When the
> compact_unevictable sysctl is 0, I cannot allocate hugepages after

compact_unevictable_allowed

> fragmenting memory.  When the value is set to 1, allocations succeed.
> 
> Signed-off-by: Eric B Munson <emunson-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org>
> Cc: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>

Acked-by: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>

Thanks.

^ permalink raw reply

* Re: [PATCH V6] Allow compaction of unevictable pages
From: Michal Hocko @ 2015-03-19 14:47 UTC (permalink / raw)
  To: Eric B Munson
  Cc: Andrew Morton, Vlastimil Babka, Thomas Gleixner,
	Christoph Lameter, Peter Zijlstra, Mel Gorman, David Rientjes,
	Rik van Riel, linux-rt-users, linux-mm, linux-api, linux-kernel
In-Reply-To: <1426773430-31052-1-git-send-email-emunson@akamai.com>

On Thu 19-03-15 09:57:10, Eric B Munson wrote:
> Currently, pages which are marked as unevictable are protected from
> compaction, but not from other types of migration.  The POSIX real time
> extension explicitly states that mlock() will prevent a major page
> fault, but the spirit of is is that mlock() should give a process the
> ability to control sources of latency, including minor page faults.
> However, the mlock manpage only explicitly says that a locked page will
> not be written to swap and this can cause some confusion.  The
> compaction code today, does not give a developer who wants to avoid swap
> but wants to have large contiguous areas available any method to achieve
> this state.  This patch introduces a sysctl for controlling compaction
> behavoir with respect to the unevictable lru.  Users that demand no page
> faults after a page is present can set compact_unevictable to 0 and
> users who need the large contiguous areas can enable compaction on
> locked memory by leaving the default value of 1.
> 
> To illustrate this problem I wrote a quick test program that mmaps a
> large number of 1MB files filled with random data.  These maps are
> created locked and read only.  Then every other mmap is unmapped and I
> attempt to allocate huge pages to the static huge page pool.  When the
> compact_unevictable sysctl is 0, I cannot allocate hugepages after
> fragmenting memory.  When the value is set to 1, allocations succeed.
> 
> Signed-off-by: Eric B Munson <emunson@akamai.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: linux-rt-users@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: linux-api@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org

Acked-by: Michal Hocko <mhocko@suse.cz>
Thanks!

> ---
> Changes from V5:
> * Default sysctl value is now 1
> * Used more descriptive sysctl name
> * documentation calss out default value
> 
>  Documentation/sysctl/vm.txt |   11 +++++++++++
>  include/linux/compaction.h  |    1 +
>  kernel/sysctl.c             |    9 +++++++++
>  mm/compaction.c             |    7 +++++++
>  4 files changed, 28 insertions(+)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 902b457..9832ec5 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -21,6 +21,7 @@ Currently, these files are in /proc/sys/vm:
>  - admin_reserve_kbytes
>  - block_dump
>  - compact_memory
> +- compact_unevictable_allowed
>  - dirty_background_bytes
>  - dirty_background_ratio
>  - dirty_bytes
> @@ -106,6 +107,16 @@ huge pages although processes will also directly compact memory as required.
>  
>  ==============================================================
>  
> +compact_unevictable_allowed
> +
> +Available only when CONFIG_COMPACTION is set. When set to 1, compaction is
> +allowed to examine the unevictable lru (mlocked pages) for pages to compact.
> +This should be used on systems where stalls for minor page faults are an
> +acceptable trade for large contiguous free memory.  Set to 0 to prevent
> +compaction from moving pages that are unevictable.  Default value is 1.
> +
> +==============================================================
> +
>  dirty_background_bytes
>  
>  Contains the amount of dirty memory at which the background kernel
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index a014559..aa8f61c 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -34,6 +34,7 @@ extern int sysctl_compaction_handler(struct ctl_table *table, int write,
>  extern int sysctl_extfrag_threshold;
>  extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
>  			void __user *buffer, size_t *length, loff_t *ppos);
> +extern int sysctl_compact_unevictable_allowed;
>  
>  extern int fragmentation_index(struct zone *zone, unsigned int order);
>  extern unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 88ea2d6..2f6c880 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1313,6 +1313,15 @@ static struct ctl_table vm_table[] = {
>  		.extra1		= &min_extfrag_threshold,
>  		.extra2		= &max_extfrag_threshold,
>  	},
> +	{
> +		.procname	= "compact_unevictable_allowed",
> +		.data		= &sysctl_compact_unevictable_allowed,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +		.extra1		= &zero,
> +		.extra2		= &one,
> +	},
>  
>  #endif /* CONFIG_COMPACTION */
>  	{
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 8c0d945..ad88a8c 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1047,6 +1047,12 @@ typedef enum {
>  } isolate_migrate_t;
>  
>  /*
> + * Allow userspace to control policy on scanning the unevictable LRU for
> + * compactable pages.
> + */
> +int sysctl_compact_unevictable_allowed __read_mostly = 1;
> +
> +/*
>   * Isolate all pages that can be migrated from the first suitable block,
>   * starting at the block pointed to by the migrate scanner pfn within
>   * compact_control.
> @@ -1057,6 +1063,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
>  	unsigned long low_pfn, end_pfn;
>  	struct page *page;
>  	const isolate_mode_t isolate_mode =
> +		(sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) |
>  		(cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0);
>  
>  	/*
> -- 
> 1.7.9.5
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v0 01/11] stm class: Introduce an abstraction for System Trace Module devices
From: Mathieu Poirier @ 2015-03-19 14:39 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Greg Kroah-Hartman,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Pratik Patel, peter.lachner-ral2JQCrhuEAvxtiuMwx3w,
	norbert.schulz-ral2JQCrhuEAvxtiuMwx3w,
	keven.boell-ral2JQCrhuEAvxtiuMwx3w,
	yann.fouassier-ral2JQCrhuEAvxtiuMwx3w,
	laurent.fert-ral2JQCrhuEAvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <877fuddq2c.fsf-qxRn5AmX6ZD9BXuAQUXR0fooFf0ArEBIu+b9c/7xato@public.gmane.org>

On 19 March 2015 at 08:23, Alexander Shishkin
<alexander.shishkin-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
> Mathieu Poirier <mathieu.poirier-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> writes:
>
>> I forgot to mention in my previous email...  I think the hierarchy of
>> our respective tracing module along with the generic-stm probably
>> needs a review.
>>
>> Currently we have drivers/coresight, drivers/intel_th and drivers/stm.
>>
>> To me it doesn't scale - what happens when other architectures come
>> out with their own hw tracing technologies?
>>
>> I suggest we move everything under drivers/hwtracing and as such have:
>>
>> drivers/hwtracing
>> drivers/hwtracing/intel_ht
>> drivers/hwtracing/coresight
>> drivers/hwtracing/stm
>>
>> That way other architectures can add drivers for their own hw tracing
>> technology without further polluting the drivers/ directory and
>> concentrating everything in the same area.  What's your view on that?
>
> I wanted to suggest something similar, actually, if you don't mind
> moving drivers/coresight, then let's do it.

That's a deal - my next patchset will reflect that new organisation.

>
> Regards,
> --
> Alex

^ permalink raw reply

* Re: [PATCH v0 01/11] stm class: Introduce an abstraction for System Trace Module devices
From: Alexander Shishkin @ 2015-03-19 14:23 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: Greg Kroah-Hartman, linux-kernel@vger.kernel.org, Pratik Patel,
	peter.lachner-ral2JQCrhuEAvxtiuMwx3w,
	norbert.schulz-ral2JQCrhuEAvxtiuMwx3w,
	keven.boell-ral2JQCrhuEAvxtiuMwx3w,
	yann.fouassier-ral2JQCrhuEAvxtiuMwx3w,
	laurent.fert-ral2JQCrhuEAvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <CANLsYkx6+mtTf0gPhu37mfbL84oOj0CkW_yOMwFHh_XyqQ7pAQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Mathieu Poirier <mathieu.poirier-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> writes:

> I forgot to mention in my previous email...  I think the hierarchy of
> our respective tracing module along with the generic-stm probably
> needs a review.
>
> Currently we have drivers/coresight, drivers/intel_th and drivers/stm.
>
> To me it doesn't scale - what happens when other architectures come
> out with their own hw tracing technologies?
>
> I suggest we move everything under drivers/hwtracing and as such have:
>
> drivers/hwtracing
> drivers/hwtracing/intel_ht
> drivers/hwtracing/coresight
> drivers/hwtracing/stm
>
> That way other architectures can add drivers for their own hw tracing
> technology without further polluting the drivers/ directory and
> concentrating everything in the same area.  What's your view on that?

I wanted to suggest something similar, actually, if you don't mind
moving drivers/coresight, then let's do it.

Regards,
--
Alex

^ permalink raw reply

* [PATCH V6] Allow compaction of unevictable pages
From: Eric B Munson @ 2015-03-19 13:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric B Munson, Vlastimil Babka, Thomas Gleixner,
	Christoph Lameter, Peter Zijlstra, Mel Gorman, David Rientjes,
	Rik van Riel, Michal Hocko, linux-rt-users-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Currently, pages which are marked as unevictable are protected from
compaction, but not from other types of migration.  The POSIX real time
extension explicitly states that mlock() will prevent a major page
fault, but the spirit of is is that mlock() should give a process the
ability to control sources of latency, including minor page faults.
However, the mlock manpage only explicitly says that a locked page will
not be written to swap and this can cause some confusion.  The
compaction code today, does not give a developer who wants to avoid swap
but wants to have large contiguous areas available any method to achieve
this state.  This patch introduces a sysctl for controlling compaction
behavoir with respect to the unevictable lru.  Users that demand no page
faults after a page is present can set compact_unevictable to 0 and
users who need the large contiguous areas can enable compaction on
locked memory by leaving the default value of 1.

To illustrate this problem I wrote a quick test program that mmaps a
large number of 1MB files filled with random data.  These maps are
created locked and read only.  Then every other mmap is unmapped and I
attempt to allocate huge pages to the static huge page pool.  When the
compact_unevictable sysctl is 0, I cannot allocate hugepages after
fragmenting memory.  When the value is set to 1, allocations succeed.

Signed-off-by: Eric B Munson <emunson-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org>
Cc: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>
Cc: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
Cc: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
Cc: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: linux-rt-users-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
Changes from V5:
* Default sysctl value is now 1
* Used more descriptive sysctl name
* documentation calss out default value

 Documentation/sysctl/vm.txt |   11 +++++++++++
 include/linux/compaction.h  |    1 +
 kernel/sysctl.c             |    9 +++++++++
 mm/compaction.c             |    7 +++++++
 4 files changed, 28 insertions(+)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 902b457..9832ec5 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -21,6 +21,7 @@ Currently, these files are in /proc/sys/vm:
 - admin_reserve_kbytes
 - block_dump
 - compact_memory
+- compact_unevictable_allowed
 - dirty_background_bytes
 - dirty_background_ratio
 - dirty_bytes
@@ -106,6 +107,16 @@ huge pages although processes will also directly compact memory as required.
 
 ==============================================================
 
+compact_unevictable_allowed
+
+Available only when CONFIG_COMPACTION is set. When set to 1, compaction is
+allowed to examine the unevictable lru (mlocked pages) for pages to compact.
+This should be used on systems where stalls for minor page faults are an
+acceptable trade for large contiguous free memory.  Set to 0 to prevent
+compaction from moving pages that are unevictable.  Default value is 1.
+
+==============================================================
+
 dirty_background_bytes
 
 Contains the amount of dirty memory at which the background kernel
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index a014559..aa8f61c 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -34,6 +34,7 @@ extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 extern int sysctl_extfrag_threshold;
 extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
 			void __user *buffer, size_t *length, loff_t *ppos);
+extern int sysctl_compact_unevictable_allowed;
 
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 88ea2d6..2f6c880 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1313,6 +1313,15 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &min_extfrag_threshold,
 		.extra2		= &max_extfrag_threshold,
 	},
+	{
+		.procname	= "compact_unevictable_allowed",
+		.data		= &sysctl_compact_unevictable_allowed,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
 
 #endif /* CONFIG_COMPACTION */
 	{
diff --git a/mm/compaction.c b/mm/compaction.c
index 8c0d945..ad88a8c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1047,6 +1047,12 @@ typedef enum {
 } isolate_migrate_t;
 
 /*
+ * Allow userspace to control policy on scanning the unevictable LRU for
+ * compactable pages.
+ */
+int sysctl_compact_unevictable_allowed __read_mostly = 1;
+
+/*
  * Isolate all pages that can be migrated from the first suitable block,
  * starting at the block pointed to by the migrate scanner pfn within
  * compact_control.
@@ -1057,6 +1063,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	unsigned long low_pfn, end_pfn;
 	struct page *page;
 	const isolate_mode_t isolate_mode =
+		(sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) |
 		(cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0);
 
 	/*
-- 
1.7.9.5

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox