* Re: [PATCH] Add virtio gpu driver.
From: Gerd Hoffmann @ 2015-03-25 14:52 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: virtio-dev, open list:ABI/API, Rusty Russell, open list,
open list:DRM DRIVERS, open list:VIRTIO CORE, NET..., Dave Airlie
In-Reply-To: <20150324171255-mutt-send-email-mst@redhat.com>
Hi,
> > diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
> > index e894eb2..a3167fa 100644
> > --- a/drivers/virtio/virtio_pci_common.c
> > +++ b/drivers/virtio/virtio_pci_common.c
> > @@ -510,7 +510,7 @@ static int virtio_pci_probe(struct pci_dev *pci_dev,
> > goto err_enable_device;
> >
> > rc = pci_request_regions(pci_dev, "virtio-pci");
> > - if (rc)
> > + if (rc && ((pci_dev->class >> 8) != PCI_CLASS_DISPLAY_VGA))
> > goto err_request_regions;
> >
> > if (force_legacy) {
>
> This is probably what you described as "the only concern?
Ahem, no, forgot that one, but it is related. With vesafb using and
registering the vga compat framebuffer bar pci_request_regions will not
succeed.
vesafb will be unregistered later on (this is what I was refering to) by
the virtio-gpu driver.
> If we only need to request specific
> regions, I think we should do exactly that, requesting only parts of
> regions that are covered by the virtio capabilities.
That should work too.
cheers,
Gerd
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply
* Re: [PATCH] Add virtio gpu driver.
From: Gerd Hoffmann @ 2015-03-25 14:53 UTC (permalink / raw)
To: Daniel Vetter
Cc: virtio-dev, Michael S. Tsirkin, open list:ABI/API, open list,
open list:DRM DRIVERS, open list:VIRTIO CORE, NET..., Dave Airlie
In-Reply-To: <20150324165057.GN1349@phenom.ffwll.local>
> > Signed-off-by: Dave Airlie <airlied@redhat.com>
> > Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
>
> Standard request from my side for new drm drivers (especially if they're
> this simple): Can you please update the drivers to latest drm internal
> interfaces, i.e. using universal planes and atomic?
Have a docs / sample code pointer for me?
thanks,
Gerd
^ permalink raw reply
* Re: [PATCH v2 7/7] clone4: Add a CLONE_FD flag to get task exit notification via fd
From: Josh Triplett @ 2015-03-25 14:53 UTC (permalink / raw)
To: David Drysdale
Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
Thomas Gleixner, Michael Kerrisk, Thiago Macieira,
linux-kernel@vger.kernel.org, Linux API, Linux FS Devel, X86 ML
In-Reply-To: <CAHse=S9F=F8yOcac4ywwQbahZkZjbTGFUfTjy=4Guo_UoMaJkQ@mail.gmail.com>
On Mon, Mar 23, 2015 at 05:38:45PM +0000, David Drysdale wrote:
> On Sun, Mar 15, 2015 at 8:00 AM, Josh Triplett <josh@joshtriplett.org> wrote:
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 9daa017..1dc680b 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1374,6 +1374,11 @@ struct task_struct {
> >
> > unsigned autoreap:1; /* Do not become a zombie on exit */
> >
> > +#ifdef CONFIG_CLONEFD
> > + unsigned clonefd:1; /* Notify clonefd_wqh on exit */
> > + wait_queue_head_t clonefd_wqh;
> > +#endif
> > +
> > unsigned long atomic_flags; /* Flags needing atomic access. */
> >
> > struct restart_block restart_block;
>
> Idle thought: are there any concerns about the occupancy
> impact of adding a wait_queue_head to every task_struct,
> whether it has a clonefd or not?
>
> I guess we could reduce the size somewhat by just
> storing a struct file *clonefd_file in the task, and then have
> a separate structure (with the wqh and a task_struct*) referenced
> by file->private_data. Not sure whether the added complication
> would be worthwhile, though.
My original patches did exactly that (minus the reference back to the
task_struct). However, there are a couple of problems with that
approach. First, it assumes that a task_struct has only a single file
referencing it, but in the future I'd like to support obtaining a
clonefd for an existing task. Second, the task_struct really shouldn't
have a reference to the actual struct file, when it only needs the
wait_queue_head_t.
Also, AFAICT a wait_queue_head_t is normally (in the absence of kernel
lock debugging options) the size of two pointers. Adding an indirection
and an extra allocation to change that to the size of one pointer seems
iffy, especially when looking at the rest of what's directly in
task_struct that's far larger.
> > --- /dev/null
> > +++ b/kernel/clonefd.c
> > @@ -0,0 +1,121 @@
> > +/*
> > + * Support functions for CLONE_FD
> > + *
> > + * Copyright (c) 2015 Intel Corporation
> > + * Original authors: Josh Triplett <josh@joshtriplett.org>
> > + * Thiago Macieira <thiago@macieira.org>
> > + */
> > +#include <linux/anon_inodes.h>
> > +#include <linux/file.h>
> > +#include <linux/fs.h>
> > +#include <linux/poll.h>
> > +#include <linux/slab.h>
> > +#include "clonefd.h"
> > +
> > +static int clonefd_release(struct inode *inode, struct file *file)
> > +{
> > + put_task_struct(file->private_data);
> > + return 0;
> > +}
> > +
> > +static unsigned int clonefd_poll(struct file *file, poll_table *wait)
> > +{
> > + struct task_struct *p = file->private_data;
> > + poll_wait(file, &p->clonefd_wqh, wait);
> > + return p->exit_state ? (POLLIN | POLLRDNORM | POLLHUP) : 0;
> > +}
> > +
> > +static ssize_t clonefd_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
> > +{
> > + struct task_struct *p = file->private_data;
> > + int ret = 0;
> > +
> > + /* EOF after first read */
> > + if (*ppos)
> > + return 0;
> > +
> > + if (file->f_flags & O_NONBLOCK)
> > + ret = -EAGAIN;
> > + else
> > + ret = wait_event_interruptible(p->clonefd_wqh, p->exit_state);
> > +
> > + if (p->exit_state) {
> > + struct clonefd_info info = {};
> > + cputime_t utime, stime;
> > + task_exit_code_status(p->exit_code, &info.code, &info.status);
> > + info.code &= ~__SI_MASK;
> > + task_cputime(p, &utime, &stime);
> > + info.utime = cputime_to_clock_t(utime + p->signal->utime);
> > + info.stime = cputime_to_clock_t(stime + p->signal->stime);
> > + ret = simple_read_from_buffer(buf, count, ppos, &info, sizeof(info));
> > + }
> > + return ret;
> > +}
> > +
> > +static struct file_operations clonefd_fops = {
> > + .release = clonefd_release,
> > + .poll = clonefd_poll,
> > + .read = clonefd_read,
> > + .llseek = no_llseek,
> > +};
>
> It might be nice to include a show_fdinfo() implementation that shows
> (say) the pid that the clonefd refers to. E.g. something like:
>
> static void clonefd_show_fdinfo(struct seq_file *m, struct file *file)
> {
> struct task_struct *p = file->private_data;
>
> seq_printf(m, "tid:\t%d\n", task_tgid_vnr(p));
> }
I thought about that, but that would add a couple of additional ifdefs
(CONFIG_PROC_FS), for an informational file of minimal value. More
importantly, I don't want to add that until after adding an ioctl or
similar to programmatically obtain the pid from a clonefd; otherwise,
someone might try to use fdinfo as the "API" to do so, which would be
all kinds of awful.
So I'd prefer to add fdinfo in a future extension of clonefd, rather
than in the initial patch series.
> > +
> > +/* Do process exit notification for clonefd. */
> > +void clonefd_do_notify(struct task_struct *p)
> > +{
> > + if (p->clonefd)
> > + wake_up_all(&p->clonefd_wqh);
> > +}
> > +
> > +/* Handle the CLONE_FD case for copy_process. */
> > +int clonefd_do_clone(u64 clone_flags, struct task_struct *p,
> > + struct clone4_args *args, struct clonefd_setup *setup)
> > +{
> > + int flags;
> > + struct file *file;
> > + int fd;
> > +
> > + p->clonefd = !!(clone_flags & CLONE_FD);
> > + if (!p->clonefd)
> > + return 0;
> > +
> > + if (args->clonefd_flags & ~(O_CLOEXEC | O_NONBLOCK))
> > + return -EINVAL;
> > +
>
> Maybe also check for (args->clonefd == NULL) in advance, and
> return -EINVAL or -EFAULT?
That wouldn't be consistent with how clone treats its various other
out argument pointers.
- Josh Triplett
^ permalink raw reply
* Re: [PATCH] Add virtio gpu driver.
From: Gerd Hoffmann @ 2015-03-25 15:19 UTC (permalink / raw)
To: Daniel Stone
Cc: virtio-dev, Michael S. Tsirkin, open list:ABI/API, Rusty Russell,
open list, open list:DRM DRIVERS, open list:VIRTIO CORE, NET...,
Dave Airlie
In-Reply-To: <CAPj87rN4pXHukDRD-e=ZrO1hcts04cSz1Hr9TNAgicGVWE5_-Q@mail.gmail.com>
On Di, 2015-03-24 at 22:50 +0000, Daniel Stone wrote:
> Hi,
>
> On 24 March 2015 at 16:07, Gerd Hoffmann <kraxel@redhat.com> wrote:
> > +static int virtio_gpu_crtc_page_flip(struct drm_crtc *crtc,
> > + struct drm_framebuffer *fb,
> > + struct drm_pending_vblank_event *event,
> > + uint32_t flags)
> > +{
> > + return -EINVAL;
> > +}
>
> I'm not going to lie, I was really hoping the 5th (?) GPU option for
> Qemu would support pageflipping.
As Dave already pointed out there is a WIP patch for that, it'll be
there.
While being at it:
- bochsdrm (qemu -vga std driver) supports pageflip since 3.19.
- cirrus is more or less a lost case, we mimic existing hardware
from the 90ies here and it simply isn't up to todays needs for
many reasons. Just stop using it.
- qxl -- hmm, not sure, there is this "primary surface" concept in
the virtual hardware design, which doesn't mix very well with
pageflip I suspect.
> Daniel's comment about conversion to
> atomic is relevant, but: do you have a mechanism which allows you to
> post updates (e.g. 'start displaying this buffer now please') that
> allows you to get events back when they have actually been displayed?
It's possible to fence the framebuffer update requests, so you'll be
notified when the update has reached the qemu ui code. Typically the ui
code has queued the update at that point. So with a local display (sdl,
gtk) showing up on the screen should be just a pageflip (on the host)
away. With a remote display (vnc) it will take a little longer until
the user will actually see the update.
cheers,
Gerd
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply
* Re: [PATCH] Add virtio gpu driver.
From: Michael S. Tsirkin @ 2015-03-25 15:24 UTC (permalink / raw)
To: Gerd Hoffmann
Cc: virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b, Dave Airlie,
Dave Airlie, David Airlie, Rusty Russell, open list,
open list:DRM DRIVERS, open list:VIRTIO CORE, NET...,
open list:ABI/API
In-Reply-To: <1427295121.23304.5.camel-3OfP5uLMi4C46o+2HkPkLj4oCIwMql/M@public.gmane.org>
On Wed, Mar 25, 2015 at 03:52:01PM +0100, Gerd Hoffmann wrote:
> Hi,
>
> > > diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
> > > index e894eb2..a3167fa 100644
> > > --- a/drivers/virtio/virtio_pci_common.c
> > > +++ b/drivers/virtio/virtio_pci_common.c
> > > @@ -510,7 +510,7 @@ static int virtio_pci_probe(struct pci_dev *pci_dev,
> > > goto err_enable_device;
> > >
> > > rc = pci_request_regions(pci_dev, "virtio-pci");
> > > - if (rc)
> > > + if (rc && ((pci_dev->class >> 8) != PCI_CLASS_DISPLAY_VGA))
> > > goto err_request_regions;
> > >
> > > if (force_legacy) {
> >
> > This is probably what you described as "the only concern?
>
> Ahem, no, forgot that one,
What does the concern refer to then?
> but it is related. With vesafb using and
> registering the vga compat framebuffer bar pci_request_regions will not
> succeed.
>
> vesafb will be unregistered later on (this is what I was refering to) by
> the virtio-gpu driver.
>
> > If we only need to request specific
> > regions, I think we should do exactly that, requesting only parts of
> > regions that are covered by the virtio capabilities.
>
> That should work too.
>
> cheers,
> Gerd
BTW can we teach virtio-gpu to look for framebuffer using
virtio pci caps? Or are there limitations such as only
using IO port BARs, or compatibility with
BIOS code etc that limit us to specific BARs anyway?
--
MST
^ permalink raw reply
* Re: [PATCH] Add virtio gpu driver.
From: Gerd Hoffmann @ 2015-03-25 15:37 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: virtio-dev, open list:ABI/API, Rusty Russell, open list,
open list:DRM DRIVERS, open list:VIRTIO CORE, NET..., Dave Airlie
In-Reply-To: <20150325162246-mutt-send-email-mst@redhat.com>
Hi,
> BTW can we teach virtio-gpu to look for framebuffer using
> virtio pci caps?
The virtio-gpu driver doesn't matter much here, it doesn't use it
anyway.
> Or are there limitations such as only
> using IO port BARs, or compatibility with
> BIOS code etc that limit us to specific BARs anyway?
Yes, vgabios code needs to know. Currently it has bar #2 for the vga
framebuffer bar hardcoded. It's 16bit code. I don't feel like making
the probing more complicated ...
cheers,
Gerd
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply
* Re: [PATCH] mremap: add MREMAP_NOHOLE flag --resend
From: Vlastimil Babka @ 2015-03-25 16:22 UTC (permalink / raw)
To: Daniel Micay, Aliaksey Kandratsenka
Cc: Andrew Morton, Shaohua Li, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
linux-api-u79uwXL29TY76Z2rM5mHXA, Rik van Riel, Hugh Dickins,
Mel Gorman, Johannes Weiner, Michal Hocko, Andy Lutomirski,
google-perftools-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
In-Reply-To: <550E6D9D.1060507-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
On 03/22/2015 08:22 AM, Daniel Micay wrote:
> BTW, THP currently interacts very poorly with the jemalloc/tcmalloc
> madvise purging. The part where khugepaged assigns huge pages to dense
> spans of pages is*great*. The part where the kernel hands out a huge
> page on for a fault in a 2M span can be awful. It causes the model
> inside the allocator of uncommitted vs. committed pages to break down.
>
> For example, the allocator might use 1M of a huge page and then start
> purging. The purging will split it into 4k pages, so there will be 1M of
> zeroed 4k pages that are considered purged by the allocator. Over time,
> this can cripple purging. Search for "jemalloc huge pages" and you'll
> find lots of horror stories about this.
I'm not sure I get your description right. The problem I know about is
where "purging" means madvise(MADV_DONTNEED) and khugepaged later
collapses a new hugepage that will repopulate the purged parts,
increasing the memory usage. One can limit this via
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none . That
setting doesn't affect the page fault THP allocations, which however
happen only in newly accessed hugepage-sized areas and not partially
purged ones, though.
> I think a THP implementation playing that played well with purging would
> need to drop the page fault heuristic and rely on a significantly better
> khugepaged.
See here http://lwn.net/Articles/636162/ (the "Compaction" part)
The objection is that some short-lived workloads like gcc have to map
hugepages immediately if they are to benefit from them. I still plan to
improve khugepaged and allow admins to say that they don't want THP page
faults (and rely solely on khugepaged which has more information to
judge additional memory usage), but I'm not sure if it would be an
acceptable default behavior.
One workaround in the current state for jemalloc and friends could be to
use madvise(MADV_NOHUGEPAGE) on hugepage-sized/aligned areas where it
wants to purge parts of them via madvise(MADV_DONTNEED). It could mean
overhead of another syscall and tracking of where this was applied and
when it makes sense to undo this and allow THP to be collapsed again,
though, and it would also split vma's.
> This would mean faulting in a span of memory would no longer
> be faster. Having a flag to populate a range with madvise would help a
If it's a newly mapped memory, there's mmap(MAP_POPULATE). There is also
a madvise(MADV_WILLNEED), which sounds like what you want, but I don't
know what the implementation does exactly - it was apparently added for
paging in ahead, and maybe it ignores unpopulated anonymous areas, but
it would probably be well in spirit of the flag to make it prepopulate
those.
> lot though, since the allocator knows exactly how much it's going to
> clobber with the memcpy. There will still be a threshold where mremap
> gets significantly faster, but it would move it higher.
^ permalink raw reply
* Re: [PATCH v3 5/9] eeprom: Add bindings for simple eeprom framework
From: Maxime Ripard @ 2015-03-25 16:40 UTC (permalink / raw)
To: Sascha Hauer
Cc: Srinivas Kandagatla, linux-arm-kernel, Rob Herring, Kumar Gala,
Mark Brown, Greg Kroah-Hartman, linux-api, linux-kernel,
devicetree, linux-arm-msm, arnd, sboyd
In-Reply-To: <20150325071006.GC4946@pengutronix.de>
[-- Attachment #1: Type: text/plain, Size: 3807 bytes --]
On Wed, Mar 25, 2015 at 08:10:06AM +0100, Sascha Hauer wrote:
> On Tue, Mar 24, 2015 at 10:30:30PM +0000, Srinivas Kandagatla wrote:
> > This patch adds bindings for simple eeprom framework which allows eeprom
> > consumers to talk to eeprom providers to get access to eeprom cell data.
> >
> > Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com>
> > [Maxime Ripard: intial version of eeprom framework]
> > Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
> > ---
> > .../devicetree/bindings/eeprom/eeprom.txt | 70 ++++++++++++++++++++++
> > 1 file changed, 70 insertions(+)
> > create mode 100644 Documentation/devicetree/bindings/eeprom/eeprom.txt
> >
> > diff --git a/Documentation/devicetree/bindings/eeprom/eeprom.txt b/Documentation/devicetree/bindings/eeprom/eeprom.txt
> > new file mode 100644
> > index 0000000..8348d18
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/eeprom/eeprom.txt
> > @@ -0,0 +1,70 @@
> > += EEPROM Data Device Tree Bindings =
> > +
> > +This binding is intended to represent the location of hardware
> > +configuration data stored in EEPROMs.
> > +
> > +On a significant proportion of boards, the manufacturer has stored
> > +some data on an EEPROM-like device, for the OS to be able to retrieve
> > +these information and act upon it. Obviously, the OS has to know
> > +about where to retrieve these data from, and where they are stored on
> > +the storage device.
> > +
> > +This document is here to document this.
> > +
> > += Data providers =
> > +Contains bindings specific to provider drivers and data cells as children
> > +to this node.
> > +
> > += Data cells =
> > +These are the child nodes of the provider which contain data cell
> > +information like offset and size in eeprom provider.
> > +
> > +Required properties:
> > +reg: specifies the offset in byte within that storage device, and the length
> > + in bytes of the data we care about.
> > + There could be more then one offset-length pairs in this property.
> > +
> > +Optional properties:
> > +As required by specific data parsers/interpreters.
> > +
> > +For example:
> > +
> > + /* Provider */
> > + qfprom: qfprom@00700000 {
> > + compatible = "qcom,qfprom";
> > + reg = <0x00700000 0x1000>;
> > + ...
> > +
> > + /* Data cells */
> > + tsens_calibration: calib@404 {
> > + reg = <0x404 0x10>;
> > + };
> > +
> > + serial_number: sn {
> > + reg = <0x104 0x4>, <0x204 0x4>, <0x30c 0x4>;
> > +
> > + };
> > + ...
> > + };
> > +
> > += Data consumers =
> > +Are device nodes which consume eeprom data cells.
> > +
> > +Required properties:
> > +
> > +eeproms: List of phandle and data cell the device might be interested in.
> > +
> > +Optional properties:
> > +
> > +eeprom-names: List of data cell name strings sorted in the same order
> > + as the eeproms property. Consumers drivers will use
> > + eeprom-names to differentiate between multiple cells,
> > + and hence being able to know what these cells are for.
> > +
> > +For example:
> > +
> > + tsens {
> > + ...
> > + eeproms = <&tsens_calibration>;
> > + eeprom-names = "calibration";
> > + };
>
> This is somewhat complicated. Also having 'eeprom' in the binding is not
> nice since it could be FRAM or something else. How about:
>
> tsens {
> calibration = <&tsens_calibration>;
> };
A similar property was suggested the first time we discussed it, and
it turned out eventually that the construct you commented about was
actually preferred.
I guess we can always change the property name to something more
generic though.
--
Maxime Ripard, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply
* Re: [PATCH] Add virtio gpu driver.
From: Michael S. Tsirkin @ 2015-03-25 17:09 UTC (permalink / raw)
To: Gerd Hoffmann
Cc: virtio-dev, Dave Airlie, Dave Airlie, David Airlie, Rusty Russell,
open list, open list:DRM DRIVERS, open list:VIRTIO CORE, NET...,
open list:ABI/API
In-Reply-To: <1427297836.23304.29.camel@nilsson.home.kraxel.org>
On Wed, Mar 25, 2015 at 04:37:16PM +0100, Gerd Hoffmann wrote:
> Hi,
>
> > BTW can we teach virtio-gpu to look for framebuffer using
> > virtio pci caps?
>
> The virtio-gpu driver doesn't matter much here, it doesn't use it
> anyway.
>
> > Or are there limitations such as only
> > using IO port BARs, or compatibility with
> > BIOS code etc that limit us to specific BARs anyway?
>
> Yes, vgabios code needs to know. Currently it has bar #2 for the vga
> framebuffer bar hardcoded. It's 16bit code. I don't feel like making
> the probing more complicated ...
>
> cheers,
> Gerd
OK - you are saying all VGA cards use bar #2 for this
functionality, so we are just following
established practice here?
--
MST
^ permalink raw reply
* Re: [PATCH v4 00/14] Add kdbus implementation
From: David Herrmann @ 2015-03-25 17:29 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Greg Kroah-Hartman, Arnd Bergmann, Eric W. Biederman,
One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Linux API,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Daniel Mack,
Djalal Harouni
In-Reply-To: <CALCETrXqYBeZuOWhm9mz_nt+aWPXHFwkQPEAfwBXzDxnAP7f+g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
Hi
On Tue, Mar 24, 2015 at 12:24 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Mon, Mar 23, 2015 at 8:28 AM, David Herrmann <dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On Thu, Mar 19, 2015 at 4:48 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Thu, Mar 19, 2015 at 4:26 AM, David Herrmann <dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> metadata handling is local to the connection that sends the message.
>>>> It does not affect the overall performance of other bus operations in
>>>> parallel.
>>>
>>> Sure it does if it writes to shared cachelines. Given that you're
>>> incrementing refcounts, I'm reasonable sure that you're touching lots
>>> of shared cachelines.
>>
>> Ok, sure, but it's still mostly local to the sending task. We take
>> locks and ref-counts on the task-struct and mm, which is for most
>> parts local to the CPU the task runs on. But this is inherent to
>> accessing this kind of data, which is the fundamental difference in
>> our views here, as seen below..
>
> You're also refcounting the struct cred
No?
We do ref-count the group-info, but that is actually redundant as we
just copy the IDs. We should drop this, since group-info of 'current'
can be accessed right away. I noted it down.
> and there's no good reason
> for that to be local. (It might be a bit more local than intended
> because of the absurd things that the key subsystem does to struct
> cred, but IMO users should turn that off or the kernel should fix it.)
>
> Even more globally, I think you're touching init_user_ns's refcount in
> most scenarios. That's about as global as it gets.
get_user_ns() in metadata.c is a workaround (as the comment there
explains). With better export-helpers for caps, we can simply drop it.
It's conditional on KDBUS_ATTACH_CAPS, anyway.
> (Also, is there an easy benchmark to see how much time it takes to
> send and receive metadata? I tried to get the kdbus test to do this,
> and I failed. I probably did it wrong.)
patch for out-of-tree kdbus:
https://gist.github.com/dvdhrm/3ac4339bf94fadc13b98
Update it to pass _KDBUS_ATTACH_ALL for both arguments of
kdbus_conn_update_attach_flags().
>>>> Furthermore, it's way faster than collecting the "same" data
>>>> via /proc, so I don't think it slows down the overall transaction at
>>>> all. If a receiver doesn't want metadata, it should not request it (by
>>>> setting the receiver-metadata-mask). If a sender doesn't like the
>>>> overhead, it should not send the metadata (by setting the
>>>> sender-metadata-mask). Only if both peers set the metadata mask, it
>>>> will be transmitted.
>>>
>>> But you're comparing to the wrong thing, IMO. Of course it's much
>>> faster than /proc hackery, but it's probably much slower to do the
>>> metadata operation once per message than to do it when you connect to
>>> the endpoint. (Gah! It's a "bus" that could easily have tons of
>>> users but a single "endpoint". I'm still not used to it.)
>>
>> Yes, of course your assumption is right if you compare against
>> per-connection caches, instead of per-message metadata. But we do
>> support _both_ use-cases, so we don't impose any policy.
>> We still believe "live"-metadata is a crucial feature of kdbus,
>> despite the known performance penalties.
[...]
> This is even more true if this feature
> is *inconsistent* with legacy userspace (i.e. userspace dbus).
Live metadata is already supported on UDS via SCM_CREDENTIALS, we just
extend it to other metadata items. It's not a new invention by us.
Debian code-search on SO_PASSCRED and SCM_CREDENTIALS gives lots of
results.
Netlink, as a major example of an existing bus API, already uses
SCM_CREDENTIALS as primary way to transmit metadata.
> I could be wrong about the lack of use cases. If so, please enlighten me.
We have several dbus APIs that allow clients to register as a special
handler/controller/etc. (eg., see systemd-logind TakeControl()). The
API provider checks the privileges of a client on registration and
then just tracks the client ID. This way, the client can be privileged
when asking for special access, then drop privileges and still use the
interface. You cannot re-connect in between, as the API provider
tracks your bus ID. Without message-metadata, all your (other) calls
on this bus would always be treated as privileged. We *really* want to
avoid this.
Another example is logging, where we want exact data at the time a
message is logged. Otherwise, the data is useless. With
message-metadata, you can figure out the exact situation a process was
in when a specific message was logged. Furthermore, it is impossible
to read such data from /proc, as the process might already be dead.
Which is a _real_ problem right now!
Similarly, system monitoring wants message-metadata for the same
reasons. And it needs to be reliable, you don't want malicious
sandboxes to mess with your logs.
kdbus is a _bus_, not a p2p channel. Thus, a peer may talk to multiple
destinations, and it may want to look different to each of them. DBus
method-calls allow 'syscall'-ish behavior when calling into other
processes. We *want* to be able to drop privileges after doing process
setup. We want further bus-calls to no longer be treated privileged.
Furthermore, DBus was designed to allow peers to track other peers
(which is why it always had the NameOwnerChanged signal). This is an
essential feature, that simplifies access-management considerably, as
you can cache it together with the unique name of a peer. We only open
a single connection to a bus. glib, libdbus, efl, ell, qt, sd-bus, and
others use cached bus-connections that are shared by all code of a
single thread. Hence, the bus connection is kinda part of the process
itself, like stdin/stdout. Without message-metadata, it is impossible
to ever drop privileges on a bus, without losing all state.
Thanks
David
^ permalink raw reply
* Re: [PATCH v4 00/14] Add kdbus implementation
From: Andy Lutomirski @ 2015-03-25 18:12 UTC (permalink / raw)
To: David Herrmann
Cc: Greg Kroah-Hartman, Arnd Bergmann, Eric W. Biederman,
One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Linux API,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Daniel Mack,
Djalal Harouni
In-Reply-To: <CANq1E4QiErHp8Q6bzFLkK9=7eBZC8dvh+Xnrh9_D5DAAogyaZA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Wed, Mar 25, 2015 at 10:29 AM, David Herrmann <dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi
>
> On Tue, Mar 24, 2015 at 12:24 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Mon, Mar 23, 2015 at 8:28 AM, David Herrmann <dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> On Thu, Mar 19, 2015 at 4:48 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> On Thu, Mar 19, 2015 at 4:26 AM, David Herrmann <dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>> metadata handling is local to the connection that sends the message.
>>>>> It does not affect the overall performance of other bus operations in
>>>>> parallel.
>>>>
>>>> Sure it does if it writes to shared cachelines. Given that you're
>>>> incrementing refcounts, I'm reasonable sure that you're touching lots
>>>> of shared cachelines.
>>>
>>> Ok, sure, but it's still mostly local to the sending task. We take
>>> locks and ref-counts on the task-struct and mm, which is for most
>>> parts local to the CPU the task runs on. But this is inherent to
>>> accessing this kind of data, which is the fundamental difference in
>>> our views here, as seen below..
>>
>> You're also refcounting the struct cred
>
> No?
>
> We do ref-count the group-info, but that is actually redundant as we
> just copy the IDs. We should drop this, since group-info of 'current'
> can be accessed right away. I noted it down.
OK
>
>> and there's no good reason
>> for that to be local. (It might be a bit more local than intended
>> because of the absurd things that the key subsystem does to struct
>> cred, but IMO users should turn that off or the kernel should fix it.)
>>
>> Even more globally, I think you're touching init_user_ns's refcount in
>> most scenarios. That's about as global as it gets.
>
> get_user_ns() in metadata.c is a workaround (as the comment there
> explains). With better export-helpers for caps, we can simply drop it.
> It's conditional on KDBUS_ATTACH_CAPS, anyway.
Fair enough.
>
>> (Also, is there an easy benchmark to see how much time it takes to
>> send and receive metadata? I tried to get the kdbus test to do this,
>> and I failed. I probably did it wrong.)
>
> patch for out-of-tree kdbus:
> https://gist.github.com/dvdhrm/3ac4339bf94fadc13b98
>
> Update it to pass _KDBUS_ATTACH_ALL for both arguments of
> kdbus_conn_update_attach_flags().
>
>>>>> Furthermore, it's way faster than collecting the "same" data
>>>>> via /proc, so I don't think it slows down the overall transaction at
>>>>> all. If a receiver doesn't want metadata, it should not request it (by
>>>>> setting the receiver-metadata-mask). If a sender doesn't like the
>>>>> overhead, it should not send the metadata (by setting the
>>>>> sender-metadata-mask). Only if both peers set the metadata mask, it
>>>>> will be transmitted.
>>>>
>>>> But you're comparing to the wrong thing, IMO. Of course it's much
>>>> faster than /proc hackery, but it's probably much slower to do the
>>>> metadata operation once per message than to do it when you connect to
>>>> the endpoint. (Gah! It's a "bus" that could easily have tons of
>>>> users but a single "endpoint". I'm still not used to it.)
>>>
>>> Yes, of course your assumption is right if you compare against
>>> per-connection caches, instead of per-message metadata. But we do
>>> support _both_ use-cases, so we don't impose any policy.
>>> We still believe "live"-metadata is a crucial feature of kdbus,
>>> despite the known performance penalties.
> [...]
>> This is even more true if this feature
>> is *inconsistent* with legacy userspace (i.e. userspace dbus).
>
> Live metadata is already supported on UDS via SCM_CREDENTIALS, we just
> extend it to other metadata items. It's not a new invention by us.
> Debian code-search on SO_PASSCRED and SCM_CREDENTIALS gives lots of
> results.
>
> Netlink, as a major example of an existing bus API, already uses
> SCM_CREDENTIALS as primary way to transmit metadata.
>
>> I could be wrong about the lack of use cases. If so, please enlighten me.
>
> We have several dbus APIs that allow clients to register as a special
> handler/controller/etc. (eg., see systemd-logind TakeControl()). The
> API provider checks the privileges of a client on registration and
> then just tracks the client ID. This way, the client can be privileged
> when asking for special access, then drop privileges and still use the
> interface. You cannot re-connect in between, as the API provider
> tracks your bus ID. Without message-metadata, all your (other) calls
> on this bus would always be treated as privileged. We *really* want to
> avoid this.
Connect twice?
You *already* have to reconnect or connect twice because you have
per-connection metadata. That's part of my problem with this scheme
-- you support *both styles*, which seems like it'll give you most of
the downsides of both without the upsides.
>
> Another example is logging, where we want exact data at the time a
> message is logged. Otherwise, the data is useless.
Why?
No, really, why is exact data at the time of logging so important? It
sounds nice, but I really don't see it.
> With
> message-metadata, you can figure out the exact situation a process was
> in when a specific message was logged. Furthermore, it is impossible
> to read such data from /proc, as the process might already be dead.
> Which is a _real_ problem right now!
> Similarly, system monitoring wants message-metadata for the same
> reasons. And it needs to be reliable, you don't want malicious
> sandboxes to mess with your logs.
Huh? A "malicious sandbox" can always impersonate itself, whether by
connecting and handing off a connection or simply by relaying mesages.
>
> kdbus is a _bus_, not a p2p channel. Thus, a peer may talk to multiple
> destinations, and it may want to look different to each of them. DBus
> method-calls allow 'syscall'-ish behavior when calling into other
> processes. We *want* to be able to drop privileges after doing process
> setup. We want further bus-calls to no longer be treated privileged.
You could have an IOCTL that re-captures your connection metata.
>
> Furthermore, DBus was designed to allow peers to track other peers
> (which is why it always had the NameOwnerChanged signal). This is an
> essential feature, that simplifies access-management considerably, as
> you can cache it together with the unique name of a peer. We only open
> a single connection to a bus. glib, libdbus, efl, ell, qt, sd-bus, and
> others use cached bus-connections that are shared by all code of a
> single thread. Hence, the bus connection is kinda part of the process
> itself, like stdin/stdout. Without message-metadata, it is impossible
> to ever drop privileges on a bus, without losing all state.
See above about an IOCTL that re-captures your connection metadata.
Again, you seem to be arguing that per-connection metadata is bad, but
you still have an implementation of per-connection metadata, so you
still have all these problems.
I'm actually okay with per-message metadata in principle, but I'd like
to see evidence (with numbers, please) that a send+recv of per-message
metadata is *not* significantly slower than a recv of already-captured
per-connection metadata. If this is in fact the case, then maybe you
should trash per-connection metadata instead and the legacy
compatibility code can figure out a way to deal with it. IMO that
would be a pretty nice outcome, since you would never have to worry
whether your connection to the bus is inadvertantly privileged.
(Also, FWIW, it seems like what you really want is a capability model,
in which you grab a handle to some service and that handle captures
all your privileges wrt that service. Per-message metadata is even
farther from this than per-connection-to-the-bus metadata, but neither
one is particularly close.)
>
> Thanks
> David
--
Andy Lutomirski
AMA Capital Management, LLC
^ permalink raw reply
* [PATCH v11 tip 0/9] tracing: attach eBPF programs to kprobes
From: Alexei Starovoitov @ 2015-03-25 19:49 UTC (permalink / raw)
To: Ingo Molnar
Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Masami Hiramatsu, David S. Miller, Daniel Borkmann,
Peter Zijlstra, linux-api-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
Hi Ingo,
Patch 1 is already in net-next. Patch 3 depends on it.
I'm assuming it's not going to be a problem during merge window.
Patch 3 will have a minor conflict in uapi/linux/bpf.h in linux-next,
since net-next has added new lines to the bpf_prog_type and bpf_func_id enums.
I'm assuming it's not a problem either.
V10->V11:
- added Masami's Reviewed-by to main patch 3. Thanks Masami!
- fixed sz>0 in samples
- reworded few comments and fixed typos
- rebased
V9->V10:
- prettified formatting of struct intializers in kernel
- added Masami's Reviewed-by. Thanks Masami!
V8->V9:
- fixed comment style and allowed ispunct after %p
- added Steven's Reviewed-by. Thanks Steven!
V7->V8:
- split addition of kprobe flag into separate patch
- switched to __this_cpu_inc in now documented trace_call_bpf()
- converted array into standalone bpf_func_proto and switch statement
(this apporach looks cleanest, especially considering patch 5)
- refactored patch 5 bpf_trace_printk to do strict checking
V6->V7:
- rebase and remove confusing _notrace suffix from preempt_disable/enable
everything else unchanged
V5->V6:
- added simple recursion check to trace_call_bpf()
- added tracex4 example that does kmem_cache_alloc/free tracking.
It remembers every allocated object in a map and user space periodically
prints a set of old objects. With more work in can be made into
simple kmemleak detector.
It was used as a test of recursive kmalloc/kfree: attached to
kprobe/__kmalloc and let program to call kmalloc again.
V4->V5:
- switched to ktime_get_mono_fast_ns() as suggested by Peter
- in libbpf.c fixed zero init of 'union bpf_attr' padding
- fresh rebase on tip/master
V3 discussion:
https://lkml.org/lkml/2015/2/9/738
V3->V4:
- since the boundary of stable ABI in bpf+tracepoints is not clear yet,
I've dropped them for now.
- bpf+syscalls are ok from stable ABI point of view, but bpf+seccomp
would want to do very similar analysis of syscalls, so I've dropped
them as well to take time and define common bpf+syscalls and bpf+seccomp
infra in the future.
- so only bpf+kprobes left. kprobes by definition is not a stable ABI,
so bpf+kprobe is not stable ABI either. To stress on that point added
kernel version attribute that user space must pass along with the program
and kernel will reject programs when version code doesn't match.
So bpf+kprobe is very similar to kernel modules, but unlike modules
version check is not used for safety, but for enforcing 'non-ABI-ness'.
(version check doesn't apply to bpf+sockets which are stable)
Programs are attached to kprobe events via API:
prog_fd = bpf_prog_load(...);
struct perf_event_attr attr = {
.type = PERF_TYPE_TRACEPOINT,
.config = event_id, /* ID of just created kprobe event */
};
event_fd = perf_event_open(&attr,...);
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
Next step is to prototype TCP stack instrumentation (like web10g) using
bpf+kprobe, but without adding any new code tcp stack.
Though kprobes are slow comparing to tracepoints, they are good enough
for prototyping and trace_marker/debug_tracepoint ideas can accelerate
them in the future.
Alexei Starovoitov (8):
tracing: add kprobe flag
tracing: attach BPF programs to kprobes
tracing: allow BPF programs to call bpf_ktime_get_ns()
tracing: allow BPF programs to call bpf_trace_printk()
samples: bpf: simple non-portable kprobe filter example
samples: bpf: counting example for kfree_skb and write syscall
samples: bpf: IO latency analysis (iosnoop/heatmap)
samples: bpf: kmem_alloc/free tracker
Daniel Borkmann (1):
bpf: make internal bpf API independent of CONFIG_BPF_SYSCALL ifdefs
include/linux/bpf.h | 20 +++-
include/linux/ftrace_event.h | 14 +++
include/uapi/linux/bpf.h | 5 +
include/uapi/linux/perf_event.h | 1 +
kernel/bpf/syscall.c | 7 +-
kernel/events/core.c | 59 +++++++++++
kernel/trace/Makefile | 1 +
kernel/trace/bpf_trace.c | 222 +++++++++++++++++++++++++++++++++++++++
kernel/trace/trace_kprobe.c | 10 +-
samples/bpf/Makefile | 16 +++
samples/bpf/bpf_helpers.h | 6 ++
samples/bpf/bpf_load.c | 125 ++++++++++++++++++++--
samples/bpf/bpf_load.h | 3 +
samples/bpf/libbpf.c | 14 ++-
samples/bpf/libbpf.h | 5 +-
samples/bpf/sock_example.c | 2 +-
samples/bpf/test_verifier.c | 2 +-
samples/bpf/tracex1_kern.c | 50 +++++++++
samples/bpf/tracex1_user.c | 25 +++++
samples/bpf/tracex2_kern.c | 86 +++++++++++++++
samples/bpf/tracex2_user.c | 95 +++++++++++++++++
samples/bpf/tracex3_kern.c | 89 ++++++++++++++++
samples/bpf/tracex3_user.c | 150 ++++++++++++++++++++++++++
samples/bpf/tracex4_kern.c | 54 ++++++++++
samples/bpf/tracex4_user.c | 69 ++++++++++++
25 files changed, 1112 insertions(+), 18 deletions(-)
create mode 100644 kernel/trace/bpf_trace.c
create mode 100644 samples/bpf/tracex1_kern.c
create mode 100644 samples/bpf/tracex1_user.c
create mode 100644 samples/bpf/tracex2_kern.c
create mode 100644 samples/bpf/tracex2_user.c
create mode 100644 samples/bpf/tracex3_kern.c
create mode 100644 samples/bpf/tracex3_user.c
create mode 100644 samples/bpf/tracex4_kern.c
create mode 100644 samples/bpf/tracex4_user.c
--
1.7.9.5
^ permalink raw reply
* [PATCH v11 tip 1/9] bpf: make internal bpf API independent of CONFIG_BPF_SYSCALL ifdefs
From: Alexei Starovoitov @ 2015-03-25 19:49 UTC (permalink / raw)
To: Ingo Molnar
Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Masami Hiramatsu, David S. Miller, Daniel Borkmann,
Peter Zijlstra, linux-api, netdev, linux-kernel
In-Reply-To: <1427312966-8434-1-git-send-email-ast@plumgrid.com>
From: Daniel Borkmann <daniel@iogearbox.net>
Socket filter code and other subsystems with upcoming eBPF support should
not need to deal with the fact that we have CONFIG_BPF_SYSCALL defined or
not.
Having the bpf syscall as a config option is a nice thing and I'd expect
it to stay that way for expert users (I presume one day the default setting
of it might change, though), but code making use of it should not care if
it's actually enabled or not.
Instead, hide this via header files and let the rest deal with it.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
---
include/linux/bpf.h | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index bbfceb756452..c2e21113ecc0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -113,8 +113,6 @@ struct bpf_prog_type_list {
enum bpf_prog_type type;
};
-void bpf_register_prog_type(struct bpf_prog_type_list *tl);
-
struct bpf_prog;
struct bpf_prog_aux {
@@ -129,11 +127,25 @@ struct bpf_prog_aux {
};
#ifdef CONFIG_BPF_SYSCALL
+void bpf_register_prog_type(struct bpf_prog_type_list *tl);
+
void bpf_prog_put(struct bpf_prog *prog);
+struct bpf_prog *bpf_prog_get(u32 ufd);
#else
-static inline void bpf_prog_put(struct bpf_prog *prog) {}
+static inline void bpf_register_prog_type(struct bpf_prog_type_list *tl)
+{
+}
+
+static inline struct bpf_prog *bpf_prog_get(u32 ufd)
+{
+ return ERR_PTR(-EOPNOTSUPP);
+}
+
+static inline void bpf_prog_put(struct bpf_prog *prog)
+{
+}
#endif
-struct bpf_prog *bpf_prog_get(u32 ufd);
+
/* verify correctness of eBPF program */
int bpf_check(struct bpf_prog *fp, union bpf_attr *attr);
--
1.7.9.5
^ permalink raw reply related
* [PATCH v11 tip 2/9] tracing: add kprobe flag
From: Alexei Starovoitov @ 2015-03-25 19:49 UTC (permalink / raw)
To: Ingo Molnar
Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Masami Hiramatsu, David S. Miller, Daniel Borkmann,
Peter Zijlstra, linux-api, netdev, linux-kernel
In-Reply-To: <1427312966-8434-1-git-send-email-ast@plumgrid.com>
add TRACE_EVENT_FL_KPROBE flag to differentiate kprobe type of tracepoints,
since bpf programs can only be attached to kprobe type of
PERF_TYPE_TRACEPOINT perf events.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
---
include/linux/ftrace_event.h | 3 +++
kernel/trace/trace_kprobe.c | 2 +-
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index c674ee8f7fca..77325e1a1816 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -252,6 +252,7 @@ enum {
TRACE_EVENT_FL_WAS_ENABLED_BIT,
TRACE_EVENT_FL_USE_CALL_FILTER_BIT,
TRACE_EVENT_FL_TRACEPOINT_BIT,
+ TRACE_EVENT_FL_KPROBE_BIT,
};
/*
@@ -265,6 +266,7 @@ enum {
* it is best to clear the buffers that used it).
* USE_CALL_FILTER - For ftrace internal events, don't use file filter
* TRACEPOINT - Event is a tracepoint
+ * KPROBE - Event is a kprobe
*/
enum {
TRACE_EVENT_FL_FILTERED = (1 << TRACE_EVENT_FL_FILTERED_BIT),
@@ -274,6 +276,7 @@ enum {
TRACE_EVENT_FL_WAS_ENABLED = (1 << TRACE_EVENT_FL_WAS_ENABLED_BIT),
TRACE_EVENT_FL_USE_CALL_FILTER = (1 << TRACE_EVENT_FL_USE_CALL_FILTER_BIT),
TRACE_EVENT_FL_TRACEPOINT = (1 << TRACE_EVENT_FL_TRACEPOINT_BIT),
+ TRACE_EVENT_FL_KPROBE = (1 << TRACE_EVENT_FL_KPROBE_BIT),
};
struct ftrace_event_call {
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index d73f565b4e06..8fa549f6f528 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1286,7 +1286,7 @@ static int register_kprobe_event(struct trace_kprobe *tk)
kfree(call->print_fmt);
return -ENODEV;
}
- call->flags = 0;
+ call->flags = TRACE_EVENT_FL_KPROBE;
call->class->reg = kprobe_register;
call->data = tk;
ret = trace_add_event_call(call);
--
1.7.9.5
^ permalink raw reply related
* [PATCH v11 tip 3/9] tracing: attach BPF programs to kprobes
From: Alexei Starovoitov @ 2015-03-25 19:49 UTC (permalink / raw)
To: Ingo Molnar
Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Masami Hiramatsu, David S. Miller, Daniel Borkmann,
Peter Zijlstra, linux-api, netdev, linux-kernel
In-Reply-To: <1427312966-8434-1-git-send-email-ast@plumgrid.com>
User interface:
struct perf_event_attr attr = {.type = PERF_TYPE_TRACEPOINT, .config = event_id, ...};
event_fd = perf_event_open(&attr,...);
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
prog_fd is a file descriptor associated with BPF program previously loaded.
event_id is an ID of created kprobe
close(event_fd) - automatically detaches BPF program from it
BPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- probe_read - wraper of probe_kernel_read() used to access any kernel
data structures
BPF programs receive 'struct pt_regs *' as an input
('struct pt_regs' is architecture dependent)
and return 0 to ignore the event and 1 to store kprobe event into ring buffer.
Note, kprobes are _not_ a stable kernel ABI, so bpf programs attached to
kprobes must be recompiled for every kernel version and user must supply correct
LINUX_VERSION_CODE in attr.kern_version during bpf_prog_load() call.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
---
include/linux/ftrace_event.h | 11 ++++
include/uapi/linux/bpf.h | 3 +
include/uapi/linux/perf_event.h | 1 +
kernel/bpf/syscall.c | 7 ++-
kernel/events/core.c | 59 ++++++++++++++++++
kernel/trace/Makefile | 1 +
kernel/trace/bpf_trace.c | 130 +++++++++++++++++++++++++++++++++++++++
kernel/trace/trace_kprobe.c | 8 +++
8 files changed, 219 insertions(+), 1 deletion(-)
create mode 100644 kernel/trace/bpf_trace.c
diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 77325e1a1816..0aa535bc9f05 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -13,6 +13,7 @@ struct trace_array;
struct trace_buffer;
struct tracer;
struct dentry;
+struct bpf_prog;
struct trace_print_flags {
unsigned long mask;
@@ -306,6 +307,7 @@ struct ftrace_event_call {
#ifdef CONFIG_PERF_EVENTS
int perf_refcount;
struct hlist_head __percpu *perf_events;
+ struct bpf_prog *prog;
int (*perf_perm)(struct ftrace_event_call *,
struct perf_event *);
@@ -551,6 +553,15 @@ event_trigger_unlock_commit_regs(struct ftrace_event_file *file,
event_triggers_post_call(file, tt);
}
+#ifdef CONFIG_BPF_SYSCALL
+unsigned int trace_call_bpf(struct bpf_prog *prog, void *ctx);
+#else
+static inline unsigned int trace_call_bpf(struct bpf_prog *prog, void *ctx)
+{
+ return 1;
+}
+#endif
+
enum {
FILTER_OTHER = 0,
FILTER_STATIC_STRING,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 45da7ec7d274..b2948feeb70b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -118,6 +118,7 @@ enum bpf_map_type {
enum bpf_prog_type {
BPF_PROG_TYPE_UNSPEC,
BPF_PROG_TYPE_SOCKET_FILTER,
+ BPF_PROG_TYPE_KPROBE,
};
/* flags for BPF_MAP_UPDATE_ELEM command */
@@ -151,6 +152,7 @@ union bpf_attr {
__u32 log_level; /* verbosity level of verifier */
__u32 log_size; /* size of user buffer */
__aligned_u64 log_buf; /* user supplied buffer */
+ __u32 kern_version; /* checked when prog_type=kprobe */
};
} __attribute__((aligned(8)));
@@ -162,6 +164,7 @@ enum bpf_func_id {
BPF_FUNC_map_lookup_elem, /* void *map_lookup_elem(&map, &key) */
BPF_FUNC_map_update_elem, /* int map_update_elem(&map, &key, &value, flags) */
BPF_FUNC_map_delete_elem, /* int map_delete_elem(&map, &key) */
+ BPF_FUNC_probe_read, /* int bpf_probe_read(void *dst, int size, void *src) */
__BPF_FUNC_MAX_ID,
};
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 1e3cd07cf76e..f5a841a263bf 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -381,6 +381,7 @@ struct perf_event_attr {
#define PERF_EVENT_IOC_SET_OUTPUT _IO ('$', 5)
#define PERF_EVENT_IOC_SET_FILTER _IOW('$', 6, char *)
#define PERF_EVENT_IOC_ID _IOR('$', 7, __u64 *)
+#define PERF_EVENT_IOC_SET_BPF _IOW('$', 8, __u32)
enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 536edc2be307..504c10b990ef 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -16,6 +16,7 @@
#include <linux/file.h>
#include <linux/license.h>
#include <linux/filter.h>
+#include <linux/version.h>
static LIST_HEAD(bpf_map_types);
@@ -467,7 +468,7 @@ struct bpf_prog *bpf_prog_get(u32 ufd)
}
/* last field in 'union bpf_attr' used by this command */
-#define BPF_PROG_LOAD_LAST_FIELD log_buf
+#define BPF_PROG_LOAD_LAST_FIELD kern_version
static int bpf_prog_load(union bpf_attr *attr)
{
@@ -492,6 +493,10 @@ static int bpf_prog_load(union bpf_attr *attr)
if (attr->insn_cnt >= BPF_MAXINSNS)
return -EINVAL;
+ if (type == BPF_PROG_TYPE_KPROBE &&
+ attr->kern_version != LINUX_VERSION_CODE)
+ return -EINVAL;
+
/* plain bpf_prog allocation */
prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt), GFP_USER);
if (!prog)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b01dfb602db1..f039ea438b41 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -42,6 +42,8 @@
#include <linux/module.h>
#include <linux/mman.h>
#include <linux/compat.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
#include "internal.h"
@@ -3402,6 +3404,7 @@ errout:
}
static void perf_event_free_filter(struct perf_event *event);
+static void perf_event_free_bpf_prog(struct perf_event *event);
static void free_event_rcu(struct rcu_head *head)
{
@@ -3411,6 +3414,7 @@ static void free_event_rcu(struct rcu_head *head)
if (event->ns)
put_pid_ns(event->ns);
perf_event_free_filter(event);
+ perf_event_free_bpf_prog(event);
kfree(event);
}
@@ -3923,6 +3927,7 @@ static inline int perf_fget_light(int fd, struct fd *p)
static int perf_event_set_output(struct perf_event *event,
struct perf_event *output_event);
static int perf_event_set_filter(struct perf_event *event, void __user *arg);
+static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd);
static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned long arg)
{
@@ -3976,6 +3981,9 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
case PERF_EVENT_IOC_SET_FILTER:
return perf_event_set_filter(event, (void __user *)arg);
+ case PERF_EVENT_IOC_SET_BPF:
+ return perf_event_set_bpf_prog(event, arg);
+
default:
return -ENOTTY;
}
@@ -6446,6 +6454,49 @@ static void perf_event_free_filter(struct perf_event *event)
ftrace_profile_free_filter(event);
}
+static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
+{
+ struct bpf_prog *prog;
+
+ if (event->attr.type != PERF_TYPE_TRACEPOINT)
+ return -EINVAL;
+
+ if (event->tp_event->prog)
+ return -EEXIST;
+
+ if (!(event->tp_event->flags & TRACE_EVENT_FL_KPROBE))
+ /* bpf programs can only be attached to kprobes */
+ return -EINVAL;
+
+ prog = bpf_prog_get(prog_fd);
+ if (IS_ERR(prog))
+ return PTR_ERR(prog);
+
+ if (prog->aux->prog_type != BPF_PROG_TYPE_KPROBE) {
+ /* valid fd, but invalid bpf program type */
+ bpf_prog_put(prog);
+ return -EINVAL;
+ }
+
+ event->tp_event->prog = prog;
+
+ return 0;
+}
+
+static void perf_event_free_bpf_prog(struct perf_event *event)
+{
+ struct bpf_prog *prog;
+
+ if (!event->tp_event)
+ return;
+
+ prog = event->tp_event->prog;
+ if (prog) {
+ event->tp_event->prog = NULL;
+ bpf_prog_put(prog);
+ }
+}
+
#else
static inline void perf_tp_register(void)
@@ -6461,6 +6512,14 @@ static void perf_event_free_filter(struct perf_event *event)
{
}
+static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
+{
+ return -ENOENT;
+}
+
+static void perf_event_free_bpf_prog(struct perf_event *event)
+{
+}
#endif /* CONFIG_EVENT_TRACING */
#ifdef CONFIG_HAVE_HW_BREAKPOINT
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 98f26588255e..c575a300103b 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_EVENT_TRACING) += trace_event_perf.o
endif
obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o
obj-$(CONFIG_EVENT_TRACING) += trace_events_trigger.o
+obj-$(CONFIG_BPF_SYSCALL) += bpf_trace.o
obj-$(CONFIG_KPROBE_EVENT) += trace_kprobe.o
obj-$(CONFIG_TRACEPOINTS) += power-traces.o
ifeq ($(CONFIG_PM),y)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
new file mode 100644
index 000000000000..f1e87da91da3
--- /dev/null
+++ b/kernel/trace/bpf_trace.c
@@ -0,0 +1,130 @@
+/* Copyright (c) 2011-2015 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/uaccess.h>
+#include "trace.h"
+
+static DEFINE_PER_CPU(int, bpf_prog_active);
+
+/**
+ * trace_call_bpf - invoke BPF program
+ * @prog: BPF program
+ * @ctx: opaque context pointer
+ *
+ * kprobe handlers execute BPF programs via this helper.
+ * Can be used from static tracepoints in the future.
+ *
+ * Return: BPF programs always return an integer which is interpreted by
+ * kprobe handler as:
+ * 0 - return from kprobe (event is filtered out)
+ * 1 - store kprobe event into ring buffer
+ * Other values are reserved and currently alias to 1
+ */
+unsigned int trace_call_bpf(struct bpf_prog *prog, void *ctx)
+{
+ unsigned int ret;
+
+ if (in_nmi()) /* not supported yet */
+ return 1;
+
+ preempt_disable();
+
+ if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
+ /*
+ * since some bpf program is already running on this cpu,
+ * don't call into another bpf program (same or different)
+ * and don't send kprobe event into ring-buffer,
+ * so return zero here
+ */
+ ret = 0;
+ goto out;
+ }
+
+ rcu_read_lock();
+ ret = BPF_PROG_RUN(prog, ctx);
+ rcu_read_unlock();
+
+ out:
+ __this_cpu_dec(bpf_prog_active);
+ preempt_enable();
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(trace_call_bpf);
+
+static u64 bpf_probe_read(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ void *dst = (void *) (long) r1;
+ int size = (int) r2;
+ void *unsafe_ptr = (void *) (long) r3;
+
+ return probe_kernel_read(dst, unsafe_ptr, size);
+}
+
+static const struct bpf_func_proto bpf_probe_read_proto = {
+ .func = bpf_probe_read,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_STACK,
+ .arg2_type = ARG_CONST_STACK_SIZE,
+ .arg3_type = ARG_ANYTHING,
+};
+
+static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
+{
+ switch (func_id) {
+ case BPF_FUNC_map_lookup_elem:
+ return &bpf_map_lookup_elem_proto;
+ case BPF_FUNC_map_update_elem:
+ return &bpf_map_update_elem_proto;
+ case BPF_FUNC_map_delete_elem:
+ return &bpf_map_delete_elem_proto;
+ case BPF_FUNC_probe_read:
+ return &bpf_probe_read_proto;
+ default:
+ return NULL;
+ }
+}
+
+/* bpf+kprobe programs can access fields of 'struct pt_regs' */
+static bool kprobe_prog_is_valid_access(int off, int size, enum bpf_access_type type)
+{
+ /* check bounds */
+ if (off < 0 || off >= sizeof(struct pt_regs))
+ return false;
+
+ /* only read is allowed */
+ if (type != BPF_READ)
+ return false;
+
+ /* disallow misaligned access */
+ if (off % size != 0)
+ return false;
+
+ return true;
+}
+
+static struct bpf_verifier_ops kprobe_prog_ops = {
+ .get_func_proto = kprobe_prog_func_proto,
+ .is_valid_access = kprobe_prog_is_valid_access,
+};
+
+static struct bpf_prog_type_list kprobe_tl = {
+ .ops = &kprobe_prog_ops,
+ .type = BPF_PROG_TYPE_KPROBE,
+};
+
+static int __init register_kprobe_prog_ops(void)
+{
+ bpf_register_prog_type(&kprobe_tl);
+ return 0;
+}
+late_initcall(register_kprobe_prog_ops);
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 8fa549f6f528..dc3462507d7c 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1134,11 +1134,15 @@ static void
kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
{
struct ftrace_event_call *call = &tk->tp.call;
+ struct bpf_prog *prog = call->prog;
struct kprobe_trace_entry_head *entry;
struct hlist_head *head;
int size, __size, dsize;
int rctx;
+ if (prog && !trace_call_bpf(prog, regs))
+ return;
+
head = this_cpu_ptr(call->perf_events);
if (hlist_empty(head))
return;
@@ -1165,11 +1169,15 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
struct pt_regs *regs)
{
struct ftrace_event_call *call = &tk->tp.call;
+ struct bpf_prog *prog = call->prog;
struct kretprobe_trace_entry_head *entry;
struct hlist_head *head;
int size, __size, dsize;
int rctx;
+ if (prog && !trace_call_bpf(prog, regs))
+ return;
+
head = this_cpu_ptr(call->perf_events);
if (hlist_empty(head))
return;
--
1.7.9.5
^ permalink raw reply related
* [PATCH v11 tip 4/9] tracing: allow BPF programs to call bpf_ktime_get_ns()
From: Alexei Starovoitov @ 2015-03-25 19:49 UTC (permalink / raw)
To: Ingo Molnar
Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Masami Hiramatsu, David S. Miller, Daniel Borkmann,
Peter Zijlstra, linux-api, netdev, linux-kernel
In-Reply-To: <1427312966-8434-1-git-send-email-ast@plumgrid.com>
bpf_ktime_get_ns() is used by programs to compute time delta between events
or as a timestamp
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
---
include/uapi/linux/bpf.h | 1 +
kernel/trace/bpf_trace.c | 14 ++++++++++++++
2 files changed, 15 insertions(+)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b2948feeb70b..238c6883877b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -165,6 +165,7 @@ enum bpf_func_id {
BPF_FUNC_map_update_elem, /* int map_update_elem(&map, &key, &value, flags) */
BPF_FUNC_map_delete_elem, /* int map_delete_elem(&map, &key) */
BPF_FUNC_probe_read, /* int bpf_probe_read(void *dst, int size, void *src) */
+ BPF_FUNC_ktime_get_ns, /* u64 bpf_ktime_get_ns(void) */
__BPF_FUNC_MAX_ID,
};
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index f1e87da91da3..8f5787294971 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -78,6 +78,18 @@ static const struct bpf_func_proto bpf_probe_read_proto = {
.arg3_type = ARG_ANYTHING,
};
+static u64 bpf_ktime_get_ns(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ /* NMI safe access to clock monotonic */
+ return ktime_get_mono_fast_ns();
+}
+
+static const struct bpf_func_proto bpf_ktime_get_ns_proto = {
+ .func = bpf_ktime_get_ns,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+};
+
static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
{
switch (func_id) {
@@ -89,6 +101,8 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
return &bpf_map_delete_elem_proto;
case BPF_FUNC_probe_read:
return &bpf_probe_read_proto;
+ case BPF_FUNC_ktime_get_ns:
+ return &bpf_ktime_get_ns_proto;
default:
return NULL;
}
--
1.7.9.5
^ permalink raw reply related
* [PATCH v11 tip 5/9] tracing: allow BPF programs to call bpf_trace_printk()
From: Alexei Starovoitov @ 2015-03-25 19:49 UTC (permalink / raw)
To: Ingo Molnar
Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Masami Hiramatsu, David S. Miller, Daniel Borkmann,
Peter Zijlstra, linux-api-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1427312966-8434-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
Debugging of BPF programs needs some form of printk from the program,
so let programs call limited trace_printk() with %d %u %x %p modifiers only.
Similar to kernel modules, during program load verifier checks whether program
is calling bpf_trace_printk() and if so, kernel allocates trace_printk buffers
and emits big 'this is debug only' banner.
Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
Reviewed-by: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
---
include/uapi/linux/bpf.h | 1 +
kernel/trace/bpf_trace.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 79 insertions(+)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 238c6883877b..cc47ef41076a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -166,6 +166,7 @@ enum bpf_func_id {
BPF_FUNC_map_delete_elem, /* int map_delete_elem(&map, &key) */
BPF_FUNC_probe_read, /* int bpf_probe_read(void *dst, int size, void *src) */
BPF_FUNC_ktime_get_ns, /* u64 bpf_ktime_get_ns(void) */
+ BPF_FUNC_trace_printk, /* int bpf_trace_printk(const char *fmt, int fmt_size, ...) */
__BPF_FUNC_MAX_ID,
};
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 8f5787294971..2d56ce501632 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -10,6 +10,7 @@
#include <linux/bpf.h>
#include <linux/filter.h>
#include <linux/uaccess.h>
+#include <linux/ctype.h>
#include "trace.h"
static DEFINE_PER_CPU(int, bpf_prog_active);
@@ -90,6 +91,74 @@ static const struct bpf_func_proto bpf_ktime_get_ns_proto = {
.ret_type = RET_INTEGER,
};
+/*
+ * limited trace_printk()
+ * only %d %u %x %ld %lu %lx %lld %llu %llx %p conversion specifiers allowed
+ */
+static u64 bpf_trace_printk(u64 r1, u64 fmt_size, u64 r3, u64 r4, u64 r5)
+{
+ char *fmt = (char *) (long) r1;
+ int mod[3] = {};
+ int fmt_cnt = 0;
+ int i;
+
+ /*
+ * bpf_check()->check_func_arg()->check_stack_boundary()
+ * guarantees that fmt points to bpf program stack,
+ * fmt_size bytes of it were initialized and fmt_size > 0
+ */
+ if (fmt[--fmt_size] != 0)
+ return -EINVAL;
+
+ /* check format string for allowed specifiers */
+ for (i = 0; i < fmt_size; i++) {
+ if ((!isprint(fmt[i]) && !isspace(fmt[i])) || !isascii(fmt[i]))
+ return -EINVAL;
+
+ if (fmt[i] != '%')
+ continue;
+
+ if (fmt_cnt >= 3)
+ return -EINVAL;
+
+ /* fmt[i] != 0 && fmt[last] == 0, so we can access fmt[i + 1] */
+ i++;
+ if (fmt[i] == 'l') {
+ mod[fmt_cnt]++;
+ i++;
+ } else if (fmt[i] == 'p') {
+ mod[fmt_cnt]++;
+ i++;
+ if (!isspace(fmt[i]) && !ispunct(fmt[i]) && fmt[i] != 0)
+ return -EINVAL;
+ fmt_cnt++;
+ continue;
+ }
+
+ if (fmt[i] == 'l') {
+ mod[fmt_cnt]++;
+ i++;
+ }
+
+ if (fmt[i] != 'd' && fmt[i] != 'u' && fmt[i] != 'x')
+ return -EINVAL;
+ fmt_cnt++;
+ }
+
+ return __trace_printk(1/* fake ip will not be printed */, fmt,
+ mod[0] == 2 ? r3 : mod[0] == 1 ? (long) r3 : (u32) r3,
+ mod[1] == 2 ? r4 : mod[1] == 1 ? (long) r4 : (u32) r4,
+ mod[2] == 2 ? r5 : mod[2] == 1 ? (long) r5 : (u32) r5);
+}
+
+static const struct bpf_func_proto bpf_trace_printk_proto = {
+ .func = bpf_trace_printk,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_STACK,
+ .arg2_type = ARG_CONST_STACK_SIZE,
+};
+
static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
{
switch (func_id) {
@@ -103,6 +172,15 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
return &bpf_probe_read_proto;
case BPF_FUNC_ktime_get_ns:
return &bpf_ktime_get_ns_proto;
+
+ case BPF_FUNC_trace_printk:
+ /*
+ * this program might be calling bpf_trace_printk,
+ * so allocate per-cpu printk buffers
+ */
+ trace_printk_init_buffers();
+
+ return &bpf_trace_printk_proto;
default:
return NULL;
}
--
1.7.9.5
^ permalink raw reply related
* [PATCH v11 tip 6/9] samples: bpf: simple non-portable kprobe filter example
From: Alexei Starovoitov @ 2015-03-25 19:49 UTC (permalink / raw)
To: Ingo Molnar
Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Masami Hiramatsu, David S. Miller, Daniel Borkmann,
Peter Zijlstra, linux-api, netdev, linux-kernel
In-Reply-To: <1427312966-8434-1-git-send-email-ast@plumgrid.com>
tracex1_kern.c - C program compiled into BPF.
It attaches to kprobe:netif_receive_skb
When skb->dev->name == "lo", it prints sample debug message into trace_pipe
via bpf_trace_printk() helper function.
tracex1_user.c - corresponding user space component that:
- loads bpf program via bpf() syscall
- opens kprobes:netif_receive_skb event via perf_event_open() syscall
- attaches the program to event via ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
- prints from trace_pipe
Note, this bpf program is non-portable. It must be recompiled with current
kernel headers. kprobe is not a stable ABI and bpf+kprobe scripts
may no longer be meaningful when kernel internals change.
No matter in what way the kernel changes, neither the kprobe, nor the bpf
program can ever crash or corrupt the kernel, assuming the kprobes, perf and
bpf subsystem has no bugs.
The verifier will detect that the program is using bpf_trace_printk() and
the kernel will print 'this is a DEBUG kernel' warning banner, which means that
bpf_trace_printk() should be used for debugging of the bpf program only.
Usage:
$ sudo tracex1
ping-19826 [000] d.s2 63103.382648: : skb ffff880466b1ca00 len 84
ping-19826 [000] d.s2 63103.382684: : skb ffff880466b1d300 len 84
ping-19826 [000] d.s2 63104.382533: : skb ffff880466b1ca00 len 84
ping-19826 [000] d.s2 63104.382594: : skb ffff880466b1d300 len 84
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
samples/bpf/Makefile | 4 ++
samples/bpf/bpf_helpers.h | 6 +++
samples/bpf/bpf_load.c | 125 ++++++++++++++++++++++++++++++++++++++++---
samples/bpf/bpf_load.h | 3 ++
samples/bpf/libbpf.c | 14 ++++-
samples/bpf/libbpf.h | 5 +-
samples/bpf/sock_example.c | 2 +-
samples/bpf/test_verifier.c | 2 +-
samples/bpf/tracex1_kern.c | 50 +++++++++++++++++
samples/bpf/tracex1_user.c | 25 +++++++++
10 files changed, 224 insertions(+), 12 deletions(-)
create mode 100644 samples/bpf/tracex1_kern.c
create mode 100644 samples/bpf/tracex1_user.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index b5b3600dcdf5..51f6f01e5a3a 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -6,23 +6,27 @@ hostprogs-y := test_verifier test_maps
hostprogs-y += sock_example
hostprogs-y += sockex1
hostprogs-y += sockex2
+hostprogs-y += tracex1
test_verifier-objs := test_verifier.o libbpf.o
test_maps-objs := test_maps.o libbpf.o
sock_example-objs := sock_example.o libbpf.o
sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
+tracex1-objs := bpf_load.o libbpf.o tracex1_user.o
# Tell kbuild to always build the programs
always := $(hostprogs-y)
always += sockex1_kern.o
always += sockex2_kern.o
+always += tracex1_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
HOSTLOADLIBES_sockex1 += -lelf
HOSTLOADLIBES_sockex2 += -lelf
+HOSTLOADLIBES_tracex1 += -lelf
# point this to your LLVM backend with bpf support
LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index ca0333146006..1c872bcf5a80 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -15,6 +15,12 @@ static int (*bpf_map_update_elem)(void *map, void *key, void *value,
(void *) BPF_FUNC_map_update_elem;
static int (*bpf_map_delete_elem)(void *map, void *key) =
(void *) BPF_FUNC_map_delete_elem;
+static int (*bpf_probe_read)(void *dst, int size, void *unsafe_ptr) =
+ (void *) BPF_FUNC_probe_read;
+static unsigned long long (*bpf_ktime_get_ns)(void) =
+ (void *) BPF_FUNC_ktime_get_ns;
+static int (*bpf_trace_printk)(const char *fmt, int fmt_size, ...) =
+ (void *) BPF_FUNC_trace_printk;
/* llvm builtin functions that eBPF C program may use to
* emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 1831d236382b..38dac5a53b51 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -8,29 +8,70 @@
#include <unistd.h>
#include <string.h>
#include <stdbool.h>
+#include <stdlib.h>
#include <linux/bpf.h>
#include <linux/filter.h>
+#include <linux/perf_event.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <poll.h>
#include "libbpf.h"
#include "bpf_helpers.h"
#include "bpf_load.h"
+#define DEBUGFS "/sys/kernel/debug/tracing/"
+
static char license[128];
+static int kern_version;
static bool processed_sec[128];
int map_fd[MAX_MAPS];
int prog_fd[MAX_PROGS];
+int event_fd[MAX_PROGS];
int prog_cnt;
static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
{
- int fd;
bool is_socket = strncmp(event, "socket", 6) == 0;
-
- if (!is_socket)
- /* tracing events tbd */
+ bool is_kprobe = strncmp(event, "kprobe/", 7) == 0;
+ bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
+ enum bpf_prog_type prog_type;
+ char buf[256];
+ int fd, efd, err, id;
+ struct perf_event_attr attr = {};
+
+ attr.type = PERF_TYPE_TRACEPOINT;
+ attr.sample_type = PERF_SAMPLE_RAW;
+ attr.sample_period = 1;
+ attr.wakeup_events = 1;
+
+ if (is_socket) {
+ prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
+ } else if (is_kprobe || is_kretprobe) {
+ prog_type = BPF_PROG_TYPE_KPROBE;
+ } else {
+ printf("Unknown event '%s'\n", event);
return -1;
+ }
+
+ if (is_kprobe || is_kretprobe) {
+ if (is_kprobe)
+ event += 7;
+ else
+ event += 10;
+
+ snprintf(buf, sizeof(buf),
+ "echo '%c:%s %s' >> /sys/kernel/debug/tracing/kprobe_events",
+ is_kprobe ? 'p' : 'r', event, event);
+ err = system(buf);
+ if (err < 0) {
+ printf("failed to create kprobe '%s' error '%s'\n",
+ event, strerror(errno));
+ return -1;
+ }
+ }
- fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER,
- prog, size, license);
+ fd = bpf_prog_load(prog_type, prog, size, license, kern_version);
if (fd < 0) {
printf("bpf_prog_load() err=%d\n%s", errno, bpf_log_buf);
@@ -39,6 +80,41 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
prog_fd[prog_cnt++] = fd;
+ if (is_socket)
+ return 0;
+
+ strcpy(buf, DEBUGFS);
+ strcat(buf, "events/kprobes/");
+ strcat(buf, event);
+ strcat(buf, "/id");
+
+ efd = open(buf, O_RDONLY, 0);
+ if (efd < 0) {
+ printf("failed to open event %s\n", event);
+ return -1;
+ }
+
+ err = read(efd, buf, sizeof(buf));
+ if (err < 0 || err >= sizeof(buf)) {
+ printf("read from '%s' failed '%s'\n", event, strerror(errno));
+ return -1;
+ }
+
+ close(efd);
+
+ buf[err] = 0;
+ id = atoi(buf);
+ attr.config = id;
+
+ efd = perf_event_open(&attr, -1/*pid*/, 0/*cpu*/, -1/*group_fd*/, 0);
+ if (efd < 0) {
+ printf("event %d fd %d err %s\n", id, efd, strerror(errno));
+ return -1;
+ }
+ event_fd[prog_cnt - 1] = efd;
+ ioctl(efd, PERF_EVENT_IOC_ENABLE, 0);
+ ioctl(efd, PERF_EVENT_IOC_SET_BPF, fd);
+
return 0;
}
@@ -135,6 +211,9 @@ int load_bpf_file(char *path)
if (gelf_getehdr(elf, &ehdr) != &ehdr)
return 1;
+ /* clear all kprobes */
+ i = system("echo \"\" > /sys/kernel/debug/tracing/kprobe_events");
+
/* scan over all elf sections to get license and map info */
for (i = 1; i < ehdr.e_shnum; i++) {
@@ -149,6 +228,14 @@ int load_bpf_file(char *path)
if (strcmp(shname, "license") == 0) {
processed_sec[i] = true;
memcpy(license, data->d_buf, data->d_size);
+ } else if (strcmp(shname, "version") == 0) {
+ processed_sec[i] = true;
+ if (data->d_size != sizeof(int)) {
+ printf("invalid size of version section %zd\n",
+ data->d_size);
+ return 1;
+ }
+ memcpy(&kern_version, data->d_buf, sizeof(int));
} else if (strcmp(shname, "maps") == 0) {
processed_sec[i] = true;
if (load_maps(data->d_buf, data->d_size))
@@ -178,7 +265,8 @@ int load_bpf_file(char *path)
if (parse_relo_and_apply(data, symbols, &shdr, insns))
continue;
- if (memcmp(shname_prog, "events/", 7) == 0 ||
+ if (memcmp(shname_prog, "kprobe/", 7) == 0 ||
+ memcmp(shname_prog, "kretprobe/", 10) == 0 ||
memcmp(shname_prog, "socket", 6) == 0)
load_and_attach(shname_prog, insns, data_prog->d_size);
}
@@ -193,7 +281,8 @@ int load_bpf_file(char *path)
if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
continue;
- if (memcmp(shname, "events/", 7) == 0 ||
+ if (memcmp(shname, "kprobe/", 7) == 0 ||
+ memcmp(shname, "kretprobe/", 10) == 0 ||
memcmp(shname, "socket", 6) == 0)
load_and_attach(shname, data->d_buf, data->d_size);
}
@@ -201,3 +290,23 @@ int load_bpf_file(char *path)
close(fd);
return 0;
}
+
+void read_trace_pipe(void)
+{
+ int trace_fd;
+
+ trace_fd = open(DEBUGFS "trace_pipe", O_RDONLY, 0);
+ if (trace_fd < 0)
+ return;
+
+ while (1) {
+ static char buf[4096];
+ ssize_t sz;
+
+ sz = read(trace_fd, buf, sizeof(buf));
+ if (sz > 0) {
+ buf[sz] = 0;
+ puts(buf);
+ }
+ }
+}
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index 27789a34f5e6..cbd7c2b532b9 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -6,6 +6,7 @@
extern int map_fd[MAX_MAPS];
extern int prog_fd[MAX_PROGS];
+extern int event_fd[MAX_PROGS];
/* parses elf file compiled by llvm .c->.o
* . parses 'maps' section and creates maps via BPF syscall
@@ -21,4 +22,6 @@ extern int prog_fd[MAX_PROGS];
*/
int load_bpf_file(char *path);
+void read_trace_pipe(void);
+
#endif
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 46d50b7ddf79..7e1efa7e2ed7 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -81,7 +81,7 @@ char bpf_log_buf[LOG_BUF_SIZE];
int bpf_prog_load(enum bpf_prog_type prog_type,
const struct bpf_insn *insns, int prog_len,
- const char *license)
+ const char *license, int kern_version)
{
union bpf_attr attr = {
.prog_type = prog_type,
@@ -93,6 +93,11 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
.log_level = 1,
};
+ /* assign one field outside of struct init to make sure any
+ * padding is zero initialized
+ */
+ attr.kern_version = kern_version;
+
bpf_log_buf[0] = 0;
return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
@@ -121,3 +126,10 @@ int open_raw_sock(const char *name)
return sock;
}
+
+int perf_event_open(struct perf_event_attr *attr, int pid, int cpu,
+ int group_fd, unsigned long flags)
+{
+ return syscall(__NR_perf_event_open, attr, pid, cpu,
+ group_fd, flags);
+}
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index 58c5fe1bdba1..ac7b09672b46 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -13,7 +13,7 @@ int bpf_get_next_key(int fd, void *key, void *next_key);
int bpf_prog_load(enum bpf_prog_type prog_type,
const struct bpf_insn *insns, int insn_len,
- const char *license);
+ const char *license, int kern_version);
#define LOG_BUF_SIZE 65536
extern char bpf_log_buf[LOG_BUF_SIZE];
@@ -182,4 +182,7 @@ extern char bpf_log_buf[LOG_BUF_SIZE];
/* create RAW socket and bind to interface 'name' */
int open_raw_sock(const char *name);
+struct perf_event_attr;
+int perf_event_open(struct perf_event_attr *attr, int pid, int cpu,
+ int group_fd, unsigned long flags);
#endif
diff --git a/samples/bpf/sock_example.c b/samples/bpf/sock_example.c
index c8ad0404416f..a0ce251c5390 100644
--- a/samples/bpf/sock_example.c
+++ b/samples/bpf/sock_example.c
@@ -56,7 +56,7 @@ static int test_sock(void)
};
prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog, sizeof(prog),
- "GPL");
+ "GPL", 0);
if (prog_fd < 0) {
printf("failed to load prog '%s'\n", strerror(errno));
goto cleanup;
diff --git a/samples/bpf/test_verifier.c b/samples/bpf/test_verifier.c
index b96175e90363..740ce97cda5e 100644
--- a/samples/bpf/test_verifier.c
+++ b/samples/bpf/test_verifier.c
@@ -689,7 +689,7 @@ static int test(void)
prog_fd = bpf_prog_load(BPF_PROG_TYPE_UNSPEC, prog,
prog_len * sizeof(struct bpf_insn),
- "GPL");
+ "GPL", 0);
if (tests[i].result == ACCEPT) {
if (prog_fd < 0) {
diff --git a/samples/bpf/tracex1_kern.c b/samples/bpf/tracex1_kern.c
new file mode 100644
index 000000000000..31620463701a
--- /dev/null
+++ b/samples/bpf/tracex1_kern.c
@@ -0,0 +1,50 @@
+/* Copyright (c) 2013-2015 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <uapi/linux/bpf.h>
+#include <linux/version.h>
+#include "bpf_helpers.h"
+
+#define _(P) ({typeof(P) val = 0; bpf_probe_read(&val, sizeof(val), &P); val;})
+
+/* kprobe is NOT a stable ABI
+ * kernel functions can be removed, renamed or completely change semantics.
+ * Number of arguments and their positions can change, etc.
+ * In such case this bpf+kprobe example will no longer be meaningful
+ */
+SEC("kprobe/__netif_receive_skb_core")
+int bpf_prog1(struct pt_regs *ctx)
+{
+ /* attaches to kprobe netif_receive_skb,
+ * looks for packets on loobpack device and prints them
+ */
+ char devname[IFNAMSIZ] = {};
+ struct net_device *dev;
+ struct sk_buff *skb;
+ int len;
+
+ /* non-portable! works for the given kernel only */
+ skb = (struct sk_buff *) ctx->di;
+
+ dev = _(skb->dev);
+
+ len = _(skb->len);
+
+ bpf_probe_read(devname, sizeof(devname), dev->name);
+
+ if (devname[0] == 'l' && devname[1] == 'o') {
+ char fmt[] = "skb %p len %d\n";
+ /* using bpf_trace_printk() for DEBUG ONLY */
+ bpf_trace_printk(fmt, sizeof(fmt), skb, len);
+ }
+
+ return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/tracex1_user.c b/samples/bpf/tracex1_user.c
new file mode 100644
index 000000000000..31a48183beea
--- /dev/null
+++ b/samples/bpf/tracex1_user.c
@@ -0,0 +1,25 @@
+#include <stdio.h>
+#include <linux/bpf.h>
+#include <unistd.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+int main(int ac, char **argv)
+{
+ FILE *f;
+ char filename[256];
+
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+
+ f = popen("taskset 1 ping -c5 localhost", "r");
+ (void) f;
+
+ read_trace_pipe();
+
+ return 0;
+}
--
1.7.9.5
^ permalink raw reply related
* [PATCH v11 tip 7/9] samples: bpf: counting example for kfree_skb and write syscall
From: Alexei Starovoitov @ 2015-03-25 19:49 UTC (permalink / raw)
To: Ingo Molnar
Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Masami Hiramatsu, David S. Miller, Daniel Borkmann,
Peter Zijlstra, linux-api, netdev, linux-kernel
In-Reply-To: <1427312966-8434-1-git-send-email-ast@plumgrid.com>
this example has two probes in one C file that attach to different kprove events
and use two different maps.
1st probe is x64 specific equivalent of dropmon. It attaches to kfree_skb,
retrevies 'ip' address of kfree_skb() caller and counts number of packet drops
at that 'ip' address. User space prints 'location - count' map every second.
2nd probe attaches to kprobe:sys_write and computes a histogram of different
write sizes
Usage:
$ sudo tracex2
location 0xffffffff81695995 count 1
location 0xffffffff816d0da9 count 2
location 0xffffffff81695995 count 2
location 0xffffffff816d0da9 count 2
location 0xffffffff81695995 count 3
location 0xffffffff816d0da9 count 2
557145+0 records in
557145+0 records out
285258240 bytes (285 MB) copied, 1.02379 s, 279 MB/s
syscall write() stats
byte_size : count distribution
1 -> 1 : 3 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 2 | |
32 -> 63 : 3 | |
64 -> 127 : 1 | |
128 -> 255 : 1 | |
256 -> 511 : 0 | |
512 -> 1023 : 1118968 |************************************* |
Ctrl-C at any time. Kernel will auto cleanup maps and programs
$ addr2line -ape ./bld_x64/vmlinux 0xffffffff81695995 0xffffffff816d0da9
0xffffffff81695995: ./bld_x64/../net/ipv4/icmp.c:1038
0xffffffff816d0da9: ./bld_x64/../net/unix/af_unix.c:1231
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
samples/bpf/Makefile | 4 ++
samples/bpf/tracex2_kern.c | 86 +++++++++++++++++++++++++++++++++++++++
samples/bpf/tracex2_user.c | 95 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 185 insertions(+)
create mode 100644 samples/bpf/tracex2_kern.c
create mode 100644 samples/bpf/tracex2_user.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 51f6f01e5a3a..6dd272143733 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -7,6 +7,7 @@ hostprogs-y += sock_example
hostprogs-y += sockex1
hostprogs-y += sockex2
hostprogs-y += tracex1
+hostprogs-y += tracex2
test_verifier-objs := test_verifier.o libbpf.o
test_maps-objs := test_maps.o libbpf.o
@@ -14,12 +15,14 @@ sock_example-objs := sock_example.o libbpf.o
sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
tracex1-objs := bpf_load.o libbpf.o tracex1_user.o
+tracex2-objs := bpf_load.o libbpf.o tracex2_user.o
# Tell kbuild to always build the programs
always := $(hostprogs-y)
always += sockex1_kern.o
always += sockex2_kern.o
always += tracex1_kern.o
+always += tracex2_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
@@ -27,6 +30,7 @@ HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
HOSTLOADLIBES_sockex1 += -lelf
HOSTLOADLIBES_sockex2 += -lelf
HOSTLOADLIBES_tracex1 += -lelf
+HOSTLOADLIBES_tracex2 += -lelf
# point this to your LLVM backend with bpf support
LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/tracex2_kern.c b/samples/bpf/tracex2_kern.c
new file mode 100644
index 000000000000..19ec1cfc45db
--- /dev/null
+++ b/samples/bpf/tracex2_kern.c
@@ -0,0 +1,86 @@
+/* Copyright (c) 2013-2015 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") my_map = {
+ .type = BPF_MAP_TYPE_HASH,
+ .key_size = sizeof(long),
+ .value_size = sizeof(long),
+ .max_entries = 1024,
+};
+
+/* kprobe is NOT a stable ABI. If kernel internals change this bpf+kprobe
+ * example will no longer be meaningful
+ */
+SEC("kprobe/kfree_skb")
+int bpf_prog2(struct pt_regs *ctx)
+{
+ long loc = 0;
+ long init_val = 1;
+ long *value;
+
+ /* x64 specific: read ip of kfree_skb caller.
+ * non-portable version of __builtin_return_address(0)
+ */
+ bpf_probe_read(&loc, sizeof(loc), (void *)ctx->sp);
+
+ value = bpf_map_lookup_elem(&my_map, &loc);
+ if (value)
+ *value += 1;
+ else
+ bpf_map_update_elem(&my_map, &loc, &init_val, BPF_ANY);
+ return 0;
+}
+
+static unsigned int log2(unsigned int v)
+{
+ unsigned int r;
+ unsigned int shift;
+
+ r = (v > 0xFFFF) << 4; v >>= r;
+ shift = (v > 0xFF) << 3; v >>= shift; r |= shift;
+ shift = (v > 0xF) << 2; v >>= shift; r |= shift;
+ shift = (v > 0x3) << 1; v >>= shift; r |= shift;
+ r |= (v >> 1);
+ return r;
+}
+
+static unsigned int log2l(unsigned long v)
+{
+ unsigned int hi = v >> 32;
+ if (hi)
+ return log2(hi) + 32;
+ else
+ return log2(v);
+}
+
+struct bpf_map_def SEC("maps") my_hist_map = {
+ .type = BPF_MAP_TYPE_ARRAY,
+ .key_size = sizeof(u32),
+ .value_size = sizeof(long),
+ .max_entries = 64,
+};
+
+SEC("kprobe/sys_write")
+int bpf_prog3(struct pt_regs *ctx)
+{
+ long write_size = ctx->dx; /* arg3 */
+ long init_val = 1;
+ long *value;
+ u32 index = log2l(write_size);
+
+ value = bpf_map_lookup_elem(&my_hist_map, &index);
+ if (value)
+ __sync_fetch_and_add(value, 1);
+ return 0;
+}
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/tracex2_user.c b/samples/bpf/tracex2_user.c
new file mode 100644
index 000000000000..91b8d0896fbb
--- /dev/null
+++ b/samples/bpf/tracex2_user.c
@@ -0,0 +1,95 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define MAX_INDEX 64
+#define MAX_STARS 38
+
+static void stars(char *str, long val, long max, int width)
+{
+ int i;
+
+ for (i = 0; i < (width * val / max) - 1 && i < width - 1; i++)
+ str[i] = '*';
+ if (val > max)
+ str[i - 1] = '+';
+ str[i] = '\0';
+}
+
+static void print_hist(int fd)
+{
+ int key;
+ long value;
+ long data[MAX_INDEX] = {};
+ char starstr[MAX_STARS];
+ int i;
+ int max_ind = -1;
+ long max_value = 0;
+
+ for (key = 0; key < MAX_INDEX; key++) {
+ bpf_lookup_elem(fd, &key, &value);
+ data[key] = value;
+ if (value && key > max_ind)
+ max_ind = key;
+ if (value > max_value)
+ max_value = value;
+ }
+
+ printf(" syscall write() stats\n");
+ printf(" byte_size : count distribution\n");
+ for (i = 1; i <= max_ind + 1; i++) {
+ stars(starstr, data[i - 1], max_value, MAX_STARS);
+ printf("%8ld -> %-8ld : %-8ld |%-*s|\n",
+ (1l << i) >> 1, (1l << i) - 1, data[i - 1],
+ MAX_STARS, starstr);
+ }
+}
+static void int_exit(int sig)
+{
+ print_hist(map_fd[1]);
+ exit(0);
+}
+
+int main(int ac, char **argv)
+{
+ char filename[256];
+ long key, next_key, value;
+ FILE *f;
+ int i;
+
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+ signal(SIGINT, int_exit);
+
+ /* start 'ping' in the background to have some kfree_skb events */
+ f = popen("ping -c5 localhost", "r");
+ (void) f;
+
+ /* start 'dd' in the background to have plenty of 'write' syscalls */
+ f = popen("dd if=/dev/zero of=/dev/null count=5000000", "r");
+ (void) f;
+
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+
+ for (i = 0; i < 5; i++) {
+ key = 0;
+ while (bpf_get_next_key(map_fd[0], &key, &next_key) == 0) {
+ bpf_lookup_elem(map_fd[0], &next_key, &value);
+ printf("location 0x%lx count %ld\n", next_key, value);
+ key = next_key;
+ }
+ if (key)
+ printf("\n");
+ sleep(1);
+ }
+ print_hist(map_fd[1]);
+
+ return 0;
+}
--
1.7.9.5
^ permalink raw reply related
* [PATCH v11 tip 8/9] samples: bpf: IO latency analysis (iosnoop/heatmap)
From: Alexei Starovoitov @ 2015-03-25 19:49 UTC (permalink / raw)
To: Ingo Molnar
Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Masami Hiramatsu, David S. Miller, Daniel Borkmann,
Peter Zijlstra, linux-api, netdev, linux-kernel
In-Reply-To: <1427312966-8434-1-git-send-email-ast@plumgrid.com>
BPF C program attaches to blk_mq_start_request/blk_update_request kprobe events
to calculate IO latency.
For every completed block IO event it computes the time delta in nsec
and records in a histogram map: map[log10(delta)*10]++
User space reads this histogram map every 2 seconds and prints it as a 'heatmap'
using gray shades of text terminal. Black spaces have many events and white
spaces have very few events. Left most space is the smallest latency, right most
space is the largest latency in the range.
Usage:
$ sudo ./tracex3
and do 'sudo dd if=/dev/sda of=/dev/null' in other terminal.
Observe IO latencies and how different activity (like 'make kernel') affects it.
Similar experiments can be done for network transmit latencies, syscalls, etc
'-t' flag prints the heatmap using normal ascii characters:
$ sudo ./tracex3 -t
heatmap of IO latency
# - many events with this latency
- few events
|1us |10us |100us |1ms |10ms |100ms |1s |10s
*ooo. *O.#. # 221
. *# . # 125
.. .o#*.. # 55
. . . . .#O # 37
.# # 175
.#*. # 37
# # 199
. . *#*. # 55
*#..* # 42
# # 266
...***Oo#*OO**o#* . # 629
# # 271
. .#o* o.*o* # 221
. . o* *#O.. # 50
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
samples/bpf/Makefile | 4 ++
samples/bpf/tracex3_kern.c | 89 ++++++++++++++++++++++++++
samples/bpf/tracex3_user.c | 150 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 243 insertions(+)
create mode 100644 samples/bpf/tracex3_kern.c
create mode 100644 samples/bpf/tracex3_user.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 6dd272143733..dcd850546d52 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -8,6 +8,7 @@ hostprogs-y += sockex1
hostprogs-y += sockex2
hostprogs-y += tracex1
hostprogs-y += tracex2
+hostprogs-y += tracex3
test_verifier-objs := test_verifier.o libbpf.o
test_maps-objs := test_maps.o libbpf.o
@@ -16,6 +17,7 @@ sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
tracex1-objs := bpf_load.o libbpf.o tracex1_user.o
tracex2-objs := bpf_load.o libbpf.o tracex2_user.o
+tracex3-objs := bpf_load.o libbpf.o tracex3_user.o
# Tell kbuild to always build the programs
always := $(hostprogs-y)
@@ -23,6 +25,7 @@ always += sockex1_kern.o
always += sockex2_kern.o
always += tracex1_kern.o
always += tracex2_kern.o
+always += tracex3_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
@@ -31,6 +34,7 @@ HOSTLOADLIBES_sockex1 += -lelf
HOSTLOADLIBES_sockex2 += -lelf
HOSTLOADLIBES_tracex1 += -lelf
HOSTLOADLIBES_tracex2 += -lelf
+HOSTLOADLIBES_tracex3 += -lelf
# point this to your LLVM backend with bpf support
LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/tracex3_kern.c b/samples/bpf/tracex3_kern.c
new file mode 100644
index 000000000000..255ff2792366
--- /dev/null
+++ b/samples/bpf/tracex3_kern.c
@@ -0,0 +1,89 @@
+/* Copyright (c) 2013-2015 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") my_map = {
+ .type = BPF_MAP_TYPE_HASH,
+ .key_size = sizeof(long),
+ .value_size = sizeof(u64),
+ .max_entries = 4096,
+};
+
+/* kprobe is NOT a stable ABI. If kernel internals change this bpf+kprobe
+ * example will no longer be meaningful
+ */
+SEC("kprobe/blk_mq_start_request")
+int bpf_prog1(struct pt_regs *ctx)
+{
+ long rq = ctx->di;
+ u64 val = bpf_ktime_get_ns();
+
+ bpf_map_update_elem(&my_map, &rq, &val, BPF_ANY);
+ return 0;
+}
+
+static unsigned int log2l(unsigned long long n)
+{
+#define S(k) if (n >= (1ull << k)) { i += k; n >>= k; }
+ int i = -(n == 0);
+ S(32); S(16); S(8); S(4); S(2); S(1);
+ return i;
+#undef S
+}
+
+#define SLOTS 100
+
+struct bpf_map_def SEC("maps") lat_map = {
+ .type = BPF_MAP_TYPE_ARRAY,
+ .key_size = sizeof(u32),
+ .value_size = sizeof(u64),
+ .max_entries = SLOTS,
+};
+
+SEC("kprobe/blk_update_request")
+int bpf_prog2(struct pt_regs *ctx)
+{
+ long rq = ctx->di;
+ u64 *value, l, base;
+ u32 index;
+
+ value = bpf_map_lookup_elem(&my_map, &rq);
+ if (!value)
+ return 0;
+
+ u64 cur_time = bpf_ktime_get_ns();
+ u64 delta = cur_time - *value;
+
+ bpf_map_delete_elem(&my_map, &rq);
+
+ /* the lines below are computing index = log10(delta)*10
+ * using integer arithmetic
+ * index = 29 ~ 1 usec
+ * index = 59 ~ 1 msec
+ * index = 89 ~ 1 sec
+ * index = 99 ~ 10sec or more
+ * log10(x)*10 = log2(x)*10/log2(10) = log2(x)*3
+ */
+ l = log2l(delta);
+ base = 1ll << l;
+ index = (l * 64 + (delta - base) * 64 / base) * 3 / 64;
+
+ if (index >= SLOTS)
+ index = SLOTS - 1;
+
+ value = bpf_map_lookup_elem(&lat_map, &index);
+ if (value)
+ __sync_fetch_and_add((long *)value, 1);
+
+ return 0;
+}
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/tracex3_user.c b/samples/bpf/tracex3_user.c
new file mode 100644
index 000000000000..0aaa933ab938
--- /dev/null
+++ b/samples/bpf/tracex3_user.c
@@ -0,0 +1,150 @@
+/* Copyright (c) 2013-2015 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <unistd.h>
+#include <stdbool.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+
+#define SLOTS 100
+
+static void clear_stats(int fd)
+{
+ __u32 key;
+ __u64 value = 0;
+
+ for (key = 0; key < SLOTS; key++)
+ bpf_update_elem(fd, &key, &value, BPF_ANY);
+}
+
+const char *color[] = {
+ "\033[48;5;255m",
+ "\033[48;5;252m",
+ "\033[48;5;250m",
+ "\033[48;5;248m",
+ "\033[48;5;246m",
+ "\033[48;5;244m",
+ "\033[48;5;242m",
+ "\033[48;5;240m",
+ "\033[48;5;238m",
+ "\033[48;5;236m",
+ "\033[48;5;234m",
+ "\033[48;5;232m",
+};
+const int num_colors = ARRAY_SIZE(color);
+
+const char nocolor[] = "\033[00m";
+
+const char *sym[] = {
+ " ",
+ " ",
+ ".",
+ ".",
+ "*",
+ "*",
+ "o",
+ "o",
+ "O",
+ "O",
+ "#",
+ "#",
+};
+
+bool full_range = false;
+bool text_only = false;
+
+static void print_banner(void)
+{
+ if (full_range)
+ printf("|1ns |10ns |100ns |1us |10us |100us"
+ " |1ms |10ms |100ms |1s |10s\n");
+ else
+ printf("|1us |10us |100us |1ms |10ms "
+ "|100ms |1s |10s\n");
+}
+
+static void print_hist(int fd)
+{
+ __u32 key;
+ __u64 value;
+ __u64 cnt[SLOTS];
+ __u64 max_cnt = 0;
+ __u64 total_events = 0;
+
+ for (key = 0; key < SLOTS; key++) {
+ value = 0;
+ bpf_lookup_elem(fd, &key, &value);
+ cnt[key] = value;
+ total_events += value;
+ if (value > max_cnt)
+ max_cnt = value;
+ }
+ clear_stats(fd);
+ for (key = full_range ? 0 : 29; key < SLOTS; key++) {
+ int c = num_colors * cnt[key] / (max_cnt + 1);
+
+ if (text_only)
+ printf("%s", sym[c]);
+ else
+ printf("%s %s", color[c], nocolor);
+ }
+ printf(" # %lld\n", total_events);
+}
+
+int main(int ac, char **argv)
+{
+ char filename[256];
+ int i;
+
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+
+ for (i = 1; i < ac; i++) {
+ if (strcmp(argv[i], "-a") == 0) {
+ full_range = true;
+ } else if (strcmp(argv[i], "-t") == 0) {
+ text_only = true;
+ } else if (strcmp(argv[i], "-h") == 0) {
+ printf("Usage:\n"
+ " -a display wider latency range\n"
+ " -t text only\n");
+ return 1;
+ }
+ }
+
+ printf(" heatmap of IO latency\n");
+ if (text_only)
+ printf(" %s", sym[num_colors - 1]);
+ else
+ printf(" %s %s", color[num_colors - 1], nocolor);
+ printf(" - many events with this latency\n");
+
+ if (text_only)
+ printf(" %s", sym[0]);
+ else
+ printf(" %s %s", color[0], nocolor);
+ printf(" - few events\n");
+
+ for (i = 0; ; i++) {
+ if (i % 20 == 0)
+ print_banner();
+ print_hist(map_fd[1]);
+ sleep(2);
+ }
+
+ return 0;
+}
--
1.7.9.5
^ permalink raw reply related
* [PATCH v11 tip 9/9] samples: bpf: kmem_alloc/free tracker
From: Alexei Starovoitov @ 2015-03-25 19:49 UTC (permalink / raw)
To: Ingo Molnar
Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
Masami Hiramatsu, David S. Miller, Daniel Borkmann,
Peter Zijlstra, linux-api, netdev, linux-kernel
In-Reply-To: <1427312966-8434-1-git-send-email-ast@plumgrid.com>
One bpf program attaches to kmem_cache_alloc_node() and remembers all allocated
objects in the map.
Another program attaches to kmem_cache_free() and deletes corresponding
object from the map.
User space walks the map every second and prints any objects which are
older than 1 second.
Usage:
$ sudo tracex4
Then start few long living processes. The tracex4 will print:
obj 0xffff880465928000 is 13sec old was allocated at ip ffffffff8105dc32
obj 0xffff88043181c280 is 13sec old was allocated at ip ffffffff8105dc32
obj 0xffff880465848000 is 8sec old was allocated at ip ffffffff8105dc32
obj 0xffff8804338bc280 is 15sec old was allocated at ip ffffffff8105dc32
$ addr2line -fispe vmlinux ffffffff8105dc32
do_fork at fork.c:1665
As soon as processes exit the memory is reclaimed and tracex4 prints nothing.
Similar experiment can be done with __kmalloc/kfree pair.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
samples/bpf/Makefile | 4 +++
samples/bpf/tracex4_kern.c | 54 ++++++++++++++++++++++++++++++++++
samples/bpf/tracex4_user.c | 69 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 127 insertions(+)
create mode 100644 samples/bpf/tracex4_kern.c
create mode 100644 samples/bpf/tracex4_user.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index dcd850546d52..fe98fb226e6e 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -9,6 +9,7 @@ hostprogs-y += sockex2
hostprogs-y += tracex1
hostprogs-y += tracex2
hostprogs-y += tracex3
+hostprogs-y += tracex4
test_verifier-objs := test_verifier.o libbpf.o
test_maps-objs := test_maps.o libbpf.o
@@ -18,6 +19,7 @@ sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
tracex1-objs := bpf_load.o libbpf.o tracex1_user.o
tracex2-objs := bpf_load.o libbpf.o tracex2_user.o
tracex3-objs := bpf_load.o libbpf.o tracex3_user.o
+tracex4-objs := bpf_load.o libbpf.o tracex4_user.o
# Tell kbuild to always build the programs
always := $(hostprogs-y)
@@ -26,6 +28,7 @@ always += sockex2_kern.o
always += tracex1_kern.o
always += tracex2_kern.o
always += tracex3_kern.o
+always += tracex4_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
@@ -35,6 +38,7 @@ HOSTLOADLIBES_sockex2 += -lelf
HOSTLOADLIBES_tracex1 += -lelf
HOSTLOADLIBES_tracex2 += -lelf
HOSTLOADLIBES_tracex3 += -lelf
+HOSTLOADLIBES_tracex4 += -lelf -lrt
# point this to your LLVM backend with bpf support
LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/tracex4_kern.c b/samples/bpf/tracex4_kern.c
new file mode 100644
index 000000000000..126b80512228
--- /dev/null
+++ b/samples/bpf/tracex4_kern.c
@@ -0,0 +1,54 @@
+/* Copyright (c) 2015 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/ptrace.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+struct pair {
+ u64 val;
+ u64 ip;
+};
+
+struct bpf_map_def SEC("maps") my_map = {
+ .type = BPF_MAP_TYPE_HASH,
+ .key_size = sizeof(long),
+ .value_size = sizeof(struct pair),
+ .max_entries = 1000000,
+};
+
+/* kprobe is NOT a stable ABI. If kernel internals change this bpf+kprobe
+ * example will no longer be meaningful
+ */
+SEC("kprobe/kmem_cache_free")
+int bpf_prog1(struct pt_regs *ctx)
+{
+ long ptr = ctx->si;
+
+ bpf_map_delete_elem(&my_map, &ptr);
+ return 0;
+}
+
+SEC("kretprobe/kmem_cache_alloc_node")
+int bpf_prog2(struct pt_regs *ctx)
+{
+ long ptr = ctx->ax;
+ long ip = 0;
+
+ /* get ip address of kmem_cache_alloc_node() caller */
+ bpf_probe_read(&ip, sizeof(ip), (void *)(ctx->bp + sizeof(ip)));
+
+ struct pair v = {
+ .val = bpf_ktime_get_ns(),
+ .ip = ip,
+ };
+
+ bpf_map_update_elem(&my_map, &ptr, &v, BPF_ANY);
+ return 0;
+}
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/tracex4_user.c b/samples/bpf/tracex4_user.c
new file mode 100644
index 000000000000..bc4a3bdea6ed
--- /dev/null
+++ b/samples/bpf/tracex4_user.c
@@ -0,0 +1,69 @@
+/* Copyright (c) 2015 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <unistd.h>
+#include <stdbool.h>
+#include <string.h>
+#include <time.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+struct pair {
+ long long val;
+ __u64 ip;
+};
+
+static __u64 time_get_ns(void)
+{
+ struct timespec ts;
+
+ clock_gettime(CLOCK_MONOTONIC, &ts);
+ return ts.tv_sec * 1000000000ull + ts.tv_nsec;
+}
+
+static void print_old_objects(int fd)
+{
+ long long val = time_get_ns();
+ __u64 key, next_key;
+ struct pair v;
+
+ key = write(1, "\e[1;1H\e[2J", 12); /* clear screen */
+
+ key = -1;
+ while (bpf_get_next_key(map_fd[0], &key, &next_key) == 0) {
+ bpf_lookup_elem(map_fd[0], &next_key, &v);
+ key = next_key;
+ if (val - v.val < 1000000000ll)
+ /* object was allocated more then 1 sec ago */
+ continue;
+ printf("obj 0x%llx is %2lldsec old was allocated at ip %llx\n",
+ next_key, (val - v.val) / 1000000000ll, v.ip);
+ }
+}
+
+int main(int ac, char **argv)
+{
+ char filename[256];
+ int i;
+
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+
+ for (i = 0; ; i++) {
+ print_old_objects(map_fd[1]);
+ sleep(1);
+ }
+
+ return 0;
+}
--
1.7.9.5
^ permalink raw reply related
* Re: [PATCH] mremap: add MREMAP_NOHOLE flag --resend
From: Daniel Micay @ 2015-03-25 20:49 UTC (permalink / raw)
To: Vlastimil Babka, Aliaksey Kandratsenka
Cc: Andrew Morton, Shaohua Li, linux-mm, linux-api, Rik van Riel,
Hugh Dickins, Mel Gorman, Johannes Weiner, Michal Hocko,
Andy Lutomirski, google-perftools@googlegroups.com
In-Reply-To: <5512E0C0.6060406@suse.cz>
[-- Attachment #1: Type: text/plain, Size: 3719 bytes --]
On 25/03/15 12:22 PM, Vlastimil Babka wrote:
>
> I'm not sure I get your description right. The problem I know about is
> where "purging" means madvise(MADV_DONTNEED) and khugepaged later
> collapses a new hugepage that will repopulate the purged parts,
> increasing the memory usage. One can limit this via
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none . That
> setting doesn't affect the page fault THP allocations, which however
> happen only in newly accessed hugepage-sized areas and not partially
> purged ones, though.
Since jemalloc doesn't unmap memory but instead does recycling itself in
userspace, it ends up with large spans of free virtual memory and gets
*lots* of huge pages from the page fault heuristic. It keeps track of
active vs. dirty (not purged) vs. clean (purged / untouched) ranges
everywhere, and will purge dirty ranges as they build up.
The THP allocation on page faults mean it ends up with memory that's
supposed to be clean but is really not.
A worst case example with the (up until recently) default chunk size of
4M is allocating a bunch of 2.1M allocations. Chunks are naturally
aligned, so each one can be represented as 2 huge pages. It increases
memory usage by nearly *50%*. The allocator thinks the tail is clean
memory, but it's not. When the allocations are freed, it will purge the
2.1M at the head (once enough dirty memory builds up) but all of the
tail memory will be leaked until something else is allocated there and
then freed.
>> I think a THP implementation playing that played well with purging would
>> need to drop the page fault heuristic and rely on a significantly better
>> khugepaged.
>
> See here http://lwn.net/Articles/636162/ (the "Compaction" part)
>
> The objection is that some short-lived workloads like gcc have to map
> hugepages immediately if they are to benefit from them. I still plan to
> improve khugepaged and allow admins to say that they don't want THP page
> faults (and rely solely on khugepaged which has more information to
> judge additional memory usage), but I'm not sure if it would be an
> acceptable default behavior.
> One workaround in the current state for jemalloc and friends could be to
> use madvise(MADV_NOHUGEPAGE) on hugepage-sized/aligned areas where it
> wants to purge parts of them via madvise(MADV_DONTNEED). It could mean
> overhead of another syscall and tracking of where this was applied and
> when it makes sense to undo this and allow THP to be collapsed again,
> though, and it would also split vma's.
Huge pages do significantly help performance though, and this would
pretty much mean no huge pages. The overhead of toggling it on and off
based on whether it's a < chunk size allocation or a >= chunk size one
is too high.
The page fault heuristic is just way too aggressive because there's no
indication of how much memory will be used. I don't think it makes sense
to do it without an explicit MADV_NOHUGEPAGE. Collapsing only dense
ranges doesn't have the same risk.
>> This would mean faulting in a span of memory would no longer
>> be faster. Having a flag to populate a range with madvise would help a
>
> If it's a newly mapped memory, there's mmap(MAP_POPULATE). There is also
> a madvise(MADV_WILLNEED), which sounds like what you want, but I don't
> know what the implementation does exactly - it was apparently added for
> paging in ahead, and maybe it ignores unpopulated anonymous areas, but
> it would probably be well in spirit of the flag to make it prepopulate
> those.
It doesn't seem to do anything for anon mappings atm but I do see a
patch from 2008 for that. I guess it never landed.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply
* Re: [PATCH] mremap: add MREMAP_NOHOLE flag --resend
From: Daniel Micay @ 2015-03-25 20:54 UTC (permalink / raw)
To: Vlastimil Babka, Aliaksey Kandratsenka
Cc: Andrew Morton, Shaohua Li, linux-mm, linux-api, Rik van Riel,
Hugh Dickins, Mel Gorman, Johannes Weiner, Michal Hocko,
Andy Lutomirski, google-perftools@googlegroups.com
In-Reply-To: <55131F70.7020503@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 304 bytes --]
> The page fault heuristic is just way too aggressive because there's no
> indication of how much memory will be used. I don't think it makes sense
> to do it without an explicit MADV_NOHUGEPAGE. Collapsing only dense
> ranges doesn't have the same risk.
Er, without an explicit MADV_HUGEPAGE*.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply
* Re: [PATCH 0/5] Enhancements to twl4030 phy to support better charging - V2
From: Kishon Vijay Abraham I @ 2015-03-25 21:16 UTC (permalink / raw)
To: NeilBrown
Cc: Tony Lindgren, linux-api-u79uwXL29TY76Z2rM5mHXA, GTA04 owners,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Pavel Machek
In-Reply-To: <20150322223307.21765.62974.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
Hi,
On Monday 23 March 2015 04:05 AM, NeilBrown wrote:
> Hi Kishon,
> I wonder if you could queue the following for the next merge window.
> They allow the twl4030 phy to provide more information to the
> twl4030 battery charger.
> There are only minimal changes since the first version, particularly
> documentation has been improved.
There are quite a few things in this series which use the USB PHY library
interface which is kindof deprecated. We should try and use the Generic PHY
library for all of them. It would also be better to add features to the
PHY framework if the we can't achieve something with the existing PHY
framework.
-Kishon
^ permalink raw reply
* Re: [PATCH 0/5] Enhancements to twl4030 phy to support better charging - V2
From: NeilBrown @ 2015-03-25 21:22 UTC (permalink / raw)
To: Kishon Vijay Abraham I
Cc: NeilBrown, Tony Lindgren, linux-api, GTA04 owners, linux-kernel,
Pavel Machek
In-Reply-To: <551325B0.1090308@ti.com>
[-- Attachment #1: Type: text/plain, Size: 1108 bytes --]
On Thu, 26 Mar 2015 02:46:32 +0530 Kishon Vijay Abraham I <kishon@ti.com>
wrote:
> Hi,
>
> On Monday 23 March 2015 04:05 AM, NeilBrown wrote:
> > Hi Kishon,
> > I wonder if you could queue the following for the next merge window.
> > They allow the twl4030 phy to provide more information to the
> > twl4030 battery charger.
> > There are only minimal changes since the first version, particularly
> > documentation has been improved.
>
> There are quite a few things in this series which use the USB PHY library
> interface which is kindof deprecated. We should try and use the Generic PHY
> library for all of them. It would also be better to add features to the
> PHY framework if the we can't achieve something with the existing PHY
> framework.
Hi,
are you able to more specific at all? What is the "USB PHY library"?
Where exactly is the "PHY framework"?
I know none of the history here and while I could try to guess I suspect
there is an even chance I would get wrong.
I'm happy to do the work but I want to be sure of what you are asking.
Thanks,
NeilBrown
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox