linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] kernel: add kcov code coverage
       [not found]   ` <CACT4Y+bHjfkrZ4jWZf+j6FKnj2GQv1t94Y9LJm4=WFyb8C0J8w@mail.gmail.com>
@ 2016-01-15 13:05     ` Andrey Ryabinin
  2016-01-15 13:42       ` Will Deacon
  2016-01-15 14:07       ` Dmitry Vyukov
  0 siblings, 2 replies; 11+ messages in thread
From: Andrey Ryabinin @ 2016-01-15 13:05 UTC (permalink / raw)
  To: linux-arm-kernel

2016-01-14 17:30 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
> On Thu, Jan 14, 2016 at 11:50 AM, Andrey Ryabinin
> <ryabinin.a.a@gmail.com> wrote:
>> 2016-01-13 15:48 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
>>
>>> +       /* Read number of PCs collected. */
>>> +       n = __atomic_load_n(&cover[0], __ATOMIC_RELAXED);
>>> +       /* PCs are shorten to uint32_t, so we need to restore the upper part. */
>>> +       for (i = 0; i < n; i++)
>>> +               printf("0xffffffff%0lx\n", (unsigned long)cover[i + 1]);
>
> Thanks for the review!
> Mailed v3 with fixes.
> Comments inline.
>
>> This works only for x86-64.
>> Probably there is no simple way to make this arch-independent with
>> 32-bit values.
>
> We probably could add an ioctl that returns base of the stripped PCs.

You forgot about modules. With stripped PCs you'll start mixing
kernel's and module's PC (if distance between module and kernel > 4G).

>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index 61aa9bb..9e9e9f6 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -1807,6 +1807,16 @@ struct task_struct {
>>>         /* bitmask and counter of trace recursion */
>>>         unsigned long trace_recursion;
>>>  #endif /* CONFIG_TRACING */
>>> +#ifdef CONFIG_KCOV
>>> +       /* Coverage collection mode enabled for this task (0 if disabled). */
>>> +       int             kcov_mode;
>>> +       /* Size of the kcov_area. */
>>> +       unsigned long   kcov_size;
>>
>> Could be just 'unsigned'
>
> Done
>
>>> +       /* Buffer for coverage collection. */
>>> +       void            *kcov_area;
>>
>> So, these fields above are duplicates the same fields from kcov struct.
>> Consider embedding kcov struct (since it's relatively small) into task_struct.
>
> It would be strange to copy spinlock and refcounter.  Also if
> additional fields added to kcov struct, most likely they won't need to
> be copied to task struct. Also I did not want to pull my headers into
> sched.h.
> How strong are you about this? I would leave it as is.

Fine by me. It was just an idea to consider.


>>> diff --git a/kernel/kcov/kcov.c b/kernel/kcov/kcov.c
>>> +/* Entry point from instrumented code.
>>> + * This is called once per basic-block/edge.
>>> + */
>>> +void __sanitizer_cov_trace_pc(void)
>>> +{
>>> +       struct task_struct *t;
>>> +       enum kcov_mode mode;
>>> +
>>> +       t = current;
>>> +       /* We are interested in code coverage as a function of a syscall inputs,
>>> +        * so we ignore code executed in interrupts.
>>> +        */
>>> +       if (!t || in_interrupt())
>>> +               return;
>>> +       mode = READ_ONCE(t->kcov_mode);
>>> +       if (mode == kcov_mode_trace) {
>>> +               u32 *area;
>>> +               u32 pos;
>>> +
>>> +               /* There is some code that runs in interrupts but for which
>>> +                * in_interrupt() returns false (e.g. preempt_schedule_irq()).
>>> +                * READ_ONCE()/barrier() effectively provides load-acquire wrt
>>> +                * interrupts, there are paired barrier()/WRITE_ONCE() in
>>> +                * kcov_ioctl_locked().
>>> +                */
>>> +               barrier();
>>> +               area = t->kcov_area;
>>> +               /* The first u32 is number of subsequent PCs. */
>>> +               pos = READ_ONCE(area[0]) + 1;
>>> +               if (likely(pos < t->kcov_size)) {
>>> +                       area[pos] = (u32)_RET_IP_;
>>> +                       WRITE_ONCE(area[0], pos);
>>
>> Note that this works only for cache-coherent architectures.
>> For incoherent arches you'll need to flush_dcache_page() somewhere.
>> Perhaps it could be done on exit to userspace, since flushing here is
>> certainly an overkill.
>
> I can say that I understand the problem. Does it have to do with the
> fact that the buffer is shared between kernel and user-space?
> Current code is OK from the plain multi-threading side, as user must
> not read buffer concurrently with writing (that would not yield
> anything useful).

It's not about SMP.
This problem is about virtually indexed aliasing D-caches and could be
observed on uniprocessor system.
You have 3 virtual addresses (user-space, linear mapping and vmalloc)
mapped to the same physical page.
With aliasing cache it's possible to have multiple cache-lines
representing the same physical page.
So the kernel might not see the update made by userspace and vise
versa because kernel/userspace use different virtual addresses.

And btw, flush_dcache_page()  would be a wrong choice, since kcov_area
is a vmalloc address, not a linear address.
So we need something that flushes vmalloc addresses.

Alternatively we could simply mlock that memory and talk to user space
via get/put_user(). No flush will be required.
And we will avoid another potential problem - lack of vmalloc address
space on 32-bits.

> We could add an ioctl that does the flush. But I would prefer if it is
> done when we port kcov to such an arch. Does arm64 require the flush?
>

I think, it doesn't. AFAIK arm64 has non-aliasing D-cache.

arm64/include/asm/cacheflush.h says:
       Please note that the implementation assumes non-aliasing VIPT D-cache

However, I wonder why it implements flush_dcache_page(). Per my
understanding it is not need for non-aliasing caches.
And Documentation/cachetlb.txt agrees with me:
       void flush_dcache_page(struct page *page)
          If D-cache aliasing is not an issue, this routine may
          simply be defined as a nop on that architecture.

Catalin, Will, could you please shed light on this?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] kernel: add kcov code coverage
  2016-01-15 13:05     ` [PATCH v2] kernel: add kcov code coverage Andrey Ryabinin
@ 2016-01-15 13:42       ` Will Deacon
  2016-01-15 14:07       ` Dmitry Vyukov
  1 sibling, 0 replies; 11+ messages in thread
From: Will Deacon @ 2016-01-15 13:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jan 15, 2016 at 04:05:55PM +0300, Andrey Ryabinin wrote:
> 2016-01-14 17:30 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
> > On Thu, Jan 14, 2016 at 11:50 AM, Andrey Ryabinin
> > <ryabinin.a.a@gmail.com> wrote:
> >> 2016-01-13 15:48 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
> >>> diff --git a/kernel/kcov/kcov.c b/kernel/kcov/kcov.c
> >>> +/* Entry point from instrumented code.
> >>> + * This is called once per basic-block/edge.
> >>> + */
> >>> +void __sanitizer_cov_trace_pc(void)
> >>> +{
> >>> +       struct task_struct *t;
> >>> +       enum kcov_mode mode;
> >>> +
> >>> +       t = current;
> >>> +       /* We are interested in code coverage as a function of a syscall inputs,
> >>> +        * so we ignore code executed in interrupts.
> >>> +        */
> >>> +       if (!t || in_interrupt())
> >>> +               return;
> >>> +       mode = READ_ONCE(t->kcov_mode);
> >>> +       if (mode == kcov_mode_trace) {
> >>> +               u32 *area;
> >>> +               u32 pos;
> >>> +
> >>> +               /* There is some code that runs in interrupts but for which
> >>> +                * in_interrupt() returns false (e.g. preempt_schedule_irq()).
> >>> +                * READ_ONCE()/barrier() effectively provides load-acquire wrt
> >>> +                * interrupts, there are paired barrier()/WRITE_ONCE() in
> >>> +                * kcov_ioctl_locked().
> >>> +                */
> >>> +               barrier();
> >>> +               area = t->kcov_area;
> >>> +               /* The first u32 is number of subsequent PCs. */
> >>> +               pos = READ_ONCE(area[0]) + 1;
> >>> +               if (likely(pos < t->kcov_size)) {
> >>> +                       area[pos] = (u32)_RET_IP_;
> >>> +                       WRITE_ONCE(area[0], pos);
> >>
> >> Note that this works only for cache-coherent architectures.
> >> For incoherent arches you'll need to flush_dcache_page() somewhere.
> >> Perhaps it could be done on exit to userspace, since flushing here is
> >> certainly an overkill.
> >
> > I can say that I understand the problem. Does it have to do with the
> > fact that the buffer is shared between kernel and user-space?
> > Current code is OK from the plain multi-threading side, as user must
> > not read buffer concurrently with writing (that would not yield
> > anything useful).
> 
> It's not about SMP.
> This problem is about virtually indexed aliasing D-caches and could be
> observed on uniprocessor system.
> You have 3 virtual addresses (user-space, linear mapping and vmalloc)
> mapped to the same physical page.
> With aliasing cache it's possible to have multiple cache-lines
> representing the same physical page.
> So the kernel might not see the update made by userspace and vise
> versa because kernel/userspace use different virtual addresses.
> 
> And btw, flush_dcache_page()  would be a wrong choice, since kcov_area
> is a vmalloc address, not a linear address.
> So we need something that flushes vmalloc addresses.
> 
> Alternatively we could simply mlock that memory and talk to user space
> via get/put_user(). No flush will be required.
> And we will avoid another potential problem - lack of vmalloc address
> space on 32-bits.
> 
> > We could add an ioctl that does the flush. But I would prefer if it is
> > done when we port kcov to such an arch. Does arm64 require the flush?
> >
> 
> I think, it doesn't. AFAIK arm64 has non-aliasing D-cache.
> 
> arm64/include/asm/cacheflush.h says:
>        Please note that the implementation assumes non-aliasing VIPT D-cache
> 
> However, I wonder why it implements flush_dcache_page(). Per my
> understanding it is not need for non-aliasing caches.
> And Documentation/cachetlb.txt agrees with me:
>        void flush_dcache_page(struct page *page)
>           If D-cache aliasing is not an issue, this routine may
>           simply be defined as a nop on that architecture.
> 
> Catalin, Will, could you please shed light on this?

It's only there to keep the I-cache and D-cache in sync for executable
pages. That is, flush_dcache_page sets a flah (PG_dcache_clean) in the
page flags, which is checked and cleared when we install an executable
user pte.

Will

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] kernel: add kcov code coverage
  2016-01-15 13:05     ` [PATCH v2] kernel: add kcov code coverage Andrey Ryabinin
  2016-01-15 13:42       ` Will Deacon
@ 2016-01-15 14:07       ` Dmitry Vyukov
  2016-01-18 13:34         ` Andrey Ryabinin
  2016-01-18 14:13         ` Mark Rutland
  1 sibling, 2 replies; 11+ messages in thread
From: Dmitry Vyukov @ 2016-01-15 14:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jan 15, 2016 at 2:05 PM, Andrey Ryabinin <ryabinin.a.a@gmail.com> wrote:
> 2016-01-14 17:30 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
>> On Thu, Jan 14, 2016 at 11:50 AM, Andrey Ryabinin
>> <ryabinin.a.a@gmail.com> wrote:
>>> 2016-01-13 15:48 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
>>>
>>>> +       /* Read number of PCs collected. */
>>>> +       n = __atomic_load_n(&cover[0], __ATOMIC_RELAXED);
>>>> +       /* PCs are shorten to uint32_t, so we need to restore the upper part. */
>>>> +       for (i = 0; i < n; i++)
>>>> +               printf("0xffffffff%0lx\n", (unsigned long)cover[i + 1]);
>>
>> Thanks for the review!
>> Mailed v3 with fixes.
>> Comments inline.
>>
>>> This works only for x86-64.
>>> Probably there is no simple way to make this arch-independent with
>>> 32-bit values.
>>
>> We probably could add an ioctl that returns base of the stripped PCs.
>
> You forgot about modules. With stripped PCs you'll start mixing
> kernel's and module's PC (if distance between module and kernel > 4G).

It's just that on x86 text and modules are within 4GB.

I've checked that on arm64 it also seems to be the case:

 48  * The module space lives between the addresses given by TASK_SIZE
 49  * and PAGE_OFFSET - it must be within 128MB of the kernel text.
 50  */
 54 #define MODULES_END             (PAGE_OFFSET)
 55 #define MODULES_VADDR           (MODULES_END - SZ_64M)

Again, we can store wither u32s or u64s and expose this info in an ioctl...




>>> Note that this works only for cache-coherent architectures.
>>> For incoherent arches you'll need to flush_dcache_page() somewhere.
>>> Perhaps it could be done on exit to userspace, since flushing here is
>>> certainly an overkill.
>>
>> I can say that I understand the problem. Does it have to do with the
>> fact that the buffer is shared between kernel and user-space?
>> Current code is OK from the plain multi-threading side, as user must
>> not read buffer concurrently with writing (that would not yield
>> anything useful).
>
> It's not about SMP.
> This problem is about virtually indexed aliasing D-caches and could be
> observed on uniprocessor system.
> You have 3 virtual addresses (user-space, linear mapping and vmalloc)
> mapped to the same physical page.
> With aliasing cache it's possible to have multiple cache-lines
> representing the same physical page.
> So the kernel might not see the update made by userspace and vise
> versa because kernel/userspace use different virtual addresses.
>
> And btw, flush_dcache_page()  would be a wrong choice, since kcov_area
> is a vmalloc address, not a linear address.
> So we need something that flushes vmalloc addresses.
>
> Alternatively we could simply mlock that memory and talk to user space
> via get/put_user(). No flush will be required.
> And we will avoid another potential problem - lack of vmalloc address
> space on 32-bits.

Do you mean that user-space allocates a buffer and passes this buffer
to ioctl(KCOV_INIT); kernel locks this range and then directly writes
to it?

I afraid it becomes prohibitively expensive with put_user/get_user:
https://gist.githubusercontent.com/dvyukov/568f2e4a61afc910f880/raw/540cc071f1d561b9a3f9e50183d681be265af8c3/gistfile1.txt

Also, won't it require the same flush since the region is mmaped into
several processes (and process that reads is not the one that setups
the region)?

Size of coverage buffer that I currently use is 64K. I hope it is not
a problem for 32-bit archs.



>> We could add an ioctl that does the flush. But I would prefer if it is
>> done when we port kcov to such an arch. Does arm64 require the flush?
>>
>
> I think, it doesn't. AFAIK arm64 has non-aliasing D-cache.
>
> arm64/include/asm/cacheflush.h says:
>        Please note that the implementation assumes non-aliasing VIPT D-cache
>
> However, I wonder why it implements flush_dcache_page(). Per my
> understanding it is not need for non-aliasing caches.
> And Documentation/cachetlb.txt agrees with me:
>        void flush_dcache_page(struct page *page)
>           If D-cache aliasing is not an issue, this routine may
>           simply be defined as a nop on that architecture.
>
> Catalin, Will, could you please shed light on this?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] kernel: add kcov code coverage
  2016-01-15 14:07       ` Dmitry Vyukov
@ 2016-01-18 13:34         ` Andrey Ryabinin
  2016-01-18 19:31           ` Dmitry Vyukov
  2016-01-18 14:13         ` Mark Rutland
  1 sibling, 1 reply; 11+ messages in thread
From: Andrey Ryabinin @ 2016-01-18 13:34 UTC (permalink / raw)
  To: linux-arm-kernel

2016-01-15 17:07 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
>>>> Note that this works only for cache-coherent architectures.
>>>> For incoherent arches you'll need to flush_dcache_page() somewhere.
>>>> Perhaps it could be done on exit to userspace, since flushing here is
>>>> certainly an overkill.
>>>
>>> I can say that I understand the problem. Does it have to do with the
>>> fact that the buffer is shared between kernel and user-space?
>>> Current code is OK from the plain multi-threading side, as user must
>>> not read buffer concurrently with writing (that would not yield
>>> anything useful).
>>
>> It's not about SMP.
>> This problem is about virtually indexed aliasing D-caches and could be
>> observed on uniprocessor system.
>> You have 3 virtual addresses (user-space, linear mapping and vmalloc)
>> mapped to the same physical page.
>> With aliasing cache it's possible to have multiple cache-lines
>> representing the same physical page.
>> So the kernel might not see the update made by userspace and vise
>> versa because kernel/userspace use different virtual addresses.
>>
>> And btw, flush_dcache_page()  would be a wrong choice, since kcov_area
>> is a vmalloc address, not a linear address.
>> So we need something that flushes vmalloc addresses.
>>
>> Alternatively we could simply mlock that memory and talk to user space
>> via get/put_user(). No flush will be required.
>> And we will avoid another potential problem - lack of vmalloc address
>> space on 32-bits.
>
> Do you mean that user-space allocates a buffer and passes this buffer
> to ioctl(KCOV_INIT); kernel locks this range and then directly writes
> to it?
>

It's one of the ways of doing this. Another possible way is to
allocate, mmap and pin pages in kcov_mmap().

> I afraid it becomes prohibitively expensive with put_user/get_user:
> https://gist.githubusercontent.com/dvyukov/568f2e4a61afc910f880/raw/540cc071f1d561b9a3f9e50183d681be265af8c3/gistfile1.txt
>

Right, but it should be better with __get_user/__put_user.

> Also, won't it require the same flush since the region is mmaped into
> several processes (and process that reads is not the one that setups
> the region)?

But it's only child process that could inherit kcov mapping from
parent, so it's be the same physical->virtual mapping as in parent.

> Size of coverage buffer that I currently use is 64K. I hope it is not
> a problem for 32-bit archs.
>

64K - per process. It's hard to whether this is a real problem or not,
since it depends
on how many processes collect coverage, size of vmalloc and vmalloc's
utilization by the rest of the kernel.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] kernel: add kcov code coverage
  2016-01-15 14:07       ` Dmitry Vyukov
  2016-01-18 13:34         ` Andrey Ryabinin
@ 2016-01-18 14:13         ` Mark Rutland
  2016-01-18 19:44           ` Dmitry Vyukov
  1 sibling, 1 reply; 11+ messages in thread
From: Mark Rutland @ 2016-01-18 14:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jan 15, 2016 at 03:07:59PM +0100, Dmitry Vyukov wrote:
> On Fri, Jan 15, 2016 at 2:05 PM, Andrey Ryabinin <ryabinin.a.a@gmail.com> wrote:
> > 2016-01-14 17:30 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
> >> On Thu, Jan 14, 2016 at 11:50 AM, Andrey Ryabinin
> >> <ryabinin.a.a@gmail.com> wrote:
> >>> 2016-01-13 15:48 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
> >>>
> >>>> +       /* Read number of PCs collected. */
> >>>> +       n = __atomic_load_n(&cover[0], __ATOMIC_RELAXED);
> >>>> +       /* PCs are shorten to uint32_t, so we need to restore the upper part. */
> >>>> +       for (i = 0; i < n; i++)
> >>>> +               printf("0xffffffff%0lx\n", (unsigned long)cover[i + 1]);
> >>
> >> Thanks for the review!
> >> Mailed v3 with fixes.
> >> Comments inline.
> >>
> >>> This works only for x86-64.
> >>> Probably there is no simple way to make this arch-independent with
> >>> 32-bit values.
> >>
> >> We probably could add an ioctl that returns base of the stripped PCs.
> >
> > You forgot about modules. With stripped PCs you'll start mixing
> > kernel's and module's PC (if distance between module and kernel > 4G).
> 
> It's just that on x86 text and modules are within 4GB.
> 
> I've checked that on arm64 it also seems to be the case:
> 
>  48  * The module space lives between the addresses given by TASK_SIZE
>  49  * and PAGE_OFFSET - it must be within 128MB of the kernel text.
>  50  */
>  54 #define MODULES_END             (PAGE_OFFSET)
>  55 #define MODULES_VADDR           (MODULES_END - SZ_64M)

This won't necessarily remain true. With kASLR [1,2] the modules and
kernel might be located anywhere in the vmalloc area, independently.
Using PLTs removes the +/-128MB restriction, so they may be placed
anywhere in the vmalloc area.

On my defconfig kernel (4KiB, 39-bit VA) today, that area is ~246GiB wide:

[    0.000000] Virtual kernel memory layout:
[    0.000000]     vmalloc : 0xffffff8000000000 - 0xffffffbdbfff0000   (   246 GB)
[    0.000000]     vmemmap : 0xffffffbdc0000000 - 0xffffffbfc0000000   (     8 GB maximum)
[    0.000000]               0xffffffbdc2000000 - 0xffffffbde8000000   (   608 MB actual)
[    0.000000]     fixed   : 0xffffffbffa7fd000 - 0xffffffbffac00000   (  4108 KB)
[    0.000000]     PCI I/O : 0xffffffbffae00000 - 0xffffffbffbe00000   (    16 MB)
[    0.000000]     modules : 0xffffffbffc000000 - 0xffffffc000000000   (    64 MB)
[    0.000000]     memory  : 0xffffffc000000000 - 0xffffffc980000000   ( 38912 MB)
[    0.000000]       .init : 0xffffffc000a00000 - 0xffffffc000a9c000   (   624 KB)
[    0.000000]       .text : 0xffffffc000080000 - 0xffffffc000a00000   (  9728 KB)
[    0.000000]       .data : 0xffffffc000a9c000 - 0xffffffc000b17a00   (   495 KB)

Kernels can be built with a 48-bit VA (and potentially larger in future
with ARMv8.2-A or later [3]). The vmalloc area (and hence the maximum
distances between modules and kernel) will increase grow with the VA
range.

Thanks,
Mark.

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/398527.html
[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/398534.html
[3] https://community.arm.com/groups/processors/blog/2016/01/05/armv8-a-architecture-evolution

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] kernel: add kcov code coverage
  2016-01-18 13:34         ` Andrey Ryabinin
@ 2016-01-18 19:31           ` Dmitry Vyukov
  0 siblings, 0 replies; 11+ messages in thread
From: Dmitry Vyukov @ 2016-01-18 19:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 18, 2016 at 2:34 PM, Andrey Ryabinin <ryabinin.a.a@gmail.com> wrote:
> 2016-01-15 17:07 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
>>>>> Note that this works only for cache-coherent architectures.
>>>>> For incoherent arches you'll need to flush_dcache_page() somewhere.
>>>>> Perhaps it could be done on exit to userspace, since flushing here is
>>>>> certainly an overkill.
>>>>
>>>> I can say that I understand the problem. Does it have to do with the
>>>> fact that the buffer is shared between kernel and user-space?
>>>> Current code is OK from the plain multi-threading side, as user must
>>>> not read buffer concurrently with writing (that would not yield
>>>> anything useful).
>>>
>>> It's not about SMP.
>>> This problem is about virtually indexed aliasing D-caches and could be
>>> observed on uniprocessor system.
>>> You have 3 virtual addresses (user-space, linear mapping and vmalloc)
>>> mapped to the same physical page.
>>> With aliasing cache it's possible to have multiple cache-lines
>>> representing the same physical page.
>>> So the kernel might not see the update made by userspace and vise
>>> versa because kernel/userspace use different virtual addresses.
>>>
>>> And btw, flush_dcache_page()  would be a wrong choice, since kcov_area
>>> is a vmalloc address, not a linear address.
>>> So we need something that flushes vmalloc addresses.
>>>
>>> Alternatively we could simply mlock that memory and talk to user space
>>> via get/put_user(). No flush will be required.
>>> And we will avoid another potential problem - lack of vmalloc address
>>> space on 32-bits.
>>
>> Do you mean that user-space allocates a buffer and passes this buffer
>> to ioctl(KCOV_INIT); kernel locks this range and then directly writes
>> to it?
>>
>
> It's one of the ways of doing this. Another possible way is to
> allocate, mmap and pin pages in kcov_mmap().


Which means that we can hide it under the same interface, right?
I preallocate all pages in kcov_mmap() in v4 as suggested by Kirill.
If we can hide it under the current interface, then I would prefer to
do locking later in subsequent patches (probably the ones that port
kcov to an arch that require flush).



>> I afraid it becomes prohibitively expensive with put_user/get_user:
>> https://gist.githubusercontent.com/dvyukov/568f2e4a61afc910f880/raw/540cc071f1d561b9a3f9e50183d681be265af8c3/gistfile1.txt
>>
>
> Right, but it should be better with __get_user/__put_user.
>
>> Also, won't it require the same flush since the region is mmaped into
>> several processes (and process that reads is not the one that setups
>> the region)?
>
> But it's only child process that could inherit kcov mapping from
> parent, so it's be the same physical->virtual mapping as in parent.

ack

>> Size of coverage buffer that I currently use is 64K. I hope it is not
>> a problem for 32-bit archs.
>>
>
> 64K - per process. It's hard to whether this is a real problem or not,
> since it depends
> on how many processes collect coverage, size of vmalloc and vmalloc's
> utilization by the rest of the kernel.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] kernel: add kcov code coverage
  2016-01-18 14:13         ` Mark Rutland
@ 2016-01-18 19:44           ` Dmitry Vyukov
  2016-01-18 20:09             ` Dmitry Vyukov
  0 siblings, 1 reply; 11+ messages in thread
From: Dmitry Vyukov @ 2016-01-18 19:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 18, 2016 at 3:13 PM, Mark Rutland <mark.rutland@arm.com> wrote:
> On Fri, Jan 15, 2016 at 03:07:59PM +0100, Dmitry Vyukov wrote:
>> On Fri, Jan 15, 2016 at 2:05 PM, Andrey Ryabinin <ryabinin.a.a@gmail.com> wrote:
>> > 2016-01-14 17:30 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
>> >> On Thu, Jan 14, 2016 at 11:50 AM, Andrey Ryabinin
>> >> <ryabinin.a.a@gmail.com> wrote:
>> >>> 2016-01-13 15:48 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
>> >>>
>> >>>> +       /* Read number of PCs collected. */
>> >>>> +       n = __atomic_load_n(&cover[0], __ATOMIC_RELAXED);
>> >>>> +       /* PCs are shorten to uint32_t, so we need to restore the upper part. */
>> >>>> +       for (i = 0; i < n; i++)
>> >>>> +               printf("0xffffffff%0lx\n", (unsigned long)cover[i + 1]);
>> >>
>> >> Thanks for the review!
>> >> Mailed v3 with fixes.
>> >> Comments inline.
>> >>
>> >>> This works only for x86-64.
>> >>> Probably there is no simple way to make this arch-independent with
>> >>> 32-bit values.
>> >>
>> >> We probably could add an ioctl that returns base of the stripped PCs.
>> >
>> > You forgot about modules. With stripped PCs you'll start mixing
>> > kernel's and module's PC (if distance between module and kernel > 4G).
>>
>> It's just that on x86 text and modules are within 4GB.
>>
>> I've checked that on arm64 it also seems to be the case:
>>
>>  48  * The module space lives between the addresses given by TASK_SIZE
>>  49  * and PAGE_OFFSET - it must be within 128MB of the kernel text.
>>  50  */
>>  54 #define MODULES_END             (PAGE_OFFSET)
>>  55 #define MODULES_VADDR           (MODULES_END - SZ_64M)
>
> This won't necessarily remain true. With kASLR [1,2] the modules and
> kernel might be located anywhere in the vmalloc area, independently.
> Using PLTs removes the +/-128MB restriction, so they may be placed
> anywhere in the vmalloc area.
>
> On my defconfig kernel (4KiB, 39-bit VA) today, that area is ~246GiB wide:
>
> [    0.000000] Virtual kernel memory layout:
> [    0.000000]     vmalloc : 0xffffff8000000000 - 0xffffffbdbfff0000   (   246 GB)
> [    0.000000]     vmemmap : 0xffffffbdc0000000 - 0xffffffbfc0000000   (     8 GB maximum)
> [    0.000000]               0xffffffbdc2000000 - 0xffffffbde8000000   (   608 MB actual)
> [    0.000000]     fixed   : 0xffffffbffa7fd000 - 0xffffffbffac00000   (  4108 KB)
> [    0.000000]     PCI I/O : 0xffffffbffae00000 - 0xffffffbffbe00000   (    16 MB)
> [    0.000000]     modules : 0xffffffbffc000000 - 0xffffffc000000000   (    64 MB)
> [    0.000000]     memory  : 0xffffffc000000000 - 0xffffffc980000000   ( 38912 MB)
> [    0.000000]       .init : 0xffffffc000a00000 - 0xffffffc000a9c000   (   624 KB)
> [    0.000000]       .text : 0xffffffc000080000 - 0xffffffc000a00000   (  9728 KB)
> [    0.000000]       .data : 0xffffffc000a9c000 - 0xffffffc000b17a00   (   495 KB)
>
> Kernels can be built with a 48-bit VA (and potentially larger in future
> with ARMv8.2-A or later [3]). The vmalloc area (and hence the maximum
> distances between modules and kernel) will increase grow with the VA
> range.
>
> Thanks,
> Mark.
>
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/398527.html
> [2] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/398534.html
> [3] https://community.arm.com/groups/processors/blog/2016/01/05/armv8-a-architecture-evolution


Thanks, Mark.

I've got several comments regarding the 4-byte compressed PCs. We've
also discussed this internally.
As the result in v4 I made it possible to export both compressed
4-byte PCs and full 8-byte PCs.
Now init ioctl accepts the following struct and kernel can say whether
it will export 4- or 8-byte PCs:

struct kcov_init_trace {
        unsigned long        flags; /* In: reserved, must be 0. */
        unsigned long        size; /* In: trace buffer size. */
        unsigned long        version;  /* Out: trace format, currently 1. */
        /*
         * Output.
         * pc_size can be 4 or 8. If pc_size = 4 on a 64-bit arch,
         * returned PCs are compressed by subtracting pc_base and then
         * truncating to 4 bytes.
         */
        unsigned long        pc_size;
        unsigned long        pc_base;
};

So for KASLR or other archs we can just export full 8-byte PCs.

Regarding KASLR and dynamically loaded modules. I've looked at my
use-case and concluded
that most of the time I can work with "non-stable" PCs within a single
VM. Whenever I need to
store PCs persistently or send to another machine, I think I can
"canonicalize" PCs using
/proc/modules and /proc/kallsyms to something like (module hash,
module offset). So kernel does
not need to do this during coverage collection.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] kernel: add kcov code coverage
  2016-01-18 19:44           ` Dmitry Vyukov
@ 2016-01-18 20:09             ` Dmitry Vyukov
  2016-01-22 11:55               ` Mark Rutland
  0 siblings, 1 reply; 11+ messages in thread
From: Dmitry Vyukov @ 2016-01-18 20:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 18, 2016 at 8:44 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Mon, Jan 18, 2016 at 3:13 PM, Mark Rutland <mark.rutland@arm.com> wrote:
>> On Fri, Jan 15, 2016 at 03:07:59PM +0100, Dmitry Vyukov wrote:
>>> On Fri, Jan 15, 2016 at 2:05 PM, Andrey Ryabinin <ryabinin.a.a@gmail.com> wrote:
>>> > 2016-01-14 17:30 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
>>> >> On Thu, Jan 14, 2016 at 11:50 AM, Andrey Ryabinin
>>> >> <ryabinin.a.a@gmail.com> wrote:
>>> >>> 2016-01-13 15:48 GMT+03:00 Dmitry Vyukov <dvyukov@google.com>:
>>> >>>
>>> >>>> +       /* Read number of PCs collected. */
>>> >>>> +       n = __atomic_load_n(&cover[0], __ATOMIC_RELAXED);
>>> >>>> +       /* PCs are shorten to uint32_t, so we need to restore the upper part. */
>>> >>>> +       for (i = 0; i < n; i++)
>>> >>>> +               printf("0xffffffff%0lx\n", (unsigned long)cover[i + 1]);
>>> >>
>>> >> Thanks for the review!
>>> >> Mailed v3 with fixes.
>>> >> Comments inline.
>>> >>
>>> >>> This works only for x86-64.
>>> >>> Probably there is no simple way to make this arch-independent with
>>> >>> 32-bit values.
>>> >>
>>> >> We probably could add an ioctl that returns base of the stripped PCs.
>>> >
>>> > You forgot about modules. With stripped PCs you'll start mixing
>>> > kernel's and module's PC (if distance between module and kernel > 4G).
>>>
>>> It's just that on x86 text and modules are within 4GB.
>>>
>>> I've checked that on arm64 it also seems to be the case:
>>>
>>>  48  * The module space lives between the addresses given by TASK_SIZE
>>>  49  * and PAGE_OFFSET - it must be within 128MB of the kernel text.
>>>  50  */
>>>  54 #define MODULES_END             (PAGE_OFFSET)
>>>  55 #define MODULES_VADDR           (MODULES_END - SZ_64M)
>>
>> This won't necessarily remain true. With kASLR [1,2] the modules and
>> kernel might be located anywhere in the vmalloc area, independently.
>> Using PLTs removes the +/-128MB restriction, so they may be placed
>> anywhere in the vmalloc area.
>>
>> On my defconfig kernel (4KiB, 39-bit VA) today, that area is ~246GiB wide:
>>
>> [    0.000000] Virtual kernel memory layout:
>> [    0.000000]     vmalloc : 0xffffff8000000000 - 0xffffffbdbfff0000   (   246 GB)
>> [    0.000000]     vmemmap : 0xffffffbdc0000000 - 0xffffffbfc0000000   (     8 GB maximum)
>> [    0.000000]               0xffffffbdc2000000 - 0xffffffbde8000000   (   608 MB actual)
>> [    0.000000]     fixed   : 0xffffffbffa7fd000 - 0xffffffbffac00000   (  4108 KB)
>> [    0.000000]     PCI I/O : 0xffffffbffae00000 - 0xffffffbffbe00000   (    16 MB)
>> [    0.000000]     modules : 0xffffffbffc000000 - 0xffffffc000000000   (    64 MB)
>> [    0.000000]     memory  : 0xffffffc000000000 - 0xffffffc980000000   ( 38912 MB)
>> [    0.000000]       .init : 0xffffffc000a00000 - 0xffffffc000a9c000   (   624 KB)
>> [    0.000000]       .text : 0xffffffc000080000 - 0xffffffc000a00000   (  9728 KB)
>> [    0.000000]       .data : 0xffffffc000a9c000 - 0xffffffc000b17a00   (   495 KB)
>>
>> Kernels can be built with a 48-bit VA (and potentially larger in future
>> with ARMv8.2-A or later [3]). The vmalloc area (and hence the maximum
>> distances between modules and kernel) will increase grow with the VA
>> range.
>>
>> Thanks,
>> Mark.
>>
>> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/398527.html
>> [2] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/398534.html
>> [3] https://community.arm.com/groups/processors/blog/2016/01/05/armv8-a-architecture-evolution
>
>
> Thanks, Mark.
>
> I've got several comments regarding the 4-byte compressed PCs. We've
> also discussed this internally.
> As the result in v4 I made it possible to export both compressed
> 4-byte PCs and full 8-byte PCs.
> Now init ioctl accepts the following struct and kernel can say whether
> it will export 4- or 8-byte PCs:
>
> struct kcov_init_trace {
>         unsigned long        flags; /* In: reserved, must be 0. */
>         unsigned long        size; /* In: trace buffer size. */
>         unsigned long        version;  /* Out: trace format, currently 1. */
>         /*
>          * Output.
>          * pc_size can be 4 or 8. If pc_size = 4 on a 64-bit arch,
>          * returned PCs are compressed by subtracting pc_base and then
>          * truncating to 4 bytes.
>          */
>         unsigned long        pc_size;
>         unsigned long        pc_base;
> };
>
> So for KASLR or other archs we can just export full 8-byte PCs.
>
> Regarding KASLR and dynamically loaded modules. I've looked at my
> use-case and concluded
> that most of the time I can work with "non-stable" PCs within a single
> VM. Whenever I need to
> store PCs persistently or send to another machine, I think I can
> "canonicalize" PCs using
> /proc/modules and /proc/kallsyms to something like (module hash,
> module offset). So kernel does
> not need to do this during coverage collection.

On second though, maybe it's better to just always export unsigned long PCs...
Need to measure how much memory coverage information consumes,
and how much slower it is with uint64 PCs. Maybe I can live with large PCs,
or maybe I can make syzkaller require !KASLR and compress PCs in user-space...
Need to think about this more.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] kernel: add kcov code coverage
  2016-01-18 20:09             ` Dmitry Vyukov
@ 2016-01-22 11:55               ` Mark Rutland
  2016-01-22 12:15                 ` Dmitry Vyukov
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Rutland @ 2016-01-22 11:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 18, 2016 at 09:09:43PM +0100, Dmitry Vyukov wrote:
> On Mon, Jan 18, 2016 at 8:44 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
> > I've got several comments regarding the 4-byte compressed PCs. We've
> > also discussed this internally.
> > As the result in v4 I made it possible to export both compressed
> > 4-byte PCs and full 8-byte PCs.
> > Now init ioctl accepts the following struct and kernel can say whether
> > it will export 4- or 8-byte PCs:
> >
> > struct kcov_init_trace {
> >         unsigned long        flags; /* In: reserved, must be 0. */
> >         unsigned long        size; /* In: trace buffer size. */
> >         unsigned long        version;  /* Out: trace format, currently 1. */
> >         /*
> >          * Output.
> >          * pc_size can be 4 or 8. If pc_size = 4 on a 64-bit arch,
> >          * returned PCs are compressed by subtracting pc_base and then
> >          * truncating to 4 bytes.
> >          */
> >         unsigned long        pc_size;
> >         unsigned long        pc_base;
> > };
> >
> > So for KASLR or other archs we can just export full 8-byte PCs.
> >
> > Regarding KASLR and dynamically loaded modules. I've looked at my
> > use-case and concluded
> > that most of the time I can work with "non-stable" PCs within a single
> > VM. Whenever I need to
> > store PCs persistently or send to another machine, I think I can
> > "canonicalize" PCs using
> > /proc/modules and /proc/kallsyms to something like (module hash,
> > module offset). So kernel does
> > not need to do this during coverage collection.
> 
> On second though, maybe it's better to just always export unsigned long PCs...
> Need to measure how much memory coverage information consumes,
> and how much slower it is with uint64 PCs. Maybe I can live with large PCs,
> or maybe I can make syzkaller require !KASLR and compress PCs in user-space...
> Need to think about this more.

I can imagine we might keep the expanded module range even in the
absence of full KASLR, though I don't know how realistic that thought
is.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] kernel: add kcov code coverage
  2016-01-22 11:55               ` Mark Rutland
@ 2016-01-22 12:15                 ` Dmitry Vyukov
  2016-01-22 12:52                   ` Mark Rutland
  0 siblings, 1 reply; 11+ messages in thread
From: Dmitry Vyukov @ 2016-01-22 12:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jan 22, 2016 at 12:55 PM, Mark Rutland <mark.rutland@arm.com> wrote:
> On Mon, Jan 18, 2016 at 09:09:43PM +0100, Dmitry Vyukov wrote:
>> On Mon, Jan 18, 2016 at 8:44 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
>> > I've got several comments regarding the 4-byte compressed PCs. We've
>> > also discussed this internally.
>> > As the result in v4 I made it possible to export both compressed
>> > 4-byte PCs and full 8-byte PCs.
>> > Now init ioctl accepts the following struct and kernel can say whether
>> > it will export 4- or 8-byte PCs:
>> >
>> > struct kcov_init_trace {
>> >         unsigned long        flags; /* In: reserved, must be 0. */
>> >         unsigned long        size; /* In: trace buffer size. */
>> >         unsigned long        version;  /* Out: trace format, currently 1. */
>> >         /*
>> >          * Output.
>> >          * pc_size can be 4 or 8. If pc_size = 4 on a 64-bit arch,
>> >          * returned PCs are compressed by subtracting pc_base and then
>> >          * truncating to 4 bytes.
>> >          */
>> >         unsigned long        pc_size;
>> >         unsigned long        pc_base;
>> > };
>> >
>> > So for KASLR or other archs we can just export full 8-byte PCs.
>> >
>> > Regarding KASLR and dynamically loaded modules. I've looked at my
>> > use-case and concluded
>> > that most of the time I can work with "non-stable" PCs within a single
>> > VM. Whenever I need to
>> > store PCs persistently or send to another machine, I think I can
>> > "canonicalize" PCs using
>> > /proc/modules and /proc/kallsyms to something like (module hash,
>> > module offset). So kernel does
>> > not need to do this during coverage collection.
>>
>> On second though, maybe it's better to just always export unsigned long PCs...
>> Need to measure how much memory coverage information consumes,
>> and how much slower it is with uint64 PCs. Maybe I can live with large PCs,
>> or maybe I can make syzkaller require !KASLR and compress PCs in user-space...
>> Need to think about this more.
>
> I can imagine we might keep the expanded module range even in the
> absence of full KASLR, though I don't know how realistic that thought
> is.


The last version of the patch just exposes PCs as unsigned longs
without any compression. So it should not be a problem (at least for
kernel, now it's user responsibility to make sense out of the PCs).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] kernel: add kcov code coverage
  2016-01-22 12:15                 ` Dmitry Vyukov
@ 2016-01-22 12:52                   ` Mark Rutland
  0 siblings, 0 replies; 11+ messages in thread
From: Mark Rutland @ 2016-01-22 12:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jan 22, 2016 at 01:15:27PM +0100, Dmitry Vyukov wrote:
> On Fri, Jan 22, 2016 at 12:55 PM, Mark Rutland <mark.rutland@arm.com> wrote:
> > On Mon, Jan 18, 2016 at 09:09:43PM +0100, Dmitry Vyukov wrote:
> >> On Mon, Jan 18, 2016 at 8:44 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
> >> > Regarding KASLR and dynamically loaded modules. I've looked at my
> >> > use-case and concluded
> >> > that most of the time I can work with "non-stable" PCs within a single
> >> > VM. Whenever I need to
> >> > store PCs persistently or send to another machine, I think I can
> >> > "canonicalize" PCs using
> >> > /proc/modules and /proc/kallsyms to something like (module hash,
> >> > module offset). So kernel does
> >> > not need to do this during coverage collection.
> >>
> >> On second though, maybe it's better to just always export unsigned long PCs...
> >> Need to measure how much memory coverage information consumes,
> >> and how much slower it is with uint64 PCs. Maybe I can live with large PCs,
> >> or maybe I can make syzkaller require !KASLR and compress PCs in user-space...
> >> Need to think about this more.
> >
> > I can imagine we might keep the expanded module range even in the
> > absence of full KASLR, though I don't know how realistic that thought
> > is.
> 
> The last version of the patch just exposes PCs as unsigned longs
> without any compression. So it should not be a problem (at least for
> kernel, now it's user responsibility to make sense out of the PCs).

Ah, ok. Sorry for the noise!

Mark.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-01-22 12:52 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1452689318-107172-1-git-send-email-dvyukov@google.com>
     [not found] ` <CAPAsAGx_0pVjtKA2RwwpStRG2-L+5X6W14QbrQaMYhdg30EeAQ@mail.gmail.com>
     [not found]   ` <CACT4Y+bHjfkrZ4jWZf+j6FKnj2GQv1t94Y9LJm4=WFyb8C0J8w@mail.gmail.com>
2016-01-15 13:05     ` [PATCH v2] kernel: add kcov code coverage Andrey Ryabinin
2016-01-15 13:42       ` Will Deacon
2016-01-15 14:07       ` Dmitry Vyukov
2016-01-18 13:34         ` Andrey Ryabinin
2016-01-18 19:31           ` Dmitry Vyukov
2016-01-18 14:13         ` Mark Rutland
2016-01-18 19:44           ` Dmitry Vyukov
2016-01-18 20:09             ` Dmitry Vyukov
2016-01-22 11:55               ` Mark Rutland
2016-01-22 12:15                 ` Dmitry Vyukov
2016-01-22 12:52                   ` Mark Rutland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).