From: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
To: Oleg Nesterov <oleg@redhat.com>
Cc: mhiramat@kernel.org, peterz@infradead.org,
srikar@linux.vnet.ibm.com, acme@kernel.org,
ananth@linux.vnet.ibm.com, akpm@linux-foundation.org,
alexander.shishkin@linux.intel.com, alexis.berlemont@gmail.com,
corbet@lwn.net, dan.j.williams@intel.com,
gregkh@linuxfoundation.org, huawei.libin@huawei.com,
hughd@google.com, jack@suse.cz, jglisse@redhat.com,
jolsa@redhat.com, kan.liang@intel.com,
kirill.shutemov@linux.intel.com, kjlx@templeofstupid.com,
kstewart@linuxfoundation.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
mhocko@suse.com, milian.wolff@kdab.com, mingo@redhat.com,
namhyung@kernel.org, naveen.n.rao@linux.vnet.ibm.com,
pc@us.ibm.com, pombredanne@nexb.com, rostedt@goodmis.org,
tglx@linutronix.de, tmricht@linux.vnet.ibm.com,
willy@infradead.org, yao.jin@linux.intel.com,
fengguang.wu@intel.com,
Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Subject: Re: [PATCH 6/8] trace_uprobe/sdt: Fix multiple update of same reference counter
Date: Fri, 16 Mar 2018 17:42:02 +0530 [thread overview]
Message-ID: <c93216a4-a4e1-dd8f-00be-17254e308cd1@linux.vnet.ibm.com> (raw)
In-Reply-To: <20180315144959.GB19643@redhat.com>
On 03/15/2018 08:19 PM, Oleg Nesterov wrote:
> On 03/13, Ravi Bangoria wrote:
>> For tiny binaries/libraries, different mmap regions points to the
>> same file portion. In such cases, we may increment reference counter
>> multiple times.
> Yes,
>
>> But while de-registration, reference counter will get
>> decremented only by once
> could you explain why this happens? sdt_increment_ref_ctr() and
> sdt_decrement_ref_ctr() look symmetrical, _decrement_ should see
> the same mappings?
Sorry, I thought this happens only for tiny binaries. But that is not the case.
This happens for binary / library of any length.
Also, it's not a problem with sdt_increment_ref_ctr() / sdt_increment_ref_ctr().
The problem happens with trace_uprobe_mmap_callback().
To illustrate in detail, I'm adding a pr_info() in trace_uprobe_mmap_callback():
A A A A A A A A A A A A A A A vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
+A A A A A A A A A A A A pr_info("0x%lx-0x%lx : 0x%lx\n", vma->vm_start, vma->vm_end, vaddr);
A A A A A A A A A A A A A A A sdt_update_ref_ctr(vma->vm_mm, vaddr, 1);
Ok now, libpython has SDT markers with reference counter:
A A # readelf -n /usr/lib64/libpython2.7.so.1.0 | grep -A2 Provider
A A A A Provider: python
A A A A A Name: function__entry
A A A A A A ... Semaphore: 0x00000000002899d8
Probing on that marker:
A A # cd /sys/kernel/debug/tracing/
A A A # echo "p:sdt_python/function__entry /usr/lib64/libpython2.7.so.1.0:0x16a4d4(0x2799d8)" > uprobe_events
A A A # echo 1 > events/sdt_python/function__entry/enable
When I run python:
A A A # strace -o out python
A A A A mmap(NULL, 2738968, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fff92460000
A A A A A mmap(0x7fff926a0000, 327680, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x230000) = 0x7fff926a0000
A A A A A mprotect(0x7fff926a0000, 65536, PROT_READ) = 0
The first mmap() maps the whole library into one region. Second mmap()
and third mprotect() split out the whole region into smaller vmas and sets
appropriate protection flags.
Now, in this case, trace_uprobe_mmap_callback() updates reference counter
twice -- by second mmap() call and by third mprotect() call -- because both
regions contain reference counter offset. This I can verify in dmesg:
A A A # dmesg | tail
A A A A A trace_kprobe: 0x7fff926a0000-0x7fff926f0000 : 0x7fff926e99d8
A A A A A trace_kprobe: 0x7fff926b0000-0x7fff926f0000 : 0x7fff926e99d8
Final vmas of libpython:
A A A # cat /proc/`pgrep python`/maps | grep libpython
A A A A A 7fff92460000-7fff926a0000 r-xp 00000000 08:05 403934A /usr/lib64/libpython2.7.so.1.0
A A A A A 7fff926a0000-7fff926b0000 r--p 00230000 08:05 403934A /usr/lib64/libpython2.7.so.1.0
A A A A A 7fff926b0000-7fff926f0000 rw-p 00240000 08:05 403934A /usr/lib64/libpython2.7.so.1.0
I see similar problem with normal binary as well. I'm using Brendan Gregg's
example[1]:
A A A # readelf -n /tmp/tick | grep -A2 Provider
A A A A A A Provider: tick
A A A A A Name: loop2
A A A A A A A ... Semaphore: 0x000000001005003c
Probing that marker:
A A A # echo "p:sdt_tick/loop2 /tmp/tick:0x6e4(0x10036)" > uprobe_events
A A A # echo 1 > events/sdt_tick/loop2/enable
Now when I run the binary
A A A # /tmp/tick
load_elf_binary() internally calls mmap() and I see trace_uprobe_mmap_callback()
updating reference counter twice:
A A A # dmesg | tail
A A A A A trace_kprobe: 0x10010000-0x10030000 : 0x10020036
A A A A A trace_kprobe: 0x10020000-0x10030000 : 0x10020036
proc/<pid>/maps of the tick:
A A A # cat /proc/`pgrep tick`/maps
A A A A A 10000000-10010000 r-xp 00000000 08:05 1335712A /tmp/tick
A A A A 10010000-10020000 r--p 00000000 08:05 1335712A /tmp/tick
A A A A A 10020000-10030000 rw-p 00010000 08:05 1335712A /tmp/tick
[1] https://github.com/iovisor/bcc/issues/327#issuecomment-200576506
> Ether way, this patch doesn't look right at first glance... Just
> for example,
>
>> +static bool sdt_check_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
>> +{
>> + struct sdt_mm_list *tmp = tu->sml;
>> +
>> + if (!tu->sml || !mm)
>> + return false;
>> +
>> + while (tmp) {
>> + if (tmp->mm == mm)
>> + return true;
>> + tmp = tmp->next;
>> + }
>> +
>> + return false;
> ...
>
>> +}
>> +
>> +static void sdt_add_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
>> +{
>> + struct sdt_mm_list *tmp;
>> +
>> + tmp = kzalloc(sizeof(*tmp), GFP_KERNEL);
>> + if (!tmp)
>> + return;
>> +
>> + tmp->mm = mm;
>> + tmp->next = tu->sml;
>> + tu->sml = tmp;
>> +}
>> +
> ...
>
>> @@ -1020,8 +1104,16 @@ void trace_uprobe_mmap_callback(struct vm_area_struct *vma)
>> !trace_probe_is_enabled(&tu->tp))
>> continue;
>>
>> + down_write(&tu->sml_rw_sem);
>> + if (sdt_check_mm_list(tu, vma->vm_mm))
>> + goto cont;
>> +
>> vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
>> - sdt_update_ref_ctr(vma->vm_mm, vaddr, 1);
>> + if (!sdt_update_ref_ctr(vma->vm_mm, vaddr, 1))
>> + sdt_add_mm_list(tu, vma->vm_mm);
>> +
>> +cont:
>> + up_write(&tu->sml_rw_sem);
> To simplify, suppose that tu->sml is empty.
>
> Some process calls this function, increments the counter and adds its ->mm into
> the list.
>
> Then it exits, ->mm is freed.
>
> The next fork/exec allocates the same memory for the new ->mm, the new process
> calls trace_uprobe_mmap_callback() and sdt_check_mm_list() returns T?
Yes. This can happen. May be we can use mmu_notifier for this?
We register a release() callback from trace_uprobe while adding mm
in tu->sml. When mm gets freed, trace_uprobe will get notified.
Though, I don't know much about mmu_notifier. I need to think on this.
Let me know if you have better ideas.
Thanks for the review :)
Ravi
next prev parent reply other threads:[~2018-03-16 12:10 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-03-13 12:55 [PATCH 0/8] trace_uprobe: Support SDT markers having reference count (semaphore) Ravi Bangoria
2018-03-13 12:55 ` [PATCH 1/8] Uprobe: Export vaddr <-> offset conversion functions Ravi Bangoria
2018-03-13 20:36 ` Jerome Glisse
2018-03-15 16:27 ` Steven Rostedt
2018-03-16 8:54 ` Ravi Bangoria
2018-03-13 12:55 ` [PATCH 2/8] mm: Prefix vma_ to vaddr_to_offset() and offset_to_vaddr() Ravi Bangoria
2018-03-13 20:38 ` Jerome Glisse
2018-03-15 16:28 ` Steven Rostedt
2018-03-16 8:58 ` Ravi Bangoria
2018-03-13 12:55 ` [PATCH 3/8] Uprobe: Rename map_info to uprobe_map_info Ravi Bangoria
2018-03-13 20:39 ` Jerome Glisse
2018-03-15 16:44 ` Steven Rostedt
2018-03-16 8:56 ` Ravi Bangoria
2018-03-13 12:55 ` [PATCH 4/8] Uprobe: Export uprobe_map_info along with uprobe_{build/free}_map_info() Ravi Bangoria
2018-03-13 20:40 ` Jerome Glisse
2018-03-15 16:32 ` Steven Rostedt
2018-03-16 8:59 ` Ravi Bangoria
2018-03-13 12:56 ` [PATCH 5/8] trace_uprobe: Support SDT markers having reference count (semaphore) Ravi Bangoria
2018-03-14 13:48 ` Masami Hiramatsu
2018-03-14 15:12 ` Ravi Bangoria
2018-03-14 16:59 ` Oleg Nesterov
2018-03-15 11:23 ` Ravi Bangoria
2018-03-19 4:28 ` Ravi Bangoria
2018-03-19 13:46 ` Oleg Nesterov
2018-03-14 21:58 ` Steven Rostedt
2018-03-15 14:21 ` Oleg Nesterov
2018-03-15 14:30 ` Oleg Nesterov
2018-03-16 9:28 ` Ravi Bangoria
2018-03-16 11:39 ` Oleg Nesterov
2018-03-16 11:46 ` Ravi Bangoria
2018-03-16 9:21 ` Ravi Bangoria
2018-03-15 15:01 ` Oleg Nesterov
2018-03-16 9:31 ` Ravi Bangoria
2018-03-15 16:48 ` Steven Rostedt
2018-03-16 9:01 ` Ravi Bangoria
2018-03-16 16:16 ` Oleg Nesterov
2018-03-13 12:56 ` [PATCH 6/8] trace_uprobe/sdt: Fix multiple update of same reference counter Ravi Bangoria
2018-03-14 14:15 ` Masami Hiramatsu
2018-03-14 15:15 ` Ravi Bangoria
2018-03-15 14:49 ` Oleg Nesterov
2018-03-16 12:12 ` Ravi Bangoria [this message]
2018-03-16 13:49 ` Ravi Bangoria
2018-03-16 17:50 ` Oleg Nesterov
2018-03-19 9:18 ` Ravi Bangoria
2018-03-19 13:40 ` Oleg Nesterov
2018-03-13 12:56 ` [PATCH 7/8] perf probe: Support SDT markers having reference counter (semaphore) Ravi Bangoria
2018-03-14 14:09 ` Masami Hiramatsu
2018-03-14 15:21 ` Ravi Bangoria
2018-03-13 12:56 ` [PATCH 8/8] trace_uprobe/sdt: Document about reference counter Ravi Bangoria
2018-03-14 13:50 ` Masami Hiramatsu
2018-03-14 15:22 ` Ravi Bangoria
2018-03-15 12:47 ` Masami Hiramatsu
2018-03-16 9:42 ` Ravi Bangoria
2018-03-16 14:26 ` Masami Hiramatsu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c93216a4-a4e1-dd8f-00be-17254e308cd1@linux.vnet.ibm.com \
--to=ravi.bangoria@linux.vnet.ibm.com \
--cc=acme@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=alexander.shishkin@linux.intel.com \
--cc=alexis.berlemont@gmail.com \
--cc=ananth@linux.vnet.ibm.com \
--cc=corbet@lwn.net \
--cc=dan.j.williams@intel.com \
--cc=fengguang.wu@intel.com \
--cc=gregkh@linuxfoundation.org \
--cc=huawei.libin@huawei.com \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=jglisse@redhat.com \
--cc=jolsa@redhat.com \
--cc=kan.liang@intel.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kjlx@templeofstupid.com \
--cc=kstewart@linuxfoundation.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhiramat@kernel.org \
--cc=mhocko@suse.com \
--cc=milian.wolff@kdab.com \
--cc=mingo@redhat.com \
--cc=namhyung@kernel.org \
--cc=naveen.n.rao@linux.vnet.ibm.com \
--cc=oleg@redhat.com \
--cc=pc@us.ibm.com \
--cc=peterz@infradead.org \
--cc=pombredanne@nexb.com \
--cc=rostedt@goodmis.org \
--cc=srikar@linux.vnet.ibm.com \
--cc=tglx@linutronix.de \
--cc=tmricht@linux.vnet.ibm.com \
--cc=willy@infradead.org \
--cc=yao.jin@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).