* [PATCH v3 2/9] mm: Prefix vma_ to vaddr_to_offset() and offset_to_vaddr()
From: Ravi Bangoria @ 2018-04-17 4:32 UTC (permalink / raw)
To: mhiramat, oleg, peterz, srikar, rostedt
Cc: acme, ananth, akpm, alexander.shishkin, alexis.berlemont, corbet,
dan.j.williams, jolsa, kan.liang, kjlx, kstewart, linux-doc,
linux-kernel, linux-mm, milian.wolff, mingo, namhyung,
naveen.n.rao, pc, tglx, yao.jin, fengguang.wu, jglisse,
Ravi Bangoria
In-Reply-To: <20180417043244.7501-1-ravi.bangoria@linux.vnet.ibm.com>
From: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Make function names more meaningful by adding vma_ prefix
to them.
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
---
include/linux/mm.h | 4 ++--
kernel/events/uprobes.c | 14 +++++++-------
2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index de0cc08..47fd8a9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2273,13 +2273,13 @@ struct vm_unmapped_area_info {
}
static inline unsigned long
-offset_to_vaddr(struct vm_area_struct *vma, loff_t offset)
+vma_offset_to_vaddr(struct vm_area_struct *vma, loff_t offset)
{
return vma->vm_start + offset - ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
}
static inline loff_t
-vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
+vma_vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
{
return ((loff_t)vma->vm_pgoff << PAGE_SHIFT) + (vaddr - vma->vm_start);
}
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index bd6f230..535fd39 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -748,7 +748,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
curr = info;
info->mm = vma->vm_mm;
- info->vaddr = offset_to_vaddr(vma, offset);
+ info->vaddr = vma_offset_to_vaddr(vma, offset);
}
i_mmap_unlock_read(mapping);
@@ -807,7 +807,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
goto unlock;
if (vma->vm_start > info->vaddr ||
- vaddr_to_offset(vma, info->vaddr) != uprobe->offset)
+ vma_vaddr_to_offset(vma, info->vaddr) != uprobe->offset)
goto unlock;
if (is_register) {
@@ -977,7 +977,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm)
uprobe->offset >= offset + vma->vm_end - vma->vm_start)
continue;
- vaddr = offset_to_vaddr(vma, uprobe->offset);
+ vaddr = vma_offset_to_vaddr(vma, uprobe->offset);
err |= remove_breakpoint(uprobe, mm, vaddr);
}
up_read(&mm->mmap_sem);
@@ -1023,7 +1023,7 @@ static void build_probe_list(struct inode *inode,
struct uprobe *u;
INIT_LIST_HEAD(head);
- min = vaddr_to_offset(vma, start);
+ min = vma_vaddr_to_offset(vma, start);
max = min + (end - start) - 1;
spin_lock(&uprobes_treelock);
@@ -1076,7 +1076,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
if (!fatal_signal_pending(current) &&
filter_chain(uprobe, UPROBE_FILTER_MMAP, vma->vm_mm)) {
- unsigned long vaddr = offset_to_vaddr(vma, uprobe->offset);
+ unsigned long vaddr = vma_offset_to_vaddr(vma, uprobe->offset);
install_breakpoint(uprobe, vma->vm_mm, vma, vaddr);
}
put_uprobe(uprobe);
@@ -1095,7 +1095,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
inode = file_inode(vma->vm_file);
- min = vaddr_to_offset(vma, start);
+ min = vma_vaddr_to_offset(vma, start);
max = min + (end - start) - 1;
spin_lock(&uprobes_treelock);
@@ -1730,7 +1730,7 @@ static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp)
if (vma && vma->vm_start <= bp_vaddr) {
if (valid_vma(vma, false)) {
struct inode *inode = file_inode(vma->vm_file);
- loff_t offset = vaddr_to_offset(vma, bp_vaddr);
+ loff_t offset = vma_vaddr_to_offset(vma, bp_vaddr);
uprobe = find_uprobe(inode, offset);
}
--
1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH v3 7/9] trace_uprobe/sdt: Fix multiple update of same reference counter
From: Ravi Bangoria @ 2018-04-17 4:32 UTC (permalink / raw)
To: mhiramat, oleg, peterz, srikar, rostedt
Cc: acme, ananth, akpm, alexander.shishkin, alexis.berlemont, corbet,
dan.j.williams, jolsa, kan.liang, kjlx, kstewart, linux-doc,
linux-kernel, linux-mm, milian.wolff, mingo, namhyung,
naveen.n.rao, pc, tglx, yao.jin, fengguang.wu, jglisse,
Ravi Bangoria
In-Reply-To: <20180417043244.7501-1-ravi.bangoria@linux.vnet.ibm.com>
From: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
When virtual memory map for binary/library is being prepared, there is
no direct one to one mapping between mmap() and virtual memory area. Ex,
when loader loads the library, it first calls mmap(size = total_size),
where total_size is addition of size of all elf sections that are going
to be mapped. Then it splits individual vmas with new mmap()/mprotect()
calls. Loader does this to ensure it gets continuous address range for
a library. load_elf_binary() also uses similar tricks while preparing
mappings of binary.
Ex for pyhton library,
# strace -o out python
mmap(NULL, 2738968, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fff92460000
mmap(0x7fff926a0000, 327680, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x230000) = 0x7fff926a0000
mprotect(0x7fff926a0000, 65536, PROT_READ) = 0
Here, the first mmap() maps the whole library into one region. Second
mmap() and third mprotect() split out the whole region into smaller
vmas and sets appropriate protection flags.
Now, in this case, trace_uprobe_mmap_callback() update the reference
counter twice -- by second mmap() call and by third mprotect() call --
because both regions contain reference counter.
But while de-registration, reference counter will get decremented only
by once leaving reference counter > 0 even if no one is tracing on that
marker.
Example with python library before patch:
# readelf -n /lib64/libpython2.7.so.1.0 | grep -A1 function__entry
Name: function__entry
... Semaphore: 0x00000000002899d8
Probe on a marker:
# echo "p:sdt_python/function__entry /usr/lib64/libpython2.7.so.1.0:0x16a4d4(0x2799d8)" > uprobe_events
Start tracing:
# perf record -e sdt_python:function__entry -a
Run python workload:
# python
# cat /proc/`pgrep python`/maps | grep libpython
7fffadb00000-7fffadd40000 r-xp 00000000 08:05 403934 /usr/lib64/libpython2.7.so.1.0
7fffadd40000-7fffadd50000 r--p 00230000 08:05 403934 /usr/lib64/libpython2.7.so.1.0
7fffadd50000-7fffadd90000 rw-p 00240000 08:05 403934 /usr/lib64/libpython2.7.so.1.0
Reference counter value has been incremented twice:
# dd if=/proc/`pgrep python`/mem bs=1 count=1 skip=$(( 0x7fffadd899d8 )) 2>/dev/null | xxd
0000000: 02 .
Kill perf:
#
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.322 MB perf.data (1273 samples) ]
Reference conter is still 1 even when no one is tracing on it:
# dd if=/proc/`pgrep python`/mem bs=1 count=1 skip=$(( 0x7fffadd899d8 )) 2>/dev/null | xxd
0000000: 01 .
Ensure increment and decrement happens in sync by keeping list of mms
in trace_uprobe. Check presence of mm in the list before incrementing
the reference counter. I.e. for each {trace_uprobe,mm} tuple, reference
counter must be incremented only by one. Note that we don't check the
presence of mm in the list at decrement time.
We consider only two case while incrementing the reference counter:
1. Target binary is already running when we start tracing. In this
case, find all mm which maps region of target binary containing
reference counter. Loop over all mms and increment the counter
if mm is not already present in the list.
2. Tracer is already tracing before target binary starts execution.
In this case, all mmap(vma) gets notified to trace_uprobe.
Trace_uprobe will update reference counter if vma->vm_mm is not
already present in the list.
There is also a third case which we don't consider, a fork() case.
When process with markers forks itself, we don't explicitly increment
the reference counter in child process because it should be taken care
by dup_mmap(). We also don't add the child mm in the list. This is
fine because we don't check presence of mm in the list at decrement
time.
After patch:
Start perf record and then run python...
Reference counter value has been incremented only once:
# dd if=/proc/`pgrep python`/mem bs=1 count=1 skip=$(( 0x7fff9cbf99d8 )) 2>/dev/null | xxd
0000000: 01 .
Kill perf:
#
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.364 MB perf.data (1427 samples) ]
Reference conter is reset to 0:
# dd if=/proc/`pgrep python`/mem bs=1 count=1 skip=$(( 0x7fff9cbb99d8 )) 2>/dev/null | xxd
0000000: 00 .
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
---
include/linux/uprobes.h | 1 +
kernel/events/uprobes.c | 6 +++
kernel/trace/trace_uprobe.c | 121 +++++++++++++++++++++++++++++++++++++++++---
3 files changed, 122 insertions(+), 6 deletions(-)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 2db3ed1..e447991 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -123,6 +123,7 @@ struct uprobe_map_info {
};
extern void (*uprobe_mmap_callback)(struct vm_area_struct *vma);
+extern void (*uprobe_clear_state_callback)(struct mm_struct *mm);
extern int set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
extern int set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index e26ad83..e8005d2 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1231,6 +1231,9 @@ static struct xol_area *get_xol_area(void)
return area;
}
+/* Rightnow the only user of this is trace_uprobe. */
+void (*uprobe_clear_state_callback)(struct mm_struct *mm);
+
/*
* uprobe_clear_state - Free the area allocated for slots.
*/
@@ -1238,6 +1241,9 @@ void uprobe_clear_state(struct mm_struct *mm)
{
struct xol_area *area = mm->uprobes_state.xol_area;
+ if (uprobe_clear_state_callback)
+ uprobe_clear_state_callback(mm);
+
if (!area)
return;
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 1a48b04..7341042c 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -50,6 +50,11 @@ struct trace_uprobe_filter {
struct list_head perf_events;
};
+struct sdt_mm_list {
+ struct list_head list;
+ struct mm_struct *mm;
+};
+
/*
* uprobe event core functions
*/
@@ -61,6 +66,8 @@ struct trace_uprobe {
char *filename;
unsigned long offset;
unsigned long ref_ctr_offset;
+ struct sdt_mm_list sml;
+ struct mutex sml_lock;
unsigned long nhit;
struct trace_probe tp;
};
@@ -276,6 +283,8 @@ static inline bool is_ret_probe(struct trace_uprobe *tu)
if (is_ret)
tu->consumer.ret_handler = uretprobe_dispatcher;
init_trace_uprobe_filter(&tu->filter);
+ mutex_init(&tu->sml_lock);
+ INIT_LIST_HEAD(&(tu->sml.list));
return tu;
error:
@@ -923,6 +932,43 @@ static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func,
return trace_handle_return(s);
}
+static bool sdt_check_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
+{
+ struct sdt_mm_list *sml;
+
+ list_for_each_entry(sml, &(tu->sml.list), list)
+ if (sml->mm == mm)
+ return true;
+
+ return false;
+}
+
+static int sdt_add_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
+{
+ struct sdt_mm_list *sml = kzalloc(sizeof(*sml), GFP_KERNEL);
+
+ if (!sml)
+ return -ENOMEM;
+
+ sml->mm = mm;
+ list_add(&(sml->list), &(tu->sml.list));
+ return 0;
+}
+
+static void sdt_del_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
+{
+ struct list_head *pos, *q;
+ struct sdt_mm_list *sml;
+
+ list_for_each_safe(pos, q, &(tu->sml.list)) {
+ sml = list_entry(pos, struct sdt_mm_list, list);
+ if (sml->mm == mm) {
+ list_del(pos);
+ kfree(sml);
+ }
+ }
+}
+
static bool sdt_valid_vma(struct trace_uprobe *tu,
struct vm_area_struct *vma,
unsigned long vaddr)
@@ -975,6 +1021,31 @@ static struct vm_area_struct *sdt_find_vma(struct trace_uprobe *tu,
return 0;
}
+static void __sdt_increment_ref_ctr(struct trace_uprobe *tu,
+ struct mm_struct *mm,
+ unsigned long vaddr)
+{
+ int ret = 0;
+
+ ret = sdt_update_ref_ctr(mm, vaddr, 1);
+ if (unlikely(ret)) {
+ pr_info("Failed to increment ref_ctr. (%d)\n", ret);
+ return;
+ }
+
+ ret = sdt_add_mm_list(tu, mm);
+ if (unlikely(ret)) {
+ pr_info("Failed to add mm into list. (%d)\n", ret);
+ goto revert_ctr;
+ }
+ return;
+
+revert_ctr:
+ ret = sdt_update_ref_ctr(mm, vaddr, -1);
+ if (ret)
+ pr_info("Reverting ref_ctr update failed. (%d)\n", ret);
+}
+
static void sdt_increment_ref_ctr(struct trace_uprobe *tu)
{
struct uprobe_map_info *info;
@@ -985,15 +1056,19 @@ static void sdt_increment_ref_ctr(struct trace_uprobe *tu)
if (IS_ERR(info))
goto out;
+ mutex_lock(&tu->sml_lock);
while (info) {
- down_write(&info->mm->mmap_sem);
+ if (sdt_check_mm_list(tu, info->mm))
+ goto cont;
+ down_write(&info->mm->mmap_sem);
if (sdt_find_vma(tu, info->mm, info->vaddr))
- sdt_update_ref_ctr(info->mm, info->vaddr, 1);
-
+ __sdt_increment_ref_ctr(tu, info->mm, info->vaddr);
up_write(&info->mm->mmap_sem);
+cont:
info = uprobe_free_map_info(info);
}
+ mutex_unlock(&tu->sml_lock);
out:
uprobe_up_write_dup_mmap();
@@ -1017,14 +1092,28 @@ static void trace_uprobe_mmap(struct vm_area_struct *vma)
if (!sdt_valid_vma(tu, vma, vaddr))
continue;
- sdt_update_ref_ctr(vma->vm_mm, vaddr, 1);
+ mutex_lock(&tu->sml_lock);
+ if (!sdt_check_mm_list(tu, vma->vm_mm))
+ __sdt_increment_ref_ctr(tu, vma->vm_mm, vaddr);
+ mutex_unlock(&tu->sml_lock);
}
mutex_unlock(&uprobe_lock);
}
+/*
+ * We don't check presence of mm in tu->sml here. We just decrement
+ * the reference counter if we find vma holding the reference counter.
+ *
+ * For tiny binaries/libraries, different mmap regions point to the
+ * same file portion. In such cases, uprobe_build_map_info() returns
+ * same mm multiple times with different virtual address of one
+ * reference counter. But we don't decrement the reference counter
+ * multiple time because we check for VM_WRITE in sdt_valid_vma().
+ */
static void sdt_decrement_ref_ctr(struct trace_uprobe *tu)
{
struct uprobe_map_info *info;
+ int ret;
uprobe_down_write_dup_mmap();
info = uprobe_build_map_info(tu->inode->i_mapping,
@@ -1032,20 +1121,39 @@ static void sdt_decrement_ref_ctr(struct trace_uprobe *tu)
if (IS_ERR(info))
goto out;
+ mutex_lock(&tu->sml_lock);
while (info) {
down_write(&info->mm->mmap_sem);
- if (sdt_find_vma(tu, info->mm, info->vaddr))
- sdt_update_ref_ctr(info->mm, info->vaddr, -1);
+ if (sdt_find_vma(tu, info->mm, info->vaddr)) {
+ ret = sdt_update_ref_ctr(info->mm, info->vaddr, -1);
+ if (unlikely(ret))
+ pr_info("Failed to Decrement ref_ctr. (%d)\n", ret);
+ }
up_write(&info->mm->mmap_sem);
+ sdt_del_mm_list(tu, info->mm);
info = uprobe_free_map_info(info);
}
+ mutex_unlock(&tu->sml_lock);
out:
uprobe_up_write_dup_mmap();
}
+static void sdt_mm_release(struct mm_struct *mm)
+{
+ struct trace_uprobe *tu;
+
+ mutex_lock(&uprobe_lock);
+ list_for_each_entry(tu, &uprobe_list, list) {
+ mutex_lock(&tu->sml_lock);
+ sdt_del_mm_list(tu, mm);
+ mutex_unlock(&tu->sml_lock);
+ }
+ mutex_unlock(&uprobe_lock);
+}
+
typedef bool (*filter_func_t)(struct uprobe_consumer *self,
enum uprobe_filter_ctx ctx,
struct mm_struct *mm);
@@ -1583,6 +1691,7 @@ static __init int init_uprobe_trace(void)
NULL, &uprobe_profile_ops);
uprobe_mmap_callback = trace_uprobe_mmap;
+ uprobe_clear_state_callback = sdt_mm_release;
return 0;
}
--
1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH v3 6/9] trace_uprobe: Support SDT markers having reference count (semaphore)
From: Ravi Bangoria @ 2018-04-17 4:32 UTC (permalink / raw)
To: mhiramat, oleg, peterz, srikar, rostedt
Cc: acme, ananth, akpm, alexander.shishkin, alexis.berlemont, corbet,
dan.j.williams, jolsa, kan.liang, kjlx, kstewart, linux-doc,
linux-kernel, linux-mm, milian.wolff, mingo, namhyung,
naveen.n.rao, pc, tglx, yao.jin, fengguang.wu, jglisse,
Ravi Bangoria
In-Reply-To: <20180417043244.7501-1-ravi.bangoria@linux.vnet.ibm.com>
From: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Userspace Statically Defined Tracepoints[1] are dtrace style markers
inside userspace applications. Applications like PostgreSQL, MySQL,
Pthread, Perl, Python, Java, Ruby, Node.js, libvirt, QEMU, glib etc
have these markers embedded in them. These markers are added by developer
at important places in the code. Each marker source expands to a single
nop instruction in the compiled code but there may be additional
overhead for computing the marker arguments which expands to couple of
instructions. In case the overhead is more, execution of it can be
omitted by runtime if() condition when no one is tracing on the marker:
if (reference_counter > 0) {
Execute marker instructions;
}
Default value of reference counter is 0. Tracer has to increment the
reference counter before tracing on a marker and decrement it when
done with the tracing.
Implement the reference counter logic in trace_uprobe, leaving core
uprobe infrastructure as is, except one new callback from uprobe_mmap()
to trace_uprobe.
trace_uprobe definition with reference counter will now be:
<path>:<offset>[(ref_ctr_offset)]
There are two different cases while enabling the marker,
1. Trace existing process. In this case, find all suitable processes
and increment the reference counter in them.
2. Enable trace before running target binary. In this case, all mmaps
will get notified to trace_uprobe and trace_uprobe will increment
the reference counter if corresponding uprobe is enabled.
At the time of disabling probes, decrement reference counter in all
existing target processes.
[1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation
Note: 'reference counter' is called as 'semaphore' in original Dtrace
(or Systemtap, bcc and even in ELF) documentation and code. But the
term 'semaphore' is misleading in this context. This is just a counter
used to hold number of tracers tracing on a marker. This is not really
used for any synchronization. So we are referring it as 'reference
counter' in kernel / perf code.
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
[Fengguang reported/fixed build failure]
---
include/linux/uprobes.h | 10 +++
kernel/events/uprobes.c | 21 +++++-
kernel/trace/trace_uprobe.c | 162 +++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 190 insertions(+), 3 deletions(-)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 7bd2760..2db3ed1 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -122,6 +122,8 @@ struct uprobe_map_info {
unsigned long vaddr;
};
+extern void (*uprobe_mmap_callback)(struct vm_area_struct *vma);
+
extern int set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
extern int set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
extern bool is_swbp_insn(uprobe_opcode_t *insn);
@@ -136,6 +138,8 @@ struct uprobe_map_info {
extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
extern void uprobe_start_dup_mmap(void);
extern void uprobe_end_dup_mmap(void);
+extern void uprobe_down_write_dup_mmap(void);
+extern void uprobe_up_write_dup_mmap(void);
extern void uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm);
extern void uprobe_free_utask(struct task_struct *t);
extern void uprobe_copy_process(struct task_struct *t, unsigned long flags);
@@ -192,6 +196,12 @@ static inline void uprobe_start_dup_mmap(void)
static inline void uprobe_end_dup_mmap(void)
{
}
+static inline void uprobe_down_write_dup_mmap(void)
+{
+}
+static inline void uprobe_up_write_dup_mmap(void)
+{
+}
static inline void
uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm)
{
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 096d1e6..e26ad83 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1044,6 +1044,9 @@ static void build_probe_list(struct inode *inode,
spin_unlock(&uprobes_treelock);
}
+/* Rightnow the only user of this is trace_uprobe. */
+void (*uprobe_mmap_callback)(struct vm_area_struct *vma);
+
/*
* Called from mmap_region/vma_adjust with mm->mmap_sem acquired.
*
@@ -1056,7 +1059,13 @@ int uprobe_mmap(struct vm_area_struct *vma)
struct uprobe *uprobe, *u;
struct inode *inode;
- if (no_uprobe_events() || !valid_vma(vma, true))
+ if (no_uprobe_events())
+ return 0;
+
+ if (uprobe_mmap_callback)
+ uprobe_mmap_callback(vma);
+
+ if (!valid_vma(vma, true))
return 0;
inode = file_inode(vma->vm_file);
@@ -1247,6 +1256,16 @@ void uprobe_end_dup_mmap(void)
percpu_up_read(&dup_mmap_sem);
}
+void uprobe_down_write_dup_mmap(void)
+{
+ percpu_down_write(&dup_mmap_sem);
+}
+
+void uprobe_up_write_dup_mmap(void)
+{
+ percpu_up_write(&dup_mmap_sem);
+}
+
void uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm)
{
if (test_bit(MMF_HAS_UPROBES, &oldmm->flags)) {
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 0d450b4..1a48b04 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -25,6 +25,8 @@
#include <linux/namei.h>
#include <linux/string.h>
#include <linux/rculist.h>
+#include <linux/sched/mm.h>
+#include <linux/highmem.h>
#include "trace_probe.h"
@@ -58,6 +60,7 @@ struct trace_uprobe {
struct inode *inode;
char *filename;
unsigned long offset;
+ unsigned long ref_ctr_offset;
unsigned long nhit;
struct trace_probe tp;
};
@@ -364,10 +367,10 @@ static int create_trace_uprobe(int argc, char **argv)
{
struct trace_uprobe *tu;
struct inode *inode;
- char *arg, *event, *group, *filename;
+ char *arg, *event, *group, *filename, *rctr, *rctr_end;
char buf[MAX_EVENT_NAME_LEN];
struct path path;
- unsigned long offset;
+ unsigned long offset, ref_ctr_offset;
bool is_delete, is_return;
int i, ret;
@@ -377,6 +380,7 @@ static int create_trace_uprobe(int argc, char **argv)
is_return = false;
event = NULL;
group = NULL;
+ ref_ctr_offset = 0;
/* argc must be >= 1 */
if (argv[0][0] == '-')
@@ -456,6 +460,26 @@ static int create_trace_uprobe(int argc, char **argv)
goto fail_address_parse;
}
+ /* Parse reference counter offset if specified. */
+ rctr = strchr(arg, '(');
+ if (rctr) {
+ rctr_end = strchr(rctr, ')');
+ if (rctr > rctr_end || *(rctr_end + 1) != 0) {
+ ret = -EINVAL;
+ pr_info("Invalid reference counter offset.\n");
+ goto fail_address_parse;
+ }
+
+ *rctr++ = '\0';
+ *rctr_end = '\0';
+ ret = kstrtoul(rctr, 0, &ref_ctr_offset);
+ if (ret) {
+ pr_info("Invalid reference counter offset.\n");
+ goto fail_address_parse;
+ }
+ }
+
+ /* Parse uprobe offset. */
ret = kstrtoul(arg, 0, &offset);
if (ret)
goto fail_address_parse;
@@ -490,6 +514,7 @@ static int create_trace_uprobe(int argc, char **argv)
goto fail_address_parse;
}
tu->offset = offset;
+ tu->ref_ctr_offset = ref_ctr_offset;
tu->inode = inode;
tu->filename = kstrdup(filename, GFP_KERNEL);
@@ -622,6 +647,8 @@ static int probes_seq_show(struct seq_file *m, void *v)
break;
}
}
+ if (tu->ref_ctr_offset)
+ seq_printf(m, "(0x%lx)", tu->ref_ctr_offset);
for (i = 0; i < tu->tp.nr_args; i++)
seq_printf(m, " %s=%s", tu->tp.args[i].name, tu->tp.args[i].comm);
@@ -896,6 +923,129 @@ static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func,
return trace_handle_return(s);
}
+static bool sdt_valid_vma(struct trace_uprobe *tu,
+ struct vm_area_struct *vma,
+ unsigned long vaddr)
+{
+ return tu->ref_ctr_offset &&
+ vma->vm_file &&
+ file_inode(vma->vm_file) == tu->inode &&
+ vma->vm_flags & VM_WRITE &&
+ vma->vm_start <= vaddr &&
+ vma->vm_end > vaddr;
+}
+
+static struct vm_area_struct *sdt_find_vma(struct trace_uprobe *tu,
+ struct mm_struct *mm,
+ unsigned long vaddr)
+{
+ struct vm_area_struct *vma = find_vma(mm, vaddr);
+
+ return (vma && sdt_valid_vma(tu, vma, vaddr)) ? vma : NULL;
+}
+
+/*
+ * Reference counter gate the invocation of probe. If present,
+ * by default reference counter is 0. One needs to increment
+ * it before tracing the probe and decrement it when done.
+ */
+static int
+sdt_update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d)
+{
+ void *kaddr;
+ struct page *page;
+ struct vm_area_struct *vma;
+ int ret = 0;
+ unsigned short *ptr;
+
+ if (vaddr == 0)
+ return -EINVAL;
+
+ ret = get_user_pages_remote(NULL, mm, vaddr, 1,
+ FOLL_FORCE | FOLL_WRITE, &page, &vma, NULL);
+ if (ret <= 0)
+ return ret;
+
+ kaddr = kmap_atomic(page);
+ ptr = kaddr + (vaddr & ~PAGE_MASK);
+ *ptr += d;
+ kunmap_atomic(kaddr);
+
+ put_page(page);
+ return 0;
+}
+
+static void sdt_increment_ref_ctr(struct trace_uprobe *tu)
+{
+ struct uprobe_map_info *info;
+
+ uprobe_down_write_dup_mmap();
+ info = uprobe_build_map_info(tu->inode->i_mapping,
+ tu->ref_ctr_offset, false);
+ if (IS_ERR(info))
+ goto out;
+
+ while (info) {
+ down_write(&info->mm->mmap_sem);
+
+ if (sdt_find_vma(tu, info->mm, info->vaddr))
+ sdt_update_ref_ctr(info->mm, info->vaddr, 1);
+
+ up_write(&info->mm->mmap_sem);
+ info = uprobe_free_map_info(info);
+ }
+
+out:
+ uprobe_up_write_dup_mmap();
+}
+
+/* Called with down_write(&vma->vm_mm->mmap_sem) */
+static void trace_uprobe_mmap(struct vm_area_struct *vma)
+{
+ struct trace_uprobe *tu;
+ unsigned long vaddr;
+
+ if (!(vma->vm_flags & VM_WRITE))
+ return;
+
+ mutex_lock(&uprobe_lock);
+ list_for_each_entry(tu, &uprobe_list, list) {
+ if (!trace_probe_is_enabled(&tu->tp))
+ continue;
+
+ vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
+ if (!sdt_valid_vma(tu, vma, vaddr))
+ continue;
+
+ sdt_update_ref_ctr(vma->vm_mm, vaddr, 1);
+ }
+ mutex_unlock(&uprobe_lock);
+}
+
+static void sdt_decrement_ref_ctr(struct trace_uprobe *tu)
+{
+ struct uprobe_map_info *info;
+
+ uprobe_down_write_dup_mmap();
+ info = uprobe_build_map_info(tu->inode->i_mapping,
+ tu->ref_ctr_offset, false);
+ if (IS_ERR(info))
+ goto out;
+
+ while (info) {
+ down_write(&info->mm->mmap_sem);
+
+ if (sdt_find_vma(tu, info->mm, info->vaddr))
+ sdt_update_ref_ctr(info->mm, info->vaddr, -1);
+
+ up_write(&info->mm->mmap_sem);
+ info = uprobe_free_map_info(info);
+ }
+
+out:
+ uprobe_up_write_dup_mmap();
+}
+
typedef bool (*filter_func_t)(struct uprobe_consumer *self,
enum uprobe_filter_ctx ctx,
struct mm_struct *mm);
@@ -941,6 +1091,9 @@ typedef bool (*filter_func_t)(struct uprobe_consumer *self,
if (ret)
goto err_buffer;
+ if (tu->ref_ctr_offset)
+ sdt_increment_ref_ctr(tu);
+
return 0;
err_buffer:
@@ -981,6 +1134,9 @@ typedef bool (*filter_func_t)(struct uprobe_consumer *self,
WARN_ON(!uprobe_filter_is_empty(&tu->filter));
+ if (tu->ref_ctr_offset)
+ sdt_decrement_ref_ctr(tu);
+
uprobe_unregister(tu->inode, tu->offset, &tu->consumer);
tu->tp.flags &= file ? ~TP_FLAG_TRACE : ~TP_FLAG_PROFILE;
@@ -1425,6 +1581,8 @@ static __init int init_uprobe_trace(void)
/* Profile interface */
trace_create_file("uprobe_profile", 0444, d_tracer,
NULL, &uprobe_profile_ops);
+
+ uprobe_mmap_callback = trace_uprobe_mmap;
return 0;
}
--
1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH v3 8/9] trace_uprobe/sdt: Document about reference counter
From: Ravi Bangoria @ 2018-04-17 4:32 UTC (permalink / raw)
To: mhiramat, oleg, peterz, srikar, rostedt
Cc: acme, ananth, akpm, alexander.shishkin, alexis.berlemont, corbet,
dan.j.williams, jolsa, kan.liang, kjlx, kstewart, linux-doc,
linux-kernel, linux-mm, milian.wolff, mingo, namhyung,
naveen.n.rao, pc, tglx, yao.jin, fengguang.wu, jglisse,
Ravi Bangoria
In-Reply-To: <20180417043244.7501-1-ravi.bangoria@linux.vnet.ibm.com>
From: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Reference counter gate the invocation of probe. If present,
by default reference count is 0. Kernel needs to increment
it before tracing the probe and decrement it when done. This
is identical to semaphore in Userspace Statically Defined
Tracepoints (USDT).
Document usage of reference counter.
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
---
Documentation/trace/uprobetracer.txt | 16 +++++++++++++---
kernel/trace/trace.c | 2 +-
2 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/Documentation/trace/uprobetracer.txt b/Documentation/trace/uprobetracer.txt
index bf526a7c..cb6751d 100644
--- a/Documentation/trace/uprobetracer.txt
+++ b/Documentation/trace/uprobetracer.txt
@@ -19,15 +19,25 @@ user to calculate the offset of the probepoint in the object.
Synopsis of uprobe_tracer
-------------------------
- p[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a uprobe
- r[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a return uprobe (uretprobe)
- -:[GRP/]EVENT : Clear uprobe or uretprobe event
+ p[:[GRP/]EVENT] PATH:OFFSET[(REF_CTR_OFFSET)] [FETCHARGS]
+ r[:[GRP/]EVENT] PATH:OFFSET[(REF_CTR_OFFSET)] [FETCHARGS]
+ -:[GRP/]EVENT
+
+ p : Set a uprobe
+ r : Set a return uprobe (uretprobe)
+ - : Clear uprobe or uretprobe event
GRP : Group name. If omitted, "uprobes" is the default value.
EVENT : Event name. If omitted, the event name is generated based
on PATH+OFFSET.
PATH : Path to an executable or a library.
OFFSET : Offset where the probe is inserted.
+ REF_CTR_OFFSET: Reference counter offset. Optional field. Reference count
+ gate the invocation of probe. If present, by default
+ reference count is 0. Kernel needs to increment it before
+ tracing the probe and decrement it when done. This is
+ identical to semaphore in Userspace Statically Defined
+ Tracepoints (USDT).
FETCHARGS : Arguments. Each probe can have up to 128 args.
%REG : Fetch register REG
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 300f4ea..d211937 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4604,7 +4604,7 @@ static int tracing_trace_options_open(struct inode *inode, struct file *file)
"place (kretprobe): [<module>:]<symbol>[+<offset>]|<memaddr>\n"
#endif
#ifdef CONFIG_UPROBE_EVENTS
- "\t place: <path>:<offset>\n"
+ " place (uprobe): <path>:<offset>[(ref_ctr_offset)]\n"
#endif
"\t args: <name>=fetcharg[:type]\n"
"\t fetcharg: %<register>, @<address>, @<symbol>[+|-<offset>],\n"
--
1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH v3 9/9] perf probe: Support SDT markers having reference counter (semaphore)
From: Ravi Bangoria @ 2018-04-17 4:32 UTC (permalink / raw)
To: mhiramat, oleg, peterz, srikar, rostedt
Cc: acme, ananth, akpm, alexander.shishkin, alexis.berlemont, corbet,
dan.j.williams, jolsa, kan.liang, kjlx, kstewart, linux-doc,
linux-kernel, linux-mm, milian.wolff, mingo, namhyung,
naveen.n.rao, pc, tglx, yao.jin, fengguang.wu, jglisse,
Ravi Bangoria
In-Reply-To: <20180417043244.7501-1-ravi.bangoria@linux.vnet.ibm.com>
From: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
With this, perf buildid-cache will save SDT markers with reference
counter in probe cache. Perf probe will be able to probe markers
having reference counter. Ex,
# readelf -n /tmp/tick | grep -A1 loop2
Name: loop2
... Semaphore: 0x0000000010020036
# ./perf buildid-cache --add /tmp/tick
# ./perf probe sdt_tick:loop2
# ./perf stat -e sdt_tick:loop2 /tmp/tick
hi: 0
hi: 1
hi: 2
^C
Performance counter stats for '/tmp/tick':
3 sdt_tick:loop2
2.561851452 seconds time elapsed
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
---
tools/perf/util/probe-event.c | 39 ++++++++++++++++++++++++++++++++----
tools/perf/util/probe-event.h | 1 +
tools/perf/util/probe-file.c | 34 ++++++++++++++++++++++++++------
tools/perf/util/probe-file.h | 1 +
tools/perf/util/symbol-elf.c | 46 ++++++++++++++++++++++++++++++++-----------
tools/perf/util/symbol.h | 7 +++++++
6 files changed, 106 insertions(+), 22 deletions(-)
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index e1dbc98..9b9c26e 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -1832,6 +1832,12 @@ int parse_probe_trace_command(const char *cmd, struct probe_trace_event *tev)
tp->offset = strtoul(fmt2_str, NULL, 10);
}
+ if (tev->uprobes) {
+ fmt2_str = strchr(p, '(');
+ if (fmt2_str)
+ tp->ref_ctr_offset = strtoul(fmt2_str + 1, NULL, 0);
+ }
+
tev->nargs = argc - 2;
tev->args = zalloc(sizeof(struct probe_trace_arg) * tev->nargs);
if (tev->args == NULL) {
@@ -2025,6 +2031,22 @@ static int synthesize_probe_trace_arg(struct probe_trace_arg *arg,
return err;
}
+static int
+synthesize_uprobe_trace_def(struct probe_trace_event *tev, struct strbuf *buf)
+{
+ struct probe_trace_point *tp = &tev->point;
+ int err;
+
+ err = strbuf_addf(buf, "%s:0x%lx", tp->module, tp->address);
+
+ if (err >= 0 && tp->ref_ctr_offset) {
+ if (!uprobe_ref_ctr_is_supported())
+ return -1;
+ err = strbuf_addf(buf, "(0x%lx)", tp->ref_ctr_offset);
+ }
+ return err >= 0 ? 0 : -1;
+}
+
char *synthesize_probe_trace_command(struct probe_trace_event *tev)
{
struct probe_trace_point *tp = &tev->point;
@@ -2054,15 +2076,17 @@ char *synthesize_probe_trace_command(struct probe_trace_event *tev)
}
/* Use the tp->address for uprobes */
- if (tev->uprobes)
- err = strbuf_addf(&buf, "%s:0x%lx", tp->module, tp->address);
- else if (!strncmp(tp->symbol, "0x", 2))
+ if (tev->uprobes) {
+ err = synthesize_uprobe_trace_def(tev, &buf);
+ } else if (!strncmp(tp->symbol, "0x", 2)) {
/* Absolute address. See try_to_find_absolute_address() */
err = strbuf_addf(&buf, "%s%s0x%lx", tp->module ?: "",
tp->module ? ":" : "", tp->address);
- else
+ } else {
err = strbuf_addf(&buf, "%s%s%s+%lu", tp->module ?: "",
tp->module ? ":" : "", tp->symbol, tp->offset);
+ }
+
if (err)
goto error;
@@ -2646,6 +2670,13 @@ static void warn_uprobe_event_compat(struct probe_trace_event *tev)
{
int i;
char *buf = synthesize_probe_trace_command(tev);
+ struct probe_trace_point *tp = &tev->point;
+
+ if (tp->ref_ctr_offset && !uprobe_ref_ctr_is_supported()) {
+ pr_warning("A semaphore is associated with %s:%s and "
+ "seems your kernel doesn't support it.\n",
+ tev->group, tev->event);
+ }
/* Old uprobe event doesn't support memory dereference */
if (!tev->uprobes || tev->nargs == 0 || !buf)
diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
index 45b14f0..15a98c3 100644
--- a/tools/perf/util/probe-event.h
+++ b/tools/perf/util/probe-event.h
@@ -27,6 +27,7 @@ struct probe_trace_point {
char *symbol; /* Base symbol */
char *module; /* Module name */
unsigned long offset; /* Offset from symbol */
+ unsigned long ref_ctr_offset; /* SDT reference counter offset */
unsigned long address; /* Actual address of the trace point */
bool retprobe; /* Return probe flag */
};
diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/probe-file.c
index 4ae1123..a17ba6a 100644
--- a/tools/perf/util/probe-file.c
+++ b/tools/perf/util/probe-file.c
@@ -697,8 +697,16 @@ int probe_cache__add_entry(struct probe_cache *pcache,
#ifdef HAVE_GELF_GETNOTE_SUPPORT
static unsigned long long sdt_note__get_addr(struct sdt_note *note)
{
- return note->bit32 ? (unsigned long long)note->addr.a32[0]
- : (unsigned long long)note->addr.a64[0];
+ return note->bit32 ?
+ (unsigned long long)note->addr.a32[SDT_NOTE_IDX_LOC] :
+ (unsigned long long)note->addr.a64[SDT_NOTE_IDX_LOC];
+}
+
+static unsigned long long sdt_note__get_ref_ctr_offset(struct sdt_note *note)
+{
+ return note->bit32 ?
+ (unsigned long long)note->addr.a32[SDT_NOTE_IDX_REFCTR] :
+ (unsigned long long)note->addr.a64[SDT_NOTE_IDX_REFCTR];
}
static const char * const type_to_suffix[] = {
@@ -776,14 +784,21 @@ static char *synthesize_sdt_probe_command(struct sdt_note *note,
{
struct strbuf buf;
char *ret = NULL, **args;
- int i, args_count;
+ int i, args_count, err;
+ unsigned long long ref_ctr_offset;
if (strbuf_init(&buf, 32) < 0)
return NULL;
- if (strbuf_addf(&buf, "p:%s/%s %s:0x%llx",
- sdtgrp, note->name, pathname,
- sdt_note__get_addr(note)) < 0)
+ err = strbuf_addf(&buf, "p:%s/%s %s:0x%llx",
+ sdtgrp, note->name, pathname,
+ sdt_note__get_addr(note));
+
+ ref_ctr_offset = sdt_note__get_ref_ctr_offset(note);
+ if (ref_ctr_offset && err >= 0)
+ err = strbuf_addf(&buf, "(0x%llx)", ref_ctr_offset);
+
+ if (err < 0)
goto error;
if (!note->args)
@@ -999,6 +1014,7 @@ int probe_cache__show_all_caches(struct strfilter *filter)
enum ftrace_readme {
FTRACE_README_PROBE_TYPE_X = 0,
FTRACE_README_KRETPROBE_OFFSET,
+ FTRACE_README_UPROBE_REF_CTR,
FTRACE_README_END,
};
@@ -1010,6 +1026,7 @@ enum ftrace_readme {
[idx] = {.pattern = pat, .avail = false}
DEFINE_TYPE(FTRACE_README_PROBE_TYPE_X, "*type: * x8/16/32/64,*"),
DEFINE_TYPE(FTRACE_README_KRETPROBE_OFFSET, "*place (kretprobe): *"),
+ DEFINE_TYPE(FTRACE_README_UPROBE_REF_CTR, "*ref_ctr_offset*"),
};
static bool scan_ftrace_readme(enum ftrace_readme type)
@@ -1065,3 +1082,8 @@ bool kretprobe_offset_is_supported(void)
{
return scan_ftrace_readme(FTRACE_README_KRETPROBE_OFFSET);
}
+
+bool uprobe_ref_ctr_is_supported(void)
+{
+ return scan_ftrace_readme(FTRACE_README_UPROBE_REF_CTR);
+}
diff --git a/tools/perf/util/probe-file.h b/tools/perf/util/probe-file.h
index 63f29b1..2a24918 100644
--- a/tools/perf/util/probe-file.h
+++ b/tools/perf/util/probe-file.h
@@ -69,6 +69,7 @@ struct probe_cache_entry *probe_cache__find_by_name(struct probe_cache *pcache,
int probe_cache__show_all_caches(struct strfilter *filter);
bool probe_type_is_available(enum probe_type type);
bool kretprobe_offset_is_supported(void);
+bool uprobe_ref_ctr_is_supported(void);
#else /* ! HAVE_LIBELF_SUPPORT */
static inline struct probe_cache *probe_cache__new(const char *tgt __maybe_unused, struct nsinfo *nsi __maybe_unused)
{
diff --git a/tools/perf/util/symbol-elf.c b/tools/perf/util/symbol-elf.c
index 2de7705..45b7dba 100644
--- a/tools/perf/util/symbol-elf.c
+++ b/tools/perf/util/symbol-elf.c
@@ -1803,6 +1803,34 @@ void kcore_extract__delete(struct kcore_extract *kce)
}
#ifdef HAVE_GELF_GETNOTE_SUPPORT
+
+static void sdt_adjust_loc(struct sdt_note *tmp, GElf_Addr base_off)
+{
+ if (!base_off)
+ return;
+
+ if (tmp->bit32)
+ tmp->addr.a32[SDT_NOTE_IDX_LOC] =
+ tmp->addr.a32[SDT_NOTE_IDX_LOC] + base_off -
+ tmp->addr.a32[SDT_NOTE_IDX_BASE];
+ else
+ tmp->addr.a64[SDT_NOTE_IDX_LOC] =
+ tmp->addr.a64[SDT_NOTE_IDX_LOC] + base_off -
+ tmp->addr.a64[SDT_NOTE_IDX_BASE];
+}
+
+static void sdt_adjust_refctr(struct sdt_note *tmp, GElf_Addr base_addr,
+ GElf_Addr base_off)
+{
+ if (!base_off)
+ return;
+
+ if (tmp->bit32)
+ tmp->addr.a32[SDT_NOTE_IDX_REFCTR] -= (base_addr - base_off);
+ else
+ tmp->addr.a64[SDT_NOTE_IDX_REFCTR] -= (base_addr - base_off);
+}
+
/**
* populate_sdt_note : Parse raw data and identify SDT note
* @elf: elf of the opened file
@@ -1820,7 +1848,6 @@ static int populate_sdt_note(Elf **elf, const char *data, size_t len,
const char *provider, *name, *args;
struct sdt_note *tmp = NULL;
GElf_Ehdr ehdr;
- GElf_Addr base_off = 0;
GElf_Shdr shdr;
int ret = -EINVAL;
@@ -1916,17 +1943,12 @@ static int populate_sdt_note(Elf **elf, const char *data, size_t len,
* base address in the description of the SDT note. If its different,
* then accordingly, adjust the note location.
*/
- if (elf_section_by_name(*elf, &ehdr, &shdr, SDT_BASE_SCN, NULL)) {
- base_off = shdr.sh_offset;
- if (base_off) {
- if (tmp->bit32)
- tmp->addr.a32[0] = tmp->addr.a32[0] + base_off -
- tmp->addr.a32[1];
- else
- tmp->addr.a64[0] = tmp->addr.a64[0] + base_off -
- tmp->addr.a64[1];
- }
- }
+ if (elf_section_by_name(*elf, &ehdr, &shdr, SDT_BASE_SCN, NULL))
+ sdt_adjust_loc(tmp, shdr.sh_offset);
+
+ /* Adjust reference counter offset */
+ if (elf_section_by_name(*elf, &ehdr, &shdr, SDT_PROBES_SCN, NULL))
+ sdt_adjust_refctr(tmp, shdr.sh_addr, shdr.sh_offset);
list_add_tail(&tmp->note_list, sdt_notes);
return 0;
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 70c16741..aa095bf 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -384,12 +384,19 @@ struct sdt_note {
int cleanup_sdt_note_list(struct list_head *sdt_notes);
int sdt_notes__get_count(struct list_head *start);
+#define SDT_PROBES_SCN ".probes"
#define SDT_BASE_SCN ".stapsdt.base"
#define SDT_NOTE_SCN ".note.stapsdt"
#define SDT_NOTE_TYPE 3
#define SDT_NOTE_NAME "stapsdt"
#define NR_ADDR 3
+enum {
+ SDT_NOTE_IDX_LOC = 0,
+ SDT_NOTE_IDX_BASE,
+ SDT_NOTE_IDX_REFCTR,
+};
+
struct mem_info *mem_info__new(void);
struct mem_info *mem_info__get(struct mem_info *mi);
void mem_info__put(struct mem_info *mi);
--
1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH v3 5/9] Uprobe: Export uprobe_map_info along with uprobe_{build/free}_map_info()
From: Ravi Bangoria @ 2018-04-17 4:32 UTC (permalink / raw)
To: mhiramat, oleg, peterz, srikar, rostedt
Cc: acme, ananth, akpm, alexander.shishkin, alexis.berlemont, corbet,
dan.j.williams, jolsa, kan.liang, kjlx, kstewart, linux-doc,
linux-kernel, linux-mm, milian.wolff, mingo, namhyung,
naveen.n.rao, pc, tglx, yao.jin, fengguang.wu, jglisse,
Ravi Bangoria
In-Reply-To: <20180417043244.7501-1-ravi.bangoria@linux.vnet.ibm.com>
From: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Given the file(inode) and offset, build_map_info() finds all
existing mm that map the portion of file containing offset.
Exporting these functions and data structure will help to use
them in other set of files.
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
---
include/linux/uprobes.h | 9 +++++++++
kernel/events/uprobes.c | 14 +++-----------
2 files changed, 12 insertions(+), 11 deletions(-)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 0a294e9..7bd2760 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -109,12 +109,19 @@ enum rp_check {
RP_CHECK_RET,
};
+struct address_space;
struct xol_area;
struct uprobes_state {
struct xol_area *xol_area;
};
+struct uprobe_map_info {
+ struct uprobe_map_info *next;
+ struct mm_struct *mm;
+ unsigned long vaddr;
+};
+
extern int set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
extern int set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
extern bool is_swbp_insn(uprobe_opcode_t *insn);
@@ -149,6 +156,8 @@ struct uprobes_state {
extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
void *src, unsigned long len);
+extern struct uprobe_map_info *uprobe_free_map_info(struct uprobe_map_info *info);
+extern struct uprobe_map_info *uprobe_build_map_info(struct address_space *mapping, loff_t offset, bool is_register);
#else /* !CONFIG_UPROBES */
struct uprobes_state {
};
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 477dc42..096d1e6 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -695,14 +695,7 @@ static void delete_uprobe(struct uprobe *uprobe)
put_uprobe(uprobe);
}
-struct uprobe_map_info {
- struct uprobe_map_info *next;
- struct mm_struct *mm;
- unsigned long vaddr;
-};
-
-static inline struct uprobe_map_info *
-uprobe_free_map_info(struct uprobe_map_info *info)
+struct uprobe_map_info *uprobe_free_map_info(struct uprobe_map_info *info)
{
struct uprobe_map_info *next = info->next;
mmput(info->mm);
@@ -710,9 +703,8 @@ struct uprobe_map_info {
return next;
}
-static struct uprobe_map_info *
-uprobe_build_map_info(struct address_space *mapping, loff_t offset,
- bool is_register)
+struct uprobe_map_info *uprobe_build_map_info(struct address_space *mapping,
+ loff_t offset, bool is_register)
{
unsigned long pgoff = offset >> PAGE_SHIFT;
struct vm_area_struct *vma;
--
1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH v3 4/9] Uprobe: Rename map_info to uprobe_map_info
From: Ravi Bangoria @ 2018-04-17 4:32 UTC (permalink / raw)
To: mhiramat, oleg, peterz, srikar, rostedt
Cc: acme, ananth, akpm, alexander.shishkin, alexis.berlemont, corbet,
dan.j.williams, jolsa, kan.liang, kjlx, kstewart, linux-doc,
linux-kernel, linux-mm, milian.wolff, mingo, namhyung,
naveen.n.rao, pc, tglx, yao.jin, fengguang.wu, jglisse,
Ravi Bangoria
In-Reply-To: <20180417043244.7501-1-ravi.bangoria@linux.vnet.ibm.com>
From: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
map_info is very generic name, rename it to uprobe_map_info.
Renaming will help to export this structure outside of the
file.
Also rename free_map_info() to uprobe_free_map_info() and
build_map_info() to uprobe_build_map_info().
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
---
kernel/events/uprobes.c | 30 ++++++++++++++++--------------
1 file changed, 16 insertions(+), 14 deletions(-)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 1d439c7..477dc42 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -695,28 +695,30 @@ static void delete_uprobe(struct uprobe *uprobe)
put_uprobe(uprobe);
}
-struct map_info {
- struct map_info *next;
+struct uprobe_map_info {
+ struct uprobe_map_info *next;
struct mm_struct *mm;
unsigned long vaddr;
};
-static inline struct map_info *free_map_info(struct map_info *info)
+static inline struct uprobe_map_info *
+uprobe_free_map_info(struct uprobe_map_info *info)
{
- struct map_info *next = info->next;
+ struct uprobe_map_info *next = info->next;
mmput(info->mm);
kfree(info);
return next;
}
-static struct map_info *
-build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
+static struct uprobe_map_info *
+uprobe_build_map_info(struct address_space *mapping, loff_t offset,
+ bool is_register)
{
unsigned long pgoff = offset >> PAGE_SHIFT;
struct vm_area_struct *vma;
- struct map_info *curr = NULL;
- struct map_info *prev = NULL;
- struct map_info *info;
+ struct uprobe_map_info *curr = NULL;
+ struct uprobe_map_info *prev = NULL;
+ struct uprobe_map_info *info;
int more = 0;
again:
@@ -730,7 +732,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
* Needs GFP_NOWAIT to avoid i_mmap_rwsem recursion through
* reclaim. This is optimistic, no harm done if it fails.
*/
- prev = kmalloc(sizeof(struct map_info),
+ prev = kmalloc(sizeof(struct uprobe_map_info),
GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
if (prev)
prev->next = NULL;
@@ -763,7 +765,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
}
do {
- info = kmalloc(sizeof(struct map_info), GFP_KERNEL);
+ info = kmalloc(sizeof(struct uprobe_map_info), GFP_KERNEL);
if (!info) {
curr = ERR_PTR(-ENOMEM);
goto out;
@@ -786,11 +788,11 @@ static inline struct map_info *free_map_info(struct map_info *info)
register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
{
bool is_register = !!new;
- struct map_info *info;
+ struct uprobe_map_info *info;
int err = 0;
percpu_down_write(&dup_mmap_sem);
- info = build_map_info(uprobe->inode->i_mapping,
+ info = uprobe_build_map_info(uprobe->inode->i_mapping,
uprobe->offset, is_register);
if (IS_ERR(info)) {
err = PTR_ERR(info);
@@ -828,7 +830,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
unlock:
up_write(&mm->mmap_sem);
free:
- info = free_map_info(info);
+ info = uprobe_free_map_info(info);
}
out:
percpu_up_write(&dup_mmap_sem);
--
1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH v3 3/9] Uprobe: Move mmput() into free_map_info()
From: Ravi Bangoria @ 2018-04-17 4:32 UTC (permalink / raw)
To: mhiramat, oleg, peterz, srikar, rostedt
Cc: acme, ananth, akpm, alexander.shishkin, alexis.berlemont, corbet,
dan.j.williams, jolsa, kan.liang, kjlx, kstewart, linux-doc,
linux-kernel, linux-mm, milian.wolff, mingo, namhyung,
naveen.n.rao, pc, tglx, yao.jin, fengguang.wu, jglisse,
Ravi Bangoria
In-Reply-To: <20180417043244.7501-1-ravi.bangoria@linux.vnet.ibm.com>
From: Oleg Nesterov <oleg@redhat.com>
build_map_info() has a side effect like one need to perform
mmput() when done with the mm. Add mmput() in free_map_info()
so that user does not have to call it explicitly.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
---
kernel/events/uprobes.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 535fd39..1d439c7 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -704,6 +704,7 @@ struct map_info {
static inline struct map_info *free_map_info(struct map_info *info)
{
struct map_info *next = info->next;
+ mmput(info->mm);
kfree(info);
return next;
}
@@ -773,8 +774,11 @@ static inline struct map_info *free_map_info(struct map_info *info)
goto again;
out:
- while (prev)
- prev = free_map_info(prev);
+ while (prev) {
+ info = prev;
+ prev = prev->next;
+ kfree(info);
+ }
return curr;
}
@@ -824,7 +828,6 @@ static inline struct map_info *free_map_info(struct map_info *info)
unlock:
up_write(&mm->mmap_sem);
free:
- mmput(mm);
info = free_map_info(info);
}
out:
--
1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH v3 1/9] Uprobe: Export vaddr <-> offset conversion functions
From: Ravi Bangoria @ 2018-04-17 4:32 UTC (permalink / raw)
To: mhiramat, oleg, peterz, srikar, rostedt
Cc: acme, ananth, akpm, alexander.shishkin, alexis.berlemont, corbet,
dan.j.williams, jolsa, kan.liang, kjlx, kstewart, linux-doc,
linux-kernel, linux-mm, milian.wolff, mingo, namhyung,
naveen.n.rao, pc, tglx, yao.jin, fengguang.wu, jglisse,
Ravi Bangoria
In-Reply-To: <20180417043244.7501-1-ravi.bangoria@linux.vnet.ibm.com>
From: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
These are generic functions which operates on file offset
and virtual address. Make these functions available outside
of uprobe code so that other can use it as well.
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
---
include/linux/mm.h | 12 ++++++++++++
kernel/events/uprobes.c | 10 ----------
2 files changed, 12 insertions(+), 10 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ccac106..de0cc08 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2272,6 +2272,18 @@ struct vm_unmapped_area_info {
return unmapped_area(info);
}
+static inline unsigned long
+offset_to_vaddr(struct vm_area_struct *vma, loff_t offset)
+{
+ return vma->vm_start + offset - ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
+}
+
+static inline loff_t
+vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
+{
+ return ((loff_t)vma->vm_pgoff << PAGE_SHIFT) + (vaddr - vma->vm_start);
+}
+
/* truncate.c */
extern void truncate_inode_pages(struct address_space *, loff_t);
extern void truncate_inode_pages_range(struct address_space *,
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ce6848e..bd6f230 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -130,16 +130,6 @@ static bool valid_vma(struct vm_area_struct *vma, bool is_register)
return vma->vm_file && (vma->vm_flags & flags) == VM_MAYEXEC;
}
-static unsigned long offset_to_vaddr(struct vm_area_struct *vma, loff_t offset)
-{
- return vma->vm_start + offset - ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
-}
-
-static loff_t vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
-{
- return ((loff_t)vma->vm_pgoff << PAGE_SHIFT) + (vaddr - vma->vm_start);
-}
-
/**
* __replace_page - replace page in vma by new page.
* based on replace_page in mm/ksm.c
--
1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH v3 0/9] trace_uprobe: Support SDT markers having reference count (semaphore)
From: Ravi Bangoria @ 2018-04-17 4:32 UTC (permalink / raw)
To: mhiramat, oleg, peterz, srikar, rostedt
Cc: acme, ananth, akpm, alexander.shishkin, alexis.berlemont, corbet,
dan.j.williams, jolsa, kan.liang, kjlx, kstewart, linux-doc,
linux-kernel, linux-mm, milian.wolff, mingo, namhyung,
naveen.n.rao, pc, tglx, yao.jin, fengguang.wu, jglisse,
Ravi Bangoria
Userspace Statically Defined Tracepoints[1] are dtrace style markers
inside userspace applications. Applications like PostgreSQL, MySQL,
Pthread, Perl, Python, Java, Ruby, Node.js, libvirt, QEMU, glib etc
have these markers embedded in them. These markers are added by developer
at important places in the code. Each marker source expands to a single
nop instruction in the compiled code but there may be additional
overhead for computing the marker arguments which expands to couple of
instructions. In case the overhead is more, execution of it can be
omitted by runtime if() condition when no one is tracing on the marker:
if (reference_counter > 0) {
Execute marker instructions;
}
Default value of reference counter is 0. Tracer has to increment the
reference counter before tracing on a marker and decrement it when
done with the tracing.
Currently, perf tool has limited supports for SDT markers. I.e. it
can not trace markers surrounded by reference counter. Also, it's
not easy to add reference counter logic in userspace tool like perf,
so basic idea for this patchset is to add reference counter logic in
the trace_uprobe infrastructure. Ex,[2]
# cat tick.c
...
for (i = 0; i < 100; i++) {
DTRACE_PROBE1(tick, loop1, i);
if (TICK_LOOP2_ENABLED()) {
DTRACE_PROBE1(tick, loop2, i);
}
printf("hi: %d\n", i);
sleep(1);
}
...
Here tick:loop1 is marker without reference counter where as tick:loop2
is surrounded by reference counter condition.
# perf buildid-cache --add /tmp/tick
# perf probe sdt_tick:loop1
# perf probe sdt_tick:loop2
# perf stat -e sdt_tick:loop1,sdt_tick:loop2 -- /tmp/tick
hi: 0
hi: 1
hi: 2
^C
Performance counter stats for '/tmp/tick':
3 sdt_tick:loop1
0 sdt_tick:loop2
2.747086086 seconds time elapsed
Perf failed to record data for tick:loop2. Same experiment with this
patch series:
# ./perf buildid-cache --add /tmp/tick
# ./perf probe sdt_tick:loop2
# ./perf stat -e sdt_tick:loop2 /tmp/tick
hi: 0
hi: 1
hi: 2
^C
Performance counter stats for '/tmp/tick':
3 sdt_tick:loop2
2.561851452 seconds time elapsed
Note:
- 'reference counter' is called as 'semaphore' in original Dtrace
(or Systemtap, bcc and even in ELF) documentation and code. But the
term 'semaphore' is misleading in this context. This is just a counter
used to hold number of tracers tracing on a marker. This is not really
used for any synchronization. So we are referring it as 'reference
counter' in kernel / perf code.
v3 changes:
- [PATCH v3 6/9] Fix build failure.
- [PATCH v3 6/9] Move uprobe_mmap_callback() after no_uprobe_events()
check. Actually, it should be moved after MMF_HAS_UPROBES as well.
But current implementation is sub-optimal. If there are multiple
instances of same application running and user wants to trace any
particular instance, trace_uprobe is updating reference counter in
all instances. This is not a problem on user side because instruction
is not replaced with trap/int3 and thus user will only see samples
from his interested process. But still this is more of a correctness
issue. I'm working on a fix for this. Once this gets fixed, we can
move uprobe_mmap_callback() call after MMF_HAS_UPROBES.
- [PATCH v3 7/9] Remove mmu_notifier. Instead, use callback from
uprobe_clear_state(). Again, uprobe_clear_state_callback() should be
moved after MMF_HAS_UPROBES. But that should be done when move
uprobe_mmap_callback() first.
- [PATCH v3 7/9] Properly handle error cases for sdt_increment_ref_ctr()
and trace_uprobe_mmap().
- [PATCH v3 9/9] Show warning if kernel doesn't support ref_ctr logic
and user tries to use it. Also, return error in this case instead
of adding entry in uprobe_events.
- [PATCH v3 9/9] Don't check kernel ref_ctr support while adding files
into buildid-cache.
v2 can be found at:
https://lkml.org/lkml/2018/4/4/127
v2 changes:
- [PATCH v2 3/9] is new. build_map_info() has a side effect. One has
to perform mmput() when he is done with the mm. Let free_map_info()
take care of mmput() so that one does not need to worry about it.
- [PATCH v2 6/9] sdt_update_ref_ctr(). No need to use memcpy().
Reference counter can be directly updated using normal assignment.
- [PATCH v2 6/9] Check valid vma is returned by sdt_find_vma() before
incrementing / decrementing a reference counter.
- [PATCH v2 6/9] Introduce utility functions for taking write lock on
dup_mmap_sem. Use these functions in trace_uprobe to avoid race with
fork / dup_mmap().
- [PATCH v2 6/9] Don't check presence of mm in tu->sml at decrement
time. Purpose of maintaining the list is to ensure increment happen
only once for each {trace_uprobe,mm} tuple.
- [PATCH v2 7/9] v1 was not removing mm from tu->sml when process
exits and tracing is still on. This leads to a problem if same
address gets used by new mm. Use mmu_notifier to remove such mm
from the list. This guarantees that all mm which has been added
to tu->sml will be removed from list either when tracing ends or
when process goes away.
- [PATCH v2 7/9] Patch description was misleading. Change it. Add
more generic python example.
- [PATCH v2 7/9] Convert sml_rw_sem into mutex sml_lock.
- [PATCH v2 7/9] Use builtin linked list in sdt_mm_list instead of
defining it's own pointer chain.
- Change the order of last two patches.
- [PATCH v2 9/9] Check availability of ref_ctr_offset support by
trace_uprobe infrastructure before using it. This ensures newer
perf tool will still work on older kernels which does not support
trace_uprobe with reference counter.
- Other changes as suggested by Masami, Oleg and Steve.
v1 can be found at:
https://lkml.org/lkml/2018/3/13/432
[1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation
[2] https://github.com/iovisor/bcc/issues/327#issuecomment-200576506
[3] https://lkml.org/lkml/2017/12/6/976
Oleg Nesterov (1):
Uprobe: Move mmput() into free_map_info()
Ravi Bangoria (8):
Uprobe: Export vaddr <-> offset conversion functions
mm: Prefix vma_ to vaddr_to_offset() and offset_to_vaddr()
Uprobe: Rename map_info to uprobe_map_info
Uprobe: Export uprobe_map_info along with
uprobe_{build/free}_map_info()
trace_uprobe: Support SDT markers having reference count (semaphore)
trace_uprobe/sdt: Fix multiple update of same reference counter
trace_uprobe/sdt: Document about reference counter
perf probe: Support SDT markers having reference counter (semaphore)
Documentation/trace/uprobetracer.txt | 16 ++-
include/linux/mm.h | 12 ++
include/linux/uprobes.h | 20 +++
kernel/events/uprobes.c | 90 +++++++-----
kernel/trace/trace.c | 2 +-
kernel/trace/trace_uprobe.c | 271 ++++++++++++++++++++++++++++++++++-
tools/perf/util/probe-event.c | 39 ++++-
tools/perf/util/probe-event.h | 1 +
tools/perf/util/probe-file.c | 34 ++++-
tools/perf/util/probe-file.h | 1 +
tools/perf/util/symbol-elf.c | 46 ++++--
tools/perf/util/symbol.h | 7 +
12 files changed, 472 insertions(+), 67 deletions(-)
--
1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 2/7] powerpc: Use TIDR CPU feature to control TIDR allocation
From: Andrew Donnellan @ 2018-04-17 4:21 UTC (permalink / raw)
To: Alastair D'Silva, linuxppc-dev
Cc: linux-kernel, linux-doc, mikey, vaibhav, aneesh.kumar, malat,
felix, pombredanne, sukadev, npiggin, gregkh, arnd, fbarrat,
corbet, Alastair D'Silva
In-Reply-To: <20180417020950.21446-3-alastair@au1.ibm.com>
On 17/04/18 12:09, Alastair D'Silva wrote:
> From: Alastair D'Silva <alastair@d-silva.org>
>
> Switch the use of TIDR on it's CPU feature, rather than assuming it
> is available based on architecture.
>
> Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
There's a use of TIDR in restore_sprs() that's behind the ARCH_300 flag
as well, ideally it should never trigger in the !P9_TIDR case, but you
might want to update that too for clarity?
> ---
> arch/powerpc/kernel/process.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index 1237f13fed51..a3e0a3e06d5a 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -1570,7 +1570,7 @@ void clear_thread_tidr(struct task_struct *t)
> if (!t->thread.tidr)
> return;
>
> - if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
> + if (!cpu_has_feature(CPU_FTR_P9_TIDR)) {
> WARN_ON_ONCE(1);
> return;
> }
> @@ -1593,7 +1593,7 @@ int set_thread_tidr(struct task_struct *t)
> {
> int rc;
>
> - if (!cpu_has_feature(CPU_FTR_ARCH_300))
> + if (!cpu_has_feature(CPU_FTR_P9_TIDR))
> return -EINVAL;
>
> if (t != current)
>
--
Andrew Donnellan OzLabs, ADL Canberra
andrew.donnellan@au1.ibm.com IBM Australia Limited
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 1/7] powerpc: Add TIDR CPU feature for Power9
From: Andrew Donnellan @ 2018-04-17 4:09 UTC (permalink / raw)
To: Alastair D'Silva, linuxppc-dev
Cc: linux-kernel, linux-doc, mikey, vaibhav, aneesh.kumar, malat,
felix, pombredanne, sukadev, npiggin, gregkh, arnd, fbarrat,
corbet, Alastair D'Silva
In-Reply-To: <20180417020950.21446-2-alastair@au1.ibm.com>
On 17/04/18 12:09, Alastair D'Silva wrote:
> diff --git a/arch/powerpc/include/asm/switch_to.h b/arch/powerpc/include/asm/switch_to.h
> index be8c9fa23983..5b03d8a82409 100644
> --- a/arch/powerpc/include/asm/switch_to.h
> +++ b/arch/powerpc/include/asm/switch_to.h
> @@ -94,6 +94,5 @@ static inline void clear_task_ebb(struct task_struct *t)
> extern int set_thread_uses_vas(void);
>
> extern int set_thread_tidr(struct task_struct *t);
> -extern void clear_thread_tidr(struct task_struct *t);
This hunk looks like it really belongs in patch 3.
Apart from that, I'm not really familiar with the CPU features code but
nothing seems overly wrong...
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
--
Andrew Donnellan OzLabs, ADL Canberra
andrew.donnellan@au1.ibm.com IBM Australia Limited
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 7/7] ocxl: Document new OCXL IOCTLs
From: Andrew Donnellan @ 2018-04-17 3:45 UTC (permalink / raw)
To: Alastair D'Silva, linuxppc-dev
Cc: linux-kernel, linux-doc, mikey, vaibhav, aneesh.kumar, malat,
felix, pombredanne, sukadev, npiggin, gregkh, arnd, fbarrat,
corbet, Alastair D'Silva
In-Reply-To: <20180417020950.21446-8-alastair@au1.ibm.com>
On 17/04/18 12:09, Alastair D'Silva wrote:
> From: Alastair D'Silva <alastair@d-silva.org>
>
> Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
> ---
> Documentation/accelerators/ocxl.rst | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/Documentation/accelerators/ocxl.rst b/Documentation/accelerators/ocxl.rst
> index ddcc58d01cfb..144595a80a1c 100644
> --- a/Documentation/accelerators/ocxl.rst
> +++ b/Documentation/accelerators/ocxl.rst
> @@ -157,6 +157,16 @@ OCXL_IOCTL_GET_METADATA:
> Obtains configuration information from the card, such at the size of
> MMIO areas, the AFU version, and the PASID for the current context.
>
> +OCXL_IOCTL_ENABLE_P9_WAIT:
> +
> + Allows the AFU to wake a userspace thread executing 'wait'. Returns
> + information to userspace to allow it to configure the AFU.
Note that this is only available on POWER9.
> +
> +OCXL_IOCTL_GET_PLATFORM:
> +
> + Notifies userspace as to the platform the kernel believes we are on,
> + which may differ from what userspace believes. Also reports on which CPU
> + features which are usable from userspace.
The first sentence here doesn't seem to relate to anything that
GET_PLATFORM actually does - afaict you're just passing flags which I
suppose imply what the correct platform is, but really they're just
feature flags?
--
Andrew Donnellan OzLabs, ADL Canberra
andrew.donnellan@au1.ibm.com IBM Australia Limited
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH 4/7] ocxl: Rename pnv_ocxl_spa_remove_pe to clarify it's action
From: Alastair D'Silva @ 2018-04-17 2:09 UTC (permalink / raw)
To: linuxppc-dev
Cc: linux-kernel, linux-doc, mikey, vaibhav, aneesh.kumar, malat,
felix, pombredanne, sukadev, npiggin, gregkh, arnd,
andrew.donnellan, fbarrat, corbet, Alastair D'Silva
In-Reply-To: <20180417020950.21446-1-alastair@au1.ibm.com>
From: Alastair D'Silva <alastair@d-silva.org>
The function removes the process element from NPU cache.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
---
arch/powerpc/include/asm/pnv-ocxl.h | 2 +-
arch/powerpc/platforms/powernv/ocxl.c | 4 ++--
drivers/misc/ocxl/link.c | 2 +-
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/include/asm/pnv-ocxl.h b/arch/powerpc/include/asm/pnv-ocxl.h
index f6945d3bc971..208b5503f4ed 100644
--- a/arch/powerpc/include/asm/pnv-ocxl.h
+++ b/arch/powerpc/include/asm/pnv-ocxl.h
@@ -28,7 +28,7 @@ extern int pnv_ocxl_map_xsl_regs(struct pci_dev *dev, void __iomem **dsisr,
extern int pnv_ocxl_spa_setup(struct pci_dev *dev, void *spa_mem, int PE_mask,
void **platform_data);
extern void pnv_ocxl_spa_release(void *platform_data);
-extern int pnv_ocxl_spa_remove_pe(void *platform_data, int pe_handle);
+extern int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int pe_handle);
extern int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr);
extern void pnv_ocxl_free_xive_irq(u32 irq);
diff --git a/arch/powerpc/platforms/powernv/ocxl.c b/arch/powerpc/platforms/powernv/ocxl.c
index fa9b53af3c7b..8c65aacda9c8 100644
--- a/arch/powerpc/platforms/powernv/ocxl.c
+++ b/arch/powerpc/platforms/powernv/ocxl.c
@@ -475,7 +475,7 @@ void pnv_ocxl_spa_release(void *platform_data)
}
EXPORT_SYMBOL_GPL(pnv_ocxl_spa_release);
-int pnv_ocxl_spa_remove_pe(void *platform_data, int pe_handle)
+int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int pe_handle)
{
struct spa_data *data = (struct spa_data *) platform_data;
int rc;
@@ -483,7 +483,7 @@ int pnv_ocxl_spa_remove_pe(void *platform_data, int pe_handle)
rc = opal_npu_spa_clear_cache(data->phb_opal_id, data->bdfn, pe_handle);
return rc;
}
-EXPORT_SYMBOL_GPL(pnv_ocxl_spa_remove_pe);
+EXPORT_SYMBOL_GPL(pnv_ocxl_spa_remove_pe_from_cache);
int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr)
{
diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index f30790582dc0..656e8610eec2 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -599,7 +599,7 @@ int ocxl_link_remove_pe(void *link_handle, int pasid)
* On powerpc, the entry needs to be cleared from the context
* cache of the NPU.
*/
- rc = pnv_ocxl_spa_remove_pe(link->platform_data, pe_handle);
+ rc = pnv_ocxl_spa_remove_pe_from_cache(link->platform_data, pe_handle);
WARN_ON(rc);
pe_data = radix_tree_delete(&spa->pe_tree, pe_handle);
--
2.14.3
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH 5/7] ocxl: Expose the thread_id needed for wait on p9
From: Alastair D'Silva @ 2018-04-17 2:09 UTC (permalink / raw)
To: linuxppc-dev
Cc: linux-kernel, linux-doc, mikey, vaibhav, aneesh.kumar, malat,
felix, pombredanne, sukadev, npiggin, gregkh, arnd,
andrew.donnellan, fbarrat, corbet, Alastair D'Silva
In-Reply-To: <20180417020950.21446-1-alastair@au1.ibm.com>
From: Alastair D'Silva <alastair@d-silva.org>
In order to successfully issue as_notify, an AFU needs to know the TID
to notify, which in turn means that this information should be
available in userspace so it can be communicated to the AFU.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
---
drivers/misc/ocxl/context.c | 5 +++-
drivers/misc/ocxl/file.c | 53 +++++++++++++++++++++++++++++++++++++++
drivers/misc/ocxl/link.c | 36 ++++++++++++++++++++++++++
drivers/misc/ocxl/ocxl_internal.h | 1 +
include/misc/ocxl.h | 9 +++++++
include/uapi/misc/ocxl.h | 10 ++++++++
6 files changed, 113 insertions(+), 1 deletion(-)
diff --git a/drivers/misc/ocxl/context.c b/drivers/misc/ocxl/context.c
index 909e8807824a..95f74623113e 100644
--- a/drivers/misc/ocxl/context.c
+++ b/drivers/misc/ocxl/context.c
@@ -34,6 +34,8 @@ int ocxl_context_init(struct ocxl_context *ctx, struct ocxl_afu *afu,
mutex_init(&ctx->xsl_error_lock);
mutex_init(&ctx->irq_lock);
idr_init(&ctx->irq_idr);
+ ctx->tidr = 0;
+
/*
* Keep a reference on the AFU to make sure it's valid for the
* duration of the life of the context
@@ -65,6 +67,7 @@ int ocxl_context_attach(struct ocxl_context *ctx, u64 amr)
{
int rc;
+ // Locks both status & tidr
mutex_lock(&ctx->status_mutex);
if (ctx->status != OPENED) {
rc = -EIO;
@@ -72,7 +75,7 @@ int ocxl_context_attach(struct ocxl_context *ctx, u64 amr)
}
rc = ocxl_link_add_pe(ctx->afu->fn->link, ctx->pasid,
- current->mm->context.id, 0, amr, current->mm,
+ current->mm->context.id, ctx->tidr, amr, current->mm,
xsl_fault_error, ctx);
if (rc)
goto out;
diff --git a/drivers/misc/ocxl/file.c b/drivers/misc/ocxl/file.c
index 038509e5d031..eb409a469f21 100644
--- a/drivers/misc/ocxl/file.c
+++ b/drivers/misc/ocxl/file.c
@@ -5,6 +5,8 @@
#include <linux/sched/signal.h>
#include <linux/uaccess.h>
#include <uapi/misc/ocxl.h>
+#include <asm/reg.h>
+#include <asm/switch_to.h>
#include "ocxl_internal.h"
@@ -123,11 +125,55 @@ static long afu_ioctl_get_metadata(struct ocxl_context *ctx,
return 0;
}
+#ifdef CONFIG_PPC64
+static long afu_ioctl_enable_p9_wait(struct ocxl_context *ctx,
+ struct ocxl_ioctl_p9_wait __user *uarg)
+{
+ struct ocxl_ioctl_p9_wait arg;
+
+ memset(&arg, 0, sizeof(arg));
+
+ if (cpu_has_feature(CPU_FTR_P9_TIDR)) {
+ enum ocxl_context_status status;
+
+ // Locks both status & tidr
+ mutex_lock(&ctx->status_mutex);
+ if (!ctx->tidr) {
+ if (set_thread_tidr(current))
+ return -ENOENT;
+
+ ctx->tidr = current->thread.tidr;
+ }
+
+ status = ctx->status;
+ mutex_unlock(&ctx->status_mutex);
+
+ if (status == ATTACHED) {
+ int rc;
+ struct link *link = ctx->afu->fn->link;
+
+ rc = ocxl_link_update_pe(link, ctx->pasid, ctx->tidr);
+ if (rc)
+ return rc;
+ }
+
+ arg.thread_id = ctx->tidr;
+ } else
+ return -ENOENT;
+
+ if (copy_to_user(uarg, &arg, sizeof(arg)))
+ return -EFAULT;
+
+ return 0;
+}
+#endif
+
#define CMD_STR(x) (x == OCXL_IOCTL_ATTACH ? "ATTACH" : \
x == OCXL_IOCTL_IRQ_ALLOC ? "IRQ_ALLOC" : \
x == OCXL_IOCTL_IRQ_FREE ? "IRQ_FREE" : \
x == OCXL_IOCTL_IRQ_SET_FD ? "IRQ_SET_FD" : \
x == OCXL_IOCTL_GET_METADATA ? "GET_METADATA" : \
+ x == OCXL_IOCTL_ENABLE_P9_WAIT ? "ENABLE_P9_WAIT" : \
"UNKNOWN")
static long afu_ioctl(struct file *file, unsigned int cmd,
@@ -186,6 +232,13 @@ static long afu_ioctl(struct file *file, unsigned int cmd,
(struct ocxl_ioctl_metadata __user *) args);
break;
+#ifdef CONFIG_PPC64
+ case OCXL_IOCTL_ENABLE_P9_WAIT:
+ rc = afu_ioctl_enable_p9_wait(ctx,
+ (struct ocxl_ioctl_p9_wait __user *) args);
+ break;
+#endif
+
default:
rc = -EINVAL;
}
diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index 656e8610eec2..88876ae8f330 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -544,6 +544,42 @@ int ocxl_link_add_pe(void *link_handle, int pasid, u32 pidr, u32 tidr,
}
EXPORT_SYMBOL_GPL(ocxl_link_add_pe);
+int ocxl_link_update_pe(void *link_handle, int pasid, __u16 tid)
+{
+ struct link *link = (struct link *) link_handle;
+ struct spa *spa = link->spa;
+ struct ocxl_process_element *pe;
+ int pe_handle, rc;
+
+ if (pasid > SPA_PASID_MAX)
+ return -EINVAL;
+
+ pe_handle = pasid & SPA_PE_MASK;
+ pe = spa->spa_mem + pe_handle;
+
+ mutex_lock(&spa->spa_lock);
+
+ pe->tid = tid;
+
+ /*
+ * The barrier makes sure the PE is updated
+ * before we clear the NPU context cache below, so that the
+ * old PE cannot be reloaded erroneously.
+ */
+ mb();
+
+ /*
+ * hook to platform code
+ * On powerpc, the entry needs to be cleared from the context
+ * cache of the NPU.
+ */
+ rc = pnv_ocxl_spa_remove_pe_from_cache(link->platform_data, pe_handle);
+ WARN_ON(rc);
+
+ mutex_unlock(&spa->spa_lock);
+ return rc;
+}
+
int ocxl_link_remove_pe(void *link_handle, int pasid)
{
struct link *link = (struct link *) link_handle;
diff --git a/drivers/misc/ocxl/ocxl_internal.h b/drivers/misc/ocxl/ocxl_internal.h
index 5d421824afd9..6c6d4e61888e 100644
--- a/drivers/misc/ocxl/ocxl_internal.h
+++ b/drivers/misc/ocxl/ocxl_internal.h
@@ -77,6 +77,7 @@ struct ocxl_context {
struct ocxl_xsl_error xsl_error;
struct mutex irq_lock;
struct idr irq_idr;
+ __u16 tidr; // Thread ID used for P9 wait implementation
};
struct ocxl_process_element {
diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
index 51ccf76db293..9ff6ddc28e22 100644
--- a/include/misc/ocxl.h
+++ b/include/misc/ocxl.h
@@ -188,6 +188,15 @@ extern int ocxl_link_add_pe(void *link_handle, int pasid, u32 pidr, u32 tidr,
void (*xsl_err_cb)(void *data, u64 addr, u64 dsisr),
void *xsl_err_data);
+/**
+ * Update values within a Process Element
+ *
+ * link_handle: the link handle associated with the process element
+ * pasid: the PASID for the AFU context
+ * tid: the new thread id for the process element
+ */
+extern int ocxl_link_update_pe(void *link_handle, int pasid, __u16 tid);
+
/*
* Remove a Process Element from the Shared Process Area for a link
*/
diff --git a/include/uapi/misc/ocxl.h b/include/uapi/misc/ocxl.h
index 0af83d80fb3e..8d2748e69c84 100644
--- a/include/uapi/misc/ocxl.h
+++ b/include/uapi/misc/ocxl.h
@@ -48,6 +48,15 @@ struct ocxl_ioctl_metadata {
__u64 reserved[13]; // Total of 16*u64
};
+struct ocxl_ioctl_p9_wait {
+ __u16 thread_id; // The thread ID required to wake this thread
+ __u16 reserved1;
+ __u32 reserved2;
+ __u64 reserved3[3];
+};
+
+};
+
struct ocxl_ioctl_irq_fd {
__u64 irq_offset;
__s32 eventfd;
@@ -62,5 +71,6 @@ struct ocxl_ioctl_irq_fd {
#define OCXL_IOCTL_IRQ_FREE _IOW(OCXL_MAGIC, 0x12, __u64)
#define OCXL_IOCTL_IRQ_SET_FD _IOW(OCXL_MAGIC, 0x13, struct ocxl_ioctl_irq_fd)
#define OCXL_IOCTL_GET_METADATA _IOR(OCXL_MAGIC, 0x14, struct ocxl_ioctl_metadata)
+#define OCXL_IOCTL_ENABLE_P9_WAIT _IOR(OCXL_MAGIC, 0x15, struct ocxl_ioctl_p9_wait)
#endif /* _UAPI_MISC_OCXL_H */
--
2.14.3
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH 6/7] ocxl: Add an IOCTL so userspace knows which platform the kernel requires
From: Alastair D'Silva @ 2018-04-17 2:09 UTC (permalink / raw)
To: linuxppc-dev
Cc: linux-kernel, linux-doc, mikey, vaibhav, aneesh.kumar, malat,
felix, pombredanne, sukadev, npiggin, gregkh, arnd,
andrew.donnellan, fbarrat, corbet, Alastair D'Silva
In-Reply-To: <20180417020950.21446-1-alastair@au1.ibm.com>
From: Alastair D'Silva <alastair@d-silva.org>
In order for a userspace AFU driver to call the Power9 specific
OCXL_IOCTL_ENABLE_P9_WAIT, it needs to verify that it can actually
make that call.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
---
drivers/misc/ocxl/file.c | 25 +++++++++++++++++++++++++
include/uapi/misc/ocxl.h | 4 ++++
2 files changed, 29 insertions(+)
diff --git a/drivers/misc/ocxl/file.c b/drivers/misc/ocxl/file.c
index eb409a469f21..5a9f4f85aafd 100644
--- a/drivers/misc/ocxl/file.c
+++ b/drivers/misc/ocxl/file.c
@@ -168,12 +168,32 @@ static long afu_ioctl_enable_p9_wait(struct ocxl_context *ctx,
}
#endif
+
+static long afu_ioctl_get_platform(struct ocxl_context *ctx,
+ struct ocxl_ioctl_platform __user *uarg)
+{
+ struct ocxl_ioctl_platform arg;
+
+ memset(&arg, 0, sizeof(arg));
+
+#ifdef CONFIG_PPC64
+ if (cpu_has_feature(CPU_FTR_P9_TIDR))
+ arg.flags[0] |= OCXL_IOCTL_PLATFORM_FLAGS0_P9_WAIT;
+#endif
+
+ if (copy_to_user(uarg, &arg, sizeof(arg)))
+ return -EFAULT;
+
+ return 0;
+}
+
#define CMD_STR(x) (x == OCXL_IOCTL_ATTACH ? "ATTACH" : \
x == OCXL_IOCTL_IRQ_ALLOC ? "IRQ_ALLOC" : \
x == OCXL_IOCTL_IRQ_FREE ? "IRQ_FREE" : \
x == OCXL_IOCTL_IRQ_SET_FD ? "IRQ_SET_FD" : \
x == OCXL_IOCTL_GET_METADATA ? "GET_METADATA" : \
x == OCXL_IOCTL_ENABLE_P9_WAIT ? "ENABLE_P9_WAIT" : \
+ x == OCXL_IOCTL_GET_PLATFORM ? "GET_PLATFORM" : \
"UNKNOWN")
static long afu_ioctl(struct file *file, unsigned int cmd,
@@ -239,6 +259,11 @@ static long afu_ioctl(struct file *file, unsigned int cmd,
break;
#endif
+ case OCXL_IOCTL_GET_PLATFORM:
+ rc = afu_ioctl_get_platform(ctx,
+ (struct ocxl_ioctl_platform __user *) args);
+ break;
+
default:
rc = -EINVAL;
}
diff --git a/include/uapi/misc/ocxl.h b/include/uapi/misc/ocxl.h
index 8d2748e69c84..7bdd3efcf294 100644
--- a/include/uapi/misc/ocxl.h
+++ b/include/uapi/misc/ocxl.h
@@ -55,6 +55,9 @@ struct ocxl_ioctl_p9_wait {
__u64 reserved3[3];
};
+#define OCXL_IOCTL_PLATFORM_FLAGS0_P9_WAIT 0x01
+struct ocxl_ioctl_platform {
+ __u64 flags[4];
};
struct ocxl_ioctl_irq_fd {
@@ -72,5 +75,6 @@ struct ocxl_ioctl_irq_fd {
#define OCXL_IOCTL_IRQ_SET_FD _IOW(OCXL_MAGIC, 0x13, struct ocxl_ioctl_irq_fd)
#define OCXL_IOCTL_GET_METADATA _IOR(OCXL_MAGIC, 0x14, struct ocxl_ioctl_metadata)
#define OCXL_IOCTL_ENABLE_P9_WAIT _IOR(OCXL_MAGIC, 0x15, struct ocxl_ioctl_p9_wait)
+#define OCXL_IOCTL_GET_PLATFORM _IOR(OCXL_MAGIC, 0x16, struct ocxl_ioctl_platform)
#endif /* _UAPI_MISC_OCXL_H */
--
2.14.3
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH 2/7] powerpc: Use TIDR CPU feature to control TIDR allocation
From: Alastair D'Silva @ 2018-04-17 2:09 UTC (permalink / raw)
To: linuxppc-dev
Cc: linux-kernel, linux-doc, mikey, vaibhav, aneesh.kumar, malat,
felix, pombredanne, sukadev, npiggin, gregkh, arnd,
andrew.donnellan, fbarrat, corbet, Alastair D'Silva
In-Reply-To: <20180417020950.21446-1-alastair@au1.ibm.com>
From: Alastair D'Silva <alastair@d-silva.org>
Switch the use of TIDR on it's CPU feature, rather than assuming it
is available based on architecture.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
---
arch/powerpc/kernel/process.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 1237f13fed51..a3e0a3e06d5a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1570,7 +1570,7 @@ void clear_thread_tidr(struct task_struct *t)
if (!t->thread.tidr)
return;
- if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+ if (!cpu_has_feature(CPU_FTR_P9_TIDR)) {
WARN_ON_ONCE(1);
return;
}
@@ -1593,7 +1593,7 @@ int set_thread_tidr(struct task_struct *t)
{
int rc;
- if (!cpu_has_feature(CPU_FTR_ARCH_300))
+ if (!cpu_has_feature(CPU_FTR_P9_TIDR))
return -EINVAL;
if (t != current)
--
2.14.3
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH 3/7] powerpc: use task_pid_nr() for TID allocation
From: Alastair D'Silva @ 2018-04-17 2:09 UTC (permalink / raw)
To: linuxppc-dev
Cc: linux-kernel, linux-doc, mikey, vaibhav, aneesh.kumar, malat,
felix, pombredanne, sukadev, npiggin, gregkh, arnd,
andrew.donnellan, fbarrat, corbet, Alastair D'Silva
In-Reply-To: <20180417020950.21446-1-alastair@au1.ibm.com>
From: Alastair D'Silva <alastair@d-silva.org>
The current implementation of TID allocation, using a global IDR, may
result in an errant process starving the system of available TIDs.
Instead, use task_pid_nr(), as mentioned by the original author. The
scenario described which prevented it's use is not applicable, as
set_thread_tidr can only be called after the task struct has been
populated.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
---
arch/powerpc/kernel/process.c | 97 +------------------------------------------
1 file changed, 1 insertion(+), 96 deletions(-)
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index a3e0a3e06d5a..56ff7eb5ff79 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1496,103 +1496,12 @@ int set_thread_uses_vas(void)
}
#ifdef CONFIG_PPC64
-static DEFINE_SPINLOCK(vas_thread_id_lock);
-static DEFINE_IDA(vas_thread_ida);
-
-/*
- * We need to assign a unique thread id to each thread in a process.
- *
- * This thread id, referred to as TIDR, and separate from the Linux's tgid,
- * is intended to be used to direct an ASB_Notify from the hardware to the
- * thread, when a suitable event occurs in the system.
- *
- * One such event is a "paste" instruction in the context of Fast Thread
- * Wakeup (aka Core-to-core wake up in the Virtual Accelerator Switchboard
- * (VAS) in POWER9.
- *
- * To get a unique TIDR per process we could simply reuse task_pid_nr() but
- * the problem is that task_pid_nr() is not yet available copy_thread() is
- * called. Fixing that would require changing more intrusive arch-neutral
- * code in code path in copy_process()?.
- *
- * Further, to assign unique TIDRs within each process, we need an atomic
- * field (or an IDR) in task_struct, which again intrudes into the arch-
- * neutral code. So try to assign globally unique TIDRs for now.
- *
- * NOTE: TIDR 0 indicates that the thread does not need a TIDR value.
- * For now, only threads that expect to be notified by the VAS
- * hardware need a TIDR value and we assign values > 0 for those.
- */
-#define MAX_THREAD_CONTEXT ((1 << 16) - 1)
-static int assign_thread_tidr(void)
-{
- int index;
- int err;
- unsigned long flags;
-
-again:
- if (!ida_pre_get(&vas_thread_ida, GFP_KERNEL))
- return -ENOMEM;
-
- spin_lock_irqsave(&vas_thread_id_lock, flags);
- err = ida_get_new_above(&vas_thread_ida, 1, &index);
- spin_unlock_irqrestore(&vas_thread_id_lock, flags);
-
- if (err == -EAGAIN)
- goto again;
- else if (err)
- return err;
-
- if (index > MAX_THREAD_CONTEXT) {
- spin_lock_irqsave(&vas_thread_id_lock, flags);
- ida_remove(&vas_thread_ida, index);
- spin_unlock_irqrestore(&vas_thread_id_lock, flags);
- return -ENOMEM;
- }
-
- return index;
-}
-
-static void free_thread_tidr(int id)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&vas_thread_id_lock, flags);
- ida_remove(&vas_thread_ida, id);
- spin_unlock_irqrestore(&vas_thread_id_lock, flags);
-}
-
-/*
- * Clear any TIDR value assigned to this thread.
- */
-void clear_thread_tidr(struct task_struct *t)
-{
- if (!t->thread.tidr)
- return;
-
- if (!cpu_has_feature(CPU_FTR_P9_TIDR)) {
- WARN_ON_ONCE(1);
- return;
- }
-
- mtspr(SPRN_TIDR, 0);
- free_thread_tidr(t->thread.tidr);
- t->thread.tidr = 0;
-}
-
-void arch_release_task_struct(struct task_struct *t)
-{
- clear_thread_tidr(t);
-}
-
/*
* Assign a unique TIDR (thread id) for task @t and set it in the thread
* structure. For now, we only support setting TIDR for 'current' task.
*/
int set_thread_tidr(struct task_struct *t)
{
- int rc;
-
if (!cpu_has_feature(CPU_FTR_P9_TIDR))
return -EINVAL;
@@ -1602,11 +1511,7 @@ int set_thread_tidr(struct task_struct *t)
if (t->thread.tidr)
return 0;
- rc = assign_thread_tidr();
- if (rc < 0)
- return rc;
-
- t->thread.tidr = rc;
+ t->thread.tidr = (u16)task_pid_nr(t);
mtspr(SPRN_TIDR, t->thread.tidr);
return 0;
--
2.14.3
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH 7/7] ocxl: Document new OCXL IOCTLs
From: Alastair D'Silva @ 2018-04-17 2:09 UTC (permalink / raw)
To: linuxppc-dev
Cc: linux-kernel, linux-doc, mikey, vaibhav, aneesh.kumar, malat,
felix, pombredanne, sukadev, npiggin, gregkh, arnd,
andrew.donnellan, fbarrat, corbet, Alastair D'Silva
In-Reply-To: <20180417020950.21446-1-alastair@au1.ibm.com>
From: Alastair D'Silva <alastair@d-silva.org>
Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
---
Documentation/accelerators/ocxl.rst | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/Documentation/accelerators/ocxl.rst b/Documentation/accelerators/ocxl.rst
index ddcc58d01cfb..144595a80a1c 100644
--- a/Documentation/accelerators/ocxl.rst
+++ b/Documentation/accelerators/ocxl.rst
@@ -157,6 +157,16 @@ OCXL_IOCTL_GET_METADATA:
Obtains configuration information from the card, such at the size of
MMIO areas, the AFU version, and the PASID for the current context.
+OCXL_IOCTL_ENABLE_P9_WAIT:
+
+ Allows the AFU to wake a userspace thread executing 'wait'. Returns
+ information to userspace to allow it to configure the AFU.
+
+OCXL_IOCTL_GET_PLATFORM:
+
+ Notifies userspace as to the platform the kernel believes we are on,
+ which may differ from what userspace believes. Also reports on which CPU
+ features which are usable from userspace.
mmap
----
--
2.14.3
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH 1/7] powerpc: Add TIDR CPU feature for Power9
From: Alastair D'Silva @ 2018-04-17 2:09 UTC (permalink / raw)
To: linuxppc-dev
Cc: linux-kernel, linux-doc, mikey, vaibhav, aneesh.kumar, malat,
felix, pombredanne, sukadev, npiggin, gregkh, arnd,
andrew.donnellan, fbarrat, corbet, Alastair D'Silva
In-Reply-To: <20180417020950.21446-1-alastair@au1.ibm.com>
From: Alastair D'Silva <alastair@d-silva.org>
This patch adds a CPU feature bit to show whether the CPU has
the TIDR register available, enabling as_notify/wait in userspace.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
---
arch/powerpc/include/asm/cputable.h | 3 ++-
arch/powerpc/include/asm/switch_to.h | 1 -
arch/powerpc/kernel/dt_cpu_ftrs.c | 1 +
3 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
index 4e332f3531c5..54c4cbbe57b4 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -215,6 +215,7 @@ static inline void cpu_feature_keys_init(void) { }
#define CPU_FTR_P9_TM_HV_ASSIST LONG_ASM_CONST(0x0000100000000000)
#define CPU_FTR_P9_TM_XER_SO_BUG LONG_ASM_CONST(0x0000200000000000)
#define CPU_FTR_P9_TLBIE_BUG LONG_ASM_CONST(0x0000400000000000)
+#define CPU_FTR_P9_TIDR LONG_ASM_CONST(0x0000800000000000)
#ifndef __ASSEMBLY__
@@ -462,7 +463,7 @@ static inline void cpu_feature_keys_init(void) { }
CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \
CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_ARCH_207S | \
CPU_FTR_TM_COMP | CPU_FTR_ARCH_300 | CPU_FTR_PKEY | \
- CPU_FTR_P9_TLBIE_BUG)
+ CPU_FTR_P9_TLBIE_BUG | CPU_FTR_P9_TIDR)
#define CPU_FTRS_POWER9_DD1 ((CPU_FTRS_POWER9 | CPU_FTR_POWER9_DD1) & \
(~CPU_FTR_SAO))
#define CPU_FTRS_POWER9_DD2_0 CPU_FTRS_POWER9
diff --git a/arch/powerpc/include/asm/switch_to.h b/arch/powerpc/include/asm/switch_to.h
index be8c9fa23983..5b03d8a82409 100644
--- a/arch/powerpc/include/asm/switch_to.h
+++ b/arch/powerpc/include/asm/switch_to.h
@@ -94,6 +94,5 @@ static inline void clear_task_ebb(struct task_struct *t)
extern int set_thread_uses_vas(void);
extern int set_thread_tidr(struct task_struct *t);
-extern void clear_thread_tidr(struct task_struct *t);
#endif /* _ASM_POWERPC_SWITCH_TO_H */
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 11a3a4fed3fb..10f8b7f55637 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -722,6 +722,7 @@ static __init void cpufeatures_cpu_quirks(void)
if ((version & 0xffff0000) == 0x004e0000) {
cur_cpu_spec->cpu_features &= ~(CPU_FTR_DAWR);
cur_cpu_spec->cpu_features |= CPU_FTR_P9_TLBIE_BUG;
+ cur_cpu_spec->cpu_features |= CPU_FTR_P9_TIDR;
}
}
--
2.14.3
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH 0/7] ocxl: Implement Power9 as_notify/wait for OpenCAPI
From: Alastair D'Silva @ 2018-04-17 2:09 UTC (permalink / raw)
To: linuxppc-dev
Cc: linux-kernel, linux-doc, mikey, vaibhav, aneesh.kumar, malat,
felix, pombredanne, sukadev, npiggin, gregkh, arnd,
andrew.donnellan, fbarrat, corbet, Alastair D'Silva
From: Alastair D'Silva <alastair@d-silva.org>
The Power 9 as_notify/wait feature provides a lower latency way to
signal a thread that work is complete. This series enables the use of
this feature from OpenCAPI adapters, as well as addressing a potential
starvation issue when allocating thread IDs.
Alastair D'Silva (7):
powerpc: Add TIDR CPU feature for Power9
powerpc: Use TIDR CPU feature to control TIDR allocation
powerpc: use task_pid_nr() for TID allocation
ocxl: Rename pnv_ocxl_spa_remove_pe to clarify it's action
ocxl: Expose the thread_id needed for wait on p9
ocxl: Add an IOCTL so userspace knows which platform the kernel
requires
ocxl: Document new OCXL IOCTLs
Documentation/accelerators/ocxl.rst | 10 ++++
arch/powerpc/include/asm/cputable.h | 3 +-
arch/powerpc/include/asm/pnv-ocxl.h | 2 +-
arch/powerpc/include/asm/switch_to.h | 1 -
arch/powerpc/kernel/dt_cpu_ftrs.c | 1 +
arch/powerpc/kernel/process.c | 99 +----------------------------------
arch/powerpc/platforms/powernv/ocxl.c | 4 +-
drivers/misc/ocxl/context.c | 5 +-
drivers/misc/ocxl/file.c | 78 +++++++++++++++++++++++++++
drivers/misc/ocxl/link.c | 38 +++++++++++++-
drivers/misc/ocxl/ocxl_internal.h | 1 +
include/misc/ocxl.h | 9 ++++
include/uapi/misc/ocxl.h | 14 +++++
13 files changed, 161 insertions(+), 104 deletions(-)
--
2.14.3
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 07/10] Documentation: dt-bindings: Add documents for PECI hwmon client drivers
From: Jae Hyun Yoo @ 2018-04-16 23:51 UTC (permalink / raw)
To: Rob Herring
Cc: Alan Cox, Andrew Jeffery, Andrew Lunn, Andy Shevchenko,
Arnd Bergmann, Benjamin Herrenschmidt, Fengguang Wu, Greg KH,
Guenter Roeck, Haiyue Wang, James Feist, Jason M Biils,
Jean Delvare, Joel Stanley, Julia Cartwright, Miguel Ojeda,
Milton Miller II, Pavel Machek, Randy Dunlap, Stef van Os,
Sumeet R Pawnikar, Vernon Mauery, linux-kernel, linux-doc,
devicetree, linux-hwmon, linux-arm-kernel, openbmc
In-Reply-To: <aee4e04c-885f-b1d8-9e85-9e741551342f@linux.intel.com>
On 4/16/2018 4:22 PM, Jae Hyun Yoo wrote:
> On 4/16/2018 11:14 AM, Rob Herring wrote:
>> On Tue, Apr 10, 2018 at 11:32:09AM -0700, Jae Hyun Yoo wrote:
>>> This commit adds dt-bindings documents for PECI cputemp and dimmtemp
>>> client
>>> drivers.
>>
>> "dt-bindings: hwmon: ..." for the subject.
>>
>
> I'll change the subject.
>
>>>
>>> Signed-off-by: Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>
>>> Reviewed-by: Haiyue Wang <haiyue.wang@linux.intel.com>
>>> Reviewed-by: James Feist <james.feist@linux.intel.com>
>>> Reviewed-by: Vernon Mauery <vernon.mauery@linux.intel.com>
>>> Cc: Alan Cox <alan@linux.intel.com>
>>> Cc: Andrew Jeffery <andrew@aj.id.au>
>>> Cc: Andrew Lunn <andrew@lunn.ch>
>>> Cc: Andy Shevchenko <andriy.shevchenko@intel.com>
>>> Cc: Arnd Bergmann <arnd@arndb.de>
>>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>>> Cc: Fengguang Wu <fengguang.wu@intel.com>
>>> Cc: Greg KH <gregkh@linuxfoundation.org>
>>> Cc: Guenter Roeck <linux@roeck-us.net>
>>> Cc: Jason M Biils <jason.m.bills@linux.intel.com>
>>> Cc: Jean Delvare <jdelvare@suse.com>
>>> Cc: Joel Stanley <joel@jms.id.au>
>>> Cc: Julia Cartwright <juliac@eso.teric.us>
>>> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
>>> Cc: Milton Miller II <miltonm@us.ibm.com>
>>> Cc: Pavel Machek <pavel@ucw.cz>
>>> Cc: Randy Dunlap <rdunlap@infradead.org>
>>> Cc: Stef van Os <stef.van.os@prodrive-technologies.com>
>>> Cc: Sumeet R Pawnikar <sumeet.r.pawnikar@intel.com>
>>> ---
>>> .../devicetree/bindings/hwmon/peci-cputemp.txt | 24
>>> +++++++++++++++++++++
>>> .../devicetree/bindings/hwmon/peci-dimmtemp.txt | 25
>>> ++++++++++++++++++++++
>>> 2 files changed, 49 insertions(+)
>>> create mode 100644
>>> Documentation/devicetree/bindings/hwmon/peci-cputemp.txt
>>> create mode 100644
>>> Documentation/devicetree/bindings/hwmon/peci-dimmtemp.txt
>>>
>>> diff --git a/Documentation/devicetree/bindings/hwmon/peci-cputemp.txt
>>> b/Documentation/devicetree/bindings/hwmon/peci-cputemp.txt
>>> new file mode 100644
>>> index 000000000000..d5530ef9cfd2
>>> --- /dev/null
>>> +++ b/Documentation/devicetree/bindings/hwmon/peci-cputemp.txt
>>> @@ -0,0 +1,24 @@
>>> +Bindings for Intel PECI (Platform Environment Control Interface)
>>> cputemp driver.
>>> +
>>> +Required properties:
>>> +- compatible : Should be "intel,peci-cputemp".
>>> +- reg : Should contain address of a client CPU. Address range
>>> of CPU
>>> + clients is starting from 0x30 based on PECI specification.
>>> + <0x30> .. <0x37> (depends on the PECI_OFFSET_MAX definition)
>>
>> Again, where is PECI_OFFSET_MAX defined? It can't depend on something in
>> the kernel.
>>
>
> I'll remove the unnecessary description.
>
>>> +
>>> +Example:
>>> + peci-bus@0 {
>>> + #address-cells = <1>;
>>> + #size-cells = <0>;
>>> + < more properties >
>>> +
>>> + peci-cputemp@cpu0 {
>>> + compatible = "intel,peci-cputemp";
>>> + reg = <0x30>;
>>> + };
>>> +
>>> + peci-cputemp@cpu1 {
>>> + compatible = "intel,peci-cputemp";
>>> + reg = <0x31>;
>>> + };
>>> + };
>>> diff --git
>>> a/Documentation/devicetree/bindings/hwmon/peci-dimmtemp.txt
>>> b/Documentation/devicetree/bindings/hwmon/peci-dimmtemp.txt
>>> new file mode 100644
>>> index 000000000000..56e5deb61e5c
>>> --- /dev/null
>>> +++ b/Documentation/devicetree/bindings/hwmon/peci-dimmtemp.txt
>>> @@ -0,0 +1,25 @@
>>> +Bindings for Intel PECI (Platform Environment Control Interface)
>>> dimmtemp
>>> +driver.
>>> +
>>> +Required properties:
>>> +- compatible : Should be "intel,peci-dimmtemp".
>>> +- reg : Should contain address of a client CPU. Address range
>>> of CPU
>>> + clients is starting from 0x30 based on PECI specification.
>>> + <0x30> .. <0x37> (depends on the PECI_OFFSET_MAX definition)
>>> +
>>> +Example:
>>> + peci-bus@0 {
>>> + #address-cells = <1>;
>>> + #size-cells = <0>;
>>> + < more properties >
>>> +
>>> + peci-dimmtemp@cpu0 {
>>
>> unit-address is wrong.
>>
>
> Will fix it using the reg value.
>
>> It is a different bus from cputemp? Otherwise, you have conflicting
>> addresses. If that's the case, probably should make it clear by showing
>> different host adapters for each example.
>>
>
> It could be the same bus with cputemp. Also, client address sharing is
> possible by PECI core if the functionality is different. I mean, cputemp
> and dimmtemp targeting the same client is possible case like this.
> peci-cputemp@30
> peci-dimmtemp@30
>
Oh, I got your point. Probably, I should change these separate settings
into one like
peci-client@30 {
compatible = "intel,peci-client";
reg = <0x30>;
};
Then cputemp and dimmtemp drivers could refer the same compatible
string. Will rewrite it.
>>> + compatible = "intel,peci-dimmtemp";
>>> + reg = <0x30>;
>>> + };
>>> +
>>> + peci-dimmtemp@cpu1 {
>>> + compatible = "intel,peci-dimmtemp";
>>> + reg = <0x31>;
>>> + };
>>> + };
>>> --
>>> 2.16.2
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe devicetree" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 07/10] Documentation: dt-bindings: Add documents for PECI hwmon client drivers
From: Jae Hyun Yoo @ 2018-04-16 23:22 UTC (permalink / raw)
To: Rob Herring
Cc: Alan Cox, Andrew Jeffery, Andrew Lunn, Andy Shevchenko,
Arnd Bergmann, Benjamin Herrenschmidt, Fengguang Wu, Greg KH,
Guenter Roeck, Haiyue Wang, James Feist, Jason M Biils,
Jean Delvare, Joel Stanley, Julia Cartwright, Miguel Ojeda,
Milton Miller II, Pavel Machek, Randy Dunlap, Stef van Os,
Sumeet R Pawnikar, Vernon Mauery, linux-kernel, linux-doc,
devicetree, linux-hwmon, linux-arm-kernel, openbmc
In-Reply-To: <20180416181423.t4vf7sugv6z3aw5h@rob-hp-laptop>
On 4/16/2018 11:14 AM, Rob Herring wrote:
> On Tue, Apr 10, 2018 at 11:32:09AM -0700, Jae Hyun Yoo wrote:
>> This commit adds dt-bindings documents for PECI cputemp and dimmtemp client
>> drivers.
>
> "dt-bindings: hwmon: ..." for the subject.
>
I'll change the subject.
>>
>> Signed-off-by: Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>
>> Reviewed-by: Haiyue Wang <haiyue.wang@linux.intel.com>
>> Reviewed-by: James Feist <james.feist@linux.intel.com>
>> Reviewed-by: Vernon Mauery <vernon.mauery@linux.intel.com>
>> Cc: Alan Cox <alan@linux.intel.com>
>> Cc: Andrew Jeffery <andrew@aj.id.au>
>> Cc: Andrew Lunn <andrew@lunn.ch>
>> Cc: Andy Shevchenko <andriy.shevchenko@intel.com>
>> Cc: Arnd Bergmann <arnd@arndb.de>
>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>> Cc: Fengguang Wu <fengguang.wu@intel.com>
>> Cc: Greg KH <gregkh@linuxfoundation.org>
>> Cc: Guenter Roeck <linux@roeck-us.net>
>> Cc: Jason M Biils <jason.m.bills@linux.intel.com>
>> Cc: Jean Delvare <jdelvare@suse.com>
>> Cc: Joel Stanley <joel@jms.id.au>
>> Cc: Julia Cartwright <juliac@eso.teric.us>
>> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
>> Cc: Milton Miller II <miltonm@us.ibm.com>
>> Cc: Pavel Machek <pavel@ucw.cz>
>> Cc: Randy Dunlap <rdunlap@infradead.org>
>> Cc: Stef van Os <stef.van.os@prodrive-technologies.com>
>> Cc: Sumeet R Pawnikar <sumeet.r.pawnikar@intel.com>
>> ---
>> .../devicetree/bindings/hwmon/peci-cputemp.txt | 24 +++++++++++++++++++++
>> .../devicetree/bindings/hwmon/peci-dimmtemp.txt | 25 ++++++++++++++++++++++
>> 2 files changed, 49 insertions(+)
>> create mode 100644 Documentation/devicetree/bindings/hwmon/peci-cputemp.txt
>> create mode 100644 Documentation/devicetree/bindings/hwmon/peci-dimmtemp.txt
>>
>> diff --git a/Documentation/devicetree/bindings/hwmon/peci-cputemp.txt b/Documentation/devicetree/bindings/hwmon/peci-cputemp.txt
>> new file mode 100644
>> index 000000000000..d5530ef9cfd2
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/hwmon/peci-cputemp.txt
>> @@ -0,0 +1,24 @@
>> +Bindings for Intel PECI (Platform Environment Control Interface) cputemp driver.
>> +
>> +Required properties:
>> +- compatible : Should be "intel,peci-cputemp".
>> +- reg : Should contain address of a client CPU. Address range of CPU
>> + clients is starting from 0x30 based on PECI specification.
>> + <0x30> .. <0x37> (depends on the PECI_OFFSET_MAX definition)
>
> Again, where is PECI_OFFSET_MAX defined? It can't depend on something in
> the kernel.
>
I'll remove the unnecessary description.
>> +
>> +Example:
>> + peci-bus@0 {
>> + #address-cells = <1>;
>> + #size-cells = <0>;
>> + < more properties >
>> +
>> + peci-cputemp@cpu0 {
>> + compatible = "intel,peci-cputemp";
>> + reg = <0x30>;
>> + };
>> +
>> + peci-cputemp@cpu1 {
>> + compatible = "intel,peci-cputemp";
>> + reg = <0x31>;
>> + };
>> + };
>> diff --git a/Documentation/devicetree/bindings/hwmon/peci-dimmtemp.txt b/Documentation/devicetree/bindings/hwmon/peci-dimmtemp.txt
>> new file mode 100644
>> index 000000000000..56e5deb61e5c
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/hwmon/peci-dimmtemp.txt
>> @@ -0,0 +1,25 @@
>> +Bindings for Intel PECI (Platform Environment Control Interface) dimmtemp
>> +driver.
>> +
>> +Required properties:
>> +- compatible : Should be "intel,peci-dimmtemp".
>> +- reg : Should contain address of a client CPU. Address range of CPU
>> + clients is starting from 0x30 based on PECI specification.
>> + <0x30> .. <0x37> (depends on the PECI_OFFSET_MAX definition)
>> +
>> +Example:
>> + peci-bus@0 {
>> + #address-cells = <1>;
>> + #size-cells = <0>;
>> + < more properties >
>> +
>> + peci-dimmtemp@cpu0 {
>
> unit-address is wrong.
>
Will fix it using the reg value.
> It is a different bus from cputemp? Otherwise, you have conflicting
> addresses. If that's the case, probably should make it clear by showing
> different host adapters for each example.
>
It could be the same bus with cputemp. Also, client address sharing is
possible by PECI core if the functionality is different. I mean, cputemp
and dimmtemp targeting the same client is possible case like this.
peci-cputemp@30
peci-dimmtemp@30
>> + compatible = "intel,peci-dimmtemp";
>> + reg = <0x30>;
>> + };
>> +
>> + peci-dimmtemp@cpu1 {
>> + compatible = "intel,peci-dimmtemp";
>> + reg = <0x31>;
>> + };
>> + };
>> --
>> 2.16.2
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe devicetree" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 04/10] Documentations: dt-bindings: Add a document of PECI adapter driver for Aspeed AST24xx/25xx SoCs
From: Jae Hyun Yoo @ 2018-04-16 23:12 UTC (permalink / raw)
To: Rob Herring
Cc: Alan Cox, Andrew Jeffery, Andrew Lunn, Andy Shevchenko,
Arnd Bergmann, Benjamin Herrenschmidt, Fengguang Wu, Greg KH,
Guenter Roeck, Haiyue Wang, James Feist, Jason M Biils,
Jean Delvare, Joel Stanley, Julia Cartwright, Miguel Ojeda,
Milton Miller II, Pavel Machek, Randy Dunlap, Stef van Os,
Sumeet R Pawnikar, Vernon Mauery, linux-kernel, linux-doc,
devicetree, linux-hwmon, linux-arm-kernel, openbmc
In-Reply-To: <20180416181035.rcjvt4rlrwaj5yxf@rob-hp-laptop>
On 4/16/2018 11:10 AM, Rob Herring wrote:
> On Tue, Apr 10, 2018 at 11:32:06AM -0700, Jae Hyun Yoo wrote:
>> This commit adds a dt-bindings document of PECI adapter driver for Aspeed
>> AST24xx/25xx SoCs.
>>
>> Signed-off-by: Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>
>> Reviewed-by: Haiyue Wang <haiyue.wang@linux.intel.com>
>> Reviewed-by: James Feist <james.feist@linux.intel.com>
>> Reviewed-by: Vernon Mauery <vernon.mauery@linux.intel.com>
>> Cc: Alan Cox <alan@linux.intel.com>
>> Cc: Andrew Jeffery <andrew@aj.id.au>
>> Cc: Andrew Lunn <andrew@lunn.ch>
>> Cc: Andy Shevchenko <andriy.shevchenko@intel.com>
>> Cc: Arnd Bergmann <arnd@arndb.de>
>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>> Cc: Fengguang Wu <fengguang.wu@intel.com>
>> Cc: Greg KH <gregkh@linuxfoundation.org>
>> Cc: Guenter Roeck <linux@roeck-us.net>
>> Cc: Jason M Biils <jason.m.bills@linux.intel.com>
>> Cc: Jean Delvare <jdelvare@suse.com>
>> Cc: Joel Stanley <joel@jms.id.au>
>> Cc: Julia Cartwright <juliac@eso.teric.us>
>> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
>> Cc: Milton Miller II <miltonm@us.ibm.com>
>> Cc: Pavel Machek <pavel@ucw.cz>
>> Cc: Randy Dunlap <rdunlap@infradead.org>
>> Cc: Stef van Os <stef.van.os@prodrive-technologies.com>
>> Cc: Sumeet R Pawnikar <sumeet.r.pawnikar@intel.com>
>> ---
>> .../devicetree/bindings/peci/peci-aspeed.txt | 60 ++++++++++++++++++++++
>> 1 file changed, 60 insertions(+)
>> create mode 100644 Documentation/devicetree/bindings/peci/peci-aspeed.txt
>>
>> diff --git a/Documentation/devicetree/bindings/peci/peci-aspeed.txt b/Documentation/devicetree/bindings/peci/peci-aspeed.txt
>> new file mode 100644
>> index 000000000000..4598bb8c20fa
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/peci/peci-aspeed.txt
>> @@ -0,0 +1,60 @@
>> +Device tree configuration for PECI buses on the AST24XX and AST25XX SoCs.
>> +
>> +Required properties:
>> +- compatible : Should be "aspeed,ast2400-peci" or "aspeed,ast2500-peci"
>> + - aspeed,ast2400-peci: Aspeed AST2400 family PECI
>> + controller
>> + - aspeed,ast2500-peci: Aspeed AST2500 family PECI
>> + controller
>> +- reg : Should contain PECI controller registers location and
>> + length.
>> +- #address-cells : Should be <1>.
>> +- #size-cells : Should be <0>.
>> +- interrupts : Should contain PECI controller interrupt.
>> +- clocks : Should contain clock source for PECI controller.
>> + Should reference clkin.
>> +- clock_frequency : Should contain the operation frequency of PECI controller
>> + in units of Hz.
>> + 187500 ~ 24000000
>
> This is the frequency of the bus or used to derive it? It would be
> better to specify the bus frequency instead and have the driver
> calculate its internal freq. And then use "bus-frequency" instead.
>
I agree with you. Actually, it is being used for operation frequency
setting of PECI controller module in SoC so it's different from the
meaning of "bus-frequency". I'll change it to "operation-frequency".
>> +
>> +Optional properties:
>> +- msg-timing-nego : Message timing negotiation period. This value will
>> + determine the period of message timing negotiation to be
>> + issued by PECI controller. The unit of the programmed
>> + value is four times of PECI clock period.
>> + 0 ~ 255 (default: 1)
>> +- addr-timing-nego : Address timing negotiation period. This value will
>> + determine the period of address timing negotiation to be
>> + issued by PECI controller. The unit of the programmed
>> + value is four times of PECI clock period.
>> + 0 ~ 255 (default: 1)
>> +- rd-sampling-point : Read sampling point selection. The whole period of a bit
>> + time will be divided into 16 time frames. This value will
>> + determine the time frame in which the controller will
>> + sample PECI signal for data read back. Usually in the
>> + middle of a bit time is the best.
>> + 0 ~ 15 (default: 8)
>> +- cmd_timeout_ms : Command timeout in units of ms.
>> + 1 ~ 60000 (default: 1000)
>
> s/_/-/
>
Will fix it.
>
> All these either need vendor prefixes or should be standard properties
> for PECI adapters. I think probably the latter case. If so, the first
> 2 should probably be in units of clocks (not 4 clocks). And they should
> then be documented in the common PECI binding doc.
>
So far I've checked that these are ASPEED PECI controller specific
properties so it should be listed in here.
>> +
>> +Example:
>> + peci: peci@1e78b000 {
>> + compatible = "simple-bus";
>> + #address-cells = <1>;
>> + #size-cells = <1>;
>> + ranges = <0x0 0x1e78b000 0x60>;
>
> No need to show this part in examples.
>
Got it. Will drop the part.
>> +
>> + peci0: peci-bus@0 {
>> + compatible = "aspeed,ast2500-peci";
>> + reg = <0x0 0x60>;
>> + #address-cells = <1>;
>> + #size-cells = <0>;
>> + interrupts = <15>;
>> + clocks = <&clk_clkin>;
>> + clock-frequency = <24000000>;
>> + msg-timing-nego = <1>;
>> + addr-timing-nego = <1>;
>> + rd-sampling-point = <8>;
>> + cmd-timeout-ms = <1000>;
>> + };
>> + };
>> --
>> 2.16.2
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe devicetree" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 01/10] Documentations: dt-bindings: Add documents of generic PECI bus, adapter and client drivers
From: Jae Hyun Yoo @ 2018-04-16 23:06 UTC (permalink / raw)
To: Rob Herring
Cc: Alan Cox, Andrew Jeffery, Andrew Lunn, Andy Shevchenko,
Arnd Bergmann, Benjamin Herrenschmidt, Fengguang Wu, Greg KH,
Guenter Roeck, Haiyue Wang, James Feist, Jason M Biils,
Jean Delvare, Joel Stanley, Julia Cartwright, Miguel Ojeda,
Milton Miller II, Pavel Machek, Randy Dunlap, Stef van Os,
Sumeet R Pawnikar, Vernon Mauery, linux-hwmon, devicetree,
linux-doc, openbmc, linux-kernel, linux-arm-kernel
In-Reply-To: <20180416175936.4vcse73mrnyk242m@rob-hp-laptop>
Hi Rob,
Thanks for sharing your time. Please see my answers inline.
On 4/16/2018 10:59 AM, Rob Herring wrote:
> On Tue, Apr 10, 2018 at 11:32:03AM -0700, Jae Hyun Yoo wrote:
>> This commit adds documents of generic PECI bus, adapter and client drivers.
>
> "dt-bindings: ..." for the subject prefix please.
>
Sure, I'll change the subject.
>>
>> Signed-off-by: Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>
>> Reviewed-by: Haiyue Wang <haiyue.wang@linux.intel.com>
>> Reviewed-by: James Feist <james.feist@linux.intel.com>
>> Reviewed-by: Vernon Mauery <vernon.mauery@linux.intel.com>
>> Cc: Alan Cox <alan@linux.intel.com>
>> Cc: Andrew Jeffery <andrew@aj.id.au>
>> Cc: Andrew Lunn <andrew@lunn.ch>
>> Cc: Andy Shevchenko <andriy.shevchenko@intel.com>
>> Cc: Arnd Bergmann <arnd@arndb.de>
>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>> Cc: Fengguang Wu <fengguang.wu@intel.com>
>> Cc: Greg KH <gregkh@linuxfoundation.org>
>> Cc: Guenter Roeck <linux@roeck-us.net>
>> Cc: Jason M Biils <jason.m.bills@linux.intel.com>
>> Cc: Jean Delvare <jdelvare@suse.com>
>> Cc: Joel Stanley <joel@jms.id.au>
>> Cc: Julia Cartwright <juliac@eso.teric.us>
>> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
>> Cc: Milton Miller II <miltonm@us.ibm.com>
>> Cc: Pavel Machek <pavel@ucw.cz>
>> Cc: Randy Dunlap <rdunlap@infradead.org>
>> Cc: Stef van Os <stef.van.os@prodrive-technologies.com>
>> Cc: Sumeet R Pawnikar <sumeet.r.pawnikar@intel.com>
>> ---
>> .../devicetree/bindings/peci/peci-adapter.txt | 23 ++++++++++++++++++++
>> .../devicetree/bindings/peci/peci-bus.txt | 15 +++++++++++++
>> .../devicetree/bindings/peci/peci-client.txt | 25 ++++++++++++++++++++++
>
> This should be all one document.
>
Okay. I'll combine them into one document.
>> 3 files changed, 63 insertions(+)
>> create mode 100644 Documentation/devicetree/bindings/peci/peci-adapter.txt
>> create mode 100644 Documentation/devicetree/bindings/peci/peci-bus.txt
>> create mode 100644 Documentation/devicetree/bindings/peci/peci-client.txt
>>
>> diff --git a/Documentation/devicetree/bindings/peci/peci-adapter.txt b/Documentation/devicetree/bindings/peci/peci-adapter.txt
>> new file mode 100644
>> index 000000000000..9221374f6b11
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/peci/peci-adapter.txt
>> @@ -0,0 +1,23 @@
>> +Generic device tree configuration for PECI adapters.
>> +
>> +Required properties:
>> +- compatible : Should contain hardware specific definition strings that can
>> + match an adapter driver implementation.
>> +- reg : Should contain PECI controller registers location and length.
>
> No need for these 2 here.
>
Will drop these 2.
>> +- #address-cells : Should be <1>.
>> +- #size-cells : Should be <0>.
>
> Some details on the addressing for PECI would be good.
>
It is for the PECI client address. Will add details.
>> +
>> +Example:
>> + peci: peci@10000000 {
>> + compatible = "simple-bus";
>> + #address-cells = <1>;
>> + #size-cells = <1>;
>> + ranges = <0x0 0x10000000 0x1000>;
>> +
>
> This part of the example is not relevant. Just start with the adapter
> node.
>
Will remove that part. Thanks!
>> + peci0: peci-bus@0 {
>> + compatible = "soc,soc-peci";
>> + reg = <0x0 0x1000>;
>> + #address-cells = <1>;
>> + #size-cells = <0>;
>> + };
>> + };
>> diff --git a/Documentation/devicetree/bindings/peci/peci-bus.txt b/Documentation/devicetree/bindings/peci/peci-bus.txt
>> new file mode 100644
>> index 000000000000..90bcc791ccb0
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/peci/peci-bus.txt
>> @@ -0,0 +1,15 @@
>> +Generic device tree configuration for PECI buses.
>> +
>> +Required properties:
>> +- compatible : Should be "simple-bus".
>
> I don't understand what this has to do with PECI? "simple-bus" already
> has a defined meaning.
>
Maybe I'm wrong but I intended to show this node is an umbrella node of
a PECI bus subsystem. What should I use then?
>> +- #address-cells : Should be <1>.
>> +- #size-cells : Should be <1>.
>> +- ranges : Should contain PECI controller registers ranges.
>> +
>> +Example:
>> + peci: peci@10000000 {
>> + compatible = "simple-bus";
>> + #address-cells = <1>;
>> + #size-cells = <1>;
>> + ranges = <0x0 0x10000000 0x1000>;
>> + };
>> diff --git a/Documentation/devicetree/bindings/peci/peci-client.txt b/Documentation/devicetree/bindings/peci/peci-client.txt
>> new file mode 100644
>> index 000000000000..8e2bfd8532f6
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/peci/peci-client.txt
>> @@ -0,0 +1,25 @@
>> +Generic device tree configuration for PECI clients.
>> +
>> +Required properties:
>> +- compatible : Should contain target device specific definition strings that can
>> + match a client driver implementation.
>
> Bindings are for h/w, not client drivers.
>
> How are PECI devices defined?
>
Got it. I'll correct the description. PECI client device is Intel CPU
which is connected through a PECI bus.
>> +- reg : Should contain address of a client CPU. Address range of CPU
>> + clients is starting from 0x30 based on PECI specification.
>> + <0x30> .. <0x37> (depends on the PECI_OFFSET_MAX definition)
>
> 8 devices should be enough for anyone...
>
> Where is PECI_OFFSET_MAX defined?
>
PECI_OFFSET_MAX is defined in include/linux/peci.h based on the maximum
CPU numbers of the current IA generation. I'll remove the unnecessary
details. A setting out of range would be handled accordingly in kernel.
>> +
>> +Example:
>> + peci-bus@0 {
>> + #address-cells = <1>;
>> + #size-cells = <0>;
>> + < more properties >
>> +
>> + function@cpu0 {
>
> Not a valid node name. "function@30" is what it probably should be. For
> a new bus you can define unit-address format you like, but it must be
> based on the contents of reg. However, it doesn't look like you should
> create anything special here.
>
Got it. I'll fix these node name like function@30 and function@31.
Thanks a lot for your comments!
-Jae
>> + compatible = "device,function";
>> + reg = <0x30>;
>> + };
>> +
>> + function@cpu1 {
>> + compatible = "device,function";
>> + reg = <0x31>;
>> + };
>> + };
>> --
>> 2.16.2
>>
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox