* [PATCH v5 0/9] ipc: Clamp *mni to the real IPCMNI limit & increase that limit
From: Waiman Long @ 2018-03-16 18:13 UTC (permalink / raw)
To: Luis R. Rodriguez, Kees Cook
Cc: linux-kernel, linux-fsdevel, linux-doc, Jonathan Corbet,
Andrew Morton, Al Viro, Matthew Wilcox, Eric W. Biederman,
Waiman Long
v4->v5:
- Revert the flags back to 16-bit so that there will be no change to
the size of ctl_table.
- Enhance the sysctl_check_flags() as requested by Luis to perform more
checks to spot incorrect ctl_table entries.
- Change the sysctl selftest to use dummy sysctls instead of production
ones & enhance it to do more checks.
- Add one more sysctl selftest for registration failure.
- Add 2 ipc patches to add an extended mode to increase IPCMNI from
32k to 2M.
- Miscellaneous change to incorporate feedback comments from
reviewers.
v3->v4:
- Remove v3 patches 1 & 2 as they have been merged into the mm tree.
- Change flags from uint16_t to unsigned int.
- Remove CTL_FLAGS_OOR_WARNED and use pr_warn_ratelimited() instead.
- Simplify the warning message code.
- Add a new patch to fail the ctl_table registration with invalid flag.
- Add a test case for range clamping in sysctl selftest.
v2->v3:
- Fix kdoc comment errors.
- Incorporate comments and suggestions from Luis R. Rodriguez.
- Add a patch to fix a typo error in fs/proc/proc_sysctl.c.
v1->v2:
- Add kdoc comments to the do_proc_do{u}intvec_minmax_conv_param
structures.
- Add a new flags field to the ctl_table structure for specifying
whether range clamping should be activated instead of adding new
sysctl parameter handlers.
- Clamp the semmni value embedded in the multi-values sem parameter.
v1 patch: https://lkml.org/lkml/2018/2/19/453
v2 patch: https://lkml.org/lkml/2018/2/27/627
v3 patch: https://lkml.org/lkml/2018/3/1/716
v4 patch: https://lkml.org/lkml/2018/3/12/867
The sysctl parameters msgmni, shmmni and semmni have an inherent limit
of IPC_MNI (32k). However, users may not be aware of that because they
can write a value much higher than that without getting any error or
notification. Reading the parameters back will show the newly written
values which are not real.
Enforcing the limit by failing sysctl parameter write, however, may
cause regressions if existing user setup scripts set those parameters
above 32k as those scripts will now fail in this case.
To address this delemma, a new flags field is introduced into
the ctl_table. The value CTL_FLAGS_CLAMP_RANGE can be added to any
ctl_table entries to enable a looser range clamping without returning
any error. For example,
.flags = CTL_FLAGS_CLAMP_RANGE,
This flags value are now used for the range checking of shmmni,
msgmni and semmni without breaking existing applications. If any out
of range value is written to those sysctl parameters, the following
warning will be printed instead.
sysctl: "shmmni" was set out of range [0, 32768], clamped to 32768.
Reading the values back will show 32768 instead of some fake values.
New sysctl selftests are added to exercise new code added by this
patchset.
There are users out there requesting increase in the IPCMNI value.
The last 2 patches attempt to do that by using a boot kernel parameter
"ipcmni_extend" to increase the IPCMNI limit from 32k to 2M.
Eric Biederman had posted an RFC patch to just scrap the IPCMNI limit
and open up the whole positive integer space for IPC IDs. A major
issue that I have with this approach is that SysV IPC had been in use
for over 20 years. We just don't know if there are user applications
that have dependency on the way that the IDs are built. So drastic
change like this may have the potential of breaking some applications.
I prefer a more conservative approach where users will observe no
change in behavior unless they explictly opt in to enable the extended
mode. I could open up the whole positive integer space in this case
like what Eric did, but that will make the code more complex. So I
just extend IPCMNI to 2M in this case and keep similar ID generation
logic.
Waiman Long (9):
sysctl: Add flags to support min/max range clamping
proc/sysctl: Provide additional ctl_table.flags checks
sysctl: Warn when a clamped sysctl parameter is set out of range
ipc: Clamp msgmni and shmmni to the real IPCMNI limit
ipc: Clamp semmni to the real IPCMNI limit
test_sysctl: Add range clamping test
test_sysctl: Add ctl_table registration failure test
ipc: Allow boot time extension of IPCMNI from 32k to 2M
ipc: Conserve sequence numbers in extended IPCMNI mode
Documentation/admin-guide/kernel-parameters.txt | 3 +
fs/proc/proc_sysctl.c | 62 ++++++++++++++++++++
include/linux/ipc.h | 11 +++-
include/linux/ipc_namespace.h | 1 +
include/linux/sysctl.h | 32 +++++++++++
ipc/ipc_sysctl.c | 33 ++++++++++-
ipc/sem.c | 25 ++++++++
ipc/util.c | 41 ++++++++-----
ipc/util.h | 23 +++++---
kernel/sysctl.c | 76 ++++++++++++++++++++++---
lib/test_sysctl.c | 70 +++++++++++++++++++++++
tools/testing/selftests/sysctl/sysctl.sh | 67 ++++++++++++++++++++++
12 files changed, 411 insertions(+), 33 deletions(-)
--
1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 6/8] trace_uprobe/sdt: Fix multiple update of same reference counter
From: Oleg Nesterov @ 2018-03-16 17:50 UTC (permalink / raw)
To: Ravi Bangoria
Cc: mhiramat, peterz, srikar, acme, ananth, akpm, alexander.shishkin,
alexis.berlemont, corbet, dan.j.williams, gregkh, huawei.libin,
hughd, jack, jglisse, jolsa, kan.liang, kirill.shutemov, kjlx,
kstewart, linux-doc, linux-kernel, linux-mm, mhocko, milian.wolff,
mingo, namhyung, naveen.n.rao, pc, pombredanne, rostedt, tglx,
tmricht, willy, yao.jin, fengguang.wu
In-Reply-To: <c93216a4-a4e1-dd8f-00be-17254e308cd1@linux.vnet.ibm.com>
On 03/16, Ravi Bangoria wrote:
>
> On 03/15/2018 08:19 PM, Oleg Nesterov wrote:
> > On 03/13, Ravi Bangoria wrote:
> >> For tiny binaries/libraries, different mmap regions points to the
> >> same file portion. In such cases, we may increment reference counter
> >> multiple times.
> > Yes,
> >
> >> But while de-registration, reference counter will get
> >> decremented only by once
> > could you explain why this happens? sdt_increment_ref_ctr() and
> > sdt_decrement_ref_ctr() look symmetrical, _decrement_ should see
> > the same mappings?
...
> # strace -o out python
> mmap(NULL, 2738968, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fff92460000
> mmap(0x7fff926a0000, 327680, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x230000) = 0x7fff926a0000
> mprotect(0x7fff926a0000, 65536, PROT_READ) = 0
Ah, in this case everything is clear, thanks.
I was confused by the changelog, I misinterpreted it as if inc/dec are not
balanced in case of multiple mappings even if the application doesn't play
with mmap/mprotect/etc.
And it seems that you are trying to confuse yourself, not only me ;) Just
suppose that an application does mmap+munmap in a loop and the mapped region
contains uprobe but not the counter.
And this all makes me think that we should do something else. Ideally,
install_breakpoint() and remove_breakpoint() should inc/dec the counter
if they do not fail...
Btw, why do we need a counter, not a boolean? Who else can modify it?
Or different uprobes can share the same counter?
Oleg.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 5/8] trace_uprobe: Support SDT markers having reference count (semaphore)
From: Oleg Nesterov @ 2018-03-16 16:16 UTC (permalink / raw)
To: Steven Rostedt
Cc: Ravi Bangoria, mhiramat, peterz, srikar, acme, ananth, akpm,
alexander.shishkin, alexis.berlemont, corbet, dan.j.williams,
gregkh, huawei.libin, hughd, jack, jglisse, jolsa, kan.liang,
kirill.shutemov, kjlx, kstewart, linux-doc, linux-kernel,
linux-mm, mhocko, milian.wolff, mingo, namhyung, naveen.n.rao, pc,
pombredanne, tglx, tmricht, willy, yao.jin, fengguang.wu
In-Reply-To: <20180315124816.6aa3d4e2@vmware.local.home>
On 03/15, Steven Rostedt wrote:
>
> On Tue, 13 Mar 2018 18:26:00 +0530
> Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> wrote:
>
> > +static void sdt_increment_ref_ctr(struct trace_uprobe *tu)
> > +{
> > + struct uprobe_map_info *info;
> > + struct vm_area_struct *vma;
> > + unsigned long vaddr;
> > +
> > + uprobe_start_dup_mmap();
>
> Please add a comment here that this function ups the mm ref count for
> each info returned. Otherwise it's hard to know what that mmput() below
> matches.
You meant uprobe_build_map_info(), not uprobe_start_dup_mmap().
Yes, and if it gets more callers perhaps we should move this mmput() into
uprobe_free_map_info()...
Oleg.
--- x/kernel/events/uprobes.c
+++ x/kernel/events/uprobes.c
@@ -714,6 +714,7 @@ struct map_info {
static inline struct map_info *free_map_info(struct map_info *info)
{
struct map_info *next = info->next;
+ mmput(info->mm);
kfree(info);
return next;
}
@@ -783,8 +784,11 @@ build_map_info(struct address_space *map
goto again;
out:
- while (prev)
- prev = free_map_info(prev);
+ while (prev) {
+ info = prev;
+ prev = prev->next;
+ kfree(info);
+ }
return curr;
}
@@ -834,7 +838,6 @@ register_for_each_vma(struct uprobe *upr
unlock:
up_write(&mm->mmap_sem);
free:
- mmput(mm);
info = free_map_info(info);
}
out:
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH] README: Improve documentation descriptions
From: Martin Kepplinger @ 2018-03-16 15:57 UTC (permalink / raw)
To: corbet; +Cc: linux-doc, linux-kernel, Martin Kepplinger
"This file" indeed was moved once, but at some point "this file", the
top-level README, becomes a file in itself. Now that time has come :)
Let's describe how things are, and suggest reading "this file" first,
"this file" simply being a the admin-guide README file, not a file that
was once moved.
Signed-off-by: Martin Kepplinger <martink@posteo.de>
---
That's at least my opinion :)
thanks
martin
README | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/README b/README
index b2ba4aaa3a71..12b4674a483c 100644
--- a/README
+++ b/README
@@ -1,10 +1,9 @@
Linux kernel
============
-This file was moved to Documentation/admin-guide/README.rst
-
-Please notice that there are several guides for kernel developers and users.
-These guides can be rendered in a number of formats, like HTML and PDF.
+There are several guides for kernel developers and users. These guides can
+be rendered in a number of formats, like HTML and PDF. Please read
+Documentation/admin-guide/README.rst first.
In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.
--
2.14.2
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* Re: [PATCH] crypto: doc - clarify hash callbacks state machine
From: Herbert Xu @ 2018-03-16 15:16 UTC (permalink / raw)
To: Horia Geantă
Cc: David S. Miller, Jonathan Corbet, linux-crypto, linux-doc,
linux-kernel
In-Reply-To: <20180305103945.3517-1-horia.geanta@nxp.com>
On Mon, Mar 05, 2018 at 12:39:45PM +0200, Horia Geantă wrote:
> Even though it doesn't make too much sense, it is perfectly legal to:
> - call .init() and then (as many times) .update()
> - subseqently _not_ call any of .final(), .finup() or .export()
Actually it makes perfect sense, because there can be an arbitrary
number of requests for a given tfm. There is no requirement that
you must finalise the first request before submitting new ones.
IOW there can be an arbitrary number of outstanding requests even
without the user intentionally abandoning any hash request.
So please modify your commit description.
Thanks,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 8/8] trace_uprobe/sdt: Document about reference counter
From: Masami Hiramatsu @ 2018-03-16 14:26 UTC (permalink / raw)
To: Ravi Bangoria
Cc: oleg, peterz, srikar, acme, ananth, akpm, alexander.shishkin,
alexis.berlemont, corbet, dan.j.williams, gregkh, huawei.libin,
hughd, jack, jglisse, jolsa, kan.liang, kirill.shutemov, kjlx,
kstewart, linux-doc, linux-kernel, linux-mm, mhocko, milian.wolff,
mingo, namhyung, naveen.n.rao, pc, pombredanne, rostedt, tglx,
tmricht, willy, yao.jin, fengguang.wu
In-Reply-To: <e991f9e2-6488-7739-548d-6ca28c57826d@linux.vnet.ibm.com>
On Fri, 16 Mar 2018 15:12:38 +0530
Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> wrote:
> On 03/15/2018 06:17 PM, Masami Hiramatsu wrote:
> > Hi Ravi,
> >
> > On Wed, 14 Mar 2018 20:52:59 +0530
> > Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> wrote:
> >
> >> On 03/14/2018 07:20 PM, Masami Hiramatsu wrote:
> >>> On Tue, 13 Mar 2018 18:26:03 +0530
> >>> Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> wrote:
> >>>
> >>>> No functionality changes.
> >>> Please consider to describe what is this change and why, here.
> >> Will add in next version.
> > Thanks, and could you also move this before perf-probe patch?
> > Also Could you make perf-probe check the tracing/README whether
> > the kernel supports reference counter syntax or not?
> >
> > perf-tool can be used on older (or stable) kernel.
>
> Sure, Will do that.
Please see scan_ftrace_readme@util/probe-file.c :)
It is easy to expand the pattern table.
Thank you,
--
Masami Hiramatsu <mhiramat@kernel.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] Improve mutex documentation
From: Peter Zijlstra @ 2018-03-16 13:57 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Andrew Morton, Kirill Tkhai, tj, cl, linux-mm, linux-kernel,
linux-doc, Jonathan Corbet, Mauro Carvalho Chehab, Ingo Molnar
In-Reply-To: <20180315115812.GA9949@bombadil.infradead.org>
On Thu, Mar 15, 2018 at 04:58:12AM -0700, Matthew Wilcox wrote:
> On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
> > My memory is weak and our documentation is awful. What does
> > mutex_lock_killable() actually do and how does it differ from
> > mutex_lock_interruptible()?
>
> From: Matthew Wilcox <mawilcox@microsoft.com>
>
> Add kernel-doc for mutex_lock_killable() and mutex_lock_io(). Reword the
> kernel-doc for mutex_lock_interruptible().
>
> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Thanks Matthew!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 6/8] trace_uprobe/sdt: Fix multiple update of same reference counter
From: Ravi Bangoria @ 2018-03-16 13:49 UTC (permalink / raw)
To: Oleg Nesterov
Cc: mhiramat, peterz, srikar, acme, ananth, akpm, alexander.shishkin,
alexis.berlemont, corbet, dan.j.williams, gregkh, huawei.libin,
hughd, jack, jglisse, jolsa, kan.liang, kirill.shutemov, kjlx,
kstewart, linux-doc, linux-kernel, linux-mm, mhocko, milian.wolff,
mingo, namhyung, naveen.n.rao, pc, pombredanne, rostedt, tglx,
tmricht, willy, yao.jin, fengguang.wu, Ravi Bangoria
In-Reply-To: <c93216a4-a4e1-dd8f-00be-17254e308cd1@linux.vnet.ibm.com>
On 03/16/2018 05:42 PM, Ravi Bangoria wrote:
>
> On 03/15/2018 08:19 PM, Oleg Nesterov wrote:
>> On 03/13, Ravi Bangoria wrote:
>>> For tiny binaries/libraries, different mmap regions points to the
>>> same file portion. In such cases, we may increment reference counter
>>> multiple times.
>> Yes,
>>
>>> But while de-registration, reference counter will get
>>> decremented only by once
>> could you explain why this happens? sdt_increment_ref_ctr() and
>> sdt_decrement_ref_ctr() look symmetrical, _decrement_ should see
>> the same mappings?
> Sorry, I thought this happens only for tiny binaries. But that is not the case.
> This happens for binary / library of any length.
>
> Also, it's not a problem with sdt_increment_ref_ctr() / sdt_increment_ref_ctr().
> The problem happens with trace_uprobe_mmap_callback().
>
> To illustrate in detail, I'm adding a pr_info() in trace_uprobe_mmap_callback():
>
> vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
> + pr_info("0x%lx-0x%lx : 0x%lx\n", vma->vm_start, vma->vm_end, vaddr);
> sdt_update_ref_ctr(vma->vm_mm, vaddr, 1);
>
>
> Ok now, libpython has SDT markers with reference counter:
>
> # readelf -n /usr/lib64/libpython2.7.so.1.0 | grep -A2 Provider
> Provider: python
> Name: function__entry
> ... Semaphore: 0x00000000002899d8
>
> Probing on that marker:
>
> # cd /sys/kernel/debug/tracing/
> # echo "p:sdt_python/function__entry /usr/lib64/libpython2.7.so.1.0:0x16a4d4(0x2799d8)" > uprobe_events
> # echo 1 > events/sdt_python/function__entry/enable
>
> When I run python:
>
> # strace -o out python
> mmap(NULL, 2738968, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fff92460000
> mmap(0x7fff926a0000, 327680, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x230000) = 0x7fff926a0000
> mprotect(0x7fff926a0000, 65536, PROT_READ) = 0
>
> The first mmap() maps the whole library into one region. Second mmap()
> and third mprotect() split out the whole region into smaller vmas and sets
> appropriate protection flags.
>
> Now, in this case, trace_uprobe_mmap_callback() updates reference counter
> twice -- by second mmap() call and by third mprotect() call -- because both
> regions contain reference counter offset. This I can verify in dmesg:
>
> # dmesg | tail
> trace_kprobe: 0x7fff926a0000-0x7fff926f0000 : 0x7fff926e99d8
> trace_kprobe: 0x7fff926b0000-0x7fff926f0000 : 0x7fff926e99d8
>
> Final vmas of libpython:
>
> # cat /proc/`pgrep python`/maps | grep libpython
> 7fff92460000-7fff926a0000 r-xp 00000000 08:05 403934 /usr/lib64/libpython2.7.so.1.0
> 7fff926a0000-7fff926b0000 r--p 00230000 08:05 403934 /usr/lib64/libpython2.7.so.1.0
> 7fff926b0000-7fff926f0000 rw-p 00240000 08:05 403934 /usr/lib64/libpython2.7.so.1.0
>
>
> I see similar problem with normal binary as well. I'm using Brendan Gregg's
> example[1]:
>
> # readelf -n /tmp/tick | grep -A2 Provider
> Provider: tick
> Name: loop2
> ... Semaphore: 0x000000001005003c
>
> Probing that marker:
>
> # echo "p:sdt_tick/loop2 /tmp/tick:0x6e4(0x10036)" > uprobe_events
> # echo 1 > events/sdt_tick/loop2/enable
>
> Now when I run the binary
>
> # /tmp/tick
>
> load_elf_binary() internally calls mmap() and I see trace_uprobe_mmap_callback()
> updating reference counter twice:
>
> # dmesg | tail
> trace_kprobe: 0x10010000-0x10030000 : 0x10020036
> trace_kprobe: 0x10020000-0x10030000 : 0x10020036
>
> proc/<pid>/maps of the tick:
>
> # cat /proc/`pgrep tick`/maps
> 10000000-10010000 r-xp 00000000 08:05 1335712 /tmp/tick
> 10010000-10020000 r--p 00000000 08:05 1335712 /tmp/tick
> 10020000-10030000 rw-p 00010000 08:05 1335712 /tmp/tick
>
> [1] https://github.com/iovisor/bcc/issues/327#issuecomment-200576506
Also, while de-registration, we look for all existing mms using
uprobe_build_mmap_info() and decrement the counter in each
of the mm. i.e. we decrement the counter only once.
-Ravi
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [RESEND PATCH v5] input: pxrc: new driver for PhoenixRC Flight Controller Adapter
From: Marcus Folkesson @ 2018-03-16 13:39 UTC (permalink / raw)
To: Dmitry Torokhov, Jonathan Corbet; +Cc: linux-input, linux-doc, linux-kernel
In-Reply-To: <20180218161747.21110-1-marcus.folkesson@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 13068 bytes --]
ping.
I do not want to nag, but would someone please have a look at this?
Thanks,
Marcus Folkesson
On Sun, Feb 18, 2018 at 05:17:46PM +0100, Marcus Folkesson wrote:
> This driver let you plug in your RC controller to the adapter and
> use it as input device in various RC simulators.
>
> Signed-off-by: Marcus Folkesson <marcus.folkesson@gmail.com>
> ---
>
> v5:
> - Drop autosuspend support
> - Use pm_mutex instead of input_dev->mutex
> - Use pxrc->is_open instead of input_dev->users
> v4:
> - Add call to usb_mark_last_busy() in irq
> - Move code from pxrc_resume() to pxrc_reset_resume()
> v3:
> - Use RUDDER and MISC instead of TILT_X and TILT_Y
> - Drop kref and anchor
> - Rework URB handling
> - Add PM support
> v2:
> - Change module license to GPLv2 to match SPDX tag
>
>
> Documentation/input/devices/pxrc.rst | 57 +++++++
> drivers/input/joystick/Kconfig | 9 ++
> drivers/input/joystick/Makefile | 1 +
> drivers/input/joystick/pxrc.c | 303 +++++++++++++++++++++++++++++++++++
> 4 files changed, 370 insertions(+)
> create mode 100644 Documentation/input/devices/pxrc.rst
> create mode 100644 drivers/input/joystick/pxrc.c
>
> diff --git a/Documentation/input/devices/pxrc.rst b/Documentation/input/devices/pxrc.rst
> new file mode 100644
> index 000000000000..ca11f646bae8
> --- /dev/null
> +++ b/Documentation/input/devices/pxrc.rst
> @@ -0,0 +1,57 @@
> +=======================================================
> +pxrc - PhoenixRC Flight Controller Adapter
> +=======================================================
> +
> +:Author: Marcus Folkesson <marcus.folkesson@gmail.com>
> +
> +This driver let you use your own RC controller plugged into the
> +adapter that comes with PhoenixRC [1]_ or other compatible adapters.
> +
> +The adapter supports 7 analog channels and 1 digital input switch.
> +
> +Notes
> +=====
> +
> +Many RC controllers is able to configure which stick goes to which channel.
> +This is also configurable in most simulators, so a matching is not necessary.
> +
> +The driver is generating the following input event for analog channels:
> +
> ++---------+----------------+
> +| Channel | Event |
> ++=========+================+
> +| 1 | ABS_X |
> ++---------+----------------+
> +| 2 | ABS_Y |
> ++---------+----------------+
> +| 3 | ABS_RX |
> ++---------+----------------+
> +| 4 | ABS_RY |
> ++---------+----------------+
> +| 5 | ABS_RUDDER |
> ++---------+----------------+
> +| 6 | ABS_THROTTLE |
> ++---------+----------------+
> +| 7 | ABS_MISC |
> ++---------+----------------+
> +
> +The digital input switch is generated as an `BTN_A` event.
> +
> +Manual Testing
> +==============
> +
> +To test this driver's functionality you may use `input-event` which is part of
> +the `input layer utilities` suite [2]_.
> +
> +For example::
> +
> + > modprobe pxrc
> + > input-events <devnr>
> +
> +To print all input events from input `devnr`.
> +
> +References
> +==========
> +
> +.. [1] http://www.phoenix-sim.com/
> +.. [2] https://www.kraxel.org/cgit/input/
> diff --git a/drivers/input/joystick/Kconfig b/drivers/input/joystick/Kconfig
> index f3c2f6ea8b44..332c0cc1b2ab 100644
> --- a/drivers/input/joystick/Kconfig
> +++ b/drivers/input/joystick/Kconfig
> @@ -351,4 +351,13 @@ config JOYSTICK_PSXPAD_SPI_FF
>
> To drive rumble motor a dedicated power supply is required.
>
> +config JOYSTICK_PXRC
> + tristate "PhoenixRC Flight Controller Adapter"
> + depends on USB_ARCH_HAS_HCD
> + depends on USB
> + help
> + Say Y here if you want to use the PhoenixRC Flight Controller Adapter.
> +
> + To compile this driver as a module, choose M here: the
> + module will be called pxrc.
> endif
> diff --git a/drivers/input/joystick/Makefile b/drivers/input/joystick/Makefile
> index 67651efda2e1..dd0492ebbed7 100644
> --- a/drivers/input/joystick/Makefile
> +++ b/drivers/input/joystick/Makefile
> @@ -23,6 +23,7 @@ obj-$(CONFIG_JOYSTICK_JOYDUMP) += joydump.o
> obj-$(CONFIG_JOYSTICK_MAGELLAN) += magellan.o
> obj-$(CONFIG_JOYSTICK_MAPLE) += maplecontrol.o
> obj-$(CONFIG_JOYSTICK_PSXPAD_SPI) += psxpad-spi.o
> +obj-$(CONFIG_JOYSTICK_PXRC) += pxrc.o
> obj-$(CONFIG_JOYSTICK_SIDEWINDER) += sidewinder.o
> obj-$(CONFIG_JOYSTICK_SPACEBALL) += spaceball.o
> obj-$(CONFIG_JOYSTICK_SPACEORB) += spaceorb.o
> diff --git a/drivers/input/joystick/pxrc.c b/drivers/input/joystick/pxrc.c
> new file mode 100644
> index 000000000000..07a0dbd3ced2
> --- /dev/null
> +++ b/drivers/input/joystick/pxrc.c
> @@ -0,0 +1,303 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Driver for Phoenix RC Flight Controller Adapter
> + *
> + * Copyright (C) 2018 Marcus Folkesson <marcus.folkesson@gmail.com>
> + *
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/errno.h>
> +#include <linux/slab.h>
> +#include <linux/module.h>
> +#include <linux/uaccess.h>
> +#include <linux/usb.h>
> +#include <linux/usb/input.h>
> +#include <linux/mutex.h>
> +#include <linux/input.h>
> +
> +#define PXRC_VENDOR_ID (0x1781)
> +#define PXRC_PRODUCT_ID (0x0898)
> +
> +static const struct usb_device_id pxrc_table[] = {
> + { USB_DEVICE(PXRC_VENDOR_ID, PXRC_PRODUCT_ID) },
> + { }
> +};
> +MODULE_DEVICE_TABLE(usb, pxrc_table);
> +
> +struct pxrc {
> + struct input_dev *input;
> + struct usb_device *udev;
> + struct usb_interface *intf;
> + struct urb *urb;
> + struct mutex pm_mutex;
> + bool is_open;
> + __u8 epaddr;
> + char phys[64];
> + unsigned char *data;
> + size_t bsize;
> +};
> +
> +static void pxrc_usb_irq(struct urb *urb)
> +{
> + struct pxrc *pxrc = urb->context;
> + int error;
> +
> + switch (urb->status) {
> + case 0:
> + /* success */
> + break;
> + case -ETIME:
> + /* this urb is timing out */
> + dev_dbg(&pxrc->intf->dev,
> + "%s - urb timed out - was the device unplugged?\n",
> + __func__);
> + return;
> + case -ECONNRESET:
> + case -ENOENT:
> + case -ESHUTDOWN:
> + case -EPIPE:
> + /* this urb is terminated, clean up */
> + dev_dbg(&pxrc->intf->dev, "%s - urb shutting down with status: %d\n",
> + __func__, urb->status);
> + return;
> + default:
> + dev_dbg(&pxrc->intf->dev, "%s - nonzero urb status received: %d\n",
> + __func__, urb->status);
> + goto exit;
> + }
> +
> + if (urb->actual_length == 8) {
> + input_report_abs(pxrc->input, ABS_X, pxrc->data[0]);
> + input_report_abs(pxrc->input, ABS_Y, pxrc->data[2]);
> + input_report_abs(pxrc->input, ABS_RX, pxrc->data[3]);
> + input_report_abs(pxrc->input, ABS_RY, pxrc->data[4]);
> + input_report_abs(pxrc->input, ABS_RUDDER, pxrc->data[5]);
> + input_report_abs(pxrc->input, ABS_THROTTLE, pxrc->data[6]);
> + input_report_abs(pxrc->input, ABS_MISC, pxrc->data[7]);
> +
> + input_report_key(pxrc->input, BTN_A, pxrc->data[1]);
> + }
> +
> +exit:
> + /* Resubmit to fetch new fresh URBs */
> + error = usb_submit_urb(urb, GFP_ATOMIC);
> + if (error && error != -EPERM)
> + dev_err(&pxrc->intf->dev,
> + "%s - usb_submit_urb failed with result: %d",
> + __func__, error);
> +}
> +
> +static int pxrc_open(struct input_dev *input)
> +{
> + struct pxrc *pxrc = input_get_drvdata(input);
> + int retval;
> +
> + mutex_lock(&pxrc->pm_mutex);
> + retval = usb_submit_urb(pxrc->urb, GFP_KERNEL);
> + if (retval) {
> + dev_err(&pxrc->intf->dev,
> + "%s - usb_submit_urb failed, error: %d\n",
> + __func__, retval);
> + retval = -EIO;
> + goto out;
> + }
> +
> + pxrc->is_open = true;
> +
> +out:
> + mutex_unlock(&pxrc->pm_mutex);
> + return retval;
> +}
> +
> +static void pxrc_close(struct input_dev *input)
> +{
> + struct pxrc *pxrc = input_get_drvdata(input);
> +
> + mutex_lock(&pxrc->pm_mutex);
> + usb_kill_urb(pxrc->urb);
> + pxrc->is_open = false;
> + mutex_unlock(&pxrc->pm_mutex);
> +}
> +
> +static int pxrc_usb_init(struct pxrc *pxrc)
> +{
> + struct usb_endpoint_descriptor *epirq;
> + unsigned int pipe;
> + int retval;
> +
> + /* Set up the endpoint information */
> + /* This device only has an interrupt endpoint */
> + retval = usb_find_common_endpoints(pxrc->intf->cur_altsetting,
> + NULL, NULL, &epirq, NULL);
> + if (retval) {
> + dev_err(&pxrc->intf->dev,
> + "Could not find endpoint\n");
> + goto error;
> + }
> +
> + pxrc->bsize = usb_endpoint_maxp(epirq);
> + pxrc->epaddr = epirq->bEndpointAddress;
> + pxrc->data = devm_kmalloc(&pxrc->intf->dev, pxrc->bsize, GFP_KERNEL);
> + if (!pxrc->data) {
> + retval = -ENOMEM;
> + goto error;
> + }
> +
> + usb_set_intfdata(pxrc->intf, pxrc);
> + usb_make_path(pxrc->udev, pxrc->phys, sizeof(pxrc->phys));
> + strlcat(pxrc->phys, "/input0", sizeof(pxrc->phys));
> +
> + pxrc->urb = usb_alloc_urb(0, GFP_KERNEL);
> + if (!pxrc->urb) {
> + retval = -ENOMEM;
> + goto error;
> + }
> +
> + pipe = usb_rcvintpipe(pxrc->udev, pxrc->epaddr),
> + usb_fill_int_urb(pxrc->urb, pxrc->udev, pipe, pxrc->data, pxrc->bsize,
> + pxrc_usb_irq, pxrc, 1);
> +
> +error:
> + return retval;
> +
> +
> +}
> +
> +static int pxrc_input_init(struct pxrc *pxrc)
> +{
> + pxrc->input = devm_input_allocate_device(&pxrc->intf->dev);
> + if (pxrc->input == NULL) {
> + dev_err(&pxrc->intf->dev, "couldn't allocate input device\n");
> + return -ENOMEM;
> + }
> +
> + pxrc->input->name = "PXRC Flight Controller Adapter";
> + pxrc->input->phys = pxrc->phys;
> + usb_to_input_id(pxrc->udev, &pxrc->input->id);
> +
> + pxrc->input->open = pxrc_open;
> + pxrc->input->close = pxrc_close;
> +
> + input_set_capability(pxrc->input, EV_KEY, BTN_A);
> + input_set_abs_params(pxrc->input, ABS_X, 0, 255, 0, 0);
> + input_set_abs_params(pxrc->input, ABS_Y, 0, 255, 0, 0);
> + input_set_abs_params(pxrc->input, ABS_RX, 0, 255, 0, 0);
> + input_set_abs_params(pxrc->input, ABS_RY, 0, 255, 0, 0);
> + input_set_abs_params(pxrc->input, ABS_RUDDER, 0, 255, 0, 0);
> + input_set_abs_params(pxrc->input, ABS_THROTTLE, 0, 255, 0, 0);
> + input_set_abs_params(pxrc->input, ABS_MISC, 0, 255, 0, 0);
> +
> + input_set_drvdata(pxrc->input, pxrc);
> +
> + return input_register_device(pxrc->input);
> +}
> +
> +static int pxrc_probe(struct usb_interface *intf,
> + const struct usb_device_id *id)
> +{
> + struct pxrc *pxrc;
> + int retval;
> +
> + pxrc = devm_kzalloc(&intf->dev, sizeof(*pxrc), GFP_KERNEL);
> + if (!pxrc)
> + return -ENOMEM;
> +
> + mutex_init(&pxrc->pm_mutex);
> + pxrc->udev = usb_get_dev(interface_to_usbdev(intf));
> + pxrc->intf = intf;
> +
> + retval = pxrc_usb_init(pxrc);
> + if (retval)
> + goto error;
> +
> + retval = pxrc_input_init(pxrc);
> + if (retval)
> + goto err_free_urb;
> +
> + return 0;
> +
> +err_free_urb:
> + usb_free_urb(pxrc->urb);
> +
> +error:
> + return retval;
> +}
> +
> +static void pxrc_disconnect(struct usb_interface *intf)
> +{
> + struct pxrc *pxrc = usb_get_intfdata(intf);
> +
> + usb_free_urb(pxrc->urb);
> + usb_set_intfdata(intf, NULL);
> +}
> +
> +static int pxrc_suspend(struct usb_interface *intf, pm_message_t message)
> +{
> + struct pxrc *pxrc = usb_get_intfdata(intf);
> +
> + mutex_lock(&pxrc->pm_mutex);
> + if (pxrc->is_open)
> + usb_kill_urb(pxrc->urb);
> + mutex_unlock(&pxrc->pm_mutex);
> +
> + return 0;
> +}
> +
> +static int pxrc_resume(struct usb_interface *intf)
> +{
> + struct pxrc *pxrc = usb_get_intfdata(intf);
> + int retval = 0;
> +
> + mutex_lock(&pxrc->pm_mutex);
> + if (pxrc->is_open && usb_submit_urb(pxrc->urb, GFP_KERNEL) < 0)
> + retval = -EIO;
> +
> + mutex_unlock(&pxrc->pm_mutex);
> + return retval;
> +}
> +
> +static int pxrc_pre_reset(struct usb_interface *intf)
> +{
> + struct pxrc *pxrc = usb_get_intfdata(intf);
> +
> + mutex_lock(&pxrc->pm_mutex);
> + usb_kill_urb(pxrc->urb);
> + return 0;
> +}
> +
> +static int pxrc_post_reset(struct usb_interface *intf)
> +{
> + struct pxrc *pxrc = usb_get_intfdata(intf);
> + int retval = 0;
> +
> + if (pxrc->is_open && usb_submit_urb(pxrc->urb, GFP_KERNEL) < 0)
> + retval = -EIO;
> +
> + mutex_unlock(&pxrc->pm_mutex);
> +
> + return retval;
> +}
> +
> +static int pxrc_reset_resume(struct usb_interface *intf)
> +{
> + return pxrc_resume(intf);
> +}
> +
> +static struct usb_driver pxrc_driver = {
> + .name = "pxrc",
> + .probe = pxrc_probe,
> + .disconnect = pxrc_disconnect,
> + .id_table = pxrc_table,
> + .suspend = pxrc_suspend,
> + .resume = pxrc_resume,
> + .pre_reset = pxrc_pre_reset,
> + .post_reset = pxrc_post_reset,
> + .reset_resume = pxrc_reset_resume,
> +};
> +
> +module_usb_driver(pxrc_driver);
> +
> +MODULE_AUTHOR("Marcus Folkesson <marcus.folkesson@gmail.com>");
> +MODULE_DESCRIPTION("PhoenixRC Flight Controller Adapter");
> +MODULE_LICENSE("GPL v2");
> --
> 2.15.1
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH 6/8] trace_uprobe/sdt: Fix multiple update of same reference counter
From: Ravi Bangoria @ 2018-03-16 12:12 UTC (permalink / raw)
To: Oleg Nesterov
Cc: mhiramat, peterz, srikar, acme, ananth, akpm, alexander.shishkin,
alexis.berlemont, corbet, dan.j.williams, gregkh, huawei.libin,
hughd, jack, jglisse, jolsa, kan.liang, kirill.shutemov, kjlx,
kstewart, linux-doc, linux-kernel, linux-mm, mhocko, milian.wolff,
mingo, namhyung, naveen.n.rao, pc, pombredanne, rostedt, tglx,
tmricht, willy, yao.jin, fengguang.wu, Ravi Bangoria
In-Reply-To: <20180315144959.GB19643@redhat.com>
On 03/15/2018 08:19 PM, Oleg Nesterov wrote:
> On 03/13, Ravi Bangoria wrote:
>> For tiny binaries/libraries, different mmap regions points to the
>> same file portion. In such cases, we may increment reference counter
>> multiple times.
> Yes,
>
>> But while de-registration, reference counter will get
>> decremented only by once
> could you explain why this happens? sdt_increment_ref_ctr() and
> sdt_decrement_ref_ctr() look symmetrical, _decrement_ should see
> the same mappings?
Sorry, I thought this happens only for tiny binaries. But that is not the case.
This happens for binary / library of any length.
Also, it's not a problem with sdt_increment_ref_ctr() / sdt_increment_ref_ctr().
The problem happens with trace_uprobe_mmap_callback().
To illustrate in detail, I'm adding a pr_info() in trace_uprobe_mmap_callback():
vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
+ pr_info("0x%lx-0x%lx : 0x%lx\n", vma->vm_start, vma->vm_end, vaddr);
sdt_update_ref_ctr(vma->vm_mm, vaddr, 1);
Ok now, libpython has SDT markers with reference counter:
# readelf -n /usr/lib64/libpython2.7.so.1.0 | grep -A2 Provider
Provider: python
Name: function__entry
... Semaphore: 0x00000000002899d8
Probing on that marker:
# cd /sys/kernel/debug/tracing/
# echo "p:sdt_python/function__entry /usr/lib64/libpython2.7.so.1.0:0x16a4d4(0x2799d8)" > uprobe_events
# echo 1 > events/sdt_python/function__entry/enable
When I run python:
# strace -o out python
mmap(NULL, 2738968, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fff92460000
mmap(0x7fff926a0000, 327680, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x230000) = 0x7fff926a0000
mprotect(0x7fff926a0000, 65536, PROT_READ) = 0
The first mmap() maps the whole library into one region. Second mmap()
and third mprotect() split out the whole region into smaller vmas and sets
appropriate protection flags.
Now, in this case, trace_uprobe_mmap_callback() updates reference counter
twice -- by second mmap() call and by third mprotect() call -- because both
regions contain reference counter offset. This I can verify in dmesg:
# dmesg | tail
trace_kprobe: 0x7fff926a0000-0x7fff926f0000 : 0x7fff926e99d8
trace_kprobe: 0x7fff926b0000-0x7fff926f0000 : 0x7fff926e99d8
Final vmas of libpython:
# cat /proc/`pgrep python`/maps | grep libpython
7fff92460000-7fff926a0000 r-xp 00000000 08:05 403934 /usr/lib64/libpython2.7.so.1.0
7fff926a0000-7fff926b0000 r--p 00230000 08:05 403934 /usr/lib64/libpython2.7.so.1.0
7fff926b0000-7fff926f0000 rw-p 00240000 08:05 403934 /usr/lib64/libpython2.7.so.1.0
I see similar problem with normal binary as well. I'm using Brendan Gregg's
example[1]:
# readelf -n /tmp/tick | grep -A2 Provider
Provider: tick
Name: loop2
... Semaphore: 0x000000001005003c
Probing that marker:
# echo "p:sdt_tick/loop2 /tmp/tick:0x6e4(0x10036)" > uprobe_events
# echo 1 > events/sdt_tick/loop2/enable
Now when I run the binary
# /tmp/tick
load_elf_binary() internally calls mmap() and I see trace_uprobe_mmap_callback()
updating reference counter twice:
# dmesg | tail
trace_kprobe: 0x10010000-0x10030000 : 0x10020036
trace_kprobe: 0x10020000-0x10030000 : 0x10020036
proc/<pid>/maps of the tick:
# cat /proc/`pgrep tick`/maps
10000000-10010000 r-xp 00000000 08:05 1335712 /tmp/tick
10010000-10020000 r--p 00000000 08:05 1335712 /tmp/tick
10020000-10030000 rw-p 00010000 08:05 1335712 /tmp/tick
[1] https://github.com/iovisor/bcc/issues/327#issuecomment-200576506
> Ether way, this patch doesn't look right at first glance... Just
> for example,
>
>> +static bool sdt_check_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
>> +{
>> + struct sdt_mm_list *tmp = tu->sml;
>> +
>> + if (!tu->sml || !mm)
>> + return false;
>> +
>> + while (tmp) {
>> + if (tmp->mm == mm)
>> + return true;
>> + tmp = tmp->next;
>> + }
>> +
>> + return false;
> ...
>
>> +}
>> +
>> +static void sdt_add_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
>> +{
>> + struct sdt_mm_list *tmp;
>> +
>> + tmp = kzalloc(sizeof(*tmp), GFP_KERNEL);
>> + if (!tmp)
>> + return;
>> +
>> + tmp->mm = mm;
>> + tmp->next = tu->sml;
>> + tu->sml = tmp;
>> +}
>> +
> ...
>
>> @@ -1020,8 +1104,16 @@ void trace_uprobe_mmap_callback(struct vm_area_struct *vma)
>> !trace_probe_is_enabled(&tu->tp))
>> continue;
>>
>> + down_write(&tu->sml_rw_sem);
>> + if (sdt_check_mm_list(tu, vma->vm_mm))
>> + goto cont;
>> +
>> vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
>> - sdt_update_ref_ctr(vma->vm_mm, vaddr, 1);
>> + if (!sdt_update_ref_ctr(vma->vm_mm, vaddr, 1))
>> + sdt_add_mm_list(tu, vma->vm_mm);
>> +
>> +cont:
>> + up_write(&tu->sml_rw_sem);
> To simplify, suppose that tu->sml is empty.
>
> Some process calls this function, increments the counter and adds its ->mm into
> the list.
>
> Then it exits, ->mm is freed.
>
> The next fork/exec allocates the same memory for the new ->mm, the new process
> calls trace_uprobe_mmap_callback() and sdt_check_mm_list() returns T?
Yes. This can happen. May be we can use mmu_notifier for this?
We register a release() callback from trace_uprobe while adding mm
in tu->sml. When mm gets freed, trace_uprobe will get notified.
Though, I don't know much about mmu_notifier. I need to think on this.
Let me know if you have better ideas.
Thanks for the review :)
Ravi
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 5/8] trace_uprobe: Support SDT markers having reference count (semaphore)
From: Ravi Bangoria @ 2018-03-16 11:46 UTC (permalink / raw)
To: Oleg Nesterov
Cc: mhiramat, peterz, srikar, acme, ananth, akpm, alexander.shishkin,
alexis.berlemont, corbet, dan.j.williams, gregkh, huawei.libin,
hughd, jack, jglisse, jolsa, kan.liang, kirill.shutemov, kjlx,
kstewart, linux-doc, linux-kernel, linux-mm, mhocko, milian.wolff,
mingo, namhyung, naveen.n.rao, pc, pombredanne, rostedt, tglx,
tmricht, willy, yao.jin, fengguang.wu, Ravi Bangoria
In-Reply-To: <20180316113917.GA21303@redhat.com>
On 03/16/2018 05:09 PM, Oleg Nesterov wrote:
> On 03/16, Ravi Bangoria wrote:
>> On 03/15/2018 08:00 PM, Oleg Nesterov wrote:
>>> Note to mention that sdt_find_vma() can return NULL but the callers do
>>> vma_offset_to_vaddr(vma) without any check.
>> If the "mm" we are passing to sdt_find_vma() is returned by
>> uprobe_build_map_info(ref_ctr_offset), sdt_find_vma() must
>> _not_ return NULL.
> Not at all.
>
> Once build_map_info() returns any mapping can go away. Otherwise, why do
> you think the caller has to take ->mmap_sem and use find_vma()? If you
> were right, build_map_info() could just return the list of vma's instead
> of list of mm's.
Oh.. okay.. I was under wrong impression then. Will add a check there.
Thanks for the review :)
Ravi
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 5/8] trace_uprobe: Support SDT markers having reference count (semaphore)
From: Oleg Nesterov @ 2018-03-16 11:39 UTC (permalink / raw)
To: Ravi Bangoria
Cc: mhiramat, peterz, srikar, acme, ananth, akpm, alexander.shishkin,
alexis.berlemont, corbet, dan.j.williams, gregkh, huawei.libin,
hughd, jack, jglisse, jolsa, kan.liang, kirill.shutemov, kjlx,
kstewart, linux-doc, linux-kernel, linux-mm, mhocko, milian.wolff,
mingo, namhyung, naveen.n.rao, pc, pombredanne, rostedt, tglx,
tmricht, willy, yao.jin, fengguang.wu
In-Reply-To: <3b303a2d-35dd-9178-fc03-de9f2d588797@linux.vnet.ibm.com>
On 03/16, Ravi Bangoria wrote:
>
> On 03/15/2018 08:00 PM, Oleg Nesterov wrote:
> > Note to mention that sdt_find_vma() can return NULL but the callers do
> > vma_offset_to_vaddr(vma) without any check.
>
> If the "mm" we are passing to sdt_find_vma() is returned by
> uprobe_build_map_info(ref_ctr_offset), sdt_find_vma() must
> _not_ return NULL.
Not at all.
Once build_map_info() returns any mapping can go away. Otherwise, why do
you think the caller has to take ->mmap_sem and use find_vma()? If you
were right, build_map_info() could just return the list of vma's instead
of list of mm's.
Oleg.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 8/8] trace_uprobe/sdt: Document about reference counter
From: Ravi Bangoria @ 2018-03-16 9:42 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: oleg, peterz, srikar, acme, ananth, akpm, alexander.shishkin,
alexis.berlemont, corbet, dan.j.williams, gregkh, huawei.libin,
hughd, jack, jglisse, jolsa, kan.liang, kirill.shutemov, kjlx,
kstewart, linux-doc, linux-kernel, linux-mm, mhocko, milian.wolff,
mingo, namhyung, naveen.n.rao, pc, pombredanne, rostedt, tglx,
tmricht, willy, yao.jin, fengguang.wu, Ravi Bangoria
In-Reply-To: <20180315214750.fc1d53d01045d8e6c1e8e491@kernel.org>
On 03/15/2018 06:17 PM, Masami Hiramatsu wrote:
> Hi Ravi,
>
> On Wed, 14 Mar 2018 20:52:59 +0530
> Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> wrote:
>
>> On 03/14/2018 07:20 PM, Masami Hiramatsu wrote:
>>> On Tue, 13 Mar 2018 18:26:03 +0530
>>> Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> wrote:
>>>
>>>> No functionality changes.
>>> Please consider to describe what is this change and why, here.
>> Will add in next version.
> Thanks, and could you also move this before perf-probe patch?
> Also Could you make perf-probe check the tracing/README whether
> the kernel supports reference counter syntax or not?
>
> perf-tool can be used on older (or stable) kernel.
Sure, Will do that.
Thanks,
Ravi
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 5/8] trace_uprobe: Support SDT markers having reference count (semaphore)
From: Ravi Bangoria @ 2018-03-16 9:31 UTC (permalink / raw)
To: Oleg Nesterov
Cc: mhiramat, peterz, srikar, acme, ananth, akpm, alexander.shishkin,
alexis.berlemont, corbet, dan.j.williams, gregkh, huawei.libin,
hughd, jack, jglisse, jolsa, kan.liang, kirill.shutemov, kjlx,
kstewart, linux-doc, linux-kernel, linux-mm, mhocko, milian.wolff,
mingo, namhyung, naveen.n.rao, pc, pombredanne, rostedt, tglx,
tmricht, willy, yao.jin, fengguang.wu, Ravi Bangoria
In-Reply-To: <20180315150156.GA19767@redhat.com>
On 03/15/2018 08:31 PM, Oleg Nesterov wrote:
> On 03/13, Ravi Bangoria wrote:
>> +sdt_update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d)
>> +{
>> + void *kaddr;
>> + struct page *page;
>> + struct vm_area_struct *vma;
>> + int ret = 0;
>> + unsigned short orig = 0;
>> +
>> + if (vaddr == 0)
>> + return -EINVAL;
>> +
>> + ret = get_user_pages_remote(NULL, mm, vaddr, 1,
>> + FOLL_FORCE | FOLL_WRITE, &page, &vma, NULL);
>> + if (ret <= 0)
>> + return ret;
>> +
>> + kaddr = kmap_atomic(page);
>> + memcpy(&orig, kaddr + (vaddr & ~PAGE_MASK), sizeof(orig));
>> + orig += d;
>> + memcpy(kaddr + (vaddr & ~PAGE_MASK), &orig, sizeof(orig));
>> + kunmap_atomic(kaddr);
> Hmm. Why memcpy? You could simply do
>
> kaddr = kmap_atomic();
> unsigned short *ptr = kaddr + (vaddr & ~PAGE_MASK);
> *ptr += d;
> kunmap_atomic();
Yes, that should work. Will change it.
Thanks for the review,
Ravi
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 5/8] trace_uprobe: Support SDT markers having reference count (semaphore)
From: Ravi Bangoria @ 2018-03-16 9:28 UTC (permalink / raw)
To: Oleg Nesterov
Cc: mhiramat, peterz, srikar, acme, ananth, akpm, alexander.shishkin,
alexis.berlemont, corbet, dan.j.williams, gregkh, huawei.libin,
hughd, jack, jglisse, jolsa, kan.liang, kirill.shutemov, kjlx,
kstewart, linux-doc, linux-kernel, linux-mm, mhocko, milian.wolff,
mingo, namhyung, naveen.n.rao, pc, pombredanne, rostedt, tglx,
tmricht, willy, yao.jin, fengguang.wu, Ravi Bangoria
In-Reply-To: <20180315143044.GA19643@redhat.com>
On 03/15/2018 08:00 PM, Oleg Nesterov wrote:
> On 03/15, Oleg Nesterov wrote:
>>> +static struct vm_area_struct *
>>> +sdt_find_vma(struct mm_struct *mm, struct trace_uprobe *tu)
>>> +{
>>> + struct vm_area_struct *tmp;
>>> +
>>> + for (tmp = mm->mmap; tmp != NULL; tmp = tmp->vm_next)
>>> + if (sdt_valid_vma(tu, tmp))
>>> + return tmp;
>>> +
>>> + return NULL;
>> I can't understand the logic... Lets ignore sdt_valid_vma() for now.
>> The caller has uprobe_map_info, why it can't simply do
>> vma = find_vma(uprobe_map_info->vaddr)? and then check sdt_valid_vma().
> Note to mention that sdt_find_vma() can return NULL but the callers do
> vma_offset_to_vaddr(vma) without any check.
If the "mm" we are passing to sdt_find_vma() is returned by
uprobe_build_map_info(ref_ctr_offset), sdt_find_vma() must
_not_ return NULL.
Thanks for the review,
Ravi
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 5/8] trace_uprobe: Support SDT markers having reference count (semaphore)
From: Ravi Bangoria @ 2018-03-16 9:21 UTC (permalink / raw)
To: Oleg Nesterov
Cc: mhiramat, peterz, srikar, acme, ananth, akpm, alexander.shishkin,
alexis.berlemont, corbet, dan.j.williams, gregkh, huawei.libin,
hughd, jack, jglisse, jolsa, kan.liang, kirill.shutemov, kjlx,
kstewart, linux-doc, linux-kernel, linux-mm, mhocko, milian.wolff,
mingo, namhyung, naveen.n.rao, pc, pombredanne, rostedt, tglx,
tmricht, willy, yao.jin, fengguang.wu, Ravi Bangoria
In-Reply-To: <20180315142120.GA19218@redhat.com>
On 03/15/2018 07:51 PM, Oleg Nesterov wrote:
> On 03/13, Ravi Bangoria wrote:
>> @@ -1053,6 +1056,9 @@ int uprobe_mmap(struct vm_area_struct *vma)
>> struct uprobe *uprobe, *u;
>> struct inode *inode;
>>
>> + if (uprobe_mmap_callback)
>> + uprobe_mmap_callback(vma);
>> +
>> if (no_uprobe_events() || !valid_vma(vma, true))
>> return 0;
> probe_event_enable() does
>
> uprobe_register();
> /* WINDOW */
> sdt_increment_ref_ctr();
>
> what if uprobe_mmap() is called in between? The counter(s) in this vma
> will be incremented twice, no?
I guess, it's a valid issue with PATCH 5 but should be taken care by PATCH 6.
>
>> +static struct vm_area_struct *
>> +sdt_find_vma(struct mm_struct *mm, struct trace_uprobe *tu)
>> +{
>> + struct vm_area_struct *tmp;
>> +
>> + for (tmp = mm->mmap; tmp != NULL; tmp = tmp->vm_next)
>> + if (sdt_valid_vma(tu, tmp))
>> + return tmp;
>> +
>> + return NULL;
> I can't understand the logic... Lets ignore sdt_valid_vma() for now.
> The caller has uprobe_map_info, why it can't simply do
> vma = find_vma(uprobe_map_info->vaddr)? and then check sdt_valid_vma().
Yes. that should work. Will change it.
Thanks for the review,
Ravi
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 5/8] trace_uprobe: Support SDT markers having reference count (semaphore)
From: Ravi Bangoria @ 2018-03-16 9:01 UTC (permalink / raw)
To: Steven Rostedt
Cc: mhiramat, oleg, peterz, srikar, acme, ananth, akpm,
alexander.shishkin, alexis.berlemont, corbet, dan.j.williams,
gregkh, huawei.libin, hughd, jack, jglisse, jolsa, kan.liang,
kirill.shutemov, kjlx, kstewart, linux-doc, linux-kernel,
linux-mm, mhocko, milian.wolff, mingo, namhyung, naveen.n.rao, pc,
pombredanne, tglx, tmricht, willy, yao.jin, fengguang.wu,
Ravi Bangoria
In-Reply-To: <20180315124816.6aa3d4e2@vmware.local.home>
On 03/15/2018 10:18 PM, Steven Rostedt wrote:
> On Tue, 13 Mar 2018 18:26:00 +0530
> Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> wrote:
>
>> +static void sdt_increment_ref_ctr(struct trace_uprobe *tu)
>> +{
>> + struct uprobe_map_info *info;
>> + struct vm_area_struct *vma;
>> + unsigned long vaddr;
>> +
>> + uprobe_start_dup_mmap();
> Please add a comment here that this function ups the mm ref count for
> each info returned. Otherwise it's hard to know what that mmput() below
> matches.
Sure. Will add it.
Thanks for the review,
Ravi
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 4/8] Uprobe: Export uprobe_map_info along with uprobe_{build/free}_map_info()
From: Ravi Bangoria @ 2018-03-16 8:59 UTC (permalink / raw)
To: Steven Rostedt
Cc: mhiramat, oleg, peterz, srikar, acme, ananth, akpm,
alexander.shishkin, alexis.berlemont, corbet, dan.j.williams,
gregkh, huawei.libin, hughd, jack, jglisse, jolsa, kan.liang,
kirill.shutemov, kjlx, kstewart, linux-doc, linux-kernel,
linux-mm, mhocko, milian.wolff, mingo, namhyung, naveen.n.rao, pc,
pombredanne, tglx, tmricht, willy, yao.jin, fengguang.wu,
Ravi Bangoria
In-Reply-To: <20180315123255.17a8d997@vmware.local.home>
On 03/15/2018 10:02 PM, Steven Rostedt wrote:
> On Tue, 13 Mar 2018 18:25:59 +0530
> Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> wrote:
>
>> These exported data structure and functions will be used by other
>> files in later patches.
> I'm reluctantly OK with the above.
>
>> No functionality changes.
> Please remove this line. There are functionality changes. Turning a
> static inline into an exported function *is* a functionality change.
Sure. Will change it.
Thanks for the review,
Ravi
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 2/8] mm: Prefix vma_ to vaddr_to_offset() and offset_to_vaddr()
From: Ravi Bangoria @ 2018-03-16 8:58 UTC (permalink / raw)
To: Steven Rostedt
Cc: mhiramat, oleg, peterz, srikar, acme, ananth, akpm,
alexander.shishkin, alexis.berlemont, corbet, dan.j.williams,
gregkh, huawei.libin, hughd, jack, jglisse, jolsa, kan.liang,
kirill.shutemov, kjlx, kstewart, linux-doc, linux-kernel,
linux-mm, mhocko, milian.wolff, mingo, namhyung, naveen.n.rao, pc,
pombredanne, tglx, tmricht, willy, yao.jin, fengguang.wu,
Ravi Bangoria
In-Reply-To: <20180315122840.02ac36ec@vmware.local.home>
On 03/15/2018 09:58 PM, Steven Rostedt wrote:
> On Tue, 13 Mar 2018 18:25:57 +0530
> Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> wrote:
>
>> No functionality changes.
> Again, please add an explanation to why this patch is done.
Sure. Will add.
Thanks for the review,
Ravi
> -- Steve
>
>> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 3/8] Uprobe: Rename map_info to uprobe_map_info
From: Ravi Bangoria @ 2018-03-16 8:56 UTC (permalink / raw)
To: Steven Rostedt
Cc: mhiramat, oleg, peterz, srikar, acme, ananth, akpm,
alexander.shishkin, alexis.berlemont, corbet, dan.j.williams,
gregkh, huawei.libin, hughd, jack, jglisse, jolsa, kan.liang,
kirill.shutemov, kjlx, kstewart, linux-doc, linux-kernel,
linux-mm, mhocko, milian.wolff, mingo, namhyung, naveen.n.rao, pc,
pombredanne, tglx, tmricht, willy, yao.jin, fengguang.wu,
Ravi Bangoria
In-Reply-To: <20180315124449.7d92c06b@vmware.local.home>
On 03/15/2018 10:14 PM, Steven Rostedt wrote:
> On Tue, 13 Mar 2018 18:25:58 +0530
> Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> wrote:
>> -static inline struct map_info *free_map_info(struct map_info *info)
>> +static inline struct uprobe_map_info *
>> +uprobe_free_map_info(struct uprobe_map_info *info)
>> {
>> - struct map_info *next = info->next;
>> + struct uprobe_map_info *next = info->next;
>> kfree(info);
>> return next;
>> }
>>
>> -static struct map_info *
>> -build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
>> +static struct uprobe_map_info *
>> +uprobe_build_map_info(struct address_space *mapping, loff_t offset,
> Also, as these functions have side effects (like you need to perform a
> mmput(info->mm), you need to add kerneldoc type comments to these
> functions, explaining how to use them.
>
> When you upgrade a function from static to use cases outside the file,
> it requires documenting that function for future users.
Sure, will add a comment here.
Thanks for the review,
Ravi
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 1/8] Uprobe: Export vaddr <-> offset conversion functions
From: Ravi Bangoria @ 2018-03-16 8:54 UTC (permalink / raw)
To: Steven Rostedt
Cc: mhiramat, oleg, peterz, srikar, acme, ananth, akpm,
alexander.shishkin, alexis.berlemont, corbet, dan.j.williams,
gregkh, huawei.libin, hughd, jack, jglisse, jolsa, kan.liang,
kirill.shutemov, kjlx, kstewart, linux-doc, linux-kernel,
linux-mm, mhocko, milian.wolff, mingo, namhyung, naveen.n.rao, pc,
pombredanne, tglx, tmricht, willy, yao.jin, fengguang.wu,
Ravi Bangoria
In-Reply-To: <20180315122753.258c38df@vmware.local.home>
On 03/15/2018 09:57 PM, Steven Rostedt wrote:
> On Tue, 13 Mar 2018 18:25:56 +0530
> Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> wrote:
>
>> No functionality changes.
> Please add a detailed explanation why this patch is needed. All commits
> should be self sufficient and stand on their own. If I were to come up
> to this patch via a git blame, I would be clueless to why it was done.
Sure Steve, Will add description it in next series.
Thanks for the review,
Ravi
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v10 1/5] arm64: KVM: Prepare set virtual SEI syndrome value
From: gengdongjiu @ 2018-03-16 7:58 UTC (permalink / raw)
To: James Morse
Cc: rkrcmar, corbet, christoffer.dall, marc.zyngier, linux,
catalin.marinas, rjw, bp, lenb, kvm, linux-doc, linux-kernel,
linux-arm-kernel, kvmarm, linux-acpi, devel, huangshaoyu
In-Reply-To: <5AAAD9A1.6090107@arm.com>
Hi James,
Thank you very much for this mail and your time to review this patch. Appreciate that.
I will check it and reply.
On 2018/3/16 4:37, James Morse wrote:
> Hi Dongjiu Geng,
>
> On 03/03/18 16:09, Dongjiu Geng wrote:
>> Export one API to specify virtual SEI syndrome value
>> for guest, and add a helper to get the VSESR_EL2 value.
>
> This patch adds two helpers that nothing calls... its not big, please merge it
> with the patch that uses these.
>
>
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
>> index 413dc82..3294885 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -71,6 +71,11 @@ static inline void vcpu_set_hcr(struct kvm_vcpu *vcpu, unsigned long hcr)
>> vcpu->arch.hcr_el2 = hcr;
>> }
>>
>> +static inline unsigned long vcpu_get_vsesr(struct kvm_vcpu *vcpu)
>> +{
>> + return vcpu->arch.vsesr_el2;
>> +}
>> +
>> static inline void vcpu_set_vsesr(struct kvm_vcpu *vcpu, u64 vsesr)
>> {
>> vcpu->arch.vsesr_el2 = vsesr;
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index a73f63a..3dc49b7 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -354,6 +354,8 @@ void handle_exit_early(struct kvm_vcpu *vcpu, struct kvm_run *run,
>> int kvm_perf_init(void);
>> int kvm_perf_teardown(void);
>>
>> +void kvm_set_sei_esr(struct kvm_vcpu *vcpu, u64 syndrome);
>> +
>> struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr);
>>
>> static inline void __cpu_init_hyp_mode(phys_addr_t pgd_ptr,
>> diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
>> index 60666a0..78ecb28 100644
>> --- a/arch/arm64/kvm/inject_fault.c
>> +++ b/arch/arm64/kvm/inject_fault.c
>> @@ -186,3 +186,8 @@ void kvm_inject_vabt(struct kvm_vcpu *vcpu)
>> {
>> pend_guest_serror(vcpu, ESR_ELx_ISV);
>> }
>> +
>> +void kvm_set_sei_esr(struct kvm_vcpu *vcpu, u64 syndrome)
>> +{
>> + pend_guest_serror(vcpu, syndrome & ESR_ELx_ISS_MASK);
>
> If you move the ISS_MASK into pend_guest_serror(), you wouldn't need this at all.
>
> It would be better if any validation were in the user-space helpers so we can
> check user-space hasn't put something funny in the top bits.
>
>> +}
>>
>
>
> Thanks,
>
> James
>
> .
>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 00/16] remove eight obsolete architectures
From: afzal mohammed @ 2018-03-16 4:50 UTC (permalink / raw)
To: Arnd Bergmann
Cc: David Howells, linux-arch, Linux Kernel Mailing List,
open list:DOCUMENTATION, linux-block, IDE-ML,
open list:HID CORE LAYER, Networking, linux-wireless, linux-pwm,
linux-rtc, linux-spi, linux-usb, dri-devel, linux-fbdev,
linux-watchdog, Linux FS-devel Mailing List, Linux-MM
In-Reply-To: <CAK8P3a10hBz7QYk5v5MfhVMPOwFnWYTn95WZp1HtHrd7-GQpRg@mail.gmail.com>
Hi,
On Thu, Mar 15, 2018 at 10:56:48AM +0100, Arnd Bergmann wrote:
> On Thu, Mar 15, 2018 at 10:42 AM, David Howells <dhowells@redhat.com> wrote:
> > Do we have anything left that still implements NOMMU?
Please don't kill !MMU.
> Yes, plenty.
> I've made an overview of the remaining architectures for my own reference[1].
> The remaining NOMMU architectures are:
>
> - arch/arm has ARMv7-M (Cortex-M microcontroller), which is actually
> gaining traction
ARMv7-R as well, also seems ARM is coming up with more !MMU's - v8-M,
v8-R. In addition, though only of academic interest, ARM MMU capable
platform's can run !MMU Linux.
afzal
> - arch/sh has an open-source J2 core that was added not that long ago,
> it seems to
> be the only SH compatible core that anyone is working on.
> - arch/microblaze supports both MMU/NOMMU modes (most use an MMU)
> - arch/m68k supports several NOMMU targets, both the coldfire SoCs and the
> classic processors
> - c6x has no MMU
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH v5 1/2] cpuset: Enable cpuset controller in default hierarchy
From: Waiman Long @ 2018-03-15 21:20 UTC (permalink / raw)
To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto, efault,
torvalds, Roman Gushchin, Waiman Long
In-Reply-To: <1521148842-15486-1-git-send-email-longman@redhat.com>
Given the fact that thread mode had been merged into 4.14, it is now
time to enable cpuset to be used in the default hierarchy (cgroup v2)
as it is clearly threaded.
The cpuset controller had experienced feature creep since its
introduction more than a decade ago. Besides the core cpus and mems
control files to limit cpus and memory nodes, there are a bunch of
additional features that can be controlled from the userspace. Some of
the features are of doubtful usefulness and may not be actively used.
This patch enables cpuset controller in the default hierarchy with
a minimal set of features, namely just the cpus and mems and their
effective_* counterparts. We can certainly add more features to the
default hierarchy in the future if there is a real user need for them
later on.
Alternatively, with the unified hiearachy, it may make more sense
to move some of those additional cpuset features, if desired, to
memory controller or may be to the cpu controller instead of staying
with cpuset.
Signed-off-by: Waiman Long <longman@redhat.com>
---
Documentation/cgroup-v2.txt | 96 ++++++++++++++++++++++++++++++++++++++++-----
kernel/cgroup/cpuset.c | 44 +++++++++++++++++++--
2 files changed, 127 insertions(+), 13 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 74cdeae..b91fd5d 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,16 +48,18 @@ v1 is available under Documentation/cgroup-v1/.
5-2-1. Memory Interface Files
5-2-2. Usage Guidelines
5-2-3. Memory Ownership
- 5-3. IO
- 5-3-1. IO Interface Files
- 5-3-2. Writeback
- 5-4. PID
- 5-4-1. PID Interface Files
- 5-5. Device
- 5-6. RDMA
- 5-6-1. RDMA Interface Files
- 5-7. Misc
- 5-7-1. perf_event
+ 5-3. Cpuset
+ 5.3-1. Cpuset Interface Files
+ 5-4. IO
+ 5-4-1. IO Interface Files
+ 5-4-2. Writeback
+ 5-5. PID
+ 5-5-1. PID Interface Files
+ 5-6. Device
+ 5-7. RDMA
+ 5-7-1. RDMA Interface Files
+ 5-8. Misc
+ 5-8-1. perf_event
5-N. Non-normative information
5-N-1. CPU controller root cgroup process behaviour
5-N-2. IO controller root cgroup process behaviour
@@ -1243,6 +1245,80 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
belonging to the affected files to ensure correct memory ownership.
+Cpuset
+------
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical. That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~~~~~~~~~~~~~~~~~~~~~
+
+ cpuset.cpus
+ A read-write multiple values file which exists on non-root
+ cgroups.
+
+ It lists the CPUs allowed to be used by tasks within this
+ cgroup. The CPU numbers are comma-separated numbers or
+ ranges. For example:
+
+ # cat cpuset.cpus
+ 0-4,6,8-10
+
+ An empty value indicates that the cgroup is using the same
+ setting as the nearest cgroup ancestor with a non-empty
+ "cpuset.cpus" or all the available CPUs if none is found.
+
+ The value of "cpuset.cpus" stays constant until the next update
+ and won't be affected by any CPU hotplug events.
+
+ cpuset.effective_cpus
+ A read-only multiple values file which exists on non-root
+ cgroups.
+
+ It lists the onlined CPUs that are actually allowed to be
+ used by tasks within the current cgroup. It is a subset of
+ "cpuset.cpus". Its value will be affected by CPU hotplug
+ events.
+
+ cpuset.mems
+ A read-write multiple values file which exists on non-root
+ cgroups.
+
+ It lists the memory nodes allowed to be used by tasks within
+ this cgroup. The memory node numbers are comma-separated
+ numbers or ranges. For example:
+
+ # cat cpuset.mems
+ 0-1,3
+
+ An empty value indicates that the cgroup is using the same
+ setting as the nearest cgroup ancestor with a non-empty
+ "cpuset.mems" or all the available memory nodes if none
+ is found.
+
+ The value of "cpuset.mems" stays constant until the next update
+ and won't be affected by any memory nodes hotplug events.
+
+ cpuset.effective_mems
+ A read-only multiple values file which exists on non-root
+ cgroups.
+
+ It lists the onlined memory nodes that are actually allowed
+ to be used by tasks within the current cgroup. It is a subset
+ of "cpuset.mems". Its value will be affected by memory nodes
+ hotplug events.
+
+
IO
--
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b42037e..7837d1f 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1823,12 +1823,11 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
return 0;
}
-
/*
* for the common functions, 'private' gives the type of file
*/
-static struct cftype files[] = {
+static struct cftype legacy_files[] = {
{
.name = "cpus",
.seq_show = cpuset_common_seq_show,
@@ -1931,6 +1930,43 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
};
/*
+ * This is currently a minimal set for the default hierarchy. It can be
+ * expanded later on by migrating more features and control files from v1.
+ */
+static struct cftype dfl_files[] = {
+ {
+ .name = "cpus",
+ .seq_show = cpuset_common_seq_show,
+ .write = cpuset_write_resmask,
+ .max_write_len = (100U + 6 * NR_CPUS),
+ .private = FILE_CPULIST,
+ },
+
+ {
+ .name = "mems",
+ .seq_show = cpuset_common_seq_show,
+ .write = cpuset_write_resmask,
+ .max_write_len = (100U + 6 * MAX_NUMNODES),
+ .private = FILE_MEMLIST,
+ },
+
+ {
+ .name = "effective_cpus",
+ .seq_show = cpuset_common_seq_show,
+ .private = FILE_EFFECTIVE_CPULIST,
+ },
+
+ {
+ .name = "effective_mems",
+ .seq_show = cpuset_common_seq_show,
+ .private = FILE_EFFECTIVE_MEMLIST,
+ },
+
+ { } /* terminate */
+};
+
+
+/*
* cpuset_css_alloc - allocate a cpuset css
* cgrp: control group that the new cpuset will be part of
*/
@@ -2104,8 +2140,10 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
.post_attach = cpuset_post_attach,
.bind = cpuset_bind,
.fork = cpuset_fork,
- .legacy_cftypes = files,
+ .legacy_cftypes = legacy_files,
+ .dfl_cftypes = dfl_files,
.early_init = true,
+ .threaded = true,
};
/**
--
1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH v5 2/2] cpuset: Add cpuset.flags control knob to v2
From: Waiman Long @ 2018-03-15 21:20 UTC (permalink / raw)
To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto, efault,
torvalds, Roman Gushchin, Waiman Long
In-Reply-To: <1521148842-15486-1-git-send-email-longman@redhat.com>
In cgroup v1, there are a bunch of cpuset control knobs that are just
boolean flags. To reduce the proliteration of cpuset control knobs
in v2 where multiple controllers will be active in a single cgroup,
all the boolean flags are now consolidated into a single cpuset.flags
control knob.
Currently, the cpuset.flags control knob just has one flag -
sched_load_balance. This flag is needed to enable CPU isolation
similar to what can be done with the "isolcpus" kernel boot parameter.
More flags can be added in the future if desired.
When the "cpuset.flags" file is read, it shows what flags have been
currently enabled. Turning on or off specific flags are done in a
way similar to what controllers can be enabled or disabled in the
cgroup.subtree_control control knob.
To enable the sched_load_balance flag, use
# echo +sched_load_balance > cpuset.flags
To disable it, use
# echo -sched_load_balance > cpuset.flags
Signed-off-by: Waiman Long <longman@redhat.com>
---
Documentation/cgroup-v2.txt | 32 +++++++++++++++
kernel/cgroup/cpuset.c | 98 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 130 insertions(+)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index b91fd5d..362026a 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1318,6 +1318,38 @@ Cpuset Interface Files
of "cpuset.mems". Its value will be affected by memory nodes
hotplug events.
+ cpuset.flags
+ A read-write multiple values file which exists on non-root
+ cgroups.
+
+ On read, it lists the flags that are currently enabled. On
+ write, the '+' or '-' prefix with the flag name is used to
+ enable and disable the flag respectively.
+
+ The currently supported flag is:
+
+ sched_load_balance
+ When it is not set, there will be no load balancing
+ among CPUs on this cpuset. Tasks will stay in the
+ CPUs they are running on and will not be moved to
+ other CPUs.
+
+ When it is set, tasks within this cpuset will be
+ load-balanced by the kernel scheduler. Tasks will be
+ moved from CPUs with high load to other CPUs within
+ the same cpuset with less load periodically.
+
+ This flag is on by default and will have to be
+ explicitly turned off.
+
+ To set a flag, use the '+' prefix:
+
+ # echo +sched_load_balance > cpuset.flags
+
+ To clear a flag, use the '-' prefix:
+
+ # echo -sched_load_balance > cpuset.flags
+
IO
--
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 7837d1f..3145dc0 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1601,6 +1601,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
+ FILE_FLAGS,
} cpuset_filetype_t;
static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
@@ -1823,6 +1824,97 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
return 0;
}
+static const struct {
+ char *name;
+ int cs_bit;
+} cpuset_flags[] = {
+ { "sched_load_balance", CS_SCHED_LOAD_BALANCE },
+};
+
+static int cpuset_read_flags(struct seq_file *sf, void *v)
+{
+ struct cpuset *cs = css_cs(seq_css(sf));
+ unsigned long enabled = READ_ONCE(cs->flags);
+ int i, cnt;
+
+ for (i = cnt = 0; i < ARRAY_SIZE(cpuset_flags); i++) {
+ if (!test_bit(cpuset_flags[i].cs_bit, &enabled))
+ continue;
+
+ if (cnt++)
+ seq_putc(sf, ' ');
+
+ seq_printf(sf, "%s", cpuset_flags[i].name);
+ }
+ seq_putc(sf, '\n');
+ return 0;
+}
+
+static ssize_t cpuset_write_flags(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, loff_t off)
+{
+ struct cpuset *cs = css_cs(of_css(of));
+ unsigned long enable = 0, disable = 0;
+ char *tok;
+ int i;
+
+ mutex_lock(&cpuset_mutex);
+ if (!is_cpuset_online(cs)) {
+ nbytes = -ENODEV;
+ goto out_unlock;
+ }
+
+ /*
+ * Parse input - space seperated list of feature names prefixed
+ * with either + or -.
+ */
+ buf = strstrip(buf);
+ while ((tok = strsep(&buf, " "))) {
+ if (tok[0] == '\0')
+ continue;
+ for (i = 0; i < ARRAY_SIZE(cpuset_flags); i++)
+ if (!strcmp(tok + 1, cpuset_flags[i].name))
+ break;
+ if (i == ARRAY_SIZE(cpuset_flags)) {
+ nbytes = -EINVAL;
+ goto out_unlock;
+ }
+ if (*tok == '+') {
+ set_bit(cpuset_flags[i].cs_bit, &cs->flags);
+ enable |= 1UL << cpuset_flags[i].cs_bit;
+ disable &= 1UL << cpuset_flags[i].cs_bit;
+ } else if (*tok == '-') {
+ disable |= 1UL << cpuset_flags[i].cs_bit;
+ enable &= 1UL << cpuset_flags[i].cs_bit;
+ } else {
+ nbytes = -EINVAL;
+ goto out_unlock;
+ }
+ }
+
+ /*
+ * Set/clear the flags individually. It is possible an error may
+ * happen before all the flags are processed, but it should be rare.
+ */
+ for (i = 0; i < ARRAY_SIZE(cpuset_flags); i++) {
+ unsigned long flag = 1UL << cpuset_flags[i].cs_bit;
+ int retval;
+
+ if (flag & (enable|disable)) {
+ retval = update_flag(cpuset_flags[i].cs_bit, cs,
+ enable & flag);
+ if (retval) {
+ nbytes = retval;
+ goto out_unlock;
+ }
+ }
+ }
+
+out_unlock:
+ mutex_unlock(&cpuset_mutex);
+ return nbytes;
+}
+
/*
* for the common functions, 'private' gives the type of file
*/
@@ -1962,6 +2054,12 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
.private = FILE_EFFECTIVE_MEMLIST,
},
+ {
+ .name = "flags",
+ .seq_show = cpuset_read_flags,
+ .write = cpuset_write_flags,
+ .private = FILE_FLAGS,
+ },
{ } /* terminate */
};
--
1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox