Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCHv4 RESEND 0/3] syscalls,x86: Add execveat() system call
From: David Drysdale @ 2014-10-22 11:08 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, Alexander Viro, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
	Kees Cook, Arnd Bergmann, X86 ML, linux-arch, Linux API
In-Reply-To: <87ioje2ggq.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

On Tue, Oct 21, 2014 at 5:29 AM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>
>> On Mon, Oct 20, 2014 at 6:48 AM, David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Sun, Oct 19, 2014 at 1:20 AM, Eric W. Biederman
>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>>>
>>>>> [Added Eric Biederman, since I think your tree might be a reasonable
>>>>> route forward for these patches.]
>>>>>
>>>>> On Thu, Jun 5, 2014 at 6:40 AM, David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>> Resending, adding cc:linux-api.
>>>>>>
>>>>>> Also, it may help to add a little more background -- this patch is
>>>>>> needed as a (small) part of implementing Capsicum in the Linux kernel.
>>>>>>
>>>>>> Capsicum is a security framework that has been present in FreeBSD since
>>>>>> version 9.0 (Jan 2012), and is based on concepts from object-capability
>>>>>> security [1].
>>>>>>
>>>>>> One of the features of Capsicum is capability mode, which locks down
>>>>>> access to global namespaces such as the filesystem hierarchy.  In
>>>>>> capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
>>>>>> work -- hence the need for a kernel-space
>>>>>
>>>>> I just found myself wanting this syscall for another reason: injecting
>>>>> programs into sandboxes or otherwise heavily locked-down namespaces.
>>>>>
>>>>> For example, I want to be able to reliably do something like nsenter
>>>>> --namespace-flags-here toybox sh.  Toybox's shell is unusual in that
>>>>> it is more or less fully functional, so this should Just Work (tm),
>>>>> except that the toybox binary might not exist in the namespace being
>>>>> entered.  If execveat were available, I could rig nsenter or a similar
>>>>> tool to open it with O_CLOEXEC, enter the namespace, and then call
>>>>> execveat.
>>>>>
>>>>> Is there any reason that these patches can't be merged more or less as
>>>>> is for 3.19?
>>>>
>>>> Yes.  There is a silliness in how it implements fexecve.  The fexecve
>>>> case should be use the empty string "" not a NULL pointer to indication
>>>> that.  That change will then harmonize execveat with the other ...at
>>>> system calls and simplify the code and remove a special case.  I believe
>>>> using the empty string "" requires implementing the AT_EMPTY_PATH flag.
>>>
>>> Good point -- I'll shift to "" + AT_EMPTY_PATH.
>>
>> Pending a better idea, I would also see if the patches can be changed
>> to return an error if d_path ends up with an "(unreachable)" thing
>> rather than failing inexplicably later on.
>
> For my reference we are talking about
>
>> @@ -1489,7 +1524,21 @@ static int do_execve_common(struct filename *filename,
>>       sched_exec();
>>
>>       bprm->file = file;
>> -     bprm->filename = bprm->interp = filename->name;
>> +     if (filename && fd == AT_FDCWD) {
>> +             bprm->filename = filename->name;
>> +     } else {
>> +             pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
>> +             if (!pathbuf) {
>> +                     retval = -ENOMEM;
>> +                     goto out_unmark;
>> +             }
>> +             bprm->filename = d_path(&file->f_path, pathbuf, PATH_MAX);
>> +             if (IS_ERR(bprm->filename)) {
>> +                     retval = PTR_ERR(bprm->filename);
>> +                     goto out_unmark;
>> +             }
>> +     }
>> +     bprm->interp = bprm->filename;
>>
>>       retval = bprm_mm_init(bprm);
>>       if (retval)
>
> The interesting case for fexecve is when we either don't know what files
> are present or we don't want to depend on which files are present.
>
> As Al pointed out d_path really isn't the right solution.  It fails when
> printing /proc/self/fd/${fd}/${filename->name} would work, and the
> "(deleted)" or "(unreachable)" strings are wrong.
>
> The test for today's cases should be:
> if ((filename->name[0] == '/') || fd == AT_FDCWD) {
>         bprm->filename = filename->name;
> }
>
> To handle the case where the file descriptor is relevant.
(s/relevant/irrelevant)

Yep, good spot.

> For the case where the file descriptor is relevant let me suggest
> setting bprm->filename and bprm->interp to:
>
> /dev/fd/${fd}/${filename->name}

I'll send out an updated patchset with this approach, but I have a slight
reservation. Given that /dev/fd is a symlink to /proc/self/fd, this approach
means that script invocations will always fail on a /proc-less system,
where the previous iteration might have worked.

(As it happens, this isn't a restriction that affects the things I'm
working on, as Capsicum wouldn't allow script invocation anyway.
However, scenarios without /proc were nominally one of the motivating
factors for execveat in the first place...)

> It is more a description of what we have done but as a magic string it
> is descriptive.  Documetation/devices.txt documents that /dev/fd/ should
> exist, making it an unambiguous path.  Further these days the kernel
> sets the device naming policy in dev, so I think we are strongly safe in
> using that path in any event.
>
> I think execveat is interesting in the kernel because the motivating
> cases are the cases where anything except a static executable is
> uninteresting.

FYI, there is potential in the future for something other than static
executables -- the FreeBSD Capsicum implementation includes changes
to the dynamic linker to get its search path as a list of pre-opened dfds
(in LD_LIBRARY_PATH_FDS) rather than paths.

> Now it has been suggested creating a dupfs or a mini-proc.  I think that
> sounds like a nice companion, to the concept of a locked down root.
> But I don't think it removes the need for execveat (because we still
> have the case where we don't want to care what is mounted, and are happy
> to use static executables).
>
> Eric
>

^ permalink raw reply

* Re: [PATCH V2] kernel, add bug_on_warn
From: Prarit Bhargava @ 2014-10-22 10:13 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-kernel, Jonathan Corbet, Andrew Morton, H. Peter Anvin,
	Andi Kleen, Masami Hiramatsu, Fabian Frederick, vgoyal,
	isimatu.yasuaki, linux-doc, kexec, linux-api
In-Reply-To: <87bnp491b3.fsf@rustcorp.com.au>



On 10/22/2014 12:27 AM, Rusty Russell wrote:
> Prarit Bhargava <prarit@redhat.com> writes:
>> There have been several times where I have had to rebuild a kernel to
>> cause a panic when hitting a WARN() in the code in order to get a crash
>> dump from a system.  Sometimes this is easy to do, other times (such as
>> in the case of a remote admin) it is not trivial to send new images to the
>> user.
>>
>> A much easier method would be a switch to change the WARN() over to a
>> BUG().  This makes debugging easier in that I can now test the actual
>> image the WARN() was seen on and I do not have to engage in remote
>> debugging.
>>
>> This patch adds a bug_on_warn kernel parameter, which calls BUG() in the
>> warn_slowpath_common() path.  The function will still print out the
>> location of the warning.
>>
>> An example of the bug_on_warn output:
>>
>> The first line below is from the WARN_ON() to output the WARN_ON()'s location.
>> After that the new BUG() call is displayed.
>>
>>  WARNING: CPU: 27 PID: 3204 at
>> /home/rhel7/redhat/debug/dummy-module/dummy-module.c:25 init_dummy+0x28/0x30
>> [dummy_module]()
>>  bug_on_warn set, calling BUG()...
>>  ------------[ cut here ]------------
>>  kernel BUG at kernel/panic.c:434!
>>  invalid opcode: 0000 [#1] SMP
>>  Modules linked in: dummy_module(OE+) sg nfsv3 rpcsec_gss_krb5 nfsv4
>> dns_resolver nfs fscache cfg80211 rfkill x86_pkg_temp_thermal intel_powerclamp
>> coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel
>> ghash_clmulni_intel igb iTCO_wdt aesni_intel iTCO_vendor_support lrw gf128mul
>> sb_edac ptp edac_core glue_helper lpc_ich ioatdma pcspkr ablk_helper pps_core
>> i2c_i801 mfd_core cryptd dca shpchp ipmi_si wmi ipmi_msghandler acpi_cpufreq
>> nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c sr_mod cdrom sd_mod
>> mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper isci ttm
>> drm libsas ahci libahci scsi_transport_sas libata i2c_core dm_mirror
>> dm_region_hash dm_log dm_mod
>>  CPU: 27 PID: 3204 Comm: insmod Tainted: G           OE  3.17.0+ #19
>>  Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
>> RMLSDP.86I.00.29.D696.1311111329 11/11/2013
>>  task: ffff880034e75160 ti: ffff8807fc5ac000 task.ti: ffff8807fc5ac000
>>  RIP: 0010:[<ffffffff81076b81>]  [<ffffffff81076b81>] warn_slowpath_common+0xc1/0xd0
>>  RSP: 0018:ffff8807fc5afc68  EFLAGS: 00010246
>>  RAX: 0000000000000021 RBX: ffff8807fc5afcb0 RCX: 0000000000000000
>>  RDX: 0000000000000000 RSI: ffff88081efee5f8 RDI: ffff88081efee5f8
>>  RBP: ffff8807fc5afc98 R08: 0000000000000096 R09: 0000000000000000
>>  R10: 0000000000000711 R11: ffff8807fc5af93e R12: ffffffffa0424070
>>  R13: 0000000000000019 R14: ffffffffa0423068 R15: 0000000000000009
>>  FS:  00007f2d4b034740(0000) GS:ffff88081efe0000(0000) knlGS:0000000000000000
>>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>  CR2: 00007f2d4a99f3c0 CR3: 00000007fd88b000 CR4: 00000000001407e0
>>  Stack:
>>   ffff8807fc5afcb8 ffffffff8199f020 ffff88080e396160 0000000000000000
>>   ffffffffa0423040 ffffffffa0425000 ffff8807fc5afd08 ffffffff81076be5
>>   0000000000000008 ffffffffa0424053 ffff880700000018 ffff8807fc5afd18
>>  Call Trace:
>>   [<ffffffffa0423040>] ? dummy_greetings+0x40/0x40 [dummy_module]
>>   [<ffffffff81076be5>] warn_slowpath_fmt+0x55/0x70
>>   [<ffffffffa0423068>] init_dummy+0x28/0x30 [dummy_module]
>>   [<ffffffff81002144>] do_one_initcall+0xd4/0x210
>>   [<ffffffff811b52c2>] ? __vunmap+0xc2/0x110
>>   [<ffffffff810f8889>] load_module+0x16a9/0x1b30
>>   [<ffffffff810f3d30>] ? store_uevent+0x70/0x70
>>   [<ffffffff810f49b9>] ? copy_module_from_fd.isra.44+0x129/0x180
>>   [<ffffffff810f8ec6>] SyS_finit_module+0xa6/0xd0
>>   [<ffffffff8166ce29>] system_call_fastpath+0x12/0x17
>>  Code: c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 c7 c7 20 42 8a 81 31 c0 e8 fc
>> 80 5e 00 eb 80 48 c7 c7 78 42 8a 81 31 c0 e8 ec 80 5e 00 <0f> 0b 66 66 66 66 2e
>> 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
>>  RIP  [<ffffffff81076b81>] warn_slowpath_common+0xc1/0xd0
>>   RSP <ffff8807fc5afc68>
>>  ---[ end trace 428218934a12088b ]---
>>
>> Successfully tested by me.
>>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Andi Kleen <ak@linux.intel.com>
>> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
>> Cc: Fabian Frederick <fabf@skynet.be>
>> Cc: vgoyal@redhat.com
>> Cc: isimatu.yasuaki@jp.fujitsu.com
>> Cc: linux-doc@vger.kernel.org
>> Cc: kexec@lists.infradead.org
>> Cc: linux-api@vger.kernel.org
>> Signed-off-by: Prarit Bhargava <prarit@redhat.com>
>>
>> [v2]: add /proc/sys/kernel/bug_on_warn, additional documentation, modify
>>       !slowpath cases
>> ---
>>  Documentation/kdump/kdump.txt       |    7 +++++++
>>  Documentation/kernel-parameters.txt |    3 +++
>>  Documentation/sysctl/kernel.txt     |   12 ++++++++++++
>>  include/asm-generic/bug.h           |   12 ++++++++++--
>>  include/linux/kernel.h              |    1 +
>>  include/uapi/linux/sysctl.h         |    1 +
>>  kernel/panic.c                      |   21 ++++++++++++++++++++-
>>  kernel/sysctl.c                     |    7 +++++++
>>  kernel/sysctl_binary.c              |    1 +
>>  9 files changed, 62 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
>> index 6c0b9f2..a04ed72 100644
>> --- a/Documentation/kdump/kdump.txt
>> +++ b/Documentation/kdump/kdump.txt
>> @@ -471,6 +471,13 @@ format. Crash is available on Dave Anderson's site at the following URL:
>>  
>>     http://people.redhat.com/~anderson/
>>  
>> +Trigger Kdump on WARN()
>> +=======================
>> +
>> +The kernel parameter, bug_on_warn, calls BUG() in all WARN() paths.  This
>> +will cause a kdump to occur at the BUG() call.  In cases where a user
>> +wants to specify this during runtime, /proc/sys/kernel/bug_on_warn can be
>> +set to 1 to achieve the same behaviour.
> 
> What about during early boot?

Hi Rusty,

I really don't have a use case for this in early boot.  The kernel boots, the
initramfs, and then we run whatever init (systemd in my case).  A systemd script
configures kexec for kdump and that point kdump is "armed".  Doing a bug_on_warn
before this will simply result in a panicked system.  I don't get any "new"
information FWIW as I get a stack trace, etc., in both the WARN() and BUG() cases.

> 
> I'd recommend you use core_param().  Less code, and can be set on
> commandline.

Is that a general request, or is it dependent on the answer above?  Of course I
have no problem doing it either way.

P.

^ permalink raw reply

* Re: [PATCH V2] kernel, add bug_on_warn
From: Rusty Russell @ 2014-10-22  4:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Prarit Bhargava, Jonathan Corbet, Andrew Morton, H. Peter Anvin,
	Andi Kleen, Masami Hiramatsu, Fabian Frederick, vgoyal,
	isimatu.yasuaki, linux-doc, kexec, linux-api
In-Reply-To: <1413910077-9464-1-git-send-email-prarit@redhat.com>

Prarit Bhargava <prarit@redhat.com> writes:
> There have been several times where I have had to rebuild a kernel to
> cause a panic when hitting a WARN() in the code in order to get a crash
> dump from a system.  Sometimes this is easy to do, other times (such as
> in the case of a remote admin) it is not trivial to send new images to the
> user.
>
> A much easier method would be a switch to change the WARN() over to a
> BUG().  This makes debugging easier in that I can now test the actual
> image the WARN() was seen on and I do not have to engage in remote
> debugging.
>
> This patch adds a bug_on_warn kernel parameter, which calls BUG() in the
> warn_slowpath_common() path.  The function will still print out the
> location of the warning.
>
> An example of the bug_on_warn output:
>
> The first line below is from the WARN_ON() to output the WARN_ON()'s location.
> After that the new BUG() call is displayed.
>
>  WARNING: CPU: 27 PID: 3204 at
> /home/rhel7/redhat/debug/dummy-module/dummy-module.c:25 init_dummy+0x28/0x30
> [dummy_module]()
>  bug_on_warn set, calling BUG()...
>  ------------[ cut here ]------------
>  kernel BUG at kernel/panic.c:434!
>  invalid opcode: 0000 [#1] SMP
>  Modules linked in: dummy_module(OE+) sg nfsv3 rpcsec_gss_krb5 nfsv4
> dns_resolver nfs fscache cfg80211 rfkill x86_pkg_temp_thermal intel_powerclamp
> coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel
> ghash_clmulni_intel igb iTCO_wdt aesni_intel iTCO_vendor_support lrw gf128mul
> sb_edac ptp edac_core glue_helper lpc_ich ioatdma pcspkr ablk_helper pps_core
> i2c_i801 mfd_core cryptd dca shpchp ipmi_si wmi ipmi_msghandler acpi_cpufreq
> nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c sr_mod cdrom sd_mod
> mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper isci ttm
> drm libsas ahci libahci scsi_transport_sas libata i2c_core dm_mirror
> dm_region_hash dm_log dm_mod
>  CPU: 27 PID: 3204 Comm: insmod Tainted: G           OE  3.17.0+ #19
>  Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
> RMLSDP.86I.00.29.D696.1311111329 11/11/2013
>  task: ffff880034e75160 ti: ffff8807fc5ac000 task.ti: ffff8807fc5ac000
>  RIP: 0010:[<ffffffff81076b81>]  [<ffffffff81076b81>] warn_slowpath_common+0xc1/0xd0
>  RSP: 0018:ffff8807fc5afc68  EFLAGS: 00010246
>  RAX: 0000000000000021 RBX: ffff8807fc5afcb0 RCX: 0000000000000000
>  RDX: 0000000000000000 RSI: ffff88081efee5f8 RDI: ffff88081efee5f8
>  RBP: ffff8807fc5afc98 R08: 0000000000000096 R09: 0000000000000000
>  R10: 0000000000000711 R11: ffff8807fc5af93e R12: ffffffffa0424070
>  R13: 0000000000000019 R14: ffffffffa0423068 R15: 0000000000000009
>  FS:  00007f2d4b034740(0000) GS:ffff88081efe0000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 00007f2d4a99f3c0 CR3: 00000007fd88b000 CR4: 00000000001407e0
>  Stack:
>   ffff8807fc5afcb8 ffffffff8199f020 ffff88080e396160 0000000000000000
>   ffffffffa0423040 ffffffffa0425000 ffff8807fc5afd08 ffffffff81076be5
>   0000000000000008 ffffffffa0424053 ffff880700000018 ffff8807fc5afd18
>  Call Trace:
>   [<ffffffffa0423040>] ? dummy_greetings+0x40/0x40 [dummy_module]
>   [<ffffffff81076be5>] warn_slowpath_fmt+0x55/0x70
>   [<ffffffffa0423068>] init_dummy+0x28/0x30 [dummy_module]
>   [<ffffffff81002144>] do_one_initcall+0xd4/0x210
>   [<ffffffff811b52c2>] ? __vunmap+0xc2/0x110
>   [<ffffffff810f8889>] load_module+0x16a9/0x1b30
>   [<ffffffff810f3d30>] ? store_uevent+0x70/0x70
>   [<ffffffff810f49b9>] ? copy_module_from_fd.isra.44+0x129/0x180
>   [<ffffffff810f8ec6>] SyS_finit_module+0xa6/0xd0
>   [<ffffffff8166ce29>] system_call_fastpath+0x12/0x17
>  Code: c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 c7 c7 20 42 8a 81 31 c0 e8 fc
> 80 5e 00 eb 80 48 c7 c7 78 42 8a 81 31 c0 e8 ec 80 5e 00 <0f> 0b 66 66 66 66 2e
> 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
>  RIP  [<ffffffff81076b81>] warn_slowpath_common+0xc1/0xd0
>   RSP <ffff8807fc5afc68>
>  ---[ end trace 428218934a12088b ]---
>
> Successfully tested by me.
>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Rusty Russell <rusty@rustcorp.com.au>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
> Cc: Fabian Frederick <fabf@skynet.be>
> Cc: vgoyal@redhat.com
> Cc: isimatu.yasuaki@jp.fujitsu.com
> Cc: linux-doc@vger.kernel.org
> Cc: kexec@lists.infradead.org
> Cc: linux-api@vger.kernel.org
> Signed-off-by: Prarit Bhargava <prarit@redhat.com>
>
> [v2]: add /proc/sys/kernel/bug_on_warn, additional documentation, modify
>       !slowpath cases
> ---
>  Documentation/kdump/kdump.txt       |    7 +++++++
>  Documentation/kernel-parameters.txt |    3 +++
>  Documentation/sysctl/kernel.txt     |   12 ++++++++++++
>  include/asm-generic/bug.h           |   12 ++++++++++--
>  include/linux/kernel.h              |    1 +
>  include/uapi/linux/sysctl.h         |    1 +
>  kernel/panic.c                      |   21 ++++++++++++++++++++-
>  kernel/sysctl.c                     |    7 +++++++
>  kernel/sysctl_binary.c              |    1 +
>  9 files changed, 62 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> index 6c0b9f2..a04ed72 100644
> --- a/Documentation/kdump/kdump.txt
> +++ b/Documentation/kdump/kdump.txt
> @@ -471,6 +471,13 @@ format. Crash is available on Dave Anderson's site at the following URL:
>  
>     http://people.redhat.com/~anderson/
>  
> +Trigger Kdump on WARN()
> +=======================
> +
> +The kernel parameter, bug_on_warn, calls BUG() in all WARN() paths.  This
> +will cause a kdump to occur at the BUG() call.  In cases where a user
> +wants to specify this during runtime, /proc/sys/kernel/bug_on_warn can be
> +set to 1 to achieve the same behaviour.

What about during early boot?

I'd recommend you use core_param().  Less code, and can be set on
commandline.

Cheers,
Rusty.

^ permalink raw reply

* Re: [PATCH] staging: android: binder: move to the "real" part of the kernel
From: Joe Perches @ 2014-10-22  3:16 UTC (permalink / raw)
  To: Rom Lemarchand
  Cc: Arve Hjønnevåg, Dan Carpenter, Greg Kroah-Hartman,
	devel@driverdev.osuosl.org, Anup Patel, linux-api, LKML,
	John Stultz, Rebecca Schultz Zavin, Santosh Shilimkar,
	Sumit Semwal, Christoffer Dall
In-Reply-To: <CAABpnA8Oe9_8r9hyAQADPwNh4Zn63tUQAUD7f0WSYh1xcdtz6g@mail.gmail.com>

On Tue, 2014-10-21 at 20:10 -0700, Rom Lemarchand wrote:
> On the topic of maintainers for binder: both Arve Hjønnevåg
> (arve@android.com) and Riley Andrews (riandrews@android.com) have
> volunteered to be co-maintainers with Greg.
> 
> We would also like to make kernel-team@android.com the maintainer of
> the whole android directory.

It's generally better to have people as maintainers than
using nameless email addresses.

^ permalink raw reply

* Re: [PATCH] staging: android: binder: move to the "real" part of the kernel
From: Rom Lemarchand @ 2014-10-22  3:10 UTC (permalink / raw)
  To: Arve Hjønnevåg
  Cc: Dan Carpenter, Greg Kroah-Hartman,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b@public.gmane.org,
	Anup Patel, linux-api-u79uwXL29TY76Z2rM5mHXA, LKML, John Stultz,
	Rebecca Schultz Zavin, Santosh Shilimkar, Sumit Semwal,
	Christoffer Dall
In-Reply-To: <CAMP5Xgcm-sxd3rf3VA1ZO44bUT1+u_QG1AAdjzK39q0ynfsZGQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On the topic of maintainers for binder: both Arve Hjønnevåg
(arve-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org) and Riley Andrews (riandrews-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org) have
volunteered to be co-maintainers with Greg.

We would also like to make kernel-team-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org the maintainer of
the whole android directory.

^ permalink raw reply

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
From: Andy Lutomirski @ 2014-10-22  0:58 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Ingo Molnar,
	Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <CAGr1F2HYGG9=jwugywD8tUdB+dOjN4z+3BSpqL_m2aaM+3Rz1A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, Oct 21, 2014 at 5:46 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>>>>>
>>>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>>>> implementation.
>>>>>>
>>>>>> Could be.  I'll defer to Aditya for that one.
>>>>>>
>>>>>
>>>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>>>> addition to restricting the process to a cgroup-root, new processes
>>>>> entering the container should also be implicitly contained within the
>>>>> cgroup-root of that container.
>>>>
>>>> Why?  Concretely, why should this be in the kernel namespace code
>>>> instead of in userspace?
>>>>
>>>
>>> Userspace can do it too. Though then there will be possibility of
>>> having processes in the same mount namespace with different
>>> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
>>> more complex. Thats another reason why it might not be good idea to
>>> tie cgroups with mount namespace.
>>>
>>>>> Implementing pivot_cgroup_root would
>>>>> probably involve overloading mount-namespace to now understand cgroup
>>>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>>>> earlier (not via a new syscall though), but came to the conclusion
>>>>> that its just simpler to have a separate cgroup namespace and get
>>>>> clear semantics. One of the issues was that implicitly changing cgroup
>>>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>>>
>>>>> About pinning: I really feel that it should be OK to pin processes
>>>>> within cgroupns-root. I think thats one of the most important feature
>>>>> of cgroup-namespace since its most common usecase is to containerize
>>>>> un-trusted processes - processes that, for their entire lifetime, need
>>>>> to remain inside their container.
>>>>
>>>> So don't let them out.  None of the other namespaces have this kind of
>>>> constraint:
>>>>
>>>>  - If you're in a mntns, you can still use fds from outside.
>>>>  - If you're in a netns, you can still use sockets from outside the namespace.
>>>>  - If you're in an ipcns, you can still use ipc handles from outside.
>>>
>>> But none of the namespaces allow you to allocate new fds/sockets/ipc
>>> handles in the outside namespace. I think moving a process outside of
>>> cgroupns-root is like allocating a resource outside of your namespace.
>>
>> In a pidns, you can see outside tasks if you have an outside procfs
>> mounted, but, if you don't, then you can't.  Wouldn't cgroupns be just
>> like that?  You wouldn't be able to escape your cgroup as long as you
>> don't have an inappropriate cgroupfs mounted.
>>
>
> I am not if we should only depend on restricted visibility for this
> though. More details below.
>
>>
>>>>
>>>>> And with explicit permission from
>>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>>> suggested previously), we can make sure that unprivileged processes
>>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>>> semantics simple.
>>>>
>>>> I actually think it makes the semantics more complex.  The less policy
>>>> you stick in the kernel, the easier it is to understand the impact of
>>>> that policy.
>>>>
>>>
>>> My inclination is towards keeping things simpler - both in code as
>>> well as in configuration. I agree that cgroupns might seem
>>> "less-flexible", but in its current form, it encourages consistent
>>> container configuration. If you have a process that needs to move
>>> around between cgroups belonging to different containers, then that
>>> process should probably not be inside any container's cgroup
>>> namespace. Allowing that will just make the cgroup namespace
>>> pretty-much meaningless.
>>
>> The problem with pinning is that preventing it causes problems
>> (specifically, either something potentially complex and incompatible
>> needs to be added or unprivileged processes will be able to pin
>> themselves).
>>
>> Unless I'm missing something, a normal cgroupns user doesn't actually
>> need kernel pinning support to effectively constrain its members'
>> cgroups.
>>
>
> So there are 2 scenarios to consider:
>
> We have 2 containers with cgroups: /container1 and /container2
> Assume process P is running under cgroupns-root '/container1'
>
> (1) process P wants to 'write' to cgroup.procs outside its
> cgroupns-root (say to /container2/cgroup.procs)

This, at least, doesn't have the problem with unprivileged processes
pinning themselves.

> (2) An admin process running in init_cgroup_ns (or any parent cgroupns
> with cgroupns-root above /container1) wants to write pid of process P
> to /container2/cgroup.procs (which lies outside of P's cgroupns-root)
>
> For (1), I think its ok to reject such a write. This is consistent
> with the restriction in cgroup_file_write added in 'Patch 6' of this
> set. I believe this should be independent of visibility of the cgroup
> hierarchy for P.
>
> For (2), we may allow the write to succeed if we make sure that the
> process doing the write is an admin process (with CAP_SYS_ADMIN in its
> userns AND over P's cgroupns->user_ns).

Why is its userns relevant?

Why not just check whether the target cgroup is in the process doing
the write's cgroupns? (NB: you need to check f_cred, here, not
current_cred(), but that's orthogonal.)  Then the policy becomes: no
user of cgroupfs can move any process outside of the cgroupfs's user's
cgroupns root.

I think I'm okay with this.

> If this write succeeds, then:
> (a) process P's /proc/<pid>/cgroup does not show anything when viewed
> by 'self' or any other process in P's cgrgroupns. I would really like
> to avoid showing relative paths or paths outside the cgroupns-root

The empty string seems just as problematic to me.

> (b) if process P does 'mount -t cgroup cgroup <mnt>', it will still be
> only able to mount and see cgroup hierarchy under its cgroupns-root
> (d) if process P tries to write to any cgroup file outside of its
> cgroupns-root (assuming that hierarchy is visible to it for whatever
> reason), it will fail as in (1)

I'm still unconvinced that this serves any purpose.  If you give
DAC/MAC permission to a task to write to something, and you give it
access to an fd or mount pointing there, and you don't want it writing
there, then *don't do that*.  I'm not really seeing why cgroupns needs
special treatment here.

>
> i.e., in summary, you can't escape out of cgroupns-root yourself. You
> will need help from an admin process running under some parent
> cgroupns-root to move you out. Is that workable for your usecase? Most
> of the things above already happen with the current patch-set, so it
> should be easy to enable this.
>
> Though there are still some open issues like:
> * what happens if you move all the processes out of /container1 and
> then 'rmdir /container1'? As it is now, you won't be able to setns()
> to that cgroupns anymore. But the cgroupns will still hang around
> until the processes switch their cgroupns.

Seems okay.

> * should we then also allow setns() without first entering the
> cgroupns-root? setns also checks the same conditions as in (a) plus it
> checks that your current cgroup is descendant of target cgroupns-root.
> Alternatively we can special-case setns() to own cgroupns so that it
> doesn't fail.

I think setns should completely ignore the caller's cgroup and should
not change it.  Userspace can do this.

> * migration for these processes will be tricky, if not impossible. But
> the admin trying to do this probably doesn't care about it or will
> provision for it.

Migration for processes in a mntns that have a current directory
outside their mntns is also difficult or impossible.  Same with
pidnses with an fd pointing at /proc/self from an outside-the-pid-ns
procfs.  Nothing new here.

--Andy

^ permalink raw reply

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
From: Aditya Kali @ 2014-10-22  0:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Ingo Molnar,
	Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <CALCETrXEAegFmSs2LnfSJR0tQmqZudnESDER8CoqKxOCBFMwdA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>>>>
>>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>>> implementation.
>>>>>
>>>>> Could be.  I'll defer to Aditya for that one.
>>>>>
>>>>
>>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>>> addition to restricting the process to a cgroup-root, new processes
>>>> entering the container should also be implicitly contained within the
>>>> cgroup-root of that container.
>>>
>>> Why?  Concretely, why should this be in the kernel namespace code
>>> instead of in userspace?
>>>
>>
>> Userspace can do it too. Though then there will be possibility of
>> having processes in the same mount namespace with different
>> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
>> more complex. Thats another reason why it might not be good idea to
>> tie cgroups with mount namespace.
>>
>>>> Implementing pivot_cgroup_root would
>>>> probably involve overloading mount-namespace to now understand cgroup
>>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>>> earlier (not via a new syscall though), but came to the conclusion
>>>> that its just simpler to have a separate cgroup namespace and get
>>>> clear semantics. One of the issues was that implicitly changing cgroup
>>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>>
>>>> About pinning: I really feel that it should be OK to pin processes
>>>> within cgroupns-root. I think thats one of the most important feature
>>>> of cgroup-namespace since its most common usecase is to containerize
>>>> un-trusted processes - processes that, for their entire lifetime, need
>>>> to remain inside their container.
>>>
>>> So don't let them out.  None of the other namespaces have this kind of
>>> constraint:
>>>
>>>  - If you're in a mntns, you can still use fds from outside.
>>>  - If you're in a netns, you can still use sockets from outside the namespace.
>>>  - If you're in an ipcns, you can still use ipc handles from outside.
>>
>> But none of the namespaces allow you to allocate new fds/sockets/ipc
>> handles in the outside namespace. I think moving a process outside of
>> cgroupns-root is like allocating a resource outside of your namespace.
>
> In a pidns, you can see outside tasks if you have an outside procfs
> mounted, but, if you don't, then you can't.  Wouldn't cgroupns be just
> like that?  You wouldn't be able to escape your cgroup as long as you
> don't have an inappropriate cgroupfs mounted.
>

I am not if we should only depend on restricted visibility for this
though. More details below.

>
>>>
>>>> And with explicit permission from
>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>> suggested previously), we can make sure that unprivileged processes
>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>> semantics simple.
>>>
>>> I actually think it makes the semantics more complex.  The less policy
>>> you stick in the kernel, the easier it is to understand the impact of
>>> that policy.
>>>
>>
>> My inclination is towards keeping things simpler - both in code as
>> well as in configuration. I agree that cgroupns might seem
>> "less-flexible", but in its current form, it encourages consistent
>> container configuration. If you have a process that needs to move
>> around between cgroups belonging to different containers, then that
>> process should probably not be inside any container's cgroup
>> namespace. Allowing that will just make the cgroup namespace
>> pretty-much meaningless.
>
> The problem with pinning is that preventing it causes problems
> (specifically, either something potentially complex and incompatible
> needs to be added or unprivileged processes will be able to pin
> themselves).
>
> Unless I'm missing something, a normal cgroupns user doesn't actually
> need kernel pinning support to effectively constrain its members'
> cgroups.
>

So there are 2 scenarios to consider:

We have 2 containers with cgroups: /container1 and /container2
Assume process P is running under cgroupns-root '/container1'

(1) process P wants to 'write' to cgroup.procs outside its
cgroupns-root (say to /container2/cgroup.procs)
(2) An admin process running in init_cgroup_ns (or any parent cgroupns
with cgroupns-root above /container1) wants to write pid of process P
to /container2/cgroup.procs (which lies outside of P's cgroupns-root)

For (1), I think its ok to reject such a write. This is consistent
with the restriction in cgroup_file_write added in 'Patch 6' of this
set. I believe this should be independent of visibility of the cgroup
hierarchy for P.

For (2), we may allow the write to succeed if we make sure that the
process doing the write is an admin process (with CAP_SYS_ADMIN in its
userns AND over P's cgroupns->user_ns).
If this write succeeds, then:
(a) process P's /proc/<pid>/cgroup does not show anything when viewed
by 'self' or any other process in P's cgrgroupns. I would really like
to avoid showing relative paths or paths outside the cgroupns-root
(b) if process P does 'mount -t cgroup cgroup <mnt>', it will still be
only able to mount and see cgroup hierarchy under its cgroupns-root
(d) if process P tries to write to any cgroup file outside of its
cgroupns-root (assuming that hierarchy is visible to it for whatever
reason), it will fail as in (1)

i.e., in summary, you can't escape out of cgroupns-root yourself. You
will need help from an admin process running under some parent
cgroupns-root to move you out. Is that workable for your usecase? Most
of the things above already happen with the current patch-set, so it
should be easy to enable this.

Though there are still some open issues like:
* what happens if you move all the processes out of /container1 and
then 'rmdir /container1'? As it is now, you won't be able to setns()
to that cgroupns anymore. But the cgroupns will still hang around
until the processes switch their cgroupns.
* should we then also allow setns() without first entering the
cgroupns-root? setns also checks the same conditions as in (a) plus it
checks that your current cgroup is descendant of target cgroupns-root.
Alternatively we can special-case setns() to own cgroupns so that it
doesn't fail.
* migration for these processes will be tricky, if not impossible. But
the admin trying to do this probably doesn't care about it or will
provision for it.

>>
>>>>
>>>> If we ditch the pinning requirement and allow the containarized
>>>> process to move outside of its cgroupns-root, we will have to address
>>>> atleast the following:
>>>> * what does its /proc/self/cgroup  (and /proc/<pid>/cgroup in general)
>>>> look like? We might need to just not show anything in
>>>> /proc/<pid>/cgroup in such case (for default hierarchy).
>>>
>>> The process should see the cgroup path relative to its cgroup ns.
>>> Whether this requires a new /proc mount or happens automatically is an
>>> open question.  (I *hate* procfs for reasons like this.)
>>>
>>>> * how should future setns() and unshare() by such process behave?
>>>
>>> Open question.
>>>
>>>> * 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result
>>>
>>> You could disallow that and instead require 'mount -t cgroup -o
>>> cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
>>> relative to the caller's cgroupns.
>>>
>>>> * container will not remain migratable
>>>
>>> Why not?
>>>
>>
>> Well, the processes running outside of cgroupns root will be exposed
>> to information outside of the container (i.e., its /proc/self/cgroup
>> will show paths involving other containers and potentially system
>> level information). So unless you even restore them, it will be
>> difficult to restore these processes. The whole point of virtualizing
>> the /proc/self/cgroup view was so that the processes don't see outside
>> cgroups.
>>
>
> So don't do that?
>

Lot of non-cgroup-manager userspace processes have legitimate reasons
to read /proc/self/cgroup. One example is to register for OOM
notifications. Migratability of the container is also very important.
So "don't do that" is not always an option :)


> --Andy

-- 
Aditya

^ permalink raw reply

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
From: Andy Lutomirski @ 2014-10-21 22:42 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Eric W. Biederman, Serge E. Hallyn, Linux API, Linux Containers,
	Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar
In-Reply-To: <CAGr1F2FdQ4VF1_o7mdybZ-WhLLhFxdgkNnzotHOwnhLU8W+YCw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>>>
>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>> implementation.
>>>>
>>>> Could be.  I'll defer to Aditya for that one.
>>>>
>>>
>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>> addition to restricting the process to a cgroup-root, new processes
>>> entering the container should also be implicitly contained within the
>>> cgroup-root of that container.
>>
>> Why?  Concretely, why should this be in the kernel namespace code
>> instead of in userspace?
>>
>
> Userspace can do it too. Though then there will be possibility of
> having processes in the same mount namespace with different
> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
> more complex. Thats another reason why it might not be good idea to
> tie cgroups with mount namespace.
>
>>> Implementing pivot_cgroup_root would
>>> probably involve overloading mount-namespace to now understand cgroup
>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>> earlier (not via a new syscall though), but came to the conclusion
>>> that its just simpler to have a separate cgroup namespace and get
>>> clear semantics. One of the issues was that implicitly changing cgroup
>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>
>>> About pinning: I really feel that it should be OK to pin processes
>>> within cgroupns-root. I think thats one of the most important feature
>>> of cgroup-namespace since its most common usecase is to containerize
>>> un-trusted processes - processes that, for their entire lifetime, need
>>> to remain inside their container.
>>
>> So don't let them out.  None of the other namespaces have this kind of
>> constraint:
>>
>>  - If you're in a mntns, you can still use fds from outside.
>>  - If you're in a netns, you can still use sockets from outside the namespace.
>>  - If you're in an ipcns, you can still use ipc handles from outside.
>
> But none of the namespaces allow you to allocate new fds/sockets/ipc
> handles in the outside namespace. I think moving a process outside of
> cgroupns-root is like allocating a resource outside of your namespace.

In a pidns, you can see outside tasks if you have an outside procfs
mounted, but, if you don't, then you can't.  Wouldn't cgroupns be just
like that?  You wouldn't be able to escape your cgroup as long as you
don't have an inappropriate cgroupfs mounted.


>>
>>> And with explicit permission from
>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>> suggested previously), we can make sure that unprivileged processes
>>> cannot pin themselves. Also, maintaining this invariant (your current
>>> cgroup is always under your cgroupns-root) keeps the code and the
>>> semantics simple.
>>
>> I actually think it makes the semantics more complex.  The less policy
>> you stick in the kernel, the easier it is to understand the impact of
>> that policy.
>>
>
> My inclination is towards keeping things simpler - both in code as
> well as in configuration. I agree that cgroupns might seem
> "less-flexible", but in its current form, it encourages consistent
> container configuration. If you have a process that needs to move
> around between cgroups belonging to different containers, then that
> process should probably not be inside any container's cgroup
> namespace. Allowing that will just make the cgroup namespace
> pretty-much meaningless.

The problem with pinning is that preventing it causes problems
(specifically, either something potentially complex and incompatible
needs to be added or unprivileged processes will be able to pin
themselves).

Unless I'm missing something, a normal cgroupns user doesn't actually
need kernel pinning support to effectively constrain its members'
cgroups.

>
>>>
>>> If we ditch the pinning requirement and allow the containarized
>>> process to move outside of its cgroupns-root, we will have to address
>>> atleast the following:
>>> * what does its /proc/self/cgroup  (and /proc/<pid>/cgroup in general)
>>> look like? We might need to just not show anything in
>>> /proc/<pid>/cgroup in such case (for default hierarchy).
>>
>> The process should see the cgroup path relative to its cgroup ns.
>> Whether this requires a new /proc mount or happens automatically is an
>> open question.  (I *hate* procfs for reasons like this.)
>>
>>> * how should future setns() and unshare() by such process behave?
>>
>> Open question.
>>
>>> * 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result
>>
>> You could disallow that and instead require 'mount -t cgroup -o
>> cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
>> relative to the caller's cgroupns.
>>
>>> * container will not remain migratable
>>
>> Why not?
>>
>
> Well, the processes running outside of cgroupns root will be exposed
> to information outside of the container (i.e., its /proc/self/cgroup
> will show paths involving other containers and potentially system
> level information). So unless you even restore them, it will be
> difficult to restore these processes. The whole point of virtualizing
> the /proc/self/cgroup view was so that the processes don't see outside
> cgroups.
>

So don't do that?

--Andy

^ permalink raw reply

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
From: Aditya Kali @ 2014-10-21 22:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Ingo Molnar,
	Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <CALCETrWXDMRsexfvmh2CiMW4WX0ZLJ4pJvzHU55PEBk=NmnyZg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>>
>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>> implementation.
>>>
>>> Could be.  I'll defer to Aditya for that one.
>>>
>>
>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>> addition to restricting the process to a cgroup-root, new processes
>> entering the container should also be implicitly contained within the
>> cgroup-root of that container.
>
> Why?  Concretely, why should this be in the kernel namespace code
> instead of in userspace?
>

Userspace can do it too. Though then there will be possibility of
having processes in the same mount namespace with different
cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
more complex. Thats another reason why it might not be good idea to
tie cgroups with mount namespace.

>> Implementing pivot_cgroup_root would
>> probably involve overloading mount-namespace to now understand cgroup
>> filesystem too. I did attempt combining cgroupns-root with mntns
>> earlier (not via a new syscall though), but came to the conclusion
>> that its just simpler to have a separate cgroup namespace and get
>> clear semantics. One of the issues was that implicitly changing cgroup
>> on setns to mntns seemed like a huge undesirable side-effect.
>>
>> About pinning: I really feel that it should be OK to pin processes
>> within cgroupns-root. I think thats one of the most important feature
>> of cgroup-namespace since its most common usecase is to containerize
>> un-trusted processes - processes that, for their entire lifetime, need
>> to remain inside their container.
>
> So don't let them out.  None of the other namespaces have this kind of
> constraint:
>
>  - If you're in a mntns, you can still use fds from outside.
>  - If you're in a netns, you can still use sockets from outside the namespace.
>  - If you're in an ipcns, you can still use ipc handles from outside.

But none of the namespaces allow you to allocate new fds/sockets/ipc
handles in the outside namespace. I think moving a process outside of
cgroupns-root is like allocating a resource outside of your namespace.

>
> etc.

>
>> And with explicit permission from
>> cgroup subsystem (something like cgroup.may_unshare as you had
>> suggested previously), we can make sure that unprivileged processes
>> cannot pin themselves. Also, maintaining this invariant (your current
>> cgroup is always under your cgroupns-root) keeps the code and the
>> semantics simple.
>
> I actually think it makes the semantics more complex.  The less policy
> you stick in the kernel, the easier it is to understand the impact of
> that policy.
>

My inclination is towards keeping things simpler - both in code as
well as in configuration. I agree that cgroupns might seem
"less-flexible", but in its current form, it encourages consistent
container configuration. If you have a process that needs to move
around between cgroups belonging to different containers, then that
process should probably not be inside any container's cgroup
namespace. Allowing that will just make the cgroup namespace
pretty-much meaningless.

>>
>> If we ditch the pinning requirement and allow the containarized
>> process to move outside of its cgroupns-root, we will have to address
>> atleast the following:
>> * what does its /proc/self/cgroup  (and /proc/<pid>/cgroup in general)
>> look like? We might need to just not show anything in
>> /proc/<pid>/cgroup in such case (for default hierarchy).
>
> The process should see the cgroup path relative to its cgroup ns.
> Whether this requires a new /proc mount or happens automatically is an
> open question.  (I *hate* procfs for reasons like this.)
>
>> * how should future setns() and unshare() by such process behave?
>
> Open question.
>
>> * 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result
>
> You could disallow that and instead require 'mount -t cgroup -o
> cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
> relative to the caller's cgroupns.
>
>> * container will not remain migratable
>
> Why not?
>

Well, the processes running outside of cgroupns root will be exposed
to information outside of the container (i.e., its /proc/self/cgroup
will show paths involving other containers and potentially system
level information). So unless you even restore them, it will be
difficult to restore these processes. The whole point of virtualizing
the /proc/self/cgroup view was so that the processes don't see outside
cgroups.

>> * added code complexity to handle above scenarios
>>
>> I understand that having process pinned to a cgroup hierarchy might
>> seem inconvenient. But even today (without cgroup namespaces), moving
>> a task from one cgroup to another can fail for reasons outside of
>> control of the task attempting the move (even if its privileged). So
>> the userspace should already handle this scenario. I feel its not
>> worth to add complexity in the kernel for this.
>
> --Andy



-- 
Aditya

^ permalink raw reply

* [PATCH] Add preadv2/pwritev2 documentation.
From: Milosz Tanski @ 2014-10-21 20:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk, linux-man
In-Reply-To: <cover.1413923420.git.milosz@adfin.com>

New syscalls that are a variation on the preadv/pwritev but support an extra
flag argument.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Suggested-by: Jeff Moyer <jmoyer@redhat.com>
Fixes: Jeff Moyer <jmoyer@redhat.com>
---
 man2/readv.2 | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 61 insertions(+), 10 deletions(-)

diff --git a/man2/readv.2 b/man2/readv.2
index 8748efa..31b3870 100644
--- a/man2/readv.2
+++ b/man2/readv.2
@@ -45,6 +45,12 @@ readv, writev, preadv, pwritev \- read or write data into multiple buffers
 .sp
 .BI "ssize_t pwritev(int " fd ", const struct iovec *" iov ", int " iovcnt ,
 .BI "                off_t " offset );
+.sp
+.BI "ssize_t preadv2(int " fd ", const struct iovec *" iov ", int " iovcnt ,
+.BI "                off_t " offset ", int " flags );
+.sp
+.BI "ssize_t pwritev2(int " fd ", const struct iovec *" iov ", int " iovcnt ,
+.BI "                 off_t " offset ", int " flags );
 .fi
 .sp
 .in -4n
@@ -162,9 +168,9 @@ The
 system call combines the functionality of
 .BR writev ()
 and
-.BR pwrite (2).
+.BR pwrite (2) "."
 It performs the same task as
-.BR writev (),
+.BR writev () ","
 but adds a fourth argument,
 .IR offset ,
 which specifies the file offset at which the output operation
@@ -174,15 +180,41 @@ The file offset is not changed by these system calls.
 The file referred to by
 .I fd
 must be capable of seeking.
+.SS preadv2() and pwritev2()
+
+This pair of system calls has similar functionality to the
+.BR preadv ()
+and
+.BR pwritev ()
+calls, but adds a fifth argument, \fIflags\fP, which modifies the behavior on a per call basis.
+
+Like the
+.BR preadv ()
+and
+.BR pwritev ()
+calls, they accept an \fIoffset\fP argument. Unlike those calls, if the \fIoffset\fP argument is set to -1 then the current file offset is used and updated.
+
+The \fIflags\fP arguments to
+.BR preadv2 ()
+and
+.BR pwritev2 ()
+contains a bitwise OR of one or more of the following flags:
+.TP
+.BR RWF_NONBLOCK " (only " preadv2() " since Linux 3.19)"
+Performs a non-blocking operation for regular files (not sockets) opened in buffered mode (not
+.BR O_DIRECT ")."
+
 .SH RETURN VALUE
 On success,
-.BR readv ()
-and
+.BR readv () ","
 .BR preadv ()
-return the number of bytes read;
-.BR writev ()
 and
+.BR preadv2 ()
+return the number of bytes read;
+.BR writev () ","
 .BR pwritev ()
+and
+.BR pwritev2 ()
 return the number of bytes written.
 On error, \-1 is returned, and \fIerrno\fP is set appropriately.
 .SH ERRORS
@@ -191,12 +223,22 @@ The errors are as given for
 and
 .BR write (2).
 Furthermore,
-.BR preadv ()
-and
+.BR preadv () ","
+.BR preadv2 () ","
 .BR pwritev ()
+and
+.BR pwritev2 ()
 can also fail for the same reasons as
 .BR lseek (2).
-Additionally, the following error is defined:
+Additionally, the following errors are defined:
+.TP
+.B EAGAIN
+The operation would block. This is possible if the file descriptor \fIfd\fP refers to a socket and has been marked nonblocking
+.RB ( O_NONBLOCK ),
+or the operation is a
+.BR preadv2
+and the \fIflags\fP argument is set to
+.BR RWF_NONBLOCK.
 .TP
 .B EINVAL
 The sum of the
@@ -205,12 +247,17 @@ values overflows an
 .I ssize_t
 value.
 Or, the vector count \fIiovcnt\fP is less than zero or greater than the
-permitted maximum.
+permitted maximum. Or, an unknown flag is specified in \fIflags\fP.
 .SH VERSIONS
 .BR preadv ()
 and
 .BR pwritev ()
 first appeared in Linux 2.6.30; library support was added in glibc 2.10.
+.sp
+.BR preadv2 ()
+and
+.BR pwritev2 ()
+first appeared in Linux 3.19 (if we're lucky);
 .SH CONFORMING TO
 .BR readv (),
 .BR writev ():
@@ -223,6 +270,10 @@ first appeared in Linux 2.6.30; library support was added in glibc 2.10.
 .BR preadv (),
 .BR pwritev ():
 nonstandard, but present also on the modern BSDs.
+.sp
+.BR preadv2 (),
+.BR pwritev2 ():
+nonstandard, Linux extension.
 .SH NOTES
 .SS C library/kernel ABI differences
 POSIX.1-2001 allows an implementation to place a limit on
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related

* [PATCH 4/4] vfs: RWF_NONBLOCK flag for preadv2
From: Milosz Tanski @ 2014-10-21 20:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk
In-Reply-To: <cover.1413923420.git.milosz@adfin.com>

Filesystems that generic_file_read_iter will not be allowed to perform
non-blocking reads. This only will read data if it's in the page cache and if
there is no page error (causing a re-read).

Christoph Hellwig wrote the filesystem specify code (cifs, ofs, shm, xfs).

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 fs/cifs/file.c     |  6 ++++++
 fs/ocfs2/file.c    |  6 ++++++
 fs/pipe.c          |  3 ++-
 fs/read_write.c    | 21 ++++++++++++++-------
 fs/xfs/xfs_file.c  |  4 ++++
 include/linux/fs.h |  3 +++
 mm/filemap.c       | 18 ++++++++++++++++++
 mm/shmem.c         |  4 ++++
 8 files changed, 57 insertions(+), 8 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 3e4d00a..c485afa 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3005,6 +3005,9 @@ ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to)
 	struct cifs_readdata *rdata, *tmp;
 	struct list_head rdata_list;
 
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	len = iov_iter_count(to);
 	if (!len)
 		return 0;
@@ -3123,6 +3126,9 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to)
 	    ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NOPOSIXBRL) == 0))
 		return generic_file_read_iter(iocb, to);
 
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	/*
 	 * We need to hold the sem to be sure nobody modifies lock list
 	 * with a brlock that prevents reading.
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 324dc93..bb66ca4 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2472,6 +2472,12 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
 			filp->f_path.dentry->d_name.name,
 			to->nr_segs);	/* GRRRRR */
 
+	/*
+	 * No non-blocking reads for ocfs2 for now.  Might be doable with
+	 * non-blocking cluster lock helpers.
+	 */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
 
 	if (!inode) {
 		ret = -EINVAL;
diff --git a/fs/pipe.c b/fs/pipe.c
index 21981e5..212bf68 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -302,7 +302,8 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 			 */
 			if (ret)
 				break;
-			if (filp->f_flags & O_NONBLOCK) {
+			if ((filp->f_flags & O_NONBLOCK) ||
+			    (iocb->ki_rwflags & RWF_NONBLOCK)) {
 				ret = -EAGAIN;
 				break;
 			}
diff --git a/fs/read_write.c b/fs/read_write.c
index e3d8451..955d829 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -835,14 +835,19 @@ static ssize_t do_readv_writev(int type, struct file *file,
 		file_start_write(file);
 	}
 
-	if (iter_fn)
+	if (iter_fn) {
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
 						pos, iter_fn, flags);
-	else if (fnv)
-		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
-						pos, fnv);
-	else
-		ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	} else {
+		if (type == READ && (flags & RWF_NONBLOCK))
+			return -EAGAIN;
+
+		if (fnv)
+			ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
+							pos, fnv);
+		else
+			ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	}
 
 	if (type != READ)
 		file_end_write(file);
@@ -866,8 +871,10 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
-	if (flags & ~0)
+	if (flags & ~RWF_NONBLOCK)
 		return -EINVAL;
+	if ((file->f_flags & O_DIRECT) && (flags & RWF_NONBLOCK))
+		return -EAGAIN;
 
 	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index eb596b4..b1f6334 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -246,6 +246,10 @@ xfs_file_read_iter(
 
 	XFS_STATS_INC(xs_read_calls);
 
+	/* XXX: need a non-blocking iolock helper, shouldn't be too hard */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	if (unlikely(file->f_flags & O_DIRECT))
 		ioflags |= XFS_IO_ISDIRECT;
 	if (file->f_mode & FMODE_NOCMTIME)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9ed5711..eaebd99 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1459,6 +1459,9 @@ struct block_device_operations;
 #define HAVE_COMPAT_IOCTL 1
 #define HAVE_UNLOCKED_IOCTL 1
 
+/* These flags are used for the readv/writev syscalls with flags. */
+#define RWF_NONBLOCK 0x00000001
+
 struct iov_iter;
 
 struct file_operations {
diff --git a/mm/filemap.c b/mm/filemap.c
index 45964c8..e73ba7e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1493,6 +1493,8 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
 find_page:
 		page = find_get_page(mapping, index);
 		if (!page) {
+			if (flags & RWF_NONBLOCK)
+				goto would_block;
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
@@ -1584,6 +1586,11 @@ page_ok:
 		continue;
 
 page_not_up_to_date:
+		if (flags & RWF_NONBLOCK) {
+			page_cache_release(page);
+			goto would_block;
+		}
+
 		/* Get exclusive access to the page ... */
 		error = lock_page_killable(page);
 		if (unlikely(error))
@@ -1603,6 +1610,12 @@ page_not_up_to_date_locked:
 			goto page_ok;
 		}
 
+		if (flags & RWF_NONBLOCK) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto would_block;
+		}
+
 readpage:
 		/*
 		 * A previous I/O error may have been due to temporary
@@ -1673,6 +1686,8 @@ no_cached_page:
 		goto readpage;
 	}
 
+would_block:
+	error = -EAGAIN;
 out:
 	ra->prev_pos = prev_index;
 	ra->prev_pos <<= PAGE_CACHE_SHIFT;
@@ -1706,6 +1721,9 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		size_t count = iov_iter_count(iter);
 		loff_t size;
 
+		if (iocb->ki_rwflags & RWF_NONBLOCK)
+			return -EAGAIN;
+
 		if (!count)
 			goto out; /* skip atime */
 		size = i_size_read(inode);
diff --git a/mm/shmem.c b/mm/shmem.c
index cd6fc75..5c30f04 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1531,6 +1531,10 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	ssize_t retval = 0;
 	loff_t *ppos = &iocb->ki_pos;
 
+	/* XXX: should be easily supportable */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	/*
 	 * Might this read be for a stacking filesystem?  Then when reading
 	 * holes of a sparse file, we actually need to allocate those pages,
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related

* [PATCH 3/4] vfs: Export new vector IO syscalls (with flags) to userland
From: Milosz Tanski @ 2014-10-21 20:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk
In-Reply-To: <cover.1413923420.git.milosz@adfin.com>

This is only for x86_64 and x86. Will add other arch later.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 arch/x86/syscalls/syscall_32.tbl | 2 ++
 arch/x86/syscalls/syscall_64.tbl | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 9fe1b5d..d592d87 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -364,3 +364,5 @@
 355	i386	getrandom		sys_getrandom
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
+358	i386	preadv2			sys_preadv2
+359	i386	pwritev2		sys_pwritev2
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 281150b..7be2447 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -328,6 +328,8 @@
 319	common	memfd_create		sys_memfd_create
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
+322	64	preadv2			sys_preadv2
+323	64	pwritev2		sys_pwritev2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related

* [PATCH 2/4] vfs: Define new syscalls preadv2,pwritev2
From: Milosz Tanski @ 2014-10-21 20:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk
In-Reply-To: <cover.1413923420.git.milosz@adfin.com>

New syscalls that take an flag argument. This change does not add any specific
flags.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 fs/read_write.c                   | 82 ++++++++++++++++++++++++++++++++-------
 include/linux/syscalls.h          |  6 +++
 include/uapi/asm-generic/unistd.h |  6 ++-
 mm/filemap.c                      |  2 +-
 4 files changed, 80 insertions(+), 16 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 9858c06..e3d8451 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -866,6 +866,8 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
+	if (flags & ~0)
+		return -EINVAL;
 
 	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
@@ -879,21 +881,23 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		return -EINVAL;
+	if (flags & ~0)
+		return -EINVAL;
 
 	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_writev);
 
-SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen)
+static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
+			unsigned long vlen, int flags)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+		ret = vfs_readv(f.file, vec, vlen, &pos, flags);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -905,15 +909,15 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
-SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen)
+static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
+			 unsigned long vlen, int flags)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+		ret = vfs_writev(f.file, vec, vlen, &pos, flags);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -931,10 +935,9 @@ static inline loff_t pos_from_hilo(unsigned long high, unsigned long low)
 	return (((loff_t)high << HALF_LONG_BITS) << HALF_LONG_BITS) | low;
 }
 
-SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_preadv(unsigned long fd, const struct iovec __user *vec,
+			 unsigned long vlen, loff_t pos, int flags)
 {
-	loff_t pos = pos_from_hilo(pos_h, pos_l);
 	struct fd f;
 	ssize_t ret = -EBADF;
 
@@ -945,7 +948,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PREAD)
-			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+			ret = vfs_readv(f.file, vec, vlen, &pos, flags);
 		fdput(f);
 	}
 
@@ -955,10 +958,9 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
-SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_pwritev(unsigned long fd, const struct iovec __user *vec,
+			  unsigned long vlen, loff_t pos, int flags)
 {
-	loff_t pos = pos_from_hilo(pos_h, pos_l);
 	struct fd f;
 	ssize_t ret = -EBADF;
 
@@ -969,7 +971,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PWRITE)
-			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+			ret = vfs_writev(f.file, vec, vlen, &pos, flags);
 		fdput(f);
 	}
 
@@ -979,6 +981,58 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
+SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen)
+{
+	return do_readv(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen)
+{
+	return do_writev(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	return do_preadv(fd, vec, vlen, pos, 0);
+}
+
+SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	if (pos == -1)
+		return do_readv(fd, vec, vlen, flags);
+
+	return do_preadv(fd, vec, vlen, pos, flags);
+}
+
+SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	return do_pwritev(fd, vec, vlen, pos, 0);
+}
+
+SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	if (pos == -1)
+		return do_writev(fd, vec, vlen, flags);
+
+	return do_pwritev(fd, vec, vlen, pos, flags);
+}
+
 #ifdef CONFIG_COMPAT
 
 static ssize_t compat_do_readv_writev(int type, struct file *file,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index bda9b81..cedc22e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -571,8 +571,14 @@ asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf,
 			     size_t count, loff_t pos);
 asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec,
 			   unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
+			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+			    int flags);
 asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec,
 			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
+			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+			    int flags);
 asmlinkage long sys_getcwd(char __user *buf, unsigned long size);
 asmlinkage long sys_mkdir(const char __user *pathname, umode_t mode);
 asmlinkage long sys_chdir(const char __user *filename);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 22749c1..10f8883 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -213,6 +213,10 @@ __SC_COMP(__NR_pwrite64, sys_pwrite64, compat_sys_pwrite64)
 __SC_COMP(__NR_preadv, sys_preadv, compat_sys_preadv)
 #define __NR_pwritev 70
 __SC_COMP(__NR_pwritev, sys_pwritev, compat_sys_pwritev)
+#define __NR_preadv2 281
+__SC_COMP(__NR_preadv2, sys_preadv2)
+#define __NR_pwritev2 282
+__SC_COMP(__NR_pwritev2, sys_pwritev2)
 
 /* fs/sendfile.c */
 #define __NR3264_sendfile 71
@@ -709,7 +713,7 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
 __SYSCALL(__NR_bpf, sys_bpf)
 
 #undef __NR_syscalls
-#define __NR_syscalls 281
+#define __NR_syscalls 283
 
 /*
  * All syscalls below here should go away really,
diff --git a/mm/filemap.c b/mm/filemap.c
index cb7f530..45964c8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1735,7 +1735,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		}
 	}
 
-	retval = do_generic_file_read(file, ppos, iter, retval);
+	retval = do_generic_file_read(file, ppos, iter, retval, iocb->ki_rwflags);
 out:
 	return retval;
 }
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related

* [PATCH 1/4] vfs: Prepare for adding a new preadv/pwritev with user flags.
From: Milosz Tanski @ 2014-10-21 20:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk
In-Reply-To: <cover.1413923420.git.milosz@adfin.com>

Plumbing the flags argument through the vfs code so they can be passed down to
__generic_file_(read/write)_iter function that do the acctual work.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 drivers/target/target_core_file.c |  6 +++---
 fs/nfsd/vfs.c                     |  4 ++--
 fs/read_write.c                   | 28 ++++++++++++++++------------
 fs/splice.c                       |  2 +-
 include/linux/aio.h               |  2 ++
 include/linux/fs.h                |  4 ++--
 mm/filemap.c                      |  2 +-
 7 files changed, 27 insertions(+), 21 deletions(-)

diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c
index 7d6cdda..58d9a6d 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -350,9 +350,9 @@ static int fd_do_rw(struct se_cmd *cmd, struct scatterlist *sgl,
 	set_fs(get_ds());
 
 	if (is_write)
-		ret = vfs_writev(fd, &iov[0], sgl_nents, &pos);
+		ret = vfs_writev(fd, &iov[0], sgl_nents, &pos, 0);
 	else
-		ret = vfs_readv(fd, &iov[0], sgl_nents, &pos);
+		ret = vfs_readv(fd, &iov[0], sgl_nents, &pos, 0);
 
 	set_fs(old_fs);
 
@@ -528,7 +528,7 @@ fd_execute_write_same(struct se_cmd *cmd)
 
 	old_fs = get_fs();
 	set_fs(get_ds());
-	rc = vfs_writev(f, &iov[0], iov_num, &pos);
+	rc = vfs_writev(f, &iov[0], iov_num, &pos, 0);
 	set_fs(old_fs);
 
 	vfree(iov);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 989129e..ef01c78 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -872,7 +872,7 @@ __be32 nfsd_readv(struct file *file, loff_t offset, struct kvec *vec, int vlen,
 
 	oldfs = get_fs();
 	set_fs(KERNEL_DS);
-	host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset);
+	host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset, 0);
 	set_fs(oldfs);
 	return nfsd_finish_read(file, count, host_err);
 }
@@ -960,7 +960,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 
 	/* Write the data. */
 	oldfs = get_fs(); set_fs(KERNEL_DS);
-	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos);
+	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos, 0);
 	set_fs(oldfs);
 	if (host_err < 0)
 		goto out_nfserr;
diff --git a/fs/read_write.c b/fs/read_write.c
index 7d9318c..9858c06 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -653,7 +653,8 @@ unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to)
 EXPORT_SYMBOL(iov_shorten);
 
 static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iovec *iov,
-		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn)
+		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn,
+		int flags)
 {
 	struct kiocb kiocb;
 	struct iov_iter iter;
@@ -662,6 +663,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iove
 	init_sync_kiocb(&kiocb, filp);
 	kiocb.ki_pos = *ppos;
 	kiocb.ki_nbytes = len;
+	kiocb.ki_rwflags = flags;
 
 	iov_iter_init(&iter, rw, iov, nr_segs, len);
 	ret = fn(&kiocb, &iter);
@@ -800,7 +802,8 @@ out:
 
 static ssize_t do_readv_writev(int type, struct file *file,
 			       const struct iovec __user * uvector,
-			       unsigned long nr_segs, loff_t *pos)
+			       unsigned long nr_segs, loff_t *pos,
+			       int flags)
 {
 	size_t tot_len;
 	struct iovec iovstack[UIO_FASTIOV];
@@ -834,7 +837,7 @@ static ssize_t do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn);
+						pos, iter_fn, flags);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
@@ -857,27 +860,27 @@ out:
 }
 
 ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
-		  unsigned long vlen, loff_t *pos)
+		  unsigned long vlen, loff_t *pos, int flags)
 {
 	if (!(file->f_mode & FMODE_READ))
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
 
-	return do_readv_writev(READ, file, vec, vlen, pos);
+	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_readv);
 
 ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
-		   unsigned long vlen, loff_t *pos)
+		   unsigned long vlen, loff_t *pos, int flags)
 {
 	if (!(file->f_mode & FMODE_WRITE))
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		return -EINVAL;
 
-	return do_readv_writev(WRITE, file, vec, vlen, pos);
+	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_writev);
@@ -890,7 +893,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos);
+		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -910,7 +913,7 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos);
+		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -942,7 +945,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PREAD)
-			ret = vfs_readv(f.file, vec, vlen, &pos);
+			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
 		fdput(f);
 	}
 
@@ -966,7 +969,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PWRITE)
-			ret = vfs_writev(f.file, vec, vlen, &pos);
+			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
 		fdput(f);
 	}
 
@@ -1014,7 +1017,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn);
+						pos, iter_fn, 0);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
@@ -1113,6 +1116,7 @@ COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd,
 	return __compat_sys_preadv64(fd, vec, vlen, pos);
 }
 
+
 static size_t compat_writev(struct file *file,
 			    const struct compat_iovec __user *vec,
 			    unsigned long vlen, loff_t *pos)
diff --git a/fs/splice.c b/fs/splice.c
index f5cb9ba..9591b9f 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -576,7 +576,7 @@ static ssize_t kernel_readv(struct file *file, const struct iovec *vec,
 	old_fs = get_fs();
 	set_fs(get_ds());
 	/* The cast to a user pointer is valid due to the set_fs() */
-	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos);
+	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0);
 	set_fs(old_fs);
 
 	return res;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d9c92da..9c1d499 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -52,6 +52,8 @@ struct kiocb {
 	 * this is the underlying eventfd context to deliver events to.
 	 */
 	struct eventfd_ctx	*ki_eventfd;
+
+	int			ki_rwflags;
 };
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a957d43..9ed5711 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1538,9 +1538,9 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
-		unsigned long, loff_t *);
+		unsigned long, loff_t *, int);
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
-		unsigned long, loff_t *);
+		unsigned long, loff_t *, int);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/mm/filemap.c b/mm/filemap.c
index 14b4642..cb7f530 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1465,7 +1465,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * of the logic when it comes to error handling etc.
  */
 static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
-		struct iov_iter *iter, ssize_t written)
+		struct iov_iter *iter, ssize_t written, int flags)
 {
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related

* [PATCH 0/4] vfs: Non-blockling buffered fs read (page cache only)
From: Milosz Tanski @ 2014-10-21 20:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk

This patcheset introduces an ability to perform a non-blocking read from
regular files in buffered IO mode. This works by only for those filesystems
that have data in the page cache.

It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
extra flag argument (RWF_NONBLOCK).

It's a very common patern today (samba, libuv, etc..) use a large threadpool to
perform buffered IO operations. They submit the work form another thread
that performs network IO and epoll or other threads that perform CPU work. This
leads to increased latency for processing, esp. in the case of data that's
already cached in the page cache.

With the new interface the applications will now be able to fetch the data in
their network / cpu bound thread(s) and only defer to a threadpool if it's not
there. In our own application (VLDB) we've observed a decrease in latency for
"fast" request by avoiding unnecessary queuing and having to swap out current
tasks in IO bound work threads.


Version 4 highlight:
 - Updated for 3.18-rc1.
 - Performance data from our application.
 - First stab at man page with Jeff's help. Patch is in-reply to.

RFC Version 3 highlights:
 - Down to 2 syscalls from 4; can user fp or argument position.
 - RWF_NONBLOCK value flag is not the same O_NONBLOCK, per Jeff.

RFC Version 2 highlights:
 - Put the flags argument into kiocb (less noise), per. Al Viro
 - O_DIRECT checking early in the process, per. Jeff Moyer
 - Resolved duplicate (c&p) code in syscall code, per. Jeff
 - Included perf data in thread cover letter, per. Jeff
 - Created a new flag (not O_NONBLOCK) for readv2, perf Jeff


Some perf data generated using fio comparing the posix aio engine to a version
of the posix AIO engine that attempts to performs "fast" reads before
submitting the operations to the queue. This workflow is on ext4 partition on
raid0 (test / build-rig.) Simulating our database access patern workload using
16kb read accesses. Our database uses a home-spun posix aio like queue (samba
does the same thing.)

f1: ~73% rand read over mostly cached data (zipf med-size dataset)
f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
f3: ~9% seq-read over large dataset

before:

f1:
    bw (KB  /s): min=   11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
    lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
    lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
f2:
    bw (KB  /s): min=    2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
    lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
    lat (msec) : >=2000=4.33%
f3:
    bw (KB  /s): min=    0, max=265568, per=99.95%, avg=174575.10,
                 stdev=34526.89
    lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
    lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
    lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
    lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
total:
   READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
         mint=600001msec, maxt=600113msec

after (with fast read using preadv2 before submit):

f1:
    bw (KB  /s): min=    3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
    lat (usec) : 2=70.63%, 4=0.01%
    lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
f2:
    bw (KB  /s): min=    2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
    lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
    lat (msec) : >=2000=9.99%
f3:
    bw (KB  /s): min=    1, max=245448, per=100.00%, avg=177366.50,
                 stdev=35995.60
    lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
    lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
    lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
    lat (msec) : 100=0.05%, 250=0.02%
total:
   READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
         mint=600020msec, maxt=600178msec

Interpreting the results you can see total bandwidth stays the same but overall
request latency is decreased in f1 (random, mostly cached) and f3 (sequential)
workloads. There is a slight bump in latency for since it's random data that's
unlikely to be cached but we're always trying "fast read".

In our application we have starting keeping track of "fast read" hits/misses
and for files / requests that have a lot hit ratio we don't do "fast reads"
mostly getting rid of extra latency in the uncached cases. In our real world
work load we were able to reduce average response time by 20 to 30% (depends
on amount of IO done by request).

I've performed other benchmarks and I have no observed any perf regressions in
any of the normal (old) code paths.


I have co-developed these changes with Christoph Hellwig.

Milosz Tanski (4):
  vfs: Prepare for adding a new preadv/pwritev with user flags.
  vfs: Define new syscalls preadv2,pwritev2
  vfs: Export new vector IO syscalls (with flags) to userland
  vfs: RWF_NONBLOCK flag for preadv2

 arch/x86/syscalls/syscall_32.tbl  |   2 +
 arch/x86/syscalls/syscall_64.tbl  |   2 +
 drivers/target/target_core_file.c |   6 +-
 fs/cifs/file.c                    |   6 ++
 fs/nfsd/vfs.c                     |   4 +-
 fs/ocfs2/file.c                   |   6 ++
 fs/pipe.c                         |   3 +-
 fs/read_write.c                   | 121 +++++++++++++++++++++++++++++---------
 fs/splice.c                       |   2 +-
 fs/xfs/xfs_file.c                 |   4 ++
 include/linux/aio.h               |   2 +
 include/linux/fs.h                |   7 ++-
 include/linux/syscalls.h          |   6 ++
 include/uapi/asm-generic/unistd.h |   6 +-
 mm/filemap.c                      |  22 ++++++-
 mm/shmem.c                        |   4 ++
 16 files changed, 163 insertions(+), 40 deletions(-)

-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply

* system size test
From: Shuah Khan @ 2014-10-21 20:15 UTC (permalink / raw)
  To: Bird, Tim; +Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, Shuah Khan

Hi Tim,

You mentioned you have a size test you would like to send it for
inclusion under selftest? Are you still planning to do that?

thanks,
-- Shuah
-- 
Shuah Khan
Sr. Linux Kernel Developer
Samsung Research America (Silicon Valley)
shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org | (970) 217-8978

^ permalink raw reply

* rt-tester - move under selftests
From: Shuah Khan @ 2014-10-21 20:10 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

Andrew/Ingo/Thomas,

You are all on the signed-off list for rt-tester. I have been poking
around the source tree looking for existing test and came across rt-tester.

What are your thoughts on rt-tester suitability to reside under
tools/testing/selftest and becoming part of kselftest target?
Would you like see it moved under the kselftest umbrella?

thanks,
-- Shuah
-- 
Shuah Khan
Sr. Linux Kernel Developer
Samsung Research America (Silicon Valley)
shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org | (970) 217-8978

^ permalink raw reply

* Re: preadv2/pwritev2 rename
From: H. Peter Anvin @ 2014-10-21 20:07 UTC (permalink / raw)
  To: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Milosz Tanski
  Cc: Christoph Hellwig, Jeff Moyer, Linux API
In-Reply-To: <CAKgNAkjwJ6SH_j_Op+u+p4-0mN-cBVE24Wrt4SyhAiG1MMF9tA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 10/20/2014 11:42 PM, Michael Kerrisk (man-pages) wrote:
> Hello Milosz,
> 
> On Mon, Oct 20, 2014 at 11:52 PM, Milosz Tanski <milosz-B5zB6C1i6pkAvxtiuMwx3w@public.gmane.org> wrote:
>> Christoph and/or Jeff,
>>
>> I updated the patch for 3.18-rc1 and I'm going to resend it as non-RFC
>> as I didn't get comments last time.
>>
>> I only have one stupid question... I'm going to rename the calls to
>> preadv6 and pwritev6 (so it's more like the other syscalls: dup3,
>> accept4, eventfd2) but I'm not sure if i should call it preadv5 or
>> pwritev6 since the offset argument is split into two different
>> arguments (upper and lower part).
> 
> It's points like this that show exactly why naming system calls after
> the number of their arguments is a very bad idea[1]. Please don't do
> it. pwritev2() and preadv2() are not pretty either, but are marginally
> better. pwritev_fl() and preadv_fl() (or simialr) might also be okay,
> I guess.
> 

The splitting of the argument is a calling convention thing (and a
rather stupid one at that... we shouldn't do these kinds of things
manually.)  As such, it is not visible to the user and should not be
counted.

	-hpa

^ permalink raw reply

* Re: [PATCH] staging: android: binder: move to the "real" part of the kernel
From: Pavel Machek @ 2014-10-21 20:05 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Greg Kroah-Hartman, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	John Stultz, lkml, devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b,
	Linux API, Santosh Shilimkar, Arve Hjønnevåg,
	Sumit Semwal, Rebecca Schultz Zavin, Christoffer Dall, Anup Patel
In-Reply-To: <4227199.e5u61J7jtX@wuerfel>

On Tue 2014-10-21 16:12:24, Arnd Bergmann wrote:
> On Tuesday 21 October 2014 12:36:22 Pavel Machek wrote:
> > On Fri 2014-10-17 01:12:21, Greg Kroah-Hartman wrote:
> > > On Thu, Oct 16, 2014 at 10:09:04AM -0700, John Stultz wrote:
> > > > On Thu, Oct 16, 2014 at 5:47 AM, Greg Kroah-Hartman
> > > > <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
> > > > > From: Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>
> > 
> > > > Are the Android guys comfortable with the ABI stability rules they'll
> > > > now face?
> > > 
> > > Just because something is in staging, doesn't mean you don't have to
> > > follow the same ABI stability rules as the rest of the kernel.  If a
> > > change had happened to this code that broke userspace in the past, I
> > > would have reverted it.  So this should not be anything different from
> > > what has been happening inthe past.
> > 
> > Actually, there's big difference.
> > 
> > If Al Viro changes core filesystem in a way that breaks
> > staging/binder, binder is broken, and if it can't be fixed... well it
> > can't be fixed.
> > 
> > If Al Viro changes core filesystem in a way that breaks
> > drivers/binder, Al's change is going to be reverted.
> 
> One might have argued that we'd have to do that already, but the reasons
> for doing that with binder in the main kernel are certainly stronger.
> 
> > It is really hard to review without API documentation. Normally, API
> > documentation is required for stuff like this.
> > 
> > For example: does it add new files in /proc?
> > 
> > Given that it is stable, can we get rid of binder_debug() and
> > especially BINDER_DEBUG_ENTRY stuff?
> 
> Good point. We require documentation for every single sysfs attribute
> that gets added to a driver (some escape the review, but that doesn't
> change the rule), so we should not make an exception for a new procfs
> file here.

Actually, it looked like it is debugfs file. Code was messy enough
that I was not sure.

> > Could binder_transcation() be split to smaller functions according to
> > CodingStyle? 17 goto targets at the end of function are not exactly
> > easy to read.
> > 
> > ginder_thread_read/write also needs splitting.
> 
> Yes, in principle, but this is still a detail that would mainly serve
> to simplify review. The problem is more the lack of review and
> documentation of the API.

Yes, the problem is that code is impossible to review without API
documentation.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
From: Andy Lutomirski @ 2014-10-21 19:02 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Ingo Molnar,
	Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <CAGr1F2Ee2MCKOwALR2YV7ppDmyHxO6+EsHqSc1+WcwKFPPQB0w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>
>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>> implementation.
>>
>> Could be.  I'll defer to Aditya for that one.
>>
>
> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
> addition to restricting the process to a cgroup-root, new processes
> entering the container should also be implicitly contained within the
> cgroup-root of that container.

Why?  Concretely, why should this be in the kernel namespace code
instead of in userspace?

> Implementing pivot_cgroup_root would
> probably involve overloading mount-namespace to now understand cgroup
> filesystem too. I did attempt combining cgroupns-root with mntns
> earlier (not via a new syscall though), but came to the conclusion
> that its just simpler to have a separate cgroup namespace and get
> clear semantics. One of the issues was that implicitly changing cgroup
> on setns to mntns seemed like a huge undesirable side-effect.
>
> About pinning: I really feel that it should be OK to pin processes
> within cgroupns-root. I think thats one of the most important feature
> of cgroup-namespace since its most common usecase is to containerize
> un-trusted processes - processes that, for their entire lifetime, need
> to remain inside their container.

So don't let them out.  None of the other namespaces have this kind of
constraint:

 - If you're in a mntns, you can still use fds from outside.
 - If you're in a netns, you can still use sockets from outside the namespace.
 - If you're in an ipcns, you can still use ipc handles from outside.

etc.

> And with explicit permission from
> cgroup subsystem (something like cgroup.may_unshare as you had
> suggested previously), we can make sure that unprivileged processes
> cannot pin themselves. Also, maintaining this invariant (your current
> cgroup is always under your cgroupns-root) keeps the code and the
> semantics simple.

I actually think it makes the semantics more complex.  The less policy
you stick in the kernel, the easier it is to understand the impact of
that policy.

>
> If we ditch the pinning requirement and allow the containarized
> process to move outside of its cgroupns-root, we will have to address
> atleast the following:
> * what does its /proc/self/cgroup  (and /proc/<pid>/cgroup in general)
> look like? We might need to just not show anything in
> /proc/<pid>/cgroup in such case (for default hierarchy).

The process should see the cgroup path relative to its cgroup ns.
Whether this requires a new /proc mount or happens automatically is an
open question.  (I *hate* procfs for reasons like this.)

> * how should future setns() and unshare() by such process behave?

Open question.

> * 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result

You could disallow that and instead require 'mount -t cgroup -o
cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
relative to the caller's cgroupns.

> * container will not remain migratable

Why not?

> * added code complexity to handle above scenarios
>
> I understand that having process pinned to a cgroup hierarchy might
> seem inconvenient. But even today (without cgroup namespaces), moving
> a task from one cgroup to another can fail for reasons outside of
> control of the task attempting the move (even if its privileged). So
> the userspace should already handle this scenario. I feel its not
> worth to add complexity in the kernel for this.

--Andy

^ permalink raw reply

* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
From: Aditya Kali @ 2014-10-21 18:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Serge E. Hallyn, Linux API, Linux Containers,
	Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar
In-Reply-To: <CALCETrVFKvtHpTfY3kuE5ZTrwQAzuDmk6dm-mbQffDHAZmq-KQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>
>>> On Mon, Oct 20, 2014 at 9:49 PM, Eric W. Biederman
>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>>>> Possible solution:
>>>>>
>>>>> Ditch the pinning.  That is, if you're outside a cgroupns (or you have
>>>>> a non-ns-confined cgroupfs mounted), then you can move a task in a
>>>>> cgroupns outside of its root cgroup.  If you do this, then the task
>>>>> thinks its cgroup is something like "../foo" or "../../foo".
>>>>
>>>> Of the possible solutions that seems attractive to me, simply because
>>>> we sometimes want to allow clever things to occur.
>>>>
>>>> Does anyone know of a reason (beyond pretty printing) why we need
>>>> cgroupns to restrict the subset of cgroups processes can be in?
>>>>
>>>> I would expect permissions on the cgroup directories themselves, and
>>>> limited visiblilty would be (in general) to achieve the desired
>>>> visiblity.
>>>
>>> This makes the security impact of cgroupns very easy to understand,
>>> right?  Because there really won't be any -- cgroupns only affects
>>> reads from /proc and what cgroupfs shows, but it doesn't change any
>>> actual cgroups, nor does it affect any cgroup *changes*.
>>
>> It seems like what we have described is chcgrouproot aka chroot for
>> cgroups.  At which point I think there are potentially similar security
>> issues as for chroot.  Can we confuse a setuid root process if we make
>> it's cgroup names look different.
>>
>> Of course the confusing root concern is handled by the usual namespace
>> security checks that are already present.
>
> I think that the chroot issues are mostly in two categories: setuid
> confusion (not an issue here as you described) and chroot escapes.
> cgroupns escapes aren't a big deal, I think -- admins should deny the
> confined task the right to write to cgroupfs outside its hierarchy, by
> setting cgroupfs permissions appropriately and/or avoiding mounting
> cgroupfs outside the hierarchy.
>
>>
>> I do wonder if we think of this as chcgrouproot if there is a simpler
>> implementation.
>
> Could be.  I'll defer to Aditya for that one.
>

More than chcgrouproot, its probably closer to pivot_cgroup_root. In
addition to restricting the process to a cgroup-root, new processes
entering the container should also be implicitly contained within the
cgroup-root of that container. Implementing pivot_cgroup_root would
probably involve overloading mount-namespace to now understand cgroup
filesystem too. I did attempt combining cgroupns-root with mntns
earlier (not via a new syscall though), but came to the conclusion
that its just simpler to have a separate cgroup namespace and get
clear semantics. One of the issues was that implicitly changing cgroup
on setns to mntns seemed like a huge undesirable side-effect.

About pinning: I really feel that it should be OK to pin processes
within cgroupns-root. I think thats one of the most important feature
of cgroup-namespace since its most common usecase is to containerize
un-trusted processes - processes that, for their entire lifetime, need
to remain inside their container. And with explicit permission from
cgroup subsystem (something like cgroup.may_unshare as you had
suggested previously), we can make sure that unprivileged processes
cannot pin themselves. Also, maintaining this invariant (your current
cgroup is always under your cgroupns-root) keeps the code and the
semantics simple.

If we ditch the pinning requirement and allow the containarized
process to move outside of its cgroupns-root, we will have to address
atleast the following:
* what does its /proc/self/cgroup  (and /proc/<pid>/cgroup in general)
look like? We might need to just not show anything in
/proc/<pid>/cgroup in such case (for default hierarchy).
* how should future setns() and unshare() by such process behave?
* 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result
* container will not remain migratable
* added code complexity to handle above scenarios

I understand that having process pinned to a cgroup hierarchy might
seem inconvenient. But even today (without cgroup namespaces), moving
a task from one cgroup to another can fail for reasons outside of
control of the task attempting the move (even if its privileged). So
the userspace should already handle this scenario. I feel its not
worth to add complexity in the kernel for this.

>>
>>>>> While we're at it, consider making setns for a cgroupns *not* change
>>>>> the caller's cgroup.  Is there any reason it really needs to?
>>>>
>>>> setns doesn't but nsenter is going to need to change the cgroup
>>>> if the pinning requirement is kept.  nsenenter is going to want to
>>>> change the cgroup if the pinning requirement is dropped.
>>>>
>>>
>>> It seems easy enough for nsenter to change the cgroup all by itself.
>>
>> Again.  I don't think anyone has suggested or implemented anything
>> different.
>
> The current patchset seems to punt on this decision by just failing
> the setns call if the caller is outside the cgroup in question.
>
> --Andy

-- 
Aditya

^ permalink raw reply

* Re: [PATCH 2/2] fs: Support compiling out sendfile
From: Eric Paris @ 2014-10-21 17:20 UTC (permalink / raw)
  To: josh
  Cc: H. Peter Anvin, Pieter Smith, Alexander Viro, Andrew Morton,
	Matt Turner, Michal Hocko, Paul E. McKenney, Fabian Frederick,
	Tejun Heo, 蔡正龙, Luis R. Rodriguez,
	Peter Foley, Konstantin Khlebnikov, Eric W. Biederman,
	Oleg Nesterov, Andy Lutomirski, David Herrmann, Kees Cook,
	linux-fsdevel, open list, ABI/API
In-Reply-To: <20141021171814.GA14704@cloud>

On Tue, 2014-10-21 at 10:18 -0700, josh@joshtriplett.org wrote:
> On Tue, Oct 21, 2014 at 08:37:00AM -0700, H. Peter Anvin wrote:
> > On 10/20/2014 02:48 PM, Pieter Smith wrote:
> > > Many embedded systems will not need this syscall, and omitting it
> > > saves space.  Add a new EXPERT config option CONFIG_SENDFILE_SYSCALL
> > > (default y) to support compiling it out.
> > 
> > <bikeshed>
> > I believe these options ought to be CONFIG_SYSCALL_*
> > </bikeshed>
> 
> I agree.  I think people started using CONFIG_*_SYSCALL because of
> things like AUDITSYSCALL 

AUDITSYSCALL audits syscalls.  It doesn't actually implement any
syscalls.  You are right about SYSFS_SYSCALL though...

^ permalink raw reply

* [PATCH V2] kernel, add bug_on_warn
From: Prarit Bhargava @ 2014-10-21 16:47 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Prarit Bhargava, Jonathan Corbet, Andrew Morton, Rusty Russell,
	H. Peter Anvin, Andi Kleen, Masami Hiramatsu, Fabian Frederick,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA,
	isimatu.yasuaki-+CUm20s59erQFUHtdCDX3A,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-api-u79uwXL29TY76Z2rM5mHXA

There have been several times where I have had to rebuild a kernel to
cause a panic when hitting a WARN() in the code in order to get a crash
dump from a system.  Sometimes this is easy to do, other times (such as
in the case of a remote admin) it is not trivial to send new images to the
user.

A much easier method would be a switch to change the WARN() over to a
BUG().  This makes debugging easier in that I can now test the actual
image the WARN() was seen on and I do not have to engage in remote
debugging.

This patch adds a bug_on_warn kernel parameter, which calls BUG() in the
warn_slowpath_common() path.  The function will still print out the
location of the warning.

An example of the bug_on_warn output:

The first line below is from the WARN_ON() to output the WARN_ON()'s location.
After that the new BUG() call is displayed.

 WARNING: CPU: 27 PID: 3204 at
/home/rhel7/redhat/debug/dummy-module/dummy-module.c:25 init_dummy+0x28/0x30
[dummy_module]()
 bug_on_warn set, calling BUG()...
 ------------[ cut here ]------------
 kernel BUG at kernel/panic.c:434!
 invalid opcode: 0000 [#1] SMP
 Modules linked in: dummy_module(OE+) sg nfsv3 rpcsec_gss_krb5 nfsv4
dns_resolver nfs fscache cfg80211 rfkill x86_pkg_temp_thermal intel_powerclamp
coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel
ghash_clmulni_intel igb iTCO_wdt aesni_intel iTCO_vendor_support lrw gf128mul
sb_edac ptp edac_core glue_helper lpc_ich ioatdma pcspkr ablk_helper pps_core
i2c_i801 mfd_core cryptd dca shpchp ipmi_si wmi ipmi_msghandler acpi_cpufreq
nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c sr_mod cdrom sd_mod
mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper isci ttm
drm libsas ahci libahci scsi_transport_sas libata i2c_core dm_mirror
dm_region_hash dm_log dm_mod
 CPU: 27 PID: 3204 Comm: insmod Tainted: G           OE  3.17.0+ #19
 Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
RMLSDP.86I.00.29.D696.1311111329 11/11/2013
 task: ffff880034e75160 ti: ffff8807fc5ac000 task.ti: ffff8807fc5ac000
 RIP: 0010:[<ffffffff81076b81>]  [<ffffffff81076b81>] warn_slowpath_common+0xc1/0xd0
 RSP: 0018:ffff8807fc5afc68  EFLAGS: 00010246
 RAX: 0000000000000021 RBX: ffff8807fc5afcb0 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffff88081efee5f8 RDI: ffff88081efee5f8
 RBP: ffff8807fc5afc98 R08: 0000000000000096 R09: 0000000000000000
 R10: 0000000000000711 R11: ffff8807fc5af93e R12: ffffffffa0424070
 R13: 0000000000000019 R14: ffffffffa0423068 R15: 0000000000000009
 FS:  00007f2d4b034740(0000) GS:ffff88081efe0000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f2d4a99f3c0 CR3: 00000007fd88b000 CR4: 00000000001407e0
 Stack:
  ffff8807fc5afcb8 ffffffff8199f020 ffff88080e396160 0000000000000000
  ffffffffa0423040 ffffffffa0425000 ffff8807fc5afd08 ffffffff81076be5
  0000000000000008 ffffffffa0424053 ffff880700000018 ffff8807fc5afd18
 Call Trace:
  [<ffffffffa0423040>] ? dummy_greetings+0x40/0x40 [dummy_module]
  [<ffffffff81076be5>] warn_slowpath_fmt+0x55/0x70
  [<ffffffffa0423068>] init_dummy+0x28/0x30 [dummy_module]
  [<ffffffff81002144>] do_one_initcall+0xd4/0x210
  [<ffffffff811b52c2>] ? __vunmap+0xc2/0x110
  [<ffffffff810f8889>] load_module+0x16a9/0x1b30
  [<ffffffff810f3d30>] ? store_uevent+0x70/0x70
  [<ffffffff810f49b9>] ? copy_module_from_fd.isra.44+0x129/0x180
  [<ffffffff810f8ec6>] SyS_finit_module+0xa6/0xd0
  [<ffffffff8166ce29>] system_call_fastpath+0x12/0x17
 Code: c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 c7 c7 20 42 8a 81 31 c0 e8 fc
80 5e 00 eb 80 48 c7 c7 78 42 8a 81 31 c0 e8 ec 80 5e 00 <0f> 0b 66 66 66 66 2e
0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
 RIP  [<ffffffff81076b81>] warn_slowpath_common+0xc1/0xd0
  RSP <ffff8807fc5afc68>
 ---[ end trace 428218934a12088b ]---

Successfully tested by me.

Cc: Jonathan Corbet <corbet-T1hC0tSOHrs@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
Cc: "H. Peter Anvin" <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
Cc: Andi Kleen <ak-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt-FCd8Q96Dh0JBDgjK7y7TUQ@public.gmane.org>
Cc: Fabian Frederick <fabf-AgBVmzD5pcezQB+pC5nmwQ@public.gmane.org>
Cc: vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
Cc: isimatu.yasuaki-+CUm20s59erQFUHtdCDX3A@public.gmane.org
Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: Prarit Bhargava <prarit-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

[v2]: add /proc/sys/kernel/bug_on_warn, additional documentation, modify
      !slowpath cases
---
 Documentation/kdump/kdump.txt       |    7 +++++++
 Documentation/kernel-parameters.txt |    3 +++
 Documentation/sysctl/kernel.txt     |   12 ++++++++++++
 include/asm-generic/bug.h           |   12 ++++++++++--
 include/linux/kernel.h              |    1 +
 include/uapi/linux/sysctl.h         |    1 +
 kernel/panic.c                      |   21 ++++++++++++++++++++-
 kernel/sysctl.c                     |    7 +++++++
 kernel/sysctl_binary.c              |    1 +
 9 files changed, 62 insertions(+), 3 deletions(-)

diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
index 6c0b9f2..a04ed72 100644
--- a/Documentation/kdump/kdump.txt
+++ b/Documentation/kdump/kdump.txt
@@ -471,6 +471,13 @@ format. Crash is available on Dave Anderson's site at the following URL:
 
    http://people.redhat.com/~anderson/
 
+Trigger Kdump on WARN()
+=======================
+
+The kernel parameter, bug_on_warn, calls BUG() in all WARN() paths.  This
+will cause a kdump to occur at the BUG() call.  In cases where a user
+wants to specify this during runtime, /proc/sys/kernel/bug_on_warn can be
+set to 1 to achieve the same behaviour.
 
 Contact
 =======
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 988160a..3890a3a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -553,6 +553,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 	bttv.pll=	See Documentation/video4linux/bttv/Insmod-options
 	bttv.tuner=
 
+	bug_on_warn	BUG() instead of WARN().  Useful to cause kdump
+			on a WARN().
+
 	bulk_remove=off	[PPC]  This parameter disables the use of the pSeries
 			firmware feature for flushing multiple hpte entries
 			at a time.
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 57baff5..dcadcdc 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -23,6 +23,7 @@ show up in /proc/sys/kernel:
 - auto_msgmni
 - bootloader_type	     [ X86 only ]
 - bootloader_version	     [ X86 only ]
+- bug_on_warn
 - callhome		     [ S390 only ]
 - cap_last_cap
 - core_pattern
@@ -152,6 +153,17 @@ Documentation/x86/boot.txt for additional information.
 
 ==============================================================
 
+bug_on_warn:
+
+Calls BUG() in the WARN() path when set to 1.  This is useful to avoid
+a kernel rebuild when attempting to kdump at the location of a WARN().
+
+0: only WARN(), default behaviour.
+
+1: call BUG() after printing out WARN() location.
+
+==============================================================
+
 callhome:
 
 Controls the kernel's callhome behavior in case of a kernel panic.
diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index 630dd23..4d0c763 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -75,10 +75,18 @@ extern void warn_slowpath_null(const char *file, const int line);
 #define __WARN_printf_taint(taint, arg...)				\
 	warn_slowpath_fmt_taint(__FILE__, __LINE__, taint, arg)
 #else
-#define __WARN()		__WARN_TAINT(TAINT_WARN)
+#define check_bug_on_warn()						\
+	do {								\
+		if (bug_on_warn)					\
+			BUG();						\
+	} while (0)
+
+#define __WARN()							\
+	do { __WARN_TAINT(TAINT_WARN); check_bug_on_warn(); } while (0)
+
 #define __WARN_printf(arg...)	do { printk(arg); __WARN(); } while (0)
 #define __WARN_printf_taint(taint, arg...)				\
-	do { printk(arg); __WARN_TAINT(taint); } while (0)
+	do { printk(arg); __WARN_TAINT(taint); check_bug_on_warn(); } while (0)
 #endif
 
 #ifndef WARN_ON
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 40728cf..4094a60 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -422,6 +422,7 @@ extern int panic_on_oops;
 extern int panic_on_unrecovered_nmi;
 extern int panic_on_io_nmi;
 extern int sysctl_panic_on_stackoverflow;
+extern int bug_on_warn;
 /*
  * Only to be used by arch init code. If the user over-wrote the default
  * CONFIG_PANIC_TIMEOUT, honor it.
diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index 43aaba1..2ba0a58 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -153,6 +153,7 @@ enum
 	KERN_MAX_LOCK_DEPTH=74, /* int: rtmutex's maximum lock depth */
 	KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
 	KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
+	KERN_BUG_ON_WARN=77, /* int: call BUG() in WARN() functions */
 };
 
 
diff --git a/kernel/panic.c b/kernel/panic.c
index d09dc5c..a6d2e2f 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -33,6 +33,7 @@ static int pause_on_oops;
 static int pause_on_oops_flag;
 static DEFINE_SPINLOCK(pause_on_oops_lock);
 static bool crash_kexec_post_notifiers;
+int bug_on_warn;
 
 int panic_timeout = CONFIG_PANIC_TIMEOUT;
 EXPORT_SYMBOL_GPL(panic_timeout);
@@ -420,13 +421,24 @@ static void warn_slowpath_common(const char *file, int line, void *caller,
 {
 	disable_trace_on_warning();
 
-	pr_warn("------------[ cut here ]------------\n");
+	if (!bug_on_warn)
+		pr_warn("------------[ cut here ]------------\n");
 	pr_warn("WARNING: CPU: %d PID: %d at %s:%d %pS()\n",
 		raw_smp_processor_id(), current->pid, file, line, caller);
 
 	if (args)
 		vprintk(args->fmt, args->args);
 
+	if (bug_on_warn) {
+		pr_warn("bug_on_warn set, calling BUG()...\n");
+		/*
+		 * A flood of WARN()s may occur.  Prevent further WARN()s
+		 * from panicking the system.
+		 */
+		bug_on_warn = 0;
+		BUG();
+	}
+
 	print_modules();
 	dump_stack();
 	print_oops_end_marker();
@@ -501,3 +513,10 @@ static int __init oops_setup(char *s)
 	return 0;
 }
 early_param("oops", oops_setup);
+
+static int __init bug_on_warn_setup(char *s)
+{
+	bug_on_warn = 1;
+	return 0;
+}
+early_param("bug_on_warn", bug_on_warn_setup);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 4aada6d..030bb5d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1103,6 +1103,13 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 #endif
+	{
+		.procname	= "bug_on_warn",
+		.data		= &bug_on_warn,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };
 
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 9a4f750..28376bf 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -137,6 +137,7 @@ static const struct bin_table bin_kern_table[] = {
 	{ CTL_INT,	KERN_COMPAT_LOG,		"compat-log" },
 	{ CTL_INT,	KERN_MAX_LOCK_DEPTH,		"max_lock_depth" },
 	{ CTL_INT,	KERN_PANIC_ON_NMI,		"panic_on_unrecovered_nmi" },
+	{ CTL_INT,	KERN_BUG_ON_WARN,		"bug_on_warn" },
 	{}
 };
 
-- 
1.7.9.3

^ permalink raw reply related

* Re: [PATCH v8 02/18] vfio: platform: probe to devices on the platform bus
From: Eric Auger @ 2014-10-21 16:37 UTC (permalink / raw)
  To: Alex Williamson, Antonios Motakis
  Cc: VFIO DRIVER, marc.zyngier-5wv7dgnIgG8, ABI/API,
	will.deacon-5wv7dgnIgG8, open list,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	tech-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J,
	kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
	christoffer.dall-QSEj5FYQhm4dnm+yROfE0A
In-Reply-To: <1413908269.4202.135.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>

On 10/21/2014 06:17 PM, Alex Williamson wrote:
> On Mon, 2014-10-13 at 15:10 +0200, Antonios Motakis wrote:
>> Driver to bind to Linux platform devices, and callbacks to discover their
>> resources to be used by the main VFIO PLATFORM code.
>>
>> Signed-off-by: Antonios Motakis <a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>
>> ---
>>  drivers/vfio/platform/vfio_platform.c | 107 ++++++++++++++++++++++++++++++++++
>>  include/uapi/linux/vfio.h             |   1 +
>>  2 files changed, 108 insertions(+)
>>  create mode 100644 drivers/vfio/platform/vfio_platform.c
>>
>> diff --git a/drivers/vfio/platform/vfio_platform.c b/drivers/vfio/platform/vfio_platform.c
>> new file mode 100644
>> index 0000000..baeb7da
>> --- /dev/null
>> +++ b/drivers/vfio/platform/vfio_platform.c
>> @@ -0,0 +1,107 @@
>> +/*
>> + * Copyright (C) 2013 - Virtual Open Systems
>> + * Author: Antonios Motakis <a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License, version 2, as
>> + * published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + */
>> +
>> +#include <linux/device.h>
>> +#include <linux/eventfd.h>
>> +#include <linux/interrupt.h>
>> +#include <linux/iommu.h>
>> +#include <linux/module.h>
>> +#include <linux/mutex.h>
>> +#include <linux/notifier.h>
>> +#include <linux/pm_runtime.h>
>> +#include <linux/slab.h>
>> +#include <linux/types.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/vfio.h>
>> +#include <linux/io.h>
>> +#include <linux/platform_device.h>
>> +#include <linux/irq.h>
>> +
>> +#include "vfio_platform_private.h"
>> +
>> +#define DRIVER_VERSION  "0.8"
>> +#define DRIVER_AUTHOR   "Antonios Motakis <a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>"
>> +#define DRIVER_DESC     "VFIO for platform devices - User Level meta-driver"
>> +
>> +/* probing devices from the linux platform bus */
>> +
>> +static struct resource *get_platform_resource(struct vfio_platform_device *vdev,
>> +						int num)
>> +{
>> +	struct platform_device *dev = (struct platform_device *) vdev->opaque;
>> +	int i;
>> +
>> +	for (i = 0; i < dev->num_resources; i++) {
>> +		struct resource *r = &dev->resource[i];
>> +
>> +		if (resource_type(r) & (IORESOURCE_MEM|IORESOURCE_IO)) {
>> +			num--;
>> +
>> +			if (!num)
>> +				return r;
> 
> Has this been tested?  What happens when we call this with num = 0?
Yep. I confirm I enter that case with my xgmac where the IORESOURCE_MEM
is the 1st resource. I Just ended to the same cause ;-)

Best Regards

Eric
> 
>> +		}
>> +	}
>> +	return NULL;
>> +}
>> +
>> +static int get_platform_irq(struct vfio_platform_device *vdev, int i)
>> +{
>> +	struct platform_device *pdev = (struct platform_device *) vdev->opaque;
>> +
>> +	return platform_get_irq(pdev, i);
>> +}
>> +
>> +
>> +static int vfio_platform_probe(struct platform_device *pdev)
>> +{
>> +	struct vfio_platform_device *vdev;
>> +	int ret;
>> +
>> +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
>> +	if (!vdev)
>> +		return -ENOMEM;
>> +
>> +	vdev->opaque = (void *) pdev;
>> +	vdev->name = pdev->name;
>> +	vdev->flags = VFIO_DEVICE_FLAGS_PLATFORM;
>> +	vdev->get_resource = get_platform_resource;
>> +	vdev->get_irq = get_platform_irq;
>> +
>> +	ret = vfio_platform_probe_common(vdev, &pdev->dev);
>> +	if (ret)
>> +		kfree(vdev);
>> +
>> +	return ret;
>> +}
>> +
>> +static int vfio_platform_remove(struct platform_device *pdev)
>> +{
>> +	return vfio_platform_remove_common(&pdev->dev);
>> +}
>> +
>> +static struct platform_driver vfio_platform_driver = {
>> +	.probe		= vfio_platform_probe,
>> +	.remove		= vfio_platform_remove,
>> +	.driver	= {
>> +		.name	= "vfio-platform",
>> +		.owner	= THIS_MODULE,
>> +	},
>> +};
>> +
>> +module_platform_driver(vfio_platform_driver);
>> +
>> +MODULE_VERSION(DRIVER_VERSION);
>> +MODULE_LICENSE("GPL v2");
>> +MODULE_AUTHOR(DRIVER_AUTHOR);
>> +MODULE_DESCRIPTION(DRIVER_DESC);
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 111b5e8..aca6d3e 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -158,6 +158,7 @@ struct vfio_device_info {
>>  	__u32	flags;
>>  #define VFIO_DEVICE_FLAGS_RESET	(1 << 0)	/* Device supports reset */
>>  #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
>> +#define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
>>  	__u32	num_regions;	/* Max region index + 1 */
>>  	__u32	num_irqs;	/* Max IRQ index + 1 */
>>  };
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v8 02/18] vfio: platform: probe to devices on the platform bus
From: Alex Williamson @ 2014-10-21 16:17 UTC (permalink / raw)
  To: Antonios Motakis
  Cc: open list:VFIO DRIVER, eric.auger-QSEj5FYQhm4dnm+yROfE0A,
	marc.zyngier-5wv7dgnIgG8, open list:ABI/API,
	will.deacon-5wv7dgnIgG8, open list,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	tech-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J,
	kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
	christoffer.dall-QSEj5FYQhm4dnm+yROfE0A
In-Reply-To: <1413205825-6370-3-git-send-email-a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>

On Mon, 2014-10-13 at 15:10 +0200, Antonios Motakis wrote:
> Driver to bind to Linux platform devices, and callbacks to discover their
> resources to be used by the main VFIO PLATFORM code.
> 
> Signed-off-by: Antonios Motakis <a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>
> ---
>  drivers/vfio/platform/vfio_platform.c | 107 ++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h             |   1 +
>  2 files changed, 108 insertions(+)
>  create mode 100644 drivers/vfio/platform/vfio_platform.c
> 
> diff --git a/drivers/vfio/platform/vfio_platform.c b/drivers/vfio/platform/vfio_platform.c
> new file mode 100644
> index 0000000..baeb7da
> --- /dev/null
> +++ b/drivers/vfio/platform/vfio_platform.c
> @@ -0,0 +1,107 @@
> +/*
> + * Copyright (C) 2013 - Virtual Open Systems
> + * Author: Antonios Motakis <a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License, version 2, as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/interrupt.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/notifier.h>
> +#include <linux/pm_runtime.h>
> +#include <linux/slab.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/io.h>
> +#include <linux/platform_device.h>
> +#include <linux/irq.h>
> +
> +#include "vfio_platform_private.h"
> +
> +#define DRIVER_VERSION  "0.8"
> +#define DRIVER_AUTHOR   "Antonios Motakis <a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>"
> +#define DRIVER_DESC     "VFIO for platform devices - User Level meta-driver"
> +
> +/* probing devices from the linux platform bus */
> +
> +static struct resource *get_platform_resource(struct vfio_platform_device *vdev,
> +						int num)
> +{
> +	struct platform_device *dev = (struct platform_device *) vdev->opaque;
> +	int i;
> +
> +	for (i = 0; i < dev->num_resources; i++) {
> +		struct resource *r = &dev->resource[i];
> +
> +		if (resource_type(r) & (IORESOURCE_MEM|IORESOURCE_IO)) {
> +			num--;
> +
> +			if (!num)
> +				return r;

Has this been tested?  What happens when we call this with num = 0?

> +		}
> +	}
> +	return NULL;
> +}
> +
> +static int get_platform_irq(struct vfio_platform_device *vdev, int i)
> +{
> +	struct platform_device *pdev = (struct platform_device *) vdev->opaque;
> +
> +	return platform_get_irq(pdev, i);
> +}
> +
> +
> +static int vfio_platform_probe(struct platform_device *pdev)
> +{
> +	struct vfio_platform_device *vdev;
> +	int ret;
> +
> +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> +	if (!vdev)
> +		return -ENOMEM;
> +
> +	vdev->opaque = (void *) pdev;
> +	vdev->name = pdev->name;
> +	vdev->flags = VFIO_DEVICE_FLAGS_PLATFORM;
> +	vdev->get_resource = get_platform_resource;
> +	vdev->get_irq = get_platform_irq;
> +
> +	ret = vfio_platform_probe_common(vdev, &pdev->dev);
> +	if (ret)
> +		kfree(vdev);
> +
> +	return ret;
> +}
> +
> +static int vfio_platform_remove(struct platform_device *pdev)
> +{
> +	return vfio_platform_remove_common(&pdev->dev);
> +}
> +
> +static struct platform_driver vfio_platform_driver = {
> +	.probe		= vfio_platform_probe,
> +	.remove		= vfio_platform_remove,
> +	.driver	= {
> +		.name	= "vfio-platform",
> +		.owner	= THIS_MODULE,
> +	},
> +};
> +
> +module_platform_driver(vfio_platform_driver);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 111b5e8..aca6d3e 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -158,6 +158,7 @@ struct vfio_device_info {
>  	__u32	flags;
>  #define VFIO_DEVICE_FLAGS_RESET	(1 << 0)	/* Device supports reset */
>  #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
> +#define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
>  	__u32	num_regions;	/* Max region index + 1 */
>  	__u32	num_irqs;	/* Max IRQ index + 1 */
>  };

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox