From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753490AbbI1V5q (ORCPT <rfc822;w@1wt.eu>);
	Mon, 28 Sep 2015 17:57:46 -0400
Received: from mail-db3on0082.outbound.protection.outlook.com ([157.55.234.82]:31862
	"EHLO emea01-db3-obe.outbound.protection.outlook.com"
	rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP
	id S1752763AbbI1V5n (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Mon, 28 Sep 2015 17:57:43 -0400
Authentication-Results: spf=none (sender IP is )
 smtp.mailfrom=cmetcalf@ezchip.com; 
Subject: Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality
To: Andy Lutomirski <luto@amacapital.net>
References: <1443453446-7827-1-git-send-email-cmetcalf@ezchip.com>
 <1443453446-7827-8-git-send-email-cmetcalf@ezchip.com>
 <CALCETrWh2iW_sz7Bd+QqWa0OQZsrma_Dxk1zZEZUbpr9E+GuqA@mail.gmail.com>
CC: Gilad Ben Yossef <giladb@ezchip.com>, Steven Rostedt <rostedt@goodmis.org>,
        Ingo Molnar <mingo@kernel.org>, Peter Zijlstra <peterz@infradead.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Rik van Riel <riel@redhat.com>, Tejun Heo <tj@kernel.org>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        Christoph Lameter <cl@linux.com>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Catalin Marinas <catalin.marinas@arm.com>,
        Will Deacon <will.deacon@arm.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "H. Peter Anvin" <hpa@zytor.com>, X86 ML <x86@kernel.org>
From: Chris Metcalf <cmetcalf@ezchip.com>
Message-ID: <5609B7C0.3010807@ezchip.com>
Date: Mon, 28 Sep 2015 17:57:20 -0400
User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.2.0
MIME-Version: 1.0
In-Reply-To: <CALCETrWh2iW_sz7Bd+QqWa0OQZsrma_Dxk1zZEZUbpr9E+GuqA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 7bit
X-Originating-IP: [12.216.194.146]
X-ClientProxiedBy: DM2PR11CA0019.namprd11.prod.outlook.com (25.160.91.29) To
 VI1PR02MB0783.eurprd02.prod.outlook.com (25.162.14.145)
X-Microsoft-Exchange-Diagnostics: 1;VI1PR02MB0783;2:hiNHRCOplnEOKHJndcKworh9bMS1mzQ9WvKg35/KaPKkTalJjeEs30zodemAlUhFRCVLcn9Czr4pd0o/DcP9bhC5CHygT3zShOmdPC7RfLyrPLVVYE5wCAIAv6sX+aWXyRusch+P00k0Jp945t6PkRGT4bVQm5YyLSLs/YJinoI=;3:abe+FdOQM27SosFkUkYNw2Jb9ZCzLGCCjWnVAzNYn3j/jbQHe8SzOx5dH5lhCUu1NFkvMfjVZN6ac50Tkx7UKYix0w0VhATQB4S/LMTXWfZt0kFrqHwDOVmKM/iJUq0LIW8tFbfjVu2f+dl38tiWmQ==;25:4gu9uUS7RU3wbndbnk84YPz9wv45A76gaWXE3/bKEIUNwLgSjXWjxLoifljdq40NKIwd7SdCmPj4yi2AN985l7Wg3wW9nFhNN5D4Os5sL2rZsnhooA6JZq6zmoqucxpp0MxovnrTSBO3saxN7zy/VP6JxjjeOjIcPRt7Td2+P3qvjsImfuQ1Gngl16vHkQ2EBrhRj6i/aXdmx+bO4GFSk/dN6uclwp91aT/q+8oC/SZZ/g3wkIlwe0b7slyiRukmYxeY6NUVHl2l4jABhFPE/g==;20:8GIYMsUwl9offYXTEnWTDyAwCwXr4PKditzmZGjsYH+VR7ydNbOcffYQ35/G0IAXqD8mvaWq537I2y7ok6YxLWnJyYO+yBrWt/NpEkQtkjPJ2cYSNRdUIuJzxDbzyQz9x9/jcXu5/ohmA6WEVtRj+ElIF1sWm/vNRl4LM2aJRIs=
X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:VI1PR02MB0783;
X-Microsoft-Antispam-PRVS: <VI1PR02MB07833EB8258FF025743A8F23AF4F0@VI1PR02MB0783.eurprd02.prod.outlook.com>
X-Exchange-Antispam-Report-Test: UriScan:;
X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(601004)(2401047)(5005006)(520078)(8121501046)(3002001);SRVR:VI1PR02MB0783;BCL:0;PCL:0;RULEID:;SRVR:VI1PR02MB0783;
X-Microsoft-Exchange-Diagnostics: 1;VI1PR02MB0783;4:wWsJ0OQmUB4kW+Epjz0ToI8J8H03NZZV3bV7/gJDMvqaZvYsPAI3/0GQmbPjFg5QFlI7eMLPBKQiIZnZbtLDJUIlBJQfdsT9aJtIFRDBdPiVcNkyCuimEh6GMSvH4F59USCyRag/uY5Mpq5dlidm2UCxhIQ3bnLiWw0uzs7ZZVjuMtvqvcq9snHohY9PRJIIvkdYQdmqnRSCf+SUHlHl+rdlDRrzroEYY53FT+eVnvzsiMb1oWm66aXRWf+XhrTA6SGa+YEdAEB48PDLJiczC1YNKavNBrmCnw0vh76Z0nI6jHI1aePizsq6WUFGkmR+LqxGtBVgMzs8vohP642qL5WDgohPrVeXFmlTJ0/SNho=
X-Forefront-PRVS: 0713BC207F
X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10009020)(6049001)(6009001)(24454002)(199003)(479174004)(189002)(377454003)(46102003)(15975445007)(81156007)(2950100001)(4001540100001)(77156002)(5001860100001)(54356999)(105586002)(110136002)(76176999)(77096005)(50986999)(87266999)(101416001)(64126003)(65816999)(50466002)(83506001)(36756003)(42186005)(5001960100002)(62966003)(68736005)(189998001)(106356001)(19580405001)(87976001)(64706001)(80316001)(97736004)(59896002)(66066001)(65806001)(47776003)(4001350100001)(33656002)(65956001)(23676002)(40100003)(5004730100002)(122386002)(5007970100001)(92566002)(86362001)(19580395003)(5001830100001)(18886065003);DIR:OUT;SFP:1101;SCL:1;SRVR:VI1PR02MB0783;H:[10.7.0.41];FPR:;SPF:None;PTR:InfoNoRecords;A:1;MX:1;LANG:en;
X-Microsoft-Exchange-Diagnostics: =?utf-8?B?MTtWSTFQUjAyTUIwNzgzOzIzOlBHT1huN201eDkvNUVJQ25zb2lac2dxN3JG?=
 =?utf-8?B?Y2laMzlmdlFkdm5XN2tDVmZmdmpoZmZIQ1JJZ0tkd25zWnZ1QzNFakorK0Ez?=
 =?utf-8?B?TXF1NnFFU0VMSGlraXAweVJXQXdxeitxaWJnb0VhbDVHSC9rZWlBN2poSlk1?=
 =?utf-8?B?WnQ1Z2dCcDZ5UzVzQ0x3S25RaTdhbDQ1V2h2WGJhSXFuSjRISzhSUWsvZDlL?=
 =?utf-8?B?YjhvZTF1UUlWYzB5YitWcnRwc2tjMWcwZ2hSVlQ2bUpkSUJtSmxBNFppOVUx?=
 =?utf-8?B?VGVMZEUzRFhuNjRIK2czS1BhZFBnWjkwTjZNOFlEZXZkV3Uvdkg1Ykxmb0tl?=
 =?utf-8?B?OXIwSnBmUS9LQStyZ0c2ZEtsMkl6Nnh3RHpGYnF1Wms2YUFJSWFPOEo1Ym90?=
 =?utf-8?B?M1pkTDRpb2dVNXNSdVMwK3l3WUtzKzE2eDBwMVVRYVdtQ0kydnZUaFV1K3FJ?=
 =?utf-8?B?UEdjK1hPSnZkeUpMWTUyc3VUa05yYUZLTzY0M3RwM2FZZEF5cnFuakQ1c05v?=
 =?utf-8?B?L0lWczg2eGtvTEJXMWdHaUo1NkNGcnZxRkc2NHQycGZCTE1LRHV4WjNUSVlo?=
 =?utf-8?B?YXliTnNiblh1MUZvUDU4OG5lQ3NRSHRxUTdrc3JXaVRQdTE1QVJNV2oxSkJz?=
 =?utf-8?B?RERBQzdEVWZrVlNVVnFNWWlWSzFTUE1EVjl2MXd3VU1LSlltMTFJUmpvTzJ4?=
 =?utf-8?B?THBodVUwYjZsdXhSMFZzSm9CeEtaVTJxY2ViVlNPa0NaVHYxTlhNQWNqZk9h?=
 =?utf-8?B?cjkwVFh5cjZIS2RlNDlwTWNjWWFsd2pSZ21iRkN0d1JSNDd1b20rUUl3dmJm?=
 =?utf-8?B?WjVrVGZjczBDc2hYOVVUUjNkL1Njb0dWS00wU0F6OUIxZHFFUnZGTUROWDZi?=
 =?utf-8?B?M2FZYlVyRzZxTGg1czJOc0VBOXlYbjFablREaDc3amhwSmUwU1k2anZDVy9G?=
 =?utf-8?B?dnRvUm9HcllUWTZrZWVqbExnQ3psS09vYXJmSGtjemhFMWlQVGJaSzBBSjR4?=
 =?utf-8?B?RHVrQjQrcDZoS1UwQWZiaVRncHhkMXJKYU5tV0dnVjMrNDdMU2xwTkNqMFhp?=
 =?utf-8?B?UDN5bVVqMFpYYkU4YUgrSEtlWVhKSDBaVFI2VzJtdTRkNXk5UW5BNlU3SWVk?=
 =?utf-8?B?SEhwdUlkQ25mVjBnQldEcnhQRENtcTg1Y1lFYThEdDYxOGUyQWNZa0tpOVNX?=
 =?utf-8?B?OWF3RVAzanZBa2ZVbHd5b3I0bVY1RytrUGpKUmxUd0o3eHNJbHNEQ1pPVCtW?=
 =?utf-8?B?bFVVa2svbjk2M3gxMXpCd2h4RGpUOFlRbUlLT1VWdzZkejBJeGY1QjduZzBl?=
 =?utf-8?B?akJraU5IVTVTdGZjMmRRUWVIUVFiNklBekp0blVxa0w4LzdFNVhVcmVSVHB3?=
 =?utf-8?B?N0FUMGo4Vk85ajBDM0JoZE1Od0NPNnk4YWgrVUd3ZFdJd2cwczRHYm15andn?=
 =?utf-8?B?TWNSZDFHUlZSZzV6MXFhK2JEM1VPVC94QmNyekNzdHVGREVaaGlRaGhOOVNn?=
 =?utf-8?B?SDY2NjR6dk9IRWRBby83RFczU2RQbXd1RTBCekJKblBCZzBwdEp4RThSL1lU?=
 =?utf-8?B?cHg4ZXZXa0Z4VG5oUlRyTUNybk8yZk9uYUZQMXc1aE9taEJsK0Q4K0VEd2Ri?=
 =?utf-8?B?N2hEdnczNTA0L01aNkhucytqem9kZk9MNnFnNnBXQjhwQmtqdGNSd1ZjM3Fv?=
 =?utf-8?B?Q3FkQjhPUG5RUTBneTd2ajZWZDdWL2JscThpVGFzRENPT2FFblZoMDMxNzVv?=
 =?utf-8?B?cXE0REkzOTM1UVYxSSs3cFJENURFZkV0enhZOXZLTmxxNDlBb0JBZ0d0L3Ru?=
 =?utf-8?B?KzZSbjZIU29zMU93MjFwbU9pMWdnczJ6UU1EaG0xckp4bExuc3d0Z2xQVVYr?=
 =?utf-8?B?MGJyeVRUWGlyZklNaTd6ZWUwWU1IOEFNZzJDOGMxVnMxTm0vUTd1Mll5T0Vl?=
 =?utf-8?Q?UWamPNzz8LWYbTzTlSrQ8YPHhdH7YY=3D?=
X-Microsoft-Exchange-Diagnostics: 1;VI1PR02MB0783;5:bE4WdDJcmoNPZK/aGjIM/gvpsi/9ZZ1mOHK52DS9ToI7dzOqhtbds1zDMNx2jZRuioMfydOpV5AErwab6+L0mdyZ6YPM3J2JEV/5hlg1N1kQ+lVf+Rh7wWRiyzwV4ALaAe7vBWob96wcC1+ZkwwqIg==;24:xqdfr6Dlp5GjVJ0o+nJV2gqcEMBCp/VtRdAp/59FG6mlSHJnPEo8hbdvl4b/SGI+87oQjt+J209qMved2RdX3SIw1iFijKeB4X+WF+UxkMw=;20:tmA58O5rQqi1gPhvQxwdF0OxYgvLUqWzLyY/OoKT1jy7MNFOX2KMlGilY3pd7lvttxGodvYPcTZwO3aOD04wbg==
SpamDiagnosticOutput: 1:23
SpamDiagnosticMetadata: NSPM
X-OriginatorOrg: ezchip.com
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 28 Sep 2015 21:57:34.1892 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR02MB0783
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 09/28/2015 04:59 PM, Andy Lutomirski wrote:
> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
>> In prepare_exit_to_usermode(), we would like to call
>> task_isolation_enter() on every return to userspace, and like
>> other work items, we would like to recheck for more work after
>> calling it, since it will enable interrupts internally.
>>
>> However, if task_isolation_enter() is the only work item,
>> and it has already been called once, we don't want to continue
>> calling it in a loop.  We don't have a dedicated TIF flag for
>> task isolation, and it wouldn't make sense to have one, since
>> we'd want to set it before starting exit every time, and then
>> clear it the first time around the loop.
>>
>> Instead, we change the loop structure somewhat, so that we
>> have a more inclusive set of flags that are tested for on the
>> first entry to the function (including TIF_NOHZ), and if any
>> of those flags are set, we enter the loop.  And, we do the
>> task_isolation() test unconditionally at the bottom of the loop,
>> but then when making the decision to loop back, we just use the
>> set of flags that doesn't include TIF_NOHZ.  That way we only
>> loop if there is other work to do, but then if that work
>> is done, we again unconditionally call task_isolation_enter().
>>
>> In syscall_trace_enter_phase1(), we try to add the necessary
>> support for strict-mode detection of syscalls in an optimized
>> way, by letting the code remain unchanged if we are not using
>> TASK_ISOLATION, but otherwise calling enter_from_user_mode()
>> under the first time we see _TIF_NOHZ, and then waiting until
>> after we do the secure computing work to actually clear the bit
>> from the "work" variable and call task_isolation_syscall().
>>
>> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
>> ---
>>   arch/x86/entry/common.c | 47 ++++++++++++++++++++++++++++++++++++-----------
>>   1 file changed, 36 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
>> index 80dcc9261ca3..0f74389c6f3b 100644
>> --- a/arch/x86/entry/common.c
>> +++ b/arch/x86/entry/common.c
>> @@ -21,6 +21,7 @@
>>   #include <linux/context_tracking.h>
>>   #include <linux/user-return-notifier.h>
>>   #include <linux/uprobes.h>
>> +#include <linux/isolation.h>
>>
>>   #include <asm/desc.h>
>>   #include <asm/traps.h>
>> @@ -81,7 +82,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
>>           */
>>          if (work & _TIF_NOHZ) {
>>                  enter_from_user_mode();
>> -               work &= ~_TIF_NOHZ;
>> +               if (!IS_ENABLED(CONFIG_TASK_ISOLATION))
>> +                       work &= ~_TIF_NOHZ;
>>          }
>>   #endif
>>
>> @@ -131,6 +133,13 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
>>          }
>>   #endif
>>
>> +       /* Now check task isolation, if needed. */
>> +       if (IS_ENABLED(CONFIG_TASK_ISOLATION) && (work & _TIF_NOHZ)) {
>> +               work &= ~_TIF_NOHZ;
>> +               if (task_isolation_strict())
>> +                       task_isolation_syscall(regs->orig_ax);
>> +       }
>> +
> This is IMO rather nasty.  Can you try to find a way to do this
> without making the control flow depend on config options?

Well, I suppose this is the best argument for testing for task
isolation before seccomp :-)

Honestly, if not, it's tricky to see how to do better; I did spend
some time looking at it.  One possibility is to just unconditionally
clear _TIF_NOHZ before testing "work == 0", so that we can
test (work & TIF_NOHZ) once early and once after seccomp.
This presumably costs a cycle in the no-nohz-full case.

So maybe just do it before seccomp...

> What guarantees that TIF_NOHZ is an acceptable thing to check?

Well, TIF_NOHZ is set on all tasks whenever we are running with
nohz_full enabled anywhere, so testing it lets us do stuff on
the fastpath without slowing down the fastpath much.
See context_tracking_cpu_set().

>>          /* Do our best to finish without phase 2. */
>>          if (work == 0)
>>                  return ret;  /* seccomp and/or nohz only (ret == 0 here) */
>> @@ -217,10 +226,26 @@ static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs)
>>   /* Called with IRQs disabled. */
>>   __visible void prepare_exit_to_usermode(struct pt_regs *regs)
>>   {
>> +       u32 cached_flags;
>> +
>>          if (WARN_ON(!irqs_disabled()))
>>                  local_irq_disable();
>>
>>          /*
>> +        * We may want to enter the loop here unconditionally to make
>> +        * sure to do some work at least once.  Test here for all
>> +        * possible conditions that might make us enter the loop,
>> +        * and return immediately if none of them are set.
>> +        */
>> +       cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
>> +       if (!(cached_flags & (TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
>> +                             _TIF_UPROBE | _TIF_NEED_RESCHED |
>> +                             _TIF_USER_RETURN_NOTIFY | _TIF_NOHZ))) {
>> +               user_enter();
>> +               return;
>> +       }
>> +
> Too complicated and too error prone.
>
> In any event, I don't think that the property you actually want is for
> the loop to be entered once.  I think the property you want is that
> we're isolated by the time we're finished.  Why not just check that
> directly in the loop condition?

So something like this (roughly):

                 if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
                                       _TIF_UPROBE | _TIF_NEED_RESCHED |
                                       _TIF_USER_RETURN_NOTIFY)) &&
+                    task_isolation_done())
                         break;

i.e. just add the one extra call?  That could work, I suppose.
In the body we would then keep the proposed logic that unconditionally
calls task_isolation_enter().

>> +       /*
>>           * In order to return to user mode, we need to have IRQs off with
>>           * none of _TIF_SIGPENDING, _TIF_NOTIFY_RESUME, _TIF_USER_RETURN_NOTIFY,
>>           * _TIF_UPROBE, or _TIF_NEED_RESCHED set.  Several of these flags
>> @@ -228,15 +253,7 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs)
>>           * so we need to loop.  Disabling preemption wouldn't help: doing the
>>           * work to clear some of the flags can sleep.
>>           */
>> -       while (true) {
>> -               u32 cached_flags =
>> -                       READ_ONCE(pt_regs_to_thread_info(regs)->flags);
>> -
>> -               if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
>> -                                     _TIF_UPROBE | _TIF_NEED_RESCHED |
>> -                                     _TIF_USER_RETURN_NOTIFY)))
>> -                       break;
>> -
>> +       do {
>>                  /* We have work to do. */
>>                  local_irq_enable();
>>
>> @@ -258,9 +275,17 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs)
>>                  if (cached_flags & _TIF_USER_RETURN_NOTIFY)
>>                          fire_user_return_notifiers();
>>
>> +               if (task_isolation_enabled())
>> +                       task_isolation_enter();
>> +
> Does anything here guarantee forward progress or at least give
> reasonable confidence that we'll make forward progress?

A given task can get stuck in the kernel if it has a lengthy far-future
alarm() type situation, or if there are multiple task-isolated tasks
scheduled onto the same core, but that only affects those tasks;
other tasks on the same core, and the system as a whole, are OK.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com