From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753490AbbI1V5q (ORCPT ); Mon, 28 Sep 2015 17:57:46 -0400 Received: from mail-db3on0082.outbound.protection.outlook.com ([157.55.234.82]:31862 "EHLO emea01-db3-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752763AbbI1V5n (ORCPT ); Mon, 28 Sep 2015 17:57:43 -0400 Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=cmetcalf@ezchip.com; Subject: Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality To: Andy Lutomirski References: <1443453446-7827-1-git-send-email-cmetcalf@ezchip.com> <1443453446-7827-8-git-send-email-cmetcalf@ezchip.com> CC: Gilad Ben Yossef , Steven Rostedt , Ingo Molnar , Peter Zijlstra , Andrew Morton , Rik van Riel , Tejun Heo , Frederic Weisbecker , Thomas Gleixner , "Paul E. McKenney" , Christoph Lameter , Viresh Kumar , Catalin Marinas , Will Deacon , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , X86 ML From: Chris Metcalf Message-ID: <5609B7C0.3010807@ezchip.com> Date: Mon, 28 Sep 2015 17:57:20 -0400 User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [12.216.194.146] X-ClientProxiedBy: DM2PR11CA0019.namprd11.prod.outlook.com (25.160.91.29) To VI1PR02MB0783.eurprd02.prod.outlook.com (25.162.14.145) X-Microsoft-Exchange-Diagnostics: 1;VI1PR02MB0783;2:hiNHRCOplnEOKHJndcKworh9bMS1mzQ9WvKg35/KaPKkTalJjeEs30zodemAlUhFRCVLcn9Czr4pd0o/DcP9bhC5CHygT3zShOmdPC7RfLyrPLVVYE5wCAIAv6sX+aWXyRusch+P00k0Jp945t6PkRGT4bVQm5YyLSLs/YJinoI=;3:abe+FdOQM27SosFkUkYNw2Jb9ZCzLGCCjWnVAzNYn3j/jbQHe8SzOx5dH5lhCUu1NFkvMfjVZN6ac50Tkx7UKYix0w0VhATQB4S/LMTXWfZt0kFrqHwDOVmKM/iJUq0LIW8tFbfjVu2f+dl38tiWmQ==;25:4gu9uUS7RU3wbndbnk84YPz9wv45A76gaWXE3/bKEIUNwLgSjXWjxLoifljdq40NKIwd7SdCmPj4yi2AN985l7Wg3wW9nFhNN5D4Os5sL2rZsnhooA6JZq6zmoqucxpp0MxovnrTSBO3saxN7zy/VP6JxjjeOjIcPRt7Td2+P3qvjsImfuQ1Gngl16vHkQ2EBrhRj6i/aXdmx+bO4GFSk/dN6uclwp91aT/q+8oC/SZZ/g3wkIlwe0b7slyiRukmYxeY6NUVHl2l4jABhFPE/g==;20:8GIYMsUwl9offYXTEnWTDyAwCwXr4PKditzmZGjsYH+VR7ydNbOcffYQ35/G0IAXqD8mvaWq537I2y7ok6YxLWnJyYO+yBrWt/NpEkQtkjPJ2cYSNRdUIuJzxDbzyQz9x9/jcXu5/ohmA6WEVtRj+ElIF1sWm/vNRl4LM2aJRIs= X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:VI1PR02MB0783; X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:; X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(601004)(2401047)(5005006)(520078)(8121501046)(3002001);SRVR:VI1PR02MB0783;BCL:0;PCL:0;RULEID:;SRVR:VI1PR02MB0783; X-Microsoft-Exchange-Diagnostics: 1;VI1PR02MB0783;4:wWsJ0OQmUB4kW+Epjz0ToI8J8H03NZZV3bV7/gJDMvqaZvYsPAI3/0GQmbPjFg5QFlI7eMLPBKQiIZnZbtLDJUIlBJQfdsT9aJtIFRDBdPiVcNkyCuimEh6GMSvH4F59USCyRag/uY5Mpq5dlidm2UCxhIQ3bnLiWw0uzs7ZZVjuMtvqvcq9snHohY9PRJIIvkdYQdmqnRSCf+SUHlHl+rdlDRrzroEYY53FT+eVnvzsiMb1oWm66aXRWf+XhrTA6SGa+YEdAEB48PDLJiczC1YNKavNBrmCnw0vh76Z0nI6jHI1aePizsq6WUFGkmR+LqxGtBVgMzs8vohP642qL5WDgohPrVeXFmlTJ0/SNho= X-Forefront-PRVS: 0713BC207F X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10009020)(6049001)(6009001)(24454002)(199003)(479174004)(189002)(377454003)(46102003)(15975445007)(81156007)(2950100001)(4001540100001)(77156002)(5001860100001)(54356999)(105586002)(110136002)(76176999)(77096005)(50986999)(87266999)(101416001)(64126003)(65816999)(50466002)(83506001)(36756003)(42186005)(5001960100002)(62966003)(68736005)(189998001)(106356001)(19580405001)(87976001)(64706001)(80316001)(97736004)(59896002)(66066001)(65806001)(47776003)(4001350100001)(33656002)(65956001)(23676002)(40100003)(5004730100002)(122386002)(5007970100001)(92566002)(86362001)(19580395003)(5001830100001)(18886065003);DIR:OUT;SFP:1101;SCL:1;SRVR:VI1PR02MB0783;H:[10.7.0.41];FPR:;SPF:None;PTR:InfoNoRecords;A:1;MX:1;LANG:en; X-Microsoft-Exchange-Diagnostics: =?utf-8?B?MTtWSTFQUjAyTUIwNzgzOzIzOlBHT1huN201eDkvNUVJQ25zb2lac2dxN3JG?= =?utf-8?B?Y2laMzlmdlFkdm5XN2tDVmZmdmpoZmZIQ1JJZ0tkd25zWnZ1QzNFakorK0Ez?= =?utf-8?B?TXF1NnFFU0VMSGlraXAweVJXQXdxeitxaWJnb0VhbDVHSC9rZWlBN2poSlk1?= =?utf-8?B?WnQ1Z2dCcDZ5UzVzQ0x3S25RaTdhbDQ1V2h2WGJhSXFuSjRISzhSUWsvZDlL?= =?utf-8?B?YjhvZTF1UUlWYzB5YitWcnRwc2tjMWcwZ2hSVlQ2bUpkSUJtSmxBNFppOVUx?= =?utf-8?B?VGVMZEUzRFhuNjRIK2czS1BhZFBnWjkwTjZNOFlEZXZkV3Uvdkg1Ykxmb0tl?= =?utf-8?B?OXIwSnBmUS9LQStyZ0c2ZEtsMkl6Nnh3RHpGYnF1Wms2YUFJSWFPOEo1Ym90?= =?utf-8?B?M1pkTDRpb2dVNXNSdVMwK3l3WUtzKzE2eDBwMVVRYVdtQ0kydnZUaFV1K3FJ?= =?utf-8?B?UEdjK1hPSnZkeUpMWTUyc3VUa05yYUZLTzY0M3RwM2FZZEF5cnFuakQ1c05v?= =?utf-8?B?L0lWczg2eGtvTEJXMWdHaUo1NkNGcnZxRkc2NHQycGZCTE1LRHV4WjNUSVlo?= =?utf-8?B?YXliTnNiblh1MUZvUDU4OG5lQ3NRSHRxUTdrc3JXaVRQdTE1QVJNV2oxSkJz?= =?utf-8?B?RERBQzdEVWZrVlNVVnFNWWlWSzFTUE1EVjl2MXd3VU1LSlltMTFJUmpvTzJ4?= =?utf-8?B?THBodVUwYjZsdXhSMFZzSm9CeEtaVTJxY2ViVlNPa0NaVHYxTlhNQWNqZk9h?= =?utf-8?B?cjkwVFh5cjZIS2RlNDlwTWNjWWFsd2pSZ21iRkN0d1JSNDd1b20rUUl3dmJm?= =?utf-8?B?WjVrVGZjczBDc2hYOVVUUjNkL1Njb0dWS00wU0F6OUIxZHFFUnZGTUROWDZi?= =?utf-8?B?M2FZYlVyRzZxTGg1czJOc0VBOXlYbjFablREaDc3amhwSmUwU1k2anZDVy9G?= =?utf-8?B?dnRvUm9HcllUWTZrZWVqbExnQ3psS09vYXJmSGtjemhFMWlQVGJaSzBBSjR4?= =?utf-8?B?RHVrQjQrcDZoS1UwQWZiaVRncHhkMXJKYU5tV0dnVjMrNDdMU2xwTkNqMFhp?= =?utf-8?B?UDN5bVVqMFpYYkU4YUgrSEtlWVhKSDBaVFI2VzJtdTRkNXk5UW5BNlU3SWVk?= =?utf-8?B?SEhwdUlkQ25mVjBnQldEcnhQRENtcTg1Y1lFYThEdDYxOGUyQWNZa0tpOVNX?= =?utf-8?B?OWF3RVAzanZBa2ZVbHd5b3I0bVY1RytrUGpKUmxUd0o3eHNJbHNEQ1pPVCtW?= =?utf-8?B?bFVVa2svbjk2M3gxMXpCd2h4RGpUOFlRbUlLT1VWdzZkejBJeGY1QjduZzBl?= =?utf-8?B?akJraU5IVTVTdGZjMmRRUWVIUVFiNklBekp0blVxa0w4LzdFNVhVcmVSVHB3?= =?utf-8?B?N0FUMGo4Vk85ajBDM0JoZE1Od0NPNnk4YWgrVUd3ZFdJd2cwczRHYm15andn?= =?utf-8?B?TWNSZDFHUlZSZzV6MXFhK2JEM1VPVC94QmNyekNzdHVGREVaaGlRaGhOOVNn?= =?utf-8?B?SDY2NjR6dk9IRWRBby83RFczU2RQbXd1RTBCekJKblBCZzBwdEp4RThSL1lU?= =?utf-8?B?cHg4ZXZXa0Z4VG5oUlRyTUNybk8yZk9uYUZQMXc1aE9taEJsK0Q4K0VEd2Ri?= =?utf-8?B?N2hEdnczNTA0L01aNkhucytqem9kZk9MNnFnNnBXQjhwQmtqdGNSd1ZjM3Fv?= =?utf-8?B?Q3FkQjhPUG5RUTBneTd2ajZWZDdWL2JscThpVGFzRENPT2FFblZoMDMxNzVv?= =?utf-8?B?cXE0REkzOTM1UVYxSSs3cFJENURFZkV0enhZOXZLTmxxNDlBb0JBZ0d0L3Ru?= =?utf-8?B?KzZSbjZIU29zMU93MjFwbU9pMWdnczJ6UU1EaG0xckp4bExuc3d0Z2xQVVYr?= =?utf-8?B?MGJyeVRUWGlyZklNaTd6ZWUwWU1IOEFNZzJDOGMxVnMxTm0vUTd1Mll5T0Vl?= =?utf-8?Q?UWamPNzz8LWYbTzTlSrQ8YPHhdH7YY=3D?= X-Microsoft-Exchange-Diagnostics: 1;VI1PR02MB0783;5:bE4WdDJcmoNPZK/aGjIM/gvpsi/9ZZ1mOHK52DS9ToI7dzOqhtbds1zDMNx2jZRuioMfydOpV5AErwab6+L0mdyZ6YPM3J2JEV/5hlg1N1kQ+lVf+Rh7wWRiyzwV4ALaAe7vBWob96wcC1+ZkwwqIg==;24:xqdfr6Dlp5GjVJ0o+nJV2gqcEMBCp/VtRdAp/59FG6mlSHJnPEo8hbdvl4b/SGI+87oQjt+J209qMved2RdX3SIw1iFijKeB4X+WF+UxkMw=;20:tmA58O5rQqi1gPhvQxwdF0OxYgvLUqWzLyY/OoKT1jy7MNFOX2KMlGilY3pd7lvttxGodvYPcTZwO3aOD04wbg== SpamDiagnosticOutput: 1:23 SpamDiagnosticMetadata: NSPM X-OriginatorOrg: ezchip.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 28 Sep 2015 21:57:34.1892 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR02MB0783 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/28/2015 04:59 PM, Andy Lutomirski wrote: > On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf wrote: >> In prepare_exit_to_usermode(), we would like to call >> task_isolation_enter() on every return to userspace, and like >> other work items, we would like to recheck for more work after >> calling it, since it will enable interrupts internally. >> >> However, if task_isolation_enter() is the only work item, >> and it has already been called once, we don't want to continue >> calling it in a loop. We don't have a dedicated TIF flag for >> task isolation, and it wouldn't make sense to have one, since >> we'd want to set it before starting exit every time, and then >> clear it the first time around the loop. >> >> Instead, we change the loop structure somewhat, so that we >> have a more inclusive set of flags that are tested for on the >> first entry to the function (including TIF_NOHZ), and if any >> of those flags are set, we enter the loop. And, we do the >> task_isolation() test unconditionally at the bottom of the loop, >> but then when making the decision to loop back, we just use the >> set of flags that doesn't include TIF_NOHZ. That way we only >> loop if there is other work to do, but then if that work >> is done, we again unconditionally call task_isolation_enter(). >> >> In syscall_trace_enter_phase1(), we try to add the necessary >> support for strict-mode detection of syscalls in an optimized >> way, by letting the code remain unchanged if we are not using >> TASK_ISOLATION, but otherwise calling enter_from_user_mode() >> under the first time we see _TIF_NOHZ, and then waiting until >> after we do the secure computing work to actually clear the bit >> from the "work" variable and call task_isolation_syscall(). >> >> Signed-off-by: Chris Metcalf >> --- >> arch/x86/entry/common.c | 47 ++++++++++++++++++++++++++++++++++++----------- >> 1 file changed, 36 insertions(+), 11 deletions(-) >> >> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c >> index 80dcc9261ca3..0f74389c6f3b 100644 >> --- a/arch/x86/entry/common.c >> +++ b/arch/x86/entry/common.c >> @@ -21,6 +21,7 @@ >> #include >> #include >> #include >> +#include >> >> #include >> #include >> @@ -81,7 +82,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) >> */ >> if (work & _TIF_NOHZ) { >> enter_from_user_mode(); >> - work &= ~_TIF_NOHZ; >> + if (!IS_ENABLED(CONFIG_TASK_ISOLATION)) >> + work &= ~_TIF_NOHZ; >> } >> #endif >> >> @@ -131,6 +133,13 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) >> } >> #endif >> >> + /* Now check task isolation, if needed. */ >> + if (IS_ENABLED(CONFIG_TASK_ISOLATION) && (work & _TIF_NOHZ)) { >> + work &= ~_TIF_NOHZ; >> + if (task_isolation_strict()) >> + task_isolation_syscall(regs->orig_ax); >> + } >> + > This is IMO rather nasty. Can you try to find a way to do this > without making the control flow depend on config options? Well, I suppose this is the best argument for testing for task isolation before seccomp :-) Honestly, if not, it's tricky to see how to do better; I did spend some time looking at it. One possibility is to just unconditionally clear _TIF_NOHZ before testing "work == 0", so that we can test (work & TIF_NOHZ) once early and once after seccomp. This presumably costs a cycle in the no-nohz-full case. So maybe just do it before seccomp... > What guarantees that TIF_NOHZ is an acceptable thing to check? Well, TIF_NOHZ is set on all tasks whenever we are running with nohz_full enabled anywhere, so testing it lets us do stuff on the fastpath without slowing down the fastpath much. See context_tracking_cpu_set(). >> /* Do our best to finish without phase 2. */ >> if (work == 0) >> return ret; /* seccomp and/or nohz only (ret == 0 here) */ >> @@ -217,10 +226,26 @@ static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs) >> /* Called with IRQs disabled. */ >> __visible void prepare_exit_to_usermode(struct pt_regs *regs) >> { >> + u32 cached_flags; >> + >> if (WARN_ON(!irqs_disabled())) >> local_irq_disable(); >> >> /* >> + * We may want to enter the loop here unconditionally to make >> + * sure to do some work at least once. Test here for all >> + * possible conditions that might make us enter the loop, >> + * and return immediately if none of them are set. >> + */ >> + cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags); >> + if (!(cached_flags & (TIF_SIGPENDING | _TIF_NOTIFY_RESUME | >> + _TIF_UPROBE | _TIF_NEED_RESCHED | >> + _TIF_USER_RETURN_NOTIFY | _TIF_NOHZ))) { >> + user_enter(); >> + return; >> + } >> + > Too complicated and too error prone. > > In any event, I don't think that the property you actually want is for > the loop to be entered once. I think the property you want is that > we're isolated by the time we're finished. Why not just check that > directly in the loop condition? So something like this (roughly): if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY)) && + task_isolation_done()) break; i.e. just add the one extra call? That could work, I suppose. In the body we would then keep the proposed logic that unconditionally calls task_isolation_enter(). >> + /* >> * In order to return to user mode, we need to have IRQs off with >> * none of _TIF_SIGPENDING, _TIF_NOTIFY_RESUME, _TIF_USER_RETURN_NOTIFY, >> * _TIF_UPROBE, or _TIF_NEED_RESCHED set. Several of these flags >> @@ -228,15 +253,7 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs) >> * so we need to loop. Disabling preemption wouldn't help: doing the >> * work to clear some of the flags can sleep. >> */ >> - while (true) { >> - u32 cached_flags = >> - READ_ONCE(pt_regs_to_thread_info(regs)->flags); >> - >> - if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | >> - _TIF_UPROBE | _TIF_NEED_RESCHED | >> - _TIF_USER_RETURN_NOTIFY))) >> - break; >> - >> + do { >> /* We have work to do. */ >> local_irq_enable(); >> >> @@ -258,9 +275,17 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs) >> if (cached_flags & _TIF_USER_RETURN_NOTIFY) >> fire_user_return_notifiers(); >> >> + if (task_isolation_enabled()) >> + task_isolation_enter(); >> + > Does anything here guarantee forward progress or at least give > reasonable confidence that we'll make forward progress? A given task can get stuck in the kernel if it has a lengthy far-future alarm() type situation, or if there are multiple task-isolated tasks scheduled onto the same core, but that only affects those tasks; other tasks on the same core, and the system as a whole, are OK. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com