From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f182.google.com (mail-wi0-f182.google.com [209.85.212.182]) by kanga.kvack.org (Postfix) with ESMTP id 4D2C46B006C for ; Tue, 28 Apr 2015 18:16:05 -0400 (EDT) Received: by wizk4 with SMTP id k4so158072648wiz.1 for ; Tue, 28 Apr 2015 15:16:04 -0700 (PDT) Received: from kirsi1.inet.fi (mta-out1.inet.fi. [62.71.2.203]) by mx.google.com with ESMTP id k6si20316901wiz.1.2015.04.28.15.16.03 for ; Tue, 28 Apr 2015 15:16:04 -0700 (PDT) Date: Wed, 29 Apr 2015 01:15:53 +0300 From: "Kirill A. Shutemov" Subject: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) Message-ID: <20150428221553.GA5770@node.dhcp.inet.fi> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski , Dave Hansen Cc: Linus Torvalds , Andrew Morton , Mel Gorman , Rik van Riel , linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote: > At some point, I'd like to implement PCID on x86 (if no one beats me > to it, and this is a low priority for me), which will allow us to skip > expensive TLB flushes while context switching. I have no idea whether > ARM can do something similar. I talked with Dave about implementing PCID and he thinks that it will be net loss. TLB entries will live longer and it means we would need to trigger more IPIs to flash them out when we have to. Cost of IPIs will be higher than benifit from hot TLB after context switch. Do you have different expectations? -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f52.google.com (mail-pa0-f52.google.com [209.85.220.52]) by kanga.kvack.org (Postfix) with ESMTP id 2EE446B006C for ; Tue, 28 Apr 2015 18:38:02 -0400 (EDT) Received: by pabsx10 with SMTP id sx10so8735175pab.3 for ; Tue, 28 Apr 2015 15:38:01 -0700 (PDT) Received: from mga01.intel.com (mga01.intel.com. [192.55.52.88]) by mx.google.com with ESMTP id m4si36579576pap.204.2015.04.28.15.38.01 for ; Tue, 28 Apr 2015 15:38:01 -0700 (PDT) Message-ID: <55400BC8.6080204@intel.com> Date: Tue, 28 Apr 2015 15:38:00 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) References: <20150428221553.GA5770@node.dhcp.inet.fi> In-Reply-To: <20150428221553.GA5770@node.dhcp.inet.fi> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" , Andy Lutomirski Cc: Linus Torvalds , Andrew Morton , Mel Gorman , Rik van Riel , linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org On 04/28/2015 03:15 PM, Kirill A. Shutemov wrote: > On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote: >> At some point, I'd like to implement PCID on x86 (if no one beats me >> to it, and this is a low priority for me), which will allow us to skip >> expensive TLB flushes while context switching. I have no idea whether >> ARM can do something similar. > > I talked with Dave about implementing PCID and he thinks that it will be > net loss. TLB entries will live longer and it means we would need to trigger > more IPIs to flash them out when we have to. Cost of IPIs will be higher > than benifit from hot TLB after context switch. > > Do you have different expectations? Kirill, I think Andy is asking about something different that what you and I talked about. My point to you was that PCIDs can not be used to to replace or in lieu of TLB shootdowns because they *only* make TLB entries live longer. Their entire purpose is to make things live longer and to reduce the cost of the implicit TLB shootdowns that we do as a part of a context switch. I'm not sure if it will have a benefit overall. It depends on the increase in shootdown cost vs. the decrease in TLB refill cost at context switch. I think someone hacked up some code to do it (maybe just internally to Intel), so if anyone is seriously interested in implementing it, let me know and I'll see if I can dig it up. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vn0-f44.google.com (mail-vn0-f44.google.com [209.85.216.44]) by kanga.kvack.org (Postfix) with ESMTP id 3EDCC6B0032 for ; Tue, 28 Apr 2015 18:41:52 -0400 (EDT) Received: by vnbg190 with SMTP id g190so1353748vnb.12 for ; Tue, 28 Apr 2015 15:41:52 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id xg5si36103483vdb.106.2015.04.28.15.41.51 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 28 Apr 2015 15:41:51 -0700 (PDT) Message-ID: <55400CA7.3050902@redhat.com> Date: Tue, 28 Apr 2015 18:41:43 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) References: <20150428221553.GA5770@node.dhcp.inet.fi> In-Reply-To: <20150428221553.GA5770@node.dhcp.inet.fi> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" , Andy Lutomirski , Dave Hansen Cc: Linus Torvalds , Andrew Morton , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote: > On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote: >> At some point, I'd like to implement PCID on x86 (if no one beats me >> to it, and this is a low priority for me), which will allow us to skip >> expensive TLB flushes while context switching. I have no idea whether >> ARM can do something similar. > > I talked with Dave about implementing PCID and he thinks that it will be > net loss. TLB entries will live longer and it means we would need to trigger > more IPIs to flash them out when we have to. Cost of IPIs will be higher > than benifit from hot TLB after context switch. I suspect that may depend on how you do the shootdown. If, when receiving a TLB shootdown for a non-current PCID, we just flush all the entries for that PCID and remove the CPU from the mm's cpu_vm_mask_var, we will never receive more than one shootdown IPI for a non-current mm, but we will still get the benefits of TLB longevity when dealing with eg. pipe workloads where tasks take turns running on the same CPU. -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f170.google.com (mail-lb0-f170.google.com [209.85.217.170]) by kanga.kvack.org (Postfix) with ESMTP id EE3866B0032 for ; Tue, 28 Apr 2015 18:54:52 -0400 (EDT) Received: by lbbuc2 with SMTP id uc2so7558064lbb.2 for ; Tue, 28 Apr 2015 15:54:52 -0700 (PDT) Received: from mail-la0-f50.google.com (mail-la0-f50.google.com. [209.85.215.50]) by mx.google.com with ESMTPS id ld16si18038304lbb.169.2015.04.28.15.54.50 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 28 Apr 2015 15:54:51 -0700 (PDT) Received: by layy10 with SMTP id y10so7550966lay.0 for ; Tue, 28 Apr 2015 15:54:50 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <55400CA7.3050902@redhat.com> References: <20150428221553.GA5770@node.dhcp.inet.fi> <55400CA7.3050902@redhat.com> From: Andy Lutomirski Date: Tue, 28 Apr 2015 15:54:29 -0700 Message-ID: Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: "Kirill A. Shutemov" , Dave Hansen , Linus Torvalds , Andrew Morton , Mel Gorman , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , X86 ML On Tue, Apr 28, 2015 at 3:41 PM, Rik van Riel wrote: > On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote: >> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote: >>> At some point, I'd like to implement PCID on x86 (if no one beats me >>> to it, and this is a low priority for me), which will allow us to skip >>> expensive TLB flushes while context switching. I have no idea whether >>> ARM can do something similar. >> >> I talked with Dave about implementing PCID and he thinks that it will be >> net loss. TLB entries will live longer and it means we would need to trigger >> more IPIs to flash them out when we have to. Cost of IPIs will be higher >> than benifit from hot TLB after context switch. > > I suspect that may depend on how you do the shootdown. > > If, when receiving a TLB shootdown for a non-current PCID, we just flush > all the entries for that PCID and remove the CPU from the mm's > cpu_vm_mask_var, we will never receive more than one shootdown IPI for > a non-current mm, but we will still get the benefits of TLB longevity > when dealing with eg. pipe workloads where tasks take turns running on > the same CPU. I had a totally different implementation idea in mind. It goes something like this: For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7. We have a per-cpu array of the mm [1] that owns each PCID. On context switch, we look up the new mm in the array and, if there's a PCID mapped, we switch cr3 and select that PCID. If there is no PCID mapped, we choose one (LRU? clock replacement?), switch cr3 and select and invalidate that PCID. When it's time to invalidate a TLB entry on an mm that's active remotely, we really don't want to send an IPI to a CPU that doesn't actually have that mm active. Instead we bump some kind of generation counter in the mm_struct that will cause the next switch to that mm not to match the PCID list. To keep this working, I think we also need to update the per-cpu PCID list with our generation counter either when we context switch out or when we process a TLB shootdown IPI. This could be a bit tricky to get right, but I think it can be done without adding more than a cacheline or two to the context switch overhead and without any extra IPIs at all. [1] It shouldn't be just an mm_struct pointer, because then we have to invalidate it somehow when we recycle an mm_struct. Maybe we'd use some kind of counter. We also need a TLB shootdown generation counter of some sort as described. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ie0-f173.google.com (mail-ie0-f173.google.com [209.85.223.173]) by kanga.kvack.org (Postfix) with ESMTP id F07E96B0032 for ; Tue, 28 Apr 2015 18:56:29 -0400 (EDT) Received: by iecrt8 with SMTP id rt8so30425637iec.0 for ; Tue, 28 Apr 2015 15:56:29 -0700 (PDT) Received: from mail-ie0-x22e.google.com (mail-ie0-x22e.google.com. [2607:f8b0:4001:c03::22e]) by mx.google.com with ESMTPS id f19si19971336icl.8.2015.04.28.15.56.29 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 28 Apr 2015 15:56:29 -0700 (PDT) Received: by iebrs15 with SMTP id rs15so30408775ieb.3 for ; Tue, 28 Apr 2015 15:56:29 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20150428221553.GA5770@node.dhcp.inet.fi> References: <20150428221553.GA5770@node.dhcp.inet.fi> Date: Tue, 28 Apr 2015 15:56:29 -0700 Message-ID: Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) From: Linus Torvalds Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" Cc: Andy Lutomirski , Dave Hansen , Andrew Morton , Mel Gorman , Rik van Riel , Linux Kernel Mailing List , linux-mm , the arch/x86 maintainers On Tue, Apr 28, 2015 at 3:15 PM, Kirill A. Shutemov wrote: > > I talked with Dave about implementing PCID and he thinks that it will be > net loss. So I'm told that Suresh Siddha actually had a patch inside Intel to use PCID (back when he worked for Intel, I think he left), and that it was a wash in their testing. I never saw the patch, and it might be interesting to try it again, but there is some reason to believe that it doesn't make much of a difference. Unlike most of the traditional RISC machines that got big speedups, Intel TLB walking is so good that it likely isn't nearly as noticeable, and it likely *does* result in more IPI's etc. Possibly not a lot more, but if the win isn't big... So I don't want to discourage you, because I'd love to see what the patch looks like and if we can find cases where it matters, but I do want to set expectations right. It's unlikely to be a big issue. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vn0-f53.google.com (mail-vn0-f53.google.com [209.85.216.53]) by kanga.kvack.org (Postfix) with ESMTP id 828646B006C for ; Tue, 28 Apr 2015 18:56:36 -0400 (EDT) Received: by vnbg62 with SMTP id g62so1406031vnb.7 for ; Tue, 28 Apr 2015 15:56:36 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id yn14si37664319vdb.73.2015.04.28.15.56.35 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 28 Apr 2015 15:56:35 -0700 (PDT) Message-ID: <5540101D.7020800@redhat.com> Date: Tue, 28 Apr 2015 18:56:29 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) References: <20150428221553.GA5770@node.dhcp.inet.fi> <55400CA7.3050902@redhat.com> In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: "Kirill A. Shutemov" , Dave Hansen , Linus Torvalds , Andrew Morton , Mel Gorman , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , X86 ML On 04/28/2015 06:54 PM, Andy Lutomirski wrote: > On Tue, Apr 28, 2015 at 3:41 PM, Rik van Riel wrote: >> On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote: >>> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote: >>>> At some point, I'd like to implement PCID on x86 (if no one beats me >>>> to it, and this is a low priority for me), which will allow us to skip >>>> expensive TLB flushes while context switching. I have no idea whether >>>> ARM can do something similar. >>> >>> I talked with Dave about implementing PCID and he thinks that it will be >>> net loss. TLB entries will live longer and it means we would need to trigger >>> more IPIs to flash them out when we have to. Cost of IPIs will be higher >>> than benifit from hot TLB after context switch. >> >> I suspect that may depend on how you do the shootdown. >> >> If, when receiving a TLB shootdown for a non-current PCID, we just flush >> all the entries for that PCID and remove the CPU from the mm's >> cpu_vm_mask_var, we will never receive more than one shootdown IPI for >> a non-current mm, but we will still get the benefits of TLB longevity >> when dealing with eg. pipe workloads where tasks take turns running on >> the same CPU. > > I had a totally different implementation idea in mind. It goes > something like this: > > For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7. We have > a per-cpu array of the mm [1] that owns each PCID. On context switch, > we look up the new mm in the array and, if there's a PCID mapped, we > switch cr3 and select that PCID. If there is no PCID mapped, we > choose one (LRU? clock replacement?), switch cr3 and select and > invalidate that PCID. > > When it's time to invalidate a TLB entry on an mm that's active > remotely, we really don't want to send an IPI to a CPU that doesn't > actually have that mm active. Instead we bump some kind of generation > counter in the mm_struct that will cause the next switch to that mm > not to match the PCID list. To keep this working, I think we also > need to update the per-cpu PCID list with our generation counter > either when we context switch out or when we process a TLB shootdown > IPI. If we do that, we can also get rid of TLB shootdowns for idle CPUs in lazy TLB mode. Very nice, if the details work out. -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f180.google.com (mail-lb0-f180.google.com [209.85.217.180]) by kanga.kvack.org (Postfix) with ESMTP id D6D6F6B0032 for ; Tue, 28 Apr 2015 19:01:37 -0400 (EDT) Received: by lbcga7 with SMTP id ga7so7661639lbc.1 for ; Tue, 28 Apr 2015 16:01:37 -0700 (PDT) Received: from mail-la0-f41.google.com (mail-la0-f41.google.com. [209.85.215.41]) by mx.google.com with ESMTPS id r10si17700664lal.5.2015.04.28.16.01.36 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 28 Apr 2015 16:01:36 -0700 (PDT) Received: by layy10 with SMTP id y10so7639650lay.0 for ; Tue, 28 Apr 2015 16:01:36 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <5540101D.7020800@redhat.com> References: <20150428221553.GA5770@node.dhcp.inet.fi> <55400CA7.3050902@redhat.com> <5540101D.7020800@redhat.com> From: Andy Lutomirski Date: Tue, 28 Apr 2015 16:01:15 -0700 Message-ID: Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: "Kirill A. Shutemov" , Dave Hansen , Linus Torvalds , Andrew Morton , Mel Gorman , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , X86 ML On Tue, Apr 28, 2015 at 3:56 PM, Rik van Riel wrote: > On 04/28/2015 06:54 PM, Andy Lutomirski wrote: >> On Tue, Apr 28, 2015 at 3:41 PM, Rik van Riel wrote: >>> On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote: >>>> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote: >>>>> At some point, I'd like to implement PCID on x86 (if no one beats me >>>>> to it, and this is a low priority for me), which will allow us to skip >>>>> expensive TLB flushes while context switching. I have no idea whether >>>>> ARM can do something similar. >>>> >>>> I talked with Dave about implementing PCID and he thinks that it will be >>>> net loss. TLB entries will live longer and it means we would need to trigger >>>> more IPIs to flash them out when we have to. Cost of IPIs will be higher >>>> than benifit from hot TLB after context switch. >>> >>> I suspect that may depend on how you do the shootdown. >>> >>> If, when receiving a TLB shootdown for a non-current PCID, we just flush >>> all the entries for that PCID and remove the CPU from the mm's >>> cpu_vm_mask_var, we will never receive more than one shootdown IPI for >>> a non-current mm, but we will still get the benefits of TLB longevity >>> when dealing with eg. pipe workloads where tasks take turns running on >>> the same CPU. >> >> I had a totally different implementation idea in mind. It goes >> something like this: >> >> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7. We have >> a per-cpu array of the mm [1] that owns each PCID. On context switch, >> we look up the new mm in the array and, if there's a PCID mapped, we >> switch cr3 and select that PCID. If there is no PCID mapped, we >> choose one (LRU? clock replacement?), switch cr3 and select and >> invalidate that PCID. >> >> When it's time to invalidate a TLB entry on an mm that's active >> remotely, we really don't want to send an IPI to a CPU that doesn't >> actually have that mm active. Instead we bump some kind of generation >> counter in the mm_struct that will cause the next switch to that mm >> not to match the PCID list. To keep this working, I think we also >> need to update the per-cpu PCID list with our generation counter >> either when we context switch out or when we process a TLB shootdown >> IPI. > > If we do that, we can also get rid of TLB shootdowns for > idle CPUs in lazy TLB mode. > > Very nice, if the details work out. > I wonder if we could treat the non-PCID case just like the PCID case but with only one PCID. Maybe get rid of the mm vs active_mm distinction. Maybe not, though -- if nothing else, we still need to kick our pgd out from idle or kthread CPUs before we free it. The reason I thought of PCIDs this way is that 12 bits isn't nearly enough to get away with allocating each mm its own PCID. Rather than trying to shoehorn them in, it seemed like a better approach would be to only use a very small number, since keeping around TLB entries that are more than a few context switches old seems mostly useless. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ie0-f174.google.com (mail-ie0-f174.google.com [209.85.223.174]) by kanga.kvack.org (Postfix) with ESMTP id 1E3EF6B0032 for ; Tue, 28 Apr 2015 19:16:19 -0400 (EDT) Received: by iecrt8 with SMTP id rt8so30691729iec.0 for ; Tue, 28 Apr 2015 16:16:18 -0700 (PDT) Received: from mail-ig0-x234.google.com (mail-ig0-x234.google.com. [2607:f8b0:4001:c05::234]) by mx.google.com with ESMTPS id d16si9826855igm.0.2015.04.28.16.16.18 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 28 Apr 2015 16:16:18 -0700 (PDT) Received: by iget9 with SMTP id t9so97751141ige.1 for ; Tue, 28 Apr 2015 16:16:18 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20150428221553.GA5770@node.dhcp.inet.fi> <55400CA7.3050902@redhat.com> Date: Tue, 28 Apr 2015 16:16:18 -0700 Message-ID: Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) From: Linus Torvalds Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Rik van Riel , "Kirill A. Shutemov" , Dave Hansen , Andrew Morton , Mel Gorman , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , X86 ML On Tue, Apr 28, 2015 at 3:54 PM, Andy Lutomirski wrote: > > I had a totally different implementation idea in mind. It goes > something like this: > > For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7. We have > a per-cpu array of the mm [1] that owns each PCID. [...] We've done this before on other architectures. See for example alpha. Look up "__get_new_mm_context()" and friends. I think sparc does the same (and I think sparc copied a lot of it from the alpha implementation). Iirc, the alpha version just generates a (per-cpu) asid one at a time, and has a generation counter so that when you run out of ASID's you do a global TLB invalidate on that CPU and start from 0 again. Actually, I think the generation number is just the high bits of the asid counter (alpha calls them "asn", intel calls them "pcid", and I tend to prefer "asid", but it's all the same thing). Then each thread just has a per-thread ASID. We don't try to make that be per-thread and per-cpu, but instead just force a new allocation when a thread moves to another CPU. It's not obvious what alpha does, because we end up hiding the per-thread ASN in the "struct pcb_struct" (in 'struct thread_info') which is part the alpha pal-code interface. But it seemed to work and is fairly simple. I think something very similar should work with intel pcid's. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f169.google.com (mail-ig0-f169.google.com [209.85.213.169]) by kanga.kvack.org (Postfix) with ESMTP id 01CE86B0032 for ; Tue, 28 Apr 2015 19:19:51 -0400 (EDT) Received: by iget9 with SMTP id t9so97788495ige.1 for ; Tue, 28 Apr 2015 16:19:50 -0700 (PDT) Received: from mail-ie0-x236.google.com (mail-ie0-x236.google.com. [2607:f8b0:4001:c03::236]) by mx.google.com with ESMTPS id ro2si9808384igb.38.2015.04.28.16.19.50 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 28 Apr 2015 16:19:50 -0700 (PDT) Received: by iebrs15 with SMTP id rs15so30721757ieb.3 for ; Tue, 28 Apr 2015 16:19:50 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20150428221553.GA5770@node.dhcp.inet.fi> <55400CA7.3050902@redhat.com> <5540101D.7020800@redhat.com> Date: Tue, 28 Apr 2015 16:19:50 -0700 Message-ID: Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) From: Linus Torvalds Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Rik van Riel , "Kirill A. Shutemov" , Dave Hansen , Andrew Morton , Mel Gorman , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , X86 ML On Tue, Apr 28, 2015 at 4:01 PM, Andy Lutomirski wrote: > > The reason I thought of PCIDs this way is that 12 bits isn't nearly > enough to get away with allocating each mm its own PCID. Not even close. And really, we've already done this for other architectures. On alpha, the number of bits in the pcid is model-specific, but it was something like 6 for the ones I used. That's plenty. Also, I don't think Intel actually does 12 bits of pcid. What they do is to hash the 12 bits down to something smaller (like two or three bits in the actual TLB data structure), and then the CPU basically invalidates any pcid's that alias (have a small 4- or 8-entry array saying that "this hash was used for this 12-bit pcid). So there's actually *another* level of dynamic mapping going on below the software interface. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f171.google.com (mail-lb0-f171.google.com [209.85.217.171]) by kanga.kvack.org (Postfix) with ESMTP id A7ABC6B0032 for ; Tue, 28 Apr 2015 19:23:52 -0400 (EDT) Received: by lbcga7 with SMTP id ga7so7942782lbc.1 for ; Tue, 28 Apr 2015 16:23:52 -0700 (PDT) Received: from mail-lb0-f172.google.com (mail-lb0-f172.google.com. [209.85.217.172]) by mx.google.com with ESMTPS id j8si18101767lah.14.2015.04.28.16.23.50 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 28 Apr 2015 16:23:51 -0700 (PDT) Received: by lbbqq2 with SMTP id qq2so7867548lbb.3 for ; Tue, 28 Apr 2015 16:23:50 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20150428221553.GA5770@node.dhcp.inet.fi> <55400CA7.3050902@redhat.com> From: Andy Lutomirski Date: Tue, 28 Apr 2015 16:23:29 -0700 Message-ID: Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Rik van Riel , "Kirill A. Shutemov" , Dave Hansen , Andrew Morton , Mel Gorman , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , X86 ML On Tue, Apr 28, 2015 at 4:16 PM, Linus Torvalds wrote: > On Tue, Apr 28, 2015 at 3:54 PM, Andy Lutomirski wrote: >> >> I had a totally different implementation idea in mind. It goes >> something like this: >> >> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7. We have >> a per-cpu array of the mm [1] that owns each PCID. [...] > > We've done this before on other architectures. See for example alpha. > Look up "__get_new_mm_context()" and friends. I think sparc does the > same (and I think sparc copied a lot of it from the alpha > implementation). > > Iirc, the alpha version just generates a (per-cpu) asid one at a time, > and has a generation counter so that when you run out of ASID's you do > a global TLB invalidate on that CPU and start from 0 again. Actually, > I think the generation number is just the high bits of the asid > counter (alpha calls them "asn", intel calls them "pcid", and I tend > to prefer "asid", but it's all the same thing). > > Then each thread just has a per-thread ASID. We don't try to make that > be per-thread and per-cpu, but instead just force a new allocation > when a thread moves to another CPU. Alpha appears to have a per-thread per-cpu id of some sort: /* The alpha MMU context is one "unsigned long" bitmap per CPU */ typedef unsigned long mm_context_t[NR_CPUS]; I think we can do it without that by keeping the mapping in reverse as I sort of outlined -- for each cpu, store a mapping from mm to pcid. When things fall out of the list, no big deal. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f177.google.com (mail-ig0-f177.google.com [209.85.213.177]) by kanga.kvack.org (Postfix) with ESMTP id 04A736B0032 for ; Tue, 28 Apr 2015 19:38:06 -0400 (EDT) Received: by igbhj9 with SMTP id hj9so35560008igb.1 for ; Tue, 28 Apr 2015 16:38:05 -0700 (PDT) Received: from mail-ig0-x22e.google.com (mail-ig0-x22e.google.com. [2607:f8b0:4001:c05::22e]) by mx.google.com with ESMTPS id zw6si2878igc.11.2015.04.28.16.38.05 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 28 Apr 2015 16:38:05 -0700 (PDT) Received: by igblo3 with SMTP id lo3so103723375igb.1 for ; Tue, 28 Apr 2015 16:38:05 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20150428221553.GA5770@node.dhcp.inet.fi> <55400CA7.3050902@redhat.com> Date: Tue, 28 Apr 2015 16:38:05 -0700 Message-ID: Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) From: Linus Torvalds Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Rik van Riel , "Kirill A. Shutemov" , Dave Hansen , Andrew Morton , Mel Gorman , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , X86 ML On Tue, Apr 28, 2015 at 4:23 PM, Andy Lutomirski wrote: > > I think we can do it without that by keeping the mapping in reverse as > I sort of outlined -- for each cpu, store a mapping from mm to pcid. > When things fall out of the list, no big deal. So you do it by just having a per-cpu array of (say, 64 entries), you now end up having to search that every time you do a task switch to find the asid for the mm. And even then you've limited yourself to just six bits, because doing the same for a possible full 12-bit asid would not be possible. It's actually much simpler if you just do it the other way. But hey, maybe you do something clever and can figure out a good way to do it. I'm just saying that we *have* done this before on other architectures, and it has worked. I think ARM has another asid implementation in arch/arm/mm/context.c. I really think it would be a good idea to copy some existing case rather than make up a new one. It's not like asid's are unusual. It's arguably x86 that was unusual in _not_ having them. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f169.google.com (mail-lb0-f169.google.com [209.85.217.169]) by kanga.kvack.org (Postfix) with ESMTP id B57F76B0032 for ; Tue, 28 Apr 2015 19:49:51 -0400 (EDT) Received: by lbbzk7 with SMTP id zk7so8309020lbb.0 for ; Tue, 28 Apr 2015 16:49:50 -0700 (PDT) Received: from mail-la0-f51.google.com (mail-la0-f51.google.com. [209.85.215.51]) by mx.google.com with ESMTPS id wx3si18120560lbb.142.2015.04.28.16.49.49 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 28 Apr 2015 16:49:49 -0700 (PDT) Received: by layy10 with SMTP id y10so8221487lay.0 for ; Tue, 28 Apr 2015 16:49:49 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20150428221553.GA5770@node.dhcp.inet.fi> <55400CA7.3050902@redhat.com> From: Andy Lutomirski Date: Tue, 28 Apr 2015 16:49:28 -0700 Message-ID: Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Rik van Riel , "Kirill A. Shutemov" , Dave Hansen , Andrew Morton , Mel Gorman , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , X86 ML On Tue, Apr 28, 2015 at 4:38 PM, Linus Torvalds wrote: > On Tue, Apr 28, 2015 at 4:23 PM, Andy Lutomirski wrote: >> >> I think we can do it without that by keeping the mapping in reverse as >> I sort of outlined -- for each cpu, store a mapping from mm to pcid. >> When things fall out of the list, no big deal. > > So you do it by just having a per-cpu array of (say, 64 entries), you > now end up having to search that every time you do a task switch to > find the asid for the mm. And even then you've limited yourself to > just six bits, because doing the same for a possible full 12-bit asid > would not be possible. > > It's actually much simpler if you just do it the other way. I'm unconvinced. I doubt that trying to keep more than 4-8 PCIDs alive in a cpu's TLB is ever a win. After all, the TLB isn't that big, and, if we're only the 7th most recent mm to have been loaded on a cpu, I doubt any of our TLB entries are still likely to be there. Given that, even if we need 16 bytes of generation counter and such in the per-cpu array, that's at most 128 bytes. In practice, we really ought to be able to get it down to closer to 8 bytes with some care or we could only use 4 PCIDs, at which point the whole per-cpu structure fits in a single cache line. We can search it with 4-8 branches and no additional L1 misses. Sure, with 64 entries this would be expensive, but I think that's excessive. Also, this approach keeps the cost of blowing away stale PCIDs when we need to invalidate a TLB entry on an inactive PCID down to a single write as opposed to digging through the per-mm array to poke at the state for each cpu it might be cached in. But maybe I missed some trick that avoids needing to do that. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031188AbbD1WQP (ORCPT ); Tue, 28 Apr 2015 18:16:15 -0400 Received: from mta-out1.inet.fi ([62.71.2.227]:46320 "EHLO kirsi1.inet.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030444AbbD1WQK (ORCPT ); Tue, 28 Apr 2015 18:16:10 -0400 Date: Wed, 29 Apr 2015 01:15:53 +0300 From: "Kirill A. Shutemov" To: Andy Lutomirski , Dave Hansen Cc: Linus Torvalds , Andrew Morton , Mel Gorman , Rik van Riel , linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org Subject: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) Message-ID: <20150428221553.GA5770@node.dhcp.inet.fi> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.23.1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote: > At some point, I'd like to implement PCID on x86 (if no one beats me > to it, and this is a low priority for me), which will allow us to skip > expensive TLB flushes while context switching. I have no idea whether > ARM can do something similar. I talked with Dave about implementing PCID and he thinks that it will be net loss. TLB entries will live longer and it means we would need to trigger more IPIs to flash them out when we have to. Cost of IPIs will be higher than benifit from hot TLB after context switch. Do you have different expectations? -- Kirill A. Shutemov From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031118AbbD1WiD (ORCPT ); Tue, 28 Apr 2015 18:38:03 -0400 Received: from mga09.intel.com ([134.134.136.24]:64622 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1031071AbbD1WiB (ORCPT ); Tue, 28 Apr 2015 18:38:01 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.11,666,1422950400"; d="scan'208";a="720727137" Message-ID: <55400BC8.6080204@intel.com> Date: Tue, 28 Apr 2015 15:38:00 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: "Kirill A. Shutemov" , Andy Lutomirski CC: Linus Torvalds , Andrew Morton , Mel Gorman , Rik van Riel , linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) References: <20150428221553.GA5770@node.dhcp.inet.fi> In-Reply-To: <20150428221553.GA5770@node.dhcp.inet.fi> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/28/2015 03:15 PM, Kirill A. Shutemov wrote: > On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote: >> At some point, I'd like to implement PCID on x86 (if no one beats me >> to it, and this is a low priority for me), which will allow us to skip >> expensive TLB flushes while context switching. I have no idea whether >> ARM can do something similar. > > I talked with Dave about implementing PCID and he thinks that it will be > net loss. TLB entries will live longer and it means we would need to trigger > more IPIs to flash them out when we have to. Cost of IPIs will be higher > than benifit from hot TLB after context switch. > > Do you have different expectations? Kirill, I think Andy is asking about something different that what you and I talked about. My point to you was that PCIDs can not be used to to replace or in lieu of TLB shootdowns because they *only* make TLB entries live longer. Their entire purpose is to make things live longer and to reduce the cost of the implicit TLB shootdowns that we do as a part of a context switch. I'm not sure if it will have a benefit overall. It depends on the increase in shootdown cost vs. the decrease in TLB refill cost at context switch. I think someone hacked up some code to do it (maybe just internally to Intel), so if anyone is seriously interested in implementing it, let me know and I'll see if I can dig it up. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031243AbbD1Wlv (ORCPT ); Tue, 28 Apr 2015 18:41:51 -0400 Received: from mx1.redhat.com ([209.132.183.28]:39649 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030901AbbD1Wlu (ORCPT ); Tue, 28 Apr 2015 18:41:50 -0400 Message-ID: <55400CA7.3050902@redhat.com> Date: Tue, 28 Apr 2015 18:41:43 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: "Kirill A. Shutemov" , Andy Lutomirski , Dave Hansen CC: Linus Torvalds , Andrew Morton , Mel Gorman , linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) References: <20150428221553.GA5770@node.dhcp.inet.fi> In-Reply-To: <20150428221553.GA5770@node.dhcp.inet.fi> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote: > On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote: >> At some point, I'd like to implement PCID on x86 (if no one beats me >> to it, and this is a low priority for me), which will allow us to skip >> expensive TLB flushes while context switching. I have no idea whether >> ARM can do something similar. > > I talked with Dave about implementing PCID and he thinks that it will be > net loss. TLB entries will live longer and it means we would need to trigger > more IPIs to flash them out when we have to. Cost of IPIs will be higher > than benifit from hot TLB after context switch. I suspect that may depend on how you do the shootdown. If, when receiving a TLB shootdown for a non-current PCID, we just flush all the entries for that PCID and remove the CPU from the mm's cpu_vm_mask_var, we will never receive more than one shootdown IPI for a non-current mm, but we will still get the benefits of TLB longevity when dealing with eg. pipe workloads where tasks take turns running on the same CPU. -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031153AbbD1Wyy (ORCPT ); Tue, 28 Apr 2015 18:54:54 -0400 Received: from mail-lb0-f175.google.com ([209.85.217.175]:34338 "EHLO mail-lb0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030957AbbD1Wyw (ORCPT ); Tue, 28 Apr 2015 18:54:52 -0400 MIME-Version: 1.0 In-Reply-To: <55400CA7.3050902@redhat.com> References: <20150428221553.GA5770@node.dhcp.inet.fi> <55400CA7.3050902@redhat.com> From: Andy Lutomirski Date: Tue, 28 Apr 2015 15:54:29 -0700 Message-ID: Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) To: Rik van Riel Cc: "Kirill A. Shutemov" , Dave Hansen , Linus Torvalds , Andrew Morton , Mel Gorman , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , X86 ML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 28, 2015 at 3:41 PM, Rik van Riel wrote: > On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote: >> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote: >>> At some point, I'd like to implement PCID on x86 (if no one beats me >>> to it, and this is a low priority for me), which will allow us to skip >>> expensive TLB flushes while context switching. I have no idea whether >>> ARM can do something similar. >> >> I talked with Dave about implementing PCID and he thinks that it will be >> net loss. TLB entries will live longer and it means we would need to trigger >> more IPIs to flash them out when we have to. Cost of IPIs will be higher >> than benifit from hot TLB after context switch. > > I suspect that may depend on how you do the shootdown. > > If, when receiving a TLB shootdown for a non-current PCID, we just flush > all the entries for that PCID and remove the CPU from the mm's > cpu_vm_mask_var, we will never receive more than one shootdown IPI for > a non-current mm, but we will still get the benefits of TLB longevity > when dealing with eg. pipe workloads where tasks take turns running on > the same CPU. I had a totally different implementation idea in mind. It goes something like this: For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7. We have a per-cpu array of the mm [1] that owns each PCID. On context switch, we look up the new mm in the array and, if there's a PCID mapped, we switch cr3 and select that PCID. If there is no PCID mapped, we choose one (LRU? clock replacement?), switch cr3 and select and invalidate that PCID. When it's time to invalidate a TLB entry on an mm that's active remotely, we really don't want to send an IPI to a CPU that doesn't actually have that mm active. Instead we bump some kind of generation counter in the mm_struct that will cause the next switch to that mm not to match the PCID list. To keep this working, I think we also need to update the per-cpu PCID list with our generation counter either when we context switch out or when we process a TLB shootdown IPI. This could be a bit tricky to get right, but I think it can be done without adding more than a cacheline or two to the context switch overhead and without any extra IPIs at all. [1] It shouldn't be just an mm_struct pointer, because then we have to invalidate it somehow when we recycle an mm_struct. Maybe we'd use some kind of counter. We also need a TLB shootdown generation counter of some sort as described. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031317AbbD1W4h (ORCPT ); Tue, 28 Apr 2015 18:56:37 -0400 Received: from mail-ie0-f176.google.com ([209.85.223.176]:33513 "EHLO mail-ie0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030966AbbD1W4a (ORCPT ); Tue, 28 Apr 2015 18:56:30 -0400 MIME-Version: 1.0 In-Reply-To: <20150428221553.GA5770@node.dhcp.inet.fi> References: <20150428221553.GA5770@node.dhcp.inet.fi> Date: Tue, 28 Apr 2015 15:56:29 -0700 X-Google-Sender-Auth: lGdslqFxRr1p7KFTK2KlqTsucZs Message-ID: Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) From: Linus Torvalds To: "Kirill A. Shutemov" Cc: Andy Lutomirski , Dave Hansen , Andrew Morton , Mel Gorman , Rik van Riel , Linux Kernel Mailing List , linux-mm , "the arch/x86 maintainers" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 28, 2015 at 3:15 PM, Kirill A. Shutemov wrote: > > I talked with Dave about implementing PCID and he thinks that it will be > net loss. So I'm told that Suresh Siddha actually had a patch inside Intel to use PCID (back when he worked for Intel, I think he left), and that it was a wash in their testing. I never saw the patch, and it might be interesting to try it again, but there is some reason to believe that it doesn't make much of a difference. Unlike most of the traditional RISC machines that got big speedups, Intel TLB walking is so good that it likely isn't nearly as noticeable, and it likely *does* result in more IPI's etc. Possibly not a lot more, but if the win isn't big... So I don't want to discourage you, because I'd love to see what the patch looks like and if we can find cases where it matters, but I do want to set expectations right. It's unlikely to be a big issue. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031337AbbD1W4p (ORCPT ); Tue, 28 Apr 2015 18:56:45 -0400 Received: from mx1.redhat.com ([209.132.183.28]:59068 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1031306AbbD1W4f (ORCPT ); Tue, 28 Apr 2015 18:56:35 -0400 Message-ID: <5540101D.7020800@redhat.com> Date: Tue, 28 Apr 2015 18:56:29 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Andy Lutomirski CC: "Kirill A. Shutemov" , Dave Hansen , Linus Torvalds , Andrew Morton , Mel Gorman , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , X86 ML Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) References: <20150428221553.GA5770@node.dhcp.inet.fi> <55400CA7.3050902@redhat.com> In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/28/2015 06:54 PM, Andy Lutomirski wrote: > On Tue, Apr 28, 2015 at 3:41 PM, Rik van Riel wrote: >> On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote: >>> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote: >>>> At some point, I'd like to implement PCID on x86 (if no one beats me >>>> to it, and this is a low priority for me), which will allow us to skip >>>> expensive TLB flushes while context switching. I have no idea whether >>>> ARM can do something similar. >>> >>> I talked with Dave about implementing PCID and he thinks that it will be >>> net loss. TLB entries will live longer and it means we would need to trigger >>> more IPIs to flash them out when we have to. Cost of IPIs will be higher >>> than benifit from hot TLB after context switch. >> >> I suspect that may depend on how you do the shootdown. >> >> If, when receiving a TLB shootdown for a non-current PCID, we just flush >> all the entries for that PCID and remove the CPU from the mm's >> cpu_vm_mask_var, we will never receive more than one shootdown IPI for >> a non-current mm, but we will still get the benefits of TLB longevity >> when dealing with eg. pipe workloads where tasks take turns running on >> the same CPU. > > I had a totally different implementation idea in mind. It goes > something like this: > > For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7. We have > a per-cpu array of the mm [1] that owns each PCID. On context switch, > we look up the new mm in the array and, if there's a PCID mapped, we > switch cr3 and select that PCID. If there is no PCID mapped, we > choose one (LRU? clock replacement?), switch cr3 and select and > invalidate that PCID. > > When it's time to invalidate a TLB entry on an mm that's active > remotely, we really don't want to send an IPI to a CPU that doesn't > actually have that mm active. Instead we bump some kind of generation > counter in the mm_struct that will cause the next switch to that mm > not to match the PCID list. To keep this working, I think we also > need to update the per-cpu PCID list with our generation counter > either when we context switch out or when we process a TLB shootdown > IPI. If we do that, we can also get rid of TLB shootdowns for idle CPUs in lazy TLB mode. Very nice, if the details work out. -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031179AbbD1XBk (ORCPT ); Tue, 28 Apr 2015 19:01:40 -0400 Received: from mail-lb0-f171.google.com ([209.85.217.171]:36021 "EHLO mail-lb0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030960AbbD1XBh (ORCPT ); Tue, 28 Apr 2015 19:01:37 -0400 MIME-Version: 1.0 In-Reply-To: <5540101D.7020800@redhat.com> References: <20150428221553.GA5770@node.dhcp.inet.fi> <55400CA7.3050902@redhat.com> <5540101D.7020800@redhat.com> From: Andy Lutomirski Date: Tue, 28 Apr 2015 16:01:15 -0700 Message-ID: Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) To: Rik van Riel Cc: "Kirill A. Shutemov" , Dave Hansen , Linus Torvalds , Andrew Morton , Mel Gorman , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , X86 ML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 28, 2015 at 3:56 PM, Rik van Riel wrote: > On 04/28/2015 06:54 PM, Andy Lutomirski wrote: >> On Tue, Apr 28, 2015 at 3:41 PM, Rik van Riel wrote: >>> On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote: >>>> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote: >>>>> At some point, I'd like to implement PCID on x86 (if no one beats me >>>>> to it, and this is a low priority for me), which will allow us to skip >>>>> expensive TLB flushes while context switching. I have no idea whether >>>>> ARM can do something similar. >>>> >>>> I talked with Dave about implementing PCID and he thinks that it will be >>>> net loss. TLB entries will live longer and it means we would need to trigger >>>> more IPIs to flash them out when we have to. Cost of IPIs will be higher >>>> than benifit from hot TLB after context switch. >>> >>> I suspect that may depend on how you do the shootdown. >>> >>> If, when receiving a TLB shootdown for a non-current PCID, we just flush >>> all the entries for that PCID and remove the CPU from the mm's >>> cpu_vm_mask_var, we will never receive more than one shootdown IPI for >>> a non-current mm, but we will still get the benefits of TLB longevity >>> when dealing with eg. pipe workloads where tasks take turns running on >>> the same CPU. >> >> I had a totally different implementation idea in mind. It goes >> something like this: >> >> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7. We have >> a per-cpu array of the mm [1] that owns each PCID. On context switch, >> we look up the new mm in the array and, if there's a PCID mapped, we >> switch cr3 and select that PCID. If there is no PCID mapped, we >> choose one (LRU? clock replacement?), switch cr3 and select and >> invalidate that PCID. >> >> When it's time to invalidate a TLB entry on an mm that's active >> remotely, we really don't want to send an IPI to a CPU that doesn't >> actually have that mm active. Instead we bump some kind of generation >> counter in the mm_struct that will cause the next switch to that mm >> not to match the PCID list. To keep this working, I think we also >> need to update the per-cpu PCID list with our generation counter >> either when we context switch out or when we process a TLB shootdown >> IPI. > > If we do that, we can also get rid of TLB shootdowns for > idle CPUs in lazy TLB mode. > > Very nice, if the details work out. > I wonder if we could treat the non-PCID case just like the PCID case but with only one PCID. Maybe get rid of the mm vs active_mm distinction. Maybe not, though -- if nothing else, we still need to kick our pgd out from idle or kthread CPUs before we free it. The reason I thought of PCIDs this way is that 12 bits isn't nearly enough to get away with allocating each mm its own PCID. Rather than trying to shoehorn them in, it seemed like a better approach would be to only use a very small number, since keeping around TLB entries that are more than a few context switches old seems mostly useless. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031293AbbD1XQU (ORCPT ); Tue, 28 Apr 2015 19:16:20 -0400 Received: from mail-ie0-f176.google.com ([209.85.223.176]:33683 "EHLO mail-ie0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1031051AbbD1XQT (ORCPT ); Tue, 28 Apr 2015 19:16:19 -0400 MIME-Version: 1.0 In-Reply-To: References: <20150428221553.GA5770@node.dhcp.inet.fi> <55400CA7.3050902@redhat.com> Date: Tue, 28 Apr 2015 16:16:18 -0700 X-Google-Sender-Auth: ewxJt7qXg8xjPDzLo6-jXWEFtQ0 Message-ID: Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) From: Linus Torvalds To: Andy Lutomirski Cc: Rik van Riel , "Kirill A. Shutemov" , Dave Hansen , Andrew Morton , Mel Gorman , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , X86 ML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 28, 2015 at 3:54 PM, Andy Lutomirski wrote: > > I had a totally different implementation idea in mind. It goes > something like this: > > For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7. We have > a per-cpu array of the mm [1] that owns each PCID. [...] We've done this before on other architectures. See for example alpha. Look up "__get_new_mm_context()" and friends. I think sparc does the same (and I think sparc copied a lot of it from the alpha implementation). Iirc, the alpha version just generates a (per-cpu) asid one at a time, and has a generation counter so that when you run out of ASID's you do a global TLB invalidate on that CPU and start from 0 again. Actually, I think the generation number is just the high bits of the asid counter (alpha calls them "asn", intel calls them "pcid", and I tend to prefer "asid", but it's all the same thing). Then each thread just has a per-thread ASID. We don't try to make that be per-thread and per-cpu, but instead just force a new allocation when a thread moves to another CPU. It's not obvious what alpha does, because we end up hiding the per-thread ASN in the "struct pcb_struct" (in 'struct thread_info') which is part the alpha pal-code interface. But it seemed to work and is fairly simple. I think something very similar should work with intel pcid's. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031290AbbD1XTw (ORCPT ); Tue, 28 Apr 2015 19:19:52 -0400 Received: from mail-ig0-f176.google.com ([209.85.213.176]:38076 "EHLO mail-ig0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1031017AbbD1XTv (ORCPT ); Tue, 28 Apr 2015 19:19:51 -0400 MIME-Version: 1.0 In-Reply-To: References: <20150428221553.GA5770@node.dhcp.inet.fi> <55400CA7.3050902@redhat.com> <5540101D.7020800@redhat.com> Date: Tue, 28 Apr 2015 16:19:50 -0700 X-Google-Sender-Auth: 90Yy9xRyvSOn2xNty38ZznNjx-U Message-ID: Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) From: Linus Torvalds To: Andy Lutomirski Cc: Rik van Riel , "Kirill A. Shutemov" , Dave Hansen , Andrew Morton , Mel Gorman , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , X86 ML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 28, 2015 at 4:01 PM, Andy Lutomirski wrote: > > The reason I thought of PCIDs this way is that 12 bits isn't nearly > enough to get away with allocating each mm its own PCID. Not even close. And really, we've already done this for other architectures. On alpha, the number of bits in the pcid is model-specific, but it was something like 6 for the ones I used. That's plenty. Also, I don't think Intel actually does 12 bits of pcid. What they do is to hash the 12 bits down to something smaller (like two or three bits in the actual TLB data structure), and then the CPU basically invalidates any pcid's that alias (have a small 4- or 8-entry array saying that "this hash was used for this 12-bit pcid). So there's actually *another* level of dynamic mapping going on below the software interface. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031213AbbD1XXy (ORCPT ); Tue, 28 Apr 2015 19:23:54 -0400 Received: from mail-la0-f53.google.com ([209.85.215.53]:35910 "EHLO mail-la0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030960AbbD1XXw (ORCPT ); Tue, 28 Apr 2015 19:23:52 -0400 MIME-Version: 1.0 In-Reply-To: References: <20150428221553.GA5770@node.dhcp.inet.fi> <55400CA7.3050902@redhat.com> From: Andy Lutomirski Date: Tue, 28 Apr 2015 16:23:29 -0700 Message-ID: Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) To: Linus Torvalds Cc: Rik van Riel , "Kirill A. Shutemov" , Dave Hansen , Andrew Morton , Mel Gorman , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , X86 ML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 28, 2015 at 4:16 PM, Linus Torvalds wrote: > On Tue, Apr 28, 2015 at 3:54 PM, Andy Lutomirski wrote: >> >> I had a totally different implementation idea in mind. It goes >> something like this: >> >> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7. We have >> a per-cpu array of the mm [1] that owns each PCID. [...] > > We've done this before on other architectures. See for example alpha. > Look up "__get_new_mm_context()" and friends. I think sparc does the > same (and I think sparc copied a lot of it from the alpha > implementation). > > Iirc, the alpha version just generates a (per-cpu) asid one at a time, > and has a generation counter so that when you run out of ASID's you do > a global TLB invalidate on that CPU and start from 0 again. Actually, > I think the generation number is just the high bits of the asid > counter (alpha calls them "asn", intel calls them "pcid", and I tend > to prefer "asid", but it's all the same thing). > > Then each thread just has a per-thread ASID. We don't try to make that > be per-thread and per-cpu, but instead just force a new allocation > when a thread moves to another CPU. Alpha appears to have a per-thread per-cpu id of some sort: /* The alpha MMU context is one "unsigned long" bitmap per CPU */ typedef unsigned long mm_context_t[NR_CPUS]; I think we can do it without that by keeping the mapping in reverse as I sort of outlined -- for each cpu, store a mapping from mm to pcid. When things fall out of the list, no big deal. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031364AbbD1XiI (ORCPT ); Tue, 28 Apr 2015 19:38:08 -0400 Received: from mail-ie0-f175.google.com ([209.85.223.175]:33103 "EHLO mail-ie0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1031079AbbD1XiG (ORCPT ); Tue, 28 Apr 2015 19:38:06 -0400 MIME-Version: 1.0 In-Reply-To: References: <20150428221553.GA5770@node.dhcp.inet.fi> <55400CA7.3050902@redhat.com> Date: Tue, 28 Apr 2015 16:38:05 -0700 X-Google-Sender-Auth: vfIViZw4N0TGLQ6TMBpW3JJjdag Message-ID: Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) From: Linus Torvalds To: Andy Lutomirski Cc: Rik van Riel , "Kirill A. Shutemov" , Dave Hansen , Andrew Morton , Mel Gorman , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , X86 ML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 28, 2015 at 4:23 PM, Andy Lutomirski wrote: > > I think we can do it without that by keeping the mapping in reverse as > I sort of outlined -- for each cpu, store a mapping from mm to pcid. > When things fall out of the list, no big deal. So you do it by just having a per-cpu array of (say, 64 entries), you now end up having to search that every time you do a task switch to find the asid for the mm. And even then you've limited yourself to just six bits, because doing the same for a possible full 12-bit asid would not be possible. It's actually much simpler if you just do it the other way. But hey, maybe you do something clever and can figure out a good way to do it. I'm just saying that we *have* done this before on other architectures, and it has worked. I think ARM has another asid implementation in arch/arm/mm/context.c. I really think it would be a good idea to copy some existing case rather than make up a new one. It's not like asid's are unusual. It's arguably x86 that was unusual in _not_ having them. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031249AbbD1Xty (ORCPT ); Tue, 28 Apr 2015 19:49:54 -0400 Received: from mail-lb0-f178.google.com ([209.85.217.178]:33419 "EHLO mail-lb0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1031121AbbD1Xtu (ORCPT ); Tue, 28 Apr 2015 19:49:50 -0400 MIME-Version: 1.0 In-Reply-To: References: <20150428221553.GA5770@node.dhcp.inet.fi> <55400CA7.3050902@redhat.com> From: Andy Lutomirski Date: Tue, 28 Apr 2015 16:49:28 -0700 Message-ID: Subject: Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) To: Linus Torvalds Cc: Rik van Riel , "Kirill A. Shutemov" , Dave Hansen , Andrew Morton , Mel Gorman , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , X86 ML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 28, 2015 at 4:38 PM, Linus Torvalds wrote: > On Tue, Apr 28, 2015 at 4:23 PM, Andy Lutomirski wrote: >> >> I think we can do it without that by keeping the mapping in reverse as >> I sort of outlined -- for each cpu, store a mapping from mm to pcid. >> When things fall out of the list, no big deal. > > So you do it by just having a per-cpu array of (say, 64 entries), you > now end up having to search that every time you do a task switch to > find the asid for the mm. And even then you've limited yourself to > just six bits, because doing the same for a possible full 12-bit asid > would not be possible. > > It's actually much simpler if you just do it the other way. I'm unconvinced. I doubt that trying to keep more than 4-8 PCIDs alive in a cpu's TLB is ever a win. After all, the TLB isn't that big, and, if we're only the 7th most recent mm to have been loaded on a cpu, I doubt any of our TLB entries are still likely to be there. Given that, even if we need 16 bytes of generation counter and such in the per-cpu array, that's at most 128 bytes. In practice, we really ought to be able to get it down to closer to 8 bytes with some care or we could only use 4 PCIDs, at which point the whole per-cpu structure fits in a single cache line. We can search it with 4-8 branches and no additional L1 misses. Sure, with 64 entries this would be expensive, but I think that's excessive. Also, this approach keeps the cost of blowing away stale PCIDs when we need to invalidate a TLB entry on an inactive PCID down to a single write as opposed to digging through the per-mm array to poke at the state for each cpu it might be cached in. But maybe I missed some trick that avoids needing to do that. --Andy