* [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads @ 2013-07-31 21:43 Rik van Riel 2013-07-31 21:46 ` Paul Turner 2013-07-31 22:07 ` Linus Torvalds 0 siblings, 2 replies; 20+ messages in thread From: Rik van Riel @ 2013-07-31 21:43 UTC (permalink / raw) To: torvalds; +Cc: mingo, linux-kernel, jmario, dzickus, hpa Don Zickus and Joe Mario have been working on improvements to perf, and noticed heavy cache line contention on the mm_cpumask, running linpack on a 60 core / 120 thread system. The cause turned out to be unnecessary atomic accesses to the mm_cpumask. When in lazy TLB mode, the CPU is only removed from the mm_cpumask if there is a TLB flush event. Most of the time, no such TLB flush happens, and the kernel skips the TLB reload. It can also skip the atomic memory set & test. Here is a summary of Joe's test results: * The __schedule function dropped from 24% of all program cycles down to 5.5%. * The cacheline contention/hotness for accesses to that bitmask went from being the 1st/2nd hottest - down to the 84th hottest (0.3% of all shared misses which is now quite cold) * The average load latency for the bit-test-n-set instruction in __schedule dropped from 10k-15k cycles down to an average of 600 cycles. * The linpack program results improved from 133 GFlops to 144 GFlops. Peak GFlops rose from 133 to 153. Reported-by: Don Zickus <dzickus@redhat.com> Reported-by: Joe Mario <jmario@redhat.com> Tested-by: Joe Mario <jmario@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> --- arch/x86/include/asm/mmu_context.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index cdbf367..987eb3d 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -59,11 +59,12 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); - if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) { + if (!cpumask_test_cpu(cpu, mm_cpumask(next))) { /* We were in lazy tlb mode and leave_mm disabled * tlb flush IPI delivery. We must reload CR3 * to make sure to use no freed page tables. */ + cpumask_set_cpu(cpu, mm_cpumask(next)); load_cr3(next->pgd); load_LDT_nolock(&next->context); } ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads 2013-07-31 21:43 [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads Rik van Riel @ 2013-07-31 21:46 ` Paul Turner 2013-07-31 22:07 ` Linus Torvalds 1 sibling, 0 replies; 20+ messages in thread From: Paul Turner @ 2013-07-31 21:46 UTC (permalink / raw) To: Rik van Riel; +Cc: Linus Torvalds, Ingo Molnar, LKML, jmario, dzickus, hpa On Wed, Jul 31, 2013 at 2:43 PM, Rik van Riel <riel@redhat.com> wrote: > Don Zickus and Joe Mario have been working on improvements to > perf, and noticed heavy cache line contention on the mm_cpumask, > running linpack on a 60 core / 120 thread system. > > The cause turned out to be unnecessary atomic accesses to the > mm_cpumask. When in lazy TLB mode, the CPU is only removed from > the mm_cpumask if there is a TLB flush event. > > Most of the time, no such TLB flush happens, and the kernel > skips the TLB reload. It can also skip the atomic memory > set & test. > > Here is a summary of Joe's test results: > > * The __schedule function dropped from 24% of all program cycles down > to 5.5%. > * The cacheline contention/hotness for accesses to that bitmask went > from being the 1st/2nd hottest - down to the 84th hottest (0.3% of > all shared misses which is now quite cold) > * The average load latency for the bit-test-n-set instruction in > __schedule dropped from 10k-15k cycles down to an average of 600 cycles. > * The linpack program results improved from 133 GFlops to 144 GFlops. > Peak GFlops rose from 133 to 153. > > Reported-by: Don Zickus <dzickus@redhat.com> > Reported-by: Joe Mario <jmario@redhat.com> > Tested-by: Joe Mario <jmario@redhat.com> > Signed-off-by: Rik van Riel <riel@redhat.com> > --- > arch/x86/include/asm/mmu_context.h | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h > index cdbf367..987eb3d 100644 > --- a/arch/x86/include/asm/mmu_context.h > +++ b/arch/x86/include/asm/mmu_context.h > @@ -59,11 +59,12 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, > this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); > BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); > > - if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) { > + if (!cpumask_test_cpu(cpu, mm_cpumask(next))) { > /* We were in lazy tlb mode and leave_mm disabled > * tlb flush IPI delivery. We must reload CR3 > * to make sure to use no freed page tables. > */ > + cpumask_set_cpu(cpu, mm_cpumask(next)); > load_cr3(next->pgd); > load_LDT_nolock(&next->context); > } We're carrying the *exact* same patch for *exact* same reason. I've been meaning to send it out but wasn't sure of a good external workload for this. Reviewed-by: Paul Turner <pjt@google.com> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads 2013-07-31 21:43 [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads Rik van Riel 2013-07-31 21:46 ` Paul Turner @ 2013-07-31 22:07 ` Linus Torvalds 2013-07-31 22:16 ` Rik van Riel 1 sibling, 1 reply; 20+ messages in thread From: Linus Torvalds @ 2013-07-31 22:07 UTC (permalink / raw) To: Rik van Riel Cc: Ingo Molnar, Linux Kernel Mailing List, jmario, dzickus, Peter Anvin On Wed, Jul 31, 2013 at 2:43 PM, Rik van Riel <riel@redhat.com> wrote: > > The cause turned out to be unnecessary atomic accesses to the > mm_cpumask. When in lazy TLB mode, the CPU is only removed from > the mm_cpumask if there is a TLB flush event. > > Most of the time, no such TLB flush happens, and the kernel > skips the TLB reload. It can also skip the atomic memory > set & test. The patch looks obvious, and I'm not seeing any very clear reasons for why we would want that test-and-set to be atomic. That said, I'd like to have some explicit comments about exactly why it doesn't need the atomicity. Because afaik, there actually are concurrent readers _and_ writers of that mask, and the accesses are not locked by anything here. I _think_ the reason for this all being safe is simply that the only real race is "We need to set the bit before we load the page table, and we're protected against that bit being cleared because the TLB state is TLBSTATE_OK and thus TLB flushing will no longer leave that mm". But damn, it all looks subtle as hell. That code does: this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) { and I'm wondering if we need a barrier to make sure that that TLBSTATE_OK write happens *before* we test the cpumask. With test_and_set(), we have the barrier in the test-and-set. But with just test_bit, I'm not seeing why the compiler couldn't re-order them. I suspect it won't, but... So I think the patch is _probably_ fine, but I think this is a hell of a lot more subtle than it all looks, and I'd like people to show that they have thought this through. And I don't see any indication of that in the changelog - it only talks about the performance issues, not about the *correctness* issues. Please? Linus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads 2013-07-31 22:07 ` Linus Torvalds @ 2013-07-31 22:16 ` Rik van Riel [not found] ` <CA+55aFwj+6P4y8MgGTGbiK_EtfY7LJ_bL_7k9zFYLx4S8F0rJQ@mail.gmail.com> 0 siblings, 1 reply; 20+ messages in thread From: Rik van Riel @ 2013-07-31 22:16 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Linux Kernel Mailing List, jmario, dzickus, Peter Anvin On 07/31/2013 06:07 PM, Linus Torvalds wrote: > On Wed, Jul 31, 2013 at 2:43 PM, Rik van Riel <riel@redhat.com> wrote: >> >> The cause turned out to be unnecessary atomic accesses to the >> mm_cpumask. When in lazy TLB mode, the CPU is only removed from >> the mm_cpumask if there is a TLB flush event. >> >> Most of the time, no such TLB flush happens, and the kernel >> skips the TLB reload. It can also skip the atomic memory >> set & test. > > The patch looks obvious, and I'm not seeing any very clear reasons for > why we would want that test-and-set to be atomic. That said, I'd like > to have some explicit comments about exactly why it doesn't need the > atomicity. Because afaik, there actually are concurrent readers _and_ > writers of that mask, and the accesses are not locked by anything > here. > > I _think_ the reason for this all being safe is simply that the only > real race is "We need to set the bit before we load the page table, > and we're protected against that bit being cleared because the TLB > state is TLBSTATE_OK and thus TLB flushing will no longer leave that > mm". > > But damn, it all looks subtle as hell. That code does: > > this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); > BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); > > if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) { > > and I'm wondering if we need a barrier to make sure that that > TLBSTATE_OK write happens *before* we test the cpumask. With > test_and_set(), we have the barrier in the test-and-set. But with just > test_bit, I'm not seeing why the compiler couldn't re-order them. I > suspect it won't, but... cpumask_set_bit expands to set_bit, which is atomic Do we need any explicit compiler barrier in addition to the atomic operation performed by set_bit? I would be happy to rewrite the comment right above the cpumask_set_cpu call if you want. -- All rights reversed ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <CA+55aFwj+6P4y8MgGTGbiK_EtfY7LJ_bL_7k9zFYLx4S8F0rJQ@mail.gmail.com>]
* Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads [not found] ` <CA+55aFwj+6P4y8MgGTGbiK_EtfY7LJ_bL_7k9zFYLx4S8F0rJQ@mail.gmail.com> @ 2013-07-31 22:39 ` Rik van Riel [not found] ` <CA+55aFxKSEHSkdsCkTvjwwo4MnEpN0TwJrek2jd1QJCyUTb-=Q@mail.gmail.com> 0 siblings, 1 reply; 20+ messages in thread From: Rik van Riel @ 2013-07-31 22:39 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel Mailing List, Peter Anvin, jmario, dzickus, Ingo Molnar On 07/31/2013 06:21 PM, Linus Torvalds wrote: > Ummm.. The race is to the testing of the bit, not setting. The testing > of the bit is not valid before we have set the tlb state, AFAIK. I believe the bit is cleared and set by the current CPU. Clearing is done from the TLB flush IPI handler, or by directly calling leave_mm from ptep_flush_clear if the flush originated locally. The exception is clear_tasks_mm_cpumask, which may only be called for an already offlined CPU. I believe setting is only ever done in switch_mm. Interrupts are blocked inside switch_mm, so I think we are safe. Would you like a comment to this effect in the code, or are there other things we need to check first? > On Jul 31, 2013 3:16 PM, "Rik van Riel" <riel@redhat.com > <mailto:riel@redhat.com>> wrote: > > On 07/31/2013 06:07 PM, Linus Torvalds wrote: > > On Wed, Jul 31, 2013 at 2:43 PM, Rik van Riel <riel@redhat.com > <mailto:riel@redhat.com>> wrote: > > > The cause turned out to be unnecessary atomic accesses to the > mm_cpumask. When in lazy TLB mode, the CPU is only removed from > the mm_cpumask if there is a TLB flush event. > > Most of the time, no such TLB flush happens, and the kernel > skips the TLB reload. It can also skip the atomic memory > set & test. > > > The patch looks obvious, and I'm not seeing any very clear > reasons for > why we would want that test-and-set to be atomic. That said, I'd > like > to have some explicit comments about exactly why it doesn't need the > atomicity. Because afaik, there actually are concurrent readers > _and_ > writers of that mask, and the accesses are not locked by anything > here. > > > > > I _think_ the reason for this all being safe is simply that the only > real race is "We need to set the bit before we load the page table, > and we're protected against that bit being cleared because the TLB > state is TLBSTATE_OK and thus TLB flushing will no longer leave that > mm". > > But damn, it all looks subtle as hell. That code does: > > this_cpu_write(cpu_tlbstate.__state, TLBSTATE_OK); > BUG_ON(this_cpu_read(cpu___tlbstate.active_mm) > != next); > > if (!cpumask_test_and_set_cpu(__cpu, > mm_cpumask(next))) { > > and I'm wondering if we need a barrier to make sure that that > TLBSTATE_OK write happens *before* we test the cpumask. With > test_and_set(), we have the barrier in the test-and-set. But > with just > test_bit, I'm not seeing why the compiler couldn't re-order them. I > suspect it won't, but... > > > cpumask_set_bit expands to set_bit, which is atomic > > Do we need any explicit compiler barrier in addition to the > atomic operation performed by set_bit? > > I would be happy to rewrite the comment right above the > cpumask_set_cpu call if you want. > > -- > All rights reversed > -- All rights reversed ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <CA+55aFxKSEHSkdsCkTvjwwo4MnEpN0TwJrek2jd1QJCyUTb-=Q@mail.gmail.com>]
* Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads [not found] ` <CA+55aFxKSEHSkdsCkTvjwwo4MnEpN0TwJrek2jd1QJCyUTb-=Q@mail.gmail.com> @ 2013-07-31 23:12 ` Rik van Riel 2013-07-31 23:14 ` Paul Turner 2013-08-01 15:37 ` [PATCH] sched,x86: optimize switch_mm " Jörn Engel 0 siblings, 2 replies; 20+ messages in thread From: Rik van Riel @ 2013-07-31 23:12 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel Mailing List, jmario, Peter Anvin, dzickus, Ingo Molnar On 07/31/2013 06:46 PM, Linus Torvalds wrote: > > On Jul 31, 2013 3:39 PM, "Rik van Riel" <riel@redhat.com > <mailto:riel@redhat.com>> wrote: > > > > On 07/31/2013 06:21 PM, Linus Torvalds wrote: > >> > >> Ummm.. The race is to the testing of the bit, not setting. The testing > >> of the bit is not valid before we have set the tlb state, AFAIK. > > > > > > I believe the bit is cleared and set by the current CPU. > > Yes, but we need to be careful with interrupts. > > > Interrupts are blocked inside switch_mm, so I think we > > are safe. > > Are they? I thought we removed all that. context_switch() shows that the runqueue lock (which is an irq lock) is released, and irqs re-enabled, by the next task, after switch_to(), in finish_lock_switch(), called from finish_task_switch() > Note that switch_mm gets called for activate_mm too, or something. Good catch, though it looks like activate_mm is only called from exec_mmap, with the new mm as its argument. While the new mm can have pages in memory yet, it has never been run so there should be nothing in the TLB yet for the new mm. This is subtler than I thought, but it does appear to be safe. -- All rights reversed ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads 2013-07-31 23:12 ` Rik van Riel @ 2013-07-31 23:14 ` Paul Turner 2013-08-01 0:41 ` Linus Torvalds 2013-08-01 15:37 ` [PATCH] sched,x86: optimize switch_mm " Jörn Engel 1 sibling, 1 reply; 20+ messages in thread From: Paul Turner @ 2013-07-31 23:14 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Linux Kernel Mailing List, jmario, Peter Anvin, dzickus, Ingo Molnar We attached the following explanatory comment to our version of the patch: /* * In the common case (two user threads sharing mm * switching) the bit will be set; avoid doing a write * (via atomic test & set) unless we have to. This is * safe, because no other CPU ever writes to our bit * in the mask, and interrupts are off (so we can't * take a TLB IPI here.) If we don't do this, then * switching threads will pingpong the cpumask * cacheline. */ On Wed, Jul 31, 2013 at 4:12 PM, Rik van Riel <riel@redhat.com> wrote: > On 07/31/2013 06:46 PM, Linus Torvalds wrote: >> >> >> On Jul 31, 2013 3:39 PM, "Rik van Riel" <riel@redhat.com >> >> <mailto:riel@redhat.com>> wrote: >> > >> > On 07/31/2013 06:21 PM, Linus Torvalds wrote: >> >> >> >> Ummm.. The race is to the testing of the bit, not setting. The testing >> >> of the bit is not valid before we have set the tlb state, AFAIK. >> > >> > >> > I believe the bit is cleared and set by the current CPU. >> >> Yes, but we need to be careful with interrupts. >> >> >> > Interrupts are blocked inside switch_mm, so I think we >> > are safe. >> >> Are they? I thought we removed all that. > > > context_switch() shows that the runqueue lock (which is an irq > lock) is released, and irqs re-enabled, by the next task, after > switch_to(), in finish_lock_switch(), called from finish_task_switch() > >> Note that switch_mm gets called for activate_mm too, or something. > > > Good catch, though it looks like activate_mm is only called from > exec_mmap, with the new mm as its argument. While the new mm can > have pages in memory yet, it has never been run so there should > be nothing in the TLB yet for the new mm. > > This is subtler than I thought, but it does appear to be safe. > > > -- > All rights reversed > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads 2013-07-31 23:14 ` Paul Turner @ 2013-08-01 0:41 ` Linus Torvalds 2013-08-01 1:58 ` Rik van Riel 2013-08-01 2:14 ` [PATCH -v2] " Rik van Riel 0 siblings, 2 replies; 20+ messages in thread From: Linus Torvalds @ 2013-08-01 0:41 UTC (permalink / raw) To: Paul Turner Cc: Rik van Riel, Linux Kernel Mailing List, jmario, Peter Anvin, dzickus, Ingo Molnar On Wed, Jul 31, 2013 at 4:14 PM, Paul Turner <pjt@google.com> wrote: > We attached the following explanatory comment to our version of the patch: > > /* > * In the common case (two user threads sharing mm > * switching) the bit will be set; avoid doing a write > * (via atomic test & set) unless we have to. This is > * safe, because no other CPU ever writes to our bit > * in the mask, and interrupts are off (so we can't > * take a TLB IPI here.) If we don't do this, then > * switching threads will pingpong the cpumask > * cacheline. > */ So as mentioned, the "interrupts will be off" is actually dubious. It's true for the context switch case, but not for the activate_mm(). However, as Rik points out, activate_mm() is different in that we shouldn't have any preexisting MMU state anyway. And besides, that should never trigger the "prev == next" case. But it does look a bit messy, and even your comment is a bit misleading (it might make somebody think that all of switch_mm() is protected from interrupts) . Anyway, I'm perfectly ok with the patch itself, but I just wanted to make sure people had thought about these things. Linus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads 2013-08-01 0:41 ` Linus Torvalds @ 2013-08-01 1:58 ` Rik van Riel 2013-08-01 2:14 ` Linus Torvalds 2013-08-01 2:14 ` [PATCH -v2] " Rik van Riel 1 sibling, 1 reply; 20+ messages in thread From: Rik van Riel @ 2013-08-01 1:58 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Turner, Linux Kernel Mailing List, jmario, Peter Anvin, dzickus, Ingo Molnar On 07/31/2013 08:41 PM, Linus Torvalds wrote: > On Wed, Jul 31, 2013 at 4:14 PM, Paul Turner <pjt@google.com> wrote: >> We attached the following explanatory comment to our version of the patch: >> >> /* >> * In the common case (two user threads sharing mm >> * switching) the bit will be set; avoid doing a write >> * (via atomic test & set) unless we have to. This is >> * safe, because no other CPU ever writes to our bit >> * in the mask, and interrupts are off (so we can't >> * take a TLB IPI here.) If we don't do this, then >> * switching threads will pingpong the cpumask >> * cacheline. >> */ > > So as mentioned, the "interrupts will be off" is actually dubious. > It's true for the context switch case, but not for the activate_mm(). > > However, as Rik points out, activate_mm() is different in that we > shouldn't have any preexisting MMU state anyway. And besides, that > should never trigger the "prev == next" case. > > But it does look a bit messy, and even your comment is a bit > misleading (it might make somebody think that all of switch_mm() is > protected from interrupts) > . > Anyway, I'm perfectly ok with the patch itself, but I just wanted to > make sure people had thought about these things. Would you like me to document the things we found in the comment, and resend a patch, or is the patch good as-is? -- All rights reversed ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads 2013-08-01 1:58 ` Rik van Riel @ 2013-08-01 2:14 ` Linus Torvalds 0 siblings, 0 replies; 20+ messages in thread From: Linus Torvalds @ 2013-08-01 2:14 UTC (permalink / raw) To: Rik van Riel Cc: Paul Turner, Linux Kernel Mailing List, jmario, Peter Anvin, dzickus, Ingo Molnar On Wed, Jul 31, 2013 at 6:58 PM, Rik van Riel <riel@redhat.com> wrote: > > Would you like me to document the things we found in the comment, > and resend a patch, or is the patch good as-is? Please re-send with commentary.. Linus ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH -v2] sched,x86: optimize switch_mm for multi-threaded workloads 2013-08-01 0:41 ` Linus Torvalds 2013-08-01 1:58 ` Rik van Riel @ 2013-08-01 2:14 ` Rik van Riel 2013-08-01 2:25 ` Linus Torvalds 2013-08-02 9:07 ` [tip:sched/core] sched/x86: Optimize switch_mm() " tip-bot for Rik van Riel 1 sibling, 2 replies; 20+ messages in thread From: Rik van Riel @ 2013-08-01 2:14 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Turner, Linux Kernel Mailing List, jmario, Peter Anvin, dzickus, Ingo Molnar On Wed, 31 Jul 2013 17:41:39 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > However, as Rik points out, activate_mm() is different in that we > shouldn't have any preexisting MMU state anyway. And besides, that > should never trigger the "prev == next" case. > > But it does look a bit messy, and even your comment is a bit > misleading (it might make somebody think that all of switch_mm() is > protected from interrupts) Is this better? Not that I really care which version gets applied :) ---8<--- Subject: sched,x86: optimize switch_mm for multi-threaded workloads Dick Fowles, Don Zickus and Joe Mario have been working on improvements to perf, and noticed heavy cache line contention on the mm_cpumask, running linpack on a 60 core / 120 thread system. The cause turned out to be unnecessary atomic accesses to the mm_cpumask. When in lazy TLB mode, the CPU is only removed from the mm_cpumask if there is a TLB flush event. Most of the time, no such TLB flush happens, and the kernel skips the TLB reload. It can also skip the atomic memory set & test. Here is a summary of Joe's test results: * The __schedule function dropped from 24% of all program cycles down to 5.5%. * The cacheline contention/hotness for accesses to that bitmask went from being the 1st/2nd hottest - down to the 84th hottest (0.3% of all shared misses which is now quite cold) * The average load latency for the bit-test-n-set instruction in __schedule dropped from 10k-15k cycles down to an average of 600 cycles. * The linpack program results improved from 133 GFlops to 144 GFlops. Peak GFlops rose from 133 to 153. Reported-by: Don Zickus <dzickus@redhat.com> Reported-by: Joe Mario <jmario@redhat.com> Tested-by: Joe Mario <jmario@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> --- arch/x86/include/asm/mmu_context.h | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index cdbf367..3ac6089 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -59,7 +59,13 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); - if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) { + if (!cpumask_test_cpu(cpu, mm_cpumask(next))) { + /* On established mms, the mm_cpumask is only changed + * from irq context, from ptep_clear_flush while in + * lazy tlb mode, and here. Irqs are blocked during + * schedule, protecting us from simultaneous changes. + */ + cpumask_set_cpu(cpu, mm_cpumask(next)); /* We were in lazy tlb mode and leave_mm disabled * tlb flush IPI delivery. We must reload CR3 * to make sure to use no freed page tables. ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH -v2] sched,x86: optimize switch_mm for multi-threaded workloads 2013-08-01 2:14 ` [PATCH -v2] " Rik van Riel @ 2013-08-01 2:25 ` Linus Torvalds 2013-08-01 7:04 ` Ingo Molnar 2013-08-02 9:07 ` [tip:sched/core] sched/x86: Optimize switch_mm() " tip-bot for Rik van Riel 1 sibling, 1 reply; 20+ messages in thread From: Linus Torvalds @ 2013-08-01 2:25 UTC (permalink / raw) To: Rik van Riel Cc: Paul Turner, Linux Kernel Mailing List, jmario, Peter Anvin, dzickus, Ingo Molnar On Wed, Jul 31, 2013 at 7:14 PM, Rik van Riel <riel@redhat.com> wrote: > > Is this better? Not that I really care which version gets applied :) I'm ok with this. That said, I'm assuming this goes in through Ingo, since it is both x86 and scheduler-related. I can take it directly, but see no real reason to. Ingo? Talking about x86-specific - I'm assuming other architectures have similar patterns? Linus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH -v2] sched,x86: optimize switch_mm for multi-threaded workloads 2013-08-01 2:25 ` Linus Torvalds @ 2013-08-01 7:04 ` Ingo Molnar 0 siblings, 0 replies; 20+ messages in thread From: Ingo Molnar @ 2013-08-01 7:04 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Paul Turner, Linux Kernel Mailing List, jmario, Peter Anvin, dzickus, Ingo Molnar * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Jul 31, 2013 at 7:14 PM, Rik van Riel <riel@redhat.com> wrote: > > > > Is this better? Not that I really care which version gets applied :) > > I'm ok with this. > > That said, I'm assuming this goes in through Ingo, since it is both > x86 and scheduler-related. I can take it directly, but see no real > reason to. Ingo? Yeah, I'm picking it up. Thanks, Ingo ^ permalink raw reply [flat|nested] 20+ messages in thread
* [tip:sched/core] sched/x86: Optimize switch_mm() for multi-threaded workloads 2013-08-01 2:14 ` [PATCH -v2] " Rik van Riel 2013-08-01 2:25 ` Linus Torvalds @ 2013-08-02 9:07 ` tip-bot for Rik van Riel 2013-08-02 9:12 ` Ingo Molnar 1 sibling, 1 reply; 20+ messages in thread From: tip-bot for Rik van Riel @ 2013-08-02 9:07 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, hpa, mingo, torvalds, pjt, jmario, riel, tglx, dzickus Commit-ID: 8f898fbbe5ee5e20a77c4074472a1fd088dc47d1 Gitweb: http://git.kernel.org/tip/8f898fbbe5ee5e20a77c4074472a1fd088dc47d1 Author: Rik van Riel <riel@redhat.com> AuthorDate: Wed, 31 Jul 2013 22:14:21 -0400 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Thu, 1 Aug 2013 09:10:26 +0200 sched/x86: Optimize switch_mm() for multi-threaded workloads Dick Fowles, Don Zickus and Joe Mario have been working on improvements to perf, and noticed heavy cache line contention on the mm_cpumask, running linpack on a 60 core / 120 thread system. The cause turned out to be unnecessary atomic accesses to the mm_cpumask. When in lazy TLB mode, the CPU is only removed from the mm_cpumask if there is a TLB flush event. Most of the time, no such TLB flush happens, and the kernel skips the TLB reload. It can also skip the atomic memory set & test. Here is a summary of Joe's test results: * The __schedule function dropped from 24% of all program cycles down to 5.5%. * The cacheline contention/hotness for accesses to that bitmask went from being the 1st/2nd hottest - down to the 84th hottest (0.3% of all shared misses which is now quite cold) * The average load latency for the bit-test-n-set instruction in __schedule dropped from 10k-15k cycles down to an average of 600 cycles. * The linpack program results improved from 133 GFlops to 144 GFlops. Peak GFlops rose from 133 to 153. Reported-by: Don Zickus <dzickus@redhat.com> Reported-by: Joe Mario <jmario@redhat.com> Tested-by: Joe Mario <jmario@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> Reviewed-by: Paul Turner <pjt@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20130731221421.616d3d20@annuminas.surriel.com [ Made the comments consistent around the modified code. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> --- arch/x86/include/asm/mmu_context.h | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index cdbf367..be12c53 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -45,22 +45,28 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, /* Re-load page tables */ load_cr3(next->pgd); - /* stop flush ipis for the previous mm */ + /* Stop flush ipis for the previous mm */ cpumask_clear_cpu(cpu, mm_cpumask(prev)); - /* - * load the LDT, if the LDT is different: - */ + /* Load the LDT, if the LDT is different: */ if (unlikely(prev->context.ldt != next->context.ldt)) load_LDT_nolock(&next->context); } #ifdef CONFIG_SMP - else { + else { this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); - if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) { - /* We were in lazy tlb mode and leave_mm disabled + if (!cpumask_test_cpu(cpu, mm_cpumask(next))) { + /* + * On established mms, the mm_cpumask is only changed + * from irq context, from ptep_clear_flush() while in + * lazy tlb mode, and here. Irqs are blocked during + * schedule, protecting us from simultaneous changes. + */ + cpumask_set_cpu(cpu, mm_cpumask(next)); + /* + * We were in lazy tlb mode and leave_mm disabled * tlb flush IPI delivery. We must reload CR3 * to make sure to use no freed page tables. */ ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [tip:sched/core] sched/x86: Optimize switch_mm() for multi-threaded workloads 2013-08-02 9:07 ` [tip:sched/core] sched/x86: Optimize switch_mm() " tip-bot for Rik van Riel @ 2013-08-02 9:12 ` Ingo Molnar 2013-08-02 12:44 ` Joe Mario 2013-08-03 1:18 ` Greg KH 0 siblings, 2 replies; 20+ messages in thread From: Ingo Molnar @ 2013-08-02 9:12 UTC (permalink / raw) To: hpa, linux-kernel, torvalds, pjt, jmario, riel, tglx, dzickus Cc: linux-tip-commits * tip-bot for Rik van Riel <tipbot@zytor.com> wrote: > Commit-ID: 8f898fbbe5ee5e20a77c4074472a1fd088dc47d1 > Gitweb: http://git.kernel.org/tip/8f898fbbe5ee5e20a77c4074472a1fd088dc47d1 > Author: Rik van Riel <riel@redhat.com> > AuthorDate: Wed, 31 Jul 2013 22:14:21 -0400 > Committer: Ingo Molnar <mingo@kernel.org> > CommitDate: Thu, 1 Aug 2013 09:10:26 +0200 > > sched/x86: Optimize switch_mm() for multi-threaded workloads > > Dick Fowles, Don Zickus and Joe Mario have been working on > improvements to perf, and noticed heavy cache line contention > on the mm_cpumask, running linpack on a 60 core / 120 thread > system. > > The cause turned out to be unnecessary atomic accesses to the > mm_cpumask. When in lazy TLB mode, the CPU is only removed from > the mm_cpumask if there is a TLB flush event. > > Most of the time, no such TLB flush happens, and the kernel > skips the TLB reload. It can also skip the atomic memory > set & test. > > Here is a summary of Joe's test results: > > * The __schedule function dropped from 24% of all program cycles down > to 5.5%. > > * The cacheline contention/hotness for accesses to that bitmask went > from being the 1st/2nd hottest - down to the 84th hottest (0.3% of > all shared misses which is now quite cold) > > * The average load latency for the bit-test-n-set instruction in > __schedule dropped from 10k-15k cycles down to an average of 600 cycles. > > * The linpack program results improved from 133 GFlops to 144 GFlops. > Peak GFlops rose from 133 to 153. > > Reported-by: Don Zickus <dzickus@redhat.com> > Reported-by: Joe Mario <jmario@redhat.com> > Tested-by: Joe Mario <jmario@redhat.com> > Signed-off-by: Rik van Riel <riel@redhat.com> > Reviewed-by: Paul Turner <pjt@google.com> > Acked-by: Linus Torvalds <torvalds@linux-foundation.org> > Link: http://lkml.kernel.org/r/20130731221421.616d3d20@annuminas.surriel.com > [ Made the comments consistent around the modified code. ] > Signed-off-by: Ingo Molnar <mingo@kernel.org> > + else { > this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); > BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); > > - if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) { > + if (!cpumask_test_cpu(cpu, mm_cpumask(next))) { > + /* > + * On established mms, the mm_cpumask is only changed > + * from irq context, from ptep_clear_flush() while in > + * lazy tlb mode, and here. Irqs are blocked during > + * schedule, protecting us from simultaneous changes. > + */ > + cpumask_set_cpu(cpu, mm_cpumask(next)); Note, I marked this for v3.12 with no -stable backport tag as it's not a regression fix. Nevertheless if it's a real issue in production (and +20% of linpack performance is certainly significant) feel free to forward it to -stable once this hits Linus's tree in the v3.12 merge window - by that time the patch will be reasonably well tested and it's a relatively simple change. Thanks, Ingo ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [tip:sched/core] sched/x86: Optimize switch_mm() for multi-threaded workloads 2013-08-02 9:12 ` Ingo Molnar @ 2013-08-02 12:44 ` Joe Mario 2013-08-03 1:18 ` Greg KH 1 sibling, 0 replies; 20+ messages in thread From: Joe Mario @ 2013-08-02 12:44 UTC (permalink / raw) To: Ingo Molnar Cc: hpa, linux-kernel, torvalds, pjt, riel, tglx, dzickus, linux-tip-commits, Richard Fowles On 08/02/2013 05:12 AM, Ingo Molnar wrote: > * tip-bot for Rik van Riel <tipbot@zytor.com> wrote: > >> Commit-ID: 8f898fbbe5ee5e20a77c4074472a1fd088dc47d1 >> Gitweb: http://git.kernel.org/tip/8f898fbbe5ee5e20a77c4074472a1fd088dc47d1 >> Author: Rik van Riel <riel@redhat.com> >> AuthorDate: Wed, 31 Jul 2013 22:14:21 -0400 >> Committer: Ingo Molnar <mingo@kernel.org> >> CommitDate: Thu, 1 Aug 2013 09:10:26 +0200 >> >> sched/x86: Optimize switch_mm() for multi-threaded workloads >> >> Dick Fowles, Don Zickus and Joe Mario have been working on >> improvements to perf, and noticed heavy cache line contention >> on the mm_cpumask, running linpack on a 60 core / 120 thread >> system. >> >> The cause turned out to be unnecessary atomic accesses to the >> mm_cpumask. When in lazy TLB mode, the CPU is only removed from >> the mm_cpumask if there is a TLB flush event. >> >> Most of the time, no such TLB flush happens, and the kernel >> skips the TLB reload. It can also skip the atomic memory >> set & test. >> >> Here is a summary of Joe's test results: >> >> * The __schedule function dropped from 24% of all program cycles down >> to 5.5%. >> >> * The cacheline contention/hotness for accesses to that bitmask went >> from being the 1st/2nd hottest - down to the 84th hottest (0.3% of >> all shared misses which is now quite cold) >> >> * The average load latency for the bit-test-n-set instruction in >> __schedule dropped from 10k-15k cycles down to an average of 600 cycles. >> >> * The linpack program results improved from 133 GFlops to 144 GFlops. >> Peak GFlops rose from 133 to 153. >> >> Reported-by: Don Zickus <dzickus@redhat.com> >> Reported-by: Joe Mario <jmario@redhat.com> >> Tested-by: Joe Mario <jmario@redhat.com> >> Signed-off-by: Rik van Riel <riel@redhat.com> >> Reviewed-by: Paul Turner <pjt@google.com> >> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> >> Link: http://lkml.kernel.org/r/20130731221421.616d3d20@annuminas.surriel.com >> [ Made the comments consistent around the modified code. ] >> Signed-off-by: Ingo Molnar <mingo@kernel.org> >> + else { >> this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); >> BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); >> >> - if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) { >> + if (!cpumask_test_cpu(cpu, mm_cpumask(next))) { >> + /* >> + * On established mms, the mm_cpumask is only changed >> + * from irq context, from ptep_clear_flush() while in >> + * lazy tlb mode, and here. Irqs are blocked during >> + * schedule, protecting us from simultaneous changes. >> + */ >> + cpumask_set_cpu(cpu, mm_cpumask(next)); > Note, I marked this for v3.12 with no -stable backport tag as it's not a > regression fix. > > Nevertheless if it's a real issue in production (and +20% of linpack > performance is certainly significant) The cacheline contention on the mm->cpu_vm_mask_va cpulist bitmask skyrockets on numa systems as soon as the number of threads exceeds the available cpus - and it's all from the locked bts instruction in cpumask_test_and_set_cpu(). The OMP version of Linpack uses clone() with the CLONE_VM flag set, so all the clones are tugging at the same mm->cpu_vm_mask_va. With Rik's patch, the number of accesses tomm->cpu_vm_mask_va plummets. > feel free to forward it to -stable > once this hits Linus's tree in the v3.12 merge window - by that time the > patch will be reasonably well tested and it's a relatively simple change. > > Thanks, > > Ingo ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [tip:sched/core] sched/x86: Optimize switch_mm() for multi-threaded workloads 2013-08-02 9:12 ` Ingo Molnar 2013-08-02 12:44 ` Joe Mario @ 2013-08-03 1:18 ` Greg KH 1 sibling, 0 replies; 20+ messages in thread From: Greg KH @ 2013-08-03 1:18 UTC (permalink / raw) To: Ingo Molnar Cc: hpa, linux-kernel, torvalds, pjt, jmario, riel, tglx, dzickus, linux-tip-commits On Fri, Aug 02, 2013 at 11:12:47AM +0200, Ingo Molnar wrote: > * tip-bot for Rik van Riel <tipbot@zytor.com> wrote: > > > Commit-ID: 8f898fbbe5ee5e20a77c4074472a1fd088dc47d1 > > Gitweb: http://git.kernel.org/tip/8f898fbbe5ee5e20a77c4074472a1fd088dc47d1 > > Author: Rik van Riel <riel@redhat.com> > > AuthorDate: Wed, 31 Jul 2013 22:14:21 -0400 > > Committer: Ingo Molnar <mingo@kernel.org> > > CommitDate: Thu, 1 Aug 2013 09:10:26 +0200 > > > > sched/x86: Optimize switch_mm() for multi-threaded workloads > > > > Dick Fowles, Don Zickus and Joe Mario have been working on > > improvements to perf, and noticed heavy cache line contention > > on the mm_cpumask, running linpack on a 60 core / 120 thread > > system. > > > > The cause turned out to be unnecessary atomic accesses to the > > mm_cpumask. When in lazy TLB mode, the CPU is only removed from > > the mm_cpumask if there is a TLB flush event. > > > > Most of the time, no such TLB flush happens, and the kernel > > skips the TLB reload. It can also skip the atomic memory > > set & test. > > > > Here is a summary of Joe's test results: > > > > * The __schedule function dropped from 24% of all program cycles down > > to 5.5%. > > > > * The cacheline contention/hotness for accesses to that bitmask went > > from being the 1st/2nd hottest - down to the 84th hottest (0.3% of > > all shared misses which is now quite cold) > > > > * The average load latency for the bit-test-n-set instruction in > > __schedule dropped from 10k-15k cycles down to an average of 600 cycles. > > > > * The linpack program results improved from 133 GFlops to 144 GFlops. > > Peak GFlops rose from 133 to 153. > > > > Reported-by: Don Zickus <dzickus@redhat.com> > > Reported-by: Joe Mario <jmario@redhat.com> > > Tested-by: Joe Mario <jmario@redhat.com> > > Signed-off-by: Rik van Riel <riel@redhat.com> > > Reviewed-by: Paul Turner <pjt@google.com> > > Acked-by: Linus Torvalds <torvalds@linux-foundation.org> > > Link: http://lkml.kernel.org/r/20130731221421.616d3d20@annuminas.surriel.com > > [ Made the comments consistent around the modified code. ] > > Signed-off-by: Ingo Molnar <mingo@kernel.org> > > > + else { > > this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); > > BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); > > > > - if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) { > > + if (!cpumask_test_cpu(cpu, mm_cpumask(next))) { > > + /* > > + * On established mms, the mm_cpumask is only changed > > + * from irq context, from ptep_clear_flush() while in > > + * lazy tlb mode, and here. Irqs are blocked during > > + * schedule, protecting us from simultaneous changes. > > + */ > > + cpumask_set_cpu(cpu, mm_cpumask(next)); > > Note, I marked this for v3.12 with no -stable backport tag as it's not a > regression fix. > > Nevertheless if it's a real issue in production (and +20% of linpack > performance is certainly significant) feel free to forward it to -stable > once this hits Linus's tree in the v3.12 merge window - by that time the > patch will be reasonably well tested and it's a relatively simple change. I'll watch for this as well and try to remember to pick it up for -stable once it hits Linus's tree, as those type of benchmark improvements are good to have in stable releases. thanks, greg k-h ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads 2013-07-31 23:12 ` Rik van Riel 2013-07-31 23:14 ` Paul Turner @ 2013-08-01 15:37 ` Jörn Engel 2013-08-01 17:45 ` Linus Torvalds 1 sibling, 1 reply; 20+ messages in thread From: Jörn Engel @ 2013-08-01 15:37 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Linux Kernel Mailing List, jmario, Peter Anvin, dzickus, Ingo Molnar On Wed, 31 July 2013 19:12:30 -0400, Rik van Riel wrote: > On 07/31/2013 06:46 PM, Linus Torvalds wrote: > > > On 07/31/2013 06:21 PM, Linus Torvalds wrote: Looks like two of Linus' mails didn't make it to vger.kernel.org. They are missing from my inbox and seem to be missing from lkml.org as well. I only know noticed them because Rik seemingly replied to himself while quoting Linus. Message-IDs in question are: CA+55aFwj+6P4y8MgGTGbiK_EtfY7LJ_bL_7k9zFYLx4S8F0rJQ@mail.gmail.com CA+55aFxKSEHSkdsCkTvjwwo4MnEpN0TwJrek2jd1QJCyUTb-=Q@mail.gmail.com I find this disturbing. Many "modern" mailers seem to be broken in more ways than I care to enumerate. When I tell people, including otherwise capable programmers, that they use a broken mailer I rarely get more than a shrug. But if Linus is incapable of sending mail to the kernel mailing list, we are in trouble indeed. Linus, can you get this sorted out? Jörn -- There are three principal ways to lose money: wine, women, and engineers. While the first two are more pleasant, the third is by far the more certain. -- Baron Rothschild ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads 2013-08-01 15:37 ` [PATCH] sched,x86: optimize switch_mm " Jörn Engel @ 2013-08-01 17:45 ` Linus Torvalds 2013-08-01 17:54 ` Jörn Engel 0 siblings, 1 reply; 20+ messages in thread From: Linus Torvalds @ 2013-08-01 17:45 UTC (permalink / raw) To: Jörn Engel Cc: Rik van Riel, Linux Kernel Mailing List, jmario, Peter Anvin, dzickus, Ingo Molnar On Thu, Aug 1, 2013 at 8:37 AM, Jörn Engel <joern@logfs.org> wrote: > > Looks like two of Linus' mails didn't make it to vger.kernel.org. > They are missing from my inbox and seem to be missing from lkml.org as > well. I only know noticed them because Rik seemingly replied to > himself while quoting Linus. It's normal. I occasionally write quick replies while on my phone, and the f*cking android gmail app cannot do plain-text mode. There are open bug tickets about it: https://code.google.com/p/android/issues/detail?id=8712 and I've talked in private to my google contacts too, but some ass-wipe inside the android team thinks that html-only mail is the right thing to do. The sad part is that there is absolutely no *technical* reason for it, since the gmail android app actually generates a multipart/alternative email where the first part is perfectly fine text, properly wrapped and everything. So the code to generate the email in proper format is actually there, it's just that there is no setting to say "don't generate the other parts". It's literally one added setting, and probably a few lines of code to test it and not generate the html part. So there's some internal politics going on, probably one or two internal people who are just ass-hats and are lying to their managers about how hard it is to do. Don't ask me why. Some people just want to watch the world burn. And no, don't tell me about the other mailers. I've tried many of them. They tend to be badly integrated or have their own crazy issues. I'm not willing to read email over imap any more. The one big feature of gmail is how you can search for things sanely on the server side, and despite the imap server-side searching technically being a feature, I've never met any server and MUA that does it reasonably. Gmail does, but sadly the android app has this one big glaring fault. Which is sad, because in other ways the android gmail app is actually quite good these days (that didn't use to be true, but they've fixed much bigger problems over the years). Feel free to pester the android people, *especially* if you know somebody inside the gmail app team. Call them out on the bullshit if somebody says "that's hard". Sadly, the features they *have* added seem to be about things big companies asked for, not trivial technical improvements. Linus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads 2013-08-01 17:45 ` Linus Torvalds @ 2013-08-01 17:54 ` Jörn Engel 0 siblings, 0 replies; 20+ messages in thread From: Jörn Engel @ 2013-08-01 17:54 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Linux Kernel Mailing List, jmario, Peter Anvin, dzickus, Ingo Molnar On Thu, 1 August 2013 10:45:15 -0700, Linus Torvalds wrote: > On Thu, Aug 1, 2013 at 8:37 AM, Jörn Engel <joern@logfs.org> wrote: > > > > Looks like two of Linus' mails didn't make it to vger.kernel.org. > > They are missing from my inbox and seem to be missing from lkml.org as > > well. I only know noticed them because Rik seemingly replied to > > himself while quoting Linus. > > It's normal. I agree with everything in your email except this bit. Even though technically it is normal now that all email clients are broken in impressive ways, your words imply that this might even remotely be an acceptable state of the world. It is not. Four-letter words directed at Google (and Apple, and...) seem more appropriate. Jörn -- "Security vulnerabilities are here to stay." -- Scott Culp, Manager of the Microsoft Security Response Center, 2001 ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2013-08-03 1:26 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-31 21:43 [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads Rik van Riel
2013-07-31 21:46 ` Paul Turner
2013-07-31 22:07 ` Linus Torvalds
2013-07-31 22:16 ` Rik van Riel
[not found] ` <CA+55aFwj+6P4y8MgGTGbiK_EtfY7LJ_bL_7k9zFYLx4S8F0rJQ@mail.gmail.com>
2013-07-31 22:39 ` Rik van Riel
[not found] ` <CA+55aFxKSEHSkdsCkTvjwwo4MnEpN0TwJrek2jd1QJCyUTb-=Q@mail.gmail.com>
2013-07-31 23:12 ` Rik van Riel
2013-07-31 23:14 ` Paul Turner
2013-08-01 0:41 ` Linus Torvalds
2013-08-01 1:58 ` Rik van Riel
2013-08-01 2:14 ` Linus Torvalds
2013-08-01 2:14 ` [PATCH -v2] " Rik van Riel
2013-08-01 2:25 ` Linus Torvalds
2013-08-01 7:04 ` Ingo Molnar
2013-08-02 9:07 ` [tip:sched/core] sched/x86: Optimize switch_mm() " tip-bot for Rik van Riel
2013-08-02 9:12 ` Ingo Molnar
2013-08-02 12:44 ` Joe Mario
2013-08-03 1:18 ` Greg KH
2013-08-01 15:37 ` [PATCH] sched,x86: optimize switch_mm " Jörn Engel
2013-08-01 17:45 ` Linus Torvalds
2013-08-01 17:54 ` Jörn Engel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).