From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752503AbbJEUtP (ORCPT ); Mon, 5 Oct 2015 16:49:15 -0400 Received: from mga09.intel.com ([134.134.136.24]:21577 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752440AbbJEUtL (ORCPT ); Mon, 5 Oct 2015 16:49:11 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.17,640,1437462000"; d="scan'208";a="785133682" Subject: Re: [REGRESSION] 998ef75ddb and aio-dio-invalidate-failure w/ data=journal To: Linus Torvalds , Peter Anvin References: <20151005152236.GA8140@thunk.org> <5612A3F3.2040609@linux.intel.com> Cc: "Theodore Ts'o" , Andrew Morton , "linux-ext4@vger.kernel.org" , Linux Kernel Mailing List From: Dave Hansen Message-ID: <5612E23B.7070606@linux.intel.com> Date: Mon, 5 Oct 2015 13:48:59 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/05/2015 01:22 PM, Linus Torvalds wrote: > On Mon, Oct 5, 2015 at 5:23 PM, Dave Hansen wrote: >> One thing I've been noticing on Skylake is that barriers (implicit and >> explicit) are showing up more in profiles. > > Ahh, you're on skylake? Yup. > It's entirely possible that the issue is that the whole > "stac/mov/clac" is much more expensive because skylake actually ends > up supporting those AC instructions. That would make sense. > > We could probably do them outside the loop, rather than tightly around > the actual move instructions. Peter (hpa), is there some sane > interface to try to do that? iov_iter_fault_in_readable() is just going and touching a single word in the page so that it is faulted in, or a pair of words if it manages to cross a page boundary (which isn't happening here). I'm not sure there's a loop to move them out of here (for the prefaulting part). We could theoretically expand the stac/clac to be around the pair of __get_user()s in fault_in_pages_readable() but that would only help the case where we are crossing a page boundary. Although I was probably wrong about the source of the overhead, the point still remains that the prefaulting is eating cycles for no practical benefit. >> What we're seeing here >> probably isn't actually stac/clac overhead, but the cost of finishing >> some other operations that are outstanding before we can proceed through >> here. > > I suspect it actually _is_ stac/clac overhead. It might well be that > clac/stac ends up serializing loads some way. Last I heard, they were > reasonably cheap but certainly not free - and when we're talking about > something that just loops over bringing the line into cache, it might > be relatively expensive. > > How did you do the profile? Use "-e cycles:pp" to get the precise > profile information, which should actually attribute the cost to the > instruction that really causes it. It reduced the skid a bit. Plain (no -e"): > │ stac > 24.57 │ mov (%rcx),%sil > 15.70 │ clac > 28.77 │ test %eax,%eax > 2.15 │ mov %sil,-0x1(%rbp) > 8.93 │ ↓ jne 66 > 2.31 │ movslq %edx,%rdx With "-e cycles:pp": > │ sub $0x8,%rsp > 24.57 │ stac > 15.49 │ mov (%rcx),%sil > 29.06 │ clac > 2.24 │ test %eax,%eax > 8.77 │ mov %sil,-0x1(%rbp) > 2.22 │ ↓ jne 66 > │ movslq %edx,%rdx