From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Chen, Kenneth W" Date: Tue, 14 Mar 2006 19:33:53 +0000 Subject: RE: accessed/dirty bit handler tuning Message-Id: <200603141933.k2EJXrg05935@unix-os.sc.intel.com> List-Id: References: <44157CF1.5060902@bull.net> In-Reply-To: <44157CF1.5060902@bull.net> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-ia64@vger.kernel.org Zoltan Menyhart wrote on Tuesday, March 14, 2006 2:13 AM > Yet in my sequence: > > (p6) cmpxchg8.acq.nta r26 = [r17],r25,ar.ccv > (p6) itc.d r25 > ;; > (p6) srlz.d > > the execution of "cmpxchg" (that is not a quick & simple instruction) > partially overlaps that of "itc" (this latter has got an acquire > semantics, it does not depend on the completion of the former). This is indeed a very fine work of art in micro-optimization. Thank you for pointing this out. I think this is going to save us a lot of cycles. > If it is the page walker that inserts the new translation, then it has > to observe the purge requirements, too: > E.g. in case of page size of 64 K, up to 16 L1 DTLB entries may be > purged and all the L1D cache lines brought in via these translations > need to be invalidated. There is no need to worry about performance in the slow path. Slow path is meant to take whatever effort needed to fix up a detected race condition. So let it be a couple of cycles longer. > I'd expect (sure, not knowing exectly how the HW works :-)) up to: > > 16 max. number of L1 DTLB entries used for a page > * 32 L1D cache is indexed as 0...31 > ---- > 512 > > cycles only for purging and invalidating the old suff. The hardware is a lot smarter than what you think :-) come on, we are talking about Itanium processor here. I plea you to give some faith to the hardware designers please. - Ken