From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Subject: [1/2] ras: fix an off-by-one error in __find_elem() From: "Luck, Tony" Message-Id: <20190417211549.GA17200@agluck-desk> Date: Wed, 17 Apr 2019 14:15:49 -0700 To: Cong Wang Cc: Borislav Petkov , LKML , linux-edac@vger.kernel.org, Thomas Gleixner List-ID: T24gVHVlLCBBcHIgMTYsIDIwMTkgYXQgMDc6Mzc6NDFQTSAtMDcwMCwgQ29uZyBXYW5nIHdyb3Rl Ogo+IE9uIFR1ZSwgQXByIDE2LCAyMDE5IGF0IDc6MzEgUE0gQ29uZyBXYW5nIDx4aXlvdS53YW5n Y29uZ0BnbWFpbC5jb20+IHdyb3RlOgo+ID4gWWVzIGl0IGlzLCBJIGhhdmUgYSBzdGFja3RyYWNl IGluIHByb2R1Y3Rpb24gd2hpY2ggY2xlYXJseSBzaG93cwo+ID4gZGVsX2VsZW0uaXNyYS4xKzB4 MzQvMHg0MCwgdW5saWtlIHRoZSBvbmUgSSB0cmlnZ2VyZWQgdmlhIGZha2UKPiA+IFBGTidzLiBJ IGNhbiBzaG93IHlvdSBpZiB5b3Ugd2FudCwgaXQgaXMgb24gNC4xNCwgc28gdmVyeSB1bmxpa2Vs eQo+ID4gaXQgaXMgaW50ZXJlc3RpbmcgdG8gYW55b25lIGhlcmUuCj4gCj4gQ29ycmVjdCBteXNl bGY6Cj4gCj4gVGhlIG9uZSB0cmlnZ2VyZWQgdmlhIGZha2UgUEZOJ3MgYWxzbyBzaG93cwo+IGRl bF9lbGVtLmNvbnN0cHJvcC4xKzB4MzkvMHg0MC4KPiAKPiBTb3JyeSBmb3IgdGhlIG1pc3Rha2Us IEkga2luZGEgcmVhZCBpbnRvIGFub3RoZXIgY3Jhc2guLi4KCk9rLiBOb3cgSSBmb2xsb3cgd2hh dCB5b3UgYXJlIHNheWluZy4gSSdsbCB0cnkgdG8gcmVwaHJhc2UKaGVyZS4KCjEpIEluaXRpYWwg c2V0dXAgaXMgdGhhdCB3ZSBmaWxsZWQgdGhlIGFycmF5LiBTZXQgdGhlIHRocmVzaG9sZAp0byBh IGxvdyB2YWx1ZSwgYW5kIHRoZW4gcmVwZWF0ZWRseSBhZGRlZCBhIG5ldyBwZm4gKHdpdGggYQp2 YWx1ZSBoaWdoZXIgdGhhbiBhbnkgaW4gdGhlIGFycmF5KS4KCjIpIEZpcnN0IHRpbWU6CldlIGZp bmQgdGhlIGFycmF5IGlzIGZ1bGwsIHNvIGRlbGV0ZSBvbmUgaXRlbS4gQWRkIHRoZSBuZXcgcGZu CnRvIHRoZSBlbmQgKGluIHNsb3QgTUFYX0VMRU0tMSkKCjMpIE5leHQgdGltZS4gQXJyYXkgaXMg ZnVsbC4gRGVsZXRlIG9uZSBpdGVtIChhc3N1bWUgbm90IHRoZSBvbmUKZm9yIG91ciB0YXJnZXQg cGZuKS4gVGhpcyBtb3ZlZCBvdXIgdGFyZ2V0IHRvIHNsb3QgTUFYX0VMRU0tMi4KCjQpIE5leHQg dGltZS4gQXJyYXkgaXNuJ3QgZnVsbC4gV2UgZ2V0IGEgaGl0IG9uIG91ciBwZm4uIEJ1bXAKdGhl IGNvdW50LiBGaW5kIHRoYXQgaXQgZXhjZWVkZWQgdGhlIHRocmVzaG9sZCwgc28gYXNrIHRvCm9m ZmxpbmUgaXQuIFdlIGRlbGV0ZSB0aGlzIGZyb20gdGhlIGFycmF5LiBJdCBpcyBhIHRhaWwKZGVs ZXRpb24gc28gd2UganVzdCBkZWNyZW1lbnQgY2EtPm4uCgo1KSBGaW5hbCB0aW1lLiBBcnJheSBp c24ndCBmdWxsLiBXaGVuIHdlIGxvb2sgZm9yIG91ciBwZm4gaW4KdGhlIGFycmF5IGl0IGlzbid0 IGluIHRoZSBhcnJheVswLGNhLT5uKSByYW5nZS4gU28gd2Ugb3VnaHQKdG8gcmV0dXJuIEVOT0tF WS4gQnV0IHdlIGRvbid0LiBXaGVuIHdlIGZhbGwgb3V0IG9mIHRoZQpzZWFyY2ggbG9vcDoKCjEz NiAgICAgICAgIHdoaWxlIChtaW4gPCBtYXgpIHsKMTM3ICAgICAgICAgICAgICAgICBpbnQgdG1w ID0gKG1heCArIG1pbikgPj4gMTsKMTM4CjEzOSAgICAgICAgICAgICAgICAgdGhpc19wZm4gPSBQ Rk4oY2EtPmFycmF5W3RtcF0pOwoxNDAKMTQxICAgICAgICAgICAgICAgICBpZiAodGhpc19wZm4g PCBwZm4pCjE0MiAgICAgICAgICAgICAgICAgICAgICAgICBtaW4gPSB0bXAgKyAxOwoxNDMgICAg ICAgICAgICAgICAgIGVsc2UgaWYgKHRoaXNfcGZuID4gcGZuKQoxNDQgICAgICAgICAgICAgICAg ICAgICAgICAgbWF4ID0gdG1wOwoxNDUgICAgICAgICAgICAgICAgIGVsc2UgewoxNDYgICAgICAg ICAgICAgICAgICAgICAgICAgbWluID0gdG1wOwoxNDcgICAgICAgICAgICAgICAgICAgICAgICAg YnJlYWs7CjE0OCAgICAgICAgICAgICAgICAgfQoxNDkgICAgICAgICB9CgpXZSBleGl0IGJlY2F1 c2UgIm1pbiIgaXMgbm90IGxlc3MgdGhhbiAibWF4Ii4gV2UgYXNzaWduCiIqdG8iIHdpdGggdGhl IGluZGV4IG9mIHRoZSBhdmFpbGFibGUgc2xvdC4KCjE1MSAgICAgICAgIGlmICh0bykKMTUyICAg ICAgICAgICAgICAgICAqdG8gPSBtaW47CgpUaGlzIGxpbmUgaXMganVzdCB3cm9uZy4gIm1pbiIg aXMgc3RpbGwgd2l0aGluIHRoZSBhbGxvY2F0ZWQKYm91bmRzIG9mIHRoZSBhcnJheSwgYnV0IHRo ZSB2YWx1ZSBpcyAiY2EtPm4iLCBpLmUuIHRoZSBmaXJzdAp1bnVzZWQgc2xvdC4gVGhlICJnYXJi YWdlIiB2YWx1ZSB3ZSBwaWNrIHVwIGZyb20gaGVyZSBpcyB3aGF0CndhcyBsYXN0IGFzc2lnbmVk IHRvIHRoZSBzbG90IHdoZW4gaXQgd2FzIHZhbGlkLiBCZWNhdXNlIHdlIGhhdmUKYmVlbiBhZGRp bmcgdGhlIHNhbWUgcGZuIG92ZXIgYW5kIG92ZXIsIHRoZSB2YWx1ZSB3ZSBwaWNrIHVwIGlzCnRo ZSBwZm4gdGhhdCB3ZSAiZGVsZXRlZCIgaW4gc3RlcCAiNCIgYWJvdmUuIEJ1dCByZW1lbWJlciB3 ZQpkZWxldGVkIGl0IGp1c3QgYnkgYWRqdXN0aW5nIGNhLT5uLCBub3QgYnkgb3ZlcndyaXRpbmcg dGhpcwplbnRyeS4KCjE1NCAgICAgICAgIHRoaXNfcGZuID0gUEZOKGNhLT5hcnJheVttaW5dKTsK CgpTbyBub3cgdGhpcyB0ZXN0IHN1Y2NlZWRzLCB3aGVuIGl0IHNob3VsZCBmYWlsLiBXZSByZXR1 cm4KYW4gaW5kZXggb2YgImNhLT5uIi4KMTU2ICAgICAgICAgaWYgKHRoaXNfcGZuID09IHBmbikK MTU3ICAgICAgICAgICAgICAgICByZXR1cm4gbWluOwoxNTgKMTU5ICAgICAgICAgcmV0dXJuIC1F Tk9LRVk7CgpUaGluZ3MgZ28gZG93bmhpbGwgZmFzdCBmcm9tIGhlcmUuIFdlIHVwZGF0ZSB0aGUg Y291bnQgZm9yCnRoaXMgZW50cnkuIEZpbmQgaXQgaXMgYWJvdmUgdGhlIHRocmVzaG9sZCwgc28g b2ZmbGluZSB0aGUKcGFnZSBhbmQgdHJ5IHRvIGRlbGV0ZS4KCjIyNiBzdGF0aWMgdm9pZCBkZWxf ZWxlbShzdHJ1Y3QgY2VfYXJyYXkgKmNhLCBpbnQgaWR4KQoyMjcgewoyMjggICAgICAgICAvKiBT YXZlIHVzIGEgZnVuY3Rpb24gY2FsbCB3aGVuIGRlbGV0aW5nIHRoZSBsYXN0IGVsZW1lbnQuICov CgppZHggaXMgZXF1YWwgdG8gY2EtPm4uIFNvIHRoaXMgaXMgImlmICgtMSkiIC4uLiBzbyB3ZSBk ZWNpZGUKd2UgbmVlZCB0byBtZW1tb3ZlIHNvbWUgZW50cmllcy4gIFRoZSBhcnJheSBpbmRpY2Vz ICJpZHgiIGFuZCAiaWR4KzEiCmFyZSBub3Qgb3V0IG9mIGJvdW5kcyBmb3IgdGhlIGFsbG9jYXRl ZCBzaXplLCBidXQgdGhleSBhcmVuJ3Qgd2hhdCB3ZQp3YW50LiBUaGUgcmVhbCBwcm9ibGVtIGlz IHRoZSBjb3VudCBwYXNzZWQgdG8gbWVtbW92ZS4KCjIyOSAgICAgICAgIGlmIChjYS0+biAtIChp ZHggKyAxKSkKMjMwICAgICAgICAgICAgICAgICBtZW1tb3ZlKCh2b2lkICopJmNhLT5hcnJheVtp ZHhdLAoyMzEgICAgICAgICAgICAgICAgICAgICAgICAgKHZvaWQgKikmY2EtPmFycmF5W2lkeCAr IDFdLAoyMzIgICAgICAgICAgICAgICAgICAgICAgICAgKGNhLT5uIC0gKGlkeCArIDEpKSAqIHNp emVvZih1NjQpKTsKMjMzCjIzNCAgICAgICAgIGNhLT5uLS07CjIzNSB9CgoKU28gSSdtIG5vdyBs aWtpbmcgdGhpcyB2ZXJzaW9uIG9mIHRoZSBwYXRjaCBieSBDb25nOgoKCgpSZXZpZXdlZC1ieTog VG9ueSBMdWNrIDx0b255Lmx1Y2tAaW50ZWwuY29tPgoKZGlmZiAtLWdpdCBhL2RyaXZlcnMvcmFz L2NlYy5jIGIvZHJpdmVycy9yYXMvY2VjLmMKaW5kZXggMmQ5ZWMzNzhhOGJjLi5iN2NiMGI5YjEz NDYgMTAwNjQ0Ci0tLSBhL2RyaXZlcnMvcmFzL2NlYy5jCisrKyBiL2RyaXZlcnMvcmFzL2NlYy5j CkBAIC0yMDQsMTAgKzIwNCwxMiBAQCBzdGF0aWMgaW50IF9fZmluZF9lbGVtKHN0cnVjdCBjZV9h cnJheSAqY2EsIHU2NCBwZm4sIHVuc2lnbmVkIGludCAqdG8pCiAJaWYgKHRvKQogCQkqdG8gPSBt aW47CiAKLQl0aGlzX3BmbiA9IFBGTihjYS0+YXJyYXlbbWluXSk7CisJaWYgKG1pbiA8IGNhLT5u KSB7CisJCXRoaXNfcGZuID0gUEZOKGNhLT5hcnJheVttaW5dKTsKIAotCWlmICh0aGlzX3BmbiA9 PSBwZm4pCi0JCXJldHVybiBtaW47CisJCWlmICh0aGlzX3BmbiA9PSBwZm4pCisJCQlyZXR1cm4g bWluOworCX0KIAogCXJldHVybiAtRU5PS0VZOwogfQo= From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED,USER_AGENT_MUTT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 20630C282DA for ; Wed, 17 Apr 2019 21:15:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E31FC217FA for ; Wed, 17 Apr 2019 21:15:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733039AbfDQVPv (ORCPT ); Wed, 17 Apr 2019 17:15:51 -0400 Received: from mga07.intel.com ([134.134.136.100]:5952 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727013AbfDQVPv (ORCPT ); Wed, 17 Apr 2019 17:15:51 -0400 X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 17 Apr 2019 14:15:50 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.60,363,1549958400"; d="scan'208";a="162819828" Received: from agluck-desk.sc.intel.com (HELO agluck-desk) ([10.3.52.160]) by fmsmga004.fm.intel.com with ESMTP; 17 Apr 2019 14:15:49 -0700 Date: Wed, 17 Apr 2019 14:15:49 -0700 From: "Luck, Tony" To: Cong Wang Cc: Borislav Petkov , LKML , linux-edac@vger.kernel.org, Thomas Gleixner Subject: Re: [PATCH 1/2] ras: fix an off-by-one error in __find_elem() Message-ID: <20190417211549.GA17200@agluck-desk> References: <20190416012001.5338-1-xiyou.wangcong@gmail.com> <20190416090726.GD27892@zn.tnic> <20190416221852.GA10781@agluck-desk> <20190416232833.GA17372@agluck-desk> <20190417015351.GA28490@agluck-desk> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-edac-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org Message-ID: <20190417211549.Jtgvf2DP0QpxmmUtw4by3x7rM0op1q6qvbvzGVIQNlg@z> On Tue, Apr 16, 2019 at 07:37:41PM -0700, Cong Wang wrote: > On Tue, Apr 16, 2019 at 7:31 PM Cong Wang wrote: > > Yes it is, I have a stacktrace in production which clearly shows > > del_elem.isra.1+0x34/0x40, unlike the one I triggered via fake > > PFN's. I can show you if you want, it is on 4.14, so very unlikely > > it is interesting to anyone here. > > Correct myself: > > The one triggered via fake PFN's also shows > del_elem.constprop.1+0x39/0x40. > > Sorry for the mistake, I kinda read into another crash... Ok. Now I follow what you are saying. I'll try to rephrase here. 1) Initial setup is that we filled the array. Set the threshold to a low value, and then repeatedly added a new pfn (with a value higher than any in the array). 2) First time: We find the array is full, so delete one item. Add the new pfn to the end (in slot MAX_ELEM-1) 3) Next time. Array is full. Delete one item (assume not the one for our target pfn). This moved our target to slot MAX_ELEM-2. 4) Next time. Array isn't full. We get a hit on our pfn. Bump the count. Find that it exceeded the threshold, so ask to offline it. We delete this from the array. It is a tail deletion so we just decrement ca->n. 5) Final time. Array isn't full. When we look for our pfn in the array it isn't in the array[0,ca->n) range. So we ought to return ENOKEY. But we don't. When we fall out of the search loop: 136 while (min < max) { 137 int tmp = (max + min) >> 1; 138 139 this_pfn = PFN(ca->array[tmp]); 140 141 if (this_pfn < pfn) 142 min = tmp + 1; 143 else if (this_pfn > pfn) 144 max = tmp; 145 else { 146 min = tmp; 147 break; 148 } 149 } We exit because "min" is not less than "max". We assign "*to" with the index of the available slot. 151 if (to) 152 *to = min; This line is just wrong. "min" is still within the allocated bounds of the array, but the value is "ca->n", i.e. the first unused slot. The "garbage" value we pick up from here is what was last assigned to the slot when it was valid. Because we have been adding the same pfn over and over, the value we pick up is the pfn that we "deleted" in step "4" above. But remember we deleted it just by adjusting ca->n, not by overwriting this entry. 154 this_pfn = PFN(ca->array[min]); So now this test succeeds, when it should fail. We return an index of "ca->n". 156 if (this_pfn == pfn) 157 return min; 158 159 return -ENOKEY; Things go downhill fast from here. We update the count for this entry. Find it is above the threshold, so offline the page and try to delete. 226 static void del_elem(struct ce_array *ca, int idx) 227 { 228 /* Save us a function call when deleting the last element. */ idx is equal to ca->n. So this is "if (-1)" ... so we decide we need to memmove some entries. The array indices "idx" and "idx+1" are not out of bounds for the allocated size, but they aren't what we want. The real problem is the count passed to memmove. 229 if (ca->n - (idx + 1)) 230 memmove((void *)&ca->array[idx], 231 (void *)&ca->array[idx + 1], 232 (ca->n - (idx + 1)) * sizeof(u64)); 233 234 ca->n--; 235 } So I'm now liking this version of the patch by Cong: diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c index 2d9ec378a8bc..b7cb0b9b1346 100644 --- a/drivers/ras/cec.c +++ b/drivers/ras/cec.c @@ -204,10 +204,12 @@ static int __find_elem(struct ce_array *ca, u64 pfn, unsigned int *to) if (to) *to = min; - this_pfn = PFN(ca->array[min]); + if (min < ca->n) { + this_pfn = PFN(ca->array[min]); - if (this_pfn == pfn) - return min; + if (this_pfn == pfn) + return min; + } return -ENOKEY; } Reviewed-by: Tony Luck