From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jiang Liu Subject: Re: Panic when cpu hot-remove Date: Thu, 25 Jun 2015 16:11:36 +0800 Message-ID: <558BB7B8.7000402@linux.intel.com> References: <42BB8332972FC149B81C55A0D41E3A79C07469@jtjnmailbox06.home.langchao.com> <20150617115238.GC27750@8bytes.org> <1434551800.5628.5.camel@redhat.com> <558259BD.7080402@linux.intel.com> <558272E3.4000504@inspur.com> <55827927.4080504@inspur.com> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Return-path: In-Reply-To: <55827927.4080504-6gUaA8visnnQT0dZR+AlfA@public.gmane.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org Errors-To: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org To: fandongdong , Alex Williamson , Joerg Roedeljoro Cc: Roland Dreier , =?UTF-8?B?6Zer5pmT5bOw?= , "jiang.liu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org" , linux-kernel , =?UTF-8?B?5YiY6ZW/55Sf?= , iommu List-Id: iommu@lists.linux-foundation.org T24gMjAxNS82LzE4IDE1OjU0LCBmYW5kb25nZG9uZyB3cm90ZToKPiAKPiAKPiDlnKggMjAxNS82 LzE4IDE1OjI3LCBmYW5kb25nZG9uZyDlhpnpgZM6Cj4+Cj4+Cj4+IOWcqCAyMDE1LzYvMTggMTM6 NDAsIEppYW5nIExpdSDlhpnpgZM6Cj4+PiBPbiAyMDE1LzYvMTcgMjI6MzYsIEFsZXggV2lsbGlh bXNvbiB3cm90ZToKPj4+PiBPbiBXZWQsIDIwMTUtMDYtMTcgYXQgMTM6NTIgKzAyMDAsIEpvZXJn IFJvZWRlbGpvcm8gd3JvdGU6Cj4+Pj4+IE9uIFdlZCwgSnVuIDE3LCAyMDE1IGF0IDEwOjQyOjQ5 QU0gKzAwMDAsIOiMg+WGrOWGrCB3cm90ZToKPj4+Pj4+IEhpIG1haW50YWluZXIsCj4+Pj4+Pgo+ Pj4+Pj4gV2UgZm91bmQgYSBwcm9ibGVtIHRoYXQgYSBwYW5pYyBoYXBwZW4gd2hlbiBjcHUgd2Fz IGhvdC1yZW1vdmVkLgo+Pj4+Pj4gV2UgYWxzbyB0cmFjZSB0aGUgcHJvYmxlbSBhY2NvcmRpbmcg dG8gdGhlIGNhbGx0cmFjZSBpbmZvcm1hdGlvbi4KPj4+Pj4+IEFuIGVuZGxlc3MgbG9vcCBoYXBw ZW4gYmVjYXVzZSB2YWx1ZSBoZWFkIGlzIG5vdCBlcXVhbCB0byB2YWx1ZQo+Pj4+Pj4gdGFpbCBm b3JldmVyIGluIHRoZSBmdW5jdGlvbiBxaV9jaGVja19mYXVsdCggKS4KPj4+Pj4+IFRoZSBsb2Nh dGlvbiBjb2RlIGlzIGFzIGZvbGxvd3M6Cj4+Pj4+Pgo+Pj4+Pj4KPj4+Pj4+IGRvIHsKPj4+Pj4+ ICAgICAgICAgIGlmIChxaS0+ZGVzY19zdGF0dXNbaGVhZF0gPT0gUUlfSU5fVVNFKQo+Pj4+Pj4g ICAgICAgICAgcWktPmRlc2Nfc3RhdHVzW2hlYWRdID0gUUlfQUJPUlQ7Cj4+Pj4+PiAgICAgICAg ICBoZWFkID0gKGhlYWQgLSAyICsgUUlfTEVOR1RIKSAlIFFJX0xFTkdUSDsKPj4+Pj4+ICAgICAg fSB3aGlsZSAoaGVhZCAhPSB0YWlsKTsKPj4+Pj4gSG1tLCB0aGlzIGNvZGUgaW50ZXJhdGVzIG9u bHkgb3ZlciBldmVyeSBzZWNvbmQgUUkgZGVzY3JpcHRvciwgYW5kCj4+Pj4+IHRhaWwKPj4+Pj4g cHJvYmFibHkgcG9pbnRzIHRvIGEgZGVzY3JpcHRvciB0aGF0IGlzIG5vdCBpdGVyYXRlZCBvdmVy Lgo+Pj4+Pgo+Pj4+PiBKaWFuZywgY2FuIHlvdSBwbGVhc2UgaGF2ZSBhIGxvb2s/Cj4+Pj4gSSB0 aGluayB0aGF0IHBhcnQgaXMgbm9ybWFsLCB0aGUgd2F5IHdlIHVzZSB0aGUgcXVldWUgaXMgdG8g YWx3YXlzCj4+Pj4gc3VibWl0IGEgd29yayBvcGVyYXRpb24gZm9sbG93ZWQgYnkgYSB3YWl0IG9w ZXJhdGlvbiBzbyB0aGF0IHdlIGNhbgo+Pj4+IGRldGVybWluZSB0aGUgd29yayBvcGVyYXRpb24g aXMgY29tcGxldGUuICBUaGF0J3MgZG9uZSB2aWEKPj4+PiBxaV9zdWJtaXRfc3luYygpLiAgV2Ug aGF2ZSBoYWQgc3B1cmlvdXMgcmVwb3J0cyBvZiB0aGUgcXVldWUgZ2V0dGluZwo+Pj4+IGltcG9z c2libHkgb3V0IG9mIHN5bmMgdGhvdWdoLiAgSSBzYXcgb25lIHRoYXQgd2FzIHNvbWVob3cgbGlu a2VkIHRvCj4+Pj4gdGhlCj4+Pj4gSS9PIEFUIERNQSBlbmdpbmUuICBSb2xhbmQgRHJlaWVyIHNh dyBzb21ldGhpbmcgc2ltaWxhclsxXS4gSSdtIG5vdAo+Pj4+IHN1cmUgaWYgdGhleSdyZSByZWxh dGVkIHRvIHRoaXMsIGJ1dCBtYXliZSB3b3J0aCBjb21wYXJpbmcuIFRoYW5rcywKPj4+IFRoYW5r cywgQWxleCBhbmQgSm9lcmchCj4+Pgo+Pj4gSGkgRG9uZ2RvbmcsCj4+PiAgICAgQ291bGQgeW91 IHBsZWFzZSBoZWxwIHRvIGdpdmUgc29tZSBpbnN0cnVjdGlvbnMgYWJvdXQgaG93IHRvCj4+PiBy ZXByb2R1Y2UgdGhpcyBpc3N1ZT8gSSB3aWxsIHRyeSB0byByZXByb2R1Y2UgaXQgaWYgcG9zc2li bGUuCj4+PiBUaGFua3MhCj4+PiBHZXJyeQo+PiBIaSBHZXJyeSwKPj4KPj4gV2UncmUgcnVubmlu ZyBrZXJuZWwgNC4xLjAgb24gYSA0LXNvY2tldCBzeXN0ZW0gYW5kICB3ZSB3YW50IHRvCj4+IG9m ZmxpbmUgc29ja2V0IDEuCj4+IFN0ZXBzIGFzIGZvbGxvd3M6Cj4+Cj4+IGVjaG8gMSA+IC9zeXMv ZmlybXdhcmUvYWNwaS9ob3RwbHVnL2ZvcmNlX3JlbW92ZQo+PiBlY2hvIDEgPiAvc3lzL2Rldmlj ZXMvTE5YU1lTVE06MDAvTE5YU1lCVVM6MDAvQUNQSTAwMDQ6MDEvZWplY3QKSGkgRG9uZ2Rvbmcs CglJIGZhaWxlZCB0byByZXByb2R1Y2UgdGhpcyBpc3N1ZSBvbiBteSBzaWRlLiBTb21lIHBsZWFz ZSBoZWxwCnRvIGNvbmZpcm0/CjEpIElzIHRoaXMgaXNzdWUgcmVwcm9kdWNpYmxlIG9uIHlvdXIg c2lkZT8KMikgRG9lcyB0aGlzIGlzc3VlIGhhcHBlbiBpZiB5b3UgZGlzYWJsZSBpcnFiYWxhbmNl IHNlcnZpY2Ugb24geW91CiAgIHN5c3RlbT8KMykgSGFzIHRoZSBjb3JyZXNwb25kaW5nIFBDSSBo b3N0IGJyaWRnZSBiZWVuIHJlbW92ZWQgYmVmb3JlIHJlbW92aW5nCiAgIHRoZSBzb2NrZXQ/CgpG cm9tIHRoZSBsb2cgbWVzc2FnZSwgd2Ugb25seSBub3RpY2VkIGxvZyBtZXNzYWdlcyBmb3IgQ1BV IGFuZCBtZW1vcnksCmJ1dCBub3QgbWVzc2FnZXMgZm9yIFBDSSAoSU9NTVUpIGRldmljZXMuIEFu ZCB0aGlzIGxvZyBtZXNzYWdlCgkiWyAxNDkuOTc2NDkzXSBhY3BpIEFDUEkwMDA0OjAxOiBTdGls bCBub3QgcHJlc2VudCIKaW1wbGllcyB0aGF0IHRoZSBzb2NrZXQgaGFzIGJlZW4gcG93ZXJlZCBv ZmYgZHVyaW5nIHRoZSBlamVjdGlvbi4KU28gdGhlIHN0b3J5IG1heSBiZSB0aGF0IHlvdSBwb3dl cmVkIG9mZiB0aGUgc29ja2V0IHdoaWxlIHRoZSBob3N0CmJyaWRnZSBvbiB0aGUgc29ja2V0IGlz IHN0aWxsIGluIHVzZS4KVGhhbmtzIQpHZXJyeQoKX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX18KaW9tbXUgbWFpbGluZyBsaXN0CmlvbW11QGxpc3RzLmxpbnV4 LWZvdW5kYXRpb24ub3JnCmh0dHBzOi8vbGlzdHMubGludXhmb3VuZGF0aW9uLm9yZy9tYWlsbWFu L2xpc3RpbmZvL2lvbW11 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751759AbbFYILx (ORCPT ); Thu, 25 Jun 2015 04:11:53 -0400 Received: from mga11.intel.com ([192.55.52.93]:63830 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750968AbbFYILk (ORCPT ); Thu, 25 Jun 2015 04:11:40 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.13,675,1427785200"; d="scan'208";a="594460585" Message-ID: <558BB7B8.7000402@linux.intel.com> Date: Thu, 25 Jun 2015 16:11:36 +0800 From: Jiang Liu Organization: Intel User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: fandongdong , Alex Williamson , Joerg Roedeljoro CC: =?UTF-8?B?5YiY6ZW/55Sf?= , iommu , "jiang.liu@intel.com" , linux-kernel , =?UTF-8?B?6Zer5pmT5bOw?= , Roland Dreier Subject: Re: Panic when cpu hot-remove References: <42BB8332972FC149B81C55A0D41E3A79C07469@jtjnmailbox06.home.langchao.com> <20150617115238.GC27750@8bytes.org> <1434551800.5628.5.camel@redhat.com> <558259BD.7080402@linux.intel.com> <558272E3.4000504@inspur.com> <55827927.4080504@inspur.com> In-Reply-To: <55827927.4080504@inspur.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2015/6/18 15:54, fandongdong wrote: > > > 在 2015/6/18 15:27, fandongdong 写道: >> >> >> 在 2015/6/18 13:40, Jiang Liu 写道: >>> On 2015/6/17 22:36, Alex Williamson wrote: >>>> On Wed, 2015-06-17 at 13:52 +0200, Joerg Roedeljoro wrote: >>>>> On Wed, Jun 17, 2015 at 10:42:49AM +0000, 范冬冬 wrote: >>>>>> Hi maintainer, >>>>>> >>>>>> We found a problem that a panic happen when cpu was hot-removed. >>>>>> We also trace the problem according to the calltrace information. >>>>>> An endless loop happen because value head is not equal to value >>>>>> tail forever in the function qi_check_fault( ). >>>>>> The location code is as follows: >>>>>> >>>>>> >>>>>> do { >>>>>> if (qi->desc_status[head] == QI_IN_USE) >>>>>> qi->desc_status[head] = QI_ABORT; >>>>>> head = (head - 2 + QI_LENGTH) % QI_LENGTH; >>>>>> } while (head != tail); >>>>> Hmm, this code interates only over every second QI descriptor, and >>>>> tail >>>>> probably points to a descriptor that is not iterated over. >>>>> >>>>> Jiang, can you please have a look? >>>> I think that part is normal, the way we use the queue is to always >>>> submit a work operation followed by a wait operation so that we can >>>> determine the work operation is complete. That's done via >>>> qi_submit_sync(). We have had spurious reports of the queue getting >>>> impossibly out of sync though. I saw one that was somehow linked to >>>> the >>>> I/O AT DMA engine. Roland Dreier saw something similar[1]. I'm not >>>> sure if they're related to this, but maybe worth comparing. Thanks, >>> Thanks, Alex and Joerg! >>> >>> Hi Dongdong, >>> Could you please help to give some instructions about how to >>> reproduce this issue? I will try to reproduce it if possible. >>> Thanks! >>> Gerry >> Hi Gerry, >> >> We're running kernel 4.1.0 on a 4-socket system and we want to >> offline socket 1. >> Steps as follows: >> >> echo 1 > /sys/firmware/acpi/hotplug/force_remove >> echo 1 > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:01/eject Hi Dongdong, I failed to reproduce this issue on my side. Some please help to confirm? 1) Is this issue reproducible on your side? 2) Does this issue happen if you disable irqbalance service on you system? 3) Has the corresponding PCI host bridge been removed before removing the socket? >>From the log message, we only noticed log messages for CPU and memory, but not messages for PCI (IOMMU) devices. And this log message "[ 149.976493] acpi ACPI0004:01: Still not present" implies that the socket has been powered off during the ejection. So the story may be that you powered off the socket while the host bridge on the socket is still in use. Thanks! Gerry