From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752913AbbFYKtU (ORCPT ); Thu, 25 Jun 2015 06:49:20 -0400 Received: from sg02.corpemail.net ([128.199.154.28]:35457 "EHLO sg02.corpemail.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751241AbbFYKtO (ORCPT ); Thu, 25 Jun 2015 06:49:14 -0400 Subject: Re: Panic when cpu hot-remove To: Jiang Liu , Alex Williamson , Joerg Roedeljoro References: <42BB8332972FC149B81C55A0D41E3A79C07469@jtjnmailbox06.home.langchao.com> <20150617115238.GC27750@8bytes.org> <1434551800.5628.5.camel@redhat.com> <558259BD.7080402@linux.intel.com> <558272E3.4000504@inspur.com> <55827927.4080504@inspur.com> <558BB7B8.7000402@linux.intel.com> CC: =?UTF-8?B?5YiY6ZW/55Sf?= , iommu , "jiang.liu@intel.com" , linux-kernel , =?UTF-8?B?6Zer5pmT5bOw?= , Roland Dreier From: fandongdong Message-ID: <558BDC0D.2000206@inspur.com> Date: Thu, 25 Jun 2015 18:46:37 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.0.1 MIME-Version: 1.0 In-Reply-To: <558BB7B8.7000402@linux.intel.com> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.165.21.134] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 在 2015/6/25 16:11, Jiang Liu 写道: > On 2015/6/18 15:54, fandongdong wrote: >> >> 在 2015/6/18 15:27, fandongdong 写道: >>> >>> 在 2015/6/18 13:40, Jiang Liu 写道: >>>> On 2015/6/17 22:36, Alex Williamson wrote: >>>>> On Wed, 2015-06-17 at 13:52 +0200, Joerg Roedeljoro wrote: >>>>>> On Wed, Jun 17, 2015 at 10:42:49AM +0000, 范冬冬 wrote: >>>>>>> Hi maintainer, >>>>>>> >>>>>>> We found a problem that a panic happen when cpu was hot-removed. >>>>>>> We also trace the problem according to the calltrace information. >>>>>>> An endless loop happen because value head is not equal to value >>>>>>> tail forever in the function qi_check_fault( ). >>>>>>> The location code is as follows: >>>>>>> >>>>>>> >>>>>>> do { >>>>>>> if (qi->desc_status[head] == QI_IN_USE) >>>>>>> qi->desc_status[head] = QI_ABORT; >>>>>>> head = (head - 2 + QI_LENGTH) % QI_LENGTH; >>>>>>> } while (head != tail); >>>>>> Hmm, this code interates only over every second QI descriptor, and >>>>>> tail >>>>>> probably points to a descriptor that is not iterated over. >>>>>> >>>>>> Jiang, can you please have a look? >>>>> I think that part is normal, the way we use the queue is to always >>>>> submit a work operation followed by a wait operation so that we can >>>>> determine the work operation is complete. That's done via >>>>> qi_submit_sync(). We have had spurious reports of the queue getting >>>>> impossibly out of sync though. I saw one that was somehow linked to >>>>> the >>>>> I/O AT DMA engine. Roland Dreier saw something similar[1]. I'm not >>>>> sure if they're related to this, but maybe worth comparing. Thanks, >>>> Thanks, Alex and Joerg! >>>> >>>> Hi Dongdong, >>>> Could you please help to give some instructions about how to >>>> reproduce this issue? I will try to reproduce it if possible. >>>> Thanks! >>>> Gerry >>> Hi Gerry, >>> >>> We're running kernel 4.1.0 on a 4-socket system and we want to >>> offline socket 1. >>> Steps as follows: >>> >>> echo 1 > /sys/firmware/acpi/hotplug/force_remove >>> echo 1 > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:01/eject > Hi Dongdong, > I failed to reproduce this issue on my side. Some please help > to confirm? > 1) Is this issue reproducible on your side? Yes. > 2) Does this issue happen if you disable irqbalance service on you > system? Yes. > 3) Has the corresponding PCI host bridge been removed before removing > the socket? No, we will try to remove it before removing the socket later. Thanks for your help, Gerry. > > >From the log message, we only noticed log messages for CPU and memory, > but not messages for PCI (IOMMU) devices. And this log message > "[ 149.976493] acpi ACPI0004:01: Still not present" > implies that the socket has been powered off during the ejection. > So the story may be that you powered off the socket while the host > bridge on the socket is still in use. > Thanks! > Gerry > > . >