From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 040EFC48BD3 for ; Wed, 26 Jun 2019 21:51:01 +0000 (UTC) Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 8389220665 for ; Wed, 26 Jun 2019 21:51:00 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8389220665 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.vnet.ibm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 45YxYC4tvdzDqR2 for ; Thu, 27 Jun 2019 07:50:55 +1000 (AEST) Authentication-Results: lists.ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.156.1; helo=mx0a-001b2d01.pphosted.com; envelope-from=julietk@linux.vnet.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 45YxWC3383zDqF0 for ; Thu, 27 Jun 2019 07:49:10 +1000 (AEST) Received: from pps.filterd (m0098410.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x5QLlCjk106560 for ; Wed, 26 Jun 2019 17:49:08 -0400 Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.11]) by mx0a-001b2d01.pphosted.com with ESMTP id 2tcgemgxfx-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Wed, 26 Jun 2019 17:49:08 -0400 Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1]) by ppma03dal.us.ibm.com (8.16.0.27/8.16.0.27) with SMTP id x5QLj5xG030310 for ; Wed, 26 Jun 2019 21:49:07 GMT Received: from b03cxnp08025.gho.boulder.ibm.com (b03cxnp08025.gho.boulder.ibm.com [9.17.130.17]) by ppma03dal.us.ibm.com with ESMTP id 2t9by79wns-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Wed, 26 Jun 2019 21:49:07 +0000 Received: from b03ledav006.gho.boulder.ibm.com (b03ledav006.gho.boulder.ibm.com [9.17.130.237]) by b03cxnp08025.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id x5QLn50Q60686732 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 26 Jun 2019 21:49:05 GMT Received: from b03ledav006.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 96C14C6057; Wed, 26 Jun 2019 21:49:05 +0000 (GMT) Received: from b03ledav006.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 2F203C6059; Wed, 26 Jun 2019 21:49:05 +0000 (GMT) Received: from juliets-mbp.austin.ibm.com (unknown [9.41.174.95]) by b03ledav006.gho.boulder.ibm.com (Postfix) with ESMTPS; Wed, 26 Jun 2019 21:49:04 +0000 (GMT) Subject: Re: [PATCH] powerpc/rtas: Fix hang in race against concurrent cpu offline To: Nathan Lynch References: <5d116143.403IF78HZadonD0m%julietk@linux.vnet.ibm.com> <87a7e5tvyb.fsf@linux.ibm.com> From: Juliet Kim Message-ID: Date: Wed, 26 Jun 2019 16:49:04 -0500 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.7.2 MIME-Version: 1.0 In-Reply-To: <87a7e5tvyb.fsf@linux.ibm.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Content-Language: en-US X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-06-26_11:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1906260248 X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: linuxppc-dev@lists.ozlabs.org, mmc@linux.vnet.ibm.com, mwb@linux.ibm.com Errors-To: linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Sender: "Linuxppc-dev" On 6/25/19 12:29 PM, Nathan Lynch wrote: > Juliet Kim writes: >> The commit >> (“powerpc/rtas: Fix a potential race between CPU-Offline & Migration) >> attempted to fix a hang in Live Partition Mobility(LPM) by abandoning >> the LPM attempt if a race between LPM and concurrent CPU offline was >> detected. >> >> However, that fix failed to notify Hypervisor that the LPM attempted >> had been abandoned which results in a system hang. > It is surprising to me that leaving a migration unterminated would cause > Linux to hang. Can you explain more about how that happens? > PHYP will block further requests(next partition migration, dlpar etc) while it's in suspending state. That would have a follow-on effect on the HMC and potentially this and other partitions. >> Fix this by sending a signal PHYP to cancel the migration, so that PHYP >> can stop waiting, and clean up the migration. > This is well-spotted and rtas_ibm_suspend_me() needs to signal > cancellation in several error paths. But I don't agree that this is one > of them: this race is going to be a temporary condition in any > production setting, and retrying would allow the migration to succeed. If LPM and CPU offine requests conflict with one another, it might be better to let them fail and let the customer decide which he prefers. IBM i cancels migration if the other OS components/operations veto migration. It’s consistent with other OS behavior for LPM. I think all the IBM products should have a consistent customer experience. Even if the race can be temporary, it still could happen and can cause livelock.