From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.8 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AF1E8C433E0 for ; Tue, 7 Jul 2020 15:31:29 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 81BE7204EC for ; Tue, 7 Jul 2020 15:31:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1594135889; bh=NBfZR347CX3NjXV8lpgwwhncpWpFn2IdfEY6lIHKI78=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=Y8holizFMJu3A8+3v7XewzL7D7WPfNmi6lRm9jy5AbQ74MJHNaogmdMcMT1M/E15y XScpYqu1zrFvk9dGSUcRK2a66bPsRes72Y70P9Uk1ud/KNA6yGf2D7zI82NRZygdOz eQgbB+hhmOwQbvdQMVpkmqZQRqGjE/Sz/5EMi0bc= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730204AbgGGPb2 (ORCPT ); Tue, 7 Jul 2020 11:31:28 -0400 Received: from mail.kernel.org ([198.145.29.99]:35916 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728757AbgGGPXC (ORCPT ); Tue, 7 Jul 2020 11:23:02 -0400 Received: from localhost (83-86-89-107.cable.dynamic.v4.ziggo.nl [83.86.89.107]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 84F1920663; Tue, 7 Jul 2020 15:23:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1594135382; bh=NBfZR347CX3NjXV8lpgwwhncpWpFn2IdfEY6lIHKI78=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=GXu0ue5batlgQsKVVoyBTl0bFHmF9bRTiUNLF22wyvT11onk77K5W3At5ZdK2hAKF 0lIxqYN7dC7OVDySqj5i485rAAqGG0YAHklHhwZpa8DGdE1eQJ+6Q9w+1s9pJEUK3Q 2c44mxIQiuAl9qTwfrWQ+SVNAdygufZ8I2jdGLv8= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Guchun Chen , John Clements , Alex Deucher , Sasha Levin Subject: [PATCH 5.7 019/112] drm/amdgpu: fix kernel page fault issue by ras recovery on sGPU Date: Tue, 7 Jul 2020 17:16:24 +0200 Message-Id: <20200707145801.910328468@linuxfoundation.org> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200707145800.925304888@linuxfoundation.org> References: <20200707145800.925304888@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Guchun Chen [ Upstream commit 12c17b9d62663c14a5343d6742682b3e67280754 ] When running ras uncorrectable error injection and triggering GPU reset on sGPU, below issue is observed. It's caused by the list uninitialized when accessing. [ 80.047227] BUG: unable to handle page fault for address: ffffffffc0f4f750 [ 80.047300] #PF: supervisor write access in kernel mode [ 80.047351] #PF: error_code(0x0003) - permissions violation [ 80.047404] PGD 12c20e067 P4D 12c20e067 PUD 12c210067 PMD 41c4ee067 PTE 404316061 [ 80.047477] Oops: 0003 [#1] SMP PTI [ 80.047516] CPU: 7 PID: 377 Comm: kworker/7:2 Tainted: G OE 5.4.0-rc7-guchchen #1 [ 80.047594] Hardware name: System manufacturer System Product Name/TUF Z370-PLUS GAMING II, BIOS 0411 09/21/2018 [ 80.047888] Workqueue: events amdgpu_ras_do_recovery [amdgpu] Signed-off-by: Guchun Chen Reviewed-by: John Clements Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index b0aa4e1ed4df7..cd18596b47d33 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c @@ -1444,9 +1444,10 @@ static void amdgpu_ras_do_recovery(struct work_struct *work) struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev, false); /* Build list of devices to query RAS related errors */ - if (hive && adev->gmc.xgmi.num_physical_nodes > 1) { + if (hive && adev->gmc.xgmi.num_physical_nodes > 1) device_list_handle = &hive->device_list; - } else { + else { + INIT_LIST_HEAD(&device_list); list_add_tail(&adev->gmc.xgmi.head, &device_list); device_list_handle = &device_list; } -- 2.25.1