From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 572B11DB148; Thu, 17 Apr 2025 18:03:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744913004; cv=none; b=rXC5xfffHS9xea+qUSe+3QI0uOuSlBrnj9ulyFF7pxVmZa632souwL3RIZ1F3tyI6Z3z8etw75hpA1u8dxznm6wLtUw0sZX3g6CB7/59oby4Wfx3rGfXAh9StoSNWCBRZ3VoL9GwA8M3Blo0NHuk5r64tTUXAqJB2Lp+dfxc03k= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744913004; c=relaxed/simple; bh=Jdr0R+3c7JCuDqy3GtmnDwDxVjcS8+3cnhq8V4jBAoQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=W3LvmdCE7I/c+eZiuw28nGFwc8424nbIogc0WcEEBHfPh8GXx7sDTiJyze6b99pn4wTXc3OLe0xGGtXGRSFSSBv3jBdoN5daV3ZN7Y3ERBqoGZIGgtVOYfhbyPlrxuu8Urrp3iTMs9hmb4osENDa5LbS7/hNsDA4iK1qX3SpZEw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linuxfoundation.org header.i=@linuxfoundation.org header.b=lTQVr1kY; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linuxfoundation.org header.i=@linuxfoundation.org header.b="lTQVr1kY" Received: by smtp.kernel.org (Postfix) with ESMTPSA id D4449C4CEE4; Thu, 17 Apr 2025 18:03:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1744913004; bh=Jdr0R+3c7JCuDqy3GtmnDwDxVjcS8+3cnhq8V4jBAoQ=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=lTQVr1kY8sMeG1os8mNZP+P0wvREZ0vwU8UCvuBOoNTJORYDfyaOsS+a1uo/ZzpM1 k17V1xuJ9XVxo/7ooCtTtPc36f4152XKXAEzSi2nrmo5sfZg3lhvqNnh34cVT/1ldo BHMTzSttANRmXbQZ8TPSpXISIXXGTtR5Qi2tUp0M= From: Greg Kroah-Hartman To: stable@vger.kernel.org Cc: Greg Kroah-Hartman , patches@lists.linux.dev, Philip Yang , Lijo Lazar , Felix Kuehling , Alex Deucher , Sasha Levin Subject: [PATCH 6.14 166/449] drm/amdkfd: Fix mode1 reset crash issue Date: Thu, 17 Apr 2025 19:47:34 +0200 Message-ID: <20250417175124.663687091@linuxfoundation.org> X-Mailer: git-send-email 2.49.0 In-Reply-To: <20250417175117.964400335@linuxfoundation.org> References: <20250417175117.964400335@linuxfoundation.org> User-Agent: quilt/0.68 X-stable: review X-Patchwork-Hint: ignore Precedence: bulk X-Mailing-List: patches@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit 6.14-stable review patch. If anyone has any objections, please let me know. ------------------ From: Philip Yang [ Upstream commit f0b4440cdc1807bb6ec3dce0d6de81170803569b ] If HW scheduler hangs and mode1 reset is used to recover GPU, KFD signal user space to abort the processes. After process abort exit, user queues still use the GPU to access system memory before h/w is reset while KFD cleanup worker free system memory and free VRAM. There is use-after-free race bug that KFD allocate and reuse the freed system memory, and user queue write to the same system memory to corrupt the data structure and cause driver crash. To fix this race, KFD cleanup worker terminate user queues, then flush reset_domain wq to wait for any GPU ongoing reset complete, and then free outstanding BOs. Signed-off-by: Philip Yang Reviewed-by: Lijo Lazar Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c index 083f83c945318..c3f2c0428e013 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c @@ -35,6 +35,7 @@ #include #include "amdgpu_amdkfd.h" #include "amdgpu.h" +#include "amdgpu_reset.h" struct mm_struct; @@ -1140,6 +1141,17 @@ static void kfd_process_remove_sysfs(struct kfd_process *p) p->kobj = NULL; } +/* + * If any GPU is ongoing reset, wait for reset complete. + */ +static void kfd_process_wait_gpu_reset_complete(struct kfd_process *p) +{ + int i; + + for (i = 0; i < p->n_pdds; i++) + flush_workqueue(p->pdds[i]->dev->adev->reset_domain->wq); +} + /* No process locking is needed in this function, because the process * is not findable any more. We must assume that no other thread is * using it any more, otherwise we couldn't safely free the process @@ -1154,6 +1166,11 @@ static void kfd_process_wq_release(struct work_struct *work) kfd_process_dequeue_from_all_devices(p); pqm_uninit(&p->pqm); + /* + * If GPU in reset, user queues may still running, wait for reset complete. + */ + kfd_process_wait_gpu_reset_complete(p); + /* Signal the eviction fence after user mode queues are * destroyed. This allows any BOs to be freed without * triggering pointless evictions or waiting for fences. -- 2.39.5