From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 4BBF4E6F069 for ; Fri, 1 Nov 2024 15:07:41 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id CB49F10E083; Fri, 1 Nov 2024 15:07:40 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="GtfR4LnH"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7]) by gabe.freedesktop.org (Postfix) with ESMTPS id 7BF8110E083 for ; Fri, 1 Nov 2024 15:07:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1730473659; x=1762009659; h=date:from:to:cc:subject:message-id:references: mime-version:in-reply-to; bh=v2Kp1OcgbGvnRJVHRqzCmn69+N/Qe88ztMVY/zNGmyQ=; b=GtfR4LnH4JVvSEdK8IL8xcv0NBA/qZ4z16ioDD1uXCxZmehmx2S1OWGU F0oExqF2PaCjPAb4O/VrAfxATzxkoAqaOm809SV5BI6gPrHFCLr0/9ex7 tOzEgnOH3k6/bvBAVdf0qWFeleOx1Aa/KGw3aWuJgQhpVz4hFp2ZwaX61 Dyy6ovq1pI8isGJjHmZfJCH1sBcJGKjeU1kfofABLBn+OR8HkPeCg0HD1 MgDWZW6VwV86NquBbBESyXcTcTmQmQlfMkbD6e8q2YvVw3VewA7PfnWGi fmOctGx0UpOeHsTzeJBVWOi3CjL0HvmjMtN250jVPAZ8HafhmkLPwVAEt A==; X-CSE-ConnectionGUID: /gH3l9kZTIqq0UjFUVLz7w== X-CSE-MsgGUID: nJ3DlGIDTD6oSoedWiV0SA== X-IronPort-AV: E=McAfee;i="6700,10204,11243"; a="55639760" X-IronPort-AV: E=Sophos;i="6.11,250,1725346800"; d="scan'208";a="55639760" Received: from orviesa007.jf.intel.com ([10.64.159.147]) by fmvoesa101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Nov 2024 08:07:39 -0700 X-CSE-ConnectionGUID: J1Z7oQ4ZSUeEZxzJUZKkIw== X-CSE-MsgGUID: HXhUgCtFRxi+56T+Kg9CCQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,250,1725346800"; d="scan'208";a="83423422" Received: from black.fi.intel.com ([10.237.72.28]) by orviesa007.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Nov 2024 08:07:37 -0700 Date: Fri, 1 Nov 2024 17:07:34 +0200 From: Raag Jadav To: Lucas De Marchi Cc: intel-xe@lists.freedesktop.org, John Harrison , Rodrigo Vivi , =?iso-8859-1?Q?Jos=E9?= Roberto de Souza Subject: Re: [PATCH 1/2] drm/xe: Improve devcoredump documentation Message-ID: References: <20241031182916.1441987-1-lucas.demarchi@intel.com> <20241031182916.1441987-2-lucas.demarchi@intel.com> <4kw2zzb76m42zbisvsy2fu52q2litchy6dfl4hyrmvze5u5dvk@hjs2pdynjemd> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4kw2zzb76m42zbisvsy2fu52q2litchy6dfl4hyrmvze5u5dvk@hjs2pdynjemd> X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On Fri, Nov 01, 2024 at 07:44:37AM -0500, Lucas De Marchi wrote: > On Fri, Nov 01, 2024 at 07:47:54AM +0200, Raag Jadav wrote: > > On Thu, Oct 31, 2024 at 11:29:15AM -0700, Lucas De Marchi wrote: > > > > ... > > > > > - * Snapshot at hang: > > > - * The 'data' file is printed with a drm_printer pointer at devcoredump read > > > - * time. For this reason, we need to take snapshots from when the hang has > > > - * happened, and not only when the user is reading the file. Otherwise the > > > - * information is outdated since the resets might have happened in between. > > > + * The following characteristics are observed by xe when creating a device > > > + * coredump: > > > * > > > - * 'First' failure snapshot: > > > - * In general, the first hang is the most critical one since the following hangs > > > - * can be a consequence of the initial hang. For this reason we only take the > > > - * snapshot of the 'first' failure and ignore subsequent calls of this function, > > > - * at least while the coredump device is alive. Dev_coredump has a delayed work > > > - * queue that will eventually delete the device and free all the dump > > > - * information. > > > + * **Snapshot at hang**: > > > + * The 'data' file contains a snapshot of the HW state at the time the hang > > > + * happened. Due to the driver recovering from resets/crashes, it may not > > > + * correspond to the state of when the file is read by userspace. > > > > Does that mean the devcoredump will be present even after a successful recovery? > > yes.... if it's not succesful then it's moved to the wedged state. Easy > way to test is running this: > > xe_exec_threads --r threads-hang-basic > > You should see something like this in your dmesg: > > [IGT] xe_exec_threads: starting subtest threads-hang-basic > xe 0000:00:02.0: [drm] GT0: Engine reset: engine_class=rcs, logical_mask: 0x1, guc_id=34 > xe 0000:00:02.0: [drm] GT0: Engine reset: engine_class=bcs, logical_mask: 0x1, guc_id=32 > xe 0000:00:02.0: [drm] GT1: Engine reset: engine_class=vcs, logical_mask: 0x1, guc_id=18 > xe 0000:00:02.0: [drm] GT0: Timedout job: seqno=4294967169, lrc_seqno=4294967169, guc_id=34, flags=0x0 in xe_exec_threads [2636] > xe 0000:00:02.0: [drm] GT1: Engine reset: engine_class=vecs, logical_mask: 0x1, guc_id=17 > xe 0000:00:02.0: [drm] GT1: Timedout job: seqno=4294967169, lrc_seqno=4294967169, guc_id=18, flags=0x0 in xe_exec_threads [2636] > xe 0000:00:02.0: [drm] Xe device coredump has been created > --> xe 0000:00:02.0: [drm] Check your /sys/class/drm/card0/device/devcoredump/data > xe 0000:00:02.0: [drm] GT1: Timedout job: seqno=4294967169, lrc_seqno=4294967169, guc_id=17, flags=0x0 in xe_exec_threads [2636] > xe 0000:00:02.0: [drm] GT0: Timedout job: seqno=4294967169, lrc_seqno=4294967169, guc_id=32, flags=0x0 in xe_exec_threads [2636] > xe 0000:00:02.0: [drm] GT0: Engine reset: engine_class=ccs, logical_mask: 0x1, guc_id=27 > xe 0000:00:02.0: [drm] GT0: Timedout job: seqno=4294967169, lrc_seqno=4294967169, guc_id=27, flags=0x0 in xe_exec_threads [2636] > [IGT] xe_exec_threads: finished subtest threads-hang-basic, SUCCESS > > > If you run it again, it won't overwrite the previous dump, until user > cleans the previous dump or the timeout on the kernel side fires to > release it. Yes, which I think we're covering at later point in "First failure only". So maybe establishing the mechanism itself before explaining reset/recovery would be a bit neater... > From a distro-integration pov, I think it should have a udev rule that > fires when a devcoredump is created so the dump is copied to persistent > storage. Just like it happens with cpu coredump (see systemd-coredump) > > > Perhaps moving the 'release' part to above paragraph will add required context. > > not sure I follow. Are you suggesting to swap the order of "First > failure only" and "Snapshot at hang" ? ... in whichever way you think is best. > > > + * **First failure only**: > > > + * In general, the first hang is the most critical one since the following > > > + * hangs can be a consequence of the initial hang. For this reason a snapshot > > > + * is taken only for the first failure. Until the devcoredump is released by > > > + * userspace or kernel, all subsequent hangs do not override the snapshot nor > > > + * create new ones. Devcoredump has a delayed work queue that will eventually > > > + * delete the file node and free all the dump information. Raag