From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D8778276032; Tue, 20 May 2025 11:04:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.176.79.56 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747739096; cv=none; b=BWlWXlfK7qPjForSkNss7HANdNqg40ijYH9gqzi5d2MNyy7DuH6vCinrd5NVtrkcleCzXNRr2FfV5qfZi0mCn+rUxS3YrVoUpRLND6HwAUSQvtlboMZ4MVkmeoK85AN4YuOBPcnU9axnVns5NiSEJbIfz266guHDnxRrU4oHY2k= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747739096; c=relaxed/simple; bh=NqNxom7nfx4bue1iFQFzk4tegD7G+cHox4fj0Y3rv78=; h=Date:From:To:CC:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=m04tmt2dEMio5u4Px4zlp5Bc5qCrUWhAMs7e9ymo1iKUikgKEd6oC24BjczVnkKCfy6p4IDH9n97vEieyn/Mi+gGDukE4Ltn1usHsVaQ0A9thcWXkj8fL19yyQyrO/aHAcS2fGkO93bECDA+IoeQZw7fPhmQZ89iRdTu2MBNDN8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=185.176.79.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.18.186.31]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4b1s5W6N0kz6M4t5; Tue, 20 May 2025 18:59:59 +0800 (CST) Received: from frapeml500008.china.huawei.com (unknown [7.182.85.71]) by mail.maildlp.com (Postfix) with ESMTPS id AFA1F1402F0; Tue, 20 May 2025 19:04:49 +0800 (CST) Received: from localhost (10.203.177.66) by frapeml500008.china.huawei.com (7.182.85.71) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Tue, 20 May 2025 13:04:48 +0200 Date: Tue, 20 May 2025 12:04:46 +0100 From: Jonathan Cameron To: "Bowman, Terry" CC: , , , , , , , , , , , , , , , , , , , Subject: Re: [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver Message-ID: <20250520120446.000022b2@huawei.com> In-Reply-To: <8042c08a-42f0-49d5-b619-26bfc8e6f853@amd.com> References: <20250327014717.2988633-1-terry.bowman@amd.com> <20250327014717.2988633-5-terry.bowman@amd.com> <20250423160443.00006ee0@huawei.com> <20250425141849.00003c92@huawei.com> <8042c08a-42f0-49d5-b619-26bfc8e6f853@amd.com> X-Mailer: Claws Mail 4.3.0 (GTK 3.24.42; x86_64-w64-mingw32) Precedence: bulk X-Mailing-List: linux-pci@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-ClientProxiedBy: lhrpeml500012.china.huawei.com (7.191.174.4) To frapeml500008.china.huawei.com (7.182.85.71) On Thu, 15 May 2025 16:52:15 -0500 "Bowman, Terry" wrote: > On 4/25/2025 8:18 AM, Jonathan Cameron wrote: > > On Thu, 24 Apr 2025 09:17:45 -0500 > > "Bowman, Terry" wrote: > > > >> On 4/23/2025 10:04 AM, Jonathan Cameron wrote: > >>> On Wed, 26 Mar 2025 20:47:05 -0500 > >>> Terry Bowman wrote: > >>> > >>>> The AER service driver includes a CXL-specific kfifo, intended to forward > >>>> CXL errors to the CXL driver. However, the forwarding functionality is > >>>> currently unimplemented. Update the AER driver to enable error forwarding > >>>> to the CXL driver. > >>>> > >>>> Modify the AER service driver's handle_error_source(), which is called from > >>>> process_aer_err_devices(), to distinguish between PCIe and CXL errors. > >>>> > >>>> Rename and update is_internal_error() to is_cxl_error(). Ensuring it > >>>> checks both the 'struct aer_info::is_cxl' flag and the AER internal error > >>>> masks. > >>>> > >>>> If the error is a standard PCIe error then continue calling pcie_aer_handle_error() > >>>> as done in the current AER driver. > >>>> > >>>> If the error is a CXL-related error then forward it to the CXL driver for > >>>> handling using the kfifo mechanism. > >>>> > >>>> Introduce a new function forward_cxl_error(), which constructs a CXL > >>>> protocol error context using cxl_create_prot_err_info(). This context is > >>>> then passed to the CXL driver via kfifo using a 'struct work_struct'. > >>>> > >>>> Signed-off-by: Terry Bowman > >>> Hi Terry, > >>> > >>> Finally got back to this. I'm not following how some of the reference > >>> counting in here is working. It might be fine but there is a lot > >>> taking then dropping device references - some of which are taken again later. > >>> > >>>> @@ -1082,10 +1094,44 @@ static void cxl_rch_enable_rcec(struct pci_dev *rcec) > >>>> pci_info(rcec, "CXL: Internal errors unmasked"); > >>>> } > >>>> > >>>> +static void forward_cxl_error(struct pci_dev *_pdev, struct aer_err_info *info) > >>>> +{ > >>>> + int severity = info->severity; > >>> So far this variable isn't really justified. Maybe it makes sense later in the > >>> series? > >> This is used below in call to cxl_create_prot_err_info(). > > Sure, but why not just do > > > > if (cxl_create_prot_error_info(pdev, info->severity, &wd.err_info)) { > > > > There isn't anything modifying info->severity in between so that local > > variable is just padding out the code to no real benefit. > > > > > >>>> + pci_err(pdev, "Failed to create CXL protocol error information"); > >>>> + return; > >>>> + } > >>>> + > >>>> + struct device *cxl_dev __free(put_device) = get_device(err_info->dev); > >>> Also this one. A reference was acquired and dropped in cxl_create_prot_err_info() > >>> followed by retaking it here. How do we know it is still about by this call > >>> and once we pull it off the kfifo later? > >> Yes, this is a problem I realized after sending the series. > >> > >> The device reference incr could be changed for all the devices to the non-cleanup > >> variety. Then would add the reference incr in the caller after calling cxl_create_prot_err_info(). > >> I need to look at the other calls to to cxl_create_prot_err_info() as well. > >> > >> In addition, I think we should consider adding the CXL RAS status into the struct cxl_prot_err_info. > >> This would eliminate the need for further accesses to the CXL device after being dequeued from the > >> fifo. Thoughts? > > That sounds like a reasonable solution to me. > > > > Jonathan > Hi Jonathan, Hi Terry, Sorry for delay - travel etc... > > Is it sufficient to rely on correctly implemented reference counting implementation instead > of caching the RAS status I mentioned earlier? > > I have the next revision coded to 'get' the CXL erring device's reference count in the AER > driver before enqueuing in the kfifo and then added a reference count 'put' in the CXL driver > after dequeuing and handling/logging. This is an alternative to what I mentioned earlier reading > the RAS status and caching it. One more question: is it OK to implement the get and put (of > the same object) in different drivers? It's definitely unusual. If there is anything similar to point at I'd be happier than this 'innovation' showing up here first. > > If we need to read and cache the RAS status before the kfifo enqueue there will be some other > details to work through. This still smells like the cleaner solution to me, but depends on those details.. Jonathan > > -Terry > >