From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wr1-f49.google.com (mail-wr1-f49.google.com [209.85.221.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F1538255F3F for ; Thu, 7 Aug 2025 13:39:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.49 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1754573966; cv=none; b=XSEheOQazPbLGFHOhoMJRj8/yWab137k/H0E63xjDVWu6k9fqXzGULQhFcG3lWfhBdPQgRrCQC3rdYtbJ4jS1z5mzRN7MOuFKe3dbHJHNraH8+wKGjU2oOuI4NyctkHL5Kt7Ok7WRv544mptLrCJLQsFvQo0++AmL2L+HliczyY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1754573966; c=relaxed/simple; bh=LL4Hn8xvsHjmCbR0Up4BTpc7Nhfe+eU0Cc7WHVRV/2M=; h=Mime-Version:Content-Type:Date:Message-Id:Cc:Subject:From:To: References:In-Reply-To; b=kHIRAecue/58u5nVUr7vU954zoZu8M4M6i+wB5hdbs2WKmUFClal4JC7qXQz6QhRpwE0I7f7CF6KeaOizbE2yncxGz6R9YVY4ZFKm73KXZsC2rWV/jU2+05UkIbhkk+QIQR+oGMnV5ikwF7ApAwl8cZKB3+F5+7lecXmhfol8ak= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=OgLLZBtf; arc=none smtp.client-ip=209.85.221.49 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="OgLLZBtf" Received: by mail-wr1-f49.google.com with SMTP id ffacd0b85a97d-3b78315ff04so738733f8f.0 for ; Thu, 07 Aug 2025 06:39:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1754573963; x=1755178763; darn=vger.kernel.org; h=in-reply-to:references:to:from:subject:cc:message-id:date :content-transfer-encoding:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Xio5y0lXCspCluJXgo9hDhsD3UsJmR4wlHE4U9B6BAU=; b=OgLLZBtfGrP3+qAvMFhXbuRXbd4mvxhZPGifFoZMSNXhDY3vCZYb/pBY6HzZZX+z28 bEwJC142Xysrku+jWd4ggSCYeGUUxWecdG2GIkDKGqUmjpiKmGKKedqL6oSua+Q7+cZw iMCeiiF8c+6w8tlNOVBsMFIAGncfUFb51c9FpGQKwWStmz8QfqEnFya4JNu+8a349ELf pibj20EK3roPfLspGnKXDb7aJ3hw/YnIyc31I5OB5mYoT1TuEx4E35725ynrGqwKhWds OQedHrG7Zx5Q2Gle/w1yqpg2q8N2J0toTNCYLJfwv2WxJfQ2TXcZqgXuoB+e8+aOCmUf L8Lw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1754573963; x=1755178763; h=in-reply-to:references:to:from:subject:cc:message-id:date :content-transfer-encoding:mime-version:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=Xio5y0lXCspCluJXgo9hDhsD3UsJmR4wlHE4U9B6BAU=; b=ABFk2i6lEY18AkeLZiP2fZ80wdYoX5o7wDmWpG3V/geR+TN7WIApAsmYq3QBzfFmXi jhYWxlhUdmoMrAyfv6C2QdnUup9HQtqsCY6sIwlBHKXJZhF8wDvVB9OD7Ja6cJvpluOQ MEltu5egUwJf41Iar6HhIy0ZbCASWcn+ilIyYogSoQtgDxRZ+mByVwWHHctFpDfuuLpW RYrYeW7jJYV0WQm5EbAQszNrMafomCNAUpbN28A96FPhXYR1UXlegiSc+8KTxJa3ZOqH CDh9QRmFaCMq5JZpj7L2eksFtMLqSPkOHp+kXu0px9jgwB5r0JhgXA4bNAnGW5FwDyUp bCHw== X-Forwarded-Encrypted: i=1; AJvYcCUQw0/eLc1k4kAnd2hsXwZeJJkqXz2aEmwj6eKLj0ds3pqhOdnAoZ0qB1xONibs1oXyDvxy+XVfKgTBBBjMaw==@vger.kernel.org X-Gm-Message-State: AOJu0YyJUvJypWtdU8yEjdoGP7Ri06HI4zsYZADXdcnJ7zniZ7dBMSdw e7mVA6NINwlz4+k52XJj4tLQAn2EUil6wiwkRKVyBJEbD9MZl727ILnlgWN9xA== X-Gm-Gg: ASbGncsvc/Cg8EKfzj3HkSfUEizMyVkcAdu4BRLNSGNIOKopJAOWPRm8sXtkkFz3xbV Z9RvSoTsTr6g/IYGuqztO6WJSeFx8SF1Ss9Ho6B/PvsZElkiFitHYBiuaXToyzInuHUbkvMNarh Q85G4IfQGy9AuUUHzW+uYd4QbBpQjYcb+O8mLDuGlSN5XJHrJRHyUf+zT5HCLh/eAqJaKDnEEcD 0d/Puh0ddV3v1MSZw/OETL22YCKxbArqCBMVcAW4pRjJRCZrqmc8Oi4QZqXs8VcRBEpmnSKHcuG nI5DJidImWate7g2Vboc11aX9kKVKtXA0jR1P+eSWncDEtOqlB6QBFUbBz3f8EPBQ8q41eol6Qq xslTNUGzp+hIS79IgFNB8glMYfx26dHM= X-Google-Smtp-Source: AGHT+IFpqmsveJ+G5NUkHgXYl584B8zuFrATJyaI3aJYgC8hKZ91YmLJTJqksXB7fQTdVV/jjE6Bxw== X-Received: by 2002:a05:6000:2287:b0:3b7:868d:435e with SMTP id ffacd0b85a97d-3b8f41ad8cbmr6078436f8f.41.1754573962864; Thu, 07 Aug 2025 06:39:22 -0700 (PDT) Received: from localhost ([2a01:e0a:0:2480:67b:cbff:fea5:a121]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-459dcb86d6asm149772725e9.5.2025.08.07.06.39.22 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 07 Aug 2025 06:39:22 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-wireless@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Thu, 07 Aug 2025 15:39:21 +0200 Message-Id: Cc: , Subject: Re: [PATCH ath-current 0/7] wifi: ath12k: Fix Issues in REO RX Queue Updates From: "Nicolas Escande" To: "Nithyanantham Paramasivam" X-Mailer: aerc 0.20.1-0-g2ecb8770224a-dirty References: <20250806111750.3214584-1-nithyanantham.paramasivam@oss.qualcomm.com> In-Reply-To: On Thu Aug 7, 2025 at 2:31 PM CEST, Nithyanantham Paramasivam wrote: > On Thu, Aug 7, 2025 at 1:32=E2=80=AFAM Nicolas Escande wrote: >> >> On Wed Aug 6, 2025 at 1:17 PM CEST, Nithyanantham Paramasivam wrote: >> > During stress test scenarios, when the REO command ring becomes full, >> > the RX queue update command issued during peer deletion fails due to >> > insufficient space. In response, the host performs a dma_unmap and >> > frees the associated memory. However, the hardware still retains a >> > reference to the same memory address. If the kernel later reallocates >> > this address, unaware that the hardware is still using it, it can >> > lead to memory corruption-since the host might access or modify >> > memory that is still actively referenced by the hardware. >> > >> > Implement a retry mechanism for the HAL_REO_CMD_UPDATE_RX_QUEUE >> > command during TID deletion to prevent memory corruption. Introduce >> > a new list, reo_cmd_update_rx_queue_list, in the dp structure to >> > track pending RX queue updates. Protect this list with >> > reo_rxq_flush_lock, which also ensures synchronized access to >> > reo_cmd_cache_flush_list. Defer memory release until hardware >> > confirms the virtual address is no longer in use, avoiding immediate >> > deallocation on command failure. Release memory for pending RX queue >> > updates via ath12k_dp_rx_reo_cmd_list_cleanup() on system reset >> > if hardware confirmation is not received. >> > >> > Additionally, increase DP_REO_CMD_RING_SIZE to 256 to reduce the >> > likelihood of ring exhaustion. Use a 1KB cache flush command for >> > QoS TID descriptors to improve efficiency. >> > >> >> Hello, >> >> I'm not sure I fully understand so please correct where I'm wrong but fr= om what >> I got looking at the code: >> - you increase the ring size for reo cmd >> - you enable releasing multiple tid buffer at once >> - as soon as you allocate the tid you create an entry in the list flag= ging it >> as active >> - when you need to clean up a tid >> - you mark it as inactive in the list, then >> - try to process the whole list by >> - sending the reo command to release it >> - if it succeed you release it and remove the entry from the list >> - on core exit, you re process the list >> >> What is bothering me with this is that when a vdev which has multiple st= a goes >> down, you will have a lot of those entries to push to the firmware at on= ce: >> >> - So increasing the ring size / being able to release multiple entries= at once >> in one reo cmd may or may not be enough to handle the burst. It seem= s >> that increasing those is just band aids on the underlying problem bu= t I >> understand it's just to be on the safe side. >> - If entries do not get send/accepted by the firmware, we will need to= wait >> for another station removal before retrying, which could be a wile. >> - Or in the worst case (one vdev going down and needing to release tid= of >> all its stations) the more entries we have in the list the less like= ly we >> will be to be able to push the delete of all stations + the ones sti= ll in >> queue >> >> So, on station removal, why not make just things completely async. Push = the tid >> deletes in a list and wake a workqueue that tries to push those to the f= irmware. >> If the ring is full, retry periodically. >> >> Thanks > > Hi Nicolas, Hi > > Thanks for the detailed observations and suggestions. > > You're right in your understanding of the flow. To clarify further: > > When the host fails to obtain a buffer from the hardware to send a > command=E2=80=94due to the REO command ring being full > (ath12k_hal_reo_cmd_send returning -ENOBUFS)=E2=80=94we treat it as a com= mand > send failure and avoid deleting the corresponding entry immediately. > > This situation typically arises in the flow: > > ath12k_dp_rx_process_reo_cmd_update_rx_queue_list =E2=86=92 > ath12k_dp_rx_tid_delete_handler =E2=86=92 returns -ENOBUFS > > Given the command ring size is 256, and each peer generally has 16 > TIDs, each TID sends 2 commands (RXQ update and cache flush). So, > deleting one peer involves up to 32 commands. This means we can delete > up to 8 peers (8 =C3=97 32 =3D 256 commands) before hitting the ring limi= t. > > From the 9th peer onward, we may encounter ring exhaustion. However, > we handle retries in the REO command callback > (ath12k_dp_rx_tid_del_func). If commands fail to send, they remain in > the pending list and can be retried via the success callback of > earlier commands. That was the part I did not get, I thought it was just another peer removal= that would cause the whole list of pending entries to be reprocessed and pushed = to the ring. > > Do we foresee any issue with this ? > Nope, it should work. > Regarding your suggestion to make the deletion process fully > asynchronous via a workqueue, that=E2=80=99s a valid point. Workqueue-bas= ed > handling is a good idea, but we need to account for potential delays > if the worker thread isn=E2=80=99t scheduled promptly. We also need to > consider timeout-based rescheduling, especially during command > failures, to ensure retries happen in a timely manner. We=E2=80=99ll eval= uate > this suggestion further and get back to you. > IMHO it feels like it would simplify the code to just push + wake to a wq o= n delete, and reschedule if no space is available in the ring. The timing sho= uld not be such a big deal right ? The important part is to eventually push all= reo commands to the firmware but it should not have impact on the rest of the operation if it reaches the firmware later that sooner right ? > Thanks again for the insightful feedback! Thanks for all the explanations > > Best Regards, > Nithy