From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.9 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EEB1CC2D0CE for ; Fri, 3 Jan 2020 16:31:31 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B255822314 for ; Fri, 3 Jan 2020 16:31:31 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="HJ1BwP3q" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727907AbgACQbb (ORCPT ); Fri, 3 Jan 2020 11:31:31 -0500 Received: from aserp2120.oracle.com ([141.146.126.78]:40426 "EHLO aserp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727859AbgACQba (ORCPT ); Fri, 3 Jan 2020 11:31:30 -0500 Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.27/8.16.0.27) with SMTP id 003GTOc9073919; Fri, 3 Jan 2020 16:31:25 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=content-type : mime-version : subject : from : in-reply-to : date : cc : content-transfer-encoding : message-id : references : to; s=corp-2019-08-05; bh=MCxYQsBaQc/0r53GJBQloDBNau19kLxx+ZWgJIG73yU=; b=HJ1BwP3qUX0HX17Agd95w345APBbQmWb7x7cMln6v8GcvJdXvH/iH/1UCr8Mz2fbnrYb g//huKlSJLwRrOlfYOfpetg6hun+Q+i45imd8JjYw7MRv39v0mvHl//RGTNyRq4IQs7R M0hG2lnJ1wXar3qqK7Wtoc3vtlzpTwCGXzxpOXnDdKB5Zivul7AAcUq8D1pSTMjHtTrk DTWg3KuQGRLKTgQ9c7/6FEpDht93jbqWEuiT9HfJ1Vphd3sJM1Ot5sQOxxwKTIAANqxn lLn0MkZsPcLgsx2K3qRet4AB716VsrmQN3Zi0TnXLhF6M/0Yv0yPlNKzBKFfx3UHi4jc RA== Received: from aserp3020.oracle.com (aserp3020.oracle.com [141.146.126.70]) by aserp2120.oracle.com with ESMTP id 2x5y0pw3s4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 03 Jan 2020 16:31:25 +0000 Received: from pps.filterd (aserp3020.oracle.com [127.0.0.1]) by aserp3020.oracle.com (8.16.0.27/8.16.0.27) with SMTP id 003GSduV051923; Fri, 3 Jan 2020 16:31:25 GMT Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by aserp3020.oracle.com with ESMTP id 2xa7ty60a1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 03 Jan 2020 16:31:24 +0000 Received: from abhmp0002.oracle.com (abhmp0002.oracle.com [141.146.116.8]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id 003GVNkY019507; Fri, 3 Jan 2020 16:31:23 GMT Received: from [192.168.14.112] (/79.178.220.19) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Fri, 03 Jan 2020 08:31:23 -0800 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\)) Subject: Re: [PATCH] net: mlx5: Use writeX() to ring doorbell and remove reduntant wmb() From: Liran Alon In-Reply-To: <20200103133749.GA9706@ziepe.ca> Date: Fri, 3 Jan 2020 18:31:18 +0200 Cc: Will Deacon , saeedm@mellanox.com, leon@kernel.org, netdev@vger.kernel.org, linux-rdma@vger.kernel.org, eli@mellanox.com, tariqt@mellanox.com, danielm@mellanox.com, =?utf-8?Q?H=C3=A5kon_Bugge?= Content-Transfer-Encoding: quoted-printable Message-Id: References: <20200102174436.66329-1-liran.alon@oracle.com> <20200102192934.GH9282@ziepe.ca> <6524AE07-2ED7-41B5-B761-9F6BE8D2049B@oracle.com> <20200102205847.GJ9282@ziepe.ca> <79BB7EDF-406D-4FA1-ADDC-634D55F15C37@oracle.com> <20200103133749.GA9706@ziepe.ca> To: Jason Gunthorpe X-Mailer: Apple Mail (2.3445.4.7) X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9489 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1911140001 definitions=main-2001030151 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9489 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1911140001 definitions=main-2001030151 Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org > On 3 Jan 2020, at 15:37, Jason Gunthorpe wrote: >=20 > On Fri, Jan 03, 2020 at 12:21:06AM +0200, Liran Alon wrote: >=20 >>> AFAIK WC is largely unspecified by the memory model. Is wmb() even >>> formally specified to interact with WC? >>=20 >> As I said, I haven=E2=80=99t seen such semantics defined in kernel >> documentation such as memory-barriers.txt. However, in practice, it >> does flush WC buffers. At least for x86 and ARM which I=E2=80=99m = familiar >> enough with. I think it=E2=80=99s reasonable to assume that wmb() = should >> flush WC buffers while dma_wmb()/smp_wmb() doesn=E2=80=99t = necessarily have >> to do this. >=20 > It is because WC is rarely used and laregly undefined for the kernel > :( Yep. >=20 >>>>>> Therefore, change mlx5_write64() to use writeX() and remove wmb() = from >>>>>> it's callers. >>>>>=20 >>>>> Yes, wmb(); writel(); is always redundant >>>>=20 >>>> Well, unfortunately not=E2=80=A6 >>>> See: = https://urldefense.proofpoint.com/v2/url?u=3Dhttps-3A__marc.info_-3Fl-3Dli= nux-2Dnetdev-26m-3D157798859215697-26w-3D2&d=3DDwIDaQ&c=3DRoP1YumCXCgaWHvl= ZYR8PZh8Bv7qIrMUB65eapI_JnE&r=3DJk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx= 0&m=3DOx1lCS1KAGBvJrf24kiFQrranIaNi_zeo05sqCUEf7Y&s=3DMz6MJzUQ862DGjgGnj3n= eX4ZpjI88nOI9KpZhNF9TqQ&e=3D >>>> (See my suggestion to add flush_wc_writeX()) >>>=20 >>> Well, the last time wmb & writel came up Linus was pretty clear that >>> writel is supposed to remain in program order and have the barriers >>> needed to do that. >>=20 >> Right. But that doesn=E2=80=99t take into account that WC writes are >> considered completed when they are still posted in CPU WC buffers. >=20 >> The semantics as I understand of writeX() is that it guarantees all >> prior writes have been completed. It means that all prior stores >> have executed and that store-buffer is flushed. But it doesn=E2=80=99t = mean >> that WC buffers is flushed as-well. >=20 > The semantic for writel is that prior program order stores will be > observable by DMA from the device receiving the writel. This is > required for UC and NC stores today. WC is undefined, I think. >=20 > This is why ARM has the additional barrier in writel. Yep. >=20 > It would logically make sense if WC followed the same rule, however, > adding a barrier to writel to make WC ordered would not be popular, so > I think we are left with using special accessors for WC and placing > the barrier there.. Right. >=20 >>> IMHO you should start there before going around and adding/removing = wmbs >>> related to WC. Update membory-barriers.txt and related with the = model >>> ordering for WC access and get agreement. >>=20 >> I disagree here. It=E2=80=99s more important to fix a real bug (e.g. = Not >> flushing WC buffers on x86 AMD). Then, we can later formalise this >> and refactor code as necessary. Which will also optimise it as-well. >> Bug fix can be merged before we finish all these discussions and get >> agreement. >=20 > Is it a real bug that people actually hit? It wasn't clear from the > commit message. If so, sure, it should be fixed and the commit message > clarified. (but I'd put the wmb near the WC writes..) I found this bug during code review. I=E2=80=99m not aware if AWS saw = this bug happening in production. But according to AMD SDM and Optimization Guide SDM, this is a bug. I think it doesn=E2=80=99t happen in practice because the write of the = Tx descriptor + 128 first bytes of packet Effectively fills the relevant WC buffers and when a WC buffer is fully = written to, the CPU *should* (Not *must*) flush the WC buffer to memory. Having said that, I (and AWS maintainers of this code which I spoke with = them about this) still think that we should first apply this patch and later refactor this code when = we will introduce the proper memory accessor (flush_wc_writeX()). But I would be glad to hear thoughts from memory-barriers.txt = maintainers such as Will on this. >=20 > I am surprised that AMD is different here, the evolution of the WC > feature on x86 was to transparently speed up graphics, so I'm pretty > surprised AMD can get away with not ordering the same as Intel.. Completely agree. I was very surprised to see this from AMD SDM and = Optimization Guide SDM. It made sense to me too that graphics frame buffer is written to WC = memory and then is committed to GPU by writing to some doorbell register mapped as UC memory. My personal theory is that the difference is rooted from the fact that = AMD use dedicated CPU registers for WC buffers in contrast to Intel which utilise LFB entries for that = (In addition to their standard role). But this is a pure guess of course. :) >=20 >> I do completely agree we should have this discussion on WC and >> barriers and I already sent an email on this to all the >> memory-barriers.txt maintainers. Waiting to see how that discussion >> go and get community feedback before I will submit a patch-series >> that will introduce new changes to memory-barriers.txt and probably >> also new barrier macro. >=20 > The barrier macros have been unpopular, ie the confusing mmiowb was > dumped, and I strongly suspect to contain WC within a spinlock (which > is very important!) you need a barrier instruction on some archs. >=20 > New accessors might work better. We agree. I just use wrong terminology :) I meant we should probably introduce something as flush_wc_writeX(). >=20 >>>>>> doorbell[0] =3D cpu_to_be32(sn << 28 | cmd | ci); >>>>>> doorbell[1] =3D cpu_to_be32(cq->cqn); >>>>>=20 >>>>>> static inline void mlx5_write64(__be32 val[2], void __iomem = *dest) >>>>>> { >>>>>> #if BITS_PER_LONG =3D=3D 64 >>>>>> - __raw_writeq(*(u64 *)val, dest); >>>>>> + writeq(*(u64 *)val, dest); >>>>>=20 >>>>> I want to say this might cause problems with endian swapping as = writeq >>>>> also does some swaps that __raw does not? Is this true? >>>>=20 >>>> Hmm... Looking at ARM64 version, writeq() indeed calls = cpu_to_le64() >>>> on parameter before passing it to __raw_writeq(). Quite surprising >>>> from API perspective to be honest. >>>=20 >>> For PCI-E devices writel(x) is defined to generate the same TLP on = the >>> PCI-E bus, across all arches. >>=20 >> Good to know. >> Question: Where is this documented? >=20 > Hmm, haven't ever seen it documented like that. It is sort of a > logical requirement. If writel(x) produces different TLPs (ie > different byte order) how could a driver ever work with a PCI-E device > that requires only one TLP for x? Makes sense indeed. >=20 >>>> So should I change this instead to iowrite64be(*(u64 *)val, dest)? >>>=20 >>> This always made my head hurt, but IIRC, when I looked at it years = ago >>> the weird array construction caused problems with that simple = conversion. >>>=20 >>> The userspace version looks like this now: >>>=20 >>> uint64_t doorbell; >>> uint32_t sn; >>> uint32_t ci; >>> uint32_t cmd; >>>=20 >>> sn =3D cq->arm_sn & 3; >>> ci =3D cq->cons_index & 0xffffff; >>> cmd =3D solicited ? MLX5_CQ_DB_REQ_NOT_SOL : = MLX5_CQ_DB_REQ_NOT; >>>=20 >>> doorbell =3D sn << 28 | cmd | ci; >>> doorbell <<=3D 32; >>> doorbell |=3D cq->cqn; >>>=20 >>> mmio_write64_be(ctx->uar[0].reg + MLX5_CQ_DOORBELL, = htobe64(doorbell)); >>>=20 >>> Where on all supported platforms the mmio_write64_be() expands to a >>> simple store (no swap) >>>=20 >>> Which does look functionally the same as >>>=20 >>> iowrite64be(doorbell, dest); >>>=20 >>> So this patch should change the mlx5_write64 to accept a u64 like we >>> did in userspace when this was all cleaned there. >>=20 >> If I understand you correctly, you suggest to change callers to pass >> here a standard u64 and then modify mlx5_write64() to just call >> iowrite64be(). If so, I agree. Just want to confirm before sending >> v2. >=20 > Yes, this is what I did to userspace and it made this all make > sense. I strongly recommend to do the same in the kernel, particularly > now that we have iowrite64be()! Ack. Thanks, -Liran >=20 > Jason