From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 760CCC433F5 for ; Tue, 26 Oct 2021 22:28:09 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 43BAD60D07 for ; Tue, 26 Oct 2021 22:28:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240010AbhJZWa2 (ORCPT ); Tue, 26 Oct 2021 18:30:28 -0400 Received: from mail-dm6nam11on2061.outbound.protection.outlook.com ([40.107.223.61]:50913 "EHLO NAM11-DM6-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S240004AbhJZWaW (ORCPT ); Tue, 26 Oct 2021 18:30:22 -0400 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=BbYXoOwStlM7KlSKkNo/s4vN3J8htn87208LXP9cEb02UIMlvRuaUmsxePGGlTb07m/e1rk0kUcFD5+chhO3aTQjDX90yZHL3u9Bo44fOyMJ1FNmQ5g+2X/P5wbfB3LvbbhbAk/LyKGPIoNe8xBw0/orLickq+oc56AYZJQ4CnOM1aYs1BKOrt0LrjLYtG5ShekwnGH8W+ZSe2MzvLWITE5gbNHfWBCNgl8aBOS/SrNj5MW+jHA53K5ghbS9HURB5VGBIImBGufcztr9WFZcD6hehnOQm62Le7JAJCKH6lMJHlXqyKwZefM6BSOcr3hSg/jvbuCYqNNCMUawDSDjdA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=4F13YeJncWc0fUDKQ/xRO06nmgAryDTRMTLNRusc4Is=; b=c61PKLgdK2n8Si63u2w7SatZAr1NVZ9ADCzkeWHMyIIYxLjjshp4cecWpT/cg13DCVz2bpJXzqliyF+MtK8wwW9BEEYvde/LVH1yC9WU+xVdeOb81nWA86cFMX1xAyRwKYptyD5Q4BJ1nKAeeFrJIRpjmjS0pN+5b+rrtRkRW4wgQcWZorbrlE3ZqagZVd1rXgubm9VlnPFF01Fc8Jc/PzTvqbypIlKeBnhxy4V5+QM+T/ZumatAwsBnI60bu+GXN9cO9YhFBCBQHP6khnE4/CXyDlTdauPCsPuKuW2WJ+pt4GTEzSqrKUY7FqEbq7p2zlBRJ/Haweqa5q23tvYYVQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nvidia.com; dmarc=pass action=none header.from=nvidia.com; dkim=pass header.d=nvidia.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=4F13YeJncWc0fUDKQ/xRO06nmgAryDTRMTLNRusc4Is=; b=pcJl5nPU3/WlHrlJccw8+qI4lc0lB3N3WRRs8V/PfegWBvN1oC+oG3DI8+ASOUKgO3nOPI8gTPBh3fffy8hCQwEU5uRDcqhzd2+/MGnZryBm9IGQxaqTjnUM2AfswUuHUEYyQa9zy+uPZBUv0Q3Nf3TfIBqxaXEswgdz82qF52vRcVJRZqA3EKKKmb4pMGLS/3WIRSfIHoR34guCXOTJNL0NiehZMOkDIT/npRYmMOho0To64FQ4M+Br1ZYqYl7/HdjIX00axEWrQaO2ruJdLJmMx1zZWMPO9aoQ3W4HmUOkTP8suNbecWoliS5tR0GOZr2mQgqwmR3YLPhU1T7gyw== Authentication-Results: nxp.com; dkim=none (message not signed) header.d=none;nxp.com; dmarc=none action=none header.from=nvidia.com; Received: from DM4PR12MB5278.namprd12.prod.outlook.com (2603:10b6:5:39e::17) by DM4PR12MB5309.namprd12.prod.outlook.com (2603:10b6:5:39d::12) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4628.18; Tue, 26 Oct 2021 22:27:56 +0000 Received: from DM4PR12MB5278.namprd12.prod.outlook.com ([fe80::513c:d3d8:9c43:2cea]) by DM4PR12MB5278.namprd12.prod.outlook.com ([fe80::513c:d3d8:9c43:2cea%8]) with mapi id 15.20.4628.020; Tue, 26 Oct 2021 22:27:56 +0000 Message-ID: Date: Wed, 27 Oct 2021 01:27:47 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.1.0 Subject: Re: [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Content-Language: en-US To: Vladimir Oltean Cc: "netdev@vger.kernel.org" , Roopa Prabhu , Ido Schimmel , Jakub Kicinski , "David S. Miller" , Andrew Lunn , Florian Fainelli , Vivien Didelot , Vladimir Oltean , Jiri Pirko References: <20211025222415.983883-1-vladimir.oltean@nxp.com> <531e75e8-d5d1-407b-d665-aec2a66bf432@nvidia.com> <20211026112525.glv7n2fk27sjqubj@skbuf> <1d9c3666-4b29-17e6-1b65-8c64c5eed726@nvidia.com> <20211026165424.djjy5xludtcqyqj2@skbuf> <20211026190136.jkxyqi6b7f4i2bfe@skbuf> <20211026215153.lpdk66rjvsodmxto@skbuf> From: Nikolay Aleksandrov In-Reply-To: <20211026215153.lpdk66rjvsodmxto@skbuf> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-ClientProxiedBy: ZR0P278CA0057.CHEP278.PROD.OUTLOOK.COM (2603:10a6:910:21::8) To DM4PR12MB5278.namprd12.prod.outlook.com (2603:10b6:5:39e::17) MIME-Version: 1.0 Received: from [10.21.241.50] (213.179.129.39) by ZR0P278CA0057.CHEP278.PROD.OUTLOOK.COM (2603:10a6:910:21::8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4628.22 via Frontend Transport; Tue, 26 Oct 2021 22:27:53 +0000 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 500b078d-bab1-4a00-b773-08d998cfdbb7 X-MS-TrafficTypeDiagnostic: DM4PR12MB5309: X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:10000; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: GSDH3lWnfnDBUIKh80VZAEmpvsL++Z01lnP4rAKrAXOy/focuaHkqtibANqqxOSTD9cMBc7TbGNQsWZx511Yq7HajT6LXy79hAf0LT5cB2qlr8MjK0o+o/g48QGrfAzrSHnGAIdWbaWTLeeTT03CUnIGXo1al7KiQBWFcDtJ+a+/kcvchRUaYkbmfMnRdIyaDk0UPyOl1cuclRqmWeUBpD+gmWgB13SO/jppzU4Ct/bDyYd4N2AE1gWO97Di7EItLo6BhScIf1rx71SxE5kjFUfn5Squ9DHDS8SFXM4wgWDWUH8ADLeDpqwGP5aGbYYt8hylIk4oXacn9hoQyK+ucW02igw4kENPvtG81+6lZim7qSuHBQJJm71bYlFFAV7iz+mkkeg0B3LSuL16Ec+DMguaUOZmERl83pZzmr1lTvqnClR7+ykYBcxPHJt0BO5IKhqp65/prKzjDtIT6Ug5QxLZOGZhaCPGNHFTgi/grKQxgLiuCwtBFvTPaNk0X6Tg11KsZ2MX2FfXRZh+l5+3jDndKKjAAGYT9P520R2nj+2LH/hDElHNWLYcHo0ILHWxWtV5F2safKT45Ja18iwsP4kmKWIeXxHxq6M8iIVLH5mAuUxV6M75U2RHk4ppauC2NkazT+xXtR7Af+iwvbdNgmqzYmhdNdz370Vq86hvBKNLOzdsXz/ry7ntKVAEHTR3d7w//+FLRd+yRltGrie8ywMGmXrBNjrk48GNO5KT3Fk= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:DM4PR12MB5278.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(4636009)(366004)(6486002)(107886003)(508600001)(6916009)(53546011)(316002)(31686004)(36756003)(6666004)(26005)(4326008)(38100700002)(66946007)(8936002)(8676002)(83380400001)(5660300002)(186003)(30864003)(16576012)(2616005)(956004)(2906002)(31696002)(66556008)(66476007)(86362001)(54906003)(45980500001)(43740500002);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?Q2dKTVBJbUVvMURsUnZWT2VQUDlVUU9oZFJURUowRGY4d2E1N3Y3SU51U2lT?= =?utf-8?B?U3g3OUJRRVdCZENWS25HMnhlSGlrZE5TRmt3ZkhkbHArYmdQU2dFdWtpZ0V2?= =?utf-8?B?V2xUVzFaYys4dXVtcTBaQ3RxaytTMFE1VWZNc3Q4RTY1VlJ3L1dZNHRiN0tm?= =?utf-8?B?ZHJWd0pGdE1sQ2tRSnNyb2ZVVmlHbG85Um5xQlZMVjVlTkhuYytkdnJxME9k?= =?utf-8?B?K3VINEhZYTNtWjlNNzBzbUJrekprL0JhZFlDQ2Z3dzFyYU9hUFBOZkNoaWRi?= =?utf-8?B?NUVmbE5IWEl1RzBmZ3BWMWVKUzU3NGxROEZWMzJTaGM0RkQxd3lheWNIRk5X?= =?utf-8?B?b2J1akVXOC9BeVV3V3d1QTdGaFRNeG50aVNGcUtwMXlHRFpaTmpCazJqM3c3?= =?utf-8?B?R3lqTnErUzNzVXYyQ3dWdVNPeVhLT0c4ekJ2L1J1TitHaHdwUXhJVkFGMVcw?= =?utf-8?B?MVcrZVlPRlBzYThNUm83dUFyR2grNFY3SE5jeDJjSklzQnJXSkhSUU5BTG0w?= =?utf-8?B?bXEzM1RIOFRZbC9RYzFoemgrMlBDS0UxZ2lxR00vcU9lV2hZbnRWclV4Y09l?= =?utf-8?B?MHdWRDdaREtNd05ZeUNaYlpjNmloV3A0YXRpSWxPTFlSL1Z6YXdRTGR6cllj?= =?utf-8?B?WVR6ejdGZmd5RTNwWkRoS1FmOVZvK1B2Yk1TMmU5TjFZcE5xeTg2bmZVWkVU?= =?utf-8?B?QW95MitiTzlsVDViMnZzQTlHdFhEZUlCbnIrcXdvWDVjVGZtR1ZnZEQxMU83?= =?utf-8?B?RWp0MkQzZ3oyNkJKTStscFd1UWYwdDhza2x1dno3SHl1ZlJOLzNQRGxwUDJV?= =?utf-8?B?Z1h2cFV4UXFyYWF2TWN2eUF4Z1A3SnFncU9vOUhOMnZtRDhjay9lK255SGYw?= =?utf-8?B?WHVRcGhXU2dFQUkreHFsanNTdFN3L0lIbGx3K0k2aUxsdTU2dDBkVjlLOXEw?= =?utf-8?B?SVJoUUVBZnJCMzlkMHo4S3FqZzBrUFczcWlqL25HQ2hIVVZQbFBzd0NsVHNR?= =?utf-8?B?engzZjlrUm9rM3JYUmlIamd6bW5NN1pydnlWQXFtM0NQMWRuMFFyMy9jTW1P?= =?utf-8?B?L1Z3T1Q2bUUvY0Ywa3NITGMxb2ZxeU9Mb1VYYlZOcHFYT2dBY2YyWUc1Y2Rm?= =?utf-8?B?SUloUjVHME16VE5QUkcwaFJPUmVLbmtPTmFBKzY3SWNYVUd6WU5xUGUrYzEv?= =?utf-8?B?NU8vYXNYdjVYdXYrRWs5c0NvUEZidEFLeFZ0UFYyb0RQb0NyWi9sMVFoNTNj?= =?utf-8?B?Y3c3NUVqOVpQV2V1endMdnpRcDhjQ3dzQ1l6WUQ5bTNTbEpqMjd2UVhvUEFV?= =?utf-8?B?QUhNcWxjMG1Jd1RrN2RoSnpGaGFuQTZqanFvWWtIazJmNWhkRDM0c0kwM1Fl?= =?utf-8?B?YkwwbmZTMGFFakhrQXVnb0RCdE05OHpBNjZuZ2Z2UHBsTzQzRUQ2VDY4VUxZ?= =?utf-8?B?Uk9kNStyRnpNZnc1aHFaYmJJandONTZDeTZicmVJSkxqVFVqRWg2dE8wUE1O?= =?utf-8?B?REo2NlVUYVpkZnNsL0VFU0Y2dk1SU0dvOFhYYi9Kdk1CNW5UYWxxc3dhY1I5?= =?utf-8?B?UGQ2S3NYMCs3b25lY1Y2MFdVRmlwWWRjVG9veDRBRFBpSHdJSktNQytVUFpn?= =?utf-8?B?bFRwYVJ6ZzFJeVB4MFpGUzNBL3lTRWNSS0NCSjJXcE9ZbHEyeVFmTW9yMUpT?= =?utf-8?B?bXExeXRtV1JCWGozay83ME5Fb05sU3F6VFljdHlRZXF4Yk5pRzNLY2NlaFFh?= =?utf-8?B?MnJwMWZVaTlISWwvRnBVU1dLeXZER0hVRDJ5S0crUTdHYWVqZEFWcnc3YXVJ?= =?utf-8?B?K1dSazI4OFkrcktoZjlNMURCbkI5STFGanhxei9sZlR0QWF0N2tWSkxJdHQx?= =?utf-8?B?Vkp4MDlBN0t5UTk5SnVVR1R2Q080bm1QbDVkdFJqZ1lzcWNvV0tvTFVBemNL?= =?utf-8?B?NE82NERQc0hlMWhMSXc5bGxSMThqSjVtTC83QUx0OXAxQ21kZFBaWnJyUXU2?= =?utf-8?B?V1pPVHVrV2FCSzgwZWJselFhUU1sS3dMd3hTallqUHlqMnk2YUZrbEVxaStY?= =?utf-8?B?YUc5N2xDQkZxUSt5YmJVS1QwL082VjRHVTdLR0dVY1owUWZ4QmlmU1dRbXBC?= =?utf-8?B?U0JscjZpeTdoTElvRU15UVV5dFdQZmREb0JWeDArSk5sMk9HMFArMFpxeTdM?= =?utf-8?Q?8hYd3KLyXPlkJQDKRiEeiwo=3D?= X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-Network-Message-Id: 500b078d-bab1-4a00-b773-08d998cfdbb7 X-MS-Exchange-CrossTenant-AuthSource: DM4PR12MB5278.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 26 Oct 2021 22:27:56.3539 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: VuHEN5ODBJ2gRgl729e37cmQjnu/snRbLClXV6ht7VhsrR+y2EnQr2zYJfQMivQf+1fWv1GnK37kDUoghjF9Zw== X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM4PR12MB5309 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On 27/10/2021 00:51, Vladimir Oltean wrote: > On Tue, Oct 26, 2021 at 10:56:59PM +0300, Nikolay Aleksandrov wrote: >> On 26/10/2021 22:01, Vladimir Oltean wrote: >>> On Tue, Oct 26, 2021 at 08:10:45PM +0300, Nikolay Aleksandrov wrote: >>>> On 26/10/2021 19:54, Vladimir Oltean wrote: >>>>> On Tue, Oct 26, 2021 at 03:20:03PM +0300, Nikolay Aleksandrov wrote: >>>>>> On 26/10/2021 14:25, Vladimir Oltean wrote: >>>>>>> On Tue, Oct 26, 2021 at 01:40:15PM +0300, Nikolay Aleksandrov wrote: >>>>>>>> Hi, >>>>>>>> Interesting way to work around the asynchronous notifiers. :) I went over >>>>>>>> the patch-set and given that we'll have to support and maintain this fragile >>>>>>>> solution (e.g. playing with locking, possible races with fdb changes etc) I'm >>>>>>>> inclined to go with Ido's previous proposition to convert the hash_lock into a mutex >>>>>>>> with delayed learning from the fast-path to get a sleepable context where we can >>>>>>>> use synchronous switchdev calls and get feedback immediately. >>>>>>> >>>>>>> Delayed learning means that we'll receive a sequence of packets like this: >>>>>>> >>>>>>> br0--------\ >>>>>>> / \ \ >>>>>>> / \ \ >>>>>>> / \ \ >>>>>>> swp0 swp1 swp2 >>>>>>> | | | >>>>>>> station A station B station C >>>>>>> >>>>>>> station A sends request to B, station B sends reply to A. >>>>>>> Since the learning of station A's MAC SA races with the reply sent by >>>>>>> station B, it now becomes theoretically possible for the reply packet to >>>>>>> be flooded to station C as well, right? And that was not possible before >>>>>>> (at least assuming an ageing time longer than the round-trip time of these packets). >>>>>>> >>>>>>> And that will happen regardless of whether switchdev is used or not. >>>>>>> I don't want to outright dismiss this (maybe I don't fully understand >>>>>>> this either), but it seems like a pretty heavy-handed change. >>>>>>> >>>>>> >>>>>> It will depending on lock contention, I plan to add a fast/uncontended case with >>>>>> trylock from fast-path and if that fails then queue the fdb, but yes - in general >>>>> >>>>> I wonder why mutex_trylock has this comment? >>>>> >>>>> * This function must not be used in interrupt context. The >>>>> * mutex must be released by the same task that acquired it. >>>>> >>>>>> you are correct that the traffic could get flooded in the queue case before the delayed >>>>>> learning processes the entry, it's a trade off if we want sleepable learning context. >>>>>> Ido noted privately that's usually how hw acts anyway, also if people want guarantees >>>>>> that the reply won't get flooded there are other methods to achieve that (ucast flood >>>>>> disable, firewall rules etc). >>>>> >>>>> Not all hardware is like that, the switches I'm working with, which >>>>> perform autonomous learning, all complete the learning process for a >>>>> frame strictly before they start the forwarding process. The software >>>>> bridge also behaves like that. My only concern is that we might start >>>>> building on top of some fundamental bridge changes like these, which >>>>> might risk a revert a few months down the line, when somebody notices >>>>> and comes with a use case where that is not acceptable. >>>>> >>>> >>>> I should've clarified I was talking about Spectrum as Ido did in a reply. >>> >>> Ido also said "I assume Spectrum is not special in this regard" and I >>> just wanted to say this such that we don't end with the wrong impression. >>> Special or not, to the software bridge it would be new behavior, which >>> I can only hope will be well received. >>> >>>>>> Today the reply could get flooded if the entry can't be programmed >>>>>> as well, e.g. the atomic allocation might fail and we'll flood it again, granted it's much less likely >>>>>> but still there haven't been any such guarantees. I think it's generally a good improvement and >>>>>> will simplify a lot of processing complexity. We can bite the bullet and get the underlying delayed >>>>>> infrastructure correct once now, then the locking rules and other use cases would be easier to enforce >>>>>> and reason about in the future. >>>>> >>>>> You're the maintainer, I certainly won't complain if we go down this path. >>>>> It would be nice if br->lock can also be transformed into a mutex, it >>>>> would make all of switchdev much simpler. >>>>> >>>> >>>> That is why we are discussing possible solutions, I don't want to force anything >>>> but rather reach a consensus about the way forward. There are complexities involved with >>>> moving to delayed learning too, one would be that the queue won't be a simple linked list >>>> but probably a structure that allows fast lookups (e.g. rbtree) to avoid duplicating entries, >>>> we also may end up doing two stage lookup if the main hash table doesn't find an entry >>>> to close the above scenario's window as much as possible. But as I said I think that we can get >>>> these correct and well defined, after that we'll be able to reason about future changes and >>>> use cases easier. I'm still thinking about the various corner cases with delayed learning, so >>>> any feedback is welcome. I'll start working on it in a few days and will get a clearer >>>> view of the issues once I start converting the bridge, but having straight-forward locking >>>> rules and known deterministic behaviour sounds like a better long term plan. >>> >>> Sorry, I might have to read the code, I don't want to misinterpret your >>> idea. With what you're describing here, I think that what you are trying >>> to achieve is to both have it our way, and preserve the current behavior >>> for the common case, where learning still happens from the fast path. >>> But I'm not sure that this is the correct goal, especially if you also >>> add "straightforward locking rules" to the mix. >>> >> >> The trylock was only an optimization idea, but yes you'd need both synchronous >> and asynchronous notifiers. I don't think you need sleepable context when >> learning from softirq, who would you propagate the error to? It is important >> when entries are being added from user-space, but if you'd like to have veto >> option from softirq then we can scratch the trylock idea altogether. > > I'll let Ido answer here. As I said, the model I'm working with is that > of autonomous learning, so for me, no. Whereas the Spectrum model is > that of secure learning. I expect that it'd be pretty useless to set up > software assisted secure learning if you're just going to say yes and > learn all addresses anyway. I've never seen Spectrum documentation, but > I would be shocked if it wouldn't be able to be configured to operate in > the bare-bones autonomous learning mode too. > Ack, got it. >> Let's not focus on the trylock, it's a minor detail. >> >>> I think you'd have to explain what is the purpose of the fast path trylock >>> logic you've mentioned above. So in the uncontended br->hash_lock case, >>> br_fdb_update() could take that mutex and then what? It would create and >>> add the FDB entry, and call fdb_notify(), but that still can't sleep. >>> So if switchdev drivers still need to privately defer the offloading >>> work, we still need some crazy completion-based mechanism to report >>> errors like the one I posted here, because in the general sense, >>> SWITCHDEV_FDB_ADD_TO_DEVICE will still be atomic. >> >> We do not if we have ADD_TO_DEVICE and ADD_TO_DEVICE_SYNC, > > Just when I was about to say that I'm happy to get rid of some of those > workqueues, and now you're telling me not only we might not get rid of > them, but we might also end up with a second, separate implementation :) > > Anyway, let's not put the cart before the horses, let's see what the > realities of the bridge data path learning and STP flushing will teach > us about what can and can't be done. > We will get rid of some workqueues (I hope), I was saying that if do the trylock we might have to add both sync and async, otherwise we just need a single sync one. >> but that optimization is probably not worth the complexity of >> maintaining both so we can just drop the trylock. >> >>> >>> And how do you queue a deletion, do you delete the FDB from the pending >>> and the main hash table, or do you just create a deletion command to be >>> processed in deferred context? >>> >> >> All commands which cannot take the mutex directly will be executed from a delayed >> queue. We also need a delayed flush operation because we need to flush from STP code >> and we can't sleep there due to the STP spinlock. The pending table can be considered >> only if we decide to do a 2-stage lookup, it won't be used or consulted in any other >> case, so user-space adds and deletes go only the main table. >> >>> And since you'd be operating on the hash table concurrently from the >>> deferred work and from the fast path, doesn't this mean you'll need to >>> use some sort of spin_lock_bh from the deferred work, which again would >>> incur atomic context when you want to notify switchdev? Because with the >>> mutex_trylock described above, you'd gain exclusivity to the main hash >>> table, but if you lose, what you need is exclusivity to the pending hash >>> table. So the deferred context also needs to be atomic when it dequeues >>> the pending FDB entries and notifies them. I just don't see what we're >>> winning, how the rtnetlink functions will be any different for the better. >>> >> >> The rbtree can be fully taken by the delayed action and swapped with a new one, >> so no exclusivity needed for the notifications. It will take the spinlock, get >> the whole tree and release the lock, same if it was a simple queue. > > I'm quite unfamiliar with this technique, atomically swapping a queue > pointer with a new empty one, and emptying that queue while unlocked. > Do you have any reference implementation for this kind of technique? > the delayed work would be doing something similar to: spin_lock(delay_lock); record tree root rcu change tree root with empty spin_unlock(delay_lock); mutex_lock(br->hash_lock); process recorded (old) tree and free items via rcu mutex_unlock(br->hash_lock); We have the same pattern with queues all around the kernel. >> The spinlock for the rbtree for the delayed learning is necessary in all cases, > > "in all cases" means "regardless of whether we try from the fast path to > make fdb_create() insert directly into &br->fdb_hash_tbl, or if we > insert into a temporary rbtree for pending entries"? I don't understand > this: why would you take the rbtree spinlock if you've inserted into the > main hash table already? > No, it means that we need the spinlock to protect the delayed queue. > I'm only concerned about spin locks we'd have to hold while calling > fdb_notify(). > There won't be any spinlocks for fdb_notify(), we'll be doing all notifications from sleepable context with the mutex held, that's the goal at least. >> if we used the trylock fast learn then we could directly insert the entry if we win, but >> again lets just always use delayed ops from atomic contexts as a start. >> >> All entries from atomic contexts will be added to an rbtree which will be processed >> from a delayed work, it will be freed by RCU so lookups can be done if we decide to >> do a 2-stage lookup for the fast Rx path to reduce the flooding case window that you >> described above. >> >> We win having sleepable context for user-space calls and being able to do synchronous >> calls to the drivers to wait for the errors. > > I think I'd really have to see the code at this point, sorry. > Sure, I'll prepare an RFC version this week. Thanks for the comments.