From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4C4ACCDB483 for ; Tue, 17 Oct 2023 14:06:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343869AbjJQOGi (ORCPT ); Tue, 17 Oct 2023 10:06:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41522 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233670AbjJQOGg (ORCPT ); Tue, 17 Oct 2023 10:06:36 -0400 Received: from mail-lf1-x132.google.com (mail-lf1-x132.google.com [IPv6:2a00:1450:4864:20::132]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8B90FF0; Tue, 17 Oct 2023 07:06:34 -0700 (PDT) Received: by mail-lf1-x132.google.com with SMTP id 2adb3069b0e04-507a936f4a9so3231030e87.1; Tue, 17 Oct 2023 07:06:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1697551593; x=1698156393; darn=vger.kernel.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:date:from:from:to :cc:subject:date:message-id:reply-to; bh=0s/4CIk0Qxy4D6C/paFFwCJyOCOD8J+KFr/totpARKw=; b=lNnPfu5H+ymQrQb7f6XpVBhEDtE7JS0p/u0JxlF2Pm3qz7rkMB0MW+YkmLRWsBdNDN JsjTVgsbNqozNmn2YDaBQ4A+i6xXOGEy9fH9SjJoApv5z7Q4+59hnEvZHyvAI9dC/GCg hWphFY7b9KF0qwAmgGEK05pLejtWQgCmA6dlXZsHJDfWmzCttOX9F8zo75QUfVg4Lhj6 Zgss4lpSQg23esKu42J/sXGa7bPMJYjA5uDJc5N7heCxMjivexNMP8gNS/ajwLUd0pYg zL7sDQUwnwFTqsf1T8oY8MRiDJTYIGz+9Jhf6Auqzxdvnz2VAyYUWpGoAifvVOHGmCuJ /V0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697551593; x=1698156393; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:date:from :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=0s/4CIk0Qxy4D6C/paFFwCJyOCOD8J+KFr/totpARKw=; b=Ylxux4dkrXNYyY7oH6ULL42scHGSQANJ7RVp4dB7qADIe4KdiozqCbDgofBkoV8Oop EFByT/B+/H26vFd06eTyCk9WtVEMPB/8fk9JUIXW6JrTDTmENB/q1poKjixnwV3DzAfO wNKEexRYz23kGWpuSN4iAsmU6t3bph69Nc95jzf3ePAoV0rXXwkQxZRPclAj+u3ol9oS 3XIY0Kh5/yKFNykbWXHL+9e89s4QpbF+ZkcxleBki7qfE94mu75NpDBxXzPVlPdgFghe jm9p5PQabollHq8luqE6qkt7d1kNKne+hGJ1lXnwkoRRuo9jNoGdyVA6KpRj46VEyk7G YUTw== X-Gm-Message-State: AOJu0YxJ2J4yZgEImyKVBS8qR7nDv+2dfANJIUdwFhbE/n/kZOP3GkOv 423ZCROD8oD+giRG397klME= X-Google-Smtp-Source: AGHT+IHXdM9WUkZU6l0Fc1mEJSV4n3wpEctHxdihOEQbdyBqgGT46X7sv3/QcTVHCHSjdi8F6SXWXw== X-Received: by 2002:ac2:4c16:0:b0:503:653:5711 with SMTP id t22-20020ac24c16000000b0050306535711mr1744911lfq.9.1697551592414; Tue, 17 Oct 2023 07:06:32 -0700 (PDT) Received: from pc636 (host-90-233-215-212.mobileonline.telia.com. [90.233.215.212]) by smtp.gmail.com with ESMTPSA id f13-20020a056512360d00b005007fc9ccccsm295501lfs.122.2023.10.17.07.06.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 17 Oct 2023 07:06:31 -0700 (PDT) From: Uladzislau Rezki X-Google-Original-From: Uladzislau Rezki Date: Tue, 17 Oct 2023 16:06:29 +0200 To: Joel Fernandes Cc: "Uladzislau Rezki (Sony)" , "Paul E . McKenney" , RCU , Neeraj upadhyay , Boqun Feng , LKML , Oleksiy Avramchenko , Frederic Weisbecker Subject: Re: [PATCH v3 1/1] rcu: Reduce synchronize_rcu() waiting time Message-ID: References: <20231016173004.14148-1-urezki@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Joel! > Hello, Vlad, > > Looks like a nice patch, I am still looking at this more and just > started, but a few small comments: > Thank you for looking at it :) And thank you for the comments. > On Mon, Oct 16, 2023 at 1:30 PM Uladzislau Rezki (Sony) > wrote: > > > > A call to a synchronize_rcu() can be optimized from time point of > > view. Different workloads can be affected by this especially the > > ones which use this API in its time critical sections. > > > > For example if CONFIG_RCU_NOCB_CPU is set, the wakeme_after_rcu() > > callback can be delayed and such delay depends on where in a nocb > > list it is located. > > > > 1. On our Android devices i can easily trigger the scenario when > > it is a last in the list out of ~3600 callbacks: > > > > > > <...>-29 [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28 > > ... > > <...>-29 [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt > > <...>-29 [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt > > <...>-29 [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt > > <...>-29 [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt > > <...>-29 [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt > > <...>-29 [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt > > <...>-29 [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=.... > > > > > > 2. We use cpuset/cgroup to classify tasks and assign them into > > different cgroups. For example "backgrond" group which binds tasks > > only to little CPUs or "foreground" which makes use of all CPUs. > > Tasks can be migrated between groups by a request if an acceleration > > is needed. > > > > See below an example how "surfaceflinger" task gets migrated. > > Initially it is located in the "system-background" cgroup which > > allows to run only on little cores. In order to speed it up it > > can be temporary moved into "foreground" cgroup which allows > > to use big/all CPUs: > > > [...] > > 3. To address this drawback, maintain a separate track that consists > > of synchronize_rcu() callers only. The GP-kthread, that drivers a GP > > either wake-ups a worker to drain all list or directly wakes-up end > > user if it is one in the drain list. > > I wonder if there is a cut-off N, where waking up N ("a few") inline > instead of just 1, yields similar results. For small values of N, that > keeps the total number of wakeups lower than pushing off to a kworker. > For instance, if 2 users, then you get 3 wakeups instead of just 2 (1 > for the kworker and another 2 for the synchronize). But if you had a > cut off like N=5, then 2 users results in just 2 wakeups. > I don't feel too strongly about it, but for small values of N, I am > interested in a slightly better optimization if we can squeeze it. > This makes sense to me. We can add a threshold to wake-up several users. Like you mentioned. We can do 2 wake-ups instead of 3. > [...] > > + * There are three lists for handling synchronize_rcu() users. > > + * A first list corresponds to new coming users, second for users > > + * which wait for a grace period and third is for which a grace > > + * period is passed. > > + */ > > +static struct sr_normal_state { > > + struct llist_head curr; /* request a GP users. */ > > + struct llist_head wait; /* wait for GP users. */ > > + struct llist_head done; /* ready for GP users. */ > > + struct llist_node *curr_tail; > > + struct llist_node *wait_tail; > > Just for clarification, the only reason you need 'tail' is because you > can use llist_add_batch() to quickly splice the list, instead of > finding the tail each time (expensive for a large list), correct? It > would be good to mention that in a comment here. > Right. I do not want to scan the list. I will comment it. > > + atomic_t active; > > +} sr; > > + > > +/* Disable it by default. */ > > +static int rcu_normal_wake_from_gp = 0; > > +module_param(rcu_normal_wake_from_gp, int, 0644); > > + > > +static void rcu_sr_normal_complete(struct llist_node *node) > > +{ > > + struct rcu_synchronize *rs = container_of( > > + (struct rcu_head *) node, struct rcu_synchronize, head); > > + unsigned long oldstate = (unsigned long) rs->head.func; > > + > > + if (!poll_state_synchronize_rcu(oldstate)) > > + WARN_ONCE(1, "A full grace period is not passed yet: %lu", > > + rcu_seq_diff(get_state_synchronize_rcu(), oldstate)); > > nit: WARN_ONCE(!poll_state_synchronize_rcu(oldstate)), ...) and get > rid of if() ? > Initially i had it written as you proposed i reworked it because i wanted to see the delta. I do not have a strong opinion, so i can fix it as you proposed. > > > + > > + /* Finally. */ > > + complete(&rs->completion); > > +} > > + > > +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work) > > +{ > > + struct llist_node *done, *rcu, *next; > > + > > + done = llist_del_all(&sr.done); > > + if (!done) > > + return; > > + > > + llist_for_each_safe(rcu, next, done) > > + rcu_sr_normal_complete(rcu); > > +} > [...] > > +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs) > > +{ > > + atomic_inc(&sr.active); > > + if (llist_add((struct llist_node *) &rs->head, &sr.curr)) > > + /* Set the tail. Only first and one user can do that. */ > > + WRITE_ONCE(sr.curr_tail, (struct llist_node *) &rs->head); > > + atomic_dec(&sr.active); > > Here there is no memory ordering provided by the atomic ops. Is that really Ok? > This needs to be reworked since there is no ordering guaranteed. I think there is a version of "atomic_inc_something" that guarantees it? > And same question for other usages of .active. > > Per: https://www.kernel.org/doc/Documentation/atomic_t.txt > "RMW operations that have no return value are unordered;" > -- > Note to self to ping my Android friends as well about this improvement > :-). Especially the -mm Android folks are interested in app startup > time. > That is good. Thank you :) -- Uladzislau Rezki