From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3714541C2F7 for ; Mon, 2 Mar 2026 15:53:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772466817; cv=none; b=NRL7grnChOFzImetcMDeOaHX4E/ttjkMxBUhbhrmmnbxfU7/M1ehO/LuFuuIpLmVM8RV9CAQ0qxlt6Tq1jM23A3C0DeYbYRdI+G9GcEEkxbTFa0y2hpuEx87dhjoL4RH6A+u8Av0cX8qldnrJcsh8wXMEvhKXiUMob586VleWjs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772466817; c=relaxed/simple; bh=WIbh+is4KvNzRkch3zcJp52bSKjLueSHIQxVDyUP5ww=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=VyYzNeY2eH9SUZGA4vzaLv87S7bPea2kFHSJzhN9vdvUkQmM6ZkP+oWikezpVsOrA1DJC7AJzN+oMyHIZ/69i2FkyctFh7jUECYIaBR/Bw+mAUWrCrz8hDorCLcm2vDfeArw27U9BK1yQO6qRXY0eG4PhSnmuUxFUTxDkNh/pcA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Dh0MRm1i; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Dh0MRm1i" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1772466810; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=q5je42QRO4AOiyaMXgXsTKWPYijh9+PVOYu2ob4k4BI=; b=Dh0MRm1iINVtITuH189GHKMB1+del/MN2BvsMMJV8Om1xdk+qslyK5ygtVGHk5gAXsiypv ITiiGHIfQFnw0UZa536eohEaTFAUxE4nstOgamqvyFmGEFWAn0Vahyl8gYOqnQz30YmKkN V9ENDKtNYeVaMvG5P3UAXm41cK2gsX8= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-691-tvROHLPWMWegtmx-T6a-ig-1; Mon, 02 Mar 2026 10:53:25 -0500 X-MC-Unique: tvROHLPWMWegtmx-T6a-ig-1 X-Mimecast-MFC-AGG-ID: tvROHLPWMWegtmx-T6a-ig_1772466802 Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 32AE21800365; Mon, 2 Mar 2026 15:53:22 +0000 (UTC) Received: from tpad.localdomain (unknown [10.96.133.6]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id E899119560A7; Mon, 2 Mar 2026 15:53:20 +0000 (UTC) Received: by tpad.localdomain (Postfix, from userid 1000) id 82AC5401E0CDF; Thu, 26 Feb 2026 08:41:09 -0300 (-03) Date: Thu, 26 Feb 2026 08:41:09 -0300 From: Marcelo Tosatti To: Frederic Weisbecker Cc: Michal Hocko , Leonardo Bras , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Vlastimil Babka , Hyeonggon Yoo <42.hyeyoo@gmail.com>, Leonardo Bras , Thomas Gleixner , Waiman Long , Boqun Feng , Frederic Weisbecker Subject: Re: [PATCH 0/4] Introduce QPW for per-cpu operations Message-ID: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 On Wed, Feb 25, 2026 at 10:49:54PM +0100, Frederic Weisbecker wrote: > > There are specific parts of a simulation that are intensive, but > > researchers try to minimize them: > > > > I/O Operations: Writing "checkpoints" or large trajectory files to disk > > (using write()). This is why high-end HPC systems use Asynchronous I/O > > or dedicated I/O nodes—to keep the compute cores from getting bogged > > down in system calls. > > > > Memory Allocation: Constantly calling malloc/free involves the brk or > > mmap system calls. Optimized simulation tools pre-allocate all the > > memory they need at startup to avoid this. > > Ok. I asked a similar question and got this (you made me use an LLM for the > first time btw, I held out for 4 years... I'm sure I can wait 4 more years until > the next usage :o) You should use it more often, it can save a significant amount of time :-) > ### 2. The "Slow Path" (System Calls / Syscalls) > > Passing through the kernel (a syscall) is necessary in certain situations, but it is "expensive" because it forces a **context switch**, which flushes CPU caches. > > * **Initialization:** During startup (`MPI_Init`), many syscalls are used to create sockets, map shared memory (`mmap`), and configure network interfaces. > * **Standard TCP/IP:** If you are not using a high-performance network (RDMA) but simple Ethernet instead, MPI must call `send()` and `recv()`, which are syscalls. The Linux kernel then takes over to manage the TCP/IP stack. > * **Sleep Mode (Blocking):** If an MPI process waits for a message for too long, it may decide to "go to sleep" to yield the CPU to another task via syscalls like `futex()` or `poll()`. > > **In summary:** MPI synchronization aims to be **100% User-Space** (via memory polling) to avoid syscall latency. It is precisely because MPI tries to bypass the kernel that we use `nohz_full`: we are asking the kernel not to even "knock on the CPU's door" with its clock interruptions. Of course, there is a cost to system calls. However, considering "low latency applications must necessarily remain in userspace, therefore lets optimize only for that case" is limiting IMHO. Should avoid interruptions whenever possible, for isolated CPUs (in userspace _and_ kernelspace).