From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-lf1-f53.google.com (mail-lf1-f53.google.com [209.85.167.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 845201A256C; Thu, 15 Aug 2024 15:07:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.167.53 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1723734466; cv=none; b=BcL6gRNX0Q3yY2rmJRDbfw21Uy8o2CSIOO6dHJgE7+5D80F7X6zEOiTlH00Tc/YqCcLKItIu5MH0SR8oZDll2hFgyrpvMXhzTLUCHe8GHHGlYBE+RynJsLlwrkhoIpryD9sSmQyhzeG7YUO6p8YXpGpRXlSCu1nNSBqPRl/Yjmw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1723734466; c=relaxed/simple; bh=kZUvP4nzbg6phjn2v7MWFjDsuXHPZwCGV1OGE6l43T4=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=KoI2fUQNleUjAOq1FQXREkAFihuQLvRXUey6rl+O5D1gl8jC3wuj/m3wLMZsKHKUulKaSeZqKMe+sX/u/Bzgqy+ldHzkgFjnvp6OZt+8S6AW/6QrtFv5s2xTQMLZhcV+pfDx2CftZ5h++EXMNJjrLnxdqj4ESb1H2YlGK+ORyC8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=HKTjWhUo; arc=none smtp.client-ip=209.85.167.53 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="HKTjWhUo" Received: by mail-lf1-f53.google.com with SMTP id 2adb3069b0e04-52f00ad303aso1314527e87.2; Thu, 15 Aug 2024 08:07:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1723734462; x=1724339262; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=N8SNHJNcIoI942o94u3367Q/xCX1kW+hct2q4PcQPf8=; b=HKTjWhUor0lodMiv3N0e44Kij5fBClNnWC6/CREPvsxkJn9fJqaC1CxNxpWGUUt5hm f914hFn9mmNAkHKOJPklwt8th+hF2iyPuekzv3mtTj1WWLdc7QH3BZMRLwkiVJC6Rmhc MQqieRlfpd0OPV6mI0ynEo/t5FDEO9oE77XGWC/9Eg8IYWF2Xby5tDBAzemkkkF9swdI 9Q81HjEOaFN2YejnbPLIvtKMf8eGsWq5BCFhL5vDINSPlST6DOQIAAhh7TjJuRsSv34I OUO04sIGfiI4cVDtZGULMbD3ESF+A/XN0+YcgaAZEKXn8wxFW8MZeBmKeCo2wbzrBd+3 Hqng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723734462; x=1724339262; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=N8SNHJNcIoI942o94u3367Q/xCX1kW+hct2q4PcQPf8=; b=TpJGScKsrGm9ZndBgpWb5b6AbgFwElPHxeWpbBKfTeW5e6AEGLYcj2rydBlmja9b+W qjFPLqA5seO1BN10lqdGzKOYRiGFqEnp9+03UFtKiOwkpolJfz63hipdJ7gk7+O8jeRS 5Cg77kQS2jA5RNubQ+A1ahyJfFY8w1xo1dXWArmXeAkwuDV6m/I41W0Vaq1n/pmv227R zgrqTgkTnhDsSAa5QZFDd+K1/lNIhnDrTtM055HJhgetKFwPM7XJQNYE5A5VlsmLaULc aw+j5wwIY2FlN572bPkBH7+JTodye0cx+QxZWvcgSDkMBmOUIgaocDkoEzwCsE8uNI9U HW+g== X-Forwarded-Encrypted: i=1; AJvYcCW5kCqzraoozJyUxzyWWcN/0zelO3vN0E85pgulTSN9Ou5bu+nyWApCo7Nvb/TH98Vabb/jwNlF/hOzXVL2jM6Ih2Dk0tPogt3Oa5aTuaF1+7EOtIPAjGsjLoEL0aeeua6nkqGPJ6EqT0J5qw== X-Gm-Message-State: AOJu0YwoPO0HEjJhQEPTGoued0zVqEkcxiNBoJtnt74xaE/Zr7ZNFiXj ek+MOEOq22Gtg9Fw2/2LkgbaW2QTYJ5ApB2eZ1mPr9zwzvRpp+0miDpnFce7GKTURCM/bBsblTU bcb0ctlccgaHBMVxBeckBR2ASD/w= X-Google-Smtp-Source: AGHT+IFnuPqWOtDaNn2zQBGHE0JlZWDcrapJnEDM3aQoPyj49qqny15G/tNOSlkatwkYZDqUdZmCb4l0RZulA9wMhAA= X-Received: by 2002:a05:6512:308d:b0:52e:9694:3f98 with SMTP id 2adb3069b0e04-532eda81faemr4527712e87.27.1723734462016; Thu, 15 Aug 2024 08:07:42 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20240806-openfast-v2-1-42da45981811@kernel.org> <87ikwdtqiy.fsf@linux.intel.com> <44862ec7c85cdc19529e26f47176d0ecfc90d888.camel@kernel.org> In-Reply-To: From: Mateusz Guzik Date: Thu, 15 Aug 2024 17:07:29 +0200 Message-ID: Subject: Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too To: Andi Kleen Cc: Jeff Layton , Alexander Viro , Christian Brauner , Jan Kara , Andrew Morton , Josef Bacik , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, Aug 6, 2024 at 10:47=E2=80=AFPM Andi Kleen wro= te: > > > Before I get to the vfs layer, there is a significant loss in the > > memory allocator because of memcg -- it takes several irq off/on trips > > for every alloc (needed to grab struct file *). I have a plan what to > > do with it (handle stuff with local cmpxchg (note no lock prefix)), > > which I'm trying to get around to. Apart from that you may note the > > allocator fast path performs a 16-byte cmpxchg, which is again dog > > slow and executes twice (once for the file obj, another time for the > > namei buffer). Someone(tm) should patch it up and I have some vague > > ideas, but 0 idea when I can take a serious stab. > > I just LBR sampled it on my skylake and it doesn't look > particularly slow. You see the whole massive block including CMPXCHG16 > gets IPC 2.7, which is rather good. If you see lots of cycles on it it's = likely > a missing cache line. > > kmem_cache_free: > ffffffff9944ce20 nop %edi, %edx > ffffffff9944ce24 nopl %eax, (%rax,%rax,1) > ffffffff9944ce29 pushq %rbp > ffffffff9944ce2a mov %rdi, %rdx > ffffffff9944ce2d mov %rsp, %rbp > ffffffff9944ce30 pushq %r15 > ffffffff9944ce32 pushq %r14 > ffffffff9944ce34 pushq %r13 > ffffffff9944ce36 pushq %r12 > ffffffff9944ce38 mov $0x80000000, %r12d > ffffffff9944ce3e pushq %rbx > ffffffff9944ce3f mov %rsi, %rbx > ffffffff9944ce42 and $0xfffffffffffffff0, = %rsp > ffffffff9944ce46 sub $0x10, %rsp > ffffffff9944ce4a movq %gs:0x28, %rax > ffffffff9944ce53 movq %rax, 0x8(%rsp) > ffffffff9944ce58 xor %eax, %eax > ffffffff9944ce5a add %rsi, %r12 > ffffffff9944ce5d jb 0xffffffff9944d1ea > ffffffff9944ce63 mov $0xffffffff80000000, = %rax > ffffffff9944ce6a xor %r13d, %r13d > ffffffff9944ce6d subq 0x17b068c(%rip), %r= ax > ffffffff9944ce74 add %r12, %rax > ffffffff9944ce77 shr $0xc, %rax > ffffffff9944ce7b shl $0x6, %rax > ffffffff9944ce7f addq 0x17b066a(%rip), %r= ax > ffffffff9944ce86 movq 0x8(%rax), %rcx > ffffffff9944ce8a test $0x1, %cl > ffffffff9944ce8d jnz 0xffffffff9944d15c > ffffffff9944ce93 nopl %eax, (%rax,%rax,1) > ffffffff9944ce98 movq (%rax), %rcx > ffffffff9944ce9b and $0x8, %ch > ffffffff9944ce9e jz 0xffffffff9944cfea > ffffffff9944cea4 test %rax, %rax > ffffffff9944cea7 jz 0xffffffff9944cfea > ffffffff9944cead movq 0x8(%rax), %r14 > ffffffff9944ceb1 test %r14, %r14 > ffffffff9944ceb4 jz 0xffffffff9944cfac > ffffffff9944ceba cmp %r14, %rdx > ffffffff9944cebd jnz 0xffffffff9944d165 > ffffffff9944cec3 test %r14, %r14 > ffffffff9944cec6 jz 0xffffffff9944cfac > ffffffff9944cecc movq 0x8(%rbp), %r15 > ffffffff9944ced0 nopl %eax, (%rax,%rax,1) > ffffffff9944ced5 movq 0x1fe5134(%rip), %r= ax > ffffffff9944cedc test %r13, %r13 > ffffffff9944cedf jnz 0xffffffff9944ceef > ffffffff9944cee1 mov $0xffffffff80000000, = %rax > ffffffff9944cee8 subq 0x17b0611(%rip), %r= ax > ffffffff9944ceef add %rax, %r12 > ffffffff9944cef2 shr $0xc, %r12 > ffffffff9944cef6 shl $0x6, %r12 > ffffffff9944cefa addq 0x17b05ef(%rip), %r= 12 > ffffffff9944cf01 movq 0x8(%r12), %rax > ffffffff9944cf06 mov %r12, %r13 > ffffffff9944cf09 test $0x1, %al > ffffffff9944cf0b jnz 0xffffffff9944d1b1 > ffffffff9944cf11 nopl %eax, (%rax,%rax,1) > ffffffff9944cf16 movq (%r13), %rax > ffffffff9944cf1a movq %rbx, (%rsp) > ffffffff9944cf1e test $0x8, %ah > ffffffff9944cf21 mov $0x0, %eax > ffffffff9944cf26 cmovz %rax, %r13 > ffffffff9944cf2a data16 nop > ffffffff9944cf2c movq 0x38(%r13), %r8 > ffffffff9944cf30 cmp $0x3, %r8 > ffffffff9944cf34 jnbe 0xffffffff9944d1ca > ffffffff9944cf3a nopl %eax, (%rax,%rax,1) > ffffffff9944cf3f movq 0x23d6f72(%rip), %r= ax > ffffffff9944cf46 mov %rbx, %rdx > ffffffff9944cf49 sub %rax, %rdx > ffffffff9944cf4c cmp $0x1fffff, %rdx > ffffffff9944cf53 jbe 0xffffffff9944d03a > ffffffff9944cf59 movq (%r14), %rax > ffffffff9944cf5c addq %gs:0x66bccab4(%rip= ), %rax > ffffffff9944cf64 movq 0x8(%rax), %rdx > ffffffff9944cf68 cmpq %r13, 0x10(%rax) > ffffffff9944cf6c jnz 0xffffffff9944d192 > ffffffff9944cf72 movl 0x28(%r14), %ecx > ffffffff9944cf76 movq (%rax), %rax > ffffffff9944cf79 add %rbx, %rcx > ffffffff9944cf7c cmp %rbx, %rax > ffffffff9944cf7f jz 0xffffffff9944d1ba > ffffffff9944cf85 movq 0xb8(%r14), %rsi > ffffffff9944cf8c mov %rcx, %rdi > ffffffff9944cf8f bswap %rdi > ffffffff9944cf92 xor %rax, %rsi > ffffffff9944cf95 xor %rdi, %rsi > ffffffff9944cf98 movq %rsi, (%rcx) > ffffffff9944cf9b leaq 0x2000(%rdx), %rcx > ffffffff9944cfa2 movq (%r14), %rsi > ffffffff9944cfa5 cmpxchg16bx %gs:(%rsi) > ffffffff9944cfaa jnz 0xffffffff9944cf59 > ffffffff9944cfac movq 0x8(%rsp), %rax > ffffffff9944cfb1 subq %gs:0x28, %rax > ffffffff9944cfba jnz 0xffffffff9944d1fc > ffffffff9944cfc0 leaq -0x28(%rbp), %rsp > ffffffff9944cfc4 popq %rbx > ffffffff9944cfc5 popq %r12 > ffffffff9944cfc7 popq %r13 > ffffffff9944cfc9 popq %r14 > ffffffff9944cfcb popq %r15 > ffffffff9944cfcd popq %rbp > ffffffff9944cfce retq = # PRED 38 cycles [126] 2.74 IPC <------------- Sorry for late reply, my test box was temporarily unavailable and then I forgot about this e-mail :) I don't have a good scientific test(tm) and I don't think coming up with one is warranted at the moment. But to illustrate, I slapped together a test case for will-it-scale where I either cmpxchg8 or 16 in a loop. No lock prefix on these. On Sapphire Rapids I see well over twice the throughput for the 8-byte vari= ant: # ./cmpxchg8_processes warmup min:481465497 max:481465497 total:481465497 min:464439645 max:464439645 total:464439645 min:461884735 max:461884735 total:461884735 min:460850043 max:460850043 total:460850043 min:461066452 max:461066452 total:461066452 min:463984473 max:463984473 total:463984473 measurement min:461317703 max:461317703 total:461317703 min:458608942 max:458608942 total:458608942 min:460846336 max:460846336 total:460846336 [snip] # ./cmpxchg16b_processes warmup min:205207128 max:205207128 total:205207128 min:205010535 max:205010535 total:205010535 min:204877781 max:204877781 total:204877781 min:204163814 max:204163814 total:204163814 min:204392000 max:204392000 total:204392000 min:204094222 max:204094222 total:204094222 measurement min:204243282 max:204243282 total:204243282 min:204136589 max:204136589 total:204136589 min:203504119 max:203504119 total:203504119 So I would say trying it out in a real alloc is worth looking at. Of course the 16-byte variant is not used just for kicks, so going to 8 bytes is more involved than just replacing the instruction. The current code follows the standard idea on how to deal with the ABA problem -- apart from replacing a pointer you validate this is what you thought by checking the counter in the same instruction. I note that in the kernel we can do better, but I don't have have all kinks worked out yet. The core idea builds on the fact that we can cheaply detect a pending alloc on the same cpu and should a conflicting free be executing from an interrupt, it can instead add the returning buffer to a different list and the aba problem disappears. Should the alloc fast path fail to find a free buffer, it can disable interrupts an take a look at the fallback list. --=20 Mateusz Guzik