From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C45E4C433E6 for ; Sun, 10 Jan 2021 06:14:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 94629233CF for ; Sun, 10 Jan 2021 06:14:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725858AbhAJGOi (ORCPT ); Sun, 10 Jan 2021 01:14:38 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48868 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725267AbhAJGOi (ORCPT ); Sun, 10 Jan 2021 01:14:38 -0500 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BDBA5C061786; Sat, 9 Jan 2021 22:13:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=GNdziTbQbEJ9mo4mjqBaEUuZozMS1eNM8r/cJMJoprY=; b=SsfT8KjpTJHHhQ8OIYtxfv9/Ii TdZoUe768+SB0A2U8voTxFHiKC7KjTFx9AElQatAVXunYvC6bM4WdH9g1fq3IbWOmbO6SK/GDSBAk Beznn47owEO/xpUe5dBBTOY2zElJZrLyLh5THB9hoN+QcO0POI2/FeW4TCKpfnl2t/TIjyx8LjG2R pLCEf4RZMXyipHT+QZ8x3k8mma4VCk8BbYfYLAKvsb+9sMs1JWPyjA5/ZB8fl42iyEis5CWJ/tSnz 5EtHLksnUq35W5uGAi0IhRNu2CulCK4X0XwCg8eGv3HZdIVBVVKyCTBkTcNHVIDaSWAr6FEAb8JSD /oCMRNhQ==; Received: from willy by casper.infradead.org with local (Exim 4.94 #2 (Red Hat Linux)) id 1kyTyb-001Rl1-QK; Sun, 10 Jan 2021 06:13:24 +0000 Date: Sun, 10 Jan 2021 06:13:21 +0000 From: Matthew Wilcox To: Mikulas Patocka Cc: Andrew Morton , Dan Williams , Vishal Verma , Dave Jiang , Ira Weiny , Jan Kara , Steven Whitehouse , Eric Sandeen , Dave Chinner , Theodore Ts'o , Wang Jianchao , "Kani, Toshi" , "Norton, Scott J" , "Tadakamadla, Rajesh" , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-nvdimm@lists.01.org Subject: Re: Expense of read_iter Message-ID: <20210110061321.GC35215@casper.infradead.org> References: <20210107151125.GB5270@casper.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Thu, Jan 07, 2021 at 01:59:01PM -0500, Mikulas Patocka wrote: > On Thu, 7 Jan 2021, Matthew Wilcox wrote: > > On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote: > > > I'd like to ask about this piece of code in __kernel_read: > > > if (unlikely(!file->f_op->read_iter || file->f_op->read)) > > > return warn_unsupported... > > > and __kernel_write: > > > if (unlikely(!file->f_op->write_iter || file->f_op->write)) > > > return warn_unsupported... > > > > > > - It exits with an error if both read_iter and read or write_iter and > > > write are present. > > > > > > I found out that on NVFS, reading a file with the read method has 10% > > > better performance than the read_iter method. The benchmark just reads the > > > same 4k page over and over again - and the cost of creating and parsing > > > the kiocb and iov_iter structures is just that high. > > > > Which part of it is so expensive? > > The read_iter path is much bigger: > vfs_read - 0x160 bytes > new_sync_read - 0x160 bytes > nvfs_rw_iter - 0x100 bytes > nvfs_rw_iter_locked - 0x4a0 bytes > iov_iter_advance - 0x300 bytes Number of bytes in a function isn't really correlated with how expensive a particular function is. That said, looking at new_sync_read() shows one part that's particularly bad, init_sync_kiocb(): static inline int iocb_flags(struct file *file) { int res = 0; if (file->f_flags & O_APPEND) res |= IOCB_APPEND; 7ec: 8b 57 40 mov 0x40(%rdi),%edx 7ef: 48 89 75 80 mov %rsi,-0x80(%rbp) if (file->f_flags & O_DIRECT) 7f3: 89 d0 mov %edx,%eax 7f5: c1 e8 06 shr $0x6,%eax 7f8: 83 e0 10 and $0x10,%eax res |= IOCB_DIRECT; if ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host)) 7fb: 89 c1 mov %eax,%ecx 7fd: 81 c9 00 00 02 00 or $0x20000,%ecx 803: f6 c6 40 test $0x40,%dh 806: 0f 45 c1 cmovne %ecx,%eax res |= IOCB_DSYNC; 809: f6 c6 10 test $0x10,%dh 80c: 75 18 jne 826 80e: 48 8b 8f d8 00 00 00 mov 0xd8(%rdi),%rcx 815: 48 8b 09 mov (%rcx),%rcx 818: 48 8b 71 28 mov 0x28(%rcx),%rsi 81c: f6 46 50 10 testb $0x10,0x50(%rsi) 820: 0f 84 e2 00 00 00 je 908 if (file->f_flags & __O_SYNC) 826: 83 c8 02 or $0x2,%eax res |= IOCB_SYNC; return res; 829: 89 c1 mov %eax,%ecx 82b: 83 c9 04 or $0x4,%ecx 82e: 81 e2 00 00 10 00 and $0x100000,%edx We could optimise this by, eg, checking for (__O_SYNC | O_DIRECT | O_APPEND) and returning 0 if none of them are set, since they're all pretty rare. It might be better to maintain an f_iocb_flags in the struct file and just copy that unconditionally. We'd need to remember to update it in fcntl(F_SETFL), but I think that's the only place. > If we go with the "read" method, there's just: > vfs_read - 0x160 bytes > nvfs_read - 0x200 bytes > > > Is it worth, eg adding an iov_iter > > type that points to a single buffer instead of a single-member iov? > 6.57% pread [nvfs] [k] nvfs_rw_iter_locked > 2.31% pread [kernel.vmlinux] [k] new_sync_read > 1.89% pread [kernel.vmlinux] [k] iov_iter_advance > 1.24% pread [nvfs] [k] nvfs_rw_iter > 0.29% pread [kernel.vmlinux] [k] iov_iter_init > 2.71% pread [nvfs] [k] nvfs_read > Note that if we sum the percentage of nvfs_iter_locked, new_sync_read, > iov_iter_advance, nvfs_rw_iter, we get 12.01%. On the other hand, in the > second trace, nvfs_read consumes just 2.71% - and it replaces > functionality of all these functions. > > That is the reason for that 10% degradation with read_iter. You seem to be focusing on your argument for "let's just permit filesystems to implement both ->read and ->read_iter". My suggestion is that we need to optimise the ->read_iter path, but to do that we need to know what's expensive. nvfs_rw_iter_locked() looks very complicated. I suspect it can be simplified. Of course new_sync_read() needs to be improved too, as do the other functions here, but fully a third of the difference between read() and read_iter() is the difference between nvfs_read() and nvfs_rw_iter_locked().