All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: Jan Kara <jack@suse.cz>, Mateusz Guzik <mjguzik@gmail.com>
Cc: Bharata B Rao <bharata@amd.com>,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	nikunj@amd.com, willy@infradead.org, vbabka@suse.cz,
	david@redhat.com, akpm@linux-foundation.org, yuzhao@google.com,
	axboe@kernel.dk, viro@zeniv.linux.org.uk, brauner@kernel.org,
	jack@suse.cz, joshdon@google.com, clm@meta.com
Subject: Re: [RFC PATCH 0/1] Large folios in block buffered IO path
Date: Thu, 28 Nov 2024 11:10:35 +0530	[thread overview]
Message-ID: <87plmf3oh8.fsf@gmail.com> (raw)
In-Reply-To: <20241127120235.ejpvpks3fosbzbkr@quack3>

Jan Kara <jack@suse.cz> writes:

> On Wed 27-11-24 07:19:59, Mateusz Guzik wrote:
>> On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>> >
>> > On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@amd.com> wrote:
>> > >
>> > > Recently we discussed the scalability issues while running large
>> > > instances of FIO with buffered IO option on NVME block devices here:
>> > >
>> > > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/
>> > >
>> > > One of the suggestions Chris Mason gave (during private discussions) was
>> > > to enable large folios in block buffered IO path as that could
>> > > improve the scalability problems and improve the lock contention
>> > > scenarios.
>> > >
>> >
>> > I have no basis to comment on the idea.
>> >
>> > However, it is pretty apparent whatever the situation it is being
>> > heavily disfigured by lock contention in blkdev_llseek:
>> >
>> > > perf-lock contention output
>> > > ---------------------------
>> > > The lock contention data doesn't look all that conclusive but for 30% rwmixwrite
>> > > mix it looks like this:
>> > >
>> > > perf-lock contention default
>> > >  contended   total wait     max wait     avg wait         type   caller
>> > >
>> > > 1337359017     64.69 h     769.04 us    174.14 us     spinlock   rwsem_wake.isra.0+0x42
>> > >                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>> > >                         0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
>> > >                         0xffffffff8f39e7d2  rwsem_wake.isra.0+0x42
>> > >                         0xffffffff8f39e88f  up_write+0x4f
>> > >                         0xffffffff8f9d598e  blkdev_llseek+0x4e
>> > >                         0xffffffff8f703322  ksys_lseek+0x72
>> > >                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>> > >                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>> > >    2665573     64.38 h       1.98 s      86.95 ms      rwsem:W   blkdev_llseek+0x31
>> > >                         0xffffffff903f15bc  rwsem_down_write_slowpath+0x36c
>> > >                         0xffffffff903f18fb  down_write+0x5b
>> > >                         0xffffffff8f9d5971  blkdev_llseek+0x31
>> > >                         0xffffffff8f703322  ksys_lseek+0x72
>> > >                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>> > >                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>> > >                         0xffffffff903dce5e  do_syscall_64+0x7e
>> > >                         0xffffffff9040012b  entry_SYSCALL_64_after_hwframe+0x76
>> >
>> > Admittedly I'm not familiar with this code, but at a quick glance the
>> > lock can be just straight up removed here?
>> >
>> >   534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
>> >   535 {
>> >   536 │       struct inode *bd_inode = bdev_file_inode(file);
>> >   537 │       loff_t retval;
>> >   538 │
>> >   539 │       inode_lock(bd_inode);
>> >   540 │       retval = fixed_size_llseek(file, offset, whence,
>> > i_size_read(bd_inode));
>> >   541 │       inode_unlock(bd_inode);
>> >   542 │       return retval;
>> >   543 }
>> >
>> > At best it stabilizes the size for the duration of the call. Sounds
>> > like it helps nothing since if the size can change, the file offset
>> > will still be altered as if there was no locking?
>> >
>> > Suppose this cannot be avoided to grab the size for whatever reason.
>> >
>> > While the above fio invocation did not work for me, I ran some crapper
>> > which I had in my shell history and according to strace:
>> > [pid 271829] lseek(7, 0, SEEK_SET)      = 0
>> > [pid 271829] lseek(7, 0, SEEK_SET)      = 0
>> > [pid 271830] lseek(7, 0, SEEK_SET)      = 0
>> >
>> > ... the lseeks just rewind to the beginning, *definitely* not needing
>> > to know the size. One would have to check but this is most likely the
>> > case in your test as well.
>> >
>> > And for that there is 0 need to grab the size, and consequently the inode lock.
>> 
>> That is to say bare minimum this needs to be benchmarked before/after
>> with the lock removed from the picture, like so:
>
> Yeah, I've noticed this in the locking profiles as well and I agree
> bd_inode locking seems unnecessary here. Even some filesystems (e.g. ext4)
> get away without using inode lock in their llseek handler...
>

Right, we don't need an inode_lock() for i_size_read(). i_size_write()
still needs locking for serialization, mainly for 32bit SMP case, due
to use of seqcounts.
I guess it would be good to maybe add this in Documentation too rather
than this info just hanging on top of i_size_write()?

References
===========
[1]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/locking.rst#n557
[2]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/fs.h#n932
[3]: https://lore.kernel.org/all/20061016162729.176738000@szeredi.hu/

-ritesh

  parent reply	other threads:[~2024-11-28  5:50 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-27  5:47 [RFC PATCH 0/1] Large folios in block buffered IO path Bharata B Rao
2024-11-27  5:47 ` [RFC PATCH 1/1] block/ioctl: Add an ioctl to enable large folios for " Bharata B Rao
2024-11-27  6:26   ` Christoph Hellwig
2024-11-27 10:37     ` Bharata B Rao
2024-11-28  5:43       ` Christoph Hellwig
2024-11-27  6:13 ` [RFC PATCH 0/1] Large folios in " Mateusz Guzik
2024-11-27  6:19   ` Mateusz Guzik
2024-11-27 12:02     ` Jan Kara
2024-11-27 12:13       ` Christian Brauner
2024-11-28  5:40       ` Ritesh Harjani [this message]
2024-11-27 12:18     ` Bharata B Rao
2024-11-27 12:28       ` Mateusz Guzik
2024-11-28  4:01         ` Bharata B Rao
2024-11-28  4:22           ` Matthew Wilcox
2024-11-28  4:37             ` Bharata B Rao
2024-11-28 11:23               ` Bharata B Rao
2024-11-28 23:31                 ` Mateusz Guzik
2024-11-29 10:32                   ` Bharata B Rao
2024-11-28  4:22           ` Mateusz Guzik
2024-11-28  4:31             ` Mateusz Guzik
2024-12-02  9:37               ` Bharata B Rao
2024-12-02 10:08                 ` Mateusz Guzik
2024-12-03  5:01                   ` Bharata B Rao
2024-11-28  4:43             ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87plmf3oh8.fsf@gmail.com \
    --to=ritesh.list@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=bharata@amd.com \
    --cc=brauner@kernel.org \
    --cc=clm@meta.com \
    --cc=david@redhat.com \
    --cc=jack@suse.cz \
    --cc=joshdon@google.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mjguzik@gmail.com \
    --cc=nikunj@amd.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.