From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: Jan Kara <jack@suse.cz>, Mateusz Guzik <mjguzik@gmail.com>
Cc: Bharata B Rao <bharata@amd.com>,
linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
nikunj@amd.com, willy@infradead.org, vbabka@suse.cz,
david@redhat.com, akpm@linux-foundation.org, yuzhao@google.com,
axboe@kernel.dk, viro@zeniv.linux.org.uk, brauner@kernel.org,
jack@suse.cz, joshdon@google.com, clm@meta.com
Subject: Re: [RFC PATCH 0/1] Large folios in block buffered IO path
Date: Thu, 28 Nov 2024 11:10:35 +0530 [thread overview]
Message-ID: <87plmf3oh8.fsf@gmail.com> (raw)
In-Reply-To: <20241127120235.ejpvpks3fosbzbkr@quack3>
Jan Kara <jack@suse.cz> writes:
> On Wed 27-11-24 07:19:59, Mateusz Guzik wrote:
>> On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>> >
>> > On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@amd.com> wrote:
>> > >
>> > > Recently we discussed the scalability issues while running large
>> > > instances of FIO with buffered IO option on NVME block devices here:
>> > >
>> > > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/
>> > >
>> > > One of the suggestions Chris Mason gave (during private discussions) was
>> > > to enable large folios in block buffered IO path as that could
>> > > improve the scalability problems and improve the lock contention
>> > > scenarios.
>> > >
>> >
>> > I have no basis to comment on the idea.
>> >
>> > However, it is pretty apparent whatever the situation it is being
>> > heavily disfigured by lock contention in blkdev_llseek:
>> >
>> > > perf-lock contention output
>> > > ---------------------------
>> > > The lock contention data doesn't look all that conclusive but for 30% rwmixwrite
>> > > mix it looks like this:
>> > >
>> > > perf-lock contention default
>> > > contended total wait max wait avg wait type caller
>> > >
>> > > 1337359017 64.69 h 769.04 us 174.14 us spinlock rwsem_wake.isra.0+0x42
>> > > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3
>> > > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c
>> > > 0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42
>> > > 0xffffffff8f39e88f up_write+0x4f
>> > > 0xffffffff8f9d598e blkdev_llseek+0x4e
>> > > 0xffffffff8f703322 ksys_lseek+0x72
>> > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18
>> > > 0xffffffff8f20b983 x64_sys_call+0x1fb3
>> > > 2665573 64.38 h 1.98 s 86.95 ms rwsem:W blkdev_llseek+0x31
>> > > 0xffffffff903f15bc rwsem_down_write_slowpath+0x36c
>> > > 0xffffffff903f18fb down_write+0x5b
>> > > 0xffffffff8f9d5971 blkdev_llseek+0x31
>> > > 0xffffffff8f703322 ksys_lseek+0x72
>> > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18
>> > > 0xffffffff8f20b983 x64_sys_call+0x1fb3
>> > > 0xffffffff903dce5e do_syscall_64+0x7e
>> > > 0xffffffff9040012b entry_SYSCALL_64_after_hwframe+0x76
>> >
>> > Admittedly I'm not familiar with this code, but at a quick glance the
>> > lock can be just straight up removed here?
>> >
>> > 534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
>> > 535 {
>> > 536 │ struct inode *bd_inode = bdev_file_inode(file);
>> > 537 │ loff_t retval;
>> > 538 │
>> > 539 │ inode_lock(bd_inode);
>> > 540 │ retval = fixed_size_llseek(file, offset, whence,
>> > i_size_read(bd_inode));
>> > 541 │ inode_unlock(bd_inode);
>> > 542 │ return retval;
>> > 543 }
>> >
>> > At best it stabilizes the size for the duration of the call. Sounds
>> > like it helps nothing since if the size can change, the file offset
>> > will still be altered as if there was no locking?
>> >
>> > Suppose this cannot be avoided to grab the size for whatever reason.
>> >
>> > While the above fio invocation did not work for me, I ran some crapper
>> > which I had in my shell history and according to strace:
>> > [pid 271829] lseek(7, 0, SEEK_SET) = 0
>> > [pid 271829] lseek(7, 0, SEEK_SET) = 0
>> > [pid 271830] lseek(7, 0, SEEK_SET) = 0
>> >
>> > ... the lseeks just rewind to the beginning, *definitely* not needing
>> > to know the size. One would have to check but this is most likely the
>> > case in your test as well.
>> >
>> > And for that there is 0 need to grab the size, and consequently the inode lock.
>>
>> That is to say bare minimum this needs to be benchmarked before/after
>> with the lock removed from the picture, like so:
>
> Yeah, I've noticed this in the locking profiles as well and I agree
> bd_inode locking seems unnecessary here. Even some filesystems (e.g. ext4)
> get away without using inode lock in their llseek handler...
>
Right, we don't need an inode_lock() for i_size_read(). i_size_write()
still needs locking for serialization, mainly for 32bit SMP case, due
to use of seqcounts.
I guess it would be good to maybe add this in Documentation too rather
than this info just hanging on top of i_size_write()?
References
===========
[1]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/locking.rst#n557
[2]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/fs.h#n932
[3]: https://lore.kernel.org/all/20061016162729.176738000@szeredi.hu/
-ritesh
next prev parent reply other threads:[~2024-11-28 5:50 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-27 5:47 [RFC PATCH 0/1] Large folios in block buffered IO path Bharata B Rao
2024-11-27 5:47 ` [RFC PATCH 1/1] block/ioctl: Add an ioctl to enable large folios for " Bharata B Rao
2024-11-27 6:26 ` Christoph Hellwig
2024-11-27 10:37 ` Bharata B Rao
2024-11-28 5:43 ` Christoph Hellwig
2024-11-27 6:13 ` [RFC PATCH 0/1] Large folios in " Mateusz Guzik
2024-11-27 6:19 ` Mateusz Guzik
2024-11-27 12:02 ` Jan Kara
2024-11-27 12:13 ` Christian Brauner
2024-11-28 5:40 ` Ritesh Harjani [this message]
2024-11-27 12:18 ` Bharata B Rao
2024-11-27 12:28 ` Mateusz Guzik
2024-11-28 4:01 ` Bharata B Rao
2024-11-28 4:22 ` Matthew Wilcox
2024-11-28 4:37 ` Bharata B Rao
2024-11-28 11:23 ` Bharata B Rao
2024-11-28 23:31 ` Mateusz Guzik
2024-11-29 10:32 ` Bharata B Rao
2024-11-28 4:22 ` Mateusz Guzik
2024-11-28 4:31 ` Mateusz Guzik
2024-12-02 9:37 ` Bharata B Rao
2024-12-02 10:08 ` Mateusz Guzik
2024-12-03 5:01 ` Bharata B Rao
2024-11-28 4:43 ` Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87plmf3oh8.fsf@gmail.com \
--to=ritesh.list@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=bharata@amd.com \
--cc=brauner@kernel.org \
--cc=clm@meta.com \
--cc=david@redhat.com \
--cc=jack@suse.cz \
--cc=joshdon@google.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mjguzik@gmail.com \
--cc=nikunj@amd.com \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.