From: Hannes Reinecke <hare@suse.de>
To: Luis Chamberlain <mcgrof@kernel.org>,
Matthew Wilcox <willy@infradead.org>
Cc: "Keith Busch" <kbusch@kernel.org>,
"Theodore Ts'o" <tytso@mit.edu>,
"Pankaj Raghav" <p.raghav@samsung.com>,
"Daniel Gomez" <da.gomez@samsung.com>,
"Javier González" <javier.gonz@samsung.com>,
lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, linux-block@vger.kernel.org
Subject: Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
Date: Sat, 4 Mar 2023 12:08:36 +0100 [thread overview]
Message-ID: <c9f6544d-1731-4a73-a926-0e85ae9da9df@suse.de> (raw)
In-Reply-To: <ZAJqjM6qLrraFrrn@bombadil.infradead.org>
On 3/3/23 22:45, Luis Chamberlain wrote:
> On Fri, Mar 03, 2023 at 03:49:29AM +0000, Matthew Wilcox wrote:
>> On Thu, Mar 02, 2023 at 06:58:58PM -0700, Keith Busch wrote:
>>> That said, I was hoping you were going to suggest supporting 16k logical block
>>> sizes. Not a problem on some arch's, but still problematic when PAGE_SIZE is
>>> 4k. :)
>>
>> I was hoping Luis was going to propose a session on LBA size > PAGE_SIZE.
>> Funnily, while the pressure is coming from the storage vendors, I don't
>> think there's any work to be done in the storage layers. It's purely
>> a FS+MM problem.
>
> You'd hope most of it is left to FS + MM, but I'm not yet sure that's
> quite it yet. Initial experimentation shows just enabling > PAGE_SIZE
> physical & logical block NVMe devices gets brought down to 512 bytes.
> That seems odd to say the least. Would changing this be an issue now?
>
> I'm gathering there is generic interest in this topic though. So one
> thing we *could* do is perhaps review lay-of-the-land of interest and
> break down what we all think are things likely could be done / needed.
> At the very least we can come out together knowing the unknowns together.
>
> I started to think about some of these things a while ago and with the
> help of Willy I tried to break down some of the items I gathered from him
> into community OKRs (super informal itemization of goals and sub tasks which
> would complete such goals) and started trying to take a stab at them
> with our team, but obviously I think it would be great if we all just
> divide & and conquer here. So maybe reviewing these and extending them
> as a community would be good:
>
> https://kernelnewbies.org/KernelProjects/large-block-size
>
> I'm recently interested in tmpfs so will be taking a stab at higher
> order page size support there to see what blows up.
>
Cool.
> The other stuff like general IOMAP conversion is pretty well known, and
> we already I think have a proposed session on that. But there is also
> even smaller fish to fry, like *just* doing a baseline with some
> filesystems with 4 KiB block size seems in order.
>
> Hearing filesystem developer's thoughts on support for larger block
> size in light of lower order PAGE_SIZE would be good, given one of the
> odd situations some distributions / teams find themselves in is trying
> to support larger block sizes but with difficult access to higher
> PAGE_SIZE systems. Are there ways to simplify this / help us in general?
> Without it's a bit hard to muck around with some of this in terms of
> support long term. This also got me thinking about ways to try to replicate
> larger IO virtual devices a bit better too. While paying a cloud
> provider to test this is one nice option, it'd be great if I can just do
> this in house with some hacks too. For virtio-blk-pci at least, for instance,
> I wondered whether using just the host page cache suffices, or would a 4K
> page cache on the host modify say a 16 k emualated io controller results
> significantly? How do we most effectively virtualize 16k controllers
> in-house?
>
> To help with experimenting with large io and NVMe / virtio-blk-pci I
> recented added support to intantiate tons of large IO devices to kdevops
> [0], with it it should be easy to reproduce odd issues we may come up
> with. For instnace it should be possible to subsequently extend the
> kdevops fstests or blktests automation support with just a few Kconfig files
> to use some of these largio devices to see what blows up.
>
We could implement a (virtual) zoned device, and expose each zone as a
block. That gives us the required large block characteristics, and with
a bit of luck we might be able to dial up to really large block sizes
like the 256M sizes on current SMR drives.
ublk might be a good starting point.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman
next prev parent reply other threads:[~2023-03-04 11:08 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-01 3:52 [LSF/MM/BPF TOPIC] Cloud storage optimizations Theodore Ts'o
2023-03-01 4:18 ` Gao Xiang
2023-03-01 4:40 ` Matthew Wilcox
2023-03-01 4:59 ` Gao Xiang
2023-03-01 4:35 ` Matthew Wilcox
2023-03-01 4:49 ` Gao Xiang
2023-03-01 5:01 ` Matthew Wilcox
2023-03-01 5:09 ` Gao Xiang
2023-03-01 5:19 ` Gao Xiang
2023-03-01 5:42 ` Matthew Wilcox
2023-03-01 5:51 ` Gao Xiang
2023-03-01 6:00 ` Gao Xiang
2023-03-02 3:13 ` Chaitanya Kulkarni
2023-03-02 3:50 ` Darrick J. Wong
2023-03-03 3:03 ` Martin K. Petersen
2023-03-02 20:30 ` Bart Van Assche
2023-03-03 3:05 ` Martin K. Petersen
2023-03-03 1:58 ` Keith Busch
2023-03-03 3:49 ` Matthew Wilcox
2023-03-03 11:32 ` Hannes Reinecke
2023-03-03 13:11 ` James Bottomley
2023-03-04 7:34 ` Matthew Wilcox
2023-03-04 13:41 ` James Bottomley
2023-03-04 16:39 ` Matthew Wilcox
2023-03-05 4:15 ` Luis Chamberlain
2023-03-05 5:02 ` Matthew Wilcox
2023-03-08 6:11 ` Luis Chamberlain
2023-03-08 7:59 ` Dave Chinner
2023-03-06 12:04 ` Hannes Reinecke
2023-03-06 3:50 ` James Bottomley
2023-03-04 19:04 ` Luis Chamberlain
2023-03-03 21:45 ` Luis Chamberlain
2023-03-03 22:07 ` Keith Busch
2023-03-03 22:14 ` Luis Chamberlain
2023-03-03 22:32 ` Keith Busch
2023-03-03 23:09 ` Luis Chamberlain
2023-03-16 15:29 ` Pankaj Raghav
2023-03-16 15:41 ` Pankaj Raghav
2023-03-03 23:51 ` Bart Van Assche
2023-03-04 11:08 ` Hannes Reinecke [this message]
2023-03-04 13:24 ` Javier González
2023-03-04 16:47 ` Matthew Wilcox
2023-03-04 17:17 ` Hannes Reinecke
2023-03-04 17:54 ` Matthew Wilcox
2023-03-04 18:53 ` Luis Chamberlain
2023-03-05 3:06 ` Damien Le Moal
2023-03-05 11:22 ` Hannes Reinecke
2023-03-06 8:23 ` Matthew Wilcox
2023-03-06 10:05 ` Hannes Reinecke
2023-03-06 16:12 ` Theodore Ts'o
2023-03-08 17:53 ` Matthew Wilcox
2023-03-08 18:13 ` James Bottomley
2023-03-09 8:04 ` Javier González
2023-03-09 13:11 ` James Bottomley
2023-03-09 14:05 ` Keith Busch
2023-03-09 15:23 ` Martin K. Petersen
2023-03-09 20:49 ` James Bottomley
2023-03-09 21:13 ` Luis Chamberlain
2023-03-09 21:28 ` Martin K. Petersen
2023-03-10 1:16 ` Dan Helmick
2023-03-10 7:59 ` Javier González
2023-03-08 19:35 ` Luis Chamberlain
2023-03-08 19:55 ` Bart Van Assche
2023-03-03 2:54 ` Martin K. Petersen
2023-03-03 3:29 ` Keith Busch
2023-03-03 4:20 ` Theodore Ts'o
-- strict thread matches above, loose matches on Subject: below --
2023-07-16 4:09 BELINDA Goodpaster kelly
2025-09-22 17:49 Belinda R Goodpaster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c9f6544d-1731-4a73-a926-0e85ae9da9df@suse.de \
--to=hare@suse.de \
--cc=da.gomez@samsung.com \
--cc=javier.gonz@samsung.com \
--cc=kbusch@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mcgrof@kernel.org \
--cc=p.raghav@samsung.com \
--cc=tytso@mit.edu \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).