linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Jan Kara <jack@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Theodore Ts'o <tytso@mit.edu>,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	Jan Kara <jack@suse.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Hugh Dickins <hughd@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Matthew Wilcox <willy@infradead.org>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-block@vger.kernel.org
Subject: Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()
Date: Mon, 7 Nov 2016 14:07:36 +0300	[thread overview]
Message-ID: <20161107110736.GA13280@node.shutemov.name> (raw)
In-Reply-To: <20161103204012.GC24234@quack2.suse.cz>

On Thu, Nov 03, 2016 at 09:40:12PM +0100, Jan Kara wrote:
> On Wed 02-11-16 11:32:04, Kirill A. Shutemov wrote:
> > Yes, buffer_head list doesn't scale. That's the main reason (along with 4)
> > why syscall-based IO sucks. We spend a lot of time looking for desired
> > block.
> > 
> > We need to switch to some other data structure for storing buffer_heads.
> > Is there a reason why we have list there in first place?
> > Why not just array?
> > 
> > I will look into it, but this sounds like a separate infrastructure change
> > project.
> 
> As Christoph said iomap code should help you with that and make things
> simpler. If things go as we imagine, we should be able to pretty much avoid
> buffer heads. But it will take some time to get there.

Just to clarify: is it show-stopper or we can live with buffer_head list
for now?

> > > 2) PMD-sized pages result in increased space & memory usage.
> > 
> > Space? Do you mean disk space? Not really: we still don't write beyond
> > i_size or into holes.
> > 
> > Behaviour wrt to holes may change with mmap()-IO as we have less
> > granularity, but the same can be seen just between different
> > architectures: 4k vs. 64k base page size.
> 
> Yes, I meant different granularity of mmap based IO. And I agree it isn't a
> new problem but the scale of the problem is much larger with 2MB pages than
> with say 64K pages. And actually the overhead of higher IO granularity of
> 64K pages has been one of the reasons we have switched SLES PPC kernels
> from 64K pages to 4K pages (we've got complaints from customers). 

I guess fadvise()/madvise() hints for opt-in/opt-out should be good enough
to deal with this. I probably need to wire them up.

> > > 3) In ext4 we have to estimate how much metadata we may need to modify when
> > > allocating blocks underlying a page in the worst case (you don't seem to
> > > update this estimate in your patch set). With 2048 blocks underlying a page,
> > > each possibly in a different block group, it is a lot of metadata forcing
> > > us to reserve a large transaction (not sure if you'll be able to even
> > > reserve such large transaction with the default journal size), which again
> > > makes things slower.
> > 
> > I didn't saw this on profiles. And xfstests looks fine. I probably need to
> > run them with 1k blocks once again.
> 
> You wouldn't see this in profiles - it is a correctness thing. And it won't
> be triggered unless the file is heavily fragmented which likely does not
> happen with any test in xfstests. If it happens you'll notice though - the
> filesystem will just report error and shut itself down.

Any suggestion how I can simulate this situation?

> > The numbers below generated with fio. The working set is relatively small,
> > so it fits into page cache and writing set doesn't hit dirty_ratio.
> > 
> > I think the mmap performance should be enough to justify initial inclusion
> > of an experimental feature: it useful for workloads that targets mmap()-IO.
> > It will take time to get feature mature anyway.
> 
> I agree it will take time for feature to mature so I'me fine with
> suboptimal performance in some cases. But I'm not fine with some of the
> hacks you do currently because code maintenability is an issue even if
> people don't actually use the feature...

Hm. Okay, I'll try to check what I can do to make it more maintainable.
My worry is that it will make the patchset even bigger...

> > Configuration:
> >  - 2x E5-2697v2, 64G RAM;
> >  - INTEL SSDSC2CW24;
> >  - IO request size is 4k;
> >  - 8 processes, 512MB data set each;
> 
> The numbers indeed look interesting for mmaped case. Can you post the fio
> cmdline? I'd like to compare profiles...

	fio \
		--directory=/mnt/ \
		--name="$engine-$rw" \
		--ioengine="$engine" \
		--rw="$rw" \
		--size=512M \
		--invalidate=1 \
		--numjobs=8 \
		--runtime=60 \
		--time_based \
		--group_reporting

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2016-11-07 11:07 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-15 11:54 [PATCHv3 00/41] ext4: support of huge pages Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 01/41] tools: Add WARN_ON_ONCE Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 02/41] radix tree test suite: Allow GFP_ATOMIC allocations to fail Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 03/41] radix-tree: Add radix_tree_join Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 04/41] radix-tree: Add radix_tree_split Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 05/41] radix-tree: Add radix_tree_split_preload() Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 06/41] radix-tree: Handle multiorder entries being deleted by replace_clear_tags Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 07/41] mm, shmem: swich huge tmpfs to multi-order radix-tree entries Kirill A. Shutemov
2016-09-16 12:07   ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 08/41] Revert "radix-tree: implement radix_tree_maybe_preload_order()" Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 09/41] page-flags: relax page flag policy for few flags Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 10/41] mm, rmap: account file thp pages Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 11/41] thp: try to free page's buffers before attempt split Kirill A. Shutemov
2016-10-11 15:40   ` Jan Kara
2016-10-11 21:43     ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 12/41] thp: handle write-protection faults for file THP Kirill A. Shutemov
2016-10-11 15:47   ` Jan Kara
2016-10-11 21:47     ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 13/41] truncate: make sure invalidate_mapping_pages() can discard huge pages Kirill A. Shutemov
2016-10-11 15:58   ` Jan Kara
2016-10-11 21:53     ` Kirill A. Shutemov
2016-10-12  6:43       ` Jan Kara
2016-10-24 10:41         ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 14/41] filemap: allocate huge page in page_cache_read(), if allowed Kirill A. Shutemov
2016-10-11 16:15   ` Jan Kara
2016-10-11 21:57     ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read() Kirill A. Shutemov
2016-10-13  9:33   ` Jan Kara
2016-10-31 18:10     ` Kirill A. Shutemov
2016-11-01 16:39       ` Jan Kara
2016-11-02  8:32         ` Kirill A. Shutemov
2016-11-02 14:37           ` Christoph Hellwig
2016-11-03 20:40           ` Jan Kara
2016-11-07 11:07             ` Kirill A. Shutemov [this message]
2016-11-07 14:59               ` Christoph Hellwig
2016-11-02 14:36         ` Christoph Hellwig
2016-11-03 17:56           ` Jan Kara
2016-11-07 11:13           ` Kirill A. Shutemov
2016-11-07 15:01             ` Christoph Hellwig
2016-11-07 16:03               ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 16/41] filemap: allocate huge page in pagecache_get_page(), if allowed Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 17/41] filemap: handle huge pages in filemap_fdatawait_range() Kirill A. Shutemov
2016-10-13  9:44   ` Jan Kara
2016-10-13 12:08     ` Kirill A. Shutemov
2016-10-13 13:18       ` Jan Kara
2016-10-24 11:36         ` Kirill A. Shutemov
2016-10-30 17:31           ` Jan Kara
2016-09-15 11:55 ` [PATCHv3 18/41] HACK: readahead: alloc huge pages, if allowed Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 19/41] block: define BIO_MAX_PAGES to HPAGE_PMD_NR if huge page cache enabled Kirill A. Shutemov
2016-09-16 12:09   ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 20/41] mm: make write_cache_pages() work on huge pages Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 21/41] thp: introduce hpage_size() and hpage_mask() Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 22/41] thp: do not threat slab pages as huge in hpage_{nr_pages,size,mask} Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 23/41] fs: make block_read_full_page() be able to read huge page Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 24/41] fs: make block_write_{begin,end}() be able to handle huge pages Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 25/41] fs: make block_page_mkwrite() aware about " Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 26/41] truncate: make truncate_inode_pages_range() " Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 27/41] truncate: make invalidate_inode_pages2_range() " Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 28/41] mm, hugetlb: switch hugetlbfs to multi-order radix-tree entries Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 29/41] ext4: make ext4_mpage_readpages() hugepage-aware Kirill A. Shutemov
2016-09-15 12:27   ` Andreas Dilger
2016-09-16 12:17     ` Kirill A. Shutemov
2016-09-16 12:10   ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 30/41] ext4: make ext4_writepage() work on huge pages Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 31/41] ext4: handle huge pages in ext4_page_mkwrite() Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 32/41] ext4: handle huge pages in __ext4_block_zero_page_range() Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 33/41] ext4: make ext4_block_write_begin() aware about huge pages Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 34/41] ext4: handle huge pages in ext4_da_write_end() Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 35/41] ext4: make ext4_da_page_release_reservation() aware about huge pages Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 36/41] ext4: handle writeback with " Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 37/41] ext4: make EXT4_IOC_MOVE_EXT work " Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 38/41] ext4: fix SEEK_DATA/SEEK_HOLE for " Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 39/41] ext4: make fallocate() operations work with " Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 40/41] mm, fs, ext4: expand use of page_mapping() and page_to_pgoff() Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 41/41] ext4, vfs: add huge= mount option Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161107110736.GA13280@node.shutemov.name \
    --to=kirill@shutemov.name \
    --cc=aarcange@redhat.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=hughd@google.com \
    --cc=jack@suse.com \
    --cc=jack@suse.cz \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ross.zwisler@linux.intel.com \
    --cc=tytso@mit.edu \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).