From: Dave Chinner <david@fromorbit.com>
To: Tejun Heo <tj@kernel.org>
Cc: axboe@kernel.dk, linux-kernel@vger.kernel.org, jack@suse.cz,
hch@infradead.org, hannes@cmpxchg.org,
linux-fsdevel@vger.kernel.org, vgoyal@redhat.com,
lizefan@huawei.com, cgroups@vger.kernel.org, linux-mm@kvack.org,
mhocko@suse.cz, clm@fb.com, fengguang.wu@intel.com
Subject: Re: [PATCHSET RFC block/for-next] writeback: cgroup writeback support
Date: Sat, 10 Jan 2015 11:38:19 +1100 [thread overview]
Message-ID: <20150110003819.GP31508@dastard> (raw)
In-Reply-To: <20150109212336.GB2785@htj.dyndns.org>
On Fri, Jan 09, 2015 at 04:23:36PM -0500, Tejun Heo wrote:
> Hello, Dave.
>
> On Thu, Jan 08, 2015 at 10:45:32AM +1100, Dave Chinner wrote:
> > > Complications mostly arise from filesystems and inodes having to deal
> > > with multiple split bdi's instead of one, but those are mostly
> > > straight-forward 1:N mapping issues. It does get tedious here and
> > > there but doesn't complicate the overall picture.
> >
> > Some filesystems don't track metadata-dirty inode state in the bdi
> > lists, and instead track that in their own lists (usually deep
> > inside the journalling subsystem). i.e. I_DIRTY_PAGES are the only
> > dirty state that is tracked in the VFS. i.e. inode metadata
> > writeback will still be considered global, but pages won't be. Hence
> > you might get pages written back quickly, but the inodes are going
> > to remain dirty and unreclaimable until the filesystem flushes some
> > time in the future after the journal is committed and the inode
> > written...
>
> I'm not sure I'm following. What writeback layer provides is cgroup
> awareness when dealing with I_DIRTY_PAGES. Metadata writebacks will
> become automatically cgroup-aware to the extent they go through
> regular page dirtying mechanism.
That's my point - inode metadata drtying/writeback don't go through
the regular page dirtying mechanisms.
> What's implemented in this patchset is
> propagation of memcg tags for pagecache pages. If necessary, further
> mechanisms can be added, but this should cover the basics.
Sure, but I'm just pointing out that if you dirty a million inodes
in a memcg (e.g. chown -R), memcg-based writeback will not cause
them to be written...
> > There has also been talk of allowing filesystems to directly track
> > dirty page state as well - the discussion came out of the way tux3
> > was tracking and committing delta changes to file data. Now that
> > hasn't gone anywhere, but I'm wondering what impact this patch set
> > would have on such proposals?
>
> Would such a filesystem take over writeback mechanism too?
That was the intent.
> The
> implemented mechanism is fairly modular and the counterparts in each
> filesystem should be able to use them the same way the core writeback
> code does. I'm afraid I can't say much without knowing further
> details.
Ok,you haven't said "that's impossible", so that's good enough for
now ;)
> > Similarly, I'm concerned about additional overhead in the writeback
> > path - we can easily drive the flusher thread to be CPU bound on IO
> > subsystems that have decent bandwidth (low GB/s), so adding more
> > overhead to every page we have to flush is going to reduce
> > performance on these systems. Do you have any idea what impact
> > just enabling the memcg/blkcg tracking has on writeback performance
> > and CPU consumption?
>
> I measured avg sys+user time of 50 iterations of
>
> fs_mark -d /mnt/tmp/ -s 104857600 -n 32
>
> on an ext2 on a ramdisk, which should put the hot path part - page
> faulting and inode dirtying - under spotlight. cgroup writeback
> enabled but not used case consumes around 1% more cpu time - AVG 6.616
> STDEV 0.050 w/o this patchset, AVG 6.682 STDEV 0.046 with. This is an
> extreme case and while it isn't free the overhead is fairly low.
What's the throughput for these numbers? CPU usage without any idea
of the number of pages being scanned doesn't tell us a whole lot.
> > A further complication for data writeback is that some filesystems
> > do their own adjacent page write clustering own inside their own
> > ->writepages/->writepage implementations. Both ext4 and XFS do this,
> > and it makes no sense from a system and filesystem performance
> > perspective to turn sequential ranges of dirty pages into much
> > slower, semi-random IO just because the pages belong to different
> > memcgs. It's not a good idea to compromise bulk writeback
> > throughput under memory pressure just because a different memcgs
> > write to the same files, so what is going to be the impact of
> > filesystems ignoring memcg ownership during writeback clustering?
>
> I don't think that's a good idea. Implementing that isn't hard.
> Range writeback can simply avoid skipping pages from different
> cgroups; however, different cgroups can have vastly different
> characteristics. One may be configured to have a majority of the
> available bandwidth while another has to scrap by with few hundreds of
> k's per sec. Initiating write out on pages which belong to the former
> from the writeback of the latter may cause serious priority inversion
> issues.
I'd suggest that you should provide mechanisms at the block layer
for accounting the pages in the bio to the memcg they belong to,
not make a sweeping directive that filesystems can only write back
pages from one memcg at a time.
If you account for pages to their memcg and decide on bio priority
at bio_add_page() time you would avoid the inversion and cross-cg
accounting problems. If you do this, the filesystem doesn't need to
care at all what memcg pages belong to; they just do optimal IO to
clean sequential dirty pages and it is accounted and throttled
appropriately by the lower layers.
> Maybe we can think of optimizations down the road but I'd strongly
> prefer to stick to simple and clear divisions among cgroups. Also, a
> file highly interleaved by multiple cgroups isn't a particularly
> likely use case.
That's true, and that's a further reason why I think we should not
be caring about this case in the filesystem writeback code at all.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2015-01-10 0:38 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
2015-01-06 21:25 ` [PATCH 01/45] writeback: add struct dirty_context Tejun Heo
2015-01-06 21:25 ` [PATCH 02/45] writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK Tejun Heo
2015-01-06 21:25 ` [PATCH 03/45] memcg: encode page_cgflags in the lower bits of page->mem_cgroup Tejun Heo
2015-01-06 21:25 ` [PATCH 04/45] memcg, writeback: implement memcg_blkcg_ptr Tejun Heo
2015-01-06 21:25 ` [PATCH 05/45] writeback: make backing_dev_info host cgroup-specific bdi_writebacks Tejun Heo
2015-01-06 21:25 ` [PATCH 06/45] writeback, blkcg: associate each blkcg_gq with the corresponding bdi_writeback Tejun Heo
2015-01-06 21:25 ` [PATCH 07/45] writeback: attribute stats to the matching per-cgroup bdi_writeback Tejun Heo
2015-01-06 21:25 ` [PATCH 08/45] writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback Tejun Heo
2015-01-06 21:25 ` [PATCH 09/45] writeback: make congestion functions per bdi_writeback Tejun Heo
2015-01-06 21:25 ` [PATCH 10/45] writeback, blkcg: restructure blk_{set|clear}_queue_congested() Tejun Heo
2015-01-06 21:25 ` [PATCH 11/45] writeback, blkcg: propagate non-root blkcg congestion state Tejun Heo
2015-01-06 21:25 ` [PATCH 12/45] writeback: implement and use mapping_congested() Tejun Heo
2015-01-06 21:25 ` [PATCH 13/45] writeback: implement WB_has_dirty_io wb_state flag Tejun Heo
2015-01-06 21:25 ` [PATCH 14/45] writeback: implement backing_dev_info->tot_write_bandwidth Tejun Heo
2015-01-06 21:25 ` [PATCH 15/45] writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account Tejun Heo
2015-01-06 21:25 ` [PATCH 16/45] writeback: don't issue wb_writeback_work if clean Tejun Heo
2015-01-06 21:25 ` [PATCH 17/45] writeback: make bdi->min/max_ratio handling cgroup writeback aware Tejun Heo
2015-01-06 21:25 ` [PATCH 18/45] writeback: implement bdi_for_each_wb() Tejun Heo
2015-01-06 21:25 ` [PATCH 19/45] writeback: remove bdi_start_writeback() Tejun Heo
2015-01-06 21:25 ` [PATCH 20/45] writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's Tejun Heo
[not found] ` <1420579582-8516-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-01-06 21:25 ` [PATCH 21/45] writeback: make writeback_in_progress() take bdi_writeback instead of backing_dev_info Tejun Heo
2015-01-06 21:26 ` [PATCH 44/45] mpage: make __mpage_writepage() honor cgroup writeback Tejun Heo
2015-01-06 21:25 ` [PATCH 22/45] writeback: make bdi_start_background_writeback() take bdi_writeback instead of backing_dev_info Tejun Heo
2015-01-06 21:26 ` [PATCH 23/45] writeback: make wakeup_flusher_threads() handle multiple bdi_writeback's Tejun Heo
2015-01-06 21:26 ` [PATCH 24/45] writeback: add wb_writeback_work->auto_free Tejun Heo
2015-01-06 21:26 ` [PATCH 25/45] writeback: implement bdi_wait_for_completion() Tejun Heo
2015-01-06 21:26 ` [PATCH 26/45] writeback: implement wb_wait_for_single_work() Tejun Heo
2015-01-06 21:26 ` [PATCH 27/45] writeback: restructure try_writeback_inodes_sb[_nr]() Tejun Heo
2015-01-06 21:26 ` [PATCH 28/45] writeback: make writeback initiation functions handle multiple bdi_writeback's Tejun Heo
2015-01-06 21:26 ` [PATCH 29/45] writeback: move i_wb_list emptiness test into inode_wb_list_del() from its caller Tejun Heo
2015-01-06 21:26 ` [PATCH 30/45] vfs, writeback: introduce struct inode_wb_link Tejun Heo
2015-01-06 21:26 ` [PATCH 31/45] vfs, writeback: add inode_wb_link->data point to the associated bdi_writeback Tejun Heo
2015-01-06 21:26 ` [PATCH 32/45] vfs, writeback: move inode->dirtied_when into inode->i_wb_link Tejun Heo
2015-01-06 21:26 ` [PATCH 33/45] writeback: minor reorganization of fs/fs-writeback.c Tejun Heo
2015-01-06 21:26 ` [PATCH 34/45] vfs, writeback: implement support for multiple inode_wb_link's Tejun Heo
2015-01-06 21:26 ` [PATCH 35/45] vfs, writeback: implement inode->i_nr_syncs Tejun Heo
2015-01-06 21:26 ` [PATCH 36/45] writeback: dirty inodes against their matching cgroup bdi_writeback's Tejun Heo
2015-01-06 21:26 ` [PATCH 37/45] writeback: make writeback_control carry the inode_wb_link being served Tejun Heo
2015-01-06 21:26 ` [PATCH 38/45] writeback: make cyclic writeback cursor cgroup writeback aware Tejun Heo
2015-01-06 21:26 ` [PATCH 39/45] writeback: make DIRTY_PAGES tracking " Tejun Heo
2015-01-06 21:26 ` [PATCH 40/45] writeback: make write_cache_pages() " Tejun Heo
2015-01-06 21:26 ` [PATCH 41/45] writeback: make __writeback_single_inode() " Tejun Heo
2015-01-06 21:26 ` [PATCH 42/45] writeback: make __filemap_fdatawrite_range() croup " Tejun Heo
2015-01-06 21:26 ` [PATCH 43/45] buffer, writeback: make __block_write_full_page() honor cgroup writeback Tejun Heo
2015-01-06 21:26 ` [PATCH 45/45] ext2: enable cgroup writeback support Tejun Heo
2015-01-06 21:44 ` [PATCHSET RFC block/for-next] writeback: " Tejun Heo
2015-01-07 23:45 ` Dave Chinner
2015-01-09 21:23 ` Tejun Heo
2015-01-10 0:38 ` Dave Chinner [this message]
2015-01-10 15:56 ` Tejun Heo
2015-01-10 16:05 ` Tejun Heo
2015-01-08 9:30 ` Jan Kara
2015-01-09 21:36 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150110003819.GP31508@dastard \
--to=david@fromorbit.com \
--cc=axboe@kernel.dk \
--cc=cgroups@vger.kernel.org \
--cc=clm@fb.com \
--cc=fengguang.wu@intel.com \
--cc=hannes@cmpxchg.org \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lizefan@huawei.com \
--cc=mhocko@suse.cz \
--cc=tj@kernel.org \
--cc=vgoyal@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).