linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: axboe@kernel.dk, linux-kernel@vger.kernel.org, jack@suse.cz,
	hch@infradead.org, hannes@cmpxchg.org,
	linux-fsdevel@vger.kernel.org, vgoyal@redhat.com,
	lizefan@huawei.com, cgroups@vger.kernel.org, linux-mm@kvack.org,
	mhocko@suse.cz, clm@fb.com, fengguang.wu@intel.com
Subject: Re: [PATCHSET RFC block/for-next] writeback: cgroup writeback support
Date: Sat, 10 Jan 2015 10:56:53 -0500	[thread overview]
Message-ID: <20150110155653.GA25319@htj.dyndns.org> (raw)
In-Reply-To: <20150110003819.GP31508@dastard>

Hey,

On Sat, Jan 10, 2015 at 11:38:19AM +1100, Dave Chinner wrote:
> > What's implemented in this patchset is
> > propagation of memcg tags for pagecache pages.  If necessary, further
> > mechanisms can be added, but this should cover the basics.
> 
> Sure, but I'm just pointing out that if you dirty a million inodes
> in a memcg (e.g. chown -R), memcg-based writeback will not cause
> them to be written...

Sure, in such cases, they'd need further wiring up if to be solved
properly.  For some, we'd have to punt to the root cgroup and charge
it as general system cost but this is no different from other
controllers and at least some of such punting would be inherent in the
nature of the involved activities.

> > I measured avg sys+user time of 50 iterations of
> > 
> >   fs_mark -d /mnt/tmp/ -s 104857600 -n 32
> > 
> > on an ext2 on a ramdisk, which should put the hot path part - page
> > faulting and inode dirtying - under spotlight.  cgroup writeback
> > enabled but not used case consumes around 1% more cpu time - AVG 6.616
> > STDEV 0.050 w/o this patchset, AVG 6.682 STDEV 0.046 with.  This is an
> > extreme case and while it isn't free the overhead is fairly low.
> 
> What's the throughput for these numbers? CPU usage without any idea
> of the number of pages being scanned doesn't tell us a whole lot.

Ah, sorry about that.  Here's the output from one such fs_mark run.
Being run on a ramdisk, it's CPU bound (1.9Ghz Opteron).

$ fs_mark -d /mnt/tmp/ -s 104857600 -n 32 -v

[opt ~]# fs_mark -d /mnt/tmp/ -s 104857600 -n 32 -v

#  fs_mark  -d  /mnt/tmp/  -s  104857600  -n  32  -v
#       Version 3.3, 1 thread(s) starting at Sat Jan 10 10:46:13 2015
#       Sync method: INBAND FSYNC: fsync() per file in write loop.
#       Directories:  no subdirectories used
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 104857600 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.
#       All system call times are reported in microseconds.

FSUse%        Count         Size    Files/sec     App Overhead        CREAT (Min/Avg/Max)        WRITE (Min/Avg/Max)        FSYNC (Min/Avg/Max)         SYNC (Min/Avg/Max)        CLOSE (Min/Avg/Max)       UNLINK (Min/Avg/Max)
     5           32    104857600          4.6            23204       28       45       54       14       21      364    65986    73892   112279        0        0        0       12       12       13     2647     5777    23842

> I'd suggest that you should provide mechanisms at the block layer
> for accounting the pages in the bio to the memcg they belong to,
> not make a sweeping directive that filesystems can only write back
> pages from one memcg at a time.
> 
> If you account for pages to their memcg and decide on bio priority
> at bio_add_page() time you would avoid the inversion and cross-cg
> accounting problems.  If you do this, the filesystem doesn't need to
> care at all what memcg pages belong to; they just do optimal IO to
> clean sequential dirty pages and it is accounted and throttled
> appropriately by the lower layers.

That'd destroy the fundamental feedback mechanism propagating the
pressure from the blkcg split block device up through the writeback
eventually to the memcg.  This chain of backpressure is why this whole
scheme works.  When a blkcg on a device gets congested, its request
reserve becomes contended which in turn sets congestion state on the
channel and blocks further bio submissions till requests are complete.
This blocking of bio is the final and ultimate channel of the
backpressure propagation.  If you start mixing pages from different
cgroups in a single bio, the only options for handling it from the
lower layer is either splitting it into two separate requests and
finish the bio only on completion of both or choosing one victim
cgroup, essentially arbitrarily, both of which can lead to gross
priority inversion in many circumstances.

> > Maybe we can think of optimizations down the road but I'd strongly
> > prefer to stick to simple and clear divisions among cgroups.  Also, a
> > file highly interleaved by multiple cgroups isn't a particularly
> > likely use case.
> 
> That's true, and that's a further reason why I think we should not
> be caring about this case in the filesystem writeback code at all.

I'm afraid I'm not following this logic.  Why would we do something
which isn't straight forward and has a lot of corner cases for a
prospect for optimizing a fringe case?  The only thing filesystem
writeback logic has to do is skipping pages which belong to a
different cgroup, just like it'd skip a page which is already under
writeback.  There's nothing complicated about it.  Those pages simply
aren't the target of that writeback instance.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2015-01-10 15:56 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
2015-01-06 21:25 ` [PATCH 01/45] writeback: add struct dirty_context Tejun Heo
2015-01-06 21:25 ` [PATCH 02/45] writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK Tejun Heo
2015-01-06 21:25 ` [PATCH 03/45] memcg: encode page_cgflags in the lower bits of page->mem_cgroup Tejun Heo
2015-01-06 21:25 ` [PATCH 04/45] memcg, writeback: implement memcg_blkcg_ptr Tejun Heo
2015-01-06 21:25 ` [PATCH 05/45] writeback: make backing_dev_info host cgroup-specific bdi_writebacks Tejun Heo
2015-01-06 21:25 ` [PATCH 06/45] writeback, blkcg: associate each blkcg_gq with the corresponding bdi_writeback Tejun Heo
2015-01-06 21:25 ` [PATCH 07/45] writeback: attribute stats to the matching per-cgroup bdi_writeback Tejun Heo
2015-01-06 21:25 ` [PATCH 08/45] writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback Tejun Heo
2015-01-06 21:25 ` [PATCH 09/45] writeback: make congestion functions per bdi_writeback Tejun Heo
2015-01-06 21:25 ` [PATCH 10/45] writeback, blkcg: restructure blk_{set|clear}_queue_congested() Tejun Heo
2015-01-06 21:25 ` [PATCH 11/45] writeback, blkcg: propagate non-root blkcg congestion state Tejun Heo
2015-01-06 21:25 ` [PATCH 12/45] writeback: implement and use mapping_congested() Tejun Heo
2015-01-06 21:25 ` [PATCH 13/45] writeback: implement WB_has_dirty_io wb_state flag Tejun Heo
2015-01-06 21:25 ` [PATCH 14/45] writeback: implement backing_dev_info->tot_write_bandwidth Tejun Heo
2015-01-06 21:25 ` [PATCH 15/45] writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account Tejun Heo
2015-01-06 21:25 ` [PATCH 16/45] writeback: don't issue wb_writeback_work if clean Tejun Heo
2015-01-06 21:25 ` [PATCH 17/45] writeback: make bdi->min/max_ratio handling cgroup writeback aware Tejun Heo
2015-01-06 21:25 ` [PATCH 18/45] writeback: implement bdi_for_each_wb() Tejun Heo
2015-01-06 21:25 ` [PATCH 19/45] writeback: remove bdi_start_writeback() Tejun Heo
2015-01-06 21:25 ` [PATCH 20/45] writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's Tejun Heo
2015-01-06 21:25 ` [PATCH 21/45] writeback: make writeback_in_progress() take bdi_writeback instead of backing_dev_info Tejun Heo
2015-01-06 21:25 ` [PATCH 22/45] writeback: make bdi_start_background_writeback() " Tejun Heo
2015-01-06 21:26 ` [PATCH 23/45] writeback: make wakeup_flusher_threads() handle multiple bdi_writeback's Tejun Heo
2015-01-06 21:26 ` [PATCH 24/45] writeback: add wb_writeback_work->auto_free Tejun Heo
2015-01-06 21:26 ` [PATCH 25/45] writeback: implement bdi_wait_for_completion() Tejun Heo
2015-01-06 21:26 ` [PATCH 26/45] writeback: implement wb_wait_for_single_work() Tejun Heo
2015-01-06 21:26 ` [PATCH 27/45] writeback: restructure try_writeback_inodes_sb[_nr]() Tejun Heo
2015-01-06 21:26 ` [PATCH 28/45] writeback: make writeback initiation functions handle multiple bdi_writeback's Tejun Heo
2015-01-06 21:26 ` [PATCH 29/45] writeback: move i_wb_list emptiness test into inode_wb_list_del() from its caller Tejun Heo
2015-01-06 21:26 ` [PATCH 30/45] vfs, writeback: introduce struct inode_wb_link Tejun Heo
2015-01-06 21:26 ` [PATCH 31/45] vfs, writeback: add inode_wb_link->data point to the associated bdi_writeback Tejun Heo
2015-01-06 21:26 ` [PATCH 32/45] vfs, writeback: move inode->dirtied_when into inode->i_wb_link Tejun Heo
2015-01-06 21:26 ` [PATCH 33/45] writeback: minor reorganization of fs/fs-writeback.c Tejun Heo
2015-01-06 21:26 ` [PATCH 34/45] vfs, writeback: implement support for multiple inode_wb_link's Tejun Heo
2015-01-06 21:26 ` [PATCH 35/45] vfs, writeback: implement inode->i_nr_syncs Tejun Heo
2015-01-06 21:26 ` [PATCH 36/45] writeback: dirty inodes against their matching cgroup bdi_writeback's Tejun Heo
2015-01-06 21:26 ` [PATCH 37/45] writeback: make writeback_control carry the inode_wb_link being served Tejun Heo
2015-01-06 21:26 ` [PATCH 38/45] writeback: make cyclic writeback cursor cgroup writeback aware Tejun Heo
2015-01-06 21:26 ` [PATCH 39/45] writeback: make DIRTY_PAGES tracking " Tejun Heo
2015-01-06 21:26 ` [PATCH 40/45] writeback: make write_cache_pages() " Tejun Heo
2015-01-06 21:26 ` [PATCH 41/45] writeback: make __writeback_single_inode() " Tejun Heo
2015-01-06 21:26 ` [PATCH 42/45] writeback: make __filemap_fdatawrite_range() croup " Tejun Heo
2015-01-06 21:26 ` [PATCH 43/45] buffer, writeback: make __block_write_full_page() honor cgroup writeback Tejun Heo
2015-01-06 21:26 ` [PATCH 44/45] mpage: make __mpage_writepage() " Tejun Heo
2015-01-06 21:26 ` [PATCH 45/45] ext2: enable cgroup writeback support Tejun Heo
2015-01-06 21:44 ` [PATCHSET RFC block/for-next] writeback: " Tejun Heo
2015-01-07 23:45   ` Dave Chinner
2015-01-09 21:23     ` Tejun Heo
2015-01-10  0:38       ` Dave Chinner
2015-01-10 15:56         ` Tejun Heo [this message]
2015-01-10 16:05           ` Tejun Heo
2015-01-08  9:30 ` Jan Kara
2015-01-09 21:36   ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150110155653.GA25319@htj.dyndns.org \
    --to=tj@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=cgroups@vger.kernel.org \
    --cc=clm@fb.com \
    --cc=david@fromorbit.com \
    --cc=fengguang.wu@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan@huawei.com \
    --cc=mhocko@suse.cz \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).