[RFC] writeback: add elastic bdi in cgwb bdp

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] writeback: add elastic bdi in cgwb bdp
@ 2019-10-12 13:27 Hillf Danton
  2019-10-13 19:17 ` kbuild test robot
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Hillf Danton @ 2019-10-12 13:27 UTC (permalink / raw)
  To: mm
  Cc: fsdev, Andrew Morton, linux, Roman Gushchin, Tejun Heo, Jan Kara,
	Johannes Weiner, Shakeel Butt, Minchan Kim, Mel Gorman,
	Hillf Danton


The behaviors of the elastic bdi (ebdi) observed in the current cgwb
bandwidth measurement include

1, like spinning disks on market ebdi can do ~128MB/s IOs in consective
minutes in few scenarios, or higher like SSD, or lower like USB key.

2, with ebdi a bdi_writeback, wb-A, is able to do 80MB/s writeouts in the
current time window of 200ms, while it was 16M/s in the previous one.

3, it will be either 100MB/s in the next time window if wb-B joins wb-A
writing pages out or 18MB/s if wb-C also decides to chime in.

With the help of bandwidth gauged above, what is left in balancing dirty
pages, bdp, is try to make wb-A's laundry speed catch up dirty speed in
every 200ms interval without knowing what wb-B is doing.

No heuristic is added in this work because ebdi does bdp without it.

Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Hillf Danton <hdanton@sina.com>
---

--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -157,6 +157,9 @@ struct bdi_writeback {
 	struct list_head memcg_node;	/* anchored at memcg->cgwb_list */
 	struct list_head blkcg_node;	/* anchored at blkcg->cgwb_list */
 
+#ifdef CONFIG_CGWB_BDP_WITH_EBDI
+	struct wait_queue_head bdp_waitq;
+#endif
 	union {
 		struct work_struct release_work;
 		struct rcu_head rcu;
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -324,6 +324,10 @@ static int wb_init(struct bdi_writeback
 			goto out_destroy_stat;
 	}
 
+	if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
+	    IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
+		init_waitqueue_head(&wb->bdp_waitq);
+
 	return 0;
 
 out_destroy_stat:
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1551,6 +1551,45 @@ static inline void wb_dirty_limits(struc
 	}
 }
 
+#if defined(CONFIG_CGROUP_WRITEBACK) && defined(CONFIG_CGWB_BDP_WITH_EBDI)
+static bool cgwb_bdp_should_throttle(struct bdi_writeback *wb)
+{
+	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
+
+	if (fatal_signal_pending(current))
+		return false;
+
+	gdtc.avail = global_dirtyable_memory();
+
+	domain_dirty_limits(&gdtc);
+
+	gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
+			global_node_page_state(NR_UNSTABLE_NFS) +
+			global_node_page_state(NR_WRITEBACK);
+
+	if (gdtc.dirty < gdtc.bg_thresh)
+		return false;
+
+	if (!writeback_in_progress(wb))
+		wb_start_background_writeback(wb);
+
+	/*
+	 * throttle if laundry speed remarkably falls behind dirty speed
+	 * in the current time window of 200ms
+	 */
+	return gdtc.dirty > gdtc.thresh &&
+		wb_stat(wb, WB_DIRTIED) >
+		wb_stat(wb, WB_WRITTEN) +
+		wb_stat_error();
+}
+
+static inline void cgwb_bdp(struct bdi_writeback *wb)
+{
+	wait_event_interruptible_timeout(wb->bdp_waitq,
+			!cgwb_bdp_should_throttle(wb), HZ);
+}
+#endif
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -1910,7 +1949,11 @@ void balance_dirty_pages_ratelimited(str
 	preempt_enable();
 
 	if (unlikely(current->nr_dirtied >= ratelimit))
-		balance_dirty_pages(wb, current->nr_dirtied);
+		if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
+		    IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
+			cgwb_bdp(wb);
+		else
+			balance_dirty_pages(wb, current->nr_dirtied);
 
 	wb_put(wb);
 }
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -632,6 +632,11 @@ void wbc_detach_inode(struct writeback_c
 	if (!wb)
 		return;
 
+	if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
+	    IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
+		if (waitqueue_active(&wb->bdp_waitq))
+			wake_up_all(&wb->bdp_waitq);
+
 	history = inode->i_wb_frn_history;
 	avg_time = inode->i_wb_frn_avg_time;
 
@@ -811,6 +816,9 @@ static long wb_split_bdi_pages(struct bd
 	if (nr_pages == LONG_MAX)
 		return LONG_MAX;
 
+	if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
+	    IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
+		return nr_pages;
 	/*
 	 * This may be called on clean wb's and proportional distribution
 	 * may not make sense, just use the original @nr_pages in those
@@ -1599,6 +1607,10 @@ static long writeback_chunk_size(struct
 	if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
 		pages = LONG_MAX;
 	else {
+		if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
+		    IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
+			return work->nr_pages;
+
 		pages = min(wb->avg_write_bandwidth / 2,
 			    global_wb_domain.dirty_limit / DIRTY_SCOPE);
 		pages = min(pages, work->nr_pages);
--



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] writeback: add elastic bdi in cgwb bdp
  2019-10-12 13:27 [RFC] writeback: add elastic bdi in cgwb bdp Hillf Danton
@ 2019-10-13 19:17 ` kbuild test robot
  2019-10-13 20:13 ` kbuild test robot
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: kbuild test robot @ 2019-10-13 19:17 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4917 bytes --]

Hi Hillf,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on linus/master]
[cannot apply to v5.4-rc2 next-20191011]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Hillf-Danton/writeback-add-elastic-bdi-in-cgwb-bdp/20191014-014906
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.4.0-13) 7.4.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

   mm/page-writeback.c: In function 'balance_dirty_pages_ratelimited':
>> mm/page-writeback.c:1954:4: error: implicit declaration of function 'cgwb_bdp' [-Werror=implicit-function-declaration]
       cgwb_bdp(wb);
       ^~~~~~~~
>> mm/page-writeback.c:1951:5: warning: suggest explicit braces to avoid ambiguous 'else' [-Wdangling-else]
     if (unlikely(current->nr_dirtied >= ratelimit))
        ^
   cc1: some warnings being treated as errors
--
   In file included from mm/backing-dev.c:3:0:
   mm/backing-dev.c: In function 'wb_init':
>> mm/backing-dev.c:329:26: error: 'struct bdi_writeback' has no member named 'bdp_waitq'
      init_waitqueue_head(&wb->bdp_waitq);
                             ^
   include/linux/wait.h:67:26: note: in definition of macro 'init_waitqueue_head'
      __init_waitqueue_head((wq_head), #wq_head, &__key);  \
                             ^~~~~~~

vim +/cgwb_bdp +1954 mm/page-writeback.c

  1889	
  1890	/**
  1891	 * balance_dirty_pages_ratelimited - balance dirty memory state
  1892	 * @mapping: address_space which was dirtied
  1893	 *
  1894	 * Processes which are dirtying memory should call in here once for each page
  1895	 * which was newly dirtied.  The function will periodically check the system's
  1896	 * dirty state and will initiate writeback if needed.
  1897	 *
  1898	 * On really big machines, get_writeback_state is expensive, so try to avoid
  1899	 * calling it too often (ratelimiting).  But once we're over the dirty memory
  1900	 * limit we decrease the ratelimiting by a lot, to prevent individual processes
  1901	 * from overshooting the limit by (ratelimit_pages) each.
  1902	 */
  1903	void balance_dirty_pages_ratelimited(struct address_space *mapping)
  1904	{
  1905		struct inode *inode = mapping->host;
  1906		struct backing_dev_info *bdi = inode_to_bdi(inode);
  1907		struct bdi_writeback *wb = NULL;
  1908		int ratelimit;
  1909		int *p;
  1910	
  1911		if (!bdi_cap_account_dirty(bdi))
  1912			return;
  1913	
  1914		if (inode_cgwb_enabled(inode))
  1915			wb = wb_get_create_current(bdi, GFP_KERNEL);
  1916		if (!wb)
  1917			wb = &bdi->wb;
  1918	
  1919		ratelimit = current->nr_dirtied_pause;
  1920		if (wb->dirty_exceeded)
  1921			ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
  1922	
  1923		preempt_disable();
  1924		/*
  1925		 * This prevents one CPU to accumulate too many dirtied pages without
  1926		 * calling into balance_dirty_pages(), which can happen when there are
  1927		 * 1000+ tasks, all of them start dirtying pages at exactly the same
  1928		 * time, hence all honoured too large initial task->nr_dirtied_pause.
  1929		 */
  1930		p =  this_cpu_ptr(&bdp_ratelimits);
  1931		if (unlikely(current->nr_dirtied >= ratelimit))
  1932			*p = 0;
  1933		else if (unlikely(*p >= ratelimit_pages)) {
  1934			*p = 0;
  1935			ratelimit = 0;
  1936		}
  1937		/*
  1938		 * Pick up the dirtied pages by the exited tasks. This avoids lots of
  1939		 * short-lived tasks (eg. gcc invocations in a kernel build) escaping
  1940		 * the dirty throttling and livelock other long-run dirtiers.
  1941		 */
  1942		p = this_cpu_ptr(&dirty_throttle_leaks);
  1943		if (*p > 0 && current->nr_dirtied < ratelimit) {
  1944			unsigned long nr_pages_dirtied;
  1945			nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
  1946			*p -= nr_pages_dirtied;
  1947			current->nr_dirtied += nr_pages_dirtied;
  1948		}
  1949		preempt_enable();
  1950	
> 1951		if (unlikely(current->nr_dirtied >= ratelimit))
  1952			if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
  1953			    IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> 1954				cgwb_bdp(wb);
  1955			else
  1956				balance_dirty_pages(wb, current->nr_dirtied);
  1957	
  1958		wb_put(wb);
  1959	}
  1960	EXPORT_SYMBOL(balance_dirty_pages_ratelimited);
  1961	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 7207 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] writeback: add elastic bdi in cgwb bdp
  2019-10-12 13:27 [RFC] writeback: add elastic bdi in cgwb bdp Hillf Danton
  2019-10-13 19:17 ` kbuild test robot
@ 2019-10-13 20:13 ` kbuild test robot
  2019-10-15 10:22 ` Jan Kara
  2019-10-15 14:03 ` Hillf Danton
  3 siblings, 0 replies; 7+ messages in thread
From: kbuild test robot @ 2019-10-13 20:13 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 7896 bytes --]

Hi Hillf,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on linus/master]
[cannot apply to v5.4-rc2 next-20191011]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Hillf-Danton/writeback-add-elastic-bdi-in-cgwb-bdp/20191014-014906
config: sh-allmodconfig (attached as .config)
compiler: sh4-linux-gcc (GCC) 7.4.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.4.0 make.cross ARCH=sh 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   fs/fs-writeback.c: In function 'wbc_detach_inode':
>> fs/fs-writeback.c:637:27: error: 'struct bdi_writeback' has no member named 'bdp_waitq'
      if (waitqueue_active(&wb->bdp_waitq))
                              ^~
   In file included from include/linux/mmzone.h:10:0,
                    from include/linux/gfp.h:6,
                    from include/linux/slab.h:15,
                    from fs/fs-writeback.c:20:
   fs/fs-writeback.c:638:19: error: 'struct bdi_writeback' has no member named 'bdp_waitq'
       wake_up_all(&wb->bdp_waitq);
                      ^
   include/linux/wait.h:210:36: note: in definition of macro 'wake_up_all'
    #define wake_up_all(x)   __wake_up(x, TASK_NORMAL, 0, NULL)
                                       ^

vim +637 fs/fs-writeback.c

   586	
   587	/**
   588	 * wbc_detach_inode - disassociate wbc from inode and perform foreign detection
   589	 * @wbc: writeback_control of the just finished writeback
   590	 *
   591	 * To be called after a writeback attempt of an inode finishes and undoes
   592	 * wbc_attach_and_unlock_inode().  Can be called under any context.
   593	 *
   594	 * As concurrent write sharing of an inode is expected to be very rare and
   595	 * memcg only tracks page ownership on first-use basis severely confining
   596	 * the usefulness of such sharing, cgroup writeback tracks ownership
   597	 * per-inode.  While the support for concurrent write sharing of an inode
   598	 * is deemed unnecessary, an inode being written to by different cgroups at
   599	 * different points in time is a lot more common, and, more importantly,
   600	 * charging only by first-use can too readily lead to grossly incorrect
   601	 * behaviors (single foreign page can lead to gigabytes of writeback to be
   602	 * incorrectly attributed).
   603	 *
   604	 * To resolve this issue, cgroup writeback detects the majority dirtier of
   605	 * an inode and transfers the ownership to it.  To avoid unnnecessary
   606	 * oscillation, the detection mechanism keeps track of history and gives
   607	 * out the switch verdict only if the foreign usage pattern is stable over
   608	 * a certain amount of time and/or writeback attempts.
   609	 *
   610	 * On each writeback attempt, @wbc tries to detect the majority writer
   611	 * using Boyer-Moore majority vote algorithm.  In addition to the byte
   612	 * count from the majority voting, it also counts the bytes written for the
   613	 * current wb and the last round's winner wb (max of last round's current
   614	 * wb, the winner from two rounds ago, and the last round's majority
   615	 * candidate).  Keeping track of the historical winner helps the algorithm
   616	 * to semi-reliably detect the most active writer even when it's not the
   617	 * absolute majority.
   618	 *
   619	 * Once the winner of the round is determined, whether the winner is
   620	 * foreign or not and how much IO time the round consumed is recorded in
   621	 * inode->i_wb_frn_history.  If the amount of recorded foreign IO time is
   622	 * over a certain threshold, the switch verdict is given.
   623	 */
   624	void wbc_detach_inode(struct writeback_control *wbc)
   625	{
   626		struct bdi_writeback *wb = wbc->wb;
   627		struct inode *inode = wbc->inode;
   628		unsigned long avg_time, max_bytes, max_time;
   629		u16 history;
   630		int max_id;
   631	
   632		if (!wb)
   633			return;
   634	
   635		if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
   636		    IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
 > 637			if (waitqueue_active(&wb->bdp_waitq))
   638				wake_up_all(&wb->bdp_waitq);
   639	
   640		history = inode->i_wb_frn_history;
   641		avg_time = inode->i_wb_frn_avg_time;
   642	
   643		/* pick the winner of this round */
   644		if (wbc->wb_bytes >= wbc->wb_lcand_bytes &&
   645		    wbc->wb_bytes >= wbc->wb_tcand_bytes) {
   646			max_id = wbc->wb_id;
   647			max_bytes = wbc->wb_bytes;
   648		} else if (wbc->wb_lcand_bytes >= wbc->wb_tcand_bytes) {
   649			max_id = wbc->wb_lcand_id;
   650			max_bytes = wbc->wb_lcand_bytes;
   651		} else {
   652			max_id = wbc->wb_tcand_id;
   653			max_bytes = wbc->wb_tcand_bytes;
   654		}
   655	
   656		/*
   657		 * Calculate the amount of IO time the winner consumed and fold it
   658		 * into the running average kept per inode.  If the consumed IO
   659		 * time is lower than avag / WB_FRN_TIME_CUT_DIV, ignore it for
   660		 * deciding whether to switch or not.  This is to prevent one-off
   661		 * small dirtiers from skewing the verdict.
   662		 */
   663		max_time = DIV_ROUND_UP((max_bytes >> PAGE_SHIFT) << WB_FRN_TIME_SHIFT,
   664					wb->avg_write_bandwidth);
   665		if (avg_time)
   666			avg_time += (max_time >> WB_FRN_TIME_AVG_SHIFT) -
   667				    (avg_time >> WB_FRN_TIME_AVG_SHIFT);
   668		else
   669			avg_time = max_time;	/* immediate catch up on first run */
   670	
   671		if (max_time >= avg_time / WB_FRN_TIME_CUT_DIV) {
   672			int slots;
   673	
   674			/*
   675			 * The switch verdict is reached if foreign wb's consume
   676			 * more than a certain proportion of IO time in a
   677			 * WB_FRN_TIME_PERIOD.  This is loosely tracked by 16 slot
   678			 * history mask where each bit represents one sixteenth of
   679			 * the period.  Determine the number of slots to shift into
   680			 * history from @max_time.
   681			 */
   682			slots = min(DIV_ROUND_UP(max_time, WB_FRN_HIST_UNIT),
   683				    (unsigned long)WB_FRN_HIST_MAX_SLOTS);
   684			history <<= slots;
   685			if (wbc->wb_id != max_id)
   686				history |= (1U << slots) - 1;
   687	
   688			if (history)
   689				trace_inode_foreign_history(inode, wbc, history);
   690	
   691			/*
   692			 * Switch if the current wb isn't the consistent winner.
   693			 * If there are multiple closely competing dirtiers, the
   694			 * inode may switch across them repeatedly over time, which
   695			 * is okay.  The main goal is avoiding keeping an inode on
   696			 * the wrong wb for an extended period of time.
   697			 */
   698			if (hweight32(history) > WB_FRN_HIST_THR_SLOTS)
   699				inode_switch_wbs(inode, max_id);
   700		}
   701	
   702		/*
   703		 * Multiple instances of this function may race to update the
   704		 * following fields but we don't mind occassional inaccuracies.
   705		 */
   706		inode->i_wb_frn_winner = max_id;
   707		inode->i_wb_frn_avg_time = min(avg_time, (unsigned long)U16_MAX);
   708		inode->i_wb_frn_history = history;
   709	
   710		wb_put(wbc->wb);
   711		wbc->wb = NULL;
   712	}
   713	EXPORT_SYMBOL_GPL(wbc_detach_inode);
   714	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 52219 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] writeback: add elastic bdi in cgwb bdp
  2019-10-12 13:27 [RFC] writeback: add elastic bdi in cgwb bdp Hillf Danton
  2019-10-13 19:17 ` kbuild test robot
  2019-10-13 20:13 ` kbuild test robot
@ 2019-10-15 10:22 ` Jan Kara
  2019-10-15 14:03 ` Hillf Danton
  3 siblings, 0 replies; 7+ messages in thread
From: Jan Kara @ 2019-10-15 10:22 UTC (permalink / raw)
  To: Hillf Danton
  Cc: mm, fsdev, Andrew Morton, linux, Roman Gushchin, Tejun Heo,
	Jan Kara, Johannes Weiner, Shakeel Butt, Minchan Kim, Mel Gorman

Hello,

On Sat 12-10-19 21:27:40, Hillf Danton wrote:
> The behaviors of the elastic bdi (ebdi) observed in the current cgwb
> bandwidth measurement include
> 
> 1, like spinning disks on market ebdi can do ~128MB/s IOs in consective
> minutes in few scenarios, or higher like SSD, or lower like USB key.
> 
> 2, with ebdi a bdi_writeback, wb-A, is able to do 80MB/s writeouts in the
> current time window of 200ms, while it was 16M/s in the previous one.
> 
> 3, it will be either 100MB/s in the next time window if wb-B joins wb-A
> writing pages out or 18MB/s if wb-C also decides to chime in.
> 
> With the help of bandwidth gauged above, what is left in balancing dirty
> pages, bdp, is try to make wb-A's laundry speed catch up dirty speed in
> every 200ms interval without knowing what wb-B is doing.
> 
> No heuristic is added in this work because ebdi does bdp without it.

Thanks for the patch but honestly, I have hard time understanding what is
the purpose of this patch from the changelog. Some kind of writeback
throttling? And why is this needed? Also some highlevel description of what
your solution is would be good...

								Honza
 
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Hillf Danton <hdanton@sina.com>
> ---
> 
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -157,6 +157,9 @@ struct bdi_writeback {
>  	struct list_head memcg_node;	/* anchored at memcg->cgwb_list */
>  	struct list_head blkcg_node;	/* anchored at blkcg->cgwb_list */
>  
> +#ifdef CONFIG_CGWB_BDP_WITH_EBDI
> +	struct wait_queue_head bdp_waitq;
> +#endif
>  	union {
>  		struct work_struct release_work;
>  		struct rcu_head rcu;
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -324,6 +324,10 @@ static int wb_init(struct bdi_writeback
>  			goto out_destroy_stat;
>  	}
>  
> +	if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> +	    IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> +		init_waitqueue_head(&wb->bdp_waitq);
> +
>  	return 0;
>  
>  out_destroy_stat:
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1551,6 +1551,45 @@ static inline void wb_dirty_limits(struc
>  	}
>  }
>  
> +#if defined(CONFIG_CGROUP_WRITEBACK) && defined(CONFIG_CGWB_BDP_WITH_EBDI)
> +static bool cgwb_bdp_should_throttle(struct bdi_writeback *wb)
> +{
> +	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
> +
> +	if (fatal_signal_pending(current))
> +		return false;
> +
> +	gdtc.avail = global_dirtyable_memory();
> +
> +	domain_dirty_limits(&gdtc);
> +
> +	gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
> +			global_node_page_state(NR_UNSTABLE_NFS) +
> +			global_node_page_state(NR_WRITEBACK);
> +
> +	if (gdtc.dirty < gdtc.bg_thresh)
> +		return false;
> +
> +	if (!writeback_in_progress(wb))
> +		wb_start_background_writeback(wb);
> +
> +	/*
> +	 * throttle if laundry speed remarkably falls behind dirty speed
> +	 * in the current time window of 200ms
> +	 */
> +	return gdtc.dirty > gdtc.thresh &&
> +		wb_stat(wb, WB_DIRTIED) >
> +		wb_stat(wb, WB_WRITTEN) +
> +		wb_stat_error();
> +}
> +
> +static inline void cgwb_bdp(struct bdi_writeback *wb)
> +{
> +	wait_event_interruptible_timeout(wb->bdp_waitq,
> +			!cgwb_bdp_should_throttle(wb), HZ);
> +}
> +#endif
> +
>  /*
>   * balance_dirty_pages() must be called by processes which are generating dirty
>   * data.  It looks at the number of dirty pages in the machine and will force
> @@ -1910,7 +1949,11 @@ void balance_dirty_pages_ratelimited(str
>  	preempt_enable();
>  
>  	if (unlikely(current->nr_dirtied >= ratelimit))
> -		balance_dirty_pages(wb, current->nr_dirtied);
> +		if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> +		    IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> +			cgwb_bdp(wb);
> +		else
> +			balance_dirty_pages(wb, current->nr_dirtied);
>  
>  	wb_put(wb);
>  }
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -632,6 +632,11 @@ void wbc_detach_inode(struct writeback_c
>  	if (!wb)
>  		return;
>  
> +	if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> +	    IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> +		if (waitqueue_active(&wb->bdp_waitq))
> +			wake_up_all(&wb->bdp_waitq);
> +
>  	history = inode->i_wb_frn_history;
>  	avg_time = inode->i_wb_frn_avg_time;
>  
> @@ -811,6 +816,9 @@ static long wb_split_bdi_pages(struct bd
>  	if (nr_pages == LONG_MAX)
>  		return LONG_MAX;
>  
> +	if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> +	    IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> +		return nr_pages;
>  	/*
>  	 * This may be called on clean wb's and proportional distribution
>  	 * may not make sense, just use the original @nr_pages in those
> @@ -1599,6 +1607,10 @@ static long writeback_chunk_size(struct
>  	if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
>  		pages = LONG_MAX;
>  	else {
> +		if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> +		    IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> +			return work->nr_pages;
> +
>  		pages = min(wb->avg_write_bandwidth / 2,
>  			    global_wb_domain.dirty_limit / DIRTY_SCOPE);
>  		pages = min(pages, work->nr_pages);
> --
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] writeback: add elastic bdi in cgwb bdp
  2019-10-12 13:27 [RFC] writeback: add elastic bdi in cgwb bdp Hillf Danton
                   ` (2 preceding siblings ...)
  2019-10-15 10:22 ` Jan Kara
@ 2019-10-15 14:03 ` Hillf Danton
  2019-10-15 14:37   ` Tejun Heo
  2019-10-16  2:26   ` Hillf Danton
  3 siblings, 2 replies; 7+ messages in thread
From: Hillf Danton @ 2019-10-15 14:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: mm, fsdev, Andrew Morton, linux, Roman Gushchin, Tejun Heo,
	Johannes Weiner, Shakeel Butt, Fengguang Wu, Hillf Danton,
	Minchan Kim, Mel Gorman


Hey Jan

On Tue, 15 Oct 2019 12:22:10 +0200 Jan Kara wrote:
> Hello,
> 
> On Sat 12-10-19 21:27:40, Hillf Danton wrote:
> > 
> > The behaviors of the elastic bdi (ebdi) observed in the current cgwb
> > bandwidth measurement include
> > 
> > 1, like spinning disks on market ebdi can do ~128MB/s IOs in consective
> > minutes in few scenarios, or higher like SSD, or lower like USB key.
> > 
> > 2, with ebdi a bdi_writeback, wb-A, is able to do 80MB/s writeouts in the
> > current time window of 200ms, while it was 16M/s in the previous one.
> > 
> > 3, it will be either 100MB/s in the next time window if wb-B joins wb-A
> > writing pages out or 18MB/s if wb-C also decides to chime in.
> > 
> > With the help of bandwidth gauged above, what is left in balancing dirty
> > pages, bdp, is try to make wb-A's laundry speed catch up dirty speed in
> > every 200ms interval without knowing what wb-B is doing.
> > 
> > No heuristic is added in this work because ebdi does bdp without it.
> 
> Thanks for the patch but honestly, I have hard time understanding what is
> the purpose of this patch from the changelog.

Fault on my side. I will try to make it as clear as I could.

Under the cover of "behaviors of elastic bdi" I list the difficulties in
the current writeback bandwidth measurings, particularly in the case with
CONFIG_CGROUP_WRITEBACK enabled, with the phrase ebdi used to abstract
the attribute of hardwares, like spinning disk, SSD and USB storage,
that their physical bandwidth is a constant.

The difference between that constant and the bandwidth currently
measured comes from, I think, the IO pattern dispatched to hardware in
the time interval of 200ms. How much sense does it make to guide wb-A's
IO in the next 200ms without idea about what other wbs are doing? What
should be modeled and built on top of the measured bw value?
Hard to say.

What will bdp OTOH look like on top of ebdi without the hard work of
measuring bw?
A name came up before I am tapping this message, though not available
when I sent the RFC, and it essentially is that ebdi paves a brick for
applying the walk-dog method to bdp: let wb-A's laundry speed walk its
dirty speed the same way as pet owners in Paris, Prague and other cities
go walking their dogs every day with a leash worth two dimes on average.

Is a $200 electronic walkmeter needed to have a good time of walking dog
in London?
Nope, I think because it makes ant-eyelash-size sense to gauge the walker's
speed first with that gadget prone to glitch and then teach the dog to
walk that speed, and to do more based on it.

The only reason I have to do walk-dog in bdp is that laundry speed
remarkably falls behind dirty speed in every case of bdp with no exception.
And a leash is supposed to do the job in a manner that it should naturally
be, even though laundry speed changes in every 200ms interval and is
usually hard to predict before hand under real workloads, with two things
below met in every 200ms:
1, dirty pages in the system clamped near the threshold that is configurable
in userspace,
2, dirty speed of every wb glued as close to the laundry speed as possible,
in long run.

Should walk-dog be in place, then we can do cleanups of bw measurement and
things dependent of it a step after another.

Thanks
Hillf

> Some kind of writeback throttling?
> And why is this needed?
> Also some highlevel description of what
> your solution is would be good...
> 
> 								Honza
>  
> > Cc: Roman Gushchin <guro@fb.com>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Shakeel Butt <shakeelb@google.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Signed-off-by: Hillf Danton <hdanton@sina.com>
> > ---
> > 
> > --- a/include/linux/backing-dev-defs.h
> > +++ b/include/linux/backing-dev-defs.h
> > @@ -157,6 +157,9 @@ struct bdi_writeback {
> >  	struct list_head memcg_node;	/* anchored at memcg->cgwb_list */
> >  	struct list_head blkcg_node;	/* anchored at blkcg->cgwb_list */
> >  
> > +#ifdef CONFIG_CGWB_BDP_WITH_EBDI
> > +	struct wait_queue_head bdp_waitq;
> > +#endif
> >  	union {
> >  		struct work_struct release_work;
> >  		struct rcu_head rcu;
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -324,6 +324,10 @@ static int wb_init(struct bdi_writeback
> >  			goto out_destroy_stat;
> >  	}
> >  
> > +	if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> > +	    IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> > +		init_waitqueue_head(&wb->bdp_waitq);
> > +
> >  	return 0;
> >  
> >  out_destroy_stat:
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -1551,6 +1551,45 @@ static inline void wb_dirty_limits(struc
> >  	}
> >  }
> >  
> > +#if defined(CONFIG_CGROUP_WRITEBACK) && defined(CONFIG_CGWB_BDP_WITH_EBDI)
> > +static bool cgwb_bdp_should_throttle(struct bdi_writeback *wb)
> > +{
> > +	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
> > +
> > +	if (fatal_signal_pending(current))
> > +		return false;
> > +
> > +	gdtc.avail = global_dirtyable_memory();
> > +
> > +	domain_dirty_limits(&gdtc);
> > +
> > +	gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
> > +			global_node_page_state(NR_UNSTABLE_NFS) +
> > +			global_node_page_state(NR_WRITEBACK);
> > +
> > +	if (gdtc.dirty < gdtc.bg_thresh)
> > +		return false;
> > +
> > +	if (!writeback_in_progress(wb))
> > +		wb_start_background_writeback(wb);
> > +
> > +	/*
> > +	 * throttle if laundry speed remarkably falls behind dirty speed
> > +	 * in the current time window of 200ms
> > +	 */
> > +	return gdtc.dirty > gdtc.thresh &&
> > +		wb_stat(wb, WB_DIRTIED) >
> > +		wb_stat(wb, WB_WRITTEN) +
> > +		wb_stat_error();
> > +}
> > +
> > +static inline void cgwb_bdp(struct bdi_writeback *wb)
> > +{
> > +	wait_event_interruptible_timeout(wb->bdp_waitq,
> > +			!cgwb_bdp_should_throttle(wb), HZ);
> > +}
> > +#endif
> > +
> >  /*
> >   * balance_dirty_pages() must be called by processes which are generating dirty
> >   * data.  It looks at the number of dirty pages in the machine and will force
> > @@ -1910,7 +1949,11 @@ void balance_dirty_pages_ratelimited(str
> >  	preempt_enable();
> >  
> >  	if (unlikely(current->nr_dirtied >= ratelimit))
> > -		balance_dirty_pages(wb, current->nr_dirtied);
> > +		if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> > +		    IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> > +			cgwb_bdp(wb);
> > +		else
> > +			balance_dirty_pages(wb, current->nr_dirtied);
> >  
> >  	wb_put(wb);
> >  }
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -632,6 +632,11 @@ void wbc_detach_inode(struct writeback_c
> >  	if (!wb)
> >  		return;
> >  
> > +	if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> > +	    IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> > +		if (waitqueue_active(&wb->bdp_waitq))
> > +			wake_up_all(&wb->bdp_waitq);
> > +
> >  	history = inode->i_wb_frn_history;
> >  	avg_time = inode->i_wb_frn_avg_time;
> >  
> > @@ -811,6 +816,9 @@ static long wb_split_bdi_pages(struct bd
> >  	if (nr_pages == LONG_MAX)
> >  		return LONG_MAX;
> >  
> > +	if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> > +	    IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> > +		return nr_pages;
> >  	/*
> >  	 * This may be called on clean wb's and proportional distribution
> >  	 * may not make sense, just use the original @nr_pages in those
> > @@ -1599,6 +1607,10 @@ static long writeback_chunk_size(struct
> >  	if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
> >  		pages = LONG_MAX;
> >  	else {
> > +		if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> > +		    IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> > +			return work->nr_pages;
> > +
> >  		pages = min(wb->avg_write_bandwidth / 2,
> >  			    global_wb_domain.dirty_limit / DIRTY_SCOPE);
> >  		pages = min(pages, work->nr_pages);
> > --
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] writeback: add elastic bdi in cgwb bdp
  2019-10-15 14:03 ` Hillf Danton
@ 2019-10-15 14:37   ` Tejun Heo
  2019-10-16  2:26   ` Hillf Danton
  1 sibling, 0 replies; 7+ messages in thread
From: Tejun Heo @ 2019-10-15 14:37 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Jan Kara, mm, fsdev, Andrew Morton, linux, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Fengguang Wu, Minchan Kim,
	Mel Gorman

Hello, Hillf.

Do you have a test case which can demonstrate the problem you're
seeing in the existing code?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] writeback: add elastic bdi in cgwb bdp
  2019-10-15 14:03 ` Hillf Danton
  2019-10-15 14:37   ` Tejun Heo
@ 2019-10-16  2:26   ` Hillf Danton
  1 sibling, 0 replies; 7+ messages in thread
From: Hillf Danton @ 2019-10-16  2:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Hillf Danton, Jan Kara, mm, fsdev, Andrew Morton, linux,
	Roman Gushchin, Johannes Weiner, Shakeel Butt, Fengguang Wu,
	Minchan Kim, Mel Gorman

Hello Tejun

On Tue, 15 Oct 2019 07:37:31 -0700 Tejun Heo wrote:
> Hello, Hillf.
> 
> Do you have a test case which can demonstrate the problem you're
> seeing in the existing code?

I dont have such a test case because I see no problem in the current
bw measurings except for the difficulties. For example, wb-A's bw is
measured to be 66MB/s if wb-B joins it dispatching IO, or 96MB/s if
wb-C also joins them, in assumption it is a simple case.

It may be too difficult to be feasible, I am afraid, to get wb-A's bw
under the workloads in data centers without wb-non-A's churnings.

Thanks
Hillf

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-10-16  2:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-10-12 13:27 [RFC] writeback: add elastic bdi in cgwb bdp Hillf Danton
2019-10-13 19:17 ` kbuild test robot
2019-10-13 20:13 ` kbuild test robot
2019-10-15 10:22 ` Jan Kara
2019-10-15 14:03 ` Hillf Danton
2019-10-15 14:37   ` Tejun Heo
2019-10-16  2:26   ` Hillf Danton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.