* Re: [RFC] writeback: add elastic bdi in cgwb bdp
2019-10-12 13:27 [RFC] writeback: add elastic bdi in cgwb bdp Hillf Danton
@ 2019-10-13 19:17 ` kbuild test robot
2019-10-13 20:13 ` kbuild test robot
` (2 subsequent siblings)
3 siblings, 0 replies; 7+ messages in thread
From: kbuild test robot @ 2019-10-13 19:17 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 4917 bytes --]
Hi Hillf,
[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on linus/master]
[cannot apply to v5.4-rc2 next-20191011]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]
url: https://github.com/0day-ci/linux/commits/Hillf-Danton/writeback-add-elastic-bdi-in-cgwb-bdp/20191014-014906
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.4.0-13) 7.4.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>
All error/warnings (new ones prefixed by >>):
mm/page-writeback.c: In function 'balance_dirty_pages_ratelimited':
>> mm/page-writeback.c:1954:4: error: implicit declaration of function 'cgwb_bdp' [-Werror=implicit-function-declaration]
cgwb_bdp(wb);
^~~~~~~~
>> mm/page-writeback.c:1951:5: warning: suggest explicit braces to avoid ambiguous 'else' [-Wdangling-else]
if (unlikely(current->nr_dirtied >= ratelimit))
^
cc1: some warnings being treated as errors
--
In file included from mm/backing-dev.c:3:0:
mm/backing-dev.c: In function 'wb_init':
>> mm/backing-dev.c:329:26: error: 'struct bdi_writeback' has no member named 'bdp_waitq'
init_waitqueue_head(&wb->bdp_waitq);
^
include/linux/wait.h:67:26: note: in definition of macro 'init_waitqueue_head'
__init_waitqueue_head((wq_head), #wq_head, &__key); \
^~~~~~~
vim +/cgwb_bdp +1954 mm/page-writeback.c
1889
1890 /**
1891 * balance_dirty_pages_ratelimited - balance dirty memory state
1892 * @mapping: address_space which was dirtied
1893 *
1894 * Processes which are dirtying memory should call in here once for each page
1895 * which was newly dirtied. The function will periodically check the system's
1896 * dirty state and will initiate writeback if needed.
1897 *
1898 * On really big machines, get_writeback_state is expensive, so try to avoid
1899 * calling it too often (ratelimiting). But once we're over the dirty memory
1900 * limit we decrease the ratelimiting by a lot, to prevent individual processes
1901 * from overshooting the limit by (ratelimit_pages) each.
1902 */
1903 void balance_dirty_pages_ratelimited(struct address_space *mapping)
1904 {
1905 struct inode *inode = mapping->host;
1906 struct backing_dev_info *bdi = inode_to_bdi(inode);
1907 struct bdi_writeback *wb = NULL;
1908 int ratelimit;
1909 int *p;
1910
1911 if (!bdi_cap_account_dirty(bdi))
1912 return;
1913
1914 if (inode_cgwb_enabled(inode))
1915 wb = wb_get_create_current(bdi, GFP_KERNEL);
1916 if (!wb)
1917 wb = &bdi->wb;
1918
1919 ratelimit = current->nr_dirtied_pause;
1920 if (wb->dirty_exceeded)
1921 ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
1922
1923 preempt_disable();
1924 /*
1925 * This prevents one CPU to accumulate too many dirtied pages without
1926 * calling into balance_dirty_pages(), which can happen when there are
1927 * 1000+ tasks, all of them start dirtying pages at exactly the same
1928 * time, hence all honoured too large initial task->nr_dirtied_pause.
1929 */
1930 p = this_cpu_ptr(&bdp_ratelimits);
1931 if (unlikely(current->nr_dirtied >= ratelimit))
1932 *p = 0;
1933 else if (unlikely(*p >= ratelimit_pages)) {
1934 *p = 0;
1935 ratelimit = 0;
1936 }
1937 /*
1938 * Pick up the dirtied pages by the exited tasks. This avoids lots of
1939 * short-lived tasks (eg. gcc invocations in a kernel build) escaping
1940 * the dirty throttling and livelock other long-run dirtiers.
1941 */
1942 p = this_cpu_ptr(&dirty_throttle_leaks);
1943 if (*p > 0 && current->nr_dirtied < ratelimit) {
1944 unsigned long nr_pages_dirtied;
1945 nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
1946 *p -= nr_pages_dirtied;
1947 current->nr_dirtied += nr_pages_dirtied;
1948 }
1949 preempt_enable();
1950
> 1951 if (unlikely(current->nr_dirtied >= ratelimit))
1952 if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
1953 IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> 1954 cgwb_bdp(wb);
1955 else
1956 balance_dirty_pages(wb, current->nr_dirtied);
1957
1958 wb_put(wb);
1959 }
1960 EXPORT_SYMBOL(balance_dirty_pages_ratelimited);
1961
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 7207 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [RFC] writeback: add elastic bdi in cgwb bdp
2019-10-12 13:27 [RFC] writeback: add elastic bdi in cgwb bdp Hillf Danton
2019-10-13 19:17 ` kbuild test robot
@ 2019-10-13 20:13 ` kbuild test robot
2019-10-15 10:22 ` Jan Kara
2019-10-15 14:03 ` Hillf Danton
3 siblings, 0 replies; 7+ messages in thread
From: kbuild test robot @ 2019-10-13 20:13 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 7896 bytes --]
Hi Hillf,
[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on linus/master]
[cannot apply to v5.4-rc2 next-20191011]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]
url: https://github.com/0day-ci/linux/commits/Hillf-Danton/writeback-add-elastic-bdi-in-cgwb-bdp/20191014-014906
config: sh-allmodconfig (attached as .config)
compiler: sh4-linux-gcc (GCC) 7.4.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.4.0 make.cross ARCH=sh
If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
fs/fs-writeback.c: In function 'wbc_detach_inode':
>> fs/fs-writeback.c:637:27: error: 'struct bdi_writeback' has no member named 'bdp_waitq'
if (waitqueue_active(&wb->bdp_waitq))
^~
In file included from include/linux/mmzone.h:10:0,
from include/linux/gfp.h:6,
from include/linux/slab.h:15,
from fs/fs-writeback.c:20:
fs/fs-writeback.c:638:19: error: 'struct bdi_writeback' has no member named 'bdp_waitq'
wake_up_all(&wb->bdp_waitq);
^
include/linux/wait.h:210:36: note: in definition of macro 'wake_up_all'
#define wake_up_all(x) __wake_up(x, TASK_NORMAL, 0, NULL)
^
vim +637 fs/fs-writeback.c
586
587 /**
588 * wbc_detach_inode - disassociate wbc from inode and perform foreign detection
589 * @wbc: writeback_control of the just finished writeback
590 *
591 * To be called after a writeback attempt of an inode finishes and undoes
592 * wbc_attach_and_unlock_inode(). Can be called under any context.
593 *
594 * As concurrent write sharing of an inode is expected to be very rare and
595 * memcg only tracks page ownership on first-use basis severely confining
596 * the usefulness of such sharing, cgroup writeback tracks ownership
597 * per-inode. While the support for concurrent write sharing of an inode
598 * is deemed unnecessary, an inode being written to by different cgroups at
599 * different points in time is a lot more common, and, more importantly,
600 * charging only by first-use can too readily lead to grossly incorrect
601 * behaviors (single foreign page can lead to gigabytes of writeback to be
602 * incorrectly attributed).
603 *
604 * To resolve this issue, cgroup writeback detects the majority dirtier of
605 * an inode and transfers the ownership to it. To avoid unnnecessary
606 * oscillation, the detection mechanism keeps track of history and gives
607 * out the switch verdict only if the foreign usage pattern is stable over
608 * a certain amount of time and/or writeback attempts.
609 *
610 * On each writeback attempt, @wbc tries to detect the majority writer
611 * using Boyer-Moore majority vote algorithm. In addition to the byte
612 * count from the majority voting, it also counts the bytes written for the
613 * current wb and the last round's winner wb (max of last round's current
614 * wb, the winner from two rounds ago, and the last round's majority
615 * candidate). Keeping track of the historical winner helps the algorithm
616 * to semi-reliably detect the most active writer even when it's not the
617 * absolute majority.
618 *
619 * Once the winner of the round is determined, whether the winner is
620 * foreign or not and how much IO time the round consumed is recorded in
621 * inode->i_wb_frn_history. If the amount of recorded foreign IO time is
622 * over a certain threshold, the switch verdict is given.
623 */
624 void wbc_detach_inode(struct writeback_control *wbc)
625 {
626 struct bdi_writeback *wb = wbc->wb;
627 struct inode *inode = wbc->inode;
628 unsigned long avg_time, max_bytes, max_time;
629 u16 history;
630 int max_id;
631
632 if (!wb)
633 return;
634
635 if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
636 IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> 637 if (waitqueue_active(&wb->bdp_waitq))
638 wake_up_all(&wb->bdp_waitq);
639
640 history = inode->i_wb_frn_history;
641 avg_time = inode->i_wb_frn_avg_time;
642
643 /* pick the winner of this round */
644 if (wbc->wb_bytes >= wbc->wb_lcand_bytes &&
645 wbc->wb_bytes >= wbc->wb_tcand_bytes) {
646 max_id = wbc->wb_id;
647 max_bytes = wbc->wb_bytes;
648 } else if (wbc->wb_lcand_bytes >= wbc->wb_tcand_bytes) {
649 max_id = wbc->wb_lcand_id;
650 max_bytes = wbc->wb_lcand_bytes;
651 } else {
652 max_id = wbc->wb_tcand_id;
653 max_bytes = wbc->wb_tcand_bytes;
654 }
655
656 /*
657 * Calculate the amount of IO time the winner consumed and fold it
658 * into the running average kept per inode. If the consumed IO
659 * time is lower than avag / WB_FRN_TIME_CUT_DIV, ignore it for
660 * deciding whether to switch or not. This is to prevent one-off
661 * small dirtiers from skewing the verdict.
662 */
663 max_time = DIV_ROUND_UP((max_bytes >> PAGE_SHIFT) << WB_FRN_TIME_SHIFT,
664 wb->avg_write_bandwidth);
665 if (avg_time)
666 avg_time += (max_time >> WB_FRN_TIME_AVG_SHIFT) -
667 (avg_time >> WB_FRN_TIME_AVG_SHIFT);
668 else
669 avg_time = max_time; /* immediate catch up on first run */
670
671 if (max_time >= avg_time / WB_FRN_TIME_CUT_DIV) {
672 int slots;
673
674 /*
675 * The switch verdict is reached if foreign wb's consume
676 * more than a certain proportion of IO time in a
677 * WB_FRN_TIME_PERIOD. This is loosely tracked by 16 slot
678 * history mask where each bit represents one sixteenth of
679 * the period. Determine the number of slots to shift into
680 * history from @max_time.
681 */
682 slots = min(DIV_ROUND_UP(max_time, WB_FRN_HIST_UNIT),
683 (unsigned long)WB_FRN_HIST_MAX_SLOTS);
684 history <<= slots;
685 if (wbc->wb_id != max_id)
686 history |= (1U << slots) - 1;
687
688 if (history)
689 trace_inode_foreign_history(inode, wbc, history);
690
691 /*
692 * Switch if the current wb isn't the consistent winner.
693 * If there are multiple closely competing dirtiers, the
694 * inode may switch across them repeatedly over time, which
695 * is okay. The main goal is avoiding keeping an inode on
696 * the wrong wb for an extended period of time.
697 */
698 if (hweight32(history) > WB_FRN_HIST_THR_SLOTS)
699 inode_switch_wbs(inode, max_id);
700 }
701
702 /*
703 * Multiple instances of this function may race to update the
704 * following fields but we don't mind occassional inaccuracies.
705 */
706 inode->i_wb_frn_winner = max_id;
707 inode->i_wb_frn_avg_time = min(avg_time, (unsigned long)U16_MAX);
708 inode->i_wb_frn_history = history;
709
710 wb_put(wbc->wb);
711 wbc->wb = NULL;
712 }
713 EXPORT_SYMBOL_GPL(wbc_detach_inode);
714
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 52219 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [RFC] writeback: add elastic bdi in cgwb bdp
2019-10-12 13:27 [RFC] writeback: add elastic bdi in cgwb bdp Hillf Danton
2019-10-13 19:17 ` kbuild test robot
2019-10-13 20:13 ` kbuild test robot
@ 2019-10-15 10:22 ` Jan Kara
2019-10-15 14:03 ` Hillf Danton
3 siblings, 0 replies; 7+ messages in thread
From: Jan Kara @ 2019-10-15 10:22 UTC (permalink / raw)
To: Hillf Danton
Cc: mm, fsdev, Andrew Morton, linux, Roman Gushchin, Tejun Heo,
Jan Kara, Johannes Weiner, Shakeel Butt, Minchan Kim, Mel Gorman
Hello,
On Sat 12-10-19 21:27:40, Hillf Danton wrote:
> The behaviors of the elastic bdi (ebdi) observed in the current cgwb
> bandwidth measurement include
>
> 1, like spinning disks on market ebdi can do ~128MB/s IOs in consective
> minutes in few scenarios, or higher like SSD, or lower like USB key.
>
> 2, with ebdi a bdi_writeback, wb-A, is able to do 80MB/s writeouts in the
> current time window of 200ms, while it was 16M/s in the previous one.
>
> 3, it will be either 100MB/s in the next time window if wb-B joins wb-A
> writing pages out or 18MB/s if wb-C also decides to chime in.
>
> With the help of bandwidth gauged above, what is left in balancing dirty
> pages, bdp, is try to make wb-A's laundry speed catch up dirty speed in
> every 200ms interval without knowing what wb-B is doing.
>
> No heuristic is added in this work because ebdi does bdp without it.
Thanks for the patch but honestly, I have hard time understanding what is
the purpose of this patch from the changelog. Some kind of writeback
throttling? And why is this needed? Also some highlevel description of what
your solution is would be good...
Honza
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Hillf Danton <hdanton@sina.com>
> ---
>
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -157,6 +157,9 @@ struct bdi_writeback {
> struct list_head memcg_node; /* anchored at memcg->cgwb_list */
> struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */
>
> +#ifdef CONFIG_CGWB_BDP_WITH_EBDI
> + struct wait_queue_head bdp_waitq;
> +#endif
> union {
> struct work_struct release_work;
> struct rcu_head rcu;
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -324,6 +324,10 @@ static int wb_init(struct bdi_writeback
> goto out_destroy_stat;
> }
>
> + if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> + IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> + init_waitqueue_head(&wb->bdp_waitq);
> +
> return 0;
>
> out_destroy_stat:
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1551,6 +1551,45 @@ static inline void wb_dirty_limits(struc
> }
> }
>
> +#if defined(CONFIG_CGROUP_WRITEBACK) && defined(CONFIG_CGWB_BDP_WITH_EBDI)
> +static bool cgwb_bdp_should_throttle(struct bdi_writeback *wb)
> +{
> + struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
> +
> + if (fatal_signal_pending(current))
> + return false;
> +
> + gdtc.avail = global_dirtyable_memory();
> +
> + domain_dirty_limits(&gdtc);
> +
> + gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
> + global_node_page_state(NR_UNSTABLE_NFS) +
> + global_node_page_state(NR_WRITEBACK);
> +
> + if (gdtc.dirty < gdtc.bg_thresh)
> + return false;
> +
> + if (!writeback_in_progress(wb))
> + wb_start_background_writeback(wb);
> +
> + /*
> + * throttle if laundry speed remarkably falls behind dirty speed
> + * in the current time window of 200ms
> + */
> + return gdtc.dirty > gdtc.thresh &&
> + wb_stat(wb, WB_DIRTIED) >
> + wb_stat(wb, WB_WRITTEN) +
> + wb_stat_error();
> +}
> +
> +static inline void cgwb_bdp(struct bdi_writeback *wb)
> +{
> + wait_event_interruptible_timeout(wb->bdp_waitq,
> + !cgwb_bdp_should_throttle(wb), HZ);
> +}
> +#endif
> +
> /*
> * balance_dirty_pages() must be called by processes which are generating dirty
> * data. It looks at the number of dirty pages in the machine and will force
> @@ -1910,7 +1949,11 @@ void balance_dirty_pages_ratelimited(str
> preempt_enable();
>
> if (unlikely(current->nr_dirtied >= ratelimit))
> - balance_dirty_pages(wb, current->nr_dirtied);
> + if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> + IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> + cgwb_bdp(wb);
> + else
> + balance_dirty_pages(wb, current->nr_dirtied);
>
> wb_put(wb);
> }
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -632,6 +632,11 @@ void wbc_detach_inode(struct writeback_c
> if (!wb)
> return;
>
> + if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> + IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> + if (waitqueue_active(&wb->bdp_waitq))
> + wake_up_all(&wb->bdp_waitq);
> +
> history = inode->i_wb_frn_history;
> avg_time = inode->i_wb_frn_avg_time;
>
> @@ -811,6 +816,9 @@ static long wb_split_bdi_pages(struct bd
> if (nr_pages == LONG_MAX)
> return LONG_MAX;
>
> + if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> + IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> + return nr_pages;
> /*
> * This may be called on clean wb's and proportional distribution
> * may not make sense, just use the original @nr_pages in those
> @@ -1599,6 +1607,10 @@ static long writeback_chunk_size(struct
> if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
> pages = LONG_MAX;
> else {
> + if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> + IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> + return work->nr_pages;
> +
> pages = min(wb->avg_write_bandwidth / 2,
> global_wb_domain.dirty_limit / DIRTY_SCOPE);
> pages = min(pages, work->nr_pages);
> --
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [RFC] writeback: add elastic bdi in cgwb bdp
2019-10-12 13:27 [RFC] writeback: add elastic bdi in cgwb bdp Hillf Danton
` (2 preceding siblings ...)
2019-10-15 10:22 ` Jan Kara
@ 2019-10-15 14:03 ` Hillf Danton
2019-10-15 14:37 ` Tejun Heo
2019-10-16 2:26 ` Hillf Danton
3 siblings, 2 replies; 7+ messages in thread
From: Hillf Danton @ 2019-10-15 14:03 UTC (permalink / raw)
To: Jan Kara
Cc: mm, fsdev, Andrew Morton, linux, Roman Gushchin, Tejun Heo,
Johannes Weiner, Shakeel Butt, Fengguang Wu, Hillf Danton,
Minchan Kim, Mel Gorman
Hey Jan
On Tue, 15 Oct 2019 12:22:10 +0200 Jan Kara wrote:
> Hello,
>
> On Sat 12-10-19 21:27:40, Hillf Danton wrote:
> >
> > The behaviors of the elastic bdi (ebdi) observed in the current cgwb
> > bandwidth measurement include
> >
> > 1, like spinning disks on market ebdi can do ~128MB/s IOs in consective
> > minutes in few scenarios, or higher like SSD, or lower like USB key.
> >
> > 2, with ebdi a bdi_writeback, wb-A, is able to do 80MB/s writeouts in the
> > current time window of 200ms, while it was 16M/s in the previous one.
> >
> > 3, it will be either 100MB/s in the next time window if wb-B joins wb-A
> > writing pages out or 18MB/s if wb-C also decides to chime in.
> >
> > With the help of bandwidth gauged above, what is left in balancing dirty
> > pages, bdp, is try to make wb-A's laundry speed catch up dirty speed in
> > every 200ms interval without knowing what wb-B is doing.
> >
> > No heuristic is added in this work because ebdi does bdp without it.
>
> Thanks for the patch but honestly, I have hard time understanding what is
> the purpose of this patch from the changelog.
Fault on my side. I will try to make it as clear as I could.
Under the cover of "behaviors of elastic bdi" I list the difficulties in
the current writeback bandwidth measurings, particularly in the case with
CONFIG_CGROUP_WRITEBACK enabled, with the phrase ebdi used to abstract
the attribute of hardwares, like spinning disk, SSD and USB storage,
that their physical bandwidth is a constant.
The difference between that constant and the bandwidth currently
measured comes from, I think, the IO pattern dispatched to hardware in
the time interval of 200ms. How much sense does it make to guide wb-A's
IO in the next 200ms without idea about what other wbs are doing? What
should be modeled and built on top of the measured bw value?
Hard to say.
What will bdp OTOH look like on top of ebdi without the hard work of
measuring bw?
A name came up before I am tapping this message, though not available
when I sent the RFC, and it essentially is that ebdi paves a brick for
applying the walk-dog method to bdp: let wb-A's laundry speed walk its
dirty speed the same way as pet owners in Paris, Prague and other cities
go walking their dogs every day with a leash worth two dimes on average.
Is a $200 electronic walkmeter needed to have a good time of walking dog
in London?
Nope, I think because it makes ant-eyelash-size sense to gauge the walker's
speed first with that gadget prone to glitch and then teach the dog to
walk that speed, and to do more based on it.
The only reason I have to do walk-dog in bdp is that laundry speed
remarkably falls behind dirty speed in every case of bdp with no exception.
And a leash is supposed to do the job in a manner that it should naturally
be, even though laundry speed changes in every 200ms interval and is
usually hard to predict before hand under real workloads, with two things
below met in every 200ms:
1, dirty pages in the system clamped near the threshold that is configurable
in userspace,
2, dirty speed of every wb glued as close to the laundry speed as possible,
in long run.
Should walk-dog be in place, then we can do cleanups of bw measurement and
things dependent of it a step after another.
Thanks
Hillf
> Some kind of writeback throttling?
> And why is this needed?
> Also some highlevel description of what
> your solution is would be good...
>
> Honza
>
> > Cc: Roman Gushchin <guro@fb.com>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Shakeel Butt <shakeelb@google.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Signed-off-by: Hillf Danton <hdanton@sina.com>
> > ---
> >
> > --- a/include/linux/backing-dev-defs.h
> > +++ b/include/linux/backing-dev-defs.h
> > @@ -157,6 +157,9 @@ struct bdi_writeback {
> > struct list_head memcg_node; /* anchored at memcg->cgwb_list */
> > struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */
> >
> > +#ifdef CONFIG_CGWB_BDP_WITH_EBDI
> > + struct wait_queue_head bdp_waitq;
> > +#endif
> > union {
> > struct work_struct release_work;
> > struct rcu_head rcu;
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -324,6 +324,10 @@ static int wb_init(struct bdi_writeback
> > goto out_destroy_stat;
> > }
> >
> > + if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> > + IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> > + init_waitqueue_head(&wb->bdp_waitq);
> > +
> > return 0;
> >
> > out_destroy_stat:
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -1551,6 +1551,45 @@ static inline void wb_dirty_limits(struc
> > }
> > }
> >
> > +#if defined(CONFIG_CGROUP_WRITEBACK) && defined(CONFIG_CGWB_BDP_WITH_EBDI)
> > +static bool cgwb_bdp_should_throttle(struct bdi_writeback *wb)
> > +{
> > + struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
> > +
> > + if (fatal_signal_pending(current))
> > + return false;
> > +
> > + gdtc.avail = global_dirtyable_memory();
> > +
> > + domain_dirty_limits(&gdtc);
> > +
> > + gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
> > + global_node_page_state(NR_UNSTABLE_NFS) +
> > + global_node_page_state(NR_WRITEBACK);
> > +
> > + if (gdtc.dirty < gdtc.bg_thresh)
> > + return false;
> > +
> > + if (!writeback_in_progress(wb))
> > + wb_start_background_writeback(wb);
> > +
> > + /*
> > + * throttle if laundry speed remarkably falls behind dirty speed
> > + * in the current time window of 200ms
> > + */
> > + return gdtc.dirty > gdtc.thresh &&
> > + wb_stat(wb, WB_DIRTIED) >
> > + wb_stat(wb, WB_WRITTEN) +
> > + wb_stat_error();
> > +}
> > +
> > +static inline void cgwb_bdp(struct bdi_writeback *wb)
> > +{
> > + wait_event_interruptible_timeout(wb->bdp_waitq,
> > + !cgwb_bdp_should_throttle(wb), HZ);
> > +}
> > +#endif
> > +
> > /*
> > * balance_dirty_pages() must be called by processes which are generating dirty
> > * data. It looks at the number of dirty pages in the machine and will force
> > @@ -1910,7 +1949,11 @@ void balance_dirty_pages_ratelimited(str
> > preempt_enable();
> >
> > if (unlikely(current->nr_dirtied >= ratelimit))
> > - balance_dirty_pages(wb, current->nr_dirtied);
> > + if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> > + IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> > + cgwb_bdp(wb);
> > + else
> > + balance_dirty_pages(wb, current->nr_dirtied);
> >
> > wb_put(wb);
> > }
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -632,6 +632,11 @@ void wbc_detach_inode(struct writeback_c
> > if (!wb)
> > return;
> >
> > + if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> > + IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> > + if (waitqueue_active(&wb->bdp_waitq))
> > + wake_up_all(&wb->bdp_waitq);
> > +
> > history = inode->i_wb_frn_history;
> > avg_time = inode->i_wb_frn_avg_time;
> >
> > @@ -811,6 +816,9 @@ static long wb_split_bdi_pages(struct bd
> > if (nr_pages == LONG_MAX)
> > return LONG_MAX;
> >
> > + if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> > + IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> > + return nr_pages;
> > /*
> > * This may be called on clean wb's and proportional distribution
> > * may not make sense, just use the original @nr_pages in those
> > @@ -1599,6 +1607,10 @@ static long writeback_chunk_size(struct
> > if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
> > pages = LONG_MAX;
> > else {
> > + if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) &&
> > + IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI))
> > + return work->nr_pages;
> > +
> > pages = min(wb->avg_write_bandwidth / 2,
> > global_wb_domain.dirty_limit / DIRTY_SCOPE);
> > pages = min(pages, work->nr_pages);
> > --
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [RFC] writeback: add elastic bdi in cgwb bdp
2019-10-15 14:03 ` Hillf Danton
@ 2019-10-15 14:37 ` Tejun Heo
2019-10-16 2:26 ` Hillf Danton
1 sibling, 0 replies; 7+ messages in thread
From: Tejun Heo @ 2019-10-15 14:37 UTC (permalink / raw)
To: Hillf Danton
Cc: Jan Kara, mm, fsdev, Andrew Morton, linux, Roman Gushchin,
Johannes Weiner, Shakeel Butt, Fengguang Wu, Minchan Kim,
Mel Gorman
Hello, Hillf.
Do you have a test case which can demonstrate the problem you're
seeing in the existing code?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] writeback: add elastic bdi in cgwb bdp
2019-10-15 14:03 ` Hillf Danton
2019-10-15 14:37 ` Tejun Heo
@ 2019-10-16 2:26 ` Hillf Danton
1 sibling, 0 replies; 7+ messages in thread
From: Hillf Danton @ 2019-10-16 2:26 UTC (permalink / raw)
To: Tejun Heo
Cc: Hillf Danton, Jan Kara, mm, fsdev, Andrew Morton, linux,
Roman Gushchin, Johannes Weiner, Shakeel Butt, Fengguang Wu,
Minchan Kim, Mel Gorman
Hello Tejun
On Tue, 15 Oct 2019 07:37:31 -0700 Tejun Heo wrote:
> Hello, Hillf.
>
> Do you have a test case which can demonstrate the problem you're
> seeing in the existing code?
I dont have such a test case because I see no problem in the current
bw measurings except for the difficulties. For example, wb-A's bw is
measured to be 66MB/s if wb-B joins it dispatching IO, or 96MB/s if
wb-C also joins them, in assumption it is a simple case.
It may be too difficult to be feasible, I am afraid, to get wb-A's bw
under the workloads in data centers without wb-non-A's churnings.
Thanks
Hillf
^ permalink raw reply [flat|nested] 7+ messages in thread