* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable [not found] <1347798342-2830-1-git-send-email-linkinjeon@gmail.com> @ 2012-09-20 8:44 ` Fengguang Wu 2012-09-24 22:23 ` Jan Kara 2012-09-25 1:36 ` Dave Chinner 0 siblings, 2 replies; 16+ messages in thread From: Fengguang Wu @ 2012-09-20 8:44 UTC (permalink / raw) To: Namjae Jeon Cc: Jan Kara, Dave Chinner, linux-kernel, Namjae Jeon, Vivek Trivedi, Linux Memory Management List, linux-fsdevel [ CC FS and MM lists ] Patch looks good to me, however we need to be careful because it's introducing a new interface. So it's desirable to get some acks from the FS/MM developers. Thanks, Fengguang On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: > From: Namjae Jeon <namjae.jeon@samsung.com> > > This patch is based on suggestion by Wu Fengguang: > https://lkml.org/lkml/2011/8/19/19 > > kernel has mechanism to do writeback as per dirty_ratio and dirty_background > ratio. It also maintains per task dirty rate limit to keep balance of > dirty pages at any given instance by doing bdi bandwidth estimation. > > Kernel also has max_ratio/min_ratio tunables to specify percentage of > writecache to control per bdi dirty limits and task throttling. > > However, there might be a usecase where user wants a per bdi writeback tuning > parameter to flush dirty data once per bdi dirty data reach a threshold > especially at NFS server. > > dirty_background_centisecs provides an interface where user can tune > background writeback start threshold using > /sys/block/sda/bdi/dirty_background_centisecs > > dirty_background_centisecs is used alongwith average bdi write bandwidth > estimation to start background writeback. > > One of the use case to demonstrate the patch functionality can be > on NFS setup:- > We have a NFS setup with ethernet line of 100Mbps, while the USB > disk is attached to server, which has a local speed of 25MBps. Server > and client both are arm target boards. > > Now if we perform a write operation over NFS (client to server), as > per the network speed, data can travel at max speed of 100Mbps. But > if we check the default write speed of USB hdd over NFS it comes > around to 8MB/sec, far below the speed of network. > > Reason being is as per the NFS logic, during write operation, initially > pages are dirtied on NFS client side, then after reaching the dirty > threshold/writeback limit (or in case of sync) data is actually sent > to NFS server (so now again pages are dirtied on server side). This > will be done in COMMIT call from client to server i.e if 100MB of data > is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds. > > After the data is received, now it will take approx 100/25 ~4 Seconds to > write the data to USB Hdd on server side. Hence making the overall time > to write this much of data ~12 seconds, which in practically comes out to > be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS > client. > > However we may improve this write performace by making the use of NFS > server idle time i.e while data is being received from the client, > simultaneously initiate the writeback thread on server side. So instead > of waiting for the complete data to come and then start the writeback, > we can work in parallel while the network is still busy in receiving the > data. Hence in this way overall performace will be improved. > > If we tune dirty_background_centisecs, we can see there > is increase in the performace and it comes out to be ~ 11MB/seconds. > Results are:- > > Write test(create a 1 GB file) result at 'NFS client' after changing > /sys/block/sda/bdi/dirty_background_centisecs > on *** NFS Server only - not on NFS Client **** > > --------------------------------------------------------------------- > |WRITE Test with various 'dirty_background_centisecs' at NFS Server | > --------------------------------------------------------------------- > | | default = 0 | 300 centisec| 200 centisec| 100 centisec | > --------------------------------------------------------------------- > |RecSize | WriteSpeed | WriteSpeed | WriteSpeed | WriteSpeed | > --------------------------------------------------------------------- > |10485760 | 8.44MB/sec | 8.60MB/sec | 9.30MB/sec | 10.27MB/sec | > | 1048576 | 8.48MB/sec | 8.87MB/sec | 9.31MB/sec | 10.34MB/sec | > | 524288 | 8.37MB/sec | 8.42MB/sec | 9.84MB/sec | 10.47MB/sec | > | 262144 | 8.16MB/sec | 8.51MB/sec | 9.52MB/sec | 10.62MB/sec | > | 131072 | 8.48MB/sec | 8.81MB/sec | 9.42MB/sec | 10.55MB/sec | > | 65536 | 8.38MB/sec | 9.09MB/sec | 9.76MB/sec | 10.53MB/sec | > | 32768 | 8.65MB/sec | 9.00MB/sec | 9.57MB/sec | 10.54MB/sec | > | 16384 | 8.27MB/sec | 8.80MB/sec | 9.39MB/sec | 10.43MB/sec | > | 8192 | 8.52MB/sec | 8.70MB/sec | 9.40MB/sec | 10.50MB/sec | > | 4096 | 8.20MB/sec | 8.63MB/sec | 9.80MB/sec | 10.35MB/sec | > --------------------------------------------------------------------- > > we can see, average write speed is increased to ~10-11MB/sec. > ============================================================ > > This patch provides the changes per block devices. So that we may modify the > dirty_background_centisecs as per the device and overall system is not impacted > by the changes and we get improved perforamace in certain use cases. > > NOTE: dirty_background_centisecs is used alongwith average bdi write bandwidth > estimation to start background writeback. But, bdi write bandwidth estimation > is an _estimation_ and may become wildly wrong. dirty_background_centisecs > tuning may not always work to the user expectations. dirty_background_centisecs > will require careful tuning by users on NFS Server. > As a good use case, dirty_background_time should be set around 100 (1 sec). > It should not be set to very small value, otherwise it will start > flushing for small I/O size dirty data. > > Changes since v1: > * make default value of 'dirty_background_centisecs = 0' sothat there is no change > in default writeback behaviour. > * Add description of dirty_background_centisecs in documentation. > > Original-patch-by: Wu Fengguang <fengguang.wu@intel.com> > Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> > Signed-off-by: Vivek Trivedi <t.vivek@samsung.com> > --- > fs/fs-writeback.c | 21 +++++++++++++++++++-- > include/linux/backing-dev.h | 1 + > include/linux/writeback.h | 1 + > mm/backing-dev.c | 21 +++++++++++++++++++++ > mm/page-writeback.c | 3 ++- > 5 files changed, 44 insertions(+), 3 deletions(-) > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index fd255c0..c427130 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -769,6 +769,22 @@ static bool over_bground_thresh(struct backing_dev_info *bdi) > return false; > } > > +bool over_dirty_bground_time(struct backing_dev_info *bdi) > +{ > + unsigned long background_thresh; > + > + if (!bdi->dirty_background_centisecs) > + return false; > + > + background_thresh = bdi->avg_write_bandwidth * > + bdi->dirty_background_centisecs / 100; > + > + if (bdi_stat(bdi, BDI_RECLAIMABLE) > background_thresh) > + return true; > + > + return false; > +} > + > /* > * Called under wb->list_lock. If there are multiple wb per bdi, > * only the flusher working on the first wb should do it. > @@ -828,7 +844,8 @@ static long wb_writeback(struct bdi_writeback *wb, > * For background writeout, stop when we are below the > * background dirty threshold > */ > - if (work->for_background && !over_bground_thresh(wb->bdi)) > + if (work->for_background && !over_bground_thresh(wb->bdi) && > + !over_dirty_bground_time(wb->bdi)) > break; > > /* > @@ -920,7 +937,7 @@ static unsigned long get_nr_dirty_pages(void) > > static long wb_check_background_flush(struct bdi_writeback *wb) > { > - if (over_bground_thresh(wb->bdi)) { > + if (over_bground_thresh(wb->bdi) || over_dirty_bground_time(wb->bdi)) { > > struct wb_writeback_work work = { > .nr_pages = LONG_MAX, > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h > index 2a9a9ab..43d2e15 100644 > --- a/include/linux/backing-dev.h > +++ b/include/linux/backing-dev.h > @@ -95,6 +95,7 @@ struct backing_dev_info { > > unsigned int min_ratio; > unsigned int max_ratio, max_prop_frac; > + unsigned int dirty_background_centisecs; > > struct bdi_writeback wb; /* default writeback info for this bdi */ > spinlock_t wb_lock; /* protects work_list */ > diff --git a/include/linux/writeback.h b/include/linux/writeback.h > index 50c3e8f..6dc2abe 100644 > --- a/include/linux/writeback.h > +++ b/include/linux/writeback.h > @@ -96,6 +96,7 @@ long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages, > long wb_do_writeback(struct bdi_writeback *wb, int force_wait); > void wakeup_flusher_threads(long nr_pages, enum wb_reason reason); > void inode_wait_for_writeback(struct inode *inode); > +bool over_dirty_bground_time(struct backing_dev_info *bdi); > > /* writeback.h requires fs.h; it, too, is not included from here. */ > static inline void wait_on_inode(struct inode *inode) > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index d3ca2b3..b1b2fd2 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -221,12 +221,32 @@ static ssize_t max_ratio_store(struct device *dev, > } > BDI_SHOW(max_ratio, bdi->max_ratio) > > +static ssize_t dirty_background_centisecs_store(struct device *dev, > + struct device_attribute *attr, const char *buf, size_t count) > +{ > + struct backing_dev_info *bdi = dev_get_drvdata(dev); > + unsigned int centisecs; > + ssize_t ret; > + > + ret = kstrtouint(buf, 10, ¢isecs); > + if (ret < 0) > + return ret; > + > + bdi->dirty_background_centisecs = centisecs; > + if (over_dirty_bground_time(bdi)) > + bdi_start_background_writeback(bdi); > + > + return count; > +} > +BDI_SHOW(dirty_background_centisecs, bdi->dirty_background_centisecs) > + > #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store) > > static struct device_attribute bdi_dev_attrs[] = { > __ATTR_RW(read_ahead_kb), > __ATTR_RW(min_ratio), > __ATTR_RW(max_ratio), > + __ATTR_RW(dirty_background_centisecs), > __ATTR_NULL, > }; > > @@ -628,6 +648,7 @@ int bdi_init(struct backing_dev_info *bdi) > bdi->min_ratio = 0; > bdi->max_ratio = 100; > bdi->max_prop_frac = FPROP_FRAC_BASE; > + bdi->dirty_background_centisecs = 0; > spin_lock_init(&bdi->wb_lock); > INIT_LIST_HEAD(&bdi->bdi_list); > INIT_LIST_HEAD(&bdi->work_list); > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 5ad5ce2..8c1530d 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -1403,7 +1403,8 @@ pause: > if (laptop_mode) > return; > > - if (nr_reclaimable > background_thresh) > + if (nr_reclaimable > background_thresh || > + over_dirty_bground_time(bdi)) > bdi_start_background_writeback(bdi); > } > > -- > 1.7.9.5 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable 2012-09-20 8:44 ` [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable Fengguang Wu @ 2012-09-24 22:23 ` Jan Kara 2012-09-25 6:50 ` Namjae Jeon 2012-09-26 16:56 ` Fengguang Wu 2012-09-25 1:36 ` Dave Chinner 1 sibling, 2 replies; 16+ messages in thread From: Jan Kara @ 2012-09-24 22:23 UTC (permalink / raw) To: Fengguang Wu Cc: Namjae Jeon, Jan Kara, Dave Chinner, linux-kernel, Namjae Jeon, Vivek Trivedi, Linux Memory Management List, linux-fsdevel On Thu 20-09-12 16:44:22, Wu Fengguang wrote: > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: > > From: Namjae Jeon <namjae.jeon@samsung.com> > > > > This patch is based on suggestion by Wu Fengguang: > > https://lkml.org/lkml/2011/8/19/19 > > > > kernel has mechanism to do writeback as per dirty_ratio and dirty_background > > ratio. It also maintains per task dirty rate limit to keep balance of > > dirty pages at any given instance by doing bdi bandwidth estimation. > > > > Kernel also has max_ratio/min_ratio tunables to specify percentage of > > writecache to control per bdi dirty limits and task throttling. > > > > However, there might be a usecase where user wants a per bdi writeback tuning > > parameter to flush dirty data once per bdi dirty data reach a threshold > > especially at NFS server. > > > > dirty_background_centisecs provides an interface where user can tune > > background writeback start threshold using > > /sys/block/sda/bdi/dirty_background_centisecs > > > > dirty_background_centisecs is used alongwith average bdi write bandwidth > > estimation to start background writeback. The functionality you describe, i.e. start flushing bdi when there's reasonable amount of dirty data on it, looks sensible and useful. However I'm not so sure whether the interface you propose is the right one. Traditionally, we allow user to set amount of dirty data (either in bytes or percentage of memory) when background writeback should start. You propose setting the amount of data in centisecs-to-write. Why that difference? Also this interface ties our throughput estimation code (which is an implementation detail of current dirty throttling) with the userspace API. So we'd have to maintain the estimation code forever, possibly also face problems when we change the estimation code (and thus estimates in some cases) and users will complain that the values they set originally no longer work as they used to. Also, as with each knob, there's a problem how to properly set its value? Most admins won't know about the knob and so won't touch it. Others might know about the knob but will have hard time figuring out what value should they set. So if there's a new knob, it should have a sensible initial value. And since this feature looks like a useful one, it shouldn't be zero. So my personal preference would be to have bdi->dirty_background_ratio and bdi->dirty_background_bytes and start background writeback whenever one of global background limit and per-bdi background limit is exceeded. I think this interface will do the job as well and it's easier to maintain in future. Honza > > One of the use case to demonstrate the patch functionality can be > > on NFS setup:- > > We have a NFS setup with ethernet line of 100Mbps, while the USB > > disk is attached to server, which has a local speed of 25MBps. Server > > and client both are arm target boards. > > > > Now if we perform a write operation over NFS (client to server), as > > per the network speed, data can travel at max speed of 100Mbps. But > > if we check the default write speed of USB hdd over NFS it comes > > around to 8MB/sec, far below the speed of network. > > > > Reason being is as per the NFS logic, during write operation, initially > > pages are dirtied on NFS client side, then after reaching the dirty > > threshold/writeback limit (or in case of sync) data is actually sent > > to NFS server (so now again pages are dirtied on server side). This > > will be done in COMMIT call from client to server i.e if 100MB of data > > is dirtied and sent then it will take minimum 100MB/100Mbps ~ 8-9 seconds. > > > > After the data is received, now it will take approx 100/25 ~4 Seconds to > > write the data to USB Hdd on server side. Hence making the overall time > > to write this much of data ~12 seconds, which in practically comes out to > > be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS > > client. > > > > However we may improve this write performace by making the use of NFS > > server idle time i.e while data is being received from the client, > > simultaneously initiate the writeback thread on server side. So instead > > of waiting for the complete data to come and then start the writeback, > > we can work in parallel while the network is still busy in receiving the > > data. Hence in this way overall performace will be improved. > > > > If we tune dirty_background_centisecs, we can see there > > is increase in the performace and it comes out to be ~ 11MB/seconds. > > Results are:- > > > > Write test(create a 1 GB file) result at 'NFS client' after changing > > /sys/block/sda/bdi/dirty_background_centisecs > > on *** NFS Server only - not on NFS Client **** > > > > --------------------------------------------------------------------- > > |WRITE Test with various 'dirty_background_centisecs' at NFS Server | > > --------------------------------------------------------------------- > > | | default = 0 | 300 centisec| 200 centisec| 100 centisec | > > --------------------------------------------------------------------- > > |RecSize | WriteSpeed | WriteSpeed | WriteSpeed | WriteSpeed | > > --------------------------------------------------------------------- > > |10485760 | 8.44MB/sec | 8.60MB/sec | 9.30MB/sec | 10.27MB/sec | > > | 1048576 | 8.48MB/sec | 8.87MB/sec | 9.31MB/sec | 10.34MB/sec | > > | 524288 | 8.37MB/sec | 8.42MB/sec | 9.84MB/sec | 10.47MB/sec | > > | 262144 | 8.16MB/sec | 8.51MB/sec | 9.52MB/sec | 10.62MB/sec | > > | 131072 | 8.48MB/sec | 8.81MB/sec | 9.42MB/sec | 10.55MB/sec | > > | 65536 | 8.38MB/sec | 9.09MB/sec | 9.76MB/sec | 10.53MB/sec | > > | 32768 | 8.65MB/sec | 9.00MB/sec | 9.57MB/sec | 10.54MB/sec | > > | 16384 | 8.27MB/sec | 8.80MB/sec | 9.39MB/sec | 10.43MB/sec | > > | 8192 | 8.52MB/sec | 8.70MB/sec | 9.40MB/sec | 10.50MB/sec | > > | 4096 | 8.20MB/sec | 8.63MB/sec | 9.80MB/sec | 10.35MB/sec | > > --------------------------------------------------------------------- > > > > we can see, average write speed is increased to ~10-11MB/sec. > > ============================================================ > > > > This patch provides the changes per block devices. So that we may modify the > > dirty_background_centisecs as per the device and overall system is not impacted > > by the changes and we get improved perforamace in certain use cases. > > > > NOTE: dirty_background_centisecs is used alongwith average bdi write bandwidth > > estimation to start background writeback. But, bdi write bandwidth estimation > > is an _estimation_ and may become wildly wrong. dirty_background_centisecs > > tuning may not always work to the user expectations. dirty_background_centisecs > > will require careful tuning by users on NFS Server. > > As a good use case, dirty_background_time should be set around 100 (1 sec). > > It should not be set to very small value, otherwise it will start > > flushing for small I/O size dirty data. > > > > Changes since v1: > > * make default value of 'dirty_background_centisecs = 0' sothat there is no change > > in default writeback behaviour. > > * Add description of dirty_background_centisecs in documentation. > > > > Original-patch-by: Wu Fengguang <fengguang.wu@intel.com> > > Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> > > Signed-off-by: Vivek Trivedi <t.vivek@samsung.com> > > --- > > fs/fs-writeback.c | 21 +++++++++++++++++++-- > > include/linux/backing-dev.h | 1 + > > include/linux/writeback.h | 1 + > > mm/backing-dev.c | 21 +++++++++++++++++++++ > > mm/page-writeback.c | 3 ++- > > 5 files changed, 44 insertions(+), 3 deletions(-) > > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > > index fd255c0..c427130 100644 > > --- a/fs/fs-writeback.c > > +++ b/fs/fs-writeback.c > > @@ -769,6 +769,22 @@ static bool over_bground_thresh(struct backing_dev_info *bdi) > > return false; > > } > > > > +bool over_dirty_bground_time(struct backing_dev_info *bdi) > > +{ > > + unsigned long background_thresh; > > + > > + if (!bdi->dirty_background_centisecs) > > + return false; > > + > > + background_thresh = bdi->avg_write_bandwidth * > > + bdi->dirty_background_centisecs / 100; > > + > > + if (bdi_stat(bdi, BDI_RECLAIMABLE) > background_thresh) > > + return true; > > + > > + return false; > > +} > > + > > /* > > * Called under wb->list_lock. If there are multiple wb per bdi, > > * only the flusher working on the first wb should do it. > > @@ -828,7 +844,8 @@ static long wb_writeback(struct bdi_writeback *wb, > > * For background writeout, stop when we are below the > > * background dirty threshold > > */ > > - if (work->for_background && !over_bground_thresh(wb->bdi)) > > + if (work->for_background && !over_bground_thresh(wb->bdi) && > > + !over_dirty_bground_time(wb->bdi)) > > break; > > > > /* > > @@ -920,7 +937,7 @@ static unsigned long get_nr_dirty_pages(void) > > > > static long wb_check_background_flush(struct bdi_writeback *wb) > > { > > - if (over_bground_thresh(wb->bdi)) { > > + if (over_bground_thresh(wb->bdi) || over_dirty_bground_time(wb->bdi)) { > > > > struct wb_writeback_work work = { > > .nr_pages = LONG_MAX, > > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h > > index 2a9a9ab..43d2e15 100644 > > --- a/include/linux/backing-dev.h > > +++ b/include/linux/backing-dev.h > > @@ -95,6 +95,7 @@ struct backing_dev_info { > > > > unsigned int min_ratio; > > unsigned int max_ratio, max_prop_frac; > > + unsigned int dirty_background_centisecs; > > > > struct bdi_writeback wb; /* default writeback info for this bdi */ > > spinlock_t wb_lock; /* protects work_list */ > > diff --git a/include/linux/writeback.h b/include/linux/writeback.h > > index 50c3e8f..6dc2abe 100644 > > --- a/include/linux/writeback.h > > +++ b/include/linux/writeback.h > > @@ -96,6 +96,7 @@ long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages, > > long wb_do_writeback(struct bdi_writeback *wb, int force_wait); > > void wakeup_flusher_threads(long nr_pages, enum wb_reason reason); > > void inode_wait_for_writeback(struct inode *inode); > > +bool over_dirty_bground_time(struct backing_dev_info *bdi); > > > > /* writeback.h requires fs.h; it, too, is not included from here. */ > > static inline void wait_on_inode(struct inode *inode) > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > > index d3ca2b3..b1b2fd2 100644 > > --- a/mm/backing-dev.c > > +++ b/mm/backing-dev.c > > @@ -221,12 +221,32 @@ static ssize_t max_ratio_store(struct device *dev, > > } > > BDI_SHOW(max_ratio, bdi->max_ratio) > > > > +static ssize_t dirty_background_centisecs_store(struct device *dev, > > + struct device_attribute *attr, const char *buf, size_t count) > > +{ > > + struct backing_dev_info *bdi = dev_get_drvdata(dev); > > + unsigned int centisecs; > > + ssize_t ret; > > + > > + ret = kstrtouint(buf, 10, ¢isecs); > > + if (ret < 0) > > + return ret; > > + > > + bdi->dirty_background_centisecs = centisecs; > > + if (over_dirty_bground_time(bdi)) > > + bdi_start_background_writeback(bdi); > > + > > + return count; > > +} > > +BDI_SHOW(dirty_background_centisecs, bdi->dirty_background_centisecs) > > + > > #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store) > > > > static struct device_attribute bdi_dev_attrs[] = { > > __ATTR_RW(read_ahead_kb), > > __ATTR_RW(min_ratio), > > __ATTR_RW(max_ratio), > > + __ATTR_RW(dirty_background_centisecs), > > __ATTR_NULL, > > }; > > > > @@ -628,6 +648,7 @@ int bdi_init(struct backing_dev_info *bdi) > > bdi->min_ratio = 0; > > bdi->max_ratio = 100; > > bdi->max_prop_frac = FPROP_FRAC_BASE; > > + bdi->dirty_background_centisecs = 0; > > spin_lock_init(&bdi->wb_lock); > > INIT_LIST_HEAD(&bdi->bdi_list); > > INIT_LIST_HEAD(&bdi->work_list); > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > > index 5ad5ce2..8c1530d 100644 > > --- a/mm/page-writeback.c > > +++ b/mm/page-writeback.c > > @@ -1403,7 +1403,8 @@ pause: > > if (laptop_mode) > > return; > > > > - if (nr_reclaimable > background_thresh) > > + if (nr_reclaimable > background_thresh || > > + over_dirty_bground_time(bdi)) > > bdi_start_background_writeback(bdi); > > } > > > > -- > > 1.7.9.5 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable 2012-09-24 22:23 ` Jan Kara @ 2012-09-25 6:50 ` Namjae Jeon 2012-09-26 16:56 ` Fengguang Wu 1 sibling, 0 replies; 16+ messages in thread From: Namjae Jeon @ 2012-09-25 6:50 UTC (permalink / raw) To: Jan Kara, Fengguang Wu Cc: Dave Chinner, linux-kernel, Namjae Jeon, Vivek Trivedi, Linux Memory Management List, linux-fsdevel 2012/9/25, Jan Kara <jack@suse.cz>: > On Thu 20-09-12 16:44:22, Wu Fengguang wrote: >> On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: >> > From: Namjae Jeon <namjae.jeon@samsung.com> >> > >> > This patch is based on suggestion by Wu Fengguang: >> > https://lkml.org/lkml/2011/8/19/19 >> > >> > kernel has mechanism to do writeback as per dirty_ratio and >> > dirty_background >> > ratio. It also maintains per task dirty rate limit to keep balance of >> > dirty pages at any given instance by doing bdi bandwidth estimation. >> > >> > Kernel also has max_ratio/min_ratio tunables to specify percentage of >> > writecache to control per bdi dirty limits and task throttling. >> > >> > However, there might be a usecase where user wants a per bdi writeback >> > tuning >> > parameter to flush dirty data once per bdi dirty data reach a threshold >> > especially at NFS server. >> > >> > dirty_background_centisecs provides an interface where user can tune >> > background writeback start threshold using >> > /sys/block/sda/bdi/dirty_background_centisecs >> > >> > dirty_background_centisecs is used alongwith average bdi write >> > bandwidth >> > estimation to start background writeback. > The functionality you describe, i.e. start flushing bdi when there's > reasonable amount of dirty data on it, looks sensible and useful. However > I'm not so sure whether the interface you propose is the right one. > Traditionally, we allow user to set amount of dirty data (either in bytes > or percentage of memory) when background writeback should start. You > propose setting the amount of data in centisecs-to-write. Why that > difference? Also this interface ties our throughput estimation code (which > is an implementation detail of current dirty throttling) with the userspace > API. So we'd have to maintain the estimation code forever, possibly also > face problems when we change the estimation code (and thus estimates in > some cases) and users will complain that the values they set originally no > longer work as they used to. > > Also, as with each knob, there's a problem how to properly set its value? > Most admins won't know about the knob and so won't touch it. Others might > know about the knob but will have hard time figuring out what value should > they set. So if there's a new knob, it should have a sensible initial > value. And since this feature looks like a useful one, it shouldn't be > zero. > > So my personal preference would be to have bdi->dirty_background_ratio and > bdi->dirty_background_bytes and start background writeback whenever > one of global background limit and per-bdi background limit is exceeded. I > think this interface will do the job as well and it's easier to maintain in > future. Hi Jan. Thanks for review and your opinion. Hi. Wu. How about adding per-bdi - bdi->dirty_background_ratio and bdi->dirty_background_bytes interface as suggested by Jan? Thanks. > > Honza -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable 2012-09-24 22:23 ` Jan Kara 2012-09-25 6:50 ` Namjae Jeon @ 2012-09-26 16:56 ` Fengguang Wu 2012-09-26 20:22 ` Jan Kara 1 sibling, 1 reply; 16+ messages in thread From: Fengguang Wu @ 2012-09-26 16:56 UTC (permalink / raw) To: Jan Kara Cc: Namjae Jeon, Dave Chinner, linux-kernel, Namjae Jeon, Vivek Trivedi, Linux Memory Management List, linux-fsdevel On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote: > On Thu 20-09-12 16:44:22, Wu Fengguang wrote: > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: > > > From: Namjae Jeon <namjae.jeon@samsung.com> > > > > > > This patch is based on suggestion by Wu Fengguang: > > > https://lkml.org/lkml/2011/8/19/19 > > > > > > kernel has mechanism to do writeback as per dirty_ratio and dirty_background > > > ratio. It also maintains per task dirty rate limit to keep balance of > > > dirty pages at any given instance by doing bdi bandwidth estimation. > > > > > > Kernel also has max_ratio/min_ratio tunables to specify percentage of > > > writecache to control per bdi dirty limits and task throttling. > > > > > > However, there might be a usecase where user wants a per bdi writeback tuning > > > parameter to flush dirty data once per bdi dirty data reach a threshold > > > especially at NFS server. > > > > > > dirty_background_centisecs provides an interface where user can tune > > > background writeback start threshold using > > > /sys/block/sda/bdi/dirty_background_centisecs > > > > > > dirty_background_centisecs is used alongwith average bdi write bandwidth > > > estimation to start background writeback. > The functionality you describe, i.e. start flushing bdi when there's > reasonable amount of dirty data on it, looks sensible and useful. However > I'm not so sure whether the interface you propose is the right one. > Traditionally, we allow user to set amount of dirty data (either in bytes > or percentage of memory) when background writeback should start. You > propose setting the amount of data in centisecs-to-write. Why that > difference? Also this interface ties our throughput estimation code (which > is an implementation detail of current dirty throttling) with the userspace > API. So we'd have to maintain the estimation code forever, possibly also > face problems when we change the estimation code (and thus estimates in > some cases) and users will complain that the values they set originally no > longer work as they used to. Yes, that bandwidth estimation is not all that (and in theory cannot be made) reliable which may be a surprise to the user. Which make the interface flaky. > Also, as with each knob, there's a problem how to properly set its value? > Most admins won't know about the knob and so won't touch it. Others might > know about the knob but will have hard time figuring out what value should > they set. So if there's a new knob, it should have a sensible initial > value. And since this feature looks like a useful one, it shouldn't be > zero. Agreed in principle. There seems be no reasonable defaults for the centisecs-to-write interface, mainly due to its inaccurate nature, especially the initial value may be wildly wrong on fresh system bootup. This is also true for your proposed interfaces, see below. > So my personal preference would be to have bdi->dirty_background_ratio and > bdi->dirty_background_bytes and start background writeback whenever > one of global background limit and per-bdi background limit is exceeded. I > think this interface will do the job as well and it's easier to maintain in > future. bdi->dirty_background_ratio, if I understand its semantics right, is unfortunately flaky in the same principle as centisecs-to-write, because it relies on the (implicitly estimation of) writeout proportions. The writeout proportions for each bdi starts with 0, which is even worse than the 100MB/s initial value for bdi->write_bandwidth and will trigger background writeback on the first write. bdi->dirty_background_bytes is, however, reliable, and gives users total control. If we export this interface alone, I'd imagine users who want to control centisecs-to-write could run a simple script to periodically get the write bandwith value out of the existing bdi interface and echo it into bdi->dirty_background_bytes. Which makes simple yet good enough centisecs-to-write controlling. So what do you think about exporting a really dumb bdi->dirty_background_bytes, which will effectively give smart users the freedom to do smart control over per-bdi background writeback threshold? The users are offered the freedom to do his own bandwidth estimation and choose not to rely on the kernel estimation, which will free us from the burden of maintaining a flaky interface as well. :) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable 2012-09-26 16:56 ` Fengguang Wu @ 2012-09-26 20:22 ` Jan Kara 2012-09-27 6:00 ` Namjae Jeon 0 siblings, 1 reply; 16+ messages in thread From: Jan Kara @ 2012-09-26 20:22 UTC (permalink / raw) To: Fengguang Wu Cc: Jan Kara, Namjae Jeon, Dave Chinner, linux-kernel, Namjae Jeon, Vivek Trivedi, Linux Memory Management List, linux-fsdevel On Thu 27-09-12 00:56:02, Wu Fengguang wrote: > On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote: > > On Thu 20-09-12 16:44:22, Wu Fengguang wrote: > > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: > > > > From: Namjae Jeon <namjae.jeon@samsung.com> > > > > > > > > This patch is based on suggestion by Wu Fengguang: > > > > https://lkml.org/lkml/2011/8/19/19 > > > > > > > > kernel has mechanism to do writeback as per dirty_ratio and dirty_background > > > > ratio. It also maintains per task dirty rate limit to keep balance of > > > > dirty pages at any given instance by doing bdi bandwidth estimation. > > > > > > > > Kernel also has max_ratio/min_ratio tunables to specify percentage of > > > > writecache to control per bdi dirty limits and task throttling. > > > > > > > > However, there might be a usecase where user wants a per bdi writeback tuning > > > > parameter to flush dirty data once per bdi dirty data reach a threshold > > > > especially at NFS server. > > > > > > > > dirty_background_centisecs provides an interface where user can tune > > > > background writeback start threshold using > > > > /sys/block/sda/bdi/dirty_background_centisecs > > > > > > > > dirty_background_centisecs is used alongwith average bdi write bandwidth > > > > estimation to start background writeback. > > The functionality you describe, i.e. start flushing bdi when there's > > reasonable amount of dirty data on it, looks sensible and useful. However > > I'm not so sure whether the interface you propose is the right one. > > Traditionally, we allow user to set amount of dirty data (either in bytes > > or percentage of memory) when background writeback should start. You > > propose setting the amount of data in centisecs-to-write. Why that > > difference? Also this interface ties our throughput estimation code (which > > is an implementation detail of current dirty throttling) with the userspace > > API. So we'd have to maintain the estimation code forever, possibly also > > face problems when we change the estimation code (and thus estimates in > > some cases) and users will complain that the values they set originally no > > longer work as they used to. > > Yes, that bandwidth estimation is not all that (and in theory cannot > be made) reliable which may be a surprise to the user. Which make the > interface flaky. > > > Also, as with each knob, there's a problem how to properly set its value? > > Most admins won't know about the knob and so won't touch it. Others might > > know about the knob but will have hard time figuring out what value should > > they set. So if there's a new knob, it should have a sensible initial > > value. And since this feature looks like a useful one, it shouldn't be > > zero. > > Agreed in principle. There seems be no reasonable defaults for the > centisecs-to-write interface, mainly due to its inaccurate nature, > especially the initial value may be wildly wrong on fresh system > bootup. This is also true for your proposed interfaces, see below. > > > So my personal preference would be to have bdi->dirty_background_ratio and > > bdi->dirty_background_bytes and start background writeback whenever > > one of global background limit and per-bdi background limit is exceeded. I > > think this interface will do the job as well and it's easier to maintain in > > future. > > bdi->dirty_background_ratio, if I understand its semantics right, is > unfortunately flaky in the same principle as centisecs-to-write, > because it relies on the (implicitly estimation of) writeout > proportions. The writeout proportions for each bdi starts with 0, > which is even worse than the 100MB/s initial value for > bdi->write_bandwidth and will trigger background writeback on the > first write. Well, I meant bdi->dirty_backround_ratio wouldn't use writeout proportion estimates at all. Limit would be dirtiable_memory * bdi->dirty_backround_ratio. After all we want to start writeout to bdi when we have enough pages to reasonably load the device for a while which has nothing to do with how much is written to this device as compared to other devices. OTOH I'm not particularly attached to this interface. Especially since on a lot of today's machines, 1% is rather big so people might often end up using dirty_background_bytes anyway. > bdi->dirty_background_bytes is, however, reliable, and gives users > total control. If we export this interface alone, I'd imagine users > who want to control centisecs-to-write could run a simple script to > periodically get the write bandwith value out of the existing bdi > interface and echo it into bdi->dirty_background_bytes. Which makes > simple yet good enough centisecs-to-write controlling. > > So what do you think about exporting a really dumb > bdi->dirty_background_bytes, which will effectively give smart users > the freedom to do smart control over per-bdi background writeback > threshold? The users are offered the freedom to do his own bandwidth > estimation and choose not to rely on the kernel estimation, which will > free us from the burden of maintaining a flaky interface as well. :) That's fine with me. Just it would be nice if we gave bdi->dirty_background_bytes some useful initial value. Maybe like dirtiable_memory * dirty_background_ratio? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable 2012-09-26 20:22 ` Jan Kara @ 2012-09-27 6:00 ` Namjae Jeon 2012-09-27 11:52 ` Jan Kara 0 siblings, 1 reply; 16+ messages in thread From: Namjae Jeon @ 2012-09-27 6:00 UTC (permalink / raw) To: Jan Kara Cc: Fengguang Wu, Dave Chinner, linux-kernel, Namjae Jeon, Vivek Trivedi, Linux Memory Management List, linux-fsdevel 2012/9/27, Jan Kara <jack@suse.cz>: > On Thu 27-09-12 00:56:02, Wu Fengguang wrote: >> On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote: >> > On Thu 20-09-12 16:44:22, Wu Fengguang wrote: >> > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: >> > > > From: Namjae Jeon <namjae.jeon@samsung.com> >> > > > >> > > > This patch is based on suggestion by Wu Fengguang: >> > > > https://lkml.org/lkml/2011/8/19/19 >> > > > >> > > > kernel has mechanism to do writeback as per dirty_ratio and >> > > > dirty_background >> > > > ratio. It also maintains per task dirty rate limit to keep balance >> > > > of >> > > > dirty pages at any given instance by doing bdi bandwidth >> > > > estimation. >> > > > >> > > > Kernel also has max_ratio/min_ratio tunables to specify percentage >> > > > of >> > > > writecache to control per bdi dirty limits and task throttling. >> > > > >> > > > However, there might be a usecase where user wants a per bdi >> > > > writeback tuning >> > > > parameter to flush dirty data once per bdi dirty data reach a >> > > > threshold >> > > > especially at NFS server. >> > > > >> > > > dirty_background_centisecs provides an interface where user can >> > > > tune >> > > > background writeback start threshold using >> > > > /sys/block/sda/bdi/dirty_background_centisecs >> > > > >> > > > dirty_background_centisecs is used alongwith average bdi write >> > > > bandwidth >> > > > estimation to start background writeback. >> > The functionality you describe, i.e. start flushing bdi when there's >> > reasonable amount of dirty data on it, looks sensible and useful. >> > However >> > I'm not so sure whether the interface you propose is the right one. >> > Traditionally, we allow user to set amount of dirty data (either in >> > bytes >> > or percentage of memory) when background writeback should start. You >> > propose setting the amount of data in centisecs-to-write. Why that >> > difference? Also this interface ties our throughput estimation code >> > (which >> > is an implementation detail of current dirty throttling) with the >> > userspace >> > API. So we'd have to maintain the estimation code forever, possibly >> > also >> > face problems when we change the estimation code (and thus estimates in >> > some cases) and users will complain that the values they set originally >> > no >> > longer work as they used to. >> >> Yes, that bandwidth estimation is not all that (and in theory cannot >> be made) reliable which may be a surprise to the user. Which make the >> interface flaky. >> >> > Also, as with each knob, there's a problem how to properly set its >> > value? >> > Most admins won't know about the knob and so won't touch it. Others >> > might >> > know about the knob but will have hard time figuring out what value >> > should >> > they set. So if there's a new knob, it should have a sensible initial >> > value. And since this feature looks like a useful one, it shouldn't be >> > zero. >> >> Agreed in principle. There seems be no reasonable defaults for the >> centisecs-to-write interface, mainly due to its inaccurate nature, >> especially the initial value may be wildly wrong on fresh system >> bootup. This is also true for your proposed interfaces, see below. >> >> > So my personal preference would be to have bdi->dirty_background_ratio >> > and >> > bdi->dirty_background_bytes and start background writeback whenever >> > one of global background limit and per-bdi background limit is exceeded. >> > I >> > think this interface will do the job as well and it's easier to maintain >> > in >> > future. >> >> bdi->dirty_background_ratio, if I understand its semantics right, is >> unfortunately flaky in the same principle as centisecs-to-write, >> because it relies on the (implicitly estimation of) writeout >> proportions. The writeout proportions for each bdi starts with 0, >> which is even worse than the 100MB/s initial value for >> bdi->write_bandwidth and will trigger background writeback on the >> first write. > Well, I meant bdi->dirty_backround_ratio wouldn't use writeout proportion > estimates at all. Limit would be > dirtiable_memory * bdi->dirty_backround_ratio. > > After all we want to start writeout to bdi when we have enough pages to > reasonably load the device for a while which has nothing to do with how > much is written to this device as compared to other devices. > > OTOH I'm not particularly attached to this interface. Especially since on a > lot of today's machines, 1% is rather big so people might often end up > using dirty_background_bytes anyway. > >> bdi->dirty_background_bytes is, however, reliable, and gives users >> total control. If we export this interface alone, I'd imagine users >> who want to control centisecs-to-write could run a simple script to >> periodically get the write bandwith value out of the existing bdi >> interface and echo it into bdi->dirty_background_bytes. Which makes >> simple yet good enough centisecs-to-write controlling. >> >> So what do you think about exporting a really dumb >> bdi->dirty_background_bytes, which will effectively give smart users >> the freedom to do smart control over per-bdi background writeback >> threshold? The users are offered the freedom to do his own bandwidth >> estimation and choose not to rely on the kernel estimation, which will >> free us from the burden of maintaining a flaky interface as well. :) > That's fine with me. Just it would be nice if we gave > bdi->dirty_background_bytes some useful initial value. Maybe like > dirtiable_memory * dirty_background_ratio? Hi. Jan. Global dirty_background_bytes default value is zero that means flushing is started based on dirty_background_ratio and dirtiable memory. Is it correct to set per bdi default dirty threshold (bdi->dirty_background_bytes) equal to global dirty threshold - dirtiable_memory * dirty_background_ratio ? In my opinion, default setting for per bdi-> dirty_background_bytes should be zero to avoid any confusion and any change in default writeback behaviour. Thanks. > > Honza > -- > Jan Kara <jack@suse.cz> > SUSE Labs, CR > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable 2012-09-27 6:00 ` Namjae Jeon @ 2012-09-27 11:52 ` Jan Kara 2012-09-28 1:59 ` Namjae Jeon 0 siblings, 1 reply; 16+ messages in thread From: Jan Kara @ 2012-09-27 11:52 UTC (permalink / raw) To: Namjae Jeon Cc: Jan Kara, Fengguang Wu, Dave Chinner, linux-kernel, Namjae Jeon, Vivek Trivedi, Linux Memory Management List, linux-fsdevel On Thu 27-09-12 15:00:18, Namjae Jeon wrote: > 2012/9/27, Jan Kara <jack@suse.cz>: > > On Thu 27-09-12 00:56:02, Wu Fengguang wrote: > >> On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote: > >> > On Thu 20-09-12 16:44:22, Wu Fengguang wrote: > >> > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: > >> > > > From: Namjae Jeon <namjae.jeon@samsung.com> > >> > > > > >> > > > This patch is based on suggestion by Wu Fengguang: > >> > > > https://lkml.org/lkml/2011/8/19/19 > >> > > > > >> > > > kernel has mechanism to do writeback as per dirty_ratio and > >> > > > dirty_background > >> > > > ratio. It also maintains per task dirty rate limit to keep balance > >> > > > of > >> > > > dirty pages at any given instance by doing bdi bandwidth > >> > > > estimation. > >> > > > > >> > > > Kernel also has max_ratio/min_ratio tunables to specify percentage > >> > > > of > >> > > > writecache to control per bdi dirty limits and task throttling. > >> > > > > >> > > > However, there might be a usecase where user wants a per bdi > >> > > > writeback tuning > >> > > > parameter to flush dirty data once per bdi dirty data reach a > >> > > > threshold > >> > > > especially at NFS server. > >> > > > > >> > > > dirty_background_centisecs provides an interface where user can > >> > > > tune > >> > > > background writeback start threshold using > >> > > > /sys/block/sda/bdi/dirty_background_centisecs > >> > > > > >> > > > dirty_background_centisecs is used alongwith average bdi write > >> > > > bandwidth > >> > > > estimation to start background writeback. > >> > The functionality you describe, i.e. start flushing bdi when there's > >> > reasonable amount of dirty data on it, looks sensible and useful. > >> > However > >> > I'm not so sure whether the interface you propose is the right one. > >> > Traditionally, we allow user to set amount of dirty data (either in > >> > bytes > >> > or percentage of memory) when background writeback should start. You > >> > propose setting the amount of data in centisecs-to-write. Why that > >> > difference? Also this interface ties our throughput estimation code > >> > (which > >> > is an implementation detail of current dirty throttling) with the > >> > userspace > >> > API. So we'd have to maintain the estimation code forever, possibly > >> > also > >> > face problems when we change the estimation code (and thus estimates in > >> > some cases) and users will complain that the values they set originally > >> > no > >> > longer work as they used to. > >> > >> Yes, that bandwidth estimation is not all that (and in theory cannot > >> be made) reliable which may be a surprise to the user. Which make the > >> interface flaky. > >> > >> > Also, as with each knob, there's a problem how to properly set its > >> > value? > >> > Most admins won't know about the knob and so won't touch it. Others > >> > might > >> > know about the knob but will have hard time figuring out what value > >> > should > >> > they set. So if there's a new knob, it should have a sensible initial > >> > value. And since this feature looks like a useful one, it shouldn't be > >> > zero. > >> > >> Agreed in principle. There seems be no reasonable defaults for the > >> centisecs-to-write interface, mainly due to its inaccurate nature, > >> especially the initial value may be wildly wrong on fresh system > >> bootup. This is also true for your proposed interfaces, see below. > >> > >> > So my personal preference would be to have bdi->dirty_background_ratio > >> > and > >> > bdi->dirty_background_bytes and start background writeback whenever > >> > one of global background limit and per-bdi background limit is exceeded. > >> > I > >> > think this interface will do the job as well and it's easier to maintain > >> > in > >> > future. > >> > >> bdi->dirty_background_ratio, if I understand its semantics right, is > >> unfortunately flaky in the same principle as centisecs-to-write, > >> because it relies on the (implicitly estimation of) writeout > >> proportions. The writeout proportions for each bdi starts with 0, > >> which is even worse than the 100MB/s initial value for > >> bdi->write_bandwidth and will trigger background writeback on the > >> first write. > > Well, I meant bdi->dirty_backround_ratio wouldn't use writeout proportion > > estimates at all. Limit would be > > dirtiable_memory * bdi->dirty_backround_ratio. > > > > After all we want to start writeout to bdi when we have enough pages to > > reasonably load the device for a while which has nothing to do with how > > much is written to this device as compared to other devices. > > > > OTOH I'm not particularly attached to this interface. Especially since on a > > lot of today's machines, 1% is rather big so people might often end up > > using dirty_background_bytes anyway. > > > >> bdi->dirty_background_bytes is, however, reliable, and gives users > >> total control. If we export this interface alone, I'd imagine users > >> who want to control centisecs-to-write could run a simple script to > >> periodically get the write bandwith value out of the existing bdi > >> interface and echo it into bdi->dirty_background_bytes. Which makes > >> simple yet good enough centisecs-to-write controlling. > >> > >> So what do you think about exporting a really dumb > >> bdi->dirty_background_bytes, which will effectively give smart users > >> the freedom to do smart control over per-bdi background writeback > >> threshold? The users are offered the freedom to do his own bandwidth > >> estimation and choose not to rely on the kernel estimation, which will > >> free us from the burden of maintaining a flaky interface as well. :) > > That's fine with me. Just it would be nice if we gave > > bdi->dirty_background_bytes some useful initial value. Maybe like > > dirtiable_memory * dirty_background_ratio? > Global dirty_background_bytes default value is zero that means > flushing is started based on dirty_background_ratio and dirtiable > memory. > Is it correct to set per bdi default dirty threshold > (bdi->dirty_background_bytes) equal to global dirty threshold - > dirtiable_memory * dirty_background_ratio ? Right, the default setting I proposed doesn't make a difference. And it's not obvious how to create one which is more meaningful. Pity. > In my opinion, default setting for per bdi-> dirty_background_bytes > should be zero to avoid any confusion and any change in default > writeback behaviour. OK, fine with me. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable 2012-09-27 11:52 ` Jan Kara @ 2012-09-28 1:59 ` Namjae Jeon 0 siblings, 0 replies; 16+ messages in thread From: Namjae Jeon @ 2012-09-28 1:59 UTC (permalink / raw) To: Jan Kara, Fengguang Wu Cc: Dave Chinner, linux-kernel, Namjae Jeon, Vivek Trivedi, Linux Memory Management List, linux-fsdevel 2012/9/27, Jan Kara <jack@suse.cz>: > On Thu 27-09-12 15:00:18, Namjae Jeon wrote: >> 2012/9/27, Jan Kara <jack@suse.cz>: >> > On Thu 27-09-12 00:56:02, Wu Fengguang wrote: >> >> On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote: >> >> > On Thu 20-09-12 16:44:22, Wu Fengguang wrote: >> >> > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: >> >> > > > From: Namjae Jeon <namjae.jeon@samsung.com> >> >> > > > >> >> > > > This patch is based on suggestion by Wu Fengguang: >> >> > > > https://lkml.org/lkml/2011/8/19/19 >> >> > > > >> >> > > > kernel has mechanism to do writeback as per dirty_ratio and >> >> > > > dirty_background >> >> > > > ratio. It also maintains per task dirty rate limit to keep >> >> > > > balance >> >> > > > of >> >> > > > dirty pages at any given instance by doing bdi bandwidth >> >> > > > estimation. >> >> > > > >> >> > > > Kernel also has max_ratio/min_ratio tunables to specify >> >> > > > percentage >> >> > > > of >> >> > > > writecache to control per bdi dirty limits and task throttling. >> >> > > > >> >> > > > However, there might be a usecase where user wants a per bdi >> >> > > > writeback tuning >> >> > > > parameter to flush dirty data once per bdi dirty data reach a >> >> > > > threshold >> >> > > > especially at NFS server. >> >> > > > >> >> > > > dirty_background_centisecs provides an interface where user can >> >> > > > tune >> >> > > > background writeback start threshold using >> >> > > > /sys/block/sda/bdi/dirty_background_centisecs >> >> > > > >> >> > > > dirty_background_centisecs is used alongwith average bdi write >> >> > > > bandwidth >> >> > > > estimation to start background writeback. >> >> > The functionality you describe, i.e. start flushing bdi when >> >> > there's >> >> > reasonable amount of dirty data on it, looks sensible and useful. >> >> > However >> >> > I'm not so sure whether the interface you propose is the right one. >> >> > Traditionally, we allow user to set amount of dirty data (either in >> >> > bytes >> >> > or percentage of memory) when background writeback should start. You >> >> > propose setting the amount of data in centisecs-to-write. Why that >> >> > difference? Also this interface ties our throughput estimation code >> >> > (which >> >> > is an implementation detail of current dirty throttling) with the >> >> > userspace >> >> > API. So we'd have to maintain the estimation code forever, possibly >> >> > also >> >> > face problems when we change the estimation code (and thus estimates >> >> > in >> >> > some cases) and users will complain that the values they set >> >> > originally >> >> > no >> >> > longer work as they used to. >> >> >> >> Yes, that bandwidth estimation is not all that (and in theory cannot >> >> be made) reliable which may be a surprise to the user. Which make the >> >> interface flaky. >> >> >> >> > Also, as with each knob, there's a problem how to properly set its >> >> > value? >> >> > Most admins won't know about the knob and so won't touch it. Others >> >> > might >> >> > know about the knob but will have hard time figuring out what value >> >> > should >> >> > they set. So if there's a new knob, it should have a sensible >> >> > initial >> >> > value. And since this feature looks like a useful one, it shouldn't >> >> > be >> >> > zero. >> >> >> >> Agreed in principle. There seems be no reasonable defaults for the >> >> centisecs-to-write interface, mainly due to its inaccurate nature, >> >> especially the initial value may be wildly wrong on fresh system >> >> bootup. This is also true for your proposed interfaces, see below. >> >> >> >> > So my personal preference would be to have >> >> > bdi->dirty_background_ratio >> >> > and >> >> > bdi->dirty_background_bytes and start background writeback whenever >> >> > one of global background limit and per-bdi background limit is >> >> > exceeded. >> >> > I >> >> > think this interface will do the job as well and it's easier to >> >> > maintain >> >> > in >> >> > future. >> >> >> >> bdi->dirty_background_ratio, if I understand its semantics right, is >> >> unfortunately flaky in the same principle as centisecs-to-write, >> >> because it relies on the (implicitly estimation of) writeout >> >> proportions. The writeout proportions for each bdi starts with 0, >> >> which is even worse than the 100MB/s initial value for >> >> bdi->write_bandwidth and will trigger background writeback on the >> >> first write. >> > Well, I meant bdi->dirty_backround_ratio wouldn't use writeout >> > proportion >> > estimates at all. Limit would be >> > dirtiable_memory * bdi->dirty_backround_ratio. >> > >> > After all we want to start writeout to bdi when we have enough pages to >> > reasonably load the device for a while which has nothing to do with how >> > much is written to this device as compared to other devices. >> > >> > OTOH I'm not particularly attached to this interface. Especially since >> > on a >> > lot of today's machines, 1% is rather big so people might often end up >> > using dirty_background_bytes anyway. >> > >> >> bdi->dirty_background_bytes is, however, reliable, and gives users >> >> total control. If we export this interface alone, I'd imagine users >> >> who want to control centisecs-to-write could run a simple script to >> >> periodically get the write bandwith value out of the existing bdi >> >> interface and echo it into bdi->dirty_background_bytes. Which makes >> >> simple yet good enough centisecs-to-write controlling. >> >> >> >> So what do you think about exporting a really dumb >> >> bdi->dirty_background_bytes, which will effectively give smart users >> >> the freedom to do smart control over per-bdi background writeback >> >> threshold? The users are offered the freedom to do his own bandwidth >> >> estimation and choose not to rely on the kernel estimation, which will >> >> free us from the burden of maintaining a flaky interface as well. :) >> > That's fine with me. Just it would be nice if we gave >> > bdi->dirty_background_bytes some useful initial value. Maybe like >> > dirtiable_memory * dirty_background_ratio? >> Global dirty_background_bytes default value is zero that means >> flushing is started based on dirty_background_ratio and dirtiable >> memory. >> Is it correct to set per bdi default dirty threshold >> (bdi->dirty_background_bytes) equal to global dirty threshold - >> dirtiable_memory * dirty_background_ratio ? > Right, the default setting I proposed doesn't make a difference. And it's > not obvious how to create one which is more meaningful. Pity. > >> In my opinion, default setting for per bdi-> dirty_background_bytes >> should be zero to avoid any confusion and any change in default >> writeback behaviour. > OK, fine with me. Okay, I will make the patches as your opinion again. Thanks Jan and Wu ! > > Honza > -- > Jan Kara <jack@suse.cz> > SUSE Labs, CR > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable 2012-09-20 8:44 ` [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable Fengguang Wu 2012-09-24 22:23 ` Jan Kara @ 2012-09-25 1:36 ` Dave Chinner 2012-09-25 6:46 ` Namjae Jeon 1 sibling, 1 reply; 16+ messages in thread From: Dave Chinner @ 2012-09-25 1:36 UTC (permalink / raw) To: Fengguang Wu Cc: Namjae Jeon, Jan Kara, linux-kernel, Namjae Jeon, Vivek Trivedi, Linux Memory Management List, linux-fsdevel On Thu, Sep 20, 2012 at 04:44:22PM +0800, Fengguang Wu wrote: > [ CC FS and MM lists ] > > Patch looks good to me, however we need to be careful because it's > introducing a new interface. So it's desirable to get some acks from > the FS/MM developers. > > Thanks, > Fengguang > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: > > From: Namjae Jeon <namjae.jeon@samsung.com> > > > > This patch is based on suggestion by Wu Fengguang: > > https://lkml.org/lkml/2011/8/19/19 > > > > kernel has mechanism to do writeback as per dirty_ratio and dirty_background > > ratio. It also maintains per task dirty rate limit to keep balance of > > dirty pages at any given instance by doing bdi bandwidth estimation. > > > > Kernel also has max_ratio/min_ratio tunables to specify percentage of > > writecache to control per bdi dirty limits and task throttling. > > > > However, there might be a usecase where user wants a per bdi writeback tuning > > parameter to flush dirty data once per bdi dirty data reach a threshold > > especially at NFS server. > > > > dirty_background_centisecs provides an interface where user can tune > > background writeback start threshold using > > /sys/block/sda/bdi/dirty_background_centisecs > > > > dirty_background_centisecs is used alongwith average bdi write bandwidth > > estimation to start background writeback. > > > > One of the use case to demonstrate the patch functionality can be > > on NFS setup:- > > We have a NFS setup with ethernet line of 100Mbps, while the USB > > disk is attached to server, which has a local speed of 25MBps. Server > > and client both are arm target boards. > > > > Now if we perform a write operation over NFS (client to server), as > > per the network speed, data can travel at max speed of 100Mbps. But > > if we check the default write speed of USB hdd over NFS it comes > > around to 8MB/sec, far below the speed of network. > > > > Reason being is as per the NFS logic, during write operation, initially > > pages are dirtied on NFS client side, then after reaching the dirty > > threshold/writeback limit (or in case of sync) data is actually sent > > to NFS server (so now again pages are dirtied on server side). This > > will be done in COMMIT call from client to server i.e if 100MB of data > > is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds. > > > > After the data is received, now it will take approx 100/25 ~4 Seconds to > > write the data to USB Hdd on server side. Hence making the overall time > > to write this much of data ~12 seconds, which in practically comes out to > > be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS > > client. > > > > However we may improve this write performace by making the use of NFS > > server idle time i.e while data is being received from the client, > > simultaneously initiate the writeback thread on server side. So instead > > of waiting for the complete data to come and then start the writeback, > > we can work in parallel while the network is still busy in receiving the > > data. Hence in this way overall performace will be improved. > > > > If we tune dirty_background_centisecs, we can see there > > is increase in the performace and it comes out to be ~ 11MB/seconds. > > Results are:- > > > > Write test(create a 1 GB file) result at 'NFS client' after changing > > /sys/block/sda/bdi/dirty_background_centisecs > > on *** NFS Server only - not on NFS Client **** What is the configuration of the client and server? How much RAM, what their dirty_* parameters are set to, network speed, server disk speed for local sequential IO, etc? > > --------------------------------------------------------------------- > > |WRITE Test with various 'dirty_background_centisecs' at NFS Server | > > --------------------------------------------------------------------- > > | | default = 0 | 300 centisec| 200 centisec| 100 centisec | > > --------------------------------------------------------------------- > > |RecSize | WriteSpeed | WriteSpeed | WriteSpeed | WriteSpeed | > > --------------------------------------------------------------------- > > |10485760 | 8.44MB/sec | 8.60MB/sec | 9.30MB/sec | 10.27MB/sec | > > | 1048576 | 8.48MB/sec | 8.87MB/sec | 9.31MB/sec | 10.34MB/sec | > > | 524288 | 8.37MB/sec | 8.42MB/sec | 9.84MB/sec | 10.47MB/sec | > > | 262144 | 8.16MB/sec | 8.51MB/sec | 9.52MB/sec | 10.62MB/sec | > > | 131072 | 8.48MB/sec | 8.81MB/sec | 9.42MB/sec | 10.55MB/sec | > > | 65536 | 8.38MB/sec | 9.09MB/sec | 9.76MB/sec | 10.53MB/sec | > > | 32768 | 8.65MB/sec | 9.00MB/sec | 9.57MB/sec | 10.54MB/sec | > > | 16384 | 8.27MB/sec | 8.80MB/sec | 9.39MB/sec | 10.43MB/sec | > > | 8192 | 8.52MB/sec | 8.70MB/sec | 9.40MB/sec | 10.50MB/sec | > > | 4096 | 8.20MB/sec | 8.63MB/sec | 9.80MB/sec | 10.35MB/sec | > > --------------------------------------------------------------------- While this set of numbers looks good, it's a very limited in scope. I can't evaluate whether the change is worthwhile or not from this test. If I was writing this patch, the questions I'd be seeking to answer before proposing it for inclusion are as follows.... 1. what's the comparison in performance to typical NFS server writeback parameter tuning? i.e. dirty_background_ratio=5, dirty_ratio=10, dirty_expire_centiseconds=1000, dirty_writeback_centisecs=1? i.e. does this give change give any benefit over the current common practice for configuring NFS servers? 2. what happens when you have 10 clients all writing to the server at once? Or a 100? NFS servers rarely have a single writer to a single file at a time, so what impact does this change have on multiple concurrent file write performance from multiple clients? 3. Following on from the multiple client test, what difference does it make to file fragmentation rates? Writing more frequently means smaller allocations and writes, and that tends to lead to higher fragmentation rates, especially when multiple files are being written concurrently. Higher fragmentation also means lower performance over time as fragmentation accelerates filesystem aging effects on performance. IOWs, it may be faster when new, but it will be slower 3 months down the track and that's a bad tradeoff to make. 4. What happens for higher bandwidth network links? e.g. gigE or 10gigE? Are the improvements still there? Or does it cause regressions at higher speeds? I'm especially interested in what happens to multiple writers at higher network speeds, because that's a key performance metric used to measure enterprise level NFS servers. 5. Are the improvements consistent across different filesystem types? We've had writeback changes in the past cause improvements on one filesystem but significant regressions on others. I'd suggest that you need to present results for ext4, XFS and btrfs so that we have a decent idea of what we can expect from the change to the generic code. Yeah, I'm asking a lot of questions. That's because the generic writeback code is extremely important to performance and the impact of a change cannot be evaluated from a single test. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable 2012-09-25 1:36 ` Dave Chinner @ 2012-09-25 6:46 ` Namjae Jeon 2012-09-25 6:54 ` Namjae Jeon 2012-10-19 7:51 ` Namjae Jeon 0 siblings, 2 replies; 16+ messages in thread From: Namjae Jeon @ 2012-09-25 6:46 UTC (permalink / raw) To: Dave Chinner Cc: Fengguang Wu, Jan Kara, linux-kernel, Namjae Jeon, Vivek Trivedi, Linux Memory Management List, linux-fsdevel 2012/9/25, Dave Chinner <david@fromorbit.com>: > On Thu, Sep 20, 2012 at 04:44:22PM +0800, Fengguang Wu wrote: >> [ CC FS and MM lists ] >> >> Patch looks good to me, however we need to be careful because it's >> introducing a new interface. So it's desirable to get some acks from >> the FS/MM developers. >> >> Thanks, >> Fengguang >> >> On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: >> > From: Namjae Jeon <namjae.jeon@samsung.com> >> > >> > This patch is based on suggestion by Wu Fengguang: >> > https://lkml.org/lkml/2011/8/19/19 >> > >> > kernel has mechanism to do writeback as per dirty_ratio and >> > dirty_background >> > ratio. It also maintains per task dirty rate limit to keep balance of >> > dirty pages at any given instance by doing bdi bandwidth estimation. >> > >> > Kernel also has max_ratio/min_ratio tunables to specify percentage of >> > writecache to control per bdi dirty limits and task throttling. >> > >> > However, there might be a usecase where user wants a per bdi writeback >> > tuning >> > parameter to flush dirty data once per bdi dirty data reach a threshold >> > especially at NFS server. >> > >> > dirty_background_centisecs provides an interface where user can tune >> > background writeback start threshold using >> > /sys/block/sda/bdi/dirty_background_centisecs >> > >> > dirty_background_centisecs is used alongwith average bdi write >> > bandwidth >> > estimation to start background writeback. >> > >> > One of the use case to demonstrate the patch functionality can be >> > on NFS setup:- >> > We have a NFS setup with ethernet line of 100Mbps, while the USB >> > disk is attached to server, which has a local speed of 25MBps. Server >> > and client both are arm target boards. >> > >> > Now if we perform a write operation over NFS (client to server), as >> > per the network speed, data can travel at max speed of 100Mbps. But >> > if we check the default write speed of USB hdd over NFS it comes >> > around to 8MB/sec, far below the speed of network. >> > >> > Reason being is as per the NFS logic, during write operation, initially >> > pages are dirtied on NFS client side, then after reaching the dirty >> > threshold/writeback limit (or in case of sync) data is actually sent >> > to NFS server (so now again pages are dirtied on server side). This >> > will be done in COMMIT call from client to server i.e if 100MB of data >> > is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 >> > seconds. >> > >> > After the data is received, now it will take approx 100/25 ~4 Seconds >> > to >> > write the data to USB Hdd on server side. Hence making the overall time >> > to write this much of data ~12 seconds, which in practically comes out >> > to >> > be near 7 to 8MB/second. After this a COMMIT response will be sent to >> > NFS >> > client. >> > >> > However we may improve this write performace by making the use of NFS >> > server idle time i.e while data is being received from the client, >> > simultaneously initiate the writeback thread on server side. So instead >> > of waiting for the complete data to come and then start the writeback, >> > we can work in parallel while the network is still busy in receiving >> > the >> > data. Hence in this way overall performace will be improved. >> > >> > If we tune dirty_background_centisecs, we can see there >> > is increase in the performace and it comes out to be ~ 11MB/seconds. >> > Results are:- >> > >> > Write test(create a 1 GB file) result at 'NFS client' after changing >> > /sys/block/sda/bdi/dirty_background_centisecs >> > on *** NFS Server only - not on NFS Client **** > Hi. Dave. > What is the configuration of the client and server? How much RAM, > what their dirty_* parameters are set to, network speed, server disk > speed for local sequential IO, etc? these results are on ARM, 512MB RAM and XFS over NFS with default writeback settings(only our writeback setting - dirty_background_cen tisecs changed at nfs server only). Network speed is ~100MB/sec and local disk speed is ~25MB/sec. > >> > --------------------------------------------------------------------- >> > |WRITE Test with various 'dirty_background_centisecs' at NFS Server | >> > --------------------------------------------------------------------- >> > | | default = 0 | 300 centisec| 200 centisec| 100 centisec | >> > --------------------------------------------------------------------- >> > |RecSize | WriteSpeed | WriteSpeed | WriteSpeed | WriteSpeed | >> > --------------------------------------------------------------------- >> > |10485760 | 8.44MB/sec | 8.60MB/sec | 9.30MB/sec | 10.27MB/sec | >> > | 1048576 | 8.48MB/sec | 8.87MB/sec | 9.31MB/sec | 10.34MB/sec | >> > | 524288 | 8.37MB/sec | 8.42MB/sec | 9.84MB/sec | 10.47MB/sec | >> > | 262144 | 8.16MB/sec | 8.51MB/sec | 9.52MB/sec | 10.62MB/sec | >> > | 131072 | 8.48MB/sec | 8.81MB/sec | 9.42MB/sec | 10.55MB/sec | >> > | 65536 | 8.38MB/sec | 9.09MB/sec | 9.76MB/sec | 10.53MB/sec | >> > | 32768 | 8.65MB/sec | 9.00MB/sec | 9.57MB/sec | 10.54MB/sec | >> > | 16384 | 8.27MB/sec | 8.80MB/sec | 9.39MB/sec | 10.43MB/sec | >> > | 8192 | 8.52MB/sec | 8.70MB/sec | 9.40MB/sec | 10.50MB/sec | >> > | 4096 | 8.20MB/sec | 8.63MB/sec | 9.80MB/sec | 10.35MB/sec | >> > --------------------------------------------------------------------- > > While this set of numbers looks good, it's a very limited in scope. > I can't evaluate whether the change is worthwhile or not from this > test. If I was writing this patch, the questions I'd be seeking to > answer before proposing it for inclusion are as follows.... > > 1. what's the comparison in performance to typical NFS > server writeback parameter tuning? i.e. dirty_background_ratio=5, > dirty_ratio=10, dirty_expire_centiseconds=1000, > dirty_writeback_centisecs=1? i.e. does this give change give any > benefit over the current common practice for configuring NFS > servers? > > 2. what happens when you have 10 clients all writing to the server > at once? Or a 100? NFS servers rarely have a single writer to a > single file at a time, so what impact does this change have on > multiple concurrent file write performance from multiple clients? > > 3. Following on from the multiple client test, what difference does it > make to file fragmentation rates? Writing more frequently means > smaller allocations and writes, and that tends to lead to higher > fragmentation rates, especially when multiple files are being > written concurrently. Higher fragmentation also means lower > performance over time as fragmentation accelerates filesystem aging > effects on performance. IOWs, it may be faster when new, but it > will be slower 3 months down the track and that's a bad tradeoff to > make. > > 4. What happens for higher bandwidth network links? e.g. gigE or > 10gigE? Are the improvements still there? Or does it cause > regressions at higher speeds? I'm especially interested in what > happens to multiple writers at higher network speeds, because that's > a key performance metric used to measure enterprise level NFS > servers. > > 5. Are the improvements consistent across different filesystem > types? We've had writeback changes in the past cause improvements > on one filesystem but significant regressions on others. I'd > suggest that you need to present results for ext4, XFS and btrfs so > that we have a decent idea of what we can expect from the change to > the generic code. > > Yeah, I'm asking a lot of questions. That's because the generic > writeback code is extremely important to performance and the impact > of a change cannot be evaluated from a single test. Yes, I agree. I will share patch behavior in gigabit Ethernet, different filesystems(e.g. ext4, xfs and btrfs) and multiple NFS clients setup. Thanks. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable 2012-09-25 6:46 ` Namjae Jeon @ 2012-09-25 6:54 ` Namjae Jeon 2012-10-19 7:51 ` Namjae Jeon 1 sibling, 0 replies; 16+ messages in thread From: Namjae Jeon @ 2012-09-25 6:54 UTC (permalink / raw) To: Dave Chinner Cc: Fengguang Wu, Jan Kara, linux-kernel, Namjae Jeon, Vivek Trivedi, Linux Memory Management List, linux-fsdevel 2012/9/25, Namjae Jeon <linkinjeon@gmail.com>: > 2012/9/25, Dave Chinner <david@fromorbit.com>: >> On Thu, Sep 20, 2012 at 04:44:22PM +0800, Fengguang Wu wrote: >>> [ CC FS and MM lists ] >>> >>> Patch looks good to me, however we need to be careful because it's >>> introducing a new interface. So it's desirable to get some acks from >>> the FS/MM developers. >>> >>> Thanks, >>> Fengguang >>> >>> On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: >>> > From: Namjae Jeon <namjae.jeon@samsung.com> >>> > >>> > This patch is based on suggestion by Wu Fengguang: >>> > https://lkml.org/lkml/2011/8/19/19 >>> > >>> > kernel has mechanism to do writeback as per dirty_ratio and >>> > dirty_background >>> > ratio. It also maintains per task dirty rate limit to keep balance of >>> > dirty pages at any given instance by doing bdi bandwidth estimation. >>> > >>> > Kernel also has max_ratio/min_ratio tunables to specify percentage of >>> > writecache to control per bdi dirty limits and task throttling. >>> > >>> > However, there might be a usecase where user wants a per bdi writeback >>> > tuning >>> > parameter to flush dirty data once per bdi dirty data reach a >>> > threshold >>> > especially at NFS server. >>> > >>> > dirty_background_centisecs provides an interface where user can tune >>> > background writeback start threshold using >>> > /sys/block/sda/bdi/dirty_background_centisecs >>> > >>> > dirty_background_centisecs is used alongwith average bdi write >>> > bandwidth >>> > estimation to start background writeback. >>> > >>> > One of the use case to demonstrate the patch functionality can be >>> > on NFS setup:- >>> > We have a NFS setup with ethernet line of 100Mbps, while the USB >>> > disk is attached to server, which has a local speed of 25MBps. Server >>> > and client both are arm target boards. >>> > >>> > Now if we perform a write operation over NFS (client to server), as >>> > per the network speed, data can travel at max speed of 100Mbps. But >>> > if we check the default write speed of USB hdd over NFS it comes >>> > around to 8MB/sec, far below the speed of network. >>> > >>> > Reason being is as per the NFS logic, during write operation, >>> > initially >>> > pages are dirtied on NFS client side, then after reaching the dirty >>> > threshold/writeback limit (or in case of sync) data is actually sent >>> > to NFS server (so now again pages are dirtied on server side). This >>> > will be done in COMMIT call from client to server i.e if 100MB of data >>> > is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 >>> > seconds. >>> > >>> > After the data is received, now it will take approx 100/25 ~4 Seconds >>> > to >>> > write the data to USB Hdd on server side. Hence making the overall >>> > time >>> > to write this much of data ~12 seconds, which in practically comes out >>> > to >>> > be near 7 to 8MB/second. After this a COMMIT response will be sent to >>> > NFS >>> > client. >>> > >>> > However we may improve this write performace by making the use of NFS >>> > server idle time i.e while data is being received from the client, >>> > simultaneously initiate the writeback thread on server side. So >>> > instead >>> > of waiting for the complete data to come and then start the writeback, >>> > we can work in parallel while the network is still busy in receiving >>> > the >>> > data. Hence in this way overall performace will be improved. >>> > >>> > If we tune dirty_background_centisecs, we can see there >>> > is increase in the performace and it comes out to be ~ 11MB/seconds. >>> > Results are:- >>> > >>> > Write test(create a 1 GB file) result at 'NFS client' after changing >>> > /sys/block/sda/bdi/dirty_background_centisecs >>> > on *** NFS Server only - not on NFS Client **** >> > > Hi. Dave. > >> What is the configuration of the client and server? How much RAM, >> what their dirty_* parameters are set to, network speed, server disk >> speed for local sequential IO, etc? > these results are on ARM, 512MB RAM and XFS over NFS with default > writeback settings(only our writeback setting - dirty_background_cen > tisecs changed at nfs server only). Network speed is ~100MB/sec and Sorry, there is typo:) ^^100Mb/sec > local disk speed is ~25MB/sec. > >> >>> > --------------------------------------------------------------------- >>> > |WRITE Test with various 'dirty_background_centisecs' at NFS Server | >>> > --------------------------------------------------------------------- >>> > | | default = 0 | 300 centisec| 200 centisec| 100 centisec | >>> > --------------------------------------------------------------------- >>> > |RecSize | WriteSpeed | WriteSpeed | WriteSpeed | WriteSpeed | >>> > --------------------------------------------------------------------- >>> > |10485760 | 8.44MB/sec | 8.60MB/sec | 9.30MB/sec | 10.27MB/sec | >>> > | 1048576 | 8.48MB/sec | 8.87MB/sec | 9.31MB/sec | 10.34MB/sec | >>> > | 524288 | 8.37MB/sec | 8.42MB/sec | 9.84MB/sec | 10.47MB/sec | >>> > | 262144 | 8.16MB/sec | 8.51MB/sec | 9.52MB/sec | 10.62MB/sec | >>> > | 131072 | 8.48MB/sec | 8.81MB/sec | 9.42MB/sec | 10.55MB/sec | >>> > | 65536 | 8.38MB/sec | 9.09MB/sec | 9.76MB/sec | 10.53MB/sec | >>> > | 32768 | 8.65MB/sec | 9.00MB/sec | 9.57MB/sec | 10.54MB/sec | >>> > | 16384 | 8.27MB/sec | 8.80MB/sec | 9.39MB/sec | 10.43MB/sec | >>> > | 8192 | 8.52MB/sec | 8.70MB/sec | 9.40MB/sec | 10.50MB/sec | >>> > | 4096 | 8.20MB/sec | 8.63MB/sec | 9.80MB/sec | 10.35MB/sec | >>> > --------------------------------------------------------------------- >> >> While this set of numbers looks good, it's a very limited in scope. >> I can't evaluate whether the change is worthwhile or not from this >> test. If I was writing this patch, the questions I'd be seeking to >> answer before proposing it for inclusion are as follows.... >> >> 1. what's the comparison in performance to typical NFS >> server writeback parameter tuning? i.e. dirty_background_ratio=5, >> dirty_ratio=10, dirty_expire_centiseconds=1000, >> dirty_writeback_centisecs=1? i.e. does this give change give any >> benefit over the current common practice for configuring NFS >> servers? >> >> 2. what happens when you have 10 clients all writing to the server >> at once? Or a 100? NFS servers rarely have a single writer to a >> single file at a time, so what impact does this change have on >> multiple concurrent file write performance from multiple clients? >> >> 3. Following on from the multiple client test, what difference does it >> make to file fragmentation rates? Writing more frequently means >> smaller allocations and writes, and that tends to lead to higher >> fragmentation rates, especially when multiple files are being >> written concurrently. Higher fragmentation also means lower >> performance over time as fragmentation accelerates filesystem aging >> effects on performance. IOWs, it may be faster when new, but it >> will be slower 3 months down the track and that's a bad tradeoff to >> make. >> >> 4. What happens for higher bandwidth network links? e.g. gigE or >> 10gigE? Are the improvements still there? Or does it cause >> regressions at higher speeds? I'm especially interested in what >> happens to multiple writers at higher network speeds, because that's >> a key performance metric used to measure enterprise level NFS >> servers. >> >> 5. Are the improvements consistent across different filesystem >> types? We've had writeback changes in the past cause improvements >> on one filesystem but significant regressions on others. I'd >> suggest that you need to present results for ext4, XFS and btrfs so >> that we have a decent idea of what we can expect from the change to >> the generic code. >> >> Yeah, I'm asking a lot of questions. That's because the generic >> writeback code is extremely important to performance and the impact >> of a change cannot be evaluated from a single test. > Yes, I agree. > I will share patch behavior in gigabit Ethernet, different > filesystems(e.g. ext4, xfs and btrfs) and multiple NFS clients setup. > > Thanks. >> >> Cheers, >> >> Dave. >> -- >> Dave Chinner >> david@fromorbit.com >> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable 2012-09-25 6:46 ` Namjae Jeon 2012-09-25 6:54 ` Namjae Jeon @ 2012-10-19 7:51 ` Namjae Jeon 2012-10-22 1:25 ` Dave Chinner 1 sibling, 1 reply; 16+ messages in thread From: Namjae Jeon @ 2012-10-19 7:51 UTC (permalink / raw) To: Dave Chinner Cc: Fengguang Wu, Jan Kara, linux-kernel, Namjae Jeon, Vivek Trivedi, Linux Memory Management List, linux-fsdevel Hi Dave. Test Procedure: 1) Local USB disk WRITE speed on NFS server is ~25 MB/s 2) Run WRITE test(create 1 GB file) on NFS Client with default writeback settings on NFS Server. By default bdi->dirty_background_bytes = 0, that means no change in default writeback behaviour 3) Next we change bdi->dirty_background_bytes = 25 MB (almost equal to local USB disk write speed on NFS Server) *** only on NFS Server - not on NFS Client *** [NFS Server] # echo $((25*1024*1024)) > /sys/block/sdb/bdi/dirty_background_bytes # cat /sys/block/sdb/bdi/dirty_background_bytes 26214400 4) Run WRITE test again on NFS client to see change in WRITE speed at NFS client Test setup details: Test result on PC - FC16 - RAM 3 GB - ethernet - 1000 Mbits/s, Create 1 GB File -------------------------------------------------------------------------------- Table 1: XFS over NFS - WRITE SPEED on NFS Client -------------------------------------------------------------------------------- default writeback bdi->dirty_background_bytes setting = 25 MB RecSize write speed(MB/s) write speed(MB/s) % Change 10485760 27.39 28.53 4% 1048576 27.9 28.59 2% 524288 27.55 28.94 5% 262144 25.4 28.58 13% 131072 25.73 27.55 7% 65536 25.85 28.45 10% 32768 26.13 28.64 10% 16384 26.17 27.93 7% 8192 25.64 28.07 9% 4096 26.28 28.19 7% ------------------------------------------------------------------------------ Table 2: EXT4 over NFS - WRITE SPEED on NFS Client ------------------------------------------------------------------------------ default writeback bdi->dirty_background_bytes setting = 25 MB RecSize write speed(MB/s) write speed(MB/s) % Change 10485760 23.87 28.3 19% 1048576 24.81 27.79 12% 524288 24.53 28.14 15% 262144 24.21 27.99 16% 131072 24.11 28.33 18% 65536 23.73 28.21 19% 32768 25.66 27.52 7% 16384 24.3 27.67 14% 8192 23.6 27.08 15% 4096 23.35 27.24 17% As mentioned in the above Table 1 & 2, there is performance improvement on NFS client on gigabit Ethernet on both EXT4/XFS over NFS. We did not observe any degradation in write speed. However, performance gain varies on different file systems i.e. different on XFS & EXT4 over NFS. We also tried this change on BTRFS over NFS, but we did not see any significant change in WRITE speed. ---------------------------------------------------------------------------------- Multiple NFS Client test: ----------------------------------------------------------------------------------- Sorry - We could not arrange multiple PCs to verify this. So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards: ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File ----------------------------------------------------------------------------- Table 3: bdi->dirty_background_bytes = 0 MB - default writeback behaviour ----------------------------------------------------------------------------- RecSize Write Speed Write Speed Combined on Client 1 on client 2 write speed (MB/s) (MB/s) (MB/s) 10485760 5.45 5.36 10.81 1048576 5.44 5.34 10.78 524288 5.48 5.51 10.99 262144 6.24 4.83 11.07 131072 5.58 5.53 11.11 65536 5.51 5.48 10.99 32768 5.42 5.46 10.88 16384 5.62 5.58 11.2 8192 5.59 5.49 11.08 4096 5.57 6.38 11.95 ----------------------------------------------------------------------- Table 4: bdi->dirty_background_bytes = 25 MB ----------------------------------------------------------------------- RecSize Write Speed Write Speed Combined on Client 1 on client 2 write speed (MB/s) (MB/s) (MB/s) 10485760 5.43 5.76 11.19 1048576 5.51 5.72 11.23 524288 5.37 5.69 11.06 262144 5.46 5.51 10.97 131072 5.64 5.6 11.24 65536 5.53 5.64 11.17 32768 5.51 5.53 11.04 16384 5.51 5.51 11.02 8192 5.61 5.59 11.2 4096 6.11 5.65 11.76 As mentioned in the above table 3 & 4, there is no significant drop in WRITE speed of individual NFS Clients. There is minor improvement on combined write speed of NFS client 1 & 2. > 1. what's the comparison in performance to typical NFS > server writeback parameter tuning? i.e. dirty_background_ratio=5, > dirty_ratio=10, dirty_expire_centiseconds=1000, > dirty_writeback_centisecs=1? i.e. does this give change give any > benefit over the current common practice for configuring NFS > servers? Agreed, that above improvement in write speed can be achieved by tuning above write-back parameters. But if we change these settings, it will change write-back behavior system wide. On the other hand, if we change proposed per bdi setting, bdi->dirty_background_bytes it will change write-back behavior for the block device exported on NFS server. > 2. what happens when you have 10 clients all writing to the server > at once? Or a 100? NFS servers rarely have a single writer to a > single file at a time, so what impact does this change have on > multiple concurrent file write performance from multiple clients Sorry, we could not arrange more than 2 PCs for verifying this. So, We tried this on 3 ARM target boards - 1 ARM board as NFS server & 2 ARM targets as NFS clients connected with HUB on 100 Mbits/s ethernet link speed. Please refer above Table 3 & 4 results for this. > 3. Following on from the multiple client test, what difference does it > make to file fragmentation rates? Writing more frequently means > smaller allocations and writes, and that tends to lead to higher > fragmentation rates, especially when multiple files are being > written concurrently. Higher fragmentation also means lower > performance over time as fragmentation accelerates filesystem aging > effects on performance. IOWs, it may be faster when new, but it > will be slower 3 months down the track and that's a bad tradeoff to > make. We agree that there could be bit more framentation. But as you know, we are not changing writeback settings at NFS clients. So, write-back behavior on NFS client will not change - IO requests will be buffered at NFS client as per existing write-back behavior. Also, by default we set bdi->dirty_background_bytes = 0, so, it does not change default writeback setting unless user wants to tune it as per the environment. > 4. What happens for higher bandwidth network links? e.g. gigE or > 10gigE? Are the improvements still there? Or does it cause > regressions at higher speeds? I'm especially interested in what > happens to multiple writers at higher network speeds, because that's > a key performance metric used to measure enterprise level NFS > servers. As mentioned in the above table 1 & 2, on GIGABIT Ethernet interface also it provides performance improvement in WRITE speed. It is not degrading write speed. We could not arrange 10gigE, so we could verify this patch only on 100Mbits/s and 1000Mbits/s Ethernet link. > 5. Are the improvements consistent across different filesystem > types? We've had writeback changes in the past cause improvements > on one filesystem but significant regressions on others. I'd > suggest that you need to present results for ext4, XFS and btrfs so > that we have a decent idea of what we can expect from the change to > the generic code. As mentioned in the above Table 1 & 2, performance gain in WRITE speed is different on different file systems i.e. different on NFS client over XFS & EXT4. We also tried BTRFS over NFS, but we could not see any WRITE speed performance gain/degrade on BTRFS over NFS, so we are not posting BTRFS results here. Please let us know your opinion. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable 2012-10-19 7:51 ` Namjae Jeon @ 2012-10-22 1:25 ` Dave Chinner 2012-11-19 23:18 ` Namjae Jeon 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2012-10-22 1:25 UTC (permalink / raw) To: Namjae Jeon Cc: Fengguang Wu, Jan Kara, linux-kernel, Namjae Jeon, Vivek Trivedi, Linux Memory Management List, linux-fsdevel On Fri, Oct 19, 2012 at 04:51:05PM +0900, Namjae Jeon wrote: > Hi Dave. > > Test Procedure: > > 1) Local USB disk WRITE speed on NFS server is ~25 MB/s > > 2) Run WRITE test(create 1 GB file) on NFS Client with default > writeback settings on NFS Server. By default > bdi->dirty_background_bytes = 0, that means no change in default > writeback behaviour > > 3) Next we change bdi->dirty_background_bytes = 25 MB (almost equal to > local USB disk write speed on NFS Server) > *** only on NFS Server - not on NFS Client *** Ok, so the results look good, but it's not really addressing what I was asking, though. A typical desktop PC has a disk that can do 100MB/s and GbE, so I was expecting a test that showed throughput close to GbE maximums at least (ie. around that 100MB/s). I have 3 year old, low end, low power hardware (atom) that hanles twice the throughput you are testing here, and most current consumer NAS devices are more powerful than this. IOWs, I think the rates you are testing at are probably too low even for the consumer NAS market to consider relevant... > ---------------------------------------------------------------------------------- > Multiple NFS Client test: > ----------------------------------------------------------------------------------- > Sorry - We could not arrange multiple PCs to verify this. > So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards: > ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File But this really doesn't tells us anything - it's still only 100Mb/s, which we'd expect is already getting very close to line rate even with low powered client hardware. What I'm concerned about the NFS server "sweet spot" - a $10k server that exports 20TB of storage and can sustain close to a GB/s of NFS traffic over a single 10GbE link with tens to hundreds of clients. 100MB/s and 10 clients is about the minimum needed to be able to extrapolate a litle and make an informed guess of how it will scale up.... > > 1. what's the comparison in performance to typical NFS > > server writeback parameter tuning? i.e. dirty_background_ratio=5, > > dirty_ratio=10, dirty_expire_centiseconds=1000, > > dirty_writeback_centisecs=1? i.e. does this give change give any > > benefit over the current common practice for configuring NFS > > servers? > > Agreed, that above improvement in write speed can be achieved by > tuning above write-back parameters. > But if we change these settings, it will change write-back behavior > system wide. > On the other hand, if we change proposed per bdi setting, > bdi->dirty_background_bytes it will change write-back behavior for the > block device exported on NFS server. I already know what the difference between global vs per-bdi tuning means. What I want to know is how your results compare *numerically* to just having a tweaked global setting on a vanilla kernel. i.e. is there really any performance benefit to per-bdi configuration that cannot be gained by existing methods? > > 2. what happens when you have 10 clients all writing to the server > > at once? Or a 100? NFS servers rarely have a single writer to a > > single file at a time, so what impact does this change have on > > multiple concurrent file write performance from multiple clients > > Sorry, we could not arrange more than 2 PCs for verifying this. Really? Well, perhaps there's some tools that might be useful for you here: http://oss.sgi.com/projects/nfs/testtools/ "Weber Test load generator for NFS. Uses multiple threads, multiple sockets and multiple IP addresses to simulate loads from many machines, thus enabling testing of NFS server setups with larger client counts than can be tested with physical infrastructure (or Virtual Machine clients). Has been useful in automated NFS testing and as a pinpoint NFS load generator tool for performance development." > > 3. Following on from the multiple client test, what difference does it > > make to file fragmentation rates? Writing more frequently means > > smaller allocations and writes, and that tends to lead to higher > > fragmentation rates, especially when multiple files are being > > written concurrently. Higher fragmentation also means lower > > performance over time as fragmentation accelerates filesystem aging > > effects on performance. IOWs, it may be faster when new, but it > > will be slower 3 months down the track and that's a bad tradeoff to > > make. > > We agree that there could be bit more framentation. But as you know, > we are not changing writeback settings at NFS clients. > So, write-back behavior on NFS client will not change - IO requests > will be buffered at NFS client as per existing write-back behavior. I think you misunderstand - writeback settings on the server greatly impact the way the server writes data and therefore the way files are fragmented. It has nothing to do with client side tuning. Effectively, what you are presenting is best case numbers - empty filesystem, single client, streaming write, no fragmentation, no allocation contention, no competing IO load that causes write latency occurring. Testing with lots of clients introduces all of these things, and that will greatly impact server behaviour. Aggregation in memory isolates a lot of this variation from writeback and hence smooths out a lot of the variability that leads to fragmentation, seeks, latency spikes and preamture filesystem aging. That is, if you set a 100MB dirty_bytes limit on a bdi it will give really good buffering for a single client doing a streaming write. If you've got 10 clients, then assuming fair distribution of server resources, then that is 10MB per client per writeback trigger. That's line ball as to whether it will cause fragmentation severe enough to impact server throughput. If you've got 100 clients,then that's only 1MB per client per writeback trigger, and that's definitely too low to maintain decent writeback behaviour. i.e. you're now writing 100 files 1MB at a time, and that tends towards random IO patterns rather than sequential IO patterns. Seek time dertermines throughput, not IO bandwidth limits. IOWs, as the client count goes up, the writeback patterns will tends more towards random IO than sequential IO unless the amount of buffering allowed before writeback triggers also grows. That's important, because random IO is much slower than sequential IO. What I'd like to have is some insight into whether this patch changes that inflection point, for better or for worse. The only way to find that is to run multi-client testing.... > > 5. Are the improvements consistent across different filesystem > > types? We've had writeback changes in the past cause improvements > > on one filesystem but significant regressions on others. I'd > > suggest that you need to present results for ext4, XFS and btrfs so > > that we have a decent idea of what we can expect from the change to > > the generic code. > > As mentioned in the above Table 1 & 2, performance gain in WRITE speed > is different on different file systems i.e. different on NFS client > over XFS & EXT4. > We also tried BTRFS over NFS, but we could not see any WRITE speed > performance gain/degrade on BTRFS over NFS, so we are not posting > BTRFS results here. You should post btrfs numbers even if they show no change. It wasn't until I got this far that I even realised that you'd even tested BTRFS. I don't know what to make of this, because I don't know what the throughput rates compared to XFS and EXT4 are.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable 2012-10-22 1:25 ` Dave Chinner @ 2012-11-19 23:18 ` Namjae Jeon 2012-12-05 3:51 ` Wanpeng Li [not found] ` <50bec4af.0aad2a0a.2fc9.6fe5SMTPIN_ADDED_BROKEN@mx.google.com> 0 siblings, 2 replies; 16+ messages in thread From: Namjae Jeon @ 2012-11-19 23:18 UTC (permalink / raw) To: Dave Chinner Cc: Fengguang Wu, Jan Kara, linux-kernel, Namjae Jeon, Vivek Trivedi, Linux Memory Management List, linux-fsdevel 2012/10/22, Dave Chinner <david@fromorbit.com>: > On Fri, Oct 19, 2012 at 04:51:05PM +0900, Namjae Jeon wrote: >> Hi Dave. >> >> Test Procedure: >> >> 1) Local USB disk WRITE speed on NFS server is ~25 MB/s >> >> 2) Run WRITE test(create 1 GB file) on NFS Client with default >> writeback settings on NFS Server. By default >> bdi->dirty_background_bytes = 0, that means no change in default >> writeback behaviour >> >> 3) Next we change bdi->dirty_background_bytes = 25 MB (almost equal to >> local USB disk write speed on NFS Server) >> *** only on NFS Server - not on NFS Client *** > > Ok, so the results look good, but it's not really addressing what I > was asking, though. A typical desktop PC has a disk that can do > 100MB/s and GbE, so I was expecting a test that showed throughput > close to GbE maximums at least (ie. around that 100MB/s). I have 3 > year old, low end, low power hardware (atom) that hanles twice the > throughput you are testing here, and most current consumer NAS > devices are more powerful than this. IOWs, I think the rates you are > testing at are probably too low even for the consumer NAS market to > consider relevant... > >> ---------------------------------------------------------------------------------- >> Multiple NFS Client test: >> ----------------------------------------------------------------------------------- >> Sorry - We could not arrange multiple PCs to verify this. >> So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards: >> ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File > > But this really doesn't tells us anything - it's still only 100Mb/s, > which we'd expect is already getting very close to line rate even > with low powered client hardware. > > What I'm concerned about the NFS server "sweet spot" - a $10k server > that exports 20TB of storage and can sustain close to a GB/s of NFS > traffic over a single 10GbE link with tens to hundreds of clients. > 100MB/s and 10 clients is about the minimum needed to be able to > extrapolate a litle and make an informed guess of how it will scale > up.... > >> > 1. what's the comparison in performance to typical NFS >> > server writeback parameter tuning? i.e. dirty_background_ratio=5, >> > dirty_ratio=10, dirty_expire_centiseconds=1000, >> > dirty_writeback_centisecs=1? i.e. does this give change give any >> > benefit over the current common practice for configuring NFS >> > servers? >> >> Agreed, that above improvement in write speed can be achieved by >> tuning above write-back parameters. >> But if we change these settings, it will change write-back behavior >> system wide. >> On the other hand, if we change proposed per bdi setting, >> bdi->dirty_background_bytes it will change write-back behavior for the >> block device exported on NFS server. > > I already know what the difference between global vs per-bdi tuning > means. What I want to know is how your results compare > *numerically* to just having a tweaked global setting on a vanilla > kernel. i.e. is there really any performance benefit to per-bdi > configuration that cannot be gained by existing methods? > >> > 2. what happens when you have 10 clients all writing to the server >> > at once? Or a 100? NFS servers rarely have a single writer to a >> > single file at a time, so what impact does this change have on >> > multiple concurrent file write performance from multiple clients >> >> Sorry, we could not arrange more than 2 PCs for verifying this. > > Really? Well, perhaps there's some tools that might be useful for > you here: > > http://oss.sgi.com/projects/nfs/testtools/ > > "Weber > > Test load generator for NFS. Uses multiple threads, multiple > sockets and multiple IP addresses to simulate loads from many > machines, thus enabling testing of NFS server setups with larger > client counts than can be tested with physical infrastructure (or > Virtual Machine clients). Has been useful in automated NFS testing > and as a pinpoint NFS load generator tool for performance > development." > Hi Dave, We ran "weber" test on below setup: 1) SATA HDD - Local WRITE speed ~120 MB/s, NFS WRITE speed ~90 MB/s 2) Used 10GbE - network interface to mount NFS We ran "weber" test with NFS clients ranging from 1 to 100, below is the % GAIN in NFS WRITE speed with bdi->dirty_background_bytes = 100 MB at NFS server ------------------------------------------------- | Number of NFS Clients |% GAIN in WRITE Speed | |-----------------------------------------------| | 1 | 19.83 % | |-----------------------------------------------| | 2 | 2.97 % | |-----------------------------------------------| | 3 | 2.01 % | |-----------------------------------------------| | 10 | 0.25 % | |-----------------------------------------------| | 20 | 0.23 % | |-----------------------------------------------| | 30 | 0.13 % | |-----------------------------------------------| | 100 | - 0.60 % | ------------------------------------------------- with bdi->dirty_background_bytes setting at NFS server, we observed that NFS WRITE speed improvement is maximum with single NFS client. But WRITE speed improvement drops when Number of NFS clients increase from 1 to 100. So, bdi->dirty_background_bytes setting might be useful where we have only one NFS client(scenario like ours). But this is not useful for big NFS Servers which host hundreads of NFS clients. Let me know your opinion. Thanks. >> > 3. Following on from the multiple client test, what difference does it >> > make to file fragmentation rates? Writing more frequently means >> > smaller allocations and writes, and that tends to lead to higher >> > fragmentation rates, especially when multiple files are being >> > written concurrently. Higher fragmentation also means lower >> > performance over time as fragmentation accelerates filesystem aging >> > effects on performance. IOWs, it may be faster when new, but it >> > will be slower 3 months down the track and that's a bad tradeoff to >> > make. >> >> We agree that there could be bit more framentation. But as you know, >> we are not changing writeback settings at NFS clients. >> So, write-back behavior on NFS client will not change - IO requests >> will be buffered at NFS client as per existing write-back behavior. > > I think you misunderstand - writeback settings on the server greatly > impact the way the server writes data and therefore the way files > are fragmented. It has nothing to do with client side tuning. > > Effectively, what you are presenting is best case numbers - empty > filesystem, single client, streaming write, no fragmentation, no > allocation contention, no competing IO load that causes write > latency occurring. Testing with lots of clients introduces all of > these things, and that will greatly impact server behaviour. > Aggregation in memory isolates a lot of this variation from > writeback and hence smooths out a lot of the variability that leads > to fragmentation, seeks, latency spikes and preamture filesystem > aging. > > That is, if you set a 100MB dirty_bytes limit on a bdi it will give > really good buffering for a single client doing a streaming write. > If you've got 10 clients, then assuming fair distribution of server > resources, then that is 10MB per client per writeback trigger. > That's line ball as to whether it will cause fragmentation severe > enough to impact server throughput. If you've got 100 clients,then > that's only 1MB per client per writeback trigger, and that's > definitely too low to maintain decent writeback behaviour. i.e. > you're now writing 100 files 1MB at a time, and that tends towards > random IO patterns rather than sequential IO patterns. Seek time > dertermines throughput, not IO bandwidth limits. > > IOWs, as the client count goes up, the writeback patterns will tends > more towards random IO than sequential IO unless the amount of > buffering allowed before writeback triggers also grows. That's > important, because random IO is much slower than sequential IO. > What I'd like to have is some insight into whether this patch > changes that inflection point, for better or for worse. The only way > to find that is to run multi-client testing.... > >> > 5. Are the improvements consistent across different filesystem >> > types? We've had writeback changes in the past cause improvements >> > on one filesystem but significant regressions on others. I'd >> > suggest that you need to present results for ext4, XFS and btrfs so >> > that we have a decent idea of what we can expect from the change to >> > the generic code. >> >> As mentioned in the above Table 1 & 2, performance gain in WRITE speed >> is different on different file systems i.e. different on NFS client >> over XFS & EXT4. >> We also tried BTRFS over NFS, but we could not see any WRITE speed >> performance gain/degrade on BTRFS over NFS, so we are not posting >> BTRFS results here. > > You should post btrfs numbers even if they show no change. It wasn't > until I got this far that I even realised that you'd even tested > BTRFS. I don't know what to make of this, because I don't know what > the throughput rates compared to XFS and EXT4 are.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable 2012-11-19 23:18 ` Namjae Jeon @ 2012-12-05 3:51 ` Wanpeng Li [not found] ` <50bec4af.0aad2a0a.2fc9.6fe5SMTPIN_ADDED_BROKEN@mx.google.com> 1 sibling, 0 replies; 16+ messages in thread From: Wanpeng Li @ 2012-12-05 3:51 UTC (permalink / raw) To: Namjae Jeon, Fengguang Wu Cc: Jan Kara, Dave Chinner, linux-kernel, Namjae Jeon, Vivek Trivedi, Linux Memory Management List, linux-fsdevel Hi Namjae, How about set bdi->dirty_background_bytes according to bdi_thresh? I found an issue during background flush process when review codes, if over background flush threshold, wb_check_background_flush will kick a work to current per-bdi flusher, but maybe it is other heavy dirties written in other bdis who heavily dirty pages instead of current bdi, the worst case is current bdi has many frequently used data and flush lead to cache thresh. How about add a check in wb_check_background_flush if it is not current bdi who contributes large number of dirty pages to background flush threshold(over bdi->dirty_background_bytes), then don't bother it. Regards, Wanpeng Li On Tue, Nov 20, 2012 at 08:18:59AM +0900, Namjae Jeon wrote: >2012/10/22, Dave Chinner <david@fromorbit.com>: >> On Fri, Oct 19, 2012 at 04:51:05PM +0900, Namjae Jeon wrote: >>> Hi Dave. >>> >>> Test Procedure: >>> >>> 1) Local USB disk WRITE speed on NFS server is ~25 MB/s >>> >>> 2) Run WRITE test(create 1 GB file) on NFS Client with default >>> writeback settings on NFS Server. By default >>> bdi->dirty_background_bytes = 0, that means no change in default >>> writeback behaviour >>> >>> 3) Next we change bdi->dirty_background_bytes = 25 MB (almost equal to >>> local USB disk write speed on NFS Server) >>> *** only on NFS Server - not on NFS Client *** >> >> Ok, so the results look good, but it's not really addressing what I >> was asking, though. A typical desktop PC has a disk that can do >> 100MB/s and GbE, so I was expecting a test that showed throughput >> close to GbE maximums at least (ie. around that 100MB/s). I have 3 >> year old, low end, low power hardware (atom) that hanles twice the >> throughput you are testing here, and most current consumer NAS >> devices are more powerful than this. IOWs, I think the rates you are >> testing at are probably too low even for the consumer NAS market to >> consider relevant... >> >>> ---------------------------------------------------------------------------------- >>> Multiple NFS Client test: >>> ----------------------------------------------------------------------------------- >>> Sorry - We could not arrange multiple PCs to verify this. >>> So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards: >>> ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File >> >> But this really doesn't tells us anything - it's still only 100Mb/s, >> which we'd expect is already getting very close to line rate even >> with low powered client hardware. >> >> What I'm concerned about the NFS server "sweet spot" - a $10k server >> that exports 20TB of storage and can sustain close to a GB/s of NFS >> traffic over a single 10GbE link with tens to hundreds of clients. >> 100MB/s and 10 clients is about the minimum needed to be able to >> extrapolate a litle and make an informed guess of how it will scale >> up.... >> >>> > 1. what's the comparison in performance to typical NFS >>> > server writeback parameter tuning? i.e. dirty_background_ratio=5, >>> > dirty_ratio=10, dirty_expire_centiseconds=1000, >>> > dirty_writeback_centisecs=1? i.e. does this give change give any >>> > benefit over the current common practice for configuring NFS >>> > servers? >>> >>> Agreed, that above improvement in write speed can be achieved by >>> tuning above write-back parameters. >>> But if we change these settings, it will change write-back behavior >>> system wide. >>> On the other hand, if we change proposed per bdi setting, >>> bdi->dirty_background_bytes it will change write-back behavior for the >>> block device exported on NFS server. >> >> I already know what the difference between global vs per-bdi tuning >> means. What I want to know is how your results compare >> *numerically* to just having a tweaked global setting on a vanilla >> kernel. i.e. is there really any performance benefit to per-bdi >> configuration that cannot be gained by existing methods? >> >>> > 2. what happens when you have 10 clients all writing to the server >>> > at once? Or a 100? NFS servers rarely have a single writer to a >>> > single file at a time, so what impact does this change have on >>> > multiple concurrent file write performance from multiple clients >>> >>> Sorry, we could not arrange more than 2 PCs for verifying this. >> >> Really? Well, perhaps there's some tools that might be useful for >> you here: >> >> http://oss.sgi.com/projects/nfs/testtools/ >> >> "Weber >> >> Test load generator for NFS. Uses multiple threads, multiple >> sockets and multiple IP addresses to simulate loads from many >> machines, thus enabling testing of NFS server setups with larger >> client counts than can be tested with physical infrastructure (or >> Virtual Machine clients). Has been useful in automated NFS testing >> and as a pinpoint NFS load generator tool for performance >> development." >> > >Hi Dave, >We ran "weber" test on below setup: >1) SATA HDD - Local WRITE speed ~120 MB/s, NFS WRITE speed ~90 MB/s >2) Used 10GbE - network interface to mount NFS > >We ran "weber" test with NFS clients ranging from 1 to 100, >below is the % GAIN in NFS WRITE speed with >bdi->dirty_background_bytes = 100 MB at NFS server > >------------------------------------------------- >| Number of NFS Clients |% GAIN in WRITE Speed | >|-----------------------------------------------| >| 1 | 19.83 % | >|-----------------------------------------------| >| 2 | 2.97 % | >|-----------------------------------------------| >| 3 | 2.01 % | >|-----------------------------------------------| >| 10 | 0.25 % | >|-----------------------------------------------| >| 20 | 0.23 % | >|-----------------------------------------------| >| 30 | 0.13 % | >|-----------------------------------------------| >| 100 | - 0.60 % | >------------------------------------------------- > >with bdi->dirty_background_bytes setting at NFS server, we observed >that NFS WRITE speed improvement is maximum with single NFS client. >But WRITE speed improvement drops when Number of NFS clients increase >from 1 to 100. > >So, bdi->dirty_background_bytes setting might be useful where we have >only one NFS client(scenario like ours). >But this is not useful for big NFS Servers which host hundreads of NFS clients. > >Let me know your opinion. > >Thanks. > >>> > 3. Following on from the multiple client test, what difference does it >>> > make to file fragmentation rates? Writing more frequently means >>> > smaller allocations and writes, and that tends to lead to higher >>> > fragmentation rates, especially when multiple files are being >>> > written concurrently. Higher fragmentation also means lower >>> > performance over time as fragmentation accelerates filesystem aging >>> > effects on performance. IOWs, it may be faster when new, but it >>> > will be slower 3 months down the track and that's a bad tradeoff to >>> > make. >>> >>> We agree that there could be bit more framentation. But as you know, >>> we are not changing writeback settings at NFS clients. >>> So, write-back behavior on NFS client will not change - IO requests >>> will be buffered at NFS client as per existing write-back behavior. >> >> I think you misunderstand - writeback settings on the server greatly >> impact the way the server writes data and therefore the way files >> are fragmented. It has nothing to do with client side tuning. >> >> Effectively, what you are presenting is best case numbers - empty >> filesystem, single client, streaming write, no fragmentation, no >> allocation contention, no competing IO load that causes write >> latency occurring. Testing with lots of clients introduces all of >> these things, and that will greatly impact server behaviour. >> Aggregation in memory isolates a lot of this variation from >> writeback and hence smooths out a lot of the variability that leads >> to fragmentation, seeks, latency spikes and preamture filesystem >> aging. >> >> That is, if you set a 100MB dirty_bytes limit on a bdi it will give >> really good buffering for a single client doing a streaming write. >> If you've got 10 clients, then assuming fair distribution of server >> resources, then that is 10MB per client per writeback trigger. >> That's line ball as to whether it will cause fragmentation severe >> enough to impact server throughput. If you've got 100 clients,then >> that's only 1MB per client per writeback trigger, and that's >> definitely too low to maintain decent writeback behaviour. i.e. >> you're now writing 100 files 1MB at a time, and that tends towards >> random IO patterns rather than sequential IO patterns. Seek time >> dertermines throughput, not IO bandwidth limits. >> >> IOWs, as the client count goes up, the writeback patterns will tends >> more towards random IO than sequential IO unless the amount of >> buffering allowed before writeback triggers also grows. That's >> important, because random IO is much slower than sequential IO. >> What I'd like to have is some insight into whether this patch >> changes that inflection point, for better or for worse. The only way >> to find that is to run multi-client testing.... >> >>> > 5. Are the improvements consistent across different filesystem >>> > types? We've had writeback changes in the past cause improvements >>> > on one filesystem but significant regressions on others. I'd >>> > suggest that you need to present results for ext4, XFS and btrfs so >>> > that we have a decent idea of what we can expect from the change to >>> > the generic code. >>> >>> As mentioned in the above Table 1 & 2, performance gain in WRITE speed >>> is different on different file systems i.e. different on NFS client >>> over XFS & EXT4. >>> We also tried BTRFS over NFS, but we could not see any WRITE speed >>> performance gain/degrade on BTRFS over NFS, so we are not posting >>> BTRFS results here. >> >> You should post btrfs numbers even if they show no change. It wasn't >> until I got this far that I even realised that you'd even tested >> BTRFS. I don't know what to make of this, because I don't know what >> the throughput rates compared to XFS and EXT4 are.... >> >> Cheers, >> >> Dave. >> -- >> Dave Chinner >> david@fromorbit.com >> >-- >To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <50bec4af.0aad2a0a.2fc9.6fe5SMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable [not found] ` <50bec4af.0aad2a0a.2fc9.6fe5SMTPIN_ADDED_BROKEN@mx.google.com> @ 2012-12-05 10:19 ` Namjae Jeon 0 siblings, 0 replies; 16+ messages in thread From: Namjae Jeon @ 2012-12-05 10:19 UTC (permalink / raw) To: Wanpeng Li Cc: Fengguang Wu, Jan Kara, Dave Chinner, linux-kernel, Namjae Jeon, Vivek Trivedi, Linux Memory Management List, linux-fsdevel 2012/12/5, Wanpeng Li <liwanp@linux.vnet.ibm.com>: > Hi Namjae, > > How about set bdi->dirty_background_bytes according to bdi_thresh? I found > an issue during background flush process when review codes, if over > background > flush threshold, wb_check_background_flush will kick a work to current > per-bdi > flusher, but maybe it is other heavy dirties written in other bdis who > heavily > dirty pages instead of current bdi, the worst case is current bdi has many > frequently used data and flush lead to cache thresh. How about add a check > in wb_check_background_flush if it is not current bdi who contributes large > > number of dirty pages to background flush threshold(over > bdi->dirty_background_bytes), > then don't bother it. Hi Wanpeng. First, Thanks for your suggestion! Yes, I think that it looks reasonable. I will start checking it. Thanks. > > Regards, > Wanpeng Li > > On Tue, Nov 20, 2012 at 08:18:59AM +0900, Namjae Jeon wrote: >>2012/10/22, Dave Chinner <david@fromorbit.com>: >>> On Fri, Oct 19, 2012 at 04:51:05PM +0900, Namjae Jeon wrote: >>>> Hi Dave. >>>> >>>> Test Procedure: >>>> >>>> 1) Local USB disk WRITE speed on NFS server is ~25 MB/s >>>> >>>> 2) Run WRITE test(create 1 GB file) on NFS Client with default >>>> writeback settings on NFS Server. By default >>>> bdi->dirty_background_bytes = 0, that means no change in default >>>> writeback behaviour >>>> >>>> 3) Next we change bdi->dirty_background_bytes = 25 MB (almost equal to >>>> local USB disk write speed on NFS Server) >>>> *** only on NFS Server - not on NFS Client *** >>> >>> Ok, so the results look good, but it's not really addressing what I >>> was asking, though. A typical desktop PC has a disk that can do >>> 100MB/s and GbE, so I was expecting a test that showed throughput >>> close to GbE maximums at least (ie. around that 100MB/s). I have 3 >>> year old, low end, low power hardware (atom) that hanles twice the >>> throughput you are testing here, and most current consumer NAS >>> devices are more powerful than this. IOWs, I think the rates you are >>> testing at are probably too low even for the consumer NAS market to >>> consider relevant... >>> >>>> ---------------------------------------------------------------------------------- >>>> Multiple NFS Client test: >>>> ----------------------------------------------------------------------------------- >>>> Sorry - We could not arrange multiple PCs to verify this. >>>> So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards: >>>> ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File >>> >>> But this really doesn't tells us anything - it's still only 100Mb/s, >>> which we'd expect is already getting very close to line rate even >>> with low powered client hardware. >>> >>> What I'm concerned about the NFS server "sweet spot" - a $10k server >>> that exports 20TB of storage and can sustain close to a GB/s of NFS >>> traffic over a single 10GbE link with tens to hundreds of clients. >>> 100MB/s and 10 clients is about the minimum needed to be able to >>> extrapolate a litle and make an informed guess of how it will scale >>> up.... >>> >>>> > 1. what's the comparison in performance to typical NFS >>>> > server writeback parameter tuning? i.e. dirty_background_ratio=5, >>>> > dirty_ratio=10, dirty_expire_centiseconds=1000, >>>> > dirty_writeback_centisecs=1? i.e. does this give change give any >>>> > benefit over the current common practice for configuring NFS >>>> > servers? >>>> >>>> Agreed, that above improvement in write speed can be achieved by >>>> tuning above write-back parameters. >>>> But if we change these settings, it will change write-back behavior >>>> system wide. >>>> On the other hand, if we change proposed per bdi setting, >>>> bdi->dirty_background_bytes it will change write-back behavior for the >>>> block device exported on NFS server. >>> >>> I already know what the difference between global vs per-bdi tuning >>> means. What I want to know is how your results compare >>> *numerically* to just having a tweaked global setting on a vanilla >>> kernel. i.e. is there really any performance benefit to per-bdi >>> configuration that cannot be gained by existing methods? >>> >>>> > 2. what happens when you have 10 clients all writing to the server >>>> > at once? Or a 100? NFS servers rarely have a single writer to a >>>> > single file at a time, so what impact does this change have on >>>> > multiple concurrent file write performance from multiple clients >>>> >>>> Sorry, we could not arrange more than 2 PCs for verifying this. >>> >>> Really? Well, perhaps there's some tools that might be useful for >>> you here: >>> >>> http://oss.sgi.com/projects/nfs/testtools/ >>> >>> "Weber >>> >>> Test load generator for NFS. Uses multiple threads, multiple >>> sockets and multiple IP addresses to simulate loads from many >>> machines, thus enabling testing of NFS server setups with larger >>> client counts than can be tested with physical infrastructure (or >>> Virtual Machine clients). Has been useful in automated NFS testing >>> and as a pinpoint NFS load generator tool for performance >>> development." >>> >> >>Hi Dave, >>We ran "weber" test on below setup: >>1) SATA HDD - Local WRITE speed ~120 MB/s, NFS WRITE speed ~90 MB/s >>2) Used 10GbE - network interface to mount NFS >> >>We ran "weber" test with NFS clients ranging from 1 to 100, >>below is the % GAIN in NFS WRITE speed with >>bdi->dirty_background_bytes = 100 MB at NFS server >> >>------------------------------------------------- >>| Number of NFS Clients |% GAIN in WRITE Speed | >>|-----------------------------------------------| >>| 1 | 19.83 % | >>|-----------------------------------------------| >>| 2 | 2.97 % | >>|-----------------------------------------------| >>| 3 | 2.01 % | >>|-----------------------------------------------| >>| 10 | 0.25 % | >>|-----------------------------------------------| >>| 20 | 0.23 % | >>|-----------------------------------------------| >>| 30 | 0.13 % | >>|-----------------------------------------------| >>| 100 | - 0.60 % | >>------------------------------------------------- >> >>with bdi->dirty_background_bytes setting at NFS server, we observed >>that NFS WRITE speed improvement is maximum with single NFS client. >>But WRITE speed improvement drops when Number of NFS clients increase >>from 1 to 100. >> >>So, bdi->dirty_background_bytes setting might be useful where we have >>only one NFS client(scenario like ours). >>But this is not useful for big NFS Servers which host hundreads of NFS >> clients. >> >>Let me know your opinion. >> >>Thanks. >> >>>> > 3. Following on from the multiple client test, what difference does >>>> > it >>>> > make to file fragmentation rates? Writing more frequently means >>>> > smaller allocations and writes, and that tends to lead to higher >>>> > fragmentation rates, especially when multiple files are being >>>> > written concurrently. Higher fragmentation also means lower >>>> > performance over time as fragmentation accelerates filesystem aging >>>> > effects on performance. IOWs, it may be faster when new, but it >>>> > will be slower 3 months down the track and that's a bad tradeoff to >>>> > make. >>>> >>>> We agree that there could be bit more framentation. But as you know, >>>> we are not changing writeback settings at NFS clients. >>>> So, write-back behavior on NFS client will not change - IO requests >>>> will be buffered at NFS client as per existing write-back behavior. >>> >>> I think you misunderstand - writeback settings on the server greatly >>> impact the way the server writes data and therefore the way files >>> are fragmented. It has nothing to do with client side tuning. >>> >>> Effectively, what you are presenting is best case numbers - empty >>> filesystem, single client, streaming write, no fragmentation, no >>> allocation contention, no competing IO load that causes write >>> latency occurring. Testing with lots of clients introduces all of >>> these things, and that will greatly impact server behaviour. >>> Aggregation in memory isolates a lot of this variation from >>> writeback and hence smooths out a lot of the variability that leads >>> to fragmentation, seeks, latency spikes and preamture filesystem >>> aging. >>> >>> That is, if you set a 100MB dirty_bytes limit on a bdi it will give >>> really good buffering for a single client doing a streaming write. >>> If you've got 10 clients, then assuming fair distribution of server >>> resources, then that is 10MB per client per writeback trigger. >>> That's line ball as to whether it will cause fragmentation severe >>> enough to impact server throughput. If you've got 100 clients,then >>> that's only 1MB per client per writeback trigger, and that's >>> definitely too low to maintain decent writeback behaviour. i.e. >>> you're now writing 100 files 1MB at a time, and that tends towards >>> random IO patterns rather than sequential IO patterns. Seek time >>> dertermines throughput, not IO bandwidth limits. >>> >>> IOWs, as the client count goes up, the writeback patterns will tends >>> more towards random IO than sequential IO unless the amount of >>> buffering allowed before writeback triggers also grows. That's >>> important, because random IO is much slower than sequential IO. >>> What I'd like to have is some insight into whether this patch >>> changes that inflection point, for better or for worse. The only way >>> to find that is to run multi-client testing.... >>> >>>> > 5. Are the improvements consistent across different filesystem >>>> > types? We've had writeback changes in the past cause improvements >>>> > on one filesystem but significant regressions on others. I'd >>>> > suggest that you need to present results for ext4, XFS and btrfs so >>>> > that we have a decent idea of what we can expect from the change to >>>> > the generic code. >>>> >>>> As mentioned in the above Table 1 & 2, performance gain in WRITE speed >>>> is different on different file systems i.e. different on NFS client >>>> over XFS & EXT4. >>>> We also tried BTRFS over NFS, but we could not see any WRITE speed >>>> performance gain/degrade on BTRFS over NFS, so we are not posting >>>> BTRFS results here. >>> >>> You should post btrfs numbers even if they show no change. It wasn't >>> until I got this far that I even realised that you'd even tested >>> BTRFS. I don't know what to make of this, because I don't know what >>> the throughput rates compared to XFS and EXT4 are.... >>> >>> Cheers, >>> >>> Dave. >>> -- >>> Dave Chinner >>> david@fromorbit.com >>> >>-- >>To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" >> in >>the body of a message to majordomo@vger.kernel.org >>More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2012-12-05 10:19 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <1347798342-2830-1-git-send-email-linkinjeon@gmail.com> 2012-09-20 8:44 ` [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable Fengguang Wu 2012-09-24 22:23 ` Jan Kara 2012-09-25 6:50 ` Namjae Jeon 2012-09-26 16:56 ` Fengguang Wu 2012-09-26 20:22 ` Jan Kara 2012-09-27 6:00 ` Namjae Jeon 2012-09-27 11:52 ` Jan Kara 2012-09-28 1:59 ` Namjae Jeon 2012-09-25 1:36 ` Dave Chinner 2012-09-25 6:46 ` Namjae Jeon 2012-09-25 6:54 ` Namjae Jeon 2012-10-19 7:51 ` Namjae Jeon 2012-10-22 1:25 ` Dave Chinner 2012-11-19 23:18 ` Namjae Jeon 2012-12-05 3:51 ` Wanpeng Li [not found] ` <50bec4af.0aad2a0a.2fc9.6fe5SMTPIN_ADDED_BROKEN@mx.google.com> 2012-12-05 10:19 ` Namjae Jeon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).