Re: Re: [for-416 PATCH] bcache: fix writeback target calc on large devices

Linux bcache driver list
 help / color / mirror / Atom feed

* Re: Re: [for-416 PATCH] bcache: fix writeback target calc on large devices
@ 2018-01-02  9:25 tang.junhui
  2018-01-02 17:04 ` Michael Lyle
  2018-01-02 19:00 ` [for-416 PATCH v2] bcache: fix writeback target calculation Michael Lyle
  0 siblings, 2 replies; 6+ messages in thread
From: tang.junhui @ 2018-01-02  9:25 UTC (permalink / raw)
  To: mlyle; +Cc: linux-bcache, linux-block, tang.junhui

From: Tang Junhui <tang.junhui@zte.com.cn>


>Thank you for the feedback.
>
>On Mon, Jan 1, 2018 at 10:33 PM,  <tang.junhui@zte.com.cn> wrote:
>> From: Tang Junhui <tang.junhui@zte.com.cn>
>>
>> This patch is useful for preventing the overflow of the expression
>> (cache_dirty_target * bdev_sectors(dc->bdev)), but it also
>> lead into a calc error, for example, when there is a 1G and
>> 100*164G cached device, it would cause the "target" value to
>> be aways zero of the 1G device, which would cause write-back
>> threshold losing efficacy.
>>
>> Maybe at first we can judge if it overflows or not of the expression
>> (cache_dirty_target * bdev_sectors(dc->bdev)), if it overflows,
>> We can calc the value of target as the patch, otherwise,
>> we calc it as old way.
>
>Maybe it'd be preferable just to ensure that share always >=1. 
Yes, this is a good and simple way.
 
>It
>seems like a pretty narrow set of cases where the current math works
>and the new math doesn't work, though, as I expect that it's
>relatively rare to have such a variation in sizes, and even so 16.4TB
>uses up 35 bits of the 64 bit quantity.
>
Truely, this issue is rare. But as we known, cache set is a sharable pool,
we may attach various devices on it, such as petabyte big device or gigabyte
small device. We would better to make them all works well in the same 
cache set.

>I don't really like special cases or trying two different ways to do
>the math, because then it's very difficult to test.
>
>What do you think?
Ha, I don't think it is difficult to do, actually you have provided a 
simple and effective way above, and we can use a small partition device
and a big partition device to test (such as a 100M device and a 2T device). 
I believe you can do it better.

Thanks,
Tang Junhui

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [for-416 PATCH] bcache: fix writeback target calc on large devices
  2018-01-02  9:25 Re: [for-416 PATCH] bcache: fix writeback target calc on large devices tang.junhui
@ 2018-01-02 17:04 ` Michael Lyle
  2018-01-02 19:00 ` [for-416 PATCH v2] bcache: fix writeback target calculation Michael Lyle
  1 sibling, 0 replies; 6+ messages in thread
From: Michael Lyle @ 2018-01-02 17:04 UTC (permalink / raw)
  To: tang.junhui; +Cc: linux-bcache, linux-block

I'm withdrawing this patch.  Reading the entire function more carefully,
I think the rate calculation is completely wrong both in the original
code and in this change.

That is, the top of __update_writeback_rate calculates a dirty data
target for one backing device.  However, the target for one backing
device is compared to the amount of dirty data in the entire cache.  If
there is more than one cache device, the writeback rate will be far too
high.

I will attempt to write a new version of this scaling functionality soon
that fixes both the overflow issue and ... corrects it to calculate the
correct thing.

Mike

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [for-416 PATCH v2] bcache: fix writeback target calculation
  2018-01-02  9:25 Re: [for-416 PATCH] bcache: fix writeback target calc on large devices tang.junhui
  2018-01-02 17:04 ` Michael Lyle
@ 2018-01-02 19:00 ` Michael Lyle
  2018-01-02 19:02   ` Michael Lyle
  1 sibling, 1 reply; 6+ messages in thread
From: Michael Lyle @ 2018-01-02 19:00 UTC (permalink / raw)
  To: linux-bcache, linux-block; +Cc: tang.junhui, Michael Lyle

This fixes bcache writeback target calculation, for defects relating to
large cache sets and calculation errors when there are multiple backing
devices.

Bcache needs to scale the dirty data in the cache over the multiple
backing disks in order to calculate writeback rates for each.
The previous code did this by multiplying the target number of dirty
sectors by the backing device size, and expected it to fit into a
uint64_t; this blows up on relatively small backing devices.

The new approach figures out the bdev's share in 16384ths of the overall
cached data.  This is chosen to cope well when bdevs drastically vary in
size and to ensure that bcache can cross the petabyte boundary for each
backing device.

Per review, we ensure that every device gets at least 1/16384th of the
writeback share.  This also differs from v1 of the patchset in scaling
the error, which corrects previous problems with writeback rate
calculation across multiple devices.

Reported-by: Jack Douglas <jack@douglastechnology.co.uk>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
---
 drivers/md/bcache/writeback.c | 38 ++++++++++++++++++++++++++++++--------
 1 file changed, 30 insertions(+), 8 deletions(-)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 56a37884ca8b..be2840ad02d4 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -26,16 +26,28 @@ static void __update_writeback_rate(struct cached_dev *dc)
 				bcache_flash_devs_sectors_dirty(c);
 	uint64_t cache_dirty_target =
 		div_u64(cache_sectors * dc->writeback_percent, 100);
-	int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),
-				   c->cached_dev_sectors);
+
+	/*
+	 * 16384 is chosen here as something that each backing device should
+	 * be a reasonable fraction of the share, and to not blow up until
+	 * individual backing devices are a petabyte.
+	 */
+	uint32_t bdev_share_per16k =
+		div64_u64(16384 * bdev_sectors(dc->bdev),
+				c->cached_dev_sectors);
 
 	/*
 	 * PI controller:
 	 * Figures out the amount that should be written per second.
 	 *
 	 * First, the error (number of sectors that are dirty beyond our
-	 * target) is calculated.  The error is accumulated (numerically
-	 * integrated).
+	 * target) is calculated. 
+	 *
+	 * Unfortunately we don't know the exact share of dirty data for
+	 * each backing device; therefore the error is adjusted for this
+	 * backing disk's share of the overall error based on size.
+	 *
+	 * The error is accumulated (numerically integrated).
 	 *
 	 * Then, the proportional value and integral value are scaled
 	 * based on configured values.  These are stored as inverses to
@@ -50,12 +62,22 @@ static void __update_writeback_rate(struct cached_dev *dc)
 	 * variations in usage like the p term.
 	 */
 	int64_t dirty = bcache_dev_sectors_dirty(&dc->disk);
-	int64_t error = dirty - target;
-	int64_t proportional_scaled =
-		div_s64(error, dc->writeback_rate_p_term_inverse);
-	int64_t integral_scaled;
+	int64_t error = dirty - cache_dirty_target;
+	int64_t proportional_scaled, integral_scaled;
 	uint32_t new_rate;
 
+	/*
+	 * Ensure even with large device size disparities that we still
+	 * writeback at least some.
+	 */
+	if (bdev_share_per16k < 1)
+		bdev_share_per16k = 1;
+
+	target = div_s64(error * bdev_share_per16k, 16384)
+
+	proportional_scaled = div_s64(error,
+			dc->writeback_rate_p_term_inverse);
+
 	if ((error < 0 && dc->writeback_rate_integral > 0) ||
 	    (error > 0 && time_before64(local_clock(),
 			 dc->writeback_rate.next + NSEC_PER_MSEC))) {
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [for-416 PATCH v2] bcache: fix writeback target calculation
  2018-01-02 19:00 ` [for-416 PATCH v2] bcache: fix writeback target calculation Michael Lyle
@ 2018-01-02 19:02   ` Michael Lyle
  0 siblings, 0 replies; 6+ messages in thread
From: Michael Lyle @ 2018-01-02 19:02 UTC (permalink / raw)
  To: linux-bcache, linux-block; +Cc: tang.junhui

Please note I have not tested this version yet: I'm just putting it out
for review.

Mike

On 01/02/2018 11:00 AM, Michael Lyle wrote:
> This fixes bcache writeback target calculation, for defects relating to
> large cache sets and calculation errors when there are multiple backing
> devices.
> 
> Bcache needs to scale the dirty data in the cache over the multiple
> backing disks in order to calculate writeback rates for each.
> The previous code did this by multiplying the target number of dirty
> sectors by the backing device size, and expected it to fit into a
> uint64_t; this blows up on relatively small backing devices.
> 
> The new approach figures out the bdev's share in 16384ths of the overall
> cached data.  This is chosen to cope well when bdevs drastically vary in
> size and to ensure that bcache can cross the petabyte boundary for each
> backing device.
> 
> Per review, we ensure that every device gets at least 1/16384th of the
> writeback share.  This also differs from v1 of the patchset in scaling
> the error, which corrects previous problems with writeback rate
> calculation across multiple devices.
> 
> Reported-by: Jack Douglas <jack@douglastechnology.co.uk>
> Signed-off-by: Michael Lyle <mlyle@lyle.org>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [for-416 PATCH v2] bcache: fix writeback target calculation
@ 2018-01-02 17:20 tang.junhui
  2018-01-03 18:01 ` Michael Lyle
  0 siblings, 1 reply; 6+ messages in thread
From: tang.junhui @ 2018-01-02 17:20 UTC (permalink / raw)
  To: mlyle; +Cc: linux-bcache, linux-block, tang.junhui

From: Tang Junhui <tang.junhui@zte.com.cn>

When I saw the code "int64_t error = dirty - cache_dirty_target", I was 
totally lost, I will try my best to get out. 
And this patch dependent on your earlier patch(PI controller), did that
patch have applied to upstreams branch? I searched in 4.14 branch, 
and did not find it.

> drivers/md/bcache/writeback.c | 38 ++++++++++++++++++++++++++++++--------    
> 1 file changed, 30 insertions(+), 8 deletions(-)                             
>                                                                              
>diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c    
>index 56a37884ca8b..be2840ad02d4 100644                                       
>--- a/drivers/md/bcache/writeback.c                                           
>+++ b/drivers/md/bcache/writeback.c                                           
>@@ -26,16 +26,28 @@ static void __update_writeback_rate(struct cached_dev *dc)
>                 bcache_flash_devs_sectors_dirty(c);                          
>     uint64_t cache_dirty_target =                                            
>         div_u64(cache_sectors * dc->writeback_percent, 100);                 
>-    int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),  
>-                   c->cached_dev_sectors);                                   
>+                                                                             
>+    /*                                                                       
>+     * 16384 is chosen here as something that each backing device should     
>+     * be a reasonable fraction of the share, and to not blow up until       
>+     * individual backing devices are a petabyte.                            
>+     */                                                                      
>+    uint32_t bdev_share_per16k =                                             
>+        div64_u64(16384 * bdev_sectors(dc->bdev),                            
>+                c->cached_dev_sectors);                                      
>                                                                              
>     /*                                                                       
>      * PI controller:                                                        
>      * Figures out the amount that should be written per second.             
>      *                                                                       
>      * First, the error (number of sectors that are dirty beyond our         
>-     * target) is calculated.  The error is accumulated (numerically         
>-     * integrated).                                                          
>+     * target) is calculated.                                                
>+     *                                                                       
>+     * Unfortunately we don't know the exact share of dirty data for         
>+     * each backing device; therefore the error is adjusted for this         
>+     * backing disk's share of the overall error based on size.              
>+     *                                                                       
>+     * The error is accumulated (numerically integrated).                    
>      *                                                                       
>      * Then, the proportional value and integral value are scaled            
>      * based on configured values.  These are stored as inverses to  
>              
>@@ -50,12 +62,22 @@ static void __update_writeback_rate(struct cached_dev *dc)
>      * variations in usage like the p term.                                  
>      */                                                                      
>     int64_t dirty = bcache_dev_sectors_dirty(&dc->disk);                     
>-    int64_t error = dirty - target;                                          
>-    int64_t proportional_scaled =                                            
>-        div_s64(error, dc->writeback_rate_p_term_inverse);                   
>-    int64_t integral_scaled;                                                 
>+    int64_t error = dirty - cache_dirty_target;                              
>+    int64_t proportional_scaled, integral_scaled;                            
>     uint32_t new_rate;                                                       
>                                                                              
>+    /*                                                                       
>+     * Ensure even with large device size disparities that we still          
>+     * writeback at least some.                                              
>+     */                                                                      
>+    if (bdev_share_per16k < 1)                                               
>+        bdev_share_per16k = 1;                                               
>+                                                                             
>+    target = div_s64(error * bdev_share_per16k, 16384)                       
>+                                                                             
>+    proportional_scaled = div_s64(error,                                     
>+            dc->writeback_rate_p_term_inverse);                              
>+                                                                             
>     if ((error < 0 && dc->writeback_rate_integral > 0) ||                    
>         (error > 0 && time_before64(local_clock(),                           
>              dc->writeback_rate.next + NSEC_PER_MSEC))) {                    
>--                                                                            
>2.14.1                                                                        

Thanks,
Tang Junhui

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [for-416 PATCH v2] bcache: fix writeback target calculation
  2018-01-02 17:20 tang.junhui
@ 2018-01-03 18:01 ` Michael Lyle
  0 siblings, 0 replies; 6+ messages in thread
From: Michael Lyle @ 2018-01-03 18:01 UTC (permalink / raw)
  To: tang.junhui; +Cc: linux-bcache, linux-block

On 01/02/2018 09:20 AM, tang.junhui@zte.com.cn wrote:
> From: Tang Junhui <tang.junhui@zte.com.cn>
> 
> When I saw the code "int64_t error = dirty - cache_dirty_target", I was 
> totally lost, I will try my best to get out. 

Don't waste your time, let me think about this more.  I think the
previous version was correct and now I screwed it up.  I need to really
read how the dirty block accounting happens to do this properly.

I will send a v3, and make sure that I'm fully confident in it.

> And this patch dependent on your earlier patch(PI controller), did that
> patch have applied to upstreams branch? I searched in 4.14 branch, 
> and did not find it.

It's 1d316e658374f700fdfff9299c70ce65d8d145e6 on linus/master.

Mike

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-01-03 18:01 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-02  9:25 Re: [for-416 PATCH] bcache: fix writeback target calc on large devices tang.junhui
2018-01-02 17:04 ` Michael Lyle
2018-01-02 19:00 ` [for-416 PATCH v2] bcache: fix writeback target calculation Michael Lyle
2018-01-02 19:02   ` Michael Lyle
  -- strict thread matches above, loose matches on Subject: below --
2018-01-02 17:20 tang.junhui
2018-01-03 18:01 ` Michael Lyle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox