From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx196.postini.com [74.125.245.196]) by kanga.kvack.org (Postfix) with SMTP id E5EEA6B0092 for ; Thu, 24 May 2012 12:59:18 -0400 (EDT) From: Jan Kara Subject: [PATCH 0/2 v4] Flexible proportions Date: Thu, 24 May 2012 18:59:09 +0200 Message-Id: <1337878751-22942-1-git-send-email-jack@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Wu Fengguang Cc: Peter Zijlstra , linux-mm@kvack.org, LKML Hello, here is the next iteration of my flexible proportions code. I've addressed all Peter's comments. Changes since v3: * changed fprop_fraction_foo() to avoid using percpu_counter_sum() * changed __fprop_inc_percpu_max() to avoid 64-bit division (now maximum allowed fraction is expressed max_frac/FPROP_FRAC_BASE) * avoid drifting of period timer * handle better cases where period timer fires long after intended time by aging by really passed number of periods Changes since v2: * use timer instead of workqueue for triggering period switch * arm timer only if aging didn't zero out all fractions, re-arm timer when new event arrives again * set period length to 3s Some introduction for first time readers: The idea of this patch set is to provide code for computing event proportions where aging period is not dependent on the number of events happening (so that aging works well both with fast storage and slow USB sticks in the same system). The basic idea is that we compute proportions as: p_j = (\Sum_{i>=0} x_{i,j}/2^{i+1}) / (\Sum_{i>=0} x_i/2^{i+1}) Where x_{i,j} is j's number of events in i-th last time period and x_i is total number of events in i-th last time period. Note that when x_i's are all the same (as is the case with current proportion code), this expression simplifies to the expression defining current proportions which is: p_j = \Sum_{i>=0} x_{i,j}/2^{i+1} / t where t is the lenght of the aging period. In fact, if we are in the middle of the period, the proportion computed by the current code is: p_j = (x_0 + \Sum_{i>=1} x_{i,j}/2^{i+1}) / (t' + t) where t' is total number of events in the running period and t is the lenght of the aging period. So there is event more similarity. Similarly as with current proportion code, it is simple to compute update proportion after several periods have elapsed. For each proportion we store the numerator of our fraction and the number of period when the proportion was last updated. In global proportion structure we compute the denominator of the fraction which is the same for all event types. So catch up with missed periods boils down to shifting the numerator by the number of missed periods and that's it. For more details, please see the code. I've also run a few tests (I've created a userspace wrapper to allow me to run proportion code in userpace and arbitrarily generate events for it) to compare the behavior of old and new code. You can see them at http://beta.suse.com/private/jack/flex_proportions/ In all the tests new code showed faster convergence to current event proportions (I tried to realistically set period_shift for fixed proportions). Also in the last test we see that if period_shift is decreased, then current proportions become more sensitive to short term fluctuations in event rate so just decreasing period_shift isn't a good solution to slower convergence. If anyone has other idea what to try, I can do that - it should be simple enough to implement in my testing tool. Honza -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx115.postini.com [74.125.245.115]) by kanga.kvack.org (Postfix) with SMTP id E7B5D6B00E7 for ; Thu, 24 May 2012 12:59:18 -0400 (EDT) From: Jan Kara Subject: [PATCH 2/2] block: Convert BDI proportion calculations to flexible proportions Date: Thu, 24 May 2012 18:59:11 +0200 Message-Id: <1337878751-22942-3-git-send-email-jack@suse.cz> In-Reply-To: <1337878751-22942-1-git-send-email-jack@suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Wu Fengguang Cc: Peter Zijlstra , linux-mm@kvack.org, LKML , Jan Kara Convert calculations of proportion of writeback each bdi does to new flexible proportion code. That allows us to use aging period of fixed wallclock time which gives better proportion estimates given the hugely varying throughput of different devices. Signed-off-by: Jan Kara --- include/linux/backing-dev.h | 4 +- mm/backing-dev.c | 6 +- mm/page-writeback.c | 103 ++++++++++++++++++++++++++---------------- 3 files changed, 69 insertions(+), 44 deletions(-) diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index b1038bd..489de62 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -10,7 +10,7 @@ #include #include -#include +#include #include #include #include @@ -89,7 +89,7 @@ struct backing_dev_info { unsigned long dirty_ratelimit; unsigned long balanced_dirty_ratelimit; - struct prop_local_percpu completions; + struct fprop_local_percpu completions; int dirty_exceeded; unsigned int min_ratio; diff --git a/mm/backing-dev.c b/mm/backing-dev.c index dd8e2aa..3387aea 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -677,7 +677,7 @@ int bdi_init(struct backing_dev_info *bdi) bdi->min_ratio = 0; bdi->max_ratio = 100; - bdi->max_prop_frac = PROP_FRAC_BASE; + bdi->max_prop_frac = FPROP_FRAC_BASE; spin_lock_init(&bdi->wb_lock); INIT_LIST_HEAD(&bdi->bdi_list); INIT_LIST_HEAD(&bdi->work_list); @@ -700,7 +700,7 @@ int bdi_init(struct backing_dev_info *bdi) bdi->write_bandwidth = INIT_BW; bdi->avg_write_bandwidth = INIT_BW; - err = prop_local_init_percpu(&bdi->completions); + err = fprop_local_init_percpu(&bdi->completions); if (err) { err: @@ -744,7 +744,7 @@ void bdi_destroy(struct backing_dev_info *bdi) for (i = 0; i < NR_BDI_STAT_ITEMS; i++) percpu_counter_destroy(&bdi->bdi_stat[i]); - prop_local_destroy_percpu(&bdi->completions); + fprop_local_destroy_percpu(&bdi->completions); } EXPORT_SYMBOL(bdi_destroy); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 26adea8..647daa3 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -34,6 +34,7 @@ #include #include /* __set_page_dirty_buffers */ #include +#include #include /* @@ -135,7 +136,20 @@ unsigned long global_dirty_limit; * measured in page writeback completions. * */ -static struct prop_descriptor vm_completions; +static struct fprop_global writeout_completions; + +static void writeout_period(unsigned long t); +/* Timer for aging of writeout_completions */ +static struct timer_list writeout_period_timer = + TIMER_DEFERRED_INITIALIZER(writeout_period, 0, 0); +static unsigned long writeout_period_time = 0; + +/* + * Length of period for aging writeout fractions of bdis. This is an + * arbitrarily chosen number. The longer the period, the slower fractions will + * reflect changes in current writeout rate. + */ +#define VM_COMPLETIONS_PERIOD_LEN (3*HZ) /* * Work out the current dirty-memory clamping and background writeout @@ -322,34 +336,6 @@ bool zone_dirty_ok(struct zone *zone) zone_page_state(zone, NR_WRITEBACK) <= limit; } -/* - * couple the period to the dirty_ratio: - * - * period/2 ~ roundup_pow_of_two(dirty limit) - */ -static int calc_period_shift(void) -{ - unsigned long dirty_total; - - if (vm_dirty_bytes) - dirty_total = vm_dirty_bytes / PAGE_SIZE; - else - dirty_total = (vm_dirty_ratio * global_dirtyable_memory()) / - 100; - return 2 + ilog2(dirty_total - 1); -} - -/* - * update the period when the dirty threshold changes. - */ -static void update_completion_period(void) -{ - int shift = calc_period_shift(); - prop_change_shift(&vm_completions, shift); - - writeback_set_ratelimit(); -} - int dirty_background_ratio_handler(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos) @@ -383,7 +369,7 @@ int dirty_ratio_handler(struct ctl_table *table, int write, ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); if (ret == 0 && write && vm_dirty_ratio != old_ratio) { - update_completion_period(); + writeback_set_ratelimit(); vm_dirty_bytes = 0; } return ret; @@ -398,12 +384,21 @@ int dirty_bytes_handler(struct ctl_table *table, int write, ret = proc_doulongvec_minmax(table, write, buffer, lenp, ppos); if (ret == 0 && write && vm_dirty_bytes != old_bytes) { - update_completion_period(); + writeback_set_ratelimit(); vm_dirty_ratio = 0; } return ret; } +static unsigned long wp_next_time(unsigned long cur_time) +{ + cur_time += VM_COMPLETIONS_PERIOD_LEN; + /* 0 has a special meaning... */ + if (!cur_time) + return 1; + return cur_time; +} + /* * Increment the BDI's writeout completion count and the global writeout * completion count. Called from test_clear_page_writeback(). @@ -411,8 +406,19 @@ int dirty_bytes_handler(struct ctl_table *table, int write, static inline void __bdi_writeout_inc(struct backing_dev_info *bdi) { __inc_bdi_stat(bdi, BDI_WRITTEN); - __prop_inc_percpu_max(&vm_completions, &bdi->completions, - bdi->max_prop_frac); + __fprop_inc_percpu_max(&writeout_completions, &bdi->completions, + bdi->max_prop_frac); + /* First event after period switching was turned off? */ + if (!unlikely(writeout_period_time)) { + /* + * We can race with other __bdi_writeout_inc calls here but + * it does not cause any harm since the resulting time when + * timer will fire and what is in writeout_period_time will be + * roughly the same. + */ + writeout_period_time = wp_next_time(jiffies); + mod_timer(&writeout_period_timer, writeout_period_time); + } } void bdi_writeout_inc(struct backing_dev_info *bdi) @@ -431,11 +437,33 @@ EXPORT_SYMBOL_GPL(bdi_writeout_inc); static void bdi_writeout_fraction(struct backing_dev_info *bdi, long *numerator, long *denominator) { - prop_fraction_percpu(&vm_completions, &bdi->completions, + fprop_fraction_percpu(&writeout_completions, &bdi->completions, numerator, denominator); } /* + * On idle system, we can be called long after we scheduled because we use + * deferred timers so count with missed periods. + */ +static void writeout_period(unsigned long t) +{ + int miss_periods = (jiffies - writeout_period_time) / + VM_COMPLETIONS_PERIOD_LEN; + + if (fprop_new_period(&writeout_completions, miss_periods + 1)) { + writeout_period_time = wp_next_time(writeout_period_time + + miss_periods * VM_COMPLETIONS_PERIOD_LEN); + mod_timer(&writeout_period_timer, writeout_period_time); + } else { + /* + * Aging has zeroed all fractions. Stop wasting CPU on period + * updates. + */ + writeout_period_time = 0; + } +} + +/* * bdi_min_ratio keeps the sum of the minimum dirty shares of all * registered backing devices, which, for obvious reasons, can not * exceed 100%. @@ -475,7 +503,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio) ret = -EINVAL; } else { bdi->max_ratio = max_ratio; - bdi->max_prop_frac = (PROP_FRAC_BASE * max_ratio) / 100; + bdi->max_prop_frac = (FPROP_FRAC_BASE * max_ratio) / 100; } spin_unlock_bh(&bdi_lock); @@ -1605,13 +1633,10 @@ static struct notifier_block __cpuinitdata ratelimit_nb = { */ void __init page_writeback_init(void) { - int shift; - writeback_set_ratelimit(); register_cpu_notifier(&ratelimit_nb); - shift = calc_period_shift(); - prop_descriptor_init(&vm_completions, shift); + fprop_global_init(&writeout_completions); } /** -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx109.postini.com [74.125.245.109]) by kanga.kvack.org (Postfix) with SMTP id E23906B0083 for ; Thu, 24 May 2012 12:59:18 -0400 (EDT) From: Jan Kara Subject: [PATCH 1/2] lib: Proportions with flexible period Date: Thu, 24 May 2012 18:59:10 +0200 Message-Id: <1337878751-22942-2-git-send-email-jack@suse.cz> In-Reply-To: <1337878751-22942-1-git-send-email-jack@suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Wu Fengguang Cc: Peter Zijlstra , linux-mm@kvack.org, LKML , Jan Kara Implement code computing proportions of events of different type (like code in lib/proportions.c) but allowing periods to have different lengths. This allows us to have aging periods of fixed wallclock time which gives better proportion estimates given the hugely varying throughput of different devices - previous measuring of aging period by number of events has the problem that a reasonable period length for a system with low-end USB stick is not a reasonable period length for a system with high-end storage array resulting either in too slow proportion updates or too fluctuating proportion updates. Signed-off-by: Jan Kara --- include/linux/flex_proportions.h | 100 ++++++++++++++ lib/Makefile | 2 +- lib/flex_proportions.c | 267 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 368 insertions(+), 1 deletions(-) create mode 100644 include/linux/flex_proportions.h create mode 100644 lib/flex_proportions.c diff --git a/include/linux/flex_proportions.h b/include/linux/flex_proportions.h new file mode 100644 index 0000000..9ac291c --- /dev/null +++ b/include/linux/flex_proportions.h @@ -0,0 +1,100 @@ +/* + * Floating proportions with flexible aging period + * + * Copyright (C) 2011, SUSE, Jan Kara + */ + +#ifndef _LINUX_FLEX_PROPORTIONS_H +#define _LINUX_FLEX_PROPORTIONS_H + +#include +#include +#include + +/* + * When maximum proportion of some event type is specified, this is the + * precision with which we allow limitting. Note that this creates an upper + * bound on the number of events per period like + * ULLONG_MAX >> FPROP_FRAC_SHIFT. + */ +#define FPROP_FRAC_SHIFT 10 +#define FPROP_FRAC_BASE (1UL << FPROP_FRAC_SHIFT) + +/* + * ---- Global proportion definitions ---- + */ +struct fprop_global { + /* Number of events in the current period */ + struct percpu_counter events; + /* Current period */ + unsigned int period; + /* Synchronization with period transitions */ + seqcount_t sequence; +}; + +int fprop_global_init(struct fprop_global *p); +void fprop_global_destroy(struct fprop_global *p); +bool fprop_new_period(struct fprop_global *p, int periods); + +/* + * ---- SINGLE ---- + */ +struct fprop_local_single { + /* the local events counter */ + unsigned long events; + /* Period in which we last updated events */ + unsigned int period; + raw_spinlock_t lock; /* Protect period and numerator */ +}; + +#define INIT_FPROP_LOCAL_SINGLE(name) \ +{ .lock = __RAW_SPIN_LOCK_UNLOCKED(name.lock), \ +} + +int fprop_local_init_single(struct fprop_local_single *pl); +void __fprop_inc_single(struct fprop_global *p, struct fprop_local_single *pl); +void fprop_fraction_single(struct fprop_global *p, + struct fprop_local_single *pl, unsigned long *numerator, + unsigned long *denominator); + +static inline +void fprop_inc_single(struct fprop_global *p, struct fprop_local_single *pl) +{ + unsigned long flags; + + local_irq_save(flags); + __fprop_inc_single(p, pl); + local_irq_restore(flags); +} + +/* + * ---- PERCPU ---- + */ +struct fprop_local_percpu { + /* the local events counter */ + struct percpu_counter events; + /* Period in which we last updated events */ + unsigned int period; + raw_spinlock_t lock; /* Protect period and numerator */ +}; + +int fprop_local_init_percpu(struct fprop_local_percpu *pl); +void fprop_local_destroy_percpu(struct fprop_local_percpu *pl); +void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl); +void __fprop_inc_percpu_max(struct fprop_global *p, struct fprop_local_percpu *pl, + int max_frac); +void fprop_fraction_percpu(struct fprop_global *p, + struct fprop_local_percpu *pl, unsigned long *numerator, + unsigned long *denominator); + +static inline +void fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl) +{ + unsigned long flags; + + local_irq_save(flags); + __fprop_inc_percpu(p, pl); + local_irq_restore(flags); +} + +#endif diff --git a/lib/Makefile b/lib/Makefile index 74290c9..8dd81cf 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -11,7 +11,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ rbtree.o radix-tree.o dump_stack.o timerqueue.o\ idr.o int_sqrt.o extable.o prio_tree.o \ sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ - proportions.o prio_heap.o ratelimit.o show_mem.o \ + proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ is_single_threaded.o plist.o decompress.o lib-$(CONFIG_MMU) += ioremap.o diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c new file mode 100644 index 0000000..530dbc2 --- /dev/null +++ b/lib/flex_proportions.c @@ -0,0 +1,267 @@ +/* + * Floating proportions with flexible aging period + * + * Copyright (C) 2011, SUSE, Jan Kara + * + * The goal of this code is: Given different types of event, measure proportion + * of each type of event over time. The proportions are measured with + * exponentially decaying history to give smooth transitions. A formula + * expressing proportion of event of type 'j' is: + * + * p_{j} = (\Sum_{i>=0} x_{i,j}/2^{i+1})/(\Sum_{i>=0} x_i/2^{i+1}) + * + * Where x_{i,j} is j's number of events in i-th last time period and x_i is + * total number of events in i-th last time period. + * + * Note that p_{j}'s are normalised, i.e. + * + * \Sum_{j} p_{j} = 1, + * + * This formula can be straightforwardly computed by maintaing denominator + * (let's call it 'd') and for each event type its numerator (let's call it + * 'n_j'). When an event of type 'j' happens, we simply need to do: + * n_j++; d++; + * + * When a new period is declared, we could do: + * d /= 2 + * for each j + * n_j /= 2 + * + * To avoid iteration over all event types, we instead shift numerator of event + * j lazily when someone asks for a proportion of event j or when event j + * occurs. This can bit trivially implemented by remembering last period in + * which something happened with proportion of type j. + */ +#include + +int fprop_global_init(struct fprop_global *p) +{ + int err; + + p->period = 0; + /* Use 1 to avoid dealing with periods with 0 events... */ + err = percpu_counter_init(&p->events, 1); + if (err) + return err; + seqcount_init(&p->sequence); + return 0; +} + +void fprop_global_destroy(struct fprop_global *p) +{ + percpu_counter_destroy(&p->events); +} + +/* + * Declare @periods new periods. It is upto the caller to make sure period + * transitions cannot happen in parallel. + * + * The function returns true if the proportions are still defined and false + * if aging zeroed out all events. This can be used to detect whether declaring + * further periods has any effect. + */ +bool fprop_new_period(struct fprop_global *p, int periods) +{ + u64 events = percpu_counter_sum(&p->events); + + /* + * Don't do anything if there are no events. + */ + if (events <= 1) + return false; + write_seqcount_begin(&p->sequence); + if (periods < 64) + events -= events >> periods; + /* Use addition to avoid losing events happening between sum and set */ + percpu_counter_add(&p->events, -events); + p->period += periods; + write_seqcount_end(&p->sequence); + + return true; +} + +/* + * ---- SINGLE ---- + */ + +int fprop_local_init_single(struct fprop_local_single *pl) +{ + pl->events = 0; + pl->period = 0; + raw_spin_lock_init(&pl->lock); + return 0; +} + +void fprop_local_destroy_single(struct fprop_local_single *pl) +{ +} + +static void fprop_reflect_period_single(struct fprop_global *p, + struct fprop_local_single *pl) +{ + unsigned int period = p->period; + unsigned long flags; + + /* Fast path - period didn't change */ + if (pl->period == period) + return; + raw_spin_lock_irqsave(&pl->lock, flags); + /* Someone updated pl->period while we were spinning? */ + if (pl->period >= period) { + raw_spin_unlock_irqrestore(&pl->lock, flags); + return; + } + /* Aging zeroed our fraction? */ + if (period - pl->period < BITS_PER_LONG) + pl->events >>= period - pl->period; + else + pl->events = 0; + pl->period = period; + raw_spin_unlock_irqrestore(&pl->lock, flags); +} + +/* Event of type pl happened */ +void __fprop_inc_single(struct fprop_global *p, struct fprop_local_single *pl) +{ + fprop_reflect_period_single(p, pl); + pl->events++; + percpu_counter_add(&p->events, 1); +} + +/* Return fraction of events of type pl */ +void fprop_fraction_single(struct fprop_global *p, + struct fprop_local_single *pl, + unsigned long *numerator, unsigned long *denominator) +{ + unsigned int seq; + s64 num, den; + + do { + seq = read_seqcount_begin(&p->sequence); + fprop_reflect_period_single(p, pl); + num = pl->events; + den = percpu_counter_read_positive(&p->events); + } while (read_seqcount_retry(&p->sequence, seq)); + + /* + * Make fraction <= 1 and denominator > 0 even in presence of percpu + * counter errors + */ + if (den <= num) { + if (num) + den = num; + else + den = 1; + } + *denominator = den; + *numerator = num; +} + +/* + * ---- PERCPU ---- + */ +#define PROP_BATCH (8*(1+ilog2(nr_cpu_ids))) + +int fprop_local_init_percpu(struct fprop_local_percpu *pl) +{ + int err; + + err = percpu_counter_init(&pl->events, 0); + if (err) + return err; + pl->period = 0; + raw_spin_lock_init(&pl->lock); + return 0; +} + +void fprop_local_destroy_percpu(struct fprop_local_percpu *pl) +{ + percpu_counter_destroy(&pl->events); +} + +static void fprop_reflect_period_percpu(struct fprop_global *p, + struct fprop_local_percpu *pl) +{ + unsigned int period = p->period; + unsigned long flags; + + /* Fast path - period didn't change */ + if (pl->period == period) + return; + raw_spin_lock_irqsave(&pl->lock, flags); + /* Someone updated pl->period while we were spinning? */ + if (pl->period >= period) { + raw_spin_unlock_irqrestore(&pl->lock, flags); + return; + } + /* Aging zeroed our fraction? */ + if (period - pl->period < BITS_PER_LONG) { + s64 val = percpu_counter_read(&pl->events); + + if (val < (nr_cpu_ids * PROP_BATCH)) + val = percpu_counter_sum(&pl->events); + + __percpu_counter_add(&pl->events, + -val + (val >> (period-pl->period)), PROP_BATCH); + } else + percpu_counter_set(&pl->events, 0); + pl->period = period; + raw_spin_unlock_irqrestore(&pl->lock, flags); +} + +/* Event of type pl happened */ +void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl) +{ + fprop_reflect_period_percpu(p, pl); + __percpu_counter_add(&pl->events, 1, PROP_BATCH); + percpu_counter_add(&p->events, 1); +} + +void fprop_fraction_percpu(struct fprop_global *p, + struct fprop_local_percpu *pl, + unsigned long *numerator, unsigned long *denominator) +{ + unsigned int seq; + s64 num, den; + + do { + seq = read_seqcount_begin(&p->sequence); + fprop_reflect_period_percpu(p, pl); + num = percpu_counter_read_positive(&pl->events); + den = percpu_counter_read_positive(&p->events); + } while (read_seqcount_retry(&p->sequence, seq)); + + /* + * Make fraction <= 1 and denominator > 0 even in presence of percpu + * counter errors + */ + if (den <= num) { + if (num) + den = num; + else + den = 1; + } + *denominator = den; + *numerator = num; +} + +/* + * Like __fprop_inc_percpu() except that event is counted only if the given + * type has fraction smaller than @max_frac/FPROP_FRAC_BASE + */ +void __fprop_inc_percpu_max(struct fprop_global *p, + struct fprop_local_percpu *pl, int max_frac) +{ + if (unlikely(max_frac < FPROP_FRAC_BASE)) { + unsigned long numerator, denominator; + + fprop_fraction_percpu(p, pl, &numerator, &denominator); + if (numerator > + (((u64)denominator) * max_frac) >> FPROP_FRAC_SHIFT) + return; + } else + fprop_reflect_period_percpu(p, pl); + __percpu_counter_add(&pl->events, 1, PROP_BATCH); + percpu_counter_add(&p->events, 1); +} + -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx203.postini.com [74.125.245.203]) by kanga.kvack.org (Postfix) with SMTP id CDD696B00F8 for ; Fri, 25 May 2012 05:12:47 -0400 (EDT) Received: from canuck.infradead.org ([2001:4978:20e::1]) by merlin.infradead.org with esmtps (Exim 4.76 #1 (Red Hat Linux)) id 1SXqZi-0005UD-Cv for linux-mm@kvack.org; Fri, 25 May 2012 09:12:46 +0000 Received: from dhcp-089-099-019-018.chello.nl ([89.99.19.18] helo=dyad.programming.kicks-ass.net) by canuck.infradead.org with esmtpsa (Exim 4.76 #1 (Red Hat Linux)) id 1SXqZi-0000MS-2I for linux-mm@kvack.org; Fri, 25 May 2012 09:12:46 +0000 Subject: Re: [PATCH 0/2 v4] Flexible proportions From: Peter Zijlstra In-Reply-To: <1337878751-22942-1-git-send-email-jack@suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> Content-Type: text/plain; charset="UTF-8" Date: Fri, 25 May 2012 11:12:42 +0200 Message-ID: <1337937162.9783.163.camel@laptop> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: Wu Fengguang , linux-mm@kvack.org, LKML On Thu, 2012-05-24 at 18:59 +0200, Jan Kara wrote: > here is the next iteration of my flexible proportions code. I've addressed > all Peter's comments. Thanks, all I could come up with is comment placement nits and I'll not go there ;-) Acked-by: Peter Zijlstra -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx160.postini.com [74.125.245.160]) by kanga.kvack.org (Postfix) with SMTP id D1A41940001 for ; Fri, 25 May 2012 05:30:38 -0400 (EDT) Date: Fri, 25 May 2012 17:29:36 +0800 From: Fengguang Wu Subject: Re: [PATCH 0/2 v4] Flexible proportions Message-ID: <20120525092936.GA12729@localhost> References: <1337878751-22942-1-git-send-email-jack@suse.cz> <1337937162.9783.163.camel@laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1337937162.9783.163.camel@laptop> Sender: owner-linux-mm@kvack.org List-ID: To: Peter Zijlstra Cc: Jan Kara , linux-mm@kvack.org, LKML On Fri, May 25, 2012 at 11:12:42AM +0200, Peter Zijlstra wrote: > On Thu, 2012-05-24 at 18:59 +0200, Jan Kara wrote: > > here is the next iteration of my flexible proportions code. I've addressed > > all Peter's comments. > > Thanks, all I could come up with is comment placement nits and I'll not > go there ;-) > > Acked-by: Peter Zijlstra Thank you both for making it work! I've applied them to the writeback tree. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx167.postini.com [74.125.245.167]) by kanga.kvack.org (Postfix) with SMTP id 759A26B0092 for ; Mon, 28 May 2012 11:48:49 -0400 (EDT) Received: by bkcjm19 with SMTP id jm19so3375517bkc.14 for ; Mon, 28 May 2012 08:48:47 -0700 (PDT) Message-ID: <1338220185.4284.19.camel@lappy> Subject: Re: [PATCH 2/2] block: Convert BDI proportion calculations to flexible proportions From: Sasha Levin Date: Mon, 28 May 2012 17:49:45 +0200 In-Reply-To: <1337878751-22942-3-git-send-email-jack@suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> <1337878751-22942-3-git-send-email-jack@suse.cz> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Mime-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: Wu Fengguang , Peter Zijlstra , linux-mm@kvack.org, LKML Hi Jan, On Thu, 2012-05-24 at 18:59 +0200, Jan Kara wrote: > Convert calculations of proportion of writeback each bdi does to new flexible > proportion code. That allows us to use aging period of fixed wallclock time > which gives better proportion estimates given the hugely varying throughput of > different devices. > > Signed-off-by: Jan Kara > --- This patch appears to be causing lockdep warnings over here: [ 20.545016] ================================= [ 20.545016] [ INFO: inconsistent lock state ] [ 20.545016] 3.4.0-next-20120528-sasha-00008-g11ef39f #307 Tainted: G W [ 20.545016] --------------------------------- [ 20.545016] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. [ 20.545016] rcu_torture_rea/2493 [HC0[0]:SC1[1]:HE1:SE0] takes: [ 20.545016] (key#3){?.-...}, at: [] __percpu_counter_sum+0x17/0xc0 [ 20.545016] {IN-HARDIRQ-W} state was registered at: [ 20.545016] [] mark_irqflags+0x6b/0x170 [ 20.545016] [] __lock_acquire+0x2bb/0x4c0 [ 20.545016] [] lock_acquire+0x18a/0x1e0 [ 20.545016] [] _raw_spin_lock+0x3b/0x70 [ 20.545016] [] __percpu_counter_add+0x50/0xb0 [ 20.545016] [] __fprop_inc_percpu_max+0x8a/0xa0 [ 20.545016] [] test_clear_page_writeback+0x12d/0x1c0 [ 20.545016] [] end_page_writeback+0x24/0x50 [ 20.545016] [] end_buffer_async_write+0x26a/0x350 [ 20.545016] [] end_bio_bh_io_sync+0x3d/0x50 [ 20.545016] [] bio_endio+0x29/0x30 [ 20.545016] [] req_bio_endio+0xb9/0xd0 [ 20.545016] [] blk_update_request+0x1a8/0x3c0 [ 20.545016] [] blk_update_bidi_request+0x22/0x90 [ 20.545016] [] __blk_end_bidi_request+0x1c/0x40 [ 20.545016] [] __blk_end_request_all+0x28/0x40 [ 20.545016] [] blk_done+0x9e/0xf0 [ 20.545016] [] vring_interrupt+0x86/0xa0 [ 20.680186] [] handle_irq_event_percpu+0x151/0x3e0 [ 20.680186] [] handle_irq_event+0x43/0x70 [ 20.680186] [] handle_edge_irq+0xe8/0x120 [ 20.680186] [] handle_irq+0x164/0x180 [ 20.680186] [] do_IRQ+0x58/0xd0 [ 20.680186] [] ret_from_intr+0x0/0x1a [ 20.680186] [] blk_queue_bio+0x30d/0x430 [ 20.680186] [] generic_make_request+0xbe/0x120 [ 20.680186] [] submit_bio+0xf8/0x120 [ 20.680186] [] submit_bh+0x122/0x150 [ 20.680186] [] __block_write_full_page+0x287/0x3b0 [ 20.680186] [] block_write_full_page_endio+0xfc/0x120 [ 20.680186] [] block_write_full_page+0x10/0x20 [ 20.680186] [] blkdev_writepage+0x13/0x20 [ 20.680186] [] __writepage+0x15/0x40 [ 20.680186] [] write_cache_pages+0x49f/0x650 [ 20.680186] [] generic_writepages+0x4f/0x70 [ 20.680186] [] do_writepages+0x1e/0x50 [ 20.680186] [] __filemap_fdatawrite_range+0x49/0x50 [ 20.680186] [] filemap_fdatawrite+0x1a/0x20 [ 20.680186] [] filemap_write_and_wait+0x25/0x50 [ 20.680186] [] __sync_blockdev+0x2d/0x40 [ 20.680186] [] sync_blockdev+0xe/0x10 [ 20.680186] [] journal_recover+0x182/0x1c0 [ 20.680186] [] journal_load+0x58/0xa0 [ 20.680186] [] ext3_load_journal+0x200/0x2b0 [ 20.680186] [] ext3_fill_super+0xc18/0x10d0 [ 20.680186] [] mount_bdev+0x176/0x210 [ 20.680186] [] ext3_mount+0x10/0x20 [ 20.680186] [] mount_fs+0x85/0x1a0 [ 20.680186] [] vfs_kern_mount+0x74/0x100 [ 20.680186] [] do_kern_mount+0x51/0x120 [ 20.680186] [] do_mount+0x1d4/0x240 [ 20.680186] [] sys_mount+0x9d/0xe0 [ 20.680186] [] do_mount_root+0x1e/0x94 [ 20.680186] [] mount_block_root+0xe2/0x224 [ 20.680186] [] mount_root+0x12b/0x136 [ 20.680186] [] prepare_namespace+0x165/0x19e [ 20.680186] [] kernel_init+0x274/0x28a [ 20.680186] [] kernel_thread_helper+0x4/0x10 [ 20.680186] irq event stamp: 1551906 [ 20.680186] hardirqs last enabled at (1551906): [] _raw_spin_unlock_irq+0x2b/0x80 [ 20.680186] hardirqs last disabled at (1551905): [] _raw_spin_lock_irq+0x34/0xa0 [ 20.680186] softirqs last enabled at (1551022): [] __do_softirq+0x3db/0x460 [ 20.680186] softirqs last disabled at (1551903): [] call_softirq+0x1c/0x30 [ 20.680186] [ 20.680186] other info that might help us debug this: [ 20.680186] Possible unsafe locking scenario: [ 20.680186] [ 20.680186] CPU0 [ 20.680186] ---- [ 20.680186] lock(key#3); [ 20.680186] [ 20.680186] lock(key#3); [ 20.680186] [ 20.680186] *** DEADLOCK *** [ 20.680186] [ 20.680186] 2 locks held by rcu_torture_rea/2493: [ 20.680186] #0: (rcu_read_lock){.+.+..}, at: [] rcu_torture_read_lock+0x0/0x80 [ 20.680186] #1: (mm/page-writeback.c:144){+.-...}, at: [] call_timer_fn+0x0/0x260 [ 20.680186] [ 20.680186] stack backtrace: [ 20.680186] Pid: 2493, comm: rcu_torture_rea Tainted: G W 3.4.0-next-20120528-sasha-00008-g11ef39f #307 [ 20.680186] Call Trace: [ 20.680186] [] print_usage_bug+0x1a9/0x1d0 [ 20.680186] [] ? check_usage_forwards+0xf0/0xf0 [ 20.680186] [] mark_lock_irq+0xc9/0x270 [ 20.680186] [] mark_lock+0x11d/0x200 [ 20.680186] [] mark_irqflags+0xf0/0x170 [ 20.680186] [] __lock_acquire+0x2bb/0x4c0 [ 20.680186] [] lock_acquire+0x18a/0x1e0 [ 20.680186] [] ? __percpu_counter_sum+0x17/0xc0 [ 20.680186] [] ? laptop_io_completion+0x30/0x30 [ 20.680186] [] _raw_spin_lock+0x3b/0x70 [ 20.680186] [] ? __percpu_counter_sum+0x17/0xc0 [ 20.680186] [] __percpu_counter_sum+0x17/0xc0 [ 20.680186] [] ? init_timer_deferrable_key+0x20/0x20 [ 20.680186] [] fprop_new_period+0x12/0x60 [ 20.680186] [] writeout_period+0x3d/0xa0 [ 20.680186] [] call_timer_fn+0x12f/0x260 [ 20.680186] [] ? init_timer_deferrable_key+0x20/0x20 [ 20.680186] [] ? _raw_spin_unlock_irq+0x2b/0x80 [ 20.680186] [] ? laptop_io_completion+0x30/0x30 [ 20.680186] [] run_timer_softirq+0x29e/0x2f0 [ 20.680186] [] __do_softirq+0x221/0x460 [ 20.680186] [] ? kvm_clock_read+0x46/0x80 [ 20.680186] [] call_softirq+0x1c/0x30 [ 20.680186] [] do_softirq+0x75/0x120 [ 20.680186] [] irq_exit+0x5b/0xf0 [ 20.680186] [] smp_apic_timer_interrupt+0x8a/0xa0 [ 20.680186] [] apic_timer_interrupt+0x6f/0x80 [ 20.680186] [] ? lock_acquire+0x1be/0x1e0 [ 20.680186] [] ? rcu_torture_reader+0x380/0x380 [ 20.680186] [] rcu_torture_read_lock+0x33/0x80 [ 20.680186] [] ? rcu_torture_reader+0x380/0x380 [ 20.680186] [] rcu_torture_reader+0x123/0x380 [ 20.680186] [] ? T.841+0x50/0x50 [ 20.680186] [] ? rcu_torture_read_unlock+0x60/0x60 [ 20.680186] [] kthread+0xb2/0xc0 [ 20.680186] [] kernel_thread_helper+0x4/0x10 [ 20.680186] [] ? retint_restore_args+0x13/0x13 [ 20.680186] [] ? __init_kthread_worker+0x70/0x70 [ 20.680186] [] ? gs_change+0x13/0x13 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx177.postini.com [74.125.245.177]) by kanga.kvack.org (Postfix) with SMTP id DB3976B005C for ; Tue, 29 May 2012 08:34:18 -0400 (EDT) Date: Tue, 29 May 2012 14:34:08 +0200 From: Jan Kara Subject: Re: [PATCH 2/2] block: Convert BDI proportion calculations to flexible proportions Message-ID: <20120529123408.GA23991@quack.suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> <1337878751-22942-3-git-send-email-jack@suse.cz> <1338220185.4284.19.camel@lappy> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1338220185.4284.19.camel@lappy> Sender: owner-linux-mm@kvack.org List-ID: To: Sasha Levin Cc: Jan Kara , Wu Fengguang , Peter Zijlstra , linux-mm@kvack.org, LKML On Mon 28-05-12 17:49:45, Sasha Levin wrote: > Hi Jan, > > On Thu, 2012-05-24 at 18:59 +0200, Jan Kara wrote: > > Convert calculations of proportion of writeback each bdi does to new flexible > > proportion code. That allows us to use aging period of fixed wallclock time > > which gives better proportion estimates given the hugely varying throughput of > > different devices. > > > > Signed-off-by: Jan Kara > > --- > > This patch appears to be causing lockdep warnings over here: Actually, this is not caused directly by my patch. Just my patch makes the problem more likely because I use smaller counter batch in __fprop_inc_percpu_max() than is used in original __prop_inc_percpu_max(), so the probability that percpu counter takes spinlock (which is what triggers the warning) is higher. The only safe solution seems to be to create a variant of percpu counters that can be used from an interrupt. Or do you have other idea Peter? Honza > > [ 20.545016] ================================= > [ 20.545016] [ INFO: inconsistent lock state ] > [ 20.545016] 3.4.0-next-20120528-sasha-00008-g11ef39f #307 Tainted: G W > [ 20.545016] --------------------------------- > [ 20.545016] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. > [ 20.545016] rcu_torture_rea/2493 [HC0[0]:SC1[1]:HE1:SE0] takes: > [ 20.545016] (key#3){?.-...}, at: [] __percpu_counter_sum+0x17/0xc0 > [ 20.545016] {IN-HARDIRQ-W} state was registered at: > [ 20.545016] [] mark_irqflags+0x6b/0x170 > [ 20.545016] [] __lock_acquire+0x2bb/0x4c0 > [ 20.545016] [] lock_acquire+0x18a/0x1e0 > [ 20.545016] [] _raw_spin_lock+0x3b/0x70 > [ 20.545016] [] __percpu_counter_add+0x50/0xb0 > [ 20.545016] [] __fprop_inc_percpu_max+0x8a/0xa0 > [ 20.545016] [] test_clear_page_writeback+0x12d/0x1c0 > [ 20.545016] [] end_page_writeback+0x24/0x50 > [ 20.545016] [] end_buffer_async_write+0x26a/0x350 > [ 20.545016] [] end_bio_bh_io_sync+0x3d/0x50 > [ 20.545016] [] bio_endio+0x29/0x30 > [ 20.545016] [] req_bio_endio+0xb9/0xd0 > [ 20.545016] [] blk_update_request+0x1a8/0x3c0 > [ 20.545016] [] blk_update_bidi_request+0x22/0x90 > [ 20.545016] [] __blk_end_bidi_request+0x1c/0x40 > [ 20.545016] [] __blk_end_request_all+0x28/0x40 > [ 20.545016] [] blk_done+0x9e/0xf0 > [ 20.545016] [] vring_interrupt+0x86/0xa0 > [ 20.680186] [] handle_irq_event_percpu+0x151/0x3e0 > [ 20.680186] [] handle_irq_event+0x43/0x70 > [ 20.680186] [] handle_edge_irq+0xe8/0x120 > [ 20.680186] [] handle_irq+0x164/0x180 > [ 20.680186] [] do_IRQ+0x58/0xd0 > [ 20.680186] [] ret_from_intr+0x0/0x1a > [ 20.680186] [] blk_queue_bio+0x30d/0x430 > [ 20.680186] [] generic_make_request+0xbe/0x120 > [ 20.680186] [] submit_bio+0xf8/0x120 > [ 20.680186] [] submit_bh+0x122/0x150 > [ 20.680186] [] __block_write_full_page+0x287/0x3b0 > [ 20.680186] [] block_write_full_page_endio+0xfc/0x120 > [ 20.680186] [] block_write_full_page+0x10/0x20 > [ 20.680186] [] blkdev_writepage+0x13/0x20 > [ 20.680186] [] __writepage+0x15/0x40 > [ 20.680186] [] write_cache_pages+0x49f/0x650 > [ 20.680186] [] generic_writepages+0x4f/0x70 > [ 20.680186] [] do_writepages+0x1e/0x50 > [ 20.680186] [] __filemap_fdatawrite_range+0x49/0x50 > [ 20.680186] [] filemap_fdatawrite+0x1a/0x20 > [ 20.680186] [] filemap_write_and_wait+0x25/0x50 > [ 20.680186] [] __sync_blockdev+0x2d/0x40 > [ 20.680186] [] sync_blockdev+0xe/0x10 > [ 20.680186] [] journal_recover+0x182/0x1c0 > [ 20.680186] [] journal_load+0x58/0xa0 > [ 20.680186] [] ext3_load_journal+0x200/0x2b0 > [ 20.680186] [] ext3_fill_super+0xc18/0x10d0 > [ 20.680186] [] mount_bdev+0x176/0x210 > [ 20.680186] [] ext3_mount+0x10/0x20 > [ 20.680186] [] mount_fs+0x85/0x1a0 > [ 20.680186] [] vfs_kern_mount+0x74/0x100 > [ 20.680186] [] do_kern_mount+0x51/0x120 > [ 20.680186] [] do_mount+0x1d4/0x240 > [ 20.680186] [] sys_mount+0x9d/0xe0 > [ 20.680186] [] do_mount_root+0x1e/0x94 > [ 20.680186] [] mount_block_root+0xe2/0x224 > [ 20.680186] [] mount_root+0x12b/0x136 > [ 20.680186] [] prepare_namespace+0x165/0x19e > [ 20.680186] [] kernel_init+0x274/0x28a > [ 20.680186] [] kernel_thread_helper+0x4/0x10 > [ 20.680186] irq event stamp: 1551906 > [ 20.680186] hardirqs last enabled at (1551906): [] _raw_spin_unlock_irq+0x2b/0x80 > [ 20.680186] hardirqs last disabled at (1551905): [] _raw_spin_lock_irq+0x34/0xa0 > [ 20.680186] softirqs last enabled at (1551022): [] __do_softirq+0x3db/0x460 > [ 20.680186] softirqs last disabled at (1551903): [] call_softirq+0x1c/0x30 > [ 20.680186] > [ 20.680186] other info that might help us debug this: > [ 20.680186] Possible unsafe locking scenario: > [ 20.680186] > [ 20.680186] CPU0 > [ 20.680186] ---- > [ 20.680186] lock(key#3); > [ 20.680186] > [ 20.680186] lock(key#3); > [ 20.680186] > [ 20.680186] *** DEADLOCK *** > [ 20.680186] > [ 20.680186] 2 locks held by rcu_torture_rea/2493: > [ 20.680186] #0: (rcu_read_lock){.+.+..}, at: [] rcu_torture_read_lock+0x0/0x80 > [ 20.680186] #1: (mm/page-writeback.c:144){+.-...}, at: [] call_timer_fn+0x0/0x260 > [ 20.680186] > [ 20.680186] stack backtrace: > [ 20.680186] Pid: 2493, comm: rcu_torture_rea Tainted: G W 3.4.0-next-20120528-sasha-00008-g11ef39f #307 > [ 20.680186] Call Trace: > [ 20.680186] [] print_usage_bug+0x1a9/0x1d0 > [ 20.680186] [] ? check_usage_forwards+0xf0/0xf0 > [ 20.680186] [] mark_lock_irq+0xc9/0x270 > [ 20.680186] [] mark_lock+0x11d/0x200 > [ 20.680186] [] mark_irqflags+0xf0/0x170 > [ 20.680186] [] __lock_acquire+0x2bb/0x4c0 > [ 20.680186] [] lock_acquire+0x18a/0x1e0 > [ 20.680186] [] ? __percpu_counter_sum+0x17/0xc0 > [ 20.680186] [] ? laptop_io_completion+0x30/0x30 > [ 20.680186] [] _raw_spin_lock+0x3b/0x70 > [ 20.680186] [] ? __percpu_counter_sum+0x17/0xc0 > [ 20.680186] [] __percpu_counter_sum+0x17/0xc0 > [ 20.680186] [] ? init_timer_deferrable_key+0x20/0x20 > [ 20.680186] [] fprop_new_period+0x12/0x60 > [ 20.680186] [] writeout_period+0x3d/0xa0 > [ 20.680186] [] call_timer_fn+0x12f/0x260 > [ 20.680186] [] ? init_timer_deferrable_key+0x20/0x20 > [ 20.680186] [] ? _raw_spin_unlock_irq+0x2b/0x80 > [ 20.680186] [] ? laptop_io_completion+0x30/0x30 > [ 20.680186] [] run_timer_softirq+0x29e/0x2f0 > [ 20.680186] [] __do_softirq+0x221/0x460 > [ 20.680186] [] ? kvm_clock_read+0x46/0x80 > [ 20.680186] [] call_softirq+0x1c/0x30 > [ 20.680186] [] do_softirq+0x75/0x120 > [ 20.680186] [] irq_exit+0x5b/0xf0 > [ 20.680186] [] smp_apic_timer_interrupt+0x8a/0xa0 > [ 20.680186] [] apic_timer_interrupt+0x6f/0x80 > [ 20.680186] [] ? lock_acquire+0x1be/0x1e0 > [ 20.680186] [] ? rcu_torture_reader+0x380/0x380 > [ 20.680186] [] rcu_torture_read_lock+0x33/0x80 > [ 20.680186] [] ? rcu_torture_reader+0x380/0x380 > [ 20.680186] [] rcu_torture_reader+0x123/0x380 > [ 20.680186] [] ? T.841+0x50/0x50 > [ 20.680186] [] ? rcu_torture_read_unlock+0x60/0x60 > [ 20.680186] [] kthread+0xb2/0xc0 > [ 20.680186] [] kernel_thread_helper+0x4/0x10 > [ 20.680186] [] ? retint_restore_args+0x13/0x13 > [ 20.680186] [] ? __init_kthread_worker+0x70/0x70 > [ 20.680186] [] ? gs_change+0x13/0x13 > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx136.postini.com [74.125.245.136]) by kanga.kvack.org (Postfix) with SMTP id 3BC866B005C for ; Tue, 29 May 2012 08:38:41 -0400 (EDT) Message-ID: <1338295111.26856.57.camel@twins> Subject: Re: [PATCH 2/2] block: Convert BDI proportion calculations to flexible proportions From: Peter Zijlstra Date: Tue, 29 May 2012 14:38:31 +0200 In-Reply-To: <20120529123408.GA23991@quack.suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> <1337878751-22942-3-git-send-email-jack@suse.cz> <1338220185.4284.19.camel@lappy> <20120529123408.GA23991@quack.suse.cz> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: Sasha Levin , Wu Fengguang , linux-mm@kvack.org, LKML On Tue, 2012-05-29 at 14:34 +0200, Jan Kara wrote: > The only safe solution seems to be to create a variant of percpu counters > that can be used from an interrupt. Or do you have other idea Peter? > > [ 20.680186] [] _raw_spin_lock+0x3b/0x70 > > [ 20.680186] [] ? __percpu_counter_sum+0x17/0xc0 > > [ 20.680186] [] __percpu_counter_sum+0x17/0xc0 > > [ 20.680186] [] ? init_timer_deferrable_key+0x20/0= x20 > > [ 20.680186] [] fprop_new_period+0x12/0x60 > > [ 20.680186] [] writeout_period+0x3d/0xa0 > > [ 20.680186] [] call_timer_fn+0x12f/0x260 > > [ 20.680186] [] ? init_timer_deferrable_key+0x20/0= x20 Yeah, just make sure IRQs are disabled around doing that ;-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx108.postini.com [74.125.245.108]) by kanga.kvack.org (Postfix) with SMTP id 308016B005C for ; Tue, 29 May 2012 08:55:00 -0400 (EDT) Date: Tue, 29 May 2012 14:54:52 +0200 From: Jan Kara Subject: Re: [PATCH 2/2] block: Convert BDI proportion calculations to flexible proportions Message-ID: <20120529125452.GB23991@quack.suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> <1337878751-22942-3-git-send-email-jack@suse.cz> <1338220185.4284.19.camel@lappy> <20120529123408.GA23991@quack.suse.cz> <1338295111.26856.57.camel@twins> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1338295111.26856.57.camel@twins> Sender: owner-linux-mm@kvack.org List-ID: To: Peter Zijlstra Cc: Jan Kara , Sasha Levin , Wu Fengguang , linux-mm@kvack.org, LKML On Tue 29-05-12 14:38:31, Peter Zijlstra wrote: > On Tue, 2012-05-29 at 14:34 +0200, Jan Kara wrote: > > > The only safe solution seems to be to create a variant of percpu counters > > that can be used from an interrupt. Or do you have other idea Peter? > > > > [ 20.680186] [] _raw_spin_lock+0x3b/0x70 > > > [ 20.680186] [] ? __percpu_counter_sum+0x17/0xc0 > > > [ 20.680186] [] __percpu_counter_sum+0x17/0xc0 > > > [ 20.680186] [] ? init_timer_deferrable_key+0x20/0x20 > > > [ 20.680186] [] fprop_new_period+0x12/0x60 > > > [ 20.680186] [] writeout_period+0x3d/0xa0 > > > [ 20.680186] [] call_timer_fn+0x12f/0x260 > > > [ 20.680186] [] ? init_timer_deferrable_key+0x20/0x20 > > Yeah, just make sure IRQs are disabled around doing that ;-) Evil ;) But we'd need to have IRQs disabled also in each fprop_fraction_percpu() call, and generally, if we want things clean, we'd need to disable them in all entry points to proportion code (or at least around all percpu calls)... Honza -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx146.postini.com [74.125.245.146]) by kanga.kvack.org (Postfix) with SMTP id 9E29D6B005C for ; Thu, 31 May 2012 18:11:51 -0400 (EDT) Date: Fri, 1 Jun 2012 00:11:46 +0200 From: Jan Kara Subject: Re: [PATCH 2/2] block: Convert BDI proportion calculations to flexible proportions Message-ID: <20120531221146.GA19050@quack.suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> <1337878751-22942-3-git-send-email-jack@suse.cz> <1338220185.4284.19.camel@lappy> <20120529123408.GA23991@quack.suse.cz> <1338295111.26856.57.camel@twins> <20120529125452.GB23991@quack.suse.cz> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="zhXaljGHf11kAtnf" Content-Disposition: inline In-Reply-To: <20120529125452.GB23991@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Peter Zijlstra Cc: Jan Kara , Sasha Levin , Wu Fengguang , linux-mm@kvack.org, LKML --zhXaljGHf11kAtnf Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Tue 29-05-12 14:54:52, Jan Kara wrote: > On Tue 29-05-12 14:38:31, Peter Zijlstra wrote: > > On Tue, 2012-05-29 at 14:34 +0200, Jan Kara wrote: > > > > > The only safe solution seems to be to create a variant of percpu counters > > > that can be used from an interrupt. Or do you have other idea Peter? > > > > > > [ 20.680186] [] _raw_spin_lock+0x3b/0x70 > > > > [ 20.680186] [] ? __percpu_counter_sum+0x17/0xc0 > > > > [ 20.680186] [] __percpu_counter_sum+0x17/0xc0 > > > > [ 20.680186] [] ? init_timer_deferrable_key+0x20/0x20 > > > > [ 20.680186] [] fprop_new_period+0x12/0x60 > > > > [ 20.680186] [] writeout_period+0x3d/0xa0 > > > > [ 20.680186] [] call_timer_fn+0x12f/0x260 > > > > [ 20.680186] [] ? init_timer_deferrable_key+0x20/0x20 > > > > Yeah, just make sure IRQs are disabled around doing that ;-) > Evil ;) But we'd need to have IRQs disabled also in each > fprop_fraction_percpu() call, and generally, if we want things clean, we'd > need to disable them in all entry points to proportion code (or at least > around all percpu calls)... OK, after some thought I was wrong and fixing fprop_new_period() is enough. Attached patch should fix the warning (and possible deadlock). Fengguang should I resend you fixed patch implementing flexible proportions or do you prefer incremental patch against your tree? Honza -- Jan Kara SUSE Labs, CR --zhXaljGHf11kAtnf Content-Type: text/x-patch; charset=us-ascii Content-Disposition: attachment; filename="flex-proportion-irq-save.diff" From: Jan Kara Subject: lib: Fix possible deadlock in flexible proportion code When percpu counter function in fprop_new_period() is interrupted by an interrupt while holding counter lock, it can cause deadlock when the interrupt wants to take the lock as well. Fix the problem by disabling interrupts when calling percpu counter functions. Signed-off-by: Jan Kara diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c index 530dbc2..fbf6b11 100644 --- a/lib/flex_proportions.c +++ b/lib/flex_proportions.c @@ -62,8 +62,12 @@ void fprop_global_destroy(struct fprop_global *p) */ bool fprop_new_period(struct fprop_global *p, int periods) { - u64 events = percpu_counter_sum(&p->events); + u64 events; + unsigned long flags; + local_irq_save(flags); + events = percpu_counter_sum(&p->events); + local_irq_restore(flags); /* * Don't do anything if there are no events. */ @@ -73,7 +77,9 @@ bool fprop_new_period(struct fprop_global *p, int periods) if (periods < 64) events -= events >> periods; /* Use addition to avoid losing events happening between sum and set */ + local_irq_save(flags); percpu_counter_add(&p->events, -events); + local_irq_restore(flags); p->period += periods; write_seqcount_end(&p->sequence); --zhXaljGHf11kAtnf-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx195.postini.com [74.125.245.195]) by kanga.kvack.org (Postfix) with SMTP id F02FB6B005C for ; Thu, 31 May 2012 18:26:14 -0400 (EDT) Message-ID: <1338503165.28384.134.camel@twins> Subject: Re: [PATCH 2/2] block: Convert BDI proportion calculations to flexible proportions From: Peter Zijlstra Date: Fri, 01 Jun 2012 00:26:05 +0200 In-Reply-To: <20120531221146.GA19050@quack.suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> <1337878751-22942-3-git-send-email-jack@suse.cz> <1338220185.4284.19.camel@lappy> <20120529123408.GA23991@quack.suse.cz> <1338295111.26856.57.camel@twins> <20120529125452.GB23991@quack.suse.cz> <20120531221146.GA19050@quack.suse.cz> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: Sasha Levin , Wu Fengguang , linux-mm@kvack.org, LKML On Fri, 2012-06-01 at 00:11 +0200, Jan Kara wrote: > bool fprop_new_period(struct fprop_global *p, int periods) > { > - u64 events =3D percpu_counter_sum(&p->events); > + u64 events; > + unsigned long flags; > =20 > + local_irq_save(flags); > + events =3D percpu_counter_sum(&p->events); > + local_irq_restore(flags); > /* > * Don't do anything if there are no events. > */ > @@ -73,7 +77,9 @@ bool fprop_new_period(struct fprop_global *p, int perio= ds) > if (periods < 64) > events -=3D events >> periods; > /* Use addition to avoid losing events happening between sum and = set */ > + local_irq_save(flags); > percpu_counter_add(&p->events, -events); > + local_irq_restore(flags); > p->period +=3D periods; > write_seqcount_end(&p->sequence);=20 Uhm, why bother enabling it in between? Just wrap the whole function in a single IRQ disable. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx201.postini.com [74.125.245.201]) by kanga.kvack.org (Postfix) with SMTP id 5E8EB6B005C for ; Thu, 31 May 2012 18:42:09 -0400 (EDT) Date: Fri, 1 Jun 2012 00:42:06 +0200 From: Jan Kara Subject: Re: [PATCH 2/2] block: Convert BDI proportion calculations to flexible proportions Message-ID: <20120531224206.GC19050@quack.suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> <1337878751-22942-3-git-send-email-jack@suse.cz> <1338220185.4284.19.camel@lappy> <20120529123408.GA23991@quack.suse.cz> <1338295111.26856.57.camel@twins> <20120529125452.GB23991@quack.suse.cz> <20120531221146.GA19050@quack.suse.cz> <1338503165.28384.134.camel@twins> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="IiVenqGWf+H9Y6IX" Content-Disposition: inline In-Reply-To: <1338503165.28384.134.camel@twins> Sender: owner-linux-mm@kvack.org List-ID: To: Peter Zijlstra Cc: Jan Kara , Sasha Levin , Wu Fengguang , linux-mm@kvack.org, LKML --IiVenqGWf+H9Y6IX Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Fri 01-06-12 00:26:05, Peter Zijlstra wrote: > On Fri, 2012-06-01 at 00:11 +0200, Jan Kara wrote: > > bool fprop_new_period(struct fprop_global *p, int periods) > > { > > - u64 events = percpu_counter_sum(&p->events); > > + u64 events; > > + unsigned long flags; > > > > + local_irq_save(flags); > > + events = percpu_counter_sum(&p->events); > > + local_irq_restore(flags); > > /* > > * Don't do anything if there are no events. > > */ > > @@ -73,7 +77,9 @@ bool fprop_new_period(struct fprop_global *p, int periods) > > if (periods < 64) > > events -= events >> periods; > > /* Use addition to avoid losing events happening between sum and set */ > > + local_irq_save(flags); > > percpu_counter_add(&p->events, -events); > > + local_irq_restore(flags); > > p->period += periods; > > write_seqcount_end(&p->sequence); > > Uhm, why bother enabling it in between? Just wrap the whole function in > a single IRQ disable. I wanted to have interrupts disabled for as short as possible but if you think it doesn't matter, I'll take your advice. The result is attached. Honza -- Jan Kara SUSE Labs, CR --IiVenqGWf+H9Y6IX Content-Type: text/x-patch; charset=us-ascii Content-Disposition: attachment; filename="flex-proportion-irq-save.diff" From: Jan Kara Subject: lib: Fix possible deadlock in flexible proportion code When percpu counter function in fprop_new_period() is interrupted by an interrupt while holding counter lock, it can cause deadlock when the interrupt wants to take the lock as well. Fix the problem by disabling interrupts when calling percpu counter functions. Signed-off-by: Jan Kara diff -u b/lib/flex_proportions.c b/lib/flex_proportions.c --- b/lib/flex_proportions.c +++ b/lib/flex_proportions.c @@ -62,13 +62,18 @@ */ bool fprop_new_period(struct fprop_global *p, int periods) { - u64 events = percpu_counter_sum(&p->events); + u64 events; + unsigned long flags; + local_irq_save(flags); + events = percpu_counter_sum(&p->events); /* * Don't do anything if there are no events. */ - if (events <= 1) + if (events <= 1) { + local_irq_restore(flags); return false; + } write_seqcount_begin(&p->sequence); if (periods < 64) events -= events >> periods; @@ -76,6 +81,7 @@ percpu_counter_add(&p->events, -events); p->period += periods; write_seqcount_end(&p->sequence); + local_irq_restore(flags); return true; } --IiVenqGWf+H9Y6IX-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx106.postini.com [74.125.245.106]) by kanga.kvack.org (Postfix) with SMTP id 1FFE36B004D for ; Thu, 31 May 2012 23:10:18 -0400 (EDT) Date: Fri, 1 Jun 2012 11:10:15 +0800 From: Fengguang Wu Subject: Re: [PATCH 2/2] block: Convert BDI proportion calculations to flexible proportions Message-ID: <20120601031015.GB7896@localhost> References: <1337878751-22942-1-git-send-email-jack@suse.cz> <1337878751-22942-3-git-send-email-jack@suse.cz> <1338220185.4284.19.camel@lappy> <20120529123408.GA23991@quack.suse.cz> <1338295111.26856.57.camel@twins> <20120529125452.GB23991@quack.suse.cz> <20120531221146.GA19050@quack.suse.cz> <1338503165.28384.134.camel@twins> <20120531224206.GC19050@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=gb2312 Content-Disposition: inline In-Reply-To: <20120531224206.GC19050@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: Peter Zijlstra , Sasha Levin , linux-mm@kvack.org, LKML On Fri, Jun 01, 2012 at 12:42:06AM +0200, Jan Kara wrote: > On Fri 01-06-12 00:26:05, Peter Zijlstra wrote: > > On Fri, 2012-06-01 at 00:11 +0200, Jan Kara wrote: > > > bool fprop_new_period(struct fprop_global *p, int periods) > > > { > > > - u64 events = percpu_counter_sum(&p->events); > > > + u64 events; > > > + unsigned long flags; > > > > > > + local_irq_save(flags); > > > + events = percpu_counter_sum(&p->events); > > > + local_irq_restore(flags); > > > /* > > > * Don't do anything if there are no events. > > > */ > > > @@ -73,7 +77,9 @@ bool fprop_new_period(struct fprop_global *p, int periods) > > > if (periods < 64) > > > events -= events >> periods; > > > /* Use addition to avoid losing events happening between sum and set */ > > > + local_irq_save(flags); > > > percpu_counter_add(&p->events, -events); > > > + local_irq_restore(flags); > > > p->period += periods; > > > write_seqcount_end(&p->sequence); > > > > Uhm, why bother enabling it in between? Just wrap the whole function in > > a single IRQ disable. > I wanted to have interrupts disabled for as short as possible but if you > think it doesn't matter, I'll take your advice. The result is attached. Thank you! I applied this incremental fix next to the commit "lib: Proportions with flexible period". Thanks, Fengguang > From: Jan Kara > Subject: lib: Fix possible deadlock in flexible proportion code > > When percpu counter function in fprop_new_period() is interrupted by an > interrupt while holding counter lock, it can cause deadlock when the > interrupt wants to take the lock as well. Fix the problem by disabling > interrupts when calling percpu counter functions. > > Signed-off-by: Jan Kara > > diff -u b/lib/flex_proportions.c b/lib/flex_proportions.c > --- b/lib/flex_proportions.c > +++ b/lib/flex_proportions.c > @@ -62,13 +62,18 @@ > */ > bool fprop_new_period(struct fprop_global *p, int periods) > { > - u64 events = percpu_counter_sum(&p->events); > + u64 events; > + unsigned long flags; > > + local_irq_save(flags); > + events = percpu_counter_sum(&p->events); > /* > * Don't do anything if there are no events. > */ > - if (events <= 1) > + if (events <= 1) { > + local_irq_restore(flags); > return false; > + } > write_seqcount_begin(&p->sequence); > if (periods < 64) > events -= events >> periods; > @@ -76,6 +81,7 @@ > percpu_counter_add(&p->events, -events); > p->period += periods; > write_seqcount_end(&p->sequence); > + local_irq_restore(flags); > > return true; > } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx196.postini.com [74.125.245.196]) by kanga.kvack.org (Postfix) with SMTP id 549B86B004D for ; Fri, 1 Jun 2012 06:14:08 -0400 (EDT) Message-ID: <1338545638.28384.137.camel@twins> Subject: Re: [PATCH 2/2] block: Convert BDI proportion calculations to flexible proportions From: Peter Zijlstra Date: Fri, 01 Jun 2012 12:13:58 +0200 In-Reply-To: <20120531224206.GC19050@quack.suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> <1337878751-22942-3-git-send-email-jack@suse.cz> <1338220185.4284.19.camel@lappy> <20120529123408.GA23991@quack.suse.cz> <1338295111.26856.57.camel@twins> <20120529125452.GB23991@quack.suse.cz> <20120531221146.GA19050@quack.suse.cz> <1338503165.28384.134.camel@twins> <20120531224206.GC19050@quack.suse.cz> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: Sasha Levin , Wu Fengguang , linux-mm@kvack.org, LKML On Fri, 2012-06-01 at 00:42 +0200, Jan Kara wrote: > On Fri 01-06-12 00:26:05, Peter Zijlstra wrote: > > On Fri, 2012-06-01 at 00:11 +0200, Jan Kara wrote: > > > bool fprop_new_period(struct fprop_global *p, int periods) > > > { > > > - u64 events =3D percpu_counter_sum(&p->events); > > > + u64 events; > > > + unsigned long flags; > > > =20 > > > + local_irq_save(flags); > > > + events =3D percpu_counter_sum(&p->events); > > > + local_irq_restore(flags); > > > /* > > > * Don't do anything if there are no events. > > > */ > > > @@ -73,7 +77,9 @@ bool fprop_new_period(struct fprop_global *p, int p= eriods) > > > if (periods < 64) > > > events -=3D events >> periods; > > > /* Use addition to avoid losing events happening between sum = and set */ > > > + local_irq_save(flags); > > > percpu_counter_add(&p->events, -events); > > > + local_irq_restore(flags); > > > p->period +=3D periods; > > > write_seqcount_end(&p->sequence);=20 > >=20 > > Uhm, why bother enabling it in between? Just wrap the whole function in > > a single IRQ disable. > I wanted to have interrupts disabled for as short as possible but if yo= u > think it doesn't matter, I'll take your advice. The result is attached. Thing is, disabling interrupts is quite expensive and the extra few instructions covered isn't much. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933837Ab2EXQ7U (ORCPT ); Thu, 24 May 2012 12:59:20 -0400 Received: from cantor2.suse.de ([195.135.220.15]:33340 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757337Ab2EXQ7S (ORCPT ); Thu, 24 May 2012 12:59:18 -0400 From: Jan Kara To: Wu Fengguang Cc: Peter Zijlstra , , LKML Subject: [PATCH 0/2 v4] Flexible proportions Date: Thu, 24 May 2012 18:59:09 +0200 Message-Id: <1337878751-22942-1-git-send-email-jack@suse.cz> X-Mailer: git-send-email 1.7.1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, here is the next iteration of my flexible proportions code. I've addressed all Peter's comments. Changes since v3: * changed fprop_fraction_foo() to avoid using percpu_counter_sum() * changed __fprop_inc_percpu_max() to avoid 64-bit division (now maximum allowed fraction is expressed max_frac/FPROP_FRAC_BASE) * avoid drifting of period timer * handle better cases where period timer fires long after intended time by aging by really passed number of periods Changes since v2: * use timer instead of workqueue for triggering period switch * arm timer only if aging didn't zero out all fractions, re-arm timer when new event arrives again * set period length to 3s Some introduction for first time readers: The idea of this patch set is to provide code for computing event proportions where aging period is not dependent on the number of events happening (so that aging works well both with fast storage and slow USB sticks in the same system). The basic idea is that we compute proportions as: p_j = (\Sum_{i>=0} x_{i,j}/2^{i+1}) / (\Sum_{i>=0} x_i/2^{i+1}) Where x_{i,j} is j's number of events in i-th last time period and x_i is total number of events in i-th last time period. Note that when x_i's are all the same (as is the case with current proportion code), this expression simplifies to the expression defining current proportions which is: p_j = \Sum_{i>=0} x_{i,j}/2^{i+1} / t where t is the lenght of the aging period. In fact, if we are in the middle of the period, the proportion computed by the current code is: p_j = (x_0 + \Sum_{i>=1} x_{i,j}/2^{i+1}) / (t' + t) where t' is total number of events in the running period and t is the lenght of the aging period. So there is event more similarity. Similarly as with current proportion code, it is simple to compute update proportion after several periods have elapsed. For each proportion we store the numerator of our fraction and the number of period when the proportion was last updated. In global proportion structure we compute the denominator of the fraction which is the same for all event types. So catch up with missed periods boils down to shifting the numerator by the number of missed periods and that's it. For more details, please see the code. I've also run a few tests (I've created a userspace wrapper to allow me to run proportion code in userpace and arbitrarily generate events for it) to compare the behavior of old and new code. You can see them at http://beta.suse.com/private/jack/flex_proportions/ In all the tests new code showed faster convergence to current event proportions (I tried to realistically set period_shift for fixed proportions). Also in the last test we see that if period_shift is decreased, then current proportions become more sensitive to short term fluctuations in event rate so just decreasing period_shift isn't a good solution to slower convergence. If anyone has other idea what to try, I can do that - it should be simple enough to implement in my testing tool. Honza From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933914Ab2EXQ7d (ORCPT ); Thu, 24 May 2012 12:59:33 -0400 Received: from cantor2.suse.de ([195.135.220.15]:33341 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756239Ab2EXQ7S (ORCPT ); Thu, 24 May 2012 12:59:18 -0400 From: Jan Kara To: Wu Fengguang Cc: Peter Zijlstra , , LKML , Jan Kara Subject: [PATCH 2/2] block: Convert BDI proportion calculations to flexible proportions Date: Thu, 24 May 2012 18:59:11 +0200 Message-Id: <1337878751-22942-3-git-send-email-jack@suse.cz> X-Mailer: git-send-email 1.7.1 In-Reply-To: <1337878751-22942-1-git-send-email-jack@suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Convert calculations of proportion of writeback each bdi does to new flexible proportion code. That allows us to use aging period of fixed wallclock time which gives better proportion estimates given the hugely varying throughput of different devices. Signed-off-by: Jan Kara --- include/linux/backing-dev.h | 4 +- mm/backing-dev.c | 6 +- mm/page-writeback.c | 103 ++++++++++++++++++++++++++---------------- 3 files changed, 69 insertions(+), 44 deletions(-) diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index b1038bd..489de62 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -10,7 +10,7 @@ #include #include -#include +#include #include #include #include @@ -89,7 +89,7 @@ struct backing_dev_info { unsigned long dirty_ratelimit; unsigned long balanced_dirty_ratelimit; - struct prop_local_percpu completions; + struct fprop_local_percpu completions; int dirty_exceeded; unsigned int min_ratio; diff --git a/mm/backing-dev.c b/mm/backing-dev.c index dd8e2aa..3387aea 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -677,7 +677,7 @@ int bdi_init(struct backing_dev_info *bdi) bdi->min_ratio = 0; bdi->max_ratio = 100; - bdi->max_prop_frac = PROP_FRAC_BASE; + bdi->max_prop_frac = FPROP_FRAC_BASE; spin_lock_init(&bdi->wb_lock); INIT_LIST_HEAD(&bdi->bdi_list); INIT_LIST_HEAD(&bdi->work_list); @@ -700,7 +700,7 @@ int bdi_init(struct backing_dev_info *bdi) bdi->write_bandwidth = INIT_BW; bdi->avg_write_bandwidth = INIT_BW; - err = prop_local_init_percpu(&bdi->completions); + err = fprop_local_init_percpu(&bdi->completions); if (err) { err: @@ -744,7 +744,7 @@ void bdi_destroy(struct backing_dev_info *bdi) for (i = 0; i < NR_BDI_STAT_ITEMS; i++) percpu_counter_destroy(&bdi->bdi_stat[i]); - prop_local_destroy_percpu(&bdi->completions); + fprop_local_destroy_percpu(&bdi->completions); } EXPORT_SYMBOL(bdi_destroy); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 26adea8..647daa3 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -34,6 +34,7 @@ #include #include /* __set_page_dirty_buffers */ #include +#include #include /* @@ -135,7 +136,20 @@ unsigned long global_dirty_limit; * measured in page writeback completions. * */ -static struct prop_descriptor vm_completions; +static struct fprop_global writeout_completions; + +static void writeout_period(unsigned long t); +/* Timer for aging of writeout_completions */ +static struct timer_list writeout_period_timer = + TIMER_DEFERRED_INITIALIZER(writeout_period, 0, 0); +static unsigned long writeout_period_time = 0; + +/* + * Length of period for aging writeout fractions of bdis. This is an + * arbitrarily chosen number. The longer the period, the slower fractions will + * reflect changes in current writeout rate. + */ +#define VM_COMPLETIONS_PERIOD_LEN (3*HZ) /* * Work out the current dirty-memory clamping and background writeout @@ -322,34 +336,6 @@ bool zone_dirty_ok(struct zone *zone) zone_page_state(zone, NR_WRITEBACK) <= limit; } -/* - * couple the period to the dirty_ratio: - * - * period/2 ~ roundup_pow_of_two(dirty limit) - */ -static int calc_period_shift(void) -{ - unsigned long dirty_total; - - if (vm_dirty_bytes) - dirty_total = vm_dirty_bytes / PAGE_SIZE; - else - dirty_total = (vm_dirty_ratio * global_dirtyable_memory()) / - 100; - return 2 + ilog2(dirty_total - 1); -} - -/* - * update the period when the dirty threshold changes. - */ -static void update_completion_period(void) -{ - int shift = calc_period_shift(); - prop_change_shift(&vm_completions, shift); - - writeback_set_ratelimit(); -} - int dirty_background_ratio_handler(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos) @@ -383,7 +369,7 @@ int dirty_ratio_handler(struct ctl_table *table, int write, ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); if (ret == 0 && write && vm_dirty_ratio != old_ratio) { - update_completion_period(); + writeback_set_ratelimit(); vm_dirty_bytes = 0; } return ret; @@ -398,12 +384,21 @@ int dirty_bytes_handler(struct ctl_table *table, int write, ret = proc_doulongvec_minmax(table, write, buffer, lenp, ppos); if (ret == 0 && write && vm_dirty_bytes != old_bytes) { - update_completion_period(); + writeback_set_ratelimit(); vm_dirty_ratio = 0; } return ret; } +static unsigned long wp_next_time(unsigned long cur_time) +{ + cur_time += VM_COMPLETIONS_PERIOD_LEN; + /* 0 has a special meaning... */ + if (!cur_time) + return 1; + return cur_time; +} + /* * Increment the BDI's writeout completion count and the global writeout * completion count. Called from test_clear_page_writeback(). @@ -411,8 +406,19 @@ int dirty_bytes_handler(struct ctl_table *table, int write, static inline void __bdi_writeout_inc(struct backing_dev_info *bdi) { __inc_bdi_stat(bdi, BDI_WRITTEN); - __prop_inc_percpu_max(&vm_completions, &bdi->completions, - bdi->max_prop_frac); + __fprop_inc_percpu_max(&writeout_completions, &bdi->completions, + bdi->max_prop_frac); + /* First event after period switching was turned off? */ + if (!unlikely(writeout_period_time)) { + /* + * We can race with other __bdi_writeout_inc calls here but + * it does not cause any harm since the resulting time when + * timer will fire and what is in writeout_period_time will be + * roughly the same. + */ + writeout_period_time = wp_next_time(jiffies); + mod_timer(&writeout_period_timer, writeout_period_time); + } } void bdi_writeout_inc(struct backing_dev_info *bdi) @@ -431,11 +437,33 @@ EXPORT_SYMBOL_GPL(bdi_writeout_inc); static void bdi_writeout_fraction(struct backing_dev_info *bdi, long *numerator, long *denominator) { - prop_fraction_percpu(&vm_completions, &bdi->completions, + fprop_fraction_percpu(&writeout_completions, &bdi->completions, numerator, denominator); } /* + * On idle system, we can be called long after we scheduled because we use + * deferred timers so count with missed periods. + */ +static void writeout_period(unsigned long t) +{ + int miss_periods = (jiffies - writeout_period_time) / + VM_COMPLETIONS_PERIOD_LEN; + + if (fprop_new_period(&writeout_completions, miss_periods + 1)) { + writeout_period_time = wp_next_time(writeout_period_time + + miss_periods * VM_COMPLETIONS_PERIOD_LEN); + mod_timer(&writeout_period_timer, writeout_period_time); + } else { + /* + * Aging has zeroed all fractions. Stop wasting CPU on period + * updates. + */ + writeout_period_time = 0; + } +} + +/* * bdi_min_ratio keeps the sum of the minimum dirty shares of all * registered backing devices, which, for obvious reasons, can not * exceed 100%. @@ -475,7 +503,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio) ret = -EINVAL; } else { bdi->max_ratio = max_ratio; - bdi->max_prop_frac = (PROP_FRAC_BASE * max_ratio) / 100; + bdi->max_prop_frac = (FPROP_FRAC_BASE * max_ratio) / 100; } spin_unlock_bh(&bdi_lock); @@ -1605,13 +1633,10 @@ static struct notifier_block __cpuinitdata ratelimit_nb = { */ void __init page_writeback_init(void) { - int shift; - writeback_set_ratelimit(); register_cpu_notifier(&ratelimit_nb); - shift = calc_period_shift(); - prop_descriptor_init(&vm_completions, shift); + fprop_global_init(&writeout_completions); } /** -- 1.7.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933875Ab2EXQ7b (ORCPT ); Thu, 24 May 2012 12:59:31 -0400 Received: from cantor2.suse.de ([195.135.220.15]:33342 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757809Ab2EXQ7S (ORCPT ); Thu, 24 May 2012 12:59:18 -0400 From: Jan Kara To: Wu Fengguang Cc: Peter Zijlstra , , LKML , Jan Kara Subject: [PATCH 1/2] lib: Proportions with flexible period Date: Thu, 24 May 2012 18:59:10 +0200 Message-Id: <1337878751-22942-2-git-send-email-jack@suse.cz> X-Mailer: git-send-email 1.7.1 In-Reply-To: <1337878751-22942-1-git-send-email-jack@suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Implement code computing proportions of events of different type (like code in lib/proportions.c) but allowing periods to have different lengths. This allows us to have aging periods of fixed wallclock time which gives better proportion estimates given the hugely varying throughput of different devices - previous measuring of aging period by number of events has the problem that a reasonable period length for a system with low-end USB stick is not a reasonable period length for a system with high-end storage array resulting either in too slow proportion updates or too fluctuating proportion updates. Signed-off-by: Jan Kara --- include/linux/flex_proportions.h | 100 ++++++++++++++ lib/Makefile | 2 +- lib/flex_proportions.c | 267 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 368 insertions(+), 1 deletions(-) create mode 100644 include/linux/flex_proportions.h create mode 100644 lib/flex_proportions.c diff --git a/include/linux/flex_proportions.h b/include/linux/flex_proportions.h new file mode 100644 index 0000000..9ac291c --- /dev/null +++ b/include/linux/flex_proportions.h @@ -0,0 +1,100 @@ +/* + * Floating proportions with flexible aging period + * + * Copyright (C) 2011, SUSE, Jan Kara + */ + +#ifndef _LINUX_FLEX_PROPORTIONS_H +#define _LINUX_FLEX_PROPORTIONS_H + +#include +#include +#include + +/* + * When maximum proportion of some event type is specified, this is the + * precision with which we allow limitting. Note that this creates an upper + * bound on the number of events per period like + * ULLONG_MAX >> FPROP_FRAC_SHIFT. + */ +#define FPROP_FRAC_SHIFT 10 +#define FPROP_FRAC_BASE (1UL << FPROP_FRAC_SHIFT) + +/* + * ---- Global proportion definitions ---- + */ +struct fprop_global { + /* Number of events in the current period */ + struct percpu_counter events; + /* Current period */ + unsigned int period; + /* Synchronization with period transitions */ + seqcount_t sequence; +}; + +int fprop_global_init(struct fprop_global *p); +void fprop_global_destroy(struct fprop_global *p); +bool fprop_new_period(struct fprop_global *p, int periods); + +/* + * ---- SINGLE ---- + */ +struct fprop_local_single { + /* the local events counter */ + unsigned long events; + /* Period in which we last updated events */ + unsigned int period; + raw_spinlock_t lock; /* Protect period and numerator */ +}; + +#define INIT_FPROP_LOCAL_SINGLE(name) \ +{ .lock = __RAW_SPIN_LOCK_UNLOCKED(name.lock), \ +} + +int fprop_local_init_single(struct fprop_local_single *pl); +void __fprop_inc_single(struct fprop_global *p, struct fprop_local_single *pl); +void fprop_fraction_single(struct fprop_global *p, + struct fprop_local_single *pl, unsigned long *numerator, + unsigned long *denominator); + +static inline +void fprop_inc_single(struct fprop_global *p, struct fprop_local_single *pl) +{ + unsigned long flags; + + local_irq_save(flags); + __fprop_inc_single(p, pl); + local_irq_restore(flags); +} + +/* + * ---- PERCPU ---- + */ +struct fprop_local_percpu { + /* the local events counter */ + struct percpu_counter events; + /* Period in which we last updated events */ + unsigned int period; + raw_spinlock_t lock; /* Protect period and numerator */ +}; + +int fprop_local_init_percpu(struct fprop_local_percpu *pl); +void fprop_local_destroy_percpu(struct fprop_local_percpu *pl); +void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl); +void __fprop_inc_percpu_max(struct fprop_global *p, struct fprop_local_percpu *pl, + int max_frac); +void fprop_fraction_percpu(struct fprop_global *p, + struct fprop_local_percpu *pl, unsigned long *numerator, + unsigned long *denominator); + +static inline +void fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl) +{ + unsigned long flags; + + local_irq_save(flags); + __fprop_inc_percpu(p, pl); + local_irq_restore(flags); +} + +#endif diff --git a/lib/Makefile b/lib/Makefile index 74290c9..8dd81cf 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -11,7 +11,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ rbtree.o radix-tree.o dump_stack.o timerqueue.o\ idr.o int_sqrt.o extable.o prio_tree.o \ sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ - proportions.o prio_heap.o ratelimit.o show_mem.o \ + proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ is_single_threaded.o plist.o decompress.o lib-$(CONFIG_MMU) += ioremap.o diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c new file mode 100644 index 0000000..530dbc2 --- /dev/null +++ b/lib/flex_proportions.c @@ -0,0 +1,267 @@ +/* + * Floating proportions with flexible aging period + * + * Copyright (C) 2011, SUSE, Jan Kara + * + * The goal of this code is: Given different types of event, measure proportion + * of each type of event over time. The proportions are measured with + * exponentially decaying history to give smooth transitions. A formula + * expressing proportion of event of type 'j' is: + * + * p_{j} = (\Sum_{i>=0} x_{i,j}/2^{i+1})/(\Sum_{i>=0} x_i/2^{i+1}) + * + * Where x_{i,j} is j's number of events in i-th last time period and x_i is + * total number of events in i-th last time period. + * + * Note that p_{j}'s are normalised, i.e. + * + * \Sum_{j} p_{j} = 1, + * + * This formula can be straightforwardly computed by maintaing denominator + * (let's call it 'd') and for each event type its numerator (let's call it + * 'n_j'). When an event of type 'j' happens, we simply need to do: + * n_j++; d++; + * + * When a new period is declared, we could do: + * d /= 2 + * for each j + * n_j /= 2 + * + * To avoid iteration over all event types, we instead shift numerator of event + * j lazily when someone asks for a proportion of event j or when event j + * occurs. This can bit trivially implemented by remembering last period in + * which something happened with proportion of type j. + */ +#include + +int fprop_global_init(struct fprop_global *p) +{ + int err; + + p->period = 0; + /* Use 1 to avoid dealing with periods with 0 events... */ + err = percpu_counter_init(&p->events, 1); + if (err) + return err; + seqcount_init(&p->sequence); + return 0; +} + +void fprop_global_destroy(struct fprop_global *p) +{ + percpu_counter_destroy(&p->events); +} + +/* + * Declare @periods new periods. It is upto the caller to make sure period + * transitions cannot happen in parallel. + * + * The function returns true if the proportions are still defined and false + * if aging zeroed out all events. This can be used to detect whether declaring + * further periods has any effect. + */ +bool fprop_new_period(struct fprop_global *p, int periods) +{ + u64 events = percpu_counter_sum(&p->events); + + /* + * Don't do anything if there are no events. + */ + if (events <= 1) + return false; + write_seqcount_begin(&p->sequence); + if (periods < 64) + events -= events >> periods; + /* Use addition to avoid losing events happening between sum and set */ + percpu_counter_add(&p->events, -events); + p->period += periods; + write_seqcount_end(&p->sequence); + + return true; +} + +/* + * ---- SINGLE ---- + */ + +int fprop_local_init_single(struct fprop_local_single *pl) +{ + pl->events = 0; + pl->period = 0; + raw_spin_lock_init(&pl->lock); + return 0; +} + +void fprop_local_destroy_single(struct fprop_local_single *pl) +{ +} + +static void fprop_reflect_period_single(struct fprop_global *p, + struct fprop_local_single *pl) +{ + unsigned int period = p->period; + unsigned long flags; + + /* Fast path - period didn't change */ + if (pl->period == period) + return; + raw_spin_lock_irqsave(&pl->lock, flags); + /* Someone updated pl->period while we were spinning? */ + if (pl->period >= period) { + raw_spin_unlock_irqrestore(&pl->lock, flags); + return; + } + /* Aging zeroed our fraction? */ + if (period - pl->period < BITS_PER_LONG) + pl->events >>= period - pl->period; + else + pl->events = 0; + pl->period = period; + raw_spin_unlock_irqrestore(&pl->lock, flags); +} + +/* Event of type pl happened */ +void __fprop_inc_single(struct fprop_global *p, struct fprop_local_single *pl) +{ + fprop_reflect_period_single(p, pl); + pl->events++; + percpu_counter_add(&p->events, 1); +} + +/* Return fraction of events of type pl */ +void fprop_fraction_single(struct fprop_global *p, + struct fprop_local_single *pl, + unsigned long *numerator, unsigned long *denominator) +{ + unsigned int seq; + s64 num, den; + + do { + seq = read_seqcount_begin(&p->sequence); + fprop_reflect_period_single(p, pl); + num = pl->events; + den = percpu_counter_read_positive(&p->events); + } while (read_seqcount_retry(&p->sequence, seq)); + + /* + * Make fraction <= 1 and denominator > 0 even in presence of percpu + * counter errors + */ + if (den <= num) { + if (num) + den = num; + else + den = 1; + } + *denominator = den; + *numerator = num; +} + +/* + * ---- PERCPU ---- + */ +#define PROP_BATCH (8*(1+ilog2(nr_cpu_ids))) + +int fprop_local_init_percpu(struct fprop_local_percpu *pl) +{ + int err; + + err = percpu_counter_init(&pl->events, 0); + if (err) + return err; + pl->period = 0; + raw_spin_lock_init(&pl->lock); + return 0; +} + +void fprop_local_destroy_percpu(struct fprop_local_percpu *pl) +{ + percpu_counter_destroy(&pl->events); +} + +static void fprop_reflect_period_percpu(struct fprop_global *p, + struct fprop_local_percpu *pl) +{ + unsigned int period = p->period; + unsigned long flags; + + /* Fast path - period didn't change */ + if (pl->period == period) + return; + raw_spin_lock_irqsave(&pl->lock, flags); + /* Someone updated pl->period while we were spinning? */ + if (pl->period >= period) { + raw_spin_unlock_irqrestore(&pl->lock, flags); + return; + } + /* Aging zeroed our fraction? */ + if (period - pl->period < BITS_PER_LONG) { + s64 val = percpu_counter_read(&pl->events); + + if (val < (nr_cpu_ids * PROP_BATCH)) + val = percpu_counter_sum(&pl->events); + + __percpu_counter_add(&pl->events, + -val + (val >> (period-pl->period)), PROP_BATCH); + } else + percpu_counter_set(&pl->events, 0); + pl->period = period; + raw_spin_unlock_irqrestore(&pl->lock, flags); +} + +/* Event of type pl happened */ +void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl) +{ + fprop_reflect_period_percpu(p, pl); + __percpu_counter_add(&pl->events, 1, PROP_BATCH); + percpu_counter_add(&p->events, 1); +} + +void fprop_fraction_percpu(struct fprop_global *p, + struct fprop_local_percpu *pl, + unsigned long *numerator, unsigned long *denominator) +{ + unsigned int seq; + s64 num, den; + + do { + seq = read_seqcount_begin(&p->sequence); + fprop_reflect_period_percpu(p, pl); + num = percpu_counter_read_positive(&pl->events); + den = percpu_counter_read_positive(&p->events); + } while (read_seqcount_retry(&p->sequence, seq)); + + /* + * Make fraction <= 1 and denominator > 0 even in presence of percpu + * counter errors + */ + if (den <= num) { + if (num) + den = num; + else + den = 1; + } + *denominator = den; + *numerator = num; +} + +/* + * Like __fprop_inc_percpu() except that event is counted only if the given + * type has fraction smaller than @max_frac/FPROP_FRAC_BASE + */ +void __fprop_inc_percpu_max(struct fprop_global *p, + struct fprop_local_percpu *pl, int max_frac) +{ + if (unlikely(max_frac < FPROP_FRAC_BASE)) { + unsigned long numerator, denominator; + + fprop_fraction_percpu(p, pl, &numerator, &denominator); + if (numerator > + (((u64)denominator) * max_frac) >> FPROP_FRAC_SHIFT) + return; + } else + fprop_reflect_period_percpu(p, pl); + __percpu_counter_add(&pl->events, 1, PROP_BATCH); + percpu_counter_add(&p->events, 1); +} + -- 1.7.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754775Ab2EYJMu (ORCPT ); Fri, 25 May 2012 05:12:50 -0400 Received: from merlin.infradead.org ([205.233.59.134]:48271 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752865Ab2EYJMr (ORCPT ); Fri, 25 May 2012 05:12:47 -0400 Subject: Re: [PATCH 0/2 v4] Flexible proportions From: Peter Zijlstra To: Jan Kara Cc: Wu Fengguang , linux-mm@kvack.org, LKML In-Reply-To: <1337878751-22942-1-git-send-email-jack@suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> Content-Type: text/plain; charset="UTF-8" Date: Fri, 25 May 2012 11:12:42 +0200 Message-ID: <1337937162.9783.163.camel@laptop> Mime-Version: 1.0 X-Mailer: Evolution 2.32.2 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2012-05-24 at 18:59 +0200, Jan Kara wrote: > here is the next iteration of my flexible proportions code. I've addressed > all Peter's comments. Thanks, all I could come up with is comment placement nits and I'll not go there ;-) Acked-by: Peter Zijlstra From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756943Ab2EYJam (ORCPT ); Fri, 25 May 2012 05:30:42 -0400 Received: from mga11.intel.com ([192.55.52.93]:13985 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756665Ab2EYJai (ORCPT ); Fri, 25 May 2012 05:30:38 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.71,315,1320652800"; d="scan'208";a="171238698" Date: Fri, 25 May 2012 17:29:36 +0800 From: Fengguang Wu To: Peter Zijlstra Cc: Jan Kara , linux-mm@kvack.org, LKML Subject: Re: [PATCH 0/2 v4] Flexible proportions Message-ID: <20120525092936.GA12729@localhost> References: <1337878751-22942-1-git-send-email-jack@suse.cz> <1337937162.9783.163.camel@laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1337937162.9783.163.camel@laptop> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 25, 2012 at 11:12:42AM +0200, Peter Zijlstra wrote: > On Thu, 2012-05-24 at 18:59 +0200, Jan Kara wrote: > > here is the next iteration of my flexible proportions code. I've addressed > > all Peter's comments. > > Thanks, all I could come up with is comment placement nits and I'll not > go there ;-) > > Acked-by: Peter Zijlstra Thank you both for making it work! I've applied them to the writeback tree. Thanks, Fengguang From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754533Ab2E1Psu (ORCPT ); Mon, 28 May 2012 11:48:50 -0400 Received: from mail-bk0-f46.google.com ([209.85.214.46]:34213 "EHLO mail-bk0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753501Ab2E1Pss (ORCPT ); Mon, 28 May 2012 11:48:48 -0400 Message-ID: <1338220185.4284.19.camel@lappy> Subject: Re: [PATCH 2/2] block: Convert BDI proportion calculations to flexible proportions From: Sasha Levin To: Jan Kara Cc: Wu Fengguang , Peter Zijlstra , linux-mm@kvack.org, LKML Date: Mon, 28 May 2012 17:49:45 +0200 In-Reply-To: <1337878751-22942-3-git-send-email-jack@suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> <1337878751-22942-3-git-send-email-jack@suse.cz> Content-Type: text/plain; charset="us-ascii" X-Mailer: Evolution 3.2.3 Content-Transfer-Encoding: 7bit Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Jan, On Thu, 2012-05-24 at 18:59 +0200, Jan Kara wrote: > Convert calculations of proportion of writeback each bdi does to new flexible > proportion code. That allows us to use aging period of fixed wallclock time > which gives better proportion estimates given the hugely varying throughput of > different devices. > > Signed-off-by: Jan Kara > --- This patch appears to be causing lockdep warnings over here: [ 20.545016] ================================= [ 20.545016] [ INFO: inconsistent lock state ] [ 20.545016] 3.4.0-next-20120528-sasha-00008-g11ef39f #307 Tainted: G W [ 20.545016] --------------------------------- [ 20.545016] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. [ 20.545016] rcu_torture_rea/2493 [HC0[0]:SC1[1]:HE1:SE0] takes: [ 20.545016] (key#3){?.-...}, at: [] __percpu_counter_sum+0x17/0xc0 [ 20.545016] {IN-HARDIRQ-W} state was registered at: [ 20.545016] [] mark_irqflags+0x6b/0x170 [ 20.545016] [] __lock_acquire+0x2bb/0x4c0 [ 20.545016] [] lock_acquire+0x18a/0x1e0 [ 20.545016] [] _raw_spin_lock+0x3b/0x70 [ 20.545016] [] __percpu_counter_add+0x50/0xb0 [ 20.545016] [] __fprop_inc_percpu_max+0x8a/0xa0 [ 20.545016] [] test_clear_page_writeback+0x12d/0x1c0 [ 20.545016] [] end_page_writeback+0x24/0x50 [ 20.545016] [] end_buffer_async_write+0x26a/0x350 [ 20.545016] [] end_bio_bh_io_sync+0x3d/0x50 [ 20.545016] [] bio_endio+0x29/0x30 [ 20.545016] [] req_bio_endio+0xb9/0xd0 [ 20.545016] [] blk_update_request+0x1a8/0x3c0 [ 20.545016] [] blk_update_bidi_request+0x22/0x90 [ 20.545016] [] __blk_end_bidi_request+0x1c/0x40 [ 20.545016] [] __blk_end_request_all+0x28/0x40 [ 20.545016] [] blk_done+0x9e/0xf0 [ 20.545016] [] vring_interrupt+0x86/0xa0 [ 20.680186] [] handle_irq_event_percpu+0x151/0x3e0 [ 20.680186] [] handle_irq_event+0x43/0x70 [ 20.680186] [] handle_edge_irq+0xe8/0x120 [ 20.680186] [] handle_irq+0x164/0x180 [ 20.680186] [] do_IRQ+0x58/0xd0 [ 20.680186] [] ret_from_intr+0x0/0x1a [ 20.680186] [] blk_queue_bio+0x30d/0x430 [ 20.680186] [] generic_make_request+0xbe/0x120 [ 20.680186] [] submit_bio+0xf8/0x120 [ 20.680186] [] submit_bh+0x122/0x150 [ 20.680186] [] __block_write_full_page+0x287/0x3b0 [ 20.680186] [] block_write_full_page_endio+0xfc/0x120 [ 20.680186] [] block_write_full_page+0x10/0x20 [ 20.680186] [] blkdev_writepage+0x13/0x20 [ 20.680186] [] __writepage+0x15/0x40 [ 20.680186] [] write_cache_pages+0x49f/0x650 [ 20.680186] [] generic_writepages+0x4f/0x70 [ 20.680186] [] do_writepages+0x1e/0x50 [ 20.680186] [] __filemap_fdatawrite_range+0x49/0x50 [ 20.680186] [] filemap_fdatawrite+0x1a/0x20 [ 20.680186] [] filemap_write_and_wait+0x25/0x50 [ 20.680186] [] __sync_blockdev+0x2d/0x40 [ 20.680186] [] sync_blockdev+0xe/0x10 [ 20.680186] [] journal_recover+0x182/0x1c0 [ 20.680186] [] journal_load+0x58/0xa0 [ 20.680186] [] ext3_load_journal+0x200/0x2b0 [ 20.680186] [] ext3_fill_super+0xc18/0x10d0 [ 20.680186] [] mount_bdev+0x176/0x210 [ 20.680186] [] ext3_mount+0x10/0x20 [ 20.680186] [] mount_fs+0x85/0x1a0 [ 20.680186] [] vfs_kern_mount+0x74/0x100 [ 20.680186] [] do_kern_mount+0x51/0x120 [ 20.680186] [] do_mount+0x1d4/0x240 [ 20.680186] [] sys_mount+0x9d/0xe0 [ 20.680186] [] do_mount_root+0x1e/0x94 [ 20.680186] [] mount_block_root+0xe2/0x224 [ 20.680186] [] mount_root+0x12b/0x136 [ 20.680186] [] prepare_namespace+0x165/0x19e [ 20.680186] [] kernel_init+0x274/0x28a [ 20.680186] [] kernel_thread_helper+0x4/0x10 [ 20.680186] irq event stamp: 1551906 [ 20.680186] hardirqs last enabled at (1551906): [] _raw_spin_unlock_irq+0x2b/0x80 [ 20.680186] hardirqs last disabled at (1551905): [] _raw_spin_lock_irq+0x34/0xa0 [ 20.680186] softirqs last enabled at (1551022): [] __do_softirq+0x3db/0x460 [ 20.680186] softirqs last disabled at (1551903): [] call_softirq+0x1c/0x30 [ 20.680186] [ 20.680186] other info that might help us debug this: [ 20.680186] Possible unsafe locking scenario: [ 20.680186] [ 20.680186] CPU0 [ 20.680186] ---- [ 20.680186] lock(key#3); [ 20.680186] [ 20.680186] lock(key#3); [ 20.680186] [ 20.680186] *** DEADLOCK *** [ 20.680186] [ 20.680186] 2 locks held by rcu_torture_rea/2493: [ 20.680186] #0: (rcu_read_lock){.+.+..}, at: [] rcu_torture_read_lock+0x0/0x80 [ 20.680186] #1: (mm/page-writeback.c:144){+.-...}, at: [] call_timer_fn+0x0/0x260 [ 20.680186] [ 20.680186] stack backtrace: [ 20.680186] Pid: 2493, comm: rcu_torture_rea Tainted: G W 3.4.0-next-20120528-sasha-00008-g11ef39f #307 [ 20.680186] Call Trace: [ 20.680186] [] print_usage_bug+0x1a9/0x1d0 [ 20.680186] [] ? check_usage_forwards+0xf0/0xf0 [ 20.680186] [] mark_lock_irq+0xc9/0x270 [ 20.680186] [] mark_lock+0x11d/0x200 [ 20.680186] [] mark_irqflags+0xf0/0x170 [ 20.680186] [] __lock_acquire+0x2bb/0x4c0 [ 20.680186] [] lock_acquire+0x18a/0x1e0 [ 20.680186] [] ? __percpu_counter_sum+0x17/0xc0 [ 20.680186] [] ? laptop_io_completion+0x30/0x30 [ 20.680186] [] _raw_spin_lock+0x3b/0x70 [ 20.680186] [] ? __percpu_counter_sum+0x17/0xc0 [ 20.680186] [] __percpu_counter_sum+0x17/0xc0 [ 20.680186] [] ? init_timer_deferrable_key+0x20/0x20 [ 20.680186] [] fprop_new_period+0x12/0x60 [ 20.680186] [] writeout_period+0x3d/0xa0 [ 20.680186] [] call_timer_fn+0x12f/0x260 [ 20.680186] [] ? init_timer_deferrable_key+0x20/0x20 [ 20.680186] [] ? _raw_spin_unlock_irq+0x2b/0x80 [ 20.680186] [] ? laptop_io_completion+0x30/0x30 [ 20.680186] [] run_timer_softirq+0x29e/0x2f0 [ 20.680186] [] __do_softirq+0x221/0x460 [ 20.680186] [] ? kvm_clock_read+0x46/0x80 [ 20.680186] [] call_softirq+0x1c/0x30 [ 20.680186] [] do_softirq+0x75/0x120 [ 20.680186] [] irq_exit+0x5b/0xf0 [ 20.680186] [] smp_apic_timer_interrupt+0x8a/0xa0 [ 20.680186] [] apic_timer_interrupt+0x6f/0x80 [ 20.680186] [] ? lock_acquire+0x1be/0x1e0 [ 20.680186] [] ? rcu_torture_reader+0x380/0x380 [ 20.680186] [] rcu_torture_read_lock+0x33/0x80 [ 20.680186] [] ? rcu_torture_reader+0x380/0x380 [ 20.680186] [] rcu_torture_reader+0x123/0x380 [ 20.680186] [] ? T.841+0x50/0x50 [ 20.680186] [] ? rcu_torture_read_unlock+0x60/0x60 [ 20.680186] [] kthread+0xb2/0xc0 [ 20.680186] [] kernel_thread_helper+0x4/0x10 [ 20.680186] [] ? retint_restore_args+0x13/0x13 [ 20.680186] [] ? __init_kthread_worker+0x70/0x70 [ 20.680186] [] ? gs_change+0x13/0x13 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751406Ab2E2MeU (ORCPT ); Tue, 29 May 2012 08:34:20 -0400 Received: from cantor2.suse.de ([195.135.220.15]:60626 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751078Ab2E2MeS (ORCPT ); Tue, 29 May 2012 08:34:18 -0400 Date: Tue, 29 May 2012 14:34:08 +0200 From: Jan Kara To: Sasha Levin Cc: Jan Kara , Wu Fengguang , Peter Zijlstra , linux-mm@kvack.org, LKML Subject: Re: [PATCH 2/2] block: Convert BDI proportion calculations to flexible proportions Message-ID: <20120529123408.GA23991@quack.suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> <1337878751-22942-3-git-send-email-jack@suse.cz> <1338220185.4284.19.camel@lappy> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1338220185.4284.19.camel@lappy> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 28-05-12 17:49:45, Sasha Levin wrote: > Hi Jan, > > On Thu, 2012-05-24 at 18:59 +0200, Jan Kara wrote: > > Convert calculations of proportion of writeback each bdi does to new flexible > > proportion code. That allows us to use aging period of fixed wallclock time > > which gives better proportion estimates given the hugely varying throughput of > > different devices. > > > > Signed-off-by: Jan Kara > > --- > > This patch appears to be causing lockdep warnings over here: Actually, this is not caused directly by my patch. Just my patch makes the problem more likely because I use smaller counter batch in __fprop_inc_percpu_max() than is used in original __prop_inc_percpu_max(), so the probability that percpu counter takes spinlock (which is what triggers the warning) is higher. The only safe solution seems to be to create a variant of percpu counters that can be used from an interrupt. Or do you have other idea Peter? Honza > > [ 20.545016] ================================= > [ 20.545016] [ INFO: inconsistent lock state ] > [ 20.545016] 3.4.0-next-20120528-sasha-00008-g11ef39f #307 Tainted: G W > [ 20.545016] --------------------------------- > [ 20.545016] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. > [ 20.545016] rcu_torture_rea/2493 [HC0[0]:SC1[1]:HE1:SE0] takes: > [ 20.545016] (key#3){?.-...}, at: [] __percpu_counter_sum+0x17/0xc0 > [ 20.545016] {IN-HARDIRQ-W} state was registered at: > [ 20.545016] [] mark_irqflags+0x6b/0x170 > [ 20.545016] [] __lock_acquire+0x2bb/0x4c0 > [ 20.545016] [] lock_acquire+0x18a/0x1e0 > [ 20.545016] [] _raw_spin_lock+0x3b/0x70 > [ 20.545016] [] __percpu_counter_add+0x50/0xb0 > [ 20.545016] [] __fprop_inc_percpu_max+0x8a/0xa0 > [ 20.545016] [] test_clear_page_writeback+0x12d/0x1c0 > [ 20.545016] [] end_page_writeback+0x24/0x50 > [ 20.545016] [] end_buffer_async_write+0x26a/0x350 > [ 20.545016] [] end_bio_bh_io_sync+0x3d/0x50 > [ 20.545016] [] bio_endio+0x29/0x30 > [ 20.545016] [] req_bio_endio+0xb9/0xd0 > [ 20.545016] [] blk_update_request+0x1a8/0x3c0 > [ 20.545016] [] blk_update_bidi_request+0x22/0x90 > [ 20.545016] [] __blk_end_bidi_request+0x1c/0x40 > [ 20.545016] [] __blk_end_request_all+0x28/0x40 > [ 20.545016] [] blk_done+0x9e/0xf0 > [ 20.545016] [] vring_interrupt+0x86/0xa0 > [ 20.680186] [] handle_irq_event_percpu+0x151/0x3e0 > [ 20.680186] [] handle_irq_event+0x43/0x70 > [ 20.680186] [] handle_edge_irq+0xe8/0x120 > [ 20.680186] [] handle_irq+0x164/0x180 > [ 20.680186] [] do_IRQ+0x58/0xd0 > [ 20.680186] [] ret_from_intr+0x0/0x1a > [ 20.680186] [] blk_queue_bio+0x30d/0x430 > [ 20.680186] [] generic_make_request+0xbe/0x120 > [ 20.680186] [] submit_bio+0xf8/0x120 > [ 20.680186] [] submit_bh+0x122/0x150 > [ 20.680186] [] __block_write_full_page+0x287/0x3b0 > [ 20.680186] [] block_write_full_page_endio+0xfc/0x120 > [ 20.680186] [] block_write_full_page+0x10/0x20 > [ 20.680186] [] blkdev_writepage+0x13/0x20 > [ 20.680186] [] __writepage+0x15/0x40 > [ 20.680186] [] write_cache_pages+0x49f/0x650 > [ 20.680186] [] generic_writepages+0x4f/0x70 > [ 20.680186] [] do_writepages+0x1e/0x50 > [ 20.680186] [] __filemap_fdatawrite_range+0x49/0x50 > [ 20.680186] [] filemap_fdatawrite+0x1a/0x20 > [ 20.680186] [] filemap_write_and_wait+0x25/0x50 > [ 20.680186] [] __sync_blockdev+0x2d/0x40 > [ 20.680186] [] sync_blockdev+0xe/0x10 > [ 20.680186] [] journal_recover+0x182/0x1c0 > [ 20.680186] [] journal_load+0x58/0xa0 > [ 20.680186] [] ext3_load_journal+0x200/0x2b0 > [ 20.680186] [] ext3_fill_super+0xc18/0x10d0 > [ 20.680186] [] mount_bdev+0x176/0x210 > [ 20.680186] [] ext3_mount+0x10/0x20 > [ 20.680186] [] mount_fs+0x85/0x1a0 > [ 20.680186] [] vfs_kern_mount+0x74/0x100 > [ 20.680186] [] do_kern_mount+0x51/0x120 > [ 20.680186] [] do_mount+0x1d4/0x240 > [ 20.680186] [] sys_mount+0x9d/0xe0 > [ 20.680186] [] do_mount_root+0x1e/0x94 > [ 20.680186] [] mount_block_root+0xe2/0x224 > [ 20.680186] [] mount_root+0x12b/0x136 > [ 20.680186] [] prepare_namespace+0x165/0x19e > [ 20.680186] [] kernel_init+0x274/0x28a > [ 20.680186] [] kernel_thread_helper+0x4/0x10 > [ 20.680186] irq event stamp: 1551906 > [ 20.680186] hardirqs last enabled at (1551906): [] _raw_spin_unlock_irq+0x2b/0x80 > [ 20.680186] hardirqs last disabled at (1551905): [] _raw_spin_lock_irq+0x34/0xa0 > [ 20.680186] softirqs last enabled at (1551022): [] __do_softirq+0x3db/0x460 > [ 20.680186] softirqs last disabled at (1551903): [] call_softirq+0x1c/0x30 > [ 20.680186] > [ 20.680186] other info that might help us debug this: > [ 20.680186] Possible unsafe locking scenario: > [ 20.680186] > [ 20.680186] CPU0 > [ 20.680186] ---- > [ 20.680186] lock(key#3); > [ 20.680186] > [ 20.680186] lock(key#3); > [ 20.680186] > [ 20.680186] *** DEADLOCK *** > [ 20.680186] > [ 20.680186] 2 locks held by rcu_torture_rea/2493: > [ 20.680186] #0: (rcu_read_lock){.+.+..}, at: [] rcu_torture_read_lock+0x0/0x80 > [ 20.680186] #1: (mm/page-writeback.c:144){+.-...}, at: [] call_timer_fn+0x0/0x260 > [ 20.680186] > [ 20.680186] stack backtrace: > [ 20.680186] Pid: 2493, comm: rcu_torture_rea Tainted: G W 3.4.0-next-20120528-sasha-00008-g11ef39f #307 > [ 20.680186] Call Trace: > [ 20.680186] [] print_usage_bug+0x1a9/0x1d0 > [ 20.680186] [] ? check_usage_forwards+0xf0/0xf0 > [ 20.680186] [] mark_lock_irq+0xc9/0x270 > [ 20.680186] [] mark_lock+0x11d/0x200 > [ 20.680186] [] mark_irqflags+0xf0/0x170 > [ 20.680186] [] __lock_acquire+0x2bb/0x4c0 > [ 20.680186] [] lock_acquire+0x18a/0x1e0 > [ 20.680186] [] ? __percpu_counter_sum+0x17/0xc0 > [ 20.680186] [] ? laptop_io_completion+0x30/0x30 > [ 20.680186] [] _raw_spin_lock+0x3b/0x70 > [ 20.680186] [] ? __percpu_counter_sum+0x17/0xc0 > [ 20.680186] [] __percpu_counter_sum+0x17/0xc0 > [ 20.680186] [] ? init_timer_deferrable_key+0x20/0x20 > [ 20.680186] [] fprop_new_period+0x12/0x60 > [ 20.680186] [] writeout_period+0x3d/0xa0 > [ 20.680186] [] call_timer_fn+0x12f/0x260 > [ 20.680186] [] ? init_timer_deferrable_key+0x20/0x20 > [ 20.680186] [] ? _raw_spin_unlock_irq+0x2b/0x80 > [ 20.680186] [] ? laptop_io_completion+0x30/0x30 > [ 20.680186] [] run_timer_softirq+0x29e/0x2f0 > [ 20.680186] [] __do_softirq+0x221/0x460 > [ 20.680186] [] ? kvm_clock_read+0x46/0x80 > [ 20.680186] [] call_softirq+0x1c/0x30 > [ 20.680186] [] do_softirq+0x75/0x120 > [ 20.680186] [] irq_exit+0x5b/0xf0 > [ 20.680186] [] smp_apic_timer_interrupt+0x8a/0xa0 > [ 20.680186] [] apic_timer_interrupt+0x6f/0x80 > [ 20.680186] [] ? lock_acquire+0x1be/0x1e0 > [ 20.680186] [] ? rcu_torture_reader+0x380/0x380 > [ 20.680186] [] rcu_torture_read_lock+0x33/0x80 > [ 20.680186] [] ? rcu_torture_reader+0x380/0x380 > [ 20.680186] [] rcu_torture_reader+0x123/0x380 > [ 20.680186] [] ? T.841+0x50/0x50 > [ 20.680186] [] ? rcu_torture_read_unlock+0x60/0x60 > [ 20.680186] [] kthread+0xb2/0xc0 > [ 20.680186] [] kernel_thread_helper+0x4/0x10 > [ 20.680186] [] ? retint_restore_args+0x13/0x13 > [ 20.680186] [] ? __init_kthread_worker+0x70/0x70 > [ 20.680186] [] ? gs_change+0x13/0x13 > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751293Ab2E2Min (ORCPT ); Tue, 29 May 2012 08:38:43 -0400 Received: from merlin.infradead.org ([205.233.59.134]:38103 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750715Ab2E2Mim convert rfc822-to-8bit (ORCPT ); Tue, 29 May 2012 08:38:42 -0400 Message-ID: <1338295111.26856.57.camel@twins> Subject: Re: [PATCH 2/2] block: Convert BDI proportion calculations to flexible proportions From: Peter Zijlstra To: Jan Kara Cc: Sasha Levin , Wu Fengguang , linux-mm@kvack.org, LKML Date: Tue, 29 May 2012 14:38:31 +0200 In-Reply-To: <20120529123408.GA23991@quack.suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> <1337878751-22942-3-git-send-email-jack@suse.cz> <1338220185.4284.19.camel@lappy> <20120529123408.GA23991@quack.suse.cz> Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Mailer: Evolution 3.2.2- Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2012-05-29 at 14:34 +0200, Jan Kara wrote: > The only safe solution seems to be to create a variant of percpu counters > that can be used from an interrupt. Or do you have other idea Peter? > > [ 20.680186] [] _raw_spin_lock+0x3b/0x70 > > [ 20.680186] [] ? __percpu_counter_sum+0x17/0xc0 > > [ 20.680186] [] __percpu_counter_sum+0x17/0xc0 > > [ 20.680186] [] ? init_timer_deferrable_key+0x20/0x20 > > [ 20.680186] [] fprop_new_period+0x12/0x60 > > [ 20.680186] [] writeout_period+0x3d/0xa0 > > [ 20.680186] [] call_timer_fn+0x12f/0x260 > > [ 20.680186] [] ? init_timer_deferrable_key+0x20/0x20 Yeah, just make sure IRQs are disabled around doing that ;-) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751806Ab2E2MzA (ORCPT ); Tue, 29 May 2012 08:55:00 -0400 Received: from cantor2.suse.de ([195.135.220.15]:35247 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751022Ab2E2My7 (ORCPT ); Tue, 29 May 2012 08:54:59 -0400 Date: Tue, 29 May 2012 14:54:52 +0200 From: Jan Kara To: Peter Zijlstra Cc: Jan Kara , Sasha Levin , Wu Fengguang , linux-mm@kvack.org, LKML Subject: Re: [PATCH 2/2] block: Convert BDI proportion calculations to flexible proportions Message-ID: <20120529125452.GB23991@quack.suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> <1337878751-22942-3-git-send-email-jack@suse.cz> <1338220185.4284.19.camel@lappy> <20120529123408.GA23991@quack.suse.cz> <1338295111.26856.57.camel@twins> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1338295111.26856.57.camel@twins> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 29-05-12 14:38:31, Peter Zijlstra wrote: > On Tue, 2012-05-29 at 14:34 +0200, Jan Kara wrote: > > > The only safe solution seems to be to create a variant of percpu counters > > that can be used from an interrupt. Or do you have other idea Peter? > > > > [ 20.680186] [] _raw_spin_lock+0x3b/0x70 > > > [ 20.680186] [] ? __percpu_counter_sum+0x17/0xc0 > > > [ 20.680186] [] __percpu_counter_sum+0x17/0xc0 > > > [ 20.680186] [] ? init_timer_deferrable_key+0x20/0x20 > > > [ 20.680186] [] fprop_new_period+0x12/0x60 > > > [ 20.680186] [] writeout_period+0x3d/0xa0 > > > [ 20.680186] [] call_timer_fn+0x12f/0x260 > > > [ 20.680186] [] ? init_timer_deferrable_key+0x20/0x20 > > Yeah, just make sure IRQs are disabled around doing that ;-) Evil ;) But we'd need to have IRQs disabled also in each fprop_fraction_percpu() call, and generally, if we want things clean, we'd need to disable them in all entry points to proportion code (or at least around all percpu calls)... Honza From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758259Ab2EaW0R (ORCPT ); Thu, 31 May 2012 18:26:17 -0400 Received: from casper.infradead.org ([85.118.1.10]:42632 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757545Ab2EaW0Q convert rfc822-to-8bit (ORCPT ); Thu, 31 May 2012 18:26:16 -0400 Message-ID: <1338503165.28384.134.camel@twins> Subject: Re: [PATCH 2/2] block: Convert BDI proportion calculations to flexible proportions From: Peter Zijlstra To: Jan Kara Cc: Sasha Levin , Wu Fengguang , linux-mm@kvack.org, LKML Date: Fri, 01 Jun 2012 00:26:05 +0200 In-Reply-To: <20120531221146.GA19050@quack.suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> <1337878751-22942-3-git-send-email-jack@suse.cz> <1338220185.4284.19.camel@lappy> <20120529123408.GA23991@quack.suse.cz> <1338295111.26856.57.camel@twins> <20120529125452.GB23991@quack.suse.cz> <20120531221146.GA19050@quack.suse.cz> Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Mailer: Evolution 3.2.2- Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2012-06-01 at 00:11 +0200, Jan Kara wrote: > bool fprop_new_period(struct fprop_global *p, int periods) > { > - u64 events = percpu_counter_sum(&p->events); > + u64 events; > + unsigned long flags; > > + local_irq_save(flags); > + events = percpu_counter_sum(&p->events); > + local_irq_restore(flags); > /* > * Don't do anything if there are no events. > */ > @@ -73,7 +77,9 @@ bool fprop_new_period(struct fprop_global *p, int periods) > if (periods < 64) > events -= events >> periods; > /* Use addition to avoid losing events happening between sum and set */ > + local_irq_save(flags); > percpu_counter_add(&p->events, -events); > + local_irq_restore(flags); > p->period += periods; > write_seqcount_end(&p->sequence); Uhm, why bother enabling it in between? Just wrap the whole function in a single IRQ disable. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758841Ab2FADKU (ORCPT ); Thu, 31 May 2012 23:10:20 -0400 Received: from mga03.intel.com ([143.182.124.21]:17195 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758174Ab2FADKS (ORCPT ); Thu, 31 May 2012 23:10:18 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.71,315,1320652800"; d="scan'208";a="150227736" Date: Fri, 1 Jun 2012 11:10:15 +0800 From: Fengguang Wu To: Jan Kara Cc: Peter Zijlstra , Sasha Levin , linux-mm@kvack.org, LKML Subject: Re: [PATCH 2/2] block: Convert BDI proportion calculations to flexible proportions Message-ID: <20120601031015.GB7896@localhost> References: <1337878751-22942-1-git-send-email-jack@suse.cz> <1337878751-22942-3-git-send-email-jack@suse.cz> <1338220185.4284.19.camel@lappy> <20120529123408.GA23991@quack.suse.cz> <1338295111.26856.57.camel@twins> <20120529125452.GB23991@quack.suse.cz> <20120531221146.GA19050@quack.suse.cz> <1338503165.28384.134.camel@twins> <20120531224206.GC19050@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=gb2312 Content-Disposition: inline In-Reply-To: <20120531224206.GC19050@quack.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jun 01, 2012 at 12:42:06AM +0200, Jan Kara wrote: > On Fri 01-06-12 00:26:05, Peter Zijlstra wrote: > > On Fri, 2012-06-01 at 00:11 +0200, Jan Kara wrote: > > > bool fprop_new_period(struct fprop_global *p, int periods) > > > { > > > - u64 events = percpu_counter_sum(&p->events); > > > + u64 events; > > > + unsigned long flags; > > > > > > + local_irq_save(flags); > > > + events = percpu_counter_sum(&p->events); > > > + local_irq_restore(flags); > > > /* > > > * Don't do anything if there are no events. > > > */ > > > @@ -73,7 +77,9 @@ bool fprop_new_period(struct fprop_global *p, int periods) > > > if (periods < 64) > > > events -= events >> periods; > > > /* Use addition to avoid losing events happening between sum and set */ > > > + local_irq_save(flags); > > > percpu_counter_add(&p->events, -events); > > > + local_irq_restore(flags); > > > p->period += periods; > > > write_seqcount_end(&p->sequence); > > > > Uhm, why bother enabling it in between? Just wrap the whole function in > > a single IRQ disable. > I wanted to have interrupts disabled for as short as possible but if you > think it doesn't matter, I'll take your advice. The result is attached. Thank you! I applied this incremental fix next to the commit "lib: Proportions with flexible period". Thanks, Fengguang > From: Jan Kara > Subject: lib: Fix possible deadlock in flexible proportion code > > When percpu counter function in fprop_new_period() is interrupted by an > interrupt while holding counter lock, it can cause deadlock when the > interrupt wants to take the lock as well. Fix the problem by disabling > interrupts when calling percpu counter functions. > > Signed-off-by: Jan Kara > > diff -u b/lib/flex_proportions.c b/lib/flex_proportions.c > --- b/lib/flex_proportions.c > +++ b/lib/flex_proportions.c > @@ -62,13 +62,18 @@ > */ > bool fprop_new_period(struct fprop_global *p, int periods) > { > - u64 events = percpu_counter_sum(&p->events); > + u64 events; > + unsigned long flags; > > + local_irq_save(flags); > + events = percpu_counter_sum(&p->events); > /* > * Don't do anything if there are no events. > */ > - if (events <= 1) > + if (events <= 1) { > + local_irq_restore(flags); > return false; > + } > write_seqcount_begin(&p->sequence); > if (periods < 64) > events -= events >> periods; > @@ -76,6 +81,7 @@ > percpu_counter_add(&p->events, -events); > p->period += periods; > write_seqcount_end(&p->sequence); > + local_irq_restore(flags); > > return true; > } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932946Ab2FAKOL (ORCPT ); Fri, 1 Jun 2012 06:14:11 -0400 Received: from merlin.infradead.org ([205.233.59.134]:35703 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759379Ab2FAKOJ convert rfc822-to-8bit (ORCPT ); Fri, 1 Jun 2012 06:14:09 -0400 Message-ID: <1338545638.28384.137.camel@twins> Subject: Re: [PATCH 2/2] block: Convert BDI proportion calculations to flexible proportions From: Peter Zijlstra To: Jan Kara Cc: Sasha Levin , Wu Fengguang , linux-mm@kvack.org, LKML Date: Fri, 01 Jun 2012 12:13:58 +0200 In-Reply-To: <20120531224206.GC19050@quack.suse.cz> References: <1337878751-22942-1-git-send-email-jack@suse.cz> <1337878751-22942-3-git-send-email-jack@suse.cz> <1338220185.4284.19.camel@lappy> <20120529123408.GA23991@quack.suse.cz> <1338295111.26856.57.camel@twins> <20120529125452.GB23991@quack.suse.cz> <20120531221146.GA19050@quack.suse.cz> <1338503165.28384.134.camel@twins> <20120531224206.GC19050@quack.suse.cz> Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Mailer: Evolution 3.2.2- Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2012-06-01 at 00:42 +0200, Jan Kara wrote: > On Fri 01-06-12 00:26:05, Peter Zijlstra wrote: > > On Fri, 2012-06-01 at 00:11 +0200, Jan Kara wrote: > > > bool fprop_new_period(struct fprop_global *p, int periods) > > > { > > > - u64 events = percpu_counter_sum(&p->events); > > > + u64 events; > > > + unsigned long flags; > > > > > > + local_irq_save(flags); > > > + events = percpu_counter_sum(&p->events); > > > + local_irq_restore(flags); > > > /* > > > * Don't do anything if there are no events. > > > */ > > > @@ -73,7 +77,9 @@ bool fprop_new_period(struct fprop_global *p, int periods) > > > if (periods < 64) > > > events -= events >> periods; > > > /* Use addition to avoid losing events happening between sum and set */ > > > + local_irq_save(flags); > > > percpu_counter_add(&p->events, -events); > > > + local_irq_restore(flags); > > > p->period += periods; > > > write_seqcount_end(&p->sequence); > > > > Uhm, why bother enabling it in between? Just wrap the whole function in > > a single IRQ disable. > I wanted to have interrupts disabled for as short as possible but if you > think it doesn't matter, I'll take your advice. The result is attached. Thing is, disabling interrupts is quite expensive and the extra few instructions covered isn't much.