Re: [PATCH] x86: Reduce the default HZ value

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] x86: Reduce the default HZ value
@ 2009-05-14 20:25 devzero
  2009-05-14 20:29 ` Alan Cox
  0 siblings, 1 reply; 65+ messages in thread
From: devzero @ 2009-05-14 20:25 UTC (permalink / raw)
  To: akataria; +Cc: Alan Cox, linux-kernel@vger.kernel.org

> On Tue, 2009-05-12 at 12:45 -0700, devzero@web.de wrote:
> > >> > As a side note Red Hat ships runtime configurable tick behaviour in RHEL
> > >> > these days. HZ is fixed but the ticks can be bunched up. That was done as
> > >> > a quick fix to keep stuff portable but its a lot more sensible than
> > >> > randomly messing with the HZ value and its not much code either.
> > >> > 
> > >> Hi Alan, 
> > >> 
> > >> I guess you are talking about the tick_divider patch ? 
> > >> And that's still same as reducing the HZ value only that it can be done
> > >> dynamically (boot time), right ? 
> > >
> > >Yes - which has the advantage that you can select different behaviours
> > >rather than distributions having to build with HZ=1000 either for
> > >compatibility or responsiveness can still allow users to drop to a lower
> > >HZ value if doing stuff like HPC.
> > >
> > >Basically it removes the need to argue about it at build time and lets
> > >the user decide.
> > 
> > any reason why this did not reach mainline?
> 
> I think it is because during the time when this was implemented for RHEL
> 5, mainline was moving towards the tickless approach, which might have
> prompted people to think that it would no more be useful for mainline.
> 
> Since Alan was the one who implemented those patches, I guess he would
> have a better say on this. Alan, are there any plans for mainlining this
> now ?
> 
> Alok

anyway, just fyi or for some additional transparency, here`s the 4 tick-divider 
related patches from "recent" RHEL5  kernel 
(-> http://isoredirect.centos.org/centos/5/os/SRPMS/kernel-2.6.18-128.el5.src.rpm)

regards
roland


cat ./linux-2.6-docs-update-kernel-parameters-with-tick-divider.patch

From: Chris Lalancette <clalance@redhat.com>
Date: Wed, 17 Sep 2008 17:14:19 +0200
Subject: [docs] update kernel-parameters with tick-divider
Message-id: 48D11ECB.1060100@redhat.com
O-Subject: [RHEL5.3 PATCH v2]: Update kernel-parameters with tick-divider
Bugzilla: 454792
RH-Acked-by: Prarit Bhargava <prarit@redhat.com>
RH-Acked-by: Alan Cox <alan@redhat.com>
RH-Nacked-by: Alan Cox <alan@redhat.com>

We have a request to better document the tick divider patch that went into 5.1.
 Towards this end, I came up with the following patch to
Documentation/kernel-parameters.txt.  Not sure if it needs ACKs or anything, but
I wanted to make sure dzickus saw it.  This will resolve BZ 454792.  This
version doesn't tell the user to divide by zero (thanks Alan).

--
Chris Lalancette

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index b5bbd11..20ab2a9 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -470,6 +470,10 @@ running once the system is up.
                        See drivers/char/README.epca and
                        Documentation/digiepca.txt.

+       divider=        [IA-32,X86-64]
+                       divide kernel HZ rate by given value.
+                       Format: <num>, where <num> is between 1 and 25
+
        dmascc=         [HW,AX25,SERIAL] AX.25 Z80SCC driver with DMA
                        support available.
                        Format: <io_dev0>[,<io_dev1>[,..<io_dev32>]]





cat ./linux-2.6-x86_64-fix-casting-issue-in-tick-divider-patch.patch

From: Prarit Bhargava <prarit@redhat.com>
Subject: [RHEL 5.1 PATCH]: Fix casting issue in tick divider patch
Date: Wed, 20 Jun 2007 14:16:29 -0400
Bugzilla: 244861
Message-Id: <20070620181629.28881.27223.sendpatchset@prarit.boston.redhat.com>
Changelog: [x86_64] Fix casting issue in tick divider patch


Fix a casting bug in the tick divider patch.

Successfully tested by me on a variety of systems that were exhibiting slow
boot behaviour.

Resolves BZ 244861.

--- linux-2.6.18.x86_64/arch/x86_64/kernel/time.c.orig  2007-06-20 04:21:58.000000000 -0400
+++ linux-2.6.18.x86_64/arch/x86_64/kernel/time.c       2007-06-20 04:28:58.000000000 -0400
@@ -433,7 +433,7 @@ void main_timer_handler(struct pt_regs *
                                (((long) offset << US_SCALE) / vxtime.tsc_quot) - 1;
        }
        /* SCALE: We expect tick_divider - 1 lost, ie 0 for normal behaviour */
-       if (lost > tick_divider - 1)  {
+       if (lost > (int)tick_divider - 1)  {
                handle_lost_ticks(lost, regs);
                jiffies += lost - (tick_divider - 1);
        }



cat ./linux-2.6-x86-fixes-for-the-tick-divider-patch.patch

From: Chris Lalancette <clalance@redhat.com>
Subject: Re: [RHEL 5.1.z PATCH]: Fixes for the tick divider patch
Date: Tue, 02 Oct 2007 16:53:22 -0400
Bugzilla: 315471
Message-Id: <4702AFC2.9020702@redhat.com>
Changelog: [x86] Fixes for the tick divider patch

All,
     While testing the tick divider patch under VMware, a number of issues were
found with it:

1)  On i386, when specifying "divider=10 apic=verbose", a bogus value was
printed for the CPU MHz and the host bus speed.  This is because during APIC
calibration, we were using "HZ/10" loops instead of "REAL_HZ/10", causing the
calculation to go out of bounds.

2)  On x86_64, when using the tick divider, it wasn't dividing the local APIC as
well as the external timer.  This causes problems under VMware since the
hypervisor (ESX server) has to deliver 1000 local APIC interrupts per second to
each logical processor, which can end up causing time drift.  By properly
dividing the local APIC as well as the external time source, it significantly
reduces the load on the HV, and the guests have less tendency to drift.

3)  On x86_64, we weren't looping during smp_local_timer_interrupt(), so we were
losing profiling ticks.

3)  On x86_64, when using the tick divider with PM-Timer, lost tick compensation
wasn't being calculated properly.  In particular, we would count ticks as lost
when they really weren't, because we were using HZ instead of REAL_HZ in the
lost calculation.

4)  On x86_64, TSC suffers from the same problem as PM-Timer.

The attached patch fixes all 4 of these problems.  Additionally, this patch also
adds a "hz=" command-line parameter for both i386 and x86_64.  This is nicer way
to specify the divider from a user point-of-view; they don't have to know the
current value of HZ in order to specify the HZ value they want.

These patches are not upstream, since upstream has since gone with the tickless
kernel.

Patches successfully tested by myself (just for verifying basic correctness),
and HP and VMware using ESX server.

This fixes BZ 305011.  Please review and ACK.

Chris Lalancette


>
> ACK less the hz= bits for 5.1.z, per Alan's concern about only certain
> values in the currently accepted range actually being valid. I'd say
> fully bake that part for 5.2 and just take the fixes for 5.1.z.
>

Same patch, with hz= bits removed for the z-stream.

Chris Lalancette

diff -urp linux-2.6.18.noarch.orig/arch/i386/kernel/apic.c linux-2.6.18.noarch/arch/i386/kernel/apic.c
--- linux-2.6.18.noarch.orig/arch/i386/kernel/apic.c    2007-10-02 16:42:24.000000000 -0400
+++ linux-2.6.18.noarch/arch/i386/kernel/apic.c 2007-10-02 16:47:00.000000000 -0400
@@ -1027,7 +1027,7 @@ static int __init calibrate_APIC_clock(v
        long tt1, tt2;
        long result;
        int i;
-       const int LOOPS = HZ/10;
+       const int LOOPS = REAL_HZ/10;

        apic_printk(APIC_VERBOSE, "calibrating APIC timer ...\n");

@@ -1076,13 +1076,13 @@ static int __init calibrate_APIC_clock(v
        if (cpu_has_tsc)
                apic_printk(APIC_VERBOSE, "..... CPU clock speed is "
                        "%ld.%04ld MHz.\n",
-                       ((long)(t2-t1)/LOOPS)/(1000000/HZ),
-                       ((long)(t2-t1)/LOOPS)%(1000000/HZ));
+                       ((long)(t2-t1)/LOOPS)/(1000000/REAL_HZ),
+                       ((long)(t2-t1)/LOOPS)%(1000000/REAL_HZ));

        apic_printk(APIC_VERBOSE, "..... host bus clock speed is "
                "%ld.%04ld MHz.\n",
-               result/(1000000/HZ),
-               result%(1000000/HZ));
+               result/(1000000/REAL_HZ),
+               result%(1000000/REAL_HZ));

        return result;
 }
diff -urp linux-2.6.18.noarch.orig/arch/x86_64/kernel/apic.c linux-2.6.18.noarch/arch/x86_64/kernel/apic.c
--- linux-2.6.18.noarch.orig/arch/x86_64/kernel/apic.c  2007-10-02 16:42:30.000000000 -0400
+++ linux-2.6.18.noarch/arch/x86_64/kernel/apic.c       2007-10-02 16:47:00.000000000 -0400
@@ -811,7 +811,7 @@ static int __init calibrate_APIC_clock(v
        printk(KERN_INFO "Detected %d.%03d MHz APIC timer.\n",
                result / 1000 / 1000, result / 1000 % 1000);

-       return result * APIC_DIVISOR / HZ;
+       return result * APIC_DIVISOR / REAL_HZ;
 }

 static unsigned int calibration_result;
@@ -941,10 +941,13 @@ void setup_APIC_extened_lvt(unsigned cha

 void smp_local_timer_interrupt(struct pt_regs *regs)
 {
-       profile_tick(CPU_PROFILING, regs);
+       int i;
+       for (i = 0; i < tick_divider; i++) {
+               profile_tick(CPU_PROFILING, regs);
 #ifdef CONFIG_SMP
-       update_process_times(user_mode(regs));
+               update_process_times(user_mode(regs));
 #endif
+       }
        if (apic_runs_main_timer > 1 && smp_processor_id() == boot_cpu_id)
                main_timer_handler(regs);
        /*
diff -urp linux-2.6.18.noarch.orig/arch/x86_64/kernel/pmtimer.c linux-2.6.18.noarch/arch/x86_64/kernel/pmtimer.c
--- linux-2.6.18.noarch.orig/arch/x86_64/kernel/pmtimer.c       2006-09-19 23:42:06.000000000 -0400
+++ linux-2.6.18.noarch/arch/x86_64/kernel/pmtimer.c    2007-10-02 16:47:00.000000000 -0400
@@ -64,8 +64,8 @@ int pmtimer_mark_offset(void)

        delta += offset_delay;

-       lost = delta / (USEC_PER_SEC / HZ);
-       offset_delay = delta % (USEC_PER_SEC / HZ);
+       lost = delta / (USEC_PER_SEC / REAL_HZ);
+       offset_delay = delta % (USEC_PER_SEC / REAL_HZ);

        rdtscll(tsc);
        vxtime.last_tsc = tsc - offset_delay * (u64)cpu_khz / 1000;
diff -urp linux-2.6.18.noarch.orig/arch/x86_64/kernel/time.c linux-2.6.18.noarch/arch/x86_64/kernel/time.c
--- linux-2.6.18.noarch.orig/arch/x86_64/kernel/time.c  2007-10-02 16:42:31.000000000 -0400
+++ linux-2.6.18.noarch/arch/x86_64/kernel/time.c       2007-10-02 16:47:43.000000000 -0400
@@ -65,6 +65,8 @@ static int notsc __initdata = 0;
 #define NSEC_PER_TICK (NSEC_PER_SEC / HZ)
 #define FSEC_PER_TICK (FSEC_PER_SEC / HZ)

+#define USEC_PER_REAL_TICK (USEC_PER_SEC / REAL_HZ)
+
 #define NS_SCALE       10 /* 2^10, carefully chosen */
 #define US_SCALE       32 /* 2^32, arbitralrily chosen */

@@ -304,7 +306,7 @@ unsigned long long monotonic_clock(void)
                        this_offset = hpet_readl(HPET_COUNTER);
                } while (read_seqretry(&xtime_lock, seq));
                offset = (this_offset - last_offset);
-               offset *= NSEC_PER_TICK / hpet_tick;
+               offset *= NSEC_PER_TICK / hpet_tick_real;
        } else {
                do {
                        seq = read_seqbegin(&xtime_lock);
@@ -406,7 +408,7 @@ void main_timer_handler(struct pt_regs *
                }

                monotonic_base +=
-                       (offset - vxtime.last) * NSEC_PER_TICK / hpet_tick;
+                       (offset - vxtime.last) * NSEC_PER_TICK / hpet_tick_real;

                vxtime.last = offset;
 #ifdef CONFIG_X86_PM_TIMER
@@ -415,14 +417,14 @@ void main_timer_handler(struct pt_regs *
 #endif
        } else {
                offset = (((tsc - vxtime.last_tsc) *
-                          vxtime.tsc_quot) >> US_SCALE) - USEC_PER_TICK;
+                          vxtime.tsc_quot) >> US_SCALE) - USEC_PER_REAL_TICK;

                if (offset < 0)
                        offset = 0;

-               if (offset > USEC_PER_TICK) {
-                       lost = offset / USEC_PER_TICK;
-                       offset %= USEC_PER_TICK;
+               if (offset > USEC_PER_REAL_TICK) {
+                       lost = offset / USEC_PER_REAL_TICK;
+                       offset %= USEC_PER_REAL_TICK;
                }

                /* FIXME: 1000 or 1000000? */







cat ./linux-2.6-x86-tick-divider.patch

From: Alan Cox <alan@redhat.com>
Subject: [RHEL5]: Tick Divider (Bugzilla #215403]
Date: Wed, 18 Apr 2007 16:39:15 -0400
Bugzilla: 215403
Message-Id: <20070418203915.GA23344@devserv.devel.redhat.com>
Changelog: [x86] Tick Divider


The following patch implements a tick divider feature that allows you to
boot the kernel with HZ at 1000 but the real timer tick rate lower (thus
not breaking all the modules and kABI).

The selection is done at boot to minimize risk and the patch has been reworked
so that you can do an informal attempt at a proof that it doesn't cause
regression for the non dividing case.

The patch interleaved with notes follows, and below that the actual patch
proper.

Xen kernels remain at 250HZ because
a) Xen guests have a 'tickless mode'
b) Xen itself has issues with multiple differing guest GZ rates

Not queued for upstream as the upstream path is Ingo's tickless kernel, which
is not viable as a RHEL5 tweak

Index: linux-2.6.18.noarch/arch/i386/kernel/apic.c
===================================================================
--- linux-2.6.18.noarch.orig/arch/i386/kernel/apic.c
+++ linux-2.6.18.noarch/arch/i386/kernel/apic.c
@@ -1185,10 +1185,13 @@ EXPORT_SYMBOL(switch_ipi_to_APIC_timer);

 inline void smp_local_timer_interrupt(struct pt_regs * regs)
 {
-       profile_tick(CPU_PROFILING, regs);
+       int i;
+       for (i = 0; i < tick_divider; i++) {
+               profile_tick(CPU_PROFILING, regs);
 #ifdef CONFIG_SMP
-       update_process_times(user_mode_vm(regs));
+               update_process_times(user_mode_vm(regs));
 #endif
+       }

        /*
         * We take the 'long' return path, and there every subsystem
Index: linux-2.6.18.noarch/arch/i386/kernel/apm.c
===================================================================
--- linux-2.6.18.noarch.orig/arch/i386/kernel/apm.c
+++ linux-2.6.18.noarch/arch/i386/kernel/apm.c
@@ -1189,7 +1189,7 @@ static void reinit_timer(void)
        unsigned long flags;

        spin_lock_irqsave(&i8253_lock, flags);
-       /* set the clock to 100 Hz */
+       /* set the clock to HZ */
        outb_p(0x34, PIT_MODE);         /* binary, mode 2, LSB/MSB, ch 0 */
        udelay(10);
        outb_p(LATCH & 0xff, PIT_CH0);  /* LSB */
Index: linux-2.6.18.noarch/arch/i386/kernel/i8253.c
===================================================================
--- linux-2.6.18.noarch.orig/arch/i386/kernel/i8253.c
+++ linux-2.6.18.noarch/arch/i386/kernel/i8253.c
@@ -26,6 +26,7 @@ void setup_pit_timer(void)
        spin_lock_irqsave(&i8253_lock, flags);
        outb_p(0x34,PIT_MODE);          /* binary, mode 2, LSB/MSB, ch 0 */
        udelay(10);
+       /* Physical HZ */
        outb_p(LATCH & 0xff , PIT_CH0); /* LSB */
        udelay(10);
        outb(LATCH >> 8 , PIT_CH0);     /* MSB */
@@ -94,8 +95,11 @@ static cycle_t pit_read(void)
        spin_unlock_irqrestore(&i8253_lock, flags);

        count = (LATCH - 1) - count;
-
-       return (cycle_t)(jifs * LATCH) + count;
+       /* Adjust to logical ticks */
+       count *= tick_divider;
+
+       /* Keep the jiffies in terms of logical ticks not physical */
+       return (cycle_t)(jifs * LOGICAL_LATCH) + count;
 }

 static struct clocksource clocksource_pit = {
Index: linux-2.6.18.noarch/arch/i386/kernel/time.c
===================================================================
--- linux-2.6.18.noarch.orig/arch/i386/kernel/time.c
+++ linux-2.6.18.noarch/arch/i386/kernel/time.c
@@ -366,3 +367,22 @@ void __init time_init(void)

        time_init_hook();
 }
+
+#ifdef CONFIG_TICK_DIVIDER
+
+unsigned int tick_divider = 1;
+
+static int __init divider_setup(char *s)
+{
+       unsigned int divider = 1;
+       get_option(&s, &divider);
+       if (divider >= 1 && HZ/divider >= 25)
+               tick_divider = divider;
+       else
+               printk(KERN_ERR "tick_divider: %d is out of range.\n", divider);
+       return 1;
+}
+
+__setup("divider=", divider_setup);
+
+#endif
Index: linux-2.6.18.noarch/arch/i386/kernel/time_hpet.c
===================================================================
--- linux-2.6.18.noarch.orig/arch/i386/kernel/time_hpet.c
+++ linux-2.6.18.noarch/arch/i386/kernel/time_hpet.c
@@ -24,6 +24,7 @@

 static unsigned long hpet_period;      /* fsecs / HPET clock */
 unsigned long hpet_tick;               /* hpet clks count per tick */
+unsigned long hpet_tick_real;          /* hpet clocks per interrupt */
 unsigned long hpet_address;            /* hpet memory map physical address */
 int hpet_use_timer;

@@ -156,7 +157,8 @@ int __init hpet_enable(void)

        hpet_use_timer = id & HPET_ID_LEGSUP;

-       if (hpet_timer_stop_set_go(hpet_tick))
+       hpet_tick_real = hpet_tick * tick_divider;
+       if (hpet_timer_stop_set_go(hpet_tick_real))
                return -1;

        use_hpet = 1;
Index: linux-2.6.18.noarch/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.18.noarch.orig/arch/x86_64/Kconfig
+++ linux-2.6.18.noarch/arch/x86_64/Kconfig
@@ -443,6 +443,13 @@ config HPET_EMULATE_RTC
        bool "Provide RTC interrupt"
        depends on HPET_TIMER && RTC=y

+config TICK_DIVIDER
+       bool "Support clock division"
+       default n
+       help
+         Supports the use of clock division allowing the real interrupt
+         rate to be lower than the HZ setting.
+
 # Mark as embedded because too many people got it wrong.
 # The code disables itself when not needed.
 config IOMMU
Index: linux-2.6.18.noarch/arch/x86_64/kernel/i8259.c
===================================================================
--- linux-2.6.18.noarch.orig/arch/x86_64/kernel/i8259.c
+++ linux-2.6.18.noarch/arch/x86_64/kernel/i8259.c
@@ -498,6 +498,7 @@ static void setup_timer_hardware(void)
 {
        outb_p(0x34,0x43);              /* binary, mode 2, LSB/MSB, ch 0 */
        udelay(10);
+       /* LATCH is in physical clocks */
        outb_p(LATCH & 0xff , 0x40);    /* LSB */
        udelay(10);
        outb(LATCH >> 8 , 0x40);        /* MSB */
Index: linux-2.6.18.noarch/arch/x86_64/kernel/time.c
===================================================================
--- linux-2.6.18.noarch.orig/arch/x86_64/kernel/time.c
+++ linux-2.6.18.noarch/arch/x86_64/kernel/time.c
@@ -70,7 +70,8 @@ static int notsc __initdata = 0;
 unsigned int cpu_khz;                                  /* TSC clocks / usec, not used here */
 EXPORT_SYMBOL(cpu_khz);
 static unsigned long hpet_period;                      /* fsecs / HPET clock */
-unsigned long hpet_tick;                               /* HPET clocks / interrupt */
+unsigned long hpet_tick;                               /* HPET clocks / HZ */
+unsigned long hpet_tick_real;                          /* HPET clocks / interrupt */
 int hpet_use_timer;                            /* Use counter of hpet for time keeping, otherwise PIT */
 unsigned long vxtime_hz = PIT_TICK_RATE;
 int report_lost_ticks;                         /* command line option */
@@ -108,7 +109,9 @@ static inline unsigned int do_gettimeoff
 {
        /* cap counter read to one tick to avoid inconsistencies */
        unsigned long counter = hpet_readl(HPET_COUNTER) - vxtime.last;
-       return (min(counter,hpet_tick) * vxtime.quot) >> US_SCALE;
+       /* The hpet counter runs at a fixed rate so we don't care about HZ
+          scaling here. We do however care that the limit is in real ticks */
+       return (min(counter,hpet_tick_real) * vxtime.quot) >> US_SCALE;
 }

 unsigned int (*do_gettimeoffset)(void) = do_gettimeoffset_tsc;
@@ -332,7 +335,7 @@ static noinline void handle_lost_ticks(i
                        printk(KERN_WARNING "Falling back to HPET\n");
                        if (hpet_use_timer)
                                vxtime.last = hpet_readl(HPET_T0_CMP) -
-                                                       hpet_tick;
+                                                       hpet_tick_real;
                        else
                                vxtime.last = hpet_readl(HPET_COUNTER);
                        vxtime.mode = VXTIME_HPET;
@@ -355,7 +358,7 @@ void main_timer_handler(struct pt_regs *
 {
        static unsigned long rtc_update = 0;
        unsigned long tsc;
-       int delay = 0, offset = 0, lost = 0;
+       int delay = 0, offset = 0, lost = 0, i;

 /*
  * Here we are in the timer irq handler. We have irqs locally disabled (so we
@@ -373,8 +376,10 @@ void main_timer_handler(struct pt_regs *
                /* if we're using the hpet timer functionality,
                 * we can more accurately know the counter value
                 * when the timer interrupt occured.
+                *
+                * We are working in physical time here
                 */
-               offset = hpet_readl(HPET_T0_CMP) - hpet_tick;
+               offset = hpet_readl(HPET_T0_CMP) - hpet_tick_real;
                delay = hpet_readl(HPET_COUNTER) - offset;
        } else if (!pmtmr_ioport) {
                spin_lock(&i8253_lock);
@@ -382,14 +387,19 @@ void main_timer_handler(struct pt_regs *
                delay = inb_p(0x40);
                delay |= inb(0x40) << 8;
                spin_unlock(&i8253_lock);
+               /* We are in physical not logical ticks */
                delay = LATCH - 1 - delay;
+               /* True ticks of delay elapsed */
+               delay *= tick_divider;
        }

        tsc = get_cycles_sync();

        if (vxtime.mode == VXTIME_HPET) {
-               if (offset - vxtime.last > hpet_tick) {
-                       lost = (offset - vxtime.last) / hpet_tick - 1;
+               if (offset - vxtime.last > hpet_tick_real) {
+                       lost = (offset - vxtime.last) / hpet_tick_real - 1;
+                       /* Lost is now in real ticks but we want logical */
+                       lost *= tick_divider;
                }

                monotonic_base +=
@@ -422,33 +432,35 @@ void main_timer_handler(struct pt_regs *
                        vxtime.last_tsc = tsc -
                                (((long) offset << US_SCALE) / vxtime.tsc_quot) - 1;
        }
-
-       if (lost > 0) {
+       /* SCALE: We expect tick_divider - 1 lost, ie 0 for normal behaviour */
+       if (lost > tick_divider - 1)  {
                handle_lost_ticks(lost, regs);
-               jiffies += lost;
+               jiffies += lost - (tick_divider - 1);
        }

 /*
  * Do the timer stuff.
  */

-       do_timer(regs);
+       for (i = 0; i < tick_divider; i++) {
+               do_timer(regs);
 #ifndef CONFIG_SMP
-       update_process_times(user_mode(regs));
+               update_process_times(user_mode(regs));
 #endif

-/*
- * In the SMP case we use the local APIC timer interrupt to do the profiling,
- * except when we simulate SMP mode on a uniprocessor system, in that case we
- * have to call the local interrupt handler.
- */
+       /*
+        * In the SMP case we use the local APIC timer interrupt to do the profiling,
+        * except when we simulate SMP mode on a uniprocessor system, in that case we
+        * have to call the local interrupt handler.
+        */

 #ifndef CONFIG_X86_LOCAL_APIC
-       profile_tick(CPU_PROFILING, regs);
+               profile_tick(CPU_PROFILING, regs);
 #else
-       if (!using_apic_timer)
-               smp_local_timer_interrupt(regs);
+               if (!using_apic_timer)
+                       smp_local_timer_interrupt(regs);
 #endif
+       }

 /*
  * If we have an externally synchronized Linux clock, then update CMOS clock
@@ -800,8 +812,8 @@ static int hpet_timer_stop_set_go(unsign
        if (hpet_use_timer) {
                hpet_writel(HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL |
                    HPET_TN_32BIT, HPET_T0_CFG);
-               hpet_writel(hpet_tick, HPET_T0_CMP); /* next interrupt */
-               hpet_writel(hpet_tick, HPET_T0_CMP); /* period */
+               hpet_writel(hpet_tick_real, HPET_T0_CMP); /* next interrupt */
+               hpet_writel(hpet_tick_real, HPET_T0_CMP); /* period */
                cfg |= HPET_CFG_LEGACY;
        }
 /*
@@ -836,16 +848,19 @@ static int hpet_init(void)
        if (hpet_period < 100000 || hpet_period > 100000000)
                return -1;

+       /* Logical ticks */
        hpet_tick = (FSEC_PER_TICK + hpet_period / 2) / hpet_period;
+       /* Ticks per real interrupt */
+       hpet_tick_real = hpet_tick * tick_divider;

        hpet_use_timer = (id & HPET_ID_LEGSUP);

-       return hpet_timer_stop_set_go(hpet_tick);
+       return hpet_timer_stop_set_go(hpet_tick_real);
 }

 static int hpet_reenable(void)
 {
-       return hpet_timer_stop_set_go(hpet_tick);
+       return hpet_timer_stop_set_go(hpet_tick_real);
 }

 #define PIT_MODE 0x43
@@ -864,6 +879,7 @@ static void __init __pit_init(int val, u

 void __init pit_init(void)
 {
+       /* LATCH is in actual interrupt ticks */
        __pit_init(LATCH, 0x34); /* binary, mode 2, LSB/MSB, ch 0 */
 }

@@ -1002,7 +1018,7 @@ void time_init_gtod(void)
        if (vxtime.hpet_address && notsc) {
                timetype = hpet_use_timer ? "HPET" : "PIT/HPET";
                if (hpet_use_timer)
-                       vxtime.last = hpet_readl(HPET_T0_CMP) - hpet_tick;
+                       vxtime.last = hpet_readl(HPET_T0_CMP) - hpet_tick_real;
                else
                        vxtime.last = hpet_readl(HPET_COUNTER);
                vxtime.mode = VXTIME_HPET;
@@ -1073,7 +1089,7 @@ static int timer_resume(struct sys_devic
        xtime.tv_nsec = 0;
        if (vxtime.mode == VXTIME_HPET) {
                if (hpet_use_timer)
-                       vxtime.last = hpet_readl(HPET_T0_CMP) - hpet_tick;
+                       vxtime.last = hpet_readl(HPET_T0_CMP) - hpet_tick_real;
                else
                        vxtime.last = hpet_readl(HPET_COUNTER);
 #ifdef CONFIG_X86_PM_TIMER
@@ -1352,3 +1368,22 @@ int __init notsc_setup(char *s)
 }

 __setup("notsc", notsc_setup);
+
+#ifdef CONFIG_TICK_DIVIDER
+
+
+unsigned int tick_divider = 1;
+
+static int __init divider_setup(char *s)
+{
+       unsigned int divider = 1;
+       get_option(&s, &divider);
+       if (divider >= 1 && HZ/divider >= 25)
+               tick_divider = divider;
+       else
+               printk(KERN_ERR "tick_divider: %d is out of range.\n", divider);
+       return 1;
+}
+
+__setup("divider=", divider_setup);
+#endif
Index: linux-2.6.18.noarch/include/asm-i386/mach-default/do_timer.h
===================================================================
--- linux-2.6.18.noarch.orig/include/asm-i386/mach-default/do_timer.h
+++ linux-2.6.18.noarch/include/asm-i386/mach-default/do_timer.h
@@ -16,17 +16,21 @@

 static inline void do_timer_interrupt_hook(struct pt_regs *regs)
 {
-       do_timer(regs);
+       int i;
+       for (i = 0; i < tick_divider; i++) {
+               do_timer(regs);
 #ifndef CONFIG_SMP
-       update_process_times(user_mode_vm(regs));
+               update_process_times(user_mode_vm(regs));
 #endif
+       }
 /*
  * In the SMP case we use the local APIC timer interrupt to do the
  * profiling, except when we simulate SMP mode on a uniprocessor
  * system, in that case we have to call the local interrupt handler.
  */
 #ifndef CONFIG_X86_LOCAL_APIC
-       profile_tick(CPU_PROFILING, regs);
+       for (i = 0; i < tick_divider; i++)
+               profile_tick(CPU_PROFILING, regs);
 #else
        if (!using_apic_timer)
                smp_local_timer_interrupt(regs);
Index: linux-2.6.18.noarch/include/asm-i386/mach-visws/do_timer.h
===================================================================
--- linux-2.6.18.noarch.orig/include/asm-i386/mach-visws/do_timer.h
+++ linux-2.6.18.noarch/include/asm-i386/mach-visws/do_timer.h
@@ -6,20 +6,24 @@

 static inline void do_timer_interrupt_hook(struct pt_regs *regs)
 {
+       int i;
        /* Clear the interrupt */
        co_cpu_write(CO_CPU_STAT,co_cpu_read(CO_CPU_STAT) & ~CO_STAT_TIMEINTR);

-       do_timer(regs);
+       for (i = 0; i < tick_divider; i++) {
+               do_timer(regs);
 #ifndef CONFIG_SMP
-       update_process_times(user_mode_vm(regs));
+               update_process_times(user_mode_vm(regs));
 #endif
+       }
 /*
  * In the SMP case we use the local APIC timer interrupt to do the
  * profiling, except when we simulate SMP mode on a uniprocessor
  * system, in that case we have to call the local interrupt handler.
  */
 #ifndef CONFIG_X86_LOCAL_APIC
-       profile_tick(CPU_PROFILING, regs);
+       for (i = 0; i < tick_divider; i++)
+               profile_tick(CPU_PROFILING, regs);
 #else
        if (!using_apic_timer)
                smp_local_timer_interrupt(regs);
Index: linux-2.6.18.noarch/include/asm-i386/mach-voyager/do_timer.h
===================================================================
--- linux-2.6.18.noarch.orig/include/asm-i386/mach-voyager/do_timer.h
+++ linux-2.6.18.noarch/include/asm-i386/mach-voyager/do_timer.h
@@ -3,12 +3,14 @@

 static inline void do_timer_interrupt_hook(struct pt_regs *regs)
 {
-       do_timer(regs);
+       int i;
+       for (i = 0; i < tick_divider; i++) {
+               do_timer(regs);
 #ifndef CONFIG_SMP
-       update_process_times(user_mode_vm(regs));
+               update_process_times(user_mode_vm(regs));
 #endif
-
-       voyager_timer_interrupt(regs);
+               voyager_timer_interrupt(regs);
+       }
 }

 static inline int do_timer_overflow(int count)
Index: linux-2.6.18.noarch/include/linux/jiffies.h
===================================================================
--- linux-2.6.18.noarch.orig/include/linux/jiffies.h
+++ linux-2.6.18.noarch/include/linux/jiffies.h
@@ -33,10 +33,21 @@
 # error You lose.
 #endif

+#ifndef CONFIG_TICK_DIVIDER
+#define tick_divider 1
+#else
+extern unsigned int tick_divider;
+#endif
+
+#define REAL_HZ (HZ/tick_divider)
 /* LATCH is used in the interval timer and ftape setup. */
-#define LATCH  ((CLOCK_TICK_RATE + HZ/2) / HZ) /* For divider */
+#define LATCH  ((CLOCK_TICK_RATE + REAL_HZ/2) / REAL_HZ)       /* For divider */
+
+#define LATCH_HPET ((HPET_TICK_RATE + REAL_HZ/2) / REAL_HZ)
+
+#define LOGICAL_LATCH  ((CLOCK_TICK_RATE + HZ/2) / HZ) /* For divider */

-#define LATCH_HPET ((HPET_TICK_RATE + HZ/2) / HZ)
+#define LOGICAL_LATCH_HPET ((HPET_TICK_RATE + HZ/2) / HZ)

 /* Suppose we want to devide two numbers NOM and DEN: NOM/DEN, the we can
  * improve accuracy by shifting LSH bits, hence calculating:
@@ -51,9 +62,9 @@
                              + ((((NOM) % (DEN)) << (LSH)) + (DEN) / 2) / (DEN))

 /* HZ is the requested value. ACTHZ is actual HZ ("<< 8" is for accuracy) */
-#define ACTHZ (SH_DIV (CLOCK_TICK_RATE, LATCH, 8))
+#define ACTHZ (SH_DIV (CLOCK_TICK_RATE, LOGICAL_LATCH, 8))

-#define ACTHZ_HPET (SH_DIV (HPET_TICK_RATE, LATCH_HPET, 8))
+#define ACTHZ_HPET (SH_DIV (HPET_TICK_RATE, LOGICAL_LATCH_HPET, 8))

 /* TICK_NSEC is the time between ticks in nsec assuming real ACTHZ */
 #define TICK_NSEC (SH_DIV (1000000UL * 1000, ACTHZ, 8))
Index: linux-2.6.18.noarch/init/calibrate.c
===================================================================
--- linux-2.6.18.noarch.orig/init/calibrate.c
+++ linux-2.6.18.noarch/init/calibrate.c
@@ -26,7 +26,6 @@ __setup("lpj=", lpj_setup);
  * Also, this code tries to handle non-maskable asynchronous events
  * (like SMIs)
  */
-#define DELAY_CALIBRATION_TICKS                        ((HZ < 100) ? 1 : (HZ/100))
 #define MAX_DIRECT_CALIBRATION_RETRIES         5

 static unsigned long __devinit calibrate_delay_direct(void)
@@ -37,6 +36,7 @@ static unsigned long __devinit calibrate
        unsigned long tsc_rate_min, tsc_rate_max;
        unsigned long good_tsc_sum = 0;
        unsigned long good_tsc_count = 0;
+       unsigned long delay_calibration_ticks = ((REAL_HZ < 100) ? 1 : (REAL_HZ/100));
        int i;

        if (read_current_timer(&pre_start) < 0 )
@@ -65,7 +65,7 @@ static unsigned long __devinit calibrate
                pre_start = 0;
                read_current_timer(&start);
                start_jiffies = jiffies;
-               while (jiffies <= (start_jiffies + 1)) {
+               while (jiffies <= (start_jiffies + tick_divider)) {
                        pre_start = start;
                        read_current_timer(&start);
                }
@@ -74,15 +74,18 @@ static unsigned long __devinit calibrate
                pre_end = 0;
                end = post_start;
                while (jiffies <=
-                      (start_jiffies + 1 + DELAY_CALIBRATION_TICKS)) {
+                      (start_jiffies + tick_divider * (1 + delay_calibration_ticks))) {
                        pre_end = end;
                        read_current_timer(&end);
                }
                read_current_timer(&post_end);

-               tsc_rate_max = (post_end - pre_start) / DELAY_CALIBRATION_TICKS;
-               tsc_rate_min = (pre_end - post_start) / DELAY_CALIBRATION_TICKS;
-
+               tsc_rate_max = (post_end - pre_start) / delay_calibration_ticks;
+               tsc_rate_min = (pre_end - post_start) / delay_calibration_ticks;
+
+               tsc_rate_max /= tick_divider;
+               tsc_rate_min /= tick_divider;
+
                /*
                 * If the upper limit and lower limit of the tsc_rate is
                 * >= 12.5% apart, redo calibration.
Index: linux-2.6.18.noarch/arch/i386/Kconfig
===================================================================
--- linux-2.6.18.noarch.orig/arch/i386/Kconfig
+++ linux-2.6.18.noarch/arch/i386/Kconfig
@@ -238,6 +238,13 @@ config HPET_EMULATE_RTC
        depends on HPET_TIMER && RTC=y
        default y

+config TICK_DIVIDER
+       bool "Support clock division"
+       default n
+       help
+         Supports the use of clock division allowing the real interrupt
+         rate to be lower than the HZ setting.
+
 config NR_CPUS
        int "Maximum number of CPUs (2-255)"
        range 2 255

______________________________________________________
GRATIS für alle WEB.DE-Nutzer: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://movieflat.web.de


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-14 20:25 [PATCH] x86: Reduce the default HZ value devzero
@ 2009-05-14 20:29 ` Alan Cox
  0 siblings, 0 replies; 65+ messages in thread
From: Alan Cox @ 2009-05-14 20:29 UTC (permalink / raw)
  To: devzero; +Cc: akataria, linux-kernel@vger.kernel.org

It was done as a back compatible option and with tickless the future it
wasn't really planned for upstreaming. No reason I can see it shouldn't
be however.

Alan

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
@ 2009-05-12 19:45 devzero
  2009-05-13 23:30 ` Alok Kataria
  0 siblings, 1 reply; 65+ messages in thread
From: devzero @ 2009-05-12 19:45 UTC (permalink / raw)
  To: Alan Cox; +Cc: Alok Kataria, linux-kernel

>> > As a side note Red Hat ships runtime configurable tick behaviour in RHEL
>> > these days. HZ is fixed but the ticks can be bunched up. That was done as
>> > a quick fix to keep stuff portable but its a lot more sensible than
>> > randomly messing with the HZ value and its not much code either.
>> > 
>> Hi Alan, 
>> 
>> I guess you are talking about the tick_divider patch ? 
>> And that's still same as reducing the HZ value only that it can be done
>> dynamically (boot time), right ? 
>
>Yes - which has the advantage that you can select different behaviours
>rather than distributions having to build with HZ=1000 either for
>compatibility or responsiveness can still allow users to drop to a lower
>HZ value if doing stuff like HPC.
>
>Basically it removes the need to argue about it at build time and lets
>the user decide.

any reason why this did not reach mainline?
is it because there were issues with clocksource=pit ?

regards
roland

______________________________________________________
GRATIS für alle WEB.DE-Nutzer: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://movieflat.web.de


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-12 19:45 devzero
@ 2009-05-13 23:30 ` Alok Kataria
  0 siblings, 0 replies; 65+ messages in thread
From: Alok Kataria @ 2009-05-13 23:30 UTC (permalink / raw)
  To: devzero@web.de; +Cc: Alan Cox, linux-kernel@vger.kernel.org


On Tue, 2009-05-12 at 12:45 -0700, devzero@web.de wrote:
> >> > As a side note Red Hat ships runtime configurable tick behaviour in RHEL
> >> > these days. HZ is fixed but the ticks can be bunched up. That was done as
> >> > a quick fix to keep stuff portable but its a lot more sensible than
> >> > randomly messing with the HZ value and its not much code either.
> >> > 
> >> Hi Alan, 
> >> 
> >> I guess you are talking about the tick_divider patch ? 
> >> And that's still same as reducing the HZ value only that it can be done
> >> dynamically (boot time), right ? 
> >
> >Yes - which has the advantage that you can select different behaviours
> >rather than distributions having to build with HZ=1000 either for
> >compatibility or responsiveness can still allow users to drop to a lower
> >HZ value if doing stuff like HPC.
> >
> >Basically it removes the need to argue about it at build time and lets
> >the user decide.
> 
> any reason why this did not reach mainline?

I think it is because during the time when this was implemented for RHEL
5, mainline was moving towards the tickless approach, which might have
prompted people to think that it would no more be useful for mainline.

Since Alan was the one who implemented those patches, I guess he would
have a better say on this. Alan, are there any plans for mainlining this
now ?

Alok
> is it because there were issues with clocksource=pit ?
> 
> regards
> roland
> 
> ______________________________________________________
> GRATIS für alle WEB.DE-Nutzer: Die maxdome Movie-FLAT!
> Jetzt freischalten unter http://movieflat.web.de
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH] x86: Reduce the default HZ value
@ 2009-05-04 18:44 Alok Kataria
  2009-05-05 21:21 ` H. Peter Anvin
  0 siblings, 1 reply; 65+ messages in thread
From: Alok Kataria @ 2009-05-04 18:44 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner; +Cc: the arch/x86 maintainers, LKML, alan

Hi,

Given that there were no major objections that came up regarding
reducing the HZ value in http://lkml.org/lkml/2009/4/27/499. 

Below is the patch which actually reduces it, please consider for tip.

Thanks,
Alok

-- 
With HRT support in the kernel we shouldn't actually be needing a high interrupt
frequency. This patch reduces the HZ value to 100 for x86 defconfig.

A high HZ value may affect the performance of the system if its nonidle.
I ran a simple experiment with 2.6.29 kernel running on VMware..
A simple tight loop took about 264s to complete with a HZ value of 1000.
The system serviced a total of 264405 timer interrupts during that time.
The same loop with HZ=100 took only about 255sec to complete.
Total timer interrupts were 25593.
More information here - http://lkml.org/lkml/2009/4/28/401

With highres timers most of the important timers are not tied down with the
how often the jiffy value is updated so this shouldn't have any adverse
effects on the latency of these timers either.

Signed-off-by: Alok N Kataria <akataria@vmware.com>

Index: linux-tip-master/arch/x86/configs/i386_defconfig
===================================================================
--- linux-tip-master.orig/arch/x86/configs/i386_defconfig	2009-05-01 16:47:43.000000000 -0700
+++ linux-tip-master/arch/x86/configs/i386_defconfig	2009-05-01 16:49:48.000000000 -0700
@@ -312,11 +312,11 @@
 CONFIG_X86_PAT=y
 CONFIG_EFI=y
 CONFIG_SECCOMP=y
-# CONFIG_HZ_100 is not set
+CONFIG_HZ_100=y
 # CONFIG_HZ_250 is not set
 # CONFIG_HZ_300 is not set
-CONFIG_HZ_1000=y
-CONFIG_HZ=1000
+# CONFIG_HZ_1000 is not set
+CONFIG_HZ=100
 CONFIG_SCHED_HRTICK=y
 CONFIG_KEXEC=y
 CONFIG_CRASH_DUMP=y
Index: linux-tip-master/arch/x86/configs/x86_64_defconfig
===================================================================
--- linux-tip-master.orig/arch/x86/configs/x86_64_defconfig	2009-05-01 16:37:53.000000000 -0700
+++ linux-tip-master/arch/x86/configs/x86_64_defconfig	2009-05-01 16:50:22.000000000 -0700
@@ -316,11 +316,11 @@
 CONFIG_X86_PAT=y
 CONFIG_EFI=y
 CONFIG_SECCOMP=y
-# CONFIG_HZ_100 is not set
+CONFIG_HZ_100=y
 # CONFIG_HZ_250 is not set
 # CONFIG_HZ_300 is not set
-CONFIG_HZ_1000=y
-CONFIG_HZ=1000
+# CONFIG_HZ_1000 is not set
+CONFIG_HZ=100
 CONFIG_SCHED_HRTICK=y
 CONFIG_KEXEC=y
 CONFIG_CRASH_DUMP=y

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-04 18:44 Alok Kataria
@ 2009-05-05 21:21 ` H. Peter Anvin
  2009-05-05 21:44   ` Alan Cox
  2009-05-05 21:57   ` Alok Kataria
  0 siblings, 2 replies; 65+ messages in thread
From: H. Peter Anvin @ 2009-05-05 21:21 UTC (permalink / raw)
  To: akataria; +Cc: Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers, LKML,
	alan

Alok Kataria wrote:
> Hi,
> 
> Given that there were no major objections that came up regarding
> reducing the HZ value in http://lkml.org/lkml/2009/4/27/499. 
> 
> Below is the patch which actually reduces it, please consider for tip.
> 

What is the benefit of this?

I can see at least one immediate downside: some timeout values in the 
kernel are still maintained in units of HZ (like poll, I believe), and 
so with a lower HZ value we'll have higher roundoff errors.

	-hpa

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-05 21:21 ` H. Peter Anvin
@ 2009-05-05 21:44   ` Alan Cox
  2009-05-05 22:09     ` Alok Kataria
  2009-05-05 21:57   ` Alok Kataria
  1 sibling, 1 reply; 65+ messages in thread
From: Alan Cox @ 2009-05-05 21:44 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: akataria, Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers,
	LKML

> What is the benefit of this?

I believe the "benefit" is that a certain posters proprietary
virtualisation product works better.

> I can see at least one immediate downside: some timeout values in the 
> kernel are still maintained in units of HZ (like poll, I believe), and 
> so with a lower HZ value we'll have higher roundoff errors.

And HZ=100 actually causes real problems for some video work (not in
Europe where its just peachy). We switched to 1000Hz a very long time ago
because it improved desktop feel and responsiveness. We switched to
tickless to keep that behaviour with good power and idle behaviour.

If we still have problems that give virtualisers hiccups then they will
be giving real processor power management the same grief so I don't think
meddling with HZ to harm the desktop is the right move - but if there are
improvements to tickless (eg timer granularity and bunching under
virtualisation) that would be far better to explore.

Alan

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-05 21:44   ` Alan Cox
@ 2009-05-05 22:09     ` Alok Kataria
  2009-05-05 22:33       ` Alan Cox
  0 siblings, 1 reply; 65+ messages in thread
From: Alok Kataria @ 2009-05-05 22:09 UTC (permalink / raw)
  To: Alan Cox
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML

On Tue, 2009-05-05 at 14:44 -0700, Alan Cox wrote:
> > What is the benefit of this?
> 
> I believe the "benefit" is that a certain posters proprietary
> virtualisation product works better.

Hi Alan, 

I posted numbers that I had, and I don't think that the problem is
limited just to virtualization or our platform. 

> > I can see at least one immediate downside: some timeout values in the 
> > kernel are still maintained in units of HZ (like poll, I believe), and 
> > so with a lower HZ value we'll have higher roundoff errors.
> 
> And HZ=100 actually causes real problems for some video work (not in
> Europe where its just peachy). We switched to 1000Hz a very long time ago
> because it improved desktop feel and responsiveness. We switched to
> tickless to keep that behaviour with good power and idle behaviour.

IMO, one of the main motives of HRT implementation apart from getting
higher precision timers was that we now don't necessarily need to rely
on a high timer frequency. If you see problems with Desktop feel and
responsiveness don't you think there would be other problem which might
be causing that ?  Your argument about the "desktop feel and
responsiveness" doesn't explain what actual problem did you see.
Also there are lots of distribution kernels which ship with a lower HZ
value anyway, so I don't see why is HZ=1000 such a big requirement for
your desktop use case.

Alok

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-05 22:09     ` Alok Kataria
@ 2009-05-05 22:33       ` Alan Cox
  2009-05-05 23:37         ` Alok Kataria
  2009-05-07 14:09         ` Christoph Lameter
  0 siblings, 2 replies; 65+ messages in thread
From: Alan Cox @ 2009-05-05 22:33 UTC (permalink / raw)
  To: akataria
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML

> IMO, one of the main motives of HRT implementation apart from getting
> higher precision timers was that we now don't necessarily need to rely

Timer frequency and HZ are two entirely different things nowdyas

> on a high timer frequency. If you see problems with Desktop feel and
> responsiveness don't you think there would be other problem which might
> be causing that ?  Your argument about the "desktop feel and
> responsiveness" doesn't explain what actual problem did you see.

People spent months poking at the differences before HZ=1000 became the
default. It wasn't due for amusement values - but this is irrelevant
anyway on a modern kernel as HZ=1000 is simply a precision setting that
affects things like poll()

HZ on a tickless system has no meaningful relationship to wakup rates -
which are what I assume you actually care about.

So do you want to change the precision of poll() and other
functionality ? or do you want to change the wakeup rates and
corresponding virtualisation overhead ?

If the latter then HZ is not the thing to touch.

What are you *actually* trying to achieve ?
What measurements have you done that make you think HZ is relevant in a
tickless kernel ?

Alan

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-05 22:33       ` Alan Cox
@ 2009-05-05 23:37         ` Alok Kataria
  2009-05-07 14:09         ` Christoph Lameter
  1 sibling, 0 replies; 65+ messages in thread
From: Alok Kataria @ 2009-05-05 23:37 UTC (permalink / raw)
  To: Alan Cox
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML


On Tue, 2009-05-05 at 15:33 -0700, Alan Cox wrote:
> > IMO, one of the main motives of HRT implementation apart from getting
> > higher precision timers was that we now don't necessarily need to rely
> 
> Timer frequency and HZ are two entirely different things nowdyas

Huh ? maybe I am reading this code incorrectly, but this is what I
understand, the APIC is still being programmed to wake HZ time every
second if the system is nonidle (periodic mode). 
Only if the system is idle does the kernel program the APIC in one shot
mode as a result the tickless kernel gives us a lot less pain when the
guest is idle. 

Here are the numbers with a HZ=100 kernel which proves this hypothesis. 

[root@alok-vm-rhel64 ~]# cat /proc/interrupts | grep "timer" ; time
sleep 30 ; cat /proc/interrupts | grep "timer"
  0:         36          0   IO-APIC-edge      timer
LOC:       7549       7176   Local timer interrupts

real    0m30.006s
user    0m0.000s
sys     0m0.000s
  0:         36          0   IO-APIC-edge      timer
LOC:       7616       7209   Local timer interrupts


So in this case when the system is (pretty much) "idle" the total number
of wakeup's are far less just about 65 in the total 30sec on cpu0.

If I run a simple program which does a tight loop, this to check the
behavior when the system is non-idle, 

[root@alok-vm-rhel64 ~]# cat /proc/interrupts | grep "timer" ;
time ./tightloop_short ; cat /proc/interrupts | grep "timer"
  0:         36          0   IO-APIC-edge      timer
LOC:       8008       7453   Local timer interrupts

real    0m30.377s
user    0m30.370s
sys     0m0.000s
  0:         36          0   IO-APIC-edge      timer
LOC:      11049      10493   Local timer interrupts

Here we see that we had a total of ~3000 interrupts. In this case the
system was non-idle and hence the APIC was programmed in periodic mode. 

The tightloop program only does this
 int main()
{
        unsigned long long count;
        while(count++ < 5999999999UL);
        return 0;
}


If I do the same experiments on a HZ=1000 kernel I  see that the number
of interrupts would rise to 30000 in the second case.

I did check that the "apic_timer_irqs" counter - that is read from the
proc file  - is updated only from smp_apic_timer_interrupt code path, so
this can't be a interrupt accounting bug.

In short, I don't believe that HZ and timer frequency are not related
nowadays, please correct me if I am missing anything here. 

> 
> > on a high timer frequency. If you see problems with Desktop feel and
> > responsiveness don't you think there would be other problem which might
> > be causing that ?  Your argument about the "desktop feel and
> > responsiveness" doesn't explain what actual problem did you see.
> 
> People spent months poking at the differences before HZ=1000 became the
> default. It wasn't due for amusement values - but this is irrelevant
> anyway on a modern kernel as HZ=1000 is simply a precision setting that
> affects things like poll()
> 
> HZ on a tickless system has no meaningful relationship to wakup rates -
> which are what I assume you actually care about.

Yes I care about the wakeup rates and as explained above HZ does affect
that.

> 
> So do you want to change the precision of poll() and other
> functionality ? or do you want to change the wakeup rates and
> corresponding virtualisation overhead ?
> 
> If the latter then HZ is not the thing to touch.
> 
> What are you *actually* trying to achieve ?
> What measurements have you done that make you think HZ is relevant in a
> tickless kernel ?
> 
I hope all these questions are answered above.

Thanks,
Alok
> 
> Alan


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-05 22:33       ` Alan Cox
  2009-05-05 23:37         ` Alok Kataria
@ 2009-05-07 14:09         ` Christoph Lameter
  2009-05-07 15:12           ` Alan Cox
  1 sibling, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-05-07 14:09 UTC (permalink / raw)
  To: Alan Cox
  Cc: akataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML

On Tue, 5 May 2009, Alan Cox wrote:

> HZ on a tickless system has no meaningful relationship to wakup rates -
> which are what I assume you actually care about.

Linux is not tickless. It only switches off ticks if a processor is idle.

> So do you want to change the precision of poll() and other
> functionality ? or do you want to change the wakeup rates and
> corresponding virtualisation overhead ?

select and poll use timeouts based on high resolution timers.

> What measurements have you done that make you think HZ is relevant in a
> tickless kernel ?

Just reading the code gets you there.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 14:09         ` Christoph Lameter
@ 2009-05-07 15:12           ` Alan Cox
  0 siblings, 0 replies; 65+ messages in thread
From: Alan Cox @ 2009-05-07 15:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML

On Thu, 7 May 2009 10:09:56 -0400 (EDT)
Christoph Lameter <cl@linux.com> wrote:

> On Tue, 5 May 2009, Alan Cox wrote:
> 
> > HZ on a tickless system has no meaningful relationship to wakup rates -
> > which are what I assume you actually care about.
> 
> Linux is not tickless. It only switches off ticks if a processor is idle.

Hooray - finally someone admits the *real* problem here, and for power
management too. Otherwise known as "referencing jiffies as a variable must
die"

Alan

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-05 21:21 ` H. Peter Anvin
  2009-05-05 21:44   ` Alan Cox
@ 2009-05-05 21:57   ` Alok Kataria
  2009-05-07 14:13     ` Christoph Lameter
  2009-05-07 16:35     ` Chris Snook
  1 sibling, 2 replies; 65+ messages in thread
From: Alok Kataria @ 2009-05-05 21:57 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers, LKML,
	alan@lxorguk.ukuu.org.uk

On Tue, 2009-05-05 at 14:21 -0700, H. Peter Anvin wrote:
> Alok Kataria wrote:
> > Hi,
> > 
> > Given that there were no major objections that came up regarding
> > reducing the HZ value in http://lkml.org/lkml/2009/4/27/499. 
> > 
> > Below is the patch which actually reduces it, please consider for tip.
> > 
> 
> What is the benefit of this?

I did some experiments on linux 2.6.29 guests running on VMware and
noticed that the number of timer interrupts could have some slowdown on
the total throughput on the system. 
A simple tight loop experiment showed that with HZ=1000 we took about
264sec to complete the loop and that same loop took about 255sec with
HZ=100.
You can find more information here http://lkml.org/lkml/2009/4/28/401

And with HRT i don't see any downsides in terms of increased latencies
for device timer's or anything of that sought.

> 
> I can see at least one immediate downside: some timeout values in the 
> kernel are still maintained in units of HZ (like poll, I believe), and 
> so with a lower HZ value we'll have higher roundoff errors.

If that at all is such a big problem shouldn't we think about moving to
using schedule_hrtimeout for such cases rather than relying on jiffy
based timeouts. 
The hrtimer explanation over here http://www.tglx.de/hrtimers.html
also talks about where these HZ (timer wheel) based timeouts be used and
shouldn't really be dependent on accurate timing.

Also the default HZ value was 250 before this commit

commit 5cb04df8d3f03e37a19f2502591a84156be71772 
  x86: defconfig updates 

And it was 250 for a very long time before that too. The commit log
doesn't explain why the value was bumped up either.

Thanks,
Alok

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-05 21:57   ` Alok Kataria
@ 2009-05-07 14:13     ` Christoph Lameter
  2009-05-07 15:14       ` Alan Cox
  2009-05-07 17:07       ` Peter Zijlstra
  2009-05-07 16:35     ` Chris Snook
  1 sibling, 2 replies; 65+ messages in thread
From: Christoph Lameter @ 2009-05-07 14:13 UTC (permalink / raw)
  To: Alok Kataria
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk

I think we need to reduce the general tick frequency to be as low as
possible. With high resolution timers the tick frequency is just the
frequency with which the timer interrupt disturbs a running application.

Are there any benefits remaining from frequent timer interrupts? I would
think that 60 HZ would be sufficient.

It would be good if the kernel would be truly tickless. Scheduler events
would be driven by the scheduling intervals and not the invokations of the
scheduler softirq.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 14:13     ` Christoph Lameter
@ 2009-05-07 15:14       ` Alan Cox
  2009-05-07 15:20         ` Christoph Lameter
                           ` (2 more replies)
  2009-05-07 17:07       ` Peter Zijlstra
  1 sibling, 3 replies; 65+ messages in thread
From: Alan Cox @ 2009-05-07 15:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Alok Kataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML

On Thu, 7 May 2009 10:13:52 -0400 (EDT)
Christoph Lameter <cl@linux.com> wrote:

> I think we need to reduce the general tick frequency to be as low as
> possible. With high resolution timers the tick frequency is just the
> frequency with which the timer interrupt disturbs a running application.
> 
> Are there any benefits remaining from frequent timer interrupts? I would
> think that 60 HZ would be sufficient.

50 works for various european video apps, 60 breaks, 60 works for various
US video apps, 50 breaks. Now that may have changed with all the select
stuff being hrtimer based (which I'd missed).

The tick also still appears to be involved in ntp and in cpu stats where
a 50Hz tick would mean only 25Hz accuracy on CPU usage etc


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 15:14       ` Alan Cox
@ 2009-05-07 15:20         ` Christoph Lameter
  2009-05-07 15:30         ` H. Peter Anvin
  2009-05-07 16:37         ` Alok Kataria
  2 siblings, 0 replies; 65+ messages in thread
From: Christoph Lameter @ 2009-05-07 15:20 UTC (permalink / raw)
  To: Alan Cox
  Cc: Alok Kataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML

On Thu, 7 May 2009, Alan Cox wrote:

> The tick also still appears to be involved in ntp and in cpu stats where
> a 50Hz tick would mean only 25Hz accuracy on CPU usage etc

That could be fixed. If a thread is running then its using the processor
and that does not change as long as it runs. That alone is enough to
calculate cpu usage.

Only if we want to examine the cpu state periodically while its running
(profiling) then we would have a justification for interrupting the
thread.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 15:14       ` Alan Cox
  2009-05-07 15:20         ` Christoph Lameter
@ 2009-05-07 15:30         ` H. Peter Anvin
  2009-05-07 15:40           ` Christoph Lameter
  2009-05-07 16:55           ` Jeff Garzik
  2009-05-07 16:37         ` Alok Kataria
  2 siblings, 2 replies; 65+ messages in thread
From: H. Peter Anvin @ 2009-05-07 15:30 UTC (permalink / raw)
  To: Alan Cox
  Cc: Christoph Lameter, Alok Kataria, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML

Alan Cox wrote:
> On Thu, 7 May 2009 10:13:52 -0400 (EDT)
> Christoph Lameter <cl@linux.com> wrote:
> 
>> I think we need to reduce the general tick frequency to be as low as
>> possible. With high resolution timers the tick frequency is just the
>> frequency with which the timer interrupt disturbs a running application.
>>
>> Are there any benefits remaining from frequent timer interrupts? I would
>> think that 60 HZ would be sufficient.
> 
> 50 works for various european video apps, 60 breaks, 60 works for various
> US video apps, 50 breaks. Now that may have changed with all the select
> stuff being hrtimer based (which I'd missed).

Hence 300 Hz.  ;)

> Hooray - finally someone admits the *real* problem here, and for power
> management too. Otherwise known as "referencing jiffies as a variable must
> die"

Amen.  Also, "using HZ as a unit of measurement must die, too."

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 15:30         ` H. Peter Anvin
@ 2009-05-07 15:40           ` Christoph Lameter
  2009-05-07 16:55           ` Jeff Garzik
  1 sibling, 0 replies; 65+ messages in thread
From: Christoph Lameter @ 2009-05-07 15:40 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alan Cox, Alok Kataria, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML

On Thu, 7 May 2009, H. Peter Anvin wrote:

> Amen.  Also, "using HZ as a unit of measurement must die, too."

Dont think that is a problem if we had a convention of seeing 1 HZ as an
interval of 1 ms or so.





^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 15:30         ` H. Peter Anvin
  2009-05-07 15:40           ` Christoph Lameter
@ 2009-05-07 16:55           ` Jeff Garzik
  2009-05-07 17:09             ` Alan Cox
  1 sibling, 1 reply; 65+ messages in thread
From: Jeff Garzik @ 2009-05-07 16:55 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alan Cox, Christoph Lameter, Alok Kataria, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML

H. Peter Anvin wrote:
> Alan Cox wrote:
>> Hooray - finally someone admits the *real* problem here, and for power
>> management too. Otherwise known as "referencing jiffies as a variable must
>> die"
> 
> Amen.  Also, "using HZ as a unit of measurement must die, too."

Love to -- now, what will it be replaced with?

grep for 'deadline' in drivers/ata/libata* to find an example not so 
easily converted away from jiffies.

	Jeff




^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 16:55           ` Jeff Garzik
@ 2009-05-07 17:09             ` Alan Cox
  2009-05-07 17:55               ` Jeff Garzik
  0 siblings, 1 reply; 65+ messages in thread
From: Alan Cox @ 2009-05-07 17:09 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: H. Peter Anvin, Christoph Lameter, Alok Kataria, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML

On Thu, 07 May 2009 12:55:05 -0400
Jeff Garzik <jeff@garzik.org> wrote:

> H. Peter Anvin wrote:
> > Alan Cox wrote:
> >> Hooray - finally someone admits the *real* problem here, and for power
> >> management too. Otherwise known as "referencing jiffies as a variable must
> >> die"
> > 
> > Amen.  Also, "using HZ as a unit of measurement must die, too."
> 
> Love to -- now, what will it be replaced with?
> 
> grep for 'deadline' in drivers/ata/libata* to find an example not so 
> easily converted away from jiffies.

I don't see any.

I do see a complicated interface that appears to actually really want to
implement

		add_timer(&foo->expiry_timer);

and checks against the timer completing. In fact it looks as if all the
stuff in there is really down to

		add a timer
		check if it expired
		check how long until it expires
		delete it

And you might as well measure that in HZ=1000's better known as
"milliseconds"


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:09             ` Alan Cox
@ 2009-05-07 17:55               ` Jeff Garzik
  2009-05-07 19:51                 ` Alan Cox
  0 siblings, 1 reply; 65+ messages in thread
From: Jeff Garzik @ 2009-05-07 17:55 UTC (permalink / raw)
  To: Alan Cox
  Cc: H. Peter Anvin, Christoph Lameter, Alok Kataria, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML

Alan Cox wrote:
> On Thu, 07 May 2009 12:55:05 -0400
> Jeff Garzik <jeff@garzik.org> wrote:
> 
>> H. Peter Anvin wrote:
>>> Alan Cox wrote:
>>>> Hooray - finally someone admits the *real* problem here, and for power
>>>> management too. Otherwise known as "referencing jiffies as a variable must
>>>> die"
>>> Amen.  Also, "using HZ as a unit of measurement must die, too."
>> Love to -- now, what will it be replaced with?
>>
>> grep for 'deadline' in drivers/ata/libata* to find an example not so 
>> easily converted away from jiffies.
> 
> I don't see any.
> 
> I do see a complicated interface that appears to actually really want to
> implement
> 
> 		add_timer(&foo->expiry_timer);
> 
> and checks against the timer completing. In fact it looks as if all the
> stuff in there is really down to
> 
> 		add a timer
> 		check if it expired
> 		check how long until it expires
> 		delete it

This is why I mentioned this example... because it's not as easy as you 
seem to think it is :)

We care only about a decreasing time interval.  This interval is passed 
to register polling functions (bitbang no longer than <this> amount of 
time), as well as _cumulatively_ affecting the entire EH [sub-]process.

A timer-based solution, in addition to being an ugly hack, would imply 
replacing a simple variable with _at least_ two spinlocks, plus a timer 
callback function that simply says "I expired".  With loops such as

	max_msecs = calc_deadline(overall_deadline, ...)
	while (!(register & bit))
		msleep(1)
		max_msecs--
		register = readl(...)

must be converted to the more-complex timer-based solution.		

libata would be happy to use milliseconds rather than jiffies; the unit 
does not matter.  What matters is calculating our progress versus the 
clock tick, as spread across multiple functions, multiple contexts, and 
register polling loops.

The current code is a -lot- more simple than checking "is timer 
expired?" all over the code, given that any sort of timer-based function 
implies dealing with additional concurrency issues -- a complication the 
libata EH does not need.

	Jeff

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:55               ` Jeff Garzik
@ 2009-05-07 19:51                 ` Alan Cox
  2009-05-07 20:03                   ` Jeff Garzik
  0 siblings, 1 reply; 65+ messages in thread
From: Alan Cox @ 2009-05-07 19:51 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: H. Peter Anvin, Christoph Lameter, Alok Kataria, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML

O> We care only about a decreasing time interval.  This interval is passed 
> to register polling functions (bitbang no longer than <this> amount of 
> time), as well as _cumulatively_ affecting the entire EH [sub-]process.

Yes so its a simple timer.

> A timer-based solution, in addition to being an ugly hack, would imply 
> replacing a simple variable with _at least_ two spinlocks, plus a timer 
> callback function that simply says "I expired".  With loops such as

That simple variable is incredibly incredibly expensive in power, in CPU
use, in cache poisoning and more - its very misleading to think of it as
"free".

> The current code is a -lot- more simple than checking "is timer 
> expired?" all over the code, given that any sort of timer-based function 
> implies dealing with additional concurrency issues -- a complication the 
> libata EH does not need.

I disagree - your implementation seems very very ugly. But even then its
a case of swapping jiffies for jiffies() for unconverted code and using
that function read the timer value from somewhere. The important thing is
that asking the time is active and we don't burn processor time and
wakeups and power going "lets wake up, lets turn the cache on, lets load
some cache lines, lets increment a variable, and poke some lines on the
bus waking up bits of other logic, now lets write the cache back out, and
go back to sleep for a tiny amount of time"

jiffies is *really* expensive...

Even a 

	jiffies = time_count_begin();

	blah blah with jiffies;

	time_count_end();

would help. But that seems a good way to add bugs.

However you don't need HZ - you are using it to implement state changes
and you can do that really easily by making your timer switch a single
value to the correct state for the time and in your loops just checking

	while ( ap->reset_state == WIBBLING_RANDOMLY) {
		wobble = readl(ap->ioaddr.wibble);
		mdelay(1);
		if (wobble & WIBBLED) {
			...
		}
	}

Alan

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 19:51                 ` Alan Cox
@ 2009-05-07 20:03                   ` Jeff Garzik
  2009-05-07 20:30                     ` Alan Cox
  0 siblings, 1 reply; 65+ messages in thread
From: Jeff Garzik @ 2009-05-07 20:03 UTC (permalink / raw)
  To: Alan Cox
  Cc: H. Peter Anvin, Christoph Lameter, Alok Kataria, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML

Alan Cox wrote:
> jiffies is *really* expensive...

Certainly.

> However you don't need HZ

Thus I said "libata would be happy to use milliseconds rather than 
jiffies; the unit does not matter."

All crucial constants and variables are already in useful units, and we 
are forced to convert them to jiffies, and test against 'jiffies' global 
var, because the API does not provide other reasonable alternatives.

Pretty much all users of time -- including users of timers -- are forced 
to convert from a useful unit of time to jiffies, because that's what 
the API requires.

> and you can do that really easily by making your timer switch a single
> value to the correct state for the time and in your loops just checking
> 
> 	while ( ap->reset_state == WIBBLING_RANDOMLY) {
> 		wobble = readl(ap->ioaddr.wibble);
> 		mdelay(1);
> 		if (wobble & WIBBLED) {
> 			...
> 		}
> 	}

And when the deadline changes, you need mod_timer rather than a simple 
non-global-variable increment.

And when you consider all the bits of state, spread across not only 
libata core __but drivers as well__, a timer-based solution gets even 
uglier.

But you're still missing the point.  jiffies are just a unit, a unit 
forced upon us by all the time-based kernel API functions that take 
jiffies rather than usec or msec.

I would rather fix the API, than investigate hacky solutions that 
complicate the code even more.

	Jeff

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 20:03                   ` Jeff Garzik
@ 2009-05-07 20:30                     ` Alan Cox
  0 siblings, 0 replies; 65+ messages in thread
From: Alan Cox @ 2009-05-07 20:30 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: H. Peter Anvin, Christoph Lameter, Alok Kataria, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML

> I would rather fix the API, than investigate hacky solutions that 
> complicate the code even more.

The fundamental problem is that the API that is wrong is "jiffies". The
false notion that there is some magical free timer variable.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 15:14       ` Alan Cox
  2009-05-07 15:20         ` Christoph Lameter
  2009-05-07 15:30         ` H. Peter Anvin
@ 2009-05-07 16:37         ` Alok Kataria
  2 siblings, 0 replies; 65+ messages in thread
From: Alok Kataria @ 2009-05-07 16:37 UTC (permalink / raw)
  To: Alan Cox
  Cc: Christoph Lameter, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML


On Thu, 2009-05-07 at 08:14 -0700, Alan Cox wrote:
> On Thu, 7 May 2009 10:13:52 -0400 (EDT)
> Christoph Lameter <cl@linux.com> wrote:
> 
> > I think we need to reduce the general tick frequency to be as low as
> > possible. With high resolution timers the tick frequency is just the
> > frequency with which the timer interrupt disturbs a running application.
> > 
> > Are there any benefits remaining from frequent timer interrupts? I would
> > think that 60 HZ would be sufficient.
> 
> 50 works for various european video apps, 60 breaks, 60 works for various
> US video apps, 50 breaks. Now that may have changed with all the select
> stuff being hrtimer based (which I'd missed).

I would have assumed whatever timeout mechanism the video apps use
should have already been converted to hrtimers ? Or are you saying that
they use select stuff which is already hrtimer based and so there
shouldn't be any problem now for video apps ? 

> 
> The tick also still appears to be involved in ntp and in cpu stats where
> a 50Hz tick would mean only 25Hz accuracy on CPU usage etc
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 14:13     ` Christoph Lameter
  2009-05-07 15:14       ` Alan Cox
@ 2009-05-07 17:07       ` Peter Zijlstra
  2009-05-07 17:13         ` Peter Zijlstra
  2009-05-07 17:19         ` Christoph Lameter
  1 sibling, 2 replies; 65+ messages in thread
From: Peter Zijlstra @ 2009-05-07 17:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Alok Kataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk

On Thu, 2009-05-07 at 10:13 -0400, Christoph Lameter wrote:
> I think we need to reduce the general tick frequency to be as low as
> possible. With high resolution timers the tick frequency is just the
> frequency with which the timer interrupt disturbs a running application.
> 
> Are there any benefits remaining from frequent timer interrupts? I would
> think that 60 HZ would be sufficient.
> 
> It would be good if the kernel would be truly tickless. Scheduler events
> would be driven by the scheduling intervals and not the invokations of the
> scheduler softirq.

The only thing that's driven by the softirq is load-balancing, there's
way more to the scheduler-tick than kicking that thing awake every so
often.

The problem is that running the scheduler of off hrtimers is too
expensive. We have the code, we tried it, people complained.

Another random user that relies on the jiffy tick is
CLOCK_THREAD_CPUTIME_ID posix timers, although I'm planning to convert
that to hrtimers some time in the future.

We also use the scheduler tick to generate a somewhat coupled time
source from flaky TSCs -- reducing HZ decreases the accuracy. This is
something only fixable in hardware by providing a proper (and cheap)
high resolution clock source -- nehalem class machines have such a
thing, provided you stick to one (maybe two) sockets [s390, ppc64 and
sparc64 also rule].

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:07       ` Peter Zijlstra
@ 2009-05-07 17:13         ` Peter Zijlstra
  2009-05-07 17:18           ` Peter Zijlstra
  2009-05-07 17:18           ` Christoph Lameter
  2009-05-07 17:19         ` Christoph Lameter
  1 sibling, 2 replies; 65+ messages in thread
From: Peter Zijlstra @ 2009-05-07 17:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Alok Kataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk

On Thu, 2009-05-07 at 19:09 +0200, Peter Zijlstra wrote:
> On Thu, 2009-05-07 at 10:13 -0400, Christoph Lameter wrote:
> > I think we need to reduce the general tick frequency to be as low as
> > possible. With high resolution timers the tick frequency is just the
> > frequency with which the timer interrupt disturbs a running application.
> > 
> > Are there any benefits remaining from frequent timer interrupts? I would
> > think that 60 HZ would be sufficient.
> > 
> > It would be good if the kernel would be truly tickless. Scheduler events
> > would be driven by the scheduling intervals and not the invokations of the
> > scheduler softirq.
> 
> The only thing that's driven by the softirq is load-balancing, there's
> way more to the scheduler-tick than kicking that thing awake every so
> often.
> 
> The problem is that running the scheduler of off hrtimers is too
> expensive. We have the code, we tried it, people complained.

Therefore, decreasing the HZ value to say 50, we'd get a minimum
involuntary preemption granularity of 20ms, something on the high end of
barely usable.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:13         ` Peter Zijlstra
@ 2009-05-07 17:18           ` Peter Zijlstra
  2009-05-07 17:20             ` Christoph Lameter
  2009-05-07 17:36             ` Paul E. McKenney
  2009-05-07 17:18           ` Christoph Lameter
  1 sibling, 2 replies; 65+ messages in thread
From: Peter Zijlstra @ 2009-05-07 17:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Alok Kataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk,
	Paul E. McKenney

On Thu, 2009-05-07 at 19:13 +0200, Peter Zijlstra wrote:
> On Thu, 2009-05-07 at 19:09 +0200, Peter Zijlstra wrote:
> > On Thu, 2009-05-07 at 10:13 -0400, Christoph Lameter wrote:
> > > I think we need to reduce the general tick frequency to be as low as
> > > possible. With high resolution timers the tick frequency is just the
> > > frequency with which the timer interrupt disturbs a running application.
> > > 
> > > Are there any benefits remaining from frequent timer interrupts? I would
> > > think that 60 HZ would be sufficient.
> > > 
> > > It would be good if the kernel would be truly tickless. Scheduler events
> > > would be driven by the scheduling intervals and not the invokations of the
> > > scheduler softirq.
> > 
> > The only thing that's driven by the softirq is load-balancing, there's
> > way more to the scheduler-tick than kicking that thing awake every so
> > often.
> > 
> > The problem is that running the scheduler of off hrtimers is too
> > expensive. We have the code, we tried it, people complained.
> 
> Therefore, decreasing the HZ value to say 50, we'd get a minimum
> involuntary preemption granularity of 20ms, something on the high end of
> barely usable.

Another user is RCU, the grace period is tick driven, growing these
ticks by a factor 50 or so might require some tinkering with forced
grace periods when we notice our batch queues getting too long.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:18           ` Peter Zijlstra
@ 2009-05-07 17:20             ` Christoph Lameter
  2009-05-07 17:39               ` Peter Zijlstra
  2009-05-07 17:54               ` Paul E. McKenney
  2009-05-07 17:36             ` Paul E. McKenney
  1 sibling, 2 replies; 65+ messages in thread
From: Christoph Lameter @ 2009-05-07 17:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alok Kataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk,
	Paul E. McKenney

On Thu, 7 May 2009, Peter Zijlstra wrote:

> Another user is RCU, the grace period is tick driven, growing these
> ticks by a factor 50 or so might require some tinkering with forced
> grace periods when we notice our batch queues getting too long.

One could also schedule RCU via hrtimers with a large fuzz period?


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:20             ` Christoph Lameter
@ 2009-05-07 17:39               ` Peter Zijlstra
  2009-05-07 17:40                 ` Christoph Lameter
  2009-05-07 17:54               ` Paul E. McKenney
  1 sibling, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2009-05-07 17:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Alok Kataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk,
	Paul E. McKenney

On Thu, 2009-05-07 at 13:20 -0400, Christoph Lameter wrote:
> On Thu, 7 May 2009, Peter Zijlstra wrote:
> 
> > Another user is RCU, the grace period is tick driven, growing these
> > ticks by a factor 50 or so might require some tinkering with forced
> > grace periods when we notice our batch queues getting too long.
> 
> One could also schedule RCU via hrtimers with a large fuzz period?

No, that's not the point, the longer these period are, the more
callbacks you can accumulate in a period. You need a cap on the callback
list, we already have seen DoS scenarios in this space.




^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:39               ` Peter Zijlstra
@ 2009-05-07 17:40                 ` Christoph Lameter
  0 siblings, 0 replies; 65+ messages in thread
From: Christoph Lameter @ 2009-05-07 17:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alok Kataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk,
	Paul E. McKenney

On Thu, 7 May 2009, Peter Zijlstra wrote:

> On Thu, 2009-05-07 at 13:20 -0400, Christoph Lameter wrote:
> > On Thu, 7 May 2009, Peter Zijlstra wrote:
> >
> > > Another user is RCU, the grace period is tick driven, growing these
> > > ticks by a factor 50 or so might require some tinkering with forced
> > > grace periods when we notice our batch queues getting too long.
> >
> > One could also schedule RCU via hrtimers with a large fuzz period?
>
> No, that's not the point, the longer these period are, the more
> callbacks you can accumulate in a period. You need a cap on the callback
> list, we already have seen DoS scenarios in this space.

At some point the RCU must run and if the callback list gets too long the
callbacks must be processed. Thats unavoidable.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:20             ` Christoph Lameter
  2009-05-07 17:39               ` Peter Zijlstra
@ 2009-05-07 17:54               ` Paul E. McKenney
  2009-05-07 17:51                 ` Christoph Lameter
  1 sibling, 1 reply; 65+ messages in thread
From: Paul E. McKenney @ 2009-05-07 17:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Alok Kataria, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML,
	alan@lxorguk.ukuu.org.uk

On Thu, May 07, 2009 at 01:20:29PM -0400, Christoph Lameter wrote:
> On Thu, 7 May 2009, Peter Zijlstra wrote:
> 
> > Another user is RCU, the grace period is tick driven, growing these
> > ticks by a factor 50 or so might require some tinkering with forced
> > grace periods when we notice our batch queues getting too long.
> 
> One could also schedule RCU via hrtimers with a large fuzz period?

You could, but then you would still have a periodic interrupt introducing
jitter into your HPC workload.  The approach I suggested allows RCU to be
happy with no periodic interrupts on any CPU that has only one runnable
task that is a CPU-bound user-level task (in addition to the idle task,
of course).

							Thanx, Paul

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:54               ` Paul E. McKenney
@ 2009-05-07 17:51                 ` Christoph Lameter
  2009-05-07 19:51                   ` Paul E. McKenney
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-05-07 17:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Alok Kataria, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML,
	alan@lxorguk.ukuu.org.uk

On Thu, 7 May 2009, Paul E. McKenney wrote:

> On Thu, May 07, 2009 at 01:20:29PM -0400, Christoph Lameter wrote:
> > On Thu, 7 May 2009, Peter Zijlstra wrote:
> >
> > > Another user is RCU, the grace period is tick driven, growing these
> > > ticks by a factor 50 or so might require some tinkering with forced
> > > grace periods when we notice our batch queues getting too long.
> >
> > One could also schedule RCU via hrtimers with a large fuzz period?
>
> You could, but then you would still have a periodic interrupt introducing
> jitter into your HPC workload.  The approach I suggested allows RCU to be
> happy with no periodic interrupts on any CPU that has only one runnable
> task that is a CPU-bound user-level task (in addition to the idle task,
> of course).

Sounds good.

An HPC workload typically has minimal kernel interaction. RCU would
only need to run once and then the system would be quiet.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:51                 ` Christoph Lameter
@ 2009-05-07 19:51                   ` Paul E. McKenney
  0 siblings, 0 replies; 65+ messages in thread
From: Paul E. McKenney @ 2009-05-07 19:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Alok Kataria, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML,
	alan@lxorguk.ukuu.org.uk, anton

On Thu, May 07, 2009 at 01:51:58PM -0400, Christoph Lameter wrote:
> On Thu, 7 May 2009, Paul E. McKenney wrote:
> 
> > On Thu, May 07, 2009 at 01:20:29PM -0400, Christoph Lameter wrote:
> > > On Thu, 7 May 2009, Peter Zijlstra wrote:
> > >
> > > > Another user is RCU, the grace period is tick driven, growing these
> > > > ticks by a factor 50 or so might require some tinkering with forced
> > > > grace periods when we notice our batch queues getting too long.
> > >
> > > One could also schedule RCU via hrtimers with a large fuzz period?
> >
> > You could, but then you would still have a periodic interrupt introducing
> > jitter into your HPC workload.  The approach I suggested allows RCU to be
> > happy with no periodic interrupts on any CPU that has only one runnable
> > task that is a CPU-bound user-level task (in addition to the idle task,
> > of course).
> 
> Sounds good.
> 
> An HPC workload typically has minimal kernel interaction. RCU would
> only need to run once and then the system would be quiet.

Peter Z's post leads me to believe that there might be dragons in
this approach that I am blissfully unaware of.  However, here is what
would have to happen from an RCU perspective, in case it helps:

o	This new mode needs to imply CONFIG_NO_HZ.

o	When a given CPU is transitioning into tickless mode, invoke
	rcu_enter_nohz().  This already happens for dynticks-idle,
	this would be a dynticks-CPU-bound-usermode-task.
	Note that CONFIG_NO_HZ kernels already invokes rcu_enter_nohz()
	from tick_nohz_stop_sched_tick(), and many of the things in
	tick_nohz_stop_sched_tick() would need to be done in this case
	as well.

o	When a given CPU is transitioning out of tickless mode, invoke
	rcu_exit_nohz().  Again, this already happens for dynticks-idle.
	Note that CONFIG_NO_HZ kernels already invoke rcu_exit_nohz() 
	from tick_nohz_restart_sched_tick(), which does other stuff that
	would be required in your case as well.

o	When a given CPU in tickless mode transitions into the kernel
	via a system call or trap, invoke rcu_irq_enter().  Note that
	rcu_irq_enter() is already invoked on irq entry if CONFIG_NO_HZ.
	NMIs are also already handled via rcu_nmi_enter().

o	When a given CPU in tickless mode transitions out of the kernel
	from a system call or trap, invoke rcu_irq_exit().  Note that
	rcu_irq_exit() is already invoked on irq exit if CONFIG_NO_HZ.
	NMIs are also already handled via rcu_nmi_exit().

Then RCU would know that any CPU running a CPU-bound user-mode task
need not be consulted when working out when a grace period ends, since
user-mode code cannot contain kernel-mode RCU read-side critical sections.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:18           ` Peter Zijlstra
  2009-05-07 17:20             ` Christoph Lameter
@ 2009-05-07 17:36             ` Paul E. McKenney
  2009-05-07 17:38               ` Peter Zijlstra
  1 sibling, 1 reply; 65+ messages in thread
From: Paul E. McKenney @ 2009-05-07 17:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Alok Kataria, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML,
	alan@lxorguk.ukuu.org.uk

On Thu, May 07, 2009 at 07:18:38PM +0200, Peter Zijlstra wrote:
> On Thu, 2009-05-07 at 19:13 +0200, Peter Zijlstra wrote:
> > On Thu, 2009-05-07 at 19:09 +0200, Peter Zijlstra wrote:
> > > On Thu, 2009-05-07 at 10:13 -0400, Christoph Lameter wrote:
> > > > I think we need to reduce the general tick frequency to be as low as
> > > > possible. With high resolution timers the tick frequency is just the
> > > > frequency with which the timer interrupt disturbs a running application.
> > > > 
> > > > Are there any benefits remaining from frequent timer interrupts? I would
> > > > think that 60 HZ would be sufficient.
> > > > 
> > > > It would be good if the kernel would be truly tickless. Scheduler events
> > > > would be driven by the scheduling intervals and not the invokations of the
> > > > scheduler softirq.
> > > 
> > > The only thing that's driven by the softirq is load-balancing, there's
> > > way more to the scheduler-tick than kicking that thing awake every so
> > > often.
> > > 
> > > The problem is that running the scheduler of off hrtimers is too
> > > expensive. We have the code, we tried it, people complained.
> > 
> > Therefore, decreasing the HZ value to say 50, we'd get a minimum
> > involuntary preemption granularity of 20ms, something on the high end of
> > barely usable.
> 
> Another user is RCU, the grace period is tick driven, growing these
> ticks by a factor 50 or so might require some tinkering with forced
> grace periods when we notice our batch queues getting too long.

One approach would be to enter nohz mode when running a CPU-bound
application on a CPU that had nothing else (other than the idle task)
on its runqueue and for which rcu_needs_cpu() returns zero.  In this
mode, RCU would need to be informed on each system call, perhaps with an
rcu_kernel_enter() and rcu_kernel_exit() that work like rcu_irq_enter()
and rcu_irq_exit() -- and that perhaps replace rcu_irq_enter() and
rcu_irq_exit().

Then RCU would ignore any CPU that was executing a CPU-bound application,
allowing the HZ to be dialed down as low as you like, or perhaps really
entering something like nohz mode.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:36             ` Paul E. McKenney
@ 2009-05-07 17:38               ` Peter Zijlstra
  2009-05-07 18:01                 ` Paul E. McKenney
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2009-05-07 17:38 UTC (permalink / raw)
  To: paulmck
  Cc: Christoph Lameter, Alok Kataria, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML,
	alan@lxorguk.ukuu.org.uk

On Thu, 2009-05-07 at 10:36 -0700, Paul E. McKenney wrote:
> On Thu, May 07, 2009 at 07:18:38PM +0200, Peter Zijlstra wrote:
> > On Thu, 2009-05-07 at 19:13 +0200, Peter Zijlstra wrote:
> > > On Thu, 2009-05-07 at 19:09 +0200, Peter Zijlstra wrote:
> > > > On Thu, 2009-05-07 at 10:13 -0400, Christoph Lameter wrote:
> > > > > I think we need to reduce the general tick frequency to be as low as
> > > > > possible. With high resolution timers the tick frequency is just the
> > > > > frequency with which the timer interrupt disturbs a running application.
> > > > > 
> > > > > Are there any benefits remaining from frequent timer interrupts? I would
> > > > > think that 60 HZ would be sufficient.
> > > > > 
> > > > > It would be good if the kernel would be truly tickless. Scheduler events
> > > > > would be driven by the scheduling intervals and not the invokations of the
> > > > > scheduler softirq.
> > > > 
> > > > The only thing that's driven by the softirq is load-balancing, there's
> > > > way more to the scheduler-tick than kicking that thing awake every so
> > > > often.
> > > > 
> > > > The problem is that running the scheduler of off hrtimers is too
> > > > expensive. We have the code, we tried it, people complained.
> > > 
> > > Therefore, decreasing the HZ value to say 50, we'd get a minimum
> > > involuntary preemption granularity of 20ms, something on the high end of
> > > barely usable.
> > 
> > Another user is RCU, the grace period is tick driven, growing these
> > ticks by a factor 50 or so might require some tinkering with forced
> > grace periods when we notice our batch queues getting too long.
> 
> One approach would be to enter nohz mode when running a CPU-bound
> application on a CPU that had nothing else (other than the idle task)
> on its runqueue and for which rcu_needs_cpu() returns zero.  In this
> mode, RCU would need to be informed on each system call, perhaps with an
> rcu_kernel_enter() and rcu_kernel_exit() that work like rcu_irq_enter()
> and rcu_irq_exit() -- and that perhaps replace rcu_irq_enter() and
> rcu_irq_exit().
> 
> Then RCU would ignore any CPU that was executing a CPU-bound application,
> allowing the HZ to be dialed down as low as you like, or perhaps really
> entering something like nohz mode.

Which would make syscall more expensive, not something you'd want to
do :-)


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:38               ` Peter Zijlstra
@ 2009-05-07 18:01                 ` Paul E. McKenney
  2009-05-07 18:12                   ` Christoph Lameter
  2009-05-08 10:32                   ` Peter Zijlstra
  0 siblings, 2 replies; 65+ messages in thread
From: Paul E. McKenney @ 2009-05-07 18:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Alok Kataria, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML,
	alan@lxorguk.ukuu.org.uk

On Thu, May 07, 2009 at 07:38:24PM +0200, Peter Zijlstra wrote:
> On Thu, 2009-05-07 at 10:36 -0700, Paul E. McKenney wrote:
> > On Thu, May 07, 2009 at 07:18:38PM +0200, Peter Zijlstra wrote:
> > > On Thu, 2009-05-07 at 19:13 +0200, Peter Zijlstra wrote:
> > > > On Thu, 2009-05-07 at 19:09 +0200, Peter Zijlstra wrote:
> > > > > On Thu, 2009-05-07 at 10:13 -0400, Christoph Lameter wrote:
> > > > > > I think we need to reduce the general tick frequency to be as low as
> > > > > > possible. With high resolution timers the tick frequency is just the
> > > > > > frequency with which the timer interrupt disturbs a running application.
> > > > > > 
> > > > > > Are there any benefits remaining from frequent timer interrupts? I would
> > > > > > think that 60 HZ would be sufficient.
> > > > > > 
> > > > > > It would be good if the kernel would be truly tickless. Scheduler events
> > > > > > would be driven by the scheduling intervals and not the invokations of the
> > > > > > scheduler softirq.
> > > > > 
> > > > > The only thing that's driven by the softirq is load-balancing, there's
> > > > > way more to the scheduler-tick than kicking that thing awake every so
> > > > > often.
> > > > > 
> > > > > The problem is that running the scheduler of off hrtimers is too
> > > > > expensive. We have the code, we tried it, people complained.
> > > > 
> > > > Therefore, decreasing the HZ value to say 50, we'd get a minimum
> > > > involuntary preemption granularity of 20ms, something on the high end of
> > > > barely usable.
> > > 
> > > Another user is RCU, the grace period is tick driven, growing these
> > > ticks by a factor 50 or so might require some tinkering with forced
> > > grace periods when we notice our batch queues getting too long.
> > 
> > One approach would be to enter nohz mode when running a CPU-bound
> > application on a CPU that had nothing else (other than the idle task)
> > on its runqueue and for which rcu_needs_cpu() returns zero.  In this
> > mode, RCU would need to be informed on each system call, perhaps with an
> > rcu_kernel_enter() and rcu_kernel_exit() that work like rcu_irq_enter()
> > and rcu_irq_exit() -- and that perhaps replace rcu_irq_enter() and
> > rcu_irq_exit().
> > 
> > Then RCU would ignore any CPU that was executing a CPU-bound application,
> > allowing the HZ to be dialed down as low as you like, or perhaps really
> > entering something like nohz mode.
> 
> Which would make syscall more expensive, not something you'd want to
> do :-)

In general, I agree.  However, in the case where you have a single
CPU-bound task running in user mode, you don't care that much about
syscall performance.  So, yes, this would mean having yet another config
variable that users running big CPU-bound scientific applications would
need to worry about, which is not perfect either.

For whatever it is worth, the added overhead on entry would be something
like the following:

void rcu_irq_enter(void)
{
	struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);

	if (rdtp->dynticks_nesting++)
		return;
	rdtp->dynticks++;
	WARN_ON_RATELIMIT(!(rdtp->dynticks & 0x1), &rcu_rs);
	smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
}

On exit, a bit more:

void rcu_irq_exit(void)
{
	struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);

	if (--rdtp->dynticks_nesting)
		return;
	smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
	rdtp->dynticks++;
	WARN_ON_RATELIMIT(rdtp->dynticks & 0x1, &rcu_rs);

	/* If the interrupt queued a callback, get out of dyntick mode. */
	if (__get_cpu_var(rcu_data).nxtlist ||
	    __get_cpu_var(rcu_bh_data).nxtlist)
		set_need_resched();
}

But I could move the callback check into call_rcu(), which would get the
overhead of rcu_irq_exit() down to about that of rcu_irq_enter().

							Thanx, Paul

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 18:01                 ` Paul E. McKenney
@ 2009-05-07 18:12                   ` Christoph Lameter
  2009-05-07 19:06                     ` Paul E. McKenney
  2009-05-07 19:53                     ` Alan Cox
  2009-05-08 10:32                   ` Peter Zijlstra
  1 sibling, 2 replies; 65+ messages in thread
From: Christoph Lameter @ 2009-05-07 18:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Alok Kataria, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML,
	alan@lxorguk.ukuu.org.uk

To get back to the main point: I agree with Alok that the default HZ needs
to be as low as possible. The remaining justification is the load
balancing and possible context switching between multiple tasks contenting
for a processor.

Is it enough if this occurs 100 times per second?

If so then we should change the default in kernel/Kconfig.hz.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 18:12                   ` Christoph Lameter
@ 2009-05-07 19:06                     ` Paul E. McKenney
  2009-05-07 19:53                     ` Alan Cox
  1 sibling, 0 replies; 65+ messages in thread
From: Paul E. McKenney @ 2009-05-07 19:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Alok Kataria, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML,
	alan@lxorguk.ukuu.org.uk

On Thu, May 07, 2009 at 02:12:15PM -0400, Christoph Lameter wrote:
> To get back to the main point: I agree with Alok that the default HZ needs
> to be as low as possible. The remaining justification is the load
> balancing and possible context switching between multiple tasks contenting
> for a processor.
> 
> Is it enough if this occurs 100 times per second?

I am sorry, I got into solution mode too quickly in my earlier posts.

As long as you have enough memory for the callbacks, and as long
as you don't mind things like netfilter changes taking longer, RCU
doesn't much care what value HZ has.

> If so then we should change the default in kernel/Kconfig.hz.

But some workloads and users do care deeply about memory size and
netfilter-change speed, of course.  So perhaps HPC workloads need
a different default HZ value.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 18:12                   ` Christoph Lameter
  2009-05-07 19:06                     ` Paul E. McKenney
@ 2009-05-07 19:53                     ` Alan Cox
  2009-05-07 19:56                       ` Christoph Lameter
  1 sibling, 1 reply; 65+ messages in thread
From: Alan Cox @ 2009-05-07 19:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Peter Zijlstra, Alok Kataria, H. Peter Anvin,
	Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers, LKML

On Thu, 7 May 2009 14:12:15 -0400 (EDT)
Christoph Lameter <cl@linux.com> wrote:

> To get back to the main point: I agree with Alok that the default HZ needs
> to be as low as possible. The remaining justification is the load
> balancing and possible context switching between multiple tasks contenting
> for a processor.
> 
> Is it enough if this occurs 100 times per second?

I would like to see evidence that it is, and by evidence I don't mean
"our virtual machine manager runs better" but good evidence that it isn't
affecting performance of any games, applications, video suites etc -
because we know it did in the past.

Even then its pointlessly jumping up and down and changing stuff rather
than fixing the real problem - which is jiffies.

Alan

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 19:53                     ` Alan Cox
@ 2009-05-07 19:56                       ` Christoph Lameter
  2009-05-07 20:24                         ` Alan Cox
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-05-07 19:56 UTC (permalink / raw)
  To: Alan Cox
  Cc: Paul E. McKenney, Peter Zijlstra, Alok Kataria, H. Peter Anvin,
	Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers, LKML

On Thu, 7 May 2009, Alan Cox wrote:

> Even then its pointlessly jumping up and down and changing stuff rather
> than fixing the real problem - which is jiffies.

Reducing the number of application interruptions by a factor of 10 is not
pointless. The lower HZ becomes the more useless it will become for the
users of jiffies because they will have to create timers and use time
intervals if the HZ intervals become too large.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 19:56                       ` Christoph Lameter
@ 2009-05-07 20:24                         ` Alan Cox
  2009-05-07 20:21                           ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: Alan Cox @ 2009-05-07 20:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Peter Zijlstra, Alok Kataria, H. Peter Anvin,
	Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers, LKML

> Reducing the number of application interruptions by a factor of 10 is not
> pointless. The lower HZ becomes the more useless it will become for the
> users of jiffies because they will have to create timers and use time
> intervals if the HZ intervals become too large.

More like "the more your performance goes down the toilet" - as
everything is rounded up.

If you want to follow that argument set the default HZ to 1. Then see if
it has any side effects.

Given HZ=1 doesn't work, its fairly clear that all you are doing is
frobbing pointlessly with defaults that have been evolved over about 15
years because they work.

Whereas if you make jiffies jiffies() and it reads a timer it puts the
pain in the right place and you can begin to make actual progress to
making things work properly.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 20:24                         ` Alan Cox
@ 2009-05-07 20:21                           ` Christoph Lameter
  0 siblings, 0 replies; 65+ messages in thread
From: Christoph Lameter @ 2009-05-07 20:21 UTC (permalink / raw)
  To: Alan Cox
  Cc: Paul E. McKenney, Peter Zijlstra, Alok Kataria, H. Peter Anvin,
	Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers, LKML

On Thu, 7 May 2009, Alan Cox wrote:

> Given HZ=1 doesn't work, its fairly clear that all you are doing is
> frobbing pointlessly with defaults that have been evolved over about 15
> years because they work.

Are you sure that 1000 HZ default was not developed due to the need to
have sufficient time granulaity for poll and select? In most of the 15
years we did not have high resolution timers.

> Whereas if you make jiffies jiffies() and it reads a timer it puts the
> pain in the right place and you can begin to make actual progress to
> making things work properly.

Good idea. A patch would be appreciated.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 18:01                 ` Paul E. McKenney
  2009-05-07 18:12                   ` Christoph Lameter
@ 2009-05-08 10:32                   ` Peter Zijlstra
  2009-05-08 12:50                     ` Paul E. McKenney
  1 sibling, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2009-05-08 10:32 UTC (permalink / raw)
  To: paulmck
  Cc: Christoph Lameter, Alok Kataria, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML,
	alan@lxorguk.ukuu.org.uk

On Thu, 2009-05-07 at 11:01 -0700, Paul E. McKenney wrote:

> In general, I agree.  However, in the case where you have a single
> CPU-bound task running in user mode, you don't care that much about
> syscall performance.  So, yes, this would mean having yet another config
> variable that users running big CPU-bound scientific applications would
> need to worry about, which is not perfect either.
> 
> For whatever it is worth, the added overhead on entry would be something
> like the following:
> 
> void rcu_irq_enter(void)
> {
> 	struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
> 
> 	if (rdtp->dynticks_nesting++)
> 		return;
> 	rdtp->dynticks++;
> 	WARN_ON_RATELIMIT(!(rdtp->dynticks & 0x1), &rcu_rs);
> 	smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
> }
> 
> On exit, a bit more:
> 
> void rcu_irq_exit(void)
> {
> 	struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
> 
> 	if (--rdtp->dynticks_nesting)
> 		return;
> 	smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
> 	rdtp->dynticks++;
> 	WARN_ON_RATELIMIT(rdtp->dynticks & 0x1, &rcu_rs);
> 
> 	/* If the interrupt queued a callback, get out of dyntick mode. */
> 	if (__get_cpu_var(rcu_data).nxtlist ||
> 	    __get_cpu_var(rcu_bh_data).nxtlist)
> 		set_need_resched();
> }
> 
> But I could move the callback check into call_rcu(), which would get the
> overhead of rcu_irq_exit() down to about that of rcu_irq_enter().

Can't you simply enter idle state after a grace period completes and
finds no pending callbacks for the next period. And leave idle state at
the next call_rcu()?




^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-08 10:32                   ` Peter Zijlstra
@ 2009-05-08 12:50                     ` Paul E. McKenney
  2009-05-08 14:16                       ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: Paul E. McKenney @ 2009-05-08 12:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Alok Kataria, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML,
	alan@lxorguk.ukuu.org.uk

On Fri, May 08, 2009 at 12:32:56PM +0200, Peter Zijlstra wrote:
> On Thu, 2009-05-07 at 11:01 -0700, Paul E. McKenney wrote:
> 
> > In general, I agree.  However, in the case where you have a single
> > CPU-bound task running in user mode, you don't care that much about
> > syscall performance.  So, yes, this would mean having yet another config
> > variable that users running big CPU-bound scientific applications would
> > need to worry about, which is not perfect either.
> > 
> > For whatever it is worth, the added overhead on entry would be something
> > like the following:
> > 
> > void rcu_irq_enter(void)
> > {
> > 	struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
> > 
> > 	if (rdtp->dynticks_nesting++)
> > 		return;
> > 	rdtp->dynticks++;
> > 	WARN_ON_RATELIMIT(!(rdtp->dynticks & 0x1), &rcu_rs);
> > 	smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
> > }
> > 
> > On exit, a bit more:
> > 
> > void rcu_irq_exit(void)
> > {
> > 	struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
> > 
> > 	if (--rdtp->dynticks_nesting)
> > 		return;
> > 	smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
> > 	rdtp->dynticks++;
> > 	WARN_ON_RATELIMIT(rdtp->dynticks & 0x1, &rcu_rs);
> > 
> > 	/* If the interrupt queued a callback, get out of dyntick mode. */
> > 	if (__get_cpu_var(rcu_data).nxtlist ||
> > 	    __get_cpu_var(rcu_bh_data).nxtlist)
> > 		set_need_resched();
> > }
> > 
> > But I could move the callback check into call_rcu(), which would get the
> > overhead of rcu_irq_exit() down to about that of rcu_irq_enter().
> 
> Can't you simply enter idle state after a grace period completes and
> finds no pending callbacks for the next period. And leave idle state at
> the next call_rcu()?

If there were no RCU callbacks -globally- across all CPUs, yes.  But
the check at the end of rcu_irq_exit() is testing only on the current
CPU.  Checking across all CPUs is expensive and racy.

So what happens instead is that there is rcu_needs_cpu(), which gates
entry into dynticks-idle mode.  This function returns 1 if there are
callbacks on the current CPU.  So, if no CPU has an RCU callback, then
all CPUs can enter dynticks-idle mode so that the entire system is
quiescent from an RCU viewpoint -- no RCU processing at all.

Or am I missing what you are getting at with your question?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-08 12:50                     ` Paul E. McKenney
@ 2009-05-08 14:16                       ` Christoph Lameter
  2009-05-08 15:06                         ` Paul E. McKenney
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-05-08 14:16 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Alok Kataria, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML,
	alan@lxorguk.ukuu.org.uk

On Fri, 8 May 2009, Paul E. McKenney wrote:

> > Can't you simply enter idle state after a grace period completes and
> > finds no pending callbacks for the next period. And leave idle state at
> > the next call_rcu()?
>
> If there were no RCU callbacks -globally- across all CPUs, yes.  But
> the check at the end of rcu_irq_exit() is testing only on the current
> CPU.  Checking across all CPUs is expensive and racy.
>
> So what happens instead is that there is rcu_needs_cpu(), which gates
> entry into dynticks-idle mode.  This function returns 1 if there are
> callbacks on the current CPU.  So, if no CPU has an RCU callback, then
> all CPUs can enter dynticks-idle mode so that the entire system is
> quiescent from an RCU viewpoint -- no RCU processing at all.

Did not follow RCU developments. But wasnt there a time when RCU periods
were processor specific and a global RCU period was done when all the
processors went through their rcu periods?

Cpu cache hotness may not be relevant to RCU since rcu involves long time
periods in which cachelines cool down. Can the RCU callbacks all be done
on processor 0 (or a so designated processor)? That would avoiding
disturbances of the other processors.




^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-08 14:16                       ` Christoph Lameter
@ 2009-05-08 15:06                         ` Paul E. McKenney
  0 siblings, 0 replies; 65+ messages in thread
From: Paul E. McKenney @ 2009-05-08 15:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Alok Kataria, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML,
	alan@lxorguk.ukuu.org.uk

On Fri, May 08, 2009 at 10:16:10AM -0400, Christoph Lameter wrote:
> On Fri, 8 May 2009, Paul E. McKenney wrote:
> 
> > > Can't you simply enter idle state after a grace period completes and
> > > finds no pending callbacks for the next period. And leave idle state at
> > > the next call_rcu()?
> >
> > If there were no RCU callbacks -globally- across all CPUs, yes.  But
> > the check at the end of rcu_irq_exit() is testing only on the current
> > CPU.  Checking across all CPUs is expensive and racy.
> >
> > So what happens instead is that there is rcu_needs_cpu(), which gates
> > entry into dynticks-idle mode.  This function returns 1 if there are
> > callbacks on the current CPU.  So, if no CPU has an RCU callback, then
> > all CPUs can enter dynticks-idle mode so that the entire system is
> > quiescent from an RCU viewpoint -- no RCU processing at all.
> 
> Did not follow RCU developments. But wasnt there a time when RCU periods
> were processor specific and a global RCU period was done when all the
> processors went through their rcu periods?

For non-realtime RCU implementations, after a given grace period starts,
once each CPU goes through a "quiescent state", then that grace period
can end.  For realtime (AKA "preemptable") RCU, the focus is on tasks
rather than CPUs, but the same general principle applies, give or take
some implementation details: after a given grace period starts, once
each task goes through a quiescent state, then that grace period can end.

> Cpu cache hotness may not be relevant to RCU since rcu involves long time
> periods in which cachelines cool down. Can the RCU callbacks all be done
> on processor 0 (or a so designated processor)? That would avoiding
> disturbances of the other processors.

This approach -might- be OK for a specially configured and protected HPC
machine.  But it is a non-starter for general-purpose machines.  For an
example of why, consider a denial-of-service attack that continually
change routing tables could saturate CPU 0 and start falling behind,
eventually OOMing the machine.

But if you would like to experiment with this, make call_rcu() be a
wrapper that causes an underlying call_rcu_cpu_0() to be executed on
CPU 0.  That won't get exactly the cache-warmth effects that you are
after, but it will let you see whether the machine would gracefully
handle various events that might dump large numbers of callbacks.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:13         ` Peter Zijlstra
  2009-05-07 17:18           ` Peter Zijlstra
@ 2009-05-07 17:18           ` Christoph Lameter
  2009-05-07 17:37             ` Peter Zijlstra
  1 sibling, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-05-07 17:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alok Kataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk

On Thu, 7 May 2009, Peter Zijlstra wrote:

> Therefore, decreasing the HZ value to say 50, we'd get a minimum
> involuntary preemption granularity of 20ms, something on the high end of
> barely usable.

But the schedule is activated when a process is being woken up. The main
use is for load balancing.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:18           ` Christoph Lameter
@ 2009-05-07 17:37             ` Peter Zijlstra
  2009-05-07 17:34               ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2009-05-07 17:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Alok Kataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk

On Thu, 2009-05-07 at 13:18 -0400, Christoph Lameter wrote:
> On Thu, 7 May 2009, Peter Zijlstra wrote:
> 
> > Therefore, decreasing the HZ value to say 50, we'd get a minimum
> > involuntary preemption granularity of 20ms, something on the high end of
> > barely usable.
> 
> But the schedule is activated when a process is being woken up. The main
> use is for load balancing.

Somehow people like involuntary preemption to happen at slightly more
than glacial pace as well.




^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:37             ` Peter Zijlstra
@ 2009-05-07 17:34               ` Christoph Lameter
  2009-05-07 17:55                 ` Peter Zijlstra
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-05-07 17:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alok Kataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk

On Thu, 7 May 2009, Peter Zijlstra wrote:

> Somehow people like involuntary preemption to happen at slightly more
> than glacial pace as well.

These intervals could be dynamically established by the scheduler.

If there is a single process running on a processor and no other process
is contending then there is no need to do involuntary preemption.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:34               ` Christoph Lameter
@ 2009-05-07 17:55                 ` Peter Zijlstra
  2009-05-07 17:55                   ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2009-05-07 17:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Alok Kataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk

On Thu, 2009-05-07 at 13:34 -0400, Christoph Lameter wrote:
> On Thu, 7 May 2009, Peter Zijlstra wrote:
> 
> > Somehow people like involuntary preemption to happen at slightly more
> > than glacial pace as well.
> 
> These intervals could be dynamically established by the scheduler.
> 
> If there is a single process running on a processor and no other process
> is contending then there is no need to do involuntary preemption.

Sure, and CONFIG_SCHED_HRTICK does that, the trick is getting it to not
hurt others.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:55                 ` Peter Zijlstra
@ 2009-05-07 17:55                   ` Christoph Lameter
  0 siblings, 0 replies; 65+ messages in thread
From: Christoph Lameter @ 2009-05-07 17:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alok Kataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk

On Thu, 7 May 2009, Peter Zijlstra wrote:

> > If there is a single process running on a processor and no other process
> > is contending then there is no need to do involuntary preemption.
>
> Sure, and CONFIG_SCHED_HRTICK does that, the trick is getting it to not
> hurt others.

If other processors are scheduled on the processor then an IPI could
notify the scheduler over there to start timer interrupts for preemption
again.

I still see timer interrupts ongoing even with CONFIG_SCHED_HRTICK.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:07       ` Peter Zijlstra
  2009-05-07 17:13         ` Peter Zijlstra
@ 2009-05-07 17:19         ` Christoph Lameter
  2009-05-07 17:45           ` Peter Zijlstra
  1 sibling, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-05-07 17:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alok Kataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk

On Thu, 7 May 2009, Peter Zijlstra wrote:

> The problem is that running the scheduler of off hrtimers is too
> expensive. We have the code, we tried it, people complained.

If you aggregate events into a single hrtimer event then it may work.
Wasnt that recently added?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:19         ` Christoph Lameter
@ 2009-05-07 17:45           ` Peter Zijlstra
  2009-05-07 17:50             ` Christoph Lameter
  2009-05-07 21:01             ` H. Peter Anvin
  0 siblings, 2 replies; 65+ messages in thread
From: Peter Zijlstra @ 2009-05-07 17:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Alok Kataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk

On Thu, 2009-05-07 at 13:19 -0400, Christoph Lameter wrote:
> On Thu, 7 May 2009, Peter Zijlstra wrote:
> 
> > The problem is that running the scheduler of off hrtimers is too
> > expensive. We have the code, we tried it, people complained.
> 
> If you aggregate events into a single hrtimer event then it may work.
> Wasnt that recently added?

No it won't, you want fairly decent involuntary preemption rate to keep
the full service latency at a usable figure.

The problem with scheduling a hrtimer along with tasks is that at high
context switch rates the timer will never fire but you do pay the
overhead of programming the hardware each time, something that can be
about as expensive as the whole context switch itself.

Although, I guess we could amortize that by not re-programming the timer
when the existing timer is within a reasonable period (say 1ms) of the
requested on.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:45           ` Peter Zijlstra
@ 2009-05-07 17:50             ` Christoph Lameter
  2009-05-07 19:17               ` Peter Zijlstra
  2009-05-07 21:01             ` H. Peter Anvin
  1 sibling, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-05-07 17:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alok Kataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk

On Thu, 7 May 2009, Peter Zijlstra wrote:

> No it won't, you want fairly decent involuntary preemption rate to keep
> the full service latency at a usable figure.
>
> The problem with scheduling a hrtimer along with tasks is that at high
> context switch rates the timer will never fire but you do pay the
> overhead of programming the hardware each time, something that can be
> about as expensive as the whole context switch itself.

What are high context switch rates? 1000 HZ? Generally it seems that
context switches are bad for cpu caches and thus to be avoided.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:50             ` Christoph Lameter
@ 2009-05-07 19:17               ` Peter Zijlstra
  2009-05-07 19:38                 ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2009-05-07 19:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Alok Kataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk

On Thu, 2009-05-07 at 13:50 -0400, Christoph Lameter wrote:
> On Thu, 7 May 2009, Peter Zijlstra wrote:
> 
> > No it won't, you want fairly decent involuntary preemption rate to keep
> > the full service latency at a usable figure.
> >
> > The problem with scheduling a hrtimer along with tasks is that at high
> > context switch rates the timer will never fire but you do pay the
> > overhead of programming the hardware each time, something that can be
> > about as expensive as the whole context switch itself.
> 
> What are high context switch rates? 1000 HZ? Generally it seems that
> context switches are bad for cpu caches and thus to be avoided.


We're talking about in excess of 250k switches a second.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 19:17               ` Peter Zijlstra
@ 2009-05-07 19:38                 ` Christoph Lameter
  0 siblings, 0 replies; 65+ messages in thread
From: Christoph Lameter @ 2009-05-07 19:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alok Kataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk

On Thu, 7 May 2009, Peter Zijlstra wrote:

> > What are high context switch rates? 1000 HZ? Generally it seems that
> > context switches are bad for cpu caches and thus to be avoided.
>
>
> We're talking about in excess of 250k switches a second.

Timer resources are not processor specific? Why such a high rate?


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 17:45           ` Peter Zijlstra
  2009-05-07 17:50             ` Christoph Lameter
@ 2009-05-07 21:01             ` H. Peter Anvin
  1 sibling, 0 replies; 65+ messages in thread
From: H. Peter Anvin @ 2009-05-07 21:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Alok Kataria, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk

Peter Zijlstra wrote:
> 
> The problem with scheduling a hrtimer along with tasks is that at high
> context switch rates the timer will never fire but you do pay the
> overhead of programming the hardware each time, something that can be
> about as expensive as the whole context switch itself.
> 
> Although, I guess we could amortize that by not re-programming the timer
> when the existing timer is within a reasonable period (say 1ms) of the
> requested on.
> 

That seems like a reasonable optimization since we're talking about 
multi-kHz context switch rates, here.

	-hpa


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-05 21:57   ` Alok Kataria
  2009-05-07 14:13     ` Christoph Lameter
@ 2009-05-07 16:35     ` Chris Snook
  2009-05-07 16:56       ` Alok Kataria
  1 sibling, 1 reply; 65+ messages in thread
From: Chris Snook @ 2009-05-07 16:35 UTC (permalink / raw)
  To: akataria
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk

On Tue, May 5, 2009 at 5:57 PM, Alok Kataria <akataria@vmware.com> wrote:
>
> On Tue, 2009-05-05 at 14:21 -0700, H. Peter Anvin wrote:
>> Alok Kataria wrote:
>> > Hi,
>> >
>> > Given that there were no major objections that came up regarding
>> > reducing the HZ value in http://lkml.org/lkml/2009/4/27/499.
>> >
>> > Below is the patch which actually reduces it, please consider for tip.
>> >
>>
>> What is the benefit of this?
>
> I did some experiments on linux 2.6.29 guests running on VMware and
> noticed that the number of timer interrupts could have some slowdown on
> the total throughput on the system.
> A simple tight loop experiment showed that with HZ=1000 we took about
> 264sec to complete the loop and that same loop took about 255sec with
> HZ=100.
> You can find more information here http://lkml.org/lkml/2009/4/28/401

This is why certain niches, such as HPC users, often prefer HZ=100
kernels.  For the rest of us, sacrificing a few percent CPU throughput
for significant latency gains is well worth it.

> And with HRT i don't see any downsides in terms of increased latencies
> for device timer's or anything of that sought.
>
>>
>> I can see at least one immediate downside: some timeout values in the
>> kernel are still maintained in units of HZ (like poll, I believe), and
>> so with a lower HZ value we'll have higher roundoff errors.
>
> If that at all is such a big problem shouldn't we think about moving to
> using schedule_hrtimeout for such cases rather than relying on jiffy
> based timeouts.
> The hrtimer explanation over here http://www.tglx.de/hrtimers.html
> also talks about where these HZ (timer wheel) based timeouts be used and
> shouldn't really be dependent on accurate timing.

But your patch doesn't do this.  If you want us to merge a patch that
makes VMware systems faster, we're a lot more likely to take it if it
make everyone else's systems faster, or at least not slower.

> Also the default HZ value was 250 before this commit
>
> commit 5cb04df8d3f03e37a19f2502591a84156be71772
>  x86: defconfig updates
>
> And it was 250 for a very long time before that too. The commit log
> doesn't explain why the value was bumped up either.

250 was considered a compromise between 100 and 1000, but almost
everyone who cared just ended up using one or the other, and most of
them preferred 1000.

Given your use case, what you really need to do is get Red Hat,
Novell, et al. on the phone and ask them to ship kernels with HZ=100,
because the distributions do their own thing anyway.  If you can
figure out a way to do that without harming latency, they'll be
thrilled.

-- Chris

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 16:35     ` Chris Snook
@ 2009-05-07 16:56       ` Alok Kataria
  2009-05-07 20:29         ` Chris Snook
  0 siblings, 1 reply; 65+ messages in thread
From: Alok Kataria @ 2009-05-07 16:56 UTC (permalink / raw)
  To: Chris Snook
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk


On Thu, 2009-05-07 at 09:35 -0700, Chris Snook wrote:
> On Tue, May 5, 2009 at 5:57 PM, Alok Kataria <akataria@vmware.com> wrote:
> >
> > On Tue, 2009-05-05 at 14:21 -0700, H. Peter Anvin wrote:
> >> Alok Kataria wrote:
> >> > Hi,
> >> >
> >> > Given that there were no major objections that came up regarding
> >> > reducing the HZ value in http://lkml.org/lkml/2009/4/27/499.
> >> >
> >> > Below is the patch which actually reduces it, please consider for tip.
> >> >
> >>
> >> What is the benefit of this?
> >
> > I did some experiments on linux 2.6.29 guests running on VMware and
> > noticed that the number of timer interrupts could have some slowdown on
> > the total throughput on the system.
> > A simple tight loop experiment showed that with HZ=1000 we took about
> > 264sec to complete the loop and that same loop took about 255sec with
> > HZ=100.
> > You can find more information here http://lkml.org/lkml/2009/4/28/401
> 
> This is why certain niches, such as HPC users, often prefer HZ=100
> kernels.  For the rest of us, sacrificing a few percent CPU throughput
> for significant latency gains is well worth it.
> 
> > And with HRT i don't see any downsides in terms of increased latencies
> > for device timer's or anything of that sought.
> >
> >>
> >> I can see at least one immediate downside: some timeout values in the
> >> kernel are still maintained in units of HZ (like poll, I believe), and
> >> so with a lower HZ value we'll have higher roundoff errors.
> >
> > If that at all is such a big problem shouldn't we think about moving to
> > using schedule_hrtimeout for such cases rather than relying on jiffy
> > based timeouts.
> > The hrtimer explanation over here http://www.tglx.de/hrtimers.html
> > also talks about where these HZ (timer wheel) based timeouts be used and
> > shouldn't really be dependent on accurate timing.
> 
> But your patch doesn't do this. 

The reason it doesn't do it is because poll and select already use
hrtimer. So IMO no important subsystem relies on jiffies for wakeups. 
Thus the latency problem is not actually present in the kernel.

>  If you want us to merge a patch that
> makes VMware systems faster, we're a lot more likely to take it if it
> make everyone else's systems faster, or at least not slower.

I doubt it would make any system slower, running these simple
experiments is not hard at all and one could run these on native system
too to check.

> 
> > Also the default HZ value was 250 before this commit
> >
> > commit 5cb04df8d3f03e37a19f2502591a84156be71772
> >  x86: defconfig updates
> >
> > And it was 250 for a very long time before that too. The commit log
> > doesn't explain why the value was bumped up either.
> 
> 250 was considered a compromise between 100 and 1000, but almost
> everyone who cared just ended up using one or the other, and most of
> them preferred 1000.
> 
> Given your use case, what you really need to do is get Red Hat,
> Novell, et al. on the phone and ask them to ship kernels with HZ=100,
> because the distributions do their own thing anyway.

Yeah but I don't think there is any better platform other than LKML to
figure out if at all this is a problem anymore. Once we are assured that
a low HZ is no more a problem I don't see why would the various distros
not consider reducing it.

>   If you can
> figure out a way to do that without harming latency, they'll be
> thrilled.

Why do you think it would harm latency ? 
The sched_tick too is driven by hrtimers, if there is any specific
subsystem which you think still relies on jiffy we could think about
using hrtimer's for them too, right ? 
I did a quick scan and the only things that rely on jiffy are the device
timeout's where latency is not a issue. 
So please let me know in what cases do you think it could affect system
latency.

Thanks,
Alok

> 
> -- Chris


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 16:56       ` Alok Kataria
@ 2009-05-07 20:29         ` Chris Snook
  2009-05-07 20:34           ` Alan Cox
  0 siblings, 1 reply; 65+ messages in thread
From: Chris Snook @ 2009-05-07 20:29 UTC (permalink / raw)
  To: akataria
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML, alan@lxorguk.ukuu.org.uk

On Thu, May 7, 2009 at 12:56 PM, Alok Kataria <akataria@vmware.com> wrote:
>
> On Thu, 2009-05-07 at 09:35 -0700, Chris Snook wrote:
>> On Tue, May 5, 2009 at 5:57 PM, Alok Kataria <akataria@vmware.com> wrote:
>> >
>> > On Tue, 2009-05-05 at 14:21 -0700, H. Peter Anvin wrote:
>> >> Alok Kataria wrote:
>> >> > Hi,
>> >> >
>> >> > Given that there were no major objections that came up regarding
>> >> > reducing the HZ value in http://lkml.org/lkml/2009/4/27/499.
>> >> >
>> >> > Below is the patch which actually reduces it, please consider for tip.
>> >> >
>> >>
>> >> What is the benefit of this?
>> >
>> > I did some experiments on linux 2.6.29 guests running on VMware and
>> > noticed that the number of timer interrupts could have some slowdown on
>> > the total throughput on the system.
>> > A simple tight loop experiment showed that with HZ=1000 we took about
>> > 264sec to complete the loop and that same loop took about 255sec with
>> > HZ=100.
>> > You can find more information here http://lkml.org/lkml/2009/4/28/401
>>
>> This is why certain niches, such as HPC users, often prefer HZ=100
>> kernels.  For the rest of us, sacrificing a few percent CPU throughput
>> for significant latency gains is well worth it.
>>
>> > And with HRT i don't see any downsides in terms of increased latencies
>> > for device timer's or anything of that sought.
>> >
>> >>
>> >> I can see at least one immediate downside: some timeout values in the
>> >> kernel are still maintained in units of HZ (like poll, I believe), and
>> >> so with a lower HZ value we'll have higher roundoff errors.
>> >
>> > If that at all is such a big problem shouldn't we think about moving to
>> > using schedule_hrtimeout for such cases rather than relying on jiffy
>> > based timeouts.
>> > The hrtimer explanation over here http://www.tglx.de/hrtimers.html
>> > also talks about where these HZ (timer wheel) based timeouts be used and
>> > shouldn't really be dependent on accurate timing.
>>
>> But your patch doesn't do this.
>
> The reason it doesn't do it is because poll and select already use
> hrtimer. So IMO no important subsystem relies on jiffies for wakeups.
> Thus the latency problem is not actually present in the kernel.

TCP/IP still uses jiffies.  There's been talk of changing that, but it
hasn't been done yet, and it's definitely a latency-critical
subsystem.

>>  If you want us to merge a patch that
>> makes VMware systems faster, we're a lot more likely to take it if it
>> make everyone else's systems faster, or at least not slower.
>
> I doubt it would make any system slower, running these simple
> experiments is not hard at all and one could run these on native system
> too to check.

If this patch improves performance for both simple loops and
transaction processing by changing a non-idiotic tuning parameter, it
would be a first.  Can you at least run some sort of database
benchmark to back this up?

>>
>> > Also the default HZ value was 250 before this commit
>> >
>> > commit 5cb04df8d3f03e37a19f2502591a84156be71772
>> >  x86: defconfig updates
>> >
>> > And it was 250 for a very long time before that too. The commit log
>> > doesn't explain why the value was bumped up either.
>>
>> 250 was considered a compromise between 100 and 1000, but almost
>> everyone who cared just ended up using one or the other, and most of
>> them preferred 1000.
>>
>> Given your use case, what you really need to do is get Red Hat,
>> Novell, et al. on the phone and ask them to ship kernels with HZ=100,
>> because the distributions do their own thing anyway.
>
> Yeah but I don't think there is any better platform other than LKML to
> figure out if at all this is a problem anymore. Once we are assured that
> a low HZ is no more a problem I don't see why would the various distros
> not consider reducing it.
>
>>   If you can
>> figure out a way to do that without harming latency, they'll be
>> thrilled.
>
> Why do you think it would harm latency ?
> The sched_tick too is driven by hrtimers, if there is any specific
> subsystem which you think still relies on jiffy we could think about
> using hrtimer's for them too, right ?
> I did a quick scan and the only things that rely on jiffy are the device
> timeout's where latency is not a issue.
> So please let me know in what cases do you think it could affect system
> latency.

If you can get TCP/IP converted, or convince me that this won't hurt
transaction processing, I'm sold.

-- Chris

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 20:29         ` Chris Snook
@ 2009-05-07 20:34           ` Alan Cox
  2009-05-07 22:16             ` Ravikiran G Thirumalai
  2009-05-07 22:19             ` Alok Kataria
  0 siblings, 2 replies; 65+ messages in thread
From: Alan Cox @ 2009-05-07 20:34 UTC (permalink / raw)
  To: Chris Snook
  Cc: akataria, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML

> >> Given your use case, what you really need to do is get Red Hat,
> >> Novell, et al. on the phone and ask them to ship kernels with HZ=100,
> >> because the distributions do their own thing anyway.

As a side note Red Hat ships runtime configurable tick behaviour in RHEL
these days. HZ is fixed but the ticks can be bunched up. That was done as
a quick fix to keep stuff portable but its a lot more sensible than
randomly messing with the HZ value and its not much code either.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 20:34           ` Alan Cox
@ 2009-05-07 22:16             ` Ravikiran G Thirumalai
  2009-05-07 22:19             ` Alok Kataria
  1 sibling, 0 replies; 65+ messages in thread
From: Ravikiran G Thirumalai @ 2009-05-07 22:16 UTC (permalink / raw)
  To: Alan Cox
  Cc: Chris Snook, akataria, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML

On Thu, May 07, 2009 at 09:34:03PM +0100, Alan Cox wrote:
>> >> Given your use case, what you really need to do is get Red Hat,
>> >> Novell, et al. on the phone and ask them to ship kernels with HZ=100,
>> >> because the distributions do their own thing anyway.
>
>As a side note Red Hat ships runtime configurable tick behavior in RHEL
>these days. HZ is fixed but the ticks can be bunched up. That was done as
>a quick fix to keep stuff portable but its a lot more sensible than
>randomly messing with the HZ value and its not much code either.
>

That's interesting!
Could you please point us to the patch if you can?

(As a HPC + virtualization shop, we set HZ to 100 all the time,
and a patch like the one you mention above sounds great)

Thanks,
Kiran


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 20:34           ` Alan Cox
  2009-05-07 22:16             ` Ravikiran G Thirumalai
@ 2009-05-07 22:19             ` Alok Kataria
  2009-05-08  9:31               ` Alan Cox
  1 sibling, 1 reply; 65+ messages in thread
From: Alok Kataria @ 2009-05-07 22:19 UTC (permalink / raw)
  To: Alan Cox
  Cc: Chris Snook, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML


On Thu, 2009-05-07 at 13:34 -0700, Alan Cox wrote:
> > >> Given your use case, what you really need to do is get Red Hat,
> > >> Novell, et al. on the phone and ask them to ship kernels with HZ=100,
> > >> because the distributions do their own thing anyway.
> 
> As a side note Red Hat ships runtime configurable tick behaviour in RHEL
> these days. HZ is fixed but the ticks can be bunched up. That was done as
> a quick fix to keep stuff portable but its a lot more sensible than
> randomly messing with the HZ value and its not much code either.
> 
Hi Alan, 

I guess you are talking about the tick_divider patch ? 
And that's still same as reducing the HZ value only that it can be done
dynamically (boot time), right ? 

Thanks,
Alok


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Reduce the default HZ value
  2009-05-07 22:19             ` Alok Kataria
@ 2009-05-08  9:31               ` Alan Cox
  0 siblings, 0 replies; 65+ messages in thread
From: Alan Cox @ 2009-05-08  9:31 UTC (permalink / raw)
  To: akataria
  Cc: Chris Snook, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, LKML

> > As a side note Red Hat ships runtime configurable tick behaviour in RHEL
> > these days. HZ is fixed but the ticks can be bunched up. That was done as
> > a quick fix to keep stuff portable but its a lot more sensible than
> > randomly messing with the HZ value and its not much code either.
> > 
> Hi Alan, 
> 
> I guess you are talking about the tick_divider patch ? 
> And that's still same as reducing the HZ value only that it can be done
> dynamically (boot time), right ? 

Yes - which has the advantage that you can select different behaviours
rather than distributions having to build with HZ=1000 either for
compatibility or responsiveness can still allow users to drop to a lower
HZ value if doing stuff like HPC.

Basically it removes the need to argue about it at build time and lets
the user decide.

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2009-05-15 20:47 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-14 20:25 [PATCH] x86: Reduce the default HZ value devzero
2009-05-14 20:29 ` Alan Cox
  -- strict thread matches above, loose matches on Subject: below --
2009-05-12 19:45 devzero
2009-05-13 23:30 ` Alok Kataria
2009-05-04 18:44 Alok Kataria
2009-05-05 21:21 ` H. Peter Anvin
2009-05-05 21:44   ` Alan Cox
2009-05-05 22:09     ` Alok Kataria
2009-05-05 22:33       ` Alan Cox
2009-05-05 23:37         ` Alok Kataria
2009-05-07 14:09         ` Christoph Lameter
2009-05-07 15:12           ` Alan Cox
2009-05-05 21:57   ` Alok Kataria
2009-05-07 14:13     ` Christoph Lameter
2009-05-07 15:14       ` Alan Cox
2009-05-07 15:20         ` Christoph Lameter
2009-05-07 15:30         ` H. Peter Anvin
2009-05-07 15:40           ` Christoph Lameter
2009-05-07 16:55           ` Jeff Garzik
2009-05-07 17:09             ` Alan Cox
2009-05-07 17:55               ` Jeff Garzik
2009-05-07 19:51                 ` Alan Cox
2009-05-07 20:03                   ` Jeff Garzik
2009-05-07 20:30                     ` Alan Cox
2009-05-07 16:37         ` Alok Kataria
2009-05-07 17:07       ` Peter Zijlstra
2009-05-07 17:13         ` Peter Zijlstra
2009-05-07 17:18           ` Peter Zijlstra
2009-05-07 17:20             ` Christoph Lameter
2009-05-07 17:39               ` Peter Zijlstra
2009-05-07 17:40                 ` Christoph Lameter
2009-05-07 17:54               ` Paul E. McKenney
2009-05-07 17:51                 ` Christoph Lameter
2009-05-07 19:51                   ` Paul E. McKenney
2009-05-07 17:36             ` Paul E. McKenney
2009-05-07 17:38               ` Peter Zijlstra
2009-05-07 18:01                 ` Paul E. McKenney
2009-05-07 18:12                   ` Christoph Lameter
2009-05-07 19:06                     ` Paul E. McKenney
2009-05-07 19:53                     ` Alan Cox
2009-05-07 19:56                       ` Christoph Lameter
2009-05-07 20:24                         ` Alan Cox
2009-05-07 20:21                           ` Christoph Lameter
2009-05-08 10:32                   ` Peter Zijlstra
2009-05-08 12:50                     ` Paul E. McKenney
2009-05-08 14:16                       ` Christoph Lameter
2009-05-08 15:06                         ` Paul E. McKenney
2009-05-07 17:18           ` Christoph Lameter
2009-05-07 17:37             ` Peter Zijlstra
2009-05-07 17:34               ` Christoph Lameter
2009-05-07 17:55                 ` Peter Zijlstra
2009-05-07 17:55                   ` Christoph Lameter
2009-05-07 17:19         ` Christoph Lameter
2009-05-07 17:45           ` Peter Zijlstra
2009-05-07 17:50             ` Christoph Lameter
2009-05-07 19:17               ` Peter Zijlstra
2009-05-07 19:38                 ` Christoph Lameter
2009-05-07 21:01             ` H. Peter Anvin
2009-05-07 16:35     ` Chris Snook
2009-05-07 16:56       ` Alok Kataria
2009-05-07 20:29         ` Chris Snook
2009-05-07 20:34           ` Alan Cox
2009-05-07 22:16             ` Ravikiran G Thirumalai
2009-05-07 22:19             ` Alok Kataria
2009-05-08  9:31               ` Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox