From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ben Guthro Subject: [PATCH] scrub pages on guest termination Date: Fri, 23 May 2008 11:00:18 -0400 Message-ID: <4836DC02.3050407@virtualiron.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="------------080804040303060203070009" Return-path: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: xen-devel Cc: Robert Phillips List-Id: xen-devel@lists.xenproject.org This is a multi-part message in MIME format. --------------080804040303060203070009 Content-Type: multipart/alternative; boundary="------------030400050200020203070104" --------------030400050200020203070104 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit This patch solves the following problem. When a large VS terminates, the node locks up. The node locks up because the page_scrub_kick routine sends a softirq to all processors instructing them to run the page scrub code. There they interfere with each other as they serialize behind the page_scrub_lock. The patch does two things: (1) In page_scrub_kick, only a single cpu is interrupted. Some cpu other than the calling cpu is chosen (if available) because we assume the calling cpu has other higher priority work to do. (2) In page_scrub_softirq, if more than one cpu is online, the first cpu to start scrubbing designates itself as the primary_scrubber. As such it is dedicated to scrubbing pages until the list is empty. Other cpus might call page_scrub_softirq but they spend only 1 msec scrubbing before returning to check for other higher priority work. But, with multiple cpus online, the node can afford to have one cpu dedicated to scrubbing when that work needs to be done. Signed-off-by: Robert Phillips Signed-off-by: Ben Guthro --------------030400050200020203070104 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit
This patch solves the following problem.  When a large VS terminates, the node locks
up. The node locks up because the page_scrub_kick routine sends a softirq to
all processors instructing them to run the page scrub code.  There they interfere
with each other as they serialize behind the page_scrub_lock.

The patch does two things:

(1) In page_scrub_kick, only a single cpu is interrupted.  Some cpu other than
the calling cpu is chosen (if available) because we assume the calling cpu
has other higher priority work to do.

(2) In page_scrub_softirq, if more than one cpu is online, the first cpu
to start scrubbing designates itself as the primary_scrubber.  As such
it is dedicated to scrubbing pages until the list is empty.  Other cpus
might call page_scrub_softirq but they spend only 1 msec scrubbing before
returning to check for other higher priority work.  But, with multiple cpus
online, the node can afford to have one cpu dedicated to scrubbing when
that work needs to be done.

Signed-off-by: Robert Phillips <rphillips@virtualiron.com>
Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
--------------030400050200020203070104-- --------------080804040303060203070009 Content-Type: text/plain; name="xen-page-scrub.patch" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="xen-page-scrub.patch" diff -r 29dc52031954 xen/common/page_alloc.c --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -984,16 +984,23 @@ void *p; int i; s_time_t start = NOW(); + static int primary_scrubber = -1; - /* Aim to do 1ms of work every 10ms. */ + /* Unless SMP, aim to do 1ms of work every 10ms. */ do { spin_lock(&page_scrub_lock); if ( unlikely((ent = page_scrub_list.next) == &page_scrub_list) ) { + if (primary_scrubber == smp_processor_id()) + primary_scrubber = -1; spin_unlock(&page_scrub_lock); return; } + + /* If SMP, dedicate a cpu to scrubbing til the job is done */ + if (primary_scrubber == -1 && num_online_cpus() > 1) + primary_scrubber = smp_processor_id(); /* Peel up to 16 pages from the list. */ for ( i = 0; i < 16; i++ ) @@ -1020,7 +1027,7 @@ unmap_domain_page(p); free_heap_pages(pfn_dom_zone_type(page_to_mfn(pg)), pg, 0); } - } while ( (NOW() - start) < MILLISECS(1) ); + } while ( primary_scrubber == smp_processor_id() || (NOW() - start) < MILLISECS(1) ); set_timer(&this_cpu(page_scrub_timer), NOW() + MILLISECS(10)); } diff -r 29dc52031954 xen/include/xen/mm.h --- a/xen/include/xen/mm.h +++ b/xen/include/xen/mm.h @@ -90,10 +90,21 @@ if ( !list_empty(&page_scrub_list) ) \ raise_softirq(PAGE_SCRUB_SOFTIRQ); \ } while ( 0 ) -#define page_scrub_kick() \ - do { \ - if ( !list_empty(&page_scrub_list) ) \ - cpumask_raise_softirq(cpu_online_map, PAGE_SCRUB_SOFTIRQ); \ + +#define page_scrub_kick() \ + do { \ + if ( !list_empty(&page_scrub_list) ) { \ + int cpu; \ + /* Try to use some other cpu. */ \ + for_each_online_cpu(cpu) { \ + if (cpu != smp_processor_id()) { \ + cpu_raise_softirq(cpu, PAGE_SCRUB_SOFTIRQ); \ + break; \ + } \ + } \ + if (cpu >= NR_CPUS) \ + raise_softirq(PAGE_SCRUB_SOFTIRQ); \ + } \ } while ( 0 ) unsigned long avail_scrub_pages(void); --------------080804040303060203070009 Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel --------------080804040303060203070009-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: Keir Fraser Subject: Re: [PATCH] scrub pages on guest termination Date: Fri, 23 May 2008 17:04:58 +0100 Message-ID: References: <4836DC02.3050407@virtualiron.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0415512615==" Return-path: In-Reply-To: <4836DC02.3050407@virtualiron.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Mime-version: 1.0 Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Ben Guthro , xen-devel Cc: Robert Phillips List-Id: xen-devel@lists.xenproject.org > This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. --===============0415512615== Content-type: multipart/alternative; boundary="B_3294407135_26195233" > This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. --B_3294407135_26195233 Content-type: text/plain; charset="ISO-8859-1" Content-transfer-encoding: quoted-printable The aim of the loop was to scrub enough pages in a batch that lock contention is kept tolerably low. Even if 16 pages is not sufficient for that, I=B9m surprised a =8Cnode=B9 (you mean a whole system, presumably?) would appear to lock up. Maybe pages would be scrubbed slower than we=B9d like, but still CPUs should be able to get the spinlock often enough to evaluate whether they have spent 1ms in the loop and hence get out of there. What sort of system were you seeing the lockup on? Does it have very many physical CPUs? -- Keir On 23/5/08 16:00, "Ben Guthro" wrote: > This patch solves the following problem. When a large VS terminates, the= node > locks > up. The node locks up because the page_scrub_kick routine sends a softirq= to > all processors instructing them to run the page scrub code. There they > interfere > with each other as they serialize behind the page_scrub_lock. --B_3294407135_26195233 Content-type: text/html; charset="ISO-8859-1" Content-transfer-encoding: quoted-printable Re: [Xen-devel] [PATCH] scrub pages on guest termination The a= im of the loop was to scrub enough pages in a batch that lock contention is = kept tolerably low. Even if 16 pages is not sufficient for that, I’m s= urprised a ‘node’ (you mean a whole system, presumably?) would a= ppear to lock up. Maybe pages would be scrubbed slower than we’d like,= but still CPUs should be able to get the spinlock often enough to evaluate = whether they have spent 1ms in the loop and hence get out of there.

What sort of system were you seeing the lockup on? Does it have very many p= hysical CPUs?

 -- Keir

On 23/5/08 16:00, "Ben Guthro" <bguthro@virtualiron.com> wr= ote:

This patch solves the following problem.  When a large VS terminates, = the node locks
up. The node locks up because the page_scrub_kick routine sends a softirq t= o
all processors instructing them to run the page scrub code.  There the= y interfere
with each other as they serialize behind the page_scrub_lock.

--B_3294407135_26195233-- --===============0415512615== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel --===============0415512615==-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ben Guthro Subject: Re: [PATCH] scrub pages on guest termination Date: Fri, 23 May 2008 13:01:16 -0400 Message-ID: <4836F85C.1010609@virtualiron.com> References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1590010452==" Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Mime-version: 1.0 Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Keir Fraser Cc: xen-devel , Robert Phillips List-Id: xen-devel@lists.xenproject.org This is a multi-part message in MIME format. --===============1590010452== Content-Type: multipart/alternative; boundary="------------010400060805090404060509" This is a multi-part message in MIME format. --------------010400060805090404060509 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Yes, sorry - should have removed our terminology from the description. Node=physical machine VS=HVM guest w/ pv-on-hvm drivers Looking back at the original bug report - it seems to indicate it was migrating from a system with 2 processors to one with 8 Specifcally - from Dell Precision WorkStation 380 Processor: Intel(R) Pentium(R) D CPU 2.80GHz # of CPUs: 2 Speed: 2.8GHz to Supermicro X7DB8 Processor: Genuine Intel(R) CPU @ 2.13GHz # of CPUs: 8 Speed: 2.133 GHz Keir Fraser wrote: > The aim of the loop was to scrub enough pages in a batch that lock > contention is kept tolerably low. Even if 16 pages is not sufficient > for that, I'm surprised a 'node' (you mean a whole system, > presumably?) would appear to lock up. Maybe pages would be scrubbed > slower than we'd like, but still CPUs should be able to get the > spinlock often enough to evaluate whether they have spent 1ms in the > loop and hence get out of there. > > What sort of system were you seeing the lockup on? Does it have very > many physical CPUs? > > -- Keir > > On 23/5/08 16:00, "Ben Guthro" wrote: > > This patch solves the following problem. When a large VS > terminates, the node locks > up. The node locks up because the page_scrub_kick routine sends a > softirq to > all processors instructing them to run the page scrub code. There > they interfere > with each other as they serialize behind the page_scrub_lock. > > --------------010400060805090404060509 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Yes, sorry -  should have removed our terminology from the description.
Node=physical machine
VS=HVM guest w/ pv-on-hvm drivers
Looking back at the original bug report - it seems to indicate it was migrating from a system with 2 processors to one with 8

Specifcally - from
Dell Precision WorkStation 380
Processor:    Intel(R) Pentium(R) D CPU 2.80GHz
# of CPUs:    2
Speed:    2.8GHz

to

Supermicro X7DB8
Processor:    Genuine Intel(R) CPU @ 2.13GHz
# of CPUs:    8
Speed:    2.133 GHz




Keir Fraser wrote:
Re: [Xen-devel] [PATCH] scrub pages on guest termination The aim of the loop was to scrub enough pages in a batch that lock contention is kept tolerably low. Even if 16 pages is not sufficient for that, I’m surprised a ‘node’ (you mean a whole system, presumably?) would appear to lock up. Maybe pages would be scrubbed slower than we’d like, but still CPUs should be able to get the spinlock often enough to evaluate whether they have spent 1ms in the loop and hence get out of there.

What sort of system were you seeing the lockup on? Does it have very many physical CPUs?

 -- Keir

On 23/5/08 16:00, "Ben Guthro" <bguthro@virtualiron.com> wrote:

This patch solves the following problem.  When a large VS terminates, the node locks
up. The node locks up because the page_scrub_kick routine sends a softirq to
all processors instructing them to run the page scrub code.  There they interfere
with each other as they serialize behind the page_scrub_lock.


--------------010400060805090404060509-- --===============1590010452== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel --===============1590010452==-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: Keir Fraser Subject: Re: [PATCH] scrub pages on guest termination Date: Fri, 23 May 2008 18:19:25 +0100 Message-ID: References: <4836F85C.1010609@virtualiron.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1404800402==" Return-path: In-Reply-To: <4836F85C.1010609@virtualiron.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Mime-version: 1.0 Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Ben Guthro Cc: xen-devel , Robert Phillips List-Id: xen-devel@lists.xenproject.org > This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. --===============1404800402== Content-type: multipart/alternative; boundary="B_3294411577_26477130" > This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. --B_3294411577_26477130 Content-type: text/plain; charset="ISO-8859-1" Content-transfer-encoding: quoted-printable On 23/5/08 18:01, "Ben Guthro" wrote: > Yes, sorry - should have removed our terminology from the description. > Node=3Dphysical machine > VS=3DHVM guest w/ pv-on-hvm drivers > Looking back at the original bug report - it seems to indicate it was > migrating from a system with 2 processors to one with 8 It=B9s very surprising that lock contention would cause such a severe lack of progress on an 8-CPU system. If the lock is that hotly contended then even the usage of it in free_domheap_pages() has to be questionable. I=B9m inclined to say that if we want to address this then we should do it in one or more of the following ways: 1. Count CPUs into the scrub function with an atomic_t and beyond a limit all other CPUs bail straight out after re-setting their timer. 2. Increase scrub batch size to reduce proportion of time that each loop iteration holds the lock. 3. Turn the spin_lock() into a spin_trylock() so that the timeout check ca= n be guaranteed to execute frequently. 4. Eliminate the global lock by building a lock-free linked list, or by maintaining per-CPU hashed work queues with work stealing, or... etc. The patch as-is at least suffers from the issue that the =8Cprimary scrubber=B9 should be regularly checking for softirq work. But I=B9m not sure such a sizeable change to the scheduling policy for scrubbing (such as it is!) is necessary or desirable. Option 4 is on the morally highest ground but is of course the most work. :-) -- Keir --B_3294411577_26477130 Content-type: text/html; charset="ISO-8859-1" Content-transfer-encoding: quoted-printable Re: [Xen-devel] [PATCH] scrub pages on guest termination On 23= /5/08 18:01, "Ben Guthro" <bguthro@virtualiron.com> wrote:
Yes, sorry -  should have removed our terminology = from the description.
Node=3Dphysical machine
VS=3DHVM guest w/ pv-on-hvm drivers
Looking back at the original bug report - it seems to indicate it was migra= ting from a system with 2 processors to one with 8

It’s very surprising that lock contention would cause such a severe l= ack of progress on an 8-CPU system. If the lock is that hotly contended then= even the usage of it in free_domheap_pages() has to be questionable.

I’m inclined to say that if we want to address this then we should do= it in one or more of the following ways:
 1. Count CPUs into the scrub function with an atomic_t and beyond a l= imit all other CPUs bail straight out after re-setting their timer.
 2. Increase scrub batch size to reduce proportion of time that each l= oop iteration holds the lock.
 3. Turn the spin_lock() into a spin_trylock() so that the timeout che= ck can be guaranteed to execute frequently.
 4. Eliminate the global lock by building a lock-free linked list, or = by maintaining per-CPU hashed work queues with work stealing, or... etc.

The patch as-is at least suffers from the issue that the ‘primary scr= ubber’ should be regularly checking for softirq work. But I’m no= t sure such a sizeable change to the scheduling policy for scrubbing (such a= s it is!) is necessary or desirable.

Option 4 is on the morally highest ground but is of course the most work. := -)

 -- Keir
--B_3294411577_26477130-- --===============1404800402== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel --===============1404800402==--