Re: Re: Re: [Linux-ia64] Re: Lockups on 2.4.1

public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Re: Re: [Linux-ia64] Re: Lockups on 2.4.1
@ 2001-02-22 20:48 Jack Steiner
  2001-02-28  0:39 ` Mallick, Asit K
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Jack Steiner @ 2001-02-22 20:48 UTC (permalink / raw)
  To: linux-ia64

> > Anyway, I have ITPs connected to the IBM hardware and have noticed that
> > when the lockup occurs, and we lose video, at least one of the CPUs is
> > executing in flush_tlb_no_ptcg() or handle_IPI(), in the 'do' loop where
> > TLB
> > entries are being purged. What I have observed is that the end address and
> > the start address are in completely different regions. Usually, the start
> > address
> > is in region register 1 (address of 0x2000XXXXXXXXXXXX) and the end address
> > is in region register 3 (address of 0x6000XXXXXXXXXXXX). I don't know if
> > this
> > is the same problem I am seeing on the Lion, but I plan to connect and ITP
> > and
> > a serial console (although we haven't been able to get one to work yet on
> > the
> > Lion with BIOS 71) to see if the symptoms are the same.
> 
> FWIW, we have seen EXACTLY the same hang running here on our system.
> The start/end addresses for the purge cross region boundaries.
> 
> 
> We are running a 2.4.0 kernel.

I found a problem that was causing the lockup described above & I suspect this
may responsible for some of the other hangs various folks have seen.

There is code in flush_tlb_no_ptcg() that resends the IPI if other
cpus have not responded within a short time. If this code get invoked, then
it is possible for flush_cpu_count to get corrupted. When that happens, a cpu
can be executing in handle_IPI() while flush_start/flush_end are changing.
A cpu can pick up a non-matching flush_start/flush_end. This leads to  hangs or
lost TLB flushes.

To verify that this could cause the hang, I changed the timeout in
flush_tlb_no_ptcg() from 40000UL to 400UL. I hung before getting to multiuser mode
with flush_start/flush_end in different regions.

Here is the patch I used. Note: this is against 2.4.0,


--- linux-trillian/arch/ia64/kernel/smp.c	Thu Feb 22 14:35:28 2001
+++ linux/arch/ia64/kernel/smp.c	Thu Feb 22 14:19:46 2001
@@ -321,6 +321,16 @@
 {
 	send_IPI_allbutself(IPI_FLUSH_TLB);
 }
+
+void
+smp_resend_flush_tlb(void)
+{
+	/*
+	 * Really need a null IPI but since this rarely should happen &
+	 * since this code will go away, lets not add one.
+	 */
+	send_IPI_allbutself(IPI_RESCHEDULE);
+}
 #endif	/* !CONFIG_ITANIUM_PTCG */
 
 /*
--- linux-trillian/arch/ia64/mm/tlb.c	Thu Feb 22 14:35:28 2001
+++ linux/arch/ia64/mm/tlb.c	Thu Feb 22 14:19:50 2001
@@ -59,6 +59,7 @@
 flush_tlb_no_ptcg (unsigned long start, unsigned long end, unsigned long nbits)
 {
 	extern void smp_send_flush_tlb (void);
+	extern void smp_resend_flush_tlb (void);
 	unsigned long saved_tpr = 0;
 	unsigned long flags;
 
@@ -101,9 +102,8 @@
 	{
 		unsigned long start = ia64_get_itc();
 		while (atomic_read(&flush_cpu_count) > 0) {
-			if ((ia64_get_itc() - start) > 40000UL) {
-				atomic_set(&flush_cpu_count, smp_num_cpus - 1);
-				smp_send_flush_tlb();
+			if ((ia64_get_itc() - start) > 400UL) {
+				smp_resend_flush_tlb();
 				start = ia64_get_itc();
 			}
 		}

-- 
Thanks

Jack Steiner    (651-683-5302)   (vnet 233-5302)      steiner@sgi.com



^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Re: Re: [Linux-ia64] Re: Lockups on 2.4.1
  2001-02-22 20:48 Re: Re: [Linux-ia64] Re: Lockups on 2.4.1 Jack Steiner
@ 2001-02-28  0:39 ` Mallick, Asit K
  2001-02-28  6:09 ` David Mosberger
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Mallick, Asit K @ 2001-02-28  0:39 UTC (permalink / raw)
  To: linux-ia64

Jack,

Thanks for investigating the problem and the patch. The problem is happening
because the timeout (40000UL) is not long enough. The processor is taking
long time than this to complete the handle_IPI and processor doing the
flush_tlb_no_ptcg is timing out and sending the IPI again. So, we should
increase the timeout rather than decreasing the timeout to avoid extra
reschedule IPIs.

Thanks,
Asit


> -----Original Message-----
> From: Jack Steiner [mailto:steiner@sgi.com]
> Sent: Thursday, February 22, 2001 12:48 PM
> To: linux-ia64@linuxia64.org
> Subject: Re: Re: Re: [Linux-ia64] Re: Lockups on 2.4.1
> 
> 
> 
> > > Anyway, I have ITPs connected to the IBM hardware and 
> have noticed that
> > > when the lockup occurs, and we lose video, at least one 
> of the CPUs is
> > > executing in flush_tlb_no_ptcg() or handle_IPI(), in the 
> 'do' loop where
> > > TLB
> > > entries are being purged. What I have observed is that 
> the end address and
> > > the start address are in completely different regions. 
> Usually, the start
> > > address
> > > is in region register 1 (address of 0x2000XXXXXXXXXXXX) 
> and the end address
> > > is in region register 3 (address of 0x6000XXXXXXXXXXXX). 
> I don't know if
> > > this
> > > is the same problem I am seeing on the Lion, but I plan 
> to connect and ITP
> > > and
> > > a serial console (although we haven't been able to get 
> one to work yet on
> > > the
> > > Lion with BIOS 71) to see if the symptoms are the same.
> > 
> > FWIW, we have seen EXACTLY the same hang running here on our system.
> > The start/end addresses for the purge cross region boundaries.
> > 
> > 
> > We are running a 2.4.0 kernel.
> 
> I found a problem that was causing the lockup described above 
> & I suspect this
> may responsible for some of the other hangs various folks have seen.
> 
> There is code in flush_tlb_no_ptcg() that resends the IPI if other
> cpus have not responded within a short time. If this code get 
> invoked, then
> it is possible for flush_cpu_count to get corrupted. When 
> that happens, a cpu
> can be executing in handle_IPI() while flush_start/flush_end 
> are changing.
> A cpu can pick up a non-matching flush_start/flush_end. This 
> leads to  hangs or
> lost TLB flushes.
> 
> To verify that this could cause the hang, I changed the timeout in
> flush_tlb_no_ptcg() from 40000UL to 400UL. I hung before 
> getting to multiuser mode
> with flush_start/flush_end in different regions.
> 
> Here is the patch I used. Note: this is against 2.4.0,
> 
> 
> --- linux-trillian/arch/ia64/kernel/smp.c	Thu Feb 22 14:35:28 2001
> +++ linux/arch/ia64/kernel/smp.c	Thu Feb 22 14:19:46 2001
> @@ -321,6 +321,16 @@
>  {
>  	send_IPI_allbutself(IPI_FLUSH_TLB);
>  }
> +
> +void
> +smp_resend_flush_tlb(void)
> +{
> +	/*
> +	 * Really need a null IPI but since this rarely should happen &
> +	 * since this code will go away, lets not add one.
> +	 */
> +	send_IPI_allbutself(IPI_RESCHEDULE);
> +}
>  #endif	/* !CONFIG_ITANIUM_PTCG */
>  
>  /*
> --- linux-trillian/arch/ia64/mm/tlb.c	Thu Feb 22 14:35:28 2001
> +++ linux/arch/ia64/mm/tlb.c	Thu Feb 22 14:19:50 2001
> @@ -59,6 +59,7 @@
>  flush_tlb_no_ptcg (unsigned long start, unsigned long end, 
> unsigned long nbits)
>  {
>  	extern void smp_send_flush_tlb (void);
> +	extern void smp_resend_flush_tlb (void);
>  	unsigned long saved_tpr = 0;
>  	unsigned long flags;
>  
> @@ -101,9 +102,8 @@
>  	{
>  		unsigned long start = ia64_get_itc();
>  		while (atomic_read(&flush_cpu_count) > 0) {
> -			if ((ia64_get_itc() - start) > 40000UL) {
> -				atomic_set(&flush_cpu_count, 
> smp_num_cpus - 1);
> -				smp_send_flush_tlb();
> +			if ((ia64_get_itc() - start) > 400UL) {
> +				smp_resend_flush_tlb();
>  				start = ia64_get_itc();
>  			}
>  		}
> 
> -- 
> Thanks
> 
> Jack Steiner    (651-683-5302)   (vnet 233-5302)      steiner@sgi.com
> 
> 
> _______________________________________________
> Linux-IA64 mailing list
> Linux-IA64@linuxia64.org
> http://lists.linuxia64.org/lists/listinfo/linux-ia64
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Re: Re: [Linux-ia64] Re: Lockups on 2.4.1
  2001-02-22 20:48 Re: Re: [Linux-ia64] Re: Lockups on 2.4.1 Jack Steiner
  2001-02-28  0:39 ` Mallick, Asit K
@ 2001-02-28  6:09 ` David Mosberger
  2001-02-28 17:05 ` Jack Steiner
  2001-02-28 17:56 ` Mallick, Asit K
  3 siblings, 0 replies; 5+ messages in thread
From: David Mosberger @ 2001-02-28  6:09 UTC (permalink / raw)
  To: linux-ia64

OK, this makes sense: our systems have ptc.g enabled, which explains
why we haven't seen this problem.  I made the change of using
smp_resend_flush_tlb() but also increased the timeout by a factor of
10.

Thanks,

	--david

>>>>> On Thu, 22 Feb 2001 14:48:03 -0600 (CST), Jack Steiner <steiner@sgi.com> said:

  >> > Anyway, I have ITPs connected to the IBM hardware and have
  >> noticed that > when the lockup occurs, and we lose video, at
  >> least one of the CPUs is > executing in flush_tlb_no_ptcg() or
  >> handle_IPI(), in the 'do' loop where > TLB > entries are being
  >> purged. What I have observed is that the end address and > the
  >> start address are in completely different regions. Usually, the
  >> start > address > is in region register 1 (address of
  >> 0x2000XXXXXXXXXXXX) and the end address > is in region register 3
  >> (address of 0x6000XXXXXXXXXXXX). I don't know if > this > is the
  >> same problem I am seeing on the Lion, but I plan to connect and
  >> ITP > and > a serial console (although we haven't been able to
  >> get one to work yet on > the > Lion with BIOS 71) to see if the
  >> symptoms are the same.
  >> 
  >> FWIW, we have seen EXACTLY the same hang running here on our
  >> system.  The start/end addresses for the purge cross region
  >> boundaries.
  >> 
  >> 
  >> We are running a 2.4.0 kernel.

  Jack> I found a problem that was causing the lockup described above
  Jack> & I suspect this may responsible for some of the other hangs
  Jack> various folks have seen.

  Jack> There is code in flush_tlb_no_ptcg() that resends the IPI if
  Jack> other cpus have not responded within a short time. If this
  Jack> code get invoked, then it is possible for flush_cpu_count to
  Jack> get corrupted. When that happens, a cpu can be executing in
  Jack> handle_IPI() while flush_start/flush_end are changing.  A cpu
  Jack> can pick up a non-matching flush_start/flush_end. This leads
  Jack> to hangs or lost TLB flushes.

  Jack> To verify that this could cause the hang, I changed the
  Jack> timeout in flush_tlb_no_ptcg() from 40000UL to 400UL. I hung
  Jack> before getting to multiuser mode with flush_start/flush_end in
  Jack> different regions.

  Jack> Here is the patch I used. Note: this is against 2.4.0,


  Jack> --- linux-trillian/arch/ia64/kernel/smp.c Thu Feb 22 14:35:28
  Jack> 2001 +++ linux/arch/ia64/kernel/smp.c Thu Feb 22 14:19:46 2001
  Jack> @@ -321,6 +321,16 @@ { send_IPI_allbutself(IPI_FLUSH_TLB); } +
  Jack> +void +smp_resend_flush_tlb(void) +{ + /* + * Really need a
  Jack> null IPI but since this rarely should happen & + * since this
  Jack> code will go away, lets not add one.  + */ +
  Jack> send_IPI_allbutself(IPI_RESCHEDULE); +} #endif /*
  Jack> !CONFIG_ITANIUM_PTCG */
 
  Jack>  /* --- linux-trillian/arch/ia64/mm/tlb.c Thu Feb 22 14:35:28
  Jack> 2001 +++ linux/arch/ia64/mm/tlb.c Thu Feb 22 14:19:50 2001 @@
  Jack> -59,6 +59,7 @@ flush_tlb_no_ptcg (unsigned long start,
  Jack> unsigned long end, unsigned long nbits) { extern void
  Jack> smp_send_flush_tlb (void); + extern void smp_resend_flush_tlb
  Jack> (void); unsigned long saved_tpr = 0; unsigned long flags;
 
  Jack> @@ -101,9 +102,8 @@ { unsigned long start = ia64_get_itc();
  Jack> while (atomic_read(&flush_cpu_count) > 0) { - if
  Jack> ((ia64_get_itc() - start) > 40000UL) { -
  Jack> atomic_set(&flush_cpu_count, smp_num_cpus - 1); -
  Jack> smp_send_flush_tlb(); + if ((ia64_get_itc() - start) > 400UL)
  Jack> { + smp_resend_flush_tlb(); start = ia64_get_itc(); } }

  Jack> -- Thanks

  Jack> Jack Steiner (651-683-5302) (vnet 233-5302) steiner@sgi.com


  Jack> _______________________________________________ Linux-IA64
  Jack> mailing list Linux-IA64@linuxia64.org
  Jack> http://lists.linuxia64.org/lists/listinfo/linux-ia64


^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Re: Re: [Linux-ia64] Re: Lockups on 2.4.1
  2001-02-22 20:48 Re: Re: [Linux-ia64] Re: Lockups on 2.4.1 Jack Steiner
  2001-02-28  0:39 ` Mallick, Asit K
  2001-02-28  6:09 ` David Mosberger
@ 2001-02-28 17:05 ` Jack Steiner
  2001-02-28 17:56 ` Mallick, Asit K
  3 siblings, 0 replies; 5+ messages in thread
From: Jack Steiner @ 2001-02-28 17:05 UTC (permalink / raw)
  To: linux-ia64

> Thanks for investigating the problem and the patch. The problem is happening
> because the timeout (40000UL) is not long enough. The processor is taking
> long time than this to complete the handle_IPI and processor doing the
> flush_tlb_no_ptcg is timing out and sending the IPI again. So, we should
> increase the timeout rather than decreasing the timeout to avoid extra
> reschedule IPIs.

I wondered about the timeout too. I agree that we do not want to go thru
the resend_IPI code very often. I planned to add some stats to see
how often the resend occurred & also whether any IPIs were really being
dropped.

Is the code to resend IPIs going to remain OR is it just to workaround 
an earlier sighting (#117) that reported IPIs being dropped?? If the resend
is permanent, then the "reschedule IPI" hack needs to be changed to something
more like a null IPI. If the resend is temporary, then just increasing
the timeout seems fine.

-- 
Thanks

Jack Steiner    (651-683-5302)   (vnet 233-5302)      steiner@sgi.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Re: Re: [Linux-ia64] Re: Lockups on 2.4.1
  2001-02-22 20:48 Re: Re: [Linux-ia64] Re: Lockups on 2.4.1 Jack Steiner
                   ` (2 preceding siblings ...)
  2001-02-28 17:05 ` Jack Steiner
@ 2001-02-28 17:56 ` Mallick, Asit K
  3 siblings, 0 replies; 5+ messages in thread
From: Mallick, Asit K @ 2001-02-28 17:56 UTC (permalink / raw)
  To: linux-ia64

> 
> Is the code to resend IPIs going to remain OR is it just to 
> workaround 
> an earlier sighting (#117) that reported IPIs being dropped?? 
> If the resend
> is permanent, then the "reschedule IPI" hack needs to be 
> changed to something
> more like a null IPI. If the resend is temporary, then just increasing
> the timeout seems fine.

This is a workaround and is not needed for B3 and later steppings.

Thanks,
Asit

Thanks,
Asit



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2001-02-28 17:56 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-02-22 20:48 Re: Re: [Linux-ia64] Re: Lockups on 2.4.1 Jack Steiner
2001-02-28  0:39 ` Mallick, Asit K
2001-02-28  6:09 ` David Mosberger
2001-02-28 17:05 ` Jack Steiner
2001-02-28 17:56 ` Mallick, Asit K

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox