Re: page fault scalability patch V12 [0/7]: Overview and performance tests

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
       [not found]               ` <20041201223441.3820fbc0.akpm@osdl.org>
@ 2004-12-02  7:00                 ` Jeff Garzik
  2004-12-02  7:05                   ` Benjamin Herrenschmidt
  2004-12-02 14:30                   ` Andy Warner
  0 siblings, 2 replies; 6+ messages in thread
From: Jeff Garzik @ 2004-12-02  7:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, benh, linux-kernel, linux-ide@vger.kernel.org

Andrew Morton wrote:
> We need an -rc3 yet.  And I need to do another pass through the
> regressions-since-2.6.9 list.  We've made pretty good progress there
> recently.  Mid to late December is looking like the 2.6.10 date.

another for that list, BTW:

I am currently chasing a 2.6.8->2.6.9 SATA regression, which causes 
ata_piix (Intel ICH5/6/7) to not-find some SATA devices on x86-64 SMP, 
but works on UP.  Potentially related to >=4GB of RAM.

Details, in case anyone is interested:
Unless my code is screwed up (certainly possible), PIO data-in [using 
the insw() call] seems to return all zeroes on a true-blue SMP machine, 
for the identify-device command.  When this happens, libata (correctly) 
detects a bad id page and bails.  (problem doesn't show up on single CPU 
w/ HT)

What changed from 2.6.8 to 2.6.9 is

2.6.8:
	bitbang ATA taskfile registers (loads command)
	bitbang ATA data register (read id page)

2.6.9:
	bitbang ATA taskfile registers
	queue_work()
	workqueue thread bitbangs ATA data register (read id page)

So I wonder if <something> doesn't like CPU 0 sending I/O traffic to the 
on-board SATA PCI device, then immediately after that, CPU 1 sending I/O 
traffic.

Anyway, back to debugging...  :)

	Jeff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  7:00                 ` page fault scalability patch V12 [0/7]: Overview and performance tests Jeff Garzik
@ 2004-12-02  7:05                   ` Benjamin Herrenschmidt
  2004-12-02  7:11                     ` Jeff Garzik
  2004-12-02 14:30                   ` Andy Warner
  1 sibling, 1 reply; 6+ messages in thread
From: Benjamin Herrenschmidt @ 2004-12-02  7:05 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Linus Torvalds, Linux Kernel list, list linux-ide

On Thu, 2004-12-02 at 02:00 -0500, Jeff Garzik wrote:

> 
> 2.6.9:
> 	bitbang ATA taskfile registers
> 	queue_work()
> 	workqueue thread bitbangs ATA data register (read id page)
> 
> So I wonder if <something> doesn't like CPU 0 sending I/O traffic to the 
> on-board SATA PCI device, then immediately after that, CPU 1 sending I/O 
> traffic.
> 
> Anyway, back to debugging...  :)

They may not end up in order if they are stores (the stores to the
taskfile may be out of order vs; the loads/stores to/from the data
register) unless you have a spinlock protecting both or a full sync (on
ppc), but then, I don't know the ordering things on x86_64. This could
certainly be a problem on ppc & ppc64 too.

Ben.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  7:05                   ` Benjamin Herrenschmidt
@ 2004-12-02  7:11                     ` Jeff Garzik
  2004-12-02 11:16                       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff Garzik @ 2004-12-02  7:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrew Morton, Linus Torvalds, Linux Kernel list, list linux-ide

Benjamin Herrenschmidt wrote:
> They may not end up in order if they are stores (the stores to the
> taskfile may be out of order vs; the loads/stores to/from the data
> register) unless you have a spinlock protecting both or a full sync (on
> ppc), but then, I don't know the ordering things on x86_64. This could
> certainly be a problem on ppc & ppc64 too.


Is synchronization beyond in[bwl] needed, do you think?

This specific problem is only on Intel ICHx AFAICS, which is PIO not 
MMIO and x86-only.  I presumed insw() by its very nature already has 
synchronization, but perhaps not...

	Jeff



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  7:11                     ` Jeff Garzik
@ 2004-12-02 11:16                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 6+ messages in thread
From: Benjamin Herrenschmidt @ 2004-12-02 11:16 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Linus Torvalds, Linux Kernel list, list linux-ide

On Thu, 2004-12-02 at 02:11 -0500, Jeff Garzik wrote:
> Benjamin Herrenschmidt wrote:
> > They may not end up in order if they are stores (the stores to the
> > taskfile may be out of order vs; the loads/stores to/from the data
> > register) unless you have a spinlock protecting both or a full sync (on
> > ppc), but then, I don't know the ordering things on x86_64. This could
> > certainly be a problem on ppc & ppc64 too.
> 
> 
> Is synchronization beyond in[bwl] needed, do you think?

Yes, when potentially hop'ing between CPUs, definitely.

> This specific problem is only on Intel ICHx AFAICS, which is PIO not 
> MMIO and x86-only.  I presumed insw() by its very nature already has 
> synchronization, but perhaps not...

Hrm... on "pure" x86, I would expect so at the HW level, not sure about
x86_64... but there would be definitely an issue on ppc with your
scheme. You need at least a full barrier before you trigger the
workqueue. That may not be the problem you are facing now, but it would
become one.

Ben.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02  7:00                 ` page fault scalability patch V12 [0/7]: Overview and performance tests Jeff Garzik
  2004-12-02  7:05                   ` Benjamin Herrenschmidt
@ 2004-12-02 14:30                   ` Andy Warner
  2005-01-06 23:40                     ` Jeff Garzik
  1 sibling, 1 reply; 6+ messages in thread
From: Andy Warner @ 2004-12-02 14:30 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, torvalds, benh, linux-kernel,
	linux-ide@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2854 bytes --]

Jeff Garzik wrote:
> [...]
> I am currently chasing a 2.6.8->2.6.9 SATA regression, which causes 
> ata_piix (Intel ICH5/6/7) to not-find some SATA devices on x86-64 SMP, 
> but works on UP.  Potentially related to >=4GB of RAM.
> 
> 
> 
> Details, in case anyone is interested:
> Unless my code is screwed up (certainly possible), PIO data-in [using 
> the insw() call] seems to return all zeroes on a true-blue SMP machine, 
> for the identify-device command.  When this happens, libata (correctly) 
> detects a bad id page and bails.  (problem doesn't show up on single CPU 
> w/ HT)

Ah, I might have been here recently, with the pass-thru stuff.

What I saw was that in an SMP machine:

1. queue_work() can result in the work running (on another
   CPU) instantly.

2. Having one CPU beat on PIO registers reading data from one port
   would significantly alter the timing of the CMD->BSY->DRQ sequence
   used in PIO. This behaviour was far worse for competing ports
   within one chip, which I put down to arbitration problems.

3. CPU utilisation would go through the roof. Effectively the
   entire pio_task state machine reduced to a busy spin loop.

4. The state machine needed some tweaks, especially in error
   handling cases.

I made some changes, which effectively solved the problem for promise
TX4-150 cards, and was going to test the results on other chipsets
next week before speaking up. Specifically, I have seen some
issues with SiI 3114 cards.

I was trying to explore using interrupts instead of polling state
but for some reason, I was not getting them for PIO data operations,
or I misunderstand the spec, after removing ata_qc_set_polling() - again
I saw a difference in behaviour between the Promise & SiI cards
here.

I'm about to go offline for 3 days, and hadn't prepared for this
yet. The best I can do is provide a patch (attached) that applies
against 2.6.9. It also seems to apply against libata-2.6, but
barfs a bit against libata-dev-2.6.

The changes boil down to these:

1. Minor changes in how status/error regs are read.
   Including attempts to use altstatus, while I was
   exploring interrupts.

2. State machine logic changes.

3. Replace calls to queue_work() with queue_delayed_work()
   to stop SMP machines going crazy.

With these changes, on a platform consisting of 2.6.9 and
Promise TX4-150 cards, I can move terabytes of parallel
PIO data, without error.

My gut says that the PIO mechanism should be overhauled, I
composed a "how much should we pay for this muffler" email
to linux-ide at least twice while working on this, but never
sent it - wanting to send a solution in rather than just
making more comments from the peanut gallery.

I'll pick up the thread on this next week, when I'm back online.
I hope this helps.
-- 
andyw@pobox.com

Andy Warner		Voice: (612) 801-8549	Fax: (208) 575-5634

[-- Attachment #2: 2.6.9-pio-smp.patch --]
[-- Type: text/plain, Size: 1862 bytes --]

diff -r -u -X dontdiff linux-2.6.9-vanilla/drivers/scsi/libata-core.c linux-2.6.9/drivers/scsi/libata-core.c
--- linux-2.6.9-vanilla/drivers/scsi/libata-core.c	2004-10-18 16:53:06.000000000 -0500
+++ linux-2.6.9/drivers/scsi/libata-core.c	2004-11-24 11:01:40.000000000 -0600
@@ -2099,7 +2099,7 @@
 	}

 	drv_stat = ata_wait_idle(ap);
-	if (!ata_ok(drv_stat)) {
+	if (drv_stat & (ATA_ERR | ATA_DF)) {
 		ap->pio_task_state = PIO_ST_ERR;
 		return;
 	}
@@ -2254,23 +2254,17 @@
 	 * chk-status again.  If still busy, fall back to
 	 * PIO_ST_POLL state.
 	 */
-	status = ata_busy_wait(ap, ATA_BUSY, 5);
-	if (status & ATA_BUSY) {
+	status = ata_altstatus(ap) ;
+	if (!(status & ATA_DRQ)) {
 		msleep(2);
-		status = ata_busy_wait(ap, ATA_BUSY, 10);
-		if (status & ATA_BUSY) {
+		status = ata_altstatus(ap) ;
+		if (!(status & ATA_DRQ)) {
 			ap->pio_task_state = PIO_ST_POLL;
 			ap->pio_task_timeout = jiffies + ATA_TMOUT_PIO;
 			return;
 		}
 	}

-	/* handle BSY=0, DRQ=0 as error */
-	if ((status & ATA_DRQ) == 0) {
-		ap->pio_task_state = PIO_ST_ERR;
-		return;
-	}
-
 	qc = ata_qc_from_tag(ap, ap->active_tag);
 	assert(qc != NULL);

@@ -2321,17 +2315,15 @@
 	case PIO_ST_TMOUT:
 	case PIO_ST_ERR:
 		ata_pio_error(ap);
-		break;
+		return ;
 	}

-	if ((ap->pio_task_state != PIO_ST_IDLE) &&
-	    (ap->pio_task_state != PIO_ST_TMOUT) &&
-	    (ap->pio_task_state != PIO_ST_ERR)) {
+	if (ap->pio_task_state != PIO_ST_IDLE) {
 		if (timeout)
 			queue_delayed_work(ata_wq, &ap->pio_task,
 					   timeout);
 		else
-			queue_work(ata_wq, &ap->pio_task);
+			queue_delayed_work(ata_wq, &ap->pio_task, 2);
 	}
 }

@@ -2624,7 +2616,7 @@
 		ata_qc_set_polling(qc);
 		ata_tf_to_host_nolock(ap, &qc->tf);
 		ap->pio_task_state = PIO_ST;
-		queue_work(ata_wq, &ap->pio_task);
+		queue_delayed_work(ata_wq, &ap->pio_task, 2);
 		break;

 	case ATA_PROT_ATAPI:

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: page fault scalability patch V12 [0/7]: Overview and performance tests
  2004-12-02 14:30                   ` Andy Warner
@ 2005-01-06 23:40                     ` Jeff Garzik
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff Garzik @ 2005-01-06 23:40 UTC (permalink / raw)
  To: Andy Warner
  Cc: Andrew Morton, torvalds, benh, linux-kernel,
	linux-ide@vger.kernel.org

Andy Warner wrote:
> Jeff Garzik wrote:
> 
>>[...]
>>I am currently chasing a 2.6.8->2.6.9 SATA regression, which causes 
>>ata_piix (Intel ICH5/6/7) to not-find some SATA devices on x86-64 SMP, 
>>but works on UP.  Potentially related to >=4GB of RAM.
>>
>>
>>
>>Details, in case anyone is interested:
>>Unless my code is screwed up (certainly possible), PIO data-in [using 
>>the insw() call] seems to return all zeroes on a true-blue SMP machine, 
>>for the identify-device command.  When this happens, libata (correctly) 
>>detects a bad id page and bails.  (problem doesn't show up on single CPU 
>>w/ HT)
> 
> 
> Ah, I might have been here recently, with the pass-thru stuff.
> 
> What I saw was that in an SMP machine:
> 
> 1. queue_work() can result in the work running (on another
>    CPU) instantly.
> 
> 2. Having one CPU beat on PIO registers reading data from one port
>    would significantly alter the timing of the CMD->BSY->DRQ sequence
>    used in PIO. This behaviour was far worse for competing ports
>    within one chip, which I put down to arbitration problems.
> 
> 3. CPU utilisation would go through the roof. Effectively the
>    entire pio_task state machine reduced to a busy spin loop.
> 
> 4. The state machine needed some tweaks, especially in error
>    handling cases.
> 
> I made some changes, which effectively solved the problem for promise
> TX4-150 cards, and was going to test the results on other chipsets
> next week before speaking up. Specifically, I have seen some
> issues with SiI 3114 cards.
> 
> I was trying to explore using interrupts instead of polling state
> but for some reason, I was not getting them for PIO data operations,
> or I misunderstand the spec, after removing ata_qc_set_polling() - again
> I saw a difference in behaviour between the Promise & SiI cards
> here.
> 
> I'm about to go offline for 3 days, and hadn't prepared for this
> yet. The best I can do is provide a patch (attached) that applies
> against 2.6.9. It also seems to apply against libata-2.6, but
> barfs a bit against libata-dev-2.6.
> 
> The changes boil down to these:
> 
> 1. Minor changes in how status/error regs are read.
>    Including attempts to use altstatus, while I was
>    exploring interrupts.
> 
> 2. State machine logic changes.
> 
> 3. Replace calls to queue_work() with queue_delayed_work()
>    to stop SMP machines going crazy.
> 
> With these changes, on a platform consisting of 2.6.9 and
> Promise TX4-150 cards, I can move terabytes of parallel
> PIO data, without error.
> 
> My gut says that the PIO mechanism should be overhauled, I
> composed a "how much should we pay for this muffler" email
> to linux-ide at least twice while working on this, but never
> sent it - wanting to send a solution in rather than just
> making more comments from the peanut gallery.
> 
> I'll pick up the thread on this next week, when I'm back online.
> I hope this helps.

Please let me know if you still have problems?

The PIO SMP problems seem to be fixed here.

	Jeff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2005-01-06 23:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.44.0411221457240.2970-100000@localhost.localdomain>
     [not found] ` <Pine.LNX.4.58.0411221343410.22895@schroedinger.engr.sgi.com>
     [not found]   ` <Pine.LNX.4.58.0411221419440.20993@ppc970.osdl.org>
     [not found]     ` <Pine.LNX.4.58.0411221424580.22895@schroedinger.engr.sgi.com>
     [not found]       ` <Pine.LNX.4.58.0411221429050.20993@ppc970.osdl.org>
     [not found]         ` <Pine.LNX.4.58.0412011539170.5721@schroedinger.engr.sgi.com>
     [not found]           ` <Pine.LNX.4.58.0412011608500.22796@ppc970.osdl.org>
     [not found]             ` <41AEB44D.2040805@pobox.com>
     [not found]               ` <20041201223441.3820fbc0.akpm@osdl.org>
2004-12-02  7:00                 ` page fault scalability patch V12 [0/7]: Overview and performance tests Jeff Garzik
2004-12-02  7:05                   ` Benjamin Herrenschmidt
2004-12-02  7:11                     ` Jeff Garzik
2004-12-02 11:16                       ` Benjamin Herrenschmidt
2004-12-02 14:30                   ` Andy Warner
2005-01-06 23:40                     ` Jeff Garzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).