Patch and Performance of larger pipes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Patch and Performance of larger pipes
@ 2001-10-18 15:34 Hubertus Franke
  2001-10-18 17:43 ` David S. Miller
  0 siblings, 1 reply; 9+ messages in thread
From: Hubertus Franke @ 2001-10-18 15:34 UTC (permalink / raw)
  To: lse-tech; +Cc: linux-kernel


                         PIPE: Buffer Expansion
                         ----------------------

In this we will report on some experimentation to improve Linux pipe
performance. There are two basic parameters that govern the Linux pipe 
implementation. 
(a) PIPE_SIZE is the size of the pipe buffer. 
(b) PIPE_BUF is the maximum number of bytes an application 
    can write atomically to the pipe.

In the current implementation the size of pipe buffer is PAGE_SIZE (most
architectures that is 4kB, and the PIPE_BUF is fixed to 4kB (and for
ARM=PAGE_SIZE).

We wanted to experiment with larger pipe buffer support and higher
concurrency of read and writes. Therefore we experimented with the
following items:

(A) expanding the pipe buffer size from 1 page to 2,4,8 pages.

(B) improving the pipe's concurrency of read and write by introducing 
   intermittent activation of pending readers/writers rather than at 
   the end of a pipe transaction (read/write). The PIPE_BUF 
   atomicity constraint is still observed. We therefore introduce 
   the term of a PIPE_SEG which is a multiple of the PAGE_SIZE and
   determines when to wake up pending readers and writers.

   Consider the pipe buffer size to be 32k. The space available to 
   write on the pipe is 32k and the data which is coming to be 
   written onto the pipe is also 32k. By keeping the segment size 
   as 4k, write the first 4k of the total 32k data instead of 
   writing the entire 32k data, and inform the reader process that 
   some data is there to be read, and the writer process proceeds 
   with the next 4k. By that time reader process starts reading 
   the available data. Intuitively this should create 
   more concurrency. 

Throughout this experimentation, we kept the PIPE_BUF (atomicity
guarantee) constant at 4kB.

Benchmarks
----------

The benchmarks we ran for measuring the performance of pipes are
LMBench, Grep, and Pipeflex. The description of these are below.
While LMBench is a widely used OS-Benchmark, we found that Grep and
Pipeflex model more real applications. All are descripted in more
detail below. All applications use different data transfer sizes
aka chunk sizes shown as TS. We report on the two aspects of our
implementation, i.e. larger pipes and intermittent activations
(PIPE_SEG) which is always 4k. All results are shown as % improvement
over the baseline kernel (2.4.9) and all these benchmarks are run on a
2-way Pentium II, 333Mhz machine.

                          Results Summary:
                          ================

UP + 1-way SMP:
---------------
Neither (A) nor (A)+(B) showed any improvements. Instead degradations 
of up to 30% are observed. Obviously our approach/patch does not 
make any sense on the 1-way systems.

N-way SMP:
----------

1. Increasing the pipe buffer size (A) increasingly improves the 
   performance of the Grep benchmark by upto 165% for size 32kB. 
   However, Grep does not show any added benefit nor
   degradation utilizing (A)+(B), i.e, expanding the pipe buffer
   size AND introducing the segment size=4kB.

2. For LMBench, (A) alone shows some improvements for small transfer
   sizes (TS<PIPE_SIZE). For TS>>PIPE_SIZE we observe degradation.
   Introducing (A)+(B) shows even better improvements for small TS
   with very small degradations for larger TS.

3. For pipeflex (A) provides increasing benefits with upto 358% 
   improvements, without any loss at the low end. When introducing 
   (A)+(B), the benefits drops but are still substantial.

Based on the results it is clear that expansion of PIPE_SIZE AND
PIPE_SEG introduction gives better performance for some scenarios. 


Grep
----

This benchmark measures the time taken to grep for some unexisting
pattern on a 50mb file.  ie. cat 50mbfile | grep "$$$$". We assume a
warm file cache.

LMBench
-------

LMBench provides a tool to measure the bandwidth of the pipe
(bw_pipe).  bw_pipe creates pipes between two processes and moves 10MB
through the pipe in 64KB chunks. We altered that code by providing the
chunk size as a variable input parameter.  ie. bw_pipe [2,4,...,32]

Pipeflex
--------

As LMBench does continuous read and write over the pipe in a
synchronous manner (which is not the case in real life), we studied
some test cases which uses pipes(grep, wc, sort, gunzip, ..), and
based on that we have written this pipeflex benchmark.

Here a write process writes smaller chunks continuously and the reader
process generates a number between [0.5*r .. 1.5*r] microseconds, and
spends that time for computation after each pipe reads.

A parent process clones 'c' child processes and 'c/2' pipes such that 
2 processes shares one pipe.
ie. pipeflex -c 2 -t 20 -r 500 -s 4

c : number of children/threads to launch (should be EVEN)
t : time for which each run of the test should be performed.
r : microseconds spend in computation after each pipe reads.
s : data to transfer over pipe in Kilo bytes.


Dynamically assigning values for PIPE_SIZE and PIPE_SEG
-------------------------------------------------------

In our current implementation, the size of PIPE_SIZE and PIPE_SEG can
be changed dynamically by writing the values into the newly created
/proc/sys/fs/pipe-sz file through a string having the following
format:

Po So

where Po is the Pipe size order
and So is the Segment size order.

Pipe size will be calculated as PIPE_SIZE = (1 << Po) * PAGE_SIZE.
Segment size will be calculated as (PIPE_SIZE >> So).

Similarly 'Po' and 'So' can be read through the same proc file.

* The notation we use in tables for PIPE_SIZE and PIPE_SEG are PS and
SS respectively and TS is the Transfer Size over pipe.


                2-way (% improvement) Results
                =============================

Grep
----

PS         (A)               (A)+(B)
--         ---               -------
4k        -0.87               -0.95  
8k        50.84               50.12  
16k      107.97              115.86  
32k      165.25              164.14  


LMBench
-------
                (A)                             (A)+(B)
                ---                             -------

                PS                                PS  
TS      4k      8k      16k     32k       4k      8k      16k     32k   
--      --      --      ---     ---       --      --      ---     ---  
2k     -0.3     3.26    4.25    3.83     -0.3     2.98    4.25    4.04 
4k     -2.18   18.97   18.59   18.59     -2.18   18.59   18.59   18.59
6k      0.34   13.08   32.7    49.57      0.3    35.63   39.76   55.94
8k      0.14    3.02     0     -0.82     13.87   31.59   50.82   75.27
12k     0.34  -24.09  -18.74  -12.57      0.34    4.4   -14.86    8.23
16k     1.47   -8.88  -14.16  -16.03      1.4    14.42    9.48   13.86
24k     1.17  -13.9     1.65  -23.59      1.17    1.65   -2.72    1.2
32k     0.66  -14.77  -19.83  -25.63      0.66   -3.2    -6.59   -2.92


Pipeflex
--------

                (A)                             (A)+(B)
                ---                             -------

                PS                                 PS 
TS      4k      8k      16k     32k        4k      8k      16k     32k
--      --      --      ---     ---        --      --      ---     ---
2k      0.00    0.00   -0.27   -0.27      0.00    -0.27   -0.27   -0.27
4k      0.00    0.00    0.00    0.00      0.00     0.00    0.00    0.00
6k      1.61   -1.82   11.46   10.28      1.61    -9.53    9.53    6.75
8k     -2.00   46.20   46.31   46.31     -2.00   -11.08    7.17   23.73
12k    -2.93   54.13   64.58   93.31     -2.93    51.62   46.08   84.22
16k    -3.34   49.95  163.09  162.67     -3.34    51.82   45.57   85.92
24k    -3.56   50.37  135.50  183.46     -3.56    49.95  134.97  144.40
32k    -1.40   54.99  143.29  358.75     -1.40    55.75  142.32  147.37



                1-way (% improvement) Results
                =============================

Grep     
----

PS         (A)               (A)+(B)
--         ---               -------
4k       -0.47               -0.47  
8k        1.9                -2.73  
16k      -2.73               -2.73  
32k       1.9                -2.73  



LMBench
-------
                (A)                             (A)+(B)
                ---                             -------

                PS                                PS  
TS      4k      8k      16k     32k       4k      8k      16k     32k
--      --      --      ---     ---       --      --      ---     ---
2k     0.49     4.42    -3.88  -10.13    0.49    11.02   -7.68  -12.06
4k    -0.5     -1.08   -22.17   -8.6    -0.5     -0.95  -18.65   -4.86
6k     3.73   -19.03   -24.23  -18.56    3.73   -15.85  -24.68  -12.44
8k     0.82   -33.43   -31.92  -30.19    0.82   -34.38  -25.41  -12.81
12k    1.39   -24.06   -30.67  -27.88    1.39   -24.43  -29.79  -27   
16k   -0.87   -29.16   -31.27  -29.73   -0.87   -28.53  -31.97  -28.97
24k    0.16   -28.79   -31.87  -28.37    0.16   -28.16  -31.61  -28.66
32k    0.35   -28.91   -30.73  -27.23    0.35   -28.77  -31.6   -28.77


Pipeflex
--------

                (A)                             (A)+(B)
                ---                             -------

                PS                                 PS 
TS      4k      8k      16k     32k        4k      8k      16k     32k
--      --      --      ---     ---        --      --      ---     ---
2k    0.00   -0.54   -0.80   -1.07         0.00   -0.54   -0.80   -1.07
4k   -0.14   -0.69   -1.80   -2.21        -0.14   -1.10   -1.80   -2.49
6k   -0.19   -0.19   -2.41   -2.02        -0.19   -0.19   -2.12   -1.16
8k   -0.30   -1.86   -6.41   -6.41        -0.30   -1.19   -3.80   -2.61
12k  -0.33   -1.43   -4.56   -4.01        -0.33   -1.43   -3.84   -3.51
16k  -0.18   -1.77   -4.47   -4.56        -0.18   -1.15   -4.78   -3.85
24k  -0.37   -2.21   -6.94   -5.26        -0.37   -1.51   -6.07   -4.89
32k  -0.42   -2.98   -7.12   -5.18        -0.42   -1.75   -7.06   -5.77


                UP (% improvement) Results
                ==========================

Grep     
----

PS         (A)               (A)+(B)
--         ---               -------                                    
4k        -0.53               1.61
8k        -2.58              -1.56
16k       -4.55              -4.06
32k       -3.08              -4.06


LMBench
-------
                (A)                             (A)+(B)
                ---                             -------

                PS                                 PS 
TS      4k      8k      16k     32k        4k      8k      16k     32k
--      --      --      ---     ---        --      --      ---     ---  
2k      7.38     1.17  -15.28  -18.07     4.18   -0.36  -14.32  -20.55  
4k     -1.73    -0.5   -31.7   -26.94    -1.21    5.61  -21.75  -5.17   
6k     -0.3    -22.33  -26.78  -23.36    -0.9   -17.31  -29.61  -19.5   
8k      7.8    -35     -33.75  -30.65     1.71  -37.11  -29.45  -18.03  
12k    -1.09   -25.99  -36.7   -35.77     0.13  -27.76  -35.56  -34.21  
16k     0.08   -31.39  -35.37  -34.28     1.2   -30.93  -36.18  -34.18  
24k     0.42   -32.06  -36.15  -34.65     0.63  -32.28  -36.82  -34.98  
32k     0.82   -31.52  -35.71  -33.41     1.8   -32.11  -36.49  -34.52  

    

Pipeflex
--------

                (A)                             (A)+(B)
                ---                             -------

                PS                                 PS 
TS      4k      8k      16k     32k        4k      8k      16k     32k
--      --      --      ---     ---        --      --      ---     --- 
2k    -0.27   -0.27   -0.80   -1.06      -0.27   -0.27   -0.80   -1.06  
4k    -0.27   -0.68   -1.77   -2.31      -0.14   -0.54   -1.63   -2.18  
6k    -0.66   -0.94   -3.85   -3.94      -0.47   -0.66   -2.63   -1.97  
8k    -0.58   -1.96   -6.88   -6.88      -0.58   -1.30   -4.49   -3.48  
12k   -1.00   -2.16   -5.59   -5.59      -0.84   -1.79   -4.75   -4.54  
16k   -1.22   -2.40   -6.02   -6.18      -1.14   -2.10   -6.06   -5.34  
24k   -1.50   -3.41   -8.68   -7.23      -1.72   -2.76   -7.89   -7.11  
32k   -1.92   -4.44   -9.27   -7.94      -1.79   -3.43   -9.45   -8.41  


diff -urN linux-2.4.9-v/fs/pipe.c linux-2.4.9-pipe-new/fs/pipe.c
--- linux-2.4.9-v/fs/pipe.c	Sun Aug 12 21:58:52 2001
+++ linux-2.4.9-pipe-new/fs/pipe.c	Tue Oct  9 10:48:15 2001
@@ -23,6 +23,14 @@
  * -- Julian Bradfield 1999-06-07.
  */
 
+#ifdef CONFIG_SMP 
+#define IS_SMP (1)
+#else 
+#define IS_SMP (0)
+#endif
+
+struct pipe_stat_t pipe_stat;
+
 /* Drop the inode semaphore and wait for a pipe event, atomically */
 void pipe_wait(struct inode * inode)
 {
@@ -85,30 +93,40 @@
 
 	/* Read what data is available.  */
 	ret = -EFAULT;
-	while (count > 0 && (size = PIPE_LEN(*inode))) {
-		char *pipebuf = PIPE_BASE(*inode) + PIPE_START(*inode);
-		ssize_t chars = PIPE_MAX_RCHUNK(*inode);
-
-		if (chars > count)
-			chars = count;
-		if (chars > size)
-			chars = size;
-
-		if (copy_to_user(buf, pipebuf, chars))
-			goto out;
+	if (count > 0 && (size = PIPE_LEN(*inode))) {
+		do {
+			char *pipebuf = PIPE_BASE(*inode) + PIPE_START(*inode);
+			ssize_t chars = PIPE_MAX_RCHUNK(*inode);
+			
+			if (chars > count)
+				chars = count;
+			if (chars > size)
+				chars = size;
+			if (IS_SMP && PIPE_ORDER(*inode) && (chars > PIPE_SEG(*inode))) 
+				chars = PIPE_SEG(*inode);
+			
+			if (copy_to_user(buf, pipebuf, chars))
+				goto out;
 
-		read += chars;
-		PIPE_START(*inode) += chars;
-		PIPE_START(*inode) &= (PIPE_SIZE - 1);
-		PIPE_LEN(*inode) -= chars;
-		count -= chars;
-		buf += chars;
+			read += chars;
+			PIPE_START(*inode) += chars;
+			PIPE_START(*inode) &= (PIPE_SIZE(*inode) - 1);
+			PIPE_LEN(*inode) -= chars;
+			count -= chars;
+			buf += chars;
+               		if ((count <= 0) || (!(size = PIPE_LEN(*inode))))
+	                       	break;
+        	      	if (IS_SMP && PIPE_ORDER(*inode) && PIPE_WAITING_WRITERS(*inode) && 
+			    !(filp->f_flags & O_NONBLOCK)) 
+                       		wake_up_interruptible_sync(PIPE_WAIT(*inode));
+			
+            	} while(1);
 	}
 
 	/* Cache behaviour optimization */
 	if (!PIPE_LEN(*inode))
 		PIPE_START(*inode) = 0;
-
+	
 	if (count && PIPE_WAITING_WRITERS(*inode) && !(filp->f_flags & O_NONBLOCK)) {
 		/*
 		 * We know that we are going to sleep: signal
@@ -187,10 +205,15 @@
 		ssize_t chars = PIPE_MAX_WCHUNK(*inode);
 
 		if ((space = PIPE_FREE(*inode)) != 0) {
+			pipebuf = PIPE_BASE(*inode) + PIPE_END(*inode);
+			chars = PIPE_MAX_WCHUNK(*inode);
+
 			if (chars > count)
 				chars = count;
 			if (chars > space)
 				chars = space;
+			if (IS_SMP && PIPE_ORDER(*inode) && (chars > PIPE_SEG(*inode)))
+				chars = PIPE_SEG(*inode);
 
 			if (copy_from_user(pipebuf, buf, chars))
 				goto out;
@@ -200,6 +223,9 @@
 			count -= chars;
 			buf += chars;
 			space = PIPE_FREE(*inode);
+   			if (IS_SMP && PIPE_ORDER(*inode) && (count > 0) && space &&
+			    PIPE_WAITING_READERS(*inode) && !(filp->f_flags & O_NONBLOCK))
+                                wake_up_interruptible_sync(PIPE_WAIT(*inode));
 			continue;
 		}
 
@@ -231,14 +257,14 @@
 	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
 	mark_inode_dirty(inode);
 
-out:
+ out:
 	up(PIPE_SEM(*inode));
-out_nolock:
+ out_nolock:
 	if (written)
 		ret = written;
 	return ret;
 
-sigpipe:
+ sigpipe:
 	if (written)
 		goto out;
 	up(PIPE_SEM(*inode));
@@ -309,7 +335,7 @@
 	if (!PIPE_READERS(*inode) && !PIPE_WRITERS(*inode)) {
 		struct pipe_inode_info *info = inode->i_pipe;
 		inode->i_pipe = NULL;
-		free_page((unsigned long) info->base);
+		free_pages((unsigned long) info->base, info->order);
 		kfree(info);
 	} else {
 		wake_up_interruptible(PIPE_WAIT(*inode));
@@ -443,8 +469,12 @@
 struct inode* pipe_new(struct inode* inode)
 {
 	unsigned long page;
+	int pipe_order = pipe_stat.pipe_size_order;
+
+	if (pipe_order > MAX_PIPE_ORDER)
+		pipe_order = MAX_PIPE_ORDER;
 
-	page = __get_free_page(GFP_USER);
+	page = __get_free_pages(GFP_USER, pipe_order);
 	if (!page)
 		return NULL;
 
@@ -458,10 +488,11 @@
 	PIPE_READERS(*inode) = PIPE_WRITERS(*inode) = 0;
 	PIPE_WAITING_READERS(*inode) = PIPE_WAITING_WRITERS(*inode) = 0;
 	PIPE_RCOUNTER(*inode) = PIPE_WCOUNTER(*inode) = 1;
+	PIPE_ORDER(*inode) = pipe_order;
 
 	return inode;
-fail_page:
-	free_page(page);
+ fail_page:
+	free_pages(page, pipe_order);
 	return NULL;
 }
 
@@ -477,12 +508,12 @@
 static struct inode * get_pipe_inode(void)
 {
 	struct inode *inode = get_empty_inode();
-
 	if (!inode)
 		goto fail_inode;
 
 	if(!pipe_new(inode))
 		goto fail_iput;
+
 	PIPE_READERS(*inode) = PIPE_WRITERS(*inode) = 1;
 	inode->i_fop = &rdwr_pipe_fops;
 	inode->i_sb = pipe_mnt->mnt_sb;
@@ -501,9 +532,9 @@
 	inode->i_blksize = PAGE_SIZE;
 	return inode;
 
-fail_iput:
+ fail_iput:
 	iput(inode);
-fail_inode:
+ fail_inode:
 	return NULL;
 }
 
@@ -572,20 +603,20 @@
 	fd[1] = j;
 	return 0;
 
-close_f12_inode_i_j:
+ close_f12_inode_i_j:
 	put_unused_fd(j);
-close_f12_inode_i:
+ close_f12_inode_i:
 	put_unused_fd(i);
-close_f12_inode:
-	free_page((unsigned long) PIPE_BASE(*inode));
+ close_f12_inode:
+	free_pages((unsigned long) PIPE_BASE(*inode), PIPE_ORDER(*inode));
 	kfree(inode->i_pipe);
 	inode->i_pipe = NULL;
 	iput(inode);
-close_f12:
+ close_f12:
 	put_filp(f2);
-close_f1:
+ close_f1:
 	put_filp(f1);
-no_files:
+ no_files:
 	return error;	
 }
 
@@ -631,7 +662,7 @@
 }
 
 static DECLARE_FSTYPE(pipe_fs_type, "pipefs", pipefs_read_super,
-	FS_NOMOUNT|FS_SINGLE);
+		      FS_NOMOUNT|FS_SINGLE);
 
 static int __init init_pipe_fs(void)
 {
diff -urN linux-2.4.9-v/include/linux/pipe_fs_i.h linux-2.4.9-pipe-new/include/linux/pipe_fs_i.h
--- linux-2.4.9-v/include/linux/pipe_fs_i.h	Wed Apr 25 17:18:23 2001
+++ linux-2.4.9-pipe-new/include/linux/pipe_fs_i.h	Tue Oct  9 09:35:35 2001
@@ -2,6 +2,8 @@
 #define _LINUX_PIPE_FS_I_H
 
 #define PIPEFS_MAGIC 0x50495045
+#define MAX_PIPE_ORDER	3
+
 struct pipe_inode_info {
 	wait_queue_head_t wait;
 	char *base;
@@ -13,12 +15,20 @@
 	unsigned int waiting_writers;
 	unsigned int r_counter;
 	unsigned int w_counter;
+	unsigned int order;
+};
+
+struct pipe_stat_t{
+	int pipe_size_order;
+	int pipe_seg_order;
 };
+extern struct pipe_stat_t pipe_stat;
 
 /* Differs from PIPE_BUF in that PIPE_SIZE is the length of the actual
    memory allocation, whereas PIPE_BUF makes atomicity guarantees.  */
-#define PIPE_SIZE		PAGE_SIZE
+#define PIPE_SIZE(inode)	((1 << PIPE_ORDER(inode)) * PAGE_SIZE)
 
+#define PIPE_ORDER(inode)	((inode).i_pipe->order)
 #define PIPE_SEM(inode)		(&(inode).i_sem)
 #define PIPE_WAIT(inode)	(&(inode).i_pipe->wait)
 #define PIPE_BASE(inode)	((inode).i_pipe->base)
@@ -32,12 +42,13 @@
 #define PIPE_WCOUNTER(inode)	((inode).i_pipe->w_counter)
 
 #define PIPE_EMPTY(inode)	(PIPE_LEN(inode) == 0)
-#define PIPE_FULL(inode)	(PIPE_LEN(inode) == PIPE_SIZE)
-#define PIPE_FREE(inode)	(PIPE_SIZE - PIPE_LEN(inode))
-#define PIPE_END(inode)	((PIPE_START(inode) + PIPE_LEN(inode)) & (PIPE_SIZE-1))
-#define PIPE_MAX_RCHUNK(inode)	(PIPE_SIZE - PIPE_START(inode))
-#define PIPE_MAX_WCHUNK(inode)	(PIPE_SIZE - PIPE_END(inode))
-
+#define PIPE_FULL(inode)	(PIPE_LEN(inode) == PIPE_SIZE(inode))
+#define PIPE_FREE(inode)	(PIPE_SIZE(inode) - PIPE_LEN(inode))
+#define PIPE_END(inode)	((PIPE_START(inode) + PIPE_LEN(inode)) & (PIPE_SIZE(inode)-1))
+#define PIPE_MAX_RCHUNK(inode)	(PIPE_SIZE(inode) - PIPE_START(inode))
+#define PIPE_MAX_WCHUNK(inode)	(PIPE_SIZE(inode) - PIPE_END(inode))
+#define PIPE_SEG(inode)		((PIPE_ORDER(inode) > pipe_stat.pipe_seg_order) ? \
+				(PIPE_SIZE(inode) >> pipe_stat.pipe_seg_order): PAGE_SIZE)
 /* Drop the inode semaphore and wait for a pipe event, atomically */
 void pipe_wait(struct inode * inode);
 
diff -urN linux-2.4.9-v/include/linux/sysctl.h linux-2.4.9-pipe-new/include/linux/sysctl.h
--- linux-2.4.9-v/include/linux/sysctl.h	Wed Aug 15 17:21:21 2001
+++ linux-2.4.9-pipe-new/include/linux/sysctl.h	Tue Oct  9 10:12:48 2001
@@ -533,6 +533,7 @@
 	FS_LEASES=13,	/* int: leases enabled */
 	FS_DIR_NOTIFY=14,	/* int: directory notification enabled */
 	FS_LEASE_TIME=15,	/* int: maximum time to wait for a lease break */
+	FS_PIPE_SIZE=16,	/* int: current number of allocated pages for PIPE */
 };
 
 /* CTL_DEBUG names: */
diff -urN linux-2.4.9-v/kernel/sysctl.c linux-2.4.9-pipe-new/kernel/sysctl.c
--- linux-2.4.9-v/kernel/sysctl.c	Thu Aug  9 19:41:36 2001
+++ linux-2.4.9-pipe-new/kernel/sysctl.c	Mon Oct  8 13:19:46 2001
@@ -304,6 +304,8 @@
 	 sizeof(int), 0644, NULL, &proc_dointvec},
 	{FS_LEASE_TIME, "lease-break-time", &lease_break_time, sizeof(int),
 	 0644, NULL, &proc_dointvec},
+	{FS_PIPE_SIZE, "pipe-sz", &pipe_stat, 2*sizeof(int),
+	 0644, NULL, &proc_dointvec},
 	{0}
 };
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Patch and Performance of larger pipes
  2001-10-18 15:34 Hubertus Franke
@ 2001-10-18 17:43 ` David S. Miller
  0 siblings, 0 replies; 9+ messages in thread
From: David S. Miller @ 2001-10-18 17:43 UTC (permalink / raw)
  To: frankeh; +Cc: lse-tech, linux-kernel

Have you looked at all at the zerocopy pipe patches?
How do they affect the results, and by itself does it
do better than any of the schemes you propose?

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Patch and Performance of larger pipes
@ 2001-10-18 18:07 Manfred Spraul
  2001-10-18 18:18 ` Manfred Spraul
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Manfred Spraul @ 2001-10-18 18:07 UTC (permalink / raw)
  To: Hubertus Franke, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 485 bytes --]

Could you test the attached singlecopy patches?

with bw_pipe,
* on UP, up to +100%.
* on SMP with busy cpus, up to +100%
* on SMP with idle cpus a performance drop due to increased cache
trashing. Probably the scheduler should keep both bw_pipe processes on
the same cpu.

I've sent patch-pgw to Linus for inclusion, since it's needed to fix the
elf coredump deadlock.

patch-kiopipe must wait until 2.5, because it changes the behaviour of
pipe_write with partial reads.

--
	Manfred

[-- Attachment #2: patch-kiopipe --]
[-- Type: text/plain, Size: 16760 bytes --]

// $Header$
// Kernel Version:
//  VERSION = 2
//  PATCHLEVEL = 4
//  SUBLEVEL = 10
//  EXTRAVERSION =
--- 2.4/fs/pipe.c	Sun Sep 23 21:20:49 2001
+++ build-2.4/fs/pipe.c	Sun Sep 30 12:08:59 2001
@@ -2,6 +2,9 @@
  *  linux/fs/pipe.c
  *
  *  Copyright (C) 1991, 1992, 1999  Linus Torvalds
+ *
+ *  Major pipe_read() and pipe_write() cleanup:  Single copy,
+ *  fewer schedules.	Copyright (C) 2001 Manfred Spraul
  */
 
 #include <linux/mm.h>
@@ -10,6 +13,8 @@
 #include <linux/slab.h>
 #include <linux/module.h>
 #include <linux/init.h>
+#include <linux/highmem.h>
+#include <linux/compiler.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
@@ -36,214 +41,347 @@
 	down(PIPE_SEM(*inode));
 }
 
+#define PIO_PGCOUNT	((131072+PAGE_SIZE-1)/PAGE_SIZE)
+struct pipe_pio {
+	struct list_head list;
+	struct page *pages[PIO_PGCOUNT];
+	int offset;
+	size_t len;
+	size_t orig_len;
+	struct task_struct *tsk;
+};
+
+static ssize_t
+copy_from_piolist(struct list_head *piolist, void *buf, ssize_t len)
+{
+	struct list_head *walk = piolist->next;
+	int ret = 0;
+	while(walk != piolist && len) {
+		struct pipe_pio* pio = list_entry(walk, struct pipe_pio, list);
+		if (pio->len) {
+			struct page *page;
+			void *maddr;
+			int this_len, off, i;
+			int ret2;
+
+			i = pio->offset/PAGE_SIZE;
+			off = pio->offset%PAGE_SIZE;
+			this_len = len;
+			if (this_len > PAGE_SIZE-off)
+				this_len = PAGE_SIZE-off;
+			if (this_len > pio->len)
+				this_len = pio->len;
+
+			page = pio->pages[i];
+			maddr = kmap(page);
+			ret2 = copy_to_user(buf, maddr+off, this_len);
+			flush_page_to_ram(page);
+			kunmap(page);
+			if (unlikely(ret2)) {
+				if (ret)
+					return ret;
+				return -EFAULT;
+			}
+
+			buf += this_len;
+			len -= this_len;
+			pio->len -= this_len;
+			pio->offset += this_len;
+			ret += this_len;
+			if (pio->len == 0)
+				wake_up_process(pio->tsk);
+		} else {
+			walk = walk->next;
+		}
+	}
+	return ret;
+}
+
+static void
+build_pio(struct pipe_pio *pio, struct inode *inode, const void *buf, size_t count)
+{
+	int len;
+	struct vm_area_struct *vmas[PIO_PGCOUNT];
+
+	pio->tsk = current;
+	pio->len = count;
+	pio->offset = (unsigned long)buf&(PAGE_SIZE-1);
+
+	pio->len = PIO_PGCOUNT*PAGE_SIZE - pio->offset;
+	if (pio->len > count)
+		pio->len = count;
+	len = (pio->offset+pio->len+PAGE_SIZE-1)/PAGE_SIZE;
+	down_read(&current->mm->mmap_sem);
+	len = get_user_pages(current, current->mm, (unsigned long)buf, len,
+			0, pio->pages, vmas);
+	if (len > 0) {
+		int i;
+		for(i=0;i<len;i++) {
+			flush_cache_page(vmas[i], addr+i*PAGE_SIZE);
+		}
+		len = len*PAGE_SIZE-pio->offset;
+		if (len < pio->len)
+			pio->len = len;
+		list_add_tail(&pio->list, &PIPE_PIO(*inode));
+		PIPE_PIOLEN(*inode) += pio->len;
+		pio->orig_len = pio->len;
+	} else {
+		pio->list.next = NULL;
+	}
+	up_read(&current->mm->mmap_sem);
+}
+
+static size_t
+teardown_pio(struct pipe_pio *pio, struct inode *inode, const void *buf)
+{
+	int i;
+	if (!pio->list.next)
+		return 0;
+	for (i=0;i<(pio->len+pio->offset+PAGE_SIZE-1)/PAGE_SIZE;i++) {
+		if (pio->pages[i]) {
+			put_page(pio->pages[i]);
+		}
+	}
+	i = pio->orig_len - pio->len;
+	PIPE_PIOLEN(*inode) -= pio->len;
+	list_del(&pio->list);
+	if (i && pio->len) {
+		/*
+		 * We would violate the atomicity requirements:
+		 * 1 byte in the internal buffer.
+		 * write(fd, buf, PIPE_BUF);
+		 * --> doesn't fit into internal buffer, pio build.
+		 * read(fd, buf, 200);(i.e. 199 bytes from pio)
+		 * signal sent to writer.
+		 * The writer must not return with 199 bytes written!
+		 * Fortunately the internal buffer will be empty in this
+		 * case. Write into the internal buffer before
+		 * checking for signals/error conditions.
+		 */
+		size_t j = min((size_t)PIPE_SIZE, pio->len);
+		if (PIPE_LEN(*inode)) BUG();
+		if (PIPE_START(*inode)) BUG();
+		if (!copy_from_user(PIPE_BASE(*inode), buf + i, j)) {
+			i += j;
+			PIPE_LEN(*inode) =  j;
+		}
+	}
+	return i;
+}
+/*
+ * reader:
+	flush_cache_page(vma, addr);
+ *
+		flush_icache_page(vma, page);
+ */
 static ssize_t
 pipe_read(struct file *filp, char *buf, size_t count, loff_t *ppos)
 {
 	struct inode *inode = filp->f_dentry->d_inode;
-	ssize_t size, read, ret;
+	ssize_t read;
 
-	/* Seeks are not allowed on pipes.  */
-	ret = -ESPIPE;
-	read = 0;
-	if (ppos != &filp->f_pos)
-		goto out_nolock;
+	/* pread is not allowed on pipes.  */
+	if (unlikely(ppos != &filp->f_pos))
+		return -ESPIPE;
 
 	/* Always return 0 on null read.  */
-	ret = 0;
-	if (count == 0)
-		goto out_nolock;
-
-	/* Get the pipe semaphore */
-	ret = -ERESTARTSYS;
-	if (down_interruptible(PIPE_SEM(*inode)))
-		goto out_nolock;
-
-	if (PIPE_EMPTY(*inode)) {
-do_more_read:
-		ret = 0;
-		if (!PIPE_WRITERS(*inode))
-			goto out;
+	if (unlikely(count == 0))
+		return 0;
 
-		ret = -EAGAIN;
-		if (filp->f_flags & O_NONBLOCK)
-			goto out;
+	down(PIPE_SEM(*inode));
 
-		for (;;) {
-			PIPE_WAITING_READERS(*inode)++;
-			pipe_wait(inode);
-			PIPE_WAITING_READERS(*inode)--;
-			ret = -ERESTARTSYS;
-			if (signal_pending(current))
-				goto out;
-			ret = 0;
-			if (!PIPE_EMPTY(*inode))
-				break;
-			if (!PIPE_WRITERS(*inode))
+	for (;;) {
+		/* read what data is available */
+		int chars;
+		read = 0;
+		while( (chars = PIPE_LEN(*inode)) ) {
+			char *pipebuf = PIPE_BASE(*inode);
+			int offset = PIPE_START(*inode)%PIPE_BUF;
+			if (chars > count)
+				chars = count;
+			if (chars > PIPE_SIZE-offset)
+				chars = PIPE_SIZE-offset;
+			if (unlikely(copy_to_user(buf, pipebuf+offset, chars))) {
+				if (!read)
+					read = -EFAULT;
 				goto out;
+			}
+			PIPE_LEN(*inode) -= chars;
+			if (!PIPE_LEN(*inode)) {
+				/* Cache behaviour optimization */
+				PIPE_START(*inode) = 0;
+			} else {
+				/* there is no need to limit PIPE_START
+				 * to PIPE_BUF - the user does
+				 * %PIPE_BUF anyway.
+				 */
+				PIPE_START(*inode) += chars;
+			}
+			read += chars;
+			count -= chars;
+			if (!count)
+				goto out; /* common case: done */
+			buf += chars;
+			/* Check again that the internal buffer is empty.
+			 * If it was cyclic more data could be in the buffer.
+			 */
 		}
-	}
-
-	/* Read what data is available.  */
-	ret = -EFAULT;
-	while (count > 0 && (size = PIPE_LEN(*inode))) {
-		char *pipebuf = PIPE_BASE(*inode) + PIPE_START(*inode);
-		ssize_t chars = PIPE_MAX_RCHUNK(*inode);
-
-		if (chars > count)
-			chars = count;
-		if (chars > size)
-			chars = size;
+		if (PIPE_PIOLEN(*inode)) {
+			chars = copy_from_piolist(&PIPE_PIO(*inode), buf, count);
+			if (unlikely(chars < 0)) {
+				if (!read)
+					read = chars;
+				goto out;
+			}
+			PIPE_PIOLEN(*inode) -= chars;
+			read += chars;
+			count -= chars;
+			if (!count)
+				goto out; /* common case: done */
+			buf += chars;
 
-		if (copy_to_user(buf, pipebuf, chars))
+		}
+		if (PIPE_PIOLEN(*inode) || PIPE_LEN(*inode)) BUG();
+		/* tests before sleeping:
+		 *  - don't sleep if data was read.
+		 */
+		if (read)
 			goto out;
 
-		read += chars;
-		PIPE_START(*inode) += chars;
-		PIPE_START(*inode) &= (PIPE_SIZE - 1);
-		PIPE_LEN(*inode) -= chars;
-		count -= chars;
-		buf += chars;
-	}
-
-	/* Cache behaviour optimization */
-	if (!PIPE_LEN(*inode))
-		PIPE_START(*inode) = 0;
+		/*  - don't sleep if no process has the pipe open
+		 *	 for writing
+		 */
+		if (unlikely(!PIPE_WRITERS(*inode)))
+			goto out;
 
-	if (count && PIPE_WAITING_WRITERS(*inode) && !(filp->f_flags & O_NONBLOCK)) {
-		/*
-		 * We know that we are going to sleep: signal
-		 * writers synchronously that there is more
-		 * room.
+		/*   - don't sleep if O_NONBLOCK is set */
+		read = -EAGAIN;
+		if (filp->f_flags & O_NONBLOCK)
+			goto out;
+		/*   - don't sleep if a signal is pending */
+		read = -ERESTARTSYS;
+		if (unlikely(signal_pending(current)))
+			goto out;
+		/* readers never need to wake up if they go to sleep:
+		 * They only sleep if they didn't read anything
 		 */
-		wake_up_interruptible_sync(PIPE_WAIT(*inode));
-		if (!PIPE_EMPTY(*inode))
-			BUG();
-		goto do_more_read;
+		pipe_wait(inode);
 	}
-	/* Signal writers asynchronously that there is more room.  */
-	wake_up_interruptible(PIPE_WAIT(*inode));
-
-	ret = read;
 out:
 	up(PIPE_SEM(*inode));
-out_nolock:
-	if (read)
-		ret = read;
-	return ret;
+	/* If we drained the pipe, then wakeup everyone
+	 * waiting for that - either poll(2) or write(2).
+	 * We are only reading, therefore we can access without locking.
+	 */
+	if (read > 0 && !PIPE_PIOLEN(*inode) && !PIPE_LEN(*inode))
+		wake_up_interruptible(PIPE_WAIT(*inode));
+
+	return read;
 }
 
 static ssize_t
 pipe_write(struct file *filp, const char *buf, size_t count, loff_t *ppos)
 {
 	struct inode *inode = filp->f_dentry->d_inode;
-	ssize_t free, written, ret;
-
-	/* Seeks are not allowed on pipes.  */
-	ret = -ESPIPE;
-	written = 0;
-	if (ppos != &filp->f_pos)
-		goto out_nolock;
+	size_t min;
+	ssize_t written;
+	int do_wakeup;
+
+	/* pwrite is not allowed on pipes.  */
+	if (unlikely(ppos != &filp->f_pos))
+		return -ESPIPE;
 
 	/* Null write succeeds.  */
-	ret = 0;
-	if (count == 0)
-		goto out_nolock;
-
-	ret = -ERESTARTSYS;
-	if (down_interruptible(PIPE_SEM(*inode)))
-		goto out_nolock;
-
-	/* No readers yields SIGPIPE.  */
-	if (!PIPE_READERS(*inode))
-		goto sigpipe;
+	if (unlikely(count == 0))
+		return 0;
+	min = count;
+	if (min > PIPE_BUF && (filp->f_flags & O_NONBLOCK))
+		min = 1; /* no atomicity guarantee for transfers > PIPE_BUF */
 
-	/* If count <= PIPE_BUF, we have to make it atomic.  */
-	free = (count <= PIPE_BUF ? count : 1);
-
-	/* Wait, or check for, available space.  */
-	if (filp->f_flags & O_NONBLOCK) {
-		ret = -EAGAIN;
-		if (PIPE_FREE(*inode) < free)
-			goto out;
-	} else {
-		while (PIPE_FREE(*inode) < free) {
-			PIPE_WAITING_WRITERS(*inode)++;
-			pipe_wait(inode);
-			PIPE_WAITING_WRITERS(*inode)--;
-			ret = -ERESTARTSYS;
-			if (signal_pending(current))
-				goto out;
-
-			if (!PIPE_READERS(*inode))
-				goto sigpipe;
+	down(PIPE_SEM(*inode));
+	written = 0;
+	do_wakeup = 0;
+	for(;;) {
+		int start;
+		size_t chars;
+		/* No readers yields SIGPIPE.  */
+		if (unlikely(!PIPE_READERS(*inode))) {
+			if (!written)
+				written = -EPIPE;
+			break;
 		}
-	}
-
-	/* Copy into available space.  */
-	ret = -EFAULT;
-	while (count > 0) {
-		int space;
-		char *pipebuf = PIPE_BASE(*inode) + PIPE_END(*inode);
-		ssize_t chars = PIPE_MAX_WCHUNK(*inode);
-
-		if ((space = PIPE_FREE(*inode)) != 0) {
+		if (PIPE_PIOLEN(*inode))
+			goto skip_int_buf;
+		/* write to internal buffer - could be cyclic */
+		while(start = PIPE_LEN(*inode),chars = PIPE_SIZE - start, chars >= min) {
+			start += PIPE_START(*inode);
+			start %= PIPE_SIZE;
+			if (chars > PIPE_BUF - start)
+				chars = PIPE_BUF - start;
 			if (chars > count)
 				chars = count;
-			if (chars > space)
-				chars = space;
-
-			if (copy_from_user(pipebuf, buf, chars))
+			if (unlikely(copy_from_user(PIPE_BASE(*inode)+start,
+						buf, chars))) {
+				if (!written)
+					written = -EFAULT;
 				goto out;
-
-			written += chars;
+			}
+			do_wakeup = 1;
 			PIPE_LEN(*inode) += chars;
 			count -= chars;
+			written += chars;
+			if (!count)
+				goto out;
 			buf += chars;
-			space = PIPE_FREE(*inode);
-			continue;
+			min = 1;
 		}
-
-		ret = written;
-		if (filp->f_flags & O_NONBLOCK)
+skip_int_buf:
+		if (!filp->f_flags & O_NONBLOCK) {
+			if (!written)
+				written = -EAGAIN;
 			break;
+		}
 
-		do {
-			/*
-			 * Synchronous wake-up: it knows that this process
-			 * is going to give up this CPU, so it doesnt have
-			 * to do idle reschedules.
+		if (unlikely(signal_pending(current))) {
+			if (!written)
+				written = -ERESTARTSYS;
+			break;
+		}
+		{
+			struct pipe_pio my_pio;
+			/* build_pio
+			 * wakeup readers:
+			 * If the pipe was empty and now contains data, then do
+			 * a wakeup. We will sleep --> sync wakeup.
 			 */
-			wake_up_interruptible_sync(PIPE_WAIT(*inode));
-			PIPE_WAITING_WRITERS(*inode)++;
+			build_pio(&my_pio, inode, buf, count);
+			if (do_wakeup || PIPE_PIO(*inode).next == &my_pio.list)
+				wake_up_sync(PIPE_WAIT(*inode));
+			do_wakeup = 0;
 			pipe_wait(inode);
-			PIPE_WAITING_WRITERS(*inode)--;
-			if (signal_pending(current))
-				goto out;
-			if (!PIPE_READERS(*inode))
-				goto sigpipe;
-		} while (!PIPE_FREE(*inode));
-		ret = -EFAULT;
+			chars = teardown_pio(&my_pio, inode, buf);
+			count -= chars;
+			written += chars;
+			if (!count)
+				break;
+			buf += chars;
+		}
 	}
-
-	/* Signal readers asynchronously that there is more data.  */
-	wake_up_interruptible(PIPE_WAIT(*inode));
-
-	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
-	mark_inode_dirty(inode);
-
 out:
+	if (written > 0) {
+		/* SuS V2: st_ctime and st_mtime are updated
+		 * uppon successful completion of write(2).
+		 */
+		inode->i_ctime = inode->i_mtime = CURRENT_TIME;
+		mark_inode_dirty(inode);
+	}
 	up(PIPE_SEM(*inode));
-out_nolock:
-	if (written)
-		ret = written;
-	return ret;
 
-sigpipe:
-	if (written)
-		goto out;
-	up(PIPE_SEM(*inode));
-	send_sig(SIGPIPE, current, 0);
-	return -EPIPE;
+	if (do_wakeup)
+		wake_up(PIPE_WAIT(*inode));
+	if (written == -EPIPE)
+		send_sig(SIGPIPE, current, 0);
+	return written;
 }
 
 static loff_t
@@ -270,7 +408,8 @@
 {
 	switch (cmd) {
 		case FIONREAD:
-			return put_user(PIPE_LEN(*pino), (int *)arg);
+			return put_user(PIPE_LEN(*filp->f_dentry->d_inode) +
+					PIPE_PIOLEN(*filp->f_dentry->d_inode), (int *)arg);
 		default:
 			return -EINVAL;
 	}
@@ -286,11 +425,20 @@
 	poll_wait(filp, PIPE_WAIT(*inode), wait);
 
 	/* Reading only -- no need for acquiring the semaphore.  */
+
+	/* 
+	 * POLLIN means that data is available for read.
+	 * POLLOUT means that a nonblocking write will succeed.
+	 * We can only guarantee that if the internal buffers are empty
+	 * Therefore both are mutually exclusive.
+	 */
 	mask = POLLIN | POLLRDNORM;
-	if (PIPE_EMPTY(*inode))
+	if (!PIPE_LEN(*inode) && !PIPE_PIOLEN(*inode))
 		mask = POLLOUT | POLLWRNORM;
+	/* POLLHUP: no writer, and there was at least once a writer */
 	if (!PIPE_WRITERS(*inode) && filp->f_version != PIPE_WCOUNTER(*inode))
 		mask |= POLLHUP;
+	/* POLLERR: no reader */
 	if (!PIPE_READERS(*inode))
 		mask |= POLLERR;
 
@@ -454,9 +602,9 @@
 
 	init_waitqueue_head(PIPE_WAIT(*inode));
 	PIPE_BASE(*inode) = (char*) page;
-	PIPE_START(*inode) = PIPE_LEN(*inode) = 0;
+	INIT_LIST_HEAD(&PIPE_PIO(*inode));
+	PIPE_START(*inode) = PIPE_LEN(*inode) = PIPE_PIOLEN(*inode) = 0;
 	PIPE_READERS(*inode) = PIPE_WRITERS(*inode) = 0;
-	PIPE_WAITING_READERS(*inode) = PIPE_WAITING_WRITERS(*inode) = 0;
 	PIPE_RCOUNTER(*inode) = PIPE_WCOUNTER(*inode) = 1;
 
 	return inode;
--- 2.4/include/linux/pipe_fs_i.h	Sat Apr 28 10:37:27 2001
+++ build-2.4/include/linux/pipe_fs_i.h	Sat Sep 29 22:18:31 2001
@@ -5,12 +5,12 @@
 struct pipe_inode_info {
 	wait_queue_head_t wait;
 	char *base;
-	unsigned int len;
+	size_t len;	/* not including pio buffers */
+	size_t piolen;
 	unsigned int start;
+	struct list_head pio;
 	unsigned int readers;
 	unsigned int writers;
-	unsigned int waiting_readers;
-	unsigned int waiting_writers;
 	unsigned int r_counter;
 	unsigned int w_counter;
 };
@@ -24,19 +24,15 @@
 #define PIPE_BASE(inode)	((inode).i_pipe->base)
 #define PIPE_START(inode)	((inode).i_pipe->start)
 #define PIPE_LEN(inode)		((inode).i_pipe->len)
+#define PIPE_PIOLEN(inode)	((inode).i_pipe->piolen)
+#define PIPE_PIO(inode)		((inode).i_pipe->pio)
 #define PIPE_READERS(inode)	((inode).i_pipe->readers)
 #define PIPE_WRITERS(inode)	((inode).i_pipe->writers)
-#define PIPE_WAITING_READERS(inode)	((inode).i_pipe->waiting_readers)
-#define PIPE_WAITING_WRITERS(inode)	((inode).i_pipe->waiting_writers)
 #define PIPE_RCOUNTER(inode)	((inode).i_pipe->r_counter)
 #define PIPE_WCOUNTER(inode)	((inode).i_pipe->w_counter)
 
-#define PIPE_EMPTY(inode)	(PIPE_LEN(inode) == 0)
-#define PIPE_FULL(inode)	(PIPE_LEN(inode) == PIPE_SIZE)
 #define PIPE_FREE(inode)	(PIPE_SIZE - PIPE_LEN(inode))
 #define PIPE_END(inode)	((PIPE_START(inode) + PIPE_LEN(inode)) & (PIPE_SIZE-1))
-#define PIPE_MAX_RCHUNK(inode)	(PIPE_SIZE - PIPE_START(inode))
-#define PIPE_MAX_WCHUNK(inode)	(PIPE_SIZE - PIPE_END(inode))
 
 /* Drop the inode semaphore and wait for a pipe event, atomically */
 void pipe_wait(struct inode * inode);
--- 2.4/fs/fifo.c	Fri Feb 23 15:25:22 2001
+++ build-2.4/fs/fifo.c	Sat Sep 29 22:18:31 2001
@@ -32,10 +32,8 @@
 {
 	int ret;
 
-	ret = -ERESTARTSYS;
-	lock_kernel();
 	if (down_interruptible(PIPE_SEM(*inode)))
-		goto err_nolock_nocleanup;
+		return -ERESTARTSYS;
 
 	if (!inode->i_pipe) {
 		ret = -ENOMEM;
@@ -116,7 +114,6 @@
 
 	/* Ok! */
 	up(PIPE_SEM(*inode));
-	unlock_kernel();
 	return 0;
 
 err_rd:
@@ -141,9 +138,6 @@
 
 err_nocleanup:
 	up(PIPE_SEM(*inode));
-
-err_nolock_nocleanup:
-	unlock_kernel();
 	return ret;
 }
 

[-- Attachment #3: patch-pgw --]
[-- Type: text/plain, Size: 10558 bytes --]

// $Header$
// Kernel Version:
//  VERSION = 2
//  PATCHLEVEL = 4
//  SUBLEVEL = 13
//  EXTRAVERSION =-pre3
--- 2.4/include/linux/mm.h	Thu Oct 11 16:51:38 2001
+++ build-2.4/include/linux/mm.h	Tue Oct 16 21:32:05 2001
@@ -431,6 +431,9 @@
 extern int ptrace_detach(struct task_struct *, unsigned int);
 extern void ptrace_disable(struct task_struct *);
 
+int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
+		int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);
+
 /*
  * On a two-level page table, this ends up being trivial. Thus the
  * inlining and the symmetry break with pte_alloc() that does all
--- 2.4/mm/memory.c	Tue Oct 16 21:28:44 2001
+++ build-2.4/mm/memory.c	Tue Oct 16 21:30:02 2001
@@ -404,17 +404,16 @@
 	spin_unlock(&mm->page_table_lock);
 }
 
-
 /*
  * Do a quick page-table lookup for a single page. 
  */
-static struct page * follow_page(unsigned long address, int write) 
+static struct page * follow_page(struct mm_struct *mm, unsigned long address, int write) 
 {
 	pgd_t *pgd;
 	pmd_t *pmd;
 	pte_t *ptep, pte;
 
-	pgd = pgd_offset(current->mm, address);
+	pgd = pgd_offset(mm, address);
 	if (pgd_none(*pgd) || pgd_bad(*pgd))
 		goto out;
 
@@ -450,21 +449,70 @@
 	return page;
 }
 
+int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
+		int len, int write, int force, struct page **pages, struct vm_area_struct **vmas)
+{
+	int i = 0;
+
+	do {
+		struct vm_area_struct *	vma;
+
+		vma = find_extend_vma(mm, start);
+
+		if ( !vma ||
+		    (!force &&
+		     	((write && (!(vma->vm_flags & VM_WRITE))) ||
+		    	 (!write && (!(vma->vm_flags & VM_READ))) ) )) {
+			if (i) return i;
+			return -EFAULT;
+		}
+
+		spin_lock(&mm->page_table_lock);
+		do {
+			struct page *map;
+			while (!(map = follow_page(mm, start, write))) {
+				spin_unlock(&mm->page_table_lock);
+				switch (handle_mm_fault(mm, vma, start, write)) {
+				case 1:
+					tsk->min_flt++;
+					break;
+				case 2:
+					tsk->maj_flt++;
+					break;
+				case 0:
+					if (i) return i;
+					return -EFAULT;
+				default:
+					if (i) return i;
+					return -ENOMEM;
+				}
+				spin_lock(&mm->page_table_lock);
+			}
+			if (pages) {
+				pages[i] = get_page_map(map);
+				if (pages[i]) get_page(pages[i]);
+			}
+			if (vmas)
+				vmas[i] = vma;
+			i++;
+			start += PAGE_SIZE;
+			len--;
+		} while(len && start < vma->vm_end);
+		spin_unlock(&mm->page_table_lock);
+	} while(len);
+	return i;
+}
+
 /*
  * Force in an entire range of pages from the current process's user VA,
  * and pin them in physical memory.  
  */
-
 #define dprintk(x...)
+
 int map_user_kiobuf(int rw, struct kiobuf *iobuf, unsigned long va, size_t len)
 {
-	unsigned long		ptr, end;
-	int			err;
+	int pgcount, err;
 	struct mm_struct *	mm;
-	struct vm_area_struct *	vma = 0;
-	struct page *		map;
-	int			i;
-	int			datain = (rw == READ);
 	
 	/* Make sure the iobuf is not already mapped somewhere. */
 	if (iobuf->nr_pages)
@@ -473,79 +521,37 @@
 	mm = current->mm;
 	dprintk ("map_user_kiobuf: begin\n");
 	
-	ptr = va & PAGE_MASK;
-	end = (va + len + PAGE_SIZE - 1) & PAGE_MASK;
-	err = expand_kiobuf(iobuf, (end - ptr) >> PAGE_SHIFT);
+	pgcount = (va + len + PAGE_SIZE - 1)/PAGE_SIZE - va/PAGE_SIZE;
+	/* mapping 0 bytes is not permitted */
+	if (!pgcount) BUG();
+	err = expand_kiobuf(iobuf, pgcount);
 	if (err)
 		return err;
 
-	down_read(&mm->mmap_sem);
-
-	err = -EFAULT;
 	iobuf->locked = 0;
-	iobuf->offset = va & ~PAGE_MASK;
+	iobuf->offset = va & (PAGE_SIZE-1);
 	iobuf->length = len;
 	
-	i = 0;
-	
-	/* 
-	 * First of all, try to fault in all of the necessary pages
-	 */
-	while (ptr < end) {
-		if (!vma || ptr >= vma->vm_end) {
-			vma = find_vma(current->mm, ptr);
-			if (!vma) 
-				goto out_unlock;
-			if (vma->vm_start > ptr) {
-				if (!(vma->vm_flags & VM_GROWSDOWN))
-					goto out_unlock;
-				if (expand_stack(vma, ptr))
-					goto out_unlock;
-			}
-			if (((datain) && (!(vma->vm_flags & VM_WRITE))) ||
-					(!(vma->vm_flags & VM_READ))) {
-				err = -EACCES;
-				goto out_unlock;
-			}
-		}
-		spin_lock(&mm->page_table_lock);
-		while (!(map = follow_page(ptr, datain))) {
-			int ret;
-
-			spin_unlock(&mm->page_table_lock);
-			ret = handle_mm_fault(current->mm, vma, ptr, datain);
-			if (ret <= 0) {
-				if (!ret)
-					goto out_unlock;
-				else {
-					err = -ENOMEM;
-					goto out_unlock;
-				}
-			}
-			spin_lock(&mm->page_table_lock);
-		}			
-		map = get_page_map(map);
-		if (map) {
-			flush_dcache_page(map);
-			atomic_inc(&map->count);
-		} else
-			printk (KERN_INFO "Mapped page missing [%d]\n", i);
-		spin_unlock(&mm->page_table_lock);
-		iobuf->maplist[i] = map;
-		iobuf->nr_pages = ++i;
-		
-		ptr += PAGE_SIZE;
-	}
-
+	/* Try to fault in all of the necessary pages */
+	down_read(&mm->mmap_sem);
+	/* rw==READ means read from disk, write into memory area */
+	err = get_user_pages(current, mm, va, pgcount,
+			(rw==READ), 0, iobuf->maplist, NULL);
 	up_read(&mm->mmap_sem);
+	if (err < 0) {
+		unmap_kiobuf(iobuf);
+		dprintk ("map_user_kiobuf: end %d\n", err);
+		return err;
+	}
+	iobuf->nr_pages = err;
+	while (pgcount--) {
+		/* FIXME: flush superflous for rw==READ,
+		 * probably wrong function for rw==WRITE
+		 */
+		flush_dcache_page(iobuf->maplist[pgcount]);
+	}
 	dprintk ("map_user_kiobuf: end OK\n");
 	return 0;
-
- out_unlock:
-	up_read(&mm->mmap_sem);
-	unmap_kiobuf(iobuf);
-	dprintk ("map_user_kiobuf: end %d\n", err);
-	return err;
 }
 
 /*
@@ -595,6 +601,7 @@
 		if (map) {
 			if (iobuf->locked)
 				UnlockPage(map);
+			/* FIXME: cache flush missing for rw==READ*/
 			__free_page(map);
 		}
 	}
@@ -1439,23 +1446,19 @@
 	return pte_offset(pmd, address);
 }
 
-/*
- * Simplistic page force-in..
- */
 int make_pages_present(unsigned long addr, unsigned long end)
 {
-	int write;
-	struct mm_struct *mm = current->mm;
+	int ret, len, write;
 	struct vm_area_struct * vma;
 
-	vma = find_vma(mm, addr);
+	vma = find_vma(current->mm, addr);
 	write = (vma->vm_flags & VM_WRITE) != 0;
 	if (addr >= end)
 		BUG();
-	do {
-		if (handle_mm_fault(mm, vma, addr, write) < 0)
-			return -1;
-		addr += PAGE_SIZE;
-	} while (addr < end);
-	return 0;
+	if (end > vma->vm_end)
+		BUG();
+	len = (end+PAGE_SIZE-1)/PAGE_SIZE-addr/PAGE_SIZE;
+	ret = get_user_pages(current, current->mm, addr,
+			len, write, 0, NULL, NULL);
+	return ret == len ? 0 : -1;
 }
--- 2.4/kernel/ptrace.c	Thu Oct 11 16:51:38 2001
+++ build-2.4/kernel/ptrace.c	Tue Oct 16 21:30:02 2001
@@ -85,119 +85,17 @@
 }
 
 /*
- * Access another process' address space, one page at a time.
+ * Access another process' address space.
+ * Source/target buffer must be kernel space, 
+ * Do not walk the page table directly, use get_user_pages
  */
-static int access_one_page(struct mm_struct * mm, struct vm_area_struct * vma, unsigned long addr, void *buf, int len, int write)
-{
-	pgd_t * pgdir;
-	pmd_t * pgmiddle;
-	pte_t * pgtable;
-	char *maddr; 
-	struct page *page;
-
-repeat:
-	spin_lock(&mm->page_table_lock);
-	pgdir = pgd_offset(vma->vm_mm, addr);
-	if (pgd_none(*pgdir))
-		goto fault_in_page;
-	if (pgd_bad(*pgdir))
-		goto bad_pgd;
-	pgmiddle = pmd_offset(pgdir, addr);
-	if (pmd_none(*pgmiddle))
-		goto fault_in_page;
-	if (pmd_bad(*pgmiddle))
-		goto bad_pmd;
-	pgtable = pte_offset(pgmiddle, addr);
-	if (!pte_present(*pgtable))
-		goto fault_in_page;
-	if (write && (!pte_write(*pgtable) || !pte_dirty(*pgtable)))
-		goto fault_in_page;
-	page = pte_page(*pgtable);
-
-	/* ZERO_PAGE is special: reads from it are ok even though it's marked reserved */
-	if (page != ZERO_PAGE(addr) || write) {
-		if ((!VALID_PAGE(page)) || PageReserved(page)) {
-			spin_unlock(&mm->page_table_lock);
-			return 0;
-		}
-	}
-	get_page(page);
-	spin_unlock(&mm->page_table_lock);
-	flush_cache_page(vma, addr);
-
-	if (write) {
-		maddr = kmap(page);
-		memcpy(maddr + (addr & ~PAGE_MASK), buf, len);
-		flush_page_to_ram(page);
-		flush_icache_page(vma, page);
-		kunmap(page);
-	} else {
-		maddr = kmap(page);
-		memcpy(buf, maddr + (addr & ~PAGE_MASK), len);
-		flush_page_to_ram(page);
-		kunmap(page);
-	}
-	put_page(page);
-	return len;
-
-fault_in_page:
-	spin_unlock(&mm->page_table_lock);
-	/* -1: out of memory. 0 - unmapped page */
-	if (handle_mm_fault(mm, vma, addr, write) > 0)
-		goto repeat;
-	return 0;
-
-bad_pgd:
-	spin_unlock(&mm->page_table_lock);
-	pgd_ERROR(*pgdir);
-	return 0;
-
-bad_pmd:
-	spin_unlock(&mm->page_table_lock);
-	pmd_ERROR(*pgmiddle);
-	return 0;
-}
-
-static int access_mm(struct mm_struct *mm, struct vm_area_struct * vma, unsigned long addr, void *buf, int len, int write)
-{
-	int copied = 0;
-
-	for (;;) {
-		unsigned long offset = addr & ~PAGE_MASK;
-		int this_len = PAGE_SIZE - offset;
-		int retval;
-
-		if (this_len > len)
-			this_len = len;
-		retval = access_one_page(mm, vma, addr, buf, this_len, write);
-		copied += retval;
-		if (retval != this_len)
-			break;
-
-		len -= retval;
-		if (!len)
-			break;
-
-		addr += retval;
-		buf += retval;
-
-		if (addr < vma->vm_end)
-			continue;	
-		if (!vma->vm_next)
-			break;
-		if (vma->vm_next->vm_start != vma->vm_end)
-			break;
-	
-		vma = vma->vm_next;
-	}
-	return copied;
-}
 
 int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write)
 {
-	int copied;
 	struct mm_struct *mm;
-	struct vm_area_struct * vma;
+	struct vm_area_struct *vma;
+	struct page *page;
+	void *old_buf = buf;
 
 	/* Worry about races with exit() */
 	task_lock(tsk);
@@ -209,14 +107,41 @@
 		return 0;
 
 	down_read(&mm->mmap_sem);
-	vma = find_extend_vma(mm, addr);
-	copied = 0;
-	if (vma)
-		copied = access_mm(mm, vma, addr, buf, len, write);
+	/* ignore errors, just check how much was sucessfully transfered */
+	while (len) {
+		int bytes, ret, offset;
+		void *maddr;
+
+		ret = get_user_pages(current, mm, addr, 1,
+				write, 1, &page, &vma);
+		if (ret <= 0)
+			break;
+
+		bytes = len;
+		offset = addr & (PAGE_SIZE-1);
+		if (bytes > PAGE_SIZE-offset)
+			bytes = PAGE_SIZE-offset;
 
+		flush_cache_page(vma, addr);
+
+		maddr = kmap(page);
+		if (write) {
+			memcpy(maddr + offset, buf, bytes);
+			flush_page_to_ram(page);
+			flush_icache_page(vma, page);
+		} else {
+			memcpy(buf, maddr + offset, bytes);
+			flush_page_to_ram(page);
+		}
+		kunmap(page);
+		put_page(page);
+		len -= bytes;
+		buf += bytes;
+	}
 	up_read(&mm->mmap_sem);
 	mmput(mm);
-	return copied;
+	
+	return buf - old_buf;
 }
 
 int ptrace_readdata(struct task_struct *tsk, unsigned long src, char *dst, int len)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Patch and Performance of larger pipes
  2001-10-18 18:07 Patch and Performance of larger pipes Manfred Spraul
@ 2001-10-18 18:18 ` Manfred Spraul
  2001-10-18 23:05 ` Ryan Cumming
  2001-10-19 13:52 ` Hubertus Franke
  2 siblings, 0 replies; 9+ messages in thread
From: Manfred Spraul @ 2001-10-18 18:18 UTC (permalink / raw)
  To: Hubertus Franke, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 110 bytes --]

Sorry, the patches don't compile - I mixed 2 different versions.
Apply the attached patch on top.

--
	MAnfred

[-- Attachment #2: patch-kiopipe2 --]
[-- Type: text/plain, Size: 379 bytes --]

--- 2.4/fs/pipe.c	Thu Oct 18 20:10:13 2001
+++ build-2.4/fs/pipe.c	Thu Oct 18 00:21:08 2001
@@ -113,7 +113,7 @@
 	len = (pio->offset+pio->len+PAGE_SIZE-1)/PAGE_SIZE;
 	down_read(&current->mm->mmap_sem);
 	len = get_user_pages(current, current->mm, (unsigned long)buf, len,
-			0, pio->pages, vmas);
+			0, 0, pio->pages, vmas);
 	if (len > 0) {
 		int i;
 		for(i=0;i<len;i++) {

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Patch and Performance of larger pipes
  2001-10-18 18:07 Patch and Performance of larger pipes Manfred Spraul
  2001-10-18 18:18 ` Manfred Spraul
@ 2001-10-18 23:05 ` Ryan Cumming
  2001-10-19  0:15   ` Stefan Reinauer
  2001-10-19  4:19   ` Mike Galbraith
  2001-10-19 13:52 ` Hubertus Franke
  2 siblings, 2 replies; 9+ messages in thread
From: Ryan Cumming @ 2001-10-18 23:05 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: linux-kernel

On October 18, 2001 11:07, Manfred Spraul wrote:
> Could you test the attached singlecopy patches?
>
> with bw_pipe,
> * on UP, up to +100%.

Awesome! Although any improvement improvement in efficiency is a good thing, 
I am curious as to what uses pipes besides gcc -pipe. UNIX domain sockets 
(for local X11, for instance) aren't implemented as pipes, are they? What 
sort of real world performance gains could I expect from this patch?

-Ryan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Patch and Performance of larger pipes
  2001-10-18 23:05 ` Ryan Cumming
@ 2001-10-19  0:15   ` Stefan Reinauer
  2001-10-19 11:37     ` Carlo Strozzi
  2001-10-19  4:19   ` Mike Galbraith
  1 sibling, 1 reply; 9+ messages in thread
From: Stefan Reinauer @ 2001-10-19  0:15 UTC (permalink / raw)
  To: Ryan Cumming; +Cc: linux-kernel

* Ryan Cumming <bodnar42@phalynx.dhs.org> [011019 01:05]:
> Awesome! Although any improvement improvement in efficiency is a good thing, 
> I am curious as to what uses pipes besides gcc -pipe. UNIX domain sockets 
> (for local X11, for instance) aren't implemented as pipes, are they? What 
> sort of real world performance gains could I expect from this patch?

Shell scripts often use pipes to pass data between processes. Speed up should
be quite noticable with all kinds of those.

  Best regards
   Stefan Reinauer 
     <stepan@suse.de>

-- 
This world is crying to be free; This world is dying, can't you see?
We need a turn to do it right; We need a mind revolution
To get away from this selfishness. Stop playing blind - BREAK FREE!
					Your Turn, Helloween '91

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Patch and Performance of larger pipes
  2001-10-18 23:05 ` Ryan Cumming
  2001-10-19  0:15   ` Stefan Reinauer
@ 2001-10-19  4:19   ` Mike Galbraith
  1 sibling, 0 replies; 9+ messages in thread
From: Mike Galbraith @ 2001-10-19  4:19 UTC (permalink / raw)
  To: Ryan Cumming; +Cc: Manfred Spraul, linux-kernel

On Thu, 18 Oct 2001, Ryan Cumming wrote:

> On October 18, 2001 11:07, Manfred Spraul wrote:
> > Could you test the attached singlecopy patches?
> >
> > with bw_pipe,
> > * on UP, up to +100%.
>
> Awesome! Although any improvement improvement in efficiency is a good thing,
> I am curious as to what uses pipes besides gcc -pipe. UNIX domain sockets
> (for local X11, for instance) aren't implemented as pipes, are they? What
> sort of real world performance gains could I expect from this patch?

If Manfred's patch helps gcc -pipe, then hopefully he'll submit it.
(or maybe we should just kill the -pipe switch from the kernel tree;)
In testing with a hefty parallel make, removing that switch produced
a nice speedup.

	-Mike


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Patch and Performance of larger pipes
  2001-10-19  0:15   ` Stefan Reinauer
@ 2001-10-19 11:37     ` Carlo Strozzi
  0 siblings, 0 replies; 9+ messages in thread
From: Carlo Strozzi @ 2001-10-19 11:37 UTC (permalink / raw)
  To: linux-kernel

On Fri, Oct 19, 2001 at 02:15:30AM +0200, Stefan Reinauer wrote:
> 
> Shell scripts often use pipes to pass data between processes. Speed up should
> be quite noticable with all kinds of those.
> 

I agree. I personally use shell pipes heavily, and anything aimed at
improving them is a great thing for my programs.

cheers,
carlo
-- 
For easier reading please set the Courier font.
Messages larger than 30 KB may not receive immediate attention.
Freedom for Business: http://swpat.ffii.org

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Patch and Performance of larger pipes
  2001-10-18 18:07 Patch and Performance of larger pipes Manfred Spraul
  2001-10-18 18:18 ` Manfred Spraul
  2001-10-18 23:05 ` Ryan Cumming
@ 2001-10-19 13:52 ` Hubertus Franke
  2 siblings, 0 replies; 9+ messages in thread
From: Hubertus Franke @ 2001-10-19 13:52 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: linux-kernel, lse-tech


Well we did for all the 3 benchmarks....

* Manfred Spraul <manfred@colorfullife.com> [20011018 14;07]:"
> Could you test the attached singlecopy patches?
> 
> with bw_pipe,
> * on UP, up to +100%.
> * on SMP with busy cpus, up to +100%
> * on SMP with idle cpus a performance drop due to increased cache
> trashing. Probably the scheduler should keep both bw_pipe processes on
> the same cpu.
> 
> I've sent patch-pgw to Linus for inclusion, since it's needed to fix the
> elf coredump deadlock.
> 
> patch-kiopipe must wait until 2.5, because it changes the behaviour of
> pipe_write with partial reads.
> 
> --
> 	Manfred

>    <<< Manfred's patch cout out >>>



Ok, at the request by Manfred Spraul we also ran his <single-copy>
patch. Manfred patch has applied over 2.4.13-pre13 and the results are compared 
over its base vanilla kernel. Our numbers still are %-improvements numbers
of patched 2.4.9 kernel vs. 2.4.9 vanilla.
Manfred's numbers are added as an additional column.

<< Bottom-line >>

Our patch does better for <Grep> and <Pipeflex> benchmark and
<LMBench> on low transfer sizes for large Pipe-Buffers. 
This is only relevant for SMP systems as we have enabled the patch
only for SMP.

Manfred's patch does better for <LMBench> and beats our patch across
configurations. But it lacks even vanilla for more realistic apps.
The same observation holds for UP.

We have not measured CPU utilization etc. 
As stated in my earlier message, bw_pipe doesn't have a lot of real
applicability, but I am willing to be re-educated here.

-- Hubertus

                         PIPE: Buffer Expansion
                         ----------------------

In this we will report on some experimentation to improve Linux pipe
performance. There are two basic parameters that govern the Linux pipe 
implementation. 
(a) PIPE_SIZE is the size of the pipe buffer. 
(b) PIPE_BUF is the maximum number of bytes an application 
    can write atomically to the pipe.

In the current implementation size of pipe buffer is PAGE_SIZE (most
architectures that is 4kB, and the PIPE_BUF is fixed to 4kB (and for
ARM=PAGE_SIZE).

We wanted to experiment with larger pipe buffer support and higher
concurrency of read and writes. Therefore we experimented with the
following items:

(A) expanding the pipe buffer size from 1 page to 2,4,8 pages.

(B) improving the pipe's concurrency of read and write by introducing 
   intermittent activation of pending readers/writers rather than at 
   the end of a pipe transaction (read/write). The PIPE_BUF 
   atomicity constraint is still observed. We therefore introduce 
   the term of a PIPE_SEG which is a multiple of the PAGE_SIZE and
   determines when to wake up pending readers and writers.

   Consider the pipe buffer size to be 32k. The space available to 
   write on the pipe is 32k and the data which is coming to be 
   written onto the pipe is also 32k. By keeping the segment size 
   as 4k, write the first 4k of the total 32k data instead of 
   writing the entire 32k data, and inform the reader process that 
   some data is there to be read, and the writer process proceeds 
   with the next 4k. By that time reader process starts reading 
   the available data. Intuitively this should create 
   more concurrency. 

Throughout this experimentation, we kept the PIPE_BUF (atomicity
guarantee) constant at 4kB.

Benchmarks
----------

The benchmarks we ran for measuring the performance of pipes are
LMBench, Grep, and Pipeflex. The description of these are below.
While LMBench is a widely used OS-Benchmark, we found that Grep and
Pipeflex model more real applications. All are descripted in more
detail below. All applications use different data transfer sizes
aka chunk sizes shown as TS. We report on the two aspects of our
implementation, i.e. larger pipes and intermittent activations
(PIPE_SEG) which is always 4k. All results are shown as % improvement
over the baseline kernel (2.4.9) and all these benchmarks are run on a
2-way Pentium II, 333Mhz machine.

                          Results Summary:
                          ================

UP + 1-way SMP:
---------------
Neither (A) nor (A)+(B) showed any improvements. Instead degradations 
of up to 30% are observed. Obviously our approach/patch does not 
make any sense on the 1-way systems.

N-way SMP:
----------

1. Increasing the pipe buffer size (A) increasingly improves the 
   performance of the Grep benchmark by upto 165% for size 32kB. 
   However, Grep does not show any added benefit nor
   degradation utilizing (A)+(B), i.e, expanding the pipe buffer
   size AND introducing the segment size=4kB.

2. For LMBench, (A) alone shows some improvements for small transfer
   sizes (TS<PIPE_SIZE). For TS>>PIPE_SIZE we observe degradation.
   Introducing (A)+(B) shows even better improvements for small TS
   with very small degradations for larger TS.

3. For pipeflex (A) provides increasing benefits with upto 358% 
   improvements, without any loss at the low end. When introducing 
   (A)+(B), the benefits drops but are still substantial.

Based on the results it is clear that expansion of PIPE_SIZE AND
PIPE_SEG introduction gives better performance for some scenarios. 


Grep
----

This benchmark measures the time taken to grep for some unexisting
pattern on a 50mb file.  ie. cat 50mbfile | grep "$$$$". We assume a
warm file cache.

LMBench
-------

LMBench provides a tool to measure the bandwidth of the pipe
(bw_pipe).  bw_pipe creates pipes between two processes and moves 10MB
through the pipe in 64KB chunks. We altered that code by providing the
chunk size as a variable input parameter.  ie. bw_pipe [2,4,...,32]

Pipeflex
--------

As LMBench does continuous read and write over the pipe in a
synchronous manner (which is not the case in real life), we studied
some test cases which uses pipes(grep, wc, sort, gunzip, ..), and
based on that we have written this pipeflex benchmark.

Here a write process writes smaller chunks continuously and the reader
process generates a number between [0.5*r .. 1.5*r] microseconds, and
spends that time for computation after each pipe reads.

A parent process clones 'c' child processes and 'c/2' pipes such that 
2 processes shares one pipe.
ie. pipeflex -c 2 -t 20 -r 500 -s 4

c : number of children/threads to launch (should be EVEN)
t : time for which each run of the test should be performed.
r : microseconds spend in computation after each pipe reads.
s : data to transfer over pipe in Kilo bytes.


Dynamically assigning values for PIPE_SIZE and PIPE_SEG
-------------------------------------------------------

In our current implementation, the size of PIPE_SIZE and PIPE_SEG can
be changed dynamically by writing the values into the newly created
/proc/sys/fs/pipe-sz file through a string having the following
format:

Po So

where Po is the Pipe size order
and So is the Segment size order.

Pipe size will be calculated as PIPE_SIZE = (1 << Po) * PAGE_SIZE.
Segment size will be calculated as (PIPE_SIZE >> So).

Similarly 'Po' and 'So' can be read through the same proc file.

* The notation we use in tables for PIPE_SIZE and PIPE_SEG are PS and
SS respectively and TS is the Transfer Size over pipe.

                2-way (% improvement) Results
                =============================

Grep
----

PS         (A)               (A)+(B)            Manfred
--         ---               -------            -------
4k        -0.87               -0.95              39.9
8k        50.84               50.12  
16k      107.97              115.86  
32k      165.25              164.14  


LMBench
-------
                (A)                             (A)+(B)                         
                ---                             -------                         

                PS                                PS                            
TS      4k      8k      16k     32k       4k      8k      16k     32k   Manfred
--      --      --      ---     ---       --      --      ---     ---   -------
2k     -0.3     3.26    4.25    3.83     -0.3     2.98    4.25    4.04  54.1
4k     -2.18   18.97   18.59   18.59     -2.18   18.59   18.59   18.59  21.0
6k      0.34   13.08   32.7    49.57      0.3    35.63   39.76   55.94  69.2
8k      0.14    3.02     0     -0.82     13.87   31.59   50.82   75.27  63.2
12k     0.34  -24.09  -18.74  -12.57      0.34    4.4   -14.86    8.23  38.3
16k     1.47   -8.88  -14.16  -16.03      1.4    14.42    9.48   13.86  37.5
24k     1.17  -13.9     1.65  -23.59      1.17    1.65   -2.72    1.2   31.2
32k     0.66  -14.77  -19.83  -25.63      0.66   -3.2    -6.59   -2.92  27.2
64k     x     x       x       x           x      x       x       x      22.84
128k    x     x       x       x           x      x       x       x      29.7


Pipeflex
--------

                (A)                             (A)+(B)
                ---                             -------

                PS                                 PS 
TS      4k      8k      16k     32k        4k      8k      16k     32k  Manfred
--      --      --      ---     ---        --      --      ---     ---
2k      0.00    0.00   -0.27   -0.27      0.00    -0.27   -0.27   -0.27 -24.1
4k      0.00    0.00    0.00    0.00      0.00     0.00    0.00    0.00 -42.0
6k      1.61   -1.82   11.46   10.28      1.61    -9.53    9.53    6.75 -10.86
8k     -2.00   46.20   46.31   46.31     -2.00   -11.08    7.17   23.73 -15.9
12k    -2.93   54.13   64.58   93.31     -2.93    51.62   46.08   84.22 -14.9
16k    -3.34   49.95  163.09  162.67     -3.34    51.82   45.57   85.92 -14.4
24k    -3.56   50.37  135.50  183.46     -3.56    49.95  134.97  144.40 -15.9
32k    -1.40   54.99  143.29  358.75     -1.40    55.75  142.32  147.37 -17.7
64k     x     x       x       x           x      x       x       x      -15.01
128k    x     x       x       x           x      x       x       x      -15.14


                1-way (% improvement) Results
                =============================

Grep     
----

PS         (A)               (A)+(B)    Manfred
--         ---               -------    -------
4k       -0.47               -0.47      -49.78
8k        1.9                -2.73  
16k      -2.73               -2.73  
32k       1.9                -2.73  



LMBench
-------
                (A)                             (A)+(B)
                ---                             -------

                PS                                PS  
TS      4k      8k      16k     32k       4k      8k      16k     32k   Manfred
--      --      --      ---     ---       --      --      ---     ---   -------
2k     0.49     4.42    -3.88  -10.13    0.49    11.02   -7.68  -12.06  5.91
4k    -0.5     -1.08   -22.17   -8.6    -0.5     -0.95  -18.65   -4.86  24.28
6k     3.73   -19.03   -24.23  -18.56    3.73   -15.85  -24.68  -12.44  55.17
8k     0.82   -33.43   -31.92  -30.19    0.82   -34.38  -25.41  -12.81  35.34
12k    1.39   -24.06   -30.67  -27.88    1.39   -24.43  -29.79  -27     20.39
16k   -0.87   -29.16   -31.27  -29.73   -0.87   -28.53  -31.97  -28.97  24.8
24k    0.16   -28.79   -31.87  -28.37    0.16   -28.16  -31.61  -28.66  32.72
32k    0.35   -28.91   -30.73  -27.23    0.35   -28.77  -31.6   -28.77  34.89
64k     x     x       x       x           x      x       x       x      38.7
128k    x     x       x       x           x      x       x       x      41.6

Pipeflex
--------

                (A)                             (A)+(B)
                ---                             -------

                PS                                 PS 
TS      4k      8k      16k     32k        4k      8k      16k     32k  Manfred
--      --      --      ---     ---        --      --      ---     ---  -------
2k    0.00   -0.54   -0.80   -1.07         0.00   -0.54   -0.80   -1.07 -23.66
4k   -0.14   -0.69   -1.80   -2.21        -0.14   -1.10   -1.80   -2.49 -41.97
6k   -0.19   -0.19   -2.41   -2.02        -0.19   -0.19   -2.12   -1.16 -22.09
8k   -0.30   -1.86   -6.41   -6.41        -0.30   -1.19   -3.80   -2.61 -39.97
12k  -0.33   -1.43   -4.56   -4.01        -0.33   -1.43   -3.84   -3.51 -55.71
16k  -0.18   -1.77   -4.47   -4.56        -0.18   -1.15   -4.78   -3.85 -64.37
24k  -0.37   -2.21   -6.94   -5.26        -0.37   -1.51   -6.07   -4.89 -72.98
32k  -0.42   -2.98   -7.12   -5.18        -0.42   -1.75   -7.06   -5.77 -77.19
64k     x     x       x       x           x      x       x       x      -82.57
128k    x     x       x       x           x      x       x       x      -84.91

                UP (% improvement) Results
                ==========================

Grep     
----

PS         (A)               (A)+(B)    Manfred
--         ---               -------    -------                         
4k        -0.53               1.61      -53.49
8k        -2.58              -1.56
16k       -4.55              -4.06
32k       -3.08              -4.06


LMBench
-------
                (A)                             (A)+(B)
                ---                             -------

                PS                                 PS 
TS      4k      8k      16k     32k        4k      8k      16k     32k  Manfred
--      --      --      ---     ---        --      --      ---     ---  -------
2k      7.38     1.17  -15.28  -18.07     4.18   -0.36  -14.32  -20.55  16.35
4k     -1.73    -0.5   -31.7   -26.94    -1.21    5.61  -21.75  -5.17   20.58
6k     -0.3    -22.33  -26.78  -23.36    -0.9   -17.31  -29.61  -19.5   73.06
8k      7.8    -35     -33.75  -30.65     1.71  -37.11  -29.45  -18.03  30.43
12k    -1.09   -25.99  -36.7   -35.77     0.13  -27.76  -35.56  -34.21  13.78
16k     0.08   -31.39  -35.37  -34.28     1.2   -30.93  -36.18  -34.18  17.47
24k     0.42   -32.06  -36.15  -34.65     0.63  -32.28  -36.82  -34.98  22.03
32k     0.82   -31.52  -35.71  -33.41     1.8   -32.11  -36.49  -34.52  24.87
64k     x     x       x       x           x      x       x       x      26.66
128k    x     x       x       x           x      x       x       x      41.27
    

Pipeflex
--------

                (A)                             (A)+(B)
                ---                             -------

                PS                                 PS 
TS      4k      8k      16k     32k        4k      8k      16k     32k  Manfred
--      --      --      ---     ---        --      --      ---     ---  -------
2k    -0.27   -0.27   -0.80   -1.06      -0.27   -0.27   -0.80   -1.06  -24.4
4k    -0.27   -0.68   -1.77   -2.31      -0.14   -0.54   -1.63   -2.18  -42.3
6k    -0.66   -0.94   -3.85   -3.94      -0.47   -0.66   -2.63   -1.97  -22.9
8k    -0.58   -1.96   -6.88   -6.88      -0.58   -1.30   -4.49   -3.48  -40.5
12k   -1.00   -2.16   -5.59   -5.59      -0.84   -1.79   -4.75   -4.54  -56.7
16k   -1.22   -2.40   -6.02   -6.18      -1.14   -2.10   -6.06   -5.34  -65.4
24k   -1.50   -3.41   -8.68   -7.23      -1.72   -2.76   -7.89   -7.11  -74.3
32k   -1.92   -4.44   -9.27   -7.94      -1.79   -3.43   -9.45   -8.41  -78.5
64k     x     x       x       x           x      x       x       x      -85.16
128k    x     x       x       x           x      x       x	 x	-87.66



diff -urN linux-2.4.9-v/fs/pipe.c linux-2.4.9-pipe-new/fs/pipe.c
--- linux-2.4.9-v/fs/pipe.c	Sun Aug 12 21:58:52 2001
+++ linux-2.4.9-pipe-new/fs/pipe.c	Tue Oct  9 10:48:15 2001
@@ -23,6 +23,14 @@
  * -- Julian Bradfield 1999-06-07.
  */
 
+#ifdef CONFIG_SMP 
+#define IS_SMP (1)
+#else 
+#define IS_SMP (0)
+#endif
+
+struct pipe_stat_t pipe_stat;
+
 /* Drop the inode semaphore and wait for a pipe event, atomically */
 void pipe_wait(struct inode * inode)
 {
@@ -85,30 +93,40 @@
 
 	/* Read what data is available.  */
 	ret = -EFAULT;
-	while (count > 0 && (size = PIPE_LEN(*inode))) {
-		char *pipebuf = PIPE_BASE(*inode) + PIPE_START(*inode);
-		ssize_t chars = PIPE_MAX_RCHUNK(*inode);
-
-		if (chars > count)
-			chars = count;
-		if (chars > size)
-			chars = size;
-
-		if (copy_to_user(buf, pipebuf, chars))
-			goto out;
+	if (count > 0 && (size = PIPE_LEN(*inode))) {
+		do {
+			char *pipebuf = PIPE_BASE(*inode) + PIPE_START(*inode);
+			ssize_t chars = PIPE_MAX_RCHUNK(*inode);
+			
+			if (chars > count)
+				chars = count;
+			if (chars > size)
+				chars = size;
+			if (IS_SMP && PIPE_ORDER(*inode) && (chars > PIPE_SEG(*inode))) 
+				chars = PIPE_SEG(*inode);
+			
+			if (copy_to_user(buf, pipebuf, chars))
+				goto out;
 
-		read += chars;
-		PIPE_START(*inode) += chars;
-		PIPE_START(*inode) &= (PIPE_SIZE - 1);
-		PIPE_LEN(*inode) -= chars;
-		count -= chars;
-		buf += chars;
+			read += chars;
+			PIPE_START(*inode) += chars;
+			PIPE_START(*inode) &= (PIPE_SIZE(*inode) - 1);
+			PIPE_LEN(*inode) -= chars;
+			count -= chars;
+			buf += chars;
+               		if ((count <= 0) || (!(size = PIPE_LEN(*inode))))
+	                       	break;
+        	      	if (IS_SMP && PIPE_ORDER(*inode) && PIPE_WAITING_WRITERS(*inode) && 
+			    !(filp->f_flags & O_NONBLOCK)) 
+                       		wake_up_interruptible_sync(PIPE_WAIT(*inode));
+			
+            	} while(1);
 	}
 
 	/* Cache behaviour optimization */
 	if (!PIPE_LEN(*inode))
 		PIPE_START(*inode) = 0;
-
+	
 	if (count && PIPE_WAITING_WRITERS(*inode) && !(filp->f_flags & O_NONBLOCK)) {
 		/*
 		 * We know that we are going to sleep: signal
@@ -187,10 +205,15 @@
 		ssize_t chars = PIPE_MAX_WCHUNK(*inode);
 
 		if ((space = PIPE_FREE(*inode)) != 0) {
+			pipebuf = PIPE_BASE(*inode) + PIPE_END(*inode);
+			chars = PIPE_MAX_WCHUNK(*inode);
+
 			if (chars > count)
 				chars = count;
 			if (chars > space)
 				chars = space;
+			if (IS_SMP && PIPE_ORDER(*inode) && (chars > PIPE_SEG(*inode)))
+				chars = PIPE_SEG(*inode);
 
 			if (copy_from_user(pipebuf, buf, chars))
 				goto out;
@@ -200,6 +223,9 @@
 			count -= chars;
 			buf += chars;
 			space = PIPE_FREE(*inode);
+   			if (IS_SMP && PIPE_ORDER(*inode) && (count > 0) && space &&
+			    PIPE_WAITING_READERS(*inode) && !(filp->f_flags & O_NONBLOCK))
+                                wake_up_interruptible_sync(PIPE_WAIT(*inode));
 			continue;
 		}
 
@@ -231,14 +257,14 @@
 	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
 	mark_inode_dirty(inode);
 
-out:
+ out:
 	up(PIPE_SEM(*inode));
-out_nolock:
+ out_nolock:
 	if (written)
 		ret = written;
 	return ret;
 
-sigpipe:
+ sigpipe:
 	if (written)
 		goto out;
 	up(PIPE_SEM(*inode));
@@ -309,7 +335,7 @@
 	if (!PIPE_READERS(*inode) && !PIPE_WRITERS(*inode)) {
 		struct pipe_inode_info *info = inode->i_pipe;
 		inode->i_pipe = NULL;
-		free_page((unsigned long) info->base);
+		free_pages((unsigned long) info->base, info->order);
 		kfree(info);
 	} else {
 		wake_up_interruptible(PIPE_WAIT(*inode));
@@ -443,8 +469,12 @@
 struct inode* pipe_new(struct inode* inode)
 {
 	unsigned long page;
+	int pipe_order = pipe_stat.pipe_size_order;
+
+	if (pipe_order > MAX_PIPE_ORDER)
+		pipe_order = MAX_PIPE_ORDER;
 
-	page = __get_free_page(GFP_USER);
+	page = __get_free_pages(GFP_USER, pipe_order);
 	if (!page)
 		return NULL;
 
@@ -458,10 +488,11 @@
 	PIPE_READERS(*inode) = PIPE_WRITERS(*inode) = 0;
 	PIPE_WAITING_READERS(*inode) = PIPE_WAITING_WRITERS(*inode) = 0;
 	PIPE_RCOUNTER(*inode) = PIPE_WCOUNTER(*inode) = 1;
+	PIPE_ORDER(*inode) = pipe_order;
 
 	return inode;
-fail_page:
-	free_page(page);
+ fail_page:
+	free_pages(page, pipe_order);
 	return NULL;
 }
 
@@ -477,12 +508,12 @@
 static struct inode * get_pipe_inode(void)
 {
 	struct inode *inode = get_empty_inode();
-
 	if (!inode)
 		goto fail_inode;
 
 	if(!pipe_new(inode))
 		goto fail_iput;
+
 	PIPE_READERS(*inode) = PIPE_WRITERS(*inode) = 1;
 	inode->i_fop = &rdwr_pipe_fops;
 	inode->i_sb = pipe_mnt->mnt_sb;
@@ -501,9 +532,9 @@
 	inode->i_blksize = PAGE_SIZE;
 	return inode;
 
-fail_iput:
+ fail_iput:
 	iput(inode);
-fail_inode:
+ fail_inode:
 	return NULL;
 }
 
@@ -572,20 +603,20 @@
 	fd[1] = j;
 	return 0;
 
-close_f12_inode_i_j:
+ close_f12_inode_i_j:
 	put_unused_fd(j);
-close_f12_inode_i:
+ close_f12_inode_i:
 	put_unused_fd(i);
-close_f12_inode:
-	free_page((unsigned long) PIPE_BASE(*inode));
+ close_f12_inode:
+	free_pages((unsigned long) PIPE_BASE(*inode), PIPE_ORDER(*inode));
 	kfree(inode->i_pipe);
 	inode->i_pipe = NULL;
 	iput(inode);
-close_f12:
+ close_f12:
 	put_filp(f2);
-close_f1:
+ close_f1:
 	put_filp(f1);
-no_files:
+ no_files:
 	return error;	
 }
 
@@ -631,7 +662,7 @@
 }
 
 static DECLARE_FSTYPE(pipe_fs_type, "pipefs", pipefs_read_super,
-	FS_NOMOUNT|FS_SINGLE);
+		      FS_NOMOUNT|FS_SINGLE);
 
 static int __init init_pipe_fs(void)
 {
diff -urN linux-2.4.9-v/include/linux/pipe_fs_i.h linux-2.4.9-pipe-new/include/linux/pipe_fs_i.h
--- linux-2.4.9-v/include/linux/pipe_fs_i.h	Wed Apr 25 17:18:23 2001
+++ linux-2.4.9-pipe-new/include/linux/pipe_fs_i.h	Tue Oct  9 09:35:35 2001
@@ -2,6 +2,8 @@
 #define _LINUX_PIPE_FS_I_H
 
 #define PIPEFS_MAGIC 0x50495045
+#define MAX_PIPE_ORDER	3
+
 struct pipe_inode_info {
 	wait_queue_head_t wait;
 	char *base;
@@ -13,12 +15,20 @@
 	unsigned int waiting_writers;
 	unsigned int r_counter;
 	unsigned int w_counter;
+	unsigned int order;
+};
+
+struct pipe_stat_t{
+	int pipe_size_order;
+	int pipe_seg_order;
 };
+extern struct pipe_stat_t pipe_stat;
 
 /* Differs from PIPE_BUF in that PIPE_SIZE is the length of the actual
    memory allocation, whereas PIPE_BUF makes atomicity guarantees.  */
-#define PIPE_SIZE		PAGE_SIZE
+#define PIPE_SIZE(inode)	((1 << PIPE_ORDER(inode)) * PAGE_SIZE)
 
+#define PIPE_ORDER(inode)	((inode).i_pipe->order)
 #define PIPE_SEM(inode)		(&(inode).i_sem)
 #define PIPE_WAIT(inode)	(&(inode).i_pipe->wait)
 #define PIPE_BASE(inode)	((inode).i_pipe->base)
@@ -32,12 +42,13 @@
 #define PIPE_WCOUNTER(inode)	((inode).i_pipe->w_counter)
 
 #define PIPE_EMPTY(inode)	(PIPE_LEN(inode) == 0)
-#define PIPE_FULL(inode)	(PIPE_LEN(inode) == PIPE_SIZE)
-#define PIPE_FREE(inode)	(PIPE_SIZE - PIPE_LEN(inode))
-#define PIPE_END(inode)	((PIPE_START(inode) + PIPE_LEN(inode)) & (PIPE_SIZE-1))
-#define PIPE_MAX_RCHUNK(inode)	(PIPE_SIZE - PIPE_START(inode))
-#define PIPE_MAX_WCHUNK(inode)	(PIPE_SIZE - PIPE_END(inode))
-
+#define PIPE_FULL(inode)	(PIPE_LEN(inode) == PIPE_SIZE(inode))
+#define PIPE_FREE(inode)	(PIPE_SIZE(inode) - PIPE_LEN(inode))
+#define PIPE_END(inode)	((PIPE_START(inode) + PIPE_LEN(inode)) & (PIPE_SIZE(inode)-1))
+#define PIPE_MAX_RCHUNK(inode)	(PIPE_SIZE(inode) - PIPE_START(inode))
+#define PIPE_MAX_WCHUNK(inode)	(PIPE_SIZE(inode) - PIPE_END(inode))
+#define PIPE_SEG(inode)		((PIPE_ORDER(inode) > pipe_stat.pipe_seg_order) ? \
+				(PIPE_SIZE(inode) >> pipe_stat.pipe_seg_order): PAGE_SIZE)
 /* Drop the inode semaphore and wait for a pipe event, atomically */
 void pipe_wait(struct inode * inode);
 
diff -urN linux-2.4.9-v/include/linux/sysctl.h linux-2.4.9-pipe-new/include/linux/sysctl.h
--- linux-2.4.9-v/include/linux/sysctl.h	Wed Aug 15 17:21:21 2001
+++ linux-2.4.9-pipe-new/include/linux/sysctl.h	Tue Oct  9 10:12:48 2001
@@ -533,6 +533,7 @@
 	FS_LEASES=13,	/* int: leases enabled */
 	FS_DIR_NOTIFY=14,	/* int: directory notification enabled */
 	FS_LEASE_TIME=15,	/* int: maximum time to wait for a lease break */
+	FS_PIPE_SIZE=16,	/* int: current number of allocated pages for PIPE */
 };
 
 /* CTL_DEBUG names: */
diff -urN linux-2.4.9-v/kernel/sysctl.c linux-2.4.9-pipe-new/kernel/sysctl.c
--- linux-2.4.9-v/kernel/sysctl.c	Thu Aug  9 19:41:36 2001
+++ linux-2.4.9-pipe-new/kernel/sysctl.c	Mon Oct  8 13:19:46 2001
@@ -304,6 +304,8 @@
 	 sizeof(int), 0644, NULL, &proc_dointvec},
 	{FS_LEASE_TIME, "lease-break-time", &lease_break_time, sizeof(int),
 	 0644, NULL, &proc_dointvec},
+	{FS_PIPE_SIZE, "pipe-sz", &pipe_stat, 2*sizeof(int),
+	 0644, NULL, &proc_dointvec},
 	{0}
 };
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2001-10-19 15:53 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-10-18 18:07 Patch and Performance of larger pipes Manfred Spraul
2001-10-18 18:18 ` Manfred Spraul
2001-10-18 23:05 ` Ryan Cumming
2001-10-19  0:15   ` Stefan Reinauer
2001-10-19 11:37     ` Carlo Strozzi
2001-10-19  4:19   ` Mike Galbraith
2001-10-19 13:52 ` Hubertus Franke
  -- strict thread matches above, loose matches on Subject: below --
2001-10-18 15:34 Hubertus Franke
2001-10-18 17:43 ` David S. Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox