[RFC PATCH v2 0/1] Pipe busy wait

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 0/1] Pipe busy wait
@ 2018-09-25 23:32 subhra mazumdar
  2018-09-25 23:32 ` [RFC PATCH v2 1/1] pipe: busy wait for pipe subhra mazumdar
  0 siblings, 1 reply; 5+ messages in thread
From: subhra mazumdar @ 2018-09-25 23:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, tglx, dhaval.giani, steven.sistare

This patch introduces busy waiting for pipes similar to network sockets.
When pipe is full or empty a thread busy waits for some microseconds before
sleeping. This avoids the sleep and wakeup overhead and improves
performance in case wakeup happens very fast. It uses a new field in
pipe_inode_info to decide how much to spin. As different workloads on
different systems can have different optimum spin time, it is configurable
via a tunable that can be set via /proc. The default value is 0 which
indicates no spin.

Following are the hackbench process using pipe and Unixbench performance 
numbers with baseline and suitable spin time.

Hackbench on 2 socket, 36 core and 72 threads Intel x86 machine
(lower is better). It also shows the usr+sys time as shown by time:
groups  baseline  usr+sys   patch(spin=10us)  usr+sys
1       0.6742    17.0212   0.6842 (-1.48%)   20.2765
2       0.7794    35.1954   0.7116 (8.7%)     38.1318
4       0.9744    55.6513   0.8247 (15.36%)   50.301
8       2.0382    129.519   1.572 (22.87%)    103.722
16      5.5712    427.096   2.7989 (49.76%)   194.162
24      7.9314    566.327   3.962 (50.05%)    278.368

Unixbench on 2 socket, 36 core and 72 threads Intel x86 machine (higher is
better). It shows pipe-based context switching test improving by 107.7% for
1 copy and by 51.3% for 72 parallel copies. This is expected as context
switch overhead is avoided altogether by busy wait.

72 CPUs in system; running 1 parallel copy of tests
System Benchmarks Index Values          baseline	patch(spin=10us)
Dhrystone 2 using register variables    2837.5		2842.0
Double-Precision Whetstone              753.3		753.5
Execl Throughput                        1056.8		1079.2
File Copy 1024 bufsize 2000 maxblocks   1794.0		1805.4
File Copy 256 bufsize 500 maxblocks     1111.4		1117.6
File Copy 4096 bufsize 8000 maxblocks   4136.7		4091.7
Pipe Throughput                         752.9		753.3
Pipe-based Context Switching            372.2		772.9
Process Creation                        840.1		847.6
Shell Scripts (1 concurrent)            1771.0		1747.5
Shell Scripts (8 concurrent)            7316.6		7308.3
System Call Overhead                    578.0		578.0
                                        ========	========
System Benchmarks Index Score           1337.8		1424.0

72 CPUs in system; running 72 parallel copies of tests
System Benchmarks Index Values          baseline 	patch(spin=10us)
Dhrystone 2 using register variables    112849.7	112429.8
Double-Precision Whetstone              44971.1		44929.7
Execl Throughput                        11257.6		11258.7
File Copy 1024 bufsize 2000 maxblocks   1514.6		1471.3
File Copy 256 bufsize 500 maxblocks     919.8		917.4
File Copy 4096 bufsize 8000 maxblocks   3543.8		3355.5
Pipe Throughput                         34530.6		34492.9
Pipe-based Context Switching            11298.2		17089.9
Process Creation                        9422.5		9408.1
Shell Scripts (1 concurrent)            38764.9		38610.4
Shell Scripts (8 concurrent)            32570.5		32458.8
System Call Overhead                    2607.4		2672.7
                                        ========	========
System Benchmarks Index Score           11077.2		11393.5

Changes from v1->v2:
-Removed preemption disable in the busy wait loop
-Changed busy spin condition to TASK_RUNNING state of the current thread
-Added usr+sys time for hackbench runs as shown by time command

subhra mazumdar (1):
  pipe: busy wait for pipe

 fs/pipe.c                 | 12 ++++++++++++
 include/linux/pipe_fs_i.h |  2 ++
 kernel/sysctl.c           |  7 +++++++
 3 files changed, 21 insertions(+)

-- 
2.9.3


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC PATCH v2 1/1] pipe: busy wait for pipe
  2018-09-25 23:32 [RFC PATCH v2 0/1] Pipe busy wait subhra mazumdar
@ 2018-09-25 23:32 ` subhra mazumdar
  2018-11-05 10:08   ` Mel Gorman
  0 siblings, 1 reply; 5+ messages in thread
From: subhra mazumdar @ 2018-09-25 23:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, tglx, dhaval.giani, steven.sistare

Introduce pipe_ll_usec field for pipes that indicates the amount of micro
seconds a thread should spin if pipe is empty or full before sleeping. This
is similar to network sockets. Workloads like hackbench in pipe mode
benefits significantly from this by avoiding the sleep and wakeup overhead.
Other similar usecases can benefit. A tunable pipe_busy_poll is introduced
to enable or disable busy waiting via /proc. The value of it specifies the
amount of spin in microseconds. Default value is 0 indicating no spin.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 fs/pipe.c                 | 12 ++++++++++++
 include/linux/pipe_fs_i.h |  2 ++
 kernel/sysctl.c           |  7 +++++++
 3 files changed, 21 insertions(+)

diff --git a/fs/pipe.c b/fs/pipe.c
index bdc5d3c..35d805b 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -26,6 +26,7 @@
 
 #include <linux/uaccess.h>
 #include <asm/ioctls.h>
+#include <linux/sched/clock.h>
 
 #include "internal.h"
 
@@ -40,6 +41,7 @@ unsigned int pipe_max_size = 1048576;
  */
 unsigned long pipe_user_pages_hard;
 unsigned long pipe_user_pages_soft = PIPE_DEF_BUFFERS * INR_OPEN_CUR;
+unsigned int pipe_busy_poll;
 
 /*
  * We use a start+len construction, which provides full use of the 
@@ -106,6 +108,7 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
 void pipe_wait(struct pipe_inode_info *pipe)
 {
 	DEFINE_WAIT(wait);
+	u64 start;
 
 	/*
 	 * Pipes are system-local resources, so sleeping on them
@@ -113,6 +116,10 @@ void pipe_wait(struct pipe_inode_info *pipe)
 	 */
 	prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE);
 	pipe_unlock(pipe);
+	start = local_clock();
+	while (current->state != TASK_RUNNING &&
+	       ((local_clock() - start) >> 10) < pipe->pipe_ll_usec)
+		cpu_relax();
 	schedule();
 	finish_wait(&pipe->wait, &wait);
 	pipe_lock(pipe);
@@ -825,6 +832,7 @@ static int do_pipe2(int __user *fildes, int flags)
 	struct file *files[2];
 	int fd[2];
 	int error;
+	struct pipe_inode_info *pipe;
 
 	error = __do_pipe_flags(fd, files, flags);
 	if (!error) {
@@ -838,6 +846,10 @@ static int do_pipe2(int __user *fildes, int flags)
 			fd_install(fd[0], files[0]);
 			fd_install(fd[1], files[1]);
 		}
+		pipe = files[0]->private_data;
+		pipe->pipe_ll_usec = pipe_busy_poll;
+		pipe = files[1]->private_data;
+		pipe->pipe_ll_usec = pipe_busy_poll;
 	}
 	return error;
 }
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 5a3bb3b..73267d2 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -55,6 +55,7 @@ struct pipe_inode_info {
 	unsigned int waiting_writers;
 	unsigned int r_counter;
 	unsigned int w_counter;
+	unsigned int pipe_ll_usec;
 	struct page *tmp_page;
 	struct fasync_struct *fasync_readers;
 	struct fasync_struct *fasync_writers;
@@ -170,6 +171,7 @@ void pipe_double_lock(struct pipe_inode_info *, struct pipe_inode_info *);
 extern unsigned int pipe_max_size;
 extern unsigned long pipe_user_pages_hard;
 extern unsigned long pipe_user_pages_soft;
+extern unsigned int pipe_busy_poll;
 
 /* Drop the inode semaphore and wait for a pipe event, atomically */
 void pipe_wait(struct pipe_inode_info *pipe);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index cc02050..0e9ce0c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1863,6 +1863,13 @@ static struct ctl_table fs_table[] = {
 		.proc_handler	= proc_doulongvec_minmax,
 	},
 	{
+		.procname       = "pipe-busy-poll",
+		.data           = &pipe_busy_poll,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec_minmax,
+	},
+	{
 		.procname	= "mount-max",
 		.data		= &sysctl_mount_max,
 		.maxlen		= sizeof(unsigned int),
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH v2 1/1] pipe: busy wait for pipe
  2018-09-25 23:32 ` [RFC PATCH v2 1/1] pipe: busy wait for pipe subhra mazumdar
@ 2018-11-05 10:08   ` Mel Gorman
  2018-11-05 23:40     ` Subhra Mazumdar
  0 siblings, 1 reply; 5+ messages in thread
From: Mel Gorman @ 2018-11-05 10:08 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, peterz, tglx, dhaval.giani, steven.sistare,
	Alexander Viro

Adding Al Viro as per get_maintainers.pl.

On Tue, Sep 25, 2018 at 04:32:40PM -0700, subhra mazumdar wrote:
> Introduce pipe_ll_usec field for pipes that indicates the amount of micro
> seconds a thread should spin if pipe is empty or full before sleeping. This
> is similar to network sockets.

Can you point to what pattern from network sockets you are duplicated
exactly? One would assume it's busy_loop_current_time and busy_loop_timeout
but it should be in the changelog because there are differences in polling
depending on where you are in the network subsystem (which I'm not very
familiar with).

> Workloads like hackbench in pipe mode
> benefits significantly from this by avoiding the sleep and wakeup overhead.
> Other similar usecases can benefit. A tunable pipe_busy_poll is introduced
> to enable or disable busy waiting via /proc. The value of it specifies the
> amount of spin in microseconds. Default value is 0 indicating no spin.
> 

Your lead mail indicates the spin was set to a "suitable spin time". How
should an administrator select this spin time? What works for hackbench
might not be suitable for another workload. What if the spin time
selected happens to be just slightly longer than the time it takes the
reader to respond? In such a case, the result would be "all spin, no
gain". While networking potentially suffers the same problem, it appears
to be opt-in per socket so it's up to the application not to shoot
itself in the foot.

> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
> ---
>  fs/pipe.c                 | 12 ++++++++++++
>  include/linux/pipe_fs_i.h |  2 ++
>  kernel/sysctl.c           |  7 +++++++
>  3 files changed, 21 insertions(+)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index bdc5d3c..35d805b 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -26,6 +26,7 @@
>  
>  #include <linux/uaccess.h>
>  #include <asm/ioctls.h>
> +#include <linux/sched/clock.h>
>  
>  #include "internal.h"
>  
> @@ -40,6 +41,7 @@ unsigned int pipe_max_size = 1048576;
>   */
>  unsigned long pipe_user_pages_hard;
>  unsigned long pipe_user_pages_soft = PIPE_DEF_BUFFERS * INR_OPEN_CUR;
> +unsigned int pipe_busy_poll;
>  
>  /*
>   * We use a start+len construction, which provides full use of the 
> @@ -106,6 +108,7 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
>  void pipe_wait(struct pipe_inode_info *pipe)
>  {
>  	DEFINE_WAIT(wait);
> +	u64 start;
>  
>  	/*
>  	 * Pipes are system-local resources, so sleeping on them
> @@ -113,6 +116,10 @@ void pipe_wait(struct pipe_inode_info *pipe)
>  	 */
>  	prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE);
>  	pipe_unlock(pipe);
> +	start = local_clock();
> +	while (current->state != TASK_RUNNING &&
> +	       ((local_clock() - start) >> 10) < pipe->pipe_ll_usec)
> +		cpu_relax();
>  	schedule();
>  	finish_wait(&pipe->wait, &wait);
>  	pipe_lock(pipe);

Networking breaks this out better in terms of options instead of
hard-coding. This does not handle need_resched or signal delivery
properly where as networking does for example.

> @@ -825,6 +832,7 @@ static int do_pipe2(int __user *fildes, int flags)
>  	struct file *files[2];
>  	int fd[2];
>  	int error;
> +	struct pipe_inode_info *pipe;
>  
>  	error = __do_pipe_flags(fd, files, flags);
>  	if (!error) {
> @@ -838,6 +846,10 @@ static int do_pipe2(int __user *fildes, int flags)
>  			fd_install(fd[0], files[0]);
>  			fd_install(fd[1], files[1]);
>  		}
> +		pipe = files[0]->private_data;
> +		pipe->pipe_ll_usec = pipe_busy_poll;
> +		pipe = files[1]->private_data;
> +		pipe->pipe_ll_usec = pipe_busy_poll;
>  	}
>  	return error;
>  }

You add a pipe field but the value in always based on the sysctl
so the information is redundantu (barring a race condition on one
pipe write per sysctl update which is an irrelevant corner case). In
comparison, the network subsystem appears to be explicitly opt-in via
setsockopt(SO_BUSY_POLL) from a glance and the value appears to be tunable
on a per-socket basis (I didn't check for sure, this is based on a glance
at what networking does). It's not clear what a sensible way of replicating
that for pipe file descriptors would be but it's possible that the only
way would be to define the system-wide tunable as a max spin time and try
detect the optimal spin time on a per-fd basis (not entirely dissimilar
to how cpuidle menu guesses how long it'll be in an idle state for).

It's not really my area but I feel that this patch is a benchmark-specific
hack and that tuning it on a system-wide basis will be a game of "win
some, lose some" that is never used in practice. Worse, it might end up
in a tuning guide as "always set this sysctl" without considering the
capabilities of the machine or the workload and falls victim to cargo
cult tuning.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH v2 1/1] pipe: busy wait for pipe
  2018-11-05 10:08   ` Mel Gorman
@ 2018-11-05 23:40     ` Subhra Mazumdar
  2018-11-06  8:41       ` Mel Gorman
  0 siblings, 1 reply; 5+ messages in thread
From: Subhra Mazumdar @ 2018-11-05 23:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, peterz, tglx, dhaval.giani, steven.sistare,
	Alexander Viro


On 11/5/18 2:08 AM, Mel Gorman wrote:
> Adding Al Viro as per get_maintainers.pl.
>
> On Tue, Sep 25, 2018 at 04:32:40PM -0700, subhra mazumdar wrote:
>> Introduce pipe_ll_usec field for pipes that indicates the amount of micro
>> seconds a thread should spin if pipe is empty or full before sleeping. This
>> is similar to network sockets.
> Can you point to what pattern from network sockets you are duplicated
> exactly? One would assume it's busy_loop_current_time and busy_loop_timeout
> but it should be in the changelog because there are differences in polling
> depending on where you are in the network subsystem (which I'm not very
> familiar with).
I was referring to the sk_busy_loop_timeout() that uses sk_ll_usec. By
similar I meant having a similar mechanism for pipes to busy wait
>
>> Workloads like hackbench in pipe mode
>> benefits significantly from this by avoiding the sleep and wakeup overhead.
>> Other similar usecases can benefit. A tunable pipe_busy_poll is introduced
>> to enable or disable busy waiting via /proc. The value of it specifies the
>> amount of spin in microseconds. Default value is 0 indicating no spin.
>>
> Your lead mail indicates the spin was set to a "suitable spin time". How
> should an administrator select this spin time? What works for hackbench
> might not be suitable for another workload. What if the spin time
> selected happens to be just slightly longer than the time it takes the
> reader to respond? In such a case, the result would be "all spin, no
> gain". While networking potentially suffers the same problem, it appears
> to be opt-in per socket so it's up to the application not to shoot
> itself in the foot.
Even for network, sk_ll_usec is assigned the value of the tunable
sysctl_net_busy_read in sock_init_data() for all sockets initialized by
default. There is way for per socket setting using sock_setsockopt(), for
pipes that can be added later if needed in case of different apps running
in one system. But there are cases where only one app runs (e.g big DBs)
and one tunable will suffice. It can be set to a value that is tested to be
beneficial under the operating conditions.
>
>> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
>> ---
>>   fs/pipe.c                 | 12 ++++++++++++
>>   include/linux/pipe_fs_i.h |  2 ++
>>   kernel/sysctl.c           |  7 +++++++
>>   3 files changed, 21 insertions(+)
>>
>> diff --git a/fs/pipe.c b/fs/pipe.c
>> index bdc5d3c..35d805b 100644
>> --- a/fs/pipe.c
>> +++ b/fs/pipe.c
>> @@ -26,6 +26,7 @@
>>   
>>   #include <linux/uaccess.h>
>>   #include <asm/ioctls.h>
>> +#include <linux/sched/clock.h>
>>   
>>   #include "internal.h"
>>   
>> @@ -40,6 +41,7 @@ unsigned int pipe_max_size = 1048576;
>>    */
>>   unsigned long pipe_user_pages_hard;
>>   unsigned long pipe_user_pages_soft = PIPE_DEF_BUFFERS * INR_OPEN_CUR;
>> +unsigned int pipe_busy_poll;
>>   
>>   /*
>>    * We use a start+len construction, which provides full use of the
>> @@ -106,6 +108,7 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
>>   void pipe_wait(struct pipe_inode_info *pipe)
>>   {
>>   	DEFINE_WAIT(wait);
>> +	u64 start;
>>   
>>   	/*
>>   	 * Pipes are system-local resources, so sleeping on them
>> @@ -113,6 +116,10 @@ void pipe_wait(struct pipe_inode_info *pipe)
>>   	 */
>>   	prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE);
>>   	pipe_unlock(pipe);
>> +	start = local_clock();
>> +	while (current->state != TASK_RUNNING &&
>> +	       ((local_clock() - start) >> 10) < pipe->pipe_ll_usec)
>> +		cpu_relax();
>>   	schedule();
>>   	finish_wait(&pipe->wait, &wait);
>>   	pipe_lock(pipe);
> Networking breaks this out better in terms of options instead of
> hard-coding. This does not handle need_resched or signal delivery
> properly where as networking does for example.
I don't disable preemption, so don't think checking need_resched is needed.
Can you point to what you mean by handling signal delivery in case of
networking? Not sure what I am missing. My initial version broke it out
like networking but after Peter's suggestion I clubbed it. I don't feel
strongly either way.
>
>> @@ -825,6 +832,7 @@ static int do_pipe2(int __user *fildes, int flags)
>>   	struct file *files[2];
>>   	int fd[2];
>>   	int error;
>> +	struct pipe_inode_info *pipe;
>>   
>>   	error = __do_pipe_flags(fd, files, flags);
>>   	if (!error) {
>> @@ -838,6 +846,10 @@ static int do_pipe2(int __user *fildes, int flags)
>>   			fd_install(fd[0], files[0]);
>>   			fd_install(fd[1], files[1]);
>>   		}
>> +		pipe = files[0]->private_data;
>> +		pipe->pipe_ll_usec = pipe_busy_poll;
>> +		pipe = files[1]->private_data;
>> +		pipe->pipe_ll_usec = pipe_busy_poll;
>>   	}
>>   	return error;
>>   }
> You add a pipe field but the value in always based on the sysctl
> so the information is redundantu (barring a race condition on one
> pipe write per sysctl update which is an irrelevant corner case). In
> comparison, the network subsystem appears to be explicitly opt-in via
> setsockopt(SO_BUSY_POLL) from a glance and the value appears to be tunable
> on a per-socket basis (I didn't check for sure, this is based on a glance
> at what networking does). It's not clear what a sensible way of replicating
> that for pipe file descriptors would be but it's possible that the only
> way would be to define the system-wide tunable as a max spin time and try
> detect the optimal spin time on a per-fd basis (not entirely dissimilar
> to how cpuidle menu guesses how long it'll be in an idle state for).
As I said above even networking assigns the global tunable value to each
socket during initialization. Trying to detect an optimal spin time will
be extremely difficult, I had tried a lot of heuristics but doesn't quite
work under all workloads/utilization level. This change also doesn't
preclude such work in future. In mean time some workloads can benefit from
straight forward constant tunable setting.
>
> It's not really my area but I feel that this patch is a benchmark-specific
> hack and that tuning it on a system-wide basis will be a game of "win
> some, lose some" that is never used in practice. Worse, it might end up
> in a tuning guide as "always set this sysctl" without considering the
> capabilities of the machine or the workload and falls victim to cargo
> cult tuning.
>
It is 0 by default so no spin. Workloads that do set this should know what
they are doing and if it's beneficial. I will try DB OLTP workload to see
any benefits.

Thanks,
Subhra


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH v2 1/1] pipe: busy wait for pipe
  2018-11-05 23:40     ` Subhra Mazumdar
@ 2018-11-06  8:41       ` Mel Gorman
  0 siblings, 0 replies; 5+ messages in thread
From: Mel Gorman @ 2018-11-06  8:41 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: linux-kernel, peterz, tglx, dhaval.giani, steven.sistare,
	Alexander Viro

On Mon, Nov 05, 2018 at 03:40:40PM -0800, Subhra Mazumdar wrote:
> 
> On 11/5/18 2:08 AM, Mel Gorman wrote:
> > Adding Al Viro as per get_maintainers.pl.
> > 
> > On Tue, Sep 25, 2018 at 04:32:40PM -0700, subhra mazumdar wrote:
> > > Introduce pipe_ll_usec field for pipes that indicates the amount of micro
> > > seconds a thread should spin if pipe is empty or full before sleeping. This
> > > is similar to network sockets.
> > Can you point to what pattern from network sockets you are duplicated
> > exactly? One would assume it's busy_loop_current_time and busy_loop_timeout
> > but it should be in the changelog because there are differences in polling
> > depending on where you are in the network subsystem (which I'm not very
> > familiar with).
> I was referring to the sk_busy_loop_timeout() that uses sk_ll_usec. By
> similar I meant having a similar mechanism for pipes to busy wait

Expand that in the changelog.

> > > Workloads like hackbench in pipe mode
> > > benefits significantly from this by avoiding the sleep and wakeup overhead.
> > > Other similar usecases can benefit. A tunable pipe_busy_poll is introduced
> > > to enable or disable busy waiting via /proc. The value of it specifies the
> > > amount of spin in microseconds. Default value is 0 indicating no spin.
> > > 
> > Your lead mail indicates the spin was set to a "suitable spin time". How
> > should an administrator select this spin time? What works for hackbench
> > might not be suitable for another workload. What if the spin time
> > selected happens to be just slightly longer than the time it takes the
> > reader to respond? In such a case, the result would be "all spin, no
> > gain". While networking potentially suffers the same problem, it appears
> > to be opt-in per socket so it's up to the application not to shoot
> > itself in the foot.
>
> Even for network, sk_ll_usec is assigned the value of the tunable
> sysctl_net_busy_read in sock_init_data() for all sockets initialized by
> default. There is way for per socket setting using sock_setsockopt(), for
> pipes that can be added later if needed in case of different apps running
> in one system. But there are cases where only one app runs (e.g big DBs)
> and one tunable will suffice. It can be set to a value that is tested to be
> beneficial under the operating conditions.

While sockets have an existing API to alter the polling interval, I don't
think pipes do and it's not clear what API should be used. fcntl would
be a candidate I guess but that would need careful review. If this is
in error, expand upon it in the changelog. Otherwise it'll have to be
considered what happens for applications that has a default setting for
one app that is detrimental to others.

> > > @@ -106,6 +108,7 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
> > >   void pipe_wait(struct pipe_inode_info *pipe)
> > >   {
> > >   	DEFINE_WAIT(wait);
> > > +	u64 start;
> > >   	/*
> > >   	 * Pipes are system-local resources, so sleeping on them
> > > @@ -113,6 +116,10 @@ void pipe_wait(struct pipe_inode_info *pipe)
> > >   	 */
> > >   	prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE);
> > >   	pipe_unlock(pipe);
> > > +	start = local_clock();
> > > +	while (current->state != TASK_RUNNING &&
> > > +	       ((local_clock() - start) >> 10) < pipe->pipe_ll_usec)
> > > +		cpu_relax();
> > >   	schedule();
> > >   	finish_wait(&pipe->wait, &wait);
> > >   	pipe_lock(pipe);
>
> > Networking breaks this out better in terms of options instead of
> > hard-coding. This does not handle need_resched or signal delivery
> > properly where as networking does for example.
>
> I don't disable preemption, so don't think checking need_resched is needed.

With either preemption disabled in the config or a preemption point,
this can loop when it should be rescheduling. While it's unlikely to
trigger a soft lockup unless the spin interval is set to a stupidly high
value, it still should be accounted for.

> Can you point to what you mean by handling signal delivery in case of
> networking? Not sure what I am missing. My initial version broke it out
> like networking but after Peter's suggestion I clubbed it. I don't feel
> strongly either way.

It's checked in sk_can_busy_loop(). Again, may or may not be relevant
but it should be explained why signal delivery and the need for
scheduling is not checked in either the changelog or the comments.

> > 
> > > @@ -825,6 +832,7 @@ static int do_pipe2(int __user *fildes, int flags)
> > >   	struct file *files[2];
> > >   	int fd[2];
> > >   	int error;
> > > +	struct pipe_inode_info *pipe;
> > >   	error = __do_pipe_flags(fd, files, flags);
> > >   	if (!error) {
> > > @@ -838,6 +846,10 @@ static int do_pipe2(int __user *fildes, int flags)
> > >   			fd_install(fd[0], files[0]);
> > >   			fd_install(fd[1], files[1]);
> > >   		}
> > > +		pipe = files[0]->private_data;
> > > +		pipe->pipe_ll_usec = pipe_busy_poll;
> > > +		pipe = files[1]->private_data;
> > > +		pipe->pipe_ll_usec = pipe_busy_poll;
> > >   	}
> > >   	return error;
> > >   }
> > You add a pipe field but the value in always based on the sysctl
> > so the information is redundantu (barring a race condition on one
> > pipe write per sysctl update which is an irrelevant corner case). In
> > comparison, the network subsystem appears to be explicitly opt-in via
> > setsockopt(SO_BUSY_POLL) from a glance and the value appears to be tunable
> > on a per-socket basis (I didn't check for sure, this is based on a glance
> > at what networking does). It's not clear what a sensible way of replicating
> > that for pipe file descriptors would be but it's possible that the only
> > way would be to define the system-wide tunable as a max spin time and try
> > detect the optimal spin time on a per-fd basis (not entirely dissimilar
> > to how cpuidle menu guesses how long it'll be in an idle state for).
>
> As I said above even networking assigns the global tunable value to each
> socket during initialization.

But it's an invariant in your pipe patch so why carry it now when it
can't be changed from userspace?

> Trying to detect an optimal spin time will
> be extremely difficult, I had tried a lot of heuristics but doesn't quite
> work under all workloads/utilization level. This change also doesn't
> preclude such work in future. In mean time some workloads can benefit from
> straight forward constant tunable setting.

It would be preferred if there was some guideline on how and when it
should be used. As the patch stands, there isn't even documentation on
the existance of the tunable, let alone how or when it should be used.

> > It's not really my area but I feel that this patch is a benchmark-specific
> > hack and that tuning it on a system-wide basis will be a game of "win
> > some, lose some" that is never used in practice. Worse, it might end up
> > in a tuning guide as "always set this sysctl" without considering the
> > capabilities of the machine or the workload and falls victim to cargo
> > cult tuning.
> > 
> It is 0 by default so no spin.

Which also means that it's dead code by default and given its
undocumented nature, it's hard to see whether anyone would even notice.

> Workloads that do set this should know what
> they are doing and if it's beneficial. I will try DB OLTP workload to see
> any benefits.
> 

This should be a given because if this is aimed at DB OLTP workloads,
then there should at least be one example showing that DB OLTP workloads
even benefit and are bound to the performance of pipe and not the more
obvious candidate -- sockets.

Thanks!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-11-06  8:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-09-25 23:32 [RFC PATCH v2 0/1] Pipe busy wait subhra mazumdar
2018-09-25 23:32 ` [RFC PATCH v2 1/1] pipe: busy wait for pipe subhra mazumdar
2018-11-05 10:08   ` Mel Gorman
2018-11-05 23:40     ` Subhra Mazumdar
2018-11-06  8:41       ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).