[PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
@ 2004-07-26 13:11 Avi Kivity
  2004-07-26 21:02 ` Pavel Machek
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2004-07-26 13:11 UTC (permalink / raw)
  To: linux-kernel

On heavy write activity, allocators wait synchronously for kswapd to
free some memory. But if kswapd is freeing memory via a userspace NFS
server, that server could be waiting for kswapd, and the system seizes
instantly.

This patch (against RHEL 2.4.21-15EL, but should apply either literally
or conceptually to other kernels) allows a process to declare itself as
kswapd's little helper, and thus will not have to wait on kswapd.

--- a/include/linux/prctl.h	2003-10-23 09:00:00.000000000 +0200
+++ b/include/linux/prctl.h	2004-07-21 13:43:01.000000000 +0300
@@ -43,5 +43,10 @@
  # define PR_TIMING_TIMESTAMP	1	/* Accurate timestamp based
  						   process timing */

+/* Get/set PF_MEMALLOC task flag bit */
+#define PR_GET_KSWAPD_HELPER 15
+#define PR_SET_KSWAPD_HELPER 16
+
+
  #endif /* _LINUX_PRCTL_H */
--- a/kernel/sys.c	2003-10-23 09:00:00.000000000 +0200
+++ b/kernel/sys.c	2004-07-21 13:42:59.000000000 +0300
@@ -1400,6 +1400,22 @@ asmlinkage long sys_prctl(int option, un
  			}
  			current->keep_capabilities = arg2;
  			break;
+		case PR_GET_KSWAPD_HELPER:
+			if (current->flags & PF_MEMALLOC)
+				error = 1;
+			break;
+		case PR_SET_KSWAPD_HELPER:
+			switch (arg2) {
+				case 0:
+					current->flags &= ~PF_MEMALLOC;
+					break;
+				case 1:
+					current->flags |= PF_MEMALLOC;
+					break;
+				default:
+					error = -EINVAL;
+			}
+			break;
  		default:
  			error = -EINVAL;
  			break;


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-26 13:11 [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount Avi Kivity
@ 2004-07-26 21:02 ` Pavel Machek
  2004-07-27 20:22   ` Avi Kivity
  0 siblings, 1 reply; 22+ messages in thread
From: Pavel Machek @ 2004-07-26 21:02 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-kernel

Hi!

> On heavy write activity, allocators wait synchronously for kswapd to
> free some memory. But if kswapd is freeing memory via a userspace NFS
> server, that server could be waiting for kswapd, and the system seizes
> instantly.
> 
> This patch (against RHEL 2.4.21-15EL, but should apply either 
> literally
> or conceptually to other kernels) allows a process to declare itself 
> as
> kswapd's little helper, and thus will not have to wait on kswapd.

Ok, but what if its memory runs out, anyway?

				Pavel
-- 
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms         


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-26 21:02 ` Pavel Machek
@ 2004-07-27 20:22   ` Avi Kivity
  2004-07-27 20:34     ` Pavel Machek
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2004-07-27 20:22 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux-kernel

Pavel Machek wrote:

>>On heavy write activity, allocators wait synchronously for kswapd to
>>free some memory. But if kswapd is freeing memory via a userspace NFS
>>server, that server could be waiting for kswapd, and the system seizes
>>instantly.
>>
>>This patch (against RHEL 2.4.21-15EL, but should apply either 
>>literally
>>or conceptually to other kernels) allows a process to declare itself 
>>as
>>kswapd's little helper, and thus will not have to wait on kswapd.
>>    
>>
>
>Ok, but what if its memory runs out, anyway?
>
>  
>
Tough. What if kswapd's memory runs out?

A more complete solution would be to assign memory reserve levels below 
which a process starts allocating synchronously. For example, normal 
processes must have >20MB to make forward progress, kswapd wants >15MB 
and the NFS server needs >10MB. Some way would be needed to express the 
dependencies.

I think more and more people will hit this problem as filesystems become 
more complex due to clustering, and migrate to userspace where it can be 
more easily managed.

Avi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-27 20:22   ` Avi Kivity
@ 2004-07-27 20:34     ` Pavel Machek
  2004-07-27 21:02       ` Avi Kivity
  2004-07-28 12:08       ` Mikulas Patocka
  0 siblings, 2 replies; 22+ messages in thread
From: Pavel Machek @ 2004-07-27 20:34 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-kernel

Hi!

> >>On heavy write activity, allocators wait synchronously for kswapd to
> >>free some memory. But if kswapd is freeing memory via a userspace NFS
> >>server, that server could be waiting for kswapd, and the system seizes
> >>instantly.
> >>
> >>This patch (against RHEL 2.4.21-15EL, but should apply either 
> >>literally
> >>or conceptually to other kernels) allows a process to declare itself 
> >>as
> >>kswapd's little helper, and thus will not have to wait on kswapd.
> >>   
> >>
> >
> >Ok, but what if its memory runs out, anyway?
> >
> > 
> >
> Tough. What if kswapd's memory runs out?

I'd hope that kswapd was carefully to make sure that it always has
enough pages...

...it is harder to do the same auditing with userland program.

> A more complete solution would be to assign memory reserve levels below 
> which a process starts allocating synchronously. For example, normal 
> processes must have >20MB to make forward progress, kswapd wants >15MB 
> and the NFS server needs >10MB. Some way would be needed to express the 
> dependencies.

Yes, something like that would be neccessary. I believe it would be
slightly more complicated, like

"NFS server needs > 10MB *and working kswapd*", so you'd need 25MB in
fact... and this info should be stored in some readable form so that
it can be checked.

								Pavel
-- 
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-27 20:34     ` Pavel Machek
@ 2004-07-27 21:02       ` Avi Kivity
  2004-07-28  1:29         ` Nick Piggin
  2004-07-28 12:08       ` Mikulas Patocka
  1 sibling, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2004-07-27 21:02 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux-kernel

Pavel Machek wrote:

>I'd hope that kswapd was carefully to make sure that it always has
>enough pages...
>
>...it is harder to do the same auditing with userland program.
>
>  
>
Very true. But is a kernel thread like kswapd depends on a userspace 
program, then that program better be well behaved.

>>A more complete solution would be to assign memory reserve levels below 
>>which a process starts allocating synchronously. For example, normal 
>>processes must have >20MB to make forward progress, kswapd wants >15MB 
>>and the NFS server needs >10MB. Some way would be needed to express the 
>>dependencies.
>>    
>>
>
>Yes, something like that would be neccessary. I believe it would be
>slightly more complicated, like
>
>"NFS server needs > 10MB *and working kswapd*", so you'd need 25MB in
>fact... and this info should be stored in some readable form so that
>it can be checked.
>
>  
>
If the NFS server needed kswapd, we'd deadlock pretty soon, as kswapd 
*really* needs the NFS server. In our case, all block I/O is done using 
unbuffered I/O, and all memory is preallocated, so we don't need kswapd 
at all, just that small bit of memory that syscalls consume.

If the NFS server really needs kswapd, then there'd better be two of 
them. Regular processes would depend on one kswapd, which depends on the 
NFS server, which depends on the second kswapd, which depends on the 
hardware alone. It should be fun trying to describe that topology to the 
kernel through some API.

Our filesystem actually does something like that internally, except the 
dependency chain length is seven, not two.

Avi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-27 21:02       ` Avi Kivity
@ 2004-07-28  1:29         ` Nick Piggin
  2004-07-28  2:17           ` Trond Myklebust
  2004-07-28  5:11           ` Avi Kivity
  0 siblings, 2 replies; 22+ messages in thread
From: Nick Piggin @ 2004-07-28  1:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Pavel Machek, linux-kernel

Avi Kivity wrote:
> Pavel Machek wrote:
> 
>> I'd hope that kswapd was carefully to make sure that it always has
>> enough pages...
>>
>> ...it is harder to do the same auditing with userland program.
>>
>>  
>>
> Very true. But is a kernel thread like kswapd depends on a userspace 
> program, then that program better be well behaved.
> 
>>> A more complete solution would be to assign memory reserve levels 
>>> below which a process starts allocating synchronously. For example, 
>>> normal processes must have >20MB to make forward progress, kswapd 
>>> wants >15MB and the NFS server needs >10MB. Some way would be needed 
>>> to express the dependencies.
>>>   
>>
>>
>> Yes, something like that would be neccessary. I believe it would be
>> slightly more complicated, like
>>
>> "NFS server needs > 10MB *and working kswapd*", so you'd need 25MB in
>> fact... and this info should be stored in some readable form so that
>> it can be checked.
>>
>>  
>>
> If the NFS server needed kswapd, we'd deadlock pretty soon, as kswapd 
> *really* needs the NFS server. In our case, all block I/O is done using 
> unbuffered I/O, and all memory is preallocated, so we don't need kswapd 
> at all, just that small bit of memory that syscalls consume.
> 
> If the NFS server really needs kswapd, then there'd better be two of 
> them. Regular processes would depend on one kswapd, which depends on the 
> NFS server, which depends on the second kswapd, which depends on the 
> hardware alone. It should be fun trying to describe that topology to the 
> kernel through some API.
> 
> Our filesystem actually does something like that internally, except the 
> dependency chain length is seven, not two.
> 

There is some need arising for a call to set the PF_MEMALLOC flag for
userspace tasks, so you could probably get a patch accepted. Don't
call it KSWAPD_HELPER though, maybe MEMFREE or RECLAIM or RECLAIM_HELPER.

But why is your NFS server needed to reclaim memory? Do you have the
filesystem mounted locally?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-28  1:29         ` Nick Piggin
@ 2004-07-28  2:17           ` Trond Myklebust
  2004-07-28  5:13             ` Avi Kivity
  2004-07-28  5:11           ` Avi Kivity
  1 sibling, 1 reply; 22+ messages in thread
From: Trond Myklebust @ 2004-07-28  2:17 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Avi Kivity, Pavel Machek, linux-kernel

På ty , 27/07/2004 klokka 21:29, skreiv Nick Piggin:

> There is some need arising for a call to set the PF_MEMALLOC flag for
> userspace tasks, so you could probably get a patch accepted. Don't
> call it KSWAPD_HELPER though, maybe MEMFREE or RECLAIM or RECLAIM_HELPER.
> 
> But why is your NFS server needed to reclaim memory? Do you have the
> filesystem mounted locally?

...and why can't this problem be fixed by judicious use of mlock()?

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-28  1:29         ` Nick Piggin
  2004-07-28  2:17           ` Trond Myklebust
@ 2004-07-28  5:11           ` Avi Kivity
  2004-07-28  5:29             ` Nick Piggin
  1 sibling, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2004-07-28  5:11 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Pavel Machek, linux-kernel

Nick Piggin wrote:

>
> There is some need arising for a call to set the PF_MEMALLOC flag for
> userspace tasks, so you could probably get a patch accepted. Don't
> call it KSWAPD_HELPER though, maybe MEMFREE or RECLAIM or RECLAIM_HELPER.

I don't think my patch is general enough, it deals with only one level 
of dependencies, and doesn't work if the NFS server (or other process 
that kswapd depends on) depends on kswapd itself. It was intended more 
as an RFC than a request for inclusion.

It's probably fine for those with the exact same problem as us.

>
> But why is your NFS server needed to reclaim memory? Do you have the
> filesystem mounted locally?

Yes, for use by protocol adapters like samba.

Avi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-28  2:17           ` Trond Myklebust
@ 2004-07-28  5:13             ` Avi Kivity
  0 siblings, 0 replies; 22+ messages in thread
From: Avi Kivity @ 2004-07-28  5:13 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Nick Piggin, Pavel Machek, linux-kernel

Trond Myklebust wrote:

>På ty , 27/07/2004 klokka 21:29, skreiv Nick Piggin:
>
>  
>
>>There is some need arising for a call to set the PF_MEMALLOC flag for
>>userspace tasks, so you could probably get a patch accepted. Don't
>>call it KSWAPD_HELPER though, maybe MEMFREE or RECLAIM or RECLAIM_HELPER.
>>
>>But why is your NFS server needed to reclaim memory? Do you have the
>>filesystem mounted locally?
>>    
>>
>
>...and why can't this problem be fixed by judicious use of mlock()?
>
>  
>
I mlock(MLOCK_ALL) as soon as I wake up in the morning (and I never ask 
the kernel for more memory), but the kernel likes to allocate memory 
when performing some syscalls for me.


Avi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-28  5:11           ` Avi Kivity
@ 2004-07-28  5:29             ` Nick Piggin
  2004-07-28  7:05               ` Avi Kivity
  0 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2004-07-28  5:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Pavel Machek, linux-kernel

Avi Kivity wrote:
> Nick Piggin wrote:
> 
>>
>> There is some need arising for a call to set the PF_MEMALLOC flag for
>> userspace tasks, so you could probably get a patch accepted. Don't
>> call it KSWAPD_HELPER though, maybe MEMFREE or RECLAIM or RECLAIM_HELPER.
> 
> 
> I don't think my patch is general enough, it deals with only one level 
> of dependencies, and doesn't work if the NFS server (or other process 
> that kswapd depends on) depends on kswapd itself. It was intended more 
> as an RFC than a request for inclusion.
> 
> It's probably fine for those with the exact same problem as us.
> 

Well it isn't that you depend on kswapd, but that your task gets called
into via page reclaim (to facilitate page reclaim). In which case having
the task block in memory allocation can cause a deadlock.

The solution is that PF_MEMALLOC tasks are allowed to access the reserve
pool. Dependencies don't matter to this system. It would be your job to
ensure all tasks that might need to allocate memory in order to free
memory have the flag set.


>>
>> But why is your NFS server needed to reclaim memory? Do you have the
>> filesystem mounted locally?
> 
> 
> Yes, for use by protocol adapters like samba.
> 

OK.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-28  5:29             ` Nick Piggin
@ 2004-07-28  7:05               ` Avi Kivity
  2004-07-28  7:16                 ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2004-07-28  7:05 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Pavel Machek, linux-kernel

Nick Piggin wrote:

> Avi Kivity wrote:
>
>> Nick Piggin wrote:
>>
>>>
>>> There is some need arising for a call to set the PF_MEMALLOC flag for
>>> userspace tasks, so you could probably get a patch accepted. Don't
>>> call it KSWAPD_HELPER though, maybe MEMFREE or RECLAIM or 
>>> RECLAIM_HELPER.
>>
>>
>>
>> I don't think my patch is general enough, it deals with only one 
>> level of dependencies, and doesn't work if the NFS server (or other 
>> process that kswapd depends on) depends on kswapd itself. It was 
>> intended more as an RFC than a request for inclusion.
>>
>> It's probably fine for those with the exact same problem as us.
>>
>
> Well it isn't that you depend on kswapd, but that your task gets called
> into via page reclaim (to facilitate page reclaim). In which case having
> the task block in memory allocation can cause a deadlock.

In my particular case that's true, so I only depended on kswapd as a 
side effect of the memory allocation logic. Setting PF_MEMALLOC fixed that.

>
> The solution is that PF_MEMALLOC tasks are allowed to access the reserve
> pool. Dependencies don't matter to this system. It would be your job to
> ensure all tasks that might need to allocate memory in order to free
> memory have the flag set.

In the general case that's not sufficient. What if the NFS server wrote 
to ext3 via the VFS? We might have a ton of ext3 pagecache waiting for 
kswapd to reclaim NFS memory, while kswapd is waiting on the NFS server 
writing to ext3.

The patch I posted is simple and quite sufficient for my needs, but I'm 
sure more convoluted cases will turn up where something more complex is 
needed. Probably one can construct such cases out of in-kernel 
components like the loop device, dm, and the NFS client and server.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-28  7:05               ` Avi Kivity
@ 2004-07-28  7:16                 ` Nick Piggin
  2004-07-28  7:45                   ` Avi Kivity
  0 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2004-07-28  7:16 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Pavel Machek, linux-kernel

Avi Kivity wrote:
> Nick Piggin wrote:

>>
>> The solution is that PF_MEMALLOC tasks are allowed to access the reserve
>> pool. Dependencies don't matter to this system. It would be your job to
>> ensure all tasks that might need to allocate memory in order to free
>> memory have the flag set.
> 
> 
> In the general case that's not sufficient. What if the NFS server wrote 
> to ext3 via the VFS? We might have a ton of ext3 pagecache waiting for 
> kswapd to reclaim NFS memory, while kswapd is waiting on the NFS server 
> writing to ext3.
> 

It is sufficient.

You didn't explain your example very well, but I'll assume it is the
following:

dirty NFS data -> NFS server on localhost -> ext3 filesystem.

So kswapd tries to reclaim some memory and writes out the dirty NFS
data. The NFS server then writes this data to ext3 (it can do this
because it is PF_MEMALLOC). The data gets written out, the NFS server
tells the client it is clean, kswapd continues.

Right?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-28  7:16                 ` Nick Piggin
@ 2004-07-28  7:45                   ` Avi Kivity
  2004-07-28  9:05                     ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2004-07-28  7:45 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Pavel Machek, linux-kernel

Nick Piggin wrote:

> Avi Kivity wrote:
>
>> Nick Piggin wrote:
>
>
>>>
>>> The solution is that PF_MEMALLOC tasks are allowed to access the 
>>> reserve
>>> pool. Dependencies don't matter to this system. It would be your job to
>>> ensure all tasks that might need to allocate memory in order to free
>>> memory have the flag set.
>>
>>
>>
>> In the general case that's not sufficient. What if the NFS server 
>> wrote to ext3 via the VFS? We might have a ton of ext3 pagecache 
>> waiting for kswapd to reclaim NFS memory, while kswapd is waiting on 
>> the NFS server writing to ext3.
>>
>
> It is sufficient.
>
> You didn't explain your example very well, but I'll assume it is the
> following:
>
> dirty NFS data -> NFS server on localhost -> ext3 filesystem. 

That's what I meant, sorry for not making it clear.

>
>
> So kswapd tries to reclaim some memory and writes out the dirty NFS
> data. The NFS server then writes this data to ext3 (it can do this
> because it is PF_MEMALLOC). The data gets written out, the NFS server
> tells the client it is clean, kswapd continues.
>
> Right?

What's stopping the NFS server from ooming the machine then? Every time 
some bit of memory becomes free, the server will consume it instantly. 
Eventually ext3 will not be able to write anything out because it is out 
of memory.

An even more complex case is when ext3 depends on some other process, 
say it is mounted on a loopback nbd.

  dirty NFS data -> NFS server -> ext3 -> nbd -> nbd server on localhost 
-> ext3/raw device

You can't have both the NFS server and the nbd server PF_MEMALLOC, since 
the NFS server may consume all memory, then wait for the nbd server to 
reclaim.

The solution I have in mind is to replace the sync allocation logic from

    if (free_mem() < some_global_limit && !current->PF_MEMALLOC)
        wait_for_kswapd()

to

    if (free_mem() < current->limit)
        wait_for_kswapd()

kswapd would have the lowest ->limit, other processes as their place in 
the food chain dictates.  

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-28  7:45                   ` Avi Kivity
@ 2004-07-28  9:05                     ` Nick Piggin
  2004-07-28 10:11                       ` Avi Kivity
  0 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2004-07-28  9:05 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Pavel Machek, linux-kernel

Avi Kivity wrote:
> Nick Piggin wrote:
> 
>> Avi Kivity wrote:
>>
>>> Nick Piggin wrote:
>>
>>
>>
>>>>
>>>> The solution is that PF_MEMALLOC tasks are allowed to access the 
>>>> reserve
>>>> pool. Dependencies don't matter to this system. It would be your job to
>>>> ensure all tasks that might need to allocate memory in order to free
>>>> memory have the flag set.
>>>
>>>
>>>
>>>
>>> In the general case that's not sufficient. What if the NFS server 
>>> wrote to ext3 via the VFS? We might have a ton of ext3 pagecache 
>>> waiting for kswapd to reclaim NFS memory, while kswapd is waiting on 
>>> the NFS server writing to ext3.
>>>
>>
>> It is sufficient.
>>
>> You didn't explain your example very well, but I'll assume it is the
>> following:
>>
>> dirty NFS data -> NFS server on localhost -> ext3 filesystem. 
> 
> 
> That's what I meant, sorry for not making it clear.
> 
>>
>>
>> So kswapd tries to reclaim some memory and writes out the dirty NFS
>> data. The NFS server then writes this data to ext3 (it can do this
>> because it is PF_MEMALLOC). The data gets written out, the NFS server
>> tells the client it is clean, kswapd continues.
>>
>> Right?
> 
> 
> What's stopping the NFS server from ooming the machine then? Every time 
> some bit of memory becomes free, the server will consume it instantly. 
> Eventually ext3 will not be able to write anything out because it is out 
> of memory.
> 

The NFS server should do the writeout a page at a time.

> An even more complex case is when ext3 depends on some other process, 
> say it is mounted on a loopback nbd.
> 
>  dirty NFS data -> NFS server -> ext3 -> nbd -> nbd server on localhost 
> -> ext3/raw device
> 
> You can't have both the NFS server and the nbd server PF_MEMALLOC, since 
> the NFS server may consume all memory, then wait for the nbd server to 
> reclaim.
> 

The memory allocators will block when memory reaches the reserved
mark. Page reclaim will ask NFS to free one page, so the server
will write something out to the filesystem, this will cause the nbd
server (also PF_MEMALLOC) to write out to its backing filesystem.

> The solution I have in mind is to replace the sync allocation logic from
> 
>    if (free_mem() < some_global_limit && !current->PF_MEMALLOC)
>        wait_for_kswapd()
> 
> to
> 
>    if (free_mem() < current->limit)
>        wait_for_kswapd()
> 
> kswapd would have the lowest ->limit, other processes as their place in 
> the food chain dictates. 

I think this is barking up the wrong tree. It really doesn't matter
what process is freeing memory. There isn't really anything special
about the way kswapd frees memory.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-28  9:05                     ` Nick Piggin
@ 2004-07-28 10:11                       ` Avi Kivity
  2004-07-28 10:30                         ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2004-07-28 10:11 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Pavel Machek, linux-kernel

Nick Piggin wrote:

>>>
>>>>>
>>>>> The solution is that PF_MEMALLOC tasks are allowed to access the 
>>>>> reserve
>>>>> pool. Dependencies don't matter to this system. It would be your 
>>>>> job to
>>>>> ensure all tasks that might need to allocate memory in order to free
>>>>> memory have the flag set.
>>>>
>>>> In the general case that's not sufficient. What if the NFS server 
>>>> wrote to ext3 via the VFS? We might have a ton of ext3 pagecache 
>>>> waiting for kswapd to reclaim NFS memory, while kswapd is waiting 
>>>> on the NFS server writing to ext3.
>>>>
>>> It is sufficient.
>>>
>>> You didn't explain your example very well, but I'll assume it is the
>>> following:
>>>
>>> dirty NFS data -> NFS server on localhost -> ext3 filesystem. 
>>
>>
>> That's what I meant, sorry for not making it clear.
>>
>>> So kswapd tries to reclaim some memory and writes out the dirty NFS
>>> data. The NFS server then writes this data to ext3 (it can do this
>>> because it is PF_MEMALLOC). The data gets written out, the NFS server
>>> tells the client it is clean, kswapd continues.
>>>
>>> Right?
>>
>>
>> What's stopping the NFS server from ooming the machine then? Every 
>> time some bit of memory becomes free, the server will consume it 
>> instantly. Eventually ext3 will not be able to write anything out 
>> because it is out of memory.
>>
> The NFS server should do the writeout a page at a time.

The NFS server writes not only in response to page reclaim (as a local 
NFS client), but also in response to pressure from non-local clients. If 
both ext3 and NFS have the same allocation limits, NFS may starve out ext3.

(In my case the NFS server actually writes data asynchronously, so it 
doesn't really know it is responding to page reclaim, but the problem 
occurs even in a synchrounous NFS server.)

>
>> An even more complex case is when ext3 depends on some other process, 
>> say it is mounted on a loopback nbd.
>>
>>  dirty NFS data -> NFS server -> ext3 -> nbd -> nbd server on 
>> localhost -> ext3/raw device
>>
>> You can't have both the NFS server and the nbd server PF_MEMALLOC, 
>> since the NFS server may consume all memory, then wait for the nbd 
>> server to reclaim.
>>
> The memory allocators will block when memory reaches the reserved
> mark. Page reclaim will ask NFS to free one page, so the server
> will write something out to the filesystem, this will cause the nbd
> server (also PF_MEMALLOC) to write out to its backing filesystem.

If NFS and nbd have the same limit, then NFS may cause nbd to stall. 
We've already established that NFS must be PF_MEMALLOC, so nbd must be 
PF_MEMALLOC_HARDER or something like that.

> The solution I have in mind is to replace the sync allocation logic from
>
>>
>>    if (free_mem() < some_global_limit && !current->PF_MEMALLOC)
>>        wait_for_kswapd()
>>
>> to
>>
>>    if (free_mem() < current->limit)
>>        wait_for_kswapd()
>>
>> kswapd would have the lowest ->limit, other processes as their place 
>> in the food chain dictates. 
>
>
> I think this is barking up the wrong tree. It really doesn't matter
> what process is freeing memory. There isn't really anything special
> about the way kswapd frees memory.

To free memory you need (a) to allocate memory (b) possibly wait for 
some freeing process to make some progress. That means all processes in 
the freeing chain must be able to allocate at least some memory. If two 
processes in the chain share the same blocking logic, they may deadlock 
on each other.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-28 10:11                       ` Avi Kivity
@ 2004-07-28 10:30                         ` Nick Piggin
  2004-07-28 11:48                           ` Avi Kivity
  0 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2004-07-28 10:30 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Pavel Machek, linux-kernel

Avi Kivity wrote:
> Nick Piggin wrote:

>>> What's stopping the NFS server from ooming the machine then? Every 
>>> time some bit of memory becomes free, the server will consume it 
>>> instantly. Eventually ext3 will not be able to write anything out 
>>> because it is out of memory.
>>>
>> The NFS server should do the writeout a page at a time.
> 
> 
> The NFS server writes not only in response to page reclaim (as a local 
> NFS client), but also in response to pressure from non-local clients. If 
> both ext3 and NFS have the same allocation limits, NFS may starve out ext3.
> 

What do you mean starve out ext3? ext3 gets written to *by the NFS server*
which is PF_MEMALLOC.

> (In my case the NFS server actually writes data asynchronously, so it 
> doesn't really know it is responding to page reclaim, but the problem 
> occurs even in a synchrounous NFS server.)
> 

I can't see this being the responsibility of the kernel. The NFS server
could probably find out if it is servicing a loopback request or not.
Remote requests don't help to free memory... unless maybe you want a
filesystem on a remote nbd to be exported back to server via NFS or
something crazy.

>>
>>> An even more complex case is when ext3 depends on some other process, 
>>> say it is mounted on a loopback nbd.
>>>
>>>  dirty NFS data -> NFS server -> ext3 -> nbd -> nbd server on 
>>> localhost -> ext3/raw device
>>>
>>> You can't have both the NFS server and the nbd server PF_MEMALLOC, 
>>> since the NFS server may consume all memory, then wait for the nbd 
>>> server to reclaim.
>>>
>> The memory allocators will block when memory reaches the reserved
>> mark. Page reclaim will ask NFS to free one page, so the server
>> will write something out to the filesystem, this will cause the nbd
>> server (also PF_MEMALLOC) to write out to its backing filesystem.
> 
> 
> If NFS and nbd have the same limit, then NFS may cause nbd to stall. 
> We've already established that NFS must be PF_MEMALLOC, so nbd must be 
> PF_MEMALLOC_HARDER or something like that.

No, your NFS server has to be coded differently. You can't allow it
to use up all PF_MEMALLOC memory just because it can.

> 
>> The solution I have in mind is to replace the sync allocation logic from
>>
>>>
>>>    if (free_mem() < some_global_limit && !current->PF_MEMALLOC)
>>>        wait_for_kswapd()
>>>
>>> to
>>>
>>>    if (free_mem() < current->limit)
>>>        wait_for_kswapd()
>>>
>>> kswapd would have the lowest ->limit, other processes as their place 
>>> in the food chain dictates. 
>>
>>
>>
>> I think this is barking up the wrong tree. It really doesn't matter
>> what process is freeing memory. There isn't really anything special
>> about the way kswapd frees memory.
> 
> 
> To free memory you need (a) to allocate memory (b) possibly wait for 
> some freeing process to make some progress. That means all processes in 
> the freeing chain must be able to allocate at least some memory. If two 
> processes in the chain share the same blocking logic, they may deadlock 
> on each other.
> 

The PF_MEMALLOC path isn't to be used like that. If a *single*
PF_MEMALLOC task were to allocate all its memory then that would
be a bug too.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-28 10:30                         ` Nick Piggin
@ 2004-07-28 11:48                           ` Avi Kivity
  2004-07-29  8:29                             ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2004-07-28 11:48 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Pavel Machek, linux-kernel

Nick Piggin wrote:

> Avi Kivity wrote:
>
>> Nick Piggin wrote:
>
>
>>>> What's stopping the NFS server from ooming the machine then? Every 
>>>> time some bit of memory becomes free, the server will consume it 
>>>> instantly. Eventually ext3 will not be able to write anything out 
>>>> because it is out of memory.
>>>>
>>> The NFS server should do the writeout a page at a time.
>>
>>
>>
>> The NFS server writes not only in response to page reclaim (as a 
>> local NFS client), but also in response to pressure from non-local 
>> clients. If both ext3 and NFS have the same allocation limits, NFS 
>> may starve out ext3.
>>
>
> What do you mean starve out ext3? ext3 gets written to *by the NFS 
> server*
> which is PF_MEMALLOC. 

When the NFS server writes, it allocates pagecache and temporary 
objects. When ext3 writes, it allocates temporary objects. If the NFS 
server writes too much, ext3 can't allocate memory, and will never be 
able to allocate memory.

>
>
>> (In my case the NFS server actually writes data asynchronously, so it 
>> doesn't really know it is responding to page reclaim, but the problem 
>> occurs even in a synchrounous NFS server.)
>>
>
> I can't see this being the responsibility of the kernel. The NFS server
> could probably find out if it is servicing a loopback request or not. 

It might be emptying its own caches. Or any one of a hundred other 
things. Everything can be hacked around (as I did with PF_MEMALLOC, and 
relying on my server to write directly), but that's not a general solution.

>
> Remote requests don't help to free memory... unless maybe you want a
> filesystem on a remote nbd to be exported back to server via NFS or
> something crazy. 

Cycles can easily happen in clusters, especially with failover and failback.

>>>> An even more complex case is when ext3 depends on some other 
>>>> process, say it is mounted on a loopback nbd.
>>>>
>>>>  dirty NFS data -> NFS server -> ext3 -> nbd -> nbd server on 
>>>> localhost -> ext3/raw device
>>>>
>>>> You can't have both the NFS server and the nbd server PF_MEMALLOC, 
>>>> since the NFS server may consume all memory, then wait for the nbd 
>>>> server to reclaim.
>>>>
>>> The memory allocators will block when memory reaches the reserved
>>> mark. Page reclaim will ask NFS to free one page, so the server
>>> will write something out to the filesystem, this will cause the nbd
>>> server (also PF_MEMALLOC) to write out to its backing filesystem.
>>
>>
>> If NFS and nbd have the same limit, then NFS may cause nbd to stall. 
>> We've already established that NFS must be PF_MEMALLOC, so nbd must 
>> be PF_MEMALLOC_HARDER or something like that.
>
>
> No, your NFS server has to be coded differently. You can't allow it
> to use up all PF_MEMALLOC memory just because it can. 

There is no API that the NFS server can use to block and unblock on 
memory allocation.

My proposal extends memory allocation in general to block and unblock in 
a way that prevents deadlocks, providing the reservation levels have 
been set up correctly. And it can work cluster-wide.

>>> The solution I have in mind is to replace the sync allocation logic 
>>> from
>>>
>>>>
>>>>    if (free_mem() < some_global_limit && !current->PF_MEMALLOC)
>>>>        wait_for_kswapd()
>>>>
>>>> to
>>>>
>>>>    if (free_mem() < current->limit)
>>>>        wait_for_kswapd()
>>>>
>>>> kswapd would have the lowest ->limit, other processes as their 
>>>> place in the food chain dictates. 
>>>
>>>
>>> I think this is barking up the wrong tree. It really doesn't matter
>>> what process is freeing memory. There isn't really anything special
>>> about the way kswapd frees memory.
>>
>>
>> To free memory you need (a) to allocate memory (b) possibly wait for 
>> some freeing process to make some progress. That means all processes 
>> in the freeing chain must be able to allocate at least some memory. 
>> If two processes in the chain share the same blocking logic, they may 
>> deadlock on each other.
>>
>
> The PF_MEMALLOC path isn't to be used like that. If a *single*
> PF_MEMALLOC task were to allocate all its memory then that would
> be a bug too.

Again, there is no way for a task to know how much memory to allocate. 
Even if it writes out one page at a time, if it's PF_MEMALLOC and it 
generates dirty pagecache, it will soon consume all memory. Would you 
have the server sleep() after every write?

Note that my proposal is a generalization of PF_MEMALLOC; instead of 
setting the PF_MEMALLOC flag in kswapd you can set 
current->min_free_mem_to_allocate_nonblockingly to 0.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-27 20:34     ` Pavel Machek
  2004-07-27 21:02       ` Avi Kivity
@ 2004-07-28 12:08       ` Mikulas Patocka
  2004-07-28 12:18         ` Avi Kivity
  1 sibling, 1 reply; 22+ messages in thread
From: Mikulas Patocka @ 2004-07-28 12:08 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Avi Kivity, linux-kernel

> Hi!
>
> > >>On heavy write activity, allocators wait synchronously for kswapd to
> > >>free some memory. But if kswapd is freeing memory via a userspace NFS
> > >>server, that server could be waiting for kswapd, and the system seizes
> > >>instantly.
> > >>
> > >>This patch (against RHEL 2.4.21-15EL, but should apply either
> > >>literally
> > >>or conceptually to other kernels) allows a process to declare itself
> > >>as
> > >>kswapd's little helper, and thus will not have to wait on kswapd.
> > >>
> > >>
> > >
> > >Ok, but what if its memory runs out, anyway?
> > >
> > >
> > >
> > Tough. What if kswapd's memory runs out?
>
> I'd hope that kswapd was carefully to make sure that it always has
> enough pages...
>
> ...it is harder to do the same auditing with userland program.
>
> > A more complete solution would be to assign memory reserve levels below
> > which a process starts allocating synchronously. For example, normal
> > processes must have >20MB to make forward progress, kswapd wants >15MB
> > and the NFS server needs >10MB. Some way would be needed to express the
> > dependencies.
>
> Yes, something like that would be neccessary. I believe it would be
> slightly more complicated, like
>
> "NFS server needs > 10MB *and working kswapd*", so you'd need 25MB in
> fact... and this info should be stored in some readable form so that
> it can be checked.

Hi!

And if the NFS server is waiting for some lock that is held by another
process that is wating for kswapd...? It won't help.

The solution would be a limit for dirty pages on NFS --- if you say that
less than 1/8 of memory might be dirty NFS pages, than you can keep system
stable even if NFS writes starve. If you export NFS filesystem via NFSD
again, even this woudn't help, but there's no fix for this case.

Mikulas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-28 12:08       ` Mikulas Patocka
@ 2004-07-28 12:18         ` Avi Kivity
  0 siblings, 0 replies; 22+ messages in thread
From: Avi Kivity @ 2004-07-28 12:18 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Pavel Machek, linux-kernel, Nick Piggin

Mikulas Patocka wrote:

>Hi!
>
>And if the NFS server is waiting for some lock that is held by another
>process that is wating for kswapd...? It won't help.
>
>The solution would be a limit for dirty pages on NFS --- if you say that
>less than 1/8 of memory might be dirty NFS pages, than you can keep system
>stable even if NFS writes starve. If you export NFS filesystem via NFSD
>again, even this woudn't help, but there's no fix for this case.
>
>  
>
Oh yes there is. You can have different limits for each export, with the 
nested export having a lower limit.

Say the first export may dirty at most 200MB, and the nested export at 
most 180MB. So even if there are heavy writes against the nested export, 
it can always make progress by writing to the outer export, and if the 
system has more than 200MB of memory, the external export can make 
progress by writing out to the filesystem.

That is essentially my suggestion regarding reservation levels, but 
expressed in allocate-at-most terms instead of leave-at-least. And 
nested NFS is a fine example to show the current problems.

(of course, when nesting NFS one also needs to reserve threads so each 
export has at least one thread available to service it. but that's 
another story)

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-28 11:48                           ` Avi Kivity
@ 2004-07-29  8:29                             ` Nick Piggin
  2004-07-29 12:19                               ` Marcelo Tosatti
  2004-07-29 16:09                               ` Avi Kivity
  0 siblings, 2 replies; 22+ messages in thread
From: Nick Piggin @ 2004-07-29  8:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Pavel Machek, linux-kernel

Avi Kivity wrote:
> Nick Piggin wrote:
> 
>> Avi Kivity wrote:
>>
>>> Nick Piggin wrote:
>>
>>
>>
>>>>> What's stopping the NFS server from ooming the machine then? Every 
>>>>> time some bit of memory becomes free, the server will consume it 
>>>>> instantly. Eventually ext3 will not be able to write anything out 
>>>>> because it is out of memory.
>>>>>
>>>> The NFS server should do the writeout a page at a time.
>>>
>>>
>>>
>>>
>>> The NFS server writes not only in response to page reclaim (as a 
>>> local NFS client), but also in response to pressure from non-local 
>>> clients. If both ext3 and NFS have the same allocation limits, NFS 
>>> may starve out ext3.
>>>
>>
>> What do you mean starve out ext3? ext3 gets written to *by the NFS 
>> server*
>> which is PF_MEMALLOC. 
> 
> 
> When the NFS server writes, it allocates pagecache and temporary 
> objects. When ext3 writes, it allocates temporary objects. If the NFS 
> server writes too much, ext3 can't allocate memory, and will never be 
> able to allocate memory.
> 

That is because your NFS server shouldn't hog as much memory as
it likes when it is PF_MEMALLOC. The entire writeout path should
do a page at a time if it is PF_MEMALLOC. Ie, the server should
be doing write, fsync.

But now that I think about it, I guess you may not be able to
distinguish that from regular writeout, so doing a page at a time
would hurt performance too much.

Hmm so I guess the idea of a per task reserve limit may be the way
to do it, yes. Thanks for bearing with me!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-29  8:29                             ` Nick Piggin
@ 2004-07-29 12:19                               ` Marcelo Tosatti
  2004-07-29 16:09                               ` Avi Kivity
  1 sibling, 0 replies; 22+ messages in thread
From: Marcelo Tosatti @ 2004-07-29 12:19 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Avi Kivity, Pavel Machek, linux-kernel

On Thu, Jul 29, 2004 at 06:29:12PM +1000, Nick Piggin wrote:
> Avi Kivity wrote:
> >Nick Piggin wrote:
> >
> >>Avi Kivity wrote:
> >>
> >>>Nick Piggin wrote:
> >>
> >>
> >>
> >>>>>What's stopping the NFS server from ooming the machine then? Every 
> >>>>>time some bit of memory becomes free, the server will consume it 
> >>>>>instantly. Eventually ext3 will not be able to write anything out 
> >>>>>because it is out of memory.
> >>>>>
> >>>>The NFS server should do the writeout a page at a time.
> >>>
> >>>
> >>>
> >>>
> >>>The NFS server writes not only in response to page reclaim (as a 
> >>>local NFS client), but also in response to pressure from non-local 
> >>>clients. If both ext3 and NFS have the same allocation limits, NFS 
> >>>may starve out ext3.
> >>>
> >>
> >>What do you mean starve out ext3? ext3 gets written to *by the NFS 
> >>server*
> >>which is PF_MEMALLOC. 
> >
> >
> >When the NFS server writes, it allocates pagecache and temporary 
> >objects. When ext3 writes, it allocates temporary objects. If the NFS 
> >server writes too much, ext3 can't allocate memory, and will never be 
> >able to allocate memory.
> >
> 
> That is because your NFS server shouldn't hog as much memory as
> it likes when it is PF_MEMALLOC. The entire writeout path should
> do a page at a time if it is PF_MEMALLOC. Ie, the server should
> be doing write, fsync.
> 
> But now that I think about it, I guess you may not be able to
> distinguish that from regular writeout, so doing a page at a time
> would hurt performance too much.
> 
> Hmm so I guess the idea of a per task reserve limit may be the way
> to do it, yes. Thanks for bearing with me!

Hi, 

By reading the discussion I also agree that we need levels of "allowed deepness"
into the reservations.

We could have a global limit for normal allocators, and per-task 
limit for special "kswapd helpers" (which run with PF_MEMALLOC). 
And with kswapd being the most "deep" eater, as you guys said.

The thing is, those deadlocks are quite uncommon special cases and 
we will need to change core VM logic to handle them. Well...

We need to come up with a generic way of doing it, as Avi says, 
otherwise people will have to keep doing "hacks" to make it work.

Someone needs to sit down and come up with a design. It shouldnt 
be that hard.

Just my two pennies...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount
  2004-07-29  8:29                             ` Nick Piggin
  2004-07-29 12:19                               ` Marcelo Tosatti
@ 2004-07-29 16:09                               ` Avi Kivity
  1 sibling, 0 replies; 22+ messages in thread
From: Avi Kivity @ 2004-07-29 16:09 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Pavel Machek, linux-kernel

Nick Piggin wrote:

> Avi Kivity wrote:
>
>> Nick Piggin wrote:
>>
>>> Avi Kivity wrote:
>>>
>>>> Nick Piggin wrote:
>>>
>>>
>>>
>>>>>> What's stopping the NFS server from ooming the machine then? 
>>>>>> Every time some bit of memory becomes free, the server will 
>>>>>> consume it instantly. Eventually ext3 will not be able to write 
>>>>>> anything out because it is out of memory.
>>>>>>
>>>>> The NFS server should do the writeout a page at a time.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> The NFS server writes not only in response to page reclaim (as a 
>>>> local NFS client), but also in response to pressure from non-local 
>>>> clients. If both ext3 and NFS have the same allocation limits, NFS 
>>>> may starve out ext3.
>>>>
>>>
>>> What do you mean starve out ext3? ext3 gets written to *by the NFS 
>>> server*
>>> which is PF_MEMALLOC. 
>>
>>
>>
>> When the NFS server writes, it allocates pagecache and temporary 
>> objects. When ext3 writes, it allocates temporary objects. If the NFS 
>> server writes too much, ext3 can't allocate memory, and will never be 
>> able to allocate memory.
>>
>
> That is because your NFS server shouldn't hog as much memory as
> it likes when it is PF_MEMALLOC. The entire writeout path should
> do a page at a time if it is PF_MEMALLOC. Ie, the server should
> be doing write, fsync. 

We attempted to use sync local mounts (not what you are suggesting: on 
the NFS client side, without the PF_MEMALLOC hack) and still got the 
same deadlock. I am unable to explain why.

>
>
> But now that I think about it, I guess you may not be able to
> distinguish that from regular writeout, so doing a page at a time
> would hurt performance too much.
>
> Hmm so I guess the idea of a per task reserve limit may be the way
> to do it, yes. Thanks for bearing with me!

It was my pleasure.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2004-07-29 16:14 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-07-26 13:11 [PATCH] Deadlock during heavy write activity to userspace NFS server on local NFS mount Avi Kivity
2004-07-26 21:02 ` Pavel Machek
2004-07-27 20:22   ` Avi Kivity
2004-07-27 20:34     ` Pavel Machek
2004-07-27 21:02       ` Avi Kivity
2004-07-28  1:29         ` Nick Piggin
2004-07-28  2:17           ` Trond Myklebust
2004-07-28  5:13             ` Avi Kivity
2004-07-28  5:11           ` Avi Kivity
2004-07-28  5:29             ` Nick Piggin
2004-07-28  7:05               ` Avi Kivity
2004-07-28  7:16                 ` Nick Piggin
2004-07-28  7:45                   ` Avi Kivity
2004-07-28  9:05                     ` Nick Piggin
2004-07-28 10:11                       ` Avi Kivity
2004-07-28 10:30                         ` Nick Piggin
2004-07-28 11:48                           ` Avi Kivity
2004-07-29  8:29                             ` Nick Piggin
2004-07-29 12:19                               ` Marcelo Tosatti
2004-07-29 16:09                               ` Avi Kivity
2004-07-28 12:08       ` Mikulas Patocka
2004-07-28 12:18         ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox