nfs lockup

Linux NFS development
 help / color / mirror / Atom feed

* nfs lockup
@ 2015-10-21 15:25 krichy
  2015-10-21 19:05 ` Benjamin Coddington
  2015-10-23 18:10 ` J. Bruce Fields
  0 siblings, 2 replies; 6+ messages in thread
From: krichy @ 2015-10-21 15:25 UTC (permalink / raw)
  To: linux-nfs

[-- Attachment #1: Type: text/plain, Size: 1648 bytes --]

Dear devs,

We have an nfs lockup issue. We run a ganeti cluster consisting of 7 
debian linux nodes and 1 freenas for hosting the vm images. The images are 
exported via nfsv3. The problem is that randomly we end in a livelock on 
one of our nodes.

That means the nfs share is alive, we can list directories, files, even 
can read files (very slow, see later). And even can write to files, but 
the file close operation does not return, it gets blocked.

The read is slow in that way that while copying a file from the share to 
/tmp, the data arrives very fast to the node, but in /tmp it accumulates 
slowly.

I've also opened a debian bug report on it, but I think it is not related 
to debian (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=801924).

The only way is to reboot machine, with all the vm's running on it getting 
interrupted.

I've captured each tasks' stack trace, hopefully it helps someone to find 
out the issue.

Meanwhile the other 6 nodes can access the nfs share right, so I think 
this is not a networking or server issue. Restarting the nfs server on the 
server side still does not have any effect, not recovering. The nfs tcp 
connection is established, listing files works again, but writes not.

Some information of the nodes:
# uname -a
Linux host 3.16.0-4-amd64 #1 SMP Debian 
3.16.7-ckt11-1+deb8u4 (2015-09-19) x86_64 GNU/Linux

They have 1.5G ram allocated to dom0, that should be enough.

I know this information is little information, give me advice what to look 
for next time. Unfortunately I dont know how to reproduce it.

Thanks in advance,

Kojedzinszky Richard
Euronet Magyarorszag Informatika Zrt.

[-- Attachment #2: Type: application/gzip, Size: 32556 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: nfs lockup
  2015-10-21 15:25 nfs lockup krichy
@ 2015-10-21 19:05 ` Benjamin Coddington
  2015-10-21 20:09   ` krichy
  2015-10-23 18:10 ` J. Bruce Fields
  1 sibling, 1 reply; 6+ messages in thread
From: Benjamin Coddington @ 2015-10-21 19:05 UTC (permalink / raw)
  To: krichy; +Cc: linux-nfs

On Wed, 21 Oct 2015, krichy@tvnetwork.hu wrote:

> Dear devs,
>
> We have an nfs lockup issue. We run a ganeti cluster consisting of 7 debian
> linux nodes and 1 freenas for hosting the vm images. The images are exported
> via nfsv3. The problem is that randomly we end in a livelock on one of our
> nodes.
>
> That means the nfs share is alive, we can list directories, files, even can
> read files (very slow, see later). And even can write to files, but the file
> close operation does not return, it gets blocked.
>
> The read is slow in that way that while copying a file from the share to /tmp,
> the data arrives very fast to the node, but in /tmp it accumulates slowly.
>
> I've also opened a debian bug report on it, but I think it is not related to
> debian (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=801924).
>
> The only way is to reboot machine, with all the vm's running on it getting
> interrupted.
>
> I've captured each tasks' stack trace, hopefully it helps someone to find out
> the issue.
>
> Meanwhile the other 6 nodes can access the nfs share right, so I think this is
> not a networking or server issue. Restarting the nfs server on the server side
> still does not have any effect, not recovering. The nfs tcp connection is
> established, listing files works again, but writes not.
>
> Some information of the nodes:
> # uname -a
> Linux host 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4 (2015-09-19)
> x86_64 GNU/Linux
>
> They have 1.5G ram allocated to dom0, that should be enough.
>
> I know this information is little information, give me advice what to look for
> next time. Unfortunately I dont know how to reproduce it.
>
> Thanks in advance,
>
> Kojedzinszky Richard
> Euronet Magyarorszag Informatika Zrt.

I took a look at your debian bug report.. what's up with those drbd procs?
Are you writing to drbd-backed devs, and have you made sure that's not
involved in any way?

Ben

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: nfs lockup
  2015-10-21 19:05 ` Benjamin Coddington
@ 2015-10-21 20:09   ` krichy
  2015-10-22 11:17     ` Benjamin Coddington
  0 siblings, 1 reply; 6+ messages in thread
From: krichy @ 2015-10-21 20:09 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: linux-nfs


No, the lock is nothing to do with drbd. In the ganeti cluster some vms 
use drbd mirrored disks, but others use images on shared folder on nfs. 
That locks up sometimes. Drbd devices do work well, every network 
connectivity work well.

Please give me advice, what to check next time. Unfortunately I cannot 
reproduce the problem.

Could the 9000 MTU setting affect NFS somehow? Does that count that we are 
using xen, and thus a hypervisor is involved (regarding drbd it does).

Thanks,


Kojedzinszky Richard
Euronet Magyarorszag Informatika Zrt.

On Wed, 21 Oct 2015, Benjamin Coddington wrote:

> Date: Wed, 21 Oct 2015 15:05:24 -0400 (EDT)
> From: Benjamin Coddington <bcodding@redhat.com>
> To: krichy@tvnetwork.hu
> Cc: linux-nfs@vger.kernel.org
> Subject: Re: nfs lockup
> 
> On Wed, 21 Oct 2015, krichy@tvnetwork.hu wrote:
>
>> Dear devs,
>>
>> We have an nfs lockup issue. We run a ganeti cluster consisting of 7 debian
>> linux nodes and 1 freenas for hosting the vm images. The images are exported
>> via nfsv3. The problem is that randomly we end in a livelock on one of our
>> nodes.
>>
>> That means the nfs share is alive, we can list directories, files, even can
>> read files (very slow, see later). And even can write to files, but the file
>> close operation does not return, it gets blocked.
>>
>> The read is slow in that way that while copying a file from the share to /tmp,
>> the data arrives very fast to the node, but in /tmp it accumulates slowly.
>>
>> I've also opened a debian bug report on it, but I think it is not related to
>> debian (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=801924).
>>
>> The only way is to reboot machine, with all the vm's running on it getting
>> interrupted.
>>
>> I've captured each tasks' stack trace, hopefully it helps someone to find out
>> the issue.
>>
>> Meanwhile the other 6 nodes can access the nfs share right, so I think this is
>> not a networking or server issue. Restarting the nfs server on the server side
>> still does not have any effect, not recovering. The nfs tcp connection is
>> established, listing files works again, but writes not.
>>
>> Some information of the nodes:
>> # uname -a
>> Linux host 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4 (2015-09-19)
>> x86_64 GNU/Linux
>>
>> They have 1.5G ram allocated to dom0, that should be enough.
>>
>> I know this information is little information, give me advice what to look for
>> next time. Unfortunately I dont know how to reproduce it.
>>
>> Thanks in advance,
>>
>> Kojedzinszky Richard
>> Euronet Magyarorszag Informatika Zrt.
>
> I took a look at your debian bug report.. what's up with those drbd procs?
> Are you writing to drbd-backed devs, and have you made sure that's not
> involved in any way?
>
> Ben
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: nfs lockup
  2015-10-21 20:09   ` krichy
@ 2015-10-22 11:17     ` Benjamin Coddington
  0 siblings, 0 replies; 6+ messages in thread
From: Benjamin Coddington @ 2015-10-22 11:17 UTC (permalink / raw)
  To: krichy; +Cc: linux-nfs

It looks like a lot of processes are waiting on i_mutex in
generic_file_write_iter().  Possible you're in a particularly
bad spot of contention for that mutex?

Maybe you might use the 'perf-top' tool to dig in to what the system seems to be doing
when this happens..

On Wed, 21 Oct 2015, krichy@tvnetwork.hu wrote:

>
> No, the lock is nothing to do with drbd. In the ganeti cluster some vms use
> drbd mirrored disks, but others use images on shared folder on nfs. That locks
> up sometimes. Drbd devices do work well, every network connectivity work well.
>
> Please give me advice, what to check next time. Unfortunately I cannot
> reproduce the problem.
>
> Could the 9000 MTU setting affect NFS somehow? Does that count that we are
> using xen, and thus a hypervisor is involved (regarding drbd it does).
>
> Thanks,
>
>
> Kojedzinszky Richard
> Euronet Magyarorszag Informatika Zrt.
>
> On Wed, 21 Oct 2015, Benjamin Coddington wrote:
>
> > Date: Wed, 21 Oct 2015 15:05:24 -0400 (EDT)
> > From: Benjamin Coddington <bcodding@redhat.com>
> > To: krichy@tvnetwork.hu
> > Cc: linux-nfs@vger.kernel.org
> > Subject: Re: nfs lockup
> >
> > On Wed, 21 Oct 2015, krichy@tvnetwork.hu wrote:
> >
> > > Dear devs,
> > >
> > > We have an nfs lockup issue. We run a ganeti cluster consisting of 7
> > > debian
> > > linux nodes and 1 freenas for hosting the vm images. The images are
> > > exported
> > > via nfsv3. The problem is that randomly we end in a livelock on one of our
> > > nodes.
> > >
> > > That means the nfs share is alive, we can list directories, files, even
> > > can
> > > read files (very slow, see later). And even can write to files, but the
> > > file
> > > close operation does not return, it gets blocked.
> > >
> > > The read is slow in that way that while copying a file from the share to
> > > /tmp,
> > > the data arrives very fast to the node, but in /tmp it accumulates slowly.
> > >
> > > I've also opened a debian bug report on it, but I think it is not related
> > > to
> > > debian (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=801924).
> > >
> > > The only way is to reboot machine, with all the vm's running on it getting
> > > interrupted.
> > >
> > > I've captured each tasks' stack trace, hopefully it helps someone to find
> > > out
> > > the issue.
> > >
> > > Meanwhile the other 6 nodes can access the nfs share right, so I think
> > > this is
> > > not a networking or server issue. Restarting the nfs server on the server
> > > side
> > > still does not have any effect, not recovering. The nfs tcp connection is
> > > established, listing files works again, but writes not.
> > >
> > > Some information of the nodes:
> > > # uname -a
> > > Linux host 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4 (2015-09-19)
> > > x86_64 GNU/Linux
> > >
> > > They have 1.5G ram allocated to dom0, that should be enough.
> > >
> > > I know this information is little information, give me advice what to look
> > > for
> > > next time. Unfortunately I dont know how to reproduce it.
> > >
> > > Thanks in advance,
> > >
> > > Kojedzinszky Richard
> > > Euronet Magyarorszag Informatika Zrt.
> >
> > I took a look at your debian bug report.. what's up with those drbd procs?
> > Are you writing to drbd-backed devs, and have you made sure that's not
> > involved in any way?
> >
> > Ben
> >
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: nfs lockup
  2015-10-21 15:25 nfs lockup krichy
  2015-10-21 19:05 ` Benjamin Coddington
@ 2015-10-23 18:10 ` J. Bruce Fields
  2015-10-26  7:38   ` krichy
  1 sibling, 1 reply; 6+ messages in thread
From: J. Bruce Fields @ 2015-10-23 18:10 UTC (permalink / raw)
  To: krichy; +Cc: linux-nfs

On Wed, Oct 21, 2015 at 05:25:53PM +0200, krichy@tvnetwork.hu wrote:
> Dear devs,
> 
> We have an nfs lockup issue. We run a ganeti cluster consisting of 7
> debian linux nodes and 1 freenas for hosting the vm images. The
> images are exported via nfsv3. The problem is that randomly we end
> in a livelock on one of our nodes.
> 
> That means the nfs share is alive, we can list directories, files,
> even can read files (very slow, see later). And even can write to
> files, but the file close operation does not return, it gets
> blocked.
> 
> The read is slow in that way that while copying a file from the
> share to /tmp, the data arrives very fast to the node, but in /tmp
> it accumulates slowly.

I don't understand what you mean by that.  Do you have some measurements
to help quantify "very fast" and "slowly"?

--b.

> 
> I've also opened a debian bug report on it, but I think it is not
> related to debian
> (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=801924).
> 
> The only way is to reboot machine, with all the vm's running on it
> getting interrupted.
> 
> I've captured each tasks' stack trace, hopefully it helps someone to
> find out the issue.
> 
> Meanwhile the other 6 nodes can access the nfs share right, so I
> think this is not a networking or server issue. Restarting the nfs
> server on the server side still does not have any effect, not
> recovering. The nfs tcp connection is established, listing files
> works again, but writes not.
> 
> Some information of the nodes:
> # uname -a
> Linux host 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4
> (2015-09-19) x86_64 GNU/Linux
> 
> They have 1.5G ram allocated to dom0, that should be enough.
> 
> I know this information is little information, give me advice what
> to look for next time. Unfortunately I dont know how to reproduce
> it.
> 
> Thanks in advance,
> 
> Kojedzinszky Richard
> Euronet Magyarorszag Informatika Zrt.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: nfs lockup
  2015-10-23 18:10 ` J. Bruce Fields
@ 2015-10-26  7:38   ` krichy
  0 siblings, 0 replies; 6+ messages in thread
From: krichy @ 2015-10-26  7:38 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs


I dont have exact measurements, but my observations were that the file 
grew at around a few 100kbyte/s, while after a reboot this file can be 
copied at a few megs/s rate.

I did a kernel upgrade to 4.2 now, and I am trying to collect more 
information upon the hang. Unfortunately I dont know the exact case which 
triggers this hang, thus I cannot reproduce. Measurements before the 
hangs dont show any unusual to me.

Thanks in advance,
Kojedzinszky Richard
Euronet Magyarorszag Informatika Zrt.

On Fri, 23 Oct 2015, J. Bruce Fields wrote:

> Date: Fri, 23 Oct 2015 14:10:01 -0400
> From: J. Bruce Fields <bfields@fieldses.org>
> To: krichy@tvnetwork.hu
> Cc: linux-nfs@vger.kernel.org
> Subject: Re: nfs lockup
> 
> On Wed, Oct 21, 2015 at 05:25:53PM +0200, krichy@tvnetwork.hu wrote:
>> Dear devs,
>>
>> We have an nfs lockup issue. We run a ganeti cluster consisting of 7
>> debian linux nodes and 1 freenas for hosting the vm images. The
>> images are exported via nfsv3. The problem is that randomly we end
>> in a livelock on one of our nodes.
>>
>> That means the nfs share is alive, we can list directories, files,
>> even can read files (very slow, see later). And even can write to
>> files, but the file close operation does not return, it gets
>> blocked.
>>
>> The read is slow in that way that while copying a file from the
>> share to /tmp, the data arrives very fast to the node, but in /tmp
>> it accumulates slowly.
>
> I don't understand what you mean by that.  Do you have some measurements
> to help quantify "very fast" and "slowly"?
>
> --b.
>
>>
>> I've also opened a debian bug report on it, but I think it is not
>> related to debian
>> (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=801924).
>>
>> The only way is to reboot machine, with all the vm's running on it
>> getting interrupted.
>>
>> I've captured each tasks' stack trace, hopefully it helps someone to
>> find out the issue.
>>
>> Meanwhile the other 6 nodes can access the nfs share right, so I
>> think this is not a networking or server issue. Restarting the nfs
>> server on the server side still does not have any effect, not
>> recovering. The nfs tcp connection is established, listing files
>> works again, but writes not.
>>
>> Some information of the nodes:
>> # uname -a
>> Linux host 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4
>> (2015-09-19) x86_64 GNU/Linux
>>
>> They have 1.5G ram allocated to dom0, that should be enough.
>>
>> I know this information is little information, give me advice what
>> to look for next time. Unfortunately I dont know how to reproduce
>> it.
>>
>> Thanks in advance,
>>
>> Kojedzinszky Richard
>> Euronet Magyarorszag Informatika Zrt.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-10-26  7:38 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-21 15:25 nfs lockup krichy
2015-10-21 19:05 ` Benjamin Coddington
2015-10-21 20:09   ` krichy
2015-10-22 11:17     ` Benjamin Coddington
2015-10-23 18:10 ` J. Bruce Fields
2015-10-26  7:38   ` krichy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox