Timeouts copying large files to a Samba server with Btrfs

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* Timeouts copying large files to a Samba server with Btrfs
@ 2015-12-19 21:50 Roman Mamedov
  2015-12-26 14:54 ` Piotr Pawłow
  2015-12-26 17:50 ` ronnie sahlberg
  0 siblings, 2 replies; 3+ messages in thread
From: Roman Mamedov @ 2015-12-19 21:50 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1327 bytes --]

Hello,

Sometimes when I copy large files (the latest case was with a 13 GB file) to a
Btrfs-residing share on a Samba file server (using Thunar file manager), the
copy process fails around the end with following messages in dmesg on the
client:

[7699154.504380] CIFS VFS: sends on sock ffff88010d41e800 stuck for 15 seconds
[7699154.504440] CIFS VFS: Error -11 sending data on socket to server
[7699215.173469] CIFS VFS: sends on sock ffff88010d41e800 stuck for 15 seconds
[7699215.173533] CIFS VFS: Error -11 sending data on socket to server
[7699317.982262] CIFS VFS: sends on sock ffff88010d41e800 stuck for 15 seconds
[7699317.982319] CIFS VFS: Error -11 sending data on socket to server

Nothing in dmesg on the server.

My guess is that the Samba server process submits too much queued buffers at
once to be written to disk, then blocks on waiting for this, and the whole
operation ends up taking so long, that it doesn't get back to the client in
time.

This also happens much more often is compress-force is enabled on the server.

The server specs are AMD E-350 1.6GHz, 16GB of RAM, client/server network
connection is 1 Gbit. Kernel 4.1.15 on the server, 3.18.21 on the client.

Any idea what to tune so that this doesn't happen? (server/client/Samba/Btrfs?)

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Timeouts copying large files to a Samba server with Btrfs
  2015-12-19 21:50 Timeouts copying large files to a Samba server with Btrfs Roman Mamedov
@ 2015-12-26 14:54 ` Piotr Pawłow
  2015-12-26 17:50 ` ronnie sahlberg
  1 sibling, 0 replies; 3+ messages in thread
From: Piotr Pawłow @ 2015-12-26 14:54 UTC (permalink / raw)
  To: Roman Mamedov, linux-btrfs

Hello,
> My guess is that the Samba server process submits too much queued buffers at
> once to be written to disk, then blocks on waiting for this, and the whole
> operation ends up taking so long, that it doesn't get back to the client in
> time.

I've seen something similar. I could reproduce it easily:

- upload a big file
- wait for it to be written and commited
- remove the file
- upload another big file before the removal is commited
- commit wakes up [btrfs-transacti] process, which blocks IO for a very 
long time, causing the upload to time out

Demonstration: http://pps.siedziba.pl/btrfs_transaction_smb_timeout.mp4

The FS is on 16TB md raid5, half filled, default mount options except 
noatime. It was running kernel 4.1.4 I think.

I tried defragmenting directories and balancing metadata, which seemed 
to help a bit, then I upgraded to kernel 4.2.5 and the problem 
disappeared, but that kernel version had a nasty memory leak that caused 
everything to OOM after 1-2 days, so I upgraded to just released 4.3.0 
and it works fine since then - I'm observing transaction times of around 
1-2 sec.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Timeouts copying large files to a Samba server with Btrfs
  2015-12-19 21:50 Timeouts copying large files to a Samba server with Btrfs Roman Mamedov
  2015-12-26 14:54 ` Piotr Pawłow
@ 2015-12-26 17:50 ` ronnie sahlberg
  1 sibling, 0 replies; 3+ messages in thread
From: ronnie sahlberg @ 2015-12-26 17:50 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Btrfs BTRFS

That does not look good.
See if you can find something in the samba logs on the server.
Look for messages about long running VFS operations and/or client
disconnecting wile a file is open for writing.

The CIFS/SMB protocol has hard real-time requirements in the windows
client redirector which leads to dataloss if a server becomes
unresponsive for a long time.
Long time here means ~20s or more.

The reason is that for performance reasons CIFS/SMB defaults to use
clientside caching for writes (using oplocks as the cache coherency
protocol).
IF a server suddenly stops responding promptly the client will
eventually (20-60 seconds) tear down the connection and reconnect. As
part of the session teardown, any open files will be forced close, and
any write cache on the client will be discarded.

This basically means that if a server gets stuck in the VFS for a slow
filesystem, you face a real risk that any/all files that are open for
writing will be truncated at that stage and you have data loss.

This used to be a big problem when using samba ontop of various
cluster filesystems since they used to have a tendency to pause all
I/O for sometimes very long times when the cluster topology changed,
leading to a large amount of dataloss every time.
We added some logging to samba to help identiify this and also to log
all the names of the files that were very likely destroyed, but I
can't recall the exact wording of these messages of the top of my
head.
Look in the samba logs for things that relate to long running VFS
operations or client disconnect while the file is open for write.

Basically, If you want to use a filesystem host CIFS, you must
instrument it so that it will guarantee to always respond to I/O
requests from the clients within 10 seconds (to have some headroom) or
else you will face a real risk of data loss.

If you can not guarantee that the filesystem will never pause for this
long because it is doing foo/bar/bob/...   then you should not use
that filesystem for samba.

On Sat, Dec 19, 2015 at 1:50 PM, Roman Mamedov <rm@romanrm.net> wrote:
> Hello,
>
> Sometimes when I copy large files (the latest case was with a 13 GB file) to a
> Btrfs-residing share on a Samba file server (using Thunar file manager), the
> copy process fails around the end with following messages in dmesg on the
> client:
>
> [7699154.504380] CIFS VFS: sends on sock ffff88010d41e800 stuck for 15 seconds
> [7699154.504440] CIFS VFS: Error -11 sending data on socket to server
> [7699215.173469] CIFS VFS: sends on sock ffff88010d41e800 stuck for 15 seconds
> [7699215.173533] CIFS VFS: Error -11 sending data on socket to server
> [7699317.982262] CIFS VFS: sends on sock ffff88010d41e800 stuck for 15 seconds
> [7699317.982319] CIFS VFS: Error -11 sending data on socket to server
>
> Nothing in dmesg on the server.
>
> My guess is that the Samba server process submits too much queued buffers at
> once to be written to disk, then blocks on waiting for this, and the whole
> operation ends up taking so long, that it doesn't get back to the client in
> time.
>
> This also happens much more often is compress-force is enabled on the server.
>
> The server specs are AMD E-350 1.6GHz, 16GB of RAM, client/server network
> connection is 1 Gbit. Kernel 4.1.15 on the server, 3.18.21 on the client.
>
> Any idea what to tune so that this doesn't happen? (server/client/Samba/Btrfs?)
>
> --
> With respect,
> Roman

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-12-26 17:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-19 21:50 Timeouts copying large files to a Samba server with Btrfs Roman Mamedov
2015-12-26 14:54 ` Piotr Pawłow
2015-12-26 17:50 ` ronnie sahlberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox