* Regression in overlayfs in 4.13: "could not fsync file" error by PostgreSQL @ 2017-11-03 14:00 Raphael Hertzog 2017-11-04 10:00 ` Amir Goldstein 0 siblings, 1 reply; 11+ messages in thread From: Raphael Hertzog @ 2017-11-03 14:00 UTC (permalink / raw) To: linux-unionfs Hello, the Kali live image contains PostgreSQL and with our latest image[1] PostgreSQL is no longer able to create a new database (/ is an overlayfs merging a RO squashfs filesystem and an empty RW tmpfs). [1] http://cdimage.kali.org/kali-weekly/kali-linux-2017-W43-amd64.iso (if 404 look up the parent directory and pick the currently available weekly image) You get messages like this while running createdb: createdb: database creation failed: ERROR: checkpoint request failed HINT: Consult recent messages in the server log for details. And in the server log you have this: ERROR: could not fsync file "pg_commit_ts": Invalid argument This was working with Linux 4.12.6 and seems to be broken in Linux 4.13.4 that we currently have in Kali Linux. Does this ring a bell to anyone ? I saw a few patches in 4.13.6 but their description does not seem to match the error I'm getting. What else can I provide to help you diagnose this issue ? This bug report has been filed against Kali Linux in the first place: https://bugs.kali.org/view.php?id=4332 Cheers, -- Raphaël Hertzog ◈ Writer/Consultant ◈ Debian Developer Discover the Debian Administrator's Handbook: → https://debian-handbook.info/get/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in overlayfs in 4.13: "could not fsync file" error by PostgreSQL 2017-11-03 14:00 Regression in overlayfs in 4.13: "could not fsync file" error by PostgreSQL Raphael Hertzog @ 2017-11-04 10:00 ` Amir Goldstein 2017-11-07 10:11 ` Raphael Hertzog 0 siblings, 1 reply; 11+ messages in thread From: Amir Goldstein @ 2017-11-04 10:00 UTC (permalink / raw) To: Raphael Hertzog; +Cc: overlayfs On Fri, Nov 3, 2017 at 4:00 PM, Raphael Hertzog <raphael@ouaza.com> wrote: > Hello, > > the Kali live image contains PostgreSQL and with our latest image[1] > PostgreSQL is no longer able to create a new database (/ is an overlayfs > merging a RO squashfs filesystem and an empty RW tmpfs). > > [1] http://cdimage.kali.org/kali-weekly/kali-linux-2017-W43-amd64.iso > (if 404 look up the parent directory and pick the currently available > weekly image) > > You get messages like this while running createdb: > createdb: database creation failed: ERROR: checkpoint request failed > HINT: Consult recent messages in the server log for details. > > And in the server log you have this: > ERROR: could not fsync file "pg_commit_ts": Invalid argument > > This was working with Linux 4.12.6 and seems to be broken in Linux 4.13.4 > that we currently have in Kali Linux. > > Does this ring a bell to anyone ? Not really there was a regression with squashfs+overlayfs but it is fixed in 4.13.4. (fix false positive ESTALE on lookup) I tries mounting squashfs+overlayfs to /var/lib/postgresql and create db on Ubuntu and it seemed ok. > I saw a few patches in 4.13.6 but their > description does not seem to match the error I'm getting. I don't think those fixes are relevant. > > What else can I provide to help you diagnose this issue ? > > This bug report has been filed against Kali Linux in the first place: > https://bugs.kali.org/view.php?id=4332 > - strace of the failing postgresql db create - dmesg from that time, specifically "overlayfs" messages - output of cat /proc/self/mountinfo (or better postgres pid instead of self) Thanks, Amir. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in overlayfs in 4.13: "could not fsync file" error by PostgreSQL 2017-11-04 10:00 ` Amir Goldstein @ 2017-11-07 10:11 ` Raphael Hertzog 2017-11-07 11:01 ` Amir Goldstein 0 siblings, 1 reply; 11+ messages in thread From: Raphael Hertzog @ 2017-11-07 10:11 UTC (permalink / raw) To: Amir Goldstein; +Cc: overlayfs Hello Amir, Le samedi 04 novembre 2017, Amir Goldstein a écrit : > I tries mounting squashfs+overlayfs to /var/lib/postgresql and create > db on Ubuntu and it seemed ok. FWIW, in my failing case, it uses PostgreSQL 10.0 as in Debian Testing/Unstable. In Ubuntu, it's only available in Bionic Beaver (development release). > > This bug report has been filed against Kali Linux in the first place: > > https://bugs.kali.org/view.php?id=4332 > > - strace of the failing postgresql db create > - dmesg from that time, specifically "overlayfs" messages > - output of cat /proc/self/mountinfo (or better postgres pid instead of self) Here's the relevant output, there's nothing appearing in kernel messages at the time of the error. root@kali:~# ps aux|grep postgres postgres 31597 2.7 1.2 294416 25648 ? S 09:32 0:00 /usr/lib/postgresql/10/bin/postgres -D /var/lib/postgresql/10/main -c config_file=/etc/postgresql/10/main/postgresql.conf postgres 31599 0.0 0.1 294416 3928 ? Ss 09:32 0:00 postgres: 10/main: checkpointer process postgres 31600 0.0 0.1 294416 3928 ? Ss 09:32 0:00 postgres: 10/main: writer process postgres 31601 0.0 0.1 294416 3928 ? Ss 09:32 0:00 postgres: 10/main: wal writer process postgres 31602 0.0 0.3 294840 6624 ? Ss 09:32 0:00 postgres: 10/main: autovacuum launcher process postgres 31603 0.0 0.1 149348 3352 ? Ss 09:32 0:00 postgres: 10/main: stats collector process postgres 31604 0.0 0.2 294708 4880 ? Ss 09:32 0:00 postgres: 10/main: bgworker: logical replication launcher root 31618 0.0 0.0 12720 964 pts/1 S+ 09:32 0:00 grep postgres root@kali:~# cat /proc/31597/mountinfo 17 26 0:17 / /sys rw,nosuid,nodev,noexec,relatime shared:10 - sysfs sysfs rw 18 26 0:5 / /proc rw,nosuid,nodev,noexec,relatime shared:16 - proc proc rw 19 26 0:6 / /dev rw,nosuid,relatime shared:5 - devtmpfs udev rw,size=1005564k,nr_inodes=251391,mode=755 20 19 0:18 / /dev/pts rw,nosuid,noexec,relatime shared:6 - devpts devpts rw,gid=5,mode=620,ptmxmode=000 21 26 0:19 / /run rw,nosuid,noexec,relatime shared:8 - tmpfs tmpfs rw,size=205196k,mode=755 22 26 11:0 / /lib/live/mount/medium ro,noatime shared:2 - iso9660 /dev/sr0 ro,nojoliet,check=s,map=n,blocksize=2048 23 26 7:0 / /lib/live/mount/rootfs/filesystem.squashfs ro,noatime shared:3 - squashfs /dev/loop0 ro 24 26 0:20 / /lib/live/mount/overlay rw,relatime shared:4 - tmpfs tmpfs rw 26 1 0:22 / / rw,noatime shared:1 - overlay overlay rw,lowerdir=//filesystem.squashfs/,upperdir=/live/overlay//rw,workdir=/live/overlay//work 25 17 0:7 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:11 - securityfs securityfs rw 29 19 0:23 / /dev/shm rw,nosuid,nodev shared:7 - tmpfs tmpfs rw 30 21 0:24 / /run/lock rw,nosuid,nodev,noexec,relatime shared:9 - tmpfs tmpfs rw,size=5120k 31 17 0:25 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:12 - tmpfs tmpfs ro,mode=755 32 31 0:26 / /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime shared:13 - cgroup2 cgroup rw,nsdelegate 33 31 0:27 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,xattr,name=systemd 34 17 0:28 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:15 - pstore pstore rw 35 31 0:29 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:17 - cgroup cgroup rw,cpu,cpuacct 36 31 0:30 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:18 - cgroup cgroup rw,blkio 37 31 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:19 - cgroup cgroup rw,devices 38 31 0:32 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:20 - cgroup cgroup rw,net_cls,net_prio 39 31 0:33 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,pids 40 31 0:34 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:22 - cgroup cgroup rw,perf_event 41 31 0:35 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:23 - cgroup cgroup rw,memory 42 31 0:36 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:24 - cgroup cgroup rw,freezer 43 31 0:37 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:25 - cgroup cgroup rw,cpuset 44 18 0:38 / /proc/sys/fs/binfmt_misc rw,relatime shared:26 - autofs systemd-1 rw,fd=24,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11412 45 17 0:8 / /sys/kernel/debug rw,relatime shared:27 - debugfs debugfs rw 46 19 0:39 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs rw,pagesize=2M 47 19 0:16 / /dev/mqueue rw,relatime shared:29 - mqueue mqueue rw 80 26 0:42 / /tmp rw,nosuid,nodev,relatime shared:30 - tmpfs tmpfs rw 117 44 0:44 / /proc/sys/fs/binfmt_misc rw,relatime shared:31 - binfmt_misc binfmt_misc rw 259 21 0:46 / /run/user/0 rw,nosuid,nodev,relatime shared:172 - tmpfs tmpfs rw,size=205192k,mode=700 266 259 0:47 / /run/user/0/gvfs rw,nosuid,nodev,relatime shared:177 - fuse.gvfsd-fuse gvfsd-fuse rw,user_id=0,group_id=0 274 17 0:48 / /sys/fs/fuse/connections rw,relatime shared:182 - fusectl fusectl rw As for strace output, postgresql is split over multiple processes. The one that generates the error in the log is 31599 (checkpointer process). I also attach some file listing of the directories that it fails to fsync. strace looks like this (in loop): # strace -f -p 31599 select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout) rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 open("pg_xact", O_RDONLY) = 3 fsync(3) = 0 close(3) = 0 open("pg_commit_ts", O_RDONLY) = 3 fsync(3) = -1 EINVAL (Invalid argument) close(3) = 0 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE KILL SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0 write(2, "2017-11-07 09:47:38.580 UTC [315"..., 98) = 98 select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout) rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 open("pg_xact", O_RDONLY) = 3 fsync(3) = 0 close(3) = 0 open("pg_commit_ts", O_RDONLY) = 3 fsync(3) = -1 EINVAL (Invalid argument) close(3) = 0 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE KILL SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0 write(2, "2017-11-07 09:47:39.582 UTC [315"..., 98) = 98 # cat /proc/31599/mountinfo 17 26 0:17 / /sys rw,nosuid,nodev,noexec,relatime shared:10 - sysfs sysfs rw 18 26 0:5 / /proc rw,nosuid,nodev,noexec,relatime shared:16 - proc proc rw 19 26 0:6 / /dev rw,nosuid,relatime shared:5 - devtmpfs udev rw,size=1005564k,nr_inodes=251391,mode=755 20 19 0:18 / /dev/pts rw,nosuid,noexec,relatime shared:6 - devpts devpts rw,gid=5,mode=620,ptmxmode=000 21 26 0:19 / /run rw,nosuid,noexec,relatime shared:8 - tmpfs tmpfs rw,size=205196k,mode=755 22 26 11:0 / /lib/live/mount/medium ro,noatime shared:2 - iso9660 /dev/sr0 ro,nojoliet,check=s,map=n,blocksize=2048 23 26 7:0 / /lib/live/mount/rootfs/filesystem.squashfs ro,noatime shared:3 - squashfs /dev/loop0 ro 24 26 0:20 / /lib/live/mount/overlay rw,relatime shared:4 - tmpfs tmpfs rw 26 1 0:22 / / rw,noatime shared:1 - overlay overlay rw,lowerdir=//filesystem.squashfs/,upperdir=/live/overlay//rw,workdir=/live/overlay//work 25 17 0:7 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:11 - securityfs securityfs rw 29 19 0:23 / /dev/shm rw,nosuid,nodev shared:7 - tmpfs tmpfs rw 30 21 0:24 / /run/lock rw,nosuid,nodev,noexec,relatime shared:9 - tmpfs tmpfs rw,size=5120k 31 17 0:25 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:12 - tmpfs tmpfs ro,mode=755 32 31 0:26 / /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime shared:13 - cgroup2 cgroup rw,nsdelegate 33 31 0:27 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,xattr,name=systemd 34 17 0:28 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:15 - pstore pstore rw 35 31 0:29 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:17 - cgroup cgroup rw,cpu,cpuacct 36 31 0:30 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:18 - cgroup cgroup rw,blkio 37 31 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:19 - cgroup cgroup rw,devices 38 31 0:32 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:20 - cgroup cgroup rw,net_cls,net_prio 39 31 0:33 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,pids 40 31 0:34 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:22 - cgroup cgroup rw,perf_event 41 31 0:35 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:23 - cgroup cgroup rw,memory 42 31 0:36 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:24 - cgroup cgroup rw,freezer 43 31 0:37 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:25 - cgroup cgroup rw,cpuset 44 18 0:38 / /proc/sys/fs/binfmt_misc rw,relatime shared:26 - autofs systemd-1 rw,fd=24,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11412 45 17 0:8 / /sys/kernel/debug rw,relatime shared:27 - debugfs debugfs rw 46 19 0:39 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs rw,pagesize=2M 47 19 0:16 / /dev/mqueue rw,relatime shared:29 - mqueue mqueue rw 80 26 0:42 / /tmp rw,nosuid,nodev,relatime shared:30 - tmpfs tmpfs rw 117 44 0:44 / /proc/sys/fs/binfmt_misc rw,relatime shared:31 - binfmt_misc binfmt_misc rw 259 21 0:46 / /run/user/0 rw,nosuid,nodev,relatime shared:172 - tmpfs tmpfs rw,size=205192k,mode=700 266 259 0:47 / /run/user/0/gvfs rw,nosuid,nodev,relatime shared:177 - fuse.gvfsd-fuse gvfsd-fuse rw,user_id=0,group_id=0 274 17 0:48 / /sys/fs/fuse/connections rw,relatime shared:182 - fusectl fusectl rw # ls -al /proc/31599/cwd lrwxrwxrwx 1 postgres postgres 0 Nov 7 09:48 /proc/31599/cwd -> /var/lib/postgresql/10/main root@kali:~# ls -al /var/lib/postgresql/10/main total 9 drwx------ 1 postgres postgres 200 Nov 7 09:32 . drwxr-xr-x 1 postgres postgres 60 Oct 30 13:21 .. drwx------ 1 postgres postgres 60 Oct 30 13:21 base drwx------ 1 postgres postgres 360 Nov 7 09:32 global drwx------ 2 postgres postgres 3 Oct 30 13:21 pg_commit_ts drwx------ 2 postgres postgres 3 Oct 30 13:21 pg_dynshmem drwx------ 4 postgres postgres 77 Oct 30 13:21 pg_logical drwx------ 1 postgres postgres 60 Oct 30 13:21 pg_multixact drwx------ 1 postgres postgres 60 Nov 7 09:32 pg_notify drwx------ 2 postgres postgres 3 Oct 30 13:21 pg_replslot drwx------ 2 postgres postgres 3 Oct 30 13:21 pg_serial drwx------ 2 postgres postgres 3 Oct 30 13:21 pg_snapshots drwx------ 2 postgres postgres 3 Oct 30 13:21 pg_stat drwx------ 2 postgres postgres 3 Oct 30 13:21 pg_stat_tmp drwx------ 2 postgres postgres 27 Oct 30 13:21 pg_subtrans drwx------ 2 postgres postgres 3 Oct 30 13:21 pg_tblspc drwx------ 2 postgres postgres 3 Oct 30 13:21 pg_twophase -rw------- 1 postgres postgres 3 Oct 30 13:21 PG_VERSION drwx------ 1 postgres postgres 60 Oct 30 13:21 pg_wal drwx------ 1 postgres postgres 60 Oct 30 13:21 pg_xact -rw------- 1 postgres postgres 88 Oct 30 13:21 postgresql.auto.conf -rw------- 1 postgres postgres 130 Nov 7 09:32 postmaster.opts -rw------- 1 postgres postgres 109 Nov 7 09:32 postmaster.pid root@kali:~# ls -al /var/lib/postgresql/10/main/pg_commit_ts/ total 0 drwx------ 2 postgres postgres 3 Oct 30 13:21 . drwx------ 1 postgres postgres 200 Nov 7 09:32 .. root@kali:~# ls -al /var/lib/postgresql/10/main/pg_xact/ total 8 drwx------ 1 postgres postgres 60 Oct 30 13:21 . drwx------ 1 postgres postgres 200 Nov 7 09:32 .. -rw------- 1 postgres postgres 8192 Nov 7 09:42 0000 And here are the files in the underlying squashfs: root@kali:~# ls -al /lib/live/mount/rootfs/filesystem.squashfs/var/lib/postgresql/10/main/ total 1 drwx------ 19 postgres postgres 367 Oct 30 13:21 . drwxr-xr-x 3 postgres postgres 27 Oct 30 13:21 .. drwx------ 5 postgres postgres 74 Oct 30 13:21 base drwx------ 2 postgres postgres 822 Oct 30 13:21 global drwx------ 2 postgres postgres 3 Oct 30 13:21 pg_commit_ts drwx------ 2 postgres postgres 3 Oct 30 13:21 pg_dynshmem drwx------ 4 postgres postgres 77 Oct 30 13:21 pg_logical drwx------ 4 postgres postgres 45 Oct 30 13:21 pg_multixact drwx------ 2 postgres postgres 27 Oct 30 13:21 pg_notify drwx------ 2 postgres postgres 3 Oct 30 13:21 pg_replslot drwx------ 2 postgres postgres 3 Oct 30 13:21 pg_serial drwx------ 2 postgres postgres 3 Oct 30 13:21 pg_snapshots drwx------ 2 postgres postgres 3 Oct 30 13:21 pg_stat drwx------ 2 postgres postgres 3 Oct 30 13:21 pg_stat_tmp drwx------ 2 postgres postgres 27 Oct 30 13:21 pg_subtrans drwx------ 2 postgres postgres 3 Oct 30 13:21 pg_tblspc drwx------ 2 postgres postgres 3 Oct 30 13:21 pg_twophase -rw------- 1 postgres postgres 3 Oct 30 13:21 PG_VERSION drwx------ 3 postgres postgres 69 Oct 30 13:21 pg_wal drwx------ 2 postgres postgres 27 Oct 30 13:21 pg_xact -rw------- 1 postgres postgres 88 Oct 30 13:21 postgresql.auto.conf root@kali:~# ls -al /lib/live/mount/rootfs/filesystem.squashfs/var/lib/postgresql/10/main/pg_xact/ total 8 drwx------ 2 postgres postgres 27 Oct 30 13:21 . drwx------ 19 postgres postgres 367 Oct 30 13:21 .. -rw------- 1 postgres postgres 8192 Oct 30 13:21 0000 root@kali:~# ls -al /lib/live/mount/rootfs/filesystem.squashfs/var/lib/postgresql/10/main/pg_commit_ts/ total 0 drwx------ 2 postgres postgres 3 Oct 30 13:21 . drwx------ 19 postgres postgres 367 Oct 30 13:21 .. Cheers, -- Raphaël Hertzog ◈ Writer/Consultant ◈ Debian Developer Discover the Debian Administrator's Handbook: → https://debian-handbook.info/get/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in overlayfs in 4.13: "could not fsync file" error by PostgreSQL 2017-11-07 10:11 ` Raphael Hertzog @ 2017-11-07 11:01 ` Amir Goldstein 2017-11-07 11:54 ` Jeff Layton 2017-11-07 12:59 ` Raphael Hertzog 0 siblings, 2 replies; 11+ messages in thread From: Amir Goldstein @ 2017-11-07 11:01 UTC (permalink / raw) To: Raphael Hertzog; +Cc: overlayfs, Jeff Layton On Tue, Nov 7, 2017 at 12:11 PM, Raphael Hertzog <raphael@ouaza.com> wrote: > Hello Amir, > > Le samedi 04 novembre 2017, Amir Goldstein a écrit : >> I tries mounting squashfs+overlayfs to /var/lib/postgresql and create >> db on Ubuntu and it seemed ok. > > FWIW, in my failing case, it uses PostgreSQL 10.0 as in Debian > Testing/Unstable. In Ubuntu, it's only available in Bionic Beaver (development > release). And is this the same PostgreSQL version that worked with kernel v4.12.6? [...] > As for strace output, postgresql is split over multiple processes. The one that > generates the error in the log is 31599 (checkpointer process). I also attach > some file listing of the directories that it fails to fsync. strace looks like > this (in loop): > > # strace -f -p 31599 > select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout) > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > open("pg_xact", O_RDONLY) = 3 > fsync(3) = 0 > close(3) = 0 > open("pg_commit_ts", O_RDONLY) = 3 > fsync(3) = -1 EINVAL (Invalid argument) The reason for the error is quite straight forward. open O_RDONLY gets an open file on lower read-only squashfs that doesn't have an fsync operation, so fsync returns EINVAL as per the man page documentation: EROFS, EINVAL fd is bound to a special file which does not support synchronization. The question is why do you see this error only now. There is nothing I know of that changed in overlayfs which could explain why this behavior has not been the same forever. CC'ing Jeff, because his fsync error reporting infrastructure was merged to kernel v4.13 where you report the alleged regression, but I didn't find how his changes could have changed this behavior, as EINVAL is returned right in the entry of the fsync system call. int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync) { struct inode *inode = file->f_mapping->host; if (!file->f_op->fsync) return -EINVAL; Perhaps this a PostgreSQL regression (when storing db on overlayfs/squashfs) that should be reported to PostgreSQL developers? Working around this in kernel is possible, but requires overriding file_operations for all overlay regular files. Not a small change, but we might want to do it for other reasons anyway. Thanks, Amir. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in overlayfs in 4.13: "could not fsync file" error by PostgreSQL 2017-11-07 11:01 ` Amir Goldstein @ 2017-11-07 11:54 ` Jeff Layton 2017-11-07 12:16 ` Amir Goldstein 2017-11-07 12:59 ` Raphael Hertzog 1 sibling, 1 reply; 11+ messages in thread From: Jeff Layton @ 2017-11-07 11:54 UTC (permalink / raw) To: Amir Goldstein, Raphael Hertzog; +Cc: overlayfs On Tue, 2017-11-07 at 13:01 +0200, Amir Goldstein wrote: > On Tue, Nov 7, 2017 at 12:11 PM, Raphael Hertzog <raphael@ouaza.com> wrote: > > Hello Amir, > > > > Le samedi 04 novembre 2017, Amir Goldstein a écrit : > > > I tries mounting squashfs+overlayfs to /var/lib/postgresql and create > > > db on Ubuntu and it seemed ok. > > > > FWIW, in my failing case, it uses PostgreSQL 10.0 as in Debian > > Testing/Unstable. In Ubuntu, it's only available in Bionic Beaver (development > > release). > > And is this the same PostgreSQL version that worked with kernel v4.12.6? > > [...] > > > As for strace output, postgresql is split over multiple processes. The one that > > generates the error in the log is 31599 (checkpointer process). I also attach > > some file listing of the directories that it fails to fsync. strace looks like > > this (in loop): > > > > # strace -f -p 31599 > > select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout) > > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > > open("pg_xact", O_RDONLY) = 3 > > fsync(3) = 0 > > close(3) = 0 > > open("pg_commit_ts", O_RDONLY) = 3 > > fsync(3) = -1 EINVAL (Invalid argument) > > The reason for the error is quite straight forward. > open O_RDONLY gets an open file on lower read-only squashfs > that doesn't have an fsync operation, so fsync returns EINVAL as per > the man page documentation: > > EROFS, EINVAL > fd is bound to a special file which does not support > synchronization. > If that's the case, then why didn't the fsync(3) call not return EINVAL? Was it because it was copied up first? If so, then maybe something changed in v4.13 to cause the pg_commit_ts file to not have been be copied up here, when it would have before? > The question is why do you see this error only now. > There is nothing I know of that changed in overlayfs which could explain > why this behavior has not been the same forever. > > CC'ing Jeff, because his fsync error reporting infrastructure was merged to > kernel v4.13 where you report the alleged regression, but I didn't find how > his changes could have changed this behavior, as EINVAL is returned right > in the entry of the fsync system call. > > int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync) > { > struct inode *inode = file->f_mapping->host; > > if (!file->f_op->fsync) > return -EINVAL; > > Perhaps this a PostgreSQL regression (when storing db on overlayfs/squashfs) > that should be reported to PostgreSQL developers? > > Working around this in kernel is possible, but requires overriding > file_operations > for all overlay regular files. Not a small change, but we might want > to do it for > other reasons anyway. It's also possible that something in the writable layer recorded a EINVAL error during writeback, and that's what you're seeing here. That seems less likely though. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in overlayfs in 4.13: "could not fsync file" error by PostgreSQL 2017-11-07 11:54 ` Jeff Layton @ 2017-11-07 12:16 ` Amir Goldstein 2017-11-07 14:32 ` Raphael Hertzog 2017-11-07 16:54 ` Jeff Layton 0 siblings, 2 replies; 11+ messages in thread From: Amir Goldstein @ 2017-11-07 12:16 UTC (permalink / raw) To: Jeff Layton; +Cc: Raphael Hertzog, overlayfs On Tue, Nov 7, 2017 at 1:54 PM, Jeff Layton <jlayton@kernel.org> wrote: > On Tue, 2017-11-07 at 13:01 +0200, Amir Goldstein wrote: >> On Tue, Nov 7, 2017 at 12:11 PM, Raphael Hertzog <raphael@ouaza.com> wrote: >> > Hello Amir, >> > >> > Le samedi 04 novembre 2017, Amir Goldstein a écrit : >> > > I tries mounting squashfs+overlayfs to /var/lib/postgresql and create >> > > db on Ubuntu and it seemed ok. >> > >> > FWIW, in my failing case, it uses PostgreSQL 10.0 as in Debian >> > Testing/Unstable. In Ubuntu, it's only available in Bionic Beaver (development >> > release). >> >> And is this the same PostgreSQL version that worked with kernel v4.12.6? >> >> [...] >> >> > As for strace output, postgresql is split over multiple processes. The one that >> > generates the error in the log is 31599 (checkpointer process). I also attach >> > some file listing of the directories that it fails to fsync. strace looks like >> > this (in loop): >> > >> > # strace -f -p 31599 >> > select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout) >> > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 >> > open("pg_xact", O_RDONLY) = 3 >> > fsync(3) = 0 >> > close(3) = 0 >> > open("pg_commit_ts", O_RDONLY) = 3 >> > fsync(3) = -1 EINVAL (Invalid argument) >> >> The reason for the error is quite straight forward. >> open O_RDONLY gets an open file on lower read-only squashfs >> that doesn't have an fsync operation, so fsync returns EINVAL as per >> the man page documentation: >> >> EROFS, EINVAL >> fd is bound to a special file which does not support >> synchronization. >> > > If that's the case, then why didn't the fsync(3) call not return > EINVAL? Was it because it was copied up first? Allegedly yes. We see in ls -l at the end of report that file 0000 inside pg_xact mtime (Nov 7) is newer than squashfs mtime (Oct 30). > > If so, then maybe something changed in v4.13 to cause the pg_commit_ts > file Wait, I misread the information in the report and I wrongly assumed that pg_commit_ts is a file. It is not. it's a directory in which case, the inode is an overlay inode and it does have fsync f_op. But in the case of a lower directory that has no been copied up (which seems to be the case with pg_commit_ts) overlayfs will simple vfs_fsync_range the lower dir, so we are back to EINVAL. > to not have been be copied up here, when it would have before? > That is possible, but I would need more information about all the previous access to directory pg_commit_ts by postgresql to figure it out. Are there any aspects of fsync error reporting for directory fsync that we need to consider as leads to investigate? Amir. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in overlayfs in 4.13: "could not fsync file" error by PostgreSQL 2017-11-07 12:16 ` Amir Goldstein @ 2017-11-07 14:32 ` Raphael Hertzog 2017-11-07 15:14 ` Amir Goldstein 2017-11-07 16:54 ` Jeff Layton 1 sibling, 1 reply; 11+ messages in thread From: Raphael Hertzog @ 2017-11-07 14:32 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jeff Layton, overlayfs Hello, > Wait, I misread the information in the report and I wrongly assumed that > pg_commit_ts is a file. It is not. it's a directory in which case, the > inode is an overlay inode and it does have fsync f_op. > But in the case of a lower directory that has no been copied up (which seems > to be the case with pg_commit_ts) overlayfs will simple vfs_fsync_range the > lower dir, so we are back to EINVAL. Thanks for the explanation, with this understanding I can at least create an easy work-around (touch a file in the directory, and drop it again, just before starting postgresql). > > to not have been be copied up here, when it would have before? > > That is possible, but I would need more information about all the previous > access to directory pg_commit_ts by postgresql to figure it out. FWIW I reported this fsync() on a non-modified directory to PostgreSQL developers here: https://www.postgresql.org/message-id/20171107135454.lbelbbvfgadljmuj%40home.ouaza.com Cheers, -- Raphaël Hertzog ◈ Writer/Consultant ◈ Debian Developer Discover the Debian Administrator's Handbook: → https://debian-handbook.info/get/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in overlayfs in 4.13: "could not fsync file" error by PostgreSQL 2017-11-07 14:32 ` Raphael Hertzog @ 2017-11-07 15:14 ` Amir Goldstein 2017-11-07 16:30 ` Raphael Hertzog 0 siblings, 1 reply; 11+ messages in thread From: Amir Goldstein @ 2017-11-07 15:14 UTC (permalink / raw) To: Raphael Hertzog; +Cc: Jeff Layton, overlayfs On Tue, Nov 7, 2017 at 4:32 PM, Raphael Hertzog <raphael@ouaza.com> wrote: > Hello, > >> Wait, I misread the information in the report and I wrongly assumed that >> pg_commit_ts is a file. It is not. it's a directory in which case, the >> inode is an overlay inode and it does have fsync f_op. >> But in the case of a lower directory that has no been copied up (which seems >> to be the case with pg_commit_ts) overlayfs will simple vfs_fsync_range the >> lower dir, so we are back to EINVAL. > > Thanks for the explanation, with this understanding I can at least create > an easy work-around (touch a file in the directory, and drop it again, > just before starting postgresql). And you won't be the first to apply that sort of workaround: http://man7.org/linux/man-pages/man1/yum-ovl.1.html > >> > to not have been be copied up here, when it would have before? >> >> That is possible, but I would need more information about all the previous >> access to directory pg_commit_ts by postgresql to figure it out. > > FWIW I reported this fsync() on a non-modified directory to PostgreSQL > developers here: > https://www.postgresql.org/message-id/20171107135454.lbelbbvfgadljmuj%40home.ouaza.com > I guess regardless of what overlayfs behavior should be, PostgreSQL developers better fix this somehow otherwise postgress upgrade is going to break for users with stable kernel. Thanks, Amir. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in overlayfs in 4.13: "could not fsync file" error by PostgreSQL 2017-11-07 15:14 ` Amir Goldstein @ 2017-11-07 16:30 ` Raphael Hertzog 0 siblings, 0 replies; 11+ messages in thread From: Raphael Hertzog @ 2017-11-07 16:30 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jeff Layton, overlayfs Le mardi 07 novembre 2017, Amir Goldstein a écrit : > > FWIW I reported this fsync() on a non-modified directory to PostgreSQL > > developers here: > > https://www.postgresql.org/message-id/20171107135454.lbelbbvfgadljmuj%40home.ouaza.com > > I guess regardless of what overlayfs behavior should be, PostgreSQL > developers better fix this somehow otherwise postgress upgrade is going to > break for users with stable kernel. Given the current set of answers, it seems unlikely that PostgreSQL will stop doing those fsync() calls on unmodified directories because it's hard for them to track whether the directory is dirty or not but it might start to ignore EINVAL/EROFS errors on those fsync calls. That said I don't think that the number of persons using PostgreSQL on top of overlayfs is so important that they are worried about this breakage pattern on upgrade... Cheers, -- Raphaël Hertzog ◈ Writer/Consultant ◈ Debian Developer Discover the Debian Administrator's Handbook: → https://debian-handbook.info/get/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in overlayfs in 4.13: "could not fsync file" error by PostgreSQL 2017-11-07 12:16 ` Amir Goldstein 2017-11-07 14:32 ` Raphael Hertzog @ 2017-11-07 16:54 ` Jeff Layton 1 sibling, 0 replies; 11+ messages in thread From: Jeff Layton @ 2017-11-07 16:54 UTC (permalink / raw) To: Amir Goldstein; +Cc: Raphael Hertzog, overlayfs On Tue, 2017-11-07 at 14:16 +0200, Amir Goldstein wrote: > On Tue, Nov 7, 2017 at 1:54 PM, Jeff Layton <jlayton@kernel.org> wrote: > > On Tue, 2017-11-07 at 13:01 +0200, Amir Goldstein wrote: > > > On Tue, Nov 7, 2017 at 12:11 PM, Raphael Hertzog <raphael@ouaza.com> wrote: > > > > Hello Amir, > > > > > > > > Le samedi 04 novembre 2017, Amir Goldstein a écrit : > > > > > I tries mounting squashfs+overlayfs to /var/lib/postgresql and create > > > > > db on Ubuntu and it seemed ok. > > > > > > > > FWIW, in my failing case, it uses PostgreSQL 10.0 as in Debian > > > > Testing/Unstable. In Ubuntu, it's only available in Bionic Beaver (development > > > > release). > > > > > > And is this the same PostgreSQL version that worked with kernel v4.12.6? > > > > > > [...] > > > > > > > As for strace output, postgresql is split over multiple processes. The one that > > > > generates the error in the log is 31599 (checkpointer process). I also attach > > > > some file listing of the directories that it fails to fsync. strace looks like > > > > this (in loop): > > > > > > > > # strace -f -p 31599 > > > > select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout) > > > > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > > > > open("pg_xact", O_RDONLY) = 3 > > > > fsync(3) = 0 > > > > close(3) = 0 > > > > open("pg_commit_ts", O_RDONLY) = 3 > > > > fsync(3) = -1 EINVAL (Invalid argument) > > > > > > The reason for the error is quite straight forward. > > > open O_RDONLY gets an open file on lower read-only squashfs > > > that doesn't have an fsync operation, so fsync returns EINVAL as per > > > the man page documentation: > > > > > > EROFS, EINVAL > > > fd is bound to a special file which does not support > > > synchronization. > > > > > > > If that's the case, then why didn't the fsync(3) call not return > > EINVAL? Was it because it was copied up first? > > Allegedly yes. > We see in ls -l at the end of report that file 0000 inside pg_xact > mtime (Nov 7) is newer than squashfs mtime (Oct 30). > > > > > If so, then maybe something changed in v4.13 to cause the pg_commit_ts > > file > > Wait, I misread the information in the report and I wrongly assumed that > pg_commit_ts is a file. It is not. it's a directory in which case, the > inode is an > overlay inode and it does have fsync f_op. > But in the case of a lower directory that has no been copied up (which seems > to be the case with pg_commit_ts) overlayfs will simple vfs_fsync_range the > lower dir, so we are back to EINVAL. > > > to not have been be copied up here, when it would have before? > > > > That is possible, but I would need more information about all the previous > access to directory pg_commit_ts by postgresql to figure it out. > > Are there any aspects of fsync error reporting for directory fsync that > we need to consider as leads to investigate? > > Amir. At the VFS layer, we don't really make a big distinction between file and dir inodes with fsync. If it has dirty data, it'll get synced out either way. If you think that the -EINVAL is getting stored and reported via the inode's errseq_t, you can try enabling the file_check_and_advance_wb_err and filemap_set_wb_err tracepoints to catch it. Those only fire when an error is reported or recorded via that subsystem. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in overlayfs in 4.13: "could not fsync file" error by PostgreSQL 2017-11-07 11:01 ` Amir Goldstein 2017-11-07 11:54 ` Jeff Layton @ 2017-11-07 12:59 ` Raphael Hertzog 1 sibling, 0 replies; 11+ messages in thread From: Raphael Hertzog @ 2017-11-07 12:59 UTC (permalink / raw) To: Amir Goldstein; +Cc: overlayfs, Jeff Layton Hi, Le mardi 07 novembre 2017, Amir Goldstein a écrit : > > FWIW, in my failing case, it uses PostgreSQL 10.0 as in Debian > > Testing/Unstable. In Ubuntu, it's only available in Bionic Beaver (development > > release). > > And is this the same PostgreSQL version that worked with kernel v4.12.6? No. Postgresql 10 became the default in Debian Testing on October 11th and we switched to Linux 3.13.4 on October 9th, so at about the same time... And it seems likely that this fsync() call has been added as part of PostgreSQL 10 in that commit: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=1b02be21f271db6bd3cd43abb23fa596fcb6bac3 > Perhaps this a PostgreSQL regression (when storing db on overlayfs/squashfs) > that should be reported to PostgreSQL developers? I will do that in parallel but it seems awkward to have a writable filesystem where fsync() is not working... I can understand the desire to delegate to the underlying filesystem but that's one more case where we might want to hide the fact that we have a read-only layer below. Cheers, -- Raphaël Hertzog ◈ Writer/Consultant ◈ Debian Developer Discover the Debian Administrator's Handbook: → https://debian-handbook.info/get/ ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2017-11-07 16:54 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-11-03 14:00 Regression in overlayfs in 4.13: "could not fsync file" error by PostgreSQL Raphael Hertzog 2017-11-04 10:00 ` Amir Goldstein 2017-11-07 10:11 ` Raphael Hertzog 2017-11-07 11:01 ` Amir Goldstein 2017-11-07 11:54 ` Jeff Layton 2017-11-07 12:16 ` Amir Goldstein 2017-11-07 14:32 ` Raphael Hertzog 2017-11-07 15:14 ` Amir Goldstein 2017-11-07 16:30 ` Raphael Hertzog 2017-11-07 16:54 ` Jeff Layton 2017-11-07 12:59 ` Raphael Hertzog
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.